Automated Prompt Engineering. A mixture of reflections, lit reviews… | by Ian Ho

A mixture of reflections, literature reviews and an experiment on Automated Prompt Engineering for Large Language Models

Image generated by Author with the help of DALL-E

I spent the past few months trying to build all sorts of LLM-powered apps, and truthfully, a really significant portion of time was just dedicated to improving prompts to get my desired output from the LLM.

There have been many moments where I run into a sort of existential void, asking myself if I might just be a glorified prompt engineer. Given the current state of interacting with LLMs, I’m still inclined to conclude with ‘Not Yet’, and on most nights, I overcome my imposter syndrome. Won’t get into that today.

But I still often wonder if, one day, the process of writing prompts could be mostly automated away. And I think the answer to this futuristic scenario hinges on uncovering the nature of prompt engineering.

Despite the countless number of prompt engineering playbooks out there on the vast internet, I still cannot decide if prompt engineering is an art or a science.

On one hand, it feels like an art when I have to iteratively learn and edit my prompts based on what I’m observing in the outputs. Over time, you learn that some of the tiny details matter — using ‘must’ instead of ‘should’, or placing the guidelines towards the end instead of the middle of the prompt. Depending on the task, there are simply too many ways that one can express a set of instructions and guidelines, and sometimes it feels like trial and error.

On the other hand, one could argue that prompts are just hyper-parameters. At the end of it, the LLM really just sees your prompts as embeddings, and like all hyper-parameters, you can tune it and objectively measure it’s performance if you have an established set of training and testing data. I recently came across this post by Moritz Laurer, who’s an ML Engineer at HuggingFace:

Every time you test a different prompt on your data, you become less sure if the LLM actually generalizes to unseen data… Using a separate validation split to tune the main hyperparameter of LLMs (the prompt) is just as important as train-val-test splitting for fine-tuning. The only difference is that you don’t have a training…