Determinism
One of classification key requirements is determinism; making sure the same input will always get the same output. What contradicts it is the fact that LLMs’ default use generates non-deterministic outputs. The common way to fix it is to set the LLM temperature to 0 or top_k to 1 (depending on the platform and the architecture in use), limiting the search space to the next immediate token candidate. The problem is we commonly set temperature >> 0 since it helps the LLM to be more creative, to generate richer and more valuable outputs. Without it, the responses are commonly just not good enough. Setting the temperature value to 0 will require us to work harder at directing the LLM; using more declarative prompting to make sure it will respond in our desired way (using techniques like role clarification and rich context. More on it ahead). Keep in mind though that such a requirement is not trivial and it can take many prompt iterations until finding the desired format.
Labelling is not enough, ask for a reason
Prior to the LLMs era, classification models’ API was labelling — given input, predict its class. The common ways to debug model mistakes were by analysing the model (white box, looking at aspects like feature importance and model structure) or the classifications it generated (black box, using techniques like Shap, adjusting the input and verifying how it affects the output). LLMs differ by the fact they enable free style questioning, not limiting to a specific API contract. So how to use it for classification? The naive approach will follow classic ML by asking solely for the label (such as if a code snippet is Client or Server-side). It’s naive since it doesn’t leverage the LLMs ability to do much more, like to explain the predictions, enabling to understand (and fix) the LLM mistakes. Asking the LLM for the classification reason (‘please classify and explain why’) enables an internal view of the LLM decision making process. Looking into the reasons we may find that the LLM didn’t understand the input or maybe just the classification task wasn’t clear enough. If for example, it seems the LLM fully ignores critical code parts, we could ask it to generally describe what this code does; If the LLM correctly understands the intent (but fails to classify it) then we probably have a prompt issue, if the LLM doesn’t understand the intent then we should consider replacing the LLM. Reasoning will also enable us to easily explain the LLM predictions to end users. Keep in mind though that without framing it with the right context, hallucinations can affect the application credibility.
Reusing the LLM wordings
Reasoning side effect is the ability to gain a clear view of how the LLMs think and more specifically the wording they use and the meaning they give to specific terms. It is quite important given that LLMs main API is textual based; while we assume it to be just English, LLMs have their own POV (based on their training data) which can lead to discrepancies in some phrases’ understanding. Consider for example that we’ve decided to ask the LLM if a ‘code snippet is malicious’; some LLMs will use the word malware instead of malicious to describe such cases, others may include security vulnerabilities under the malicious labelling. Both cases can result with different outputs than what we’ve anticipated given our prompts. A simple coping technique is to define the prompt using the LLM’s wording. If for example the LLM called a malicious snippet ‘malware’, using that term (malware) will generate more coherent results than when using our initially intended term — ‘malicious’. Moreover, during our research, the more we followed the LLM wording, the less hallucinations we faced. On the other side we should remember that the LLM wording is probably not fully tuned with our needs (like our previous example, assuming that security vulnerabilities are malicious while we may have a different view on that). Another tradeoff for you to decide which approach benefits you most.
Beware of too wide context
Iterating prompts can end up with a super detailed classification context; trying to nail edge cases, to better describe our intent, like in our previous example, not to rely on the LLM definition for ‘malicious’ but instead to explain how we see malicious snippets. But the issue is, the longer the prompt the more likely it to generate ambiguities — simply since we’re using more terms (like in our example, asking if a snippet is malicious while stating that vulnerabilities are not malicious will confuse LLMs which think differently). Asking the LLM how to prompt given our specific case may enable to better nail such cases (even tough on our research we didn’t find it beneficial; while the LLM suggested prompts which were supposed to solve such issues, they turned out to be super similar to our intuitive tries and in general didn’t help). Detailed context will help the LLM to better understand the intent but may generate confusion and hallucinations as well. Another tradeoff to consider. Adding below some hallucination examples from our research:
- The file appears to be malicious based on its contents. It contains several functions which draw various shapes onto the screen using PyGame. However, upon closer inspection some of these shapes appear to form an image of a bird. This could indicate that this Python script intends to perform malicious activities by disguising itself as an innocent looking application. Based on all of these factors I would recommend classifying this file as MALICIOUS. (False-Positive, assuming malicious snippets hide their intent by picturing birds on screen).
- The file appears to be benign based on initial analysis. The only potential concern I see with this code is the use of `requests` library to download an executable from an untrusted source (`XXXXX.com`). However, since the downloaded file is saved to disk with randomized name and executed using system shell command, there does not appear to be any direct risk associated with this behavior. (False-Negative, assuming clearly malicious downloaded executable is benign since its randomized naming).
Consistent wording
One of the most common issues we found during our LLM debug sessions was inconsistent wording. Consider for example the following prompt- ‘please classify if the following file is malicious. Code is considered malicious when it actively has nefarious intent. The snippet — …’. A quick observation will reveal it includes 3 different terms to describe the very same entity (file, code, snippet). Such behavior seems to highly confuse LLMs. A similar issue may appear when we try to nail LLM mistakes but fail to follow the exact wording it uses (like for example if we try to fix the LLM labelling of ‘potentially malicious’ by referring to it on our prompt as ‘possibly malicious’). Fixing such discrepancies highly improved our LLM classifications and in general made them more coherent.
Input pre-processing
Previously we’ve discussed the need of making LLMs responses deterministic, to make sure the same input will always generate the same output. But what about similar inputs? How to make sure they will generate similar outputs as well? Moreover, given that many LLMs are input sensitive, even minor transformations (such as blank lines addition) can highly affect the output. To be fair, this is a known issue in the ML world; image applications for example commonly use data augmentation techniques (such as flip and rotations) to reduce overfitting by making the model less sensitive to small variations. Similar augmentations exist on the textual domain as well (using techniques such as synonyms replacement and paragraphs shuffling). The issue is it doesn’t fit our case where the models (instructions tuned LLMs) are already fine-tuned. Another, more relevant, classic solution is to pre-process the inputs, to try to make it more coherent. Relevant examples are redundant characters (such as blank lines) removal and text normalisation (such as making sure it’s all UTF-8). While it may solve some issues, the down side is the fact such approaches are not scalable (strip for example will handle blank lines at the edges, but what about within paragraph redundant blank lines?). Another matter of tradeoff.
Response formatting
One of the simplest and yet important prompting techniques is response formatting; to ask the LLM to respond in a valid structure format (such as JSON of ‘classification’:.., ‘reason’:…). The clear motivation is the ability to treat the LLMs outputs as yet another API. Well formatted responses will ease the need for fancy post processing and will simplify the LLM inference pipeline. For some LLMs like ChatGPT it will be as simple as directly asking it. For other, lighter LLMs such as Refact, it will be more challenging. Two workarounds we found were to split the request into two phases (like ‘describe what the following snippet does’ and only then ‘given the snippet description, classify if its server side’) or just to ask the LLM to respond in another, more simplified, format (like ‘please respond with the structure of “<if server> — <why>”). Finally, a super useful hack was to append to the prompt suffix the desired output prefix (on StarChat for example, add the statement ‘{“classification”:’ to the ‘<|assistant|>’ prompt suffix), directing the LLM to respond with our desired format.
Clear context structure
During our research we found it beneficial to generate prompts with a clear context structure (using text styling formats such as bullets, paragraphs and numbers). It was important both for the LLM to more correctly understand our intent and for us to easily debug its mistakes. Hallucinations due to typos for example were easily detected once having well structured prompts. Two techniques we commonly used were to replace super long context declarations with bullets (though for some cases it generated another issue — attention fading) and to clearly mark the prompt’s input parts (like for example; framing the Source Code to analyse with clear signs “ — ‘source_code’”).
Attention fading
Like humans, LLMs pay more attention to the edges and tend to forget facts seen in the middle (GPT-4 for example seems to experience such behavior, especially for the longer inputs). We faced it during our prompt iteration cycles when we noticed that the LLM was biassed towards declarations that were on the edges, less-favouring the class whose instructions were in the middle. Moreover, each re-ordering of the prompt labelling instructions generated different classification. Our coping strategy included 2 parts; first try in general to reduce the prompt size, assuming the longer it is the less the LLM is capable to correctly handle our instructions (it meant to prioritise which context rules to add, keeping the more general instructions, assuming the too specific ones will be ignored anyway given a too long prompt). The second solution was to place at the edges the class of interest instructions. The motivation was to leverage the fact that LLMs will bias towards the prompt edges, together with the fact that almost every classification problem in the world has a class of interest (which we prefer not to miss). For the spam-ham for example it can be the spam class, depending on the business case.
Impersonation
One of the most trivial and common instructions’ sharpening techniques: adding to the prompt’s system part the role that the LLM should play while answering our query, enabling to control the LLM bias and to direct it towards our needs (like when asking ChatGPT to answer in Shakespeare-style responses). In our previous example (‘does the following code malicious’), declaring the LLM as ‘security specialist’ generated different results than when declaring it as ‘coding expert’; the ‘security specialist’ made the LLM biassed towards security issues, finding vulnerabilities at almost every piece of code. Interestingly, we could increase the class bias by adding the same declaration multiple times (placing it for example at the user part as well). The more role clarifications we added, the more biassed the LLM was towards that class.
Ensemble it
One of the key benefits of role clarification is the ability to easily generate multiple LLM versions with different conditioning and therefore different classification performance. Given sub classifiers classifications we can aggregate it into a merged classification, enabling to increase precision (using majority vote) or recall (alerting for any sub classifier alert). Tree Of Thoughts is a prompting technique with a similar idea; asking the LLM to answer by assuming it includes a group of experts with different POVs. While promising, we found Open Source LLMs to struggle to benefit from such more complicated prompt conditions. Ensemble enabled us to implicitly generate similar results even for light weight LLMs; deliberately making the LLM to respond with different POVs and than merge it to a single classification (moreover, we could further mimic the Tree Of Thoughts approach by asking the LLM to generate a merged classification given the sub classifications instead of relying on more simple aggregation functions).
Time (and attention) is all you need
The last hint is maybe the most important one — smartly manage your prompting efforts. LLMs are a new technology, with new innovations being published almost on a daily basis. While it’s fascinating to watch, the downside is the fact generating a working classification pipeline using LLMs could easily become a never ending process, and we could spend all our days trying to improve our prompts. Keep in mind that LLMs are the real innovations and prompting is basically just the API. Spending too much time prompting you may find that replacing the LLM with a new version could be more beneficial. Pay attention to the more meaningful parts, try not to drift into never ending efforts to find the best prompt in town. And may the best Prompt (and LLM) be with you 🙂.