We began with a zero-shot baseline run and then repeated the experiment a few more times building the complexity of the prompt by adding strategies like few-shot in-context learning. We prompted the LLM to identify vulnerable code with no mention of which CWE it might be looking for (i.e. without labels).
In a zero-shot prompt, you ask the model to make a prediction with no example or information other than instructions. Our zero-shot template was inspired by this paper⁴ and includes a role, code delimiter, and the request to output json format only. It also includes an instruction to “think step-by-step”. The code snippet under test is inserted into code.
Prompt
You are a brilliant software security expert.
You will be provided with a python code delimited by triple backticks.
If it contains any CWE security vulnerabilities, write Vulnerable.
If the code does not contain any vulnerabilities, write Not Vulnerable.
Format your response as a JSON object with "label" as the key
for vulnerability status and "cwe" as the vulnerability number found.
Think about the answer step by step, and only answer with JSON.Python code: ```code```
Answer:
Results
Accuracy: 0.67
Precision: 0.60
Recall: 0.86
F1 Score: 0.71
In the next experiment, we add the concept of in-context or “few-shot” learning and include a few successful code-answer examples before asking the LLM to perform the same operation on the unseen code. These examples were constructed from the remainder of the dataset and care was taken to:
- Draw from different scenarios than the code snippet under test, to avoid contamination
- Draw exactly two vulnerable examples and one non-vulnerable example
Prompt
You are a brilliant software security expert.
You will be provided with a python code delimited by triple backticks.
If it contains any CWE security vulnerabilities, write Vulnerable.
If the code does not contain any vulnerabilities, write Not Vulnerable.
Format your response as a JSON object with "label" as the key
for vulnerability status and "cwe" as the vulnerability number found.
Think about the answer step by step, and only answer with JSON.Python code: ```example_0```
Answer: answer_0
Python code: ```example_1```
Answer: answer_1
Python code: ```example_2```
Answer: answer_2
Python code: ```code```
Answer:
Results
Accuracy: 0.76
Precision: 0.71
Recall: 0.81
F1 Score: 0.76
This Microsoft blog post describes an interesting technique called KNN-based few-shot example selection that can boost LLM response quality when using in-context examples. For this next experiment, instead of sampling shots at random, we calculate a similarity score between the input code and each candidate example and construct shots from the most similar candidates (still keeping the scenarios distinct). We use the ROUGE-L metric, but other metrics could be used too. The prompt template did not change from the second experiment.
Results
Accuracy: 0.73
Precision: 0.70
Recall: 0.76
F1 Score: 0.73
In this variation of the prompt, we include a request for a fixed version of the code if a CWE is found. This approach was inspired by Noever, who proposed that prompting for CWE detection and a fix together might bring about a “virtuous cycle” and force the LLM to “self-audit” or think more deeply about the steps needed to accurately identify vulnerabilities, similar to chain-of-thought prompting. We did this by constructing vulnerable code in-context examples with code fix suggestions drawn from the non-vulnerable code samples for the same scenarios.
Prompt
You are a brilliant software security expert.
You will be provided with a python code delimited by triple backticks.
If it contains any CWE security vulnerabilities, write Vulnerable.
If the code does not contain any vulnerabilities, write Not Vulnerable.
If the code has the vulnerability, write a repaired secure version of the
code that preserves its exact functionality.
Format your response as a JSON object with "label" as the key
for vulnerability status, "cwe" as the vulnerability found,
and "fix" for the fixed code snippet.
Think about the answer step by step, and only answer with JSON.Python code: ```example_0```
Answer: answer_0
Python code: ```example_1```
Answer: answer_1
Python code: ```example_2```
Answer: answer_2
Python code: ```code```
Answer:
Results
Accuracy: 0.80
Precision: 0.73
Recall: 0.90
F1 Score: 0.81
In addition to CWE detection, this experiment has the benefit of producing suggested fixes. We have not evaluated them for quality yet, so that is an area for future work.
On our small data sample, GPT4’s accuracy was 67% and its F1 score was 71% without any complex prompt adaptations. Small improvements were offered by some of the prompting techniques we tested, with few-shot and requesting a code fix standing out. The combination of techniques bumped accuracy and F1 score by about ten percentage points each from baseline, both metrics reaching or exceeding 80%.
Results can be quite different between models, datasets, and prompts, so more investigation is needed. For example, it would be interesting to:
- Test smaller models
- Test a prompt template that includes the CWE label, to investigate the potential for combining LLMs with static analysis
- Test larger and more diverse datasets
- Evaluate the security and functionality of LLM-proposed code fixes
- Study more advanced prompting techniques such as in-context example chains-of-thought, Self-Consistency, and Self-Discover
If you would like to see the code that produced these results, run it on your own code, or adapt it for your own needs, check out the pull request in OpenAI Cookbook (currently under review).