Before going into the analysis of the results, you are able to reproduce these results for yourself in this Google colab notebook. Note that ordinarily you wouldn’t be able to recreate the numbers in these tables exactly, because of the non-deterministic nature of LLMs, but for this notebook we have added a seed to the sampling so it’ll be the same every time. Also, stratified sampling has been added so the binary categories are exactly 50/50. Be aware that there is a computational cost associated with running this notebook with your OpenAI API keys. The default number of samples has been set to 2, but you can change the number to 100 if you wish to replicate the results from this blog post.
Medium Processing Time
For clarity, these comparisons (using 100 samples) were run on Google Colab with a standard OpenAI API account and key. So while the latency values are unlikely to be exact when run on a different setup, the slowest and fastest models will be reproduced.
Additionally, using explanations in your evaluations is likely to take anywhere from 3–20x longer to compile (this is independent of function calling).
For model predictive ability on relevance overall
- Latency: GPT-3.5-instruct > GPT-3.5-turbo > GPT-4-turbo > GPT-4
For model predictive ability on hallucinations
- Latency: GPT-3.5-instruct > GPT-3.5-turbo ~ GPT-4-turbo > GPT-4
GPT models with function calling tend to have a slightly higher latency than LLMs without function calling, but take this with a grain of salt because there are a few caveats. First, the latency is extracted from HTTP headers returned to us by OpenAI, so depending on your account and your method of making these requests, the latency values can shift since they were calculated by OpenAI internally. Function calling trade-offs also depend on your use case. For example, without function calling you would need to specify exactly how you would need your output structured by providing examples and a detailed description. However, if your use case is structured data extraction then it is simplest to work directly with the OpenAI function calling API.
Overall, LLMs with function calling perform on par with LLMs that do not leverage function calling and instead use the ordinary prompt completion. Whether you decide to use the OpenAI function calling API over prompt engineering should depend on your use case and complexity of your outputs.
GPT Model Performance Comparisons
For model predictive ability on relevance overall:
- Performance: GPT-4 ~ GPT-4-turbo ~ GPT-3.5-turbo >>> GPT-3.5-instruct
For model predictive ability on hallucinations:
- Performance: GPT-4 ~ GPT-4-turbo > GPT-3.5-turbo > GPT-3.5-instruct
Interestingly, in both use cases, using explanations does not always improve performance. More on this below.
Evaluation Metrics
If you are deciding which LLM to predict relevance, you want to use either GPT-4, GPT-4-turbo or GPT-3.5-turbo.
GPT-4-turbo is precisely identifying when an output is relevant, but is sacrificing on recalling all 50 examples, in fact recall is no better than a coin flip even when using explanations.
GPT-3.5-turbo suffers from the same trade-off, while having lower latency and lower accuracy. From these results GPT-4 has the highest F1 scores (harmonic mean of precision and recall) and overall best performance, while running comparable times to GPT4-turbo.
GPT-3.5-instruct and predicts everything to be relevant and therefore is not a viable LLM for predicting relevance. Interestly, when using explanations the predictive performance improves drastically, although it still underperforms the other LLMs. Also GPT-3.5-instruct cannot use the OpenAI function calling API and is likely to be deprecated in early 2024.
If you are deciding which LLM to predict hallucinations, you want to use either GPT-4, GPT-4-turbo or GPT-3.5-turbo.
The results show GPT-4 correctly identifying hallucinated and factual outputs more often (~3% of the time more) across precision, accuracy, recall and F1 than GPT-4-turbo.
While both GPT-4 and GPT-4-turbo perform slightly higher than GPT-3.5-turbo (note a higher number of samples should be used before concluding that the small margin isn’t noise), it might be worth working with GPT-3.5-turbo if you are planning to use explanations.
Explanations for predicting hallucinated and factual returned at a rate greater than three times faster for GPT-3.5-turbo than they did for both GPT-4 and GPT-4-turbo, however the recall did suffer for both GPT-3.5 models when compared to the recall of the GPT-4 models when predicting hallucinations correctly.
When deciding on which LLM to use for your application, there is a series of experiments and iterations required to make that decision. Similarly, benchmarking and experimentation is also required when deciding on whether or not a LLM should be used as an evaluator. Essentially these are the two main methods of benchmarking LLMs: LLM model evaluation (evaluating foundation models) and LLM system evaluation through observability.