that “groups are remarkably intelligent, and are often smarter than the smartest people in them.” He was writing about decision-making, but the same principle applies to classification: get enough people to describe the same phenomenon and a taxonomy starts to emerge, even if no two people phrase it the same way. The challenge is extracting that signal from the noise.
I had several thousand rows of free-text data and needed to do exactly that. Each row was a short natural-language annotation explaining why an automated security finding was irrelevant, which functions to use for a fix, or what coding practices to follow. One person wrote “this is test code, not deployed anywhere.” Another wrote “non-production environment, safe to ignore.” A third wrote “only runs in CI/CD pipeline during integration tests.” All three meant the same thing, but no two shared more than a word or two.
The taxonomy was in there. I just needed the right tool to extract it. Traditional clustering and keyword matching couldn’t handle the paraphrase variation, so I tried something I hadn’t seen discussed much: using a locally hosted LLM as a zero-shot classifier. This blog post explores how it performed, how it works, and some tips for using and deploying these systems yourself.
Why traditional clustering struggles with short free-text
Standard unsupervised clustering works by finding mathematical proximity in some feature space. For long documents, this is usually fine. Enough signal exists in word frequencies or embedding vectors to form coherent groups. But short, semantically dense text breaks these assumptions in a few specific ways.
Embedding similarity conflates different meanings. “This key is only used in development” and “This API key is hardcoded for convenience” produce similar embeddings because the vocabulary overlaps. But one is about a non-production environment and the other is about an intentional security tradeoff. K-means or DBSCAN can’t distinguish them because the vectors are too close.
Topic models surface words, not concepts. Latent Dirichlet Allocation (LDA) and its variants find word co-occurrence patterns. When your corpus consists of one-sentence annotations, the word co-occurrence signal is too sparse to form meaningful topics. You get clusters defined by “test” or “code” or “security” rather than coherent themes.
Regex and keyword matching can’t handle paraphrase variation. You could write rules to catch “test code” and “non-production,” but you’d miss “only used during CI,” “never deployed,” “development-only fixture,” and dozens of other phrasings that all express the same underlying idea.
The common thread: these methods operate on surface features (tokens, vectors, patterns) rather than semantic meaning. For classification tasks where meaning matters more than vocabulary, you need something that understands language.
LLMs as zero-shot classifiers
The key insight is simple: instead of asking an algorithm to discover clusters, define your candidate categories based on domain knowledge and ask a language model to classify each entry.
This works because LLMs process semantic meaning, not just token patterns. “This key is only used in development” and “Non-production environment, safe to ignore” contain almost no overlapping words, but a language model understands they express the same idea. This isn’t just intuition. Chae and Davidson (2025) compared 10 models across zero-shot, few-shot, and fine-tuned training regimes and found that large LLMs in zero-shot mode performed competitively with fine-tuned BERT on stance detection tasks. Wang et al. (2023) found LLMs outperformed state-of-the-art classification methods on three of four benchmark datasets using zero-shot prompting alone, no labeled training data required.
The setup has three components:
- Candidate categories. A list of mutually exclusive categories defined from domain knowledge. In my case, I started with about 10 expected themes (test code, input validation, framework protections, non-production environments, etc.) and expanded to 20 candidates after reviewing a sample.
- A classification prompt. Structured to return a category label and a brief reason. Low temperature (0.1) for consistency. Short max output (100 tokens) since we only need a label, not an essay.
- A local LLM. I used Ollama to run models locally. No API costs, no data leaving my machine, and fast enough for thousands of classifications.
Here’s the core of the classification prompt:
CLASSIFICATION_PROMPT = """
Classify this text into one of these themes:
{themes}
Text:
"{content}"
Respond with ONLY the theme number and name, and a brief reason.
Format: THEME_NUMBER. THEME_NAME | Reason
Classification:
"""
And the Ollama call:
response = ollama.generate(
model="gemma2",
prompt=prompt,
options={
"temperature": 0.1, # Low temp for consistent classification
"num_predict": 100, # Short response, we just need a label
}
)
Two things to note. First, the temperature setting matters. At 0.7 or higher, the same input can produce different classifications across runs. At 0.1, the model is nearly deterministic, which helps smooth classification. Second, limiting num_predict keeps the model from generating explanations you don’t need, which speeds up throughput significantly.
Building the pipeline
The full pipeline has three steps: preprocess, classify, analyze.
Preprocessing strips content that adds tokens without adding classification signal. URLs, boilerplate phrases (“For more information, see…”), and formatting artifacts all get removed. Common terms get normalized (“false positive” becomes “FP,” “production” becomes “prod”) to reduce token variation. Deduplication by content hash removes exact repeats. This step reduced my token budget by roughly 30% and made classification more consistent.
Classification runs each entry through the LLM with the candidate categories. For ~7,000 entries, this took about 45 minutes on a MacBook Pro using Gemma 2 (9B parameters). I also tested Llama 3.2 (3B), which was faster but slightly less precise on edge cases where two categories were close. Gemma 2 handled ambiguous entries with noticeably better judgment.
One practical concern: long runs can fail partway through. The pipeline saves checkpoints every 100 classifications, so you can resume from where you left off.
Analysis aggregates the results and generates a distribution chart. Here’s what the output looked like:
The chart tells a clear story. Over a quarter of all entries described code that only runs in non-production environments. Another 21.9% described cases where a security framework already handles the risk. These two categories alone account for half the dataset, which is the kind of insight that’s hard to extract from unstructured text any other way.
When this approach is not the right fit
This technique works best in a specific niche: medium-scale datasets (hundreds to tens of thousands of entries), semantically complex text, and situations where you have enough domain knowledge to define candidate categories but no labeled training data.
It’s not the right tool when:
- your categories are keyword-defined (just use regex),
- when you have labeled training data (train a supervised classifier; it’ll be faster and cheaper),
- when you need sub-second latency at scale (use embeddings and a nearest-neighbor lookup),
- or when you genuinely don’t know what categories exist. In this case, run exploratory topic modeling first to develop intuition, then switch to LLM classification once you can define categories.
The other constraint is throughput. Even on a fast machine, classifying one entry per fraction of a second means 7,000 entries takes close to an hour. For datasets above 100,000 entries, you’ll want an API-hosted model or a batching strategy.
Other applications worth trying
The pipeline generalizes to any problem where you have unstructured text and need structured categories.
Customer feedback. NPS responses, support tickets, and survey open-ends all suffer from the same problem: varied phrasing for a finite set of underlying themes. “Your app crashes every time I open settings” and “Settings page is broken on iOS” are the same category, but keyword matching won’t catch that.
Bug report triage. Free-text bug descriptions can be auto-categorized by component, root cause, or severity. This is especially useful when the person filing the bug doesn’t know which component is responsible.
Code intent classification. This is one I haven’t tried yet but find compelling: classifying code snippets, Semgrep rules, or configuration rules by purpose (authentication, data access, error handling, logging). The same technique applies. Define the categories, write a classification prompt, run the corpus through a local model.
Getting started
The pipeline is straightforward: define your categories, write a classification prompt, and run your data through a local model.
The hardest part isn’t the code. It’s defining categories that are mutually exclusive and collectively exhaustive. My advice: start with a sample of 100 entries, classify them manually, notice which categories you keep reaching for, and use those as your candidate list. Then let the LLM scale the pattern.
I used this technique as part of a larger analysis on how security teams remediate vulnerabilities. The classification results helped surface which types of security context are most common across organizations, and the chart above is one of the outputs from that work. If you’re interested in the security angle, the full report is available at that link.