Your taxonomy was accurate when you built it. It was probably still accurate a year later. But your content doesn't stand still; it continuously evolves, with new concepts emerging, terminology shifting, subject matter experts using language that your controlled vocabulary was never updated to reflect. This phenomenon, known as taxonomy drift, doesn't announce itself. It's one of the most common reasons enterprise content classification breaks down quietly, long before anyone notices.
There's a way to find it. It involves an LLM, a well-structured prompt, and about an hour of your time to prove the concept. The production version is more involved, but this DIY approach is an instructive place to start.
At a recent Squirro webinar on enterprise content classification, Panos Mitsias, Squirro's Semantic Graph Solution Specialist and one of the architects behind the company's taxonomy-LLM integrations, walked through exactly this workflow. What follows is the practical version of what he covered, including where the DIY approach hits its limits and what a production-grade automated content classification system looks like.
The Content Classification Gap: What You're Solving For
Your taxonomy describes your domain. Your content is produced by people working in that domain. Over time, those two things drift apart, quietly, continuously, and – in organizations managing vast amounts of content – faster than any manual review process can catch.
Candidate concept extraction answers one specific question: which terms are living in your content right now that aren't yet in your taxonomy? That gap is where the content classification process breaks down, where search begins to return the wrong results, and where the vocabulary your organization actually uses stops matching the vocabulary your systems are built around.
Traditionally, a taxonomist finds that gap by reading. Document by document, highlighter in hand, flagging terms that don't exist in the controlled vocabulary. It works. It's also, frankly, not a great use of anyone's expertise, and at enterprise scale, it simply doesn't keep up.
The LLM-based content classification workflow does the reading for you, redefining the taxonomy management role - for the better. You give it your taxonomy, a set of documents, and clear instructions. It surfaces the gaps. You decide what to do with them. The rest of this article walks through how to do it in four steps, and where each one starts to break down at scale.
Step 1: Scope Your Taxonomy Input
Your first decision is which part of the taxonomy to upload to the LLM. If you feed a large enterprise taxonomy in whole, it will consume most of your available context window before a single document has been processed. That's wasteful at small scale and completely unworkable at large scale.
Pick the branch of your taxonomy that's relevant to the content you're analyzing. If you're working with a corpus of financial regulatory documents, the finance branch is what matters. Leave the product catalog out of it.
For each concept you do include, give the model as much semantic context as you reasonably can: preferred labels, synonyms, definitions. The model needs to understand what a concept actually means – not just what it's called – to make a reliable judgment about whether something in the content is genuinely new or just a variant of something that already exists under a different label. Thin semantic data produces thin results.
Step 2: Serialize Your Taxonomy for the LLM
Once you've decided what to include, you need to represent it in a format the model can work with, like JSON, Turtle, or Markdown. In our experience, Markdown works best.
Markdown makes it straightforward to strip out the URIs and structural overhead that other formats tend to carry, expressing the hierarchy cleanly through indentation.
|
- Agriculture and Food Security [synonyms: "...", definitions: "..."] - Education - Economic Growth [synonyms: "...", definitions: "..."] |
Clean, readable, token-efficient. At small scale, the format choice barely matters. At scale – with thousands of documents and tens of thousands of concepts – it matters more than most people expect.
Step 3: Structure Your Content Input
You've decided what taxonomy input to use. Now decide how to handle the content.
For a small test with a handful of documents and a focused domain, feed documents whole and see what comes back. But at scale, a long document fed in full can push the taxonomy section out of the model's effective attention range, or exceed the context window entirely. Chunking by page or by meaningful section keeps each call focused. Batching – processing groups of documents together rather than one at a time – reduces API overhead and brings the cost per document down to something sustainable.
These decisions sit at the heart of any enterprise taxonomy and ontology management strategy. At small scale they're optional. At scale, they determine whether the content classification workflow is viable at all.
Step 4: Build an LLM Prompt That Works
This is where most DIY attempts fall apart. The model has your taxonomy and your content. It needs to know precisely what to do with them, and a vague instruction won't get you there.
A prompt that works does four things:
- Gives the model a persona. "You are an expert taxonomist and content annotator" sounds a bit odd, but it consistently improves output quality.
- Explains the taxonomy structure. Tell the model how the taxonomy is organized and ask it to study it before doing anything else.
- Defines the task exactly. Find concepts present in the content but absent from the taxonomy, suggest a parent concept from the existing hierarchy for each one, and generate a definition grounded in how the term is actually being used in the document – not a generic dictionary entry.
- Specifies the output format. JSON works well, because it's machine-readable and can feed directly into a metadata management system or taxonomy governance workflow downstream.
Why Separate Generation from Validation
Don't ask the model to generate candidates and validate them in the same step. This is the single most common mistake in DIY content classification, and it's worth taking seriously. LLMs are not reliable self-editors. When you ask them to be creative and discerning at the same time, the quality of both drops noticeably. Generate first. Validate in a separate pass. The output quality difference is significant, and the extra step is always worth it.
Where DIY Content Classification Hits Its Limits
Can you try this at home? Yes, but with caveats. You can run this content classification workflow in ChatGPT on a section of your taxonomy and a few documents. The results will tell you something real. For a focused proof of concept, or an audit of specific taxonomy gaps in a content domain, it's genuinely worth doing.
But as you scale, two things tend to break.
Token cost. Every concept, every page, every prompt instruction consumes tokens, and tokens cost money. At enterprise scale – a large taxonomy, a corpus in the millions – a naive implementation becomes expensive fast. Efficient batching and careful scope management are not optional refinements at this scale. They determine whether the system runs at all.
Over-generation. Given room to run, LLMs extract more candidates than you want – including terms that are irrelevant, too granular, or already covered by existing concepts with different labels. A structured validation step controls this, but building and maintaining that pipeline is real engineering work, not a prompt tweak.
This is the gap the Squirro classifier is built to close. It enables the same fundamental content classification workflow – taxonomy in, content in, candidates out, taxonomist reviews – but with the batching, token optimization, validation layer, and integration with enterprise metadata management systems like Graphite. The taxonomist gets the list; the infrastructure that produced it isn't their problem.
See Automated Content Classification in Action
In the webinar, Panos covers the full workflow: prompt structures, validation architecture, and the engineering decisions that separate a proof of concept from something that runs in production. If you're considering building an automated content classification system, or evaluating whether it's worth building at all, the session is a good use of an hour. Register here to attend the on-demand webinar.