RAG is a Preprocessing Problem

The wrong question

"Does data cleaning take up half the effort in production RAG?"

This is the question that keeps coming up. The answer is yes, but the framing is wrong. "Data cleaning" undersells what's actually happening. The real work isn't cleaning — it's building an inference parser.

What an inference parser does

An inference parser doesn't just clean a document. It prepares it specifically for inference. That means:

Enriching metadata at parse time: document type, date, source authority, structural role of each section
Resolving named entities against open registries rather than inferring them
Formatting the output so that each chunk is already shaped for the model that will consume it

This is prompt engineering applied upstream. A well-structured, well-annotated chunk isn't just easier to retrieve — it's a better prompt. The model gets typed, contextualized input instead of raw text, and that difference shows up directly in output quality.

It also makes vectorization cheaper. A chunk that carries precise semantic signal embeds more efficiently — you need fewer dimensions to represent it well, and your index stays leaner as you scale.

Named entity resolution: registries, not inference

One specific decision that changes everything: resolve named entities against open registries at parse time, don't leave them for inference.

Most RAG pipelines feed raw text to an embedding model and hope the model figures out that "ACME Corp" in paragraph 3 is the same entity as "Acme Corporation" in paragraph 47. This is NER by inference — probabilistic, expensive, and wrong often enough to poison your vector index.

The alternative: at parse time, look up entities in structured registries (company registries, geographic databases, standardized nomenclatures). Replace ambiguous mentions with canonical identifiers. The chunk that enters the vector index already has resolved entities — no inference needed downstream, no ambiguity in the embeddings.

This costs almost nothing at parse time and eliminates an entire class of retrieval errors.

Atomic claims: engineering the vectorization unit

In some contexts you can go further and engineer the vectorization unit itself.

Run a first inference pass on the meta-parsed document to produce numbered atomic claims — each one a self-contained logical statement, entities marked, format normalized. Then vectorize those claims rather than the source chunks.

Pair that index with SQL queries on the structured metadata and the retrieval dynamic changes entirely: you're no longer matching text against text. You're querying a semantic knowledge base where every unit is already a discrete, addressable fact.

On corpora with any structural regularity, the RAG quality improvement is not incremental — it's several orders of magnitude.

How this works in practice

HOROS runs this pipeline in production. SiftRAG — the structured claim extraction component — treats document ingestion as a preprocessing problem with 5W1H decomposition and atomic claim indexing.

On a real corpus: 15,000 emails produced 80,000 indexed claims. Each claim is a self-contained fact with resolved entities, temporal markers, and source attribution. The retrieval quality on this index compared to a naive chunk-and-embed pipeline is not a percentage improvement. It's a category change.

The difference: when you search the naive index, you get "paragraphs that are vaguely similar to the query." When you search the claims index, you get "specific facts that answer the question, with source and date."

The investment calculation

If you're retrofitting an existing RAG pipeline, restructuring the ingestion stage takes real effort — roughly half the total project budget, which is where the "half the effort" meme comes from.

If you design the inference parser as a first-class component from the start, it's the highest-leverage investment in the stack. Retrieval quality, generation quality, and vectorization cost all improve together. You pay once upstream, and everything downstream gets better.

The pattern

The lesson generalizes beyond RAG. Across most of the questions that come up in production AI systems — retrieval quality, hallucination rates, embedding costs, latency — the answer is the same: do the research before writing the code. Understand the shape of the problem before reaching for the obvious tool.

Specifically: preprocessing is where the leverage is. The model at inference time is a fixed quantity. The input you give it is the variable you control. Engineer the input.

hazyhaar — open research, sovereign infrastructure github.com/hazyhaar · hazyhaar.fr