Business

What is Semantic ETL? The Architecture for AI-Ready Data Pipelines

By Philippe Dallaire • May 15, 2026 • 12 min read

TL;DR: Traditional ETL was built to move rows. Semantic ETL is a new category of data transformation built to move meaning — designed for a stack where the consumer is an LLM, a vector index, or an agent, not a BI dashboard. Five properties define it: AI as the analyst (not the executor), deterministic transformation, ISO-standard normalization, classified errors with one-to-one reconciliation, and portable declarative mappings. ContentAtlas is the product we built around this architecture, and this article explains what the category is and why it matters before you replatform anything.

Why a New Term?

Three or four times a quarter, somebody asks us why we don’t just call ContentAtlas an “AI-powered ETL tool” and move on. The answer matters, so we want to say it clearly.

Most products marketed as “AI ETL” use a large language model to perform the transformation itself: feed the row in, ask the model to clean it, take whatever comes out. That architecture is fast to demo and structurally indefensible in production. The output is probabilistic. It varies run to run. It can’t be reconciled. A CFO cannot sign off on a quarterly close that was produced by a non-deterministic process, and a regulator cannot audit one.

What we built is different enough — and the gap between it and the legacy ETL category is wide enough — that the honest thing is to give it a name. We call it Semantic ETL, and the rest of this article explains what that means in terms a CFO can read and a data engineer can verify.

The Core Architectural Idea: AI as the Analyst, SQL as the Executor

The single most important property of Semantic ETL is the one most “AI data” products get backwards.

In a Semantic ETL pipeline, the AI understands the data. The system transforms it.

Concretely, this means a model — your choice of OpenAI, Anthropic, Gemini, Bedrock, DeepSeek, or an open weight — is given carefully sampled rows from the source, the existing target schema, the existing mappings already in place, and any system or user instructions about how the data should be interpreted. The model’s job is to decide: which source column corresponds to which target field, what type a value should be, whether a record is real or a test, which records are duplicates of which others, what transformation should apply.

Once those decisions are made, the actual transformation runs as deterministic system operations and SQL. Not as model output. The model said “this column is a currency that should be normalized to ISO 4217 precision”; SQL does the normalization. The model said “these two records refer to the same customer”; a deterministic merge operation merges them. The model said “this row matches the pattern of a test record and should be excluded from production output”; a system rule, now logged and inspectable, excludes it.

The output is the same every time you run the pipeline against the same input. You can diff it. You can reconcile it. You can hand it to an auditor and explain, row by row, why each decision was made.

This is the line that separates Semantic ETL from “LLM as ETL” products: probabilistic understanding, deterministic execution. The model is in the loop at design time and review time. It is not in the loop at runtime.

The Five Properties of Semantic ETL

If a pipeline tool ships all five of these as defaults — not as things a customer can build on top — it qualifies as Semantic ETL.

1. AI-as-analyst, system-as-executor. Covered above. The LLM decides what should happen; deterministic operations make it happen. The first question to ask any “AI ETL” vendor is: “Does the model produce the output value, or does the model produce the rule that produces the output value?” Only the second answer is auditable.

2. ISO-standard normalization, by default. A Semantic ETL pipeline ships named, version-stable operations for the international standards every real enterprise data set depends on. ISO 4217 for currency, with correct decimal precision per currency code — JPY has zero decimals, EUR has two, BHD has three, and a tool that doesn’t know the difference is shipping bugs by default. ISO 8601 for dates. ISO 13616 for IBAN, with checksum verification. ISO 3166 for country codes. E.164 for phone numbers. Luhn for credit cards. These are not premium features. They are the table stakes for any system that will be queried by a model that cannot tell the difference between "$1,000.00", "1000", and "1.000,00".

3. Meaning-aware validation, not just structural validation. The pipeline distinguishes a real record from a non-real one — test data, placeholders, sentinels, duplicates, deprecated entities — using semantic signals, not just regex and NULL checks. A row with the customer name "TEST TEST", an address of "do not use", or a 1900-01-01 birthdate is flagged automatically. The customer does not write the rule; the pipeline already knows the shape of the problem because the AI analyst has seen the source.

4. Classified errors with one-to-one reconciliation. This is the property that makes Semantic ETL defensible to a CFO, an internal auditor, and a regulator. Every input record is accounted for in the output, in one of four states:

Mapped clean — the record passed all validation and landed in the target unchanged.
Transformed with flag — the record was normalized (date format converted, currency precision corrected) and landed with provenance and a soft-correction flag.
Soft error — the record had a recoverable problem (a date in an unexpected format, a phone number that needs reformatting) and was routed to a queue for review or auto-correction, not silently dropped.
Hard error — the record failed a non-negotiable check (an IBAN that fails its ISO 13616 checksum, a required field missing) and was blocked from production output, with the reason logged.

The total of these four counts must equal the input count exactly. Like a bank reconciliation. If a hundred thousand records came in, a hundred thousand are accounted for in the logs. Nothing is silently lost, nothing is silently dropped, nothing is rounded away. This is a hard property of the system, not an audit feature you can add later — and it is the property that competitors who run their transformations through an LLM at runtime cannot structurally claim.

5. Declarative, portable mappings. The transformation logic lives in inspectable, version-controlled configuration — typically JSON. It diffs cleanly in git. It runs identically across environments. It survives the engineer who wrote it. The artifact is the contract between systems, not project ephemera that gets archived when the migration closes. This matters because data pipelines have a tendency to outlive the people who build them by ten years, and a 600-line Python script written under deadline pressure is a liability the moment the engineer who wrote it changes jobs.

What Semantic ETL Is Not

Three clarifications, because the category gets conflated with adjacent things.

Semantic ETL is not RAG. RAG (retrieval-augmented generation) is what happens when an application queries a vector index at inference time. Semantic ETL is what happens upstream — the transformation layer that prepares data so that whatever consumes it (a warehouse, a vector index, an agent tool) can be trusted. ContentAtlas, specifically, does not provide RAG or indexing. It produces clean, validated, AI-ready records in the formats a downstream consumer needs — JSON, XML, SQL, flat files — and the indexing layer is somebody else’s job. The two responsibilities are deliberately separate. A pipeline that tries to do both ends up doing neither well.

Semantic ETL is not a vector database. Vector databases (Pinecone, Weaviate, pgvector, Qdrant) store and retrieve embeddings. They are downstream consumers of clean data. A pipeline that emits garbage into a vector database produces a vector database full of garbage, and the model that queries it will confidently retrieve garbage. Fixing this at the vector layer is structurally impossible. It has to be fixed at the transformation layer.

Semantic ETL is not “LLM as ETL.” Worth repeating: a pipeline where the LLM produces the output value is not Semantic ETL. It is a fragile, non-auditable, expensive demo. Semantic ETL puts the LLM in the design loop — sampling source data, proposing mappings, classifying records — and runs the actual transformation through deterministic operations whose output is identical every time.

Why This Matters Now

Two converging facts make this the year the category needs a name.

First, the legacy ETL category is in actual upheaval. Informatica PowerCenter 10.5 reaches end of standard support on March 31, 2026, and Informatica itself was acquired by Salesforce in November 2025. Talend is now a Qlik product. Every enterprise running a 1990s-architecture ETL stack is being forced into a replatforming decision within the next eighteen months. The question is not whether to change. It is what to change to.

Second, the new consumer of enterprise data is not a BI dashboard. It is a model. And the difference matters more than people initially think: a BI dashboard tolerates a “TEST TEST” customer in the data — a human reading the chart will visually skip it. A RAG-powered assistant retrieves it with the same confidence as any other customer and answers questions as if it were real. Industry analysis of production RAG systems in 2026 puts the retrieval failure rate at roughly 73% — and the failure is almost never the model. It is the data that was fed in. It came out of an ETL pipeline that was never asked to understand what it was moving.

The data layer has to be rebuilt for the new consumer. The question is whether you rebuild it with the same architecture that produced the problem, or with one designed from the first line of code to feed a stack that cares about meaning.

How ContentAtlas Implements the Five Properties

This is the architecture we built ContentAtlas around. For each property, here is what that looks like in practice:

AI-as-analyst. ContentAtlas samples the source data, reads the existing target schema and any existing mappings, ingests system and user instructions, and uses the customer’s choice of LLM (OpenAI, Anthropic, Gemini, Bedrock, DeepSeek, Qwen — using the customer’s own API keys, so no data is exposed to a shared service) to propose the mapping and transformation logic. See how AI Mapping works.
Deterministic execution. Once the mapping is approved, the actual transformation runs as SQL and system operations. Same input, same output, every time. No model in the runtime path.
ISO normalization. First-class named operations: clean_currency (ISO 4217 with per-code precision), date_standardization (ISO 8601), standardize_phone (E.164), IBAN validation (ISO 13616 checksum), country normalization (ISO 3166).
Meaning-aware validation. The AI analyst flags candidate non-real records (test data, placeholders, sentinels) during the design phase. The customer reviews and approves; the resulting filter then runs as a deterministic rule, logged for audit.
One-to-one reconciliation. Every input record is accounted for in one of the four output states above. The counts tie. The logs are exportable. The reconciliation is part of the standard output, not an audit add-on.
Portable mappings. The full mapping and transformation logic exports as JSON. It diffs in git. It survives the project.
Flexible output. The same pipeline emits to JSON, XML, SQL inserts/upserts, CSV, or any other format the downstream consumer needs. A warehouse team and an AI team can both consume from the same upstream pipeline.

This is what we mean when we say ContentAtlas is built on Semantic ETL principles, not retrofitted to them. The architecture was the starting point. The product is the execution of it.

The Question Worth Asking Before You Buy

If you are evaluating a data transformation tool for an AI initiative — whether that’s an internal copilot, a RAG-powered assistant, an agent workflow, or just feeding a model that needs to be trusted — ask the vendor exactly five questions.

Does an LLM produce the output value, or does an LLM produce the rule that produces the output value? (If the first, walk away.)
Show me the ISO 4217 currency precision logic. Does JPY have zero decimals by default, or did somebody have to write that?
Show me a record that was filtered as a test or non-real record. Was the rule discoverable automatically, or did somebody write a regex?
Reconcile the input and output counts of your last production run. Do the totals tie one-to-one across mapped, transformed, soft-errored, and hard-errored?
Export the mapping as a file. Can I diff it in git?

If the answer to all five is yes, you’re looking at Semantic ETL — by whatever name the vendor markets it. If the answer to any of them is “well, you can build that on top,” you are looking at the same architecture that produced the data quality problem in the first place.

The Bottom Line

The reason the term matters is the reason every category name eventually matters: it gives buyers a way to ask the right questions and gives builders a stake in the ground. ETL as a category was built for a 1990s problem and is being retired by the calendar. The next category — the one that will define enterprise data tooling for the rest of this decade — has to do more than move rows. It has to understand them, normalize them, validate them against international standards, reconcile them one-to-one, and emit them in a form a model can be trusted with.

That category is Semantic ETL. ContentAtlas is the product we built around it. If you are scoping a 2026 replatforming decision, this is the architecture the question is actually about.

Book a 20-minute discovery call to see ContentAtlas in action →

Further reading from Consuly.ai: