Summary
Building an OMOP ETL is often perceived as a multi-month effort. The process can be broken down incrementally and integrated into a team’s regular workflow. Most of the work is semantic mapping — producing the SOURCE_TO_CONCEPT_MAP.csv file. Once that mapping is in place, the rest is mostly mechanical: each source row is routed to the right OMOP table based on the standard concept’s domain_id. This article walks through that approach, presents Linkr’s mapping module, and then has you build a mini MIMIC-IV → OMOP ETL step by step by executing the SQL in your browser.
OMOP ETL: an incremental approach
When discovering OMOP, it’s tempting to picture the ETL as a multi-year effort aiming to standardise the entire local dictionary before any data can be used. That view is legitimate — it reflects what several pioneering teams have actually gone through — but it isn’t the only one.
A more incremental approach lets you deliver a first useful OMOP ETL in a few weeks, then expand coverage progressively over successive projects. It fits inside a data-warehouse team’s normal workflow, without upending habits. It rests on two principles:
- Project by project. For a given study you need a few dozen variables. You only map those; the others wait for another project that will reuse what has already been mapped.
- Semantic mapping first, the rest follows mechanically. Once the
SOURCE_TO_CONCEPT_MAPtable is produced, the ETL becomes mostly routing: you look at each standard concept’sdomain_idand it directly tells you which OMOP table the data lands in.
A direct consequence of this approach: the OMOP database you produce won’t necessarily contain every table in the model. You populate the tables that match the concepts actually mapped for the project — typically PERSON, VISIT_OCCURRENCE, plus a few event tables (CONDITION_OCCURRENCE, MEASUREMENT, DRUG_EXPOSURE…). The others (DEVICE_EXPOSURE, SPECIMEN, NOTE…) stay empty until a project needs them. That reduced scope is enough for the current project: the analyses will only rely on the tables actually populated, and that’s what matters to deliver.
A complete ETL remains the goal
If your team’s resources allow it, a complete ETL remains the goal: map the entire local dictionary, populate every OMOP table, aim for high coverage from the first release. What we present here is a pragmatic alternative for teams that don’t have those resources upfront and still want to start producing — extending project after project.
This article follows that incremental path. We first present the semantic part (the most demanding, involving domain experts), then routing by domain_id, and finally walk through an immersive MIMIC-IV example: you execute the SQL yourself in your browser, step by step, to build a mini OMOP database.
Prerequisites
This article assumes you have read the previous OMOP series articles: the OMOP CDM model, OMOP vocabularies and the OHDSI tools ecosystem. We use concepts like concept_id, domain_id, standard_concept, Maps to and concept_id = 0 without redefining them.
Semantic mapping, the heart of the work
Semantic mapping consists of linking each local code (a MIMIC itemid, an internal lab identifier, a facility-specific label) to a standard OMOP concept: a LOINC identifier for a lab measurement, a SNOMED CT for a diagnosis or observation, an RxNorm for a drug. The output is a CSV file: the famous SOURCE_TO_CONCEPT_MAP.
This is the most time-consuming part of the whole ETL. It relies on domain experts: biologists for lab parameters, pharmacists for drugs, clinicians (intensivists, internists, emergency physicians…) for diagnoses, observations and procedures. The quality of the mapping determines the quality of everything that follows — this is where you take the necessary time.
Linkr’s mapping module
To support this work, Linkr offers a dedicated module — Concept Mapping. Creating a mapping project takes a few clicks, by importing a source file (CSV, Parquet…) listing the concepts to map. During that import, Linkr offers a parameter step where you pick the columns matching the local code, the label, the category, the frequency and the JSON metadata:
Import settings
Configure file parsing and map columns to concept fields.
Map file columns to concept fields. At least the concept name or code is required.
| # | itemid | label | fluid | category |
|---|---|---|---|---|
| 1 | 50808 | Free Calcium | Blood | Blood Gas |
| 2 | 50826 | Tidal Volume | Blood | Blood Gas |
| 3 | 50813 | Lactate | Blood | Blood Gas |
| 4 | 52029 | % Ionized Calcium | Blood | Blood Gas |
| 5 | 50801 | Alveolar-arterial Gradient | Blood | Blood Gas |
| 6 | 50810 | Hematocrit, Calculated | Blood | Blood Gas |
| 7 | 50820 | pH | Blood | Blood Gas |
| 8 | 50912 | Creatinine | Blood | Chemistry |
| 9 | 51221 | Hematocrit | Blood | Hematology |
| 10 | 51301 | White Blood Cells | Blood | Hematology |
Once the project is created, the mapping editor opens in two panels: source concepts on the left, candidate standards on the right:
| Vocab | Concept name | Concept code | Patients | Records | ||
|---|---|---|---|---|---|---|
| ICCA | creatinine serum | L0021 | 1,842 | 9,614 | ||
| ICCA | lactate blood | L0102 | 1,203 | 4,871 | ||
| ICCA | hemoglobin | L0301 | 1,688 | 7,203 | ||
| ICCA | platelet count | L0411 | 974 | 3,120 | ||
| ICCA | WBC count | L0512 | 881 | 2,954 |
| Name | Code | Vocab. | Domain |
|---|---|---|---|
| Creatinine [Mass/volume] in Serum or Plasma | 2160-0 | LOINC | Measurement |
| Creatinine [Mass/volume] in Blood | 38483-4 | LOINC | Measurement |
| Creatinine [Moles/volume] in Serum or Plasma | 14682-9 | LOINC | Measurement |
Agentic-AI-assisted mapping
Hand-mapping remains slow. To speed up the first pass, Linkr integrates with Claude Code via a dedicated concept-mapping skill. An agent processes source concepts in batches, proposes the best standard candidates for each, and produces a combined score:
| Score | Methods | Vocab | Concept name | Code | Std | |
|---|---|---|---|---|---|---|
92% | 95%88%91% | LOINC | Creatinine [Mass/volume] in Serum or Plasma | 2160-0 | S | |
81% | 83%79%82% | LOINC | Creatinine [Mass/volume] in Blood | 38483-4 | S | |
76% | 80%71%77% | LOINC | Creatinine [Moles/volume] in Serum or Plasma | 14682-9 | S | |
68% | 72%64% | SNOMED | Serum creatinine measurement | 113075003 | S | |
54% | 58%51% | LOINC | Creatinine renal clearance | 2164-2 | S |
Four method families contribute to the score:
- Syntactic — string similarity between the local label and standard labels (and their synonyms).
- Semantic — distance between biomedical embeddings (BioLORD-2023-M), computed offline over the entire OMOP vocabulary.
- Statistical (coming) — comparison of value distributions for numeric concepts.
- Agentic — an LLM agent reviews the best candidates, applies clinical reasoning (consistent units? right granularity? correct domain?), and proposes an equivalence (
exactMatch,closeMatch,broadMatch,narrowMatch) with a written justification.
Each suggestion still has to be validated or corrected explicitly in the editor — the AI prepares the ground, the final call always rests with the reviewer. The Suggestions page details the full mechanism.
Metadata and traceability
A concept mapping is more than a correspondence table. It is also a traceability dossier: who mapped what, when, with what reasoning, with what confidence, and what comments were exchanged during review. These metadata matter for the team that will later analyse the data. Linkr stores them with each mapping and can surface them on the source-concept side — distributions, statistics, properties, temporal charts:
Heart_rate
ICCA · M0042 · 12,450 records · 1,842 patients
Statistics
| Min | 0 |
| P25 | 65 |
| Median | 78 |
| P75 | 92 |
| Max | 245 |
| Mean | 82.3 |
| Std | 18.7 |
Boxplot
Distribution
Distribution by sex
Trend by year
Hospital units
Properties
| Unit | bpm |
| Frequency | Hourly |
| Completeness | 98% |
| Source | Monitoring |
The expected JSON structure is documented on the Mapping editor page. It accepts statistics (min/median/max/mean…), distributions (bar, pie or line charts), properties (free key/value pairs) and metadata (scalar metadata). All of it helps the reviewer validate a mapping with full context.
Multi-centre networks
In a research network where several hospitals collaborate on multi-centre projects, each institution can take on the mapping of a subset of concepts and contribute its work to a shared registry. That’s the principle of the Linkr portal: a static portal (deployable to GitLab Pages, GitHub Pages or any static host) that aggregates several mapping workspaces and projects, shared across teams. The Global view assigns custom source_concept_ids (range ≥ 2 billion) per hospital, in disjoint ranges, to avoid collisions when centralising data.
OHDSI tools — the classic alternative
The OHDSI community offers its own tool for this step: Usagi, a Java desktop application. You load your list of local codes, Usagi suggests standard concepts based on textual similarity, and you validate by hand. Its output is compatible with SOURCE_TO_CONCEPT_MAP. Usagi is stable, proven, free, offline — the historical route, and still a perfectly valid one.
Routing by domain_id
Once SOURCE_TO_CONCEPT_MAP is produced, the rest of the ETL becomes mostly mechanical. For each source row, you ask:
- What OMOP standard concept is associated (via the
SOURCE_TO_CONCEPT_MAPjoin)? - What is its
domain_id?
And the domain_id directly determines the target OMOP table:
domain_id | Target OMOP table | Examples |
|---|---|---|
Condition | CONDITION_OCCURRENCE | Diabetes, sepsis, hypertension |
Drug | DRUG_EXPOSURE | Aspirin, heparin, furosemide |
Procedure | PROCEDURE_OCCURRENCE | Colonoscopy, intubation, catheterisation |
Measurement | MEASUREMENT | Creatinine, heart rate, hemoglobin |
Device | DEVICE_EXPOSURE | Catheter, probe, ventilator |
Observation | OBSERVATION | Status, insurance type, language |
Visit | VISIT_OCCURRENCE / VISIT_DETAIL | Hospital stay, ED visit, ICU stay |
Anything that doesn’t fit any of these tables lands by convention in OBSERVATION — OMOP’s general-purpose landing table for such cases. The implementation logic becomes the same for every event table: extract source rows, join on SOURCE_TO_CONCEPT_MAP, filter by the target’s domain_id, insert into the target table with standard _concept_ids, custom _source_concept_ids and _source_values for traceability.
Building a MIMIC-IV → OMOP ETL, step by step
Time to practice. We’ll build a mini OMOP database from MIMIC-IV together, executing the SQL in your browser (via DuckDB-Wasm). The source data is the public MIMIC-IV demo — 100 patients, their stays, measurements and prescriptions.
Before diving into the SQL, here are the five steps of the approach:
Extract the source concepts
Inventory what has to be mapped: one row per concept, with its frequencies and an info_json of metadata (distributions, units, histograms…).
Retrieve the mapping table
Mapping (manual or AI-assisted in Linkr) produces the source_to_concept_map.csv file that links each source code to its standard concept.
Load the reference vocabularies
Import CONCEPT, CONCEPT_RELATIONSHIP and friends from ATHENA, plus the SOURCE_TO_CONCEPT_MAP produced in step B.
Populate the OMOP tables
First the anchor tables (PERSON, VISIT_OCCURRENCE, VISIT_DETAIL), then a routing CTE on domain_id that fills MEASUREMENT, CONDITION_OCCURRENCE and OBSERVATION. DRUG_EXPOSURE is handled separately.
Check quality
Run ACHILLES and the Data Quality Dashboard, spot anomalies, fix, repeat.
Step A — Extract source concepts
The very first step of any OMOP ETL is not an INSERT: it’s the inventory of what needs to be mapped. For our scope (demographics, visit types, transfers, diagnoses, measurements and drugs), we extract distinct source concepts from MIMIC tables along with their frequencies (rows count, distinct patients) and some metadata (units, categories…) that will help the reviewer later. The extraction logic fits in one SQL query per category:
Here’s a representative excerpt of the extraction SQL for labevents lab parameters. The query computes in one pass the descriptive statistics, the histogram, the sex breakdown, the typical measurement frequency and the temporal distribution — all assembled into a single JSON object that follows the format Linkr expects (the other categories — chartevents, diagnoses_icd, transfers, admissions, prescriptions — follow a very similar pattern, adjusted depending on whether the concept is numeric or categorical):
-- Lab parameters (labevents) with a rich info_json
WITH lab_stats AS (
SELECT
le.itemid,
COUNT(*) AS record_count,
COUNT(DISTINCT le.subject_id) AS patient_count,
MIN(le.valuenum) AS v_min,
MAX(le.valuenum) AS v_max,
AVG(le.valuenum) AS v_mean,
STDDEV(le.valuenum) AS v_sd,
QUANTILE_CONT(le.valuenum, 0.05) AS p5,
QUANTILE_CONT(le.valuenum, 0.25) AS p25,
QUANTILE_CONT(le.valuenum, 0.50) AS p50,
QUANTILE_CONT(le.valuenum, 0.75) AS p75,
QUANTILE_CONT(le.valuenum, 0.95) AS p95
FROM labevents le
WHERE le.itemid IN (50971, 50983, 50912, 50882, 51265, 51222, 51301, 50931)
GROUP BY le.itemid
)
SELECT
'MIMIC_LABITEMS' AS vocabulary,
'Lab test' AS category,
CAST(le.itemid AS VARCHAR) AS concept_code,
ANY_VALUE(dli.label) AS concept_name,
ANY_VALUE(s.record_count) AS record_count,
ANY_VALUE(s.patient_count) AS patient_count,
-- Compose the JSON metadata expected by Linkr's mapping editor
json_object(
'full_name', 'Lab test / ' || ANY_VALUE(dli.label),
'data_source', 'MIMIC-IV',
'data_types', 'numeric',
'unit', ANY_VALUE(le.valueuom),
'numeric_data', json_object(
'min', ANY_VALUE(s.v_min),
'p5', ANY_VALUE(s.p5),
'p25', ANY_VALUE(s.p25),
'median', ANY_VALUE(s.p50),
'mean', ANY_VALUE(s.v_mean),
'p75', ANY_VALUE(s.p75),
'p95', ANY_VALUE(s.p95),
'max', ANY_VALUE(s.v_max),
'sd', ANY_VALUE(s.v_sd)
)
-- + histogram, categorical_data (sex), temporal_distribution
-- (assembled via similar sub-queries)
) AS info_json
FROM labevents le
JOIN d_labitems dli ON dli.itemid = le.itemid
JOIN lab_stats s ON s.itemid = le.itemid
WHERE le.itemid IN (50971, 50983, 50912, 50882, 51265, 51222, 51301, 50931)
GROUP BY le.itemid; The resulting info_json column follows the structure documented on the Mapping editor page, which Linkr reads to render the source-concept detail card. For a vital sign like heart rate, the result looks like:
{
"full_name": "Vital sign / Routine Vital Signs / Heart Rate",
"data_source": "MIMIC-IV",
"data_types": "numeric",
"unit": "bpm",
"missing_rate": 0.0,
"numeric_data": {
"min": 0.0, "p5": 62.0, "p25": 78.0, "median": 90.0, "mean": 91.12,
"p75": 104.0, "p95": 122.0, "max": 200.0, "sd": 18.69
},
"histogram": [
{ "x": 10.0, "count": 3 }, { "x": 30.0, "count": 12 },
{ "x": 50.0, "count": 454 }, { "x": 70.0, "count": 3313 },
{ "x": 90.0, "count": 5555 }, { "x": 110.0, "count": 3685 },
{ "x": 130.0, "count": 764 }, { "x": 150.0, "count": 117 }
],
"categorical_data": [
{ "category": "Female", "count": 6284, "percentage": 45.2 },
{ "category": "Male", "count": 7629, "percentage": 54.8 }
],
"temporal_distribution": {
"start_date": "2110-04-11", "end_date": "2201-12-13",
"by_year": [ { "year": 2110, "percentage": 6.5 }, … ]
},
"measurement_frequency": "every 1 hours"
} The final result is a CSV with one row per source concept — for our scope, 46 concepts in total (2 sexes + 7 admission types + 8 care units + 10 diagnoses + 8 vitals + 8 lab parameters + 3 drugs).
To go further, open the Linkr demo and:
- Create a new mapping project (see the Mapping projects docs).
- Import the
source-concepts.csvfile above, mapping the columns according to the import settings shown earlier. - Map concept by concept in the editor — by hand, or with AI assistance via the Suggestions page.
For the next steps, we skip this and fetch the already-produced mapping table directly.
Step B — Retrieve the mapping table
We assume the mapping is done: here’s the final file, ready to be loaded into the OMOP database.
46 rows, 9 columns total (full reference on the source_to_concept_map table page). Three representative rows, in the schema’s column order:
| source_code | source_concept_id | source_vocabulary_id | source_code_description | target_concept_id | target_vocabulary_id | valid_start_date | valid_end_date | invalid_reason |
|---|---|---|---|---|---|---|---|---|
| F | 2000000001 | MIMIC_GENDER | Female | 8532 | Gender | 1970-01-01 | 2099-12-31 | NULL |
| 220045 | 2000000028 | MIMIC_CHARTEVENTS | Heart Rate | 3027018 | LOINC | 1970-01-01 | 2099-12-31 | NULL |
| Aspirin | 2000000044 | MIMIC_DRUG | Aspirin | 1112807 | RxNorm | 1970-01-01 | 2099-12-31 | NULL |
Note the source_concept_ids in the custom OMOP range (≥ 2 billion) and the target_concept_ids pointing at standard concepts.
Step C — Load the reference vocabularies into the database
Before writing the business INSERTs, the target OMOP database has to hold the dictionary of standard concepts — typically the tables CONCEPT, CONCEPT_RELATIONSHIP, CONCEPT_SYNONYM, CONCEPT_ANCESTOR, VOCABULARY, DOMAIN. We fetch them from ATHENA (free account required), ticking the vocabularies useful to our scope (SNOMED CT, LOINC, RxNorm, ICD10CM, UCUM…). The download returns a ZIP of CSV files to load via a COPY or an INSERT INTO ... SELECT FROM read_csv(...).
Once those vocabularies are in place, we also load our source_to_concept_map.csv into the table of the same name, and we add the custom concepts (the source_concept_ids ≥ 2 billion from our STCM) into the CONCEPT table — that’s what allows our _source_concept_id columns on event rows to point to valid entries in the database.
Tutorial prerequisites
To keep the demo lightweight, the interactive component below does not load the full CONCEPT table: it only prepares the source_to_concept_map table with our 46 mappings and creates the empty OMOP target tables. On a real OMOP database, Step C is mandatory — without it, neither ACHILLES nor the DQD can run properly, and queries that use the vocabulary to explore hierarchies (e.g. CONCEPT_ANCESTOR) will return nothing.
Step D — Populate the OMOP tables
Now the ETL proper. The MIMIC raw data and the mapping table are already loaded into a DuckDB database running in your browser. The OMOP target tables (PERSON, VISIT_OCCURRENCE, VISIT_DETAIL, MEASUREMENT, CONDITION_OCCURRENCE, OBSERVATION, DRUG_EXPOSURE) are created empty with the v5.4 schema.
Anchor tables
Three tables are populated first — each with its own INSERT:
PERSONfrompatients(gender, year of birth derived fromanchor_year - anchor_age).VISIT_OCCURRENCEfromadmissions(one stay = onehadm_id).VISIT_DETAILfromtransfers(one care-unit transfer = onetransfer_id).
Routing by domain_id
On MIMIC, source vocabularies are partitioned by domain (MIMIC_CHARTEVENTS → Measurement, MIMIC_DRUG → Drug, MIMIC_ICD10 → Condition…) — you could in theory write one INSERT per vocabulary, relying on that partition.
However, to stay general and match the approach described earlier, we prepare a materialised CTE mapped_events that unifies the event sources (diagnoses, vital signs, labs) and resolves each row’s target domain_id via CONCEPT. We then filter this CTE by target_domain_id to dispatch into the right OMOP table:
target_domain_id = 'Measurement'→MEASUREMENTtarget_domain_id = 'Condition'→CONDITION_OCCURRENCE- Everything else →
OBSERVATION(the default fallback table)
PROCEDURE_OCCURRENCE, DEVICE_EXPOSURE, SPECIMEN, NOTE… follow exactly the same logic — just add a WHERE target_domain_id = '…' branch; we leave them out here because no concept in our scope maps to them.
DRUG_EXPOSURE is handled separately: its sources and specifics (dose, route, unit) don’t fit the same generic router.
Heterogeneous source vocabularies
On messier databases — typically an in-house hospital vocabulary that lumps lab tests, clinical observations, scores, even drugs into the same code table — per-vocabulary partitioning no longer works. The mapped_events CTE described here, which resolves the domain_id via CONCEPT, then becomes essential: it’s the only way to route each row correctly. That’s why we adopt this approach from the start — at the cost of one extra join — rather than an INSERT per vocabulary.
Each OMOP table has its own section below, with a ready-to-run INSERT INTO query followed by a SELECT that previews the result. To run a query, click Run (or use the ⌘ / Ctrl + Enter shortcut). The Reset query button above each editor restores the original version in one click. Run the queries in order: mapped_events must exist before the routing INSERTs, and PERSON / VISIT_OCCURRENCE must exist before DRUG_EXPOSURE.
The pattern is always the same: JOIN on source_to_concept_map, INSERT into the target table, COALESCE(..., 0) for unmapped codes so as not to drop rows, and the _source_value and _source_concept_id columns preserve traceability back to the original data.
DRUG_EXPOSURE — simplified teaching version
The INSERT into DRUG_EXPOSURE above is intentionally minimal: we extract the quantity from form_val_disp via regex, fill in route_source_value, dose_unit_source_value and drug_source_value, and that’s it. A real implementation computes the actual dose received from amount_value/amount_unit (unit quantity), numerator_value/denominator_value (concentration for solutions), the administration duration and the frequency — and leans on DRUG_STRENGTH (vocabulary side) to resolve product presentations. The formulas are detailed in the OMOP “drug dose” documentation. Worth doing in a real project, kept out here to avoid bloating the tutorial.
Step E — Validate the resulting database’s quality
After the INSERTs are done, the OMOP database exists. But it can contain silent inconsistencies: nonsensical dates, non-standard concepts where standards are expected, null required fields. Before any research use, you have to measure quality.
Two OHDSI tools are designed for this:
- ACHILLES — an R package that produces a synthetic characterisation report: number of patients, age distributions, condition prevalence, drug frequencies, lab-measurement distributions. Excellent for taking stock of the database — and quickly spotting obvious anomalies.
- Data Quality Dashboard (DQD) — around 4,000 automated checks across three axes: schema conformance, completeness of required fields, plausibility of values. The interactive HTML report lets you prioritise fixes.
In practice DQD almost always flags hundreds of anomalies on a freshly built database. Most are minor (non-standard concepts, missing dates on optional events…), a handful point at real ETL bugs. It’s an iterative cycle: run, fix, re-run, until the quality level is good enough for release.
Empty tables and DQD
DQD explicitly handles unpopulated tables — exactly the scenario of an incremental, project-by-project ETL. If a table exists but is empty, all field-level and concept-level checks for it are flagged as notApplicable (details in the official status documentation). These checks don’t count as violations. You can also proactively exclude tables via the tablesToExclude parameter of executeDqChecks() — by default, DQD already excludes vocabulary tables (CONCEPT, VOCABULARY, etc.).
What’s next?
Once your first OMOP database is in place, the next project will reuse existing mappings and only add a few new concepts. Project after project, coverage of the local dictionary grows steadily — without ever having to align everything at once.
- An OMOP ETL doesn't have to be done in one shot: you proceed project by project, mapping a few dozen concepts at a time — just what the project actually needs.
- Semantic mapping is the most time-consuming part. It involves domain experts (biologists, pharmacists, clinicians) and produces the
SOURCE_TO_CONCEPT_MAPtable. - Agentic AI (Claude Code embedded in Linkr, syntactic + semantic + LLM scoring with justified SKOS equivalences) considerably speeds up the first pass — a human then validates each mapping.
- Once mapping is in place, routing follows the standard concept's
domain_id: Condition →CONDITION_OCCURRENCE, Drug →DRUG_EXPOSURE, Measurement →MEASUREMENT, etc. Anything that doesn't fit elsewhere lands inOBSERVATION. - Every event table follows the same template: extract + JOIN on STCM + INSERT with the columns
_concept_id(standard),_source_concept_id(custom, range ≥ 2 billion) and_source_value(original label, for traceability). - Multi-centre networks centralise mappings via linkr-portal: a static portal shared across institutions, with disjoint custom-ID ranges to avoid collisions.
- Final validation via ACHILLES and the Data Quality Dashboard is an iterative cycle: run, fix, re-run — until quality is releasable.