Clinical data warehouses: leveraging data already collected

TL;DR

Hospitals use many different software systems for care — prescriptions, lab results, imaging, vital signs. Each stores its data in a separate database, often incompatible with the others. A clinical data warehouse (CDW) gathers, transforms, and structures all this data into a single space designed for research.

The hospital information system

Hospitals use clinical software to manage prescriptions, lab results, imaging, vital signs, clinical notes, and more. Together, these systems form the hospital information system (HIS).

A single hospital may use dozens of different software systems: one for lab results, another for the pharmacy, another for the ICU, yet another for imaging…

Each system stores its data in its own database. Any data that appears on a clinical software screen is, in theory, retrievable. But in practice, these databases are not designed to be queried for research purposes.

Scattered data

A single patient may have data in the ICU system, the lab system, the pharmacy software, the imaging system… Gathering all this information manually is a massive undertaking.

What is a clinical data warehouse?

A clinical data warehouse (CDW) is a storage system where care data is copied and made available for research.

The CDW does not replace clinical software — it complements it. Data continues to be generated during care, but a copy is regularly transferred to the warehouse.

A simple analogy

Imagine each clinical software is a filing cabinet in a different office. The CDW is the central library that gathers a copy of every cabinet in one place, with a unified classification system.

Most university hospitals in France have a data warehouse, though their maturity levels vary. Some integrate over a dozen data sources, while others are still under construction. The French data protection authority (CNIL) provides a map of clinical data warehouses across France.

What about collection time?

Remember the chart on manual collection time? Time increased proportionally with both the number of patients and the number of variables. What happens with a data warehouse?

With a data warehouse, the logic is reversed. The main effort is writing and validating queries for each variable — checking codes, units, and time windows. But once a query is written, it applies to all patients at once: whether you have 100 or 10,000 patients, the time is the same. Use the tabs below to compare both approaches:

Estimated collection time

Time depends on the number of variables — not the number of patients.

Indicative estimate based on ~1 hour per variable. In practice, this varies greatly depending on the warehouse's maturity level and the complexity of the variables.

The difference is striking: with manual collection, going from 100 to 500 patients already makes time skyrocket — and thousands of patients is simply unrealistic. With a data warehouse, the lines stay flat — only the number of variables matters.

An investment that compounds

The time spent validating a variable in the warehouse is not wasted: a well-written, verified query is directly reusable for future studies. The initial cost is higher, but it doesn’t repeat.

ETL: transforming to unify

The core challenge is that each clinical software stores data in a different format. Column names, units, and identifiers vary from one system to another. These systems are said to be non-interoperable.

Hospital information system

Lab system

patient_id, result_bio, val_num, date_prelvt

Pharmacy

pat_id, med_prescrit, posologie, voie_adm

ICU

id_patient, param_vital, horodatage, val

Imaging

nip, examen, cr_texte, date_exam

ETL

Data warehouse

Unified format

patient_id, visit_id, concept_id,

datetime, numeric_value, unit,

…

Same columns, same codes, same units — regardless of the source.

To make the data usable, it must be transformed. This process is called ETL (Extract, Transform, Load):

Extract

Data is copied from the clinical software databases to intermediate servers. At this stage, the data is still in each vendor's proprietary format — non-interoperable.

Transform

This is the crucial step. Data is converted into a unified format: same column names, same units, same identifiers. This is what makes it possible to 'merge' data from different sources.

Load

The transformed data feeds into the data warehouse, ready to be queried for research.

This process runs at regular intervals — typically once a week, usually at night or on weekends, when clinical servers are under less load.

An ongoing effort

ETL is not a one-time project. Every software update can break a data source integration. The warehouse team must continuously monitor data quality and fix issues as soon as they arise.

Data feeds

A data source refers to a specific type of data integrated into the warehouse. Each data source corresponds to a category of information extracted from a particular clinical software.

Demographics

Patients, age, sex, hospital stays…

Lab results

Laboratory data: CBC, metabolic panel, CRP…

Vital signs

Heart rate, blood pressure, SpO2…

Prescriptions

Medications, dosages, routes of administration…

Coded diagnoses

ICD-10 (International Classification of Diseases), procedure codes…

Clinical notes

Consultations, letters, operative reports…

The maturity of a data warehouse depends on how many data sources are integrated. Integrating a reliable data source requires significant engineering effort. But once in place, the data is updated regularly and automatically.

A common data model

For data to be truly usable — and comparable across hospitals — it must follow a common data model.

The most widely used model is OMOP (Observational Medical Outcomes Partnership), developed by the OHDSI network. It is a database schema designed for health data, used in over 50 countries, representing nearly one billion patient records worldwide.

But why is a common model necessary? Without one, each hospital organizes its data differently. The problem exists at two levels:

Structure: different schemas

From one warehouse to another, tables and columns have different names for the same thing:

CDW Hospital A

patient_id, birth_date

concept_code, numeric_value,

measurement_dt

CDW Hospital B

person_id, ddn

code_concept, valeur,

date_mesure

Semantics: different codes

Even when the structure looks similar, the codes used to identify the same concept differ:

Hospital A

Heart rate → code 13902

Label: “Adult HR”

Hospital B

Heart rate → code C90230

Label: “Heart rate”

A model like OMOP solves both problems: it enforces identical column names (structural interoperability) and standardized codes from international vocabularies such as SNOMED, LOINC, and ATC (semantic interoperability).

Why OMOP?

OMOP has established itself, alongside FHIR, as the standard for research on clinical data warehouses. It enables data comparison across hospitals and countries, and makes it possible to reproduce studies on different databases. A script written for one hospital can work on another, without modification. It is the most widely used model today — and the one Linkr uses by default.

Accessing data in practice

If your hospital has a data warehouse, you can reach out to the team in charge to request data access. This typically involves the following steps:

Write a research protocol

Define your objectives, study population, and required variables — this is what the previous articles covered.

Obtain regulatory approvals

GDPR (General Data Protection Regulation) compliance, ethics committee review… The warehouse team will guide you through these steps.

Access the data

Once approvals are granted, data is made available in a secure environment.

But accessing the data is not enough: raw warehouse data contains errors, duplicates, and missing values. Before any analysis can begin, significant data quality work is required.

Another common barrier is the need for programming skills (R, Python, SQL) to work with the data. This is precisely the barrier Linkr was designed to remove: enabling clinicians to work with warehouse data without programming knowledge, in an interface built for them.

This is the topic of the next article

In the next article, we’ll see why data quality is a massive undertaking — and why it’s worth the investment.

Key takeaways

The hospital information system (HIS) comprises the software used for care, each with its own database.
A clinical data warehouse (CDW) gathers and transforms this data to make it usable for research.
The ETL process (Extract, Transform, Load) is at the heart of this transformation — and requires ongoing maintenance.
The OMOP data model is the most widely used standard, enabling comparison and reproducibility across centers.