Health data warehouses: leveraging data already collected

TL;DR

Hospitals use many different software systems for care — prescriptions, lab results, imaging, vital signs. Each stores its data in a separate database, often incompatible with the others. A health data warehouse (HDW) gathers, transforms, and structures all this data into a single space designed for research.

The hospital information system

Hospitals use care software to manage prescriptions, lab results, imaging, vital signs, clinical notes, and more. Together, these systems form the hospital information system (HIS).

A single hospital may use dozens of different software systems: one for lab results, another for the pharmacy, another for the ICU, yet another for imaging…

Each system stores its data in its own database. Any data that appears on a care software screen is, in theory, retrievable. But in practice, these databases are not designed to be queried for research purposes.

Scattered data

A single patient may have data in the ICU system, the lab system, the pharmacy software, the imaging system… Gathering all this information manually is a massive undertaking.

What is a health data warehouse?

A health data warehouse (HDW) is a storage system where care data is copied and made available for research.

The HDW does not replace care software — it complements it. Data continues to be generated during care, but a copy is regularly transferred to the warehouse.

A simple analogy

Imagine each care software is a filing cabinet in a different office. The HDW is the central library that gathers a copy of every cabinet in one place, with a unified classification system.

Most university hospitals in France have a data warehouse, though their maturity levels vary. Some integrate over a dozen data feeds, while others are still under construction. The French data protection authority (CNIL) provides a map of health data warehouses across France.

ETL: transforming to unify

The core challenge is that each software system stores data in a different format. Column names, units, and identifiers vary from one system to another. These systems are said to be non-interoperable.

Hospital information system

Lab system

patient_id, result_bio, val_num, date_prelvt

Pharmacy

pat_id, med_prescrit, posologie, voie_adm

ICU

id_patient, param_vital, horodatage, val

Imaging

nip, examen, cr_texte, date_exam

ETL

Data warehouse

Unified format

patient_id, visit_id, concept_id,

datetime, numeric_value, unit,

…

Same columns, same codes, same units — regardless of the source.

To make the data usable, it must be transformed. This process is called ETL (Extract, Transform, Load):

Extract

Data is copied from the care software databases to intermediate servers. At this stage, the data is still in each vendor's proprietary format — non-interoperable.

Transform

This is the crucial step. Data is converted into a unified format: same column names, same units, same identifiers. This is what makes it possible to 'merge' data from different sources.

Load

The transformed data feeds into the data warehouse, ready to be queried for research.

This process runs at regular intervals — typically once a week, usually at night or on weekends, when care servers are under less load.

An ongoing effort

ETL is not a one-time project. Every software update can break a data feed. The warehouse team must continuously monitor data quality and fix issues as soon as they arise.

Data feeds

A data feed refers to the integration of a specific data source into the warehouse. Each feed corresponds to a type of data extracted from a particular software system.

Demographics

Patients, age, sex, hospital stays…

Lab results

Laboratory data: CBC, metabolic panel, CRP…

Vital signs

Heart rate, blood pressure, SpO2…

Prescriptions

Medications, dosages, routes of administration…

Coded diagnoses

ICD-10 (International Classification of Diseases), procedure codes…

Clinical notes

Consultations, letters, operative reports…

The maturity of a data warehouse depends on how many feeds are integrated. Building a reliable feed requires significant engineering effort. But once in place, the data is updated regularly and automatically.

A common data model

For data to be truly usable — and comparable across hospitals — it must follow a common data model.

The most widely used model is OMOP (Observational Medical Outcomes Partnership), developed by the OHDSI network. It is a database schema designed for health data, used in over 50 countries, representing nearly one billion patient records worldwide.

But why is a common model necessary? Without one, each hospital organizes its data differently. The problem exists at two levels:

Structure: different schemas

From one warehouse to another, tables and columns have different names for the same thing:

HDW Hospital A

patient_id, birth_date

concept_code, numeric_value,

measurement_dt

HDW Hospital B

person_id, ddn

code_concept, valeur,

date_mesure

Semantics: different codes

Even when the structure looks similar, the codes used to identify the same concept differ:

Hospital A

Heart rate → code 13902

Label: “Adult HR”

Hospital B

Heart rate → code C90230

Label: “Heart rate”

A model like OMOP solves both problems: it enforces identical column names (structural interoperability) and standardized codes from international vocabularies such as SNOMED, LOINC, and ATC (semantic interoperability).

Why OMOP?

OMOP has established itself, alongside FHIR, as the standard for research on health data warehouses. It enables data comparison across hospitals and countries, and makes it possible to reproduce studies on different databases. A script written for one hospital can work on another, without modification. It is the most widely used model today — and the one Linkr uses by default.

Accessing data in practice

If your hospital has a data warehouse, you can reach out to the team in charge to request data access. This typically involves the following steps:

Write a research protocol

Define your objectives, study population, and required variables — this is what the previous articles covered.

Obtain regulatory approvals

GDPR (General Data Protection Regulation) compliance, ethics committee review… The warehouse team will guide you through these steps.

Access the data

Once approvals are granted, data is made available in a secure environment.

But accessing the data is not enough: raw warehouse data contains errors, duplicates, and missing values. Before any analysis can begin, significant data quality work is required.

Another common barrier is the need for programming skills (R, Python, SQL) to work with the data. This is precisely the barrier Linkr was designed to remove: enabling clinicians to work with warehouse data without programming knowledge, in an interface built for them.

This is the topic of the next article

In the next article, we’ll see why data quality is a massive undertaking — and why it’s worth the investment.

Key takeaways

The hospital information system (HIS) comprises the software used for care, each with its own database.
A health data warehouse (HDW) gathers and transforms this data to make it usable for research.
The ETL process (Extract, Transform, Load) is at the heart of this transformation — and requires ongoing maintenance.
The OMOP data model is the most widely used standard, enabling comparison and reproducibility across centers.

Next article : From raw data to usable data