TL;DR
Hospitals use many different software systems for care — prescriptions, lab results, imaging, vital signs. Each stores its data in a separate database, often incompatible with the others. A health data warehouse (HDW) gathers, transforms, and structures all this data into a single space designed for research.
The hospital information system
Hospitals use care software to manage prescriptions, lab results, imaging, vital signs, clinical notes, and more. Together, these systems form the hospital information system (HIS).
A single hospital may use dozens of different software systems: one for lab results, another for the pharmacy, another for the ICU, yet another for imaging…
Each system stores its data in its own database. Any data that appears on a care software screen is, in theory, retrievable. But in practice, these databases are not designed to be queried for research purposes.
Scattered data
A single patient may have data in the ICU system, the lab system, the pharmacy software, the imaging system… Gathering all this information manually is a massive undertaking.
What is a health data warehouse?
A health data warehouse (HDW) is a storage system where care data is copied and made available for research.
The HDW does not replace care software — it complements it. Data continues to be generated during care, but a copy is regularly transferred to the warehouse.
A simple analogy
Imagine each care software is a filing cabinet in a different office. The HDW is the central library that gathers a copy of every cabinet in one place, with a unified classification system.
Most university hospitals in France have a data warehouse, though their maturity levels vary. Some integrate over a dozen data feeds, while others are still under construction. The French data protection authority (CNIL) provides a map of health data warehouses across France.
ETL: transforming to unify
The core challenge is that each software system stores data in a different format. Column names, units, and identifiers vary from one system to another. These systems are said to be non-interoperable.
Hospital information system
Lab system
patient_id, result_bio, val_num, date_prelvt
Pharmacy
pat_id, med_prescrit, posologie, voie_adm
ICU
id_patient, param_vital, horodatage, val
Imaging
nip, examen, cr_texte, date_exam
Data warehouse
Unified format
patient_id, visit_id, concept_id,
datetime, numeric_value, unit,
…
Same columns, same codes, same units — regardless of the source.
To make the data usable, it must be transformed. This process is called ETL (Extract, Transform, Load):
Extract
Data is copied from the care software databases to intermediate servers. At this stage, the data is still in each vendor's proprietary format — non-interoperable.
Transform
This is the crucial step. Data is converted into a unified format: same column names, same units, same identifiers. This is what makes it possible to 'merge' data from different sources.
Load
The transformed data feeds into the data warehouse, ready to be queried for research.
This process runs at regular intervals — typically once a week, usually at night or on weekends, when care servers are under less load.
An ongoing effort
ETL is not a one-time project. Every software update can break a data feed. The warehouse team must continuously monitor data quality and fix issues as soon as they arise.
Data feeds
A data feed refers to the integration of a specific data source into the warehouse. Each feed corresponds to a type of data extracted from a particular software system.
Demographics
Patients, age, sex, hospital stays…
Lab results
Laboratory data: CBC, metabolic panel, CRP…
Vital signs
Heart rate, blood pressure, SpO2…
Prescriptions
Medications, dosages, routes of administration…
Coded diagnoses
ICD-10 (International Classification of Diseases), procedure codes…
Clinical notes
Consultations, letters, operative reports…
The maturity of a data warehouse depends on how many feeds are integrated. Building a reliable feed requires significant engineering effort. But once in place, the data is updated regularly and automatically.
A common data model
For data to be truly usable — and comparable across hospitals — it must follow a common data model.
The most widely used model is OMOP (Observational Medical Outcomes Partnership), developed by the OHDSI network. It is a database schema designed for health data, used in over 50 countries, representing nearly one billion patient records worldwide.
But why is a common model necessary? Without one, each hospital organizes its data differently. The problem exists at two levels:
Structure: different schemas
From one warehouse to another, tables and columns have different names for the same thing:
HDW Hospital A
patient_id, birth_date
concept_code, numeric_value,
measurement_dt
HDW Hospital B
person_id, ddn
code_concept, valeur,
date_mesure
Semantics: different codes
Even when the structure looks similar, the codes used to identify the same concept differ:
Hospital A
Heart rate → code 13902
Label: “Adult HR”
Hospital B
Heart rate → code C90230
Label: “Heart rate”
A model like OMOP solves both problems: it enforces identical column names (structural interoperability) and standardized codes from international vocabularies such as SNOMED, LOINC, and ATC (semantic interoperability).
Why OMOP?
OMOP has established itself, alongside FHIR, as the standard for research on health data warehouses. It enables data comparison across hospitals and countries, and makes it possible to reproduce studies on different databases. A script written for one hospital can work on another, without modification. It is the most widely used model today — and the one Linkr uses by default.
Accessing data in practice
If your hospital has a data warehouse, you can reach out to the team in charge to request data access. This typically involves the following steps:
Write a research protocol
Define your objectives, study population, and required variables — this is what the previous articles covered.
Obtain regulatory approvals
GDPR (General Data Protection Regulation) compliance, ethics committee review… The warehouse team will guide you through these steps.
Access the data
Once approvals are granted, data is made available in a secure environment.
But accessing the data is not enough: raw warehouse data contains errors, duplicates, and missing values. Before any analysis can begin, significant data quality work is required.
Another common barrier is the need for programming skills (R, Python, SQL) to work with the data. This is precisely the barrier Linkr was designed to remove: enabling clinicians to work with warehouse data without programming knowledge, in an interface built for them.
This is the topic of the next article
In the next article, we’ll see why data quality is a massive undertaking — and why it’s worth the investment.
Key takeaways
- The hospital information system (HIS) comprises the software used for care, each with its own database.
- A health data warehouse (HDW) gathers and transforms this data to make it usable for research.
- The ETL process (Extract, Transform, Load) is at the heart of this transformation — and requires ongoing maintenance.
- The OMOP data model is the most widely used standard, enabling comparison and reproducibility across centers.