TL;DR
Data in a health data warehouse (HDW) is not stored like an Excel spreadsheet. It is spread across multiple linked tables, in long format: one row per measurement, not one row per patient. Running a research project means transforming this data into a wide-format table — the same format as manual data collection in a spreadsheet.
From Excel to a database
When a clinician thinks about data, they often picture an Excel spreadsheet: one row per patient, one column per variable. This is called wide format.
Let’s take 5 ICU patients with some demographic and laboratory data:
The clinician’s spreadsheet
| patient_id | birth_date | sex | creatinine (µmol/L) | hemoglobin (g/dL) | leukocytes (G/L) |
|---|---|---|---|---|---|
| 1 | 1952-06-14 | F | 142 | 9.2 | 14.2 |
| 2 | 1979-03-22 | M | 88 | / | 7.8 |
| 3 | 1961-11-08 | M | 210 | 8.5 | / |
| 4 | 1943-01-30 | F | 95 | / | 11.3 |
| 5 | 1968-09-05 | F | 73 | 12.4 | 6.9 |
This table is readable and familiar. But it poses three problems when data becomes large-scale:
Empty cells
If a patient didn't have a leukocyte count, the cell stays empty. With hundreds of lab parameters, the table fills up with blanks.
Redundancy
If a patient has two hospital stays, their demographic data (date of birth, sex) is repeated on every row. That's wasted space.
Rigidity
If you want to add a new lab parameter (e.g. CRP), you need a new column. With thousands of possible parameters, this doesn't scale.
The solution? Separate data into multiple tables, each storing one type of information. This is exactly the principle behind a database.
The Excel analogy
Imagine that an Excel sheet is a table and the entire workbook is a database. A health data warehouse works the same way — but with dozens of interconnected tables instead of a single sheet.
Long format vs wide format
The distinction between long format and wide format is essential to understanding how data is organized in a HDW.
Wide format: the clinician’s spreadsheet
This is the format you already know: one row per patient, one column per variable. It’s the format of a data collection spreadsheet in Excel, like the table above.
Long format: warehouse storage
In a HDW, data is not stored this way. It’s in long format: one row per measurement, with columns indicating which parameter was measured, its value, unit, and timestamp.
Here are the same laboratory data in both formats:
Wide format (spreadsheet)
| id | creat. | hb | wbc |
|---|---|---|---|
| 1 | 142 | 9.2 | 14.2 |
| 2 | 88 | / | 7.8 |
| 3 | 210 | 8.5 | / |
| 4 | 95 | / | 11.3 |
| 5 | 73 | 12.4 | 6.9 |
5 rows · empty cells
Long format (warehouse)
| id | parameter | value | unit |
|---|---|---|---|
| 1 | Creatinine | 142 | µmol/L |
| 1 | Hemoglobin | 9.2 | g/dL |
| 1 | Leukocytes | 14.2 | G/L |
| 2 | Creatinine | 88 | µmol/L |
| 2 | Leukocytes | 7.8 | G/L |
| … | … | … | … |
12 rows · no empty cells
On the left, the wide format: 5 rows, one per patient. Easy to read, but rigid — adding a new parameter (e.g. CRP) would require a new column.
On the right, the long format: 12 rows, one per measurement. More rows, but no empty cells, and you can add as many parameters as you want without changing the table structure.
Why long format?
Long format solves the three problems mentioned above:
- No empty cells: if a patient has no leukocyte count, the row simply doesn’t exist
- Unlimited parameters: adding a new lab test (CRP, troponin…) doesn’t require a new column, just new rows
- Temporal data: each row can have its own timestamp, allowing multiple measurements of the same parameter over time
The star schema
In a HDW, data is not stored in one giant table. It’s split into specialized tables, each storing one type of information: a table for patients, one for visits, one for lab results, one for prescriptions, etc.
These tables are linked through shared identifiers: patient_id identifies a unique patient, and visit_id identifies a specific hospital stay.
This is called a star schema: the patient table sits at the center, with other tables orbiting around it.
PK = primary key (unique identifier) · FK = foreign key (link to another table)
Each satellite table is in long format: it only contains information from its domain, with one row per measurement or event. It’s the shared identifiers (patient_id, visit_id) that make it possible to link tables together.
Here’s what three of these tables actually look like for our 5 patients. Click the tabs to navigate:
| patient_id | birth_date | sex |
|---|---|---|
| 1 | 1952-06-14 | F |
| 2 | 1979-03-22 | M |
| 3 | 1961-11-08 | M |
| 4 | 1943-01-30 | F |
| 5 | 1968-09-05 | F |
| visit_id | patient_id | admission | discharge |
|---|---|---|---|
| 101 | 1 | 03/10 | 03/18 |
| 102 | 2 | 03/12 | 03/15 |
| 103 | 3 | 03/14 | 03/22 |
| 104 | 4 | 03/15 | 03/28 |
| 105 | 5 | 03/16 | 03/20 |
| patient_id | visit_id | datetime | parameter | value |
|---|---|---|---|---|
| 1 | 101 | 2024-03-10 08:15 | Creatinine | 142 |
| 1 | 101 | 2024-03-10 08:15 | Hemoglobin | 9.2 |
| 1 | 101 | 2024-03-10 08:15 | Leukocytes | 14.2 |
| 2 | 102 | 2024-03-12 06:30 | Creatinine | 88 |
| … | … | … | … | … |
Notice that patient_id appears in all three tables: it’s what allows you to link information for the same patient. visit_id connects each measurement to its corresponding hospital stay.
patient_id and visit_id
These identifiers are the thread that connects all warehouse tables. patient_id identifies a unique patient. visit_id identifies a specific hospital stay. When you merge tables, you always merge on these shared identifiers.
Joins: merging tables together
Data is now spread across multiple tables. How do you bring it back together into a usable table?
This is what’s called a join: merging two tables by matching rows that share the same identifier.
Let’s take a concrete example. We want to get each patient’s date of birth (from the Patients table) and their lab results (from the Lab results table). We merge both tables on patient_id:
Patients
| patient_id | birth_date |
|---|---|
| 1 | 1952-06-14 |
| 2 | 1979-03-22 |
| 3 | 1961-11-08 |
Lab results
| patient_id | datetime | parameter | value |
|---|---|---|---|
| 1 | 2024-03-10 08:15 | Creatinine | 142 |
| 1 | 2024-03-10 08:15 | Hemoglobin | 9.2 |
| 2 | 2024-03-12 06:30 | Creatinine | 88 |
Result
| patient_id | birth_date | datetime | parameter | value |
|---|---|---|---|---|
| 1 | 1952-06-14 | 2024-03-10 08:15 | Creatinine | 142 |
| 1 | 1952-06-14 | 2024-03-10 08:15 | Hemoglobin | 9.2 |
| 2 | 1979-03-22 | 2024-03-12 06:30 | Creatinine | 88 |
The join combined information from both tables into one, matching rows through patient_id. In practice, these operations are written in a language called SQL (Structured Query Language).
From long to wide format: the research pipeline
We’ve seen that the HDW stores data in long format, spread across linked tables. But to analyze this data, the clinician needs a wide-format table: one row per patient, one column per variable.
This transformation from long to wide format is the core of a research project on a HDW. And this is precisely where the four dimensions defined in article 2 come in: the concept, the temporal anchor, the time window, and the aggregate function.
A concrete example
From the long-format lab results table, we can build the variable “maximum creatinine in the first 24 hours” by applying the 4 dimensions:
- Concept: creatinine (µmol/L)
- Anchor: ICU admission
- Window: H0 to H24
- Aggregate: maximum
The result is a single value per patient — one column in the analysis table.
By repeating this operation for each variable (hemoglobin at admission, max leukocytes at H24, date of birth, sex…), you progressively build the wide-format analysis table.
And here’s the key insight: this analysis table has exactly the same format as the one you’d get from manual spreadsheet collection. Both paths — manual collection and warehouse extraction — converge to the same result.
Manual collection
from the patient chart
4 dimensions
Warehouse (long format)
linked tables, 1 row / measurement
4 dimensions
Analysis table (wide format)
1 row per patient · 1 column per variable
= same format
Here’s the analysis table you get — whether from manual collection or from the HDW:
| patient_id | birth_date | sex | creat_max_H24 | hb_admission | wbc_max_H24 |
|---|---|---|---|---|---|
| 1 | 1952-06-14 | F | 142 | 9.2 | 14.2 |
| 2 | 1979-03-22 | M | 88 | 13.1 | 7.8 |
| 3 | 1961-11-08 | M | 210 | 8.5 | 18.9 |
| 4 | 1943-01-30 | F | 95 | 10.8 | 11.3 |
| 5 | 1968-09-05 | F | 73 | 12.4 | 6.9 |
Column names reflect the 4 dimensions: concept + aggregate + window (e.g. creat_max_H24 = maximum creatinine, H0-H24).
This is what Linkr's Lab does
Linkr’s Lab module works with wide-format datasets: one row per patient, one column per variable. The transformation from long to wide format is handled upstream by the data scientist’s scripts. The clinician works directly on the analysis table, in the same format as a spreadsheet.
Key takeaways
- Warehouse data is organized in linked tables, connected by shared identifiers (patient_id, visit_id).
- The warehouse stores data in long format (one row per measurement), while clinicians work in wide format (one row per patient).
- A research project consists of transforming long format into wide format, by applying the four dimensions (concept, anchor, window, aggregate).
- The result is identical to manual spreadsheet collection — both approaches converge to the same analysis format.