How is this data organized

TL;DR

Data in a health data warehouse (HDW) is not stored like an Excel spreadsheet. It is spread across multiple linked tables, in long format: one row per measurement, not one row per patient. Running a research project means transforming this data into a wide-format table — the same format as manual data collection in a spreadsheet.

From Excel to a database

When a clinician thinks about data, they often picture an Excel spreadsheet: one row per patient, one column per variable. This is called wide format.

Let’s take 5 ICU patients with some demographic and laboratory data:

The clinician’s spreadsheet

patient_id	birth_date	sex	creatinine (µmol/L)	hemoglobin (g/dL)	leukocytes (G/L)
1	1952-06-14	F	142	9.2	14.2
2	1979-03-22	M	88	/	7.8
3	1961-11-08	M	210	8.5	/
4	1943-01-30	F	95	/	11.3
5	1968-09-05	F	73	12.4	6.9

This table is readable and familiar. But it poses three problems when data becomes large-scale:

Empty cells

If a patient didn't have a leukocyte count, the cell stays empty. With hundreds of lab parameters, the table fills up with blanks.

Redundancy

If a patient has two hospital stays, their demographic data (date of birth, sex) is repeated on every row. That's wasted space.

Rigidity

If you want to add a new lab parameter (e.g. CRP), you need a new column. With thousands of possible parameters, this doesn't scale.

The solution? Separate data into multiple tables, each storing one type of information. This is exactly the principle behind a database.

The Excel analogy

Imagine that an Excel sheet is a table and the entire workbook is a database. A health data warehouse works the same way — but with dozens of interconnected tables instead of a single sheet.

Long format vs wide format

The distinction between long format and wide format is essential to understanding how data is organized in a HDW.

Wide format: the clinician’s spreadsheet

This is the format you already know: one row per patient, one column per variable. It’s the format of a data collection spreadsheet in Excel, like the table above.

Long format: warehouse storage

In a HDW, data is not stored this way. It’s in long format: one row per measurement, with columns indicating which parameter was measured, its value, unit, and timestamp.

Here are the same laboratory data in both formats:

Wide format (spreadsheet)

id	creat.	hb	wbc
1	142	9.2	14.2
2	88	/	7.8
3	210	8.5	/
4	95	/	11.3
5	73	12.4	6.9

5 rows · empty cells

Long format (warehouse)

id	parameter	value	unit
1	Creatinine	142	µmol/L
1	Hemoglobin	9.2	g/dL
1	Leukocytes	14.2	G/L
2	Creatinine	88	µmol/L
2	Leukocytes	7.8	G/L
…	…	…	…

12 rows · no empty cells

On the left, the wide format: 5 rows, one per patient. Easy to read, but rigid — adding a new parameter (e.g. CRP) would require a new column.

On the right, the long format: 12 rows, one per measurement. More rows, but no empty cells, and you can add as many parameters as you want without changing the table structure.

Why long format?

Long format solves the three problems mentioned above:

No empty cells: if a patient has no leukocyte count, the row simply doesn’t exist
Unlimited parameters: adding a new lab test (CRP, troponin…) doesn’t require a new column, just new rows
Temporal data: each row can have its own timestamp, allowing multiple measurements of the same parameter over time

The star schema

In a HDW, data is not stored in one giant table. It’s split into specialized tables, each storing one type of information: a table for patients, one for visits, one for lab results, one for prescriptions, etc.

These tables are linked through shared identifiers: patient_id identifies a unique patient, and visit_id identifies a specific hospital stay.

This is called a star schema: the patient table sits at the center, with other tables orbiting around it.

Patients

PKpatient_id

birth_date

sex

Visits

PKvisit_id

FKpatient_id

admission

discharge

unit

Lab results

FKpatient_id

FKvisit_id

parameter

value

unit

datetime

Prescriptions

FKpatient_id

FKvisit_id

medication

dosage

datetime

Diagnoses

FKpatient_id

FKvisit_id

icd10_code

label

Vital signs

FKpatient_id

FKvisit_id

parameter

value

datetime

PK = primary key (unique identifier) · FK = foreign key (link to another table)

Each satellite table is in long format: it only contains information from its domain, with one row per measurement or event. It’s the shared identifiers (patient_id, visit_id) that make it possible to link tables together.

Here’s what three of these tables actually look like for our 5 patients. Click the tabs to navigate:

patient_id	birth_date	sex
1	1952-06-14	F
2	1979-03-22	M
3	1961-11-08	M
4	1943-01-30	F
5	1968-09-05	F

visit_id	patient_id	admission	discharge
101	1	03/10	03/18
102	2	03/12	03/15
103	3	03/14	03/22
104	4	03/15	03/28
105	5	03/16	03/20

patient_id	visit_id	datetime	parameter	value
1	101	2024-03-10 08:15	Creatinine	142
1	101	2024-03-10 08:15	Hemoglobin	9.2
1	101	2024-03-10 08:15	Leukocytes	14.2
2	102	2024-03-12 06:30	Creatinine	88
…	…	…	…	…

Notice that patient_id appears in all three tables: it’s what allows you to link information for the same patient. visit_id connects each measurement to its corresponding hospital stay.

patient_id and visit_id

These identifiers are the thread that connects all warehouse tables. patient_id identifies a unique patient. visit_id identifies a specific hospital stay. When you merge tables, you always merge on these shared identifiers.

Joins: merging tables together

Data is now spread across multiple tables. How do you bring it back together into a usable table?

This is what’s called a join: merging two tables by matching rows that share the same identifier.

Let’s take a concrete example. We want to get each patient’s date of birth (from the Patients table) and their lab results (from the Lab results table). We merge both tables on patient_id:

Patients

patient_id	birth_date
1	1952-06-14
2	1979-03-22
3	1961-11-08

patient_id

Lab results

patient_id	datetime	parameter	value
1	2024-03-10 08:15	Creatinine	142
1	2024-03-10 08:15	Hemoglobin	9.2
2	2024-03-12 06:30	Creatinine	88

Result

patient_id	birth_date	datetime	parameter	value
1	1952-06-14	2024-03-10 08:15	Creatinine	142
1	1952-06-14	2024-03-10 08:15	Hemoglobin	9.2
2	1979-03-22	2024-03-12 06:30	Creatinine	88

The join combined information from both tables into one, matching rows through patient_id. In practice, these operations are written in a language called SQL (Structured Query Language).

From long to wide format: the research pipeline

We’ve seen that the HDW stores data in long format, spread across linked tables. But to analyze this data, the clinician needs a wide-format table: one row per patient, one column per variable.

This transformation from long to wide format is the core of a research project on a HDW. And this is precisely where the four dimensions defined in article 2 come in: the concept, the temporal anchor, the time window, and the aggregate function.

A concrete example

From the long-format lab results table, we can build the variable “maximum creatinine in the first 24 hours” by applying the 4 dimensions:

Concept: creatinine (µmol/L)
Anchor: ICU admission
Window: H0 to H24
Aggregate: maximum

The result is a single value per patient — one column in the analysis table.

By repeating this operation for each variable (hemoglobin at admission, max leukocytes at H24, date of birth, sex…), you progressively build the wide-format analysis table.

And here’s the key insight: this analysis table has exactly the same format as the one you’d get from manual spreadsheet collection. Both paths — manual collection and warehouse extraction — converge to the same result.

Manual collection

from the patient chart

4 dimensions

Warehouse (long format)

linked tables, 1 row / measurement

4 dimensions

Analysis table (wide format)

1 row per patient · 1 column per variable

= same format

Here’s the analysis table you get — whether from manual collection or from the HDW:

patient_id	birth_date	sex	creat_max_H24	hb_admission	wbc_max_H24
1	1952-06-14	F	142	9.2	14.2
2	1979-03-22	M	88	13.1	7.8
3	1961-11-08	M	210	8.5	18.9
4	1943-01-30	F	95	10.8	11.3
5	1968-09-05	F	73	12.4	6.9

Column names reflect the 4 dimensions: concept + aggregate + window (e.g. creat_max_H24 = maximum creatinine, H0-H24).

This is what Linkr's Lab does

Linkr’s Lab module works with wide-format datasets: one row per patient, one column per variable. The transformation from long to wide format is handled upstream by the data scientist’s scripts. The clinician works directly on the analysis table, in the same format as a spreadsheet.

Key takeaways

Warehouse data is organized in linked tables, connected by shared identifiers (patient_id, visit_id).
The warehouse stores data in long format (one row per measurement), while clinicians work in wide format (one row per patient).
A research project consists of transforming long format into wide format, by applying the four dimensions (concept, anchor, window, aggregate).
The result is identical to manual spreadsheet collection — both approaches converge to the same analysis format.

Next article : Speaking the same language: data models and terminologies