TL;DR
If two people extract the same data from the same chart and get different results, it’s often because the variable wasn’t defined precisely enough. Defining a variable isn’t just about choosing a concept — it’s also about specifying when, over what period, and which value to retain. This four-dimension framework reduces errors and ambiguity, whether the collection is done manually or through a data warehouse.
The problem: one variable, multiple interpretations
Let’s take a simple example: you want to collect the serum creatinine for each patient.
One concept, three different results
A patient is admitted on January 5 at 2:00 PM. They have three creatinine measurements:
- January 5, 4:00 PM: 92 µmol/L
- January 6, 6:00 AM: 118 µmol/L
- January 6, 6:00 PM: 104 µmol/L
Which value do you enter in your spreadsheet? The first one? The highest? The one from the first 24 hours?
Without explicit instructions, each person doing the collection will make their own choice — and that choice will vary from one patient to another, even for the same person. It’s this type of micro-decision that produces heterogeneous data.
The four dimensions of a variable
For a variable to be defined without ambiguity, four elements must be specified.
The concept
What is being measured: heart rate, serum creatinine, primary diagnosis, SOFA score (Sequential Organ Failure Assessment)… This is the most intuitive dimension, the one we usually write down first. The unit of measurement, when relevant, is part of the concept.
The temporal anchor
The reference point in the patient's journey from which the data is sought. For example: ICU admission, start of mechanical ventilation, diagnosis of sepsis…
The time window
The period, relative to the anchor, during which the data is searched. For example: H0 to H24 after admission, or D-365 to H0 (for medical history).
The aggregate function
When multiple values exist within the window, which one to retain? The first, the last, the maximum, the minimum, the mean, presence/absence…
A complete example
Let’s go back to serum creatinine. Here’s how to define it unambiguously:
| Dimension | Value |
|---|---|
| Concept | Serum creatinine (µmol/L) |
| Temporal anchor | First ICU admission |
| Time window | H0 to H24 |
| Aggregate function | Maximum |
With this definition, two people extracting the data from the same chart will get the same result — whether it’s a manual collection or a database query.
More examples to illustrate
| Variable | Concept | Anchor | Window | Aggregate |
|---|---|---|---|---|
| HR at admission | Heart rate | ICU admission | H0 to H1 | First |
| History of diabetes | Diabetes diagnosis | ICU admission | No limit – H0 | Presence (yes/no) |
| Max lactate on D1 | Serum lactate | ICU admission | H0 to H24 | Maximum |
| Norepinephrine during sepsis | Norepinephrine (administration) | Sepsis diagnosis | H0 to H72 | Presence (yes/no) |
| Length of stay | ICU stay | ICU admission | Full duration | Duration (in days) |
The anchor isn't always admission
The temporal anchor depends on the research question. If you’re studying post-intubation complications, the anchor would be the start of mechanical ventilation. If you’re looking at medical history, you’d search for diagnoses prior to admission, with no time limit.
Why it matters — even for manual collection
One might think this framework is mainly useful for database queries on a data warehouse. In reality, it’s just as essential for manual collection.
Without a framework, collection drifts
When collection spans several weeks, implicit choices evolve. The person doing the collection ends up applying different decision rules at the beginning and end of the process — without even realizing it. An explicit framework protects against this drift.
Well-defined variables:
- Reduce errors: everyone knows exactly what to look for
- Improve reproducibility: another researcher can redo the same collection and get the same data
- Facilitate collaboration: between the clinician designing the study and the data scientist writing the query, there is no more ambiguity
- Prepare for automation: a variable defined along these four dimensions can be translated directly into a data warehouse query
Linkr’s Study Designer
Linkr offers a dedicated tool for this step: the Study Designer. It guides clinicians through defining each variable along the four dimensions — concept, temporal anchor, time window, aggregate — and automatically generates a structured protocol, exportable in Word, Excel, or JSON.
A tool to structure your protocol
The Study Designer is freely accessible on its dedicated page. It lets you define your variables, inclusion criteria, and analysis plan — all in an interface designed for clinicians.
Key takeaways
- Defining a variable means specifying four dimensions: the concept, the temporal anchor, the time window, and the aggregate function.
- This framework reduces errors and ambiguity, whether the collection is manual or automated.
- Well-defined variables can be directly translated into data warehouse queries.
- Linkr's Study Designer helps structure this definition into an exportable protocol.