Defining your variables well: the key to reliable data collection

TL;DR

If two people extract the same data from the same chart and get different results, it’s often because the variable wasn’t defined precisely enough. Defining a variable isn’t just about choosing a concept — it’s also about specifying when, over what period, and which value to retain. This four-dimension framework reduces errors and ambiguity, whether the collection is done manually or through a data warehouse.

The problem: one variable, multiple interpretations

Let’s take a simple example: you want to collect the serum creatinine for each patient.

One concept, three different results

A patient is admitted on January 5 at 2:00 PM. They have three creatinine measurements:

January 5, 4:00 PM: 92 µmol/L
January 6, 6:00 AM: 118 µmol/L
January 6, 6:00 PM: 104 µmol/L

Which value do you enter in your spreadsheet? The first one? The highest? The one from the first 24 hours?

Without explicit instructions, each person doing the collection will make their own choice — and that choice will vary from one patient to another, even for the same person. It’s this type of micro-decision that produces heterogeneous data.

The four dimensions of a variable

For a variable to be defined without ambiguity, four elements must be specified.

The concept

What is being measured: heart rate, serum creatinine, primary diagnosis, SOFA score (Sequential Organ Failure Assessment)… This is the most intuitive dimension, the one we usually write down first. The unit of measurement, when relevant, is part of the concept.

The temporal anchor

The reference point in the patient's journey from which the data is sought. For example: ICU admission, start of mechanical ventilation, diagnosis of sepsis…

The time window

The period, relative to the anchor, during which the data is searched. For example: H0 to H24 after admission, or D-365 to H0 (for medical history).

The aggregate function

When multiple values exist within the window, which one to retain? The first, the last, the maximum, the minimum, the mean, presence/absence…

A complete example

Let’s go back to serum creatinine. Here’s how to define it unambiguously:

Dimension	Value
Concept	Serum creatinine (µmol/L)
Temporal anchor	First ICU admission
Time window	H0 to H24
Aggregate function	Maximum

With this definition, two people extracting the data from the same chart will get the same result — whether it’s a manual collection or a database query.

More examples to illustrate

Variable	Concept	Anchor	Window	Aggregate
HR at admission	Heart rate	ICU admission	H0 to H1	First
History of diabetes	Diabetes diagnosis	ICU admission	No limit – H0	Presence (yes/no)
Max lactate on D1	Serum lactate	ICU admission	H0 to H24	Maximum
Norepinephrine during sepsis	Norepinephrine (administration)	Sepsis diagnosis	H0 to H72	Presence (yes/no)
Length of stay	ICU stay	ICU admission	Full duration	Duration (in days)

The anchor isn't always admission

The temporal anchor depends on the research question. If you’re studying post-intubation complications, the anchor would be the start of mechanical ventilation. If you’re looking at medical history, you’d search for diagnoses prior to admission, with no time limit.

Why it matters — even for manual collection

One might think this framework is mainly useful for database queries on a data warehouse. In reality, it’s just as essential for manual collection.

Without a framework, collection drifts

When collection spans several weeks, implicit choices evolve. The person doing the collection ends up applying different decision rules at the beginning and end of the process — without even realizing it. An explicit framework protects against this drift.

Well-defined variables:

Reduce errors: everyone knows exactly what to look for
Improve reproducibility: another researcher can redo the same collection and get the same data
Facilitate collaboration: between the clinician designing the study and the data scientist writing the query, there is no more ambiguity
Prepare for automation: a variable defined along these four dimensions can be translated directly into a data warehouse query

Linkr’s Study Designer

Linkr offers a dedicated tool for this step: the Study Designer. It guides clinicians through defining each variable along the four dimensions — concept, temporal anchor, time window, aggregate — and automatically generates a structured protocol, exportable in Word, Excel, or JSON.

A tool to structure your protocol

The Study Designer is freely accessible on its dedicated page. It lets you define your variables, inclusion criteria, and analysis plan — all in an interface designed for clinicians.

Key takeaways

Defining a variable means specifying four dimensions: the concept, the temporal anchor, the time window, and the aggregate function.
This framework reduces errors and ambiguity, whether the collection is manual or automated.
Well-defined variables can be directly translated into data warehouse queries.
Linkr's Study Designer helps structure this definition into an exportable protocol.

Next article : Health data warehouses: leveraging data already collected