From raw data to usable data

TL;DR

Data in a health data warehouse (HDW) is not directly usable. It contains errors, inconsistencies, and outliers that must be detected and corrected. This quality work is massive, but it is a cumulative investment: every cleaned variable is reusable for all future projects.

Raw data is not ready

One might think that once data is gathered in a warehouse, it’s ready to be analyzed. In reality, raw data from care software is often of poor quality: missing values, data entry errors, inconsistent units, duplicates…

These issues are not exceptions — they are the norm. Care software is designed to support clinicians in their practice, not to produce research-grade data. Data is entered under time pressure, sometimes in poorly adapted fields, and consistency checks are rare.

A monumental task

Every variable has its own quality issues. Data cleaning is a massive task that must be performed variable by variable, combining both technical and clinical expertise.

A classic example: swapped weight and height

In care software, the “weight” and “height” fields are often close together on screen. It frequently happens that a clinician enters weight in the height field, and vice versa. Result: a patient who weighs 1.72 kg and is 68 cm tall.

This type of error is easy to spot on a chart:

Weight vs. Height — Outlier Detection

Swapped weight/height entries are detectable automatically. When the swap is obvious, the data can be corrected; otherwise, it is excluded from analysis.

This type of correction script must be written for every variable that needs it. Heart rate, blood pressure, temperature, creatinine, blood glucose… all have their own types of errors.

Shared expertise is essential

Data quality work cannot be done by a single person. It requires a partnership: a data scientist who masters the technical tools, and a clinician who understands the reality on the ground.

Why? Because behind every anomaly in the data, there is often a practical explanation that only the healthcare professional can provide:

ICU nursing assistant

”The weight and height fields are right next to each other. When you’re in a rush, you swap them. It happens often, especially on night shifts.”

Lab technician

”We changed the troponin assay method in March 2022. The values are not comparable before and after that date — you need to apply a conversion factor.”

Ward physician

”That ‘reason for consultation’ field is never filled in correctly — everyone puts ‘other’. The real reason is in the free-text clinical note.”

This knowledge is irreplaceable. It doesn’t appear in any technical documentation. This is why data quality work must be done locally, by teams who know the practices and software of their institution.

Why not centralize?

If data from multiple hospitals were centralized without involving local teams, this ground-level knowledge would be lost. Only the clinician who uses the software daily knows how data is actually produced — and where the pitfalls lie.

Why isn’t all retrospective research done on warehouses yet?

If HDWs exist, why do most retrospective clinical studies still rely on manual data collection? Several reasons:

A massive setup effort

Building a warehouse, integrating data feeds, harmonizing HIS (Hospital Information System) databases, and cleaning each variable takes years of work and significant resources.

Programming skills required

Exploiting warehouse data requires skills in R, Python, or SQL. Data scientists carry much of the research workload, which limits the number of concurrent projects.

A still-maturing ecosystem

Many warehouses are recent or still under construction. Data feeds, analysis tools, and quality processes are gradually being put in place.

A cumulative investment

But there is a fundamental difference between manual collection and a data warehouse: with manual collection, you start from scratch with every project. With a warehouse, each piece of work benefits the next.

This is the principle of a cumulative investment:

Effort per project: manual collection vs. data warehouse

Project 1 — 50 variables50 variables to process

50 / 50

Project 2 — 45 variables20 already done, 25 new

20 ✓

25 new

Project 3 — 40 variables30 already done, 10 new

30 ✓

Project 4 — 35 variables32 already done, 3 new

32 ✓

Already done

New variables to process

For the first project, everything must be done from scratch. But from the second project on, some variables are already cleaned, structured, and documented. The more the warehouse matures, the faster new projects can be launched.

With manual collection, this is not the case: every new project starts from zero, with a constant cost.

Invest today to accelerate tomorrow

This is why investing in health data warehouses is essential. The upfront cost is high, but every variable that is cleaned becomes reusable — and the marginal cost of each new project decreases over time.

Removing the programming barrier

The other major obstacle is the need for programming skills. Today, a clinician who wants to work with warehouse data must go through a data scientist to write queries, run analyses, and generate results.

This is precisely what Linkr aims to solve: enabling clinicians to work directly with warehouse data, without programming knowledge, using integrated low-code tools. By sharing the workload between clinicians and data scientists, more projects can move forward: clinicians handle routine explorations, while data scientists focus their expertise on complex analyses.

Key takeaways

Raw warehouse data contains errors and inconsistencies that must be corrected variable by variable.
Data quality work requires a clinician / data scientist partnership: ground-level expertise is irreplaceable.
It is a cumulative investment: every cleaned variable is reusable for all future projects.
Low-code tools like Linkr enable clinicians to contribute directly, freeing data scientists for complex analyses and allowing more projects to move forward.

Next article : How is this data organized