TL;DR
Manual collection — opening each patient chart, searching for information, copying it into a spreadsheet — remains the standard method for conducting a retrospective clinical study. It is time-consuming, can introduce errors, and its results are difficult to reproduce. Yet hospitals already have most of this data in digital form, within their care software. Health data warehouses make it possible to use this data directly — a potential that is still largely untapped.
Data collection: how it works
To conduct a retrospective clinical study, you need structured data: a table with columns (variables) and rows (patients). Data collection is the process of building that table.
A concrete example
You’re studying the parameters that predict ICU length of stay. For each patient, you need:
- Age, sex
- Admission and discharge dates
- Primary diagnoses
- Severity score (APACHE — Acute Physiology and Chronic Health Evaluation, SOFA — Sequential Organ Failure Assessment…)
- Key lab results
In practice, the clinician opens each patient’s chart in the hospital software, searches for the information, and enters it into a spreadsheet. Patient by patient, variable by variable.
This is how the vast majority of studies are conducted today.
The limits of manual collection
Time
The time needed for collection depends on both the number of patients and the number of variables. The more you have, the longer it takes — proportionally.
Estimated manual collection time
Time is proportional to the number of patients — and the slope increases with the number of variables.
Estimate based on ~1 minute per variable per patient.
That’s time not spent on data analysis, literature review, or scientific thinking. And if a reviewer asks for an additional variable after manuscript submission, you often have to redo part of the collection.
Data quality
Manual collection is also a source of heterogeneity. When two people extract the same information from the same chart, they won’t necessarily make the same choices:
- Which lab result to use if there are multiple in the same day?
- How to interpret an ambiguous diagnosis?
- What value to enter when information is partially missing?
These micro-decisions, repeated hundreds of times, can introduce a bias that is hard to detect. One way to reduce this risk is to define each variable precisely before starting the collection.
A silent problem
Data entry errors are invisible in the final spreadsheet. They don’t generate error messages or trigger alerts — they simply skew the results, without anyone knowing.
Reproducibility
If another researcher wants to reproduce your study, they’ll need to redo the same collection. If the collection choices aren’t thoroughly documented — which value to retain among several, which time window to consider — the results may differ from one collection to another.
Data already available
Hospitals have gradually adopted electronic health records (EHRs). These systems record a vast amount of information as part of routine care:
Administrative data
Age, sex, admission dates…
Lab results
CBC, metabolic panel, CRP…
Vital parameters
HR, BP, SpO2, temperature…
Prescriptions
Medications, doses, routes…
Coded diagnoses
ICD-10 (International Classification of Diseases), procedure codes…
Clinical notes
Consultations, letters, operative reports…
This data is not collected for research — it is generated as part of patient care. However, it can be reused for research purposes: this is known as secondary data reuse (data reuse).
Two approaches, compared
Let’s take a concrete example: a retrospective study on 500 ICU patients.
| Manual collection | Health data warehouse | |
|---|---|---|
| Collection time | Weeks to months | Several days * |
| Number of patients | Limited by available time | All patients in the unit |
| Available variables | Only those planned upfront | Everything recorded in the EHR |
| Adding a variable | Restart the collection | Add a column to the query |
| Human errors | Inherent to the process | Limited (source data) |
| Reproducibility | Low | High |
* The initial effort can be significant when working with new variables. But the work compounds: a variable properly integrated and validated for one study is directly reusable for the next ones.
Manual collection remains necessary
Some data doesn’t flow automatically into the EHR, or requires the clinician’s interpretation and expertise — particularly free-text data (discharge summaries, clinical notes…). For these cases, manual collection remains essential. However, data warehouses can significantly reduce the human workload for a large portion of the variables to extract.
Why not use EHR data directly?
If the data already exists, why do clinicians keep copying it by hand? Because accessing EHR data for research is far from straightforward:
Technical access is complex
EHRs are not designed for bulk data export. Extracting lab results for 500 patients often requires technical skills (SQL, programming).
Data is scattered
A single patient may have data across the EHR, the lab system, the pharmacy software, the billing system… Bringing it all together requires integration work.
Regulatory requirements are strict
Access to health data is governed by strict rules (GDPR — General Data Protection Regulation, HIPAA — Health Insurance Portability and Accountability Act, and other local regulations). Authorizations and a secure infrastructure are required.
Suitable tools are lacking
Even when data is accessible, available tools are often designed for technical profiles, not for clinicians.
Health data warehouses were created to address exactly these four problems.
Toward health data warehouses
A health data warehouse (HDW) gathers, structures, and secures data from various hospital systems into a single space designed for research. With a data warehouse, a clinician can — within an appropriate regulatory framework — query data from thousands of patients.
This is the topic of an upcoming article
But before exploring data warehouses, it’s essential to know how to define your variables well. That’s the topic of the next article.
Key takeaways
- Manual collection works, but it is slow, prone to errors, and poorly reproducible.
- EHRs already contain most of the data needed for clinical research.
- Using this data requires appropriate tools — that's the role of health data warehouses.