Manual data collection in the age of the electronic health record

TL;DR

Manual collection — opening each patient chart, searching for information, copying it into a spreadsheet — remains the standard method for conducting a retrospective clinical study. It is time-consuming, can introduce errors, and its results are difficult to reproduce. Yet hospitals already have most of this data in digital form, within their care software. Health data warehouses make it possible to use this data directly — a potential that is still largely untapped.

Data collection: how it works

To conduct a retrospective clinical study, you need structured data: a table with columns (variables) and rows (patients). Data collection is the process of building that table.

A concrete example

You’re studying the parameters that predict ICU length of stay. For each patient, you need:

Age, sex
Admission and discharge dates
Primary diagnoses
Severity score (APACHE — Acute Physiology and Chronic Health Evaluation, SOFA — Sequential Organ Failure Assessment…)
Key lab results

In practice, the clinician opens each patient’s chart in the hospital software, searches for the information, and enters it into a spreadsheet. Patient by patient, variable by variable.

This is how the vast majority of studies are conducted today.

The limits of manual collection

Time

The time needed for collection depends on both the number of patients and the number of variables. The more you have, the longer it takes — proportionally.

Estimated manual collection time

Time is proportional to the number of patients — and the slope increases with the number of variables.

Estimate based on ~1 minute per variable per patient.

That’s time not spent on data analysis, literature review, or scientific thinking. And if a reviewer asks for an additional variable after manuscript submission, you often have to redo part of the collection.

Data quality

Manual collection is also a source of heterogeneity. When two people extract the same information from the same chart, they won’t necessarily make the same choices:

Which lab result to use if there are multiple in the same day?
How to interpret an ambiguous diagnosis?
What value to enter when information is partially missing?

These micro-decisions, repeated hundreds of times, can introduce a bias that is hard to detect. One way to reduce this risk is to define each variable precisely before starting the collection.

A silent problem

Data entry errors are invisible in the final spreadsheet. They don’t generate error messages or trigger alerts — they simply skew the results, without anyone knowing.

Reproducibility

If another researcher wants to reproduce your study, they’ll need to redo the same collection. If the collection choices aren’t thoroughly documented — which value to retain among several, which time window to consider — the results may differ from one collection to another.

Data already available

Hospitals have gradually adopted electronic health records (EHRs). These systems record a vast amount of information as part of routine care:

Administrative data

Age, sex, admission dates…

Lab results

CBC, metabolic panel, CRP…

Vital parameters

HR, BP, SpO2, temperature…

Prescriptions

Medications, doses, routes…

Coded diagnoses

ICD-10 (International Classification of Diseases), procedure codes…

Clinical notes

Consultations, letters, operative reports…

This data is not collected for research — it is generated as part of patient care. However, it can be reused for research purposes: this is known as secondary data reuse (data reuse).

Two approaches, compared

Let’s take a concrete example: a retrospective study on 500 ICU patients.

	Manual collection	Health data warehouse
Collection time	Weeks to months	Several days *
Number of patients	Limited by available time	All patients in the unit
Available variables	Only those planned upfront	Everything recorded in the EHR
Adding a variable	Restart the collection	Add a column to the query
Human errors	Inherent to the process	Limited (source data)
Reproducibility	Low	High

* The initial effort can be significant when working with new variables. But the work compounds: a variable properly integrated and validated for one study is directly reusable for the next ones.

Manual collection remains necessary

Some data doesn’t flow automatically into the EHR, or requires the clinician’s interpretation and expertise — particularly free-text data (discharge summaries, clinical notes…). For these cases, manual collection remains essential. However, data warehouses can significantly reduce the human workload for a large portion of the variables to extract.

Why not use EHR data directly?

If the data already exists, why do clinicians keep copying it by hand? Because accessing EHR data for research is far from straightforward:

Technical access is complex

EHRs are not designed for bulk data export. Extracting lab results for 500 patients often requires technical skills (SQL, programming).

Data is scattered

A single patient may have data across the EHR, the lab system, the pharmacy software, the billing system… Bringing it all together requires integration work.

Regulatory requirements are strict

Access to health data is governed by strict rules (GDPR — General Data Protection Regulation, HIPAA — Health Insurance Portability and Accountability Act, and other local regulations). Authorizations and a secure infrastructure are required.

Suitable tools are lacking

Even when data is accessible, available tools are often designed for technical profiles, not for clinicians.

Health data warehouses were created to address exactly these four problems.

Toward health data warehouses

A health data warehouse (HDW) gathers, structures, and secures data from various hospital systems into a single space designed for research. With a data warehouse, a clinician can — within an appropriate regulatory framework — query data from thousands of patients.

This is the topic of an upcoming article

But before exploring data warehouses, it’s essential to know how to define your variables well. That’s the topic of the next article.

Key takeaways

Manual collection works, but it is slow, prone to errors, and poorly reproducible.
EHRs already contain most of the data needed for clinical research.
Using this data requires appropriate tools — that's the role of health data warehouses.

Next article : Defining your variables well: the key to reliable data collection