Designing a research project on a data warehouse

Summary

The previous articles laid the groundwork: manual data collection and its limitations, variable definition, how clinical data warehouses work, data quality, data organization, and medical terminologies. It’s time to see concretely how to set up a research project on a clinical data warehouse, from the first contact with the data warehouse team to the analysis of results.

The process in 6 steps

Each hospital has its own organization, but the process generally follows the same major steps. Here is an overview.

First contact with the data warehouse team

The team that manages the clinical data warehouse is the entry point for any research project. They support researchers, perform data extractions, and ensure data quality.

Writing the protocol

The protocol describes the research question, objectives, inclusion and exclusion criteria, variables to collect, and the analysis plan. It is the project's reference document. The more precise it is, the faster and more reliable the feasibility study will be.

Feasibility study

The data warehouse team evaluates whether the project is feasible: are the required data available? Is the number of patients meeting the criteria sufficient?

Regulatory requirements

If the study is feasible, the necessary approvals must be obtained before accessing the data. The data warehouse team can usually guide you through this process.

Data access

Once approvals are obtained, the data is extracted and made available in a secure environment.

Analysis and results

The clinician can explore, visualize, and analyze their data — ideally with tools suited to their profile.

Let’s detail each of these steps.

1. First contact with the data warehouse team

The data warehouse team is the primary point of contact. This is the team that:

Manages and maintains the clinical data warehouse
Supports researchers in designing their projects
Performs feasibility studies and data extractions
Ensures data quality and security

Where to start?

Find out who manages the data warehouse at your institution and how to contact them. The simplest approach is often to ask a colleague who has already conducted a project using the data warehouse.

This first contact allows for a general assessment of the project’s feasibility before even writing a full protocol. The data warehouse team can quickly indicate whether the envisioned data is available and guide the researcher through the process.

2. Writing the protocol

The protocol is the backbone of the project. It formalizes:

The research question and objectives (primary, secondary)
The inclusion and exclusion criteria for patients
The variables to collect, with their precise definition (concept, temporal anchor, time window, aggregate function)
The statistical analysis plan

The more precise and structured the protocol, the faster the project will move forward.

The Study Designer: a tool to structure the protocol

The Study Designer is an open source tool, accessible from this website, that guides you step by step through the design of a research protocol on a data warehouse. It allows you to define objectives, criteria, variables — and automatically generate a Word, Excel, or Markdown document ready to share with the data warehouse team.

It saves time for both the researcher and the data warehouse team: the protocol arrives structured, with the right information, in a usable format. And it’s a higher-quality protocol, because the tool guides you toward best practices (rigorous variable definitions, explicit criteria, etc.).

3. Feasibility study

The feasibility study is an essential step. The data warehouse team checks two things:

Are the data available?

Not all data is necessarily present in the data warehouse at any given time. As explained in the article on clinical data warehouses, integrating a new data source (a laboratory system, a prescription tool, a monitoring device…) represents significant work: technical development, quality control, and ongoing maintenance. This work consumes resources and takes time.

The feasibility study identifies precisely which variables are already available and which are not yet integrated.

Is the number of patients sufficient?

Even if the data is available, there must be a sufficient number of patients meeting the inclusion criteria. The data warehouse team can query the warehouse to estimate this number and compare it to the required sample size calculated in the protocol.

Even partially feasible, it's still valuable

If some data is not in the data warehouse, the project is not necessarily impossible. Some variables can be extracted automatically, and the rest collected manually. This is a hybrid approach that remains far more efficient than entirely manual collection: each automated variable saves time across all patients.

4. Regulatory requirements

If the study is feasible, the necessary approvals must be obtained before accessing the data. Depending on the type of study and local regulations, requirements vary: ethical approval, compliance declarations, institutional authorizations… The data warehouse team can usually guide you through these procedures.

5. Data access

Once approvals are obtained, the data is extracted from the data warehouse and made available to the researcher in a secure environment. Depending on the available tools, several options are possible:

A programming environment: RStudio, Jupyter Notebook, Python… for those comfortable with code
A graphical interface that allows the clinician to query their data without writing code: reconstructing a patient record to view cases one by one, creating charts on cohorts, performing descriptive or analytical statistics

6. Analysis and results

The clinician can then explore, visualize, and analyze their data. Depending on their profile, they can use a programming environment or a graphical interface.

Linkr: the interface for clinicians

This is exactly what Linkr does: it provides an environment where clinicians can explore, visualize, and analyze their data without programming skills.

The clinician is involved in the analysis from the start: they can verify the data, spot inconsistencies, and test hypotheses. Communication with the data scientist is made easier — they work on the same interface. And for more technical users, Linkr also provides access to a full programming environment (R, Python).

What if the project goes multicenter?

A project that works at one site can naturally expand to multiple centers to increase statistical power. But multicenter studies add complexity:

They require a budget: coordination time, data harmonization across sites, technical infrastructure. Funding calls exist for this type of research.
The data must be interoperable across centers — this is the whole point of standardized terminologies and the OMOP model.

The advantage of standards

Linkr is built on the OMOP model and standardized data dictionaries like INDICATE. This means that analysis scripts developed locally are directly shareable with other centers using the same standards — scaling from a single-center to a multicenter study is considerably easier.

An investment that builds over time

A research project on a data warehouse requires significant human expertise: data scientist time, clinical knowledge, quality control. In return, each study contributes to improving the quality and coverage of the data warehouse. As discussed in the article on data quality, cleaned variables, integrated data sources, concept mappings — all of this builds up over time.

As projects accumulate, it becomes easier and faster to conduct studies.

An ongoing effort

Even as studies become progressively easier to conduct, ongoing effort is still needed to maintain the data warehouse and its data quality over time.

Looking ahead — The importance of trained clinicians

Throughout this process, clinician expertise is essential:

To define variables precisely — clinicians know which measurements are relevant and in what context they were collected
To certify data quality — clinicians know the data best, because they use the software that produces it on a daily basis
To interpret results — a statistical anomaly only has meaning if a domain expert can explain it

Having one or more clinicians trained in data science within a department considerably smooths the entire process. They become key partners for the data warehouse team, capable of designing rigorous protocols, actively participating in the analysis, and training their colleagues.

Getting trained

University programs exist for healthcare professionals: University Diplomas (short courses) that can be pursued alongside clinical practice, or Master’s degrees that can be completed during residency, for example. See our article Data science training in healthcare for an overview of available options.

Key takeaways

The resources on this site provide the foundational knowledge for approaching a research project on a data warehouse: data, variables, quality, organization, terminologies.
The Study Designer supports protocol design: criteria, variables, analysis plan — and generates a document ready to share with the data warehouse team.
Linkr lets clinicians visualize, analyze, and explore their data in a tailored interface, collaborate with data scientists, and work reproducibly.
Each study is a cumulative investment: data quality improves project after project, making subsequent studies faster and less costly.