Overview of public healthcare databases

In a nutshell

Several clinical data warehouses are publicly available for research and learning — MIMIC, eICU-CRD, AmsterdamUMCdb, HiRID, SICdb. Most of them focus on intensive care, where data are densest and best structured. This article offers an overview to help you choose the one that fits your project.

Why public databases?

As discussed in the article on clinical data warehouses, most hospitals now operate an internal CDW, but accessing these data still requires lengthy regulatory steps. Public databases offer a valuable alternative: de-identified data made available to the scientific community under conditions, with documented schemas and an active surrounding community.

They serve three main purposes:

Learning — working with real clinical data, practicing SQL, testing statistical methods or machine learning models on a representative playground.
Research — publishing observational studies, comparing practices, testing clinical hypotheses on cohorts of tens of thousands of patients.
Reproducibility — replicating the analyses of a publication on the same data, and running multicenter international studies by combining several databases.

Why mostly intensive care?

The vast majority of public databases cover intensive care (Kallout et al., 2025). In ICUs, most data (vital signs, ventilation, medications) are captured continuously in a single software (patient data management system), which makes extraction and de-identification much easier. In other medical wards, the data are less dense, more heterogeneous and scattered across multiple software — which makes building coherent, shareable databases more difficult.

A brief history

Opening clinical data to the scientific community is a relatively recent movement.

1996 — Early preparatory work at MIT (Moody & Mark, 1996) with a database designed for the development and evaluation of ICU monitoring algorithms.
2003 — Launch of the MIMIC project, NIH-funded, in collaboration between MIT, Beth Israel Deaconess Medical Center, and Philips Medical Systems.
2011 — Public release of MIMIC-II on PhysioNet (Saeed et al., 2011, Critical Care Medicine). For the first time, an ICU database is accessible to any researcher who has completed an ethics training.
2016 — Release of MIMIC-III (Johnson et al., 2016, Scientific Data), which becomes the global reference for ICU research.
2018 — Release of the eICU Collaborative Research Database (Pollard et al., 2018, Scientific Data), the first multicenter database with over 200,000 stays from 208 US hospitals.
2019 — First freely accessible European database: AmsterdamUMCdb (reference paper: Thoral et al., 2021, Critical Care Medicine).
2020 — Release of HiRID (Bern University Hospital, Switzerland), with very fine temporal resolution for several parameters. Reference paper: Hyland et al., 2020, Nature Medicine.
2023 — Release of MIMIC-IV (Johnson et al., 2023, Scientific Data) and SICdb (Rodemund et al., 2023, Intensive Care Medicine).

This movement goes hand in hand with the rise of the OMOP model (see our article on clinical data warehouses), which standardizes schemas and lets the same SQL query run on several databases.

The main databases

MIMIC (United States)

The MIMIC (Medical Information Mart for Intensive Care) database is the most widely known and used. It contains data from patients admitted to intensive care at the Beth Israel Deaconess Medical Center (Boston). Three versions coexist (MIMIC-III, MIMIC-IV, MIMIC-ED for the emergency department), with hundreds of publications per year using it as their testing ground.

Number of patients: ~65,000 ICU patients across ~94,000 ICU stays, out of ~365,000 hospitalized patients and ~546,000 hospitalizations (MIMIC-IV v3.1, October 2024)
Time period: 2008-2022
Center type: single-center, tertiary
Access: PhysioNet, CITI training required
Data model: native schema + OMOP CDM v5.4 version available
Demo: 100 patients in open access, no registration

A dedicated article is available: the MIMIC database.

eICU-CRD (United States, multicenter)

The eICU Collaborative Research Database is the only multicenter database in the list. The data come from the Philips eICU telehealth program, which centralizes information from 208 US hospitals.

Number of stays: >200,000 admissions for ~139,000 unique patients over 2014-2015
Center type: 335 units across 208 hospitals, mix of academic and community
Access: PhysioNet, CITI training required
Data model: native eICU schema (no official OMOP ETL to date, a few community initiatives exist)
Of particular interest: diversity of practices (hospitals of varying size and level), ideal for generalizability studies

AmsterdamUMCdb (Netherlands)

AmsterdamUMCdb is the first freely available European database. It is endorsed by the ESICM (European Society of Intensive Care Medicine) and sourced from the Amsterdam UMC.

Number of admissions: 23,106 admissions for 20,109 unique patients, over 2003-2016
Center type: single-center, academic
Access: amsterdammedicaldatascience.nl (code & documentation on GitHub), training and DUA required
Of particular interest: first European database, clinical practices differ from the US (ventilation, sedation, coding)

HiRID (Switzerland)

HiRID (High time Resolution ICU Dataset) comes from Bern University Hospital. Its distinctive feature is its very high temporal resolution: vital signs are available at a per-minute granularity for several variables.

Number of admissions: ~34,000 over 2008-2016
Center type: single-center, academic
Access: PhysioNet, CITI training required
Of particular interest: fine temporal resolution, widely used for time-series prediction models (deep learning)

SICdb (Austria)

The Salzburg Intensive Care database is the most recent of the landscape. It contains minute-level granular data, with a special focus on physiological signals.

Number of admissions: >27,000 over 2013-2021
Center type: single-center (4 ICUs of Salzburg University Hospital)
Access: PhysioNet, CITI training required
Of particular interest: minute-level granularity, recent data

Other notable databases

NWICU (Northwestern ICU) — a recent US database from the Northwestern Memorial HealthCare network (12 hospitals in Chicago), >25,000 patients over 2020-2022. PhysioNet access.
ZFPH (Zigong Fourth People’s Hospital, China) — 2,790 patients over 2019-2020, focused on patients with infections (sepsis, septic shock), useful to broaden geographic diversity. PhysioNet access.
PIC (Paediatric Intensive Care) — Chinese pediatric database (Children’s Hospital, Zhejiang University) over 2010-2018, complementary to the adult databases above. PhysioNet access.

Specialty databases: beyond structured data

The databases above mostly contain structured data (lab values, medications, coded diagnoses). But clinical research is also interested in images, free text and raw physiological signals. Several public databases cover these modalities, often linked to MIMIC to enable richer analyses.

Imaging: MIMIC-CXR

MIMIC-CXR contains 377,110 chest X-rays from 227,835 imaging studies at Beth Israel Deaconess Medical Center (Johnson et al., 2019). Each image is associated with the radiologist’s free-text report. It is the go-to database for medical imaging research and for AI models in radiology.

Key point: the patient identifiers are compatible with MIMIC-IV — so an X-ray can be linked to the patient’s full clinical record (labs, diagnoses, mortality…), which enables multimodal studies.

Clinical notes: MIMIC-IV-Note

MIMIC-IV-Note contains ~332,000 discharge summaries and ~2.3 million radiology reports, de-identified and linked to MIMIC-IV. Useful for clinical natural language processing (NLP): entity extraction, document classification, training medical LLMs.

De-identifying free text

Free text is particularly hard to de-identify: a doctor’s name, an address, a precise date can appear anywhere. MIMIC-IV-Note uses a process combining rules and automated models, with manual validation.

High-resolution signals: VitalDB

VitalDB is a Korean database (Seoul National University Hospital) of a slightly different kind: it contains raw physiological signals from 6,388 surgical patients (intraoperative data) (Lee et al., 2022). Signals are recorded at high frequency — up to 500 Hz for waveforms (ECG, arterial pressure, EEG, plethysmography) and 1 to 7 seconds for numeric values.

~486,000 data tracks per patient
196 intraoperative parameters, 73 clinical, 34 lab
Fully open access (no CITI training, just registration)
Typical use cases: intraoperative hypotension prediction, monitoring algorithms, deep learning on high-frequency time series

Another level of granularity

In MIMIC or HiRID, arterial pressure is recorded every minute or every hour. In VitalDB, you have the full pressure waveform at 500 Hz — i.e. 500 points per second. This is another scale of work, with dedicated signal processing methods.

Going further

Two recent literature reviews offer a very detailed panorama of the public ICU database ecosystem, with precise comparisons that this article cannot fully reproduce:

Kallout et al., 2025 — Contribution of Open Access Databases to Intensive Care Medicine Research: Scoping Review. Recent review, focused on usage and contribution to the literature.
Sauer et al., 2022 — Systematic Review and Comparison of Publicly Available ICU Data Sets: a Decision Guide for Clinicians and Data Scientists. Decision guide with side-by-side comparisons.

If you are hesitating between several databases for a specific project, these two reviews are an excellent starting point.

OMOP adaptations

As introduced in the article on clinical data warehouses, the OMOP model standardizes the data schema and allows the same queries to run on different databases. Several teams have done this conversion work for the public databases — a considerable effort that then benefits the whole community.

MIMIC-OMOP

The most emblematic work is Paris, Lamer & Parrot (JMIR Med Inform, 2021), who converted MIMIC-III to OMOP. Their paper has become the community reference and paved the way for the MIMIC-IV conversion, now maintained by OHDSI (OHDSI/MIMIC).

This OMOP version of MIMIC-IV is what we use in our beginner and intermediate interactive tutorials: you query the database directly in your browser, on a demo dataset of 100 patients.

Other adaptations

AmsterdamUMCdb has an OMOP conversion developed by the Amsterdam UMC team, available on their GitHub repository.
eICU-CRD has no official OMOP ETL to date, but several community initiatives exist, and work is also underway for HiRID and SICdb.

BlendedICU: a unified dataset

BlendedICU is a remarkable initiative from a team at the Reunion University Hospital in La Réunion, France (Oliver et al., 2023, Journal of Biomedical Informatics). The idea: gather into a single OMOP format the four main public databases — AmsterdamUMCdb, eICU-CRD, HiRID, and MIMIC-IV.

The final dataset contains:

41 longitudinal variables (time series) extracted and harmonized
Exposure to 113 active medication ingredients
The pipeline code is open source on GitHub, so anyone can reproduce and adapt the harmonization

Why is it interesting?

BlendedICU greatly simplifies generalizability studies: with the same script, you can train a model on MIMIC, validate it on AmsterdamUMCdb, and compare its performance on HiRID — without having to redo the cleaning and harmonization work for each database. It is also a fine illustration of what the OMOP standard enables internationally.

How to choose?

The choice depends on your research question and your level of experience.

Need	Recommended database
Learn SQL on clinical data	MIMIC-IV Demo (OMOP, 100 patients, open)
First observational study in intensive care	MIMIC-IV (documented schema, large community)
Multicenter study / generalizability	eICU-CRD
European data, non-US practices	AmsterdamUMCdb, HiRID, SICdb
Fine time series, deep learning	HiRID, SICdb
Reproducibility / external validation	Combine several databases, ideally those already in OMOP (MIMIC, AmsterdamUMCdb)
Medical imaging, NLP, multimodal AI	MIMIC-CXR, MIMIC-IV-Note
High-frequency signals, intraoperative	VitalDB

The multi-database strategy

Replicating a study on several databases has become standard practice. The OMOP model makes this work easier: the same SQL script can run on MIMIC-IV OMOP, on a European database converted to OMOP, or on a French hospital CDW. This is one of the main benefits of the standard.

Access process: a common path

For most of these databases, the access process follows a similar pattern — detailed in the MIMIC article:

Create an account on the hosting platform (PhysioNet for most).
Complete a research ethics training (CITI Course).
Sign a Data Use Agreement (DUA).
Get approval from the team maintaining the database.

Institutional email address required

Most platforms require a professional email address and a supervisor’s validation. Personal addresses are generally refused.

Limitations to keep in mind

Public databases are a remarkable tool, but they come with limitations:

Selection bias — all these databases come from academic centers: practices, populations, and equipment may not be representative.
No external data — once the patient leaves the ICU, you lose the trace (except MIMIC-IV, which provides 1-year mortality).
Imperfect quality — these data are extracted from routine care, with entry errors, outliers, duplicates. Data quality work remains essential.
De-identification — all dates are shifted, some rare variables are removed, which may limit certain analyses (seasonality, epidemics).

Sources

History & reference publications

Moody GB, Mark RG. A database to support development and evaluation of intelligent intensive care monitoring. Computers in Cardiology, 1996 (see also the archive page on PhysioNet) — preparatory work for MIMIC.
Saeed M et al. Multiparameter Intelligent Monitoring in Intensive Care II: A public-access intensive care unit database. Crit Care Med. 2011 — MIMIC-II.
Johnson AEW et al. MIMIC-III, a freely accessible critical care database. Sci Data. 2016.
Pollard TJ et al. The eICU Collaborative Research Database, a freely available multi-center database for critical care research. Sci Data. 2018.
Thoral PJ et al. Sharing ICU Patient Data Responsibly: The AmsterdamUMCdb Example. Crit Care Med. 2021 — database released in 2019.
Hyland SL et al. Early prediction of circulatory failure in the intensive care unit using machine learning. Nat Med. 2020 — HiRID.
Johnson AEW et al. MIMIC-IV, a freely accessible electronic health record dataset. Sci Data. 2023.
Rodemund N et al. The Salzburg Intensive Care database (SICdb). Intensive Care Med. 2023.

Specialty databases

MIMIC-CXR v2.1.0 on PhysioNet — imaging.
MIMIC-IV-Note v2.2 on PhysioNet — clinical notes.
Lee HC et al. VitalDB, a high-fidelity multi-parameter vital signs database in surgical patients. Sci Data. 2022.

OMOP adaptations

Paris N, Lamer A, Parrot A. Transformation and Evaluation of the MIMIC Database in the OMOP Common Data Model: Development and Usability Study. JMIR Med Inform. 2021.
Oliver M, Allyn J, Carencotte R, Allou N, Ferdynus C. Introducing the BlendedICU dataset, the first harmonized, international intensive care dataset. J Biomed Inform. 2023.

Literature reviews

Kallout J, Lamer A, Grosjean J et al. Contribution of Open Access Databases to Intensive Care Medicine Research: Scoping Review. JMIR. 2025.
Sauer CM et al. Systematic Review and Comparison of Publicly Available ICU Data Sets — A Decision Guide for Clinicians and Data Scientists. Crit Care Explor. 2022.

Key takeaways

Public databases offer a high-value learning and research ground, accessible after an ethics training.
Most cover intensive care: MIMIC-IV is the reference, eICU-CRD is the only multicenter database.
European databases (AmsterdamUMCdb, HiRID, SICdb) bring valuable diversity of clinical practices.
Specialty databases complement the ecosystem: MIMIC-CXR (imaging), MIMIC-IV-Note (free text), VitalDB (high-frequency signals).
OMOP adaptations enable external validation and generalizability studies.