The data pipeline

How data flows through a Linkr project: warehouse (long format) → pipeline (long → wide) → lab (analyses).

Summary

Within a project, data flows through three successive spaces. The warehouse holds raw data in long format (one row per event), read-only. The pipeline transforms it into wide format (one row per patient) without ever modifying the source. The lab hosts this analysis-ready data: datasets, statistical analyses and dashboards.

Client Backend

Three spaces, one direction of flow

A Linkr project is organised along the natural path of the data: from the warehouse’s raw records to the final result. There are three spaces, which you find in a project’s navigation.

1 · Warehouse

Raw data, long format

One row per clinical event.
Read-only — the source is never modified.
Explore concepts, build cohorts, check data quality.

2 · Pipeline

Transformations, long → wide

A graph of steps that prepares the data.
Turns long format into wide format.
Each step produces a new dataset.

3 · Lab

Analyses, wide format

One row per patient.
Datasets, built-in analyses, dashboards.
Python / R IDE for custom code.

Long format, wide format: what does it mean?

A clinical database (OMOP, for example) is in long format: one row per event — a measurement, a diagnosis, a prescription. Great for storage, but not for analysis. Clinicians and statisticians think in wide format: one row per patient, one column per variable (age, latest creatinine, mean blood pressure…). The whole point of the pipeline is to go from long to wide. To dig deeper, see the Data organisation resource.

The warehouse: explore without breaking anything

The warehouse is the entry point for clinical data. You connect one or more databases to it — an OMOP extraction, a demo dataset, a hospital export. These databases are always read-only: Linkr never rewrites the source data.

At this level, without writing any code, you can:

Browse the concepts present in the database, by category (diagnoses, drugs, measurements…), with their statistics.
Build cohorts by stacking criteria (age, sex, concepts, period, length of stay) into a logical tree — Linkr generates the corresponding SQL.
Check data quality: completeness, distributions, validation rules.

For these tools to read a database, you associate a schema with it: a preset (OMOP CDM, MIMIC…) or a custom definition, telling Linkr which table holds patients, visits, measurements, and so on. It is the schema that makes the warehouse “aware”: it teaches Linkr how this particular database is structured.

Aligning local vocabulary with standards

Before analysing, you often need to translate codes specific to an institution into standard vocabularies (SNOMED, LOINC, RxNorm). This step, concept mapping, has its own section: Concept mapping.

The pipeline: preparing data for analysis

Between the warehouse (long format) and the lab (wide format) sits the pipeline: a chain of steps that transform the data. Linkr follows the principle of data-preparation tools: the source is never modified, and each transformation produces a new dataset as output. This preserves end-to-end traceability — you can always trace back to the original data.

A pipeline is represented as a graph: a database as input, a cohort that restricts the population, one or more transformation steps, and a dataset as output.

Database (warehouse)→Cohort→Transformation (long → wide)→Dataset (lab)

The visual pipeline is being finalised

The pipeline’s graphical canvas already exists, but its execution engine is not yet fully wired up. Today, going from long to wide format is done concretely in the built-in IDE (a Python or R script on the dataset) rather than through the graph. This is precisely where Linkr bridges the two worlds: the clinician prepares the cohort with low-code tools, the data scientist writes the transformation in the same project. See Your first project.

Two ways into a dataset

You do not have to start from the warehouse. An analytical dataset can be born in two ways:

From the warehouse

Database → cohort → transformation → dataset. The full path, when starting from long-format clinical data.

By direct import

A file already in wide format (CSV, Excel, Parquet) loaded straight into the lab, without going through the warehouse.

The lab: analysing and reporting

The lab is the space for analysis-ready data, in wide format. This is where the work becomes visible: you describe a population, compare groups, track an indicator.

The lab brings together three building blocks:

Datasets — the wide-format data (one row per patient), with their columns and descriptive statistics.
Built-in analyses — Table 1, Key Indicator, Plot Builder: ready-to-use analyses, no code required.
Dashboards — assembling widgets to explore and report. See the Dashboards section.

For needs that go beyond the graphical tools, the built-in IDE (Python, R, SQL) lets you write your own analyses directly on the dataset.

Analysing with plugins

Built-in analyses and widgets cover the common cases, but not all of them. A plugin is a reusable analysis building block that you plug in wherever you need it in the lab:

as a widget in a dashboard,
as an analysis on a dataset.

A plugin can be a simple configurable graphical component, or embed code in Python / R executed on the data. Linkr ships a catalog of ready-to-use analysis plugins — Table 1, Plot Builder, Kaplan-Meier, regression, correlation matrix, Sankey diagram, statistical tests, map… — and an editor to build your own.

This is where the loop closes with the workspace: a plugin is shared at the workspace level, so it is reusable across all its projects. An analysis widget built for one study serves the next one directly.

Building your own plugins

Designing a plugin — structure, Python/R execution, publishing — will be covered separately, in the More → Building a plugin section (coming soon). Their use in dashboards is detailed in Built-in widgets and R, Python and SQL code.

Client-only or full-stack?

This whole journey — importing files, schemas, cohorts, datasets, analyses, plugins — works entirely in client-only mode, in the browser, on local files or extractions. Connecting directly to a hospital warehouse in a relational database (PostgreSQL, SQL Server, Oracle…) to query the data at the source belongs to the full-stack mode, under development. See Deployment modes.