Data Quality Assurance Framework

The CLIF Consortium validates every dataset through a rigorous three-pillar framework — Conformance, Completeness, and Plausibility — catching data issues before they reach your analysis.

7 Conformance Checks

5 Completeness Checks

8 Plausibility Checks

Conformance

Schema & Structure

Conformance checks verify structure and schema — do the right tables exist, are the required columns present, are data types valid, and do categorical values belong to the allowed vocabulary? These are the first line of defense: if the skeleton is wrong, nothing downstream matters.

PASS

patient.parquet12,847 rows

hospitalization.parquet15,203 rows

vitals.parquet1,204,881 rows

labs.parquet892,456 rows

FAIL

patient.parquet12,847 rows

hospitalization.parquet15,203 rows

vitals.parquetMISSING

labs.parquet892,456 rows

PASS — hospitalization

hospitalization_id ✓

patient_id ✓

admission_dttm ✓

discharge_dttm ✓

hospital_id ✓

FAIL — hospitalization

hospitalization_id ✓

patient_id ✓

admission_dttm ✓

discharge_dttm ✗ missing

hospital_id ✓

PASS

Column	Type
weight	float64
height	float64
age	int64

FAIL

Column	Type
weight	object ("abc")
height	float64
age	int64

PASS

admission_dttm

datetime64[ns] → 2024-01-15 08:32:00

FAIL

admission_dttm

object (string) → "2024-01-15"

PASS — vital_category

"heart_rate"

"sbp"

"dbp"

"spo2"

FAIL — vital_category

"heart_rate"

"sbp"

"heart_rate_bpm" ✗ not in vocabulary

"spo2"

PASS

acetaminophen → analgesic_antipyretic

norepinephrine → vasopressor

fentanyl → opioid

FAIL

acetaminophen → analgesic_antipyretic

acetaminophen → pain_reliever ✗ dual mapping

fentanyl → opioid

PASS

potassiummmol/L

creatininemg/dL

hemoglobing/dL

FAIL

potassiummEq/L (expected mmol/L)

creatininemg/dL

hemoglobing/dL

Completeness

Data Coverage

A dataset can be perfectly conformant yet still unusable if half the values are NULL. Completeness checks look beyond structure to ask: is the data filled in? They flag excessive missing values, verify that conditional fields are populated when they should be, and ensure that expected categorical coverage is met.

PASS — labs

lab_value

8% null

lab_category

1% null

FAIL — labs

lab_value

55% null

lab_category

1% null

PASS

device_category	fio2_set	peep
IMV	0.6	10
IMV	0.5	8

FAIL

device_category	fio2_set	peep
IMV	NULL	10
IMV	0.5	NULL

PASS — vital_category (8/9)

heart_rate sbp dbp map spo2 resp_rate temp weight

FAIL — vital_category (3/9)

heart_rate sbp spo2 dbp map resp_rate temp weight height

PASS — labs → hospitalization

labs.hosp_id	in hospitalization?
H001	✓
H002	✓
H003	✓

100% coverage ✓

FAIL — labs → hospitalization

labs.hosp_id	in hospitalization?
H001	✓
H999	✗ (orphan)
H002	✓

Orphan key in child table ✗

PASS — IMV episode

respiratory_support: IMV

medication_admin: propofol ✓

medication_admin: fentanyl ✓

FAIL — IMV episode

respiratory_support: IMV

medication_admin: no sedation/analgesia rows ✗

Plausibility

Clinical Sense

Plausibility checks go beyond structure — they verify that data values are clinically reasonable. A heart rate of 450 is technically numeric but clinically impossible. These checks catch subtle ETL bugs that conformance alone would miss.

PASS — labs

collect_dttm: 2024-01-15 08:00

result_dttm: 2024-01-15 09:30

collect ≤ result ✓

FAIL — labs

collect_dttm: 2024-01-15 09:30

result_dttm: 2024-01-15 08:00

result before collect? ✗

PASS — vitals

Vital	Value	Range
heart_rate	88	0–300
sbp	125	0–400
spo2	96	0–100

FAIL — vitals

Vital	Value	Range
heart_rate	450	0–300 ✗
sbp	125	0–400
spo2	96	0–100

PASS

location_category: "icu"

location_type: "MICU"

ICU type matches ICU category ✓

FAIL

location_category: "ward"

location_type: "MICU"

ICU type but ward category? ✗

PASS

norepinephrine (continuous)

dose: 0.15 unit: mcg/kg/min

FAIL

norepinephrine (continuous)

dose: 500 unit: mg (should be rate)

PASS — ADT

MICU Jan 15 08:00 – Jan 18 14:00

Ward Jan 18 14:00 – Jan 20 10:00

Sequential, no overlap ✓

FAIL — ADT

MICU Jan 15 08:00 – Jan 18 14:00

Ward Jan 17 10:00 – Jan 20 10:00

Overlap: Jan 17–18 patient in 2 locations ✗

PASS — respiratory_support

recorded_dttm	device_category
10:00	NIPPV
10:30	NIPPV
11:00	IMV

Coherent escalation ✓

FAIL — respiratory_support

recorded_dttm	device_category
10:00	IMV
10:01	Room Air
10:02	IMV

Implausible 1-minute toggle ✗

PASS — labs

hosp_id	lab_cat	collect_dttm
H001	potassium	Jan 15 08:00
H001	creatinine	Jan 15 08:00
H001	potassium	Jan 16 08:00

All composite keys unique ✓

FAIL — labs

hosp_id	lab_cat	collect_dttm
H001	potassium	Jan 15 08:00
H001	potassium	Jan 15 08:00
H001	potassium	Jan 16 08:00

Duplicate key detected ✗

PASS

hospitalization:

admit: Jan 15 discharge: Jan 20

lab collected:

Jan 17 (within stay)

FAIL

hospitalization:

admit: Jan 15 discharge: Jan 20

lab collected:

Jan 23 (3 days after discharge) ✗

clifpy v0.4.9 · Python 3.9+

Run It Yourself

The clifpy package runs every check on this page against your own CLIF data and emits structured results, a PDF report, and a text summary.

all tables · one call

# 1. install
pip install clifpy

# 2. orchestrate
from clifpy.orchestrator import ClifOrchestrator

co = ClifOrchestrator(config='clif_config.yaml')
co.initialize(['labs', 'vitals', 'medications'])
co.validate_all()

# 3. inspect
for err in co.labs.errors:
    print(err.column, err.message)

one table · one pillar

from clifpy.utils.validator import (
    run_conformance_checks,
    run_completeness_checks,
    run_plausibility_checks,
    run_full_dqa,
)

result = run_full_dqa(df, schema, 'labs')
print(result.passed, result.metrics)

# Serialize or report out
result.to_dict()                 # JSON-safe dict
generate_validation_pdf(result)  # PDF report
generate_text_report(result)     # plain-text summary

New in clifpy 0.4.x

Nullable / allow_missing support, data profiling and monthly trends, ICD normalization (0.4.8+), atomic-check counts in reports, dual Polars / DuckDB backends for datasets in the hundreds of GB, and PDF + text report generators.

PyPI

pypi.org/project/clifpy

GitHub

source, issues, releases

Demo notebook

examples/dqa_demo.ipynb

Data Quality Assurance Framework

Conformance

Completeness

Plausibility

Conformance

C.1 Table Presence

C.2 Required Columns

C.3 Data Types

C.4 Datetime Format

C.5 Categorical Values

C.6 Category–Group Mapping

C.7 Lab Reference Units

Completeness

K.1 Missingness

K.2 Conditional Requirements

K.3 mCIDE Value Coverage

K.4 Relational Integrity

K.5 Cross-Table Conditional Completeness

Plausibility

P.1 Temporal Ordering

P.2 Numeric Range

P.3 Field Plausibility

P.4 Medication Dose Units

P.5 Overlapping Periods

P.6 Category Temporal Consistency

P.7 Duplicate Composite Keys

P.8 Cross-Table Temporal Plausibility

Run It Yourself

New in clifpy 0.4.x

clifpy

CLIF-TableOne

Lighthouse