Data Quality Library (dqLib): An R Package for Assessing and Reporting Data Quality in Clinical Research and Care
The data quality library (dqLib
) is an R package for data quality (DQ) assessment. The library provides generic methods for calculating DQ metrics and generating reports on detected DQ issues, especially in clinical research and healthcare settings. This package also provides specific functions for reporting on DQ issues that may arise in the context of Rare Diseases(RDs) and common diseases like cardiovascular diseases (CVDs). The current version enables the detection and visualization of plausibility issues based on predefined mathematical and logical rules. To enhance usability, this release allows for the specification of DQ rules using spreadsheets. Further details on the developed functions are given in the news.
You can install dqLib
directly from github by running the following command:
devtools::install_github("https://github.com/KaisTahar/dqLib")
To install dqLib
, you can also clone the code repository of the desired version or download it, and then run the following command from the local folder:
devtools::install_local("./dqLib")
dqLib
provides multiple metrics to analyze different aspects of DQ. The implemented functions enable users to select desired dimensions and indicators as well as to define and generate customized DQ reports. The following generic DQ Indicators are already implemented:
DQ Indicator | DQ Dimension | |
---|---|---|
Abbreviation | Name | |
dqi_co_icr | Item Completeness Rate | completeness |
dqi_co_vcr | Value Completeness Rate | |
dqi_co_scr | Subject Completeness Rate | |
dqi_pl_rpr | Range Plausibility Rate | Plausibility |
dqi_pl_spr | Semantic Plausibility Rate |
In addition to indicators, the DQ reports include the resulting parameters and adequate information to identify potential DQ issues. The dqLib
package enables users to specify DQ rules using spreadsheets and to detect DQ issues based on the predefined rules, as described in the news. dqLib
provides functions to detect the following common DQ issues:
Abbreviation | DQ Parameter | Description |
---|---|---|
im_misg | missing mandatory data items | number of missing mandatory data items |
vm_misg | missing mandatory data values | number of missing mandatory data values |
s_inc | incomplete subjects | number of incomplete subject records |
vo | outlier values | number of detected outlier values |
vc | contradictory values | number of detected contradictory data values |
dqLib
also provides functions to assess the following specific indicators for RD data:
DQ Indicator | DQ Dimension | |
---|---|---|
Abbreviation | Name | |
dqi_un_cur | RD Case unambiguity Rate | Uniqueness |
dqi_un_cdr | RD Case Dissimilarity Rate | |
dqi_co_icr | Orphacoding Completeness Rate | Completeness |
dqi_pl_opr | Orphacoding Plausibility Rate | Plausibility |
dqi_cc_rvl | Concordance with Reference Values from Literature | Concordance |
Moreover, dqLib
enables annual assessments of selected DQ parameters. The following RD-specific metrics are already implemented:
Abbreviation | DQ Parameter | Description |
---|---|---|
rdCase | RD cases | number of RD cases |
orphaCase | Orpha cases | number of available orpha-coded cases |
tracerCase | tracer cases | number of tracer cases |
rdCase_rel | RD cases rel. frequency | relative frequency of RD cases |
orphaCase_rel | Orpha cases rel. frequency | relative frequency of Orpha cases normalized to 100.000 inpatient cases |
tracerCase_rel | tracer cases rel. frequency | relative frequency of tracer cases normalized to 100.000 inpatient cases |
tracerCase_rel_min | minimal tracer cases in reference values | min. rel. frequency of tracer cases normalized to 100.000 inpatient cases found in the literature |
tracerCase_rel_max | maximal tracer cases in reference values | max. rel. frequency of tracer cases normalized to 100.000 inpatient cases found in the literature |
vm_case_misg | missing mandatory data values in case module | number of missing mandatory data values in the case module |
rdCase_amb | ambiguous RD cases | number of ambiguous RD cases |
rdCase_dup | duplicated RD cases | number of duplicated RD cases |
oc_misg | missing Orphacodes | number of missing Orphacodes by tracer diagnoses |
link_ip | implausible links | number of implausible ICD-10-GM/OC links |
The following references are required to assess the quality of RD documentation: (1) Current Version of Alpha-ID-SE Terminology [1] and (2) a reference for tracer diagnoses such as the list provided in [2].
[1] BfArM - Alpha-ID-SE [Internet]. [cited 2022 May 23]. Available from: BfArM
[2] Tahar et al. Rare Diseases in Hospital Information Systems — An Interoperable Methodology for Distributed Data Quality Assessments. Methods Inf Med. 2023 Sep;62(3/4):71–89. DOI: 10.1055/a-2006-1018
- cordDqChecker: A reporting tool for DQ assessment on RD data implemented using
dqLib
. This tool provides some examples of DQ reports generated using synthetic data. - cvdDqChecker: A tool for assessing and reporting data quality on CVD data. This tool was also implemented using
dqLib
. The ./Export folder contains exemplary DQ reports and visualizations.
-
To cite
dqLib
, please use the CITATION file located in the folder./inst
. -
Acknowledgment: This work was funded by the German Centre for Cardiovascular Research (DZHK), grant number 81X1300117, and the "Collaboration on Rare Diseases" of the Medical Informatics Initiative (CORD-MI) under grant number: 01ZZ1911R, FKZ-01ZZ1911R.