Skip to content

Data Quality Library (dqLib): An R Package for Assessing and Reporting Data Quality in Clinical Research and Care

Notifications You must be signed in to change notification settings

KaisTahar/dqLib

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data Quality Library (dqLib): An R Package for Assessing and Reporting Data Quality in Clinical Research and Care

1. Description

The data quality library (dqLib) is an R package for data quality (DQ) assessment. The library provides generic methods for calculating DQ metrics and generating reports on detected DQ issues, especially in clinical research and healthcare settings. This package also provides specific functions for reporting on DQ issues that may arise in the context of Rare Diseases(RDs) and common diseases like cardiovascular diseases (CVDs). The current version enables the detection and visualization of plausibility issues based on predefined mathematical and logical rules. To enhance usability, this release allows for the specification of DQ rules using spreadsheets. Further details on the developed functions are given in the news.

2. Installation

You can install dqLib directly from github by running the following command:

devtools::install_github("https://github.com/KaisTahar/dqLib") 

To install dqLib, you can also clone the code repository of the desired version or download it, and then run the following command from the local folder:

devtools::install_local("./dqLib")

3. DQ Metrics and Reports

dqLib provides multiple metrics to analyze different aspects of DQ. The implemented functions enable users to select desired dimensions and indicators as well as to define and generate customized DQ reports. The following generic DQ Indicators are already implemented:

DQ Indicator DQ Dimension
Abbreviation Name
dqi_co_icr Item Completeness Rate completeness
dqi_co_vcr Value Completeness Rate
dqi_co_scr Subject Completeness Rate
dqi_pl_rpr Range Plausibility Rate Plausibility
dqi_pl_spr Semantic Plausibility Rate


In addition to indicators, the DQ reports include the resulting parameters and adequate information to identify potential DQ issues. The dqLib package enables users to specify DQ rules using spreadsheets and to detect DQ issues based on the predefined rules, as described in the news. dqLib provides functions to detect the following common DQ issues:

Abbreviation DQ Parameter Description
im_misg missing mandatory data items number of missing mandatory data items
vm_misg missing mandatory data values number of missing mandatory data values
s_inc incomplete subjects number of incomplete subject records
vo outlier values number of detected outlier values
vc contradictory values number of detected contradictory data values


dqLib also provides functions to assess the following specific indicators for RD data:

DQ Indicator DQ Dimension
Abbreviation Name
dqi_un_cur RD Case unambiguity Rate Uniqueness
dqi_un_cdr RD Case Dissimilarity Rate
dqi_co_icr Orphacoding Completeness Rate Completeness
dqi_pl_opr Orphacoding Plausibility Rate Plausibility
dqi_cc_rvl Concordance with Reference Values from Literature Concordance


Moreover, dqLib enables annual assessments of selected DQ parameters. The following RD-specific metrics are already implemented:

Abbreviation DQ Parameter Description
rdCase RD cases number of RD cases
orphaCase Orpha cases number of available orpha-coded cases
tracerCase tracer cases number of tracer cases
rdCase_rel RD cases rel. frequency relative frequency of RD cases
orphaCase_rel Orpha cases rel. frequency relative frequency of Orpha cases normalized to 100.000 inpatient cases
tracerCase_rel tracer cases rel. frequency relative frequency of tracer cases normalized to 100.000 inpatient cases
tracerCase_rel_min minimal tracer cases in reference values min. rel. frequency of tracer cases normalized to 100.000 inpatient cases found in the literature
tracerCase_rel_max maximal tracer cases in reference values max. rel. frequency of tracer cases normalized to 100.000 inpatient cases found in the literature
vm_case_misg missing mandatory data values in case module number of missing mandatory data values in the case module
rdCase_amb ambiguous RD cases number of ambiguous RD cases
rdCase_dup duplicated RD cases number of duplicated RD cases
oc_misg missing Orphacodes number of missing Orphacodes by tracer diagnoses
link_ip implausible links number of implausible ICD-10-GM/OC links

The following references are required to assess the quality of RD documentation: (1) Current Version of Alpha-ID-SE Terminology [1] and (2) a reference for tracer diagnoses such as the list provided in [2].

[1] BfArM - Alpha-ID-SE [Internet]. [cited 2022 May 23]. Available from: BfArM

[2] Tahar et al. Rare Diseases in Hospital Information Systems — An Interoperable Methodology for Distributed Data Quality Assessments. Methods Inf Med. 2023 Sep;62(3/4):71–89. DOI: 10.1055/a-2006-1018

4. Examples

  • cordDqChecker: A reporting tool for DQ assessment on RD data implemented using dqLib. This tool provides some examples of DQ reports generated using synthetic data.
  • cvdDqChecker: A tool for assessing and reporting data quality on CVD data. This tool was also implemented using dqLib. The ./Export folder contains exemplary DQ reports and visualizations.

5. Notes

  • To cite dqLib, please use the CITATION file located in the folder ./inst.

  • Acknowledgment: This work was funded by the German Centre for Cardiovascular Research (DZHK), grant number 81X1300117, and the "Collaboration on Rare Diseases" of the Medical Informatics Initiative (CORD-MI) under grant number: 01ZZ1911R, FKZ-01ZZ1911R.

About

Data Quality Library (dqLib): An R Package for Assessing and Reporting Data Quality in Clinical Research and Care

Topics

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages

  • R 100.0%