Challenges and Solutions for Collecting and Analyzing Real World Data: The Eric CLL Database as an Illustrative Example
The authors have indicated they have no potential conflicts of interest to disclose.
The development of ERICdb is funded partially by an unrestricted grant from AbbVie and the European Initiative on CLL (ERIC).
ERIC, the European research on CLL
Chronic lymphocytic leukemia (CLL) is an age-related malignancy of mature B lymphocytes.1 While the diagnosis of CLL is relatively straightforward, the clinical course and outcome are highly heterogeneous.2 Moreover, despite remarkable therapeutic advances achieved in recent years, the disease is mostly incurable.
ERIC, the European Research Initiative on CLL (http://www.ericll.org) is a Scientific Working Group (SWG) of the European Hematology Association (EHA) aimed at improved management of CLL through collaborative research. Thanks to the active participation of its members, now exceeding 1300 from all over Europe and beyond, ERIC engages in projects extending from basic to (mainly) translational and clinical research.
Capitalizing on such initiatives but also on our expertise in the collection, management and analysis of heterogeneous clinical and biological data,3-5 we have developed and present here the ERIC CLL database, a registry of clinical and biological data of patients with CLL.
Challenges of gathering high-quality real-world data
Collection and analysis of real world data (RWD) can prove both effective and efficient for advancing precision medicine and improving the quality and delivery of medical care, provided these come along with data quality.6, 7 The amount of biomedical data continuously increases due to technological advances, thus raising the necessity for designing and developing standardized approaches and methodologies to be implemented in clinical practice.8
Data acquisition is usually a process distributed among different health professionals potentially leading to data quality problems across datasets, such as data redundancy (ie, repeated information), heterogeneity (eg, different date format) and inconsistency (eg, a date of diagnosis after the date of treatment), mainly resulting from lack of standardization and data curation processes. Such problems are particularly pertinent in the case of multi-institutional efforts, where multilevel and multi-originated data are collected. Furthermore, the rapid increase of data complexity captured during patient care, especially data produced by the application of novel methodologies (eg, next generation sequencing), poses challenges that cannot be addressed with standard computational approaches.
Thus, there is an imperative to improve real-world evidence generation by optimizing the integration of the heterogeneous information through automated and thorough quality control and curation mechanisms; and, support analysis and compatibility with established ontologies. This will provide unified and standardized access to valid, accurate and comparable datasets. Practical and feasible tools are required, capable of providing easiness in use, flexibility and simplicity, in order to facilitate the data entry procedure and encourage the registration and organization of clinically relevant data from the daily practice.9
Towards the development of a unified data management framework
Harmonization of heterogeneous data is a prerequisite for gathering homogenized high-quality datasets and bridging the many forms of biological and medical information.
A common approach that can be adapted to local and project-specific requirements, will inevitably facilitate biological, translational and clinical research, enabling multi-center projects on, for example, clinical association studies and translational medicine at large.10
Standardization
Agreement on common policies along with a user-friendly integration of ontologies, terminologies and standards, paves the way for standardized registration of RWD; thus, the collected data can be seamlessly combined into a data integration framework, achieving semantic interoperability.
Retrospective data integration
Development and use of reliable “Extract Transform Load” (ETL) software enables and facilitates the massive import of datasets from external, diverse sources into a centralized repository.11
Quality assessment
The development of a semantic framework with data cleaning processes12 capable of identifying quality problems in the collected data, is a prerequisite in order to minimize barriers to data sharing, availability and reusability for research purposes.
A standard-based cleaning, organization and integration approach can provide relevance and data accuracy while guaranteeing long-term usability of the collected high-quality data.13
The definition of rules for syntactical and semantic errors, out of range values, missing data, unique value and functional dependency violations, is a well-recognized objective of any strategy aiming at improving quality control.
Building an integrated research infrastructure
Organizing clinical and translational RWD in a standardized and centralized data repository allows unified access for research purposes, improving research efficiency and quality on multiple levels. That said, such integrated approaches must ensure data security and privacy, in particular when aiming for multi-institutional, transnational efforts. In this context, the preferred systems for data collection and retrieval are web-based, with remote data entry, consisting of custom project-related electronic case report forms with simple, user-friendly interfaces to facilitate efficient data registration.
Moreover, data collection, management and sharing must be conducted with standardized procedures ensuring compliance and adherence to ethical, regulatory and legal standards and preventing unauthorized access and unintended disclosure. Personal data protection and conforming to EU regulations, at least for member states of the European Union (EU),14 and the principles of beneficence, non-maleficence, respect for autonomy and confidentiality must be guaranteed through the application of anonymization methods in the captured data15 and the development of access control and activity monitoring mechanisms.
The ERIC CLL database
The ERIC CLL database is a data management system that supports research and medical knowledge discovery in CLL implementing the aforementioned methodologies. The database is designed in expandable modules that allow the rapid introduction of additional categories and values when and if needed, based on the type of project run at any single moment in time, making it flexible and adjustable. All new information eventually ends up in the central dataset as stable asset of the database. Currently, the ERIC CLL database includes data from 9147 cases coming from 19 centres in 10 different countries.
The main objectives of the ERIC CLL database are to (1) collect and transform clinically relevant RWD into evidence and correlate with biological data in order to provide accurate information about the state-of-the-art diagnosis; (2) generate hypotheses regarding important disease characteristics, laboratory studies and therapies; and, (3) define relevant parameters influencing CLL impact on health systems.
The data categories composing all the relevant information to form an accurate and complete representation of the disease course include basic demographic data, disease-related information, treatment options and response, laboratory results and relevant outcome data. The data model has been designed in order to meet the requirements for an accurate description of diagnosis, prognostic assessment and management of patients with CLL.
A relational database developed in PostgreSQL has been designed in a way to provide data integrity and promote data correlations and statistical analysis. To the benefit of the scientific community at large, open-source tools have been used for the development of the ERIC CLL database. A standards-based approach is used to determine efficacy of data registration and integration, providing useful, accurate and valid information, increasing the availability of clinically relevant structured data and fulfilling the data quality assurance requirements.
A web-based user interface has been developed for prospective data collection as part of patient's routine care, designed to ensure data protection, security and availability. The interface allows for controlled database login, real-time registration with data validation mechanisms, data retrieval and management.
Moreover, a retrospective data registration and import tool has been developed in order to efficiently and effectively load into the database retrospective patient data, collected in purpose-specific template registration spreadsheets. The tool deploys data cleaning processes based on certain rules ensuring content validation and detecting data inconsistency and redundancy errors. A mapping mechanim is then applied using transformation rules to convert data to predefined types and import them into the database in the appropriate form. Accordingly, the tool can be configured and applied to transform exports of data coming from different sources, regardless the diversity of the software currently utilized in each single institution (ie, different databases), thus enabling interoperability.
Personal identifying data are not requested or stored in the ERIC CLL database. Concerning the collection of retrospective data, anonymization of data takes place during the registration and validation processes; anonymized datasets are then saved and imported into the ERIC CLL database, conforming to EU regulations. Moreover, a user management system has been developed and configured for authentication and authorization of users to ensure data confidentiality and privacy, providing efficient and secure handling and exchange of information. Center-based, lab-based and role-based privileges are defined to restrain access, control, monitor and facilitate data management procedures, according to general and local requirements. Data is stored in a secure dedicated database server, controlled by an ISMS which is ISO27001-2013 certified and abides to GDPR. The infrastructure includes system failover mechanisms, backup processes to prevent data loss and history management mechanisms providing information about data modifications.
Concluding remarks
The uniqueness of CLL in terms of clinico-biological heterogeneity and rapidly evolving therapeutic paradigms underlines the need for large-scale collaboration and multi-disciplinarity aimed towards the realization of precision medicine in CLL. This essentially requires refined understanding of CLL at the fundamental, pathophysiological level as well as integration of multiple layers and sources of biological data with information about disease trajectories and outcomes. The ERIC CLL database is a concrete step in this direction, tailored to user needs and aspiring to contribute to improved management of patients with CLL.