Phelan-McDermid syndrome data network: Integrating patient reported outcomes with clinical notes and curated genetic reports
Abstract
The heterogeneity of patient phenotype data are an impediment to the research into the origins and progression of neuropsychiatric disorders. This difficulty is compounded in the case of rare disorders such as Phelan-McDermid Syndrome (PMS) by the paucity of patient clinical data. PMS is a rare syndromic genetic cause of autism and intellectual deficiency. In this paper, we describe the Phelan-McDermid Syndrome Data Network (PMS_DN), a platform that facilitates research into phenotype–genotype correlation and progression of PMS by: a) integrating knowledge of patient phenotypes extracted from Patient Reported Outcomes (PRO) data and clinical notes—two heterogeneous, underutilized sources of knowledge about patient phenotypes—with curated genetic information from the same patient cohort and b) making this integrated knowledge, along with a suite of statistical tools, available free of charge to authorized investigators on a Web portal https://pmsdn.hms.harvard.edu. PMS_DN is a Patient Centric Outcomes Research Initiative (PCORI) where patients and their families are involved in all aspects of the management of patient data in driving research into PMS. To foster collaborative research, PMS_DN also makes patient aggregates from this knowledge available to authorized investigators using distributed research networks such as the PCORnet PopMedNet. PMS_DN is hosted on a scalable cloud based environment and complies with all patient data privacy regulations. As of October 31, 2016, PMS_DN integrates high-quality knowledge extracted from the clinical notes of 112 patients and curated genetic reports of 176 patients with preprocessed PRO data from 415 patients.
1 INTRODUCTION
Genetic causes of neuropsychiatric disorders are not well understood in general (Kerner, 2015). Research investigations using Genome Wide Association Study (GWAS) (Network and Pathway Analysis Subgroup of the Psychiatric Genomics Consortium, 2015; Wood, 2013), exome-based sequencing (Girard et al., 2011; Iossifov et al., 2012; O'Roak et al., 2011; Vissers et al., 2010; Xu et al., 2012), and whole genome sequencing (Kong et al., 2012) techniques have revealed several candidate genes that are associated with common neuropsychiatric disorders such as Autism Spectrum Disorder (ASD), intellectual disability, and schizophrenia. However, in the case of rare disorders, understanding the genetic origins and progressions of disorders—one of the key objectives of Precision Medicine research (Collins & Varmus, 2015; Kohane, 2015; Kohane, Churchill, & Murphy, 2012)—is hindered by small patient population size, the consequent paucity of patient data, and the lack of robust phenotyping protocols (Baynam et al., 2015; Delude, 2015; Robinson, Mungall, & Haendel, 2015).
One such rare neuropsychiatric disorder is Phelan-McDermid Syndrome (PMS) or 22q13 deletion syndrome (OMIM 606232) (Cusmano-Ozog, Manning, & Eugene Hoyme, 2007; Phelan, 2008; Phelan & McDermid, 2012), with approximately 1,400 cases diagnosed worldwide, mostly in children. PMS is caused by deletion of the terminal end of the long arm of chromosome 22 or by mutation and loss of function of the SHANK3 gene (Macedoni-Lukšič, Krgović, Zagradišnik, & Kokalj-Vokač, 2013), which is also implicated in ASD (Gauthier et al., 2008; Uchino & Waga, 2013). Diagnosis is only possible with genetic testing and is often delayed. Early studies have looked at the effect of intranasal insulin therapy (Maxonus, Irnberger, & Rittinger, 2012; Zwanenberg et al., 2016) and the role of Insulin-like Growth Factor-1 (IGF-1) (Kolevzon et al., 2014) in reversing some of the symptoms of PMS, but there is currently no known treatment for the disorder. A wide variety of symptoms have been observed in individuals with PMS, including poor muscle tone, intellectual disability, developmental delays, dysmorphic facial features, vesicoureteral reflux, gastroesophageal reflux, congenital cardiac diseases, and behavioral disorders. Given the scarcity of patient data resulting from the small patient population size, clinical notes and Patient Reported Outcomes (PRO) data—previously underutilized sources of detailed information about patient conditions—assume significant importance in Precision Medicine research into PMS. Comparative analysis of the genetic profiles of the cohort of PMS patients with patient phenotypes reported in the clinical notes and PRO data has the potential to identify correlations between polymorphisms and deletions of specific genes and patient phenotypes, as well as to identify patient subtypes based upon genotypic and phenotypic profiles.
- Extracting knowledge from clinical notes by using a combination of Optical Character Recognition (OCR) and Natural Language Processing (NLP) methods.
- Ensuring the high-quality and trustworthiness of the knowledge extracted from clinical notes by allowing experts to crosscheck the knowledge against the de-identified source raw text.
- Integrating the knowledge extracted from clinical notes with PRO data and curated genetic reports from the same cohort of PMS patients, facilitating comparative analyses.
- Provisioning free multi-level access privileges to the integrated knowledge to clinical practitioners and investigators researching into neuropsychiatric disorders over a Web portal and over distributed research networks, while complying with all the stipulations of patient privacy regulations, including the Health Insurance Portability and Accountability Act (HIPAA).
- Allaying concerns about long term scalability and viability of the project by adopting a cloud based computation environment.
2 MATERIALS AND METHODS
2.1 Data acquisition
- Patient Reported Outcomes: Patient Reported Outcomes (PRO) data comprises responses by parents and caregivers of PMS patients to detailed questionnaires about diagnoses, procedures, lab tests, medications, patient behavior, and patient conditions, which were collected and stored in the PMS Information Registry (PMSIR, pmsiregistry.patientcrossroads.org).
- Clinical notes: The families of PMS patients provided consent to CareSync (caresync.com), a third-party vendor, to request and obtain their health records, including clinical notes, from various healthcare providers on their behalf. CareSync collected the clinical notes and shared the PDF scans with the patients' families and with PMSF. This process greatly simplified the cumbersome and time-consuming process of patients obtaining access to their health records (Lester, Boateng, Studeny, & Coustasse, 2016).
- Curated Genetic Reports: Reports of PMS patients from genetic tests including Comparative Genome Hybridization (CGH) arrays, Single Nucleotide Polymorphism (SNP) arrays, and microarrays were collected, curated by trained genetic counselors, and stored in the PMSIR.
With periodic patient outreach activities, PMSF has progressively improved patient participation in terms of the number of families consenting to share their data with PMSIR and with PMS_DN.
2.2 Data processing
2.2.1 Clinical notes
We used the open source Tesseract OCR tool (Smith, 2007) to extract raw text content from the curated clinical notes. Then, the MITRE MIST tool (Aberdeen et al., 2010) and the Scrubber toolkit (McMurry, Fitch, Savova, Kohane, & Reis, 2013) in the Apache cTAKES NLP engine were used to erase Protected Health Information (PHI) elements from the text. Following de-identification, the Apache cTAKES NLP engine (Savova et al., 2010) was deployed to extract knowledge by identifying occurrences of concepts defined in the Unified Medical Language System (UMLS) (Bodenreider, 2004) in the text. Apache cTAKES also identifies the context in which the concepts are mentioned in the sentence including negation, patient history, family history, and uncertainty. The identified UMLS concepts were mapped to concept definitions in 20 clinical terminologies (Figure 1) including ICD-9/10 (www.icd9data.com, www.icd10data.com), MeSH (Rogers, 1963), SNOMED CT (Schulz & Klein, 2008), and the Human Phenotype Ontology (Robinson et al., 2008).

2.2.2 Genetic reports
The genetic reports include results from sequencing, CGH arrays, and Fluorescent In-Situ Hybridization (FISH) probes. Genetic reports are first curated by trained genetic counselors who fill 57 structured fields to represent the genetic abnormalities. Because of the disparity in techniques from which genetic data are obtained, all the curated genetic test result information was manually reviewed to extract the coordinates and genome assembly of the chromosomal abnormalities. Chromosomal coordinates for CGH were extracted from the relevant structured fields (chromosome, gain/loss, start, end), and from the International Society of Cytogenetics Nomenclature (ISCN 2013) standard (Simons, Shaffer, & Hastings, 2013) and comments where necessary. Chromosomal coordinates for FISH results were directly obtained in the GRCh38/hg38 genome assembly (Miga et al., 2014) from the National Center for Biotechnology Information (NCBI) Clone database (Schneider et al., 2013). When multiple assays were available for the same region, the most recent or the most precise—in terms of resolution—assay was used. In order of decreasing resolution of the sequence data, sequencing output was preferred over array CGH, and array CGH was preferred over FISH. Chromosomal coordinates were transformed from each original human genome assembly to the latest one available at the time of this study, GRCh38/hg38, using the University of California—Santa Cruz (UCSC) liftOver tool (genome.ucsc.edu/cgi-bin/hgLiftOver). All duplications, deletions, and mutations were retained along with the original fields for the standard nomenclature, karyotype, and parental results; the only exceptions being chromosome alterations with coordinates that did not map to GRCh38/hg38.
2.2.3 Patient reported outcomes (PRO)
- A “clinical” questionnaire with questions regarding diagnosed comorbidities, symptoms, tests, and treatments for the whole range of known pathologies and features associated with PMS,
- A “developmental” questionnaire, focusing on physical, motor, behavioral, cognitive, and social development, and
- An “adult” questionnaire with specific questions aimed at patients aged 12 or more, regarding the evolution of symptoms after puberty. All the questions from the PRO dataset were manually mapped to UMLS Concept Unique Identifiers (CUIs) by a clinical expert before being preprocessed for statistical analysis.
The knowledge extracted from clinical notes was loaded by dedicated Extract Transform Load (ETL) pipelines into the PMS_DN data repository along with the PRO data and the processed curated genetic reports of the PMS patients.
2.3 Data integration on PMS_DN: Leveraging the i2b2/tranSMART platform
PMS_DN leverages the capabilities of the i2b2/tranSMART knowledge management platform (Patel et al., 2016; Perakslis, van Dam, & Szalma, 2010; Scheufele et al., 2014; Szalma, Koka, Khasanova, & Perakslis, 2010) to integrate heterogeneous datasets—including phenome, exposome, and genome data—and to facilitate browsing and comparative analysis of these datasets. The i2b2/tranSMART platform is layered upon the Informatics for Integrating Biology with Bedside (i2b2) clinical and biomedical data integration platform (Kohane, Churchill, & Murphy, 2012; Murphy et al., 2010). The i2b2 platform uses a simple and intuitive “observation centric” star schema data model that accommodates a variety of longitudinal patient level datasets including clinical data, prescriptions, and laboratory values. Multiple hierarchical ontologies describe the types of data contained within i2b2, allowing users to start with broad biomedical concepts and drill down to find specific patients and data of interest (Figure 1). New data types can be added to i2b2 by modifying the ontology but without changing the underlying database schema or the software. The ease of use of i2b2 has led to its adoption by over 150 University Hospital research centers worldwide.
2.4 Authorized user access to PMS_DN
The primary target audience for PMS_DN are clinical practitioners and researchers working in the areas of autism and other neuropsychiatric disorders. Qualified applicants affiliated with research institutions with an active interest in the research into neuropsychiatric disorders can request access to PMS_DN by filling out a registration form and agreeing to the terms of use. The registration request is reviewed by a Data Network Specialist at PMSF before approval.
Access to PMS_DN is granted at one of two levels: a basic level (Level 1) or an advanced level (Level 2). Level 1 access allows users to browse through and interrogate the patient aggregates of the integrated datasets on PMS_DN's Web portal. Figure 2 demonstrates the use of the i2b2/tranSMART interface to test a hypothesis about the relationship between patient age and hypotonia, a commonly reported symptom in PMS patients. Users with Level 2 access privileges, obtained from PMSF after mandatory Institutional Research Board (IRB) clearances from their institutions of affiliation, can see the raw, de-identified patient level data (Figure 3) and download it as well. In addition, investigators with Level 2 access privileges can access a novel validation tool, which allows them to verify the accuracy of the knowledge extracted from clinical notes by cross-checking the identified concepts against the anonymized sentences from which they were extracted (Figure 4). While eliminating residual errors of cTAKES caused by ambiguous context of the raw text, the validation tool improves the trustworthiness of the knowledge by allowing authorized investigators to see the raw text source of the knowledge. The input of the investigators is used to immediately update the knowledge in the PMS_DN repository (Figure 5). In a future release of PMS_DN, we will display the credentials of the experts performing the validation to other users, so the credibility of the validation input can be independently assessed. It must be noted that the knowledge validation step is not an exhaustive review of the entirety of the NLP engine's output. Instead, it is an open-ended process where experts choose to crosscheck specific concepts of interest identified by the NLP pipeline against the raw anonymized sentences in which they occur.




PMS_DN uses the single sign on feature of the OAuth2 authorization protocol (oauth.net/2/) to leverage the login credentials from: a) Harvard Medical School, Boston Children's Hospital, or the University of Pittsburgh, or b) NIH eRA Commons or c) Google Mail or d) GitHub (github.com) to login to the i2b2/tranSMART Web portal. The OAuth2 based single sign on feature obviates a potential security loophole associated with the storage of user login credentials on PMS_DN.
To foster collaborative research with similar Patient Powered Research Networks, snapshots of PMS_DN data in the form of patient counts for queried parameters are available to authorized investigators using distributed research networks such as SHRINE (Weber et al., 2009) and the PCORnet PopMedNet (www.popmednet.org).
2.5 Cloud hosting
To ensure long-term scalability and to eliminate concerns about data archival and hardware maintenance and procurement, we have ported the PMS_DN application to a HIPAA compliant cloud based environment hosted by Amazon Web Services (AWS, aws.amazon.com). The PMS_DN data repository is hosted on a Relational Data Service (RDS) instance of AWS. The ETL pipelines are hosted on dedicated Elastic Compute Cloud (EC2) instances of AWS. The raw clinical notes are stored in a secure Simple Storage Service (S3) instance of AWS prior to processing. Figure 6 displays the entire cloud-based architecture and data flows of PMS_DN.

3 RESULTS
As of October 31, 2016, 623 families (334 in the USA) provided consent to PMSF to share their data with PMS_DN. PMS_DN integrates: a) the knowledge extracted by Apache cTAKES from the clinical notes of 112 patients comprising 40,320 pages in 2202 files, b) preprocessed PRO data from 415 patients, and c) curated genetic information from 176 patients. Following integration, 70 patients were linked across the three datasets, that is, PMS_DN has the full complement of clinical notes, genetic reports, and PRO data for 70 patients, enabling comparative analyses across the datasets (Figure 7). This number is expected to increase as more patient data becomes available. Authorized users can access and interrogate the integrated PMS patient data on the i2b2/tranSMART Web user interface of PMS_DN at https://pmsdn.hms.harvard.edu. Level 2 users with advanced access privileges and the appropriate IRB clearances can: i) obtain advanced, raw data download privileges on PMS_DN from PMSF and also ii) verify the accuracy of the knowledge extracted from clinical notes by cross-checking the identified concepts with anonymized sentences from which they were extracted using the validation tool.

4 DISCUSSION
PMS_DN facilitates research into the origins and treatment of PMS by making high quality, trustworthy knowledge available to clinical practitioners and investigators researching neuropsychiatric disorders, while safeguarding patient privacy through rigorous patient de-identification methods.
4.1 Patient de-identification
PMS_DN uses a combination of two independent anonymizers—the MITRE MIST anonymizer (Aberdeen et al., 2010) and the Scrubber toolkit (McMurry et al., 2013) in the Apache cTAKES NLP engine—to remove PHI elements from the clinical notes. In a study (McMurry et al., 2013), the Scrubber toolkit in Apache cTAKES identified and removed approximately 98% of the PHI elements (Recall = 98%) from a test corpus of clinical notes selected from the i2b2 De-Identification Challenge dataset (Uzuner, Luo, & Szolovits, 2007). However, the same study reported a very low precision score, that is, a number of useful non-PHI elements were removed from the clinical notes by the Apache cTAKES Scrubber in addition to the PHI elements. Another investigation studied the effectiveness of the MITRE MIST tool in removing PHI elements from clinical notes (Deleger, Molnar, Savova, Xia, Lingren, Li, & Solti, 2013) and reported F-Scores (the harmonic mean of precision and recall metrics) (Hripcsak & Rothschild, 2005) of 93.48% and 95.2% at sentence-level and word-level de-identification. These performance metrics were comparable with the performance of human experts in identifying PHI elements in the same corpus of clinical notes.
Because a maximally effective de-identifier with maximal precision and recall performance metrics has yet to be developed, the PMS_DN combines the two independent anonymizers to try and remove PHI elements from the PMS patients' clinical notes to the maximum extent possible. Despite these efforts, the likelihood of the appearance of PHI elements in the clinical notes cannot be ruled out. We have attempted to mitigate this limitation by restricting the visibility of the anonymized raw text of the clinical notes (on the validation window) to only those users with Level 2 advanced access privileges.
Given the early stage of deployment of PMS_DN, the patient de-identification pipeline has not been observed to adversely impact the comprehensibility of the content of the clinical notes so far. A typical example of an anonymized sentence from the clinical notes can be seen in Figure 4 as displayed in the validation window to a Level 2 user for verification. At present, only the exact sentence from which the concept was extracted is displayed in the validation window. In a future version of PMS_DN, we plan to display, in addition to the source sentence for the concept, the sentences immediately preceding and following this source sentence to try and make the context clearer to the user accessing the validation window.
4.2 Knowledge extraction from clinical notes and expert validation
From the anonymized text in clinical notes, the Apache cTAKES NLP engine identified the mentions of concepts defined in the UMLS in addition to the appropriate context—including negation, uncertainty, patient history, and family history—in which the extracted concept is mentioned. The validation tool allows experts to cross-check whether the context has been correctly identified by the NLP engine and make corrections where necessary. This is intended to be an open-ended process and not an exhaustive review of the functional efficiency of the NLP engine. The credentials of the experts—including research background and interests, institutions of affiliation, and the relevance of their proposed research work with PMS patient data to the objective of PMS Foundation—who perform these validations are carefully reviewed by the steering committee at the PMS Foundation before access is granted. This ensures a certain level of credibility to the validation input from the experts. At present, the identity of the users who perform these validations are logged by PMS_DN but not displayed to the end users. In future, the credentials of the users will also be displayed to authorized users with Level 2 access privileges so the quality of the input can be independently assessed by users.
4.3 PRO questionnaire
The PRO dataset comprises answers to approximately 1,300 questions that were sourced by the PMSF from existing surveys and databases, including the Autism Genetic Resource Exchange (AGRE) (Lajonchere & Consortium, 2010), the PMS survey by Dr. Katy Phelan, and Common Data Elements and questions in other specific condition surveys about phenotypes reported by PMS patients including seizures, lymphedema, sleep disorders, behavioral disorders, and developmental delays as well as cardiac and renal abnormalities. Expert researchers reviewed and edited the initial draft of questions and delivered two sets of questions: A Clinical Survey of 100 questions split into 23 topics such as Cardiovascular, Seizures, and Sleep and a Developmental Survey split into 11 topics such as Fine and Gross Motor Skills, Puberty Status, and Communication Development. Some of these questions are specific to the symptoms exhibited by PMS patients such as dysplastic toenails. There are also a number of questions that ask about more common conditions such as seizures, reflux, and behavioral patterns associated with ASD. The ASD related questions are relevant given that studies have reported the prevalence of symptoms of ASD in PMS patients (Oberman, Boccuto, Cascio, Sarasua, & Kaufmann, 2015) and gene-linkage studies have associated SHANK3 mutations with ASD (Leblond et al., 2014; Uchino & Waga, 2013).
4.4 Data sharing
It would be desirable to promote data sharing between PMS_DN and the other PPRNs to foster collaborative research between these projects. However, the stipulations of patient privacy regulations preclude easy data sharing. Therefore, at present, only patient counts can be shared across these projects over distributed research networks such as SCILHS SHRINE and PCORnet PopMedNet. A unified questionnaire pertinent to PMS patients as well as patients diagnosed with disorders related to other PPRNs would be highly desirable. This would spare the families of patients with rare disorders from the hassle of having to repeatedly provide the same information across different questionnaires. The Research Domain Criteria (RDoC) framework from the National Institute of Mental Health (NIMH) (Insel, Cuthbert, Garvey, Heinssen, Pine, Quinn & Wang, 2010) could be useful in addressing this concern. The objective of the RDoC framework is to bring about synergy between the diverse research projects into mental and behavioral disorders and by extension, between the various surveys that are used in the research into these disorders. RDoC provides a rigid framework comprising units of analysis (from molecules to self-report) for behavioral and developmental domains including cognitive, positive valence, negative valence, social processes, arousal, and regulatory systems. We are in the process of mapping the questions from the PRO dataset of PMS_DN to the domains of the RDoC framework as lead off work in this direction. With similar mapping initiatives from other PPRNs over time, the vision of a unified questionnaire for all rare disorders can be achieved.
5 CONCLUSION
In this paper, we have described PMS_DN, a Patient Powered Research Network that exemplifies the potential of collaborations between academic researchers and family organizations such as PMSF to drive research into a rare genetic disorder: PMS. PMS_DN addresses the paucity of patient data in rare disease research by exploiting the rich yet underutilized sources of knowledge about patient conditions: clinical notes and self-reported outcomes. PMS_DN uses a state-of-the-art NLP engine, Apache cTAKES, to extract context sensitive knowledge from rich text descriptions in patient clinical notes before making this knowledge, along with self-reported outcomes and genetic reports of the same patient cohort, available to authorized investigators. Further, to minimize inaccuracies in the extracted knowledge, PMS_DN implements a novel knowledge validation tool that utilizes clinical expert input to eliminate residual ambiguities. PMS_DN is hosted in a cloud computing environment guaranteeing scalability while mitigating concerns regarding long term viability of the project. By integrating diverse and heterogeneous data about patient phenotypes and genotypes, PMS_DN facilitates research that can identify patient subgroups for targeted therapies based upon genomic and phenotypic profiles. The comparative analyses of integrated datasets, made possible by PMS_DN, has the potential to yield an improved understanding of the associations between genotypic profiles and PMS patient phenotypes.
ACKNOWLEDGMENTS
The Phelan-McDermid Syndrome Foundation, the Phelan-McDermid Syndrome International Registry, the patients and their families, Chris Botka, and the Harvard Medical School Research Computing center. This work was partially funded through a Patient-Centered Outcomes Research Institute (PCORI) Award (PPRN-1306-04814) phase I and II for development of the National Patient-Centered Clinical Research Network, known as PCORnet; by Research Grant EDU_R_FY2015_Q2_HarvardMedicalSchool_Avillach-NEW from Amazon Inc.; and National Institutes of Health — RFA-HG-13-009 — Centers of Excellence for Big Data Computing in the Biomedical Sciences (U54) — Grant Number 1U54HG007963-01.
CONFLICT OF INTEREST
None
DISCLAIMER
The statements presented in this article are solely the responsibility of the author(s) and do not necessarily represent the views of the Patient-Centered Outcomes Research Institute (PCORI), its Board of Governors or Methodology Committee or other participants in PCORnet.