Volume 2025, Issue 1 6565955
Review Article
Open Access

Genomic and Health Data as Fuel to Advance a Health Data Economy for Artificial Intelligence

Patrick J. Silva

Corresponding Author

Patrick J. Silva

Institute for Bioscience and Technology , Texas A&M University , Houston , Texas , USA , tamu.edu

Department of Translational Medical Sciences , School of Medicine , Texas A&M Health , Bryan , Texas , USA

Search for more papers by this author
Patrick A. Silva

Patrick A. Silva

The John Cooper School , The Woodlands , Texas , USA

Search for more papers by this author
Kenneth S. Ramos

Kenneth S. Ramos

Institute for Bioscience and Technology , Texas A&M University , Houston , Texas , USA , tamu.edu

Department of Translational Medical Sciences , School of Medicine , Texas A&M Health , Bryan , Texas , USA

Search for more papers by this author
First published: 19 July 2025
Academic Editor: Seyed Shahmy

Abstract

Cloud and distributed computing, code repositories, and large language models are democratizing the less computationally intensive use cases of artificial intelligence (AI) in medicine. The convergence and democratization of these powerful tools promises to mobilize and utilize humanity’s knowledge and data, at least the knowledge bases and data that are readily available in the public commons. Healthcare represents a challenge due to fragmentation of the data fabric and governance mechanisms intrinsic to that sector of the economy. Privacy laws, stewardship practices, and the fragmented nature of the patient data journey (medical record silos) create cumbersome impediments to health data sharing, particularly longitudinal patient-level data. Consequently, obtaining the data necessary to train and operationalize AI in many healthcare and clinical genomics use cases limits the promise of these new technologies in addressing complexities in healthcare. We posit that trust, provenance, and fitness of health data and transaction costs represent challenges that blockchain ledgers and smart digital contracts might address. Here, we present frameworks from some of the great economic thinkers that might help address some of the stewardship and agency issues inherent to health data sharing. Our goal is to promote a more equitable and patient-centric healthcare data fabric to address current challenges of healthcare.

1. Introduction

We all know data has value. Elon Musk said: “Money is just data that allows us to avoid the inconvenience of barter.” Clive Humby has been credited with saying “Data is the new oil” [1, 2]. Recently, cryptocurrencies have definitively shown that data is indeed a form of money. In his book Read, Write, Create [3], venture capitalist and blockchain advocate Chris Dixon paints a future where creators and contributors to the data fabric can see the benefits from their contributions. That data fabric will fuel the future of artificial intelligence (AI) technology and its use in many industries and domains of human experience.

Indeed, data is the fuel to train AI models. Large language models (LLMs) such as ChatGPT, LLaMA, Gemini, and Orca are being widely used across society and even operated in edge computing environments [4]. In fact, LLMs have been rapidly deployed for clinical decision support [5]. For 25 years now, machine learning [6] including deep learning (DL) and neural networks [7], digital twins [8], polygenic risk scores [9, 10], and other methodologies has been used for healthcare and biomedical research applications. A vast majority of the medical use cases for AI technologies require supervised learning and some form of ground truth (e.g., known clinical outcomes) to benchmark model performance and ensure patient safety. These use cases often require data at a case or person level [11]. However, such models can provide robust classification schemas that can exceed the limits of human intuition in making inferences from health data. For example, random forest models have been used for over 20 years to develop disease classifiers from gene expression readouts of peripheral blood mononuclear cells that are not directly or mechanistically tied to disease processes in existing knowledge bases. Such an approach enabled knowledge of the diseased lung using a noninvasive biosampling approach (a blood sample instead of the diseased lung tissue itself) [12].

Digital twins represent a huge opportunity, albeit with significant challenges in that they are predicated largely upon achieving the adaptive patient-level predictions based on patient-level inputs. A digital twin requires frequent longitudinal input about the patient being modeled across time. This approach goes beyond modeling generalized disease biology and physiology using deidentified data, aggregate data, and external data from a knowledge base. The digital twin provides a dynamic and patient-specific model of disease processes. The challenge with using patient-level data is the myriad intellectual property, privacy, provenance, attribution, and fitness elements associated with data obtained for real-world clinical care (RWD). There are challenging barriers associated with patient-level data moving through the data network and its fabric. The current healthcare data fabric is composed of the knowledge bases, electronic medical records (RWD), insurance claim databases, prescription clearinghouses, genomic databases, clinical trial databases, biobanks, and many of the other enclaves where patient data resides. The health data ecosystem is fragmented and siloed, albeit to a somewhat lesser extent in Europe than the United States [13], so moving patient-level health data across the plumbing of these ecosystems is daunting.

While it is true that very useful AI models can be trained using aggregate/population level data, deidentified data, molecular, genetic, and biochemical data, the field of precision medicine requires a significant degree of insight about the patient at hand. This creates a conundrum for the implementation of AI in precision medicine use cases. Knowledge about the patient is predicated on inferences that are based on knowledge about similar patients or past information about the index patient. This conundrum is not too different than the “fruit from a poison tree” problem of LLMs that are trained using proprietary or copyrighted content in the artistic realm (i.e., music, literature, and video). Consent for data use is routine, requisite, and enumerated in research settings, but in a healthcare setting, such data use is often within the safe harbor of quality improvement use. Currently, the use of data for training AI is inherently an informatics problem rooted in the reality that most data on the internet, and in publicly accessible databases, is effectively treated as fungible. The knowledge base of patient-level health data generally entails limited access, but maintaining the provenance of such data is a crucial determinant of its use in training models using AI. When deidentified, that patient data ceases to have connectivity to the individual, per the definition of privacy. The researcher or the end user is unable to move backward or forward long the time spectrum to better characterize the individual patient’s natural history of disease, to evaluate predictive utility, or to back test a predictive tool (test data).

2. Discussion

2.1. Provenance, Property Rights, Agency, and Stewardship

Institutions often exercise patient data governance from the perspective of institutional data ownership, which often complicates and impedes institutional and patient participation in clinical trials. Agency translates to who gets to decide how data is used; this is usually the institution within the confines of the consent the patient or research participant provides. The institution tends to view data stewardship as a liability, so there is a strong aversion to sharing. Importantly, the institution supersedes patient agency in that the institution often directs and restricts data use, regardless of the wishes of the patient. This reality limits developing a data commons of our healthcare systems across the world [14]. Due to a focus on ownership and control instead of the many social and medical benefits inherent in sharing medical data, institutional stewards get caught up in a protective stance ensuring absolute privacy. Health data stewards emphasize the protection of the institution from the real and substantive liabilities associated with the Health Insurance Portability and Accountability Act (HIPAA) and the real and perceived privacy risks for patients. In fact, it has been argued that HIPAA is heavy on accountability and light on portability [15]. Another legal framework involves the Genetic Information Nondiscrimination Act (GINA), a law designed to provide guidance on consent and privacy requirements for genetic data in the United States [16]. Research participation, consumer genetic testing, and the inherently relational nature of genetic information increase the vulnerability of prevailing privacy preservation practices for genetic data [17]. Use of genetic data in health models presents challenges requiring highly specialized skills that transcend law, medicine, science, and information technology. Institutions recognize the risks in this realm of health AI, and thus, it is fraught with risk aversion tactics that severely limit scientific progress. A full accounting of the implications of GINA in the health data AI value chain is beyond the scope of this paper, but several reviews cover the topic well [1619]. As we demonstrate below, there are jurisdictional considerations impacting health data available on platforms like Kaggle and the Cancer Imaging Archive [20]. The General Data Protection Regulation (GDPR) in the EU and HIPAA and GINA in the United States involve very different compliance implications. Tracking the provenance and appropriability of health data is going to be increasingly necessary to navigate these legal ecosystems and make audits practical.

Those that have negotiated contracts to get access to patient data from a hospital system, insurance company, or a pharma company sponsoring a clinical trial appreciate the fact that patient data stewardship is a multiagency problem. First, the property rights of the institution come into play with the legal issues inherent in laws, policies, and common practices in institutional stewardship of patient data. Second, sometimes such data may involve third-party rights if obtained through work with a payer/insurer, a health information exchange (HIE), clinical trial sponsor, or a data clearinghouse such as the All Payer Claims Database. Third, patient agency also comes into play in the form of ensuring the patient has provided informed consent for the intended use of data. Informed consent involving genomic data that is coupled to patient-level medical record data (clinicogenomic data) involves many thorny issues associated with secondary use, and some statutory safe harbors for clinical care (quality improvement) [21] and secondary research use [22], with patient [22], researcher [23], and institutional stewardship [24] considerations.

2.2. The Knowledge Base and Complexity

The canonical genetic code has four deoxyribonucleotides and four ribonucleotides, encoding 26 amino acids that can be assembled into an unlimited number of proteins with an almost limitless diversity of functions. This diversity of biomolecules also entails a number of regulatory domains ranging from genetic, epigenetic, translation, posttranslational, cellular, structural, and integrated signaling. The structure of biology in general results in five domains of high dimensional complexity [25]: epigenetic, sequence, functional, structural, and environmental domains.

Genomics databases entail the following:
  • Nucleotide sequence databases: GenBank, RefSeq, Ensemble, exome aggregation database (ExAC), reference genomes (Celera and Incyte, private), and pangenomes

  • Variant and phenotype databases: ClinVar, pHARMvAR, GnomAD, dbGaP, and Rare Disease

  • Clinicogenomics databases: The Cancer Genome Atlas (TCGA) and Foundation Medicine/Flatiron (private)

  • Population genomic databases: All of Us, 100 Genomes, Million Veterans, Regeneron (private), and deCODE (private)

  • Epigenetics databases: Database of epigenetic modifiers (dbE) and Roadmap Epigenomics project

  • Gene expression databases: GEO, GTWx, and ENCODE

  • Protein databases: Swiss-Prot

  • Biologic databases: Human Microbiome Project and Human Cell Atlas.

In medicine, the knowledge bases used to train AI models are vast. LLMs have proven transformational in accelerating human utilization of the vast and exploding scientific literature. In essence, LLMs have enabled a massive and powerful pattern recognition tool as a viable alternative to trying to simulate and computationally interpret the complexities of biology de novo. The expanse of the medical literature is unknown, but NSF estimates range from over 3.3 million to over 37 M citations in PubMed. Importantly, the medical knowledge base reflects a rather limited human understanding of the interplay of biological factors (genetics and epigenetics) with the environmental components (nonbiological determinants of health). Nonbiologic determinants of health reflect interactions with the healthcare system and broader social domains such as occupation, geography, environment, and behavioral factors and finally inherent biologic determinants. A recent axiom in public health has emerged: The zip code is a more significant contributor to health outcomes and life expectancy than the genetic code [26]. So the models based on observations of past patient and disease case studies, going beyond biologic and mechanistic considerations, are proving of greater utility than models predicated purely on the biology knowledge base.

2.3. Appropriability

One of the most frequently used tools to navigate the multiagent patient data problem is deidentification of health data. The 18 identifiers enumerated by HIPAA are removed from the health data being shared with the intention that such data would no longer be able to be traced back to an individual patient. It is questionable whether the removal of HIPAA identifiers is truly privacy preserving [27], especially when genetic data are involved. Indeed, genetic information can be used to link sequences to the faces of individuals [28]. The public benefits versus privacy trade-offs of genomic databases are illustrated by law enforcement use of these databases to link evidence from crimes to individuals. This is an example of some of the practical, social, and policy countercurrents at odds with the intention of absolute privacy preservation with health data [29]. Whether absolute privacy is an immutable human right or is even achievable is beyond the scope of this concept paper; however, it is clear that the concept of patient-centric data agency is an increasingly important and unaddressed policy consideration in health data stewardship.

Deidentified patient data that finds its way into a database, data commons, data fabric, or knowledge base is largely devoid of economic appropriability. Appropriability is an economic concept where a contribution to a social benefit can be attributed directly back to an individual or organization. In science, this is achieved by citation; in innovation, this is achieved by patents; in business, this is accomplished with financial securities; in real estate, this is achieved via a deed. By definition, privacy preservation generally forecloses backtracking the provenance chain of a data element to the patient it came from originally. The paradox of openness states that innovation requires openness, but commercialization requires property rights protections [30]. This is a challenge that very much vexes the application of AI to healthcare. Patient data is an increasingly important contributor to the innovation ecosystem as a byproduct of healthcare services, clinical trials, and biomedical research [31]. Some of the great economic thinkers of the last century, Joseph Schumpeter and Elinor Ostrom, have done foundational scholarship on the concepts of appropriability and the role of a public commons in innovation ecosystems. In many respects, the private property rights associated with appropriability (i.e., patents and copyrights) may seem at odds with the notion of a public commons.

This is certainly true with a finite resource such as an aquifer or a limited term monopoly for a drug (i.e., a patent) introduced into a rare disease market. William Press, in his 2013 Presidential Address to the American Association for the Advancement of Science, makes elegant arguments that appropriability and intellectual capital are increasingly important co-contributors to the scientific advancement and economic development associated with heavy-tailed advances in medical science [32]. By heavy-tailed events, Press means infrequent events with outsized impact such as the discovery of a new molecule such as an immune checkpoint inhibitor like pembrolizumab (Keytruda) that would require hundreds of millions of dollars to test in clinical trials. Appropriability enables a means to recover those investments and thus serves as an incentive for those investments. Ostrom published a framework in 1990 for Governing the Commons [33], which provided eight parameters for successful community-based resource management (actor group boundary clarity, social-ecological fit, external recognition, multiple levels, commons boundaries, proportionality, participation in rulemaking, environmental monitoring, self-monitoring, self-sanctions, conflict resolution, and the Ostrom principles). However, data is not a depletable resource, and its use to build predictive models for medical care is really an example of the economic concept of coproduction, as described by Ostrom [34, 35]. When Ostrom introduced the concept of coproduction, it established the basis for public–private partnerships in economic production. In the realm of health data, an example is Medicare or the Veterans Administration working with health insurers and pharmacy benefit managers to collectively produce adherence and utilization data that has enormous value to train AI models. Yet the increasing economic utility of benefits derived from data and models does not flow back to patients because maintaining appropriability of that data across the data journey is currently administratively intractable. This is in large part due to the multiagency problem and privacy preservation impetus of health data practices and policies [15]. Blockchain technologies that enable granular data agency through tokenization have unlocked emergent data economics in financial technology (i.e., Chainlink DECO) [36, 37], oil and gas (i.e., Data Gumbo) [38], music (i.e., Napster 3.0) [39], and even incentivizing hyperlocal weather forecasting through tokenized weather data (i.e., WeatherXM) [40, 41]. These use cases portend an emerging data economy for personal health data. In his theories on creative destruction and monopolistic practices, Schumpeter offers thoughts on how the inertia of the status quo impedes the deployment of new commodities (i.e., data) and new technologies (i.e., AI models) by deploying protective devices that preserve advantages for economic incumbents and defer the erosion of monopolies by innovators. An example of this in healthcare would be policy advocacy (lobbying) against new payment models for telehealth to preserve the number of hospital encounters for a healthcare system with a large real estate footprint. It can be argued that innovation in the healthcare industry is vulnerable to such tactics as health data becomes an increasingly important contributor to innovative advances and the commercial toolkit. While companies in the health informatics arena have done much to advance cancer care by collecting and aggregating data, for example, in cancer clinicogenomics [15, 31, 4244], the US healthcare system data mesh lags behind others like the United Kingdom (UK Biobank) [45], Taiwan (Taiwan Biobank) [46], and even Estonia (Estonian Biobank) [47] in access to population-scale datasets. Industries reliant on health data benefit from economies of scale and network effects, especially in clinicogenomics [48].

The powerful network effects enabled by digital aggregation of consumer data at population scale by the Siren Servers represent a model for collection and curation of consumer data in exchange for free services. Siren Servers is a business model term coined by Jaron Lanier, named after the Sirens of Odysseus from Homer’s The Odyssey. Network effects have benefitted the corporate stewards of social networks and digital communications companies such as Google, Meta. In healthcare, Foundation Medicine and Flatiron Health (initially founded and funded by Google executives) are examples of networks enabling population-scale curation of molecular diagnostic data in cancer patients [49]. In the Siren Server model, consumers share their data in exchange for free services. In healthcare, there really is no such model where free healthcare services are achievable for sharing one’s health data. Lanier argues that the more consumer-centric (or in this case, patient-centric) the model, the more it can return social benefits, agency, and even wealth to the masses in exchange for the data they share. In Radical Markets, Microsoft economists Posner and Weyl further the argument for a more decentralized data ecosystem in liberating data stewardship from institutional practices that are designed to maintain the status quo. Posner and Weyl’s work is predicated on the idea that property rights are inherently monopolistic, and they put forth a new concept of partial common ownership as an alternative to corporate, state, or social ownership to maximize economic utility of an asset. This is similar to Ostrom’s concept of coproduction. Such a paradigm has great potential to address inequities and disparities in healthcare and allow the power of the network to empower the healthcare consumer, not the private owner of the network (in Lanier’s example, the Siren Server businesses dominating an industry niche sector). This framing is not intended to be pejorative with respect to industry practices, only to illustrate the tensions and trade-offs between shareholder versus social benefits of different data stewardship models and the inherent incentives for centralized versus decentralized models of stewardship. To be clear, private investments in tumor sequencing companies have provides a large net social benefit [15] and contributed to remarkable improvements in cancer survival [44].

In Read, Write, Own, venture capitalist Chris Dixon makes a compelling case that in the Web3 paradigm, the network effects of data contributions by creators to the data fabric have great potential to accrue directly to the creator [3]; in other words, Web3 enabled appropriability. Credit bureaus, genetic testing companies, social media companies, and even the companies who curate warranty cards have managed data networks with sensitive consumer data. These firms have been entrusted with stewardship of sensitive consumer data for many years, sometimes with social benefit, occasionally with privacy breaches, and sometimes at odds with consumer or societal interests and agency. Dixon argues that Web3 architectures might enable sustainable decentralized data networks and incentivize righteous and decentralized scaling of these networks. That is, noncorporate, nongovernmental, patient-centric [15, 50] population-scale medical data collections can be self-sustaining and feasible using Web3 governance frameworks. Indeed, blockchain ledger technology was designed to make work done on networks (i.e., data creation, computation, and validating disintermediated transactions) directly attributable and appropriable to creators (i.e., a bitcoin miner or data contributors). See proof-of-work versus proof-of-stake for a deeper perspective on appropriability with respect to data and computational work [51]. A patient-centric data commons fits this paradigm of creative contribution to the network.

Blockchain ledgers were designed purposely to layer an immutable property right on a piece of data. That data can be a bitcoin, a nonfungible token, or my genome sequence (or a fraction thereof) that has inherent value if used in accordance with my wishes [52, 53]. The ideal is a data commons where all stakeholders in the multiagency problem of health data sharing have a property right and incentives to share [15] and curate [35]. There are nuanced considerations regarding whether blockchain networks for certain uses should have public, private, or semiprivate ledgers, proof-of-work versus proof-of-stake models, and the computational throughput of the Layer-1 platforms on which they reside (i.e., Ethereum, Hyperledger, Solana, and Sui). However, blockchain ledgers provide a technology architecture and a practical means to impact appropriability to health data throughout the data journey [52]. The idea that blockchain ledgers undermine privacy is a misnomer and false attribute—private blockchain network functions are purely at the discretion of network governance and can be as private or open as desired. These architectures are however decentralized and thus immutable by design [54], so trust of institutions or stewardship is not necessary for the platform to provide a trustless data transit. Trustless means: No trust of intermediaries is required to transact, as would be the case in current models of health data sharing where there are large trust barriers between institutional stewards and data users. Importantly, new technologies are being developed that connect the provenance and lineage of data elements used to train an AI model with cryptographic validation at all points in the model lifecycle and across the data governance continuum [55]. These technologies provide powerful tools to manage model quality and decay over time. In effect, provenance of the data and provenance of the model are both increasingly important in the implementation of individual health data into AI models. Indeed, Dixon argues blockchain networks are a new and specialized type of large-scale computer that keeps track of data use and data users in an immutable, distributed ledger across time. A patient’s health data is many things: It is intellectual property, it is a commodity, and certain packaging and validation metadata around it make it comparable to a security; for these reasons, health data becomes valuable property. As such, Web3 architectures can be a tool to help enable disintermediation of data governance, privacy preservation, and demarcate intellectual property rights simultaneously, perhaps unlocking the paradox of openness. These exciting new technologies can soon allow any one of us to share our genomic data, when and with whom we choose, for purposes we can know a priori, and follow where the data contributed to AI models, scholarly results, and other social goods that we care for as a consumer, a patient, and a citizen.

There are only a few instances of blockchain technology being deployed for health data management [56]. The most prevalent use cases explored to date have been internet of things for medical sensors, genomics, and medical records [56]. Blockchains provide tools for governance and provenance of data, but are not inherently privacy preserving by themselves. Encryption technologies are a nice complement that can enable computing on encrypted data (i.e., homomorphic encryption) [57, 58]. Nebula Genomics and LunaDNA pioneered the use of blockchain technology to enable patient-centric genomic data sharing for their sequencing business and as a means to provide rewards and compensation directly to consumers sharing data [59]. Zenome [60] and Genecoin [61] both represent examples of the data economics token business model GitHub and blockchain ecosystems about with health data sharing projects and networks [56]. Most of these examples are on private blockchain architectures, restricted to validated, credentialed, or authenticated users. Such an approach has been developed for clinical research achieving somewhat limited achievement of FAIR (findable, accessible, interoperable, and reusable) principles [62].

MedRec was a concept paper using proof-of-work data economics to reward data sharing from medical records [63, 64]. There is no indication of successful implementation of this approach in the scientific literature. Moving data from medical records is particularly challenging in the United States compared to nationalized health systems like the United Kingdom or Canada. In the United States, many different hospitals and health systems hold stewardship over data from the patient journey. Each of these organizations has different practices for managing compliance and different data standards and ontologies, which collectively impede incentives for interoperable data sharing. However, the potential for blockchain-based data governance models to remove governance barriers to data sharing adds to the impetus to address standardization and interoperability barriers.

The distributed ledger ensures an immutable and tamper-resistant record of data transactions in support of provenance and transparency [43]. This can provide greater transparency and enable dynamic consent management where patients can provide or revoke access. Smart contracts can automate and digitally enforce data use agreements, solving the multiagent problem in health data. Blockchain architectures can enable tokenization, monetization, and other reward structures for sharing data and doing computation work (using proof-of-work [65] or proof-of-stake [66] frameworks).

There are two significant hurdles to using blockchains for stewardship of health data. The first is getting health data out of traditional stewardship silos in health systems that lack incentives and have strong disincentives to accommodate patients’ sharing of their health data. A second constraint is specific to high-density data like image data and large-scale genome sequence data. The throughput of blockchain networks in validating transactions using smart contracts is dependent on the amount of data in a block and is impeded by large blocks of data such as .bam.dicom, and .vcf files. Off-chain storage and object-oriented ledgers provide mitigation strategies (see review by Kasyapa for more details on constraints and mitigation approaches) [67]. Recent efforts to implement data standards and application programming interfaces (APIs) have been guided by the need for continuity of care documentation using HL7, FHIR frameworks with structured data formats (JavaScript Object Notation [JSON] and Representational State Transfer [RESTful]), and APIs [68]. The myriad data schemas and standards entering practices are reviewed thoroughly elsewhere [69]. The maturation of these standards is increasing uniformity across the health ecosystems of the United States, and as HIEs achieve scale in the tens of millions of patients [70] and consolidate, barriers to data linkage for research and analytic applications are lessened.

2.4. Democratizing AI and Distributed Computing

AI models are not new. The use of citizen science models to build, validate, and train AI models is also not a new practice. Before the LLMs were announced with much fanfare in 2021, the Rosetta Project and Folding@Home achieved exascale (a billion, or 1018) operations per second by recruiting personal computers and gaming consoles into a massive, distributed computer to computationally fold proteins to optimum thermodynamic stability. Folding@Home has been an ongoing project for nearly 25 years (since 2000) [71]. These are examples where citizen science has primarily been limited to the use of distributing computing resources coupled with public databases about protein sequences and thermodynamics. An emerging opportunity is for blockchain to help accelerate innovation by making it easier to obtain data from distributed sources, including real-world case-level patient data for the design of machine learning models that are personalized and inform precision medicine practices. Put another way, citizens can contribute their health data to scientific enterprise at the intersection of open data and citizen science. A distributed data fabric that can instill agency and privacy for patients for health data is crucial to unlock this resource but is within reach.

An example of citizen science is a high school demonstration project [72], with students participating virtually from around the world, to build a convolutional neural network (CNN) to classify oral squamous cell carcinoma (OSCC) from magnified histopathology images of the epithelium of the oral cavity using machine vision (surface images of the oral cavity). Computational resources, pre-existing code from foundational models, and guidance were readily available. Obtaining the necessary data became the most limiting part of the project. It is not unusual for data scientists to lack expertise in medical science and vice versa.

A limited dataset from Kaggle [73] was utilized to build the network. This dataset included 89 images at 100× magnification of OSCC samples and 201 images at 400× magnification of the normal epithelium of the oral cavity, in addition to 439 images at 100× and 495 images at 400× magnifications of OSCC [73]. As is typical for publicly available deidentified datasets, not much additional information was available, and important variables such as the stage of cancer for positive samples, whether the cases were successfully treated or whether the lesions had metastasized, were missing. Importantly, an inherent limitation of such public, deidentified databases is that the researcher cannot go back to obtain additional information that may help improve the model.

The images were collected using a Leica ICC50 HD microscope from hematoxylin/eosin (HE)-stained tissue slides from a total of 230 patients, prepared, and cataloged by experts [73]. How would this model or classifier perform with images obtained using different optics? Is the training data biased, such that images from a different pathology lab source or different optics might not be classified appropriately? Ideally, for most models, additional information about the patient and the methods of sample acquisition would be useful to validate model performance and address biases inherent to the model. But for deidentified data, working backwards in the data journey is not possible.

Images were used to train three models: a VGG16 CNN [74], a Resnet50 CNN [75], and an InceptionV3 CNN [76] that were imported from the DL framework libraries TensorFlow [77] and Keras [78]. The imported models were refined to ensure compatibility with the histopathology dataset [73]. The corresponding accuracies of each of the models in detecting the presence or absence of OSCC from histopathology images of the epithelium of the oral cavity were 85.18% accuracy, 58.41% accuracy, and 80.91% accuracy for the VGG16 CNN, the Resnet50 CNN, and the InceptionV3 CNN, respectively. Additionally, a standard CNN [79], an EfficientNetB3 CNN [80], a VGG16 CNN [74], and an LSTM CNN [81] were designed using a previously existing oral cancer repository from GitHub [78, 82]. The model was tweaked to ensure compatibility with the data and then trained using data from another dataset containing 87 images of oral cancer and 44 images of the regular oral cavity [83], with the goal to detect OSCC using images from the oral cavity. The images were captured in various hospitals in Ahmedabad, India, and classified into cancerous or noncancerous groups [83].

The corresponding accuracies of each of the models in detecting OSCC from images of the oral cavity were 66.67%, 66.67%, 62.15%, and 63.33% for the standard CNN, the EfficientNetB3 CNN, the VGG16 CNN, and the LSTM CNN, respectively. After training and testing was completed, the accuracy of each model was tested using the T4 GPU cloud computing option in Google Colab. The most accurate model and dataset combination was the VGG16 CNN [74] with the histopathology dataset [73] with an accuracy of 85.18%. This level of performance is likely not adequate for most diagnostic or prognostic clinical use and leaves significant uncertainty unresolved. Larger and better annotated datasets are required to train and validate such models, putting some applications just out of reach of the citizen scientist, motivated student, or researchers in developing nations. Cloud computing has enabled people to work collaboratively in teams to create and improve models, thus harboring and promoting innovation. While one of the models was accurate when tested to detect OSCC on the images from the histopathology dataset [73], the model accuracy of the model was limited, especially for patients outside of the dataset.

In this example, laptop computing was inadequate, but resources on Google Colab platform were leveraged to train and validate the model. Google Colab is the most frequently used among several cloud computing services, offering scale and compute throughput advantages compared to localized machine computing. Colab “enables resource sharing on cloud systems, allowing different functions to operate simultaneously across multiple different platforms” [84]. The datasets used do not represent the diversity of OSCC in the human population. The demographic and clinical details of the provenance of these data are mostly unknown (see Kaggle depository) [73, 83]. Pigmentation, shapes of lesions, depth of lesions, stage of lesions, image capture technology for photographs, lens optics, and staining technology for micrographs are all factors where variation is unlikely represented in such a small sample set. However, as blockchain technologies advance and are further integrated into society, such data will be easier to access (interoperability), data will be of higher quality (immutability), and data will be more comprehensive (trustless networks). Trustless networks and smart contracts will allow for a more extensive user group to view the data and ensure control over the data and more providers of data to share data [85]. Data immutability will allow for higher quality data by ensuring no data manipulation and allowing for the viewing of previous versions of the data for more control in the design and reliability of a model over time. Interoperability will allow for the facile aggregation of different data elements from more patients and incorporate more inputs in the design of a model. The control that blockchain affords ML models has the ability to significantly increase both the number of ML models being designed and their accuracies.

A note of caution with the open data governance architectures presented is warranted. The private sector undoubtedly plays an important and indispensable role in the data ecosystem [15, 86]. Investment incentives for collection, curation, validation, and quality control of knowledge bases are functions that the private sector is well-suited and highly capable of cultivating for societal benefit. In 2021, DeepMind’s AlphaFold made a breakthrough enabling it to produce folded protein structures highly concordant with ground truth (x-ray crystal structures), in contrast to the thermodynamic functions of the Folding@Home project. Over 200 million protein structures are now available from the DeepMind knowledge base in the public domain. Both AlphaFold and Folding@Home have made distinct, large, and transformational contributions to biomedical science. Contrast this example of citizen science involving limited data with the UK Biobank’s recently announced largest ever functional study of the human proteome [87] that goes beyond the structural information of AlphaFold and Folding@Home. This vast dataset has revealed over 14,000 associations between genetic variants and plasma proteins from analysis of over 3000 proteins in the blood of 54,000 research participants. In addition to whole genome sequences, these patient samples have deep clinical annotation including details about the diseases individuals have reported, their socioeconomic and demographic details, and even MRI scans. The work was funded by 13 biopharmaceutical companies and conducted by over 20,000 scientists in over 50 countries. Imagine a ledger schema where contributions of research participants, funders, and researchers can be attributed to instances of data use, publication, and discovery of new biomarkers and drug targets. Web3 makes a benefit and attribution sharing waterfall for this kind of network appropriable and practicable, and privacy need not necessarily be surrendered to operationalize it.

Unlike the biological knowledge base surrounding these examples in protein sequence and thermodynamics, largely in the public domain and available for citizen science, such a crowdsourced approach to health data is impeded by the lack of a case-level data commons. The All of Us program has provided a roadmap for many of the impediments to research and modeling with health data [88]. However, the data sharing procedures (by necessity) are still reliant on manual and laborious trust vetting steps for users. The gene expression classifier for pulmonary hypertension published by Bull et al. [12] 20 years ago still cannot be deployed in a dynamic, longitudinal format like a digital twin because moving patient-level data across organization boundaries, whether in a research context or a healthcare context, entails a formidable multiagent challenge. There are ethical, economic, and medical imperatives for the use of information technology to remove frictions, enable trustless collaboration (i.e., federated learning), and instill individual agency (and address multiagency impediments) in data throughout the data journey. Right-sizing the role of institutions as an intermediary in data governance can address disparities in access to data for scientists in developing countries to the corpus of AI technologies in healthcare. However, technology adoption is going to need to overcome legacy practices ossified with regulatory doctrines whose risk–benefit balance has been significantly shifted by social and technological change. Data governance practices need to change and delegate some of the privacy risk to patients themselves to unlock the social benefits of the health data armamentarium.

3. Conclusions

The status quo in health data governance involves cumbersome institutional intermediary processes. As Schumpeter argued, the incumbents tend to deploy protective mechanisms to stave off disruption, and there are many instances in history where such tactics have impeded implementation of new innovations. Lanier has argued to reallocate agency of our societies’ data assets from government and corporate institutional curators to individuals. The Ostrom principles can provide the rules of engagement for addressing multiagency data stewardship challenges to enable the data commons needed to give all stewards, including patients’ agency over their data across time and organizational boundaries. Alas, multilateral consensus building, and governance processes, can be slow and time consuming, thus expensive. However, digital smart contracts make enforcement of the Ostrom principles digital and thus practical and cost effective to implement at population scale. Dixon’s vision of a decentralized network curated and sustained by creators, consumers, or patients has enormous potential to disrupt the status quo of institutional stewardship of patient data governance. The notion of bringing agency for the patient (akin to nonexclusive property rights) in the data commons, as posited by Lanier, Posner, and Weyl, can better align incentives for data sharing, personalized technology-driven care, and can potentially mitigate healthcare access and disparities. If money is information, and information is valuable, then health data is valuable currency. In the Web 3.0 era, any one of us can share it for the benefit of society and perhaps receive reasonable rewards, services, and incentives for its use that can help improve my healthcare and better healthcare for all.

4. Future Directions

The implementation of Web3 technologies has expanded from cryptocurrency to other segments of the financial sector, and other sectors such as music royalties and oil royalties, where microtransactions and digital ledgers can address administrative frictions. Implementation of these technologies with healthcare data will be a challenge with headwinds from both existing policy and practice. Existing intermediaries will cite the righteousness of existing privacy and stewardship practices. Three different forces are likely to usher in change.

First, there is an increasing market demand, driven by AI models, for patient level data such as wearable physiological sensors, continuous glucose monitors, consumer genomics, and other technologies at the intersection of the quantitative self and the internet of things. These dynamics can bypass the institutional intermediary where resistance to implementing new stewardship models is strongest.

Second, solutions to data standards and interoperability barriers are engineered out of the ecosystem to enable our health data to function more like a liquid currency rather than a commodity for barter, which is the status quo under institutional stewardship. Credit bureaus play a role in aggregating and curating our individual financial data at no cost to us, in essence providing free services at a population scale to enable consumer credit markets (and other functions where an individual’s risk profile is useful). There are models for this in healthcare that have had significant but diffuse social benefits (Foundation Medicine and Flatiron Health), but none yet that have been directly appropriable as a reward or service directly to the patient.

Third and, finally, in the United States, there is a shift in policy and law to acknowledge a stronger property right for patients to their data. The GDPR in Europe have done that for consumer data, including a right to be forgotten. The right to be forgotten is a significant challenge without a ledger of the data journey, so such an approach provides multiple forms of agency. There is a predicate for this in the realm of political economy that has been applied extensively to data byproduct of healthcare and biomedical research [11, 15, 42, 43, 48, 89, 90]. Patents arising as a byproduct of federally funded research were rendered appropriable and valuable by the Bayh–Dole Act in 1980 [31, 89], so it stands to reason a similar statutory shift can catalyze the same with health data if there is a practical and alternative means to provide the ethical and privacy protections that institutional intermediaries currently play [15]. In short, a mix of incentives and policy requirements can unlock the significant social potential of health data to users in an era of citizen science and personalized AI to catalyze precision medicine for all.

Ethics Statement

No human subject research was conducted.

Consent

No human subject research was conducted.

Conflicts of Interest

P.J.S. is a paid consultant of Procyon, LLC; P.J.S. is an unpaid advisor and research collaborator with MyEngene, Geneial, and Wave Neuroscience.

Author Contributions

The citizen science project AI CNN model comparison was conducted independently by P.A.S. with mentorship as specified in acknowledgments. Conceptualization: P.J.S., K.S.R., and P.A.S.; methodology: P.A.S.; software: P.A.S.; validation: P.A.S.; writing—original draft preparation: P.J.S. and P.A.S.; writing—review and editing: K.S.R. and P.J.S.; supervision: P.J.S. and K.S.R.; project administration: P.J.S. All authors have read and agreed to the published version of the manuscript.

Funding

No funding was received for this manuscript.

Acknowledgments

The authors wish to thank and acknowledge the Colorado School of Mines: Coding for Good Internship Program and our mentor Lauren Guo for mentorship on the OSCC project. We wish to acknowledge the OSCC project team members Jayden He, Devang Pandey, Zachary Song, and Darsh Mirwani.

    Data Availability Statement

    Datasets and code associated with the citizen science project can be found at CNN code available: https://github.com/Praveenanand333/Early-Cancer-Prediction, photograph data: https://www.kaggle.com/datasets/shivam17299/oral-cancer-lips-and-tongue-images?select=OralCancer (CC0 1.0 Creative Commons License), and histopathology data: https://www.kaggle.com/datasets/ashenafifasilkebede/dataset/data (CC0 1.0 Creative Commons License).

      The full text of this article hosted at iucr.org is unavailable due to technical difficulties.