Data is supremely important in life sciences and healthcare. Multi-omic analyses yield clues to biomarkers of disease and possible drug targets. In the ag biotech world, genome wide association studies (GWAS) provide paths to gene editing targets for more nutritious and sustainable food crops. Complex multi-modal analyses of digital signals such as EEG, ECG, video, voice, and actigraphy yield digital behavior biomarkers that can help quantify mental state, and diagnose and stratify neurodevelopmental and neurocognitive diseases. Analyses of real-world health data from EHRs, wearables, devices, and other sources can produce new insights and paths to personalized medicine algorithms. Aggregated sensor data can enable new forms of remote patient monitoring. Data pouring in from scientific publications and social media activity can be used to support adverse event monitoring to provide early warnings of risk, as well as surprising hints of new indications. The value of data to drive discovery in life sciences and healthcare has never been greater.
And yet there are significant challenges in managing life sciences and healthcare data. “The world of healthcare is beset with a paradox in whereby there is more data than ever is flowing through physicians’ hands, but the true value of that data has gone largely untapped because it is unstructured and silo-ed in systems that generally are unable to talk to one another,” laments Janssen’s Chairman for EMEA, Kris Sterkens. Fortune writes: “new techniques have created a new challenge that the life science industry wasn’t ready for: how to get massive amounts of super complex data into a single place where it's all interlinked and can be fully made into real insights.”
What are the unique challenges in managing life sciences and healthcare data? Here are five that we have identified:
In the biomedical world, it is common to manage a time series of wearable and sensor data, alongside periodic vital signs, labs, and imaging data, combined with unstructured clinical data and new sources of video and audio data. In the ag biotech world, it is common to encounter genomic and RNAseq data, gene editing targets, NLP gleaned insights from publications, along with plant trait information, computer vision input from growing labs, and drone footage of crops.
Solution: Managing data sets that are this heterogeneous is never going to be easy. However, the newest generation of cloud data engineering solutions support all of these heterogeneous data types including JSON, columnar, image, video, etc. The solution often encompasses many tools such as warehouses, lakes, graph stores, RDBMS, and file stores.
In the case of genomic data, sophisticated annotation of the raw data is required to label the structural and functional classifications of sequences and sub-sequences, to assign meaning and biological significance. These annotations require detailed knowledge of biological function and must be done by biologists and bioinformaticians. Unstructured clinical, voice, video, image data needs to be structured using NLP and feature extraction techniques, relying on domain knowledge of biological and medical ontologies. Sensors and devices require sophisticated signal processing algorithms. Protected health information (PHI) requires de-identification workflows. For many applications, labeling workflows are required to make the data useful. It will be impossible to scale a life science organization’s data science mission if every team has to repeat the cleaning and pre-processing steps before beginning their analyses.
Solution: Cloud data engineering solutions provide auditable data engineering pipelines to automate and standardize cleaning and pre-processing steps for an entire organization.
It is very expensive to conduct preclinical experiments and run clinical trials. For ag biotech companies, taking a gene-editing experiment all the way to the field is both expensive and time-consuming. The data is much more expensive to collect than in other industries. It is vital these data assets are properly stored and under version control, to ensure data provenance and ground truth over time. Without a centralized data warehouse, there is no single source of truth. All data is mostly passed from person to person, team to team. No accountability or record of who generated what, how did they do it, or did someone change something? This leads to questions like what is the correct data set? Is this the most recent? Are there other versions? This can easily create a problem of multiple versions floating around leading to discrepancies later with downstream analyses.
Solution: Cloud data engineering solutions maintain an auditable set of ground truth data, complete with version management and traceable provenance. Access control to the data limits who can edit, ensuring there is clear ownership of the data.
In any given analysis, there are typically fewer observations, often less than 1,000, with many features (10,000s+). Feature engineering workflows are very important to maximize signal, avoid overfitting and achieve generalizable results. The creation of feature engineering strategies often requires deep domain expertise by resources that are in short supply. As our team has previously published, “feature engineering is the most important part of real-world data science”. Maintaining a shared feature store among a team of scientists is essential to maximize scientific productivity. Further, with such small, but valuable data sets, there is a lot of focus on model experimentation. Experimental result tracking is very important to avoid wasted time and effort potentially re-doing work that has already been done. Without a centralized data repository, experimental results are often maintained by individual teams in spreadsheets.
Solution: Cloud data engineering solutions support versioned feature engineering workflows and a feature store, providing your scientists with a unified view of previous work and shared approaches for managing data. Labeling of modeling experiments ensures a controlled and auditable process, ensuring every scientist is operating with complete information.
New data is emerging hourly from trials, real-world data, and the lab. In the biomedical space, there is a focus on reducing bias and increasing equity in research data sets. The FDA has released a discussion paper as it considers how to support continuous learning algorithms for Software as a Medical Device (SaMD), providing guidance on Good Machine Learning Practices (GMLP). All of these factors mean that organizations must be able to accept a never-ending stream of incoming data, must maintain clear version history of data sets used in learning algorithms, track model version history, document model QA, and ensure production monitoring of models. This is a big ask.
Solution: The implementation of a unified data management strategy is the foundation for a sound ML program. It is impossible to properly manage model versions, ensure adequate measures of bias are in place, and ensure proper model monitoring without a data management strategy.
There is a significant opportunity for life sciences and healthcare organizations to implement a comprehensive data management strategy. Cloud data engineering solutions, comprised of a complex ecosystem of tooling and apps, are increasingly able to support sophisticated ML use cases and handle heterogeneous data types. Many tools support the rigorous data provenance, audit trails, and versioning that life sciences and healthcare organizations need. Now is the time for these organizations to invest in a unified data engineering solution to make their data valuable and accessible and to achieve the organization's data science mission.