Leveraging text based data within machine learning models is a major challenge for many biotech and healthcare companies. This text based data can be found in clinical trial notes, medical records, messages between study site coordinators and study participants, and claims documentation. Even supposedly structured data such as lists of co-morbidities, medications or procedures contain such a vast number of choices with few patterns that they more closely resemble unstructured text data. This data typically contains a wealth of information but often goes unused due to the large amount of variability, lack of structure, and the inherent difficulty of translating it into numerical information. Here, we present how natural language processing can unlock the potential of your text based data using techniques available from open source pre-trained large language models (LLMs).
Generally when people think about natural language processing (NLP) they imagine algorithms that interpret and generate human-like language, sentiment analysis on product reviews, or chat bots taking over customer service. To organizations focused on training ML models, NLP can feel esoteric, specialized and not applicable to the primary objective. However, there are straightforward NLP use cases that can be used to create much more useful and predictive ML models. Given the significant value added, we include these NLP techniques in most of our ML projects across domains and modeling approaches.
Categorical information, data stored as text based categories, represents a significant opportunity to transform your raw data using NLP and capitalize on unrealized data value. Let’s say you are trying to predict health outcomes for your study participants. You have collected their job titles, what they eat for breakfast every day, their pre-existing medical conditions, and a brief description of how they felt that day. Now, you want to use this data to predict who is at risk for disease or negative health outcomes. Even if you limit these categories to 1-3 words you will end up with 10,000s of discrete categories or unique terms. The data you collected isn’t usable in its raw form because it lacks a level of uniformity needed to identify patterns.
In this scenario you’ll end up with job titles like: “secretary”, “administrative assistant”, “office manager”, and “project coordinator”. Without NLP, each of these becomes its own category and your ML model will struggle to learn across categories. You therefore need an approach to automatically condense similar responses into concise buckets and then go even further and learn the similarity between these buckets.
Using NLP we can condense messy text into concise buckets and then use word vectors to calculate the similarity between discrete words and phrases. Given a single word, its vector provides a deep context for what that term means, how it is used, and other terms to which it is related. Word vectors provide a path to identify synonymous and/or highly related entries thereby allowing the aggregation of items that belong to similar classes and mitigating the challenges presented by high cardinality in categorical labels.
Since these vectors are pre trained as part of open source large language models (LLMs), any organization can quickly augment categorical data and draw value from text fragments, adding contextual depth and richness. For instance, the job titles listed above will be automatically grouped into “Administrative” and kept separate from other professions such as “waitress” or “mechanic”. Breakfast choices of bananas, oatmeal, and pears can be grouped separately from donuts and kolaches. From the chaos of free entry text and diverse categories, NLP can transform the data into actionable groups and patterns thereby supporting much more useful and predictive ML models.
But what if you have longer, more complicated blocks of text? You want to get value from: clinician notes, participant status written in multiple sentences by your study coordinator, short descriptions of an incident or event. You want the ability to reduce this text down to the key points, words, terms, or topics.
If you had a limited number of known terms of interest, you could simply run a search for those terms using something like regular expressions (RegEx). For instance, does a patient file use the word “cancer”. However, NLP allows you to go beyond simple look-ups in several powerful ways.
Many organizations are sitting on a gold mine of data in unstructured text and categorical data that they fail to utilize effectively, often because they simply don’t realize it’s possible to do so. Using NLP, structure can be imposed on unwieldy blocks of text, and hidden meaning can be extracted from fragments of text to provide order to your data, uncover relationships in the data that were previously hidden, and create more predictive ML models.