Insights

A Better Facial Emotion Recognition Model

Major cloud service providers offer off-the-shelf emotion prediction APIs, but accuracy and interpretability often fail to meet the needs of our clients. Read how our team developed a new approach to predict emotions from video. Here we describe the model building process, its current performance, and future considerations.

Facial Emotion Recognition (FER) is increasingly useful in healthcare and life science applications, particularly in neuroscience, as well as in areas like marketing, training, and quality assurance for services delivered by video conferencing. In healthcare, FER can be applied in a number of interesting use cases. For example:

Telemedicine platforms can employ AI/ML-based FER to augment their existing patient engagement workflows so that they can increase efficiency and quality of care. FER will make it possible for mental health care providers to automatically summarize and assess a patient’s emotional state over a session or, longitudinally, over a period of time to help determine treatment needs and measure response to medication or therapy. It could also allow health care providers to annotate notes to record, and search for, topics of high emotional impact to the patient offering providers a way to rapidly quality assure video visits.

Clinical trials are increasingly administered remotely in a distributed fashion. Distributed trials require more personalized attention towards participants to prevent participant attrition and to ensure patients are engaged with the process. AI/ML based FER provides a basis to rapidly assess patient engagement so that clinical trial administrators are better able to focus their resources when preventing patient dropout. As with telemedicine based mental health services, video analysis of patients has the potential to provide insights into response to therapies for very little additional cost.

Insurance Payers can use FER to assess the emotional and mental health state of their members. Emotional state is highly correlated with other health outcomes. Particularly in a value based care ecosystem, the ability to identify members with emotional states of concern could accelerate diagnoses and interventions.

Google and Amazon both host off-the-shelf facial recognition APIs that are able to return emotion predictions indicating whether an individual’s face is expressing “happiness,” “sadness,” or other emotions during a video. In practice, however, we have found that results returned by these APIs are often inconsistent with human annotations and unable to capture behavioral nuances. As a result of these limitations, we have built our own deep learning model that is better suited to the needs of those in healthcare and life sciences than the current off-the-shelf APIs.

Approach

The model is based on 468 3D facial landmarks captured using Mediapipe. We derived features from these facial landmarks and used them to train an ensemble model. We then passed ensemble model predictions into a deep learning meta model to classify and continuously evaluate emotion.


Finally, we used all of these models’ learned features to train a supervised UMAP model and clustered the transformed data. Ultimately, this modeling pipeline generates an embedding of facial emotions, and allows us to project unseen/future data into this space.


Training data for our model corresponded to video frames with labeled emotions from two open source datasets: Ravdess and Aff-Wild 2. Before training our model, we balanced emotion labels to promote high model performance across each category (see table below).

Emotion labels

To test the model’s performance, we reserved a validation set of videos and video frames that were not included among the videos found in the training set.

Results

The model showed improved performance over both Amazon and Google’s algorithms. When tested on the validation set, we found that Amazon’s algorithm performed with 69.5% accuracy (averaged across 5 emotions), Google’s algorithm performed with 73.3% accuracy (averaged across only 4 emotions), while our strategy performed with 73.4% accuracy (averaged across 5 emotions). The table below outlines performance predicting individual emotions.


Considerations

Looking beyond performance, this FER model has a number of benefits:

Optimized for AI/ML: Our approach automatically generates predictions for all video frames, and is able to contextualize multiple adjacent video frames to summarize emotional response across time. All output data is structured into a time series that is ready for downstream data science analysis.

Scalable & Lower Cost: Off-the-shelf emotion prediction APIs are priced in terms of the amount of video data processed; however, at scale and in cases where every frame in every video must be analyzed, this can be prohibitively expensive. This approach offers a more flexible pricing model to support use cases at scale.

More Private and Secure: Handling audio and video in environments that require high levels of data privacy and security is challenging. To address privacy and security concerns, this emotion recognition model can be flexibly deployed on premises, at the edge, or in the cloud, ensuring all data remains private and secure, never leaving the hands of trusted administrators.

Transparent: Off-the-shelf APIs provide opaque emotion predictions based on proprietary data sets, without transparent version control. This model can be retrained to meet specific project needs and additional labeled training data can be incorporated to tune performance and mitigate bias.

Conclusions

First, it is important to note that emotion prediction models do not convey actual emotion. Instead they tell us what someone is expressing based on labeled training images that encode social norms. We have, in fact, performed in-depth research to understand how social norms impact facial emotion prediction and generally found that there are relatively few cross-cultural effects. Nevertheless, for applications in healthcare and neuroscience in particular, it is important not to treat predictions as ground truth.

Second, companies in the healthcare and life sciences space require a more nuanced representation of emotion than provided by current off-the-shelf models that predict a relatively small set of emotion categories. Toward this end, our next step is to improve the ability of the VIVO FER model to capture subtle facial differences that span more than one emotional category.

Finally, Image-based algorithms alone are inadequate to capture the full spectrum of emotion. To augment the performance of our models, we are leveraging other features captured by our video analytics platform to derive voice and text features that provide necessary context and help modulate results, particularly when a subject is speaking (i.e. when their face is not actively conveying an emotion).

Written by:
Michael Bell
VP, Product
Savannah Gosnell
Senior Data Scientist
Wen Chan
Clinical Data Scientist
Published On:
January 6, 2022