Many companies have begun storing every piece of data they can think of in order to tap into secret insights about their customers, products, processes, and financial metrics. Data collection is the new gold rush, and every piece of data may be mined for useful information. However, the path from raw data to valuable insights isn’t always clear, leaving many companies to collect data without a plan for the future. So, you have data, now what?
Most conversations around AI and ML center around modeling and model improvements, thereby side stepping arguably the most important aspect of data science, feature engineering. If data is the raw ore, feature engineering is the process to shape the data into structured pieces that can then be assembled into a final modeled product. The shape and form of these features dictates the final assembled product. Likewise, creating new features out of the same data can highlight different patterns, fit into new models, and result in new discoveries. The form of the features needs to fit the function of the model and business objective.
“Coming up with features is difficult, time-consuming, requires expert knowledge. ‘Applied machine learning’ is basically feature engineering.” ~Andrew Ng
The collected data provides the raw material and the business goal is the commissioned product, the design process between these two points is what requires the artistry. This is why feature engineering tends to be the most time intensive step for any data science project, it’s an art and a science.
In their purest form, ML features represent mathematical transformations of the data; mixing, shuffling, and refining raw data in order to highlight underlying patterns and trends. This can take the form of various representative statistics (like mean or kurtosis of a data type), measuring change or variance over time using various time series transformations, or creating more complex representations of the data through different convolutions. Clustering analysis and dimension reduction techniques can also be used to extract hidden features in the data, can cull low-value features and improve the performance of predictive models. As with any other artistry, these various tools all have an intended function; knowing which to use and how to get the best result is the line between success and failure.
For this reason, collaborations between data scientists and domain experts are essential during feature creation. Domain experts can suggest hypothesis-driven features that tend to improve signal over a broad set of less curated features. They can further troubleshoot feature performance and help adjust these underlying hypotheses to iteratively improve performance. This collaboration can further prove beneficial early in a company's life cycle when they are beginning to think about what data to collect in order to provide the best path to successfully answer the key objective.
Here lies one of the largest challenges within data science: the original data may contain amazing insights... which may be completely missed because you chose the wrong features.
For instance, take a simple question of whether 2 distributions/populations are different. We could look at mean, median, percentiles, standard deviation, kurtosis, number of outliers, skewness, etc. Each of these features highlights a different aspect of the data, presenting it in a new light. The objective of the ML problem will determine which view is most useful, with different objectives supported by different features. Scaling the complexity of the problem, for instance, identifying a disease state from ECG and EEG patterns, drastically increases the feature engineering space making it more challenging to find the features that highlight signal while minimizing noise.
Feature engineering is THE most important part of real-world DS and also one of the most time-consuming. With the right transformations, the same raw data can answer multiple questions. When considering your AI/ML strategy, including time and resources for the feature engineering step and setting expectations for the desired and best/worst outcomes is vital.