21.01.2020 | Felix Kaus | comment icon 0 Comment

A peek into Automated Machine Learning

A typical Machine Learning (ML) problem is about prediction such as in Forecasting for data-driven decision making or Credit Risk Analysis Using Machine Learning. If the predicted value is a real number, we use regression. For discrete labels (like TRUE and FALSE or WARM and COLD) we use classification. In order to train a machine to accomplish this task, we structure the data into X and Y as follows:

X consists of a set of features x1,…,xn. Y is the label. Assuming X carries information on predicting Y, we need a model to use the information to correctly fill in Y. For training purposes, we need to prepare a set of data with X and Y. Usually, we only have X. This is called supervised learning. However, before a model can be deployed in a real use-case, there are questions to be addressed:

  1. How to properly represent the data as X and Y? (Data Preparation)
  2. Which features maximize prediction quality? (Feature Engineering)
  3. Which algorithm is suitable for model training? (Model Selection)
  4. How to adjust the model? (Hyperparameter Tuning)

Although Machine Learning is about automation, the above-mentioned points depend on expert knowledge and can be time-consuming. Therefore, it is crucial to automate as many of these points as possible along the process. Here, AutoML becomes relevant: automate the learning process for Machine Learning.

The benefits of using AutoML-solutions include freeing up time for developing and testing models for strategic tasks, reducing human bias and errors, and ultimately advancing organizations by improving models and by optimizing data-driven processes. The main parts of this article are based on the work of He, Zhao et al., 2019. Axel de Romblay has provided a figure, which displays the areas to automate:

The figure shows the core Machine Learning process and involves the four questions mentioned before when applying supervised learning. We now take a deeper look into each automation task separately.

Data Preprocessing

Data Preprocessing is usually the first step in a Machine Learning pipeline and comprises data collection and data cleaning.

From a practical point of view, it is hard to imagine that data collection will ever be fully automated, since the data sources and requirements are very heterogeneous. The first step usually includes setting up a data interface after searching for the data. This interface could tap into a data lake, data warehouses, SQL- or No-SQL-databases, web scraping or simply access local or cloud raw data. However, AutoML tools can assist the Data Scientist by providing tools to augment data, pseudo labeling (semi-supervised learning approach, one part of the data is labeled by humans and the other part by a machine) and balancing the labels.

If the data cleaning problem can be solved via statistical methods like aggregation, normalizing, scaling, filling up missing entries with averages etc. or other tasks like one-hot encoding it can (and should) be done automatically. In practice, packages enable the automated encoding of categorical variables, missing data encoding, handling of dates etc.

Feature Engineering

This step is crucial in the pipeline because it involves finalizing the training and testing data. The overall goal is to provide the model with the information it needs, while keeping the dimensions and overall size of the data as small as possible. It involves selection of appropriate features that contribute to learning a model that represents reality as close as possible. Therefore, it might also be useful to construct new features based on the available data. Rather than adding new features via construction we might also consider altering the feature space (Feature Extraction). AutoML can contribute to these three sub-sections as follows:

  • Feature Selection: remove feature redundancy via correlation coefficient or by checking whether the variable is non-stable (=the data distribution of a feature is drifting apart in the training and testing dataset)
  • Feature Construction: create new features from the given feature space with the help of mathematical set operations such as min, max, negation, addition etc.
  • Feature Extraction: alteration of the original features with advanced techniques for dimension reduction such as Principal Component Analysis or Multidimensional Scaling

Model Selection and Hyperparameter Tuning

The problem of finding a suitable model and determining its hyperparameters is at the core of Machine Learning. When I ran my first experiments five years ago, I used a method called Grid Search to train multiple models with various configurations on a small dataset. The idea was to exhaustively try different models in all available combinations, score each of them and pick the best model-parameter-combination.

This approach has a few problems. As there are multiple parameters per model and some of the parameters are real-valued, the combinations are infinite. Therefore, a Data Scientist needs experience and an understanding of all the applied algorithms and their characteristics to predefine the parameter space. But even then, human intuition might not lead to the desired outcome. There is also the problem of computational efficiency. Even with small data problems, it can be a costly experiment. Let us assume, we have 2 possible algorithms with three parameters and each of the parameter has 4 variations. Exhaustively, we train

2 ⋅ 4 ⋅ 4 ⋅ 4 = 128

models. As one can imagine, the usage of Grid Search in Big Data cases can quickly become a steep requirement.

He, Zhao et al. (2019) suggest various ways to solve this process by remodelling the sub-process as an optimization or Machine learning problem (like adding another layer of Machine Learning to the Machine Learning platform):

  • Reinforcement Learning
  • Evolutionary Optimization
  • Gradient Descent
  • Bayesian Optimization

Reinforcement learning is not based on an already established solution, but the Machine Learning infrastructure itself creates a cost function for the decision via rewards and punishments. Evolutionary optimization and gradient descent are based on gradual steps. With each step, the method tries to improve itself. In the best case, we find a global optimum, which is however not guaranteed. In the worst case, we are stuck in a local optimum that does not provide good results. The most widespread method is based on Bayesian optimization. Rather than exhaustively searching for the solution, the optimizer reduces the search space via probabilities.

Furthermore, the paper shows how a system can even automatically generate its own model based on combining a set of primitive operations like concatenation or elemental math. The main idea is to construct and develop a model via Neural Architecture Search which includes Hyperparameter Optimization and Network Evaluation techniques. This goes beyond the scope of this article, as we focus on traditional models.

Combining AutoML with Python

To conclude, we showcase part of the AutoML-functionalities by using Python libraries like auto_ml or mlbox. Both libraries can be installed into your python environment via pip:

!pip install auto_ml

!pip install mlbox

The following script shows, how to apply the mlbox-library with the well-known Titanic dataset from the Kaggle Database. It offers the whole pipeline from reading in the data, removing features, searching for the model-parameter-combination and predicting new instances at the end:

We also want to refer to the Azure Machine Learning platform. Their solution to AutoML is based on the research of Fusi et al., 2017. Other services like Google Cloud Platform or Amazon AWS also offer solutions to automate Machine Learning.

We provided a peek into the exciting field of AutoML. There are already plenty of solutions for usage right now. As the demand for automated processes and data-driven solutions will increase, we will see more to come in future services. For some tasks, like the data collection, it is unknown, whether there will ever be an automated process that can adequately cover the needs of businesses. For the rest of the pipeline it is reasonable to assume, that they are going to be automated in one way or another. Innovative approaches and a lot of hands-on work from Data Scientists and Software Engineers are required to further the vision of AutoML.

References

  • Fusi, N., Sheth, R., & Elibol, M. (2018). Probabilistic matrix factorization for automated machine learning. In: Advances in Neural Information Processing Systems (pp. 3348-3357).
  • He, X., Zhao, K., & Chu, X. (2019). AutoML: A Survey of the State-of-the-Art. arXiv preprint arXiv:1908.00709.
automl bayesian optimization data science machine learning mlbox python reinforcement learning supervised learning
Felix Kaus

Felix Kaus

Felix is a Machine Learning and Natural Language Processing Enthusiast. He studied Business Information Systems at Karlsruhe University of Applied Sciences and Philipps University Marburg and specialises as Full-Stack-Data Scientist.

Show all posts by Felix Kaus

Leave a Comment