29 Jul 2018

Machine Learning for Beginners

Masterclass

Machine Learning: A Blessing or Curse for Science?

Machine Learning (ML), similar to Artificial Intelligence (AI) and Deep Learning have all become buzzwords which people pass around without necessarily understanding what it means. So what is behind all these term. Artificial Intelligence can be any intelligence displayed by a machine as opposed to living beings. ML is a subdivision of AI and Deep learning is in turn a subdivision of ML. Deep learning is a special type of artificial neural network which is particularily “deep”, meaning that is hat lots of layers.

Conceptually ML is not more than a statistical procedure that improves its own accuracy without the need anyone to interfere. This is both a blessing and a curse, because the accuracy improvements can be hard or impossible to understand. This is especially dangerous for us as scientists who want to understand what we’re doing. Another criticism often raised is that Machine Learning is a hammer which makes everything sort of look like a nail. Due to the frenzy of a lot of hammer wielders and the black box nature of ML applications you will encounter a lot of researchers who are opposed to using ML in our research. I have heard one professor call it “the worst thing that ever happened to science”, due to how quckly accurate results can be achieved without the need to understand the underlying theory or even how the prediction works. As with any powerful tool we have to use it wisely and understand it thoroughly in order to make proper use of it. So let’s try to understand what the whole craze is about ML.

What is Machine Learning?

The term Machine Learning was coined in 1959 by a statistician in the field of Artifical Intelligence. The concept itself is quite old, but only the recent emergence of powerful computers has brought about the coming of age of ML applications. Since it’s conception ML has diverged from the frequentist (observing and counting) approach of statistics to a more computer science grounded trial and error procedure. With growing computing prowess there is no need anymore to think carefully about which parameters might be optimal for a certain problem we can just try out all of them and see which one does best. Which is partly why it is criticised in science as we don’t have to think how fast for example certain populations migrate, we can simply try out all possible migration rates and see what describes nature the best. ML can make you a pretty complacent scientist.

However mysterious it might be to the beholder, ML is no black magic. ML is an information lever, it can generate a lot of knowledge from very little knowledge. The more information you put in, the more information ML will give you, which is an important concept to remember when looking at what to feed into the ML. The goal of any ML approach is to make a prediction. Whatever flavour you will use the principle is always to learn underlying structures from as much data as possible (a process called training) and apply that knowledge on new unseen data (a process called generalisation). The data used in training can be labelled so that the algorithm can learn specific attributes of a calss with the same labels. This is called supervised ML. The unsupervised counter-part deals with unlabelled data to try understand underlying priniciples of how the data was generated. Sticking with the concept of ML as a lever, if there are labels of your data available, they should always be used. In line with this concept there is semi-supervised learning in which there is some labeled data and some unlabeled data, and it is used so that the information that is contained in the few labels is not lost to unsupervised learning.

What goes into Machine Learning?

Data is often fed to machine learning in spreadsheets. Some problems are already formulated in spreadsheets (think finance) but in problems like image recognition there is some thought necessary to put the problem into a table. The process of extracting features into a form ML can understand is called feature extraction and can be quite demanding. There are different types of data to consider, there is binary (True and False often expressed in 0 and 1) count (1, 2, 5), categorical (red, white, blue) and real-valued (0.3, 1.4, 6.7) datan which don’t necessarily mix well. Also often data has to be scaled so the different columns are comparable. If one feature is measured in cm and another in km, the cm column is more likely to have huge variation and therefore contributes much more to decision making. Scaling could easily be achieved to both columns by making the biggest value = 1 and the smallest = 0 and modify values in between arcodingly (min-max-scaling). There are numerous scalers you can apply to your data

The resulting table is usually organised in a way that rows are samples of your data, whereas columns are features, which are also called dimensions. You will encounter someone speaking with dread of high-dimensional data. High-dimensional data is nothing more than a spreadsheet with a lot of columns and the dread is caused by something called the curse of dimensionality. This ominous phrase refers to the common problem of data becoming sparse in high dimensions, meaning that there are only few non-zero values present which complicates analysis. With growing dimensionality, data analysis quickly becomes a needle in a haystack problem.

What can Machine Learning do for you?

Overall supervised and unsupervised learning are the two flavours in which ML comes, which are each used for two basic purposes respectively:

Supervised Machine Learning

Classification divides data into different groups. In training data all samples with the same label constitute one class. The algorithm learns the specific characteristics of one class by looking at the underlying data. New unseen data is then divided into the classes seen in training. A typical classification application is spam filtering, where the two labels are spam and non-spam. The border between classes which is used to group new data is called decision boundary.
Regression learns the relationship between variables in training and applies this knowledge to new unseen variables. It can therefore predict the change in a variable that is given to the algorithm. For example different labels can be time points and the values can be net-worth of a company. This is often used in finance to do market predictions.

Unsupervised Machine Learning

Clustering is similar to classification in that it tries to group data by looking at the relationship between samples. However it operates without the information given to classification through the presence of labels. Clustering can be used to assign labels to every sample in a given cluster. These can in turn be used for supervised ML.
Dimensionality Reduction is a measure that can reduce the burden of having very high-dimensional data. Dimensionality reduction tries to capture the information contained in high-dimensional space as good as possible in lower dimesnional space. In practice this means it boils your table with 60000 columns down to 50 columns, without losing a lot of information. This not only makes everything quicker, but can also reduce noise in the data-set.

Noise is everything in the data which doesn’t help solving the problem at hand and can be a big issue. Fitting a ML too well on the noise specific to your data-set will make it unlikely to perform well on unseen data, which is a problem called overfitting. Essentially your model has too many parameters and is so fine-tuned on your specific set of observations, every new observation with new unseen noise will throw off the model. You have fitted a model too well on the data. The opposite is an underfitted model where the underlying structure of the data is not captured, so your model was not fit well enough and probably has too few parameters.

Does it work?

To give a good estimation on how well the model fits to unseen data, the whole data-set is often split into training and testing. If you would use the whole data-set and fit a model on it, you will have no clue how it will react to new data. You could check performance on the data used for training, but that will lead to an overly optimistic impression of the quality of the model. A typical split is 75% training data on which the model is fit and 25% test data on which performance is measured. If there is not a lot of data, you might not want to sacrifice 25% of your data in training just for quality testing. In this case k-fold cross-validation is often used. Here the data is split into k subsets of equal size (called folds), where k-1 are used for training and 1 for testing. The procedure cycles through all folds until every fold was used for testing once (which is k times). Then performance can be averaged over all folds to get a reasonable estimate of how good the model is.

So how is performance measuered? When looking at classification the simplest is accuracy, which is how many of the labels are assigned correctly. However when one imagines unbalanced class problems like cancer, where 99% of the labels are no cancer, accuracy quickly becomes useless. Every predictor which just says no cancer every time will achieve 99 accuracy but is useless as a model. Therefore multiple scorers have been designed which are applicable depending on the underlying problem. To understand ML performance scorers one first has to understand the concepts of True positive, False positive, True negative and False negative:

True positives (TP), where the sample is predicted as a certain label by the model and truly has this label in the training data.
True Negatives (TN), where the sample is not predicted as a certain label and truly does not have this label.
False Positives (FP), where the sample is predicted as a certain label, but does not truly have that label.
False Negatives (FN), where the sample is not predicted as a certain label, but truly has that label.

The concept is easier to understand in a Table:

A perfect classifier would have only TP and TN and we would be at the end of optimisation. However this is rarely the case in reality and trying to increase one desirable number like TN often ends up also increasing FP. So different trade-offs have to be calibrated carefully to achieve an optimal score for your problem. There are multiple scorers available to be optimised:

There is not one optimal scorer for every task. Precision for example should be measured when you want to be sure that of all assigned labels, most of them are correct. If you’re looking at algorithms that suggest new videos based on your taste, like the one applied on YouTube, you’d want to be very sure of your reccomendation once you make it (TP). There are enough videos to draw from so you don’t care if you do not catch all appropriate recommendations, but you do care for the ones you pick to be a good suggestions. On the contrary when trying to predict cancer for example, recall is the most important scorer, where you’re trying to minimise undetected cancer (FN). You’d rather have more False positives to get thoroughly checked for cancer than to miss any patients with cancer. For every single problem that is thrown at ML you have to think carefully what scorer you want to optimise. If you do not know whether to go for Precision or Recall, F1 can be a good scorer which is a harmonic mean of both.

How do I do it?

At the beginning of any ML workflow we stand with our hammer which is ML and we are trying to see whether our problem is truly a nail. There is a lot to consider here:

Does the problem fall into the basic categories of ML applications (Regression, Classification, Clustering or Dimensionality Reduction)?
Can I express my problem in a way that can be captured in features (can I express my problem in a spreadsheet)?
Do I have enough data to give a ML the chance to learn enough to perform well? Unfortunately there is no one answer to whether you have enough data. This highly depends also on how many dimensions your data will have and also how much random noise there is. Generally speaking you should have more training data when encountering high-dimensions and/ or lots of noise.
Are there similar undertakings which have used ML? Why not benefit from the wisdom of who came before you in your enterprise to employ novel techniques.
Do I have enough computational power to perform ML? There is a lot of punch necessary to account for the wealth of data a ML algorithm needs. Also depending on your specific model the calculations can be quite computationally expensive. Neural networks for example require a lot of training and a high number of CPUs or GPUs to be built. If you only have access to a laptop ML might not be a viable option.

These are all questions which are not necessarily easy to answer and only experience can really reliably tell you the answer, which doesn’t make the start any easier. There is no shame however to bother anyone else with more experience or consult forums for their wisdom ( I’m also always happy to help at nicolas.arning@bdi.ox.ac.uk). You will find that the ML community is actually quite helpful even to someone who has no experience whatsoever.

Once ML has been chosen as a viable method, features need to be extracted to be fed into a model. It is a good idea to start of with exploratory data analysis, which is just a general term for looking at your data. For example it is a good idea to plot data into 2 dimensional or 3 dimensional space and colour by label to see how good the labels separate. The better the separation, the more likely a classification algorithm is to succeed. If your data is high dimensional, the plotting can be preceded by a dimensionality reduction to try to capture the information in 2- or 3-dimensional space. You can also look at the sparsity of your data, by calculating the fraction of non-zero values. Generally, the less sparse (so more non-zero values) your data is the better a ML performs. Also there are certain ML models which can handle sparsity better than others.

After you have explored your data, a ML model has to be fitted to it. Here is where the training data comes into play. Optimisation includes trying different parameters and checking which one results in the highest performance (as measured by scorers). The appropriate scorers could be taken from similar problems, or from your own reasoning. If you don’t really know which scorer to use just use all of them. Then you can see maybe if a parameter scores best in most or all of them. You can use the test data to gauge how good your model is (or a k-fold cross-validation). If you don’t know which model to use you can look into the literature what kind of model has been applied. However, if you have enough time and computational power you might just want to try all of them and see which one performs best. We will take a look next week on which models are out there, what the theory behind them is and how to apply them in Python. Once the best classifier has been optimised and chosen it is ready to be applied to new unseen data exhibiting the same features, which is the goal of ML.

Written by Nicolas Arning