Introduction to Machine Learning

21 min readSep 5, 2017

Artificial Intelligence (AI) is quite popular right now. The famous AI researcher Andrew Ng for instance even called it the new electricity, because in his opinion AI will transform countless industries like electricity did in the past. Although we have not completely reached this point yet, we already can observe an increasing demand for AI experts nowadays. I see, for instance, a lot more job openings in this field today than I did five years ago. As a result, to prepare for the future, Andrew Ng just released a whole new series of online courses to train many people in this field. More precisely, these courses have actually a focus on a special subfield of AI named Machine Learning (ML), which has had a lot of successes during the last couple of years and hence is probably its most famous subfield currently (the courses are actually even about Deep Learning, a special approach of Machine Learning, but I do not want to go into too much detail about that here). But what is actually Machine Learning? Well, Arthur Samuel, a famous pioneer in AI, defined it as follows (in 1959):

Machine Learning is the field of study, that gives computers the ability to learn without explicitly programmed.

So, the computer learns to solve problems by itself. Where is this useful? Well, it turns out, that you probably face learning algorithms every day without even thinking of it. Every time you use a search engine like Google or Bing for instance there is a learning algorithm, that helps to rank relevant web pages for you, according to your search query. Or every time you use a voice recognition system like Siri or OK Google, to control your smart phone, there are learning algorithms involved to process your voice commands. These are just two examples of where Machine Learning is used, but you can actually find more and more nowadays. But why do we actually use learning algorithms for all this? Well, it turns out, that there are various reasons like the following ones for instance:

Through the web and increasing automation we get larger and larger datasets, which would be cumbersome to analyze by hand.
It is too difficult to program some applications by hand like e.g. autonomous helicopters or voice recognition system.
Programs like recommendation systems (e.g. in Netflix) need to be customized for each user, which is cumbersome if you have many users.

Machine Learning can help in all these cases. But how can it learn by itself? What is it actually in detail? Well, when we talk about Machine Learning, we do not talk about a certain algorithm. It rather consists of a whole bunch of different algorithms, which can be divided into subcategories as shown in the following image.

Types of Machine Learning with Examples of their Applications (source: Denis Canty)

Supervised Learning

In supervised learning we want the learning algorithm to give us a certain prediction, when we give it some data. Depending on the kind of prediction we call this process regression or classification. However, before the algorithm is able to predict, we need to teach it first how to do that. Therefor we show it historical data with the correct predictions, so that it can learn over time by example how to predict by itself. This process is called training. It might still sound a little abstract right now, but I hope it will get more clear with some examples. Moreover, I want to talk more about the differences between regression and classification in the following.

Regression

Instead of beginning with a formal explanation of regression, let us start right away with the example. I hope this way it makes it easier to grasp what regression is all about. Imagine you work in the housing market and to improve your accounting, you want to know which future revenues you can expect, i.e. you want to know for which prices you will be able to sell your houses most likely. How could you do that? Well, you are probably not a psychic! All you have is some historical data about the houses, that you already sold. This data probably consists of some information about the properties of the houses, like e.g. the size (in square feet) or the number of bed rooms, and the price for which you could sell them. In ML terminology we call the house properties features and the prices targets. So, how can we predict the prices now? Maybe we can use our historical data for this! Since we could sell larger houses to a higher price in the past, we probably can expect the same for the future!? So, could it be, that there is some relationship between our features and our targets!? Well, to better answer that question, let us first visualize our data. We could use a coordinate system for this, where our target and each of our features are represented by an own dimension. However, unfortunately we can actually only use one of our features here, because otherwise we would have more than two dimensions, which would make it hard to visualize (a visualization with three dimensions is possible, but it might be harder to recognize relationships in the data; more than three dimensions is usually not possible to see for humans). Well, but since we only need a visualization for the purpose of explaining the concept of regression, let us just randomly choose one of our features for now. We will see later a technique to visualize the data while keeping all of our desired features. Anyway, now let us take the house size for instance. Then we can just fill our data into the coordinate system and as a result we would get something like the following.

As you can see, the data points (our houses) are not just randomly distributed. There really seems to be some kind of relationship between the price and the house size. We could imagine a linear relationship (blue line) or maybe even a non-linear one (pink line), which both could be expressed as a function. The function might not fit our data perfectly (i.e. our data points do not always lay exactly on the line) as you also can see in our coordinate system above, but it still comes pretty close, so that this usually is not a problem in practice. Actually, sometimes it could be even harmful to have a function, that fits all of our data points exactly (it would be a highly non-linear function!), since there could be noise in the data. Wait! What is noise? Well, unfortunately I can not come up with a good example for our housing market problem, but for instance when you deal with robots you can imagine noise as inaccuracies of your sensor inputs for example, and we do not want to fit these, since they are probably sensor specific and not characteristic for all sensors (different sensors probably have different inaccuracies). There is probably also something similar in our housing data. However, I actually do not want to go into more detail here. If you want to know more about which kind of function you should choose to fit your data, I recommend the Machine Learning course from Andrew Ng, which is a good introduction into the field in my opinion (maybe I will also write a separate blog post about this topic later). Just remember, we want to find a function here, that fits our data as good as possible, but without fitting any noise at the same time. Anyway, let us not care too much about that yet and thus, let us just choose the linear function for now. The pink function might be even better, but for simplicity let us stick with the blue one. Mathematically, it looks like the following:

How can we use this function now to predict future house prices? Can we just use it right away for calculating the expected price of a new house? Well, let us take a closer look. The size is a property of our houses. So, we got that. However, what we are still missing are the parameters m and n (in ML terminology called weights). Only if we can get these two, we can use the function for future predictions of the price. Of course the calculated price might not be perfect, because of the noise in the real world. However, if the data contains more or less a linear relationship (what we assume here), it will probably be still pretty close. But again, for this we need m and n! How can we get them? It turns out, that we can calculate them mathematically with the help of the prices and the sizes of our historical data! But again I do not want to already go into too much detail here, since it would go far beyond an introduction into the field. Instead, I will cover this in future blog posts. But if you already want to know more about it now, I can refer to the Machine Learning course from Andrew Ng again.

As a final remark, I rather want to talk about a more practical application, where regression is used. One would be for instance the detection of facial keypoints (e.g. nose tip, mouse corners), which is needed for a whole bunch of higher level applications itself like the following ones:

Tracking faces in videos
Analyzing facial expressions
Detecting abnormal facial signs for medical diagnosis
Face recognition in surveillance systems

Although, it might not seem so at a first glance, but actually we can use the same approach here again, that we already used for our housing market problem. We only have different features and targets of course. Our features could be the pixel values of the images for instance and our targets would be the pixel coordinates of the desired keypoints. However, since we have a lot more features and targets here, we can not visualize our data anymore with our simple approach from above. Nevertheless, we can still find the prediction function mathematically in the same way. If you want to know more about it, I can refer to the tutorial from Daniel Nouri for the facial keypoint detection challenge on Kaggle (a bit more advanced!).

Facial Keypoint Detection (source: Daniel Nouri)

Classification

Classification is probably the most widely used Machine Learning technique. As the name already says we try to classify our data here. To explain how it works I want to use an example again. Let us say we want to classify tumors in benign and malicious ones with the goal to initiate the appropriate medical treatment. How can we do that? Well, usually a doctor, that is specialized in tumor detection, would take a look at the data of the tumor, like its size or its growth rate for instance (in reality many more of course), and then would decide to which of the two classes it belongs. However, we would like to automate this process now. Maybe Machine Learning can help here, too!? Well, let us see. First, we need to collect some historical data again, i.e. we need a collection of tumor cases from that we know already whether they are benign or malicious. Moreover, we need also some relevant tumor properties of each case again. Probably it would make most sense to take the properties here, that are also important for the doctor to make a prediction (e.g. tumor size, tumor growth rate). As a result, we have our features (tumor size, tumor growth rate) and we have our targets (benign, malicious) again. However, there is one difference to regression now. Did you notice it? Our targets are a little different! For regression they were numerical values, but here the targets are categorical (the two classes benign and malicious). How should we deal with this? How should we find a function now, that fits our data? However, before we talk about that. Let us first visualize the data again.

As before, the dimensions x1 and x2 represent our features here. However, since our targets are categorical now, it might not make too much sense anymore to represent them as an own dimension as well. A better way instead might be to use different symbols for each class of our data points. We could for instance represent the class benign with a circle and the class malicious with a cross, as we did above. This way it is possible now to see how our classes are distributed in the coordinate system. What do we notice? Well, as we can see above, the classes are not randomly distributed. The circles are in one corner and the crosses are in another one. This makes sense actually, because we assume our classes to have similar feature values and as a result the data points of each class will probably cluster together somewhere as they actually really do here. The only condition is, that we chose good features, that are relevant for the problem. Otherwise, it could be that we do not get any clusters (i.e. circles and crosses are mixed) and it actually also would not make much sense to classify some data with features that are irrelevant for the problem. Well, let us just assume for now that we have appropriate features for our tumor classification. However, how can we find good features in general? Well, it turns out that there is actually a whole bunch of methods to select best possible features. If you want to know more about it, you can take a look here or here. Just one more remark: Also if we find good features, we always still can have at least a few outliers (e.g. a single circle is in the cross cluster) through noise in our data. However, how to handle them goes beyond an introduction again and thus let us blind them out for now. Just keep in mind, that they can appear.

Okay, now we talked about our features and our targets. What is left? Oh, yeah! What is actually with our prediction function now? It turns out, since our targets are not represented as a dimension anymore, a function like the one for regression does not make sense here anymore. But if we have a new tumor case, how can we decide then whether it is benign or malicious (i.e. whether it is a circle or a cross)? Well, we probably should take a look to which class cluster it is close to! If it is close to the circles, it probably should be a circle. If it is close the crosses, it probably should be a cross. Hence, there must be some kind of boundary between the two clusters and for our example above we even already marked it in the coordinate system as you can see. But how can we represent this boundary now? Well, it turns out, that we can use a function for it as well and similar to regression we also need to choose this function as good as possible regarding our data. But instead of predicting the target value, it separates the different classes here and if we want to predict the class of a new case, we just need to look on which side of the boundary it is located. However, usually we would do that “looking” mathematically, since if we have many features, we can not just visualize our data easily anymore as already mentioned and hence we can not just simply look at it. But for our toy example, it is still possible and for instance, if we see that the new data point is on the right side of the boundary, we know it belongs to the class circle. If it is on the left side, we know it belongs to the class cross. But again, I do not want to go into any more detail here how you can do this in a mathematical way or even how to find this boundary function. If you want to know more about it, I can refer to a whole bunch of resources like e.g. the Machine Learning course again, this tutorial from Quoc Le, the course Neural Networks for Machine Learning from Geoffrey Hinton (a bit more advanced!) or the Neural Networks course from Hugo LaRochelle (a bit more advanced!).

At last, I just want to show a more practical example again where classification is actually used. As I already mentioned at the beginning, it is probably the most widely used Machine Learning technique and you can find it in a wide range of applications. One of them is for instance detecting objects in images. There are actually many ways to approach this problem. A simple one is for instance to slide with a search window over the image and at each position you ask: Does this image patch show a cat? Does it show a banana? Does it show an orange? So, basically we have an own classification problem for each image patch, where our target classes are cat, orange, banana etc. and as features we could choose for instance the pixel values of patches again. If you want to know more about this topic, you can take a look at the Object Detection course on coursera (in Spanish!), this Stanford course or this blog post (contains a few more modern approaches).

Object Detection (source: Google Research)

Unsupervised Learning

In unsupervised learning we do not care about any predictions as we do in supervised learning. Here we only have the data in form of our features, and we want to find some kind of structure in them. This can be done for instance by identifying clusters or by compressing the features. However, since we do not need to predict anything here, we also do not have any targets at all. Hence, we call it unsupervised, since we also do not need a “supervisor” anymore, that teaches an input-output relation (i.e. features-targets). Well, how does it work then? To answer this question let us take a closer look in the following how you can perform unsupervised learning through clustering and compression.

Clustering

When we want to find some structure in our data, we could take a look whether we can find any clusters in it. Wait a moment! Did we not also talk about clusters when we were talking about classification? Yes, but the difference here is, that we do not have any targets. Hence, we do not know a priori how many clusters (i.e. classes) are in our data. Thus, the goal here is not to classify anything, but to find out whether there are any clusters at all and if yes, how many. If you want to know more about clustering, you can take a look at the Machine Learning course again, where they explain a common clustering algorithm.

But where is this used practically? Well, a famous example for clustering is actually Google News. On the internet we have a lot of news articles. However, some of them are about the same topic. They are just published by different news websites. Google News clusters all these articles with similar topics, so that they can show all news in a structured way. This makes it easier to read them, since you probably do not want to encounter articles with the same topic again and again, do you?

Clusters of News Articles (source: Google News)

Dimensionality Reduction

Another unsupervised technique is dimensionality reduction, where we want to lessen the number of dimensions in our data. What does this mean? Well, remember how we have visualized our data so far. We created a coordinate system, where each of our features was represented as an own dimension (for regression we did the same with our target too, but since we do not have any targets in unsupervised learning, we do not care about that here). Now we want to reduce the number of these dimensions, which (as a result) is the same as to say that we want to reduce the number of our features. But wait! Did we not talk about this before? When we talked about classification above, we also mentioned, that there are techniques to automatically select possibly good features. Is this not the same? Well, not quite. In contrast to feature selection, we do not want to simply select some of our features, we rather want to compress them! How can we do that? Well, to answer that, let us first look at a visualization again.

As we can see here, we have a bunch of data points in a coordinate system with the two dimensions v1 and v2, which represent the two features of our data. However, as we already saw before, our data points are not just randomly distributed. Instead, it looks like that they are located around an imaginary line (cf. linePC1 above). Actually we have already seen something similar to this before when we talked about regression. However, this is different, because for regression we were interested in a target-feature relation (e.g. house price — house size), whereas here we are interested in a relation between features only instead. Remember, in unsupervised learning we do not have any targets. Moreover, we also do not only have a single “line” in the data, but multiples ones as you can also see in our example above (cf. line PC1 and line PC2). Let us call these “lines” directions and it turns out that we have as many of them in our data as we have features.

However, they are not all equally important. What does this mean? Well, to answer this we should first talk about how to obtain our compressed features finally. Let us say we do not want to use our current dimensions anymore. Instead, we want to use the directions in our data for this. More precisely, forget the axes v1 and v2 and imagine that PC1 and PC2 are our new axes now. Are we done with this? Well, not yet. We still have two dimensions. They are just different now. To reduce them just let us remove one of them and transform our two-dimensional coordinate system to a one-dimensional one. However, which dimension should we keep? Well, let us just choose PC1 for now. Next, we need to get our data points from the two-dimensional representation to the newly created one-dimensional one. Therefor we project them on the “line” PC1. Unfortunately, we also lose information when we do that. In our coordinate system above this loss is marked with the distance from the data point to the line PC1 (cf. red line orthogonal to PC1). However, if this loss is not too high, it will usually not be a problem in practice like in our example here, since all data points are located quite close around the line. But could we also have chosen PC2? Yes, we could have. However, if we have chosen PC2, the loss of information would have been quite high, since the data points are not as close to PC2 as they are to PC1. As a result we can say, that the value of direction PC1 is higher than the value of direction PC2, so that it was good to choose PC1 indeed. But how do we actually get these directions in the first place? Well, it turns out that we can derive them mathematically and additionally also their “importance values”. However, again I do not want to go into too much detail here. If you want to know more about it, you can take a look at the Machine Learning course or the linear algebra course from fast.ai.

However, what does our new feature (i.e. dimension) actually represent now? Well, it is not quite clear! Before it was easier, since we usually used properties like the house size or the tumor growth rate as features. But now we have an artificially created feature, that is not easily interpretable anymore. So, what does it represent? We do not know! At least not without any further investigation. However, often times it is not even necessary to understand them. They are still useful for a wide range of applications like the following ones:

You can compress your data down to two or three dimensions, so that it can be visualized (was not possible with more than three dimensions).
Through the compression you can make your data smaller in terms of storage size.
A new representation with new kinds of features can be beneficial for a succeeding supervised learning algorithm (e.g. boundary function might be easier to find, since our classes could be better to separate now).
You could find some structure in the data (similar to clustering) through the possible directions in it.

A practical example regarding the latter point can be found for instance in algorithms, that remove the background of videos, which can be beneficial in surveillance systems for instance. Why is it beneficial? Well, the goal of surveillance systems is to find suspicious activities in videos (which are basically just sequences of images) and it might make it easier to find those if we remove all the unimportant parts like e.g. the image backgrounds. If you want to know more about how this works, you can take a look at the linear algebra course from fast.ai, in which they describe how to implement such an algorithm.

Reinforcement Learning

Reinforcement learning (RL) is the third type of Machine Learning and compared to supervised and unsupervised learning it works a little differently. Instead of showing the learning algorithm data from which it can learn, we have an agent here, that should learn by experience over time. What does this mean? Well, imagine for instance, that our agent is a robot for playing chess. The goal of our robot is to win games and to achieve that goal it needs to perform actions (i.e. chess moves). However, randomly choosing any actions probably will not make our robot win a game. It also needs to observe its environment (i.e. the chess board) to see where its opponent’s chessmen are, so that it can make its moves in a more clever way. Hence, our game play would probably look like the following:

First we need to look where our opponent’s chess men are (i.e. we need to observe the chess board)
Based on our observation from 1. we can make a chess move (i.e. perform an action)
Our opponent makes an own chess move and changes the state of the board with this again
Since the board state has changed we need to jump back to 1. to look again where our opponent’s chess men are now

As you already can see, it turns out to be an iterative process at the end, where we alternate between observing our board and performing actions, and we keep doing that until the game is finished (hopefully with a victory for our robot).

Reinforcement Learning Problem (source: visteon.bg)

But where is learning involved here now? Well, it turns out, that only with observing the environment we probably will not win a game in most cases, since our robot can only react to immediate attacks of its opponent. However, what is with attacks that the opponent planned over a longer period of time? To react to these kinds of attacks learning comes into play finally. But how can our robot learn? Well, through experience! To better understand this, let us compare it to human learning. How does an infant learn for instance? It tries things out! When it waves its arms or plays, it gets feedback and experiences the consequences of its actions through it. This way it learns over time to avoid actions that lead to negative consequences and to prefer actions that lead to positive ones. Our robot needs to do the same. When it keeps playing games of chess, it will try to avoid bad moves and prefer good ones after some time of learning. Hence, now it can not only react to immediate situations, but also plan long term strategies. This process is actually shown quite nicely in the following video, where a robot tries to play the video game Atari Breakout.

Playing Atari Games with Reinforcement Learning

However, how the learning works in detail goes again beyond an introduction and thus I do not want to talk too much about it here. If you want to know more, you can check out this post on Quora, where you can find some references for lectures and blog posts about it. Personally I also have not got in contact with reinforcement learning too much yet. Thus, I currently started to read the book from Richard Sutton (famous RL researcher), which is supposed to be a nice introduction into the field and thus probably also a good reference. Along the way I plan to write blog posts to better comprehend the material. Maybe it will also be helpful for someone else. But more to this later.

As a final remark, I want to talk about a real world application again where this kind of Machine Learning is used. It turns out, that there is actually a wide range of applications like e.g. game playing agents as mentioned above (for e.g. chess, Atari or the game of Go), vacuum cleaning robots or even autonomous cars, which seems to be a hot topic right now (there is even a course about it here). But wait! What do vacuum cleaning robots or autonomous cars have in common with playing a game as described above? Well, maybe more than you might think at a first glance. Also an autonomous driving car performs actions (here in the form of e.g. braking, steering) and therefor it also needs to observe its environment (through e.g. cameras, radar), so that it does not bump into another car for instance. Through experience, it should get better and better in this over time then. During the last couple of years more and more companies actually started to work on that problem, and they have been able to achieve some quite nice results. The following video shows for instance the newest self-driving car from Google.

Autonomous Cars

Summary

So, what have we learned?

Machine Learning is needed for programs that can not really be programmed by hand anymore
There are three types of Machine Learning: supervised, unsupervised and reinforcement learning
In supervised learning we teach the learning algorithm to make a prediction when we give it some data
There are two types of supervised learning: if our prediction is numerical, we call it regression; if categorical, we call it classification
In unsupervised learning we try to find structure in our data (we do not predict anything here!) through clustering or feature compression
In reinforcement learning we let an agent try to learn by experience over time