In my last blog post I talked about different problems in the field of computer vision (resp. visual recognition), that people work on. The most basic one in this context is the image classification problem, since many other tasks like e.g. object detection or image segmentation can be reduced to it. However, image classification has also many practical applications on its own. For instance, there has been some research going on using image classification to detect diabetic retinopathy (a severe eye disease often leading to blindness) or to monitor drivers regarding their level of distraction while driving. So, briefly speaking, it is an important problem. However, before we go into any details about how to solve it, let us quickly repeat what the problem is even about.
Let us imagine we have a system, that is able to classify an image into one of some fixed set of predefined categories (resp. labels or classes), i.e. when it receives an image, it needs to look at it and assign it a label. However, of course this label should not be chosen arbitrarily by our system. It should be the label instead, that represents the main object of that image and our system should find it either by directly predicting it or by generating a probability distribution over all the labels to indicate confidence, i.e. we get a probability for each label saying how “sure” our system is, that this one corresponds to the main object of the image.
But wait! How do we actually determine the set of labels our system should consider for the predictions!? Well, it turns out, that we need to know in advance what kind of images we are going to receive. Let us say we have a bunch of photos, that show either a cat or a dog and we want our system to organize them automatically, so that we have a bin with all cat pictures and a bin with all dog pictures at the end. In order to do that our system needs to be aware of these two classes cat and dog and as a result we have {cat, dog} as our set of predefined categories. Well, but what if we do not know in advance what kind of labels we can expect!? It turns out, that we need something slightly more complicated then. However, since I want to keep it simple for now, I do not want to go into any details here and rather stick to the situation, that we always know our categories in advance (but if you are still interested check out one-class classification in my last blog post for the main idea at least). Okay, maybe let us look at an example now to better understand the whole process.
Let us imagine we have images of cats, dogs, hats and mugs and we want to organize them automatically with the help of our system. The first image we receive is an image of a cat like the one above, which should be classified regarding our fixed set of labels {cat, dog, hat, mug}. So, we give that image to our system and after some analysis (we will talk about this in a second) we get back a probability distribution over all the labels (in our example above: 82% cat, 15% dog, 2% hat, 1% mug). If our system works correctly, the probability of the label cat will be the highest, since that is actually the correct label (which is indeed the case in our example above where we get a probability of 82% for the label cat). Of course, we can also have images of multiple categories like photos, that show e.g. cats and dogs together, but again, to keep it simple for now let us only consider cases, where we just need to assign one label (i.e. a binary or multiclass classification; for details about the different types of classifications see my last blog post).
But wait a moment! Recognizing a cat in an image seems like an easy problem, doesn’t it!? Well, it actually really is…for us humans at least! Our visual system is specialized to perform these kinds of visual recognition tasks and thus, they seem to be so simple to us. However, it turns out, that it took nature millions of years to develop this ability and that it is actually a really difficult task to reproduce it with a machine. We should keep in mind: What does a computer see when it looks at an image? Well, it does not immediately get this holistic idea of a cat, which we humans obtain when we look at the image above. Instead, the computer represents an image as a large grid of numbers (see my last blog post for some more details about this) and it is really not so easy then to distill the idea of a cat from the grid. This problem is sometimes called Semantic Gap, which refers to the gap between the semantic concept of e.g. a cat and the pixel values of the image, that the computers sees (the numbers of the grid). Closing this gap is a really hard problem, since when we try to do so, we are confronted with a bunch of difficult challenges, which I want to discuss in the following.
Challenges
To recognize an image of e.g. a cat, it is not enough to simply remember its pixel values, because even if the picture only changes in very small subtle ways, it can cause the grid to change entirely and it turns out, that changes are not rare at all in practice. They actually occur quite often:
- Viewpoint Variation: Let us say we take an image of e.g. a mug like the one below. Then we move the camera a bit and take another picture without moving the mug in any way. Although both images show the same mug, they end up being totally different.
- Illumination: When we take pictures, we usually do not always have the same lighting conditions. An image taken in the evening is probably darker than one taken during the day. This has a drastic effect on the pixel level, since colors (represented by the pixel values) usually turn out to be darker at night. So, if we take a photo of a cat in day light and a photo of the same cat during sunset, the two images will turn out to be quite different, due to the different lighting conditions, even though the two images show the same cat.
- Deformation: Many objects are not rigid bodies, but can be deformed in extreme ways. Cats are for instance really good in this and even though they take a lot of varying poses, we humans are still able to recognize them as cats.
- Occlusion: We do not always see the whole object in the image, but only parts of it. Nevertheless, we humans can still recognize them in most cases. Let us take for instance the right image of the cat below. We only can see a tail, but we are still able to tell, that this is probably a cat hiding under a couch pillow.
- Background Clutter: Sometimes foreground objects, like for instance the cats in the images below, can look quite similar to the appearance of their backgrounds and thus, their pixel values are probably quite similar as well. However, we humans are still be able to recognize them.
- Scale Variation: Some objects can also appear in different sizes and when I talk about sizes here, I mean sizes in the real word. Not just in terms of their extent in the image (objects close by vs. far away). We can have for instance big dogs or small dogs, but we humans are still able to recognize them all as dogs.
- Intraclass Variation: Objects do not even always need to look that similar to each other like for instance chairs. Chairs can come up in a large variety of colors, shapes and sizes. However, they share at least some common properties, that make us humans recognize them as chairs (e.g. the seating surface).
Data-Driven Approach
So, we have seen, that we are confronted with a bunch of difficult challenges, when we try to develop an image classifier. Now the problem does actually not seem so easy anymore and thus, it is really amazing, that we have algorithms today, which actually work quite well. In some limited situations they can even reach human accuracy. But how do they work? How can we implement such a classifier? Well, before we get to the details, let us think about an API first. We probably need some kind of function, that takes an image and gives us a class label back. So, it should look like the following:
Okay, now that we have the API, let us think about how we could obtain the label from the image. Well, unlike e.g. sorting a list of numbers there is really no obvious way to hard-code the algorithm for recognizing objects like e.g. cats. When we try to sort a list, we can easily (more or less at least) write down the required steps, but for classifying images it is not so clear how to do that. However, this did people not stop from coming up with various attempts. For instance, they computed the edges of the image and then developed a bunch of hand-coded rules, that tried to find corners and boundaries to detect ears, eyes or noses of cats. Unfortunately, this turned out to work not very well, because it is actually quite brittle. Furthermore, it is also not a very scalable approach. If we want to write an algorithm to recognize dogs, we need to start all over again and come up with new rules. So, it turns out, that we need to do that for every new object category. This is quite tedious. It would be better to have an algorithm, that scales much more naturally to all objects in the world, so that we do not need to start all over again for every new object.
Thus, people came up with the data-driven approach, which is a lot more suitable for these kinds of problem. What is it about? Well, we actually can compare it with how a child would learn something. We are going to provide the computer with many examples of each class we want to recognize and then have a learning algorithm look at these examples and learn something about the visual appearance of each class. This can be done actually in three steps.
- Collect a dataset of images and their corresponding labels. We often refer to this as the training set. However, collecting a dataset requires a lot of effort, since in practice we often have hundreds of classes and thousands of images per class. Fortunately, often times other people already created high quality datasets for us like e.g. CIFAR, ImageNet or the datasets on the data science platform Kaggle.
- Use Machine Learning to train a classifier with the collected training data, i.e. the classifier should take all the data, summarize it in some way and then create a model from it, which contains the summarized knowledge of how to recognize the different classes (it basically should learn what each of the classes looks like).
- Evaluate the quality of the classifier regarding its ability to correctly recognize e.g. cats or dogs on new images, that the classifier has not seen before. Therefor we simply predict the labels of these new images and compare them with the true labels (often called groundtruth) using a certain classification metric (I will probably write more about metrics in another blog post, but if you are already interested, check out this, this and this article on Wikipedia).
However, with the data-driven approach our classifier API from above has actually changed a bit. We probably need to have two functions now. One for training the classifier, that takes the training data (images and labels) and gives us a model back, and one for predicting the labels of new images, that takes the new images as well as the model and gives us the corresponding labels. Thus, the API should rather look like the following now:
Okay, now we need an implementation for the training and prediction function. However, a lot of things are involved here, so that it is probably better to split this topic up over a series of following blog posts and finish this one here now.
Summary
So, what have we learned?
- Image classification is the core problem of Computer Vision and it has a lot of practical applications
- Classification process: the classifier receives an image and predicts a label for this image corresponding to the main object in that image
- However, predicting the correct label is not an easy task for a computer (unlike for humans), since there are many challenges (e.g. illumination)
- Instead of hard-code any rules to predict a label, it is better to use a lot of data and Machine Learning (data-driven approach)
- High quality datasets are e.g. CIFAR, ImageNet or the various datasets on Kaggle