Online courses can be a good option to get into a new topic. For instance regarding Machine Learning the ML course hold by Andrew Ng is highly recommended, since it gives a really good introduction into the field in my opinion. Recently, I also discovered the online courses from fast.ai, which caught my attention, since they follow an interesting approach. Instead of explaining fundamental concepts first and then use these to describe bigger ones (like e.g. in a traditional math class in high school), they start with the biggest concept first (i.e. the application itself like e.g. recognizing a certain object in an image) and explain all the details on the way. The idea behind this is, that this way students are less likely to lose sight of the big picture and as a result also motivation. Harvard professor David Perkins uses baseball in his book Making Learning Whole as an analogy. Instead of explaining kids all the rules and technical details of baseball first, we just let them play. Maybe they are not perfect at the beginning, but they will pick up all the important things over time and improve step by step. This approach is the one, that also fast.ai follows. You can find more information about their teaching philosophy in this blog post and this talk.
Currently they offer three courses. Two about Deep Learning (part 1 and part 2) and one about Computational Linear Algebra. I decided to start with the latter one first, since linear algebra is also used in Deep Learning (resp. Machine Learning) and thus knowledge I can obtain here might be also helpful for the other two. An overview of the Computational Linear Algebra course can be found here. While taking it, I want to write about each lesson. This helps me to comprehend the material and it makes sure, that I really understood the topic. Furthermore, I hope this might be also helpful for someone else. The first lesson is about the following question:
How can we do matrix computations with acceptable speed and acceptable accuracy?
OK, it seems to be all about matrices. Why is that? Well, we basically can find matrices everywhere around us like in the form of e.g. Excel spread sheets or even pictures. The latter might seem surprising if you have not dealt with computational image processing or computer vision yet. So, what do pictures have to do with matrices? Well, pictures are basically matrices of color values!
If we have a black and white image for instance like the one above, the matrix values (here the numbers from 0 to 255) correspond to how bright or dark the color at this spot of the image (i.e. pixel) is. For colored images we have the same concept, but we have three matrices instead of a single matrix. One for each color channel (red-green-blue). However, I do not want to go into detail here. If you are interested, you can read more about the RGB color model on Wikipedia.
But what can we do with these matrices now? Well, basically we can do two things with them:
- We can combine several matrices to a new one
- We can split a matrix into several smaller ones
And why do we want to do this? It turns out, that this is needed for a wide range of applications like e.g. object detection in images or removing the background of an image, which both can be an important component of a surveillance system. However, again I do not want to go into detail here, since this seems to be covered in more depth in later lessons of the course.
Let us rather concentrate on the matrix computations themselves for now. When we choose or design an algorithm , we should keep three things in mind:
- Accuracy
- Memory Usage
- Speed
However, there are often trade-offs between these categories like an approximate approach can sometimes be a lot faster than an exact one for instance. We will see later what this means.
Accuracy
Before we get to talk about accuracy, we need to look at the way how the computer stores our matrix values, which will often be real numbers in practice. In this context, let us take a look at a small example first:
Let us say, we want to call the function f
with the value 0.1
, take the result and plug it back into f
as an argument again and repeat this procedure. What happens? Let us look at the results, when we calculate it by hand:
0.2, 0.4, 0.8, 0.6, 0.2, 0.4, 0.8, 0.6, 0.2, 0.4, ...
It loops! However, what does the computer do? Of course we would expect, that the computer gives the same results, but let us have a look:
...
0.2
0.4
0.8
0.6
0.200000000001
0.400000000001
0.800000000003
0.600000000006
0.200000000012
0.400000000023
0.800000000047
0.600000000093
0.200000000186
0.400000000373
0.800000000745
0.60000000149
0.20000000298
0.40000000596
0.800000011921
0.600000023842
0.200000047684
0.400000095367
0.800000190735
0.60000038147
0.200000762939
0.400001525879
0.800003051758
0.600006103516
0.200012207031
0.400024414062
0.800048828125
0.60009765625
0.2001953125
0.400390625
0.80078125
0.6015625
0.203125
0.40625
0.8125
0.625
0.25
0.5
1.0
1.0
...
What happened?! I let the program run for 80 iterations and at some point we can observe, that there is a tiny error. However, this tiny error adds up over time and leads to a really wrong result at the end.
The problem is that computers can not store infinitely accurate numbers, since computers are discrete and finite, whereas math is continuous and infinite. As a result computers can not represent numbers arbitrarily large or small and there must be gaps between them. Thus, we need to care about accuracy, because these small inaccuracies can add up to really huge errors as in our example above.
But how does the computer store real numbers? The most widely used approach on modern computers is the floating-point arithmetic. Here a real number consists of three parts:
So, what are these components? Let us look at an example. We want to store the real number -1567.2
. We can also write the number as -1.5672 * 10³
, which is equal. The sign is just a bit, that stores if the number is positive or negative (in our case it is negative). The mantissa is 1.5672
and the exponent is 3
, where the base of the exponent is assumed to be 10
here (computers usually use base 2
however, but since for humans it is easier to comprehend, I used base 10
for my example).
As already mentioned these numbers can not be infinitely large or small. This means that there are boundaries. For example in the IEEE Double precision arithmetic standard real numbers are between 1.79 * 10³⁰⁸
and 2.23 * 10³⁰⁸
. Furthermore, I also mentioned, that there need to be gaps between the numbers, since the storage of the numbers can not be infinitely accurate. However, these gaps are usually not of the same size. Usually they are small around zero and get bigger, the further away from zero you go.
However, the floating-point arithmetic is not the only way of storing numbers. There are many more with their advantages and disadvantages. But as already mentioned the floating-point arithmetic is most widely used today. For more information you can take a lot into the handbook of floating point arithmetic (chapter 1 can be accessed for free), into this paper (there seems to be an edited reprint here) or into this Wikipedia article.
When designing an algorithm, we should keep this in mind how the computer stores the numbers. Otherwise, there can appear errors, that get worse and worse over time as we saw in our example above and these errors can be expensive sometimes. For example the system of the Ariane 5 rocket (the European Space Agency spent 10 years and $7 billion on it) tried to fit a 64 bit number into a 16 bit space (integer overflow!), i.e. they tried to store a number, that was too big. What could possible go wrong?
However, as mentioned above, we do not always need highly accurate results. We just need to make sure, that our algorithm nearly gives the correct result with nearly the correct data. Sometimes it is acceptable to have a little decrease in accuracy, if we can have an increase in speed for instance by using approximate algorithms. These algorithms usually give the correct answer with some probability and by re-running the algorithm you can increase this probability. An example of these algorithms is the bloom filter, which can sometimes reduce memory usage by a lot.
It checks if an element belongs to a group. If it does not, it says clearly NO. Otherwise, it says MAYBE and you can check that element further. Bloom filters are used for instance in Web Browsers to handle a group of blocked pages (e.g. pages with malware). If you want to know in more depth how they work, you can get more information through this tutorial.
Memory Usage
Above we talked about how numbers are stored by computers. Now we come to the matrices themselves. How are they internally stored? Well, for instance with an array-like structure. However, although this is also an important topic, I rather want to talk about something else here. Usually we would just store all our matrix values in a certain data structure internally. We call this a dense storage.
But what if we have a lot of elements in a matrix, that are zero like in the matrix above? We could just ignore the zeros and only store the non-zero values! This way we can save a lot of memory sometimes. This is called sparse storage. Thus you should always ponder if a dense or a sparse storage makes more sense for solving your problem.
Speed
Next to accuracy and memory usage also the speed of the algorithm is an important factor. It is usually influenced by the following:
- Comlexity
- Vectorization
- Locality
- Scaling to multiple cores/nodes
Complexity
First of all, the speed of an algorithm is influenced by the complexity of the algorithm itself. In this context the O-nation is often times used to express how complex a program is. If you have not heard of the O-nation, you can read about it here (you can also practice on Code Academy).
Vectorization
Instead of using for-loops it is usually faster to use vectorization. Vectorization means, that you do not iterate through your data, but you take the whole vector or matrix and execute a desired matrix operation on it to get the same result. This is a lot faster, since highly optimized libraries like BLAS or LAPACK are used to execute the computation and they can do it way faster than running the computation on every single element while iterating through the data with a for-loop. If you want to know more about vectorization, you can take a look here and here.
Locality
We have different kinds of storage (e.g. disk, RAM, Cache) from where we can access our data. Some of them are fast and some are slow. Unfortunately we have much less fast than slow storage. For instance loading data from disk is quite slow, whereas loading data from the register is quite fast. As a result we should keep two things in mind here. As soon as we are in the fast storage we should do as much as possible while we are still in there, because if we just do one thing, put it back into slow storage and then load it again from there into fast storage, our program will be slow, because of the loading times. Instead, we should load it into fast storage and do there as much as we can with it before we put it back into slow storage. Furthermore, we should also make sure that we should store the next data item, that we need, close to our current data item. This way it will be faster to access it. If you want to know more about storage, you can take a look into this paper. It is a bit old already, but maybe still useful (check out the discussion about it on reddit too).
The following video (minutes 1:00–13:00) is a good example of how you can implement an algorithm, that blurs an image, in different ways, while keeping storage usage in mind. Each way with its own trade-offs.
As a final remark I want to talk about temporaries. Temporaries are variables in a program, that are created to hold an intermediate result. What happens here is that, when we calculate something, we store the result in such a variable. However, this variable is usually located in the slower RAM. So, when we need the value again for further computations, we need to load it from the RAM again. Instead, we could just keep the result in the fast storage and do all calculations there. Unfortunately the python library numpy (we will use this later a lot) usually creates these temporaries for every operation or function. For instance a = b * c² + ln(d)
creates already four temporaries, since we have four operations/functions here.
Scaling to multiple cores/nodes
If our data can not completely fit in memory or if the computation time for our data takes too long, we can split and distribute it over multiple cores or nodes. This way we can execute the computation in parallel. If you want to know how to do it in Scala, you could take a look at this and this online course.
Summary
So, what have we learned in this first lesson?
- We can find matrices everywhere, e.g. through Excel tables, images etc.
- There are two kinds of matrix operations: combining, splitting
- These matrix operations are needed for a wide range of applications, e.g. in image object detection or background removal in surveillance systems
- When choosing or designing an algorithm for a certain matrix operation you need to keep accuracy, memory usage and speed in mind
- Math is continuous and infinite, but computers are discrete and finite and thus they can not store numbers infinitely accurate
- However, our algorithm for the matrix operation should just give nearly the correct result with nearly the correct data
- We can store matrices sparse or dense
- Regarding speed we should keep complexity, vectorization, locality and parallelization in mind
But why do we actually need to know all this about matrix computations? Could we not just take a library, who takes care of all that for us? Yes! But, sometimes it is also good to know how these algorithms work. It might help us to choose the right one for our problem for instance or maybe if there is no library, which has this particular algorithm, we can implement it ourselves. Hence, it is good to know about this. If you want to take a look into the original lesson on fast.ai, you can find it here.