Tuesday, March 24, 2015

The Philosophy of PANNS -- A Very Preliminary Evaluation & A Not-So-Fair Comparison (maybe)

Though I have promised a long time ago [here] to perform a simple evaluation on PANNS, I finally got it done just today due to my overwhelming workload before moving to the Cambridge.


I. Caveat

Before we start, I must clarify several things beforehand. The purpose is not merely trying to avoid criticism, but to emphasize the context and goal of making PANNS as a small tool.

First, frankly, a large part of the hyped data science deals with data engineering stuff. However, it is well-known that there is no silver bullet in the engineering world. It further means that a good engineer has to look into a specific application area to determine the proper tool. Different data sets may lead to different conclusions when you are evaluating your machine learning/data mining algorithms, as we will see in this post.

Second, I only evaluated the accuracy and the index size here. I admit that PANNS will be beaten miserably by other software in terms of index building time. However, we assume that index will not be built frequently. Once the index is built, it will be only queried all the time. Besides, the index should be as small as possible so that it can work on the large data set. Please check "The Philosophy of PANNS" for more details.

Third, I only compare to Annoy in this post. I am not saying the other candidates are bad. Simply because there are already such comparisons and Annoy seems superior to others in many aspects. In fact, Annoy can be an order of magnitude faster than PANNS in terms of speed. But you will also see how PANNS will win out in other aspects in the follow of this post.


II. Evaluation

The data set used in the evaluation is synthetic. Each data point in the data set is 2000-dimension and follows a standard normal distribution $\mathcal{N}(0,1)$. The index contains 256 RPTrees. The accuracy and index size are measured, the numbers presented are the average results of 50 experiments. The two tables below summarize our results, one for Euclidean similarity and one for Angular similarity. In most cases, I prefer tables to figures when presenting numbers, so I did not bother to plot the result.


Data Set Size50001000015000200002500030000
Accuracy - Panns75.0 %51.2 %39.2 %30.4 %27.2 %25.2 %
Accuracy - Annoy58.2 %38.0 %29.4 %23.2 %20.6 %18.0 %
Index - Panns29.6 MB59 MB88 MB116 MB149 MB174 MB
Index - Annoy56.0 MB112 MB169 MB224 MB279 MB334 MB

Table 1. Comparison between PANNS and Annoy using Euclidean similarity.


Data Set Size50001000015000200002500030000
Accuracy - Panns75.0 %53.6 %36.0 %37.0 %27.0 %25.0 %
Accuracy - Annoy65.0 %36.8 %26.4 %26.4 %19.2 %17.2 %
Index - Panns29.6 MB59 MB88 MB116 MB149 MB174 MB
Index - Annoy35.0 MB70 MB93 MB140 MB159 MB188 MB

Table 2. Comparison between PANNS and Annoy using Angular similarity.


From both tables, the first thing we noticed is the difference in the index size. PANNS is able to save much more space when using the same amount of RPTrees. In the case of Euclidean similarity, the PANNS index is only half size of the Annoy. This benefit becomes even more noticeable when dealing with extremely large data sets and many RPTrees. From this results, we can understand why Panns lost quite a lot in some evaluations where the index only had small number of RPTrees.

In terms of accuracy, we can see PANNS consistently outperforms Annoy in all cases in our evaluation. But the difference starts diminishing as there are more and more data points. However, since PANNS uses much less space for storing the index, we can incorporate more RPTrees in the index given the same index size to achieve better accuracy. However, it becomes difficult to argue whether it is fair comparison to some extent.

In terms of building time, it is so true that Annoy is faster. However, I must point out this only holds when you use serial execution. PANNS provides parallel index building to take advantage of multiple cores on your workstation. In my case, it turned out PANNS is much faster than Annoy because I have 16 cores on my workstation. I hope this does not count as cheating ;-)

In terms of scalability, I also tried building the index from the English Wikepedia dump which consists of 3.8 million documents approximately. PANNS was able to get the job decently done (though took a long time) whereas Annoy always failed due to memory issue. However, I think further investigation is definitely needed.


III. Inconclusive Conclusion

Well, Erik already gave good summaries on "what to use in which context" in his post. I only provide some complementary advice in the following.

In general, PANNS generates much smaller index without sacrificing the accuracy. This becomes more important when you are dealing with large data sets and still want to incorporate as many RPTrees as possible to achieve satisfying accuracy.

The future trend is parallel computing. We need squeeze out all your computational resources from your devices. The slowness of PANNS can be effectively ameliorated by using parallel building and parallel query on multiple cores (even across the network). Please check out our paper on parallel RPTree building on Apache Spark [here].

We are still carrying on the research work in improving the PANNS accuracy. As mentioned, this more relates to the data engineering stuff. There are even better algorithms which can outperform PANNS from 20% to 100%, of course at the price of even longer building time. In the future, we will gradually incorporate these algorithms with a balance between efficiency and accuracy. All in all, PANNS is going to remain as simple and compact as possible for teaching purpose, as we already mention in the [previous article].








Saturday, March 07, 2015

The Spherical Cow in System Science - Matrix Multiplication on Apache Spark (Part III)

As mentioned in the previous [blog], we introduced the big data framework into teaching in 2014, and we specifically chose to use Apache Spark in 2015. Besides calculating the basic statistic, we also asked the students to do something more skillful -- matrix multiplications. The detailed assignment description can be found [here].


I. Two Simple (But Representative) Matrix Multiplication Problems

In the assignment, we gave the students a tall and thin matrix which is of size $10^3 \times 10^6$. Let's call it matrix $A$. The matrix $A$ has one million rows and each row has one thousand double numbers. In total, there are one billion double numbers. We asked the students to calculate $X$ as blow:

  1. $X = A \times A^T \times A$
  2. $X = diag(A \times A^T)$

Solving large linear equation systems is very common in both engineering and scientific fields, and knowing how to manipulate matrices is basic skills for engineers and scientists. These two examples, though seemingly simple, are quite representative in scientific computing. Then how to solve them actually?


II. Order Matters

Let's start from the first question $X = A \times A^T \times A$. There are two ways to calculate $X$, namely $X = (A \times A^T ) \times A$ and $X = A \times (A^T \times A)$. Mathematically, they are equivalent, namely same result and same computation complexity. However, from engineering and system perspective, there is a big difference! Why?

Let me first ask you "What is the size of $X$?". Yes, it is exactly the same as $A$, namely $10^3 \times 10^6$. No matter whether we calculate the first two first (i.e., $(A \times A^T )$) or the latter two first (i.e., $(A^T \times A)$), the intermediate result need to be transmitted on many nodes to finish the calculation in order to give us the $X$ of the same size as $A$. However, the two intermediate results are quite different, see the figure below:


You must get the point after seeing this figure, right? If we call it matrix $B$ as our intermediate result. We can see $(A \times A^T )$ leads to a huge $B$ which is $10^6 \times 10^6$ in case I, whereas $(A^T \times A )$ only leads to a very small $B$ which is $10^3 \times 10^3$. Now, you tell me, "which matrix is easier to distribute in the network?". There is going to be several orders of magnitudes difference in the network traffic. If the matrix $A$ is very very tall (i.e., contains huge amount of rows), it is not even possible to finish the calculate of $(A \times A^T )$. So, the lesson we learnt here is "Order does matter!"


III. Decomposition and Aggregation

We know getting a smaller $B$ is more beneficial, but how shall we carry on to compute $(A^T \times A )$? We know Spark cuts the $A$ into many chunks and stores them separately on different machine. Let's call these chunks of $A$ as $A_1, A_2, A_3, A_4 ... A_m$. Note these chunks have same number of columns (i.e., $10^3$) but may have different number of rows, in other words, their chunk size may be different.

Now, let me ask you another question - "Choose any $i \leq m$, what is the size of $A_i \times A_i^T$?". Yes, they are all the same, namely $10^3 \times 10^3$. Actually, this is just the partial result of final $B$. To calculate the final $B$, we just need to aggregate these partial results and sum them up as below:

$B = \sum_1^m A_i \times A_i^T$

After having $B$, getting the final result $A$ is trivial. We just need to broadcast $B$ onto every node, then finish the calculation on every chunk as below:

$X_i = A_i \times B$

Assemble all the $X_i$. More precisely, concatenate them. Then you have $X$.


IV. Getting Diagonal Elements

I was actually a bit hesitant about whether to give the second question or not. I thought it might be too easy for the students. However, the truth is that most of the students actually did try to compute the complete matrix $A \times A^T$ which is so big that you cannot even fit into the normal memory, then extract the diagonal elements.

When the students brought such solutions to me and complained about the memory issue, I just ask them to do one thing - "Can you tell me the definition of the diagonal elements and describe how you calculate them? "

They usually started with -- "Well, the first row of  $A$ (dot) multiply the first column of $A^T$ ...".

Then I continued to ask -- "Then what is the first column of $A^T$?"

Suddenly, the students got the point -- "Haha, it is still the first row of the $A$!!!"

So, you do not need to calculate the whole matrix in order to get the diagonal elements. You just need process the data set line by line, each row will give you one diagonal element.

$X_(i,i) = \sum_{j=1}^{1000} A_{i,j}^2$


V. Common Issues

Same as the previous [median example], this set of questions were also very successful. The students quite liked the problem design even though very few of them actually got all the solutions right. One common issue from the students is that they tried very hard to have precise control on how the matrix/data set is cut in Spark. Controlling how the data is divided is not impossible but just meaningless regarding solving the problem, of course given you have sufficient knowledge in linear algebra and matrix operations.


Read this series: The Spherical Cow in System Science - Part IPart IIPart III

Friday, March 06, 2015

The Spherical Cow in System Science - Calculating Median Value in Big Data Sets (Part II)

As system students of 90s, we remember that the classic distributed system course usually had strong assumptions on heterogeneity of nodes' geography and configurations (regarding both software and hardware). Many universities used Tanebaum's classic textbooks, and the distributed algorithms taught in the classes mostly dealt with fault tolerant, consistency and focused on a point-to-point communication paradigm. These algorithms are still very valuable and being will used in the various system. However, the industry force shifted the paradigm to a more centralized, more homogeneous, richer bandwidth-based high performance computer clusters. Or using more widely known and hyped name, we call it clouds computing.

To catch up the trend, we introduced the big data framework into DSP course (check my previous blog) in 2014 and 2015. We chose to use Hadoop in 2014 course and "upgraded" to Spark in 2015, but the questions remain more or less the same. Compared with the previous example I introduced, this set of exercises turned out to be a big success. So what did we actually asked the students to do?


I. Finding A Needle in A Haystack

The power of big data framework is fast processing and analyzing huge data sets by taking advantage of the computational and storage resources in high performance computer clusters. In the question set, we gave the students a date set of a very simple structure -- just one double number per line. The data set contains 10 billions lines. For generating data, I used Weibull and Gaussian mixture model. Technically, this is not really "Big Data", but I think it is pretty enough for the students to grasp the idea. We asked the students to provide the following statistics of the data set.

  • Minimum value
  • Maximum value
  • Mean value
  • Variance value
  • Median value

By the way, we did not allow the students to use the default functions provided in the framework. We asked them to implement the functions by themselves using map-reduce paradigm (see figure blow). Some students did complain about reinventing wheels. But in my opinion, as a system engineer, you should know what happens behind the scene. However, to help the student start, I provided the example code to calculate the minimum value.

A figure summarizes the map-reduce-based data-parallel computing paradigm. 
For getting the maximum, it is very easy by simply tweaking my example code. Map-reduce is very good at handling data-parallel computation. A data set is divided into chunks then distributed on multiple nodes then processes in parallel. The minimum of the minimums (of all the chunks) is the minimum of the whole data set. The same applies to the maximum. However, the students need pay a bit more attention when dealing with mean value, since the mean of all mean values is simply not the mean value of the whole data set. It turned out some students did make this silly mistake.

The variance value needs even more thinking. We know there are (at least) two ways to calculate variance.

  1. $Var(X) = E[(X - \mu)^2]$
  2. $Var(X) = E[X^2] - (E[X])^2$

The first method is familiar to almost everyone, and indeed most students chose to use it. However, it needs to iterate the data set twice. Whereas for the second method, you only need to iterate the data set once. It is worth noting that iteration used to be expensive, especially for Hadoop in the early days. Though Spark has improved quite a lot, scheduling still contributes a significant time overhead in the calculations. So reducing the number of iterations is still a good idea when dealing with big data.


II. Handling Order Statistic

Things turned to be extremely interesting when the students started calculating the median value. Similar to the mean value, median of the medians are not the median of the whole data set. Median is one of the order statistic like Quantile, Percentile, Decile and etc. By definition, the median separates a data set into two equal parts. Technically, you need sort the data from the smallest to the largest and choose the middle one (break the tie if necessary). However, such conventional operation of getting median is not really suitable when dealing with large data sets.

Can you image sorting a data set of hundreds of GB or TB or even PB? First, you simply do not have enough memory to do that on one node. Second, you will introduce huge traffic if you try to do that in distributed way. Well, some of the students did try to sort the data, then quickly got frustrated. At this point, they realized they need to think differently in the context of big data. What's more, they also realized that the map-reduce may not be as straightforward as they thought.

The students eventually came up various solutions which can be categorized as below:

Median of medians - it is an approximation algorithm which gives a quick estimate of the median. But you still need sort the data set at least partially and only get an estimate, which is not good. For the algorithmic details, please check the [wiki page].

Random sampling - this method tries to get a random sample from the data set and use the median of the sample to estimate the median of the population. This method may work well on certain data which follow well-defined distributions like Gaussian, uniform and etc. However, let's imagine a data set wherein the first half numbers are 1, the latter half are 100, and the actual median is 57. Then the random sampling method is doomed to fail with high probability.


Binary search - if we think about the definition of median, it is not difficult to notice the median is bigger than 50% of the data and smaller then the other 50%, because this is simply its definition. So we can just randomly guess a number and check how many data fall on the left (i.e. smaller than the guessed number), and how many data fall on the right (i.e. bigger than the guessed number). Then we can know whether we under- or over-estimated the median. We can narrow down the scope by iterating this search until we find the true median.


Buckets/Histogram method - Binary search is already very close to the correct solution. But let me ask another question - how many iterations does binary search need to find the actual median? Easy, it is just $\log_2(N)$ where $N$ is the size of the data set. For those students who successfully reached the binary search solution, I asked them - "Can you further improve the performance by reducing the number of iterations?"

Of course you can! In some sense, the binary search uses two buckets and count the number of data in each bucket. We can certainly use more buckets and apply the same algorithmic logic here, then the narrow down the range where the median falls much faster. E.g., if we use $b$ buckets, then we only need  $\log_b(N)$ iterations. So feel free to choose $b = 1000$ or $b = 10000$, the introduced traffic overhead is negligible. Such method also has another name - histogram method. The figure blow sketches the crude idea, and there are some other materials on the Internet talking about this method. E.g., this YouTube video [here].



III. Do the Students Like This Exercise?

It turned out this exercise set was extremely successful. From teaching perspective, the exercise only uses simple statistics which every student can understand with very mathematical background. The wonderful part of the exercise is that all these statistics look similar and straightforward in the mathematical sense. However, from (data) engineering perspective, you need treat them completely differently. The exercise set reminds the student that porting the classic algorithms into big data context may not be as straightforward as they originally thought.


In the actual teaching, all the students solved min and max without any problem. Surprisingly, over 60% of the students still chose the slower way to calculate the variance. For median, it is even more striking. For these two years, only three students independently came up histogram method without any help from me. About the students' attitudes, in general, they like the question set very much, the main reason of which, I believe is because their strong curiosity on the hyped big data.

Read this series: The Spherical Cow in System Science - Part I, Part II, Part III

Sunday, March 01, 2015

The Popular Abstract of My PhD Dissertation

In less than one month's time, I am going to have my doctoral defense in Helsinki, Finland. In order to register my thesis in the university E-thesis System, I need provide two different versions of abstract. One is called scientific abstract which is mainly for the experts working in the related fields, the other is called popular abstract which targets the public who does not have prior knowledge of the subject and it must be in Finnish or Swedish. I am not sure whether this is a unique invention in Finland, but asking a non-Finnish speaker to provide such an abstract without providing enough help in the University seems a bit ridiculous.

 


Anyway, thanks to Maria's great help, I got the popular abstract done, though I really doubted at some point whether she really knew what she was doing there. :D Interestingly, having no field knowledge at all turned out to be a great advantage when Maria was trying to write the popular abstract for me. Here you can find two abstracts as following.

The Scientific Abstract

In-network caching aims at improving content delivery and alleviating pressures on network bandwidth by leveraging universally networked caches. This thesis studies the design of cooperative in-network caching strategy from three perspectives: content, topology and cooperation, specifically focuses on the mechanisms of content delivery and cooperation policy and their impacts on the performance of cache networks.

The main contributions of this thesis are twofold. From measurement perspective, we show that the conventional metric hit rate is not sufficient in evaluating a caching strategy on non-trivial topologies, therefore we introduce footprint reduction and coupling factor, which contain richer information. We show cooperation policy is the key in balancing various tradeoffs in caching strategy design, and further investigate the performance impact from content per se via different chunking schemes.

From design perspective, we first show different caching heuristics and smart routing schemes can significantly improve the caching performance and facilitate content delivery. We then incorporate well-defined fairness metric into design and derive the unique optimal caching solution on the Pareto boundary with bargaining game framework. In addition, our study on the functional relationship between cooperation overhead and neighborhood size indicates collaboration should be constrained in a small neighborhood due to its cost growing exponentially on general network topologies.

The Popular Abstract

Verkonsisäinen välimuistitallennus pyrkii parantamaan sisällöntoimitusta ja helpottamaan painetta verkon siirtonopeudessa hyödyntämällä universaaleja verkottuneita välimuisteja. Tämä väitöskirja tutkii yhteistoiminnallisen verkonsisäisen välimuistitallennuksen suunnittelua kolmesta näkökulmasta: sisällön, topologian ja yhteistyön kautta, erityisesti keskittyen sisällöntoimituksen mekanismeihin ja yhteistyökäytäntöihin sekä näiden vaikutuksiin välimuistiverkkojen performanssiin.

Väitöskirjan suurimmat aikaansaannokset ovat kahdella saralla. Mittaamisen näkökulmasta näytämme, että perinteinen metrinen välimuistin osumatarkkuus ei ole riittävä ei-triviaalin välimuistitallennusstrategian arvioinnissa, joten esittelemme parempaa informaatiota sisältävät jalanjäljen pienentämisen sekä yhdistämistekijän. Näytämme, että yhteistyökäytäntö on avain erilaisten välimuistitallennusstrategian suunnitteluun liittyvien kompromissien tasapainotukseen ja tutkimme lisää sisällön erilaisten lohkomisjärjestelmien kautta aiheuttamaa vaikutusta performanssiin.

Suunnittelun näkökulmasta näytämme ensin, kuinka erilaiset välimuistitallennuksen heuristiikat ja viisaan reitityksen järjestelmät parantavat merkittävästi välimuistitallennusperformanssia (Did you see how long this word is?!?! Oh, dear, Finnish!) sekä helpottavat sisällön toimitusta. Sisällytämme sitten suunnitteluun hyvin määritellyn oikeudenmukaisuusmittarin ja johdamme uniikin optimaalin välimuistitallennusratkaisun Pareto-rintamalla neuvottelupelin kehyksissä. Lisäksi tutkimuksemme yhteistyökustannusten ja naapurustokoon funktionaalisesta suhteesta viittaa siihen, että yhteistyö on syytä rajoittaa pieneen naapurustoon sen kustannusten kasvaessa eksponentiaalisesti yleisessä verkkotopologiassa.


P.S. If you happen to have any interest in information-centric networking, you can find my online thesis via [this link].

The Spherical Cow in System Science - Teaching Distributed System Course (Part I)


I. What Do We Want to Teach Students in System Courses?

I have been the teaching assistant of the Distributed System Project course, which is one of the compulsory course for every master student specialized in system and networking, for over five years' time in the University of Helsinki. Now it is my last year to handle this course before moving to the Cambridge University. During these four years, I learnt quite a lot by interacting with different students in the class. Since I designed most of the exercises for the course and graded all the students' solutions, it provided me the first-hand experience to understand how the students thought about the teaching and the content. I feel obliged to write down some of my thoughts and share the experience with current and future teaching assistants/staffs.



The Distributed System Project was the first course I had ever got involved in teaching in my life. I independently taught this course in 2013 while my PhD supervisor was away taking his sabbatical in the U.S. Because it is a project course, the core philosophy is "Learn by doing", we do not really need prepare any slides to actually lecture the students in the class. Instead, we try to let the students gain the hands-on experience by finishing some small projects. Therefore, the course assignments need to be carefully designed so that the students can apply those abstract theories (i.e., the spherical cow mentioned in the title) they have learnt in the distributed system course in a more concrete scenario. You can find all the information of the course via the following links:


The exercises are generally divided into two categories. First category usually covers the classic distributed algorithms (e.g., Lamport clock, vector clock, election algorithm and etc.) which is quite trivial and does not require me too much efforts to design. The second category is much more challenging since we tried to let the students take a system-level view when they are solving the problems. Some of the questions were very successful, some were not. In this series of posts, I will choose some examples and discuss them one by one.

About the title, assuming you know that joke, I think my purpose is quite obvious. So let me safely assume we can save the strength in explaining the title and simply start our first example.

II. Example I - The Migration of Complexity in the Distributed System Design

This was the exercise we gave in both 2011 and 2012. The exercise is very straightforward, the detailed the assignment description can be found [here]. In short, we simply ask the students to implement a naive calculator application which can plot the sine function using client/service model. We chose to use web browser, AJAX and RESTful API to avoid letting the students develop everything from scratch. We also specify that every request to the service can only carry one operator and two operands.

However, the story did not end like this simple. We asked the students to implement three versions of the aforementioned calculator under three different cases as following:

  • The server is smart enough  to know how to calculate sine function. The client only plots the figure.
  • The server is stupid and only plots the figure for the given points. The client is smart enough to know how to calculate sine function.
  • Both server and clients are stupid. The server only knows +, -, *, / four arithmetic operations, and the client only knows how to plot.

I think you probably already got the idea. Case 1 and case 2 actually represent the evolution of our computation model within these several decades. In the beginning of computer systems, we had a really "powerful" server (powerful is a relative sense) who would take care of all the computations. A user only needs a terminal to connect to the server. A terminal, in some sense, is just a naked and stupid client which is responsible for submitting jobs and displaying the results. In technical terms, we call it thin client and fat server model. This model is reflected in our first case.

As time goes by, functionality started shifting to the client side. CPU became much faster, the storage grew much bigger and the price dropped a lot. Then more and more features were added to the client side. As the PC became more powerful, the applications were also growing in their complexity and more applications started running at the client side to improve user experience. The shift of the use pattern eventually led to fat client and thin server model. Nowadays, even a moderate smartphone is more powerful than a mainframe dozens of years ago. Apparently, our second case tries to capture this model.

It is very interesting to notice that the hyped clouds computing is shifting the computations back to a centralized entity again. Note that the entity here can refer a cluster of computers instead of a standalone server. The key enabler of this trend is the virtualization technology. People realized that the horizontal scalability is a more feasible solution from both economy and engineering perspective.

III. Do Not Forget the Communication Channel Is Also An Integral Part of A Distributed System!

However, our third assumption tries to capture another (maybe a bit unfortunate) case wherein both the server and client are stupid. In such a case, how can a client collaborate with a server only supporting  +, -, *, / four basic arithmetic operations to finish the calculation of sine function? One solution is using Taylor Series to approximate the sine function. Because we explicitly required the solution need to reach certain precision, the students have to figure out the minimal degree needed in the polynomials. When solving this case, we can clearly see that the complexity migrates to the communication channel. Namely, the client and server need to collaborate many rounds in order to finish the computation task.

I am pretty sure you already got the big picture of this assignment. In the first case, the server takes care of the (computation) complexity while the client handles the complexity in the second case. However, if neither can deal with the complexity, the complexity needed to finish the computation has to go somewhere, then it goes to the communication channel in our third case. The exercise reminds the student that the communication channel (usually our network) is also an integral part of a distributed system design.

IV. Is This Really A Good Exercise?

About this exercise, the comments from my colleagues were quite positive. However, the actual feedback from the students in the class were rather bad. Most students failed to grasp the core idea of the exercise and simply thought that we asked them to develop a naive AJAX application. I even got some angry complaints from a student saying "I am not coming here to become a web developer!" Oh, well, you know what? The web is actually the biggest distributed computer system in the world nowadays (excluding the mobile devices ;-) ). Besides, being a web developer is not a bad idea at all in the first place :D

All in all, we finally realized that, without explicitly explaining what is the purpose of this exercise, the students could not really grasp its idea. However, if we put everything across in the first place, there is simply no fun any more in solving the problem. Eventually, we dropped this exercise out after its two trials in 2011 and 2012, which was really sad in my opinion!

Read this series: The Spherical Cow in System Science - Part IPart IIPart III