Saturday, March 11, 2017

Build a Neural Network from Scratch in 60 lines of OCaml Code

People have been asking me what is the current development state of Owl (a numerical library in OCaml). Well, I think it is about time to show what Owl can actually do at the moment with its newly added AD (Algorithmic Differentiation) module.

I will demonstrate how to build a small two-layer neural network from scratch in order to learn the hand-written digits in MNIST dataset. First, open `utop`, load `Owl` library, then type `Dataset.download_all ();;` to download all the necessary datasets used in the example.

The following code snippet defines a simple two-layer neural network with `tanh` and `softmax` as the activation function for the first and second layer respectively. Remember to open `Owl` and `Algodiff.AD` modules.


Defining a network seems trivial, but how about the core component in all neural networks: back propagation? It turns out writing up a back propagation in Owl is just as easy as a dozen lines of code. Well, actually 12 lines of code in total :)


The reason for this brevity is because algorithmic differentiation is a generalisation of back propagation. `Owl.Algodiff` module relieves us from manually deriving the derivatives of activation function which is just a laborious and tedious task.

Now, you can use the following code in `utop` to train the model then test the model on the test dataset.


You should be able to see the following output in your terminal. It seems this small neural network works just fine. E.g., our model predicts the following hand-written digit as 6, correct!



How about more complicated ones such as convolutional networks, recurrent neural networks, and etc. Well, you can either define it yourself with `Owl.Algodiff` module, or you can also wait for me to wrap up everything up and add a new module in Owl specifically for neural networks.

In general, `Owl` just makes my life so easy when dealing with these numerical tasks in OCaml. I hope you also find it useful.

Saturday, March 05, 2016

Manage My Supervisions at Cambridge University


Undoubtedly, the University of Cambridge is famous for its college and supervision systems. Both are quite unique and you probably cannot find other similar correspondences easily in other places. The longer you stay in Cambridge, the more attractive you will find about this amazing system. In some sense, this system leaves a strong Cambridge label on its students and makes them so unique comparing to other students. Sometimes, I must admit that I am more impressed by the undergraduates I supervised than by the PhD students (especially if he/she didn't do his/her undergraduate here).

The supervision usually takes place within a small group of two or three students with a supervisor. The form of supervision can be flexible and versatile, but there are usually a lot of discussions between the supervisor and supervisees.  Even though I had a long experience of doing TA (Teaching Assistant) in the University of Helsinki, I must say the teaching in Cambridge is much more challenging also much more fun. Of course, it has more financial benefits since the colleges pay you extra money for your teaching.

I was lucky enough to get affiliated with Queens' College (one of the oldest colleges) just after a couple of months' arrival in Cambridge, which gave me an invaluable chance to look into this system. Moreover, I also started doing supervisions soon after the Michaelmas term started. Herein, I will share some experience in doing supervision at the University of Cambridge, which might be useful for other newly joined Postdocs, especially those who did not have undergraduate experience here.

The university does provide a decent system (CamCORS) for writing final reports for students. However, there is no useful systems in practice to help in scheduling supervisions, a supervisor and his students need to handle the scheduling all by themselves. I was quickly overwhelmed by email exchanges for discussing a suitable time in Michaelmas term while I was taking care of 20 students (7 groups). In this Lent term, I need to supervise 27 students (10 groups) on two topics (Information Retrieval & Operating Systems), so I definitely need to come up something smarter.

Therefore, I wrote down the following email and send it to my students. Haha, now it is their job to do the scheduling and compete for the "CPU resources" themselves. So far, this works pretty well and has saved me from a lot of hassles.


Hello all,

I made a Doodle poll which provided a list of all my available time slots in this Lent term. These time slots can be the candidates of our supervisions in the following weeks. To reduce hassles and email exchanges, let's schedule together using the following rules (essentially a distributed algorithm with relaxed contention avoidance ;-)

1. Each supervision group makes one poll. Do not make another separate poll if your group already made one. So for all the supervisions of a given course, each group should only take one row in the poll.

2. When making the poll, the group name should follow the following convention: COURSE_NAME(crsid_1; crsid2; crsid_3) so that I can know which course and which group it is. We have two courses: please use OS for Operating System and IR for Information Retrieval. Here is an example: OS(lw525; mbt11; szuh233)  another example: IR(tzx234; xrt233; bgh512)

3. For OS, we have three supervisions so you need to mark three time slots in all the available ones (try to distribute them evenly in three weeks). You need to discuss with your group members first to agree on a time before filling the poll. Do not mark multiple time slots for one supervision. For IR, we have two supervisions, the same rule applies.

4. If you notice that a time slot is already taken by another group, do NOT mark it anymore. Essentially, each column in the poll should NOT have more than one reservation. if contention does occur (less likely), run your randomisation or back-off protocol to resolve the contention ;-)

Here is the link to the poll —> [link]

I hope you can finish the poll as soon as possible to make all of our life much easier. If you have any questions, please do ask me. Thank you very much.

Best regards,

Liang Wang

Computer Laboratory




Sunday, May 03, 2015

Working Environment & Personal Career - My Fourth Week in Cambridge

My fourth week in Cambridge was filled with all kinds of work as usual. However,  the time spent on life-related or administration-related stuff has been significantly reduced, so that I can have much more time in doing pure research and reading books.

Looking back to what I have done lately: got scandex paper accepted and gave a talklet in the Lab; finished scoped-flooding draft; finished all the simulation work on different topologies; set up the website for N4D Lab; set up the website for ACM DEV; finalised the user scenarios in UMobile project; got familiar with docker system and started system design; even finished a history book, so on and so on. Working efficiency seems very high here in the Computer Lab, I do not know if it is because of pressure or I simply feel happy to work more.

The atmosphere in the Lab is really nice. The Computer Lab in Cambridge University is the biggest computer science department I have ever seen, and is well-known for its System Research Group. In terms of working environment and infrastructure, however, I must admit University of Helsinki certainly wins, maybe because of its much smaller size. As we (system scientists) know, scalability issues generally applies to every system. University of Helsinki is so generous to its employees that you can hardly find a competitor for it. However, having good infrastructure is not sufficient condition for being a top university. In most cases, "software" can be as important as "hardware", and we researchers, are the software who are continuously contributing to university's research outcome and build up its reputation.

In my opinion, a good lab/institute should at least have the following characteristics:

  • Trust - You simply cannot work (efficiently) in an uneasy environment. Pressure and tension kill the creativity.
  • Fairness - Reward people for their achievements, not matter big or small. Being mean can simply kill people's motivation. 
  • Collaboration - moderate level of competition between even within a group is fine. However, the collaboration is more important since nowadays excellent work can hardly be done by one person.
  • Promotion - it does not necessarily mean an increase in salary or change in position especially in academia. It is more about opportunities. Well, it might be a big problem for Finnish universities, and hopefully not for here.
  • Flat - having a flat structure makes communication very efficient. However, it is not going to be the case in Cambridge due to its super long history. But the good thing is that the university is at least very liberal and democratic.
  • Be nice - Last but probably the most important one. Good universities (should) never treat its researcher like slaves, especially for those with high throughput of incoming and outgoing scholars. Building up a network is the key to boost both personal career and institution reputation.
So far, the Computer Lab seems doing a great job from my own perspective. I do hope my following years here is going to be productive, and more importantly - happy. :)

My first talklet on service caching in the Computer Lab of Cambridge University. The audience were quite nice and gentle to me during the talk, probably because I am new and junior in the Lab :)

Saturday, April 25, 2015

Walking Around in the City and History - My Second and Third Week in Cambridge

Three  weeks passed, I bought the bike, got the simcard, opened a bank account, received my first month salary, even did a business trip to meet UCL folks in London. I think I really settled down into Cambridge. Except occasionally, I applied very bad Scandinavian humour untimely. I think I am getting more and more familiar with every thing here, even with that peculiar accent.

Beside always being overwhelmed with research work, I finished a very thick book on British history. The book is in Chinese, almost 400 pages and covers the history from 6000 BC to 2000 AD. Due to this large time span, the book cannot really dig into details. Luckily, the author is a very good storyteller. Instead of stating "cold" fact, she narrated a lively and coherent story. It is quite funny that history used to be the subject I hated most when I was in the middle school and high school. I remember clearly I complained to my mom every day that when I had to remember so many names and why we had such a long history. However, things started changing as I am getting older. I realised you simply cannot get rid of history, and history itself, always repeats itself. I sometimes made fun of myself in front of my colleagues -- "Well, you know, maybe it is because I am becoming part of the history now ... fading away ...."

Except the Friday's happy hour (which is a long tradition in the Computer Laboratory in Cambridge University) when I usually go and grab a pint, I did not really visit pubs or bars. But I did visited Fitzwilliam museum and King's College and its famous Chapel. Recall that I have told you the importance of your university card in my previous post. Because of the card, I am not only able to visit the place for free but also can take two guest free of charge. Lovely!


Fitzwilliam museum and a corner in King's college.









Magnificent view of the King's College and its famous Chapel.

People are punting on the famous river Cam in such a lovely weather.

I certainly have to mention my first visit to the Cambridge University Library. In total, the Cambridge University has over one hundred library and every faculty may also have its own library. The main university library is very close to the lab, less than five minutes bike ride. To get into the library, you need a university card. I think the public can also get in as long as they apply the card from the registration office. I was completely impressed by the historical atmosphere at the moment when I set my foot in the hall. Such a sharp and strong contrast to the Kaisa library and Kumpula library which are the exemplars of the modernism and functionalism in Scandinavian design.




Cambridge University main library, feels like in Harry Potter movie.


Last thing, a side note, the water in Cambridge is really really hard. The moment I opened the kettle lid in the kitchen, I was impressed again, like I was impressed before by the library. I made fun of my British colleagues - "Are you growing (sea) fish in the kettle, this definitely looks like coral ...." However, I have been drinking it anyway since all others are drinking it. You know, "When in Rome, do as the Romans do".


Sunday, April 19, 2015

Settling Down into Cambridge

During these four years struggling with my PhD research, I had always been planning that I would take a long holiday after I got my PhD hat. A at least 3-month holiday, without doing anything just keeping my brain blank and relaxed, or maybe travelling around Europe. But life always surprises you. After I passed my PhD defence, I got only about one-week break, then hastily moved to Cambridge, UK to start my postdoc research in the Computer Laboratory.



Finding an accommodation is Cambridge is not difficult, but finding a good one with reasonable price is quite tricky. Eventually, I got a bedroom in a family and the rent costs me about 550 EUR per month. The downside of living with a family is obvious -- privacy. Since I am not social and life is rather simple (maybe even a bit tedious and boring), so privacy is not really a problem for me. On the other hand, the advantage of living with a family is also significant, you always have someone to ask and it is faster to get familiar with the local society. My host family, Mr and Mrs Kell, are typical middle class,  gentle and kind. Mr. Kell gave me a lift when I first arrived in Cambridge, and even kindly invited me to their Easter dinner.



First week is always difficult. I was totally overwhelmed with all kinds of administrative and work-related stuff. Though the HR office in Computer Laboratory did a great job in assisting me to settling down, there were still a lot of things that had not been written on the instruction but required me to handle. In the following, I will list some important things you might want to know before moving here, so that your migration will be much more smooth than mine.



Transportation

Cambridge is not big at all, especially the university area. In fact, you can just go everywhere on foot if you do not mind 30- to 40-minute walking. Walking is definitely OK in the first couple of days. But eventually, you will need a bike. There are many bikers in the city, but some roads do not really have a 'well-defined' bike lane. So you need be very careful especially before you get familiar with the town. My landlord told me local driver did not really like bikers. For the bike, I got mine from a shop in King's street, 135 pounds with three-month guarantee. It is second-hand but seems in a good condition. Later, I noticed you could also get pretty good and cheap ones on Amazon UK.



Bank Account

If you come to Cambridge to work as me, a bank account is definitely the first thing you want to have. The university needs your bank account number in order to give you salary. I almost missed the deadline of my first month payroll. In order to open an account, you need to have a current living address in UK, the department will issue another document including your current address and the information showing you work for the university. Besides, it is better to have a UK phone number. Which bank to choose? They are more or less the same. I chose the one recommended by both my landlord and the department -- Barclays.


National Insurance Number

You need to apply for a national insurance number if you want to use the local health service. First, you need call the centre to have a short interview, then they will send a paper application via plain mail. You can find the their phone number [here]. Remember you definitely need a UK phone number before calling them, since they require that in the registration. The paper application form will arrive within 10 working days, fill the form then send it back. There is a pre-paid envelop in the application package, so you don't have to pay. This part is pretty easy.


UK Mobile Phone Number

This part really depends on what kind of user you are. If you are not an active user, you can just order Pay as You Go which does not have any monthly fee. For myself, I am a heavy Internet user. So I chose to use giffgaff which is considered as cheap solution in UK. Of course, BT is also good in the sense they really have a lot of wifi hotspot in the city. You can always use BT's free-wifi if you are their subscriber. In general, the price is not very expensive, but it is definitely cheaper in Finland. :)


University Card

You will get the card from the HR, so you don't need to do anything except emailing them a nice photo of you before coming here. Note that this is not just a card to let you open the door, you will realise how useful it is in the future. As my landlord told me, Cambridge university is (arguably) the most powerful and influential entity in this town. ;-)


Then What Now?

Well, it totally depends on you now. For me, I will focus on my work, do interesting research and enjoy the life in the No.1 European university. :)



Wednesday, April 01, 2015

Book Review of "Apache Hive Essentials"

I am currently reviewing the book "Apache Hive Essentials" published by PACKT. PACKT is a pretty young publishing company based in Birmingham, UK. The company has a very interesting and agile publishing model called Print on Demand.

Given the hyped data science and big data framework buzzwords, the topic this book covers is definitely relevant and important to big data practitioners. The author appears to have a long and solid experience in the industry which gave him much practical knowledge on the subject. Having quickly skimmed through the book, my first impression is the book has a broad coverage of Apache Hive, ranging from the basic setup to security, data manipulation and the detailed explanation on the grammar, complemented with relatively straightforward examples.

My current feeling is, as a thin book of 200 pages, it did quite a decent job. Of course, I will read the book more carefully later this month then post a more detailed a review on my blog.

Tuesday, March 24, 2015

The Philosophy of PANNS -- A Very Preliminary Evaluation & A Not-So-Fair Comparison (maybe)

Though I have promised a long time ago [here] to perform a simple evaluation on PANNS, I finally got it done just today due to my overwhelming workload before moving to the Cambridge.


I. Caveat

Before we start, I must clarify several things beforehand. The purpose is not merely trying to avoid criticism, but to emphasize the context and goal of making PANNS as a small tool.

First, frankly, a large part of the hyped data science deals with data engineering stuff. However, it is well-known that there is no silver bullet in the engineering world. It further means that a good engineer has to look into a specific application area to determine the proper tool. Different data sets may lead to different conclusions when you are evaluating your machine learning/data mining algorithms, as we will see in this post.

Second, I only evaluated the accuracy and the index size here. I admit that PANNS will be beaten miserably by other software in terms of index building time. However, we assume that index will not be built frequently. Once the index is built, it will be only queried all the time. Besides, the index should be as small as possible so that it can work on the large data set. Please check "The Philosophy of PANNS" for more details.

Third, I only compare to Annoy in this post. I am not saying the other candidates are bad. Simply because there are already such comparisons and Annoy seems superior to others in many aspects. In fact, Annoy can be an order of magnitude faster than PANNS in terms of speed. But you will also see how PANNS will win out in other aspects in the follow of this post.


II. Evaluation

The data set used in the evaluation is synthetic. Each data point in the data set is 2000-dimension and follows a standard normal distribution $\mathcal{N}(0,1)$. The index contains 256 RPTrees. The accuracy and index size are measured, the numbers presented are the average results of 50 experiments. The two tables below summarize our results, one for Euclidean similarity and one for Angular similarity. In most cases, I prefer tables to figures when presenting numbers, so I did not bother to plot the result.


Data Set Size50001000015000200002500030000
Accuracy - Panns75.0 %51.2 %39.2 %30.4 %27.2 %25.2 %
Accuracy - Annoy58.2 %38.0 %29.4 %23.2 %20.6 %18.0 %
Index - Panns29.6 MB59 MB88 MB116 MB149 MB174 MB
Index - Annoy56.0 MB112 MB169 MB224 MB279 MB334 MB

Table 1. Comparison between PANNS and Annoy using Euclidean similarity.


Data Set Size50001000015000200002500030000
Accuracy - Panns75.0 %53.6 %36.0 %37.0 %27.0 %25.0 %
Accuracy - Annoy65.0 %36.8 %26.4 %26.4 %19.2 %17.2 %
Index - Panns29.6 MB59 MB88 MB116 MB149 MB174 MB
Index - Annoy35.0 MB70 MB93 MB140 MB159 MB188 MB

Table 2. Comparison between PANNS and Annoy using Angular similarity.


From both tables, the first thing we noticed is the difference in the index size. PANNS is able to save much more space when using the same amount of RPTrees. In the case of Euclidean similarity, the PANNS index is only half size of the Annoy. This benefit becomes even more noticeable when dealing with extremely large data sets and many RPTrees. From this results, we can understand why Panns lost quite a lot in some evaluations where the index only had small number of RPTrees.

In terms of accuracy, we can see PANNS consistently outperforms Annoy in all cases in our evaluation. But the difference starts diminishing as there are more and more data points. However, since PANNS uses much less space for storing the index, we can incorporate more RPTrees in the index given the same index size to achieve better accuracy. However, it becomes difficult to argue whether it is fair comparison to some extent.

In terms of building time, it is so true that Annoy is faster. However, I must point out this only holds when you use serial execution. PANNS provides parallel index building to take advantage of multiple cores on your workstation. In my case, it turned out PANNS is much faster than Annoy because I have 16 cores on my workstation. I hope this does not count as cheating ;-)

In terms of scalability, I also tried building the index from the English Wikepedia dump which consists of 3.8 million documents approximately. PANNS was able to get the job decently done (though took a long time) whereas Annoy always failed due to memory issue. However, I think further investigation is definitely needed.


III. Inconclusive Conclusion

Well, Erik already gave good summaries on "what to use in which context" in his post. I only provide some complementary advice in the following.

In general, PANNS generates much smaller index without sacrificing the accuracy. This becomes more important when you are dealing with large data sets and still want to incorporate as many RPTrees as possible to achieve satisfying accuracy.

The future trend is parallel computing. We need squeeze out all your computational resources from your devices. The slowness of PANNS can be effectively ameliorated by using parallel building and parallel query on multiple cores (even across the network). Please check out our paper on parallel RPTree building on Apache Spark [here].

We are still carrying on the research work in improving the PANNS accuracy. As mentioned, this more relates to the data engineering stuff. There are even better algorithms which can outperform PANNS from 20% to 100%, of course at the price of even longer building time. In the future, we will gradually incorporate these algorithms with a balance between efficiency and accuracy. All in all, PANNS is going to remain as simple and compact as possible for teaching purpose, as we already mention in the [previous article].