Recent Articles

Deploying Machine Learning Model in Python Talk at PyCon HK 2018

Nov 23, 2018  

PyCon HK 2018 was held on 23-24th November 2018 at Cyberport. I gave a talk on how to deploy machine learning models in Python. The slides of the talk can be found at the link below: The video of the talk can be found on Youtube at


Generating N-grams from Sentences Python

Jun 3, 2018  

N-grams are contiguous sequences of n-items in a sentence. N can be 1, 2 or any other positive integers, although usually we do not consider very large N because those n-grams rarely appears in many different places. When performing machine learning tasks related to natural language processing, we usually need to generate n-grams from input sentences. For example, in text classification tasks, in addition to using each individual token found in the corpus, we may want to add bi-grams or tri-grams as features to represent our documents.


Gradient Boosting Talk at PyCon HK 2017

Nov 6, 2017  

PyCon HK 2017 was held on 4-5th November 2017 at the City University of Hong Kong. I gave a talk on using the LightGBM library to build gradient boosting models. The slides of the talk can be found at the link below: The video of the talk can be found on Youtube at


Deep Learning and Its Applications - Research Seminar at HSMC

Jul 21, 2017  

I gave a talk on deep learning and its applications in a research seminar at the Deep Learning Research & Application Centre (DLC), Hang Seng Management College on 20th July, 2017. The slides of the talk can be found at the link below:


Making pandas Operations Faster

Jul 8, 2017  

pandas is one of the most commonly used Python library in data analysis and machine learning. It is versatile and can be used to handle many different types of data. Before feeding a model with training data, one would most probably pre-process the data and perform feature extraction on data stored as pandas DataFrame. I have been using pandas extensively in my work, and have recently discovered that the time required to manipulate data stored in a DataFrame can vary hugely depending on the method you used.


Performing Sequence Labelling using CRF in Python

May 23, 2017  

Sequence Labelling in NLP In natural language processing, it is a common task to extract words or phrases of particular types from a given sentence or paragraph. For example, when performing analysis of a corpus of news articles, we may want to know which countries are mentioned in the articles, and how many articles are related to each of these countries. This is actually a special case of sequence labelling in NLP (others include POS tagging and Chunking), in which the goal is to assign a label to each member in the sequence.


Matrix Factorization: A Simple Tutorial and Implementation in Python

Apr 23, 2017  

(This is an updated version of the article published on my previous personal Website and quuxlab) There is probably no need to say that there is too much information on the Web nowadays. Search engines help us a little bit. What is better is to have something interesting recommended to us automatically without asking. Indeed, from as simple as a list of the most popular questions and answers on Quora to some more personalized recommendations we received on Amazon, we are usually offered recommendations on the Web.


Deploying Jupyter in Ubuntu with Nginx and Supervisor

Mar 21, 2017  

The IPython Notebook, now called Jupyter Notebook, is a convenient and interactive Web application for fast prototyping and testing ideas in Python (and R, Julia , Scala, and others) in the Web browser. Installing it on Ubuntu is easy, but it takes a little bit more effort to deploy it on a server and have it run as a service. This article serves as a simple guide to deploy Jupyter in a Ubuntu server, using the Nginx Web server and the supervisor system.


Location and Friendship: Data Mining in Facebook

Sep 5, 2010  

In the past, studying social issues such as the mobility of a group of people generally required a huge amount of effort. Questionnaires would have had to be prepared, distributed, and collected after they were filled in. It was and still is a labor-intensive task when face-to-face interviews are required to obtain various personal data. Nowadays, we have more and more people connected to the Internet, and many of these Internet users participate in various kinds of social interactions on the Web.