Classification And Regression Tree

Classification And Regression Tree(CART) is a kind of decision tree model can be used both for classification and regression. CART models are always binary trees.

Classification Tree

The main process of a classification tree is shown below.

  1. Choose a variable $x_i$ and a split point $v_i$, then split the data space into two parts. All data in the first part satisfy that $x_i \leq v_i$ and all data in the second part satisfy that $x_i > v_i$. For discrete data, the condition is equivalent to $x_i = v_i$ and $x_i \neq v_i$.
  2. Split the space recursively until the stopping condition.
  3. Stopping condition is that all data in this subspace is in the same class. There are also some other conditions like using $\chi^2$ value or other independence test and stop the spliting when are splited data are independent.

A question is how to choose the split point. In classification task, Gini impurity is widely used. The Gini impurity can be simply understand as the probability of misclassified.

Gini(p) = \sum_{k = 1}^mp_k(1-p_k) = 1-\sum_{k = 1}^mp_k^2

Under this situation, $p_k = \frac{|C_k|}{|D|}$ where $C_k$ is the subset of $D$ with data labeled as $k^{th}$ class.

If $D_1 = {X|x_i \leq v_i}$, $D_2 = {X|x_i > v_i}$, we have $D_1\cup D_2 = D$ and $D_1\cap D_2 = \emptyset$, the gain of gini impurity shown below.

Gain(D, x_i) = \sum_{j=1}^2\frac{|D_j|}{|D|}Gini(D_j)

Here, the smaller gain of gini is, the less misclassification. So we always choose the split and $x_i$ makes the gain of gini smallest. CART will combine catagories into two super-catagories before spliting if there are more than two catagories.

Read more »


When I use torch.nn.utils.rnn.pad_sequence to Padding words and feed the padded sequence into LSTM/RNN, a input sorted by length is neccessary. But an order-changed sequence will increase the difficulty of evaluation. So here is a way to recover the sorted tensor using Pytorch functions.

Let’s go

x = torch.randn(10)

Here x is tensor([-0.4321, 0.3852, 0.6008, 0.8452, -0.4709, 0.7610, -0.9743, -0.9819, -1.1142, -0.1249]) and then we do the sort.

sorted_x, idx = torch.sort(x)

Here idx is the index of x, tensor([8, 7, 6, 4, 0, 9, 1, 2, 5, 3]). Then we can get the original order just by sorting the idx.

_, rev_idx = torch.sort(idx)

We can see that the script prints the tensor([-0.4321, 0.3852, 0.6008, 0.8452, -0.4709, 0.7610, -0.9743, -0.9819, -1.1142, -0.1249]) which is equals to the original x. It’s amazing, isn’t it? I’ll then show you why it works.

Read more »


Duality and KKT condition are very important for machine learning, especially in SVM models. I’ll focus more on the high-level idea and the derivation of Lagrange Duality and how to introduce to KKT condition. There are some connceptions should be covered first.

Optimization Problems without Restrictions

The basic form of optimization problem without restrictions is just like finding the $x \in \Re^d$ makes that

A simple solution is to calculate the derivatives of $f(x)$, solve the equation and test if is the minimal value.

Read more »

What and Why

CAEN is the information technology (IT) services department for the University of Michigan (U-M) College of Engineering, and offers IT resources to support the College’s educational, research, and administrative needs. It’s quite unefficient to manage files on CAEN using command line tools if I need to text our code in CAEN environment. I need to type the whole sftp command and path every time. Plugins for editors is a great solution. There are many tutorials about connecting using Sublime Text Editor on the Internet, but there is no documentations about VScode. As a fan of VScode, that’s why I want to write this article.


This is my own running environment may but not necessary.

  • Operating system: Windows 10
  • VSCode Version: 1.36.1
  • Plugin: SFTP (by liximomo)
Read more »


Optimization is a focus on many kinds of machine learning algorithms like Linear Regression, SVM and K-means. But actually many kinds of target function is non-convex which means we can only find its local minima. But convex functions still plays an important role in machine learning. And Hessian Matrix is a great algebra tool to analyze convex functions since in most cases our target function will be real, continuous and $2^{nd}$-order differentiable. The main goal of this article is to record the proof of the equivalence between Convex Functions and their Hessians. Here is the some important definitions.

Convex Set

A Convex Set $C\subseteq \Re^n$ is a set of points s.t. $\forall x, y \in C$ and $t \in [0,1]$, $tx+(1-t)y \in C$.

Convex Function

A function $f:\Re^n \rightarrow \Re$ is a Convex Function if for $x, y \in D$, where $D$ is a convex set, $f$ and any $t \in [0,1]$ makes

Hessian Matrix

A Hessian Matrix is a square matrix of second-order partial derivatives of a function $f:\Re^n \rightarrow \Re$, usually written as:

Positive Definite/Semi-Definite Matrix

A real symmetric matrix $P$ is called Positive Semi-Definite (PSD) when for all $x \in \Re^n$, there are $x^TPx \geq 0$. And it’s called Positive Definite (PD) when for all $x \neq 0 \in \Re^n$, there are $x^TPx > 0$.

Read more »


In this article I will try to solve Best Time to Buy and Sell Stock series problem, including Best Time to Buy and Sell Stock I, II, III, IV and with Cooldown. Most of them are solved by dynamic programming and I will focus on construct transition equation and dimension reduction.


The description of Best Time to Buy and Sell Stock I is:

Say you have an array for which the $i^{th}$ element is the price of a given stock on day $i$.

If you were only permitted to complete at most one transaction (i.e., buy one and sell one share of the stock), design an algorithm to find the maximum profit.

Note that you cannot sell a stock before you buy one.


Input: [7,1,5,3,6,4]
Output: 5
Explanation: Buy on day 2 (price = 1) and sell on day 5 (price = 6), profit = 6-1 = 5.
Read more »


In this article I will describe two dynamic programming algorithms solving LIS problem and STL functions lower_bound() and upper_bound().


Given an unsorted array of integers, find the length of longest increasing subsequence.

Input: $[10,9,2,5,3,7,101,18]$
Output: 4
Explanation: The longest increasing subsequence is $[2,3,7,101]$, therefore the length is $4$.

Read more »


Design and implement a data structure for Least Recent Used(LRU) cache. It should support the following operations: get and put.

get(key) -Get the value (will always be positive) of the key if the key exists in the cache, otherwise return -1.
put(key, value) -Set or insert the value if the key is not already present. When the cache reached its capacity, it should invalidate the least recently used item before inserting a new item.

The cache is initialized with a positive capacity.

Basic Idea

To solve this problem, we need to design a kind of data structure with the properties as follow:

  1. The data structure can visit the and set/insert the item as soon as possible(such as vector or map).
  2. The data structure can order the data by the operation time.
  3. The data structure can quickly check for the overflow of capacity.

Due to the data structure in different language is not the same, I will choose python and cpp as my solution language.

Read more »


Given a linked list and return where the circle begins. For example, a linked list $[3, 2, 0, 4]$ having circle $[2, 0, 4]$ is shown below.


The algorithm should return the second node. I use C++ to solve this problem and define the node as below.

//Definition for singly-linked list.
struct ListNode {
int val;
ListNode *next;
ListNode(int x) : val(x), next(NULL) {}
Read more »