Erebos's blog 2020-01-02T05:09:09.365Z https://erebos.top/ Erebos Hexo CART, Gradient Boosting and XGboost https://erebos.top/2020/01/02/CART-Random-Forest-and-XGboost/ 2020-01-01T19:10:42.000Z 2020-01-02T05:09:09.365Z Classification And Regression Tree

Classification And Regression Tree(CART) is a kind of decision tree model can be used both for classification and regression. CART models are always binary trees.

### Classification Tree

The main process of a classification tree is shown below.

1. Choose a variable $x_i$ and a split point $v_i$, then split the data space into two parts. All data in the first part satisfy that $x_i \leq v_i$ and all data in the second part satisfy that $x_i > v_i$. For discrete data, the condition is equivalent to $x_i = v_i$ and $x_i \neq v_i$.
2. Split the space recursively until the stopping condition.
3. Stopping condition is that all data in this subspace is in the same class. There are also some other conditions like using $\chi^2$ value or other independence test and stop the spliting when are splited data are independent.

A question is how to choose the split point. In classification task, Gini impurity is widely used. The Gini impurity can be simply understand as the probability of misclassified.

$$Gini(p) = \sum_{k = 1}^mp_k(1-p_k) = 1-\sum_{k = 1}^mp_k^2$$

Under this situation, $p_k = \frac{|C_k|}{|D|}$ where $C_k$ is the subset of $D$ with data labeled as $k^{th}$ class.

If $D_1 = {X|x_i \leq v_i}$, $D_2 = {X|x_i > v_i}$, we have $D_1\cup D_2 = D$ and $D_1\cap D_2 = \emptyset$, the gain of gini impurity shown below.

$$Gain(D, x_i) = \sum_{j=1}^2\frac{|D_j|}{|D|}Gini(D_j)$$

Here, the smaller gain of gini is, the less misclassification. So we always choose the split and $x_i$ makes the gain of gini smallest. CART will combine catagories into two super-catagories before spliting if there are more than two catagories.

### Regression Tree

The main process of a regression tree is most likely the classification tree. There are several difference between them.

1. Fit the residual of previous regression results to the labels and add them together.
2. Usually use inner-class minimal mean squared error instead of Gini impurity as measurement.

CART choose the best spliting point to optimize the following optimization problem.

$$min_{j,s}[min\sum_{x_i\in R_1(j,s)}(y_i-c_1)^2 + min\sum_{x_i\in R_2(j,s)}(y_i-c_2)^2]$$

Here $R_i(j,s)$’s are the subspace after spliting by condition $(j,s)$. $x_j$ is the spliting feature and $s$ is the spliting point. We use this equation in stead of Gini impurity because we want to minimize the inner-class distances.

Then we can get $M$ subspaces and for each subspace $R_m$, we calculate the mean value as the regression value. i.e. $\hat c_m = \frac{1}{N_m}\sum_{x_i\in R_m} y_i$.

The final regression function is shown below.

$$f(x) = \sum_{m=1}^M \hat c_m I(x\in R_m)$$

Then we can use mean square error to evaluate the tree and fit the residuals to improve the model. It’s a simple boosting method. Let $T_i(x)$ be the estimation of $y - f_{i-1}(x)$ based on CART. then we have $f_i(x) = f_{i-1}(x) + T_i(x)$. It’s a special case of Gradient Boosting Decision Tree(The loss function is mean square error).

The boosting strategy mentioned above has a more general form. Gradient boosting Decision Tree(GBDT) use a similar recursive formula.

$$F_m(x) = F_{m-1}(x) + argmin_{h\in H}\sum_{i=1}^nLoss(y_i,F_{m-1}(x_i) + h(x_i))$$

We can treat the loss function as a function of vector $F_{m-1}(x)$. Then using gradient descent method, the $F_{m}(x)$ can be calculated by $F_{m-1}(x) - \eta\nabla_{F_{m-1}} Loss(x)$. We use $F_{m-1}(x)$ instead of $x$ as gradient variable because we can not get a expression of $x$ from Decision Tree model. Then our target is to find a way to calculate $\nabla_{F_{m-1}}Loss(x)$ or a reasonable estimation.

Also using CART, if we use ${x_i,-\frac{\partial Loss(y_i,F_{m-1}(x_i))}{\partial F_{m-1}}}$ to build a CART $T_m(x)$ and it’s a estimation of $-\nabla_{F_{m-1}}Loss(x)$.

### Classification

The classification tree is not same as CART because the CART classification tree does not have gradient. The way of classification is using log-odds value. Just like logistic regression or neural network classification, we first estimate a continuous value $logit = ln\frac{P(y=1|x)}{P(y=0|x)}$ and use sigmoid function(or softmax in high-dimension) to translate it into probability. Just like the logistic regression, we can use cross entropy as out loss function. Here we use binary classification as an example.

$$loss(x_i,y_i) = -y_ilog\hat y_i - (1-y_i)log(1-\hat y_i)$$

The function of probability is shown below.

$$P(y=1|x) = \frac{1}{1 + e^{-F_{m-1}(x)}}$$

So we can get

$$loss(y_i, F_{m-1}(x_i)) = y_ilog(1+e^{-F_{m-1}(x_i)}) + (1-y_i)[F_{m-1}(x_i) + log(1+e^{-F_{m-1}(x_i)})]$$

$$-\frac{loss}{F_{m-1}}(x_i,y_i) = y_i - \hat y_i$$

If we have $k$ labels, we need to use one-hot encoding and softmax function then fit $k$ trees each iteration to fit each dimension.

To fit the model better, there is a variation.

$$F_{m-1}(x) + \eta\rho_mT_m(x_i)$$

where $\rho_m$ is the result of linear search $argmin_\rho\sum_{i}loss(x_i,y_i|F_{m-1(x_i)}+ \rho T_m(x_i))$ and $\eta$ is learning rate.

## XgBoost

In XgBoost, the regression results are represented by the formula

$$\hat y = \sum_{i = 1}^K f_k(x)$$

Here every $f_k(x)$ is a regression tree structure. Assume the tree has $T$ leaves, $q(x)\in {1,2,…,T}$ is the leave of $x$ and $f_k(x) = w_{q(x)}$ is the score of the input.

### Regularization

The regularization of XgBoost has two parts, the complexity of tree and the scalability of scores.

$$\Omega(f_t(x)) = \gamma T + \frac{1}{2}\lambda |w|^2$$

Here $T$ is the number of leaves in the tree and $w \in \Re^T$ is the score of leaves.

### Splitting Choice

XgBoost uses second order Taylor Expansion to approach the true value of loss function. Assume loss function is $l(y_i,\hat y_i)$, we have $l(y_i,\hat y_i^{(t-1)} + f_k(x_i)) = l(y_i) + g_if_k(x_i) + \frac{1}{2}h_if_k^2(x_i)$ where
$$g_i = \frac{\partial l(y_i, \hat y^{(t-1)}_i)}{\partial \hat y^{(t-1)}_i}$$
$$h_i = \frac{\partial^2 l(y_i, \hat y^{(t-1)}_i)}{\partial (\hat y^{(t-1)}_i)^2}$$

Remove the constant term, we have the objective function.

$$\mathcal{L}^{(t)} = \sum_{i=1}^n[g_if_t(x_i) + \frac{1}{2}h_if_t^2(x_i)] + \gamma T + \frac{1}{2}\lambda |w|^2$$

Rewrite the function, we have

$$\mathcal{L}^{(t)} = \sum_{j=1}^T[w_j\sum_{i\in I_j}g_i + \frac{1}{2}w_j^2(\sum_{i\in I_j}h_i + \lambda)] + \gamma T.$$

where $I_j = {i|q(x_i) = j}$. Then calculate the derivatives and zero point.

$$\frac{\partial \mathcal{L}^{(t)}}{\partial w_j} = [\sum_{i\in I_j}g_i + w_j(\sum_{i\in I_j}h_i + \lambda)] = 0$$

We can get the optimal score by solving the equation above.
$$w_j^* = -\frac{\sum_{i\in I_j}g_i}{\sum_{i\in I_j}h_i + \lambda}$$

Then bring it back, we have the optimal objective function.
$$\mathcal{L}^{(t)} = -\frac{1}{2}\sum_{j=1}^T\frac{(\sum_{i\in I_j}g_i)^2}{\sum_{i\in I_j}h_i+\lambda} + \gamma T.$$

The smaller the $\mathcal{L}$ is, the better the tree structure is, so we can choose the splitting point make the $\mathcal{L}$ smallest.

The idea is to choose a splitting point making the following value as large as possible.

$$\mathcal L_{split} = \mathcal L_{Ori} - \mathcal L_{L} - \mathcal L_{R} = \frac{1}{2}[\frac{(\sum_{i\in I_L}g_i)^2}{\sum_{i\in I_L}h_i+\lambda} + \frac{(\sum_{i\in I_R}g_i)^2}{\sum_{i\in I_R}h_i+\lambda} - \frac{(\sum_{i\in I}g_i)^2}{\sum_{i\in I}h_i+\lambda}] - \gamma$$

The algorithm will stop creating subtrees when $\mathcal{L}_{split} < 0$ or reach the maximal depth.

]]>
<h2 id="Classification-And-Regression-Tree"><a href="#Classification-And-Regression-Tree" class="headerlink" title="Classification And Regression Tree"></a>Classification And Regression Tree</h2><p>Classification And Regression Tree(CART) is a kind of decision tree model can be used both for classification and regression. CART models are always binary trees. </p> <h3 id="Classification-Tree"><a href="#Classification-Tree" class="headerlink" title="Classification Tree"></a>Classification Tree</h3><p>The main process of a classification tree is shown below.</p> <ol> <li>Choose a variable $x_i$ and a <strong><em>split point</em></strong> $v_i$, then split the data space into two parts. All data in the first part satisfy that $x_i \leq v_i$ and all data in the second part satisfy that $x_i &gt; v_i$. For discrete data, the condition is equivalent to $x_i = v_i$ and $x_i \neq v_i$.</li> <li>Split the space recursively until the stopping condition.</li> <li>Stopping condition is that all data in this subspace is in the same class. There are also some other conditions like using $\chi^2$ value or other independence test and stop the spliting when are splited data are independent.</li> </ol> <p>A question is how to choose the split point. In classification task, Gini impurity is widely used. The Gini impurity can be simply understand as <strong><em>the probability of misclassified</em></strong>.</p> <p>$$<br>Gini(p) = \sum_{k = 1}^mp_k(1-p_k) = 1-\sum_{k = 1}^mp_k^2<br>$$</p> <p>Under this situation, $p_k = \frac{|C_k|}{|D|}$ where $C_k$ is the subset of $D$ with data labeled as $k^{th}$ class.</p> <p>If $D_1 = {X|x_i \leq v_i}$, $D_2 = {X|x_i &gt; v_i}$, we have $D_1\cup D_2 = D$ and $D_1\cap D_2 = \emptyset$, the gain of gini impurity shown below.</p> <p>$$<br>Gain(D, x_i) = \sum_{j=1}^2\frac{|D_j|}{|D|}Gini(D_j)<br>$$</p> <p>Here, the smaller gain of gini is, the less misclassification. So we always choose the split and $x_i$ makes the gain of gini smallest. <strong><em>CART will combine catagories into two super-catagories before spliting if there are more than two catagories</em></strong>.</p>
Recover sorted tensor in Pytorch https://erebos.top/2019/11/28/Recover-sorted-tensor-in-Pytorch/ 2019-11-28T00:22:07.000Z 2019-11-28T00:55:49.248Z Problem

When I use torch.nn.utils.rnn.pad_sequence to Padding words and feed the padded sequence into LSTM/RNN, a input sorted by length is neccessary. But an order-changed sequence will increase the difficulty of evaluation. So here is a way to recover the sorted tensor using Pytorch functions.

## Let’s go

Here x is tensor([-0.4321, 0.3852, 0.6008, 0.8452, -0.4709, 0.7610, -0.9743, -0.9819, -1.1142, -0.1249]) and then we do the sort.

Here idx is the index of x, tensor([8, 7, 6, 4, 0, 9, 1, 2, 5, 3]). Then we can get the original order just by sorting the idx.

We can see that the script prints the tensor([-0.4321, 0.3852, 0.6008, 0.8452, -0.4709, 0.7610, -0.9743, -0.9819, -1.1142, -0.1249]) which is equals to the original x. It’s amazing, isn’t it? I’ll then show you why it works.

## Mathematical Explain

We suppose there is a n-permutation corresponding to our tensor.

Then we do the sort and get a new permutation.

Here idx is corresponding to the vector $(j_1,j_2,...,j_n)$. Then do the second sort and get $j_{k_1} \leq j_{k_2} \leq ... \leq j_{k_n}$ and here rev_idx corresponding to vector $(k_1,k_2,...,k_n)$. The code sorted_x[rev_idx] selected elements with subscriber $(k_1,k_2,...,k_n)$ from the second permutaion, which means it selected the vector $(i_{j_{k_1}}, i_{j_{k_2}},...,i_{j_{k_n}})$.

Mention that the vector $(j_1, j_2,...,j_n)$ is a permutation of $(1,2,...,n)$. So the vector $(j_{k_1}, j_{k_2},...,j_{k_n})$ is also a permutation of $(1,2,...,n)$. So we have that for all $l\in (1,...,n)$,$j_{k_l} = l$. Finally, $(i_{j_{k_1}}, i_{j_{k_2}},...,i_{j_{k_n}}) = (i_1, i_2,...,i_n)$ which is the original tensor.

]]>
<h2 id="Problem"><a href="#Problem" class="headerlink" title="Problem"></a>Problem</h2><p>When I use <code>torch.nn.utils.rnn.pad_sequence</code> to Padding words and feed the padded sequence into LSTM/RNN, a input sorted by length is neccessary. But an order-changed sequence will increase the difficulty of evaluation. So here is a way to recover the sorted tensor using Pytorch functions.</p> <h2 id="Let’s-go"><a href="#Let’s-go" class="headerlink" title="Let’s go"></a>Let’s go</h2><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">x = torch.randn(<span class="number">10</span>)</span><br></pre></td></tr></table></figure> <p>Here x is <code>tensor([-0.4321, 0.3852, 0.6008, 0.8452, -0.4709, 0.7610, -0.9743, -0.9819, -1.1142, -0.1249])</code> and then we do the sort.</p> <figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">sorted_x, idx = torch.sort(x)</span><br></pre></td></tr></table></figure> <p>Here idx is the index of x, <code>tensor([8, 7, 6, 4, 0, 9, 1, 2, 5, 3])</code>. Then we can get the original order just by sorting the <code>idx</code>.</p> <figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">_, rev_idx = torch.sort(idx)</span><br><span class="line">sorted_x[rev_idx]</span><br></pre></td></tr></table></figure> <p>We can see that the script prints the <code>tensor([-0.4321, 0.3852, 0.6008, 0.8452, -0.4709, 0.7610, -0.9743, -0.9819, -1.1142, -0.1249])</code> which is equals to the original <code>x</code>. It’s amazing, isn’t it? I’ll then show you why it works.<br>
Machine Learning Basis: Lagrange Duality and KKT Condition https://erebos.top/2019/11/04/Machine-Learning-Basis-Lagrange-Duality-and-KKT-Condition/ 2019-11-03T20:32:23.000Z 2019-11-03T23:02:04.836Z Introduction

Duality and KKT condition are very important for machine learning, especially in SVM models. I’ll focus more on the high-level idea and the derivation of Lagrange Duality and how to introduce to KKT condition. There are some connceptions should be covered first.

### Optimization Problems without Restrictions

The basic form of optimization problem without restrictions is just like finding the $x \in \Re^d$ makes that

A simple solution is to calculate the derivatives of $f(x)$, solve the equation $f'(x^*) = 0$ and test if $x^*$ is the minimal value.

### Lagrange Multiplier

Consider an Optimization Problem with equality restrictions.

Lagrange Multiplier is a method to solve this kind of problem. We can rewrite the objective function as $f(x) + \sum^n_{i=1}\lambda_i h_i(x)$. Then we can prove that the solution of $min_{x\in \Re^d,\lambda_i\in \Re }f(x) + \sum^n_{i=1}\lambda_i h_i(x)$ is equal to the solution of previous problem. Here $\lambda_i$ are called the Lagrange Multiplier. The new optimization problem is

And here the new function $\mathcal{L}(x,\lambda) = f(x) + \sum^n_{i=1}\lambda_i h_i(x)$ is called Lagrange function.

If there is any $h_i(x) \neq0$, the minimum will become $-\infty$ due to the unrestricted $\lambda_i$, so we should add a restriction that $\nabla_{x,\lambda}\mathcal{L}(x, \lambda)= 0$ which makes the solution finite and converge into $h_i(x) = 0$.

### Dual Problem

Let $\mathcal{L}(x,\lambda, \mu) = f(x) + \sum_{i=1}^n\lambda_i h_i(x) + \sum_{j=1}^m\mu_j g_j(x)$,then there is a trivial theorem that

Here $d^* = max_{\lambda, \mu}(min_{x}(\mathcal{L}(x, \lambda,\mu)))$ is the dual problem of $p^*$.

## Derivation

### Transformation of Primal Problem

Assume $f(x), g_i(x), h_j(x)$ are continuous functions on $\Re^d$, then consider the restricted optimization problem.

We have already known that the primal problem without restrictions can be solved easily by calculating derivatives and testing. So our first step is to translate the primal problem into a problem without restrictions.

We have a enhanced Lagrange Function in form of $\mathcal{L}(x,\lambda, \mu) = f(x) + \sum_{i=1}^n\lambda_i h_i(x) + \sum_{j=1}^m\mu_j g_j(x)$. Here $\mu_j \geq 0$ because the direction of $g_j$ has been restricted.

Define a new function $d(x) = max_{\lambda,\mu>0}\mathcal{L}(x, \lambda, \mu)$, we can conclude that $min_{x}d(x) = min_{x}f(x)$ under all primal constraints.

Obviously, $d(x) \geq max_{\lambda, \mu}f(x) = f(x)$, so $d(x)$ is an upper bound of $f(x)$. Then under all constraints, we have $\sum_{i=1}^n\lambda_i h_i(x) + \sum_{j=1}^m\mu_j g_j(x) = 0$, then $d(x) = max_{\lambda, \mu}f(x) = f(x)$.

In conclusion, the primal problem has an equivalent form

### KKT condition

We have already known the equivalent form of primal problem, but in this form we should still consider the constraints which makes the calculation too complicated. The next step is to find a simpler way of finding the best solution.

Consider the dual problem, a well property should be $d^* = p^*$ when $x = x^*$ is the best solution of primal problem.

Think back the transformation of primal problem, if dual problem is equal to the primal problem on $x = x^*$, the formula should be

Then consider the Lagrange condition of both inner optimizations which are $max_{\lambda, \mu}\mathcal{L}(x,\lambda,\mu)$ and $min_{x}\mathcal{L}(x,\lambda,\mu)$. This leads to $\nabla_x\mathcal{L}(x^*) = 0$ and $\nabla_\lambda\mathcal{L} = 0$.

Then consider the parameter $\mu_j$. There are two situations about $g_j(x)$. First is that the minimized point is of $g_j(x) = 0$ and the other is that the minimized point is of $g_j(x) < 0$.

For the first case, the inequality constraint becomes a equality constraint. That is

For the second case, the inequality constraint disappears, that is

So combine two situations together we have $\mu_j g_j(x)=0$. Then under this constraint, the $\mathcal{L}$ becomes a regular Lagrange Function, which leads to a Lagrange Multiplier constraint that

So the final constraint becomes

This is the KKT condition.

]]>
<h2 id="Introduction"><a href="#Introduction" class="headerlink" title="Introduction"></a>Introduction</h2><p>Duality and KKT condition are very important for machine learning, especially in SVM models. I’ll focus more on the high-level idea and the derivation of Lagrange Duality and how to introduce to KKT condition. There are some connceptions should be covered first.</p> <h3 id="Optimization-Problems-without-Restrictions"><a href="#Optimization-Problems-without-Restrictions" class="headerlink" title="Optimization Problems without Restrictions"></a>Optimization Problems without Restrictions</h3><p>The basic form of optimization problem without restrictions is just like finding the $x \in \Re^d$ makes that</p> <script type="math/tex; mode=display"> min_{x\in \Re^d} f(x)</script><p>A simple solution is to calculate the derivatives of $f(x)$, solve the equation <script type="math/tex">f'(x^*) = 0</script> and test if <script type="math/tex">x^*</script> is the minimal value.<br>
Connecting CAEN using VScode https://erebos.top/2019/10/06/Connecting-CAEN-using-VScode/ 2019-10-06T00:18:17.000Z 2019-11-03T22:40:24.016Z What and Why

CAEN is the information technology (IT) services department for the University of Michigan (U-M) College of Engineering, and offers IT resources to support the College’s educational, research, and administrative needs. It’s quite unefficient to manage files on CAEN using command line tools if I need to text our code in CAEN environment. I need to type the whole sftp command and path every time. Plugins for editors is a great solution. There are many tutorials about connecting using Sublime Text Editor on the Internet, but there is no documentations about VScode. As a fan of VScode, that’s why I want to write this article.

### Environment

This is my own running environment may but not necessary.

• Operating system: Windows 10
• VSCode Version: 1.36.1
• Plugin: SFTP (by liximomo)

### Getting Start

After downloading the plugin, press Ctrl+Shift+P and run SFTP: config command. This command will build a configuration file named sftp.json on your folder (you may need to open a folder) and it may looks like:

Name your server casually in name and the type the host address of CAEN machine in host. The host will be something looked like login.engin.umich.edu. It’s not necessary to change the protocol and port. The username is your Umich unique name.

Here is the explanation of these parameters.

• name is your own name of this server, you can name it casually.
• host is the host address of CAEN machine, like login.engin.umich.edu.
• protocol is the protocol of connection, you don’t need to change the default sftp.
• port is the port of the connecting server.
• username is yout own Umich uniquename which is needed for signing in the server.
• remotePath is the path on the CEAN machine where will upload your local file to. For example home/username.
• uploadOnSave is the switch of autouploading to the server. If the value is true, the files will be automatically uploaded to your server when you save your files locally.

After all these settings are saved, you will see a new icon on the Activity Bar.

### Two-factor Authentication

The University of Michigan uses two-factor authentication to authenticate your account. So we need to add a new parameter to handle this. Add a new attribute interactiveAuth in the json file and set it to true. So the whole configuration file will looks like

### Connecting

Double click the server in the SFTP option on the activity bar with the following icon. After connecting, you will see a input window above looks like Then your will see a two-factor authentication window like input and pass the authentication through app or message, then you will see the dictionary of your server machine. Mention that if you use address "/" in the remotePath, you will connect to the public area and will have not permission to open the private folders including yours.

]]>
<h3 id="What-and-Why"><a href="#What-and-Why" class="headerlink" title="What and Why"></a>What and Why</h3><p>CAEN is the information technology (IT) services department for the University of Michigan (U-M) College of Engineering, and offers IT resources to support the College’s educational, research, and administrative needs. It’s quite unefficient to manage files on CAEN using command line tools if I need to text our code in CAEN environment. I need to type the whole sftp command and path every time. Plugins for editors is a great solution. There are many tutorials about connecting using Sublime Text Editor on the Internet, but there is no documentations about VScode. As a fan of VScode, that’s why I want to write this article.</p> <h3 id="Environment"><a href="#Environment" class="headerlink" title="Environment"></a>Environment</h3><p>This is my own running environment may but not necessary.</p> <ul> <li>Operating system: Windows 10</li> <li>VSCode Version: 1.36.1</li> <li>Plugin: SFTP (by liximomo)</li> </ul>
Machine Learning Basis: Convex Function and Hessian Matrix https://erebos.top/2019/09/09/Machine-Learning-Basis-Convex-Function-and-Hessian-Matrix/ 2019-09-08T23:04:23.000Z 2019-10-05T23:36:23.958Z Introduction

Optimization is a focus on many kinds of machine learning algorithms like Linear Regression, SVM and K-means. But actually many kinds of target function is non-convex which means we can only find its local minima. But convex functions still plays an important role in machine learning. And Hessian Matrix is a great algebra tool to analyze convex functions since in most cases our target function will be real, continuous and $2^{nd}$-order differentiable. The main goal of this article is to record the proof of the equivalence between Convex Functions and their Hessians. Here is the some important definitions.

### Convex Set

A Convex Set $C\subseteq \Re^n$ is a set of points s.t. $\forall x, y \in C$ and $t \in [0,1]$, $tx+(1-t)y \in C$.

### Convex Function

A function $f:\Re^n \rightarrow \Re$ is a Convex Function if for $x, y \in D$, where $D$ is a convex set, $f$ and any $t \in [0,1]$ makes

### Hessian Matrix

A Hessian Matrix is a square matrix of second-order partial derivatives of a function $f:\Re^n \rightarrow \Re$, usually written as:

### Positive Definite/Semi-Definite Matrix

A real symmetric matrix $P$ is called Positive Semi-Definite (PSD) when for all $x \in \Re^n$, there are $x^TPx \geq 0$. And it’s called Positive Definite (PD) when for all $x \neq 0 \in \Re^n$, there are $x^TPx > 0$.

## The equivalence of convex function

There is a strong relationship between Convex Functions and their Hessians. Here is what I want to prove today.

A $2^{nd}$-order differentiable function $f$ with convex domain $D$ is (strict) convex if and only if its Hessian is PSD (PD).

This conclusion is also called the Second Order Condition of a convex function. To prove this, we need to introduce a First Order Condition that is

A $1^{st}$-order differentiable function $f$ with convex domain $D$ is (strict) convex if and only if for any $x, y\in D$, $f(y) \geq f(x) + \nabla^T f(x)(y-x)$

### Proof of First Order Condition

I divided the proof into two parts. Firstly we can prove that if $f$ is a convex function, then first order condition works.
If $f$ is convex, we have

So, we can see

Let $t\rightarrow 0$,

Then we can prove that, under the case of first order condition, $f$ is a convex function.
If $f$ satisfy the first order condition, for all $x, y\in \Re^n$ and $t\in [0,1]$, we have

So $f(x)$ is a convex function.

### Proof of Second Order Condition

Now all prerequisites are proved, it’s turn to prove the Second Order Condition! Also, I depart the proof into two parts.
First we prove that if the Hessian of $f$, $H$ is $PSD$, then $f$ is convex.

If $f$ is PSD, there exists $\xi$ that

So $f$ is convex due to the first order condition.

Then we can prove the reverse part.
If $f$ is convex, according to the first order condition, we suppose for all $y$,

Then,

Let $\lambda\rightarrow0$, we have $y^T\nabla^2f(x)y \geq 0$
So $\nabla^2f(x)$ is PSD.

]]>
<h2 id="Introduction"><a href="#Introduction" class="headerlink" title="Introduction"></a>Introduction</h2><p>Optimization is a focus on many kinds of machine learning algorithms like Linear Regression, SVM and K-means. But actually many kinds of target function is non-convex which means we can only find its local minima. But convex functions still plays an important role in machine learning. And Hessian Matrix is a great algebra tool to analyze convex functions since in most cases our target function will be real, continuous and $2^{nd}$-order differentiable. The main goal of this article is to record the proof of the equivalence between Convex Functions and their Hessians. Here is the some important definitions.</p> <h3 id="Convex-Set"><a href="#Convex-Set" class="headerlink" title="Convex Set"></a>Convex Set</h3><p>A <strong>Convex Set</strong> $C\subseteq \Re^n$ is a set of points s.t. $\forall x, y \in C$ and $t \in [0,1]$, $tx+(1-t)y \in C$.</p> <h3 id="Convex-Function"><a href="#Convex-Function" class="headerlink" title="Convex Function"></a>Convex Function</h3><p>A function $f:\Re^n \rightarrow \Re$ is a <strong>Convex Function</strong> if for $x, y \in D$, where $D$ is a <strong>convex set</strong>, $f$ and any $t \in [0,1]$ makes</p> <script type="math/tex; mode=display"> f(tx + (1-t)y) \leq tf(x) + (1-t)f(y)</script><h3 id="Hessian-Matrix"><a href="#Hessian-Matrix" class="headerlink" title="Hessian Matrix"></a>Hessian Matrix</h3><p>A <strong>Hessian Matrix</strong> is a square matrix of <strong>second-order partial derivatives</strong> of a function $f:\Re^n \rightarrow \Re$, usually written as:</p> <script type="math/tex; mode=display"> H = \nabla^2f(x) = \left[ \begin{array}{cc} \frac{\partial^2 f}{\partial x_1\partial x_1} & \frac{\partial^2 f}{\partial x_1\partial x_2} & ... & \frac{\partial^2 f}{\partial x_1\partial x_d}\\ \frac{\partial^2 f}{\partial x_2\partial x_1} & \frac{\partial^2 f}{\partial x_2\partial x_2} & ... & \frac{\partial^2 f}{\partial x_2\partial x_d}\\ ... & ... & ... & ... \\ \frac{\partial^2 f}{\partial x_d\partial x_1} & \frac{\partial^2 f}{\partial x_d\partial x_2} & ... & \frac{\partial^2 f}{\partial x_d\partial x_d} \end{array} \right]_{d\times d}</script><h3 id="Positive-Definite-Semi-Definite-Matrix"><a href="#Positive-Definite-Semi-Definite-Matrix" class="headerlink" title="Positive Definite/Semi-Definite Matrix"></a>Positive Definite/Semi-Definite Matrix</h3><p>A <strong>real symmetric matrix</strong> $P$ is called <strong>Positive Semi-Definite</strong> (PSD) when for all $x \in \Re^n$, there are $x^TPx \geq 0$. And it’s called <strong>Positive Definite</strong> (PD) when for all $x \neq 0 \in \Re^n$, there are $x^TPx &gt; 0$.<br>
LeetCode Solution: Best Time to Buy and Sell Stock https://erebos.top/2019/09/05/LeetCode-Solution-Best-Time-Stock-to-Buy-and-Sell-Stock/ 2019-09-04T23:13:50.000Z 2019-10-05T23:35:08.115Z Introduction

In this article I will try to solve Best Time to Buy and Sell Stock series problem, including Best Time to Buy and Sell Stock I, II, III, IV and with Cooldown. Most of them are solved by dynamic programming and I will focus on construct transition equation and dimension reduction.

### Description

The description of Best Time to Buy and Sell Stock I is:

Say you have an array for which the $i^{th}$ element is the price of a given stock on day $i$.

If you were only permitted to complete at most one transaction (i.e., buy one and sell one share of the stock), design an algorithm to find the maximum profit.

Note that you cannot sell a stock before you buy one.

Example:

### Solution of Problem I

A simple idea is using dp[i] as the most profit buying in $i^{th}$ day. Then the transition equation will be dp[i] = max(prices[j] - prices[i]) for all j > i and the soluton is max(dp). It will be an $O(n^2)$ algorithm. But there is a waste of computation in this method. We suppose $j$ is the specific day that dp[i] = prices[j] - prices[i], then if there is a $k$ makes $dp[k] < dp[i]$ and $k > i$, then we have

So the soluton won’t be dp[i]. Under this circumstance, we cam simplify our algorithm by always searching lower price day as buying day, record the current price minus buying day price(the lowest price before/on current $i^{th}$ day) and generate a sequence of profit. The profit[i] means the difference between $i^{th}$ day price and the lowest price before/on $i^{th}$ day. So max(profit) will be the solution. By doing so, we reduce the method into $O(n)$ time. Here is the cpp code.

### Solution of Problem II

In problem II, we have not the transaction number limitation, we can buy/sell any times. When we try to using the dp[i] as above, we find that it’s hard to build a transition equation because we don’t know how many transaction times there will be. We have to change our state description. We have only three actions in a day, buying, selling and doing nothing, so we can use two states to describe a day, i.e. a day with stock and a day without stock. Let nohold[i] be the maximal profit when we have not stock in $i^{th}$ day, hold[i] be the maximal profit when we have stock. Then the transition equation will be

That simply means if we have stock in $i^{th}$ day, the stock can be bought today or we already have it yesterday and if we have stock in $i^{th}$ day, the stock can be sold today or yesterday or before. By this equation, we can solve this problem in one pass. Don’t forget the initialization nohold = -prices.

There is another solution do not use DP. A trivial idea is that we buy all the stock at the begin of an increasing line and sell it at the end of line, we can get the most profit.

### Soluton of III & IV

Problem III is a special case of Problem IV, so we just introduce Problem IV. In Problem IV, we have a limitation that we can only buy $k$ times($k$ is given). It can be solved simply like the DP algorithm of Problem II. We can use similar state description and just increase a dimension of transaction times. Let hold[i][j] as the maximal profit when we have stock and $j$ transitions on $i^{th}$ day and nohold[i][j] as the maximal profit when we have no stock and have $j$ transitions on $i^{th}$ day. Also like Problem II, the transition equation can be written as

The solution will be nohold[n-1][k-1]. What need to be mentioned is that we counting transaction by counting buying numbers but not selling. Then it’s a one-pass method.

But the code did not pass! We got a Memory Limit Exceeded. So I start to reduce the dimension of the equation. Obviously, both hold[i][j] and nohold[i][j] have only relationship with hold[i-1][*] and nohold[i-1][*]. So we can just reduce it as

Also, using a sentinel $0$ in nohold[j] can make code looks better(reduce the number of $if$). So we get the code like this.

~Ok, we have already solved it!~ Wait, it’s still Memory Limit Exceeded! But why? If we consider a super large $k$ that the limitation is meaningless to the problem, the problem reduces into Problem II. But the time complexity of our solution will still be $O(k*\dot n)$, which is a super large number especially comparing with $O(n)$ solution in Problem II. We can solve this by a simple $if$ sentence.

And here is the whole program of Problem IV.

### Solution of Problem with Cooldown

Cooldown means we have to ~have relax and take a coffee~ the day after selling. Buying the day after a selling is not allowed. That means our states description above can not be used again…Of course not! We can just do a little modification, adding a new vector called cooldown[i] means the maximal profit when we just sell or do nothing on the $i^{th}$ day. We have have_stock[i] and have_no_stock[i] as above. We can find the transition of cooldown like cooldown[i] = max(hold_no_stock[i-1], hold_stock[i-1] + prices[i]) which means today we sell the stock or do nothing. The transition of hold_stock is still hold_stock[i] = max(hold_stock[i-1], hold_no_stock[i-1] - prices[i]) because cooldown doesn’t influence buying. Finally the transition equation pf hold_no_stock[i] can be hold_no_stock[i] = max(hold_no_stock[i-1], cooldown[i-1]), meaning that today is a cooldown day or no stock day. Combine them together we have_stock

Don’t forget the initialization hold_stock = -prices; cooldown = INT_MIN;. It’s also a $O(n)$ one-pass method now. In conclusion, all these kind of problem can be solved by dynamic programming idea and the basic idea is to form transition equation. The number of variables or the number of dimensions are equivalent in constructing equation. So if you have not idea how to form the equation, including the variable number of state will be a good choice.

]]>
<h3 id="Introduction"><a href="#Introduction" class="headerlink" title="Introduction"></a>Introduction</h3><p>In this article I will try to solve <strong>Best Time to Buy and Sell Stock</strong> series problem, including <strong>Best Time to Buy and Sell Stock I, II, III, IV</strong> and <strong>with Cooldown.</strong> Most of them are solved by <strong>dynamic programming</strong> and I will focus on construct transition equation and dimension reduction. </p> <h3 id="Description"><a href="#Description" class="headerlink" title="Description"></a>Description</h3><p>The description of <strong>Best Time to Buy and Sell Stock I</strong> is:</p> <p>Say you have an array for which the $i^{th}$ element is the price of a given stock on day $i$.</p> <p>If you were only permitted to complete at most one transaction (i.e., buy one and sell one share of the stock), design an algorithm to find the maximum profit.</p> <p>Note that you cannot sell a stock before you buy one.</p> <p>Example:</p> <figure class="highlight cpp"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">Input: [<span class="number">7</span>,<span class="number">1</span>,<span class="number">5</span>,<span class="number">3</span>,<span class="number">6</span>,<span class="number">4</span>]</span><br><span class="line">Output: <span class="number">5</span></span><br><span class="line">Explanation: Buy on day <span class="number">2</span> (price = <span class="number">1</span>) <span class="keyword">and</span> sell on day <span class="number">5</span> (price = <span class="number">6</span>), profit = <span class="number">6</span><span class="number">-1</span> = <span class="number">5.</span></span><br></pre></td></tr></table></figure>
LeetCode Solution: #300 Longest Increasing Subsequence https://erebos.top/2019/09/01/LeetCode-Solution-300-Longest-Increasing-Subsequence/ 2019-09-01T04:07:18.000Z 2019-10-05T23:35:09.114Z Introduction

In this article I will describe two dynamic programming algorithms solving LIS problem and STL functions lower_bound() and upper_bound().

### Description

Given an unsorted array of integers, find the length of longest increasing subsequence.
Example:

Input: $[10,9,2,5,3,7,101,18]$
Output: 4
Explanation: The longest increasing subsequence is $[2,3,7,101]$, therefore the length is $4$.

### $O(n^2)$ Dynamic Programming Solution

Here is a trivial description that dp[i] means the length of longest increasing subsequence with $i^{th}$ element. Also, we can find easily that the value of dp[i] can be determined by all increasing subsequence with $j < i$ that maintain increasing property with $i^{th}$ value. Mathematically, dp[i] is determined by all the value dp[j] with $j < i$ and $nums[i] > nums[j]$ which nums is the input vector. So the state transition equation is

dp[i] = max(dp[j]) + 1 with j < i, nums[j] < nums[i]

This method need two iterations so it’s a $O(n^2)$ algorithm.

### $O(nlgn)$ Dynamic Programming Solution

Comparing all optimal subsequences with the same length, the one with least last number will confirm that when a new number is added in, the new subsequence will still optimal. For example, for subsequence $[1,3,5,2,7,4,5]$, we have two subsequences length $4$:

Then we add $6$ into the sequence, the first subsequence is still $[1,3,5,7]$ when the second one becomes $[1,2,4,5,6]$.

But how to guarantee that the subsequence has the least last number? We can do so by replacing the number just larger than the new number with the new number. It’s because the replacement won’t change the length of the subsequence but will decrease the number value generally.
There is a very great property that the increasing subsequences are ‘increasing’, which means that given a increasing subsequence and a new number, we can find the correct position of the new number in the subsequence in only $O(lgn)$ time. We can generate a new largest increasing subsequence including the new number by adding the new number if it’s larger than all numbers in the subsequence and do replacing if not. The whole time complexity will be $O(nlgn)$.

### lower_bound and upper_bound in STL

We can mention that I use lower_bound function in the previous code. It’s a binary search function in STL. Both it ans upper_bound use binary search and return a position of a vector. The difference is that lower_bound return the position of the first number larger than or equals to the target and upper_bound return the position of the first number strictly larger than the target. There are three parameters in both functions. The first parameter is a Iterator refers to the search begin position, the second parameter is a Iterator refers to the end position and the third parameter is target number. Here is the source code of lower_bound.

What should be mentioned is that the begin position will be included but the end position won’t be included. The function uses binary search, so the time complexity is $O(lgn)$ where $n$ is the size between two pointers.

]]>
<h3 id="Introduction"><a href="#Introduction" class="headerlink" title="Introduction"></a>Introduction</h3><p>In this article I will describe two <strong>dynamic programming</strong> algorithms solving LIS problem and <strong>STL functions</strong> <code>lower_bound()</code> and <code>upper_bound()</code>.</p> <h3 id="Description"><a href="#Description" class="headerlink" title="Description"></a>Description</h3><p>Given an unsorted array of integers, find the length of longest increasing subsequence.<br>Example:</p> <blockquote> <p>Input: $[10,9,2,5,3,7,101,18]$<br>Output: 4<br>Explanation: The longest increasing subsequence is $[2,3,7,101]$, therefore the length is $4$.</p> </blockquote>
Leetcode Solution: #146 LRUcache https://erebos.top/2019/08/28/Leetcode-Solution-LRUcache/ 2019-08-28T10:25:14.000Z 2019-10-05T23:35:25.697Z Description

Design and implement a data structure for Least Recent Used(LRU) cache. It should support the following operations: get and put.

get(key) -Get the value (will always be positive) of the key if the key exists in the cache, otherwise return -1.
put(key, value) -Set or insert the value if the key is not already present. When the cache reached its capacity, it should invalidate the least recently used item before inserting a new item.

The cache is initialized with a positive capacity.

### Basic Idea

To solve this problem, we need to design a kind of data structure with the properties as follow:

1. The data structure can visit the and set/insert the item as soon as possible(such as vector or map).
2. The data structure can order the data by the operation time.
3. The data structure can quickly check for the overflow of capacity.

Due to the data structure in different language is not the same, I will choose python and cpp as my solution language.

### Python solution

I will introduce a kind of python data structure called OrderedDict. This is a kind of dictionary(in fact it is inherited from ‘dictionary’ of python) with the order of insertion time. Python uses an extra circular linked list to save the node as form $[PREV, NEXT, KEY]$ to realize the data structure. Obviously, this data structure is perfectly suitable for our problem.

The only problem we need to solve is the question is to find a data structure ordering by operation time but not the insertion time. So we just need to simply delete and insert every operation to solve this problem.

### Cpp solution(using STL container)

CPP provides many kinds of STL containers, but there are nothing like the ‘OrderedDict’ in python. The design idea is to combine two or more kinds of containers(like the OrderedDict source code do). If we want to build a structure with insertion order, stack will be a first choice. But what we also want is to keep the high speed of insertion and deletion of map/vector, which will conflict with stack’s properties that we can not move/delete a node in the middle of a stack. Therefore, Linked List(in STL is list) will be a great choice, which also matches python’s choice. To keep the order of operation time, the problem we need to solve can be as follow:

1. We need a fast way to visit/insert/delete a node in linked list given key.
2. We need a fast way to move a node to the front of linked list after every operation.
3. When we get the first value of linked list, we need a fast way to delete the pair in map/vector.

These properties keeps a linked list by operation time order with short modifying time. Actually, property $1$ and $2$ can be combined due to the property of Linked List. Map can be a good way to satisfy property $1$ and $2$. We can use map(key, node) form to visit a linked list node quickly in $O(1)$ time. About property $3$, we need a reverse_map(value, key) to fast delete corresponding pair in map. We just need to delete reverse_map[node.value] in map. But it’s not convenient and cost extra space. We can just store node(value, key) in linked list to do the same thing.

In conclusion, what we need is to combine a map(or unordered_map) and a list.

There are also some tricky cases like we need to check for if key in the cache first because change a value in cache need to delete the value first and then insert a new one.

The combination of STL containers is not the only way to solve this problem. Actually many artificial data structures have better performance. A specific example is using circular linked list nodes just like python do in the ‘OrderedDict’. I won’t cover this method here and you can find related articles easily.

]]>
<h3 id="Description"><a href="#Description" class="headerlink" title="Description"></a>Description</h3><p>Design and implement a data structure for <strong>Least Recent Used(LRU) cache</strong>. It should support the following operations: <code>get</code> and <code>put</code>.</p> <p><code>get(key)</code> -Get the value (will always be positive) of the key if the key exists in the cache, otherwise return -1.<br><code>put(key, value)</code> -Set or insert the value if the key is not already present. When the cache reached its capacity, it should invalidate the least recently used item before inserting a new item.</p> <p>The cache is initialized with a <strong>positive</strong> capacity.</p> <h3 id="Basic-Idea"><a href="#Basic-Idea" class="headerlink" title="Basic Idea"></a>Basic Idea</h3><p>To solve this problem, we need to design a kind of data structure with the properties as follow:</p> <ol> <li>The data structure can visit the and set/insert the item as soon as possible(such as <strong>vector or map</strong>).</li> <li>The data structure can order the data <strong>by the operation time.</strong></li> <li>The data structure can quickly check for the <strong>overflow of capacity.</strong></li> </ol> <p>Due to the data structure in different language is not the same, I will choose <strong>python</strong> and <strong>cpp</strong> as my solution language.<br>

Given a linked list and return where the circle begins. For example, a linked list $[3, 2, 0, 4]$ having circle $[2, 0, 4]$ is shown below. The algorithm should return the second node. I use C++ to solve this problem and define the node as below.

### Solution 1: hashset

A trivial idea to solve this problem is saving the node information in a hashset when traversing and find if there is a visited node.

This is a method with O(n) time complexity and O(n) space complexity. But if there is a method to solve this problem with O(1) space complexity?

### Solution 2: fast and slow pointers

Set two pointers which slower one move one time one step and faster one move one time two steps. Once they meet, reset the faster one to the head pointer, then finally they will meet in the begin node of circle. This is an algorithm without extra space. But why it works?

There is a mathematical idea. Suppose the distance from head node to the begin of circle is $x_1$, the distance from begin of circle and meeting point on the circle is $x_2$, the distance from meeting point back to the begin of the circle is $x_3$. Then there is the velocity equation.

It means that the difference between $x_3$ and $x_1$ is the multiple of circle length. Due to the definition of $x_3$ and $x_1$, if the fast pointer move from the head node when the slow pointer move from the meeting point, finally the slow pointer and fast pointer will meet on the begin of the circle.

Don’t forget the special cases of NULL pointer and if fast pointer move to NULL means there is no circle.

]]>
<h3 id="Description"><a href="#Description" class="headerlink" title="Description"></a>Description</h3><p>Given a linked list and return where the circle begins. For example, a linked list $[3, 2, 0, 4]$ having circle $[2, 0, 4]$ is shown below.</p> <p><img src="https://raw.githubusercontent.com/hongxin-y/picture4blog/master/linked_circle.png?token=AKSH2L5CED5EO3FVD4ZH5425ZCI6Q" alt="Linked_circle"></p> <p>The algorithm should return the second node. I use C++ to solve this problem and define the node as below.</p> <figure class="highlight cpp"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">//Definition for singly-linked list.</span></span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">ListNode</span> &#123;</span></span><br><span class="line"> <span class="keyword">int</span> val;</span><br><span class="line"> ListNode *next;</span><br><span class="line"> ListNode(<span class="keyword">int</span> x) : val(x), next(<span class="literal">NULL</span>) &#123;&#125;</span><br><span class="line">&#125;;</span><br></pre></td></tr></table></figure>