Classification And Regression Tree(CART) is a kind of decision tree model can be used both for classification and regression. CART models are always binary trees.

The main process of a classification tree is shown below.

- Choose a variable $x_i$ and a
$v_i$, then split the data space into two parts. All data in the first part satisfy that $x_i \leq v_i$ and all data in the second part satisfy that $x_i > v_i$. For discrete data, the condition is equivalent to $x_i = v_i$ and $x_i \neq v_i$.*split point* - Split the space recursively until the stopping condition.
- Stopping condition is that all data in this subspace is in the same class. There are also some other conditions like using $\chi^2$ value or other independence test and stop the spliting when are splited data are independent.

A question is how to choose the split point. In classification task, Gini impurity is widely used. The Gini impurity can be simply understand as ** the probability of misclassified**.

$$

Gini(p) = \sum_{k = 1}^mp_k(1-p_k) = 1-\sum_{k = 1}^mp_k^2

$$

Under this situation, $p_k = \frac{|C_k|}{|D|}$ where $C_k$ is the subset of $D$ with data labeled as $k^{th}$ class.

If $D_1 = {X|x_i \leq v_i}$, $D_2 = {X|x_i > v_i}$, we have $D_1\cup D_2 = D$ and $D_1\cap D_2 = \emptyset$, the gain of gini impurity shown below.

$$

Gain(D, x_i) = \sum_{j=1}^2\frac{|D_j|}{|D|}Gini(D_j)

$$

Here, the smaller gain of gini is, the less misclassification. So we always choose the split and $x_i$ makes the gain of gini smallest. ** CART will combine catagories into two super-catagories before spliting if there are more than two catagories**.

The main process of a regression tree is most likely the classification tree. There are several difference between them.

- Fit the residual of previous regression results to the labels and add them together.
- Usually use inner-class minimal mean squared error instead of Gini impurity as measurement.

CART choose the best spliting point to optimize the following optimization problem.

$$

min_{j,s}[min\sum_{x_i\in R_1(j,s)}(y_i-c_1)^2 + min\sum_{x_i\in R_2(j,s)}(y_i-c_2)^2]

$$

Here $R_i(j,s)$’s are the subspace after spliting by condition $(j,s)$. $x_j$ is the spliting feature and $s$ is the spliting point. We use this equation in stead of Gini impurity because we want to minimize the inner-class distances.

Then we can get $M$ subspaces and for each subspace $R_m$, we ** calculate the mean value as the regression value**. i.e. $\hat c_m = \frac{1}{N_m}\sum_{x_i\in R_m} y_i$.

The final regression function is shown below.

$$

f(x) = \sum_{m=1}^M \hat c_m I(x\in R_m)

$$

Then we can use mean square error to evaluate the tree and fit the residuals to improve the model. It’s a simple ** boosting method**. Let $T_i(x)$ be the estimation of $y - f_{i-1}(x)$ based on CART. then we have $f_i(x) = f_{i-1}(x) + T_i(x)$. It’s a special case of Gradient Boosting Decision Tree(

The boosting strategy mentioned above has a more general form. Gradient boosting Decision Tree(GBDT) use a similar recursive formula.

$$

F_m(x) = F_{m-1}(x) + argmin_{h\in H}\sum_{i=1}^nLoss(y_i,F_{m-1}(x_i) + h(x_i))

$$

We can treat the loss function as a function of vector $F_{m-1}(x)$. Then using ** gradient descent method**, the $F_{m}(x)$ can be calculated by $F_{m-1}(x) - \eta\nabla_{F_{m-1}} Loss(x)$. We use $F_{m-1}(x)$ instead of $x$ as gradient variable because we can not get a expression of $x$ from Decision Tree model. Then our target is to find a way to calculate $\nabla_{F_{m-1}}Loss(x)$ or a

Also using CART, if we use ${x_i,-\frac{\partial Loss(y_i,F_{m-1}(x_i))}{\partial F_{m-1}}}$ to build a CART $T_m(x)$ and it’s a estimation of $-\nabla_{F_{m-1}}Loss(x)$.

The classification tree is not same as CART because the CART classification tree does not have gradient. The way of classification is using ** log-odds value**. Just like logistic regression or neural network classification, we first estimate a continuous value $logit = ln\frac{P(y=1|x)}{P(y=0|x)}$ and use sigmoid function(or softmax in high-dimension) to translate it into probability. Just like the logistic regression, we can use

$$

loss(x_i,y_i) = -y_ilog\hat y_i - (1-y_i)log(1-\hat y_i)

$$

The function of probability is shown below.

$$

P(y=1|x) = \frac{1}{1 + e^{-F_{m-1}(x)}}

$$

So we can get

$$

loss(y_i, F_{m-1}(x_i)) = y_ilog(1+e^{-F_{m-1}(x_i)}) + (1-y_i)[F_{m-1}(x_i) + log(1+e^{-F_{m-1}(x_i)})]

$$

Then calculate the gradient,

$$

-\frac{loss}{F_{m-1}}(x_i,y_i) = y_i - \hat y_i

$$

If we have $k$ labels, we need to use one-hot encoding and softmax function then ** fit $k$ trees each iteration** to fit each dimension.

To fit the model better, there is a variation.

$$F_{m-1}(x) + \eta\rho_mT_m(x_i)$$

where $\rho_m$ is the result of linear search $argmin_\rho\sum_{i}loss(x_i,y_i|F_{m-1(x_i)}+ \rho T_m(x_i))$ and $\eta$ is learning rate.

In XgBoost, the regression results are represented by the formula

$$

\hat y = \sum_{i = 1}^K f_k(x)

$$

Here every $f_k(x)$ is a regression tree structure. Assume the tree has $T$ leaves, $q(x)\in {1,2,…,T}$ is the leave of $x$ and $f_k(x) = w_{q(x)}$ is the score of the input.

The regularization of XgBoost has two parts, the complexity of tree and the scalability of scores.

$$

\Omega(f_t(x)) = \gamma T + \frac{1}{2}\lambda |w|^2

$$

Here $T$ is the number of leaves in the tree and $w \in \Re^T$ is the score of leaves.

XgBoost uses second order Taylor Expansion to approach the true value of loss function. Assume loss function is $l(y_i,\hat y_i)$, we have $l(y_i,\hat y_i^{(t-1)} + f_k(x_i)) = l(y_i) + g_if_k(x_i) + \frac{1}{2}h_if_k^2(x_i)$ where

$$

g_i = \frac{\partial l(y_i, \hat y^{(t-1)}_i)}{\partial \hat y^{(t-1)}_i}

$$

$$

h_i = \frac{\partial^2 l(y_i, \hat y^{(t-1)}_i)}{\partial (\hat y^{(t-1)}_i)^2}

$$

Remove the constant term, we have the objective function.

$$

\mathcal{L}^{(t)} = \sum_{i=1}^n[g_if_t(x_i) + \frac{1}{2}h_if_t^2(x_i)] + \gamma T + \frac{1}{2}\lambda |w|^2

$$

Rewrite the function, we have

$$

\mathcal{L}^{(t)} = \sum_{j=1}^T[w_j\sum_{i\in I_j}g_i + \frac{1}{2}w_j^2(\sum_{i\in I_j}h_i + \lambda)] + \gamma T.

$$

where $I_j = {i|q(x_i) = j}$. Then calculate the derivatives and zero point.

$$

\frac{\partial \mathcal{L}^{(t)}}{\partial w_j} = [\sum_{i\in I_j}g_i + w_j(\sum_{i\in I_j}h_i + \lambda)] = 0

$$

We can get the optimal score by solving the equation above.

$$

w_j^* = -\frac{\sum_{i\in I_j}g_i}{\sum_{i\in I_j}h_i + \lambda}

$$

Then bring it back, we have the optimal objective function.

$$

\mathcal{L}^{(t)} = -\frac{1}{2}\sum_{j=1}^T\frac{(\sum_{i\in I_j}g_i)^2}{\sum_{i\in I_j}h_i+\lambda} + \gamma T.

$$

The smaller the $\mathcal{L}$ is, the better the tree structure is, so we can choose the splitting point make the $\mathcal{L}$ smallest.

The idea is to choose a splitting point making the following value as large as possible.

$$

\mathcal L_{split} = \mathcal L_{Ori} - \mathcal L_{L} - \mathcal L_{R} = \frac{1}{2}[\frac{(\sum_{i\in I_L}g_i)^2}{\sum_{i\in I_L}h_i+\lambda} + \frac{(\sum_{i\in I_R}g_i)^2}{\sum_{i\in I_R}h_i+\lambda} - \frac{(\sum_{i\in I}g_i)^2}{\sum_{i\in I}h_i+\lambda}] - \gamma

$$

The algorithm will stop creating subtrees when $\mathcal{L}_{split} < 0$ or reach the maximal depth.

]]>When I use `torch.nn.utils.rnn.pad_sequence`

to Padding words and feed the padded sequence into LSTM/RNN, a input sorted by length is neccessary. But an order-changed sequence will increase the difficulty of evaluation. So here is a way to recover the sorted tensor using Pytorch functions.

1 | x = torch.randn(10) |

Here x is `tensor([-0.4321, 0.3852, 0.6008, 0.8452, -0.4709, 0.7610, -0.9743, -0.9819, -1.1142, -0.1249])`

and then we do the sort.

1 | sorted_x, idx = torch.sort(x) |

Here idx is the index of x, `tensor([8, 7, 6, 4, 0, 9, 1, 2, 5, 3])`

. Then we can get the original order just by sorting the `idx`

.

1 | _, rev_idx = torch.sort(idx) |

We can see that the script prints the `tensor([-0.4321, 0.3852, 0.6008, 0.8452, -0.4709, 0.7610, -0.9743, -0.9819, -1.1142, -0.1249])`

which is equals to the original `x`

. It’s amazing, isn’t it? I’ll then show you why it works.

We suppose there is a n-permutation corresponding to our tensor.

Then we do the sort and get a new permutation.

Here `idx`

is corresponding to the vector . Then do the second sort and get and here `rev_idx`

corresponding to vector . The code `sorted_x[rev_idx]`

selected elements with subscriber from the second permutaion, which means it selected the vector .

Mention that the vector is a permutation of . So the vector is also a permutation of . **So we have that for all ,**. Finally, which is the original tensor.

Duality and KKT condition are very important for machine learning, especially in SVM models. I’ll focus more on the high-level idea and the derivation of Lagrange Duality and how to introduce to KKT condition. There are some connceptions should be covered first.

The basic form of optimization problem without restrictions is just like finding the $x \in \Re^d$ makes that

A simple solution is to calculate the derivatives of $f(x)$, solve the equation and test if is the minimal value.

Consider an Optimization Problem with equality restrictions.

Lagrange Multiplier is a method to solve this kind of problem. We can rewrite the objective function as . Then we can prove that the solution of is equal to the solution of previous problem. Here are called the Lagrange Multiplier. The new optimization problem is

And here the new function is called Lagrange function.

If there is any , the minimum will become $-\infty$ due to the unrestricted , so we should add a restriction that which makes the solution finite and converge into .

Let ,then there is a trivial theorem that

Here is the dual problem of .

Assume are continuous functions on , then consider the restricted optimization problem.

We have already known that the primal problem without restrictions can be solved easily by calculating derivatives and testing. So our first step is to translate the primal problem into a problem without restrictions.

We have a enhanced Lagrange Function in form of . Here because **the direction** of has been restricted.

Define a new function , we can conclude that under all primal constraints.

Obviously, , so is an upper bound of . Then under all constraints, we have , then .

In conclusion, the primal problem has an equivalent form

We have already known the equivalent form of primal problem, but in this form we should still consider the constraints which makes the calculation too complicated. The next step is to find a simpler way of finding the **best solution**.

Consider the dual problem, a well property should be when is the best solution of primal problem.

Think back the transformation of primal problem, if dual problem is equal to the primal problem on $x = x^*$, the formula should be

Then consider the Lagrange condition of both inner optimizations which are and . This leads to and .

Then consider the parameter . There are two situations about . First is that the minimized point is of and the other is that the minimized point is of .

For the first case, the inequality constraint becomes a equality constraint. That is

For the second case, the inequality constraint disappears, that is

So combine two situations together we have . Then under this constraint, the becomes a regular Lagrange Function, which leads to a Lagrange Multiplier constraint that

So the final constraint becomes

This is the KKT condition.

]]>CAEN is the information technology (IT) services department for the University of Michigan (U-M) College of Engineering, and offers IT resources to support the College’s educational, research, and administrative needs. It’s quite unefficient to manage files on CAEN using command line tools if I need to text our code in CAEN environment. I need to type the whole sftp command and path every time. Plugins for editors is a great solution. There are many tutorials about connecting using Sublime Text Editor on the Internet, but there is no documentations about VScode. As a fan of VScode, that’s why I want to write this article.

This is my own running environment may but not necessary.

- Operating system: Windows 10
- VSCode Version: 1.36.1
- Plugin: SFTP (by liximomo)

After downloading the plugin, press `Ctrl+Shift+P`

and run `SFTP: config`

command. This command will build a configuration file named `sftp.json`

on your folder (you may need to open a folder) and it may looks like:

1 | { |

Name your server casually in `name`

and the type the host address of CAEN machine in `host`

. The host will be something looked like `login.engin.umich.edu`

. It’s not necessary to change the `protocol`

and `port`

. The `username`

is your Umich unique name.

Here is the explanation of these parameters.

`name`

is your own name of this server, you can name it casually.`host`

is the host address of CAEN machine, like`login.engin.umich.edu`

.`protocol`

is the protocol of connection, you don’t need to change the default`sftp`

.`port`

is the port of the connecting server.`username`

is yout own Umich uniquename which is needed for signing in the server.`remotePath`

is the pathwhere will upload your local file to. For example*on the CEAN machine*`home/username`

.`uploadOnSave`

is the switch of autouploading to the server. If the value is true, the files will be automatically uploaded to your server when you save your files locally.

After all these settings are saved, you will see a new icon on the Activity Bar.

The University of Michigan uses two-factor authentication to authenticate your account. So we need to add a new parameter to handle this. Add a new attribute `interactiveAuth`

in the json file and set it to `true`

. So the whole configuration file will looks like

1 | { |

Double click the server in the SFTP option on the activity bar with the following icon.

After connecting, you will see a input window above looks like

Then your will see a two-factor authentication window like

input and pass the authentication through app or message, then you will see the dictionary of your server machine.

Mention that if you use address `"/"`

in the `remotePath`

, you will connect to the public area and will ** have not permission to open the private folders including yours**.

Optimization is a focus on many kinds of machine learning algorithms like Linear Regression, SVM and K-means. But actually many kinds of target function is non-convex which means we can only find its local minima. But convex functions still plays an important role in machine learning. And Hessian Matrix is a great algebra tool to analyze convex functions since in most cases our target function will be real, continuous and $2^{nd}$-order differentiable. The main goal of this article is to record the proof of the equivalence between Convex Functions and their Hessians. Here is the some important definitions.

A **Convex Set** $C\subseteq \Re^n$ is a set of points s.t. $\forall x, y \in C$ and $t \in [0,1]$, $tx+(1-t)y \in C$.

A function $f:\Re^n \rightarrow \Re$ is a **Convex Function** if for $x, y \in D$, where $D$ is a **convex set**, $f$ and any $t \in [0,1]$ makes

A **Hessian Matrix** is a square matrix of **second-order partial derivatives** of a function $f:\Re^n \rightarrow \Re$, usually written as:

A **real symmetric matrix** $P$ is called **Positive Semi-Definite** (PSD) when for all $x \in \Re^n$, there are $x^TPx \geq 0$. And it’s called **Positive Definite** (PD) when for all $x \neq 0 \in \Re^n$, there are $x^TPx > 0$.

There is a strong relationship between Convex Functions and their Hessians. Here is what I want to prove today.

A $2^{nd}$-order differentiable function $f$ with convex domain $D$ is (strict) convex

if and only ifits Hessian isPSD (PD).

This conclusion is also called the **Second Order Condition** of a convex function. To prove this, we need to introduce a **First Order Condition** that is

A $1^{st}$-order differentiable function $f$ with convex domain $D$ is (strict) convex

if and only iffor any $x, y\in D$, $f(y) \geq f(x) + \nabla^T f(x)(y-x)$

I divided the proof into two parts. Firstly we can prove that if $f$ is a convex function, then first order condition works.

If $f$ is convex, we have

So, we can see

Let $t\rightarrow 0$,

Then we can prove that, under the case of first order condition, $f$ is a convex function.

If $f$ satisfy the first order condition, for all $x, y\in \Re^n$ and $t\in [0,1]$, we have

Add them together, we have

So $f(x)$ is a convex function.

Now all prerequisites are proved, it’s turn to prove the *Second Order Condition*! Also, I depart the proof into two parts.

First we prove that if the Hessian of $f$, $H$ is $PSD$, then $f$ is convex.

If $f$ is PSD, there exists $\xi$ that

So $f$ is convex due to the **first order condition**.

Then we can prove the reverse part.

If $f$ is convex, according to the **first order condition**, we suppose for all $y$,

Then,

Let $\lambda\rightarrow0$, we have $y^T\nabla^2f(x)y \geq 0$

So $\nabla^2f(x)$ is PSD.

In this article I will try to solve **Best Time to Buy and Sell Stock** series problem, including **Best Time to Buy and Sell Stock I, II, III, IV** and **with Cooldown.** Most of them are solved by **dynamic programming** and I will focus on construct transition equation and dimension reduction.

The description of **Best Time to Buy and Sell Stock I** is:

Say you have an array for which the $i^{th}$ element is the price of a given stock on day $i$.

If you were only permitted to complete at most one transaction (i.e., buy one and sell one share of the stock), design an algorithm to find the maximum profit.

Note that you cannot sell a stock before you buy one.

Example:

1 | Input: [7,1,5,3,6,4] |

A simple idea is using `dp[i]`

as **the most profit buying in $i^{th}$ day.** Then the transition equation will be `dp[i] = max(prices[j] - prices[i]) for all j > i`

and the soluton is `max(dp)`

. It will be an $O(n^2)$ algorithm. But there is a waste of computation in this method. We suppose $j$ is the specific day that `dp[i] = prices[j] - prices[i]`

, then if there is a $k$ makes $dp[k] < dp[i]$ and $k > i$, then we have

So the soluton won’t be `dp[i]`

. Under this circumstance, we cam simplify our algorithm by **always searching lower price day as buying day**, record the current price minus buying day price(**the lowest price before/on current $i^{th}$ day**) and generate a sequence of profit. The `profit[i]`

means the difference between $i^{th}$ day price and the lowest price before/on $i^{th}$ day. So `max(profit)`

will be the solution. By doing so, we reduce the method into $O(n)$ time. Here is the cpp code.

1 | class Solution { |

In problem II, we have not the transaction number limitation, **we can buy/sell any times.** When we try to using the `dp[i]`

as above, we find that it’s hard to build a transition equation because we don’t know how many transaction times there will be. We have to change our state description. We have **only three actions** in a day, buying, selling and doing nothing, so we can use two states to describe a day, i.e. **a day with stock** and **a day without stock**. Let `nohold[i]`

be the maximal profit when we have not stock in $i^{th}$ day, `hold[i]`

be the maximal profit when we have stock. Then the transition equation will be

1 | hold[i] = max(hold[i-1], nohold[i-1] - prices[i]); |

That simply means if we have stock in $i^{th}$ day, the stock can be bought today or we already have it yesterday and if we have stock in $i^{th}$ day, the stock can be sold today or yesterday or before. By this equation, we can solve this problem in one pass. Don’t forget the initialization `nohold[0] = -prices[0]`

.

1 | class Solution { |

There is another solution **do not use DP.** A trivial idea is that we buy all the stock at the begin of an **increasing line** and sell it at the end of line, we can get the most profit.

1 | class Solution { |

Problem III is a special case of Problem IV, so we just introduce Problem IV. In Problem IV, we have a limitation that **we can only buy $k$ times($k$ is given).** It can be solved simply like the DP algorithm of Problem II. We can use similar state description and just increase a dimension of **transaction times.** Let `hold[i][j]`

as the maximal profit when we have stock and $j$ transitions on $i^{th}$ day and `nohold[i][j]`

as the maximal profit when we have no stock and have $j$ transitions on $i^{th}$ day. Also like Problem II, the transition equation can be written as

1 | hold[i][j] = max(hold[i-1][j], nohold[i-1][j-1] - prices[i]); |

The solution will be `nohold[n-1][k-1]`

. What need to be mentioned is that **we counting transaction by counting buying numbers but not selling.** Then it’s a one-pass method.

1 | class Solution { |

But the code **did not pass!** We got a **Memory Limit Exceeded.** So I start to reduce the dimension of the equation. Obviously, both `hold[i][j]`

and `nohold[i][j]`

have only relationship with `hold[i-1][*]`

and `nohold[i-1][*]`

. So we can just reduce it as

1 | hold[j] = max(hold[j], nohold[j-1] - prices[i]); |

Also, using a **sentinel $0$ in nohold[j]** can make code looks better(reduce the number of $if$). So we get the code like this.

1 | class Solution { |

~Ok, we have already solved it!~ Wait, it’s still **Memory Limit Exceeded!** But why? If we consider a super large $k$ that the limitation is meaningless to the problem, the problem **reduces into Problem II.** But the time complexity of our solution will still be $O(k*\dot n)$, which is a super large number especially comparing with $O(n)$ solution in Problem II. We can solve this by a simple $if$ sentence.

1 | if(k > n/2){ |

And here is the whole program of Problem IV.

1 | class Solution { |

Cooldown means we have to ~have relax and take a coffee~ the day after selling. **Buying the day after a selling is not allowed.** That means our states description above can not be used again…Of course not! We can just do a little modification, adding a new vector called `cooldown[i]`

means the maximal profit when we **just sell or do nothing** on the $i^{th}$ day. We have `have_stock[i]`

and `have_no_stock[i]`

as above. We can find the transition of cooldown like `cooldown[i] = max(hold_no_stock[i-1], hold_stock[i-1] + prices[i])`

which means today we sell the stock or do nothing. The transition of hold_stock is still `hold_stock[i] = max(hold_stock[i-1], hold_no_stock[i-1] - prices[i])`

because cooldown doesn’t influence buying. Finally the transition equation pf `hold_no_stock[i]`

can be `hold_no_stock[i] = max(hold_no_stock[i-1], cooldown[i-1])`

, meaning that today is a **cooldown day or no stock day.** Combine them together we have_stock

1 | hold_no_stock[i] = max(hold_no_stock[i-1], cooldown[i-1]); |

Don’t forget the initialization `hold_stock[0] = -prices[0]; cooldown[0] = INT_MIN;`

. It’s also a $O(n)$ one-pass method now. In conclusion, all these kind of problem can be solved by dynamic programming idea and the basic idea is to form transition equation. **The number of variables or the number of dimensions are equivalent** in constructing equation. So if you have not idea how to form the equation, including the variable number of state will be a good choice.

1 | class Solution { |

In this article I will describe two **dynamic programming** algorithms solving LIS problem and **STL functions** `lower_bound()`

and `upper_bound()`

.

Given an unsorted array of integers, find the length of longest increasing subsequence.

Example:

Input: $[10,9,2,5,3,7,101,18]$

Output: 4

Explanation: The longest increasing subsequence is $[2,3,7,101]$, therefore the length is $4$.

Here is a trivial description that `dp[i]`

means the length of longest increasing subsequence **with $i^{th}$ element.** Also, we can find easily that the value of `dp[i]`

can be determined by all increasing subsequence with $j < i$ that **maintain increasing property** with $i^{th}$ value. Mathematically, `dp[i]`

is determined by all the value `dp[j]`

with $j < i$ and $nums[i] > nums[j]$ which `nums`

is the input vector. So the state transition equation is

dp[i] = max(dp[j]) + 1 with j < i, nums[j] < nums[i]

This method need two iterations so it’s a $O(n^2)$ algorithm.

1 | class Solution { |

Comparing all optimal subsequences with the same length, the one **with least last number** will confirm that when a new number is added in, the new subsequence will still optimal. For example, for subsequence $[1,3,5,2,7,4,5]$, we have two subsequences length $4$:

Then we add $6$ into the sequence, the first subsequence is still $[1,3,5,7]$ when the second one becomes $[1,2,4,5,6]$.

But how to guarantee that the subsequence has the least last number? We can do so by replacing the number **just larger than the new number** with the new number. It’s because the replacement won’t change the length of the subsequence but will decrease the number value generally.

There is a very great property that the increasing subsequences are ‘increasing’, which means that given a increasing subsequence and a new number, we can find the **correct position** of the new number in the subsequence in only $O(lgn)$ time. We can generate a new largest increasing subsequence including the new number by **adding the new number** if it’s larger than all numbers in the subsequence and do replacing if not. The whole time complexity will be $O(nlgn)$.

1 | class Solution { |

`lower_bound`

and `upper_bound`

in STLWe can mention that I use `lower_bound`

function in the previous code. It’s a binary search function in STL. Both it ans `upper_bound`

use binary search and return a position of a vector. The difference is that `lower_bound`

return the position of the first number larger than **or equals to** the target and `upper_bound`

return the position of the first number **strictly** larger than the target. There are three parameters in both functions. The first parameter is a Iterator refers to the search begin position, the second parameter is a Iterator refers to the end position and the third parameter is target number. Here is the source code of `lower_bound`

.

1 | template <class ForwardIterator, class T> |

What should be mentioned is that **the begin position will be included but the end position won’t be included.** The function uses **binary search**, so the time complexity is $O(lgn)$ where $n$ is the size between two pointers.

Design and implement a data structure for **Least Recent Used(LRU) cache**. It should support the following operations: `get`

and `put`

.

`get(key)`

-Get the value (will always be positive) of the key if the key exists in the cache, otherwise return -1.`put(key, value)`

-Set or insert the value if the key is not already present. When the cache reached its capacity, it should invalidate the least recently used item before inserting a new item.

The cache is initialized with a **positive** capacity.

To solve this problem, we need to design a kind of data structure with the properties as follow:

- The data structure can visit the and set/insert the item as soon as possible(such as
**vector or map**). - The data structure can order the data
**by the operation time.** - The data structure can quickly check for the
**overflow of capacity.**

Due to the data structure in different language is not the same, I will choose **python** and **cpp** as my solution language.

I will introduce a kind of python data structure called **OrderedDict**. This is a kind of dictionary(in fact it is inherited from ‘dictionary’ of python) with the order of insertion time. Python uses an extra **circular linked list** to save the node as form $[PREV, NEXT, KEY]$ to realize the data structure. Obviously, this data structure is perfectly suitable for our problem.

The only problem we need to solve is the question is to find a data structure ordering by **operation time** but not the **insertion time**. So we just need to simply **delete and insert** every operation to solve this problem.

1 | class LRUCache: |

CPP provides many kinds of STL containers, but there are nothing like the ‘OrderedDict’ in python. The design idea is to **combine two or more kinds of containers**(like the OrderedDict source code do). If we want to build a structure with **insertion order**, stack will be a first choice. But what we also want is to **keep the high speed of insertion and deletion** of map/vector, which will conflict with stack’s properties that we can not move/delete a node in the middle of a stack. Therefore, **Linked List**(in STL is list) will be a great choice, which also matches python’s choice. To keep the order of operation time, the problem we need to solve can be as follow:

- We need a fast way to
**visit/insert/delete**a node in linked list**given key**. - We need a fast way to
**move a node to the front**of linked list after every operation. - When we get the first value of linked list, we need a fast way to
**delete the pair in map/vector.**

These properties keeps a linked list by operation time order with short modifying time. Actually, property $1$ and $2$ can be combined due to the property of Linked List. Map can be a good way to satisfy property $1$ and $2$. We can use **map(key, node)** form to visit a linked list node quickly in $O(1)$ time. About property $3$, we need a **reverse_map(value, key)** to fast delete corresponding pair in map. We just need to delete **reverse_map[node.value]** in map. But it’s not convenient and cost extra space. We can just store **node(value, key)** in linked list to do the same thing.

In conclusion, what we need is to combine a **map**(or unordered_map) and a **list.**

1 | class LRUCache { |

There are also some tricky cases like we need to **check for if key in the cache first** because change a value in cache need to delete the value first and then insert a new one.

The combination of STL containers is not the only way to solve this problem. Actually many artificial data structures have better performance. A specific example is using **circular linked list nodes** just like python do in the ‘OrderedDict’. I won’t cover this method here and you can find related articles easily.

Given a linked list and return where the circle begins. For example, a linked list $[3, 2, 0, 4]$ having circle $[2, 0, 4]$ is shown below.

The algorithm should return the second node. I use C++ to solve this problem and define the node as below.

1 | //Definition for singly-linked list. |

A trivial idea to solve this problem is saving the node information in a hashset when traversing and find if there is a visited node.

1 | class Solution { |

This is a method with O(n) time complexity and O(n) space complexity. But if there is a method to solve this problem with O(1) space complexity?

Set two pointers which slower one move one time one step and faster one move one time two steps. Once they meet, **reset the faster one to the head pointer**, then finally they will meet in the begin node of circle. This is an algorithm without extra space. But why it works?

There is a mathematical idea. Suppose the distance from head node to the begin of circle is $x_1$, the distance from begin of circle and meeting point on the circle is $x_2$, the distance from meeting point back to the begin of the circle is $x_3$. Then there is the velocity equation.

It means that the difference between $x_3$ and $x_1$ is the multiple of circle length. Due to the definition of $x_3$ and $x_1$, if the fast pointer move from the head node when the slow pointer move from the meeting point, **finally the slow pointer and fast pointer will meet on the begin of the circle.**

1 | class Solution { |

Don’t forget the special cases of NULL pointer and if fast pointer move to NULL means **there is no circle.**