# Journal Club 01- Singular Value Decomposition and its Applications

Download the pdf file of the slides by clicking [here]

# EECS 589 Computational Data Science Cheat Sheet

You can directly download the cheat sheet here [pdf]. If you want to edit the pdf based on the current version, you can turn to my overleaf project oneline in this link [overleaf project]. I am sorry I only open the rights to view the projects.

## Inner/outer product

$e_ie_j^T\rightarrow E_{i,j}=1$, $R(A)=span(A)={Ax:x\in R^n}$

Vector input into a single hidden neuron: $f(x)=f_a.(w^Tx+b)$

Euclidean Norm (length) of vector: $||x||_2=\sqrt{x^Hx}$

If $Q$ is a unitary matrix then: $||Qx||_2=||x||_2$

$Trace$ call only applied to square matrix.

Left product a matrix is to transform the row while right product a matrix is to transform the column. Similarly, $e_i^TA$ is $i$-th row of $A$ while $Ae_i$ is the $i$-th column of $A$.

### Determinant

$$\det (A)= \ det (A^T),\det(AB)=\det(BA)$$

Optional:

$$\det \left[\begin{array}{cc}

A & B \\

C & D

\end{array} \right]=\det(A)\det(D-CA^{-1}B)$$

$$\det \left[\begin{array}{cc}

A & B \\

C & D

\end{array} \right]=\det(D)\det(A-BD^{-1}C)$$

### Orthonormal Matrix

A matrix $A$ is orthonormal if:

$$AA^T=A^TA=I$$

which means that both the column vectors and the row vectors are orthonormal vectors. More generally, if $A$ and $B$ are square matrices and $AB=I$, then $BA=AB=I$.

If $U$ and $V$ are orthonormal matrices, then $UV$ and $VU$ are all orthonormal matrices.

## Singular Value Decomposition

$$A=\sum_{i=1}^r\sigma_iu_iv_i^H$$

$u_i/v_i$: left/right singular vectors, $u_i^Hu_i=v_i^Hv_i=\delta_{ij}$

SVD is not unique: ${\sigma_i,-u_i,-v_i}$ is also a valid triplet

$$A=\sum_{i=1}^r\sigma_iu_iv_i^H\Rightarrow R(A)=R(u_1,..u_r)$$

Null space: $N(A)={x:Ax=0}$

Full SVD: $A=U\Sigma V^H,U\in F^{m\times m}, \Sigma \in F^{m\times n}, V\in F^{n\times n}$

Economy or reduced SVD…

Truncated SVD: $A_k=\sum_{i=1}^k\sigma_iu_iv_i^H$

Theorem (Eckart-Young):

$$A_k=\arg \min ||A-X||_F\quad s.t.\quad rank(X)\le k$$

## Projection Matrices

Let $Q_r=[q_1,…q_r]$ be an orthogonal (unitary) set of unit-norm vectors, then the orthogonal/unitary projection matrix onto $R(Q_r)$: $P_R(Q_r)=Q_rQ_r^H$

$$P_{R(Q_r)}x=(Q_rQ_r^H)x\Rightarrow \sum_{i=1}^kq_i(q_i^Hx)$$

$$P_{R(q_1,q_2..q_k)}+P_{R(q_{k+1},…q_n)}=I_n$$

Gram-Schmidt Orthogonalization

$$\beta_n=v_n-\sum_{i=1}^{n-1}(v_n^T\eta_i)\eta_i,\eta_n=\frac{\beta_n}{||\beta_n||}$$

Base for matrix $A$ using SVD: $A=\sum_{i=1}^r\sigma_iu_iv_i^H$, then the bases of $A$ are: $u_1,…u_r$:

$$P_{R(A)}=U_1U_1^H=\sum_{i=1}^ru_iu_i^H$$

$$P_{R^\perp (A)}=I_m-U_1U_1^H=U_2U_2^H=\sum_{i=r+1}^mu_iu_i^H$$

$$P_{N(A)}=I_n-V_1V_1^H=\sum_{i=r+1}^nv_iv_i^H$$

$$P_{N^\perp (A)}=V_1V_1^H=\sum_{i=1}^rv_iv_i^H$$

$$R^\perp (A)=N(A^T)$$

## Norm of matrix

### P-norm of a matrix

The p-norm of a matrix is defined as:

$$||A||_p = \sup_{x\neq0}\frac{||Ax||_p}{||x||_p}$$

where the p-norm of a vector is defined as:

$$||x||_p=(\sum_i |x_i|^p)^{1/p}$$

then obviously for second norm of a matrix:

$$||A||_2 = \sigma_{max}$$

where the $\sigma_{max}$ is the maximum singular value of the SVD.

### Frobenius norm of a matrix

$$||A||_F=\sqrt{\sum_i\sum_ja_{ij}^2}$$

$$||A||_F^2=Tr(A^HA)=Tr(AA^H)=\sum_{i=1}^r\sigma_i^2$$

Here is a brief proof:

$$Tr(A^HA)=Tr(\sum_i \sigma_i u_iv_i^H\sum_i \sigma_i u_iv_i^H)=Tr(\sum_i \sigma_i^2u_iu_i^H)$$

$$=\sum_iTr(\sigma_i^2u_iu_i^H)=\sum_iTr(\sigma_i^2u_i^Hu_i)=\sum_i\sigma^2_i$$

## Least Squares and SVD

Pseudo-inverse:

$$AA^\dagger A=A, A^\dagger A A^\dagger = A^\dagger$$

$$A^\dagger = A^T(AA^T)^{-1}$$

SVD: Pseudo-inverse $A^\dagger$ is an $n\times m$ matrix given by:

$$A^\dagger =V\Sigma^\dagger U^H=\sum_{i=1}^r\sigma^{-1}_iv_iu_i^H$$

$$(u_i)^\dagger = u_i^H, (u_iu_i^H)^\dagger = u_iu_i^H$$

or sum of outer-products: $A^\dagger = \sum_{i=1}^r\sigma_i^{-1}v_iu_i^H$.

$$AA^\dagger = P_{R(A)},\quad I-AA^\dagger = P_{R^\perp (A)}$$

$$A^\dagger A=P_{N^\perp (A)},\quad I-A^\dagger A=P_{N(A)}$$

$$x_{LS}=\arg \min_x ||b-Ax||_2^2=A^\dagger b$$

Here to prove the theorem by changing the variables:

$$J(x)=||Ax-b||_2^2\Rightarrow ||\Sigma V^Hx-U^Hb||_2^2$$

let $\tilde{x}=V^Hx$ and $\tilde{b}=U^Hb$, then

$$J(\tilde{x})=||\Sigma \tilde{x}-\tilde{b}||_2^2=\sum_{i=1}^r(\sigma_i\tilde{x}*i-\tilde{b})^2+\sum*{i=r+1}^nb_i^2.$$

Then it is obvious that:

$$\tilde{x}_{LS}=\frac{\tilde{b_i}}{\sigma_i}\Rightarrow …\Rightarrow x_{LS}=A^\dagger b.$$

So far we have completed the proof of mean square method when $r=n$. However, when $r<n$, then comes the tricky part of least square method:

$$\tilde{x}=V^Hx\in R^n,x=V\tilde{x}\in R^n$$

the previous proof tells us that:

$$x=\sum_{i=1}^n\tilde{x_i}v_i, \tilde{x}_i=\frac{\tilde{b}_i}{\sigma_i}$$

However, if $r<n$ then $\sigma_i=0,i>r$. One question from mid-term exam, $Ax=b,A\in R^{m\times n}$ and when $rank(A)<n$, and then for the $x$:

$$x=A^\dagger b + (I-A^\dagger A)w,w\in R^n$$

where the first term gives the projection to the $span{v_1,..v_r}$ and the second term is the projection to the $span{v_{r+1},…r_n}=$.

### Error vector as a projection

$$x_{LS}=A^\dagger b\Rightarrow e=b-Ax_{LS}=b-AA^\dagger b$$

$$\Rightarrow e=P_{R^\perp (A)}b=(I-P_{R(A)})x$$

using SVD, $error=[u_{r+1},…u_m][u_{r+1},…u_m]^Hb$

Over-determined: $m>n$, full rank, $error\neq 0$

Under-determined: $m<n$, full rank, $error=0$.

One lesson from the analysis here is that sometimes you should jump out of the SVD and go back to the original matrix towards the column space.

## Extension of Least Squares

### Truncated Least Squares

Original least squares:

$$x_{LS}=\arg \min_x||b-Ax||_2^2$$

then the Truncated Least Squares is:

$$x_{TLS}^{(k)}\sum_{i=1}^k\frac{1}{\sigma_i}v_iu_i^Hb$$

which means that the original optimization problem becomes:

$$x^{(k)}_{TLS}=\arg \min_x||b-A_kx||_2^2$$

### Regularized Least Squares

$$x_{opt}=\arg \min_x ||Ax-b||_2^2+\delta^2||x||_2^2$$

is equivalent to:

$$x_{opt}=\arg \min_x||\tilde{A}x-\tilde{b}||_2^2$$

where $\tilde{A}=[A, \delta I]^T$ and $\tilde{b}=[b,0]^T$. The solution to this problem is:

$$x_{opt}=\tilde{A}^\dagger \tilde{b}=(A^HA+\delta^2I)^{-1}A^Hb=\sum_{i=1}^r\frac{\sigma_i}{\sigma_i^2+\delta^2}v_i(u_i^Hb)$$

## Learning SVD

$$v_1=\arg \max_{||x||_2=1}||Ax||_2$$

Here gives the proof:

$$J(x)=||Ax||_2^2=||U\Sigma V^Hx||_2^2=||\Sigma V^H x||_2^2,$$

let $\tilde{x}=V^Hx$, then:

$$J(\tilde{x})=||\Sigma \tilde{x}||_2^2=\sum_{i=1}^n\sigma_i^2||x_i||_2^2\le \sigma_i^2 \sum_{i=1}^r||x_i||_2^2=\sigma_i^2$$

$$\tilde{x}_{opt}=e_1\Rightarrow x_{opt}=v_1$$

Similarly,

$$u_1=\arg \max_{||x||_2=1}||x^HA||_2$$

Furthermore,

$$v_2=\arg \max_{||x||_2=1}||AP_{R(v_1)^\perp }x||$$

Proof: $AP_{R(v_1)}=\sum\sigma_iu_iv_i^H\sum_{i=2}^rv_iv_i^H=\sum_{i=2}^r\sigma_iu_iv_i^H$

$$v_k=\arg \max_{||x||_2=1}||AP_{R(v_1,..v_{k-1})^\perp}x||$$

Equivalent formulation:

$$x_{opt}=\arg \max_{||x||_2=1}||Ax||_2,\quad s.t.\quad x\perp v_1,v_2…v_{k-1}$$

It is the same for $u_k$:

$$u_k=\arg \max_{||x||_2}||x^HA||_2\quad s.t. \quad x\perp u_1,..u_{k-1}$$

$$u_k=\arg \min_{||x||_2=1}||x^HB||_2,B=P_{R(u_1,…u_{k-1})^\perp}A$$

Claim: $x_{opt}=\pm v_k/\pm u_k$ (not unique), another possibility is that all the $\sigma_i$ are the same, then $x_{opt}=span{v_1,..v_r}$

## PCA and SVD

Note that the format of the data is that each column is a sampled data and each row is a attribute.

### Principal Component Analysis

Learning interpretation:

$$u=\arg \max_{||x||_2=1} ||Ax||_2, v=\arg \max_{||x||_2=1} ||xA||_2$$

the first step of PCA is to center the data:

$$\tilde{A}=A-\mu_A\cdot 1^T, \mu_A=\frac{1}{n}A1^T$$

then for the principal component factorization:

$$A=W_{pca}\cdot X_{pca}$$

$$W_{pca}=U\Sigma,X_{pca}=(U\Sigma)^\dagger \mu_A 1^T +V^H$$

Besides maximizing the variance, the PCA also minimizes the correlation between the update attributes. The covariance quantifies the correlation between two variables:

$$cov(x, y)=E(x\cdot y)-E(x)E(y)$$

where the variables are considered to be uncorrelated if their covariance is zero. And the covariance matrix is :

$$\hat{\Sigma}_X=\frac{1}{n}XX^T-\mu_X\mu_X^T$$

under the principal component factorization, the coordinates of the original dataset becomes $V^T$ (centered) and the covariance matrix is $I$ (diagonal) which means that the different attributes are uncorrelated.

There is also another way to get the PCA, first centered the matrix $A$ to $\bar{A}$ and find the eigenvectors of the covariance matrix $\bar{A}\bar{A}^T$.

### Independent Component Analysis

The final form of ICA is quite similar with PCA:

$$A=W_{ica}\cdot X_{ica}$$

$$W_{ica}=U\Sigma Q_{ica}, X_{ica}=(U\Sigma Q_ica)^\dagger\mu_A1^T+Q_{ica}^HV^H$$

where the unitary matrix $Q_{ica}$ by maximize:

$$Q_{ica}=\arg \max_Q \sum_{i=1}^m |kurtosis(Q[:,i]^HV^H)|$$

Actually, ICA can try to make the different attributes as independent as possible. Independence is more about the internal relation between two random variables while the correlation is more about the linear relationship between two random variables.

k-th moment of a random variable is defined as:

$$E[x^k]=\mu_k=\int x^k f(x)dx$$

where the $f(x)$ is the p.d.f. of $x$. Then the cumulants of a centered random variable:

$$\kappa_2 = \mu_2, \kappa_3=\mu_3,\kappa_4=\mu_4-3\mu_2^2, \kappa_5,….$$

where the $\kappa_2$ is the variance, $\kappa_3$ is the skewness and the $\kappa_4$ is the kurtosis. There are two facts: the first is that maximizing the absolute value of kurtosis value is equivalent to maximizing the “non-Gaussianity”. Another is that if $x_1$ and $x_2$ are independent, then:

$$\kappa_k(x_1+x_2)=\kappa_k(x_1)+\kappa_k(x_2), \kappa_k(wx)=w^k\kappa_k(x)$$

Here to prove that the ICA unmixes mixed independent random variables:

Let $Q$ be orthogonal matrix and $x_1$ and $x_2$ be independent and then:

$$[y_1,y_2]^T=Q[x_1,x_2]^T$$

and the ICA algorithm is:

$$w_{opt}=\arg \max_{||w||_2=1}|\kappa_4(w^Ty)|$$

we want to prove that:

$$w_{opt}=\pm q_1$$

Here is the simple proof:

$$\kappa_4(w^Ty)=\kappa_4(w^TQx)=\kappa_4(\tilde{w}^Tx)=\kappa_4(\tilde{w}_1x_1)+\kappa_4(\tilde{w}_2x_2)$$

$$=\tilde{w}_1\kappa_4(x_1)+\tilde{w}_2\kappa_4(x_2)\le 1\cdot \max(\kappa_4(x_1),\kappa_4(x_2))$$

$$\tilde{w}=e_i\rightarrow w=\pm q_1/q_2$$

## Eigen-decomposition of a symmetric matrix

Assume that $A$ is a $n\times n$ symmetric matrix. Then there will be $n$ real eigenvalues $\lambda_1\ge \lambda_2 \ge …\lambda_n$ and each eigenvalue has an eigenvector:

$$Aq_i=\lambda_iq_i$$

then the eigenvalue decomposition of the matrix is:

$$A=\sum_{i=1}^n\lambda_iq_iq_i^T=Q\Lambda Q^T$$

where $Q$ is an arthogonal matrix and each column of this matrix is an eigenvector of the matrix.

Then for the optimization problem with the quadratic form with spherical constraint:

$$x_{opt}=\arg \max x^TAx,s.t.x^Tx=1$$

It is obvious that when $A$ is a symmetric matrix, the solution to the optimization problem is:

$$x_{opt}=q_1$$

## Application 1: Procrustes Analysis

### Normal Procrustes Problem

Here every column is an attribute while every row is a sampled data.

The Procrustes is to learn the transfer matrix including rotation, translation and scaling.

If there is only rotation, then the optimization problem is:

$$Q_{opt}=\arg \min_Q ||X-YQ||_F^2, s.t.Q^TQ=I$$

$$\min_Q||X-YQ||_F^2=\min_Q Tr[(X-YQ)(X-YQ)^T]$$

$$=\min_Q Tr(X^TX+Q^TY^TYQ-X^TYQ-Q^TY^TX)$$

$$=-\min_QTr(X^TYQ+Q^TY^TX)=\max Tr(X^TYQ)$$

Let $U\sigma V^T$ be the SVD of $X^TY$, then:

$$Tr[X^TYQ]=Tr[U\Sigma V^TQ]=Tr[\Sigma V^TQU]$$

where $W=V^TQU$ is an orthogonal matrix and the objective function reaches the maximum when $W=I$, that is:

$$W=V^TQU=I\rightarrow Q=VU^T$$

In practice, there will be translation and scaling besides the rotation, the complete optimization problem is:

$$\min_{Q,\alpha,\mu}||X-\alpha YQ-1_m\mu^T||_F^2, s.t. QQ^T=I$$

the $\mu$ can be directly calculated given the other parameters and the other way to understand it is that we can first center the dataset to get rid of the translation part.

$$\alpha = [\frac{1}{m}1_m^T(X-\alpha YQ)]^T$$

$$\min_{Q,\alpha} ||X_0-\alpha Y_0Q||_F^2,s.t.Q^TQ=I$$

where $X_0=X-1\mu_X^T$, $Y_0=Y-1\mu_Y^T$.

And then we can find that the $\alpha$ do not exert any influence on the $Q$ because only the singular vectors do not change. Then:

$$Q=VU^T, X_0^TY_0=U\Sigma V^H$$

and finally for the scaling parameter $\alpha$:

$$\alpha = Tr[X_0^TY_0Q]/Tr[Y_0Y_0]$$

## Application 2: MDS to recover the coordinates

The zero centroid multidimensional scaling (MDS) coordinates can be used to recovering coordinates from a Euclidean distance matrix. The algorithm is shown as blew:

$$K=-\frac{1}{2}PSP^T,S=D.^2,P=I-\frac{1}{n}11^T$$

Let $K=U\Lambda U^T$ be the eigenvalue decomposition of $K$, The the d-dimensional MDS coordinates of the points are given by the rows of:

$$X_r=U[:,1:d]\Lambda ^{1/2}[1:d,1:d]$$

Here is the proof, it is kind of complicated. Assume $x$ is the original coordinates and every row stands for a point, then:

$$S_{ij}=D_{ij}^2=||x_i-x_j||_2^2=||x_i||_2^2+||x_j||_2^2=2x_i^Tx_j$$

$$S=c1^T+1c^T-2xx^T\rightarrow xx^T=-\frac{1}{2}(S-c1^T-1c^T)$$

where:

$$c=[||x_1||^2,||x_2||^2,…,||x_n||^2]^T$$

On one side, let $K=xx^T$,

$$S=c1^T+1c^T-2K\rightarrow \frac{1^TS1}{n}=\frac{1c1^T1}{n}+\frac{1^T1c^T1}{n}-2\frac{1^Txx^T1}{n}$$

$$\frac{1^TS1}{n}=1^Tc+c^T1-2\frac{1^Txx^T1}{n}\rightarrow2(c^T1)=\frac{1^TS1}{n}+2\frac{1^Txx^T1}{n}$$

One the other side,

$$\frac{S1}{n}=c+\frac{1c^T1}{n}-2\frac{xx^T1}{n}\rightarrow c=\frac{S1}{n}-\frac{1c^T1}{n}+2x\frac{x^T1}{n}$$

If we set the original coordinates centroid, then $x^T1=0$. Then finally the $K=xx^T$ is:

$$K=1\frac{1}{2}(S-c1^T-1c^T)=-\frac{1}{2}PSP^T=Q\Lambda Q^T$$

The the coordinates are:

$$x=Q\Lambda ^{1/2}$$

## Application 3: learn to Partition Graphs

A graph is a collection of nodes (vertices) ${v_i}$ and edges ${e_i}$. The degree of a vertex is the number of edges that connect to it. The degree sequence of a graph is a list having the degree of each vertex.

Ajacency matrix of a simple graph with $n$ vertices is an $n\times n$ matrix with $A_{ij}=1$ if $v_i$ and $v_j$ is connected otherwise $0$.

The notion of modularity is based on the elegant idea: “A good division is one where there are fewer than expected edges between communities.”

Let $d_i$ and $d_j$ be the degree of vertex $i$ and vertex $j$ respectively. Let $g_i$ and $g_j$ denote the groups they belong to and let $\delta_{g_ig_j}$ be the Kronecker delta function that is equal to one when $g_i=g_j$ and zero otherwise.

Then the modularity $Q$ of a group assignment encoded by $g_i$ is given by:

$$Q=\frac{1}{N}\sum_{i,j}[A_{ij}-P_{ij}]\delta_{g_ig_j}$$

where $N$ is the total number of edges and $P_{ij}$ is the expected probability of an edge between vertex $i$ and vertex $j$ in a random graph.

A random graph with the same total number of edges as given graph and with $d_i$ edge from vertex $i$ and $d_j$ edges from vertex $j$ has:

$$P_{ij}=\frac{d_i\cdot d_j}{N}$$

Let $s_i=1$ if vertex $i$ belongs to group 1 and $s_i=-1$ if it belongs to group 2. Then:

$$\delta_{g_ig_j}=\frac{s_is_j+1}{2}$$

Then the former optimization problem becomes:

$$s_{opt}=\arg \max_s\sum B_{ij}s_ss_j,s.t.s_i=\pm 1$$

where the modularity matrix $B$ is defined as $B_{ij}=[A_{ij}-\frac{d_id_j}{N}]$.

This problem is to hard to solve and here is the relaxation of the previous problem:

$$s_{eig}=\arg \max s^TBs,s.t.s^Ts=n$$

$$s_{opt}=sign.(s_{eig})$$

# Writing Homework with Jupyter Notebook

# Introduction

Sometimes we need to submit the homework with equations and code, then the Jupyter Notebook will be a good choice.

Jupyter enable you use both markdown and the code in the same environment, that is, it is quite convenient to give a brief explanation to your code. From this perspective, Jupyter Notebook is beyond all the other development environments.

Here is an example of the Jupyter Notebook.

In Jupyter Notebook, there are two kinds of cells: markdown and code. In the markdown cell, you can write the explanantion such as some important equations. You can simply insert the equations in markdown cell by typing the equations within the `$...$`

or `$$...$$`

environment. Refer to [Link] for typing the mathematical expressions.

Actually, I think that Jupyter Notebook is designed to do the research-related work instead of being used as a development environment. It is not that efficient for one to develop a large-scale project.

# Install

Maybe there are many methods to install the Jupyter notebook. The official website of Jupyter Notebook suggested installing it using [Anaconda], then the jupyter notebook will be installed automatically. To launch the Notebook, all of what you need to do is to open the terminal or the Command Prompt and then type:

1 | jupyter notebook |

# Julia

Julia is quite similar to Python and Matlab and is designed for researchers. To some extent, it combines the advantages of Python and Matlab, for example, the matrix calculation in Julia is much more friendly than that in Python. However, none of the environments for Julia is satisfactory. Maybe Jupyter Notebook is the best one.

# CEE 553 Infrastructure System Optimization Mid-term Exam Review

I completed this almost within one day…..so this can only be used as a reference. I cannot guarantee all of below are right. Email me if any question.

## Notations:

- $f(x)$: objective function
- $X$: feasible region
- $h_i(x)$: equality constraints
- $g_i(x)$: inequality constraints

then for a optimization problem, the general formulation is:

$$z=\min_{x\in X}f(x)$$

subject to:

$$

\left\{

\begin{aligned}

h_i(x) = 0 \\

g_i(x)\le 0

\end{aligned}

\right.

$$

## Formulate the Optimization Problem

Three main steps to formulate the optimization problem:

- Translate the problem: A optimization are composed of objective, decision variables, parameters and constraints. Think what is the decision variables, objective and constraints before writting down the mathematical expression;
- Mathematical expression of the problem;
- Fine-tuning.

General principles when fine-tuning,

- Minimize the number of decision variables and constraints.
- Try to maintain linearity or convexity.
- Avoid introducing binary or integer variables.
- Pay attention to the units in which quantities are measured

### Fine-tuning Examples 1: Get Rid of “if…else…”

There are two decision variables, $x_1$ and $x_2$, if we add a constraint like this:

If the $x_1$ is greater than $a_1$, then the $x_2$ must be greater than $a_2$.

One feasible method is to split the optimization into two sub-problems. However, this is too complicated. The alternative to this method is introducing an extra binary variable $\delta \in \{0,1\}$ where $1$ and $0$ indicating the condition is true or false correspondingly.

The the “if…else…” condition can be converted to:

$$

\left\{

\begin{array}{c}

a_1\delta \le x_1 \le a_1 + M\delta\\

a_2 \delta \le x_2 \\

\delta \in \{0, 1\}

\end{array}

\right.

$$

### Fine-tuning Examples 2:

When the optimization is like this:

$$z=\min_x (\max (f(x),g(x),h(x)))$$

this optimization can be understood as optimizaing the upper coutour of the three functions and it is not a linear problem. This can be replaced by the linear problem by introducing an extra function $p(x)$:

$$p(x) = \min(f(x),g(x),h(x)) $$ $$z = \min p(x)$$ subject to: $$p(x)\ge f(x),g(x),h(x)$$

## Finding the Solutions

I think this is the most theoretical part of this course and it is very interesting.

### Convexity

Definition of convex set:

$\forall x,y \in X$ and $\forall \lambda \in[0,1]$, the $X$ is convex if

$$\lambda x+ (1-\lambda)y\in X$$

Using the concept, we can easily prove that the intersection of two convex set is also convex.

Definition of convex function:

$\forall x,y \in X$ which is a convex set, and $\forall \lambda \in [0,1]$:

$$f(\lambda x_1+(1-\lambda)x_2)\le \lambda f(x_1)+(1-\lambda)f(x_2)$$

A function is convex if the Hessian matrix of the function is semi-positive definite, the Hessian matrix is defined as:

$$

\nabla^2 f(x_1,…x_n) =

\left(

\begin{matrix}

\frac{\partial^2f}{\partial x_1^2} & \frac{\partial^2f}{\partial x_1 x_2} & … & \frac{\partial^2f}{\partial x_1 x_4} \\

\frac{\partial^2f}{\partial x_2 x_1} & \frac{\partial^2f}{\partial x_2^2} & … & \frac{\partial^2f}{\partial x_1 x_4} \\

… & … & … & … \\

\frac{\partial^2f}{\partial x_n x_1} & \frac{\partial^2f}{\partial x_nx_2} & … & \frac{\partial^2f}{\partial x_n^2}

\end{matrix}

\right)

$$

The Hessian matrix is symmetirc (when the $f(x)\in C^2$). For a symmetric matrix, it is postive definite if and only if all the eigenvalues are postive. (semi-definite <-> non-negative)

Relationship between convex function and convex set:

For the inequality constraint $g(x)\le 0$, if $g(x)$ is convex, then the feasible set given by the IEC is also convex, here is the proof:

Since $f(x)$ is a convex function, then $\forall x_1, x_2 \in g(x), \forall \lambda \in [0,1]$

$$g(\lambda x_1+(1-\lambda x_2))\le \lambda g(x_1) + (1-\lambda )g(x_2)\le 0$$

Another property of convex function is that the increased value from a specific point will be always no less than the increased value given by the first order derivative:

$$f(y)-f(x)\ge \nabla f(x)^T (y-x),\forall x,y \in X$$

### Extreme Values

Definition of global minimum:

$$f(x^*)\le f(x), \forall x\in X$$

Definition of local minimum:

$$f(x^{*})\le f(x),\forall x\in X, ||x-x^*||<\epsilon$$

It is hard to find the local/global minimum usint the definitions, so here we explore the properties of the local/global minimum to find the alternative conditions.

For a local minimum ($X$ is a convex set):

$$\nabla f(x^{*})^T (x-x^*)\ge 0, \forall x \in X$$

The most straight-forward understanding is that all the feasible directions are gradient non-descent directions. This is the necessary condition for a local minimum and it is for both constrained and unconstrained constraints condition. If the problem is unconstrained, then all the directions are feasible and it will turn to:

$$\nabla f(x^*)=0$$

If $f(x)$ is a convex function and $X$ is a convex set, then $x^*$ will become the global minimum point. Here is the proof:

$f(x)$ is the convex function, so we have:

$$f(x)-f(x^*)\ge \nabla f(x^*)(x-x^*)\ge 0$$

### Unconstrained Optimization

$$z=\min_{\forall x \in X}f(x)$$

For the local minimum point,

- Necessary condition: $\nabla f(x^*)=0$ and $\nabla^2f(x^*)$ is positive semi-definite at $x^*$.
- Sufficient condition: $\nabla f(x^*)=0$ and $\nabla^2f(x^*)$ is positive definite at $x^*$.

Generally procedure:

- Find the staionary point, $\nabla f(x^*)=0$
- Examine the sufficient condition, $\nabla^2f(x^*) \succ 0$

#### Newton’s Method

Newton’s method is to numerically calculate the stationary point of the function. For the one dimension optimization,

$$x^{n+1}=x^n-\frac{f’(x^n)}{f’’(x^n)}$$

The prcedure for the Newton’s method:

- Select $x^1$ and set $k=1$
- If $||\nabla f(x^k)||=0$, stop otherwise go the next step
- Update the $x^{k+1}$

$$x^{k+1}=x^{k}-[\nabla f(x^k)]^{-1}\nabla f(x^k)$$

This method can only used in the condition when the inverse of Hessian exists.

#### Iterative Descent Method

Framework:

- Start at an initial point $x_0\in X,k=0$
- Choose a descent direction $d_k$, which follows $\nabla f(x_k)^Td_k<0$
- Choose a step size $\alpha_k$
- Update the solution $x_{k+1}=x_k+\alpha\cdot d_k$
- Stop if the criterion is met otherwise repeat step 2-4

#### Steepest Descent Method

It is one of the iterative descent method:

$$

\left\{

\begin{array}{c}

d_k = -\nabla f(x_k) \\

\alpha_k = \arg \min_{\alpha_k}f(x_k+\alpha_k d_k)

\end{array}

\right.

$$

### Constrained Optimization

$$z=\min_{x\in X}f(x)$$

subject to:

$$

\left\{

\begin{aligned}

h_i(x) = 0 \\

g_i(x)\le 0

\end{aligned}

\right.

$$

KKT condition:

$$

\left\{

\begin{array}{c}

\nabla f(x^*)+\sum_{i=1}^m\lambda_i^*\nabla h_i(x^*)+\sum_{j=1}^r\mu_j^*\nabla g_j(x^*)=0\quad (Gradient) \\

h_i(x^*)=0,\forall i\in {1,2…m} \quad (Feasibility)\\

g_j(x^*)\le 0, \forall j\in {1,2,…r} \quad (Feasibility)\\

\mu_j^* \ge 0, \forall j\in {1,2,…r} \quad (Non-negativity) \\

\mu_j^* g_j(x^*)=0,\forall j \in {1,2,…r} \quad (Complementary Slackness Condition)

\end{array}

\right.

$$

Here is a brief explanation of the KKT condition:

Firstly, when we define $d$ as feasible direction, it can be understood as:

$$

\left\{

\begin{array}{c}

d=x-x^* =[dx_1, dx_2,…dx_n]^T,||x-x^*||<\epsilon\rightarrow 0 \\

x^*+d\in X\rightarrow h(x^*+d)=0

\end{array}

\right.

$$

for all the feasible directions:

$$h(x^*+d)-h(x^*)=\nabla h(x^*)^T d=0$$

if there is no inequality constraints, the local minimum $x^*$ has:

$$\nabla f(x^*)^T d=0$$

Combine the two equations together, we have:

$$(\nabla (\lambda_1 h(x^*)+\lambda_2f(x^*))^Td=0\rightarrow \nabla f(x^*) + \lambda \nabla h(x)=0$$

things become a little different after adding the inequality constraints. For a inequality constraints $g(x)\le 0$, if it is binding, then $g(x)=0\rightarrow \mu g(x)=0$. However, if it is not binding, which means that $g(x)<0$. Under this condition, all the directions are feasible direction so that the gradients of $f(x)$ should be zero. Therefore, if the inequality constraints are not binding, then $\mu=0\rightarrow \mu g(x)=0$.

For the non-negativity condition:

$$

\begin{array}{c}

(\nabla f(x^*) +\sum \lambda \nabla h(x^*) +\sum \nabla \mu g(x^*))^Td=0 \\

\Rightarrow \nabla f(x^*)^Td + \sum \mu \nabla g(x^*)^T d =0 \\

\Rightarrow … \Rightarrow \mu \nabla g(x^*)^Td\le 0\Rightarrow \mu \ge 0

\end{array}

$$

Therefore, $\mu_i\ge 0$ is to constrain the point to a minimum point. In other word, if there is any binding inequality constraint, the KKT point can only be the minimum point.

From the above we have roughly proved that:

$x$ is a local minimum $\Rightarrow$ $x$ is a KKT point

which means that KKT point is only the necessary condition for a local minimum point. However, if the function $f(x)$ is a convex function at the convex feasible region $X$, then the KKT condition will be the sufficient condition, the local minimum will also be the global minimum point (we have proven this before ^_^).

In the lecture, the sufficient conditions given by Prof. Yin is “linear $h$, convex $g$, convex $f$ $\Rightarrow$ global minimum”. I think the reason here is that:

linear $h\Rightarrow$ $X_h$ convex, convex $g\Rightarrow$ $X_g$ convex, Then $X=X_h\cap X_g$ is also convex because the intersection between two convex set is also a convex set.

Obviously, this condition is far too sufficient.

For more about the course review of CEE553, see CEE 553 Final Term Review, which is the conituned part of this post.

# Introduction

This is my personal blog. Maybe I will update my recent progress, share some useful tools which can help a researcher work more efficiently and also share my stories with my parents and my best friends. This is the first year of my study in US. It is quite important for me to develop a good habit during such a long journey so that I decide to write this blog to record my life and simply use it as my notebook. There is an old Chinese saysing, “工欲善其事，必先利其器”. I hope this blog will be my powerful weapon in my daily life and work.

Here I want to thank my undergraduate roommate Jacob Zhong [Link] for the help to establish this blog. Zhong inspired me a lot during my undergraduate study especially on the use of all kinds of fancy tools. It is so lucky of me to continue to be the neighbor of Zhong maybe for the whole graduate study in U of M. Now I can still learn from him.

The following instructions maybe can help me to quickly familiar with writing this blog.

## Insert Equation

The following code will enable the mathjax before inserting the mathematical expressions:

1 | mathjax: true |

For inline equations: $a_0=1$

1 | $a_0=1$ |

For centered formulas:

$$ \delta Q=dU+pdV$$

1 | $$ \delta Q=dU+pdV$$ |

### Escape in the Markdown within `$$...$$`

But it is quite annoying when inserting the equation, because the “markdown” may be first compiled and then will be the math part. You need to double the ‘\\‘ to escape the markdown even though in the math environment. Here are some examples:

The output you want is shown as below:

$$

\left\{

\begin{aligned}

h_i(x) = 0 \\

g_i(x)\le 0

\end{aligned}

\right.

$$

In latex or mathjax, the code is quite straingt-forward:

1 | $$ |

Unfortunately, although the code block is within the `$$...$$`

environment, it still went wrong due to the escape issue. You have to double slash `\\`

to avoid the error. That is:

1 | $$ |

It is still acceptable if I can simply change the single slash into double slash to avoid the problem. However, it seems that sometimes you can only input one slash. For example,

$$

\left(

\begin{matrix}

b & b & c & d \\

1 & 2 & 3 & d \\

1 & 2 & 3 & d \\

1 & 2 & 3 & d

\end{matrix}

\right)

$$

the right code should be:

1 | $$ |

instead of

1 | $$ |

nor

1 | $$ |

This “bug” almost drove me crazy. The best solution now may be remember the special cases. The following table gives the special cases I have encountered till now.

Function | Latex | Markdown |
---|---|---|

Change line | `\\` |
`\\\\` |

star | `*` |
`\*` |