You can directly download the cheat sheet here [pdf]. If you want to edit the pdf based on the current version, you can turn to my overleaf project oneline in this link [overleaf project]. I am sorry I only open the rights to view the projects.

Inner/outer product

$e_ie_j^T\rightarrow E_{i,j}=1$, $R(A)=span(A)={Ax:x\in R^n}$

Vector input into a single hidden neuron: $f(x)=f_a.(w^Tx+b)$

Euclidean Norm (length) of vector: $||x||_2=\sqrt{x^Hx}$

If $Q$ is a unitary matrix then: $||Qx||_2=||x||_2$

$Trace$ call only applied to square matrix.

Left product a matrix is to transform the row while right product a matrix is to transform the column. Similarly, $e_i^TA$ is $i$-th row of $A$ while $Ae_i$ is the $i$-th column of $A$.

Determinant

$$\det (A)= \ det (A^T),\det(AB)=\det(BA)$$
Optional:
$$\det \left[\begin{array}{cc} A & B \\ C & D \end{array} \right]=\det(A)\det(D-CA^{-1}B)$$
$$\det \left[\begin{array}{cc} A & B \\ C & D \end{array} \right]=\det(D)\det(A-BD^{-1}C)$$

Orthonormal Matrix

A matrix $A$ is orthonormal if:
$$AA^T=A^TA=I$$
which means that both the column vectors and the row vectors are orthonormal vectors. More generally, if $A$ and $B$ are square matrices and $AB=I$, then $BA=AB=I$.

If $U$ and $V$ are orthonormal matrices, then $UV$ and $VU$ are all orthonormal matrices.

Singular Value Decomposition

$$A=\sum_{i=1}^r\sigma_iu_iv_i^H$$
$u_i/v_i$: left/right singular vectors, $u_i^Hu_i=v_i^Hv_i=\delta_{ij}$

SVD is not unique: ${\sigma_i,-u_i,-v_i}$ is also a valid triplet

$$A=\sum_{i=1}^r\sigma_iu_iv_i^H\Rightarrow R(A)=R(u_1,..u_r)$$
Null space: $N(A)={x:Ax=0}$

Full SVD: $A=U\Sigma V^H,U\in F^{m\times m}, \Sigma \in F^{m\times n}, V\in F^{n\times n}$

Economy or reduced SVD…

Truncated SVD: $A_k=\sum_{i=1}^k\sigma_iu_iv_i^H$

Theorem (Eckart-Young):
$$A_k=\arg \min ||A-X||_F\quad s.t.\quad rank(X)\le k$$

Projection Matrices

Let $Q_r=[q_1,…q_r]$ be an orthogonal (unitary) set of unit-norm vectors, then the orthogonal/unitary projection matrix onto $R(Q_r)$: $P_R(Q_r)=Q_rQ_r^H$
$$P_{R(Q_r)}x=(Q_rQ_r^H)x\Rightarrow \sum_{i=1}^kq_i(q_i^Hx)$$
$$P_{R(q_1,q_2..q_k)}+P_{R(q_{k+1},…q_n)}=I_n$$
Gram-Schmidt Orthogonalization
$$\beta_n=v_n-\sum_{i=1}^{n-1}(v_n^T\eta_i)\eta_i,\eta_n=\frac{\beta_n}{||\beta_n||}$$
Base for matrix $A$ using SVD: $A=\sum_{i=1}^r\sigma_iu_iv_i^H$, then the bases of $A$ are: $u_1,…u_r$:
$$P_{R(A)}=U_1U_1^H=\sum_{i=1}^ru_iu_i^H$$
$$P_{R^\perp (A)}=I_m-U_1U_1^H=U_2U_2^H=\sum_{i=r+1}^mu_iu_i^H$$
$$P_{N(A)}=I_n-V_1V_1^H=\sum_{i=r+1}^nv_iv_i^H$$
$$P_{N^\perp (A)}=V_1V_1^H=\sum_{i=1}^rv_iv_i^H$$
$$R^\perp (A)=N(A^T)$$

Norm of matrix

P-norm of a matrix

The p-norm of a matrix is defined as:

$$||A||_p = \sup_{x\neq0}\frac{||Ax||_p}{||x||_p}$$

where the p-norm of a vector is defined as:

$$||x||_p=(\sum_i |x_i|^p)^{1/p}$$

then obviously for second norm of a matrix:

$$||A||_2 = \sigma_{max}$$

where the $\sigma_{max}$ is the maximum singular value of the SVD.

Frobenius norm of a matrix

$$||A||_F=\sqrt{\sum_i\sum_ja_{ij}^2}$$
$$||A||_F^2=Tr(A^HA)=Tr(AA^H)=\sum_{i=1}^r\sigma_i^2$$

Here is a brief proof:
$$Tr(A^HA)=Tr(\sum_i \sigma_i u_iv_i^H\sum_i \sigma_i u_iv_i^H)=Tr(\sum_i \sigma_i^2u_iu_i^H)$$
$$=\sum_iTr(\sigma_i^2u_iu_i^H)=\sum_iTr(\sigma_i^2u_i^Hu_i)=\sum_i\sigma^2_i$$

Least Squares and SVD

Pseudo-inverse:

$$AA^\dagger A=A, A^\dagger A A^\dagger = A^\dagger$$
$$A^\dagger = A^T(AA^T)^{-1}$$

SVD: Pseudo-inverse $A^\dagger$ is an $n\times m$ matrix given by:
$$A^\dagger =V\Sigma^\dagger U^H=\sum_{i=1}^r\sigma^{-1}_iv_iu_i^H$$
$$(u_i)^\dagger = u_i^H, (u_iu_i^H)^\dagger = u_iu_i^H$$

or sum of outer-products: $A^\dagger = \sum_{i=1}^r\sigma_i^{-1}v_iu_i^H$.
$$AA^\dagger = P_{R(A)},\quad I-AA^\dagger = P_{R^\perp (A)}$$
$$A^\dagger A=P_{N^\perp (A)},\quad I-A^\dagger A=P_{N(A)}$$
$$x_{LS}=\arg \min_x ||b-Ax||_2^2=A^\dagger b$$
Here to prove the theorem by changing the variables:
$$J(x)=||Ax-b||_2^2\Rightarrow ||\Sigma V^Hx-U^Hb||_2^2$$
let $\tilde{x}=V^Hx$ and $\tilde{b}=U^Hb$, then
$$J(\tilde{x})=||\Sigma \tilde{x}-\tilde{b}||_2^2=\sum_{i=1}^r(\sigma_i\tilde{x}i-\tilde{b})^2+\sum{i=r+1}^nb_i^2.$$
Then it is obvious that:
$$\tilde{x}_{LS}=\frac{\tilde{b_i}}{\sigma_i}\Rightarrow …\Rightarrow x_{LS}=A^\dagger b.$$

So far we have completed the proof of mean square method when $r=n$. However, when $r<n$, then comes the tricky part of least square method:
$$\tilde{x}=V^Hx\in R^n,x=V\tilde{x}\in R^n$$

the previous proof tells us that:
$$x=\sum_{i=1}^n\tilde{x_i}v_i, \tilde{x}_i=\frac{\tilde{b}_i}{\sigma_i}$$

However, if $r<n$ then $\sigma_i=0,i>r$. One question from mid-term exam, $Ax=b,A\in R^{m\times n}$ and when $rank(A)<n$, and then for the $x$:
$$x=A^\dagger b + (I-A^\dagger A)w,w\in R^n$$

where the first term gives the projection to the $span{v_1,..v_r}$ and the second term is the projection to the $span{v_{r+1},…r_n}=$.

Error vector as a projection

$$x_{LS}=A^\dagger b\Rightarrow e=b-Ax_{LS}=b-AA^\dagger b$$
$$\Rightarrow e=P_{R^\perp (A)}b=(I-P_{R(A)})x$$
using SVD, $error=[u_{r+1},…u_m][u_{r+1},…u_m]^Hb$
Over-determined: $m>n$, full rank, $error\neq 0$
Under-determined: $m<n$, full rank, $error=0$.
One lesson from the analysis here is that sometimes you should jump out of the SVD and go back to the original matrix towards the column space.

Extension of Least Squares

Truncated Least Squares

Original least squares:
$$x_{LS}=\arg \min_x||b-Ax||_2^2$$
then the Truncated Least Squares is:
$$x_{TLS}^{(k)}\sum_{i=1}^k\frac{1}{\sigma_i}v_iu_i^Hb$$

which means that the original optimization problem becomes:
$$x^{(k)}_{TLS}=\arg \min_x||b-A_kx||_2^2$$

Regularized Least Squares

$$x_{opt}=\arg \min_x ||Ax-b||_2^2+\delta^2||x||_2^2$$
is equivalent to:
$$x_{opt}=\arg \min_x||\tilde{A}x-\tilde{b}||_2^2$$
where $\tilde{A}=[A, \delta I]^T$ and $\tilde{b}=[b,0]^T$. The solution to this problem is:
$$x_{opt}=\tilde{A}^\dagger \tilde{b}=(A^HA+\delta^2I)^{-1}A^Hb=\sum_{i=1}^r\frac{\sigma_i}{\sigma_i^2+\delta^2}v_i(u_i^Hb)$$

Learning SVD

$$v_1=\arg \max_{||x||_2=1}||Ax||_2$$
Here gives the proof:
$$J(x)=||Ax||_2^2=||U\Sigma V^Hx||_2^2=||\Sigma V^H x||_2^2,$$
let $\tilde{x}=V^Hx$, then:
$$J(\tilde{x})=||\Sigma \tilde{x}||_2^2=\sum_{i=1}^n\sigma_i^2||x_i||_2^2\le \sigma_i^2 \sum_{i=1}^r||x_i||_2^2=\sigma_i^2$$
$$\tilde{x}_{opt}=e_1\Rightarrow x_{opt}=v_1$$
Similarly,
$$u_1=\arg \max_{||x||_2=1}||x^HA||_2$$
Furthermore,
$$v_2=\arg \max_{||x||_2=1}||AP_{R(v_1)^\perp }x||$$
Proof: $AP_{R(v_1)}=\sum\sigma_iu_iv_i^H\sum_{i=2}^rv_iv_i^H=\sum_{i=2}^r\sigma_iu_iv_i^H$
$$v_k=\arg \max_{||x||_2=1}||AP_{R(v_1,..v_{k-1})^\perp}x||$$
Equivalent formulation:
$$x_{opt}=\arg \max_{||x||_2=1}||Ax||_2,\quad s.t.\quad x\perp v_1,v_2…v_{k-1}$$
It is the same for $u_k$:
$$u_k=\arg \max_{||x||_2}||x^HA||_2\quad s.t. \quad x\perp u_1,..u_{k-1}$$
$$u_k=\arg \min_{||x||_2=1}||x^HB||_2,B=P_{R(u_1,…u_{k-1})^\perp}A$$
Claim: $x_{opt}=\pm v_k/\pm u_k$ (not unique), another possibility is that all the $\sigma_i$ are the same, then $x_{opt}=span{v_1,..v_r}$

PCA and SVD

Note that the format of the data is that each column is a sampled data and each row is a attribute.

Principal Component Analysis

Learning interpretation:
$$u=\arg \max_{||x||_2=1} ||Ax||_2, v=\arg \max_{||x||_2=1} ||xA||_2$$

the first step of PCA is to center the data:
$$\tilde{A}=A-\mu_A\cdot 1^T, \mu_A=\frac{1}{n}A1^T$$

then for the principal component factorization:
$$A=W_{pca}\cdot X_{pca}$$
$$W_{pca}=U\Sigma,X_{pca}=(U\Sigma)^\dagger \mu_A 1^T +V^H$$

Besides maximizing the variance, the PCA also minimizes the correlation between the update attributes. The covariance quantifies the correlation between two variables:
$$cov(x, y)=E(x\cdot y)-E(x)E(y)$$

where the variables are considered to be uncorrelated if their covariance is zero. And the covariance matrix is :
$$\hat{\Sigma}_X=\frac{1}{n}XX^T-\mu_X\mu_X^T$$

under the principal component factorization, the coordinates of the original dataset becomes $V^T$ (centered) and the covariance matrix is $I$ (diagonal) which means that the different attributes are uncorrelated.

There is also another way to get the PCA, first centered the matrix $A$ to $\bar{A}$ and find the eigenvectors of the covariance matrix $\bar{A}\bar{A}^T$.

Independent Component Analysis

The final form of ICA is quite similar with PCA:
$$A=W_{ica}\cdot X_{ica}$$
$$W_{ica}=U\Sigma Q_{ica}, X_{ica}=(U\Sigma Q_ica)^\dagger\mu_A1^T+Q_{ica}^HV^H$$

where the unitary matrix $Q_{ica}$ by maximize:
$$Q_{ica}=\arg \max_Q \sum_{i=1}^m |kurtosis(Q[:,i]^HV^H)|$$

Actually, ICA can try to make the different attributes as independent as possible. Independence is more about the internal relation between two random variables while the correlation is more about the linear relationship between two random variables.

k-th moment of a random variable is defined as:
$$E[x^k]=\mu_k=\int x^k f(x)dx$$

where the $f(x)$ is the p.d.f. of $x$. Then the cumulants of a centered random variable:
$$\kappa_2 = \mu_2, \kappa_3=\mu_3,\kappa_4=\mu_4-3\mu_2^2, \kappa_5,….$$
where the $\kappa_2$ is the variance, $\kappa_3$ is the skewness and the $\kappa_4$ is the kurtosis. There are two facts: the first is that maximizing the absolute value of kurtosis value is equivalent to maximizing the “non-Gaussianity”. Another is that if $x_1$ and $x_2$ are independent, then:
$$\kappa_k(x_1+x_2)=\kappa_k(x_1)+\kappa_k(x_2), \kappa_k(wx)=w^k\kappa_k(x)$$

Here to prove that the ICA unmixes mixed independent random variables:

Let $Q$ be orthogonal matrix and $x_1$ and $x_2$ be independent and then:
$$[y_1,y_2]^T=Q[x_1,x_2]^T$$

and the ICA algorithm is:
$$w_{opt}=\arg \max_{||w||_2=1}|\kappa_4(w^Ty)|$$

we want to prove that:
$$w_{opt}=\pm q_1$$

Here is the simple proof:
$$\kappa_4(w^Ty)=\kappa_4(w^TQx)=\kappa_4(\tilde{w}^Tx)=\kappa_4(\tilde{w}_1x_1)+\kappa_4(\tilde{w}_2x_2)$$
$$=\tilde{w}_1\kappa_4(x_1)+\tilde{w}_2\kappa_4(x_2)\le 1\cdot \max(\kappa_4(x_1),\kappa_4(x_2))$$
$$\tilde{w}=e_i\rightarrow w=\pm q_1/q_2$$

Eigen-decomposition of a symmetric matrix

Assume that $A$ is a $n\times n$ symmetric matrix. Then there will be $n$ real eigenvalues $\lambda_1\ge \lambda_2 \ge …\lambda_n$ and each eigenvalue has an eigenvector:
$$Aq_i=\lambda_iq_i$$

then the eigenvalue decomposition of the matrix is:
$$A=\sum_{i=1}^n\lambda_iq_iq_i^T=Q\Lambda Q^T$$

where $Q$ is an arthogonal matrix and each column of this matrix is an eigenvector of the matrix.

Then for the optimization problem with the quadratic form with spherical constraint:
$$x_{opt}=\arg \max x^TAx,s.t.x^Tx=1$$

It is obvious that when $A$ is a symmetric matrix, the solution to the optimization problem is:
$$x_{opt}=q_1$$

Application 1: Procrustes Analysis

Normal Procrustes Problem

Here every column is an attribute while every row is a sampled data.

The Procrustes is to learn the transfer matrix including rotation, translation and scaling.

If there is only rotation, then the optimization problem is:
$$Q_{opt}=\arg \min_Q ||X-YQ||_F^2, s.t.Q^TQ=I$$
$$\min_Q||X-YQ||_F^2=\min_Q Tr[(X-YQ)(X-YQ)^T]$$
$$=\min_Q Tr(X^TX+Q^TY^TYQ-X^TYQ-Q^TY^TX)$$
$$=-\min_QTr(X^TYQ+Q^TY^TX)=\max Tr(X^TYQ)$$

Let $U\sigma V^T$ be the SVD of $X^TY$, then:
$$Tr[X^TYQ]=Tr[U\Sigma V^TQ]=Tr[\Sigma V^TQU]$$

where $W=V^TQU$ is an orthogonal matrix and the objective function reaches the maximum when $W=I$, that is:
$$W=V^TQU=I\rightarrow Q=VU^T$$

In practice, there will be translation and scaling besides the rotation, the complete optimization problem is:
$$\min_{Q,\alpha,\mu}||X-\alpha YQ-1_m\mu^T||_F^2, s.t. QQ^T=I$$

the $\mu$ can be directly calculated given the other parameters and the other way to understand it is that we can first center the dataset to get rid of the translation part.
$$\alpha = [\frac{1}{m}1_m^T(X-\alpha YQ)]^T$$
$$\min_{Q,\alpha} ||X_0-\alpha Y_0Q||_F^2,s.t.Q^TQ=I$$
where $X_0=X-1\mu_X^T$, $Y_0=Y-1\mu_Y^T$.

And then we can find that the $\alpha$ do not exert any influence on the $Q$ because only the singular vectors do not change. Then:
$$Q=VU^T, X_0^TY_0=U\Sigma V^H$$

and finally for the scaling parameter $\alpha$:
$$\alpha = Tr[X_0^TY_0Q]/Tr[Y_0Y_0]$$

Application 2: MDS to recover the coordinates

The zero centroid multidimensional scaling (MDS) coordinates can be used to recovering coordinates from a Euclidean distance matrix. The algorithm is shown as blew:
$$K=-\frac{1}{2}PSP^T,S=D.^2,P=I-\frac{1}{n}11^T$$

Let $K=U\Lambda U^T$ be the eigenvalue decomposition of $K$, The the d-dimensional MDS coordinates of the points are given by the rows of:
$$X_r=U[:,1:d]\Lambda ^{1/2}[1:d,1:d]$$

Here is the proof, it is kind of complicated. Assume $x$ is the original coordinates and every row stands for a point, then:
$$S_{ij}=D_{ij}^2=||x_i-x_j||_2^2=||x_i||_2^2+||x_j||_2^2=2x_i^Tx_j$$
$$S=c1^T+1c^T-2xx^T\rightarrow xx^T=-\frac{1}{2}(S-c1^T-1c^T)$$
where:
$$c=[||x_1||^2,||x_2||^2,…,||x_n||^2]^T$$

On one side, let $K=xx^T$,
$$S=c1^T+1c^T-2K\rightarrow \frac{1^TS1}{n}=\frac{1c1^T1}{n}+\frac{1^T1c^T1}{n}-2\frac{1^Txx^T1}{n}$$
$$\frac{1^TS1}{n}=1^Tc+c^T1-2\frac{1^Txx^T1}{n}\rightarrow2(c^T1)=\frac{1^TS1}{n}+2\frac{1^Txx^T1}{n}$$
One the other side,
$$\frac{S1}{n}=c+\frac{1c^T1}{n}-2\frac{xx^T1}{n}\rightarrow c=\frac{S1}{n}-\frac{1c^T1}{n}+2x\frac{x^T1}{n}$$

If we set the original coordinates centroid, then $x^T1=0$. Then finally the $K=xx^T$ is:
$$K=1\frac{1}{2}(S-c1^T-1c^T)=-\frac{1}{2}PSP^T=Q\Lambda Q^T$$

The the coordinates are:
$$x=Q\Lambda ^{1/2}$$

Application 3: learn to Partition Graphs

A graph is a collection of nodes (vertices) ${v_i}$ and edges ${e_i}$. The degree of a vertex is the number of edges that connect to it. The degree sequence of a graph is a list having the degree of each vertex.
Ajacency matrix of a simple graph with $n$ vertices is an $n\times n$ matrix with $A_{ij}=1$ if $v_i$ and $v_j$ is connected otherwise $0$.
The notion of modularity is based on the elegant idea: “A good division is one where there are fewer than expected edges between communities.”

Let $d_i$ and $d_j$ be the degree of vertex $i$ and vertex $j$ respectively. Let $g_i$ and $g_j$ denote the groups they belong to and let $\delta_{g_ig_j}$ be the Kronecker delta function that is equal to one when $g_i=g_j$ and zero otherwise.

Then the modularity $Q$ of a group assignment encoded by $g_i$ is given by:
$$Q=\frac{1}{N}\sum_{i,j}[A_{ij}-P_{ij}]\delta_{g_ig_j}$$
where $N$ is the total number of edges and $P_{ij}$ is the expected probability of an edge between vertex $i$ and vertex $j$ in a random graph.

A random graph with the same total number of edges as given graph and with $d_i$ edge from vertex $i$ and $d_j$ edges from vertex $j$ has:
$$P_{ij}=\frac{d_i\cdot d_j}{N}$$
Let $s_i=1$ if vertex $i$ belongs to group 1 and $s_i=-1$ if it belongs to group 2. Then:
$$\delta_{g_ig_j}=\frac{s_is_j+1}{2}$$

Then the former optimization problem becomes:
$$s_{opt}=\arg \max_s\sum B_{ij}s_ss_j,s.t.s_i=\pm 1$$

where the modularity matrix $B$ is defined as $B_{ij}=[A_{ij}-\frac{d_id_j}{N}]$.

This problem is to hard to solve and here is the relaxation of the previous problem:

$$s_{eig}=\arg \max s^TBs,s.t.s^Ts=n$$
$$s_{opt}=sign.(s_{eig})$$