It’s been a while since I tried to solve a system of equations without

using a numerical library, so I figured it was time to do a linear

algebra review.

\[

A=\left[\begin{array}{cc}

1 & 2\\

3 & 4

\end{array}\right]=\left[\begin{array}{c}

r_{1}^{T}\\

r_{2}^{T}

\end{array}\right]=\left[\begin{array}{cc}

c_{1} & c_{2}\end{array}\right]

\]

This should be the easiest matrix to work with: small size, nice integer values, non-colinear columns and rows, non-zero determinant, full rank, etc. Let’s go through the basic definitions, just because it’s been a while. $A$ is 2×2 (MxN), which is small and square. Let’s check the column space of $A$ for colinearity by reducing to row echelon form by adding $-3r_{1}^{T}$ to row 2:

\[

\left[\begin{array}{cc}

1 & 2\\

0 & -2

\end{array}\right]

\]

Then adding $r_{2}^{T}$ to row 1:

\[

\left[\begin{array}{cc}

1 & 0\\

0 & -2

\end{array}\right]

\]

So the good news is that the column space of $A$ has linearly independent columns, $\left[\begin{array}{c}

1\\

3

\end{array}\right]$ and $\left[\begin{array}{c}

2\\

4

\end{array}\right]$, which form a basis for $\mathbb{R^{\mathrm{2}}}$, and rank($A$) is 2. So the null space of $A$ is empty. That’s the best possible outcome for a matrix because it means that an inverse exists.

But before exploring that, let’s think about the row space of $A$. Using the same row reductions as above, we can conclude that $A$ has linearly independent rows, $\left[\begin{array}{cc} 1 & 2\end{array}\right]$ and $\left[\begin{array}{cc} 3 & 4\end{array}\right]$, which form a basis for $\mathbb{R^{\mathrm{2}}}$, and rank($A^{T}$) is 2. So the left null space of $A$ is also empty. The row space of $A$ is isomorphic to the column space of $A^{T}$ by definition, and $A$ happens to have full rank, so an inverse exists. Let’s use Gauss-Jordan elimination to find it:

\[

\left[\begin{array}{ccccc}

1 & 2 & | & 1 & 0\\

3 & 4 & | & 0 & 1

\end{array}\right]

\]

Starting with an augmented matrix $\left[\begin{array}{ccc}

A & | & I\end{array}\right]$, we can use row operations to find $\left[\begin{array}{ccc}

I & | & A^{-1}\end{array}\right]$. Add $-3r_{1}^{T}$ to row 2:

\[

\left[\begin{array}{ccccc}

1 & 2 & | & 1 & 0\\

0 & -2 & | & -3 & 1

\end{array}\right]

\]

Then add $r_{2}^{T}$ to row 1:

\[

\left[\begin{array}{ccccc}

1 & 0 & | & -2 & 1\\

0 & 2 & | & -3 & 1

\end{array}\right]

\]

Finally rescale row 2:

\[

\left[\begin{array}{ccccc}

1 & 0 & | & -2 & 1\\

0 & 1 & | & -\frac{3}{2} & \frac{1}{2}

\end{array}\right]

\]

Therefore:

\[

A^{-1}=\left[\begin{array}{cc}

-2 & 1\\

-\frac{3}{2} & \frac{1}{2}

\end{array}\right]

\]

Now we need to check that $A^{-1}A=I$:

\begin{eqnarray*}

\left[\begin{array}{cc}

2 & 1\\

-\frac{3}{2} & \frac{1}{2}

\end{array}\right]\left[\begin{array}{cc}

1 & 2\\

3 & 4

\end{array}\right] & = & \left[\begin{array}{cc}

1 & 0\\

0 & 1

\end{array}\right]

\end{eqnarray*}

Of course, there is an explicit formula for the inverse of a 2×2 matrix:

\[

\left[\begin{array}{cc}

a & b\\

c & d

\end{array}\right]^{-1}=\frac{1}{\det A}\left[\begin{array}{cc}

d & -b\\

-c & a

\end{array}\right]

\]

For our $A$, $\det A=ad-bc=-2$. Unfortunately, analytical inverses don’t exist for larger matrices, or they are so long and complex as to be of limited utility. But there is at least one important idea to take away from the inverse: it can only exist if the determinant is non-zero. This becomes a very important fact for the eigenvalue problem.

One last thing I wanted to write about for now: the L2 norm (or Euclidean norm) of a vector $x=\left[\begin{array}{cccc}

x_{1} & x_{2} & \cdots & x_{n}\end{array}\right]^{T}$ is defined as:

\[

\left\Vert x\right\Vert _{2}^{2}=\sum_{i=1}^{n}x_{i}^{2}

\]

Suppose that we’re fitting data $\left(a_{ij},y_{i}\right)$ to a known linear model$A$ and we want to determine the unknown coefficients $x$ that best fit the data using ordinary least squares:

\begin{eqnarray*}

\left[\begin{array}{c}

y_{1}\\

y_{2}\\

\vdots\\

y_{m}

\end{array}\right] & = & \left[\begin{array}{cccc}

a_{1,1} & a_{1,2} & \cdots & a_{1,n}\\

a_{2,1} & a_{2,2} & \cdots & a_{2,n}\\

\vdots & \vdots & \ddots & \vdots\\

a_{m,1} & a_{m,2} & \cdots & a_{m,n}

\end{array}\right]\left[\begin{array}{c}

x_{1}\\

x_{2}\\

\vdots\\

x_{n}

\end{array}\right]\\

y & = & Ax

\end{eqnarray*}

And we want to choose the $x$ which minimizes the L2 norm of the residuals because we assume them to be Gaussian:

\[

J\left(x\right)=\left\Vert Ax-y\right\Vert _{2}^{2}

\]

Then $J\left(x\right)$ can be expanded:

\begin{eqnarray*}

\left\Vert Ax-y\right\Vert _{2}^{2} & = & \left(Ax-y\right)^{T}\left(Ax-y\right)\\

& = & \left(x^{T}A^{T}-y^{T}\right)\left(Ax-y\right)\\

& = & x^{T}A^{T}Ax-x^{T}A^{T}y-y^{T}Ax+y^{T}y

\end{eqnarray*}

However, $x^{T}A^{T}y$ is a scalar that can be computed as $y^{T}Ax$ by reversing the order of the multiplications, so the last expression can be further simplied:

\begin{eqnarray*}

x^{T}A^{T}Ax-x^{T}A^{T}y-y^{T}Ax+y^{T}y & = & x^{T}A^{T}Ax-2y^{T}Ax+y^{T}y

\end{eqnarray*}

This might look ugly, but we can now minimize $J\left(x\right)$ by

taking the derivative with respect to $x$:

\begin{eqnarray*}

\frac{dJ\left(x\right)}{dx} & = & 2x^{T}A^{T}A-2y^{T}A\\

& = & 2A^{T}Ax-2A^{T}y

\end{eqnarray*}

Where we used the fact that $x^{T}A^{T}A=A^{T}Ax$ and $y^{T}A=A^{T}y$ because it’s just changing the order of the multiplications again.

Now we can derive the celebrated pseudo-inverse of $A$ by setting the derivative to zero:

\begin{eqnarray*}

2A^{T}Ax-2A^{T}y & = & 0\\

A^{T}Ax & = & A^{T}y\\

x & = & \left(A^{T}A\right)^{-1}A^{T}y

\end{eqnarray*}

However, the inverse of $A^{T}A$ may not exist, or it may be very hard to compute (due to numerical instability). An alternative solution is to use gradient descent. Since we already have $\frac{dJ\left(x\right)}{dx}$ in a nice form:

\begin{eqnarray*}

\frac{dJ\left(x\right)}{dx} & = & 2A^{T}\left(Ax-y\right)

\end{eqnarray*}

We can start at any point $x=x_{0}$ and take a step along the direction given by the derivative, $x_{1}=x_{0}-\gamma\frac{dJ\left(x\right)}{dx}$. The problem is how big of a step to take. Despite the existence of a global minimum and an analytical form for the derivative, the steps could either be too small (taking forever to converge) or too large (diverging even when starting near the global minimum). This is where I wrote a paper about bracketing the minimum using a priori constraints (i.e., Golden section search and Brent’s method), but variations on line search minimization are also possible.

I’ve read that the conjugate gradient method is the more popular solution to this problem now. Gradient descent searches strictly along the derivative, whereas the conjugate gradient method chooses a different search direction every time. The Grahm-Schmidt procedure is used to orthogonalize the gradient vectors, and then the conjugate gradient method moves along that new basis. This can be much faster than gradient descent, but it can become slow if the condition number of $A$ is too large. But it’s still a good choice because it doesn’t require the Hessian matrix to be calculated or inverted (as per Newton’s method).

If you’ve got lots of memory and $M,N$ are small-ish, the Levenberg-Marquardt algorithm can converge even faster because it approximates the Hessian with the Jacobian matrix and chooses directions either along the derivative or the Hessian, whichever is better. Unfortunately, it doesn’t work with regularization, and it has a few more internal parameters, and it usually runs out of memory when computing $\left(J^{T}J+\lambda I\right)^{-1}$. So I usually end up using the conjugate gradient method anyway because it can be regularized and doesn’t require crazy amounts of memory.

At CodeMash 2017 I heard a presentation about artificial neural networks where the presenter complained bitterly about how his L2 minimization (“backtracking”) in the neural network was converging very slowly. I thought to suggest an improved algorithm (conjugate gradient), but his talk was focused on a high level introduction with no math, so it didn’t seem appropriate at the time. That inspired me to write this post. Then I saw that someone else already did conjugate gradients with artificial neural networks in 1992. 😛