5 Interesting Facts of Matrix Calculus

Matrix calculus is the extension of calculus to vector or matrix setting. I used to suffer a lot with matrix calculus in my early grad life. I’ve seen some other people to suffer as well. It primarily happens because the standard courses in linear algebra do not cover this topic very often. However, all the maths in Matrix calculus are basically trivial. Anyway, here are a few interesting facts that made my life much easier. I could use these to derive many expressions (e.g. calculating gradients in SGD based algorithms etc.) in my machine learning practice. Check if they can help you or not.

Fact 1

Inside trace, objects (both vector and matrix. tensor?) can cycle or can take transpose of itself. For example:

  1. tr(AB) = tr(BA)
    tr(ABC) = tr(BCA) = tr(CAB)
    In general, tr(ABCD…Z) = tr(ZABC…Y)=tr(YZAB…X)
  2. \text{tr}(A^T) = \text{tr}(A)
    \text{tr}(ABC) = \text{tr}(C^T B^T A^T)

Fact 2

Inner product of objects can be represented by traces.

  1. \langle A,B \rangle = \text{tr}(A^T B)      … … A,B are matrices
  2. \langle a,x \rangle = a^T x = \text{tr}(a^T x)    … … a, x are vectors. The second equality holds because trace of a scalar is just the scalar. Once it is inside the trace, now we can make a lot of fun as shown in the next fact.

Fact 3

Fact 1 and 2 leads to interesting consequences:

  1. \langle A, BC \rangle = \text{tr} (A^T BC) = \text{tr} ({(A^T B)^T}^T C) =\text{tr} ({(B^T A)^T C}) = \langle B^T A, C \rangle … Notice how B changed its location
  2. \langle A, BC \rangle = \text{tr} (A^T BC) = \text{tr} (C A^T B) \text{;[cycle]} = \text{tr} ((A C^T)^T B)
    = \langle AC^T,B \rangle  … Notice how C changed its location

Fact 4

Differentiation of inner product of an object (vector or matrix) with a constant object is the const. object: \frac{\partial}{\partial X} \langle A,X \rangle = A

Fact 5

\frac{\partial f(X,X)}{\partial X} = \left[ \frac{\partial f(X_1, X_2)}{\partial X_1} + \frac{\partial f(X_1, X_2)}{\partial X_2} \right]_{X_1=X_2=X}

We really need to understand this final fact. In order to take derivative of a function which contains multiple instances of the differentiating variable, we need to take the following approach. Consider only one instance of the variable to be active at a time (i.e. assume all the other instances constant), do the derivative, and then put the actual variable in place of all its instances. Now perform similar action for other instances of the variable. This explanation is still too verbose. I hope the following example will clarify it:


f(X) = \|AX-B\|^2 \frac{\partial}{\partial X}f(X)=?


f(X) = \|AX-B\|^2 = \langle AX-B,AX-B \rangle = \text{tr} (AX-B)^T(AX-B) = \text{tr}(X^TA^TAX-B^TAX-X^TA^TB+B^TB) = \text{tr}(X^TA^TAX)-\text{tr}(B^TAX)-\text{tr}(B^TAX)+\text{tr}(B^TB) \therefore \frac{\partial}{\partial X}f(X)=\frac{\partial}{\partial X}\text{tr}(X^TA^TAX)-2\frac{\partial}{\partial X}\text{tr}(B^TAX) =\frac{\partial}{\partial X}\langle AX,AX \rangle - 2\frac{\partial}{\partial X} \langle A^TB,X \rangle =\left[\frac{\partial}{\partial X_1} \langle AX_1, AX_2 \rangle +\frac{\partial}{\partial X_2} \langle AX_1, AX_2 \rangle \right]_{X_1=X_2=X} - 2A^TB = \left[\frac{\partial}{\partial X_1} \langle X_1,A^TAX_2 \rangle +\frac{\partial}{\partial X_2} \langle A^TAX_1,X_2 \rangle \right]_{X_1=X_2=X} - 2A^TB =A^TAX + A^TAX - 2A^TB =2A^TAX-2A^TB =2A^T(AX-B)


Prove all the facts.