Neural Networks: Mathematics of Backpropagation
Sylesh Suresh and Nikhil Sardana
October 2017
Introduction
Recall our network from the previous lecture:
Through forward propagation, we calculated our prediction x2x2 for the input x0. Keep in mind, however, that this prediction is essentially a stab in the dark. We multiplied the input by randomly initialized weights and randomly initialized biases through each layer and calculated an output. At this stage, the output is simply a random guess and is probably very different from the target/true output y. In order to make our neural network more accurate, we must change the weights and biases so that our predicted output x2 gets as close to the target output y as possible.
To accomplish this, we will use backpropagation. The first step is to quantify exactly how wrong our prediction is; in other words, we will use an error function. We define the error function as E=12||x2−y||2 This makes sense; E increases as the difference between x2 and y increases. The goal of backpropagation is to minimize this error, which will get x2 and y as close together as possible, making x2 a better prediction.
To minimize E, we must change the weights and biases accordingly. Unfortunately, there are many weights and biases (just look at the diagram above), so we can’t just test many different combinations weights and biases and see which one yields the least error. That said, we can use differentiation to help us. By taking the partial derivative of the error with respect to a particular weight, we will know how to change the weight in order to decrease the error; changing the weight in the direction of the negative partial derivative will decrease the error. Please see the “Neural Networks: Part 2”, Sections 4.1-4.3, for a more in-depth explanation as to why calculating the partial derivatives helps us minimize the error function.
Non-Vectorized Backpropagation
We’ve already covered how to backpropagate in the vectorized form (Neural Networks: Part 2, Section 4.5). However, this can be confusing to many students, particularly the parts with matrix transformations and Hadamard products. To alleviate any confusion, let’s walk through backpropagation without any vectors at all.
Defining the Network
Consider the following neural network:
Let’s take the partial derivative of the error with respect to a particular weight. Our target outputs are y1 and y2. We define the value at each node i to be ni. We define the bias of node i to be bi. Note that input nodes do not have biases, they are simply input values into the network. We can thus rewrite the error function as:
E=12((n8−y1)2+(n9−y2)2)
A Simple Example
Let’s calculate ∂E∂w48. By the chain rule: ∂E∂w48=(n8−y1)(∂(n8−y1)∂w48)+(n9−y2)(∂(n9−y2)∂w48)
We know: n8=σ(w48n4+w58n5+w68n6+w78n7+b8). n9=σ(w49n4+w59n5+w69n6+w79n7+b9).
Neither n9 nor y2, a part of the target output, depend on w48, so the equation becomes:
∂E∂w48=(n8−y1)(∂(n8−y1)∂w48)
Let’s look at ∂(n8−y1)∂w48. y1 does not depend on w48. Looking at n8, we see that the only term in the sigmoid function that depends on w48 is w48n4. Thus: ∂(n8−y1)∂w48=∂(σ(w48n4+w58n5+w68n6+w78n7+b8))∂w48 ∂(n8−y1)∂w48=σ′(w48n4+w58n5+w68n6+w78n7+b8)∂w48n4∂w48
We know: n4=σ(w14n1+w24n2+w34n3+b4).
This equation does not involve w48. So we get:
∂(n8−y1)∂w48=σ′(w48n4+w58n5+w68n6+w78n7+b8)n4
Substituting back in, we finally arrive at:
∂E∂w48=(n8−y1)σ′(w48n4+w58n5+w68n6+w78n7+b8)n4
To shorten this expression, we’ll refer to σ′(w48n4+w58n5+w68n6+w78n7+b8) as σ′8. ∂E∂w48=(n8−y1)σ′8n4
Updating Weights
A network learns by changing its weights to minimize the error. Now that we’ve figured out the partial derivative of w48 with respect to the error, we simply update w48 according to the following rule:
w48=w48−α∂E∂w48
Where α is the learning rate, or some constant.
A Complex Example
Let’s begin calculating ∂E∂w14. We know that y1 and y2 are just constants. We also know
n8=σ(w48n4+w58n5+w68n6+w78n7+b8) n9=σ(w49n4+w59n5+w69n6+w79n7+b9) n4=σ(w14n1+w24n2+w34n3+b4)
Nowhere else does w14 appear in the network. Let’s begin chain-ruling.
∂E∂w14=(n8−y1)(∂(n8−y1)∂w14)+(n9−y2)(∂(n9−y2)∂w14)
Calculations
The process for calculating each of these partial derivatives is extraordinarily similar. Let’s just focus on the first part, or:
∂E∂w14=(n8−y1)(∂(n8−y1)∂w14) At the end, repeat the following process for the second partial.
∂E∂w14=(n8−y1)∂n8∂w14 ∂E∂w14=(n8−y1)∂(σ(w48n4+w58n5+w68n6+w78n7+b8))∂w14 ∂E∂w14=(n8−y1)(σ′(w48n4+w58n5+w68n6+w78n7+b8))∂(w48n4+w58n5+w68n6+w78n7+b8)∂w14
To make our expression shorter, lets just write σ′(w48n4+w58n5+w68n6+w78n7+b8) as σ′8. We can rewrite what we have so far as: ∂E∂w14=(n8−y1)σ′8∂(w48n4+w58n5+w68n6+w78n7+b8)∂w14 Let’s continue. Remember, only n4 is dependent on w14.
∂E∂w14=(n8−y1)σ′8∂w48n4∂w14 ∂E∂w14=(n8−y1)σ′8w48∂(σ(w14n1+w24n2+w34n3+b4))∂w14 ∂E∂w14=(n8−y1)σ′8w48(σ′(w14n1+w24n2+w34n3+b4))∂(w14n1+w24n2+w34n3+b4)∂w14
Again, lets just write σ′(w14n1+w24n2+w34n3+b4) as σ′4. We can rewrite what we have so far as:
∂E∂w14=(n8−y1)σ′8w48σ′4∂(w14n1+w24n2+w34n3+b4)∂w14 ∂E∂w14=(n8−y1)σ′8w48σ′4∂w14n1∂w14 ∂E∂w14=(n8−y1)σ′8w48σ′4n1
The Final Expression
Let’s add back in the second partial. Repeating the calculations, we finally achieve: ∂E∂w14=(n8−y1)σ′8w48σ′4n1+(n9−y2)σ′9w49σ′4n1 ∂E∂w14=((n8−y1)σ′8w48+(n9−y2)σ′9w49)σ′4n1
Of course, in these equations, σ′9=σ′(w49n4+w59n5+w69n6+w79n7+b9)
Biases
Let’s calculate the partial derivative of the error with respect to b4. As with the weights, let’s start chain ruling: ∂E∂b4=(n8−y1)(∂(n8−y1)∂b4)+(n9−y2)(∂(n9−y2)∂b4)
Recall that: n8=σ(w48n4+w58n5+w68n6+w78n7+b8) n9=σ(w49n4+w59n5+w69n6+w79n7+b9) n4=σ(w14n1+w24n2+w34n3+b4)
Let’s look at ∂(n8−y1)∂b4. Looking at n8, we see that the only term in the sigmoid function that relies on b4 is w48n4, and w48 is constant with respect to b4. Armed with this knowledge, we start chain-ruling:
∂(n8−y1)∂b4=σ′(w48n4+w58n5+w68n6+w78n7+b8)(w48)∂n4∂b4
Let’s break down ∂n4∂b4. Looking at n4, we see that the only term inside the sigmoid function that relies on b4 is b4 itself, whose derivative is just 1. Thus:
∂n4∂b4=σ′(w14n1+w24n2+w34n3+b4)
Substituting and using our previous shorthand notation, we find: ∂(n8−y1)∂b4=(σ′8)(σ′4)(w48)
We can use the same process to find that: ∂(n9−y2)∂b4=(σ′9)(σ′4)(w49)
Finally, we get: ∂E∂b4=σ′4((n8−y1)σ′8w48+(n9−y2)σ′9w49)
The process is simpler for the third layer. Take b8 for example. We start the same way: ∂E∂b8=(n8−y1)(∂(n8−y1)∂b8)+(n9−y2)(∂(n9−y2)∂b8)
n9 does not depend on b8, so the right side of the expression evaluates to 0 and goes away. For the left side, again, y1 is constant, so we find that: ∂E∂b8=(n8−y1)∂n8∂b8 The only term in the sigmoid function in the expression for n8 is b8 itself. Thus: ∂E∂b8=(n8−y1)σ′8
A Matrix Representation
Let’s first define the vectorized network (we’ve done this in “Neural Networks: Part 2” already). In this case:
W1=[w14w24w34w15w25w35w16w36w36w17w27w37]W2=[w48w58w68w78w49w59w69w79] x0=[n1n2n3]x1=[n4n5n6n7]x2=[n8n9] b1=[b4b5b6b7]b2=[b8b9]y=[y1y2]
And from vectorized forward propagation, we know x1=σ(W1x0+b1) x2=σ(W2x1+b2)
Note that
W1x0=[w14w24w34w15w25w35w16w36w36w17w27w37][n1n2n3]=[w14n1+w24n2+w34n3+b4w15n1+w25n2+w35n3+b5w16n1+w26n2+w36n3+b6w17n1+w27n2+w37n3+b7]
W2x1=[w48w58w68w78w49w59w69w79][n4n5n6n7]=[w48n4+w58n5+w68n6+w78n7+b8w49n4+w59n5+w69n6+w79n7+b9]
The First Matrix
We can repeat the process outlined in Section 2.2 to calculate the partial derivative of the error function with respect to each weight. The final results for the weights between the second and third layers of nodes follow a similar pattern:
∂E∂w48=(n8−y1)σ′8n4 ∂E∂w49=(n9−y2)σ′9n4
∂E∂w58=(n8−y1)σ′8n5 ∂E∂w59=(n9−y2)σ′9n5
∂E∂w68=(n8−y1)σ′8n6 ∂E∂w69=(n9−y2)σ′9n6
∂E∂w78=(n8−y1)σ′8n7 ∂E∂w79=(n9−y2)σ′9n7
Hmmmm... these all look similar! I bet we could rearrange them to create a more compact matrix representation, which we will call ∂E∂W2. We wish to simplify weight updating to just
W2=W2−α∂E∂W2 So, ∂E∂W2 must be the same shape as W2, which is 2×4.
∂E∂W2=[∂E∂w48∂E∂w58∂E∂w68∂E∂w78∂E∂w49∂E∂w59∂E∂w69∂E∂w79]
Note that: [(n8−y1)σ′8(n9−y2)σ′9][n4n5n6n7]=[∂E∂w48∂E∂w58∂E∂w68∂E∂w78∂E∂w49∂E∂w59∂E∂w69∂E∂w79]
We can also write this as: [n8−y1n9−y2]⊙[σ′8σ′9][n4n5n6n7]=[∂E∂w48∂E∂w58∂E∂w68∂E∂w78∂E∂w49∂E∂w59∂E∂w69∂E∂w79] Where ⊙ is the Hadamard product, or element-wise multiplication. Let’s vectorize this more: [σ′8σ′9]=[σ′(w48n4+w58n5+w68n6+w78n7+b9)σ′(w49n4+w59n5+w69n6+w79n7+b9)]=σ′(W2x1) [n8−y1n9−y2]=[n8n9]−[y1y2]=x2−y [n4n5n6n7]=xT1
Thus, we get ∂E∂W2=[(x2−y)⊙σ′(W2x1)]xT1
For the sake of simplicity, we define δ2=(x2−y)⊙σ′(W2x1)
Finally, we reach our old vectorized backpropagation formula: ∂E∂W2=δ2xT1
The Second Matrix
We can repeat the process outlined in Section 2.4 to calculate the partial derivative of the error function with respect to each weight. The final results for the weights between the first and second layers of nodes follow a similar pattern:
∂E∂w14=((n8−y1)σ′8w48+(n9−y2)σ′9w49)σ′4n1 ∂E∂w15=((n8−y1)σ′8w58+(n9−y2)σ′9w59)σ′5n1 ∂E∂w16=((n8−y1)σ′8w68+(n9−y2)σ′9w69)σ′6n1 ∂E∂w17=((n8−y1)σ′8w78+(n9−y2)σ′9w79)σ′7n1 ∂E∂w24=((n8−y1)σ′8w48+(n9−y2)σ′9w49)σ′4n2 ∂E∂w25=((n8−y1)σ′8w58+(n9−y2)σ′9w59)σ′5n2 ∂E∂w26=((n8−y1)σ′8w68+(n9−y2)σ′9w69)σ′6n2 ∂E∂w27=((n8−y1)σ′8w78+(n9−y2)σ′9w79)σ′7n2 ∂E∂w34=((n8−y1)σ′8w48+(n9−y2)σ′9w49)σ′4n3 ∂E∂w35=((n8−y1)σ′8w58+(n9−y2)σ′9w59)σ′5n3 ∂E∂w36=((n8−y1)σ′8w68+(n9−y2)σ′9w69)σ′6n3 ∂E∂w37=((n8−y1)σ′8w78+(n9−y2)σ′9w79)σ′7n3
To update each respective weight, we would subtract these partials (multiplied by α) from their respective weights, just as we did for w14 above.
Our matrix representation needs to look like this:
∂E∂W1=[∂E∂w14∂E∂w24∂E∂w34∂E∂w15∂E∂w25∂E∂w35∂E∂w16∂E∂w26∂E∂w36∂E∂w17∂E∂w27∂E∂w37]
We notice that: [w48w49w58w59w68w69w78w79][σ′8n8−y1σ′9n9−y2]=[w48σ′8(n8−y1)+w49σ′9(n9−y2)w58σ′8(n8−y1)+w59σ′9(n9−y2)w68σ′8(n8−y1)+w69σ′9(n9−y2)w78σ′8(n8−y1)+w79σ′9(n9−y2)] Which can be vectorized and simplified as: [w48w49w58w59w68w69w78w79]([σ′8σ′9]⊙[n8−y1n9−y2])=[w48σ′8(n8−y1)+w49σ′9(n9−y2)w58σ′8(n8−y1)+w59σ′9(n9−y2)w68σ′8(n8−y1)+w69σ′9(n9−y2)w78σ′8(n8−y1)+w79σ′9(n9−y2)] WT2[σ′(W2x1)⊙(x2−y)]=[w48σ′8(n8−y1)+w49σ′9(n9−y2)w58σ′8(n8−y1)+w59σ′9(n9−y2)w68σ′8(n8−y1)+w69σ′9(n9−y2)w78σ′8(n8−y1)+w79σ′9(n9−y2)] WT2δ2=[w48σ′8(n8−y1)+w49σ′9(n9−y2)w58σ′8(n8−y1)+w59σ′9(n9−y2)w68σ′8(n8−y1)+w69σ′9(n9−y2)w78σ′8(n8−y1)+w79σ′9(n9−y2)]
This δ2 is the same as the one we defined when calculating the previous vectorized representation. Now, this is fine and dandy, but we’ve so far ignored the σ′4n1 and the σ′5n1 etc. sitting outside the parends of our partial derivative calculation. Let’s work those in. ([w48σ′8(n8−y1)+w49σ′9(n9−y2)w58σ′8(n8−y1)+w59σ′9(n9−y2)w68σ′8(n8−y1)+w69σ′9(n9−y2)w78σ′8(n8−y1)+w79σ′9(n9−y2)]⊙[σ′4σ′5σ′6σ′7])[n1n2n3]= [((n8−y1)σ′8w48+(n9−y2)σ′9w49)σ′4n1((n8−y1)σ′8w48+(n9−y2)σ′9w49)σ′4n2((n8−y1)σ′8w48+(n9−y2)σ′9w49)σ′4n3((n8−y1)σ′8w58+(n9−y2)σ′9w59)σ′5n1((n8−y1)σ′8w58+(n9−y2)σ′9w59)σ′5n2((n8−y1)σ′8w58+(n9−y2)σ′9w59)σ′5n3((n8−y1)σ′8w68+(n9−y2)σ′9w69)σ′6n1((n8−y1)σ′8w68+(n9−y2)σ′9w69)σ′6n2((n8−y1)σ′8w68+(n9−y2)σ′9w69)σ′6n3((n8−y1)σ′8w78+(n9−y2)σ′9w79)σ′7n1((n8−y1)σ′8w78+(n9−y2)σ′9w79)σ′7n2((n8−y1)σ′8w78+(n9−y2)σ′9w79)σ′7n3] =[∂E∂w14∂E∂w24∂E∂w34∂E∂w15∂E∂w25∂E∂w35∂E∂w16∂E∂w26∂E∂w36∂E∂w17∂E∂w27∂E∂w37]=∂E∂W1
Note that [σ′4σ′5σ′6σ′7]=σ′([w14n1+w24n2+w34n3+b4w15n1+w25n2+w35n3+b5w16n1+w26n2+w36n3+b6w17n1+w27n2+w37n3+b7])=σ′(W1x0) xT0=[n1n2n3]
Thus, we get ∂E∂W1=[WT2δ2⊙σ′(W1x0)]xT0
We denote δ1=WT2δ2⊙σ′(W1x0) yielding ∂E∂W1=δ1xT0
The Bias Vectors
For the second layer, the biases are: ∂E∂b4=σ′4((n8−y1)σ′8w48+(n9−y2)σ′9w49) ∂E∂b5=σ′5((n8−y1)σ′8w58+(n9−y2)σ′9w59) ∂E∂b6=σ′6((n8−y1)σ′8w68+(n9−y2)σ′9w69) ∂E∂b7=σ′7((n8−y1)σ′8w78+(n9−y2)σ′9w79) For the third layer, the biases are: ∂E∂b8=(n8−y1)σ′8 ∂E∂b9=(n9−y2)σ′9
We wish to have a simple way to express how to update the bias matrix, like so:
bi=bi−α∂E∂bi
where bi is a matrix. In other words, we need to calculate ∂E∂bi Let’s look at the second layer first. b1 is a 4 x 1 matrix, so ∂E∂b1 must have the same dimensions. Without vectors, our update is:
bj=bj−α∂E∂bj
where bj is the bias for a single node, i.e. an element of a bias matrix. Looking at the partials for the biases of the second layer, we can simply define b1 as the 4 x 1 matrix of each bias like so:
∂E∂b1=[∂E∂b4∂E∂b5∂E∂b6∂E∂b7]
Similarly, for the second bias matrix:
∂E∂b2=[∂E∂b8∂E∂b9] To derive our equations for the bias update for vectorized backpropagation, let’s first look at the second bias matrix. If write out each element, we find: ∂E∂b2=[(n8−y1)σ′8(n9−y2)σ′9]=[(n8−y1)σ′(w48n4+w58n5+w68n6+w78n7+b8)(n9−y2)σ′(w49n4+w59n5+w69n6+w79n7+b9)] We can rewrite this as a Hadamard product: ∂E∂b2=[n8−y1n9−y2]⊙[σ′(w48n4+w58n5+w68n6+w78n7+b8)σ′(w49n4+w59n5+w69n6+w79n7+b9)] The first matrix can be written in vectorized form as x2−y, and the second matrix can be written as σ′(W2x1). We can then arrive at: ∂E∂b2=(x2−y)⊙(σ′(W2x1)) Notice that this is the same as the δ2 we calculated for the weights earlier. We can write the bias formula as just: ∂E∂b2=δ2
Looking at the first bias matrix, we see that we can write it out as: ∂E∂b1=[σ′4((n8−y1)σ′8w48+(n9−y2)σ′9w49)σ′5((n8−y1)σ′8w58+(n9−y2)σ′9w59)σ′6((n8−y1)σ′8w68+(n9−y2)σ′9w69)σ′7((n8−y1)σ′8w78+(n9−y2)σ′9w79)] Looking carefully at this matrix, we see that we can write it out as a product of two matrices including ∂E∂b2. ∂E∂b1=[σ′4w48σ′4w49σ′5w58σ′5w59σ′6w68σ′6w69σ′7w78σ′7w79][(n8−y1)σ′8(n9−y2)σ′9] The first matrix looks very similar to some combination of W2 and σ′(W2x1). We can rewrite and rearrange the equation with the help of the Hadamard product: ∂E∂b1=[w48w49w58w59w68w69w78w79][(n8−y1)σ′8(n9−y2)σ′9]⊙[σ′4σ′5σ′6σ′7] This, in turn, can be written as ∂E∂b1=WT2∂E∂b2⊙σ′(W2x1) ∂E∂b1=WT2δ2⊙σ′(W2x1)
Hey…this is the same as δ1 from earlier! Now our equation is just: ∂E∂b1=δ1
A General Form
We can generalize this for any layer. The only difference is the delta for the last layer: δL=(xL−y)⊙σ′(WLxL−1)
The delta for every other layer is: δi=WTi+1δi+1⊙σ′(Wixi−1)
And the gradient for every weight matrix are calculated and the weight matrices are updated as follows: ∂E∂Wi=δixTi−1
Wi=Wi−α∂E∂Wi
For biases, the rule is simpler:
bi=bi−αδi
That is the essence of backpropagation. Note that these formulas work for any activation function. The reason sigmoid is used to teach is because its derivative is fairly straightforward: σ′(x)=σ(x)(1−σ(x))