Neural Networks: Mathematics of Backpropagation

Sylesh Suresh and Nikhil Sardana

October 2017

Introduction

Recall our network from the previous lecture:

Network 2

Through forward propagation, we calculated our prediction x2x2 for the input x0. Keep in mind, however, that this prediction is essentially a stab in the dark. We multiplied the input by randomly initialized weights and randomly initialized biases through each layer and calculated an output. At this stage, the output is simply a random guess and is probably very different from the target/true output y. In order to make our neural network more accurate, we must change the weights and biases so that our predicted output x2 gets as close to the target output y as possible.

To accomplish this, we will use backpropagation. The first step is to quantify exactly how wrong our prediction is; in other words, we will use an error function. We define the error function as E=12||x2y||2 This makes sense; E increases as the difference between x2 and y increases. The goal of backpropagation is to minimize this error, which will get x2 and y as close together as possible, making x2 a better prediction.

To minimize E, we must change the weights and biases accordingly. Unfortunately, there are many weights and biases (just look at the diagram above), so we can’t just test many different combinations weights and biases and see which one yields the least error. That said, we can use differentiation to help us. By taking the partial derivative of the error with respect to a particular weight, we will know how to change the weight in order to decrease the error; changing the weight in the direction of the negative partial derivative will decrease the error. Please see the “Neural Networks: Part 2”, Sections 4.1-4.3, for a more in-depth explanation as to why calculating the partial derivatives helps us minimize the error function.

Non-Vectorized Backpropagation

We’ve already covered how to backpropagate in the vectorized form (Neural Networks: Part 2, Section 4.5). However, this can be confusing to many students, particularly the parts with matrix transformations and Hadamard products. To alleviate any confusion, let’s walk through backpropagation without any vectors at all.

Defining the Network

Consider the following neural network:

image

Let’s take the partial derivative of the error with respect to a particular weight. Our target outputs are y1 and y2. We define the value at each node i to be ni. We define the bias of node i to be bi. Note that input nodes do not have biases, they are simply input values into the network. We can thus rewrite the error function as:

E=12((n8y1)2+(n9y2)2)

A Simple Example

Let’s calculate Ew48. By the chain rule: Ew48=(n8y1)((n8y1)w48)+(n9y2)((n9y2)w48)

We know: n8=σ(w48n4+w58n5+w68n6+w78n7+b8). n9=σ(w49n4+w59n5+w69n6+w79n7+b9).

Neither n9 nor y2, a part of the target output, depend on w48, so the equation becomes:

Ew48=(n8y1)((n8y1)w48)

Let’s look at (n8y1)w48. y1 does not depend on w48. Looking at n8, we see that the only term in the sigmoid function that depends on w48 is w48n4. Thus: (n8y1)w48=(σ(w48n4+w58n5+w68n6+w78n7+b8))w48 (n8y1)w48=σ(w48n4+w58n5+w68n6+w78n7+b8)w48n4w48

We know: n4=σ(w14n1+w24n2+w34n3+b4).

This equation does not involve w48. So we get:

(n8y1)w48=σ(w48n4+w58n5+w68n6+w78n7+b8)n4

Substituting back in, we finally arrive at:

Ew48=(n8y1)σ(w48n4+w58n5+w68n6+w78n7+b8)n4

To shorten this expression, we’ll refer to σ(w48n4+w58n5+w68n6+w78n7+b8) as σ8. Ew48=(n8y1)σ8n4

Updating Weights

A network learns by changing its weights to minimize the error. Now that we’ve figured out the partial derivative of w48 with respect to the error, we simply update w48 according to the following rule:

w48=w48αEw48

Where α is the learning rate, or some constant.

A Complex Example

Let’s begin calculating Ew14. We know that y1 and y2 are just constants. We also know

n8=σ(w48n4+w58n5+w68n6+w78n7+b8) n9=σ(w49n4+w59n5+w69n6+w79n7+b9) n4=σ(w14n1+w24n2+w34n3+b4)

Nowhere else does w14 appear in the network. Let’s begin chain-ruling.

Ew14=(n8y1)((n8y1)w14)+(n9y2)((n9y2)w14)

Calculations

The process for calculating each of these partial derivatives is extraordinarily similar. Let’s just focus on the first part, or:

Ew14=(n8y1)((n8y1)w14) At the end, repeat the following process for the second partial.

Ew14=(n8y1)n8w14 Ew14=(n8y1)(σ(w48n4+w58n5+w68n6+w78n7+b8))w14 Ew14=(n8y1)(σ(w48n4+w58n5+w68n6+w78n7+b8))(w48n4+w58n5+w68n6+w78n7+b8)w14

To make our expression shorter, lets just write σ(w48n4+w58n5+w68n6+w78n7+b8) as σ8. We can rewrite what we have so far as: Ew14=(n8y1)σ8(w48n4+w58n5+w68n6+w78n7+b8)w14 Let’s continue. Remember, only n4 is dependent on w14.

Ew14=(n8y1)σ8w48n4w14 Ew14=(n8y1)σ8w48(σ(w14n1+w24n2+w34n3+b4))w14 Ew14=(n8y1)σ8w48(σ(w14n1+w24n2+w34n3+b4))(w14n1+w24n2+w34n3+b4)w14

Again, lets just write σ(w14n1+w24n2+w34n3+b4) as σ4. We can rewrite what we have so far as:

Ew14=(n8y1)σ8w48σ4(w14n1+w24n2+w34n3+b4)w14 Ew14=(n8y1)σ8w48σ4w14n1w14 Ew14=(n8y1)σ8w48σ4n1

The Final Expression

Let’s add back in the second partial. Repeating the calculations, we finally achieve: Ew14=(n8y1)σ8w48σ4n1+(n9y2)σ9w49σ4n1 Ew14=((n8y1)σ8w48+(n9y2)σ9w49)σ4n1

Of course, in these equations, σ9=σ(w49n4+w59n5+w69n6+w79n7+b9)

Biases

Let’s calculate the partial derivative of the error with respect to b4. As with the weights, let’s start chain ruling: Eb4=(n8y1)((n8y1)b4)+(n9y2)((n9y2)b4)

Recall that: n8=σ(w48n4+w58n5+w68n6+w78n7+b8) n9=σ(w49n4+w59n5+w69n6+w79n7+b9) n4=σ(w14n1+w24n2+w34n3+b4)

Let’s look at (n8y1)b4. Looking at n8, we see that the only term in the sigmoid function that relies on b4 is w48n4, and w48 is constant with respect to b4. Armed with this knowledge, we start chain-ruling:

(n8y1)b4=σ(w48n4+w58n5+w68n6+w78n7+b8)(w48)n4b4

Let’s break down n4b4. Looking at n4, we see that the only term inside the sigmoid function that relies on b4 is b4 itself, whose derivative is just 1. Thus:

n4b4=σ(w14n1+w24n2+w34n3+b4)

Substituting and using our previous shorthand notation, we find: (n8y1)b4=(σ8)(σ4)(w48)

We can use the same process to find that: (n9y2)b4=(σ9)(σ4)(w49)

Finally, we get: Eb4=σ4((n8y1)σ8w48+(n9y2)σ9w49)

The process is simpler for the third layer. Take b8 for example. We start the same way: Eb8=(n8y1)((n8y1)b8)+(n9y2)((n9y2)b8)

n9 does not depend on b8, so the right side of the expression evaluates to 0 and goes away. For the left side, again, y1 is constant, so we find that: Eb8=(n8y1)n8b8 The only term in the sigmoid function in the expression for n8 is b8 itself. Thus: Eb8=(n8y1)σ8

A Matrix Representation

Let’s first define the vectorized network (we’ve done this in “Neural Networks: Part 2” already). In this case:

Network 2

W1=[w14w24w34w15w25w35w16w36w36w17w27w37]W2=[w48w58w68w78w49w59w69w79] x0=[n1n2n3]x1=[n4n5n6n7]x2=[n8n9] b1=[b4b5b6b7]b2=[b8b9]y=[y1y2]

And from vectorized forward propagation, we know x1=σ(W1x0+b1) x2=σ(W2x1+b2)

Note that

W1x0=[w14w24w34w15w25w35w16w36w36w17w27w37][n1n2n3]=[w14n1+w24n2+w34n3+b4w15n1+w25n2+w35n3+b5w16n1+w26n2+w36n3+b6w17n1+w27n2+w37n3+b7]

W2x1=[w48w58w68w78w49w59w69w79][n4n5n6n7]=[w48n4+w58n5+w68n6+w78n7+b8w49n4+w59n5+w69n6+w79n7+b9]

The First Matrix

We can repeat the process outlined in Section 2.2 to calculate the partial derivative of the error function with respect to each weight. The final results for the weights between the second and third layers of nodes follow a similar pattern:

Ew48=(n8y1)σ8n4 Ew49=(n9y2)σ9n4

Ew58=(n8y1)σ8n5 Ew59=(n9y2)σ9n5

Ew68=(n8y1)σ8n6 Ew69=(n9y2)σ9n6

Ew78=(n8y1)σ8n7 Ew79=(n9y2)σ9n7

Hmmmm... these all look similar! I bet we could rearrange them to create a more compact matrix representation, which we will call EW2. We wish to simplify weight updating to just

W2=W2αEW2 So, EW2 must be the same shape as W2, which is 2×4.

EW2=[Ew48Ew58Ew68Ew78Ew49Ew59Ew69Ew79]

Note that: [(n8y1)σ8(n9y2)σ9][n4n5n6n7]=[Ew48Ew58Ew68Ew78Ew49Ew59Ew69Ew79]

We can also write this as: [n8y1n9y2][σ8σ9][n4n5n6n7]=[Ew48Ew58Ew68Ew78Ew49Ew59Ew69Ew79] Where is the Hadamard product, or element-wise multiplication. Let’s vectorize this more: [σ8σ9]=[σ(w48n4+w58n5+w68n6+w78n7+b9)σ(w49n4+w59n5+w69n6+w79n7+b9)]=σ(W2x1) [n8y1n9y2]=[n8n9][y1y2]=x2y [n4n5n6n7]=xT1

Thus, we get EW2=[(x2y)σ(W2x1)]xT1

For the sake of simplicity, we define δ2=(x2y)σ(W2x1)

Finally, we reach our old vectorized backpropagation formula: EW2=δ2xT1

The Second Matrix

We can repeat the process outlined in Section 2.4 to calculate the partial derivative of the error function with respect to each weight. The final results for the weights between the first and second layers of nodes follow a similar pattern:

Ew14=((n8y1)σ8w48+(n9y2)σ9w49)σ4n1 Ew15=((n8y1)σ8w58+(n9y2)σ9w59)σ5n1 Ew16=((n8y1)σ8w68+(n9y2)σ9w69)σ6n1 Ew17=((n8y1)σ8w78+(n9y2)σ9w79)σ7n1 Ew24=((n8y1)σ8w48+(n9y2)σ9w49)σ4n2 Ew25=((n8y1)σ8w58+(n9y2)σ9w59)σ5n2 Ew26=((n8y1)σ8w68+(n9y2)σ9w69)σ6n2 Ew27=((n8y1)σ8w78+(n9y2)σ9w79)σ7n2 Ew34=((n8y1)σ8w48+(n9y2)σ9w49)σ4n3 Ew35=((n8y1)σ8w58+(n9y2)σ9w59)σ5n3 Ew36=((n8y1)σ8w68+(n9y2)σ9w69)σ6n3 Ew37=((n8y1)σ8w78+(n9y2)σ9w79)σ7n3

To update each respective weight, we would subtract these partials (multiplied by α) from their respective weights, just as we did for w14 above.

Our matrix representation needs to look like this:

EW1=[Ew14Ew24Ew34Ew15Ew25Ew35Ew16Ew26Ew36Ew17Ew27Ew37]

We notice that: [w48w49w58w59w68w69w78w79][σ8n8y1σ9n9y2]=[w48σ8(n8y1)+w49σ9(n9y2)w58σ8(n8y1)+w59σ9(n9y2)w68σ8(n8y1)+w69σ9(n9y2)w78σ8(n8y1)+w79σ9(n9y2)] Which can be vectorized and simplified as: [w48w49w58w59w68w69w78w79]([σ8σ9][n8y1n9y2])=[w48σ8(n8y1)+w49σ9(n9y2)w58σ8(n8y1)+w59σ9(n9y2)w68σ8(n8y1)+w69σ9(n9y2)w78σ8(n8y1)+w79σ9(n9y2)] WT2[σ(W2x1)(x2y)]=[w48σ8(n8y1)+w49σ9(n9y2)w58σ8(n8y1)+w59σ9(n9y2)w68σ8(n8y1)+w69σ9(n9y2)w78σ8(n8y1)+w79σ9(n9y2)] WT2δ2=[w48σ8(n8y1)+w49σ9(n9y2)w58σ8(n8y1)+w59σ9(n9y2)w68σ8(n8y1)+w69σ9(n9y2)w78σ8(n8y1)+w79σ9(n9y2)]

This δ2 is the same as the one we defined when calculating the previous vectorized representation. Now, this is fine and dandy, but we’ve so far ignored the σ4n1 and the σ5n1 etc. sitting outside the parends of our partial derivative calculation. Let’s work those in. ([w48σ8(n8y1)+w49σ9(n9y2)w58σ8(n8y1)+w59σ9(n9y2)w68σ8(n8y1)+w69σ9(n9y2)w78σ8(n8y1)+w79σ9(n9y2)][σ4σ5σ6σ7])[n1n2n3]= [((n8y1)σ8w48+(n9y2)σ9w49)σ4n1((n8y1)σ8w48+(n9y2)σ9w49)σ4n2((n8y1)σ8w48+(n9y2)σ9w49)σ4n3((n8y1)σ8w58+(n9y2)σ9w59)σ5n1((n8y1)σ8w58+(n9y2)σ9w59)σ5n2((n8y1)σ8w58+(n9y2)σ9w59)σ5n3((n8y1)σ8w68+(n9y2)σ9w69)σ6n1((n8y1)σ8w68+(n9y2)σ9w69)σ6n2((n8y1)σ8w68+(n9y2)σ9w69)σ6n3((n8y1)σ8w78+(n9y2)σ9w79)σ7n1((n8y1)σ8w78+(n9y2)σ9w79)σ7n2((n8y1)σ8w78+(n9y2)σ9w79)σ7n3] =[Ew14Ew24Ew34Ew15Ew25Ew35Ew16Ew26Ew36Ew17Ew27Ew37]=EW1

Note that [σ4σ5σ6σ7]=σ([w14n1+w24n2+w34n3+b4w15n1+w25n2+w35n3+b5w16n1+w26n2+w36n3+b6w17n1+w27n2+w37n3+b7])=σ(W1x0) xT0=[n1n2n3]

Thus, we get EW1=[WT2δ2σ(W1x0)]xT0

We denote δ1=WT2δ2σ(W1x0) yielding EW1=δ1xT0

The Bias Vectors

For the second layer, the biases are: Eb4=σ4((n8y1)σ8w48+(n9y2)σ9w49) Eb5=σ5((n8y1)σ8w58+(n9y2)σ9w59) Eb6=σ6((n8y1)σ8w68+(n9y2)σ9w69) Eb7=σ7((n8y1)σ8w78+(n9y2)σ9w79) For the third layer, the biases are: Eb8=(n8y1)σ8 Eb9=(n9y2)σ9

We wish to have a simple way to express how to update the bias matrix, like so:

bi=biαEbi

where bi is a matrix. In other words, we need to calculate Ebi Let’s look at the second layer first. b1 is a 4 x 1 matrix, so Eb1 must have the same dimensions. Without vectors, our update is:

bj=bjαEbj

where bj is the bias for a single node, i.e. an element of a bias matrix. Looking at the partials for the biases of the second layer, we can simply define b1 as the 4 x 1 matrix of each bias like so:

Eb1=[Eb4Eb5Eb6Eb7]

Similarly, for the second bias matrix:

Eb2=[Eb8Eb9] To derive our equations for the bias update for vectorized backpropagation, let’s first look at the second bias matrix. If write out each element, we find: Eb2=[(n8y1)σ8(n9y2)σ9]=[(n8y1)σ(w48n4+w58n5+w68n6+w78n7+b8)(n9y2)σ(w49n4+w59n5+w69n6+w79n7+b9)] We can rewrite this as a Hadamard product: Eb2=[n8y1n9y2][σ(w48n4+w58n5+w68n6+w78n7+b8)σ(w49n4+w59n5+w69n6+w79n7+b9)] The first matrix can be written in vectorized form as x2y, and the second matrix can be written as σ(W2x1). We can then arrive at: Eb2=(x2y)(σ(W2x1)) Notice that this is the same as the δ2 we calculated for the weights earlier. We can write the bias formula as just: Eb2=δ2

Looking at the first bias matrix, we see that we can write it out as: Eb1=[σ4((n8y1)σ8w48+(n9y2)σ9w49)σ5((n8y1)σ8w58+(n9y2)σ9w59)σ6((n8y1)σ8w68+(n9y2)σ9w69)σ7((n8y1)σ8w78+(n9y2)σ9w79)] Looking carefully at this matrix, we see that we can write it out as a product of two matrices including Eb2. Eb1=[σ4w48σ4w49σ5w58σ5w59σ6w68σ6w69σ7w78σ7w79][(n8y1)σ8(n9y2)σ9] The first matrix looks very similar to some combination of W2 and σ(W2x1). We can rewrite and rearrange the equation with the help of the Hadamard product: Eb1=[w48w49w58w59w68w69w78w79][(n8y1)σ8(n9y2)σ9][σ4σ5σ6σ7] This, in turn, can be written as Eb1=WT2Eb2σ(W2x1) Eb1=WT2δ2σ(W2x1)

Hey…this is the same as δ1 from earlier! Now our equation is just: Eb1=δ1

A General Form

We can generalize this for any layer. The only difference is the delta for the last layer: δL=(xLy)σ(WLxL1)

The delta for every other layer is: δi=WTi+1δi+1σ(Wixi1)

And the gradient for every weight matrix are calculated and the weight matrices are updated as follows: EWi=δixTi1

Wi=WiαEWi

For biases, the rule is simpler:

bi=biαδi

That is the essence of backpropagation. Note that these formulas work for any activation function. The reason sigmoid is used to teach is because its derivative is fairly straightforward: σ(x)=σ(x)(1σ(x))

← Back to lectures