CS760 Homewok 6-Background Test Solved

35.00 $

Category:

Description

5/5 - (1 vote)

1 Neural Network Family Portrait
A fixed feedforward neural network architecture induces a family of functions ffw : X 7! Y g with different edge weights w. In this question you will randomly generate members of the family and visualize the function mapping. This helps us understand the variety of functions that neural networks can produce. For visualization, the input is x = (x1; x2) 2 R2 and the output is y 2 R.
1. Implement a single hidden layer with 10 ReLU units. The ith ReLU unit has weights bi;wi1;wi2 and outputs oi = max(bi + wi1x1 + wi2x2; 0): Implement an output layer with a single sigmoid unit. It has weights bo;wo1; : : : ;wo;10 and outputs
y = (bo + wo1o1 + : : : + wo;10o10) where () is the sigmoid function. For this question, set all b and w to 1. Show the value of o1; : : : ; o10; y for three input points, respectively: x = (1; 1); x = (1;􀀀1); x = (􀀀1;􀀀1).
2. Now generate a random network by drawing each and every b, w independently from a standard Gaussian N(0; 1). With this random network fw, visualize y = fw(x) for x = (x1; x2) on the 2D grid x1 2 [􀀀5; 5]; x2 2 [􀀀5; 5]. You can do this with a 3D surface plot, a 2D contour plot, or whatever is visually clear and convenient for your programming language. If you are not familiar with visualization in your language, ask on Piazza. For example, in Matlab with meshgrid(-5:0.1:5, -5:0.1:5) and surf(X,Y,F) my random network looks like

6
4
2
0
-2
-4
-6 -6
0
0.01
0.02
0.03
6
4
0.04
2
0.05
0
-2
-4
We ask you to generate as many random networks as you like, but visualize the first three networks (so they are an unbiased sample), then visualize three more networks that you find interesting among the ones you generate. That is a total of six plots.
3. Repeat question 1 but with 5 hidden layers each with 2 ReLU units (the first hidden layer produces output
o1; o2, and so on). Same single sigmoid output unit. This is a deeper network. For this question, set all b and
w to 1. Show the value of o1; : : : ; o10; y for three input points, respectively: x = (1; 1); x = (1;􀀀1); x =
(􀀀1;􀀀1).
4. Repeat question 2 but with the 5-hidden-layer network. For example, one of my random network is visual-
2

ized as
0.02
0.04
6
0.06
4
0.08
6
4
2
0
0.1
0.12
2
0
0.14
0.16
-2
-4 -2
-4
-6 -6
Produce those six plots.
2 Back Propagation
We will build a neural network to perform binary classification. Each input item has two real-valued features
x = (x1; x2), and the class label y is either 0 or 1. Our neural network has a very simple structure:
3

ReLU A
C Sigmoid
B ReLU
x1 x2
1 1
1
w8 w9
w2
w5 w3
w6
w1 w4
w7
There is one hidden layer with two hidden units A, B, and one output layer with a single output unit C. The input
layer is fully connected to the hidden layer, and the hidden layer is fully connected to the output layer. Each unit
also has a constant bias 1 input with the corresponding weight. Units A are B are ReLU, namely
fA(z) = fB(z) = max(z; 0):
Unit C is a sigmoid:
fC(z) = (z) =
1
1 + e􀀀z :
Implement the following steps. We provide a few examples for you to debug your code.
1. This neural network is fully defined by the nine weights w1; : : : ;w9. The first question first focuses on
predictions given fixed weights.
Remark: we will use both single index and double index to refer to a weight. Single index corresponds to
the figure above. Double index, on the other hand, is used to describe the algorithm and denotes the “from
! to” nodes that the edge is connecting. For example, w8 is the same as wA;C, w2 is the same as wx1;A,
and w1 is the same as w1;A where we used “1” to denote the constant bias input of one. These should be
clear from the context.
Recall in a neural network for any unit j, it first collects input from lower units:
uj =
X
i:i!j
wijvi
where vi is the output of lower unit i. Specifically, if i is an input unit then vi = xi; if i is the bias then
vi = 1. The unit j then passes uj through its nonlinear function fj() to produce its output vj :
vj = fj(uj):
Given weights w1; : : : ;w9 and input x1, x2, print uA; vA; uB; vB; uC; vC on the same line separated by
space. For example,
(weights) 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 (input) 1 -1
0.00000 0.00000 0.30000 0.30000 0.97000 0.72512
(weights) 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 (input) -0.2 1.7

2.18000 2.18000 1.43000 1.43000 1.34000 0.79249
(weights) 4 3 2 1 0 -1 -2 -3 -4 (input) -4 1
-6.00000 0.00000 0.00000 0.00000 -2.00000 0.11920
Now compute uA; vA; uB; vB; uC; vC for weights 0.1 -0.2 0.3 -0.4 0.5 -0.6 0.7 -0.8 0.9 and input 1 -1.
2. Given a training item x = (x1; x2) and its label y, the squared error made by the neural network on the item
is defined as
E =
1
2
(vC 􀀀 y)2:
The partial derivative with respect to the output layer variable vC is
@E
@vC
= vC 􀀀 y:
The partial derivative with respect to the intermediate variable uC is
@E
@uC
=
@E
@vC
f0C
(uC):
Recall f0C
(uC) = 0(uC) = (uC)(1 􀀀 (uC)) = vC(1 􀀀 vC).
Print E, @E
@vC
, and @E
@uC
. For example,
(weights) 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 (input) 1 -1 (y) 1
0.03778 -0.27488 -0.05479
(weights) 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 (input) -0.2 1.7 (y) 0
0.31402 0.79249 0.13032
(weights) 4 3 2 1 0 -1 -2 -3 -4 (input) -4 1 (y) 0
0.00710 0.11920 0.01252
Now compute E, @E
@vC
, and @E
@uC
for weights 0.1 -0.2 0.3 -0.4 0.5 -0.6 0.7 -0.8 0.9, input 1 -1, and label
y = 1.
3. The partial derivative with respect to hidden layer variable vj is
@E
@vj
=
X
k:j!k
wjk
@E
@uk
:
And
@E
@uj
=
@E
@vj
@vj
@uj
:
Recall our hidden layer units are ReLU, for which
@vj
@uj
=
@ max(uj ; 0)
@uj
=

1; uj 0
0; uj < 0
Note we define the derivative to be 1 when uj = 0 (look up subderivative to learn more).
Print @E
@vA
, @E
@uA
, @E
@vB
, and @E
@uB
. For example
(weights) 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 (input) 1 -1 (y) 1
-0.04383 -0.04383 -0.04931 -0.04931
(weights) 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 (input) -0.2 1.7 (y) 0
0.03910 0.03910 0.02606 0.02606
(weights) 4 3 2 1 0 -1 -2 -3 -4 (input) -4 1 (y) 0
-0.03755 0.00000 -0.05006 -0.05006
5
Homework 6 CS 760 Machine Learning
Now print @E
@vA
, @E
@uA
, @E
@vB
, and @E
@uB
for weights 0.1 -0.2 0.3 -0.4 0.5 -0.6 0.7 -0.8 0.9, input 1 -1, and label
y = 1.
4. Now we can compute the partial derivative with respect to the edge weights:
@E
@wij
= vi
@E
@uj
:
Print @E
@w1
; : : : @E
@w9
. For example,
(weights) 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 (input) 1 -1 (y) 1
-0.04383 -0.04383 0.04383 -0.04931 -0.04931 0.04931 -0.05479 0.00000 -0.01644
(weights) 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 (input) -0.2 1.7 (y) 0
0.03910 -0.00782 0.06647 0.02606 -0.00521 0.04431 0.13032 0.28411 0.18636
(weights) 4 3 2 1 0 -1 -2 -3 -4 (input) -4 1 (y) 0
0.00000 0.00000 0.00000 -0.05006 0.20025 -0.05006 0.01252 0.00000 0.00000
Now print @E
@w1
; : : : @E
@w9
for weights 0.1 -0.2 0.3 -0.4 0.5 -0.6 0.7 -0.8 0.9, input 1 -1, and label y = 1.
5. Now we perform one step of stochastic gradient descent. With step size , we update the weights:
wi = wi 􀀀
@E
@wi
; i = 1 : : : 9:
For weights w1; : : : ;w9, input x1, x2, label y, step size , print four lines:
(a) the old w1 : : :w9
(b) the error E under the old w
(c) the updated w1 : : :w9
(d) the error E after the update
For example,
(weights) 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 (x) 1 -1 (y) 1 (eta) 0.1
0.10000 0.20000 0.30000 0.40000 0.50000 0.60000 0.70000 0.80000 0.90000
0.03778
0.10438 0.20438 0.29562 0.40493 0.50493 0.59507 0.70548 0.80000 0.90164
0.03617
(weights) 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 (x) -0.2 1.7 (y) 0 (eta) 0.1
1.00000 0.90000 0.80000 0.70000 0.60000 0.50000 0.40000 0.30000 0.20000
0.31402
0.99609 0.90078 0.79335 0.69739 0.60052 0.49557 0.38697 0.27159 0.18136
0.29972
(weights) 4 3 2 1 0 -1 -2 -3 -4 (x) -4 1 (y) 0 (eta) 0.1
4.00000 3.00000 2.00000 1.00000 0.00000 -1.00000 -2.00000 -3.00000 -4.00000
0.00710
4.00000 3.00000 2.00000 1.00501 -0.02002 -0.99499 -2.00125 -3.00000 -4.00000
0.00371
Print these for weights 0.1 -0.2 0.3 -0.4 0.5 -0.6 0.7 -0.8 0.9, input x 1 -1, label y 1, step size 0.1.
6. We provide a training data set data.txt where each row is a labeled item x1; x2; y. Starting from initial
weights 0.1 -0.2 0.3 -0.4 0.5 -0.6 0.7 -0.8 0.9, with step size 0.1, run SGD for 10000 rounds. This means
in each round you have to select a training item uniformly randomly from data.txt. After every 100 rounds,
compute the training set error
P
i Ei where Ei is the i-th training item. Use these to make a plot of round
vs. training set error.
6

  • hw6-rj1jpk.zip