# Deep Learning for NLP WS18/19
## Exercise Sheet 3 - Pytorch Introduction
This exercise sheet is due on 19.11.19 11:59 pm. There is a total of 10
points for this exercise sheet. Please send your solution to [profilmodul1920@cis.uni-muenchen.de](mailto:profilmodul1920@cis.uni-muenchen.de). Please submit a completed version of this file in Python 3. You may submit in teams of 2 or 3 students.

Please rename the file to pytorch_intro_last_names.ipynb

### Installation of required packages

For installation of Pytorch check <http://pytorch.org/>. Note that CUDA is required if you want to execute Pytorch on a GPU. The program below doesn't require a lot of computation, so CPU-only is enough.

The sklearn can be installed with pip (pip3 for python3) or with the procedure you chose for the last exercise sheet and installing numpy.

#### If you have any problems regarding the installation feel free to send an email to [profilmodul1819@cis.lmu.de](mailto:profilmodul1819@cis.lmu.de)

### Exercise 1
#### Linear Regression on the California Housing Dataset (5 points)

Useful Pytorch tutorials and resources are mentioned in the lecture slides. 


The California Housing Dataset is often used as a logistic regression example. The dataset is provided with the sklearn module.

You will have to complete the code where marked with ***TODO***

First, we import Pytorch, the Boston dataset, numpy, math and shuffle. 

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.autograd import Variable
import sklearn.datasets
import numpy as np
import math
from random import shuffle

We define a method which returns 3 boolean arrays that helps providing a shuffled train, dev, test dataset split

In [None]:
def random_train_dev_test_split(num_total_items, train_ratio = 0.5, dev_ratio = 0.25):
    num_train_items = math.floor(num_total_items * train_ratio)
    num_dev_items = math.floor(num_total_items * dev_ratio)
    num_test_items = num_total_items - num_train_items - num_dev_items
    split = [0] * num_train_items + [1] * num_dev_items + [2] * num_test_items
    shuffle(split)
    split = np.asarray(split)
    return split == 0, split == 1, split == 2

Now we start the main program. At first we get the dataset from the sklearn module and assign features and labels to x and y.

In [None]:
boston = sklearn.datasets.fetch_california_housing()
x = boston.data
y = boston.target
num_items, num_features = x.shape

#### TODO: normalize the data so that each feature has 0 mean and unit standard deviation (1 p.)

In [None]:
# TODO

Next, we need to get a 50%-25%-25% split of the shuffled(!) training data (you can use the provided method random_train_dev_test_split).

If you use the predefined method, note:
 * The returned arrays contain boolean indicators which element are contained in the respective sets
 * You can use Boolean indexing and Numpy to access those elements

In [None]:
train_spl, dev_spl, test_spl = random_train_dev_test_split(num_items)

#### TODO: get a k x n -dimensional numpy array, where k is the number of items in the training data, n the number of features. (0.5 p.)

In [None]:
train_x = # TODO

#### TODO: get a k x 1 dimensional numpy array of the training targets. Hint: use np.expand_dims(...) to get an extra dimension (i.e. a matrix instead of an vector. Check the shape of train_y to see the difference). (0.5 p.)


In [None]:
train_y = # TODO

#### TODO: Similarly, get the dev feature matrix with associated dev targets.

In [None]:
dev_x = # TODO
dev_y = # TODO

We create the Linear Regression class that inherits from the Pytorch nn module. Information about __init__ and __forward__ can be found in the lecture slides.

In [None]:
# Define model
class LinearRegression(nn.Module):
    
    def __init__(self, num_features):
        super(LinearRegression, self).__init__()
        self.final_layer = nn.Linear(num_features, 1)
        
    def forward(self, x):
        return self.final_layer(x)

Definition of Mean Squared Error as loss criterion.

In [None]:
criterion = nn.MSELoss()

The __np_to_torch__ method is a convinience method used later to convert numpy Arrays to torch Tensors.

In [None]:
def np_to_torch(np_array):
    return torch.from_numpy(np_array).float()

The training iterates over the data 20 times. The gradient is backpropagated after every training example and total loss is printed at the end of each epoch.
#### TODO: Extend the code that it prints the average per-example loss on development data at the end of each epoch. (1 p.)

In [None]:
# Create model
linreg_model = LinearRegression(num_features)
optimizer = optim.SGD(linreg_model.parameters(), lr=0.0001)

#Train model
num_epochs = 20
for epoch in range(num_epochs):
    loss_accum = 0.0
    for i in range(len(train_y)):
        x_i = np_to_torch(train_x[i])
        y_i = np_to_torch(train_y[i])
        optimizer.zero_grad()   # zero the gradient buffers
        output = linreg_model.forward(x_i)
        loss = criterion(output, y_i)
        loss_accum += loss.data.item()
        loss.backward()
        optimizer.step()    # Does the update
    # Evaluate model
    print("train loss:", loss_accum/len(train_y))
    # TODO: Also print the dev loss

Some gradient-based optimizers work well with looking at the entire data set for one update step. One of those optimizers is LBFGS.
#### TODO: replace the SGD optimizer by using the LBFGS optimizer. Check the Pytorch documentation for the optim package for more information: <http://pytorch.org/docs/master/optim.html> (2 p.)

In [None]:
# Create new model
linreg_model = LinearRegression(num_features)
optimizer = #TODO: LBFGS instead of SGD optimizer.

def closure():
    optimizer.zero_grad()   # zero the gradient buffers
    output = linreg_model.forward(np_to_torch(train_x))
    loss = criterion(output, np_to_torch(train_y))
    loss.backward()
    return loss

In [None]:
# Tain new model
# TODO: perform the optimizer step (once).

# Evaluate new model
# TODO: Print train and dev loss.

### Exercise 2
#### Neural Network Regression on the California Housing Dataset (5 points)
Now implement the regression with a hidden layer and an activation function after that hidden layer. 

Recall that normal linear regression uses the function:

$$\hat y = Linear(x) = Wx+b$$

Neural Network regression uses two (distinct) linear transformations, and a non-linear activation function in between. Using the tanh activation function, we get:

$$\hat y = Linear(Tanh(Linear(x))) = W_B tanh(W_Ax + b_A) + b_B$$

The output of the first linear transformation (+ non-linear activation) is also called the hidden layer.
You can use for example 10 as its size (hidden_size).

#### TODO: You need to change the __init__ and the __forward__ method. (2 p.)

See <https://pytorch.org/docs/stable/nn.html?#non-linear-activations-weighted-sum-nonlinearity> for a list of pre-defined activation functions

Optional: Feel free to experiment with different activation functions!

In [None]:
# Define NN model
class NeuralNetworkRegression(nn.Module):
    def __init__(self, num_features,hidden_size):
        super(NeuralNetworkRegression, self).__init__()
        self.final_layer = nn.Linear(num_features, 1) # TODO
        
    def forward(self, x):
        return self.final_layer((x)) # TODO

#### TODO: Analogous to the Linear Regression model, create a Neural Regression model, train it using the LBFGS optimizer and report the MSE on the train and dev set. (2 p.)

In [None]:
# Create NN model
# TODO

In [None]:
# Train NN model
# TODO

# Evaluate NN model
# TODO

#### TODO: How does the Neural Network Regression compare to Linear Regression? (1 p.)

Optional: Feel free to experiment with different optimizers for both models!