# Deep Learning for NLP WS18/19
## Exercise Sheet 7 - Keras Sentiment Prediction
This exercise sheet is due on 11.12.18 11:59 pm. There is a total of **10 points**.
Since some students wanted to try the functional API, there is an optional dot-product attention model to implement for additional bonus points. **You can get the full 10 points without the attention model.**

Please work with Python 3 and in teams of 2 or 3 students, rename the file to keras_sentiment_last_names.ipynb and send your completed version to [anne.beyer@campus.lmu.de](mailto:anne.beyer@campus.lmu.de).

As usual, you will have to complete the code/questions marked with ***TODO***.

In this exercise, you will predict the sentiment polarity (positive or negative) of IMDB movie reviews. 
You will implement different architectures in keras and comment on their performance relative to one another.

Some of these models take a long time to train. For development, you can reduce the MAXEPOCHS parameter. Remember to restore the original value before evaluation.

In [None]:
import numpy as np
np.random.seed(123)
from keras.datasets import imdb
from keras.preprocessing.sequence import pad_sequences
from keras.layers import *
from keras.models import Sequential, Model
from keras.callbacks import EarlyStopping
import matplotlib.pyplot as plot
import time

In [None]:
MAXLEN = 500
BATCHSIZE = 16
EMBSIZE = 50
HIDDENSIZE = 50
KERNELSIZE = 5
VOCABSIZE = 10000

MAXEPOCHS = 20
#MAXEPOCHS = 2 # uncomment during development


In [None]:
(x_train, y_train), (x_test, y_test) = imdb.load_data(path = "imdb.npz",
                                                      num_words = VOCABSIZE,
                                                      skip_top = 0,
                                                      maxlen = MAXLEN,
                                                      start_char = 1,
                                                      oov_char = 2,
                                                      index_from = 3)

y_train = np.expand_dims(y_train, -1)
y_test = np.expand_dims(y_test, -1)

We reduce the validation set size to reduce runtime. **You should not do this in a serious experiment.**

In [None]:
x_test = x_test[:1000]
y_test = y_test[:1000]

print("# training samples: {}; # validation samples: {}".format(len(x_train), len(x_test)))

BATCHES_PER_EPOCH = len(x_train) // BATCHSIZE

If we use pad_sequences on x_train, x_test, the resulting matrices would become very big (num samples x size of longest sentence). Instead, we will use a generator function to pad the training data on-the-fly.

**TODO: The following generator loops through x and y and yields tuples (x_batch, y_batch). x_batch and y_batch are slices of length BATCHSIZE from x and y. Complete the code in the places marked TODO (1 pt.)**

In [None]:
def generator(x, y, return_positions = False):
    while True:
        for i in range(0, len(x), BATCHSIZE):
            x_batch = None # TODO
            y_batch = None # TODO
                
            yield(pad_sequences(x_batch), y_batch)

In [None]:
# generic dropout layer for use below
dropout_L = Dropout(0.25)

Here, we implement an bidirectional GRU model. You can use it as a template for the other models below.

In [None]:
def build_gru_model():
    embedding_L = Embedding(input_dim = VOCABSIZE, output_dim = EMBSIZE, mask_zero = True)
    gru_L = Bidirectional(GRU(units = HIDDENSIZE // 2))
    output_L = Dense(units = 1, activation = "sigmoid")
    return Sequential([embedding_L, dropout_L, gru_L, dropout_L, output_L])

**TODO: Implement a 1-Dimensional CNN model. It should have the following layers: 1) Word embeddings 2) Dropout 3) 1-D convolution with kernel length KERNELSIZE and #filters HIDDENSIZE 4) global maximum pooling 5) Dropout 6) a linear layer with sigmoid activation (2.5 p.)**

*Hint: CNNs don't accept input masks, so set mask_zero to False in the embedding layer*

In [None]:
def build_cnn_model():
    return None # TODO

On many tasks, a simple model that averages together word embeddings achieves surprisingly good performance. Therefore, it is a good idea to compare against this baseline.

**TODO: Implement an embedding-only baseline. It should have the following layers: 1) Word embeddings 2) Dropout 3) global average pooling 4) Dropout 5) a linear layer with sigmoid activation (2.5 p.)**

In [None]:
def build_emb_model():
    return None # TODO
    

The following two steps are **optional**, but note that there are **two more obligatory tasks below**!

**OPTIONAL:** Complete the following dot product attention layer. (2 p.) You can get full points without this exercise.

*Hint: Attention gets a list of three tensors \[query, key, value\]. query has size (batch_size, 20), key has size (batch_size, T, 20), value has size (batch_size, T, HIDDENSIZE). Use backend functions such as K.batch_dot, K.softmax, K.sum in the call() method.*

In [None]:
class Attention(Layer):
    def __init__(self, **kwargs):
        super(Attention, self).__init__(**kwargs)
        self.supports_masking = True
        
    def compute_output_shape(self, input_shape):
        query_shape, key_shape, value_shape = input_shape
        output_shape = None # TODO (output shape should be tuple with two entries)
        return output_shape
    
    def call(self, inputs, mask = None):
        query, key, value = inputs
        
        energy = None # TODO
        energy /= np.sqrt(K.int_shape(key)[-1]) # scale by square root of Dkey
        attention = None # TODO
        output = None # TODO
        return output    

**OPTIONAL:** Complete the following GRU+attention model. (2 p.) You can get full points without this exercise.

In [None]:
def build_gru_attn_model():
    input_L = Input((None,))
    
    embedding_L = Embedding(input_dim = VOCABSIZE, output_dim = EMBSIZE, mask_zero = True)
    
    gru_L = Bidirectional(GRU(units = HIDDENSIZE // 2, return_sequences = True))
    
    attn_L = Attention()
    output_L = Dense(units = 1, activation = "sigmoid") 
    
    key_L = TimeDistributed(Dense(units = 20))
    query_L = Dense(units = 20)
    value_L = TimeDistributed(Dense(units = HIDDENSIZE))
    
    # We will calculate our query from the last GRU hidden vector h_T.
    # However, the GRU returns the full matrix of hidden vectors H.
    # We therefore need a layer that takes H and returns h_T.
    # TODO: Implement last_vector and last_vector_shape
    
    def last_vector(x):
        return None # TODO
    
    def last_vector_shape(shape):
        return None # TODO
    
    last_vector_L = Lambda(last_vector, output_shape = last_vector_shape)
    
    H = gru_L(embedding_L(input_L))
    h_t = last_vector_L(H)
    
    query = query_L(h_t)
    keys = key_L(H)
    values = value_L(H)
    
    c_t = None # TODO
    
    output = output_L(concatenate([h_t, c_t]))
    
    model = Model(inputs = [input_L], outputs = [output]) # TODO
    return model

**TODO: Complete the compile function with the appropriate loss function, an optimizer of your choice and accuracy as a metric (1 p.)**

In [None]:
def train_model(model, x_train, y_train, x_test, y_test):
    model.compile() # TODO
    
    earlystop = EarlyStopping(monitor = "val_acc", patience = 7)
    
    history = model.fit_generator(generator(x_train, y_train),
                                  steps_per_epoch = BATCHES_PER_EPOCH,
                                  validation_data = (pad_sequences(x_test), y_test),
                                  epochs = MAXEPOCHS, callbacks = [earlystop])
    
    return history.history

To evaluate, uncomment all models that you have implemented:

In [None]:
models = {}

#uncomment all models that you have implemented:
#models["gru+attn"] = build_gru_attn_model()
#models["gru"] = build_gru_model()
#models["cnn"] = build_cnn_model()
#models["emb"] = build_emb_model()

histories = {}
traintimes = {}

for name in sorted(models.keys()):
    print("Training", name)
    before = time.time()
    histories[name] = train_model(models[name], x_train, y_train, x_test, y_test)
    duration = time.time() - before
    traintimes[name] = duration / len(histories[name]["loss"]) / BATCHES_PER_EPOCH

Since we have measured the training time for every architecture, we can compare their efficiency. **Important:** This comparison is only meaningful with respect to your current hardware. For instance, parallelization is more effective on GPUs than on CPUs.

In [None]:
print("Average training time per batch")
for name in traintimes:
    print(name, round(traintimes[name], 4), "seconds")

We plot training and validation set metrics:

In [None]:
for i, metric in enumerate(sorted(histories["gru"].keys())):
    plot.subplot(1, len(histories["gru"]), i+1)
    
    for name in sorted(histories.keys()):
        plot.plot(range(1, len(histories[name][metric]) + 1),
                        histories[name][metric], label = name)
        plot.title(metric)
        plot.xlabel("Epoch")
    
plot.tight_layout()
plot.legend()    

**TODO: Compare the different architectures in a few sentences. You should comment on overall performance, overfitting and convergence time (3 p.)**