# Deep Learning for NLP WS19/20
## Exercise Sheet 5 - Pytorch WordEmbeddings
This exercise sheet is due on 26.11.19 11:59 pm. There is a total of 5
points for this exercise sheet. Please send your solution in a
suitable format to [profilmodul1920@cis.lmu.de](mailto:profilmodul1920@cis.lmu.de). Please submit a
completed version of this file in Python 3 and please submit in teams of
2 or 3 students.


You will have to complete the code/questions marked with ***TODO***

Please rename the file to pytorch_wordEmbeddings_last_names.ipynb

### Setup

Please refer to the last exercise sheet if you have trouble installing Pytorch. This time you will have to install nltk. (e.g. by using pip)

In [None]:
import random
import numpy as np
import nltk
from collections import Counter
from nltk.corpus import brown
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torch.utils.data as data_utils

The nltk library provides datasets which can be used for dry running an approach or to verify a hypothesis. We will use the brown corpus. If you don't yet have the brown corpus on your machine, run the following cell once to download it.

In [None]:
nltk.download("brown")

For developing/debugging your code you may want to use only the first tokens of the corpus. Training with the full corpus takes quite some time. Use only the first 1000 words until your code works, later you can use the whole corpus.

In [None]:
brown_word_list = [w.lower() for w in brown.words()] #TODO only use a subset during developing/debugging

Next we create the vocabulary. The brown corpus contains 56057 unique words/tokens (lower cased), for the sake of computation time we will only use the 10 000 most common words/tokens.

In [None]:
# creating vocabulary
word_counts = Counter(brown_word_list)
vocab_size = 10000  # 
vocab = {w[0]: idx for idx, w in enumerate(word_counts.most_common(vocab_size))}

We define some Hyper-parameters. Make sure you understand what each of these do. You can always send questions to [profilmodul1819@cis.lmu.de](mailto:profilmodul1819@cis.lmu.de).

In [None]:
# Hyper-parameters of the algorithm
window_size = 1 # window size
# You can change the number of negative samples to a higher value, e.g. 7. 
# This will give you better results, but will take longer to train.
neg_samples_factor = 2 # negative samples multiple
dims = 64 # embedding dimension
learning_rate = 0.01
batch_size = 4096

The following function takes a list of strings (words) and returns a generator of word tuples of the following form:
    
(center_word, context_word, label)

where center_word is the word at a certain position, and context_word is at most max_distance tokens away.

Each pair is marked by label (positive cooccurrences have the label "True", negative cooccurrences are marked with "False").

Words are represented by integers (rather than by string) denoting their id.

Only pairs where both words are in the vocabulary are considered.

Note: cooccurrence only holds between words in different positions, not for a position with itself.)

In [None]:
def positive_and_negative_cooccurrences(tokens, max_distance, neg_samples_factor, vocab_to_id):
    """
    :param tokens: list of strings (words)
    :param max_distance: max distance of context word to target word
    :param neg_samples_factor: number of sampled negative tuples for each positive tuple
    :param vocab_to_id: dictionary (string to int) mapping each word to its id (=row in embedding matrizes)
    :return: generator over tuples of the form (context_word:string, center_word:string, label:boolean)
    """
    for center_position in range(len(tokens)):
        center_word = tokens[center_position]
        if center_word not in vocab_to_id:
            continue
        context_start = max(0, center_position - max_distance)
        context_end = min(len(tokens), center_position + max_distance + 1)
        for context_position in range(context_start, context_end):
            if context_position != center_position:
                context_word = tokens[context_position]
                if context_word not in vocab_to_id:
                    continue
                yield (vocab_to_id[center_word], vocab_to_id[context_word], True)
                for i in range(neg_samples_factor):
                    yield (vocab_to_id[center_word], random.randint(0, len(vocab_to_id) - 1), False)


Next we build our model. As you have seen in the last exercise you define a model by making a class inherit from torch.nn.module as well as defining the init and forward method. 

##### init method

***TODO*** You will have to create two embedddings for the vocabulary as center and context words. Hint: There is an Embedding layer in Pytorch which should be used here. (0.5p)

***TODO*** How are the weight values in the embedding layer initialized internally? (0.5p) 

***TODO*** Inspect the stucture of the embedding layer and change the initial weights to a torch Tensor of normally distributed float values with 0 mean and variance 0.1. (Consulting the documentation for torch.Tensor might help.) (1p)

***TODO*** Can you think of a scenario, in which you would initialize the embedding weights with specific values? (0.5p)

#### forward method:

***TODO*** What is the shape of center_context_idxs? (0.5p)

***TODO*** What are the shapes of cntr_idxs and ctxt_idxs? (0.5p)

***TODO*** What shape does ctxt_vecs have? What shape does torch.bmm require ctxt_vecs to have? (0.5p)

***TODO*** Bring ctxt_vecs into the required shape (use variable.view(...) ) (0.5p)

***TODO*** What is the shape of the returned Tensor? (0.5p)

In [None]:
class Word2Vec(nn.Module):
    def __init__(self, vocab_size, embedding_size):
        super(Word2Vec, self).__init__()
        self.embedding_size = embedding_size
        # TODO: create embeddings for vocabulary as center and context words. (0.5p)
        self.embeddings_center = None # TODO
        self.embeddings_context = None #TODO
        # TODO: Initialize the word vectors so that the components are normally distributed 
        # and have mean 0 and variance 0.1 (1p)
        
    def forward(self, center_context_idxs):
        # TODO: What is the shape of center_context_idxs? (0.5p)
        cntr_idxs = center_context_idxs[:, 0]
        ctxt_idxs = center_context_idxs[:, 1]
        # TODO: What are the shapes of cntr_idxs and ctxt_idxs? (0.5p)
        
        # Bring the center embeddings into the required shape for torch.bnn()
        cntr_vecs = self.embeddings_center(ctxt_idxs).view(-1, 1, self.embedding_size)
        # resulting shape: batch_size x 1 x embedding_size
        
        ctxt_vecs = self.embeddings_context(cntr_idxs) #TODO
        # TODO: What shape does ctxt_vecs have? What shape does torch.bmm() require ctxt_vecs to have? (0.5p)
        # TODO: Bring ctxt_vecs into the required shape (using variable.view(...) ) (0.5p)
       
        scores = torch.bmm(cntr_vecs, ctxt_vecs) # Batch-wise matrix multiplication.
        # resulting shape: batch_size x 1 x 1
        
        # TODO: What is the shape of the returned Tensor? (0.5p)
        return scores.view(-1,1) 

    def center_sims(self, word_idx):
        m = self.embeddings_center.weight
        v = m[word_idx]
        return F.cosine_similarity(m, v.expand(m.size())).data.numpy()

    def context_sims(self, word_idx):
        m = self.embeddings_context.weight
        v = m[word_idx]
        return F.cosine_similarity(m, v.expand(m.size())).data.numpy()


From this point on there is nothing left to do for you. Please read through the following and have a look into the documentation to see what each line does and feel free to experiment with different hyperparameters.

Understanding how the model learns and predicts will be essential for your upcoming projects.

In [None]:
w2v_model = Word2Vec(vocab_size, dims)

criterion = nn.BCEWithLogitsLoss()

optimizer = optim.Adam(w2v_model.parameters(), lr=learning_rate)

pos_neg_list = list(positive_and_negative_cooccurrences(brown_word_list, window_size, neg_samples_factor, vocab))
data_size=len(pos_neg_list)
train_data = np.asarray(pos_neg_list)

num_epochs = 20

data_tensor = torch.LongTensor(train_data[:, 0:2].tolist())
target_tensor = torch.FloatTensor(train_data[:, 2].tolist()).view(-1,1)

train = data_utils.TensorDataset(data_tensor, target_tensor)
train_loader = data_utils.DataLoader(train, batch_size=batch_size, shuffle=True)

for epoch_nr in range(num_epochs):
    loss_accum = 0.0
    print("epoch", epoch_nr)
    for ctxt_tgt_idxs, labels in train_loader:
        optimizer.zero_grad()
        output = w2v_model.forward(ctxt_tgt_idxs)
        loss = criterion(output, labels)
        loss_accum += loss.item()
        loss.backward()
        optimizer.step()
    print("current loss:", loss_accum)


At last we get the most similar word embeddings to a few example words. The first list holds the results from the word being the center word, the second from the word being a context word. 

In [None]:
sorted_words = sorted(vocab.keys(), key=vocab.get)
def top_words_for(word, n=10, for_center=True):
    print(word)
    query_index = vocab[word]
    if for_center:
        sims = list(zip(w2v_model.center_sims(query_index), sorted_words))
    else:
        sims = list(zip(w2v_model.context_sims(query_index), sorted_words))
    sims.sort(key=lambda x: -x[0])
    return sims[:n]

print("word".ljust(20), "similarity")
for ww in ["the", "jury", "city", "man", "any","1","two"]:
    print("="*35)
    for dist, nw in top_words_for(ww, for_center=True):
        print(nw.ljust(20), dist)
    for dist, nw in top_words_for(ww, for_center=False):
        print(nw.ljust(20), dist)