View on GitHub

GermanWordEmbeddings

A toolkit to obtain and preprocess German corpora, train models and evaluate them with generated testsets

Download the whole project as a .zip file Download the whole project as a tar.gz file

Welcome

In my bachelor thesis I trained German word embeddings with gensim's word2vec library and evaluated them with generated test sets. This page offers an overview about the project and download links for scripts, source and evaluation files. The whole project is licensed under MIT license.

Training and Evaluation

I found the following parameter configuration to be optimal to train german language models with word2vec:

a corpus as big as possible (and as diverse as possible without being informal)
filtering of punctuation and stopwords
forming bigramm tokens
using skip-gram as training algorithm with hierarchical softmax
window size between 5 and 10
dimensionality of feature vectors of 300 or more
using negative sampling with 10 samples
ignoring all words with total frequency lower than 50

The following table shows some training stats for training a model with the above specification:

training time	6,16 h
training speed	26626 words/s
vocab size	608.130 words
corpus size	651.219.519 words
model size	720 MB

To train this model, you can take the following snippets after downloading this toolkit and navigating to its directory, where the preprocessing.py and the training.py script are used.

Make working directories:

mkdir corpus
mkdir model

Build news corpus:

wget http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2013.de.shuffled.gz
gzip -d news.2013.de.shuffled.gz
python preprocessing.py news.2013.de.shuffled corpus/news.2013.de.shuffled.corpus -psub
rm news.2013.de.shuffled.gz

Build wikipedia corpus:

wget http://download.wikimedia.org/dewiki/latest/dewiki-latest-pages-articles.xml.bz2
wget http://medialab.di.unipi.it/Project/SemaWiki/Tools/WikiExtractor.py
python WikiExtractor.py -c -b 25M -o extracted dewiki-latest-pages-articles.xml.bz2
find extracted -name '*bz2' \! -exec bzip2 -k -c -d {} \; > dewiki.xml
printf "Number of articles: "
grep -o "<doc" dewiki.xml | wc -w
sed -i 's/<[^>]*>//g' dewiki.xml
rm -rf extracted
python preprocessing.py dewiki.xml corpus/dewiki.corpus -psub
rm dewiki.xml

Training:

python training.py corpus/ model/my.model -s 300 -w 5 -n 10 -m 50

Subsequently the evaluation.py script can be used to evaluate the trained model:

python evaluation.py model/my.model -u -t 10

Further examples and code explanation can be found in the following ipython notebooks:

Preprocessing

Training

Evaluation

Semantic arithmetic

With basic vector arithmetic it's possible to show the meaning of words that are representable by the model. Therefore the vectors are added or subtracted and with the help of the cosine similarity the vector(s) that are nearest to the result can be found. In the following, some interesting examples are shown:

Frau + Kind = Mutter (0,831)
Frau + Hochzeit = Ehefrau (0,795)

A common family relationship: a woman with a child added is a mother. In word2vec terms: adding the vector of child to the vector of woman results in a vector which is closest to mother with a comparatively high cosine similarity of 0,831. In the same way a woman with a wedding results in a wife.

Obama - USA + Russland = Putin (0,780)

The model is able to find a leader to a given country. Here Obama without USA is the feature for a country leader. Adding this feature to Russia results in Putin. It's also successful for other countries.

Verwaltungsgebaeude + Buecher = Bibliothek (0,718)
Verwaltungsgebaeude + Buergermeister = Rathaus (0,746)
Haus + Filme = Kino (0,713)
Haus + Filme + Popcorn = Kino (0,721)

The relationship between a building and its function is found correctly. Here an administration building with books is logically the library and an administration building with a mayor is the city hall. Moreover a house with movies results in a cinema. Note that when adding popcorn to the equation, the resulting vector gets a little closer to the vector of the word cinema.

Becken + Wasser = Schwimmbecken (0,790)
Sand + Wasser = Schlamm (0,792)
Meer + Sand = Strand (0,725)

Some nice examples with water: sand and water result in mud, sea and sand result in beach and a basin with water is a pool.

Planet + Wasser = Erde (0,717)
Planet - Wasser = Pluto (0,385)

The main feature of our planet is correctly represented by the model: a planet with water is the earth, while a planet without water is Pluto. That's not quite accurate, because Pluto is made of water ice to one third...

Kerze + Feuerzeug = brennende_Kerze (0,768)

Here is quite a good example for a semantically correct guess of a bigram token: candle and lighter result in a burning_candle.

The examples shown above are the results of a quick manual search for useful vector equations in the model. There are more amazing semantic relations for sure.

Visualizing features with PCA

The Principal Component Analysis is a method to reduce the number of dimensions of high-dimensional vectors, while keeping main features (= the principal components). Therefore the 300 dimensions of the vectors of my German language model were reduced to a two-dimensional representation and plotted with pythons matplotlib for some word classes.

PCA: Capital of a country — Countries and capitals are grouped correctly. The connecting lines are approximately parallel (except the one for Sweden maybe...) and of the same length. So the model understands the concept of capitals and countries.

PCA: Currency of a country — Countries and their currencies are also grouped correctly. As well as the capitals, the concept of currencies of countries is well understood by the model. Note: `british_pounds` is here more accurate then just `pounds` because of multiple meanings of the word. Same with `US-Dollar` and `Dollar`.

PCA: Language of a country — Finally another great example of grouped features with languages of countries.

The plots above are created with the visualize.py script of this project. Some further examples and code explanation can be found in the PCA ipython notebook.

Download

Model

The German language model, trained with word2vec on the German Wikipedia (15th May 2015) and German news articles (15th May 2015):
german.model [704 MB]

Syntactic Questions

10k questions with German umlauts:
syntactic.questions
The same 10k questions with transformed German umlauts:
syntactic.questions.nouml

Evaluation source files:
adjectives.txt
nouns.txt
verbs.txt

Semantic Questions

300 opposite questions with German umlauts:
semantic_op.questions
The same 300 opposite questions with transformed German umlauts:
semantic_op.questions.nouml

540 best match questions with German umlauts:
semantic_bm.questions
The same 540 best match questions with transformed German umlauts:
semantic_bm.questions.nouml

110 doesn't fit questions with German umlauts:
semantic_df.questions
The same 110 doesn't fit questions with transformed German umlauts:
semantic_df.questions.nouml

Evaluation source files:
opposite.txt
bestmatch.txt
doesntfit.txt