View on GitHub


A toolkit to obtain and preprocess German corpora, train models and evaluate them with generated testsets

Download the whole project as a .zip file Download the whole project as a tar.gz file


In my bachelor thesis I trained German word embeddings with gensim's word2vec library and evaluated them with generated test sets. This page offers an overview about the project and download links for scripts, source and evaluation files. The whole project is licensed under MIT license.

Training and Evaluation

I found the following parameter configuration to be optimal to train german language models with word2vec:

The following table shows some training stats for training a model with the above specification:

training time 6,16 h
training speed 26626 words/s
vocab size 608.130 words
corpus size 651.219.519 words
model size 720 MB

To train this model, you can take the following snippets after downloading this toolkit and navigating to its directory, where the and the script are used.

Make working directories:

mkdir corpus
mkdir model

Build news corpus:

gzip -d
python corpus/ -psub

Build wikipedia corpus:

python -c -b 25M -o extracted dewiki-latest-pages-articles.xml.bz2
find extracted -name '*bz2' \! -exec bzip2 -k -c -d {} \; > dewiki.xml
printf "Number of articles: "
grep -o "<doc" dewiki.xml | wc -w
sed -i 's/<[^>]*>//g' dewiki.xml
rm -rf extracted
python dewiki.xml corpus/dewiki.corpus -psub
rm dewiki.xml


python corpus/ model/my.model -s 300 -w 5 -n 10 -m 50

Subsequently the script can be used to evaluate the trained model:

python model/my.model -u -t 10

Further examples and code explanation can be found in the following ipython notebooks:

  • Preprocessing
  • Training
  • Evaluation
  • Semantic arithmetic

    With basic vector arithmetic it's possible to show the meaning of words that are representable by the model. Therefore the vectors are added or subtracted and with the help of the cosine similarity the vector(s) that are nearest to the result can be found. In the following, some interesting examples are shown:

    Frau + Kind = Mutter (0,831)
    Frau + Hochzeit = Ehefrau (0,795)

    A common family relationship: a woman with a child added is a mother. In word2vec terms: adding the vector of child to the vector of woman results in a vector which is closest to mother with a comparatively high cosine similarity of 0,831. In the same way a woman with a wedding results in a wife.

    Obama - USA + Russland = Putin (0,780)

    The model is able to find a leader to a given country. Here Obama without USA is the feature for a country leader. Adding this feature to Russia results in Putin. It's also successful for other countries.

    Verwaltungsgebaeude + Buecher = Bibliothek (0,718)
    Verwaltungsgebaeude + Buergermeister = Rathaus (0,746)
    Haus + Filme = Kino (0,713)
    Haus + Filme + Popcorn = Kino (0,721)

    The relationship between a building and its function is found correctly. Here an administration building with books is logically the library and an administration building with a mayor is the city hall. Moreover a house with movies results in a cinema. Note that when adding popcorn to the equation, the resulting vector gets a little closer to the vector of the word cinema.

    Becken + Wasser = Schwimmbecken (0,790)
    Sand + Wasser = Schlamm (0,792)
    Meer + Sand = Strand (0,725)

    Some nice examples with water: sand and water result in mud, sea and sand result in beach and a basin with water is a pool.

    Planet + Wasser = Erde (0,717)
    Planet - Wasser = Pluto (0,385)

    The main feature of our planet is correctly represented by the model: a planet with water is the earth, while a planet without water is Pluto. That's not quite accurate, because Pluto is made of water ice to one third...

    Kerze + Feuerzeug = brennende_Kerze (0,768)

    Here is quite a good example for a semantically correct guess of a bigram token: candle and lighter result in a burning_candle.

    The examples shown above are the results of a quick manual search for useful vector equations in the model. There are more amazing semantic relations for sure.

    Visualizing features with PCA

    The Principal Component Analysis is a method to reduce the number of dimensions of high-dimensional vectors, while keeping main features (= the principal components). Therefore the 300 dimensions of the vectors of my German language model were reduced to a two-dimensional representation and plotted with pythons matplotlib for some word classes.
    PCA: Capital of a country
    Countries and capitals are grouped correctly. The connecting lines are approximately parallel (except the one for Sweden maybe...) and of the same length. So the model understands the concept of capitals and countries.
    PCA: Currency of a country
    Countries and their currencies are also grouped correctly. As well as the capitals, the concept of currencies of countries is well understood by the model. Note: british_pounds is here more accurate then just pounds because of multiple meanings of the word. Same with US-Dollar and Dollar.
    PCA: Language of a country
    Finally another great example of grouped features with languages of countries.

    The plots above are created with the script of this project. Some further examples and code explanation can be found in the PCA ipython notebook.



    The German language model, trained with word2vec on the German Wikipedia (15th May 2015) and German news articles (15th May 2015):
    german.model [704 MB]

    Syntactic Questions

    10k questions with German umlauts:
    The same 10k questions with transformed German umlauts:

    Evaluation source files:

    Semantic Questions

    300 opposite questions with German umlauts:
    The same 300 opposite questions with transformed German umlauts:

    540 best match questions with German umlauts:
    The same 540 best match questions with transformed German umlauts:

    110 doesn't fit questions with German umlauts:
    The same 110 doesn't fit questions with transformed German umlauts:

    Evaluation source files: