Welcome
In my bachelor thesis I trained German word embeddings with gensim's word2vec library and evaluated them with generated test sets. This page offers an overview about the project and download links for scripts, source and evaluation files. The whole project is licensed under MIT license.
Training and Evaluation
I found the following parameter configuration to be optimal to train german language models with word2vec:- a corpus as big as possible (and as diverse as possible without being informal)
- filtering of punctuation and stopwords
- forming bigramm tokens
- using skip-gram as training algorithm with hierarchical softmax
- window size between 5 and 10
- dimensionality of feature vectors of 300 or more
- using negative sampling with 10 samples
- ignoring all words with total frequency lower than 50
The following table shows some training stats for training a model with the above specification:
| training time | 6,16 h | 
| training speed | 26626 words/s | 
| vocab size | 608.130 words | 
| corpus size | 651.219.519 words | 
| model size | 720 MB | 
To train this model, you can take the following snippets after downloading this toolkit and navigating to its directory, where the preprocessing.py and the training.py script are used.
Make working directories:
mkdir corpus
mkdir modelBuild news corpus:
wget http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2013.de.shuffled.gz
gzip -d news.2013.de.shuffled.gz
python preprocessing.py news.2013.de.shuffled corpus/news.2013.de.shuffled.corpus -psub
rm news.2013.de.shuffled.gzBuild wikipedia corpus:
wget http://download.wikimedia.org/dewiki/latest/dewiki-latest-pages-articles.xml.bz2
wget http://medialab.di.unipi.it/Project/SemaWiki/Tools/WikiExtractor.py
python WikiExtractor.py -c -b 25M -o extracted dewiki-latest-pages-articles.xml.bz2
find extracted -name '*bz2' \! -exec bzip2 -k -c -d {} \; > dewiki.xml
printf "Number of articles: "
grep -o "<doc" dewiki.xml | wc -w
sed -i 's/<[^>]*>//g' dewiki.xml
rm -rf extracted
python preprocessing.py dewiki.xml corpus/dewiki.corpus -psub
rm dewiki.xmlTraining:
python training.py corpus/ model/my.model -s 300 -w 5 -n 10 -m 50Subsequently the evaluation.py script can be used to evaluate the trained model:
python evaluation.py model/my.model -u -t 10Further examples and code explanation can be found in the following ipython notebooks:
  
Semantic arithmetic
With basic vector arithmetic it's possible to show the meaning of words that are representable by the model. Therefore the vectors are added or subtracted and with the help of the cosine similarity the vector(s) that are nearest to the result can be found. In the following, some interesting examples are shown:
Frau + Kind = Mutter (0,831)
Frau + Hochzeit = Ehefrau (0,795)A common family relationship: a woman with a child added is a mother. In word2vec terms: adding the vector of child to the vector of woman results in a vector which is closest to mother with a comparatively high cosine similarity of 0,831. In the same way a woman with a wedding results in a wife.
Obama - USA + Russland = Putin (0,780)The model is able to find a leader to a given country. Here Obama without USA is the feature for a country leader. Adding this feature to Russia results in Putin. It's also successful for other countries.
Verwaltungsgebaeude + Buecher = Bibliothek (0,718)
Verwaltungsgebaeude + Buergermeister = Rathaus (0,746)
Haus + Filme = Kino (0,713)
Haus + Filme + Popcorn = Kino (0,721)The relationship between a building and its function is found correctly. Here an administration building with books is logically the library and an administration building with a mayor is the city hall. Moreover a house with movies results in a cinema. Note that when adding popcorn to the equation, the resulting vector gets a little closer to the vector of the word cinema.
Becken + Wasser = Schwimmbecken (0,790)
Sand + Wasser = Schlamm (0,792)
Meer + Sand = Strand (0,725)Some nice examples with water: sand and water result in mud, sea and sand result in beach and a basin with water is a pool.
Planet + Wasser = Erde (0,717)
Planet - Wasser = Pluto (0,385)The main feature of our planet is correctly represented by the model: a planet with water is the earth, while a planet without water is Pluto. That's not quite accurate, because Pluto is made of water ice to one third...
Kerze + Feuerzeug = brennende_Kerze (0,768)Here is quite a good example for a semantically correct guess of a bigram token: candle and lighter result in a burning_candle.
The examples shown above are the results of a quick manual search for useful vector equations in the model. There are more amazing semantic relations for sure.
Visualizing features with PCA
The Principal Component Analysis is a method to reduce the number of dimensions of high-dimensional vectors, while keeping main features (= the principal components). Therefore the 300 dimensions of the vectors of my German language model were reduced to a two-dimensional representation and plotted with pythons matplotlib for some word classes.british_pounds is here more accurate then just pounds because of multiple meanings of the word. Same with US-Dollar and Dollar.The plots above are created with the visualize.py script of this project. Some further examples and code explanation can be found in the PCA ipython notebook.
Download
Model
The German language model, trained with word2vec on the German Wikipedia (15th May 2015) and German news articles (15th May 2015):
german.model [704 MB]
Syntactic Questions
10k questions with German umlauts:
syntactic.questions
The same 10k questions with transformed German umlauts:
syntactic.questions.nouml
Evaluation source files:
adjectives.txt
nouns.txt
verbs.txt
Semantic Questions
300 opposite questions with German umlauts:
semantic_op.questions
The same 300 opposite questions with transformed German umlauts:
semantic_op.questions.nouml
540 best match questions with German umlauts:
semantic_bm.questions
The same 540 best match questions with transformed German umlauts:
semantic_bm.questions.nouml
110 doesn't fit questions with German umlauts:
semantic_df.questions
The same 110 doesn't fit questions with transformed German umlauts:
semantic_df.questions.nouml
Evaluation source files:
opposite.txt
bestmatch.txt
doesntfit.txt