Welcome
In my bachelor thesis I trained German word embeddings with gensim's word2vec library and evaluated them with generated test sets. This page offers an overview about the project and download links for scripts, source and evaluation files. The whole project is licensed under MIT license.
I found the following parameter configuration to be optimal to train german language models with word2vec: Training and Evaluation
- a corpus as big as possible (and as diverse as possible without being informal)
- filtering of punctuation and stopwords
- forming bigramm tokens
- using skip-gram as training algorithm with hierarchical softmax
- window size between 5 and 10
- dimensionality of feature vectors of 300 or more
- using negative sampling with 10 samples
- ignoring all words with total frequency lower than 50
The following table shows some training stats for training a model with the above specification:
training time | 6,16 h |
training speed | 26626 words/s |
vocab size | 608.130 words |
corpus size | 651.219.519 words |
model size | 720 MB |
To train this model, you can take the following snippets after downloading this toolkit and navigating to its directory, where the preprocessing.py and the training.py script are used.
Make working directories:
mkdir corpus
mkdir model
Build news corpus:
wget http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2013.de.shuffled.gz
gzip -d news.2013.de.shuffled.gz
python preprocessing.py news.2013.de.shuffled corpus/news.2013.de.shuffled.corpus -psub
rm news.2013.de.shuffled.gz
Build wikipedia corpus:
wget http://download.wikimedia.org/dewiki/latest/dewiki-latest-pages-articles.xml.bz2
wget http://medialab.di.unipi.it/Project/SemaWiki/Tools/WikiExtractor.py
python WikiExtractor.py -c -b 25M -o extracted dewiki-latest-pages-articles.xml.bz2
find extracted -name '*bz2' \! -exec bzip2 -k -c -d {} \; > dewiki.xml
printf "Number of articles: "
grep -o "<doc" dewiki.xml | wc -w
sed -i 's/<[^>]*>//g' dewiki.xml
rm -rf extracted
python preprocessing.py dewiki.xml corpus/dewiki.corpus -psub
rm dewiki.xml
Training:
python training.py corpus/ model/my.model -s 300 -w 5 -n 10 -m 50
Subsequently the evaluation.py script can be used to evaluate the trained model:
python evaluation.py model/my.model -u -t 10
Further examples and code explanation can be found in the following ipython notebooks:
Semantic arithmetic
With basic vector arithmetic it's possible to show the meaning of words that are representable by the model. Therefore the vectors are added or subtracted and with the help of the cosine similarity the vector(s) that are nearest to the result can be found. In the following, some interesting examples are shown:
Frau + Kind = Mutter (0,831)
Frau + Hochzeit = Ehefrau (0,795)
A common family relationship: a woman with a child added is a mother. In word2vec terms: adding the vector of child
to the vector of woman
results in a vector which is closest to mother
with a comparatively high cosine similarity of 0,831. In the same way a woman
with a wedding
results in a wife
.
Obama - USA + Russland = Putin (0,780)
The model is able to find a leader to a given country. Here Obama
without USA
is the feature for a country leader. Adding this feature to Russia
results in Putin
. It's also successful for other countries.
Verwaltungsgebaeude + Buecher = Bibliothek (0,718)
Verwaltungsgebaeude + Buergermeister = Rathaus (0,746)
Haus + Filme = Kino (0,713)
Haus + Filme + Popcorn = Kino (0,721)
The relationship between a building and its function is found correctly. Here an administration building
with books
is logically the library
and an administration building
with a mayor
is the city hall
. Moreover a house
with movies
results in a cinema
. Note that when adding popcorn
to the equation, the resulting vector gets a little closer to the vector of the word cinema
.
Becken + Wasser = Schwimmbecken (0,790)
Sand + Wasser = Schlamm (0,792)
Meer + Sand = Strand (0,725)
Some nice examples with water: sand
and water
result in mud
, sea
and sand
result in beach
and a basin
with water
is a pool
.
Planet + Wasser = Erde (0,717)
Planet - Wasser = Pluto (0,385)
The main feature of our planet is correctly represented by the model: a planet
with water
is the earth
, while a planet
without water
is Pluto
. That's not quite accurate, because Pluto is made of water ice to one third...
Kerze + Feuerzeug = brennende_Kerze (0,768)
Here is quite a good example for a semantically correct guess of a bigram token: candle
and lighter
result in a burning_candle
.
The examples shown above are the results of a quick manual search for useful vector equations in the model. There are more amazing semantic relations for sure.
The Visualizing features with PCAPrincipal Component Analysis is a method to reduce the number of dimensions of high-dimensional vectors, while keeping main features (= the principal components). Therefore the 300 dimensions of the vectors of my German language model were reduced to a two-dimensional representation and plotted with pythons matplotlib for some word classes.
The plots above are created with the visualize.py script of this project. Some further examples and code explanation can be found in the PCA ipython notebook.
Download
Model
The German language model, trained with word2vec on the German Wikipedia (15th May 2015) and German news articles (15th May 2015):
german.model [704 MB]
Syntactic Questions
10k questions with German umlauts:
syntactic.questions
The same 10k questions with transformed German umlauts:
syntactic.questions.nouml
Evaluation source files:
adjectives.txt
nouns.txt
verbs.txt
Semantic Questions
300 opposite questions with German umlauts:
semantic_op.questions
The same 300 opposite questions with transformed German umlauts:
semantic_op.questions.nouml
540 best match questions with German umlauts:
semantic_bm.questions
The same 540 best match questions with transformed German umlauts:
semantic_bm.questions.nouml
110 doesn't fit questions with German umlauts:
semantic_df.questions
The same 110 doesn't fit questions with transformed German umlauts:
semantic_df.questions.nouml
Evaluation source files:
opposite.txt
bestmatch.txt
doesntfit.txt