Training a fastText model from scratch using Python January 22, 2024 | 8

Training a fastText model from scratch using Python

By now, it’s no surprise to anybody the astonishing results large language models produce. Models such as GPT-4, Bard, Bert or RoBERTa have sparked intense research and media attention, as well as changed many people’s workflows. However, these models have issues. A common critique is that they function as black boxes: users do not know much about their training data or modelling choices. Besides, training them usually requires gigantic datasets and processing power. Therefore, there is value in alternative models that can be trained by researchers, having full control over the input data and internal process. In this tutorial, we explain how to train a natural language processing model using fastText: a lightweight, easy-to-implement and efficient word embedding model that has shown good performance in various natural language tasks over the years.

First, a bit about word-embedding models. Word-embeddings models are one type of natural language processing models. By producing real number vector representations of words, they offer a powerful way to capture the semantic meaning of words in large datasets, which is why they are widely used in diverse research applications. Indeed, their uses are numerous: semantic similarity, text generation, document representation, author recognition, knowledge graph construction, sentiment analysis, or bias detection (Caliskan et al., 2017).

Installation

To install Fasttext, we recommend checking the fasttext-wheel PyPI module. To verify the installation succeeded, you have to importat the package in a Python script.

>>> import fasttext

If there are no error messages, you have succeeded and we can move to the training part.

Training the model

Training data

To train fastText, you need a corpus: a large collection of text. The required size of these corpora varies depending on the research purpose: from several thousand to billions of words. Some research benefits from smaller, well-curated corpora; other research benefits from large unstructured corpora. However, while the exact size needed is hard to determine, do keep in mind that the text in the training data has to relate to your research question! If you want to use word embeddings for studying doctors’ notes, you need doctors’ notes - and not legal medical texts. If you want to study niche cultural sub-groups, you need data from these groups - and not necessarily a corpus of random Internet interactions. The corpus is an integral part of your research! Generally, the larger the research-related corpus you can get, the better.

In this tutorial we use a freely available corpus of Science-Fiction texts downloaded from Kaggle. Preferably, the text you feed to fastText should have each sentence on a new line.

Hyperparameters

We will train an unsupervised fastText model, which means that lot of implementation decisions need to be made. If you don’t have specific methodological reasons and/or you lack the time or computing power for a proper grid search, we suggest you go with the default parameter options - which are optimized for many research contexts -, but switching the ‘dim’ parameter to 300. Empirical research has shown that a dimensionality of 300 leads to optimal performance in most settings, even if that will increase computational resources and training time. If you can afford spending the time thinking about hyperparameters, you could tune the training model (CBOW or SkipGram), learning rate, dimensionality of the vector, context widow size, and more. You can see here the full list of tuning parameters available.

Fitting model

We fit the model with the following command:

>>> model = fasttext.train_unsupervised('internet_archive_scifi_v3.txt', dim = 300)

Then, you can save the trained model, so that you do not have to train it again. For this, you need to feed the save_model() method a path to which to save the file. Make sure to add ‘.bin’ to save the model as a .bin file.

>>> model.save_model('scifi_fasttext_model.bin')

Re-opening a saved model is done with the load_model() method:

>>> model = fasttext.load_model('scifi_fasttext_model.bin')

Using word embeddings

Now we have trained the model, we have the word embeddings ready to be used. And, luckily, fastText comes with some nice functions to work with word embeddings! Here we highlight two of possible uses of word embeddings: obtaining most similar words, and analogies - but remember there are more possible uses. We start by simply retrieving the word embeddings. This can be done with any of the two following commands.

>>> model.get_word_vector('villain')
>>> model['villain']
array([ 0.01417591, -0.06866349,  0.09390495, -0.04146367,  0.10481305,
       -0.2541916 ,  0.26757774, -0.04365376, -0.02336818,  0.07684527,
       -0.05139925,  0.14692445,  0.07103274,  0.23373744, -0.28555775,
        ..............................................................
       -0.14082788,  0.27454248,  0.02602287,  0.03754443,  0.18067479,
        0.20172128,  0.02454677,  0.04874028, -0.17860755, -0.01387627,
        0.02247835,  0.05518318,  0.04844297, -0.2925061 , -0.05710272],
      dtype=float32)

Since fastText does not only train an embedding for the full word, but also so for the ngrams in each word as well, subwords and their embeddings can be accessed as follows:

>>> ngrams, hashes = model.get_subwords('villain')
>>> 
>>> for ngram, hash in zip(ngrams, hashes):
>>>     print(ngram, model.get_input_vector(hash))

Note: using the get_subwords() method returns two lists, one with the ngrams of type string, the other with hashes. These hashes are not the same as embeddings, but rather are the identifier that fastText uses to store and retrieve embeddings. Therefore, to get the (sub-)word embedding using a hash, the get_input_vector() method has to be used.

Furthermore, vectors can be created for full sentences as well:

>>> model.get_sentence_vector('the villain defeated the hero, tyrrany reigned throughout the galaxy for a thousand eons.')
array([-2.73631997e-02,  7.83981197e-03, -1.97590180e-02, -1.42770987e-02,
        6.88663125e-03, -1.63909234e-02,  5.72902411e-02,  1.44126266e-02,
       -1.64726824e-02,  8.55281111e-03, -5.33024594e-02,  4.74718548e-02,
        .................................................................
        3.30820642e-02,  7.64035881e-02,  7.33195152e-03,  4.60342802e-02,
        4.94049815e-03,  2.52075139e-02, -2.30138078e-02, -3.56832631e-02,
       -2.22732662e-03, -1.84207838e-02,  2.37668958e-03, -1.00214258e-02],
      dtype=float32)

Most similar words

A nice usecase of fastText is to retrieve similar words. For instance, you can retrieve the set of 10 words with the most similar meaning (i.e., most similar word vector) to a target word using a nearest neighbours algorithm based on the cosine distance.

>>> model.get_nearest_neighbors('villain')
[(0.9379335641860962, 'villainy'), (0.9019550681114197, 'villain,'), (0.890184223651886, 'villain.'), (0.8709720969200134, 'villains'), (0.8297745585441589, 'villains.'), (0.8225630521774292, 'villainous'), (0.8214142918586731, 'villains,'), (0.6485553979873657, 'Villains'), (0.6020095944404602, 'heroine'), (0.5941146612167358, 'villa,')]

Interestingly, this also works for words not in the model corpus, including misspelled words!

>>> model.get_nearest_neighbors('vilain')
[(0.6722341179847717, 'villain'), (0.619519829750061, 'villain.'), (0.6137816309928894, 'lain'), (0.6128077507019043, 'villainous'), (0.609745979309082, 'villainy'), (0.6089878678321838, 'Glain'), (0.5980470180511475, 'slain'), (0.5925296545028687, 'villain,'), (0.5779100060462952, 'villains'), (0.5764451622962952, 'chaplain')]

Analogies

Another nice use for fastText is creating analogies. Since the word embedding vectors are created in relation to every other word in the corpus, these relations should be preserved in the vector space so that analogies can be created. For analogies, a triplet of words is required according to the formula ‘A is to B as C is to [output]’. For example, if we take the formula ‘Men is to Father as [output] is to Mother’, we get the expected answer of Women.

>>> model.get_analogies('men', 'father', 'mother')
[(0.6985629200935364, 'women'), (0.6015384793281555, 'all'), (0.5977899432182312, 'man'), (0.5835891366004944, 'out'), (0.5830296874046326, 'now'), (0.5767865180969238, 'one'), (0.5711579322814941, 'in'), (0.5671708583831787, 'wingmen'), (0.567089855670929, 'women"'), (0.5663136839866638, 'were')]

However, since the model that we have created was done using uncleaned data from a relatively small corpus, our output is not perfect. For example, with the following analogy triplets, the correct answer of bad comes fourth, after villainy, villain. and villain, showing that for a better model, we should do some additional cleaning of our data (e.g., removing punctuation).

>>> model.get_analogies('good', 'hero', 'villain')
[(0.5228292942047119, 'villainy'), (0.5205934047698975, 'villain.'), (0.5122538208961487, 'villain,'), (0.5047158598899841, 'bad'), (0.483129620552063, 'villains.'), (0.4676515460014343, 'good"'), (0.4662466049194336, 'vill'), (0.46115875244140625, 'villains'), (0.4569159746170044, "good'"), (0.4529685974121094, 'excellent."')]

Conclusion

In this blogpost we have shown how to train a lightweight, efficient natural language processing model using fastText. After installing it, we have shown how to use some of the fastText functions to train the model, retrieve word embeddings, and usem them for different questions. While this was a toy example, we hope you found it inspiring for your own research! And remember, if you have a research idea that entails using natural language models and are stuck or do not know how to start - you can contact us!


Bonus: Tips to improve your model performance

Depending on your research, various tweaks can be made to your data to improve fastText performance. For example, if multi-word phrases (e.g., Social Data Science) play some key aspect in your analyses, you might want to change these word phrases in the data with a dash or underscore (e.g., Social_Data_Science) so that these phrases are trained as a single token, not as a sum of the tokens Social, Data, and Science.

As shown with the good-hero-villain analogy, removing punctuation and other types of non-alphabetic characters can help remove learning representations for unwanted tokens. For this, stemming (removing word stems) and lemmatization (converting all words to their base form) can also be useful. Similarly, two other ways to deal with unwanted tokens are to remove stop-words from your data, and to play around with the minCount parameter (i.e, the minimum number of times a word needs to occur in the data to have a token be trained) when training your model.

Most importantly, try to gather as much knowledge about the domain of your research. This tip can seem obvious, but having a proper understanding of the topic you are researching is the most important skill to have when it comes to language models. Let’s take the Sci-Fi corpus we used as an example: the 10th nearest neigbor of the word villain was villa. If you don’t really understand what either of those words mean, you would not know that these results seem fishy (since the model we created has low internal quality, it relies very much on the trained n-grams. Since both words contain the n-gram ‘villa’, they are rated as being close in the vector space). Therefore, make sure to understand the domain to the best of your abilities and scrutinize your findings to get reliable results.