30.10.2019 | Felix Kaus | comment icon 1 Comment

Using Word Embeddings for Business Insights

Transforming raw text into applicable business insights for data-driven decision making is not a trivial task. Following the Natural Language Processing (NLP) breakthrough of a Google research team on Word Embeddings, words or even sentences are efficiently represented as vectors (please refer to Mikolov et al., 2013a, and Mikolov et al., 2013b).

Given these vectors, unstructured text is transformed into an machine interpretable format, opening up the usage of a wide range of statistical and Machine Learning tasks to gain valuable knowledge for businesses. For example, Word Embeddings enable businesses to automatically analyze project offers for requirements, construct skill-portfolios, detect gaps in knowledge, or conduct a companywide market fit.

Getting the Gist of Word Embeddings

We use Word Embeddings – as opposed to commonly used Bag-of-Words-Models – because their data structure is memory as well as Machine Learning friendly. Furthermore, the word context is captured within the model (similar to N-Gram-Models). Word Embedding means mapping a word as a M-dimensional real-valued vector. Given such vectors, we can compute similarities both syntactically and semantically. Moreover, the original paper shows an algebraic operation performed on the word vectors:

vector(‘King’) — vector(‘Man’) + vector(‘Woman’) = vector(‘Queen’)

Widespread implementations of Word Embeddings include Mikolov et al.´s Word2Vec, FastText, and Glove. Word2Vec proposes two different architectures for building a model which are CBOW and Skip-Grams. Both are 3-layer neural networks consisting of an Input, Projection, and Output layer. The following figure shows the Skip-Gram Model:

This architecture works according to the principle of “you shall know a word by the company it keeps” (Firth, 1957). It basically answers the question: What are the most probable words occurring in the immediate word neighborhood of a search word? Think of a Skip-Gram as a tuple consisting of an word-pair and a binary classification. If the word is in the immediate neighborhood it classifies as 1, else 0:

((search_word, word_in_neighborhood), 1) or ((search_word, word_not_in_neighborhood), 0)

Furthermore, the source code comprises two kinds of algorithms to choose from: Hierarchical Softmax and Negative Sampling. In this example, Hierarchical Softmax is used. This article won’t go into further detail here (please refer to the original papers or the work by Rong, 2014). According to the word count, a classification tree is built. The algorithm traverses down the tree and creates a probability for every word in the vocabulary. It thus learns what words will most likely occur next to the input word. This information is also encoded in vector space after model training, letting us efficiently conduct similarity calculations.

Setting up a Word2Vec-Model

Here, we use the excellent Python library gensim to showcase the Word2Vec modelling. The library can be accessed via pip:

!pip install gensim

Then we set up the proper imports, load the text data, train, and save the model:

Subtleties behind the code

As shown above, given an input text (this can be quite large, e.g., all Wikipedia articles) a model can be trained within a few lines of code. In practice, however, one must carefully evaluate all the available hyperparameters and can construct an memory-friendly data-pipeline for streaming one sequence at a time. While the latter is considered a problem of Data Engineering which also includes complex Data Retrieval and NLP-Preprocessing like N-Gram-Training, the former presents itself as one of the core Machine Learning problems with no analytical solution available. We recommend to test out various hyperparameter combinations against publicly available model benchmarks.

To conclude, a proper Word2Vec-training environment is highly dependent on skilled Data Engineering and Machine Learning expertise.

It is also possible, to make use of already trained models. These models are publicly available in binary form, which prohibits the user to further train the model. Given that a proper model has been trained and saved, we only care about the resulting vectors, which can be gathered as follows:

Use Case 1: Match people with projects

For this experiment, we gathered publicly available text data from German freelancer platforms for a time span of over three months. The following screenshot displays the most popular words within the text data:

Also, we collected hundreds of actual and anonymous skillsets within company sources. These projects and skillsets served as the training set for our Word2Vec-Model. One use case is to compare a set of skills with the available project texts to find best matches:

As we can see, employee no. 1 has a certain skillset, which is displayed with his Top-3 project leads within the same JSON-File (Note: for display purposes, employee no. 1 is not a real person). To compare skillsets with projects we use the cosine similarity, which can be computed as follows:

where each vector consists of d elements. This type of comparison is useful because it is indifferent to the actual length of the word sets. With these similarities at hand, one natural use case is to build and analyze clusters of words.

Use Case 2: Generating a skill tree out of raw text

Clustering algorithms try to minimize the distance between objects within a cluster whilst maximizing the distance between such clusters. One way to find such clusters is hierarchical agglomerative clustering (available via the SciPy library). Fortunately, you don’t have to predetermine the number of clusters beforehand (like in k-means) and the result can be automatically displayed as a natural hierarchy of skills via a dendrogram:

We first computed the five most similar words to either “SAP” und “JAVA” and constructed the tree with the Ward algorithm. One can see the big gap between both keyword sets of “SAP” and “JAVA”. Traversing down the tree, we find further clusters: “primefaces” and “jsf” are more similar to each other, than “rest” is similar to these.

This example shows, that the vector space is indeed capable of distinguishing between two different concepts. To finish, let’s take a more general concept like “DATA_SCIENCE”, collect the 15 most similar vectors, and plot the results:

From top down, there are two big groups: one we refer to as “cloud_themed” on the left-hand side, the other as “general_data_science”. According to the “cloud_themed”-cluster, you should learn Scala, whereas Python seems to be the choice for “general_data_science”.

This simple example shows the unique and valuable insights such an analysis can provide. With the amount of unused and unstructured text data at our disposal, the journey of NLP just begins. Here is a list of already implemented use cases for the German market:

  • Skillset Matching Algorithm
  • Topic Modelling: filter important tags from project leads
  • Trend Analysis of expert and market skills
  • Recommendation Systems via Clustering and Market Basket Analysis
  • automated Ontology learning

We are able to provide unique and insightful business knowledge with our innovative approaches. Please, contact us for further information.


1. Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.

2. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (pp. 3111-3119).

3. Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5, 135-146.

4. Pennington, J., Socher, R., & Manning, C. (2014, October). Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 1532-1543).

5. Rong, X. (2014). word2vec parameter learning explained. arXiv preprint arXiv:1411.2738.

6. Rehurek, R., & Sojka, P. (2010). Software framework for topic modelling with large corpora. In In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks.

7. Jones, E., Oliphant, T., & Peterson, P. (2001). SciPy: Open source scientific tools for Python.

data science machine learning neural networks nlp word embedding word2vec

Felix Kaus

Felix is a Machine Learning and Natural Language Processing Enthusiast. He studied Business Information Systems at Karlsruhe University of Applied Sciences and Philipps University Marburg and specialises as Full-Stack-Data Scientist.

Show all posts by Felix Kaus
  1. Ajay Sharma
    Posted on
    I found some useful information in your blog, it was awesome to read, thanks for sharing this great content to my vision, keep sharing. For more related to Skip-Gram Model Visit: https://insideaiml.com/article-details/The-Skip-gram-Model-139

Leave a Comment