In addition to natural language processing, you can also use word embedding (Word2Vec) to do this

In addition to natural language processing, you can also use word embedding (Word2Vec) to do this

When using machine learning methods to solve problems, it is critical to have the right data. Unfortunately, raw data is often "dirty" and unstructured. Natural Language Processing (NLP) practitioners know this well because the data they work with is text. Since most machine learning algorithms do not accept raw strings as input, word embeddings are used to transform the data before feeding it into the learning algorithm. But this is not only true for text data, it can also exist in other standard non-natural language processing tasks in the form of categorical features. In fact, many of us struggle with this categorical feature process, so what role do word embeddings play in this scenario?

The goal of this post is to show how we can use a word embedding method, Word2Vec (2013, Mikolov et al.), to transform a categorical feature with a large number of modalities into a set of smaller, easy-to-use numerical features that are not only easy to use, but can also successfully learn the relationship between several modalities, which is very similar to the way classic word embeddings process language.

Word2Vec

You can tell what a person is by the company he keeps. (Firth, JR 1957.11)

The above accurately describes the goal of Word2Vec: it tries to determine the meaning of a word by analyzing its neighbors (also called context). There are two different styles of models for this approach: CBOW and Skip-Gram. Given a corpus, the model loops over the words of each sentence, either predicting its neighbors (context) based on the current word, or predicting the current word based on the current context. The former method is called "Skip-Gram" and the latter is called "continuous bag of words (CBOW)". The limit on the number of words in each context is determined by a parameter called "Window Size".

So if you choose the Skip-Gram method, Word2Vec will use a shallow neural network, that is, a neural network with only one hidden layer to learn word embeddings. The network will first randomly initialize its weights, and then use words to predict their context, iteratively adjusting these weights during the training process to minimize the error it makes. Hopefully, after a relatively successful training process, the word vector for each word can be obtained by multiplying the network weight matrix and the one-hot vector of the word.

Note: In addition to allowing text data to be represented numerically, the resulting embeddings also learn some interesting relationships between words that can be used to answer questions like: a king is to a queen as a father is to ...?

If you want to learn more about Word2Vec in detail, you can check out this Stanford course or the TensorFlow tutorial.

application

We provide online math exercises on the Kwyk platform (https://www.kwyk.fr/). Teachers give their students homework and some data is stored every time an exercise is completed. Then, we use the collected data to evaluate the level of each student and give them tailored reviews to help them improve. For each solved exercise, we store a series of identifiers that help us distinguish the following information: What is this exercise? Who is the student who answered it? Which chapter does it belong to? .... In addition, we also store a score based on whether the student successfully solved the problem, either 0 or 1. Then, in order to evaluate the student's score, we have to predict this score and get the probability of the student's success from our classifier.

As you can see, many of our features are categorical. Usually, when the number of modes is small enough, you can simply convert the n-mode categorical features into n-1 dummy variables and use them for training. But when the modes are in the thousands - as in some cases in our application - relying on dummy variables becomes inefficient and impractical.

To solve this problem, we use Word2Vec to convert categorical features into a fairly small number of usable continuous features through a little trick. To illustrate this idea, let's use "exercise_id" as an example: exercise_id is a categorical feature that tells us which exercise was solved. To be able to use Word2Vec, we provide a corpus, which is a series of sentences that will be fed into our algorithm. However, the original feature is just a list of IDs, which is not a corpus per se: its order is completely random, and similar IDs do not carry any information about their neighbors. Our trick consists in treating a homework assignment from a teacher as a "sentence", that is, a series of exercise_ids. As a result, all the IDs will naturally be collected together with labels such as level, chapter, etc., and Word2Vec can then start learning the exercise embedding (corresponding to the word embedding) directly on these sentences.

In fact, it is precisely because of these artificial sentences that we can use Word2Vec and get beautiful results:


As we can see, the resulting embedding has structure. In fact, the 3D projection of the exercises is spiral-shaped, with higher-level exercises following lower-level ones. This also means that the embedding successfully learned to distinguish between exercises of different levels and regrouped exercises with similar levels together. But that’s not all, after using non-linear dimensionality reduction techniques, we can reduce the entire embedding to a real-valued variable with the same characteristics. In other words, we get a feature about the complexity of the exercises, which is the smallest in the 6th grade (6th), and as the exercises become more complex, this variable becomes larger and larger until it reaches the maximum value of the variable in the 12th grade.

Furthermore, just as Mikolov did with English words, the embeddings also learn relationships between exercises:

The figure above shows some examples of the relationships that our embeddings are able to learn. So when we ask "is an exercise in adding numbers to an exercise in subtracting numbers what an exercise in adding time is to...?", the embeddings give the following answer: "an exercise in subtracting time". Specifically, this means that if we ask this question: embedding [Substract(Numbers)] -- embedding [Add(Numbers)], and add it to the embedding of a student exercise where the student is asked to add time (e.g., hours, minutes, etc.), then the closest embedding is the one that contains the exercise in subtracting time.

in conclusion

In summary, word embedding techniques are useful for converting text data into real-valued vectors that can be used directly by machine learning algorithms. Although word embedding techniques are mainly used in natural language processing applications such as machine translation, we have shown that these techniques can also be used for categorical features by giving a specific example of their use in Kwyk. However, in order to use techniques such as Word2Vec, you must build a corpus - that is, a set of sentences where the labels have been arranged so that the context is implicitly created. In the above example, we used the homework given on the website to create the "sentences" of the exercises and learned the exercise embeddings. As a result, we were able to obtain new numerical features that successfully learned the relationship between the exercises and were more useful than the original set of labels they were labeled with.

A big thank you to Christophe Gabar, one of our developers at Kwyk, who came up with the idea of ​​using word2vec on categorical features.

Original link: https://medium.com/towards-data-science/a-non-nlp-application-of-word2vec-c637e35d3668

<<:  The second round of recruitment of Aiti tribe administrators has begun

>>:  DeepMind: Combining artificial intelligence and neuroscience to achieve a virtuous cycle

Recommend

Kingdee Software: Efficient Office [WeChat Enterprise Account Case]

Kingdee International Software Group uses WeChat ...

How to improve the conversion rate of the lead generation training camp?

The following is a review of a recently concluded...

Four charts to help you understand how Uber is taking away taxi business

Uber and Lyft began a 120-day trial in Portland i...

The entire process of new product launch promotion plan!

Table of contents: Chapter 1: Industry Background...

Top 10 marketing promotion trends in 2020

With only one week left in 2019, it’s time to loo...

Yuanfudao Product Analysis

How did Yuanfudao, which has only been establishe...

No idea for event operation? Maybe you should read this article...

As an employee of the company, it is more accurat...

How to take beautiful airplane window photos when traveling?

Where did you travel in the Year of the Dragon? B...

Graphite: A dark appearance hides a soft heart

The English name of graphite, "Graphite"...