In addition to natural language processing, you can also use word embedding (Word2Vec) to do this

When using machine learning methods to solve problems, it is critical to have the right data. Unfortunately, raw data is often "dirty" and unstructured. Natural Language Processing (NLP) practitioners know this well because the data they work with is text. Since most machine learning algorithms do not accept raw strings as input, word embeddings are used to transform the data before feeding it into the learning algorithm. But this is not only true for text data, it can also exist in other standard non-natural language processing tasks in the form of categorical features. In fact, many of us struggle with this categorical feature process, so what role do word embeddings play in this scenario?

The goal of this post is to show how we can use a word embedding method, Word2Vec (2013, Mikolov et al.), to transform a categorical feature with a large number of modalities into a set of smaller, easy-to-use numerical features that are not only easy to use, but can also successfully learn the relationship between several modalities, which is very similar to the way classic word embeddings process language.

Word2Vec

You can tell what a person is by the company he keeps. (Firth, JR 1957.11)

The above accurately describes the goal of Word2Vec: it tries to determine the meaning of a word by analyzing its neighbors (also called context). There are two different styles of models for this approach: CBOW and Skip-Gram. Given a corpus, the model loops over the words of each sentence, either predicting its neighbors (context) based on the current word, or predicting the current word based on the current context. The former method is called "Skip-Gram" and the latter is called "continuous bag of words (CBOW)". The limit on the number of words in each context is determined by a parameter called "Window Size".

So if you choose the Skip-Gram method, Word2Vec will use a shallow neural network, that is, a neural network with only one hidden layer to learn word embeddings. The network will first randomly initialize its weights, and then use words to predict their context, iteratively adjusting these weights during the training process to minimize the error it makes. Hopefully, after a relatively successful training process, the word vector for each word can be obtained by multiplying the network weight matrix and the one-hot vector of the word.

Note: In addition to allowing text data to be represented numerically, the resulting embeddings also learn some interesting relationships between words that can be used to answer questions like: a king is to a queen as a father is to ...?

If you want to learn more about Word2Vec in detail, you can check out this Stanford course or the TensorFlow tutorial.

application

We provide online math exercises on the Kwyk platform (https://www.kwyk.fr/). Teachers give their students homework and some data is stored every time an exercise is completed. Then, we use the collected data to evaluate the level of each student and give them tailored reviews to help them improve. For each solved exercise, we store a series of identifiers that help us distinguish the following information: What is this exercise? Who is the student who answered it? Which chapter does it belong to? .... In addition, we also store a score based on whether the student successfully solved the problem, either 0 or 1. Then, in order to evaluate the student's score, we have to predict this score and get the probability of the student's success from our classifier.

As you can see, many of our features are categorical. Usually, when the number of modes is small enough, you can simply convert the n-mode categorical features into n-1 dummy variables and use them for training. But when the modes are in the thousands - as in some cases in our application - relying on dummy variables becomes inefficient and impractical.

To solve this problem, we use Word2Vec to convert categorical features into a fairly small number of usable continuous features through a little trick. To illustrate this idea, let's use "exercise_id" as an example: exercise_id is a categorical feature that tells us which exercise was solved. To be able to use Word2Vec, we provide a corpus, which is a series of sentences that will be fed into our algorithm. However, the original feature is just a list of IDs, which is not a corpus per se: its order is completely random, and similar IDs do not carry any information about their neighbors. Our trick consists in treating a homework assignment from a teacher as a "sentence", that is, a series of exercise_ids. As a result, all the IDs will naturally be collected together with labels such as level, chapter, etc., and Word2Vec can then start learning the exercise embedding (corresponding to the word embedding) directly on these sentences.

In fact, it is precisely because of these artificial sentences that we can use Word2Vec and get beautiful results:

As we can see, the resulting embedding has structure. In fact, the 3D projection of the exercises is spiral-shaped, with higher-level exercises following lower-level ones. This also means that the embedding successfully learned to distinguish between exercises of different levels and regrouped exercises with similar levels together. But that’s not all, after using non-linear dimensionality reduction techniques, we can reduce the entire embedding to a real-valued variable with the same characteristics. In other words, we get a feature about the complexity of the exercises, which is the smallest in the 6th grade (6th), and as the exercises become more complex, this variable becomes larger and larger until it reaches the maximum value of the variable in the 12th grade.

Furthermore, just as Mikolov did with English words, the embeddings also learn relationships between exercises:

The figure above shows some examples of the relationships that our embeddings are able to learn. So when we ask "is an exercise in adding numbers to an exercise in subtracting numbers what an exercise in adding time is to...?", the embeddings give the following answer: "an exercise in subtracting time". Specifically, this means that if we ask this question: embedding [Substract(Numbers)] -- embedding [Add(Numbers)], and add it to the embedding of a student exercise where the student is asked to add time (e.g., hours, minutes, etc.), then the closest embedding is the one that contains the exercise in subtracting time.

in conclusion

In summary, word embedding techniques are useful for converting text data into real-valued vectors that can be used directly by machine learning algorithms. Although word embedding techniques are mainly used in natural language processing applications such as machine translation, we have shown that these techniques can also be used for categorical features by giving a specific example of their use in Kwyk. However, in order to use techniques such as Word2Vec, you must build a corpus - that is, a set of sentences where the labels have been arranged so that the context is implicitly created. In the above example, we used the homework given on the website to create the "sentences" of the exercises and learned the exercise embeddings. As a result, we were able to obtain new numerical features that successfully learned the relationship between the exercises and were more useful than the original set of labels they were labeled with.

A big thank you to Christophe Gabar, one of our developers at Kwyk, who came up with the idea of using word2vec on categorical features.

Original link: https://medium.com/towards-data-science/a-non-nlp-application-of-word2vec-c637e35d3668

<<: The second round of recruitment of Aiti tribe administrators has begun

>>: DeepMind: Combining artificial intelligence and neuroscience to achieve a virtuous cycle

Spreading youth in the bright starry sky: the astronauts behind the Tianzhou-6 cargo spacecraft

Is 360's sky-high price for purchasing a domain name just the willfulness of a wealthy person or is there some other hidden story?

[[127445]] On February 4, 2015, Chinese Inter...

6 ways to use countdown posters, how many have you seen?

Starting from July 21, Mr. Jia Yueting, Chairman ...

NIO’s third model “Firefly” has been questioned. Why is it adding battery swapping to its existing offerings? What’s the point?

Among all the mainstream new forces, NIO is a rar...

iOS scams collection

When I was making my first iOS app, I encountered...

Qianlong: Is Qianlong cabbage a cold dish? I have never seen it!

One good thing about the Internet age is that you...

A chip crisis broke out among global automakers, and some car factories including Audi, Ford, Subaru, etc. were shut down

Affected by the epidemic, the production capacity...

How can a barber shop make money with mini programs? How can a barber shop make appointments with mini programs?

Due to the impact of the epidemic, all walks of l...

The second model of Jiyue will be launched in 2024 with the market code name Venus. Jiyue 01 is selected into the 2023 Fortune China Best Design List again

“In this new world, design plays two important ro...

In addition to natural language processing, you can also use word embedding (Word2Vec) to do this

Word2Vec

application

in conclusion

Spreading youth in the bright starry sky: the astronauts behind the Tianzhou-6 cargo spacecraft

695 square kilometers! Poyang Lake has become bigger, migratory birds and finless porpoises are happy →

Why is Android development the most sought-after?

Wanmen University "Love Psychology" course

A few key points about digital advertising!

Google Toilet Culture: Testing Programs on the Toilet

VR Internet is a good idea, but it may not become a reality

The unity of aging and evolution: Why do lifespans vary so much across species?

Alipay and JD.com have established an “active user” indicator system!

How do you know what an elephant eats? Look at its poop!

Recommend

How to promote APP in 2018? Complete guide to explaining channel promotion!

Dropbox engineers explain why they gave up sharing code and used native languages

A story tells you: the role of operations in products

How to apply for WeChat Small Store? The application process is as follows

Is 360's sky-high price for purchasing a domain name just the willfulness of a wealthy person or is there some other hidden story?

6 ways to use countdown posters, how many have you seen?

NIO’s third model “Firefly” has been questioned. Why is it adding battery swapping to its existing offerings? What’s the point?

iOS scams collection

Qianlong: Is Qianlong cabbage a cold dish? I have never seen it!

A chip crisis broke out among global automakers, and some car factories including Audi, Ford, Subaru, etc. were shut down

How can a barber shop make money with mini programs? How can a barber shop make appointments with mini programs?

The second model of Jiyue will be launched in 2024 with the market code name Venus. Jiyue 01 is selected into the 2023 Fortune China Best Design List again

4 traffic depressions for user growth in 2021!

2023: Relive the amazing moments of "the great power's heavy weapon"

Advanced ways to operate and promote Xiaohongshu!