Some details and thoughts on “Guess the Picture Song”

Some details and thoughts on “Guess the Picture Song”

Quickdraw’s CNN-RNN model

The quickdraw model used in "Guess the Painting Song" is essentially a classification model. The input is the coordinate information of the stroke points and the identification information of the starting point of each stroke. Several cascaded one-dimensional convolutions are applied, and then a BiLSTM layer is used and the results are summed. Finally, a Softmax layer is used for classification.

The entire network structure is shown in the figure:


Model structure

For more information about open source data and code, please refer to the reference document below. The entire network is relatively simple, and the final model accuracy is 75% with its default parameters, as shown in the figure below. It is not a high-demand scenario, and the effect is good enough.

Here I share a few interesting details that I noticed (pats from experts).

Small details

Data preprocessing

For stroke-3(x, y, n), Google uses the TFRecord data by default to normalize and interpolate the coordinates.

  1. # 1. Size normalization.
  2. lower = np. min (np_ink[:, 0:2], axis=0)
  3. upper = np. max (np_ink[:, 0:2], axis=0)
  4. scale = upper - lower scale[scale == 0] = 1
  5. np_ink[:, 0:2] = (np_ink[:, 0:2] - lower ) / scale
  6. # 2. Compute deltas. np_ink[1:, 0:2] -=
  7. np_ink[0:-1, 0:2]
  8. np_ink = np_ink[1:, :]

Why Normalization?

Similar to the role of BN in the input layer, the distribution of data is adjusted from the convergence area of ​​the original activation function to the area with larger gradient

We only care about the stroke trend of the painting, not the size of the painting. In other words, there is not much difference between drawing a large circle and drawing a small circle in terms of input data.

Why interpolation?

Ignore the effect of the starting coordinate position, that is, starting to draw the same shape in the middle and the four corners of the canvas will not make much difference in the input data level.

Convolutional Layer

Use multiple one-dimensional convolutions (conv1d) in cascade, use a linear activation function, and do not use a pooling layer.

  • The linear activation was changed to relu, and the accuracy dropped a bit to 73%.
  • The linear activation was changed to relu + plus pooling layer (size=4, strides=4), and the accuracy dropped a bit to 70%

Why does linear activation and removing the pooling layer improve the effect by 2-3 points?

What are the functions of the pooling layer?

  1. Reducing the number of parameters. In fact, adding a pooling layer shortens the training time by more than half.
  2. Maintaining local invariance of features. It seems that our input is not complex image pixel information, but stroke information, and after interpolation processing, local invariance is not much needed;
  3. Reducing redundancy and removing noise may not be particularly effective for stick figures.

The author (simple) understands that stick figures are already a high level of abstraction of objects by humans, so there is no need to use complex CNN networks to abstract features, and the global features are obtained by the subsequent RNN layer.

Small Thoughts

Google launched the web version of QuickDraw in November 2016, and it has recently become popular again with the help of mini-programs. A large amount of real user data has been obtained before and used to optimize the effectiveness of this mini-program.

What else can models be used for?

Recently, I saw an article that studied the relationship between the drawing order of people from different countries in this stick figure data and their national characters. In addition, there has been a lot of research and progress in the fields of anomaly analysis, handwriting recognition, speech recognition, and text classification on time series classification models.


Drawing circles differently

When I was a graduate student, I studied abnormal analysis of computer users. I built a classification model based on the user's mouse trajectory and keyboard operation to identify whether the user was operating the computer. Now that I think about it, it should be pretty good to use this model to run the previous task.

What other innovations can we have at the product level?

  • AutoDraw: Automatically transform your doodles into beautiful art images (Google has launched)
  • Drawing story: Draw 4 comics and the system will automatically generate a story (this should be no problem with the upper-level NLG technology)
  • Drawing Scoring: Automatically score your drawings for innovation, technique, completeness, etc.

What other value can be mined from these painting data?

Painting is a way for people to describe the world they understand in their own way. If we start with these simple sketches, we can learn how people understand objects and the world. In simple terms, we can migrate to the high-level abstract stage of current image recognition algorithms and improve the effectiveness of certain tasks. In a more complex way, we can even use it to enhance the reasoning ability of machines and learn the human ability to abstractly model objects and the world (brain hole).

<<:  An article to show you 20 new changes in iOS 12 Beta 5

>>:  Android native communication with H5

Recommend

What to do if the white wall is dirty? Save this method!

On the white wall of the house If you are not car...

Nine tips for early stage app brand building

Although the author is not an expert, he comes fr...

How to plan a public relations event?

How do PR activities affect brands? Organizing a ...

What is Momo? King of Glory is the biggest hookup platform in China!

Yes, Honor of Kings , a mobile game , seems to ha...

He shot a gorilla and brought it back to life

In 1896, in what is now Somalia, Carl Akeley fire...

How to better cultivate users' ability and habits to pay?

Today we are going to talk about a common and rel...

Two key points to achieve a 60% private domain repurchase rate

A few days ago, a good friend and I walked to the...