Some details and thoughts on “Guess the Picture Song”

Quickdraw’s CNN-RNN model

The quickdraw model used in "Guess the Painting Song" is essentially a classification model. The input is the coordinate information of the stroke points and the identification information of the starting point of each stroke. Several cascaded one-dimensional convolutions are applied, and then a BiLSTM layer is used and the results are summed. Finally, a Softmax layer is used for classification.

The entire network structure is shown in the figure:

Model structure

For more information about open source data and code, please refer to the reference document below. The entire network is relatively simple, and the final model accuracy is 75% with its default parameters, as shown in the figure below. It is not a high-demand scenario, and the effect is good enough.

Here I share a few interesting details that I noticed (pats from experts).

Small details

Data preprocessing

For stroke-3(x, y, n), Google uses the TFRecord data by default to normalize and interpolate the coordinates.

 # 1. Size normalization.
 lower = np. min (np_ink[:, 0:2], axis=0)
 upper = np. max (np_ink[:, 0:2], axis=0)
 scale = upper - lower scale[scale == 0] = 1
 np_ink[:, 0:2] = (np_ink[:, 0:2] - lower ) / scale
 # 2. Compute deltas. np_ink[1:, 0:2] -=
 np_ink[0:-1, 0:2]
 np_ink = np_ink[1:, :]

Why Normalization?

Similar to the role of BN in the input layer, the distribution of data is adjusted from the convergence area of the original activation function to the area with larger gradient

We only care about the stroke trend of the painting, not the size of the painting. In other words, there is not much difference between drawing a large circle and drawing a small circle in terms of input data.

Why interpolation?

Ignore the effect of the starting coordinate position, that is, starting to draw the same shape in the middle and the four corners of the canvas will not make much difference in the input data level.

Convolutional Layer

Use multiple one-dimensional convolutions (conv1d) in cascade, use a linear activation function, and do not use a pooling layer.

The linear activation was changed to relu, and the accuracy dropped a bit to 73%.
The linear activation was changed to relu + plus pooling layer (size=4, strides=4), and the accuracy dropped a bit to 70%

Why does linear activation and removing the pooling layer improve the effect by 2-3 points?

What are the functions of the pooling layer?

Reducing the number of parameters. In fact, adding a pooling layer shortens the training time by more than half.
Maintaining local invariance of features. It seems that our input is not complex image pixel information, but stroke information, and after interpolation processing, local invariance is not much needed;
Reducing redundancy and removing noise may not be particularly effective for stick figures.

The author (simple) understands that stick figures are already a high level of abstraction of objects by humans, so there is no need to use complex CNN networks to abstract features, and the global features are obtained by the subsequent RNN layer.

Small Thoughts

Google launched the web version of QuickDraw in November 2016, and it has recently become popular again with the help of mini-programs. A large amount of real user data has been obtained before and used to optimize the effectiveness of this mini-program.

What else can models be used for?

Recently, I saw an article that studied the relationship between the drawing order of people from different countries in this stick figure data and their national characters. In addition, there has been a lot of research and progress in the fields of anomaly analysis, handwriting recognition, speech recognition, and text classification on time series classification models.

Drawing circles differently

When I was a graduate student, I studied abnormal analysis of computer users. I built a classification model based on the user's mouse trajectory and keyboard operation to identify whether the user was operating the computer. Now that I think about it, it should be pretty good to use this model to run the previous task.

What other innovations can we have at the product level?

AutoDraw: Automatically transform your doodles into beautiful art images (Google has launched)
Drawing story: Draw 4 comics and the system will automatically generate a story (this should be no problem with the upper-level NLG technology)
Drawing Scoring: Automatically score your drawings for innovation, technique, completeness, etc.

What other value can be mined from these painting data?

Painting is a way for people to describe the world they understand in their own way. If we start with these simple sketches, we can learn how people understand objects and the world. In simple terms, we can migrate to the high-level abstract stage of current image recognition algorithms and improve the effectiveness of certain tasks. In a more complex way, we can even use it to enhance the reasoning ability of machines and learn the human ability to abstractly model objects and the world (brain hole).

<<: An article to show you 20 new changes in iOS 12 Beta 5

>>: Android native communication with H5

What are the conditions and steps required for Baidu V certification?

Blog

The Nine Swords of User Growth Strategy

Blog

Zhu Jun's personal information: Website promotion and optimization with a clear layout can bring a better experience to users

Blog

How much does a mall mini program template cost per year? How much does it cost to develop a WeChat mini program mall system?

Blog

Why does eating dumplings raise blood sugar so quickly? Six ways to eat dumplings safely

Some details and thoughts on “Guess the Picture Song”

What are the conditions and steps required for Baidu V certification?

The Nine Swords of User Growth Strategy

Zhu Jun's personal information: Website promotion and optimization with a clear layout can bring a better experience to users

What is Quantum Dot TV

Someone was defrauded of 100,000 yuan! Be sure to be vigilant against these scams on "Double Eleven"!

iQOO 5 review: balanced evolution in all aspects, the first "lightweight" gaming phone in history

How much does a mall mini program template cost per year? How much does it cost to develop a WeChat mini program mall system?

Why does eating dumplings raise blood sugar so quickly? Six ways to eat dumplings safely

Programmatic Advertising Quantitative Evaluation Revenue Indicator - eCPM

Ruifeng S3 road test spy photos exposed with longer body size and upgraded power system

Recommend

Use these 6 tricks to create a fission activity with over 100,000 followers!

Data rankings of 60 information flow advertising platforms!

Windows 10X can be started on Surface Pro 7 and most drivers work fine

With dual cameras again, can Honor V8 lead the brand to charge into the high-end market?

New trends in new media content marketing in the beauty industry!

Sony NW-ZX505, the walking vinyl record player

Perfume and poop contain the same substance? The most "smell" popular science is here

Will the more online promotion channels there are, the better the campaign results will be?

Silent installation｜APP promotion anti-cheating secrets series of articles

Why do some insiders prefer to buy flagship phones and never consider mid-range phones?

Android P is officially confirmed as 9.0, and the third developer preview version has these new changes |

What are the common black hat SEO cheating methods?

How is the lunar south pole different from Earth's? Can India really find water ice at the lunar south pole?

Super luxury cars suddenly impose 10% "rich tax" Maserati is the most affected

Application practice of cross-platform mobile framework UniApp