What? Can the verification code I fill in every day be used for charity?

CAPTCHA is a security mechanism widely used in websites, applications and other systems. It requires users to correctly enter certain characters or number combinations to confirm the user's identity or prevent malicious behavior, such as malicious registration, brute force password cracking, etc. CAPTCHA can effectively prevent attacks and abuse of the system by automated programs such as network hackers, robots, and scripts, and ensure the security of user data and privacy. In addition, CAPTCHA is often used for uniqueness verification, specific permission verification, and interactive process verification.

Sentinels to guard against automated programs abusing online services

CAPTCHA is an abbreviation of "Completely Automated Public Turing test to tell Computers and Humans Apart", which was first proposed by Luis von Ahn, Manuel Blum, Nicholas Hopper and John Langford of Carnegie Mellon University in 2000. It is a security technology widely used on the Internet to distinguish between computer programs (such as robots) and real human users.

A typical CAPTCHA is an image containing multiple distorted characters, such as Figure 1, which usually appears at the bottom of a web page form. Users are asked to enter these squiggly characters to "prove" that they are human. Computer programs at the time could not read distorted text like humans, so CAPTCHAs served as sentinels against automated programs that abuse online services. Due to its effectiveness as a security measure, CAPTCHAs are used to protect many types of websites, including free email providers, ticket sales sites, social networks, wikis, and blogs. For example, CAPTCHAs can be used to prevent scalpers from using computer programs to deliberately purchase large quantities of concert tickets and resell them at high prices. Free email providers such as Gmail and Yahoo Mail use CAPTCHAs to block malicious accounts from malicious registrations and spam.

Figure 1 CAPTCHA example (Image source [1])

If you have ever filled out a verification code like the one in Figure 2, then congratulations and thank you, because you have done something meaningful for humanity without knowing it.

Figure 2 reCAPTCHA interface (Image source [1])

The story begins with a fantastic idea: According to estimates by Luis von Ahn's team, more than 100 million people around the world enter verification codes every day (in 2008). Although it only takes a few seconds to identify and enter distorted characters each time, overall, this is equivalent to hundreds of thousands of hours of time per day. Although verification codes are very effective in preventing large-scale abuse of online services, the energy spent by each person in solving verification codes is wasted. Such a large-scale waste of time made Luis von Ahn's team begin to think about whether there is any way to make use of these fragmented time. Faced with such a fantastic idea, they actually found the answer - digitization of old paper books.

At that time, the large-scale digitization projects of old paper books (such as the Google Books Project and the non-profit Internet Archive) attracted the attention of Luis von Ahn's team. The digitization of old paper books is of great significance, which is conducive to the preservation of human knowledge and makes information easier to access, retrieve and analyze.

At that time, the way to digitize old paper books was to directly scan the books to generate images, and then convert them into text files through optical character recognition (OCR) software. For old books with faded ink and yellowed paper, OCR can only recognize 80% of the words [1]. In contrast, humans are more accurate in transcribing such printed materials, and can achieve an accuracy rate of more than 99% at the word level based on transcription and proofreading [1]. Unfortunately, manual transcription is very expensive.

Since it is expensive to manually transcribe old books and the OCR automatic recognition effect is not ideal, Luis von Ahn's team thought, why not let users recognize the images generated by scanning books? Another question is how to distinguish whether the person filling in the verification code is a real person rather than a malicious program? With these ideas and goals, Luis von Ahn's team replaced the original system's automatically randomly generated images with scanned images based on the standard CAPTCHA, and introduced two-word verification to develop a new verification code system: reCAPTCHA.

reCAPTCHA two-word verification method

The reCAPTCHA verification code system consists of two words, both of which are taken from scanned images of old books. Users are required to recognize and enter the two words, and subsequent operations can be carried out after the verification is passed.

Figure 3 reCAPTCHA interface (Image source [1])

As shown above, reCAPTCHA gives the user two words, one is an "unknown" word ("morning") for which the computer cannot recognize the answer, and the other is a "control" word ("overlooks") for which the answer is known.

Any word that has inconsistent analysis results or cannot be found in the dictionary after being analyzed by two different OCR programs is marked as a "suspicious" word. The "suspicious" word is initially sent to users as an "unknown" word, with each user's answer counted as one vote and the OCR recognition result counted as half a vote. If three identical answers appear and are different from both OCR results, the "unknown" word becomes a "control" word and is randomly presented to users. If the answers given by users vary greatly, it will continue to be sent to more users as an "unknown" word.

Figure 4: How reCAPTCHA works

Each "unknown" word was placed in an image in a random order alongside another "control" word, which was further distorted to ensure that the automated program could not decipher them. To reduce the probability that the automated program would randomly guess the correct answer, the frequencies of the control words were normalized so that, for example, the more common word "today" and the less common word "abridged" had the same probability of being presented.

When a user inputs an "unknown" word and a "control" word, if the "control" word can be spelled correctly, then the user is judged to be a real person. At the same time, for an "unknown" word, as long as it obtains 2.5 votes or more, it is considered to be a correctly recognized word.

By deploying the system on a large scale and collecting and analyzing recognition results, the reCAPTCHA system achieved an accuracy of 99.1% at the word level [1], while the accuracy of standard OCR was only 83.5% [1]. The 99.1% accuracy rate meets the industry standard of "greater than 99%" accuracy guarantee for transcription technology.

After one year of operation, humans had solved over 1.2 billion CAPTCHAs, which equates to over 440 million suspicious words correctly decrypted. Assuming that each book has 100,000 words (400 pages, 250 words per page), this equates to over 17,600 books being manually transcribed (about 25% of the words in each book are marked as suspicious by the algorithm). The popularity of the system continued to grow: in 2008, the transcription rate exceeded 4 million suspicious words per day, which equates to about 160 books per day. To achieve this rate with traditional manual transcription, a team of over 1,500 people would have been required to work 40 hours a week to decrypt the words (assuming an average of 60 words per minute) [1].

Since "control" words are words that neither OCR program can recognize, any program that can recognize these words with a very high probability will be an improvement in the OCR program, and also represent an advancement in OCR technology.

reCAPTCHA was acquired by Google in September 2009. Since Google acquired reCAPTCHA, the verification code system has been further developed and improved. Google has integrated it into its own products and services, including Gmail, Google Search, Google Forms, etc. reCAPTCHA is not only used to verify whether the user is human, but also used for data training and machine learning to improve image recognition and automation technology.

In the process of developing reCAPTCHA, Google introduced new algorithms and technologies to improve the ability to identify robots and malicious behaviors. For example, it has evolved from the classic CAPTCHA (entering illegible text) to the no CAPTCHA reCAPTCHA (no manual input by users, verification is performed by analyzing user behavior, the core is a verification system that does not require the entry of a CAPTCHA, and users only need to click a "I am not a robot" checkbox), and further evolved to invisible reCAPTCHA (when the user is judged to be low risk, the verification process will be carried out in the background, and no CAPTCHA interface will be displayed). These improvements are designed to provide a better user experience and stronger anti-robot protection.

Future Outlook

The future CAPTCHA technology will be more intelligent, non-intrusive, multi-factor, safe and reliable to provide a better user experience and protect websites from robots and malicious behaviors. At the same time, CAPTCHA technology also needs to be continuously innovated and optimized to adapt to the rapidly evolving network threats and user needs.

References:

[1] Luis von Ahn, Ben Maurer, Colin McMillen, David Abraham and Manuel Blum. reCAPTCHA: HumanBased Character Recognition via Web Security Measures. Science, September 12, 2008. Pages 1465-1468.

Author: Cheng Xinyan

Unit: China Mobile Smart Home Operation Center

<<: Science and technology creates magic: New achievements in hybrid rice seed production

>>: What does a chicken go through before it reaches the table? Watch this and eat a chicken drumstick!

After three rehearsals, the first flight of the new generation of moon landing rocket has no date

1455+11691! How to avoid cross infection in close contact isolation points? What should I pay attention to after the nucleic acid test? Let's see the expert interpretation

From 0:00 to 24:00 on April 2, 31 provinces (auto...

What? Can the verification code I fill in every day be used for charity?

After three rehearsals, the first flight of the new generation of moon landing rocket has no date

How is “brain age” calculated?

They are both potato chips, so why don’t we have to pay for “air” when buying potato chips in a bucket?

Why does outdoor exercise help slow the progression of myopia?

iOS Translucent Beginner's Guide to Teaching You How to Make It

Level 3 emergency response initiated! Heavy rain + thunderstorm + strong wind! These areas should be on guard →

Shihezi SEO training: details of website construction

Yahoo and Google have both withdrawn from China, so how can Amazon be the survivor?

How to quickly build a marketing and promotion system for B2B products?

Analysis of GuangDianTong information flow optimization techniques!

Recommend

Apple opens the floodgates: Developers can now submit iOS 9 software

#千万IP创科学热门#丨Minnan flower bricks, can cement also bloom?

A two-month-old baby was nearly amputated! The culprit turned out to be a hair. These little things can't be ignored

“Cross-border marketing”, how to achieve “1+1＞2”?

The advantages of YouTube video ads and the correct way to place them!

[Promotion Case] How was the campaign that brought about a net increase of 60,000 users in 6 days created?

WeChat has released a major new feature: WeChat Circle Experience

1455+11691! How to avoid cross infection in close contact isolation points? What should I pay attention to after the nucleic acid test? Let's see the expert interpretation

[Case] 18 methods of social marketing of Durex!

The first brain-computer interface chip implanted in humans! Has "The Matrix" become a reality?

Let the data speak for itself: What will be the next phenomenal APP?

In addition to cleanliness, giving the elderly a bath has unexpected benefits!

Some ingenious tricks for iOS development 2

Successful landing! The astronauts of the "Dream Crew" feel good, and the Shenzhou 15 mission was a complete success

Traffic generation and promotion: Are there any better ways to generate traffic on the entire network?