Use Node.js to segment text content and extract keywords

Articles translated by Zhongcheng have tags. Users can quickly filter articles of interest based on tags, and articles can also be recommended based on tag associations. However, the tags of Zhongcheng Translation are set when recommending articles, and they are all in English. Moreover, manual settings are inevitably irregular and incomplete. Although articles can be manually edited after publication, we cannot expect users or administrators to edit appropriate tags all the time, so we need to use tools to automatically generate tags.

Among the current open source word segmentation tools, jieba is a powerful and high-performance word segmentation component. Fortunately, it has a node version.

The installation and use of nodejieba is very simple:

 npm install nodejieba

 var nodejieba = require( "nodejieba" );
 var result = nodejieba.cut( "Imperialism wants to divide up our sweet potatoes" );
 console.log(result);
 //[ 'Imperialism', 'want', 'to', 'divide', 'our', 'land', 'take', 'away' ]   
 
 result = nodejieba.cut( 'Landlord, where is my golden hoop?' );
 console.log(result);
 //[ 'Land', '，', 'I', 'Old', 'Sun', 'of', 'golden hoop', 'where', '? ' ]   
 
 result = nodejieba.cut( 'Great Sage, your golden hoop stick is great because it matches your head shape!' );
 console.log(result);
 //['Monkey King',','your','golden hoop','is','great','especially','match','your','head','!' ]

We can load our own dictionary and set the weight and part of speech for each word in the dictionary:

Edit user.uft8

 Sweet potato 9999 n
 Golden Hoop 9999 n
 The best part is 9999

Then load the dictionary through nodejieba.load.

 var nodejieba = require( "nodejieba" ); 
 
 nodejieba.load({
 userDict: './user.utf8' ,
 }); 
 
 var result = nodejieba.cut( "Imperialism wants to divide up our sweet potatoes" );
 console.log(result);
 //[ 'Imperialism', 'want', 'to', 'our', 'sweet potatoes', 'divide', 'get rid of' ]   
 
 result = nodejieba.cut( 'Landlord, where is my golden hoop?' );
 console.log(result);
 //[ 'Land', '，', 'I', 'Old', 'Sun', 'of', 'golden hoop', 'where', '? ' ]   
 
 result = nodejieba.cut( 'Great Sage, your golden hoop stick is great because it matches your head shape!' );
 console.log(result);
 //[ 'Monkey King', '，', 'You', 'Your', 'Golden Hoop', 'It's great because', 'It's special', 'matches', 'You', 'Your', 'Head shape', '！ ' ]

In addition to word segmentation, we can use nodejieba to extract keywords:

 const content = `
 HTTP, HTTP/2, and performance optimization 
 
 The purpose of this article is to tell you through comparison why you should migrate from HTTP to HTTPS and why you should add support for HTTP/2. Before comparing HTTP and HTTP/2, let's take a look at what HTTP is. 
 
 What is HTTP
 HTTP is a set of rules for communicating on the World Wide Web. HTTP is an application layer protocol that runs on top of the TCP/IP layer. When a user requests a web page through a browser, HTTP is responsible for processing the request and establishing a connection between the web server and the client. 
 
 With HTTP/2, performance can be improved by not using sprites, compression, or concatenation. However, this does not mean that these techniques should not be used. However, it clearly shows the need to move from HTTP/1.1 to HTTP/2.
 `; 
 
 const nodejieba = require( "nodejieba" ); 
 
 const result = nodejieba.extract(content, 20); 
 
 console.log(result);

The output is similar to the following:

 [ { word: 'HTTP' , weight: 140.8704516850025 },
 { word: 'request' , weight: 14.23018001394 },
 { word: 'should' , weight: 14.052171126120001 },
 { word: 'World Wide Web' , weight: 12.2912397395 },
  { word: 'TCP' , weight: 11.739204307083542 },
  { word: '1.1' , weight: 11.739204307083542 },
  { word: 'Web' , weight: 11.739204307083542 },
 { word: 'Sprite' , weight: 11.739204307083542 },
  { word: 'HTTPS' , weight: 11.739204307083542 },
  { word: 'IP' , weight: 11.739204307083542 },
 { word: 'Application layer' , weight: 11.2616203224 },
 { word: 'client' , weight: 11.1926274509 },
 { word: 'browser' , weight: 10.8561552143 },
 { word: 'splice' , weight: 9.85762638414 },
 { word: 'comparison' , weight: 9.5435285574 },
 { word: 'webpage' , weight: 9.53122979951 },
 { word: 'server' , weight: 9.41204128224 },
  { word: 'use' , weight: 9.03259988558 },
  { word: 'necessity' , weight: 8.81927328699 },
  { word: 'Add' , weight: 8.0484751722 } ]

We add some new keywords to the dictionary:

 performance
 HTTP/2

The output is as follows:

 [ { word: 'HTTP' , weight: 105.65283876375187 },
  { word: 'HTTP/2' , weight: 58.69602153541771 },
 { word: 'request' , weight: 14.23018001394 },
 { word: 'should' , weight: 14.052171126120001 },
  { word: 'Performance' , weight: 12.61259281884 },
 { word: 'World Wide Web' , weight: 12.2912397395 },
  { word: 'IP' , weight: 11.739204307083542 },
  { word: 'HTTPS' , weight: 11.739204307083542 },
  { word: '1.1' , weight: 11.739204307083542 },
  { word: 'TCP' , weight: 11.739204307083542 },
  { word: 'Web' , weight: 11.739204307083542 },
 { word: 'Sprite' , weight: 11.739204307083542 },
 { word: 'Application layer' , weight: 11.2616203224 },
 { word: 'client' , weight: 11.1926274509 },
 { word: 'browser' , weight: 10.8561552143 },
 { word: 'splice' , weight: 9.85762638414 },
 { word: 'comparison' , weight: 9.5435285574 },
 { word: 'webpage' , weight: 9.53122979951 },
 { word: 'server' , weight: 9.41204128224 },
  { word: 'use' , weight: 9.03259988558 } ]

On this basis, we use a whitelist approach to filter out some words that can be used as tags:

 const content = `
 HTTP, HTTP/2, and performance optimization 
 
 The purpose of this article is to tell you through comparison why you should migrate from HTTP to HTTPS and why you should add support for HTTP/2. Before comparing HTTP and HTTP/2, let's take a look at what HTTP is. 
 
 What is HTTP
 HTTP is a set of rules for communicating on the World Wide Web. HTTP is an application layer protocol that runs on top of the TCP/IP layer. When a user requests a web page through a browser, HTTP is responsible for processing the request and establishing a connection between the web server and the client. 
 
 With HTTP/2, performance can be improved by not using sprites, compression, or concatenation. However, this does not mean that these techniques should not be used. However, it clearly shows the need to move from HTTP/1.1 to HTTP/2.
 `; 
 
 const nodejieba = require( "nodejieba" ); 
 
 nodejieba.load({
 userDict: './user.utf8' ,
 }); 
 
 const result = nodejieba.extract(content, 20); 
 
 const tagList = [ 'HTTPS' , 'HTTP' , 'HTTP/2' , 'Web' , 'Browser' , 'Performance' ]; 
 
 console.log(result.filter(item => tagList.indexOf(item.word) >= 0));

***get:

 [ { word: 'HTTP' , weight: 105.65283876375187 },
  { word: 'HTTP/2' , weight: 58.69602153541771 },
  { word: 'Performance' , weight: 12.61259281884 },
  { word: 'HTTPS' , weight: 11.739204307083542 },
  { word: 'Web' , weight: 11.739204307083542 },
 { word: 'browser' , weight: 10.8561552143 } ]

This is what we want.

The above is the basic usage of the nodejieba word segmentation library. In the future, we can use it to automatically analyze and add corresponding tags to the translations published by Zhongcheng Translation to provide a better user experience for translators and readers.

<<: A brief introduction to MVP's practical exercises to make the code structure simpler~

>>: Analysis and application of WebView cache principle

How to bring about conversions by building an operation system and channel diversion?

Recommend

The 'spy whale' you once called adorable is heading to the wrong place to find home

Four years ago, a white whale suspected of being ...

Why did poets in the late Tang and Song dynasties love to write about Yumen Pass and Yangguan?

Wang Wei, a native of Puzhou (now Yongji, Shanxi)...

Having trouble with mobile development? Find MDSA! —China’s first Mobile Developer Service Alliance (MDSA) was established

In the past two years, with the fragmentation of ...

Use Node.js to segment text content and extract keywords

How to bring about conversions by building an operation system and channel diversion?

Yuan Longping's wish for the Year of the Ox is just eight words, and netizens heard it: It will definitely come true

Case Analysis: How to use the AARRR model to increase user growth?

Fashion + AI: Can we create another wearable craze?

Chanel announced that it would stop production. How long will Chanel stop production?

Analysis of the fission gameplay of Weibo traffic diversion!

Why do short videos on Douyin get no views, and why does no one view the Douyin videos I post?

How big is the intellectual property gap between Xiaomi, Meizu and Apple?

Ye Tan Finance "2021 Tan Tan Bull and Bear Exchange"

403 seconds! China's "artificial sun" has made a major breakthrough

Recommend

The 'spy whale' you once called adorable is heading to the wrong place to find home

Why did poets in the late Tang and Song dynasties love to write about Yumen Pass and Yangguan?

Having trouble with mobile development? Find MDSA! —China’s first Mobile Developer Service Alliance (MDSA) was established

Financial Times: Alibaba IPO may be delayed

The latest news on the Nanjing epidemic: Can outsiders enter Wuxi now?

Marketing promotion strategy, 3 strategies to create brand personalization!

5 "pitfalls" to avoid in marketing activities

Milk cartons are 100% recyclable and the recycling method is unexpected

After the journey into space, how will the “chosen seed” transform?

Uncovering the true face of the Milky Way: Exploring how we describe the entire picture of the Milky Way

iPhone 6 Plus camera issue: Can't focus

What does "starting with an egg" mean? This is called native advertising!

iPhone 7 Launch Guide: Get it first

How to avoid handling fees when withdrawing money from WeChat? Do you know this official practice?

Apple's bottom line from cracking down on App ranking manipulation