adambuj
ServiceNow Employee
ServiceNow Employee

AI Search New Language Support 

We are excited to announce a significant enhancement to the AI Search product which will greatly improve search accuracy and relevance for many of our international customers. Starting August 1st, 2024, AI Search will support advanced linguistic search features for Portuguese, Korean, Italian, Swedish, and Dutch. This expansion includes language-specific tokenization, lemmas, normalization, and decompounding, which are critical for improving document retrieval accuracy in these languages. 

The expansion of advanced linguistic features to Portuguese, Korean, Italian, Swedish, and Dutch means that AI Search will now better understand the nuances of these languages. By incorporating language-specific tokenization and decompounding, AI Search can break down text more accurately. Lemmas will allow the search engine to understand the root forms of words, and normalization will ensure consistent processing of text. Additionally, typo-handling (spellcheck) will be enabled for Portuguese, Italian, Swedish, and Dutch. These improvements are designed to provide more accurate and relevant search results for our users. 

 

How to Take Advantage of This Feature 

To leverage the new advanced linguistic search capabilities, following the feature launch date, customers must reindex all indexed sources containing content in these languages. Reindexing is essential as it activates the new features within your search experience. Once reindexed, users will benefit from more precise and relevant search results immediately. 

 

Key Dates and Actions 

Feature Release Date: August 1st, 2024 

Action Required: Reindex any content in Portuguese, Korean, Italian, Swedish, and Dutch to activate enhanced support. 

 

Current Feature Support 

AI Search supports indexing and search in all BCP-47 languages where words are space delimited. In addition to the new languages, advanced linguistic search features are already available in English, French Canadian, French, German, Japanese, Spanish, and Traditional & Simplified Chinese. Feature support by language is summarized below: 

Feature 

English 

French, French Canadian, German, Spanish 

*New*  Portuguese, Korean, Italian, Swedish, Dutch  

Japanese, Traditional & Simplified Chinese 

All other languages where words are space delimited 

“exact-match” search   

✓  

 

 

 

 

Character Normalization 

✓  

 

 

 

 

Synonyms 

✓  

 

 

 

 

Stop words 

✓  

 

 

 

 

Result Improvement Rules 

✓  

 

 

 

 

Now Assist Actions Genius Results 

✓  

 

 

 

 

Now Assist Q&A Genius Results 

✓  

 

 

 

 

Language-specific tokenization, lemmatization, word normalization, & decompounding 

✓  

 

 

 

 

Typo handling 

✓  

 

 

 

 

 

Multilingual Search Terminology 

Search is fundamentally about matching terms in your query to documents in your index containing these terms. Results relevancy comprises both precision—the percentage of retrieved results that are relevant—and recall—the percentage of relevant results retrieved. A search engine's precision and recall depend on the natural language processing (NLP) applied to indexed documents and query text.  

  

Tokenization 

Before you can match query terms to indexed document terms, you need to tokenize the text, i.e., break it apart into discrete words. Tokenization is straightforward in languages, such as English, where words are mostly space delimited. In Japanese, on the other hand, words aren’t space delimited. Both AI Search and Zing perform Japanese tokenization based on morphological analysis, which breaks up text into real words, ensuring accurate search results. 

  

Here is an example: 

東京都の人口 (Tokyo population) 

  

Using morphological analysis, this text gets tokenized as follows: 

(Tokyo) 
(city) 
の ([possessive particle]) 
人口 (population) 

  

By contrast, simple substring matching would result in an incorrect match for the query 京都 (Kyoto), decreasing search precision. 

   

Language-Specific Normalization 

Next, text needs to be normalized at the character and word level to ensure that a query for one form of a word matches all other forms. As illustrated by the following examples, at the character level, accents should be removed since search end users frequently omit them from queries, and Asian half-width characters should be normalized to match their full-width counterparts: 

  

Pâtisserie → Patisserie 
ホテル → ホテル 

  

AI Search supports lemmatization—the identification of the dictionary form of the word, based on context. By contrast, Zing supports lemmatization only for Japanese (with certain constraints not imposed by AI Search). Zing uses stemming—an alternative normalization method that truncates words via simplistic rules without considering context—in English, French, and German. Stemming can reduce recall because related terms may have different stems. 

  

Continuing our previous example: 

Input 

Zing 

AI Search 

selling 

sell 

sell 

sold 

sold 

sell 

  

Stemming can also lower precision because unrelated words may share the same stem, as in this French example: 

Input 

Zing 

AI Search 

faut (necessary) 

faut 

faut 

faute (mistake) 

faut 

faute 

  

Decompounding 

In certain languages, such as German, compound words are prevalent. These words must be broken down into constituent parts to maximize recall. For example, the compound “humanressourcen” should be broken down into its components, “human” and “ressourcen.” This ensures that queries for the component terms match documents containing the compound and vice versa.  

 

Conclusion 

With this upgrade, AI Search now provides comprehensive support for a wider range of languages, significantly improving the search experience for our non-English speaking users. We are committed to continuously enhancing our capabilities to meet the needs of our global customer base. Stay tuned for more updates and enhancements. 

 

Comments
Magnus Hovik
Tera Contributor

@adambuj when can we expect to see AI search improvements in other languages? Norwegian specifically in my case 🙂

Max Thomsson
Tera Contributor

@adambuj The decompounding feature kind of breaks the whole search engine for Swedish. When using Swedish locale and searching for a compound words, the top results are now for completely different words using only part of the search term (and then we see the exact matches for the whole search term way down in the results). Is there any way to disable this feature (or at least make sure that exakt matches are ranked higher than matches for parts of the compound words) since we get A LOT of incidents from our users that search is no longer working and we have identified the decompounding feature as the cause.

Version history
Last update:
‎07-15-2024 06:33 AM
Updated by:
Contributors