Enabling Multilingual Search with AI Search: Utah Edition

Heather Phipps · ‎03-14-2023

Introduction

NOW users expect search to work seamlessly across multiple languages. They want to be able to issue queries across all available content, regardless of language, and retrieve highly relevant results. This article describes the benefits of AI Search relative to Zing for multilingual content, how document filtering by language works in AI Search, and how to customize your configuration to improve the search experience for international users.

AI Search supports indexing and search in all BCP-47 languages where words are space delimited. In addition, advanced linguistic search features are available in English, French Canada, French, German, Japanese, Spanish, and Traditional & Simplified Chinese. Feature support by language in Utah is summarized below:

Feature	English	French, French Canada, German, Spanish	Japanese	Traditional & Simplified Chinese	All other languages in which words are space delimited
Indexing & “exact-match” search	✓	✓	✓	✓	✓
Character Normalization	✓	✓	✓	✓	✓
Synonyms	✓	✓	✓	✓	✓
Stop words	✓	✓	✓	✓	✓
Result Improvement Rules	✓	✓	✓	✓	✓
Catalog Item Genius Results	✓	✓	✓	✓	✓
Language-specific tokenization, word normalization, & decompounding	✓	✓	✓	✓ NEW
Typo handling	✓	✓
Q&A Genius Results	✓

Multilingual Search Terminology

Search is fundamentally about matching terms in your query to documents in your index containing these terms. Results relevancy comprises both precision—the percentage of retrieved results that are relevant—and recall—the percentage of relevant results retrieved. A search engine's precision and recall depend on the natural language processing (NLP) applied to indexed documents and query text. Out of the box, before any tuning, AI Search provides a ~10% relevancy lift over Zing¹ largely due to its more sophisticated NLP. NLP for search includes three essential tasks: tokenization, language-specific normalization, and decompounding.

Tokenization

Before you can match query terms to indexed document terms, you need to tokenize the text, i.e., break it apart into discrete words. Tokenization is straightforward in languages, such as English, where words are mostly space delimited. In Japanese, on the other hand, words aren’t space delimited. Both AI Search and Zing perform Japanese tokenization based on morphological analysis, which breaks up text into real words, ensuring accurate search results.

Here is an example:

東京都の人口 (Tokyo population)

Using morphological analysis, this text gets tokenized as follows:

東京 (Tokyo)
都 (city)
の ([possessive particle])
人口 (population)

By contrast, simple substring matching would result in an incorrect match for the query 京都 (Kyoto), decreasing search precision.

Only AI Search supports tokenization in Traditional and Simplified Chinese.

Language-Specific Normalization

Next, text needs to be normalized at the character and word level to ensure that a query for one form of a word matches all other forms. As illustrated by the following examples, at the character level, accents should be removed since search end users frequently omit them from queries, and Asian half-width characters should be normalized to match their full-width counterparts:

Pâtisserie → Patisserie
ﾎﾃﾙ → ホテル

AI Search supports lemmatization—the identification of the dictionary form of the word, based on context—in English, French Canada, French, German, Japanese, Spanish, and Traditional & Simplified Chinese. By contrast, Zing supports lemmatization only for Japanese (with certain constraints not imposed by AI Search). Zing uses stemming—an alternative normalization method that truncates words via simplistic rules without considering context—in English, French, and German. Stemming can reduce recall because related terms may have different stems.

Continuing our previous example:

Input	Zing	AI Search
selling	sell	sell
sold	sold	sell

Stemming can also lower precision because unrelated words may share the same stem, as in this French example:

Input	Zing	AI Search
faut (necessary)	faut	faut
faute (mistake)	faut	faute

Decompounding

In certain languages, such as German, compound words are prevalent. These words must be broken down into constituent parts to maximize recall. For example, the compound “humanressourcen” should be broken down into its components, “human” and “ressourcen.” This ensures that queries for the component terms match documents containing the compound and vice versa. Only AI Search supports German decompounding.

Searchable Content in AI Search

There are two categories of translated content in the Now Platform:

Translated fields, such as Catalog Item fields.
Translated documents, such as Knowledge articles.

The default AI Search filtering behavior differs for these two content types.

In the case of Catalog Item search, if a field lacks a translation in the user’s session language, AI Search effectively falls back to exact matching against the English-language field value, as shown:

By contrast, in the case of Knowledge search, AI Search only searches articles that are in the same language as the user’s session, as shown:

Configuration Options: From Rome to Vancouver

AI Search provides a consumer-like search experience where users can express their intent in their language of choice and get back all relevant results across all relevant languages. We've heard your feedback that while AI Search provides better relevancy than Zing within each supported language, you need more flexibility around defining the set of globally searchable documents. From San Diego to Utah, we've delivered numerous enhancements in this area, with more to come in Vancouver.

The Evolution of Multilingual Search in AI Search

In San Diego, we productized the workaround introduced in Rome that lets users search in both their session language and English by introducing the concept of a global fallback locale. However, the use case for this feature remains narrow: it works best when designating all English articles as globally searchable.

We extended “tier 1” language support to Traditional and Simplified Chinese in Tokyo. We also added support for locales consisting of a language and country code for customers with granularly-localized content. For example, such customers might have Mexican Spanish content distinct from Castilian Spanish content. This was a change in the NOW platform at large, supported by AI Search.

In Utah, we added the ability to designate any Knowledge article in any language as global and searchable by all users. We accomplished this without losing the benefits of language-specific processing by adding lightweight language identification at query time. So, for example, when a search user switches from English to Japanese, we detect this automatically and process the query accordingly to retrieve relevant Japanese documents. We also added language ID to task table indexing for better relevancy for non-English task table content.

Looking ahead to Vancouver, we're focusing on best practices documentation for filtering and promoting content by geography and language. This documentation will help Admins and Knowledge Managers construct Result Improvement Rules to promote contextually relevant content so that users can easily find information relevant to their geography, role, and organization. We're also adding the ability to configure searchable languages by country. This will enable users in countries with multiple official languages to search across Knowledge content in all official languages. Lastly, we have some Japanese tokenization improvements planned for better recall.

Configuration Recommendations

With all these new features, you may wonder which are most appropriate for your use case and requirements. The table below summarizes our recommendations:

Feature

Global Fallback Workaround

(Rome)

Global Fallback Locale

(San Diego+)

Language Fallback

(Tokyo+)

Globally Searchable Knowledge Filter

(Utah+)

Country to Languages Mapping

(Vancouver+)

When to Use

End users need to search across KBs in their session language + English.

If you have granularly-localized content (e.g., Mexican Spanish content distinct from Castilian Spanish content).

End users need to search across a flexibly configured set of KBs (in addition to docs in their session language).

When the searchable content filter needs to be set dynamically at runtime, based on the user's country (sys_user country value).

Usage Notes

Deprecated —

We recommend unwinding the workaround with the steps outlined in the next section, especially if you designated non-English documents as global.

From Utah+, we recommend using the globally searchable content filter instead, especially if you need to designate non-English documents as global.

Example: Spanish is the fallback for both Mexican Spanish and Castilian Spanish —

Mexican Spanish users can search all KBs in Mexican Spanish and Spanish. Likewise, Castilian Spanish users can search all KBs in Castilian Spanish and Spanish.

Recommended over the Global Fallback Locale for new configurations because it’s more flexible and doesn’t impact language-specific document processing in non-English languages.

Compatible with global fallback, language fallback, and globally searchable knowledge filter.

Unwinding the Rome Workaround

While you may continue to use the global fallback workaround introduced in Rome, it’s no longer officially supported. If you used it to designate non-English documents as “global,” we highly recommend you unwind it because the globally searchable knowledge filter introduced in Utah better serves this use case. Please follow these steps if you wish to roll back the workaround:

Navigate to AI Search > AI Search Index > Indexed Sources.
Edit the Knowledge Table record.
In the Field Settings & Mapping related list, locate the map_to_raw setting that has translation_language_id as its value and change its field from u_search_index_language to language.
Reindex all tables for the Knowledge Table indexed source.
(OPTIONAL) Delete/disable business rule created to populate value to u_search_index_lanaguage field.

Conclusion

Thanks for reading! If you have feedback on the changes described in this article or a multilingual search request, please don’t hesitate to get in touch via the comments below or direct message.

¹(AI Search NDCG) - (Zing NDCG) as measured on hand-labeled golden sets.

Eric Davis · ‎02-14-2024

What is the roadmap and timeline for enabling lemmatization for additional languages, specifically Polish and Dutch?

Daniel Oderbolz · ‎04-04-2024

Hi @Heather Phipps

Thanks for this great article. Is there any update for Washington D.C.?

Best

Daniel

adambuj · ‎04-04-2024

@Daniel Oderbolz No major updates to the multilingual search experience in Washington D.C., however, we have plans to extend our advanced linguistic features (seen in the first table of this article) to additional languages in upcoming releases.

adambuj · ‎07-15-2024

@Daniel Oderbolz @Eric Davis You can see the latest plans for extending support here.

J2 · ‎09-25-2024

"In the case of Catalog Item search, if a field lacks a translation in the user’s session language, AI Search effectively falls back to exact matching against the English-language field value" :

If i have a catalog item with an english name and a french name, using translated text, is there any way i can allow a user with their locale set to french, to search for the item by it's english name ? I would assume the fallback settings serve this purpose but i cannot seem to get it to work. Is this a mistake on my part, or is this simply not a feature, as suggested here ? https://www.servicenow.com/community/now-platform-forum/ai-search-catalog-item-translations/m-p/1119...