AI Search Introduces Enhanced Linguistic Support for Portuguese, Korean, Italian, Swedish & Dutch

adambuj · ‎07-15-2024

AI Search New Language Support

We are excited to announce a significant enhancement to the AI Search product which will greatly improve search accuracy and relevance for many of our international customers. Starting August 1st, 2024, AI Search will support advanced linguistic search features for Portuguese, Korean, Italian, Swedish, and Dutch. This expansion includes language-specific tokenization, lemmas, normalization, and decompounding, which are critical for improving document retrieval accuracy in these languages.

The expansion of advanced linguistic features to Portuguese, Korean, Italian, Swedish, and Dutch means that AI Search will now better understand the nuances of these languages. By incorporating language-specific tokenization and decompounding, AI Search can break down text more accurately. Lemmas will allow the search engine to understand the root forms of words, and normalization will ensure consistent processing of text. Additionally, typo-handling (spellcheck) will be enabled for Portuguese, Italian, Swedish, and Dutch. These improvements are designed to provide more accurate and relevant search results for our users.

How to Take Advantage of This Feature

To leverage the new advanced linguistic search capabilities, following the feature launch date, customers must reindex all indexed sources containing content in these languages. Reindexing is essential as it activates the new features within your search experience. Once reindexed, users will benefit from more precise and relevant search results immediately.

Key Dates and Actions

Feature Release Date: August 1st, 2024

Action Required: Reindex any content in Portuguese, Korean, Italian, Swedish, and Dutch to activate enhanced support.

Current Feature Support

AI Search supports indexing and search in all BCP-47 languages where words are space delimited. In addition to the new languages, advanced linguistic search features are already available in English, French Canadian, French, German, Japanese, Spanish, and Traditional & Simplified Chinese. Feature support by language is summarized below:

Feature	English	French, French Canadian, German, Spanish	New Portuguese, Korean, Italian, Swedish, Dutch	Japanese, Traditional & Simplified Chinese	All other languages where words are space delimited
“exact-match” search	✓	✓	✓	✓	✓
Character Normalization	✓	✓	✓	✓	✓
Synonyms	✓	✓	✓	✓	✓
Stop words	✓	✓	✓	✓	✓
Result Improvement Rules	✓	✓	✓	✓	✓
Now Assist Actions Genius Results	✓	✓	✓	✓	✓
Now Assist Q&A Genius Results	✓	✓	✓	✓	✓
Language-specific tokenization, lemmatization, word normalization, & decompounding	✓	✓	✓	✓
Typo handling	✓	✓	✓

Multilingual Search Terminology

Search is fundamentally about matching terms in your query to documents in your index containing these terms. Results relevancy comprises both precision—the percentage of retrieved results that are relevant—and recall—the percentage of relevant results retrieved. A search engine's precision and recall depend on the natural language processing (NLP) applied to indexed documents and query text.

Tokenization

Before you can match query terms to indexed document terms, you need to tokenize the text, i.e., break it apart into discrete words. Tokenization is straightforward in languages, such as English, where words are mostly space delimited. In Japanese, on the other hand, words aren’t space delimited. Both AI Search and Zing perform Japanese tokenization based on morphological analysis, which breaks up text into real words, ensuring accurate search results.

Here is an example:

東京都の人口 (Tokyo population)

Using morphological analysis, this text gets tokenized as follows:

東京 (Tokyo)
都 (city)
の ([possessive particle])
人口 (population)

By contrast, simple substring matching would result in an incorrect match for the query 京都 (Kyoto), decreasing search precision.

Language-Specific Normalization

Next, text needs to be normalized at the character and word level to ensure that a query for one form of a word matches all other forms. As illustrated by the following examples, at the character level, accents should be removed since search end users frequently omit them from queries, and Asian half-width characters should be normalized to match their full-width counterparts:

Pâtisserie → Patisserie
ﾎﾃﾙ → ホテル

AI Search supports lemmatization—the identification of the dictionary form of the word, based on context. By contrast, Zing supports lemmatization only for Japanese (with certain constraints not imposed by AI Search). Zing uses stemming—an alternative normalization method that truncates words via simplistic rules without considering context—in English, French, and German. Stemming can reduce recall because related terms may have different stems.

Continuing our previous example:

Input	Zing	AI Search
selling	sell	sell
sold	sold	sell

Stemming can also lower precision because unrelated words may share the same stem, as in this French example:

Input	Zing	AI Search
faut (necessary)	faut	faut
faute (mistake)	faut	faute

Decompounding

In certain languages, such as German, compound words are prevalent. These words must be broken down into constituent parts to maximize recall. For example, the compound “humanressourcen” should be broken down into its components, “human” and “ressourcen.” This ensures that queries for the component terms match documents containing the compound and vice versa.

Conclusion

With this upgrade, AI Search now provides comprehensive support for a wider range of languages, significantly improving the search experience for our non-English speaking users. We are committed to continuously enhancing our capabilities to meet the needs of our global customer base. Stay tuned for more updates and enhancements.

Magnus Hovik · ‎02-03-2025

@adambuj when can we expect to see AI search improvements in other languages? Norwegian specifically in my case 🙂

Max Thomsson · ‎03-07-2025

@adambuj The decompounding feature kind of breaks the whole search engine for Swedish. When using Swedish locale and searching for a compound words, the top results are now for completely different words using only part of the search term (and then we see the exact matches for the whole search term way down in the results). Is there any way to disable this feature (or at least make sure that exakt matches are ranked higher than matches for parts of the compound words) since we get A LOT of incidents from our users that search is no longer working and we have identified the decompounding feature as the cause.