<< Back to news

Explaining the Search Index Issue

News

Written by LunarSpotlight
Monday, 05-Aug-24 23:20:45 UTC


image

This week was a bit lean for station-related work, so we’ll take the opportunity to dive into more detail about something we fixed recently: website search and character encoding.

Earlier this week, a listener pointed out a discrepancy between a browsable album on the website and search results, and the cause was related to a table containing indexed search terms which separately stores things like album and song titles. When we migrated and updated systems back in June, we also updated the encoding for a column related to album titles in order to allow RD-Sounds’ album 「𠷡」 (Yuwai) to be stored.

The kanji character for this album is part of an extended UTF-8 character set which needed an additional byte to store than the old database allowed for. This is a technical thing, but basically, every character can be represented in a binary format for storage and recall. For example, A is represented by the decimal number 65, or the binary number “01000001”, and this binary space represents a single unit known as a “byte”. Japanese characters are stored as “multi-byte” characters because thousands of characters need to be represented, and a single byte can only store around 128 distinct characters (combinations of 0s and 1s). Using 2 bytes brings this limit up to over 65,000 characters, and adding more bytes increases the limit exponentially.

Until Yuwai came along, we’ve been fine using 3 bytes of storage space which represents nearly 17 million distinct characters, but over time more characters get defined in the UTF-8 character space with a major modern-day contributor being emoji. The Yuwai character is a bit of a weird case because it’s an older Chinese character which was added to the UTF-8 set around the year 2001, and to represent it in this character set, four bytes are required as defined by the Unicode Standard.

The old database was effectively hardcoded to use 3 bytes for character storage, so when we migrated to a new system, the database was set up to use four, albums were changed to use four, but the search table still used three. When search terms were re-indexed, the process would silently fail on albums and omit everything else that came after including songs. This is the root cause for why some expected entries for albums and songs were not appearing in search results. With that said, there are other issues that persist including consistent display of the Yuwai character (although it is now stored correctly, the PWA may still show a “?” as a placeholder).

Looking forward, we want to integrate a dedicated search solution which handles the full indexing process and returns more relevant search results. We will likely also reintegrate our information library at that time. Just another thing to add to the list.

That’s all for this week, thanks for tuning in to my TED* talk, and thanks for listening!

*not actually a TED talk.

[Knowledge #178]



Suggested Posts