Mattermost, Inc.

Supporting full text for Japanese and other idiographic languages

By default, full text search in English requires a minimum of three characters and requires a * to find subterms, (e.g. “real” doesn’t find “realtime” but “real*” does find "realtime).

This doesn’t work well for Japanese, Korean, Chinese or other idiographic languages.

Because this behavior is defined in the full text search capabilities of the underlying database (MySQL or Postgres), we’re hoping someone familiar with Japanese would be able to test to see if certain configurations of the database could be used to optimize for searching Japanese:
http://textsearch-ja.projects.pgfoundry.org/index-ja.html

For example, could are there full text search parameters for using Japanese in Postgres that would change the default behavior so that * wouldn’t be needed to match characters, and by default search would support single characters–unlike the default English settings.

Would anyone be open to helping test and share advice for other international users?

EDIT: after talking about this with my friend, I think I was answering something completely different. I guess I’ll rewrite my answer when I have spare time again

I’ve noticed you can’t search the chat in Japanese, so I’ll give my 2cents to help the problem.
fwiw, I am a Japanese, no experience in Postgres, a bit in Oracle and MySQL.

I’m not sure if I understood your problem, but here’s what I think you are saying.

  1. How do we optimize full text search (which is presumably costly) in multi-byte character languages?

  2. We aren’t able to search multi-byte strings.

  3. I am not fully updated about the current state of full text search optimization, but there are many full text search engines such as “mroonga”, which does optimization. I know this particular product works for Japanese, but I’d assume any popular full text search engine would work for almost any language anyways.

  4. An SQL like "SELECT * FROM FOO WHERE FOO.BAR LIKE “(some random Japanese string)%” works perfectly fine, just like for English. Reg-ex works fine the same way, so for example, “(some random Japanese string).*” would match any (some random Japanese string) followed by any characters (any Japanese or English).

There is no difference in idiographic languages (as you put them) and English, because programs just interpret them as bytes, and the bytes just happen to be single bytes or multiple bytes, depending on the language.

Hope this helps.