[LU-15768] Add tests for UTF-8 and UTF-16 handling. Created: 20/Apr/22  Updated: 21/Apr/22

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Minor
Reporter: Colin Faber Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Rank (Obsolete): 9223372036854775807

 Description   

AFAIK, we don't test UTF-8 or UTF-16 support for a variety of reasons (mostly because the file system itself should support it without issue). That said, given the wide range of possible file names these days, including file names with emojies it should be tested to ensure that this does in fact work as expected.

Another thing to note here is that the existing tests may blow up with these character sets and may need some adjustments like LANG=C.UTF-8 to ensure the environment is ready for UTF-8 support.

Thoughts?



 Comments   
Comment by John Hammond [ 21/Apr/22 ]

Linux system calls do not support paths encoded as UTF-16 strings. When the string "foo/bar" is encoded in UTF-16 every other byte will be NUL (0). To the Linux kernel and to Lustre, a path is a NUL terminated sequence of bytes.

I would be surprised if a well written application would attempt to use a UTF-16 encoded string as a path on Linux.

There are no UTF-16 locales.

UTF-8 encoded string with Emojis or whatever work just fine as pathnames. They also work just fine with the kind of string handling done in a command like lfs.

One issue that may arise is with something like Python which is very strict about encoding correctness. To make up for the strictness, Python includes a path container type (see https://docs.python.org/3/library/pathlib.html).

Generated at Sat Feb 10 03:21:08 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.