[LU-15768] Add tests for UTF-8 and UTF-16 handling. Created: 20/Apr/22 Updated: 21/Apr/22 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Improvement | Priority: | Minor |
| Reporter: | Colin Faber | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
AFAIK, we don't test UTF-8 or UTF-16 support for a variety of reasons (mostly because the file system itself should support it without issue). That said, given the wide range of possible file names these days, including file names with emojies it should be tested to ensure that this does in fact work as expected. Another thing to note here is that the existing tests may blow up with these character sets and may need some adjustments like LANG=C.UTF-8 to ensure the environment is ready for UTF-8 support. Thoughts? |
| Comments |
| Comment by John Hammond [ 21/Apr/22 ] |
|
Linux system calls do not support paths encoded as UTF-16 strings. When the string "foo/bar" is encoded in UTF-16 every other byte will be NUL (0). To the Linux kernel and to Lustre, a path is a NUL terminated sequence of bytes. I would be surprised if a well written application would attempt to use a UTF-16 encoded string as a path on Linux. There are no UTF-16 locales. UTF-8 encoded string with Emojis or whatever work just fine as pathnames. They also work just fine with the kind of string handling done in a command like lfs. One issue that may arise is with something like Python which is very strict about encoding correctness. To make up for the strictness, Python includes a path container type (see https://docs.python.org/3/library/pathlib.html). |