Details
-
Bug
-
Resolution: Fixed
-
Critical
-
Lustre 2.15.2
-
ZFS MDT
-
3
-
9223372036854775807
Description
While doing several directory restripes (parent dir only with "lfs migrate -m -d") at the same time, the Lustre client lost connection with the typical messages on the MDS asking to resume the migration process.
Resuming doesn't work, starting the migration process on affected directories returns "File descriptor in bad state (77)".
The current stripe setting of the affected directories show hash types of "none" and "bad_type", e.g.
lmv_stripe_count: 5 lmv_stripe_offset: 2 lmv_hash_type: none,bad_type mdtidx FID[seq:oid:ver] 2 [0x280000bd1:0x6fb:0x0] 0 [0x200000400:0xd16:0x0] 2 [0x280000402:0xd16:0x0] 1 [0x240000401:0xd16:0x0] 3 [0x2c0000402:0xd16:0x0]
The directories can be traversed, changes on the directories can't be done (adding or removing files).
Several directory trees are affected through multiple levels. Trying to migrate the directories from the top level results in "Invalid argument (22)" instead of -EBADFD.
Debug messages on the MDS show that lmv_is_sane() is failing when trying to access the broken directories:
Thu Apr 6 09:36:53 2023] LustreError: 1952126:0:(lustre_lmv.h:468:lmv_is_sane2()) insane LMV: magic=0xcd20cd0 count=5 index=1 hash=none:0x20000000 version=3 migrate offset=1 migrate hash=fnv_1a_64:33554434. [Thu Apr 6 09:36:53 2023] LustreError: 1952126:0:(lustre_lmv.h:446:lmv_is_sane()) insane LMV: magic=0xcd20cd0 count=5 index=1 hash=none:0x20000000 version=3 migrate offset=1 migrate hash=fnv_1a_64:33554434.
Removing of the directories is also imposssible ("cannot remove <path>: invalid argument").
LFSCK only has "striped_shards_skipped", but doesn't repair anything.
The migration might have been started in parallel on several levels within a directory tree, maybe causing races or corruption. lfs migrate
If folders are affected, they are part of a whole affected path of directories from root+1 to a directory with no more subdirectories within, e.g.
/root/dir1/dir2/dir3 with directories dir1 dir2 dir3 being affected.
Is there any way to repair or remove the affected files? Data loss is not an issue for us, as this is a pre-production test setup, I want to report the issue anyway, because the problem immediately occured on starting the migration.