[LU-16717] Directory restripe breaking lmv stripe settings Created: 06/Apr/23  Updated: 04/Oct/23  Resolved: 31/May/23

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.15.2
Fix Version/s: Lustre 2.16.0, Lustre 2.15.4

Type: Bug Priority: Critical
Reporter: Patrick Keller Assignee: Lai Siyao
Resolution: Fixed Votes: 0
Labels: migration_improvements
Environment:

ZFS MDT


Attachments: Text File debug_migrate_rmdir_leafdir.txt     File dmesg_dommigrate.out    
Issue Links:
Related
is related to LU-14975 DNE3: directory migration in non-recu... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

While doing several directory restripes (parent dir only with "lfs migrate -m -d") at the same time, the Lustre client lost connection with the typical messages on the MDS asking to resume the migration process.

Resuming doesn't work, starting the migration process on affected directories returns "File descriptor in bad state (77)".

The current stripe setting of the affected directories show hash types of "none" and "bad_type", e.g.

lmv_stripe_count: 5 lmv_stripe_offset: 2 lmv_hash_type: none,bad_type
mdtidx           FID[seq:oid:ver]
     2           [0x280000bd1:0x6fb:0x0]
     0           [0x200000400:0xd16:0x0]
     2           [0x280000402:0xd16:0x0]
     1           [0x240000401:0xd16:0x0]
     3           [0x2c0000402:0xd16:0x0]
 

The directories can be traversed, changes on the directories can't be done (adding or removing files).

Several directory trees are affected through multiple levels. Trying to migrate the directories from the top level results in "Invalid argument (22)" instead of  -EBADFD.

Debug messages on the MDS show that lmv_is_sane() is failing when trying to access the broken directories:

Thu Apr  6 09:36:53 2023] LustreError: 1952126:0:(lustre_lmv.h:468:lmv_is_sane2()) insane LMV: magic=0xcd20cd0 count=5 index=1 hash=none:0x20000000 version=3 migrate offset=1 migrate hash=fnv_1a_64:33554434.
[Thu Apr  6 09:36:53 2023] LustreError: 1952126:0:(lustre_lmv.h:446:lmv_is_sane()) insane LMV: magic=0xcd20cd0 count=5 index=1 hash=none:0x20000000 version=3 migrate offset=1 migrate hash=fnv_1a_64:33554434.
 

Removing of the directories is also imposssible ("cannot remove <path>: invalid argument").

LFSCK only has "striped_shards_skipped", but doesn't repair anything.

The migration might have been started in parallel on several levels within a directory tree, maybe causing races or corruption. lfs migrate 

If folders are affected, they are part of a whole affected path of directories from root+1 to a directory with no more subdirectories within, e.g.

/root/dir1/dir2/dir3 with directories dir1 dir2 dir3 being affected.

Is there any way to repair or remove the affected files? Data loss is not an issue for us, as this is a pre-production test setup, I want to report the issue anyway, because the problem immediately occured on starting the migration.



 Comments   
Comment by Patrick Keller [ 06/Apr/23 ]

The "lfsck_striped_dir.c" documentation suggests that the "LMV_HASH_FLAG_BAD_TYPE" flag has been introduced to distinguish between cases with a valid hash on master LMV EA and the ones where anyone needs to be trusted because there is no valid hash found yet.

However, in my case there seems to be no valid hash to be found anywhere. There should be a way to ignore LMV_HASH_FLAG_BAD_TYPE in order to remove the file. As of right now, it seems the directory is marked read-only as it is supposed to for "LMV_HASH_TYPE_UNKNOWN" but there is no exception to still be able to delete it.

 

I have a few debug outputs that show lfsck trying to fix directories with bad name hashes (flag = 4), which seem to show the behavior described above. I assume that lfsck set the bad_type flag but can't remove it.

Comment by Andreas Dilger [ 06/Apr/23 ]

keller,
I haven't looked into the details of this issue yet, but wanted to take a step back to ask what your goal for the directory restripe was, so that the resolution can be geared toward addressing the issue in the right way. Were you restriping the whole directory tree from 1-stripe to 4-stripe directories? IMHO, that would be counter-productive in terms of performance and reliability vs. using 1-stripe directories for all but the very largest directories (1M entries or more).

For Lustre 2.15 there is automatic MDT space balancing (should keep MDT usage within about 5% of each other). By default the top 3 levels of the directory tree have round-robin MDT directory creation to better utilize all MDTs in the system. This will create 1-stripe directories on remote MDTs to ensure all of the MDTs are being used. The round-robin MDT allocation can be enabled deeper into the directory tree with "lfs setdirstripe -D -c 1 -i -1 --max-inherit-rr=N <dir>", if needed, but the defaults should provide reasonable MDT usage out of the box. With the MDT space balancing there is very little need to use striped directories except for huge single directories.

Comment by Patrick Keller [ 06/Apr/23 ]

The intent was to test the behavior of the file system when doing a lot of directory restriping to see if thats something that could be done in a production environment for rebalancing MDTs as well as specifically trying to test impact on single directory metadata performance.

For production, we will set the round-robin depth to go into the first level within project-specific directories for initial load distribution and will let the load balancing do the rest.

The ticket is only supposed to point at a possible issue that came up during restriping and/or lfsck and not supposed to lead to any new features.

Comment by Andreas Dilger [ 07/Apr/23 ]

keller, thanks for the background, and I appreciate the bug report. I just wanted to make sure that you weren't intentionally trying to restripe the whole filesystem to fully-striped directories, since we've had a number of significant performance problems when users think that will somehow improve performance, and I wanted to steer you away from that. As I wrote, the 2.15 MDT space balancing and round-robin directory creation has proven very useful.

In case we aren't able to reproduce the issue here, it would definitely be useful to collect full debug logs from the MDS (assuming you have a client mounted there also):

# lctl set_param debug=all debug_mb=1024
# lctl clear
# lctl mark "migrate"
# lfs migrate -m -d ... <dir>
# lctl mark "rmdir"
# rmdir <dir>
# lctl dk /tmp/debug.migrate.txt

along with any console errors from the MDS from the time of original migrate error.

Comment by Patrick Keller [ 07/Apr/23 ]

I've attached logs (debug_migrate_rmdir_leafdir) that show the debug output when trying to migrate and remove an affected directory. I ran the command on the MDS with MDT0002 because I think that's where the issue occurs.

I created the directory list that i used for directory migration after the first "rc = -22" errors apeared, so the issue might have been present before restriping of the directories even started. The fids affected at that time can't be resolved bs lfs fid2path right now due to -EBADFD.

I also did restriping of DoM prior on the same day, so that might actually be the root cause of the issue. attached the kernel messages for Lustre on the client doing the DoM migration in dmesg_dommigrate. I left the dump_lsm() messages in for completeness, although the messages seem to be redundant.

Comment by Lai Siyao [ 11/Apr/23 ]

It's not supported to resume directory migration upon bad hash type. Under this condition, user can still access this directory and sub files, but can't unlink them because server code doesn't support name lookup with bad hash (it should try each stripe as what's done on client).

I will look into these two issues:

  1. upon bad hash type, directory migration can try to resume any way, and there is big chance of success.
  2. support unlink of such directory and files.
Comment by Andreas Dilger [ 11/Apr/23 ]

Lai, I've always wondered about cases like this (eg. bad hash type). It should be possible for the MDS to "complete" the migration in any case. At worst it could pick a new hash type and move more entries around, but at least it would get the directory out of the "MIGRATION" state.

Comment by Gerrit Updater [ 28/Apr/23 ]

"Lai Siyao <lai.siyao@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50796
Subject: LU-16717 mdt: treat unknown hash type as sane type
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: be2d62766e7a0d75846c9482f8105fd5ae7bf461

Comment by Gerrit Updater [ 28/Apr/23 ]

"Lai Siyao <lai.siyao@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50797
Subject: LU-16717 mdt: resume dir migration with bad_type
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 4054056c6268ee47850f54a57019b819f108738f

Comment by Lai Siyao [ 28/Apr/23 ]

Patrick, I just pushed two patches to address the issues listed above, will you try the latter one https://review.whamcloud.com/c/fs/lustre-release/+/50797 to see if it can fix your issue?

Comment by Xing Huang [ 08/May/23 ]

2023-05-08: Two patches being worked on.

Comment by Patrick Keller [ 09/May/23 ]

Sorry for the late response (LUG was in the way).

I can confirm that removing the directories was possible in all cases with the patched server. LFSCK didn't seem to repair anything, although I'm not sure if its supposed to change anything for directories that have already been broken by my previous LFSCK runs.

I will have to setup the filesystem again, as we are going into production and can't do any more testing.

Comment by Xing Huang [ 20/May/23 ]

2023-05-20: Two patches being worked on, the first patch is being reviewed, another one passed code-review and is depending on the first one.

Comment by Xing Huang [ 29/May/23 ]

2023-05-29: Both two patches being worked on are ready to land(on master-next branch).

Comment by Gerrit Updater [ 31/May/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50796/
Subject: LU-16717 mdt: treat unknown hash type as sane type
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 05cdb71ba6813570123613993f3cfcf74fc83561

Comment by Gerrit Updater [ 31/May/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50797/
Subject: LU-16717 mdt: resume dir migration with bad_type
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 151650e468ab423e831c30d635ea380e0434a122

Comment by Peter Jones [ 31/May/23 ]

Landed for 2.16

Comment by Gerrit Updater [ 06/Jun/23 ]

"Xing Huang <hxing@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51235
Subject: LU-16717 mdt: treat unknown hash type as sane type
Project: fs/lustre-release
Branch: b2_15
Current Patch Set: 1
Commit: ed44665dcfe40b5ae15d5733b177f434396854de

Comment by Gerrit Updater [ 07/Jun/23 ]

"Xing Huang <hxing@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51243
Subject: LU-16717 mdt: resume dir migration with bad_type
Project: fs/lustre-release
Branch: b2_15
Current Patch Set: 1
Commit: 1b9b0c6f9c50218ab8549fa9795e5dd40f243b2d

Comment by Gerrit Updater [ 02/Aug/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/51235/
Subject: LU-16717 mdt: treat unknown hash type as sane type
Project: fs/lustre-release
Branch: b2_15
Current Patch Set:
Commit: e4208468b65a34c84c20d5d932f35b29f9025722

Comment by Gerrit Updater [ 02/Aug/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/51243/
Subject: LU-16717 mdt: resume dir migration with bad_type
Project: fs/lustre-release
Branch: b2_15
Current Patch Set:
Commit: 1c882aebeaac4970c78a3616f1dd96d0920d133f

Generated at Sat Feb 10 03:29:24 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.