[LU-15913] rename stress test leads to REMOTE_PARENT_DIR corruption Created: 06/Jun/22  Updated: 08/Feb/24

Status: Reopened
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Blocker
Reporter: Artem Blagodarenko (Inactive) Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Related
is related to LU-14570 e2fsck reports "Entry '..' an incorre... Open
is related to LU-17426 parallel cross-directory rename of re... Open
is related to LU-12125 Allow parallel rename of regular files Resolved
is related to LU-15830 distribute mkdir should lookup target... Resolved
is related to LU-12834 MDT hung during failover Open
is related to LU-17448 LBUG in racer with layout change Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Streess test with active renaming of files and directories inside a striped directory finished with an error:

Performing actions for 'rename(2,676pp674,aaaaaaaaaa,bbbbbbbbbb,cccccccccc,dddddddddd,eeeeeeeeee,ffffffffff,gggggggggg,hhhhhhhhhh,iiiiiiiiii,jjjjjjjjjj,kkkkkkkkkk,llllllllll,mmmmmmmmmm,nnnnnnnnnn,oooooo -> 3,676pp674,aaaaaaaaaa,bbbbbbbbbb,cccccccccc,dddddddddd,eeeeeeeeee,ffffffffff,gggggggggg,hhhhhhhhhh,i) error Input/output error (5)'
numerrs=1
Fri Jun  3 01:06:30 CDT 2022 

leads to corrupted MDT partition. E2fsck reports next symptoms:

1) incorrect filetype

2) a link to directory

3) .. points to the REMOTE_PARENT_DIR but should point to the special directory with sequence name

./data.20220423/server/e2fsck.pre_read_only.kjcf04n03.6.1-010.43.20220504165818.out.kjcf04n03:Entry '14,37275pp37273,aaaaaaaaaa,bbbbbbbbbb,cccccccccc,dddddddddd,eeeeeeeeee,ffffffffff,gggggggggg,hhh' in /REMOTE_PARENT_DIR/0x24006497a:0x129bc:0x0 (4014289161) has an incorrect filetype (was 17, should be 2).

./data.20220423/server/e2fsck.pre_read_only.kjcf04n03.6.1-010.43.20220504165818.out.kjcf04n03:Entry '0x24006b05d:0xa4c8:0x0' in /REMOTE_PARENT_DIR (4030089985) is a link to directory /REMOTE_PARENT_DIR/0x24006497a:0x129bc:0x0/14,37275pp37273,aaaaaaaaaa,bbbbbbbbbb,cccccccccc,dddddddddd,eeeeeeeeee,ffffffffff,gggggggggg,hhh (4014289162 fid=[0x20006a4b1:0x42c6:0x0]).

./data.20220423/server/e2fsck.pre_read_only.kjcf04n03.6.1-010.43.20220504165818.out.kjcf04n03:'..' in /REMOTE_PARENT_DIR/0x24006497a:0x129bc:0x0/14,37275pp37273,aaaaaaaaaa,bbbbbbbbbb,cccccccccc,dddddddddd,eeeeeeeeee,ffffffffff,gggggggggg,hhh (4014289162) is /REMOTE_PARENT_DIR (4030089985), should be /REMOTE_PARENT_DIR/0x24006497a:0x129bc:0x0 (4014289161).  

 



 Comments   
Comment by Artem Blagodarenko (Inactive) [ 06/Jun/22 ]

Now I think this symptom is important.

Jun  2 19:01:42 kjcf04n02 kernel: LustreError: 55470:0:(out_handler.c:910:out_tx_end()) kjcf04-MDT0000-osd: undo for /builddir/build/BUILD/lustre-2.15.0.1_rc2_cray_105_g679d729/lustre/ptlrpc/../../lustre/target/out_handler.c:445: rc = -524 

This message gives hint that possible error in out_handler can be clue.

#define LUSTRE_ENOTSUPP         524     /* Operation is not supported */   

This message has been printed in out_tx_end() during undo operation that is not supported. 

   if (ta->ta_args[i]->undo_fn != NULL)
                                        ta->ta_args[i]->undo_fn(env,
                                                               ta->ta_handle,
                                                               ta->ta_args[i]);
                                else    
                                        CERROR("%s: undo for %s:%d: rc = %d\n",
                                             dt_obd_name(ta->ta_handle->th_dev),          
                                               ta->ta_args[i]->file, 
                                               ta->ta_args[i]->line, -ENOTSUPP);

The message says that undo operation for out_xattr_set() is not supported.

There are two questions now:

  1. What operation has been failed 
     for (i = 0; i < ta->ta_argno; i++) {                                              
                    rc = ta->ta_args[i]->exec_fn(env, ta->ta_handle,                          
                                                 ta->ta_args[i]);  

    2. Why redo operation for  out_tx_end is not supported.

 

Comment by Andreas Dilger [ 09/Jun/22 ]

This is likely due the parallel rename locking changes in LU-12125. Before this change there was a global "Big Filesystem Lock" that serialized all renames in the whole filesystem.

It would also be useful to determine if this is affecting directory renames or also file renames? The LU-12125 patch added checks for parallel directory and file renames separately, so it would be desirable to allow enabling/disabling the parallel renames separately for files and directories.

Comment by Cory Spitz [ 10/Jun/22 ]

It appears that only file renames are affected, but artem_blagodarenko can correct me if I'm wrong.

A symptom seen before the metadata corruption is EIO returned to rename(). With parallel rename from LU-12125 reverted, our reproducer can now survive 12+ hours without an EIO. We'll report more next week after an e2fsck, but it looks promising. Thanks for the suggestion, Andreas!

Comment by Cory Spitz [ 10/Jun/22 ]

Test passed with no new e2fsck errors! I'd say we found our culprit.

Comment by Andreas Dilger [ 10/Jun/22 ]

It would be useful to know more details of the reproducer workload. Is this renaming in local directories or striped/remote directories, within one directory or within multiple directories, from a single client or multiple clients?

The parallel rename patch should only be renaming within a single local directory, so (in theory) the REMOTE_PARENT_DIR should not be involved. It might be that renames within a striped directory could trigger this, if the "rename within same parent" check is incorrect?

Comment by Gerrit Updater [ 11/Jun/22 ]

"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/47593
Subject: LU-15913 mdt: disable parallel rename by default
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 9d4673cc87739dd21f4516f599b21934d846b04b

Comment by Andreas Dilger [ 12/Jun/22 ]

The patch 47593 both has tunable parameters ("mdt.*.enable_parallel_rename_file|dir", default off) to independently control parallel rename of both files and directories. In addition, I believe it also contains a fix in the form of disabling parallel renames within a striped directory. It would be useful if HPE could test both the default patch mode as well as with enable_parallel_rename_*=1 to determine if this is a proper fix for the issue.

The patch could be further optimized to allow parallel rename within a single shard of the same directory, and to add additional locking to REMOTE_PARENT_DIR, but it isn't clear if that is needed, or if exchanging one level of locking is more efficient than locking in a different area of the code.

Comment by Cory Spitz [ 13/Jun/22 ]

> I'd say we found our culprit.
While our test ran much longer with the revert of LU-12125, it did trigger EIO for renames over the weekend. We'll need to investigate and perform e2fsck to understand if the signature of the fallout has changed.

Andreas, thanks for posting your patch to try and bypass the problem. We'll get more details about the tests here shortly and we can decide what to do for both the code and the upcoming release of 2.15.0.

Comment by Peggy Gazzola [ 13/Jun/22 ]

Regarding the reproducer, the test by default runs in local dirs, remote dirs, striped dirs.  It's random, based on a mkdir wrapper script.  The failures have mostly (possibly always) involved striped dirs; we'll need to check/verify that.
The problem hits both file and directory renames.

Comment by Artem Blagodarenko (Inactive) [ 14/Jun/22 ]

Hi Andreas,

Thank you for sharing https://review.whamcloud.com/#/c/47593/

I have reverted this patches at Friday

commit c76f4a2ab19c203e8dd6d826e2921721ea3430b5 (HEAD -> LU-15285-revert)
Author: Artem Blagodarenko <artem.blagodarenko@gmail.com>
Date:   Wed Jun 8 17:41:51 2022 -0400    LUS-10934 Revert "LU-12125 mds: allow parallel directory rename"
    
    This reverts commit 90979ab390a72a084f6a77cf7fdc29a4329adb41.
    
    Change-Id: I885a135ccb3aa11d4a349ae4658c60664634af64

commit a6c4b929bd69fd4a0a6dfc083be577f4bdfe7906
Author: Artem Blagodarenko <artem.blagodarenko@gmail.com>
Date:   Wed Jun 8 17:39:51 2022 -0400    Revert "LU-12125 mds: allow parallel regular file rename"
    
    This reverts commit d76cc65d5d68ed3e04bfbd9b7527f64ab0ee0ca7.
    
    Change-Id: I37893e0a3b49560003eac0905f54724c4a76a20a

commit 4b3d30030298389b93bfffab46a5f57c8977ba5c
Author: Artem Blagodarenko <artem.blagodarenko@gmail.com>
Date:   Wed Jun 8 17:22:44 2022 -0400    Revert "LU-15285 mdt: fix same-dir racing rename deadlock"
    
    This reverts commit 82ec537d8b4cc9261828f4efe6b03d8d33f38432.
    
    Change-Id: I9527f74eeea30c7625cdbad3cc990b3e2c114377 

And after 2 days and 19 hours hit 9 rename issues again. 

We can check your patch. Probably my revert is not complete (I am comparing) something.

Comment by Andreas Dilger [ 15/Jun/22 ]

And after 2 days and 19 hours hit 9 rename issues again.

Artem, the revert patches look correct, so it seems probable that the remaining EIO error is unrelated to the parallel rename? It looks like about 8h per failure. I don't have any info on how often the test was failing before the patches were reverted, but presumably it was failing more often? Does the current EIO failure also report corruption with e2fsck, or is it just the EIO error to userspace? Did this same test ever run against 2.14.0 for the same amount of time?

My patch does three things:

  • by default it disables both file and directory parallel same-dir renames
  • there are tunable parameters to separately re-enable parallel same-dir renames for files and directories
  • it permanently disables parallel renames for files and directories within a striped directory.

It would be useful to test the patch with at least the default mode (all parallel renames disabled), as well as with parallel renames enabled to see if disabling the striped directory renames also "mostly fixes" the problem. I think at this point that the chance of hitting the "mostly fixed" problem is extremely low, given that this is itself an unlikely workload, so if the patch is also fixing the problem to the same level as reverting the patches, then I think the patch should land and the release is made as-is.

Comment by Artem Blagodarenko (Inactive) [ 15/Jun/22 ]

Hi Andreas,

>so it seems probable that the remaining EIO error is unrelated to the parallel rename?

It seems, but I still have a hope that may revert is not complete or incorrect. So I want to see testing results with your patch.

>It looks like about 8h per failure. I don't have any info on how often the test was failing before the patches were reverted, but presumably it was failing more often? 

It looks like the rate is quite the same. We executed 48h before and seen quite the same amount of EIO and e2fsck errors.

>Does the current EIO failure also report corruption with e2fsck, or is it just the EIO error to userspace?

Yes, e2fsck reports same corruptions:

[root@kjcf04n00 admin]# diff <(grep -h "has an incorrect filetype" *20220610141112*) <(grep -h "has an incorrect filetype" *20220613171537*)
4a5
> Entry '9,14403pp14401,aaaaaaaaaa,bbbbbbbbbb,cccccccccc,dddddddddd,eeeeeeeeee,ffffffffff,gggggggggg,h' in /REMOTE_PARENT_DIR/0x2000b8d6e:0x2730:0x0 (3808878414) has an incorrect filetype (was 17, should be 2).
[root@kjcf04n00 admin]#
[root@kjcf04n00 admin]# diff <(grep -h "is a link to directory" *20220610141112*) <(grep -h "is a link to directory" *20220613171537*)
5a6
> Entry '0x2000bb431:0x314:0x0' in /REMOTE_PARENT_DIR (1472988673) is a link to directory /REMOTE_PARENT_DIR/0x2000b8d6e:0x2730:0x0/9,14403pp14401,aaaaaaaaaa,bbbbbbbbbb,cccccccccc,dddddddddd,eeeeeeeeee,ffffffffff,gggggggggg,h (3808878415).
6a8
> Entry '0x2000b8df1:0x36e6:0x0' in /REMOTE_PARENT_DIR (1472988673) is a link to directory /REMOTE_PARENT_DIR/0x2000b8d6e:0x164f:0x0/8:2,9035pp8575,aaaaaaaaaa,bbbbbbbbbb,cccccccccc,dddddddddd,eeeeeeeeee,ffffffffff,gggggggg (3362505009 fid=[0x2400bac60:0x277b:0x0]).
8a11
> Entry '0x2400b4ee8:0x111a6:0x0' in /REMOTE_PARENT_DIR (4030089985) is a link to directory /REMOTE_PARENT_DIR/0x2400b4ee6:0x13807:0x0/[0x2400b4ea0:0x1dd3:0x0]:0/26,26044pp26042,aaaaaaaaaa,bbbbbbbbbb,cccccccccc,dddddddddd,eeeeeeeeee,ffffffffff,gggggggggg,hhhhhhhhhh,iiiiiiiiii,jjjjjjjjjj,kkkkkkkkkk,llllllllll,mmmmmmmmmm,nnnnnnnnnn,oooooooooo (1296546125).
[root@kjcf04n00 admin]# 

>Did this same test ever run against 2.14.0 for the same amount of time?
As I know, it didn't. Thanks for the idea.

Andreas, currently testing we did previously is on hold due technical problems. Do you have any ideas whether this issue can be reproducable on the smaller environment? Do you think clients count make sense? IO rate? 

Thanks.

 

Comment by Andreas Dilger [ 16/Jun/22 ]

Andreas, currently testing we did previously is on hold due technical problems. Do you have any ideas whether this issue can be reproducable on the smaller environment? Do you think clients count make sense? IO rate?

Artem, my current theory is that the problem is caused by parallel renames by different clients/mountpoints in a single directory, which cause the updates in REMOTE_PARENT_DIR to fail. It is possible that the conflicting renames can only happen while parallel rename is allowed, otherwise the renames are serialized and cannot race.

Comment by Gerrit Updater [ 16/Jun/22 ]

"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/47643
Subject: LU-15913 tests: add rename stress test via racer
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 7441c13b7931907f91cd0b8945c1e26b50607eed

Comment by Artem Blagodarenko (Inactive) [ 16/Jun/22 ]

>Artem, my current theory is that the problem is caused by parallel renames by different clients/mountpoints in a single directory, which cause the updates in REMOTE_PARENT_DIR to fail. It is possible that the conflicting renames can only happen while parallel rename is allowed, otherwise the renames are serialized and cannot race

The problem reproduced even with https://review.whamcloud.com/#/c/47593/  (default configure) applied, so serialization doesn't help unfortunatly.

Comment by Andreas Dilger [ 16/Jun/22 ]

The problem reproduced even with https://review.whamcloud.com/#/c/47593/ (default configure) applied, so serialization doesn't help unfortunately.

The patch in the default mode is essentially the same as reverting the parallel rename patches, so it seems like the parallel rename patches are not the source of the problem. That is a bit contradictory to Cory's previous comment, so some clarification is needed. Was the "problem going away after reverting" was just a lucky gap in failures?

How many clients/mountpoints/threads are needed to hit this issue, and what type of rename operation is hitting the EIO? I have been trying to reproduce with multiple mountpoints and test directories, and 4x threads per directory creating and renaming files within those directories (because of focus on parallel rename patches) but have not hit the problem.

This all makes me think that the parallel rename patches are not the cause of the current problems (or are at most making the problem easier to hit), and either this issue predates 2.14.0, or possibly was caused by one of the other changes in the MDT code. My suggestion would be to test 2.14.0 to see if the problem is hit there, and if not the code should be bisected to find the patch that causes the problem.

Also, given that the problem seems extremely difficult to reproduce outside of a dedicated rename stress test, I don't think this is a risk of being hit by any normal user workload and should no longer be considered a blocker for 2.15.0. When the root cause is found we can always issue a patch for 2.15.1, which itself will be released fairly quickly afterward because of el8.6 and other needs.

Comment by Artem Blagodarenko (Inactive) [ 16/Jun/22 ]

>That is a bit contradictory to Cory's previous comment, so some clarification is needed. 
We were too optimistic after 14 hours of testing. 48 hours testing shown same symptoms.

Comment by Artem Blagodarenko (Inactive) [ 30/Jun/22 ]

The problem is fixed with the patch from the LU-15830. We still need to understand how rollback influence to the corruption, but avoiding a rollback is a solution for this exact issue. 

Comment by Cory Spitz [ 30/Jun/22 ]

> but avoiding a rollback is a solution for this exact issue
Yes, but avoiding rollback is just that, avoiding it. We still have problem of corruption from rollback, if it happens. LU-14570 is a case that illustrates the symptoms. It was opened before LU-15830.

Comment by Artem Blagodarenko (Inactive) [ 30/Jun/22 ]

Opps...actually we have patches here, that were not landed yet. So it is too early to close it. I have reopened the issue.

Comment by Andreas Dilger [ 30/Jun/22 ]

The problem is fixed with the patch from the LU-15830.

This is a bit confusing to me, since patch https://review.whamcloud.com/47226 "LU-15830 mdt: mkdir to lookup target name" was landed on 2022-05-12 and included in 2.15.0-RC5 (2022-05-31), but this ticket was filed on 2022-06-06. Was testing not being done with 2.15.0-RC5?

That said, the root of the problem has been described as the rename failing because there is a conflicting target name appearing that would cause the rename to fail. I don't think this should be happening, given that the source and target FIDs and should be locked before the rename is actually done. Is it possible that the LU-15830 patch to do the remote target lookup hasn't just worked around the problem, but actually fixed it fully? With proper parent locking and lookup, it shouldn't be possible for a link to that name to appear in the target dir or be linked to the target file.

Comment by Alex Zhuravlev [ 30/Jun/22 ]

one option is to make the failing update (e.g. insert) the very first update in the batch so there would be nothing to rollback.

Comment by Cory Spitz [ 30/Jun/22 ]

> Was testing not being done with 2.15.0-RC5?
Correct. The report was filed for the previous RC. I regret that that wasn't made clear. But, at least it has uncovered this broken rollback, which could still happen in theory for other reasons (say, an ENOMEM experienced along the way).

Comment by Gerrit Updater [ 18/Jul/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/47593/
Subject: LU-15913 mdt: disable parallel rename for striped dirs
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: f238540c879dc668e18cf99cba62f117ccae64d6

Comment by Gerrit Updater [ 23/Jan/24 ]

"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53768
Subject: LU-15913 tests: add rename stress test via racer (testing)
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: b49b73d379b33705688a82c535b824c660c64641

Comment by Gerrit Updater [ 08/Feb/24 ]

"Sergey Cheremencev <scherementsev@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53981
Subject: LU-15913 tests: clean between racer 1 and 2
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 81cb6f72de12dc6a3aca486a8585405dca755fc3

Generated at Sat Feb 10 03:22:22 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.