[LU-16405] regression in create that may cause directory entries with the same name Created: 15/Dec/22  Updated: 28/Aug/23  Resolved: 27/Jul/23

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.16.0

Type: Bug Priority: Critical
Reporter: Lai Siyao Assignee: Alex Zhuravlev
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-10235 mkdir should check for directory exis... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

In LU-10235, name lookup is moved before parent locking to avoid unnecessary lock revoke, and it's believed that backend filesystem will check name before entry insert, but some test shows this is not always true.

# ls -lisad pan_*
162139365236040503 4 lrwxrwxrwx 1 rdx rd   65 Dec  8 01:51 pan_000 -> /ec/fws5/sb/work/rd/cxsj/hvxz/MMSF/2017050100/longrange/an_centre
162139357216535883 4 drwxr-x--- 2 rdx rd 4096 Dec  8 01:51 pan_001
162139341278207222 4 drwxr-x--- 2 rdx rd 4096 Dec  8 01:51 pan_002
162139365689006218 4 drwxr-x--- 2 rdx rd 4096 Dec  8 01:51 pan_003
162139362014851225 4 drwxr-x--- 2 rdx rd 4096 Dec  8 01:51 pan_004
162139364397192377 4 drwxr-x--- 2 rdx rd 4096 Dec  8 01:51 pan_005
162139362031645470 4 drwxr-x--- 2 rdx rd 4096 Dec  8 01:51 pan_006
162139361696099316 4 drwxr-x--- 2 rdx rd 4096 Dec  8 01:51 pan_007
162139361343777112 4 drwxr-x--- 2 rdx rd 4096 Dec  8 01:51 pan_008
162139353626266212 4 drwxr-x--- 2 rdx rd 4096 Dec  8 01:51 pan_009
162139350539298147 4 drwxr-x--- 2 rdx rd 4096 Dec  8 01:51 pan_010
162139350539298147 4 drwxr-x--- 2 rdx rd 4096 Dec  8 01:51 pan_010   ## Here 
162139362769806682 4 drwxr-x--- 2 rdx rd 4096 Dec  8 01:51 pan_011
162139353609530857 4 drwxr-x--- 2 rdx rd 4096 Dec  8 01:51 pan_012
162139356226776195 4 drwxr-x--- 2 rdx rd 4096 Dec  8 01:51 pan_013
162139346428878656 4 drwxr-x--- 2 rdx rd 4096 Dec  8 01:51 pan_014
162139361343777140 4 drwxr-x--- 2 rdx rd 4096 Dec  8 01:51 pan_015
162139366141987047 4 drwxr-x--- 2 rdx rd 4096 Dec  8 01:51 pan_016
162139358508461812 4 drwxr-x--- 2 rdx rd 4096 Dec  8 01:51 pan_017
162139353072661755 4 drwxr-x--- 2 rdx rd 4096 Dec  8 01:51 pan_018
162139353072661755 4 drwxr-x--- 2 rdx rd 4096 Dec  8 01:51 pan_018 ## and there

The problem is hit with single-block directories, so is not related to htree split or similar:

#> debugfs -c /dev/mapper/vg_mdt0003_f02-mdt0003
debugfs 1.46.2.wc3 (18-Jun-2021)
/dev/mapper/vg_mdt0003_f02-mdt0003: catastrophic mode - not reading inode or group bitmaps
debugfs:  cd REMOTE_PARENT_DIR/0x240007130:0x1:0x0/work/2017050100/longrange
debugfs: ls
[...snip]
 1032403872  (44) psuhvxz14.noSnow    1032403874  (1020) psuhvxz14
 1038197629  (36) pan_004    1032401372  (48) __ICMGG3.noSnow.tmp
 1039177323  (36) pan_010    1039222784  (36) pan_010
 1032971017  (48) targethvxz20170501002    1032401160  (36) pert_001
[...]
debugfs:  stat <1039177323>
Inode: 1039177323   Type: directory    Mode:  0750   Flags: 0x20000000
Generation: 3276743085    Version: 0x0000002f:d1139426
User:   388   Group:  1100   Project:    13   Size: 4096
File ACL: 0
Links: 2   Blockcount: 8
Fragment:  Address: 0    Number: 0    Size: 0
 ctime: 0x63914328:4490c0f0 -- Thu Dec  8 01:51:36 2022
 atime: 0x63988ad8:00000000 -- Tue Dec 13 14:23:20 2022
 mtime: 0x63914328:4490c0f0 -- Thu Dec  8 01:51:36 2022
crtime: 0x63914328:4490c0f0 -- Thu Dec  8 01:51:36 2022
Size of extra inode fields: 32
Extended attributes:
  lma: fid=[0x24008e159:0x1e563:0x0] compat=0 incompat=0
  linkea: idx=0 parent=[0x24008e347:0x14e34:0x0] name='pan_010'
BLOCKS:
(0):649617594
TOTAL: 1

debugfs:  stat <1039222784>
Inode: 1039222784   Type: directory    Mode:  0750   Flags: 0x20000000
Generation: 3276743100    Version: 0x0000002f:d1139472
User:   388   Group:  1100   Project:    13   Size: 4096
File ACL: 0
Links: 2   Blockcount: 8
Fragment:  Address: 0    Number: 0    Size: 0
 ctime: 0x6391432b:00000000 -- Thu Dec  8 01:51:39 2022
 atime: 0x63914328:450ad5a4 -- Thu Dec  8 01:51:36 2022
 mtime: 0x6391432b:00000000 -- Thu Dec  8 01:51:39 2022
crtime: 0x63914328:450ad5a4 -- Thu Dec  8 01:51:36 2022
Size of extra inode fields: 32
Extended attributes:
  lma: fid=[0x24008e422:0x1fcab:0x0] compat=0 incompat=0
  linkea: idx=0 parent=[0x24008e347:0x14e34:0x0] name='pan_010'
debugfs:  ls <1039177323>
 1039177323  (12) .    1032400653  (4084) ..
debugfs:  ls <1039222784>
 1039222784  (12) .    1032400653  (28) ..    1039271133  (40) ICMSHhvxzINIT
 1039279349  (40) ICMGGhvxzINIT    1039281023  (3976) ggml199



 Comments   
Comment by Alex Zhuravlev [ 15/Dec/22 ]

looks quite strange.. ext4_find_dest_de() checks full block for duplicates. a hole in pdirops locking? is it specific to some kernel?

Comment by Andreas Dilger [ 15/Dec/22 ]

Alex, would be el7.9 kernel.

Comment by Sebastien Piechurski [ 21/Feb/23 ]

Is there a smarter way to avoid this than reverting LU-10235 ?

Comment by Lai Siyao [ 22/Feb/23 ]

It's not proved yet, we tried to reproduce and capture debug log, but couldn't reproduce.

Comment by Sebastien Piechurski [ 23/Feb/23 ]

It is reverted on the client used on customer's production for last 1 month and a half.

I am currently waiting for confirmation that the problem was not seen since then.

I will report here when I get confirmation.

If this was proved to be the reason, would there be a better way than reverting ?

Comment by Andreas Dilger [ 23/Feb/23 ]

Looking at the original LU-10235 patch, it seems possible to add some lightweight check after the parent is locked (if no entry was found during the unlocked lookup) to determine if the directory was modified between the time the lookup was done and when it was locked. We don't want to reintroduce the overhead of reverting the whole patch, or worse always do two lookups and make the code slower than it was before.

In the use case addressed by the LU-10235 patch (many threads opening the same file) this should not add any overhead, but in the concurrent directory modification case it would decide if another lookup was needed with the lock held. It might be possible to take and drop the object lock before the parent lock, and then determine if the object lock was cancelled (i.e. conflicting operation) to force another lookup under the lock.

Comment by Lai Siyao [ 07/Mar/23 ]

But directory time is not accurate enough to know whether parent was changed, because create only modifies parent directory time to seconds.

Comment by Andreas Dilger [ 07/Mar/23 ]

I wasn't thinking of using ctime (unless it changes to use nanoseconds?), but rather the directory version (either i_version or another in-memory version counter).

Comment by Alex Zhuravlev [ 10/Mar/23 ]

iirc, htree locks specific block to add a new direntry? I'd try to reproduce this with a change to ldiskfs_find_dest_de() searching through a whole block for the given name.

Comment by Gerrit Updater [ 03/Apr/23 ]

"Alex Zhuravlev <bzzz@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50505
Subject: LU-16405 tests: a reproducer
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: db199a9398cb6b773ccd58505eb7ca7d43053f8b

Comment by Gerrit Updater [ 04/Apr/23 ]

"Alex Zhuravlev <bzzz@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50521
Subject: LU-16405 osd: lookup cache
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: a3d94d8abfb8c0d9320687be578073d3e3387fa6

Comment by Gerrit Updater [ 27/Jul/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50521/
Subject: LU-16405 osd: lookup cache
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 29f8eb2a67ba2806d91d93de1e82e05a63f76382

Comment by Peter Jones [ 27/Jul/23 ]

Fix landed for 2.16

Comment by Alex Zhuravlev [ 31/Jul/23 ]

the patch(#50521) uses function obj_name2lu_name which was added in patch(https://review.whamcloud.com/#/c/fs/lustre-release/+/43390/)

yes, that's to support filesystem encryption. if we don't need to support encryption in this version, then we don't need to call obd_name2lu_name() and can use the passed key directly. feel free to ask if you have more questions.

Comment by Alex Zhuravlev [ 31/Jul/23 ]

can you please port the patch to b_es5_2 please?

sure, will do

Generated at Sat Feb 10 03:26:44 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.