[LU-9868] dcache/namei fixes for lustre Created: 11/Aug/17  Updated: 24/Nov/22

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Upstream, Lustre 2.11.0, Lustre 2.12.0, Lustre 2.10.5
Fix Version/s: Upstream

Type: Bug Priority: Major
Reporter: James A Simmons Assignee: James A Simmons
Resolution: Unresolved Votes: 0
Labels: patch, upstream
Environment:

upstream and well as any lustre 2.11 client.


Issue Links:
Duplicate
is duplicated by LU-16179 race in ll_splice_alias() may lead to... Resolved
Related
is related to LU-11501 use the dcache properly with .lustre/fid In Progress
is related to LU-9735 Sles12Sp2 and 2.9 getcwd() sometimes ... Resolved
is related to LU-12997 getcwd() returns ENOENT on RHEL7 Resolved
is related to LU-13486 tests sanity/32* fail Resolved
is related to LU-12511 Prepare lustre for adoption into the ... Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

From Neil Brown that works at SuSE

I was drawn to look at this code due to the tests on
DCACHE_DISCONNECTED which are often wrong, and it turns out
they are used wrongly in lustre too. Fixing one led to some
clean-up. Fixing the other is straight forward.

A particular change here from the previous posting is
the first patch which tests for DCACHE_PAR_LOOKUP in ll_dcompare().
Without this patch, two threads can be looking up the same
name in a given directory in parallel. This parallelism lead
to my concerns about needing improved locking in ll_splice_alias().
Instead of improving the locking, I now avoid the need for it
by fixing ll_dcompare.

This code passes basic "smoke tests".

Note that the cast to "struct dentry *" in the first patch is because
we have a "const struct dentry *" but d_in_lookup() requires a
pointer to a non-const structure. I'll send a separate patch to
change d_in_lookup().



 Comments   
Comment by Gerrit Updater [ 11/Aug/17 ]

James Simmons (uja.ornl@yahoo.com) uploaded a new patch: https://review.whamcloud.com/28486
Subject: LU-9868 llite: dcache/namei fixes for lustre
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: ee439ac5304343708bc344c62297749c5909d59e

Comment by Marek Magrys [ 28/Sep/18 ]

I gave it a spin, but opening a file under vim (probably a mix of stat/open/write/chmod operations) hangs the process. Trace is :

Sep 28 17:50:46 p0851 kernel: INFO: rcu_sched self-detected stall on CPU { 7}  (t=60000 jiffies g=16256 c=16255 q=13733)
Sep 28 17:50:46 p0851 kernel: Task dump for CPU 7:
Sep 28 17:50:46 p0851 kernel: vim             R  running task        0 28301  28299 0x0000000b
Sep 28 17:50:46 p0851 kernel: Call Trace:
Sep 28 17:50:46 p0851 kernel: <IRQ>  [<ffffffffa6ed1268>] sched_show_task+0xa8/0x110
Sep 28 17:50:46 p0851 kernel: [<ffffffffa6ed4e39>] dump_cpu_task+0x39/0x70
Sep 28 17:50:46 p0851 kernel: [<ffffffffa6f4bb00>] rcu_dump_cpu_stacks+0x90/0xd0
Sep 28 17:50:46 p0851 kernel: [<ffffffffa6f4f1a2>] rcu_check_callbacks+0x442/0x730
Sep 28 17:50:46 p0851 kernel: [<ffffffffa6f046c0>] ? tick_sched_do_timer+0x50/0x50
Sep 28 17:50:46 p0851 kernel: [<ffffffffa6ea7a16>] update_process_times+0x46/0x80
Sep 28 17:50:46 p0851 kernel: [<ffffffffa6f044c0>] tick_sched_handle+0x30/0x70
Sep 28 17:50:46 p0851 kernel: [<ffffffffa6f046f9>] tick_sched_timer+0x39/0x80
Sep 28 17:50:46 p0851 kernel: [<ffffffffa6ec2163>] __hrtimer_run_queues+0xf3/0x270
Sep 28 17:50:46 p0851 kernel: [<ffffffffa6ec26ef>] hrtimer_interrupt+0xaf/0x1d0
Sep 28 17:50:46 p0851 kernel: [<ffffffffa6e596ab>] local_apic_timer_interrupt+0x3b/0x60
Sep 28 17:50:46 p0851 kernel: [<ffffffffa752a083>] smp_apic_timer_interrupt+0x43/0x60
Sep 28 17:50:46 p0851 kernel: [<ffffffffa75267b2>] apic_timer_interrupt+0x162/0x170
Sep 28 17:50:46 p0851 kernel: <EOI>  [<ffffffffa6f0bb43>] ? native_queued_spin_lock_slowpath+0x1d3/0x200
Sep 28 17:50:46 p0851 kernel: [<ffffffffa750dc5f>] queued_spin_lock_slowpath+0xb/0xf
Sep 28 17:50:46 p0851 kernel: [<ffffffffa751b750>] _raw_spin_lock+0x20/0x30
Sep 28 17:50:46 p0851 kernel: [<ffffffffa703bf3e>] igrab+0x1e/0x60
Sep 28 17:50:46 p0851 kernel: [<ffffffffc1026040>] ll_inode_from_resource_lock+0xa0/0xb0 [lustre]
Sep 28 17:50:46 p0851 kernel: [<ffffffffc103ed92>] ll_md_blocking_ast+0x52/0x730 [lustre]
Sep 28 17:50:46 p0851 kernel: [<ffffffffa6f9f8e4>] ? free_one_page+0x2e4/0x310
Sep 28 17:50:46 p0851 kernel: [<ffffffffc0e1321a>] ldlm_cancel_callback+0x8a/0x330 [ptlrpc]
Sep 28 17:50:46 p0851 kernel: [<ffffffffc0e81f22>] ? null_free_repbuf+0xd2/0x200 [ptlrpc]
Sep 28 17:50:46 p0851 kernel: [<ffffffffa6fa01fd>] ? __free_pages+0x1d/0x30
Sep 28 17:50:46 p0851 kernel: [<ffffffffa6fa0412>] ? __free_memcg_kmem_pages+0x22/0x50
Sep 28 17:50:46 p0851 kernel: [<ffffffffc0e1e9b0>] ldlm_cli_cancel_local+0xa0/0x420 [ptlrpc]
Sep 28 17:50:46 p0851 kernel: [<ffffffffc0e2303a>] ldlm_cli_cancel_list_local+0xea/0x280 [ptlrpc]
Sep 28 17:50:46 p0851 kernel: [<ffffffffc0e2335b>] ldlm_cancel_resource_local+0x18b/0x280 [ptlrpc]
Sep 28 17:50:46 p0851 kernel: [<ffffffffc0a93e92>] mdc_resource_get_unused+0x142/0x2a0 [mdc]
Sep 28 17:50:46 p0851 kernel: [<ffffffffc0a94217>] mdc_setattr+0x227/0x4a0 [mdc]
Sep 28 17:50:46 p0851 kernel: [<ffffffffa7037779>] ? dput+0x29/0x160
Sep 28 17:50:46 p0851 kernel: [<ffffffffc0f5ff54>] lmv_setattr+0x1c4/0x570 [lmv]
Sep 28 17:50:46 p0851 kernel: [<ffffffffc103115b>] ll_setattr_raw+0x28b/0x1350 [lustre]
Sep 28 17:50:46 p0851 kernel: [<ffffffffc103228c>] ll_setattr+0x6c/0xd0 [lustre]
Sep 28 17:50:46 p0851 kernel: [<ffffffffa703dc54>] notify_change+0x2c4/0x420
Sep 28 17:50:46 p0851 kernel: [<ffffffffa701c837>] chmod_common+0x137/0x160
Sep 28 17:50:46 p0851 kernel: [<ffffffffa701dac7>] SyS_fchmodat+0x57/0xc0
Sep 28 17:50:46 p0851 kernel: [<ffffffffa701db49>] SyS_chmod+0x19/0x20
Sep 28 17:50:46 p0851 kernel: [<ffffffffa7525a1b>] tracesys+0xa3/0xc9

Tested on Centos 7.5 3.10.0-862.14.4.el7.x86_64 kernel and Lustre 2.10.5.

Comment by James A Simmons [ 28/Sep/18 ]

Yeah the original patch was based on a earlier version posted for upstream. A later patch set was done and landed. I just did a push of the new patch. My testing shows its in much better shape.

Sorry about the breakage.

Comment by Marek Magrys [ 01/Oct/18 ]

We moved patchset 17 into production after some initial testing and as for now it has been running for over 3000 node days without any problems. Is there any plan to have the patchset included in b2_10 or master?

Comment by James A Simmons [ 01/Oct/18 ]

Wow, that is awesome news. Before your report of fixing an actually production system their wasn't really a push to merge this for 2.12. Now it looks like it's an important fix for production systems.

The way it works is that we have to port patches exactly from upstream to opensfs so we are looking at 5 patches instead of the one test patch I have currently. Since this fixes real problems I will push to 2.10.X once it lands to latest lustre.

Comment by Gerrit Updater [ 01/Oct/18 ]

James Simmons (uja.ornl@yahoo.com) uploaded a new patch: https://review.whamcloud.com/33258
Subject: LU-9868 llite: use d_splice_alias for directories.
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 025cf259d4a66b2a5de57a66891cb88f0e860e14

Comment by Gerrit Updater [ 01/Oct/18 ]

James Simmons (uja.ornl@yahoo.com) uploaded a new patch: https://review.whamcloud.com/33259
Subject: LU-9868 llite: remove directory-specific code from ll_find_alias()
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 58df6511f22d17a7cbe6d8d0f13d3c6da198c060

Comment by Gerrit Updater [ 01/Oct/18 ]

James Simmons (uja.ornl@yahoo.com) uploaded a new patch: https://review.whamcloud.com/33260
Subject: LU-9868 llite: simplify ll_find_alias()
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 808f713c96423555d57f11dd69d0a372cc30c769

Comment by Gerrit Updater [ 01/Oct/18 ]

James Simmons (uja.ornl@yahoo.com) uploaded a new patch: https://review.whamcloud.com/33261
Subject: LU-9868 llite: refine ll_find_alias based on d_exact_alias
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 160ed51460d9ab7be92a7abf4581560f9d97093a

Comment by James A Simmons [ 02/Oct/18 ]

I see while this bug fixes things in newer kernel it breaks ancients kernel like RHEL.

Comment by James A Simmons [ 09/Oct/18 ]

I found a work around to the bug the breaks older kernels. I'm testing it now.

Comment by Peter Jones [ 11/Oct/18 ]

We think that any changes in this area would need to soak a while and get tested broadly before including into a release

Comment by James A Simmons [ 15/Oct/18 ]

I agree. This is a complex problem so caution is required here.

Comment by Lukasz Flis [ 17/Oct/18 ]

We are running https://review.whamcloud.com/#/c/28486 - patch set 17 with b2_10 and as Marek said there were no major issues except few nodes with processes blocked in rename operation invoked by software called OOMMF. We are not sure it's related but i thought it's worth reporting

Stacktrace:

 

[455219.524004] oxs             D ffff984d68ba1fa0     0  3818   3806 0x00000000
[455219.558389] Call Trace:
[455219.571231]  [<ffffffffa3f19e59>] schedule_preempt_disabled+0x29/0x70
[455219.602131]  [<ffffffffa3f17c17>] __mutex_lock_slowpath+0xc7/0x1d0
[455219.631887]  [<ffffffffa3f16fff>] mutex_lock+0x1f/0x2f
[455219.656923]  [<ffffffffa3a2b50a>] lock_rename+0xda/0xe0
[455219.681973]  [<ffffffffa3a315bf>] SYSC_renameat2+0x22f/0x5a0
[455219.709742]  [<ffffffffa39caefd>] ? handle_mm_fault+0x39d/0x9b0
[455219.739183]  [<ffffffffa3f2056c>] ? __do_page_fault+0x1bc/0x4f0
[455219.767746]  [<ffffffffa3a327ae>] SyS_renameat2+0xe/0x10
[455219.793358]  [<ffffffffa3a327ee>] SyS_rename+0x1e/0x20
[455219.818174]  [<ffffffffa3f2579b>] system_call_fastpath+0x22/0x27

According to our logs issue has started to appear few days after deployment of client patched with https://review.whamcloud.com/#/c/28486 - patch set 17

I have noticed that current patch set is 22 - do you recommend to give upgrade a try ?

 

UPDATE#1: or is it better way to patch rhel kernel like  SuSe did to address the issues now?

 

Comment by James A Simmons [ 17/Oct/18 ]

I broke the patch up so its more than one now. Also a bug in lustre was exposed while testing this fix. On newer kernels it reports an error but older ones it will crash the node. Once I figure everything out I will port a 2.10 backported patch. The rename issue is new.

Comment by Lukasz Flis [ 17/Oct/18 ]

OK. If so we will revert patch 17 from https://review.whamcloud.com/#/c/28486 and see if the rename issue is gone as OOMMF crash is easily reproducible (however we don't have a separate reproducer yet).
We'll update the ticket to make clear whether it's related

Comment by Lukasz Flis [ 19/Oct/18 ]

Reverting patch 17 from https://review.whamcloud.com/#/c/28486  fixed the problem with clients hanging on rename in oommf

We have four processes interacting, probably one is handling file writes, then signals another one which does rename opearations which lead to client lock-up after several dozens of such operations

[pid  5643] futex(0x2b774caf1d20, FUTEX_WAKE_PRIVATE, 1) = 0
[pid  5643] sendto(17, "reply 1762 0 0\r\n", 16, 0, NULL, 0) = 16
[pid  5643] recvfrom(18, "query 13 datafile {/net/scratch/"..., 4096, 0, NULL, NULL) = 305
[pid  5643] stat("/net/scratch/people/plgczuchr/OOMMF/dwieBariery/./dwieBariery/freeAniz_20000_Voltage_0.1500_mtop_-1_mfree_1_mbottom_1_VProfileType_0_/dwieBariery-Oxs_TimeDriver-Magnetization-00-0044000.omf", {st_mode=S_IFREG|0644, st_size=388216, ...}) = 0
[pid  5643] lstat("/net/scratch/people/plgczuchr/slurm_jobdir/13738101/tmp/dwieBariery-p1311.prometheus-5582.omf", {st_mode=S_IFREG|0644, st_size=388216, ...}) = 0
[pid  5643] lstat("/net/scratch/people/plgczuchr/OOMMF/dwieBariery/./dwieBariery/freeAniz_20000_Voltage_0.1500_mtop_-1_mfree_1_mbottom_1_VProfileType_0_/dwieBariery-Oxs_TimeDriver-Magnetization-00-0044000.omf", {st_mode=S_IFREG|0644, st_size=388216, ...}) = 0
[pid  5643] stat("/net/scratch/people/plgczuchr/OOMMF/dwieBariery/./dwieBariery/freeAniz_20000_Voltage_0.1500_mtop_-1_mfree_1_mbottom_1_VProfileType_0_/dwieBariery-Oxs_TimeDriver-Magnetization-00-0044000.omf", {st_mode=S_IFREG|0644, st_size=388216, ...}) = 0
[pid  5643] chmod("/net/scratch/people/plgczuchr/OOMMF/dwieBariery/./dwieBariery/freeAniz_20000_Voltage_0.1500_mtop_-1_mfree_1_mbottom_1_VProfileType_0_/dwieBariery-Oxs_TimeDriver-Magnetization-00-0044000.omf", 0644) = 0
[pid  5643] rename("/net/scratch/people/plgczuchr/slurm_jobdir/13738101/tmp/dwieBariery-p1311.prometheus-5582.omf", "/net/scratch/people/plgczuchr/OOMMF/dwieBariery/./dwieBariery/freeAniz_20000_Voltage_0.1500_mtop_-1_mfree_1_mbottom_1_VProfileType_0_/dwieBariery-Oxs_TimeDriver-Magnetization-00-0044000.omf") = 0
[pid  5643] write(9, "\0", 1 <unfinished ...>
[pid  5652] <... select resumed> )      = 1 (in [8])
[pid  5643] <... write resumed> )       = 1
Comment by James A Simmons [ 23/Oct/18 ]

Thank you for this info and also for testing out this work. I will talk with Neil about this problem to see if we can resolve it.

Comment by James A Simmons [ 24/Oct/18 ]

Which version of lustre are you running and is it only patched with https://review.whamcloud.com/#/c/28486/17 ? Better yet can you

share your rpms with us.

Comment by Gerrit Updater [ 11/Feb/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/28486/
Subject: LU-9868 llite: handle DCACHE_PAR_LOOKUP in ll_dcompare
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 32e547aa9cb913f5736ee3d58cb79f4e63ce2c0b

Comment by Gerrit Updater [ 03/Apr/19 ]

Sebastien Piechurski (sebastien.piechurski@atos.net) uploaded a new patch: https://review.whamcloud.com/34582
Subject: LU-9868 llite: handle DCACHE_PAR_LOOKUP in ll_dcompare
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: fb6aa13622d441b3495287cbc2630cc47da7c4ae

Comment by Sebastien Piechurski [ 03/Apr/19 ]

Any chances to have the patch included in 2.12.1 ?

Comment by Peter Jones [ 04/Apr/19 ]

spiechurski it would be possible but what is the driver to do so?

Comment by James A Simmons [ 04/Apr/19 ]

Note the work is not complete. I need to solve the dcache hard link breakage issue. I have an idea on how to fix it but it will take some time to resolve it.

Comment by Sebastien Piechurski [ 02/Oct/19 ]

pjones : one of our customer (Cardiff University) is hitting this issue from time to time.

simmonsja : Did you progress on this topic since April ? Maybe in another ticket ?

Comment by James A Simmons [ 02/Oct/19 ]

Oh I thought with LU-9735 being closed this was not a pressing issue. I can look at it. Just need to find a few cycles to work on this.

Comment by Sebastien Piechurski [ 02/Oct/19 ]

Indeed with LU-9735, the occurences have dramatically decreased, but there are still a few left.

No pressure though, I was just looking for information on progress.

Thanks,

Seb.

Comment by James A Simmons [ 30/Oct/19 ]

I have been looking at how to resolve these issues and thinking of some ideas. So the breakage you see is due to the way Lustre manages its dcache so the special .lustre/fid directory works. This special handling is what is causing these problems. The patches from upstream resolve the issues you see but end up making the .lustre/fid handling not work. I cover why under ticket LU-11501 what that breakage is. Additional the special .lustre/fid directory doesn't work with file sets so I really need to think of a way to handle all these conditions.

Before we can land the patches under these tickets .lustre/fid needs to be resolved first. What I have been thinking of is using a overlayfs type approach which used d_real() to map a dentry to a real dentry under the hood. This way a dentry for .lustre/fid/$FID can then be mapped to the real dentry. This means moving the .lustre/fid dentry out of the dcache into its own private cache for lookup. This is needed anyways since you don't want a mounted file set being able to FID outside of its tree. Also at the same time work is being done in the tools to move away from using .lustre/fid to using fhandles. This might help to resolve some of the issues as well.

Comment by Gerrit Updater [ 13/Dec/19 ]

James Simmons (jsimmons@infradead.org) uploaded a new patch: https://review.whamcloud.com/37013
Subject: LU-9868 llite: dcache/namei fixes for lustre
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: b148c9e3f634ecff0b766d3ae89f27440690c001

Comment by Gerrit Updater [ 27/Feb/20 ]

Neil Brown (neilb@suse.de) uploaded a new patch: https://review.whamcloud.com/37741
Subject: LU-9868 llite: remove lld_it field of ll_dentry_data
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: c21a2a28bd71d9fe4468c211cb64dc8ca27174e1

Comment by Gerrit Updater [ 05/Mar/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37741/
Subject: LU-9868 llite: remove lld_it field of ll_dentry_data
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: f172a59c99f7c6a3398b5f590f4cf1cecabdaee4

Comment by Gerrit Updater [ 14/Apr/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/24175/
Subject: LU-9868 llite: Get rid of ll_dcompare
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 787231f53ab63c72634250f8fe9d27bc66cc4e46

Comment by Gerrit Updater [ 03/Jul/21 ]

James Simmons (jsimmons@infradead.org) uploaded a new patch: https://review.whamcloud.com/44135
Subject: LU-9868 lustre: switch to use of ->d_init()
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: c704d8547f63613b34c96151f9451f407f0ef664

Comment by Gerrit Updater [ 11/Sep/21 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/44135/
Subject: LU-9868 lustre: switch to use of ->d_init()
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 57569871375dd857a89e5e03524d0eec093fef5d

Comment by Gerrit Updater [ 24/Nov/22 ]

"James Simmons <jsimmons@infradead.org>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49237
Subject: LU-9868 llite: remove lld_nfs_dentry flag
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 66e550a888a89e333033b8d19e14f4e1b41ca1c9

Generated at Sat Feb 10 02:30:02 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.