[LU-8907] LBUG: (namei.c:816:ll_create_node()) ASSERTION( list_empty(&inode->i_dentry) ) failed Created: 05/Dec/16  Updated: 07/Jun/17  Resolved: 07/Jun/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.3
Fix Version/s: Lustre 2.10.0

Type: Bug Priority: Minor
Reporter: Sebastien Piechurski Assignee: Bruno Faccini (Inactive)
Resolution: Fixed Votes: 0
Labels: p4b
Environment:

RHEL 6 Bull kernel 2.6.32-642.4.2.el6.Bull.100.x86_64
Lustre build based on 2.5.3.90


Issue Links:
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

One Lustre client crashed right after a recovery with the following messages:

2016-10-29 16:00:01 Lustre: DEBUG MARKER: Sat Oct 29 16:00:01 2016
 2016-10-29 16:00:01
 2016-10-29 16:01:07 Lustre: 17766:0:(client.c:1942:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1477749640/real 1477749640] req@ffff8811e1b9a400 x1547883906151520/t0(0) 
 o101->scratch3-MDT0000-mdc-ffff88205ec91400@JO.BOO.AL.IL@o2ib2:12/10 lens 632/1136 e 0 to 1 dl 1477749667 ref 2 fl Rpc:XP/0/ffffffff rc 0/-1
 2016-10-29 16:01:07 Lustre: 17766:0:(client.c:1942:ptlrpc_expire_one_request()) Skipped 12 previous similar messages
 2016-10-29 16:01:07 Lustre: scratch3-MDT0000-mdc-ffff88205ec91400: Connection to scratch3-MDT0000 (at JO.BOO.AL.IL@o2ib2) was lost; in progress operations using this service will wait for recovery to complet
 e
 2016-10-29 16:01:07 Lustre: Skipped 12 previous similar messages
 2016-10-29 16:01:07 Lustre: scratch3-MDT0000-mdc-ffff88205ec91400: Connection restored to scratch3-MDT0000 (at JO.BOO.AL.IL@o2ib2)
 2016-10-29 16:01:07 Lustre: Skipped 12 previous similar messages
 2016-10-29 16:01:07 LustreError: 17766:0:(namei.c:816:ll_create_node()) ASSERTION( list_empty(&inode->i_dentry) ) failed:
 2016-10-29 16:01:07 LustreError: 17766:0:(namei.c:816:ll_create_node()) LBUG

The stack is as follows:

PID: 17766 TASK: ffff8819ab5b0ab0 CPU: 6 COMMAND: "rsync"
 #0 [ffff88189c697b20] machine_kexec at ffffffff8103ff4b
 #1 [ffff88189c697b80] crash_kexec at ffffffff810cfce2
 #2 [ffff88189c697c50] panic at ffffffff81546ce9
 #3 [ffff88189c697cd0] lbug_with_loc at ffffffffa067beeb [libcfs]
 #4 [ffff88189c697cf0] ll_create_nd at ffffffffa0dbf854 [lustre]
 #5 [ffff88189c697d70] vfs_create at ffffffff811a7946
 #6 [ffff88189c697db0] do_filp_open at ffffffff811ab75e
 #7 [ffff88189c697f20] do_sys_open at ffffffff81194e87
 #8 [ffff88189c697f70] sys_open at ffffffff81194f90
 #9 [ffff88189c697f80] system_call_fastpath at ffffffff8100b0d2
 RIP: 0000003c88adb480 RSP: 00007fffd85b6bc8 RFLAGS: 00010246
 RAX: 0000000000000002 RBX: ffffffff8100b0d2 RCX: 0000000000000001
 RDX: 0000000000000180 RSI: 00000000000000c2 RDI: 00007fffd85b7d70
 RBP: 00007fffd85b7d9b R8: 7a672e7761722e30 R9: 00000000ffffffff
 R10: 0000000000000001 R11: 0000000000000246 R12: ffffffff81194f90
 R13: ffff88189c697f78 R14: 00007fffd85b7d9c R15: 0000000000000000
 ORIG_RAX: 0000000000000002 CS: 0033 SS: 002b

 

Even though the intent is to create the inode with a dentry in the filesystem namespace (/somepath/.tmpfile), the inode structure field i_dentry is populated with one dentry pointing at /.lustre/fid/[0x298cd542a:0x3b3d:0x0].

Strangely enough, even though the crash occured in ll_create_nd() -> ll_create_it() -> ll_create_node, before the call to d_instantiate() in ll_create_nd() -> ll_create_it(), the fid actually really points back to the requested location:

# lfs fid2path /mountpoint [0x298cd542a:0x3b3d:0x0]
/somepath/.tmpfile

I can't upload the crashdump due to site restriction, but will be happy to give more informations as you request them.



 Comments   
Comment by Bruno Faccini (Inactive) [ 05/Dec/16 ]

Hello Sebastien,
Could it be possible for me to work on crash-dump with you on-site or when you access it from your office?

Comment by Sebastien Piechurski [ 06/Dec/16 ]

Hi Bruno,

 

For sure, I'd be happy to work with you on this !

Comment by Bruno Faccini (Inactive) [ 06/Dec/16 ]

Ok, then I will try to contact you offline to see how/when/where we can schedule a RDV.

Comment by Sebastien Piechurski [ 08/Dec/16 ]

I forgot to mention that the intent dentry which should not had time to be instantiated (since the assert crashed before the call to d_instantiate()) is really /somepath/.tmpfile (the file actually exists on the fs currently).

 

Comment by Sebastien Piechurski [ 03/Jan/17 ]

Hi Bruno, and Happy New year to you !

Were you able to discuss the data we extracted from the dump with your colleagues on this issue ?

Comment by Bruno Faccini (Inactive) [ 09/Jan/17 ]

Thank you Seb and I wish the same for you and your family!
About the problem, it looks definitely like an unknown single-shoot for us.
But I am still working on both Lustre and Kernel code in order to find a hole which can lead to it.
BTW, can you tell me how does your kernel config (i.e., inode.i_dentry/dentry.d_alias use h_list or double-linked list) cause ll_d_hlist_empty() macro to be converted ?

Comment by Bruno Faccini (Inactive) [ 09/Jan/17 ]

Oops, forget my previous question, the conversion can be found in the LBUG text ...

Comment by Bruno Faccini (Inactive) [ 09/Jan/17 ]

Hello Seb,
As we already have discussed off-line, can you also give us some idea about any additional patches that have been applied on top of Lustre 2.5.3.90 and Kernel 2.6.32-642.4.2 ??
Thanks in advance.

Comment by Bruno Faccini (Inactive) [ 11/Jan/17 ]

Hello Seb, did you see my previous update/query?
On my side, I am working on a possible reproducer, based on some other known occurrences of this same LBUG. Will let you know asap on how it goes.

Comment by Bruno Faccini (Inactive) [ 12/Jan/17 ]

Hmm first attempt to reproduce looks interesting (promising?).
Running a modified version (to enable it on a single-node config) of replay-single/test_70b in a loop, I just got the following LBUG at the 25th pass :

Lustre: DEBUG MARKER: -----============= acceptance-small: replay-single ============----- Wed Jan 11 16:40:48 UTC 2017
Lustre: DEBUG MARKER: only running test 70b
Lustre: DEBUG MARKER: excepting tests: 61d 33a 33b 89 62
Lustre: DEBUG MARKER: skipping tests SLOW=no: 1 2 3 4 6 12 16 44a 44b 65 66 67 68
Lustre: DEBUG MARKER: Using TIMEOUT=20
LustreError: 27268:0:(llite_lib.c:1225:ll_clear_inode()) ASSERTION( lli->u.d.d_sai == ((void *)0) ) failed: 
LustreError: 27268:0:(llite_lib.c:1225:ll_clear_inode()) LBUG
Pid: 27268, comm: rm

Call Trace:
 [<ffffffffa03fe895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
 [<ffffffffa03fee97>] lbug_with_loc+0x47/0xb0 [libcfs]
 [<ffffffffa0f64e72>] ll_clear_inode+0x8e2/0x920 [lustre]
 [<ffffffffa0f63300>] ? ll_delete_inode+0x0/0x1e0 [lustre]
 [<ffffffff811a69fc>] clear_inode+0xac/0x140
 [<ffffffffa0f6338d>] ll_delete_inode+0x8d/0x1e0 [lustre]
 [<ffffffffa0f34000>] ? return_if_equal+0x0/0x30 [lustre]
 [<ffffffff811a70fe>] generic_delete_inode+0xde/0x1d0
 [<ffffffff811a7255>] generic_drop_inode+0x65/0x80
 [<ffffffff811a60a2>] iput+0x62/0x70
 [<ffffffffa0f3590a>] ll_d_iput+0xfa/0x820 [lustre]
 [<ffffffff811a2c7c>] dentry_iput+0x7c/0x100
 [<ffffffff811a2df1>] d_kill+0x31/0x60
 [<ffffffff811a49cc>] dput+0x7c/0x150
 [<ffffffff8119b1d9>] do_rmdir+0x199/0x1f0
 [<ffffffff81196df6>] ? final_putname+0x26/0x50
 [<ffffffff810e2257>] ? audit_syscall_entry+0x1d7/0x200
 [<ffffffff8119b25d>] sys_unlinkat+0x2d/0x40
 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b

which is only reported in LU-2048 (quite old !!!) and without any analyze/explanation !!….

But unfortunately, kdump over the LAN/nfs has failed :

……………..
mapping eth0 to eth0
ADDRCONF(NETDEV_UP): eth0: link is not ready
udhcpc (v1.15.1) started
Sending discover...
e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
Sending discover...
Sending discover...
Unable to get a DHCP address retsd 5:0:0:0: [sdf] Synchronizing SCSI cache
ry...
No lease, failing
eth0 failed to come up
Saving to remote location eagle-2.eagle.hpdd.intel.com:/export/scratch
mount: RPC: Remote system error - Network is unreachable
sd 4:0:0:0: [sde] Synchronizing SCSI cache
sd 3:0:0:0: [sdd] Synchronizing SCSI cache
sd 2:0:0:0: [sdc] Synchronizing SCSI cache
sd 1:0:0:0: [sdb] Synchronizing SCSI cache
sd 0:0:0:0: [sda] Synchronizing SCSI cache
e1000e 0000:02:00.0: PCI INT A disabled
e1000e 0000:01:00.0: PCI INT A disabled
Restarting system.
machine restart
Comment by Sebastien Piechurski [ 12/Jan/17 ]

Hi Bruno, 

 

Here is the list of patches we apply to our servers kernel:

0003-Applied-Bull_lmigr-x86_64.el6.0.7.patch-for-RHEL-2.6.patch
Patch which delays migration of pages between numa nodes

0006-Applied-Bull_311740-GENERIC_ALLOCATOR-4.patch
Include generic memory allocator for x86 architecture.

0008-Applied-Bull_Dynamic_Power_Limitation_2.6.32-573.patch
CState manipulation interface through /sys

0011-Applied-Bull_mce75xx_2.6.32-504.diff.patch
Intel Xeon 7500 series corrected memory error driver

0012-Applied-Bull_improve-x86_64.csum.patch.patch
Performance improvement on TCP/IP checksumming

0017-Applied-raid5-mmp-unplug-dev-rhel6.Bull.3.patch
Forces sync reads on MD devices' disks for MMP compatibility

0018-lustre-blkdev_tunables-2.6-rhel6.patch.patch
Some block device parameters value modifications

0021-lustre-bh_lru_size_config.patch.patch
Buffer head per-cpu LRU size increased from 8 to 16

0022-quota-replace-dqptr-sem.patch
Replaces dqptr semaphore by spin locks (lustre server kernel patch)

0029-Applied-proc_pid_limits.patch.patch
Ease use of writes to /proc/<pid>/limits file

0030-Applied-patch-Bull_oomprotect.diff.patch
Patch to protect processes from root to be killed by oom

0055-quota-avoid-dqget-calls.patch
(lustre server kernel patch)

0058-fix-CONFIG_XEN.patch
Some modifications to have kvm work when XEN is not enabled

0059-ocfs2-getname.patch
fix getname/putname changes in ocfs2

0090-kernel_H-2.6.32-573.1.1-config-x86_64.patch
0091-kernel_H-2.6.32-573.1.1-config-x86_64.debug.patch
Some modifications to the default RHEL kernel configuration.

 

For the lustre code, this is based on 2.5.3.90 (commit 35bb8577c83f787451cd9842a07ac670cb6f4d3f), with some patches that I will send you by mail.

Comment by Bruno Faccini (Inactive) [ 17/Jan/17 ]

Thanks Seb, I don't see any Kernel/Lustre patch that could be related to the problem.

Concerning the other LBUG, related to LU-2048, that I have triggered during my first attempts to reproduce, it is a pure problem related to statahead code where some fixes appear to be missing in 2.5.3.90.

On the other hand, I have been able to reproduce your LBUG by introducing the following tempo just before the LBUG :

[root@eagle-28 lustre-release]# git status
# On branch b2_5
# Changed but not updated:
#   (use "git add <file>..." to update what will be committed)
#   (use "git checkout -- <file>..." to discard changes in working directory)
#
#       modified:   lustre/include/obd_support.h
#       modified:   lustre/llite/namei.c
#
# Untracked files:
#   (use "git add <file>..." to include in what will be committed)
#
#       lustre/include/obd_support.h.bfi
#       lustre/llite/namei.c.bfi
#       lustre/scripts/make_META.pl
no changes added to commit (use "git add" and/or "git commit -a")
[root@eagle-28 lustre-release]# 
[root@eagle-28 lustre-release]# git diff
diff --git a/lustre/include/obd_support.h b/lustre/include/obd_support.h
index 37bee22..a21f8a0 100644
--- a/lustre/include/obd_support.h
+++ b/lustre/include/obd_support.h
@@ -430,6 +430,7 @@ int obd_alloc_fail(const void *ptr, const char *name, const char *type,
 #define OBD_FAIL_MDC_GETATTR_ENQUEUE     0x803
 #define OBD_FAIL_MDC_RPCS_SEM           0x804
 #define OBD_FAIL_MDC_LIGHTWEIGHT        0x805
+#define OBD_FAIL_MDC_CREATE_PAUSE       0x805
 
 #define OBD_FAIL_MGS                     0x900
 #define OBD_FAIL_MGS_ALL_REQUEST_NET     0x901
diff --git a/lustre/llite/namei.c b/lustre/llite/namei.c
index 39cd6d5..1cf3518 100644
--- a/lustre/llite/namei.c
+++ b/lustre/llite/namei.c
@@ -813,6 +813,8 @@ static struct inode *ll_create_node(struct inode *dir, const char *name,
         if (rc)
                 GOTO(out, inode = ERR_PTR(rc));
 
+       OBD_FAIL_TIMEOUT(OBD_FAIL_MDC_CREATE_PAUSE, 5);
+
        LASSERT(ll_d_hlist_empty(&inode->i_dentry));
 
         /* We asked for a lock on the directory, but were granted a
[root@eagle-28 lustre-release]# 

and by running the following snippet :

[root@eagle-28 lustre-release]# date; lctl set_param fail_loc=0x80000805 ; echo foofour > /mnt/lustre/foo8 & sleep 2 ; date ; fid=$(lfs changelog lustre-MDT0000 | grep ' foo8' | sed 's/.* t=//' | sed 's/ p=.*//') ; date ; cat /mnt/lustre/.lustre/fid/$fid ; date
Tue Jan 17 16:31:09 UTC 2017
fail_loc=0x80000805
[1] 27152
Tue Jan 17 16:31:11 UTC 2017
Tue Jan 17 16:31:11 UTC 2017
Tue Jan 17 16:31:11 UTC 2017
[root@eagle-28 lustre-release]# 
Message from syslogd@eagle-28 at Jan 17 16:31:14 ...
 kernel:LustreError: 27152:0:(namei.c:818:ll_create_node()) ASSERTION( list_empty(&inode->i_dentry) ) failed: 
                  
Message from syslogd@eagle-28 at Jan 17 16:31:14 ...
 kernel:LustreError: 27152:0:(namei.c:818:ll_create_node()) LBUG

So this seems to mean that, as I have suspected from the start of our joint analysis, there is a potential race during file creation (including directory/dentry populate) and concurrent access using .lustre/fid method that is allowed to a ChangeLog reader, some kind of a
"crime of insider trading". Or may be the LBUG is finally inaccurate and could be simply removed?

More to come from crash-dump/debug-log analysis, and will also check if problem is still alive (like the same LBUG!) in master.

Comment by Gerrit Updater [ 18/Jan/17 ]

Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: https://review.whamcloud.com/24976
Subject: LU-8907 llite: handle client racy case during create
Project: fs/lustre-release
Branch: b2_5
Current Patch Set: 1
Commit: 376e27c4ac97d31f9729f351317891bff3f54147

Comment by Bruno Faccini (Inactive) [ 18/Jan/17 ]

I have just pushed https://review.whamcloud.com/24976 to b2_5 branch. It introduces sanity/test_161d sub-test in order to check that the reproducer I have identified also fails in the auto-tests environment in a first time. And then, with next patch-set that removing LBUG could fix and not introduce any regression.

Next, will also check, using the same way, if problem also exists in master.

Comment by Sebastien Piechurski [ 02/Feb/17 ]

Hi Bruno, did you see the problem reproduced in the auto tests environment ?

Comment by Gerrit Updater [ 07/Feb/17 ]

Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: https://review.whamcloud.com/25296
Subject: LU-8907 llite: handle client racy case during create
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 3d35f58c3b6c736e5db3d22dea2451bf634cdb95

Comment by Bruno Faccini (Inactive) [ 08/Feb/17 ]

Yes, new sanity/test_161d sub-test is also a reproducer in auto-tests environment and it also demonstrate that problem is still alive in master!
This has been proofed by patch-set #2 testing of https://review.whamcloud.com/25296, and now patch-set #3 introduces a first attempt to fix simply removing the inaccurate

LASSERT(ll_d_hlist_empty(&inode->i_dentry));

in ll_create_node() and exposes it to full tests suites.

Comment by Sebastien Piechurski [ 02/Mar/17 ]

Hello Bruno,

 

What is the next step for this patch ?

Are we waiting for reviewers ?

Comment by Gerrit Updater [ 26/Mar/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/25296/
Subject: LU-8907 llite: handle client racy case during create
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: ee25babfe72378f9496a9732742984f26eb7d4a5

Comment by Bruno Faccini (Inactive) [ 27/Mar/17 ]

Hello Seb,
As you can see https://review.whamcloud.com/25296/ has just landed on master.
And as per its content (at least its pure fix part in core code, i.e. simply remove LBUG), it should be pretty easy for you to back-port it on top of your Lustre branch based on 2.5.3.90.

Comment by Sebastien Piechurski [ 07/Jun/17 ]

Thanks Bruno.

The patch was backported to our branch.

This ticket can be closed.

Comment by Peter Jones [ 07/Jun/17 ]

Thanks

Comment by Peter Jones [ 07/Jun/17 ]

Landed for 2.10

Generated at Sat Feb 10 02:21:34 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.