[LU-5883] DNE II testing: LustreError: 6618:0:(statahead.c:262:sa_kill()) ASSERTION( !list_empty(&entry->se_list) ) failed Created: 07/Nov/14  Updated: 09/Oct/21  Resolved: 09/Oct/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.7.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Patrick Farrell (Inactive) Assignee: Lai Siyao
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

Versions of 2.6.54 on clients & servers.
Cray SLES11SP3 clients, CentOS servers (2.6.32-431.5.1.el6.x86_64).

Most recent commit on clients:
Ie7a2a98be8cc97db9af7a64476c06fc7321544eb
http://review.whamcloud.com/12142

Most recent commit on servers:
If24443955290b091fd22905dfb74b0d6a6d1b4e8
http://review.whamcloud.com/12490


Severity: 3
Rank (Obsolete): 16451

 Description   

While doing a general purpose test of master with DNE II (2 MDSes with 3 MDTs each, 6 total MDTs), we did an ls to check the status of something, and our client LBUGged. We turned up debug on a different client, did the ls again, and it crashed as well:

2014-11-06T20:08:43.974554-06:00 c1-0c0s0n3 LustreError: 3797:0:(statahead.c:262:sa_kill()) ASSERTION( !list_empty(&entry->se_list) ) failed:
2014-11-06T20:08:43.974591-06:00 c1-0c0s0n3 LustreError: 3797:0:(statahead.c:262:sa_kill()) LBUG
2014-11-06T20:08:43.974597-06:00 c1-0c0s0n3 Pid: 3797, comm: ls
2014-11-06T20:08:43.974604-06:00 c1-0c0s0n3 Call Trace:
2014-11-06T20:08:43.974617-06:00 c1-0c0s0n3 [<ffffffff81006591>] try_stack_unwind+0x161/0x1a0
2014-11-06T20:08:43.974622-06:00 c1-0c0s0n3 [<ffffffff81004de9>] dump_trace+0x89/0x440
2014-11-06T20:08:43.974632-06:00 c1-0c0s0n3 [<ffffffffa0176897>] libcfs_debug_dumpstack+0x57/0x80 [libcfs]
2014-11-06T20:08:43.974639-06:00 c1-0c0s0n3 [<ffffffffa0176de7>] lbug_with_loc+0x47/0xc0 [libcfs]
2014-11-06T20:08:44.015207-06:00 c1-0c0s0n3 [<ffffffffa0750512>] sa_put+0x332/0x370 [lustre]
2014-11-06T20:08:44.015232-06:00 c1-0c0s0n3 [<ffffffffa0752b6c>] do_statahead_enter+0xfdc/0x1cb0 [lustre]
2014-11-06T20:08:44.015238-06:00 c1-0c0s0n3 [<ffffffffa073a695>] ll_lookup_it+0x7b5/0x1b70 [lustre]
2014-11-06T20:08:44.015245-06:00 c1-0c0s0n3 [<ffffffffa073bad5>] ll_lookup_nd+0x85/0x570 [lustre]
2014-11-06T20:08:44.015250-06:00 c1-0c0s0n3 [<ffffffff81148edc>] d_alloc_and_lookup+0x4c/0x80
2014-11-06T20:08:44.015261-06:00 c1-0c0s0n3 [<ffffffff8114a57e>] do_lookup+0x29e/0x3a0
2014-11-06T20:08:44.015273-06:00 c1-0c0s0n3 [<ffffffff8114cdb3>] path_lookupat+0xc3/0x5e0
2014-11-06T20:08:44.015281-06:00 c1-0c0s0n3 [<ffffffff8114d305>] do_path_lookup+0x35/0xd0
2014-11-06T20:08:44.015287-06:00 c1-0c0s0n3 [<ffffffff8114e043>] user_path_at_empty+0x83/0xb0
2014-11-06T20:08:44.015303-06:00 c1-0c0s0n3 [<ffffffff8114e081>] user_path_at+0x11/0x20
2014-11-06T20:08:44.015311-06:00 c1-0c0s0n3 [<ffffffff81142d85>] vfs_fstatat+0x55/0x90
2014-11-06T20:08:44.054257-06:00 c1-0c0s0n3 [<ffffffff81142e2e>] vfs_lstat+0x1e/0x20
2014-11-06T20:08:44.054275-06:00 c1-0c0s0n3 [<ffffffff81142e54>] sys_newlstat+0x24/0x50
2014-11-06T20:08:44.054281-06:00 c1-0c0s0n3 [<ffffffff8133776b>] system_call_fastpath+0x16/0x1b
2014-11-06T20:08:44.054293-06:00 c1-0c0s0n3 [<00007ffff78b9515>] 0x7ffff78b9515

I'll make the client dump available in a few minutes.



 Comments   
Comment by Patrick Farrell (Inactive) [ 07/Nov/14 ]

Dump and associated kos (and system logs for that node) going up here:
ftp.whamcloud.com/uploads/LU-5883/c1-0c0s1n2-20141107.tar.gz

The node had full debugging enabled when it was crashed.

Confirmed from the engineer testing it - The ls was of the root of the file system. There should not have been any remote directories in the root of the file system, so this may have nothing to do with DNE II.

Comment by Patrick Farrell (Inactive) [ 07/Nov/14 ]

Full Lustre log extracted from the client is here:
ftp.whamcloud.com/uploads/LU-5883/c1-0c0s1n2.lustre.log

Comment by Patrick Farrell (Inactive) [ 07/Nov/14 ]

Note that, weirdly, dirdata was off on MDT0. This system was used to troubleshoot LU-5626 and we forgot to reenable dirdata. We'll reenable it before the next run.

Comment by Di Wang [ 08/Nov/14 ]

From the debug log it seems not related with DNE, hmm, I saw

    LU-3270 statahead: statahead thread wait for RPCs to finish
    
    Statahead thread should wait for inflight stat RPCs to finish in
    case statahead RPC callback may access data allocated in statahead
    thread context.
    
    ll_sa_entry_fini() should keep old entry if stat RPC is not
    finished yet.
    
    Simplify sai refcounting:
    * newly allocated sai will hold one refcount, and it will put it
      after starting statahead thread.
    * statahead thread holds one refcount.
    * agl thread holds one refcount.
    * stat process calls do_statahead_enter() which will try to get
      sai, and if it's valid, it will revalidate from statahead cache,
      and put refcount after use.
    
    Signed-off-by: Lai Siyao <lai.siyao@intel.com>
    Change-Id: I55a4fe66a5f6c04595d3bc84f0cd3750f20e0ee4
    Reviewed-on: http://review.whamcloud.com/9663
    Tested-by: Jenkins
    Tested-by: Maloo <hpdd-maloo@intel.com>
    Reviewed-by: Fan Yong <fan.yong@intel.com>
    Reviewed-by: James Simmons <uja.ornl@gmail.com>
    Reviewed-by: Oleg Drokin <oleg.drokin@intel.com>

Just changed these area in 2.6. Lai, could you please have a look. Thanks.

Comment by Lai Siyao [ 10/Nov/14 ]

this is a known issue: revalidate_statahead_dentry() may fail to wait for a statahead entry to become ready, in this case it should not release this entry. originally this was found in 2.4 statahead patches, and I'm waiting for their response. I'll update the last patch of LU-3270: http://review.whamcloud.com/#/c/9667/, so if you want to test statahead, it's better to apply this one and test.

Comment by Lai Siyao [ 10/Nov/14 ]

Patch updated, if test is okay, I'll mark this duplicate.

Comment by Patrick Farrell (Inactive) [ 10/Nov/14 ]

Thanks, Lai - We'll use that patch when we next update the system. That will be later this week, I'll try to give test results shortly after that.

Comment by Andrew Zenk [ 03/Feb/15 ]

I see that the patch seems to have been applied, though we seem to be having a similar problem. We're using 2.6.93 on the clients and 2.6.92 on the servers. The problem can be reproduced reliably on our end, so I'd be happy to provide additional logs/diagnostic information (Is there a standard procedure for this? I couldn't seem to find anything via google.) I've included the kernel log messages that were dumped to the console as well as the stack trace below.

kernel:LustreError: 13007:0:(statahead.c:262:sa_kill()) ASSERTION( !list_empty(&entry->se_list) ) failed:
kernel:LustreError: 13007:0:(statahead.c:262:sa_kill()) LBUG

PID: 13007 TASK: ffff880b0aea2040 CPU: 25 COMMAND: "rsync"
#0 [ffff881c20ef16f0] machine_kexec at ffffffff8103b5bb
/usr/src/debug/kernel-2.6.32-504.3.3.el6/linux-2.6.32-504.3.3.el6.x86_64/arch/x86/kernel/machine_kexec_64.c: 336
#1 [ffff881c20ef1750] crash_kexec at ffffffff810c9852
/usr/src/debug/kernel-2.6.32-504.3.3.el6/linux-2.6.32-504.3.3.el6.x86_64/kernel/kexec.c: 1106
#2 [ffff881c20ef1820] panic at ffffffff8152927e
/usr/src/debug/kernel-2.6.32-504.3.3.el6/linux-2.6.32-504.3.3.el6.x86_64/kernel/panic.c: 111
#3 [ffff881c20ef18a0] lbug_with_loc at ffffffffa03c5eeb [libcfs]
/usr/src/debug/lustre-2.6.93/libcfs/libcfs/linux/linux-debug.c: 175
#4 [ffff881c20ef18c0] revalidate_statahead_dentry at ffffffffa0a0e31d [lustre]
/usr/src/debug/lustre-2.6.93/lustre/llite/statahead.c: 263
#5 [ffff881c20ef1a00] ll_statahead at ffffffffa0a0e642 [lustre]
/usr/src/debug/lustre-2.6.93/libcfs/include/libcfs/libcfs_debug.h: 219
#6 [ffff881c20ef1ae0] ll_lookup_it at ffffffffa09f8907 [lustre]
/usr/src/debug/lustre-2.6.93/lustre/llite/namei.c: 541
#7 [ffff881c20ef1ba0] ll_lookup_nd at ffffffffa09f9029 [lustre]
/usr/src/debug/lustre-2.6.93/lustre/llite/namei.c: 771
#8 [ffff881c20ef1bf0] do_lookup at ffffffff8119dc65
/usr/src/debug/kernel-2.6.32-504.3.3.el6/linux-2.6.32-504.3.3.el6.x86_64/fs/namei.c: 1063
#9 [ffff881c20ef1c50] __link_path_walk at ffffffff8119e8f4
/usr/src/debug/kernel-2.6.32-504.3.3.el6/linux-2.6.32-504.3.3.el6.x86_64/fs/namei.c: 1239
#10 [ffff881c20ef1d30] path_walk at ffffffff8119f40a
/usr/src/debug/kernel-2.6.32-504.3.3.el6/linux-2.6.32-504.3.3.el6.x86_64/fs/namei.c: 558
#11 [ffff881c20ef1d70] filename_lookup at ffffffff8119f61b
/usr/src/debug/kernel-2.6.32-504.3.3.el6/linux-2.6.32-504.3.3.el6.x86_64/fs/namei.c: 1375
#12 [ffff881c20ef1db0] user_path_at at ffffffff811a0747
/usr/src/debug/kernel-2.6.32-504.3.3.el6/linux-2.6.32-504.3.3.el6.x86_64/fs/namei.c: 1597
#13 [ffff881c20ef1e80] vfs_fstatat at ffffffff81193bc0
/usr/src/debug/kernel-2.6.32-504.3.3.el6/linux-2.6.32-504.3.3.el6.x86_64/fs/stat.c: 84
#14 [ffff881c20ef1ee0] vfs_lstat at ffffffff81193c7e
/usr/src/debug/kernel-2.6.32-504.3.3.el6/linux-2.6.32-504.3.3.el6.x86_64/fs/stat.c: 107
#15 [ffff881c20ef1ef0] sys_newlstat at ffffffff81193ca4
/usr/src/debug/kernel-2.6.32-504.3.3.el6/linux-2.6.32-504.3.3.el6.x86_64/fs/stat.c: 257
#16 [ffff881c20ef1f80] system_call_fastpath at ffffffff8100b072
/usr/src/debug////////kernel-2.6.32-504.3.3.el6/linux-2.6.32-504.3.3.el6.x86_64/arch/x86/kernel/entry_64.S: 489
RIP: 00000031f78daf75 RSP: 00007fff51e03da8 RFLAGS: 00010206
RAX: 0000000000000006 RBX: ffffffff8100b072 RCX: 00007fff51e040b0
RDX: 00007fff51dffb10 RSI: 00007fff51dffb10 RDI: 00007fff51e00ba0
RBP: 0000000000000000 R8: 00007fff51e00be3 R9: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00007fff51e00ba0
R13: 00007fff51dffb10 R14: 00007fff51dffb10 R15: 00007fff51e00ba0
ORIG_RAX: 0000000000000006 CS: 0033 SS: 002b

Generated at Sat Feb 10 01:55:20 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.