[LU-5883] DNE II testing: LustreError: 6618:0:(statahead.c:262:sa_kill()) ASSERTION( !list_empty(&entry->se_list) ) failed Created: 07/Nov/14 Updated: 09/Oct/21 Resolved: 09/Oct/21 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.7.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Patrick Farrell (Inactive) | Assignee: | Lai Siyao |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Versions of 2.6.54 on clients & servers. Most recent commit on clients: Most recent commit on servers: |
||
| Severity: | 3 |
| Rank (Obsolete): | 16451 |
| Description |
|
While doing a general purpose test of master with DNE II (2 MDSes with 3 MDTs each, 6 total MDTs), we did an ls to check the status of something, and our client LBUGged. We turned up debug on a different client, did the ls again, and it crashed as well: 2014-11-06T20:08:43.974554-06:00 c1-0c0s0n3 LustreError: 3797:0:(statahead.c:262:sa_kill()) ASSERTION( !list_empty(&entry->se_list) ) failed: I'll make the client dump available in a few minutes. |
| Comments |
| Comment by Patrick Farrell (Inactive) [ 07/Nov/14 ] |
|
Dump and associated kos (and system logs for that node) going up here: The node had full debugging enabled when it was crashed. Confirmed from the engineer testing it - The ls was of the root of the file system. There should not have been any remote directories in the root of the file system, so this may have nothing to do with DNE II. |
| Comment by Patrick Farrell (Inactive) [ 07/Nov/14 ] |
|
Full Lustre log extracted from the client is here: |
| Comment by Patrick Farrell (Inactive) [ 07/Nov/14 ] |
|
Note that, weirdly, dirdata was off on MDT0. This system was used to troubleshoot |
| Comment by Di Wang [ 08/Nov/14 ] |
|
From the debug log it seems not related with DNE, hmm, I saw LU-3270 statahead: statahead thread wait for RPCs to finish
Statahead thread should wait for inflight stat RPCs to finish in
case statahead RPC callback may access data allocated in statahead
thread context.
ll_sa_entry_fini() should keep old entry if stat RPC is not
finished yet.
Simplify sai refcounting:
* newly allocated sai will hold one refcount, and it will put it
after starting statahead thread.
* statahead thread holds one refcount.
* agl thread holds one refcount.
* stat process calls do_statahead_enter() which will try to get
sai, and if it's valid, it will revalidate from statahead cache,
and put refcount after use.
Signed-off-by: Lai Siyao <lai.siyao@intel.com>
Change-Id: I55a4fe66a5f6c04595d3bc84f0cd3750f20e0ee4
Reviewed-on: http://review.whamcloud.com/9663
Tested-by: Jenkins
Tested-by: Maloo <hpdd-maloo@intel.com>
Reviewed-by: Fan Yong <fan.yong@intel.com>
Reviewed-by: James Simmons <uja.ornl@gmail.com>
Reviewed-by: Oleg Drokin <oleg.drokin@intel.com>
Just changed these area in 2.6. Lai, could you please have a look. Thanks. |
| Comment by Lai Siyao [ 10/Nov/14 ] |
|
this is a known issue: revalidate_statahead_dentry() may fail to wait for a statahead entry to become ready, in this case it should not release this entry. originally this was found in 2.4 statahead patches, and I'm waiting for their response. I'll update the last patch of |
| Comment by Lai Siyao [ 10/Nov/14 ] |
|
Patch updated, if test is okay, I'll mark this duplicate. |
| Comment by Patrick Farrell (Inactive) [ 10/Nov/14 ] |
|
Thanks, Lai - We'll use that patch when we next update the system. That will be later this week, I'll try to give test results shortly after that. |
| Comment by Andrew Zenk [ 03/Feb/15 ] |
|
I see that the patch seems to have been applied, though we seem to be having a similar problem. We're using 2.6.93 on the clients and 2.6.92 on the servers. The problem can be reproduced reliably on our end, so I'd be happy to provide additional logs/diagnostic information (Is there a standard procedure for this? I couldn't seem to find anything via google.) I've included the kernel log messages that were dumped to the console as well as the stack trace below. kernel:LustreError: 13007:0:(statahead.c:262:sa_kill()) ASSERTION( !list_empty(&entry->se_list) ) failed: PID: 13007 TASK: ffff880b0aea2040 CPU: 25 COMMAND: "rsync" |