[LU-2388] Oops in ll_sai_unplug Created: 26/Nov/12 Updated: 02/May/16 Resolved: 30/Mar/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0 |
| Fix Version/s: | Lustre 2.4.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Christopher Morrone | Assignee: | Lai Siyao |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | LB, sequoia, topsequoia | ||
| Environment: |
Sequoia LAC node, 2.3.54-2chaos (github.com/lustre/chaos) |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 5663 | ||||||||
| Description |
|
We have companion nodes to Sequoia called "LAC nodes". LAC stands for "large application compile", or something like that. These nodes are NOT normal BG/Q architecture, instead the are regular large Power 7 architecture nodes. But otherwise they are similar to Sequoia I/O Nodes in that they are ppc64, 64K pages, and running RHEL for the OS. seqlac2 crashed on 2012-11-19 16:18:40, hitting the following Oops: 2012-11-19 16:18:40 Faulting instruction address: 0xd0000000139b6710 2012-11-19 16:18:40 Oops: Kernel access of bad area, sig: 11 [#1] 2012-11-19 16:18:40 SMP NR_CPUS=1024 NUMA pSeries 2012-11-19 16:18:40 Modules linked in: xt_owner nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack lmv(U) mgc(U) lustre(U) mdc(U) fid(U) fld(U) lov(U) osc(U) ptlrpc(U) obdclass(U) lvfs(U) nfs fscache lockd nfs_acl auth_rpcgss ko2iblnd(U) lnet(U) sha512_generic sha256_generic libcfs(U) sunrpc ipt_LOG xt_multiport iptable_filter ip_tables ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa uinput raid1 ses enclosure sg mlx4_ib ib_mad ib_core mlx4_en mlx4_core e1000e ehea ext4 jbd2 mbcache raid456 async_pq async_xor xor async_raid6_recov raid6_pq async_memcpy async_tx sd_mod crc_t10dif ipr dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan] 2012-11-19 16:18:40 NIP: d0000000139b6710 LR: d0000000139b6700 CTR: c0000000005be960 2012-11-19 16:18:40 REGS: c000000f54aab080 TRAP: 0300 Not tainted (2.6.32-279.11.1.1chaos.bgq62.ppc64) 2012-11-19 16:18:40 MSR: 8000000000009032 <EE,ME,IR,DR> CR: 24008442 XER: 20000000 2012-11-19 16:18:40 DAR: 0000000000000008, DSISR: 0000000042000000 2012-11-19 16:18:40 TASK = c000000f539312d0[10079] 'ls' THREAD: c000000f54aa8000 CPU: 32 2012-11-19 16:18:40 GPR00: c00000025f6e5d90 c000000f54aab300 d000000013a0d878 c000000ecf32d120 2012-11-19 16:18:40 GPR04: 5a5a5a5a5a5a5a5a 0000000000000004 c0000005ff4e506c 0000000000000002 2012-11-19 16:18:40 GPR08: c000000000ff7500 0000000000000000 c000000001099004 0000000000000000 2012-11-19 16:18:40 GPR12: d0000000139d2610 c000000000ff7500 0000000010022ff8 00000000000000e8 2012-11-19 16:18:40 GPR16: c000000739a8f200 0000000000000000 0000000000000001 0000000000000003 2012-11-19 16:18:40 GPR20: c000000ecf32cee0 0000000000000000 c00000025f6e5d80 0000000000000000 2012-11-19 16:18:40 GPR24: c000000f45ae0000 c00000025f6e5d80 d00000000cecfdc0 d00000000cecfdc4 2012-11-19 16:18:40 GPR28: 0000000000003300 c000000ecf32cd80 d000000013a0aca8 c000000f54aab300 2012-11-19 16:18:40 NIP [d0000000139b6710] .ll_sai_unplug+0x210/0x860 [lustre] 2012-11-19 16:18:40 LR [d0000000139b6700] .ll_sai_unplug+0x200/0x860 [lustre] 2012-11-19 16:18:40 Call Trace: 2012-11-19 16:18:40 [c000000f54aab300] [d0000000139b6700] .ll_sai_unplug+0x200/0x860 [lustre] (unreliable) 2012-11-19 16:18:40 [c000000f54aab430] [d0000000139b7bc8] .do_statahead_enter+0x218/0x22b0 [lustre] 2012-11-19 16:18:40 [c000000f54aab5f0] [d000000013933dec] .ll_revalidate_it+0x82c/0x21f0 [lustre] 2012-11-19 16:18:40 [c000000f54aab750] [d00000001393595c] .ll_revalidate_nd+0x1ac/0x560 [lustre] 2012-11-19 16:18:40 [c000000f54aab810] [c0000000001d55b0] .do_lookup+0xa0/0x2d0 2012-11-19 16:18:40 [c000000f54aab8e0] [c0000000001d8c50] .__link_path_walk+0x9b0/0x15a0 2012-11-19 16:18:40 [c000000f54aaba10] [c0000000001d9bf8] .path_walk+0x98/0x180 2012-11-19 16:18:40 [c000000f54aabab0] [c0000000001d9eec] .do_path_lookup+0x7c/0xf0 2012-11-19 16:18:40 [c000000f54aabb40] [c0000000001dac50] .user_path_at+0x60/0xb0 2012-11-19 16:18:40 [c000000f54aabc90] [c0000000001ce144] .vfs_fstatat+0x44/0xb0 2012-11-19 16:18:40 [c000000f54aabd30] [c0000000001ce314] .SyS_newlstat+0x24/0x50 2012-11-19 16:18:40 [c000000f54aabe30] [c000000000008564] syscall_exit+0x0/0x40 2012-11-19 16:18:40 Instruction dump: 2012-11-19 16:18:40 81f90058 ea1d0000 79ef06e0 3a10ff08 39ef00e8 79e31764 7c7d1a14 4801bf15 2012-11-19 16:18:40 e8410028 e9390018 e9790010 38190010 <f92b0008> f9690000 f8190018 f8190010 Note that this node was hitting many order:1 page allocatoin failures that day, with the most recent one less than a minute before the Oops in Lustre. The attached file "console.seqlac2.txt" contains a more complete console log. Using gdb to look up the line number of ll_sai_unplug+0x200 revealed that it was at lustre/llite/statahead.c:133, which puts it in ll_sa_entry_unhash() at the first cfs_spin_lock(). So if functions were not inlined, the end of the backtrace would probably look like this: ll_sa_entry_unhash do_sai_entry_fini ll_sa_entry_fini ll_sai_unplug do_statahead_enter ll_statahead_enter ll_revalidate_it ll_revalidate_nd Because of the inlining, I can not be entirely certain which of the three do_sai_entry_fini() calls in ll_sa_entry_fini() this was. |
| Comments |
| Comment by Lai Siyao [ 27/Nov/12 ] |
|
Statahead is started, accessed and stopped by scanner only, but the backtrace shows that the sa_entry may be unplug twice. After reviewing the code, one possible cause is that statahead has been stopped (but lli->lli_sai is not put because there are in-flight statahead RPC), but the scanner process still access the statahead data. To fix, check lli->lli_opendir_key before accessing. The patch is at http://review.whamcloud.com/#change,4718 |
| Comment by Lai Siyao [ 08/Jan/13 ] |
|
Chris, the patch has been landed to master, did you verify it? or can I close this issue? |
| Comment by Christopher Morrone [ 08/Jan/13 ] |
|
I don't know how to reproduce the bug, so you can go ahead and close this issue. We rebased onto 2.3.58, so we have this patch now. We can always reopen if we hit the problem again. Thanks! |
| Comment by Christopher Morrone [ 22/Jan/13 ] |
|
The problem is not fixed. We hit this again on the LAC node in 2.3.58-4chaos. |
| Comment by Christopher Morrone [ 06/Feb/13 ] |
|
Hit again. |
| Comment by Christopher Morrone [ 15/Feb/13 ] |
|
We are hitting this often enough on login nodes to warrant increasing the priority to blocker, and adding the topsequoia label. This morning it hit again running 2.3.58-9chaos. |
| Comment by Lai Siyao [ 18/Feb/13 ] |
|
NIP shows it's the first do_sai_entry_fini(), that is to say, entry is not NULL. The backtrace also shows it's at do_statahead_enter() statahead.c line 1613, so that only_unplug should be true, and this should be revalidate of a negative dentry. However all the above can not explain why it oops at spin_unlock, and 2012-11-19 16:18:40 Unable to handle kernel paging request for data at address 0x00000008 shows cfs_list_del_init(&entry->se_hash) is more likely to be the place where it oops. Since the backtrace says the top function in call trace is unreliable, is it possible to get crash dump and analyse it? |
| Comment by Christopher Morrone [ 19/Feb/13 ] |
|
The sysadmins are working on getting kdump working on the ppc64 LAC nodes. It basically works, but for whatever reason the root image for the crash kernel loads all of the lustre modules, and lustre hits an assertion on load of the ptlrpc module. So we either need to fix the lustre bug or fix the image to stop loading lustre modules before crash will work. |
| Comment by Peter Jones [ 06/Mar/13 ] |
|
Chris Any progress with getting a crash dump? Peter |
| Comment by Christopher Morrone [ 06/Mar/13 ] |
|
Nope. It hasn't hit recently as far as I know. |
| Comment by Peter Jones [ 06/Mar/13 ] |
|
ok then let's drop the priority for now and escalate if more information becomes available. |
| Comment by Christopher Morrone [ 07/Mar/13 ] |
|
Just got a report that a LAC node crashed on this yesterday. No crash dump. |
| Comment by Christopher Morrone [ 14/Mar/13 ] |
|
Hit this Oops again. We got a crash dump this time! I found that we died on line 89 here: crash> gdb list *(0xd000000018395a50)
0xd000000018395a50 is in ll_sai_unplug (/usr/src/kernels/2.6.32-348.1chaos.bgq62.ppc64/include/linux/list.h:89).
84 * This is only for internal list manipulation where we know
85 * the prev/next entries already!
86 */
87 static inline void __list_del(struct list_head * prev, struct list_head * next)
88 {
89 next->prev = prev;
90 prev->next = next;
91 }
92
93 /**
Also, note that I see this on the console as part of the Oops output: 2013-03-13 19:34:20 Unable to handle kernel paging request for data at address 0x00000008 0x00000008 it probably the offset into the structure, but NULL was passed in to __list_del in the "next" parameter. Looking matching the crash address to code and unwinding inlines, I believe this is the call stack where we are crashing: __list_del cfs_list_del_init ll_sa_entry_unhash do_sai_entry_fini ll_sai_entry_fini ll_sai_unplug do_statahead_enter ll_statahead_enter ll_revalidate_it ll_revalidate_nd do_lookup etc. |
| Comment by Christopher Morrone [ 14/Mar/13 ] |
|
I believe the particular ll_sai_unplug() call that we are in under do_statahead_enter() is this one on line 1672: 1665 if ((bits & MDS_INODELOCK_LOOKUP) && 1666 d_lustre_invalid(*dentryp)) 1667 d_lustre_revalidate(*dentryp); 1668 ll_intent_release(&it); 1669 } 1670 } 1671 1672 ll_sai_unplug(sai, entry); 1673 RETURN(rc); 1674 } Further, the console shows: 2013-03-13 19:34:20 NIP [d000000018395a50] .ll_sai_unplug+0x210/0x860 [lustre] 2013-03-13 19:34:20 LR [d000000018395a40] .ll_sai_unplug+0x200/0x860 [lustre] I've already shown that seems to put us in __list_del. If I back the address up a bit to find out where we are, I get: (gdb) list *(ll_sai_unplug+0x198)
0x859d8 is in ll_sai_unplug (/builddir/build/BUILD/lustre-2.3.58/lustre/llite/statahead.c:191).
186 /builddir/build/BUILD/lustre-2.3.58/lustre/llite/statahead.c: No such file or directory.
in /builddir/build/BUILD/lustre-2.3.58/lustre/llite/statahead.c
That turns out to be in the inline function is_omitted_entry(), which tells us that we are in ll_sa_entry_fini(), and we are in either the second or third calls do do_sai_entry_fini() in ll_said_entry_fini(). So lets back up a bit more: (gdb) list *(ll_sai_unplug+0x192) 0x859d2 is in ll_sai_unplug (/builddir/build/BUILD/lustre-2.3.58/lustre/llite/statahead.c:374). 369 in /builddir/build/BUILD/lustre-2.3.58/lustre/llite/statahead.c That puts us here: 373 /* drop old entry from sent list */
374 cfs_list_for_each_entry_safe(pos, next, &sai->sai_entries_sent,
375 se_list) {
376 if (is_omitted_entry(sai, pos->se_index))
377 do_sai_entry_fini(sai, pos);
378 else
379 break;
380 }
And then walking the address forward (gdb) list *(ll_sai_unplug+0x234)
0x85a74 is in ll_sai_unplug (/builddir/build/BUILD/lustre-2.3.58/lustre/llite/statahead.c:353).
348 /builddir/build/BUILD/lustre-2.3.58/lustre/llite/statahead.c: No such file or directory.
in /builddir/build/BUILD/lustre-2.3.58/lustre/llite/statahead.c
puts us on line 353: 346 static inline void
347 do_sai_entry_fini(struct ll_statahead_info *sai, struct ll_sa_entry *entry)
348 {
349 struct ll_inode_info *lli = ll_i2info(sai->sai_inode);
350
351 ll_sa_entry_unhash(sai, entry);
352
353 spin_lock(&lli->lli_sa_lock);
354 entry->se_stat = SA_ENTRY_DEST;
355 if (likely(!ll_sa_entry_unlinked(entry)))
356 cfs_list_del_init(&entry->se_list);
357 spin_unlock(&lli->lli_sa_lock);
358
359 ll_sa_entry_put(sai, entry);
360 }
So I think that means we are in fact in the cfs_list_del_init() that is under the ll_sa_entry_unhash() on line 351, and not the later cfs_list_del_init() at line 356. So if the addresses can be trusted, everything seems to point to the stack I gave in the previous comment, and we're in the in the loop dropping entries from the sent list. |
| Comment by Lai Siyao [ 15/Mar/13 ] |
|
The analysis looks reasonable, as shows an sa_entry->se_hash.next is NULL, but in the life of an sa_entry, it's NULL only when just allocated, however it won't be found and used at that moment. I will review the related code later. |
| Comment by Lai Siyao [ 18/Mar/13 ] |
|
Chris, could you help analyze the crash dump to find whether 'll_sa_<ls pid>' thread exists, and if yes, use `bt <statahead_thread_pid>` to see its backtrace? |
| Comment by Christopher Morrone [ 19/Mar/13 ] |
|
Yes, there is an ll_sa_13026 thread. There is also an ll_agl_13026 thread. Not too much in the bt: crash> bt 13027 -s -x PID: 13027 TASK: c000000d089271d0 CPU: 5 COMMAND: "ll_sa_13026" #0 [c000000053857a90] .schedule+0x3f8 at c0000000005bb0a8 #1 [c000000053857d90] .cfs_waitq_wait+0x10 at d00000000c962020 [libcfs] #2 [c000000053857e00] .ll_statahead_thread+0x5ec at d00000001839dc1c [lustre] #3 [c000000053857f90] .kernel_thread+0x54 at c000000000032fd4 crash> bt 13028 -s -x PID: 13028 TASK: c0000000536fda50 CPU: 30 COMMAND: "ll_agl_13026" #0 [c0000000387dbae0] .schedule+0x3f8 at c0000000005bb0a8 #1 [c0000000387dbde0] .cfs_waitq_wait+0x10 at d00000000c962020 [libcfs] #2 [c0000000387dbe50] .ll_agl_thread+0x550 at d00000001839f130 [lustre] #3 [c0000000387dbf90] .kernel_thread+0x54 at c000000000032fd4 The ll_sa_13026 thread appears to be in this l_wait_event: keep_it:
l_wait_event(thread->t_ctl_waitq,
!sa_sent_full(sai) ||
!sa_received_empty(sai) ||
!agl_list_empty(sai) ||
!thread_is_running(thread),
&lwi);
interpret_it:
And ll_agl_13026 is in thie l_wait_event: while (1) {
l_wait_event(thread->t_ctl_waitq,
!agl_list_empty(sai) ||
!thread_is_running(thread),
&lwi);
if (!thread_is_running(thread))
|
| Comment by Lai Siyao [ 26/Mar/13 ] |
|
The culprit should be in ll_sa_entry_fini():
/* drop old entry from sent list */
cfs_list_for_each_entry_safe(pos, next, &sai->sai_entries_sent,
se_list) {
if (is_omitted_entry(sai, pos->se_index))
do_sai_entry_fini(sai, pos);
else
break;
}
/* drop old entry from stated list */
cfs_list_for_each_entry_safe(pos, next, &sai->sai_entries_stated,
se_list) {
if (is_omitted_entry(sai, pos->se_index))
do_sai_entry_fini(sai, pos);
else
break;
}
Iterating these two list is not protected by ll_sai_lock, though no entry will be removed at the mean time (because this is the only place entry is removed), entry might be unlinked from list and be dangling, thus when iterating, both next and next->next may be the same entry, and one entry may be removed twice, as may cause the crash above. I'll commit a fix later. |
| Comment by Christopher Morrone [ 26/Mar/13 ] |
|
Great work! |
| Comment by Lai Siyao [ 26/Mar/13 ] |
|
patch is on http://review.whamcloud.com/#change,5842 |
| Comment by Peter Jones [ 30/Mar/13 ] |
|
Landed for 2.4 |