[LU-2388] Oops in ll_sai_unplug - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Blocker
Fix Version/s: Lustre 2.4.0
Affects Version/s: Lustre 2.4.0
Labels:
- LB
- sequoia
- topsequoia
Environment:
Sequoia LAC node, 2.3.54-2chaos (github.com/lustre/chaos)

Severity:
3
Rank (Obsolete):
5663

Description

We have companion nodes to Sequoia called "LAC nodes". LAC stands for "large application compile", or something like that. These nodes are NOT normal BG/Q architecture, instead the are regular large Power 7 architecture nodes. But otherwise they are similar to Sequoia I/O Nodes in that they are ppc64, 64K pages, and running RHEL for the OS.

seqlac2 crashed on 2012-11-19 16:18:40, hitting the following Oops:

2012-11-19 16:18:40 Faulting instruction address: 0xd0000000139b6710
2012-11-19 16:18:40 Oops: Kernel access of bad area, sig: 11 [#1]
2012-11-19 16:18:40 SMP NR_CPUS=1024 NUMA pSeries
2012-11-19 16:18:40 Modules linked in: xt_owner nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack lmv(U) mgc(U) lustre(U) mdc(U) fid(U) fld(U) lov(U) osc(U) ptlrpc(U) obdclass(U) lvfs(U) nfs fscache lockd nfs_acl auth_rpcgss ko2iblnd(U) lnet(U) sha512_generic sha256_generic libcfs(U) sunrpc ipt_LOG xt_multiport iptable_filter ip_tables ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa uinput raid1 ses enclosure sg mlx4_ib ib_mad ib_core mlx4_en mlx4_core e1000e ehea ext4 jbd2 mbcache raid456 async_pq async_xor xor async_raid6_recov raid6_pq async_memcpy async_tx sd_mod crc_t10dif ipr dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
2012-11-19 16:18:40 NIP: d0000000139b6710 LR: d0000000139b6700 CTR: c0000000005be960
2012-11-19 16:18:40 REGS: c000000f54aab080 TRAP: 0300   Not tainted  (2.6.32-279.11.1.1chaos.bgq62.ppc64)
2012-11-19 16:18:40 MSR: 8000000000009032 <EE,ME,IR,DR>  CR: 24008442  XER: 20000000
2012-11-19 16:18:40 DAR: 0000000000000008, DSISR: 0000000042000000
2012-11-19 16:18:40 TASK = c000000f539312d0[10079] 'ls' THREAD: c000000f54aa8000 CPU: 32
2012-11-19 16:18:40 GPR00: c00000025f6e5d90 c000000f54aab300 d000000013a0d878 c000000ecf32d120 
2012-11-19 16:18:40 GPR04: 5a5a5a5a5a5a5a5a 0000000000000004 c0000005ff4e506c 0000000000000002 
2012-11-19 16:18:40 GPR08: c000000000ff7500 0000000000000000 c000000001099004 0000000000000000 
2012-11-19 16:18:40 GPR12: d0000000139d2610 c000000000ff7500 0000000010022ff8 00000000000000e8 
2012-11-19 16:18:40 GPR16: c000000739a8f200 0000000000000000 0000000000000001 0000000000000003 
2012-11-19 16:18:40 GPR20: c000000ecf32cee0 0000000000000000 c00000025f6e5d80 0000000000000000 
2012-11-19 16:18:40 GPR24: c000000f45ae0000 c00000025f6e5d80 d00000000cecfdc0 d00000000cecfdc4 
2012-11-19 16:18:40 GPR28: 0000000000003300 c000000ecf32cd80 d000000013a0aca8 c000000f54aab300 
2012-11-19 16:18:40 NIP [d0000000139b6710] .ll_sai_unplug+0x210/0x860 [lustre]
2012-11-19 16:18:40 LR [d0000000139b6700] .ll_sai_unplug+0x200/0x860 [lustre]
2012-11-19 16:18:40 Call Trace:
2012-11-19 16:18:40 [c000000f54aab300] [d0000000139b6700] .ll_sai_unplug+0x200/0x860 [lustre] (unreliable)
2012-11-19 16:18:40 [c000000f54aab430] [d0000000139b7bc8] .do_statahead_enter+0x218/0x22b0 [lustre]
2012-11-19 16:18:40 [c000000f54aab5f0] [d000000013933dec] .ll_revalidate_it+0x82c/0x21f0 [lustre]
2012-11-19 16:18:40 [c000000f54aab750] [d00000001393595c] .ll_revalidate_nd+0x1ac/0x560 [lustre]
2012-11-19 16:18:40 [c000000f54aab810] [c0000000001d55b0] .do_lookup+0xa0/0x2d0
2012-11-19 16:18:40 [c000000f54aab8e0] [c0000000001d8c50] .__link_path_walk+0x9b0/0x15a0
2012-11-19 16:18:40 [c000000f54aaba10] [c0000000001d9bf8] .path_walk+0x98/0x180
2012-11-19 16:18:40 [c000000f54aabab0] [c0000000001d9eec] .do_path_lookup+0x7c/0xf0
2012-11-19 16:18:40 [c000000f54aabb40] [c0000000001dac50] .user_path_at+0x60/0xb0
2012-11-19 16:18:40 [c000000f54aabc90] [c0000000001ce144] .vfs_fstatat+0x44/0xb0
2012-11-19 16:18:40 [c000000f54aabd30] [c0000000001ce314] .SyS_newlstat+0x24/0x50
2012-11-19 16:18:40 [c000000f54aabe30] [c000000000008564] syscall_exit+0x0/0x40
2012-11-19 16:18:40 Instruction dump:
2012-11-19 16:18:40 81f90058 ea1d0000 79ef06e0 3a10ff08 39ef00e8 79e31764 7c7d1a14 4801bf15 
2012-11-19 16:18:40 e8410028 e9390018 e9790010 38190010 <f92b0008> f9690000 f8190018 f8190010

Note that this node was hitting many order:1 page allocatoin failures that day, with the most recent one less than a minute before the Oops in Lustre.

The attached file "console.seqlac2.txt" contains a more complete console log.

Using gdb to look up the line number of ll_sai_unplug+0x200 revealed that it was at lustre/llite/statahead.c:133, which puts it in ll_sa_entry_unhash() at the first cfs_spin_lock(). So if functions were not inlined, the end of the backtrace would probably look like this:

ll_sa_entry_unhash
do_sai_entry_fini
ll_sa_entry_fini
ll_sai_unplug
do_statahead_enter
ll_statahead_enter
ll_revalidate_it
ll_revalidate_nd

Because of the inlining, I can not be entirely certain which of the three do_sai_entry_fini() calls in ll_sa_entry_fini() this was.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

console.seqlac2.txt
271 kB
26/Nov/12 11:36 PM

Issue Links

is related to

LU-8090 List corruption in do_statahead_enter

Resolved

Oops in ll_sai_unplug

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates