Details
-
Bug
-
Resolution: Fixed
-
Major
-
Lustre 2.7.0, Lustre 2.8.0, Lustre 2.5.4
-
3
-
15902
Description
After OSTs were umounted, when running lustre_rmmod, two OSSs crashed with same reason.
<4>Lustre: Failing over lustre-OST0019
<4>Lustre: Skipped 2 previous similar messages
<4>Lustre: server umount lustre-OST0019 complete
<4>Lustre: Skipped 2 previous similar messages
<4>Lustre: 28870:0:(client.c:1907:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1404434289/real 1404434289] req@ffff881ff8a42c00 x1472650587225264/t0(0) o251->MGC10.0.10.151@o2ib@10.0.10.151@o2ib:26/25 lens 224/224 e 0 to 1 dl 1404434295 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
<4>Lustre: server umount lustre-OST0014 complete
<4>Lustre: Skipped 4 previous similar messages
<3>LNetError: 32984:0:(lib-move.c:1937:lnet_parse()) 10.0.8.230@o2ib, src 10.0.8.230@o2ib: Dropping PUT (error -108 looking up sender)
<1>BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
<1>IP: [<ffffffffa09e2409>] kiblnd_pool_alloc_node+0x49/0x2a0 [ko2iblnd]
<4>PGD 0
<4>Oops: 0000 1 SMP
<4>last sysfs file: /sys/devices/pci0000:40/0000:40:03.0/0000:42:00.0/host7/target7:0:0/7:0:0:66/state
<4>CPU 0
<4>Modules linked in: jbd2 ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) sha512_generic sha256_generic crc32c_intel libcfs(U) dm_round_robin autofs4 nfs lockd fscache auth_rpcgss nfs_acl sunrpc ib_srp(U) scsi_transport_srp(U) 8021q garp stp llc acpi_cpufreq freq_table mperf rdma_ucm(U) ib_ucm(U) rdma_cm(U) iw_cm(U) ib_ipoib(U) ib_cm(U) ib_uverbs(U) ib_umad(U) mlx5_ib(U) mlx5_core(U) mlx4_en(U) mlx4_ib(U) ib_sa(U) ib_mad(U) ib_core(U) ib_addr(U) ipv6 mlx4_core(U) compat(U) dm_multipath vhost_net macvtap macvlan tun kvm knem(U) uinput microcode iTCO_wdt iTCO_vendor_support dcdbas power_meter ses enclosure sg shpchp tg3 ptp pps_core lpc_ich mfd_core ext3 jbd mbcache sr_mod cdrom sd_mod crc_t10dif ahci wmi megaraid_sas dm_mirror dm_region_hash dm_log dm_mod [last unloaded: fld]
<4>
<4>Pid: 32983, comm: kiblnd_sd_00_02 Not tainted 2.6.32-431.17.1.el6_lustre.2.5.18.ddn2.x86_64 #1 Dell Inc. PowerEdge R620/01W23F
<4>RIP: 0010:[<ffffffffa09e2409>] [<ffffffffa09e2409>] kiblnd_pool_alloc_node+0x49/0x2a0 [ko2iblnd]
<4>RSP: 0018:ffff88100bfa5a30 EFLAGS: 00010207
<4>RAX: 0000000000000000 RBX: ffff88202619ad80 RCX: 000000000000003f
<4>RDX: 0000000000000010 RSI: 0000000000000002 RDI: ffff88202619ad80
<4>RBP: ffff88100bfa5a70 R08: 4730000000000000 R09: 3980000000000000
<4>R10: 0000000000000000 R11: 0000000000000000 R12: ffff88202619adb0
<4>R13: ffff88100bfa5a38 R14: ffff88202619ad90 R15: 0000000000000012
<4>FS: 0000000000000000(0000) GS:ffff880061c00000(0000) knlGS:0000000000000000
<4>CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
<4>CR2: 0000000000000010 CR3: 0000001025add000 CR4: 00000000001407f0
<4>DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4>DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
<4>Process kiblnd_sd_00_02 (pid: 32983, threadinfo ffff88100bfa4000, task ffff88100a87cae0)
<4>Stack:
<4> 00000000000007ed 0000000000000000 ffff88100bfa5a70 ffff88100b7874f0
<4><d> ffff880ff19b2ea8 ffff880ff19b2eb8 ffff88100d33c680 0000000000000012
<4><d> ffff88100bfa5a90 ffffffffa09ed149 ffff88100b382000 ffff880ff19b2e00
<4>Call Trace:
<4> [<ffffffffa09ed149>] kiblnd_get_idle_tx+0x29/0x2c0 [ko2iblnd]
<4> [<ffffffffa09f0765>] kiblnd_check_sends+0x425/0x610 [ko2iblnd]
<4> [<ffffffffa09f2d7e>] kiblnd_post_rx+0x15e/0x3b0 [ko2iblnd]
<4> [<ffffffffa09f30e6>] kiblnd_recv+0x116/0x560 [ko2iblnd]
<4> [<ffffffffa057ae9b>] lnet_ni_recv+0xbb/0x320 [lnet]
<4> [<ffffffffa057bcc1>] lnet_drop_message+0x81/0xa0 [lnet]
<4> [<ffffffffa05805a2>] lnet_parse+0x1b2/0x18c0 [lnet]
<4> [<ffffffff8106a39b>] ? enqueue_task_fair+0xfb/0x100
<4> [<ffffffff81061b6e>] ? try_to_wake_up+0x24e/0x3e0
<4> [<ffffffffa03a3bf4>] ? mlx5_ib_poll_cq+0x1b4/0xbf0 [mlx5_ib]
<4> [<ffffffffa09f37fb>] kiblnd_handle_rx+0x2cb/0x640 [ko2iblnd]
<4> [<ffffffff81061d12>] ? default_wake_function+0x12/0x20
<4> [<ffffffffa09f44e3>] kiblnd_rx_complete+0x2d3/0x420 [ko2iblnd]
<4> [<ffffffffa09f4692>] kiblnd_complete+0x62/0xe0 [ko2iblnd]
<4> [<ffffffffa09f4a4a>] kiblnd_scheduler+0x33a/0x7b0 [ko2iblnd]
<4> [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
<4> [<ffffffffa09f4710>] ? kiblnd_scheduler+0x0/0x7b0 [ko2iblnd]
<4> [<ffffffff8109ab56>] kthread+0x96/0xa0
<4> [<ffffffff8100c20a>] child_rip+0xa/0x20
<4> [<ffffffff8109aac0>] ? kthread+0x0/0xa0
<4> [<ffffffff8100c200>] ? child_rip+0x0/0x20
<4>Code: 44 00 00 4c 8d 67 30 4c 8d 77 10 4c 8d 6d c8 48 89 fb 48 89 df e8 48 86 b4 e0 48 8b 43 30 4c 39 e0 48 89 45 c8 74 77 48 8d 50 10 <48> 39 50 10 74 61 83 40 30 01 48 8b 15 a6 b4 22 e1 48 8b 45 c8
<1>RIP [<ffffffffa09e2409>] kiblnd_pool_alloc_node+0x49/0x2a0 [ko2iblnd]
<4> RSP <ffff88100bfa5a30>
<pre>
<4>CR2: 0000000000000010
</pre>
Following is some output of crash command:
crash /usr/lib/debug/lib/modules/2.6.32-431.17.1.el6_lustre.2.5.18.ddn2.x86_64/vmlinux /var/crash/127.0.0.1-2014-07-04-09\:38\:48/vmcore
crash> mod -s ko2iblnd
crash> disas kiblnd_pool_alloc_node
Dump of assembler code for function kiblnd_pool_alloc_node:
0xffffffffa09db3c0 <+0>: push %rbp
0xffffffffa09db3c1 <+1>: mov %rsp,%rbp
0xffffffffa09db3c4 <+4>: sub $0x40,%rsp
0xffffffffa09db3c8 <+8>: mov %rbx,-0x28(%rbp)
0xffffffffa09db3cc <+12>: mov %r12,-0x20(%rbp)
0xffffffffa09db3d0 <+16>: mov %r13,-0x18(%rbp)
0xffffffffa09db3d4 <+20>: mov %r14,-0x10(%rbp)
0xffffffffa09db3d8 <+24>: mov %r15,-0x8(%rbp)
0xffffffffa09db3dc <+28>: nopl 0x0(%rax,%rax,1)
0xffffffffa09db3e1 <+33>: lea 0x30(%rdi),%r12
0xffffffffa09db3e5 <+37>: lea 0x10(%rdi),%r14
0xffffffffa09db3e9 <+41>: lea -0x38(%rbp),%r13
0xffffffffa09db3ed <+45>: mov %rdi,%rbx
0xffffffffa09db3f0 <+48>: mov %rbx,%rdi
0xffffffffa09db3f3 <+51>: callq 0xffffffff8152aa40 <_spin_lock>
0xffffffffa09db3f8 <+56>: mov 0x30(%rbx),%rax
0xffffffffa09db3fc <+60>: cmp %r12,%rax
0xffffffffa09db3ff <+63>: mov %rax,-0x38(%rbp)
0xffffffffa09db403 <+67>: je 0xffffffffa09db47c <kiblnd_pool_alloc_node+188>
0xffffffffa09db405 <+69>: lea 0x10(%rax),%rdx
0xffffffffa09db409 <+73>: cmp %rdx,0x10(%rax)
– Crash here
0xffffffffa09db40d <+77>: je 0xffffffffa09db470 <kiblnd_pool_alloc_node+176>
0xffffffffa09db40f <+79>: addl $0x1,0x30(%rax)
0xffffffffa09db413 <+83>: mov -0x1edcdb5a(%rip),%rdx # 0xffffffff81c0d8c0
0xffffffffa09db41a <+90>: mov -0x38(%rbp),%rax
0xffffffffa09db41e <+94>: add $0x493e0,%rdx
0xffffffffa09db425 <+101>: mov %rdx,0x28(%rax)
0xffffffffa09db429 <+105>: mov -0x38(%rbp),%rax
0xffffffffa09db42d <+109>: mov 0x10(%rax),%r12
0xffffffffa09db431 <+113>: mov %r12,%rdi
0xffffffffa09db434 <+116>: callq 0xffffffff81294630 <list_del>
0xffffffffa09db439 <+121>: mov 0x78(%rbx),%rax
0xffffffffa09db43d <+125>: test %rax,%rax
0xffffffffa09db440 <+128>: je 0xffffffffa09db44b <kiblnd_pool_alloc_node+139>
0xffffffffa09db442 <+130>: mov %r12,%rsi
0xffffffffa09db445 <+133>: mov -0x38(%rbp),%rdi
0xffffffffa09db449 <+137>: callq *%rax
0xffffffffa09db44b <+139>: incw (%rbx)
0xffffffffa09db44e <+142>: data32 xchg %ax,%ax
0xffffffffa09db451 <+145>: mov %r12,%rax
0xffffffffa09db454 <+148>: mov -0x28(%rbp),%rbx
0xffffffffa09db458 <+152>: mov -0x20(%rbp),%r12
0xffffffffa09db45c <+156>: mov -0x18(%rbp),%r13
0xffffffffa09db460 <+160>: mov -0x10(%rbp),%r14
0xffffffffa09db464 <+164>: mov -0x8(%rbp),%r15
0xffffffffa09db468 <+168>: leaveq
0xffffffffa09db469 <+169>: retq
0xffffffffa09db46a <+170>: nopw 0x0(%rax,%rax,1)
0xffffffffa09db470 <+176>: mov (%rax),%rax
0xffffffffa09db473 <+179>: cmp %r12,%rax
0xffffffffa09db476 <+182>: mov %rax,-0x38(%rbp)
0xffffffffa09db47a <+186>: jne 0xffffffffa09db405 <kiblnd_pool_alloc_node+69>
0xffffffffa09db47c <+188>: mov 0x58(%rbx),%r15d
0xffffffffa09db480 <+192>: test %r15d,%r15d
0xffffffffa09db483 <+195>: jne 0xffffffffa09db558 <kiblnd_pool_alloc_node+408>
0xffffffffa09db489 <+201>: mov -0x1edcdbd0(%rip),%rax # 0xffffffff81c0d8c0
0xffffffffa09db490 <+208>: cmp 0x50(%rbx),%rax
0xffffffffa09db494 <+212>: js 0xffffffffa09db645 <kiblnd_pool_alloc_node+645>
0xffffffffa09db49a <+218>: movl $0x1,0x58(%rbx)
0xffffffffa09db4a1 <+225>: incw (%rbx)
0xffffffffa09db4a4 <+228>: data32 xchg %ax,%ax
0xffffffffa09db4a7 <+231>: testb $0x2,-0x4deee9(%rip) # 0xffffffffa04fc5c5
0xffffffffa09db4ae <+238>: je 0xffffffffa09db510 <kiblnd_pool_alloc_node+336>
0xffffffffa09db4b0 <+240>: testb $0x8,-0x4deef6(%rip) # 0xffffffffa04fc5c1
0xffffffffa09db4b7 <+247>: je 0xffffffffa09db510 <kiblnd_pool_alloc_node+336>
0xffffffffa09db4b9 <+249>: mov %r14,%rdx
0xffffffffa09db4bc <+252>: mov $0xffffffffa09f42c8,%rsi
0xffffffffa09db4c3 <+259>: mov $0xffffffffa0a01760,%rdi
0xffffffffa09db4ca <+266>: xor %eax,%eax
0xffffffffa09db4cc <+268>: movl $0x800,0x2629a(%rip) # 0xffffffffa0a01770 <msgdata.69576+16>
0xffffffffa09db4d6 <+278>: movq $0xffffffffa09f3528,0x2627f(%rip) # 0xffffffffa0a01760 <msgdata.69576>
0xffffffffa09db4e1 <+289>: movq $0xffffffffa09f2660,0x2627c(%rip) # 0xffffffffa0a01768 <msgdata.69576+8>
0xffffffffa09db4ec <+300>: movl $0x763,0x2627e(%rip) # 0xffffffffa0a01774 <msgdata.69576+20>
0xffffffffa09db4f6 <+310>: movq $0x0,0x2627f(%rip) # 0xffffffffa0a01780 <msgdata.69576+32>
0xffffffffa09db501 <+321>: movl $0x200,0x2626d(%rip) # 0xffffffffa0a01778 <msgdata.69576+24>
0xffffffffa09db50b <+331>: callq 0xffffffffa04db8d0
0xffffffffa09db510 <+336>: mov 0x5c(%rbx),%esi
0xffffffffa09db513 <+339>: mov %r13,%rdx
0xffffffffa09db516 <+342>: mov %rbx,%rdi
0xffffffffa09db519 <+345>: callq *0x68(%rbx)
0xffffffffa09db51c <+348>: mov %rbx,%rdi
0xffffffffa09db51f <+351>: mov %eax,%r15d
0xffffffffa09db522 <+354>: callq 0xffffffff8152aa40 <_spin_lock>
0xffffffffa09db527 <+359>: test %r15d,%r15d
0xffffffffa09db52a <+362>: movl $0x0,0x58(%rbx)
0xffffffffa09db531 <+369>: jne 0xffffffffa09db5d8 <kiblnd_pool_alloc_node+536>
0xffffffffa09db537 <+375>: mov 0x38(%rbx),%rsi
0xffffffffa09db53b <+379>: mov -0x38(%rbp),%rdi
0xffffffffa09db53f <+383>: mov %r12,%rdx
0xffffffffa09db542 <+386>: callq 0xffffffff812946d0 <__list_add>
0xffffffffa09db547 <+391>: incw (%rbx)
0xffffffffa09db54a <+394>: data32 xchg %ax,%ax
0xffffffffa09db54d <+397>: jmpq 0xffffffffa09db3f0 <kiblnd_pool_alloc_node+48>
0xffffffffa09db552 <+402>: nopw 0x0(%rax,%rax,1)
0xffffffffa09db558 <+408>: incw (%rbx)
0xffffffffa09db55b <+411>: data32 xchg %ax,%ax
0xffffffffa09db55e <+414>: testb $0x2,-0x4defa0(%rip) # 0xffffffffa04fc5c5
0xffffffffa09db565 <+421>: je 0xffffffffa09db5c7 <kiblnd_pool_alloc_node+519>
0xffffffffa09db567 <+423>: testb $0x8,-0x4defad(%rip) # 0xffffffffa04fc5c1
0xffffffffa09db56e <+430>: je 0xffffffffa09db5c7 <kiblnd_pool_alloc_node+519>
0xffffffffa09db570 <+432>: mov %r14,%rdx
0xffffffffa09db573 <+435>: mov $0xffffffffa09f4280,%rsi
0xffffffffa09db57a <+442>: mov $0xffffffffa0a017a0,%rdi
0xffffffffa09db581 <+449>: xor %eax,%eax
0xffffffffa09db583 <+451>: movl $0x800,0x26223(%rip) # 0xffffffffa0a017b0 <msgdata.69574+16>
0xffffffffa09db58d <+461>: movq $0xffffffffa09f3528,0x26208(%rip) # 0xffffffffa0a017a0 <msgdata.69574>
0xffffffffa09db598 <+472>: movq $0xffffffffa09f2660,0x26205(%rip) # 0xffffffffa0a017a8 <msgdata.69574+8>
0xffffffffa09db5a3 <+483>: movl $0x755,0x26207(%rip) # 0xffffffffa0a017b4 <msgdata.69574+20>
0xffffffffa09db5ad <+493>: movq $0x0,0x26208(%rip) # 0xffffffffa0a017c0 <msgdata.69574+32>
0xffffffffa09db5b8 <+504>: movl $0x200,0x261f6(%rip) # 0xffffffffa0a017b8 <msgdata.69574+24>
0xffffffffa09db5c2 <+514>: callq 0xffffffffa04db8d0
0xffffffffa09db5c7 <+519>: callq 0xffffffff81527c70 <schedule>
0xffffffffa09db5cc <+524>: jmpq 0xffffffffa09db3f0 <kiblnd_pool_alloc_node+48>
0xffffffffa09db5d1 <+529>: nopl 0x0(%rax)
0xffffffffa09db5d8 <+536>: mov -0x1edcdd1f(%rip),%rax # 0xffffffff81c0d8c0
0xffffffffa09db5df <+543>: mov %r14,%rdx
0xffffffffa09db5e2 <+546>: mov $0xffffffffa09f4248,%rsi
0xffffffffa09db5e9 <+553>: mov $0xffffffffa0a01720,%rdi
0xffffffffa09db5f0 <+560>: add $0x3e8,%rax
0xffffffffa09db5f6 <+566>: mov %rax,0x50(%rbx)
0xffffffffa09db5fa <+570>: xor %eax,%eax
0xffffffffa09db5fc <+572>: movl $0x800,0x2612a(%rip) # 0xffffffffa0a01730 <msgdata.69578+16>
0xffffffffa09db606 <+582>: movq $0xffffffffa09f3528,0x2610f(%rip) # 0xffffffffa0a01720 <msgdata.69578>
0xffffffffa09db611 <+593>: movq $0xffffffffa09f2660,0x2610c(%rip) # 0xffffffffa0a01728 <msgdata.69578+8>
0xffffffffa09db61c <+604>: movl $0x76e,0x2610e(%rip) # 0xffffffffa0a01734 <msgdata.69578+20>
0xffffffffa09db626 <+614>: movq $0xffffffffa0a01750,0x2610f(%rip) # 0xffffffffa0a01740 <msgdata.69578+32>
0xffffffffa09db631 <+625>: movl $0x20000,0x260fd(%rip) # 0xffffffffa0a01738 <msgdata.69578+24>
0xffffffffa09db63b <+635>: callq 0xffffffffa04db8d0
0xffffffffa09db640 <+640>: jmpq 0xffffffffa09db547 <kiblnd_pool_alloc_node+391>
0xffffffffa09db645 <+645>: incw (%rbx)
0xffffffffa09db648 <+648>: data32 xchg %ax,%ax
0xffffffffa09db64b <+651>: xor %r12d,%r12d
0xffffffffa09db64e <+654>: jmpq 0xffffffffa09db451 <kiblnd_pool_alloc_node+145>
End of assembler dump.
It seems ps->ps_pool_list was broken since it had a NULL entry.