[LU-8022] LNet: BUG: unable to handle kernel NULL pointer dereference Created: 14/Apr/16 Updated: 27/Nov/19 Resolved: 31/May/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.9.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Frank Heckes (Inactive) | Assignee: | Doug Oucharek (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | soak | ||
| Environment: |
lola |
||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
Error happened during soak testing of build '20160413' (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160413). DNE is enabled. OST have been formatted using zfs, MDTs using _ldiskfs. OSS and MDT nodes are configured in HA active-active failover configuration. During system boot, a MDS node that had been restarted, crashed with the following error during LNet initialization: Lustre: Lustre: Build Version: 2.8.51_28_gba2ac35 LNetError: 3247:0:(o2iblnd_cb.c:2310:kiblnd_passive_connect()) Can't accept conn from 192.168.1.108@o2ib10 on NA (ib0:0:192.168.1.110): bad dst ni d 192.168.1.110@o2ib10 BUG: unable to handle kernel NULL pointer dereference LNet: Added LNI 192.168.1.110@o2ib10 [8/256/0/180] at 0000000000000080 IP: [<ffffffffa0b861e6>] kiblnd_passive_connect+0x466/0x17e0 [ko2iblnd] PGD 839067067 PUD 8383e1067 PMD 0 Oops: 0000 [#1] SMP last sysfs file: /sys/module/lnet/initstate CPU 0 Modules linked in: ko2iblnd(U) ptlrpc(+)(U) obdclass(U) lnet(U) sha512_generic crc32c_intel libcfs(U) ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad r dma_cm ib_cm iw_cm dm_round_robin dm_multipath microcode iTCO_wdt iTCO_vendor_support zfs(P)(U) zcommon(P)(U) znvpair(P)(U) spl(U) zlib_deflate za vl(P)(U) zunicode(P)(U) sb_edac edac_core joydev lpc_ich mfd_core i2c_i801 ioatdma sg igb dca i2c_algo_bit i2c_core ext3 jbd mbcache sd_mod crc_t1 0dif ahci wmi isci libsas mpt2sas scsi_transport_sas raid_class mlx4_ib ib_sa ib_mad ib_core ib_addr ipv6 mlx4_en ptp pps_core mlx4_core dm_mirror dm_region_hash dm_log dm_mod scsi_dh_rdac [last unloaded: scsi_wait_scan] Pid: 3247, comm: ib_cm/0 Tainted: P -- ------------ 2.6.32-573.22.1.el6_lustre.x86_64 #1 Intel Corporation SandyBridge Platform/To be filled by O.E.M. RIP: 0010:[<ffffffffa0b861e6>] [<ffffffffa0b861e6>] kiblnd_passive_connect+0x466/0x17e0 [ko2iblnd] RSP: 0018:ffff8804318e7b20 EFLAGS: 00010246 RAX: 0000000000000008 RBX: 0000000000000000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000012 RBP: ffff8804318e7be0 R08: 000000000001b9c2 R09: 00000000fffffffb R10: 0000000000000003 R11: 0000000000000000 R12: ffff880835a6dc20 R13: ffffffffa0b92263 R14: ffff880432df7800 R15: ffffffffa06a1020 FS: 0000000000000000(0000) GS:ffff880038600000(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 0000000000000080 CR3: 0000000835bd0000 CR4: 00000000000407f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process ib_cm/0 (pid: 3247, threadinfo ffff8804318e4000, task ffff880431015520) Stack: ffff880835a6dc20 ffffffffa06a1020 0000000000000004 ffff8808369be5d0 <d> ffff8804318e7b80 00000000814e7731 0005000ac0a8016c ffff880800000012 <d> 000300120be91b91 0000000000000000 0000100000000008 ffffffffa011bcbc Call Trace: [<ffffffffa011bcbc>] ? ib_find_cached_gid+0xec/0x110 [ib_core] [<ffffffffa0b87c3d>] kiblnd_cm_callback+0x6dd/0x20e0 [ko2iblnd] [<ffffffffa034a011>] cma_req_handler+0x371/0x640 [rdma_cm] [<ffffffffa011692b>] ? rdma_port_get_link_layer+0x1b/0x60 [ib_core] [<ffffffffa0322b27>] cm_process_work+0x27/0x110 [ib_cm] [<ffffffffa0323735>] cm_req_handler+0x6b5/0xac0 [ib_cm] [<ffffffffa0324140>] ? cm_work_handler+0x0/0x1206 [ib_cm] [<ffffffffa0324275>] cm_work_handler+0x135/0x1206 [ib_cm] [<ffffffffa0324140>] ? cm_work_handler+0x0/0x1206 [ib_cm] [<ffffffff8109ab40>] worker_thread+0x170/0x2a0 [<ffffffff810a1820>] ? autoremove_wake_function+0x0/0x40 [<ffffffff8109a9d0>] ? worker_thread+0x0/0x2a0 [<ffffffff810a138e>] kthread+0x9e/0xc0 [<ffffffff8100c28a>] child_rip+0xa/0x20 [<ffffffff810a12f0>] ? kthread+0x0/0xc0 [<ffffffff8100c280>] ? child_rip+0x0/0x20 Code: e8 90 ff a6 ff 0f b7 95 78 ff ff ff 8b bd 78 ff ff ff 48 89 de 66 89 55 84 e8 27 46 00 00 83 bd 78 ff ff ff 11 66 89 45 90 74 10 <48> 8b 83 80 00 00 00 8b 50 1c 85 d2 89 d0 75 05 b8 00 01 00 00 RIP [<ffffffffa0b861e6>] kiblnd_passive_connect+0x466/0x17e0 [ko2iblnd] RSP <ffff8804318e7b20> CR2: 0000000000000080 ---[ end trace 01db8c57e9900e3f ]--- Kernel panic - not syncing: Fatal exception Pid: 3247, comm: ib_cm/0 Tainted: P D -- ------------ 2.6.32-573.22.1.el6_lustre.x86_64 #1 Call Trace: [<ffffffff815394d1>] ? panic+0xa7/0x16f [<ffffffff8153e2d4>] ? oops_end+0xe4/0x100 [<ffffffff8104e8cb>] ? no_context+0xfb/0x260 [<ffffffff8104eb55>] ? __bad_area_nosemaphore+0x125/0x1e0 [<ffffffff8104ec23>] ? bad_area_nosemaphore+0x13/0x20 [<ffffffff8104f31c>] ? __do_page_fault+0x30c/0x500 [<ffffffff81336a9f>] ? extract_buf+0x9f/0x130 [<ffffffff815401fe>] ? do_page_fault+0x3e/0xa0 [<ffffffff8153d5a5>] ? page_fault+0x25/0x30 [<ffffffffa0b861e6>] ? kiblnd_passive_connect+0x466/0x17e0 [ko2iblnd] [<ffffffffa011bcbc>] ? ib_find_cached_gid+0xec/0x110 [ib_core] [<ffffffffa0b87c3d>] ? kiblnd_cm_callback+0x6dd/0x20e0 [ko2iblnd] [<ffffffffa034a011>] ? cma_req_handler+0x371/0x640 [rdma_cm] [<ffffffffa011692b>] ? rdma_port_get_link_layer+0x1b/0x60 [ib_core] [<ffffffffa0322b27>] ? cm_process_work+0x27/0x110 [ib_cm] [<ffffffffa0323735>] ? cm_req_handler+0x6b5/0xac0 [ib_cm] [<ffffffffa0324140>] ? cm_work_handler+0x0/0x1206 [ib_cm] [<ffffffffa0324275>] ? cm_work_handler+0x135/0x1206 [ib_cm] [<ffffffffa0324140>] ? cm_work_handler+0x0/0x1206 [ib_cm] [<ffffffff8109ab40>] ? worker_thread+0x170/0x2a0 [<ffffffff810a1820>] ? autoremove_wake_function+0x0/0x40 [<ffffffff8109a9d0>] ? worker_thread+0x0/0x2a0 [<ffffffff810a138e>] ? kthread+0x9e/0xc0 [<ffffffff8100c28a>] ? child_rip+0xa/0x20 [<ffffffff810a12f0>] ? kthread+0x0/0xc0 [<ffffffff8100c280>] ? child_rip+0x0/0x20 Unfortunately no crash dump was written. The only error message available was extracted from console log of the node affected (lola-10). |
| Comments |
| Comment by Joseph Gmitter (Inactive) [ 15/Apr/16 ] |
|
Hi Doug, Could you have a look at this? Thanks. |
| Comment by Doug Oucharek (Inactive) [ 15/Apr/16 ] |
|
Ok, I see two problems here: 1- The network interface (NI) for the IB card seems to have "disappeared". Almost as if the device ib0 went down and was removed from our list of available NIs. However, we still received a connection request to that NI and that is the failure being reported in the error log. I am going to use this ticket to fix problem 2 (dereferencing the NULL NI pointer) so we don't crash. I don't have enough info to address the first problem so will have to wait for that to be reproduced with this fix in place to prevent a core dump. |
| Comment by Gerrit Updater [ 15/Apr/16 ] |
|
Doug Oucharek (doug.s.oucharek@intel.com) uploaded a new patch: http://review.whamcloud.com/19614 |
| Comment by Frank Heckes (Inactive) [ 27/Apr/16 ] |
|
Patch has been included into build '20160427' (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160427) and going to be verified in soak test session associated with this build. |
| Comment by Frank Heckes (Inactive) [ 09/May/16 ] |
|
In soak test session for build '20160427' which includes patch 1 of #19614, the error never occurred anymore. The duration for soak is 10 days now. |
| Comment by Matt Ezell [ 24/May/16 ] |
|
We hit this today on our LNET routers when upgrading a cluster to 2.8 with |
| Comment by Gerrit Updater [ 31/May/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/19614/ |
| Comment by Peter Jones [ 31/May/16 ] |
|
Landed for 2.9 |
| Comment by Gerrit Updater [ 27/Jun/16 ] |
|
Doug Oucharek (doug.s.oucharek@intel.com) uploaded a new patch: http://review.whamcloud.com/21001 |
| Comment by Gerrit Updater [ 05/Jul/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/21001/ |