[LU-8022] LNet: BUG: unable to handle kernel NULL pointer dereference Created: 14/Apr/16  Updated: 27/Nov/19  Resolved: 31/May/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.9.0

Type: Bug Priority: Blocker
Reporter: Frank Heckes (Inactive) Assignee: Doug Oucharek (Inactive)
Resolution: Fixed Votes: 0
Labels: soak
Environment:

lola
build: https://build.hpdd.intel.com/job/lustre-master/3346


Attachments: File lola-10.log.bz2    
Issue Links:
Duplicate
Related
is related to LU-7101 Lnet: Support per NI map-on-demand Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Error happened during soak testing of build '20160413' (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160413). DNE is enabled. OST have been formatted using zfs, MDTs using _ldiskfs. OSS and MDT nodes are configured in HA active-active failover configuration.

During system boot, a MDS node that had been restarted, crashed with the following error during LNet initialization:

Lustre: Lustre: Build Version: 2.8.51_28_gba2ac35
LNetError: 3247:0:(o2iblnd_cb.c:2310:kiblnd_passive_connect()) Can't accept conn from 192.168.1.108@o2ib10 on NA (ib0:0:192.168.1.110): bad dst ni
d 192.168.1.110@o2ib10
BUG: unable to handle kernel NULL pointer dereference
LNet: Added LNI 192.168.1.110@o2ib10 [8/256/0/180]
 at 0000000000000080
IP: [<ffffffffa0b861e6>] kiblnd_passive_connect+0x466/0x17e0 [ko2iblnd]
PGD 839067067 PUD 8383e1067 PMD 0 
Oops: 0000 [#1] SMP 
last sysfs file: /sys/module/lnet/initstate
CPU 0 
Modules linked in: ko2iblnd(U) ptlrpc(+)(U) obdclass(U) lnet(U) sha512_generic crc32c_intel libcfs(U) ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad r
dma_cm ib_cm iw_cm dm_round_robin dm_multipath microcode iTCO_wdt iTCO_vendor_support zfs(P)(U) zcommon(P)(U) znvpair(P)(U) spl(U) zlib_deflate za
vl(P)(U) zunicode(P)(U) sb_edac edac_core joydev lpc_ich mfd_core i2c_i801 ioatdma sg igb dca i2c_algo_bit i2c_core ext3 jbd mbcache sd_mod crc_t1
0dif ahci wmi isci libsas mpt2sas scsi_transport_sas raid_class mlx4_ib ib_sa ib_mad ib_core ib_addr ipv6 mlx4_en ptp pps_core mlx4_core dm_mirror
 dm_region_hash dm_log dm_mod scsi_dh_rdac [last unloaded: scsi_wait_scan]

Pid: 3247, comm: ib_cm/0 Tainted: P           -- ------------    2.6.32-573.22.1.el6_lustre.x86_64 #1 Intel Corporation SandyBridge Platform/To be
 filled by O.E.M.
RIP: 0010:[<ffffffffa0b861e6>]  [<ffffffffa0b861e6>] kiblnd_passive_connect+0x466/0x17e0 [ko2iblnd]
RSP: 0018:ffff8804318e7b20  EFLAGS: 00010246
RAX: 0000000000000008 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000012
RBP: ffff8804318e7be0 R08: 000000000001b9c2 R09: 00000000fffffffb
R10: 0000000000000003 R11: 0000000000000000 R12: ffff880835a6dc20
R13: ffffffffa0b92263 R14: ffff880432df7800 R15: ffffffffa06a1020
FS:  0000000000000000(0000) GS:ffff880038600000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000000080 CR3: 0000000835bd0000 CR4: 00000000000407f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process ib_cm/0 (pid: 3247, threadinfo ffff8804318e4000, task ffff880431015520)
Stack:
 ffff880835a6dc20 ffffffffa06a1020 0000000000000004 ffff8808369be5d0
<d> ffff8804318e7b80 00000000814e7731 0005000ac0a8016c ffff880800000012
<d> 000300120be91b91 0000000000000000 0000100000000008 ffffffffa011bcbc
Call Trace:
 [<ffffffffa011bcbc>] ? ib_find_cached_gid+0xec/0x110 [ib_core]
 [<ffffffffa0b87c3d>] kiblnd_cm_callback+0x6dd/0x20e0 [ko2iblnd]
 [<ffffffffa034a011>] cma_req_handler+0x371/0x640 [rdma_cm]
 [<ffffffffa011692b>] ? rdma_port_get_link_layer+0x1b/0x60 [ib_core]
 [<ffffffffa0322b27>] cm_process_work+0x27/0x110 [ib_cm]
 [<ffffffffa0323735>] cm_req_handler+0x6b5/0xac0 [ib_cm]
 [<ffffffffa0324140>] ? cm_work_handler+0x0/0x1206 [ib_cm]
 [<ffffffffa0324275>] cm_work_handler+0x135/0x1206 [ib_cm]
 [<ffffffffa0324140>] ? cm_work_handler+0x0/0x1206 [ib_cm]
 [<ffffffff8109ab40>] worker_thread+0x170/0x2a0
 [<ffffffff810a1820>] ? autoremove_wake_function+0x0/0x40
 [<ffffffff8109a9d0>] ? worker_thread+0x0/0x2a0
 [<ffffffff810a138e>] kthread+0x9e/0xc0
 [<ffffffff8100c28a>] child_rip+0xa/0x20
 [<ffffffff810a12f0>] ? kthread+0x0/0xc0
 [<ffffffff8100c280>] ? child_rip+0x0/0x20
Code: e8 90 ff a6 ff 0f b7 95 78 ff ff ff 8b bd 78 ff ff ff 48 89 de 66 89 55 84 e8 27 46 00 00 83 bd 78 ff ff ff 11 66 89 45 90 74 10 <48> 8b 83 80 00 00 00 8b 50 1c 85 d2 89 d0 75 05 b8 00 01 00 00 
RIP  [<ffffffffa0b861e6>] kiblnd_passive_connect+0x466/0x17e0 [ko2iblnd]
 RSP <ffff8804318e7b20>
CR2: 0000000000000080
---[ end trace 01db8c57e9900e3f ]---
Kernel panic - not syncing: Fatal exception
Pid: 3247, comm: ib_cm/0 Tainted: P      D    -- ------------    2.6.32-573.22.1.el6_lustre.x86_64 #1
Call Trace:
 [<ffffffff815394d1>] ? panic+0xa7/0x16f
 [<ffffffff8153e2d4>] ? oops_end+0xe4/0x100
 [<ffffffff8104e8cb>] ? no_context+0xfb/0x260
 [<ffffffff8104eb55>] ? __bad_area_nosemaphore+0x125/0x1e0
 [<ffffffff8104ec23>] ? bad_area_nosemaphore+0x13/0x20
 [<ffffffff8104f31c>] ? __do_page_fault+0x30c/0x500
 [<ffffffff81336a9f>] ? extract_buf+0x9f/0x130
 [<ffffffff815401fe>] ? do_page_fault+0x3e/0xa0
 [<ffffffff8153d5a5>] ? page_fault+0x25/0x30
 [<ffffffffa0b861e6>] ? kiblnd_passive_connect+0x466/0x17e0 [ko2iblnd]
 [<ffffffffa011bcbc>] ? ib_find_cached_gid+0xec/0x110 [ib_core]
 [<ffffffffa0b87c3d>] ? kiblnd_cm_callback+0x6dd/0x20e0 [ko2iblnd]
 [<ffffffffa034a011>] ? cma_req_handler+0x371/0x640 [rdma_cm]
 [<ffffffffa011692b>] ? rdma_port_get_link_layer+0x1b/0x60 [ib_core]
 [<ffffffffa0322b27>] ? cm_process_work+0x27/0x110 [ib_cm]
 [<ffffffffa0323735>] ? cm_req_handler+0x6b5/0xac0 [ib_cm]
 [<ffffffffa0324140>] ? cm_work_handler+0x0/0x1206 [ib_cm]
 [<ffffffffa0324275>] ? cm_work_handler+0x135/0x1206 [ib_cm]
 [<ffffffffa0324140>] ? cm_work_handler+0x0/0x1206 [ib_cm]
 [<ffffffff8109ab40>] ? worker_thread+0x170/0x2a0
 [<ffffffff810a1820>] ? autoremove_wake_function+0x0/0x40
 [<ffffffff8109a9d0>] ? worker_thread+0x0/0x2a0
 [<ffffffff810a138e>] ? kthread+0x9e/0xc0
 [<ffffffff8100c28a>] ? child_rip+0xa/0x20
 [<ffffffff810a12f0>] ? kthread+0x0/0xc0
 [<ffffffff8100c280>] ? child_rip+0x0/0x20

Unfortunately no crash dump was written. The only error message available was extracted from console log of the node affected (lola-10).
Therefore only the console log of MDS has been attached.



 Comments   
Comment by Joseph Gmitter (Inactive) [ 15/Apr/16 ]

Hi Doug,

Could you have a look at this?

Thanks.
Joe

Comment by Doug Oucharek (Inactive) [ 15/Apr/16 ]

Ok, I see two problems here:

1- The network interface (NI) for the IB card seems to have "disappeared". Almost as if the device ib0 went down and was removed from our list of available NIs. However, we still received a connection request to that NI and that is the failure being reported in the error log.
2- The failure path then tries to dereference the NULL NI pointer which causes the core dump. That, of course, must be fixed.

I am going to use this ticket to fix problem 2 (dereferencing the NULL NI pointer) so we don't crash. I don't have enough info to address the first problem so will have to wait for that to be reproduced with this fix in place to prevent a core dump.

Comment by Gerrit Updater [ 15/Apr/16 ]

Doug Oucharek (doug.s.oucharek@intel.com) uploaded a new patch: http://review.whamcloud.com/19614
Subject: LU-8022 lnet: Don't access NULL NI on failure path
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 7163df3dcd22199609539530a6a761acc6fd689e

Comment by Frank Heckes (Inactive) [ 27/Apr/16 ]

Patch has been included into build '20160427' (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160427) and going to be verified in soak test session associated with this build.

Comment by Frank Heckes (Inactive) [ 09/May/16 ]

In soak test session for build '20160427' which includes patch 1 of #19614, the error never occurred anymore. The duration for soak is 10 days now.

Comment by Matt Ezell [ 24/May/16 ]

We hit this today on our LNET routers when upgrading a cluster to 2.8 with LU-7101. Router pinger messages come in before all the NIs have been added, causing this failure.

Comment by Gerrit Updater [ 31/May/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/19614/
Subject: LU-8022 lnet: Don't access NULL NI on failure path
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: f5c7fec23cb26219d959290a4a311119747cc609

Comment by Peter Jones [ 31/May/16 ]

Landed for 2.9

Comment by Gerrit Updater [ 27/Jun/16 ]

Doug Oucharek (doug.s.oucharek@intel.com) uploaded a new patch: http://review.whamcloud.com/21001
Subject: LU-8022 lnet: Correct position of lnet_ni_decref()
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 046a485e69dc879bf112690c1434dee86292554b

Comment by Gerrit Updater [ 05/Jul/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/21001/
Subject: LU-8022 lnet: Correct position of lnet_ni_decref()
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: e8278552cfbcf518209a38f82548a16833686ae9

Generated at Sat Feb 10 02:13:57 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.