[LU-13454] NULL dereference in lnet_health_check lnet_incr_hstats Created: 15/Apr/20 Updated: 28/Jun/21 Resolved: 23/Apr/20 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.14.0, Lustre 2.12.4 |
| Fix Version/s: | Lustre 2.14.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Chris Horn | Assignee: | Chris Horn |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||
| Severity: | 3 | ||||
| Rank (Obsolete): | 9223372036854775807 | ||||
| Description |
|
The LNet messages used for replies to optimized GETs, created via lnet_create_reply_msg(), are only ever committed for rx. As such, their msg_txni and msg_txpeer fields are NULL. lnet_incr_hstats() does not account for this situation, so when passed one of these messages attempts to deref a NULL pointer. [534987.484660] LNet: 33866:0:(o2iblnd_cb.c:2081:kiblnd_close_conn_locked()) Closing conn to 10.16.100.20@o2ib: error 0(sending)(sending_nocred)(waiting) [534987.500344] LustreError: 166827:0:(events.c:453:server_bulk_callback()) event type 3, status -103, desc ffff89b65f3aa600 [534987.500406] LNetError: 166825:0:(lib-msg.c:479:lnet_handle_local_failure()) ni 10.16.100.55@o2ib added to recovery queue. Health = 900 [534987.500412] LustreError: 166825:0:(events.c:453:server_bulk_callback()) event type 5, status -103, desc ffff89b65494ba00 [534987.500416] LustreError: 166825:0:(events.c:453:server_bulk_callback()) event type 5, status -103, desc ffff89e751f58800 [534987.500460] BUG: unable to handle kernel NULL pointer dereference at 00000000000000ec [534987.500498] IP: [<ffffffffc0ea5889>] lnet_finalize+0xb99/0xdc0 [lnet] [534987.500499] PGD 0 [534987.500501] Oops: 0002 [#1] SMP [534987.500532] Modules linked in: osd_zfs(OE) mdt(OE) mdd(OE) lod(OE) mgs(OE) osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) ldiskfs(OE) lustre(OE) lmv(OE) mdc(OE) osc(OE) lov(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) raid5_pd(POE) raid456 async_raid6_recov async_memcpy async_pq raid6_pq async_xor xor async_tx raid1 ext4 mbcache jbd2 nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack xt_multiport iptable_filter xt_CT nf_conntrack libcrc32c iptable_raw dm_service_time dm_multipath mst_pciconf(OE) mlx4_ib(OE) mlx4_en(OE) mlx4_core(OE) rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) sd_mod crc_t10dif crct10dif_generic sg ib_umad(OE) ib_ipoib(OE) ib_cm(OE) zfs(POE) zunicode(POE) zlua(POE) edac_mce_amd kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel [534987.500557] lrw gf128mul glue_helper ablk_helper cryptd zcommon(POE) znvpair(POE) mlx5_ib(OE) zavl(POE) pcspkr icp(POE) ib_uverbs(OE) spl(OE) ib_core(OE) ast mlx5_core(OE) ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm drm_panel_orientation_quirks mlx_compat(OE) dm_mod mlxfw devlink mpt3sas(OE) raid_class scsi_transport_sas i2c_piix4 i2c_designware_platform i2c_designware_core pinctrl_amd acpi_cpufreq ip_tables nfsv3 nfs_acl nfs lockd grace fscache team_mode_activebackup team crct10dif_pclmul crct10dif_common crc32c_intel igb i2c_algo_bit dca ptp pps_core nvme nvme_core nfit libnvdimm sunrpc bonding ipmi_si ipmi_devintf ipmi_msghandler [last unloaded: libcfs] [534987.500560] CPU: 10 PID: 166825 Comm: kiblnd_connd Kdump: loaded Tainted: P W OE ------------ 3.10.0-957.1.3957.1.3.x4.1.6.x86_64 #1 [534987.500561] Hardware name: Viking Enterprise Solutions VSSEP1EA/VSSEP1EA, BIOS 10.01 03/04/2020 [534987.500562] task: ffff89baf1de1040 ti: ffff89b659a70000 task.ti: ffff89b659a70000 [534987.500576] RIP: 0010:[<ffffffffc0ea5889>] [<ffffffffc0ea5889>] lnet_finalize+0xb99/0xdc0 [lnet] [534987.500577] RSP: 0018:ffff89b659a73cf0 EFLAGS: 00010293 [534987.500578] RAX: ffff89bb25bdfa80 RBX: ffff89adb1978898 RCX: 0000000000000000 [534987.500579] RDX: 0000000000000000 RSI: ffffffffc0ea5889 RDI: ffff89bb2be8ae00 [534987.500580] RBP: ffff89b659a73d40 R08: 0000000000000000 R09: 00000001804a0018 [534987.500581] R10: 000000008c5e9601 R11: fffff16efe317a00 R12: 00000000ffffff99 [534987.500582] R13: 0000000000000000 R14: 0000000000000005 R15: 0000000000000000 [534987.500583] FS: 0000000000000000(0000) GS:ffff89cb2ee80000(0000) knlGS:0000000000000000 [534987.500584] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [534987.500585] CR2: 00000000000000ec CR3: 0000000fd52f6000 CR4: 0000000000340fe0 [534987.500586] Call Trace: [534987.500601] [<ffffffffc0d1fd22>] ? kiblnd_pool_free_node+0x82/0x170 [ko2iblnd] [534987.500609] [<ffffffffc0d296dd>] kiblnd_tx_done+0x10d/0x3e0 [ko2iblnd] [534987.500616] [<ffffffffc0d299fb>] kiblnd_txlist_done+0x4b/0x60 [ko2iblnd] [534987.500624] [<ffffffffc0d2f05d>] kiblnd_abort_txs+0xed/0x240 [ko2iblnd] [534987.500631] [<ffffffffc0d2f243>] kiblnd_finalise_conn+0x93/0x120 [ko2iblnd] [534987.500637] [<ffffffffc0d336f1>] kiblnd_connd+0x251/0xa00 [ko2iblnd] [534987.500642] [<ffffffffa1cd6b10>] ? wake_up_state+0x20/0x20 [534987.500649] [<ffffffffc0d334a0>] ? kiblnd_cm_callback+0x2380/0x2380 [ko2iblnd] [534987.500651] [<ffffffffa1cc1f81>] kthread+0xd1/0xe0 [534987.500653] [<ffffffffa1cc1eb0>] ? insert_kthread_work+0x40/0x40 [534987.500657] [<ffffffffa2377c1d>] ret_from_fork_nospec_begin+0x7/0x21 [534987.500659] [<ffffffffa1cc1eb0>] ? insert_kthread_work+0x40/0x40 [534987.500677] Code: c0 e8 cc df ed ff f0 ff 82 e8 00 00 00 83 40 58 01 48 8b 3d ca 17 04 00 31 f6 e8 83 60 ef ff 0f b6 43 6d 83 e0 01 e9 c8 f5 ff ff <f0> ff 82 ec 00 00 00 83 40 5c 01 eb d9 f0 ff 82 e4 00 00 00 83 [534987.500689] RIP [<ffffffffc0ea5889>] lnet_finalize+0xb99/0xdc0 [lnet] [534987.500690] RSP <ffff89b659a73cf0> [534987.500691] CR2: 00000000000000ec |
| Comments |
| Comment by Gerrit Updater [ 15/Apr/20 ] |
|
Chris Horn (chris.horn@hpe.com) uploaded a new patch: https://review.whamcloud.com/38237 |
| Comment by Michael Ethier (Inactive) [ 18/Apr/20 ] |
|
Hi, We have 8 lnet routers and they have been rebooting randomly with the exact same stack trace in this ticket. I thought if I disabled lnet "health" with the following 2 lines it would fix the problem, but the reboots are still occurring. lnetctl set retry_count 0 Has it been verified that this patch fixes the problem ? Does the version of MLNX OFED matter or not ? Thanks, |
| Comment by Chris Horn [ 21/Apr/20 ] |
|
The patch fixes the problem identified by the ticket. AFAIK, the version of MLNX OFED does not have an impact on this issue. LNet Health cannot really be disabled. You can disable the retry and recovery mechanisms via those two tunables, but you cannot disable the statistics gathering, response tracking, etc. Note that this bug only occurs when certain network messages fail. As such, if you are experiencing this issue regularly, that may indicate some underlying problem with your network. |
| Comment by Michael Ethier (Inactive) [ 21/Apr/20 ] |
|
Thanks Chris for your feedback. I will have to dig into our network like you said. |
| Comment by Gerrit Updater [ 23/Apr/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38237/ |
| Comment by Peter Jones [ 23/Apr/20 ] |
|
Landed for 2.14 |