Details
-
Bug
-
Resolution: Fixed
-
Major
-
Lustre 2.11.0, Lustre 2.12.0, Lustre 2.13.0
-
None
-
3
-
9223372036854775807
Description
Reproducer:
[root@cent75build01 ~]# cat /etc/modprobe.d/lnet.conf options lnet networks=tcp options lnet forwarding=enabled [root@cent75build01 ~]# modprobe lnet [root@cent75build01 ~]# lctl net up LNET configured [root@cent75build01 ~]# while true; do lnetctl export > /dev/null; echo "still alive"; done still alive still alive still alive still alive still alive still alive still alive Write failed: Broken pipe [root@control01 ~]# ssh cent75build01 Last login: Tue Apr 2 22:01:13 2019 from 192.168.1.10 [root@cent75build01 ~]# cd /var/crash/127.0.0.1-2019-04-02-22:02:31 [root@cent75build01 127.0.0.1-2019-04-02-22:02:31]# tail --lines 36 vmcore-dmesg.txt [ 156.209529] BUG: unable to handle kernel paging request at 0000007a00000002 [ 156.209598] IP: [<ffffffffa53fae34>] kmem_cache_alloc+0x74/0x1f0 [ 156.209648] PGD 800000081d0c4067 PUD 0 [ 156.209672] Oops: 0000 [#1] SMP [ 156.209695] Modules linked in: ksocklnd(OE) lnet(OE) libcfs(OE) rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_uverbs(OE) ib_umad(OE) mlx5_fpga_tools(OE) mlx5_ib(OE) mlx5_core(OE) mlxfw(OE) mlx4_en(OE) ptp pps_core mlx4_ib(OE) ib_core(OE) mlx4_core(OE) mlx_compat(OE) devlink sb_edac coretemp iosf_mbi crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd ppdev vmw_balloon pcspkr joydev sg vmw_vmci parport_pc parport shpchp i2c_piix4 binfmt_misc ip_tables ext4 mbcache jbd2 sr_mod sd_mod cdrom crc_t10dif crct10dif_generic ata_generic pata_acpi vmwgfx drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm ata_piix mptspi drm scsi_transport_spi crct10dif_pclmul crct10dif_common mptscsih crc32c_intel libata serio_raw mptbase vmxnet3 [ 156.210184] i2c_core floppy dm_mirror dm_region_hash dm_log dm_mod [ 156.210239] CPU: 25 PID: 2169 Comm: lnetctl Kdump: loaded Tainted: G OE ------------ 3.10.0-862.14.4.el7.x86_64 #1 [ 156.210289] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/30/2013 [ 156.210335] task: ffff97c376ba0fd0 ti: ffff97c377424000 task.ti: ffff97c377424000 [ 156.210368] RIP: 0010:[<ffffffffa53fae34>] [<ffffffffa53fae34>] kmem_cache_alloc+0x74/0x1f0 [ 156.210409] RSP: 0018:ffff97c377427d10 EFLAGS: 00010282 [ 156.210433] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000004993 [ 156.210465] RDX: 0000000000004992 RSI: 00000000000080d0 RDI: ffff97c0bfc03700 [ 156.210495] RBP: ffff97c377427d40 R08: 000000000001bb00 R09: ffffffffa54217ec [ 156.210526] R10: 8080808080808080 R11: 0000000000000000 R12: 0000007a00000002 [ 156.210557] R13: 00000000000080d0 R14: ffff97c0bfc03700 R15: ffff97c0bfc03700 [ 156.210588] FS: 0000000000000000(0000) GS:ffff97c77fc40000(0000) knlGS:0000000000000000 [ 156.210623] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 156.210649] CR2: 0000007a00000002 CR3: 00000008322c6000 CR4: 00000000000407e0 [ 156.210728] Call Trace: [ 156.210756] [<ffffffffa54217ec>] ? get_empty_filp+0x5c/0x1a0 [ 156.210786] [<ffffffffa54217ec>] get_empty_filp+0x5c/0x1a0 [ 156.210817] [<ffffffffa543019d>] path_openat+0x4d/0x640 [ 156.210846] [<ffffffffa53c8544>] ? handle_pte_fault+0x2f4/0xd10 [ 156.211731] [<ffffffffa5431dbd>] do_filp_open+0x4d/0xb0 [ 156.212562] [<ffffffffa53caefd>] ? handle_mm_fault+0x39d/0x9b0 [ 156.213420] [<ffffffffa543f167>] ? __alloc_fd+0x47/0x170 [ 156.214258] [<ffffffffa541e0d7>] do_sys_open+0x137/0x240 [ 156.215070] [<ffffffffa59256d5>] ? system_call_after_swapgs+0xa2/0x146 [ 156.215865] [<ffffffffa541e1fe>] SyS_open+0x1e/0x20 [ 156.216661] [<ffffffffa592579b>] system_call_fastpath+0x22/0x27 [ 156.217457] [<ffffffffa59256e1>] ? system_call_after_swapgs+0xae/0x146 [ 156.218236] Code: 63 c1 5a 49 8b 50 08 4d 8b 20 49 8b 40 10 4d 85 e4 0f 84 28 01 00 00 48 85 c0 0f 84 1f 01 00 00 49 63 46 20 48 8d 4a 01 4d 8b 06 <49> 8b 1c 04 4c 89 e0 65 49 0f c7 08 0f 94 c0 84 c0 74 ba 49 63 [ 156.220699] RIP [<ffffffffa53fae34>] kmem_cache_alloc+0x74/0x1f0 [ 156.221468] RSP <ffff97c377427d10> [ 156.222214] CR2: 0000007a00000002
The problem is in lustre_lnet_show_routing(). I verified this by applying the following patch:
diff --git a/lnet/utils/lnetctl.c b/lnet/utils/lnetctl.c index c503223..62d34bb 100644 --- a/lnet/utils/lnetctl.c +++ b/lnet/utils/lnetctl.c @@ -1550,13 +1550,6 @@ static int jt_export(int argc, char **argv) err_rc = NULL; } - rc = lustre_lnet_show_routing(-1, &show_rc, &err_rc, backup); - if (rc != LUSTRE_CFG_RC_NO_ERR) { - cYAML_print_tree2file(stderr, err_rc); - cYAML_free_tree(err_rc); - err_rc = NULL; - } - rc = lustre_lnet_show_peer(NULL, 2, -1, &show_rc, &err_rc, backup); if (rc != LUSTRE_CFG_RC_NO_ERR) { cYAML_print_tree2file(stderr, err_rc);
With that patch applied the node does not crash.
I checked master, Lustre 2.12.0 and Lustre 2.11.0 and the problem exists in all those versions.
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34922/
Subject:
LU-12152lnet: Cleanup lnet_get_rtr_pool_cfgProject: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: b5cbe49a16b68ad60a8e7293d1b5450e0f97a430