Details
-
Bug
-
Resolution: Unresolved
-
Minor
-
Lustre 2.12.8
-
None
-
3
-
9223372036854775807
Description
Found by hornc. When discovery is off, LNetPrimaryNID() can enter into a loop discovering a peer with no LNet module loaded.
Attachments
Issue Links
Activity
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38890/
Subject: LU-12424 lnet: prevent loop in LNetPrimaryNID()
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: 7cc0b292ecdbc578b3ca4d6312b7c74874712fdc
Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38890
Subject: LU-12424 lnet: prevent loop in LNetPrimaryNID()
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: f9ab0e2129425024b1114baa8e1d28c687b954ed
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35191/
Subject: LU-12424 lnet: prevent loop in LNetPrimaryNID()
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 439520f762b093edba9af2f4ab63011eafab28d5
Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35191
Subject: LU-12424 lnet: prevent loop in LNetPrimaryNID()
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: bd28e8f5358ed51427dc6e667f6a5ca0cb6e381f
This was my original report to Amir:
I have four servers.
---------------- snx11922n002 <-- MGS and MDS ---------------- 10.12.0.50@o2ib40 ---------------- snx11922n003 <-- Seconds MDS ---------------- 10.12.0.51@o2ib40 ---------------- snx11922n004 <-- OSS ---------------- 10.12.0.52@o2ib40 ---------------- snx11922n005 <-- OSS ---------------- 10.12.0.53@o2ib40
Discovery is disabled in the lnet.conf:
[root@snx11922n002 ~]# grep discover /etc/modprobe.d/lnet.conf options lnet lnet_peer_discovery_disabled=1 [root@snx11922n002 ~]#
Unload lnet on all servers:
[root@snx11922n000 ~]# pdsh -g lustre 'lctl net down; lustre_rmmod' snx11922n005: LNET busy snx11922n004: LNET busy snx11922n003: LNET busy snx11922n002: LNET busy [root@snx11922n000 ~]#
Mount the MGS
[root@snx11922n002 ~]# mount -t lustre -o errors=panic,journal_checksum /dev/md/snx11922n003:md65 /data/snx11922n003:md65
The mount hangs and console log shows stuck thread:
Jun 6 14:02:05 snx11922n002 kernel: NMI watchdog: BUG: soft lockup - CPU#15 stuck for 23s! [llog_process_th:17820] Jun 6 14:02:05 snx11922n002 kernel: Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) loop raid10 ext4 mbcache jbd2 nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack xt_multiport iptable_filter xt_CT nf_conntrack libcrc32c iptable_raw mst_pciconf(OE) rpcrdma(OE) xprtrdma(OE) ib_isert(OE) ib_iser(OE) ib_srpt(OE) ib_srp(OE) mlx4_ib(OE) mlx4_en(OE) mlx4_core(OE) ib_ipoib(OE) rdma_ucm(OE) ib_ucm(OE) ib_umad(OE) rdma_cm(OE) ib_cm(OE) iw_cm(OE) sb_edac edac_core intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm mlx5_ib(OE) irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd iTCO_wdt iTCO_vendor_support ib_uverbs(OE) ib_core(OE) ast ttm drm_kms_helper Jun 6 14:02:05 snx11922n002 kernel: syscopyarea sysfillrect sysimgblt pcspkr fb_sys_fops joydev drm i2c_i801 lpc_ich mlx5_core(OE) sg ipmi_si mlx_compat(OE) devlink ipmi_devintf wmi shpchp ipmi_msghandler acpi_power_meter acpi_cpufreq nfsd ip_tables nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache sd_mod sr_mod crc_t10dif cdrom crct10dif_generic team_mode_activebackup team uas usb_storage ahci igb libahci i2c_algo_bit mpt3sas i2c_core dca libata crct10dif_pclmul ptp raid_class crct10dif_common crc32c_intel scsi_transport_sas pps_core sunrpc bonding [last unloaded: libcfs] Jun 6 14:02:05 snx11922n002 kernel: CPU: 15 PID: 17820 Comm: llog_process_th Tainted: G OEL ------------ 3.10.0-693.21.1.x3.2.12.x86_64 #1 Jun 6 14:02:05 snx11922n002 kernel: Hardware name: Seagate Laguna Seca/Laguna Seca, BIOS v02.0040 06/29/2018 Jun 6 14:02:05 snx11922n002 kernel: task: ffff881fb5f98000 ti: ffff881d25bcc000 task.ti: ffff881d25bcc000 Jun 6 14:02:05 snx11922n002 kernel: RIP: 0010:[<ffffffff8133197c>] [<ffffffff8133197c>] strcpy+0x1c/0x30 Jun 6 14:02:05 snx11922n002 kernel: RSP: 0018:ffff881d25bcf5b0 EFLAGS: 00000282 Jun 6 14:02:05 snx11922n002 kernel: RAX: ffff881d84fca0af RBX: ffffffffc0b69280 RCX: 0000000000000069 Jun 6 14:02:05 snx11922n002 kernel: RDX: ffff881d84fca0b6 RSI: ffffffffc0b5bd77 RDI: ffff881d84fca0af Jun 6 14:02:05 snx11922n002 kernel: RBP: ffff881d25bcf5b0 R08: 000000000000ffff R09: 000000000000ffff Jun 6 14:02:05 snx11922n002 kernel: R10: 0000000000000000 R11: ffff881d25bcf5e6 R12: ffff881d84fcb000 Jun 6 14:02:05 snx11922n002 kernel: R13: ffff881d84fca0fa R14: ffffffff8133413b R15: ffff881d25bcf540 Jun 6 14:02:05 snx11922n002 kernel: FS: 0000000000000000(0000) GS:ffff881fff7c0000(0000) knlGS:0000000000000000 Jun 6 14:02:05 snx11922n002 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Jun 6 14:02:05 snx11922n002 kernel: CR2: 000055f78939a670 CR3: 0000000001a0a000 CR4: 00000000003607e0 Jun 6 14:02:05 snx11922n002 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Jun 6 14:02:05 snx11922n002 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Jun 6 14:02:05 snx11922n002 kernel: Call Trace: Jun 6 14:02:05 snx11922n002 kernel: [<ffffffffc0a30896>] libcfs_debug_vmsg2+0x446/0xb30 [libcfs] Jun 6 14:02:05 snx11922n002 kernel: [<ffffffff81333793>] ? number.isra.2+0x323/0x360 Jun 6 14:02:05 snx11922n002 kernel: [<ffffffff81335df9>] ? snprintf+0x49/0x70 Jun 6 14:02:05 snx11922n002 kernel: [<ffffffffc0a30fd7>] libcfs_debug_msg+0x57/0x80 [libcfs] Jun 6 14:02:05 snx11922n002 kernel: [<ffffffffc0b275c2>] ? libcfs_nid2str_r+0xe2/0x130 [lnet] Jun 6 14:02:05 snx11922n002 kernel: [<ffffffffc0b530e0>] lnet_discover_peer_locked+0x2f0/0x3d0 [lnet] Jun 6 14:02:05 snx11922n002 kernel: [<ffffffff810b4fc0>] ? wake_up_atomic_t+0x30/0x30 Jun 6 14:02:05 snx11922n002 kernel: [<ffffffffc0b53230>] LNetPrimaryNID+0x70/0x1a0 [lnet] Jun 6 14:02:05 snx11922n002 kernel: [<ffffffffc0dd92ee>] ptlrpc_connection_get+0x3e/0x450 [ptlrpc] Jun 6 14:02:05 snx11922n002 kernel: [<ffffffffc0dcd40c>] ptlrpc_uuid_to_connection+0xec/0x1a0 [ptlrpc] Jun 6 14:02:05 snx11922n002 kernel: [<ffffffffc0d9fa62>] import_set_conn+0xb2/0x7a0 [ptlrpc] Jun 6 14:02:05 snx11922n002 kernel: [<ffffffffc0a28bde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs] Jun 6 14:02:05 snx11922n002 kernel: [<ffffffffc0da12f9>] client_obd_setup+0xd19/0x1430 [ptlrpc] Jun 6 14:02:05 snx11922n002 kernel: [<ffffffffc0bc758e>] ? export_handle_addref+0xe/0x10 [obdclass] Jun 6 14:02:05 snx11922n002 kernel: [<ffffffffc0bda4c1>] ? lprocfs_counter_sub+0xc1/0x130 [obdclass] Jun 6 14:02:05 snx11922n002 kernel: [<ffffffffc147e8bd>] osp_init0.isra.19+0x13ed/0x1f60 [osp] Jun 6 14:02:05 snx11922n002 kernel: [<ffffffff811e4ffb>] ? __kmalloc+0x1eb/0x230 Jun 6 14:02:05 snx11922n002 kernel: [<ffffffffc0bfb559>] ? lu_context_key_revive+0x19/0x30 [obdclass] Jun 6 14:02:05 snx11922n002 kernel: [<ffffffffc0bfb5e2>] ? lu_context_key_revive_many+0x72/0xb0 [obdclass] Jun 6 14:02:05 snx11922n002 kernel: [<ffffffffc147a090>] ? osp_type_start+0x20/0x30 [osp] Jun 6 14:02:05 snx11922n002 kernel: [<ffffffffc147f4b6>] osp_device_alloc+0x86/0x130 [osp] Jun 6 14:02:05 snx11922n002 kernel: [<ffffffffc0be5179>] obd_setup+0x119/0x280 [obdclass] Jun 6 14:02:05 snx11922n002 kernel: [<ffffffffc0be5588>] class_setup+0x2a8/0x840 [obdclass] Jun 6 14:02:05 snx11922n002 kernel: [<ffffffffc0be94e6>] class_process_config+0x1726/0x2830 [obdclass] Jun 6 14:02:05 snx11922n002 kernel: [<ffffffff813329bb>] ? simple_strtoull+0x3b/0x70 Jun 6 14:02:05 snx11922n002 kernel: [<ffffffff813329f9>] ? simple_strtoul+0x9/0x10 Jun 6 14:02:05 snx11922n002 kernel: [<ffffffffc0bed220>] ? target_name2index+0x90/0xc0 [obdclass] Jun 6 14:02:05 snx11922n002 kernel: [<ffffffffc0bed2b7>] ? server_name2index+0x67/0xa0 [obdclass] Jun 6 14:02:05 snx11922n002 kernel: [<ffffffffc0bec129>] class_config_llog_handler+0x819/0x1520 [obdclass] Jun 6 14:02:05 snx11922n002 kernel: [<ffffffffc0baf7bf>] llog_process_thread+0x82f/0x1900 [obdclass] Jun 6 14:02:05 snx11922n002 kernel: [<ffffffff811e4ffb>] ? __kmalloc+0x1eb/0x230 Jun 6 14:02:05 snx11922n002 kernel: [<ffffffffc1073408>] ? htree_lock_alloc+0x38/0x130 [ldiskfs] Jun 6 14:02:05 snx11922n002 kernel: [<ffffffffc0bb129f>] llog_process_thread_daemonize+0x9f/0xe0 [obdclass] Jun 6 14:02:05 snx11922n002 kernel: [<ffffffffc0bb1200>] ? llog_backup+0x500/0x500 [obdclass] Jun 6 14:02:05 snx11922n002 kernel: [<ffffffff810b4031>] kthread+0xd1/0xe0 Jun 6 14:02:05 snx11922n002 kernel: [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40 Jun 6 14:02:05 snx11922n002 kernel: [<ffffffff816c455d>] ret_from_fork+0x5d/0xb0 Jun 6 14:02:05 snx11922n002 kernel: [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40
In the dklog we see:
00000400:00000200:19.0:1559846993.957191:0:6520:0:(peer.c:2218:lnet_discover_peer_locked()) peer 10.12.0.51@o2ib40 NID 10.12.0.51@o2ib40: 0. discovery complete 00000400:00000200:19.0:1559846993.957192:0:6520:0:(peer.c:1898:lnet_peer_queue_for_discovery()) Queue peer 10.12.0.51@o2ib40: -114 00000400:00000200:19.0:1559846993.957193:0:6520:0:(peer.c:2218:lnet_discover_peer_locked()) peer 10.12.0.51@o2ib40 NID 10.12.0.51@o2ib40: 0. discovery complete 00000400:00000200:19.0:1559846993.957194:0:6520:0:(peer.c:1898:lnet_peer_queue_for_discovery()) Queue peer 10.12.0.51@o2ib40: -114 00000400:00000200:19.0:1559846993.957194:0:6520:0:(peer.c:2218:lnet_discover_peer_locked()) peer 10.12.0.51@o2ib40 NID 10.12.0.51@o2ib40: 0. discovery complete 00000400:00000200:19.0:1559846993.957195:0:6520:0:(peer.c:1898:lnet_peer_queue_for_discovery()) Queue peer 10.12.0.51@o2ib40: -114 00000400:00000200:19.0:1559846993.957196:0:6520:0:(peer.c:2218:lnet_discover_peer_locked()) peer 10.12.0.51@o2ib40 NID 10.12.0.51@o2ib40: 0. discovery complete 00000400:00000200:19.0:1559846993.957197:0:6520:0:(peer.c:1898:lnet_peer_queue_for_discovery()) Queue peer 10.12.0.51@o2ib40: -114 00000400:00000200:19.0:1559846993.957198:0:6520:0:(peer.c:2218:lnet_discover_peer_locked()) peer 10.12.0.51@o2ib40 NID 10.12.0.51@o2ib40: 0. discovery complete 00000400:00000200:19.0:1559846993.957199:0:6520:0:(peer.c:1898:lnet_peer_queue_for_discovery()) Queue peer 10.12.0.51@o2ib40: -114 00000400:00000200:19.0:1559846993.957199:0:6520:0:(peer.c:2218:lnet_discover_peer_locked()) peer 10.12.0.51@o2ib40 NID 10.12.0.51@o2ib40: 0. discovery complete 00000400:00000200:19.0:1559846993.957200:0:6520:0:(peer.c:1898:lnet_peer_queue_for_discovery()) Queue peer 10.12.0.51@o2ib40: -114 00000400:00000200:19.0:1559846993.957201:0:6520:0:(peer.c:2218:lnet_discover_peer_locked()) peer 10.12.0.51@o2ib40 NID 10.12.0.51@o2ib40: 0. discovery complete 00000400:00000200:19.0:1559846993.957202:0:6520:0:(peer.c:1898:lnet_peer_queue_for_discovery()) Queue peer 10.12.0.51@o2ib40: -114 00000400:00000200:19.0:1559846993.957203:0:6520:0:(peer.c:2218:lnet_discover_peer_locked()) peer 10.12.0.51@o2ib40 NID 10.12.0.51@o2ib40: 0. discovery complete 00000400:00000200:19.0:1559846993.957203:0:6520:0:(peer.c:1898:lnet_peer_queue_for_discovery()) Queue peer 10.12.0.51@o2ib40: -114 00000400:00000200:19.0:1559846993.957204:0:6520:0:(peer.c:2218:lnet_discover_peer_locked()) peer 10.12.0.51@o2ib40 NID 10.12.0.51@o2ib40: 0. discovery complete 00000400:00000200:19.0:1559846993.957205:0:6520:0:(peer.c:1898:lnet_peer_queue_for_discovery()) Queue peer 10.12.0.51@o2ib40: -114 00000400:00000200:19.0:1559846993.957206:0:6520:0:(peer.c:2218:lnet_discover_peer_locked()) peer 10.12.0.51@o2ib40 NID 10.12.0.51@o2ib40: 0. discovery complete and on and on ...
Those messages repeat until I load lnet on the other servers, at which point you see
00000400:00000200:19.0:1559846994.565726:0:6520:0:(peer.c:1898:lnet_peer_queue_for_discovery()) Queue peer 10.12.0.51@o2ib40: -114 00000400:00000200:19.0:1559846994.565728:0:6520:0:(peer.c:2218:lnet_discover_peer_locked()) peer 10.12.0.51@o2ib40 NID 10.12.0.51@o2ib40: 0. discovery complete 00000400:00000200:19.0:1559846994.565733:0:6520:0:(peer.c:2218:lnet_discover_peer_locked()) peer 10.12.0.51@o2ib40 NID 10.12.0.51@o2ib40: 0. discovery complete 00000400:00000200:19.0:1559846994.565734:0:6520:0:(peer.c:1180:LNetPrimaryNID()) NID 10.12.0.51@o2ib40 primary NID 10.12.0.51@o2ib40 rc 0
pid 6520 in this case is the mount process.
I believe the thread is stuck in this loop in LNetPrimaryNID():
while (!lnet_peer_is_uptodate(lp)) { rc = lnet_discover_peer_locked(lpni, cpt, true); if (rc) goto out_decref; lp = lpni->lpni_peer_net->lpn_peer; }
I'm reopening this issue because the Gatekeeper's helper found the following crashes match this issue:
conf-santiy test 32d - https://testing.whamcloud.com/test_sets/280d0203-dc31-43a3-b57a-b26d0226f085
replay-single test 26 - https://testing.whamcloud.com/test_sets/a173753e-8bef-478f-afed-86c2eb90048b
both for 2.12.8.
Please review the logs and see if these crashes are the same issue or if we need to open a new ticket.