[LU-12424] LNet MR routing: possible loop when discovery is off Created: 11/Jun/19  Updated: 15/Dec/21

Status: Reopened
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.8
Fix Version/s: Lustre 2.13.0, Lustre 2.12.6

Type: Bug Priority: Minor
Reporter: Amir Shehata (Inactive) Assignee: Amir Shehata (Inactive)
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Related
is related to LU-12442 recovery-small test_136: mounts stuck... Resolved
is related to LU-10931 failed peer discovery still taking to... Closed
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Found by hornc. When discovery is off, LNetPrimaryNID() can enter into a loop discovering a peer with no LNet module loaded.



 Comments   
Comment by Chris Horn [ 11/Jun/19 ]

This was my original report to Amir:

I have four servers.

----------------
snx11922n002   <-- MGS and MDS
----------------
10.12.0.50@o2ib40
----------------
snx11922n003  <-- Seconds MDS
----------------
10.12.0.51@o2ib40
----------------
snx11922n004  <-- OSS
----------------
10.12.0.52@o2ib40
----------------
snx11922n005 <-- OSS
----------------
10.12.0.53@o2ib40

Discovery is disabled in the lnet.conf:

[root@snx11922n002 ~]# grep discover /etc/modprobe.d/lnet.conf
options lnet lnet_peer_discovery_disabled=1
[root@snx11922n002 ~]#

Unload lnet on all servers:

[root@snx11922n000 ~]# pdsh -g lustre 'lctl net down; lustre_rmmod'
snx11922n005: LNET busy
snx11922n004: LNET busy
snx11922n003: LNET busy
snx11922n002: LNET busy
[root@snx11922n000 ~]#

Mount the MGS

[root@snx11922n002 ~]# mount -t lustre -o errors=panic,journal_checksum /dev/md/snx11922n003:md65 /data/snx11922n003:md65

The mount hangs and console log shows stuck thread:

Jun  6 14:02:05 snx11922n002 kernel: NMI watchdog: BUG: soft lockup - CPU#15 stuck for 23s! [llog_process_th:17820]
Jun  6 14:02:05 snx11922n002 kernel: Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) loop raid10 ext4 mbcache jbd2 nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack xt_multiport iptable_filter xt_CT nf_conntrack libcrc32c iptable_raw mst_pciconf(OE) rpcrdma(OE) xprtrdma(OE) ib_isert(OE) ib_iser(OE) ib_srpt(OE) ib_srp(OE) mlx4_ib(OE) mlx4_en(OE) mlx4_core(OE) ib_ipoib(OE) rdma_ucm(OE) ib_ucm(OE) ib_umad(OE) rdma_cm(OE) ib_cm(OE) iw_cm(OE) sb_edac edac_core intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm mlx5_ib(OE) irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd iTCO_wdt iTCO_vendor_support ib_uverbs(OE) ib_core(OE) ast ttm drm_kms_helper
Jun  6 14:02:05 snx11922n002 kernel:  syscopyarea sysfillrect sysimgblt pcspkr fb_sys_fops joydev drm i2c_i801 lpc_ich mlx5_core(OE) sg ipmi_si mlx_compat(OE) devlink ipmi_devintf wmi shpchp ipmi_msghandler acpi_power_meter acpi_cpufreq nfsd ip_tables nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache sd_mod sr_mod crc_t10dif cdrom crct10dif_generic team_mode_activebackup team uas usb_storage ahci igb libahci i2c_algo_bit mpt3sas i2c_core dca libata crct10dif_pclmul ptp raid_class crct10dif_common crc32c_intel scsi_transport_sas pps_core sunrpc bonding [last unloaded: libcfs]
Jun  6 14:02:05 snx11922n002 kernel: CPU: 15 PID: 17820 Comm: llog_process_th Tainted: G           OEL ------------   3.10.0-693.21.1.x3.2.12.x86_64 #1
Jun  6 14:02:05 snx11922n002 kernel: Hardware name: Seagate Laguna Seca/Laguna Seca, BIOS v02.0040 06/29/2018
Jun  6 14:02:05 snx11922n002 kernel: task: ffff881fb5f98000 ti: ffff881d25bcc000 task.ti: ffff881d25bcc000
Jun  6 14:02:05 snx11922n002 kernel: RIP: 0010:[<ffffffff8133197c>]  [<ffffffff8133197c>] strcpy+0x1c/0x30
Jun  6 14:02:05 snx11922n002 kernel: RSP: 0018:ffff881d25bcf5b0  EFLAGS: 00000282
Jun  6 14:02:05 snx11922n002 kernel: RAX: ffff881d84fca0af RBX: ffffffffc0b69280 RCX: 0000000000000069
Jun  6 14:02:05 snx11922n002 kernel: RDX: ffff881d84fca0b6 RSI: ffffffffc0b5bd77 RDI: ffff881d84fca0af
Jun  6 14:02:05 snx11922n002 kernel: RBP: ffff881d25bcf5b0 R08: 000000000000ffff R09: 000000000000ffff
Jun  6 14:02:05 snx11922n002 kernel: R10: 0000000000000000 R11: ffff881d25bcf5e6 R12: ffff881d84fcb000
Jun  6 14:02:05 snx11922n002 kernel: R13: ffff881d84fca0fa R14: ffffffff8133413b R15: ffff881d25bcf540
Jun  6 14:02:05 snx11922n002 kernel: FS:  0000000000000000(0000) GS:ffff881fff7c0000(0000) knlGS:0000000000000000
Jun  6 14:02:05 snx11922n002 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jun  6 14:02:05 snx11922n002 kernel: CR2: 000055f78939a670 CR3: 0000000001a0a000 CR4: 00000000003607e0
Jun  6 14:02:05 snx11922n002 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Jun  6 14:02:05 snx11922n002 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Jun  6 14:02:05 snx11922n002 kernel: Call Trace:
Jun  6 14:02:05 snx11922n002 kernel:  [<ffffffffc0a30896>] libcfs_debug_vmsg2+0x446/0xb30 [libcfs]
Jun  6 14:02:05 snx11922n002 kernel:  [<ffffffff81333793>] ? number.isra.2+0x323/0x360
Jun  6 14:02:05 snx11922n002 kernel:  [<ffffffff81335df9>] ? snprintf+0x49/0x70
Jun  6 14:02:05 snx11922n002 kernel:  [<ffffffffc0a30fd7>] libcfs_debug_msg+0x57/0x80 [libcfs]
Jun  6 14:02:05 snx11922n002 kernel:  [<ffffffffc0b275c2>] ? libcfs_nid2str_r+0xe2/0x130 [lnet]
Jun  6 14:02:05 snx11922n002 kernel:  [<ffffffffc0b530e0>] lnet_discover_peer_locked+0x2f0/0x3d0 [lnet]
Jun  6 14:02:05 snx11922n002 kernel:  [<ffffffff810b4fc0>] ? wake_up_atomic_t+0x30/0x30
Jun  6 14:02:05 snx11922n002 kernel:  [<ffffffffc0b53230>] LNetPrimaryNID+0x70/0x1a0 [lnet]
Jun  6 14:02:05 snx11922n002 kernel:  [<ffffffffc0dd92ee>] ptlrpc_connection_get+0x3e/0x450 [ptlrpc]
Jun  6 14:02:05 snx11922n002 kernel:  [<ffffffffc0dcd40c>] ptlrpc_uuid_to_connection+0xec/0x1a0 [ptlrpc]
Jun  6 14:02:05 snx11922n002 kernel:  [<ffffffffc0d9fa62>] import_set_conn+0xb2/0x7a0 [ptlrpc]
Jun  6 14:02:05 snx11922n002 kernel:  [<ffffffffc0a28bde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs]
Jun  6 14:02:05 snx11922n002 kernel:  [<ffffffffc0da12f9>] client_obd_setup+0xd19/0x1430 [ptlrpc]
Jun  6 14:02:05 snx11922n002 kernel:  [<ffffffffc0bc758e>] ? export_handle_addref+0xe/0x10 [obdclass]
Jun  6 14:02:05 snx11922n002 kernel:  [<ffffffffc0bda4c1>] ? lprocfs_counter_sub+0xc1/0x130 [obdclass]
Jun  6 14:02:05 snx11922n002 kernel:  [<ffffffffc147e8bd>] osp_init0.isra.19+0x13ed/0x1f60 [osp]
Jun  6 14:02:05 snx11922n002 kernel:  [<ffffffff811e4ffb>] ? __kmalloc+0x1eb/0x230
Jun  6 14:02:05 snx11922n002 kernel:  [<ffffffffc0bfb559>] ? lu_context_key_revive+0x19/0x30 [obdclass]
Jun  6 14:02:05 snx11922n002 kernel:  [<ffffffffc0bfb5e2>] ? lu_context_key_revive_many+0x72/0xb0 [obdclass]
Jun  6 14:02:05 snx11922n002 kernel:  [<ffffffffc147a090>] ? osp_type_start+0x20/0x30 [osp]
Jun  6 14:02:05 snx11922n002 kernel:  [<ffffffffc147f4b6>] osp_device_alloc+0x86/0x130 [osp]
Jun  6 14:02:05 snx11922n002 kernel:  [<ffffffffc0be5179>] obd_setup+0x119/0x280 [obdclass]
Jun  6 14:02:05 snx11922n002 kernel:  [<ffffffffc0be5588>] class_setup+0x2a8/0x840 [obdclass]
Jun  6 14:02:05 snx11922n002 kernel:  [<ffffffffc0be94e6>] class_process_config+0x1726/0x2830 [obdclass]
Jun  6 14:02:05 snx11922n002 kernel:  [<ffffffff813329bb>] ? simple_strtoull+0x3b/0x70
Jun  6 14:02:05 snx11922n002 kernel:  [<ffffffff813329f9>] ? simple_strtoul+0x9/0x10
Jun  6 14:02:05 snx11922n002 kernel:  [<ffffffffc0bed220>] ? target_name2index+0x90/0xc0 [obdclass]
Jun  6 14:02:05 snx11922n002 kernel:  [<ffffffffc0bed2b7>] ? server_name2index+0x67/0xa0 [obdclass]
Jun  6 14:02:05 snx11922n002 kernel:  [<ffffffffc0bec129>] class_config_llog_handler+0x819/0x1520 [obdclass]
Jun  6 14:02:05 snx11922n002 kernel:  [<ffffffffc0baf7bf>] llog_process_thread+0x82f/0x1900 [obdclass]
Jun  6 14:02:05 snx11922n002 kernel:  [<ffffffff811e4ffb>] ? __kmalloc+0x1eb/0x230
Jun  6 14:02:05 snx11922n002 kernel:  [<ffffffffc1073408>] ? htree_lock_alloc+0x38/0x130 [ldiskfs]
Jun  6 14:02:05 snx11922n002 kernel:  [<ffffffffc0bb129f>] llog_process_thread_daemonize+0x9f/0xe0 [obdclass]
Jun  6 14:02:05 snx11922n002 kernel:  [<ffffffffc0bb1200>] ? llog_backup+0x500/0x500 [obdclass]
Jun  6 14:02:05 snx11922n002 kernel:  [<ffffffff810b4031>] kthread+0xd1/0xe0
Jun  6 14:02:05 snx11922n002 kernel:  [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40
Jun  6 14:02:05 snx11922n002 kernel:  [<ffffffff816c455d>] ret_from_fork+0x5d/0xb0
Jun  6 14:02:05 snx11922n002 kernel:  [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40

In the dklog we see:

00000400:00000200:19.0:1559846993.957191:0:6520:0:(peer.c:2218:lnet_discover_peer_locked()) peer 10.12.0.51@o2ib40 NID 10.12.0.51@o2ib40: 0. discovery complete
00000400:00000200:19.0:1559846993.957192:0:6520:0:(peer.c:1898:lnet_peer_queue_for_discovery()) Queue peer 10.12.0.51@o2ib40: -114
00000400:00000200:19.0:1559846993.957193:0:6520:0:(peer.c:2218:lnet_discover_peer_locked()) peer 10.12.0.51@o2ib40 NID 10.12.0.51@o2ib40: 0. discovery complete
00000400:00000200:19.0:1559846993.957194:0:6520:0:(peer.c:1898:lnet_peer_queue_for_discovery()) Queue peer 10.12.0.51@o2ib40: -114
00000400:00000200:19.0:1559846993.957194:0:6520:0:(peer.c:2218:lnet_discover_peer_locked()) peer 10.12.0.51@o2ib40 NID 10.12.0.51@o2ib40: 0. discovery complete
00000400:00000200:19.0:1559846993.957195:0:6520:0:(peer.c:1898:lnet_peer_queue_for_discovery()) Queue peer 10.12.0.51@o2ib40: -114
00000400:00000200:19.0:1559846993.957196:0:6520:0:(peer.c:2218:lnet_discover_peer_locked()) peer 10.12.0.51@o2ib40 NID 10.12.0.51@o2ib40: 0. discovery complete
00000400:00000200:19.0:1559846993.957197:0:6520:0:(peer.c:1898:lnet_peer_queue_for_discovery()) Queue peer 10.12.0.51@o2ib40: -114
00000400:00000200:19.0:1559846993.957198:0:6520:0:(peer.c:2218:lnet_discover_peer_locked()) peer 10.12.0.51@o2ib40 NID 10.12.0.51@o2ib40: 0. discovery complete
00000400:00000200:19.0:1559846993.957199:0:6520:0:(peer.c:1898:lnet_peer_queue_for_discovery()) Queue peer 10.12.0.51@o2ib40: -114
00000400:00000200:19.0:1559846993.957199:0:6520:0:(peer.c:2218:lnet_discover_peer_locked()) peer 10.12.0.51@o2ib40 NID 10.12.0.51@o2ib40: 0. discovery complete
00000400:00000200:19.0:1559846993.957200:0:6520:0:(peer.c:1898:lnet_peer_queue_for_discovery()) Queue peer 10.12.0.51@o2ib40: -114
00000400:00000200:19.0:1559846993.957201:0:6520:0:(peer.c:2218:lnet_discover_peer_locked()) peer 10.12.0.51@o2ib40 NID 10.12.0.51@o2ib40: 0. discovery complete
00000400:00000200:19.0:1559846993.957202:0:6520:0:(peer.c:1898:lnet_peer_queue_for_discovery()) Queue peer 10.12.0.51@o2ib40: -114
00000400:00000200:19.0:1559846993.957203:0:6520:0:(peer.c:2218:lnet_discover_peer_locked()) peer 10.12.0.51@o2ib40 NID 10.12.0.51@o2ib40: 0. discovery complete
00000400:00000200:19.0:1559846993.957203:0:6520:0:(peer.c:1898:lnet_peer_queue_for_discovery()) Queue peer 10.12.0.51@o2ib40: -114
00000400:00000200:19.0:1559846993.957204:0:6520:0:(peer.c:2218:lnet_discover_peer_locked()) peer 10.12.0.51@o2ib40 NID 10.12.0.51@o2ib40: 0. discovery complete
00000400:00000200:19.0:1559846993.957205:0:6520:0:(peer.c:1898:lnet_peer_queue_for_discovery()) Queue peer 10.12.0.51@o2ib40: -114
00000400:00000200:19.0:1559846993.957206:0:6520:0:(peer.c:2218:lnet_discover_peer_locked()) peer 10.12.0.51@o2ib40 NID 10.12.0.51@o2ib40: 0. discovery complete
and on
and on
...

Those messages repeat until I load lnet on the other servers, at which point you see

00000400:00000200:19.0:1559846994.565726:0:6520:0:(peer.c:1898:lnet_peer_queue_for_discovery()) Queue peer 10.12.0.51@o2ib40: -114
00000400:00000200:19.0:1559846994.565728:0:6520:0:(peer.c:2218:lnet_discover_peer_locked()) peer 10.12.0.51@o2ib40 NID 10.12.0.51@o2ib40: 0. discovery complete
00000400:00000200:19.0:1559846994.565733:0:6520:0:(peer.c:2218:lnet_discover_peer_locked()) peer 10.12.0.51@o2ib40 NID 10.12.0.51@o2ib40: 0. discovery complete
00000400:00000200:19.0:1559846994.565734:0:6520:0:(peer.c:1180:LNetPrimaryNID()) NID 10.12.0.51@o2ib40 primary NID 10.12.0.51@o2ib40 rc 0

pid 6520 in this case is the mount process.

I believe the thread is stuck in this loop in LNetPrimaryNID():

        while (!lnet_peer_is_uptodate(lp)) {
                rc = lnet_discover_peer_locked(lpni, cpt, true);
                if (rc)
                        goto out_decref;
                lp = lpni->lpni_peer_net->lpn_peer;
        }
Comment by Gerrit Updater [ 11/Jun/19 ]

Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35191
Subject: LU-12424 lnet: prevent loop in LNetPrimaryNID()
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: bd28e8f5358ed51427dc6e667f6a5ca0cb6e381f

Comment by Gerrit Updater [ 27/Jun/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35191/
Subject: LU-12424 lnet: prevent loop in LNetPrimaryNID()
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 439520f762b093edba9af2f4ab63011eafab28d5

Comment by Peter Jones [ 27/Jun/19 ]

Landed for 2.13

Comment by Gerrit Updater [ 10/Jun/20 ]

Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38890
Subject: LU-12424 lnet: prevent loop in LNetPrimaryNID()
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: f9ab0e2129425024b1114baa8e1d28c687b954ed

Comment by Gerrit Updater [ 07/Aug/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38890/
Subject: LU-12424 lnet: prevent loop in LNetPrimaryNID()
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: 7cc0b292ecdbc578b3ca4d6312b7c74874712fdc

Comment by James Nunez (Inactive) [ 06/Dec/21 ]

I'm reopening this issue because the Gatekeeper's helper found the following crashes match this issue:

conf-santiy test 32d - https://testing.whamcloud.com/test_sets/280d0203-dc31-43a3-b57a-b26d0226f085
replay-single test 26 - https://testing.whamcloud.com/test_sets/a173753e-8bef-478f-afed-86c2eb90048b

both for 2.12.8.

Please review the logs and see if these crashes are the same issue or if we need to open a new ticket.

Generated at Sat Feb 10 02:52:28 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.