Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12424

LNet MR routing: possible loop when discovery is off

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • Lustre 2.13.0, Lustre 2.12.6
    • Lustre 2.12.8
    • None
    • 3
    • 9223372036854775807

    Description

      Found by hornc. When discovery is off, LNetPrimaryNID() can enter into a loop discovering a peer with no LNet module loaded.

      Attachments

        Issue Links

          Activity

            [LU-12424] LNet MR routing: possible loop when discovery is off

            I'm reopening this issue because the Gatekeeper's helper found the following crashes match this issue:

            conf-santiy test 32d - https://testing.whamcloud.com/test_sets/280d0203-dc31-43a3-b57a-b26d0226f085
            replay-single test 26 - https://testing.whamcloud.com/test_sets/a173753e-8bef-478f-afed-86c2eb90048b

            both for 2.12.8.

            Please review the logs and see if these crashes are the same issue or if we need to open a new ticket.

            jamesanunez James Nunez (Inactive) added a comment - I'm reopening this issue because the Gatekeeper's helper found the following crashes match this issue: conf-santiy test 32d - https://testing.whamcloud.com/test_sets/280d0203-dc31-43a3-b57a-b26d0226f085 replay-single test 26 - https://testing.whamcloud.com/test_sets/a173753e-8bef-478f-afed-86c2eb90048b both for 2.12.8. Please review the logs and see if these crashes are the same issue or if we need to open a new ticket.

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38890/
            Subject: LU-12424 lnet: prevent loop in LNetPrimaryNID()
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set:
            Commit: 7cc0b292ecdbc578b3ca4d6312b7c74874712fdc

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38890/ Subject: LU-12424 lnet: prevent loop in LNetPrimaryNID() Project: fs/lustre-release Branch: b2_12 Current Patch Set: Commit: 7cc0b292ecdbc578b3ca4d6312b7c74874712fdc

            Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38890
            Subject: LU-12424 lnet: prevent loop in LNetPrimaryNID()
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set: 1
            Commit: f9ab0e2129425024b1114baa8e1d28c687b954ed

            gerrit Gerrit Updater added a comment - Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38890 Subject: LU-12424 lnet: prevent loop in LNetPrimaryNID() Project: fs/lustre-release Branch: b2_12 Current Patch Set: 1 Commit: f9ab0e2129425024b1114baa8e1d28c687b954ed
            pjones Peter Jones added a comment -

            Landed for 2.13

            pjones Peter Jones added a comment - Landed for 2.13

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35191/
            Subject: LU-12424 lnet: prevent loop in LNetPrimaryNID()
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 439520f762b093edba9af2f4ab63011eafab28d5

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35191/ Subject: LU-12424 lnet: prevent loop in LNetPrimaryNID() Project: fs/lustre-release Branch: master Current Patch Set: Commit: 439520f762b093edba9af2f4ab63011eafab28d5

            Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35191
            Subject: LU-12424 lnet: prevent loop in LNetPrimaryNID()
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: bd28e8f5358ed51427dc6e667f6a5ca0cb6e381f

            gerrit Gerrit Updater added a comment - Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35191 Subject: LU-12424 lnet: prevent loop in LNetPrimaryNID() Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: bd28e8f5358ed51427dc6e667f6a5ca0cb6e381f
            hornc Chris Horn added a comment - - edited

            This was my original report to Amir:

            I have four servers.

            ----------------
            snx11922n002   <-- MGS and MDS
            ----------------
            10.12.0.50@o2ib40
            ----------------
            snx11922n003  <-- Seconds MDS
            ----------------
            10.12.0.51@o2ib40
            ----------------
            snx11922n004  <-- OSS
            ----------------
            10.12.0.52@o2ib40
            ----------------
            snx11922n005 <-- OSS
            ----------------
            10.12.0.53@o2ib40

            Discovery is disabled in the lnet.conf:

            [root@snx11922n002 ~]# grep discover /etc/modprobe.d/lnet.conf
            options lnet lnet_peer_discovery_disabled=1
            [root@snx11922n002 ~]#
            

            Unload lnet on all servers:

            [root@snx11922n000 ~]# pdsh -g lustre 'lctl net down; lustre_rmmod'
            snx11922n005: LNET busy
            snx11922n004: LNET busy
            snx11922n003: LNET busy
            snx11922n002: LNET busy
            [root@snx11922n000 ~]#
            

            Mount the MGS

            [root@snx11922n002 ~]# mount -t lustre -o errors=panic,journal_checksum /dev/md/snx11922n003:md65 /data/snx11922n003:md65
            

            The mount hangs and console log shows stuck thread:

            Jun  6 14:02:05 snx11922n002 kernel: NMI watchdog: BUG: soft lockup - CPU#15 stuck for 23s! [llog_process_th:17820]
            Jun  6 14:02:05 snx11922n002 kernel: Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) loop raid10 ext4 mbcache jbd2 nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack xt_multiport iptable_filter xt_CT nf_conntrack libcrc32c iptable_raw mst_pciconf(OE) rpcrdma(OE) xprtrdma(OE) ib_isert(OE) ib_iser(OE) ib_srpt(OE) ib_srp(OE) mlx4_ib(OE) mlx4_en(OE) mlx4_core(OE) ib_ipoib(OE) rdma_ucm(OE) ib_ucm(OE) ib_umad(OE) rdma_cm(OE) ib_cm(OE) iw_cm(OE) sb_edac edac_core intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm mlx5_ib(OE) irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd iTCO_wdt iTCO_vendor_support ib_uverbs(OE) ib_core(OE) ast ttm drm_kms_helper
            Jun  6 14:02:05 snx11922n002 kernel:  syscopyarea sysfillrect sysimgblt pcspkr fb_sys_fops joydev drm i2c_i801 lpc_ich mlx5_core(OE) sg ipmi_si mlx_compat(OE) devlink ipmi_devintf wmi shpchp ipmi_msghandler acpi_power_meter acpi_cpufreq nfsd ip_tables nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache sd_mod sr_mod crc_t10dif cdrom crct10dif_generic team_mode_activebackup team uas usb_storage ahci igb libahci i2c_algo_bit mpt3sas i2c_core dca libata crct10dif_pclmul ptp raid_class crct10dif_common crc32c_intel scsi_transport_sas pps_core sunrpc bonding [last unloaded: libcfs]
            Jun  6 14:02:05 snx11922n002 kernel: CPU: 15 PID: 17820 Comm: llog_process_th Tainted: G           OEL ------------   3.10.0-693.21.1.x3.2.12.x86_64 #1
            Jun  6 14:02:05 snx11922n002 kernel: Hardware name: Seagate Laguna Seca/Laguna Seca, BIOS v02.0040 06/29/2018
            Jun  6 14:02:05 snx11922n002 kernel: task: ffff881fb5f98000 ti: ffff881d25bcc000 task.ti: ffff881d25bcc000
            Jun  6 14:02:05 snx11922n002 kernel: RIP: 0010:[<ffffffff8133197c>]  [<ffffffff8133197c>] strcpy+0x1c/0x30
            Jun  6 14:02:05 snx11922n002 kernel: RSP: 0018:ffff881d25bcf5b0  EFLAGS: 00000282
            Jun  6 14:02:05 snx11922n002 kernel: RAX: ffff881d84fca0af RBX: ffffffffc0b69280 RCX: 0000000000000069
            Jun  6 14:02:05 snx11922n002 kernel: RDX: ffff881d84fca0b6 RSI: ffffffffc0b5bd77 RDI: ffff881d84fca0af
            Jun  6 14:02:05 snx11922n002 kernel: RBP: ffff881d25bcf5b0 R08: 000000000000ffff R09: 000000000000ffff
            Jun  6 14:02:05 snx11922n002 kernel: R10: 0000000000000000 R11: ffff881d25bcf5e6 R12: ffff881d84fcb000
            Jun  6 14:02:05 snx11922n002 kernel: R13: ffff881d84fca0fa R14: ffffffff8133413b R15: ffff881d25bcf540
            Jun  6 14:02:05 snx11922n002 kernel: FS:  0000000000000000(0000) GS:ffff881fff7c0000(0000) knlGS:0000000000000000
            Jun  6 14:02:05 snx11922n002 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
            Jun  6 14:02:05 snx11922n002 kernel: CR2: 000055f78939a670 CR3: 0000000001a0a000 CR4: 00000000003607e0
            Jun  6 14:02:05 snx11922n002 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
            Jun  6 14:02:05 snx11922n002 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
            Jun  6 14:02:05 snx11922n002 kernel: Call Trace:
            Jun  6 14:02:05 snx11922n002 kernel:  [<ffffffffc0a30896>] libcfs_debug_vmsg2+0x446/0xb30 [libcfs]
            Jun  6 14:02:05 snx11922n002 kernel:  [<ffffffff81333793>] ? number.isra.2+0x323/0x360
            Jun  6 14:02:05 snx11922n002 kernel:  [<ffffffff81335df9>] ? snprintf+0x49/0x70
            Jun  6 14:02:05 snx11922n002 kernel:  [<ffffffffc0a30fd7>] libcfs_debug_msg+0x57/0x80 [libcfs]
            Jun  6 14:02:05 snx11922n002 kernel:  [<ffffffffc0b275c2>] ? libcfs_nid2str_r+0xe2/0x130 [lnet]
            Jun  6 14:02:05 snx11922n002 kernel:  [<ffffffffc0b530e0>] lnet_discover_peer_locked+0x2f0/0x3d0 [lnet]
            Jun  6 14:02:05 snx11922n002 kernel:  [<ffffffff810b4fc0>] ? wake_up_atomic_t+0x30/0x30
            Jun  6 14:02:05 snx11922n002 kernel:  [<ffffffffc0b53230>] LNetPrimaryNID+0x70/0x1a0 [lnet]
            Jun  6 14:02:05 snx11922n002 kernel:  [<ffffffffc0dd92ee>] ptlrpc_connection_get+0x3e/0x450 [ptlrpc]
            Jun  6 14:02:05 snx11922n002 kernel:  [<ffffffffc0dcd40c>] ptlrpc_uuid_to_connection+0xec/0x1a0 [ptlrpc]
            Jun  6 14:02:05 snx11922n002 kernel:  [<ffffffffc0d9fa62>] import_set_conn+0xb2/0x7a0 [ptlrpc]
            Jun  6 14:02:05 snx11922n002 kernel:  [<ffffffffc0a28bde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs]
            Jun  6 14:02:05 snx11922n002 kernel:  [<ffffffffc0da12f9>] client_obd_setup+0xd19/0x1430 [ptlrpc]
            Jun  6 14:02:05 snx11922n002 kernel:  [<ffffffffc0bc758e>] ? export_handle_addref+0xe/0x10 [obdclass]
            Jun  6 14:02:05 snx11922n002 kernel:  [<ffffffffc0bda4c1>] ? lprocfs_counter_sub+0xc1/0x130 [obdclass]
            Jun  6 14:02:05 snx11922n002 kernel:  [<ffffffffc147e8bd>] osp_init0.isra.19+0x13ed/0x1f60 [osp]
            Jun  6 14:02:05 snx11922n002 kernel:  [<ffffffff811e4ffb>] ? __kmalloc+0x1eb/0x230
            Jun  6 14:02:05 snx11922n002 kernel:  [<ffffffffc0bfb559>] ? lu_context_key_revive+0x19/0x30 [obdclass]
            Jun  6 14:02:05 snx11922n002 kernel:  [<ffffffffc0bfb5e2>] ? lu_context_key_revive_many+0x72/0xb0 [obdclass]
            Jun  6 14:02:05 snx11922n002 kernel:  [<ffffffffc147a090>] ? osp_type_start+0x20/0x30 [osp]
            Jun  6 14:02:05 snx11922n002 kernel:  [<ffffffffc147f4b6>] osp_device_alloc+0x86/0x130 [osp]
            Jun  6 14:02:05 snx11922n002 kernel:  [<ffffffffc0be5179>] obd_setup+0x119/0x280 [obdclass]
            Jun  6 14:02:05 snx11922n002 kernel:  [<ffffffffc0be5588>] class_setup+0x2a8/0x840 [obdclass]
            Jun  6 14:02:05 snx11922n002 kernel:  [<ffffffffc0be94e6>] class_process_config+0x1726/0x2830 [obdclass]
            Jun  6 14:02:05 snx11922n002 kernel:  [<ffffffff813329bb>] ? simple_strtoull+0x3b/0x70
            Jun  6 14:02:05 snx11922n002 kernel:  [<ffffffff813329f9>] ? simple_strtoul+0x9/0x10
            Jun  6 14:02:05 snx11922n002 kernel:  [<ffffffffc0bed220>] ? target_name2index+0x90/0xc0 [obdclass]
            Jun  6 14:02:05 snx11922n002 kernel:  [<ffffffffc0bed2b7>] ? server_name2index+0x67/0xa0 [obdclass]
            Jun  6 14:02:05 snx11922n002 kernel:  [<ffffffffc0bec129>] class_config_llog_handler+0x819/0x1520 [obdclass]
            Jun  6 14:02:05 snx11922n002 kernel:  [<ffffffffc0baf7bf>] llog_process_thread+0x82f/0x1900 [obdclass]
            Jun  6 14:02:05 snx11922n002 kernel:  [<ffffffff811e4ffb>] ? __kmalloc+0x1eb/0x230
            Jun  6 14:02:05 snx11922n002 kernel:  [<ffffffffc1073408>] ? htree_lock_alloc+0x38/0x130 [ldiskfs]
            Jun  6 14:02:05 snx11922n002 kernel:  [<ffffffffc0bb129f>] llog_process_thread_daemonize+0x9f/0xe0 [obdclass]
            Jun  6 14:02:05 snx11922n002 kernel:  [<ffffffffc0bb1200>] ? llog_backup+0x500/0x500 [obdclass]
            Jun  6 14:02:05 snx11922n002 kernel:  [<ffffffff810b4031>] kthread+0xd1/0xe0
            Jun  6 14:02:05 snx11922n002 kernel:  [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40
            Jun  6 14:02:05 snx11922n002 kernel:  [<ffffffff816c455d>] ret_from_fork+0x5d/0xb0
            Jun  6 14:02:05 snx11922n002 kernel:  [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40
            

            In the dklog we see:

            00000400:00000200:19.0:1559846993.957191:0:6520:0:(peer.c:2218:lnet_discover_peer_locked()) peer 10.12.0.51@o2ib40 NID 10.12.0.51@o2ib40: 0. discovery complete
            00000400:00000200:19.0:1559846993.957192:0:6520:0:(peer.c:1898:lnet_peer_queue_for_discovery()) Queue peer 10.12.0.51@o2ib40: -114
            00000400:00000200:19.0:1559846993.957193:0:6520:0:(peer.c:2218:lnet_discover_peer_locked()) peer 10.12.0.51@o2ib40 NID 10.12.0.51@o2ib40: 0. discovery complete
            00000400:00000200:19.0:1559846993.957194:0:6520:0:(peer.c:1898:lnet_peer_queue_for_discovery()) Queue peer 10.12.0.51@o2ib40: -114
            00000400:00000200:19.0:1559846993.957194:0:6520:0:(peer.c:2218:lnet_discover_peer_locked()) peer 10.12.0.51@o2ib40 NID 10.12.0.51@o2ib40: 0. discovery complete
            00000400:00000200:19.0:1559846993.957195:0:6520:0:(peer.c:1898:lnet_peer_queue_for_discovery()) Queue peer 10.12.0.51@o2ib40: -114
            00000400:00000200:19.0:1559846993.957196:0:6520:0:(peer.c:2218:lnet_discover_peer_locked()) peer 10.12.0.51@o2ib40 NID 10.12.0.51@o2ib40: 0. discovery complete
            00000400:00000200:19.0:1559846993.957197:0:6520:0:(peer.c:1898:lnet_peer_queue_for_discovery()) Queue peer 10.12.0.51@o2ib40: -114
            00000400:00000200:19.0:1559846993.957198:0:6520:0:(peer.c:2218:lnet_discover_peer_locked()) peer 10.12.0.51@o2ib40 NID 10.12.0.51@o2ib40: 0. discovery complete
            00000400:00000200:19.0:1559846993.957199:0:6520:0:(peer.c:1898:lnet_peer_queue_for_discovery()) Queue peer 10.12.0.51@o2ib40: -114
            00000400:00000200:19.0:1559846993.957199:0:6520:0:(peer.c:2218:lnet_discover_peer_locked()) peer 10.12.0.51@o2ib40 NID 10.12.0.51@o2ib40: 0. discovery complete
            00000400:00000200:19.0:1559846993.957200:0:6520:0:(peer.c:1898:lnet_peer_queue_for_discovery()) Queue peer 10.12.0.51@o2ib40: -114
            00000400:00000200:19.0:1559846993.957201:0:6520:0:(peer.c:2218:lnet_discover_peer_locked()) peer 10.12.0.51@o2ib40 NID 10.12.0.51@o2ib40: 0. discovery complete
            00000400:00000200:19.0:1559846993.957202:0:6520:0:(peer.c:1898:lnet_peer_queue_for_discovery()) Queue peer 10.12.0.51@o2ib40: -114
            00000400:00000200:19.0:1559846993.957203:0:6520:0:(peer.c:2218:lnet_discover_peer_locked()) peer 10.12.0.51@o2ib40 NID 10.12.0.51@o2ib40: 0. discovery complete
            00000400:00000200:19.0:1559846993.957203:0:6520:0:(peer.c:1898:lnet_peer_queue_for_discovery()) Queue peer 10.12.0.51@o2ib40: -114
            00000400:00000200:19.0:1559846993.957204:0:6520:0:(peer.c:2218:lnet_discover_peer_locked()) peer 10.12.0.51@o2ib40 NID 10.12.0.51@o2ib40: 0. discovery complete
            00000400:00000200:19.0:1559846993.957205:0:6520:0:(peer.c:1898:lnet_peer_queue_for_discovery()) Queue peer 10.12.0.51@o2ib40: -114
            00000400:00000200:19.0:1559846993.957206:0:6520:0:(peer.c:2218:lnet_discover_peer_locked()) peer 10.12.0.51@o2ib40 NID 10.12.0.51@o2ib40: 0. discovery complete
            and on
            and on
            ...
            

            Those messages repeat until I load lnet on the other servers, at which point you see

            00000400:00000200:19.0:1559846994.565726:0:6520:0:(peer.c:1898:lnet_peer_queue_for_discovery()) Queue peer 10.12.0.51@o2ib40: -114
            00000400:00000200:19.0:1559846994.565728:0:6520:0:(peer.c:2218:lnet_discover_peer_locked()) peer 10.12.0.51@o2ib40 NID 10.12.0.51@o2ib40: 0. discovery complete
            00000400:00000200:19.0:1559846994.565733:0:6520:0:(peer.c:2218:lnet_discover_peer_locked()) peer 10.12.0.51@o2ib40 NID 10.12.0.51@o2ib40: 0. discovery complete
            00000400:00000200:19.0:1559846994.565734:0:6520:0:(peer.c:1180:LNetPrimaryNID()) NID 10.12.0.51@o2ib40 primary NID 10.12.0.51@o2ib40 rc 0
            

            pid 6520 in this case is the mount process.

            I believe the thread is stuck in this loop in LNetPrimaryNID():

                    while (!lnet_peer_is_uptodate(lp)) {
                            rc = lnet_discover_peer_locked(lpni, cpt, true);
                            if (rc)
                                    goto out_decref;
                            lp = lpni->lpni_peer_net->lpn_peer;
                    }
            hornc Chris Horn added a comment - - edited This was my original report to Amir: I have four servers. ---------------- snx11922n002 <-- MGS and MDS ---------------- 10.12.0.50@o2ib40 ---------------- snx11922n003 <-- Seconds MDS ---------------- 10.12.0.51@o2ib40 ---------------- snx11922n004 <-- OSS ---------------- 10.12.0.52@o2ib40 ---------------- snx11922n005 <-- OSS ---------------- 10.12.0.53@o2ib40 Discovery is disabled in the lnet.conf: [root@snx11922n002 ~]# grep discover /etc/modprobe.d/lnet.conf options lnet lnet_peer_discovery_disabled=1 [root@snx11922n002 ~]# Unload lnet on all servers: [root@snx11922n000 ~]# pdsh -g lustre 'lctl net down; lustre_rmmod' snx11922n005: LNET busy snx11922n004: LNET busy snx11922n003: LNET busy snx11922n002: LNET busy [root@snx11922n000 ~]# Mount the MGS [root@snx11922n002 ~]# mount -t lustre -o errors=panic,journal_checksum /dev/md/snx11922n003:md65 /data/snx11922n003:md65 The mount hangs and console log shows stuck thread: Jun 6 14:02:05 snx11922n002 kernel: NMI watchdog: BUG: soft lockup - CPU#15 stuck for 23s! [llog_process_th:17820] Jun 6 14:02:05 snx11922n002 kernel: Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) loop raid10 ext4 mbcache jbd2 nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack xt_multiport iptable_filter xt_CT nf_conntrack libcrc32c iptable_raw mst_pciconf(OE) rpcrdma(OE) xprtrdma(OE) ib_isert(OE) ib_iser(OE) ib_srpt(OE) ib_srp(OE) mlx4_ib(OE) mlx4_en(OE) mlx4_core(OE) ib_ipoib(OE) rdma_ucm(OE) ib_ucm(OE) ib_umad(OE) rdma_cm(OE) ib_cm(OE) iw_cm(OE) sb_edac edac_core intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm mlx5_ib(OE) irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd iTCO_wdt iTCO_vendor_support ib_uverbs(OE) ib_core(OE) ast ttm drm_kms_helper Jun 6 14:02:05 snx11922n002 kernel: syscopyarea sysfillrect sysimgblt pcspkr fb_sys_fops joydev drm i2c_i801 lpc_ich mlx5_core(OE) sg ipmi_si mlx_compat(OE) devlink ipmi_devintf wmi shpchp ipmi_msghandler acpi_power_meter acpi_cpufreq nfsd ip_tables nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache sd_mod sr_mod crc_t10dif cdrom crct10dif_generic team_mode_activebackup team uas usb_storage ahci igb libahci i2c_algo_bit mpt3sas i2c_core dca libata crct10dif_pclmul ptp raid_class crct10dif_common crc32c_intel scsi_transport_sas pps_core sunrpc bonding [last unloaded: libcfs] Jun 6 14:02:05 snx11922n002 kernel: CPU: 15 PID: 17820 Comm: llog_process_th Tainted: G OEL ------------ 3.10.0-693.21.1.x3.2.12.x86_64 #1 Jun 6 14:02:05 snx11922n002 kernel: Hardware name: Seagate Laguna Seca/Laguna Seca, BIOS v02.0040 06/29/2018 Jun 6 14:02:05 snx11922n002 kernel: task: ffff881fb5f98000 ti: ffff881d25bcc000 task.ti: ffff881d25bcc000 Jun 6 14:02:05 snx11922n002 kernel: RIP: 0010:[<ffffffff8133197c>] [<ffffffff8133197c>] strcpy+0x1c/0x30 Jun 6 14:02:05 snx11922n002 kernel: RSP: 0018:ffff881d25bcf5b0 EFLAGS: 00000282 Jun 6 14:02:05 snx11922n002 kernel: RAX: ffff881d84fca0af RBX: ffffffffc0b69280 RCX: 0000000000000069 Jun 6 14:02:05 snx11922n002 kernel: RDX: ffff881d84fca0b6 RSI: ffffffffc0b5bd77 RDI: ffff881d84fca0af Jun 6 14:02:05 snx11922n002 kernel: RBP: ffff881d25bcf5b0 R08: 000000000000ffff R09: 000000000000ffff Jun 6 14:02:05 snx11922n002 kernel: R10: 0000000000000000 R11: ffff881d25bcf5e6 R12: ffff881d84fcb000 Jun 6 14:02:05 snx11922n002 kernel: R13: ffff881d84fca0fa R14: ffffffff8133413b R15: ffff881d25bcf540 Jun 6 14:02:05 snx11922n002 kernel: FS: 0000000000000000(0000) GS:ffff881fff7c0000(0000) knlGS:0000000000000000 Jun 6 14:02:05 snx11922n002 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Jun 6 14:02:05 snx11922n002 kernel: CR2: 000055f78939a670 CR3: 0000000001a0a000 CR4: 00000000003607e0 Jun 6 14:02:05 snx11922n002 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Jun 6 14:02:05 snx11922n002 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Jun 6 14:02:05 snx11922n002 kernel: Call Trace: Jun 6 14:02:05 snx11922n002 kernel: [<ffffffffc0a30896>] libcfs_debug_vmsg2+0x446/0xb30 [libcfs] Jun 6 14:02:05 snx11922n002 kernel: [<ffffffff81333793>] ? number.isra.2+0x323/0x360 Jun 6 14:02:05 snx11922n002 kernel: [<ffffffff81335df9>] ? snprintf+0x49/0x70 Jun 6 14:02:05 snx11922n002 kernel: [<ffffffffc0a30fd7>] libcfs_debug_msg+0x57/0x80 [libcfs] Jun 6 14:02:05 snx11922n002 kernel: [<ffffffffc0b275c2>] ? libcfs_nid2str_r+0xe2/0x130 [lnet] Jun 6 14:02:05 snx11922n002 kernel: [<ffffffffc0b530e0>] lnet_discover_peer_locked+0x2f0/0x3d0 [lnet] Jun 6 14:02:05 snx11922n002 kernel: [<ffffffff810b4fc0>] ? wake_up_atomic_t+0x30/0x30 Jun 6 14:02:05 snx11922n002 kernel: [<ffffffffc0b53230>] LNetPrimaryNID+0x70/0x1a0 [lnet] Jun 6 14:02:05 snx11922n002 kernel: [<ffffffffc0dd92ee>] ptlrpc_connection_get+0x3e/0x450 [ptlrpc] Jun 6 14:02:05 snx11922n002 kernel: [<ffffffffc0dcd40c>] ptlrpc_uuid_to_connection+0xec/0x1a0 [ptlrpc] Jun 6 14:02:05 snx11922n002 kernel: [<ffffffffc0d9fa62>] import_set_conn+0xb2/0x7a0 [ptlrpc] Jun 6 14:02:05 snx11922n002 kernel: [<ffffffffc0a28bde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs] Jun 6 14:02:05 snx11922n002 kernel: [<ffffffffc0da12f9>] client_obd_setup+0xd19/0x1430 [ptlrpc] Jun 6 14:02:05 snx11922n002 kernel: [<ffffffffc0bc758e>] ? export_handle_addref+0xe/0x10 [obdclass] Jun 6 14:02:05 snx11922n002 kernel: [<ffffffffc0bda4c1>] ? lprocfs_counter_sub+0xc1/0x130 [obdclass] Jun 6 14:02:05 snx11922n002 kernel: [<ffffffffc147e8bd>] osp_init0.isra.19+0x13ed/0x1f60 [osp] Jun 6 14:02:05 snx11922n002 kernel: [<ffffffff811e4ffb>] ? __kmalloc+0x1eb/0x230 Jun 6 14:02:05 snx11922n002 kernel: [<ffffffffc0bfb559>] ? lu_context_key_revive+0x19/0x30 [obdclass] Jun 6 14:02:05 snx11922n002 kernel: [<ffffffffc0bfb5e2>] ? lu_context_key_revive_many+0x72/0xb0 [obdclass] Jun 6 14:02:05 snx11922n002 kernel: [<ffffffffc147a090>] ? osp_type_start+0x20/0x30 [osp] Jun 6 14:02:05 snx11922n002 kernel: [<ffffffffc147f4b6>] osp_device_alloc+0x86/0x130 [osp] Jun 6 14:02:05 snx11922n002 kernel: [<ffffffffc0be5179>] obd_setup+0x119/0x280 [obdclass] Jun 6 14:02:05 snx11922n002 kernel: [<ffffffffc0be5588>] class_setup+0x2a8/0x840 [obdclass] Jun 6 14:02:05 snx11922n002 kernel: [<ffffffffc0be94e6>] class_process_config+0x1726/0x2830 [obdclass] Jun 6 14:02:05 snx11922n002 kernel: [<ffffffff813329bb>] ? simple_strtoull+0x3b/0x70 Jun 6 14:02:05 snx11922n002 kernel: [<ffffffff813329f9>] ? simple_strtoul+0x9/0x10 Jun 6 14:02:05 snx11922n002 kernel: [<ffffffffc0bed220>] ? target_name2index+0x90/0xc0 [obdclass] Jun 6 14:02:05 snx11922n002 kernel: [<ffffffffc0bed2b7>] ? server_name2index+0x67/0xa0 [obdclass] Jun 6 14:02:05 snx11922n002 kernel: [<ffffffffc0bec129>] class_config_llog_handler+0x819/0x1520 [obdclass] Jun 6 14:02:05 snx11922n002 kernel: [<ffffffffc0baf7bf>] llog_process_thread+0x82f/0x1900 [obdclass] Jun 6 14:02:05 snx11922n002 kernel: [<ffffffff811e4ffb>] ? __kmalloc+0x1eb/0x230 Jun 6 14:02:05 snx11922n002 kernel: [<ffffffffc1073408>] ? htree_lock_alloc+0x38/0x130 [ldiskfs] Jun 6 14:02:05 snx11922n002 kernel: [<ffffffffc0bb129f>] llog_process_thread_daemonize+0x9f/0xe0 [obdclass] Jun 6 14:02:05 snx11922n002 kernel: [<ffffffffc0bb1200>] ? llog_backup+0x500/0x500 [obdclass] Jun 6 14:02:05 snx11922n002 kernel: [<ffffffff810b4031>] kthread+0xd1/0xe0 Jun 6 14:02:05 snx11922n002 kernel: [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40 Jun 6 14:02:05 snx11922n002 kernel: [<ffffffff816c455d>] ret_from_fork+0x5d/0xb0 Jun 6 14:02:05 snx11922n002 kernel: [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40 In the dklog we see: 00000400:00000200:19.0:1559846993.957191:0:6520:0:(peer.c:2218:lnet_discover_peer_locked()) peer 10.12.0.51@o2ib40 NID 10.12.0.51@o2ib40: 0. discovery complete 00000400:00000200:19.0:1559846993.957192:0:6520:0:(peer.c:1898:lnet_peer_queue_for_discovery()) Queue peer 10.12.0.51@o2ib40: -114 00000400:00000200:19.0:1559846993.957193:0:6520:0:(peer.c:2218:lnet_discover_peer_locked()) peer 10.12.0.51@o2ib40 NID 10.12.0.51@o2ib40: 0. discovery complete 00000400:00000200:19.0:1559846993.957194:0:6520:0:(peer.c:1898:lnet_peer_queue_for_discovery()) Queue peer 10.12.0.51@o2ib40: -114 00000400:00000200:19.0:1559846993.957194:0:6520:0:(peer.c:2218:lnet_discover_peer_locked()) peer 10.12.0.51@o2ib40 NID 10.12.0.51@o2ib40: 0. discovery complete 00000400:00000200:19.0:1559846993.957195:0:6520:0:(peer.c:1898:lnet_peer_queue_for_discovery()) Queue peer 10.12.0.51@o2ib40: -114 00000400:00000200:19.0:1559846993.957196:0:6520:0:(peer.c:2218:lnet_discover_peer_locked()) peer 10.12.0.51@o2ib40 NID 10.12.0.51@o2ib40: 0. discovery complete 00000400:00000200:19.0:1559846993.957197:0:6520:0:(peer.c:1898:lnet_peer_queue_for_discovery()) Queue peer 10.12.0.51@o2ib40: -114 00000400:00000200:19.0:1559846993.957198:0:6520:0:(peer.c:2218:lnet_discover_peer_locked()) peer 10.12.0.51@o2ib40 NID 10.12.0.51@o2ib40: 0. discovery complete 00000400:00000200:19.0:1559846993.957199:0:6520:0:(peer.c:1898:lnet_peer_queue_for_discovery()) Queue peer 10.12.0.51@o2ib40: -114 00000400:00000200:19.0:1559846993.957199:0:6520:0:(peer.c:2218:lnet_discover_peer_locked()) peer 10.12.0.51@o2ib40 NID 10.12.0.51@o2ib40: 0. discovery complete 00000400:00000200:19.0:1559846993.957200:0:6520:0:(peer.c:1898:lnet_peer_queue_for_discovery()) Queue peer 10.12.0.51@o2ib40: -114 00000400:00000200:19.0:1559846993.957201:0:6520:0:(peer.c:2218:lnet_discover_peer_locked()) peer 10.12.0.51@o2ib40 NID 10.12.0.51@o2ib40: 0. discovery complete 00000400:00000200:19.0:1559846993.957202:0:6520:0:(peer.c:1898:lnet_peer_queue_for_discovery()) Queue peer 10.12.0.51@o2ib40: -114 00000400:00000200:19.0:1559846993.957203:0:6520:0:(peer.c:2218:lnet_discover_peer_locked()) peer 10.12.0.51@o2ib40 NID 10.12.0.51@o2ib40: 0. discovery complete 00000400:00000200:19.0:1559846993.957203:0:6520:0:(peer.c:1898:lnet_peer_queue_for_discovery()) Queue peer 10.12.0.51@o2ib40: -114 00000400:00000200:19.0:1559846993.957204:0:6520:0:(peer.c:2218:lnet_discover_peer_locked()) peer 10.12.0.51@o2ib40 NID 10.12.0.51@o2ib40: 0. discovery complete 00000400:00000200:19.0:1559846993.957205:0:6520:0:(peer.c:1898:lnet_peer_queue_for_discovery()) Queue peer 10.12.0.51@o2ib40: -114 00000400:00000200:19.0:1559846993.957206:0:6520:0:(peer.c:2218:lnet_discover_peer_locked()) peer 10.12.0.51@o2ib40 NID 10.12.0.51@o2ib40: 0. discovery complete and on and on ... Those messages repeat until I load lnet on the other servers, at which point you see 00000400:00000200:19.0:1559846994.565726:0:6520:0:(peer.c:1898:lnet_peer_queue_for_discovery()) Queue peer 10.12.0.51@o2ib40: -114 00000400:00000200:19.0:1559846994.565728:0:6520:0:(peer.c:2218:lnet_discover_peer_locked()) peer 10.12.0.51@o2ib40 NID 10.12.0.51@o2ib40: 0. discovery complete 00000400:00000200:19.0:1559846994.565733:0:6520:0:(peer.c:2218:lnet_discover_peer_locked()) peer 10.12.0.51@o2ib40 NID 10.12.0.51@o2ib40: 0. discovery complete 00000400:00000200:19.0:1559846994.565734:0:6520:0:(peer.c:1180:LNetPrimaryNID()) NID 10.12.0.51@o2ib40 primary NID 10.12.0.51@o2ib40 rc 0 pid 6520 in this case is the mount process. I believe the thread is stuck in this loop in LNetPrimaryNID(): while (!lnet_peer_is_uptodate(lp)) { rc = lnet_discover_peer_locked(lpni, cpt, true); if (rc) goto out_decref; lp = lpni->lpni_peer_net->lpn_peer; }

            People

              ashehata Amir Shehata (Inactive)
              ashehata Amir Shehata (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated: