Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10123

Repeated timeouts during mount()- MDS and client.

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.11.0
    • Lustre 2.11.0
    • Soak cluster, lustre-master build 3654 version=2.10.54_13_g84f690e
    • 3
    • 9223372036854775807

    Description

      Attempting to mount filesystem, with DNE
      MGT/MDS node hangs on initial mount.

      Oct 16 16:38:01 soak-8 kernel: LDISKFS-fs warning (device dm-0): ldiskfs_multi_mount_protect:322: MMP interval 42 higher than expected, please wait.
      Oct 16 16:38:44 soak-8 kernel: LDISKFS-fs (dm-0): recovery complete
      Oct 16 16:38:44 soak-8 kernel: LDISKFS-fs (dm-0): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,user_xattr,no_mbcache,nodelalloc
      Oct 16 16:38:45 soak-8 kernel: Lustre: MGS: Connection restored to e7ad9453-d165-e9e1-75ba-304460ab678f (at 0@lo)
      Oct 16 16:38:50 soak-8 kernel: Lustre: 12855:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1508171925/real 1508171925]  req@ffff88081c348c00 x1581432791105776/t0(0) o8->soaked-OST0001-osc-MDT0000@192.168.1.103@o2ib:28/4 lens 520/544 e 0 to 1 dl 1508171930 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      Oct 16 16:39:10 soak-8 kernel: LustreError: 137-5: soaked-MDT0001_UUID: not available for connect from 0@lo (no target). If you are running an HA pair check that the target is mounted on the other server.
      Oct 16 16:39:15 soak-8 kernel: Lustre: 12855:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1508171950/real 1508171950]  req@ffff880814820c00 x1581432791105856/t0(0) o8->soaked-OST0000-osc-MDT0000@192.168.1.103@o2ib:28/4 lens 520/544 e 0 to 1 dl 1508171955 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      Oct 16 16:39:15 soak-8 kernel: Lustre: 12855:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 4 previous similar messages
      Oct 16 16:39:24 soak-8 sssd_be: GSSAPI client step 1
      Oct 16 16:39:24 soak-8 sssd_be: GSSAPI client step 1
      Oct 16 16:39:24 soak-8 sssd_be: GSSAPI client step 1
      Oct 16 16:39:24 soak-8 sssd_be: GSSAPI client step 2
      Oct 16 16:39:45 soak-8 kernel: Lustre: 12855:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1508171975/real 1508171975]  req@ffff880814821800 x1581432791105936/t0(0) o38->soaked-MDT0003-osp-MDT0000@192.168.1.111@o2ib:24/4 lens 520/544 e 0 to 1 dl 1508171985 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      Oct 16 16:39:45 soak-8 kernel: Lustre: 12855:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 3 previous similar messages
      Oct 16 16:40:00 soak-8 kernel: LustreError: 137-5: soaked-MDT0001_UUID: not available for connect from 0@lo (no target). If you are running an HA pair check that the target is mounted on the other server.
      Oct 16 16:40:01 soak-8 systemd: Started Session 7 of user root.
      Oct 16 16:40:01 soak-8 systemd: Starting Session 7 of user root.
      Oct 16 16:40:01 soak-8 CROND[13027]: (root) CMD (/usr/lib64/sa/sa1 1 1)
      Oct 16 16:40:10 soak-8 kernel: Lustre: 12855:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1508172000/real 1508172000]  req@ffff880814822d00 x1581432791106048/t0(0) o8->soaked-OST0000-osc-MDT0000@192.168.1.103@o2ib:28/4 lens 520/544 e 0 to 1 dl 1508172010 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      Oct 16 16:40:10 soak-8 kernel: Lustre: 12855:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 4 previous similar messages
      Oct 16 16:40:40 soak-8 kernel: Lustre: 12855:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1508172025/real 1508172025]  req@ffff880814823c00 x1581432791106128/t0(0) o38->soaked-MDT0003-osp-MDT0000@192.168.1.111@o2ib:24/4 lens 520/544 e 0 to 1 dl 1508172040 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      Oct 16 16:40:40 soak-8 kernel: Lustre: 12855:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 3 previous similar messages
      Oct 16 16:40:50 soak-8 kernel: LustreError: 137-5: soaked-MDT0001_UUID: not available for connect from 0@lo (no target). If you are running an HA pair check that the target is mounted on the other server.
      Oct 16 16:40:53 soak-8 kernel: INFO: task mount.lustre:12918 blocked for more than 120 seconds.
      Oct 16 16:40:54 soak-8 kernel: mount.lustre    D 0000000000000001     0 12918  12917 0x00000082
      Oct 16 16:40:54 soak-8 kernel: ffff8804b9bbb7b0 0000000000000082 ffff88083f9e0000 ffff8804b9bbbfd8
      Oct 16 16:40:54 soak-8 kernel: ffff8804b9bbbfd8 ffff8804b9bbbfd8 ffff88083f9e0000 ffff8808203a8150
      Oct 16 16:40:54 soak-8 kernel: 7fffffffffffffff ffff8808203a8148 ffff88083f9e0000 0000000000000001
      Oct 16 16:40:54 soak-8 kernel: Call Trace:
      Oct 16 16:40:54 soak-8 kernel: [<ffffffff816a94c9>] schedule+0x29/0x70
      Oct 16 16:40:54 soak-8 kernel: [<ffffffff816a6fd9>] schedule_timeout+0x239/0x2c0
      Oct 16 16:40:54 soak-8 kernel: [<ffffffff810c9ef9>] ? select_task_rq_fair+0x549/0x700
      Oct 16 16:40:54 soak-8 kernel: [<ffffffff816a987d>] wait_for_completion+0xfd/0x140
      Oct 16 16:40:54 soak-8 kernel: [<ffffffff810c4810>] ? wake_up_state+0x20/0x20
      Oct 16 16:40:54 soak-8 kernel: [<ffffffffc0a2f6b4>] llog_process_or_fork+0x244/0x450 [obdclass]
      Oct 16 16:40:54 soak-8 kernel: [<ffffffffc0a2f8d4>] llog_process+0x14/0x20 [obdclass]
      Oct 16 16:40:54 soak-8 kernel: [<ffffffffc0a5fe45>] class_config_parse_llog+0x125/0x350 [obdclass]
      Oct 16 16:40:54 soak-8 kernel: [<ffffffffc10235a8>] mgc_process_cfg_log+0x788/0xc40 [mgc]
      Oct 16 16:40:54 soak-8 kernel: [<ffffffffc1026413>] mgc_process_log+0x3d3/0x890 [mgc]
      Oct 16 16:40:54 soak-8 kernel: [<ffffffffc0a68550>] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass]
      Oct 16 16:40:54 soak-8 kernel: [<ffffffffc1026b18>] ? do_config_log_add+0x248/0x580 [mgc]
      Oct 16 16:40:54 soak-8 kernel: [<ffffffffc10279f0>] mgc_process_config+0x890/0x13f0 [mgc]
      Oct 16 16:40:54 soak-8 kernel: [<ffffffffc0a6bfee>] lustre_process_log+0x2de/0xaf0 [obdclass]
      Oct 16 16:40:54 soak-8 kernel: [<ffffffffc090dba7>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
      Oct 16 16:40:54 soak-8 kernel: [<ffffffffc0a57339>] ? lprocfs_counter_add+0xf9/0x160 [obdclass]
      Oct 16 16:40:54 soak-8 kernel: [<ffffffffc0a98c32>] server_start_targets+0x1352/0x2a70 [obdclass]
      Oct 16 16:40:54 soak-8 kernel: [<ffffffffc0a6ef5d>] ? lustre_start_mgc+0x20d/0x2800 [obdclass]
      Oct 16 16:40:54 soak-8 kernel: [<ffffffffc0a68550>] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass]
      Oct 16 16:40:54 soak-8 kernel: [<ffffffffc0a9b3dd>] server_fill_super+0x108d/0x184c [obdclass]
      Oct 16 16:40:54 soak-8 kernel: [<ffffffffc0a71ff8>] lustre_fill_super+0x328/0x950 [obdclass]
      Oct 16 16:40:54 soak-8 kernel: [<ffffffffc0a71cd0>] ? lustre_common_put_super+0x270/0x270 [obdclass]
      Oct 16 16:40:54 soak-8 kernel: [<ffffffff81204b0d>] mount_nodev+0x4d/0xb0
      Oct 16 16:40:54 soak-8 kernel: [<ffffffffc0a69e18>] lustre_mount+0x38/0x60 [obdclass]
      Oct 16 16:40:54 soak-8 kernel: [<ffffffff81205599>] mount_fs+0x39/0x1b0
      Oct 16 16:40:54 soak-8 kernel: [<ffffffff81222077>] vfs_kern_mount+0x67/0x110
      Oct 16 16:40:54 soak-8 kernel: [<ffffffff81224583>] do_mount+0x233/0xaf0
      Oct 16 16:40:54 soak-8 kernel: [<ffffffff8118760e>] ? __get_free_pages+0xe/0x40
      Oct 16 16:40:54 soak-8 kernel: [<ffffffff812251c6>] SyS_mount+0x96/0xf0
      Oct 16 16:40:54 soak-8 kernel: [<ffffffff816b4fc9>] system_call_fastpath+0x16/0x1b
      

      Power-cycled soak-8, mount completed after power cycle. Continued bringing up the filesystem
      soak-10 (MDT0003) hung during mount.

      Oct 16 16:57:37 soak-10 kernel: INFO: task mount.lustre:13350 blocked for more than 120 seconds.
      Oct 16 16:57:37 soak-10 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      Oct 16 16:57:38 soak-10 kernel: mount.lustre    D 0000000000000001     0 13350  13349 0x00000082
      Oct 16 16:57:38 soak-10 kernel: ffff880817bdf7b0 0000000000000086 ffff880817783f40 ffff880817bdffd8
      Oct 16 16:57:38 soak-10 kernel: ffff880817bdffd8 ffff880817bdffd8 ffff880817783f40 ffff88082ce87c30
      Oct 16 16:57:38 soak-10 kernel: 7fffffffffffffff ffff88082ce87c28 ffff880817783f40 0000000000000001
      Oct 16 16:57:38 soak-10 kernel: Call Trace:
      Oct 16 16:57:38 soak-10 kernel: [<ffffffff816a94c9>] schedule+0x29/0x70
      Oct 16 16:57:38 soak-10 kernel: [<ffffffff816a6fd9>] schedule_timeout+0x239/0x2c0
      Oct 16 16:57:38 soak-10 kernel: [<ffffffff810c76f5>] ? sched_clock_cpu+0x85/0xc0
      Oct 16 16:57:38 soak-10 kernel: [<ffffffff810c12c8>] ? check_preempt_curr+0x78/0xa0
      Oct 16 16:57:38 soak-10 kernel: [<ffffffff810c1309>] ? ttwu_do_wakeup+0x19/0xd0
      Oct 16 16:57:38 soak-10 kernel: [<ffffffff816a987d>] wait_for_completion+0xfd/0x140
      Oct 16 16:57:38 soak-10 kernel: [<ffffffff810c4810>] ? wake_up_state+0x20/0x20
      Oct 16 16:57:38 soak-10 kernel: [<ffffffffc0a406b4>] llog_process_or_fork+0x244/0x450 [obdclass]
      Oct 16 16:57:38 soak-10 kernel: [<ffffffffc0a408d4>] llog_process+0x14/0x20 [obdclass]
      Oct 16 16:57:38 soak-10 kernel: [<ffffffffc0a70e45>] class_config_parse_llog+0x125/0x350 [obdclass]
      Oct 16 16:57:38 soak-10 kernel: [<ffffffffc14505a8>] mgc_process_cfg_log+0x788/0xc40 [mgc]
      Oct 16 16:57:38 soak-10 kernel: [<ffffffffc1453413>] mgc_process_log+0x3d3/0x890 [mgc]
      Oct 16 16:57:38 soak-10 kernel: [<ffffffffc1453b77>] ? do_config_log_add+0x2a7/0x580 [mgc]
      Oct 16 16:57:38 soak-10 kernel: [<ffffffffc14549f0>] mgc_process_config+0x890/0x13f0 [mgc]
      Oct 16 16:57:38 soak-10 kernel: [<ffffffffc0a7cfee>] lustre_process_log+0x2de/0xaf0 [obdclass]
      Oct 16 16:57:38 soak-10 kernel: [<ffffffffc091b4e4>] ? libcfs_log_return+0x24/0x30 [libcfs]
      Oct 16 16:57:38 soak-10 kernel: [<ffffffffc0aa9c32>] server_start_targets+0x1352/0x2a70 [obdclass]
      Oct 16 16:57:38 soak-10 kernel: [<ffffffffc0a80259>] ? lustre_start_mgc+0x509/0x2800 [obdclass]
      Oct 16 16:57:38 soak-10 kernel: [<ffffffffc0a79550>] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass]
      Oct 16 16:57:38 soak-10 kernel: [<ffffffffc0aac3dd>] server_fill_super+0x108d/0x184c [obdclass]
      Oct 16 16:57:38 soak-10 kernel: [<ffffffffc0a82ff8>] lustre_fill_super+0x328/0x950 [obdclass]
      Oct 16 16:57:38 soak-10 kernel: [<ffffffffc0a82cd0>] ? lustre_common_put_super+0x270/0x270 [obdclass]
      Oct 16 16:57:38 soak-10 kernel: [<ffffffff81204b0d>] mount_nodev+0x4d/0xb0
      Oct 16 16:57:38 soak-10 kernel: [<ffffffffc0a7ae18>] lustre_mount+0x38/0x60 [obdclass]
      Oct 16 16:57:38 soak-10 kernel: [<ffffffff81205599>] mount_fs+0x39/0x1b0
      Oct 16 16:57:38 soak-10 kernel: [<ffffffff81222077>] vfs_kern_mount+0x67/0x110
      Oct 16 16:57:38 soak-10 kernel: [<ffffffff81224583>] do_mount+0x233/0xaf0
      Oct 16 16:57:38 soak-10 kernel: [<ffffffff8118760e>] ? __get_free_pages+0xe/0x40
      Oct 16 16:57:38 soak-10 kernel: [<ffffffff812251c6>] SyS_mount+0x96/0xf0
      Oct 16 16:57:38 soak-10 kernel: [<ffffffff816b4fc9>] system_call_fastpath+0x16/0x1b
      

      Power-cycled that node, continued with mount, second mount on soak-10 worked.
      Mounting clients, soak-40 hung.

      Oct 16 17:17:56 soak-40 kernel: INFO: task mount.lustre:2933 blocked for more than 120 seconds.
      Oct 16 17:17:56 soak-40 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      Oct 16 17:17:56 soak-40 kernel: mount.lustre    D 0000000000000001     0  2933   2932 0x00000080
      Oct 16 17:17:56 soak-40 kernel: ffff88107f827940 0000000000000086 ffff88085c2e5ee0 ffff88107f827fd8
      Oct 16 17:17:56 soak-40 kernel: ffff88107f827fd8 ffff88107f827fd8 ffff88085c2e5ee0 ffff88085cb5c5d0
      Oct 16 17:17:57 soak-40 kernel: 7fffffffffffffff ffff88085cb5c5c8 ffff88085c2e5ee0 0000000000000001
      Oct 16 17:17:57 soak-40 kernel: Call Trace:
      Oct 16 17:17:57 soak-40 kernel: [<ffffffff816a94e9>] schedule+0x29/0x70
      Oct 16 17:17:57 soak-40 kernel: [<ffffffff816a6ff9>] schedule_timeout+0x239/0x2c0
      Oct 16 17:17:57 soak-40 kernel: [<ffffffff810c76f5>] ? sched_clock_cpu+0x85/0xc0
      Oct 16 17:17:57 soak-40 kernel: [<ffffffff810c12c8>] ? check_preempt_curr+0x78/0xa0
      Oct 16 17:17:57 soak-40 kernel: [<ffffffff810c1309>] ? ttwu_do_wakeup+0x19/0xd0
      Oct 16 17:17:57 soak-40 kernel: [<ffffffff816a989d>] wait_for_completion+0xfd/0x140
      Oct 16 17:17:57 soak-40 kernel: [<ffffffff810c4810>] ? wake_up_state+0x20/0x20
      Oct 16 17:17:57 soak-40 kernel: [<ffffffffc0ba66b4>] llog_process_or_fork+0x244/0x450 [obdclass]
      Oct 16 17:17:57 soak-40 kernel: [<ffffffffc0ba68d4>] llog_process+0x14/0x20 [obdclass]
      Oct 16 17:17:57 soak-40 kernel: [<ffffffffc0bd6d65>] class_config_parse_llog+0x125/0x350 [obdclass]
      Oct 16 17:17:57 soak-40 kernel: [<ffffffffc0844368>] mgc_process_cfg_log+0x788/0xc40 [mgc]
      Oct 16 17:17:57 soak-40 kernel: [<ffffffffc08471e3>] mgc_process_log+0x3d3/0x890 [mgc]
      Oct 16 17:17:57 soak-40 kernel: [<ffffffffc0847947>] ? do_config_log_add+0x2a7/0x580 [mgc]
      Oct 16 17:17:57 soak-40 kernel: [<ffffffffc08487c0>] mgc_process_config+0x890/0x13f0 [mgc]
      Oct 16 17:17:58 soak-40 kernel: [<ffffffffc0be2d7e>] lustre_process_log+0x2de/0xaf0 [obdclass]
      Oct 16 17:17:58 soak-40 kernel: [<ffffffff816aba0e>] ? _raw_spin_unlock_bh+0x1e/0x20
      Oct 16 17:17:58 soak-40 kernel: [<ffffffff8132318b>] ? fprop_local_init_percpu+0x1b/0x30
      Oct 16 17:17:58 soak-40 kernel: [<ffffffffc0d7b608>] ll_fill_super+0xaf8/0x1220 [lustre]
      Oct 16 17:17:58 soak-40 kernel: [<ffffffffc0be8ce6>] lustre_fill_super+0x286/0x910 [obdclass]
      Oct 16 17:17:58 soak-40 kernel: [<ffffffffc0be8a60>] ? lustre_common_put_super+0x270/0x270 [obdclass]
      Oct 16 17:17:58 soak-40 kernel: [<ffffffff81204b0d>] mount_nodev+0x4d/0xb0
      Oct 16 17:17:58 soak-40 kernel: [<ffffffffc0be0ba8>] lustre_mount+0x38/0x60 [obdclass]
      Oct 16 17:17:58 soak-40 kernel: [<ffffffff81205599>] mount_fs+0x39/0x1b0
      Oct 16 17:17:58 soak-40 kernel: [<ffffffff81222067>] vfs_kern_mount+0x67/0x110
      Oct 16 17:17:58 soak-40 kernel: [<ffffffff81224573>] do_mount+0x233/0xaf0
      Oct 16 17:17:58 soak-40 kernel: [<ffffffff8118760e>] ? __get_free_pages+0xe/0x40
      Oct 16 17:17:58 soak-40 kernel: [<ffffffff812251b6>] SyS_mount+0x96/0xf0
      Oct 16 17:17:58 soak-40 kernel: [<ffffffff816b5009>] system_call_fastpath+0x16/0x1b
      Oct 16 17:19:59 soak-40 kernel: INFO: task mount.lustre:2933 blocked for more than 120 seconds.
      Oct 16 17:19:59 soak-40 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      

      Dumped -1 lustre-log from client, attached.

      Attachments

        Issue Links

          Activity

            [LU-10123] Repeated timeouts during mount()- MDS and client.

            Cliff, I have just rebased my change #30147 on latest master.

            bfaccini Bruno Faccini (Inactive) added a comment - Cliff, I have just rebased my change #30147 on latest master.

            We are attempting to test master now, can you rebase your patch on current master?

            cliffw Cliff White (Inactive) added a comment - We are attempting to test master now, can you rebase your patch on current master?

            We are doing b2-10 now, will consider this when we move back to master

            cliffw Cliff White (Inactive) added a comment - We are doing b2-10 now, will consider this when we move back to master

            Sarah, Cliff,
            Can you give a try to more recent version/patch-set #7 of my patch change #30147 ?

            bfaccini Bruno Faccini (Inactive) added a comment - Sarah, Cliff, Can you give a try to more recent version/patch-set #7 of my patch change #30147 ?
            sarah Sarah Liu added a comment -

            Hi Bruno,

            Yesterday I tried to make a patch to include both your patch and the one for LU-10134 for soak. I checked out your patch and rebase master, then checkout LU-10134 patch and rebase on your patch. The new patch set #2 (which hit the LBUG) should have no change but only rebase master.

            The test patch which includes both patch also hit the LBUG
            https://review.whamcloud.com/#/c/30187/

            sarah Sarah Liu added a comment - Hi Bruno, Yesterday I tried to make a patch to include both your patch and the one for LU-10134 for soak. I checked out your patch and rebase master, then checkout LU-10134 patch and rebase on your patch. The new patch set #2 (which hit the LBUG) should have no change but only rebase master. The test patch which includes both patch also hit the LBUG https://review.whamcloud.com/#/c/30187/

            Hmm, seems that my patch has also introduced some regression and causes this new LBUG on Client side for different auto-tests sub-tests/steps (like lnet-selftest. lustre-initialization, upon mount, ...) :

            [ 5914.956548] LNetError: 23347:0:(peer.c:272:lnet_destroy_peer_locked()) ASSERTION( list_empty(&lp->lp_peer_nets) ) failed: 
            [ 5914.960593] LNetError: 23347:0:(peer.c:272:lnet_destroy_peer_locked()) LBUG
            [ 5914.964063] Pid: 23347, comm: lnet_discovery
            [ 5914.967234] 
            [ 5914.967234] Call Trace:
            [ 5914.973027]  [<ffffffffc07337ae>] libcfs_call_trace+0x4e/0x60 [libcfs]
            [ 5914.976440]  [<ffffffffc073383c>] lbug_with_loc+0x4c/0xb0 [libcfs]
            [ 5914.979816]  [<ffffffffc06bab42>] lnet_destroy_peer_locked+0x232/0x340 [lnet]
            [ 5914.983278]  [<ffffffffc06bb175>] lnet_peer_discovery_complete+0x295/0x340 [lnet]
            [ 5914.986724]  [<ffffffffc06c0c30>] lnet_peer_discovery+0x6f0/0x11b0 [lnet]
            [ 5914.990141]  [<ffffffff816a904d>] ? __schedule+0x39d/0x8b0
            [ 5914.993348]  [<ffffffff810b1920>] ? autoremove_wake_function+0x0/0x40
            [ 5914.996599]  [<ffffffffc06c0540>] ? lnet_peer_discovery+0x0/0x11b0 [lnet]
            [ 5914.999836]  [<ffffffff810b099f>] kthread+0xcf/0xe0
            [ 5915.002827]  [<ffffffff810b08d0>] ? kthread+0x0/0xe0
            [ 5915.005771]  [<ffffffff816b4fd8>] ret_from_fork+0x58/0x90
            [ 5915.008696]  [<ffffffff810b08d0>] ? kthread+0x0/0xe0
            

            Cliff, do you also trigger this ?

            bfaccini Bruno Faccini (Inactive) added a comment - Hmm, seems that my patch has also introduced some regression and causes this new LBUG on Client side for different auto-tests sub-tests/steps (like lnet-selftest. lustre-initialization, upon mount, ...) : [ 5914.956548] LNetError: 23347:0:(peer.c:272:lnet_destroy_peer_locked()) ASSERTION( list_empty(&lp->lp_peer_nets) ) failed: [ 5914.960593] LNetError: 23347:0:(peer.c:272:lnet_destroy_peer_locked()) LBUG [ 5914.964063] Pid: 23347, comm: lnet_discovery [ 5914.967234] [ 5914.967234] Call Trace: [ 5914.973027] [<ffffffffc07337ae>] libcfs_call_trace+0x4e/0x60 [libcfs] [ 5914.976440] [<ffffffffc073383c>] lbug_with_loc+0x4c/0xb0 [libcfs] [ 5914.979816] [<ffffffffc06bab42>] lnet_destroy_peer_locked+0x232/0x340 [lnet] [ 5914.983278] [<ffffffffc06bb175>] lnet_peer_discovery_complete+0x295/0x340 [lnet] [ 5914.986724] [<ffffffffc06c0c30>] lnet_peer_discovery+0x6f0/0x11b0 [lnet] [ 5914.990141] [<ffffffff816a904d>] ? __schedule+0x39d/0x8b0 [ 5914.993348] [<ffffffff810b1920>] ? autoremove_wake_function+0x0/0x40 [ 5914.996599] [<ffffffffc06c0540>] ? lnet_peer_discovery+0x0/0x11b0 [lnet] [ 5914.999836] [<ffffffff810b099f>] kthread+0xcf/0xe0 [ 5915.002827] [<ffffffff810b08d0>] ? kthread+0x0/0xe0 [ 5915.005771] [<ffffffff816b4fd8>] ret_from_fork+0x58/0x90 [ 5915.008696] [<ffffffff810b08d0>] ? kthread+0x0/0xe0 Cliff, do you also trigger this ?

            Mount performance is much better with this patch

            cliffw Cliff White (Inactive) added a comment - Mount performance is much better with this patch

            Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: https://review.whamcloud.com/30147
            Subject: LU-10123 lnet: ensure peer put back on dc request queue
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 3bb495fcca17bdedd89a11cc9d9bf200604af0aa

            gerrit Gerrit Updater added a comment - Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: https://review.whamcloud.com/30147 Subject: LU-10123 lnet: ensure peer put back on dc request queue Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 3bb495fcca17bdedd89a11cc9d9bf200604af0aa
            bfaccini Bruno Faccini (Inactive) added a comment - - edited

            The attached Lustre debug log indicates that during the mount process, Client is fetching the soaked-client config LLOG from MGS and running the different config steps according to its content.

            And it seems to be stuck during the LNet peer discovery of 192.168.1.110@o2ib NID for soaked-MDT0002-mdc import creation :

            00000040:00001000:67.0:1508174041.158589:0:2946:0:(llog.c:513:llog_process_thread()) processing rec 0xffff8808f39f5698 type 0x10620000
            00000040:00001000:67.0:1508174041.158590:0:2946:0:(llog.c:519:llog_process_thread()) after swabbing, type=0x10620000 idx=38
            00000040:00001000:67.0:1508174041.158591:0:2946:0:(llog.c:595:llog_process_thread()) lrh_index: 38 lrh_len: 144 (2408 remains)
            00000020:00000001:67.0:1508174041.158592:0:2946:0:(obd_config.c:1496:class_config_llog_handler()) Process entered
            00000020:00000010:67.0:1508174041.158592:0:2946:0:(obd_config.c:1629:class_config_llog_handler()) kmalloced 'inst_name': 39 at ffff88083c75d040.
            00000020:01000000:67.0:1508174041.158594:0:2946:0:(obd_config.c:1637:class_config_llog_handler()) cmd cf003, instance name: soaked-MDT0002-mdc-ffff88105dcb8800
            00000020:00000010:67.0:1508174041.158595:0:2946:0:(obd_config.c:1707:class_config_llog_handler()) kmalloced 'lcfg_new': 136 at ffff88103f70e900.
            00000020:00000080:67.0:1508174041.158596:0:2946:0:(obd_config.c:1180:class_process_config()) processing cmd: cf003
            00000020:00000001:67.0:1508174041.158597:0:2946:0:(obd_config.c:479:class_setup()) Process entered
            00000001:00000001:67.0:1508174041.158598:0:2946:0:(hash.c:1005:cfs_hash_create()) Process entered
            00000001:00000010:67.0:1508174041.158599:0:2946:0:(hash.c:1032:cfs_hash_create()) alloc '(hs)': 144 at ffff88103f70e840 (tot 21621812).
            00000001:00000010:67.0:1508174041.158600:0:2946:0:(hash.c:893:cfs_hash_buckets_realloc()) alloc '(new_bkts)': 32 at ffff88104f3f1580 (tot 21621844).
            00000001:00000010:67.0:1508174041.158602:0:2946:0:(hash.c:906:cfs_hash_buckets_realloc()) alloc '(new_bkts[i])': 280 at ffff88107fd4c600 (tot 21622124).
            00000001:00000010:67.0:1508174041.158604:0:2946:0:(hash.c:906:cfs_hash_buckets_realloc()) alloc '(new_bkts[i])': 280 at ffff88107fd4c800 (tot 21622404).
            00000001:00000010:67.0:1508174041.158605:0:2946:0:(hash.c:906:cfs_hash_buckets_realloc()) alloc '(new_bkts[i])': 280 at ffff88107fd4ca00 (tot 21622684).
            00000001:00000010:67.0:1508174041.158607:0:2946:0:(hash.c:906:cfs_hash_buckets_realloc()) alloc '(new_bkts[i])': 280 at ffff88107fd4cc00 (tot 21622964).
            00000001:00000001:67.0:1508174041.158608:0:2946:0:(hash.c:1005:cfs_hash_create()) Process entered
            00000001:00000010:67.0:1508174041.158609:0:2946:0:(hash.c:1032:cfs_hash_create()) alloc '(hs)': 144 at ffff88103f70e780 (tot 21623108).
            00000001:00000010:67.0:1508174041.158610:0:2946:0:(hash.c:893:cfs_hash_buckets_realloc()) alloc '(new_bkts)': 32 at ffff88104f3f15a0 (tot 21623140).
            00000001:00000010:67.0:1508174041.158611:0:2946:0:(hash.c:906:cfs_hash_buckets_realloc()) alloc '(new_bkts[i])': 280 at ffff88107fd4ce00 (tot 21623420).
            00000001:00000010:67.0:1508174041.158613:0:2946:0:(hash.c:906:cfs_hash_buckets_realloc()) alloc '(new_bkts[i])': 280 at ffff88107fd4d000 (tot 21623700).
            00000001:00000010:67.0:1508174041.158614:0:2946:0:(hash.c:906:cfs_hash_buckets_realloc()) alloc '(new_bkts[i])': 280 at ffff88107fd4d200 (tot 21623980).
            00000001:00000010:67.0:1508174041.158616:0:2946:0:(hash.c:906:cfs_hash_buckets_realloc()) alloc '(new_bkts[i])': 280 at ffff88107fd4d400 (tot 21624260).
            00000001:00000001:67.0:1508174041.158617:0:2946:0:(hash.c:1005:cfs_hash_create()) Process entered
            00000001:00000010:67.0:1508174041.158617:0:2946:0:(hash.c:1032:cfs_hash_create()) alloc '(hs)': 144 at ffff88103f70e6c0 (tot 21624404).
            00000001:00000010:67.0:1508174041.158619:0:2946:0:(hash.c:893:cfs_hash_buckets_realloc()) alloc '(new_bkts)': 32 at ffff88104f3f15e0 (tot 21624436).
            00000001:00000010:67.0:1508174041.158620:0:2946:0:(hash.c:906:cfs_hash_buckets_realloc()) alloc '(new_bkts[i])': 280 at ffff88107fd4d600 (tot 21624716).
            00000001:00000010:67.0:1508174041.158622:0:2946:0:(hash.c:906:cfs_hash_buckets_realloc()) alloc '(new_bkts[i])': 280 at ffff88107fd4d800 (tot 21624996).
            00000001:00000010:67.0:1508174041.158623:0:2946:0:(hash.c:906:cfs_hash_buckets_realloc()) alloc '(new_bkts[i])': 280 at ffff88107fd4da00 (tot 21625276).
            00000001:00000010:67.0:1508174041.158625:0:2946:0:(hash.c:906:cfs_hash_buckets_realloc()) alloc '(new_bkts[i])': 280 at ffff88107fd4dc00 (tot 21625556).
            00000001:00000001:67.0:1508174041.158626:0:2946:0:(hash.c:1005:cfs_hash_create()) Process entered
            00000001:00000010:67.0:1508174041.158626:0:2946:0:(hash.c:1032:cfs_hash_create()) alloc '(hs)': 144 at ffff88103f70e600 (tot 21625700).
            00000001:00000010:67.0:1508174041.158627:0:2946:0:(hash.c:893:cfs_hash_buckets_realloc()) alloc '(new_bkts)': 32 at ffff88104f3f1600 (tot 21625732).
            00000001:00000010:67.0:1508174041.158629:0:2946:0:(hash.c:906:cfs_hash_buckets_realloc()) alloc '(new_bkts[i])': 280 at ffff88107fd4de00 (tot 21626012).
            00000001:00000010:67.0:1508174041.158630:0:2946:0:(hash.c:906:cfs_hash_buckets_realloc()) alloc '(new_bkts[i])': 280 at ffff88107fd4e000 (tot 21626292).
            00000001:00000010:67.0:1508174041.158631:0:2946:0:(hash.c:906:cfs_hash_buckets_realloc()) alloc '(new_bkts[i])': 280 at ffff88107fd4e200 (tot 21626572).
            00000001:00000010:67.0:1508174041.158633:0:2946:0:(hash.c:906:cfs_hash_buckets_realloc()) alloc '(new_bkts[i])': 280 at ffff88107fd4e400 (tot 21626852).
            00000020:00000001:67.0:1508174041.158634:0:2946:0:(genops.c:853:class_new_export()) Process entered
            00000020:00000010:67.0:1508174041.158636:0:2946:0:(genops.c:855:class_new_export()) kmalloced 'export': 968 at ffff88107f8f0400.
            00000020:00000001:67.0:1508174041.158637:0:2946:0:(lustre_handles.c:64:class_handle_hash()) Process entered
            00000020:00000040:67.0:1508174041.158638:0:2946:0:(lustre_handles.c:98:class_handle_hash()) added object ffff88107f8f0400 with handle 0x36c91a8e859b7cc8 to hash
            00000020:00000001:67.0:1508174041.158639:0:2946:0:(lustre_handles.c:99:class_handle_hash()) Process leaving
            00000020:00000001:67.0:1508174041.158640:0:2946:0:(obd_class.h:991:obd_init_export()) Process entered
            00000020:00000001:67.0:1508174041.158641:0:2946:0:(obd_class.h:995:obd_init_export()) Process leaving (rc=0 : 0 : 0)
            00000020:00000040:67.0:1508174041.158642:0:2946:0:(obd_config.c:753:class_incref()) incref soaked-MDT0002-mdc-ffff88105dcb8800 (ffff88107fdacf60) now 2
            00000020:00000001:67.0:1508174041.158643:0:2946:0:(genops.c:928:class_new_export()) Process leaving (rc=18446612203173774336 : -131870535777280 : ffff88107f8f0400)
            00000020:00000040:67.0:1508174041.158644:0:2946:0:(genops.c:828:class_export_put()) PUTting export ffff88107f8f0400 : new refcount 1
            00000020:00000001:67.0:1508174041.158645:0:2946:0:(obd_class.h:582:obd_setup()) Process entered
            00000002:00000001:67.0:1508174041.158646:0:2946:0:(mdc_request.c:2483:mdc_setup()) Process entered
            00000100:00000001:67.0:1508174041.158647:0:2946:0:(ptlrpcd.c:944:ptlrpcd_addref()) Process entered
            00000100:00000001:67.0:1508174041.158648:0:2946:0:(ptlrpcd.c:953:ptlrpcd_addref()) Process leaving (rc=0 : 0 : 0)
            00010000:00000001:67.0:1508174041.158649:0:2946:0:(ldlm_lib.c:277:client_obd_setup()) Process entered
            00010000:00000010:67.0:1508174041.158651:0:2946:0:(ldlm_lib.c:444:client_obd_setup()) kmalloced 'cli->cl_mod_tag_bitmap': 64 at ffff88083c75df80.
            00010000:00000001:67.0:1508174041.158652:0:2946:0:(ldlm_lockd.c:2788:ldlm_get_ref()) Process entered
            00010000:00000001:67.0:1508174041.158653:0:2946:0:(ldlm_lockd.c:2797:ldlm_get_ref()) Process leaving (rc=0 : 0 : 0)
            00000020:00000010:67.0:1508174041.158654:0:2946:0:(genops.c:1067:class_new_import()) kmalloced 'imp': 1480 at ffff88083e668000.
            00000020:00000040:67.0:1508174041.158655:0:2946:0:(obd_config.c:753:class_incref()) incref soaked-MDT0002-mdc-ffff88105dcb8800 (ffff88107fdacf60) now 3
            00000020:00000001:67.0:1508174041.158657:0:2946:0:(lustre_handles.c:64:class_handle_hash()) Process entered
            00000020:00000040:67.0:1508174041.158658:0:2946:0:(lustre_handles.c:98:class_handle_hash()) added object ffff88083e668000 with handle 0x36c91a8e859b7ccf to hash
            00000020:00000001:67.0:1508174041.158659:0:2946:0:(lustre_handles.c:99:class_handle_hash()) Process leaving
            00000020:00000001:67.0:1508174041.158661:0:2946:0:(genops.c:1030:class_import_put()) Process entered
            00000020:00000040:67.0:1508174041.158661:0:2946:0:(genops.c:1037:class_import_put()) import ffff88083e668000 refcount=1 obd=soaked-MDT0002-mdc-ffff88105dcb8800
            00000020:00000001:67.0:1508174041.158663:0:2946:0:(genops.c:1046:class_import_put()) Process leaving
            00010000:00000001:67.0:1508174041.158664:0:2946:0:(ldlm_lib.c:61:import_set_conn()) Process entered
            00000100:00000200:67.0:1508174041.158666:0:2946:0:(events.c:542:ptlrpc_uuid_to_peer()) 192.168.1.110@o2ib->12345-192.168.1.110@o2ib
            00000100:00000001:67.0:1508174041.158667:0:2946:0:(connection.c:48:ptlrpc_connection_get()) Process entered
            00000400:00000010:67.0:1508174041.158671:0:2946:0:(peer.c:239:lnet_peer_alloc()) alloc '(lp)': 184 at ffff88083e75c900 (tot 21627036).
            00000400:00000200:67.0:1508174041.158673:0:2946:0:(peer.c:261:lnet_peer_alloc()) ffff88083e75c900 nid 192.168.1.110@o2ib
            00000400:00000010:67.0:1508174041.158675:0:2946:0:(peer.c:204:lnet_peer_net_alloc()) alloc '(lpn)': 48 at ffff88105b155080 (tot 21627084).
            00000400:00000200:67.0:1508174041.158676:0:2946:0:(peer.c:212:lnet_peer_net_alloc()) ffff88105b155080 net o2ib
            00000400:00000010:67.0:1508174041.158680:0:2946:0:(peer.c:154:lnet_peer_ni_alloc()) alloc '(lpni)': 336 at ffff88085d7a8200 (tot 21627420).
            00000400:00000200:67.0:1508174041.158682:0:2946:0:(peer.c:194:lnet_peer_ni_alloc()) ffff88085d7a8200 nid 192.168.1.110@o2ib
            00000400:00000200:67.0:1508174041.158685:0:2946:0:(peer.c:1186:lnet_peer_attach_peer_ni()) peer 192.168.1.110@o2ib NID 192.168.1.110@o2ib flags 0x0
            00000400:00000200:67.0:1508174041.158691:0:2946:0:(peer.c:1751:lnet_peer_queue_for_discovery()) Queue peer 192.168.1.110@o2ib: 0
            00000400:00000200:68.0:1508174041.158724:0:2519:0:(peer.c:3030:lnet_peer_discovery_wait_for_work()) woken: 0
            00000400:00000200:68.0:1508174041.158728:0:2519:0:(peer.c:3136:lnet_peer_discovery()) peer 192.168.1.110@o2ib state 0x20
            00000400:00000010:68.0:1508174041.158730:0:2519:0:(api-ni.c:1076:lnet_ping_buffer_alloc()) alloc '(pbuf)': 280 at ffff88105c1e3600 (tot 21627700).
            00000400:00000010:68.0:1508174041.158732:0:2519:0:(lib-lnet.h:228:lnet_md_alloc()) slab-alloced 'md' of size 128 at ffff8810537faf00.
            00000400:00000010:68.0:1508174041.158736:0:2519:0:(lib-lnet.h:479:lnet_msg_alloc()) alloc '(msg)': 416 at ffff88105c1e3400 (tot 21628116).
            00000400:00000200:68.0:1508174041.158738:0:2519:0:(lib-move.c:3187:LNetGet()) LNetGet -> 12345-192.168.1.110@o2ib
            00000400:00000200:68.0:1508174041.158742:0:2519:0:(peer.c:856:lnet_peer_ni_set_non_mr_pref_nid()) peer 192.168.1.110@o2ib nid 192.168.1.140@o2ib: 0
            00000400:00000200:68.0:1508174041.158746:0:2519:0:(lib-move.c:2114:lnet_select_pathway()) TRACE: 192.168.1.140@o2ib(192.168.1.140@o2ib:<?>) -> 192.168.1.110@o2ib(192.168.1.110@o2ib:192.168.1.110@o2ib) : GET
            00000800:00000200:68.0:1508174041.158749:0:2519:0:(o2iblnd_cb.c:1511:kiblnd_send()) sending 0 bytes in 0 frags to 12345-192.168.1.110@o2ib
            00000800:00000010:68.0:1508174041.158757:0:2519:0:(o2iblnd.c:326:kiblnd_create_peer()) alloc '(peer_ni)': 112 at ffff88085c21d900 (tot 21628228).
            00000800:00000200:68.0:1508174041.158759:0:2519:0:(o2iblnd_cb.c:1479:kiblnd_launch_tx()) peer_ni[ffff88085c21d900] -> 192.168.1.110@o2ib (1)++
            00000800:00000200:68.0:1508174041.158763:0:2519:0:(o2iblnd_cb.c:1281:kiblnd_connect_peer()) peer_ni[ffff88085c21d900] -> 192.168.1.110@o2ib (2)++
            00000800:00000200:68.0:1508174041.158782:0:2519:0:(o2iblnd_cb.c:1234:kiblnd_resolve_addr()) bound to port 1023
            00000800:00000200:68.0:1508174041.158783:0:2519:0:(o2iblnd_cb.c:1302:kiblnd_connect_peer()) 192.168.1.110@o2ib: connection bound to ib0:192.168.1.140:mlx4_0
            00000800:00000200:68.0:1508174041.158785:0:2519:0:(o2iblnd_cb.c:1486:kiblnd_launch_tx()) peer_ni[ffff88085c21d900] -> 192.168.1.110@o2ib (3)--
            00000400:00000200:68.0:1508174041.158787:0:2519:0:(peer.c:2763:lnet_peer_send_ping()) peer 192.168.1.110@o2ib
            00000400:00000200:68.0:1508174041.158788:0:2519:0:(peer.c:3157:lnet_peer_discovery()) peer 192.168.1.110@o2ib state 0x120 rc 0
            00000800:00000200:15.0:1508174041.159054:0:691:0:(o2iblnd_cb.c:2991:kiblnd_cm_callback()) 192.168.1.110@o2ib Addr resolved: 0
            00000800:00000200:14.0:1508174041.159155:0:6:0:(o2iblnd_cb.c:3021:kiblnd_cm_callback()) 192.168.1.110@o2ib Route resolved: 0
            00000800:00000010:14.0:1508174041.159164:0:6:0:(o2iblnd.c:777:kiblnd_create_conn()) alloc '(init_qp_attr)': 96 at ffff8808592e7ba0 (tot 21628324).
            00000800:00000010:14.0:1508174041.159179:0:6:0:(o2iblnd.c:784:kiblnd_create_conn()) alloc '(conn)': 280 at ffff8800748ee000 (tot 21628604).
            00000800:00000010:14.0:1508174041.159188:0:6:0:(o2iblnd.c:808:kiblnd_create_conn()) alloc '(conn->ibc_connvars)': 136 at ffff8808467e26c0 (tot 21628740).
            00000800:00000010:14.0:1508174041.159204:0:6:0:(o2iblnd.c:843:kiblnd_create_conn()) alloc '(conn->ibc_rxs)': 1872 at ffff88085c4f2000 (tot 21630612).
            00000800:00000010:14.0:1508174041.159206:0:6:0:(o2iblnd.c:1242:kiblnd_alloc_pages()) alloc '(p)': 152 at ffff8808467e2480 (tot 21630764).
            00000800:00000200:14.0:1508174041.159214:0:6:0:(o2iblnd.c:1316:kiblnd_map_rx_descs()) rx 0: ffff880841b63000 0x841b63000(0x841b63000)
            00000800:00000200:14.0:1508174041.159217:0:6:0:(o2iblnd.c:1316:kiblnd_map_rx_descs()) rx 1: ffff880859708000 0x859708000(0x859708000)
            ……………………
            00000800:00000200:14.0:1508174041.159237:0:6:0:(o2iblnd.c:1316:kiblnd_map_rx_descs()) rx 16: ffff88083a615000 0x83a615000(0x83a615000)
            00000800:00000200:14.0:1508174041.159238:0:6:0:(o2iblnd.c:1316:kiblnd_map_rx_descs()) rx 17: ffff88083b2d8000 0x83b2d8000(0x83b2d8000)
            00000800:00000200:14.0:1508174041.160476:0:6:0:(o2iblnd.c:915:kiblnd_create_conn()) original send wr 2056, created with 4079
            00000800:00000010:14.0:1508174041.160478:0:6:0:(o2iblnd.c:917:kiblnd_create_conn()) kfreed 'init_qp_attr': 96 at ffff8808592e7ba0 (tot 21630668).
            00000800:00000200:14.0:1508174041.160480:0:6:0:(o2iblnd_cb.c:194:kiblnd_post_rx()) conn[ffff8800748ee000] (19)++
            00000800:00000200:14.0:1508174041.160482:0:6:0:(o2iblnd_cb.c:223:kiblnd_post_rx()) conn[ffff8800748ee000] (20)--
            …………………..
            00000800:00000200:14.0:1508174041.160513:0:6:0:(o2iblnd_cb.c:194:kiblnd_post_rx()) conn[ffff8800748ee000] (19)++
            00000800:00000200:14.0:1508174041.160514:0:6:0:(o2iblnd_cb.c:223:kiblnd_post_rx()) conn[ffff8800748ee000] (20)--
            00000800:00000200:15.0:1508174041.162919:0:691:0:(o2iblnd_cb.c:3088:kiblnd_cm_callback()) ESTABLISHED(active): 192.168.1.110@o2ib
            00000800:00000200:15.0:1508174041.162923:0:691:0:(o2iblnd_cb.c:2123:kiblnd_connreq_done()) 192.168.1.110@o2ib: active(1), version(12), status(0)
            00000800:00000010:15.0:1508174041.162925:0:691:0:(o2iblnd_cb.c:2131:kiblnd_connreq_done()) kfreed 'conn->ibc_connvars': 136 at ffff8808467e26c0 (tot 21630532).
            00000800:00000200:15.0:1508174041.162941:0:691:0:(o2iblnd_cb.c:2150:kiblnd_connreq_done()) conn[ffff8800748ee000] (19)++
            00000800:00000200:15.0:1508174041.162944:0:691:0:(o2iblnd_cb.c:2192:kiblnd_connreq_done()) conn[ffff8800748ee000] (20)++
            00000800:00000200:15.0:1508174041.162945:0:691:0:(o2iblnd_cb.c:1162:kiblnd_queue_tx_locked()) conn[ffff8800748ee000] (21)++
            00000800:00000200:15.0:1508174041.162950:0:691:0:(o2iblnd_cb.c:2212:kiblnd_connreq_done()) conn[ffff8800748ee000] (22)--
            00000800:00000200:20.2F:1508174041.162997:0:0:0:(o2iblnd_cb.c:3516:kiblnd_cq_completion()) conn[ffff8800748ee000] (21)++
            00000800:00000200:20.0F:1508174041.163022:0:2510:0:(o2iblnd_cb.c:3638:kiblnd_scheduler()) conn[ffff8800748ee000] (22)++
            00000800:00000200:20.0:1508174041.163028:0:2510:0:(o2iblnd_cb.c:74:kiblnd_tx_done()) conn[ffff8800748ee000] (23)--
            00000400:00000200:20.0:1508174041.163032:0:2510:0:(peer.c:2326:lnet_discovery_event_handler()) Received event: 5
            00000400:00000200:20.0:1508174041.163036:0:2510:0:(peer.c:2278:lnet_discovery_event_send()) Ping Send to 192.168.1.110@o2ib: 0
            00000400:00000010:20.0:1508174041.163040:0:2510:0:(lib-lnet.h:489:lnet_msg_free()) kfreed 'msg': 416 at ffff88105c1e3400 (tot 21630116).
            00000800:00000200:20.0:1508174041.163043:0:2510:0:(o2iblnd_cb.c:3654:kiblnd_scheduler()) conn[ffff8800748ee000] (22)--
            00000800:00000200:20.0:1508174041.163045:0:2510:0:(o2iblnd_cb.c:3654:kiblnd_scheduler()) conn[ffff8800748ee000] (21)--
            00000800:00000200:20.2:1508174041.163169:0:0:0:(o2iblnd_cb.c:3516:kiblnd_cq_completion()) conn[ffff8800748ee000] (20)++
            00000800:00000200:20.0:1508174041.163176:0:2511:0:(o2iblnd_cb.c:3638:kiblnd_scheduler()) conn[ffff8800748ee000] (21)++
            00000800:00000200:20.0:1508174041.163182:0:2511:0:(o2iblnd_cb.c:326:kiblnd_handle_rx()) Received d1[1] from 192.168.1.110@o2ib
            00000400:00000200:20.0:1508174041.163187:0:2511:0:(lib-move.c:2599:lnet_parse()) TRACE: 192.168.1.140@o2ib(192.168.1.140@o2ib) <- 192.168.1.110@o2ib : REPLY - for me
            00000400:00000010:20.0:1508174041.163191:0:2511:0:(lib-lnet.h:479:lnet_msg_alloc()) alloc '(msg)': 416 at ffff88085cab6400 (tot 21630532).
            00000400:00000200:20.0:1508174041.163198:0:2511:0:(lib-move.c:2366:lnet_parse_reply()) 192.168.1.140@o2ib: Reply from 12345-192.168.1.110@o2ib of length 48/48 into md 0xad
            00000400:00000200:20.0:1508174041.163202:0:2511:0:(peer.c:2326:lnet_discovery_event_handler()) Received event: 3
            00000400:00000200:20.0:1508174041.163205:0:2511:0:(peer.c:881:lnet_peer_ni_clr_non_mr_pref_nid()) peer 192.168.1.110@o2ib: 0
            00000400:00000200:20.0:1508174041.163209:0:2511:0:(peer.c:2237:lnet_discovery_event_reply()) peer 192.168.1.110@o2ib data present 1
            00000400:00000200:20.0:1508174041.163214:0:2511:0:(lib-md.c:69:lnet_md_unlink()) Unlinking md ffff8810537faf00
            00000400:00000010:20.0:1508174041.163216:0:2511:0:(lib-lnet.h:259:lnet_md_free()) slab-freed 'md' at ffff8810537faf00.
            00000800:00000200:47.0:1508174041.163218:0:2508:0:(o2iblnd_cb.c:3654:kiblnd_scheduler()) conn[ffff8800748ee000] (22)--
            00000400:00000010:20.0:1508174041.163219:0:2511:0:(lib-lnet.h:489:lnet_msg_free()) kfreed 'msg': 416 at ffff88085cab6400 (tot 21630116).
            00000800:00000200:20.0:1508174041.163221:0:2511:0:(o2iblnd_cb.c:194:kiblnd_post_rx()) conn[ffff8800748ee000] (21)++
            00000800:00000200:20.0:1508174041.163224:0:2511:0:(o2iblnd_cb.c:223:kiblnd_post_rx()) conn[ffff8800748ee000] (22)--
            00000800:00000200:20.0:1508174041.163225:0:2511:0:(o2iblnd_cb.c:3654:kiblnd_scheduler()) conn[ffff8800748ee000] (21)--
            00000400:00000200:68.0:1508174041.163252:0:2519:0:(peer.c:3030:lnet_peer_discovery_wait_for_work()) woken: 0
            00000400:00000200:68.0:1508174041.163257:0:2519:0:(peer.c:3136:lnet_peer_discovery()) peer 192.168.1.110@o2ib state 0x61
            00000400:00000010:68.0:1508174041.163259:0:2519:0:(peer.c:2399:lnet_peer_merge_data()) alloc '(curnis)': 16 at ffff88105a776630 (tot 21630132).
            00000400:00000010:68.0:1508174041.163261:0:2519:0:(peer.c:2400:lnet_peer_merge_data()) alloc '(addnis)': 16 at ffff88105a776640 (tot 21630148).
            00000400:00000010:68.0:1508174041.163262:0:2519:0:(peer.c:2401:lnet_peer_merge_data()) alloc '(delnis)': 16 at ffff88105a776650 (tot 21630164).
            00000400:00000010:68.0:1508174041.163263:0:2519:0:(peer.c:2470:lnet_peer_merge_data()) kfreed 'curnis': 16 at ffff88105a776630 (tot 21630148).
            00000400:00000010:68.0:1508174041.163264:0:2519:0:(peer.c:2471:lnet_peer_merge_data()) kfreed 'addnis': 16 at ffff88105a776640 (tot 21630132).
            00000400:00000010:68.0:1508174041.163265:0:2519:0:(peer.c:2472:lnet_peer_merge_data()) kfreed 'delnis': 16 at ffff88105a776650 (tot 21630116).
            00000400:00000010:68.0:1508174041.163266:0:2519:0:(api-ni.c:1089:lnet_ping_buffer_free()) kfreed 'pbuf': 280 at ffff88105c1e3600 (tot 21629836).
            00000400:00000200:68.0:1508174041.163268:0:2519:0:(peer.c:2474:lnet_peer_merge_data()) peer 192.168.1.110@o2ib: 0
            00000400:00000200:68.0:1508174041.163269:0:2519:0:(peer.c:2634:lnet_peer_data_present()) peer 192.168.1.110@o2ib: 0
            00000400:00000200:68.0:1508174041.163271:0:2519:0:(peer.c:3157:lnet_peer_discovery()) peer 192.168.1.110@o2ib state 0xa1 rc 1
            00000400:00000200:68.0:1508174041.163274:0:2519:0:(peer.c:3136:lnet_peer_discovery()) peer 192.168.1.110@o2ib state 0xa1
            00000400:00000010:68.0:1508174041.163275:0:2519:0:(lib-lnet.h:228:lnet_md_alloc()) slab-alloced 'md' of size 128 at ffff88085c21d500.
            00000400:00000010:68.0:1508174041.163279:0:2519:0:(lib-lnet.h:479:lnet_msg_alloc()) alloc '(msg)': 416 at ffff88105c1e3600 (tot 21630252).
            00000400:00000200:68.0:1508174041.163281:0:2519:0:(lib-move.c:2984:LNetPut()) LNetPut -> 12345-192.168.1.110@o2ib
            00000400:00000200:68.0:1508174041.163288:0:2519:0:(lib-move.c:2114:lnet_select_pathway()) TRACE: 192.168.1.140@o2ib(192.168.1.140@o2ib:<?>) -> 192.168.1.110@o2ib(192.168.1.110@o2ib:192.168.1.110@o2ib) : PUT
            00000800:00000200:68.0:1508174041.163291:0:2519:0:(o2iblnd_cb.c:1511:kiblnd_send()) sending 48 bytes in 1 frags to 12345-192.168.1.110@o2ib
            00000800:00000200:68.0:1508174041.163296:0:2519:0:(o2iblnd.c:405:kiblnd_find_peer_locked()) got peer_ni [ffff88085c21d900] -> 192.168.1.110@o2ib (2) version: 12
            00000800:00000200:68.0:1508174041.163298:0:2519:0:(o2iblnd_cb.c:1392:kiblnd_launch_tx()) conn[ffff8800748ee000] (20)++
            00000800:00000200:68.0:1508174041.163299:0:2519:0:(o2iblnd_cb.c:1162:kiblnd_queue_tx_locked()) conn[ffff8800748ee000] (21)++
            00000800:00000200:68.0:1508174041.163303:0:2519:0:(o2iblnd_cb.c:1398:kiblnd_launch_tx()) conn[ffff8800748ee000] (22)--
            00000400:00000200:68.0:1508174041.163304:0:2519:0:(peer.c:2869:lnet_peer_send_push()) peer 192.168.1.110@o2ib
            00000400:00000200:68.0:1508174041.163305:0:2519:0:(peer.c:3157:lnet_peer_discovery()) peer 192.168.1.110@o2ib state 0x2a1 rc 0
            00000800:00000200:20.2:1508174041.163309:0:0:0:(o2iblnd_cb.c:3516:kiblnd_cq_completion()) conn[ffff8800748ee000] (21)++
            00000800:00000200:20.0:1508174041.163315:0:2510:0:(o2iblnd_cb.c:3638:kiblnd_scheduler()) conn[ffff8800748ee000] (22)++
            00000800:00000200:20.0:1508174041.163318:0:2510:0:(o2iblnd_cb.c:74:kiblnd_tx_done()) conn[ffff8800748ee000] (23)--
            00000400:00000200:20.0:1508174041.163320:0:2510:0:(peer.c:2326:lnet_discovery_event_handler()) Received event: 5
            00000400:00000200:20.0:1508174041.163323:0:2510:0:(peer.c:2278:lnet_discovery_event_send()) Push Send to 192.168.1.110@o2ib: 0
            00000400:00000010:20.0:1508174041.163326:0:2510:0:(lib-lnet.h:489:lnet_msg_free()) kfreed 'msg': 416 at ffff88105c1e3600 (tot 21629836).
            00000800:00000200:20.0:1508174041.163328:0:2510:0:(o2iblnd_cb.c:3654:kiblnd_scheduler()) conn[ffff8800748ee000] (22)--
            00000800:00000200:20.0:1508174041.163330:0:2510:0:(o2iblnd_cb.c:3654:kiblnd_scheduler()) conn[ffff8800748ee000] (21)--
            00000800:00000200:20.2:1508174041.163475:0:0:0:(o2iblnd_cb.c:3516:kiblnd_cq_completion()) conn[ffff8800748ee000] (20)++
            00000800:00000200:20.0:1508174041.163481:0:2508:0:(o2iblnd_cb.c:3638:kiblnd_scheduler()) conn[ffff8800748ee000] (21)++
            00000800:00000200:20.0:1508174041.163485:0:2508:0:(o2iblnd_cb.c:326:kiblnd_handle_rx()) Received d1[0] from 192.168.1.110@o2ib
            00000400:00000200:20.0:1508174041.163489:0:2508:0:(lib-move.c:2599:lnet_parse()) TRACE: 192.168.1.140@o2ib(192.168.1.140@o2ib) <- 192.168.1.110@o2ib : ACK - for me
            00000400:00000010:20.0:1508174041.163491:0:2508:0:(lib-lnet.h:479:lnet_msg_alloc()) alloc '(msg)': 416 at ffff88085cab6400 (tot 21630252).
            00000400:00000200:20.0:1508174041.163496:0:2508:0:(lib-move.c:2419:lnet_parse_ack()) 192.168.1.140@o2ib: ACK from 12345-192.168.1.110@o2ib into md 0xb5
            00000400:00000200:20.0:1508174041.163498:0:2508:0:(peer.c:2326:lnet_discovery_event_handler()) Received event: 4
            00000400:00000200:20.0:1508174041.163500:0:2508:0:(peer.c:2111:lnet_discovery_event_ack()) peer 192.168.1.110@o2ib ev->status 0
            00000400:00000200:20.0:1508174041.163505:0:2508:0:(lib-md.c:69:lnet_md_unlink()) Unlinking md ffff88085c21d500
            00000400:00000010:20.0:1508174041.163506:0:2508:0:(lib-lnet.h:259:lnet_md_free()) slab-freed 'md' at ffff88085c21d500.
            00000400:00000010:20.0:1508174041.163508:0:2508:0:(lib-lnet.h:489:lnet_msg_free()) kfreed 'msg': 416 at ffff88085cab6400 (tot 21629836).
            00000800:00000200:20.0:1508174041.163510:0:2508:0:(o2iblnd_cb.c:194:kiblnd_post_rx()) conn[ffff8800748ee000] (22)++
            00000800:00000200:20.0:1508174041.163512:0:2508:0:(o2iblnd_cb.c:223:kiblnd_post_rx()) conn[ffff8800748ee000] (23)--
            00000800:00000200:20.0:1508174041.163513:0:2508:0:(o2iblnd_cb.c:3654:kiblnd_scheduler()) conn[ffff8800748ee000] (22)--
            00000800:00000200:20.0:1508174041.163515:0:2508:0:(o2iblnd_cb.c:3654:kiblnd_scheduler()) conn[ffff8800748ee000] (21)--
            00000800:00000200:20.2:1508174041.163551:0:0:0:(o2iblnd_cb.c:3516:kiblnd_cq_completion()) conn[ffff8800748ee000] (20)++
            00000800:00000200:20.0:1508174041.163556:0:2510:0:(o2iblnd_cb.c:3638:kiblnd_scheduler()) conn[ffff8800748ee000] (21)++
            00000800:00000200:20.0:1508174041.163560:0:2510:0:(o2iblnd_cb.c:326:kiblnd_handle_rx()) Received d1[1] from 192.168.1.110@o2ib
            00000400:00000200:20.0:1508174041.163563:0:2510:0:(lib-move.c:2599:lnet_parse()) TRACE: 192.168.1.140@o2ib(192.168.1.140@o2ib) <- 192.168.1.110@o2ib : PUT - for me
            00000400:00000010:20.0:1508174041.163565:0:2510:0:(lib-lnet.h:479:lnet_msg_alloc()) alloc '(msg)': 416 at ffff88085cab6400 (tot 21630252).
            00000800:00000200:4.0:1508174041.163565:0:2509:0:(o2iblnd_cb.c:3654:kiblnd_scheduler()) conn[ffff8800748ee000] (22)--
            00000400:00000200:20.0:1508174041.163569:0:2510:0:(lib-ptl.c:571:lnet_ptl_match_md()) Request from 12345-192.168.1.110@o2ib of length 48 into portal 0 MB=0x8000000000000000
            00000400:00000200:20.0:1508174041.163575:0:2510:0:(lib-ptl.c:200:lnet_try_match_md()) Incoming put index 0 from 12345-192.168.1.110@o2ib of length 48/48 into md 0x9 [1] + 0
            00000400:00000200:68.0:1508174041.163577:0:2519:0:(peer.c:3030:lnet_peer_discovery_wait_for_work()) woken: 0
            00000400:00000200:20.0:1508174041.163581:0:2510:0:(peer.c:1939:lnet_peer_push_event()) Stale Push from 192.168.1.110@o2ib: got 1 have 1
            00000400:00000200:68.0:1508174041.163585:0:2519:0:(peer.c:3136:lnet_peer_discovery()) peer 192.168.1.110@o2ib state 0xa1
            00000400:00000200:68.0:1508174041.163588:0:2519:0:(peer.c:2917:lnet_peer_discovered()) peer 192.168.1.110@o2ib
            00000400:00000200:68.0:1508174041.163589:0:2519:0:(peer.c:3157:lnet_peer_discovery()) peer 192.168.1.110@o2ib state 0x89 rc 0
            00000400:00000200:20.0:1508174041.163592:0:2510:0:(peer.c:1751:lnet_peer_queue_for_discovery()) Queue peer 192.168.1.110@o2ib: -114
            00000400:00000200:20.0:1508174041.163603:0:2510:0:(lib-move.c:2114:lnet_select_pathway()) TRACE: 192.168.1.140@o2ib(192.168.1.140@o2ib:192.168.1.140@o2ib) -> 192.168.1.110@o2ib(192.168.1.110@o2ib:192.168.1.110@o2ib) : ACK
            00000800:00000200:20.0:1508174041.163608:0:2510:0:(o2iblnd_cb.c:1511:kiblnd_send()) sending 0 bytes in 1 frags to 12345-192.168.1.110@o2ib
            00000800:00000200:20.0:1508174041.163613:0:2510:0:(o2iblnd.c:405:kiblnd_find_peer_locked()) got peer_ni [ffff88085c21d900] -> 192.168.1.110@o2ib (2) version: 12
            00000800:00000200:20.0:1508174041.163616:0:2510:0:(o2iblnd_cb.c:1392:kiblnd_launch_tx()) conn[ffff8800748ee000] (21)++
            00000800:00000200:20.0:1508174041.163618:0:2510:0:(o2iblnd_cb.c:1162:kiblnd_queue_tx_locked()) conn[ffff8800748ee000] (22)++
            00000800:00000200:20.0:1508174041.163621:0:2510:0:(o2iblnd_cb.c:1398:kiblnd_launch_tx()) conn[ffff8800748ee000] (23)--
            00000800:00000200:20.0:1508174041.163622:0:2510:0:(o2iblnd_cb.c:194:kiblnd_post_rx()) conn[ffff8800748ee000] (22)++
            00000800:00000200:20.2:1508174041.163625:0:2510:0:(o2iblnd_cb.c:3516:kiblnd_cq_completion()) conn[ffff8800748ee000] (22)++
            00000800:00000200:20.0:1508174041.163630:0:2510:0:(o2iblnd_cb.c:223:kiblnd_post_rx()) conn[ffff8800748ee000] (24)--
            00000800:00000200:20.0:1508174041.163631:0:2510:0:(o2iblnd_cb.c:3654:kiblnd_scheduler()) conn[ffff8800748ee000] (23)--
            00000800:00000200:20.0:1508174041.163633:0:2510:0:(o2iblnd_cb.c:3638:kiblnd_scheduler()) conn[ffff8800748ee000] (22)++
            00000800:00000200:20.0:1508174041.163638:0:2510:0:(o2iblnd_cb.c:74:kiblnd_tx_done()) conn[ffff8800748ee000] (23)--
            00000400:00000010:20.0:1508174041.163640:0:2510:0:(lib-lnet.h:489:lnet_msg_free()) kfreed 'msg': 416 at ffff88085cab6400 (tot 21629836).
            00000800:00000200:20.0:1508174041.163641:0:2510:0:(o2iblnd_cb.c:3654:kiblnd_scheduler()) conn[ffff8800748ee000] (22)--
            

            where PID 2946 should be the forked LLOG parser thread for the soak-client config, PID 2519 should be the LNet discovery thread, PIDs 2508/2510 should be ko2iblnd scheduler threads.

            Looks like there is a race between LNet discovery thread bottom handling of ACK for previous sent PUT msg to 192.168.1.110@o2ib, and 2510/ko2iblnd scheduler thread handling consecutive "stale" (unexpected or due to multi-rail?) PUT msg received from 192.168.1.110@o2ib, thus causing LNET_PEER_DISCOVERING flag to be set again and peer to be kept in discovery process/loop, but very likely on the ln_dc_working queue instead of ln_dc_request one so out of discovery thread control and should have been probably timed-out later.

            bfaccini Bruno Faccini (Inactive) added a comment - - edited The attached Lustre debug log indicates that during the mount process, Client is fetching the soaked-client config LLOG from MGS and running the different config steps according to its content. And it seems to be stuck during the LNet peer discovery of 192.168.1.110@o2ib NID for soaked-MDT0002-mdc import creation : 00000040:00001000:67.0:1508174041.158589:0:2946:0:(llog.c:513:llog_process_thread()) processing rec 0xffff8808f39f5698 type 0x10620000 00000040:00001000:67.0:1508174041.158590:0:2946:0:(llog.c:519:llog_process_thread()) after swabbing, type=0x10620000 idx=38 00000040:00001000:67.0:1508174041.158591:0:2946:0:(llog.c:595:llog_process_thread()) lrh_index: 38 lrh_len: 144 (2408 remains) 00000020:00000001:67.0:1508174041.158592:0:2946:0:(obd_config.c:1496:class_config_llog_handler()) Process entered 00000020:00000010:67.0:1508174041.158592:0:2946:0:(obd_config.c:1629:class_config_llog_handler()) kmalloced 'inst_name': 39 at ffff88083c75d040. 00000020:01000000:67.0:1508174041.158594:0:2946:0:(obd_config.c:1637:class_config_llog_handler()) cmd cf003, instance name: soaked-MDT0002-mdc-ffff88105dcb8800 00000020:00000010:67.0:1508174041.158595:0:2946:0:(obd_config.c:1707:class_config_llog_handler()) kmalloced 'lcfg_new': 136 at ffff88103f70e900. 00000020:00000080:67.0:1508174041.158596:0:2946:0:(obd_config.c:1180:class_process_config()) processing cmd: cf003 00000020:00000001:67.0:1508174041.158597:0:2946:0:(obd_config.c:479:class_setup()) Process entered 00000001:00000001:67.0:1508174041.158598:0:2946:0:(hash.c:1005:cfs_hash_create()) Process entered 00000001:00000010:67.0:1508174041.158599:0:2946:0:(hash.c:1032:cfs_hash_create()) alloc '(hs)': 144 at ffff88103f70e840 (tot 21621812). 00000001:00000010:67.0:1508174041.158600:0:2946:0:(hash.c:893:cfs_hash_buckets_realloc()) alloc '(new_bkts)': 32 at ffff88104f3f1580 (tot 21621844). 00000001:00000010:67.0:1508174041.158602:0:2946:0:(hash.c:906:cfs_hash_buckets_realloc()) alloc '(new_bkts[i])': 280 at ffff88107fd4c600 (tot 21622124). 00000001:00000010:67.0:1508174041.158604:0:2946:0:(hash.c:906:cfs_hash_buckets_realloc()) alloc '(new_bkts[i])': 280 at ffff88107fd4c800 (tot 21622404). 00000001:00000010:67.0:1508174041.158605:0:2946:0:(hash.c:906:cfs_hash_buckets_realloc()) alloc '(new_bkts[i])': 280 at ffff88107fd4ca00 (tot 21622684). 00000001:00000010:67.0:1508174041.158607:0:2946:0:(hash.c:906:cfs_hash_buckets_realloc()) alloc '(new_bkts[i])': 280 at ffff88107fd4cc00 (tot 21622964). 00000001:00000001:67.0:1508174041.158608:0:2946:0:(hash.c:1005:cfs_hash_create()) Process entered 00000001:00000010:67.0:1508174041.158609:0:2946:0:(hash.c:1032:cfs_hash_create()) alloc '(hs)': 144 at ffff88103f70e780 (tot 21623108). 00000001:00000010:67.0:1508174041.158610:0:2946:0:(hash.c:893:cfs_hash_buckets_realloc()) alloc '(new_bkts)': 32 at ffff88104f3f15a0 (tot 21623140). 00000001:00000010:67.0:1508174041.158611:0:2946:0:(hash.c:906:cfs_hash_buckets_realloc()) alloc '(new_bkts[i])': 280 at ffff88107fd4ce00 (tot 21623420). 00000001:00000010:67.0:1508174041.158613:0:2946:0:(hash.c:906:cfs_hash_buckets_realloc()) alloc '(new_bkts[i])': 280 at ffff88107fd4d000 (tot 21623700). 00000001:00000010:67.0:1508174041.158614:0:2946:0:(hash.c:906:cfs_hash_buckets_realloc()) alloc '(new_bkts[i])': 280 at ffff88107fd4d200 (tot 21623980). 00000001:00000010:67.0:1508174041.158616:0:2946:0:(hash.c:906:cfs_hash_buckets_realloc()) alloc '(new_bkts[i])': 280 at ffff88107fd4d400 (tot 21624260). 00000001:00000001:67.0:1508174041.158617:0:2946:0:(hash.c:1005:cfs_hash_create()) Process entered 00000001:00000010:67.0:1508174041.158617:0:2946:0:(hash.c:1032:cfs_hash_create()) alloc '(hs)': 144 at ffff88103f70e6c0 (tot 21624404). 00000001:00000010:67.0:1508174041.158619:0:2946:0:(hash.c:893:cfs_hash_buckets_realloc()) alloc '(new_bkts)': 32 at ffff88104f3f15e0 (tot 21624436). 00000001:00000010:67.0:1508174041.158620:0:2946:0:(hash.c:906:cfs_hash_buckets_realloc()) alloc '(new_bkts[i])': 280 at ffff88107fd4d600 (tot 21624716). 00000001:00000010:67.0:1508174041.158622:0:2946:0:(hash.c:906:cfs_hash_buckets_realloc()) alloc '(new_bkts[i])': 280 at ffff88107fd4d800 (tot 21624996). 00000001:00000010:67.0:1508174041.158623:0:2946:0:(hash.c:906:cfs_hash_buckets_realloc()) alloc '(new_bkts[i])': 280 at ffff88107fd4da00 (tot 21625276). 00000001:00000010:67.0:1508174041.158625:0:2946:0:(hash.c:906:cfs_hash_buckets_realloc()) alloc '(new_bkts[i])': 280 at ffff88107fd4dc00 (tot 21625556). 00000001:00000001:67.0:1508174041.158626:0:2946:0:(hash.c:1005:cfs_hash_create()) Process entered 00000001:00000010:67.0:1508174041.158626:0:2946:0:(hash.c:1032:cfs_hash_create()) alloc '(hs)': 144 at ffff88103f70e600 (tot 21625700). 00000001:00000010:67.0:1508174041.158627:0:2946:0:(hash.c:893:cfs_hash_buckets_realloc()) alloc '(new_bkts)': 32 at ffff88104f3f1600 (tot 21625732). 00000001:00000010:67.0:1508174041.158629:0:2946:0:(hash.c:906:cfs_hash_buckets_realloc()) alloc '(new_bkts[i])': 280 at ffff88107fd4de00 (tot 21626012). 00000001:00000010:67.0:1508174041.158630:0:2946:0:(hash.c:906:cfs_hash_buckets_realloc()) alloc '(new_bkts[i])': 280 at ffff88107fd4e000 (tot 21626292). 00000001:00000010:67.0:1508174041.158631:0:2946:0:(hash.c:906:cfs_hash_buckets_realloc()) alloc '(new_bkts[i])': 280 at ffff88107fd4e200 (tot 21626572). 00000001:00000010:67.0:1508174041.158633:0:2946:0:(hash.c:906:cfs_hash_buckets_realloc()) alloc '(new_bkts[i])': 280 at ffff88107fd4e400 (tot 21626852). 00000020:00000001:67.0:1508174041.158634:0:2946:0:(genops.c:853:class_new_export()) Process entered 00000020:00000010:67.0:1508174041.158636:0:2946:0:(genops.c:855:class_new_export()) kmalloced 'export': 968 at ffff88107f8f0400. 00000020:00000001:67.0:1508174041.158637:0:2946:0:(lustre_handles.c:64:class_handle_hash()) Process entered 00000020:00000040:67.0:1508174041.158638:0:2946:0:(lustre_handles.c:98:class_handle_hash()) added object ffff88107f8f0400 with handle 0x36c91a8e859b7cc8 to hash 00000020:00000001:67.0:1508174041.158639:0:2946:0:(lustre_handles.c:99:class_handle_hash()) Process leaving 00000020:00000001:67.0:1508174041.158640:0:2946:0:(obd_class.h:991:obd_init_export()) Process entered 00000020:00000001:67.0:1508174041.158641:0:2946:0:(obd_class.h:995:obd_init_export()) Process leaving (rc=0 : 0 : 0) 00000020:00000040:67.0:1508174041.158642:0:2946:0:(obd_config.c:753:class_incref()) incref soaked-MDT0002-mdc-ffff88105dcb8800 (ffff88107fdacf60) now 2 00000020:00000001:67.0:1508174041.158643:0:2946:0:(genops.c:928:class_new_export()) Process leaving (rc=18446612203173774336 : -131870535777280 : ffff88107f8f0400) 00000020:00000040:67.0:1508174041.158644:0:2946:0:(genops.c:828:class_export_put()) PUTting export ffff88107f8f0400 : new refcount 1 00000020:00000001:67.0:1508174041.158645:0:2946:0:(obd_class.h:582:obd_setup()) Process entered 00000002:00000001:67.0:1508174041.158646:0:2946:0:(mdc_request.c:2483:mdc_setup()) Process entered 00000100:00000001:67.0:1508174041.158647:0:2946:0:(ptlrpcd.c:944:ptlrpcd_addref()) Process entered 00000100:00000001:67.0:1508174041.158648:0:2946:0:(ptlrpcd.c:953:ptlrpcd_addref()) Process leaving (rc=0 : 0 : 0) 00010000:00000001:67.0:1508174041.158649:0:2946:0:(ldlm_lib.c:277:client_obd_setup()) Process entered 00010000:00000010:67.0:1508174041.158651:0:2946:0:(ldlm_lib.c:444:client_obd_setup()) kmalloced 'cli->cl_mod_tag_bitmap': 64 at ffff88083c75df80. 00010000:00000001:67.0:1508174041.158652:0:2946:0:(ldlm_lockd.c:2788:ldlm_get_ref()) Process entered 00010000:00000001:67.0:1508174041.158653:0:2946:0:(ldlm_lockd.c:2797:ldlm_get_ref()) Process leaving (rc=0 : 0 : 0) 00000020:00000010:67.0:1508174041.158654:0:2946:0:(genops.c:1067:class_new_import()) kmalloced 'imp': 1480 at ffff88083e668000. 00000020:00000040:67.0:1508174041.158655:0:2946:0:(obd_config.c:753:class_incref()) incref soaked-MDT0002-mdc-ffff88105dcb8800 (ffff88107fdacf60) now 3 00000020:00000001:67.0:1508174041.158657:0:2946:0:(lustre_handles.c:64:class_handle_hash()) Process entered 00000020:00000040:67.0:1508174041.158658:0:2946:0:(lustre_handles.c:98:class_handle_hash()) added object ffff88083e668000 with handle 0x36c91a8e859b7ccf to hash 00000020:00000001:67.0:1508174041.158659:0:2946:0:(lustre_handles.c:99:class_handle_hash()) Process leaving 00000020:00000001:67.0:1508174041.158661:0:2946:0:(genops.c:1030:class_import_put()) Process entered 00000020:00000040:67.0:1508174041.158661:0:2946:0:(genops.c:1037:class_import_put()) import ffff88083e668000 refcount=1 obd=soaked-MDT0002-mdc-ffff88105dcb8800 00000020:00000001:67.0:1508174041.158663:0:2946:0:(genops.c:1046:class_import_put()) Process leaving 00010000:00000001:67.0:1508174041.158664:0:2946:0:(ldlm_lib.c:61:import_set_conn()) Process entered 00000100:00000200:67.0:1508174041.158666:0:2946:0:(events.c:542:ptlrpc_uuid_to_peer()) 192.168.1.110@o2ib->12345-192.168.1.110@o2ib 00000100:00000001:67.0:1508174041.158667:0:2946:0:(connection.c:48:ptlrpc_connection_get()) Process entered 00000400:00000010:67.0:1508174041.158671:0:2946:0:(peer.c:239:lnet_peer_alloc()) alloc '(lp)': 184 at ffff88083e75c900 (tot 21627036). 00000400:00000200:67.0:1508174041.158673:0:2946:0:(peer.c:261:lnet_peer_alloc()) ffff88083e75c900 nid 192.168.1.110@o2ib 00000400:00000010:67.0:1508174041.158675:0:2946:0:(peer.c:204:lnet_peer_net_alloc()) alloc '(lpn)': 48 at ffff88105b155080 (tot 21627084). 00000400:00000200:67.0:1508174041.158676:0:2946:0:(peer.c:212:lnet_peer_net_alloc()) ffff88105b155080 net o2ib 00000400:00000010:67.0:1508174041.158680:0:2946:0:(peer.c:154:lnet_peer_ni_alloc()) alloc '(lpni)': 336 at ffff88085d7a8200 (tot 21627420). 00000400:00000200:67.0:1508174041.158682:0:2946:0:(peer.c:194:lnet_peer_ni_alloc()) ffff88085d7a8200 nid 192.168.1.110@o2ib 00000400:00000200:67.0:1508174041.158685:0:2946:0:(peer.c:1186:lnet_peer_attach_peer_ni()) peer 192.168.1.110@o2ib NID 192.168.1.110@o2ib flags 0x0 00000400:00000200:67.0:1508174041.158691:0:2946:0:(peer.c:1751:lnet_peer_queue_for_discovery()) Queue peer 192.168.1.110@o2ib: 0 00000400:00000200:68.0:1508174041.158724:0:2519:0:(peer.c:3030:lnet_peer_discovery_wait_for_work()) woken: 0 00000400:00000200:68.0:1508174041.158728:0:2519:0:(peer.c:3136:lnet_peer_discovery()) peer 192.168.1.110@o2ib state 0x20 00000400:00000010:68.0:1508174041.158730:0:2519:0:(api-ni.c:1076:lnet_ping_buffer_alloc()) alloc '(pbuf)': 280 at ffff88105c1e3600 (tot 21627700). 00000400:00000010:68.0:1508174041.158732:0:2519:0:(lib-lnet.h:228:lnet_md_alloc()) slab-alloced 'md' of size 128 at ffff8810537faf00. 00000400:00000010:68.0:1508174041.158736:0:2519:0:(lib-lnet.h:479:lnet_msg_alloc()) alloc '(msg)': 416 at ffff88105c1e3400 (tot 21628116). 00000400:00000200:68.0:1508174041.158738:0:2519:0:(lib-move.c:3187:LNetGet()) LNetGet -> 12345-192.168.1.110@o2ib 00000400:00000200:68.0:1508174041.158742:0:2519:0:(peer.c:856:lnet_peer_ni_set_non_mr_pref_nid()) peer 192.168.1.110@o2ib nid 192.168.1.140@o2ib: 0 00000400:00000200:68.0:1508174041.158746:0:2519:0:(lib-move.c:2114:lnet_select_pathway()) TRACE: 192.168.1.140@o2ib(192.168.1.140@o2ib:<?>) -> 192.168.1.110@o2ib(192.168.1.110@o2ib:192.168.1.110@o2ib) : GET 00000800:00000200:68.0:1508174041.158749:0:2519:0:(o2iblnd_cb.c:1511:kiblnd_send()) sending 0 bytes in 0 frags to 12345-192.168.1.110@o2ib 00000800:00000010:68.0:1508174041.158757:0:2519:0:(o2iblnd.c:326:kiblnd_create_peer()) alloc '(peer_ni)': 112 at ffff88085c21d900 (tot 21628228). 00000800:00000200:68.0:1508174041.158759:0:2519:0:(o2iblnd_cb.c:1479:kiblnd_launch_tx()) peer_ni[ffff88085c21d900] -> 192.168.1.110@o2ib (1)++ 00000800:00000200:68.0:1508174041.158763:0:2519:0:(o2iblnd_cb.c:1281:kiblnd_connect_peer()) peer_ni[ffff88085c21d900] -> 192.168.1.110@o2ib (2)++ 00000800:00000200:68.0:1508174041.158782:0:2519:0:(o2iblnd_cb.c:1234:kiblnd_resolve_addr()) bound to port 1023 00000800:00000200:68.0:1508174041.158783:0:2519:0:(o2iblnd_cb.c:1302:kiblnd_connect_peer()) 192.168.1.110@o2ib: connection bound to ib0:192.168.1.140:mlx4_0 00000800:00000200:68.0:1508174041.158785:0:2519:0:(o2iblnd_cb.c:1486:kiblnd_launch_tx()) peer_ni[ffff88085c21d900] -> 192.168.1.110@o2ib (3)-- 00000400:00000200:68.0:1508174041.158787:0:2519:0:(peer.c:2763:lnet_peer_send_ping()) peer 192.168.1.110@o2ib 00000400:00000200:68.0:1508174041.158788:0:2519:0:(peer.c:3157:lnet_peer_discovery()) peer 192.168.1.110@o2ib state 0x120 rc 0 00000800:00000200:15.0:1508174041.159054:0:691:0:(o2iblnd_cb.c:2991:kiblnd_cm_callback()) 192.168.1.110@o2ib Addr resolved: 0 00000800:00000200:14.0:1508174041.159155:0:6:0:(o2iblnd_cb.c:3021:kiblnd_cm_callback()) 192.168.1.110@o2ib Route resolved: 0 00000800:00000010:14.0:1508174041.159164:0:6:0:(o2iblnd.c:777:kiblnd_create_conn()) alloc '(init_qp_attr)': 96 at ffff8808592e7ba0 (tot 21628324). 00000800:00000010:14.0:1508174041.159179:0:6:0:(o2iblnd.c:784:kiblnd_create_conn()) alloc '(conn)': 280 at ffff8800748ee000 (tot 21628604). 00000800:00000010:14.0:1508174041.159188:0:6:0:(o2iblnd.c:808:kiblnd_create_conn()) alloc '(conn->ibc_connvars)': 136 at ffff8808467e26c0 (tot 21628740). 00000800:00000010:14.0:1508174041.159204:0:6:0:(o2iblnd.c:843:kiblnd_create_conn()) alloc '(conn->ibc_rxs)': 1872 at ffff88085c4f2000 (tot 21630612). 00000800:00000010:14.0:1508174041.159206:0:6:0:(o2iblnd.c:1242:kiblnd_alloc_pages()) alloc '(p)': 152 at ffff8808467e2480 (tot 21630764). 00000800:00000200:14.0:1508174041.159214:0:6:0:(o2iblnd.c:1316:kiblnd_map_rx_descs()) rx 0: ffff880841b63000 0x841b63000(0x841b63000) 00000800:00000200:14.0:1508174041.159217:0:6:0:(o2iblnd.c:1316:kiblnd_map_rx_descs()) rx 1: ffff880859708000 0x859708000(0x859708000) …………………… 00000800:00000200:14.0:1508174041.159237:0:6:0:(o2iblnd.c:1316:kiblnd_map_rx_descs()) rx 16: ffff88083a615000 0x83a615000(0x83a615000) 00000800:00000200:14.0:1508174041.159238:0:6:0:(o2iblnd.c:1316:kiblnd_map_rx_descs()) rx 17: ffff88083b2d8000 0x83b2d8000(0x83b2d8000) 00000800:00000200:14.0:1508174041.160476:0:6:0:(o2iblnd.c:915:kiblnd_create_conn()) original send wr 2056, created with 4079 00000800:00000010:14.0:1508174041.160478:0:6:0:(o2iblnd.c:917:kiblnd_create_conn()) kfreed 'init_qp_attr': 96 at ffff8808592e7ba0 (tot 21630668). 00000800:00000200:14.0:1508174041.160480:0:6:0:(o2iblnd_cb.c:194:kiblnd_post_rx()) conn[ffff8800748ee000] (19)++ 00000800:00000200:14.0:1508174041.160482:0:6:0:(o2iblnd_cb.c:223:kiblnd_post_rx()) conn[ffff8800748ee000] (20)-- ………………….. 00000800:00000200:14.0:1508174041.160513:0:6:0:(o2iblnd_cb.c:194:kiblnd_post_rx()) conn[ffff8800748ee000] (19)++ 00000800:00000200:14.0:1508174041.160514:0:6:0:(o2iblnd_cb.c:223:kiblnd_post_rx()) conn[ffff8800748ee000] (20)-- 00000800:00000200:15.0:1508174041.162919:0:691:0:(o2iblnd_cb.c:3088:kiblnd_cm_callback()) ESTABLISHED(active): 192.168.1.110@o2ib 00000800:00000200:15.0:1508174041.162923:0:691:0:(o2iblnd_cb.c:2123:kiblnd_connreq_done()) 192.168.1.110@o2ib: active(1), version(12), status(0) 00000800:00000010:15.0:1508174041.162925:0:691:0:(o2iblnd_cb.c:2131:kiblnd_connreq_done()) kfreed 'conn->ibc_connvars': 136 at ffff8808467e26c0 (tot 21630532). 00000800:00000200:15.0:1508174041.162941:0:691:0:(o2iblnd_cb.c:2150:kiblnd_connreq_done()) conn[ffff8800748ee000] (19)++ 00000800:00000200:15.0:1508174041.162944:0:691:0:(o2iblnd_cb.c:2192:kiblnd_connreq_done()) conn[ffff8800748ee000] (20)++ 00000800:00000200:15.0:1508174041.162945:0:691:0:(o2iblnd_cb.c:1162:kiblnd_queue_tx_locked()) conn[ffff8800748ee000] (21)++ 00000800:00000200:15.0:1508174041.162950:0:691:0:(o2iblnd_cb.c:2212:kiblnd_connreq_done()) conn[ffff8800748ee000] (22)-- 00000800:00000200:20.2F:1508174041.162997:0:0:0:(o2iblnd_cb.c:3516:kiblnd_cq_completion()) conn[ffff8800748ee000] (21)++ 00000800:00000200:20.0F:1508174041.163022:0:2510:0:(o2iblnd_cb.c:3638:kiblnd_scheduler()) conn[ffff8800748ee000] (22)++ 00000800:00000200:20.0:1508174041.163028:0:2510:0:(o2iblnd_cb.c:74:kiblnd_tx_done()) conn[ffff8800748ee000] (23)-- 00000400:00000200:20.0:1508174041.163032:0:2510:0:(peer.c:2326:lnet_discovery_event_handler()) Received event: 5 00000400:00000200:20.0:1508174041.163036:0:2510:0:(peer.c:2278:lnet_discovery_event_send()) Ping Send to 192.168.1.110@o2ib: 0 00000400:00000010:20.0:1508174041.163040:0:2510:0:(lib-lnet.h:489:lnet_msg_free()) kfreed 'msg': 416 at ffff88105c1e3400 (tot 21630116). 00000800:00000200:20.0:1508174041.163043:0:2510:0:(o2iblnd_cb.c:3654:kiblnd_scheduler()) conn[ffff8800748ee000] (22)-- 00000800:00000200:20.0:1508174041.163045:0:2510:0:(o2iblnd_cb.c:3654:kiblnd_scheduler()) conn[ffff8800748ee000] (21)-- 00000800:00000200:20.2:1508174041.163169:0:0:0:(o2iblnd_cb.c:3516:kiblnd_cq_completion()) conn[ffff8800748ee000] (20)++ 00000800:00000200:20.0:1508174041.163176:0:2511:0:(o2iblnd_cb.c:3638:kiblnd_scheduler()) conn[ffff8800748ee000] (21)++ 00000800:00000200:20.0:1508174041.163182:0:2511:0:(o2iblnd_cb.c:326:kiblnd_handle_rx()) Received d1[1] from 192.168.1.110@o2ib 00000400:00000200:20.0:1508174041.163187:0:2511:0:(lib-move.c:2599:lnet_parse()) TRACE: 192.168.1.140@o2ib(192.168.1.140@o2ib) <- 192.168.1.110@o2ib : REPLY - for me 00000400:00000010:20.0:1508174041.163191:0:2511:0:(lib-lnet.h:479:lnet_msg_alloc()) alloc '(msg)': 416 at ffff88085cab6400 (tot 21630532). 00000400:00000200:20.0:1508174041.163198:0:2511:0:(lib-move.c:2366:lnet_parse_reply()) 192.168.1.140@o2ib: Reply from 12345-192.168.1.110@o2ib of length 48/48 into md 0xad 00000400:00000200:20.0:1508174041.163202:0:2511:0:(peer.c:2326:lnet_discovery_event_handler()) Received event: 3 00000400:00000200:20.0:1508174041.163205:0:2511:0:(peer.c:881:lnet_peer_ni_clr_non_mr_pref_nid()) peer 192.168.1.110@o2ib: 0 00000400:00000200:20.0:1508174041.163209:0:2511:0:(peer.c:2237:lnet_discovery_event_reply()) peer 192.168.1.110@o2ib data present 1 00000400:00000200:20.0:1508174041.163214:0:2511:0:(lib-md.c:69:lnet_md_unlink()) Unlinking md ffff8810537faf00 00000400:00000010:20.0:1508174041.163216:0:2511:0:(lib-lnet.h:259:lnet_md_free()) slab-freed 'md' at ffff8810537faf00. 00000800:00000200:47.0:1508174041.163218:0:2508:0:(o2iblnd_cb.c:3654:kiblnd_scheduler()) conn[ffff8800748ee000] (22)-- 00000400:00000010:20.0:1508174041.163219:0:2511:0:(lib-lnet.h:489:lnet_msg_free()) kfreed 'msg': 416 at ffff88085cab6400 (tot 21630116). 00000800:00000200:20.0:1508174041.163221:0:2511:0:(o2iblnd_cb.c:194:kiblnd_post_rx()) conn[ffff8800748ee000] (21)++ 00000800:00000200:20.0:1508174041.163224:0:2511:0:(o2iblnd_cb.c:223:kiblnd_post_rx()) conn[ffff8800748ee000] (22)-- 00000800:00000200:20.0:1508174041.163225:0:2511:0:(o2iblnd_cb.c:3654:kiblnd_scheduler()) conn[ffff8800748ee000] (21)-- 00000400:00000200:68.0:1508174041.163252:0:2519:0:(peer.c:3030:lnet_peer_discovery_wait_for_work()) woken: 0 00000400:00000200:68.0:1508174041.163257:0:2519:0:(peer.c:3136:lnet_peer_discovery()) peer 192.168.1.110@o2ib state 0x61 00000400:00000010:68.0:1508174041.163259:0:2519:0:(peer.c:2399:lnet_peer_merge_data()) alloc '(curnis)': 16 at ffff88105a776630 (tot 21630132). 00000400:00000010:68.0:1508174041.163261:0:2519:0:(peer.c:2400:lnet_peer_merge_data()) alloc '(addnis)': 16 at ffff88105a776640 (tot 21630148). 00000400:00000010:68.0:1508174041.163262:0:2519:0:(peer.c:2401:lnet_peer_merge_data()) alloc '(delnis)': 16 at ffff88105a776650 (tot 21630164). 00000400:00000010:68.0:1508174041.163263:0:2519:0:(peer.c:2470:lnet_peer_merge_data()) kfreed 'curnis': 16 at ffff88105a776630 (tot 21630148). 00000400:00000010:68.0:1508174041.163264:0:2519:0:(peer.c:2471:lnet_peer_merge_data()) kfreed 'addnis': 16 at ffff88105a776640 (tot 21630132). 00000400:00000010:68.0:1508174041.163265:0:2519:0:(peer.c:2472:lnet_peer_merge_data()) kfreed 'delnis': 16 at ffff88105a776650 (tot 21630116). 00000400:00000010:68.0:1508174041.163266:0:2519:0:(api-ni.c:1089:lnet_ping_buffer_free()) kfreed 'pbuf': 280 at ffff88105c1e3600 (tot 21629836). 00000400:00000200:68.0:1508174041.163268:0:2519:0:(peer.c:2474:lnet_peer_merge_data()) peer 192.168.1.110@o2ib: 0 00000400:00000200:68.0:1508174041.163269:0:2519:0:(peer.c:2634:lnet_peer_data_present()) peer 192.168.1.110@o2ib: 0 00000400:00000200:68.0:1508174041.163271:0:2519:0:(peer.c:3157:lnet_peer_discovery()) peer 192.168.1.110@o2ib state 0xa1 rc 1 00000400:00000200:68.0:1508174041.163274:0:2519:0:(peer.c:3136:lnet_peer_discovery()) peer 192.168.1.110@o2ib state 0xa1 00000400:00000010:68.0:1508174041.163275:0:2519:0:(lib-lnet.h:228:lnet_md_alloc()) slab-alloced 'md' of size 128 at ffff88085c21d500. 00000400:00000010:68.0:1508174041.163279:0:2519:0:(lib-lnet.h:479:lnet_msg_alloc()) alloc '(msg)': 416 at ffff88105c1e3600 (tot 21630252). 00000400:00000200:68.0:1508174041.163281:0:2519:0:(lib-move.c:2984:LNetPut()) LNetPut -> 12345-192.168.1.110@o2ib 00000400:00000200:68.0:1508174041.163288:0:2519:0:(lib-move.c:2114:lnet_select_pathway()) TRACE: 192.168.1.140@o2ib(192.168.1.140@o2ib:<?>) -> 192.168.1.110@o2ib(192.168.1.110@o2ib:192.168.1.110@o2ib) : PUT 00000800:00000200:68.0:1508174041.163291:0:2519:0:(o2iblnd_cb.c:1511:kiblnd_send()) sending 48 bytes in 1 frags to 12345-192.168.1.110@o2ib 00000800:00000200:68.0:1508174041.163296:0:2519:0:(o2iblnd.c:405:kiblnd_find_peer_locked()) got peer_ni [ffff88085c21d900] -> 192.168.1.110@o2ib (2) version: 12 00000800:00000200:68.0:1508174041.163298:0:2519:0:(o2iblnd_cb.c:1392:kiblnd_launch_tx()) conn[ffff8800748ee000] (20)++ 00000800:00000200:68.0:1508174041.163299:0:2519:0:(o2iblnd_cb.c:1162:kiblnd_queue_tx_locked()) conn[ffff8800748ee000] (21)++ 00000800:00000200:68.0:1508174041.163303:0:2519:0:(o2iblnd_cb.c:1398:kiblnd_launch_tx()) conn[ffff8800748ee000] (22)-- 00000400:00000200:68.0:1508174041.163304:0:2519:0:(peer.c:2869:lnet_peer_send_push()) peer 192.168.1.110@o2ib 00000400:00000200:68.0:1508174041.163305:0:2519:0:(peer.c:3157:lnet_peer_discovery()) peer 192.168.1.110@o2ib state 0x2a1 rc 0 00000800:00000200:20.2:1508174041.163309:0:0:0:(o2iblnd_cb.c:3516:kiblnd_cq_completion()) conn[ffff8800748ee000] (21)++ 00000800:00000200:20.0:1508174041.163315:0:2510:0:(o2iblnd_cb.c:3638:kiblnd_scheduler()) conn[ffff8800748ee000] (22)++ 00000800:00000200:20.0:1508174041.163318:0:2510:0:(o2iblnd_cb.c:74:kiblnd_tx_done()) conn[ffff8800748ee000] (23)-- 00000400:00000200:20.0:1508174041.163320:0:2510:0:(peer.c:2326:lnet_discovery_event_handler()) Received event: 5 00000400:00000200:20.0:1508174041.163323:0:2510:0:(peer.c:2278:lnet_discovery_event_send()) Push Send to 192.168.1.110@o2ib: 0 00000400:00000010:20.0:1508174041.163326:0:2510:0:(lib-lnet.h:489:lnet_msg_free()) kfreed 'msg': 416 at ffff88105c1e3600 (tot 21629836). 00000800:00000200:20.0:1508174041.163328:0:2510:0:(o2iblnd_cb.c:3654:kiblnd_scheduler()) conn[ffff8800748ee000] (22)-- 00000800:00000200:20.0:1508174041.163330:0:2510:0:(o2iblnd_cb.c:3654:kiblnd_scheduler()) conn[ffff8800748ee000] (21)-- 00000800:00000200:20.2:1508174041.163475:0:0:0:(o2iblnd_cb.c:3516:kiblnd_cq_completion()) conn[ffff8800748ee000] (20)++ 00000800:00000200:20.0:1508174041.163481:0:2508:0:(o2iblnd_cb.c:3638:kiblnd_scheduler()) conn[ffff8800748ee000] (21)++ 00000800:00000200:20.0:1508174041.163485:0:2508:0:(o2iblnd_cb.c:326:kiblnd_handle_rx()) Received d1[0] from 192.168.1.110@o2ib 00000400:00000200:20.0:1508174041.163489:0:2508:0:(lib-move.c:2599:lnet_parse()) TRACE: 192.168.1.140@o2ib(192.168.1.140@o2ib) <- 192.168.1.110@o2ib : ACK - for me 00000400:00000010:20.0:1508174041.163491:0:2508:0:(lib-lnet.h:479:lnet_msg_alloc()) alloc '(msg)': 416 at ffff88085cab6400 (tot 21630252). 00000400:00000200:20.0:1508174041.163496:0:2508:0:(lib-move.c:2419:lnet_parse_ack()) 192.168.1.140@o2ib: ACK from 12345-192.168.1.110@o2ib into md 0xb5 00000400:00000200:20.0:1508174041.163498:0:2508:0:(peer.c:2326:lnet_discovery_event_handler()) Received event: 4 00000400:00000200:20.0:1508174041.163500:0:2508:0:(peer.c:2111:lnet_discovery_event_ack()) peer 192.168.1.110@o2ib ev->status 0 00000400:00000200:20.0:1508174041.163505:0:2508:0:(lib-md.c:69:lnet_md_unlink()) Unlinking md ffff88085c21d500 00000400:00000010:20.0:1508174041.163506:0:2508:0:(lib-lnet.h:259:lnet_md_free()) slab-freed 'md' at ffff88085c21d500. 00000400:00000010:20.0:1508174041.163508:0:2508:0:(lib-lnet.h:489:lnet_msg_free()) kfreed 'msg': 416 at ffff88085cab6400 (tot 21629836). 00000800:00000200:20.0:1508174041.163510:0:2508:0:(o2iblnd_cb.c:194:kiblnd_post_rx()) conn[ffff8800748ee000] (22)++ 00000800:00000200:20.0:1508174041.163512:0:2508:0:(o2iblnd_cb.c:223:kiblnd_post_rx()) conn[ffff8800748ee000] (23)-- 00000800:00000200:20.0:1508174041.163513:0:2508:0:(o2iblnd_cb.c:3654:kiblnd_scheduler()) conn[ffff8800748ee000] (22)-- 00000800:00000200:20.0:1508174041.163515:0:2508:0:(o2iblnd_cb.c:3654:kiblnd_scheduler()) conn[ffff8800748ee000] (21)-- 00000800:00000200:20.2:1508174041.163551:0:0:0:(o2iblnd_cb.c:3516:kiblnd_cq_completion()) conn[ffff8800748ee000] (20)++ 00000800:00000200:20.0:1508174041.163556:0:2510:0:(o2iblnd_cb.c:3638:kiblnd_scheduler()) conn[ffff8800748ee000] (21)++ 00000800:00000200:20.0:1508174041.163560:0:2510:0:(o2iblnd_cb.c:326:kiblnd_handle_rx()) Received d1[1] from 192.168.1.110@o2ib 00000400:00000200:20.0:1508174041.163563:0:2510:0:(lib-move.c:2599:lnet_parse()) TRACE: 192.168.1.140@o2ib(192.168.1.140@o2ib) <- 192.168.1.110@o2ib : PUT - for me 00000400:00000010:20.0:1508174041.163565:0:2510:0:(lib-lnet.h:479:lnet_msg_alloc()) alloc '(msg)': 416 at ffff88085cab6400 (tot 21630252). 00000800:00000200:4.0:1508174041.163565:0:2509:0:(o2iblnd_cb.c:3654:kiblnd_scheduler()) conn[ffff8800748ee000] (22)-- 00000400:00000200:20.0:1508174041.163569:0:2510:0:(lib-ptl.c:571:lnet_ptl_match_md()) Request from 12345-192.168.1.110@o2ib of length 48 into portal 0 MB=0x8000000000000000 00000400:00000200:20.0:1508174041.163575:0:2510:0:(lib-ptl.c:200:lnet_try_match_md()) Incoming put index 0 from 12345-192.168.1.110@o2ib of length 48/48 into md 0x9 [1] + 0 00000400:00000200:68.0:1508174041.163577:0:2519:0:(peer.c:3030:lnet_peer_discovery_wait_for_work()) woken: 0 00000400:00000200:20.0:1508174041.163581:0:2510:0:(peer.c:1939:lnet_peer_push_event()) Stale Push from 192.168.1.110@o2ib: got 1 have 1 00000400:00000200:68.0:1508174041.163585:0:2519:0:(peer.c:3136:lnet_peer_discovery()) peer 192.168.1.110@o2ib state 0xa1 00000400:00000200:68.0:1508174041.163588:0:2519:0:(peer.c:2917:lnet_peer_discovered()) peer 192.168.1.110@o2ib 00000400:00000200:68.0:1508174041.163589:0:2519:0:(peer.c:3157:lnet_peer_discovery()) peer 192.168.1.110@o2ib state 0x89 rc 0 00000400:00000200:20.0:1508174041.163592:0:2510:0:(peer.c:1751:lnet_peer_queue_for_discovery()) Queue peer 192.168.1.110@o2ib: -114 00000400:00000200:20.0:1508174041.163603:0:2510:0:(lib-move.c:2114:lnet_select_pathway()) TRACE: 192.168.1.140@o2ib(192.168.1.140@o2ib:192.168.1.140@o2ib) -> 192.168.1.110@o2ib(192.168.1.110@o2ib:192.168.1.110@o2ib) : ACK 00000800:00000200:20.0:1508174041.163608:0:2510:0:(o2iblnd_cb.c:1511:kiblnd_send()) sending 0 bytes in 1 frags to 12345-192.168.1.110@o2ib 00000800:00000200:20.0:1508174041.163613:0:2510:0:(o2iblnd.c:405:kiblnd_find_peer_locked()) got peer_ni [ffff88085c21d900] -> 192.168.1.110@o2ib (2) version: 12 00000800:00000200:20.0:1508174041.163616:0:2510:0:(o2iblnd_cb.c:1392:kiblnd_launch_tx()) conn[ffff8800748ee000] (21)++ 00000800:00000200:20.0:1508174041.163618:0:2510:0:(o2iblnd_cb.c:1162:kiblnd_queue_tx_locked()) conn[ffff8800748ee000] (22)++ 00000800:00000200:20.0:1508174041.163621:0:2510:0:(o2iblnd_cb.c:1398:kiblnd_launch_tx()) conn[ffff8800748ee000] (23)-- 00000800:00000200:20.0:1508174041.163622:0:2510:0:(o2iblnd_cb.c:194:kiblnd_post_rx()) conn[ffff8800748ee000] (22)++ 00000800:00000200:20.2:1508174041.163625:0:2510:0:(o2iblnd_cb.c:3516:kiblnd_cq_completion()) conn[ffff8800748ee000] (22)++ 00000800:00000200:20.0:1508174041.163630:0:2510:0:(o2iblnd_cb.c:223:kiblnd_post_rx()) conn[ffff8800748ee000] (24)-- 00000800:00000200:20.0:1508174041.163631:0:2510:0:(o2iblnd_cb.c:3654:kiblnd_scheduler()) conn[ffff8800748ee000] (23)-- 00000800:00000200:20.0:1508174041.163633:0:2510:0:(o2iblnd_cb.c:3638:kiblnd_scheduler()) conn[ffff8800748ee000] (22)++ 00000800:00000200:20.0:1508174041.163638:0:2510:0:(o2iblnd_cb.c:74:kiblnd_tx_done()) conn[ffff8800748ee000] (23)-- 00000400:00000010:20.0:1508174041.163640:0:2510:0:(lib-lnet.h:489:lnet_msg_free()) kfreed 'msg': 416 at ffff88085cab6400 (tot 21629836). 00000800:00000200:20.0:1508174041.163641:0:2510:0:(o2iblnd_cb.c:3654:kiblnd_scheduler()) conn[ffff8800748ee000] (22)-- where PID 2946 should be the forked LLOG parser thread for the soak-client config, PID 2519 should be the LNet discovery thread, PIDs 2508/2510 should be ko2iblnd scheduler threads. Looks like there is a race between LNet discovery thread bottom handling of ACK for previous sent PUT msg to 192.168.1.110@o2ib, and 2510/ko2iblnd scheduler thread handling consecutive "stale" (unexpected or due to multi-rail?) PUT msg received from 192.168.1.110@o2ib, thus causing LNET_PEER_DISCOVERING flag to be set again and peer to be kept in discovery process/loop, but very likely on the ln_dc_working queue instead of ln_dc_request one so out of discovery thread control and should have been probably timed-out later.

            I realize u work on 2.11 future rel. This post not related to Soak, but looking at the Log it appears striped sets giving timeout problem.. May i ask why not use IB RDMA? i plan a 8-node running 2.10 Lustre with IB connect.. i wish to hold the DB in-memory..distrib across nodes. What is the best installation using Mellanox ConnectX2 VPI 40Gb / node in pure IB thru a SX6036 switch. These methods need integration within Lustre .. At What stage ?

            Is the WQE encapsulated inside the PCIe doorbell and written via Programmed IO? Or is the WQE fetched in a separate DMA read by the HCA?
            It can be either. The usual way is writing WQE and ringing the doorbell, in which case the HCA will fetch the WQE through DMA. The less common way is what is called "Blue Flame" - part of the PCIe BAR is used for WQEs, and the WQE is written to the HCA the same way the doorbell is written.

            Main issue, how does Lustre map to in-mem distributed store? Even on a 2 node split using only ConnectX2 VPI (dual port) or X3 (but i hear IB performance is slower on X3). New features of Lustre 10 makes local routing a breeze direct from the header.

            dont wish for this MPI issue to dominate : (even tho not using MPI directly) Can u say if MPI libraries are used at all in Lustre , or affected UDP binaries, in the case i use UDP) ?

            Is IB datagram mode available as DMA (or other layer) transport within Lustre 10 ?

            MPI IP performance is affected by the socket-buffer sizes for sending and receiving UDP data. These are defined by two network tuning parameters udp_sendspace and udp_recvspace. When the buffer for sending data is too small and quickly becomes full, UDP data transfer can be delayed. When the buffer for receiving data is too small, incoming UDP data can be dropped due to insufficient buffer space, resulting in send-side retransmission and very poor performance.

            Heres v2 of Mellan Ib utils :
            http://www.mellanox.com/page/products_dyn?product_family=110&mtag=monitoring_debug

            Worth the ask, if Sarah or other may interest in remote install / Dev onto a NY cluster.. on a casual basis.

            Talis Rus Taliss (Inactive) added a comment - I realize u work on 2.11 future rel. This post not related to Soak, but looking at the Log it appears striped sets giving timeout problem.. May i ask why not use IB RDMA? i plan a 8-node running 2.10 Lustre with IB connect.. i wish to hold the DB in-memory..distrib across nodes. What is the best installation using Mellanox ConnectX2 VPI 40Gb / node in pure IB thru a SX6036 switch. These methods need integration within Lustre .. At What stage ? Is the WQE encapsulated inside the PCIe doorbell and written via Programmed IO? Or is the WQE fetched in a separate DMA read by the HCA? It can be either. The usual way is writing WQE and ringing the doorbell, in which case the HCA will fetch the WQE through DMA. The less common way is what is called "Blue Flame" - part of the PCIe BAR is used for WQEs, and the WQE is written to the HCA the same way the doorbell is written. Main issue, how does Lustre map to in-mem distributed store? Even on a 2 node split using only ConnectX2 VPI (dual port) or X3 (but i hear IB performance is slower on X3). New features of Lustre 10 makes local routing a breeze direct from the header. dont wish for this MPI issue to dominate : (even tho not using MPI directly) Can u say if MPI libraries are used at all in Lustre , or affected UDP binaries, in the case i use UDP) ? Is IB datagram mode available as DMA (or other layer) transport within Lustre 10 ? MPI IP performance is affected by the socket-buffer sizes for sending and receiving UDP data. These are defined by two network tuning parameters udp_sendspace and udp_recvspace. When the buffer for sending data is too small and quickly becomes full, UDP data transfer can be delayed. When the buffer for receiving data is too small, incoming UDP data can be dropped due to insufficient buffer space, resulting in send-side retransmission and very poor performance. Heres v2 of Mellan Ib utils : http://www.mellanox.com/page/products_dyn?product_family=110&mtag=monitoring_debug Worth the ask, if Sarah or other may interest in remote install / Dev onto a NY cluster.. on a casual basis.
            yujian Jian Yu added a comment -
            yujian Jian Yu added a comment - conf-sanity test 53b also hit this issue on master branch, which is affecting patch review testing: https://testing.hpdd.intel.com/test_sets/e63e81e6-c94c-11e7-a066-52540065bddc https://testing.hpdd.intel.com/test_sets/683390e6-c570-11e7-a066-52540065bddc

            People

              bfaccini Bruno Faccini (Inactive)
              cliffw Cliff White (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: