Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13692

MDS slow/hung threads at mdt_object_local_lock

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.14.0, Lustre 2.12.6
    • Lustre 2.12.4
    • None
    • clients running l2.12.3
    • 2
    • 9223372036854775807

    Description

      MDS threads hang/slow getting stack trace dumps

      Jun 16 14:20:20 nbp8-mds1 kernel: [1660354.377038] Lustre: MGS: Connection restored to 09397a7f-3fe2-2dc8-d25a-74d157cb2008 (at 10.151.50.72@o2ib)
      Jun 16 14:20:20 nbp8-mds1 kernel: [1660354.377044] Lustre: Skipped 277 previous similar messages
      Jun 16 14:22:20 nbp8-mds1 kernel: [1660474.852981] Lustre: 8579:0:(service.c:1372:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (30/30), not sending early reply
      Jun 16 14:22:20 nbp8-mds1 kernel: [1660474.852981]   req@ffff8979f9e16300 x1668974534442592/t1531807577764(0) o36->67852282-a3ed-5acb-a9e2-3cae43fe0406@10.151.0.201@o2ib:5/0 lens 488/3152 e 0 to 0 dl 1592342570 ref 2 fl Interpret:/0/0 rc 0/0
      Jun 16 14:22:20 nbp8-mds1 kernel: [1660474.952277] Lustre: 8579:0:(service.c:1372:ptlrpc_at_send_early_reply()) Skipped 134 previous similar messages
      Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.333617] LNet: Service thread pid 14118 was inactive for 551.86s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.389992] LNet: Skipped 3 previous similar messages
      Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.407185] Pid: 14118, comm: mdt00_115 3.10.0-1062.12.1.el7_lustre2124.x86_64 #1 SMP Tue Mar 17 13:32:19 PDT 2020
      Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.407190] Call Trace:
      Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.407202]  [<ffffffffc0f17c90>] ldlm_completion_ast+0x430/0x860 [ptlrpc]
      Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412473]  [<ffffffffc0f187b1>] ldlm_cli_enqueue_local+0x231/0x830 [ptlrpc]
      Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412496]  [<ffffffffc16205cb>] mdt_object_local_lock+0x50b/0xb20 [mdt]
      Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412506]  [<ffffffffc1620c50>] mdt_object_lock_internal+0x70/0x360 [mdt]
      Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412516]  [<ffffffffc1620f9c>] mdt_reint_object_lock+0x2c/0x60 [mdt]
      Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412530]  [<ffffffffc1639efc>] mdt_reint_striped_lock+0x8c/0x510 [mdt]
      Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412541]  [<ffffffffc163d866>] mdt_reint_setattr+0x676/0x1290 [mdt]
      Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412552]  [<ffffffffc163f963>] mdt_reint_rec+0x83/0x210 [mdt]
      Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412562]  [<ffffffffc161c273>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
      Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412571]  [<ffffffffc16276e7>] mdt_reint+0x67/0x140 [mdt]
      Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412623]  [<ffffffffc0fb73ca>] tgt_request_handle+0xada/0x1570 [ptlrpc]
      Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412659]  [<ffffffffc0f5b47b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412691]  [<ffffffffc0f5ede4>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
      Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412696]  [<ffffffff858c61f1>] kthread+0xd1/0xe0
      Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412699]  [<ffffffff85f8dd37>] ret_from_fork_nospec_end+0x0/0x39
      Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412722]  [<ffffffffffffffff>] 0xffffffffffffffff
      Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412725] LustreError: dumping log to /tmp/lustre-log.1592342774.14118
      Jun 16 14:26:14 nbp8-mds1 sec[2849]: SEC_EVENT |msg lustre hung thread
      Jun 16 14:26:15 nbp8-mds1 kernel: [1660709.178343] LNet: Service thread pid 8575 was inactive for 552.75s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      Jun 16 14:26:15 nbp8-mds1 kernel: [1660709.234458] Pid: 8575, comm: mdt00_043 3.10.0-1062.12.1.el7_lustre2124.x86_64 #1 SMP Tue Mar 17 13:32:19 PDT 2020
      Jun 16 14:26:15 nbp8-mds1 kernel: [1660709.234459] Call Trace:
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.234472]  [<ffffffffc0f17c90>] ldlm_completion_ast+0x430/0x860 [ptlrpc]
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239700]  [<ffffffffc0f187b1>] ldlm_cli_enqueue_local+0x231/0x830 [ptlrpc]
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239749]  [<ffffffffc16205cb>] mdt_object_local_lock+0x50b/0xb20 [mdt]
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239763]  [<ffffffffc1620c50>] mdt_object_lock_internal+0x70/0x360 [mdt]
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239773]  [<ffffffffc1620f9c>] mdt_reint_object_lock+0x2c/0x60 [mdt]
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239785]  [<ffffffffc1639efc>] mdt_reint_striped_lock+0x8c/0x510 [mdt]
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239797]  [<ffffffffc163d866>] mdt_reint_setattr+0x676/0x1290 [mdt]
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239809]  [<ffffffffc163f963>] mdt_reint_rec+0x83/0x210 [mdt]
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239819]  [<ffffffffc161c273>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239829]  [<ffffffffc16276e7>] mdt_reint+0x67/0x140 [mdt]
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239874]  [<ffffffffc0fb73ca>] tgt_request_handle+0xada/0x1570 [ptlrpc]
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239909]  [<ffffffffc0f5b47b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239945]  [<ffffffffc0f5ede4>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239952]  [<ffffffff858c61f1>] kthread+0xd1/0xe0
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239958]  [<ffffffff85f8dd37>] ret_from_fork_nospec_end+0x0/0x39
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239982]  [<ffffffffffffffff>] 0xffffffffffffffff
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239988] Pid: 14083, comm: mdt00_088 3.10.0-1062.12.1.el7_lustre2124.x86_64 #1 SMP Tue Mar 17 13:32:19 PDT 2020
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239988] Call Trace:
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240020]  [<ffffffffc0f17c90>] ldlm_completion_ast+0x430/0x860 [ptlrpc]
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240050]  [<ffffffffc0f187b1>] ldlm_cli_enqueue_local+0x231/0x830 [ptlrpc]
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240060]  [<ffffffffc16205cb>] mdt_object_local_lock+0x50b/0xb20 [mdt]
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240071]  [<ffffffffc1620c50>] mdt_object_lock_internal+0x70/0x360 [mdt]
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240082]  [<ffffffffc1620f9c>] mdt_reint_object_lock+0x2c/0x60 [mdt]
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240093]  [<ffffffffc1639efc>] mdt_reint_striped_lock+0x8c/0x510 [mdt]
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240104]  [<ffffffffc163d866>] mdt_reint_setattr+0x676/0x1290 [mdt]
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240116]  [<ffffffffc163f963>] mdt_reint_rec+0x83/0x210 [mdt]
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240127]  [<ffffffffc161c273>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240137]  [<ffffffffc16276e7>] mdt_reint+0x67/0x140 [mdt]
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240174]  [<ffffffffc0fb73ca>] tgt_request_handle+0xada/0x1570 [ptlrpc]
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240207]  [<ffffffffc0f5b47b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240240]  [<ffffffffc0f5ede4>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240244]  [<ffffffff858c61f1>] kthread+0xd1/0xe0
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240247]  [<ffffffff85f8dd37>] ret_from_fork_nospec_end+0x0/0x39
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240251]  [<ffffffffffffffff>] 0xffffffffffffffff
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240254] Pid: 10527, comm: mdt00_069 3.10.0-1062.12.1.el7_lustre2124.x86_64 #1 SMP Tue Mar 17 13:32:19 PDT 2020
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240254] Call Trace:
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240285]  [<ffffffffc0f17c90>] ldlm_completion_ast+0x430/0x860 [ptlrpc]
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240313]  [<ffffffffc0f187b1>] ldlm_cli_enqueue_local+0x231/0x830 [ptlrpc]
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240325]  [<ffffffffc16205cb>] mdt_object_local_lock+0x50b/0xb20 [mdt]
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240335]  [<ffffffffc1620c50>] mdt_object_lock_internal+0x70/0x360 [mdt]
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240344]  [<ffffffffc1620f9c>] mdt_reint_object_lock+0x2c/0x60 [mdt]
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240357]  [<ffffffffc1639efc>] mdt_reint_striped_lock+0x8c/0x510 [mdt]
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240369]  [<ffffffffc163d866>] mdt_reint_setattr+0x676/0x1290 [mdt]
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240380]  [<ffffffffc163f963>] mdt_reint_rec+0x83/0x210 [mdt]
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240389]  [<ffffffffc161c273>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240400]  [<ffffffffc16276e7>] mdt_reint+0x67/0x140 [mdt]
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240439]  [<ffffffffc0fb73ca>] tgt_request_handle+0xada/0x1570 [ptlrpc]
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240478]  [<ffffffffc0f5b47b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240510]  [<ffffffffc0f5ede4>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240514]  [<ffffffff858c61f1>] kthread+0xd1/0xe0
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240516]  [<ffffffff85f8dd37>] ret_from_fork_nospec_end+0x0/0x39
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240520]  [<ffffffffffffffff>] 0xffffffffffffffff
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240523] Pid: 14084, comm: mdt00_089 3.10.0-1062.12.1.el7_lustre2124.x86_64 #1 SMP Tue Mar 17 13:32:19 PDT 2020
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240523] Call Trace:
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240554]  [<ffffffffc0f17c90>] ldlm_completion_ast+0x430/0x860 [ptlrpc]
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240585]  [<ffffffffc0f187b1>] ldlm_cli_enqueue_local+0x231/0x830 [ptlrpc]
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240598]  [<ffffffffc16205cb>] mdt_object_local_lock+0x50b/0xb20 [mdt]
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240610]  [<ffffffffc1620c50>] mdt_object_lock_internal+0x70/0x360 [mdt]
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240620]  [<ffffffffc1620f9c>] mdt_reint_object_lock+0x2c/0x60 [mdt]
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240631]  [<ffffffffc1639efc>] mdt_reint_striped_lock+0x8c/0x510 [mdt]
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240647]  [<ffffffffc163d866>] mdt_reint_setattr+0x676/0x1290 [mdt]
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240657]  [<ffffffffc163f963>] mdt_reint_rec+0x83/0x210 [mdt]
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240666]  [<ffffffffc161c273>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240677]  [<ffffffffc16276e7>] mdt_reint+0x67/0x140 [mdt]
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240716]  [<ffffffffc0fb73ca>] tgt_request_handle+0xada/0x1570 [ptlrpc]
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240749]  [<ffffffffc0f5b47b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240781]  [<ffffffffc0f5ede4>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240784]  [<ffffffff858c61f1>] kthread+0xd1/0xe0
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240788]  [<ffffffff85f8dd37>] ret_from_fork_nospec_end+0x0/0x39
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240792]  [<ffffffffffffffff>] 0xffffffffffffffff
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240796] LNet: Service thread pid 8568 was inactive for 552.81s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one.
       

      Attachments

        1. lustre-log.1592343544.8525
          4.75 MB
        2. messages
          51 kB
        3. vmcore-dmesg.txt
          1021 kB

        Issue Links

          Activity

            [LU-13692] MDS slow/hung threads at mdt_object_local_lock

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/40412/
            Subject: LU-13692 ldlm: Ensure we reprocess the resource on ast error
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set:
            Commit: d3dfcba36ea04809d09d38ccd97da053be85404c

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/40412/ Subject: LU-13692 ldlm: Ensure we reprocess the resource on ast error Project: fs/lustre-release Branch: b2_12 Current Patch Set: Commit: d3dfcba36ea04809d09d38ccd97da053be85404c
            pjones Peter Jones added a comment -

            Landed for 2.14

            pjones Peter Jones added a comment - Landed for 2.14

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39598/
            Subject: LU-13692 ldlm: Ensure we reprocess the resource on ast error
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 24e3b5395bc61333a32b1e9725a0d7273925ef05

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39598/ Subject: LU-13692 ldlm: Ensure we reprocess the resource on ast error Project: fs/lustre-release Branch: master Current Patch Set: Commit: 24e3b5395bc61333a32b1e9725a0d7273925ef05

            Oleg Drokin (green@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40412
            Subject: LU-13692 ldlm: Ensure we reprocess the resource on ast error
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set: 1
            Commit: ac40c31fd49b17ad809652cbd85d39617906d213

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40412 Subject: LU-13692 ldlm: Ensure we reprocess the resource on ast error Project: fs/lustre-release Branch: b2_12 Current Patch Set: 1 Commit: ac40c31fd49b17ad809652cbd85d39617906d213

            Patch#4 appears stable - I haven't seen any more hung threads. 

            mhanafi Mahmoud Hanafi added a comment - Patch#4 appears stable - I haven't seen any more hung threads. 

            I been trying to re-test but unable to reproduce the issue. I will keep trying.

            mhanafi Mahmoud Hanafi added a comment - I been trying to re-test but unable to reproduce the issue. I will keep trying.

            Amir,

            I re-ran the test with saving the router stats before and after. None of the drop counter were incremented. Please look at the router debug logs from the last set uploaded to see if you can find anything unusual.

            mhanafi Mahmoud Hanafi added a comment - Amir, I re-ran the test with saving the router stats before and after. None of the drop counter were incremented. Please look at the router debug logs from the last set uploaded to see if you can find anything unusual.

            I think if we have a script to monitor the statistics on the router and prove that we have increment of drop counters or other errors around the time of the issue that would be helpful.

            Looking at the stats, it appears that some of the routers have very high number of drops, >12K. Do you think that would indicate a problem? The stats can't be cleared, so these numbers are from the time the routers came up. But a script can be written to figure out the delta. and see if they get incremented in bursts.

            I'm also seeing a lot of:

            remote_dropped_count in the stats show.

            There are only 2 places where that could get incremented.

            1. If the tx completes with failure. Basically message is posted but the underlying stack fails to complete the message
            2. 838 »·······»·······CNETERR("Dropping message for %s: peer not alive\n",
              839 »·······»·······»·······libcfs_id2str(msg->msg_target));
              840 »·······»·······msg->msg_health_status = LNET_MSG_STATUS_REMOTE_DROPPED;

            For 2 that should be easy to verify. On the router logs, do you see that message?

            I think that could explain why some of the messages are not making it to the clients. They could be dropped by the router.

            ashehata Amir Shehata (Inactive) added a comment - I think if we have a script to monitor the statistics on the router and prove that we have increment of drop counters or other errors around the time of the issue that would be helpful. Looking at the stats, it appears that some of the routers have very high number of drops, >12K. Do you think that would indicate a problem? The stats can't be cleared, so these numbers are from the time the routers came up. But a script can be written to figure out the delta. and see if they get incremented in bursts. I'm also seeing a lot of: remote_dropped_count in the stats show. There are only 2 places where that could get incremented. If the tx completes with failure. Basically message is posted but the underlying stack fails to complete the message 838 »·······»·······CNETERR("Dropping message for %s: peer not alive\n", 839 »·······»·······»·······libcfs_id2str(msg->msg_target)); 840 »·······»·······msg->msg_health_status = LNET_MSG_STATUS_REMOTE_DROPPED; For 2 that should be easy to verify. On the router logs, do you see that message? I think that could explain why some of the messages are not making it to the clients. They could be dropped by the router.

            We do run with 'health_sensitivity=0'. I saved "net show" and "stats show" These were not cleared before. I'll try to re-run and clear the stats before the test.

            ftp uploads/LU-13692/router.netshow.2020_09_11
            ftp  uploads/LU-13692/router.netshow.2020_09_11

            I was able re-test today with router logs. see ftp:/uploads/LU-13692/LU-13692_patch4_2020_09_11.tgz

            What I saw the server send the rpc

             --- SERVER ---
            00000100:00100000:7.0:1599839141.481820:0:8199:0:(nrs_fifo.c:179:nrs_fifo_req_get()) NRS start fifo request from 12345-10.141.6.236@o2ib417, seq: 2361
            00000100:00100000:7.0:1599839141.481824:0:8199:0:(service.c:2140:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc mdt00_012:9308d1b4-ec54-b743-7579-5daa9b33d4cd+88:4780:x1677516826795584:12345-10.141.6.236@o2ib417:101
            00010000:00010000:7.0:1599839141.507240:0:8199:0:(ldlm_lock.c:1073:ldlm_granted_list_add_lock()) ### About to add lock: ns: mdt-nbptest3-MDT0000_UUID lock: ffff99b56d30a1c0/0x8e13058be0aade00 lrc: 3/0,0 mode: CR/CR res: [0x200019271:0x2:0x0].0x0 bits 0x8/0x0 rrc: 489 type: IBT flags: 0x50200400000000 nid: 10.141.6.236@o2ib417 remote: 0xd532d2c8ba1b0d1 expref: 97 pid: 8199 timeout: 0 lvb_type: 0
            00010000:00010000:7.0:1599839141.507253:0:8199:0:(ldlm_lock.c:681:ldlm_add_bl_work_item()) ### lock incompatible; sending blocking AST. ns: mdt-nbptest3-MDT0000_UUID lock: ffff99b56d30a1c0/0x8e13058be0aade00 lrc: 3/0,0 mode: CR/CR res: [0x200019271:0x2:0x0].0x0 bits 0x8/0x0 rrc: 489 type: IBT flags: 0x40200400000000 nid: 10.141.6.236@o2ib417 remote: 0xd532d2c8ba1b0d1 expref: 97 pid: 8199 timeout: 0 lvb_type: 0
            00010000:00010000:7.0:1599839141.507263:0:8199:0:(ldlm_lockd.c:903:ldlm_server_blocking_ast()) ### server preparing blocking AST ns: mdt-nbptest3-MDT0000_UUID lock: ffff99b56d30a1c0/0x8e13058be0aade00 lrc: 4/0,0 mode: CR/CR res: [0x200019271:0x2:0x0].0x0 bits 0x8/0x0 rrc: 489 type: IBT flags: 0x50200400000020 nid: 10.141.6.236@o2ib417 remote: 0xd532d2c8ba1b0d1 expref: 97 pid: 8199 timeout: 0 lvb_type: 0
            00010000:00010000:7.0:1599839141.507269:0:8199:0:(ldlm_lockd.c:465:ldlm_add_waiting_lock()) ### adding to wait list(timeout: 412, AT: on) ns: mdt-nbptest3-MDT0000_UUID lock: ffff99b56d30a1c0/0x8e13058be0aade00 lrc: 5/0,0 mode: CR/CR res: [0x200019271:0x2:0x0].0x0 bits 0x8/0x0 rrc: 489 type: IBT flags: 0x70200400000020 nid: 10.141.6.236@o2ib417 remote: 0xd532d2c8ba1b0d1 expref: 97 pid: 8199 timeout: 1350 lvb_type: 0
            00000100:00100000:7.0:1599839141.507276:0:8199:0:(client.c:1630:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc mdt00_012:nbptest3-MDT0000_UUID:8199:1677552166027200:10.141.6.236@o2ib417:104
            00000400:00000200:7.0:1599839141.507285:0:8199:0:(lib-move.c:4684:LNetPut()) LNetPut -> 12345-10.141.6.236@o2ib417
            00000400:00000200:7.0:1599839141.507288:0:8199:0:(lib-move.c:2478:lnet_handle_send_case_locked()) Source ANY to NMR:  10.141.6.236@o2ib417 routed destination
            00000400:00000200:7.0:1599839141.507296:0:8199:0:(lib-move.c:1752:lnet_handle_send()) TRACE: 10.151.27.53@o2ib(10.151.27.53@o2ib:<?>) -> 10.141.6.236@o2ib417(10.141.6.236@o2ib417:10.151.26.234@o2ib) : PUT try# 0
            00000100:00000200:7.0:1599839141.507889:0:8199:0:(events.c:93:reply_in_callback()) @@@ type 6, status 0  req@ffff99b5661a0900 x1677552166027200/t0(0) o104->nbptest3-MDT0000@10.141.6.236@o2ib417:15/16 lens 296/224 e 0 to 0 dl 1599839764 ref 1 fl Rpc:R/0/ffffffff rc 0/-1
            00000100:00000200:7.0:1599839141.507895:0:8199:0:(events.c:114:reply_in_callback()) @@@ unlink  req@ffff99b5661a0900 x1677552166027200/t0(0) o104->nbptest3-MDT0000@10.141.6.236@o2ib417:15/16 lens 296/224 e 0 to 0 dl 1599839764 ref 1 fl Rpc:R/0/ffffffff rc 0/-1
            00000100:00100000:7.0:1599839141.507904:0:8199:0:(client.c:2060:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc mdt00_012:nbptest3-MDT0000_UUID:8199:1677552166027200:10.141.6.236@o2ib417:104
            00010000:00010000:7.0:1599839141.507912:0:8199:0:(ldlm_lock.c:210:ldlm_lock_put()) ### final lock_put on destroyed lock, freeing it. ns: mdt-nbptest3-MDT0000_UUID lock: ffff99b56d30f740/0x8e13058be0aaddf2 lrc: 0/0,0 mode: --/EX res: [0x200019271:0x2:0x0].0x0 bits 0x8/0x0 rrc: 489 type: IBT flags: 0x44000000000000 nid: 10.141.6.236@o2ib417 remote: 0xd532d2c8ba1b0d1 expref: 97 pid: 8199 timeout: 0 lvb_type: 3
            00010000:00010000:7.0:1599839141.507918:0:8199:0:(ldlm_lockd.c:465:ldlm_add_waiting_lock()) ### not re-adding to wait list(timeout: 412, AT: on) ns: mdt-nbptest3-MDT0000_UUID lock: ffff99b56d30a1c0/0x8e13058be0aade00 lrc: 4/0,0 mode: CR/CR res: [0x200019271:0x2:0x0].0x0 bits 0x8/0x0 rrc: 488 type: IBT flags: 0x70200400000020 nid: 10.141.6.236@o2ib417 remote: 0xd532d2c8ba1b0d1 expref: 96 pid: 8199 timeout: 1350 lvb_type: 0
            00010000:00010000:7.0:1599839141.507924:0:8199:0:(ldlm_lockd.c:1401:ldlm_handle_enqueue0()) ### server-side enqueue handler, sending reply(err=0, rc=0) ns: mdt-nbptest3-MDT0000_UUID lock: ffff99b56d30a1c0/0x8e13058be0aade00 lrc: 4/0,0 mode: CR/CR res: [0x200019271:0x2:0x0].0x0 bits 0x8/0x0 rrc: 488 type: IBT flags: 0x60200400000020 nid: 10.141.6.236@o2ib417 remote: 0xd532d2c8ba1b0d1 expref: 96 pid: 8199 timeout: 1350 lvb_type: 0
            00010000:00000200:7.0:1599839141.507935:0:8199:0:(ldlm_lib.c:2967:target_send_reply_msg()) @@@ sending reply  req@ffff99b5702d2d00 x1677516826795584/t201863467188(0) o101->9308d1b4-ec54-b743-7579-5daa9b33d4cd@10.141.6.236@o2ib417:131/0 lens 376/1032 e 0 to 0 dl 1599839846 ref 1 fl Interpret:/0/0 rc 0/0
            00000400:00000200:7.0:1599839141.507944:0:8199:0:(lib-move.c:4684:LNetPut()) LNetPut -> 12345-10.141.6.236@o2ib417
            00000400:00000200:7.0:1599839141.507946:0:8199:0:(lib-move.c:2478:lnet_handle_send_case_locked()) Source Specified: 10.151.27.53@o2ib to NMR:  10.141.6.236@o2ib417 routed destination
            00000400:00000200:7.0:1599839141.507955:0:8199:0:(lib-move.c:1752:lnet_handle_send()) TRACE: 10.151.27.53@o2ib(10.151.27.53@o2ib:10.151.27.53@o2ib) -> 10.141.6.236@o2ib417(10.141.6.236@o2ib417:10.151.26.236@o2ib) : PUT try# 0
            00000100:00100000:7.0:1599839141.507968:0:8199:0:(service.c:2190:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc mdt00_012:9308d1b4-ec54-b743-7579-5daa9b33d4cd+96:4780:x1677516826795584:12345-10.141.6.236@o2ib417:101 Request processed in 26145us (26195us total) trans 201863467188 rc 0/0
            00000100:00100000:7.0:1599839141.507973:0:8199:0:(nrs_fifo.c:241:nrs_fifo_req_stop()) NRS stop fifo request from 12345-10.141.6.236@o2ib417, seq: 2361
            

            *The router get the requests and (tries) to send it out. I wasn't sure if it actually makes it out of the router. *

            00000800:00000200:71.0:1599839141.508041:0:25250:0:(o2iblnd_cb.c:343:kiblnd_handle_rx()) Received d1[0] from 10.151.27.53@o2ib
            00000400:00000200:71.0:1599839141.508045:0:25250:0:(lib-move.c:4190:lnet_parse()) TRACE: 10.141.6.236@o2ib417(10.151.26.236@o2ib) <- 10.151.27.53@o2ib : PUT - routed
            00000400:00000200:71.0:1599839141.508050:0:25250:0:(lib-msg.c:918:lnet_is_health_check()) health check = 0, status = 0, hstatus = 0
            00000400:00000200:71.0:1599839141.508053:0:25250:0:(lib-move.c:2478:lnet_handle_send_case_locked()) Source ANY to NMR:  10.141.6.236@o2ib417 local destination
            00000400:00000200:71.0:1599839141.508061:0:25250:0:(lib-move.c:1752:lnet_handle_send()) TRACE: 10.151.27.53@o2ib(10.141.26.236@o2ib417:<?>) -> 10.141.6.236@o2ib417(10.141.6.236@o2ib417:10.141.6.236@o2ib417) : PUT try# 0
            00000800:00000200:71.0:1599839141.508065:0:25250:0:(o2iblnd_cb.c:1663:kiblnd_send()) sending 1000 bytes in 1 frags to 12345-10.141.6.236@o2ib417
            00000800:00000200:71.0:1599839141.508071:0:25250:0:(o2iblnd.c:405:kiblnd_find_peer_locked()) got peer_ni [ffff9c5a323c9d80] -> 10.141.6.236@o2ib417 (2) version: 12
            00000800:00000200:71.0:1599839141.508072:0:25250:0:(o2iblnd_cb.c:1543:kiblnd_launch_tx()) conn[ffff9c4700f2ea00] (68)++
            00000800:00000200:71.0:1599839141.508074:0:25250:0:(o2iblnd_cb.c:1316:kiblnd_queue_tx_locked()) conn[ffff9c4700f2ea00] (69)++
            00000800:00000200:71.0:1599839141.508077:0:25250:0:(o2iblnd_cb.c:1549:kiblnd_launch_tx()) conn[ffff9c4700f2ea00] (70)--
            00000800:00000200:71.0:1599839141.508078:0:25250:0:(o2iblnd_cb.c:205:kiblnd_post_rx()) conn[ffff9c5b223f4200] (69)++
            00000800:00000200:71.0:1599839141.508080:0:25250:0:(o2iblnd_cb.c:239:kiblnd_post_rx()) conn[ffff9c5b223f4200] (70)--
            00000800:00000200:71.0:1599839141.508081:0:25250:0:(o2iblnd_cb.c:3875:kiblnd_scheduler()) conn[ffff9c5b223f4200] (69)--
            00000800:00000200:71.0:1599839141.508457:0:25250:0:(o2iblnd_cb.c:3859:kiblnd_scheduler()) conn[ffff9c5ad91c0200] (70)++
            00000800:00000200:71.0:1599839141.508461:0:25250:0:(o2iblnd_cb.c:3875:kiblnd_scheduler()) conn[ffff9c5ad91c0200] (71)--
            00000800:00000200:71.0:1599839141.508462:0:25250:0:(o2iblnd_cb.c:3859:kiblnd_scheduler()) conn[ffff9c5ad91c0200] (70)++
            

            There is no completion for conn[ffff9c5ad91c0200].

            mhanafi Mahmoud Hanafi added a comment - We do run with 'health_sensitivity=0'. I saved "net show" and "stats show" These were not cleared before. I'll try to re-run and clear the stats before the test. ftp uploads/ LU-13692 /router.netshow.2020_09_11 ftp  uploads/ LU-13692 /router.netshow.2020_09_11 I was able re-test today with router logs. see ftp:/uploads/ LU-13692 / LU-13692 _patch4_2020_09_11.tgz What I saw the server send the rpc --- SERVER --- 00000100:00100000:7.0:1599839141.481820:0:8199:0:(nrs_fifo.c:179:nrs_fifo_req_get()) NRS start fifo request from 12345-10.141.6.236@o2ib417, seq: 2361 00000100:00100000:7.0:1599839141.481824:0:8199:0:(service.c:2140:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc mdt00_012:9308d1b4-ec54-b743-7579-5daa9b33d4cd+88:4780:x1677516826795584:12345-10.141.6.236@o2ib417:101 00010000:00010000:7.0:1599839141.507240:0:8199:0:(ldlm_lock.c:1073:ldlm_granted_list_add_lock()) ### About to add lock: ns: mdt-nbptest3-MDT0000_UUID lock: ffff99b56d30a1c0/0x8e13058be0aade00 lrc: 3/0,0 mode: CR/CR res: [0x200019271:0x2:0x0].0x0 bits 0x8/0x0 rrc: 489 type: IBT flags: 0x50200400000000 nid: 10.141.6.236@o2ib417 remote: 0xd532d2c8ba1b0d1 expref: 97 pid: 8199 timeout: 0 lvb_type: 0 00010000:00010000:7.0:1599839141.507253:0:8199:0:(ldlm_lock.c:681:ldlm_add_bl_work_item()) ### lock incompatible; sending blocking AST. ns: mdt-nbptest3-MDT0000_UUID lock: ffff99b56d30a1c0/0x8e13058be0aade00 lrc: 3/0,0 mode: CR/CR res: [0x200019271:0x2:0x0].0x0 bits 0x8/0x0 rrc: 489 type: IBT flags: 0x40200400000000 nid: 10.141.6.236@o2ib417 remote: 0xd532d2c8ba1b0d1 expref: 97 pid: 8199 timeout: 0 lvb_type: 0 00010000:00010000:7.0:1599839141.507263:0:8199:0:(ldlm_lockd.c:903:ldlm_server_blocking_ast()) ### server preparing blocking AST ns: mdt-nbptest3-MDT0000_UUID lock: ffff99b56d30a1c0/0x8e13058be0aade00 lrc: 4/0,0 mode: CR/CR res: [0x200019271:0x2:0x0].0x0 bits 0x8/0x0 rrc: 489 type: IBT flags: 0x50200400000020 nid: 10.141.6.236@o2ib417 remote: 0xd532d2c8ba1b0d1 expref: 97 pid: 8199 timeout: 0 lvb_type: 0 00010000:00010000:7.0:1599839141.507269:0:8199:0:(ldlm_lockd.c:465:ldlm_add_waiting_lock()) ### adding to wait list(timeout: 412, AT: on) ns: mdt-nbptest3-MDT0000_UUID lock: ffff99b56d30a1c0/0x8e13058be0aade00 lrc: 5/0,0 mode: CR/CR res: [0x200019271:0x2:0x0].0x0 bits 0x8/0x0 rrc: 489 type: IBT flags: 0x70200400000020 nid: 10.141.6.236@o2ib417 remote: 0xd532d2c8ba1b0d1 expref: 97 pid: 8199 timeout: 1350 lvb_type: 0 00000100:00100000:7.0:1599839141.507276:0:8199:0:(client.c:1630:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc mdt00_012:nbptest3-MDT0000_UUID:8199:1677552166027200:10.141.6.236@o2ib417:104 00000400:00000200:7.0:1599839141.507285:0:8199:0:(lib-move.c:4684:LNetPut()) LNetPut -> 12345-10.141.6.236@o2ib417 00000400:00000200:7.0:1599839141.507288:0:8199:0:(lib-move.c:2478:lnet_handle_send_case_locked()) Source ANY to NMR: 10.141.6.236@o2ib417 routed destination 00000400:00000200:7.0:1599839141.507296:0:8199:0:(lib-move.c:1752:lnet_handle_send()) TRACE: 10.151.27.53@o2ib(10.151.27.53@o2ib:<?>) -> 10.141.6.236@o2ib417(10.141.6.236@o2ib417:10.151.26.234@o2ib) : PUT try # 0 00000100:00000200:7.0:1599839141.507889:0:8199:0:(events.c:93:reply_in_callback()) @@@ type 6, status 0 req@ffff99b5661a0900 x1677552166027200/t0(0) o104->nbptest3-MDT0000@10.141.6.236@o2ib417:15/16 lens 296/224 e 0 to 0 dl 1599839764 ref 1 fl Rpc:R/0/ffffffff rc 0/-1 00000100:00000200:7.0:1599839141.507895:0:8199:0:(events.c:114:reply_in_callback()) @@@ unlink req@ffff99b5661a0900 x1677552166027200/t0(0) o104->nbptest3-MDT0000@10.141.6.236@o2ib417:15/16 lens 296/224 e 0 to 0 dl 1599839764 ref 1 fl Rpc:R/0/ffffffff rc 0/-1 00000100:00100000:7.0:1599839141.507904:0:8199:0:(client.c:2060:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc mdt00_012:nbptest3-MDT0000_UUID:8199:1677552166027200:10.141.6.236@o2ib417:104 00010000:00010000:7.0:1599839141.507912:0:8199:0:(ldlm_lock.c:210:ldlm_lock_put()) ### final lock_put on destroyed lock, freeing it. ns: mdt-nbptest3-MDT0000_UUID lock: ffff99b56d30f740/0x8e13058be0aaddf2 lrc: 0/0,0 mode: --/EX res: [0x200019271:0x2:0x0].0x0 bits 0x8/0x0 rrc: 489 type: IBT flags: 0x44000000000000 nid: 10.141.6.236@o2ib417 remote: 0xd532d2c8ba1b0d1 expref: 97 pid: 8199 timeout: 0 lvb_type: 3 00010000:00010000:7.0:1599839141.507918:0:8199:0:(ldlm_lockd.c:465:ldlm_add_waiting_lock()) ### not re-adding to wait list(timeout: 412, AT: on) ns: mdt-nbptest3-MDT0000_UUID lock: ffff99b56d30a1c0/0x8e13058be0aade00 lrc: 4/0,0 mode: CR/CR res: [0x200019271:0x2:0x0].0x0 bits 0x8/0x0 rrc: 488 type: IBT flags: 0x70200400000020 nid: 10.141.6.236@o2ib417 remote: 0xd532d2c8ba1b0d1 expref: 96 pid: 8199 timeout: 1350 lvb_type: 0 00010000:00010000:7.0:1599839141.507924:0:8199:0:(ldlm_lockd.c:1401:ldlm_handle_enqueue0()) ### server-side enqueue handler, sending reply(err=0, rc=0) ns: mdt-nbptest3-MDT0000_UUID lock: ffff99b56d30a1c0/0x8e13058be0aade00 lrc: 4/0,0 mode: CR/CR res: [0x200019271:0x2:0x0].0x0 bits 0x8/0x0 rrc: 488 type: IBT flags: 0x60200400000020 nid: 10.141.6.236@o2ib417 remote: 0xd532d2c8ba1b0d1 expref: 96 pid: 8199 timeout: 1350 lvb_type: 0 00010000:00000200:7.0:1599839141.507935:0:8199:0:(ldlm_lib.c:2967:target_send_reply_msg()) @@@ sending reply req@ffff99b5702d2d00 x1677516826795584/t201863467188(0) o101->9308d1b4-ec54-b743-7579-5daa9b33d4cd@10.141.6.236@o2ib417:131/0 lens 376/1032 e 0 to 0 dl 1599839846 ref 1 fl Interpret:/0/0 rc 0/0 00000400:00000200:7.0:1599839141.507944:0:8199:0:(lib-move.c:4684:LNetPut()) LNetPut -> 12345-10.141.6.236@o2ib417 00000400:00000200:7.0:1599839141.507946:0:8199:0:(lib-move.c:2478:lnet_handle_send_case_locked()) Source Specified: 10.151.27.53@o2ib to NMR: 10.141.6.236@o2ib417 routed destination 00000400:00000200:7.0:1599839141.507955:0:8199:0:(lib-move.c:1752:lnet_handle_send()) TRACE: 10.151.27.53@o2ib(10.151.27.53@o2ib:10.151.27.53@o2ib) -> 10.141.6.236@o2ib417(10.141.6.236@o2ib417:10.151.26.236@o2ib) : PUT try # 0 00000100:00100000:7.0:1599839141.507968:0:8199:0:(service.c:2190:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc mdt00_012:9308d1b4-ec54-b743-7579-5daa9b33d4cd+96:4780:x1677516826795584:12345-10.141.6.236@o2ib417:101 Request processed in 26145us (26195us total) trans 201863467188 rc 0/0 00000100:00100000:7.0:1599839141.507973:0:8199:0:(nrs_fifo.c:241:nrs_fifo_req_stop()) NRS stop fifo request from 12345-10.141.6.236@o2ib417, seq: 2361 *The router get the requests and (tries) to send it out. I wasn't sure if it actually makes it out of the router. * 00000800:00000200:71.0:1599839141.508041:0:25250:0:(o2iblnd_cb.c:343:kiblnd_handle_rx()) Received d1[0] from 10.151.27.53@o2ib 00000400:00000200:71.0:1599839141.508045:0:25250:0:(lib-move.c:4190:lnet_parse()) TRACE: 10.141.6.236@o2ib417(10.151.26.236@o2ib) <- 10.151.27.53@o2ib : PUT - routed 00000400:00000200:71.0:1599839141.508050:0:25250:0:(lib-msg.c:918:lnet_is_health_check()) health check = 0, status = 0, hstatus = 0 00000400:00000200:71.0:1599839141.508053:0:25250:0:(lib-move.c:2478:lnet_handle_send_case_locked()) Source ANY to NMR: 10.141.6.236@o2ib417 local destination 00000400:00000200:71.0:1599839141.508061:0:25250:0:(lib-move.c:1752:lnet_handle_send()) TRACE: 10.151.27.53@o2ib(10.141.26.236@o2ib417:<?>) -> 10.141.6.236@o2ib417(10.141.6.236@o2ib417:10.141.6.236@o2ib417) : PUT try # 0 00000800:00000200:71.0:1599839141.508065:0:25250:0:(o2iblnd_cb.c:1663:kiblnd_send()) sending 1000 bytes in 1 frags to 12345-10.141.6.236@o2ib417 00000800:00000200:71.0:1599839141.508071:0:25250:0:(o2iblnd.c:405:kiblnd_find_peer_locked()) got peer_ni [ffff9c5a323c9d80] -> 10.141.6.236@o2ib417 (2) version: 12 00000800:00000200:71.0:1599839141.508072:0:25250:0:(o2iblnd_cb.c:1543:kiblnd_launch_tx()) conn[ffff9c4700f2ea00] (68)++ 00000800:00000200:71.0:1599839141.508074:0:25250:0:(o2iblnd_cb.c:1316:kiblnd_queue_tx_locked()) conn[ffff9c4700f2ea00] (69)++ 00000800:00000200:71.0:1599839141.508077:0:25250:0:(o2iblnd_cb.c:1549:kiblnd_launch_tx()) conn[ffff9c4700f2ea00] (70)-- 00000800:00000200:71.0:1599839141.508078:0:25250:0:(o2iblnd_cb.c:205:kiblnd_post_rx()) conn[ffff9c5b223f4200] (69)++ 00000800:00000200:71.0:1599839141.508080:0:25250:0:(o2iblnd_cb.c:239:kiblnd_post_rx()) conn[ffff9c5b223f4200] (70)-- 00000800:00000200:71.0:1599839141.508081:0:25250:0:(o2iblnd_cb.c:3875:kiblnd_scheduler()) conn[ffff9c5b223f4200] (69)-- 00000800:00000200:71.0:1599839141.508457:0:25250:0:(o2iblnd_cb.c:3859:kiblnd_scheduler()) conn[ffff9c5ad91c0200] (70)++ 00000800:00000200:71.0:1599839141.508461:0:25250:0:(o2iblnd_cb.c:3875:kiblnd_scheduler()) conn[ffff9c5ad91c0200] (71)-- 00000800:00000200:71.0:1599839141.508462:0:25250:0:(o2iblnd_cb.c:3859:kiblnd_scheduler()) conn[ffff9c5ad91c0200] (70)++ There is no completion for conn [ffff9c5ad91c0200] .

            I was talking with Oleg, there are a few thing we can try to collect/do to see if there are issues on the LNet/networking side.

            First of all, let's grab the LNet statistics from the node (server/routers/clients with the problem), to see if there are drops:

            lnetctl net show -v 4
            lnetctl stats show 

            Oleg was saying that he sees the message depart the server but never arrives on the client. No router logs.

            If we get the above stats from the router, we can see if there are drops happening on the routers. We can also monitor the stats on some interval and see if the errors/drops increase at the same time frame as the problem. We can do the same type of monitoring on the servers and clients. I think if we're able to correlate LNet errors with the time frame of the problem, we'll have some lead to go on. We can at least look at the type of errors which are happening during that time and see how we handle them.

            The other thing I'm thinking is to try and reduce the complexity of LNet. Can we try disabling health, if we haven't already:

            lnetctl set health_sensitivity 0 

            This avoids any resends at the LNet layer.

            ashehata Amir Shehata (Inactive) added a comment - I was talking with Oleg, there are a few thing we can try to collect/do to see if there are issues on the LNet/networking side. First of all, let's grab the LNet statistics from the node (server/routers/clients with the problem), to see if there are drops: lnetctl net show -v 4 lnetctl stats show Oleg was saying that he sees the message depart the server but never arrives on the client. No router logs. If we get the above stats from the router, we can see if there are drops happening on the routers. We can also monitor the stats on some interval and see if the errors/drops increase at the same time frame as the problem. We can do the same type of monitoring on the servers and clients. I think if we're able to correlate LNet errors with the time frame of the problem, we'll have some lead to go on. We can at least look at the type of errors which are happening during that time and see how we handle them. The other thing I'm thinking is to try and reduce the complexity of LNet. Can we try disabling health, if we haven't already: lnetctl set health_sensitivity 0 This avoids any resends at the LNet layer.

            People

              green Oleg Drokin
              mhanafi Mahmoud Hanafi
              Votes:
              0 Vote for this issue
              Watchers:
              15 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: