Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13692

MDS slow/hung threads at mdt_object_local_lock

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.14.0, Lustre 2.12.6
    • Lustre 2.12.4
    • None
    • clients running l2.12.3
    • 2
    • 9223372036854775807

    Description

      MDS threads hang/slow getting stack trace dumps

      Jun 16 14:20:20 nbp8-mds1 kernel: [1660354.377038] Lustre: MGS: Connection restored to 09397a7f-3fe2-2dc8-d25a-74d157cb2008 (at 10.151.50.72@o2ib)
      Jun 16 14:20:20 nbp8-mds1 kernel: [1660354.377044] Lustre: Skipped 277 previous similar messages
      Jun 16 14:22:20 nbp8-mds1 kernel: [1660474.852981] Lustre: 8579:0:(service.c:1372:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (30/30), not sending early reply
      Jun 16 14:22:20 nbp8-mds1 kernel: [1660474.852981]   req@ffff8979f9e16300 x1668974534442592/t1531807577764(0) o36->67852282-a3ed-5acb-a9e2-3cae43fe0406@10.151.0.201@o2ib:5/0 lens 488/3152 e 0 to 0 dl 1592342570 ref 2 fl Interpret:/0/0 rc 0/0
      Jun 16 14:22:20 nbp8-mds1 kernel: [1660474.952277] Lustre: 8579:0:(service.c:1372:ptlrpc_at_send_early_reply()) Skipped 134 previous similar messages
      Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.333617] LNet: Service thread pid 14118 was inactive for 551.86s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.389992] LNet: Skipped 3 previous similar messages
      Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.407185] Pid: 14118, comm: mdt00_115 3.10.0-1062.12.1.el7_lustre2124.x86_64 #1 SMP Tue Mar 17 13:32:19 PDT 2020
      Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.407190] Call Trace:
      Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.407202]  [<ffffffffc0f17c90>] ldlm_completion_ast+0x430/0x860 [ptlrpc]
      Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412473]  [<ffffffffc0f187b1>] ldlm_cli_enqueue_local+0x231/0x830 [ptlrpc]
      Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412496]  [<ffffffffc16205cb>] mdt_object_local_lock+0x50b/0xb20 [mdt]
      Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412506]  [<ffffffffc1620c50>] mdt_object_lock_internal+0x70/0x360 [mdt]
      Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412516]  [<ffffffffc1620f9c>] mdt_reint_object_lock+0x2c/0x60 [mdt]
      Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412530]  [<ffffffffc1639efc>] mdt_reint_striped_lock+0x8c/0x510 [mdt]
      Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412541]  [<ffffffffc163d866>] mdt_reint_setattr+0x676/0x1290 [mdt]
      Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412552]  [<ffffffffc163f963>] mdt_reint_rec+0x83/0x210 [mdt]
      Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412562]  [<ffffffffc161c273>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
      Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412571]  [<ffffffffc16276e7>] mdt_reint+0x67/0x140 [mdt]
      Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412623]  [<ffffffffc0fb73ca>] tgt_request_handle+0xada/0x1570 [ptlrpc]
      Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412659]  [<ffffffffc0f5b47b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412691]  [<ffffffffc0f5ede4>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
      Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412696]  [<ffffffff858c61f1>] kthread+0xd1/0xe0
      Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412699]  [<ffffffff85f8dd37>] ret_from_fork_nospec_end+0x0/0x39
      Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412722]  [<ffffffffffffffff>] 0xffffffffffffffff
      Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412725] LustreError: dumping log to /tmp/lustre-log.1592342774.14118
      Jun 16 14:26:14 nbp8-mds1 sec[2849]: SEC_EVENT |msg lustre hung thread
      Jun 16 14:26:15 nbp8-mds1 kernel: [1660709.178343] LNet: Service thread pid 8575 was inactive for 552.75s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      Jun 16 14:26:15 nbp8-mds1 kernel: [1660709.234458] Pid: 8575, comm: mdt00_043 3.10.0-1062.12.1.el7_lustre2124.x86_64 #1 SMP Tue Mar 17 13:32:19 PDT 2020
      Jun 16 14:26:15 nbp8-mds1 kernel: [1660709.234459] Call Trace:
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.234472]  [<ffffffffc0f17c90>] ldlm_completion_ast+0x430/0x860 [ptlrpc]
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239700]  [<ffffffffc0f187b1>] ldlm_cli_enqueue_local+0x231/0x830 [ptlrpc]
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239749]  [<ffffffffc16205cb>] mdt_object_local_lock+0x50b/0xb20 [mdt]
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239763]  [<ffffffffc1620c50>] mdt_object_lock_internal+0x70/0x360 [mdt]
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239773]  [<ffffffffc1620f9c>] mdt_reint_object_lock+0x2c/0x60 [mdt]
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239785]  [<ffffffffc1639efc>] mdt_reint_striped_lock+0x8c/0x510 [mdt]
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239797]  [<ffffffffc163d866>] mdt_reint_setattr+0x676/0x1290 [mdt]
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239809]  [<ffffffffc163f963>] mdt_reint_rec+0x83/0x210 [mdt]
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239819]  [<ffffffffc161c273>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239829]  [<ffffffffc16276e7>] mdt_reint+0x67/0x140 [mdt]
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239874]  [<ffffffffc0fb73ca>] tgt_request_handle+0xada/0x1570 [ptlrpc]
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239909]  [<ffffffffc0f5b47b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239945]  [<ffffffffc0f5ede4>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239952]  [<ffffffff858c61f1>] kthread+0xd1/0xe0
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239958]  [<ffffffff85f8dd37>] ret_from_fork_nospec_end+0x0/0x39
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239982]  [<ffffffffffffffff>] 0xffffffffffffffff
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239988] Pid: 14083, comm: mdt00_088 3.10.0-1062.12.1.el7_lustre2124.x86_64 #1 SMP Tue Mar 17 13:32:19 PDT 2020
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239988] Call Trace:
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240020]  [<ffffffffc0f17c90>] ldlm_completion_ast+0x430/0x860 [ptlrpc]
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240050]  [<ffffffffc0f187b1>] ldlm_cli_enqueue_local+0x231/0x830 [ptlrpc]
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240060]  [<ffffffffc16205cb>] mdt_object_local_lock+0x50b/0xb20 [mdt]
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240071]  [<ffffffffc1620c50>] mdt_object_lock_internal+0x70/0x360 [mdt]
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240082]  [<ffffffffc1620f9c>] mdt_reint_object_lock+0x2c/0x60 [mdt]
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240093]  [<ffffffffc1639efc>] mdt_reint_striped_lock+0x8c/0x510 [mdt]
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240104]  [<ffffffffc163d866>] mdt_reint_setattr+0x676/0x1290 [mdt]
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240116]  [<ffffffffc163f963>] mdt_reint_rec+0x83/0x210 [mdt]
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240127]  [<ffffffffc161c273>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240137]  [<ffffffffc16276e7>] mdt_reint+0x67/0x140 [mdt]
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240174]  [<ffffffffc0fb73ca>] tgt_request_handle+0xada/0x1570 [ptlrpc]
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240207]  [<ffffffffc0f5b47b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240240]  [<ffffffffc0f5ede4>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240244]  [<ffffffff858c61f1>] kthread+0xd1/0xe0
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240247]  [<ffffffff85f8dd37>] ret_from_fork_nospec_end+0x0/0x39
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240251]  [<ffffffffffffffff>] 0xffffffffffffffff
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240254] Pid: 10527, comm: mdt00_069 3.10.0-1062.12.1.el7_lustre2124.x86_64 #1 SMP Tue Mar 17 13:32:19 PDT 2020
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240254] Call Trace:
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240285]  [<ffffffffc0f17c90>] ldlm_completion_ast+0x430/0x860 [ptlrpc]
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240313]  [<ffffffffc0f187b1>] ldlm_cli_enqueue_local+0x231/0x830 [ptlrpc]
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240325]  [<ffffffffc16205cb>] mdt_object_local_lock+0x50b/0xb20 [mdt]
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240335]  [<ffffffffc1620c50>] mdt_object_lock_internal+0x70/0x360 [mdt]
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240344]  [<ffffffffc1620f9c>] mdt_reint_object_lock+0x2c/0x60 [mdt]
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240357]  [<ffffffffc1639efc>] mdt_reint_striped_lock+0x8c/0x510 [mdt]
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240369]  [<ffffffffc163d866>] mdt_reint_setattr+0x676/0x1290 [mdt]
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240380]  [<ffffffffc163f963>] mdt_reint_rec+0x83/0x210 [mdt]
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240389]  [<ffffffffc161c273>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240400]  [<ffffffffc16276e7>] mdt_reint+0x67/0x140 [mdt]
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240439]  [<ffffffffc0fb73ca>] tgt_request_handle+0xada/0x1570 [ptlrpc]
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240478]  [<ffffffffc0f5b47b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240510]  [<ffffffffc0f5ede4>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240514]  [<ffffffff858c61f1>] kthread+0xd1/0xe0
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240516]  [<ffffffff85f8dd37>] ret_from_fork_nospec_end+0x0/0x39
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240520]  [<ffffffffffffffff>] 0xffffffffffffffff
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240523] Pid: 14084, comm: mdt00_089 3.10.0-1062.12.1.el7_lustre2124.x86_64 #1 SMP Tue Mar 17 13:32:19 PDT 2020
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240523] Call Trace:
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240554]  [<ffffffffc0f17c90>] ldlm_completion_ast+0x430/0x860 [ptlrpc]
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240585]  [<ffffffffc0f187b1>] ldlm_cli_enqueue_local+0x231/0x830 [ptlrpc]
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240598]  [<ffffffffc16205cb>] mdt_object_local_lock+0x50b/0xb20 [mdt]
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240610]  [<ffffffffc1620c50>] mdt_object_lock_internal+0x70/0x360 [mdt]
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240620]  [<ffffffffc1620f9c>] mdt_reint_object_lock+0x2c/0x60 [mdt]
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240631]  [<ffffffffc1639efc>] mdt_reint_striped_lock+0x8c/0x510 [mdt]
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240647]  [<ffffffffc163d866>] mdt_reint_setattr+0x676/0x1290 [mdt]
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240657]  [<ffffffffc163f963>] mdt_reint_rec+0x83/0x210 [mdt]
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240666]  [<ffffffffc161c273>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240677]  [<ffffffffc16276e7>] mdt_reint+0x67/0x140 [mdt]
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240716]  [<ffffffffc0fb73ca>] tgt_request_handle+0xada/0x1570 [ptlrpc]
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240749]  [<ffffffffc0f5b47b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240781]  [<ffffffffc0f5ede4>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240784]  [<ffffffff858c61f1>] kthread+0xd1/0xe0
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240788]  [<ffffffff85f8dd37>] ret_from_fork_nospec_end+0x0/0x39
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240792]  [<ffffffffffffffff>] 0xffffffffffffffff
      Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240796] LNet: Service thread pid 8568 was inactive for 552.81s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one.
       

      Attachments

        1. lustre-log.1592343544.8525
          4.75 MB
        2. messages
          51 kB
        3. vmcore-dmesg.txt
          1021 kB

        Issue Links

          Activity

            [LU-13692] MDS slow/hung threads at mdt_object_local_lock
            green Oleg Drokin added a comment -

            I checked for the first eviction and that seems to have been removed from the client log by the time it was dumped so it's not possible to see what was going on on the client at the time.

            I see there's still a bunch of stuck threads after that that I need to see if I can find why - probably another AST lost but it's not a give that client log (whatever it was) survived either I guess.

            green Oleg Drokin added a comment - I checked for the first eviction and that seems to have been removed from the client log by the time it was dumped so it's not possible to see what was going on on the client at the time. I see there's still a bunch of stuck threads after that that I need to see if I can find why - probably another AST lost but it's not a give that client log (whatever it was) survived either I guess.

            Uploaded logs running with patch#4.

            debug="ioctl neterror net warning dlmtrace error emerg ha rpctrace config console lfsck"

            ftp:/uploads/LU-13692/LU-13692_patch4_20200902.tgz

            mhanafi Mahmoud Hanafi added a comment - Uploaded logs running with patch#4. debug="ioctl neterror net warning dlmtrace error emerg ha rpctrace config console lfsck" ftp:/uploads/ LU-13692 / LU-13692 _patch4_20200902.tgz
            green Oleg Drokin added a comment -

            patchset 4 if you can please.

            I agree we are having at least two issues at hand.

            green Oleg Drokin added a comment - patchset 4 if you can please. I agree we are having at least two issues at hand.

            I think we have 2 issues here. First is the ldlm deadlock and second the credit issue.
            I am going to retest with different credit values. Should I test with Patch#2 or Patch#3.

            mhanafi Mahmoud Hanafi added a comment - I think we have 2 issues here. First is the ldlm deadlock and second the credit issue. I am going to retest with different credit values. Should I test with Patch#2 or Patch#3.
            green Oleg Drokin added a comment -

            Andreas, the thread is sleeping in ptlrpc_set_wait in ptlrpc, so waking it up from ldlm is going to be very nontrivial. We can find the thread easily as it's recorded in the lock. But then we need to determine the ptlrpc set that's privately allocated and referenced on stack of that thread only.

            green Oleg Drokin added a comment - Andreas, the thread is sleeping in ptlrpc_set_wait in ptlrpc, so waking it up from ldlm is going to be very nontrivial. We can find the thread easily as it's recorded in the lock. But then we need to determine the ptlrpc set that's privately allocated and referenced on stack of that thread only.
            green Oleg Drokin added a comment -

            I discussed with Amir and he thinks your situation resembles a case where there's not enough peer credits so messages are dropped.

            I checked the logs and I do see this:

            00000800:00000200:5.0:1595883687.321977:0:15687:0:(o2iblnd_cb.c:900:kiblnd_post_tx_locked()) 10.151.26.238@o2ib: no credits
            

            but the date is strange, its from July. No newer messages, but it's only output at trace level "net" that you've not been including later (I guess you were not removing older server.dk files?)

            He says you need to check peer_credits_hiw and peer_credits values and peer_credits_hiw should be set to 1/2 of the peer_credits to ensure proper reclaim of credits for certain types of messages (perhaps growing peer credits might not be a bad idea depending on what your settings are now)

            "lnetctl net show -v 4", "lnetctl peer show -v" should show some related stats and min credits.

            green Oleg Drokin added a comment - I discussed with Amir and he thinks your situation resembles a case where there's not enough peer credits so messages are dropped. I checked the logs and I do see this: 00000800:00000200:5.0:1595883687.321977:0:15687:0:(o2iblnd_cb.c:900:kiblnd_post_tx_locked()) 10.151.26.238@o2ib: no credits but the date is strange, its from July. No newer messages, but it's only output at trace level "net" that you've not been including later (I guess you were not removing older server.dk files?) He says you need to check peer_credits_hiw and peer_credits values and peer_credits_hiw should be set to 1/2 of the peer_credits to ensure proper reclaim of credits for certain types of messages (perhaps growing peer credits might not be a bad idea depending on what your settings are now) "lnetctl net show -v 4", "lnetctl peer show -v" should show some related stats and min credits.

            In the last set of logs I had "nettrace" and "neterror". We didn't see any network errors. I can run with +net, if you would like.
            Would make sense to see if you can reproduce this locally?

            mhanafi Mahmoud Hanafi added a comment - In the last set of logs I had "nettrace" and "neterror". We didn't see any network errors. I can run with +net, if you would like. Would make sense to see if you can reproduce this locally?

            Oleg, in addition to the BL AST "granting" the lock with LDLM_FL_CBPENDING, could it also wake up the thread waiting for the lock reply to allow it to finish processing?  There would need to be some way to trigger this so that the client thinks the RPC was received successfully. 

             

            Probably a better idea is if the RPC wait in that thread could have a periodic timeout check if the lock is already granted, and exit the wait itself?  Then the BL AST can mark the lock granted with CBPENDING, and when the waiting thread next wakes up it will see the lock is granted and clean itself up and continue processing the syscall.

            adilger Andreas Dilger added a comment - Oleg, in addition to the BL AST "granting" the lock with LDLM_FL_CBPENDING , could it also wake up the thread waiting for the lock reply to allow it to finish processing?  There would need to be some way to trigger this so that the client thinks the RPC was received successfully.    Probably a better idea is if the RPC wait in that thread could have a periodic timeout check if the lock is already granted, and exit the wait itself?  Then the BL AST can mark the lock granted with CBPENDING , and when the waiting thread next wakes up it will see the lock is granted and clean itself up and continue processing the syscall.
            green Oleg Drokin added a comment -

            I don't see anything different in this log vs the previous one

            00010000:00010000:4.0:1598299915.176480:0:14307:0:(ldlm_lockd.c:1401:ldlm_handle_enqueue0()) ### server-side enqueue handler, sending reply(err=0, rc=0) ns: mdt-nbptest3-MDT0000_UUID lock: ffff9c95908f1200/0xcb77a9c20dcbc88e lrc: 4/0,0 mode: CR/CR res: [0x20000d6f1:0x1:0x0].0x0 bits 0x8/0x0 rrc: 488 type: IBT flags: 0x60200400000020 nid: 10.141.6.214@o2ib417 remote: 0xdd261a7834f00346 expref: 98 pid: 14307 timeout: 6658 lvb_type: 0
            00010000:00010000:4.0:1598299915.176495:0:14307:0:(ldlm_lockd.c:1479:ldlm_handle_enqueue0()) ### server-side enqueue handler END (lock ffff9c95908f1200, rc 0)
            00000100:00100000:4.0:1598299915.176508:0:14307:0:(service.c:2190:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc mdt00_005:221c8e76-3a4a-d7bb-7d3c-6ea3526db183+98:4765:x1675936175078016:12345-10.141.6.214@o2ib417:101 Request processed in 8860us (8880us total) trans 107374186758 rc 0/0
            00000100:00100000:4.0:1598299915.176512:0:14307:0:(nrs_fifo.c:241:nrs_fifo_req_stop()) NRS stop fifo request from 12345-10.141.6.214@o2ib417, seq: 2640
            00000100:00100000:4.0:1598299915.176518:0:14307:0:(nrs_fifo.c:179:nrs_fifo_req_get()) NRS start fifo request from 12345-10.141.3.114@o2ib417, seq: 2761
            00000100:00100000:4.0:1598299915.176521:0:14307:0:(service.c:2140:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc mdt00_005:44cfaec0-6c8c-6d1b-0b71-5fc557b7218e+97:4797:x1675936177175232:12345-10.141.3.114@o2ib417:101
            

            request handled, reply is supposedly sent (no messages from lnet indicating otherwise at least) yet hte client did not receive it.

            green Oleg Drokin added a comment - I don't see anything different in this log vs the previous one 00010000:00010000:4.0:1598299915.176480:0:14307:0:(ldlm_lockd.c:1401:ldlm_handle_enqueue0()) ### server-side enqueue handler, sending reply(err=0, rc=0) ns: mdt-nbptest3-MDT0000_UUID lock: ffff9c95908f1200/0xcb77a9c20dcbc88e lrc: 4/0,0 mode: CR/CR res: [0x20000d6f1:0x1:0x0].0x0 bits 0x8/0x0 rrc: 488 type: IBT flags: 0x60200400000020 nid: 10.141.6.214@o2ib417 remote: 0xdd261a7834f00346 expref: 98 pid: 14307 timeout: 6658 lvb_type: 0 00010000:00010000:4.0:1598299915.176495:0:14307:0:(ldlm_lockd.c:1479:ldlm_handle_enqueue0()) ### server-side enqueue handler END (lock ffff9c95908f1200, rc 0) 00000100:00100000:4.0:1598299915.176508:0:14307:0:(service.c:2190:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc mdt00_005:221c8e76-3a4a-d7bb-7d3c-6ea3526db183+98:4765:x1675936175078016:12345-10.141.6.214@o2ib417:101 Request processed in 8860us (8880us total) trans 107374186758 rc 0/0 00000100:00100000:4.0:1598299915.176512:0:14307:0:(nrs_fifo.c:241:nrs_fifo_req_stop()) NRS stop fifo request from 12345-10.141.6.214@o2ib417, seq: 2640 00000100:00100000:4.0:1598299915.176518:0:14307:0:(nrs_fifo.c:179:nrs_fifo_req_get()) NRS start fifo request from 12345-10.141.3.114@o2ib417, seq: 2761 00000100:00100000:4.0:1598299915.176521:0:14307:0:(service.c:2140:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc mdt00_005:44cfaec0-6c8c-6d1b-0b71-5fc557b7218e+97:4797:x1675936177175232:12345-10.141.3.114@o2ib417:101 request handled, reply is supposedly sent (no messages from lnet indicating otherwise at least) yet hte client did not receive it.

            Any updates?

            mhanafi Mahmoud Hanafi added a comment - Any updates?
            mhanafi Mahmoud Hanafi added a comment - - edited

            Uploaded new debug logs with nettrace
            ftp:/uploads/LU-13692/LU-13692_patch2_testing_2020_08_24.tgz

            Server side threads remain stuck until server reboot.

            mhanafi Mahmoud Hanafi added a comment - - edited Uploaded new debug logs with nettrace ftp:/uploads/ LU-13692 / LU-13692 _patch2_testing_2020_08_24.tgz Server side threads remain stuck until server reboot.

            People

              green Oleg Drokin
              mhanafi Mahmoud Hanafi
              Votes:
              0 Vote for this issue
              Watchers:
              15 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: