Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11644

LNet: Service thread inactive for 300 causes client evictions

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • Lustre 2.10.5, Lustre 2.12.1
    • None
    • 2
    • 9223372036854775807

    Description

      Update to 2.10.5 now we are seeing periods of mass evictions from servers. On the server we have the following stack trace

      Nov  7 11:33:12 nbp8-oss7 kernel: [531465.033253] Pid: 11080, comm: ll_ost01_220 3.10.0-693.21.1.el7.20180508.x86_64.lustre2105 #1 SMP Mon Aug 27 23:04:41 UTC 2018
      Nov  7 11:33:12 nbp8-oss7 kernel: [531465.033260] Call Trace:
      Nov  7 11:33:12 nbp8-oss7 kernel: [531465.033274]  [<ffffffffa0c1d0e0>] ptlrpc_set_wait+0x4c0/0x920 [ptlrpc]
      Nov  7 11:33:12 nbp8-oss7 kernel: [531465.038258]  [<ffffffffa0bdae43>] ldlm_run_ast_work+0xd3/0x3a0 [ptlrpc]
      Nov  7 11:33:12 nbp8-oss7 kernel: [531465.038285]  [<ffffffffa0bfbabb>] ldlm_glimpse_locks+0x3b/0x100 [ptlrpc]
      Nov  7 11:33:12 nbp8-oss7 kernel: [531465.038294]  [<ffffffffa10e78a4>] ofd_intent_policy+0x444/0xa40 [ofd]
      Nov  7 11:33:12 nbp8-oss7 kernel: [531465.038318]  [<ffffffffa0bda2ba>] ldlm_lock_enqueue+0x38a/0x980 [ptlrpc]
      Nov  7 11:33:12 nbp8-oss7 kernel: [531465.038346]  [<ffffffffa0c03b53>] ldlm_handle_enqueue0+0x9d3/0x16a0 [ptlrpc]
      Nov  7 11:33:12 nbp8-oss7 kernel: [531465.038390]  [<ffffffffa0c89262>] tgt_enqueue+0x62/0x210 [ptlrpc]
      Nov  7 11:33:12 nbp8-oss7 kernel: [531465.038425]  [<ffffffffa0c8ceca>] tgt_request_handle+0x92a/0x1370 [ptlrpc]
      Nov  7 11:33:12 nbp8-oss7 kernel: [531465.038455]  [<ffffffffa0c354bb>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
      Nov  7 11:33:12 nbp8-oss7 kernel: [531465.038484]  [<ffffffffa0c394a2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
      Nov  7 11:33:12 nbp8-oss7 kernel: [531465.038489]  [<ffffffff810b1131>] kthread+0xd1/0xe0
      Nov  7 11:33:12 nbp8-oss7 kernel: [531465.038492]  [<ffffffff816a14f7>] ret_from_fork+0x77/0xb0
      Nov  7 11:33:12 nbp8-oss7 kernel: [531465.038512]  [<ffffffffffffffff>] 0xffffffffffffffff
      Nov  7 11:33:12 nbp8-oss7 kernel: [531465.038515] LustreError: dumping log to /tmp/lustre-log.1541619192.11080
      Nov  7 11:33:14 nbp8-oss7 kernel: [531467.254898] LNet: Service thread pid 9724 was inactive for 303.19s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      Nov  7 11:33:14 nbp8-oss7 kernel: [531467.310852] Pid: 9724, comm: ll_ost01_019 3.10.0-693.21.1.el7.20180508.x86_64.lustre2105 #1 SMP Mon Aug 27 23:04:41 UTC 2018
      Nov  7 11:33:14 nbp8-oss7 kernel: [531467.310854] Call Trace:
      Nov  7 11:33:14 nbp8-oss7 kernel: [531467.310866]  [<ffffffffa0c1d0e0>] ptlrpc_set_wait+0x4c0/0x920 [ptlrpc]
      Nov  7 11:33:14 nbp8-oss7 kernel: [531467.332869]  [<ffffffffa0bdae43>] ldlm_run_ast_work+0xd3/0x3a0 [ptlrpc]
      Nov  7 11:33:14 nbp8-oss7 kernel: [531467.332902]  [<ffffffffa0bfbabb>] ldlm_glimpse_locks+0x3b/0x100 [ptlrpc]
      Nov  7 11:33:14 nbp8-oss7 kernel: [531467.332912]  [<ffffffffa10e78a4>] ofd_intent_policy+0x444/0xa40 [ofd]
      Nov  7 11:33:14 nbp8-oss7 kernel: [531467.332936]  [<ffffffffa0bda2ba>] ldlm_lock_enqueue+0x38a/0x980 [ptlrpc]
      Nov  7 11:33:15 nbp8-oss7 kernel: [531467.332988]  [<ffffffffa0c03b53>] ldlm_handle_enqueue0+0x9d3/0x16a0 [ptlrpc]
      Nov  7 11:33:15 nbp8-oss7 kernel: [531467.333032]  [<ffffffffa0c89262>] tgt_enqueue+0x62/0x210 [ptlrpc]
      Nov  7 11:33:15 nbp8-oss7 kernel: [531467.333067]  [<ffffffffa0c8ceca>] tgt_request_handle+0x92a/0x1370 [ptlrpc]
      Nov  7 11:33:15 nbp8-oss7 kernel: [531467.333099]  [<ffffffffa0c354bb>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
      Nov  7 11:33:15 nbp8-oss7 kernel: [531467.333128]  [<ffffffffa0c394a2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
      Nov  7 11:33:15 nbp8-oss7 kernel: [531467.333134]  [<ffffffff810b1131>] kthread+0xd1/0xe0
      Nov  7 11:33:15 nbp8-oss7 kernel: [531467.333137]  [<ffffffff816a14f7>] ret_from_fork+0x77/0xb0
      Nov  7 11:33:15 nbp8-oss7 kernel: [531467.333158]  [<ffffffffffffffff>] 0xffffffffffffffff
      

      will upload to ftp:/uploads/LU11613/lustre-log.1541619192.11080

      we didn't have rpctrace or dlmtrace so may not be very useful.

      Could be related to https://jira.whamcloud.com/browse/LU-11613

       

       

      Attachments

        1. zero.io.top.20210130.19.31.37
          480 kB
        2. s618.out
          37 kB
        3. s214_bt.20200108.18.21.23
          990 kB
        4. nasa_lu11644.patch
          15 kB
        5. lnet_metrics_during_eviction.pdf
          398 kB
        6. eviction_s611.06.05.19
          23 kB
        7. client_evictions_charts.pdf
          63 kB

        Issue Links

          Activity

            [LU-11644] LNet: Service thread inactive for 300 causes client evictions

            I check at_min, at_max, and time on all servers and clients they are the same.

            timeout=100

            at_min=275

            at_max=600

            We know that the server receives obdping from other clients and sends replies. These clients, that get evicted, are sending ping to the other OSTs. other wise we would see evictions on more than one OST. 

            mhanafi Mahmoud Hanafi added a comment - I check at_min, at_max, and time on all servers and clients they are the same. timeout=100 at_min=275 at_max=600 We know that the server receives obdping from other clients and sends replies. These clients, that get evicted, are sending ping to the other OSTs. other wise we would see evictions on more than one OST. 
            green Oleg Drokin added a comment -

            Andres, other pings are apparently coming to other OSTs on this node. We don't print the target information so I don't know how to 100% confirm that theory but it seems to be the most likely one.

            Mahmoud - can you please directly check the timeout value on the client I looked at vs the server just to be sure?

            The watchdog timeout is set based on AT values so it varies and the bigger it gets the more latency there was in the system beforehand. but doing sysrq-t as you see fit should show you where everything is stuck

            green Oleg Drokin added a comment - Andres, other pings are apparently coming to other OSTs on this node. We don't print the target information so I don't know how to 100% confirm that theory but it seems to be the most likely one. Mahmoud - can you please directly check the timeout value on the client I looked at vs the server just to be sure? The watchdog timeout is set based on AT values so it varies and the bigger it gets the more latency there was in the system beforehand. but doing sysrq-t as you see fit should show you where everything is stuck

            We do mount multiple filesystems, but we don't save those timeout values on the MGS. I double checked the MGS config logs there were no timeout settings. 

             

            mhanafi Mahmoud Hanafi added a comment - We do mount multiple filesystems, but we don't save those timeout values on the MGS. I double checked the MGS config logs there were no timeout settings.   

            3rd option is if the client and server idea of obd timeout is different (with client having a significantly longer timeout than the server) then client might be sending pings a lot less frequently than the server expects.

            Mahmoud, do you have multiple different filesystems mounted on the same client, possibly with different timeout values configured by their MGS config logs? There is an old problem LU-9912 that relates to this, in the uncommon case where different filesystems have been tuned with different timeouts...

            adilger Andreas Dilger added a comment - 3rd option is if the client and server idea of obd timeout is different (with client having a significantly longer timeout than the server) then client might be sending pings a lot less frequently than the server expects. Mahmoud, do you have multiple different filesystems mounted on the same client, possibly with different timeout values configured by their MGS config logs? There is an old problem LU-9912 that relates to this, in the uncommon case where different filesystems have been tuned with different timeouts...

            I don't think tbf is involved because we saw this with fifo. 

            mhanafi Mahmoud Hanafi added a comment - I don't think tbf is involved because we saw this with fifo. 

            Oleg, is it possible that something in the NRS TBF path is bypassing/blocking the update of the export and preventing the request from being processed by the server?

            In your previous example, it shows that this client is active for other pings and yet it is being evicted shortly thereafter. At the time of eviction, it looks like a request is being dequeued for the export, but It probably make sense for the target to evict a client that still has a pending request in the RPC queue:

            nrs_tbf_req_get()) TBF dequeues: class@ffffa08b32e6d200 gen 2 token 2, rule@ffffa095af158100 rate 5000 gen 2
            ping_evictor_main()) nbp2-OST0025: haven't heard from client (at 10.149.3.186@o2ib313) in 227 seconds. I think it's dead, and I am evicting it. cur 1612064083 expire 1612063933 last 1612063856
            

            so it might be useful to print a bit more information in nrs_tbf_req_get() about the request so that we can see what it is. Maybe change the CDEBUG(D_RPCTRACE, ...) there to a DEBUG_REQ(D_RPCTRACE, ...). Also adding the target name into the ptlrpc_server_handle_request() "Handling" and "Handled" lines (from req->rq_export->exp_obd->obd_name, if rq_export isn't NULL) would definitely be very useful for lots of reasons.

            adilger Andreas Dilger added a comment - Oleg, is it possible that something in the NRS TBF path is bypassing/blocking the update of the export and preventing the request from being processed by the server? In your previous example, it shows that this client is active for other pings and yet it is being evicted shortly thereafter. At the time of eviction, it looks like a request is being dequeued for the export, but It probably make sense for the target to evict a client that still has a pending request in the RPC queue: nrs_tbf_req_get()) TBF dequeues: class@ffffa08b32e6d200 gen 2 token 2, rule@ffffa095af158100 rate 5000 gen 2 ping_evictor_main()) nbp2-OST0025: haven't heard from client (at 10.149.3.186@o2ib313) in 227 seconds. I think it's dead, and I am evicting it. cur 1612064083 expire 1612063933 last 1612063856 so it might be useful to print a bit more information in nrs_tbf_req_get() about the request so that we can see what it is. Maybe change the CDEBUG(D_RPCTRACE, ...) there to a DEBUG_REQ(D_RPCTRACE, ...) . Also adding the target name into the ptlrpc_server_handle_request() " Handling " and " Handled " lines (from req->rq_export->exp_obd->obd_name , if rq_export isn't NULL) would definitely be very useful for lots of reasons.
            mhanafi Mahmoud Hanafi added a comment - - edited

            This filesystem never sees more 1 second of zero reads and writes. While during this event it see zero I/O for 2 mins, so something must be locked up. I will try to get a stack trace of all threads.

            The file I attached yesterday (s618.out) show that all I/O drops to zero and server load starts to climb starting at 19:31:36  to 19:34:56, when the evictions happen.

             

            The watchdog timers (500 seconds) are much higher then obdping timeout so we wouldn't see those warnings. 

             

            mhanafi Mahmoud Hanafi added a comment - - edited This filesystem never sees more 1 second of zero reads and writes. While during this event it see zero I/O for 2 mins, so something must be locked up. I will try to get a stack trace of all threads. The file I attached yesterday (s618.out) show that all I/O drops to zero and server load starts to climb starting at 19:31:36  to 19:34:56, when the evictions happen.   The watchdog timers (500 seconds) are much higher then obdping timeout so we wouldn't see those warnings.   
            green Oleg Drokin added a comment -

            First of all I do not think you have any outstanding io requests from this partcular client. If you did - we'd see aborted bulk messages and I don't think I saw any?

            if the io thread is taking too long in the actual filesystem layer we'd see messages like this for writes: "transaction handle %p was open for too long: now %lld, alloced %lld, started %lld, closed %lld" and there were a slow io of other kind messages that I cannot readily find. In addition the lnet "thread took too long" would also happen and I think you did not have any incidents of any watchdogs triggering this time?

            If you have requests sitting too long in the serving queue there'd be the "Already past deadline" messages.

            If you have requests taking to long actually being serviced.- the client would timeout (you'd also have watchdogs trigger).

            Looking at the code we prolong export several times: when a request comes in and is added to the req in queue, when we start serving it and when we finish serving it.
            So while it's possible for a too long request to end up in ping eviction in absence of other requests, the fact that the export's "last served" time did not advance in all this time and we can see it matching last confirmed request from this client, I think it's a pretty solid evidence that at least this particular client did not really send a ping request for this OST or if it did - it did not arrive to the OSS node - note if this was the case you must see a timeout for the opc 400 on the client though.

            3rd option is if the client and server idea of obdtimeout is different (with client having a significantly longer timeout than the server) then client might be sending pings a lot less frequently than the server expects.

            green Oleg Drokin added a comment - First of all I do not think you have any outstanding io requests from this partcular client. If you did - we'd see aborted bulk messages and I don't think I saw any? if the io thread is taking too long in the actual filesystem layer we'd see messages like this for writes: "transaction handle %p was open for too long: now %lld, alloced %lld, started %lld, closed %lld" and there were a slow io of other kind messages that I cannot readily find. In addition the lnet "thread took too long" would also happen and I think you did not have any incidents of any watchdogs triggering this time? If you have requests sitting too long in the serving queue there'd be the "Already past deadline" messages. If you have requests taking to long actually being serviced.- the client would timeout (you'd also have watchdogs trigger). Looking at the code we prolong export several times: when a request comes in and is added to the req in queue, when we start serving it and when we finish serving it. So while it's possible for a too long request to end up in ping eviction in absence of other requests, the fact that the export's "last served" time did not advance in all this time and we can see it matching last confirmed request from this client, I think it's a pretty solid evidence that at least this particular client did not really send a ping request for this OST or if it did - it did not arrive to the OSS node - note if this was the case you must see a timeout for the opc 400 on the client though. 3rd option is if the client and server idea of obdtimeout is different (with client having a significantly longer timeout than the server) then client might be sending pings a lot less frequently than the server expects.

            I don't think this is due dropped messages. If we had dropped messages it would effect multiple servers and osts. There are no errors on the client until they get evicted. No errors on the routers and there are no lnet errors on the servers.

            What is troubling why all I/O halts while all ll_ost_io threads are sitting in 'D' state. I will try to get a dump of stack of the threads.

            If a I/O request take longer to service ( while io threads sitting in 'D' state) than obdping timeout will the client get evicted. 

            mhanafi Mahmoud Hanafi added a comment - I don't think this is due dropped messages. If we had dropped messages it would effect multiple servers and osts. There are no errors on the client until they get evicted. No errors on the routers and there are no lnet errors on the servers. What is troubling why all I/O halts while all ll_ost_io threads are sitting in 'D' state. I will try to get a dump of stack of the threads. If a I/O request take longer to service ( while io threads sitting in 'D' state) than obdping timeout will the client get evicted. 
            green Oleg Drokin added a comment - - edited

            I checked the debug log and I don't think I see any traces of requests being stuck in the incoming queue for a long time. various "Already past deadline" sort of messages are not part of the logs.

            I picked the very first evicted client which is "10.149.3.186@o2ib313" talking to "nbp2-OST0025" and I see last message from it as expected - getting a PR lock.

            I do see a bunch of ping requests from this same nid/client across time, but there's no way to know what particular OST they were sent to (and there's more than one on this node). The important part is - there's no significant delay in request processing between its arrival and it being taken into actual processing.

            E.g. the eviction happened at:

            00000100:00080000:11.0:1612064007.003210:0:97446:0:(service.c:1128:ptlrpc_update_export_timer()) nbp2-OST0025: Think about evicting 10.149.3.186@o2ib313 from 1612063856
            ...
            00000100:00080000:9.0:1612064083.110023:0:50801:0:(pinger.c:610:ping_evictor_main()) evicting all exports of obd nbp2-OST0025 older than 1612063933
            00000100:00100000:12.0:1612064083.110024:0:21995:0:(nrs_tbf.c:3127:nrs_tbf_req_get()) TBF dequeues: class@ffffa08b32e6d200 rate 5000 gen 2 token 2, rule@ffffa095af158100 rate 5000 gen 2
            00000100:02000400:9.0:1612064083.110025:0:50801:0:(pinger.c:636:ping_evictor_main()) nbp2-OST0025: haven't heard from client b58b2f27-aabc-f700-545a-8d4bc7d4c55c (at 10.149.3.186@o2ib313) in 227 seconds. I think it's dead, and I am evicting it. exp ffffa07a96360000, cur 1612064083 expire 1612063933 last 1612063856
            

            now looking back from that looking for traffic to 10.149.3.186@o2ib313:

            00000400:00000200:9.0:1612064056.837794:0:34195:0:(lib-move.c:4114:lnet_parse()) TRACE: 10.151.26.114@o2ib(10.151.26.114@o2ib) <- 10.149.3.186@o2ib313 : PUT - for me
            00000400:00000200:9.0:1612064056.837798:0:34195:0:(lib-ptl.c:571:lnet_ptl_match_md()) Request from 12345-10.149.3.186@o2ib313 of length 224 into portal 28 MB=0x6011a1473a440
            00000400:00000200:9.0:1612064056.837801:0:34195:0:(lib-ptl.c:200:lnet_try_match_md()) Incoming put index 1c from 12345-10.149.3.186@o2ib313 of length 224/224 into md 0x86246a0871 [1] + 0
            00000400:00000200:9.0:1612064056.837803:0:34195:0:(lib-md.c:65:lnet_md_unlink()) Queueing unlink of md ffffa07b4745a2a8
            00000100:00000200:9.0:1612064056.837804:0:34195:0:(events.c:305:request_in_callback()) event type 2, status 0, service ost
            00000100:00100000:9.0:1612064056.837807:0:34195:0:(events.c:351:request_in_callback()) peer: 12345-10.149.3.186@o2ib313 (source: 12345-10.149.3.186@o2ib313)
            00000400:00000200:9.0:1612064056.837809:0:34195:0:(lib-md.c:69:lnet_md_unlink()) Unlinking md ffffa07b4745a2a8
            00000400:00000200:9.0:1612064056.837810:0:34195:0:(lib-msg.c:816:lnet_is_health_check()) health check = 0, status = 0, hstatus = 0
            00000800:00000200:9.0:1612064056.837811:0:34195:0:(o2iblnd_cb.c:205:kiblnd_post_rx()) conn[ffffa081f0c66000] (69)++
            00000100:00100000:10.0:1612064056.837812:0:37831:0:(service.c:1989:ptlrpc_server_handle_req_in()) got req x1690061384164416
            00000800:00000200:9.0:1612064056.837812:0:34195:0:(o2iblnd_cb.c:239:kiblnd_post_rx()) conn[ffffa081f0c66000] (70)--
            00000800:00000200:9.0:1612064056.837813:0:34195:0:(o2iblnd_cb.c:3859:kiblnd_scheduler()) conn[ffffa081f0c66000] (69)--
            00000100:00100000:10.0:1612064056.837817:0:37831:0:(nrs_tbf.c:3210:nrs_tbf_req_add()) TBF enqueues: class@ffffa081e90ff600 rate 5000 gen 2 token 0, rule@ffffa07df979db00 rate 5000 gen 2
            00000100:00100000:10.0:1612064056.837821:0:37831:0:(nrs_tbf.c:3127:nrs_tbf_req_get()) TBF dequeues: class@ffffa081e90ff600 rate 5000 gen 2 token 1, rule@ffffa07df979db00 rate 5000 gen 2
            00000100:00100000:10.0:1612064056.837828:0:37831:0:(service.c:2140:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc ll_ost00_250:b58b2f27-aabc-f700-545a-8d4bc7d4c55c+7:3580:x1690061384164416:12345-10.149.3.186@o2ib313:400
            

            so as you can see - processing is instanteous.

            What does the dmesg on the client side say? Could the RPCs be dropped somewhere along the way like routers?

            I don't see any obvious signs of anything being slow - no complaints about slow io nor any threads being stuck for a long time like the original bugreport shows (the "LNet: Service thread pid XXX was inactive for")

            green Oleg Drokin added a comment - - edited I checked the debug log and I don't think I see any traces of requests being stuck in the incoming queue for a long time. various "Already past deadline" sort of messages are not part of the logs. I picked the very first evicted client which is "10.149.3.186@o2ib313" talking to "nbp2-OST0025" and I see last message from it as expected - getting a PR lock. I do see a bunch of ping requests from this same nid/client across time, but there's no way to know what particular OST they were sent to (and there's more than one on this node). The important part is - there's no significant delay in request processing between its arrival and it being taken into actual processing. E.g. the eviction happened at: 00000100:00080000:11.0:1612064007.003210:0:97446:0:(service.c:1128:ptlrpc_update_export_timer()) nbp2-OST0025: Think about evicting 10.149.3.186@o2ib313 from 1612063856 ... 00000100:00080000:9.0:1612064083.110023:0:50801:0:(pinger.c:610:ping_evictor_main()) evicting all exports of obd nbp2-OST0025 older than 1612063933 00000100:00100000:12.0:1612064083.110024:0:21995:0:(nrs_tbf.c:3127:nrs_tbf_req_get()) TBF dequeues: class@ffffa08b32e6d200 rate 5000 gen 2 token 2, rule@ffffa095af158100 rate 5000 gen 2 00000100:02000400:9.0:1612064083.110025:0:50801:0:(pinger.c:636:ping_evictor_main()) nbp2-OST0025: haven't heard from client b58b2f27-aabc-f700-545a-8d4bc7d4c55c (at 10.149.3.186@o2ib313) in 227 seconds. I think it's dead, and I am evicting it. exp ffffa07a96360000, cur 1612064083 expire 1612063933 last 1612063856 now looking back from that looking for traffic to 10.149.3.186@o2ib313: 00000400:00000200:9.0:1612064056.837794:0:34195:0:(lib-move.c:4114:lnet_parse()) TRACE: 10.151.26.114@o2ib(10.151.26.114@o2ib) <- 10.149.3.186@o2ib313 : PUT - for me 00000400:00000200:9.0:1612064056.837798:0:34195:0:(lib-ptl.c:571:lnet_ptl_match_md()) Request from 12345-10.149.3.186@o2ib313 of length 224 into portal 28 MB=0x6011a1473a440 00000400:00000200:9.0:1612064056.837801:0:34195:0:(lib-ptl.c:200:lnet_try_match_md()) Incoming put index 1c from 12345-10.149.3.186@o2ib313 of length 224/224 into md 0x86246a0871 [1] + 0 00000400:00000200:9.0:1612064056.837803:0:34195:0:(lib-md.c:65:lnet_md_unlink()) Queueing unlink of md ffffa07b4745a2a8 00000100:00000200:9.0:1612064056.837804:0:34195:0:(events.c:305:request_in_callback()) event type 2, status 0, service ost 00000100:00100000:9.0:1612064056.837807:0:34195:0:(events.c:351:request_in_callback()) peer: 12345-10.149.3.186@o2ib313 (source: 12345-10.149.3.186@o2ib313) 00000400:00000200:9.0:1612064056.837809:0:34195:0:(lib-md.c:69:lnet_md_unlink()) Unlinking md ffffa07b4745a2a8 00000400:00000200:9.0:1612064056.837810:0:34195:0:(lib-msg.c:816:lnet_is_health_check()) health check = 0, status = 0, hstatus = 0 00000800:00000200:9.0:1612064056.837811:0:34195:0:(o2iblnd_cb.c:205:kiblnd_post_rx()) conn[ffffa081f0c66000] (69)++ 00000100:00100000:10.0:1612064056.837812:0:37831:0:(service.c:1989:ptlrpc_server_handle_req_in()) got req x1690061384164416 00000800:00000200:9.0:1612064056.837812:0:34195:0:(o2iblnd_cb.c:239:kiblnd_post_rx()) conn[ffffa081f0c66000] (70)-- 00000800:00000200:9.0:1612064056.837813:0:34195:0:(o2iblnd_cb.c:3859:kiblnd_scheduler()) conn[ffffa081f0c66000] (69)-- 00000100:00100000:10.0:1612064056.837817:0:37831:0:(nrs_tbf.c:3210:nrs_tbf_req_add()) TBF enqueues: class@ffffa081e90ff600 rate 5000 gen 2 token 0, rule@ffffa07df979db00 rate 5000 gen 2 00000100:00100000:10.0:1612064056.837821:0:37831:0:(nrs_tbf.c:3127:nrs_tbf_req_get()) TBF dequeues: class@ffffa081e90ff600 rate 5000 gen 2 token 1, rule@ffffa07df979db00 rate 5000 gen 2 00000100:00100000:10.0:1612064056.837828:0:37831:0:(service.c:2140:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc ll_ost00_250:b58b2f27-aabc-f700-545a-8d4bc7d4c55c+7:3580:x1690061384164416:12345-10.149.3.186@o2ib313:400 so as you can see - processing is instanteous. What does the dmesg on the client side say? Could the RPCs be dropped somewhere along the way like routers? I don't see any obvious signs of anything being slow - no complaints about slow io nor any threads being stuck for a long time like the original bugreport shows (the "LNet: Service thread pid XXX was inactive for")

            People

              ashehata Amir Shehata (Inactive)
              mhanafi Mahmoud Hanafi
              Votes:
              1 Vote for this issue
              Watchers:
              15 Start watching this issue

              Dates

                Created:
                Updated: