Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11644

LNet: Service thread inactive for 300 causes client evictions

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • Lustre 2.10.5, Lustre 2.12.1
    • None
    • 2
    • 9223372036854775807

    Description

      Update to 2.10.5 now we are seeing periods of mass evictions from servers. On the server we have the following stack trace

      Nov  7 11:33:12 nbp8-oss7 kernel: [531465.033253] Pid: 11080, comm: ll_ost01_220 3.10.0-693.21.1.el7.20180508.x86_64.lustre2105 #1 SMP Mon Aug 27 23:04:41 UTC 2018
      Nov  7 11:33:12 nbp8-oss7 kernel: [531465.033260] Call Trace:
      Nov  7 11:33:12 nbp8-oss7 kernel: [531465.033274]  [<ffffffffa0c1d0e0>] ptlrpc_set_wait+0x4c0/0x920 [ptlrpc]
      Nov  7 11:33:12 nbp8-oss7 kernel: [531465.038258]  [<ffffffffa0bdae43>] ldlm_run_ast_work+0xd3/0x3a0 [ptlrpc]
      Nov  7 11:33:12 nbp8-oss7 kernel: [531465.038285]  [<ffffffffa0bfbabb>] ldlm_glimpse_locks+0x3b/0x100 [ptlrpc]
      Nov  7 11:33:12 nbp8-oss7 kernel: [531465.038294]  [<ffffffffa10e78a4>] ofd_intent_policy+0x444/0xa40 [ofd]
      Nov  7 11:33:12 nbp8-oss7 kernel: [531465.038318]  [<ffffffffa0bda2ba>] ldlm_lock_enqueue+0x38a/0x980 [ptlrpc]
      Nov  7 11:33:12 nbp8-oss7 kernel: [531465.038346]  [<ffffffffa0c03b53>] ldlm_handle_enqueue0+0x9d3/0x16a0 [ptlrpc]
      Nov  7 11:33:12 nbp8-oss7 kernel: [531465.038390]  [<ffffffffa0c89262>] tgt_enqueue+0x62/0x210 [ptlrpc]
      Nov  7 11:33:12 nbp8-oss7 kernel: [531465.038425]  [<ffffffffa0c8ceca>] tgt_request_handle+0x92a/0x1370 [ptlrpc]
      Nov  7 11:33:12 nbp8-oss7 kernel: [531465.038455]  [<ffffffffa0c354bb>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
      Nov  7 11:33:12 nbp8-oss7 kernel: [531465.038484]  [<ffffffffa0c394a2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
      Nov  7 11:33:12 nbp8-oss7 kernel: [531465.038489]  [<ffffffff810b1131>] kthread+0xd1/0xe0
      Nov  7 11:33:12 nbp8-oss7 kernel: [531465.038492]  [<ffffffff816a14f7>] ret_from_fork+0x77/0xb0
      Nov  7 11:33:12 nbp8-oss7 kernel: [531465.038512]  [<ffffffffffffffff>] 0xffffffffffffffff
      Nov  7 11:33:12 nbp8-oss7 kernel: [531465.038515] LustreError: dumping log to /tmp/lustre-log.1541619192.11080
      Nov  7 11:33:14 nbp8-oss7 kernel: [531467.254898] LNet: Service thread pid 9724 was inactive for 303.19s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      Nov  7 11:33:14 nbp8-oss7 kernel: [531467.310852] Pid: 9724, comm: ll_ost01_019 3.10.0-693.21.1.el7.20180508.x86_64.lustre2105 #1 SMP Mon Aug 27 23:04:41 UTC 2018
      Nov  7 11:33:14 nbp8-oss7 kernel: [531467.310854] Call Trace:
      Nov  7 11:33:14 nbp8-oss7 kernel: [531467.310866]  [<ffffffffa0c1d0e0>] ptlrpc_set_wait+0x4c0/0x920 [ptlrpc]
      Nov  7 11:33:14 nbp8-oss7 kernel: [531467.332869]  [<ffffffffa0bdae43>] ldlm_run_ast_work+0xd3/0x3a0 [ptlrpc]
      Nov  7 11:33:14 nbp8-oss7 kernel: [531467.332902]  [<ffffffffa0bfbabb>] ldlm_glimpse_locks+0x3b/0x100 [ptlrpc]
      Nov  7 11:33:14 nbp8-oss7 kernel: [531467.332912]  [<ffffffffa10e78a4>] ofd_intent_policy+0x444/0xa40 [ofd]
      Nov  7 11:33:14 nbp8-oss7 kernel: [531467.332936]  [<ffffffffa0bda2ba>] ldlm_lock_enqueue+0x38a/0x980 [ptlrpc]
      Nov  7 11:33:15 nbp8-oss7 kernel: [531467.332988]  [<ffffffffa0c03b53>] ldlm_handle_enqueue0+0x9d3/0x16a0 [ptlrpc]
      Nov  7 11:33:15 nbp8-oss7 kernel: [531467.333032]  [<ffffffffa0c89262>] tgt_enqueue+0x62/0x210 [ptlrpc]
      Nov  7 11:33:15 nbp8-oss7 kernel: [531467.333067]  [<ffffffffa0c8ceca>] tgt_request_handle+0x92a/0x1370 [ptlrpc]
      Nov  7 11:33:15 nbp8-oss7 kernel: [531467.333099]  [<ffffffffa0c354bb>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
      Nov  7 11:33:15 nbp8-oss7 kernel: [531467.333128]  [<ffffffffa0c394a2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
      Nov  7 11:33:15 nbp8-oss7 kernel: [531467.333134]  [<ffffffff810b1131>] kthread+0xd1/0xe0
      Nov  7 11:33:15 nbp8-oss7 kernel: [531467.333137]  [<ffffffff816a14f7>] ret_from_fork+0x77/0xb0
      Nov  7 11:33:15 nbp8-oss7 kernel: [531467.333158]  [<ffffffffffffffff>] 0xffffffffffffffff
      

      will upload to ftp:/uploads/LU11613/lustre-log.1541619192.11080

      we didn't have rpctrace or dlmtrace so may not be very useful.

      Could be related to https://jira.whamcloud.com/browse/LU-11613

       

       

      Attachments

        1. client_evictions_charts.pdf
          63 kB
        2. eviction_s611.06.05.19
          23 kB
        3. lnet_metrics_during_eviction.pdf
          398 kB
        4. nasa_lu11644.patch
          15 kB
        5. s214_bt.20200108.18.21.23
          990 kB
        6. s618.out
          37 kB
        7. zero.io.top.20210130.19.31.37
          480 kB

        Issue Links

          Activity

            [LU-11644] LNet: Service thread inactive for 300 causes client evictions

            I check at_min, at_max, and time on all servers and clients they are the same.

            timeout=100

            at_min=275

            at_max=600

            We know that the server receives obdping from other clients and sends replies. These clients, that get evicted, are sending ping to the other OSTs. other wise we would see evictions on more than one OST. 

            mhanafi Mahmoud Hanafi added a comment - I check at_min, at_max, and time on all servers and clients they are the same. timeout=100 at_min=275 at_max=600 We know that the server receives obdping from other clients and sends replies. These clients, that get evicted, are sending ping to the other OSTs. other wise we would see evictions on more than one OST. 
            green Oleg Drokin added a comment -

            Andres, other pings are apparently coming to other OSTs on this node. We don't print the target information so I don't know how to 100% confirm that theory but it seems to be the most likely one.

            Mahmoud - can you please directly check the timeout value on the client I looked at vs the server just to be sure?

            The watchdog timeout is set based on AT values so it varies and the bigger it gets the more latency there was in the system beforehand. but doing sysrq-t as you see fit should show you where everything is stuck

            green Oleg Drokin added a comment - Andres, other pings are apparently coming to other OSTs on this node. We don't print the target information so I don't know how to 100% confirm that theory but it seems to be the most likely one. Mahmoud - can you please directly check the timeout value on the client I looked at vs the server just to be sure? The watchdog timeout is set based on AT values so it varies and the bigger it gets the more latency there was in the system beforehand. but doing sysrq-t as you see fit should show you where everything is stuck

            We do mount multiple filesystems, but we don't save those timeout values on the MGS. I double checked the MGS config logs there were no timeout settings. 

             

            mhanafi Mahmoud Hanafi added a comment - We do mount multiple filesystems, but we don't save those timeout values on the MGS. I double checked the MGS config logs there were no timeout settings.   

            3rd option is if the client and server idea of obd timeout is different (with client having a significantly longer timeout than the server) then client might be sending pings a lot less frequently than the server expects.

            Mahmoud, do you have multiple different filesystems mounted on the same client, possibly with different timeout values configured by their MGS config logs? There is an old problem LU-9912 that relates to this, in the uncommon case where different filesystems have been tuned with different timeouts...

            adilger Andreas Dilger added a comment - 3rd option is if the client and server idea of obd timeout is different (with client having a significantly longer timeout than the server) then client might be sending pings a lot less frequently than the server expects. Mahmoud, do you have multiple different filesystems mounted on the same client, possibly with different timeout values configured by their MGS config logs? There is an old problem LU-9912 that relates to this, in the uncommon case where different filesystems have been tuned with different timeouts...

            I don't think tbf is involved because we saw this with fifo. 

            mhanafi Mahmoud Hanafi added a comment - I don't think tbf is involved because we saw this with fifo. 

            Oleg, is it possible that something in the NRS TBF path is bypassing/blocking the update of the export and preventing the request from being processed by the server?

            In your previous example, it shows that this client is active for other pings and yet it is being evicted shortly thereafter. At the time of eviction, it looks like a request is being dequeued for the export, but It probably make sense for the target to evict a client that still has a pending request in the RPC queue:

            nrs_tbf_req_get()) TBF dequeues: class@ffffa08b32e6d200 gen 2 token 2, rule@ffffa095af158100 rate 5000 gen 2
            ping_evictor_main()) nbp2-OST0025: haven't heard from client (at 10.149.3.186@o2ib313) in 227 seconds. I think it's dead, and I am evicting it. cur 1612064083 expire 1612063933 last 1612063856
            

            so it might be useful to print a bit more information in nrs_tbf_req_get() about the request so that we can see what it is. Maybe change the CDEBUG(D_RPCTRACE, ...) there to a DEBUG_REQ(D_RPCTRACE, ...). Also adding the target name into the ptlrpc_server_handle_request() "Handling" and "Handled" lines (from req->rq_export->exp_obd->obd_name, if rq_export isn't NULL) would definitely be very useful for lots of reasons.

            adilger Andreas Dilger added a comment - Oleg, is it possible that something in the NRS TBF path is bypassing/blocking the update of the export and preventing the request from being processed by the server? In your previous example, it shows that this client is active for other pings and yet it is being evicted shortly thereafter. At the time of eviction, it looks like a request is being dequeued for the export, but It probably make sense for the target to evict a client that still has a pending request in the RPC queue: nrs_tbf_req_get()) TBF dequeues: class@ffffa08b32e6d200 gen 2 token 2, rule@ffffa095af158100 rate 5000 gen 2 ping_evictor_main()) nbp2-OST0025: haven't heard from client (at 10.149.3.186@o2ib313) in 227 seconds. I think it's dead, and I am evicting it. cur 1612064083 expire 1612063933 last 1612063856 so it might be useful to print a bit more information in nrs_tbf_req_get() about the request so that we can see what it is. Maybe change the CDEBUG(D_RPCTRACE, ...) there to a DEBUG_REQ(D_RPCTRACE, ...) . Also adding the target name into the ptlrpc_server_handle_request() " Handling " and " Handled " lines (from req->rq_export->exp_obd->obd_name , if rq_export isn't NULL) would definitely be very useful for lots of reasons.
            mhanafi Mahmoud Hanafi added a comment - - edited

            This filesystem never sees more 1 second of zero reads and writes. While during this event it see zero I/O for 2 mins, so something must be locked up. I will try to get a stack trace of all threads.

            The file I attached yesterday (s618.out) show that all I/O drops to zero and server load starts to climb starting at 19:31:36  to 19:34:56, when the evictions happen.

             

            The watchdog timers (500 seconds) are much higher then obdping timeout so we wouldn't see those warnings. 

             

            mhanafi Mahmoud Hanafi added a comment - - edited This filesystem never sees more 1 second of zero reads and writes. While during this event it see zero I/O for 2 mins, so something must be locked up. I will try to get a stack trace of all threads. The file I attached yesterday (s618.out) show that all I/O drops to zero and server load starts to climb starting at 19:31:36  to 19:34:56, when the evictions happen.   The watchdog timers (500 seconds) are much higher then obdping timeout so we wouldn't see those warnings.   
            green Oleg Drokin added a comment -

            First of all I do not think you have any outstanding io requests from this partcular client. If you did - we'd see aborted bulk messages and I don't think I saw any?

            if the io thread is taking too long in the actual filesystem layer we'd see messages like this for writes: "transaction handle %p was open for too long: now %lld, alloced %lld, started %lld, closed %lld" and there were a slow io of other kind messages that I cannot readily find. In addition the lnet "thread took too long" would also happen and I think you did not have any incidents of any watchdogs triggering this time?

            If you have requests sitting too long in the serving queue there'd be the "Already past deadline" messages.

            If you have requests taking to long actually being serviced.- the client would timeout (you'd also have watchdogs trigger).

            Looking at the code we prolong export several times: when a request comes in and is added to the req in queue, when we start serving it and when we finish serving it.
            So while it's possible for a too long request to end up in ping eviction in absence of other requests, the fact that the export's "last served" time did not advance in all this time and we can see it matching last confirmed request from this client, I think it's a pretty solid evidence that at least this particular client did not really send a ping request for this OST or if it did - it did not arrive to the OSS node - note if this was the case you must see a timeout for the opc 400 on the client though.

            3rd option is if the client and server idea of obdtimeout is different (with client having a significantly longer timeout than the server) then client might be sending pings a lot less frequently than the server expects.

            green Oleg Drokin added a comment - First of all I do not think you have any outstanding io requests from this partcular client. If you did - we'd see aborted bulk messages and I don't think I saw any? if the io thread is taking too long in the actual filesystem layer we'd see messages like this for writes: "transaction handle %p was open for too long: now %lld, alloced %lld, started %lld, closed %lld" and there were a slow io of other kind messages that I cannot readily find. In addition the lnet "thread took too long" would also happen and I think you did not have any incidents of any watchdogs triggering this time? If you have requests sitting too long in the serving queue there'd be the "Already past deadline" messages. If you have requests taking to long actually being serviced.- the client would timeout (you'd also have watchdogs trigger). Looking at the code we prolong export several times: when a request comes in and is added to the req in queue, when we start serving it and when we finish serving it. So while it's possible for a too long request to end up in ping eviction in absence of other requests, the fact that the export's "last served" time did not advance in all this time and we can see it matching last confirmed request from this client, I think it's a pretty solid evidence that at least this particular client did not really send a ping request for this OST or if it did - it did not arrive to the OSS node - note if this was the case you must see a timeout for the opc 400 on the client though. 3rd option is if the client and server idea of obdtimeout is different (with client having a significantly longer timeout than the server) then client might be sending pings a lot less frequently than the server expects.

            People

              ashehata Amir Shehata (Inactive)
              mhanafi Mahmoud Hanafi
              Votes:
              1 Vote for this issue
              Watchers:
              15 Start watching this issue

              Dates

                Created:
                Updated: