Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11644

LNet: Service thread inactive for 300 causes client evictions

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • Lustre 2.10.5, Lustre 2.12.1
    • None
    • 2
    • 9223372036854775807

    Description

      Update to 2.10.5 now we are seeing periods of mass evictions from servers. On the server we have the following stack trace

      Nov  7 11:33:12 nbp8-oss7 kernel: [531465.033253] Pid: 11080, comm: ll_ost01_220 3.10.0-693.21.1.el7.20180508.x86_64.lustre2105 #1 SMP Mon Aug 27 23:04:41 UTC 2018
      Nov  7 11:33:12 nbp8-oss7 kernel: [531465.033260] Call Trace:
      Nov  7 11:33:12 nbp8-oss7 kernel: [531465.033274]  [<ffffffffa0c1d0e0>] ptlrpc_set_wait+0x4c0/0x920 [ptlrpc]
      Nov  7 11:33:12 nbp8-oss7 kernel: [531465.038258]  [<ffffffffa0bdae43>] ldlm_run_ast_work+0xd3/0x3a0 [ptlrpc]
      Nov  7 11:33:12 nbp8-oss7 kernel: [531465.038285]  [<ffffffffa0bfbabb>] ldlm_glimpse_locks+0x3b/0x100 [ptlrpc]
      Nov  7 11:33:12 nbp8-oss7 kernel: [531465.038294]  [<ffffffffa10e78a4>] ofd_intent_policy+0x444/0xa40 [ofd]
      Nov  7 11:33:12 nbp8-oss7 kernel: [531465.038318]  [<ffffffffa0bda2ba>] ldlm_lock_enqueue+0x38a/0x980 [ptlrpc]
      Nov  7 11:33:12 nbp8-oss7 kernel: [531465.038346]  [<ffffffffa0c03b53>] ldlm_handle_enqueue0+0x9d3/0x16a0 [ptlrpc]
      Nov  7 11:33:12 nbp8-oss7 kernel: [531465.038390]  [<ffffffffa0c89262>] tgt_enqueue+0x62/0x210 [ptlrpc]
      Nov  7 11:33:12 nbp8-oss7 kernel: [531465.038425]  [<ffffffffa0c8ceca>] tgt_request_handle+0x92a/0x1370 [ptlrpc]
      Nov  7 11:33:12 nbp8-oss7 kernel: [531465.038455]  [<ffffffffa0c354bb>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
      Nov  7 11:33:12 nbp8-oss7 kernel: [531465.038484]  [<ffffffffa0c394a2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
      Nov  7 11:33:12 nbp8-oss7 kernel: [531465.038489]  [<ffffffff810b1131>] kthread+0xd1/0xe0
      Nov  7 11:33:12 nbp8-oss7 kernel: [531465.038492]  [<ffffffff816a14f7>] ret_from_fork+0x77/0xb0
      Nov  7 11:33:12 nbp8-oss7 kernel: [531465.038512]  [<ffffffffffffffff>] 0xffffffffffffffff
      Nov  7 11:33:12 nbp8-oss7 kernel: [531465.038515] LustreError: dumping log to /tmp/lustre-log.1541619192.11080
      Nov  7 11:33:14 nbp8-oss7 kernel: [531467.254898] LNet: Service thread pid 9724 was inactive for 303.19s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      Nov  7 11:33:14 nbp8-oss7 kernel: [531467.310852] Pid: 9724, comm: ll_ost01_019 3.10.0-693.21.1.el7.20180508.x86_64.lustre2105 #1 SMP Mon Aug 27 23:04:41 UTC 2018
      Nov  7 11:33:14 nbp8-oss7 kernel: [531467.310854] Call Trace:
      Nov  7 11:33:14 nbp8-oss7 kernel: [531467.310866]  [<ffffffffa0c1d0e0>] ptlrpc_set_wait+0x4c0/0x920 [ptlrpc]
      Nov  7 11:33:14 nbp8-oss7 kernel: [531467.332869]  [<ffffffffa0bdae43>] ldlm_run_ast_work+0xd3/0x3a0 [ptlrpc]
      Nov  7 11:33:14 nbp8-oss7 kernel: [531467.332902]  [<ffffffffa0bfbabb>] ldlm_glimpse_locks+0x3b/0x100 [ptlrpc]
      Nov  7 11:33:14 nbp8-oss7 kernel: [531467.332912]  [<ffffffffa10e78a4>] ofd_intent_policy+0x444/0xa40 [ofd]
      Nov  7 11:33:14 nbp8-oss7 kernel: [531467.332936]  [<ffffffffa0bda2ba>] ldlm_lock_enqueue+0x38a/0x980 [ptlrpc]
      Nov  7 11:33:15 nbp8-oss7 kernel: [531467.332988]  [<ffffffffa0c03b53>] ldlm_handle_enqueue0+0x9d3/0x16a0 [ptlrpc]
      Nov  7 11:33:15 nbp8-oss7 kernel: [531467.333032]  [<ffffffffa0c89262>] tgt_enqueue+0x62/0x210 [ptlrpc]
      Nov  7 11:33:15 nbp8-oss7 kernel: [531467.333067]  [<ffffffffa0c8ceca>] tgt_request_handle+0x92a/0x1370 [ptlrpc]
      Nov  7 11:33:15 nbp8-oss7 kernel: [531467.333099]  [<ffffffffa0c354bb>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
      Nov  7 11:33:15 nbp8-oss7 kernel: [531467.333128]  [<ffffffffa0c394a2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
      Nov  7 11:33:15 nbp8-oss7 kernel: [531467.333134]  [<ffffffff810b1131>] kthread+0xd1/0xe0
      Nov  7 11:33:15 nbp8-oss7 kernel: [531467.333137]  [<ffffffff816a14f7>] ret_from_fork+0x77/0xb0
      Nov  7 11:33:15 nbp8-oss7 kernel: [531467.333158]  [<ffffffffffffffff>] 0xffffffffffffffff
      

      will upload to ftp:/uploads/LU11613/lustre-log.1541619192.11080

      we didn't have rpctrace or dlmtrace so may not be very useful.

      Could be related to https://jira.whamcloud.com/browse/LU-11613

       

       

      Attachments

        1. client_evictions_charts.pdf
          63 kB
        2. eviction_s611.06.05.19
          23 kB
        3. lnet_metrics_during_eviction.pdf
          398 kB
        4. nasa_lu11644.patch
          15 kB
        5. s214_bt.20200108.18.21.23
          990 kB
        6. s618.out
          37 kB
        7. zero.io.top.20210130.19.31.37
          480 kB

        Issue Links

          Activity

            [LU-11644] LNet: Service thread inactive for 300 causes client evictions
            pjones Peter Jones added a comment -

            Do I understand correctly that the patch that landed was not a fix for the issue but just made the issue easier to debug?

            pjones Peter Jones added a comment - Do I understand correctly that the patch that landed was not a fix for the issue but just made the issue easier to debug?

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37193/
            Subject: LU-11644 ptlrpc: show target name in req_history
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 83b6c6608e94c05766ed1eddad1b7f2fee5e79fa

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37193/ Subject: LU-11644 ptlrpc: show target name in req_history Project: fs/lustre-release Branch: master Current Patch Set: Commit: 83b6c6608e94c05766ed1eddad1b7f2fee5e79fa
            For very large clusters it would also be nice to set obd_ping rate and eviction timeout independent of at_min. At 10k nodes our OSS average ~4000 obd_ping rpcs/sec
            

            For newer Lustre versions (2.12 client+server), clients will disconnect from an OST in the background if they are not actively communicating with it. This will reduce the number of pings from an idle client, and will also reduce the load during recovery due to fewer clients being connected.

            adilger Andreas Dilger added a comment - For very large clusters it would also be nice to set obd_ping rate and eviction timeout independent of at_min. At 10k nodes our OSS average ~4000 obd_ping rpcs/sec For newer Lustre versions (2.12 client+server), clients will disconnect from an OST in the background if they are not actively communicating with it. This will reduce the number of pings from an idle client, and will also reduce the load during recovery due to fewer clients being connected.

            For now we have enabled rpctrace. Long term I think elevating ping rpc prior it may be a good option. For very large clusters it would also be nice to set obd_ping rate and eviction timeout independent of at_min. At 10k nodes our OSS average ~4000 obd_ping rpcs/sec

             

             

             

            mhanafi Mahmoud Hanafi added a comment - For now we have enabled rpctrace. Long term I think elevating ping rpc prior it may be a good option. For very large clusters it would also be nice to set obd_ping rate and eviction timeout independent of at_min. At 10k nodes our OSS average ~4000 obd_ping rpcs/sec      

            Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37193
            Subject: LU-11644 ptlrpc: show target name in req_history
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 3d7e754be0a0aec62cd74c977cab159796b4b8a8

            gerrit Gerrit Updater added a comment - Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37193 Subject: LU-11644 ptlrpc: show target name in req_history Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 3d7e754be0a0aec62cd74c977cab159796b4b8a8
            adilger Andreas Dilger added a comment - - edited

            It looks like 2.10.x has the patch https://review.whamcloud.com/17345 "LU-7470 nrs: extend TBF with NID/JobID/OPCode expression" which allows you to rate-limit the RPCs by opcode. While I would not normally suggest to limit LDLM traffic, you could try this for opc 101 (LDLM_ENQUEUE) to see if the problem can be avoided. I'm not sure of the specific rate limits you want, but looking at the recently-attached graph you could try 4000 as a reasonable cap (i.e. above normal usage, but enough to slow down the huge spike of incoming RPCs. You may have more fine-grained breakdown of RPCs by type to see how many of the RPCs in that spike are from LDLM_ENQUEUE.

            If enabling +rpctrace debugging is too heavy-weight, you could also try to enable the RPC history functionality, via "lctl set_param ost.OSS.*.req_buffer_history_max=10000" and then grab it every second via "lctl get_param ost.OSS.*.req_history >> /var/log/lustre_oss_history-$(date +%Y%m%d%H%M)" (make sure you have a rule in /etc/logrotate.d/lustre.conf to clean up those logs if you cannot reproduce it quickly) and then post-process it after an event for a detailed RPC log (sort -u would be needed to discard duplicate records in a given logfile). The fields are:

            timestamp:target_nid:client_nid:client_xid:req_length:req_state:arrival_time:expiry_time opc number
            6780438299461025792:192.168.20.1@tcp:12345-192.168.20.159@tcp:x1654112309314912:328:Complete:1578693813.544069:-1.455930s(-43.0s) opc 101
            

            Since the RPC history does not involve printing to the logs (just saving the request into a list), it may be less overhead than the kernel debug logging, and allow you to capture information when the problem happens.

            adilger Andreas Dilger added a comment - - edited It looks like 2.10.x has the patch https://review.whamcloud.com/17345 " LU-7470 nrs: extend TBF with NID/JobID/OPCode expression " which allows you to rate-limit the RPCs by opcode. While I would not normally suggest to limit LDLM traffic, you could try this for opc 101 ( LDLM_ENQUEUE ) to see if the problem can be avoided. I'm not sure of the specific rate limits you want, but looking at the recently-attached graph you could try 4000 as a reasonable cap (i.e. above normal usage, but enough to slow down the huge spike of incoming RPCs. You may have more fine-grained breakdown of RPCs by type to see how many of the RPCs in that spike are from LDLM_ENQUEUE . If enabling +rpctrace debugging is too heavy-weight, you could also try to enable the RPC history functionality, via " lctl set_param ost.OSS.*.req_buffer_history_max=10000 " and then grab it every second via " lctl get_param ost.OSS.*.req_history >> /var/log/lustre_oss_history-$(date +%Y%m%d%H%M) " (make sure you have a rule in /etc/logrotate.d/lustre.conf to clean up those logs if you cannot reproduce it quickly) and then post-process it after an event for a detailed RPC log ( sort -u would be needed to discard duplicate records in a given logfile). The fields are: timestamp:target_nid:client_nid:client_xid:req_length:req_state:arrival_time:expiry_time opc number 6780438299461025792:192.168.20.1@tcp:12345-192.168.20.159@tcp:x1654112309314912:328:Complete:1578693813.544069:-1.455930s(-43.0s) opc 101 Since the RPC history does not involve printing to the logs (just saving the request into a list), it may be less overhead than the kernel debug logging, and allow you to capture information when the problem happens.

            I haven't been able to track down the user or application creating the glimpse RPCs. Part of the issue is enabling RPCTRACE slows things such enough to prevent the evictions. I tired to limit rpcs using tbf but all rpcs are issues as user root. So that didn't work. 

            I will try to setup some additional debugging triggers to determine the application/user.

            mhanafi Mahmoud Hanafi added a comment - I haven't been able to track down the user or application creating the glimpse RPCs. Part of the issue is enabling RPCTRACE slows things such enough to prevent the evictions. I tired to limit rpcs using tbf but all rpcs are issues as user root. So that didn't work.  I will try to setup some additional debugging triggers to determine the application/user.
            green Oleg Drokin added a comment -

            can you please describe what workload is it? some big activity on shared files but enough of them to generate plenty of glimpses? something else?

            You can try scaling up number of ost threads and see if that helps though obviously it's not ideal. there's probably a number of other workarounds we can use to either elevate ping rpc priority if it's not high yet or such.

            green Oleg Drokin added a comment - can you please describe what workload is it? some big activity on shared files but enough of them to generate plenty of glimpses? something else? You can try scaling up number of ost threads and see if that helps though obviously it's not ideal. there's probably a number of other workarounds we can use to either elevate ping rpc priority if it's not high yet or such.

            Mahmoud, do you know what the client application is doing at this point in the run? Glimpse RPCs are generated when clients do "stat()" operations on files to get the size, that send the LDLM glimpse RPC for the OST object(s) in the file, which may in turn cause the OST to send RPCs to the client(s) holding the locks for the file if it is actively being written. So if there were a multiple clients doing parallel directory tree traversal in the same directory where other clients are writing it could generate a lot of glimpses, or if the application was malformed and calling stat() repeatedly on a shared file for some reason (e.g. to poll for updates/completion)?

            adilger Andreas Dilger added a comment - Mahmoud, do you know what the client application is doing at this point in the run? Glimpse RPCs are generated when clients do " stat() " operations on files to get the size, that send the LDLM glimpse RPC for the OST object(s) in the file, which may in turn cause the OST to send RPCs to the client(s) holding the locks for the file if it is actively being written. So if there were a multiple clients doing parallel directory tree traversal in the same directory where other clients are writing it could generate a lot of glimpses, or if the application was malformed and calling stat() repeatedly on a shared file for some reason (e.g. to poll for updates/completion)?

            I was able to get a backtrace of all threads during when the servers ping rpcs drop to zero. It show 508 outof 512 ll_ost threads in ldlm_run_ast_work. This must block receiving all other RPCS.

            What options do we have to slow down the rate of ldlm_glimpse_equeues?

            s214_bt.20200108.18.21.23

            mhanafi Mahmoud Hanafi added a comment - I was able to get a backtrace of all threads during when the servers ping rpcs drop to zero. It show 508 outof 512 ll_ost threads in ldlm_run_ast_work. This must block receiving all other RPCS. What options do we have to slow down the rate of ldlm_glimpse_equeues? s214_bt.20200108.18.21.23

            I am trying to create a reproducer for the case were a OSS get a large spike in ldlm_glimpse_equeue RPCS. What is the best way to recreate this RPC workload.

            mhanafi Mahmoud Hanafi added a comment - I am trying to create a reproducer for the case were a OSS get a large spike in ldlm_glimpse_equeue RPCS. What is the best way to recreate this RPC workload.

            People

              ashehata Amir Shehata (Inactive)
              mhanafi Mahmoud Hanafi
              Votes:
              1 Vote for this issue
              Watchers:
              15 Start watching this issue

              Dates

                Created:
                Updated: