Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11644

LNet: Service thread inactive for 300 causes client evictions

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • Lustre 2.10.5, Lustre 2.12.1
    • None
    • 2
    • 9223372036854775807

    Description

      Update to 2.10.5 now we are seeing periods of mass evictions from servers. On the server we have the following stack trace

      Nov  7 11:33:12 nbp8-oss7 kernel: [531465.033253] Pid: 11080, comm: ll_ost01_220 3.10.0-693.21.1.el7.20180508.x86_64.lustre2105 #1 SMP Mon Aug 27 23:04:41 UTC 2018
      Nov  7 11:33:12 nbp8-oss7 kernel: [531465.033260] Call Trace:
      Nov  7 11:33:12 nbp8-oss7 kernel: [531465.033274]  [<ffffffffa0c1d0e0>] ptlrpc_set_wait+0x4c0/0x920 [ptlrpc]
      Nov  7 11:33:12 nbp8-oss7 kernel: [531465.038258]  [<ffffffffa0bdae43>] ldlm_run_ast_work+0xd3/0x3a0 [ptlrpc]
      Nov  7 11:33:12 nbp8-oss7 kernel: [531465.038285]  [<ffffffffa0bfbabb>] ldlm_glimpse_locks+0x3b/0x100 [ptlrpc]
      Nov  7 11:33:12 nbp8-oss7 kernel: [531465.038294]  [<ffffffffa10e78a4>] ofd_intent_policy+0x444/0xa40 [ofd]
      Nov  7 11:33:12 nbp8-oss7 kernel: [531465.038318]  [<ffffffffa0bda2ba>] ldlm_lock_enqueue+0x38a/0x980 [ptlrpc]
      Nov  7 11:33:12 nbp8-oss7 kernel: [531465.038346]  [<ffffffffa0c03b53>] ldlm_handle_enqueue0+0x9d3/0x16a0 [ptlrpc]
      Nov  7 11:33:12 nbp8-oss7 kernel: [531465.038390]  [<ffffffffa0c89262>] tgt_enqueue+0x62/0x210 [ptlrpc]
      Nov  7 11:33:12 nbp8-oss7 kernel: [531465.038425]  [<ffffffffa0c8ceca>] tgt_request_handle+0x92a/0x1370 [ptlrpc]
      Nov  7 11:33:12 nbp8-oss7 kernel: [531465.038455]  [<ffffffffa0c354bb>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
      Nov  7 11:33:12 nbp8-oss7 kernel: [531465.038484]  [<ffffffffa0c394a2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
      Nov  7 11:33:12 nbp8-oss7 kernel: [531465.038489]  [<ffffffff810b1131>] kthread+0xd1/0xe0
      Nov  7 11:33:12 nbp8-oss7 kernel: [531465.038492]  [<ffffffff816a14f7>] ret_from_fork+0x77/0xb0
      Nov  7 11:33:12 nbp8-oss7 kernel: [531465.038512]  [<ffffffffffffffff>] 0xffffffffffffffff
      Nov  7 11:33:12 nbp8-oss7 kernel: [531465.038515] LustreError: dumping log to /tmp/lustre-log.1541619192.11080
      Nov  7 11:33:14 nbp8-oss7 kernel: [531467.254898] LNet: Service thread pid 9724 was inactive for 303.19s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      Nov  7 11:33:14 nbp8-oss7 kernel: [531467.310852] Pid: 9724, comm: ll_ost01_019 3.10.0-693.21.1.el7.20180508.x86_64.lustre2105 #1 SMP Mon Aug 27 23:04:41 UTC 2018
      Nov  7 11:33:14 nbp8-oss7 kernel: [531467.310854] Call Trace:
      Nov  7 11:33:14 nbp8-oss7 kernel: [531467.310866]  [<ffffffffa0c1d0e0>] ptlrpc_set_wait+0x4c0/0x920 [ptlrpc]
      Nov  7 11:33:14 nbp8-oss7 kernel: [531467.332869]  [<ffffffffa0bdae43>] ldlm_run_ast_work+0xd3/0x3a0 [ptlrpc]
      Nov  7 11:33:14 nbp8-oss7 kernel: [531467.332902]  [<ffffffffa0bfbabb>] ldlm_glimpse_locks+0x3b/0x100 [ptlrpc]
      Nov  7 11:33:14 nbp8-oss7 kernel: [531467.332912]  [<ffffffffa10e78a4>] ofd_intent_policy+0x444/0xa40 [ofd]
      Nov  7 11:33:14 nbp8-oss7 kernel: [531467.332936]  [<ffffffffa0bda2ba>] ldlm_lock_enqueue+0x38a/0x980 [ptlrpc]
      Nov  7 11:33:15 nbp8-oss7 kernel: [531467.332988]  [<ffffffffa0c03b53>] ldlm_handle_enqueue0+0x9d3/0x16a0 [ptlrpc]
      Nov  7 11:33:15 nbp8-oss7 kernel: [531467.333032]  [<ffffffffa0c89262>] tgt_enqueue+0x62/0x210 [ptlrpc]
      Nov  7 11:33:15 nbp8-oss7 kernel: [531467.333067]  [<ffffffffa0c8ceca>] tgt_request_handle+0x92a/0x1370 [ptlrpc]
      Nov  7 11:33:15 nbp8-oss7 kernel: [531467.333099]  [<ffffffffa0c354bb>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
      Nov  7 11:33:15 nbp8-oss7 kernel: [531467.333128]  [<ffffffffa0c394a2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
      Nov  7 11:33:15 nbp8-oss7 kernel: [531467.333134]  [<ffffffff810b1131>] kthread+0xd1/0xe0
      Nov  7 11:33:15 nbp8-oss7 kernel: [531467.333137]  [<ffffffff816a14f7>] ret_from_fork+0x77/0xb0
      Nov  7 11:33:15 nbp8-oss7 kernel: [531467.333158]  [<ffffffffffffffff>] 0xffffffffffffffff
      

      will upload to ftp:/uploads/LU11613/lustre-log.1541619192.11080

      we didn't have rpctrace or dlmtrace so may not be very useful.

      Could be related to https://jira.whamcloud.com/browse/LU-11613

       

       

      Attachments

        1. client_evictions_charts.pdf
          63 kB
          Mahmoud Hanafi
        2. eviction_s611.06.05.19
          23 kB
          Mahmoud Hanafi
        3. lnet_metrics_during_eviction.pdf
          398 kB
          Mahmoud Hanafi
        4. nasa_lu11644.patch
          15 kB
          Amir Shehata
        5. s214_bt.20200108.18.21.23
          990 kB
          Mahmoud Hanafi
        6. s618.out
          37 kB
          Mahmoud Hanafi
        7. zero.io.top.20210130.19.31.37
          480 kB
          Mahmoud Hanafi

        Issue Links

          Activity

            People

              ashehata Amir Shehata (Inactive)
              mhanafi Mahmoud Hanafi
              Votes:
              1 Vote for this issue
              Watchers:
              15 Start watching this issue

              Dates

                Created:
                Updated: