Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12096

ldlm_run_ast_work call traces and network errors on overloaded OSS

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • Lustre 2.12.0
    • None
    • 3
    • 9223372036854775807

    Description

      Our OSS servers running 2.12.0 on Fir have been running fine until this morning. We are now seeing network errors, call traces and all servers seem overloaded. Filesystem is still reactive. I wanted to share the following logs with you just in case you see anything wrong. This really looks like a network issues but we spent some time investigating we didn't find any issues on our different IB fabrics, but lnet shows dropped packets.

      Mar 21 09:44:43 fir-io1-s1 kernel: LNet: Skipped 2 previous similar messages
      Mar 21 09:44:43 fir-io1-s1 kernel: Pid: 96368, comm: ll_ost01_045 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018
      Mar 21 09:44:44 fir-io1-s1 kernel: Call Trace:
      Mar 21 09:44:44 fir-io1-s1 kernel:  [<ffffffffc0dcd890>] ptlrpc_set_wait+0x500/0x8d0 [ptlrpc]
      Mar 21 09:44:44 fir-io1-s1 kernel:  [<ffffffffc0d8b185>] ldlm_run_ast_work+0xd5/0x3a0 [ptlrpc]
      Mar 21 09:44:44 fir-io1-s1 kernel:  [<ffffffffc0dac86b>] ldlm_glimpse_locks+0x3b/0x100 [ptlrpc]
      Mar 21 09:44:44 fir-io1-s1 kernel:  [<ffffffffc166b10b>] ofd_intent_policy+0x69b/0x920 [ofd]
      Mar 21 09:44:44 fir-io1-s1 kernel:  [<ffffffffc0d8bec6>] ldlm_lock_enqueue+0x366/0xa60 [ptlrpc]
      Mar 21 09:44:44 fir-io1-s1 kernel:  [<ffffffffc0db48a7>] ldlm_handle_enqueue0+0xa47/0x15a0 [ptlrpc]
      Mar 21 09:44:44 fir-io1-s1 kernel:  [<ffffffffc0e3b302>] tgt_enqueue+0x62/0x210 [ptlrpc]
      Mar 21 09:44:44 fir-io1-s1 kernel:  [<ffffffffc0e4235a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
      Mar 21 09:44:44 fir-io1-s1 kernel:  [<ffffffffc0de692b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      Mar 21 09:44:44 fir-io1-s1 kernel:  [<ffffffffc0dea25c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
      Mar 21 09:44:44 fir-io1-s1 kernel:  [<ffffffff850c1c31>] kthread+0xd1/0xe0
      Mar 21 09:44:44 fir-io1-s1 kernel:  [<ffffffff85774c24>] ret_from_fork_nospec_begin+0xe/0x21
      Mar 21 09:44:44 fir-io1-s1 kernel:  [<ffffffffffffffff>] 0xffffffffffffffff
      Mar 21 09:44:44 fir-io1-s1 kernel: LustreError: dumping log to /tmp/lustre-log.1553186684.96368
      
      [root@fir-io1-s1 ~]# lnetctl stats show
      statistics:
          msgs_alloc: 0
          msgs_max: 16371
          rst_alloc: 283239
          errors: 0
          send_count: 2387805172
          resend_count: 0
          response_timeout_count: 203807
          local_interrupt_count: 0
          local_dropped_count: 33
          local_aborted_count: 0
          local_no_route_count: 0
          local_timeout_count: 961
          local_error_count: 13
          remote_dropped_count: 3
          remote_error_count: 0
          remote_timeout_count: 12
          network_timeout_count: 0
          recv_count: 2387455644
          route_count: 0
          drop_count: 2971
          send_length: 871207195166809
          recv_length: 477340920770381
          route_length: 0
          drop_length: 1291240
      

      I was able to dump the kernel tasks using sysrq, attaching that as fir-io1-s1-sysrq-t.log
      Also attaching full kernel logs as fir-io1-s1-kern.log

      We use DNE,PFL and DOM. OST backend is ldiskfs on mdraid.

      Meanwhile, we'll keep investigating a possible network issue.

      Thanks!
      Stephane

      Attachments

        Activity

          People

            ashehata Amir Shehata (Inactive)
            sthiell Stephane Thiell
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated: