Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10035

Many threads hanging on OST, lustre-log dumps

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • Lustre 2.10.1
    • Soak cluster, b2.10.1 + wangdi diag patch for LU-9983
    • 3
    • 9223372036854775807

    Description

      Soak OSS are dumping multiple Lustre logs, many threads hanging.
      OSS

      Sep 26 16:13:09 soak-3 kernel: LNet: Service thread pid 55178 was inactive for 200.74s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      Sep 26 16:13:09 soak-3 kernel: Pid: 55178, comm: ll_ost00_037
      Sep 26 16:13:09 soak-3 kernel: #012Call Trace:
      Sep 26 16:13:09 soak-3 kernel: [<ffffffff816a94e9>] schedule+0x29/0x70
      Sep 26 16:13:09 soak-3 kernel: [<ffffffffc09404d5>] cv_wait_common+0x125/0x150 [spl]
      Sep 26 16:13:09 soak-3 kernel: [<ffffffff810b1910>] ? autoremove_wake_function+0x0/0x40
      Sep 26 16:13:09 soak-3 kernel: [<ffffffffc0940515>] __cv_wait+0x15/0x20 [spl]
      Sep 26 16:13:09 soak-3 kernel: [<ffffffffc2464a0b>] dmu_tx_wait+0x20b/0x3c0 [zfs]
      Sep 26 16:13:09 soak-3 kernel: [<ffffffffc2464c51>] dmu_tx_assign+0x91/0x490 [zfs]
      Sep 26 16:13:09 soak-3 kernel: [<ffffffffc0ba7efa>] osd_trans_start+0xaa/0x3c0 [osd_zfs]
      Sep 26 16:13:09 soak-3 kernel: [<ffffffffc0f96c8d>] tgt_client_data_update+0x2ed/0x5d0 [ptlrpc]
      Sep 26 16:13:09 soak-3 kernel: [<ffffffffc0f9738b>] tgt_client_new+0x41b/0x610 [ptlrpc]
      Sep 26 16:13:09 soak-3 kernel: [<ffffffffc111c553>] ofd_obd_connect+0x3a3/0x4c0 [ofd]
      Sep 26 16:13:09 soak-3 kernel: [<ffffffffc0efeaec>] target_handle_connect+0x12bc/0x3200 [ptlrpc]
      Sep 26 16:13:09 soak-3 kernel: [<ffffffff810d0408>] ? enqueue_task_fair+0x208/0x6c0
      Sep 26 16:13:09 soak-3 kernel: [<ffffffff810c76f5>] ? sched_clock_cpu+0x85/0xc0
      Sep 26 16:13:09 soak-3 kernel: [<ffffffffc0fa1b62>] tgt_request_handle+0x402/0x1370 [ptlrpc]
      Sep 26 16:13:09 soak-3 kernel: [<ffffffffc0f4aec6>] ptlrpc_server_handle_request+0x236/0xa90 [ptlrpc]
      Sep 26 16:13:09 soak-3 kernel: [<ffffffffc0f474f8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
      Sep 26 16:13:09 soak-3 kernel: [<ffffffff810c4822>] ? default_wake_function+0x12/0x20
      Sep 26 16:13:09 soak-3 kernel: [<ffffffff810ba588>] ? __wake_up_common+0x58/0x90
      Sep 26 16:13:09 soak-3 kernel: [<ffffffffc0f4e602>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
      Sep 26 16:13:09 soak-3 kernel: [<ffffffffc0f4db70>] ? ptlrpc_main+0x0/0x1e40 [ptlrpc]
      Sep 26 16:13:09 soak-3 kernel: [<ffffffff810b098f>] kthread+0xcf/0xe0
      Sep 26 16:13:09 soak-3 kernel: [<ffffffff810b08c0>] ? kthread+0x0/0xe0
      Sep 26 16:13:09 soak-3 kernel: [<ffffffff816b4f58>] ret_from_fork+0x58/0x90
      Sep 26 16:13:09 soak-3 kernel: [<ffffffff810b08c0>] ? kthread+0x0/0xe0
      Sep 26 16:13:09 soak-3 kernel:
      Sep 26 16:13:09 soak-3 kernel: LustreError: dumping log to /tmp/lustre-log.1506442389.55178
      Sep 26 16:13:11 soak-3 kernel: LNet: Service thread pid 64795 was inactive for 201.48s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      Sep 26 16:13:11 soak-3 kernel: Pid: 64795, comm: ll_ost00_043
      Sep 26 16:13:11 soak-3 kernel: #012Call Trace:
      Sep 26 16:13:11 soak-3 kernel: [<ffffffff816a94e9>] schedule+0x29/0x70
      Sep 26 16:13:11 soak-3 kernel: [<ffffffffc09404d5>] cv_wait_common+0x125/0x150 [spl]
      Sep 26 16:13:11 soak-3 kernel: [<ffffffff810b1910>] ? autoremove_wake_function+0x0/0x40
      Sep 26 16:13:11 soak-3 kernel: [<ffffffffc0940515>] __cv_wait+0x15/0x20 [spl]
      Sep 26 16:13:11 soak-3 kernel: [<ffffffffc2464a0b>] dmu_tx_wait+0x20b/0x3c0 [zfs]
      Sep 26 16:13:11 soak-3 kernel: [<ffffffffc2464c51>] dmu_tx_assign+0x91/0x490 [zfs]
      Sep 26 16:13:11 soak-3 kernel: [<ffffffffc0ba7efa>] osd_trans_start+0xaa/0x3c0 [osd_zfs]
      Sep 26 16:13:11 soak-3 kernel: [<ffffffffc0f96c8d>] tgt_client_data_update+0x2ed/0x5d0 [ptlrpc]
      Sep 26 16:13:11 soak-3 kernel: [<ffffffffc0f9738b>] tgt_client_new+0x41b/0x610 [ptlrpc]
      Sep 26 16:13:11 soak-3 kernel: [<ffffffffc111c553>] ofd_obd_connect+0x3a3/0x4c0 [ofd]
      Sep 26 16:13:11 soak-3 kernel: [<ffffffffc0efeaec>] target_handle_connect+0x12bc/0x3200 [ptlrpc]
      Sep 26 16:13:11 soak-3 kernel: [<ffffffff810d0408>] ? enqueue_task_fair+0x208/0x6c0
      Sep 26 16:13:11 soak-3 kernel: [<ffffffff810c76f5>] ? sched_clock_cpu+0x85/0xc0
      Sep 26 16:13:11 soak-3 kernel: [<ffffffffc0fa1b62>] tgt_request_handle+0x402/0x1370 [ptlrpc]
      Sep 26 16:13:11 soak-3 kernel: [<ffffffffc0f4aec6>] ptlrpc_server_handle_request+0x236/0xa90 [ptlrpc]
      Sep 26 16:13:11 soak-3 kernel: [<ffffffffc0f474f8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
      Sep 26 16:13:11 soak-3 kernel: [<ffffffff810c4822>] ? default_wake_function+0x12/0x20
      Sep 26 16:13:11 soak-3 kernel: [<ffffffff810ba588>] ? __wake_up_common+0x58/0x90
      Sep 26 16:13:11 soak-3 kernel: [<ffffffffc0f4e602>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
      Sep 26 16:13:11 soak-3 kernel: [<ffffffff81029557>] ? __switch_to+0xd7/0x510
      Sep 26 16:13:11 soak-3 kernel: [<ffffffff816a8f00>] ? __schedule+0x2f0/0x8b0
      Sep 26 16:13:11 soak-3 kernel: [<ffffffffc0f4db70>] ? ptlrpc_main+0x0/0x1e40 [ptlrpc]
      Sep 26 16:13:11 soak-3 kernel: [<ffffffff810b098f>] kthread+0xcf/0xe0
      Sep 26 16:13:11 soak-3 kernel: [<ffffffff810b08c0>] ? kthread+0x0/0xe0
      Sep 26 16:13:11 soak-3 kernel: [<ffffffff816b4f58>] ret_from_fork+0x58/0x90
      Sep 26 16:13:11 soak-3 kernel: [<ffffffff810b08c0>] ? kthread+0x0/0xe0
      Sep 26 16:13:11 soak-3 kernel:
      

      MDS report disconnection from OST at this time:

      ep 26 16:15:23 soak-8 kernel: Lustre: 2119:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1506441922/real 1506441922]  req@ffff8807bab2e300 x1579578644250848/t0(0) o1000->soaked-OST000d-osc-MDT0000@192.168.1.103@o2ib:24/4 lens 368/4320 e 23 to 1 dl 1506442523 ref 1 fl Rpc:X/2/ffffffff rc -11/-1
      Sep 26 16:15:23 soak-8 kernel: Lustre: soaked-OST0001-osc-MDT0000: Connection to soaked-OST0001 (at 192.168.1.103@o2ib) was lost; in progress operations using this service will wait for recovery to complete
      Sep 26 16:15:23 soak-8 kernel: Lustre: Skipped 4 previous similar messages
      Sep 26 16:15:23 soak-8 kernel: Lustre: soaked-OST0001-osc-MDT0000: Connection restored to 192.168.1.103@o2ib (at 192.168.1.103@o2ib)
      

      We see connections dropping and being restored quite a lot. example covering one minute:

      soak-10.log:Sep 26 16:10:00 soak-10 kernel: Lustre: soaked-MDT0002: Connection restored to 6bb03e1f-2a66-5ad2-7153-3a2729fcc702 (at 192.168.1.143@o2ib)
      soak-10.log:Sep 26 16:10:19 soak-10 kernel: Lustre: soaked-MDT0002: Connection restored to f6aa261f-cc7b-f1d5-147f-cc08e579672d (at 192.168.1.124@o2ib)
      soak-11.log:Sep 26 16:12:39 soak-11 kernel: Lustre: soaked-OST000d-osc-MDT0003: Connection restored to 192.168.1.103@o2ib (at 192.168.1.103@o2ib)
      soak-2.log:Sep 26 16:10:04 soak-2 kernel: Lustre: soaked-OST0006: Connection restored to  (at 192.168.1.117@o2ib)
      soak-2.log:Sep 26 16:10:20 soak-2 kernel: Lustre: soaked-OST0000: Connection restored to  (at 192.168.1.125@o2ib)
      soak-3.log:Sep 26 16:11:48 soak-3 kernel: Lustre: soaked-OST0001: Connection restored to  (at 192.168.1.111@o2ib)
      soak-4.log:Sep 26 16:10:31 soak-4 kernel: Lustre: soaked-OST0002: Connection restored to  (at 192.168.1.133@o2ib)
      soak-5.log:Sep 26 16:10:00 soak-5 kernel: Lustre: soaked-OST0009: Connection restored to  (at 192.168.1.143@o2ib)
      soak-5.log:Sep 26 16:10:19 soak-5 kernel: Lustre: soaked-OST0015: Connection restored to  (at 192.168.1.124@o2ib)
      soak-6.log:Sep 26 16:10:00 soak-6 kernel: Lustre: soaked-OST0010: Connection restored to  (at 192.168.1.143@o2ib)
      soak-6.log:Sep 26 16:10:00 soak-6 kernel: Lustre: soaked-OST0016: Connection restored to  (at 192.168.1.143@o2ib)
      soak-6.log:Sep 26 16:10:00 soak-6 kernel: Lustre: soaked-OST000a: Connection restored to  (at 192.168.1.143@o2ib)
      soak-6.log:Sep 26 16:10:18 soak-6 kernel: Lustre: soaked-OST0016: Connection restored to  (at 192.168.1.124@o2ib)
      soak-6.log:Sep 26 16:10:18 soak-6 kernel: Lustre: soaked-OST000a: Connection restored to  (at 192.168.1.124@o2ib)
      soak-7.log:Sep 26 16:15:23 soak-7 kernel: Lustre: soaked-OST0005: Connection restored to  (at 192.168.1.108@o2ib)
      soak-8.log:Sep 26 16:15:23 soak-8 kernel: Lustre: soaked-OST0001-osc-MDT0000: Connection restored to 192.168.1.103@o2ib (at 192.168.1.103@o2ib)
      soak-9.log:Sep 26 16:10:00 soak-9 kernel: Lustre: soaked-MDT0001: Connection restored to 6bb03e1f-2a66-5ad2-7153-3a2729fcc702 (at 192.168.1.143@o2ib)
      soak-9.log:Sep 26 16:10:19 soak-9 kernel: Lustre: soaked-MDT0001: Connection restored to f6aa261f-cc7b-f1d5-147f-cc08e579672d (at 192.168.1.124@o2ib)
      

      Attachments

        Issue Links

          Activity

            People

              ashehata Amir Shehata (Inactive)
              cliffw Cliff White (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated: