Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10035

Many threads hanging on OST, lustre-log dumps

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • Lustre 2.10.1
    • Soak cluster, b2.10.1 + wangdi diag patch for LU-9983
    • 3
    • 9223372036854775807

    Description

      Soak OSS are dumping multiple Lustre logs, many threads hanging.
      OSS

      Sep 26 16:13:09 soak-3 kernel: LNet: Service thread pid 55178 was inactive for 200.74s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      Sep 26 16:13:09 soak-3 kernel: Pid: 55178, comm: ll_ost00_037
      Sep 26 16:13:09 soak-3 kernel: #012Call Trace:
      Sep 26 16:13:09 soak-3 kernel: [<ffffffff816a94e9>] schedule+0x29/0x70
      Sep 26 16:13:09 soak-3 kernel: [<ffffffffc09404d5>] cv_wait_common+0x125/0x150 [spl]
      Sep 26 16:13:09 soak-3 kernel: [<ffffffff810b1910>] ? autoremove_wake_function+0x0/0x40
      Sep 26 16:13:09 soak-3 kernel: [<ffffffffc0940515>] __cv_wait+0x15/0x20 [spl]
      Sep 26 16:13:09 soak-3 kernel: [<ffffffffc2464a0b>] dmu_tx_wait+0x20b/0x3c0 [zfs]
      Sep 26 16:13:09 soak-3 kernel: [<ffffffffc2464c51>] dmu_tx_assign+0x91/0x490 [zfs]
      Sep 26 16:13:09 soak-3 kernel: [<ffffffffc0ba7efa>] osd_trans_start+0xaa/0x3c0 [osd_zfs]
      Sep 26 16:13:09 soak-3 kernel: [<ffffffffc0f96c8d>] tgt_client_data_update+0x2ed/0x5d0 [ptlrpc]
      Sep 26 16:13:09 soak-3 kernel: [<ffffffffc0f9738b>] tgt_client_new+0x41b/0x610 [ptlrpc]
      Sep 26 16:13:09 soak-3 kernel: [<ffffffffc111c553>] ofd_obd_connect+0x3a3/0x4c0 [ofd]
      Sep 26 16:13:09 soak-3 kernel: [<ffffffffc0efeaec>] target_handle_connect+0x12bc/0x3200 [ptlrpc]
      Sep 26 16:13:09 soak-3 kernel: [<ffffffff810d0408>] ? enqueue_task_fair+0x208/0x6c0
      Sep 26 16:13:09 soak-3 kernel: [<ffffffff810c76f5>] ? sched_clock_cpu+0x85/0xc0
      Sep 26 16:13:09 soak-3 kernel: [<ffffffffc0fa1b62>] tgt_request_handle+0x402/0x1370 [ptlrpc]
      Sep 26 16:13:09 soak-3 kernel: [<ffffffffc0f4aec6>] ptlrpc_server_handle_request+0x236/0xa90 [ptlrpc]
      Sep 26 16:13:09 soak-3 kernel: [<ffffffffc0f474f8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
      Sep 26 16:13:09 soak-3 kernel: [<ffffffff810c4822>] ? default_wake_function+0x12/0x20
      Sep 26 16:13:09 soak-3 kernel: [<ffffffff810ba588>] ? __wake_up_common+0x58/0x90
      Sep 26 16:13:09 soak-3 kernel: [<ffffffffc0f4e602>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
      Sep 26 16:13:09 soak-3 kernel: [<ffffffffc0f4db70>] ? ptlrpc_main+0x0/0x1e40 [ptlrpc]
      Sep 26 16:13:09 soak-3 kernel: [<ffffffff810b098f>] kthread+0xcf/0xe0
      Sep 26 16:13:09 soak-3 kernel: [<ffffffff810b08c0>] ? kthread+0x0/0xe0
      Sep 26 16:13:09 soak-3 kernel: [<ffffffff816b4f58>] ret_from_fork+0x58/0x90
      Sep 26 16:13:09 soak-3 kernel: [<ffffffff810b08c0>] ? kthread+0x0/0xe0
      Sep 26 16:13:09 soak-3 kernel:
      Sep 26 16:13:09 soak-3 kernel: LustreError: dumping log to /tmp/lustre-log.1506442389.55178
      Sep 26 16:13:11 soak-3 kernel: LNet: Service thread pid 64795 was inactive for 201.48s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      Sep 26 16:13:11 soak-3 kernel: Pid: 64795, comm: ll_ost00_043
      Sep 26 16:13:11 soak-3 kernel: #012Call Trace:
      Sep 26 16:13:11 soak-3 kernel: [<ffffffff816a94e9>] schedule+0x29/0x70
      Sep 26 16:13:11 soak-3 kernel: [<ffffffffc09404d5>] cv_wait_common+0x125/0x150 [spl]
      Sep 26 16:13:11 soak-3 kernel: [<ffffffff810b1910>] ? autoremove_wake_function+0x0/0x40
      Sep 26 16:13:11 soak-3 kernel: [<ffffffffc0940515>] __cv_wait+0x15/0x20 [spl]
      Sep 26 16:13:11 soak-3 kernel: [<ffffffffc2464a0b>] dmu_tx_wait+0x20b/0x3c0 [zfs]
      Sep 26 16:13:11 soak-3 kernel: [<ffffffffc2464c51>] dmu_tx_assign+0x91/0x490 [zfs]
      Sep 26 16:13:11 soak-3 kernel: [<ffffffffc0ba7efa>] osd_trans_start+0xaa/0x3c0 [osd_zfs]
      Sep 26 16:13:11 soak-3 kernel: [<ffffffffc0f96c8d>] tgt_client_data_update+0x2ed/0x5d0 [ptlrpc]
      Sep 26 16:13:11 soak-3 kernel: [<ffffffffc0f9738b>] tgt_client_new+0x41b/0x610 [ptlrpc]
      Sep 26 16:13:11 soak-3 kernel: [<ffffffffc111c553>] ofd_obd_connect+0x3a3/0x4c0 [ofd]
      Sep 26 16:13:11 soak-3 kernel: [<ffffffffc0efeaec>] target_handle_connect+0x12bc/0x3200 [ptlrpc]
      Sep 26 16:13:11 soak-3 kernel: [<ffffffff810d0408>] ? enqueue_task_fair+0x208/0x6c0
      Sep 26 16:13:11 soak-3 kernel: [<ffffffff810c76f5>] ? sched_clock_cpu+0x85/0xc0
      Sep 26 16:13:11 soak-3 kernel: [<ffffffffc0fa1b62>] tgt_request_handle+0x402/0x1370 [ptlrpc]
      Sep 26 16:13:11 soak-3 kernel: [<ffffffffc0f4aec6>] ptlrpc_server_handle_request+0x236/0xa90 [ptlrpc]
      Sep 26 16:13:11 soak-3 kernel: [<ffffffffc0f474f8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
      Sep 26 16:13:11 soak-3 kernel: [<ffffffff810c4822>] ? default_wake_function+0x12/0x20
      Sep 26 16:13:11 soak-3 kernel: [<ffffffff810ba588>] ? __wake_up_common+0x58/0x90
      Sep 26 16:13:11 soak-3 kernel: [<ffffffffc0f4e602>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
      Sep 26 16:13:11 soak-3 kernel: [<ffffffff81029557>] ? __switch_to+0xd7/0x510
      Sep 26 16:13:11 soak-3 kernel: [<ffffffff816a8f00>] ? __schedule+0x2f0/0x8b0
      Sep 26 16:13:11 soak-3 kernel: [<ffffffffc0f4db70>] ? ptlrpc_main+0x0/0x1e40 [ptlrpc]
      Sep 26 16:13:11 soak-3 kernel: [<ffffffff810b098f>] kthread+0xcf/0xe0
      Sep 26 16:13:11 soak-3 kernel: [<ffffffff810b08c0>] ? kthread+0x0/0xe0
      Sep 26 16:13:11 soak-3 kernel: [<ffffffff816b4f58>] ret_from_fork+0x58/0x90
      Sep 26 16:13:11 soak-3 kernel: [<ffffffff810b08c0>] ? kthread+0x0/0xe0
      Sep 26 16:13:11 soak-3 kernel:
      

      MDS report disconnection from OST at this time:

      ep 26 16:15:23 soak-8 kernel: Lustre: 2119:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1506441922/real 1506441922]  req@ffff8807bab2e300 x1579578644250848/t0(0) o1000->soaked-OST000d-osc-MDT0000@192.168.1.103@o2ib:24/4 lens 368/4320 e 23 to 1 dl 1506442523 ref 1 fl Rpc:X/2/ffffffff rc -11/-1
      Sep 26 16:15:23 soak-8 kernel: Lustre: soaked-OST0001-osc-MDT0000: Connection to soaked-OST0001 (at 192.168.1.103@o2ib) was lost; in progress operations using this service will wait for recovery to complete
      Sep 26 16:15:23 soak-8 kernel: Lustre: Skipped 4 previous similar messages
      Sep 26 16:15:23 soak-8 kernel: Lustre: soaked-OST0001-osc-MDT0000: Connection restored to 192.168.1.103@o2ib (at 192.168.1.103@o2ib)
      

      We see connections dropping and being restored quite a lot. example covering one minute:

      soak-10.log:Sep 26 16:10:00 soak-10 kernel: Lustre: soaked-MDT0002: Connection restored to 6bb03e1f-2a66-5ad2-7153-3a2729fcc702 (at 192.168.1.143@o2ib)
      soak-10.log:Sep 26 16:10:19 soak-10 kernel: Lustre: soaked-MDT0002: Connection restored to f6aa261f-cc7b-f1d5-147f-cc08e579672d (at 192.168.1.124@o2ib)
      soak-11.log:Sep 26 16:12:39 soak-11 kernel: Lustre: soaked-OST000d-osc-MDT0003: Connection restored to 192.168.1.103@o2ib (at 192.168.1.103@o2ib)
      soak-2.log:Sep 26 16:10:04 soak-2 kernel: Lustre: soaked-OST0006: Connection restored to  (at 192.168.1.117@o2ib)
      soak-2.log:Sep 26 16:10:20 soak-2 kernel: Lustre: soaked-OST0000: Connection restored to  (at 192.168.1.125@o2ib)
      soak-3.log:Sep 26 16:11:48 soak-3 kernel: Lustre: soaked-OST0001: Connection restored to  (at 192.168.1.111@o2ib)
      soak-4.log:Sep 26 16:10:31 soak-4 kernel: Lustre: soaked-OST0002: Connection restored to  (at 192.168.1.133@o2ib)
      soak-5.log:Sep 26 16:10:00 soak-5 kernel: Lustre: soaked-OST0009: Connection restored to  (at 192.168.1.143@o2ib)
      soak-5.log:Sep 26 16:10:19 soak-5 kernel: Lustre: soaked-OST0015: Connection restored to  (at 192.168.1.124@o2ib)
      soak-6.log:Sep 26 16:10:00 soak-6 kernel: Lustre: soaked-OST0010: Connection restored to  (at 192.168.1.143@o2ib)
      soak-6.log:Sep 26 16:10:00 soak-6 kernel: Lustre: soaked-OST0016: Connection restored to  (at 192.168.1.143@o2ib)
      soak-6.log:Sep 26 16:10:00 soak-6 kernel: Lustre: soaked-OST000a: Connection restored to  (at 192.168.1.143@o2ib)
      soak-6.log:Sep 26 16:10:18 soak-6 kernel: Lustre: soaked-OST0016: Connection restored to  (at 192.168.1.124@o2ib)
      soak-6.log:Sep 26 16:10:18 soak-6 kernel: Lustre: soaked-OST000a: Connection restored to  (at 192.168.1.124@o2ib)
      soak-7.log:Sep 26 16:15:23 soak-7 kernel: Lustre: soaked-OST0005: Connection restored to  (at 192.168.1.108@o2ib)
      soak-8.log:Sep 26 16:15:23 soak-8 kernel: Lustre: soaked-OST0001-osc-MDT0000: Connection restored to 192.168.1.103@o2ib (at 192.168.1.103@o2ib)
      soak-9.log:Sep 26 16:10:00 soak-9 kernel: Lustre: soaked-MDT0001: Connection restored to 6bb03e1f-2a66-5ad2-7153-3a2729fcc702 (at 192.168.1.143@o2ib)
      soak-9.log:Sep 26 16:10:19 soak-9 kernel: Lustre: soaked-MDT0001: Connection restored to f6aa261f-cc7b-f1d5-147f-cc08e579672d (at 192.168.1.124@o2ib)
      

      Attachments

        Issue Links

          Activity

            [LU-10035] Many threads hanging on OST, lustre-log dumps

            Running with the debug patch, seeing a lot of reconnects, in bursts:

            /scratch/logs/syslog/soak-41.log:Sep 26 23:57:58 soak-41 kernel: Lustre: soaked-OST000f-osc-ffff880173a8c000: Connection to soaked-OST000f (at 192.168.1.105@o2ib) was lost; in progress operations using this service will wait for recovery to complete
            /scratch/logs/syslog/soak-41.log:Sep 26 23:57:58 soak-41 kernel: Lustre: soaked-OST000f-osc-ffff880173a8c000: Connection restored to 192.168.1.105@o2ib (at 192.168.1.105@o2ib)
            /scratch/logs/syslog/soak-41.log:Sep 26 23:59:47 soak-41 kernel: Lustre: soaked-MDT0000-mdc-ffff880173a8c000: Connection to soaked-MDT0000 (at 192.168.1.108@o2ib) was lost; in progress operations using this service will wait for recovery to complete
            /scratch/logs/syslog/soak-41.log:Sep 26 23:59:47 soak-41 kernel: Lustre: soaked-MDT0000-mdc-ffff880173a8c000: Connection restored to 192.168.1.108@o2ib (at 192.168.1.108@o2ib)
            /scratch/logs/syslog/soak-42.log:Sep 26 23:57:59 soak-42 kernel: Lustre: soaked-OST000f-osc-ffff880173f87000: Connection to soaked-OST000f (at 192.168.1.105@o2ib) was lost; in progress operations using this service will wait for recovery to complete
            /scratch/logs/syslog/soak-42.log:Sep 26 23:57:59 soak-42 kernel: Lustre: soaked-OST000f-osc-ffff880173f87000: Connection restored to 192.168.1.105@o2ib (at 192.168.1.105@o2ib)
            /scratch/logs/syslog/soak-42.log:Sep 26 23:59:46 soak-42 kernel: Lustre: soaked-MDT0000-mdc-ffff880173f87000: Connection to soaked-MDT0000 (at 192.168.1.108@o2ib) was lost; in progress operations using this service will wait for recovery to complete
            /scratch/logs/syslog/soak-42.log:Sep 26 23:59:46 soak-42 kernel: Lustre: soaked-MDT0000-mdc-ffff880173f87000: Connection restored to 192.168.1.108@o2ib (at 192.168.1.108@o2ib)
            /scratch/logs/syslog/soak-43.log:Sep 26 23:57:59 soak-43 kernel: Lustre: soaked-OST000f-osc-ffff880173ae4000: Connection to soaked-OST000f (at 192.168.1.105@o2ib) was lost; in progress operations using this service will wait for recovery to complete
            /scratch/logs/syslog/soak-43.log:Sep 26 23:57:59 soak-43 kernel: Lustre: soaked-OST000f-osc-ffff880173ae4000: Connection restored to 192.168.1.105@o2ib (at 192.168.1.105@o2ib)
            /scratch/logs/syslog/soak-43.log:Sep 26 23:59:48 soak-43 kernel: Lustre: soaked-MDT0000-mdc-ffff880173ae4000: Connection to soaked-MDT0000 (at 192.168.1.108@o2ib) was lost; in progress operations using this service will wait for recovery to complete
            /scratch/logs/syslog/soak-43.log:Sep 26 23:59:48 soak-43 kernel: Lustre: soaked-MDT0000-mdc-ffff880173ae4000: Connection restored to 192.168.1.108@o2ib (at 192.168.1.108@o2ib)
            /scratch/logs/syslog/soak-44.log:Sep 26 23:57:58 soak-44 kernel: Lustre: soaked-OST000f-osc-ffff88083e79f800: Connection to soaked-OST000f (at 192.168.1.105@o2ib) was lost; in progress operations using this service will wait for recovery to complete
            /scratch/logs/syslog/soak-44.log:Sep 26 23:57:59 soak-44 kernel: Lustre: soaked-OST000f-osc-ffff88083e79f800: Connection restored to 192.168.1.105@o2ib (at 192.168.1.105@o2ib)
            /scratch/logs/syslog/soak-44.log:Sep 26 23:59:48 soak-44 kernel: Lustre: soaked-MDT0000-mdc-ffff88083e79f800: Connection to soaked-MDT0000 (at 192.168.1.108@o2ib) was lost; in progress operations using this service will wait for recovery to complete
            /scratch/logs/syslog/soak-44.log:Sep 26 23:59:48 soak-44 kernel: Lustre: soaked-MDT0000-mdc-ffff88083e79f800: Connection restored to 192.168.1.108@o2ib (at 192.168.1.108@o2ib)
            /scratch/logs/syslog/soak-5.log:Sep 26 23:57:58 soak-5 kernel: Lustre: soaked-OST000f: Connection restored to  (at 192.168.1.141@o2ib)
            /scratch/logs/syslog/soak-8.log:Sep 26 23:51:32 soak-8 kernel: Lustre: soaked-OST000f-osc-MDT0000: Connection to soaked-OST000f (at 192.168.1.105@o2ib) was lost; in progress operations using this service will wait for recovery to complete
            /scratch/logs/syslog/soak-8.log:Sep 26 23:58:07 soak-8 kernel: Lustre: soaked-MDT0000: Connection restored to 22754519-5c3c-7d4f-2c9f-eacb7fdfd149 (at 192.168.1.140@o2ib)
            

            Server-side is dumping logs, example

            Sep 26 23:56:08 soak-5 systemd: Stopping User Slice of root.
            Sep 26 23:56:41 soak-5 kernel: Lustre: 43683:0:(service.c:1346:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/5), not sending early reply#012  req@ffff8808145b9450 x1579643797320240/t0(0) o5->soaked-MDT0000-mdtlov_UUID@192.168.1.108@o2ib:586/0 lens 432/432 e 1 to 0 dl 1506470206 ref 2 fl Interpret:/0/0 rc 0/0
            Sep 26 23:56:41 soak-5 kernel: Lustre: 43683:0:(service.c:1346:ptlrpc_at_send_early_reply()) Skipped 10 previous similar messages
            Sep 26 23:56:47 soak-5 kernel: LNet: Service thread pid 149701 was inactive for 1203.59s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
            Sep 26 23:56:47 soak-5 kernel: LNet: Skipped 3 previous similar messages
            Sep 26 23:56:47 soak-5 kernel: Pid: 149701, comm: ll_ost01_037
            Sep 26 23:56:47 soak-5 kernel: #012Call Trace:
            Sep 26 23:56:47 soak-5 kernel: [<ffffffff816aa409>] schedule_preempt_disabled+0x29/0x70
            Sep 26 23:56:47 soak-5 kernel: [<ffffffff816a8337>] __mutex_lock_slowpath+0xc7/0x1d0
            Sep 26 23:56:47 soak-5 kernel: [<ffffffff816a774f>] mutex_lock+0x1f/0x2f
            Sep 26 23:56:48 soak-5 kernel: [<ffffffffc127ccfb>] ofd_create_hdl+0xdcb/0x2090 [ofd]
            Sep 26 23:56:48 soak-5 kernel: [<ffffffffc0ed1de7>] ? lustre_msg_add_version+0x27/0xa0 [ptlrpc]
            Sep 26 23:56:48 soak-5 kernel: [<ffffffffc0ed213f>] ? lustre_pack_reply_v2+0x14f/0x280 [ptlrpc]
            Sep 26 23:56:48 soak-5 kernel: [<ffffffffc0ed2461>] ? lustre_pack_reply+0x11/0x20 [ptlrpc]
            Sep 26 23:56:48 soak-5 kernel: [<ffffffffc0f35085>] tgt_request_handle+0x925/0x1370 [ptlrpc]
            Sep 26 23:56:48 soak-5 kernel: [<ffffffffc0eddec6>] ptlrpc_server_handle_request+0x236/0xa90 [ptlrpc]
            Sep 26 23:56:48 soak-5 kernel: [<ffffffffc0eda4f8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
            Sep 26 23:56:48 soak-5 kernel: [<ffffffff810c4822>] ? default_wake_function+0x12/0x20
            Sep 26 23:56:48 soak-5 kernel: [<ffffffff810ba588>] ? __wake_up_common+0x58/0x90
            Sep 26 23:56:48 soak-5 kernel: [<ffffffffc0ee1602>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
            Sep 26 23:56:48 soak-5 kernel: [<ffffffffc0ee0b70>] ? ptlrpc_main+0x0/0x1e40 [ptlrpc]
            Sep 26 23:56:48 soak-5 kernel: [<ffffffff810b098f>] kthread+0xcf/0xe0
            Sep 26 23:56:49 soak-5 kernel: [<ffffffff810b08c0>] ? kthread+0x0/0xe0
            Sep 26 23:56:49 soak-5 kernel: [<ffffffff816b4f58>] ret_from_fork+0x58/0x90
            Sep 26 23:56:49 soak-5 kernel: [<ffffffff810b08c0>] ? kthread+0x0/0xe0
            Sep 26 23:56:49 soak-5 kernel:
            Sep 26 23:56:49 soak-5 kernel: LustreError: dumping log to /tmp/lustre-log.1506470209.149701
            

            We will run lnet_selftest tomorrow to validate the hardware after the upgrade.

            cliffw Cliff White (Inactive) added a comment - Running with the debug patch, seeing a lot of reconnects, in bursts: /scratch/logs/syslog/soak-41.log:Sep 26 23:57:58 soak-41 kernel: Lustre: soaked-OST000f-osc-ffff880173a8c000: Connection to soaked-OST000f (at 192.168.1.105@o2ib) was lost; in progress operations using this service will wait for recovery to complete /scratch/logs/syslog/soak-41.log:Sep 26 23:57:58 soak-41 kernel: Lustre: soaked-OST000f-osc-ffff880173a8c000: Connection restored to 192.168.1.105@o2ib (at 192.168.1.105@o2ib) /scratch/logs/syslog/soak-41.log:Sep 26 23:59:47 soak-41 kernel: Lustre: soaked-MDT0000-mdc-ffff880173a8c000: Connection to soaked-MDT0000 (at 192.168.1.108@o2ib) was lost; in progress operations using this service will wait for recovery to complete /scratch/logs/syslog/soak-41.log:Sep 26 23:59:47 soak-41 kernel: Lustre: soaked-MDT0000-mdc-ffff880173a8c000: Connection restored to 192.168.1.108@o2ib (at 192.168.1.108@o2ib) /scratch/logs/syslog/soak-42.log:Sep 26 23:57:59 soak-42 kernel: Lustre: soaked-OST000f-osc-ffff880173f87000: Connection to soaked-OST000f (at 192.168.1.105@o2ib) was lost; in progress operations using this service will wait for recovery to complete /scratch/logs/syslog/soak-42.log:Sep 26 23:57:59 soak-42 kernel: Lustre: soaked-OST000f-osc-ffff880173f87000: Connection restored to 192.168.1.105@o2ib (at 192.168.1.105@o2ib) /scratch/logs/syslog/soak-42.log:Sep 26 23:59:46 soak-42 kernel: Lustre: soaked-MDT0000-mdc-ffff880173f87000: Connection to soaked-MDT0000 (at 192.168.1.108@o2ib) was lost; in progress operations using this service will wait for recovery to complete /scratch/logs/syslog/soak-42.log:Sep 26 23:59:46 soak-42 kernel: Lustre: soaked-MDT0000-mdc-ffff880173f87000: Connection restored to 192.168.1.108@o2ib (at 192.168.1.108@o2ib) /scratch/logs/syslog/soak-43.log:Sep 26 23:57:59 soak-43 kernel: Lustre: soaked-OST000f-osc-ffff880173ae4000: Connection to soaked-OST000f (at 192.168.1.105@o2ib) was lost; in progress operations using this service will wait for recovery to complete /scratch/logs/syslog/soak-43.log:Sep 26 23:57:59 soak-43 kernel: Lustre: soaked-OST000f-osc-ffff880173ae4000: Connection restored to 192.168.1.105@o2ib (at 192.168.1.105@o2ib) /scratch/logs/syslog/soak-43.log:Sep 26 23:59:48 soak-43 kernel: Lustre: soaked-MDT0000-mdc-ffff880173ae4000: Connection to soaked-MDT0000 (at 192.168.1.108@o2ib) was lost; in progress operations using this service will wait for recovery to complete /scratch/logs/syslog/soak-43.log:Sep 26 23:59:48 soak-43 kernel: Lustre: soaked-MDT0000-mdc-ffff880173ae4000: Connection restored to 192.168.1.108@o2ib (at 192.168.1.108@o2ib) /scratch/logs/syslog/soak-44.log:Sep 26 23:57:58 soak-44 kernel: Lustre: soaked-OST000f-osc-ffff88083e79f800: Connection to soaked-OST000f (at 192.168.1.105@o2ib) was lost; in progress operations using this service will wait for recovery to complete /scratch/logs/syslog/soak-44.log:Sep 26 23:57:59 soak-44 kernel: Lustre: soaked-OST000f-osc-ffff88083e79f800: Connection restored to 192.168.1.105@o2ib (at 192.168.1.105@o2ib) /scratch/logs/syslog/soak-44.log:Sep 26 23:59:48 soak-44 kernel: Lustre: soaked-MDT0000-mdc-ffff88083e79f800: Connection to soaked-MDT0000 (at 192.168.1.108@o2ib) was lost; in progress operations using this service will wait for recovery to complete /scratch/logs/syslog/soak-44.log:Sep 26 23:59:48 soak-44 kernel: Lustre: soaked-MDT0000-mdc-ffff88083e79f800: Connection restored to 192.168.1.108@o2ib (at 192.168.1.108@o2ib) /scratch/logs/syslog/soak-5.log:Sep 26 23:57:58 soak-5 kernel: Lustre: soaked-OST000f: Connection restored to (at 192.168.1.141@o2ib) /scratch/logs/syslog/soak-8.log:Sep 26 23:51:32 soak-8 kernel: Lustre: soaked-OST000f-osc-MDT0000: Connection to soaked-OST000f (at 192.168.1.105@o2ib) was lost; in progress operations using this service will wait for recovery to complete /scratch/logs/syslog/soak-8.log:Sep 26 23:58:07 soak-8 kernel: Lustre: soaked-MDT0000: Connection restored to 22754519-5c3c-7d4f-2c9f-eacb7fdfd149 (at 192.168.1.140@o2ib) Server-side is dumping logs, example Sep 26 23:56:08 soak-5 systemd: Stopping User Slice of root. Sep 26 23:56:41 soak-5 kernel: Lustre: 43683:0:(service.c:1346:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/5), not sending early reply#012 req@ffff8808145b9450 x1579643797320240/t0(0) o5->soaked-MDT0000-mdtlov_UUID@192.168.1.108@o2ib:586/0 lens 432/432 e 1 to 0 dl 1506470206 ref 2 fl Interpret:/0/0 rc 0/0 Sep 26 23:56:41 soak-5 kernel: Lustre: 43683:0:(service.c:1346:ptlrpc_at_send_early_reply()) Skipped 10 previous similar messages Sep 26 23:56:47 soak-5 kernel: LNet: Service thread pid 149701 was inactive for 1203.59s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Sep 26 23:56:47 soak-5 kernel: LNet: Skipped 3 previous similar messages Sep 26 23:56:47 soak-5 kernel: Pid: 149701, comm: ll_ost01_037 Sep 26 23:56:47 soak-5 kernel: #012Call Trace: Sep 26 23:56:47 soak-5 kernel: [<ffffffff816aa409>] schedule_preempt_disabled+0x29/0x70 Sep 26 23:56:47 soak-5 kernel: [<ffffffff816a8337>] __mutex_lock_slowpath+0xc7/0x1d0 Sep 26 23:56:47 soak-5 kernel: [<ffffffff816a774f>] mutex_lock+0x1f/0x2f Sep 26 23:56:48 soak-5 kernel: [<ffffffffc127ccfb>] ofd_create_hdl+0xdcb/0x2090 [ofd] Sep 26 23:56:48 soak-5 kernel: [<ffffffffc0ed1de7>] ? lustre_msg_add_version+0x27/0xa0 [ptlrpc] Sep 26 23:56:48 soak-5 kernel: [<ffffffffc0ed213f>] ? lustre_pack_reply_v2+0x14f/0x280 [ptlrpc] Sep 26 23:56:48 soak-5 kernel: [<ffffffffc0ed2461>] ? lustre_pack_reply+0x11/0x20 [ptlrpc] Sep 26 23:56:48 soak-5 kernel: [<ffffffffc0f35085>] tgt_request_handle+0x925/0x1370 [ptlrpc] Sep 26 23:56:48 soak-5 kernel: [<ffffffffc0eddec6>] ptlrpc_server_handle_request+0x236/0xa90 [ptlrpc] Sep 26 23:56:48 soak-5 kernel: [<ffffffffc0eda4f8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc] Sep 26 23:56:48 soak-5 kernel: [<ffffffff810c4822>] ? default_wake_function+0x12/0x20 Sep 26 23:56:48 soak-5 kernel: [<ffffffff810ba588>] ? __wake_up_common+0x58/0x90 Sep 26 23:56:48 soak-5 kernel: [<ffffffffc0ee1602>] ptlrpc_main+0xa92/0x1e40 [ptlrpc] Sep 26 23:56:48 soak-5 kernel: [<ffffffffc0ee0b70>] ? ptlrpc_main+0x0/0x1e40 [ptlrpc] Sep 26 23:56:48 soak-5 kernel: [<ffffffff810b098f>] kthread+0xcf/0xe0 Sep 26 23:56:49 soak-5 kernel: [<ffffffff810b08c0>] ? kthread+0x0/0xe0 Sep 26 23:56:49 soak-5 kernel: [<ffffffff816b4f58>] ret_from_fork+0x58/0x90 Sep 26 23:56:49 soak-5 kernel: [<ffffffff810b08c0>] ? kthread+0x0/0xe0 Sep 26 23:56:49 soak-5 kernel: Sep 26 23:56:49 soak-5 kernel: LustreError: dumping log to /tmp/lustre-log.1506470209.149701 We will run lnet_selftest tomorrow to validate the hardware after the upgrade.

            Rolling back the patch leaves us with a system that LBUGs after 15 seconds of operation, so I'd say no.

            cliffw Cliff White (Inactive) added a comment - Rolling back the patch leaves us with a system that LBUGs after 15 seconds of operation, so I'd say no.

            so this might not be related to LNet, could be an issue with the patch applied? Can we roll the patch back and retry?

            ashehata Amir Shehata (Inactive) added a comment - so this might not be related to LNet, could be an issue with the patch applied? Can we roll the patch back and retry?

            the 192.168.1.114 and 192.168.1.115 nodes are LNET->OPA routers. At the time this test was run, those nodes were powered off, due to missing OPA bits (since fixed) so, I would ignore any messages referencing those nodes, just servers trying to ping dead routers.

            cliffw Cliff White (Inactive) added a comment - the 192.168.1.114 and 192.168.1.115 nodes are LNET->OPA routers. At the time this test was run, those nodes were powered off, due to missing OPA bits (since fixed) so, I would ignore any messages referencing those nodes, just servers trying to ping dead routers.
            ashehata Amir Shehata (Inactive) added a comment - - edited

            Is this using OPA? are there multiple configured interfaces on some nodes?

            ashehata Amir Shehata (Inactive) added a comment - - edited Is this using OPA? are there multiple configured interfaces on some nodes?

            In the soak-3.lustre-log.1506442389.55178 log there is a continuous stream of LNet messages that imply some sort of connection problem:

            00000800:00000100:5.0:1506439845.140835:0:5558:0:(o2iblnd_cb.c:490:kiblnd_rx_complete()) Rx from 192.168.1.115@o2ib failed: 5
            00000800:00000100:5.0:1506439845.140838:0:5558:0:(o2iblnd_cb.c:490:kiblnd_rx_complete()) Rx from 192.168.1.115@o2ib failed: 5
            00000800:00000100:1.0:1506439845.140839:0:79688:0:(o2iblnd_cb.c:2677:kiblnd_rejected()) 192.168.1.115@o2ib rejected: no listener at 987
            00000800:00000100:5.0:1506439845.140840:0:5558:0:(o2iblnd_cb.c:490:kiblnd_rx_complete()) Rx from 192.168.1.115@o2ib failed: 5
            00000800:00000100:1.0:1506439845.148485:0:79688:0:(o2iblnd_cb.c:2651:kiblnd_check_reconnect()) 192.168.1.115@o2ib: reconnect (invalid service id), 12, 12, msg_size: 4096, queue_depth: 8/-1, max_frags: 256/-1
            

            It isn't clear if there is a network configuration problem, or if LNet multi-rail/DD are doing the wrong thing and flooding the network with extra requests?

            adilger Andreas Dilger added a comment - In the soak-3.lustre-log.1506442389.55178 log there is a continuous stream of LNet messages that imply some sort of connection problem: 00000800:00000100:5.0:1506439845.140835:0:5558:0:(o2iblnd_cb.c:490:kiblnd_rx_complete()) Rx from 192.168.1.115@o2ib failed: 5 00000800:00000100:5.0:1506439845.140838:0:5558:0:(o2iblnd_cb.c:490:kiblnd_rx_complete()) Rx from 192.168.1.115@o2ib failed: 5 00000800:00000100:1.0:1506439845.140839:0:79688:0:(o2iblnd_cb.c:2677:kiblnd_rejected()) 192.168.1.115@o2ib rejected: no listener at 987 00000800:00000100:5.0:1506439845.140840:0:5558:0:(o2iblnd_cb.c:490:kiblnd_rx_complete()) Rx from 192.168.1.115@o2ib failed: 5 00000800:00000100:1.0:1506439845.148485:0:79688:0:(o2iblnd_cb.c:2651:kiblnd_check_reconnect()) 192.168.1.115@o2ib: reconnect (invalid service id), 12, 12, msg_size: 4096, queue_depth: 8/-1, max_frags: 256/-1 It isn't clear if there is a network configuration problem, or if LNet multi-rail/DD are doing the wrong thing and flooding the network with extra requests?

            Hi Amir,

            Can you please have a look into this?

            Thanks.
            Joe

            jgmitter Joseph Gmitter (Inactive) added a comment - Hi Amir, Can you please have a look into this? Thanks. Joe

            People

              ashehata Amir Shehata (Inactive)
              cliffw Cliff White (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated: