Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4497

tgt_recov hung task timeout OSSes when caching is on

    XMLWordPrintable

Details

    • Bug
    • Resolution: Not a Bug
    • Major
    • None
    • Lustre 2.5.0
    • None
    • HA environment, 2 * 2 OSS pairs. ~500 Client nodes, all over QDR InfiniBand.
    • 3
    • 12302

    Description

      After updating a very stable Lustre 1.8.4 environment to 2.5.0 we regularly were getting hung task messages on OSSes often leading to unexpected failovers (which triggered some other bugs). We tried various clients (2.5.0, patched 2.5.0, 1.8.9, ...) and the most stable configuration seemed to be the one with 1.8.9 clients.

      The symptoms disappeared after we turned off read and write caching!

      lctl set_param obdfilter.*.read_cache_enable=0
      lctl set_param obdfilter.*.writethrough_cache_enable=0

      We do have quota turned on, and somehow the bug reminded of LU-952, but the hangs are different. The server hardware is not very powerful, and they get quite some load. We're using 63 concurrent_sends. Anyway, I'd expect that caching ... eases the load and helps, and can't really imagine that server side caching is broken in 2.5.0. But maybe I'm wrong?

      Regards,
      Erich

      Lustre: Lustre: Build Version: 2.5.0-RC1--PRISTINE-2.6.32-358.18.1.el6_lustre.x86_64
      LNet: Added LNI 10.188.20.43@o2ib [63/2016/0/180]
      LNet: Added LNI 10.188.8.43@tcp [8/256/0/180]
      LNet: Accept all, port 988
      RPC: Registered named UNIX socket transport module.
      RPC: Registered udp transport module.
      RPC: Registered tcp transport module.
      RPC: Registered tcp NFSv4.1 backchannel transport module.
      802.1Q VLAN Support v1.8 Ben Greear <greearb@candelatech.com>
      All bugs added by David S. Miller <davem@redhat.com>
      8021q: adding VLAN 0 to HW filter on device eth0
      8021q: adding VLAN 0 to HW filter on device eth1
      8021q: adding VLAN 0 to HW filter on device eth2
      ipmi device interface
      zunicode: module license 'CDDL' taints kernel.
      Disabling lock debugging due to kernel taint
      SPL: Loaded module v0.6.1-1
      ZFS: Loaded module v0.6.1-1, ZFS pool version 5000, ZFS filesystem version 5
      LDISKFS-fs (sdf): mounted filesystem with ordered data mode. quota=on. Opts: 
      LDISKFS-fs (sdd): mounted filesystem with ordered data mode. quota=on. Opts: 
      LDISKFS-fs (sdj): mounted filesystem with ordered data mode. quota=on. Opts: 
      LDISKFS-fs (sdc): mounted filesystem with ordered data mode. quota=on. Opts: 
      LDISKFS-fs (sdh): mounted filesystem with ordered data mode. quota=on. Opts: 
      LDISKFS-fs (sdi): 
      LDISKFS-fs (sde): mounted filesystem with ordered data mode. quota=on. Opts: 
      mounted filesystem with ordered data mode. quota=on. Opts: 
      LDISKFS-fs (sdg): mounted filesystem with ordered data mode. quota=on. Opts: 
      
      Lustre: 2419:0:(client.c:1897:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1389200837/real 1389200837]  req@ffff8805ff93ec00 x1456682189193224/t0(0) o250->MGC10.188
      .20.32@o2ib@10.188.8.32@tcp:26/25 lens 400/544 e 0 to 1 dl 1389200842 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      LustreError: 4392:0:(obd_mount_server.c:1123:server_register_target()) lnec-OST000b: error registering with the MGS: rc = -5 (not fatal)
      Lustre: 5801:0:(ofd_dev.c:222:ofd_process_config()) For interoperability, skip this ost.quota_type. It is obsolete.
      Lustre: 2419:0:(client.c:1897:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1389200842/real 1389200842]  req@ffff8805ff93e000 x1456682189193240/t0(0) o250->MGC10.188
      .20.32@o2ib@10.188.20.31@o2ib:26/25 lens 400/544 e 0 to 1 dl 1389200847 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      LustreError: 13a-8: Failed to get MGS log params and no local copy.
      Lustre: 2419:0:(client.c:1897:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1389200847/real 1389200847]  req@ffff8802f0eb8c00 x1456682189193252/t0(0) o250->MGC10.188
      .20.32@o2ib@10.188.8.31@tcp:26/25 lens 400/544 e 0 to 1 dl 1389200852 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      LustreError: 13a-8: Failed to get MGS log params and no local copy.
      LustreError: 11-0: lnec-MDT0000-lwp-OST000b: Communicating with 10.188.20.31@o2ib, operation mds_connect failed with -19.
      LustreError: 4749:0:(client.c:1080:ptlrpc_import_delay_req()) @@@ send limit expired   req@ffff8805ffa31800 x1456682189193268/t0(0) o101->MGC10.188.20.32@o2ib@10.188.20.32@o2ib:26/25 lens 328/344 e 0 t
      o 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
      LustreError: 4392:0:(client.c:1080:ptlrpc_import_delay_req()) @@@ send limit expired   req@ffff8802f0eb8800 x1456682189193276/t0(0) o101->MGC10.188.20.32@o2ib@10.188.20.32@o2ib:26/25 lens 328/344 e 0 t
      o 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
      LustreError: 13a-8: Failed to get MGS log params and no local copy.
      LustreError: 137-5: lnec-OST0008_UUID: not available for connect from 10.188.20.31@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server.
      LustreError: 137-5: lnec-OST0009_UUID: not available for connect from 10.188.20.31@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server.
      Lustre: 2419:0:(client.c:1897:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1389200852/real 1389200852]  req@ffff8802f0eb8000 x1456682189193264/t0(0) o250->MGC10.188.20.32@o2ib@10.188.20.32@o2ib:26/25 lens 400/544 e 0 to 1 dl 1389200862 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      LustreError: 4749:0:(client.c:1080:ptlrpc_import_delay_req()) @@@ send limit expired   req@ffff8805ffa31800 x1456682189193292/t0(0) o101->MGC10.188.20.32@o2ib@10.188.8.32@tcp:26/25 lens 328/344 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
      Lustre: 2419:0:(client.c:1897:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1389200862/real 1389200862]  req@ffff8802f0eb8400 x1456682189193284/t0(0) o250->MGC10.188.20.32@o2ib@10.188.8.32@tcp:26/25 lens 400/544 e 0 to 1 dl 1389200872 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      LustreError: 4395:0:(obd_mount_server.c:1123:server_register_target()) lnec-OST000c: error registering with the MGS: rc = -5 (not fatal)
      Lustre: 5872:0:(ofd_dev.c:222:ofd_process_config()) For interoperability, skip this ost.quota_type. It is obsolete.
      LustreError: 4395:0:(client.c:1080:ptlrpc_import_delay_req()) @@@ send limit expired   req@ffff8805ffa31000 x1456682189193320/t0(0) o101->MGC10.188.20.32@o2ib@10.188.20.31@o2ib:26/25 lens 328/344 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
      LustreError: 13a-8: Failed to get MGS log params and no local copy.
      Lustre: 2419:0:(client.c:1897:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1389200872/real 1389200872]  req@ffff8802f0eb8800 x1456682189193316/t0(0) o250->MGC10.188.20.32@o2ib@10.188.20.31@o2ib:26/25 lens 400/544 e 0 to 1 dl 1389200882 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      LustreError: 4749:0:(client.c:1080:ptlrpc_import_delay_req()) @@@ send limit expired   req@ffff880339cf1c00 x1456682189193340/t0(0) o101->MGC10.188.20.32@o2ib@10.188.8.31@tcp:26/25 lens 328/344 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
      LustreError: 13a-8: Failed to get MGS log params and no local copy.
      Lustre: 2419:0:(client.c:1897:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1389200882/real 1389200882]  req@ffff8802f0eb8000 x1456682189193336/t0(0) o250->MGC10.188.20.32@o2ib@10.188.8.31@tcp:26/25 lens 400/544 e 0 to 1 dl 1389200892 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      Lustre: 2419:0:(client.c:1897:ptlrpc_expire_one_request()) Skipped 1 previous similar message
      LustreError: 4749:0:(client.c:1080:ptlrpc_import_delay_req()) @@@ send limit expired   req@ffff880339cf1c00 x1456682189193352/t0(0) o101->MGC10.188.20.32@o2ib@10.188.8.31@tcp:26/25 lens 328/344 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
      LustreError: 4749:0:(client.c:1080:ptlrpc_import_delay_req()) Skipped 1 previous similar message
      Lustre: lnec-OST000c: Will be in recovery for at least 5:00, or until 2 clients reconnect
      Lustre: lnec-OST000b: Denying connection for new client lnec-MDT0000-mdtlov_UUID (at 10.188.20.31@o2ib), waiting for all 2 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 5:00
      LustreError: 137-5: lnec-OST000e_UUID: not available for connect from 10.188.20.31@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server.
      LustreError: Skipped 1 previous similar message
      Lustre: Skipped 1 previous similar message
      Lustre: lnec-OST000c: Denying connection for new client lnec-MDT0000-mdtlov_UUID (at 10.188.20.31@o2ib), waiting for all 2 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 4:59
      LustreError: 4749:0:(client.c:1080:ptlrpc_import_delay_req()) @@@ send limit expired   req@ffff880339cf1c00 x1456682189193368/t0(0) o101->MGC10.188.20.32@o2ib@10.188.8.31@tcp:26/25 lens 328/344 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
      LustreError: 4749:0:(client.c:1080:ptlrpc_import_delay_req()) Skipped 2 previous similar messages
      Lustre: 5953:0:(ofd_dev.c:222:ofd_process_config()) For interoperability, skip this ost.quota_type. It is obsolete.
      LustreError: 13a-8: Failed to get MGS log params and no local copy.
      LustreError: Skipped 1 previous similar message
      Lustre: lnec-OST000f: Will be in recovery for at least 5:00, or until 3 clients reconnect
      Lustre: lnec-OST0009: Will be in recovery for at least 5:00, or until 3 clients reconnect
      Lustre: 6030:0:(ofd_dev.c:222:ofd_process_config()) For interoperability, skip this ost.quota_type. It is obsolete.
      Lustre: 6030:0:(ofd_dev.c:222:ofd_process_config()) Skipped 1 previous similar message
      LustreError: 13a-8: Failed to get MGS log params and no local copy.
      LustreError: Skipped 7 previous similar messages
      Lustre: lnec-OST000d: Will be in recovery for at least 5:00, or until 3 clients reconnect
      Lustre: lnec-OST000e: Will be in recovery for at least 5:00, or until 3 clients reconnect
      Lustre: Skipped 2 previous similar messages
      Lustre: lnec-OST000c: Denying connection for new client lnec-MDT0000-mdtlov_UUID (at 10.188.20.31@o2ib), waiting for all 2 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 4:35
      Lustre: Skipped 1 previous similar message
      Lustre: lnec-OST000b: Denying connection for new client lnec-MDT0000-mdtlov_UUID (at 10.188.20.31@o2ib), waiting for all 2 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 4:10
      Lustre: lnec-OST000c: Denying connection for new client lnec-MDT0000-mdtlov_UUID (at 10.188.20.31@o2ib), waiting for all 2 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 4:10
      Lustre: lnec-OST000c: Denying connection for new client lnec-MDT0000-mdtlov_UUID (at 10.188.20.31@o2ib), waiting for all 2 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 3:45
      Lustre: lnec-OST000b: Denying connection for new client lnec-MDT0000-mdtlov_UUID (at 10.188.20.31@o2ib), waiting for all 2 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 3:45
      Lustre: lnec-OST000c: Denying connection for new client lnec-MDT0000-mdtlov_UUID (at 10.188.20.31@o2ib), waiting for all 2 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 3:20
      Lustre: Skipped 1 previous similar message
      Lustre: lnec-OST000b: Denying connection for new client lnec-MDT0000-mdtlov_UUID (at 10.188.20.31@o2ib), waiting for all 2 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 2:55
      Lustre: lnec-OST000c: Denying connection for new client lnec-MDT0000-mdtlov_UUID (at 10.188.20.31@o2ib), waiting for all 2 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 2:55
      INFO: task tgt_recov:5864 blocked for more than 120 seconds.
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      tgt_recov     D 000000000000000d     0  5864      2 0x00000000
       ffff88031d269da0 0000000000000046 0000000000000000 0000000000000003
       ffff88031d269d30 ffffffff81055f96 ffff88031d269d40 ffff880338ade080
       ffff8803390cbaf8 ffff88031d269fd8 000000000000fb88 ffff8803390cbaf8
      Call Trace:
       [<ffffffff81055f96>] ? enqueue_task+0x66/0x80
       [<ffffffffa064e210>] ? check_for_clients+0x0/0x70 [ptlrpc]
       [<ffffffffa064f87d>] target_recovery_overseer+0x9d/0x230 [ptlrpc]
       [<ffffffffa064df00>] ? exp_connect_healthy+0x0/0x20 [ptlrpc]
       [<ffffffff81096da0>] ? autoremove_wake_function+0x0/0x40
       [<ffffffffa0656140>] ? target_recovery_thread+0x0/0x1920 [ptlrpc]
       [<ffffffffa0656680>] target_recovery_thread+0x540/0x1920 [ptlrpc]
       [<ffffffff81063422>] ? default_wake_function+0x12/0x20
       [<ffffffffa0656140>] ? target_recovery_thread+0x0/0x1920 [ptlrpc]
       [<ffffffff81096a36>] kthread+0x96/0xa0
       [<ffffffff8100c0ca>] child_rip+0xa/0x20
       [<ffffffff810969a0>] ? kthread+0x0/0xa0
       [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
      INFO: task tgt_recov:5944 blocked for more than 120 seconds.
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      tgt_recov     D 000000000000000d     0  5944      2 0x00000000
       ffff8803280cbda0 0000000000000046 0000000000000000 0000000000000003
       ffff8803280cbd30 ffffffff81055f96 ffff8803280cbd40 ffff8806298dcae0
       ffff8802f0ef85f8 ffff8803280cbfd8 000000000000fb88 ffff8802f0ef85f8
      Call Trace:
       [<ffffffff81055f96>] ? enqueue_task+0x66/0x80
       [<ffffffffa064e210>] ? check_for_clients+0x0/0x70 [ptlrpc]
       [<ffffffffa064f87d>] target_recovery_overseer+0x9d/0x230 [ptlrpc]
       [<ffffffffa064df00>] ? exp_connect_healthy+0x0/0x20 [ptlrpc]
       [<ffffffff81096da0>] ? autoremove_wake_function+0x0/0x40
       [<ffffffffa0656140>] ? target_recovery_thread+0x0/0x1920 [ptlrpc]
       [<ffffffffa0656680>] target_recovery_thread+0x540/0x1920 [ptlrpc]
       [<ffffffff81063422>] ? default_wake_function+0x12/0x20
       [<ffffffffa0656140>] ? target_recovery_thread+0x0/0x1920 [ptlrpc]
       [<ffffffff81096a36>] kthread+0x96/0xa0
       [<ffffffff8100c0ca>] child_rip+0xa/0x20
       [<ffffffff810969a0>] ? kthread+0x0/0xa0
       [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
      INFO: task tgt_recov:5956 blocked for more than 120 seconds.
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      tgt_recov     D 000000000000000d     0  5956      2 0x00000000
       ffff88033959dda0 0000000000000046 ffff88033959dd68 ffff88033959dd64
       ffff88033959dd30 ffff88033fc25d00 ffff88034ac36740 0000000000000400
       ffff880330f6dab8 ffff88033959dfd8 000000000000fb88 ffff880330f6dab8
      Call Trace:
       [<ffffffffa064e210>] ? check_for_clients+0x0/0x70 [ptlrpc]
       [<ffffffffa064f87d>] target_recovery_overseer+0x9d/0x230 [ptlrpc]
       [<ffffffffa064df00>] ? exp_connect_healthy+0x0/0x20 [ptlrpc]
       [<ffffffff81096da0>] ? autoremove_wake_function+0x0/0x40
       [<ffffffffa0656140>] ? target_recovery_thread+0x0/0x1920 [ptlrpc]
       [<ffffffffa0656680>] target_recovery_thread+0x540/0x1920 [ptlrpc]
       [<ffffffff81063422>] ? default_wake_function+0x12/0x20
       [<ffffffffa0656140>] ? target_recovery_thread+0x0/0x1920 [ptlrpc]
       [<ffffffff81096a36>] kthread+0x96/0xa0
       [<ffffffff8100c0ca>] child_rip+0xa/0x20
       [<ffffffff810969a0>] ? kthread+0x0/0xa0
       [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
      INFO: task tgt_recov:5961 blocked for more than 120 seconds.
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      tgt_recov     D 000000000000000d     0  5961      2 0x00000000
       ffff88033969bda0 0000000000000046 0000000000000000 0000000000000003
       ffff88033969bd30 ffffffff81055f96 ffff88033969bd40 ffff8803391fd540
       ffff88032c0da5f8 ffff88033969bfd8 000000000000fb88 ffff88032c0da5f8
      Call Trace:
       [<ffffffff81055f96>] ? enqueue_task+0x66/0x80
       [<ffffffffa064e210>] ? check_for_clients+0x0/0x70 [ptlrpc]
       [<ffffffffa064f87d>] target_recovery_overseer+0x9d/0x230 [ptlrpc]
       [<ffffffffa064df00>] ? exp_connect_healthy+0x0/0x20 [ptlrpc]
       [<ffffffff81096da0>] ? autoremove_wake_function+0x0/0x40
       [<ffffffffa0656140>] ? target_recovery_thread+0x0/0x1920 [ptlrpc]
       [<ffffffffa0656680>] target_recovery_thread+0x540/0x1920 [ptlrpc]
       [<ffffffff81063422>] ? default_wake_function+0x12/0x20
       [<ffffffffa0656140>] ? target_recovery_thread+0x0/0x1920 [ptlrpc]
       [<ffffffff81096a36>] kthread+0x96/0xa0
       [<ffffffff8100c0ca>] child_rip+0xa/0x20
       [<ffffffff810969a0>] ? kthread+0x0/0xa0
       [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
      INFO: task tgt_recov:6033 blocked for more than 120 seconds.
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      tgt_recov     D 000000000000000d     0  6033      2 0x00000000
       ffff8802ef76fda0 0000000000000046 0000000000000000 0000000000000003
       ffff8802ef76fd30 ffffffff81055f96 ffff8802ef76fd40 ffff880338129500
       ffff8802ef69daf8 ffff8802ef76ffd8 000000000000fb88 ffff8802ef69daf8
      Call Trace:
       [<ffffffff81055f96>] ? enqueue_task+0x66/0x80
       [<ffffffffa064e210>] ? check_for_clients+0x0/0x70 [ptlrpc]
       [<ffffffffa064f87d>] target_recovery_overseer+0x9d/0x230 [ptlrpc]
       [<ffffffffa064df00>] ? exp_connect_healthy+0x0/0x20 [ptlrpc]
       [<ffffffff81096da0>] ? autoremove_wake_function+0x0/0x40
       [<ffffffffa0656140>] ? target_recovery_thread+0x0/0x1920 [ptlrpc]
       [<ffffffffa0656680>] target_recovery_thread+0x540/0x1920 [ptlrpc]
       [<ffffffff81063422>] ? default_wake_function+0x12/0x20
       [<ffffffffa0656140>] ? target_recovery_thread+0x0/0x1920 [ptlrpc]
       [<ffffffff81096a36>] kthread+0x96/0xa0
       [<ffffffff8100c0ca>] child_rip+0xa/0x20
       [<ffffffff810969a0>] ? kthread+0x0/0xa0
       [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
      INFO: task tgt_recov:6044 blocked for more than 120 seconds.
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      tgt_recov     D 000000000000000d     0  6044      2 0x00000000
       ffff8802eac99da0 0000000000000046 ffff8802eac99d68 ffff8802eac99d64
       ffff8802eac99d30 ffff88033fc25d00 ffff88034ac36740 0000000000000400
       ffff8803382de5f8 ffff8802eac99fd8 000000000000fb88 ffff8803382de5f8
      Call Trace:
       [<ffffffffa064e210>] ? check_for_clients+0x0/0x70 [ptlrpc]
       [<ffffffffa064f87d>] target_recovery_overseer+0x9d/0x230 [ptlrpc]
       [<ffffffffa064df00>] ? exp_connect_healthy+0x0/0x20 [ptlrpc]
       [<ffffffff81096da0>] ? autoremove_wake_function+0x0/0x40
       [<ffffffffa0656140>] ? target_recovery_thread+0x0/0x1920 [ptlrpc]
       [<ffffffffa0656680>] target_recovery_thread+0x540/0x1920 [ptlrpc]
       [<ffffffff81063422>] ? default_wake_function+0x12/0x20
       [<ffffffffa0656140>] ? target_recovery_thread+0x0/0x1920 [ptlrpc]
       [<ffffffff81096a36>] kthread+0x96/0xa0
       [<ffffffff8100c0ca>] child_rip+0xa/0x20
       [<ffffffff810969a0>] ? kthread+0x0/0xa0
       [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
      INFO: task tgt_recov:6049 blocked for more than 120 seconds.
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      tgt_recov     D 0000000000000006     0  6049      2 0x00000000
       ffff8802eaeb1da0 0000000000000046 0000000000000000 ffff8802eaeb1d64
       ffff880200000000 ffff88033fc24f00 ffff88034ac36740 0000000000000400
       ffff88033a2e4638 ffff8802eaeb1fd8 000000000000fb88 ffff88033a2e4638
      Call Trace:
       [<ffffffffa064e210>] ? check_for_clients+0x0/0x70 [ptlrpc]
       [<ffffffffa064f87d>] target_recovery_overseer+0x9d/0x230 [ptlrpc]
       [<ffffffffa064df00>] ? exp_connect_healthy+0x0/0x20 [ptlrpc]
       [<ffffffff81096da0>] ? autoremove_wake_function+0x0/0x40
       [<ffffffffa0656140>] ? target_recovery_thread+0x0/0x1920 [ptlrpc]
       [<ffffffffa0656680>] target_recovery_thread+0x540/0x1920 [ptlrpc]
       [<ffffffff81063422>] ? default_wake_function+0x12/0x20
       [<ffffffffa0656140>] ? target_recovery_thread+0x0/0x1920 [ptlrpc]
       [<ffffffff81096a36>] kthread+0x96/0xa0
       [<ffffffff8100c0ca>] child_rip+0xa/0x20
       [<ffffffff810969a0>] ? kthread+0x0/0xa0
       [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
      INFO: task tgt_recov:6054 blocked for more than 120 seconds.
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      tgt_recov     D 0000000000000006     0  6054      2 0x00000000
       ffff8802eaf2bda0 0000000000000046 ffff880028276740 0000000000000003
       ffff8802eaf2bd30 ffffffff81055f96 ffff8802eaf2bd40 ffff880338aaeaa0
       ffff880330ff65f8 ffff8802eaf2bfd8 000000000000fb88 ffff880330ff65f8
      Call Trace:
       [<ffffffff81055f96>] ? enqueue_task+0x66/0x80
       [<ffffffff8109708e>] ? prepare_to_wait+0x4e/0x80
       [<ffffffffa064e210>] ? check_for_clients+0x0/0x70 [ptlrpc]
       [<ffffffffa064f87d>] target_recovery_overseer+0x9d/0x230 [ptlrpc]
       [<ffffffffa064df00>] ? exp_connect_healthy+0x0/0x20 [ptlrpc]
       [<ffffffff81096da0>] ? autoremove_wake_function+0x0/0x40
       [<ffffffffa0656140>] ? target_recovery_thread+0x0/0x1920 [ptlrpc]
       [<ffffffffa0656680>] target_recovery_thread+0x540/0x1920 [ptlrpc]
       [<ffffffff81063422>] ? default_wake_function+0x12/0x20
       [<ffffffffa0656140>] ? target_recovery_thread+0x0/0x1920 [ptlrpc]
       [<ffffffff81096a36>] kthread+0x96/0xa0
       [<ffffffff8100c0ca>] child_rip+0xa/0x20
       [<ffffffff810969a0>] ? kthread+0x0/0xa0
       [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
      Lustre: lnec-OST000b: Denying connection for new client lnec-MDT0000-mdtlov_UUID (at 10.188.20.31@o2ib), waiting for all 2 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 2:30
      Lustre: Skipped 1 previous similar message
      Lustre: lnec-OST000b: Denying connection for new client lnec-MDT0000-mdtlov_UUID (at 10.188.20.31@o2ib), waiting for all 2 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 1:40
      Lustre: lnec-OST000c: Denying connection for new client lnec-MDT0000-mdtlov_UUID (at 10.188.20.31@o2ib), waiting for all 2 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 1:40
      Lustre: Skipped 2 previous similar messages
      Lustre: Skipped 2 previous similar messages
      INFO: task tgt_recov:5864 blocked for more than 120 seconds.
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      tgt_recov     D 000000000000000d     0  5864      2 0x00000000
       ffff88031d269da0 0000000000000046 0000000000000000 0000000000000003
       ffff88031d269d30 ffffffff81055f96 ffff88031d269d40 ffff880338ade080
       ffff8803390cbaf8 ffff88031d269fd8 000000000000fb88 ffff8803390cbaf8
      Call Trace:
       [<ffffffff81055f96>] ? enqueue_task+0x66/0x80
       [<ffffffffa064e210>] ? check_for_clients+0x0/0x70 [ptlrpc]
       [<ffffffffa064f87d>] target_recovery_overseer+0x9d/0x230 [ptlrpc]
       [<ffffffffa064df00>] ? exp_connect_healthy+0x0/0x20 [ptlrpc]
       [<ffffffff81096da0>] ? autoremove_wake_function+0x0/0x40
       [<ffffffffa0656140>] ? target_recovery_thread+0x0/0x1920 [ptlrpc]
       [<ffffffffa0656680>] target_recovery_thread+0x540/0x1920 [ptlrpc]
       [<ffffffff81063422>] ? default_wake_function+0x12/0x20
       [<ffffffffa0656140>] ? target_recovery_thread+0x0/0x1920 [ptlrpc]
       [<ffffffff81096a36>] kthread+0x96/0xa0
       [<ffffffff8100c0ca>] child_rip+0xa/0x20
       [<ffffffff810969a0>] ? kthread+0x0/0xa0
       [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
      INFO: task tgt_recov:5944 blocked for more than 120 seconds.
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      tgt_recov     D 000000000000000d     0  5944      2 0x00000000
       ffff8803280cbda0 0000000000000046 0000000000000000 0000000000000003
       ffff8803280cbd30 ffffffff81055f96 ffff8803280cbd40 ffff8806298dcae0
       ffff8802f0ef85f8 ffff8803280cbfd8 000000000000fb88 ffff8802f0ef85f8
      Call Trace:
       [<ffffffff81055f96>] ? enqueue_task+0x66/0x80
       [<ffffffffa064e210>] ? check_for_clients+0x0/0x70 [ptlrpc]
       [<ffffffffa064f87d>] target_recovery_overseer+0x9d/0x230 [ptlrpc]
       [<ffffffffa064df00>] ? exp_connect_healthy+0x0/0x20 [ptlrpc]
       [<ffffffff81096da0>] ? autoremove_wake_function+0x0/0x40
       [<ffffffffa0656140>] ? target_recovery_thread+0x0/0x1920 [ptlrpc]
       [<ffffffffa0656680>] target_recovery_thread+0x540/0x1920 [ptlrpc]
       [<ffffffff81063422>] ? default_wake_function+0x12/0x20
       [<ffffffffa0656140>] ? target_recovery_thread+0x0/0x1920 [ptlrpc]
       [<ffffffff81096a36>] kthread+0x96/0xa0
       [<ffffffff8100c0ca>] child_rip+0xa/0x20
       [<ffffffff810969a0>] ? kthread+0x0/0xa0
       [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
      Lustre: lnec-OST000c: recovery is timed out, evict stale exports
      Lustre: lnec-OST000b: disconnecting 2 stale clients
      Lustre: lnec-OST000b: Recovery over after 5:00, of 2 clients 0 recovered and 2 were evicted.
      Lustre: Skipped 1 previous similar message
      Lustre: lnec-OST000c: disconnecting 2 stale clients
      Lustre: lnec-OST000c: Recovery over after 5:00, of 2 clients 0 recovered and 2 were evicted.
      Lustre: lnec-OST000f: recovery is timed out, evict stale exports
      Lustre: lnec-OST0009: recovery is timed out, evict stale exports
      Lustre: lnec-OST0009: disconnecting 2 stale clients
      Lustre: lnec-OST0009: Recovery over after 5:00, of 3 clients 1 recovered and 2 were evicted.
      Lustre: lnec-OST0009: deleting orphan objects from 0x0:9393083 to 0x0:9394465
      Lustre: lnec-OST000f: deleting orphan objects from 0x0:9374587 to 0x0:9375201
      Lustre: lnec-OST000a: recovery is timed out, evict stale exports
      Lustre: lnec-OST000e: disconnecting 2 stale clients
      Lustre: Skipped 3 previous similar messages
      Lustre: lnec-OST0008: Recovery over after 5:00, of 3 clients 1 recovered and 2 were evicted.
      Lustre: Skipped 1 previous similar message
      Lustre: lnec-OST0008: deleting orphan objects from 0x0:9365659 to 0x0:9366273
      Lustre: lnec-OST000d: deleting orphan objects from 0x0:9397403 to 0x0:9398145
      Lustre: lnec-OST000e: deleting orphan objects from 0x0:9377403 to 0x0:9378017
      Lustre: Skipped 3 previous similar messages
      Lustre: lnec-OST000a: deleting orphan objects from 0x0:9370459 to 0x0:9371201
      Lustre: lnec-OST000b: deleting orphan objects from 0x0:9366091 to 0x0:9366705
      Lustre: lnec-OST000c: deleting orphan objects from 0x0:9381147 to 0x0:9382657
      Lustre: Failing over lnec-OST000c
      Lustre: Failing over lnec-OST000d
      Lustre: Failing over lnec-OST000f
      Lustre: Skipped 1 previous similar message
      LustreError: 137-5: lnec-OST000d_UUID: not available for connect from 10.188.23.12@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server.
      LustreError: 137-5: lnec-OST000c_UUID: not available for connect from 10.188.23.12@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server.
      LustreError: Skipped 2 previous similar messages
      LustreError: Skipped 2 previous similar messages
      Lustre: server umount lnec-OST000d complete
      

      Sorry for adding so much of the dmesg output, maybe there is some hint there...

      Attachments

        Activity

          People

            jfc John Fuchs-Chesney (Inactive)
            efocht Erich Focht
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: