Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4565

recovery-mds-scale test_failover_ost: failed mounting ost after reboot

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.6.0
    • None
    • lustre-master build # 1837 RHEL6 zfs
    • 3
    • 12460

    Description

      This issue was created by maloo for sarah <sarah@whamcloud.com>

      This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/ed6584da-847f-11e3-9133-52540035b04c.

      The sub-test test_failover_ost failed with the following error:

      test failed to respond and timed out

      OST dmesg:

      Lustre: DEBUG MARKER: zpool list -H lustre-ost2 >/dev/null 2>&1 ||
      			zpool import -f -o cachefile=none -d /dev/lvm-Role_OSS lustre-ost2
      Lustre: DEBUG MARKER: mkdir -p /mnt/ost2; mount -t lustre   		                   lustre-ost2/ost2 /mnt/ost2
      Lustre: 3388:0:(client.c:1903:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1390291785/real 1390291785]  req@ffff88006c21d800 x1457826572533816/t0(0) o38->lustre-MDT0000-lwp-OST0000@10.10.4.198@tcp:12/10 lens 400/544 e 0 to 1 dl 1390291790 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      LustreError: 13a-8: Failed to get MGS log params and no local copy.
      Lustre: 3388:0:(client.c:1903:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1390291791/real 1390291791]  req@ffff88006b122c00 x1457826572533912/t0(0) o38->lustre-MDT0000-lwp-OST0001@10.10.4.198@tcp:12/10 lens 400/544 e 0 to 1 dl 1390291796 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      Lustre: lustre-OST0000: Will be in recovery for at least 1:00, or until 3 clients reconnect
      LustreError: 137-5: lustre-OST0002_UUID: not available for connect from 10.10.4.200@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
      Lustre: 3388:0:(client.c:1903:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1390291801/real 1390291801]  req@ffff88006c21dc00 x1457826572533940/t0(0) o38->lustre-MDT0000-lwp-OST0001@10.10.4.198@tcp:12/10 lens 400/544 e 0 to 1 dl 1390291811 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      Lustre: lustre-OST0000: Recovery over after 0:07, of 3 clients 3 recovered and 0 were evicted.
      Lustre: lustre-OST0000: deleting orphan objects from 0x0:833 to 0x0:833
      Lustre: 3388:0:(client.c:1903:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1390291816/real 1390291816]  req@ffff880037b0ec00 x1457826572533968/t0(0) o38->lustre-MDT0000-lwp-OST0001@10.10.4.198@tcp:12/10 lens 400/544 e 0 to 1 dl 1390291831 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      LustreError: 137-5: lustre-OST0002_UUID: not available for connect from 10.10.4.200@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
      LustreError: Skipped 18 previous similar messages
      Lustre: 3388:0:(client.c:1903:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1390291836/real 1390291836]  req@ffff8800680a7000 x1457826572534004/t0(0) o38->lustre-MDT0000-lwp-OST0001@10.10.4.198@tcp:12/10 lens 400/544 e 0 to 1 dl 1390291856 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      Lustre: 3388:0:(client.c:1903:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1390291866/real 1390291866]  req@ffff880067d01000 x1457826572534060/t0(0) o38->lustre-MDT0000-lwp-OST0001@10.10.4.198@tcp:12/10 lens 400/544 e 0 to 1 dl 1390291891 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      Lustre: 3388:0:(client.c:1903:ptlrpc_expire_one_request()) Skipped 1 previous similar message
      LustreError: 137-5: lustre-OST0002_UUID: not available for connect from 10.10.4.200@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
      LustreError: Skipped 14 previous similar messages
      Lustre: 3388:0:(client.c:1903:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1390291871/real 1390291871]  req@ffff88006addb800 x1457826572534072/t0(0) o38->lustre-MDT0000-lwp-OST0001@10.10.4.198@tcp:12/10 lens 400/544 e 0 to 1 dl 1390291896 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      LustreError: 137-5: lustre-OST0002_UUID: not available for connect from 10.10.4.200@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
      LustreError: Skipped 14 previous similar messages
      Lustre: 3388:0:(client.c:1903:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1390291931/real 1390291931]  req@ffff88006addbc00 x1457826572534196/t0(0) o38->lustre-MDT0000-lwp-OST0001@10.10.4.198@tcp:12/10 lens 400/544 e 0 to 1 dl 1390291956 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      Lustre: 3388:0:(client.c:1903:ptlrpc_expire_one_request()) Skipped 6 previous similar messages
      LustreError: 137-5: lustre-OST0002_UUID: not available for connect from 10.10.4.200@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
      LustreError: Skipped 35 previous similar messages
      INFO: task mount.lustre:3972 blocked for more than 120 seconds.
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      mount.lustre  D 0000000000000001     0  3972   3971 0x00000080
       ffff88006af93718 0000000000000086 ffff88006af936a8 ffffffff81065c75
       000000106af936b8 ffff88007e4b2ad8 ffff8800022167a8 ffff880002216740
       ffff88006a935098 ffff88006af93fd8 000000000000fb88 ffff88006a935098
      Call Trace:
       [<ffffffff81065c75>] ? enqueue_entity+0x125/0x410
       [<ffffffff8103c7d8>] ? pvclock_clocksource_read+0x58/0xd0
       [<ffffffff8150f475>] schedule_timeout+0x215/0x2e0
       [<ffffffff81065c75>] ? enqueue_entity+0x125/0x410
       [<ffffffff810572f4>] ? check_preempt_wakeup+0x1a4/0x260
       [<ffffffff8106605b>] ? enqueue_task_fair+0xfb/0x100
       [<ffffffff8150f0f3>] wait_for_common+0x123/0x180
       [<ffffffff81063990>] ? default_wake_function+0x0/0x20
       [<ffffffff8150f20d>] wait_for_completion+0x1d/0x20
       [<ffffffffa0858c03>] llog_process_or_fork+0x353/0x5f0 [obdclass]
       [<ffffffffa0858eb4>] llog_process+0x14/0x20 [obdclass]
       [<ffffffffa088d444>] class_config_parse_llog+0x1e4/0x330 [obdclass]
       [<ffffffffa105c2d2>] mgc_process_log+0xd22/0x18e0 [mgc]
       [<ffffffffa1056360>] ? mgc_blocking_ast+0x0/0x810 [mgc]
       [<ffffffffa0aaf2a0>] ? ldlm_completion_ast+0x0/0x930 [ptlrpc]
       [<ffffffffa105e4f5>] mgc_process_config+0x645/0x11d0 [mgc]
       [<ffffffffa089d476>] lustre_process_log+0x256/0xa70 [obdclass]
       [<ffffffffa086f832>] ? class_name2dev+0x42/0xe0 [obdclass]
       [<ffffffff81168043>] ? kmem_cache_alloc_trace+0x1a3/0x1b0
       [<ffffffffa086f8de>] ? class_name2obd+0xe/0x30 [obdclass]
       [<ffffffffa08ce71c>] server_start_targets+0x1c4c/0x1e00 [obdclass]
       [<ffffffffa08a0a7b>] ? lustre_start_mgc+0x48b/0x1e60 [obdclass]
       [<ffffffffa08989e0>] ? class_config_llog_handler+0x0/0x1880 [obdclass]
       [<ffffffffa08d3708>] server_fill_super+0xb98/0x1a64 [obdclass]
       [<ffffffffa08a2628>] lustre_fill_super+0x1d8/0x530 [obdclass]
       [<ffffffffa08a2450>] ? lustre_fill_super+0x0/0x530 [obdclass]
       [<ffffffff811845df>] get_sb_nodev+0x5f/0xa0
       [<ffffffffa089a365>] lustre_get_sb+0x25/0x30 [obdclass]
       [<ffffffff81183c1b>] vfs_kern_mount+0x7b/0x1b0
       [<ffffffff81183dc2>] do_kern_mount+0x52/0x130
       [<ffffffff81195382>] ? vfs_ioctl+0x22/0xa0
       [<ffffffff811a3f82>] do_mount+0x2d2/0x8d0
       [<ffffffff811a4610>] sys_mount+0x90/0xe0
       [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
      Lustre: 3388:0:(client.c:1903:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1390291996/real 1390291996]  req@ffff88006adf3400 x1457826572534332/t0(0) o38->lustre-MDT0000-lwp-OST0001@10.10.4.198@tcp:12/10 lens 400/544 e 0 to 1 dl 1390292021 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      Lustre: 3388:0:(client.c:1903:ptlrpc_expire_one_request()) Skipped 7 previous similar messages
      LustreError: 137-5: lustre-OST0003_UUID: not available for connect from 10.10.4.202@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
      LustreError: Skipped 124 previous similar messages
      INFO: task mount.lustre:3972 blocked for more than 120 seconds.
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      mount.lustre  D 0000000000000001     0  3972   3971 0x00000080
      

      Attachments

        Issue Links

          Activity

            People

              tappro Mikhail Pershin
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: