Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2824

recovery-mds-scale test_failover_ost: tar: etc/localtime: Cannot open: Input/output error

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.1.5, Lustre 1.8.9
    • Lustre 1.8.9
    • None
    • 3
    • 6837

    Description

      While running recovery-mds-scale test_failover_ost, it failed as follows after running 15 hours:

      ==== Checking the clients loads AFTER failover -- failure NOT OK
      Client load failed on node client-31vm6, rc=1
      Client load failed during failover. Exiting...
      Found the END_RUN_FILE file: /home/autotest/.autotest/shared_dir/2013-02-12/172229-70152412386500/end_run_file
      client-31vm6.lab.whamcloud.com
      Client load  failed on node client-31vm6.lab.whamcloud.com:
      /logdir/test_logs/2013-02-12/lustre-b1_8-el5-x86_64-vs-lustre-b1_8-el6-x86_64--review--1_1_1__13121__-70152412386500-172228/recovery-mds-scale.test_failover_ost.run__stdout.client-31vm6.lab.whamcloud.com.log
      /logdir/test_logs/2013-02-12/lustre-b1_8-el5-x86_64-vs-lustre-b1_8-el6-x86_64--review--1_1_1__13121__-70152412386500-172228/recovery-mds-scale.test_failover_ost.run__debug.client-31vm6.lab.whamcloud.com.log
      2013-02-13 10:50:55 Terminating clients loads ...
      Duration:               86400
      Server failover period: 900 seconds
      Exited after:           56345 seconds
      Number of failovers before exit:
      mds: 0 times
      ost1: 8 times
      ost2: 3 times
      ost3: 13 times
      ost4: 7 times
      ost5: 10 times
      ost6: 10 times
      ost7: 12 times
      Status: FAIL: rc=1
      

      The output of "tar" operation on client-31vm6 showed that:

      tar: etc/chef/solo.rb: Cannot open: Input/output error
      tar: etc/chef/client.rb: Cannot open: Input/output error
      tar: etc/prelink.cache: Cannot open: Input/output error
      tar: etc/readahead.conf: Cannot open: Input/output error
      tar: etc/localtime: Cannot open: Input/output error
      tar: Exiting with failure status due to previous errors
      

      Dmesg on the MDS node (client-31vm3) showed that:

      Lustre: DEBUG MARKER: Starting failover on ost2
      Lustre: 7396:0:(client.c:1529:ptlrpc_expire_one_request()) @@@ Request x1426822752375573 sent from lustre-OST0000-osc to NID 10.10.4.195@tcp 7s ago has timed out (7s prior to deadline).
        req@ffff810048893000 x1426822752375573/t0 o13->lustre-OST0000_UUID@10.10.4.195@tcp:7/4 lens 192/528 e 0 to 1 dl 1360781077 ref 1 fl Rpc:N/0/0 rc 0/0
      Lustre: 7396:0:(client.c:1529:ptlrpc_expire_one_request()) Skipped 117 previous similar messages
      Lustre: lustre-OST0000-osc: Connection to service lustre-OST0000 via nid 10.10.4.195@tcp was lost; in progress operations using this service will wait for recovery to complete.
      Lustre: lustre-OST0005-osc: Connection to service lustre-OST0005 via nid 10.10.4.195@tcp was lost; in progress operations using this service will wait for recovery to complete.
      Lustre: Skipped 4 previous similar messages
      Lustre: lustre-OST0006-osc: Connection to service lustre-OST0006 via nid 10.10.4.195@tcp was lost; in progress operations using this service will wait for recovery to complete.
      Lustre: 7398:0:(import.c:517:import_select_connection()) lustre-OST0000-osc: tried all connections, increasing latency to 2s
      Lustre: 7398:0:(import.c:517:import_select_connection()) Skipped 59 previous similar messages
      LustreError: 7396:0:(lov_request.c:694:lov_update_create_set()) error creating fid 0x80013 sub-object on OST idx 0/1: rc = -11
      LustreError: 7725:0:(lov_request.c:694:lov_update_create_set()) error creating fid 0x80013 sub-object on OST idx 0/1: rc = -5
      LustreError: 7725:0:(mds_open.c:442:mds_create_objects()) error creating objects for inode 524307: rc = -5
      LustreError: 7725:0:(mds_open.c:827:mds_finish_open()) mds_create_objects: rc = -5
      LustreError: 7396:0:(lov_request.c:694:lov_update_create_set()) error creating fid 0x80013 sub-object on OST idx 1/1: rc = -11
      LustreError: 7750:0:(mds_open.c:442:mds_create_objects()) error creating objects for inode 524307: rc = -5
      LustreError: 7750:0:(mds_open.c:827:mds_finish_open()) mds_create_objects: rc = -5
      Lustre: MGS: haven't heard from client 71727980-7899-77af-8af0-a42b0349985a (at 10.10.4.195@tcp) in 49 seconds. I think it's dead, and I am evicting it.
      LustreError: 7396:0:(lov_request.c:694:lov_update_create_set()) error creating fid 0x80013 sub-object on OST idx 2/1: rc = -11
      LustreError: 7396:0:(lov_request.c:694:lov_update_create_set()) Skipped 1 previous similar message
      LustreError: 7745:0:(mds_open.c:442:mds_create_objects()) error creating objects for inode 524307: rc = -5
      LustreError: 7745:0:(mds_open.c:827:mds_finish_open()) mds_create_objects: rc = -5
      LustreError: 7396:0:(lov_request.c:694:lov_update_create_set()) error creating fid 0x80013 sub-object on OST idx 3/1: rc = -11
      LustreError: 7396:0:(lov_request.c:694:lov_update_create_set()) Skipped 1 previous similar message
      LustreError: 7739:0:(mds_open.c:442:mds_create_objects()) error creating objects for inode 524307: rc = -5
      LustreError: 7739:0:(mds_open.c:827:mds_finish_open()) mds_create_objects: rc = -5
      LustreError: 7396:0:(lov_request.c:694:lov_update_create_set()) error creating fid 0x80013 sub-object on OST idx 4/1: rc = -11
      LustreError: 7396:0:(lov_request.c:694:lov_update_create_set()) Skipped 1 previous similar message
      LustreError: 7731:0:(mds_open.c:442:mds_create_objects()) error creating objects for inode 524307: rc = -5
      LustreError: 7731:0:(mds_open.c:827:mds_finish_open()) mds_create_objects: rc = -5
      Lustre: 7397:0:(quota_master.c:1724:mds_quota_recovery()) Only 6/7 OSTs are active, abort quota recovery
      Lustre: 7397:0:(quota_master.c:1724:mds_quota_recovery()) Skipped 6 previous similar messages
      Lustre: lustre-OST0000-osc: Connection restored to service lustre-OST0000 using nid 10.10.4.191@tcp.
      Lustre: Skipped 6 previous similar messages
      Lustre: MDS lustre-MDT0000: lustre-OST0000_UUID now active, resetting orphans
      Lustre: Skipped 6 previous similar messages
      LustreError: 7396:0:(lov_request.c:694:lov_update_create_set()) error creating fid 0x80013 sub-object on OST idx 5/1: rc = -11
      LustreError: 7396:0:(lov_request.c:694:lov_update_create_set()) Skipped 1 previous similar message
      Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads AFTER failover -- failure NOT OK
      Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK
      Lustre: DEBUG MARKER: /usr/sbin/lctl mark Client load failed on node client-31vm6, rc=1
      

      Maloo report: https://maloo.whamcloud.com/test_sets/40f45b4c-760f-11e2-b5e2-52540035b04c

      Attachments

        Activity

          People

            niu Niu Yawei (Inactive)
            yujian Jian Yu
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: