Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5785

recovery-mds-scale test_failover_ost: test_failover_ost returned 1

Details

    • Bug
    • Resolution: Duplicate
    • Minor
    • None
    • Lustre 2.7.0
    • None
    • client and server: lustre-master build # 2690
      client is SLES11 SP3
    • 3
    • 16237

    Description

      This issue was created by maloo for sarah <sarah@whamcloud.com>

      This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/548b249c-565a-11e4-97f1-5254006e85c2.

      The sub-test test_failover_ost failed with the following error:

      test_failover_ost returned 1
      
      Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK
      Lustre: DEBUG MARKER: /usr/sbin/lctl mark ost5 has failed over 1 times, and counting...
      Lustre: DEBUG MARKER: ost5 has failed over 1 times, and counting...
      Lustre: lustre-OST0006: Will be in recovery for at least 1:00, or until 3 clients reconnect
      LustreError: 360:0:(ldlm_lib.c:1730:check_for_next_transno()) lustre-OST0006: waking for gap in transno, VBR is OFF (skip: 4295345303, ql: 2, comp: 1, conn: 3, next: 4295345312, last_committed: 4295345240)
      Lustre: lustre-OST0006: Recovery over after 0:22, of 3 clients 3 recovered and 0 were evicted.
      LustreError: 30883:0:(ldlm_lockd.c:1288:ldlm_handle_enqueue0()) ### delayed lvb init failed (rc -2) ns: filter-lustre-OST0006_UUID lock: ffff88006744d080/0x8abfa2edda4a4ccc lrc: 2/0,0 mode: --/PR res: [0xde56:0x0:0x0].0 rrc: 1 type: EXT [0->0] (req 0->0) flags: 0x40000000000000 nid: local remote: 0x7a6f60280c307e9a expref: -99 pid: 30883 timeout: 0 lvb_type: 0
      LustreError: 30888:0:(ofd_io.c:600:ofd_preprw_write()) lustre-OST0006: BRW to missing obj 0x0:56912
      LustreError: 30888:0:(ofd_io.c:600:ofd_preprw_write()) Skipped 1 previous similar message
      Lustre: DEBUG MARKER: /usr/sbin/lctl mark Duration:               86400
      

      Attachments

        Issue Links

          Activity

            [LU-5785] recovery-mds-scale test_failover_ost: test_failover_ost returned 1
            yujian Jian Yu added a comment -

            On Client 3 (onyx-45vm6):

                tar: etc/sysconfig/network/ifroute-lo: Cannot write: No such file or directory
                tar: etc/sysconfig/network/routes: Cannot write: No such file or directory
                tar: etc/sysconfig/network/if-down.d/ndp-proxy: Cannot stat: No such file or directory
                tar: Exiting with failure status due to previous errors
            

            It looks like this is a duplicate of LU-4621.

            Hi Hongchao,
            Could you please confirm this?

            yujian Jian Yu added a comment - On Client 3 (onyx-45vm6): tar: etc/sysconfig/network/ifroute-lo: Cannot write: No such file or directory tar: etc/sysconfig/network/routes: Cannot write: No such file or directory tar: etc/sysconfig/network/if-down.d/ndp-proxy: Cannot stat: No such file or directory tar: Exiting with failure status due to previous errors It looks like this is a duplicate of LU-4621 . Hi Hongchao, Could you please confirm this?

            The console logs reports:

            23:57:55:Lustre: lustre-OST0006: deleting orphan objects from 0x0:56878 to 0x0:56897
            00:19:16:Lustre: lustre-OST0006: deleting orphan objects from 0x0:57116 to 0x0:57142
            

            so it doesn't look like orphan cleanup caused the object to be removed. Maybe there is something in the debug logs?

            I wonder if it makes sense to dump the debug log to shared storage just before the reboot to try and capture more useful information for debugging?

            It looks like this test is run several times a day, but almost all of them are failing for one reason or another. It makes sense to investigate and fix some of these failures.

            adilger Andreas Dilger added a comment - The console logs reports: 23:57:55:Lustre: lustre-OST0006: deleting orphan objects from 0x0:56878 to 0x0:56897 00:19:16:Lustre: lustre-OST0006: deleting orphan objects from 0x0:57116 to 0x0:57142 so it doesn't look like orphan cleanup caused the object to be removed. Maybe there is something in the debug logs? I wonder if it makes sense to dump the debug log to shared storage just before the reboot to try and capture more useful information for debugging? It looks like this test is run several times a day, but almost all of them are failing for one reason or another. It makes sense to investigate and fix some of these failures.

            Mike,
            Could you please have a look at this one and comment?
            Thank you!

            jlevi Jodi Levi (Inactive) added a comment - Mike, Could you please have a look at this one and comment? Thank you!

            People

              tappro Mikhail Pershin
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: