Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1362

replay-dual test_16 fails to remount mdt

    XMLWordPrintable

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Minor
    • None
    • Lustre 2.1.1
    • None
    • 3
    • 6409

    Description

      Reply-dual test 16 failed:

      == replay-dual test 16: fail MDS during recovery (3571) == 17:38:52 (1335573532)
      Filesystem 1K-blocks Used Available Use% Mounted on
      service360@o2ib:/lustre
      3937056 205112 3531816 6% /mnt/nbp0-1
      total: 25 creates in 0.04 seconds: 678.21 creates/second
      total: 1 creates in 0.00 seconds: 389.26 creates/second
      Failing mds1 on node service360
      Stopping /mnt/mds1 (opts
      affected facets: mds1
      Failover mds1 to service360
      17:39:07 (1335573547) waiting for service360 network 900 secs ...
      17:39:07 (1335573547) network interface is UP
      Starting mds1: -o errors=panic,acl /dev/sdb1 /mnt/mds1
      service360: mount.lustre: mount /dev/sdb1 at /mnt/mds1 failed: Invalid argument
      service360: This may have multiple causes.
      service360: Are the mount options correct?
      service360: Check the syslog for more info.
      mount -t lustre /dev/sdb1 /mnt/mds1
      Start of /dev/sdb1 on mds1 failed 22
      replay-dual test_16: @@@@@@ FAIL: Restart of mds1 failed!
      Dumping lctl log to /var/acc-sm/test_logs//1335573120/replay-dual.test_16.*.1335573548.log
      tar: Removing leading `/' from member names
      /var/acc-sm/test_logs//1335573120/replay-dual-1335573548.tar.bz2
      FAIL 16 (45s)

      The "Invalid argument" was about extents, which we do not turn on on MDS.

      The replay-dual.test_16.dmesg.service360.1335573548.log seemed to suggest the problem
      was a corrupted file:
      LDISKFS-fs (sdb1): mounted filesystem with ordered data mode. Opts:
      LDISKFS-fs warning (device sdb1): ldiskfs_fill_super: extents feature not enabled on this filesystem, use tune2fs.
      LDISKFS-fs (sdb1): ldiskfs_check_descriptors: Checksum for group 0 failed (27004!=29265)
      LDISKFS-fs (sdb1): group descriptors corrupted!

      LU-699 seemed to have encountered a data corruption problem in reply-dual test_1. I applied the patch and rebuilt a lustre server package, but the test still failed.

      REPLAY_DUAL-16.tgz is attached.

      The failure is 100% reproducible.

      Could the data corruption problem caused by trying to fail-over mds to the same node?
      In other words, is it a test-case problem or a real problem?

      Attachments

        1. REPLAY_DUAL-16.tgz
          6.43 MB
        2. replay-dual-14b.tar.bz2
          4.56 MB
        3. replay-dual-14b.tar.bz2
          4.56 MB
        4. replay-dual-14b.tar.bz2
          4.56 MB
        5. replay-dual-14b.tar.bz2
          4.56 MB
        6. replay-dual-16.tar.bz2
          1.66 MB
        7. replay-dual-16.tar.bz2
          9.87 MB
        8. replay-dual-16.tar.bz2
          9.87 MB

        Activity

          People

            laisiyao Lai Siyao
            jaylan Jay Lan (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: