Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7117

replay-single test_70d: timeout and mkdir/rmdir stopped

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.9.0
    • Lustre 2.8.0, Lustre 2.9.0
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for Andreas Dilger <andreas.dilger@intel.com>

      This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/56e1b56e-53ff-11e5-8f2c-5254006e85c2.

      The sub-test test_70d failed with the following error:

      error on LL_IOC_LMV_SETSTRIPE '/mnt/lustre/d70d.replay-single/test1' (3): stripe already set
      error: mkdir: create stripe dir '/mnt/lustre/d70d.replay-single/test1' failed
      mkdir fails
      /usr/lib64/lustre/tests/replay-single.sh: line 2236: kill: (25189) - No such process
      mkdir/rmdir 25189 stopped
      

      There are several test failures and timeouts for 70d since 2015-09-02 so I suspect a patch landed on that day or the previous day that introduced a regression.

      Info required for matching: replay-single 70d

      Attachments

        Issue Links

          Activity

            [LU-7117] replay-single test_70d: timeout and mkdir/rmdir stopped
            sbuisson Sebastien Buisson (Inactive) added a comment - A lot more occurrences recently, like this one on master: https://testing.hpdd.intel.com/test_sets/4d6f9aea-4a09-11e6-8968-5254006e85c2
            bfaccini Bruno Faccini (Inactive) added a comment - +1 on master at https://testing.hpdd.intel.com/test_sets/79710324-49a3-11e6-a80f-5254006e85c2
            sbuisson Sebastien Buisson (Inactive) added a comment - Another occurrence: https://testing.hpdd.intel.com/test_sets/823971fa-433e-11e6-acf3-5254006e85c2

            Lai Siyao (lai.siyao@intel.com) uploaded a new patch: http://review.whamcloud.com/21064
            Subject: LU-7117 mdt: mdt unlink should lock before lookup
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 0ae3d50ed2745366f677b79586f3fe72645330ee

            gerrit Gerrit Updater added a comment - Lai Siyao (lai.siyao@intel.com) uploaded a new patch: http://review.whamcloud.com/21064 Subject: LU-7117 mdt: mdt unlink should lock before lookup Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 0ae3d50ed2745366f677b79586f3fe72645330ee
            bogl Bob Glossman (Inactive) added a comment - another on master: https://testing.hpdd.intel.com/test_sets/6a54094a-3a9e-11e6-a0ce-5254006e85c2

            no new RPCS (except update log redo) should be sent until the recovery is completed?

            The MDT0 is in recovery, but MDT1 is normal, so the RPC from the client to MDT1 is not blocked.

            yong.fan nasf (Inactive) added a comment - no new RPCS (except update log redo) should be sent until the recovery is completed? The MDT0 is in recovery, but MDT1 is normal, so the RPC from the client to MDT1 is not blocked.

            Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/20940
            Subject: LU-7117 osp: control RPC to be sent when recovery
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 5fb68c49af3fa08189110a4e980ad792efd7128b

            gerrit Gerrit Updater added a comment - Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/20940 Subject: LU-7117 osp: control RPC to be sent when recovery Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 5fb68c49af3fa08189110a4e980ad792efd7128b

            no new RPCS (except update log redo) should be sent until the recovery is completed?

            bzzz Alex Zhuravlev added a comment - no new RPCS (except update log redo) should be sent until the recovery is completed?

            After the painful debugging from millions of lines logs, I found that the failure should related with the following scenario:
            1) The client sent unlink (PFID/test1) RPC to the MDT1;
            2) The MDT0 failover;
            3) The MDT1 is replaying with the MDT0;
            4) At that time, the step 1) trigger lookup(test1) RPC from the MDT1 to the MDT0 for the test1's FID.
            5) But because the MDT0 recovery is not completed yet, the test1's FID on the MDT0 is still the old one. That means the MDT0 reply the MDT1 with an invalid test1's FID.
            6) The MDT1 uses the MDT0 given test1's FID to locate the target child object to be deleted, found nothing. Then the unlink operation failed.

            yong.fan nasf (Inactive) added a comment - After the painful debugging from millions of lines logs, I found that the failure should related with the following scenario: 1) The client sent unlink (PFID/test1) RPC to the MDT1; 2) The MDT0 failover; 3) The MDT1 is replaying with the MDT0; 4) At that time, the step 1) trigger lookup(test1) RPC from the MDT1 to the MDT0 for the test1's FID. 5) But because the MDT0 recovery is not completed yet, the test1's FID on the MDT0 is still the old one. That means the MDT0 reply the MDT1 with an invalid test1's FID. 6) The MDT1 uses the MDT0 given test1's FID to locate the target child object to be deleted, found nothing. Then the unlink operation failed.

            Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/20853
            Subject: LU-7117 tests: enable full debug for reply-single test_70d
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 61ec7fa12d28daffa43bc7a4f0e927d4e6550061

            gerrit Gerrit Updater added a comment - Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/20853 Subject: LU-7117 tests: enable full debug for reply-single test_70d Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 61ec7fa12d28daffa43bc7a4f0e927d4e6550061
            yong.fan nasf (Inactive) added a comment - More failure instances on master: https://testing.hpdd.intel.com/test_sets/478fbade-3180-11e6-bbf5-5254006e85c2 https://testing.hpdd.intel.com/test_sets/9d99755c-316f-11e6-a0ce-5254006e85c2

            People

              laisiyao Lai Siyao
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              18 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: