Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8729

conf-sanity test_84: FAIL: /dev/mapper/mds1_flakey failed to initialize!

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Critical
    • Lustre 2.10.1, Lustre 2.11.0
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      With patch http://review.whamcloud.com/7200 on master branch, conf-sanity test 84 failed as follows:

      CMD: onyx-31vm7 e2label /dev/mapper/mds1_flakey 				2>/dev/null | grep -E ':[a-zA-Z]{3}[0-9]{4}'
      Update not seen after 90s: wanted '' got 'lustre:MDT0000'
       conf-sanity test_84: @@@@@@ FAIL: /dev/mapper/mds1_flakey failed to initialize! 
      

      https://testing.hpdd.intel.com/test_sets/e88a61c2-89bf-11e6-a8b7-5254006e85c2

      Attachments

        Issue Links

          Activity

            [LU-8729] conf-sanity test_84: FAIL: /dev/mapper/mds1_flakey failed to initialize!
            pjones Peter Jones added a comment -

            No - let's just focus on RHEL 7.4

            pjones Peter Jones added a comment - No - let's just focus on RHEL 7.4

            So the good news is this is fixed in RHEL7.4. If some one really wants to work on RHEL7.3 can compare the sources in RHEL7.4 to RHLE7.3 to see which change fixed this problem so we can include the patch fpr RHEL7.3 if so desired. I updated the patch for LU-684 in the mean time to move this work forward.

            simmonsja James A Simmons added a comment - So the good news is this is fixed in RHEL7.4. If some one really wants to work on RHEL7.3 can compare the sources in RHEL7.4 to RHLE7.3 to see which change fixed this problem so we can include the patch fpr RHEL7.3 if so desired. I updated the patch for LU-684 in the mean time to move this work forward.
            yujian Jian Yu added a comment -

            The autotest system uses the build with the following kernel version supported by the latest master branch:

            $ head -n2 lustre/kernel_patches/targets/3.10-rhel7.target.in 
            lnxmaj="3.10.0"
            lnxrel="514.21.1.el7"
            

            So, the commit doesn't work and the failure still exists.

            yujian Jian Yu added a comment - The autotest system uses the build with the following kernel version supported by the latest master branch: $ head -n2 lustre/kernel_patches/targets/3.10-rhel7.target.in lnxmaj="3.10.0" lnxrel="514.21.1.el7" So, the commit doesn't work and the failure still exists.

            I have tested the commit locally, the commit can fix the problem on 3.10.0-514.2.2.el7.x86_64,
            but can't on 3.10.0-514.21.1.el7.x86_64, which is being used in our Auotest system.

            hongchao.zhang Hongchao Zhang added a comment - I have tested the commit locally, the commit can fix the problem on 3.10.0-514.2.2.el7.x86_64, but can't on 3.10.0-514.21.1.el7.x86_64, which is being used in our Auotest system.
            yujian Jian Yu added a comment -

            Hi James,

            Can you try linux-commit: 299f6230bc6d0ccd5f95bb0fb865d80a9c7d5ccc

            I tried this in https://review.whamcloud.com/26788 (patch set 25). The same error still occurred:

            Buffer I/O error on dev dm-6, logical block 524272, async page read
            
            yujian Jian Yu added a comment - Hi James, Can you try linux-commit: 299f6230bc6d0ccd5f95bb0fb865d80a9c7d5ccc I tried this in https://review.whamcloud.com/26788 (patch set 25). The same error still occurred: Buffer I/O error on dev dm-6, logical block 524272, async page read

            Can you try linux-commit: 299f6230bc6d0ccd5f95bb0fb865d80a9c7d5ccc

            dm flakey: fix reads to be issued if drop_writes configured

            v4.8-rc3 commit 99f3c90d0d ("dm flakey: error READ bios during the
            down_interval") overlooked the 'drop_writes' feature, which is meant to
            allow reads to be issued rather than errored, during the down_interval.

            Fixes: 99f3c90d0d ("dm flakey: error READ bios during the down_interval")
            Reported-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
            Signed-off-by: Mike Snitzer <snitzer@redhat.com>
            Cc: stable@vger.kernel.org

            simmonsja James A Simmons added a comment - Can you try linux-commit: 299f6230bc6d0ccd5f95bb0fb865d80a9c7d5ccc dm flakey: fix reads to be issued if drop_writes configured v4.8-rc3 commit 99f3c90d0d ("dm flakey: error READ bios during the down_interval") overlooked the 'drop_writes' feature, which is meant to allow reads to be issued rather than errored, during the down_interval. Fixes: 99f3c90d0d ("dm flakey: error READ bios during the down_interval") Reported-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org

            I have setup the RHEL7.3 (3.10.0-514.2.2.el7.x86_64) and reproduce the problem by using ext4 and dm-flakey module
            the reproducer is attached.

            hongchao.zhang Hongchao Zhang added a comment - I have setup the RHEL7.3 (3.10.0-514.2.2.el7.x86_64) and reproduce the problem by using ext4 and dm-flakey module the reproducer is attached.

            Status Update:

            I have tested it in Maloo by disabling different patches contained in http://review.whamcloud.com/22113 and
            http://review.whamcloud.com/23560 (the debug patch is tracked at https://review.whamcloud.com/#/c/26788/, the patch set is 15 ~ 21).
            the error "Buffer I/O error" still occur.

            I looked at the test result in the original patch for LU-684 (https://review.whamcloud.com/#/c/7200/), there is no such problem prior to
            patch set 20 and start to occur from patch set 21, the difference of Lustre itself between the two patch set is the two patches
            http://review.whamcloud.com/22113 and http://review.whamcloud.com/23560, but the more big different is the change of kernel,
            patch set 20 still used RHEL7.2 to test, but patch set 21 began to use RHEL7.3.

            Patch Set 20: https://testing.hpdd.intel.com/test_sessions/c01176f8-a5ef-11e6-b605-5254006e85c2
 (3.10.0-327.36.3.el7.x86_64)
            Patch Set 21: https://testing.hpdd.intel.com/test_sessions/41fe0cb2-beeb-11e6-92c6-5254006e85c2 (3.10.0-514.el7.x86_64)

            Then the issue could be caused by the kernel RHEL7.3 (3.10.0-514.el7.x86_64)

            hongchao.zhang Hongchao Zhang added a comment - Status Update: I have tested it in Maloo by disabling different patches contained in http://review.whamcloud.com/22113 and http://review.whamcloud.com/23560 (the debug patch is tracked at https://review.whamcloud.com/#/c/26788/ , the patch set is 15 ~ 21). the error "Buffer I/O error" still occur. I looked at the test result in the original patch for LU-684 ( https://review.whamcloud.com/#/c/7200/ ), there is no such problem prior to patch set 20 and start to occur from patch set 21, the difference of Lustre itself between the two patch set is the two patches http://review.whamcloud.com/22113 and http://review.whamcloud.com/23560 , but the more big different is the change of kernel, patch set 20 still used RHEL7.2 to test, but patch set 21 began to use RHEL7.3. Patch Set 20: https://testing.hpdd.intel.com/test_sessions/c01176f8-a5ef-11e6-b605-5254006e85c2 
 (3.10.0-327.36.3.el7.x86_64) Patch Set 21: https://testing.hpdd.intel.com/test_sessions/41fe0cb2-beeb-11e6-92c6-5254006e85c2 (3.10.0-514.el7.x86_64) Then the issue could be caused by the kernel RHEL7.3 (3.10.0-514.el7.x86_64)
            yujian Jian Yu added a comment -

            Hi Hongchao,

            Could you please proceed with this blocker? Thank you.

            yujian Jian Yu added a comment - Hi Hongchao, Could you please proceed with this blocker? Thank you.
            yujian Jian Yu added a comment - - edited

            Hi Andreas,

            I also removed "--nolockfs" and got the same failure and full stack. So, the failure is not related to "--nolockfs" and/or "--noflush" options.

            yujian Jian Yu added a comment - - edited Hi Andreas, I also removed "--nolockfs" and got the same failure and full stack. So, the failure is not related to "--nolockfs" and/or "--noflush" options.

            People

              hongchao.zhang Hongchao Zhang
              yujian Jian Yu
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: