Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4470

replay-dual test_21b: FAIL: lustre-MDT0000: ffff88007f581100 CANNOT BE SET READONLY: rc = -95

Details

    • 3
    • 12241

    Description

      replay-dual test 21b failed as follows:

      Started clients client-9vm1: 
      CMD: client-9vm1 mount | grep /mnt/lustre' '
      client-9vm3@tcp:/lustre on /mnt/lustre type lustre (rw,user_xattr,acl,flock)
      CMD: client-9vm1 PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/mpi/gcc/openmpi/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/bin:/bin:/sbin:/usr/sbin::/sbin:/bin:/usr/sbin: NAME=autotest_config sh rpc.sh set_default_debug \"-1\" \" 0xffb7e3ff\" 32 
       replay-dual test_21b: @@@@@@ FAIL: The test cannot check whether COS works or not: all renames are replied w/o COS 
      

      Maloo report: https://maloo.whamcloud.com/test_sets/e3e6ca96-7989-11e3-a27b-52540035b04c

      Attachments

        Issue Links

          Activity

            [LU-4470] replay-dual test_21b: FAIL: lustre-MDT0000: ffff88007f581100 CANNOT BE SET READONLY: rc = -95

            Saurabh, please file a new bug and link it to this bug. It may be related, or it may just have the same symptoms, but better not to re-open this old bug which was marked fixed for 2.6.0, as it makes tracking the fixes much more difficult.

            adilger Andreas Dilger added a comment - Saurabh, please file a new bug and link it to this bug. It may be related, or it may just have the same symptoms, but better not to re-open this old bug which was marked fixed for 2.6.0, as it makes tracking the fixes much more difficult.

            Encountered same issue for tag 2.7.66 for FULL- EL7.1 Server/EL6.7 Client , master , build# 3314
            https://testing.hpdd.intel.com/test_sets/84997ede-ca91-11e5-9609-5254006e85c2

            Started lustre-MDT0000
            stat: cannot read file system information for `/mnt/lustre': Input/output error
             replay-dual test_0a: @@@@@@ FAIL: test_0a failed with 1 
            
            standan Saurabh Tandan (Inactive) added a comment - Encountered same issue for tag 2.7.66 for FULL- EL7.1 Server/EL6.7 Client , master , build# 3314 https://testing.hpdd.intel.com/test_sets/84997ede-ca91-11e5-9609-5254006e85c2 Started lustre-MDT0000 stat: cannot read file system information for `/mnt/lustre': Input/output error replay-dual test_0a: @@@@@@ FAIL: test_0a failed with 1
            yujian Jian Yu added a comment -

            Patch for Lustre b2_5 branch: http://review.whamcloud.com/9288

            yujian Jian Yu added a comment - Patch for Lustre b2_5 branch: http://review.whamcloud.com/9288
            pjones Peter Jones added a comment -

            Landed for 2.6

            pjones Peter Jones added a comment - Landed for 2.6

            build hasn't yet completed, but has already proceeded far enough that I can see

            checking if Linux was built with symbol dev_set_rdonly exported... yes
            

            in the build logs for sp2 and sp3.

            bogl Bob Glossman (Inactive) added a comment - build hasn't yet completed, but has already proceeded far enough that I can see checking if Linux was built with symbol dev_set_rdonly exported... yes in the build logs for sp2 and sp3.

            I think the build problem may come down to the way the autoconf function LB_CHECK_SYMBOL_EXPORT is defined in config/lustre-build-linux.m4

            This checks in both the kernel source & symbol file for the relevant symbol. However it looks for the symbol file in $LINUX. Pretty sure it should be looking in $LINUX_OBJ. This is a don't care in centos builds where $LINUX and LINUX_OBJ are really the same. It is also a don't care in my local builds on SLES where they are also the same. On typical SLES builds they are different.

            I suspect a simple change from $LINUX to $LINUX_OBJ in that one autoconf function may fix things. Will create and push a patch to do that. If it passes builds in all the distros we use it may be all that is needed.

            bogl Bob Glossman (Inactive) added a comment - I think the build problem may come down to the way the autoconf function LB_CHECK_SYMBOL_EXPORT is defined in config/lustre-build-linux.m4 This checks in both the kernel source & symbol file for the relevant symbol. However it looks for the symbol file in $LINUX. Pretty sure it should be looking in $LINUX_OBJ. This is a don't care in centos builds where $LINUX and LINUX_OBJ are really the same. It is also a don't care in my local builds on SLES where they are also the same. On typical SLES builds they are different. I suspect a simple change from $LINUX to $LINUX_OBJ in that one autoconf function may fix things. Will create and push a patch to do that. If it passes builds in all the distros we use it may be all that is needed.
            bogl Bob Glossman (Inactive) added a comment - - edited

            I very strongly suspect that this problem comes down to a build flaw in SLES build. Here's my reasoning:

            failed tests all show errors like the following in the test log:

            CMD: client-9vm3 /usr/sbin/lctl --device lustre-MDT0000 readonly
            client-9vm3: error: readonly: Operation not supported
            

            turning devices readonly is a critical part of the barriers used in test. with that feature not working it's not a surprise that many tests fail.

            Looking in mds dmesg log for an explanation errors like the following can be seen:

            32702.560584] LustreError: 12374:0:(osd_handler.c:1267:osd_ro()) lustre-MDT0000: ffff88007f7b2100 CANNOT BE SET READONLY: rc = -95
            

            Investigating the code in osd_ro() where the error comes from it can be seen that the error "CANNOT BE SET READONLY" can happen only if lustre is configured with #undef HAVE_DEV_SET_RDONLY. It's not clear why lustre is getting config'ed that way. autoconf should be setting #define HAVE_DEV_SET_RDONLY based on the fact the dev_set_rdonly() does really exist in the SLES kernels and autoconf should be figuring that out. When I do local builds of lustre on SLES, it does in fact configure correctly and put "#define HAVE_DEV_SET_RDONLY 1" correctly into lustre's config.h

            I speculate that there is some bad interaction between lbuild and lustre autoconf on SLES only.

            checking back in any recent jenkins build logs, the following can be seen:

            checking if Linux was built with symbol dev_set_rdonly exported... no
            configure: WARNING: kernel missing dev_set_rdonly patch for testing
            

            This confirms that lustre on SLES is indeed config'ing without HAVE_DEV_SET_RDONLY in jenkins builds.

            I suspect that the whole set of recent SLES only test bugs reported have this same underlying cause.

            bogl Bob Glossman (Inactive) added a comment - - edited I very strongly suspect that this problem comes down to a build flaw in SLES build. Here's my reasoning: failed tests all show errors like the following in the test log: CMD: client-9vm3 /usr/sbin/lctl --device lustre-MDT0000 readonly client-9vm3: error: readonly: Operation not supported turning devices readonly is a critical part of the barriers used in test. with that feature not working it's not a surprise that many tests fail. Looking in mds dmesg log for an explanation errors like the following can be seen: 32702.560584] LustreError: 12374:0:(osd_handler.c:1267:osd_ro()) lustre-MDT0000: ffff88007f7b2100 CANNOT BE SET READONLY: rc = -95 Investigating the code in osd_ro() where the error comes from it can be seen that the error "CANNOT BE SET READONLY" can happen only if lustre is configured with #undef HAVE_DEV_SET_RDONLY. It's not clear why lustre is getting config'ed that way. autoconf should be setting #define HAVE_DEV_SET_RDONLY based on the fact the dev_set_rdonly() does really exist in the SLES kernels and autoconf should be figuring that out. When I do local builds of lustre on SLES, it does in fact configure correctly and put "#define HAVE_DEV_SET_RDONLY 1" correctly into lustre's config.h I speculate that there is some bad interaction between lbuild and lustre autoconf on SLES only. checking back in any recent jenkins build logs, the following can be seen: checking if Linux was built with symbol dev_set_rdonly exported... no configure: WARNING: kernel missing dev_set_rdonly patch for testing This confirms that lustre on SLES is indeed config'ing without HAVE_DEV_SET_RDONLY in jenkins builds. I suspect that the whole set of recent SLES only test bugs reported have this same underlying cause.

            So far, this problem is only being seen on the most recent SLES11 test runs.

            adilger Andreas Dilger added a comment - So far, this problem is only being seen on the most recent SLES11 test runs.

            People

              bogl Bob Glossman (Inactive)
              yujian Jian Yu
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: