[LU-4470] replay-dual test_21b: FAIL: lustre-MDT0000: ffff88007f581100 CANNOT BE SET READONLY: rc = -95 Created: 10/Jan/14  Updated: 27/Jun/16  Resolved: 04/Feb/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.6.0, Lustre 2.5.1, Lustre 2.8.0
Fix Version/s: Lustre 2.6.0, Lustre 2.5.1

Type: Bug Priority: Minor
Reporter: Jian Yu Assignee: Bob Glossman (Inactive)
Resolution: Fixed Votes: 0
Labels: ssf
Environment:

Lustre build: http://build.whamcloud.com/job/lustre-reviews/20841/
Distro/Arch: SLES11SP3/x86_64 (both server and client, kernel version: 3.0.101-0.8)


Issue Links:
Duplicate
is duplicated by LU-3742 Failure on test suite replay-single t... Resolved
is duplicated by LU-4467 replay-single test_44c: FAIL: unliked... Resolved
is duplicated by LU-4468 replay-dual test_0a: stat: cannot rea... Resolved
is duplicated by LU-5611 WARNING: kernel missing dev_set_rdonl... Resolved
Related
is related to LU-8333 replay-dual test_21b: can't check if ... Reopened
is related to LU-7740 replay-dual test_0a: test_0a failed w... Resolved
Severity: 3
Rank (Obsolete): 12241

 Description   

replay-dual test 21b failed as follows:

Started clients client-9vm1: 
CMD: client-9vm1 mount | grep /mnt/lustre' '
client-9vm3@tcp:/lustre on /mnt/lustre type lustre (rw,user_xattr,acl,flock)
CMD: client-9vm1 PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/mpi/gcc/openmpi/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/bin:/bin:/sbin:/usr/sbin::/sbin:/bin:/usr/sbin: NAME=autotest_config sh rpc.sh set_default_debug \"-1\" \" 0xffb7e3ff\" 32 
 replay-dual test_21b: @@@@@@ FAIL: The test cannot check whether COS works or not: all renames are replied w/o COS 

Maloo report: https://maloo.whamcloud.com/test_sets/e3e6ca96-7989-11e3-a27b-52540035b04c



 Comments   
Comment by Andreas Dilger [ 15/Jan/14 ]

So far, this problem is only being seen on the most recent SLES11 test runs.

Comment by Bob Glossman (Inactive) [ 29/Jan/14 ]

I very strongly suspect that this problem comes down to a build flaw in SLES build. Here's my reasoning:

failed tests all show errors like the following in the test log:

CMD: client-9vm3 /usr/sbin/lctl --device lustre-MDT0000 readonly
client-9vm3: error: readonly: Operation not supported

turning devices readonly is a critical part of the barriers used in test. with that feature not working it's not a surprise that many tests fail.

Looking in mds dmesg log for an explanation errors like the following can be seen:

32702.560584] LustreError: 12374:0:(osd_handler.c:1267:osd_ro()) lustre-MDT0000: ffff88007f7b2100 CANNOT BE SET READONLY: rc = -95

Investigating the code in osd_ro() where the error comes from it can be seen that the error "CANNOT BE SET READONLY" can happen only if lustre is configured with #undef HAVE_DEV_SET_RDONLY. It's not clear why lustre is getting config'ed that way. autoconf should be setting #define HAVE_DEV_SET_RDONLY based on the fact the dev_set_rdonly() does really exist in the SLES kernels and autoconf should be figuring that out. When I do local builds of lustre on SLES, it does in fact configure correctly and put "#define HAVE_DEV_SET_RDONLY 1" correctly into lustre's config.h

I speculate that there is some bad interaction between lbuild and lustre autoconf on SLES only.

checking back in any recent jenkins build logs, the following can be seen:

checking if Linux was built with symbol dev_set_rdonly exported... no
configure: WARNING: kernel missing dev_set_rdonly patch for testing

This confirms that lustre on SLES is indeed config'ing without HAVE_DEV_SET_RDONLY in jenkins builds.

I suspect that the whole set of recent SLES only test bugs reported have this same underlying cause.

Comment by Bob Glossman (Inactive) [ 29/Jan/14 ]

I think the build problem may come down to the way the autoconf function LB_CHECK_SYMBOL_EXPORT is defined in config/lustre-build-linux.m4

This checks in both the kernel source & symbol file for the relevant symbol. However it looks for the symbol file in $LINUX. Pretty sure it should be looking in $LINUX_OBJ. This is a don't care in centos builds where $LINUX and LINUX_OBJ are really the same. It is also a don't care in my local builds on SLES where they are also the same. On typical SLES builds they are different.

I suspect a simple change from $LINUX to $LINUX_OBJ in that one autoconf function may fix things. Will create and push a patch to do that. If it passes builds in all the distros we use it may be all that is needed.

Comment by Bob Glossman (Inactive) [ 29/Jan/14 ]

http://review.whamcloud.com/9056

Comment by Bob Glossman (Inactive) [ 29/Jan/14 ]

build hasn't yet completed, but has already proceeded far enough that I can see

checking if Linux was built with symbol dev_set_rdonly exported... yes

in the build logs for sp2 and sp3.

Comment by Peter Jones [ 03/Feb/14 ]

Landed for 2.6

Comment by Jian Yu [ 17/Feb/14 ]

Patch for Lustre b2_5 branch: http://review.whamcloud.com/9288

Comment by Saurabh Tandan (Inactive) [ 03/Feb/16 ]

Encountered same issue for tag 2.7.66 for FULL- EL7.1 Server/EL6.7 Client , master , build# 3314
https://testing.hpdd.intel.com/test_sets/84997ede-ca91-11e5-9609-5254006e85c2

Started lustre-MDT0000
stat: cannot read file system information for `/mnt/lustre': Input/output error
 replay-dual test_0a: @@@@@@ FAIL: test_0a failed with 1 
Comment by Andreas Dilger [ 04/Feb/16 ]

Saurabh, please file a new bug and link it to this bug. It may be related, or it may just have the same symptoms, but better not to re-open this old bug which was marked fixed for 2.6.0, as it makes tracking the fixes much more difficult.

Generated at Sat Feb 10 01:43:01 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.