[LU-9260] posix failure: access.43 Unresolved Created: 27/Mar/17  Updated: 18/May/20  Resolved: 18/May/20

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.0, Lustre 2.11.0, Lustre 2.12.0, Lustre 2.10.4, Lustre 2.13.0, Lustre 2.10.6, Lustre 2.12.1, Lustre 2.12.4
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Sarah Liu Assignee: Sarah Liu
Resolution: Won't Fix Votes: 0
Labels: None
Environment:

client and server: EL7


Issue Links:
Duplicate
is duplicated by LU-9261 posix failure: chmod.18 Unresolved Resolved
is duplicated by LU-9262 posix failure: creat.28 Unresolved Resolved
is duplicated by LU-9264 posix failure: link.23 Unresolved Resolved
is duplicated by LU-9732 Posix test fail Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Maloo link: https://testing.hpdd.intel.com/test_sets/461c1d7e-12fc-11e7-b742-5254006e85c2

test log

FAILURE SUMMARY:

POSIX failures: 6

Test Name                   Baseline   Lustre Report
access.43                  Succeeded      Unresolved
chmod.18                   Succeeded      Unresolved
chown.18                   Succeeded      Unresolved
creat.28                   Succeeded      Unresolved
creat.30                   Succeeded      Unresolved
link.23                    Succeeded      Unresolved

FAILURE DESCRIPTIONS:

####################################################
Test Name: access.43 Unresolved

	Test Description:
If the implementation supports a read-only file system, EROFS in errno
and a return value of -1 on a call to access(path, amode) when write
access is requested for a file on a read-only file system.
Posix Ref: Component ACCESS Assertion 5.6.3.4-48(C)

	Test Information:
deletion reason: mnt_ro(/dev/loop0, access-d.43) failed


 Comments   
Comment by Andreas Dilger [ 27/Mar/17 ]

It isn't clear why the test script is trying to mount /dev/loop0 (which might be the MDT0000 device?) as read-only, instead of remounting the client filesystem read-only? Definitely the right test here would be remounting the client filesystem read-only.

It would be worthwhile to look at some older POSIX test logs to see why this is failing now, when (presumably) it didn't fail before.

Comment by Sarah Liu [ 12/Apr/17 ]

After investigation, it looks like we always use loop devices in posix read-only tests(against lustre) even in EL6, which is obviously wrong, instead it should provide MGS/MDS here.

in test access.43, it uses setuprofs to setup the read-only fs.

test43()
{
        char    *errptr;
        int     pathok = 0;

        /* write access on read only file system */

        DBUG_ENTER("test43");

        testfail = 0;

        globok = 0;
        if (setuprofs(t43_dir, t43_file, 'f', (mode_t) MODEANY) != 0)
        {
                DBUG_VOID_RETURN;
        }

setuprofs.c

if ((rofs = tet_getvar(VSX_ROFS)) == NULL || *rofs == '\0')
        {
                xx_rpt(DELETION);
                in_rpt("deletion reason: parameter %s is not set", VSX_ROFS);
                DBUG_PRINTF("return", "setuprofs = 1");
                DBUG_RETURN(1);
        }

in scripts/vsx-pcts/parameterisations.sh

echo "VSX_ROFS=\"$NOSPC_DEV\"" >> $1

in test_sets/SRC/vsxparams

NOSPC_DEV="/dev/loop0"

The reason why El6 passed before, but EL7 failed, seems because in EL7 loop device can not be cleanup, then remount the same device failed.

In order to re run the test suites at a later date run the
rerun_tests program in vsx0's home directory as the vsx0 user

/usr/src/posix/ext4
rm: cannot remove '/usr/src/posix/ext4/TESTROOT/tset/POSIX.os/ioprim/write/d.write/write-d.25': Device or resource busy
rm: cannot remove '/usr/src/posix/ext4/TESTROOT/tset/POSIX.os/ioprim/write/d.write/write-d.16': Device or resource busy
rm: cannot remove '/usr/src/posix/ext4/TESTROOT/tset/POSIX.os/files/mkdir/d.mkdir/mkdir-d.19': Device or resource busy
rm: cannot remove '/usr/src/posix/ext4/TESTROOT/tset/POSIX.os/files/mkfifo/d.mkfifo/mkfifo-d.17': Device or resource busy
rm: cannot remove '/usr/src/posix/ext4/TESTROOT/tset/POSIX.os/files/link/d.link/link-d.25': Device or resource busy
rm: cannot remove '/usr/src/posix/ext4/TESTROOT/tset/POSIX.os/files/rmdir/d.rmdir/rmdir-d.9': Device or resource busy
rm: cannot remove '/usr/src/posix/ext4/TESTROOT/tset/POSIX.os/files/open/d.open/open-d.46': Device or resource busy
rm: cannot remove '/usr/src/posix/ext4/TESTROOT/tset/POSIX.os/files/rename/d.rename/rename-d.17': Device or resource busy
rm: cannot remove '/usr/src/posix/ext4/TESTROOT/tset/POSIX.os/files/unlink/d.unlink/unlink-d.9': Device or resource busy
Install and build POSIX test suite successfully!
Run POSIX test against lustre filesystem
Comment by Andreas Dilger [ 13/Apr/17 ]

I don't think it is the MDS that should be mounted read-only, but rather the client. The test is running on the client. Also, I don't think that he MDS can be mounted read-only and still work.

Comment by Sarah Liu [ 13/Apr/17 ]

Yes, I meant to mount lustre client as readonly, not MDS. But for mounting client, it needs to provide MGS info. Before posix read-only tests, it needs to remount FS as read-only, but it always uses /dev/loop0, see above rofs is always /dev/loop0, so it is not testing lustre at all. I think we need replace this /dev/loop0 as mgs:/fsname.

Comment by James Nunez (Inactive) [ 20/Apr/17 ]

I think I understand why we are using the loop device when mounting a readonly file system. In the configuration file TESTROOT/tetexec.cfg, we can set what file system to mount as read only

#	File system which can be mounted read only.
#	Can be the same as VSX_MOUNT_DEV and VSX_NOSPC_DEV.
#	Set to "unsup" if read only file systems are not supported.
VSX_ROFS=/dev/loop0

I need to look into how to change this configuration file in our POSIX configuration/setup.

Comment by Sarah Liu [ 26/May/17 ]

a quick update of the processing of the issue. I passed the MGSNID:/FSNAME into the suite to replace the "/dev/loop0" and got following error. The error changes and it seems it try to mount r/w but the test seems test mount read-only. Will add some debug info to those tests and see what happens there.

FAILURE SUMMARY:

POSIX failures: 5

Test Name                   Baseline   Lustre Report
chmod.18                   Succeeded      Unresolved
chown.18                   Succeeded      Unresolved
creat.28                   Succeeded      Unresolved
creat.30                   Succeeded      Unresolved
link.23                    Succeeded      Unresolved

FAILURE DESCRIPTIONS:

####################################################
Test Name: chmod.18 Unresolved

        Test Description:
If the implementation supports a read-only file system, EROFS in errno
and a return value of -1 on a call to chmod(path, mode) when the named
file resides on a read-only file system. No change to the file mode
shall occur.
Posix Ref: Component CHMOD Assertion 5.6.4.4-39(C)

        Test Information:


Test Agency: Unknown                                      System Tested: Unknown
Test Date:   May 19, 2017                                          Page 29 of 69

                           X/OPEN Verification Suite
Test-Set Summary                                                Test-Set Summary


deletion reason: mnt_rw(onyx-69@tcp:/lustre, chmod-d.18) failed

####################################################

Comment by Sarah Liu [ 04/Aug/17 ]

Did more investigation and here is what I found
1. the above changing of using MGS:/FSNAME instead of loop is needed
2. besides 1, it also needs to change FSTYPE in LSB.tools/userintf.c to lustre, then recompile the code again before running posix against lustre.

In that c file, it defines the fstype as ext2

/ modify the following for different mountable filesystem types /
#define __USERINTF_FSTYPE "ext2"

So the process would be
1. compile posix code and running with default file system
2. change the c file and recompile
3. run test with lustre

Comment by Gerrit Updater [ 23/Aug/17 ]

Wei Liu (wei3.liu@intel.com) uploaded a new patch: https://review.whamcloud.com/28661
Subject: LU-9260 test: Use the correct mount device when test against lustre
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: ee7d8850801800d398e67d0ddd1e041b0e87dc4d

Comment by Sarah Liu [ 23/Aug/17 ]

The above patch fixes the problem from lustre side, corresponding changes on Posix test suite are also needed which will be made to toolkit.

Comment by Gerrit Updater [ 23/Aug/17 ]

Wei Liu (wei3.liu@intel.com) uploaded a new patch: https://review.whamcloud.com/28669
Subject: LU-9260 test: Use correct parameters when test with lustre
Project: build/toolkit
Branch: master
Current Patch Set: 1
Commit: 8a0db0e2de2a1915b1bac6f3180e8bb9bf3c0c0d

Comment by Gerrit Updater [ 30/Aug/17 ]

Minh Diep (minh.diep@intel.com) merged in patch https://review.whamcloud.com/28669/
Subject: LU-9260 test: Use correct parameters when test with lustre
Project: build/toolkit
Branch: master
Current Patch Set:
Commit: 4ed7df5ea427b48f43d153c263bac5a41f57307c

Comment by Gerrit Updater [ 13/Sep/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/28661/
Subject: LU-9260 test: Use the correct mount device when test against lustre
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 7b59ed3ab3c9bfae95b9904982869d31a7e65770

Comment by Peter Jones [ 13/Sep/17 ]

Landed for 2.11

Comment by Gerrit Updater [ 16/Oct/17 ]

Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/29628
Subject: LU-9260 test: Use the correct mount device when test against lustre
Project: fs/lustre-release
Branch: b2_10
Current Patch Set: 1
Commit: 9e5e025b0bfc1c4280b2fcd4caabd7fc4db7ea67

Comment by James Casper [ 17/Oct/17 ]

Seen again with 2.10.54:

https://testing.hpdd.intel.com/test_sessions/77fc6fa0-eddf-4fdb-b9ad-8724d84acb75

Comment by Gerrit Updater [ 26/Oct/17 ]

John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/29628/
Subject: LU-9260 test: Use the correct mount device when test against lustre
Project: fs/lustre-release
Branch: b2_10
Current Patch Set:
Commit: 384cdeac7f40873220193b37fb083970b834fc03

Comment by Jian Yu [ 22/Jan/18 ]

The failure occurred at least 9 times in the last week.

Here is a failure instance on 2.10.57:
https://testing.hpdd.intel.com/test_sets/366580ca-fda5-11e7-a7cd-52540065bddc

Comment by Sarah Liu [ 24/Jan/18 ]

I searched Maloo for the last week and saw 4 failures in all master branch and interop between b2_10 and master. The failures seen in interop testing between master and b2_9 and/or prior should not count since the client doesn't have the fix.

Among the 4 failures, 3 of them were regular configs and 1 was interop config. All of those 3 regular configs failed on ONYX. I suspect this is an env related issue.

Comment by Sarah Liu [ 02/Feb/18 ]

Cannot reproduce the problem on onyx with physical nodes or vms, will try to provision the node with snapshot and see what I can find

Comment by Sarah Liu [ 05/Feb/18 ]

Cannot reproduce the problem on vm either, nodes were installed via snapshot

https://testing.hpdd.intel.com/test_sessions/c64e9304-0aa5-11e8-bd00-52540065bddc

Comment by Joseph Gmitter (Inactive) [ 27/Mar/18 ]

Should we resolve this issue as fixed in 2.11.0 if it is no longer seen nor reproducible?

Comment by James Nunez (Inactive) [ 28/Mar/18 ]

We are still seeing this issue or something very close to it. Here is an example https://testing.hpdd.intel.com/test_sets/5fbcc25c-2a4c-11e8-b6a0-52540065bddc. 

 

Comment by Jian Yu [ 14/Dec/18 ]

+1 on master branch:
https://testing.whamcloud.com/test_sets/8484a974-fdf0-11e8-b837-52540065bddc

Comment by James Nunez (Inactive) [ 18/May/20 ]

We will not fix this issue because we’ve replaced the POSIX test suite with pjdfstest.

Generated at Sat Feb 10 02:24:36 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.