[LU-12231] parallel-scale-nfsv4 test racer_on_nfs fails with 'test_racer_on_nfs failed with 1' Created: 26/Apr/19  Updated: 19/May/23  Resolved: 19/May/23

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.13.0, Lustre 2.12.1, Lustre 2.12.3, Lustre 2.12.4, Lustre 2.12.5, Lustre 2.12.6
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: WC Triage
Resolution: Duplicate Votes: 0
Labels: ubuntu, ubuntu18, ubuntu20
Environment:

Ubuntu 18.04 clients


Issue Links:
Duplicate
duplicates LU-14294 parallel-scale-nfsv4 fails to start w... Resolved
Related
is related to LU-10566 parallel-scale-nfsv4 test_metabench: ... Reopened
is related to LU-15781 Ubuntu 22.04 LTS release support Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

parallel-scale-nfsv4 test_ racer_on_nfs fails for Ubuntu 18.04 clients only with the following errors and hangs 100% of the time for Ubuntu 18.04 clients.

Looking at the logs for a recent failure, with logs at https://testing.whamcloud.com/test_sets/4ca25e50-6661-11e9-aeec-52540065bddc, there are several messages that may be related to the failure. In the suite_log, we see that the main issue is that the file system is read-only

== parallel-scale-nfsv4 test racer_on_nfs: racer on NFS client ======================================= 07:04:44 (1556089484)
CMD: trevis-43vm10,trevis-43vm9.trevis.whamcloud.com MDSCOUNT=1 OSTCOUNT=7 LFS=/usr/bin/lfs /usr/lib64/lustre/tests/racer/racer.sh /mnt/lustre/d0.parallel-scale-nfs
trevis-43vm9: mkdir: cannot create directory '/mnt/lustre/d0.parallel-scale-nfs': Read-only file system
trevis-43vm10: mkdir: cannot create directory '/mnt/lustre/d0.parallel-scale-nfs': Read-only file system

On the console log for the MDS (vm12), we see

[  238.469591] Lustre: DEBUG MARKER: == parallel-scale-nfsv4 test racer_on_nfs: racer on NFS client ======================================= 07:04:44 (1556089484)
[  539.092142] RPC request reserved 84 but used 100
[  545.688053] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  parallel-scale-nfsv4 test_racer_on_nfs: @@@@@@ FAIL: test_racer_on_nfs failed with 1 

For a few of the failues, we don’t see this RPC message.

In the console log for both clients we see

[  228.129232] Lustre: DEBUG MARKER: == parallel-scale-nfsv4 test racer_on_nfs: racer on NFS client ======================================= 07:04:44 (1556089484)
[  228.311480] Lustre: DEBUG MARKER: MDSCOUNT=1 OSTCOUNT=7 LFS=/usr/bin/lfs /usr/lib64/lustre/tests/racer/racer.sh /mnt/lustre/d0.parallel-scale-nfs
[  265.262396] random: crng init done
[  265.263333] random: 4 urandom warning(s) missed due to ratelimiting

There are several examples of this failure, but here are just a couple of additional links to logs
https://testing.whamcloud.com/test_sets/32607d94-6358-11e9-8bb1-52540065bddc
https://testing.whamcloud.com/test_sets/4b6d37a0-6614-11e9-aeec-52540065bddc



 Comments   
Comment by James Nunez (Inactive) [ 16/Oct/19 ]

We're still seeing racer_on_nfs fail with the 'read-only' file system issue described here, but we are also seeing test metabench fail with issues related to a 'read-only' file system. See https://testing.whamcloud.com/test_sets/ead02450-eb26-11e9-b62b-52540065bddc for logs

== parallel-scale-nfsv4 test metabench: metabench ==================================================== 06:11:52 (1570687912)
OPTIONS:
METABENCH=/usr/bin/metabench
clients=trevis-63vm10,trevis-63vm9.trevis.whamcloud.com
mbench_NFILES=10000
mbench_THREADS=4
trevis-63vm10
trevis-63vm9.trevis.whamcloud.com
mkdir: cannot create directory '/mnt/lustre/d0.parallel-scale-nfs': Read-only file system
chmod: cannot access '/mnt/lustre/d0.parallel-scale-nfs/d0.metabench': No such file or directory
+ /usr/bin/metabench -w /mnt/lustre/d0.parallel-scale-nfs/d0.metabench -c 10000 -C -S 
+ chmod 0777 /mnt/lustre
chmod: changing permissions of '/mnt/lustre': Read-only file system
dr-xr-xr-x 24 root root 4096 Aug 29 20:41 /mnt/lustre
+ su mpiuser sh -c "/usr/bin/mpirun --mca btl tcp,self --mca btl_tcp_if_include eth0 -mca boot ssh --oversubscribe -machinefile /tmp/parallel-scale-nfs.machines -np 8 /usr/bin/metabench -w /mnt/lustre/d0.parallel-scale-nfs/d0.metabench -c 10000 -C -S "
Metadata Test <no-name> on 10/10/2019 at 06:11:55

Rank   0 process on node trevis-63vm10.trevis.whamcloud.com
Rank   1 process on node trevis-63vm10.trevis.whamcloud.com
Rank   2 process on node trevis-63vm10.trevis.whamcloud.com
Rank   3 process on node trevis-63vm10.trevis.whamcloud.com
Rank   4 process on node trevis-63vm9.trevis.whamcloud.com
Rank   5 process on node trevis-63vm9.trevis.whamcloud.com
Rank   6 process on node trevis-63vm9.trevis.whamcloud.com
Rank   7 process on node trevis-63vm9.trevis.whamcloud.com

[10/10/2019 06:11:56] FATAL error on process 0
Proc 0: cannot create component d0.parallel-scale-nfs in /mnt/lustre/d0.parallel-scale-nfs/d0.metabench: Read-only file system
Comment by James Nunez (Inactive) [ 11/Nov/19 ]

We are seeing this issue for non-Ubuntu clients. Here is an example of RHEL ZFS test session that fails with these errors https://testing.whamcloud.com/test_sets/0d54458c-035b-11ea-b934-52540065bddc .

Comment by James A Simmons [ 19/May/23 ]

patch https://review.whamcloud.com/c/fs/lustre-release/+/49062 resloved this bug

Generated at Sat Feb 10 02:50:46 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.