[LU-10566] parallel-scale-nfsv4 test_metabench: mkdir: cannot create directory on Read-only file system Created: 25/Jan/18  Updated: 14/Apr/21

Status: Reopened
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.11.0, Lustre 2.10.4, Lustre 2.12.5, Lustre 2.12.6
Fix Version/s: Lustre 2.12.0, Lustre 2.10.4

Type: Bug Priority: Minor
Reporter: Sarah Liu Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Related
is related to LU-10663 obdfilter-survey Resolved
is related to LU-10851 parallel-scale-nfsv4 hangs on unmount Open
is related to LU-9892 parallel-scale-nfsv3 no sub tests fai... Reopened
is related to LU-12231 parallel-scale-nfsv4 test racer_on_nf... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

parallel-scale-nfsv4 test_metabench - metabench failed! 1
^^^^^^^^^^^^^ DO NOT REMOVE LINE ABOVE ^^^^^^^^^^^^^

This issue was created by maloo for sarah_lw <wei3.liu@intel.com>

This issue relates to the following test suite run:
https://testing.hpdd.intel.com/test_sets/6e890fe6-fd53-11e7-a6ad-52540065bddc

test_metabench failed with the following error:

metabench failed! 1

server: 2.10.57 RHEL7 ldiskfs
client: SLES12SP3

test log

== parallel-scale-nfsv4 test metabench: metabench ==================================================== 11:23:59 (1516389839)
OPTIONS:
METABENCH=/usr/bin/metabench
clients=onyx-33vm1,onyx-33vm2
mbench_NFILES=10000
mbench_THREADS=4
onyx-33vm1
onyx-33vm2
mkdir: cannot create directory ‘/mnt/lustre/d0.parallel-scale-nfs’: Read-only file system
chmod: cannot access '/mnt/lustre/d0.parallel-scale-nfs/d0.metabench': No such file or directory
+ /usr/bin/metabench -w /mnt/lustre/d0.parallel-scale-nfs/d0.metabench -c 10000 -C -S 
+ chmod 0777 /mnt/lustre
chmod: changing permissions of '/mnt/lustre': Read-only file system
dr-xr-xr-x 23 root root 4096 Jan 19 00:29 /mnt/lustre
+ su mpiuser sh -c "/usr/lib64/mpi/gcc/openmpi/bin/mpirun --mca btl tcp,self --mca btl_tcp_if_include eth0 -mca boot ssh -machinefile /tmp/parallel-scale-nfs.machines -np 8 /usr/bin/metabench -w /mnt/lustre/d0.parallel-scale-nfs/d0.metabench -c 10000 -C -S "
[onyx-33vm2:14600] mca: base: component_find: unable to open /usr/lib64/mpi/gcc/openmpi/lib64/openmpi/mca_mtl_ofi: libpsm_infinipath.so.1: cannot open shared object file: No such file or directory (ignored)
[onyx-33vm2:14600] mca: base: component_find: unable to open /usr/lib64/mpi/gcc/openmpi/lib64/openmpi/mca_mtl_psm: libpsm_infinipath.so.1: cannot open shared object file: No such file or directory (ignored)
[onyx-33vm2:14601] mca: base: component_find: unable to open /usr/lib64/mpi/gcc/openmpi/lib64/openmpi/mca_mtl_ofi: libpsm_infinipath.so.1: cannot open shared object file: No such file or directory (ignored)
[onyx-33vm2:14601] mca: base: component_find: unable to open /usr/lib64/mpi/gcc/openmpi/lib64/openmpi/mca_mtl_psm: libpsm_infinipath.so.1: cannot open shared object file: No such file or directory (ignored)
[onyx-33vm1:09898] mca: base: component_find: unable to open /usr/lib64/mpi/gcc/openmpi/lib64/openmpi/mca_mtl_ofi: libpsm_infinipath.so.1: cannot open shared object file: No such file or directory (ignored)
[onyx-33vm1:09898] mca: base: component_find: unable to open /usr/lib64/mpi/gcc/openmpi/lib64/openmpi/mca_mtl_psm: libpsm_infinipath.so.1: cannot open shared object file: No such file or directory (ignored)
[onyx-33vm1:09900] mca: base: component_find: unable to open /usr/lib64/mpi/gcc/openmpi/lib64/openmpi/mca_mtl_ofi: libpsm_infinipath.so.1: cannot open shared object file: No such file or directory (ignored)
[onyx-33vm1:09900] mca: base: component_find: unable to open /usr/lib64/mpi/gcc/openmpi/lib64/openmpi/mca_mtl_psm: libpsm_infinipath.so.1: cannot open shared object file: No such file or directory (ignored)
[onyx-33vm1:09899] mca: base: component_find: unable to open /usr/lib64/mpi/gcc/openmpi/lib64/openmpi/mca_mtl_ofi: libpsm_infinipath.so.1: cannot open shared object file: No such file or directory (ignored)
[onyx-33vm1:09899] mca: base: component_find: unable to open /usr/lib64/mpi/gcc/openmpi/lib64/openmpi/mca_mtl_psm: libpsm_infinipath.so.1: cannot open shared object file: No such file or directory (ignored)
[onyx-33vm2:14602] mca: base: component_find: unable to open /usr/lib64/mpi/gcc/openmpi/lib64/openmpi/mca_mtl_ofi: libpsm_infinipath.so.1: cannot open shared object file: No such file or directory (ignored)
[onyx-33vm2:14602] mca: base: component_find: unable to open /usr/lib64/mpi/gcc/openmpi/lib64/openmpi/mca_mtl_psm: libpsm_infinipath.so.1: cannot open shared object file: No such file or directory (ignored)
[onyx-33vm2:14604] mca: base: component_find: unable to open /usr/lib64/mpi/gcc/openmpi/lib64/openmpi/mca_mtl_ofi: libpsm_infinipath.so.1: cannot open shared object file: No such file or directory (ignored)
[onyx-33vm2:14604] mca: base: component_find: unable to open /usr/lib64/mpi/gcc/openmpi/lib64/openmpi/mca_mtl_psm: libpsm_infinipath.so.1: cannot open shared object file: No such file or directory (ignored)
[onyx-33vm1:09902] mca: base: component_find: unable to open /usr/lib64/mpi/gcc/openmpi/lib64/openmpi/mca_mtl_ofi: libpsm_infinipath.so.1: cannot open shared object file: No such file or directory (ignored)
[onyx-33vm1:09902] mca: base: component_find: unable to open /usr/lib64/mpi/gcc/openmpi/lib64/openmpi/mca_mtl_psm: libpsm_infinipath.so.1: cannot open shared object file: No such file or directory (ignored)
Metadata Test <no-name> on 01/19/2018 at 11:23:59

Rank   0 process on node onyx-33vm1
Rank   1 process on node onyx-33vm1
Rank   2 process on node onyx-33vm1
Rank   3 process on node onyx-33vm1
Rank   4 process on node onyx-33vm2
Rank   5 process on node onyx-33vm2
Rank   6 process on node onyx-33vm2
Rank   7 process on node onyx-33vm2

[01/19/2018 11:23:59] FATAL error on process 0
Proc 0: cannot create component d0.parallel-scale-nfs in /mnt/lustre/d0.parallel-scale-nfs/d0.metabench: Read-only file system
[onyx-33vm1][[7407,1],1][btl_tcp_frag.c:238:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)


 Comments   
Comment by James Nunez (Inactive) [ 26/Jan/18 ]

From the top of the suite_log, we see that the file system is 99% full:

UUID                   1K-blocks        Used   Available Use% Mounted on
lustre-MDT0000_UUID      1165900       85376      977328   8% /mnt/lustre[MDT:0]
lustre-OST0000_UUID      1933276     1801184       10852  99% /mnt/lustre[OST:0]
lustre-OST0001_UUID      1933276     1795316       16720  99% /mnt/lustre[OST:1]
lustre-OST0002_UUID      1933276     1795232       16776  99% /mnt/lustre[OST:2]
lustre-OST0003_UUID      1933276     1795292       16660  99% /mnt/lustre[OST:3]
lustre-OST0004_UUID      1933276     1795180       16828  99% /mnt/lustre[OST:4]
lustre-OST0005_UUID      1933276     1795300       16596  99% /mnt/lustre[OST:5]
lustre-OST0006_UUID      1933276     1803440        8596 100% /mnt/lustre[OST:6]

filesystem_summary:     13532932    12580944      103028  99% /mnt/lustre

Does this explain why the file system changes to 'read-only' ? I suspect that the NFS file system is read-only, but we should confirm that the Lustre file system is not read-only.

Comment by James Nunez (Inactive) [ 06/Feb/18 ]

We are seeing this issue on failover test sessions with DNE configured and ZFS servers with servers and clients el7:
https://testing.hpdd.intel.com/test_sets/ed44110c-fd83-11e7-a7cd-52540065bddc

Comment by Minh Diep [ 08/Feb/18 ]

+1 on b2_10
https://testing.hpdd.intel.com/test_sets/ba76657c-0b5c-11e8-a6ad-52540065bddc

Comment by Gerrit Updater [ 08/Feb/18 ]

Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/31231
Subject: LU-10566 test: don't direct lfs df to /dev/null
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: b9bc92bd01d9a769ddb4d8669b27fe6db8e7cf54

Comment by Minh Diep [ 09/Feb/18 ]

I found that obdfilter-survey test_1c did not clean up propertly

=============> Destroy 1 on 10.9.6.12:lustre-OST0000_ecc
error: destroy: invalid objid '3'
destroy OST object <objid> [num [verbose]]
usage: destroy <num> objects, starting at objid <objid>
run <command> after connecting to device <devno>
--device <devno> <command [args ...]>

Comment by James Nunez (Inactive) [ 14/Mar/18 ]

It looks like we are hitting this again with 2.10.59 RHEL 7 ldiskfs servers and RHEL 7 clients.

This time, the file system is not almost full. From the suite_log, before we run any parallel-scale-nfsv4 tests,

UUID                   1K-blocks        Used   Available Use% Mounted on

lustre-MDT0000_UUID      1165900       17368     1045336   2% /mnt/lustre[MDT:0]
lustre-OST0000_UUID      1933276       26956     1781868   1% /mnt/lustre[OST:0]
lustre-OST0001_UUID      1933276       26944     1785064   1% /mnt/lustre[OST:1]
lustre-OST0002_UUID      1933276       31044     1780088   2% /mnt/lustre[OST:2]
lustre-OST0003_UUID      1933276       26956     1784908   1% /mnt/lustre[OST:3]
lustre-OST0004_UUID      1933276       26948     1784972   1% /mnt/lustre[OST:4]
lustre-OST0005_UUID      1933276       26988     1784812   1% /mnt/lustre[OST:5]
lustre-OST0006_UUID      1933276       26960     1784960   1% /mnt/lustre[OST:6]

filesystem_summary:     13532932      192796    12486672   2% /mnt/lustre 

We see the same output from metabench as in the description

== parallel-scale-nfsv4 test metabench: metabench ==================================================== 08:43:41 (1521017021)
OPTIONS:
METABENCH=/usr/bin/metabench
clients=onyx-30vm5.onyx.hpdd.intel.com,onyx-30vm6
mbench_NFILES=10000
mbench_THREADS=4
onyx-30vm5.onyx.hpdd.intel.com
onyx-30vm6
mkdir: cannot create directory '/mnt/lustre/d0.parallel-scale-nfs': Read-only file system
chmod: cannot access '/mnt/lustre/d0.parallel-scale-nfs/d0.metabench': No such file or directory
+ /usr/bin/metabench -w /mnt/lustre/d0.parallel-scale-nfs/d0.metabench -c 10000 -C -S
+ chmod 0777 /mnt/lustre
chmod: changing permissions of '/mnt/lustre': Read-only file system
dr-xr-xr-x 23 root root 4096 Mar 14 07:59 /mnt/lustre 

https://testing.hpdd.intel.com/test_sets/734ea438-2773-11e8-9e0e-52540065bddc

https://testing.hpdd.intel.com/test_sets/3b702578-2769-11e8-9e0e-52540065bddc

 

Comment by Gerrit Updater [ 16/Mar/18 ]

Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/31679
Subject: LU-10566 test: debug
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 7550568625d2732afeb36a52df36db0109bda82d

Comment by Gerrit Updater [ 09/Apr/18 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/31679/
Subject: LU-10566 test: fix nfs exports clean up
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 2cdc1ad8b86d013fdb8ffc70ee567284537eee47

Comment by Peter Jones [ 09/Apr/18 ]

Landed for 2.12

Comment by Gerrit Updater [ 11/Apr/18 ]

Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/31953
Subject: LU-10566 test: fix nfs exports clean up
Project: fs/lustre-release
Branch: b2_10
Current Patch Set: 1
Commit: 5a86bc8daac771e428ad839f27ea1542a4a40f48

Comment by Gerrit Updater [ 16/Apr/18 ]

John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/31953/
Subject: LU-10566 test: fix nfs exports clean up
Project: fs/lustre-release
Branch: b2_10
Current Patch Set:
Commit: 741347aafb8053d02294650add007e1bf050e978

Comment by Sarah Liu [ 29/Nov/18 ]

hit this again on b2_10 2.10.6-rc2 zfs DNE

https://testing.whamcloud.com/test_sets/05a4f148-ef60-11e8-bfe1-52540065bddc

Comment by James Nunez (Inactive) [ 04/Jun/20 ]

I'm reopening this ticket because we are still seeing the read-only file system problem for 2.12.5 RC1 at https://testing.whamcloud.com/test_sets/bd833e99-d557-4ec6-a768-91440b98b55e .

Maybe LU-12231 is the same issue and this one can be closed since LU-12231 is still open?

Generated at Sat Feb 10 02:36:14 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.