[LU-9940] posix no sub tests failed: Created: 01/Sep/17  Updated: 19/Nov/20  Resolved: 18/May/20

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.11.0, Lustre 2.10.2, Lustre 2.12.0, Lustre 2.10.3, Lustre 2.10.5, Lustre 2.13.0, Lustre 2.10.7, Lustre 2.12.1, Lustre 2.12.4
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: James Casper Assignee: WC Triage
Resolution: Won't Fix Votes: 0
Labels: None
Environment:

Trevis, full
server: RHEL 7.3, ldiskfs, branch b2_10, v2.10.0.38, b12
client: RHEL 7.4, branch master, v2.10.52, b3631


Issue Links:
Related
is related to LU-14137 parallel-scale-nfsv4 racer_on_nfs ser... Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

https://testing.whamcloud.com/test_sessions/0cca9fbf-af0d-4ad7-ad37-7797a1864e19

From suite_log:

Setup mgs, mdt, osts
CMD: trevis-10vm4 mkdir -p /mnt/lustre-mds1
CMD: trevis-10vm4 test -b /dev/lvm-Role_MDS/P1
CMD: trevis-10vm4 e2label /dev/lvm-Role_MDS/P1
Starting mds1:   /dev/lvm-Role_MDS/P1 /mnt/lustre-mds1
CMD: trevis-10vm4 mkdir -p /mnt/lustre-mds1; mount -t lustre   		                   /dev/lvm-Role_MDS/P1 /mnt/lustre-mds1
trevis-10vm4: mount.lustre: according to /etc/mtab /dev/mapper/lvm--Role_MDS-P1 is already mounted on /mnt/lustre-mds1

Note: The "already mounted" message was also present in LU-9487.



 Comments   
Comment by James Nunez (Inactive) [ 01/Sep/17 ]

IF you look at the posix.test_complete.stack_trace.trevis-10vm1.log for this test suite, you'll see multiple "errors"

23:16:40:[10995.167125] nfs: server trevis-10vm4 not responding, timed out
23:16:40:[10995.168097] nfs: server trevis-10vm4 not responding, timed out

So, something is wrong with the NFS server upon entering this test suite.

If you look at the preceding test suite, parallel-scale-nfsv4, at the parallel-scale-nfsv4.suite_stdout.trevis-10vm1.log, we see an issue unmounting the Lustre NFS mount

22:15:35:PASS racer_on_nfs (307s)
22:15:35:== parallel-scale-nfsv4 test complete, duration 2509 sec ============================================= 22:15:26 (1503526526)
22:15:35:
22:15:35:Unmounting NFS clients...
22:15:35:CMD: trevis-10vm1.trevis.hpdd.intel.com,trevis-10vm2 umount -f /mnt/lustre
22:15:35:trevis-10vm1: umount.nfs4: /mnt/lustre: device is busy
22:15:35:
22:15:35:Unexporting Lustre filesystem...
22:15:35:CMD: trevis-10vm1.trevis.hpdd.intel.com,trevis-10vm2 chkconfig --list rpcidmapd 2>/dev/null |
22:15:35:			       grep -q rpcidmapd && service rpcidmapd stop ||
22:15:35:			       true
22:15:35:CMD: trevis-10vm4 chkconfig --list nfsserver > /dev/null 2>&1 &&
22:15:35:				 service nfsserver stop || service nfs stop
22:15:35:trevis-10vm4: Redirecting to /bin/systemctl stop nfs.service
22:15:35:CMD: trevis-10vm4 exportfs -u *:/mnt/lustre
22:15:35:trevis-10vm4: exportfs: Could not find '*:/mnt/lustre' to unexport.
Comment by James Nunez (Inactive) [ 25/Apr/19 ]

We’re seeing a similar issue for ARM architectures with Lustre 2.12.1 RC1; https://testing.whamcloud.com/test_sets/4fdace66-66c7-11e9-bd0e-52540065bddc .

In the console log for client 2 (vm26), we see

======================================= 18:16:30 \(1556129790\)
[79692.698703] Lustre: DEBUG MARKER: == parallel-scale-nfsv4 test racer_on_nfs: racer on NFS client ======================================= 18:16:30 (1556129790)
[79693.576177] Lustre: DEBUG MARKER: MDSCOUNT=4 OSTCOUNT=8 LFS=/usr/bin/lfs /usr/lib64/lustre/tests/racer/racer.sh /mnt/lustre/d0.parallel-scale-nfs
[79796.948911] 16[32169]: unhandled level 3 translation fault (11) at 0x00000008, esr 0x92000007, in ld-2.17.so[ffffa3810000+20000]
[79796.964022] CPU: 1 PID: 32169 Comm: 16 Kdump: loaded Tainted: G           OE  ------------   4.14.0-115.2.2.el7a.aarch64 #1
[79796.970552] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
[79796.974656] task: ffff8000389dcc00 task.stack: ffff000010320000
[79796.978150] PC is at 0xffffa381ab3c
[79796.980218] LR is at 0xffffa3813ce4
[79796.982295] pc : [<0000ffffa381ab3c>] lr : [<0000ffffa3813ce4>] pstate: 60000000
[79796.986812] sp : 0000ffffd4d2c3c0
[79796.988760] x29: 0000ffffd4d2c3c0 x28: 0000ffffa3840ff8 
[79796.991864] x27: 0000000000000000 x26: 0000000000000000 
[79796.995014] x25: 0000ffffa3840000 x24: 0000ffffa3840a20 
[79796.998168] x23: 0000000000000000 x22: 0000ffffa3841168 
[79797.001282] x21: 0000ffffa383f000 x20: 0000000000000001 
[79797.004501] x19: 0000000000000001 x18: 0000000000000000 
[79797.007655] x17: 0000ffffa382587c x16: 0000ffffa383ff80 
[79797.010772] x15: 0000ffffa38252e4 x14: 0000ffffa3840000 
[79797.013994] x13: 0000000000010000 x12: 0000000400000006 
[79797.017088] x11: 756e694c00000000 x10: 00000078756e694c 
[79797.020238] x9 : 0000000000000000 x8 : 0000ffffa3840000 
[79797.023462] x7 : 000000000000001c x6 : 0000ffffa383fc70 
[79797.026778] x5 : 0000ffffa3842260 x4 : 0000ffffa3840000 
[79797.029896] x3 : 0000000000000000 x2 : 0000ffffa383f000 
[79797.033018] x1 : 0000000000000000 x0 : 0000000000000000 
[79994.698005] NFS: server trevis-54vm11 error: fileid changed

for racer_on_nfs in parallel-scale-nfsv3 and nfs-v4. We see a similar message in the client 1 console log.

Comment by James Nunez (Inactive) [ 18/May/20 ]

We will not fix this issue because we’ve replaced the POSIX test suite with pjdfstest.

Generated at Sat Feb 10 02:30:38 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.