[LU-11948] parallel-scale-nfsv4 test connectathon fails ''connectathon failed: 1'' Created: 08/Feb/19  Updated: 23/Nov/22

Status: In Progress
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.13.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: Alex Deiter
Resolution: Unresolved Votes: 0
Labels: None
Environment:

DNE/ZFS


Issue Links:
Related
is related to LU-12230 parallel-scale-nfsv3 test_connectatho... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

parallel-scale-nfsv4 test_connectathon fails with ''connectathon failed: 1'' . So far, we’ve only seen this issue once, a DNE with ZFS test session; https://testing.whamcloud.com/test_sets/a1fe9392-2b96-11e9-90fb-52540065bddc . Looking at the client test_log, we see a segmentation fault

Congratulations, you passed the basic tests!
... Pass 4 ...

Starting BASIC tests: test directory /mnt/lustre/d0.parallel-scale-nfs/d0.connectathon (arg: -f)

./test1: File and directory creation test
	./test1: (/opt/connectathon/basic) runtests: line 28:  8630 Segmentation fault      ./test1 $TESTARG
basic tests failed
 parallel-scale-nfsv4 test_connectathon: @@@@@@ FAIL: connectathon failed: 1 

Looking at the MDS1, 3 (vm11) console log, we see the following errors

[124786.514802] Lustre: DEBUG MARKER: ./runtests -N 10 -l -f /mnt/lustre/d0.parallel-scale-nfs/d0.connectathon
[124859.432032] LustreError: 28274:0:(file.c:3941:ll_file_flock()) unknown fcntl lock command: 1029
[124928.294500] LustreError: 28274:0:(file.c:3941:ll_file_flock()) unknown fcntl lock command: 1029
[125035.913851] LustreError: 28274:0:(file.c:3941:ll_file_flock()) unknown fcntl lock command: 1029
[125065.916330] LustreError: 28274:0:(file.c:3941:ll_file_flock()) unknown fcntl lock command: 1029
[125105.547691] LustreError: 28274:0:(file.c:3941:ll_file_flock()) unknown fcntl lock command: 1029
[125135.550497] LustreError: 28274:0:(file.c:3941:ll_file_flock()) unknown fcntl lock command: 1029
[125175.230096] LustreError: 28274:0:(file.c:3941:ll_file_flock()) unknown fcntl lock command: 1029
[125244.986669] LustreError: 28274:0:(file.c:3941:ll_file_flock()) unknown fcntl lock command: 1029
[125244.987704] LustreError: 28274:0:(file.c:3941:ll_file_flock()) Skipped 1 previous similar message
[125251.814692] Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0 	    fail_val=0 2>/dev/null


 Comments   
Comment by Andreas Dilger [ 08/Feb/19 ]

Looking at the kernel code, it looks like 1029 is:

#define FL_POSIX        1
#define FL_DELEG        4       /* NFSv4 delegation */
#define FL_OFDLCK       1024    /* lock is "owned" by struct file */

FL_OFDLCK was "added" in v3.15-rc1-15-gcff2fce58b2b, but that is really just a rename of FL_FILE_PVT added in v3.14-rc1-10-gc918d42a27a9. This is a new type of lock that is attached to a process file struct rather than the file descriptor table (which may be shared between tasks). This avoids problems with file locks being accidentally dropped if a file descriptor is cloned due to fork/exec, and then the file is closed by the second process.

Comment by Andreas Dilger [ 08/Feb/19 ]

Andriy, it looks like you have been working in this area most recently?

Comment by James Nunez (Inactive) [ 05/Jan/21 ]

We do still see the "unknown fcntl lock command: 1029" messages when running connectathon for parallel-scale-nfsv*, but it does not cause test connectathon to fail.

One example is at https://testing.whamcloud.com/test_sets/bc5183ad-2cad-4b97-aba4-604b73b9765f where the messages can be seen on the MDS console log. There are several other examples.

Comment by Andreas Dilger [ 05/Jan/21 ]

It would be good to add support for the this type of lock, since I expect userspace servers like Samba and Ganesha are/will use this to avoid complexities in flock lock handling.

However, I have no idea about how easy/hard it is to implement. Maybe just adding a flag to quiet the error message, or maybe a complete protocol change because the MDS needs different locking semantics. I'm not really sure, but I hope/suspect it is on the easier side.

Generated at Sat Feb 10 02:48:20 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.