[LU-3701] Failure on test suite posix subtest test_1: fcntl.18/fcntl.35 Unresolved Created: 05/Aug/13  Updated: 22/Nov/13  Resolved: 09/Sep/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.0
Fix Version/s: Lustre 2.5.0, Lustre 2.4.2

Type: Bug Priority: Blocker
Reporter: Maloo Assignee: Bruno Faccini (Inactive)
Resolution: Fixed Votes: 0
Labels: mn1
Environment:

server and client: lustre-master build: 1952


Issue Links:
Related
is related to LU-2665 LBUG while unmounting client Resolved
Severity: 3
Rank (Obsolete): 9549

 Description   

This issue was created by maloo for sarah <sarah@whamcloud.com>

This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/261745ba-fb5b-11e2-8c6e-52540035b04c.

The sub-test test_1 failed with the following error:

Run POSIX testsuite on /mnt/lustre failed

test log

SUCCESS SUMMARY:

News POSIX successes: 1

Test Name                   Baseline   Lustre Report
read.15                       Failed       Succeeded


FAILURE SUMMARY:

POSIX failures: 2

Test Name                   Baseline   Lustre Report
fcntl.18                   Succeeded      Unresolved
fcntl.35                   Succeeded      Unresolved

FAILURE DESCRIPTIONS:

####################################################
Test Name: fcntl.18 Unresolved

	Test Description:
For the XNFS specification:
    If the implementation supports file locking for files residing on
    a remote file system: On a call to fcntl(fildes, F_SETLKW, arg)
    when the lock specified by arg can not be set, waits until the
    lock can be set.
For the XSH specification:
    On a call to fcntl(fildes, F_SETLKW, arg) when the lock specified
    by arg can not be set, waits until the lock can be set.
    Posix Ref: Component FCNTL Assertion 6.5.2.2-23(A)

	Test Information:
deletion reason: External error - waitsync failed
deletion reason: External error - waitsync failed

####################################################
Test Name: fcntl.35 Unresolved

	Test Description:
For the XNFS specification:
    If the implementation supports file locking for files residing on
    a remote file system: EINTR in errno and -1 returned by fcntl() if
    the operation is interrupted by a signal.
For the XSH specification:
    EINTR in errno and -1 returned by fcntl() if the operation is
    interrupted by a signal.
    Posix Ref: Component FCNTL Assertion 6.5.2.4-40(A)

	Test Information:
child process timed out


 Comments   
Comment by Andreas Dilger [ 08/Aug/13 ]

Is the client being mounted with "-o flock"?

Comment by Oleg Drokin [ 08/Aug/13 ]

in client dmesg:

Lustre: DEBUG MARKER: Run POSIX test against lustre filesystem
LustreError: 11-0: lustre-MDT0000-mdc-ffff880331591c00: Communicating with 192.168.4.20@o2ib, operation ldlm_enqueue failed with -11.
LustreError: 11-0: lustre-MDT0000-mdc-ffff880331591c00: Communicating with 192.168.4.20@o2ib, operation ldlm_enqueue failed with -11.
LustreError: 11-0: lustre-MDT0000-mdc-ffff880331591c00: Communicating with 192.168.4.20@o2ib, operation ldlm_enqueue failed with -11.
LustreError: 11-0: lustre-MDT0000-mdc-ffff880331591c00: Communicating with 192.168.4.20@o2ib, operation ldlm_enqueue failed with -11.
LustreError: Skipped 8 previous similar messages
LustreError: 11-0: lustre-MDT0000-mdc-ffff880331591c00: Communicating with 192.168.4.20@o2ib, operation ldlm_enqueue failed with -35.
LustreError: Skipped 2 previous similar messages

But nothing like that on MDS

Comment by Jian Yu [ 12/Aug/13 ]

Hi Oleg,

On Lustre b2_4 branch, this is a regression issue introduced by the patch in build http://build.whamcloud.com/job/lustre-b2_4/28/ :

https://maloo.whamcloud.com/test_sets/23814618-02b1-11e3-a4b4-52540035b04c
https://maloo.whamcloud.com/test_sets/4854f044-0283-11e3-a4b4-52540035b04c
https://maloo.whamcloud.com/test_sets/c5113ad6-0286-11e3-b384-52540035b04c
https://maloo.whamcloud.com/test_sets/6dd63b10-0261-11e3-a4b4-52540035b04c
https://maloo.whamcloud.com/test_sets/4493f2a0-0249-11e3-a4b4-52540035b04c

FYI, the posix test passed on Lustre b2_4 build #27.

Comment by Jian Yu [ 12/Aug/13 ]

Hi Oleg,

LU-2665 mdc: Keep resend FLocks (http://review.whamcloud.com/6415) caused the above regression.

On master branch, posix test passed on build #1560. However, build #1561 and #1562 were not tested. The test failed on build #1563. Here are the patches in those builds:

Build #1561: LU-2665 mdc: Keep resend FLocks
Build #1562: LU-3568 contrib: ignore initial comments
Build #1563: LU-3478 iokit: fix sgpdd-survey scripts (output and plotting)

Only "LU-2665 mdc: Keep resend FLocks" is in Lustre b2_4 build #28, so that's the culprit.

Comment by Peter Jones [ 12/Aug/13 ]

Oleg

What do you suggest here?

Peter

Comment by Peter Jones [ 20/Aug/13 ]

Reverted LU-2665 from b2_4 to avoid this problem but we still need to find a solution for 2.5

Comment by Sebastien Buisson (Inactive) [ 20/Aug/13 ]

Hi,

If the patch http://review.whamcloud.com/6415 from LU-2665 introduces a regression, we also need to find a solution for 2.1 and 2.4.
As I mentioned in LU-2665, the b2_1 patch will be rolled out at CEA in a couple of weeks!

Thanks,
Sebastien.

Comment by Peter Jones [ 20/Aug/13 ]

Bruno will look into this

Comment by Bruno Faccini (Inactive) [ 21/Aug/13 ]

According to the failing Posix tests description and LU-2665 patch content we can suspect a possible regression for fcntl.35 (EINTR handling), but it looks less obvious for fcntl.18 (patch should not impact forced wait).

BTW, where can I find the Posix test suite ? It does not appear to be part of lustre-tests.

Comment by Jian Yu [ 22/Aug/13 ]

BTW, where can I find the Posix test suite ? It does not appear to be part of lustre-tests.

http://build.whamcloud.com/job/toolkit/arch=x86_64,distro=el6/lastSuccessfulBuild/artifact/_topdir/RPMS/x86_64/posix-1.0-wc1.x86_64.rpm

After installing the above package on test node, we can perform lustre/tests/posix.sh to install, build and run LSB-VSX POSIX test suite on $BASELINE_FS and Lustre, then compare the test results.

Comment by Bruno Faccini (Inactive) [ 26/Aug/13 ]

Hello Jian,
Thanks for the link+hint already!
Unfortunately, when I run lustre/tests/posix.sh on a fresh+recent master install, it fails in build-posix.exp with following msgs/logs :

Enter the root password:^M
Password: ^M
losetup: /dev/loop0: device is busy^M
Aborting installation^M
mv: cannot stat `/usr/src/posix/ext4/tet/test_sets/results/0002e': No such file or directory
child process exited abnormally
    while executing
"system "mv $results_dir/0002e $results_dir/lustre_baseline""
    (file "build-posix.exp" line 161)^M
failed to build POSIX test suite.
 posix test_1: @@@@@@ FAIL: Setup POSIX test suite failed
  Trace dump:
  = /usr/lib64/lustre/tests/test-framework.sh:4200:error_noexit()
  = /usr/lib64/lustre/tests/test-framework.sh:4227:error()
  = ./posix.sh:106:test_1()
  = /usr/lib64/lustre/tests/test-framework.sh:4466:run_one()
  = /usr/lib64/lustre/tests/test-framework.sh:4499:run_one_logged()
  = /usr/lib64/lustre/tests/test-framework.sh:4369:run_test()
  = ./posix.sh:118:main()
Dumping lctl log to /tmp/test_logs/1377521275/posix.test_1.*.1377521325.log
Dumping logs only on local client.

and this looks like some odd loop-device configuration issue.I am trying to debug+fix this, but any other help and hint are welcome.

On the other hand, I think that original test/patch for LU-2665 could be refined to make both LU-2665 bug and Posix test suite happy. New change attempt pushed on Gerrit at http://review.whamcloud.com/7453.

Comment by Jian Yu [ 27/Aug/13 ]

On the other hand, I think that original test/patch for LU-2665 could be refined to make both LU-2665 bug and Posix test suite happy. New change attempt pushed on Gerrit at http://review.whamcloud.com/7453.

Please add the following test parameter into the commit message to see whether posix test suite can pass or not:

Test-Parameters: testlist=posix
Comment by Bruno Faccini (Inactive) [ 28/Aug/13 ]

I did, but seems that only "posix" test ran, is it expected behavior ? I thought that Test-Parameters will run tests in addition to the default set, unless "fortestonly" is specified ...

On the other hand "posix" test has been successful, so I need to check now that LU-2665 problem is still fixed too.

Comment by Bruno Faccini (Inactive) [ 29/Aug/13 ]

In fact auto-tests default set finally ran against build/patch.

Also, I checked successfully that new patch/change http://review.whamcloud.com/7453 also preserves correct behavior against LU-2665 case/scenario.

Will ask for reviews now and if ok, need to provide at least a b2_1 version and also push it for patch-less Client Kernel integration (in addition of patch for LU-2665 already pushed !!).

Comment by Peter Jones [ 09/Sep/13 ]

Landed to 2.5.

Comment by Bruno Faccini (Inactive) [ 09/Sep/13 ]

b2_1 patch version is at http://review.whamcloud.com/7586. Patch-less Client Kernel integration will occur automatically now that master (http://review.whamcloud.com/7453) patch landed.

Comment by Jian Yu [ 22/Nov/13 ]

Patch http://review.whamcloud.com/7453 was cherry-picked to Lustre b2_4 branch.

Generated at Sat Feb 10 01:36:10 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.