Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3701

Failure on test suite posix subtest test_1: fcntl.18/fcntl.35 Unresolved

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.5.0, Lustre 2.4.2
    • Lustre 2.5.0
    • server and client: lustre-master build: 1952
    • 3
    • 9549

    Description

      This issue was created by maloo for sarah <sarah@whamcloud.com>

      This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/261745ba-fb5b-11e2-8c6e-52540035b04c.

      The sub-test test_1 failed with the following error:

      Run POSIX testsuite on /mnt/lustre failed

      test log

      SUCCESS SUMMARY:
      
      News POSIX successes: 1
      
      Test Name                   Baseline   Lustre Report
      read.15                       Failed       Succeeded
      
      
      FAILURE SUMMARY:
      
      POSIX failures: 2
      
      Test Name                   Baseline   Lustre Report
      fcntl.18                   Succeeded      Unresolved
      fcntl.35                   Succeeded      Unresolved
      
      FAILURE DESCRIPTIONS:
      
      ####################################################
      Test Name: fcntl.18 Unresolved
      
      	Test Description:
      For the XNFS specification:
          If the implementation supports file locking for files residing on
          a remote file system: On a call to fcntl(fildes, F_SETLKW, arg)
          when the lock specified by arg can not be set, waits until the
          lock can be set.
      For the XSH specification:
          On a call to fcntl(fildes, F_SETLKW, arg) when the lock specified
          by arg can not be set, waits until the lock can be set.
          Posix Ref: Component FCNTL Assertion 6.5.2.2-23(A)
      
      	Test Information:
      deletion reason: External error - waitsync failed
      deletion reason: External error - waitsync failed
      
      ####################################################
      Test Name: fcntl.35 Unresolved
      
      	Test Description:
      For the XNFS specification:
          If the implementation supports file locking for files residing on
          a remote file system: EINTR in errno and -1 returned by fcntl() if
          the operation is interrupted by a signal.
      For the XSH specification:
          EINTR in errno and -1 returned by fcntl() if the operation is
          interrupted by a signal.
          Posix Ref: Component FCNTL Assertion 6.5.2.4-40(A)
      
      	Test Information:
      child process timed out
      

      Attachments

        Issue Links

          Activity

            [LU-3701] Failure on test suite posix subtest test_1: fcntl.18/fcntl.35 Unresolved

            Hello Jian,
            Thanks for the link+hint already!
            Unfortunately, when I run lustre/tests/posix.sh on a fresh+recent master install, it fails in build-posix.exp with following msgs/logs :

            Enter the root password:^M
            Password: ^M
            losetup: /dev/loop0: device is busy^M
            Aborting installation^M
            mv: cannot stat `/usr/src/posix/ext4/tet/test_sets/results/0002e': No such file or directory
            child process exited abnormally
                while executing
            "system "mv $results_dir/0002e $results_dir/lustre_baseline""
                (file "build-posix.exp" line 161)^M
            failed to build POSIX test suite.
             posix test_1: @@@@@@ FAIL: Setup POSIX test suite failed
              Trace dump:
              = /usr/lib64/lustre/tests/test-framework.sh:4200:error_noexit()
              = /usr/lib64/lustre/tests/test-framework.sh:4227:error()
              = ./posix.sh:106:test_1()
              = /usr/lib64/lustre/tests/test-framework.sh:4466:run_one()
              = /usr/lib64/lustre/tests/test-framework.sh:4499:run_one_logged()
              = /usr/lib64/lustre/tests/test-framework.sh:4369:run_test()
              = ./posix.sh:118:main()
            Dumping lctl log to /tmp/test_logs/1377521275/posix.test_1.*.1377521325.log
            Dumping logs only on local client.
            

            and this looks like some odd loop-device configuration issue.I am trying to debug+fix this, but any other help and hint are welcome.

            On the other hand, I think that original test/patch for LU-2665 could be refined to make both LU-2665 bug and Posix test suite happy. New change attempt pushed on Gerrit at http://review.whamcloud.com/7453.

            bfaccini Bruno Faccini (Inactive) added a comment - Hello Jian, Thanks for the link+hint already! Unfortunately, when I run lustre/tests/posix.sh on a fresh+recent master install, it fails in build-posix.exp with following msgs/logs : Enter the root password:^M Password: ^M losetup: /dev/loop0: device is busy^M Aborting installation^M mv: cannot stat `/usr/src/posix/ext4/tet/test_sets/results/0002e': No such file or directory child process exited abnormally while executing "system "mv $results_dir/0002e $results_dir/lustre_baseline"" (file "build-posix.exp" line 161)^M failed to build POSIX test suite. posix test_1: @@@@@@ FAIL: Setup POSIX test suite failed Trace dump: = /usr/lib64/lustre/tests/test-framework.sh:4200:error_noexit() = /usr/lib64/lustre/tests/test-framework.sh:4227:error() = ./posix.sh:106:test_1() = /usr/lib64/lustre/tests/test-framework.sh:4466:run_one() = /usr/lib64/lustre/tests/test-framework.sh:4499:run_one_logged() = /usr/lib64/lustre/tests/test-framework.sh:4369:run_test() = ./posix.sh:118:main() Dumping lctl log to /tmp/test_logs/1377521275/posix.test_1.*.1377521325.log Dumping logs only on local client. and this looks like some odd loop-device configuration issue.I am trying to debug+fix this, but any other help and hint are welcome. On the other hand, I think that original test/patch for LU-2665 could be refined to make both LU-2665 bug and Posix test suite happy. New change attempt pushed on Gerrit at http://review.whamcloud.com/7453 .
            yujian Jian Yu added a comment -

            BTW, where can I find the Posix test suite ? It does not appear to be part of lustre-tests.

            http://build.whamcloud.com/job/toolkit/arch=x86_64,distro=el6/lastSuccessfulBuild/artifact/_topdir/RPMS/x86_64/posix-1.0-wc1.x86_64.rpm

            After installing the above package on test node, we can perform lustre/tests/posix.sh to install, build and run LSB-VSX POSIX test suite on $BASELINE_FS and Lustre, then compare the test results.

            yujian Jian Yu added a comment - BTW, where can I find the Posix test suite ? It does not appear to be part of lustre-tests. http://build.whamcloud.com/job/toolkit/arch=x86_64,distro=el6/lastSuccessfulBuild/artifact/_topdir/RPMS/x86_64/posix-1.0-wc1.x86_64.rpm After installing the above package on test node, we can perform lustre/tests/posix.sh to install, build and run LSB-VSX POSIX test suite on $BASELINE_FS and Lustre, then compare the test results.

            According to the failing Posix tests description and LU-2665 patch content we can suspect a possible regression for fcntl.35 (EINTR handling), but it looks less obvious for fcntl.18 (patch should not impact forced wait).

            BTW, where can I find the Posix test suite ? It does not appear to be part of lustre-tests.

            bfaccini Bruno Faccini (Inactive) added a comment - According to the failing Posix tests description and LU-2665 patch content we can suspect a possible regression for fcntl.35 (EINTR handling), but it looks less obvious for fcntl.18 (patch should not impact forced wait). BTW, where can I find the Posix test suite ? It does not appear to be part of lustre-tests.
            pjones Peter Jones added a comment -

            Bruno will look into this

            pjones Peter Jones added a comment - Bruno will look into this

            Hi,

            If the patch http://review.whamcloud.com/6415 from LU-2665 introduces a regression, we also need to find a solution for 2.1 and 2.4.
            As I mentioned in LU-2665, the b2_1 patch will be rolled out at CEA in a couple of weeks!

            Thanks,
            Sebastien.

            sebastien.buisson Sebastien Buisson (Inactive) added a comment - Hi, If the patch http://review.whamcloud.com/6415 from LU-2665 introduces a regression, we also need to find a solution for 2.1 and 2.4. As I mentioned in LU-2665 , the b2_1 patch will be rolled out at CEA in a couple of weeks! Thanks, Sebastien.
            pjones Peter Jones added a comment -

            Reverted LU-2665 from b2_4 to avoid this problem but we still need to find a solution for 2.5

            pjones Peter Jones added a comment - Reverted LU-2665 from b2_4 to avoid this problem but we still need to find a solution for 2.5
            pjones Peter Jones added a comment -

            Oleg

            What do you suggest here?

            Peter

            pjones Peter Jones added a comment - Oleg What do you suggest here? Peter
            yujian Jian Yu added a comment -

            Hi Oleg,

            LU-2665 mdc: Keep resend FLocks (http://review.whamcloud.com/6415) caused the above regression.

            On master branch, posix test passed on build #1560. However, build #1561 and #1562 were not tested. The test failed on build #1563. Here are the patches in those builds:

            Build #1561: LU-2665 mdc: Keep resend FLocks
            Build #1562: LU-3568 contrib: ignore initial comments
            Build #1563: LU-3478 iokit: fix sgpdd-survey scripts (output and plotting)

            Only "LU-2665 mdc: Keep resend FLocks" is in Lustre b2_4 build #28, so that's the culprit.

            yujian Jian Yu added a comment - Hi Oleg, LU-2665 mdc: Keep resend FLocks ( http://review.whamcloud.com/6415 ) caused the above regression. On master branch, posix test passed on build #1560. However, build #1561 and #1562 were not tested. The test failed on build #1563. Here are the patches in those builds: Build #1561: LU-2665 mdc: Keep resend FLocks Build #1562: LU-3568 contrib: ignore initial comments Build #1563: LU-3478 iokit: fix sgpdd-survey scripts (output and plotting) Only " LU-2665 mdc: Keep resend FLocks" is in Lustre b2_4 build #28, so that's the culprit.
            yujian Jian Yu added a comment - Hi Oleg, On Lustre b2_4 branch, this is a regression issue introduced by the patch in build http://build.whamcloud.com/job/lustre-b2_4/28/ : https://maloo.whamcloud.com/test_sets/23814618-02b1-11e3-a4b4-52540035b04c https://maloo.whamcloud.com/test_sets/4854f044-0283-11e3-a4b4-52540035b04c https://maloo.whamcloud.com/test_sets/c5113ad6-0286-11e3-b384-52540035b04c https://maloo.whamcloud.com/test_sets/6dd63b10-0261-11e3-a4b4-52540035b04c https://maloo.whamcloud.com/test_sets/4493f2a0-0249-11e3-a4b4-52540035b04c FYI, the posix test passed on Lustre b2_4 build #27.
            green Oleg Drokin added a comment -

            in client dmesg:

            Lustre: DEBUG MARKER: Run POSIX test against lustre filesystem
            LustreError: 11-0: lustre-MDT0000-mdc-ffff880331591c00: Communicating with 192.168.4.20@o2ib, operation ldlm_enqueue failed with -11.
            LustreError: 11-0: lustre-MDT0000-mdc-ffff880331591c00: Communicating with 192.168.4.20@o2ib, operation ldlm_enqueue failed with -11.
            LustreError: 11-0: lustre-MDT0000-mdc-ffff880331591c00: Communicating with 192.168.4.20@o2ib, operation ldlm_enqueue failed with -11.
            LustreError: 11-0: lustre-MDT0000-mdc-ffff880331591c00: Communicating with 192.168.4.20@o2ib, operation ldlm_enqueue failed with -11.
            LustreError: Skipped 8 previous similar messages
            LustreError: 11-0: lustre-MDT0000-mdc-ffff880331591c00: Communicating with 192.168.4.20@o2ib, operation ldlm_enqueue failed with -35.
            LustreError: Skipped 2 previous similar messages
            

            But nothing like that on MDS

            green Oleg Drokin added a comment - in client dmesg: Lustre: DEBUG MARKER: Run POSIX test against lustre filesystem LustreError: 11-0: lustre-MDT0000-mdc-ffff880331591c00: Communicating with 192.168.4.20@o2ib, operation ldlm_enqueue failed with -11. LustreError: 11-0: lustre-MDT0000-mdc-ffff880331591c00: Communicating with 192.168.4.20@o2ib, operation ldlm_enqueue failed with -11. LustreError: 11-0: lustre-MDT0000-mdc-ffff880331591c00: Communicating with 192.168.4.20@o2ib, operation ldlm_enqueue failed with -11. LustreError: 11-0: lustre-MDT0000-mdc-ffff880331591c00: Communicating with 192.168.4.20@o2ib, operation ldlm_enqueue failed with -11. LustreError: Skipped 8 previous similar messages LustreError: 11-0: lustre-MDT0000-mdc-ffff880331591c00: Communicating with 192.168.4.20@o2ib, operation ldlm_enqueue failed with -35. LustreError: Skipped 2 previous similar messages But nothing like that on MDS

            Is the client being mounted with "-o flock"?

            adilger Andreas Dilger added a comment - Is the client being mounted with "-o flock"?

            People

              bfaccini Bruno Faccini (Inactive)
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: