Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-15816

sanity test_398m: FAIL: parallel dio write with failure on first stripe succeeded

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.16.0, Lustre 2.15.4
    • Lustre 2.15.0
    • None
    • 3
    • 9223372036854775807

    Description

      After clean upgrade from Lustre 2.12.8 on el7.9 to Lustre 2.15 on el8.5, sanity test 398m encountered an error:

      parallel dio write with failure on first stripe succeeded

       

      == sanity test 398m: test RPC failures with parallel dio ========================================================== 22:57:43 (1651532263)
      fail_loc=0x20e
      fail_val=1
      dd: error writing '/mnt/lustre/f398m.sanity': Input/output error
      1+0 records in
      0+0 records out
      0 bytes copied, 56.4822 s, 0.0 kB/s
      fail_loc=0
      fail_val=0
      8+0 records in
      8+0 records out
      67108864 bytes (67 MB, 64 MiB) copied, 2.49378 s, 26.9 MB/s
      fail_loc=0x20f
      fail_val=1
      dd: error reading '/mnt/lustre/f398m.sanity': Input/output error
      0+0 records in
      0+0 records out
      0 bytes copied, 56.0012 s, 0.0 kB/s
      fail_loc=0
      fail_val=0
      fail_loc=0x20e
      fail_val=2
      8+0 records in
      8+0 records out
      67108864 bytes (67 MB, 64 MiB) copied, 2.83038 s, 23.7 MB/s
       sanity test_398m: @@@@@@ FAIL: parallel dio write with failure on first stripe succeeded
        Trace dump:
        = /lib64/lustre/tests/test-framework.sh:6406:error()
        = /lib64/lustre/tests/sanity.sh:24681:test_398m()
        = /lib64/lustre/tests/test-framework.sh:6723:run_one()
        = /lib64/lustre/tests/test-framework.sh:6770:run_one_logged()
        = /lib64/lustre/tests/test-framework.sh:6611:run_test()
        = /lib64/lustre/tests/sanity.sh:24697:main()
      Dumping lctl log to /tmp/test_logs/2022-05-02/211139/sanity.test_398m.*.1651532382.log
      fail_loc=0
      fail_loc=0
      

       

      Attachments

        Issue Links

          Activity

            [LU-15816] sanity test_398m: FAIL: parallel dio write with failure on first stripe succeeded

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/52066/
            Subject: LU-15816 tests: use correct ost host to manage failure
            Project: fs/lustre-release
            Branch: b2_15
            Current Patch Set:
            Commit: 5db1bc57996b674b3df19a1ae0ee6b20f4668586

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/52066/ Subject: LU-15816 tests: use correct ost host to manage failure Project: fs/lustre-release Branch: b2_15 Current Patch Set: Commit: 5db1bc57996b674b3df19a1ae0ee6b20f4668586

            "xinliang <xinliang.liu@linaro.org>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52066
            Subject: LU-15816 tests: use correct ost host to manage failure
            Project: fs/lustre-release
            Branch: b2_15
            Current Patch Set: 1
            Commit: cd866dcf14c18dd650d407fedbdaa18c4bb1d2ac

            gerrit Gerrit Updater added a comment - "xinliang <xinliang.liu@linaro.org>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52066 Subject: LU-15816 tests: use correct ost host to manage failure Project: fs/lustre-release Branch: b2_15 Current Patch Set: 1 Commit: cd866dcf14c18dd650d407fedbdaa18c4bb1d2ac
            pjones Peter Jones added a comment -

            Landed for 2.16

            pjones Peter Jones added a comment - Landed for 2.16

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49248/
            Subject: LU-15816 tests: use correct ost host to manage failure
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 6e66cbdb5c8c08193c36262649667747127b6d90

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49248/ Subject: LU-15816 tests: use correct ost host to manage failure Project: fs/lustre-release Branch: master Current Patch Set: Commit: 6e66cbdb5c8c08193c36262649667747127b6d90

            "Neil Brown <neilb@suse.de>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49248
            Subject: LU-15816 tests: use correct ost host to manage failure
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 5ea236e3b450fea6aea7eab4c4a7d5333de6436f

            gerrit Gerrit Updater added a comment - "Neil Brown <neilb@suse.de>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49248 Subject: LU-15816 tests: use correct ost host to manage failure Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 5ea236e3b450fea6aea7eab4c4a7d5333de6436f
            neilb Neil Brown added a comment -

            The error message is misleading - it is the second stripe that fails the test.

            It will always fail if the second OST (called OST1 in the script) is on a different host than the first (OST0).

            This is because  while "do_facet ost1" is correctly used to access OST0, it is ALSO used to access OST1, which might be wrong.

            We need "do_facet ost1" to manage OST1.

             

            neilb Neil Brown added a comment - The error message is misleading - it is the second stripe that fails the test. It will always fail if the second OST (called OST1 in the script) is on a different host than the first (OST0). This is because  while "do_facet ost1" is correctly used to access OST0, it is ALSO used to access OST1, which might be wrong. We need "do_facet ost1" to manage OST1.  

            People

              neilb Neil Brown
              anikitenko Alena Nikitenko (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: