Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8062

recovery-small test_115b: @@@@@@ FAIL: dd success

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.10.1, Lustre 2.11.0
    • Lustre 2.8.0, Lustre 2.9.0, Lustre 2.10.0
    • None
    • 3
    • 9223372036854775807

    Description

      == recovery-small test 115b: write: late REQ MDunlink and no bulk == 21:12:09 (1461384729)
      Filesystem 1K-blocks Used Available Use% Mounted on
      onyx-38vm7@tcp:/lustre
      74157152 309236 69890576 1% /mnt/lustre
      fail_loc=0x8000051b
      fail_val=4
      Filesystem 1K-blocks Used Available Use% Mounted on
      onyx-38vm7@tcp:/lustre
      74157152 309236 69890576 1% /mnt/lustre
      CMD: onyx-38vm8 lctl set_param fail_val=0 fail_loc=0x80000215
      fail_val=0
      fail_loc=0x80000215
      1+0 records in
      1+0 records out
      4096 bytes (4.1 kB) copied, 2.13538 s, 1.9 kB/s
      recovery-small test_115b: @@@@@@ FAIL: dd success
      Trace dump:
      = /usr/lib64/lustre/tests/test-framework.sh:4764:error()
      = /usr/lib64/lustre/tests/recovery-small.sh:2161:test_115_write()
      = /usr/lib64/lustre/tests/recovery-small.sh:2181:test_115b()
      = /usr/lib64/lustre/tests/test-framework.sh:5028:run_one()
      = /usr/lib64/lustre/tests/test-framework.sh:5067:run_one_logged()
      = /usr/lib64/lustre/tests/test-framework.sh:4914:run_test()
      = /usr/lib64/lustre/tests/recovery-small.sh:2183:main()
      Dumping lctl log to /logdir/test_logs/2016-04-22/lustre-reviews-el6_7-x86_64-review-dne-part-1-1_6_1_38438_-70130481106820-100010/recovery-small.test_115b.*.1461384732.log

      Attachments

        Issue Links

          Activity

            [LU-8062] recovery-small test_115b: @@@@@@ FAIL: dd success
            niu Niu Yawei (Inactive) added a comment - Hit on master: https://testing.hpdd.intel.com/test_sets/3c6bacec-a680-11e6-8859-5254006e85c2
            yong.fan nasf (Inactive) added a comment - +1 on master: https://testing.hpdd.intel.com/test_sets/0409e532-a1d3-11e6-9ab0-5254006e85c2
            yujian Jian Yu added a comment - One more failure instance on master branch: https://testing.hpdd.intel.com/test_sets/11d1b22e-5b2b-11e6-b2e2-5254006e85c2
            niu Niu Yawei (Inactive) added a comment - https://testing.hpdd.intel.com/test_sets/5766e584-41c7-11e6-bbf5-5254006e85c2 The failure reoccurred in master review.

            Andreas Dilger added a comment - 8 hours ago - edited
            Closing this bug, since the problematic patch was reverted, and LU-7434 is still open to track the re-landing of the fixed patch.

            actually, the fix was submitted above. please re-land the original patch with the fix above.

            vitaly_fertman Vitaly Fertman added a comment - Andreas Dilger added a comment - 8 hours ago - edited Closing this bug, since the problematic patch was reverted, and LU-7434 is still open to track the re-landing of the fixed patch. actually, the fix was submitted above. please re-land the original patch with the fix above.

            s/LU-7343/LU-7434/
            (correcting the typo)

            parinay parinay v kondekar (Inactive) added a comment - s/ LU-7343 / LU-7434 / (correcting the typo)
            adilger Andreas Dilger added a comment - - edited

            Closing this bug, since the problematic patch was reverted, and LU-7434 is still open to track the re-landing of the fixed patch.

            adilger Andreas Dilger added a comment - - edited Closing this bug, since the problematic patch was reverted, and LU-7434 is still open to track the re-landing of the fixed patch.

            Another occurrence of this failure, on Master, on review-dne-part-1

            https://testing.hpdd.intel.com/sub_tests/637e1824-0a36-11e6-855a-5254006e85c2

            rhenwood Richard Henwood (Inactive) added a comment - Another occurrence of this failure, on Master, on review-dne-part-1 https://testing.hpdd.intel.com/sub_tests/637e1824-0a36-11e6-855a-5254006e85c2

            Bhagyesh Dudhediya (bhagyesh.dudhediya@seagate.com) uploaded a new patch: http://review.whamcloud.com/19758
            Subject: LU-8062 test: fix fail_val in recovery-small/115b
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: f5eb98a57ed6cfd36321cb2e7e6e2ac2eb309528

            gerrit Gerrit Updater added a comment - Bhagyesh Dudhediya (bhagyesh.dudhediya@seagate.com) uploaded a new patch: http://review.whamcloud.com/19758 Subject: LU-8062 test: fix fail_val in recovery-small/115b Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: f5eb98a57ed6cfd36321cb2e7e6e2ac2eb309528

            Vitaly, could you please take a look at this? Is there a simple fix, or should the patch be reverted to give you more time to look into it?

            adilger Andreas Dilger added a comment - Vitaly, could you please take a look at this? Is there a simple fix, or should the patch be reverted to give you more time to look into it?

            It appears the root cause of these failures is the following patch, which added test_115b and landed on 2016-04-21:

            commit 55f8520817a31dabf19fe0a8ac2492b85d039c38
            Author:     Vitaly Fertman <vitaly.fertman@seagate.com>
            CommitDate: Thu Apr 21 02:27:54 2016 +0000
            
                LU-7434 ptlrpc: lost bulk leads to a hang
                
                The reverse order of request_out_callback() and reply_in_callback()
                puts the RPC into UNREGISTERING state, which is waiting for RPC &
                bulk md unlink, whereas only RPC md unlink has been called so far.
                If bulk is lost, even expired_set does not check for UNREGISTERING
                state.
                
                The same for write if server returns an error.
                
                This phase is ambiguous, split to UNREG_RPC and UNREG_BULK.
                
                Signed-off-by: Vitaly Fertman <vitaly.fertman@seagate.com>
                Reviewed-on: http://review.whamcloud.com/17221
            
            adilger Andreas Dilger added a comment - It appears the root cause of these failures is the following patch, which added test_115b and landed on 2016-04-21: commit 55f8520817a31dabf19fe0a8ac2492b85d039c38 Author: Vitaly Fertman <vitaly.fertman@seagate.com> CommitDate: Thu Apr 21 02:27:54 2016 +0000 LU-7434 ptlrpc: lost bulk leads to a hang The reverse order of request_out_callback() and reply_in_callback() puts the RPC into UNREGISTERING state, which is waiting for RPC & bulk md unlink, whereas only RPC md unlink has been called so far. If bulk is lost, even expired_set does not check for UNREGISTERING state. The same for write if server returns an error. This phase is ambiguous, split to UNREG_RPC and UNREG_BULK. Signed-off-by: Vitaly Fertman <vitaly.fertman@seagate.com> Reviewed-on: http://review.whamcloud.com/17221

            People

              wc-triage WC Triage
              529964 Bhagyesh Dudhediya (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              16 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: