Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8062

recovery-small test_115b: @@@@@@ FAIL: dd success

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.10.1, Lustre 2.11.0
    • Lustre 2.8.0, Lustre 2.9.0, Lustre 2.10.0
    • None
    • 3
    • 9223372036854775807

    Description

      == recovery-small test 115b: write: late REQ MDunlink and no bulk == 21:12:09 (1461384729)
      Filesystem 1K-blocks Used Available Use% Mounted on
      onyx-38vm7@tcp:/lustre
      74157152 309236 69890576 1% /mnt/lustre
      fail_loc=0x8000051b
      fail_val=4
      Filesystem 1K-blocks Used Available Use% Mounted on
      onyx-38vm7@tcp:/lustre
      74157152 309236 69890576 1% /mnt/lustre
      CMD: onyx-38vm8 lctl set_param fail_val=0 fail_loc=0x80000215
      fail_val=0
      fail_loc=0x80000215
      1+0 records in
      1+0 records out
      4096 bytes (4.1 kB) copied, 2.13538 s, 1.9 kB/s
      recovery-small test_115b: @@@@@@ FAIL: dd success
      Trace dump:
      = /usr/lib64/lustre/tests/test-framework.sh:4764:error()
      = /usr/lib64/lustre/tests/recovery-small.sh:2161:test_115_write()
      = /usr/lib64/lustre/tests/recovery-small.sh:2181:test_115b()
      = /usr/lib64/lustre/tests/test-framework.sh:5028:run_one()
      = /usr/lib64/lustre/tests/test-framework.sh:5067:run_one_logged()
      = /usr/lib64/lustre/tests/test-framework.sh:4914:run_test()
      = /usr/lib64/lustre/tests/recovery-small.sh:2183:main()
      Dumping lctl log to /logdir/test_logs/2016-04-22/lustre-reviews-el6_7-x86_64-review-dne-part-1-1_6_1_38438_-70130481106820-100010/recovery-small.test_115b.*.1461384732.log

      Attachments

        Issue Links

          Activity

            [LU-8062] recovery-small test_115b: @@@@@@ FAIL: dd success
            niu Niu Yawei (Inactive) added a comment - https://testing.hpdd.intel.com/test_sets/5766e584-41c7-11e6-bbf5-5254006e85c2 The failure reoccurred in master review.

            Andreas Dilger added a comment - 8 hours ago - edited
            Closing this bug, since the problematic patch was reverted, and LU-7434 is still open to track the re-landing of the fixed patch.

            actually, the fix was submitted above. please re-land the original patch with the fix above.

            vitaly_fertman Vitaly Fertman added a comment - Andreas Dilger added a comment - 8 hours ago - edited Closing this bug, since the problematic patch was reverted, and LU-7434 is still open to track the re-landing of the fixed patch. actually, the fix was submitted above. please re-land the original patch with the fix above.

            s/LU-7343/LU-7434/
            (correcting the typo)

            parinay parinay v kondekar (Inactive) added a comment - s/ LU-7343 / LU-7434 / (correcting the typo)
            adilger Andreas Dilger added a comment - - edited

            Closing this bug, since the problematic patch was reverted, and LU-7434 is still open to track the re-landing of the fixed patch.

            adilger Andreas Dilger added a comment - - edited Closing this bug, since the problematic patch was reverted, and LU-7434 is still open to track the re-landing of the fixed patch.

            Another occurrence of this failure, on Master, on review-dne-part-1

            https://testing.hpdd.intel.com/sub_tests/637e1824-0a36-11e6-855a-5254006e85c2

            rhenwood Richard Henwood (Inactive) added a comment - Another occurrence of this failure, on Master, on review-dne-part-1 https://testing.hpdd.intel.com/sub_tests/637e1824-0a36-11e6-855a-5254006e85c2

            Bhagyesh Dudhediya (bhagyesh.dudhediya@seagate.com) uploaded a new patch: http://review.whamcloud.com/19758
            Subject: LU-8062 test: fix fail_val in recovery-small/115b
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: f5eb98a57ed6cfd36321cb2e7e6e2ac2eb309528

            gerrit Gerrit Updater added a comment - Bhagyesh Dudhediya (bhagyesh.dudhediya@seagate.com) uploaded a new patch: http://review.whamcloud.com/19758 Subject: LU-8062 test: fix fail_val in recovery-small/115b Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: f5eb98a57ed6cfd36321cb2e7e6e2ac2eb309528

            Vitaly, could you please take a look at this? Is there a simple fix, or should the patch be reverted to give you more time to look into it?

            adilger Andreas Dilger added a comment - Vitaly, could you please take a look at this? Is there a simple fix, or should the patch be reverted to give you more time to look into it?

            It appears the root cause of these failures is the following patch, which added test_115b and landed on 2016-04-21:

            commit 55f8520817a31dabf19fe0a8ac2492b85d039c38
            Author:     Vitaly Fertman <vitaly.fertman@seagate.com>
            CommitDate: Thu Apr 21 02:27:54 2016 +0000
            
                LU-7434 ptlrpc: lost bulk leads to a hang
                
                The reverse order of request_out_callback() and reply_in_callback()
                puts the RPC into UNREGISTERING state, which is waiting for RPC &
                bulk md unlink, whereas only RPC md unlink has been called so far.
                If bulk is lost, even expired_set does not check for UNREGISTERING
                state.
                
                The same for write if server returns an error.
                
                This phase is ambiguous, split to UNREG_RPC and UNREG_BULK.
                
                Signed-off-by: Vitaly Fertman <vitaly.fertman@seagate.com>
                Reviewed-on: http://review.whamcloud.com/17221
            
            adilger Andreas Dilger added a comment - It appears the root cause of these failures is the following patch, which added test_115b and landed on 2016-04-21: commit 55f8520817a31dabf19fe0a8ac2492b85d039c38 Author: Vitaly Fertman <vitaly.fertman@seagate.com> CommitDate: Thu Apr 21 02:27:54 2016 +0000 LU-7434 ptlrpc: lost bulk leads to a hang The reverse order of request_out_callback() and reply_in_callback() puts the RPC into UNREGISTERING state, which is waiting for RPC & bulk md unlink, whereas only RPC md unlink has been called so far. If bulk is lost, even expired_set does not check for UNREGISTERING state. The same for write if server returns an error. This phase is ambiguous, split to UNREG_RPC and UNREG_BULK. Signed-off-by: Vitaly Fertman <vitaly.fertman@seagate.com> Reviewed-on: http://review.whamcloud.com/17221

            This appears to be the top failing test on master in the past week, so increasing the priority.

            I checked a couple of the failing patches, and there doesn't seem to be any common source of the failure (i.e. it isn't caused by a bad patch that is repeatedly being retested) and some of the failures are on master. The few that I looked at more closely are based on commit 0f62ba942939c26edf07176d2eb082d38e95caec:

            CommitDate: Thu Apr 21 02:28:41 2016 +0000
            
                LU-8024 kernel: kernel update [SLES12 SP1 3.12.57-60.35]
                
                Update target and kernel_config files for new version
                
                Test-Parameters: clientdistro=sles12 testgroup=review-ldiskfs \
                  mdsdistro=sles12 ossdistro=sles12 mdsfilesystemtype=ldiskfs \
                  mdtfilesystemtype=ldiskfs ostfilesystemtype=ldiskfs
                
                Signed-off-by: Bob Glossman <bob.glossman@intel.com>
                Change-Id: Ic9f33c4877f93249b37726d8ad60c98ee624f719
                Reviewed-on: http://review.whamcloud.com/19593
            

            Which isn't to imply that this is the root cause of the problem (many of the test failures are on RHEL6.7, not SLES at all), but rather that this is the point where the other patches are failing this test, so the problematic commit is somewhere before this point.

            adilger Andreas Dilger added a comment - This appears to be the top failing test on master in the past week, so increasing the priority. I checked a couple of the failing patches, and there doesn't seem to be any common source of the failure (i.e. it isn't caused by a bad patch that is repeatedly being retested) and some of the failures are on master. The few that I looked at more closely are based on commit 0f62ba942939c26edf07176d2eb082d38e95caec: CommitDate: Thu Apr 21 02:28:41 2016 +0000 LU-8024 kernel: kernel update [SLES12 SP1 3.12.57-60.35] Update target and kernel_config files for new version Test-Parameters: clientdistro=sles12 testgroup=review-ldiskfs \ mdsdistro=sles12 ossdistro=sles12 mdsfilesystemtype=ldiskfs \ mdtfilesystemtype=ldiskfs ostfilesystemtype=ldiskfs Signed-off-by: Bob Glossman <bob.glossman@intel.com> Change-Id: Ic9f33c4877f93249b37726d8ad60c98ee624f719 Reviewed-on: http://review.whamcloud.com/19593 Which isn't to imply that this is the root cause of the problem (many of the test failures are on RHEL6.7, not SLES at all), but rather that this is the point where the other patches are failing this test, so the problematic commit is somewhere before this point.

            It looks like this failure has only been happening since 2016-04-21:
            https://testing.hpdd.intel.com/sub_tests/b09ba16c-07d9-11e6-9b34-5254006e85c2

            so it is likely related to some patch that landed shortly before that time.

            adilger Andreas Dilger added a comment - It looks like this failure has only been happening since 2016-04-21: https://testing.hpdd.intel.com/sub_tests/b09ba16c-07d9-11e6-9b34-5254006e85c2 so it is likely related to some patch that landed shortly before that time.

            In recovery-small/test_115b :
            An IO RPC is infected with some fail_loc value so that it would wait for some completion (req/reply lnet buffers unlink or even bulk unlink) for a while.
            Also a delay is injected before it is send.
            The delay gives some time to inject another fail_loc for the server side this time, the server will return an error in this particular case, whereas the 1st fail_loc is injected accurately, there is no more IO RPCs at this time, the problem with the test is that 2nd fail_loc is too generic, whatever RPC would be handled, it returns an error on it. it means if some other RPC appears just before our IO RPC, it gets an error but our IO does not.
            On the OST side logs following lines are seen indicating that fail_loc which should ne caught in IO path is actually caught by the statfs thread:

            00002000:00000020:1.0:1461384731.470636:0:7993:0:(ofd_obd.c:842:ofd_statfs()) 2317411 blocks: 2307535 free, 2183803 avail; 618944 objects: 618375 free; state 0
            00002000:02000000:1.0:1461384731.470641:0:7993:0:(libcfs_fail.h:96:cfs_fail_check_set()) *** cfs_fail_loc=215, val=0***
            

            i.e. statfs is in progress by thread 7993 and it catches 215 fail_loc.

            529964 Bhagyesh Dudhediya (Inactive) added a comment - - edited In recovery-small/test_115b : An IO RPC is infected with some fail_loc value so that it would wait for some completion (req/reply lnet buffers unlink or even bulk unlink) for a while. Also a delay is injected before it is send. The delay gives some time to inject another fail_loc for the server side this time, the server will return an error in this particular case, whereas the 1st fail_loc is injected accurately, there is no more IO RPCs at this time, the problem with the test is that 2nd fail_loc is too generic, whatever RPC would be handled, it returns an error on it. it means if some other RPC appears just before our IO RPC, it gets an error but our IO does not. On the OST side logs following lines are seen indicating that fail_loc which should ne caught in IO path is actually caught by the statfs thread: 00002000:00000020:1.0:1461384731.470636:0:7993:0:(ofd_obd.c:842:ofd_statfs()) 2317411 blocks: 2307535 free, 2183803 avail; 618944 objects: 618375 free; state 0 00002000:02000000:1.0:1461384731.470641:0:7993:0:(libcfs_fail.h:96:cfs_fail_check_set()) *** cfs_fail_loc=215, val=0*** i.e. statfs is in progress by thread 7993 and it catches 215 fail_loc.

            People

              wc-triage WC Triage
              529964 Bhagyesh Dudhediya (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              16 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: