Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6573

multiple tests: client evicted, Input/output error

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.8.0
    • Lustre 2.8.0
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for wangdi <di.wang@intel.com>

      This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/f9a7140c-f404-11e4-9769-5254006e85c2.

      The sub-test test_10b failed with the following error:

      dd /mnt/lustre2
      
      10:05:09:Lustre: DEBUG MARKER: == sanityn test 10b: write of file with sub-page size on multiple mounts == 10:04:38 (1430906678)
      10:05:09:LustreError: 11-0: lustre-OST0002-osc-ffff88007b4d1400: operation ldlm_enqueue to node 10.1.5.213@tcp failed: rc = -107
      10:05:09:Lustre: lustre-OST0002-osc-ffff88007b4d1400: Connection to lustre-OST0002 (at 10.1.5.213@tcp) was lost; in progress operations using this service will wait for recovery to complete
      10:05:09:LustreError: 167-0: lustre-OST0002-osc-ffff88007b4d1400: This client was evicted by lustre-OST0002; in progress operations using this service will fail.
      10:05:09:Lustre: lustre-OST0002-osc-ffff88007b4d1400: Connection restored to lustre-OST0002 (at 10.1.5.213@tcp)
      10:05:09:Lustre: DEBUG MARKER: /usr/sbin/lctl mark  sanityn test_10b: @@@@@@ FAIL: dd \/mnt\/lustre2 
      10:05:09:Lustre: DEBUG MARKER: sanityn test_10b: @@@@@@ FAIL: dd /mnt/lustre2
      10:05:09:Lustre: DEBUG MARKER: /usr/sbin/lctl dk > /logdir/test_logs/2015-05-06/lustre-reviews-el6_6-x86_64--review-dne-part-1--2_8_1__32016__-69851820759220-072240/sanityn.test_10b.debug_log.$(hostname -s).1430906678.log;
      10:05:09:         dmesg > /logdir/test_logs/2015-05-06/lustre-reviews-el
      10:05:09:Lustre: DEBUG MARKER: /usr/sbin/lctl mark == sanityn test 11: execution of file opened for write should return error ====== 10:04:40 \(1430906680\)
      
      

      Attachments

        Issue Links

          Activity

            [LU-6573] multiple tests: client evicted, Input/output error

            This was introduced by patch http://review.whamcloud.com/14399 "LU-6441 ptlrpc: ptlrpc_bulk_abort unlink all entries in bd_mds".

            adilger Andreas Dilger added a comment - This was introduced by patch http://review.whamcloud.com/14399 " LU-6441 ptlrpc: ptlrpc_bulk_abort unlink all entries in bd_mds".
            pjones Peter Jones added a comment -

            Landed for 2.8

            pjones Peter Jones added a comment - Landed for 2.8

            Andreas Dilger (andreas.dilger@intel.com) merged in patch http://review.whamcloud.com/14844/
            Subject: LU-6573 test: fix sanity test_224c
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 551dc77e6ecd590d35fe7759124b98961642e831

            gerrit Gerrit Updater added a comment - Andreas Dilger (andreas.dilger@intel.com) merged in patch http://review.whamcloud.com/14844/ Subject: LU-6573 test: fix sanity test_224c Project: fs/lustre-release Branch: master Current Patch Set: Commit: 551dc77e6ecd590d35fe7759124b98961642e831

            Niu Yawei (yawei.niu@intel.com) uploaded a new patch: http://review.whamcloud.com/14844
            Subject: LU-6573 test: fix sanity test_224c
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 942195be07082d82fca7dee225e9ddb0fe311290

            gerrit Gerrit Updater added a comment - Niu Yawei (yawei.niu@intel.com) uploaded a new patch: http://review.whamcloud.com/14844 Subject: LU-6573 test: fix sanity test_224c Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 942195be07082d82fca7dee225e9ddb0fe311290
            niu Niu Yawei (Inactive) added a comment - - edited

            I saw in some failures that ping interval is 1 second (the default obd_timeout for autotest is 20 seconds, and the default ping interval should be 5 seconds), I think that sanity test_224c could be the culprit: it changed obd_timeout to 5 seconds during testing, it did change back to default value, but it didn't wait to it take effect. I fixed this in: http://review.whamcloud.com/14844

            In most zfs test failures, the ping interval is the default 5 seconds, I'm wondering if we'd use longer obd_timeout/ping_interval for zfs testing?

            niu Niu Yawei (Inactive) added a comment - - edited I saw in some failures that ping interval is 1 second (the default obd_timeout for autotest is 20 seconds, and the default ping interval should be 5 seconds), I think that sanity test_224c could be the culprit: it changed obd_timeout to 5 seconds during testing, it did change back to default value, but it didn't wait to it take effect. I fixed this in: http://review.whamcloud.com/14844 In most zfs test failures, the ping interval is the default 5 seconds, I'm wondering if we'd use longer obd_timeout/ping_interval for zfs testing?
            bogl Bob Glossman (Inactive) added a comment - I think this is another instance, seen in master: https://testing.hpdd.intel.com/test_sets/5265d50a-fa23-11e4-a7e2-5254006e85c2

            I doubt that client doesn't send PINGs as expected, which caused server evict client at the end. The debug patch http://review.whamcloud.com/#/c/14777/ provides more verbose messages on client handling PINGs, hope we can reproduce the problem with the patch applied.

            niu Niu Yawei (Inactive) added a comment - I doubt that client doesn't send PINGs as expected, which caused server evict client at the end. The debug patch http://review.whamcloud.com/#/c/14777/ provides more verbose messages on client handling PINGs, hope we can reproduce the problem with the patch applied.

            Niu Yawei (yawei.niu@intel.com) uploaded a new patch: http://review.whamcloud.com/14777
            Subject: LU-6573 ptlrpc: debug patch
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: eb866c9274821fb6a818d6ec42c2b486b324d0cc

            gerrit Gerrit Updater added a comment - Niu Yawei (yawei.niu@intel.com) uploaded a new patch: http://review.whamcloud.com/14777 Subject: LU-6573 ptlrpc: debug patch Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: eb866c9274821fb6a818d6ec42c2b486b324d0cc

            Oops, 8beae9f3-b969-37e0-7b9b-eb08f4d2c9ef and 8beae9f3-b969-37e0-7b9b-eb08f4d2c9ef are different mounts from same NID.

            niu Niu Yawei (Inactive) added a comment - Oops, 8beae9f3-b969-37e0-7b9b-eb08f4d2c9ef and 8beae9f3-b969-37e0-7b9b-eb08f4d2c9ef are different mounts from same NID.

            People

              niu Niu Yawei (Inactive)
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: