Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6573

multiple tests: client evicted, Input/output error

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.8.0
    • Lustre 2.8.0
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for wangdi <di.wang@intel.com>

      This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/f9a7140c-f404-11e4-9769-5254006e85c2.

      The sub-test test_10b failed with the following error:

      dd /mnt/lustre2
      
      10:05:09:Lustre: DEBUG MARKER: == sanityn test 10b: write of file with sub-page size on multiple mounts == 10:04:38 (1430906678)
      10:05:09:LustreError: 11-0: lustre-OST0002-osc-ffff88007b4d1400: operation ldlm_enqueue to node 10.1.5.213@tcp failed: rc = -107
      10:05:09:Lustre: lustre-OST0002-osc-ffff88007b4d1400: Connection to lustre-OST0002 (at 10.1.5.213@tcp) was lost; in progress operations using this service will wait for recovery to complete
      10:05:09:LustreError: 167-0: lustre-OST0002-osc-ffff88007b4d1400: This client was evicted by lustre-OST0002; in progress operations using this service will fail.
      10:05:09:Lustre: lustre-OST0002-osc-ffff88007b4d1400: Connection restored to lustre-OST0002 (at 10.1.5.213@tcp)
      10:05:09:Lustre: DEBUG MARKER: /usr/sbin/lctl mark  sanityn test_10b: @@@@@@ FAIL: dd \/mnt\/lustre2 
      10:05:09:Lustre: DEBUG MARKER: sanityn test_10b: @@@@@@ FAIL: dd /mnt/lustre2
      10:05:09:Lustre: DEBUG MARKER: /usr/sbin/lctl dk > /logdir/test_logs/2015-05-06/lustre-reviews-el6_6-x86_64--review-dne-part-1--2_8_1__32016__-69851820759220-072240/sanityn.test_10b.debug_log.$(hostname -s).1430906678.log;
      10:05:09:         dmesg > /logdir/test_logs/2015-05-06/lustre-reviews-el
      10:05:09:Lustre: DEBUG MARKER: /usr/sbin/lctl mark == sanityn test 11: execution of file opened for write should return error ====== 10:04:40 \(1430906680\)
      
      

      Attachments

        Issue Links

          Activity

            [LU-6573] multiple tests: client evicted, Input/output error

            This was introduced by patch http://review.whamcloud.com/14399 "LU-6441 ptlrpc: ptlrpc_bulk_abort unlink all entries in bd_mds".

            adilger Andreas Dilger added a comment - This was introduced by patch http://review.whamcloud.com/14399 " LU-6441 ptlrpc: ptlrpc_bulk_abort unlink all entries in bd_mds".
            pjones Peter Jones added a comment -

            Landed for 2.8

            pjones Peter Jones added a comment - Landed for 2.8

            Andreas Dilger (andreas.dilger@intel.com) merged in patch http://review.whamcloud.com/14844/
            Subject: LU-6573 test: fix sanity test_224c
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 551dc77e6ecd590d35fe7759124b98961642e831

            gerrit Gerrit Updater added a comment - Andreas Dilger (andreas.dilger@intel.com) merged in patch http://review.whamcloud.com/14844/ Subject: LU-6573 test: fix sanity test_224c Project: fs/lustre-release Branch: master Current Patch Set: Commit: 551dc77e6ecd590d35fe7759124b98961642e831

            Niu Yawei (yawei.niu@intel.com) uploaded a new patch: http://review.whamcloud.com/14844
            Subject: LU-6573 test: fix sanity test_224c
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 942195be07082d82fca7dee225e9ddb0fe311290

            gerrit Gerrit Updater added a comment - Niu Yawei (yawei.niu@intel.com) uploaded a new patch: http://review.whamcloud.com/14844 Subject: LU-6573 test: fix sanity test_224c Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 942195be07082d82fca7dee225e9ddb0fe311290
            niu Niu Yawei (Inactive) added a comment - - edited

            I saw in some failures that ping interval is 1 second (the default obd_timeout for autotest is 20 seconds, and the default ping interval should be 5 seconds), I think that sanity test_224c could be the culprit: it changed obd_timeout to 5 seconds during testing, it did change back to default value, but it didn't wait to it take effect. I fixed this in: http://review.whamcloud.com/14844

            In most zfs test failures, the ping interval is the default 5 seconds, I'm wondering if we'd use longer obd_timeout/ping_interval for zfs testing?

            niu Niu Yawei (Inactive) added a comment - - edited I saw in some failures that ping interval is 1 second (the default obd_timeout for autotest is 20 seconds, and the default ping interval should be 5 seconds), I think that sanity test_224c could be the culprit: it changed obd_timeout to 5 seconds during testing, it did change back to default value, but it didn't wait to it take effect. I fixed this in: http://review.whamcloud.com/14844 In most zfs test failures, the ping interval is the default 5 seconds, I'm wondering if we'd use longer obd_timeout/ping_interval for zfs testing?
            bogl Bob Glossman (Inactive) added a comment - I think this is another instance, seen in master: https://testing.hpdd.intel.com/test_sets/5265d50a-fa23-11e4-a7e2-5254006e85c2

            I doubt that client doesn't send PINGs as expected, which caused server evict client at the end. The debug patch http://review.whamcloud.com/#/c/14777/ provides more verbose messages on client handling PINGs, hope we can reproduce the problem with the patch applied.

            niu Niu Yawei (Inactive) added a comment - I doubt that client doesn't send PINGs as expected, which caused server evict client at the end. The debug patch http://review.whamcloud.com/#/c/14777/ provides more verbose messages on client handling PINGs, hope we can reproduce the problem with the patch applied.

            Niu Yawei (yawei.niu@intel.com) uploaded a new patch: http://review.whamcloud.com/14777
            Subject: LU-6573 ptlrpc: debug patch
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: eb866c9274821fb6a818d6ec42c2b486b324d0cc

            gerrit Gerrit Updater added a comment - Niu Yawei (yawei.niu@intel.com) uploaded a new patch: http://review.whamcloud.com/14777 Subject: LU-6573 ptlrpc: debug patch Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: eb866c9274821fb6a818d6ec42c2b486b324d0cc

            Oops, 8beae9f3-b969-37e0-7b9b-eb08f4d2c9ef and 8beae9f3-b969-37e0-7b9b-eb08f4d2c9ef are different mounts from same NID.

            niu Niu Yawei (Inactive) added a comment - Oops, 8beae9f3-b969-37e0-7b9b-eb08f4d2c9ef and 8beae9f3-b969-37e0-7b9b-eb08f4d2c9ef are different mounts from same NID.

            Looks something went wrong in the eviction mechanism.

            Take this failure https://testing.hpdd.intel.com/test_sets/50e05d66-f589-11e4-91fd-5254006e85c2 as example:

            00000100:00100000:0.0:1431048749.096032:0:24985:0:(service.c:2075:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc mdt00_002:e6747c7f-1ec5-b75c-2bcb-cd4c06ffb96a+8:9457:x1500561708378636:12345-10.1.4.112@tcp:400
            

            MDT received PING from client 10.1.4.112 on time 1431048749, and the PING was processed.

            00000100:00080000:0.0:1431048756.180583:0:25594:0:(service.c:1098:ptlrpc_update_export_timer()) lustre-MDT0000: Think about evicting 10.1.4.112@tcp from 1431048725
            

            However, on time 1431048756, MDT think the last request from 10.1.4.112 is from time 1431048725, which is much older than 1431048749.

            00000100:00080000:1.0:1431048773.266900:0:24981:0:(pinger.c:637:ping_evictor_main()) evicting all exports of obd lustre-MDT0000 older than 1431048743
            00000100:02000400:1.0:1431048773.266911:0:24981:0:(pinger.c:663:ping_evictor_main()) lustre-MDT0000: haven't heard from client 8beae9f3-b969-37e0-7b9b-eb08f4d2c9ef (at 10.1.4.112@tcp) in 48 seconds. I think it's dead, and I am evicting it. exp ffff880072a0b800, cur 1431048773 expire 1431048743 last 1431048725
            

            At last, on time 1431048773, the client 10.1.4.112 is evicted.

            Seems the last request time for the 10.1.4.112 was not updated properly somehow.

            niu Niu Yawei (Inactive) added a comment - Looks something went wrong in the eviction mechanism. Take this failure https://testing.hpdd.intel.com/test_sets/50e05d66-f589-11e4-91fd-5254006e85c2 as example: 00000100:00100000:0.0:1431048749.096032:0:24985:0:(service.c:2075:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc mdt00_002:e6747c7f-1ec5-b75c-2bcb-cd4c06ffb96a+8:9457:x1500561708378636:12345-10.1.4.112@tcp:400 MDT received PING from client 10.1.4.112 on time 1431048749, and the PING was processed. 00000100:00080000:0.0:1431048756.180583:0:25594:0:(service.c:1098:ptlrpc_update_export_timer()) lustre-MDT0000: Think about evicting 10.1.4.112@tcp from 1431048725 However, on time 1431048756, MDT think the last request from 10.1.4.112 is from time 1431048725, which is much older than 1431048749. 00000100:00080000:1.0:1431048773.266900:0:24981:0:(pinger.c:637:ping_evictor_main()) evicting all exports of obd lustre-MDT0000 older than 1431048743 00000100:02000400:1.0:1431048773.266911:0:24981:0:(pinger.c:663:ping_evictor_main()) lustre-MDT0000: haven 't heard from client 8beae9f3-b969-37e0-7b9b-eb08f4d2c9ef (at 10.1.4.112@tcp) in 48 seconds. I think it' s dead, and I am evicting it. exp ffff880072a0b800, cur 1431048773 expire 1431048743 last 1431048725 At last, on time 1431048773, the client 10.1.4.112 is evicted. Seems the last request time for the 10.1.4.112 was not updated properly somehow.

            People

              niu Niu Yawei (Inactive)
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: