Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5407

Failover failure on test suite replay-single test_58c: test_58c failed with 2

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.8.0
    • Lustre 2.6.0, Lustre 2.7.0, Lustre 2.5.3, Lustre 2.8.0
    • lustre-b2_6-rc2 client is SLES11 SP3
    • 3
    • 15044

    Description

      This issue was created by maloo for sarah <sarah@whamcloud.com>

      This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/79386184-0f6e-11e4-aee3-5254006e85c2.

      The sub-test test_58c failed with the following error:

      test_58c failed with 2

      May related with LU-3625
      test log

      == replay-single test 58c: resend/reconstruct setxattr op ============================================ 09:15:20 (1405786520)
      CMD: onyx-63vm3 dumpe2fs -h /dev/lvm-Role_MDS/P1 2>&1 |
      		grep -E -q '(ea_inode|large_xattr)'
      Starting client: onyx-63vm1: -o user_xattr,flock onyx-63vm3:onyx-63vm7:/lustre /mnt/lustre2
      CMD: onyx-63vm1 mkdir -p /mnt/lustre2
      CMD: onyx-63vm1 mount -t lustre -o user_xattr,flock onyx-63vm3:onyx-63vm7:/lustre /mnt/lustre2
      mount.lustre: mount onyx-63vm3:onyx-63vm7:/lustre at /mnt/lustre2 failed: Input/output error
      Is the MGS running?
      CMD: onyx-63vm3 lctl set_param fail_loc=0x123
      fail_loc=0x123
      CMD: onyx-63vm1 setfattr -n trusted.foo -v bar /mnt/lustre/d58c.replay-single/f58c.replay-single
      CMD: onyx-63vm3 lctl set_param fail_loc=0
      fail_loc=0
      getfattr: /mnt/lustre2/d58c.replay-single/f58c.replay-single: No such file or directory
       replay-single test_58c: @@@@@@ FAIL: test_58c failed with 2 
      

      client dmesg

      [129081.475975] Lustre: DEBUG MARKER: == replay-single test 58c: resend/reconstruct setxattr op ============================================ 09:15:20 (1405786520)
      [129081.758649] Lustre: DEBUG MARKER: mkdir -p /mnt/lustre2
      [129081.771133] Lustre: DEBUG MARKER: mount -t lustre -o user_xattr,flock onyx-63vm3:onyx-63vm7:/lustre /mnt/lustre2
      [129081.784961] LustreError: 15c-8: MGC10.2.5.138@tcp: The configuration from log 'lustre-client' failed (-5). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
      [129081.785213] Lustre: Unmounted lustre-client
      [129081.785422] LustreError: 22063:0:(obd_mount.c:1342:lustre_fill_super()) Unable to mount  (-5)
      [129082.023875] Lustre: DEBUG MARKER: setfattr -n trusted.foo -v bar /mnt/lustre/d58c.replay-single/f58c.replay-single
      [129134.337350] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  replay-single test_58c: @@@@@@ FAIL: test_58c failed with 2 
      [129134.445678] Lustre: DEBUG MARKER: replay-single test_58c: @@@@@@ FAIL: test_58c failed with 2
      

      Attachments

        Issue Links

          Activity

            [LU-5407] Failover failure on test suite replay-single test_58c: test_58c failed with 2

            Andreas Dilger (andreas.dilger@intel.com) merged in patch http://review.whamcloud.com/14766/
            Subject: LU-5407 tests: Error message for replay-single 58b and 58c
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 88e555dbabfc35521345851ff41516156217b1ec

            gerrit Gerrit Updater added a comment - Andreas Dilger (andreas.dilger@intel.com) merged in patch http://review.whamcloud.com/14766/ Subject: LU-5407 tests: Error message for replay-single 58b and 58c Project: fs/lustre-release Branch: master Current Patch Set: Commit: 88e555dbabfc35521345851ff41516156217b1ec

            Andreas Dilger (andreas.dilger@intel.com) merged in patch http://review.whamcloud.com/14792/
            Subject: LU-5407 test: wait MGC import to finish recovery
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: c8602de66d24be2e4cf4750ce79a95e51ef5676d

            gerrit Gerrit Updater added a comment - Andreas Dilger (andreas.dilger@intel.com) merged in patch http://review.whamcloud.com/14792/ Subject: LU-5407 test: wait MGC import to finish recovery Project: fs/lustre-release Branch: master Current Patch Set: Commit: c8602de66d24be2e4cf4750ce79a95e51ef5676d

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/14864/
            Subject: LU-5407 tests: Disable replay-single test 58c
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 5b0ce8303e4033b3c7b09fda50f013e6d9d002b0

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/14864/ Subject: LU-5407 tests: Disable replay-single test 58c Project: fs/lustre-release Branch: master Current Patch Set: Commit: 5b0ce8303e4033b3c7b09fda50f013e6d9d002b0

            James Nunez (james.a.nunez@intel.com) uploaded a new patch: http://review.whamcloud.com/14864
            Subject: LU-5407 tests: Disable replay-single test 58c
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 0dbceb9e6d2835c00733a2fd76b950ff77976305

            gerrit Gerrit Updater added a comment - James Nunez (james.a.nunez@intel.com) uploaded a new patch: http://review.whamcloud.com/14864 Subject: LU-5407 tests: Disable replay-single test 58c Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 0dbceb9e6d2835c00733a2fd76b950ff77976305

            Hongchao Zhang (hongchao.zhang@intel.com) uploaded a new patch: http://review.whamcloud.com/14792
            Subject: LU-5407 test: wait MGC import to reconnect
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 3e529aea4d9e52e433084287c40792700c9bdd63

            gerrit Gerrit Updater added a comment - Hongchao Zhang (hongchao.zhang@intel.com) uploaded a new patch: http://review.whamcloud.com/14792 Subject: LU-5407 test: wait MGC import to reconnect Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 3e529aea4d9e52e433084287c40792700c9bdd63

            this problem is caused by the disconnected import state of MGC at the client, which is not recovered yet after the MDS is failed over in "test_58b".

            the log from MDT

            00000100:00100000:0.0:1431060944.791714:0:20964:0:(nrs_fifo.c:179:nrs_fifo_req_get()) NRS start fifo request from 12345-10.1.5.31@tcp, seq: 74
            00000100:00100000:0.0:1431060944.791720:0:20964:0:(service.c:2075:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc ll_mgs_0002:0+-99:2640:x1500570864992708:12345-10.1.5.31@tcp:101
            00000020:00080000:0.0:1431060944.791729:0:20964:0:(tgt_handler.c:622:tgt_request_handle()) operation 101 on unconnected OST from 12345-10.1.5.31@tcp
            00000100:00100000:0.0:1431060944.791779:0:20964:0:(service.c:2125:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc ll_mgs_0002:0+-99:2640:x1500570864992708:12345-10.1.5.31@tcp:101 Request procesed in 59us (193us total) trans 0 rc -107/-107
            00000100:00100000:0.0:1431060944.791787:0:20964:0:(nrs_fifo.c:241:nrs_fifo_req_stop()) NRS stop fifo request from 12345-10.1.5.31@tcp, seq: 74
            00000100:00100000:0.0:1431060944.792290:0:3344:0:(events.c:349:request_in_callback()) peer: 12345-10.1.5.31@tcp
            00000100:00100000:0.0:1431060944.792296:0:20964:0:(service.c:1927:ptlrpc_server_handle_req_in()) got req x1500570864992712
            00000100:00100000:0.0:1431060944.792313:0:20964:0:(nrs_fifo.c:179:nrs_fifo_req_get()) NRS start fifo request from 12345-10.1.5.31@tcp, seq: 75
            00000100:00100000:0.0:1431060944.792314:0:20964:0:(service.c:2075:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc ll_mgs_0002:0+-99:539:x1500570864992712:12345-10.1.5.31@tcp:250
            00010000:00080000:0.0:1431060944.792327:0:20964:0:(ldlm_lib.c:1045:target_handle_connect()) MGS: connection from 5108cdc4-cb46-6929-784d-073b1dfbc568@10.1.5.31@tcp t0 exp (null) cur 1431060944 last 0
            00000020:00000080:0.0:1431060944.792353:0:20964:0:(genops.c:1146:class_connect()) connect: client 5108cdc4-cb46-6929-784d-073b1dfbc568, cookie 0x96ce524a05765d59
            00000020:01000000:0.0:1431060944.792358:0:20964:0:(lprocfs_status_server.c:307:lprocfs_exp_setup()) using hash ffff8800791d5180
            00000100:00100000:0.0:1431060944.792411:0:20964:0:(service.c:2125:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc ll_mgs_0002:5108cdc4-cb46-6929-784d-073b1dfbc568+4:539:x1500570864992712:12345-10.1.5.31@tcp:250 Request procesed in 96us (122us total) trans 0 rc 0/0
            00000100:00100000:0.0:1431060944.792413:0:20964:0:(nrs_fifo.c:241:nrs_fifo_req_stop()) NRS stop fifo request from 12345-10.1.5.31@tcp, seq: 75
            

            the config lock is enqueued by MGC at 1431060944.791714 and failed in "tgt_request_handle" with "operation 101 on unconnected OST from 12345-10.1.5.31@tcp",
            then the MGC sends MGS_CONNECT(250) to MGS a little later (00000100:00100000:0.0:1431060944.792313).
            in test_58c, the lustre mount at "/mnt/lustre2" failed for the above reason then the "getfxattr" faild with -2 (ENOENT).

            hongchao.zhang Hongchao Zhang added a comment - this problem is caused by the disconnected import state of MGC at the client, which is not recovered yet after the MDS is failed over in "test_58b". the log from MDT 00000100:00100000:0.0:1431060944.791714:0:20964:0:(nrs_fifo.c:179:nrs_fifo_req_get()) NRS start fifo request from 12345-10.1.5.31@tcp, seq: 74 00000100:00100000:0.0:1431060944.791720:0:20964:0:(service.c:2075:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc ll_mgs_0002:0+-99:2640:x1500570864992708:12345-10.1.5.31@tcp:101 00000020:00080000:0.0:1431060944.791729:0:20964:0:(tgt_handler.c:622:tgt_request_handle()) operation 101 on unconnected OST from 12345-10.1.5.31@tcp 00000100:00100000:0.0:1431060944.791779:0:20964:0:(service.c:2125:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc ll_mgs_0002:0+-99:2640:x1500570864992708:12345-10.1.5.31@tcp:101 Request procesed in 59us (193us total) trans 0 rc -107/-107 00000100:00100000:0.0:1431060944.791787:0:20964:0:(nrs_fifo.c:241:nrs_fifo_req_stop()) NRS stop fifo request from 12345-10.1.5.31@tcp, seq: 74 00000100:00100000:0.0:1431060944.792290:0:3344:0:(events.c:349:request_in_callback()) peer: 12345-10.1.5.31@tcp 00000100:00100000:0.0:1431060944.792296:0:20964:0:(service.c:1927:ptlrpc_server_handle_req_in()) got req x1500570864992712 00000100:00100000:0.0:1431060944.792313:0:20964:0:(nrs_fifo.c:179:nrs_fifo_req_get()) NRS start fifo request from 12345-10.1.5.31@tcp, seq: 75 00000100:00100000:0.0:1431060944.792314:0:20964:0:(service.c:2075:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc ll_mgs_0002:0+-99:539:x1500570864992712:12345-10.1.5.31@tcp:250 00010000:00080000:0.0:1431060944.792327:0:20964:0:(ldlm_lib.c:1045:target_handle_connect()) MGS: connection from 5108cdc4-cb46-6929-784d-073b1dfbc568@10.1.5.31@tcp t0 exp (null) cur 1431060944 last 0 00000020:00000080:0.0:1431060944.792353:0:20964:0:(genops.c:1146:class_connect()) connect: client 5108cdc4-cb46-6929-784d-073b1dfbc568, cookie 0x96ce524a05765d59 00000020:01000000:0.0:1431060944.792358:0:20964:0:(lprocfs_status_server.c:307:lprocfs_exp_setup()) using hash ffff8800791d5180 00000100:00100000:0.0:1431060944.792411:0:20964:0:(service.c:2125:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc ll_mgs_0002:5108cdc4-cb46-6929-784d-073b1dfbc568+4:539:x1500570864992712:12345-10.1.5.31@tcp:250 Request procesed in 96us (122us total) trans 0 rc 0/0 00000100:00100000:0.0:1431060944.792413:0:20964:0:(nrs_fifo.c:241:nrs_fifo_req_stop()) NRS stop fifo request from 12345-10.1.5.31@tcp, seq: 75 the config lock is enqueued by MGC at 1431060944.791714 and failed in "tgt_request_handle" with "operation 101 on unconnected OST from 12345-10.1.5.31@tcp", then the MGC sends MGS_CONNECT(250) to MGS a little later (00000100:00100000:0.0:1431060944.792313). in test_58c, the lustre mount at "/mnt/lustre2" failed for the above reason then the "getfxattr" faild with -2 (ENOENT).

            Hi Andreas, Okay, I'll try to analysis and create the patch asap, thanks!

            hongchao.zhang Hongchao Zhang added a comment - Hi Andreas, Okay, I'll try to analysis and create the patch asap, thanks!

            This problem is causing about 1/3 of all test failures in review-zfs, increasing to blocker status.

            Hong Chao, can you please treat this as a priority.

            adilger Andreas Dilger added a comment - This problem is causing about 1/3 of all test failures in review-zfs, increasing to blocker status. Hong Chao, can you please treat this as a priority.

            Please note that patch http://review.whamcloud.com/14766 only modifies the error message and checks if the client mount succeeds.

            This patch does not fix the test failure.

            jamesanunez James Nunez (Inactive) added a comment - Please note that patch http://review.whamcloud.com/14766 only modifies the error message and checks if the client mount succeeds. This patch does not fix the test failure.

            James Nunez (james.a.nunez@intel.com) uploaded a new patch: http://review.whamcloud.com/14766
            Subject: LU-5407 tests: Error message for replay-single 58b and 58c
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: da8b8fe156ec0fcb06a096c1f6cddb85beec3fe9

            gerrit Gerrit Updater added a comment - James Nunez (james.a.nunez@intel.com) uploaded a new patch: http://review.whamcloud.com/14766 Subject: LU-5407 tests: Error message for replay-single 58b and 58c Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: da8b8fe156ec0fcb06a096c1f6cddb85beec3fe9

            People

              hongchao.zhang Hongchao Zhang
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: