Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6841

replay-single test_30: multiop 20786 failed

Details

    • Bug
    • Resolution: Duplicate
    • Critical
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for John Hammond <john.hammond@intel.com>

      This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/3f0a5c78-27ba-11e5-b37a-5254006e85c2.

      The sub-test test_30 failed with the following error:

      multiop 20786 failed
      

      Please provide additional information about the failure here.

      Info required for matching: replay-single 30

      Attachments

        Issue Links

          Activity

            [LU-6841] replay-single test_30: multiop 20786 failed
            utopiabound Nathaniel Clark added a comment - - edited

            Just encountered on master but test replay-single/test_31
            https://testing.hpdd.intel.com/test_sets/7708f4da-7e14-11e5-9c23-5254006e85c2

            utopiabound Nathaniel Clark added a comment - - edited Just encountered on master but test replay-single/test_31 https://testing.hpdd.intel.com/test_sets/7708f4da-7e14-11e5-9c23-5254006e85c2
            pjones Peter Jones added a comment -

            Let's close the ticket as cannot report and reopen if we do see it again

            pjones Peter Jones added a comment - Let's close the ticket as cannot report and reopen if we do see it again

            This issue has not been seen on any branch since 7/24 and may have been resolved by related landings. We are going to reduce the severity and leave it open.

            jgmitter Joseph Gmitter (Inactive) added a comment - This issue has not been seen on any branch since 7/24 and may have been resolved by related landings. We are going to reduce the severity and leave it open.

            I don't remember having seen this test fail during my local testing.

            However, I have to admit that the kernel I use does not include the dev_read_only patch, which means the MDT device cannot be made readonly by replay_barrier. This means the test environment is not exactly the same.

            pichong Gregoire Pichon added a comment - I don't remember having seen this test fail during my local testing. However, I have to admit that the kernel I use does not include the dev_read_only patch, which means the MDT device cannot be made readonly by replay_barrier . This means the test environment is not exactly the same.

            Hi Gregoire, have you seen any problems similar to this in your testing?

            adilger Andreas Dilger added a comment - Hi Gregoire, have you seen any problems similar to this in your testing?

            Hongchao Zhang (hongchao.zhang@intel.com) uploaded a new patch: http://review.whamcloud.com/15814
            Subject: LU-6841 target: debug patch to collect more logs
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: d8fa2f729b6dd8fa5c9ae604308f39306ddd6aa9

            gerrit Gerrit Updater added a comment - Hongchao Zhang (hongchao.zhang@intel.com) uploaded a new patch: http://review.whamcloud.com/15814 Subject: LU-6841 target: debug patch to collect more logs Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: d8fa2f729b6dd8fa5c9ae604308f39306ddd6aa9
            yujian Jian Yu added a comment -

            Hi Hongchao,
            I think the logs were missing because the nodes were re-provisioned for testing other patches.

            yujian Jian Yu added a comment - Hi Hongchao, I think the logs were missing because the nodes were re-provisioned for testing other patches.
            hongchao.zhang Hongchao Zhang added a comment - - edited

            this could be related to LU-5319 (multiple slots per client in last_rcvd), for it was encountered firstly in the test of the patch of it. https://testing.hpdd.intel.com/sub_tests/153ee6d4-ffba-11e4-ac6b-5254006e85c2 (at 2015-05-20)

            then this issue disappeared and occurred again just after the patch (http://review.whamcloud.com/#/c/14860/) was landed at July 1th.

            the reason of the bug is the committed replay request (open request) is timed out during replaying

            00000100:00100000:0.0:1437707068.959718:0:6229:0:(client.c:2591:ptlrpc_free_committed()) lustre-MDT0000-mdc-ffff88007c3bb800: committing for last_committed 163208757256 gen 3
            00000100:00100000:0.0:1437707068.959727:0:6229:0:(client.c:2622:ptlrpc_free_committed()) @@@ keeping (FL_REPLAY)  req@ffff88007a0c7000 x1507534108223584/t163208757253(163208757253) o101->lustre-MDT0000-mdc-ffff88007c3bb800@10.1.5.29@tcp:12/10 lens 816/600 e 0 to 0 dl 1437707075 ref 1 fl Complete:RP/4/0 rc 0/0
            00000100:00100000:0.0:1437707068.959731:0:6229:0:(client.c:2622:ptlrpc_free_committed()) @@@ keeping (FL_REPLAY)  req@ffff88007a0c5e00 x1507534108223592/t163208757254(163208757254) o101->lustre-MDT0000-mdc-ffff88007c3bb800@10.1.5.29@tcp:12/10 lens 816/600 e 0 to 0 dl 1437707075 ref 1 fl Complete:RP/4/0 rc 0/0
            
            ...
            
            00000100:00080000:0.0:1437707094.279383:0:5882:0:(import.c:1188:ptlrpc_connect_interpret()) ffff88007c3bf000 lustre-MDT0000_UUID: changing import state from CONNECTING to REPLAY
            00000100:00080000:0.0:1437707094.279386:0:5882:0:(import.c:1448:ptlrpc_import_recovery_state_machine()) replay requested by lustre-MDT0000_UUID
            00000100:00100000:0.0:1437707094.279389:0:5882:0:(client.c:2591:ptlrpc_free_committed()) lustre-MDT0000-mdc-ffff88007c3bb800: committing for last_committed 163208757256 gen 3
            00000100:00100000:0.0:1437707094.279394:0:5882:0:(client.c:2617:ptlrpc_free_committed()) @@@ stopping search  req@ffff88007a0c4000 x1507534108223680/t163208757257(163208757257) o36->lustre-MDT0000-mdc-ffff88007c3bb800@10.1.5.29@tcp:12/10 lens 640/416 e 0 to 0 dl 1437707076 ref 1 fl Complete:R/4/0 rc 0/0
            00000100:00080000:0.0:1437707094.279401:0:5882:0:(recover.c:93:ptlrpc_replay_next()) import ffff88007c3bf000 from lustre-MDT0000_UUID committed 163208757256 last 0
            00000100:00080000:0.0:1437707094.279408:0:5882:0:(client.c:2928:ptlrpc_replay_req()) @@@ REPLAY  req@ffff88007a0c7000 x1507534108223584/t163208757253(163208757253) o101->lustre-MDT0000-mdc-ffff88007c3bb800@10.1.5.29@tcp:12/10 lens 816/600 e 0 to 0 dl 1437707075 ref 1 fl New:RP/4/0 rc 0/0
            
            ...
            
            00000100:00000400:0.0:1437707100.279594:0:5882:0:(client.c:2020:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1437707094/real 1437707094]  req@ffff88007a0c7000 x1507534108223584/t163208757253(163208757253) o101->lustre-MDT0000-mdc-ffff88007c3bb800@10.1.5.29@tcp:12/10 lens 816/600 e 1 to 1 dl 1437707096 ref 2 fl Rpc:XP/4/ffffffff rc 0/-1
            00000100:00100000:0.0:1437707100.279621:0:5882:0:(client.c:2049:ptlrpc_expire_one_request()) @@@ err -110, sent_state=REPLAY (now=REPLAY)  req@ffff88007a0c7000 x1507534108223584/t163208757253(163208757253) o101->lustre-MDT0000-mdc-ffff88007c3bb800@10.1.5.29@tcp:12/10 lens 816/600 e 1 to 1 dl 1437707096 ref 2 fl Rpc:XP/4/ffffffff rc 0/-1
            00000100:00020000:0.0:1437707100.279637:0:5882:0:(client.c:2816:ptlrpc_replay_interpret()) request replay timed out, restarting recovery
            00000100:00080000:0.0:1437707100.282045:0:5882:0:(import.c:673:ptlrpc_connect_import()) ffff88007c3bf000 lustre-MDT0000_UUID: changing import state from REPLAY to CONNECTING
            00000100:00080000:0.0:1437707100.282053:0:5882:0:(import.c:518:import_select_connection()) lustre-MDT0000-mdc-ffff88007c3bb800: connect to NID 10.1.5.29@tcp last attempt 4306605556
            00000100:00080000:0.0:1437707100.282062:0:5882:0:(import.c:596:import_select_connection()) lustre-MDT0000-mdc-ffff88007c3bb800: import ffff88007c3bf000 using connection 10.1.5.29@tcp/10.1.5.29@tcp
            00000100:00100000:0.0:1437707100.282089:0:5882:0:(import.c:760:ptlrpc_connect_import()) @@@ (re)connect request (timeout 5)  req@ffff88007a0c4f00 x0/t0(0) o38->lustre-MDT0000-mdc-ffff88007c3bb800@10.1.5.29@tcp:12/10 lens 520/544 e 0 to 0 dl 0 ref 1 fl New:N/0/ffffffff rc 0/-1
            00000100:00100000:0.0:1437707100.282101:0:5882:0:(client.c:1951:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ptlrpcd_rcv:89a94e57-b8e5-bb46-dc76-c5424df90dcb:5882:1507534108223584:10.1.5.29@tcp:101
            00000100:00100000:0.0:1437707100.282113:0:5882:0:(client.c:1574:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ptlrpcd_rcv:89a94e57-b8e5-bb46-dc76-c5424df90dcb:5882:1507534108223988:10.1.5.29@tcp:38
            00000100:00080000:0.0:1437707100.283658:0:5882:0:(import.c:1004:ptlrpc_connect_interpret()) lustre-MDT0000-mdc-ffff88007c3bb800: connect to target with instance 39
            00000100:00080000:0.0:1437707100.283671:0:5882:0:(import.c:1163:ptlrpc_connect_interpret()) reconnected to lustre-MDT0000_UUID@10.1.5.29@tcp after partition
            00000100:00080000:0.0:1437707100.283673:0:5882:0:(import.c:1173:ptlrpc_connect_interpret()) lustre-MDT0000-mdc-ffff88007c3bb800: reconnected to lustre-MDT0000_UUID during replay
            00000100:00080000:0.0:1437707100.283676:0:5882:0:(import.c:1179:ptlrpc_connect_interpret()) ffff88007c3bf000 lustre-MDT0000_UUID: changing import state from CONNECTING to REPLAY
            00000100:00080000:0.0:1437707100.283679:0:5882:0:(import.c:1448:ptlrpc_import_recovery_state_machine()) replay requested by lustre-MDT0000_UUID
            00000100:00100000:0.0:1437707100.283682:0:5882:0:(client.c:2591:ptlrpc_free_committed()) lustre-MDT0000-mdc-ffff88007c3bb800: committing for last_committed 163208757256 gen 3
            00000100:00100000:0.0:1437707100.283686:0:5882:0:(client.c:2617:ptlrpc_free_committed()) @@@ stopping search  req@ffff88007a0c4000 x1507534108223680/t163208757257(163208757257) o36->lustre-MDT0000-mdc-ffff88007c3bb800@10.1.5.29@tcp:12/10 lens 640/416 e 0 to 0 dl 1437707076 ref 1 fl Complete:R/4/0 rc 0/0
            00000100:00080000:0.0:1437707100.283693:0:5882:0:(recover.c:93:ptlrpc_replay_next()) import ffff88007c3bf000 from lustre-MDT0000_UUID committed 163208757256 last 0
            00000100:00080000:0.0:1437707100.283699:0:5882:0:(client.c:2928:ptlrpc_replay_req()) @@@ REPLAY  req@ffff88007a0c5e00 x1507534108223592/t163208757254(163208757254) o101->lustre-MDT0000-mdc-
            
            ..
            
            00000100:00000400:0.0:1437707106.283630:0:5882:0:(client.c:2020:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1437707100/real 1437707100]  req@ffff88007a0c5e00 x1507534108223592/t163208757254(163208757254) o101->lustre-MDT0000-mdc-ffff88007c3bb800@10.1.5.29@tcp:12/10 lens 816/600 e 1 to 1 dl 1437707102 ref 2 fl Rpc:XP/6/ffffffff rc 0/-1
            00000100:00100000:0.0:1437707106.283659:0:5882:0:(client.c:2049:ptlrpc_expire_one_request()) @@@ err -110, sent_state=REPLAY (now=REPLAY)  req@ffff88007a0c5e00 x1507534108223592/t163208757254(163208757254) o101->lustre-MDT0000-mdc-ffff88007c3bb800@10.1.5.29@tcp:12/10 lens 816/600 e 1 to 1 dl 1437707102 ref 2 fl Rpc:XP/6/ffffffff rc 0/-1
            00000100:00020000:0.0:1437707106.283676:0:5882:0:(client.c:2816:ptlrpc_replay_interpret()) request replay timed out, restarting recovery
            00000100:00080000:0.0:1437707106.285949:0:5882:0:(import.c:673:ptlrpc_connect_import()) ffff88007c3bf000 lustre-MDT0000_UUID: changing import state from REPLAY to CONNECTING
            00000100:00080000:0.0:1437707106.285956:0:5882:0:(import.c:518:import_select_connection()) lustre-MDT0000-mdc-ffff88007c3bb800: connect to NID 10.1.5.29@tcp last attempt 4306611560
            00000100:00080000:0.0:1437707106.285965:0:5882:0:(import.c:596:import_select_connection()) lustre-MDT0000-mdc-ffff88007c3bb800: import ffff88007c3bf000 using connection 10.1.5.29@tcp/10.1.5.29@tcp
            00000100:00100000:0.0:1437707106.285996:0:5882:0:(import.c:760:ptlrpc_connect_import()) @@@ (re)connect request (timeout 5)  req@ffff88007a0c5800 x0/t0(0) o38->lustre-MDT0000-mdc-ffff88007c3bb800@10.1.5.29@tcp:12/10 lens 520/544 e 0 to 0 dl 0 ref 1 fl New:N/0/ffffffff rc 0/-1
            00000100:00100000:0.0:1437707106.286008:0:5882:0:(client.c:1951:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ptlrpcd_rcv:89a94e57-b8e5-bb46-dc76-c5424df90dcb:5882:1507534108223592:10.1.5.29@tcp:101
            00000100:00100000:0.0:1437707106.286020:0:5882:0:(client.c:1574:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ptlrpcd_rcv:89a94e57-b8e5-bb46-dc76-c5424df90dcb:5882:1507534108224040:10.1.5.29@tcp:38
            00000100:00080000:0.0:1437707106.287815:0:5882:0:(import.c:1004:ptlrpc_connect_interpret()) lustre-MDT0000-mdc-ffff88007c3bb800: connect to target with instance 39
            00000100:00080000:0.0:1437707106.287828:0:5882:0:(import.c:1163:ptlrpc_connect_interpret()) reconnected to lustre-MDT0000_UUID@10.1.5.29@tcp after partition
            00000100:00080000:0.0:1437707106.287831:0:5882:0:(import.c:1173:ptlrpc_connect_interpret()) lustre-MDT0000-mdc-ffff88007c3bb800: reconnected to lustre-MDT0000_UUID during replay
            00000100:00080000:0.0:1437707106.287833:0:5882:0:(import.c:1179:ptlrpc_connect_interpret()) ffff88007c3bf000 lustre-MDT0000_UUID: changing import state from CONNECTING to REPLAY
            00000100:00080000:0.0:1437707106.287836:0:5882:0:(import.c:1448:ptlrpc_import_recovery_state_machine()) replay requested by lustre-MDT0000_UUID
            00000100:00100000:0.0:1437707106.287839:0:5882:0:(client.c:2591:ptlrpc_free_committed()) lustre-MDT0000-mdc-ffff88007c3bb800: committing for last_committed 163208757256 gen 3
            00000100:00100000:0.0:1437707106.287843:0:5882:0:(client.c:2617:ptlrpc_free_committed()) @@@ stopping search  req@ffff88007a0c4000 x1507534108223680/t163208757257(163208757257) o36->lustre-MDT0000-mdc-ffff88007c3bb800@10.1.5.29@tcp:12/10 lens 640/416 e 0 to 0 dl 1437707076 ref 1 fl Complete:R/4/0 rc 0/0
            00000100:00080000:0.0:1437707106.287852:0:5882:0:(recover.c:93:ptlrpc_replay_next()) import ffff88007c3bf000 from lustre-MDT0000_UUID committed 163208757256 last 0
            00000100:00080000:0.0:1437707106.287857:0:5882:0:(client.c:2928:ptlrpc_replay_req()) @@@ REPLAY  req@ffff88007a0c5e00 x1507534108223592/t163208757254(163208757254) o101->lustre-MDT0000-mdc-ffff88007c3bb800@10.1.5.29@tcp:12/10 lens 816/600 e 1 to 1 dl 1437707102 ref 1 fl New:EXP/6/ffffffff rc 0/-1
            

            Hi Yujian,
            In https://testing.hpdd.intel.com/sub_tests/61999b00-32d7-11e5-a4fd-5254006e85c2 (July 24),
            the debug logs at MDS side during recovery is dumped to "/tmp/lustre-log.1437707274.9697",
            and how I can get that log? Thanks!!

            hongchao.zhang Hongchao Zhang added a comment - - edited this could be related to LU-5319 (multiple slots per client in last_rcvd), for it was encountered firstly in the test of the patch of it. https://testing.hpdd.intel.com/sub_tests/153ee6d4-ffba-11e4-ac6b-5254006e85c2 (at 2015-05-20) then this issue disappeared and occurred again just after the patch ( http://review.whamcloud.com/#/c/14860/ ) was landed at July 1th. the reason of the bug is the committed replay request (open request) is timed out during replaying 00000100:00100000:0.0:1437707068.959718:0:6229:0:(client.c:2591:ptlrpc_free_committed()) lustre-MDT0000-mdc-ffff88007c3bb800: committing for last_committed 163208757256 gen 3 00000100:00100000:0.0:1437707068.959727:0:6229:0:(client.c:2622:ptlrpc_free_committed()) @@@ keeping (FL_REPLAY) req@ffff88007a0c7000 x1507534108223584/t163208757253(163208757253) o101->lustre-MDT0000-mdc-ffff88007c3bb800@10.1.5.29@tcp:12/10 lens 816/600 e 0 to 0 dl 1437707075 ref 1 fl Complete:RP/4/0 rc 0/0 00000100:00100000:0.0:1437707068.959731:0:6229:0:(client.c:2622:ptlrpc_free_committed()) @@@ keeping (FL_REPLAY) req@ffff88007a0c5e00 x1507534108223592/t163208757254(163208757254) o101->lustre-MDT0000-mdc-ffff88007c3bb800@10.1.5.29@tcp:12/10 lens 816/600 e 0 to 0 dl 1437707075 ref 1 fl Complete:RP/4/0 rc 0/0 ... 00000100:00080000:0.0:1437707094.279383:0:5882:0:(import.c:1188:ptlrpc_connect_interpret()) ffff88007c3bf000 lustre-MDT0000_UUID: changing import state from CONNECTING to REPLAY 00000100:00080000:0.0:1437707094.279386:0:5882:0:(import.c:1448:ptlrpc_import_recovery_state_machine()) replay requested by lustre-MDT0000_UUID 00000100:00100000:0.0:1437707094.279389:0:5882:0:(client.c:2591:ptlrpc_free_committed()) lustre-MDT0000-mdc-ffff88007c3bb800: committing for last_committed 163208757256 gen 3 00000100:00100000:0.0:1437707094.279394:0:5882:0:(client.c:2617:ptlrpc_free_committed()) @@@ stopping search req@ffff88007a0c4000 x1507534108223680/t163208757257(163208757257) o36->lustre-MDT0000-mdc-ffff88007c3bb800@10.1.5.29@tcp:12/10 lens 640/416 e 0 to 0 dl 1437707076 ref 1 fl Complete:R/4/0 rc 0/0 00000100:00080000:0.0:1437707094.279401:0:5882:0:(recover.c:93:ptlrpc_replay_next()) import ffff88007c3bf000 from lustre-MDT0000_UUID committed 163208757256 last 0 00000100:00080000:0.0:1437707094.279408:0:5882:0:(client.c:2928:ptlrpc_replay_req()) @@@ REPLAY req@ffff88007a0c7000 x1507534108223584/t163208757253(163208757253) o101->lustre-MDT0000-mdc-ffff88007c3bb800@10.1.5.29@tcp:12/10 lens 816/600 e 0 to 0 dl 1437707075 ref 1 fl New:RP/4/0 rc 0/0 ... 00000100:00000400:0.0:1437707100.279594:0:5882:0:(client.c:2020:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1437707094/real 1437707094] req@ffff88007a0c7000 x1507534108223584/t163208757253(163208757253) o101->lustre-MDT0000-mdc-ffff88007c3bb800@10.1.5.29@tcp:12/10 lens 816/600 e 1 to 1 dl 1437707096 ref 2 fl Rpc:XP/4/ffffffff rc 0/-1 00000100:00100000:0.0:1437707100.279621:0:5882:0:(client.c:2049:ptlrpc_expire_one_request()) @@@ err -110, sent_state=REPLAY (now=REPLAY) req@ffff88007a0c7000 x1507534108223584/t163208757253(163208757253) o101->lustre-MDT0000-mdc-ffff88007c3bb800@10.1.5.29@tcp:12/10 lens 816/600 e 1 to 1 dl 1437707096 ref 2 fl Rpc:XP/4/ffffffff rc 0/-1 00000100:00020000:0.0:1437707100.279637:0:5882:0:(client.c:2816:ptlrpc_replay_interpret()) request replay timed out, restarting recovery 00000100:00080000:0.0:1437707100.282045:0:5882:0:(import.c:673:ptlrpc_connect_import()) ffff88007c3bf000 lustre-MDT0000_UUID: changing import state from REPLAY to CONNECTING 00000100:00080000:0.0:1437707100.282053:0:5882:0:(import.c:518:import_select_connection()) lustre-MDT0000-mdc-ffff88007c3bb800: connect to NID 10.1.5.29@tcp last attempt 4306605556 00000100:00080000:0.0:1437707100.282062:0:5882:0:(import.c:596:import_select_connection()) lustre-MDT0000-mdc-ffff88007c3bb800: import ffff88007c3bf000 using connection 10.1.5.29@tcp/10.1.5.29@tcp 00000100:00100000:0.0:1437707100.282089:0:5882:0:(import.c:760:ptlrpc_connect_import()) @@@ (re)connect request (timeout 5) req@ffff88007a0c4f00 x0/t0(0) o38->lustre-MDT0000-mdc-ffff88007c3bb800@10.1.5.29@tcp:12/10 lens 520/544 e 0 to 0 dl 0 ref 1 fl New:N/0/ffffffff rc 0/-1 00000100:00100000:0.0:1437707100.282101:0:5882:0:(client.c:1951:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ptlrpcd_rcv:89a94e57-b8e5-bb46-dc76-c5424df90dcb:5882:1507534108223584:10.1.5.29@tcp:101 00000100:00100000:0.0:1437707100.282113:0:5882:0:(client.c:1574:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ptlrpcd_rcv:89a94e57-b8e5-bb46-dc76-c5424df90dcb:5882:1507534108223988:10.1.5.29@tcp:38 00000100:00080000:0.0:1437707100.283658:0:5882:0:(import.c:1004:ptlrpc_connect_interpret()) lustre-MDT0000-mdc-ffff88007c3bb800: connect to target with instance 39 00000100:00080000:0.0:1437707100.283671:0:5882:0:(import.c:1163:ptlrpc_connect_interpret()) reconnected to lustre-MDT0000_UUID@10.1.5.29@tcp after partition 00000100:00080000:0.0:1437707100.283673:0:5882:0:(import.c:1173:ptlrpc_connect_interpret()) lustre-MDT0000-mdc-ffff88007c3bb800: reconnected to lustre-MDT0000_UUID during replay 00000100:00080000:0.0:1437707100.283676:0:5882:0:(import.c:1179:ptlrpc_connect_interpret()) ffff88007c3bf000 lustre-MDT0000_UUID: changing import state from CONNECTING to REPLAY 00000100:00080000:0.0:1437707100.283679:0:5882:0:(import.c:1448:ptlrpc_import_recovery_state_machine()) replay requested by lustre-MDT0000_UUID 00000100:00100000:0.0:1437707100.283682:0:5882:0:(client.c:2591:ptlrpc_free_committed()) lustre-MDT0000-mdc-ffff88007c3bb800: committing for last_committed 163208757256 gen 3 00000100:00100000:0.0:1437707100.283686:0:5882:0:(client.c:2617:ptlrpc_free_committed()) @@@ stopping search req@ffff88007a0c4000 x1507534108223680/t163208757257(163208757257) o36->lustre-MDT0000-mdc-ffff88007c3bb800@10.1.5.29@tcp:12/10 lens 640/416 e 0 to 0 dl 1437707076 ref 1 fl Complete:R/4/0 rc 0/0 00000100:00080000:0.0:1437707100.283693:0:5882:0:(recover.c:93:ptlrpc_replay_next()) import ffff88007c3bf000 from lustre-MDT0000_UUID committed 163208757256 last 0 00000100:00080000:0.0:1437707100.283699:0:5882:0:(client.c:2928:ptlrpc_replay_req()) @@@ REPLAY req@ffff88007a0c5e00 x1507534108223592/t163208757254(163208757254) o101->lustre-MDT0000-mdc- .. 00000100:00000400:0.0:1437707106.283630:0:5882:0:(client.c:2020:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1437707100/real 1437707100] req@ffff88007a0c5e00 x1507534108223592/t163208757254(163208757254) o101->lustre-MDT0000-mdc-ffff88007c3bb800@10.1.5.29@tcp:12/10 lens 816/600 e 1 to 1 dl 1437707102 ref 2 fl Rpc:XP/6/ffffffff rc 0/-1 00000100:00100000:0.0:1437707106.283659:0:5882:0:(client.c:2049:ptlrpc_expire_one_request()) @@@ err -110, sent_state=REPLAY (now=REPLAY) req@ffff88007a0c5e00 x1507534108223592/t163208757254(163208757254) o101->lustre-MDT0000-mdc-ffff88007c3bb800@10.1.5.29@tcp:12/10 lens 816/600 e 1 to 1 dl 1437707102 ref 2 fl Rpc:XP/6/ffffffff rc 0/-1 00000100:00020000:0.0:1437707106.283676:0:5882:0:(client.c:2816:ptlrpc_replay_interpret()) request replay timed out, restarting recovery 00000100:00080000:0.0:1437707106.285949:0:5882:0:(import.c:673:ptlrpc_connect_import()) ffff88007c3bf000 lustre-MDT0000_UUID: changing import state from REPLAY to CONNECTING 00000100:00080000:0.0:1437707106.285956:0:5882:0:(import.c:518:import_select_connection()) lustre-MDT0000-mdc-ffff88007c3bb800: connect to NID 10.1.5.29@tcp last attempt 4306611560 00000100:00080000:0.0:1437707106.285965:0:5882:0:(import.c:596:import_select_connection()) lustre-MDT0000-mdc-ffff88007c3bb800: import ffff88007c3bf000 using connection 10.1.5.29@tcp/10.1.5.29@tcp 00000100:00100000:0.0:1437707106.285996:0:5882:0:(import.c:760:ptlrpc_connect_import()) @@@ (re)connect request (timeout 5) req@ffff88007a0c5800 x0/t0(0) o38->lustre-MDT0000-mdc-ffff88007c3bb800@10.1.5.29@tcp:12/10 lens 520/544 e 0 to 0 dl 0 ref 1 fl New:N/0/ffffffff rc 0/-1 00000100:00100000:0.0:1437707106.286008:0:5882:0:(client.c:1951:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ptlrpcd_rcv:89a94e57-b8e5-bb46-dc76-c5424df90dcb:5882:1507534108223592:10.1.5.29@tcp:101 00000100:00100000:0.0:1437707106.286020:0:5882:0:(client.c:1574:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ptlrpcd_rcv:89a94e57-b8e5-bb46-dc76-c5424df90dcb:5882:1507534108224040:10.1.5.29@tcp:38 00000100:00080000:0.0:1437707106.287815:0:5882:0:(import.c:1004:ptlrpc_connect_interpret()) lustre-MDT0000-mdc-ffff88007c3bb800: connect to target with instance 39 00000100:00080000:0.0:1437707106.287828:0:5882:0:(import.c:1163:ptlrpc_connect_interpret()) reconnected to lustre-MDT0000_UUID@10.1.5.29@tcp after partition 00000100:00080000:0.0:1437707106.287831:0:5882:0:(import.c:1173:ptlrpc_connect_interpret()) lustre-MDT0000-mdc-ffff88007c3bb800: reconnected to lustre-MDT0000_UUID during replay 00000100:00080000:0.0:1437707106.287833:0:5882:0:(import.c:1179:ptlrpc_connect_interpret()) ffff88007c3bf000 lustre-MDT0000_UUID: changing import state from CONNECTING to REPLAY 00000100:00080000:0.0:1437707106.287836:0:5882:0:(import.c:1448:ptlrpc_import_recovery_state_machine()) replay requested by lustre-MDT0000_UUID 00000100:00100000:0.0:1437707106.287839:0:5882:0:(client.c:2591:ptlrpc_free_committed()) lustre-MDT0000-mdc-ffff88007c3bb800: committing for last_committed 163208757256 gen 3 00000100:00100000:0.0:1437707106.287843:0:5882:0:(client.c:2617:ptlrpc_free_committed()) @@@ stopping search req@ffff88007a0c4000 x1507534108223680/t163208757257(163208757257) o36->lustre-MDT0000-mdc-ffff88007c3bb800@10.1.5.29@tcp:12/10 lens 640/416 e 0 to 0 dl 1437707076 ref 1 fl Complete:R/4/0 rc 0/0 00000100:00080000:0.0:1437707106.287852:0:5882:0:(recover.c:93:ptlrpc_replay_next()) import ffff88007c3bf000 from lustre-MDT0000_UUID committed 163208757256 last 0 00000100:00080000:0.0:1437707106.287857:0:5882:0:(client.c:2928:ptlrpc_replay_req()) @@@ REPLAY req@ffff88007a0c5e00 x1507534108223592/t163208757254(163208757254) o101->lustre-MDT0000-mdc-ffff88007c3bb800@10.1.5.29@tcp:12/10 lens 816/600 e 1 to 1 dl 1437707102 ref 1 fl New:EXP/6/ffffffff rc 0/-1 Hi Yujian, In https://testing.hpdd.intel.com/sub_tests/61999b00-32d7-11e5-a4fd-5254006e85c2 (July 24), the debug logs at MDS side during recovery is dumped to "/tmp/lustre-log.1437707274.9697", and how I can get that log? Thanks!!

            status update: still analyzing the logs, and still need more time to find the cause of it.

            hongchao.zhang Hongchao Zhang added a comment - status update: still analyzing the logs, and still need more time to find the cause of it.
            pjones Peter Jones added a comment -

            Hongchao

            Could you please look into this issue?

            Thanks

            Peter

            pjones Peter Jones added a comment - Hongchao Could you please look into this issue? Thanks Peter

            Hmm, this is replay-single 30, and no DNE operation in this test. So it is unlikely related with DNE changes.

            di.wang Di Wang (Inactive) added a comment - Hmm, this is replay-single 30, and no DNE operation in this test. So it is unlikely related with DNE changes.

            People

              hongchao.zhang Hongchao Zhang
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: