Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6841

replay-single test_30: multiop 20786 failed

Details

    • Bug
    • Resolution: Duplicate
    • Critical
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for John Hammond <john.hammond@intel.com>

      This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/3f0a5c78-27ba-11e5-b37a-5254006e85c2.

      The sub-test test_30 failed with the following error:

      multiop 20786 failed
      

      Please provide additional information about the failure here.

      Info required for matching: replay-single 30

      Attachments

        Issue Links

          Activity

            [LU-6841] replay-single test_30: multiop 20786 failed

            This issue could be marked as a duplicate of LU-5951

            hongchao.zhang Hongchao Zhang added a comment - This issue could be marked as a duplicate of LU-5951

            Hongchao Zhang (hongchao.zhang@intel.com) uploaded a new patch: http://review.whamcloud.com/17303
            Subject: LU-6841 target: check the last reply for resent
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 21583ba45fefda43841c36ab3609e529dbadda1d

            gerrit Gerrit Updater added a comment - Hongchao Zhang (hongchao.zhang@intel.com) uploaded a new patch: http://review.whamcloud.com/17303 Subject: LU-6841 target: check the last reply for resent Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 21583ba45fefda43841c36ab3609e529dbadda1d

            Status update:
            as per the logs, this issue is caused by the resent replay request during recovery, but MDT doesn't find the corresponding reply
            of the resent request, and It could be related to the Multiple Slots in last_rcvd. Still analysis the related code lines to find where
            the problem is and will update the status once there is any progress.

            hongchao.zhang Hongchao Zhang added a comment - Status update: as per the logs, this issue is caused by the resent replay request during recovery, but MDT doesn't find the corresponding reply of the resent request, and It could be related to the Multiple Slots in last_rcvd. Still analysis the related code lines to find where the problem is and will update the status once there is any progress.
            utopiabound Nathaniel Clark added a comment - - edited

            Just encountered on master but test replay-single/test_31
            https://testing.hpdd.intel.com/test_sets/7708f4da-7e14-11e5-9c23-5254006e85c2

            utopiabound Nathaniel Clark added a comment - - edited Just encountered on master but test replay-single/test_31 https://testing.hpdd.intel.com/test_sets/7708f4da-7e14-11e5-9c23-5254006e85c2
            pjones Peter Jones added a comment -

            Let's close the ticket as cannot report and reopen if we do see it again

            pjones Peter Jones added a comment - Let's close the ticket as cannot report and reopen if we do see it again

            This issue has not been seen on any branch since 7/24 and may have been resolved by related landings. We are going to reduce the severity and leave it open.

            jgmitter Joseph Gmitter (Inactive) added a comment - This issue has not been seen on any branch since 7/24 and may have been resolved by related landings. We are going to reduce the severity and leave it open.

            I don't remember having seen this test fail during my local testing.

            However, I have to admit that the kernel I use does not include the dev_read_only patch, which means the MDT device cannot be made readonly by replay_barrier. This means the test environment is not exactly the same.

            pichong Gregoire Pichon added a comment - I don't remember having seen this test fail during my local testing. However, I have to admit that the kernel I use does not include the dev_read_only patch, which means the MDT device cannot be made readonly by replay_barrier . This means the test environment is not exactly the same.

            Hi Gregoire, have you seen any problems similar to this in your testing?

            adilger Andreas Dilger added a comment - Hi Gregoire, have you seen any problems similar to this in your testing?

            Hongchao Zhang (hongchao.zhang@intel.com) uploaded a new patch: http://review.whamcloud.com/15814
            Subject: LU-6841 target: debug patch to collect more logs
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: d8fa2f729b6dd8fa5c9ae604308f39306ddd6aa9

            gerrit Gerrit Updater added a comment - Hongchao Zhang (hongchao.zhang@intel.com) uploaded a new patch: http://review.whamcloud.com/15814 Subject: LU-6841 target: debug patch to collect more logs Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: d8fa2f729b6dd8fa5c9ae604308f39306ddd6aa9
            yujian Jian Yu added a comment -

            Hi Hongchao,
            I think the logs were missing because the nodes were re-provisioned for testing other patches.

            yujian Jian Yu added a comment - Hi Hongchao, I think the logs were missing because the nodes were re-provisioned for testing other patches.
            hongchao.zhang Hongchao Zhang added a comment - - edited

            this could be related to LU-5319 (multiple slots per client in last_rcvd), for it was encountered firstly in the test of the patch of it. https://testing.hpdd.intel.com/sub_tests/153ee6d4-ffba-11e4-ac6b-5254006e85c2 (at 2015-05-20)

            then this issue disappeared and occurred again just after the patch (http://review.whamcloud.com/#/c/14860/) was landed at July 1th.

            the reason of the bug is the committed replay request (open request) is timed out during replaying

            00000100:00100000:0.0:1437707068.959718:0:6229:0:(client.c:2591:ptlrpc_free_committed()) lustre-MDT0000-mdc-ffff88007c3bb800: committing for last_committed 163208757256 gen 3
            00000100:00100000:0.0:1437707068.959727:0:6229:0:(client.c:2622:ptlrpc_free_committed()) @@@ keeping (FL_REPLAY)  req@ffff88007a0c7000 x1507534108223584/t163208757253(163208757253) o101->lustre-MDT0000-mdc-ffff88007c3bb800@10.1.5.29@tcp:12/10 lens 816/600 e 0 to 0 dl 1437707075 ref 1 fl Complete:RP/4/0 rc 0/0
            00000100:00100000:0.0:1437707068.959731:0:6229:0:(client.c:2622:ptlrpc_free_committed()) @@@ keeping (FL_REPLAY)  req@ffff88007a0c5e00 x1507534108223592/t163208757254(163208757254) o101->lustre-MDT0000-mdc-ffff88007c3bb800@10.1.5.29@tcp:12/10 lens 816/600 e 0 to 0 dl 1437707075 ref 1 fl Complete:RP/4/0 rc 0/0
            
            ...
            
            00000100:00080000:0.0:1437707094.279383:0:5882:0:(import.c:1188:ptlrpc_connect_interpret()) ffff88007c3bf000 lustre-MDT0000_UUID: changing import state from CONNECTING to REPLAY
            00000100:00080000:0.0:1437707094.279386:0:5882:0:(import.c:1448:ptlrpc_import_recovery_state_machine()) replay requested by lustre-MDT0000_UUID
            00000100:00100000:0.0:1437707094.279389:0:5882:0:(client.c:2591:ptlrpc_free_committed()) lustre-MDT0000-mdc-ffff88007c3bb800: committing for last_committed 163208757256 gen 3
            00000100:00100000:0.0:1437707094.279394:0:5882:0:(client.c:2617:ptlrpc_free_committed()) @@@ stopping search  req@ffff88007a0c4000 x1507534108223680/t163208757257(163208757257) o36->lustre-MDT0000-mdc-ffff88007c3bb800@10.1.5.29@tcp:12/10 lens 640/416 e 0 to 0 dl 1437707076 ref 1 fl Complete:R/4/0 rc 0/0
            00000100:00080000:0.0:1437707094.279401:0:5882:0:(recover.c:93:ptlrpc_replay_next()) import ffff88007c3bf000 from lustre-MDT0000_UUID committed 163208757256 last 0
            00000100:00080000:0.0:1437707094.279408:0:5882:0:(client.c:2928:ptlrpc_replay_req()) @@@ REPLAY  req@ffff88007a0c7000 x1507534108223584/t163208757253(163208757253) o101->lustre-MDT0000-mdc-ffff88007c3bb800@10.1.5.29@tcp:12/10 lens 816/600 e 0 to 0 dl 1437707075 ref 1 fl New:RP/4/0 rc 0/0
            
            ...
            
            00000100:00000400:0.0:1437707100.279594:0:5882:0:(client.c:2020:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1437707094/real 1437707094]  req@ffff88007a0c7000 x1507534108223584/t163208757253(163208757253) o101->lustre-MDT0000-mdc-ffff88007c3bb800@10.1.5.29@tcp:12/10 lens 816/600 e 1 to 1 dl 1437707096 ref 2 fl Rpc:XP/4/ffffffff rc 0/-1
            00000100:00100000:0.0:1437707100.279621:0:5882:0:(client.c:2049:ptlrpc_expire_one_request()) @@@ err -110, sent_state=REPLAY (now=REPLAY)  req@ffff88007a0c7000 x1507534108223584/t163208757253(163208757253) o101->lustre-MDT0000-mdc-ffff88007c3bb800@10.1.5.29@tcp:12/10 lens 816/600 e 1 to 1 dl 1437707096 ref 2 fl Rpc:XP/4/ffffffff rc 0/-1
            00000100:00020000:0.0:1437707100.279637:0:5882:0:(client.c:2816:ptlrpc_replay_interpret()) request replay timed out, restarting recovery
            00000100:00080000:0.0:1437707100.282045:0:5882:0:(import.c:673:ptlrpc_connect_import()) ffff88007c3bf000 lustre-MDT0000_UUID: changing import state from REPLAY to CONNECTING
            00000100:00080000:0.0:1437707100.282053:0:5882:0:(import.c:518:import_select_connection()) lustre-MDT0000-mdc-ffff88007c3bb800: connect to NID 10.1.5.29@tcp last attempt 4306605556
            00000100:00080000:0.0:1437707100.282062:0:5882:0:(import.c:596:import_select_connection()) lustre-MDT0000-mdc-ffff88007c3bb800: import ffff88007c3bf000 using connection 10.1.5.29@tcp/10.1.5.29@tcp
            00000100:00100000:0.0:1437707100.282089:0:5882:0:(import.c:760:ptlrpc_connect_import()) @@@ (re)connect request (timeout 5)  req@ffff88007a0c4f00 x0/t0(0) o38->lustre-MDT0000-mdc-ffff88007c3bb800@10.1.5.29@tcp:12/10 lens 520/544 e 0 to 0 dl 0 ref 1 fl New:N/0/ffffffff rc 0/-1
            00000100:00100000:0.0:1437707100.282101:0:5882:0:(client.c:1951:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ptlrpcd_rcv:89a94e57-b8e5-bb46-dc76-c5424df90dcb:5882:1507534108223584:10.1.5.29@tcp:101
            00000100:00100000:0.0:1437707100.282113:0:5882:0:(client.c:1574:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ptlrpcd_rcv:89a94e57-b8e5-bb46-dc76-c5424df90dcb:5882:1507534108223988:10.1.5.29@tcp:38
            00000100:00080000:0.0:1437707100.283658:0:5882:0:(import.c:1004:ptlrpc_connect_interpret()) lustre-MDT0000-mdc-ffff88007c3bb800: connect to target with instance 39
            00000100:00080000:0.0:1437707100.283671:0:5882:0:(import.c:1163:ptlrpc_connect_interpret()) reconnected to lustre-MDT0000_UUID@10.1.5.29@tcp after partition
            00000100:00080000:0.0:1437707100.283673:0:5882:0:(import.c:1173:ptlrpc_connect_interpret()) lustre-MDT0000-mdc-ffff88007c3bb800: reconnected to lustre-MDT0000_UUID during replay
            00000100:00080000:0.0:1437707100.283676:0:5882:0:(import.c:1179:ptlrpc_connect_interpret()) ffff88007c3bf000 lustre-MDT0000_UUID: changing import state from CONNECTING to REPLAY
            00000100:00080000:0.0:1437707100.283679:0:5882:0:(import.c:1448:ptlrpc_import_recovery_state_machine()) replay requested by lustre-MDT0000_UUID
            00000100:00100000:0.0:1437707100.283682:0:5882:0:(client.c:2591:ptlrpc_free_committed()) lustre-MDT0000-mdc-ffff88007c3bb800: committing for last_committed 163208757256 gen 3
            00000100:00100000:0.0:1437707100.283686:0:5882:0:(client.c:2617:ptlrpc_free_committed()) @@@ stopping search  req@ffff88007a0c4000 x1507534108223680/t163208757257(163208757257) o36->lustre-MDT0000-mdc-ffff88007c3bb800@10.1.5.29@tcp:12/10 lens 640/416 e 0 to 0 dl 1437707076 ref 1 fl Complete:R/4/0 rc 0/0
            00000100:00080000:0.0:1437707100.283693:0:5882:0:(recover.c:93:ptlrpc_replay_next()) import ffff88007c3bf000 from lustre-MDT0000_UUID committed 163208757256 last 0
            00000100:00080000:0.0:1437707100.283699:0:5882:0:(client.c:2928:ptlrpc_replay_req()) @@@ REPLAY  req@ffff88007a0c5e00 x1507534108223592/t163208757254(163208757254) o101->lustre-MDT0000-mdc-
            
            ..
            
            00000100:00000400:0.0:1437707106.283630:0:5882:0:(client.c:2020:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1437707100/real 1437707100]  req@ffff88007a0c5e00 x1507534108223592/t163208757254(163208757254) o101->lustre-MDT0000-mdc-ffff88007c3bb800@10.1.5.29@tcp:12/10 lens 816/600 e 1 to 1 dl 1437707102 ref 2 fl Rpc:XP/6/ffffffff rc 0/-1
            00000100:00100000:0.0:1437707106.283659:0:5882:0:(client.c:2049:ptlrpc_expire_one_request()) @@@ err -110, sent_state=REPLAY (now=REPLAY)  req@ffff88007a0c5e00 x1507534108223592/t163208757254(163208757254) o101->lustre-MDT0000-mdc-ffff88007c3bb800@10.1.5.29@tcp:12/10 lens 816/600 e 1 to 1 dl 1437707102 ref 2 fl Rpc:XP/6/ffffffff rc 0/-1
            00000100:00020000:0.0:1437707106.283676:0:5882:0:(client.c:2816:ptlrpc_replay_interpret()) request replay timed out, restarting recovery
            00000100:00080000:0.0:1437707106.285949:0:5882:0:(import.c:673:ptlrpc_connect_import()) ffff88007c3bf000 lustre-MDT0000_UUID: changing import state from REPLAY to CONNECTING
            00000100:00080000:0.0:1437707106.285956:0:5882:0:(import.c:518:import_select_connection()) lustre-MDT0000-mdc-ffff88007c3bb800: connect to NID 10.1.5.29@tcp last attempt 4306611560
            00000100:00080000:0.0:1437707106.285965:0:5882:0:(import.c:596:import_select_connection()) lustre-MDT0000-mdc-ffff88007c3bb800: import ffff88007c3bf000 using connection 10.1.5.29@tcp/10.1.5.29@tcp
            00000100:00100000:0.0:1437707106.285996:0:5882:0:(import.c:760:ptlrpc_connect_import()) @@@ (re)connect request (timeout 5)  req@ffff88007a0c5800 x0/t0(0) o38->lustre-MDT0000-mdc-ffff88007c3bb800@10.1.5.29@tcp:12/10 lens 520/544 e 0 to 0 dl 0 ref 1 fl New:N/0/ffffffff rc 0/-1
            00000100:00100000:0.0:1437707106.286008:0:5882:0:(client.c:1951:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ptlrpcd_rcv:89a94e57-b8e5-bb46-dc76-c5424df90dcb:5882:1507534108223592:10.1.5.29@tcp:101
            00000100:00100000:0.0:1437707106.286020:0:5882:0:(client.c:1574:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ptlrpcd_rcv:89a94e57-b8e5-bb46-dc76-c5424df90dcb:5882:1507534108224040:10.1.5.29@tcp:38
            00000100:00080000:0.0:1437707106.287815:0:5882:0:(import.c:1004:ptlrpc_connect_interpret()) lustre-MDT0000-mdc-ffff88007c3bb800: connect to target with instance 39
            00000100:00080000:0.0:1437707106.287828:0:5882:0:(import.c:1163:ptlrpc_connect_interpret()) reconnected to lustre-MDT0000_UUID@10.1.5.29@tcp after partition
            00000100:00080000:0.0:1437707106.287831:0:5882:0:(import.c:1173:ptlrpc_connect_interpret()) lustre-MDT0000-mdc-ffff88007c3bb800: reconnected to lustre-MDT0000_UUID during replay
            00000100:00080000:0.0:1437707106.287833:0:5882:0:(import.c:1179:ptlrpc_connect_interpret()) ffff88007c3bf000 lustre-MDT0000_UUID: changing import state from CONNECTING to REPLAY
            00000100:00080000:0.0:1437707106.287836:0:5882:0:(import.c:1448:ptlrpc_import_recovery_state_machine()) replay requested by lustre-MDT0000_UUID
            00000100:00100000:0.0:1437707106.287839:0:5882:0:(client.c:2591:ptlrpc_free_committed()) lustre-MDT0000-mdc-ffff88007c3bb800: committing for last_committed 163208757256 gen 3
            00000100:00100000:0.0:1437707106.287843:0:5882:0:(client.c:2617:ptlrpc_free_committed()) @@@ stopping search  req@ffff88007a0c4000 x1507534108223680/t163208757257(163208757257) o36->lustre-MDT0000-mdc-ffff88007c3bb800@10.1.5.29@tcp:12/10 lens 640/416 e 0 to 0 dl 1437707076 ref 1 fl Complete:R/4/0 rc 0/0
            00000100:00080000:0.0:1437707106.287852:0:5882:0:(recover.c:93:ptlrpc_replay_next()) import ffff88007c3bf000 from lustre-MDT0000_UUID committed 163208757256 last 0
            00000100:00080000:0.0:1437707106.287857:0:5882:0:(client.c:2928:ptlrpc_replay_req()) @@@ REPLAY  req@ffff88007a0c5e00 x1507534108223592/t163208757254(163208757254) o101->lustre-MDT0000-mdc-ffff88007c3bb800@10.1.5.29@tcp:12/10 lens 816/600 e 1 to 1 dl 1437707102 ref 1 fl New:EXP/6/ffffffff rc 0/-1
            

            Hi Yujian,
            In https://testing.hpdd.intel.com/sub_tests/61999b00-32d7-11e5-a4fd-5254006e85c2 (July 24),
            the debug logs at MDS side during recovery is dumped to "/tmp/lustre-log.1437707274.9697",
            and how I can get that log? Thanks!!

            hongchao.zhang Hongchao Zhang added a comment - - edited this could be related to LU-5319 (multiple slots per client in last_rcvd), for it was encountered firstly in the test of the patch of it. https://testing.hpdd.intel.com/sub_tests/153ee6d4-ffba-11e4-ac6b-5254006e85c2 (at 2015-05-20) then this issue disappeared and occurred again just after the patch ( http://review.whamcloud.com/#/c/14860/ ) was landed at July 1th. the reason of the bug is the committed replay request (open request) is timed out during replaying 00000100:00100000:0.0:1437707068.959718:0:6229:0:(client.c:2591:ptlrpc_free_committed()) lustre-MDT0000-mdc-ffff88007c3bb800: committing for last_committed 163208757256 gen 3 00000100:00100000:0.0:1437707068.959727:0:6229:0:(client.c:2622:ptlrpc_free_committed()) @@@ keeping (FL_REPLAY) req@ffff88007a0c7000 x1507534108223584/t163208757253(163208757253) o101->lustre-MDT0000-mdc-ffff88007c3bb800@10.1.5.29@tcp:12/10 lens 816/600 e 0 to 0 dl 1437707075 ref 1 fl Complete:RP/4/0 rc 0/0 00000100:00100000:0.0:1437707068.959731:0:6229:0:(client.c:2622:ptlrpc_free_committed()) @@@ keeping (FL_REPLAY) req@ffff88007a0c5e00 x1507534108223592/t163208757254(163208757254) o101->lustre-MDT0000-mdc-ffff88007c3bb800@10.1.5.29@tcp:12/10 lens 816/600 e 0 to 0 dl 1437707075 ref 1 fl Complete:RP/4/0 rc 0/0 ... 00000100:00080000:0.0:1437707094.279383:0:5882:0:(import.c:1188:ptlrpc_connect_interpret()) ffff88007c3bf000 lustre-MDT0000_UUID: changing import state from CONNECTING to REPLAY 00000100:00080000:0.0:1437707094.279386:0:5882:0:(import.c:1448:ptlrpc_import_recovery_state_machine()) replay requested by lustre-MDT0000_UUID 00000100:00100000:0.0:1437707094.279389:0:5882:0:(client.c:2591:ptlrpc_free_committed()) lustre-MDT0000-mdc-ffff88007c3bb800: committing for last_committed 163208757256 gen 3 00000100:00100000:0.0:1437707094.279394:0:5882:0:(client.c:2617:ptlrpc_free_committed()) @@@ stopping search req@ffff88007a0c4000 x1507534108223680/t163208757257(163208757257) o36->lustre-MDT0000-mdc-ffff88007c3bb800@10.1.5.29@tcp:12/10 lens 640/416 e 0 to 0 dl 1437707076 ref 1 fl Complete:R/4/0 rc 0/0 00000100:00080000:0.0:1437707094.279401:0:5882:0:(recover.c:93:ptlrpc_replay_next()) import ffff88007c3bf000 from lustre-MDT0000_UUID committed 163208757256 last 0 00000100:00080000:0.0:1437707094.279408:0:5882:0:(client.c:2928:ptlrpc_replay_req()) @@@ REPLAY req@ffff88007a0c7000 x1507534108223584/t163208757253(163208757253) o101->lustre-MDT0000-mdc-ffff88007c3bb800@10.1.5.29@tcp:12/10 lens 816/600 e 0 to 0 dl 1437707075 ref 1 fl New:RP/4/0 rc 0/0 ... 00000100:00000400:0.0:1437707100.279594:0:5882:0:(client.c:2020:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1437707094/real 1437707094] req@ffff88007a0c7000 x1507534108223584/t163208757253(163208757253) o101->lustre-MDT0000-mdc-ffff88007c3bb800@10.1.5.29@tcp:12/10 lens 816/600 e 1 to 1 dl 1437707096 ref 2 fl Rpc:XP/4/ffffffff rc 0/-1 00000100:00100000:0.0:1437707100.279621:0:5882:0:(client.c:2049:ptlrpc_expire_one_request()) @@@ err -110, sent_state=REPLAY (now=REPLAY) req@ffff88007a0c7000 x1507534108223584/t163208757253(163208757253) o101->lustre-MDT0000-mdc-ffff88007c3bb800@10.1.5.29@tcp:12/10 lens 816/600 e 1 to 1 dl 1437707096 ref 2 fl Rpc:XP/4/ffffffff rc 0/-1 00000100:00020000:0.0:1437707100.279637:0:5882:0:(client.c:2816:ptlrpc_replay_interpret()) request replay timed out, restarting recovery 00000100:00080000:0.0:1437707100.282045:0:5882:0:(import.c:673:ptlrpc_connect_import()) ffff88007c3bf000 lustre-MDT0000_UUID: changing import state from REPLAY to CONNECTING 00000100:00080000:0.0:1437707100.282053:0:5882:0:(import.c:518:import_select_connection()) lustre-MDT0000-mdc-ffff88007c3bb800: connect to NID 10.1.5.29@tcp last attempt 4306605556 00000100:00080000:0.0:1437707100.282062:0:5882:0:(import.c:596:import_select_connection()) lustre-MDT0000-mdc-ffff88007c3bb800: import ffff88007c3bf000 using connection 10.1.5.29@tcp/10.1.5.29@tcp 00000100:00100000:0.0:1437707100.282089:0:5882:0:(import.c:760:ptlrpc_connect_import()) @@@ (re)connect request (timeout 5) req@ffff88007a0c4f00 x0/t0(0) o38->lustre-MDT0000-mdc-ffff88007c3bb800@10.1.5.29@tcp:12/10 lens 520/544 e 0 to 0 dl 0 ref 1 fl New:N/0/ffffffff rc 0/-1 00000100:00100000:0.0:1437707100.282101:0:5882:0:(client.c:1951:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ptlrpcd_rcv:89a94e57-b8e5-bb46-dc76-c5424df90dcb:5882:1507534108223584:10.1.5.29@tcp:101 00000100:00100000:0.0:1437707100.282113:0:5882:0:(client.c:1574:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ptlrpcd_rcv:89a94e57-b8e5-bb46-dc76-c5424df90dcb:5882:1507534108223988:10.1.5.29@tcp:38 00000100:00080000:0.0:1437707100.283658:0:5882:0:(import.c:1004:ptlrpc_connect_interpret()) lustre-MDT0000-mdc-ffff88007c3bb800: connect to target with instance 39 00000100:00080000:0.0:1437707100.283671:0:5882:0:(import.c:1163:ptlrpc_connect_interpret()) reconnected to lustre-MDT0000_UUID@10.1.5.29@tcp after partition 00000100:00080000:0.0:1437707100.283673:0:5882:0:(import.c:1173:ptlrpc_connect_interpret()) lustre-MDT0000-mdc-ffff88007c3bb800: reconnected to lustre-MDT0000_UUID during replay 00000100:00080000:0.0:1437707100.283676:0:5882:0:(import.c:1179:ptlrpc_connect_interpret()) ffff88007c3bf000 lustre-MDT0000_UUID: changing import state from CONNECTING to REPLAY 00000100:00080000:0.0:1437707100.283679:0:5882:0:(import.c:1448:ptlrpc_import_recovery_state_machine()) replay requested by lustre-MDT0000_UUID 00000100:00100000:0.0:1437707100.283682:0:5882:0:(client.c:2591:ptlrpc_free_committed()) lustre-MDT0000-mdc-ffff88007c3bb800: committing for last_committed 163208757256 gen 3 00000100:00100000:0.0:1437707100.283686:0:5882:0:(client.c:2617:ptlrpc_free_committed()) @@@ stopping search req@ffff88007a0c4000 x1507534108223680/t163208757257(163208757257) o36->lustre-MDT0000-mdc-ffff88007c3bb800@10.1.5.29@tcp:12/10 lens 640/416 e 0 to 0 dl 1437707076 ref 1 fl Complete:R/4/0 rc 0/0 00000100:00080000:0.0:1437707100.283693:0:5882:0:(recover.c:93:ptlrpc_replay_next()) import ffff88007c3bf000 from lustre-MDT0000_UUID committed 163208757256 last 0 00000100:00080000:0.0:1437707100.283699:0:5882:0:(client.c:2928:ptlrpc_replay_req()) @@@ REPLAY req@ffff88007a0c5e00 x1507534108223592/t163208757254(163208757254) o101->lustre-MDT0000-mdc- .. 00000100:00000400:0.0:1437707106.283630:0:5882:0:(client.c:2020:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1437707100/real 1437707100] req@ffff88007a0c5e00 x1507534108223592/t163208757254(163208757254) o101->lustre-MDT0000-mdc-ffff88007c3bb800@10.1.5.29@tcp:12/10 lens 816/600 e 1 to 1 dl 1437707102 ref 2 fl Rpc:XP/6/ffffffff rc 0/-1 00000100:00100000:0.0:1437707106.283659:0:5882:0:(client.c:2049:ptlrpc_expire_one_request()) @@@ err -110, sent_state=REPLAY (now=REPLAY) req@ffff88007a0c5e00 x1507534108223592/t163208757254(163208757254) o101->lustre-MDT0000-mdc-ffff88007c3bb800@10.1.5.29@tcp:12/10 lens 816/600 e 1 to 1 dl 1437707102 ref 2 fl Rpc:XP/6/ffffffff rc 0/-1 00000100:00020000:0.0:1437707106.283676:0:5882:0:(client.c:2816:ptlrpc_replay_interpret()) request replay timed out, restarting recovery 00000100:00080000:0.0:1437707106.285949:0:5882:0:(import.c:673:ptlrpc_connect_import()) ffff88007c3bf000 lustre-MDT0000_UUID: changing import state from REPLAY to CONNECTING 00000100:00080000:0.0:1437707106.285956:0:5882:0:(import.c:518:import_select_connection()) lustre-MDT0000-mdc-ffff88007c3bb800: connect to NID 10.1.5.29@tcp last attempt 4306611560 00000100:00080000:0.0:1437707106.285965:0:5882:0:(import.c:596:import_select_connection()) lustre-MDT0000-mdc-ffff88007c3bb800: import ffff88007c3bf000 using connection 10.1.5.29@tcp/10.1.5.29@tcp 00000100:00100000:0.0:1437707106.285996:0:5882:0:(import.c:760:ptlrpc_connect_import()) @@@ (re)connect request (timeout 5) req@ffff88007a0c5800 x0/t0(0) o38->lustre-MDT0000-mdc-ffff88007c3bb800@10.1.5.29@tcp:12/10 lens 520/544 e 0 to 0 dl 0 ref 1 fl New:N/0/ffffffff rc 0/-1 00000100:00100000:0.0:1437707106.286008:0:5882:0:(client.c:1951:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ptlrpcd_rcv:89a94e57-b8e5-bb46-dc76-c5424df90dcb:5882:1507534108223592:10.1.5.29@tcp:101 00000100:00100000:0.0:1437707106.286020:0:5882:0:(client.c:1574:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ptlrpcd_rcv:89a94e57-b8e5-bb46-dc76-c5424df90dcb:5882:1507534108224040:10.1.5.29@tcp:38 00000100:00080000:0.0:1437707106.287815:0:5882:0:(import.c:1004:ptlrpc_connect_interpret()) lustre-MDT0000-mdc-ffff88007c3bb800: connect to target with instance 39 00000100:00080000:0.0:1437707106.287828:0:5882:0:(import.c:1163:ptlrpc_connect_interpret()) reconnected to lustre-MDT0000_UUID@10.1.5.29@tcp after partition 00000100:00080000:0.0:1437707106.287831:0:5882:0:(import.c:1173:ptlrpc_connect_interpret()) lustre-MDT0000-mdc-ffff88007c3bb800: reconnected to lustre-MDT0000_UUID during replay 00000100:00080000:0.0:1437707106.287833:0:5882:0:(import.c:1179:ptlrpc_connect_interpret()) ffff88007c3bf000 lustre-MDT0000_UUID: changing import state from CONNECTING to REPLAY 00000100:00080000:0.0:1437707106.287836:0:5882:0:(import.c:1448:ptlrpc_import_recovery_state_machine()) replay requested by lustre-MDT0000_UUID 00000100:00100000:0.0:1437707106.287839:0:5882:0:(client.c:2591:ptlrpc_free_committed()) lustre-MDT0000-mdc-ffff88007c3bb800: committing for last_committed 163208757256 gen 3 00000100:00100000:0.0:1437707106.287843:0:5882:0:(client.c:2617:ptlrpc_free_committed()) @@@ stopping search req@ffff88007a0c4000 x1507534108223680/t163208757257(163208757257) o36->lustre-MDT0000-mdc-ffff88007c3bb800@10.1.5.29@tcp:12/10 lens 640/416 e 0 to 0 dl 1437707076 ref 1 fl Complete:R/4/0 rc 0/0 00000100:00080000:0.0:1437707106.287852:0:5882:0:(recover.c:93:ptlrpc_replay_next()) import ffff88007c3bf000 from lustre-MDT0000_UUID committed 163208757256 last 0 00000100:00080000:0.0:1437707106.287857:0:5882:0:(client.c:2928:ptlrpc_replay_req()) @@@ REPLAY req@ffff88007a0c5e00 x1507534108223592/t163208757254(163208757254) o101->lustre-MDT0000-mdc-ffff88007c3bb800@10.1.5.29@tcp:12/10 lens 816/600 e 1 to 1 dl 1437707102 ref 1 fl New:EXP/6/ffffffff rc 0/-1 Hi Yujian, In https://testing.hpdd.intel.com/sub_tests/61999b00-32d7-11e5-a4fd-5254006e85c2 (July 24), the debug logs at MDS side during recovery is dumped to "/tmp/lustre-log.1437707274.9697", and how I can get that log? Thanks!!

            People

              hongchao.zhang Hongchao Zhang
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: