Details
-
Bug
-
Resolution: Duplicate
-
Critical
-
None
-
None
-
3
-
9223372036854775807
Description
This issue was created by maloo for John Hammond <john.hammond@intel.com>
This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/3f0a5c78-27ba-11e5-b37a-5254006e85c2.
The sub-test test_30 failed with the following error:
multiop 20786 failed
Please provide additional information about the failure here.
Info required for matching: replay-single 30
Attachments
Issue Links
Activity
Hongchao Zhang (hongchao.zhang@intel.com) uploaded a new patch: http://review.whamcloud.com/17303
Subject: LU-6841 target: check the last reply for resent
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 21583ba45fefda43841c36ab3609e529dbadda1d
Status update:
as per the logs, this issue is caused by the resent replay request during recovery, but MDT doesn't find the corresponding reply
of the resent request, and It could be related to the Multiple Slots in last_rcvd. Still analysis the related code lines to find where
the problem is and will update the status once there is any progress.
Just encountered on master but test replay-single/test_31
https://testing.hpdd.intel.com/test_sets/7708f4da-7e14-11e5-9c23-5254006e85c2
Let's close the ticket as cannot report and reopen if we do see it again
This issue has not been seen on any branch since 7/24 and may have been resolved by related landings. We are going to reduce the severity and leave it open.
I don't remember having seen this test fail during my local testing.
However, I have to admit that the kernel I use does not include the dev_read_only patch, which means the MDT device cannot be made readonly by replay_barrier. This means the test environment is not exactly the same.
Hi Gregoire, have you seen any problems similar to this in your testing?
Hongchao Zhang (hongchao.zhang@intel.com) uploaded a new patch: http://review.whamcloud.com/15814
Subject: LU-6841 target: debug patch to collect more logs
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: d8fa2f729b6dd8fa5c9ae604308f39306ddd6aa9
Hi Hongchao,
I think the logs were missing because the nodes were re-provisioned for testing other patches.
this could be related to LU-5319 (multiple slots per client in last_rcvd), for it was encountered firstly in the test of the patch of it. https://testing.hpdd.intel.com/sub_tests/153ee6d4-ffba-11e4-ac6b-5254006e85c2 (at 2015-05-20)
then this issue disappeared and occurred again just after the patch (http://review.whamcloud.com/#/c/14860/) was landed at July 1th.
the reason of the bug is the committed replay request (open request) is timed out during replaying
00000100:00100000:0.0:1437707068.959718:0:6229:0:(client.c:2591:ptlrpc_free_committed()) lustre-MDT0000-mdc-ffff88007c3bb800: committing for last_committed 163208757256 gen 3 00000100:00100000:0.0:1437707068.959727:0:6229:0:(client.c:2622:ptlrpc_free_committed()) @@@ keeping (FL_REPLAY) req@ffff88007a0c7000 x1507534108223584/t163208757253(163208757253) o101->lustre-MDT0000-mdc-ffff88007c3bb800@10.1.5.29@tcp:12/10 lens 816/600 e 0 to 0 dl 1437707075 ref 1 fl Complete:RP/4/0 rc 0/0 00000100:00100000:0.0:1437707068.959731:0:6229:0:(client.c:2622:ptlrpc_free_committed()) @@@ keeping (FL_REPLAY) req@ffff88007a0c5e00 x1507534108223592/t163208757254(163208757254) o101->lustre-MDT0000-mdc-ffff88007c3bb800@10.1.5.29@tcp:12/10 lens 816/600 e 0 to 0 dl 1437707075 ref 1 fl Complete:RP/4/0 rc 0/0 ... 00000100:00080000:0.0:1437707094.279383:0:5882:0:(import.c:1188:ptlrpc_connect_interpret()) ffff88007c3bf000 lustre-MDT0000_UUID: changing import state from CONNECTING to REPLAY 00000100:00080000:0.0:1437707094.279386:0:5882:0:(import.c:1448:ptlrpc_import_recovery_state_machine()) replay requested by lustre-MDT0000_UUID 00000100:00100000:0.0:1437707094.279389:0:5882:0:(client.c:2591:ptlrpc_free_committed()) lustre-MDT0000-mdc-ffff88007c3bb800: committing for last_committed 163208757256 gen 3 00000100:00100000:0.0:1437707094.279394:0:5882:0:(client.c:2617:ptlrpc_free_committed()) @@@ stopping search req@ffff88007a0c4000 x1507534108223680/t163208757257(163208757257) o36->lustre-MDT0000-mdc-ffff88007c3bb800@10.1.5.29@tcp:12/10 lens 640/416 e 0 to 0 dl 1437707076 ref 1 fl Complete:R/4/0 rc 0/0 00000100:00080000:0.0:1437707094.279401:0:5882:0:(recover.c:93:ptlrpc_replay_next()) import ffff88007c3bf000 from lustre-MDT0000_UUID committed 163208757256 last 0 00000100:00080000:0.0:1437707094.279408:0:5882:0:(client.c:2928:ptlrpc_replay_req()) @@@ REPLAY req@ffff88007a0c7000 x1507534108223584/t163208757253(163208757253) o101->lustre-MDT0000-mdc-ffff88007c3bb800@10.1.5.29@tcp:12/10 lens 816/600 e 0 to 0 dl 1437707075 ref 1 fl New:RP/4/0 rc 0/0 ... 00000100:00000400:0.0:1437707100.279594:0:5882:0:(client.c:2020:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1437707094/real 1437707094] req@ffff88007a0c7000 x1507534108223584/t163208757253(163208757253) o101->lustre-MDT0000-mdc-ffff88007c3bb800@10.1.5.29@tcp:12/10 lens 816/600 e 1 to 1 dl 1437707096 ref 2 fl Rpc:XP/4/ffffffff rc 0/-1 00000100:00100000:0.0:1437707100.279621:0:5882:0:(client.c:2049:ptlrpc_expire_one_request()) @@@ err -110, sent_state=REPLAY (now=REPLAY) req@ffff88007a0c7000 x1507534108223584/t163208757253(163208757253) o101->lustre-MDT0000-mdc-ffff88007c3bb800@10.1.5.29@tcp:12/10 lens 816/600 e 1 to 1 dl 1437707096 ref 2 fl Rpc:XP/4/ffffffff rc 0/-1 00000100:00020000:0.0:1437707100.279637:0:5882:0:(client.c:2816:ptlrpc_replay_interpret()) request replay timed out, restarting recovery 00000100:00080000:0.0:1437707100.282045:0:5882:0:(import.c:673:ptlrpc_connect_import()) ffff88007c3bf000 lustre-MDT0000_UUID: changing import state from REPLAY to CONNECTING 00000100:00080000:0.0:1437707100.282053:0:5882:0:(import.c:518:import_select_connection()) lustre-MDT0000-mdc-ffff88007c3bb800: connect to NID 10.1.5.29@tcp last attempt 4306605556 00000100:00080000:0.0:1437707100.282062:0:5882:0:(import.c:596:import_select_connection()) lustre-MDT0000-mdc-ffff88007c3bb800: import ffff88007c3bf000 using connection 10.1.5.29@tcp/10.1.5.29@tcp 00000100:00100000:0.0:1437707100.282089:0:5882:0:(import.c:760:ptlrpc_connect_import()) @@@ (re)connect request (timeout 5) req@ffff88007a0c4f00 x0/t0(0) o38->lustre-MDT0000-mdc-ffff88007c3bb800@10.1.5.29@tcp:12/10 lens 520/544 e 0 to 0 dl 0 ref 1 fl New:N/0/ffffffff rc 0/-1 00000100:00100000:0.0:1437707100.282101:0:5882:0:(client.c:1951:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ptlrpcd_rcv:89a94e57-b8e5-bb46-dc76-c5424df90dcb:5882:1507534108223584:10.1.5.29@tcp:101 00000100:00100000:0.0:1437707100.282113:0:5882:0:(client.c:1574:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ptlrpcd_rcv:89a94e57-b8e5-bb46-dc76-c5424df90dcb:5882:1507534108223988:10.1.5.29@tcp:38 00000100:00080000:0.0:1437707100.283658:0:5882:0:(import.c:1004:ptlrpc_connect_interpret()) lustre-MDT0000-mdc-ffff88007c3bb800: connect to target with instance 39 00000100:00080000:0.0:1437707100.283671:0:5882:0:(import.c:1163:ptlrpc_connect_interpret()) reconnected to lustre-MDT0000_UUID@10.1.5.29@tcp after partition 00000100:00080000:0.0:1437707100.283673:0:5882:0:(import.c:1173:ptlrpc_connect_interpret()) lustre-MDT0000-mdc-ffff88007c3bb800: reconnected to lustre-MDT0000_UUID during replay 00000100:00080000:0.0:1437707100.283676:0:5882:0:(import.c:1179:ptlrpc_connect_interpret()) ffff88007c3bf000 lustre-MDT0000_UUID: changing import state from CONNECTING to REPLAY 00000100:00080000:0.0:1437707100.283679:0:5882:0:(import.c:1448:ptlrpc_import_recovery_state_machine()) replay requested by lustre-MDT0000_UUID 00000100:00100000:0.0:1437707100.283682:0:5882:0:(client.c:2591:ptlrpc_free_committed()) lustre-MDT0000-mdc-ffff88007c3bb800: committing for last_committed 163208757256 gen 3 00000100:00100000:0.0:1437707100.283686:0:5882:0:(client.c:2617:ptlrpc_free_committed()) @@@ stopping search req@ffff88007a0c4000 x1507534108223680/t163208757257(163208757257) o36->lustre-MDT0000-mdc-ffff88007c3bb800@10.1.5.29@tcp:12/10 lens 640/416 e 0 to 0 dl 1437707076 ref 1 fl Complete:R/4/0 rc 0/0 00000100:00080000:0.0:1437707100.283693:0:5882:0:(recover.c:93:ptlrpc_replay_next()) import ffff88007c3bf000 from lustre-MDT0000_UUID committed 163208757256 last 0 00000100:00080000:0.0:1437707100.283699:0:5882:0:(client.c:2928:ptlrpc_replay_req()) @@@ REPLAY req@ffff88007a0c5e00 x1507534108223592/t163208757254(163208757254) o101->lustre-MDT0000-mdc- .. 00000100:00000400:0.0:1437707106.283630:0:5882:0:(client.c:2020:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1437707100/real 1437707100] req@ffff88007a0c5e00 x1507534108223592/t163208757254(163208757254) o101->lustre-MDT0000-mdc-ffff88007c3bb800@10.1.5.29@tcp:12/10 lens 816/600 e 1 to 1 dl 1437707102 ref 2 fl Rpc:XP/6/ffffffff rc 0/-1 00000100:00100000:0.0:1437707106.283659:0:5882:0:(client.c:2049:ptlrpc_expire_one_request()) @@@ err -110, sent_state=REPLAY (now=REPLAY) req@ffff88007a0c5e00 x1507534108223592/t163208757254(163208757254) o101->lustre-MDT0000-mdc-ffff88007c3bb800@10.1.5.29@tcp:12/10 lens 816/600 e 1 to 1 dl 1437707102 ref 2 fl Rpc:XP/6/ffffffff rc 0/-1 00000100:00020000:0.0:1437707106.283676:0:5882:0:(client.c:2816:ptlrpc_replay_interpret()) request replay timed out, restarting recovery 00000100:00080000:0.0:1437707106.285949:0:5882:0:(import.c:673:ptlrpc_connect_import()) ffff88007c3bf000 lustre-MDT0000_UUID: changing import state from REPLAY to CONNECTING 00000100:00080000:0.0:1437707106.285956:0:5882:0:(import.c:518:import_select_connection()) lustre-MDT0000-mdc-ffff88007c3bb800: connect to NID 10.1.5.29@tcp last attempt 4306611560 00000100:00080000:0.0:1437707106.285965:0:5882:0:(import.c:596:import_select_connection()) lustre-MDT0000-mdc-ffff88007c3bb800: import ffff88007c3bf000 using connection 10.1.5.29@tcp/10.1.5.29@tcp 00000100:00100000:0.0:1437707106.285996:0:5882:0:(import.c:760:ptlrpc_connect_import()) @@@ (re)connect request (timeout 5) req@ffff88007a0c5800 x0/t0(0) o38->lustre-MDT0000-mdc-ffff88007c3bb800@10.1.5.29@tcp:12/10 lens 520/544 e 0 to 0 dl 0 ref 1 fl New:N/0/ffffffff rc 0/-1 00000100:00100000:0.0:1437707106.286008:0:5882:0:(client.c:1951:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ptlrpcd_rcv:89a94e57-b8e5-bb46-dc76-c5424df90dcb:5882:1507534108223592:10.1.5.29@tcp:101 00000100:00100000:0.0:1437707106.286020:0:5882:0:(client.c:1574:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ptlrpcd_rcv:89a94e57-b8e5-bb46-dc76-c5424df90dcb:5882:1507534108224040:10.1.5.29@tcp:38 00000100:00080000:0.0:1437707106.287815:0:5882:0:(import.c:1004:ptlrpc_connect_interpret()) lustre-MDT0000-mdc-ffff88007c3bb800: connect to target with instance 39 00000100:00080000:0.0:1437707106.287828:0:5882:0:(import.c:1163:ptlrpc_connect_interpret()) reconnected to lustre-MDT0000_UUID@10.1.5.29@tcp after partition 00000100:00080000:0.0:1437707106.287831:0:5882:0:(import.c:1173:ptlrpc_connect_interpret()) lustre-MDT0000-mdc-ffff88007c3bb800: reconnected to lustre-MDT0000_UUID during replay 00000100:00080000:0.0:1437707106.287833:0:5882:0:(import.c:1179:ptlrpc_connect_interpret()) ffff88007c3bf000 lustre-MDT0000_UUID: changing import state from CONNECTING to REPLAY 00000100:00080000:0.0:1437707106.287836:0:5882:0:(import.c:1448:ptlrpc_import_recovery_state_machine()) replay requested by lustre-MDT0000_UUID 00000100:00100000:0.0:1437707106.287839:0:5882:0:(client.c:2591:ptlrpc_free_committed()) lustre-MDT0000-mdc-ffff88007c3bb800: committing for last_committed 163208757256 gen 3 00000100:00100000:0.0:1437707106.287843:0:5882:0:(client.c:2617:ptlrpc_free_committed()) @@@ stopping search req@ffff88007a0c4000 x1507534108223680/t163208757257(163208757257) o36->lustre-MDT0000-mdc-ffff88007c3bb800@10.1.5.29@tcp:12/10 lens 640/416 e 0 to 0 dl 1437707076 ref 1 fl Complete:R/4/0 rc 0/0 00000100:00080000:0.0:1437707106.287852:0:5882:0:(recover.c:93:ptlrpc_replay_next()) import ffff88007c3bf000 from lustre-MDT0000_UUID committed 163208757256 last 0 00000100:00080000:0.0:1437707106.287857:0:5882:0:(client.c:2928:ptlrpc_replay_req()) @@@ REPLAY req@ffff88007a0c5e00 x1507534108223592/t163208757254(163208757254) o101->lustre-MDT0000-mdc-ffff88007c3bb800@10.1.5.29@tcp:12/10 lens 816/600 e 1 to 1 dl 1437707102 ref 1 fl New:EXP/6/ffffffff rc 0/-1
Hi Yujian,
In https://testing.hpdd.intel.com/sub_tests/61999b00-32d7-11e5-a4fd-5254006e85c2 (July 24),
the debug logs at MDS side during recovery is dumped to "/tmp/lustre-log.1437707274.9697",
and how I can get that log? Thanks!!
This issue could be marked as a duplicate of
LU-5951