[LU-6841] replay-single test_30: multiop 20786 failed Created: 13/Jul/15  Updated: 28/Jan/16  Resolved: 28/Jan/16

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Maloo Assignee: Hongchao Zhang
Resolution: Duplicate Votes: 0
Labels: p4hc

Issue Links:
Related
is related to LU-5319 Support multiple slots per client in ... Resolved
is related to LU-3127 replay-single test_73b: fchmod: No su... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for John Hammond <john.hammond@intel.com>

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/3f0a5c78-27ba-11e5-b37a-5254006e85c2.

The sub-test test_30 failed with the following error:

multiop 20786 failed

Please provide additional information about the failure here.

Info required for matching: replay-single 30



 Comments   
Comment by James Nunez (Inactive) [ 24/Jul/15 ]

We've hit this several times in review-dne-part-2:
2015-07-02 22:11:37 - https://testing.hpdd.intel.com/test_sets/2a34f08c-214e-11e5-b398-5254006e85c2
2015-07-06 01:11:19 - https://testing.hpdd.intel.com/test_sets/5dfd340e-23c6-11e5-a5be-5254006e85c2
2015-07-07 14:35:41 - https://testing.hpdd.intel.com/test_sets/964e83c2-2505-11e5-9d0f-5254006e85c2
2015-07-07 16:17:47 - https://testing.hpdd.intel.com/test_sets/0dc78356-251d-11e5-bf7b-5254006e85c2
2015-07-08 13:58:58 - https://testing.hpdd.intel.com/test_sets/bd146d1c-25c5-11e5-a6b1-5254006e85c2
2015-07-11 01:17:04 - https://testing.hpdd.intel.com/test_sets/3f0a5c78-27ba-11e5-b37a-5254006e85c2
2015-07-23 22:17:07 - https://testing.hpdd.intel.com/test_sets/c3e8da08-31d5-11e5-84cf-5254006e85c2
2015-07-23 14:54:47 - https://testing.hpdd.intel.com/test_sets/a8bae426-3192-11e5-a788-5254006e85c2
2015-07-23 10:36:11 - https://testing.hpdd.intel.com/test_sets/0da58c80-3178-11e5-9558-5254006e85c2
2015-07-23 08:49:47 - https://testing.hpdd.intel.com/test_sets/54d88a1a-312f-11e5-9558-5254006e85c2
2015-07-20 11:51:53 - https://testing.hpdd.intel.com/test_sets/9bb5ee50-2f22-11e5-bb5a-5254006e85c2
2015-07-18 05:32:09 - https://testing.hpdd.intel.com/test_sets/4abb96d6-2d5a-11e5-a112-5254006e85c2
2015-07-17 13:17:20 - https://testing.hpdd.intel.com/test_sets/56c54190-2db2-11e5-8ad8-5254006e85c2
2015-07-13 19:14:15 - https://testing.hpdd.intel.com/test_sets/c30d68dc-2a01-11e5-b04d-5254006e85c2

Comment by James Nunez (Inactive) [ 24/Jul/15 ]

Di - Would you please review these errors and see if they are related to recent DNE landings?
Thank you.

Comment by Di Wang [ 24/Jul/15 ]

Hmm, this is replay-single 30, and no DNE operation in this test. So it is unlikely related with DNE changes.

Comment by Peter Jones [ 28/Jul/15 ]

Hongchao

Could you please look into this issue?

Thanks

Peter

Comment by Hongchao Zhang [ 29/Jul/15 ]

status update: still analyzing the logs, and still need more time to find the cause of it.

Comment by Hongchao Zhang [ 30/Jul/15 ]

this could be related to LU-5319 (multiple slots per client in last_rcvd), for it was encountered firstly in the test of the patch of it. https://testing.hpdd.intel.com/sub_tests/153ee6d4-ffba-11e4-ac6b-5254006e85c2 (at 2015-05-20)

then this issue disappeared and occurred again just after the patch (http://review.whamcloud.com/#/c/14860/) was landed at July 1th.

the reason of the bug is the committed replay request (open request) is timed out during replaying

00000100:00100000:0.0:1437707068.959718:0:6229:0:(client.c:2591:ptlrpc_free_committed()) lustre-MDT0000-mdc-ffff88007c3bb800: committing for last_committed 163208757256 gen 3
00000100:00100000:0.0:1437707068.959727:0:6229:0:(client.c:2622:ptlrpc_free_committed()) @@@ keeping (FL_REPLAY)  req@ffff88007a0c7000 x1507534108223584/t163208757253(163208757253) o101->lustre-MDT0000-mdc-ffff88007c3bb800@10.1.5.29@tcp:12/10 lens 816/600 e 0 to 0 dl 1437707075 ref 1 fl Complete:RP/4/0 rc 0/0
00000100:00100000:0.0:1437707068.959731:0:6229:0:(client.c:2622:ptlrpc_free_committed()) @@@ keeping (FL_REPLAY)  req@ffff88007a0c5e00 x1507534108223592/t163208757254(163208757254) o101->lustre-MDT0000-mdc-ffff88007c3bb800@10.1.5.29@tcp:12/10 lens 816/600 e 0 to 0 dl 1437707075 ref 1 fl Complete:RP/4/0 rc 0/0

...

00000100:00080000:0.0:1437707094.279383:0:5882:0:(import.c:1188:ptlrpc_connect_interpret()) ffff88007c3bf000 lustre-MDT0000_UUID: changing import state from CONNECTING to REPLAY
00000100:00080000:0.0:1437707094.279386:0:5882:0:(import.c:1448:ptlrpc_import_recovery_state_machine()) replay requested by lustre-MDT0000_UUID
00000100:00100000:0.0:1437707094.279389:0:5882:0:(client.c:2591:ptlrpc_free_committed()) lustre-MDT0000-mdc-ffff88007c3bb800: committing for last_committed 163208757256 gen 3
00000100:00100000:0.0:1437707094.279394:0:5882:0:(client.c:2617:ptlrpc_free_committed()) @@@ stopping search  req@ffff88007a0c4000 x1507534108223680/t163208757257(163208757257) o36->lustre-MDT0000-mdc-ffff88007c3bb800@10.1.5.29@tcp:12/10 lens 640/416 e 0 to 0 dl 1437707076 ref 1 fl Complete:R/4/0 rc 0/0
00000100:00080000:0.0:1437707094.279401:0:5882:0:(recover.c:93:ptlrpc_replay_next()) import ffff88007c3bf000 from lustre-MDT0000_UUID committed 163208757256 last 0
00000100:00080000:0.0:1437707094.279408:0:5882:0:(client.c:2928:ptlrpc_replay_req()) @@@ REPLAY  req@ffff88007a0c7000 x1507534108223584/t163208757253(163208757253) o101->lustre-MDT0000-mdc-ffff88007c3bb800@10.1.5.29@tcp:12/10 lens 816/600 e 0 to 0 dl 1437707075 ref 1 fl New:RP/4/0 rc 0/0

...

00000100:00000400:0.0:1437707100.279594:0:5882:0:(client.c:2020:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1437707094/real 1437707094]  req@ffff88007a0c7000 x1507534108223584/t163208757253(163208757253) o101->lustre-MDT0000-mdc-ffff88007c3bb800@10.1.5.29@tcp:12/10 lens 816/600 e 1 to 1 dl 1437707096 ref 2 fl Rpc:XP/4/ffffffff rc 0/-1
00000100:00100000:0.0:1437707100.279621:0:5882:0:(client.c:2049:ptlrpc_expire_one_request()) @@@ err -110, sent_state=REPLAY (now=REPLAY)  req@ffff88007a0c7000 x1507534108223584/t163208757253(163208757253) o101->lustre-MDT0000-mdc-ffff88007c3bb800@10.1.5.29@tcp:12/10 lens 816/600 e 1 to 1 dl 1437707096 ref 2 fl Rpc:XP/4/ffffffff rc 0/-1
00000100:00020000:0.0:1437707100.279637:0:5882:0:(client.c:2816:ptlrpc_replay_interpret()) request replay timed out, restarting recovery
00000100:00080000:0.0:1437707100.282045:0:5882:0:(import.c:673:ptlrpc_connect_import()) ffff88007c3bf000 lustre-MDT0000_UUID: changing import state from REPLAY to CONNECTING
00000100:00080000:0.0:1437707100.282053:0:5882:0:(import.c:518:import_select_connection()) lustre-MDT0000-mdc-ffff88007c3bb800: connect to NID 10.1.5.29@tcp last attempt 4306605556
00000100:00080000:0.0:1437707100.282062:0:5882:0:(import.c:596:import_select_connection()) lustre-MDT0000-mdc-ffff88007c3bb800: import ffff88007c3bf000 using connection 10.1.5.29@tcp/10.1.5.29@tcp
00000100:00100000:0.0:1437707100.282089:0:5882:0:(import.c:760:ptlrpc_connect_import()) @@@ (re)connect request (timeout 5)  req@ffff88007a0c4f00 x0/t0(0) o38->lustre-MDT0000-mdc-ffff88007c3bb800@10.1.5.29@tcp:12/10 lens 520/544 e 0 to 0 dl 0 ref 1 fl New:N/0/ffffffff rc 0/-1
00000100:00100000:0.0:1437707100.282101:0:5882:0:(client.c:1951:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ptlrpcd_rcv:89a94e57-b8e5-bb46-dc76-c5424df90dcb:5882:1507534108223584:10.1.5.29@tcp:101
00000100:00100000:0.0:1437707100.282113:0:5882:0:(client.c:1574:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ptlrpcd_rcv:89a94e57-b8e5-bb46-dc76-c5424df90dcb:5882:1507534108223988:10.1.5.29@tcp:38
00000100:00080000:0.0:1437707100.283658:0:5882:0:(import.c:1004:ptlrpc_connect_interpret()) lustre-MDT0000-mdc-ffff88007c3bb800: connect to target with instance 39
00000100:00080000:0.0:1437707100.283671:0:5882:0:(import.c:1163:ptlrpc_connect_interpret()) reconnected to lustre-MDT0000_UUID@10.1.5.29@tcp after partition
00000100:00080000:0.0:1437707100.283673:0:5882:0:(import.c:1173:ptlrpc_connect_interpret()) lustre-MDT0000-mdc-ffff88007c3bb800: reconnected to lustre-MDT0000_UUID during replay
00000100:00080000:0.0:1437707100.283676:0:5882:0:(import.c:1179:ptlrpc_connect_interpret()) ffff88007c3bf000 lustre-MDT0000_UUID: changing import state from CONNECTING to REPLAY
00000100:00080000:0.0:1437707100.283679:0:5882:0:(import.c:1448:ptlrpc_import_recovery_state_machine()) replay requested by lustre-MDT0000_UUID
00000100:00100000:0.0:1437707100.283682:0:5882:0:(client.c:2591:ptlrpc_free_committed()) lustre-MDT0000-mdc-ffff88007c3bb800: committing for last_committed 163208757256 gen 3
00000100:00100000:0.0:1437707100.283686:0:5882:0:(client.c:2617:ptlrpc_free_committed()) @@@ stopping search  req@ffff88007a0c4000 x1507534108223680/t163208757257(163208757257) o36->lustre-MDT0000-mdc-ffff88007c3bb800@10.1.5.29@tcp:12/10 lens 640/416 e 0 to 0 dl 1437707076 ref 1 fl Complete:R/4/0 rc 0/0
00000100:00080000:0.0:1437707100.283693:0:5882:0:(recover.c:93:ptlrpc_replay_next()) import ffff88007c3bf000 from lustre-MDT0000_UUID committed 163208757256 last 0
00000100:00080000:0.0:1437707100.283699:0:5882:0:(client.c:2928:ptlrpc_replay_req()) @@@ REPLAY  req@ffff88007a0c5e00 x1507534108223592/t163208757254(163208757254) o101->lustre-MDT0000-mdc-

..

00000100:00000400:0.0:1437707106.283630:0:5882:0:(client.c:2020:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1437707100/real 1437707100]  req@ffff88007a0c5e00 x1507534108223592/t163208757254(163208757254) o101->lustre-MDT0000-mdc-ffff88007c3bb800@10.1.5.29@tcp:12/10 lens 816/600 e 1 to 1 dl 1437707102 ref 2 fl Rpc:XP/6/ffffffff rc 0/-1
00000100:00100000:0.0:1437707106.283659:0:5882:0:(client.c:2049:ptlrpc_expire_one_request()) @@@ err -110, sent_state=REPLAY (now=REPLAY)  req@ffff88007a0c5e00 x1507534108223592/t163208757254(163208757254) o101->lustre-MDT0000-mdc-ffff88007c3bb800@10.1.5.29@tcp:12/10 lens 816/600 e 1 to 1 dl 1437707102 ref 2 fl Rpc:XP/6/ffffffff rc 0/-1
00000100:00020000:0.0:1437707106.283676:0:5882:0:(client.c:2816:ptlrpc_replay_interpret()) request replay timed out, restarting recovery
00000100:00080000:0.0:1437707106.285949:0:5882:0:(import.c:673:ptlrpc_connect_import()) ffff88007c3bf000 lustre-MDT0000_UUID: changing import state from REPLAY to CONNECTING
00000100:00080000:0.0:1437707106.285956:0:5882:0:(import.c:518:import_select_connection()) lustre-MDT0000-mdc-ffff88007c3bb800: connect to NID 10.1.5.29@tcp last attempt 4306611560
00000100:00080000:0.0:1437707106.285965:0:5882:0:(import.c:596:import_select_connection()) lustre-MDT0000-mdc-ffff88007c3bb800: import ffff88007c3bf000 using connection 10.1.5.29@tcp/10.1.5.29@tcp
00000100:00100000:0.0:1437707106.285996:0:5882:0:(import.c:760:ptlrpc_connect_import()) @@@ (re)connect request (timeout 5)  req@ffff88007a0c5800 x0/t0(0) o38->lustre-MDT0000-mdc-ffff88007c3bb800@10.1.5.29@tcp:12/10 lens 520/544 e 0 to 0 dl 0 ref 1 fl New:N/0/ffffffff rc 0/-1
00000100:00100000:0.0:1437707106.286008:0:5882:0:(client.c:1951:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ptlrpcd_rcv:89a94e57-b8e5-bb46-dc76-c5424df90dcb:5882:1507534108223592:10.1.5.29@tcp:101
00000100:00100000:0.0:1437707106.286020:0:5882:0:(client.c:1574:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ptlrpcd_rcv:89a94e57-b8e5-bb46-dc76-c5424df90dcb:5882:1507534108224040:10.1.5.29@tcp:38
00000100:00080000:0.0:1437707106.287815:0:5882:0:(import.c:1004:ptlrpc_connect_interpret()) lustre-MDT0000-mdc-ffff88007c3bb800: connect to target with instance 39
00000100:00080000:0.0:1437707106.287828:0:5882:0:(import.c:1163:ptlrpc_connect_interpret()) reconnected to lustre-MDT0000_UUID@10.1.5.29@tcp after partition
00000100:00080000:0.0:1437707106.287831:0:5882:0:(import.c:1173:ptlrpc_connect_interpret()) lustre-MDT0000-mdc-ffff88007c3bb800: reconnected to lustre-MDT0000_UUID during replay
00000100:00080000:0.0:1437707106.287833:0:5882:0:(import.c:1179:ptlrpc_connect_interpret()) ffff88007c3bf000 lustre-MDT0000_UUID: changing import state from CONNECTING to REPLAY
00000100:00080000:0.0:1437707106.287836:0:5882:0:(import.c:1448:ptlrpc_import_recovery_state_machine()) replay requested by lustre-MDT0000_UUID
00000100:00100000:0.0:1437707106.287839:0:5882:0:(client.c:2591:ptlrpc_free_committed()) lustre-MDT0000-mdc-ffff88007c3bb800: committing for last_committed 163208757256 gen 3
00000100:00100000:0.0:1437707106.287843:0:5882:0:(client.c:2617:ptlrpc_free_committed()) @@@ stopping search  req@ffff88007a0c4000 x1507534108223680/t163208757257(163208757257) o36->lustre-MDT0000-mdc-ffff88007c3bb800@10.1.5.29@tcp:12/10 lens 640/416 e 0 to 0 dl 1437707076 ref 1 fl Complete:R/4/0 rc 0/0
00000100:00080000:0.0:1437707106.287852:0:5882:0:(recover.c:93:ptlrpc_replay_next()) import ffff88007c3bf000 from lustre-MDT0000_UUID committed 163208757256 last 0
00000100:00080000:0.0:1437707106.287857:0:5882:0:(client.c:2928:ptlrpc_replay_req()) @@@ REPLAY  req@ffff88007a0c5e00 x1507534108223592/t163208757254(163208757254) o101->lustre-MDT0000-mdc-ffff88007c3bb800@10.1.5.29@tcp:12/10 lens 816/600 e 1 to 1 dl 1437707102 ref 1 fl New:EXP/6/ffffffff rc 0/-1

Hi Yujian,
In https://testing.hpdd.intel.com/sub_tests/61999b00-32d7-11e5-a4fd-5254006e85c2 (July 24),
the debug logs at MDS side during recovery is dumped to "/tmp/lustre-log.1437707274.9697",
and how I can get that log? Thanks!!

Comment by Jian Yu [ 31/Jul/15 ]

Hi Hongchao,
I think the logs were missing because the nodes were re-provisioned for testing other patches.

Comment by Gerrit Updater [ 31/Jul/15 ]

Hongchao Zhang (hongchao.zhang@intel.com) uploaded a new patch: http://review.whamcloud.com/15814
Subject: LU-6841 target: debug patch to collect more logs
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: d8fa2f729b6dd8fa5c9ae604308f39306ddd6aa9

Comment by Andreas Dilger [ 07/Aug/15 ]

Hi Gregoire, have you seen any problems similar to this in your testing?

Comment by Gregoire Pichon [ 19/Aug/15 ]

I don't remember having seen this test fail during my local testing.

However, I have to admit that the kernel I use does not include the dev_read_only patch, which means the MDT device cannot be made readonly by replay_barrier. This means the test environment is not exactly the same.

Comment by Joseph Gmitter (Inactive) [ 25/Aug/15 ]

This issue has not been seen on any branch since 7/24 and may have been resolved by related landings. We are going to reduce the severity and leave it open.

Comment by Peter Jones [ 28/Aug/15 ]

Let's close the ticket as cannot report and reopen if we do see it again

Comment by Nathaniel Clark [ 30/Oct/15 ]

Just encountered on master but test replay-single/test_31
https://testing.hpdd.intel.com/test_sets/7708f4da-7e14-11e5-9c23-5254006e85c2

Comment by Hongchao Zhang [ 06/Nov/15 ]

Status update:
as per the logs, this issue is caused by the resent replay request during recovery, but MDT doesn't find the corresponding reply
of the resent request, and It could be related to the Multiple Slots in last_rcvd. Still analysis the related code lines to find where
the problem is and will update the status once there is any progress.

Comment by Gerrit Updater [ 20/Nov/15 ]

Hongchao Zhang (hongchao.zhang@intel.com) uploaded a new patch: http://review.whamcloud.com/17303
Subject: LU-6841 target: check the last reply for resent
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 21583ba45fefda43841c36ab3609e529dbadda1d

Comment by Hongchao Zhang [ 10/Dec/15 ]

This issue could be marked as a duplicate of LU-5951

Comment by Hongchao Zhang [ 28/Jan/16 ]

closed as duplicate of LU-5951

Generated at Sat Feb 10 02:03:43 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.