[LU-6841] replay-single test_30: multiop 20786 failed Created: 13/Jul/15 Updated: 28/Jan/16 Resolved: 28/Jan/16 |
|
| Status: | Closed |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | Maloo | Assignee: | Hongchao Zhang |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | p4hc | ||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
This issue was created by maloo for John Hammond <john.hammond@intel.com> This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/3f0a5c78-27ba-11e5-b37a-5254006e85c2. The sub-test test_30 failed with the following error: multiop 20786 failed Please provide additional information about the failure here. Info required for matching: replay-single 30 |
| Comments |
| Comment by James Nunez (Inactive) [ 24/Jul/15 ] |
|
We've hit this several times in review-dne-part-2: |
| Comment by James Nunez (Inactive) [ 24/Jul/15 ] |
|
Di - Would you please review these errors and see if they are related to recent DNE landings? |
| Comment by Di Wang [ 24/Jul/15 ] |
|
Hmm, this is replay-single 30, and no DNE operation in this test. So it is unlikely related with DNE changes. |
| Comment by Peter Jones [ 28/Jul/15 ] |
|
Hongchao Could you please look into this issue? Thanks Peter |
| Comment by Hongchao Zhang [ 29/Jul/15 ] |
|
status update: still analyzing the logs, and still need more time to find the cause of it. |
| Comment by Hongchao Zhang [ 30/Jul/15 ] |
|
this could be related to then this issue disappeared and occurred again just after the patch (http://review.whamcloud.com/#/c/14860/) was landed at July 1th. the reason of the bug is the committed replay request (open request) is timed out during replaying 00000100:00100000:0.0:1437707068.959718:0:6229:0:(client.c:2591:ptlrpc_free_committed()) lustre-MDT0000-mdc-ffff88007c3bb800: committing for last_committed 163208757256 gen 3 00000100:00100000:0.0:1437707068.959727:0:6229:0:(client.c:2622:ptlrpc_free_committed()) @@@ keeping (FL_REPLAY) req@ffff88007a0c7000 x1507534108223584/t163208757253(163208757253) o101->lustre-MDT0000-mdc-ffff88007c3bb800@10.1.5.29@tcp:12/10 lens 816/600 e 0 to 0 dl 1437707075 ref 1 fl Complete:RP/4/0 rc 0/0 00000100:00100000:0.0:1437707068.959731:0:6229:0:(client.c:2622:ptlrpc_free_committed()) @@@ keeping (FL_REPLAY) req@ffff88007a0c5e00 x1507534108223592/t163208757254(163208757254) o101->lustre-MDT0000-mdc-ffff88007c3bb800@10.1.5.29@tcp:12/10 lens 816/600 e 0 to 0 dl 1437707075 ref 1 fl Complete:RP/4/0 rc 0/0 ... 00000100:00080000:0.0:1437707094.279383:0:5882:0:(import.c:1188:ptlrpc_connect_interpret()) ffff88007c3bf000 lustre-MDT0000_UUID: changing import state from CONNECTING to REPLAY 00000100:00080000:0.0:1437707094.279386:0:5882:0:(import.c:1448:ptlrpc_import_recovery_state_machine()) replay requested by lustre-MDT0000_UUID 00000100:00100000:0.0:1437707094.279389:0:5882:0:(client.c:2591:ptlrpc_free_committed()) lustre-MDT0000-mdc-ffff88007c3bb800: committing for last_committed 163208757256 gen 3 00000100:00100000:0.0:1437707094.279394:0:5882:0:(client.c:2617:ptlrpc_free_committed()) @@@ stopping search req@ffff88007a0c4000 x1507534108223680/t163208757257(163208757257) o36->lustre-MDT0000-mdc-ffff88007c3bb800@10.1.5.29@tcp:12/10 lens 640/416 e 0 to 0 dl 1437707076 ref 1 fl Complete:R/4/0 rc 0/0 00000100:00080000:0.0:1437707094.279401:0:5882:0:(recover.c:93:ptlrpc_replay_next()) import ffff88007c3bf000 from lustre-MDT0000_UUID committed 163208757256 last 0 00000100:00080000:0.0:1437707094.279408:0:5882:0:(client.c:2928:ptlrpc_replay_req()) @@@ REPLAY req@ffff88007a0c7000 x1507534108223584/t163208757253(163208757253) o101->lustre-MDT0000-mdc-ffff88007c3bb800@10.1.5.29@tcp:12/10 lens 816/600 e 0 to 0 dl 1437707075 ref 1 fl New:RP/4/0 rc 0/0 ... 00000100:00000400:0.0:1437707100.279594:0:5882:0:(client.c:2020:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1437707094/real 1437707094] req@ffff88007a0c7000 x1507534108223584/t163208757253(163208757253) o101->lustre-MDT0000-mdc-ffff88007c3bb800@10.1.5.29@tcp:12/10 lens 816/600 e 1 to 1 dl 1437707096 ref 2 fl Rpc:XP/4/ffffffff rc 0/-1 00000100:00100000:0.0:1437707100.279621:0:5882:0:(client.c:2049:ptlrpc_expire_one_request()) @@@ err -110, sent_state=REPLAY (now=REPLAY) req@ffff88007a0c7000 x1507534108223584/t163208757253(163208757253) o101->lustre-MDT0000-mdc-ffff88007c3bb800@10.1.5.29@tcp:12/10 lens 816/600 e 1 to 1 dl 1437707096 ref 2 fl Rpc:XP/4/ffffffff rc 0/-1 00000100:00020000:0.0:1437707100.279637:0:5882:0:(client.c:2816:ptlrpc_replay_interpret()) request replay timed out, restarting recovery 00000100:00080000:0.0:1437707100.282045:0:5882:0:(import.c:673:ptlrpc_connect_import()) ffff88007c3bf000 lustre-MDT0000_UUID: changing import state from REPLAY to CONNECTING 00000100:00080000:0.0:1437707100.282053:0:5882:0:(import.c:518:import_select_connection()) lustre-MDT0000-mdc-ffff88007c3bb800: connect to NID 10.1.5.29@tcp last attempt 4306605556 00000100:00080000:0.0:1437707100.282062:0:5882:0:(import.c:596:import_select_connection()) lustre-MDT0000-mdc-ffff88007c3bb800: import ffff88007c3bf000 using connection 10.1.5.29@tcp/10.1.5.29@tcp 00000100:00100000:0.0:1437707100.282089:0:5882:0:(import.c:760:ptlrpc_connect_import()) @@@ (re)connect request (timeout 5) req@ffff88007a0c4f00 x0/t0(0) o38->lustre-MDT0000-mdc-ffff88007c3bb800@10.1.5.29@tcp:12/10 lens 520/544 e 0 to 0 dl 0 ref 1 fl New:N/0/ffffffff rc 0/-1 00000100:00100000:0.0:1437707100.282101:0:5882:0:(client.c:1951:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ptlrpcd_rcv:89a94e57-b8e5-bb46-dc76-c5424df90dcb:5882:1507534108223584:10.1.5.29@tcp:101 00000100:00100000:0.0:1437707100.282113:0:5882:0:(client.c:1574:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ptlrpcd_rcv:89a94e57-b8e5-bb46-dc76-c5424df90dcb:5882:1507534108223988:10.1.5.29@tcp:38 00000100:00080000:0.0:1437707100.283658:0:5882:0:(import.c:1004:ptlrpc_connect_interpret()) lustre-MDT0000-mdc-ffff88007c3bb800: connect to target with instance 39 00000100:00080000:0.0:1437707100.283671:0:5882:0:(import.c:1163:ptlrpc_connect_interpret()) reconnected to lustre-MDT0000_UUID@10.1.5.29@tcp after partition 00000100:00080000:0.0:1437707100.283673:0:5882:0:(import.c:1173:ptlrpc_connect_interpret()) lustre-MDT0000-mdc-ffff88007c3bb800: reconnected to lustre-MDT0000_UUID during replay 00000100:00080000:0.0:1437707100.283676:0:5882:0:(import.c:1179:ptlrpc_connect_interpret()) ffff88007c3bf000 lustre-MDT0000_UUID: changing import state from CONNECTING to REPLAY 00000100:00080000:0.0:1437707100.283679:0:5882:0:(import.c:1448:ptlrpc_import_recovery_state_machine()) replay requested by lustre-MDT0000_UUID 00000100:00100000:0.0:1437707100.283682:0:5882:0:(client.c:2591:ptlrpc_free_committed()) lustre-MDT0000-mdc-ffff88007c3bb800: committing for last_committed 163208757256 gen 3 00000100:00100000:0.0:1437707100.283686:0:5882:0:(client.c:2617:ptlrpc_free_committed()) @@@ stopping search req@ffff88007a0c4000 x1507534108223680/t163208757257(163208757257) o36->lustre-MDT0000-mdc-ffff88007c3bb800@10.1.5.29@tcp:12/10 lens 640/416 e 0 to 0 dl 1437707076 ref 1 fl Complete:R/4/0 rc 0/0 00000100:00080000:0.0:1437707100.283693:0:5882:0:(recover.c:93:ptlrpc_replay_next()) import ffff88007c3bf000 from lustre-MDT0000_UUID committed 163208757256 last 0 00000100:00080000:0.0:1437707100.283699:0:5882:0:(client.c:2928:ptlrpc_replay_req()) @@@ REPLAY req@ffff88007a0c5e00 x1507534108223592/t163208757254(163208757254) o101->lustre-MDT0000-mdc- .. 00000100:00000400:0.0:1437707106.283630:0:5882:0:(client.c:2020:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1437707100/real 1437707100] req@ffff88007a0c5e00 x1507534108223592/t163208757254(163208757254) o101->lustre-MDT0000-mdc-ffff88007c3bb800@10.1.5.29@tcp:12/10 lens 816/600 e 1 to 1 dl 1437707102 ref 2 fl Rpc:XP/6/ffffffff rc 0/-1 00000100:00100000:0.0:1437707106.283659:0:5882:0:(client.c:2049:ptlrpc_expire_one_request()) @@@ err -110, sent_state=REPLAY (now=REPLAY) req@ffff88007a0c5e00 x1507534108223592/t163208757254(163208757254) o101->lustre-MDT0000-mdc-ffff88007c3bb800@10.1.5.29@tcp:12/10 lens 816/600 e 1 to 1 dl 1437707102 ref 2 fl Rpc:XP/6/ffffffff rc 0/-1 00000100:00020000:0.0:1437707106.283676:0:5882:0:(client.c:2816:ptlrpc_replay_interpret()) request replay timed out, restarting recovery 00000100:00080000:0.0:1437707106.285949:0:5882:0:(import.c:673:ptlrpc_connect_import()) ffff88007c3bf000 lustre-MDT0000_UUID: changing import state from REPLAY to CONNECTING 00000100:00080000:0.0:1437707106.285956:0:5882:0:(import.c:518:import_select_connection()) lustre-MDT0000-mdc-ffff88007c3bb800: connect to NID 10.1.5.29@tcp last attempt 4306611560 00000100:00080000:0.0:1437707106.285965:0:5882:0:(import.c:596:import_select_connection()) lustre-MDT0000-mdc-ffff88007c3bb800: import ffff88007c3bf000 using connection 10.1.5.29@tcp/10.1.5.29@tcp 00000100:00100000:0.0:1437707106.285996:0:5882:0:(import.c:760:ptlrpc_connect_import()) @@@ (re)connect request (timeout 5) req@ffff88007a0c5800 x0/t0(0) o38->lustre-MDT0000-mdc-ffff88007c3bb800@10.1.5.29@tcp:12/10 lens 520/544 e 0 to 0 dl 0 ref 1 fl New:N/0/ffffffff rc 0/-1 00000100:00100000:0.0:1437707106.286008:0:5882:0:(client.c:1951:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ptlrpcd_rcv:89a94e57-b8e5-bb46-dc76-c5424df90dcb:5882:1507534108223592:10.1.5.29@tcp:101 00000100:00100000:0.0:1437707106.286020:0:5882:0:(client.c:1574:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ptlrpcd_rcv:89a94e57-b8e5-bb46-dc76-c5424df90dcb:5882:1507534108224040:10.1.5.29@tcp:38 00000100:00080000:0.0:1437707106.287815:0:5882:0:(import.c:1004:ptlrpc_connect_interpret()) lustre-MDT0000-mdc-ffff88007c3bb800: connect to target with instance 39 00000100:00080000:0.0:1437707106.287828:0:5882:0:(import.c:1163:ptlrpc_connect_interpret()) reconnected to lustre-MDT0000_UUID@10.1.5.29@tcp after partition 00000100:00080000:0.0:1437707106.287831:0:5882:0:(import.c:1173:ptlrpc_connect_interpret()) lustre-MDT0000-mdc-ffff88007c3bb800: reconnected to lustre-MDT0000_UUID during replay 00000100:00080000:0.0:1437707106.287833:0:5882:0:(import.c:1179:ptlrpc_connect_interpret()) ffff88007c3bf000 lustre-MDT0000_UUID: changing import state from CONNECTING to REPLAY 00000100:00080000:0.0:1437707106.287836:0:5882:0:(import.c:1448:ptlrpc_import_recovery_state_machine()) replay requested by lustre-MDT0000_UUID 00000100:00100000:0.0:1437707106.287839:0:5882:0:(client.c:2591:ptlrpc_free_committed()) lustre-MDT0000-mdc-ffff88007c3bb800: committing for last_committed 163208757256 gen 3 00000100:00100000:0.0:1437707106.287843:0:5882:0:(client.c:2617:ptlrpc_free_committed()) @@@ stopping search req@ffff88007a0c4000 x1507534108223680/t163208757257(163208757257) o36->lustre-MDT0000-mdc-ffff88007c3bb800@10.1.5.29@tcp:12/10 lens 640/416 e 0 to 0 dl 1437707076 ref 1 fl Complete:R/4/0 rc 0/0 00000100:00080000:0.0:1437707106.287852:0:5882:0:(recover.c:93:ptlrpc_replay_next()) import ffff88007c3bf000 from lustre-MDT0000_UUID committed 163208757256 last 0 00000100:00080000:0.0:1437707106.287857:0:5882:0:(client.c:2928:ptlrpc_replay_req()) @@@ REPLAY req@ffff88007a0c5e00 x1507534108223592/t163208757254(163208757254) o101->lustre-MDT0000-mdc-ffff88007c3bb800@10.1.5.29@tcp:12/10 lens 816/600 e 1 to 1 dl 1437707102 ref 1 fl New:EXP/6/ffffffff rc 0/-1 Hi Yujian, |
| Comment by Jian Yu [ 31/Jul/15 ] |
|
Hi Hongchao, |
| Comment by Gerrit Updater [ 31/Jul/15 ] |
|
Hongchao Zhang (hongchao.zhang@intel.com) uploaded a new patch: http://review.whamcloud.com/15814 |
| Comment by Andreas Dilger [ 07/Aug/15 ] |
|
Hi Gregoire, have you seen any problems similar to this in your testing? |
| Comment by Gregoire Pichon [ 19/Aug/15 ] |
|
I don't remember having seen this test fail during my local testing. However, I have to admit that the kernel I use does not include the dev_read_only patch, which means the MDT device cannot be made readonly by replay_barrier. This means the test environment is not exactly the same. |
| Comment by Joseph Gmitter (Inactive) [ 25/Aug/15 ] |
|
This issue has not been seen on any branch since 7/24 and may have been resolved by related landings. We are going to reduce the severity and leave it open. |
| Comment by Peter Jones [ 28/Aug/15 ] |
|
Let's close the ticket as cannot report and reopen if we do see it again |
| Comment by Nathaniel Clark [ 30/Oct/15 ] |
|
Just encountered on master but test replay-single/test_31 |
| Comment by Hongchao Zhang [ 06/Nov/15 ] |
|
Status update: |
| Comment by Gerrit Updater [ 20/Nov/15 ] |
|
Hongchao Zhang (hongchao.zhang@intel.com) uploaded a new patch: http://review.whamcloud.com/17303 |
| Comment by Hongchao Zhang [ 10/Dec/15 ] |
|
This issue could be marked as a duplicate of |
| Comment by Hongchao Zhang [ 28/Jan/16 ] |
|
closed as duplicate of |