[LU-5312] sanity test_161a: cannot create regular file '...f.sanity.161a': Input/output error Created: 09/Jul/14 Updated: 08/Sep/15 Resolved: 01/Dec/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.6.0 |
| Fix Version/s: | Lustre 2.7.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Maloo | Assignee: | Di Wang |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | dne | ||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 14840 | ||||||||||||
| Description |
|
This issue was created by maloo for John Hammond <john.hammond@intel.com> This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/ba2bfa4a-6e59-11e3-bae0-52540035b04c. The sub-test test_161a failed with the following error:
Info required for matching: sanity 161a |
| Comments |
| Comment by Andreas Dilger [ 09/Jul/14 ] |
|
This appears to have the same symptoms as |
| Comment by Di Wang [ 09/Jul/14 ] |
| Comment by Oleg Drokin [ 13/Aug/14 ] |
|
Apparently this is still being hit from time to time: |
| Comment by Di Wang [ 14/Aug/14 ] |
|
This is actually different problem, it seems related with OSP, instead of LWP 00000100:00080000:1.0:1407905310.504717:0:3659:0:(import.c:1004:ptlrpc_connect_interpret()) @@@ lustre-MDT0000-osp-MDT0001: evicting (reconnect/recover flags not set: 4) req@ffff88007c0c2800 x1476289415644816/t0(0) o38->lustre-MDT0000-osp-MDT0001@10.2.4.185@tcp:24/4 lens 400/264 e 0 to 0 dl 1407905346 ref 1 fl Interpret:RN/0/0 rc 0/0 00000100:00080000:1.0:1407905310.504720:0:3659:0:(import.c:1007:ptlrpc_connect_interpret()) ffff880079bdf800 lustre-MDT0000_UUID: changing import state from CONNECTING to EVICTED So It seems MDT1 is being evicted by MDT0(being restarted during the test) According to the debug log 1. MDT1 is in the final stage of recovery at 1407570696, so it sends the final PING to MDT0 00000100:00080000:1.0:1407570696.905010:0:3680:0:(import.c:1420:ptlrpc_import_recovery_state_machine()) ffff880067730800 lustre-MDT0000_UUID: changing import state from REPLAY to REPLAY_LOCKS 00010000:00010000:1.0:1407570696.905011:0:3680:0:(ldlm_request.c:2278:ldlm_cancel_unused_locks_for_replay()) Dropping as many unused locks as possible beforereplay for namespace lustre-MDT0000-osp-MDT0001 (0) 00010000:00010000:1.0:1407570696.905012:0:3680:0:(ldlm_request.c:2287:ldlm_cancel_unused_locks_for_replay()) Canceled 0 unused locks from namespace lustre-MDT0000-osp-MDT0001 00000100:00080000:1.0:1407570696.905287:0:3680:0:(import.c:1430:ptlrpc_import_recovery_state_machine()) ffff880067730800 lustre-MDT0000_UUID: changing import state from REPLAY_LOCKS to REPLAY_WAIT 00000100:00100000:1.0:1407570696.905292:0:3680:0:(client.c:1863:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ptlrpcd_rcv:lustre-MDT0001-mdtlov_UUID:3680:1475938571525140:10.2.4.194@tcp:38 00000100:00100000:1.0:1407570696.905297:0:3680:0:(client.c:1480:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ptlrpcd_rcv:lustre-MDT0002-mdtlov_UUID:3680:1475938571525152:10.2.4.194@tcp:400 00000100:00100000:1.0:1407570696.905318:0:3680:0:(client.c:1480:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ptlrpcd_rcv:lustre-MDT0001-mdtlov_UUID:3680:1475938571525156:10.2.4.194@tcp:400 2. MDT0 queue the ping at the same time 1407570696 00000100:00100000:0.0:1407570696.907586:0:710:0:(service.c:2092:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc mdt_out00_001:lustre-MDT0001-mdtlov_UUID+4:3680:x1475938571525156:12345-10.2.4.198@tcp:400 00010000:00080000:0.0:1407570696.907589:0:710:0:(ldlm_lib.c:2226:target_queue_recovery_request()) @@@ queue final req req@ffff880066ddf000 x1475938571525156/t0(0) o400->lustre-MDT0001-mdtlov_UUID@10.2.4.198@tcp:0/0 lens 224/0 e 0 to 0 dl 1407570702 ref 2 fl Interpret:/c0/ffffffff rc 0/-1 00000100:00100000:0.0:1407570696.907593:0:710:0:(service.c:2142:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc mdt_out00_001:lustre-MDT0001-mdtlov_UUID+4:3680:x1475938571525156:12345-10.2.4.198@tcp:400 Request procesed in 7us (51us total) trans 0 rc 0/-999 3. MDT0 processing the final ping req 24 seconds later, and it should reply MDT1 with "RECOVERY complete" to tell the recovery is done. 00010000:00080000:1.0:1407570720.856384:0:747:0:(ldlm_lib.c:2061:target_recovery_thread()) @@@ processing final ping from 10.2.4.198@tcp: req@ffff880066ddf000 x1475938571525156/t0(0) o400->lustre-MDT0001-mdtlov_UUID@10.2.4.198@tcp:0/0 lens 224/0 e 710876 to 0 dl 1407570712 ref 1 fl Complete:/c0/ffffffff rc 0/-1 4. For some unknown reasons, MDT1 get the "RECOVERY complete" reply from MDT0 after 20 seconds 00000100:00080000:1.0:1407570744.024908:0:3680:0:(import.c:1439:ptlrpc_import_recovery_state_machine()) ffff880067730800 lustre-MDT0000_UUID: changing import state from REPLAY_WAIT to RECOVER 00000100:00080000:1.0:1407570744.024910:0:3680:0:(import.c:1446:ptlrpc_import_recovery_state_machine()) reconnected to lustre-MDT0000_UUID@10.2.4.194@tcp 00000100:00080000:1.0:1407570744.024911:0:3680:0:(client.c:2578:ptlrpc_resend_req()) @@@ going to resend req@ffff88005c8f3400 x1475938571525156/t0(0) o400->lustre-MDT0000-osp-MDT0001@10.2.4.194@tcp:24/4 lens 224/192 e 710876 to 0 dl 1407570754 ref 1 fl Interpret:R/c0/0 rc 0/0 00000100:00080000:1.0:1407570744.024914:0:3680:0:(client.c:2585:ptlrpc_resend_req()) @@@ it has reply, so skip it req@ffff88005c8f3400 x1475938571525156/t0(0) o400->lustre-MDT0000-osp-MDT0001@10.2.4.194@tcp:24/4 lens 224/192 e 710876 to 0 dl 1407570754 ref 1 fl Interpret:R/c0/0 rc 0/0 00000100:00080000:1.0:1407570744.024916:0:3680:0:(import.c:1451:ptlrpc_import_recovery_state_machine()) ffff880067730800 lustre-MDT0000_UUID: changing import state from RECOVER to FULL 00000100:02000000:1.0:1407570744.024918:0:3680:0:(import.c:1459:ptlrpc_import_recovery_state_machine()) lustre-MDT0000-osp-MDT0001: Connection restored to lustre-MDT0000 (at 10.2.4.194@tcp) 5. In the mean time, the ping_evictor on MDT0 evict the export from MDT1, because MDT1 can not ping MDT0 during recovery stage, i.e. the import state is not FULL 00000100:00080000:0.0:1407570744.291927:0:700:0:(pinger.c:641:ping_evictor_main()) evicting all exports of obd lustre-MDT0000 older than 1407570714 00000100:02000400:0.0:1407570744.291936:0:700:0:(pinger.c:667:ping_evictor_main()) lustre-MDT0000: haven't heard from client lustre-MDT0001-mdtlov_UUID (at 10.2.4.198@tcp) in 49 seconds. I think it's dead, and I am evicting it. exp ffff88007a9cec00, cur 1407570744 expire 1407570714 last 1407570695 I am not sure why we see this now, probably because some recent changes, I did not dig yet. I think the way to fix this might be update exp_last_request_time in stage 3, because the client can not ping the server when it is waiting for the "final recovery" signal. So we should refresh the exp_last_request_time once server is ready to accept ping and other request. |
| Comment by Di Wang [ 14/Aug/14 ] |
| Comment by Jodi Levi (Inactive) [ 01/Dec/14 ] |
|
Patches landed to Master. |