[LU-5312] sanity test_161a: cannot create regular file '...f.sanity.161a': Input/output error Created: 09/Jul/14  Updated: 08/Sep/15  Resolved: 01/Dec/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.6.0
Fix Version/s: Lustre 2.7.0

Type: Bug Priority: Major
Reporter: Maloo Assignee: Di Wang
Resolution: Fixed Votes: 0
Labels: dne

Issue Links:
Related
is related to LU-4420 sanity test_161a: cannot create regul... Resolved
is related to LU-7115 fld_client_rpc() may run into deadloop Resolved
Severity: 3
Rank (Obsolete): 14840

 Description   

This issue was created by maloo for John Hammond <john.hammond@intel.com>

This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/ba2bfa4a-6e59-11e3-bae0-52540035b04c.

The sub-test test_161a failed with the following error:

failed to hardlink many files

Info required for matching: sanity 161a



 Comments   
Comment by Andreas Dilger [ 09/Jul/14 ]

This appears to have the same symptoms as LU-4420, but we're moving it to a separate bug because LU-4420 had a fix landed for 2.6.0 and it isn't clear if this new failure is the same root cause or just the same failure symptoms. There haven't been any other failures of this test in the past 4 weeks, while LU-4420 failed regularly.

Comment by Di Wang [ 09/Jul/14 ]

http://review.whamcloud.com/11039

Comment by Oleg Drokin [ 13/Aug/14 ]

Apparently this is still being hit from time to time:
https://testing.hpdd.intel.com/test_sets/9c3f9cea-22d0-11e4-94dd-5254006e85c2

Comment by Di Wang [ 14/Aug/14 ]

This is actually different problem, it seems related with OSP, instead of LWP

00000100:00080000:1.0:1407905310.504717:0:3659:0:(import.c:1004:ptlrpc_connect_interpret()) @@@ lustre-MDT0000-osp-MDT0001: evicting (reconnect/recover flags not set: 4)  req@ffff88007c0c2800 x1476289415644816/t0(0) o38->lustre-MDT0000-osp-MDT0001@10.2.4.185@tcp:24/4 lens 400/264 e 0 to 0 dl 1407905346 ref 1 fl Interpret:RN/0/0 rc 0/0
00000100:00080000:1.0:1407905310.504720:0:3659:0:(import.c:1007:ptlrpc_connect_interpret()) ffff880079bdf800 lustre-MDT0000_UUID: changing import state from CONNECTING to EVICTED

So It seems MDT1 is being evicted by MDT0(being restarted during the test)

According to the debug log

1. MDT1 is in the final stage of recovery at 1407570696, so it sends the final PING to MDT0

00000100:00080000:1.0:1407570696.905010:0:3680:0:(import.c:1420:ptlrpc_import_recovery_state_machine()) ffff880067730800 lustre-MDT0000_UUID: changing import state from REPLAY to REPLAY_LOCKS
00010000:00010000:1.0:1407570696.905011:0:3680:0:(ldlm_request.c:2278:ldlm_cancel_unused_locks_for_replay()) Dropping as many unused locks as possible beforereplay for namespace lustre-MDT0000-osp-MDT0001 (0)
00010000:00010000:1.0:1407570696.905012:0:3680:0:(ldlm_request.c:2287:ldlm_cancel_unused_locks_for_replay()) Canceled 0 unused locks from namespace lustre-MDT0000-osp-MDT0001
00000100:00080000:1.0:1407570696.905287:0:3680:0:(import.c:1430:ptlrpc_import_recovery_state_machine()) ffff880067730800 lustre-MDT0000_UUID: changing import state from REPLAY_LOCKS to REPLAY_WAIT
00000100:00100000:1.0:1407570696.905292:0:3680:0:(client.c:1863:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ptlrpcd_rcv:lustre-MDT0001-mdtlov_UUID:3680:1475938571525140:10.2.4.194@tcp:38
00000100:00100000:1.0:1407570696.905297:0:3680:0:(client.c:1480:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ptlrpcd_rcv:lustre-MDT0002-mdtlov_UUID:3680:1475938571525152:10.2.4.194@tcp:400
00000100:00100000:1.0:1407570696.905318:0:3680:0:(client.c:1480:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ptlrpcd_rcv:lustre-MDT0001-mdtlov_UUID:3680:1475938571525156:10.2.4.194@tcp:400

2. MDT0 queue the ping at the same time 1407570696

00000100:00100000:0.0:1407570696.907586:0:710:0:(service.c:2092:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc mdt_out00_001:lustre-MDT0001-mdtlov_UUID+4:3680:x1475938571525156:12345-10.2.4.198@tcp:400
00010000:00080000:0.0:1407570696.907589:0:710:0:(ldlm_lib.c:2226:target_queue_recovery_request()) @@@ queue final req  req@ffff880066ddf000 x1475938571525156/t0(0) o400->lustre-MDT0001-mdtlov_UUID@10.2.4.198@tcp:0/0 lens 224/0 e 0 to 0 dl 1407570702 ref 2 fl Interpret:/c0/ffffffff rc 0/-1
00000100:00100000:0.0:1407570696.907593:0:710:0:(service.c:2142:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc mdt_out00_001:lustre-MDT0001-mdtlov_UUID+4:3680:x1475938571525156:12345-10.2.4.198@tcp:400 Request procesed in 7us (51us total) trans 0 rc 0/-999

3. MDT0 processing the final ping req 24 seconds later, and it should reply MDT1 with "RECOVERY complete" to tell the recovery is done.

00010000:00080000:1.0:1407570720.856384:0:747:0:(ldlm_lib.c:2061:target_recovery_thread()) @@@ processing final ping from 10.2.4.198@tcp:   req@ffff880066ddf000 x1475938571525156/t0(0) o400->lustre-MDT0001-mdtlov_UUID@10.2.4.198@tcp:0/0 lens 224/0 e 710876 to 0 dl 1407570712 ref 1 fl Complete:/c0/ffffffff rc 0/-1

4. For some unknown reasons, MDT1 get the "RECOVERY complete" reply from MDT0 after 20 seconds

00000100:00080000:1.0:1407570744.024908:0:3680:0:(import.c:1439:ptlrpc_import_recovery_state_machine()) ffff880067730800 lustre-MDT0000_UUID: changing import state from REPLAY_WAIT to RECOVER
00000100:00080000:1.0:1407570744.024910:0:3680:0:(import.c:1446:ptlrpc_import_recovery_state_machine()) reconnected to lustre-MDT0000_UUID@10.2.4.194@tcp
00000100:00080000:1.0:1407570744.024911:0:3680:0:(client.c:2578:ptlrpc_resend_req()) @@@ going to resend  req@ffff88005c8f3400 x1475938571525156/t0(0) o400->lustre-MDT0000-osp-MDT0001@10.2.4.194@tcp:24/4 lens 224/192 e 710876 to 0 dl 1407570754 ref 1 fl Interpret:R/c0/0 rc 0/0
00000100:00080000:1.0:1407570744.024914:0:3680:0:(client.c:2585:ptlrpc_resend_req()) @@@ it has reply, so skip it  req@ffff88005c8f3400 x1475938571525156/t0(0) o400->lustre-MDT0000-osp-MDT0001@10.2.4.194@tcp:24/4 lens 224/192 e 710876 to 0 dl 1407570754 ref 1 fl Interpret:R/c0/0 rc 0/0
00000100:00080000:1.0:1407570744.024916:0:3680:0:(import.c:1451:ptlrpc_import_recovery_state_machine()) ffff880067730800 lustre-MDT0000_UUID: changing import state from RECOVER to FULL
00000100:02000000:1.0:1407570744.024918:0:3680:0:(import.c:1459:ptlrpc_import_recovery_state_machine()) lustre-MDT0000-osp-MDT0001: Connection restored to lustre-MDT0000 (at 10.2.4.194@tcp)

5. In the mean time, the ping_evictor on MDT0 evict the export from MDT1, because MDT1 can not ping MDT0 during recovery stage, i.e. the import state is not FULL

00000100:00080000:0.0:1407570744.291927:0:700:0:(pinger.c:641:ping_evictor_main()) evicting all exports of obd lustre-MDT0000 older than 1407570714
00000100:02000400:0.0:1407570744.291936:0:700:0:(pinger.c:667:ping_evictor_main()) lustre-MDT0000: haven't heard from client lustre-MDT0001-mdtlov_UUID (at 10.2.4.198@tcp) in 49 seconds. I think it's dead, and I am evicting it. exp ffff88007a9cec00, cur 1407570744 expire 1407570714 last 1407570695

I am not sure why we see this now, probably because some recent changes, I did not dig yet. I think the way to fix this might be update exp_last_request_time in stage 3, because the client can not ping the server when it is waiting for the "final recovery" signal. So we should refresh the exp_last_request_time once server is ready to accept ping and other request.

Comment by Di Wang [ 14/Aug/14 ]

http://review.whamcloud.com/11443

Comment by Jodi Levi (Inactive) [ 01/Dec/14 ]

Patches landed to Master.

Generated at Sat Feb 10 01:50:23 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.