[LU-4584] Lock revocation process fails consistently Created: 05/Feb/14  Updated: 13/Oct/21  Resolved: 12/Dec/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Christopher Morrone Assignee: Bruno Faccini (Inactive)
Resolution: Cannot Reproduce Votes: 1
Labels: llnl

Attachments: File 172.16.66.4@tcp.log.bz2     File 172.16.66.5@tcp.log.bz2     File 172.20.20.201@o2ib500.log.bz2     Text File client_log_20140206.txt     File inflames.log    
Issue Links:
Duplicate
duplicates LU-4963 client eviction during IOR test - loc... Closed
Related
is related to LU-2827 mdt_intent_fixup_resent() cannot find... Resolved
is related to LU-5525 ASSERTION( new_lock->l_readers + new_... Resolved
is related to LU-5632 ldlm_lock_addref()) ASSERTION( lock !... Resolved
is related to LU-5686 (mdt_handler.c:3203:mdt_intent_lock_r... Resolved
Severity: 3
Rank (Obsolete): 12530

 Description   

Some users have reported to us that the "rm" command is taking a long time. Some investigation revealed that at least the first "rm" in a directory takes just over 100 seconds, which of course sounds like OBD_TIMEOUT_DEFAULT.

This isn't necessarily the simplest reproducer, but the following reproducer is completely consistent:

  1. set directory striping default count to 48
  2. touch a file on client A
  3. rm file on client B

The clients are running 2.4.0-19chaos, servers are at 2.4.0-21chaos. The servers are using zfs as the backend.

I have some lustre logs that I will share and talk about in additional posts to this ticket. But essentially it looks like the server always times out on a AST to client A (explaining the 100 second delay). It is not really clear yet to me why that happens, because client A appears to be completely responsive. My current suspicion is the the MDT is to blame.



 Comments   
Comment by Christopher Morrone [ 05/Feb/14 ]

Actually, it looks like the MDT evicts both the client on which the file was created and the client that is running robinhood (changelog client).

Here is an example. The players are:

  • 172.16.66.4@tcp - Lustre client on which I ran "touch foo"
  • 172.16.66.5@tcp - Lustre client on which I ran "rm foo"
  • 172.16.66.30@tcp - Lustre client which is consuming changelogs (it is running robinhood)
  • 172.20.20.201@o2ib500 - Lustre MDS

So I start by doing "rm foo" on 172.16.66.5@tcp:

00000080:00200000:10.0:1391563489.083622:0:212662:0:(file.c:3001:ll_inode_permission()) VFS Op:inode=144115923974954140/33554603(ffff8817ad125238), inode mode 41c0 mask 1
00000080:00200000:10.0:1391563489.083623:0:212662:0:(dcache.c:401:ll_revalidate_it()) VFS Op:name=foo,intent=getattr
00000080:00200000:10.0:1391563489.083625:0:212662:0:(file.c:3001:ll_inode_permission()) VFS Op:inode=144115923974954140/33554603(ffff8817ad125238), inode mode 41c0 mask 3
00000080:00200000:10.0:1391563489.083627:0:212662:0:(namei.c:1263:ll_unlink_generic()) VFS Op:name=foo,dir=144115923974954140/33554603(ffff8817ad125238)
00000100:00100000:10.0:1391563489.083656:0:212662:0:(client.c:1441:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc rm:a3cebad5-b812-989a-0fc6-cab928270ea2:212662:1457223513206688:172.20.20.201@o2ib500:36
00000100:00100000:10.0:1391563489.083668:0:212662:0:(client.c:2132:ptlrpc_set_wait()) set ffff8810df789240 going to sleep for 61 seconds

Lets follow that RPC over to the MDS on 172.20.20.201@o2ib500:

00000100:00100000:8.0:1391563489.083302:0:18680:0:(service.c:1867:ptlrpc_server_handle_req_in()) got req x1457223513206688
00000100:00100000:8.0:1391563489.083331:0:18680:0:(nrs_fifo.c:182:nrs_fifo_req_get()) NRS start fifo request from 12345-172.16.66.5@tcp, seq: 322652232
00000100:00100000:8.0:1391563489.083335:0:18680:0:(service.c:2011:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc mdt02_078:a3cebad5-b812-989a-0fc6-cab928270ea2+86:212662:x1457223513206688:12345-172.16.66.5@tcp:36
00000004:00000002:8.0:1391563489.083339:0:18680:0:(mdt_handler.c:1806:mdt_reint_opcode()) @@@ reint opt = 4  req@ffff88082d057800 x1457223513206688/t0(0) o36->a3cebad5-b812-989a-0fc6-cab928270ea2@172.16.66.5@tcp:0/0 lens 488/0 e 0 to 0 dl 1391563550 ref 1 fl Interpret:/0/ffffffff rc 0/-1
00000004:00000002:8.0:1391563489.083347:0:18680:0:(mdt_reint.c:686:mdt_reint_unlink()) @@@ unlink [0x20000ab57:0x11c9c:0x0]/foo  req@ffff88082d057800 x1457223513206688/t0(0) o36->a3cebad5-b812-989a-0fc6-cab928270ea2@172.16.66.5@tcp:0/0 lens 488/696 e 0 to 0 dl 1391563550 ref 1 fl Interpret:/0/0 rc 0/0
00000004:00000002:8.0:1391563489.083361:0:18680:0:(mdt_reint.c:100:mdt_obj_version_get()) FID [0x20000ab57:0x11c9c:0x0] version is 0x14183989a3
00000100:00100000:8.0:1391563489.083383:0:18680:0:(client.c:1441:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc mdt02_078:fsv-MDT0000_UUID:18680:1458673558372844:172.16.66.5@tcp:104
00000100:00100000:8.0:1391563489.083395:0:18680:0:(client.c:1441:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc mdt02_078:fsv-MDT0000_UUID:18680:1458673558372848:172.16.66.30@tcp:104
00000100:00100000:8.0:1391563489.083404:0:18680:0:(client.c:1441:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc mdt02_078:fsv-MDT0000_UUID:18680:1458673558372852:172.16.66.30@tcp:104
00000100:00100000:8.0:1391563489.083411:0:18680:0:(client.c:1441:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc mdt02_078:fsv-MDT0000_UUID:18680:1458673558372856:172.16.66.4@tcp:104
00000100:00100000:8.0:1391563489.083419:0:18680:0:(client.c:1441:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc mdt02_078:fsv-MDT0000_UUID:18680:1458673558372860:172.16.66.4@tcp:104
00000100:00100000:8.0:1391563489.083425:0:18680:0:(client.c:1441:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc mdt02_078:fsv-MDT0000_UUID:18680:1458673558372864:172.16.66.4@tcp:104
00000100:00100000:8.0:1391563489.083429:0:18680:0:(client.c:2132:ptlrpc_set_wait()) set ffff880e25a0c900 going to sleep for 61 seconds
00000100:00100000:8.0:1391563489.084163:0:18680:0:(client.c:1809:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc mdt02_078:fsv-MDT0000_UUID:18680:1458673558372852:172.16.66.30@tcp:104
00000100:00100000:8.0:1391563489.084185:0:18680:0:(client.c:1809:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc mdt02_078:fsv-MDT0000_UUID:18680:1458673558372856:172.16.66.4@tcp:104
00000100:00100000:8.0:1391563489.084191:0:18680:0:(client.c:1809:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc mdt02_078:fsv-MDT0000_UUID:18680:1458673558372844:172.16.66.5@tcp:104
00000100:00100000:8.0:1391563489.084194:0:18680:0:(client.c:1809:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc mdt02_078:fsv-MDT0000_UUID:18680:1458673558372860:172.16.66.4@tcp:104
00000100:00100000:8.0:1391563489.084234:0:18680:0:(client.c:1809:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc mdt02_078:fsv-MDT0000_UUID:18680:1458673558372848:172.16.66.30@tcp:104
00000100:00100000:8.0:1391563489.084240:0:18680:0:(client.c:1809:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc mdt02_078:fsv-MDT0000_UUID:18680:1458673558372864:172.16.66.4@tcp:104
00000100:00100000:6.0:1391563489.084258:0:14480:0:(events.c:352:request_in_callback()) peer: 12345-172.16.66.30@tcp
00000100:00100000:13.0:1391563489.084300:0:32258:0:(service.c:1867:ptlrpc_server_handle_req_in()) got req x1455014558386756
00000100:00100000:13.0:1391563489.084310:0:32258:0:(nrs_fifo.c:182:nrs_fifo_req_get()) NRS start fifo request from 12345-172.16.66.30@tcp, seq: 24327657
00000100:00100000:13.0:1391563489.084314:0:32258:0:(service.c:2011:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc ldlm_cn03_006:2120d720-a7c5-a236-f306-8bdb94a8b962+9371:6961:x1455014558386756:12345-172.16.66.30@tcp:103
00010000:00000002:13.0:1391563489.084317:0:32258:0:(ldlm_lockd.c:2368:ldlm_cancel_handler()) cancel
00000100:00100000:13.0:1391563489.084332:0:32258:0:(service.c:2055:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc ldlm_cn03_006:2120d720-a7c5-a236-f306-8bdb94a8b962+9370:6961:x1455014558386756:12345-172.16.66.30@tcp:103 Request procesed in 21us (75us total) trans 0 rc 0/0
00000100:00100000:13.0:1391563489.084334:0:32258:0:(nrs_fifo.c:244:nrs_fifo_req_stop()) NRS stop fifo request from 12345-172.16.66.30@tcp, seq: 24327657
00000100:00100000:14.0:1391563489.084423:0:14484:0:(events.c:352:request_in_callback()) peer: 12345-172.16.66.4@tcp
00000100:00100000:13.0:1391563489.084426:0:32258:0:(service.c:1867:ptlrpc_server_handle_req_in()) got req x1450886429351844
00000100:00100000:13.0:1391563489.084432:0:32258:0:(nrs_fifo.c:182:nrs_fifo_req_get()) NRS start fifo request from 12345-172.16.66.4@tcp, seq: 24327658
00000100:00100000:13.0:1391563489.084434:0:32258:0:(service.c:2011:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc ldlm_cn03_006:1444d025-e75e-a124-88f8-c2f78d40926b+2471:10919:x1450886429351844:12345-172.16.66.4@tcp:103
00010000:00000002:13.0:1391563489.084436:0:32258:0:(ldlm_lockd.c:2368:ldlm_cancel_handler()) cancel
00000100:00100000:13.0:1391563489.084444:0:32258:0:(service.c:2055:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc ldlm_cn03_006:1444d025-e75e-a124-88f8-c2f78d40926b+2470:10919:x1450886429351844:12345-172.16.66.4@tcp:103 Request procesed in 11us (23us total) trans 0 rc 0/0
00000100:00100000:13.0:1391563489.084447:0:32258:0:(nrs_fifo.c:244:nrs_fifo_req_stop()) NRS stop fifo request from 12345-172.16.66.4@tcp, seq: 24327658

So the MDS node immediately turns around and creates 6 RPCs, 3 for 172.16.66.4@tcp, 1 for 172.16.66.5@tcp, and 2 for 172.16.66.30@tcp. Those RPCs all appear to complete in very little time.

Shortly after that, the server receives LDLM_CANCEL rpcs from both 172.16.66.4@tcp (the client where the file was originally created) and 172.16.66.30@tcp (the robinhood client node).

Here is the 172.16.66.4@tcp client handling the three LDLM_BL_CALLBACK callbacks. For whatever reason, it only seems to result in a single LDLM_CANCEL back to the server. Is that a problem?

00000100:00100000:14.0F:1391563489.084416:0:9947:0:(events.c:352:request_in_callback()) peer: 12345-172.20.20.201@o2ib500
00000100:00100000:13.0F:1391563489.084429:0:9949:0:(events.c:352:request_in_callback()) peer: 12345-172.20.20.201@o2ib500
00000100:00100000:8.0F:1391563489.084461:0:51449:0:(service.c:1867:ptlrpc_server_handle_req_in()) got req x1458673558372856
00000100:00100000:9.0F:1391563489.084465:0:11003:0:(service.c:1867:ptlrpc_server_handle_req_in()) got req x1458673558372860
00000100:00100000:4.0F:1391563489.084487:0:9943:0:(events.c:352:request_in_callback()) peer: 12345-172.20.20.201@o2ib500
00000100:00100000:9.0:1391563489.084497:0:11003:0:(service.c:1867:ptlrpc_server_handle_req_in()) got req x1458673558372864
00000100:00100000:8.0:1391563489.084503:0:51449:0:(nrs_fifo.c:182:nrs_fifo_req_get()) NRS start fifo request from 12345-172.20.20.201@o2ib500, seq: 1412060
00000100:00100000:8.0:1391563489.084510:0:51449:0:(service.c:2011:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc ldlm_cb02_005:LMV_MDC_UUID+8:18680:x1458673558372860:12345-172.20.20.201@o2ib500:104
00000100:00100000:9.0:1391563489.084515:0:11003:0:(nrs_fifo.c:182:nrs_fifo_req_get()) NRS start fifo request from 12345-172.20.20.201@o2ib500, seq: 1412061
00000100:00100000:9.0:1391563489.084522:0:11003:0:(service.c:2011:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc ldlm_cb02_000:LMV_MDC_UUID+8:18680:x1458673558372856:12345-172.20.20.201@o2ib500:104
00000100:00100000:8.0:1391563489.084567:0:51449:0:(service.c:2055:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc ldlm_cb02_005:LMV_MDC_UUID+8:18680:x1458673558372860:12345-172.20.20.201@o2ib500:104 Request procesed in 61us (140us total) trans 0 rc 0/0
00000100:00100000:10.0F:1391563489.084571:0:48512:0:(nrs_fifo.c:182:nrs_fifo_req_get()) NRS start fifo request from 12345-172.20.20.201@o2ib500, seq: 1412062
00000100:00100000:9.0:1391563489.084571:0:11003:0:(service.c:2055:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc ldlm_cb02_000:LMV_MDC_UUID+8:18680:x1458673558372856:12345-172.20.20.201@o2ib500:104 Request procesed in 53us (158us total) trans 0 rc 0/0
00000100:00100000:8.0:1391563489.084575:0:51449:0:(nrs_fifo.c:244:nrs_fifo_req_stop()) NRS stop fifo request from 12345-172.20.20.201@o2ib500, seq: 1412060
00000100:00100000:9.0:1391563489.084577:0:11003:0:(nrs_fifo.c:244:nrs_fifo_req_stop()) NRS stop fifo request from 12345-172.20.20.201@o2ib500, seq: 1412061
00000100:00100000:10.0:1391563489.084581:0:48512:0:(service.c:2011:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc ldlm_cb02_003:LMV_MDC_UUID+6:18680:x1458673558372864:12345-172.20.20.201@o2ib500:104
00000100:00100000:10.0:1391563489.084601:0:48512:0:(service.c:2055:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc ldlm_cb02_003:LMV_MDC_UUID+6:18680:x1458673558372864:12345-172.20.20.201@o2ib500:104 Request procesed in 26us (124us total) trans 0 rc -22/-22
00000100:00100000:10.0:1391563489.084604:0:48512:0:(nrs_fifo.c:244:nrs_fifo_req_stop()) NRS stop fifo request from 12345-172.20.20.201@o2ib500, seq: 1412062
00000100:00100000:5.0F:1391563489.084744:0:10919:0:(ptlrpcd.c:387:ptlrpcd_check()) transfer 1 async RPCs [2->3]
00000100:00100000:5.0:1391563489.084750:0:10919:0:(client.c:1441:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ptlrpcd_3:1444d025-e75e-a124-88f8-c2f78d40926b:10919:1450886429351844:172.20.20.201@o2ib500:103
00000100:00100000:5.0:1391563489.085483:0:10919:0:(client.c:1809:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ptlrpcd_3:1444d025-e75e-a124-88f8-c2f78d40926b:10919:1450886429351844:172.20.20.201@o2ib500:103

But the client seems to think its work is done, because the only rpcs I see it sending to the server over the next 100 seconds are the normal OBD_PING aliveness checks.

The server is clearly not satisfied, because later after the 100 second timeout it evicts both 172.16.66.4@tcp and 172.16.66.30@tcp. He is the server doing that:

00010000:00020000:8.1F:1391563589.243260:0:0:0:(ldlm_lockd.c:391:waiting_locks_callback()) ### lock callback timer expired after 287s: evicting client at 172.16.66.30@tcp  ns: mdt-fsv-MDT0000_UUID lock: ffff880f091cea00/0xf70254b1ce3273a8 lrc: 3/0,0 mode: PR/PR res: 8589983111/32 bits 0x1b rrc: 3 type: IBT flags: 0x200000000020 nid: 172.16.66.30@tcp remote: 0x60ad8f9843a9644 expref: 9434 pid: 16453 timeout: 4759057840 lvb_type: 0
...
00010000:00020000:8.1:1391563589.283179:0:0:0:(ldlm_lockd.c:391:waiting_locks_callback()) ### lock callback timer expired after 302s: evicting client at 172.16.66.4@tcp  ns: mdt-fsv-MDT0000_UUID lock: ffff880f17f60200/0xf70254b1ce2bd28a lrc: 3/0,0 mode: CR/CR res: 8589983111/32 bits 0x8 rrc: 3 type: IBT flags: 0x20 nid: 172.16.66.4@tcp remote: 0x1dfa39e9c64af6c1 expref: 2469 pid: 17054 timeout: 4759057840 lvb_type: 3

Note that the "timer expired after" numbers are bogus (which we recognized in LU4579). If you check the timestamps you'll see that it really is 100 seconds after the lock cancel was issued.

After the evictions the server can finally reply to the initial unlink rpc:

00000004:00000002:9.0:1391563589.289089:0:18680:0:(mdt_reint.c:100:mdt_obj_version_get()) FID [0x20000bd87:0x20:0x0] version is 0x141a11cf09
00000040:00100000:9.0:1391563589.293701:0:18680:0:(llog_osd.c:499:llog_osd_write_rec()) added record 0x7f3d:1: idx: 34051, 64
00000040:00100000:9.0:1391563589.293729:0:18680:0:(llog_osd.c:499:llog_osd_write_rec()) added record 0x7ede:1: idx: 34754, 64
00000040:00100000:9.0:1391563589.293743:0:18680:0:(llog_osd.c:499:llog_osd_write_rec()) added record 0x7edf:1: idx: 34957, 64
00000040:00100000:9.0:1391563589.293757:0:18680:0:(llog_osd.c:499:llog_osd_write_rec()) added record 0x7ee0:1: idx: 34238, 64
00000040:00100000:9.0:1391563589.293772:0:18680:0:(llog_osd.c:499:llog_osd_write_rec()) added record 0x7ee1:1: idx: 34149, 64
00000040:00100000:9.0:1391563589.293787:0:18680:0:(llog_osd.c:499:llog_osd_write_rec()) added record 0x7ee2:1: idx: 34303, 64
00000040:00100000:9.0:1391563589.293803:0:18680:0:(llog_osd.c:499:llog_osd_write_rec()) added record 0x7ee3:1: idx: 35304, 64
00000040:00100000:9.0:1391563589.293827:0:18680:0:(llog_osd.c:499:llog_osd_write_rec()) added record 0x7ee4:1: idx: 34051, 64
00000040:00100000:9.0:1391563589.293842:0:18680:0:(llog_osd.c:499:llog_osd_write_rec()) added record 0x7ee5:1: idx: 36690, 64
00000040:00100000:9.0:1391563589.293857:0:18680:0:(llog_osd.c:499:llog_osd_write_rec()) added record 0x7ee6:1: idx: 35442, 64
00000040:00100000:9.0:1391563589.293883:0:18680:0:(llog_osd.c:499:llog_osd_write_rec()) added record 0x7ee7:1: idx: 35208, 64
00000040:00100000:9.0:1391563589.293898:0:18680:0:(llog_osd.c:499:llog_osd_write_rec()) added record 0x7ee8:1: idx: 35410, 64
00000040:00100000:9.0:1391563589.293914:0:18680:0:(llog_osd.c:499:llog_osd_write_rec()) added record 0x7ee9:1: idx: 34747, 64
00000040:00100000:9.0:1391563589.293940:0:18680:0:(llog_osd.c:499:llog_osd_write_rec()) added record 0x7eea:1: idx: 33825, 64
00000040:00100000:9.0:1391563589.293955:0:18680:0:(llog_osd.c:499:llog_osd_write_rec()) added record 0x7eeb:1: idx: 33073, 64
00000040:00100000:9.0:1391563589.293969:0:18680:0:(llog_osd.c:499:llog_osd_write_rec()) added record 0x7eec:1: idx: 35776, 64
00000040:00100000:9.0:1391563589.293984:0:18680:0:(llog_osd.c:499:llog_osd_write_rec()) added record 0x7eed:1: idx: 34923, 64
00000040:00100000:9.0:1391563589.293999:0:18680:0:(llog_osd.c:499:llog_osd_write_rec()) added record 0x7eee:1: idx: 34477, 64
00000040:00100000:9.0:1391563589.294014:0:18680:0:(llog_osd.c:499:llog_osd_write_rec()) added record 0x7eef:1: idx: 34434, 64
00000040:00100000:9.0:1391563589.294029:0:18680:0:(llog_osd.c:499:llog_osd_write_rec()) added record 0x7ef0:1: idx: 34534, 64
00000040:00100000:9.0:1391563589.294043:0:18680:0:(llog_osd.c:499:llog_osd_write_rec()) added record 0x7ef1:1: idx: 34941, 64
00000040:00100000:9.0:1391563589.294068:0:18680:0:(llog_osd.c:499:llog_osd_write_rec()) added record 0x7ef2:1: idx: 35734, 64
00000040:00100000:9.0:1391563589.294083:0:18680:0:(llog_osd.c:499:llog_osd_write_rec()) added record 0x7ef3:1: idx: 35572, 64
00000040:00100000:9.0:1391563589.294097:0:18680:0:(llog_osd.c:499:llog_osd_write_rec()) added record 0x7ef4:1: idx: 35245, 64
00000040:00100000:9.0:1391563589.294122:0:18680:0:(llog_osd.c:499:llog_osd_write_rec()) added record 0x7ef5:1: idx: 35781, 64
00000040:00100000:9.0:1391563589.294136:0:18680:0:(llog_osd.c:499:llog_osd_write_rec()) added record 0x7ef6:1: idx: 35071, 64
00000040:00100000:9.0:1391563589.294151:0:18680:0:(llog_osd.c:499:llog_osd_write_rec()) added record 0x7ef7:1: idx: 35647, 64
00000040:00100000:9.0:1391563589.294166:0:18680:0:(llog_osd.c:499:llog_osd_write_rec()) added record 0x7ef8:1: idx: 35247, 64
00000040:00100000:9.0:1391563589.294181:0:18680:0:(llog_osd.c:499:llog_osd_write_rec()) added record 0x7ef9:1: idx: 34787, 64
00000040:00100000:9.0:1391563589.294205:0:18680:0:(llog_osd.c:499:llog_osd_write_rec()) added record 0x7efa:1: idx: 34774, 64
00000040:00100000:9.0:1391563589.294221:0:18680:0:(llog_osd.c:499:llog_osd_write_rec()) added record 0x7efb:1: idx: 35288, 64
00000040:00100000:9.0:1391563589.294236:0:18680:0:(llog_osd.c:499:llog_osd_write_rec()) added record 0x7efc:1: idx: 35788, 64
00000040:00100000:9.0:1391563589.294253:0:18680:0:(llog_osd.c:499:llog_osd_write_rec()) added record 0x7efd:1: idx: 34607, 64
00000040:00100000:9.0:1391563589.294284:0:18680:0:(llog_osd.c:499:llog_osd_write_rec()) added record 0x7efe:1: idx: 35667, 64
00000040:00100000:9.0:1391563589.294300:0:18680:0:(llog_osd.c:499:llog_osd_write_rec()) added record 0x7eff:1: idx: 35319, 64
00000040:00100000:9.0:1391563589.294317:0:18680:0:(llog_osd.c:499:llog_osd_write_rec()) added record 0x7f00:1: idx: 35735, 64
00000040:00100000:9.0:1391563589.294343:0:18680:0:(llog_osd.c:499:llog_osd_write_rec()) added record 0x7f01:1: idx: 34925, 64
00000040:00100000:9.0:1391563589.294359:0:18680:0:(llog_osd.c:499:llog_osd_write_rec()) added record 0x7f02:1: idx: 35698, 64
00000040:00100000:9.0:1391563589.294374:0:18680:0:(llog_osd.c:499:llog_osd_write_rec()) added record 0x7f03:1: idx: 34234, 64
00000040:00100000:9.0:1391563589.294390:0:18680:0:(llog_osd.c:499:llog_osd_write_rec()) added record 0x7f04:1: idx: 36572, 64
00000040:00100000:9.0:1391563589.294405:0:18680:0:(llog_osd.c:499:llog_osd_write_rec()) added record 0x7f05:1: idx: 35444, 64
00000040:00100000:9.0:1391563589.294421:0:18680:0:(llog_osd.c:499:llog_osd_write_rec()) added record 0x7f06:1: idx: 35313, 64
00000040:00100000:9.0:1391563589.294436:0:18680:0:(llog_osd.c:499:llog_osd_write_rec()) added record 0x7f07:1: idx: 34590, 64
00000040:00100000:9.0:1391563589.294451:0:18680:0:(llog_osd.c:499:llog_osd_write_rec()) added record 0x7f08:1: idx: 35230, 64
00000040:00100000:9.0:1391563589.294467:0:18680:0:(llog_osd.c:499:llog_osd_write_rec()) added record 0x7f09:1: idx: 35106, 64
00000040:00100000:9.0:1391563589.294482:0:18680:0:(llog_osd.c:499:llog_osd_write_rec()) added record 0x7f0a:1: idx: 35179, 64
00000040:00100000:9.0:1391563589.294498:0:18680:0:(llog_osd.c:499:llog_osd_write_rec()) added record 0x7f0b:1: idx: 34959, 64
00000040:00100000:9.0:1391563589.294512:0:18680:0:(llog_osd.c:499:llog_osd_write_rec()) added record 0x7f0c:1: idx: 36182, 64
00000040:00100000:9.0:1391563589.294532:0:18680:0:(llog_osd.c:499:llog_osd_write_rec()) added record 0x829e:1: idx: 43128, 96
00000004:00000002:9.0:1391563589.294535:0:18680:0:(mdt_recovery.c:562:mdt_txn_stop_cb()) transno = 86336738210, last_committed = 86336737853
00000004:00000002:9.0:1391563589.294876:0:18680:0:(mdt_handler.c:404:mdt_pack_attr2body()) [0x20000bd87:0x20:0x0]: nlink=0, mode=100600, size=0
00000100:00100000:9.0:1391563589.294944:0:18680:0:(service.c:2055:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc mdt02_078:a3cebad5-b812-989a-0fc6-cab928270ea2+85:212662:x1457223513206688:12345-172.16.66.5@tcp:36 Request procesed in 100211610us (100211668us total) trans 86336738210 rc 0/0
00000100:00100000:9.0:1391563589.294947:0:18680:0:(nrs_fifo.c:244:nrs_fifo_req_stop()) NRS stop fifo request from 12345-172.16.66.5@tcp, seq: 322652232

This info was gathered on production systems, so I've selected key info that I thought was relevant. I will also attach the full logs, in case anyone wants to sort through and look for something else.

Comment by Christopher Morrone [ 05/Feb/14 ]

This won't be terribly surprising, but here is the back trace from the mdt thread that handled the initial unlink-related RPC. This backtrace was taken during the middle of the 100s waiting period:

 #0 [ffff88100d3af8d0] schedule at ffffffff81510bd2
 #1 [ffff88100d3af998] schedule_timeout+0x192 at ffffffff81511a32
 #2 [ffff88100d3afa48] cfs_waitq_timedwait+0x11 at ffffffffa05bc751 [libcfs]
 #3 [ffff88100d3afa58] ldlm_completion_ast+0x4ed at ffffffffa089143d [ptlrpc]
 #4 [ffff88100d3afb08] ldlm_cli_enqueue_local+0x1f8 at ffffffffa0890b78 [ptlrpc]
 (inlined mdt_fid_lock)
 #5 [ffff88100d3afb98] mdt_object_lock0+0x33b at ffffffffa0e3ac7b [mdt]
 #6 [ffff88100d3afc48] mdt_object_lock+0x14 at ffffffffa0e3b4f4 [mdt]
 #7 [ffff88100d3afc58] mdt_reint_unlink+0x66e at ffffffffa0e57f2e [mdt]
 #8 [ffff88100d3afcd8] mdt_reint_rec+0x41 at ffffffffa0e54a31 [mdt]
 #9 [ffff88100d3afcf8] mdt_reint_internal+0x4c3 at ffffffffa0e39ae3 [mdt]
#10 [ffff88100d3afd38] mdt_reint+0x44 at ffffffffa0e39de4 [mdt]
#11 [ffff88100d3afd58] mdt_handle_common+0x648 at ffffffffa0e3f758 [mdt]
#12 [ffff88100d3afda8] mds_regular_handle+0x15 at ffffffffa0e78405 [mdt]
#13 [ffff88100d3afdb8] ptlrpc_server_handle_request+0x398 at ffffffffa08caf88 [ptlrpc]
#14 [ffff88100d3afeb8] ptlrpc_main+0xace at ffffffffa08cc31e [ptlrpc]
#15 [ffff88100d3aff48] child_rip+0xa at ffffffff8100c10a

But this also happens to be a common stack trace from hundreds of threads on our MDS nodes when they go unresponsive for hours at a time. So I am keen to get to understand this problem soon in case it turns out to be related to our wider production Lustre problems.

Comment by Bruno Faccini (Inactive) [ 05/Feb/14 ]

Hello Chris,
It is unclear for me if RobinHood's presence is required to reproduce? I presume RobinHood and ChangeLog usage are enabled by default on your production systems but it would help if we can determine if RobinHood's has any impact for this issue.

Comment by Bruno Faccini (Inactive) [ 05/Feb/14 ]

I am also trying to find/set-up a similar platform to run your reproducer there.

Comment by Christopher Morrone [ 06/Feb/14 ]

I don't know yet if a changelog consumer is a requirement either. I haven't been able to reproduce this problem yet in our testbed, which is both smaller and doesn't have a changelog user.

But I was hoping that someone could explain to me how lustre's protocol is expected to work. Since I have a great reproducer, I'm happy to investigate any point where the protocol is not behaving as expected.

For instance, could someone explain what else the server might be needed from the clients? The clients have each sent a LDLM_CANCEL...is that not sufficient? Are the clients failing to do something, or did the server just fail to correctly associate those cancels with the outstanding rpc? Are the number of RPCs generated earlier in the sequence as expected? Does three LDLM_BL_CALLBACKs for the node where the file was created and two more for the node running the changelog client make sense? What are their purpose?

Comment by Bruno Faccini (Inactive) [ 06/Feb/14 ]

I think it is likely that the multiple LDLM_CANCEL answers from the Clients have been packed in the same/single RPC.
And I think multiple LDLM_BL_CALLBACKs can be sent by MDS for different Inode-bits beeing locked.

But to definitely validate these assumptions I need more/full traces.

So, could it be possible that you enable both rpctrace and dlmtrace on Clients and Server/MDS sides, also set /proc/sys/lustre/dump_on_eviction (works only on Clients I think, so you will need to "lctl dk" on MDS) and also may be grow the debug-log buffer size, in order to get all traces after running your reproducer again? Or may be you already have this in the full logs you promised to attach ?

BTW, the "rc -22/-22" during one of the LDLM_BL_CALLBACK you posted looks also suspicious.

Comment by Christopher Morrone [ 06/Feb/14 ]

Doh! I definitely uploaded the logs to some ticket, but I must have accidentally attached them to the wrong one. Here the are in the right place.

Comment by Christopher Morrone [ 06/Feb/14 ]

I think it is likely that the multiple LDLM_CANCEL answers from the Clients have been packed in the same/single RPC.

Ah, is that perhaps what this shows?

00000100:00100000:5.0F:1391563489.084744:0:10919:0:(ptlrpcd.c:387:ptlrpcd_check()) transfer 1 async RPCs [2->3]
Comment by Christopher Morrone [ 07/Feb/14 ]

See the attached file "client_log_20140206.txt".

I can't get more detailed logs from the MDS right now because it is too heavily loaded to make higher debugging level practical.

But the new client log is from the client where I created a file named "foo". I did the "rm foo" on another client. This time I added "rpctrace dlmtrace vfstrace" to the log level on the client. dlmtrace is new in this log. With that, you can see better where the -22 comes from on the client.

Comment by Bruno Faccini (Inactive) [ 07/Feb/14 ]

No, this trace means that 1 RPC has been handled by ptlrpcd #3 and stolen from partner #2, the packed requests will appear in the "CDEBUG(D_DLMTRACE, "%d locks packed\n", packed);" msg/trace from ldlm_cancel_pack().

Comment by Bruno Faccini (Inactive) [ 07/Feb/14 ]

BTW, in the last Client (the node doing the "touch") log you attached, there is only one ldlm_cancel_pack() trace reporting a single LDLM_CANCEL RPC to be sent back to MDS. And it concerns only 2 of the LDLM_BL_CALLBACKs sent by the Server for the same lock/handle ffff8801f8c5e200/0x6aff086a1c937ce8, as identified by the Client.

And the -22/-EINVAL is for the 3rd LDLM_BL_CALLBACK and it concerns an other handle (0x6aff086a1c937ce1) that the Client can't find the corresponding lock :

00000100:00100000:3.0F:1391738160.718872:0:4105:0:(events.c:352:request_in_callback()) peer: 12345-172.20.20.201@o2ib500
00000100:00100000:5.0F:1391738160.718889:0:4108:0:(events.c:352:request_in_callback()) peer: 12345-172.20.20.201@o2ib500
00000100:00100000:0.0F:1391738160.718900:0:4106:0:(events.c:352:request_in_callback()) peer: 12345-172.20.20.201@o2ib500
00000100:00100000:8.0F:1391738160.718910:0:228232:0:(service.c:1867:ptlrpc_server_handle_req_in()) got req x1458673631329952
00000100:00100000:8.0:1391738160.718931:0:228232:0:(service.c:1867:ptlrpc_server_handle_req_in()) got req x1458673631329956
00000100:00100000:8.0:1391738160.718935:0:228232:0:(service.c:1867:ptlrpc_server_handle_req_in()) got req x1458673631329960
00000100:00100000:10.0F:1391738160.718940:0:177839:0:(nrs_fifo.c:182:nrs_fifo_req_get()) NRS start fifo request from 12345-172.20.20.201@o2ib500, seq: 734293
00000100:00100000:8.0:1391738160.718954:0:228232:0:(nrs_fifo.c:182:nrs_fifo_req_get()) NRS start fifo request from 12345-172.20.20.201@o2ib500, seq: 734294
00000100:00100000:10.0:1391738160.718961:0:177839:0:(service.c:2011:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc ldlm_cb02_005:LMV_MDC_UUID+8:18904:x1458673631329952:12345-172.20.20.201@o2ib500:104
00000100:00100000:8.0:1391738160.718962:0:228232:0:(service.c:2011:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc ldlm_cb02_003:LMV_MDC_UUID+8:18904:x1458673631329956:12345-172.20.20.201@o2ib500:104
00000100:00100000:9.0F:1391738160.718968:0:5327:0:(nrs_fifo.c:182:nrs_fifo_req_get()) NRS start fifo request from 12345-172.20.20.201@o2ib500, seq: 734295
00000100:00100000:9.0:1391738160.718978:0:5327:0:(service.c:2011:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc ldlm_cb02_001:LMV_MDC_UUID+8:18904:x1458673631329960:12345-172.20.20.201@o2ib500:104
00010000:00010000:9.0:1391738160.718988:0:5327:0:(ldlm_lockd.c:2242:ldlm_callback_handler()) callback on lock 0x6aff086a1c937ce1 - lock disappeared
00000100:00100000:8.0:1391738160.718993:0:228232:0:(service.c:2055:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc ldlm_cb02_003:LMV_MDC_UUID+8:18904:x1458673631329956:12345-172.20.20.201@o2ib500:104 Request procesed in 37us (106us total) trans 0 rc 0/0
00000100:00100000:10.0:1391738160.718996:0:177839:0:(service.c:2055:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc ldlm_cb02_005:LMV_MDC_UUID+8:18904:x1458673631329952:12345-172.20.20.201@o2ib500:104 Request procesed in 42us (126us total) trans 0 rc 0/0
00000100:00100000:8.0:1391738160.718996:0:228232:0:(nrs_fifo.c:244:nrs_fifo_req_stop()) NRS stop fifo request from 12345-172.20.20.201@o2ib500, seq: 734294
00000100:00100000:10.0:1391738160.718999:0:177839:0:(nrs_fifo.c:244:nrs_fifo_req_stop()) NRS stop fifo request from 12345-172.20.20.201@o2ib500, seq: 734293
00010000:00010000:9.0:1391738160.719012:0:5327:0:(ldlm_lockd.c:2095:ldlm_callback_errmsg()) @@@ Operate with invalid parameter: [nid 12345-172.20.20.201@o2ib500] [rc 0] [lock 0x6aff086a1c937ce1]  req@ffff8803db6f7c00 x1458673631329960/t0(0) o104->LMV_MDC_UUID@172.20.20.201@o2ib500:0/0 lens 296/192 e 0 to 0 dl 1391738221 ref 1 fl Interpret:/0/0 rc -22/-22
00000100:00100000:9.0:1391738160.719019:0:5327:0:(service.c:2055:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc ldlm_cb02_001:LMV_MDC_UUID+6:18904:x1458673631329960:12345-172.20.20.201@o2ib500:104 Request procesed in 48us (121us total) trans 0 rc -22/-22
00000100:00100000:9.0:1391738160.719022:0:5327:0:(nrs_fifo.c:244:nrs_fifo_req_stop()) NRS stop fifo request from 12345-172.20.20.201@o2ib500, seq: 734295
00010000:00010000:10.0:1391738160.719028:0:178165:0:(ldlm_lockd.c:1696:ldlm_handle_bl_callback()) ### client blocking AST callback handler ns: fsv-MDT0000-mdc-ffff882019e97c00 lock: ffff8801f8c5e200/0x6aff086a1c937ce8 lrc: 3/0,0 mode: CR/CR res: 8589983387/1 bits 0x8 rrc: 1 type: IBT flags: 0x420000000000 nid: local remote: 0xf70254b5f35563f2 expref: -99 pid: 217102 timeout: 0 lvb_type: 3
00010000:00010000:8.0:1391738160.719031:0:178073:0:(ldlm_lockd.c:1696:ldlm_handle_bl_callback()) ### client blocking AST callback handler ns: fsv-MDT0000-mdc-ffff882019e97c00 lock: ffff8801f8c5e200/0x6aff086a1c937ce8 lrc: 3/0,0 mode: CR/CR res: 8589983387/1 bits 0x8 rrc: 1 type: IBT flags: 0x420000000000 nid: local remote: 0xf70254b5f35563f2 expref: -99 pid: 217102 timeout: 0 lvb_type: 3
00010000:00010000:10.0:1391738160.719032:0:178165:0:(ldlm_lockd.c:1709:ldlm_handle_bl_callback()) Lock ffff8801f8c5e200 already unused, calling callback (ffffffffa0c7f470)
00010000:00010000:10.0:1391738160.719034:0:178165:0:(ldlm_request.c:1127:ldlm_cli_cancel_local()) ### client-side cancel ns: fsv-MDT0000-mdc-ffff882019e97c00 lock: ffff8801f8c5e200/0x6aff086a1c937ce8 lrc: 4/0,0 mode: CR/CR res: 8589983387/1 bits 0x8 rrc: 1 type: IBT flags: 0x428400000000 nid: local remote: 0xf70254b5f35563f2 expref: -99 pid: 217102 timeout: 0 lvb_type: 3
00010000:00010000:8.0:1391738160.719034:0:178073:0:(ldlm_lockd.c:1709:ldlm_handle_bl_callback()) Lock ffff8801f8c5e200 already unused, calling callback (ffffffffa0c7f470)
00010000:00010000:8.0:1391738160.719036:0:178073:0:(ldlm_request.c:1358:ldlm_cli_cancel()) ### lock is already being destroyed

00010000:00010000:8.0:1391738160.719037:0:178073:0:(ldlm_lockd.c:1718:ldlm_handle_bl_callback()) ### client blocking callback handler END ns: fsv-MDT0000-mdc-ffff882019e97c00 lock: ffff8801f8c5e200/0x6aff086a1c937ce8 lrc: 4/0,0 mode: CR/CR res: 8589983387/1 bits 0x8 rrc: 1 type: IBT flags: 0x428400000000 nid: local remote: 0xf70254b5f35563f2 expref: -99 pid: 217102 timeout: 0 lvb_type: 3
00010000:00010000:10.0:1391738160.719076:0:178165:0:(ldlm_request.c:1186:ldlm_cancel_pack()) ### packing ns: fsv-MDT0000-mdc-ffff882019e97c00 lock: ffff8801f8c5e200/0x6aff086a1c937ce8 lrc: 2/0,0 mode: --/CR res: 8589983387/1 bits 0x8 rrc: 1 type: IBT flags: 0xc29400000000 nid: local remote: 0xf70254b5f35563f2 expref: -99 pid: 217102 timeout: 0 lvb_type: 3
00010000:00010000:10.0:1391738160.719079:0:178165:0:(ldlm_request.c:1190:ldlm_cancel_pack()) 1 locks packed
00010000:00010000:10.0:1391738160.719082:0:178165:0:(ldlm_lockd.c:1718:ldlm_handle_bl_callback()) ### client blocking callback handler END ns: fsv-MDT0000-mdc-ffff882019e97c00 lock: ffff8801f8c5e200/0x6aff086a1c937ce8 lrc: 1/0,0 mode: --/CR res: 8589983387/1 bits 0x8 rrc: 1 type: IBT flags: 0xc29400000000 nid: local remote: 0xf70254b5f35563f2 expref: -99 pid: 217102 timeout: 0 lvb_type: 3
00010000:00010000:10.0:1391738160.719085:0:178165:0:(ldlm_lock.c:219:ldlm_lock_put()) ### final lock_put on destroyed lock, freeing it. ns: fsv-MDT0000-mdc-ffff882019e97c00 lock: ffff8801f8c5e200/0x6aff086a1c937ce8 lrc: 0/0,0 mode: --/CR res: 8589983387/1 bits 0x8 rrc: 1 type: IBT flags: 0xc29400000000 nid: local remote: 0xf70254b5f35563f2 expref: -99 pid: 217102 timeout: 0 lvb_type: 3
00000100:00100000:8.0:1391738160.719148:0:5249:0:(ptlrpcd.c:387:ptlrpcd_check()) transfer 1 async RPCs [11->10]
00000100:00100000:8.0:1391738160.719152:0:5249:0:(client.c:1441:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ptlrpcd_10:a3cebad5-b812-989a-0fc6-cab928270ea2:5249:1457223537877828:172.20.20.201@o2ib500:103
00000100:00100000:8.0:1391738160.719934:0:5249:0:(client.c:1809:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ptlrpcd_10:a3cebad5-b812-989a-0fc6-cab928270ea2:5249:1457223537877828:172.20.20.201@o2ib500:103

Going up in the log we can see the following traces sequence for 0x6aff086a1c937ce1 (failing) handle :

00000080:00200000:1.0:1391738148.867174:0:217102:0:(namei.c:503:ll_lookup_it()) VFS Op:name=foo,dir=144115923974954140/33554603(ffff881e3b002bf8),intent=open|creat
00000100:00100000:1.0:1391738148.867209:0:217102:0:(client.c:1441:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc touch:a3cebad5-b812-989a-0fc6-cab928270ea2:217102:1457223537877616:172.20.20.201@o2ib500:700
00000100:00100000:1.0:1391738148.867234:0:217102:0:(client.c:2132:ptlrpc_set_wait()) set ffff8804dd189200 going to sleep for 61 seconds
00000100:00100000:1.0:1391738148.868002:0:217102:0:(client.c:1809:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc touch:a3cebad5-b812-989a-0fc6-cab928270ea2:217102:1457223537877616:172.20.20.201@o2ib500:700
00000002:00010000:1.0:1391738148.868013:0:217102:0:(mdc_locks.c:1083:mdc_intent_lock()) (name: foo,[0x20000be9b:0x1:0x0]) in obj [0x20000ab57:0x11c9c:0x0], intent: open|creat flags 0104502
00010000:00010000:1.0:1391738148.868041:0:217102:0:(ldlm_lock.c:796:ldlm_lock_addref_internal_nolock()) ### ldlm_lock_addref(CW) ns: fsv-MDT0000-mdc-ffff882019e97c00 lock: ffff8801f8c5e800/0x6aff086a1c937ce1 lrc: 3/0,1 mode: --/CW res: 8589978455/72860 bits 0x0 rrc: 1 type: IBT flags: 0x0 nid: local remote: 0x0 expref: -99 pid: 217102 timeout: 0 lvb_type: 0
00010000:00010000:1.0:1391738148.868046:0:217102:0:(ldlm_request.c:926:ldlm_cli_enqueue()) ### client-side enqueue START, flags 1000
 ns: fsv-MDT0000-mdc-ffff882019e97c00 lock: ffff8801f8c5e800/0x6aff086a1c937ce1 lrc: 3/0,1 mode: --/CW res: 8589978455/72860 bits 0x2 rrc: 1 type: IBT flags: 0x0 nid: local remote: 0x0 expref: -99 pid: 217102 timeout: 0 lvb_type: 0
00010000:00010000:1.0:1391738148.868050:0:217102:0:(ldlm_request.c:988:ldlm_cli_enqueue()) ### sending request ns: fsv-MDT0000-mdc-ffff882019e97c00 lock: ffff8801f8c5e800/0x6aff086a1c937ce1 lrc: 3/0,1 mode: --/CW res: 8589978455/72860 bits 0x2 rrc: 1 type: IBT flags: 0x0 nid: local remote: 0x0 expref: -99 pid: 217102 timeout: 0 lvb_type: 0
00000100:00100000:1.0:1391738148.868054:0:217102:0:(client.c:1441:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc touch:a3cebad5-b812-989a-0fc6-cab928270ea2:217102:1457223537877620:172.20.20.201@o2ib500:101
00000100:00100000:1.0:1391738148.868065:0:217102:0:(client.c:2132:ptlrpc_set_wait()) set ffff8804dd189200 going to sleep for 61 seconds
00000100:00100000:10.0F:1391738148.878667:0:4112:0:(events.c:126:reply_in_callback()) truncate req ffff880d7a7ab800 rpc 1136 - 2232+192
00000100:00100000:1.0:1391738148.878772:0:217102:0:(client.c:2132:ptlrpc_set_wait()) set ffff8804dd189200 going to sleep for 61 seconds
00000100:00100000:1.0:1391738148.879522:0:217102:0:(client.c:1809:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc touch:a3cebad5-b812-989a-0fc6-cab928270ea2:217102:1457223537877620:172.20.20.201@o2ib500:101
00010000:00010000:1.0:1391738148.879539:0:217102:0:(ldlm_request.c:552:ldlm_cli_enqueue_fini()) ### client-side enqueue END (ABORTED) ns: fsv-MDT0000-mdc-ffff882019e97c00 lock: ffff8801f8c5e800/0x6aff086a1c937ce1 lrc: 4/0,1 mode: --/CW res: 8589978455/72860 bits 0x2 rrc: 1 type: IBT flags: 0x0 nid: local remote: 0x0 expref: -99 pid: 217102 timeout: 0 lvb_type: 0
00010000:00010000:1.0:1391738148.879545:0:217102:0:(ldlm_request.c:501:failed_lock_cleanup()) ### setting FL_LOCAL_ONLY | LDLM_FL_FAILED | LDLM_FL_ATOMIC_CB | LDLM_FL_CBPENDING ns: fsv-MDT0000-mdc-ffff882019e97c00 lock: ffff8801f8c5e800/0x6aff086a1c937ce1 lrc: 4/0,1 mode: --/CW res: 8589978455/72860 bits 0x2 rrc: 1 type: IBT flags: 0x206400000000 nid: local remote: 0x0 expref: -99 pid: 217102 timeout: 0 lvb_type: 0
00010000:00010000:1.0:1391738148.879548:0:217102:0:(ldlm_lock.c:849:ldlm_lock_decref_internal_nolock()) ### ldlm_lock_decref(CW) ns: fsv-MDT0000-mdc-ffff882019e97c00 lock: ffff8801f8c5e800/0x6aff086a1c937ce1 lrc: 4/0,1 mode: --/CW res: 8589978455/72860 bits 0x2 rrc: 1 type: IBT flags: 0x206400000000 nid: local remote: 0x0 expref: -99 pid: 217102 timeout: 0 lvb_type: 0
00010000:00010000:1.0:1391738148.879551:0:217102:0:(ldlm_lock.c:900:ldlm_lock_decref_internal()) ### final decref done on cbpending lock ns: fsv-MDT0000-mdc-ffff882019e97c00 lock: ffff8801f8c5e800/0x6aff086a1c937ce1 lrc: 3/0,0 mode: --/CW res: 8589978455/72860 bits 0x2 rrc: 1 type: IBT flags: 0x206400000000 nid: local remote: 0x0 expref: -99 pid: 217102 timeout: 0 lvb_type: 0
00010000:00010000:1.0:1391738148.879554:0:217102:0:(ldlm_lockd.c:1696:ldlm_handle_bl_callback()) ### client blocking AST callback handler ns: fsv-MDT0000-mdc-ffff882019e97c00 lock: ffff8801f8c5e800/0x6aff086a1c937ce1 lrc: 4/0,0 mode: --/CW res: 8589978455/72860 bits 0x2 rrc: 1 type: IBT flags: 0x206400000000 nid: local remote: 0x0 expref: -99 pid: 217102 timeout: 0 lvb_type: 0
00010000:00010000:1.0:1391738148.879556:0:217102:0:(ldlm_lockd.c:1709:ldlm_handle_bl_callback()) Lock ffff8801f8c5e800 already unused, calling callback (ffffffffa0c7f470)
00010000:00010000:1.0:1391738148.879558:0:217102:0:(ldlm_request.c:1127:ldlm_cli_cancel_local()) ### client-side cancel ns: fsv-MDT0000-mdc-ffff882019e97c00 lock: ffff8801f8c5e800/0x6aff086a1c937ce1 lrc: 5/0,0 mode: --/CW res: 8589978455/72860 bits 0x2 rrc: 1 type: IBT flags: 0x20e400000000 nid: local remote: 0x0 expref: -99 pid: 217102 timeout: 0 lvb_type: 0
00010000:00010000:1.0:1391738148.879562:0:217102:0:(ldlm_request.c:1140:ldlm_cli_cancel_local()) not sending request (at caller's instruction)
00010000:00010000:1.0:1391738148.879564:0:217102:0:(ldlm_lockd.c:1718:ldlm_handle_bl_callback()) ### client blocking callback handler END ns: fsv-MDT0000-mdc-ffff882019e97c00 lock: ffff8801f8c5e800/0x6aff086a1c937ce1 lrc: 3/0,0 mode: --/CW res: 8589978455/72860 bits 0x2 rrc: 1 type: IBT flags: 0xa0f400000000 nid: local remote: 0x0 expref: -99 pid: 217102 timeout: 0 lvb_type: 0
00010000:00010000:1.0:1391738148.879567:0:217102:0:(ldlm_lock.c:219:ldlm_lock_put()) ### final lock_put on destroyed lock, freeing it. ns: fsv-MDT0000-mdc-ffff882019e97c00 lock: ffff8801f8c5e800/0x6aff086a1c937ce1 lrc: 0/0,0 mode: --/CW res: 8589978455/72860 bits 0x2 rrc: 1 type: IBT flags: 0xa0f400000000 nid: local remote: 0x0 expref: -99 pid: 217102 timeout: 0 lvb_type: 0
00000002:00100000:1.0:1391738148.879580:0:217102:0:(mdc_locks.c:579:mdc_finish_enqueue()) @@@ op: 3 disposition: 37, status: 0  req@ffff880d7a7ab800 x1457223537877620/t86458366655(86458366655) o101->fsv-MDT0000-mdc-ffff882019e97c00@172.20.20.201@o2ib500:12/10 lens 576/2232 e 0 to 0 dl 1391738254 ref 2 fl Complete:RP/4/0 rc 301/301
00000002:00100000:1.0:1391738148.879586:0:217102:0:(mdc_request.c:781:mdc_set_open_replay_data()) @@@ Set up open replay data  req@ffff880d7a7ab800 x1457223537877620/t86458366655(86458366655) o101->fsv-MDT0000-mdc-ffff882019e97c00@172.20.20.201@o2ib500:12/10 lens 576/2232 e 0 to 0 dl 1391738254 ref 2 fl Complete:RP/4/0 rc 301/301

and then in immediate continuation the traces sequence for the "valid" handle (still within the same "touch" process context) :

00000080:00200000:1.0:1391738148.879598:0:217102:0:(file.c:3001:ll_inode_permission()) VFS Op:inode=144115923974954140/33554603(ffff881e3b002bf8), inode mode 41c0 mask 3
00000080:00200000:1.0:1391738148.879600:0:217102:0:(namei.c:848:ll_create_it()) VFS Op:name=foo,dir=144115923974954140/33554603(ffff881e3b002bf8),intent=open|creat
00000080:00200000:1.0:1391738148.879642:0:217102:0:(llite_lib.c:1913:ll_read_inode2()) VFS Op:inode=[0x20000be9b:0x1:0x0](ffff880fcd53f678)
00000080:00200000:1.0:1391738148.879750:0:217102:0:(namei.c:170:ll_iget()) got inode: ffff880fcd53f678 for [0x20000be9b:0x1:0x0]
00000080:00010000:1.0:1391738148.879754:0:217102:0:(namei.c:817:ll_create_node()) setting l_ast_data to inode ffff880fcd53f678 (144116006720110593/33554622)
00000080:00010000:1.0:1391738148.879755:0:217102:0:(llite_internal.h:1541:ll_set_lock_data()) setting l_data to inode ffff880fcd53f678 (144116006720110593/33554622) for lock 0x0
00000080:00200000:1.0:1391738148.879759:0:217102:0:(file.c:520:ll_file_open()) VFS Op:inode=144116006720110593/33554622(ffff880fcd53f678), flags 104501
00000002:00100000:1.0:1391738148.879767:0:217102:0:(mdc_request.c:781:mdc_set_open_replay_data()) @@@ Set up open replay data  req@ffff880d7a7ab800 x1457223537877620/t86458366655(86458366655) o101->fsv-MDT0000-mdc-ffff882019e97c00@172.20.20.201@o2ib500:12/10 lens 1688/2232 e 0 to 0 dl 1391738254 ref 2 fl Complete:RP/4/0 rc 301/301
00000080:00200000:1.0:1391738148.879772:0:217102:0:(file.c:3001:ll_inode_permission()) VFS Op:inode=144116006720110593/33554622(ffff880fcd53f678), inode mode 8180 mask 0
00000080:00200000:1.0:1391738148.879784:0:217102:0:(llite_lib.c:1501:ll_setattr_raw()) fsv: setattr inode ffff880fcd53f678/fid:[0x20000be9b:0x1:0x0] from 0 to 18446612189935735584, valid 70
00000100:00100000:1.0:1391738148.879797:0:217102:0:(client.c:1441:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc touch:a3cebad5-b812-989a-0fc6-cab928270ea2:217102:1457223537877624:172.20.20.201@o2ib500:36
00000100:00100000:1.0:1391738148.879808:0:217102:0:(client.c:2132:ptlrpc_set_wait()) set ffff8804dd189200 going to sleep for 61 seconds
00000100:00100000:1.0:1391738148.880673:0:217102:0:(client.c:1809:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc touch:a3cebad5-b812-989a-0fc6-cab928270ea2:217102:1457223537877624:172.20.20.201@o2ib500:36
00000080:00010000:1.0:1391738148.880691:0:217102:0:(file.c:3493:ll_layout_refresh()) ### fsv: requeue layout lock for file ffff880fcd53f678/[0x20000be9b:0x1:0x0].

00010000:00010000:1.0:1391738148.880703:0:217102:0:(ldlm_lock.c:796:ldlm_lock_addref_internal_nolock()) ### ldlm_lock_addref(CR) ns: fsv-MDT0000-mdc-ffff882019e97c00 lock: ffff8801f8c5e200/0x6aff086a1c937ce8 lrc: 3/1,0 mode: --/CR res: 8589983387/1 bits 0x0 rrc: 1 type: IBT flags: 0x0 nid: local remote: 0x0 expref: -99 pid: 217102 timeout: 0 lvb_type: 3
00010000:00010000:1.0:1391738148.880706:0:217102:0:(ldlm_request.c:926:ldlm_cli_enqueue()) ### client-side enqueue START, flags 1000
 ns: fsv-MDT0000-mdc-ffff882019e97c00 lock: ffff8801f8c5e200/0x6aff086a1c937ce8 lrc: 3/1,0 mode: --/CR res: 8589983387/1 bits 0x8 rrc: 1 type: IBT flags: 0x0 nid: local remote: 0x0 expref: -99 pid: 217102 timeout: 0 lvb_type: 3
00010000:00010000:1.0:1391738148.880709:0:217102:0:(ldlm_request.c:988:ldlm_cli_enqueue()) ### sending request ns: fsv-MDT0000-mdc-ffff882019e97c00 lock: ffff8801f8c5e200/0x6aff086a1c937ce8 lrc: 3/1,0 mode: --/CR res: 8589983387/1 bits 0x8 rrc: 1 type: IBT flags: 0x0 nid: local remote: 0x0 expref: -99 pid: 217102 timeout: 0 lvb_type: 3
00000100:00100000:1.0:1391738148.880712:0:217102:0:(client.c:1441:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc touch:a3cebad5-b812-989a-0fc6-cab928270ea2:217102:1457223537877628:172.20.20.201@o2ib500:101
00000100:00100000:1.0:1391738148.880724:0:217102:0:(client.c:2132:ptlrpc_set_wait()) set ffff8804dd189200 going to sleep for 61 seconds
00000100:00100000:6.0F:1391738148.881445:0:4110:0:(events.c:126:reply_in_callback()) truncate req ffff8808735dc000 rpc 416 - 1496+192
00000100:00100000:1.0:1391738148.881499:0:217102:0:(client.c:2132:ptlrpc_set_wait()) set ffff8804dd189200 going to sleep for 61 seconds
00000100:00100000:1.0:1391738148.882288:0:217102:0:(client.c:1809:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc touch:a3cebad5-b812-989a-0fc6-cab928270ea2:217102:1457223537877628:172.20.20.201@o2ib500:101
00010000:00010000:1.0:1391738148.882295:0:217102:0:(ldlm_lock.c:1090:ldlm_granted_list_add_lock()) ### About to add lock: ns: fsv-MDT0000-mdc-ffff882019e97c00 lock: ffff8801f8c5e200/0x6aff086a1c937ce8 lrc: 4/1,0 mode: CR/CR res: 8589983387/1 bits 0x8 rrc: 1 type: IBT flags: 0x0 nid: local remote: 0xf70254b5f35563f2 expref: -99 pid: 217102 timeout: 0 lvb_type: 3
00010000:00010000:1.0:1391738148.882303:0:217102:0:(ldlm_request.c:704:ldlm_cli_enqueue_fini()) ### client-side enqueue END ns: fsv-MDT0000-mdc-ffff882019e97c00 lock: ffff8801f8c5e200/0x6aff086a1c937ce8 lrc: 4/1,0 mode: CR/CR res: 8589983387/1 bits 0x8 rrc: 1 type: IBT flags: 0x0 nid: local remote: 0xf70254b5f35563f2 expref: -99 pid: 217102 timeout: 0 lvb_type: 3
00000002:00100000:1.0:1391738148.882307:0:217102:0:(mdc_locks.c:579:mdc_finish_enqueue()) @@@ op: 1024 disposition: 0, status: 0  req@ffff8808735dc000 x1457223537877628/t0(0) o101->fsv-MDT0000-mdc-ffff882019e97c00@172.20.20.201@o2ib500:12/10 lens 376/1496 e 0 to 0 dl 1391738254 ref 1 fl Complete:R/2/0 rc 0/0
00000002:00010000:1.0:1391738148.882311:0:217102:0:(mdc_locks.c:699:mdc_finish_enqueue()) ### layout lock returned by: layout, lvb_len: 1184
 ns: fsv-MDT0000-mdc-ffff882019e97c00 lock: ffff8801f8c5e200/0x6aff086a1c937ce8 lrc: 3/1,0 mode: CR/CR res: 8589983387/1 bits 0x8 rrc: 1 type: IBT flags: 0x0 nid: local remote: 0xf70254b5f35563f2 expref: -99 pid: 217102 timeout: 0 lvb_type: 3
00000080:00010000:1.0:1391738148.882317:0:217102:0:(llite_internal.h:1541:ll_set_lock_data()) setting l_data to inode ffff880fcd53f678 (144116006720110593/33554622) for lock 0x6aff086a1c937ce8
00000080:00010000:1.0:1391738148.882320:0:217102:0:(file.c:3330:ll_layout_lock_set()) ### File ffff880fcd53f678/[0x20000be9b:0x1:0x0] being reconfigured: 1.
 ns: fsv-MDT0000-mdc-ffff882019e97c00 lock: ffff8801f8c5e200/0x6aff086a1c937ce8 lrc: 3/1,0 mode: CR/CR res: 8589983387/1 bits 0x8 rrc: 1 type: IBT flags: 0x0 nid: local remote: 0xf70254b5f35563f2 expref: -99 pid: 217102 timeout: 0 lvb_type: 3
00000080:00200000:1.0:1391738148.882346:0:217102:0:(vvp_object.c:142:vvp_conf_set()) layout lock change: 4294967294 -> 0
00010000:00010000:1.0:1391738148.882359:0:217102:0:(ldlm_lock.c:849:ldlm_lock_decref_internal_nolock()) ### ldlm_lock_decref(CR) ns: fsv-MDT0000-mdc-ffff882019e97c00 lock: ffff8801f8c5e200/0x6aff086a1c937ce8 lrc: 3/1,0 mode: CR/CR res: 8589983387/1 bits 0x8 rrc: 1 type: IBT flags: 0x20000000000 nid: local remote: 0xf70254b5f35563f2 expref: -99 pid: 217102 timeout: 0 lvb_type: 3
00010000:00010000:1.0:1391738148.882362:0:217102:0:(ldlm_lock.c:917:ldlm_lock_decref_internal()) ### add lock into lru list ns: fsv-MDT0000-mdc-ffff882019e97c00 lock: ffff8801f8c5e200/0x6aff086a1c937ce8 lrc: 2/0,0 mode: CR/CR res: 8589983387/1 bits 0x8 rrc: 1 type: IBT flags: 0x20000000000 nid: local remote: 0xf70254b5f35563f2 expref: -99 pid: 217102 timeout: 0 lvb_type: 3

……….

So it looks like the Server/MDS sent a ELDLM_LOCK_ABORTED, but was still expecting the Client to cancel it when Client found it not necessary ?
Will dig more into the source code to find if/why this can happen. But in the mean-time, can you help to find the differences/patches between Client and Server versions you run?

Comment by Christopher Morrone [ 07/Feb/14 ]

But in the mean-time, can you help to find the differences/patches between Client and Server versions you run?

We mirror our branches on github. See http://github.com/chaos/lustre. Both of the mentioned tags, 2.4.0-19chaos and 2.4.0-21chaos, are on the 2.4.0-llnl branch.

Comment by Bruno Faccini (Inactive) [ 13/Feb/14 ]

The diffs between 2.4.0-19chaos and 2.4.0-21chaos are :

commit 4496fd9562bb1b2b0d9e1f925d7771e8dd795f4d
Author: wang di <di.wang@intel.com>
Date:   Thu Dec 12 10:33:56 2013 -0800

    LU-4222 mdt: extra checking for getattr RPC.

    It should check whether getattr RPC can hold
    layout MD(RMF_MDT_MD), in case the client sends
    some invalid RPC, which can cause panic on MDT.

    Client will retrieve cl_max_md_size/cl_default_md_size
    from MDS during mount process, so it will initialize
    cl_max_md_size/cl_default_md_size before sending getattr
    to MDS.

    Signed-off-by: wang di <di.wang@intel.com>
    Change-Id: Ic12f281d574e7138ee0e1d5aae6161b7607837cc

commit 4cd160ea66f09c71ff5cdced1bd19c09998b69b1
Author: Bruno Faccini <bruno.faccini@intel.com>
Date:   Thu Nov 14 17:20:00 2013 +0100

    LU-4194 ldlm: set l_lvb_type coherent when layout is returned

    In case layout has been packed into server reply when not
    requested, lock l_lvb_type must be set accordingly.

    Signed-off-by: Bruno Faccini <bruno.faccini@intel.com>
    Change-Id: Iaf54c9ba27785e529f4f2fb967d2fad4fc1dfbcb

commit fa3ec2e46e0e57d0c05ada2e20276cb86b3d9840
Author: Christopher J. Morrone <morrone2@llnl.gov>
Date:   Fri Nov 15 13:40:19 2013 -0800

    LU-4194 ldlm: Make OBD_[ALLOC|FREE]_LARGE use consistent

    struct ldlm_lock's l_lvb_data field is freed in ldlm_lock_put()
    using OBD_FREE.  However, some other code paths can attach
    a buffer to l_lvb_data that was allocated using OBD_ALLOC_LARGE.
    This can lead to a kfree() of a vmalloc()ed buffer, which can
    trigger a kernel Oops.

    Change-Id: Ic75a67530862eeb4d065c14bbbac80939bff5731
    Signed-off-by: Christopher J. Morrone <morrone2@llnl.gov>

commit 3e6edac776dab99b6de5db91406bbc6350a2a4d9
Author: Niu Yawei <yawei.niu@intel.com>
Date:   Wed Nov 6 02:52:32 2013 -0500

    LU-2619 quota: add debug message in osd_acct_index_lookup()

    Added debug messages in osd_acct_index_lookup().

    Signed-off-by: Niu Yawei <yawei.niu@intel.com>
    Change-Id: I6d7a1a55c0772255f8d4f4ec0f36cd69f73165ec

commit 101bf16586f8a1398be2c05c514ec14ab619b57e
Author: wang di <di.wang@intel.com>
Date:   Thu Aug 21 20:47:26 2014 -0700

    LU-4179 mdt: skip open lock enqueue during resent

    Skip open lock enqueue, if the open lock has been
    acquired(mdt_intent_fixup_resent) during resent.

    Signed-off-by: wang di <di.wang@intel.com>
    Change-Id: I625ca438e28520416ee2af884d0a9f9e6f21cf2e

commit e522f74ce63262fb16b747785c53f809dd19079a
Author: Hongchao Zhang <hongchao.zhang@intel.com>
Date:   Mon Jul 22 05:40:37 2013 +0800

    LU-2613 recovery: free open/close request promptly

    - For the non-create open or committed open, the open request
      should be freed along with the close request as soon as the
      close done, despite that the transno of open/close is
      greater than the last committed transno known by client or not.

    - Move the committed open request into another dedicated list,
      that will avoid scanning a huge replay list on receiving each
      reply (when there are many open files).

    Signed-off-by: Niu Yawei <yawei.niu@intel.com>
    Signed-off-by: Hongchao Zhang <hongchao.zhang@intel.com>
    Change-Id: I1a25a35fe7a16681368d92d16964680b6209b3ee
    Reviewed-on: http://review.whamcloud.com/6665
    Tested-by: Jenkins
    Tested-by: Maloo <hpdd-maloo@intel.com>
    Reviewed-by: Alex Zhuravlev <alexey.zhuravlev@intel.com>
    Reviewed-by: Oleg Drokin <oleg.drokin@intel.com>

And I did not found anything in these changes that may indicate any incompatibility between Clients running 2.4.0-19chaos and Servers running 2.4.0-21chaos.

I am now investigating the ELDLM_LOCK_ABORTED reason/handling in the code.

Comment by Zhenyu Xu [ 14/Feb/14 ]

from 172.20.20.201@o2ib500.log (MDS log)

00010000:00020000:8.1:1391563589.283179:0:0:0:(ldlm_lockd.c:391:waiting_locks_callback()) ### lock callback timer expired after 302s: evicting client at 172.16.66.4@tcp ns: mdt-fsv-MDT0000_UUID lock: ffff880f17f60200/0xf70254b1ce2bd28a lrc: 3/0,0 mode: CR/CR res: 8589983111/32 bits 0x8 rrc: 3 type: IBT flags: 0x20 nid: 172.16.66.4@tcp remote: 0x1dfa39e9c64af6c1 expref: 2469 pid: 17054 timeout: 4759057840 lvb_type: 3

the problem lock's bits is 0x8 (MDS_INODELOCK_LAYOUT), its lvb_type is 3 (LVB_T_LAYOUT), it's something about layout lock issue here.

Comment by Bruno Faccini (Inactive) [ 17/Feb/14 ]

Thanks Bobi to have a look on the Server side/trace when I only focused on Client side and in client_log_20140206.txt log until now ... But anyway it is not really helpful too since it is the only trace about this particular lock in both Client and Server logs.

So we are stuck here and as we discussed I started to build a platform to try to reproduce with full debug enabled. And see if problem is related to large striping layout, as we suspect actually, or anything else.

Comment by Bruno Faccini (Inactive) [ 17/Feb/14 ]

I builded a configuration to allow me 48 stripe count on zfs, with robinhood, but I am unable to reproduce at the moment…

I wonder if you could try to configure LDLM/RPC traces with a big debug buffer on your MDS, with dump_on_evict set on Clients, and re-run the reproducer during some quiet period to allow to get detailled traces for both sides ?

Comment by Bruno Faccini (Inactive) [ 18/Feb/14 ]

Finally I have been able to reproduce locally, and running with Lustre 2.4.0-RC2 only !! So I have the full traces for both MDS and Clients sides.
1st analysis indicates that the lock causing the evict is one that has been created thru mdt_intent_lock_replace() to replace an older one, and it seems to lead to a lock being replicated for some reason :

00010000:00000040:29.0:1392685427.887184:0:127322:0:(ldlm_resource.c:1429:ldlm_resource_dump()) ### ### ns: mdt-lustrez-MDT0000_UUID lock: ffff880811cc8a00/0x2609ba00ab659d1c lrc: 2/0,0 mode: CR/CR res: 8589935617/2 bits 0x9 rrc: 9 type: IBT flags: 0x200000000000 nid: 10.10.18.181@tcp remote: 0x6963eaa818aefa09 expref: 15 pid: 103045 timeout: 0 lvb_type: 0
00010000:00000040:29.0:1392685427.887190:0:127322:0:(ldlm_resource.c:1429:ldlm_resource_dump()) ### ### ns: mdt-lustrez-MDT0000_UUID lock: ffff8808030dbc00/0x2609ba00ab659cf9 lrc: 2/0,0 mode: CR/CR res: 8589935617/2 bits 0x9 rrc: 9 type: IBT flags: 0x200000000000 nid: 10.10.18.181@tcp remote: 0x6963eaa818aefa09 expref: 15 pid: 103045 timeout: 0 lvb_type: 0

More to come.

Comment by Jinshan Xiong (Inactive) [ 18/Feb/14 ]

Hi Bruno,

Good job reproducing it local side.

"And the -22/-EINVAL is for the 3rd LDLM_BL_CALLBACK and it concerns an other handle (0x6aff086a1c937ce1) that the Client can't find the corresponding lock " - Usually this means that the lock has already been canceled from the client side. This is a normal race.

Comment by Christopher Morrone [ 18/Feb/14 ]

and running with Lustre 2.4.0-RC2 only

Just FYI, 2.4.0-RC2 and 2.4.0 tag the same commit.

Are you suggesting that later 2.4.X tags do not have this problem?

Comment by Bruno Faccini (Inactive) [ 19/Feb/14 ]

Chris: No, I just wanted to indicate that according to our/internal tag's nomenclature, I was able to reproduce with the latest 2.4.0-RC2 build running on both Client/Server sides. I will try to reproduce with later 2.4.1-RC2/2.4.2-RC2 versions and let you know.

Jinshan: Ok, for the -22/EINVAL result due to the "normal" race. But concerning the problem and after more local reproducer traces analysis, it looks like on the Server side we are able to deal-with/create 2 identical IBITs locks/handles for the same Client/remote, but then the Client only answer to one during Blocking-AST/Cancel process and simply trashes the other, which seems to lead to this situation when later the Server evicts the Client due to no answer for the 2nd lock.
Does this sound as an already known issue to you ?

Comment by Christopher Morrone [ 19/Feb/14 ]

I think you guys only use the -RC2 nomenclature while it is a release candidate, and then drop the -RC2 when the real release is done. There is no reason to keep using the release candidate number when the release is finished.

Comment by Bruno Faccini (Inactive) [ 20/Feb/14 ]

More Lustre debug traces analysis seems to indicate that the "orphan" lock on the MDS causing the Client eviction is the one from the 1st open request and acquired via mdt_reint_internal()/mdt_reint_rec()/mdt_reint_open()/mdt_object_open_lock() when the "successfully" canceled one is from the 2nd/reconstructed open request via mdt_reint_internal()/mdt_check_resent()/mdt_reconstruct_open()/mdt_reint_open()/mdt_object_open_lock().

Strange is that it should have been fixed with patch http://review.whamcloud.com/8142 from LU-4179 that is already integrated … Or may be its test "if (lustre_handle_is_used(&lhc->mlh_reg_lh))" does not cover all cases or the Client should have set MSG_REPLAY ?

Also, I need to find what is the reason of the reconstruct open situation.

Comment by Bruno Faccini (Inactive) [ 20/Feb/14 ]

The reason of the resend (only, but no replay) from the Client is because the 1st reply from the Server has been truncated, due to a reply buffer of only 2048 bytes being reserved when the answer (including the layout for a 48 stripe width) required length was 2232 bytes.

Comment by Bruno Faccini (Inactive) [ 20/Feb/14 ]

I am looking for a possible fix (Client to ask for replay upon open reply truncate?), and also a possible work-around (like to grow the "default" reply buffer?).
And also to exercise the reproducer on b2_5/master.

Comment by Di Wang [ 20/Feb/14 ]

I did not read through the whole discussion, but if the client resends RPC with open lock request, and server should return the original one. If you see two locks there, something might be wrong IMHO.
Hmm, you might need http://review.whamcloud.com/#/c/8642/ from LU-4403, especially if 8142 is already integrated.

Comment by Jinshan Xiong (Inactive) [ 21/Feb/14 ]

Hi Bruno,

The patch 8142 only applied to open by name path. The same fix has to be applied to mdt_open_by_fid_lock() code path as well. Will you please create a patch for it?

Jinshan

Comment by Bruno Faccini (Inactive) [ 21/Feb/14 ]

Di: thanks to point on LU-4403/Change #8642, I will give it a try but I already wonder if it can really fix our problem here, since it does not change mdt_intent_fixup_resent() behavior in the case of open_lock and then the uninitialized "ibits" variable will be indeed zeroed by mdt_object_open_lock() during its 1st usage.

Jinshan: I created patch to reproduce Change #8142 from LU-4179 in mdt_open_by_fid_lock(), it is at http://review.whamcloud.com/9342. But I also wonder if it could be the fix for this problem since in my reproducer debug logs I always see open requests with file-name provided and indeed no mdt_open_by_fid_lock() traces.

Chris: what was the reason for running with 2.4.0-21chaos on Servers that include patch from LU-4179 ? Did you encounter the MDS LBUG that patch fixed before ?

Last, I pushed patch http://review.whamcloud.com/9326 to give a try to my own thought that Client should ask for replay if a truncate occurs during open and request is resent.

Comment by Christopher Morrone [ 21/Feb/14 ]

Yes, we pulled that in because our MDS crashed with the same assertion and stack as LU-4179.

Comment by Bruno Faccini (Inactive) [ 25/Feb/14 ]

Some more results/facts :

_ reproducer triggers the problem on the latest b2_4 (ie, 2.4.2-RC2 + latest patches) and b2_5 builds. Will also try with latest master but I suspect problem will also be there.

_ change #9342, where I simply reproduced the new controls and behavior to to skip open_lock upon resend in mdt_reint_open() (by patch #8142 for LU-4172) but now in mdt_open_by_fid_lock(), fails in auto-tests. Seems to cause setstripe failures ... So need some rework and not just simply to mimic patch #8142.

_ I abandonned change #9326, because after more analysis of the replay mechanisms, I found this way to be wrong. The resend mechanisms are ok (with patch #8142 from LU-4179!! but need also #8642 from LU-4403??) and should be able to handle this scenario, so the problem must be somewhere else (lock hashing/lookup, …).

_ I had a discussion with Johann on this and now I am working on a debug patch to allow more infos to be printed upon resend during 1st lock retrieval with hash-mechanism that seem to fail, and also during 2nd/excess lock blocking-callback that seem also to fail (Client must answer, even return error) …

Comment by Bruno Faccini (Inactive) [ 27/Feb/14 ]

According to more reproducer simplification (no more RobinHood required, now made only of "dd" command to create a 48-stripes file and a "file" command in a raw on a same node and then "rm" from an other node) and more debug trace analysis, it definitely seems that, as already indicated before, things start to go wrong when a 2nd lock is enqueued on Server upon truncate+resend, and then there are 2 possible reasons to explain why 1st lock is not found/lookup :

_ 1st lock has not been hashed, or been unhashed

_ hashing mechanism is buggy

I pushed a debug patch at http://review.whamcloud.com/9421, that will help to find the real cause now.

Also interesting, I have not been able to reproduce with latest master at this time ….

Comment by Bruno Faccini (Inactive) [ 03/Mar/14 ]

Latest news, and not of lower interest, for this ticket :

_ finally I have been able to reproduce with latest master !!!

_ I have been able to reproduce the problem running with a b2_4 build including my debug patch, and I am now analyzing the traces. 1st constatation already is that the hashing mechanism is working and is not causing the problem. More to come.

Comment by Bruno Faccini (Inactive) [ 03/Mar/14 ]

BTW, more indeep analysis of the Lustre debug logs generated by my debug patch, seems that the fact the old/1st lock is not found during hash lookup is due to the new lock already being hashed !! Will try to push a new patch/fix proposal where new_lock hashing will be deferred, or a way to find the old one during lookup if the new one is found 1st ...

Comment by Bruno Faccini (Inactive) [ 04/Mar/14 ]

So, thanks to my additional debug traces from #9421!, the fact that the 1st/original lock is not found upon resend handling on Server, is due to the early new lock create+hash in ldlm_handle_enqueue0() that the cfs_hash_lookup() call in mdt_intent_fixup_resent() returns instead of the 1st/old one.

I will try to develop a fix where mdt_intent_fixup_resent() uses cfs_hash_for_each_key() intead and with a callback that will simply return ok if the current/hashed lock is not new_lock, to allow the old/resent lock, if any, to be looked-up.

Comment by Bruno Faccini (Inactive) [ 04/Mar/14 ]

New patch/fix attempt that implement previous comment solution is at http://review.whamcloud.com/9488.

Comment by Bruno Faccini (Inactive) [ 07/Mar/14 ]

With patch-set #3 of #9488, running the reproducer I triggered the crash/LBUG tracked in LU-4403 !! :

LustreError: 45837:0:(ldlm_lock.c:851:ldlm_lock_decref_internal_nolock()) ASSERTION( lock->l_readers > 0 ) failed: 
LustreError: 45837:0:(ldlm_lock.c:851:ldlm_lock_decref_internal_nolock()) LBUG
Pid: 45837, comm: mdt00_002

Call Trace:
 [<ffffffffa105a895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
 [<ffffffffa105ae97>] lbug_with_loc+0x47/0xb0 [libcfs]
 [<ffffffffa1218402>] ldlm_lock_decref_internal_nolock+0xd2/0x180 [ptlrpc]
 [<ffffffffa121cafd>] ldlm_lock_decref_internal+0x4d/0xaa0 [ptlrpc]
 [<ffffffffa1065d98>] ? libcfs_log_return+0x28/0x40 [libcfs]
 [<ffffffffa121dfb9>] ldlm_lock_decref+0x39/0x90 [ptlrpc]
 [<ffffffffa1791723>] mdt_save_lock+0x63/0x300 [mdt]
 [<ffffffffa1265890>] ? lustre_swab_ldlm_reply+0x0/0x40 [ptlrpc]
 [<ffffffffa1791a1c>] mdt_object_unlock+0x5c/0x160 [mdt]
 [<ffffffffa17bfccc>] mdt_object_open_unlock+0xac/0x110 [mdt]
 [<ffffffffa17c6ce4>] mdt_reint_open+0xdd4/0x21d0 [mdt]
 [<ffffffffa17c876c>] mdt_reconstruct_open+0x68c/0xc30 [mdt]
 [<ffffffffa106b2d1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
 [<ffffffffa17bb415>] mdt_reconstruct+0x45/0x120 [mdt]
 [<ffffffffa1796d1b>] mdt_reint_internal+0x6bb/0x780 [mdt]
 [<ffffffffa17970ad>] mdt_intent_reint+0x1ed/0x520 [mdt]
 [<ffffffffa1794f5e>] mdt_intent_policy+0x39e/0x720 [mdt]
 [<ffffffffa121a831>] ldlm_lock_enqueue+0x361/0x8d0 [ptlrpc]
 [<ffffffffa12411cf>] ldlm_handle_enqueue0+0x4ef/0x10b0 [ptlrpc]
 [<ffffffffa17953e6>] mdt_enqueue+0x46/0xe0 [mdt]
 [<ffffffffa179c797>] mdt_handle_common+0x647/0x16d0 [mdt]
 [<ffffffffa17d56e5>] mds_regular_handle+0x15/0x20 [mdt]
 [<ffffffffa12733b8>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]
 [<ffffffffa106ce05>] ? lc_watchdog_touch+0xd5/0x170 [libcfs]
 [<ffffffffa126a719>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
 [<ffffffffa106b2d1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
 [<ffffffff81055ad3>] ? __wake_up+0x53/0x70
 [<ffffffffa127474e>] ptlrpc_main+0xace/0x1700 [ptlrpc]
 [<ffffffffa1273c80>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
 [<ffffffff8100c0ca>] child_rip+0xa/0x20
 [<ffffffffa1273c80>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
 [<ffffffffa1273c80>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20

Kernel panic - not syncing: LBUG
Pid: 45837, comm: mdt00_002 Tainted: P           ---------------    2.6.32-358.23.2.el6_lustre.gc4f4d18.x86_64 #1
Call Trace:
 [<ffffffff8150deec>] ? panic+0xa7/0x16f
 [<ffffffffa105aeeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
 [<ffffffffa1218402>] ? ldlm_lock_decref_internal_nolock+0xd2/0x180 [ptlrpc]
 [<ffffffffa121cafd>] ? ldlm_lock_decref_internal+0x4d/0xaa0 [ptlrpc]
 [<ffffffffa1065d98>] ? libcfs_log_return+0x28/0x40 [libcfs]
 [<ffffffffa121dfb9>] ? ldlm_lock_decref+0x39/0x90 [ptlrpc]
 [<ffffffffa1791723>] ? mdt_save_lock+0x63/0x300 [mdt]
 [<ffffffffa1265890>] ? lustre_swab_ldlm_reply+0x0/0x40 [ptlrpc]
 [<ffffffffa1791a1c>] ? mdt_object_unlock+0x5c/0x160 [mdt]
 [<ffffffffa17bfccc>] ? mdt_object_open_unlock+0xac/0x110 [mdt]
 [<ffffffffa17c6ce4>] ? mdt_reint_open+0xdd4/0x21d0 [mdt]
 [<ffffffffa17c876c>] ? mdt_reconstruct_open+0x68c/0xc30 [mdt]
 [<ffffffffa106b2d1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
 [<ffffffffa17bb415>] ? mdt_reconstruct+0x45/0x120 [mdt]
 [<ffffffffa1796d1b>] ? mdt_reint_internal+0x6bb/0x780 [mdt]
 [<ffffffffa17970ad>] ? mdt_intent_reint+0x1ed/0x520 [mdt]
 [<ffffffffa1794f5e>] ? mdt_intent_policy+0x39e/0x720 [mdt]
 [<ffffffffa121a831>] ? ldlm_lock_enqueue+0x361/0x8d0 [ptlrpc]
 [<ffffffffa12411cf>] ? ldlm_handle_enqueue0+0x4ef/0x10b0 [ptlrpc]
 [<ffffffffa17953e6>] ? mdt_enqueue+0x46/0xe0 [mdt]
 [<ffffffffa179c797>] ? mdt_handle_common+0x647/0x16d0 [mdt]
 [<ffffffffa17d56e5>] ? mds_regular_handle+0x15/0x20 [mdt]
 [<ffffffffa12733b8>] ? ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]
 [<ffffffffa106ce05>] ? lc_watchdog_touch+0xd5/0x170 [libcfs]
 [<ffffffffa126a719>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
 [<ffffffffa106b2d1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
 [<ffffffff81055ad3>] ? __wake_up+0x53/0x70
 [<ffffffffa127474e>] ? ptlrpc_main+0xace/0x1700 [ptlrpc]
 [<ffffffffa1273c80>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
 [<ffffffff8100c0ca>] ? child_rip+0xa/0x20
 [<ffffffffa1273c80>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
 [<ffffffffa1273c80>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20

This is because the lock (created or not) is expected to have at least a reference when leaving mdt_reint_open().

But fix from LU-4403 will not help since running with latest master version that include it, I reproduced the lock replication during truncate/resend scenario. This may be because it seems to be intended to handle pure/only MDS_OPEN_LOCK case.

So, after discussing this with Johann again, starting with patch-set #4, I take a reference now on old lock found to be resent.

Comment by Bruno Faccini (Inactive) [ 07/Mar/14 ]

My local+intensive tests of patch-set #5 and running the reproducer are successful!

Actually I take/restore the required reference on old/orig lock now found upon truncate/resend (thus avoiding a new/duplicate lock to be created and the 1st one to become orphan leading to the evict during blocking-AST process) with ldlm_lock_addref(). But may be more checks need to be done vs racing (cancel?) access ?? Will see how full auto-tests run.

Di, if I want to write a master version of this patch, actually I will need to remove (change?) your very restrictive test in patch for LU-4403 to also skip a new lock creation for layout_lock upon resend, what do you think ??

Comment by Bruno Faccini (Inactive) [ 10/Mar/14 ]

Auto-tests of #9488 p-s #5 have been successful too. So even if patch still needs to be reviewed I am pretty confident that it solves the original problem.

I may push a new version of patch with more protection vs simultaneous actions/race, that may be based on ldlm_revalidate_lock_handle() usage instead ldlm_lock_addref() to restore the reference on lock.

And to fit with patch #8642/#8680 for LU-4803 from Di, I may simply add a new test to allow lookup also for layout_lock (new_lock->l_policy_data.l_inodebits.bits & MDS_INODELOCK_LAYOUT) and not only open_lock.

Comment by Bruno Faccini (Inactive) [ 24/Mar/14 ]

Di, after I had more indeep look into the source code, I don't think I can check for MDS_INODELOCK_LAYOUT in mdt_intent_fixup_resent(), because the decision to add layout to open is made down/later in mdt_object_open_lock()/try_layout ...

BTW, the check/filter "if ((opcode != MDT_IT_OPEN) || (opcode == MDT_IT_OPEN && info->mti_spec.sp_cr_flags & MDS_OPEN_LOCK))" you added in patch (#8642/b2_4 and #8680/master) for LU-4403, is only an optimization attempt to save un-necessary hash lookups upon resend, but the problem/scenario discovered as part of this/LU-4584 ticket seems to indicate it is wrong, what do you think ??

According to this, my patch version for master branch is at http://review.whamcloud.com/9764. Since it is master patch version where patch for LU-4403 has already landed, I had to remove its too restrictive optimization to allow implicit layout_lock upon open handling upon resend. But I left its [un]lock_res_and_lock() protection it introduced, even if I am not convinced it is required in this case.

Comment by Matt Ezell [ 30/Mar/14 ]

We are seeing consistent evictions with 2.5 clients when running a shared-file IOR (stripe count 64). I'm seeing some similar messages in our client logs:

00010000:00010000:2.0:1396143171.559956:0:3053:0:(ldlm_lockd.c:1668:ldlm_handle_bl_callback()) Lock ffff8803f9c19100 already unused, calling callback (ffffffffa0793880)
00010000:00010000:2.0:1396143171.559958:0:3053:0:(ldlm_request.c:1135:ldlm_cli_cancel_local()) ### client-side cancel ns: atlastds-MDT0000-mdc-ffff880403f44c00 lock: ffff8803f9c19100/0x330d1ef03081582 lrc: 3/0,0 mode: CR/CR res: [0x200003ac6:0x648:0x0].0 bits 0x9 rrc: 1 type: IBT flags: 0x428400000000 nid: local remote: 0xe7f79ec3f6545595 expref: -99 pid: 4138 timeout: 0 lvb_type: 3
00010000:00010000:2.0:1396143171.559987:0:3053:0:(ldlm_request.c:1194:ldlm_cancel_pack()) ### packing ns: atlastds-MDT0000-mdc-ffff880403f44c00 lock: ffff8803f9c19100/0x330d1ef03081582 lrc: 2/0,0 mode: --/CR res: [0x200003ac6:0x648:0x0].0 bits 0x9 rrc: 1 type: IBT flags: 0x4c29400000000 nid: local remote: 0xe7f79ec3f6545595 expref: -99 pid: 4138 timeout: 0 lvb_type: 3
00010000:00010000:2.0:1396143171.559991:0:3053:0:(ldlm_request.c:1198:ldlm_cancel_pack()) 1 locks packed
00010000:00010000:2.0:1396143171.560003:0:3053:0:(ldlm_lockd.c:1677:ldlm_handle_bl_callback()) ### client blocking callback handler END ns: atlastds-MDT0000-mdc-ffff880403f44c00 lock: ffff8803f9c19100/0x330d1ef03081582 lrc: 1/0,0 mode: --/CR res: [0x200003ac6:0x648:0x0].0 bits 0x9 rrc: 1 type: IBT flags: 0x4c29400000000 nid: local remote: 0xe7f79ec3f6545595 expref: -99 pid: 4138 timeout: 0 lvb_type: 3
00010000:00010000:2.0:1396143171.560007:0:3053:0:(ldlm_lock.c:219:ldlm_lock_put()) ### final lock_put on destroyed lock, freeing it. ns: atlastds-MDT0000-mdc-ffff880403f44c00 lock: ffff8803f9c19100/0x330d1ef03081582 lrc: 0/0,0 mode: --/CR res: [0x200003ac6:0x648:0x0].0 bits 0x9 rrc: 1 type: IBT flags: 0x4c29400000000 nid: local remote: 0xe7f79ec3f6545595 expref: -99 pid: 4138 timeout: 0 lvb_type: 3
00010000:00010000:5.0:1396143171.560209:0:2437:0:(ldlm_lockd.c:2194:ldlm_callback_handler()) callback on lock 0x330d1ef03081582 - lock disappeared
00010000:00010000:5.0:1396143171.560220:0:2437:0:(ldlm_lockd.c:2055:ldlm_callback_errmsg()) @@@ Operate with invalid parameter: [nid 12345-10.36.226.79@o2ib] [rc 0] [lock 0x330d1ef03081582]  req@ffff8802b2680c00 x1463740310294316/t0(0) o104->LMV_MDC_UUID@10.36.226.79@o2ib:0/0 lens 296/192 e 0 to 0 dl 1396143263 ref 1 fl Interpret:/0/0 rc -22/-22

Is the re-send due to LLNL's client-side optimization to avoid vmallocs, or something else? If it is, with only 64 stripes I don't think a re-send is necessary. But I need to do some more analysis and get traces with RPCTRACE enabled.

Bruno - is the patch ready for reviewers and possible landing, or is more work still needed?
Chris - have you guys tried out this patch?

Comment by Christopher Morrone [ 31/Mar/14 ]

Chris - have you guys tried out this patch?

No, I have been waiting for patch to get closer to final form. It sounds like we are close.

Comment by Oleg Drokin [ 31/Mar/14 ]

Apparently this is a dup of lu2827, there's even a testcase in http://review.whamcloud.com/#/c/6000/ that we should use along with this proposed patch to ensure it works as expected

Comment by Oleg Drokin [ 01/Apr/14 ]

btw Vitaly advises that he's almost done creating a proper patch for this issue (in LU-2827) and will upload it as soon as it passes internal inspections at Xyratex.

Comment by Bruno Faccini (Inactive) [ 02/Apr/14 ]

Matt: Yes, b2_4/master patches were ready for reviewers, which finally and very recently happened ...

Chris: may be I made it not clear enough, but it is now about a month that patch #9488-5 works and fixes your problem with no regression in b2_4. And this applies for master patch too.

Oleg: Sorry, I had no idea of parallel work at Xyratex... BTW, LU-2827 has not been updated/commented since 29/Apr/13 !! Ok, I will add a test-case to my master patch #9764, inspired by #6000/#5978-3 from LU-2827 but may be also with more specific stuff to reproduce the resend scenario for this LU-4584 ticket. Again, I may have not been clear enough in my previous updates, but both my patches for b2_4 and master work against Chris solid/permanent reproducer and with no regression found by auto-tests.

Comment by James A Simmons [ 02/Apr/14 ]

I tested both your b2_4 patch with 2.4 and your master patch with b2_5. Both times my MDS went up in flames. I'm in the process to ask the admins for the logs of the event.

Comment by James A Simmons [ 02/Apr/14 ]

here is the stack trace resulting from this patch. I named the log inflames.log

Comment by Christopher Morrone [ 02/Apr/14 ]

may be I made it not clear enough, but it is now about a month that patch

That was fairly clear. And I really appreciate your work!

But I'm not really ready to take a patch for code that I don't understand and run it in production until it has, at a minimum, gone through the code review process and is on its way to landing.

Comment by James A Simmons [ 03/Apr/14 ]

I have a since single shared file IOR job that duplicate this problem. I turned on all debugging and pushed the logs to ftp.whamcloud.com/uploads/LU-4584/LU-4584.log.gz. Hope this helps.

Comment by James A Simmons [ 09/Apr/14 ]

Uploaded clients debug logs to ftp.whamcloud.com/uploads/LU-4584/client-run1

Comment by James A Simmons [ 16/Apr/14 ]

The syslogs and nid mapping have also been uploaded to uploads/LU-4584/console-20140409

Comment by James A Simmons [ 21/Apr/14 ]

Did the logs shine light on this problem?

Comment by Bruno Faccini (Inactive) [ 28/Apr/14 ]

James, I am sorry but I am puzzled with the latest infos you uploaded.
I 1st thought that they were intended to document, as you indicated you will!, one occurrence of the following LBUG you triggered when running with my patch :

Mar 31 14:18:21 atlas-tds-mds1 kernel: [  565.057997] LustreError: 12309:0:(mdt_handler.c:3615:mdt_intent_lock_replace()) ASSERTION( new_lock->l_readers + new_lock->l_writers == 0 ) failed: 
Mar 31 14:18:21 atlas-tds-mds1 kernel: [  565.058059] LustreError: 12460:0:(mdt_handler.c:3615:mdt_intent_lock_replace()) ASSERTION( new_lock->l_readers + new_lock->l_writers == 0 ) failed: 
Mar 31 14:18:21 atlas-tds-mds1 kernel: [  565.058062] LustreError: 12460:0:(mdt_handler.c:3615:mdt_intent_lock_replace()) LBUG

and that was shown within the inflames.log file you attached to this ticket.

But in fact, I understand that it is certainly an other problem, but then I need some help/highlights :
_ you commented "I have a since single shared file IOR job that duplicate this problem.", but what problem exactly? The original one for this ticket? The LBUG?
_ did you trigger it also when running with my patch ?
_ what is the main symptom? I don't find any LBUG in all logs, only tons of Clients evictions ...
_ can you provide Servers logs for the same timing period ?

I am also still very interested to get the logs for the LBUG, can you provide them finally ?

Comment by Lukasz Flis [ 28/Apr/14 ]

LU-4963 has been marked as duplicate of the ticket so i'll report things here

I have uploaded logs from client caught just after crash (dlmtrace)

ftp.whamcloud.com/uploads/LU-4963/lustre-client-2.4.1-6.cyfronet.log.gz

The problem described in LU-4963 occurs for either 2.5.1 and 2.4.x clients

Comment by James A Simmons [ 29/Apr/14 ]

Looking at the logs I noticed the source of our problems is the value of the at_max. So with a JIRA search I found LU-4578 and tried the patch for that ticket and lo and behold my single shared file failures went away. So for the people experiencing this problem please try patch http://review.whamcloud.com/#/c/9100 and post your results.

Comment by Bruno Faccini (Inactive) [ 30/Apr/14 ]

Lucasz: I am working on the trace you provided and will let you know if I can confirm this is the same problem or not.
James: thanks for the details on what you have found. But again, can you provide the logs from the LBUG occurence?

Also, since both tickets (and their patches!) are highly related, I tried running the reproducer of this LU-4584 ticket against latest build #23312 from ticket/Gerrit-change LU-2827/#5978-8. But the MDS just hit the LASSERT(lock->l_granted_mode != lock->l_req_mode) that the change introduced in ldlm_process_inodebits_lock().
In fact, I did not have to run the full (create+touch+rm) reproducer scenario since the MDS crashed during the file creation !!...
The platform I used was made of one MDS, 2 OSSs with a total of 154 OSTs, and a full default striping...
I will try to investigate crash-dump and debug-logs to understand/qualify what went wrong.

Comment by James A Simmons [ 30/Apr/14 ]

I did more testing last night and found that the patch from LU-4578 only helped reduce the occurrence of this bug. It still happened but much less. The new reproducer I use is

aprun -n 112 /lustre/sultan/scratch/jsimmons/simul -d /lustre/sultan/scratch/jsimmons/test_simul

were test_simul is striped at 56. This fails consistently for me.

Sorry Bruno I haven't got back to you about the patch. We dropped testing in recently but should give it another try.

Comment by Christopher Morrone [ 30/Apr/14 ]

I am a little worried that we are taking the wrong approach on this issue. I don't fully understand what you guys are working on, so maybe my concern is misplaced.

Folks keep talking about this problem along the lines of "if a message is lost, we are not handling that situation correctly". But that is not the situation in this ticket, to the best of my knowledge. The issue in this ticket is 100% reliable. If a message is getting "lost" every single time I run this test, then clearly that is itself a bug that needs to be worked in Lustre. That would also be the primary issue that needs to be worked under this ticket.

It would also be good to respond quickly and correctly to lost message situations, but that sounds like a somewhat separate issue.

Comment by Bruno Faccini (Inactive) [ 02/May/14 ]

Hello Chris,
I completely agree with you when you say that this ticket does not address a message lost issue. The problem here is only about to correctly handle the situation of an open request/message reply from the Server that is truncated (because due to striping layout/lvb does not feet into reply buffer) and same request to be resent by the Client, which ends up to a replicated layout-lock on Server and finally to the eviction upon blocking callback processing.

My b2_4 patch #9488, has successfully passed auto-tests and 3 positive reviews, but a -ve one from James since he experienced a MDS LBUG when he integrated it. I am waiting for him to provide me with the necessary infos to allow me to find what wnet wrong.

Also, since it was discovered that a similar work is on-going as part of LU-2827, and a more generic patch to also address other intent locks cases being developed, I decided to run your reproducer against it but I triggered a MDS LASSERT(lock->l_granted_mode != lock->l_req_mode) that the change introduced in ldlm_process_inodebits_lock(). I am currently working on the crash-dump from this LASSERT.

Comment by James A Simmons [ 02/May/14 ]

I was just cross referencing to another ticket which seemed to help ease the pain. The description to me was kind of loose so I figured it was a umbrella to all things lock and AST breakage. Also I had to get something going for the users to be able to test on. Now that I have a seems stable system I can focus again on this work which will be applied to my smaller testbed. Yes that patch is not a cure all. Now that we have LU-2827 as well which patch to I test? Will both of the works be combined and under which ticket?

Comment by Christopher Morrone [ 02/May/14 ]

James, could you clarify whether or not patch 9488 is introducing a new problem in your testing? If the problems you are seeing are unrelated to that patch, please clear your -1 review.

Comment by James A Simmons [ 05/May/14 ]

Last time I tested it my MDS oops. Removed the patch and the oops went away. From Oleg's comments this patch will not be landed to master since a more generic solution needs to be developed.

Comment by Bruno Faccini (Inactive) [ 06/May/14 ]

James, the more generic solution is addressed by the current state/patch-set #8 of patch for LU-2827, and it should not be combined with my Change #9488 for this ticket.
BTW, and as I already reported here and in LU-2827, when testing LU-2827 patch against the reproducer from Chris for this ticket, I triggered a MDS LASSERT(lock->l_granted_mode != lock->l_req_mode) that has been introduced in ldlm_process_inodebits_lock()... This LASSERT may have to simply be removed, but we need to confirm this.

Comment by Oleg Drokin [ 10/May/14 ]

Chris, so now that we see why there are resends not in the face of lost packets, I have another question.

Do you carry patch for LU-3338 or something of the similar nature? My looking at your chaos tree seems to imply this even though I don't see the exact commit.
This would potentially explain why you seems to be hittign this in 2.4 code. (2.5 and it's additional layout lock payload seems to add a similar failure mode, but you need to have 2.5 servers and 2.5 clients for this).
Any chance you can disable the patch and see if it help you?

Comment by Bruno Faccini (Inactive) [ 10/May/14 ]

Oleg,
Sorry to interfere, and may be I did not make it enough clear, but you may have missed that I already indicated that Chris reproducer also triggers the described problem scenario on homogeneous b2_5/master Servers/Client platforms.

Comment by Oleg Drokin [ 11/May/14 ]

I think James already verified that while 2.5/2.5 triggers due to layout lock (and increased request sizes from that I imagine), 2.4/2.4 and 2.4/2.5 does not trigger any ill effects.

Comment by Bruno Faccini (Inactive) [ 11/May/14 ]

Oleg, sorry, my 1st answer against your LU-3338 submission had the wrong/reverse meaning, I should have pointed you again about the fact that the reproducer also triggers when running with an homogeneous b2_4 and 2.4.0-RC2 Servers/Client platforms!! So I don't think LU-3338 could be suspected here, even if I agree it can raise its possibility to occur. BTW, and I have detailed this above, during my attempt to reproduce locally I saw that the reply buffer size allocated on Client was 2K where a 46 striping layout can not feet. So the truncate and then the problem will occur anyway when the reply buffer is less than the need for the layout size.

Comment by Christopher Morrone [ 12/May/14 ]

Do you carry patch for LU-3338 or something of the similar nature? My looking at your chaos tree seems to imply this even though I don't see the exact commit.

Yes. It is probably the one that begins the first line of its commit comment with "LU-3338".

Comment by Bruno Faccini (Inactive) [ 20/May/14 ]

Chris, have you been able to check Oleg's proposal to run without patch from LU-3338 ?

On the other hand, I have submitted an add-on patch (http://review.whamcloud.com/10378, to handle the layout-lock case you trigger here) to the "generic" one (http://review.whamcloud.com/5978/) handling resends from LU-2827, both patches should fix the specific issue you reported in this ticket.

Comment by James A Simmons [ 21/May/14 ]

Currently I'm in the middle of my 24 hour test shot window and I have some interesting results. First the problem I reported here happened with 2.4 servers without any patches from this ticket or LU-2827 with 2.5 clients. We had two 2.4 file systems of the same OST count (1008) but the OST size were different. The production file system OST are 14TB in size and the test file system OST is 255GB in size. The problems only showed up on the small OSTs so far. I placed the vmcore and dmesg at
ftp.whamcloud.com/uploads/LU-4584. All the involved files end in *--20-5-2014.

Comment by Bruno Faccini (Inactive) [ 21/May/14 ]

James,
Thanks to provide these infos, but to allow vmcore analysis, I need also the associated vmlinux binary (or better the 2x kernel-debuginfo[-common] RPMs) and the Lustre modules (ie, lustre-modules or better lustre-debuginfo RPM).

Also, am I right if I understand, from your last comment, that you had these occurrences of

(mdt_handler.c:3615:mdt_intent_lock_replace()) ASSERTION( new_lock->l_readers + new_lock->l_writers == 0 ) failed:
(mdt_handler.c:3615:mdt_intent_lock_replace()) LBUG

LBUG/Assert on a platform made of 2.4 Servers and 2.5 Clients, none of them having any of the patches from this/LU-4584 nor LU-2827 tickets?

Comment by James A Simmons [ 21/May/14 ]

Yes this showed up with 2.4 servers and 2.5 clients without any patches. I uploaded all the debuginfo rpms to the same ftp spot. Now you will be able to look at the core dumps. If you need anything else let me know. Thank you.

Comment by Christopher Morrone [ 21/May/14 ]

James, it sounds like your assertion is unrelated to this ticket, LU-4584. Could you please start a separate ticket for that?

Comment by James A Simmons [ 22/May/14 ]

Yes it is looking like this assertion is not related to this ticket. Waiting on feedback from Bruno and/or Oleg about the results of the core dump. Then it will most likely be moved to its own ticket.

Comment by Bruno Faccini (Inactive) [ 23/May/14 ]

James,
According to crash dump analysis, the LBUG occurs during resend+reconstruct_open. But since there are only a few Lustre debug traces available (only default traces enabled ...), to affirm this, I can only rely on the latest traces of each (because there are a bunch of mdt threads triggering this same LBUG at almost the same time...) concerned thread like following :

00000004:00080000:2.0:1400620887.232132:0:17723:0:(mdt_recovery.c:681:mdt_req_from_lcd()) @@@ restoring transno 51539764600/status 0  req@ffff881e4d7cf800 x1468652173289564/t0(0) o101->3472002b-ab6b-3d44-377e-cbe24f35879e@1395@gni100:0/0 lens 664/3512 e 0 to 0 dl 1400621204 ref 1 fl Interpret:/2/0 rc 0/0
...........
00000004:00040000:2.0:1400620887.245915:0:17723:0:(mdt_handler.c:3615:mdt_intent_lock_replace()) ASSERTION( new_lock->l_readers + new_lock->l_writers == 0 ) failed:
............
00000004:00040000:2.0:1400620887.261204:0:17723:0:(mdt_handler.c:3615:mdt_intent_lock_replace()) LBUG

and all their similar threads looking as following :

crash> bt 17723
PID: 17723  TASK: ffff881e84174ae0  CPU: 2   COMMAND: "mdt00_012"
 #0 [ffff88011c427e90] crash_nmi_callback at ffffffff8102d3a6
 #1 [ffff88011c427ea0] notifier_call_chain at ffffffff81512105
 #2 [ffff88011c427ee0] atomic_notifier_call_chain at ffffffff8151216a
 #3 [ffff88011c427ef0] notify_die at ffffffff8109cf0e
 #4 [ffff88011c427f20] do_nmi at ffffffff8150fdcb
 #5 [ffff88011c427f50] nmi at ffffffff8150f690
    [exception RIP: delay_tsc+50]
    RIP: ffffffff81280c72  RSP: ffff881e841bb620  RFLAGS: 00000012
    RAX: fffffffff796f896  RBX: 0000000000000002  RCX: 00000000f796f896
    RDX: 0000000000000688  RSI: 00000000002625ec  RDI: 00000000000009c5
    RBP: ffff881e841bb640   R8: ffffffff81c07840   R9: 0000000000000000
    R10: 0000000000000005  R11: 0000000000000096  R12: fffffffff796f20e
    R13: 0000000000000002  R14: 00000000000009c5  R15: ffffffff813542e0
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <NMI exception stack> ---
 #6 [ffff881e841bb620] delay_tsc at ffffffff81280c72
 #7 [ffff881e841bb648] __const_udelay at ffffffff81280c36
 #8 [ffff881e841bb658] wait_for_xmitr at ffffffff81354262
 #9 [ffff881e841bb688] serial8250_console_putchar at ffffffff81354306
#10 [ffff881e841bb6a8] uart_console_write at ffffffff8134fc9e
#11 [ffff881e841bb6e8] serial8250_console_write at ffffffff8135466d
#12 [ffff881e841bb738] __call_console_drivers at ffffffff8106e775
#13 [ffff881e841bb768] _call_console_drivers at ffffffff8106e7da
#14 [ffff881e841bb788] release_console_sem at ffffffff8106ee28
#15 [ffff881e841bb7c8] vprintk at ffffffff8106f528
#16 [ffff881e841bb868] printk at ffffffff8150c005
#17 [ffff881e841bb8c8] __print_symbol at ffffffff810b9636
#18 [ffff881e841bb9e8] print_trace_address at ffffffffa03b1920 [libcfs]
#19 [ffff881e841bba38] print_context_stack at ffffffff8100f551
#20 [ffff881e841bbaa8] dump_trace at ffffffff8100e4a0
#21 [ffff881e841bbb48] libcfs_debug_dumpstack at ffffffffa03b1895 [libcfs]
#22 [ffff881e841bbb68] lbug_with_loc at ffffffffa03b1e97 [libcfs]
#23 [ffff881e841bbb88] mdt_intent_lock_replace at ffffffffa0d365a6 [mdt]
#24 [ffff881e841bbbd8] mdt_intent_reint at ffffffffa0d3c387 [mdt]
#25 [ffff881e841bbc28] mdt_intent_policy at ffffffffa0d39fae [mdt]
#26 [ffff881e841bbc68] ldlm_lock_enqueue at ffffffffa0699841 [ptlrpc]
#27 [ffff881e841bbcc8] ldlm_handle_enqueue0 at ffffffffa06c02cf [ptlrpc]
#28 [ffff881e841bbd38] mdt_enqueue at ffffffffa0d3a436 [mdt]
#29 [ffff881e841bbd58] mdt_handle_common at ffffffffa0d3ebc7 [mdt]
#30 [ffff881e841bbda8] mds_regular_handle at ffffffffa0d7aae5 [mdt]
#31 [ffff881e841bbdb8] ptlrpc_server_handle_request at ffffffffa06f2568 [ptlrpc]
#32 [ffff881e841bbeb8] ptlrpc_main at ffffffffa06f38fe [ptlrpc]
#33 [ffff881e841bbf48] kernel_thread at ffffffff8100c0ca
crash> 

So this looks like the same kind of issue, even if the symptom is different (LBUG vs eviction), that has been reported starting with this ticket, likely to be caused by some large striping/layout.
BTW, my patch (#9488) from this ticket or both (#5978, #10378) from LU-2827 should fix this ...

Comment by James A Simmons [ 23/May/14 ]

Thank you for looking at our logs. I have tried the #9488 patch and also #5978 but it did not help. I have not tried the#5978 and #10378 together yet. That is next on the agenda.

Comment by Bruno Faccini (Inactive) [ 26/May/14 ]

James,
If you run a new test session, that would be very helpful if you do it with full Lustre debug traces enabled, at least on Client-side and with additional rpctrace+dlmtrace on Server/MDS side. Full traces on both-sides would be a must. Then if you reproduce the LBUG, crash-dump analysis will be much easier whith associated full debug traces reading. You may also set panic_on_lbug to 1 on MDS side to reduce MDS crash-dump/debug-traces content to the minimum required volume of informations, and both dump_on_[timeout,eviction] on Client-side too, in order to ensure debug log buffer will be saved and preserved automatically when possible.

Comment by James A Simmons [ 02/Jun/14 ]

Got a nice lockup with a ls to a directory containing a single shared file. I did a full lctl set_param debug=+all. You can find the logs at

ftp.whamcloud.com/uploads/LU-2827/client-lockup-2-6-2014.log
ftp.whamcloud.com/uploads/LU-2827/server-lockup-2-6-2014.log

For the client it has multiple file systems mounted so filter for atlastds which is the 2.5 file system.

Comment by Bruno Faccini (Inactive) [ 03/Jun/14 ]

Thanks James, but can you also provide me with the dmesg/syslog messages/stacks that should have been displayed indicating the lockup ?
This will greatly help me to faster point interesting sequences in these logs.

Also, I did not find any eviction trace in these logs.

Comment by James A Simmons [ 03/Jun/14 ]

With the two patches from LU-2827 I don't get client evictions any more. Just pure lock ups. I end up having to kill the job or in the case of the ls test killing the ssh session. The logs were not to verbose on either side but here is what I got. Hope it helps

From the client side

2014-06-02T09:19:38.010961-04:00 c0-0c0s6n1 Lustre: 15326:0:(client.c:1908:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1401714172/real 1401714172] req@ffff88014aca5400 x1469570034748356/t0(0) o36->atlastds-MDT0000-mdc-ffff8801491a7800@10.36.226.79@o2ib:12/10 lens 496/3880 e 12 to 1 dl 1401715177 ref 2 fl Rpc:IX/0/ffffffff rc 0/-1
2014-06-02T09:19:38.011028-04:00 c0-0c0s6n1 Lustre: atlastds-MDT0000-mdc-ffff8801491a7800: Connection to atlastds-MDT0000 (at 10.36.226.79@o2ib) was lost; in progress operations using this service will wait for recovery to complete
2014-06-02T09:19:38.140645-04:00 c0-0c0s6n1 Lustre: atlastds-MDT0000-mdc-ffff8801491a7800: Connection restored to atlastds-MDT0000 (at 10.36.226.79@o2ib)

MDS side

Jun 2 09:12:47 atlas-tds-mds1 kernel: [323268.180954] Lustre: 16989:0:(service.c:1349:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/5), not sending early reply
Jun 2 09:12:47 atlas-tds-mds1 kernel: [323268.180956] req@ffff8807e3f53400 x1469570034748356/t0(0) o36->c4edc84a-1bf9-d28e-6029-43f8715edbaf@13@gni2:0/0 lens 496/3128 e 12 to 0 dl 1401714772 ref 2 fl Interpret:/0/0 rc 0/0
Jun 2 09:15:22 atlas-tds-mds1 kernel: [323423.238665] Lustre: lock timed out (enqueued at 1401714172, 750s ago)
Jun 2 09:19:37 atlas-tds-mds1 kernel: [323678.373844] Lustre: atlastds-MDT0000: Client c4edc84a-1bf9-d28e-6029-43f8715edbaf (at 13@gni2) reconnecting
Jun 2 13:25:36 atlas-tds-mds1 kernel: [338442.226261] Lustre: 17228:0:(service.c:1349:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/5), not sending early reply
Jun 2 13:25:36 atlas-tds-mds1 kernel: [338442.226262] req@ffff8807e2c4ac00 x1469570037379276/t0(0) o101->c9bf7a8c-2264-2712-4013-110119f32786@12@gni2:0/0 lens 584/3384 e 12 to 0 dl 1401729941 ref 2 fl Interpret:/0/0 rc 0/0
Jun 2 13:28:11 atlas-tds-mds1 kernel: [338597.286986] Lustre: lock timed out (enqueued at 1401729341, 750s ago)
Jun 2 13:32:26 atlas-tds-mds1 kernel: [338852.487432] Lustre: atlastds-MDT0000: Client c9bf7a8c-2264-2712-4013-110119f32786 (at 12@gni2) reconnecting
Jun 2 13:49:11 atlas-tds-mds1 kernel: [339857.980604] Lustre: atlastds-MDT0000: Client c9bf7a8c-2264-2712-4013-110119f32786 (at 12@gni2) reconnecting
Jun 2 14:05:56 atlas-tds-mds1 kernel: [340863.498596] Lustre: atlastds-MDT0000: Client c9bf7a8c-2264-2712-4013-110119f32786 (at 12@gni2) reconnecting
Jun 2 14:16:12 atlas-tds-mds1 kernel: [341478.913322] LNetError: 12708:0:(o2iblnd_cb.c:3012:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 20 seconds
Jun 2 14:16:12 atlas-tds-mds1 kernel: [341478.924224] LNetError: 12708:0:(o2iblnd_cb.c:3075:kiblnd_check_conns()) Timed out RDMA with 10.36.230.231@o2ib200 (50): c: 0, oc: 0, rc: 63
Jun 2 14:16:13 atlas-tds-mds1 kernel: [341479.913856] LNetError: 12708:0:(o2iblnd_cb.c:3012:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 21 seconds
Jun 2 14:16:13 atlas-tds-mds1 kernel: [341479.924903] LNetError: 12708:0:(o2iblnd_cb.c:3012:kiblnd_check_txs_locked()) Skipped 1 previous similar message
Jun 2 14:16:13 atlas-tds-mds1 kernel: [341479.935921] LNetError: 12708:0:(o2iblnd_cb.c:3075:kiblnd_check_conns()) Timed out RDMA with 10.36.230.233@o2ib200 (51): c: 0, oc: 0, rc: 63
Jun 2 14:16:13 atlas-tds-mds1 kernel: [341479.949351] LNetError: 12708:0:(o2iblnd_cb.c:3075:kiblnd_check_conns()) Skipped 1 previous similar message
Jun 2 14:22:41 atlas-tds-mds1 kernel: [341868.934059] Lustre: atlastds-MDT0000: Client c9bf7a8c-2264-2712-4013-110119f32786 (at 12@gni2) reconnecting

Comment by Cory Spitz [ 04/Jun/14 ]

James, I think that you said you needed 48 OSTs to trigger the buffer resize case. Is that all? We have a system with 48 OSTs that does not see this issue.

Comment by James A Simmons [ 04/Jun/14 ]

Did you set the default stripe size to 48?

Comment by Christopher Morrone [ 04/Jun/14 ]

We certainly only needed to stripe over 48 OSTs, as stated in this ticket's Description section.

Comment by Bruno Faccini (Inactive) [ 05/Jun/14 ]

James,
According to the latest dmsg/syslog msgs you provided, I tried to navigate in your 2 last Client+Server logs in detail and here is what I can tell :

_ log uploads/LU-2827/client-lockup-2-6-2014.log, starts at 10:33:06 and ends at 13:17:11, if I correctly match with your time-zone.

_ log uploads/LU-2827/server-lockup-2-6-2014.log starts at 13:15:37 and ends at 13:16:08. And it is only containing 61734 lines, and I wonder if you raised the debug log buffer size there ?

_ so according to this I have been only able to work on transaction-id x1469570037379276 from the syslog msgs you pointed, and I already lack the end of its processing at both Client and Server side. But I can see that the request that caused it to be sent from the Client is :

00800000:00000002:5.0:1401729341.437250:0:21275:0:(lmv_intent.c:321:lmv_intent_lock()) INTENT LOCK 'getattr' for 'ior_fpp_185100' on [0x200000400:0x2:0x0]
..............
00010000:00010000:5.0:1401729341.437311:0:21275:0:(ldlm_lock.c:795:ldlm_lock_addref_internal_nolock()) ### ldlm_lock_addref(CR) ns: atlastds-MDT0000-mdc-ffff880333607800 lock: ffff88027e231880/0x8aab3c4b568e4c64 lrc: 3/1,0 mode: --/CR res: [0x200000400:0x2:0x0].0 bits 0x0 rrc: 3 type: IBT flags: 0x10000000000000 nid: local remote: 0x0 expref: -99 pid: 21275 timeout: 0 lvb_type: 0
00010000:00010000:5.0:1401729341.437320:0:21275:0:(ldlm_request.c:926:ldlm_cli_enqueue()) ### client-side enqueue START, flags 1000
 ns: atlastds-MDT0000-mdc-ffff880333607800 lock: ffff88027e231880/0x8aab3c4b568e4c64 lrc: 3/1,0 mode: --/CR res: [0x200000400:0x2:0x0].0 bits 0x2 rrc: 3 type: IBT flags: 0x0 nid: local remote: 0x0 expref: -99 pid: 21275 timeout: 0 lvb_type: 0
00010000:00010000:5.0:1401729341.437325:0:21275:0:(ldlm_request.c:988:ldlm_cli_enqueue()) ### sending request ns: atlastds-MDT0000-mdc-ffff880333607800 lock: ffff88027e231880/0x8aab3c4b568e4c64 lrc: 3/1,0 mode: --/CR res: [0x200000400:0x2:0x0].0 bits 0x2 rrc: 3 type: IBT flags: 0x0 nid: local remote: 0x0 expref: -99 pid: 21275 timeout: 0 lvb_type: 0
00000100:00000001:5.0:1401729341.437329:0:21275:0:(client.c:2676:ptlrpc_queue_wait()) Process entered
00000100:00000001:5.0:1401729341.437330:0:21275:0:(client.c:861:ptlrpc_prep_set()) Process entered
00000100:00000010:5.0:1401729341.437331:0:21275:0:(client.c:862:ptlrpc_prep_set()) kmalloced 'set': 144 at ffff88021dcf0f00.
00000100:00000001:5.0:1401729341.437332:0:21275:0:(client.c:878:ptlrpc_prep_set()) Process leaving (rc=18446612141404262144 : -131932305289472 : ffff88021dcf0f00)
00000100:00000001:5.0:1401729341.437333:0:21275:0:(client.c:2605:ptlrpc_request_addref()) Process entered
00000100:00000001:5.0:1401729341.437334:0:21275:0:(client.c:2607:ptlrpc_request_addref()) Process leaving (rc=18446612141417609216 : -131932291942400 : ffff88021e9ab800)
00000020:00000001:5.0:1401729341.437336:0:21275:0:(class_obd.c:131:lustre_get_jobid()) Process entered
00000020:00000001:5.0:1401729341.437337:0:21275:0:(class_obd.c:136:lustre_get_jobid()) Process leaving (rc=0 : 0 : 0)
00000100:00000001:5.0:1401729341.437338:0:21275:0:(client.c:2107:ptlrpc_set_wait()) Process entered
00000100:00000001:5.0:1401729341.437339:0:21275:0:(client.c:1415:ptlrpc_send_new_req()) Process entered
00000100:00000040:5.0:1401729341.437343:0:21275:0:(lustre_net.h:3283:ptlrpc_rqphase_move()) @@@ move req "New" -> "Rpc"  req@ffff88021e9ab800 x1469570037379276/t0(0) o101->atlastds-MDT0000-mdc-ffff880333607800@10.36.226.79@o2ib:12/10 lens 584/2344 e 0 to 0 dl 0 ref 2 fl New:/0/ffffffff rc 0/-1
00000100:00000001:5.0:1401729341.437348:0:21275:0:(client.c:1066:ptlrpc_import_delay_req()) Process entered
00000100:00000001:5.0:1401729341.437349:0:21275:0:(client.c:1117:ptlrpc_import_delay_req()) Process leaving (rc=0 : 0 : 0)
02000000:00000001:5.0:1401729341.437350:0:21275:0:(sec.c:647:sptlrpc_req_refresh_ctx()) Process entered
02000000:00000001:5.0:1401729341.437351:0:21275:0:(sec.c:674:sptlrpc_req_refresh_ctx()) Process leaving (rc=0 : 0 : 0)
00000100:00100000:5.0:1401729341.437353:0:21275:0:(client.c:1476:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc bash:c9bf7a8c-2264-2712-4013-110119f32786:21275:1469570037379276:10.36.226.79@o2ib:101

at around 13:15:41.

_ then nothing else can be found about these lock-request/transaction-id in Client's log remaining 1:30s of traces.

_ on Server side we can find its handling start with the following traces :

00000100:00100000:4.0:1401729341.436978:0:17237:0:(service.c:2079:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc mdt01_160:c9bf7a8c-2264-2712-4013-110119f32786+17:21275:x1469570037379276:12345-12@gni2:101
...............
00010000:00010000:4.0:1401729341.437006:0:17237:0:(ldlm_lockd.c:1259:ldlm_handle_enqueue0()) ### server-side enqueue handler, new lock created ns: mdt-atlastds-MDT0000_UUID lock: ffff8807e81f6940/0x65bb92c64a60ecce lrc: 2/0,0 mode: --/CR res: [0x200000400:0x2:0x0].0 bits 0x0 rrc: 15 type: IBT flags: 0x40000000000000 nid: local remote: 0x8aab3c4b568e4c64 expref: -99 pid: 17237 timeout: 0 lvb_type: 0
...............
00000004:00000002:4.0:1401729341.440386:0:17237:0:(mdt_handler.c:1315:mdt_getattr_name_lock()) getattr with lock for [0x200000400:0x2:0x0]/ior_fpp_185100, ldlm_rep = ffff8807e3fca1f8
...............
00010000:00010000:4.0:1401729341.440443:0:17237:0:(ldlm_lock.c:1089:ldlm_granted_list_add_lock()) ### About to add lock: ns: mdt-atlastds-MDT0000_UUID lock: ffff8807e7578b00/0x65bb92c64a60ecd5 lrc: 3/1,0 mode: CR/CR res: [0x200000400:0x2:0x0].0 bits 0x2 rrc: 16 type: IBT flags: 0x50210000000000 nid: local remote: 0x0 expref: -99 pid: 17237 timeout: 0 lvb_type: 0
...............
00010000:00010000:4.0:1401729341.440453:0:17237:0:(ldlm_request.c:472:ldlm_cli_enqueue_local()) ### client-side local enqueue handler, new lock created ns: mdt-atlastds-MDT0000_UUID lock: ffff8807e7578b00/0x65bb92c64a60ecd5 lrc: 3/1,0 mode: CR/CR res: [0x200000400:0x2:0x0].0 bits 0x2 rrc: 16 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 17237 timeout: 0 lvb_type: 0
...............
00010000:00010000:4.0:1401729341.440486:0:17237:0:(ldlm_lock.c:1089:ldlm_granted_list_add_lock()) ### About to add lock: ns: mdt-atlastds-MDT0000_UUID lock: ffff8807e7578700/0x65bb92c64a60ecdc lrc: 3/1,0 mode: PR/PR res: [0x200000400:0x2:0x0].b30b5d7b bits 0x2 rrc: 6 type: IBT flags: 0x50210000000000 nid: local remote: 0x0 expref: -99 pid: 17237 timeout: 0 lvb_type: 0
................
00010000:00010000:4.0:1401729341.440493:0:17237:0:(ldlm_request.c:472:ldlm_cli_enqueue_local()) ### client-side local enqueue handler, new lock created ns: mdt-atlastds-MDT0000_UUID lock: ffff8807e7578700/0x65bb92c64a60ecdc lrc: 3/1,0 mode: PR/PR res: [0x200000400:0x2:0x0].b30b5d7b bits 0x2 rrc: 6 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 17237 timeout: 0 lvb_type: 0
.................
00000004:00000002:4.0:1401729341.440522:0:17237:0:(osd_handler.c:4031:osd_add_oi_cache()) add [0x200000401:0xd7:0x0] 3649491:0 to info ffff8807efb9a000
.................
00000004:00000040:4.0:1401729341.440528:0:17237:0:(mdt_handler.c:2468:mdt_object_find()) Find object for [0x200000401:0xd7:0x0]
.................
00010000:00000001:4.0:1401729341.440817:0:17237:0:(ldlm_lock.c:750:ldlm_add_ast_work_item()) Process entered
00010000:00000001:4.0:1401729341.440818:0:17237:0:(ldlm_lock.c:756:ldlm_add_ast_work_item()) Process leaving
00010000:00000001:4.0:1401729341.440818:0:17237:0:(ldlm_lock.c:750:ldlm_add_ast_work_item()) Process entered
00010000:00000001:4.0:1401729341.440818:0:17237:0:(ldlm_lock.c:756:ldlm_add_ast_work_item()) Process leaving
00010000:00000001:4.0:1401729341.440819:0:17237:0:(ldlm_lock.c:750:ldlm_add_ast_work_item()) Process entered
00010000:00000001:4.0:1401729341.440819:0:17237:0:(ldlm_lock.c:756:ldlm_add_ast_work_item()) Process leaving
00010000:00000001:4.0:1401729341.440820:0:17237:0:(ldlm_inodebits.c:163:ldlm_inodebits_compat_queue()) Process leaving (rc=0 : 0 : 0)
00010000:00010000:4.0:1401729341.440821:0:17237:0:(ldlm_resource.c:1316:ldlm_resource_add_lock()) ### About to add this lock:
 ns: mdt-atlastds-MDT0000_UUID lock: ffff8807e7578300/0x65bb92c64a60ece3 lrc: 3/1,0 mode: --/PR res: [0x200000401:0xd7:0x0].0 bits 0x13 rrc: 619 type: IBT flags: 0x50210000000000 nid: local remote: 0x0 expref: -99 pid: 17237 timeout: 0 lvb_type: 0
00010000:00000001:4.0:1401729341.440824:0:17237:0:(ldlm_lock.c:1973:ldlm_run_ast_work()) Process leaving (rc=0 : 0 : 0)
00010000:00000001:4.0:1401729341.440824:0:17237:0:(ldlm_inodebits.c:238:ldlm_process_inodebits_lock()) Process leaving (rc=0 : 0 : 0)
00010000:00000001:4.0:1401729341.440825:0:17237:0:(ldlm_lock.c:1759:ldlm_lock_enqueue()) Process leaving via out (rc=0 : 0 : 0x0)
00010000:00000001:4.0:1401729341.440826:0:17237:0:(ldlm_request.c:238:ldlm_completion_ast()) Process entered
00010000:00010000:4.0:1401729341.440826:0:17237:0:(ldlm_request.c:252:ldlm_completion_ast()) ### client-side enqueue returned a blocked lock, sleeping ns: mdt-atlastds-MDT0000_UUID lock: ffff8807e7578300/0x65bb92c64a60ece3 lrc: 3/1,0 mode: --/PR res: [0x200000401:0xd7:0x0].0 bits 0x13 rrc: 619 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 17237 timeout: 0 lvb_type: 0
00000100:00000001:1.0:1401729341.451231:0:12772:0:(client.c:1967:ptlrpc_expired_set()) Process entered
.................

that shows parent's/dir [0x200000400:0x2:0x0] locks have been granted, child/file [0x200000401:0xd7:0x0] is blocked, and then nothing else happen for these fids/locks in Server's remaining 27s of log/traces. Which seems to indicate that any of the conflicting locks already had a blocking-AST sent before and did not cancel it at this time.

_ Also, I can't find in these logs, even if they do not cover the whole story, any truncate/resend situation that is being handled and modified by both patches being worked as part of this/LU-4584 and LU-2827 tickets. Only left possibility would be that truncate/resend occurred before but with an other Client accessing the same file ...

But it seems to me that this new lockup problem (vs previous evictions and LBUG) looks more as something different, and I wonder if it could be tracked as part of a new ticket.

Comment by Christopher Morrone [ 05/Jun/14 ]

Yes, please move that discussion to a new ticket.

Comment by James A Simmons [ 06/Jun/14 ]

Something is really broken in 2.5 which might be independent of this patch. At this point I don't know. Only data will provide the needed evidence. So I tried the next step and moved to the latest 2.5.59 with the LU-2827 patches to see what the difference is. Well it was back to good old fashion client evictions so I setup debugging to collect data and guess what happened? When set debug=+all on the clients all the evictions went away. So we have a very nasty race condition causing this problem. Next step is ust collecting rpctrace and ldlm from the clients only. Maybe the race will still be there.

Comment by James A Simmons [ 09/Jun/14 ]

I collected data for the evictions I'm seeing from two different test runs. The clients enabled rpctrace+dlmtrace for extra info. The logs are uploaded to ftp.whamcloud.com/uploads/LU-4584/20140609-run1.tbz and 20140609-run2.tbz.

Comment by James A Simmons [ 16/Jun/14 ]

Any insight gained from the logs?

Comment by Bruno Faccini (Inactive) [ 16/Jun/14 ]

James, sorry I was out, but now back. Will give you feedback soon on these new logs analysis.

Comment by Bruno Faccini (Inactive) [ 17/Jun/14 ]

James,
Within your last uploaded sets of logs (20140609-run[1,2].tbz), I only find a bunch of OSTs/Extend locks related evictions.
Looks like the OST, upon a full extent PW lock request, has requested multiple granted+conflicting PR locks to be released from the same Client. Client seems to have correctly handled the requests, but OST does not seem to receive the confirmation that locks have been canceled, hence the evictions.
I can't fully confirm that Client has effectively sent the replies over the wire since the rpctrace does not seem to be enabled on the concerned Client.

BTW, this definitelly confirms that this needs to be addressed in an other ticket since there is no relation to the resend mechanism that is concerned/fixed in both LU-2827 and this ticket.

Comment by James A Simmons [ 17/Jun/14 ]

Can you open the ticket since you will be able to describe the problem far better than I can

Comment by Bruno Faccini (Inactive) [ 18/Jun/14 ]

I just created LU-5225 to track this specific scenario.
James, can you update it with any interesting detail you think please?

Comment by Peter Jones [ 21/Jun/14 ]

So now that the unrelated issue being tracked for ORNL has moved to LU-5225 and the patches from LU-2827 have landed to master, can this issue be marked as a duplicate of LU-2827?

Comment by Christopher Morrone [ 23/Jun/14 ]

So now that the unrelated issue being tracked for ORNL has moved to LU-5225 and the patches from LU-2827 have landed to master, can this issue be marked as a duplicate of LU-2827?

First, can we get an explanation of how that fixes this problem, and then a clear list of the patch(es) I need to apply to b2_4?

Comment by Bruno Faccini (Inactive) [ 26/Jun/14 ]

Prior to patch from this ticket and/or LU-2827, there was a bug (new lock early creation causing itself instead of 1st to be found during lookup!) during Server handling of resent requests causing the 1st/old one to become orphaned and replicated.
It has been decided that patch(es) from LU-2827 will be used to fix this issue because more generic and handling all cases, mainly by detecting resent case earlier and avoiding unnecessary new lock creation.
I am presently porting+testing a b2_4 backport of LU-2827 patches and will provide updates asap.

Comment by Christopher Morrone [ 26/Jun/14 ]

Why are messages being resent?

Comment by Bruno Faccini (Inactive) [ 27/Jun/14 ]

Because Client's reply buffer was not big enough to receive 1st Server's reply including LVB/layout due to default/large striping.

Comment by Bruno Faccini (Inactive) [ 30/Jun/14 ]

Merged b2_4 backport of both #5978 and #10378 master changes from LU-2827, is at http://review.whamcloud.com/10902.

Comment by Bruno Faccini (Inactive) [ 22/Jul/14 ]

BTW, I forgot to indicate here that my b2_4 patch/back-port for LU-2827 (http://review.whamcloud.com/10902) has still some problem and needs some re-work, because MDS bombs with "(ldlm_lock.c:851:ldlm_lock_decref_internal_nolock()) ASSERTION( lock->l_readers > 0 ) failed" when running LLNL reproducer from LU-4584 or recovery-small/test_53 in auto-tests.
More to come, crash-dump is under investigations, but we still can use http://review.whamcloud.com/9488 as a fix for b2_4.

Comment by James A Simmons [ 29/Jul/14 ]

Just finished a test shot with Cray 2.5 clients to see if the client evicts stopped. Their default client which is some 2.5 version with many many patches lacked the LU-2827 and LU-4861 patches I founded that helped with 2.5.2. So I applied patches from LU-2827 and LU-4861 but still had client evicts. I collected the logs from the server side and have placed them here:

ftp.whamcloud.com/uploads/LU-4584/atlas2_testshot_Jul_29_2014_debug_logs.tar.gz

Comment by Christopher Morrone [ 11/Aug/14 ]

James, can you share the patch stack you are using? That might help us figure out if you are reporting the same issue or something else. And if it isn't exactly the same issue, we really need to get you to report it in another ticket.

Comment by James A Simmons [ 12/Aug/14 ]

It was my bad. The last test shot we used our 2.4 production file system which didn't have the patch from here. So the above breakage is expected. We are in the process of testing this at larger scale (500 nodes) production machine. Yes ORNL has created a public git tree

https://github.com/ORNL-TechInt/lustre

So people can examine our special sauce.

Comment by James A Simmons [ 18/Aug/14 ]

I have been testing with the LU-4584 patch and I'm still seeing client evictions. Could it be possible to get the LU-2827 patch working on 2.4

Comment by Christopher Morrone [ 20/Aug/14 ]

FYI, we put the LU-4584 patch into production, and it didn't take too long before we hit the assertion that James reported. I opened LU-5525 for that bug.

I am beginning to suspect that the patch either did cause the assertion, or made the assertion more common.

Comment by Christopher Morrone [ 26/Aug/14 ]

Could you please provide an explanation of what operations will not be fixed by the LU-4584 patch, as compared with the more general LU-2827 fix?

Comment by James A Simmons [ 26/Aug/14 ]

I have several different reproducers of this problem. What I found what LU-4584 address some of the reproducers but not all of them. The patch for LU-2827 addressed more of my reproducers but I still had client evictions.

Comment by Christopher Morrone [ 26/Aug/14 ]

Yes, that was my belief. I would like Intel to enumerate the failure modes that users can expect to be fixed, and those that will not be fixed by LU-4584.

Comment by Oleg Drokin [ 10/Sep/14 ]

patch http://review.whamcloud.com/#/c/9488/ fixed to eliminate the assertion in mdt_intent_lock_replace() now.
Also 2.4 deployments would need to carry patch http://review.whamcloud.com/6511 from LU-3428. This should address all related woes in b2_4. 2.6+ will be fixed by patches from LU-2827 and friends. As for 2.5 I think we'll still move with -2827 as the more generic solution.

Comment by Christopher Morrone [ 10/Sep/14 ]

Ok, I will give those patches a try.

We are planning to try to move to b2_5 in less than 2 months, so we'll need the LU-2827 solution for b2_5 as soon as possible.

Comment by Oleg Drokin [ 11/Sep/14 ]

For -2827, we have:
LU-2827 itself http://review.whamcloud.com/5978 and http://review.whamcloud.com/#/c/10378
LU-5266 http://review.whamcloud.com/10903
LU-5496 http://review.whamcloud.com/11469 and http://review.whamcloud.com/#/c/11644
LU-5579 http://review.whamcloud.com/#/c/11839/
and the final nail in this bug LU-5530: http://review.whamcloud.com/#/c/11841/1

Some of those patches are not picking cleanly into b2_5, but I know James Simmons from ORNL ported them. All of this is now in testing after which they hopefully will publish their tree with backports.

Or another easy way to get rid of all of these problems (occuring as frequently, at least, sicne resend might also happen if a reply from server was genuinely lost on the network) is to drop LU-3338 patch, that is not part of standard b2_5 (only landed to b2_6)

Comment by James A Simmons [ 11/Sep/14 ]

Actually our Cray clients don't have the LU-3338 patch and the problems of LU-5530 still showed up. So these patches are still needed. We plan to do a full scale test shot Tuesday next week with all the above patches plus a few others. If it goes well we will leave this system running 2.5 clients and 2.5 servers.

We still need this work due to another file system running at the lab that will be stuck at 2.4 for the time being but some of our clients will be moving to 2.5 so these problems will show up.

Comment by Oleg Drokin [ 11/Sep/14 ]

re 1.8 clients - since they don't know about wide striping, they hit this because their buffers are not large enough anyway, I suspect, when you have enough stripes.

The patches are needed regardless of course because the bugs are real during resends. It's just the resends used to be rare and only in the face of lost messages.

Comment by James A Simmons [ 17/Sep/14 ]

Chris did you try the patch set for 2.4? If you did was their positive results?

Comment by Christopher Morrone [ 17/Sep/14 ]

We won't know until it is installed on larger systems.

Comment by James A Simmons [ 19/Sep/14 ]

I have been testing the b2_4 version of the LU-4584 patch and I see this:

LustreError: 4012:0:(ldlm_resource.c:1445:ldlm_resource_dump()) — Resource: [0xdeaf:0x0:0x0].0 (ffff88020290da80) refcount = 2

Its my simul job struggling.

Comment by Andreas Dilger [ 12/Dec/17 ]

No updates in ticket since sites have upgraded to newer releases.

Generated at Sat Feb 10 01:44:02 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.