[LU-5193] 2.6 DNE stress testing: osp_sync_interpret()) ASSERTION( rc || req->rq_transno ) failed: Created: 13/Jun/14  Updated: 17/Apr/15  Resolved: 07/Oct/14

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.6.0
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Patrick Farrell (Inactive) Assignee: Alex Zhuravlev
Resolution: Duplicate Votes: 0
Labels: dne

Issue Links:
Related
is related to LU-5629 osp_sync_interpret() ASSERTION( rc ||... Resolved
Severity: 3
Rank (Obsolete): 14511

 Description   

During Cray stress testing of 2.6 servers and 2.5 clients using remote directories (but NOT striped directories), we hit the following LBUG on one of our MDSes:

<0>LustreError: 23068:0:(osp_sync.c:350:osp_sync_interpret()) ASSERTION( rc || req->rq_transno ) failed:
<0>LustreError: 23068:0:(osp_sync.c:350:osp_sync_interpret()) LBUG
<4>Pid: 23068, comm: ptlrpcd_10
<4>
<4>Call Trace:
<4> [<ffffffffa0a92895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
<4> [<ffffffffa0a92e97>] lbug_with_loc+0x47/0xb0 [libcfs]
<4> [<ffffffffa16b2d30>] osp_sync_process_record+0x0/0x13f0 [osp]
<4> [<ffffffffa0e7e2dc>] ptlrpc_check_set+0x2bc/0x1b50 [ptlrpc]
<4> [<ffffffffa0eaadfb>] ptlrpcd_check+0x53b/0x560 [ptlrpc]
<4> [<ffffffffa0eab44b>] ptlrpcd+0x33b/0x3f0 [ptlrpc]
<4> [<ffffffff81065df0>] ? default_wake_function+0x0/0x20
<4> [<ffffffffa0eab110>] ? ptlrpcd+0x0/0x3f0 [ptlrpc]
<4> [<ffffffff8109aee6>] kthread+0x96/0xa0
<4> [<ffffffff8100c20a>] child_rip+0xa/0x20
<4> [<ffffffff8109ae50>] ? kthread+0x0/0xa0
<4> [<ffffffff8100c200>] ? child_rip+0x0/0x20
<4>
<0>Kernel panic - not syncing: LBUG
<4>Pid: 23068, comm: ptlrpcd_10 Tainted: G W --------------- 2.6.32-431.5.1.el6.x86_64 #1
<4>Call Trace:
<4> [<ffffffff81527513>] ? panic+0xa7/0x16f
<4> [<ffffffffa0a92eeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
<4> [<ffffffffa16b2d30>] ? osp_sync_process_record+0x0/0x13f0 [osp]
<4> [<ffffffffa0e7e2dc>] ? ptlrpc_check_set+0x2bc/0x1b50 [ptlrpc]
<4> [<ffffffffa0eaadfb>] ? ptlrpcd_check+0x53b/0x560 [ptlrpc]
<4> [<ffffffffa0eab44b>] ? ptlrpcd+0x33b/0x3f0 [ptlrpc]
<4> [<ffffffff81065df0>] ? default_wake_function+0x0/0x20
<4> [<ffffffffa0eab110>] ? ptlrpcd+0x0/0x3f0 [ptlrpc]
<4> [<ffffffff8109aee6>] ? kthread+0x96/0xa0
<4> [<ffffffff8100c20a>] ? child_rip+0xa/0x20
<4> [<ffffffff8109ae50>] ? kthread+0x0/0xa0
<4> [<ffffffff8100c200>] ? child_rip+0x0/0x20



 Comments   
Comment by Patrick Farrell (Inactive) [ 13/Jun/14 ]

Uploading dump to:
ftp.whamcloud.com:/uploads/LU-5193/dne2_osp_sync_interpret_140613.tar.gz

Comment by Jodi Levi (Inactive) [ 16/Jun/14 ]

Di,
Could you please comment on this one?
Thank you!

Comment by Jodi Levi (Inactive) [ 20/Jun/14 ]

In Di's assessment he has determined that this not related to DNE.
Alex, can you please have a look at this one?

Comment by Di Wang [ 24/Jun/14 ]

extracted debug log

00000100:00080000:1.0:1402544092.443256:0:23057:0:(client.c:2567:ptlrpc_resend_req()) @@@ going to resend  req@ffff8807e6903800 x1470649362992752/t0(0) o6->esfprod-OST001b-osc-MDT0007@10.151.10.12@o2ib8:28/4 lens 664/432 e 7 to 0 dl 1402544095 ref 1 fl Rpc:/0/ffffffff rc 0/-1
00000100:00080000:1.0:1402544092.443272:0:23057:0:(client.c:2567:ptlrpc_resend_req()) @@@ going to resend  req@ffff880ffe32c400 x1470649362995304/t0(0) o6->esfprod-OST001b-osc-MDT0007@10.151.10.12@o2ib8:28/4 lens 664/432 e 4 to 0 dl 1402544140 ref 1 fl Rpc:/0/ffffffff rc 0/-1
00000100:00080000:1.0:1402544092.443282:0:23057:0:(client.c:2567:ptlrpc_resend_req()) @@@ going to resend  req@ffff880ffedcb400 x1470649363002864/t0(0) o6->esfprod-OST001b-osc-MDT0007@10.151.10.12@o2ib8:28/4 lens 664/432 e 0 to 0 dl 1402544156 ref 1 fl Rpc:/0/ffffffff rc 0/-1
00000100:00080000:1.0:1402544092.443293:0:23057:0:(client.c:2567:ptlrpc_resend_req()) @@@ going to resend  req@ffff8807e7702c00 x1470649363004640/t0(0) o6->esfprod-OST001b-osc-MDT0007@10.151.10.12@o2ib8:28/4 lens 664/432 e 0 to 0 dl 1402544235 ref 1 fl Rpc:/0/ffffffff rc 0/-1
00000100:00080000:1.0:1402544092.443300:0:23057:0:(client.c:2567:ptlrpc_resend_req()) @@@ going to resend  req@ffff8807102b2c00 x1470649363004808/t0(0) o6->esfprod-OST001b-osc-MDT0007@10.151.10.12@o2ib8:28/4 lens 664/432 e 0 to 0 dl 1402544235 ref 1 fl Rpc:/0/ffffffff rc 0/-1
00000100:00080000:1.0:1402544092.443308:0:23057:0:(client.c:2567:ptlrpc_resend_req()) @@@ going to resend  req@ffff8807e6e4c800 x1470649363004952/t0(0) o6->esfprod-OST001b-osc-MDT0007@10.151.10.12@o2ib8:28/4 lens 664/432 e 0 to 0 dl 1402544235 ref 1 fl Rpc:/0/ffffffff rc 0/-1
00000100:00080000:1.0:1402544092.443316:0:23057:0:(client.c:2567:ptlrpc_resend_req()) @@@ going to resend  req@ffff8807874ac400 x1470649363005104/t0(0) o6->esfprod-OST001b-osc-MDT0007@10.151.10.12@o2ib8:28/4 lens 664/432 e 0 to 0 dl 1402544235 ref 1 fl Rpc:/0/ffffffff rc 0/-1
00000100:00080000:1.0:1402544092.443327:0:23057:0:(import.c:1443:ptlrpc_import_recovery_state_machine()) ffff8807f32c4800 esfprod-OST001b_UUID: changing import state from RECOVER to FULL
00000004:00080000:1.0:1402544092.443334:0:23057:0:(osp_dev.c:1073:osp_import_event()) got connected
00000100:02000000:1.0:1402544092.443336:0:23057:0:(import.c:1451:ptlrpc_import_recovery_state_machine()) esfprod-OST001b-osc-MDT0007: Connection restored to esfprod-OST001b (at 10.151.10.12@o2ib8)
00000100:00080000:1.0:1402544092.443346:0:23057:0:(recover.c:233:ptlrpc_wake_delayed()) @@@ waking (set ffff88080b044180):  req@ffff8807102b2c00 x1470649363004808/t0(0) o6->esfprod-OST001b-osc-MDT0007@10.151.10.12@o2ib8:28/4 lens 664/432 e 0 to 0 dl 1402544235 ref 1 fl Rpc:S/0/ffffffff rc -11/-1
00000004:00080000:14.0:1402544092.497759:0:26266:0:(mdt_open.c:660:mdt_mfd_set_mode()) [0xdc0003394:0x1b2d6:0x0] Change mfd mode 00 -> 01102.
00000004:00080000:14.0:1402544092.498096:0:26266:0:(mdt_handler.c:2570:mdt_save_lock()) request = ffff880732d2c800 reply state = ffff8807cd6da000 transno = 73040286438
00000001:00080000:0.0:1402544095.173008:0:23753:0:(tgt_lastrcvd.c:434:tgt_cb_last_committed()) esfprod-MDT0007: transno 73040286435 is committed
00000001:00080000:0.0:1402544095.173016:0:23753:0:(tgt_lastrcvd.c:434:tgt_cb_last_committed()) esfprod-MDT0007: transno 73040286436 is committed
00000001:00080000:0.0:1402544095.173019:0:23753:0:(tgt_lastrcvd.c:434:tgt_cb_last_committed()) esfprod-MDT0007: transno 73040286438 is committed
00000001:00080000:0.0:1402544095.173022:0:23753:0:(tgt_lastrcvd.c:434:tgt_cb_last_committed()) esfprod-MDT0007: transno 73040286439 is committed
00000004:00080000:8.0:1402544095.506180:0:23786:0:(mdt_open.c:660:mdt_mfd_set_mode()) [0xdc0003394:0x1b2d6:0x0] Change mfd mode 00 -> 01102.
00000004:00080000:8.0:1402544095.506563:0:23786:0:(mdt_handler.c:2570:mdt_save_lock()) request = ffff88079ce94c00 reply state = ffff8807e81c4000 transno = 73040286441
00000004:00080000:10.0:1402544098.515284:0:23786:0:(mdt_open.c:660:mdt_mfd_set_mode()) [0xdc0003394:0x1b2d6:0x0] Change mfd mode 00 -> 01102.
00000004:00080000:10.0:1402544098.515639:0:23786:0:(mdt_handler.c:2570:mdt_save_lock()) request = ffff880fb5d0f800 reply state = ffff8807cc0f6000 transno = 73040286444
00000001:00080000:0.0:1402544101.178949:0:23753:0:(tgt_lastrcvd.c:434:tgt_cb_last_committed()) esfprod-MDT0007: transno 73040286441 is committed
00000001:00080000:0.0:1402544101.178958:0:23753:0:(tgt_lastrcvd.c:434:tgt_cb_last_committed()) esfprod-MDT0007: transno 73040286442 is committed
00000001:00080000:0.0:1402544101.178962:0:23753:0:(tgt_lastrcvd.c:434:tgt_cb_last_committed()) esfprod-MDT0007: transno 73040286444 is committed
00000001:00080000:0.0:1402544101.178965:0:23753:0:(tgt_lastrcvd.c:434:tgt_cb_last_committed()) esfprod-MDT0007: transno 73040286445 is committed
00000004:00080000:8.0:1402544101.524149:0:23786:0:(mdt_open.c:660:mdt_mfd_set_mode()) [0xdc0003394:0x1b2d6:0x0] Change mfd mode 00 -> 01102.
00000004:00080000:8.0:1402544101.524508:0:23786:0:(mdt_handler.c:2570:mdt_save_lock()) request = ffff880fa77fb000 reply state = ffff8807a5dcf000 transno = 73040286447
00000004:00080000:8.0:1402544104.533003:0:23786:0:(mdt_open.c:660:mdt_mfd_set_mode()) [0xdc0003394:0x1b2d6:0x0] Change mfd mode 00 -> 01102.
00000004:00080000:12.0:1402544104.533325:0:23786:0:(mdt_handler.c:2570:mdt_save_lock()) request = ffff8807e6d8b000 reply state = ffff8807d66a4000 transno = 73040286450
00000001:00080000:0.0:1402544107.179018:0:23753:0:(tgt_lastrcvd.c:434:tgt_cb_last_committed()) esfprod-MDT0007: transno 73040286447 is committed
00000001:00080000:0.0:1402544107.179026:0:23753:0:(tgt_lastrcvd.c:434:tgt_cb_last_committed()) esfprod-MDT0007: transno 73040286448 is committed
00000001:00080000:0.0:1402544107.179029:0:23753:0:(tgt_lastrcvd.c:434:tgt_cb_last_committed()) esfprod-MDT0007: transno 73040286450 is committed
00000001:00080000:0.0:1402544107.179031:0:23753:0:(tgt_lastrcvd.c:434:tgt_cb_last_committed()) esfprod-MDT0007: transno 73040286451 is committed
00000004:00080000:12.0:1402544107.535962:0:23786:0:(mdt_open.c:660:mdt_mfd_set_mode()) [0xdc0003394:0x1b2d6:0x0] Change mfd mode 00 -> 01102.
00000004:00080000:12.0:1402544107.536330:0:23786:0:(mdt_open.c:660:mdt_mfd_set_mode()) [0xdc0003394:0x1b2d6:0x0] Change mfd mode 00 -> 01102.
00000004:00080000:12.0:1402544107.536655:0:23786:0:(mdt_handler.c:2570:mdt_save_lock()) request = ffff8807ca394c00 reply state = ffff88070b41e000 transno = 73040286454
00000004:00080000:12.0:1402544107.537181:0:23786:0:(mdt_handler.c:2570:mdt_save_lock()) request = ffff880fbb68ec00 reply state = ffff88070b7d2000 transno = 73040286456
00000004:00080000:12.0:1402544107.574463:0:23786:0:(mdt_open.c:660:mdt_mfd_set_mode()) [0xdc0003394:0x1b2d5:0x0] Change mfd mode 00 -> 0200001.
00000004:00080000:12.0:1402544107.576380:0:23786:0:(mdt_handler.c:2570:mdt_save_lock()) request = ffff8807c35e5c00 reply state = ffff88077e353000 transno = 73040286459
00000004:00080000:12.0:1402544107.576429:0:23786:0:(mdt_handler.c:2570:mdt_save_lock()) request = ffff8807c35e5c00 reply state = ffff88077e353000 transno = 73040286459
00000004:00080000:12.0:1402544107.602191:0:23786:0:(mdd_dir.c:1413:mdd_finish_unlink()) Object [0xdc0003394:0x1b2d5:0x0] is inserted into orphan list, open count = 1
00000001:00080000:0.0:1402544107.603279:0:23753:0:(tgt_lastrcvd.c:434:tgt_cb_last_committed()) esfprod-MDT0007: transno 73040286454 is committed
00000001:00080000:0.0:1402544107.603287:0:23753:0:(tgt_lastrcvd.c:434:tgt_cb_last_committed()) esfprod-MDT0007: transno 73040286455 is committed
00000001:00080000:0.0:1402544107.603291:0:23753:0:(tgt_lastrcvd.c:434:tgt_cb_last_committed()) esfprod-MDT0007: transno 73040286456 is committed
00000001:00080000:0.0:1402544107.603293:0:23753:0:(tgt_lastrcvd.c:434:tgt_cb_last_committed()) esfprod-MDT0007: transno 73040286457 is committed
00000001:00080000:0.0:1402544107.603303:0:23753:0:(tgt_lastrcvd.c:434:tgt_cb_last_committed()) esfprod-MDT0007: transno 73040286459 is committed
00000001:00080000:0.0:1402544107.603305:0:23753:0:(tgt_lastrcvd.c:434:tgt_cb_last_committed()) esfprod-MDT0007: transno 73040286460 is committed
00000004:00080000:2.0:1402544107.603338:0:23867:0:(osp_sync.c:870:osp_sync_process_queues()) 1 changes, 0 in progress, 0 in flight
00000004:00080000:12.0:1402544107.603369:0:23786:0:(mdt_handler.c:2570:mdt_save_lock()) request = ffff880ff54c9400 reply state = ffff8807154d8000 transno = 73040286460
00000004:00080000:2.0:1402544107.603382:0:23867:0:(osp_sync.c:729:osp_sync_process_record()) found record 10692404, 64, idx 34642, id 1381748: 0
00010000:00080000:12.0:1402544107.603410:0:23786:0:(ldlm_lib.c:2497:target_send_reply()) Schedule reply immediately
00000004:00080000:14.0:1402544107.603764:0:24867:0:(mdd_object.c:1779:mdd_close()) Object [0xdc0003394:0x1b2d5:0x0] is deleted from orphan list, OSS objects to be destroyed.
00000001:00080000:14.0:1402544107.603787:0:24867:0:(tgt_lastrcvd.c:1233:tgt_txn_stop_cb()) More than one transaction 73040286461
00000004:00080000:0.0:1402544107.604014:0:23062:0:(osp_sync.c:349:osp_sync_interpret()) reply req ffff8807e80b5c00/2, rc 0, transno 871167
00000004:00080000:0.0:1402544108.413047:0:23064:0:(osp_sync.c:315:osp_sync_request_commit_cb()) commit req ffff8807e80b5c00, transno 64425380607
00000100:00000400:0.0:1402544111.068865:0:23068:0:(client.c:304:ptlrpc_at_adj_net_latency()) Reported service time 318 > total measured time 19
00000004:00080000:0.0:1402544111.068893:0:23068:0:(osp_sync.c:315:osp_sync_request_commit_cb()) commit req ffff880ffedcb400, transno 0
00000004:00080000:0.0:1402544111.068897:0:23068:0:(osp_sync.c:349:osp_sync_interpret()) reply req ffff880ffedcb400/1, rc 0, transno 0
00000004:00040000:0.0:1402544111.068900:0:23068:0:(osp_sync.c:350:osp_sync_interpret()) ASSERTION( rc || req->rq_transno ) failed: 
00000004:00040000:0.0:1402544111.079949:0:23068:0:(osp_sync.c:350:osp_sync_interpret()) LBUG
Generated at Sat Feb 10 01:49:20 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.