[LU-6846] dt_record_write()) ASSERTION( dt->do_body_ops->dbo_write ) failed: Created: 14/Jul/15 Updated: 28/Mar/17 Resolved: 19/Sep/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.8.0 |
| Fix Version/s: | Lustre 2.8.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Di Wang | Assignee: | Di Wang |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | dne2 | ||
| Attachments: |
|
||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||
| Description |
|
Found this in 24 hours failover test in OpenSFS cluster. Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: mds7 has failed over 1 times, and counting... Lustre: 9887:0:(client.c:2018:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1436839484/real 1436839484] req@ffff88062c53a080 x1506632146392040/t0(0) o400->lustre-MDT0006-osp-MDT0001@192.168.2.127@o2ib:24/4 lens 224/224 e 1 to 1 dl 1436839486 ref 1 fl Rpc:X/c0/ffffffff rc 0/-1 Lustre: 9887:0:(client.c:2018:ptlrpc_expire_one_request()) Skipped 1 previous similar message Lustre: 9887:0:(client.c:2018:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1436839503/real 1436839503] req@ffff88062c53a080 x1506632146392208/t0(0) o400->lustre-MDT0006-osp-MDT0001@192.168.2.127@o2ib:24/4 lens 224/224 e 1 to 1 dl 1436839505 ref 1 fl Rpc:X/c0/ffffffff rc 0/-1 Lustre: 9887:0:(client.c:2018:ptlrpc_expire_one_request()) Skipped 2 previous similar messages Lustre: lustre-MDT0006-osp-MDT0001: Connection restored to lustre-MDT0006 (at 192.168.2.127@o2ib) LustreError: 12030:0:(dt_object.c:512:dt_record_write()) ASSERTION( dt->do_body_ops->dbo_write ) failed: LustreError: 12030:0:(dt_object.c:512:dt_record_write()) LBUG Pid: 12030, comm: mdt_out03_005 Call Trace: [<ffffffffa0506875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] [<ffffffffa0506e77>] lbug_with_loc+0x47/0xb0 [libcfs] [<ffffffffa065978f>] dt_record_write+0xbf/0x130 [obdclass] [<ffffffffa08f4d0e>] out_tx_write_exec+0x7e/0x300 [ptlrpc] [<ffffffffa08ed30a>] out_tx_end+0xda/0x5d0 [ptlrpc] [<ffffffffa08f1e7b>] out_handle+0xd9b/0x17e0 [ptlrpc] [<ffffffffa083afb0>] ? target_bulk_timeout+0x0/0xc0 [ptlrpc] [<ffffffffa08ea212>] tgt_request_handle+0xa42/0x1230 [ptlrpc] [<ffffffffa0892891>] ptlrpc_main+0xe41/0x1920 [ptlrpc] [<ffffffffa0891a50>] ? ptlrpc_main+0x0/0x1920 [ptlrpc] [<ffffffff8109abf6>] kthread+0x96/0xa0 [<ffffffff8100c20a>] child_rip+0xa/0x20 [<ffffffff8109ab60>] ? kthread+0x0/0xa0 [<ffffffff8100c200>] ? child_rip+0x0/0x20 LustreError: dumping log to /tmp/lustre-log.1436839511.12030 |
| Comments |
| Comment by Gerrit Updater [ 14/Jul/15 ] |
|
wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/15595 |
| Comment by Di Wang [ 23/Jul/15 ] |
|
Clearly, this fix is not enough, i.e. even with this fix, we still found the LBUG in hyperion {noforrmat}2015-07-23 12:40:30 Lustre: 6294:0:(service.c:2102:ptlrpc_server_handle_request()) Skipped 1 previous similar message |
| Comment by Di Wang [ 27/Jul/15 ] |
|
It seems this is because of the race between llog cancel rec and llog destroy object. In llog_cancel_rec() /* returns negative on error; 0 if success; 1 if success & log destroyed */
int llog_cancel_rec(const struct lu_env *env, struct llog_handle *loghandle,
int index)
{
struct llog_log_hdr *llh = loghandle->lgh_hdr;
int rc = 0;
ENTRY;
CDEBUG(D_RPCTRACE, "Canceling %d in log "DOSTID"\n",
index, POSTID(&loghandle->lgh_id.lgl_oi));
if (index == 0) {
CERROR("Can't cancel index 0 which is header\n");
RETURN(-EINVAL);
}
rc = llog_cancel_rec_internal(env, loghandle, index);
if (rc < 0) {
CERROR("%s: fail to write header for llog #"DOSTID
"#%08x: rc = %d\n",
loghandle->lgh_ctxt->loc_obd->obd_name,
POSTID(&loghandle->lgh_id.lgl_oi),
loghandle->lgh_id.lgl_ogen, rc);
RETURN(rc);
}
if ((llh->llh_flags & LLOG_F_ZAP_WHEN_EMPTY) &&
(llh->llh_count == 1) &&
(loghandle->lgh_last_idx == LLOG_HDR_BITMAP_SIZE(llh) - 1)) {
rc = llog_destroy(env, loghandle);
if (rc < 0) {
/* Sigh, can not destroy the final plain llog, but
* the bitmap has been clearly, so the record can not
* be accessed anymore, let's return 0 for now, and
cancel record and log_destroy are in different transaction. So when cancel the remote record, if the RPC is out of order, i.e. cancel record arrives later than llog_destroy, then we will met this bug. So the fix should be merge them into a single transaction. |
| Comment by Gerrit Updater [ 27/Jul/15 ] |
|
wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/15730 |
| Comment by Gerrit Updater [ 15/Aug/15 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/15595/ |
| Comment by Gerrit Updater [ 19/Aug/15 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/15730/ |
| Comment by Peter Jones [ 19/Aug/15 ] |
|
Landed for 2.8 |
| Comment by Cliff White (Inactive) [ 24/Aug/15 ] |
|
We are hitting this on Hyperion with 2.7.58 Aug 22 02:34:36 iws12 kernel: LustreError: 11526:0:(dt_object.c:512:dt_record_write()) ASSERTION( dt->do_body_ops->dbo_write ) failed: Aug 22 02:34:36 iws12 kernel: LustreError: 11526:0:(dt_object.c:512:dt_record_write()) ASSERTION( dt->do_body_ops->dbo_write ) failed: Aug 22 02:34:36 iws12 kernel: LustreError: 11526:0:(dt_object.c:512:dt_record_write()) LBUG Aug 22 02:34:36 iws12 kernel: LustreError: 11526:0:(dt_object.c:512:dt_record_write()) LBUG Aug 22 02:34:36 iws12 kernel: Pid: 11526, comm: mdt_out03_005 Aug 22 02:34:36 iws12 kernel: Aug 22 02:34:36 iws12 kernel: Call Trace: Aug 22 02:34:36 iws12 kernel: [<ffffffffa04a6875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] Aug 22 02:34:36 iws12 kernel: [<ffffffffa04a6e77>] lbug_with_loc+0x47/0xb0 [libcfs] Aug 22 02:34:36 iws12 kernel: [<ffffffffa09a658f>] dt_record_write+0xbf/0x130 [obdclass] Aug 22 02:34:36 iws12 kernel: [<ffffffffa0c428ce>] out_tx_write_exec+0x7e/0x300 [ptlrpc] Aug 22 02:34:36 iws12 kernel: [<ffffffffa0c3aeca>] out_tx_end+0xda/0x5d0 [ptlrpc] Aug 22 02:34:36 iws12 kernel: [<ffffffffa0c3fa3b>] out_handle+0xd9b/0x17e0 [ptlrpc] Aug 22 02:34:36 iws12 kernel: [<ffffffffa0b88010>] ? target_bulk_timeout+0x0/0xc0 [ptlrpc] Aug 22 02:34:36 iws12 kernel: [<ffffffffa0c37dc2>] tgt_request_handle+0xa42/0x1230 [ptlrpc] Aug 22 02:34:36 iws12 kernel: [<ffffffffa0bdff91>] ptlrpc_main+0xe41/0x1920 [ptlrpc] Aug 22 02:34:36 iws12 kernel: [<ffffffff8152a39e>] ? thread_return+0x4e/0x7d0 Aug 22 02:34:36 iws12 kernel: [<ffffffffa0bdf150>] ? ptlrpc_main+0x0/0x1920 [ptlrpc] Aug 22 02:34:36 iws12 kernel: [<ffffffff8109e78e>] kthread+0x9e/0xc0 Aug 22 02:34:36 iws12 kernel: [<ffffffff8100c28a>] child_rip+0xa/0x20 Aug 22 02:34:36 iws12 kernel: [<ffffffff8109e6f0>] ? kthread+0x0/0xc0 Aug 22 02:34:36 iws12 kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20 Aug 22 02:34:36 iws12 kernel: |
| Comment by Cliff White (Inactive) [ 02/Sep/15 ] |
|
Hit again Sep 1 17:42:47 iws12 kernel: LustreError: 7434:0:(dt_object.c:512:dt_record_write()) ASSERTION( dt->do_body_ops->dbo_write ) failed: Sep 1 17:42:47 iws12 kernel: LustreError: 7434:0:(dt_object.c:512:dt_record_write()) ASSERTION( dt->do_body_ops->dbo_write ) failed: Sep 1 17:42:47 iws12 kernel: LustreError: 7434:0:(dt_object.c:512:dt_record_write()) LBUG Sep 1 17:42:47 iws12 kernel: LustreError: 7434:0:(dt_object.c:512:dt_record_write()) LBUG Sep 1 17:42:47 iws12 kernel: Pid: 7434, comm: mdt_out03_020 Sep 1 17:42:47 iws12 kernel: Sep 1 17:42:47 iws12 kernel: Call Trace: Sep 1 17:42:47 iws12 kernel: [<ffffffffa04c9875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] Sep 1 17:42:47 iws12 kernel: [<ffffffffa04c9e77>] lbug_with_loc+0x47/0xb0 [libcfs] Sep 1 17:42:47 iws12 kernel: [<ffffffffa09c350f>] dt_record_write+0xbf/0x130 [obdclass] Sep 1 17:42:47 iws12 kernel: [<ffffffffa0c6144e>] out_tx_write_exec+0x7e/0x300 [ptlrpc] Sep 1 17:42:47 iws12 kernel: [<ffffffffa0c59a4a>] out_tx_end+0xda/0x5d0 [ptlrpc] Sep 1 17:42:47 iws12 kernel: [<ffffffffa0c5e5bb>] out_handle+0xd9b/0x17e0 [ptlrpc] Sep 1 17:42:47 iws12 kernel: [<ffffffffa0ba61d0>] ? target_bulk_timeout+0x0/0xc0 [ptlrpc] Sep 1 17:42:47 iws12 kernel: [<ffffffffa0c568ec>] tgt_request_handle+0xa4c/0x1290 [ptlrpc] Sep 1 17:42:47 iws12 kernel: [<ffffffffa0bfe5b1>] ptlrpc_main+0xe41/0x1910 [ptlrpc] Sep 1 17:42:47 iws12 kernel: [<ffffffffa0bfd770>] ? ptlrpc_main+0x0/0x1910 [ptlrpc] Sep 1 17:42:47 iws12 kernel: [<ffffffff8109e78e>] kthread+0x9e/0xc0 Sep 1 17:42:47 iws12 kernel: [<ffffffff8100c28a>] child_rip+0xa/0x20 Sep 1 17:42:47 iws12 kernel: [<ffffffff8109e6f0>] ? kthread+0x0/0xc0 Sep 1 17:42:47 iws12 kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20 |
| Comment by Cliff White (Inactive) [ 02/Sep/15 ] |
|
Lustre-log dump |
| Comment by James A Simmons [ 08/Sep/15 ] |
|
I hit this bug over the week end as well. |
| Comment by Cliff White (Inactive) [ 08/Sep/15 ] |
|
And we have hit it with 2.7.59 Sep 5 14:54:15 iws10 kernel: LustreError: 52672:0:(dt_object.c:512:dt_record_write()) ASSERTION( dt->do_body_ops->dbo_write ) failed: Sep 5 14:54:15 iws10 kernel: LustreError: 52672:0:(dt_object.c:512:dt_record_write()) ASSERTION( dt->do_body_ops->dbo_write ) failed: Sep 5 14:54:15 iws10 kernel: LustreError: 52672:0:(dt_object.c:512:dt_record_write()) LBUG Sep 5 14:54:15 iws10 kernel: LustreError: 52672:0:(dt_object.c:512:dt_record_write()) LBUG Sep 5 14:54:15 iws10 kernel: Pid: 52672, comm: mdt_out02_018 Sep 5 14:54:15 iws10 kernel: Sep 5 14:54:15 iws10 kernel: Call Trace: Sep 5 14:54:15 iws10 kernel: [<ffffffffa07e4875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] Sep 5 14:54:15 iws10 kernel: [<ffffffffa07e4e77>] lbug_with_loc+0x47/0xb0 [libcfs] Sep 5 14:54:15 iws10 kernel: [<ffffffffa09e12df>] dt_record_write+0xbf/0x130 [obdclass] Sep 5 14:54:15 iws10 kernel: [<ffffffffa0c7f4ae>] out_tx_write_exec+0x7e/0x300 [ptlrpc] Sep 5 14:54:15 iws10 kernel: [<ffffffffa0c77aaa>] out_tx_end+0xda/0x5d0 [ptlrpc] Sep 5 14:54:15 iws10 kernel: [<ffffffffa0c7c61b>] out_handle+0xd9b/0x17e0 [ptlrpc] Sep 5 14:54:15 iws10 kernel: [<ffffffffa0bc41d0>] ? target_bulk_timeout+0x0/0xc0 [ptlrpc] Sep 5 14:54:15 iws10 kernel: [<ffffffffa0c7494c>] tgt_request_handle+0xa4c/0x1290 [ptlrpc] Sep 5 14:54:15 iws10 kernel: [<ffffffffa0c1c5b1>] ptlrpc_main+0xe41/0x1910 [ptlrpc] Sep 5 14:54:15 iws10 kernel: [<ffffffffa0c1b770>] ? ptlrpc_main+0x0/0x1910 [ptlrpc] Sep 5 14:54:15 iws10 kernel: [<ffffffff8109e78e>] kthread+0x9e/0xc0 Sep 5 14:54:15 iws10 kernel: [<ffffffff8100c28a>] child_rip+0xa/0x20 Sep 5 14:54:15 iws10 kernel: [<ffffffff8109e6f0>] ? kthread+0x0/0xc0 Sep 5 14:54:15 iws10 kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20 |
| Comment by Gerrit Updater [ 09/Sep/15 ] |
|
wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/16333 |
| Comment by Di Wang [ 09/Sep/15 ] |
|
Cliff, James: Please try this patch. Thanks |
| Comment by James A Simmons [ 09/Sep/15 ] |
|
Just did. Now with sanity test 60 I get the follow exception: Sep 9 13:17:09 ninja11.ccs.ornl.gov kernel: [12210.969737] Lustre: 9313:0:(llog_test.c:1010:llog_test_7()) |
| Comment by Di Wang [ 10/Sep/15 ] |
|
Hi, James, I just update patch. Thanks for updating. |
| Comment by James A Simmons [ 10/Sep/15 ] |
|
Much better!! |
| Comment by James A Simmons [ 10/Sep/15 ] |
|
Still failing for sanity test 300k. I'm getting the following back trace: Sep 10 12:35:19 spoon17.ccs.ornl.gov kernel: [ 283.502856] Lustre: DEBUG MARKER: == sanity test 300k: test large striped dir |
| Comment by Di Wang [ 10/Sep/15 ] |
|
Hi, James, this is a known issue, because credits required for creating a big striped directory is larger than maximum credits here. And I believe you can see this on current master as well. |
| Comment by James A Simmons [ 10/Sep/15 ] |
|
I posted the full log under ticket LU-6831. You can see it here : https://jira.hpdd.intel.com/secure/attachment/18890/kern-09-10-2015.log |
| Comment by Di Wang [ 10/Sep/15 ] |
|
Hmm, I only saw 300g and 300h failed in your log. Sep 10 12:49:43 spoon17.ccs.ornl.gov kernel: [ 239.577103] Lustre: DEBUG MARKER: == sanity test 300g: check default striped directory for normal directory == 12:49:43 (1441903783) Sep 10 12:49:44 spoon17.ccs.ornl.gov kernel: [ 240.291185] Lustre: DEBUG MARKER: sanity test_300g: @@@@@@ FAIL: stripe count 1 != 0 for /lustre/lustre/d300g.sanity/normal_dir/test1 Sep 10 12:49:45 spoon17.ccs.ornl.gov kernel: [ 241.048817] Lustre: DEBUG MARKER: == sanity test 300h: check default striped directory for striped directory == 12:49:45 (1441903785) Sep 10 12:49:45 spoon17.ccs.ornl.gov kernel: [ 241.837071] Lustre: DEBUG MARKER: sanity test_300h: @@@@@@ FAIL: stripe count 1 != 0 for is http://review.whamcloud.com/#/c/16130/ in your series? I think there is a fix in that patch( see test_300_check_default_striped_dir) to deal with this failure. |
| Comment by James A Simmons [ 10/Sep/15 ] |
|
Yes it is my series. Ah I haven't updates my local test scripts. Still blows up the node with 300k. |
| Comment by Di Wang [ 10/Sep/15 ] |
|
Hmm, I guess somewhere in ldiskfs still complain about this super giant credit reservation. Do you have stack trace when the panic happens? |
| Comment by James A Simmons [ 11/Sep/15 ] |
|
All have this what I posted to LU-6831 for the trace. The node just reboots with no kdumps or anything |
| Comment by Gerrit Updater [ 19/Sep/15 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/16333/ |
| Comment by Peter Jones [ 19/Sep/15 ] |
|
Landed for 2.8 |
| Comment by Gerrit Updater [ 14/Dec/15 ] |
|
Jinshan Xiong (jinshan.xiong@intel.com) uploaded a new patch: http://review.whamcloud.com/17579 |
| Comment by Gerrit Updater [ 15/Dec/15 ] |
|
Jinshan Xiong (jinshan.xiong@intel.com) uploaded a new patch: http://review.whamcloud.com/17616 |