[LU-6846] dt_record_write()) ASSERTION( dt->do_body_ops->dbo_write ) failed: Created: 14/Jul/15  Updated: 28/Mar/17  Resolved: 19/Sep/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: Lustre 2.8.0

Type: Bug Priority: Major
Reporter: Di Wang Assignee: Di Wang
Resolution: Fixed Votes: 0
Labels: dne2

Attachments: Text File iws12.lbug.txt.gz    
Issue Links:
Duplicate
Related
is related to LU-6831 The ticket for tracking all DNE2 bugs Reopened
is related to LU-7063 llog_osd_destroy() leaks a lock on error Resolved
is related to LU-9197 Incorrect return value for llog_cance... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Found this in 24 hours failover test in OpenSFS cluster.

Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: mds7 has failed over 1 times, and counting...
Lustre: 9887:0:(client.c:2018:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1436839484/real 1436839484]  req@ffff88062c53a080 x1506632146392040/t0(0) o400->lustre-MDT0006-osp-MDT0001@192.168.2.127@o2ib:24/4 lens 224/224 e 1 to 1 dl 1436839486 ref 1 fl Rpc:X/c0/ffffffff rc 0/-1
Lustre: 9887:0:(client.c:2018:ptlrpc_expire_one_request()) Skipped 1 previous similar message
Lustre: 9887:0:(client.c:2018:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1436839503/real 1436839503]  req@ffff88062c53a080 x1506632146392208/t0(0) o400->lustre-MDT0006-osp-MDT0001@192.168.2.127@o2ib:24/4 lens 224/224 e 1 to 1 dl 1436839505 ref 1 fl Rpc:X/c0/ffffffff rc 0/-1
Lustre: 9887:0:(client.c:2018:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
Lustre: lustre-MDT0006-osp-MDT0001: Connection restored to lustre-MDT0006 (at 192.168.2.127@o2ib)
LustreError: 12030:0:(dt_object.c:512:dt_record_write()) ASSERTION( dt->do_body_ops->dbo_write ) failed: 
LustreError: 12030:0:(dt_object.c:512:dt_record_write()) LBUG
Pid: 12030, comm: mdt_out03_005

Call Trace:
 [<ffffffffa0506875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
 [<ffffffffa0506e77>] lbug_with_loc+0x47/0xb0 [libcfs]
 [<ffffffffa065978f>] dt_record_write+0xbf/0x130 [obdclass]
 [<ffffffffa08f4d0e>] out_tx_write_exec+0x7e/0x300 [ptlrpc]
 [<ffffffffa08ed30a>] out_tx_end+0xda/0x5d0 [ptlrpc]
 [<ffffffffa08f1e7b>] out_handle+0xd9b/0x17e0 [ptlrpc]
 [<ffffffffa083afb0>] ? target_bulk_timeout+0x0/0xc0 [ptlrpc]
 [<ffffffffa08ea212>] tgt_request_handle+0xa42/0x1230 [ptlrpc]
 [<ffffffffa0892891>] ptlrpc_main+0xe41/0x1920 [ptlrpc]
 [<ffffffffa0891a50>] ? ptlrpc_main+0x0/0x1920 [ptlrpc]
 [<ffffffff8109abf6>] kthread+0x96/0xa0
 [<ffffffff8100c20a>] child_rip+0xa/0x20
 [<ffffffff8109ab60>] ? kthread+0x0/0xa0
 [<ffffffff8100c200>] ? child_rip+0x0/0x20

LustreError: dumping log to /tmp/lustre-log.1436839511.12030



 Comments   
Comment by Gerrit Updater [ 14/Jul/15 ]

wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/15595
Subject: LU-6846 osd: reset do_body_ops after creation
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 981452bf05fbe3e7ac06f8289cd309abf12d41bb

Comment by Di Wang [ 23/Jul/15 ]

Clearly, this fix is not enough, i.e. even with this fix, we still found the LBUG in hyperion

{noforrmat}

2015-07-23 12:40:30 Lustre: 6294:0:(service.c:2102:ptlrpc_server_handle_request()) Skipped 1 previous similar message
2015-07-23 12:40:42 Lustre: lustre-MDT0003-osp-MDT0001: Connection restored to lustre-MDT0003 (at 192.168.120.10@o2ib)
2015-07-23 12:43:44 LustreError: 6700:0:(dt_object.c:513:dt_record_write()) ASSERTION( dt->do_body_ops->dbo_write ) failed: FID [0x280000402:0x3:0x0]
2015-07-23 12:43:44 LustreError: 6700:0:(dt_object.c:513:dt_record_write()) LBUG
2015-07-23 12:43:44 Pid: 6700, comm: mdt_out02_002
2015-07-23 12:43:44
2015-07-23 12:43:44 Call Trace:
2015-07-23 12:43:44 [<ffffffffa049e875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
2015-07-23 12:43:44 [<ffffffffa049ee77>] lbug_with_loc+0x47/0xb0 [libcfs]
2015-07-23 12:43:44 [<ffffffffa063849f>] dt_record_write+0xcf/0x140 [obdclass]
2015-07-23 12:43:44 [<ffffffffa08d32de>] out_tx_write_exec+0x7e/0x300 [ptlrpc]
2015-07-23 12:43:44 [<ffffffffa08cb8da>] out_tx_end+0xda/0x5d0 [ptlrpc]
2015-07-23 12:43:45 [<ffffffffa08d044b>] out_handle+0xd9b/0x17e0 [ptlrpc]
2015-07-23 12:43:45 [<ffffffffa0819dd0>] ? target_bulk_timeout+0x0/0xc0 [ptlrpc]
2015-07-23 12:43:45 [<ffffffffa08c87e2>] tgt_request_handle+0xa42/0x1230 [ptlrpc]
2015-07-23 12:43:45 [<ffffffffa0871201>] ptlrpc_main+0xe41/0x1920 [ptlrpc]
2015-07-23 12:43:45 [<ffffffff8152a2be>] ? thread_return+0x4e/0x7d0
2015-07-23 12:43:45 [<ffffffffa08703c0>] ? ptlrpc_main+0x0/0x1920 [ptlrpc]
2015-07-23 12:43:45 [<ffffffff8109e78e>] kthread+0x9e/0xc0
2015-07-23 12:43:45 [<ffffffff8100c28a>] child_rip+0xa/0x20
2015-07-23 12:43:45 [<ffffffff8109e6f0>] ? kthread+0x0/0xc0
2015-07-23 12:43:45 [<ffffffff8100c280>] ? child_rip+0x0/0x20
2015-07-23 12:43:45


Comment by Di Wang [ 27/Jul/15 ]

It seems this is because of the race between llog cancel rec and llog destroy object. In llog_cancel_rec()

/* returns negative on error; 0 if success; 1 if success & log destroyed */
int llog_cancel_rec(const struct lu_env *env, struct llog_handle *loghandle,
                    int index)
{
        struct llog_log_hdr *llh = loghandle->lgh_hdr;
        int rc = 0;
        ENTRY;

        CDEBUG(D_RPCTRACE, "Canceling %d in log "DOSTID"\n",
               index, POSTID(&loghandle->lgh_id.lgl_oi));

        if (index == 0) {
                CERROR("Can't cancel index 0 which is header\n");
                RETURN(-EINVAL);
        }
  
        rc = llog_cancel_rec_internal(env, loghandle, index);
        if (rc < 0) {
                CERROR("%s: fail to write header for llog #"DOSTID
                       "#%08x: rc = %d\n",
                       loghandle->lgh_ctxt->loc_obd->obd_name,
                       POSTID(&loghandle->lgh_id.lgl_oi),
                       loghandle->lgh_id.lgl_ogen, rc);
                RETURN(rc);
        }
        
        if ((llh->llh_flags & LLOG_F_ZAP_WHEN_EMPTY) &&
            (llh->llh_count == 1) &&
            (loghandle->lgh_last_idx == LLOG_HDR_BITMAP_SIZE(llh) - 1)) {
                rc = llog_destroy(env, loghandle);
                if (rc < 0) {
                        /* Sigh, can not destroy the final plain llog, but
                         * the bitmap has been clearly, so the record can not
                         * be accessed anymore, let's return 0 for now, and

cancel record and log_destroy are in different transaction. So when cancel the remote record, if the RPC is out of order, i.e. cancel record arrives later than llog_destroy, then we will met this bug. So the fix should be merge them into a single transaction.

Comment by Gerrit Updater [ 27/Jul/15 ]

wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/15730
Subject: LU-6846 llog: combine cancel rec and destroy
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 649af4d6ac28378b182d18f1dc4c45d46127f74d

Comment by Gerrit Updater [ 15/Aug/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/15595/
Subject: LU-6846 osd: reset do_body_ops after creation
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 037759bc49ce7f6adc9acb4a748a25c0f4c4ab64

Comment by Gerrit Updater [ 19/Aug/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/15730/
Subject: LU-6846 llog: combine cancel rec and destroy
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: a96be7018d90585dce16ce4b126485ce41629cd2

Comment by Peter Jones [ 19/Aug/15 ]

Landed for 2.8

Comment by Cliff White (Inactive) [ 24/Aug/15 ]

We are hitting this on Hyperion with 2.7.58

Aug 22 02:34:36 iws12 kernel: LustreError: 11526:0:(dt_object.c:512:dt_record_write()) ASSERTION( dt->do_body_ops->dbo_write ) failed:
Aug 22 02:34:36 iws12 kernel: LustreError: 11526:0:(dt_object.c:512:dt_record_write()) ASSERTION( dt->do_body_ops->dbo_write ) failed:
Aug 22 02:34:36 iws12 kernel: LustreError: 11526:0:(dt_object.c:512:dt_record_write()) LBUG
Aug 22 02:34:36 iws12 kernel: LustreError: 11526:0:(dt_object.c:512:dt_record_write()) LBUG
Aug 22 02:34:36 iws12 kernel: Pid: 11526, comm: mdt_out03_005
Aug 22 02:34:36 iws12 kernel:
Aug 22 02:34:36 iws12 kernel: Call Trace:
Aug 22 02:34:36 iws12 kernel: [<ffffffffa04a6875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
Aug 22 02:34:36 iws12 kernel: [<ffffffffa04a6e77>] lbug_with_loc+0x47/0xb0 [libcfs]
Aug 22 02:34:36 iws12 kernel: [<ffffffffa09a658f>] dt_record_write+0xbf/0x130 [obdclass]
Aug 22 02:34:36 iws12 kernel: [<ffffffffa0c428ce>] out_tx_write_exec+0x7e/0x300 [ptlrpc]
Aug 22 02:34:36 iws12 kernel: [<ffffffffa0c3aeca>] out_tx_end+0xda/0x5d0 [ptlrpc] 
Aug 22 02:34:36 iws12 kernel: [<ffffffffa0c3fa3b>] out_handle+0xd9b/0x17e0 [ptlrpc]
Aug 22 02:34:36 iws12 kernel: [<ffffffffa0b88010>] ? target_bulk_timeout+0x0/0xc0 [ptlrpc]
Aug 22 02:34:36 iws12 kernel: [<ffffffffa0c37dc2>] tgt_request_handle+0xa42/0x1230 [ptlrpc]
Aug 22 02:34:36 iws12 kernel: [<ffffffffa0bdff91>] ptlrpc_main+0xe41/0x1920 [ptlrpc]
Aug 22 02:34:36 iws12 kernel: [<ffffffff8152a39e>] ? thread_return+0x4e/0x7d0
Aug 22 02:34:36 iws12 kernel: [<ffffffffa0bdf150>] ? ptlrpc_main+0x0/0x1920 [ptlrpc]
Aug 22 02:34:36 iws12 kernel: [<ffffffff8109e78e>] kthread+0x9e/0xc0
Aug 22 02:34:36 iws12 kernel: [<ffffffff8100c28a>] child_rip+0xa/0x20
Aug 22 02:34:36 iws12 kernel: [<ffffffff8109e6f0>] ? kthread+0x0/0xc0 
Aug 22 02:34:36 iws12 kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20
Aug 22 02:34:36 iws12 kernel:
Comment by Cliff White (Inactive) [ 02/Sep/15 ]

Hit again

Sep  1 17:42:47 iws12 kernel: LustreError: 7434:0:(dt_object.c:512:dt_record_write()) ASSERTION( dt->do_body_ops->dbo_write ) failed:
Sep  1 17:42:47 iws12 kernel: LustreError: 7434:0:(dt_object.c:512:dt_record_write()) ASSERTION( dt->do_body_ops->dbo_write ) failed:
Sep  1 17:42:47 iws12 kernel: LustreError: 7434:0:(dt_object.c:512:dt_record_write()) LBUG
Sep  1 17:42:47 iws12 kernel: LustreError: 7434:0:(dt_object.c:512:dt_record_write()) LBUG
Sep  1 17:42:47 iws12 kernel: Pid: 7434, comm: mdt_out03_020
Sep  1 17:42:47 iws12 kernel:
Sep  1 17:42:47 iws12 kernel: Call Trace:
Sep  1 17:42:47 iws12 kernel: [<ffffffffa04c9875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
Sep  1 17:42:47 iws12 kernel: [<ffffffffa04c9e77>] lbug_with_loc+0x47/0xb0 [libcfs]
Sep  1 17:42:47 iws12 kernel: [<ffffffffa09c350f>] dt_record_write+0xbf/0x130 [obdclass]
Sep  1 17:42:47 iws12 kernel: [<ffffffffa0c6144e>] out_tx_write_exec+0x7e/0x300 [ptlrpc]
Sep  1 17:42:47 iws12 kernel: [<ffffffffa0c59a4a>] out_tx_end+0xda/0x5d0 [ptlrpc] 
Sep  1 17:42:47 iws12 kernel: [<ffffffffa0c5e5bb>] out_handle+0xd9b/0x17e0 [ptlrpc]
Sep  1 17:42:47 iws12 kernel: [<ffffffffa0ba61d0>] ? target_bulk_timeout+0x0/0xc0 [ptlrpc]
Sep  1 17:42:47 iws12 kernel: [<ffffffffa0c568ec>] tgt_request_handle+0xa4c/0x1290 [ptlrpc]
Sep  1 17:42:47 iws12 kernel: [<ffffffffa0bfe5b1>] ptlrpc_main+0xe41/0x1910 [ptlrpc]
Sep  1 17:42:47 iws12 kernel: [<ffffffffa0bfd770>] ? ptlrpc_main+0x0/0x1910 [ptlrpc]
Sep  1 17:42:47 iws12 kernel: [<ffffffff8109e78e>] kthread+0x9e/0xc0
Sep  1 17:42:47 iws12 kernel: [<ffffffff8100c28a>] child_rip+0xa/0x20
Sep  1 17:42:47 iws12 kernel: [<ffffffff8109e6f0>] ? kthread+0x0/0xc0 
Sep  1 17:42:47 iws12 kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20
Comment by Cliff White (Inactive) [ 02/Sep/15 ]

Lustre-log dump

Comment by James A Simmons [ 08/Sep/15 ]

I hit this bug over the week end as well.

Comment by Cliff White (Inactive) [ 08/Sep/15 ]

And we have hit it with 2.7.59

Sep  5 14:54:15 iws10 kernel: LustreError: 52672:0:(dt_object.c:512:dt_record_write()) ASSERTION( dt->do_body_ops->dbo_write ) failed: 
Sep  5 14:54:15 iws10 kernel: LustreError: 52672:0:(dt_object.c:512:dt_record_write()) ASSERTION( dt->do_body_ops->dbo_write ) failed: 
Sep  5 14:54:15 iws10 kernel: LustreError: 52672:0:(dt_object.c:512:dt_record_write()) LBUG
Sep  5 14:54:15 iws10 kernel: LustreError: 52672:0:(dt_object.c:512:dt_record_write()) LBUG
Sep  5 14:54:15 iws10 kernel: Pid: 52672, comm: mdt_out02_018
Sep  5 14:54:15 iws10 kernel:
Sep  5 14:54:15 iws10 kernel: Call Trace:
Sep  5 14:54:15 iws10 kernel: [<ffffffffa07e4875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
Sep  5 14:54:15 iws10 kernel: [<ffffffffa07e4e77>] lbug_with_loc+0x47/0xb0 [libcfs]
Sep  5 14:54:15 iws10 kernel: [<ffffffffa09e12df>] dt_record_write+0xbf/0x130 [obdclass]
Sep  5 14:54:15 iws10 kernel: [<ffffffffa0c7f4ae>] out_tx_write_exec+0x7e/0x300 [ptlrpc]
Sep  5 14:54:15 iws10 kernel: [<ffffffffa0c77aaa>] out_tx_end+0xda/0x5d0 [ptlrpc]
Sep  5 14:54:15 iws10 kernel: [<ffffffffa0c7c61b>] out_handle+0xd9b/0x17e0 [ptlrpc]
Sep  5 14:54:15 iws10 kernel: [<ffffffffa0bc41d0>] ? target_bulk_timeout+0x0/0xc0 [ptlrpc]
Sep  5 14:54:15 iws10 kernel: [<ffffffffa0c7494c>] tgt_request_handle+0xa4c/0x1290 [ptlrpc]
Sep  5 14:54:15 iws10 kernel: [<ffffffffa0c1c5b1>] ptlrpc_main+0xe41/0x1910 [ptlrpc]
Sep  5 14:54:15 iws10 kernel: [<ffffffffa0c1b770>] ? ptlrpc_main+0x0/0x1910 [ptlrpc]
Sep  5 14:54:15 iws10 kernel: [<ffffffff8109e78e>] kthread+0x9e/0xc0
Sep  5 14:54:15 iws10 kernel: [<ffffffff8100c28a>] child_rip+0xa/0x20
Sep  5 14:54:15 iws10 kernel: [<ffffffff8109e6f0>] ? kthread+0x0/0xc0
Sep  5 14:54:15 iws10 kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20
Comment by Gerrit Updater [ 09/Sep/15 ]

wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/16333
Subject: LU-6846 llog: create remote llog synchronously
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 22504242f5646e34feda559e33d544d8354e0fb1

Comment by Di Wang [ 09/Sep/15 ]

Cliff, James: Please try this patch. Thanks

Comment by James A Simmons [ 09/Sep/15 ]

Just did. Now with sanity test 60 I get the follow exception:

Sep 9 13:17:09 ninja11.ccs.ornl.gov kernel: [12210.969737] Lustre: 9313:0:(llog_test.c:1010:llog_test_7())
7a: test llog_logid_rec
Sep 9 13:17:10 ninja11.ccs.ornl.gov kernel: [12211.888955] LustreError: 9313:0:(llog_test.c:80:verify_handl
e()) 7_sub: expected 1 active recs after write, found 2
Sep 9 13:17:10 ninja11.ccs.ornl.gov kernel: [12211.899565] LustreError: 9313:0:(osd_handler.c:2488:osd_obje
ct_destroy()) ASSERTION( !lu_object_is_dying(dt->do_lu.lo_header) ) failed:
Sep 9 13:17:10 ninja11.ccs.ornl.gov kernel: [12211.911973] LustreError: 9313:0:(osd_handler.c:2488:osd_obje
ct_destroy()) LBUG
Sep 9 13:17:10 ninja11.ccs.ornl.gov kernel: [12211.919387] Pid: 9313, comm: lctl
Sep 9 13:17:10 ninja11.ccs.ornl.gov kernel: [12211.922812]
Sep 9 13:17:10 ninja11.ccs.ornl.gov kernel: [12211.922813] Call Trace:
Sep 9 13:17:10 ninja11.ccs.ornl.gov kernel: [12211.926946] [<ffffffffa0448875>] libcfs_debug_dumpstack+0x5
5/0x80 [libcfs]
Sep 9 13:17:10 ninja11.ccs.ornl.gov kernel: [12211.933996] [<ffffffffa0448e77>] lbug_with_loc+0x47/0xb0 [l
ibcfs]
Sep 9 13:17:10 ninja11.ccs.ornl.gov kernel: [12211.940297] [<ffffffffa0e9e80b>] osd_object_destroy+0x51b/0
x5a0 [osd_ldiskfs]
Sep 9 13:17:10 ninja11.ccs.ornl.gov kernel: [12211.947683] [<ffffffffa0e9d177>] ? osd_object_ref_del+0x187
/0x4e0 [osd_ldiskfs]
Sep 9 13:17:10 ninja11.ccs.ornl.gov kernel: [12211.955277] [<ffffffffa058ddba>] llog_osd_destroy+0x1ba/0x9
d0 [obdclass]
Sep 9 13:17:10 ninja11.ccs.ornl.gov kernel: [12211.962177] [<ffffffffa057d686>] llog_destroy+0x2b6/0x470 [
obdclass]
Sep 9 13:17:10 ninja11.ccs.ornl.gov kernel: [12211.968712] [<ffffffffa0527ccc>] llog_test_7_sub+0x29c/0x92
0 [llog_test]
Sep 9 13:17:10 ninja11.ccs.ornl.gov kernel: [12211.975597] [<ffffffffa0528eaf>] llog_test_7+0x13f/0xa10 [l
log_test]
Sep 9 13:17:10 ninja11.ccs.ornl.gov kernel: [12211.982111] [<ffffffffa052dbb5>] llog_run_tests+0x1355/0x1d
40 [llog_test]
Sep 9 13:17:10 ninja11.ccs.ornl.gov kernel: [12211.989099] [<ffffffffa0582b7e>] ? llog_setup+0x4ae/0x8b0 [
obdclass]
Sep 9 13:17:10 ninja11.ccs.ornl.gov kernel: [12211.995643] [<ffffffffa052ed49>] llog_test_setup+0x7a9/0x8f
0 [llog_test]
Sep 9 13:17:10 ninja11.ccs.ornl.gov kernel: [12212.002538] [<ffffffffa05b00fb>] obd_setup+0x19b/0x290 [obd
class]
Sep 9 13:17:10 ninja11.ccs.ornl.gov kernel: [12212.008810] [<ffffffff81175113>] ? kmem_cache_alloc_trace+0
x1b3/0x1c0
Sep 9 13:17:10 ninja11.ccs.ornl.gov kernel: [12212.015463] [<ffffffffa0594a53>] ? class_new_export+0x883/0
x9e0 [obdclass]
Sep 9 13:17:10 ninja11.ccs.ornl.gov kernel: [12212.022553] [<ffffffffa05b044f>] class_setup+0x25f/0x940 [o
bdclass]
Sep 9 13:17:10 ninja11.ccs.ornl.gov kernel: [12212.029018] [<ffffffffa05b7631>] class_process_config+0x115
1/0x2840 [obdclass]
Sep 9 13:17:10 ninja11.ccs.ornl.gov kernel: [12212.036522] [<ffffffffa0575fd7>] ? obd_ioctl_getdata+0x127/
0x1310 [obdclass]
Sep 9 13:17:10 ninja11.ccs.ornl.gov kernel: [12212.043762] [<ffffffffa0454bb1>] ? libcfs_debug_msg+0x41/0x
50 [libcfs]
Sep 9 13:17:10 ninja11.ccs.ornl.gov kernel: [12212.050477] [<ffffffff8117546c>] ? __kmalloc+0x21c/0x230
Sep 9 13:17:10 ninja11.ccs.ornl.gov kernel: [12212.055985] [<ffffffffa058fbf2>] class_handle_ioctl+0xbb2/0
x21e0 [obdclass]
Sep 9 13:17:10 ninja11.ccs.ornl.gov kernel: [12212.063131] [<ffffffff8148b5a1>] ? netlink_create+0x191/0x2
00
Sep 9 13:17:10 ninja11.ccs.ornl.gov kernel: [12212.069075] [<ffffffffa05752ab>] obd_class_ioctl+0x4b/0x190
[obdclass]
Sep 9 13:17:10 ninja11.ccs.ornl.gov kernel: [12212.075785] [<ffffffff811a4222>] vfs_ioctl+0x22/0xa0
Sep 9 13:17:10 ninja11.ccs.ornl.gov kernel: [12212.080930] [<ffffffff811acc5f>] ? generic_drop_inode+0x4f/
0x80
Sep 9 13:17:10 ninja11.ccs.ornl.gov kernel: [12212.087028] [<ffffffff811a43c4>] do_vfs_ioctl+0x84/0x580
Sep 9 13:17:10 ninja11.ccs.ornl.gov kernel: [12212.092516] [<ffffffff81448871>] ? sock_release+0x71/0x90
Sep 9 13:17:10 ninja11.ccs.ornl.gov kernel: [12212.098104] [<ffffffff814489bf>] ? __sock_create+0xff/0x240
Sep 9 13:17:10 ninja11.ccs.ornl.gov kernel: [12212.103849] [<ffffffff811a4941>] sys_ioctl+0x81/0xa0
Sep 9 13:17:10 ninja11.ccs.ornl.gov kernel: [12212.108989] [<ffffffff8100b0d2>] system_call_fastpath+0x16/
0x1b
Sep 9 13:17:10 ninja11.ccs.ornl.gov kernel: [12212.115100]

Comment by Di Wang [ 10/Sep/15 ]

Hi, James, I just update patch. Thanks for updating.

Comment by James A Simmons [ 10/Sep/15 ]

Much better!!

Comment by James A Simmons [ 10/Sep/15 ]

Still failing for sanity test 300k. I'm getting the following back trace:

Sep 10 12:35:19 spoon17.ccs.ornl.gov kernel: [ 283.502856] Lustre: DEBUG MARKER: == sanity test 300k: test large striped dir
ectory == 12:35:19 (1441902919)
Sep 10 12:35:19 spoon17.ccs.ornl.gov kernel: [ 283.555631] Lustre: *** cfs_fail_loc=1703, val=0***
Sep 10 12:35:19 spoon17.ccs.ornl.gov kernel: [ 283.563480] Lustre: 29844:0:(osd_handler.c:1106:osd_trans_start()) lustre-MDT
0000: too many transaction credits (51379 > 1024)
Sep 10 12:35:19 spoon17.ccs.ornl.gov kernel: [ 283.575080] Lustre: 29844:0:(osd_handler.c:1041:osd_trans_dump_creds()) cre
ate: 146/1168/0, destroy: 0/0/0
Sep 10 12:35:19 spoon17.ccs.ornl.gov kernel: [ 283.585090] Lustre: 29844:0:(osd_handler.c:1048:osd_trans_dump_creds()) att
r_set: 130/130/0, xattr_set: 259/16513/0
Sep 10 12:35:19 spoon17.ccs.ornl.gov kernel: [ 283.597106] Lustre: 29844:0:(osd_handler.c:1058:osd_trans_dump_creds()) wri
te: 70/2146/0, punch: 0/0/0, quota 2/2/0
Sep 10 12:35:19 spoon17.ccs.ornl.gov kernel: [ 283.607934] Lustre: 29844:0:(osd_handler.c:1065:osd_trans_dump_creds()) ins
ert: 932/30778/0, delete: 0/0/0
Sep 10 12:35:19 spoon17.ccs.ornl.gov kernel: [ 283.617951] Lustre: 29844:0:(osd_handler.c:1072:osd_trans_dump_creds()) ref
_add: 642/642/0, ref_del: 0/0/0
Sep 10 12:35:19 spoon17.ccs.ornl.gov kernel: [ 283.627962] Pid: 29844, comm: mdt01_005
Sep 10 12:35:19 spoon17.ccs.ornl.gov kernel: [ 283.631922]
Sep 10 12:35:19 spoon17.ccs.ornl.gov kernel: [ 283.631923] Call Trace:
Sep 10 12:35:19 spoon17.ccs.ornl.gov kernel: [ 283.636098] [<ffffffffa040c875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
Sep 10 12:35:19 spoon17.ccs.ornl.gov kernel: [ 283.643201] [<ffffffffa0e57b3f>] osd_trans_start+0x3ef/0x410 [osd_ldiskfs]
Sep 10 12:35:19 spoon17.ccs.ornl.gov kernel: [ 283.650441] [<ffffffffa0837977>] top_trans_start+0x357/0x960 [ptlrpc]
Sep 10 12:35:19 spoon17.ccs.ornl.gov kernel: [ 283.657134] [<ffffffffa1082c11>] lod_trans_start+0x61/0x70 [lod]
Sep 10 12:35:19 spoon17.ccs.ornl.gov kernel: [ 283.663447] [<ffffffffa0f6c5c4>] mdd_trans_start+0x14/0x20 [mdd]
Sep 10 12:35:19 spoon17.ccs.ornl.gov kernel: [ 283.669699] [<ffffffffa0f5b45d>] mdd_create+0x9cd/0x1280 [mdd]
Sep 10 12:35:19 spoon17.ccs.ornl.gov kernel: [ 283.675797] [<ffffffffa0fcc77c>] ? mdt_version_save+0x8c/0x1a0 [mdt]
Sep 10 12:35:19 spoon17.ccs.ornl.gov kernel: [ 283.682471] [<ffffffffa0fcef3c>] mdt_reint_create+0xbbc/0xcc0 [mdt]
Sep 10 12:35:19 spoon17.ccs.ornl.gov kernel: [ 283.688988] [<ffffffffa05ac240>] ? lu_ucred+0x20/0x30 [obdclass]
Sep 10 12:35:19 spoon17.ccs.ornl.gov kernel: [ 283.695246] [<ffffffffa0fae675>] ? mdt_ucred+0x15/0x20 [mdt]
Sep 10 12:35:19 spoon17.ccs.ornl.gov kernel: [ 283.701251] [<ffffffffa0fc785c>] ? mdt_root_squash+0x2c/0x3f0 [mdt]
Sep 10 12:21:13 spoon17.ccs.ornl.gov kernel: [ 515.052812] [<ffffffffa07db532>] ? __req_capsule_get+0x162/0x6e0 [ptlrpc]
Sep 10 12:21:13 spoon17.ccs.ornl.gov kernel: [ 515.059811] [<ffffffffa0f3699d>] mdt_reint_rec+0x5d/0x200 [mdt]
Sep 10 12:21:13 spoon17.ccs.ornl.gov kernel: [ 515.065938] [<ffffffffa0f227ab>] mdt_reint_internal+0x62b/0xb80 [mdt]
Sep 10 12:21:13 spoon17.ccs.ornl.gov kernel: [ 515.072622] [<ffffffffa0f2319b>] mdt_reint+0x6b/0x120 [mdt]
Sep 10 12:21:13 spoon17.ccs.ornl.gov kernel: [ 515.078449] [<ffffffffa081d72c>] tgt_request_handle+0xa4c/0x1290 [ptlrpc]
Sep 10 12:21:13 spoon17.ccs.ornl.gov kernel: [ 515.085555] [<ffffffffa07c53b1>] ptlrpc_main+0xe41/0x1910 [ptlrpc]
Sep 10 12:21:13 spoon17.ccs.ornl.gov kernel: [ 515.091995] [<ffffffffa07c4570>] ? ptlrpc_main+0x0/0x1910 [ptlrpc]
Sep 10 12:21:13 spoon17.ccs.ornl.gov kernel: [ 515.098371] [<ffffffff8109e88e>] kthread+0x9e/0xc0
Sep 10 12:21:13 spoon17.ccs.ornl.gov kernel: [ 515.103354] [<ffffffff8100c28a>] child_rip+0xa/0x20
Sep 10 12:21:13 spoon17.ccs.ornl.gov kernel: [ 515.108422] [<ffffffff8109e7f0>] ? kthread+0x0/0xc0
Sep 10 12:21:13 spoon17.ccs.ornl.gov kernel: [ 515.113536] [<ffffffff8100c280>] ? child_rip+0x0/0x20
Sep 10 12:21:13 spoon17.ccs.ornl.gov kernel: [ 515.118808]
Sep 10 12:21:13 spoon17.ccs.ornl.gov kernel: [ 515.120489] LustreError: 30312:0:(osd_internal.h:1030:osd_trans_exec_op()) lustre-MDT0000-osd: op = 3, rb = 4
Sep 10 12:21:13 spoon17.ccs.ornl.gov kernel: [ 515.423620] Lustre: 30312:0:(osd_handler.c:1041:osd_trans_dump_creds()) create: 146/1168/11, destroy: 0/0/0
Sep 10 12:21:13 spoon17.ccs.ornl.gov kernel: [ 515.433696] Lustre: 30312:0:(osd_handler.c:1041:osd_trans_dump_creds()) Skipped 419 previous similar messages
Sep 10 12:21:13 spoon17.ccs.ornl.gov kernel: [ 515.443770] Lustre: 30312:0:(osd_handler.c:1048:osd_trans_dump_creds()) attr_set: 130/130/0, xattr_set: 259/16513/0

Comment by Di Wang [ 10/Sep/15 ]

Hi, James, this is a known issue, because credits required for creating a big striped directory is larger than maximum credits here. And I believe you can see this on current master as well.
But it is just a warning for now, and this should not cause 300k fails. Did you see other error message there? thanks.

Comment by James A Simmons [ 10/Sep/15 ]

I posted the full log under ticket LU-6831. You can see it here : https://jira.hpdd.intel.com/secure/attachment/18890/kern-09-10-2015.log

Comment by Di Wang [ 10/Sep/15 ]

Hmm, I only saw 300g and 300h failed in your log.

Sep 10 12:49:43 spoon17.ccs.ornl.gov kernel: [  239.577103] Lustre: DEBUG MARKER: == sanity test 300g: check default striped directory for normal directory == 12:49:43 (1441903783)
Sep 10 12:49:44 spoon17.ccs.ornl.gov kernel: [  240.291185] Lustre: DEBUG MARKER: sanity test_300g: @@@@@@ FAIL: stripe count 1 != 0 for /lustre/lustre/d300g.sanity/normal_dir/test1
Sep 10 12:49:45 spoon17.ccs.ornl.gov kernel: [  241.048817] Lustre: DEBUG MARKER: == sanity test 300h: check default striped directory for striped directory == 12:49:45 (1441903785)
Sep 10 12:49:45 spoon17.ccs.ornl.gov kernel: [  241.837071] Lustre: DEBUG MARKER: sanity test_300h: @@@@@@ FAIL: stripe count 1 != 0 for 

is http://review.whamcloud.com/#/c/16130/ in your series? I think there is a fix in that patch( see test_300_check_default_striped_dir) to deal with this failure.

Comment by James A Simmons [ 10/Sep/15 ]

Yes it is my series. Ah I haven't updates my local test scripts. Still blows up the node with 300k.

Comment by Di Wang [ 10/Sep/15 ]

Hmm, I guess somewhere in ldiskfs still complain about this super giant credit reservation. Do you have stack trace when the panic happens?

Comment by James A Simmons [ 11/Sep/15 ]

All have this what I posted to LU-6831 for the trace. The node just reboots with no kdumps or anything

Comment by Gerrit Updater [ 19/Sep/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/16333/
Subject: LU-6846 llog: create remote llog synchronously
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 36f59f94c06c74887a32f2a7757e7c962c6cf8dd

Comment by Peter Jones [ 19/Sep/15 ]

Landed for 2.8

Comment by Gerrit Updater [ 14/Dec/15 ]

Jinshan Xiong (jinshan.xiong@intel.com) uploaded a new patch: http://review.whamcloud.com/17579
Subject: LU-6846 llog: create remote llog synchronously
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: b66f9d956c27b7f3b1ee7289ef2e230de93b71e4

Comment by Gerrit Updater [ 15/Dec/15 ]

Jinshan Xiong (jinshan.xiong@intel.com) uploaded a new patch: http://review.whamcloud.com/17616
Subject: LU-6846 llog: combine cancel rec and destroy
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 0e22e861ff702c7515f8716b29770874d1e230e1

Generated at Sat Feb 10 02:03:46 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.