Details
-
Bug
-
Resolution: Duplicate
-
Major
-
None
-
Lustre 2.7.0
-
None
-
3
-
16718
Description
lfsck_layout_slave_conditional_destroy()
and
lfsck_layout_slave_repair_pfid()
both call dt_trans_start_local() while holding a DT write lock on an object. This was found by code inspection.
I also ran with the following:
diff --git a/lustre/osd-ldiskfs/osd_handler.c b/lustre/osd-ldiskfs/osd_handler.c index 80d1e67..ed15b0c 100644 --- a/lustre/osd-ldiskfs/osd_handler.c +++ b/lustre/osd-ldiskfs/osd_handler.c @@ -1058,6 +1058,8 @@ int osd_trans_start(const struct lu_env *env, struct dt_device *d, if (!IS_ERR(jh)) { oh->ot_handle = jh; LASSERT(oti->oti_txns == 0); + LASSERT(oti->oti_w_locks == 0); + LASSERT(oti->oti_r_locks == 0); lu_context_init(&th->th_ctx, th->th_tags); lu_context_enter(&th->th_ctx);
Running sanity-lfsck.sh I was able to trigger a crash from
lfsck_layout_slave_repair_pfid()
:
[ 1827.711965] Lustre: DEBUG MARKER: == sanity-lfsck test 19b: OST-object inconsistency self repair == 11:01:32 (1417798892) [ 1827.813194] Lustre: DEBUG MARKER: cancel_lru_locks osc start [ 1827.838214] Lustre: *** cfs_fail_loc=1611, val=0*** [ 1827.839821] Lustre: Skipped 3 previous similar messages [ 1827.877070] Lustre: DEBUG MARKER: cancel_lru_locks osc stop [ 1827.957214] LustreError: 21862:0:(osd_handler.c:1061:osd_trans_start()) ASSERTION( oti->oti_w_locks == 0 ) failed: [ 1827.960100] LustreError: 21862:0:(osd_handler.c:1061:osd_trans_start()) LBUG [ 1827.961698] Pid: 21862, comm: inconsistency_v [ 1827.962700] [ 1827.962702] Call Trace: [ 1827.963692] [<ffffffffa052f8c5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] [ 1827.965393] [<ffffffffa052fec7>] lbug_with_loc+0x47/0xb0 [libcfs] [ 1827.966913] [<ffffffffa0d98809>] osd_trans_start+0x389/0x730 [osd_ldiskfs] [ 1827.968553] [<ffffffffa0c44ed2>] lfsck_layout_slave_in_notify+0x982/0xcf0 [lfsck] [ 1827.969963] [<ffffffffa0c0cd37>] lfsck_in_notify+0xf7/0x5a0 [lfsck] [ 1827.971155] [<ffffffffa0fe6037>] ofd_inconsistency_verification_main+0x367/0xdb0 [ofd] [ 1827.972639] [<ffffffff8105e59d>] ? finish_task_switch+0x7d/0x110 [ 1827.973774] [<ffffffff8105e568>] ? finish_task_switch+0x48/0x110 [ 1827.974896] [<ffffffff81061d90>] ? default_wake_function+0x0/0x20 [ 1827.976039] [<ffffffffa0fe5cd0>] ? ofd_inconsistency_verification_main+0x0/0xdb0 [ofd] [ 1827.977539] [<ffffffff8109e856>] kthread+0x96/0xa0 [ 1827.978448] [<ffffffff8100c30a>] child_rip+0xa/0x20 [ 1827.979368] [<ffffffff815562e0>] ? _spin_unlock_irq+0x30/0x40 [ 1827.980507] [<ffffffff8100bb10>] ? restore_args+0x0/0x30 [ 1827.981704] [<ffffffff8109e7c0>] ? kthread+0x0/0xa0 [ 1827.982787] [<ffffffff8100c300>] ? child_rip+0x0/0x20 [ 1827.983912]
I also found that OFD also does this in enough places that I added
|| strstr(current->comm, "ll_ost") != NULL
to the assertions above. Should OFD be following the same order?