[LU-5994] DT transaction start and object lock ordering - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Duplicate
Priority: Major
Fix Version/s: None
Affects Version/s: Lustre 2.7.0
Labels:
None

Severity:
3
Rank (Obsolete):
16718

Description

lfsck_layout_slave_conditional_destroy()

and

lfsck_layout_slave_repair_pfid()

both call dt_trans_start_local() while holding a DT write lock on an object. This was found by code inspection.

I also ran with the following:

diff --git a/lustre/osd-ldiskfs/osd_handler.c b/lustre/osd-ldiskfs/osd_handler.c
index 80d1e67..ed15b0c 100644
--- a/lustre/osd-ldiskfs/osd_handler.c
+++ b/lustre/osd-ldiskfs/osd_handler.c
@@ -1058,6 +1058,8 @@ int osd_trans_start(const struct lu_env *env, struct dt_device *d,
         if (!IS_ERR(jh)) {
                 oh->ot_handle = jh;
                 LASSERT(oti->oti_txns == 0);
+               LASSERT(oti->oti_w_locks == 0);
+               LASSERT(oti->oti_r_locks == 0);
                 lu_context_init(&th->th_ctx, th->th_tags);
                 lu_context_enter(&th->th_ctx);

Running sanity-lfsck.sh I was able to trigger a crash from

lfsck_layout_slave_repair_pfid()

[ 1827.711965] Lustre: DEBUG MARKER: == sanity-lfsck test 19b: OST-object inconsistency self repair == 11:01:32 (1417798892)
[ 1827.813194] Lustre: DEBUG MARKER: cancel_lru_locks osc start
[ 1827.838214] Lustre: *** cfs_fail_loc=1611, val=0***
[ 1827.839821] Lustre: Skipped 3 previous similar messages
[ 1827.877070] Lustre: DEBUG MARKER: cancel_lru_locks osc stop
[ 1827.957214] LustreError: 21862:0:(osd_handler.c:1061:osd_trans_start()) ASSERTION( oti->oti_w_locks == 0 ) failed:
[ 1827.960100] LustreError: 21862:0:(osd_handler.c:1061:osd_trans_start()) LBUG
[ 1827.961698] Pid: 21862, comm: inconsistency_v
[ 1827.962700]
[ 1827.962702] Call Trace:
[ 1827.963692]  [<ffffffffa052f8c5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
[ 1827.965393]  [<ffffffffa052fec7>] lbug_with_loc+0x47/0xb0 [libcfs]
[ 1827.966913]  [<ffffffffa0d98809>] osd_trans_start+0x389/0x730 [osd_ldiskfs]
[ 1827.968553]  [<ffffffffa0c44ed2>] lfsck_layout_slave_in_notify+0x982/0xcf0 [lfsck]
[ 1827.969963]  [<ffffffffa0c0cd37>] lfsck_in_notify+0xf7/0x5a0 [lfsck]
[ 1827.971155]  [<ffffffffa0fe6037>] ofd_inconsistency_verification_main+0x367/0xdb0 [ofd]
[ 1827.972639]  [<ffffffff8105e59d>] ? finish_task_switch+0x7d/0x110
[ 1827.973774]  [<ffffffff8105e568>] ? finish_task_switch+0x48/0x110
[ 1827.974896]  [<ffffffff81061d90>] ? default_wake_function+0x0/0x20
[ 1827.976039]  [<ffffffffa0fe5cd0>] ? ofd_inconsistency_verification_main+0x0/0xdb0 [ofd]
[ 1827.977539]  [<ffffffff8109e856>] kthread+0x96/0xa0
[ 1827.978448]  [<ffffffff8100c30a>] child_rip+0xa/0x20
[ 1827.979368]  [<ffffffff815562e0>] ? _spin_unlock_irq+0x30/0x40
[ 1827.980507]  [<ffffffff8100bb10>] ? restore_args+0x0/0x30
[ 1827.981704]  [<ffffffff8109e7c0>] ? kthread+0x0/0xa0
[ 1827.982787]  [<ffffffff8100c300>] ? child_rip+0x0/0x20
[ 1827.983912]

I also found that OFD also does this in enough places that I added

 || strstr(current->comm, "ll_ost") != NULL

to the assertions above. Should OFD be following the same order?

Attachments

Issue Links

is related to

LU-10048 osd-ldiskfs to truncate outside of main transaction

Resolved

LU-8806 LFSCK hangs on MDT - osp_precreate_cleanup_orphans - cannot cleanup orphans: rc = -107

Resolved

Activity

People

Assignee:: nasf (Inactive)

Reporter:: John Hammond

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 05/Dec/14 7:47 PM

Updated:: 05/Jun/18 4:58 PM

Resolved:: 05/Jun/18 4:58 PM