[LU-17125] jbd2 commit deadlock Created: 18/Sep/23 Updated: 25/Oct/23 Resolved: 25/Oct/23 |
|
| Status: | Closed |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | Vladimir Saveliev | Assignee: | WC Triage |
| Resolution: | Duplicate | Votes: | 1 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
The following commit deadlock has been observed on a customer side: Sep 7 16:41:55 lustren002 kernel: jbd2/md65-8 D 0 259595 2 0x80004000 Sep 7 16:41:55 lustren002 kernel: Call Trace: Sep 7 16:41:55 lustren002 kernel: __schedule+0x2c4/0x700 Sep 7 16:41:55 lustren002 kernel: ? __wake_up_common_lock+0x89/0xc0 Sep 7 16:41:55 lustren002 kernel: schedule+0x38/0xa0 Sep 7 16:41:55 lustren002 kernel: jbd2_log_wait_commit+0xac/0x120 [jbd2] Sep 7 16:41:55 lustren002 kernel: ? finish_wait+0x80/0x80 Sep 7 16:41:55 lustren002 kernel: ldiskfs_sync_fs+0x1c1/0x1d0 [ldiskfs] Sep 7 16:41:55 lustren002 kernel: osd_sync+0xdf/0x180 [osd_ldiskfs] Sep 7 16:41:55 lustren002 kernel: qsd_acquire+0x468/0xdd0 [lquota] Sep 7 16:41:55 lustren002 kernel: qsd_op_begin0+0x161/0x890 [lquota] Sep 7 16:41:55 lustren002 kernel: ? __kmalloc+0x102/0x240 Sep 7 16:41:55 lustren002 kernel: qsd_reserve_or_free_quota+0x2e1/0x7b0 [lquota] Sep 7 16:41:55 lustren002 kernel: osd_reserve_or_free_quota+0xe6/0x1c0 [osd_ldiskfs] Sep 7 16:41:55 lustren002 kernel: ? lu_context_init+0xa5/0x1a0 [obdclass] Sep 7 16:41:55 lustren002 kernel: tgt_cb_last_committed+0xc2/0x5c0 [ptlrpc] Sep 7 16:41:55 lustren002 kernel: osd_trans_commit_cb+0xe6/0x300 [osd_ldiskfs] Sep 7 16:41:55 lustren002 kernel: ldiskfs_journal_commit_callback+0xa5/0xe0 [ldiskfs] Sep 7 16:41:55 lustren002 kernel: jbd2_journal_commit_transaction+0x17ba/0x19e0 [jbd2] Sep 7 16:41:55 lustren002 kernel: kjournald2+0xbd/0x270 [jbd2] Revert of " |
| Comments |
| Comment by Alex Zhuravlev [ 18/Sep/23 ] |
|
I rather think that calling qsd_acquire() from commit callback context is a bad idea. |
| Comment by Vladimir Saveliev [ 19/Sep/23 ] |
ok, then would it make sense to ask Hongchao Zhang to take a look, as that was done in: From: Hongchao Zhang <hongchao@whamcloud.com> Date: Fri, 2 Apr 2021 14:53:59 +0800 Subject: [PATCH] LU-11303 quota: enforce block quota for chgrp |
| Comment by Alexander Zarochentsev [ 25/Oct/23 ] |
|
The issue was hit with the lustre code w/o
dt_reserve_or_free_quota(&temp_env, th->th_dev,
th->th_reserved_quota.qrr_type,
th->th_reserved_quota.qrr_id.qid_uid,
th->th_reserved_quota.qrr_id.qid_gid,
-th->th_reserved_quota.qrr_count,
false);
the "count" parameter in dt_reserve_or_free_quota prototype is inly 4-bytes "int", while it is used to pass an 64bit value down the call stack. For enough big numbers of quota space to be released, the sign bit of the value can be lost.. the underlaying functions use sign of the passing value to determine whether it is quota "reservation" or "freeing". Eventually, in qsd_reserve_or_free_quota(), it causes wrong function to be called, qsd_op_begin0 instead of qsd_op_end0 at the transaction commit time. here the values from a crash dump: crash> osd_thandle.ot_super.th_reserved_quota -x ffff95df1fc4ee00
ot_super.th_reserved_quota = {
qrr_type = GRPQUOTA,
qrr_id = {
qid_fid = {
f_seq = 0x760a83c4,
f_oid = 0x0,
f_ver = 0x0
},
qid_uid = 0x760a83c4,
qid_gid = 0x760a83c4,
qid_projid = 0x760a83c4
},
qrr_count = 0x80001000
},
crash>
And, |
| Comment by Alexander Zarochentsev [ 25/Oct/23 ] |
|
Closing as a dup of |