[LU-17125] jbd2 commit deadlock Created: 18/Sep/23  Updated: 25/Oct/23  Resolved: 25/Oct/23

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Vladimir Saveliev Assignee: WC Triage
Resolution: Duplicate Votes: 1
Labels: None

Issue Links:
Duplicate
duplicates LU-15880 ASSERTION( lqe->u.se.lse_pending_writ... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

The following commit deadlock has been observed on a customer side:

Sep  7 16:41:55 lustren002 kernel: jbd2/md65-8     D    0 259595      2 0x80004000
Sep  7 16:41:55 lustren002 kernel: Call Trace:
Sep  7 16:41:55 lustren002 kernel:  __schedule+0x2c4/0x700
Sep  7 16:41:55 lustren002 kernel:  ? __wake_up_common_lock+0x89/0xc0
Sep  7 16:41:55 lustren002 kernel:  schedule+0x38/0xa0
Sep  7 16:41:55 lustren002 kernel:  jbd2_log_wait_commit+0xac/0x120 [jbd2]
Sep  7 16:41:55 lustren002 kernel:  ? finish_wait+0x80/0x80
Sep  7 16:41:55 lustren002 kernel:  ldiskfs_sync_fs+0x1c1/0x1d0 [ldiskfs]
Sep  7 16:41:55 lustren002 kernel:  osd_sync+0xdf/0x180 [osd_ldiskfs]
Sep  7 16:41:55 lustren002 kernel:  qsd_acquire+0x468/0xdd0 [lquota]
Sep  7 16:41:55 lustren002 kernel:  qsd_op_begin0+0x161/0x890 [lquota]
Sep  7 16:41:55 lustren002 kernel:  ? __kmalloc+0x102/0x240
Sep  7 16:41:55 lustren002 kernel:  qsd_reserve_or_free_quota+0x2e1/0x7b0 [lquota]
Sep  7 16:41:55 lustren002 kernel:  osd_reserve_or_free_quota+0xe6/0x1c0 [osd_ldiskfs]
Sep  7 16:41:55 lustren002 kernel:  ? lu_context_init+0xa5/0x1a0 [obdclass]
Sep  7 16:41:55 lustren002 kernel:  tgt_cb_last_committed+0xc2/0x5c0 [ptlrpc]
Sep  7 16:41:55 lustren002 kernel:  osd_trans_commit_cb+0xe6/0x300 [osd_ldiskfs]
Sep  7 16:41:55 lustren002 kernel:  ldiskfs_journal_commit_callback+0xa5/0xe0 [ldiskfs]
Sep  7 16:41:55 lustren002 kernel:  jbd2_journal_commit_transaction+0x17ba/0x19e0 [jbd2]
Sep  7 16:41:55 lustren002 kernel:  kjournald2+0xbd/0x270 [jbd2]

Revert of "LU-14370 quota: use dt_sync() to flush pending writes" could be a potential fix for the problem.



 Comments   
Comment by Alex Zhuravlev [ 18/Sep/23 ]

I rather think that calling qsd_acquire() from commit callback context is a bad idea.

Comment by Vladimir Saveliev [ 19/Sep/23 ]

I rather think that calling qsd_acquire() from commit callback context is a bad idea.

ok, then would it make sense to ask Hongchao Zhang to take a look, as that was done in:

From: Hongchao Zhang <hongchao@whamcloud.com>
Date: Fri, 2 Apr 2021 14:53:59 +0800
Subject: [PATCH] LU-11303 quota: enforce block quota for chgrp
Comment by Alexander Zarochentsev [ 25/Oct/23 ]

The issue was hit with the lustre code w/o LU-15880 landed, that code contains an incorrect integer values conversion when tgt_cb_last_committed() calls dt_reserve_or_free_quota():

                dt_reserve_or_free_quota(&temp_env, th->th_dev,
                                         th->th_reserved_quota.qrr_type,
                                         th->th_reserved_quota.qrr_id.qid_uid,
                                         th->th_reserved_quota.qrr_id.qid_gid,
                                         -th->th_reserved_quota.qrr_count,
                                         false);

the "count" parameter in dt_reserve_or_free_quota prototype is inly 4-bytes "int", while it is used to pass an 64bit value down the call stack. For enough big numbers of quota space to be released, the sign bit of the value can be lost.. the underlaying functions use sign of the passing value to determine whether it is quota "reservation" or "freeing". Eventually, in qsd_reserve_or_free_quota(), it causes wrong function to be called, qsd_op_begin0 instead of qsd_op_end0 at the transaction commit time.

here the values from a crash dump:

crash> osd_thandle.ot_super.th_reserved_quota -x ffff95df1fc4ee00
  ot_super.th_reserved_quota = {
    qrr_type = GRPQUOTA,
    qrr_id = {
      qid_fid = {
        f_seq = 0x760a83c4,
        f_oid = 0x0,
        f_ver = 0x0
      },
      qid_uid = 0x760a83c4,
      qid_gid = 0x760a83c4,
      qid_projid = 0x760a83c4
    },
    qrr_count = 0x80001000
  },
crash> 

And, LU-15880 fixes that by getting rid of "int count" parameter in dt_reserve_or_free_quota(). So the issue should not be reproducible with the changes from LU-15880 patches in place.

Comment by Alexander Zarochentsev [ 25/Oct/23 ]

Closing as a dup of LU-15880.

Generated at Sat Feb 10 03:32:51 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.