Details

    • Bug
    • Resolution: Duplicate
    • Critical
    • None
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      The following commit deadlock has been observed on a customer side:

      Sep  7 16:41:55 lustren002 kernel: jbd2/md65-8     D    0 259595      2 0x80004000
      Sep  7 16:41:55 lustren002 kernel: Call Trace:
      Sep  7 16:41:55 lustren002 kernel:  __schedule+0x2c4/0x700
      Sep  7 16:41:55 lustren002 kernel:  ? __wake_up_common_lock+0x89/0xc0
      Sep  7 16:41:55 lustren002 kernel:  schedule+0x38/0xa0
      Sep  7 16:41:55 lustren002 kernel:  jbd2_log_wait_commit+0xac/0x120 [jbd2]
      Sep  7 16:41:55 lustren002 kernel:  ? finish_wait+0x80/0x80
      Sep  7 16:41:55 lustren002 kernel:  ldiskfs_sync_fs+0x1c1/0x1d0 [ldiskfs]
      Sep  7 16:41:55 lustren002 kernel:  osd_sync+0xdf/0x180 [osd_ldiskfs]
      Sep  7 16:41:55 lustren002 kernel:  qsd_acquire+0x468/0xdd0 [lquota]
      Sep  7 16:41:55 lustren002 kernel:  qsd_op_begin0+0x161/0x890 [lquota]
      Sep  7 16:41:55 lustren002 kernel:  ? __kmalloc+0x102/0x240
      Sep  7 16:41:55 lustren002 kernel:  qsd_reserve_or_free_quota+0x2e1/0x7b0 [lquota]
      Sep  7 16:41:55 lustren002 kernel:  osd_reserve_or_free_quota+0xe6/0x1c0 [osd_ldiskfs]
      Sep  7 16:41:55 lustren002 kernel:  ? lu_context_init+0xa5/0x1a0 [obdclass]
      Sep  7 16:41:55 lustren002 kernel:  tgt_cb_last_committed+0xc2/0x5c0 [ptlrpc]
      Sep  7 16:41:55 lustren002 kernel:  osd_trans_commit_cb+0xe6/0x300 [osd_ldiskfs]
      Sep  7 16:41:55 lustren002 kernel:  ldiskfs_journal_commit_callback+0xa5/0xe0 [ldiskfs]
      Sep  7 16:41:55 lustren002 kernel:  jbd2_journal_commit_transaction+0x17ba/0x19e0 [jbd2]
      Sep  7 16:41:55 lustren002 kernel:  kjournald2+0xbd/0x270 [jbd2]
      

      Revert of "LU-14370 quota: use dt_sync() to flush pending writes" could be a potential fix for the problem.

      Attachments

        Issue Links

          Activity

            [LU-17125] jbd2 commit deadlock

            Closing as a dup of LU-15880.

            zam Alexander Zarochentsev added a comment - Closing as a dup of LU-15880 .

            The issue was hit with the lustre code w/o LU-15880 landed, that code contains an incorrect integer values conversion when tgt_cb_last_committed() calls dt_reserve_or_free_quota():

                            dt_reserve_or_free_quota(&temp_env, th->th_dev,
                                                     th->th_reserved_quota.qrr_type,
                                                     th->th_reserved_quota.qrr_id.qid_uid,
                                                     th->th_reserved_quota.qrr_id.qid_gid,
                                                     -th->th_reserved_quota.qrr_count,
                                                     false);
            

            the "count" parameter in dt_reserve_or_free_quota prototype is inly 4-bytes "int", while it is used to pass an 64bit value down the call stack. For enough big numbers of quota space to be released, the sign bit of the value can be lost.. the underlaying functions use sign of the passing value to determine whether it is quota "reservation" or "freeing". Eventually, in qsd_reserve_or_free_quota(), it causes wrong function to be called, qsd_op_begin0 instead of qsd_op_end0 at the transaction commit time.

            here the values from a crash dump:

            crash> osd_thandle.ot_super.th_reserved_quota -x ffff95df1fc4ee00
              ot_super.th_reserved_quota = {
                qrr_type = GRPQUOTA,
                qrr_id = {
                  qid_fid = {
                    f_seq = 0x760a83c4,
                    f_oid = 0x0,
                    f_ver = 0x0
                  },
                  qid_uid = 0x760a83c4,
                  qid_gid = 0x760a83c4,
                  qid_projid = 0x760a83c4
                },
                qrr_count = 0x80001000
              },
            crash> 
            

            And, LU-15880 fixes that by getting rid of "int count" parameter in dt_reserve_or_free_quota(). So the issue should not be reproducible with the changes from LU-15880 patches in place.

            zam Alexander Zarochentsev added a comment - The issue was hit with the lustre code w/o LU-15880 landed, that code contains an incorrect integer values conversion when tgt_cb_last_committed() calls dt_reserve_or_free_quota(): dt_reserve_or_free_quota(&temp_env, th->th_dev, th->th_reserved_quota.qrr_type, th->th_reserved_quota.qrr_id.qid_uid, th->th_reserved_quota.qrr_id.qid_gid, -th->th_reserved_quota.qrr_count, false ); the "count" parameter in dt_reserve_or_free_quota prototype is inly 4-bytes "int", while it is used to pass an 64bit value down the call stack. For enough big numbers of quota space to be released, the sign bit of the value can be lost.. the underlaying functions use sign of the passing value to determine whether it is quota "reservation" or "freeing". Eventually, in qsd_reserve_or_free_quota(), it causes wrong function to be called, qsd_op_begin0 instead of qsd_op_end0 at the transaction commit time. here the values from a crash dump: crash> osd_thandle.ot_super.th_reserved_quota -x ffff95df1fc4ee00 ot_super.th_reserved_quota = { qrr_type = GRPQUOTA, qrr_id = { qid_fid = { f_seq = 0x760a83c4, f_oid = 0x0, f_ver = 0x0 }, qid_uid = 0x760a83c4, qid_gid = 0x760a83c4, qid_projid = 0x760a83c4 }, qrr_count = 0x80001000 }, crash> And, LU-15880 fixes that by getting rid of "int count" parameter in dt_reserve_or_free_quota(). So the issue should not be reproducible with the changes from LU-15880 patches in place.

            I rather think that calling qsd_acquire() from commit callback context is a bad idea.

            ok, then would it make sense to ask Hongchao Zhang to take a look, as that was done in:

            From: Hongchao Zhang <hongchao@whamcloud.com>
            Date: Fri, 2 Apr 2021 14:53:59 +0800
            Subject: [PATCH] LU-11303 quota: enforce block quota for chgrp
            
            vsaveliev Vladimir Saveliev added a comment - I rather think that calling qsd_acquire() from commit callback context is a bad idea. ok, then would it make sense to ask Hongchao Zhang to take a look, as that was done in: From: Hongchao Zhang <hongchao@whamcloud.com> Date: Fri, 2 Apr 2021 14:53:59 +0800 Subject: [PATCH] LU-11303 quota: enforce block quota for chgrp

            I rather think that calling qsd_acquire() from commit callback context is a bad idea.

            bzzz Alex Zhuravlev added a comment - I rather think that calling qsd_acquire() from commit callback context is a bad idea.

            People

              wc-triage WC Triage
              vsaveliev Vladimir Saveliev
              Votes:
              1 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: