Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.13.0, Lustre 2.12.3
    • Upstream
    • Red Hat 7.7 on VMware
      Red Hat 7.7 on HPE ProLiant DL380 Gen10
      Red Hat 7.7 on HPE Synergy 480 Gen10

    Description

      After successfully creating packages for Red Hat 7.7

      (e.g. lustre-2.12.57_35_g55a7e2d-1.el7.x86_64.rpm)

      I get CPU soft lockups when trying to create an MGS with LDISKFS backend.

      NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [mkfs.lustre:31220]

      More details from log:

      Sep  6 10:41:00 mgs1 kernel: Call Trace:Sep  6 10:41:00 mgs1 kernel: [<ffffffff9bd73365>] queued_spin_lock_slowpath+0xb/0xf
      Sep  6 10:41:00 mgs1 kernel: [<ffffffff9bd81ad0>] _raw_spin_lock+0x20/0x30
      Sep  6 10:41:00 mgs1 kernel: [<ffffffff9b865e2e>] igrab+0x1e/0x60
      Sep  6 10:41:00 mgs1 kernel: [<ffffffffc06bd88b>] ldiskfs_quota_off+0x3b/0x130 [ldiskfs]
      Sep  6 10:41:00 mgs1 kernel: [<ffffffffc06c091d>] ldiskfs_put_super+0x4d/0x400 [ldiskfs]
      Sep  6 10:41:00 mgs1 kernel: [<ffffffff9b84b13d>] generic_shutdown_super+0x6d/0x100
      Sep  6 10:41:00 mgs1 kernel: [<ffffffff9b84b5b7>] kill_block_super+0x27/0x70
      Sep  6 10:41:00 mgs1 kernel: [<ffffffff9b84b91e>] deactivate_locked_super+0x4e/0x70
      Sep  6 10:41:00 mgs1 kernel: [<ffffffff9b84c0a6>] deactivate_super+0x46/0x60
      Sep  6 10:41:00 mgs1 kernel: [<ffffffff9b86abff>] cleanup_mnt+0x3f/0x80
      Sep  6 10:41:00 mgs1 kernel: [<ffffffff9b86ac92>] __cleanup_mnt+0x12/0x20
      Sep  6 10:41:00 mgs1 kernel: [<ffffffff9b6c1c0b>] task_work_run+0xbb/0xe0
      Sep  6 10:41:00 mgs1 kernel: [<ffffffff9b62cc65>] do_notify_resume+0xa5/0xc0
      Sep  6 10:41:00 mgs1 kernel: [<ffffffff9bd8c23b>] int_signal+0x12/0x17
      Sep  6 10:41:00 mgs1 kernel: Code: 47 fe ff ff 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 66 90 b9 01 00 00 00 8b 17 85 d2 74 0d 83 fa 03 74 08 f3 90 <8b> 17 85 d2 75 f3 89 d0 f0 0f b1 0f 39 c2 75 e3 5d 66 90 c3 0f

      I also tried to go for an MDS/MGS pair on the DL380 but mkfs.lustre got stuck the same way 

      as seen on VMware.

      Attachments

        Activity

          [LU-12755] CPU soft lockup on mkfs.lustre
          pjones Peter Jones added a comment -

          Landed for 2.13

          pjones Peter Jones added a comment - Landed for 2.13

          Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36203/
          Subject: LU-12755 ldiskfs: fix project quota unpon unpatched kernel
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: d780f15a2d63c8bde5ae6345aed85b4b44904fb5

          gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36203/ Subject: LU-12755 ldiskfs: fix project quota unpon unpatched kernel Project: fs/lustre-release Branch: master Current Patch Set: Commit: d780f15a2d63c8bde5ae6345aed85b4b44904fb5

          Jian Yu (yujian@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36270
          Subject: LU-12755 ldiskfs: fix project quota unpon unpatched kernel
          Project: fs/lustre-release
          Branch: b2_12
          Current Patch Set: 1
          Commit: e56b4bc0970275ce6413883794bd52d2dfa7b164

          gerrit Gerrit Updater added a comment - Jian Yu (yujian@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36270 Subject: LU-12755 ldiskfs: fix project quota unpon unpatched kernel Project: fs/lustre-release Branch: b2_12 Current Patch Set: 1 Commit: e56b4bc0970275ce6413883794bd52d2dfa7b164

          Li Xi,

          I walkrounded that issue by set PROJ active flag temporariy, and then call dquot_initialize() to call ext4_get_projid().
          I just tested fixed version, it seems work:

          +	sb_dqopt(sb)->flags |= dquot_state_flag(DQUOT_USAGE_ENABLED, PRJQUOTA);
          +	dquot_initialize(root);
          +	sb_dqopt(sb)->flags &= ~dquot_state_flag(DQUOT_USAGE_ENABLED, PRJQUOTA);
          

          By adding some printk debug, it seems detection works.

          wshilong Wang Shilong (Inactive) added a comment - Li Xi, I walkrounded that issue by set PROJ active flag temporariy, and then call dquot_initialize() to call ext4_get_projid(). I just tested fixed version, it seems work: + sb_dqopt(sb)->flags |= dquot_state_flag(DQUOT_USAGE_ENABLED, PRJQUOTA); + dquot_initialize(root); + sb_dqopt(sb)->flags &= ~dquot_state_flag(DQUOT_USAGE_ENABLED, PRJQUOTA); By adding some printk debug, it seems detection works.
          lixi_wc Li Xi added a comment -

          Shilong, I assume you mean the updated patch https://review.whamcloud.com/#/c/36203/

          Any reason why ext4_get_projid() is called before ext4_enable_quotas()? I thought ext4_enable_quotas() is always called before any one calls ext4_get_projid().

          lixi_wc Li Xi added a comment - Shilong, I assume you mean the updated patch https://review.whamcloud.com/#/c/36203/ Any reason why ext4_get_projid() is called before ext4_enable_quotas()? I thought ext4_enable_quotas() is always called before any one calls ext4_get_projid().
          yujian Jian Yu added a comment -

          Sure, Shilong.

          yujian Jian Yu added a comment - Sure, Shilong.

          The tricky way seems work with limited testing, Yu Jian could you continue some testing on refreshed patch:

          1) Built and installed. ldiskfs on patched kernel and then switch it to unpatched kernel.
          2) Built and installed ldiskfs on unpatched kernel and then switch it to patched kernel.

          To make sure mount and umount Lustre works.

          wshilong Wang Shilong (Inactive) added a comment - The tricky way seems work with limited testing, Yu Jian could you continue some testing on refreshed patch: 1) Built and installed. ldiskfs on patched kernel and then switch it to unpatched kernel. 2) Built and installed ldiskfs on unpatched kernel and then switch it to patched kernel. To make sure mount and umount Lustre works.
          lixi_wc Li Xi added a comment -

          In the mount phase, ROOT inode will call dquot_initialize() first time but at that time ext4_enable_quotas() haven't been called, and unfortunately we also need check whether kernel support PRJQUOTA inside ext4_enable_quotas().....

          I think we can ext4_enable_quotas() first on group quota and user quota. And then set the flag of sb_has_quota_active(PROJQUOTA) temporarily. And then call dquot_initialize() to check whether MAXQUOTAS == 3.

          lixi_wc Li Xi added a comment - In the mount phase, ROOT inode will call dquot_initialize() first time but at that time ext4_enable_quotas() haven't been called, and unfortunately we also need check whether kernel support PRJQUOTA inside ext4_enable_quotas()..... I think we can ext4_enable_quotas() first on group quota and user quota. And then set the flag of sb_has_quota_active(PROJQUOTA) temporarily. And then call dquot_initialize() to check whether MAXQUOTAS == 3.
          lixi_wc Li Xi added a comment -

          Instead of printing an error, returning error seems better. It will prevent any future problem.

          lixi_wc Li Xi added a comment - Instead of printing an error, returning error seems better. It will prevent any future problem.

          The good new is this issue won't exist for RHEL8 and later kernels though..

          wshilong Wang Shilong (Inactive) added a comment - The good new is this issue won't exist for RHEL8 and later kernels though..

          unfortunately it looks hard to make it work ideally, we coud not be blocked at this, at least we need fix common case, how about
          we add some warning messages during mount time?

          diff --git a/fs/ext4/super.c b/fs/ext4/super.c
          index ca8b50c8..cddf6595 100644
          --- a/fs/ext4/super.c
          +++ b/fs/ext4/super.c
          @@ -4489,6 +4489,13 @@ no_journal:
                  ratelimit_state_init(&sbi->s_msg_ratelimit_state, 5 * HZ, 10);
           
                  kfree(orig_data);
          +#ifdef  HAVE_PROJECT_QUOTA
          +       ext4_msg(sb, KERN_WARNING, 
          +               "ext4 module compiled with patched kernel won't work on unpatched kernel");
          +#else
          +       ext4_msg(sb, KERN_WARNING, 
          +               "ext4 module compiled with unpatched kernel won't work on patched kernel");
          +#endif
                  return 0;
          

          Mark it as a known issue, at least Administrator knows this issue.

          wshilong Wang Shilong (Inactive) added a comment - unfortunately it looks hard to make it work ideally, we coud not be blocked at this, at least we need fix common case, how about we add some warning messages during mount time? diff --git a/fs/ext4/super.c b/fs/ext4/super.c index ca8b50c8..cddf6595 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -4489,6 +4489,13 @@ no_journal: ratelimit_state_init(&sbi->s_msg_ratelimit_state, 5 * HZ, 10); kfree(orig_data); +#ifdef HAVE_PROJECT_QUOTA + ext4_msg(sb, KERN_WARNING, + "ext4 module compiled with patched kernel won't work on unpatched kernel"); +#else + ext4_msg(sb, KERN_WARNING, + "ext4 module compiled with unpatched kernel won't work on patched kernel"); +#endif return 0; Mark it as a known issue, at least Administrator knows this issue.

          People

            yujian Jian Yu
            kazinczy Tamas Kazinczy (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: