Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-20227

Divide-by-zero kernel panic in ll_statfs_project() when Running linux du command on a snapshot-mounted directory with project quota

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Medium
    • None
    • Lustre 2.17.0, Lustre 2.15.6
    • None
    • Rocky 8.10, ZFS backend
    • 3
    • 9223372036854775807

    Description

      Summary

      This issue was first identified at one of our customer sites. The environment had an HA configuration, and running the linux @du@ command on one of the nodes triggered the panic, which subsequently led to a failover.

      When a Lustre snapshot MDT is mounted in a DNE (Distributed Namespace Extension) configuration, @lod_statfs()@ iterates sub-MDT OSPs via @lod_foreach_mdt@. If a sub-MDT OSP returns @rc=0, os_bsize=0@ (uninitialized @opd_statfs@ cache), @lod_statfs_sum()@ silently right-shifts @sfs->os_bsize@ down to zero.

      The MDT then sends this corrupted value to the client with @OS_STATFS_SUM@ set. The client trusts the SUM result without further validation, and @ll_statfs_project()@ subsequently divides by @sfs->f_bsize=0@, triggering a CPU #DE (Divide Error) exception and kernel panic.

      Used Proxmox virtual env for reproducing this panic

      Environment

      Lustre version : 2.15.6 / 2.17.0 (2.17.0 used for this reproduction)
      Kernel : 4.18.0-553.5.1 
      backend FS : ZFS
      Topology : MGS, MDT * 3, OST * 5, Project quota enabled

      Reproduction Conditions

      All five conditions must hold simultaneously for the panic to occur

      1. Snapshot client mounted : Entry point for the code path
      2. Snapshot MDT has sub-MDTs (DNE configuration) : @lod_foreach_mdt@ loop actually iterates
      3. Sub-MDT OSP @opd_statfs@ cache uninitialized (@os_bsize=0@) : Zero fed into @lod_statfs_sum()@
      4. project quota enabled : @ll_statfs_project()@ is called
      5. Project block hard limit > 0 on the target directory : Division is executed

      Steps to Reproduce

      For detailed reproduction steps, please refer to the attached reproduction.sh script.

      Step 1. Set project quota on target directory (e.g. @-p 1000 --block-hardlimit 1G@)

      Step 2. Create snapshot by @lctl snapshot@

      Step 3. Mount snapshot targets

      Step 4. Extract snapshot fsname and wait for MDT recovery

      Step 5. Mount snapshot client (read-only)

      Step 6. Trigger the bug

      dmesg output

      [  304.329137] repro_lsnapshot: [DU] running du -sh /mnt/snap-client-lsnap/testdir
      [  304.335103] divide error: 0000 [#1] SMP PTI
      [  304.336510] CPU: 6 PID: 18020 Comm: du Kdump: loaded Tainted: P           OE     -------- -  - 4.18.0-553.5.1.el8_10.x86_64 #1
      [  304.336997] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
      [  304.337342] RIP: 0010:ll_statfs_project+0x7c9/0xab0 [lustre]
      [  304.337840] Code: fd ff 89 c3 85 c0 0f 85 29 01 00 00 48 8b 44 24 68 48 85 c0 75 05 48 8b 44 24 60 49 8b 75 08 48 c1 e0 0a 48 39 c6 77 3b 31 d2 <48> f7 f6 48 89 c1 49 39 45 10 76 2d 48 8b 44 24 70 31 d2 49 89 4d
      [  304.338546] RSP: 0018:ffffb850d0eb7c60 EFLAGS: 00010246
      [  304.338901] RAX: 0000000040000000 RBX: 0000000000000000 RCX: 0000000000006611
      [  304.339211] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000000380e0
      [  304.339508] RBP: ffff95b4f40a6800 R08: 0000000080000000 R09: 0000000000000200
      [  304.339841] R10: 0000000069f2b7a6 R11: ffff95b4dad0f3f8 R12: ffff95b4cf7ec480
      [  304.340141] R13: ffffb850d0eb7eb0 R14: ffff95b4cf7ec510 R15: ffff95b4cf7ec488
      [  304.340429] FS:  00007f5a4d4a8540(0000) GS:ffff95b577d80000(0000) knlGS:0000000000000000
      [  304.340780] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  304.341075] CR2: 00007f5a4d3a6000 CR3: 0000000177654001 CR4: 0000000000170ee0
      [  304.341360] Call Trace:
      [  304.341774]  ? __die_body+0x1a/0x60
      [  304.342265]  ? die+0x2a/0x50
      [  304.342740]  ? do_trap+0xe7/0x110
      [  304.343029]  ? ll_statfs_project+0x7c9/0xab0 [lustre]
      [  304.343356]  ? do_divide_error+0x33/0x40
      [  304.343658]  ? ll_statfs_project+0x7c9/0xab0 [lustre]
      [  304.343947]  ? divide_error+0x14/0x20
      [  304.344218]  ? ll_statfs_project+0x7c9/0xab0 [lustre]
      [  304.344543]  ll_statfs+0x1e0/0x1f0 [lustre]
      [  304.344830]  statfs_by_dentry+0x67/0x90
      [  304.346448]  vfs_statfs+0x16/0xc0
      [  304.346939]  fd_statfs+0x2d/0x70
      [  304.347210]  __do_sys_fstatfs+0x20/0x60
      [  304.347472]  do_syscall_64+0x5b/0x1b0
      [  304.347719]  entry_SYSCALL_64_after_hwframe+0x61/0xc6
      [  304.347961] RIP: 0033:0x7f5a4cee2cab
      [  304.348192] Code: 73 01 c3 48 8b 0d dd 81 39 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 8a 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d ad 81 39 00 f7 d8 64 89 01 48
      [  304.348697] RSP: 002b:00007fff543fbd68 EFLAGS: 00000206 ORIG_RAX: 000000000000008a
      [  304.348939] RAX: ffffffffffffffda RBX: 0000558db5151600 RCX: 00007f5a4cee2cab
      [  304.349170] RDX: 000000000000000f RSI: 00007fff543fbd80 RDI: 0000000000000003
      [  304.349399] RBP: 0000558db51530d0 R08: 0000558db5153130 R09: 0000558db5150030
      [  304.349648] R10: 0000000000000000 R11: 0000000000000206 R12: 0000558db51528c0
      [  304.349880] R13: 0000000000000003 R14: 0000558db5151690 R15: 0000558db5151600
      [  304.350104] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_zfs(OE) lquota(OE) lustre(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc(OE) ko2iblnd(OE) obdclass(OE) lnet(OE) libcfs(OE) bonding rpcrdma sunrpc rdma_ucm ib_srpt ib_isert iscsi_target_mod target_core_mod ib_iser libiscsi scsi_transport_iscsi ib_umad dm_multipath dm_mod rdma_cm ib_ipoib iw_cm ib_cm intel_rapl_msr intel_rapl_common sb_edac kvm_intel kvm irqbypass crc32_pclmul rapl pcspkr mlx5_ib ib_uverbs bochs drm_vram_helper ib_core drm_ttm_helper ttm drm_kms_helper syscopyarea sysfillrect sysimgblt drm joydev i2c_piix4 xfs libcrc32c sr_mod cdrom sd_mod sg ata_generic crct10dif_pclmul mlx5_core crc32c_intel serio_raw ghash_clmulni_intel nvme ata_piix virtio_net mlxfw libata pci_hyperv_intf nvme_core tls net_failover failover psample virtio_scsi t10_pi zfs(POE) spl(OE)
      

      lctl check mdts output

      360a4637-MDT0000-mdc-ffff8a8ff369f000 active.
      360a4637-MDT0001-mdc-ffff8a8ff369f000 active.
      360a4637-MDT0002-mdc-ffff8a8ff369f000 active.
      ClstVol-MDT0000-mdc-ffff8a8f771e1800 active.
      ClstVol-MDT0001-mdc-ffff8a8f771e1800 active.
      ClstVol-MDT0002-mdc-ffff8a8f771e1800 active.
      

      lctl get_param output for Normal MDT

      lctl get_param osp.ClstVol-MDT0001-osp-MDT0000.blocksize \
      >                  osp.ClstVol-MDT0001-osp-MDT0000.kbytestotal \
      >                  osp.ClstVol-MDT0001-osp-MDT0000.kbytesavail
      osp.ClstVol-MDT0001-osp-MDT0000.blocksize=4096
      osp.ClstVol-MDT0001-osp-MDT0000.kbytestotal=199557632
      osp.ClstVol-MDT0001-osp-MDT0000.kbytesavail=199547776
      

      lctl get_param output for Snapshot MDT

      lctl get_param osp.360a4637-MDT0001-osp-MDT0000.blocksize \
      >                  osp.360a4637-MDT0001-osp-MDT0000.kbytestotal \
      >                  osp.360a4637-MDT0001-osp-MDT0000.kbytesavail
      osp.360a4637-MDT0001-osp-MDT0000.blocksize=0
      osp.360a4637-MDT0001-osp-MDT0000.kbytestotal=0
      osp.360a4637-MDT0001-osp-MDT0000.kbytesavail=0
      

      stat check on normal mount path

      stat --file-system /lustre/agent/ClstVol
        File: "/lustre/agent/ClstVol"
          ID: 3d67df9200000000  Namelen: 255  Type: lustre
      Block size: 4096       Fundamental block size: 4096
      Blocks: Total: 363066880  Free: 363046144  Available: 363043584
      Inodes: Total: 45381809   Free: 45380768
      

      stat check on snapshot path

      stat --file-system /mnt/snap-client-lsnap
        File: "/mnt/snap-client-lsnap"
          ID: 566b393000000000  Namelen: 255  Type: lustre
      Block size: 0          Fundamental block size: 0
      Blocks: Total: 0          Free: 0          Available: 0
      Inodes: Total: 36441549   Free: 36441227
      

      Call chain

      du -sh /mnt/snap/testdir
        │
      vfs_statfs()
        │
        └─► ll_statfs
              │
              ├─[A] ll_statfs_internal
              │       │
              │       └─ statfs RPC to MDT
              │              │
              │              ▼  server side
              │       mdt_statfs() → lod_statfs() 
              │              │
              │              ├─ lod_foreach_mdt loop   ←  BUG
              │              │    ├─ osp_statfs() → *sfs = d->opd_statfs (\{0})
              │              │    │      rc=0, os_bsize=0 returned
              │              │    ├─ if (rc) continue;         ← NO os_bsize check!
              │              │    └─ lod_statfs_sum(sfs, &ost_sfs=\{os_bsize=0}, &bs)
              │              │         while (0 < bs=4096):
              │              │           *bs >>= 1             4096→2048→...→0
              │              │           sfs->os_bsize >>= 1   4096→2048→...→0  ← CORRUPTED
              │              │         loop exits: sfs->os_bsize = 0
              │              │
              │              ├─ lod_foreach_ost loop 
              │                      └─ if (rc || ost_sfs.os_bsize == 0) continue;  ← PROTECTED
      
              │
              └─[C] ll_statfs_project
                      │
                      ├─ quotactl → dqb_bhardlimit = 1G (> 0)
                      │
                      └─ limit = (dqb_bhardlimit * 1024) / sfs->f_bsize
                                                         ^^^^^^^^^^^^^^
                                                         0 → CPU #DE → KERNEL PANIC
      

      BUG POINT 1 — Uninitialized os_bsize silently corrupts sfs->os_bsize

      When a sub-MDT OSP returns @rc=0@ but with an uninitialized @opd_statfs@ cache (@os_bsize=0@), the existing guard @if (rc) continue;@ passes without catching it. @lod_statfs_sum()@ then enters a right-shift loop that drives both @bs@ and @sfs->os_bsize@ down to zero. Unlike @lod_foreach_ost@, which explicitly checks @ost_sfs.os_bsize == 0@ before proceeding, @lod_foreach_mdt@ has no such protection — making this a silent corruption with no error returned to the caller.

              │              ├─ lod_foreach_mdt loop   ←  BUG
              │              │    ├─ osp_statfs() → *sfs = d->opd_statfs (\{0})
              │              │    │      rc=0, os_bsize=0 returned
              │              │    ├─ if (rc) continue;         ← NO os_bsize check!
              │              │    └─ lod_statfs_sum(sfs, &ost_sfs=\{os_bsize=0}, &bs)
              │              │         while (0 < bs=4096):
              │              │           *bs >>= 1             4096→2048→...→0
              │              │           sfs->os_bsize >>= 1   4096→2048→...→0  ← CORRUPTED
              │              │         loop exits: sfs->os_bsize = 0
      

      BUG POINT 2 — Corrupted f_bsize=0 reaches division in ll_statfs_project()

      The corrupted @os_bsize=0@ is packed into the statfs RPC reply and delivered to the client with @OS_STATFS_SUM@ set. The client accepts the SUM result without any validation. When project quota is active and @dqb_bhardlimit > 0@, @ll_statfs_project()@ unconditionally divides by @sfs->f_bsize@ — now zero — triggering a CPU #DE (Divide Error) exception and an immediate kernel panic.

              └─[C] ll_statfs_project
                      │
                      ├─ quotactl → dqb_bhardlimit = 1G (> 0)
                      │
                      └─ limit = (dqb_bhardlimit * 1024) / sfs->f_bsize
                                                         ^^^^^^^^^^^^^^
                                                         0 → CPU #DE → KERNEL PANIC
      

      Proposed Fix

      server side fix (main)

      When a sub-MDT with @os_bsize=0@ is skipped, its @os_files@ and @os_ffree@ are also skipped. Thus, sub-MDT's inode count is absent from @df@ output. But once the RPC reply arrives, the next @lod_statfs()@ call includes it normally. This is a safe and acceptable trade-off versus a kernel panic

      Unable to find source-code formatter for language: diff. Available languages are: actionscript, ada, applescript, bash, c, c#, c++, cpp, css, erlang, go, groovy, haskell, html, java, javascript, js, json, lua, none, nyan, objc, perl, php, python, r, rainbow, ruby, scala, sh, sql, swift, visualbasic, xml, yaml
      lod_foreach_mdt(lod, tgt) {
         rc = dt_statfs(env, tgt->ltd_tgt, &ost_sfs);
         /* ignore errors */
      -   if (rc)
      +   // add zero-check for os_bsize to prevent division-by-zero kernel panic
      +   if (rc || ost_sfs.os_bsize == 0)
      

      client side fix (additional defensive logic)

      static int ll_statfs_project(struct inode *inode, struct kstatfs *sfs)
      ...
      
      
                if (ret) {
                    /* ignore errors if project ID does not have
                    * a quota limit or feature unsupported.
                    */
                    if (ret == -ESRCH || ret == -EOPNOTSUPP)
                        ret = 0;
                    return ret;
                }
      
      
      +        /* f_bsize=0: server lod_foreach_mdt passed uninitialized sub-MDT
      +         * opd_statfs (os_bsize=0) into lod_statfs_sum(), corrupting os_bsize.
      +         * Fall back to COMPAT_BSIZE_SHIFT — the value mdt_statfs() sends in
      +         * steady state for non-GRANT_PARAM connections.
      +        */
      +        if (unlikely(sfs->f_bsize == 0)) {
      +             CWARN("%s: f_bsize=0 in ll_statfs_project, "
      +             "sub-MDT opd_statfs uninitialized (lod_foreach_mdt bug), "
      +             "corrected to %u\n",
      +             ll_i2sbi(inode)->ll_fsname, 1 << COMPAT_BSIZE_SHIFT);
      +
      +             sfs->f_bsize = 1 << COMPAT_BSIZE_SHIFT;
      +        }
      
      
                limit = ((qctl.qc_dqblk.dqb_bsoftlimit ?
                qctl.qc_dqblk.dqb_bsoftlimit :
                qctl.qc_dqblk.dqb_bhardlimit) * 1024) / sfs->f_bsize;
      ...
      

      Immediate Workaround - fix_statfs.ko

      A loadable kernel module that intercepts @ll_statfs_project()@ entry via kprobe and
      replaces @f_bsize=0@ with @PAGE_SIZE@ before the division executes. This is a
      symptom-level workaround.  The correct fallback value is 1 << COMPAT_BSIZE_SHIFT (= 4096), defined in lustre/include/lu_target.h. However, @fix_statfs@ module is built against standard kernel headers only. Lustre internal headers are not available here. @PAGE_SIZE@ was used as a practical proxy that happens to equal 4096 on x86_64. The additional defensive logic in @llite_lib.c@ uses @1 << COMPAT_BSIZE_SHIFT@ correctly.

      result of workaround

      insmod fix_statfs.ko
      
      lsmod | grep fix
      
      
      fix_statfs             16384  0
      
      ./reproduction.sh
      
      mounted the snapshot repro_snap1 with fsname 7eb5b902
      === sub-MDT OSP blocksize (must be 0 to trigger crash) ===
      osp.7eb5b902-MDT0001-osp-MDT0000.blocksize=0
      osp.7eb5b902-MDT0002-osp-MDT0000.blocksize=0
      >>> du -sh /mnt/snap-client-lsnap/testdir
      1.1M    /mnt/snap-client-lsnap/testdir
      

      Attachments

        1. kexec-dmesg.log
          66 kB
          Sunghwan Kim
        2. reproduction.sh
          3 kB
          Sunghwan Kim
        3. vmcore
          128.07 MB
          Sunghwan Kim
        4. vmcore-dmesg.txt
          68 kB
          Sunghwan Kim

        Activity

          People

            wc-triage WC Triage
            shkim3220 Sunghwan Kim
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: