Details
-
Bug
-
Resolution: Unresolved
-
Medium
-
None
-
Lustre 2.17.0, Lustre 2.15.6
-
None
-
Rocky 8.10, ZFS backend
-
3
-
9223372036854775807
Description
Summary
This issue was first identified at one of our customer sites. The environment had an HA configuration, and running the linux @du@ command on one of the nodes triggered the panic, which subsequently led to a failover.
When a Lustre snapshot MDT is mounted in a DNE (Distributed Namespace Extension) configuration, @lod_statfs()@ iterates sub-MDT OSPs via @lod_foreach_mdt@. If a sub-MDT OSP returns @rc=0, os_bsize=0@ (uninitialized @opd_statfs@ cache), @lod_statfs_sum()@ silently right-shifts @sfs->os_bsize@ down to zero.
The MDT then sends this corrupted value to the client with @OS_STATFS_SUM@ set. The client trusts the SUM result without further validation, and @ll_statfs_project()@ subsequently divides by @sfs->f_bsize=0@, triggering a CPU #DE (Divide Error) exception and kernel panic.
Used Proxmox virtual env for reproducing this panic
Environment
Lustre version : 2.15.6 / 2.17.0 (2.17.0 used for this reproduction)
Kernel : 4.18.0-553.5.1
backend FS : ZFS
Topology : MGS, MDT * 3, OST * 5, Project quota enabled
Reproduction Conditions
All five conditions must hold simultaneously for the panic to occur
1. Snapshot client mounted : Entry point for the code path
2. Snapshot MDT has sub-MDTs (DNE configuration) : @lod_foreach_mdt@ loop actually iterates
3. Sub-MDT OSP @opd_statfs@ cache uninitialized (@os_bsize=0@) : Zero fed into @lod_statfs_sum()@
4. project quota enabled : @ll_statfs_project()@ is called
5. Project block hard limit > 0 on the target directory : Division is executed
Steps to Reproduce
For detailed reproduction steps, please refer to the attached reproduction.sh script.
Step 1. Set project quota on target directory (e.g. @-p 1000 --block-hardlimit 1G@)
Step 2. Create snapshot by @lctl snapshot@
Step 3. Mount snapshot targets
Step 4. Extract snapshot fsname and wait for MDT recovery
Step 5. Mount snapshot client (read-only)
Step 6. Trigger the bug
dmesg output
[ 304.329137] repro_lsnapshot: [DU] running du -sh /mnt/snap-client-lsnap/testdir [ 304.335103] divide error: 0000 [#1] SMP PTI [ 304.336510] CPU: 6 PID: 18020 Comm: du Kdump: loaded Tainted: P OE -------- - - 4.18.0-553.5.1.el8_10.x86_64 #1 [ 304.336997] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014 [ 304.337342] RIP: 0010:ll_statfs_project+0x7c9/0xab0 [lustre] [ 304.337840] Code: fd ff 89 c3 85 c0 0f 85 29 01 00 00 48 8b 44 24 68 48 85 c0 75 05 48 8b 44 24 60 49 8b 75 08 48 c1 e0 0a 48 39 c6 77 3b 31 d2 <48> f7 f6 48 89 c1 49 39 45 10 76 2d 48 8b 44 24 70 31 d2 49 89 4d [ 304.338546] RSP: 0018:ffffb850d0eb7c60 EFLAGS: 00010246 [ 304.338901] RAX: 0000000040000000 RBX: 0000000000000000 RCX: 0000000000006611 [ 304.339211] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000000380e0 [ 304.339508] RBP: ffff95b4f40a6800 R08: 0000000080000000 R09: 0000000000000200 [ 304.339841] R10: 0000000069f2b7a6 R11: ffff95b4dad0f3f8 R12: ffff95b4cf7ec480 [ 304.340141] R13: ffffb850d0eb7eb0 R14: ffff95b4cf7ec510 R15: ffff95b4cf7ec488 [ 304.340429] FS: 00007f5a4d4a8540(0000) GS:ffff95b577d80000(0000) knlGS:0000000000000000 [ 304.340780] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 304.341075] CR2: 00007f5a4d3a6000 CR3: 0000000177654001 CR4: 0000000000170ee0 [ 304.341360] Call Trace: [ 304.341774] ? __die_body+0x1a/0x60 [ 304.342265] ? die+0x2a/0x50 [ 304.342740] ? do_trap+0xe7/0x110 [ 304.343029] ? ll_statfs_project+0x7c9/0xab0 [lustre] [ 304.343356] ? do_divide_error+0x33/0x40 [ 304.343658] ? ll_statfs_project+0x7c9/0xab0 [lustre] [ 304.343947] ? divide_error+0x14/0x20 [ 304.344218] ? ll_statfs_project+0x7c9/0xab0 [lustre] [ 304.344543] ll_statfs+0x1e0/0x1f0 [lustre] [ 304.344830] statfs_by_dentry+0x67/0x90 [ 304.346448] vfs_statfs+0x16/0xc0 [ 304.346939] fd_statfs+0x2d/0x70 [ 304.347210] __do_sys_fstatfs+0x20/0x60 [ 304.347472] do_syscall_64+0x5b/0x1b0 [ 304.347719] entry_SYSCALL_64_after_hwframe+0x61/0xc6 [ 304.347961] RIP: 0033:0x7f5a4cee2cab [ 304.348192] Code: 73 01 c3 48 8b 0d dd 81 39 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 8a 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d ad 81 39 00 f7 d8 64 89 01 48 [ 304.348697] RSP: 002b:00007fff543fbd68 EFLAGS: 00000206 ORIG_RAX: 000000000000008a [ 304.348939] RAX: ffffffffffffffda RBX: 0000558db5151600 RCX: 00007f5a4cee2cab [ 304.349170] RDX: 000000000000000f RSI: 00007fff543fbd80 RDI: 0000000000000003 [ 304.349399] RBP: 0000558db51530d0 R08: 0000558db5153130 R09: 0000558db5150030 [ 304.349648] R10: 0000000000000000 R11: 0000000000000206 R12: 0000558db51528c0 [ 304.349880] R13: 0000000000000003 R14: 0000558db5151690 R15: 0000558db5151600 [ 304.350104] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_zfs(OE) lquota(OE) lustre(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc(OE) ko2iblnd(OE) obdclass(OE) lnet(OE) libcfs(OE) bonding rpcrdma sunrpc rdma_ucm ib_srpt ib_isert iscsi_target_mod target_core_mod ib_iser libiscsi scsi_transport_iscsi ib_umad dm_multipath dm_mod rdma_cm ib_ipoib iw_cm ib_cm intel_rapl_msr intel_rapl_common sb_edac kvm_intel kvm irqbypass crc32_pclmul rapl pcspkr mlx5_ib ib_uverbs bochs drm_vram_helper ib_core drm_ttm_helper ttm drm_kms_helper syscopyarea sysfillrect sysimgblt drm joydev i2c_piix4 xfs libcrc32c sr_mod cdrom sd_mod sg ata_generic crct10dif_pclmul mlx5_core crc32c_intel serio_raw ghash_clmulni_intel nvme ata_piix virtio_net mlxfw libata pci_hyperv_intf nvme_core tls net_failover failover psample virtio_scsi t10_pi zfs(POE) spl(OE)
lctl check mdts output
360a4637-MDT0000-mdc-ffff8a8ff369f000 active. 360a4637-MDT0001-mdc-ffff8a8ff369f000 active. 360a4637-MDT0002-mdc-ffff8a8ff369f000 active. ClstVol-MDT0000-mdc-ffff8a8f771e1800 active. ClstVol-MDT0001-mdc-ffff8a8f771e1800 active. ClstVol-MDT0002-mdc-ffff8a8f771e1800 active.
lctl get_param output for Normal MDT
lctl get_param osp.ClstVol-MDT0001-osp-MDT0000.blocksize \ > osp.ClstVol-MDT0001-osp-MDT0000.kbytestotal \ > osp.ClstVol-MDT0001-osp-MDT0000.kbytesavail osp.ClstVol-MDT0001-osp-MDT0000.blocksize=4096 osp.ClstVol-MDT0001-osp-MDT0000.kbytestotal=199557632 osp.ClstVol-MDT0001-osp-MDT0000.kbytesavail=199547776
lctl get_param output for Snapshot MDT
lctl get_param osp.360a4637-MDT0001-osp-MDT0000.blocksize \ > osp.360a4637-MDT0001-osp-MDT0000.kbytestotal \ > osp.360a4637-MDT0001-osp-MDT0000.kbytesavail osp.360a4637-MDT0001-osp-MDT0000.blocksize=0 osp.360a4637-MDT0001-osp-MDT0000.kbytestotal=0 osp.360a4637-MDT0001-osp-MDT0000.kbytesavail=0
stat check on normal mount path
stat --file-system /lustre/agent/ClstVol File: "/lustre/agent/ClstVol" ID: 3d67df9200000000 Namelen: 255 Type: lustre Block size: 4096 Fundamental block size: 4096 Blocks: Total: 363066880 Free: 363046144 Available: 363043584 Inodes: Total: 45381809 Free: 45380768
stat check on snapshot path
stat --file-system /mnt/snap-client-lsnap File: "/mnt/snap-client-lsnap" ID: 566b393000000000 Namelen: 255 Type: lustre Block size: 0 Fundamental block size: 0 Blocks: Total: 0 Free: 0 Available: 0 Inodes: Total: 36441549 Free: 36441227
Call chain
du -sh /mnt/snap/testdir
│
vfs_statfs()
│
└─► ll_statfs
│
├─[A] ll_statfs_internal
│ │
│ └─ statfs RPC to MDT
│ │
│ ▼ server side
│ mdt_statfs() → lod_statfs()
│ │
│ ├─ lod_foreach_mdt loop ← BUG
│ │ ├─ osp_statfs() → *sfs = d->opd_statfs (\{0})
│ │ │ rc=0, os_bsize=0 returned
│ │ ├─ if (rc) continue; ← NO os_bsize check!
│ │ └─ lod_statfs_sum(sfs, &ost_sfs=\{os_bsize=0}, &bs)
│ │ while (0 < bs=4096):
│ │ *bs >>= 1 4096→2048→...→0
│ │ sfs->os_bsize >>= 1 4096→2048→...→0 ← CORRUPTED
│ │ loop exits: sfs->os_bsize = 0
│ │
│ ├─ lod_foreach_ost loop
│ └─ if (rc || ost_sfs.os_bsize == 0) continue; ← PROTECTED
│
└─[C] ll_statfs_project
│
├─ quotactl → dqb_bhardlimit = 1G (> 0)
│
└─ limit = (dqb_bhardlimit * 1024) / sfs->f_bsize
^^^^^^^^^^^^^^
0 → CPU #DE → KERNEL PANIC
BUG POINT 1 — Uninitialized os_bsize silently corrupts sfs->os_bsize
When a sub-MDT OSP returns @rc=0@ but with an uninitialized @opd_statfs@ cache (@os_bsize=0@), the existing guard @if (rc) continue;@ passes without catching it. @lod_statfs_sum()@ then enters a right-shift loop that drives both @bs@ and @sfs->os_bsize@ down to zero. Unlike @lod_foreach_ost@, which explicitly checks @ost_sfs.os_bsize == 0@ before proceeding, @lod_foreach_mdt@ has no such protection — making this a silent corruption with no error returned to the caller.
│ ├─ lod_foreach_mdt loop ← BUG
│ │ ├─ osp_statfs() → *sfs = d->opd_statfs (\{0})
│ │ │ rc=0, os_bsize=0 returned
│ │ ├─ if (rc) continue; ← NO os_bsize check!
│ │ └─ lod_statfs_sum(sfs, &ost_sfs=\{os_bsize=0}, &bs)
│ │ while (0 < bs=4096):
│ │ *bs >>= 1 4096→2048→...→0
│ │ sfs->os_bsize >>= 1 4096→2048→...→0 ← CORRUPTED
│ │ loop exits: sfs->os_bsize = 0
BUG POINT 2 — Corrupted f_bsize=0 reaches division in ll_statfs_project()
The corrupted @os_bsize=0@ is packed into the statfs RPC reply and delivered to the client with @OS_STATFS_SUM@ set. The client accepts the SUM result without any validation. When project quota is active and @dqb_bhardlimit > 0@, @ll_statfs_project()@ unconditionally divides by @sfs->f_bsize@ — now zero — triggering a CPU #DE (Divide Error) exception and an immediate kernel panic.
└─[C] ll_statfs_project │ ├─ quotactl → dqb_bhardlimit = 1G (> 0) │ └─ limit = (dqb_bhardlimit * 1024) / sfs->f_bsize ^^^^^^^^^^^^^^ 0 → CPU #DE → KERNEL PANIC
Proposed Fix
server side fix (main)
When a sub-MDT with @os_bsize=0@ is skipped, its @os_files@ and @os_ffree@ are also skipped. Thus, sub-MDT's inode count is absent from @df@ output. But once the RPC reply arrives, the next @lod_statfs()@ call includes it normally. This is a safe and acceptable trade-off versus a kernel panic
lod_foreach_mdt(lod, tgt) {
rc = dt_statfs(env, tgt->ltd_tgt, &ost_sfs);
/* ignore errors */
- if (rc)
+ // add zero-check for os_bsize to prevent division-by-zero kernel panic
+ if (rc || ost_sfs.os_bsize == 0)
client side fix (additional defensive logic)
static int ll_statfs_project(struct inode *inode, struct kstatfs *sfs) ... if (ret) { /* ignore errors if project ID does not have * a quota limit or feature unsupported. */ if (ret == -ESRCH || ret == -EOPNOTSUPP) ret = 0; return ret; } + /* f_bsize=0: server lod_foreach_mdt passed uninitialized sub-MDT + * opd_statfs (os_bsize=0) into lod_statfs_sum(), corrupting os_bsize. + * Fall back to COMPAT_BSIZE_SHIFT — the value mdt_statfs() sends in + * steady state for non-GRANT_PARAM connections. + */ + if (unlikely(sfs->f_bsize == 0)) { + CWARN("%s: f_bsize=0 in ll_statfs_project, " + "sub-MDT opd_statfs uninitialized (lod_foreach_mdt bug), " + "corrected to %u\n", + ll_i2sbi(inode)->ll_fsname, 1 << COMPAT_BSIZE_SHIFT); + + sfs->f_bsize = 1 << COMPAT_BSIZE_SHIFT; + } limit = ((qctl.qc_dqblk.dqb_bsoftlimit ? qctl.qc_dqblk.dqb_bsoftlimit : qctl.qc_dqblk.dqb_bhardlimit) * 1024) / sfs->f_bsize; ...
Immediate Workaround - fix_statfs.ko
A loadable kernel module that intercepts @ll_statfs_project()@ entry via kprobe and
replaces @f_bsize=0@ with @PAGE_SIZE@ before the division executes. This is a
symptom-level workaround. The correct fallback value is 1 << COMPAT_BSIZE_SHIFT (= 4096), defined in lustre/include/lu_target.h. However, @fix_statfs@ module is built against standard kernel headers only. Lustre internal headers are not available here. @PAGE_SIZE@ was used as a practical proxy that happens to equal 4096 on x86_64. The additional defensive logic in @llite_lib.c@ uses @1 << COMPAT_BSIZE_SHIFT@ correctly.
result of workaround
insmod fix_statfs.ko lsmod | grep fix fix_statfs 16384 0 ./reproduction.sh mounted the snapshot repro_snap1 with fsname 7eb5b902 === sub-MDT OSP blocksize (must be 0 to trigger crash) === osp.7eb5b902-MDT0001-osp-MDT0000.blocksize=0 osp.7eb5b902-MDT0002-osp-MDT0000.blocksize=0 >>> du -sh /mnt/snap-client-lsnap/testdir 1.1M /mnt/snap-client-lsnap/testdir