Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-20053

buffer_head leak on inode deletion

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Critical
    • None
    • None
    • Rocky 9 (kernel 5.14.0-611.34.1.el9_7)
    • 3
    • 9223372036854775807

    Description

      Deleting files on an ldiskfs-backed Lustre filesystem permanently leaks buffer_head slab objects. The leaked references survive unmount, module unload, and drop_caches. We need to reboot to free these objects.

      The leak is proportional to the number of inodes deleted and only occurs when the ea_inode filesystem feature is enabled (the default).

      While this reproduction might appear to be a small leak, we've confirmed that it amounts to tens of GiB of leaks in a workload creating and deleting hundreds of millions of files using mdtest.

      Reproduction

      Single-node Lustre on Rocky 9 (kernel 5.14.0-611.34.1.el9_7), 2 MDTs + 2 OSTs on NVMe, default mkfs options. Created 50K empty files under /mnt/lustre/, then deleted them. Monitored buffer_head count in /proc/slabinfo at each stage, with drop_caches between steps. Full reproduction script is attached here. test-bh-leak.sh

       

      Step buffer_head count
      baseline (before Lustre) 1,287
      after mount 3,483
      after create 50K files 10,534
      after delete + drop_caches 21,916
      after umount + drop_caches 14,311
      after rmmod + drop_caches 14,258

      The controlled experiment demonstrates that ea_inode is involved.

      Same test, same build, reformatted with mkfs.lustre --mkfsoptions="-O ^ea_inode".

      Step ea_inode ON ea_inode OFF
      after delete + drop_caches 21,916 22,097
      after umount + drop_caches 14,311 2,576
      after rmmod + drop_caches 14,258 2,652

      Root Cause

      This is ext4's bug.

      ext4_xattr_inode_dec_ref_all() computes the end boundary of the inode xattr space differently depending on whether it is processing block xattrs or inline (ibody) xattrs. For the inline path (block_csum == false), it calls ext4_get_inode_loc() to obtain the raw inode and derive the boundary, but never calls brelse() on the resulting iloc.bh. Every call through this path bumps the buffer_head refcount by one and never releases it.

      It was introduced in v6.14.3 kernel, but it has also been ported to RHEL 8,9 and other versions.

      ref. https://elixir.bootlin.com/linux/v6.14.3/source/fs/ext4/xattr.c 

       

      Attachments

        Issue Links

          Activity

            People

              skoyama Sohei Koyama
              skoyama Sohei Koyama
              Votes:
              0 Vote for this issue
              Watchers:
              17 Start watching this issue

              Dates

                Created:
                Updated: