Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-17696

Kernel 6.8 client LBUG: (dcache.c:136:ll_ddelete()) ASSERTION( d_count(de) == 1 ) failed

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.16.0, Lustre 2.15.6
    • Lustre 2.16.0
    • None
    • 3
    • 9223372036854775807

    Description

      After applying patch series https://review.whamcloud.com/54229 on master branch, Lustre client build passed with kernel 6.8.0-060800-generic. However, after mounting the filesystem, touching a file hit the following LBUG:

      LustreError: 12066:0:(dcache.c:136:ll_ddelete()) ASSERTION( d_count(de) == 1 ) failed:
      LustreError: 12066:0:(dcache.c:136:ll_ddelete()) LBUG
      CPU: 0 PID: 12066 Comm: lfs Tainted: G           OE      6.8.0-060800-generic #202403131158
      Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
      Call Trace:
       <TASK>
       dump_stack_lvl+0x48/0x70
       dump_stack+0x10/0x20
       lbug_with_loc+0x3e/0x70 [libcfs]
       ll_ddelete+0x278/0x280 [lustre]
       dput+0x174/0x1b0
       step_into+0x2f1/0x390
       walk_component+0x51/0x190
       path_lookupat+0x6a/0x1b0
       filename_lookup+0xe4/0x200
       vfs_statx+0x95/0x1d0
       do_statx+0x64/0xb0
       __x64_sys_statx+0x67/0x90
       do_syscall_64+0x85/0x180
       ? irqentry_exit+0x43/0x50
       ? exc_page_fault+0x94/0x1b0
       entry_SYSCALL_64_after_hwframe+0x6e/0x76
      RIP: 0033:0x7eeaf551ac6e
      Code: 31 0e 00 ba ff ff ff ff 64 c7 00 16 00 00 00 e9 a5 fd ff ff e8 d3 e2 01 00 0f 1f 00 f3 0f 1e fa 41 89 ca b8 4c 01 00 00 0f 05 <48> 3d 00 f0 ff ff 77 2a 89 c1 85 c0 74 0f 48 8b 05 75 31 0e 00 64 
      RSP: 002b:00007ffdade73a08 EFLAGS: 00000202 ORIG_RAX: 000000000000014c
      RAX: ffffffffffffffda RBX: 00007ffdade73b38 RCX: 00007eeaf551ac6e
      RDX: 0000000000000000 RSI: 00007ffdade7a1d8 RDI: 00000000ffffff9c
      RBP: 00007ffdade7a1d8 R08: 00007ffdade73a10 R09: 0000000000000001
      R10: 0000000000000000 R11: 0000000000000202 R12: 00007ffdade74b90
      R13: 00007ffdade73b38 R14: 0000000000000000 R15: 0000000000000001
       </TASK>
      Kernel panic - not syncing: LBUG
      CPU: 0 PID: 12066 Comm: lfs Tainted: G           OE      6.8.0-060800-generic #202403131158
      Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
      Call Trace:
       <TASK>
       dump_stack_lvl+0x48/0x70
       dump_stack+0x10/0x20
       panic+0x35f/0x3c0
       lbug_with_loc+0x70/0x70 [libcfs]
       ll_ddelete+0x278/0x280 [lustre]
       dput+0x174/0x1b0
       step_into+0x2f1/0x390
       walk_component+0x51/0x190
       path_lookupat+0x6a/0x1b0
       filename_lookup+0xe4/0x200
       vfs_statx+0x95/0x1d0
       do_statx+0x64/0xb0
       __x64_sys_statx+0x67/0x90
       do_syscall_64+0x85/0x180
       ? irqentry_exit+0x43/0x50
       ? exc_page_fault+0x94/0x1b0
       entry_SYSCALL_64_after_hwframe+0x6e/0x76
      RIP: 0033:0x7eeaf551ac6e
      Code: 31 0e 00 ba ff ff ff ff 64 c7 00 16 00 00 00 e9 a5 fd ff ff e8 d3 e2 01 00 0f 1f 00 f3 0f 1e fa 41 89 ca b8 4c 01 00 00 0f 05 <48> 3d 00 f0 ff ff 77 2a 89 c1 85 c0 74 0f 48 8b 05 75 31 0e 00 64
      RSP: 002b:00007ffdade73a08 EFLAGS: 00000202 ORIG_RAX: 000000000000014c
      RAX: ffffffffffffffda RBX: 00007ffdade73b38 RCX: 00007eeaf551ac6e
      RDX: 0000000000000000 RSI: 00007ffdade7a1d8 RDI: 00000000ffffff9c
      RBP: 00007ffdade7a1d8 R08: 00007ffdade73a10 R09: 0000000000000001
      R10: 0000000000000000 R11: 0000000000000202 R12: 00007ffdade74b90
      R13: 00007ffdade73b38 R14: 0000000000000000 R15: 0000000000000001
       </TASK>
      Kernel Offset: 0x2a200000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
      ---[ end Kernel panic - not syncing: LBUG ]---
      

      Attachments

        Issue Links

          Activity

            [LU-17696] Kernel 6.8 client LBUG: (dcache.c:136:ll_ddelete()) ASSERTION( d_count(de) == 1 ) failed

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/56830/
            Subject: LU-17696 llite: remove LASSERT from ll_ddelete()
            Project: fs/lustre-release
            Branch: b2_15
            Current Patch Set:
            Commit: 91da450ba33580f85098e9f587e9220fe5dd2012

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/56830/ Subject: LU-17696 llite: remove LASSERT from ll_ddelete() Project: fs/lustre-release Branch: b2_15 Current Patch Set: Commit: 91da450ba33580f85098e9f587e9220fe5dd2012

            "Jian Yu <yujian@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/56830
            Subject: LU-17696 llite: remove LASSERT from ll_ddelete()
            Project: fs/lustre-release
            Branch: b2_15
            Current Patch Set: 1
            Commit: 4f9799cd2337a3d1aa77c5e4a5770b46f1845b2f

            gerrit Gerrit Updater added a comment - "Jian Yu <yujian@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/56830 Subject: LU-17696 llite: remove LASSERT from ll_ddelete() Project: fs/lustre-release Branch: b2_15 Current Patch Set: 1 Commit: 4f9799cd2337a3d1aa77c5e4a5770b46f1845b2f
            pjones Peter Jones added a comment -

            Merged for 2.16

            pjones Peter Jones added a comment - Merged for 2.16

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/54676/
            Subject: LU-17696 llite: remove LASSERT from ll_ddelete()
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 0176629ab3f71e88850ab95796b0e519c4d0f740

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/54676/ Subject: LU-17696 llite: remove LASSERT from ll_ddelete() Project: fs/lustre-release Branch: master Current Patch Set: Commit: 0176629ab3f71e88850ab95796b0e519c4d0f740

            "Jian Yu <yujian@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/54676
            Subject: LU-17696 llite: remove LASSERT from ll_ddelete()
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 50bd3822d2977cd45e56521d137aec2ce5829529

            gerrit Gerrit Updater added a comment - "Jian Yu <yujian@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/54676 Subject: LU-17696 llite: remove LASSERT from ll_ddelete() Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 50bd3822d2977cd45e56521d137aec2ce5829529
            yujian Jian Yu added a comment -
            yujian Jian Yu added a comment - Sure, adegremont_nvda .

            The code use to be:

            #ifdef HAVE_DCACHE_LOCK
                    LASSERT(d_count(de) == 0);
            #else
                    /* kernel >= 2.6.38 last refcount is decreased after this function. */
                    LASSERT(d_count(de) == 1);
            #endif

            Then simply became

                    /* kernel >= 2.6.38 last refcount is decreased after this function. */
                    LASSERT(ll_d_count(de) == 1);

            Based on the kernel commit above. Isn't it just simpler to get rid of the LASSERT?

            adegremont_nvda Aurelien Degremont added a comment - The code use to be: #ifdef HAVE_DCACHE_LOCK         LASSERT(d_count(de) == 0); # else         /* kernel >= 2.6.38 last refcount is decreased after this function. */         LASSERT(d_count(de) == 1); #endif Then simply became         /* kernel >= 2.6.38 last refcount is decreased after this function. */       LASSERT(ll_d_count(de) == 1); Based on the kernel commit above. Isn't it just simpler to get rid of the LASSERT?
            yujian Jian Yu added a comment -

            ll_ddelete() was called from dput()->fast_dput()->retain_dentry(), so it's the following commit made the change:

            commit 2f42f1eb9093834b635991c70d0273fbe249eabf
            Author:     Al Viro <viro@zeniv.linux.org.uk>
            AuthorDate: Mon Oct 30 01:09:50 2023 -0400
            Commit:     Al Viro <viro@zeniv.linux.org.uk>
            CommitDate: Sat Nov 25 02:33:56 2023 -0500
            
                Call retain_dentry() with refcount 0
                
                Instead of bumping it from 0 to 1, calling retain_dentry(), then
                decrementing it back to 0 (with ->d_lock held all the way through),
                just leave refcount at 0 through all of that.
                
                It will have a visible effect for ->d_delete() - now it can be
                called with refcount 0 instead of 1 and it can no longer play
                silly buggers with dropping/regaining ->d_lock.  Not that any
                in-tree instances tried to (it's pretty hard to get right).
                
                Any out-of-tree ones will have to adjust (assuming they need any
                changes).
                
                Note that we do not need to extend rcu-critical area here - we have
                verified that refcount is non-negative after having grabbed ->d_lock,
                so nobody will be able to free dentry until they get into __dentry_kill(),
                which won't happen until they manage to grab ->d_lock.
                
                Reviewed-by: Christian Brauner <brauner@kernel.org>
                Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
            

            I'm creating a patch to add a configure check and macro.

            yujian Jian Yu added a comment - ll_ddelete() was called from dput()->fast_dput()->retain_dentry() , so it's the following commit made the change: commit 2f42f1eb9093834b635991c70d0273fbe249eabf Author: Al Viro <viro@zeniv.linux.org.uk> AuthorDate: Mon Oct 30 01:09:50 2023 -0400 Commit: Al Viro <viro@zeniv.linux.org.uk> CommitDate: Sat Nov 25 02:33:56 2023 -0500 Call retain_dentry() with refcount 0 Instead of bumping it from 0 to 1, calling retain_dentry(), then decrementing it back to 0 (with ->d_lock held all the way through), just leave refcount at 0 through all of that. It will have a visible effect for ->d_delete() - now it can be called with refcount 0 instead of 1 and it can no longer play silly buggers with dropping/regaining ->d_lock. Not that any in-tree instances tried to (it's pretty hard to get right). Any out-of-tree ones will have to adjust (assuming they need any changes). Note that we do not need to extend rcu-critical area here - we have verified that refcount is non-negative after having grabbed ->d_lock, so nobody will be able to free dentry until they get into __dentry_kill(), which won't happen until they manage to grab ->d_lock. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> I'm creating a patch to add a configure check and macro.
            yujian Jian Yu added a comment -

            There are 40 commits related to dcache codes:

            commit 499aa1ca4eb6602df38afaecb88fc14edf50cdbb
            Merge: bf4e7080aeed 1b6ae9f6e6c3
            Author:     Linus Torvalds <torvalds@linux-foundation.org>
            AuthorDate: Thu Jan 11 20:11:35 2024 -0800
            Commit:     Linus Torvalds <torvalds@linux-foundation.org>
            CommitDate: Thu Jan 11 20:11:35 2024 -0800
            
                Merge tag 'pull-dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
                
                Pull dcache updates from Al Viro:
                 "Change of locking rules for __dentry_kill(), regularized refcounting
                  rules in that area, assorted cleanups and removal of weird corner
                  cases (e.g. now ->d_iput() on child is always called before the parent
                  might hit __dentry_kill(), etc)"
                
                * tag 'pull-dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (40 commits)
                  dcache: remove unnecessary NULL check in dget_dlock()
                  kill DCACHE_MAY_FREE
                  __d_unalias() doesn't use inode argument
                  d_alloc_parallel(): in-lookup hash insertion doesn't need an RCU variant
                  get rid of DCACHE_GENOCIDE
                  d_genocide(): move the extern into fs/internal.h
                  simple_fill_super(): don't bother with d_genocide() on failure
                  nsfs: use d_make_root()
                  d_alloc_pseudo(): move setting ->d_op there from the (sole) caller
                  kill d_instantate_anon(), fold __d_instantiate_anon() into remaining caller
                  retain_dentry(): introduce a trimmed-down lockless variant
                  __dentry_kill(): new locking scheme
                  d_prune_aliases(): use a shrink list
                  switch select_collect{,2}() to use of to_shrink_list()
                  to_shrink_list(): call only if refcount is 0
                  fold dentry_kill() into dput()
                  don't try to cut corners in shrink_lock_dentry()
                  fold the call of retain_dentry() into fast_dput()
                  Call retain_dentry() with refcount 0
                  dentry_kill(): don't bother with retain_dentry() on slow path
                  ...
            
            yujian Jian Yu added a comment - There are 40 commits related to dcache codes: commit 499aa1ca4eb6602df38afaecb88fc14edf50cdbb Merge: bf4e7080aeed 1b6ae9f6e6c3 Author: Linus Torvalds <torvalds@linux-foundation.org> AuthorDate: Thu Jan 11 20:11:35 2024 -0800 Commit: Linus Torvalds <torvalds@linux-foundation.org> CommitDate: Thu Jan 11 20:11:35 2024 -0800 Merge tag 'pull-dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull dcache updates from Al Viro: "Change of locking rules for __dentry_kill(), regularized refcounting rules in that area, assorted cleanups and removal of weird corner cases (e.g. now ->d_iput() on child is always called before the parent might hit __dentry_kill(), etc)" * tag 'pull-dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (40 commits) dcache: remove unnecessary NULL check in dget_dlock() kill DCACHE_MAY_FREE __d_unalias() doesn't use inode argument d_alloc_parallel(): in-lookup hash insertion doesn't need an RCU variant get rid of DCACHE_GENOCIDE d_genocide(): move the extern into fs/internal.h simple_fill_super(): don't bother with d_genocide() on failure nsfs: use d_make_root() d_alloc_pseudo(): move setting ->d_op there from the (sole) caller kill d_instantate_anon(), fold __d_instantiate_anon() into remaining caller retain_dentry(): introduce a trimmed-down lockless variant __dentry_kill(): new locking scheme d_prune_aliases(): use a shrink list switch select_collect{,2}() to use of to_shrink_list() to_shrink_list(): call only if refcount is 0 fold dentry_kill() into dput() don't try to cut corners in shrink_lock_dentry() fold the call of retain_dentry() into fast_dput() Call retain_dentry() with refcount 0 dentry_kill(): don't bother with retain_dentry() on slow path ...
            yujian Jian Yu added a comment -

            The following commit in kernel 6.8 made changes to dget_dlock():

            commit 1b6ae9f6e6c3e3c35aad0f11b116a81780b8aa03
            Author:     Vegard Nossum <vegard.nossum@oracle.com>
            AuthorDate: Mon Nov 6 14:44:17 2023 +0100
            Commit:     Al Viro <viro@zeniv.linux.org.uk>
            CommitDate: Sat Nov 25 02:51:56 2023 -0500
            
                dcache: remove unnecessary NULL check in dget_dlock()
                
                dget_dlock() requires dentry->d_lock to be held when called, yet
                contains a NULL check for dentry.
                
                An audit of all calls to dget_dlock() shows that it is never called
                with a NULL pointer (as spin_lock()/spin_unlock() would crash in these
                cases):
                ......
                After taking out the NULL check, dget_dlock() becomes almost identical
                to __dget_dlock(); the only difference is that dget_dlock() returns the
                dentry that was passed in. These are static inline helpers, so we can
                rely on the compiler to discard unused return values. We can therefore
                also remove __dget_dlock() and replace calls to it by dget_dlock().
                ......
            

             

            yujian Jian Yu added a comment - The following commit in kernel 6.8 made changes to dget_dlock(): commit 1b6ae9f6e6c3e3c35aad0f11b116a81780b8aa03 Author: Vegard Nossum <vegard.nossum@oracle.com> AuthorDate: Mon Nov 6 14:44:17 2023 +0100 Commit: Al Viro <viro@zeniv.linux.org.uk> CommitDate: Sat Nov 25 02:51:56 2023 -0500 dcache: remove unnecessary NULL check in dget_dlock() dget_dlock() requires dentry->d_lock to be held when called, yet contains a NULL check for dentry. An audit of all calls to dget_dlock() shows that it is never called with a NULL pointer (as spin_lock()/spin_unlock() would crash in these cases): ...... After taking out the NULL check, dget_dlock() becomes almost identical to __dget_dlock(); the only difference is that dget_dlock() returns the dentry that was passed in. These are static inline helpers, so we can rely on the compiler to discard unused return values. We can therefore also remove __dget_dlock() and replace calls to it by dget_dlock(). ......  

            People

              yujian Jian Yu
              yujian Jian Yu
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: