Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.16.0
    • Upstream
    • 3
    • 9223372036854775807

    Description

      We have a script that checks failover process and if it is delayed the node is crashed. Usually 30seconds after umount complete are enough to finish all working stuff and have a zero reference for a raid device. But with high IO and two MDTs on a node makes things worse. Delayed works takes much time, we have seen 20+ minutes after umount complete.

      The umount jobs are divided on

      1) umount syscall
      2) mntput() is delayed and processing by kworker
      3) iputs, LU-15404 made it works at separate super block work queue and mntput waits it
      4) fputs() using delayed and shares global kernel list with others MDTs and list are processed with a single kworker

      md65 mount has 1747359 references and 2387523 files at fput list.

      [353410.384294] Lustre: server umount kjcf04-MDT0000 complete
      [353454.532032] sysrq: SysRq : Trigger a crash
      [0]: ffffd193c1a1f584
      struct mnt_pcp {
        mnt_count = 436042, 
        mnt_writers = 0
      }
      crash> mnt_pcp 0x3a1fd1e1f584:1
      [1]: ffffd193c1a5f584
      struct mnt_pcp {
        mnt_count = 436156, 
        mnt_writers = 0
      }
      crash> mnt_pcp 0x3a1fd1e1f584:a
      [0]: ffffd193c1a1f584
      struct mnt_pcp {
        mnt_count = 436042, 
        mnt_writers = 0
      }
      cat mount_count.pcp | tr -d ,|grep mnt_count|awk '{ sum += $3 } END { print sum }'
      1747359
      
      crash> file ffff978de3176300
      struct file {
        f_u = {
          fu_llist = {
            next = 0xffff9794e4529e00
          }, 
          fu_rcuhead = {
            next = 0xffff9794e4529e00, 
            func = 0x0
          }
        }, 
        f_path = {
          mnt = 0xffff9793b5200da0, 
          dentry = 0xffff97912a254a80
        }, 
        f_inode = 0xffff979ddd8351c0, 
        f_op = 0xffffffffc1c2b640 <ldiskfs_dir_operations>, 
      
       list 0xffff9794e4529e00 | wc -l
      
      2387523
       

       
      So in such situation the total umount time is unpredictable. Let's fix it.
       

      Attachments

        Issue Links

          Activity

            [LU-16973] Busy device after successful umount

            "Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/57768
            Subject: LU-16973 osd: adds SB_KERNMOUNT flag
            Project: fs/lustre-release
            Branch: b2_15
            Current Patch Set: 1
            Commit: 13314ed47bb026b150dff0c3a3bbba17ab75936e

            gerrit Gerrit Updater added a comment - "Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/57768 Subject: LU-16973 osd: adds SB_KERNMOUNT flag Project: fs/lustre-release Branch: b2_15 Current Patch Set: 1 Commit: 13314ed47bb026b150dff0c3a3bbba17ab75936e

            "Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/57767
            Subject: LU-16973 ptlrpc: flush delayed file desc if idle
            Project: fs/lustre-release
            Branch: b2_15
            Current Patch Set: 1
            Commit: 55231b2b135f10839759c14e8f20ecfccfa43ea7

            gerrit Gerrit Updater added a comment - "Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/57767 Subject: LU-16973 ptlrpc: flush delayed file desc if idle Project: fs/lustre-release Branch: b2_15 Current Patch Set: 1 Commit: 55231b2b135f10839759c14e8f20ecfccfa43ea7
            pjones Peter Jones added a comment -

            Landed for 2.16

            pjones Peter Jones added a comment - Landed for 2.16

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/51731/
            Subject: LU-16973 osd: adds SB_KERNMOUNT flag
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: eff11c8ce1f89f30dcc5af88b67b3d6c15a631a6

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/51731/ Subject: LU-16973 osd: adds SB_KERNMOUNT flag Project: fs/lustre-release Branch: master Current Patch Set: Commit: eff11c8ce1f89f30dcc5af88b67b3d6c15a631a6

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/51805/
            Subject: LU-16973 ptlrpc: flush delayed file desc if idle
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 2feb4a7bb01c5e98763a62fb0bd64edf933c95de

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/51805/ Subject: LU-16973 ptlrpc: flush delayed file desc if idle Project: fs/lustre-release Branch: master Current Patch Set: Commit: 2feb4a7bb01c5e98763a62fb0bd64edf933c95de

            "Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51805
            Subject: LU-16973 ptlrpc: flush delayed file desc if idle
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 6c7327812230cc149ec75de6ea604b4a7740e0f6

            gerrit Gerrit Updater added a comment - "Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51805 Subject: LU-16973 ptlrpc: flush delayed file desc if idle Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 6c7327812230cc149ec75de6ea604b4a7740e0f6
            adilger Andreas Dilger added a comment - - edited

            fput is designed to be delayed until exit from the syscall or as a separate work.

            This is clearly a problem since the Lustre service threads never "exit from a syscall", so it looks like all fput() operations are being delayed. The "__fput_sync()" function was added in commit v3.5-rc6-284-g4a9d4b024a31 when delayed fput() was added, but was not exported until commit v5.17-rc5-285-gf00432063db1. We could convert (some/all?) users to use __fput_sync() (with a configure check):

            /*
             * synchronous analog of fput(); for kernel threads that might be needed
             * in some umount() (and thus can't use flush_delayed_fput() without
             * risking deadlocks), need to wait for completion of __fput() and know
             * for this specific struct file it won't involve anything that would
             * need them.  Use only if you really need it - at the very least,
             * don't blindly convert fput() by kernel thread to that.
             */
            void __fput_sync(struct file *file)
            

            For kernels where this is not exported, we could use kallsyms_lookup_symbol() to access __fput_sync().

            There is also a way to force delayed fput() calls to be flushed immediately:

            /*
             * If kernel thread really needs to have the final fput() it has done
             * to complete, call this.  Please, don't add more callers without
             * very good reasons; in particular, never call that with locks
             * held and never call that from a thread that might need to do
             * some work on any kind of umount.
             */
            void flush_delayed_fput(void)
            {
                    delayed_fput(NULL);
            }
            EXPORT_SYMBOL_GPL(flush_delayed_fput);
            

            This could be called from ptlrpc_main() before the thread goes to sleep, when there are no more RPCs to process. That will delay the fput work until the system is not totally overloaded, but should make RPC processing a tiny bit more efficient to do this in the background by an idle thread:

                            if (ptlrpc_server_request_pending(svcpt, false)) {
                                    lu_context_enter(&env->le_ctx);
                                    ptlrpc_server_handle_request(svcpt, thread);
                                    lu_context_exit(&env->le_ctx);
            -               }
            +               } else {
            +                        flush_delayed_fput();
            +               }
            
            adilger Andreas Dilger added a comment - - edited fput is designed to be delayed until exit from the syscall or as a separate work. This is clearly a problem since the Lustre service threads never "exit from a syscall", so it looks like all fput() operations are being delayed. The " __fput_sync() " function was added in commit v3.5-rc6-284-g4a9d4b024a31 when delayed fput() was added, but was not exported until commit v5.17-rc5-285-gf00432063db1. We could convert (some/all?) users to use __fput_sync() (with a configure check): /* * synchronous analog of fput(); for kernel threads that might be needed * in some umount() (and thus can't use flush_delayed_fput() without * risking deadlocks), need to wait for completion of __fput() and know * for this specific struct file it won't involve anything that would * need them. Use only if you really need it - at the very least, * don't blindly convert fput() by kernel thread to that. */ void __fput_sync(struct file *file) For kernels where this is not exported, we could use kallsyms_lookup_symbol() to access __fput_sync() . There is also a way to force delayed fput() calls to be flushed immediately: /* * If kernel thread really needs to have the final fput() it has done * to complete, call this . Please, don't add more callers without * very good reasons; in particular, never call that with locks * held and never call that from a thread that might need to do * some work on any kind of umount. */ void flush_delayed_fput(void) { delayed_fput(NULL); } EXPORT_SYMBOL_GPL(flush_delayed_fput); This could be called from ptlrpc_main() before the thread goes to sleep, when there are no more RPCs to process. That will delay the fput work until the system is not totally overloaded, but should make RPC processing a tiny bit more efficient to do this in the background by an idle thread: if (ptlrpc_server_request_pending(svcpt, false )) { lu_context_enter(&env->le_ctx); ptlrpc_server_handle_request(svcpt, thread); lu_context_exit(&env->le_ctx); - } + } else { + flush_delayed_fput(); + }
            panda Andrew Perepechko added a comment - - edited

            Andreas, delayed fput is not related to the delayed iput hack in the xattr removal path. fput is designed to be delayed until exit from the syscall or as a separate work.

            I'm not sure if there's a need for fput to be delayed in Lustre and maybe can just call the postponed version ___fput() instead of fput(). I think one of the problems with this approach is that the postponed call is not exported.

            panda Andrew Perepechko added a comment - - edited Andreas, delayed fput is not related to the delayed iput hack in the xattr removal path. fput is designed to be delayed until exit from the syscall or as a separate work. I'm not sure if there's a need for fput to be delayed in Lustre and maybe can just call the postponed version ___fput() instead of fput(). I think one of the problems with this approach is that the postponed call is not exported.

            People

              aboyko Alexander Boyko
              aboyko Alexander Boyko
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: