<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 03:31:32 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-16973] Busy device after successful umount</title>
                <link>https://jira.whamcloud.com/browse/LU-16973</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;We have a script that checks failover process and if it is delayed the node is crashed. Usually 30seconds after umount complete are enough to finish all working stuff and have a zero reference for a raid device. But with high IO and two MDTs on a node makes things worse. Delayed works takes much time, we have seen 20+ minutes after umount complete.&lt;/p&gt;

&lt;p&gt;The umount jobs are divided on&lt;/p&gt;

&lt;p&gt;1) umount syscall&lt;br/&gt;
2) mntput() is delayed and processing by kworker&lt;br/&gt;
3) iputs, &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-15404&quot; title=&quot;kernel panic and filesystem corruption in setxattr due to journal transaction restart&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-15404&quot;&gt;&lt;del&gt;LU-15404&lt;/del&gt;&lt;/a&gt; made it works at separate super block work queue and mntput waits it&lt;br/&gt;
4) fputs() using delayed and shares global kernel list with others MDTs and list are processed with a single kworker&lt;/p&gt;

&lt;p&gt;md65 mount has 1747359 references and 2387523 files at fput list.&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[353410.384294] Lustre: server umount kjcf04-MDT0000 complete
[353454.532032] sysrq: SysRq : Trigger a crash
[0]: ffffd193c1a1f584
struct mnt_pcp {
  mnt_count = 436042, 
  mnt_writers = 0
}
crash&amp;gt; mnt_pcp 0x3a1fd1e1f584:1
[1]: ffffd193c1a5f584
struct mnt_pcp {
  mnt_count = 436156, 
  mnt_writers = 0
}
crash&amp;gt; mnt_pcp 0x3a1fd1e1f584:a
[0]: ffffd193c1a1f584
struct mnt_pcp {
  mnt_count = 436042, 
  mnt_writers = 0
}
cat mount_count.pcp | tr -d ,|grep mnt_count|awk &apos;{ sum += $3 } END { print sum }&apos;
1747359

crash&amp;gt; file ffff978de3176300
struct file {
  f_u = {
    fu_llist = {
      next = 0xffff9794e4529e00
    }, 
    fu_rcuhead = {
      next = 0xffff9794e4529e00, 
      func = 0x0
    }
  }, 
  f_path = {
    mnt = 0xffff9793b5200da0, 
    dentry = 0xffff97912a254a80
  }, 
  f_inode = 0xffff979ddd8351c0, 
  f_op = 0xffffffffc1c2b640 &amp;lt;ldiskfs_dir_operations&amp;gt;, 

 list 0xffff9794e4529e00 | wc -l

2387523
 &lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;&#160;&lt;br/&gt;
So in such situation the total umount time is unpredictable. Let&apos;s fix it.&lt;br/&gt;
&#160;&lt;/p&gt;</description>
                <environment></environment>
        <key id="77092">LU-16973</key>
            <summary>Busy device after successful umount</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="3" iconUrl="https://jira.whamcloud.com/images/icons/priorities/major.svg">Major</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="aboyko">Alexander Boyko</assignee>
                                    <reporter username="aboyko">Alexander Boyko</reporter>
                        <labels>
                            <label>patch</label>
                    </labels>
                <created>Fri, 21 Jul 2023 11:18:15 +0000</created>
                <updated>Tue, 19 Dec 2023 15:48:46 +0000</updated>
                            <resolved>Wed, 6 Sep 2023 13:09:15 +0000</resolved>
                                    <version>Upstream</version>
                                    <fixVersion>Lustre 2.16.0</fixVersion>
                                        <due></due>
                            <votes>0</votes>
                                    <watches>11</watches>
                                                                            <comments>
                            <comment id="379650" author="gerrit" created="Fri, 21 Jul 2023 11:27:45 +0000"  >&lt;p&gt;&quot;Alexander Boyko &amp;lt;alexander.boyko@hpe.com&amp;gt;&quot; uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/c/fs/lustre-release/+/51731&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/c/fs/lustre-release/+/51731&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-16973&quot; title=&quot;Busy device after successful umount&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-16973&quot;&gt;&lt;del&gt;LU-16973&lt;/del&gt;&lt;/a&gt; osd: adds SB_KERNMOUNT flag&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 0792975beb025cd6ff497f12b01351e02f13d2ed&lt;/p&gt;</comment>
                            <comment id="379756" author="adilger" created="Sat, 22 Jul 2023 02:29:21 +0000"  >&lt;p&gt;Instead of deferring all of this work to unmount, does it make sense to (somehow) schedule this work more aggressively during runtime, so that it doesn&apos;t get backlogged?  Having 2.7M files to clean up seems like it is getting benhind and unable to keep up with the workload. &lt;/p&gt;

&lt;p&gt;Can we put more threads to process the workqueue, or increase the thread priority?&lt;/p&gt;</comment>
                            <comment id="379857" author="aboyko" created="Mon, 24 Jul 2023 12:08:40 +0000"  >&lt;p&gt;Unfortunately kernel is designed with a single delayed_fput_list, and delayed_fput() does not divide the list to kworker threads. Probably the reason is small amount of files at kernel level. The best approach is having percpu delayed list, by default kernel has two kworker thread per cpu already. &lt;a href=&quot;https://jira.whamcloud.com/secure/ViewProfile.jspa?name=adilger&quot; class=&quot;user-hover&quot; rel=&quot;adilger&quot;&gt;adilger&lt;/a&gt; Is it possible to minimize call of alloc_file_pseudo()?&lt;/p&gt;</comment>
                            <comment id="380115" author="adilger" created="Wed, 26 Jul 2023 04:46:37 +0000"  >&lt;p&gt;The use of &lt;tt&gt;alloc_file_pseudo()&lt;/tt&gt; was added for upstream kernel compatibility/cleanliness, but if this is causing problems by causing millions of delayed &lt;tt&gt;fput()&lt;/tt&gt; calls at unmount time, I would rather fix the core problem rather than making &quot;clean up 2M file descriptors at unmount&quot; work better.  Previously we just allocated &quot;&lt;tt&gt;struct file&lt;/tt&gt;&quot; on the stack and filled in the required fields directly for the few interfaces that were needing a &quot;&lt;tt&gt;struct file&lt;/tt&gt;&quot; argument to &lt;tt&gt;ioctl() or {{iterate_dir()&lt;/tt&gt; where it is needed.&lt;/p&gt;

&lt;p&gt;In most cases, the &lt;tt&gt;struct file&lt;/tt&gt; returned by &lt;tt&gt;alloc_file_pseudo()&lt;/tt&gt; is only used for the duration of a single function.  Rather than allocating and (delayed) freeing millions of file descriptors for handling these temporary calls, it would be much more efficient to avoid allocating these file descriptors entirely, if possible.&lt;/p&gt;

&lt;p&gt;Do you know from which function most of the file handles are coming from?  Is it mostly the delayed xattr handling?  Could we add something to the MDT service threads that is calling &lt;tt&gt;flush_workqueue(sbi-&amp;gt;s_misc_wq)&lt;/tt&gt; every 5s (or 60s or similar) to avoid accumulating so much delayed work?&lt;/p&gt;

&lt;p&gt;Looking through the callers of &lt;tt&gt;alloc_file_pseudo()&lt;/tt&gt; I think some things can be done to improve this:&lt;/p&gt;
&lt;ul class=&quot;alternate&quot; type=&quot;square&quot;&gt;
	&lt;li&gt;we should &lt;b&gt;always&lt;/b&gt; be passing &lt;tt&gt;FMODE_NONOTIFY&lt;/tt&gt; to these file descriptors to avoid overhead.  That was done internally for the compat code via &lt;tt&gt;OPEN_FMODE()&lt;/tt&gt; for pre-4.18 kernels but should instead be explicitly done by all the callers&lt;/li&gt;
	&lt;li&gt;&lt;tt&gt;osd_lseek()&lt;/tt&gt; looks like this is only using &lt;tt&gt;f_lock&lt;/tt&gt;, &lt;tt&gt;f_mode&lt;/tt&gt;, and &lt;tt&gt;f_pos&lt;/tt&gt; for files, and doesn&apos;t really need a &quot;proper&quot; file descriptor.  For &lt;tt&gt;ext4_dir_llseek()&lt;/tt&gt; on directories it is also using &lt;tt&gt;f_mapping&lt;/tt&gt;.&lt;/li&gt;
	&lt;li&gt;&lt;tt&gt;osd_execute_punch()&lt;/tt&gt; looks like it is only using &lt;tt&gt;f_inode&lt;/tt&gt; and &lt;tt&gt;f_mode&lt;/tt&gt; (should have &lt;tt&gt;S_NOCMTIME&lt;/tt&gt; set to skip &lt;tt&gt;file_modified_flags()&lt;/tt&gt; and &lt;tt&gt;S_NOSEC&lt;/tt&gt; to skip &lt;tt&gt;__file_remove_privs()&lt;/tt&gt;).&lt;/li&gt;
	&lt;li&gt;&lt;tt&gt;server_ioctl()&lt;/tt&gt; could use a single file descriptor saved in &lt;tt&gt;struct lustre_sb_info&lt;/tt&gt; across all calls, as it is always for the root inode&lt;/li&gt;
	&lt;li&gt;&lt;tt&gt;osd_check_lmv()&lt;/tt&gt; could potentially do a lookup instead of &lt;tt&gt;iterate_dir()&lt;/tt&gt;, but I think this should be called only rarely.&lt;/li&gt;
	&lt;li&gt;&lt;tt&gt;osd_object_sync()&lt;/tt&gt; could just call &lt;tt&gt;inode-&amp;gt;i_fop-&amp;gt;fsync()&lt;/tt&gt; directly, but that still needs &lt;tt&gt;filp&lt;/tt&gt; and does a lot of messy stuff which we probably don&apos;t need, but I don&apos;t think this would be easily handled.&lt;/li&gt;
&lt;/ul&gt;
</comment>
                            <comment id="380163" author="panda" created="Wed, 26 Jul 2023 10:26:32 +0000"  >&lt;p&gt;Andreas, delayed fput is not related to the delayed iput hack in the xattr removal path. fput is designed to be delayed until exit from the syscall or as a separate work.&lt;/p&gt;

&lt;p&gt;I&apos;m not sure if there&apos;s a need for fput to be delayed in Lustre and maybe can just call the postponed version ___fput() instead of fput(). I think one of the problems with this approach is that the postponed call is not exported.&lt;/p&gt;</comment>
                            <comment id="380282" author="adilger" created="Wed, 26 Jul 2023 21:38:26 +0000"  >&lt;blockquote&gt;
&lt;p&gt;fput is designed to be delayed until exit from the syscall or as a separate work.&lt;/p&gt;&lt;/blockquote&gt;
&lt;p&gt;This is clearly a problem since the Lustre service threads never &quot;exit from a syscall&quot;, so it looks like all &lt;tt&gt;fput()&lt;/tt&gt; operations are being delayed.  The &quot;&lt;tt&gt;&amp;#95;&amp;#95;fput_sync()&lt;/tt&gt;&quot; function was added in commit v3.5-rc6-284-g4a9d4b024a31 when delayed &lt;tt&gt;fput()&lt;/tt&gt; was added, but was not exported until commit v5.17-rc5-285-gf00432063db1.  We could convert (some/all?) users to use &lt;tt&gt;&amp;#95;&amp;#95;fput_sync()&lt;/tt&gt; (with a configure check):&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
/*
 * synchronous analog of fput(); &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; kernel threads that might be needed
 * in some umount() (and thus can&apos;t use flush_delayed_fput() without
 * risking deadlocks), need to wait &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; completion of __fput() and know
 * &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;code-keyword&quot;&gt;this&lt;/span&gt; specific struct file it won&apos;t involve anything that would
 * need them.  Use only &lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; you really need it - at the very least,
 * don&apos;t blindly convert fput() by kernel thread to that.
 */
void __fput_sync(struct file *file)
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;For kernels where this is not exported, we could use &lt;tt&gt;kallsyms_lookup_symbol()&lt;/tt&gt; to access &lt;tt&gt;__fput_sync()&lt;/tt&gt;. &lt;/p&gt;

&lt;p&gt;There is also a way to force delayed &lt;tt&gt;fput()&lt;/tt&gt; calls to be flushed immediately:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
/*
 * If kernel thread really needs to have the &lt;span class=&quot;code-keyword&quot;&gt;final&lt;/span&gt; fput() it has done
 * to complete, call &lt;span class=&quot;code-keyword&quot;&gt;this&lt;/span&gt;.  Please, don&apos;t add more callers without
 * very good reasons; in particular, never call that with locks
 * held and never call that from a thread that might need to &lt;span class=&quot;code-keyword&quot;&gt;do&lt;/span&gt;
 * some work on any kind of umount.
 */
void flush_delayed_fput(void)
{
        delayed_fput(NULL);
}
EXPORT_SYMBOL_GPL(flush_delayed_fput);
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;This could be called from &lt;tt&gt;ptlrpc_main()&lt;/tt&gt; before the thread goes to sleep, when there are no more RPCs to process.  That will delay the fput work until the system is not totally overloaded, but should make RPC processing a tiny bit more efficient to do this in the background by an idle thread:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
                &lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; (ptlrpc_server_request_pending(svcpt, &lt;span class=&quot;code-keyword&quot;&gt;false&lt;/span&gt;)) {
                        lu_context_enter(&amp;amp;env-&amp;gt;le_ctx);
                        ptlrpc_server_handle_request(svcpt, thread);
                        lu_context_exit(&amp;amp;env-&amp;gt;le_ctx);
-               }
+               } &lt;span class=&quot;code-keyword&quot;&gt;else&lt;/span&gt; {
+                        flush_delayed_fput();
+               }
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="380634" author="gerrit" created="Sat, 29 Jul 2023 16:34:00 +0000"  >&lt;p&gt;&quot;Andreas Dilger &amp;lt;adilger@whamcloud.com&amp;gt;&quot; uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/c/fs/lustre-release/+/51805&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/c/fs/lustre-release/+/51805&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-16973&quot; title=&quot;Busy device after successful umount&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-16973&quot;&gt;&lt;del&gt;LU-16973&lt;/del&gt;&lt;/a&gt; ptlrpc: flush delayed file desc if idle&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 6c7327812230cc149ec75de6ea604b4a7740e0f6&lt;/p&gt;</comment>
                            <comment id="383046" author="gerrit" created="Sat, 19 Aug 2023 05:35:05 +0000"  >&lt;p&gt;&quot;Oleg Drokin &amp;lt;green@whamcloud.com&amp;gt;&quot; merged in patch &lt;a href=&quot;https://review.whamcloud.com/c/fs/lustre-release/+/51805/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/c/fs/lustre-release/+/51805/&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-16973&quot; title=&quot;Busy device after successful umount&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-16973&quot;&gt;&lt;del&gt;LU-16973&lt;/del&gt;&lt;/a&gt; ptlrpc: flush delayed file desc if idle&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: &lt;br/&gt;
Commit: 2feb4a7bb01c5e98763a62fb0bd64edf933c95de&lt;/p&gt;</comment>
                            <comment id="384890" author="gerrit" created="Wed, 6 Sep 2023 06:15:29 +0000"  >&lt;p&gt;&quot;Oleg Drokin &amp;lt;green@whamcloud.com&amp;gt;&quot; merged in patch &lt;a href=&quot;https://review.whamcloud.com/c/fs/lustre-release/+/51731/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/c/fs/lustre-release/+/51731/&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-16973&quot; title=&quot;Busy device after successful umount&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-16973&quot;&gt;&lt;del&gt;LU-16973&lt;/del&gt;&lt;/a&gt; osd: adds SB_KERNMOUNT flag&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: &lt;br/&gt;
Commit: eff11c8ce1f89f30dcc5af88b67b3d6c15a631a6&lt;/p&gt;</comment>
                            <comment id="384968" author="pjones" created="Wed, 6 Sep 2023 13:09:15 +0000"  >&lt;p&gt;Landed for 2.16&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                        <issuelink>
            <issuekey id="67776">LU-15404</issuekey>
        </issuelink>
                            </outwardlinks>
                                                                <inwardlinks description="is related to">
                                        <issuelink>
            <issuekey id="77144">LU-16982</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10040" key="com.atlassian.jira.plugin.system.customfieldtypes:labels">
                        <customfieldname>Epic</customfieldname>
                        <customfieldvalues>
                                        <label>server</label>
    
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|i03r33:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>