<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 03:21:18 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-15788] lazystatfs + FOFB + mpich problems</title>
                <link>https://jira.whamcloud.com/browse/LU-15788</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;During FOFB tests with IOR and mpich we observing next errors. I&apos;ve created a timeline for a issue.&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Using Time Stamp 1648109998 (0x623c29ae) for Data Signature  (03:19:58)
delaying 15 seconds . . .
 Commencing write performance test.
 Thu Mar 24 03:21:10 2022

 write     717.93     1048576    1024.00    0.113480   91.17      0.010149   91.28      3    XXCEL
 Verifying contents of the file(s) just written.
 Thu Mar 24 03:22:41 2022

 delaying 15 seconds . . .
 [RANK 000] open for reading file /lus/kjcf05/disk/ostest.vers/alsorun.20220324030303.12286.walleye-p5/CL_IOR_sel_32ovs_mpiio_wr_8iter_n64_1m.1.LKuc9T.1648109355/CL_IOR_sel_32ovs_mpiio_wr_8iter_n64_1m/IORfile_1m XXCEL
 Commencing read performance test.
 Thu Mar 24 03:23:27 2022

 read      2698.93    1048576    1024.00    0.030882   24.25      0.005629   24.28      3    XXCEL
 Using Time Stamp 1648110232 (0x623c2a98) for Data Signature (03:24:42)
 delaying 15 seconds . . . (~03:24:57)

Mar 24 03:24:51 kjcf05n03 kernel: Lustre: Failing over kjcf05-MDT0000

 ** error **
 ** error **
 ADIO_RESOLVEFILETYPE_FNCALL(387): Invalid file name /lus/kjcf05/disk/ostest.vers/alsorun.20220324030303.12286.walleye-p5/CL_IOR_sel_32ovs_mpiio_wr_8iter_n64_1m.1.LKuc9T.1648109355/CL_IOR_sel_32ovs_mpiio_wr_8iter_n64_1m/IORfile_1m, mpi_check_status: 939600165, mpi_check_status_errno: 107
 MPI File does not exist, error stack:
 (unknown)(): Invalid file name, mpi_check_status: 939600165, mpi_check_status_errno: 2

Rank 0 [Thu Mar 24 03:25:00 2022] [c3-0c0s12n0] application called MPI_Abort(MPI_COMM_WORLD, -1) - process 0


Mar 24 03:25:46 kjcf05n03 kernel: Lustre: server umount kjcf05-MDT0000 complete
Mar 24 03:25:46 kjcf05n03 kernel: md65: detected capacity change from 21009999921152 to 0
Mar 24 03:25:46 kjcf05n03 kernel: md: md65 stopped.
Mar 24 03:25:48 kjcf05n02 kernel: md: md65 stopped.
00000020:00000001:22.0:1648110350.625691:0:512728:0:(obd_mount_server.c:1352:server_start_targets()) Process entered
Mar 24 03:25:51 kjcf05n02 kernel: Lustre: kjcf05-MDT0000: Will be in recovery for at least 15:00, or until 24 clients reconnect
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The fail reason is the next mpich codepath&lt;br/&gt;
MPI_File_open()&lt;del&gt;&amp;gt;ADIO_ResolveFileType()&lt;/del&gt;&amp;gt;ADIO_FileSysType_fncall()-&amp;gt;statfs() &lt;/p&gt;

&lt;p&gt;vfs statfs part do a lookup for a file and then ll_statfs. If cluster lost MDT between these to calls, ll_statfs ends with one of next error EAGAIN,ENOTCONN,ENODEV. The exact number depends on a MDT failover stage. The error brakes MPICH logic for detecting FS type, and fails the IOR. Error doesn&apos;t happen for nolazystatfs cause ll_statfs is blocking and waits MDT.&lt;br/&gt;
Lazystatfs was designed  not to block statfs. However OST failover does not produce ll_statfs error cause statfs returns only MDT data and rc 0.&lt;br/&gt;
Also mpich has a workaround for ESTALE error from NFS&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;static void ADIO_FileSysType_fncall(const char *filename, int *fstype, int *error_code)
{
    int err;
    int64_t file_id;
    static char myname[] = &quot;ADIO_RESOLVEFILETYPE_FNCALL&quot;;


/* NFS can get stuck and end up returning ESTALE &quot;forever&quot; */
#define MAX_ESTALE_RETRY 10000
    int retry_cnt;

    *error_code = MPI_SUCCESS;

    retry_cnt = 0;
    do {
        err = romio_statfs(filename, &amp;amp;file_id);
    } while (err &amp;amp;&amp;amp; (errno == ESTALE) &amp;amp;&amp;amp; retry_cnt++ &amp;lt; MAX_ESTALE_RETRY);
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;I&apos;m suggesting to add error masking to ESTALE for ll_statfs. This will make MPICH happy with lazystatfs option with FOFB.&lt;/p&gt;</description>
                <environment></environment>
        <key id="70033">LU-15788</key>
            <summary>lazystatfs + FOFB + mpich problems</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="3" iconUrl="https://jira.whamcloud.com/images/icons/priorities/major.svg">Major</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="aboyko">Alexander Boyko</assignee>
                                    <reporter username="aboyko">Alexander Boyko</reporter>
                        <labels>
                            <label>patch</label>
                    </labels>
                <created>Wed, 27 Apr 2022 08:35:20 +0000</created>
                <updated>Wed, 15 Jun 2022 05:32:01 +0000</updated>
                            <resolved>Sat, 11 Jun 2022 15:19:13 +0000</resolved>
                                    <version>Lustre 2.15.0</version>
                                    <fixVersion>Lustre 2.15.0</fixVersion>
                                        <due></due>
                            <votes>0</votes>
                                    <watches>5</watches>
                                                                            <comments>
                            <comment id="333098" author="gerrit" created="Wed, 27 Apr 2022 08:47:03 +0000"  >&lt;p&gt;&quot;Alexander Boyko &amp;lt;alexander.boyko@hpe.com&amp;gt;&quot; uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/47152&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/47152&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-15788&quot; title=&quot;lazystatfs + FOFB + mpich problems&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-15788&quot;&gt;&lt;del&gt;LU-15788&lt;/del&gt;&lt;/a&gt; llite: statfs error masking&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: e33e50b695eb77a877b65c1070df08398fc76a8d&lt;/p&gt;</comment>
                            <comment id="333100" author="aboyko" created="Wed, 27 Apr 2022 08:57:13 +0000"  >&lt;p&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/ViewProfile.jspa?name=adilger&quot; class=&quot;user-hover&quot; rel=&quot;adilger&quot;&gt;adilger&lt;/a&gt; could you take look at description, I&apos;ve pushed patch for discussing only. We have no agreement about fix. This also could be fixed at mpich library. I also want to mention that Lustre returns not approved errors from syscall, however estale is also wrong base on man pages. The all usermode concept to detect FS type with statfs call especially for distributed FS&#160; brings me to tears.&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;</comment>
                            <comment id="333158" author="adilger" created="Wed, 27 Apr 2022 15:24:38 +0000"  >&lt;p&gt;Probably the obd_statfs() call for MDT0000 should not be lazy, since MDT0000 is required for filesystem operation. That should also avoid this problem, and be &quot;more correct&quot; for users as well - they will get some valid return rather than an error. &lt;/p&gt;</comment>
                            <comment id="333265" author="tappro" created="Thu, 28 Apr 2022 11:45:57 +0000"  >&lt;p&gt;does it mean that turning lazystatfs off would remove problem as well?&lt;/p&gt;</comment>
                            <comment id="333270" author="aboyko" created="Thu, 28 Apr 2022 12:34:07 +0000"  >&lt;p&gt;Yeap, lazystatfs off makes ll_statfs blocking. Ptlrpc layer handles errors and resends statfs request when MDT0 finishes recovery.&lt;/p&gt;</comment>
                            <comment id="333321" author="adilger" created="Thu, 28 Apr 2022 16:38:15 +0000"  >&lt;p&gt;Mike, lazystatfs has been enabled by default for a long time. However, it &lt;b&gt;should&lt;/b&gt; only apply to &quot;lfs df&quot; to return individual OST stats, not cause the whole statfs to fail. That is a bad interaction between STATFS_SUM (which only sends one RPC to one MDS) and lazystatfs (which allows individual RPCs to fail, but expects &lt;b&gt;most&lt;/b&gt; of them to work). &lt;/p&gt;

&lt;p&gt;I think the current patch is a reasonable compromise. It retries the STATFS_SUM multiple times to different MDTs (which shouldn&apos;t all be failing at the same time), and should also block (loop retrying) if all MDTs are down. &lt;/p&gt;</comment>
                            <comment id="337374" author="gerrit" created="Sat, 11 Jun 2022 05:29:46 +0000"  >&lt;p&gt;&quot;Oleg Drokin &amp;lt;green@whamcloud.com&amp;gt;&quot; merged in patch &lt;a href=&quot;https://review.whamcloud.com/47152/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/47152/&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-15788&quot; title=&quot;lazystatfs + FOFB + mpich problems&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-15788&quot;&gt;&lt;del&gt;LU-15788&lt;/del&gt;&lt;/a&gt; lmv: try another MDT if statfs failed&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: &lt;br/&gt;
Commit: 57f3262baa7d8931176a81cde05bc057facfc3b6&lt;/p&gt;</comment>
                            <comment id="337481" author="pjones" created="Sat, 11 Jun 2022 15:19:13 +0000"  >&lt;p&gt;Landed for 2.15&lt;/p&gt;</comment>
                            <comment id="337789" author="bzzz" created="Wed, 15 Jun 2022 04:06:52 +0000"  >&lt;p&gt;with this patch landed I hit almost 100%:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
PASS 150 (9s)
== recovery-small test complete, duration 4839 sec ======= 04:02:47 (1655265767)
rm: cannot remove &lt;span class=&quot;code-quote&quot;&gt;&apos;/mnt/lustre/d110h.recovery-small/target_dir/tgt_file&apos;&lt;/span&gt;: Input/output error
 recovery-small : @@@@@@ FAIL: remove sub-test dirs failed 
  Trace dump:
  = ./../tests/test-framework.sh:6522:error()
  = ./../tests/test-framework.sh:6006:check_and_cleanup_lustre()
  = recovery-small.sh:3306:main()
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;bisection:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
COMMIT		TESTED	PASSED	FAILED		COMMIT DESCRIPTION
a3cba2ead7      1       0       1       BAD     LU-13547 tests: remove ea_inode from mkfs MDT options
4c47900889      5       4       1       BAD     LU-12186 ec: add necessary structure member &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; EC file
b762319d5a      5       4       1       BAD     LU-14195 libcfs: test &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; nla_strscpy
57f3262baa      2       1       1       BAD     LU-15788 lmv: &lt;span class=&quot;code-keyword&quot;&gt;try&lt;/span&gt; another MDT &lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; statfs failed
b00ac5f703      5       5       0       GOOD    LU-12756 lnet: Avoid redundant peer NI lookups
23028efcae      5       5       0       GOOD    LU-6864 osp: manage number of modify RPCs in flight
7f157f8ef3      5       5       0       GOOD    LU-15841 lod: iterate component to collect avoid array
eb71aec27e      5       5       0       GOOD    LU-15786 tests: get maxage param on mds1 properly
9523e99046      5       5       0       GOOD    LU-15754 lfsck: skip an inode &lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; iget() returns -ENOMEM
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="337792" author="adilger" created="Wed, 15 Jun 2022 05:30:58 +0000"  >&lt;p&gt;Oleg had problems with v2 of this patch:&lt;/p&gt;
&lt;blockquote&gt;&lt;p&gt;Oleg Drokin &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160;&#160; 05-29 20:21&lt;/p&gt;

&lt;p&gt;Patch Set 2: Verified-1&lt;br/&gt;
This seem to introduce a 100% recovery-small timeout in janitor testing.&lt;/p&gt;&lt;/blockquote&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10010">
                    <name>Duplicate</name>
                                                                <inwardlinks description="is duplicated by">
                                        <issuelink>
            <issuekey id="68081">LU-15457</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|i02ob3:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>