<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 03:07:09 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-14137] parallel-scale-nfsv4 racer_on_nfs server not responding leads to pjdfstest to hang or fail</title>
                <link>https://jira.whamcloud.com/browse/LU-14137</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;We&#8217;ve seen that racer_on_nfs for parallel-scale-nfsv3/4 can lead to  the next test suite not being able to run due to the NFS server not responding; see &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-9940&quot; title=&quot;posix no sub tests failed: &quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-9940&quot;&gt;&lt;del&gt;LU-9940&lt;/del&gt;&lt;/a&gt;. We see this for the latest test suite to follow parallel-scale-nfsv4, pjdfstest; &lt;a href=&quot;https://testing.whamcloud.com/test_sets/177f113b-8697-4e90-8a5b-a6b9fc4960d4&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://testing.whamcloud.com/test_sets/177f113b-8697-4e90-8a5b-a6b9fc4960d4&lt;/a&gt; . &lt;/p&gt;

&lt;p&gt;In client 1 (vm1) console&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[104844.990816] Lustre: DEBUG MARKER: /usr/sbin/lctl mark -----============= acceptance-small: pjdfstest ============----- Thu Nov 12 13:36:11 UTC 2020
[104845.232371] Lustre: DEBUG MARKER: -----============= acceptance-small: pjdfstest ============----- Thu Nov 12 13:36:11 UTC 2020
[104847.239225] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == run_pjdfstest test complete, duration -o sec ====================================================== 13:36:13 \(1605188173\)
[104847.480089] Lustre: DEBUG MARKER: == run_pjdfstest test complete, duration -o sec ====================================================== 13:36:13 (1605188173)
[104848.978709] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2&amp;gt;/dev/null || /usr/sbin/lctl lustre_build_version 2&amp;gt;/dev/null || /usr/sbin/lctl --version 2&amp;gt;/dev/null | cut -d&apos; &apos; -f2
[104849.221429] Lustre: DEBUG MARKER: /usr/sbin/lctl mark excepting tests: chown_00 utimensat_08
[104849.461831] Lustre: DEBUG MARKER: excepting tests: chown_00 utimensat_08
[105031.551815] nfs: server trevis-29vm4 not responding, still trying
[105055.615001] nfs: server trevis-29vm4 not responding, still trying
[105055.616105] nfs: server trevis-29vm4 not responding, timed out
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;but, if you looks at the last test suite that completed, parallel-scale-nfsv4 racer_on_nfs, you see a segfault&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[104509.071245] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == parallel-scale-nfsv4 test racer_on_nfs: racer on NFS client ======================================= 13:30:35 \(1605187835\)
[104509.316855] Lustre: DEBUG MARKER: == parallel-scale-nfsv4 test racer_on_nfs: racer on NFS client ======================================= 13:30:35 (1605187835)
[104509.542351] Lustre: DEBUG MARKER: MDSCOUNT=1 OSTCOUNT=7 LFS=/usr/bin/lfs /usr/lib64/lustre/tests/racer/racer.sh /mnt/lustre/d0.parallel-scale-nfs
[104545.372425] 10[30478]: segfault at 8 ip 00007fbf2f4b07e8 sp 00007fff363082c0 error 4 in ld-2.17.so[7fbf2f4a5000+22000]
[104554.287433] 3[22368]: segfault at 8 ip 00007f0cf07fb7e8 sp 00007fffde597b50 error 4 in ld-2.17.so[7f0cf07f0000+22000]
[104597.669765] 4[4673]: segfault at 8 ip 00007f12b63407e8 sp 00007fff862aa300 error 4 in ld-2.17.so[7f12b6335000+22000]
[104815.417110] Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0 	    fail_val=0 2&amp;gt;/dev/null
[104815.783689] Lustre: DEBUG MARKER: rc=0; val=$(/usr/sbin/lctl get_param -n catastrophe 2&amp;gt;&amp;amp;1); if [[ $? -eq 0 &amp;amp;&amp;amp; $val -ne 0 ]]; then echo $(hostname -s): $val; rc=$val; fi; exit $rc
[104817.231922] Lustre: DEBUG MARKER: dmesg
[104818.127101] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == parallel-scale-nfsv4 test complete, duration 2975 sec ============================================= 13:35:44 \(1605188144\)
[104818.381560] Lustre: DEBUG MARKER: == parallel-scale-nfsv4 test complete, duration 2975 sec ============================================= 13:35:44 (1605188144)
[104818.597884] Lustre: DEBUG MARKER: umount -f /mnt/lustre
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Looking at the OSS (vm3) console log,  we see LNet issues&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[104421.383451] Lustre: DEBUG MARKER: == parallel-scale-nfsv4 test racer_on_nfs: racer on NFS client ======================================= 13:30:35 (1605187835)
[104541.739060] LNetError: 120-3: Refusing connection from 127.0.0.1 for 0.0.0.0@tcp: No matching NI
[104541.740628] LNetError: Skipped 5 previous similar messages
[104541.741587] LNetError: 21967:0:(socklnd_cb.c:1817:ksocknal_recv_hello()) Error -104 reading HELLO from 127.0.0.1
[104541.743275] LNetError: 21967:0:(socklnd_cb.c:1817:ksocknal_recv_hello()) Skipped 5 previous similar messages
[104541.744877] LNetError: 11b-b: Connection to 0.0.0.0@tcp at host 0.0.0.0 on port 7988 was reset: is it running a compatible version of Lustre and is 0.0.0.0@tcp one of its NIDs?
[104541.747458] LNetError: Skipped 5 previous similar messages
[104727.505977] Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0 	    fail_val=0 2&amp;gt;/dev/null
[104728.383449] Lustre: DEBUG MARKER: rc=0;
[104728.383449] 			val=$(/usr/sbin/lctl get_param -n catastrophe 2&amp;gt;&amp;amp;1);
[104728.383449] 			if [[ $? -eq 0 &amp;amp;&amp;amp; $val -ne 0 ]]; then
[104728.383449] 				echo $(hostname -s): $val;
[104728.383449] 				rc=$val;
[104728.383449] 			fi;
[104728.383449] 			exit $rc
[104729.290527] Lustre: DEBUG MARKER: dmesg
[104730.187533] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == parallel-scale-nfsv4 test complete, duration 2975 sec ============================================= 13:35:44 \(1605188144\)
[104730.428143] Lustre: DEBUG MARKER: == parallel-scale-nfsv4 test complete, duration 2975 sec ============================================= 13:35:44 (1605188144)
[104754.664116] Lustre: lustre-OST0000: Connection restored to 6ef95858-0684-3186-09bf-845f7502685d (at 10.9.5.101@tcp)
[104754.666078] Lustre: Skipped 6 previous similar messages
[104755.850798] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == rpc test complete, duration -o sec ================================================================ 13:36:09 \(1605188169\)
[104756.089115] Lustre: DEBUG MARKER: == rpc test complete, duration -o sec ================================================================ 13:36:09 (1605188169)
[104756.487966] Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-29vm2.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
[104756.726639] Lustre: DEBUG MARKER: trevis-29vm2.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
[104757.052466] Lustre: DEBUG MARKER: /usr/sbin/lctl mark -----============= acceptance-small: pjdfstest ============----- Thu Nov 12 13:36:11 UTC 2020
[104757.308588] Lustre: DEBUG MARKER: -----============= acceptance-small: pjdfstest ============----- Thu Nov 12 13:36:11 UTC 2020
[104759.302601] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == run_pjdfstest test complete, duration -o sec ====================================================== 13:36:13 \(1605188173\)
[104759.557414] Lustre: DEBUG MARKER: == run_pjdfstest test complete, duration -o sec ====================================================== 13:36:13 (1605188173)
[104760.753243] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2&amp;gt;/dev/null ||
[104760.753243] 				/usr/sbin/lctl lustre_build_version 2&amp;gt;/dev/null ||
[104760.753243] 				/usr/sbin/lctl --version 2&amp;gt;/dev/null | cut -d&apos; &apos; -f2
[104761.283326] Lustre: DEBUG MARKER: /usr/sbin/lctl mark excepting tests: chown_00 utimensat_08
[104761.523935] Lustre: DEBUG MARKER: excepting tests: chown_00 utimensat_08
[104798.827334] Lustre: lustre-OST0003: haven&apos;t heard from client 6c4f2798-36a0-1283-9a2a-402a8215fdc6 (at 10.9.5.103@tcp) in 55 seconds. I think it&apos;s dead, and I am evicting it. exp ffff908f82928000, cur 1605188213 expire 1605188183 last 1605188158
[104798.831028] Lustre: Skipped 6 previous similar messages
[105141.722329] LNetError: 120-3: Refusing connection from 127.0.0.1 for 0.0.0.0@tcp: No matching NI
[105141.723961] LNetError: Skipped 6 previous similar messages
[105141.724931] LNetError: 21965:0:(socklnd_cb.c:1817:ksocknal_recv_hello()) Error -104 reading HELLO from 127.0.0.1
[105141.726576] LNetError: 21965:0:(socklnd_cb.c:1817:ksocknal_recv_hello()) Skipped 6 previous similar messages
[105141.728191] LNetError: 11b-b: Connection to 0.0.0.0@tcp at host 0.0.0.0 on port 7988 was reset: is it running a compatible version of Lustre and is 0.0.0.0@tcp one of its NIDs?
[105141.730788] LNetError: Skipped 6 previous similar messages
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;In the console log of the MDS (vm4), we see issues during and after parallel-scale-nfsv4 racer_on_nfs&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[104422.879162] Lustre: DEBUG MARKER: == parallel-scale-nfsv4 test racer_on_nfs: racer on NFS client ======================================= 13:30:35 (1605187835)
[104428.673403] LustreError: 5794:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x20006e9a6:0x3060:0x0] get parent: rc = -2
[104428.675745] LustreError: 5794:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) Skipped 31 previous similar messages
[104445.738012] LustreError: 5795:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x20006e9a6:0x3031:0x0] get parent: rc = -2
[104485.005485] LustreError: 5799:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x20006e9a6:0x324c:0x0] get parent: rc = -2
[104485.008076] LustreError: 5799:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) Skipped 16 previous similar messages
[104566.409419] LustreError: 5796:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x20006e9a6:0x3502:0x0] get parent: rc = -2
[104566.411605] LustreError: 5796:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) Skipped 63 previous similar messages
[104704.226358] LustreError: 5794:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) lustre: failure inode [0x20006e9a6:0x37e0:0x0] get parent: rc = -2
[104704.228861] LustreError: 5794:0:(llite_nfs.c:336:ll_dir_get_parent_fid()) Skipped 69 previous similar messages
[104729.025141] Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0 	    fail_val=0 2&amp;gt;/dev/null
[104730.349763] Lustre: DEBUG MARKER: rc=0;
[104730.349763] 			val=$(/usr/sbin/lctl get_param -n catastrophe 2&amp;gt;&amp;amp;1);
[104730.349763] 			if [[ $? -eq 0 &amp;amp;&amp;amp; $val -ne 0 ]]; then
[104730.349763] 				echo $(hostname -s): $val;
[104730.349763] 				rc=$val;
[104730.349763] 			fi;
[104730.349763] 			exit $rc
[104730.802926] Lustre: DEBUG MARKER: dmesg
[104731.684167] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == parallel-scale-nfsv4 test complete, duration 2975 sec ============================================= 13:35:44 \(1605188144\)
[104731.944230] Lustre: DEBUG MARKER: == parallel-scale-nfsv4 test complete, duration 2975 sec ============================================= 13:35:44 (1605188144)
[104743.180814] Lustre: DEBUG MARKER: { [[ -e /etc/SuSE-release ]] &amp;amp;&amp;amp;
[104743.180814] 				 service nfsserver stop; } ||
[104743.180814] 				 service nfs stop
[104743.428775] nfsd: last server has exited, flushing export cache
[104743.619356] Lustre: DEBUG MARKER: sed -i &apos;/lustre/d&apos; /etc/exports
[104744.102408] Lustre: DEBUG MARKER: exportfs -v
[104744.498993] Lustre: DEBUG MARKER: grep -c /mnt/lustre&apos; &apos; /proc/mounts
[104744.870704] Lustre: DEBUG MARKER: lsof -t /mnt/lustre
[104745.380995] Lustre: DEBUG MARKER: umount -f /mnt/lustre 2&amp;gt;&amp;amp;1
[104745.568011] Lustre: setting import lustre-MDT0000_UUID INACTIVE by administrator request
[104745.569374] Lustre: Skipped 7 previous similar messages
[104745.570520] LustreError: 25275:0:(file.c:216:ll_close_inode_openhandle()) lustre-clilmv-ffff96e419fc4000: inode [0x20006e9a6:0x351a:0x0] mdc close failed: rc = -108
[104745.572860] LustreError: 25275:0:(file.c:216:ll_close_inode_openhandle()) Skipped 9 previous similar messages
[104745.585316] Lustre: 3529:0:(llite_lib.c:2762:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.9.5.103@tcp:/lustre/fid: [0x20006e9a6:0x38cd:0x0]/ may get corrupted (rc -108)
&#8230;
[104745.630358] Lustre: 3529:0:(llite_lib.c:2762:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.9.5.103@tcp:/lustre/fid: [0x20006e9a6:0x3912:0x0]/ may get corrupted (rc -108)
[104755.671260] Lustre: Unmounted lustre-client
[104756.077342] Lustre: MGS: Connection restored to 43522391-bc41-831c-eaa2-9a67ae6d0c48 (at 10.9.5.101@tcp)
[104756.079241] Lustre: Skipped 4 previous similar messages
[104757.351305] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == rpc test complete, duration -o sec ================================================================ 13:36:09 \(1605188169\)
[104757.588086] Lustre: DEBUG MARKER: == rpc test complete, duration -o sec ================================================================ 13:36:09 (1605188169)
[104757.986890] Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-29vm2.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
[104758.226754] Lustre: DEBUG MARKER: trevis-29vm2.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
[104758.553655] Lustre: DEBUG MARKER: /usr/sbin/lctl mark -----============= acceptance-small: pjdfstest ============----- Thu Nov 12 13:36:11 UTC 2020
[104758.798368] Lustre: DEBUG MARKER: -----============= acceptance-small: pjdfstest ============----- Thu Nov 12 13:36:11 UTC 2020
[104760.797535] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == run_pjdfstest test complete, duration -o sec ====================================================== 13:36:13 \(1605188173\)
[104761.044320] Lustre: DEBUG MARKER: == run_pjdfstest test complete, duration -o sec ====================================================== 13:36:13 (1605188173)
[104761.408777] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2&amp;gt;/dev/null ||
[104761.408777] 				/usr/sbin/lctl lustre_build_version 2&amp;gt;/dev/null ||
[104761.408777] 				/usr/sbin/lctl --version 2&amp;gt;/dev/null | cut -d&apos; &apos; -f2
[104761.849199] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2&amp;gt;/dev/null ||
[104761.849199] 				/usr/sbin/lctl lustre_build_version 2&amp;gt;/dev/null ||
[104761.849199] 				/usr/sbin/lctl --version 2&amp;gt;/dev/null | cut -d&apos; &apos; -f2
[104762.783875] Lustre: DEBUG MARKER: /usr/sbin/lctl mark excepting tests: chown_00 utimensat_08
[104763.023125] Lustre: DEBUG MARKER: excepting tests: chown_00 utimensat_08
[104796.252581] Lustre: lustre-MDT0000: haven&apos;t heard from client 6c4f2798-36a0-1283-9a2a-402a8215fdc6 (at 0@lo) in 50 seconds. I think it&apos;s dead, and I am evicting it. exp ffff96e3f8f80c00, cur 1605188208 expire 1605188178 last 1605188158
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Sometimes pjdfstest does not hang, but just fails; &lt;a href=&quot;https://testing.whamcloud.com/test_sets/01ee37b4-627e-46cf-abc2-c097b2ce5b69&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://testing.whamcloud.com/test_sets/01ee37b4-627e-46cf-abc2-c097b2ce5b69&lt;/a&gt;. &lt;/p&gt;</description>
                <environment></environment>
        <key id="61681">LU-14137</key>
            <summary>parallel-scale-nfsv4 racer_on_nfs server not responding leads to pjdfstest to hang or fail</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="1" iconUrl="https://jira.whamcloud.com/images/icons/statuses/open.png" description="The issue is open and ready for the assignee to start work on it.">Open</status>
                    <statusCategory id="2" key="new" colorName="default"/>
                                    <resolution id="-1">Unresolved</resolution>
                                        <assignee username="wc-triage">WC Triage</assignee>
                                    <reporter username="jamesanunez">James Nunez</reporter>
                        <labels>
                    </labels>
                <created>Mon, 16 Nov 2020 20:46:54 +0000</created>
                <updated>Tue, 19 Oct 2021 02:41:32 +0000</updated>
                                            <version>Lustre 2.14.0</version>
                    <version>Lustre 2.12.6</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>1</watches>
                                                                                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                        <issuelink>
            <issuekey id="48102">LU-9940</issuekey>
        </issuelink>
                            </outwardlinks>
                                                                <inwardlinks description="is related to">
                                                        </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|i01f9b:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>