<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:20:25 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-1871] MDS oops in mdsrate-create-small.sh: Thread overran stack, or stack corrupted</title>
                <link>https://jira.whamcloud.com/browse/LU-1871</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;I think this is a MPI related issue&lt;br/&gt;
test log from client 1:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[[55105,1],10]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: OpenFabrics (openib)
  Host: client-10vm1.lab.whamcloud.com

Another transport will be used instead, although this may result in
lower performance.
--------------------------------------------------------------------------
CMA: no RDMA devices found
CMA: no RDMA devices found
CMA: no RDMA devices found
CMA: no RDMA devices found
CMA: no RDMA devices found
CMA: no RDMA devices found
CMA: no RDMA devices found
CMA: no RDMA devices found
CMA: no RDMA devices found
CMA: no RDMA devices found
CMA: no RDMA devices found
CMA: no RDMA devices found
CMA: no RDMA devices found
CMA: no RDMA devices found
CMA: no RDMA devices found
CMA: no RDMA devices found
r= 0: create /mnt/lustre/d0.write_append_truncate/f0.wat, max size: 3703701, seed 1345680120: No such file or directory
r= 0 l=0000: WR A  203927/0x031c97, AP a  157830/0x026886, TR@  308317/0x04b45d
[client-10vm1.lab.whamcloud.com:12004] 15 more processes have sent help message help-mpi-btl-base.txt / btl:no-nics
[client-10vm1.lab.whamcloud.com:12004] Set MCA parameter &quot;orte_base_help_aggregate&quot; to 0 to see all help / error messages
r= 0 l=1000: WR M  391981/0x05fb2d, AP m  363671/0x058c97, TR@  495994/0x07917a
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</description>
                <environment>server/client: lustre-b2_3/build #1/RHEL6</environment>
        <key id="15584">LU-1871</key>
            <summary>MDS oops in mdsrate-create-small.sh: Thread overran stack, or stack corrupted</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="1" iconUrl="https://jira.whamcloud.com/images/icons/priorities/blocker.svg">Blocker</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="3">Duplicate</resolution>
                                        <assignee username="laisiyao">Lai Siyao</assignee>
                                    <reporter username="sarah">Sarah Liu</reporter>
                        <labels>
                            <label>releases</label>
                    </labels>
                <created>Thu, 23 Aug 2012 17:22:22 +0000</created>
                <updated>Wed, 12 Sep 2012 11:23:52 +0000</updated>
                            <resolved>Wed, 12 Sep 2012 11:23:52 +0000</resolved>
                                                                        <due></due>
                            <votes>0</votes>
                                    <watches>1</watches>
                                                                            <comments>
                            <comment id="44471" author="mdiep" created="Mon, 10 Sep 2012 00:31:55 +0000"  >&lt;p&gt;are there more logs, dmesg, console? The test should not fail even with these message&lt;/p&gt;</comment>
                            <comment id="44487" author="chris" created="Mon, 10 Sep 2012 09:36:28 +0000"  >&lt;p&gt;From Minh&apos;s comments this seems like a lustre test issue. The lustre test needs to be fixed to appropriately fallback to a slow device.&lt;/p&gt;

&lt;p&gt;It is very hard to be sure because no link to failing tests was provided, it&apos;s not even possible to know which test was running.&lt;/p&gt;</comment>
                            <comment id="44489" author="mdiep" created="Mon, 10 Sep 2012 09:43:51 +0000"  >&lt;p&gt;the &apos;lower performance&apos; can be ignored: &lt;a href=&quot;http://cac.engin.umich.edu/faq.html&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://cac.engin.umich.edu/faq.html&lt;/a&gt;&lt;br/&gt;
May be this will help &lt;a href=&quot;http://www.open-mpi.org/community/lists/users/2012/02/18465.php&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://www.open-mpi.org/community/lists/users/2012/02/18465.php&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="44490" author="mdiep" created="Mon, 10 Sep 2012 09:49:08 +0000"  >&lt;p&gt;what has changed was:&lt;/p&gt;

&lt;p&gt;before: + su mpiuser sh -c &quot;/usr/lib64/openmpi/bin/mpirun -mca boot ssh  -mca btl tcp,self -np 1 -machinefile /tmp/mdsrate-create-small.machines /usr/lib64/lustre/tests/mdsrate --create --time 600 --nfiles 129674 --dir /mnt/lustre/mdsrate/single --filefmt &apos;f%%d&apos; &quot;&lt;/p&gt;

&lt;p&gt;today: + su mpiuser sh -c &quot;/usr/lib64/openmpi/bin/mpirun -mca boot ssh -np 6 -machinefile /tmp/mdsrate-create-small.machines /usr/lib64/lustre/tests/mdsrate --create --time 600 --nfiles 128386 --dir /mnt/lustre/mdsrate/multi --filefmt &apos;f%%d&apos; &quot;&lt;/p&gt;

&lt;p&gt;we removed -mca btl tcp,self. However adding those just to suppress the message, not causing any issue.&lt;/p&gt;

&lt;p&gt;If I recall correctly, we removed those because of running on IB issue. Perhaps, we need to check if we run on ib or tcp and put appropriate options.&lt;/p&gt;</comment>
                            <comment id="44491" author="mdiep" created="Mon, 10 Sep 2012 10:01:50 +0000"  >&lt;p&gt;this test passed a couple days before: &lt;a href=&quot;https://maloo.whamcloud.com/test_sets/506f3352-fa9b-11e1-887d-52540035b04c&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://maloo.whamcloud.com/test_sets/506f3352-fa9b-11e1-887d-52540035b04c&lt;/a&gt;&lt;br/&gt;
start failing on 9/9 &lt;a href=&quot;https://maloo.whamcloud.com/test_sets/693d643e-fa51-11e1-887d-52540035b04c&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://maloo.whamcloud.com/test_sets/693d643e-fa51-11e1-887d-52540035b04c&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;client show that mdsrate was hung&lt;/p&gt;

&lt;p&gt;20:46:53:Lustre: 8471:0:(client.c:1917:ptlrpc_expire_one_request()) Skipped 3 previous similar messages&lt;br/&gt;
20:47:24:INFO: task mdsrate:21390 blocked for more than 120 seconds.&lt;br/&gt;
20:47:24:&quot;echo 0 &amp;gt; /proc/sys/kernel/hung_task_timeout_secs&quot; disables this message.&lt;br/&gt;
20:47:24:mdsrate       D 0000000000000000     0 21390  21387 0x00000080&lt;br/&gt;
20:47:24: ffff88004c1d7be8 0000000000000082 ffff88000000012d ffffffff0000012d&lt;br/&gt;
20:47:24: ffff880000000065 ffffffffa0b404a0 000000000000012d ffffffffa0715cc5&lt;br/&gt;
20:47:24: ffff880048d5c638 ffff88004c1d7fd8 000000000000fb88 ffff880048d5c638&lt;br/&gt;
20:47:24:Call Trace:&lt;br/&gt;
20:47:24: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa06c042d&amp;gt;&amp;#93;&lt;/span&gt; ? lustre_msg_buf+0x5d/0x60 &lt;span class=&quot;error&quot;&gt;&amp;#91;ptlrpc&amp;#93;&lt;/span&gt;&lt;br/&gt;
20:47:24: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa06ed4a6&amp;gt;&amp;#93;&lt;/span&gt; ? __req_capsule_get+0x176/0x750 &lt;span class=&quot;error&quot;&gt;&amp;#91;ptlrpc&amp;#93;&lt;/span&gt;&lt;br/&gt;
20:47:24: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff814fefbe&amp;gt;&amp;#93;&lt;/span&gt; __mutex_lock_slowpath+0x13e/0x180&lt;br/&gt;
20:47:24: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff814fee5b&amp;gt;&amp;#93;&lt;/span&gt; mutex_lock+0x2b/0x50&lt;br/&gt;
20:47:24: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa06609b3&amp;gt;&amp;#93;&lt;/span&gt; mdc_close+0x193/0x9a0 &lt;span class=&quot;error&quot;&gt;&amp;#91;mdc&amp;#93;&lt;/span&gt;&lt;br/&gt;
20:47:24: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa061b8c6&amp;gt;&amp;#93;&lt;/span&gt; lmv_close+0x2d6/0x5a0 &lt;span class=&quot;error&quot;&gt;&amp;#91;lmv&amp;#93;&lt;/span&gt;&lt;br/&gt;
20:47:24: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa099c7af&amp;gt;&amp;#93;&lt;/span&gt; ll_close_inode_openhandle+0x30f/0x1050 &lt;span class=&quot;error&quot;&gt;&amp;#91;lustre&amp;#93;&lt;/span&gt;&lt;br/&gt;
20:47:24: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa099d69a&amp;gt;&amp;#93;&lt;/span&gt; ll_md_real_close+0x1aa/0x220 &lt;span class=&quot;error&quot;&gt;&amp;#91;lustre&amp;#93;&lt;/span&gt;&lt;br/&gt;
20:47:24: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa099d96b&amp;gt;&amp;#93;&lt;/span&gt; ll_md_close+0x25b/0x760 &lt;span class=&quot;error&quot;&gt;&amp;#91;lustre&amp;#93;&lt;/span&gt;&lt;br/&gt;
20:47:24: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff814ff546&amp;gt;&amp;#93;&lt;/span&gt; ? down_read+0x16/0x30&lt;br/&gt;
20:47:24: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa099df8b&amp;gt;&amp;#93;&lt;/span&gt; ll_file_release+0x11b/0x3e0 &lt;span class=&quot;error&quot;&gt;&amp;#91;lustre&amp;#93;&lt;/span&gt;&lt;br/&gt;
20:47:24: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff8117ca65&amp;gt;&amp;#93;&lt;/span&gt; __fput+0xf5/0x210&lt;br/&gt;
20:47:24: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff8117cba5&amp;gt;&amp;#93;&lt;/span&gt; fput+0x25/0x30&lt;br/&gt;
20:47:24: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff811785cd&amp;gt;&amp;#93;&lt;/span&gt; filp_close+0x5d/0x90&lt;br/&gt;
20:47:24: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff811786a5&amp;gt;&amp;#93;&lt;/span&gt; sys_close+0xa5/0x100&lt;br/&gt;
20:47:24: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff8100b0f2&amp;gt;&amp;#93;&lt;/span&gt; system_call_fastpath+0x16/0x1b&lt;br/&gt;
20:48:15:Lustre: 8471:0:(client.c:1917:ptlrpc_expire_one_request()) @@@ Request  sent has timed out for slow reply: &lt;span class=&quot;error&quot;&gt;&amp;#91;sent 1347162469/real 1347162469&amp;#93;&lt;/span&gt;  req@ffff88007a52dc00 x1412601028282197/t0(0) o250-&amp;gt;MGC10.10.4.170@tcp@10.10.4.170@tcp:26/25 lens 400/544 e 0 to 1 dl 1347162494 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1&lt;br/&gt;
20:48:15:Lustre: 8471:0:(client.c:1917:ptlrpc_expire_one_request()) Skipped 7 previous similar messages&lt;br/&gt;
20:49:26:INFO: task mdsrate:21390 blocked for more than 120 seconds.&lt;br/&gt;
20:49:26:&quot;echo 0 &amp;gt; /proc/sys/kernel/hung_task_timeout_secs&quot; disables this message.&lt;/p&gt;


&lt;p&gt;perhaps it related to this Oops on the mds&lt;br/&gt;
&lt;a href=&quot;https://maloo.whamcloud.com/test_sets/693d643e-fa51-11e1-887d-52540035b04c&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://maloo.whamcloud.com/test_sets/693d643e-fa51-11e1-887d-52540035b04c&lt;/a&gt;&lt;br/&gt;
&lt;a href=&quot;https://maloo.whamcloud.com/test_logs/6a1db8f4-fa51-11e1-887d-52540035b04c/show_text&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://maloo.whamcloud.com/test_logs/6a1db8f4-fa51-11e1-887d-52540035b04c/show_text&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;23:02:04:Lustre: DEBUG MARKER: ===== mdsrate-create-small.sh&lt;br/&gt;
23:02:04:BUG: unable to handle kernel paging request at 0000000781bbc060&lt;br/&gt;
23:02:04:IP: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff8104f1c7&amp;gt;&amp;#93;&lt;/span&gt; resched_task+0x17/0x80&lt;br/&gt;
23:02:04:PGD 374e5067 PUD 0 &lt;br/&gt;
23:02:04:Thread overran stack, or stack corrupted&lt;br/&gt;
23:02:04:Oops: 0000 &lt;a href=&quot;#1&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;1&lt;/a&gt; SMP &lt;br/&gt;
23:02:04:last sysfs file: /sys/devices/system/cpu/possible&lt;br/&gt;
23:02:04:CPU 0 &lt;br/&gt;
23:02:04:Modules linked in: osd_ldiskfs(U) fsfilt_ldiskfs(U) ldiskfs(U) lustre(U) obdfilter(U) ost(U) cmm(U) mdt(U) mdd(U) mds(U) mgs(U) obdecho(U) mgc(U) lquota(U) lov(U) osc(U) mdc(U) lmv(U) fid(U) fld(U) ptlrpc(U) obdclass(U) lvfs(U) ksocklnd(U) lnet(U) libcfs(U) sha512_generic sha256_generic jbd2 nfsd lockd nfs_acl auth_rpcgss exportfs autofs4 sunrpc ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa ib_mad ib_core microcode virtio_balloon 8139too 8139cp mii i2c_piix4 i2c_core ext3 jbd mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod &lt;span class=&quot;error&quot;&gt;&amp;#91;last unloaded: libcfs&amp;#93;&lt;/span&gt;&lt;br/&gt;
23:02:04:&lt;br/&gt;
23:02:04:Pid: 21654, comm: mdt00_001 Not tainted 2.6.32-279.5.1.el6_lustre.x86_64 #1 Red Hat KVM&lt;br/&gt;
23:02:04:RIP: 0010:&lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff8104f1c7&amp;gt;&amp;#93;&lt;/span&gt;  &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff8104f1c7&amp;gt;&amp;#93;&lt;/span&gt; resched_task+0x17/0x80&lt;br/&gt;
23:02:04:RSP: 0018:ffff880002203de8  EFLAGS: 00010087&lt;br/&gt;
23:02:04:RAX: 00000000000166c0 RBX: ffff880068462b18 RCX: 00000000ffff8800&lt;br/&gt;
23:02:04:RDX: ffff880031c2a000 RSI: 0000000000000400 RDI: ffff880068462ae0&lt;br/&gt;
23:02:04:RBP: ffff880002203de8 R08: 0000000000989680 R09: 0000000000000000&lt;br/&gt;
23:02:04:R10: 0000000000000010 R11: 0000000000000000 R12: ffff880002216728&lt;br/&gt;
23:02:04:R13: 0000000000000000 R14: 0000000000000000 R15: ffff880068462ae0&lt;br/&gt;
23:02:04:FS:  00007f0480a26700(0000) GS:ffff880002200000(0000) knlGS:0000000000000000&lt;br/&gt;
23:02:04:CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b&lt;br/&gt;
23:02:04:CR2: 0000000781bbc060 CR3: 000000007089b000 CR4: 00000000000006f0&lt;br/&gt;
23:02:04:DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000&lt;br/&gt;
23:02:04:DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400&lt;br/&gt;
23:02:04:Process mdt00_001 (pid: 21654, threadinfo ffff880031c2a000, task ffff880068462ae0)&lt;br/&gt;
23:02:04:Stack:&lt;br/&gt;
23:02:04: ffff880002203e18 ffffffff8105484c ffff8800022166c0 0000000000000000&lt;br/&gt;
23:02:04:&amp;lt;d&amp;gt; 00000000000166c0 0000000000000000 ffff880002203e58 ffffffff81057fa1&lt;br/&gt;
23:02:04:&amp;lt;d&amp;gt; ffff880002203e58 ffff880068462ae0 0000000000000000 0000000000000000&lt;br/&gt;
23:02:04:Call Trace:&lt;br/&gt;
23:02:04: &amp;lt;IRQ&amp;gt; &lt;/p&gt;</comment>
                            <comment id="44509" author="adilger" created="Mon, 10 Sep 2012 12:14:00 +0000"  >&lt;p&gt;Need to run &quot;checkstack&quot; on the current master, as well as an older Lustre (maybe b2_1) so that we can see which functions have gotten a larger stack usage.  Unfortunately, there is no stack trace that shows what the callpath is.&lt;/p&gt;</comment>
                            <comment id="44519" author="pjones" created="Mon, 10 Sep 2012 12:58:03 +0000"  >&lt;p&gt;Lai&lt;/p&gt;

&lt;p&gt;Could you please look into this one?&lt;/p&gt;

&lt;p&gt;Thanks&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="44593" author="laisiyao" created="Tue, 11 Sep 2012 10:20:57 +0000"  >&lt;p&gt;I reproduced once locally, but the result doesn&apos;t print backtrace either, I&apos;ll try to reproduce again and find more useful information before MDS oops.&lt;/p&gt;</comment>
                            <comment id="44693" author="laisiyao" created="Wed, 12 Sep 2012 11:23:52 +0000"  >&lt;p&gt;This is a duplicate of &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-1881&quot; title=&quot;sanity test 116 soft lockup&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-1881&quot;&gt;&lt;del&gt;LU-1881&lt;/del&gt;&lt;/a&gt;, and this test can pass with the patch for 1881.&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzus4n:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>2219</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>