<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 02:53:12 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-12508] (llite_mmap.c:71:our_vma()) ASSERTION( !down_write_trylock(&amp;mm-&gt;mmap_sem) ) failed when writing in multiple threads</title>
                <link>https://jira.whamcloud.com/browse/LU-12508</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
2019-07-02T01:45:11-05:00 nanny1926 kernel: LustreError:
 251884:0:(llite_mmap.c:71:our_vma()) ASSERTION(
 !down_write_trylock(&amp;amp;mm-&amp;gt;mmap_sem) ) failed:
 2019-07-02T01:45:11-05:00 nanny1926 kernel: LustreError:
 251884:0:(llite_mmap.c:71:our_vma()) LBUG
 2019-07-02T01:45:11-05:00 nanny1926 kernel: Pid: 251884, comm: java
 2019-07-02T01:45:11-05:00 nanny1926 kernel: #012Call Trace:
 2019-07-02T01:45:11-05:00 nanny1926 kernel: [&amp;lt;ffffffffc03d67ae&amp;gt;]
 libcfs_call_trace+0x4e/0x60 [libcfs]
 2019-07-02T01:45:11-05:00 nanny1926 kernel: [&amp;lt;ffffffffc03d683c&amp;gt;]
 lbug_with_loc+0x4c/0xb0 [libcfs]
 2019-07-02T01:45:11-05:00 nanny1926 kernel: [&amp;lt;ffffffffc116e66b&amp;gt;]
 our_vma+0x16b/0x170 [lustre]
 2019-07-02T01:45:11-05:00 nanny1926 kernel: [&amp;lt;ffffffffc11857f9&amp;gt;]
 vvp_io_rw_lock+0x409/0x6e0 [lustre]
 2019-07-02T01:45:11-05:00 nanny1926 kernel: [&amp;lt;ffffffffc0fbb312&amp;gt;] ?
 lov_io_iter_init+0x302/0x8b0 [lov]
 2019-07-02T01:45:11-05:00 nanny1926 kernel: [&amp;lt;ffffffffc1185b29&amp;gt;]
 vvp_io_write_lock+0x59/0xf0 [lustre]
 2019-07-02T01:45:11-05:00 nanny1926 kernel: [&amp;lt;ffffffffc063ebec&amp;gt;]
 cl_io_lock+0x5c/0x3d0 [obdclass]
 2019-07-02T01:45:11-05:00 nanny1926 kernel: [&amp;lt;ffffffffc063f1db&amp;gt;]
 cl_io_loop+0x11b/0xc90 [obdclass]
 2019-07-02T01:45:11-05:00 nanny1926 kernel: [&amp;lt;ffffffffc1133258&amp;gt;]
 ll_file_io_generic+0x498/0xc40 [lustre]
 2019-07-02T01:45:11-05:00 nanny1926 kernel: [&amp;lt;ffffffffc1133cdd&amp;gt;]
 ll_file_aio_write+0x12d/0x1f0 [lustre]
 2019-07-02T01:45:11-05:00 nanny1926 kernel: [&amp;lt;ffffffffc1133e6e&amp;gt;]
 ll_file_write+0xce/0x1e0 [lustre]
 2019-07-02T01:45:11-05:00 nanny1926 kernel: [&amp;lt;ffffffff81200cad&amp;gt;]
 vfs_write+0xbd/0x1e0
 2019-07-02T01:45:11-05:00 nanny1926 kernel: [&amp;lt;ffffffff8111f394&amp;gt;] ?
 __audit_syscall_entry+0xb4/0x110
 2019-07-02T01:45:11-05:00 nanny1926 kernel: [&amp;lt;ffffffff81201abf&amp;gt;]
 SyS_write+0x7f/0xe0
 2019-07-02T01:45:11-05:00 nanny1926 kernel: [&amp;lt;ffffffff816b5292&amp;gt;]
 tracesys+0xdd/0xe2
 2019-07-02T01:45:11-05:00 nanny1926 kernel:
 2019-07-02T01:45:11-05:00 nanny1926 kernel: Kernel panic - not syncing: LBUG
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;It is reading in up to 256 threads. And writing 16 files in up to 16 threads.&#160;&lt;br/&gt;
&#160;&lt;br/&gt;
It is reproducible (but does not fail every time) on this particular machine, which might just be a particular network timing.&lt;br/&gt;
I will try to reproduce it on another machine and get back to you if successful.&lt;br/&gt;
&#160;&lt;br/&gt;
Any ideas why this lock would have failed?&lt;br/&gt;
A quick analysis shows that the only place where our_vma is called is lustre/llite/vvp_io.c:453, and it only acquires read lock:&lt;br/&gt;
vvp_mmap_locks:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
452 &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; down_read(&amp;amp;mm-&amp;gt;mmap_sem);
&#160;453 &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &lt;span class=&quot;code-keyword&quot;&gt;while&lt;/span&gt;((vma = our_vma(mm, addr, count)) != NULL) {
&#160;454 &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; struct dentry *de = file_dentry(vma-&amp;gt;vm_file);
&#160;455 &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; struct inode *inode = de-&amp;gt;d_inode;
&#160;456 &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &lt;span class=&quot;code-object&quot;&gt;int&lt;/span&gt; flags = CEF_MUST;
&#160;&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;whereas our_vma has this:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
70 &#160; &#160; &#160; &#160; &lt;span class=&quot;code-comment&quot;&gt;/* mmap_sem must have been held by caller. */&lt;/span&gt;
71 &#160; &#160; &#160; &#160; LASSERT(!down_write_trylock(&amp;amp;mm-&amp;gt;mmap_sem));
&#160;&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;So i guess if there are multiple threads in vvp_mmap_locks and more than one happen to acquire read_lock, or one of them acquires write lock then the other would fail, no?&lt;/p&gt;</description>
                <environment>Linux nanny1926 3.10.0-693.5.2.el7.x86_64 #1 SMP Fri Oct 20 20:32:50 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux</environment>
        <key id="56271">LU-12508</key>
            <summary>(llite_mmap.c:71:our_vma()) ASSERTION( !down_write_trylock(&amp;mm-&gt;mmap_sem) ) failed when writing in multiple threads</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="3" iconUrl="https://jira.whamcloud.com/images/icons/priorities/major.svg">Major</priority>
                        <status id="6" iconUrl="https://jira.whamcloud.com/images/icons/statuses/closed.png" description="The issue is considered finished, the resolution is correct. Issues which are closed can be reopened.">Closed</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="6">Not a Bug</resolution>
                                        <assignee username="ys">Yang Sheng</assignee>
                                    <reporter username="Tomaka">Jacek Tomaka</reporter>
                        <labels>
                    </labels>
                <created>Fri, 5 Jul 2019 01:13:08 +0000</created>
                <updated>Mon, 13 Dec 2021 13:21:29 +0000</updated>
                            <resolved>Tue, 19 Nov 2019 02:49:31 +0000</resolved>
                                    <version>Lustre 2.10.1</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>7</watches>
                                                                            <comments>
                            <comment id="250698" author="ys" created="Fri, 5 Jul 2019 01:46:29 +0000"  >&lt;p&gt;Hi, Jacek,&lt;/p&gt;

&lt;p&gt;Please upgrade kernel to 3.10.0-957.12.1 to fix this issue.&lt;/p&gt;

&lt;p&gt;Thanks,&lt;br/&gt;
YangSheng &lt;/p&gt;</comment>
                            <comment id="250700" author="tomaka" created="Fri, 5 Jul 2019 02:18:21 +0000"  >&lt;p&gt;Hi Yang, &lt;/p&gt;

&lt;p&gt;Thank you!&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://access.redhat.com/solutions/3393611&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://access.redhat.com/solutions/3393611&lt;/a&gt; mentions also that the issue is fixed in &lt;tt&gt;kernel-3.10.0-693.47.2.el7&lt;/tt&gt; for Centos 7.4. Would that work as well?&lt;/p&gt;

&lt;p&gt;Regards.&lt;/p&gt;

&lt;p&gt;Jacek Tomaka&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;</comment>
                            <comment id="250701" author="ys" created="Fri, 5 Jul 2019 02:26:57 +0000"  >&lt;p&gt;Hi, Jacek,&lt;/p&gt;

&lt;p&gt;Yes, 693.47.2 also included this fix.&lt;/p&gt;

&lt;p&gt;Thanks,&lt;br/&gt;
YangSheng&lt;/p&gt;</comment>
                            <comment id="250828" author="ys" created="Mon, 8 Jul 2019 13:51:03 +0000"  >&lt;p&gt;Hi, Jacek,&lt;/p&gt;

&lt;p&gt;Just for a record, Could you please tell us what kind of machine was reproducible? Especial the cpu architecture.&lt;/p&gt;

&lt;p&gt;Thanks,&lt;br/&gt;
YangSheng&lt;/p&gt;</comment>
                            <comment id="250865" author="tomaka" created="Mon, 8 Jul 2019 22:58:29 +0000"  >&lt;p&gt;Hi YangSheng,&#160;&lt;/p&gt;

&lt;p&gt;So far i managed to reproduce it on three nodes:&#160;&lt;/p&gt;

&lt;p&gt;Intel(R) Xeon Phi(TM) CPU 7210 @ 1.30GHz (three times), two times was straight after reboot, once it required to run it overnight in a loop, by the morning it crashed.&#160;&lt;/p&gt;

&lt;p&gt;Intel(R) Xeon Phi(TM) CPU 7250 @ 1.40GHz (two different nodes, once each).&lt;/p&gt;

&lt;p&gt;I believe we started seeing it after a change to write one of the workflows in parallel.&#160;&lt;/p&gt;

&lt;p&gt;Please let me know if you need more information.&lt;/p&gt;

&lt;p&gt;We still have not deployed never version of the kernel for testing but i will confirm if it fixes the issue.&#160;&lt;/p&gt;

&lt;p&gt;Regards.&lt;/p&gt;

&lt;p&gt;Jacek Tomaka&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;</comment>
                            <comment id="250994" author="tomaka" created="Thu, 11 Jul 2019 07:34:10 +0000"  >&lt;p&gt;Hi YangSheng, &lt;br/&gt;
I managed to reproduce it in Lustre 2.12.2 and 3.10.0-957.21.3 (Centos 7.6)&lt;br/&gt;
This time it is read!&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Lustre: Lustre: Build Version: 2.12.2
2019-07-11T02:13:50-05:00 nanny2309 kernel: LustreError: 161434:0:(llite_mmap.c:61:our_vma()) ASSERTION( !down_write_trylock(&amp;amp;mm-&amp;gt;mmap_sem) ) failed: 
2019-07-11T02:13:50-05:00 nanny2309 kernel: LustreError: 161434:0:(llite_mmap.c:61:our_vma()) LBUG
2019-07-11T02:13:50-05:00 nanny2309 kernel: Pid: 161434, comm: pool-HeadersRea 3.10.0-957.21.3.el7.x86_64 #1 SMP Tue Jun 18 16:35:19 UTC 2019
2019-07-11T02:13:50-05:00 nanny2309 kernel: Call Trace:
2019-07-11T02:13:50-05:00 nanny2309 kernel: [&amp;lt;ffffffffc073b7cc&amp;gt;] libcfs_call_trace+0x8c/0xc0 [libcfs]
2019-07-11T02:13:50-05:00 nanny2309 kernel: [&amp;lt;ffffffffc073b87c&amp;gt;] lbug_with_loc+0x4c/0xa0 [libcfs]
2019-07-11T02:13:50-05:00 nanny2309 kernel: [&amp;lt;ffffffffc0e3cb4b&amp;gt;] our_vma+0x16b/0x170 [lustre]
2019-07-11T02:13:50-05:00 nanny2309 kernel: [&amp;lt;ffffffffc0e522e9&amp;gt;] vvp_io_rw_lock+0x439/0x760 [lustre]
2019-07-11T02:13:50-05:00 nanny2309 kernel: [&amp;lt;ffffffffc0e526ce&amp;gt;] vvp_io_read_lock+0x3e/0xe0 [lustre]
2019-07-11T02:13:50-05:00 nanny2309 kernel: [&amp;lt;ffffffffc08942ff&amp;gt;] cl_io_lock+0x5f/0x3d0 [obdclass]
2019-07-11T02:13:50-05:00 nanny2309 kernel: [&amp;lt;ffffffffc089488a&amp;gt;] cl_io_loop+0xba/0x1c0 [obdclass]
2019-07-11T02:13:50-05:00 nanny2309 kernel: [&amp;lt;ffffffffc0e0b5b0&amp;gt;] ll_file_io_generic+0x590/0xcb0 [lustre]
2019-07-11T02:13:50-05:00 nanny2309 kernel: [&amp;lt;ffffffffc0e0c868&amp;gt;] ll_file_aio_read+0x2b8/0x3d0 [lustre]
2019-07-11T02:13:50-05:00 nanny2309 kernel: [&amp;lt;ffffffffc0e0ca24&amp;gt;] ll_file_read+0xa4/0x170 [lustre]
2019-07-11T02:13:50-05:00 nanny2309 kernel: [&amp;lt;ffffffffa32416ff&amp;gt;] vfs_read+0x9f/0x170
2019-07-11T02:13:50-05:00 nanny2309 kernel: [&amp;lt;ffffffffa32425bf&amp;gt;] SyS_read+0x7f/0xf0
2019-07-11T02:13:50-05:00 nanny2309 kernel: [&amp;lt;ffffffffa377606b&amp;gt;] tracesys+0xa3/0xc9
2019-07-11T02:13:50-05:00 nanny2309 kernel: [&amp;lt;ffffffffffffffff&amp;gt;] 0xffffffffffffffff
2019-07-11T02:13:50-05:00 nanny2309 kernel: Kernel panic - not syncing: LBUG
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;So i was wondering if you could answer my question about the code analysis i have done in the description of this bug. I.e. why are we expecting to succeed with down_write_trylock, when caller only has read lock? When there is contention it will obviously fail. Shouldn&apos;t it in this case just spin?&lt;/p&gt;</comment>
                            <comment id="251000" author="ys" created="Thu, 11 Jul 2019 12:30:30 +0000"  >&lt;p&gt;Hi, Jacek,&lt;/p&gt;

&lt;p&gt;Ok, It is really a surprise. Can you get a vmcore for analysis? We want a failed return from down_write_trylock there. It means caller already hold this lock. Then the ASSERTION would not be triggered. &lt;/p&gt;

&lt;p&gt;Thanks,&lt;br/&gt;
Yangsheng&lt;/p&gt;</comment>
                            <comment id="251005" author="simmonsja" created="Thu, 11 Jul 2019 13:29:18 +0000"  >&lt;p&gt;It looks like you are still having problems. Can you try&#160;&lt;a href=&quot;https://review.whamcloud.com/#/c/35271/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/#/c/35271&lt;/a&gt;&#160;to see if it helps&lt;/p&gt;</comment>
                            <comment id="251598" author="tomaka" created="Thu, 18 Jul 2019 02:44:35 +0000"  >&lt;p&gt;Yang Sheng: &lt;br/&gt;
Sorry for the delay, looks like setting up kdump was a bit of a task &lt;img class=&quot;emoticon&quot; src=&quot;https://jira.whamcloud.com/images/icons/emoticons/wink.png&quot; height=&quot;16&quot; width=&quot;16&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt; But i think i have it already. With kdump in place, the reproducer has been running for couple of hours already and still has not failed. But i will keep it running. &lt;br/&gt;
 My kdump.conf says:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;core_collector makedumpfile -F -l --message-level 1 -d 31
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;will that be good enough or you will need some pages that i have excluded?&lt;/p&gt;

&lt;p&gt;James A Simmons: &lt;br/&gt;
I can give it a go once i give you the dump. Why do you think it will help?&lt;/p&gt;</comment>
                            <comment id="251612" author="tomaka" created="Thu, 18 Jul 2019 04:29:10 +0000"  >&lt;p&gt;Ok, Yang Sheng, i do have a kernel crash dump. It is 5.4GB. We are happy to give you access to it or upload it to a place but unfortunately we cannot make it public for everyone. &lt;br/&gt;
Alternatively i can be your proxy. If you let me know what commands to issue against the dump, i can do it and paste the output. &lt;br/&gt;
Please let me know how you would like to proceed. &lt;br/&gt;
Regards.&lt;br/&gt;
Jacek Tomaka&lt;/p&gt;</comment>
                            <comment id="251636" author="pjones" created="Thu, 18 Jul 2019 13:20:37 +0000"  >&lt;p&gt; Sometimes the diagnostic data collected as part of Lustre troubleshooting is too large to be attached to a JIRA ticket. For these cases, Whamcloud provides an anonymous write-only FTP upload service. In order to use this service, you&apos;ll need an FTP client (e.g. ncftp, ftp, etc.) and a JIRA issue. Use the &apos;uploads&apos; directory and create a new subdirectory using your Jira issue as a name.&lt;br/&gt;
In the following example, there are three debug logs in a single directory and the JIRA issue &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4242&quot; title=&quot;mdt_open.c:1685:mdt_reint_open()) LBUG&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4242&quot;&gt;&lt;del&gt;LU-4242&lt;/del&gt;&lt;/a&gt; has been created. After completing the upload, please update the relevant issue with a note mentioning the upload, so that our engineers know where to find your logs.&lt;br/&gt;
$ ls -lh&lt;br/&gt;
total 333M&lt;br/&gt;
rw-rr- 1 mjmac mjmac 98M Feb 23 17:36 mds-debug&lt;br/&gt;
rw-rr- 1 mjmac mjmac 118M Feb 23 17:37 oss-00-debug&lt;br/&gt;
rw-rr- 1 mjmac mjmac 118M Feb 23 17:37 oss-01-debug&lt;br/&gt;
$ ncftp ftp.whamcloud.com&lt;br/&gt;
NcFTP 3.2.2 (Sep 04, 2008) by Mike Gleason (&lt;a href=&quot;http://www.NcFTP.com/contact/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://www.NcFTP.com/contact/&lt;/a&gt;).&lt;br/&gt;
Connecting to 99.96.190.235...&lt;br/&gt;
(vsFTPd 2.2.2)&lt;br/&gt;
Logging in...&lt;br/&gt;
Login successful.&lt;br/&gt;
Logged in to ftp.whamcloud.com.&lt;br/&gt;
ncftp / &amp;gt; cd uploads&lt;br/&gt;
Directory successfully changed.&lt;br/&gt;
ncftp /uploads &amp;gt; mkdir &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4242&quot; title=&quot;mdt_open.c:1685:mdt_reint_open()) LBUG&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4242&quot;&gt;&lt;del&gt;LU-4242&lt;/del&gt;&lt;/a&gt;&lt;br/&gt;
ncftp /uploads &amp;gt; cd &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4242&quot; title=&quot;mdt_open.c:1685:mdt_reint_open()) LBUG&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4242&quot;&gt;&lt;del&gt;LU-4242&lt;/del&gt;&lt;/a&gt;&lt;br/&gt;
Directory successfully changed.&lt;br/&gt;
ncftp /uploads/&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4242&quot; title=&quot;mdt_open.c:1685:mdt_reint_open()) LBUG&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4242&quot;&gt;&lt;del&gt;LU-4242&lt;/del&gt;&lt;/a&gt; &amp;gt; put *&lt;br/&gt;
mds-debug: 97.66 MB 11.22 MB/s&lt;br/&gt;
oss-00-debug: 117.19 MB 11.16 MB/s&lt;br/&gt;
oss-01-debug: 117.48 MB 11.18 MB/s&lt;br/&gt;
ncftp /uploads/&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4242&quot; title=&quot;mdt_open.c:1685:mdt_reint_open()) LBUG&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4242&quot;&gt;&lt;del&gt;LU-4242&lt;/del&gt;&lt;/a&gt; &amp;gt;&lt;br/&gt;
Please note that this is a WRITE-ONLY FTP service, so you will not be able to see (with ls) the files or directories you&apos;ve created, nor will you (or anyone other than Whamcloud staff) be able to see or read them.&lt;/p&gt;</comment>
                            <comment id="251683" author="tomaka" created="Fri, 19 Jul 2019 01:50:03 +0000"  >&lt;p&gt;Hi Peter!&lt;br/&gt;
Thank you for the description: &lt;br/&gt;
Uploaded the files: &lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
ls -la uploads/LU-12508
-rw-r--r--    1 99       50            101 Jul 19 01:46 md5.sums
-rw-r--r--    1 99       50         149799 Jul 19 01:45 vmcore-dmesg.txt
-rw-r--r--    1 99       50       4032739061 Jul 19 01:46 vmcore.flat.bz2
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Also, please treat this data as sensitive and delete as soon as you are done with investigation. Please do not share with third parties. &lt;br/&gt;
Happy hunting!&lt;/p&gt;</comment>
                            <comment id="251934" author="tomaka" created="Wed, 24 Jul 2019 15:02:59 +0000"  >&lt;p&gt;Hello, any news on this?&lt;/p&gt;

&lt;p&gt;Here is the semaphore state from the crashed kernel:&#160;&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt; mmap_sem = {
    {
      count = {
        counter = 0xfffffffe00000002
      }, 
      __UNIQUE_ID_rh_kabi_hide4 = {
        count = 0xfffffffe00000002
      }, 
      {&amp;lt;No data fields&amp;gt;}
    }, 
    wait_lock = {
      raw_lock = {
        val = {
          counter = 0
        }
      }
    }, 
    osq = {
      tail = {
        counter = 0
      }
    }, 
    wait_list = {
      next = 0xffff9b2d10fa7a30
    }, 
    owner = 0x1
  }, 
 &lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;BTW: do you know if the patch that RedHat claims fixes this problem is&#160;&lt;a href=&quot;https://lkml.org/lkml/2018/11/29/961&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://lkml.org/lkml/2018/11/29/961&lt;/a&gt; ?&lt;/p&gt;

&lt;p&gt;Also, when i reduce number of writer threads to 4 in our workload, i can no longer reproduce the problem.&#160;&lt;/p&gt;

&lt;p&gt;I have also tested without adaptive ticks (without setting nohz_full=1-255) and the problem is still there (with 16 writing threads)&#160;&lt;/p&gt;</comment>
                            <comment id="251999" author="ys" created="Thu, 25 Jul 2019 02:47:49 +0000"  >&lt;p&gt;Hi, Jacek,&lt;/p&gt;

&lt;p&gt;From source diff, we can conclude this patch is only change between 957 kernel and previous. I think this issue was caused by some memory barrier issue.  It only show up on some kind of cpu, But i still cannot narrow down the range. Looks like you have a certain way to reproduce it? Can you share your way?&lt;/p&gt;

&lt;p&gt;Thanks,&lt;br/&gt;
YangSheng&lt;/p&gt;</comment>
                            <comment id="254101" author="ys" created="Wed, 4 Sep 2019 15:54:40 +0000"  >&lt;p&gt;Hi, Guys,&lt;/p&gt;

&lt;p&gt;Please upgrade to latest rhel7.7 kernel to fix this issue.&lt;/p&gt;

&lt;p&gt;Thanks,&lt;br/&gt;
YangSheng&lt;/p&gt;</comment>
                            <comment id="254151" author="tomaka" created="Thu, 5 Sep 2019 02:53:39 +0000"  >&lt;p&gt;YangSheng, &lt;br/&gt;
Would you be so kind to point me to the bug fix that you think solves the issue? &lt;br/&gt;
It does require significant effort to test on newer version and no one likes to be sent on wild goose chase as happened with 7.6 before. &lt;br/&gt;
Regards.&lt;br/&gt;
Jacek Tomaka&lt;/p&gt;</comment>
                            <comment id="254156" author="ys" created="Thu, 5 Sep 2019 06:32:50 +0000"  >&lt;p&gt;Hi, Jacek,&lt;/p&gt;

&lt;p&gt;This is patch comment:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;From a9e9bcb45b1525ba7aea26ed9441e8632aeeda58 Mon Sep 17 00:00:00 2001
From: Waiman Long &amp;lt;longman@redhat.com&amp;gt;
Date: Sun, 28 Apr 2019 17:25:38 -0400
Subject: [PATCH] locking/rwsem: Prevent decrement of reader count before
 increment

During my rwsem testing, it was found that after a down_read(), the
reader count may occasionally become 0 or even negative. Consequently,
a writer may steal the lock at that time and execute with the reader
in parallel thus breaking the mutual exclusion guarantee of the write
lock. In other words, both readers and writer can become rwsem owners
simultaneously.

The current reader wakeup code does it in one pass to clear waiter-&amp;gt;task
and put them into wake_q before fully incrementing the reader count.
Once waiter-&amp;gt;task is cleared, the corresponding reader may see it,
finish the critical section and do unlock to decrement the count before
the count is incremented. This is not a problem if there is only one
reader to wake up as the count has been pre-incremented by 1.  It is
a problem if there are more than one readers to be woken up and writer
can steal the lock.

The wakeup was actually done in 2 passes before the following v4.9 commit:

  70800c3c0cc5 (&quot;locking/rwsem: Scan the wait_list for readers only once&quot;)

To fix this problem, the wakeup is now done in two passes
again. In the first pass, we collect the readers and count them.
The reader count is then fully incremented. In the second pass, the
waiter-&amp;gt;task is then cleared and they are put into wake_q to be woken
up later.
Thanks,

&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt; 

&lt;p&gt;Thanks,&lt;br/&gt;
YangSheng&lt;/p&gt;</comment>
                            <comment id="254157" author="tomaka" created="Thu, 5 Sep 2019 06:41:00 +0000"  >&lt;p&gt;That looks reasonable. Thanks!&lt;/p&gt;</comment>
                            <comment id="258486" author="tomaka" created="Tue, 19 Nov 2019 02:49:31 +0000"  >&lt;p&gt;Hi YangSheng, &lt;br/&gt;
I would like confirm that applying patch you reference to kernel make the otherwise reliable repoducer not hit this issue anymore. Thank you for your help!&lt;br/&gt;
Regards.&lt;br/&gt;
Jacek Tomaka&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10010">
                    <name>Duplicate</name>
                                            <outwardlinks description="duplicates">
                                        <issuelink>
            <issuekey id="50852">LU-10678</issuekey>
        </issuelink>
                            </outwardlinks>
                                                        </issuelinktype>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                                        </outwardlinks>
                                                                <inwardlinks description="is related to">
                                        <issuelink>
            <issuekey id="66806">LU-15156</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|i00j7j:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>