<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:50:09 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-5284] GPF in radix_tree_lookup_slot on OSS</title>
                <link>https://jira.whamcloud.com/browse/LU-5284</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;OSS hits a general protection fault with the following trace:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;PID: 22785  TASK: ffff880612f90830  CPU: 1   COMMAND: &quot;ll_ost_io_1011&quot;
 #0 [ffff880612fd7620] machine_kexec at ffffffff8102902b
 #1 [ffff880612fd7680] crash_kexec at ffffffff810a5292
 #2 [ffff880612fd7750] oops_end at ffffffff8149a050
 #3 [ffff880612fd7780] die at ffffffff8100714b
 #4 [ffff880612fd77b0] do_general_protection at ffffffff81499be2
 #5 [ffff880612fd77e0] general_protection at ffffffff814993b5
    [exception RIP: radix_tree_lookup_slot+5]
    RIP: ffffffff81261465  RSP: ffff880612fd7890  RFLAGS: 00010286
    RAX: e940201000000010  RBX: e940201000000008  RCX: 0000000000000000
    RDX: 00000000000200d2  RSI: 0000000000000000  RDI: e940201000000008
    RBP: ffff880612fd78b0   R8: ffff880612fdc140   R9: 0000000000000008
    R10: 0000000000001000  R11: 0000000000000001  R12: 0000000000000000
    R13: 0000000000000000  R14: e940201000000000  R15: 20105fa000080221
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #6 [ffff880612fd7898] find_get_page at ffffffff810ffe8e
 #7 [ffff880612fd78b8] find_lock_page at ffffffff8110112a
 #8 [ffff880612fd78e8] find_or_create_page at ffffffff8110129f
 #9 [ffff880612fd7938] filter_get_page at ffffffffa0c4b065 [obdfilter]
#10 [ffff880612fd7968] filter_preprw_read at ffffffffa0c4d64d [obdfilter]
#11 [ffff880612fd7a98] filter_preprw at ffffffffa0c4dedc [obdfilter]
#12 [ffff880612fd7ad8] obd_preprw at ffffffffa0c09051 [ost]
#13 [ffff880612fd7b48] ost_brw_read at ffffffffa0c10091 [ost]
#14 [ffff880612fd7c88] ost_handle at ffffffffa0c16423 [ost]
#15 [ffff880612fd7da8] ptlrpc_main at ffffffffa07fd4e6 [ptlrpc]
#16 [ffff880612fd7f48] kernel_thread at ffffffff8100412a
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Taking a look at the dump, it seems there is a race leading to corruption of the struct inode which is passed from filter_preprw_read to filter_get_page.&lt;/p&gt;

&lt;p&gt;I attached my complete dump analysis log.&lt;/p&gt;

&lt;p&gt;I can also upload the dump if you request it.&lt;/p&gt;</description>
                <environment>bullx supercomputer suite</environment>
        <key id="25406">LU-5284</key>
            <summary>GPF in radix_tree_lookup_slot on OSS</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="5">Cannot Reproduce</resolution>
                                        <assignee username="bfaccini">Bruno Faccini</assignee>
                                    <reporter username="spiechurski">Sebastien Piechurski</reporter>
                        <labels>
                            <label>p4b</label>
                    </labels>
                <created>Wed, 2 Jul 2014 14:22:40 +0000</created>
                <updated>Wed, 7 Jun 2017 12:02:01 +0000</updated>
                            <resolved>Wed, 7 Jun 2017 12:02:01 +0000</resolved>
                                    <version>Lustre 2.1.6</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>5</watches>
                                                                            <comments>
                            <comment id="87967" author="pjones" created="Wed, 2 Jul 2014 14:34:20 +0000"  >&lt;p&gt;Bruno&lt;/p&gt;

&lt;p&gt;Could you please look into this one?&lt;/p&gt;

&lt;p&gt;Thanks&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="87972" author="bfaccini" created="Wed, 2 Jul 2014 14:50:01 +0000"  >&lt;p&gt;Hello Seb,&lt;br/&gt;
Thanks for your crash-dump analysis log! But can you check for me if %rbp value at the time of filter_get_page() call in filter_preprw_read() is really 0xffff880040aa57c0? It should be saved at 0xffff880612fd7970 location in stack, but it is better to confirm this by disassembling the 1st instructions of filter_get_page() ...&lt;br/&gt;
Also can you check if the inode address you found (0xffff880040aa57c0) is part of one of the inodes slab area/cache ?&lt;/p&gt;

&lt;p&gt;Or may be you can upload the crash-dump if you want ?&lt;/p&gt;</comment>
                            <comment id="87977" author="spiechurski" created="Wed, 2 Jul 2014 16:02:47 +0000"  >&lt;p&gt;As seen in the stack of filter_get_page(), the base pointer of filter_preprw_read() is  really 0xffff880612fd7a90 (I think you swapped the addresses in your comment with the one of the inode struct):&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;#9 [ffff880612fd7938] filter_get_page at ffffffffa0c4b065 [obdfilter]
    ffff880612fd7940: 00000000531db489 000000001c4be826 
    ffff880612fd7950: ffff880612fd7c4c ffff880610f480b8 
    ffff880612fd7960: ffff880612fd7a90 ffffffffa0c4d64d 
#10 [ffff880612fd7968] filter_preprw_read at ffffffffa0c4d64d [obdfilter]
[...]
    ffff880612fd7a80: ffff880091cab1c8 ffff8805235be548 
    ffff880612fd7a90: ffff880612fd7ad0 ffffffffa0c4dedc 
#11 [ffff880612fd7a98] filter_preprw at ffffffffa0c4dedc [obdfilter]
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt; Looking for the struct address in the slab, it would seem the ldiskfs_inode_cache slab itself is corrupted:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;crash&amp;gt; kmem -s ffff880040aa57c0
kmem: ldiskfs_inode_cache: full list: slab: ffff880040aa5280  bad next pointer: 0
kmem: ldiskfs_inode_cache: full list: slab: ffff880040aa5280  bad prev pointer: 0
kmem: ldiskfs_inode_cache: full list: slab: ffff880040aa5280  bad inuse counter: 0
kmem: ldiskfs_inode_cache: full list: slab: ffff880040aa5280  bad s_mem pointer: 0
kmem: ldiskfs_inode_cache: partial list: slab: ffff880040aa5280  bad next pointer: 0
kmem: ldiskfs_inode_cache: partial list: slab: ffff880040aa5280  bad s_mem pointer: 0
kmem: ldiskfs_inode_cache: full list: slab: ffff880040aa5280  bad next pointer: 0
kmem: ldiskfs_inode_cache: full list: slab: ffff880040aa5280  bad s_mem pointer: 0
kmem: ldiskfs_inode_cache: free list: slab: ffff880040aa5280  bad next pointer: 0
kmem: ldiskfs_inode_cache: free list: slab: ffff880040aa5280  bad s_mem pointer: 0
kmem: ldiskfs_inode_cache: partial list: slab: ffff880040aa5280  bad next pointer: 0
kmem: ldiskfs_inode_cache: partial list: slab: ffff880040aa5280  bad s_mem pointer: 0
kmem: ldiskfs_inode_cache: full list: slab: ffff880040aa5280  bad next pointer: 0
kmem: ldiskfs_inode_cache: full list: slab: ffff880040aa5280  bad s_mem pointer: 0
kmem: ldiskfs_inode_cache: free list: slab: ffff880040aa5280  bad next pointer: 0
kmem: ldiskfs_inode_cache: free list: slab: ffff880040aa5280  bad s_mem pointer: 0
kmem: ldiskfs_inode_cache: address not found in cache: ffff880040aa57c0
kmem: 16 errors encountered
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;I am uploading the dump as well on the ftp site. Should be available tomorrow.&lt;/p&gt;</comment>
                            <comment id="88094" author="bfaccini" created="Thu, 3 Jul 2014 14:41:14 +0000"  >&lt;p&gt;Seb, Yes you are right, I should have written 0xffff880612fd7960 for the location/save of %rbp value at the time of filter_get_page() call in filter_preprw_read(), and I in fact since it was not available in your crash extracts for cut+paste, I have simply used 0xffff880612fd7970 but then have forgotten to manually modify it !!!&lt;/p&gt;

&lt;p&gt;And BTW, the &quot;kmem&quot; sub-command confirms the validity of the inode address and its errors confirm the Slab corruption !!...&lt;/p&gt;

&lt;p&gt;Thanks for uploading the crash-dump.&lt;/p&gt;</comment>
                            <comment id="90147" author="bfaccini" created="Mon, 28 Jul 2014 11:52:19 +0000"  >&lt;p&gt;Crash dump analysis findings :&lt;/p&gt;

&lt;p&gt;              _ there are 2 additional threads with references to the same page/slab/inode address range, they both triggered the same GPF too, with the same stack!&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;crash&amp;gt; bt 21929 22191
PID: 21929  TASK: ffff880c12206830  CPU: 2   COMMAND: &quot;ll_ost_io_155&quot;
 #0 [ffff880c12277780] die at ffffffff8100711e
 #1 [ffff880c122777b0] do_general_protection at ffffffff81499be2
 #2 [ffff880c122777e0] general_protection at ffffffff814993b5
    [exception RIP: radix_tree_lookup_slot+5]
    RIP: ffffffff81261465  RSP: ffff880c12277890  RFLAGS: 00010286
    RAX: e940201000000010  RBX: e940201000000008  RCX: 0000000000000000
    RDX: 00000000000200d2  RSI: 0000000000000008  RDI: e940201000000008
    RBP: ffff880c122778b0   R8: ffff880614e926c0   R9: 00000000000000f8
    R10: 0000000000001000  R11: 0000000000000001  R12: 0000000000000008
    R13: 0000000000000008  R14: e940201000000000  R15: 20105fa000080221
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #3 [ffff880c12277898] find_get_page at ffffffff810ffe8e
 #4 [ffff880c122778b8] find_lock_page at ffffffff8110112a
 #5 [ffff880c122778e8] find_or_create_page at ffffffff8110129f
 #6 [ffff880c12277938] filter_get_page at ffffffffa0c4b065 [obdfilter]
 #7 [ffff880c12277968] filter_preprw_read at ffffffffa0c4d64d [obdfilter]
 #8 [ffff880c12277a98] filter_preprw at ffffffffa0c4dedc [obdfilter]
 #9 [ffff880c12277ad8] obd_preprw at ffffffffa0c09051 [ost]
#10 [ffff880c12277b48] ost_brw_read at ffffffffa0c10091 [ost]
#11 [ffff880c12277c88] ost_handle at ffffffffa0c16423 [ost]
#12 [ffff880c12277da8] ptlrpc_main at ffffffffa07fd4e6 [ptlrpc]
#13 [ffff880c12277f48] kernel_thread at ffffffff8100412a

PID: 22191  TASK: ffff8806144fd0c0  CPU: 0   COMMAND: &quot;ll_ost_io_417&quot;
 #0 [ffff880614617780] die at ffffffff8100711e
 #1 [ffff8806146177b0] do_general_protection at ffffffff81499be2
 #2 [ffff8806146177e0] general_protection at ffffffff814993b5
    [exception RIP: radix_tree_lookup_slot+5]
    RIP: ffffffff81261465  RSP: ffff880614617890  RFLAGS: 00010286
    RAX: e940201000000010  RBX: e940201000000008  RCX: 0000000000000000
    RDX: 00000000000200d2  RSI: 0000000000000100  RDI: e940201000000008
    RBP: ffff8806146178b0   R8: ffff88061461c140   R9: 0000000000000008
    R10: 0000000000001000  R11: 0000000000000001  R12: 0000000000000100
    R13: 0000000000000100  R14: e940201000000000  R15: 20105fa000080221
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #3 [ffff880614617898] find_get_page at ffffffff810ffe8e
 #4 [ffff8806146178b8] find_lock_page at ffffffff8110112a
 #5 [ffff8806146178e8] find_or_create_page at ffffffff8110129f
 #6 [ffff880614617938] filter_get_page at ffffffffa0c4b065 [obdfilter]
 #7 [ffff880614617968] filter_preprw_read at ffffffffa0c4d64d [obdfilter]
 #8 [ffff880614617a98] filter_preprw at ffffffffa0c4dedc [obdfilter]
 #9 [ffff880614617ad8] obd_preprw at ffffffffa0c09051 [ost]
#10 [ffff880614617b48] ost_brw_read at ffffffffa0c10091 [ost]
#11 [ffff880614617c88] ost_handle at ffffffffa0c16423 [ost]
#12 [ffff880614617da8] ptlrpc_main at ffffffffa07fd4e6 [ptlrpc]
#13 [ffff880614617f48] kernel_thread at ffffffff8100412a
crash&amp;gt; 
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;              _ The ldiskfs_inode slab page corruption looks like :&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;crash&amp;gt; rd 0xffff880040aa5000 512 | grep -v &quot;0000000000000000 0000000000000000&quot; | head
ffff880040aa5600:  0000041100000401 20025bc900000421   ........!....[. 
ffff880040aa5610:  0000000000040002 d514200200000000   ............. ..
ffff880040aa5620:  0000041200000402 2010176b00000622   ........&quot;...k.. 
ffff880040aa5630:  0000000000050000 3e44201000000000   ............. D&amp;gt;
ffff880040aa5640:  0000041300000403 20103b4500000823   ........#...E;. 
ffff880040aa5650:  0000000000050000 ec37201000000000   ............. 7.
ffff880040aa5660:  0000041400000404 20102f7700000a24   ........$...w/. 
ffff880040aa5670:  0000000000050000 3928201000000000   ............. (9
ffff880040aa5680:  0000041500000405 2010299500000c25   ........%....). 
ffff880040aa5690:  0000000000050000 d013201000000000   ............. ..
grep: write error
crash&amp;gt; 
crash&amp;gt; rd 0xffff880040aa5000 512 | grep -v &quot;0000000000000000 0000000000000000&quot; | tail
ffff880040aa5f60:  0020001b0020000b 201080000020162b   .. ... .+. .... 
ffff880040aa5f70:  0000000000070000 2a74201000000000   ............. t*
ffff880040aa5f80:  0020001c0020000c 201080000020182c   .. ... .,. .... 
ffff880040aa5f90:  0000000000070000 6e46201000000000   ............. Fn
ffff880040aa5fa0:  0020001d0020000d 2010800000201a2d   .. ... .-. .... 
ffff880040aa5fb0:  0000000000070000 024d201000000000   ............. M.
ffff880040aa5fc0:  0020001e0020000e 2010800000201c2e   .. ... ... .... 
ffff880040aa5fd0:  0000000000070000 b650201000000000   ............. P.
ffff880040aa5fe0:  0020001f0020000f 2010800000201e2f   .. ... ./. .... 
ffff880040aa5ff0:  0000000000070000 da5b201000000000   ............. [.
crash&amp;gt; 
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;              _ dump is missing the 2 previous/next 4K virtual/real pages (inside the same 2M page!) because they are un-mapped/free.&lt;/p&gt;
</comment>
                            <comment id="90190" author="bfaccini" created="Mon, 28 Jul 2014 15:43:31 +0000"  >&lt;p&gt;Even if this will impact crash-dump timing and on-disk space, could it be possible to set kdump config to allow full memory/all pages to be dumped next time ??&lt;/p&gt;</comment>
                            <comment id="92591" author="bfaccini" created="Wed, 27 Aug 2014 12:36:20 +0000"  >&lt;p&gt;Seb, any news or new occurence ??&lt;/p&gt;</comment>
                            <comment id="198412" author="spiechurski" created="Wed, 7 Jun 2017 08:02:56 +0000"  >&lt;p&gt;Given that we did not see this issue for almost 3 years, and all our customers have migrated away from 2.1, I guess you can close this ticket.&lt;/p&gt;</comment>
                            <comment id="198430" author="pjones" created="Wed, 7 Jun 2017 12:02:01 +0000"  >&lt;p&gt;ok , thanks&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                            <attachment id="15282" name="Dump-analysis-140310-BS0206.txt" size="13229" author="spiechurski" created="Wed, 2 Jul 2014 14:22:40 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10490" key="com.atlassian.jira.plugin.system.customfieldtypes:datepicker">
                        <customfieldname>End date</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>Wed, 27 Aug 2014 14:22:40 +0000</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                            <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzwqfj:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>14747</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                        <customfield id="customfield_10493" key="com.atlassian.jira.plugin.system.customfieldtypes:datepicker">
                        <customfieldname>Start date</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>Wed, 2 Jul 2014 14:22:40 +0000</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                    </customfields>
    </item>
</channel>
</rss>