<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 03:00:32 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-13351] Out of memory on server perhaps due to unreachable lnet network</title>
                <link>https://jira.whamcloud.com/browse/LU-13351</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;Hi,&lt;/p&gt;

&lt;p&gt;Last Saturday, we hit the following crash on one of Fir&apos;s OSS. This is with Lustre 2.12.3.&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;      KERNEL: /usr/lib/debug/lib/modules/3.10.0-957.27.2.el7_lustre.pl2.x86_64/vmlinux
    DUMPFILE: fir-io3-s2_vmcore_2020-03-07-23-00-05  [PARTIAL DUMP]
        CPUS: 48
        DATE: Sat Mar  7 22:59:56 2020
      UPTIME: 88 days, 17:04:42
LOAD AVERAGE: 392.09, 183.09, 89.88
       TASKS: 2344
    NODENAME: fir-io3-s2
     RELEASE: 3.10.0-957.27.2.el7_lustre.pl2.x86_64
     VERSION: #1 SMP Thu Nov 7 15:26:16 PST 2019
     MACHINE: x86_64  (1996 Mhz)
      MEMORY: 255.6 GB
       PANIC: &quot;Kernel panic - not syncing: Out of memory and no killable processes...&quot;
         PID: 6889
     COMMAND: &quot;ll_ost_io02_054&quot;
        TASK: ffff9c36fc49b0c0  [THREAD_INFO: ffff9c340a59c000]
         CPU: 38
       STATE: TASK_RUNNING (PANIC)
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Kernel memory:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;crash&amp;gt; kmem -i
...(garbage)...
                 PAGES        TOTAL      PERCENTAGE
    TOTAL MEM  65891310     251.4 GB         ----
         FREE   590325       2.3 GB    0% of TOTAL MEM
         USED  65300985     249.1 GB   99% of TOTAL MEM
       SHARED   100074     390.9 MB    0% of TOTAL MEM
      BUFFERS    46259     180.7 MB    0% of TOTAL MEM
       CACHED    29672     115.9 MB    0% of TOTAL MEM
         SLAB  63120835     240.8 GB   95% of TOTAL MEM

   TOTAL HUGE        0            0         ----
    HUGE FREE        0            0    0% of TOTAL HUGE

   TOTAL SWAP  1048575         4 GB         ----
    SWAP USED   267787         1 GB   25% of TOTAL SWAP
    SWAP FREE   780788         3 GB   74% of TOTAL SWAP

 COMMIT LIMIT  33994230     129.7 GB         ----
    COMMITTED   270117         1 GB    0% of TOTAL LIMIT
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Unfortunately, the vmcore seems kind of corrupted, as I cannot access the slab information:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;crash&amp;gt; kmem -s
kmem: invalid kernel virtual address: b3dd52e21b11fbc  type: &quot;list entry&quot;
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Last week, we had to change the lnet routes live on Fir, and we migrated routers from o2ib4/6 to new other Lnet networks for further expansion (long story). This all went well, but this specific server may have kept a reference to o2ib4, as we can see many occurrences of the following logs:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[7663978.048351] Lustre: 124006:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1583649999/real 1583649999]  req@ffff9c2e05194380 x1652574281110560/t0(0) o106-&amp;gt;fir-OST0019@10.9.0.63@o2ib4:15/16 lens 296/280 e 0 to 1 dl 1583650006 ref 2 fl Rpc:eX/2/ffffffff rc 0/-1
[7663978.076127] Lustre: 124006:0:(client.c:2133:ptlrpc_expire_one_request()) Skipped 21840089 previous similar messages
[7663988.066479] LNetError: 80427:0:(lib-move.c:2007:lnet_handle_find_routed_path()) no route to 10.9.0.63@o2ib4 from &amp;lt;?&amp;gt;
[7663988.077177] LNetError: 80427:0:(lib-move.c:2007:lnet_handle_find_routed_path()) Skipped 21832206 previous similar messages
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Again, this is normal, as o2ib4 had just been decommissioned. Fir is using o2ib7 and we have IB/IB routers to several IB fabrics, each with its own o2ib index. It&apos;s possible that a memory leak occurred due to this specific situation. We don&apos;t consider this issue as Major issue but wanted to report it anyway because a server crashed.&lt;/p&gt;

&lt;p&gt;I&apos;m attaching dmesg-vmcore.txt as &lt;span class=&quot;nobr&quot;&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/attachment/34428/34428_fir-io3-s2_vmcore-dmesg_2020-03-07-23-00-05.txt&quot; title=&quot;fir-io3-s2_vmcore-dmesg_2020-03-07-23-00-05.txt attached to LU-13351&quot;&gt;fir-io3-s2_vmcore-dmesg_2020-03-07-23-00-05.txt&lt;sup&gt;&lt;img class=&quot;rendericon&quot; src=&quot;https://jira.whamcloud.com/images/icons/link_attachment_7.gif&quot; height=&quot;7&quot; width=&quot;7&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/sup&gt;&lt;/a&gt;&lt;/span&gt;and the output of foreach bt as&#160;&lt;span class=&quot;nobr&quot;&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/attachment/34427/34427_fir-io3-s2_foreachbt_2020-03-07-23-00-05.txt&quot; title=&quot;fir-io3-s2_foreachbt_2020-03-07-23-00-05.txt attached to LU-13351&quot;&gt;fir-io3-s2_foreachbt_2020-03-07-23-00-05.txt&lt;sup&gt;&lt;img class=&quot;rendericon&quot; src=&quot;https://jira.whamcloud.com/images/icons/link_attachment_7.gif&quot; height=&quot;7&quot; width=&quot;7&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/sup&gt;&lt;/a&gt;&lt;/span&gt;&#160;. Also, the full vmcore has been uploaded to the FTP as &lt;tt&gt;fir-io3-s2_vmcore_2020-03-07-23-00-05&lt;/tt&gt; (but it looks like it&apos;s incomplete).&lt;/p&gt;</description>
                <environment>CentOS 7.6</environment>
        <key id="58336">LU-13351</key>
            <summary>Out of memory on server perhaps due to unreachable lnet network</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="1" iconUrl="https://jira.whamcloud.com/images/icons/statuses/open.png" description="The issue is open and ready for the assignee to start work on it.">Open</status>
                    <statusCategory id="2" key="new" colorName="default"/>
                                    <resolution id="-1">Unresolved</resolution>
                                        <assignee username="ssmirnov">Serguei Smirnov</assignee>
                                    <reporter username="sthiell">Stephane Thiell</reporter>
                        <labels>
                    </labels>
                <created>Tue, 10 Mar 2020 20:35:43 +0000</created>
                <updated>Wed, 11 Mar 2020 14:19:30 +0000</updated>
                                            <version>Lustre 2.12.3</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>3</watches>
                                                                            <comments>
                            <comment id="265108" author="pjones" created="Wed, 11 Mar 2020 14:19:29 +0000"  >&lt;p&gt;Sergeui&lt;/p&gt;

&lt;p&gt;Could you please investigate?&lt;/p&gt;

&lt;p&gt;Thanks&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                            <attachment id="34427" name="fir-io3-s2_foreachbt_2020-03-07-23-00-05.txt" size="1557379" author="sthiell" created="Tue, 10 Mar 2020 20:33:54 +0000"/>
                            <attachment id="34428" name="fir-io3-s2_vmcore-dmesg_2020-03-07-23-00-05.txt" size="1052872" author="sthiell" created="Tue, 10 Mar 2020 20:33:45 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|i00v9r:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>