<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 02:30:38 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-9939] Bad checksums from clients using SR-IOV</title>
                <link>https://jira.whamcloud.com/browse/LU-9939</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;A Lustre client on a KVM hypervisor using SR-IOV for IB has started to generate&#160;the following errors:&lt;/p&gt;

&lt;p&gt;OSS (oak-io1-s1 10.0.2.101@o2ib5):&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;Aug 31 11:27:04 oak-io1-s1 kernel: LustreError: 168-f: BAD WRITE CHECKSUM: oak-OST001a from 12345-10.0.2.225@o2ib5 inode [0x200002f84:0x6b6a:0x0] object 0x0:4413301 extent [726925312-727973887]: client csum 4ecd330, server csum 5610e5e5


&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;The second OSS in production also has the same errors.&lt;/p&gt;

&lt;p&gt;SR-IOV based client (oak-gw06 10.0.2.225@o2ib5):&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;Aug 31 11:27:05 oak-gw06 kernel: LustreError: 132-0: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 10.0.2.101@o2ib5 inode [0x200002f84:0x6b6a:0x0] object 0x0:4413301 extent [726925312-727973887]


&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;The client also gets some read checksum errors later:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;Aug 31 11:37:42 oak-gw06 kernel: LustreError: 133-1: oak-OST001a-osc-ffff88041b99c000: BAD READ CHECKSUM: from 10.0.2.101@o2ib5 inode [0x0:0x0:0x0] object 0x0:4413301 extent [1581252608-1582301183]


&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;I will attach kernel logs of both.&lt;/p&gt;

&lt;p&gt;In this particular case, the client is a Globus endpoint, using Lustre a the backend. This is actually the second time we&apos;ve seen this, indeed the&#160;same issue was seen on another VM running rsnapshot jobs. Rebooting the impacted VM does fix the issue.&lt;/p&gt;

&lt;p&gt;Are you aware of such issues when using&#160;SR-IOV? Any idea how we could troubleshoot this?&lt;/p&gt;

&lt;p&gt;Thanks!&lt;br/&gt;
 Stephane Thiell&lt;/p&gt;</description>
                <environment>Lustre 2.9 including fixes for &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-8851&quot; title=&quot;nodemap: add flags to limit mapping to UID or GID only&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-8851&quot;&gt;&lt;strike&gt;LU-8851&lt;/strike&gt;&lt;/a&gt; (nodemap: add uid/gid only flags to control mapping) and &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-9258&quot; title=&quot;nodemap: group quota ID not properly mapped &quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-9258&quot;&gt;&lt;strike&gt;LU-9258&lt;/strike&gt;&lt;/a&gt; (nodemap: group quota ID not properly mapped), kernel 3.10.0-514.16.1.el7_lustre.x86_64 on servers, 3.10.0-514.10.2.el7_lustre.x86_64 on clients</environment>
        <key id="48101">LU-9939</key>
            <summary>Bad checksums from clients using SR-IOV</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="3" iconUrl="https://jira.whamcloud.com/images/icons/priorities/major.svg">Major</priority>
                        <status id="1" iconUrl="https://jira.whamcloud.com/images/icons/statuses/open.png" description="The issue is open and ready for the assignee to start work on it.">Open</status>
                    <statusCategory id="2" key="new" colorName="default"/>
                                    <resolution id="-1">Unresolved</resolution>
                                        <assignee username="bfaccini">Bruno Faccini</assignee>
                                    <reporter username="sthiell">Stephane Thiell</reporter>
                        <labels>
                    </labels>
                <created>Fri, 1 Sep 2017 18:26:30 +0000</created>
                <updated>Mon, 6 Nov 2017 19:54:03 +0000</updated>
                                            <version>Lustre 2.9.0</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>6</watches>
                                                                            <comments>
                            <comment id="207332" author="bfaccini" created="Sat, 2 Sep 2017 23:39:02 +0000"  >&lt;p&gt;Hello Stephane!&lt;br/&gt;
Can you check if your Lustre 2.9 version includes patch for &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-8376&quot; title=&quot;Enhance debugging infos available for Lustre checksum errors&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-8376&quot;&gt;&lt;del&gt;LU-8376&lt;/del&gt;&lt;/a&gt; or not?&lt;br/&gt;
Because if yes, you should enable pages dump upon cksum error on both Client and OST sides and then may be have more infos to help find the cause of the error.&lt;/p&gt;</comment>
                            <comment id="207493" author="sthiell" created="Wed, 6 Sep 2017 00:08:22 +0000"  >&lt;p&gt;Hey Bruno!&lt;/p&gt;

&lt;p&gt;That&apos;s very good to know!&#160;And no, I don&apos;t have the patch from &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-8376&quot; title=&quot;Enhance debugging infos available for Lustre checksum errors&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-8376&quot;&gt;&lt;del&gt;LU-8376&lt;/del&gt;&lt;/a&gt; in&#160;our 2.9, but we plan to upgrade to 2.10.1 in the near future, so I will just wait for that and then enable this&#160;new debugging option.&lt;/p&gt;

&lt;p&gt;Thanks!&lt;/p&gt;

&lt;p&gt;Stephane&lt;/p&gt;</comment>
                            <comment id="207525" author="bfaccini" created="Wed, 6 Sep 2017 07:06:17 +0000"  >&lt;p&gt;Ok, but on the other hand, I am sorry but I don&apos;t have any other option to debug and I did not find any similar report of checksum error running SR-IOV.&lt;/p&gt;</comment>
                            <comment id="212024" author="sthiell" created="Thu, 26 Oct 2017 02:43:46 +0000"  >&lt;p&gt;Hi Bruno!&lt;/p&gt;

&lt;p&gt;The problem occurred again on a VM used as a Globus endpoint. The good news is that we are now running 2.10.1 on this system (clients and servers), so I&#160;did enable this checksum_dump thing&#160;and&#160;attached some of&#160;the resulting files to this ticket. Do you know how to&#160;troubleshoot this?&lt;/p&gt;

&lt;p&gt;Thanks!&lt;/p&gt;

&lt;p&gt;Stephane&lt;/p&gt;</comment>
                            <comment id="212054" author="bfaccini" created="Thu, 26 Oct 2017 12:24:09 +0000"  >&lt;p&gt;Hello Stephane,&lt;br/&gt;
Tu nous a manque au dernier LAD!!&lt;br/&gt;
Your tarball of dumps only contains files from Client side, so did you also enable dump of pages in a bulk xfer with cksum error on OSSs side? If yes, there should be corresponding dumps available on OSSs that are of interest for comparison.&lt;br/&gt;
Concerned Clients and OSSs syslogs would be also helpful along with the striping infos of each affected files/FIDs, and the files content themselves if still available and not modified.&lt;/p&gt;</comment>
                            <comment id="212908" author="sthiell" created="Mon, 6 Nov 2017 19:54:03 +0000"  >&lt;p&gt;Thanks Bruno! I&apos;m still waiting to see a new occurrence to further troubleshoot this issue on the OSS side. I believe our Globus endpoint VMs have been less loaded lately, that might be why I haven&apos;t see&#160;the problem yet. I&apos;ll keep you posted.&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                            <attachment id="28547" name="lustre-log-checksum_dump.tar.gz" size="22871880" author="sthiell" created="Thu, 26 Oct 2017 02:40:14 +0000"/>
                            <attachment id="28180" name="oak-gw06_kernel_Aug31.log" size="219747" author="sthiell" created="Fri, 1 Sep 2017 18:25:18 +0000"/>
                            <attachment id="28181" name="oak-gw06_kernel_full.log" size="4886725" author="sthiell" created="Fri, 1 Sep 2017 18:25:13 +0000"/>
                            <attachment id="28179" name="oak-io1-s1_kernel_Aug31.log" size="112563" author="sthiell" created="Fri, 1 Sep 2017 18:25:21 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzzjgf:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>