<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 03:02:54 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-13632] BAD WRITE CHECKSUM</title>
                <link>https://jira.whamcloud.com/browse/LU-13632</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;One of our Lustre storage servers has become unstable. Below are the messages we found on the 2 OSS that fell over (rebooted) due to bad write checksums. We don&apos;t recall seeing this before. We tracked down the files referenced via the Hex codes in the error messages below with the lfs fid2path command and killed the user&apos;s jobs. I will mentioned that we did update our lnet routers from running 2.13.0 to 2.12.4 in the past week or two. Also the lustre clients that are accessing our Lustre storage are running lustre 2.10. 7-1.&lt;/p&gt;

&lt;p&gt;This particular lustre server that is unstable is running lustre 2.12.3:&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;root@holyscratch01mds01 ~&amp;#93;&lt;/span&gt;# rpm -qa |grep lustre&lt;br/&gt;
kernel-3.10.0-1062.1.1.el7_lustre.x86_64&lt;br/&gt;
kmod-lustre-2.12.3-1.el7.x86_64&lt;br/&gt;
kmod-lustre-osd-ldiskfs-2.12.3-1.el7.x86_64&lt;br/&gt;
kernel-devel-3.10.0-1062.1.1.el7_lustre.x86_64&lt;br/&gt;
lustre-osd-zfs-mount-2.12.3-1.el7.x86_64&lt;br/&gt;
lustre-2.12.3-1.el7.x86_64&lt;br/&gt;
lustre-zfs-dkms-2.12.3-1.el7.noarch&lt;br/&gt;
lustre-resource-agents-2.12.3-1.el7.x86_64&lt;br/&gt;
lustre-ldiskfs-zfs-5.0.0-1.el7.x86_64&lt;br/&gt;
kernel-mft-4.13.3-3.10.0_1062.1.1.el7_lustre.x86_64.x86_64&lt;br/&gt;
lustre-osd-ldiskfs-mount-2.12.3-1.el7.x86_64&lt;br/&gt;
kmod-spl-3.10.0-1062.1.1.el7_lustre.x86_64-0.7.13-1.el7.x86_64&lt;/p&gt;

&lt;p&gt;Here are the errors:&lt;/p&gt;

&lt;p&gt;Jun  3 18:23:54 holyscratch01oss03 kernel: LustreError: 168-f: scratch1-OST001a: BAD WRITE CHECKSUM: from 12345-10.31.164.172@o2ib via 10.31.179.131@o2ib4 inode &lt;span class=&quot;error&quot;&gt;&amp;#91;0x20001bc69:0xa3f6:0x0&amp;#93;&lt;/span&gt; object 0x0:76971842 extent &lt;span class=&quot;error&quot;&gt;&amp;#91;71303168-75497471&amp;#93;&lt;/span&gt;: client csum e6b01811, server csum 880d3728&lt;br/&gt;
Jun  3 18:23:55 holyscratch01oss03 kernel: LustreError: 168-f: scratch1-OST0009: BAD WRITE CHECKSUM: from 12345-10.31.163.222@o2ib via 10.31.179.133@o2ib4 inode &lt;span class=&quot;error&quot;&gt;&amp;#91;0x200012232:0xb3f4:0x0&amp;#93;&lt;/span&gt; object 0x0:76503185 extent &lt;span class=&quot;error&quot;&gt;&amp;#91;1279262720-1283457023&amp;#93;&lt;/span&gt;: client csum e611ac34, server csum 82fdc50a&lt;/p&gt;

&lt;p&gt;Any ideas as to why this happening ?&lt;/p&gt;</description>
                <environment>Server side is all Dell gear. MDS and OSS are R740, storage is Dell ME4 storage enclosures. lnet routers are Lenovo SR630 running lustre 2.12.4, clients are mostly Dell and Lenovo gear running 2.10.7-1.</environment>
        <key id="59445">LU-13632</key>
            <summary>BAD WRITE CHECKSUM</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="2" iconUrl="https://jira.whamcloud.com/images/icons/priorities/critical.svg">Critical</priority>
                        <status id="1" iconUrl="https://jira.whamcloud.com/images/icons/statuses/open.png" description="The issue is open and ready for the assignee to start work on it.">Open</status>
                    <statusCategory id="2" key="new" colorName="default"/>
                                    <resolution id="-1">Unresolved</resolution>
                                        <assignee username="wshilong">Wang Shilong</assignee>
                                    <reporter username="mre64">Michael Ethier</reporter>
                        <labels>
                    </labels>
                <created>Thu, 4 Jun 2020 00:29:08 +0000</created>
                <updated>Thu, 23 Jul 2020 19:47:54 +0000</updated>
                                            <version>Lustre 2.12.3</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>10</watches>
                                                                            <comments>
                            <comment id="271907" author="pjones" created="Thu, 4 Jun 2020 01:07:45 +0000"  >&lt;p&gt;Michael&lt;/p&gt;

&lt;p&gt;You have opened this ticket as severity 1 which means that your whole filesystem is out of services - is that the case? From the description it sounds like you are in production but want to root cause what has happened to the impacted files.&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="271908" author="mre64" created="Thu, 4 Jun 2020 01:17:04 +0000"  >&lt;p&gt;Hi Peter,&lt;br/&gt;
Its has gone up and down a least 2 times since 5pm today. We had it running but shortly after we lose 2 OSS. So basically its not stable and useable at this point. So its not completely down. Sorry for not selecting the correct value. Your help page didn&apos;t have info on that selection when I looked.&lt;br/&gt;
Mike&lt;/p&gt;</comment>
                            <comment id="271909" author="mre64" created="Thu, 4 Jun 2020 01:26:12 +0000"  >&lt;p&gt;We are seeing errors like this on an OST, &lt;span class=&quot;image-wrap&quot; style=&quot;&quot;&gt;&lt;a id=&quot;35053_thumb&quot; href=&quot;https://jira.whamcloud.com/secure/attachment/35053/35053_Screen+Shot+2020-06-03+at+9.24.17+PM.png&quot; title=&quot;Screen Shot 2020-06-03 at 9.24.17 PM.png&quot; file-preview-type=&quot;image&quot; file-preview-id=&quot;35053&quot; file-preview-title=&quot;Screen Shot 2020-06-03 at 9.24.17 PM.png&quot;&gt;&lt;img src=&quot;https://jira.whamcloud.com/secure/thumbnail/35053/_thumb_35053.png&quot; style=&quot;border: 0px solid black&quot; role=&quot;presentation&quot;/&gt;&lt;/a&gt;&lt;/span&gt;  see attached.&lt;/p&gt;</comment>
                            <comment id="271911" author="pjones" created="Thu, 4 Jun 2020 01:37:48 +0000"  >&lt;p&gt;No problem Michael - I just wanted to be clear as to what the priority was at this point. It sounds like getting things stable is the first priority.&#160;&lt;/p&gt;

&lt;p&gt;Shilong - could you please advise?&lt;/p&gt;</comment>
                            <comment id="271913" author="mre64" created="Thu, 4 Jun 2020 01:43:54 +0000"  >&lt;p&gt;Also we continue to have issues with OSTs with these errors and that OSTs won&#8217;t stay mounted, and this is causing OSS to consistently reboot&lt;/p&gt;</comment>
                            <comment id="271916" author="mre64" created="Thu, 4 Jun 2020 02:10:51 +0000"  >&lt;p&gt;Update: we haven&#8217;t had an OST failover in about 30min so it seems stable, but we have OSTs that won&#8217;t fail back without failure.&lt;/p&gt;</comment>
                            <comment id="271925" author="dongyang" created="Thu, 4 Jun 2020 04:04:07 +0000"  >&lt;p&gt;Michael,&lt;/p&gt;

&lt;p&gt;can you run&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
lctl get_param osc.*.checksum_type
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;on one of the clients seeing the error, e.g.&#160;10.31.164.172 or&#160;10.31.163.222&lt;/p&gt;

&lt;p&gt;and gather the output?&lt;/p&gt;

&lt;p&gt;Thanks&lt;/p&gt;</comment>
                            <comment id="271957" author="mre64" created="Thu, 4 Jun 2020 13:49:00 +0000"  >&lt;p&gt;Hi Dongyang,&lt;br/&gt;
The output of that command is attached.&lt;br/&gt;
 &lt;span class=&quot;nobr&quot;&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/attachment/35060/35060_holy2b11102.out&quot; title=&quot;holy2b11102.out attached to LU-13632&quot;&gt;holy2b11102.out&lt;sup&gt;&lt;img class=&quot;rendericon&quot; src=&quot;https://jira.whamcloud.com/images/icons/link_attachment_7.gif&quot; height=&quot;7&quot; width=&quot;7&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/sup&gt;&lt;/a&gt;&lt;/span&gt; &lt;/p&gt;</comment>
                            <comment id="271965" author="jhammond" created="Thu, 4 Jun 2020 14:50:18 +0000"  >&lt;p&gt;Hi &lt;a href=&quot;https://jira.whamcloud.com/secure/ViewProfile.jspa?name=mre64&quot; class=&quot;user-hover&quot; rel=&quot;mre64&quot;&gt;mre64&lt;/a&gt;,&lt;/p&gt;

&lt;p&gt;Are the clients, LNet routers, and servers all using ECC memory? Are there any memory errors in the logs on the path from client to server? It would also be good to rule out any network hardware errors?&lt;/p&gt;</comment>
                            <comment id="271968" author="mre64" created="Thu, 4 Jun 2020 15:48:36 +0000"  >&lt;p&gt;Hi John,&lt;br/&gt;
Yes they are using ECC memory. I checked the server side Lustre nodes, the lnet routers, and the client nodes that were referenced by the BAD WRITE CHECKSUM and none of them have memory errors.&lt;/p&gt;</comment>
                            <comment id="271977" author="jhammond" created="Thu, 4 Jun 2020 16:39:11 +0000"  >&lt;p&gt;Thanks Michael.&lt;/p&gt;

&lt;p&gt;Some questions to help us isolate this:&lt;/p&gt;
&lt;ol&gt;
	&lt;li&gt;Are the checksum errors associated with multiple applications?&lt;/li&gt;
	&lt;li&gt;Did they start soon after the router upgrade?&lt;/li&gt;
	&lt;li&gt;Do they only occur on routed clients?&lt;/li&gt;
	&lt;li&gt;Would it be possible to revert the router upgrade and see if the checksum errors stop for RPCs that use the downgraded router?&lt;/li&gt;
	&lt;li&gt;Would it be possible to bring some clients to 2.12.4 and see if they occur?&lt;/li&gt;
&lt;/ol&gt;
</comment>
                            <comment id="271994" author="mre64" created="Thu, 4 Jun 2020 17:55:30 +0000"  >&lt;p&gt;Hi John,&lt;/p&gt;

&lt;p&gt;1. It seems the CHECKSUM error was coming from the same type of application and user. However, that user has been running the same type of jobs for months on the same group of nodes.&lt;br/&gt;
2. We have been changing the lnet routers to be exactly the same for past 2-3 weeks, so they were not done all within a short period of time.&lt;br/&gt;
3. We saw the BAD WRITE CHECKSUM on both routed and directly connected (via HDR IB) nodes.&lt;br/&gt;
4. I don&apos;t think we are in the position to change the lnet routers back to 2.13.0. Also since we saw the CHECKSUM error coming from nodes that are not using the lnet routers, it seems the routers are not the problem.&lt;br/&gt;
5. We are running a specific kernel and OS so it maybe possible to update the client on some of the compute nodes if 2.12.4 installs without issue or errors. Our compute nodes are running Centos 7.6.1810 3.10.0-957.12.1.el7.x86_64 kernel with OFED INBOX drivers, not MLNX OFED. We would rather not do this to be honest.&lt;/p&gt;</comment>
                            <comment id="272031" author="dongyang" created="Fri, 5 Jun 2020 00:54:45 +0000"  >&lt;p&gt;Thanks Michael,&lt;/p&gt;

&lt;p&gt;from the output we can see the checksum type is crc32c:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
osc.scratch1-OST0009-osc-ffff9ff17a698000.checksum_type=crc32 adler [crc32c]
osc.scratch1-OST001a-osc-ffff9ff17a698000.checksum_type=crc32 adler [crc32c]
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;You mentioned the clients are using OFED inbox drivers,&lt;/p&gt;

&lt;p&gt;are they always using inbox drivers? and the version of it?&lt;/p&gt;

&lt;p&gt;what about the lnet router and oss servers? Are they using MOFED and do we have the same version as the clients?&lt;/p&gt;</comment>
                            <comment id="272033" author="mre64" created="Fri, 5 Jun 2020 01:12:22 +0000"  >&lt;p&gt;Hi Dongyang,&lt;/p&gt;

&lt;p&gt;The compute clients have always been using the OFED INBOX drivers, for example:&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;root@holy7c02108 ~&amp;#93;&lt;/span&gt;# modinfo mlx5_ib&lt;br/&gt;
filename:       /lib/modules/3.10.0-957.12.1.el7.x86_64/kernel/drivers/infiniband/hw/mlx5/mlx5_ib.ko.xz&lt;br/&gt;
license:        Dual BSD/GPL&lt;br/&gt;
description:    Mellanox Connect-IB HCA IB driver&lt;br/&gt;
author:         Eli Cohen &amp;lt;eli@mellanox.com&amp;gt;&lt;br/&gt;
retpoline:      Y&lt;br/&gt;
rhelversion:    7.6&lt;br/&gt;
srcversion:     3B27ACD7C17E508C4D27B18&lt;br/&gt;
depends:        mlx5_core,ib_core&lt;br/&gt;
intree:         Y&lt;br/&gt;
vermagic:       3.10.0-957.12.1.el7.x86_64 SMP mod_unload modversions &lt;br/&gt;
signer:         CentOS Linux kernel signing key&lt;br/&gt;
sig_key:        2C:7C:17:70:5C:86:D4:20:80:50:D3:F5:54:56:9A:7B:D3:BF:D1:BF&lt;br/&gt;
sig_hashalgo:   sha256&lt;/p&gt;


&lt;p&gt;The lnet routers are using MLNX_OFED_LINUX-4.7-1.0.0.1 (OFED-4.7-1.0.0).&lt;/p&gt;

&lt;p&gt;The lustre server is using 4.7.1.0.0.1:&lt;br/&gt;
-bash-4.2$ rpm -qa |grep -i mlnx&lt;br/&gt;
mlnx-ofa_kernel-4.7-OFED.4.7.1.0.0.1.g1c4bf42.x86_64&lt;br/&gt;
libibumad-static-43.1.1.MLNX20190905.1080879-0.1.47329.x86_64&lt;br/&gt;
mlnx-ofa_kernel-devel-4.7-OFED.4.7.1.0.0.1.g1c4bf42.x86_64&lt;br/&gt;
ibutils2-2.1.1-0.113.MLNX20191121.g1c29603.47329.x86_64&lt;br/&gt;
libibumad-devel-43.1.1.MLNX20190905.1080879-0.1.47329.x86_64&lt;br/&gt;
libibumad-43.1.1.MLNX20190905.1080879-0.1.47329.x86_64&lt;br/&gt;
kmod-mlnx-ofa_kernel-4.7-OFED.4.7.1.0.0.1.g1c4bf42.x86_64&lt;/p&gt;</comment>
                            <comment id="272034" author="mre64" created="Fri, 5 Jun 2020 01:27:05 +0000"  >&lt;p&gt;Also, the lnet routers have the MLNX OFED &quot;upstream&quot; drivers which are the latest/greatest while it appears the server side is using the MLNX legacy driver:&lt;/p&gt;

&lt;p&gt;Server side:&lt;br/&gt;
-bash-4.2$ modinfo mlx5_ib&lt;br/&gt;
filename:       /lib/modules/3.10.0-1062.1.1.el7_lustre.x86_64/extra/mlnx-ofa_kernel/drivers/infiniband/hw/mlx5/mlx5_ib.ko&lt;br/&gt;
license:        Dual BSD/GPL&lt;br/&gt;
description:    Mellanox Connect-IB HCA IB driver&lt;br/&gt;
author:         Eli Cohen &amp;lt;eli@mellanox.com&amp;gt;&lt;br/&gt;
retpoline:      Y&lt;br/&gt;
rhelversion:    7.7&lt;br/&gt;
srcversion:     706E928F4D4ECF2659B961F&lt;br/&gt;
depends:        mlx5_core,ib_core,ib_uverbs,mlx_compat&lt;br/&gt;
vermagic:       3.10.0-1062.1.1.el7_lustre.x86_64 SMP mod_unload modversions &lt;br/&gt;
parm:           dc_cnak_qp_depth:DC CNAK QP depth (uint)&lt;br/&gt;
-bash-4.2$ rpm -qf /lib/modules/3.10.0-1062.1.1.el7_lustre.x86_64/extra/mlnx-ofa_kernel/drivers/infiniband/hw/mlx5/mlx5_ib.ko&lt;br/&gt;
kmod-mlnx-ofa_kernel-4.7-OFED.4.7.1.0.0.1.g1c4bf42.x86_64&lt;/p&gt;

&lt;p&gt;However, like I mentioned before we have some compute nodes that are connected to the same HDR fabric as the lustre server (holyscratch01), and don&apos;t go thru the lnet routers. Those nodes were also giving the BAD WRITE CHECKSUM error.&lt;/p&gt;</comment>
                            <comment id="272038" author="mre64" created="Fri, 5 Jun 2020 01:55:38 +0000"  >&lt;p&gt;I would like to mention I&apos;m seeing a lot of these server_bulk_callback errors every 3-5 sec on oss03 and also I have seen these errors on oss06 (which has been flakey lately) which we ended up failing all of oss06 OSTs to oss05 because of its instability:&lt;/p&gt;

&lt;p&gt;Jun  4 21:49:48 holyscratch01oss03 kernel: LustreError: 8683:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff9ecbd649ae00&lt;br/&gt;
Jun  4 21:49:48 holyscratch01oss03 kernel: LustreError: 8683:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff9ec19c3c1000&lt;br/&gt;
Jun  4 21:49:48 holyscratch01oss03 kernel: LustreError: 8683:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff9eb2c6e59000&lt;br/&gt;
Jun  4 21:49:52 holyscratch01oss03 kernel: LustreError: 10895:0:(ldlm_lib.c:3262:target_bulk_io()) @@@ network error on bulk READ  req@ffff9ea2b5496050 x1661185873568080/t0(0) o3-&amp;gt;64c2d217-765e-c934-68ad-1480c3b9eac2@10.31.130.246@tcp:2/0 lens 608/440 e 0 to 0 dl 1591321807 ref 1 fl Interpret:/0/0 rc 0/0&lt;br/&gt;
Jun  4 21:49:52 holyscratch01oss03 kernel: LustreError: 10895:0:(ldlm_lib.c:3262:target_bulk_io()) Skipped 134 previous similar messages&lt;br/&gt;
Jun  4 21:49:54 holyscratch01oss03 kernel: LustreError: 8683:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff9ec449cf6c00&lt;br/&gt;
Jun  4 21:50:00 holyscratch01oss03 kernel: LustreError: 8683:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff9ec19c3c2400&lt;br/&gt;
Jun  4 21:50:00 holyscratch01oss03 kernel: LustreError: 8683:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff9ef420aeb200&lt;br/&gt;
Jun  4 21:50:07 holyscratch01oss03 kernel: LustreError: 8683:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff9eb2c6e58600&lt;br/&gt;
Jun  4 21:50:07 holyscratch01oss03 kernel: LustreError: 8683:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff9ec449cf0600&lt;br/&gt;
Jun  4 21:50:13 holyscratch01oss03 kernel: LustreError: 8683:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff9ec31bdc2c00&lt;br/&gt;
Jun  4 21:50:19 holyscratch01oss03 kernel: LustreError: 8683:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff9ebbd9c8f200&lt;br/&gt;
Jun  4 21:50:19 holyscratch01oss03 kernel: LustreError: 8683:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff9ef075f1a800&lt;br/&gt;
Jun  4 21:50:26 holyscratch01oss03 kernel: LustreError: 8683:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff9ec54f615a00&lt;br/&gt;
Jun  4 21:50:32 holyscratch01oss03 kernel: LustreError: 8683:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff9ecbd6499000&lt;br/&gt;
Jun  4 21:50:38 holyscratch01oss03 kernel: LustreError: 8683:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff9e9ac410d600&lt;br/&gt;
Jun  4 21:50:38 holyscratch01oss03 kernel: LustreError: 8683:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff9e98ca39bc00&lt;br/&gt;
Jun  4 21:50:43 holyscratch01oss03 systemd: Starting IML Swap Emitter...&lt;br/&gt;
Jun  4 21:50:43 holyscratch01oss03 systemd: Started IML Swap Emitter.&lt;br/&gt;
Jun  4 21:50:45 holyscratch01oss03 kernel: LustreError: 8683:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff9eb49b8b9a00&lt;br/&gt;
Jun  4 21:50:57 holyscratch01oss03 kernel: LustreError: 8683:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff9ecd6d92ba00&lt;br/&gt;
Jun  4 21:51:04 holyscratch01oss03 kernel: LustreError: 8683:0:(events.c:450:server_bulk_callback()) event type 3, status -103, desc ffff9ee813e90000&lt;br/&gt;
Jun  4 21:51:04 holyscratch01oss03 kernel: Lustre: scratch1-OST0014: Bulk IO write error with a10910a5-f0a8-e30e-b540-a728a40bd183 (at 10.31.163.219@o2ib), client will retry: rc = -110&lt;br/&gt;
Jun  4 21:51:04 holyscratch01oss03 kernel: Lustre: Skipped 73 previous similar messages&lt;br/&gt;
Jun  4 21:51:07 holyscratch01oss03 kernel: LustreError: 11313:0:(sec.c:2485:sptlrpc_svc_unwrap_bulk()) @@@ truncated bulk GET 3145728(4194304)  req@ffff9eaf5d2a8850 x1644618841213632/t0(0) o4-&amp;gt;a5f73f03-2a7d-38af-559e-bc8ad5f84416@10.31.167.138@o2ib:182/0 lens 608/448 e 0 to 0 dl 1591321987 ref 1 fl Interpret:/0/0 rc 0/0&lt;br/&gt;
Jun  4 21:51:07 holyscratch01oss03 kernel: LustreError: 11313:0:(sec.c:2485:sptlrpc_svc_unwrap_bulk()) Skipped 68 previous similar messages&lt;br/&gt;
Jun  4 21:51:10 holyscratch01oss03 kernel: LustreError: 8683:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff9eb161ddb200&lt;br/&gt;
Jun  4 21:51:10 holyscratch01oss03 kernel: LustreError: 8683:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff9ee83c77a600&lt;br/&gt;
Jun  4 21:51:10 holyscratch01oss03 kernel: LustreError: 8683:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff9ebbd9c8f400&lt;br/&gt;
Jun  4 21:51:23 holyscratch01oss03 kernel: LustreError: 8683:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff9ebf48b49a00&lt;br/&gt;
Jun  4 21:51:23 holyscratch01oss03 kernel: LustreError: 8683:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff9ecc98b41c00&lt;br/&gt;
Jun  4 21:51:29 holyscratch01oss03 kernel: LustreError: 8683:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff9edc45d7bc00&lt;br/&gt;
Jun  4 21:51:29 holyscratch01oss03 kernel: LustreError: 8683:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff9ee83c77e800&lt;/p&gt;</comment>
                            <comment id="272039" author="dongyang" created="Fri, 5 Jun 2020 02:09:12 +0000"  >&lt;p&gt;Michael,&lt;/p&gt;

&lt;p&gt;can you create a file using lfs setstripe to create the stripes on the osts running on oss03(or whichever showing up the errors),&lt;/p&gt;

&lt;p&gt;and do a simple dd(both read and write) from the client and see if you are having the checksum errors?&lt;/p&gt;</comment>
                            <comment id="272066" author="mre64" created="Fri, 5 Jun 2020 14:44:48 +0000"  >&lt;p&gt;Hi Dongyang,&lt;br/&gt;
I tried that and I am not seeing any checksum errors on the oss03 /var/log/messages file when I run the dd.&lt;/p&gt;</comment>
                            <comment id="272081" author="mre64" created="Fri, 5 Jun 2020 16:30:18 +0000"  >&lt;p&gt;Hi Dongyang, this is what I did exactly:&lt;/p&gt;

&lt;p&gt;local OSTs on oss03:&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;root@holyscratch01oss03 ~&amp;#93;&lt;/span&gt;# df -hl&lt;br/&gt;
Filesystem                                  Size  Used Avail Use% Mounted on&lt;br/&gt;
devtmpfs                                    189G     0  189G   0% /dev&lt;br/&gt;
tmpfs                                       189G   39M  189G   1% /dev/shm&lt;br/&gt;
tmpfs                                       189G   43M  189G   1% /run&lt;br/&gt;
tmpfs                                       189G     0  189G   0% /sys/fs/cgroup&lt;br/&gt;
/dev/mapper/centos_holyscratch01oss03-root  218G  7.7G  211G   4% /&lt;br/&gt;
/dev/sda1                                  1014M  200M  815M  20% /boot&lt;br/&gt;
/dev/mapper/mpathg                           85T   46T   35T  57% /mnt/scratch1-OST0014&lt;br/&gt;
/dev/mapper/mpathe                           85T   45T   36T  56% /mnt/scratch1-OST000e&lt;br/&gt;
/dev/mapper/mpathi                           85T   44T   37T  55% /mnt/scratch1-OST0002&lt;br/&gt;
/dev/mapper/mpathj                           85T   44T   38T  54% /mnt/scratch1-OST0008&lt;br/&gt;
/dev/mapper/mpathm                           85T   47T   35T  58% /mnt/scratch1-OST001a&lt;br/&gt;
/dev/mapper/mpathc                           85T   47T   35T  58% /mnt/scratch1-OST002c&lt;br/&gt;
/dev/mapper/mpathn                           85T   44T   37T  55% /mnt/scratch1-OST0020&lt;br/&gt;
/dev/mapper/mpatha                           85T   45T   36T  56% /mnt/scratch1-OST0026&lt;/p&gt;

&lt;p&gt;Lists the OSTs on the holyscatch01 lustre FS:&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;root@holy2c24108 ~&amp;#93;&lt;/span&gt;# lfs osts /n/holyscratch01&lt;br/&gt;
OBDS:&lt;br/&gt;
0: scratch1-OST0000_UUID ACTIVE&lt;br/&gt;
1: scratch1-OST0001_UUID ACTIVE&lt;br/&gt;
2: scratch1-OST0002_UUID ACTIVE&lt;br/&gt;
3: scratch1-OST0003_UUID ACTIVE&lt;br/&gt;
4: scratch1-OST0004_UUID ACTIVE&lt;br/&gt;
5: scratch1-OST0005_UUID ACTIVE&lt;br/&gt;
6: scratch1-OST0006_UUID ACTIVE&lt;br/&gt;
7: scratch1-OST0007_UUID ACTIVE&lt;br/&gt;
8: scratch1-OST0008_UUID ACTIVE&lt;br/&gt;
9: scratch1-OST0009_UUID ACTIVE&lt;br/&gt;
10: scratch1-OST000a_UUID ACTIVE&lt;br/&gt;
11: scratch1-OST000b_UUID ACTIVE&lt;br/&gt;
12: scratch1-OST000c_UUID ACTIVE&lt;br/&gt;
13: scratch1-OST000d_UUID ACTIVE&lt;br/&gt;
14: scratch1-OST000e_UUID ACTIVE&lt;br/&gt;
15: scratch1-OST000f_UUID ACTIVE&lt;br/&gt;
16: scratch1-OST0010_UUID ACTIVE&lt;br/&gt;
17: scratch1-OST0011_UUID ACTIVE&lt;br/&gt;
18: scratch1-OST0012_UUID ACTIVE&lt;br/&gt;
19: scratch1-OST0013_UUID ACTIVE&lt;br/&gt;
20: scratch1-OST0014_UUID ACTIVE&lt;br/&gt;
21: scratch1-OST0015_UUID ACTIVE&lt;br/&gt;
22: scratch1-OST0016_UUID ACTIVE&lt;br/&gt;
23: scratch1-OST0017_UUID ACTIVE&lt;br/&gt;
24: scratch1-OST0018_UUID ACTIVE&lt;br/&gt;
25: scratch1-OST0019_UUID ACTIVE&lt;br/&gt;
26: scratch1-OST001a_UUID ACTIVE&lt;br/&gt;
27: scratch1-OST001b_UUID ACTIVE&lt;br/&gt;
28: scratch1-OST001c_UUID ACTIVE&lt;br/&gt;
29: scratch1-OST001d_UUID ACTIVE&lt;br/&gt;
30: scratch1-OST001e_UUID ACTIVE&lt;br/&gt;
31: scratch1-OST001f_UUID ACTIVE&lt;br/&gt;
32: scratch1-OST0020_UUID ACTIVE&lt;br/&gt;
33: scratch1-OST0021_UUID ACTIVE&lt;br/&gt;
34: scratch1-OST0022_UUID ACTIVE&lt;br/&gt;
35: scratch1-OST0023_UUID ACTIVE&lt;br/&gt;
36: scratch1-OST0024_UUID ACTIVE&lt;br/&gt;
37: scratch1-OST0025_UUID ACTIVE&lt;br/&gt;
38: scratch1-OST0026_UUID ACTIVE&lt;br/&gt;
39: scratch1-OST0027_UUID ACTIVE&lt;br/&gt;
40: scratch1-OST0028_UUID ACTIVE&lt;br/&gt;
41: scratch1-OST0029_UUID ACTIVE&lt;br/&gt;
42: scratch1-OST002a_UUID ACTIVE&lt;br/&gt;
43: scratch1-OST002b_UUID ACTIVE&lt;br/&gt;
44: scratch1-OST002c_UUID ACTIVE&lt;br/&gt;
45: scratch1-OST002d_UUID ACTIVE&lt;br/&gt;
46: scratch1-OST002e_UUID ACTIVE&lt;br/&gt;
47: scratch1-OST002f_UUID ACTIVE&lt;/p&gt;

&lt;p&gt;Execute the lfs setstripe:&lt;br/&gt;
lfs setstripe --ost-list 2,8,14,20,26,32,38,44 /n/holyscratch01/rc_admin/methier/teststripe&lt;/p&gt;

&lt;p&gt;The dd write:&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;root@holy2c24108 ~&amp;#93;&lt;/span&gt;# dd if=/dev/urandom of=/n/holyscratch01/rc_admin/methier/teststripe count=1024 bs=10M&lt;/p&gt;

&lt;p&gt;The read I just swapped if with of:&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;root@holy2c24108 ~&amp;#93;&lt;/span&gt;# dd of=/dev/urandom if=/n/holyscratch01/rc_admin/methier/teststripe count=1024 bs=10M&lt;/p&gt;</comment>
                            <comment id="272084" author="mre64" created="Fri, 5 Jun 2020 17:10:12 +0000"  >&lt;p&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;root@holy2c24108 ~&amp;#93;&lt;/span&gt;# lfs getstripe /n/holyscratch01/rc_admin/methier/teststripe&lt;br/&gt;
/n/holyscratch01/rc_admin/methier/teststripe&lt;br/&gt;
lmm_stripe_count:  8&lt;br/&gt;
lmm_stripe_size:   1048576&lt;br/&gt;
lmm_pattern:       1&lt;br/&gt;
lmm_layout_gen:    0&lt;br/&gt;
lmm_stripe_offset: 2&lt;br/&gt;
	obdidx		 objid		 objid		 group&lt;br/&gt;
	     2	      78461277	    0x4ad395d	             0&lt;br/&gt;
	     8	      79385285	    0x4bb52c5	             0&lt;br/&gt;
	    14	      77802814	    0x4a32d3e	             0&lt;br/&gt;
	    20	      77643661	    0x4a0bf8d	             0&lt;br/&gt;
	    26	      77265238	    0x49af956	             0&lt;br/&gt;
	    32	      79283352	    0x4b9c498	             0&lt;br/&gt;
	    38	      78144208	    0x4a862d0	             0&lt;br/&gt;
	    44	      77354743	    0x49c56f7	             0&lt;/p&gt;</comment>
                            <comment id="272089" author="adilger" created="Fri, 5 Jun 2020 17:54:15 +0000"  >&lt;p&gt;Just to clarify, you are still running 2.12.3 on the servers, and not 2.12.4?  There was a known issue with 2.12.3 (&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-13020&quot; title=&quot;ko2iblnd tuning&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-13020&quot;&gt;&lt;del&gt;LU-13020&lt;/del&gt;&lt;/a&gt;, &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-13145&quot; title=&quot;LNet Health: increase transaction timeout&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-13145&quot;&gt;&lt;del&gt;LU-13145&lt;/del&gt;&lt;/a&gt;) that affected LNet under load on larger system that may be contributing to the problem here. It appears that you are seeing the checksum errors because the bulk data transfers are being interrupted. &lt;/p&gt;

&lt;p&gt;A workaround to get equivalent behavior for 2.12.4 systems without upgrading or applying a patch is to run the following commands on all of the 2.12.3 nodes in the shown order:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;echo 150 &amp;gt; /sys/module/lnet/parameters/lnet_transaction_timeout
echo 3 &amp;gt; /sys/module/lnet/parameters/lnet_retry_count
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This only temporarily changes these values, but they can be set permanently by adding the following line in &lt;tt&gt;/etc/modprobe.d/lnet.conf&lt;/tt&gt; on all 2.13.3 nodes:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;options lnet lnet_retry_count=3 lnet_transaction_timeout=150
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Another thing to try to reduce the severity of the problem, if the above does not help would be setting &quot;&lt;tt&gt;lctl set_param osc.&amp;#42;.max_pages_per_rpc=1M&lt;/tt&gt;&quot; on the clients, which will reduce the number of bulk transfers per RPC, which should at least avoid the checksum errors being reported, and avoid network traffic congestion if there &lt;b&gt;are&lt;/b&gt; still transfer errors, since smaller RPCs would also need less resent each time upon hitting an error. This would only affect the current clients, so you could set &quot;&lt;tt&gt;lctl set_param obdfilter.&amp;#42;.brw_size=1&lt;/tt&gt;&quot; on all OSS nodes to limit this for future client mounts as well. &lt;/p&gt;</comment>
                            <comment id="272090" author="mre64" created="Fri, 5 Jun 2020 18:05:58 +0000"  >&lt;p&gt;Hi Andreas,&lt;br/&gt;
Thanks for the useful info. Below is what we have set on the lnet routers, do you see an issue with these settings ?&lt;br/&gt;
Thanks,&lt;br/&gt;
Mike&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;root@cannonlnet06 ~&amp;#93;&lt;/span&gt;# more /etc/modprobe.d/lustre.conf &lt;br/&gt;
options lnet networks=&quot;o2ib(ib1),o2ib2(ib1),o2ib4(ib0),tcp(bond0),tcp4(bond0.2475)&quot;&lt;br/&gt;
options lnet forwarding=&quot;enabled&quot;&lt;br/&gt;
options lnet lnet_peer_discovery_disabled=1&lt;br/&gt;
options lnet lnet_health_sensitivity=0&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;root@cannonlnet06 ~&amp;#93;&lt;/span&gt;# lnetctl global show&lt;br/&gt;
global:&lt;br/&gt;
    numa_range: 0&lt;br/&gt;
    max_intf: 200&lt;br/&gt;
    discovery: 0&lt;br/&gt;
    drop_asym_route: 0&lt;br/&gt;
    retry_count: 0&lt;br/&gt;
    transaction_timeout: 50&lt;br/&gt;
    health_sensitivity: 0&lt;br/&gt;
    recovery_interval: 1&lt;/p&gt;</comment>
                            <comment id="272094" author="mre64" created="Fri, 5 Jun 2020 18:16:09 +0000"  >&lt;p&gt;Also the lustre FS we have checksum errors on is lustre 2.12.3, yes. We have another lustre fs nearby running 2.12.4 and we have not seen checksum errors or instability on it recently. Its not getting hit with as much file I/O most likely as its a lab data storage server. This scratch01 (where checksums errors occur) server gets hammered by 2000 or so compute nodes as a temporary disk space to write/read their compute output.&lt;/p&gt;</comment>
                            <comment id="272115" author="adilger" created="Sat, 6 Jun 2020 05:27:05 +0000"  >&lt;p&gt;Serguei, could you review the LNet parameters in &lt;a href=&quot;#comment-272090&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;comment-272090&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;I would recommend also trying the newer 2.12.4 clients on at least some of the clients to determine if this resolves the issue.&lt;/p&gt;</comment>
                            <comment id="272172" author="ssmirnov" created="Sun, 7 Jun 2020 16:08:38 +0000"  >&lt;p&gt;Hi,&lt;/p&gt;

&lt;p&gt;LNet parameters look fine to me. Perhaps &lt;a href=&quot;https://jira.whamcloud.com/secure/ViewProfile.jspa?name=ashehata&quot; class=&quot;user-hover&quot; rel=&quot;ashehata&quot;&gt;ashehata&lt;/a&gt; can take a quick look to double-check. Because there appear to be multiple interfaces of the same kind,&#160;I&apos;d also recommend checking if Linux routing is setup as outlined here: &lt;a href=&quot;http://wiki.lustre.org/LNet_Router_Config_Guide#ARP_flux_issue_for_MR_node.&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://wiki.lustre.org/LNet_Router_Config_Guide#ARP_flux_issue_for_MR_node.&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Thanks,&lt;/p&gt;

&lt;p&gt;Serguei.&lt;/p&gt;</comment>
                            <comment id="272303" author="mre64" created="Mon, 8 Jun 2020 19:40:13 +0000"  >&lt;p&gt;Hi Andreas,&lt;br/&gt;
I tried setting:&lt;br/&gt;
echo 150 &amp;gt; /sys/module/lnet/parameters/lnet_transaction_timeout&lt;br/&gt;
echo 3 &amp;gt; /sys/module/lnet/parameters/lnet_retry_count&lt;/p&gt;

&lt;p&gt; But its not letting me change retry_count:&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;root@salt ~&amp;#93;&lt;/span&gt;# salt &apos;holyscratch01*&apos; cmd.run &quot;echo 3 &amp;gt; /sys/module/lnet/parameters/lnet_retry_count&quot;&lt;br/&gt;
holyscratch01oss01.rc.fas.harvard.edu:&lt;br/&gt;
    /bin/sh: line 0: echo: write error: Invalid argument&lt;br/&gt;
holyscratch01oss06.rc.fas.harvard.edu:&lt;br/&gt;
    /bin/sh: /sys/module/lnet/parameters/lnet_retry_count: No such file or directory&lt;br/&gt;
holyscratch01oss04.rc.fas.harvard.edu:&lt;br/&gt;
    /bin/sh: line 0: echo: write error: Invalid argument&lt;br/&gt;
holyscratch01oss02.rc.fas.harvard.edu:&lt;/p&gt;

&lt;p&gt;holyscratch01oss03.rc.fas.harvard.edu:&lt;br/&gt;
    /bin/sh: /sys/module/lnet/parameters/lnet_retry_count: No such file or directory&lt;br/&gt;
holyscratch01mds02.rc.fas.harvard.edu:&lt;br/&gt;
    /bin/sh: line 0: echo: write error: Invalid argument&lt;br/&gt;
holyscratch01oss05.rc.fas.harvard.edu:&lt;br/&gt;
    /bin/sh: line 0: echo: write error: Invalid argument&lt;br/&gt;
holyscratch01mds01.rc.fas.harvard.edu:&lt;/p&gt;

&lt;p&gt;What do you recommend ? Is the order correct ? I change the value transaction_timeout to 150 and tried to change it after that to a lower value and it doesn&apos;t let me now. There must be some kind of dependency between these two parameters that doesn&apos;t allow you to change them.&lt;/p&gt;

&lt;p&gt;Thanks,&lt;br/&gt;
Mike&lt;/p&gt;</comment>
                            <comment id="272305" author="mre64" created="Mon, 8 Jun 2020 19:50:58 +0000"  >&lt;p&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;root@holyscratch01oss04 ~&amp;#93;&lt;/span&gt;# lnetctl global show&lt;br/&gt;
global:&lt;br/&gt;
    numa_range: 0&lt;br/&gt;
    max_intf: 200&lt;br/&gt;
    discovery: 0&lt;br/&gt;
    drop_asym_route: 0&lt;br/&gt;
    retry_count: 0&lt;br/&gt;
    transaction_timeout: 150&lt;br/&gt;
    health_sensitivity: 0&lt;br/&gt;
    recovery_interval: 1&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;root@holyscratch01oss04 ~&amp;#93;&lt;/span&gt;# lnetctl set retry_count 3&lt;br/&gt;
add:&lt;/p&gt;
&lt;ul class=&quot;alternate&quot; type=&quot;square&quot;&gt;
	&lt;li&gt;retry_count:&lt;br/&gt;
          errno: -5&lt;br/&gt;
          descr: &quot;cannot configure retry count: Invalid argument&quot;&lt;/li&gt;
&lt;/ul&gt;
</comment>
                            <comment id="272307" author="ssmirnov" created="Mon, 8 Jun 2020 20:12:40 +0000"  >&lt;p&gt;Hi &lt;a href=&quot;https://jira.whamcloud.com/secure/ViewProfile.jspa?name=mre64&quot; class=&quot;user-hover&quot; rel=&quot;mre64&quot;&gt;mre64&lt;/a&gt;,&lt;/p&gt;

&lt;p&gt;If you have health_sensitivity set to 0, you are prevented from setting non-zero retry_count as the health feature is off. You should still be able to change transaction_timeout though.&lt;/p&gt;

&lt;p&gt;Thanks,&lt;/p&gt;

&lt;p&gt;Serguei&lt;/p&gt;</comment>
                            <comment id="272309" author="mre64" created="Mon, 8 Jun 2020 20:27:25 +0000"  >&lt;p&gt;Hi Serguei,&lt;br/&gt;
Thanks. So in our case, setting lnet_transaction_timeout=150 will be the only setting we should change, per Andreas suggestion ? Does that mean that if retry_county is set to 3 or some other value, its ignored when health is turned off ?&lt;br/&gt;
Mike&lt;/p&gt;</comment>
                            <comment id="272312" author="adilger" created="Mon, 8 Jun 2020 21:20:02 +0000"  >&lt;p&gt;Right, if &lt;tt&gt;health_sensitivity=&lt;/tt&gt; then these parameters are ignored, which is fine.  It means that LNet will not interrupt incomplete RPCs to retry sending them to the server.  Sorry, I didn&apos;t realize this was the case.&lt;/p&gt;</comment>
                            <comment id="272317" author="mre64" created="Mon, 8 Jun 2020 22:27:48 +0000"  >&lt;p&gt;Hi Andreas, I went ahead and set this on all the hosts that make up the scratch01 filesystem in /etc/modprobe.d/lustre.conf:&lt;br/&gt;
options lnet lnet_health_sensitivity=100 lnet_retry_count=3 lnet_transaction_timeout=150&lt;br/&gt;
The reason we had health turned off is it seemed to cause issues for us in the past. Do you recommend we turn health on or off ?&lt;br/&gt;
Mike&lt;/p&gt;</comment>
                            <comment id="272465" author="mre64" created="Wed, 10 Jun 2020 14:32:42 +0000"  >&lt;p&gt;Hi Andreas,&lt;/p&gt;

&lt;p&gt;I set the following settings, and we still have the Bulk IO and server_bulk_callback messages on 2 of the OSS machines:&lt;/p&gt;

&lt;p&gt;options lnet lnet_health_sensitivity=100 lnet_retry_count=3 lnet_transaction_timeout=150 lnet_peer_discovery_disabled=1&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;root@holyscratch01oss06 ~&amp;#93;&lt;/span&gt;# lnetctl global show&lt;br/&gt;
global:&lt;br/&gt;
    numa_range: 0&lt;br/&gt;
    max_intf: 200&lt;br/&gt;
    discovery: 0&lt;br/&gt;
    drop_asym_route: 0&lt;br/&gt;
    retry_count: 3&lt;br/&gt;
    transaction_timeout: 150&lt;br/&gt;
    health_sensitivity: 100&lt;br/&gt;
    recovery_interval: 1&lt;/p&gt;

&lt;p&gt;Also checked the lctl get_param osc.*.max_pages_per_rpc from one of the clients the gives the bulk I/O and callback messages and its all 1M:&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;root@holy2c18214 ~&amp;#93;&lt;/span&gt;# lctl get_param osc.*.max_pages_per_rpc |grep scratch&lt;br/&gt;
osc.scratch1-OST0000-osc-ffff957cc6b6c800.max_pages_per_rpc=1024&lt;br/&gt;
osc.scratch1-OST0001-osc-ffff957cc6b6c800.max_pages_per_rpc=1024&lt;br/&gt;
osc.scratch1-OST0002-osc-ffff957cc6b6c800.max_pages_per_rpc=1024&lt;br/&gt;
osc.scratch1-OST0003-osc-ffff957cc6b6c800.max_pages_per_rpc=1024&lt;br/&gt;
osc.scratch1-OST0004-osc-ffff957cc6b6c800.max_pages_per_rpc=1024&lt;br/&gt;
osc.scratch1-OST0005-osc-ffff957cc6b6c800.max_pages_per_rpc=1024&lt;br/&gt;
osc.scratch1-OST0006-osc-ffff957cc6b6c800.max_pages_per_rpc=1024&lt;br/&gt;
osc.scratch1-OST0007-osc-ffff957cc6b6c800.max_pages_per_rpc=1024&lt;br/&gt;
osc.scratch1-OST0008-osc-ffff957cc6b6c800.max_pages_per_rpc=1024&lt;br/&gt;
osc.scratch1-OST0009-osc-ffff957cc6b6c800.max_pages_per_rpc=1024&lt;br/&gt;
osc.scratch1-OST000a-osc-ffff957cc6b6c800.max_pages_per_rpc=1024&lt;br/&gt;
osc.scratch1-OST000b-osc-ffff957cc6b6c800.max_pages_per_rpc=1024&lt;br/&gt;
osc.scratch1-OST000c-osc-ffff957cc6b6c800.max_pages_per_rpc=1024&lt;br/&gt;
osc.scratch1-OST000d-osc-ffff957cc6b6c800.max_pages_per_rpc=1024&lt;br/&gt;
osc.scratch1-OST000e-osc-ffff957cc6b6c800.max_pages_per_rpc=1024&lt;br/&gt;
osc.scratch1-OST000f-osc-ffff957cc6b6c800.max_pages_per_rpc=1024&lt;br/&gt;
osc.scratch1-OST0010-osc-ffff957cc6b6c800.max_pages_per_rpc=1024&lt;br/&gt;
osc.scratch1-OST0011-osc-ffff957cc6b6c800.max_pages_per_rpc=1024&lt;br/&gt;
osc.scratch1-OST0012-osc-ffff957cc6b6c800.max_pages_per_rpc=1024&lt;br/&gt;
osc.scratch1-OST0013-osc-ffff957cc6b6c800.max_pages_per_rpc=1024&lt;br/&gt;
osc.scratch1-OST0014-osc-ffff957cc6b6c800.max_pages_per_rpc=1024&lt;br/&gt;
osc.scratch1-OST0015-osc-ffff957cc6b6c800.max_pages_per_rpc=1024&lt;br/&gt;
osc.scratch1-OST0016-osc-ffff957cc6b6c800.max_pages_per_rpc=1024&lt;br/&gt;
osc.scratch1-OST0017-osc-ffff957cc6b6c800.max_pages_per_rpc=1024&lt;br/&gt;
osc.scratch1-OST0018-osc-ffff957cc6b6c800.max_pages_per_rpc=1024&lt;br/&gt;
osc.scratch1-OST0019-osc-ffff957cc6b6c800.max_pages_per_rpc=1024&lt;br/&gt;
osc.scratch1-OST001a-osc-ffff957cc6b6c800.max_pages_per_rpc=1024&lt;br/&gt;
osc.scratch1-OST001b-osc-ffff957cc6b6c800.max_pages_per_rpc=1024&lt;br/&gt;
osc.scratch1-OST001c-osc-ffff957cc6b6c800.max_pages_per_rpc=1024&lt;br/&gt;
osc.scratch1-OST001d-osc-ffff957cc6b6c800.max_pages_per_rpc=1024&lt;br/&gt;
osc.scratch1-OST001e-osc-ffff957cc6b6c800.max_pages_per_rpc=1024&lt;br/&gt;
osc.scratch1-OST001f-osc-ffff957cc6b6c800.max_pages_per_rpc=1024&lt;br/&gt;
osc.scratch1-OST0020-osc-ffff957cc6b6c800.max_pages_per_rpc=1024&lt;br/&gt;
osc.scratch1-OST0021-osc-ffff957cc6b6c800.max_pages_per_rpc=1024&lt;br/&gt;
osc.scratch1-OST0022-osc-ffff957cc6b6c800.max_pages_per_rpc=1024&lt;br/&gt;
osc.scratch1-OST0023-osc-ffff957cc6b6c800.max_pages_per_rpc=1024&lt;br/&gt;
osc.scratch1-OST0024-osc-ffff957cc6b6c800.max_pages_per_rpc=1024&lt;br/&gt;
osc.scratch1-OST0025-osc-ffff957cc6b6c800.max_pages_per_rpc=1024&lt;br/&gt;
osc.scratch1-OST0026-osc-ffff957cc6b6c800.max_pages_per_rpc=1024&lt;br/&gt;
osc.scratch1-OST0027-osc-ffff957cc6b6c800.max_pages_per_rpc=1024&lt;br/&gt;
osc.scratch1-OST0028-osc-ffff957cc6b6c800.max_pages_per_rpc=1024&lt;br/&gt;
osc.scratch1-OST0029-osc-ffff957cc6b6c800.max_pages_per_rpc=1024&lt;br/&gt;
osc.scratch1-OST002a-osc-ffff957cc6b6c800.max_pages_per_rpc=1024&lt;br/&gt;
osc.scratch1-OST002b-osc-ffff957cc6b6c800.max_pages_per_rpc=1024&lt;br/&gt;
osc.scratch1-OST002c-osc-ffff957cc6b6c800.max_pages_per_rpc=1024&lt;br/&gt;
osc.scratch1-OST002d-osc-ffff957cc6b6c800.max_pages_per_rpc=1024&lt;br/&gt;
osc.scratch1-OST002e-osc-ffff957cc6b6c800.max_pages_per_rpc=1024&lt;br/&gt;
osc.scratch1-OST002f-osc-ffff957cc6b6c800.max_pages_per_rpc=1024&lt;/p&gt;

&lt;p&gt;We are going to see if we can update the lustre client to 2.12.4. It does build on our older centos 7.6 OS we just have to set options lnet lnet_peer_discovery_disabled=1 in order for it to mount all the lustre FS properly. Can you update 2.10.7 to 2.12.4 while jobs are running and FS are mounted and then reboot to get the new version, or do you have to remove the older version completely and install the new one with no lustre mounts ?&lt;/p&gt;

&lt;p&gt;Any other suggestions ?&lt;br/&gt;
Thanks,&lt;br/&gt;
Mike&lt;/p&gt;</comment>
                            <comment id="272520" author="mre64" created="Wed, 10 Jun 2020 20:30:31 +0000"  >&lt;p&gt;This is what I see on the OSS nodes:&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;root@salt ~&amp;#93;&lt;/span&gt;# salt &apos;holyscratch01oss*&apos; cmd.run &apos;lctl get_param obdfilter.*.brw_size&apos;&lt;br/&gt;
holyscratch01oss05.rc.fas.harvard.edu:&lt;br/&gt;
    obdfilter.scratch1-OST0000.brw_size=4&lt;br/&gt;
    obdfilter.scratch1-OST0006.brw_size=4&lt;br/&gt;
    obdfilter.scratch1-OST000c.brw_size=4&lt;br/&gt;
    obdfilter.scratch1-OST0012.brw_size=4&lt;br/&gt;
    obdfilter.scratch1-OST0018.brw_size=4&lt;br/&gt;
    obdfilter.scratch1-OST001e.brw_size=4&lt;br/&gt;
    obdfilter.scratch1-OST0024.brw_size=4&lt;br/&gt;
    obdfilter.scratch1-OST002a.brw_size=4&lt;br/&gt;
holyscratch01oss02.rc.fas.harvard.edu:&lt;br/&gt;
    obdfilter.scratch1-OST0004.brw_size=4&lt;br/&gt;
    obdfilter.scratch1-OST000a.brw_size=4&lt;br/&gt;
    obdfilter.scratch1-OST0011.brw_size=4&lt;br/&gt;
    obdfilter.scratch1-OST0016.brw_size=4&lt;br/&gt;
    obdfilter.scratch1-OST001d.brw_size=4&lt;br/&gt;
    obdfilter.scratch1-OST0023.brw_size=4&lt;br/&gt;
    obdfilter.scratch1-OST0029.brw_size=4&lt;br/&gt;
    obdfilter.scratch1-OST002f.brw_size=4&lt;br/&gt;
holyscratch01oss06.rc.fas.harvard.edu:&lt;br/&gt;
    obdfilter.scratch1-OST0001.brw_size=4&lt;br/&gt;
    obdfilter.scratch1-OST0007.brw_size=4&lt;br/&gt;
    obdfilter.scratch1-OST000d.brw_size=4&lt;br/&gt;
    obdfilter.scratch1-OST0013.brw_size=4&lt;br/&gt;
    obdfilter.scratch1-OST0019.brw_size=4&lt;br/&gt;
    obdfilter.scratch1-OST001f.brw_size=4&lt;br/&gt;
    obdfilter.scratch1-OST0025.brw_size=4&lt;br/&gt;
    obdfilter.scratch1-OST002b.brw_size=4&lt;br/&gt;
holyscratch01oss04.rc.fas.harvard.edu:&lt;br/&gt;
    obdfilter.scratch1-OST0003.brw_size=4&lt;br/&gt;
    obdfilter.scratch1-OST0009.brw_size=4&lt;br/&gt;
    obdfilter.scratch1-OST000f.brw_size=4&lt;br/&gt;
    obdfilter.scratch1-OST0015.brw_size=4&lt;br/&gt;
    obdfilter.scratch1-OST001b.brw_size=4&lt;br/&gt;
    obdfilter.scratch1-OST0021.brw_size=4&lt;br/&gt;
    obdfilter.scratch1-OST0027.brw_size=4&lt;br/&gt;
    obdfilter.scratch1-OST002d.brw_size=4&lt;br/&gt;
holyscratch01oss01.rc.fas.harvard.edu:&lt;br/&gt;
    obdfilter.scratch1-OST0005.brw_size=4&lt;br/&gt;
    obdfilter.scratch1-OST000b.brw_size=4&lt;br/&gt;
    obdfilter.scratch1-OST0010.brw_size=4&lt;br/&gt;
    obdfilter.scratch1-OST0017.brw_size=4&lt;br/&gt;
    obdfilter.scratch1-OST001c.brw_size=4&lt;br/&gt;
    obdfilter.scratch1-OST0022.brw_size=4&lt;br/&gt;
    obdfilter.scratch1-OST0028.brw_size=4&lt;br/&gt;
    obdfilter.scratch1-OST002e.brw_size=4&lt;br/&gt;
holyscratch01oss03.rc.fas.harvard.edu:&lt;br/&gt;
    obdfilter.scratch1-OST0002.brw_size=4&lt;br/&gt;
    obdfilter.scratch1-OST0008.brw_size=4&lt;br/&gt;
    obdfilter.scratch1-OST000e.brw_size=4&lt;br/&gt;
    obdfilter.scratch1-OST0014.brw_size=4&lt;br/&gt;
    obdfilter.scratch1-OST001a.brw_size=4&lt;br/&gt;
    obdfilter.scratch1-OST0020.brw_size=4&lt;br/&gt;
    obdfilter.scratch1-OST0026.brw_size=4&lt;br/&gt;
    obdfilter.scratch1-OST002c.brw_size=4&lt;/p&gt;</comment>
                            <comment id="272597" author="adilger" created="Thu, 11 Jun 2020 10:13:19 +0000"  >&lt;p&gt;Note that &quot;&lt;tt&gt;osc.&amp;#42;.max_pages_per_rpc=1024&lt;/tt&gt;&quot; is 4MB (with &lt;tt&gt;PAGE_SIZE=4096&lt;/tt&gt; on x86 clients).  This matches with &quot;&lt;tt&gt;obdfilter.&amp;#42;.brw_size=4&lt;/tt&gt;&quot; on the OSTs. If you set &quot;&lt;tt&gt;...max_pages_per_rpc=1M&lt;/tt&gt;&quot; it is internally converted to 256x 4KB pages.  &lt;/p&gt;</comment>
                            <comment id="273141" author="mre64" created="Wed, 17 Jun 2020 20:03:28 +0000"  >&lt;p&gt;Hi,&lt;br/&gt;
So nothing has worked so far so solve our problem with the bulk errors and we can only run on 4 of the 6 oss for now. We are going to try to update the lustre client side to 2.12.4 as soon as we can. Also the scratch01 lustre server that has issues is running 2.12.3 and at some point we will update that to 2.12.4.&lt;br/&gt;
Thanks,&lt;br/&gt;
Mike&lt;/p&gt;</comment>
                            <comment id="276052" author="mre64" created="Thu, 23 Jul 2020 19:01:05 +0000"  >&lt;p&gt;Hello,&lt;/p&gt;

&lt;p&gt;We upgraded all our 2150 compute node&apos;s lustre client to 2.12.4 this past Monday and it seems to have greatly stabilized things for us. Apparently this was the main fix we needed. So we have most devices (not all) on 2.12.4 (clients, lnet routers and Lustre FS). We have some main lustre FS on v2.12.3.&lt;/p&gt;

&lt;p&gt;We are planning on upgrading our compute nodes to the Centos 7.8 OS in October. Do you recommend we stay on 2.12.4 or update the lustre clients to 2.12.5 ? Our lustre storage will most likely stay on 2.12.3 or 2.12.4 for a while more.&lt;/p&gt;

&lt;p&gt;Thanks,&lt;br/&gt;
Mike&lt;/p&gt;</comment>
                            <comment id="276053" author="pjones" created="Thu, 23 Jul 2020 19:47:54 +0000"  >&lt;p&gt;Mike&lt;/p&gt;

&lt;p&gt;I think that you&apos;ll need to move to 2.12.5 in order to get support for CentOS 7.8&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                            <attachment id="35053" name="Screen Shot 2020-06-03 at 9.24.17 PM.png" size="579963" author="mre64" created="Thu, 4 Jun 2020 01:25:58 +0000"/>
                            <attachment id="35060" name="holy2b11102.out" size="41057" author="mre64" created="Thu, 4 Jun 2020 13:48:56 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|i011yv:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10020"><![CDATA[1]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>