<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 02:40:53 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-11093] clients hang when over quota</title>
                <link>https://jira.whamcloud.com/browse/LU-11093</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;Hi,&lt;/p&gt;

&lt;p&gt;clients nodes hang and loop forever when codes go over group quota. Lustre is very verbose when this happens. the below is pretty typical. john50 is a client. arkles are OSS&apos;s.&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Jun 23 13:22:42 john50 kernel: LNetError: 895:0:(lib-ptl.c:190:lnet_try_match_md()) Matching packet from 12345-192.168.44.32@o2ib44, match 1603955949771616 length 104
8576 too big: 1048208 left, 1048208 allowed
Jun 23 13:22:42 arkle2 kernel: LustreError: 297785:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff8815b4044a00
Jun 23 13:22:42 arkle2 kernel: LustreError: 297785:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff8815b4044a00
Jun 23 13:22:42 arkle2 kernel: LustreError: 272882:0:(ldlm_lib.c:3253:target_bulk_io()) @@@ network error on bulk WRITE  req@ffff8816bf52cc50 x1603955949771616/t0(0) o4-&amp;gt;fe53a66a-b8c6-e1de-1353-a3b91bd42058@192.168.44.150@o2ib44:548/0 lens 608/448 e 0 to 0 dl 1529724168 ref 1 fl Interpret:/0/0 rc 0/0
Jun 23 13:22:42 arkle2 kernel: LustreError: 272882:0:(ldlm_lib.c:3253:target_bulk_io()) Skipped 73 previous similar messages
Jun 23 13:22:42 arkle2 kernel: Lustre: dagg-OST0002: Bulk IO write error with fe53a66a-b8c6-e1de-1353-a3b91bd42058 (at 192.168.44.150@o2ib44), client will retry: rc = -110
Jun 23 13:22:42 arkle2 kernel: Lustre: Skipped 73 previous similar messages
Jun 23 13:22:49 john50 kernel: Lustre: 906:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1529724162/real 1529724162]  req@ffff8817b7ab4b00 x1603955949771616/t0(0) o4-&amp;gt;dagg-OST0002-osc-ffff882fcafb0800@192.168.44.32@o2ib44:6/4 lens 608/448 e 0 to 1 dl 1529724169 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
Jun 23 13:22:49 john50 kernel: Lustre: 906:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 3 previous similar messages
Jun 23 13:22:49 john50 kernel: Lustre: dagg-OST0002-osc-ffff882fcafb0800: Connection to dagg-OST0002 (at 192.168.44.32@o2ib44) was lost; in progress operations using this service will wait for recovery to complete
Jun 23 13:22:49 arkle2 kernel: Lustre: dagg-OST0002: Client fe53a66a-b8c6-e1de-1353-a3b91bd42058 (at 192.168.44.150@o2ib44) reconnecting
Jun 23 13:22:49 arkle2 kernel: Lustre: Skipped 72 previous similar messages
Jun 23 13:22:49 arkle2 kernel: Lustre: dagg-OST0002: Connection restored to  (at 192.168.44.150@o2ib44)
Jun 23 13:22:49 arkle2 kernel: Lustre: Skipped 59 previous similar messages
Jun 23 13:22:49 john50 kernel: Lustre: dagg-OST0002-osc-ffff882fcafb0800: Connection restored to 192.168.44.32@o2ib44 (at 192.168.44.32@o2ib44)
Jun 23 13:22:49 john50 kernel: LNetError: 893:0:(lib-ptl.c:190:lnet_try_match_md()) Matching packet from 12345-192.168.44.32@o2ib44, match 1603955949773296 length 1048576 too big: 1048208 left, 1048208 allowed
Jun 23 13:22:49 arkle2 kernel: LustreError: 297783:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff881055ca0e00
Jun 23 13:22:49 arkle2 kernel: LustreError: 297783:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff881055ca0e00
Jun 23 13:22:56 john50 kernel: Lustre: 906:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1529724169/real 1529724169]  req@ffff8817b7ab4b00 x1603955949771616/t0(0) o4-&amp;gt;dagg-OST0002-osc-ffff882fcafb0800@192.168.44.32@o2ib44:6/4 lens 608/448 e 0 to 1 dl 1529724176 ref 2 fl Rpc:X/2/ffffffff rc 0/-1
Jun 23 13:22:56 john50 kernel: Lustre: dagg-OST0002-osc-ffff882fcafb0800: Connection to dagg-OST0002 (at 192.168.44.32@o2ib44) was lost; in progress operations using this service will wait for recovery to complete
Jun 23 13:22:56 john50 kernel: Lustre: dagg-OST0002-osc-ffff882fcafb0800: Connection restored to 192.168.44.32@o2ib44 (at 192.168.44.32@o2ib44)
Jun 23 13:22:56 john50 kernel: LNetError: 895:0:(lib-ptl.c:190:lnet_try_match_md()) Matching packet from 12345-192.168.44.32@o2ib44, match 1603955949773424 length 1048576 too big: 1048208 left, 1048208 allowed
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;the messages don&apos;t go away when the code exits. the node stays (somewhat) broken afterwards. only rebooting the client seems to fix it.&lt;/p&gt;

&lt;p&gt;reports from the users seem to indicate that it&apos;s not 100% repeatable, but is reasonably close.&lt;/p&gt;

&lt;p&gt;our OSTs are pretty plain and simple z3 12+3&apos;s with 4 of those making up one OST pool. 2M recordsize and compression on.&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[arkle2]root: zfs get all | grep /OST | egrep &apos;compression|record&apos;
arkle2-dagg-OST2-pool/OST2  recordsize            2M                                         local
arkle2-dagg-OST2-pool/OST2  compression           lz4                                        inherited from arkle2-dagg-OST2-pool
arkle2-dagg-OST3-pool/OST3  recordsize            2M                                         local
arkle2-dagg-OST3-pool/OST3  compression           lz4                                        inherited from arkle2-dagg-OST3-pool
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;cheers,&lt;br/&gt;
robin&lt;/p&gt;</description>
                <environment>centos7, x86_64, OPA, zfs, compression on.</environment>
        <key id="52588">LU-11093</key>
            <summary>clients hang when over quota</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="3">Duplicate</resolution>
                                        <assignee username="hongchao.zhang">Hongchao Zhang</assignee>
                                    <reporter username="scadmin">SC Admin</reporter>
                        <labels>
                    </labels>
                <created>Sat, 23 Jun 2018 06:53:50 +0000</created>
                <updated>Mon, 13 Aug 2018 13:50:07 +0000</updated>
                            <resolved>Mon, 13 Aug 2018 13:50:07 +0000</resolved>
                                    <version>Lustre 2.10.4</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>3</watches>
                                                                            <comments>
                            <comment id="229701" author="scadmin" created="Sat, 23 Jun 2018 06:57:24 +0000"  >&lt;p&gt;I&apos;ll also note that our &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-10683&quot; title=&quot;write checksum errors&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-10683&quot;&gt;&lt;del&gt;LU-10683&lt;/del&gt;&lt;/a&gt; seems possibly related to this. I saw some write checksum errors in one of the over quota incidents today.&lt;/p&gt;

&lt;p&gt;cheers,&lt;br/&gt;
robin&lt;/p&gt;</comment>
                            <comment id="229703" author="pjones" created="Sun, 24 Jun 2018 05:00:33 +0000"  >&lt;p&gt;Hongchao&lt;/p&gt;

&lt;p&gt;Can you please assist with this issue?&lt;/p&gt;

&lt;p&gt;Thanks&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="229777" author="hongchao.zhang" created="Thu, 28 Jun 2018 10:53:52 +0000"  >&lt;p&gt;It could be related to the LNet, the client does&apos;t receive the data from OSS for the following error&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Jun 23 13:22:42 john50 kernel: LNetError: 895:0:(lib-ptl.c:190:lnet_try_match_md()) Matching packet from 12345-192.168.44.32@o2ib44, match 1603955949771616 length 104
8576 too big: 1048208 left, 1048208 allowed
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;the target_bulk_io in OSS failed with -ETIMEDOUT (-110), and cause the client to initiates the recovery process.&lt;br/&gt;
after the connection is restored between client and OSS, the above LNet issue is triggered again and cause the client hang.&lt;/p&gt;</comment>
                            <comment id="229841" author="scadmin" created="Mon, 2 Jul 2018 05:15:09 +0000"  >&lt;p&gt;Hi Hongchao Zhang,&lt;/p&gt;

&lt;p&gt;I&apos;m not sure if that statement is directed at us or not, but we have no lnet or networking problems that we&apos;re aware of in this cluster.&lt;br/&gt;
only clients that are going over quota print the above message.&lt;/p&gt;

&lt;p&gt;cheers,&lt;br/&gt;
robin&lt;/p&gt;</comment>
                            <comment id="229879" author="hongchao.zhang" created="Tue, 3 Jul 2018 09:17:40 +0000"  >&lt;p&gt;Hi Robin,&lt;/p&gt;

&lt;p&gt;Is it possible to apply some debug patch in your site and collect some logs when this issue is triggered?&lt;br/&gt;
I can&apos;t reproduce this issue locally, and it is better to have more logs to trace this problem.&lt;/p&gt;

&lt;p&gt;btw, what is the following value at your site?&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;#at OST
lctl get_param obdfilter.*.brw_size

#at Client
lctl get_param osc.*.max_pages_per_rpc
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Thanks,&lt;br/&gt;
Hongchao&lt;/p&gt;</comment>
                            <comment id="229920" author="scadmin" created="Wed, 4 Jul 2018 09:06:16 +0000"  >&lt;p&gt;Hi Hongchao,&lt;/p&gt;

&lt;p&gt;thanks for looking at this.&lt;/p&gt;

&lt;p&gt;all clients have the same&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;$ lctl get_param osc.*.max_pages_per_rpc
osc.apps-OST0000-osc-ffff8ad8dad4c000.max_pages_per_rpc=256
osc.apps-OST0001-osc-ffff8ad8dad4c000.max_pages_per_rpc=256
osc.dagg-OST0000-osc-ffff8ac199cad800.max_pages_per_rpc=512
osc.dagg-OST0001-osc-ffff8ac199cad800.max_pages_per_rpc=512
osc.dagg-OST0002-osc-ffff8ac199cad800.max_pages_per_rpc=512
osc.dagg-OST0003-osc-ffff8ac199cad800.max_pages_per_rpc=512
osc.dagg-OST0004-osc-ffff8ac199cad800.max_pages_per_rpc=512
osc.dagg-OST0005-osc-ffff8ac199cad800.max_pages_per_rpc=512
osc.dagg-OST0006-osc-ffff8ac199cad800.max_pages_per_rpc=512
osc.dagg-OST0007-osc-ffff8ac199cad800.max_pages_per_rpc=512
osc.dagg-OST0008-osc-ffff8ac199cad800.max_pages_per_rpc=512
osc.dagg-OST0009-osc-ffff8ac199cad800.max_pages_per_rpc=512
osc.dagg-OST000a-osc-ffff8ac199cad800.max_pages_per_rpc=512
osc.dagg-OST000b-osc-ffff8ac199cad800.max_pages_per_rpc=512
osc.dagg-OST000c-osc-ffff8ac199cad800.max_pages_per_rpc=512
osc.dagg-OST000d-osc-ffff8ac199cad800.max_pages_per_rpc=512
osc.dagg-OST000e-osc-ffff8ac199cad800.max_pages_per_rpc=512
osc.dagg-OST000f-osc-ffff8ac199cad800.max_pages_per_rpc=512
osc.home-OST0000-osc-ffff8af01e3ff800.max_pages_per_rpc=256
osc.home-OST0001-osc-ffff8af01e3ff800.max_pages_per_rpc=256
osc.images-OST0000-osc-ffff8ad8da08d000.max_pages_per_rpc=256
osc.images-OST0001-osc-ffff8ad8da08d000.max_pages_per_rpc=256
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;and on servers for the big filesystem (group quotas)&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;obdfilter.dagg-OST0000.brw_size=2
obdfilter.dagg-OST0001.brw_size=2
obdfilter.dagg-OST0002.brw_size=2
obdfilter.dagg-OST0003.brw_size=2
obdfilter.dagg-OST0004.brw_size=2
obdfilter.dagg-OST0005.brw_size=2
obdfilter.dagg-OST0006.brw_size=2
obdfilter.dagg-OST0007.brw_size=2
obdfilter.dagg-OST0008.brw_size=2
obdfilter.dagg-OST0009.brw_size=2
obdfilter.dagg-OST000a.brw_size=2
obdfilter.dagg-OST000b.brw_size=2
obdfilter.dagg-OST000c.brw_size=2
obdfilter.dagg-OST000d.brw_size=2
obdfilter.dagg-OST000e.brw_size=2
obdfilter.dagg-OST000f.brw_size=2
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;and the small filesystems (only /home has user quotas, the rest have no quotas)&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;obdfilter.apps-OST0000.brw_size=1
obdfilter.apps-OST0001.brw_size=1
obdfilter.home-OST0000.brw_size=1
obdfilter.home-OST0001.brw_size=1
obdfilter.images-OST0000.brw_size=1
obdfilter.images-OST0001.brw_size=1
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;it&apos;ll take a while for us to try to reproduce the problem artificially. I don&apos;t think I even have a user account with quotas, so I&apos;ll have to get one setup etc.&lt;/p&gt;

&lt;p&gt;cheers,&lt;br/&gt;
robin&lt;/p&gt;</comment>
                            <comment id="229954" author="hongchao.zhang" created="Thu, 5 Jul 2018 11:45:57 +0000"  >&lt;p&gt;I have managed to reproduce the &quot;BAD CHECKSUM ERROR&quot; locality, but can&apos;t reproduce the &quot;lnet_try_match_md&quot; issue.&lt;br/&gt;
but it could be the same one, which is caused by some bug in the osd_zfs module.&lt;/p&gt;

&lt;p&gt;Could you please try the patch &lt;a href=&quot;https://review.whamcloud.com/32788&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/32788&lt;/a&gt; in &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-10683&quot; title=&quot;write checksum errors&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-10683&quot;&gt;&lt;del&gt;LU-10683&lt;/del&gt;&lt;/a&gt;?&lt;br/&gt;
Thanks!&lt;/p&gt;</comment>
                            <comment id="230855" author="scadmin" created="Tue, 24 Jul 2018 19:00:55 +0000"  >&lt;p&gt;ok. we&apos;re running that patch on the largest of the 4 filesystems now. we&apos;ll let you know if we see it again. thanks!&lt;/p&gt;

&lt;p&gt;cheers,&lt;br/&gt;
robin&lt;/p&gt;</comment>
                            <comment id="231852" author="scadmin" created="Mon, 13 Aug 2018 06:37:23 +0000"  >&lt;p&gt;Hi,&lt;/p&gt;

&lt;p&gt;we haven&apos;t seen this issue again so are presuming it&apos;s fixed by &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-10683&quot; title=&quot;write checksum errors&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-10683&quot;&gt;&lt;del&gt;LU-10683&lt;/del&gt;&lt;/a&gt;.&lt;br/&gt;
thanks!&lt;/p&gt;

&lt;p&gt;cheers,&lt;br/&gt;
robin&lt;/p&gt;</comment>
                            <comment id="231857" author="pjones" created="Mon, 13 Aug 2018 13:50:07 +0000"  >&lt;p&gt;Good news - thanks!&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                                                <inwardlinks description="is related to">
                                        <issuelink>
            <issuekey id="50865">LU-10683</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzzyfr:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>