<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 03:19:17 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-15550] WBC: retry the batched RPC when the reply buffer is overflowed</title>
                <link>https://jira.whamcloud.com/browse/LU-15550</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;Before send the batched RPC, we have no idea about the actual reply buffer size. &lt;br/&gt;
The reply buffer size the client prepared may be smaller than the reply buffer in need, we can grow the reply buffer properly.&lt;/p&gt;

&lt;p&gt;But when the needed reply buffer size is larger than BUT_MAXREPSIZE (1000 * 1024), the server will return -EOVERFLOW error code. At this time, the server only executed the partial sub requests in the batched RPC. The overflowed sub requests will not be executed.&lt;/p&gt;

&lt;p&gt;Thus the client needs a retry mechanism: when found that the reply buffer overflowed, the client will rebuild the batched RPC for the sub requests that not executed on the server, and send to the server to re-executed them again.&lt;/p&gt;</description>
                <environment></environment>
        <key id="68629">LU-15550</key>
            <summary>WBC: retry the batched RPC when the reply buffer is overflowed</summary>
                <type id="4" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11310&amp;avatarType=issuetype">Improvement</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="qian_wc">Qian Yingjin</assignee>
                                    <reporter username="qian_wc">Qian Yingjin</reporter>
                        <labels>
                    </labels>
                <created>Fri, 11 Feb 2022 09:18:33 +0000</created>
                <updated>Fri, 16 Jun 2023 15:27:50 +0000</updated>
                            <resolved>Mon, 1 May 2023 06:25:28 +0000</resolved>
                                                    <fixVersion>Lustre 2.16.0</fixVersion>
                                        <due></due>
                            <votes>0</votes>
                                    <watches>3</watches>
                                                                            <comments>
                            <comment id="326573" author="gerrit" created="Thu, 17 Feb 2022 03:59:29 +0000"  >&lt;p&gt;&quot;Yingjin Qian &amp;lt;qian@ddn.com&amp;gt;&quot; uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/46540&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/46540&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-15550&quot; title=&quot;WBC: retry the batched RPC when the reply buffer is overflowed&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-15550&quot;&gt;&lt;del&gt;LU-15550&lt;/del&gt;&lt;/a&gt; ptlrpc: retry mechanism for overflowed batched RPCs&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 3a33b10c3169056c5837869482fe96a6ea814a14&lt;/p&gt;</comment>
                            <comment id="370999" author="gerrit" created="Mon, 1 May 2023 04:08:14 +0000"  >&lt;p&gt;&quot;Oleg Drokin &amp;lt;green@whamcloud.com&amp;gt;&quot; merged in patch &lt;a href=&quot;https://review.whamcloud.com/c/fs/lustre-release/+/46540/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/c/fs/lustre-release/+/46540/&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-15550&quot; title=&quot;WBC: retry the batched RPC when the reply buffer is overflowed&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-15550&quot;&gt;&lt;del&gt;LU-15550&lt;/del&gt;&lt;/a&gt; ptlrpc: retry mechanism for overflowed batched RPCs&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: &lt;br/&gt;
Commit: 668f48f87bec3999892ce1daad24b6dba9ae362b&lt;/p&gt;</comment>
                            <comment id="371021" author="pjones" created="Mon, 1 May 2023 06:25:28 +0000"  >&lt;p&gt;Landed for 2.16&lt;/p&gt;</comment>
                            <comment id="371616" author="bzzz" created="Tue, 9 May 2023 07:15:03 +0000"  >&lt;p&gt;please check:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
[ 4604.470775] Lustre: DEBUG MARKER: == sanity test 123f: Retry mechanism with large wide striping files ========================================================== 15:59:19 (1683388759)
[ 4668.109917] Lustre: lustre-OST0000-osc-MDT0000: update sequence from 0x280000bd0 to 0x2800013a0
[ 4668.697696] Lustre: lustre-OST0001-osc-MDT0000: update sequence from 0x2c0000bd0 to 0x2c00013a0
[ 4791.809264] Lustre: lustre-OST0001-osc-MDT0000: update sequence from 0x2c00013a0 to 0x2c00013a1
[ 4791.859391] Lustre: lustre-OST0000-osc-MDT0000: update sequence from 0x2800013a0 to 0x2800013a1
[ 4809.801406] ------------[ cut here ]------------
[ 4809.801558] Max IOV exceeded: 257 should be &amp;lt; 256
[ 4809.801678] WARNING: CPU: 0 PID: 116874 at /home/lustre/master-mine/lnet/lnet/lib-md.c:257 lnet_md_build.part.1+0x4b3/0x770 [lnet]
[ 4809.801847] Modules linked in: lustre(O) ofd(O) osp(O) lod(O) ost(O) mdt(O) mdd(O) mgs(O) osd_zfs(O) lquota(O) lfsck(O) obdecho(O) mgc(O) mdc(O) lov(O) osc(O) lmv(O) fid(O) fld(O) ptlrpc(O) obdclass(O) ksocklnd(O) lnet(O) libcfs(O) zfs(O) zunicode(O) zzstd(O) zlua(O) zcommon(O) znvpair(O) zavl(O) icp(O) spl(O) [last unloaded: libcfs]
[ 4809.802263] CPU: 0 PID: 116874 Comm: mdt_out00_000 Tainted: G        W  O     --------- -  - 4.18.0 #2
[ 4809.802389] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[ 4809.802492] RIP: 0010:lnet_md_build.part.1+0x4b3/0x770 [lnet]
[ 4809.802589] Code: 00 00 48 c7 05 da 55 07 00 00 00 00 00 e8 f5 88 fb ff e9 05 fc ff ff ba 00 01 00 00 89 ee 48 c7 c7 38 49 7d c0 e8 c3 d7 91 d6 &amp;lt;0f&amp;gt; 0b e9 f6 fc ff ff f6 05 4b 38 fd ff 10 49 c7 c4 f4 ff ff ff 0f
[ 4809.802838] RSP: 0018:ffff972524367c10 EFLAGS: 00010282
[ 4809.802918] RAX: 0000000000000025 RBX: 0000000000000168 RCX: 0000000000000007
[ 4809.803028] RDX: 0000000000000007 RSI: 0000000000000022 RDI: ffff97256cbe5450
[ 4809.803147] RBP: 0000000000000101 R08: 000004cd62abfed5 R09: 0000000000000000
[ 4809.803257] R10: 0000000000000001 R11: 00000000ffffffff R12: ffff97247bb2a000
[ 4809.803367] R13: 0000000000000000 R14: ffff972242f00168 R15: ffff972524367c90
[ 4809.803477] FS:  0000000000000000(0000) GS:ffff97256ca00000(0000) knlGS:0000000000000000
[ 4809.803587] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 4809.803679] CR2: 000055aeee75bf44 CR3: 00000003104c2000 CR4: 00000000000006b0
[ 4809.803792] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 4809.803910] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 4809.804020] Call Trace:
[ 4809.804074]  LNetMDBind+0x3e/0x380 [lnet]
[ 4809.804193]  ptl_send_buf+0x117/0x570 [ptlrpc]
[ 4809.804307]  ? at_measured+0x1eb/0x300 [ptlrpc]
[ 4809.804419]  ? reply_out_callback+0x390/0x390 [ptlrpc]
[ 4809.804529]  ptlrpc_send_reply+0x2b4/0x8f0 [ptlrpc]
[ 4809.804647]  target_send_reply+0x343/0x770 [ptlrpc]
[ 4809.804768]  tgt_request_handle+0x9ca/0x1a40 [ptlrpc]
[ 4809.804885]  ? lustre_msg_get_transno+0x6f/0xd0 [ptlrpc]
[ 4809.804998]  ptlrpc_main+0x1784/0x32a0 [ptlrpc]
[ 4809.805083]  ? __kthread_parkme+0x33/0x90
[ 4809.805172]  ? ptlrpc_wait_event+0x4b0/0x4b0 [ptlrpc]
[ 4809.805254]  kthread+0x129/0x140
[ 4809.805314]  ? kthread_flush_work_fn+0x10/0x10
[ 4809.805391]  ret_from_fork+0x1f/0x30
[ 4809.805450] ---[ end trace 02ea061606b4429e ]---
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="371618" author="qian_wc" created="Tue, 9 May 2023 07:41:35 +0000"  >&lt;p&gt;Is there any stable reproducer for this panic?&lt;/p&gt;

&lt;p&gt;It seems there are something wrong when calc message buffer size... &lt;/p&gt;</comment>
                            <comment id="371619" author="bzzz" created="Tue, 9 May 2023 07:53:55 +0000"  >&lt;blockquote&gt;&lt;p&gt;Is there any stable reproducer for this panic?&lt;/p&gt;&lt;/blockquote&gt;
&lt;p&gt;no, just a single hit so far, local testing.&lt;/p&gt;</comment>
                            <comment id="371727" author="qian_wc" created="Wed, 10 May 2023 03:37:53 +0000"  >&lt;p&gt;The panic hit on Maloo testing also...&lt;br/&gt;
&lt;a href=&quot;https://testing.whamcloud.com/test_logs/b34c1c0d-4572-43b2-bc87-b4e32ba5e992/show_text&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://testing.whamcloud.com/test_logs/b34c1c0d-4572-43b2-bc87-b4e32ba5e992/show_text&lt;/a&gt;&lt;br/&gt;
It should be a bug related to batched statahead. &lt;/p&gt;</comment>
                            <comment id="371729" author="qian_wc" created="Wed, 10 May 2023 03:55:40 +0000"  >&lt;p&gt;Just add a single line in &lt;/p&gt;

&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
                        max = BUT_MAXREPSIZE - req-&amp;gt;rq_replen;
                        &lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; (used_len + msg_len &amp;gt; len)
                                len = used_len + msg_len;

                        &lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; (len &amp;gt; max)
                                len += max;
                        &lt;span class=&quot;code-keyword&quot;&gt;else&lt;/span&gt;
                                len += len;

                        len += 10 * PAGE_SIZE; &lt;span class=&quot;code-comment&quot;&gt;// +++++++++++
&lt;/span&gt;                        rc = req_capsule_server_grow(&amp;amp;req-&amp;gt;rq_pill,
                                                     &amp;amp;RMF_BUT_REPLY, len);
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;It caused the panic:&lt;/p&gt;

&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
[ 6059.646103] Max IOV exceeded: 262 should be &amp;lt; 256
[ 6059.646267] RIP: 0010:lnet_md_build.part.10+0x525/0x790 [lnet]
[ 6059.646324]  LNetMDBind+0x48/0x380 [lnet]
[ 6059.646408]  ptl_send_buf+0x144/0x5a0 [ptlrpc]
[ 6059.646436]  ? ptlrpc_ni_fini+0x60/0x60 [ptlrpc]
[ 6059.646461]  ptlrpc_send_reply+0x2ad/0x8d0 [ptlrpc]
[ 6059.646493]  target_send_reply+0x324/0x7d0 [ptlrpc]
[ 6059.646526]  tgt_request_handle+0xe81/0x1920 [ptlrpc]
[ 6059.646554]  ptlrpc_server_handle_request+0x31d/0xbc0 [ptlrpc]
[ 6059.646580]  ? lprocfs_counter_add+0x12a/0x1a0 [obdclass]
[ 6059.646607]  ptlrpc_main+0xc4e/0x1510 [ptlrpc]
[ 6059.646610]  ? __schedule+0x2cc/0x700
[ 6059.646637]  ? ptlrpc_wait_event+0x590/0x590 [ptlrpc]
[ 6059.646639]  kthread+0x116/0x130
[ 6059.646640]  ? kthread_flush_work_fn+0x10/0x10
[ 6059.646641]  ret_from_fork+0x1f/0x40
[ 6059.646642] ---[ end trace d2884932fb9123c4 ]---
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The calculation for max allowed reply buffer size is incorrect.&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                        <issuelink>
            <issuekey id="70916">LU-15975</issuekey>
        </issuelink>
                            </outwardlinks>
                                                                <inwardlinks description="is related to">
                                        <issuelink>
            <issuekey id="76601">LU-16907</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="61685">LU-14139</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|i02i47:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                                                                                </customfields>
    </item>
</channel>
</rss>