<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:48:08 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-5053] soft lockup in ptlrpcd on client</title>
                <link>https://jira.whamcloud.com/browse/LU-5053</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;On our BG/Q client nodes (ppc64), we frequently see soft lockup messages like so:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;2014-05-12 17:24:44.936384 {RMP08Ma185908506} [mmcs]{0}.0.1:  kernel:BUG: soft lockup - CPU#32 stuck for 67s! [ptlrpcd_0:3202]
2014-05-12 17:24:45.140774 {RMP08Ma185908506} [mmcs]{0}.16.1: BUG: soft lockup - CPU#65 stuck for 67s! [ptlrpcd_1:3203]
2014-05-12 17:24:45.141140 {RMP08Ma185908506} [mmcs]{0}.16.1: Modules linked in: lmv(U) mgc(U) lustre(U) mdc(U) fid(U) fld(U) lov(U) osc(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) bgvrnic bgmudm
2014-05-12 17:24:45.141506 {RMP08Ma185908506} [mmcs]{0}.16.1: NIP: 80000000003715bc LR: 80000000003717cc CTR: 000000000000007d
2014-05-12 17:24:45.141979 {RMP08Ma185908506} [mmcs]{0}.16.1: REGS: c0000003eb43ab40 TRAP: 0901   Not tainted  (2.6.32-358.11.1.bgq.4blueos.V1R2M1.bl2.1_0.ppc64)
2014-05-12 17:24:45.142274 {RMP08Ma185908506} [mmcs]{0}.16.1: MSR: 0000000080029000 &amp;lt;EE,ME,CE&amp;gt;  CR: 44224424  XER: 00000000
2014-05-12 17:24:45.142575 {RMP08Ma185908506} [mmcs]{0}.16.1: TASK = c0000003ecc14060[3203] &apos;ptlrpcd_1&apos; THREAD: c0000003eb438000 CPU: 65
2014-05-12 17:24:45.142979 {RMP08Ma185908506} [mmcs]{0}.16.1: GPR00: 000000000003dfb8 c0000003eb43adc0 80000000003b5550 00000000000015b0 
2014-05-12 17:24:45.143352 {RMP08Ma185908506} [mmcs]{0}.16.1: GPR04: c00000031d82c330 0000000000000000 0000000000000000 0000000000000053 
2014-05-12 17:24:45.143728 {RMP08Ma185908506} [mmcs]{0}.16.1: GPR08: 0000000000000048 c00000031d82d110 0000000000000071 000000001eaffaf3 
2014-05-12 17:24:45.144048 {RMP08Ma185908506} [mmcs]{0}.16.1: GPR12: 0000000000002720 c0000000007f7200 
2014-05-12 17:24:45.144424 {RMP08Ma185908506} [mmcs]{0}.16.1: NIP [80000000003715bc] .__adler32+0x9c/0x220 [libcfs]
2014-05-12 17:24:45.144861 {RMP08Ma185908506} [mmcs]{0}.16.1: LR [80000000003717cc] .adler32_update+0x1c/0x40 [libcfs]
2014-05-12 17:24:45.145111 {RMP08Ma185908506} [mmcs]{0}.16.1: Call Trace:
2014-05-12 17:24:45.145442 {RMP08Ma185908506} [mmcs]{0}.16.1: [c0000003eb43adc0] [c0000003eb43ae50] 0xc0000003eb43ae50 (unreliable)
2014-05-12 17:24:45.145800 {RMP08Ma185908506} [mmcs]{0}.16.1: [c0000003eb43ae40] [c0000000002041c4] .crypto_shash_update+0x4c/0x60
2014-05-12 17:24:45.146124 {RMP08Ma185908506} [mmcs]{0}.16.1: [c0000003eb43aeb0] [c000000000204218] .shash_compat_update+0x40/0x80
2014-05-12 17:24:45.146441 {RMP08Ma185908506} [mmcs]{0}.16.1: [c0000003eb43af70] [800000000037072c] .cfs_crypto_hash_update_page+0x8c/0xb0 [libcfs]
2014-05-12 17:24:45.146758 {RMP08Ma185908506} [mmcs]{0}.16.1: [c0000003eb43b030] [8000000001284f94] .osc_checksum_bulk+0x1c4/0x890 [osc]
2014-05-12 17:24:45.147160 {RMP08Ma185908506} [mmcs]{0}.16.1: [c0000003eb43b180] [80000000012863c0] .osc_brw_prep_request+0xd60/0x1d80 [osc]
2014-05-12 17:24:45.147512 {RMP08Ma185908506} [mmcs]{0}.16.1: [c0000003eb43b340] [800000000129931c] .osc_build_rpc+0xc8c/0x2320 [osc]
2014-05-12 17:24:45.147814 {RMP08Ma185908506} [mmcs]{0}.16.1: [c0000003eb43b4f0] [80000000012bf304] .osc_send_write_rpc+0x594/0xb40 [osc]
2014-05-12 17:24:45.148150 {RMP08Ma185908506} [mmcs]{0}.16.1: [c0000003eb43b6d0] [80000000012bff84] .osc_check_rpcs+0x6d4/0x1770 [osc]
2014-05-12 17:24:45.148447 {RMP08Ma185908506} [mmcs]{0}.16.1: [c0000003eb43b900] [80000000012c132c] .osc_io_unplug0+0x30c/0x6b0 [osc]
2014-05-12 17:24:45.148812 {RMP08Ma185908506} [mmcs]{0}.16.1: [c0000003eb43ba20] [800000000129b174] .brw_interpret+0x7c4/0x1940 [osc]
2014-05-12 17:24:45.149257 {RMP08Ma185908506} [mmcs]{0}.16.1: [c0000003eb43bb60] [8000000000edb6b4] .ptlrpc_check_set+0x3a4/0x4bb0 [ptlrpc]
2014-05-12 17:24:45.149659 {RMP08Ma185908506} [mmcs]{0}.16.1: [c0000003eb43bd20] [8000000000f2a80c] .ptlrpcd_check+0x66c/0x870 [ptlrpc]
2014-05-12 17:24:45.149960 {RMP08Ma185908506} [mmcs]{0}.16.1: [c0000003eb43be40] [8000000000f2acc8] .ptlrpcd+0x2b8/0x4c0 [ptlrpc]
2014-05-12 17:24:45.150341 {RMP08Ma185908506} [mmcs]{0}.16.1: [c0000003eb43bf90] [c00000000001b9a8] .kernel_thread+0x54/0x70
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;I have noticed too that I can watch the cpu usage using &lt;tt&gt;top&lt;/tt&gt;, and I will see about two CPUs that are pegged at exactly 100% CPU usage, all in the sys column.  The rest of the CPUs are more evenly split among different usage columns:  a little in user, most in sys and wa.&lt;/p&gt;

&lt;p&gt;The CPUs with the 100% usage seem to correspond to the ptlrpcd processes that trigger the soft lockup warning, which I suppose is not terribly surprising.&lt;/p&gt;

&lt;p&gt;We have seen this happening for well over a year, but are just getting around to look closer at it due to other recent work on the BG/Q clients.&lt;/p&gt;</description>
                <environment></environment>
        <key id="24693">LU-5053</key>
            <summary>soft lockup in ptlrpcd on client</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="2" iconUrl="https://jira.whamcloud.com/images/icons/priorities/critical.svg">Critical</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="green">Oleg Drokin</assignee>
                                    <reporter username="morrone">Christopher Morrone</reporter>
                        <labels>
                            <label>mn4</label>
                    </labels>
                <created>Tue, 13 May 2014 00:44:38 +0000</created>
                <updated>Mon, 6 Jun 2022 20:59:32 +0000</updated>
                            <resolved>Wed, 11 Jun 2014 16:18:51 +0000</resolved>
                                    <version>Lustre 2.4.1</version>
                                    <fixVersion>Lustre 2.6.0</fixVersion>
                                        <due></due>
                            <votes>0</votes>
                                    <watches>8</watches>
                                                                            <comments>
                            <comment id="83952" author="morrone" created="Tue, 13 May 2014 00:46:46 +0000"  >&lt;p&gt;I&apos;ll start with the most simple question: Is it possible that with 128 writer threads, we are able to keep the list of work constantly growing, and the ptlrpcd will happily continue working for long periods of time without ever hitting schedule point?&lt;/p&gt;</comment>
                            <comment id="83982" author="pjones" created="Tue, 13 May 2014 13:20:04 +0000"  >&lt;p&gt;Oleg&lt;/p&gt;

&lt;p&gt;Could you please advise?&lt;/p&gt;

&lt;p&gt;Thanks&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="83997" author="green" created="Tue, 13 May 2014 15:26:11 +0000"  >&lt;p&gt;So is this just the case of clients being really slow with adler32 checksums?&lt;br/&gt;
When you load lustre modules, speeds of different checksum algorithms are printed into debug logs, so perhaps peek there to see if you can choose a different checksum algorithm that&apos;s ligher on your cpus?&lt;/p&gt;

&lt;p&gt;Currently preference is taken based on server performance a that&apos;s considered more imprtant, but your case might be different.&lt;br/&gt;
I assume you are not really keen on disabling the checksums altogether.&lt;/p&gt;</comment>
                            <comment id="83998" author="green" created="Tue, 13 May 2014 15:29:05 +0000"  >&lt;p&gt;I guess some schedules might help too, but that would just get rid of the warning&lt;/p&gt;</comment>
                            <comment id="84049" author="adilger" created="Tue, 13 May 2014 20:40:17 +0000"  >&lt;p&gt;My patch at &lt;a href=&quot;http://review.whamcloud.com/9990&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/9990&lt;/a&gt; improves the checksum logging a bit - it moves the performance numbers into the D_CONFIG so that the checksum performance is always logged at startup (one time only), and increases the checksum size to better reflect RPC sizes.  &lt;/p&gt;</comment>
                            <comment id="84058" author="morrone" created="Tue, 13 May 2014 22:28:37 +0000"  >&lt;blockquote&gt;&lt;p&gt;So is this just the case of clients being really slow with adler32 checksums?&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;No, as far as I know that is not a problem.  Early in BG/Q&apos;s life the ptlrpcd was parallelized so that we can distribute the checksumming load across (edit) more ptlrpcds, and checksumming performance has not, to the best of my knowledge, been a problem.&lt;/p&gt;

&lt;p&gt;The problem is that we a have a kernel thread that ran in excess of 67 seconds without ever calling schedule().  Why did that happen?&lt;/p&gt;</comment>
                            <comment id="84063" author="morrone" created="Tue, 13 May 2014 23:35:30 +0000"  >&lt;blockquote&gt;&lt;p&gt;My patch at &lt;a href=&quot;http://review.whamcloud.com/9990&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/9990&lt;/a&gt; improves the checksum logging a bit&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;I applied the patch, but I&apos;m not getting anything useful.  It always says zero for all hash types:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Using crypto hash: crc32 (crc32-table) speed 0 MB/s&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Each of those lines in the log is repeated 10s to hundreds of times for each hash type.  Here&apos;s a summary:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;$ grep &quot;Using crypto&quot; log | cut -b 98- | uniq -c
    338 Using crypto hash: adler32 (adler32-zlib) speed 0 MB/s
     84 Using crypto hash: crc32 (crc32-table) speed 0 MB/s
     78 Using crypto hash: md5 (md5-generic) speed 0 MB/s
     24 Using crypto hash: sha1 (sha1-generic) speed 0 MB/s
     68 Using crypto hash: crc32c (crc32c-generic) speed 0 MB/s
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="84064" author="morrone" created="Wed, 14 May 2014 00:37:00 +0000"  >&lt;p&gt;Ah, the one message that I really needed already moved to D_CONFIG on master, so Andreas&apos; patch didn&apos;t touch that one for me.  I fixed that and found:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Crypto hash algorithm adler32 speed = 341 MB/s&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;341 * 67 seconds is over 22 GBs.  So if you think that the ptlrpcd is spending the entire time on adler32, then I would argue that checksumming over 22 GB without a schedule() is not particularly reasonable.&lt;/p&gt;</comment>
                            <comment id="84071" author="adilger" created="Wed, 14 May 2014 05:27:46 +0000"  >&lt;p&gt;Chris, just to clarify, is the core problem that the ptlrpcd thread is generating the watchdog dump, or is the problem that the ptlrpcd thread on this code is hogging this core and other threads need to run on that specific core?&lt;/p&gt;

&lt;p&gt;It seems fairly straight forward to add a schedule in the main loop so other processes get a chance to run and that the watchdog does not get triggered.  If the problem is that there is just too much work being done on the core, then that would need more work to try and improve the ptlrpcd scheduling of the checksum work.&lt;/p&gt;</comment>
                            <comment id="84072" author="adilger" created="Wed, 14 May 2014 05:30:17 +0000"  >&lt;p&gt;Also, is there no other checksum type available than adler32?  On my very old Pentium system it still has a few choices (all software):&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Crypto hash algorithm adler32 speed = 694 MB/s
Crypto hash algorithm crc32 speed = 165 MB/s
Crypto hash algorithm crc32c speed = 402 MB/s
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Are there are patches for newer kernels that implement CPU-optimized PPC checksum codes in the kernel?&lt;/p&gt;</comment>
                            <comment id="84118" author="morrone" created="Wed, 14 May 2014 20:33:25 +0000"  >&lt;blockquote&gt;&lt;p&gt;Chris, just to clarify, is the core problem that the ptlrpcd thread is generating the watchdog dump&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;Yes, that is the primary problem.  It is never good kernel programming practice to have a process with unbounded run time monopolizing a CPU.&lt;/p&gt;

&lt;blockquote&gt;&lt;p&gt;is there no other checksum type available than adler32?&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;There were others, but we use adler32, and it was also the fastest.  Here is the fuller list on the BG/Q client:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Crypto hash algorithm adler32 speed = 341 MB/s
Crypto hash algorithm crc32 speed = 82 MB/s
Crypto hash algorithm md5 speed = 78 MB/s
Crypto hash algorithm sha1 speed = 23 MB/s
Crypto hash algorithm crc32c speed = 67 MB/s
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The print statements say &quot;-1 MB/s&quot; for the unsupported algorithms.  That print statement could be improved as well.  Here is the unfiltered results for reference:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;00000001:01000000:0.0F:1400027577.586961:2224:3144:0:(linux-crypto.c:127:cfs_crypto_hash_alloc()) Failed to alloc crypto hash null
00000001:01000000:0.0:1400027577.586977:1792:3144:0:(linux-crypto.c:297:cfs_crypto_performance_test()) Crypto hash algorithm null, err = -2
00000001:01000000:0.0:1400027577.586988:1792:3144:0:(linux-crypto.c:305:cfs_crypto_performance_test()) Crypto hash algorithm null speed = -1 MB/s
00000001:01000000:0.0:1400027578.587671:1792:3144:0:(linux-crypto.c:305:cfs_crypto_performance_test()) Crypto hash algorithm adler32 speed = 341 MB/s
00000001:01000000:0.0:1400027579.580961:1792:3144:0:(linux-crypto.c:305:cfs_crypto_performance_test()) Crypto hash algorithm crc32 speed = 82 MB/s
00000001:01000000:0.0:1400027580.586248:1792:3144:0:(linux-crypto.c:305:cfs_crypto_performance_test()) Crypto hash algorithm md5 speed = 78 MB/s
00000001:01000000:0.0:1400027581.593861:1792:3144:0:(linux-crypto.c:305:cfs_crypto_performance_test()) Crypto hash algorithm sha1 speed = 23 MB/s
00000001:01000000:0.0:1400027584.015726:2224:3144:0:(linux-crypto.c:127:cfs_crypto_hash_alloc()) Failed to alloc crypto hash sha256
00000001:01000000:0.0:1400027584.015738:1792:3144:0:(linux-crypto.c:297:cfs_crypto_performance_test()) Crypto hash algorithm sha256, err = -2
00000001:01000000:0.0:1400027584.015750:1792:3144:0:(linux-crypto.c:305:cfs_crypto_performance_test()) Crypto hash algorithm sha256 speed = -1 MB/s
00000001:01000000:0.0:1400027584.031881:2224:3144:0:(linux-crypto.c:127:cfs_crypto_hash_alloc()) Failed to alloc crypto hash sha384
00000001:01000000:0.0:1400027584.031893:1792:3144:0:(linux-crypto.c:297:cfs_crypto_performance_test()) Crypto hash algorithm sha384, err = -2
00000001:01000000:0.0:1400027584.031905:1792:3144:0:(linux-crypto.c:305:cfs_crypto_performance_test()) Crypto hash algorithm sha384 speed = -1 MB/s
00000001:01000000:0.0:1400027584.048161:2224:3144:0:(linux-crypto.c:127:cfs_crypto_hash_alloc()) Failed to alloc crypto hash sha512
00000001:01000000:0.0:1400027584.048173:1792:3144:0:(linux-crypto.c:297:cfs_crypto_performance_test()) Crypto hash algorithm sha512, err = -2
00000001:01000000:0.0:1400027584.048184:1792:3144:0:(linux-crypto.c:305:cfs_crypto_performance_test()) Crypto hash algorithm sha512 speed = -1 MB/s
00000001:01000000:0.0:1400027585.047812:1792:3144:0:(linux-crypto.c:305:cfs_crypto_performance_test()) Crypto hash algorithm crc32c speed = 67 MB/s
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="84209" author="morrone" created="Thu, 15 May 2014 18:57:04 +0000"  >&lt;blockquote&gt;&lt;p&gt;If the problem is that there is just too much work being done on the core, then that would need more work to try and improve the ptlrpcd scheduling of the checksum work.&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;By the way, there &lt;em&gt;is&lt;/em&gt; some question in my mind about whether the work load is being sanely distributed across the 16 ptlrpcd.  It seems odd to me that only one or two of the ptlrpcd spin at 100% like this.  But I think we can leave that issue to a future ticket.&lt;/p&gt;</comment>
                            <comment id="84229" author="adilger" created="Fri, 16 May 2014 02:57:41 +0000"  >&lt;p&gt;I was looking at the code for ptlrpcd() to add a schedule() call, or cpu_relax(). I need to look into that more closely, but it needs to be done in any case. &lt;/p&gt;

&lt;p&gt;It may be that the reason that core #0 and core #1 are handling more work with ptlrpcd is because the BG/L scheduler is not running any user processes on this core. The ptlrpcd() thread will grab RPCs to process from other neighboring cores the local one is idle and if the others are busy. This allows the &quot;producer&quot; threads to keep working and the ptlrpcd can run on an otherwise idle core to offload checksums and other request handling work. &lt;/p&gt;

&lt;p&gt;Whether this load balancing is working optimally is up for discussion. &lt;/p&gt;</comment>
                            <comment id="84251" author="liang" created="Fri, 16 May 2014 15:51:51 +0000"  >&lt;p&gt;I actually saw a few similar tickets, so I looked into ptlrpcd and found some suspicious code, here is a patch could be helpful: &lt;a href=&quot;http://review.whamcloud.com/#/c/10351/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/10351/&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="84275" author="morrone" created="Fri, 16 May 2014 18:01:51 +0000"  >&lt;blockquote&gt;&lt;p&gt;It may be that the reason that core #0 and core #1 are handling more work with ptlrpcd is because the BG/L scheduler is not running any user processes on this core.&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;Andreas, I think you are confusing this ticket&apos;s symptoms with &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5043&quot; title=&quot;ptlrpcd threads only run on one CPU&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5043&quot;&gt;&lt;del&gt;LU-5043&lt;/del&gt;&lt;/a&gt;.  That issue is resolved.&lt;/p&gt;

&lt;p&gt;In this ticket core 0 and 1 are not handling more work.  When ptlrpcd gets stuck working for minutes at a time without a single schedule it can be just about any CPU and any of the ptlrpcd.  But so far it I have never seen more than 2 (maybe 3?) out of the 16 ptlrpcd threads spinning like this at any particular time.  This is why I suspect load balancing issues in Lustre; but admittedly my current visibility into what is going on is fairly limited.&lt;/p&gt;

&lt;p&gt;Also, the &quot;BG/L scheduler&quot; is just Linux on PPC.  I don&apos;t think there are any bluegene-specific scheduler changes in this kernel.&lt;/p&gt;</comment>
                            <comment id="84276" author="morrone" created="Fri, 16 May 2014 18:05:57 +0000"  >&lt;blockquote&gt;&lt;p&gt;I actually saw a few similar tickets, so I looked into ptlrpcd and found some suspicious code, here is a patch could be helpful: &lt;a href=&quot;http://review.whamcloud.com/#/c/10351/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/10351/&lt;/a&gt;&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;Thanks Liang!  I think that &lt;em&gt;is&lt;/em&gt; the general idea.&lt;/p&gt;

&lt;p&gt;But I think maybe the approach should be a little different.  Since ptlrpc_check_set() is used a number of places, and ptlrpcd() is the one place that wants everything to be different, I think perhaps ptlrpcd() just needs its own function to walk the queue of work.  Maybe it shouldn&apos;t even use the ptlrpc_request_set(), and instead have a list more specific to its own needs.  But I could be missing something there.&lt;/p&gt;</comment>
                            <comment id="84323" author="morrone" created="Fri, 16 May 2014 21:37:06 +0000"  >&lt;p&gt;I made a simpler cond_resched() patch that is only intended for testing, but I&apos;ll share it here for reference: &lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://review.whamcloud.com/10358&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/10358&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In limited testing, so far the soft lockups are gone as expected.  Load on CPUs looks pretty even, with no runaway Lustre threads monopolizing cores.  Better still, write throughput is up 40%!  Read performance looks improved as well.&lt;/p&gt;</comment>
                            <comment id="84338" author="morrone" created="Sat, 17 May 2014 00:36:05 +0000"  >&lt;p&gt;I have seen a couple of these console messages on the clients while using test patch &lt;a href=&quot;http://review.whamcloud.com/10358&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;10358&lt;/a&gt;.&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;2014-05-16 14:51:42.541470 {RMP16Ma135918551} [mmcs]{10}.6.1: Lustre: 3876:0:(service.c:1889:ptlrpc_server_handle_req_in()) @@@ Slow req_in handling 6s  req@c0000002450f32c0 x1465655019342052/t0(0) o104-&amp;gt;LOV_OSC_UUID@172.20.20.18@o2ib500:0/0 lens 296/0 e 0
 to 0 dl 0 ref 1 fl New:/0/ffffffff rc 0/-1&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;That thread was ldlm_bl_19.&lt;/p&gt;

&lt;p&gt;The client is still probably faster and better behaved then it has been in the past.&lt;/p&gt;</comment>
                            <comment id="84345" author="liang" created="Sat, 17 May 2014 13:50:44 +0000"  >&lt;p&gt;Hi, after looking into code again, I suspect the major reason is ptlrpcd_check() is a too expensive condition for l_wait_event():&lt;/p&gt;
&lt;ul&gt;
	&lt;li&gt;l_wait_event() will call condition callback (it&apos;s ptlrpcd_check in this case) for three times in each loop&lt;/li&gt;
	&lt;li&gt;ptlrpcd_check() will usually scan the whole request list for twice: the first time in ptlrpc_check_set(), the second time is for finishing completed requests&lt;/li&gt;
	&lt;li&gt;which means ptlrpcd() will scan all requests 3 * 2 = 6 times in each loop, it could be very time consuming. Even worse, more longer it takes, more likely that it can&apos;t sleep because more requests/replies may come&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;So I changed my patch in this way:&lt;/p&gt;
&lt;ul&gt;
	&lt;li&gt;only call cond_resched() in mainloop of ptlrpcd, because very likely, we just call into ptlrpcd_check() again and again&lt;/li&gt;
	&lt;li&gt;ptlrpc_check_set() will put completed request at the head of ptlrpc_request_set::set_requests&lt;/li&gt;
	&lt;li&gt;ptlrpcd_check() only needs to scan small part of ptlrpc_request_set::set_requests while finishing completed requests&lt;/li&gt;
	&lt;li&gt;adding &quot;wakeup version&quot;  ptlrpc_request_set::set_wake_ver
	&lt;ul&gt;
		&lt;li&gt;this version number increases whenever there is a wakeup call for ptlrpc_request_set&lt;/li&gt;
		&lt;li&gt;ptlrpcd will check set::set_wake_ver in ptlrpcd_check(), if it&apos;s different with the version that ptlrpcd has already checked/saved, then ptlrpcd needs to save the new version number and call into ptlrpc_check_set(), otherwise ptlrpcd can skip scan&lt;/li&gt;
		&lt;li&gt;With this change, ptlrpcd_check() could be a lot cheaper if it&apos;s called for more than once in the same loop (very short interval).&lt;/li&gt;
	&lt;/ul&gt;
	&lt;/li&gt;
&lt;/ul&gt;
</comment>
                            <comment id="84348" author="liang" created="Sun, 18 May 2014 07:46:52 +0000"  >&lt;p&gt;Peter/Andreas/Oleg, I will be on vacation for most of the next two weeks, so I probably can&apos;t update my patch on time, if it makes any sense to you, could you please find someone to take over it, or feel free to abandon it and take a different approach if it&apos;s not correct, thanks.&lt;/p&gt;</comment>
                            <comment id="84395" author="morrone" created="Mon, 19 May 2014 18:47:38 +0000"  >&lt;p&gt;Liang, generally speaking I think there are some good improvements that you are working on.  However, I do not think that they address the fundamental bug of this ticket, which is this:&lt;/p&gt;

&lt;p&gt;The processing time in ptlrpc_check_set() is unbounded.&lt;/p&gt;

&lt;p&gt;Reducing the number of times the list is scanned by a factor of 6 would be good, but that will still leave the ptlrpcd using the CPU for far too long without a schedule.  &quot;Way too long&quot; divided by six is still &quot;too long&quot;.  We must either somehow bound the time spent in ptlrpc_check_set(), or we must call cond_sched() while processing the list.&lt;/p&gt;</comment>
                            <comment id="84399" author="morrone" created="Mon, 19 May 2014 19:00:58 +0000"  >&lt;p&gt;Andreas had a couple of questions about my testing patch, &lt;a href=&quot;http://review.whamcloud.com/#/c/10358/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/10358/&lt;/a&gt; that I will answer here so they hopefully have a better chance of informing the final patch.&lt;/p&gt;

&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;  cfs_list_for_each_safe(tmp, next, &amp;amp;set-&amp;gt;set_requests) {
	  struct ptlrpc_request *req =
	  cfs_list_entry(tmp, struct ptlrpc_request,
	  rq_set_chain);
	  struct obd_import *imp = req-&amp;gt;rq_import;
	  &lt;span class=&quot;code-object&quot;&gt;int&lt;/span&gt; unregistered = 0;
	  &lt;span class=&quot;code-object&quot;&gt;int&lt;/span&gt; rc = 0;
+
+        &lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; (resched_allowed)
+                cond_resched();
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;blockquote&gt;&lt;p&gt;Any reason not to allow reschedule for all callers and just drop the parameter?&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;Allowing it for all callers would certainly be cleaner code.  But ensuring that lustre never does anything stupid, like holding a spin lock while calling ptlrpc_check_set(), would require a much much larger code audit on my part.  I have seen ptlrpc_check_set() appear in far too many backtraces over the past few years to make the change without greater understanding.  It looked safe to schedule from the ptlrpcd() path, but I just don&apos;t know about the others.&lt;/p&gt;

&lt;p&gt;You folks probably have a better idea then I do if it would be safe to cond_resched() in all cases.&lt;/p&gt;

&lt;blockquote&gt;&lt;p&gt;It would also make more sense to do this after processing the RPC instead of before (no point to reschedule before doing any work)&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;I did consider that as well.  The problem is that ptlrpc_check_set() has nearly 20 &lt;tt&gt;continue&lt;/tt&gt; statements.  I don&apos;t see a clear way to ensure that our CPU time is bounded, and therefore correct, without putting the cond_resched() first.  And if the very first cond_resched() results in a real schedule before we do any rpc processing, then clearly the thread exhausted its time slice already and &lt;em&gt;should&lt;/em&gt; be rescheduled.&lt;/p&gt;</comment>
                            <comment id="84827" author="green" created="Fri, 23 May 2014 23:25:26 +0000"  >&lt;p&gt;I actually tested with a very similar patch, only inserted unconditional cond_resched() before calling RPC completion handler (which is the real expensive place i suspect).&lt;br/&gt;
It passed my real rigorous testing on the race-crazy setup and so I think it&apos;s totally fine to do it for all callers.&lt;/p&gt;

&lt;p&gt;What do you think?&lt;/p&gt;

&lt;p&gt;We do not allow of calling ptlrpc_check_next holding a spinlock anyway.&lt;/p&gt;</comment>
                            <comment id="84830" author="morrone" created="Sat, 24 May 2014 00:13:45 +0000"  >&lt;p&gt;Can you prove that it is correct?  I see quite a few continue statements before ptlrpc_req_interpret().  &lt;/p&gt;

&lt;p&gt;What would be the worst case time between cond_resched() calls?  Is it bounded?&lt;/p&gt;

&lt;p&gt;By putting the cond_resched() at the beginning of the loop, we bound the maximum time that we exceed our run time to just over the maximum time it takes to process a single element of the list.  Lets call that X.&lt;/p&gt;

&lt;p&gt;If we put the cond_resched() the processing time for each element is assumed to be less than X, so lets call that Y.  But now we&apos;ll need to process N elements before a cond_resched(), so lets call the processing time Y*N.  &lt;/p&gt;

&lt;p&gt;Yes, Y is probably less than X.  But N is unknown and potentially quite large.  Yes, Y*N may &lt;em&gt;often&lt;/em&gt; be less than X.  But this is computer science, we need to know the worst case and handle it reasonably.  In the worst case, Y*N may be much, much larger than X.&lt;/p&gt;

&lt;p&gt;This all assumes that that ptlrpc_req_interpret() and other calls are not buggy and not unbounded in run time.  But that is fine for now, lets just focus on making ptlrpc_check_set() correct before addressing other problems.&lt;/p&gt;

&lt;p&gt;Since N is unknown, we need to assume it is large.  That means we need the cond_resched() at the top of the loop.&lt;/p&gt;

&lt;p&gt;Is&lt;/p&gt;

&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;unlikely(test_thread_flag(TIF_NEED_RESCHED)&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;too high of a price to pay to ensure proper run time?&lt;/p&gt;

&lt;p&gt;If you think it is too expensive, then we need to audit more code and put protections in place to limit N.&lt;/p&gt;</comment>
                            <comment id="85523" author="green" created="Mon, 2 Jun 2014 22:27:26 +0000"  >&lt;p&gt;As we just discussed on the call - I guess it&apos;s more future-proof to just do cond_resched at every iteration. Since we already allow to sleep in the interpret callbacks, this should patch into a basically a oneliner&lt;/p&gt;</comment>
                            <comment id="85676" author="morrone" created="Wed, 4 Jun 2014 02:25:05 +0000"  >&lt;p&gt;I updated my patch to be that one-liner, plus comments:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://review.whamcloud.com/#/c/10358/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/10358/&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="86329" author="pjones" created="Wed, 11 Jun 2014 16:18:51 +0000"  >&lt;p&gt;Landed for 2.6&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                                                <inwardlinks description="is related to">
                                        <issuelink>
            <issuekey id="25391">LU-5279</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzwmdr:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>13959</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>