[LU-5053] soft lockup in ptlrpcd on client Created: 13/May/14 Updated: 06/Jun/22 Resolved: 11/Jun/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.1 |
| Fix Version/s: | Lustre 2.6.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Christopher Morrone | Assignee: | Oleg Drokin |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | mn4 | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 13959 | ||||||||
| Description |
|
On our BG/Q client nodes (ppc64), we frequently see soft lockup messages like so: 2014-05-12 17:24:44.936384 {RMP08Ma185908506} [mmcs]{0}.0.1: kernel:BUG: soft lockup - CPU#32 stuck for 67s! [ptlrpcd_0:3202]
2014-05-12 17:24:45.140774 {RMP08Ma185908506} [mmcs]{0}.16.1: BUG: soft lockup - CPU#65 stuck for 67s! [ptlrpcd_1:3203]
2014-05-12 17:24:45.141140 {RMP08Ma185908506} [mmcs]{0}.16.1: Modules linked in: lmv(U) mgc(U) lustre(U) mdc(U) fid(U) fld(U) lov(U) osc(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) bgvrnic bgmudm
2014-05-12 17:24:45.141506 {RMP08Ma185908506} [mmcs]{0}.16.1: NIP: 80000000003715bc LR: 80000000003717cc CTR: 000000000000007d
2014-05-12 17:24:45.141979 {RMP08Ma185908506} [mmcs]{0}.16.1: REGS: c0000003eb43ab40 TRAP: 0901 Not tainted (2.6.32-358.11.1.bgq.4blueos.V1R2M1.bl2.1_0.ppc64)
2014-05-12 17:24:45.142274 {RMP08Ma185908506} [mmcs]{0}.16.1: MSR: 0000000080029000 <EE,ME,CE> CR: 44224424 XER: 00000000
2014-05-12 17:24:45.142575 {RMP08Ma185908506} [mmcs]{0}.16.1: TASK = c0000003ecc14060[3203] 'ptlrpcd_1' THREAD: c0000003eb438000 CPU: 65
2014-05-12 17:24:45.142979 {RMP08Ma185908506} [mmcs]{0}.16.1: GPR00: 000000000003dfb8 c0000003eb43adc0 80000000003b5550 00000000000015b0
2014-05-12 17:24:45.143352 {RMP08Ma185908506} [mmcs]{0}.16.1: GPR04: c00000031d82c330 0000000000000000 0000000000000000 0000000000000053
2014-05-12 17:24:45.143728 {RMP08Ma185908506} [mmcs]{0}.16.1: GPR08: 0000000000000048 c00000031d82d110 0000000000000071 000000001eaffaf3
2014-05-12 17:24:45.144048 {RMP08Ma185908506} [mmcs]{0}.16.1: GPR12: 0000000000002720 c0000000007f7200
2014-05-12 17:24:45.144424 {RMP08Ma185908506} [mmcs]{0}.16.1: NIP [80000000003715bc] .__adler32+0x9c/0x220 [libcfs]
2014-05-12 17:24:45.144861 {RMP08Ma185908506} [mmcs]{0}.16.1: LR [80000000003717cc] .adler32_update+0x1c/0x40 [libcfs]
2014-05-12 17:24:45.145111 {RMP08Ma185908506} [mmcs]{0}.16.1: Call Trace:
2014-05-12 17:24:45.145442 {RMP08Ma185908506} [mmcs]{0}.16.1: [c0000003eb43adc0] [c0000003eb43ae50] 0xc0000003eb43ae50 (unreliable)
2014-05-12 17:24:45.145800 {RMP08Ma185908506} [mmcs]{0}.16.1: [c0000003eb43ae40] [c0000000002041c4] .crypto_shash_update+0x4c/0x60
2014-05-12 17:24:45.146124 {RMP08Ma185908506} [mmcs]{0}.16.1: [c0000003eb43aeb0] [c000000000204218] .shash_compat_update+0x40/0x80
2014-05-12 17:24:45.146441 {RMP08Ma185908506} [mmcs]{0}.16.1: [c0000003eb43af70] [800000000037072c] .cfs_crypto_hash_update_page+0x8c/0xb0 [libcfs]
2014-05-12 17:24:45.146758 {RMP08Ma185908506} [mmcs]{0}.16.1: [c0000003eb43b030] [8000000001284f94] .osc_checksum_bulk+0x1c4/0x890 [osc]
2014-05-12 17:24:45.147160 {RMP08Ma185908506} [mmcs]{0}.16.1: [c0000003eb43b180] [80000000012863c0] .osc_brw_prep_request+0xd60/0x1d80 [osc]
2014-05-12 17:24:45.147512 {RMP08Ma185908506} [mmcs]{0}.16.1: [c0000003eb43b340] [800000000129931c] .osc_build_rpc+0xc8c/0x2320 [osc]
2014-05-12 17:24:45.147814 {RMP08Ma185908506} [mmcs]{0}.16.1: [c0000003eb43b4f0] [80000000012bf304] .osc_send_write_rpc+0x594/0xb40 [osc]
2014-05-12 17:24:45.148150 {RMP08Ma185908506} [mmcs]{0}.16.1: [c0000003eb43b6d0] [80000000012bff84] .osc_check_rpcs+0x6d4/0x1770 [osc]
2014-05-12 17:24:45.148447 {RMP08Ma185908506} [mmcs]{0}.16.1: [c0000003eb43b900] [80000000012c132c] .osc_io_unplug0+0x30c/0x6b0 [osc]
2014-05-12 17:24:45.148812 {RMP08Ma185908506} [mmcs]{0}.16.1: [c0000003eb43ba20] [800000000129b174] .brw_interpret+0x7c4/0x1940 [osc]
2014-05-12 17:24:45.149257 {RMP08Ma185908506} [mmcs]{0}.16.1: [c0000003eb43bb60] [8000000000edb6b4] .ptlrpc_check_set+0x3a4/0x4bb0 [ptlrpc]
2014-05-12 17:24:45.149659 {RMP08Ma185908506} [mmcs]{0}.16.1: [c0000003eb43bd20] [8000000000f2a80c] .ptlrpcd_check+0x66c/0x870 [ptlrpc]
2014-05-12 17:24:45.149960 {RMP08Ma185908506} [mmcs]{0}.16.1: [c0000003eb43be40] [8000000000f2acc8] .ptlrpcd+0x2b8/0x4c0 [ptlrpc]
2014-05-12 17:24:45.150341 {RMP08Ma185908506} [mmcs]{0}.16.1: [c0000003eb43bf90] [c00000000001b9a8] .kernel_thread+0x54/0x70
I have noticed too that I can watch the cpu usage using top, and I will see about two CPUs that are pegged at exactly 100% CPU usage, all in the sys column. The rest of the CPUs are more evenly split among different usage columns: a little in user, most in sys and wa. The CPUs with the 100% usage seem to correspond to the ptlrpcd processes that trigger the soft lockup warning, which I suppose is not terribly surprising. We have seen this happening for well over a year, but are just getting around to look closer at it due to other recent work on the BG/Q clients. |
| Comments |
| Comment by Christopher Morrone [ 13/May/14 ] |
|
I'll start with the most simple question: Is it possible that with 128 writer threads, we are able to keep the list of work constantly growing, and the ptlrpcd will happily continue working for long periods of time without ever hitting schedule point? |
| Comment by Peter Jones [ 13/May/14 ] |
|
Oleg Could you please advise? Thanks Peter |
| Comment by Oleg Drokin [ 13/May/14 ] |
|
So is this just the case of clients being really slow with adler32 checksums? Currently preference is taken based on server performance a that's considered more imprtant, but your case might be different. |
| Comment by Oleg Drokin [ 13/May/14 ] |
|
I guess some schedules might help too, but that would just get rid of the warning |
| Comment by Andreas Dilger [ 13/May/14 ] |
|
My patch at http://review.whamcloud.com/9990 improves the checksum logging a bit - it moves the performance numbers into the D_CONFIG so that the checksum performance is always logged at startup (one time only), and increases the checksum size to better reflect RPC sizes. |
| Comment by Christopher Morrone [ 13/May/14 ] |
No, as far as I know that is not a problem. Early in BG/Q's life the ptlrpcd was parallelized so that we can distribute the checksumming load across (edit) more ptlrpcds, and checksumming performance has not, to the best of my knowledge, been a problem. The problem is that we a have a kernel thread that ran in excess of 67 seconds without ever calling schedule(). Why did that happen? |
| Comment by Christopher Morrone [ 13/May/14 ] |
I applied the patch, but I'm not getting anything useful. It always says zero for all hash types: Using crypto hash: crc32 (crc32-table) speed 0 MB/s Each of those lines in the log is repeated 10s to hundreds of times for each hash type. Here's a summary: $ grep "Using crypto" log | cut -b 98- | uniq -c
338 Using crypto hash: adler32 (adler32-zlib) speed 0 MB/s
84 Using crypto hash: crc32 (crc32-table) speed 0 MB/s
78 Using crypto hash: md5 (md5-generic) speed 0 MB/s
24 Using crypto hash: sha1 (sha1-generic) speed 0 MB/s
68 Using crypto hash: crc32c (crc32c-generic) speed 0 MB/s
|
| Comment by Christopher Morrone [ 14/May/14 ] |
|
Ah, the one message that I really needed already moved to D_CONFIG on master, so Andreas' patch didn't touch that one for me. I fixed that and found: Crypto hash algorithm adler32 speed = 341 MB/s 341 * 67 seconds is over 22 GBs. So if you think that the ptlrpcd is spending the entire time on adler32, then I would argue that checksumming over 22 GB without a schedule() is not particularly reasonable. |
| Comment by Andreas Dilger [ 14/May/14 ] |
|
Chris, just to clarify, is the core problem that the ptlrpcd thread is generating the watchdog dump, or is the problem that the ptlrpcd thread on this code is hogging this core and other threads need to run on that specific core? It seems fairly straight forward to add a schedule in the main loop so other processes get a chance to run and that the watchdog does not get triggered. If the problem is that there is just too much work being done on the core, then that would need more work to try and improve the ptlrpcd scheduling of the checksum work. |
| Comment by Andreas Dilger [ 14/May/14 ] |
|
Also, is there no other checksum type available than adler32? On my very old Pentium system it still has a few choices (all software): Crypto hash algorithm adler32 speed = 694 MB/s Crypto hash algorithm crc32 speed = 165 MB/s Crypto hash algorithm crc32c speed = 402 MB/s Are there are patches for newer kernels that implement CPU-optimized PPC checksum codes in the kernel? |
| Comment by Christopher Morrone [ 14/May/14 ] |
Yes, that is the primary problem. It is never good kernel programming practice to have a process with unbounded run time monopolizing a CPU.
There were others, but we use adler32, and it was also the fastest. Here is the fuller list on the BG/Q client: Crypto hash algorithm adler32 speed = 341 MB/s Crypto hash algorithm crc32 speed = 82 MB/s Crypto hash algorithm md5 speed = 78 MB/s Crypto hash algorithm sha1 speed = 23 MB/s Crypto hash algorithm crc32c speed = 67 MB/s The print statements say "-1 MB/s" for the unsupported algorithms. That print statement could be improved as well. Here is the unfiltered results for reference: 00000001:01000000:0.0F:1400027577.586961:2224:3144:0:(linux-crypto.c:127:cfs_crypto_hash_alloc()) Failed to alloc crypto hash null 00000001:01000000:0.0:1400027577.586977:1792:3144:0:(linux-crypto.c:297:cfs_crypto_performance_test()) Crypto hash algorithm null, err = -2 00000001:01000000:0.0:1400027577.586988:1792:3144:0:(linux-crypto.c:305:cfs_crypto_performance_test()) Crypto hash algorithm null speed = -1 MB/s 00000001:01000000:0.0:1400027578.587671:1792:3144:0:(linux-crypto.c:305:cfs_crypto_performance_test()) Crypto hash algorithm adler32 speed = 341 MB/s 00000001:01000000:0.0:1400027579.580961:1792:3144:0:(linux-crypto.c:305:cfs_crypto_performance_test()) Crypto hash algorithm crc32 speed = 82 MB/s 00000001:01000000:0.0:1400027580.586248:1792:3144:0:(linux-crypto.c:305:cfs_crypto_performance_test()) Crypto hash algorithm md5 speed = 78 MB/s 00000001:01000000:0.0:1400027581.593861:1792:3144:0:(linux-crypto.c:305:cfs_crypto_performance_test()) Crypto hash algorithm sha1 speed = 23 MB/s 00000001:01000000:0.0:1400027584.015726:2224:3144:0:(linux-crypto.c:127:cfs_crypto_hash_alloc()) Failed to alloc crypto hash sha256 00000001:01000000:0.0:1400027584.015738:1792:3144:0:(linux-crypto.c:297:cfs_crypto_performance_test()) Crypto hash algorithm sha256, err = -2 00000001:01000000:0.0:1400027584.015750:1792:3144:0:(linux-crypto.c:305:cfs_crypto_performance_test()) Crypto hash algorithm sha256 speed = -1 MB/s 00000001:01000000:0.0:1400027584.031881:2224:3144:0:(linux-crypto.c:127:cfs_crypto_hash_alloc()) Failed to alloc crypto hash sha384 00000001:01000000:0.0:1400027584.031893:1792:3144:0:(linux-crypto.c:297:cfs_crypto_performance_test()) Crypto hash algorithm sha384, err = -2 00000001:01000000:0.0:1400027584.031905:1792:3144:0:(linux-crypto.c:305:cfs_crypto_performance_test()) Crypto hash algorithm sha384 speed = -1 MB/s 00000001:01000000:0.0:1400027584.048161:2224:3144:0:(linux-crypto.c:127:cfs_crypto_hash_alloc()) Failed to alloc crypto hash sha512 00000001:01000000:0.0:1400027584.048173:1792:3144:0:(linux-crypto.c:297:cfs_crypto_performance_test()) Crypto hash algorithm sha512, err = -2 00000001:01000000:0.0:1400027584.048184:1792:3144:0:(linux-crypto.c:305:cfs_crypto_performance_test()) Crypto hash algorithm sha512 speed = -1 MB/s 00000001:01000000:0.0:1400027585.047812:1792:3144:0:(linux-crypto.c:305:cfs_crypto_performance_test()) Crypto hash algorithm crc32c speed = 67 MB/s |
| Comment by Christopher Morrone [ 15/May/14 ] |
By the way, there is some question in my mind about whether the work load is being sanely distributed across the 16 ptlrpcd. It seems odd to me that only one or two of the ptlrpcd spin at 100% like this. But I think we can leave that issue to a future ticket. |
| Comment by Andreas Dilger [ 16/May/14 ] |
|
I was looking at the code for ptlrpcd() to add a schedule() call, or cpu_relax(). I need to look into that more closely, but it needs to be done in any case. It may be that the reason that core #0 and core #1 are handling more work with ptlrpcd is because the BG/L scheduler is not running any user processes on this core. The ptlrpcd() thread will grab RPCs to process from other neighboring cores the local one is idle and if the others are busy. This allows the "producer" threads to keep working and the ptlrpcd can run on an otherwise idle core to offload checksums and other request handling work. Whether this load balancing is working optimally is up for discussion. |
| Comment by Liang Zhen (Inactive) [ 16/May/14 ] |
|
I actually saw a few similar tickets, so I looked into ptlrpcd and found some suspicious code, here is a patch could be helpful: http://review.whamcloud.com/#/c/10351/ |
| Comment by Christopher Morrone [ 16/May/14 ] |
Andreas, I think you are confusing this ticket's symptoms with In this ticket core 0 and 1 are not handling more work. When ptlrpcd gets stuck working for minutes at a time without a single schedule it can be just about any CPU and any of the ptlrpcd. But so far it I have never seen more than 2 (maybe 3?) out of the 16 ptlrpcd threads spinning like this at any particular time. This is why I suspect load balancing issues in Lustre; but admittedly my current visibility into what is going on is fairly limited. Also, the "BG/L scheduler" is just Linux on PPC. I don't think there are any bluegene-specific scheduler changes in this kernel. |
| Comment by Christopher Morrone [ 16/May/14 ] |
Thanks Liang! I think that is the general idea. But I think maybe the approach should be a little different. Since ptlrpc_check_set() is used a number of places, and ptlrpcd() is the one place that wants everything to be different, I think perhaps ptlrpcd() just needs its own function to walk the queue of work. Maybe it shouldn't even use the ptlrpc_request_set(), and instead have a list more specific to its own needs. But I could be missing something there. |
| Comment by Christopher Morrone [ 16/May/14 ] |
|
I made a simpler cond_resched() patch that is only intended for testing, but I'll share it here for reference: http://review.whamcloud.com/10358 In limited testing, so far the soft lockups are gone as expected. Load on CPUs looks pretty even, with no runaway Lustre threads monopolizing cores. Better still, write throughput is up 40%! Read performance looks improved as well. |
| Comment by Christopher Morrone [ 17/May/14 ] |
|
I have seen a couple of these console messages on the clients while using test patch 10358. 2014-05-16 14:51:42.541470 {RMP16Ma135918551} [mmcs]{10}.6.1: Lustre: 3876:0:(service.c:1889:ptlrpc_server_handle_req_in()) @@@ Slow req_in handling 6s req@c0000002450f32c0 x1465655019342052/t0(0) o104->LOV_OSC_UUID@172.20.20.18@o2ib500:0/0 lens 296/0 e 0
to 0 dl 0 ref 1 fl New:/0/ffffffff rc 0/-1
That thread was ldlm_bl_19. The client is still probably faster and better behaved then it has been in the past. |
| Comment by Liang Zhen (Inactive) [ 17/May/14 ] |
|
Hi, after looking into code again, I suspect the major reason is ptlrpcd_check() is a too expensive condition for l_wait_event():
So I changed my patch in this way:
|
| Comment by Liang Zhen (Inactive) [ 18/May/14 ] |
|
Peter/Andreas/Oleg, I will be on vacation for most of the next two weeks, so I probably can't update my patch on time, if it makes any sense to you, could you please find someone to take over it, or feel free to abandon it and take a different approach if it's not correct, thanks. |
| Comment by Christopher Morrone [ 19/May/14 ] |
|
Liang, generally speaking I think there are some good improvements that you are working on. However, I do not think that they address the fundamental bug of this ticket, which is this: The processing time in ptlrpc_check_set() is unbounded. Reducing the number of times the list is scanned by a factor of 6 would be good, but that will still leave the ptlrpcd using the CPU for far too long without a schedule. "Way too long" divided by six is still "too long". We must either somehow bound the time spent in ptlrpc_check_set(), or we must call cond_sched() while processing the list. |
| Comment by Christopher Morrone [ 19/May/14 ] |
|
Andreas had a couple of questions about my testing patch, http://review.whamcloud.com/#/c/10358/ that I will answer here so they hopefully have a better chance of informing the final patch. cfs_list_for_each_safe(tmp, next, &set->set_requests) {
struct ptlrpc_request *req =
cfs_list_entry(tmp, struct ptlrpc_request,
rq_set_chain);
struct obd_import *imp = req->rq_import;
int unregistered = 0;
int rc = 0;
+
+ if (resched_allowed)
+ cond_resched();
Allowing it for all callers would certainly be cleaner code. But ensuring that lustre never does anything stupid, like holding a spin lock while calling ptlrpc_check_set(), would require a much much larger code audit on my part. I have seen ptlrpc_check_set() appear in far too many backtraces over the past few years to make the change without greater understanding. It looked safe to schedule from the ptlrpcd() path, but I just don't know about the others. You folks probably have a better idea then I do if it would be safe to cond_resched() in all cases.
I did consider that as well. The problem is that ptlrpc_check_set() has nearly 20 continue statements. I don't see a clear way to ensure that our CPU time is bounded, and therefore correct, without putting the cond_resched() first. And if the very first cond_resched() results in a real schedule before we do any rpc processing, then clearly the thread exhausted its time slice already and should be rescheduled. |
| Comment by Oleg Drokin [ 23/May/14 ] |
|
I actually tested with a very similar patch, only inserted unconditional cond_resched() before calling RPC completion handler (which is the real expensive place i suspect). What do you think? We do not allow of calling ptlrpc_check_next holding a spinlock anyway. |
| Comment by Christopher Morrone [ 24/May/14 ] |
|
Can you prove that it is correct? I see quite a few continue statements before ptlrpc_req_interpret(). What would be the worst case time between cond_resched() calls? Is it bounded? By putting the cond_resched() at the beginning of the loop, we bound the maximum time that we exceed our run time to just over the maximum time it takes to process a single element of the list. Lets call that X. If we put the cond_resched() the processing time for each element is assumed to be less than X, so lets call that Y. But now we'll need to process N elements before a cond_resched(), so lets call the processing time Y*N. Yes, Y is probably less than X. But N is unknown and potentially quite large. Yes, Y*N may often be less than X. But this is computer science, we need to know the worst case and handle it reasonably. In the worst case, Y*N may be much, much larger than X. This all assumes that that ptlrpc_req_interpret() and other calls are not buggy and not unbounded in run time. But that is fine for now, lets just focus on making ptlrpc_check_set() correct before addressing other problems. Since N is unknown, we need to assume it is large. That means we need the cond_resched() at the top of the loop. Is unlikely(test_thread_flag(TIF_NEED_RESCHED) too high of a price to pay to ensure proper run time? If you think it is too expensive, then we need to audit more code and put protections in place to limit N. |
| Comment by Oleg Drokin [ 02/Jun/14 ] |
|
As we just discussed on the call - I guess it's more future-proof to just do cond_resched at every iteration. Since we already allow to sleep in the interpret callbacks, this should patch into a basically a oneliner |
| Comment by Christopher Morrone [ 04/Jun/14 ] |
|
I updated my patch to be that one-liner, plus comments: |
| Comment by Peter Jones [ 11/Jun/14 ] |
|
Landed for 2.6 |