[LU-5053] soft lockup in ptlrpcd on client - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: Lustre 2.6.0
Affects Version/s: Lustre 2.4.1
Labels:
- mn4

Severity:
3
Rank (Obsolete):
13959

Description

On our BG/Q client nodes (ppc64), we frequently see soft lockup messages like so:

2014-05-12 17:24:44.936384 {RMP08Ma185908506} [mmcs]{0}.0.1:  kernel:BUG: soft lockup - CPU#32 stuck for 67s! [ptlrpcd_0:3202]
2014-05-12 17:24:45.140774 {RMP08Ma185908506} [mmcs]{0}.16.1: BUG: soft lockup - CPU#65 stuck for 67s! [ptlrpcd_1:3203]
2014-05-12 17:24:45.141140 {RMP08Ma185908506} [mmcs]{0}.16.1: Modules linked in: lmv(U) mgc(U) lustre(U) mdc(U) fid(U) fld(U) lov(U) osc(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) bgvrnic bgmudm
2014-05-12 17:24:45.141506 {RMP08Ma185908506} [mmcs]{0}.16.1: NIP: 80000000003715bc LR: 80000000003717cc CTR: 000000000000007d
2014-05-12 17:24:45.141979 {RMP08Ma185908506} [mmcs]{0}.16.1: REGS: c0000003eb43ab40 TRAP: 0901   Not tainted  (2.6.32-358.11.1.bgq.4blueos.V1R2M1.bl2.1_0.ppc64)
2014-05-12 17:24:45.142274 {RMP08Ma185908506} [mmcs]{0}.16.1: MSR: 0000000080029000 <EE,ME,CE>  CR: 44224424  XER: 00000000
2014-05-12 17:24:45.142575 {RMP08Ma185908506} [mmcs]{0}.16.1: TASK = c0000003ecc14060[3203] 'ptlrpcd_1' THREAD: c0000003eb438000 CPU: 65
2014-05-12 17:24:45.142979 {RMP08Ma185908506} [mmcs]{0}.16.1: GPR00: 000000000003dfb8 c0000003eb43adc0 80000000003b5550 00000000000015b0 
2014-05-12 17:24:45.143352 {RMP08Ma185908506} [mmcs]{0}.16.1: GPR04: c00000031d82c330 0000000000000000 0000000000000000 0000000000000053 
2014-05-12 17:24:45.143728 {RMP08Ma185908506} [mmcs]{0}.16.1: GPR08: 0000000000000048 c00000031d82d110 0000000000000071 000000001eaffaf3 
2014-05-12 17:24:45.144048 {RMP08Ma185908506} [mmcs]{0}.16.1: GPR12: 0000000000002720 c0000000007f7200 
2014-05-12 17:24:45.144424 {RMP08Ma185908506} [mmcs]{0}.16.1: NIP [80000000003715bc] .__adler32+0x9c/0x220 [libcfs]
2014-05-12 17:24:45.144861 {RMP08Ma185908506} [mmcs]{0}.16.1: LR [80000000003717cc] .adler32_update+0x1c/0x40 [libcfs]
2014-05-12 17:24:45.145111 {RMP08Ma185908506} [mmcs]{0}.16.1: Call Trace:
2014-05-12 17:24:45.145442 {RMP08Ma185908506} [mmcs]{0}.16.1: [c0000003eb43adc0] [c0000003eb43ae50] 0xc0000003eb43ae50 (unreliable)
2014-05-12 17:24:45.145800 {RMP08Ma185908506} [mmcs]{0}.16.1: [c0000003eb43ae40] [c0000000002041c4] .crypto_shash_update+0x4c/0x60
2014-05-12 17:24:45.146124 {RMP08Ma185908506} [mmcs]{0}.16.1: [c0000003eb43aeb0] [c000000000204218] .shash_compat_update+0x40/0x80
2014-05-12 17:24:45.146441 {RMP08Ma185908506} [mmcs]{0}.16.1: [c0000003eb43af70] [800000000037072c] .cfs_crypto_hash_update_page+0x8c/0xb0 [libcfs]
2014-05-12 17:24:45.146758 {RMP08Ma185908506} [mmcs]{0}.16.1: [c0000003eb43b030] [8000000001284f94] .osc_checksum_bulk+0x1c4/0x890 [osc]
2014-05-12 17:24:45.147160 {RMP08Ma185908506} [mmcs]{0}.16.1: [c0000003eb43b180] [80000000012863c0] .osc_brw_prep_request+0xd60/0x1d80 [osc]
2014-05-12 17:24:45.147512 {RMP08Ma185908506} [mmcs]{0}.16.1: [c0000003eb43b340] [800000000129931c] .osc_build_rpc+0xc8c/0x2320 [osc]
2014-05-12 17:24:45.147814 {RMP08Ma185908506} [mmcs]{0}.16.1: [c0000003eb43b4f0] [80000000012bf304] .osc_send_write_rpc+0x594/0xb40 [osc]
2014-05-12 17:24:45.148150 {RMP08Ma185908506} [mmcs]{0}.16.1: [c0000003eb43b6d0] [80000000012bff84] .osc_check_rpcs+0x6d4/0x1770 [osc]
2014-05-12 17:24:45.148447 {RMP08Ma185908506} [mmcs]{0}.16.1: [c0000003eb43b900] [80000000012c132c] .osc_io_unplug0+0x30c/0x6b0 [osc]
2014-05-12 17:24:45.148812 {RMP08Ma185908506} [mmcs]{0}.16.1: [c0000003eb43ba20] [800000000129b174] .brw_interpret+0x7c4/0x1940 [osc]
2014-05-12 17:24:45.149257 {RMP08Ma185908506} [mmcs]{0}.16.1: [c0000003eb43bb60] [8000000000edb6b4] .ptlrpc_check_set+0x3a4/0x4bb0 [ptlrpc]
2014-05-12 17:24:45.149659 {RMP08Ma185908506} [mmcs]{0}.16.1: [c0000003eb43bd20] [8000000000f2a80c] .ptlrpcd_check+0x66c/0x870 [ptlrpc]
2014-05-12 17:24:45.149960 {RMP08Ma185908506} [mmcs]{0}.16.1: [c0000003eb43be40] [8000000000f2acc8] .ptlrpcd+0x2b8/0x4c0 [ptlrpc]
2014-05-12 17:24:45.150341 {RMP08Ma185908506} [mmcs]{0}.16.1: [c0000003eb43bf90] [c00000000001b9a8] .kernel_thread+0x54/0x70

I have noticed too that I can watch the cpu usage using top, and I will see about two CPUs that are pegged at exactly 100% CPU usage, all in the sys column. The rest of the CPUs are more evenly split among different usage columns: a little in user, most in sys and wa.

The CPUs with the 100% usage seem to correspond to the ptlrpcd processes that trigger the soft lockup warning, which I suppose is not terribly surprising.

We have seen this happening for well over a year, but are just getting around to look closer at it due to other recent work on the BG/Q clients.

Attachments

Issue Links

is related to

LU-5279 kernel BUG at arch/x86/mm/physaddr.c:20!

Resolved

Activity

[LU-5053] soft lockup in ptlrpcd on client

Peter Jones added a comment - 11/Jun/14 4:18 PM

Landed for 2.6

Peter Jones added a comment - 11/Jun/14 4:18 PM Landed for 2.6

Christopher Morrone (Inactive) added a comment - 04/Jun/14 2:25 AM

I updated my patch to be that one-liner, plus comments:

http://review.whamcloud.com/#/c/10358/

Christopher Morrone (Inactive) added a comment - 04/Jun/14 2:25 AM I updated my patch to be that one-liner, plus comments: http://review.whamcloud.com/#/c/10358/

Oleg Drokin added a comment - 02/Jun/14 10:27 PM

As we just discussed on the call - I guess it's more future-proof to just do cond_resched at every iteration. Since we already allow to sleep in the interpret callbacks, this should patch into a basically a oneliner

Oleg Drokin added a comment - 02/Jun/14 10:27 PM As we just discussed on the call - I guess it's more future-proof to just do cond_resched at every iteration. Since we already allow to sleep in the interpret callbacks, this should patch into a basically a oneliner

Christopher Morrone (Inactive) added a comment - 24/May/14 12:13 AM

Can you prove that it is correct? I see quite a few continue statements before ptlrpc_req_interpret().

What would be the worst case time between cond_resched() calls? Is it bounded?

By putting the cond_resched() at the beginning of the loop, we bound the maximum time that we exceed our run time to just over the maximum time it takes to process a single element of the list. Lets call that X.

If we put the cond_resched() the processing time for each element is assumed to be less than X, so lets call that Y. But now we'll need to process N elements before a cond_resched(), so lets call the processing time Y*N.

Yes, Y is probably less than X. But N is unknown and potentially quite large. Yes, Y*N may often be less than X. But this is computer science, we need to know the worst case and handle it reasonably. In the worst case, Y*N may be much, much larger than X.

This all assumes that that ptlrpc_req_interpret() and other calls are not buggy and not unbounded in run time. But that is fine for now, lets just focus on making ptlrpc_check_set() correct before addressing other problems.

Since N is unknown, we need to assume it is large. That means we need the cond_resched() at the top of the loop.

unlikely(test_thread_flag(TIF_NEED_RESCHED)

too high of a price to pay to ensure proper run time?

If you think it is too expensive, then we need to audit more code and put protections in place to limit N.

Christopher Morrone (Inactive) added a comment - 24/May/14 12:13 AM Can you prove that it is correct? I see quite a few continue statements before ptlrpc_req_interpret(). What would be the worst case time between cond_resched() calls? Is it bounded? By putting the cond_resched() at the beginning of the loop, we bound the maximum time that we exceed our run time to just over the maximum time it takes to process a single element of the list. Lets call that X. If we put the cond_resched() the processing time for each element is assumed to be less than X, so lets call that Y. But now we'll need to process N elements before a cond_resched(), so lets call the processing time Y*N. Yes, Y is probably less than X. But N is unknown and potentially quite large. Yes, Y*N may often be less than X. But this is computer science, we need to know the worst case and handle it reasonably. In the worst case, Y*N may be much, much larger than X. This all assumes that that ptlrpc_req_interpret() and other calls are not buggy and not unbounded in run time. But that is fine for now, lets just focus on making ptlrpc_check_set() correct before addressing other problems. Since N is unknown, we need to assume it is large. That means we need the cond_resched() at the top of the loop. Is unlikely(test_thread_flag(TIF_NEED_RESCHED) too high of a price to pay to ensure proper run time? If you think it is too expensive, then we need to audit more code and put protections in place to limit N.

Oleg Drokin added a comment - 23/May/14 11:25 PM

I actually tested with a very similar patch, only inserted unconditional cond_resched() before calling RPC completion handler (which is the real expensive place i suspect).
It passed my real rigorous testing on the race-crazy setup and so I think it's totally fine to do it for all callers.

What do you think?

We do not allow of calling ptlrpc_check_next holding a spinlock anyway.

Oleg Drokin added a comment - 23/May/14 11:25 PM I actually tested with a very similar patch, only inserted unconditional cond_resched() before calling RPC completion handler (which is the real expensive place i suspect). It passed my real rigorous testing on the race-crazy setup and so I think it's totally fine to do it for all callers. What do you think? We do not allow of calling ptlrpc_check_next holding a spinlock anyway.

Christopher Morrone (Inactive) added a comment - 19/May/14 7:00 PM

Andreas had a couple of questions about my testing patch, http://review.whamcloud.com/#/c/10358/ that I will answer here so they hopefully have a better chance of informing the final patch.

  cfs_list_for_each_safe(tmp, next, &set->set_requests) {
	  struct ptlrpc_request *req =
	  cfs_list_entry(tmp, struct ptlrpc_request,
	  rq_set_chain);
	  struct obd_import *imp = req->rq_import;
	  int unregistered = 0;
	  int rc = 0;
+
+        if (resched_allowed)
+                cond_resched();

Any reason not to allow reschedule for all callers and just drop the parameter?

Allowing it for all callers would certainly be cleaner code. But ensuring that lustre never does anything stupid, like holding a spin lock while calling ptlrpc_check_set(), would require a much much larger code audit on my part. I have seen ptlrpc_check_set() appear in far too many backtraces over the past few years to make the change without greater understanding. It looked safe to schedule from the ptlrpcd() path, but I just don't know about the others.

You folks probably have a better idea then I do if it would be safe to cond_resched() in all cases.

It would also make more sense to do this after processing the RPC instead of before (no point to reschedule before doing any work)

I did consider that as well. The problem is that ptlrpc_check_set() has nearly 20 continue statements. I don't see a clear way to ensure that our CPU time is bounded, and therefore correct, without putting the cond_resched() first. And if the very first cond_resched() results in a real schedule before we do any rpc processing, then clearly the thread exhausted its time slice already and should be rescheduled.

Christopher Morrone (Inactive) added a comment - 19/May/14 7:00 PM Andreas had a couple of questions about my testing patch, http://review.whamcloud.com/#/c/10358/ that I will answer here so they hopefully have a better chance of informing the final patch. cfs_list_for_each_safe(tmp, next, &set->set_requests) { struct ptlrpc_request *req = cfs_list_entry(tmp, struct ptlrpc_request, rq_set_chain); struct obd_import *imp = req->rq_import; int unregistered = 0; int rc = 0; + + if (resched_allowed) + cond_resched(); Any reason not to allow reschedule for all callers and just drop the parameter? Allowing it for all callers would certainly be cleaner code. But ensuring that lustre never does anything stupid, like holding a spin lock while calling ptlrpc_check_set(), would require a much much larger code audit on my part. I have seen ptlrpc_check_set() appear in far too many backtraces over the past few years to make the change without greater understanding. It looked safe to schedule from the ptlrpcd() path, but I just don't know about the others. You folks probably have a better idea then I do if it would be safe to cond_resched() in all cases. It would also make more sense to do this after processing the RPC instead of before (no point to reschedule before doing any work) I did consider that as well. The problem is that ptlrpc_check_set() has nearly 20 continue statements. I don't see a clear way to ensure that our CPU time is bounded, and therefore correct, without putting the cond_resched() first. And if the very first cond_resched() results in a real schedule before we do any rpc processing, then clearly the thread exhausted its time slice already and should be rescheduled.

Christopher Morrone (Inactive) added a comment - 19/May/14 6:47 PM

Liang, generally speaking I think there are some good improvements that you are working on. However, I do not think that they address the fundamental bug of this ticket, which is this:

The processing time in ptlrpc_check_set() is unbounded.

Reducing the number of times the list is scanned by a factor of 6 would be good, but that will still leave the ptlrpcd using the CPU for far too long without a schedule. "Way too long" divided by six is still "too long". We must either somehow bound the time spent in ptlrpc_check_set(), or we must call cond_sched() while processing the list.

Christopher Morrone (Inactive) added a comment - 19/May/14 6:47 PM Liang, generally speaking I think there are some good improvements that you are working on. However, I do not think that they address the fundamental bug of this ticket, which is this: The processing time in ptlrpc_check_set() is unbounded. Reducing the number of times the list is scanned by a factor of 6 would be good, but that will still leave the ptlrpcd using the CPU for far too long without a schedule. "Way too long" divided by six is still "too long". We must either somehow bound the time spent in ptlrpc_check_set(), or we must call cond_sched() while processing the list.

Liang Zhen (Inactive) added a comment - 18/May/14 7:46 AM

Peter/Andreas/Oleg, I will be on vacation for most of the next two weeks, so I probably can't update my patch on time, if it makes any sense to you, could you please find someone to take over it, or feel free to abandon it and take a different approach if it's not correct, thanks.

Liang Zhen (Inactive) added a comment - 18/May/14 7:46 AM Peter/Andreas/Oleg, I will be on vacation for most of the next two weeks, so I probably can't update my patch on time, if it makes any sense to you, could you please find someone to take over it, or feel free to abandon it and take a different approach if it's not correct, thanks.

Liang Zhen (Inactive) added a comment - 17/May/14 1:50 PM

Hi, after looking into code again, I suspect the major reason is ptlrpcd_check() is a too expensive condition for l_wait_event():

l_wait_event() will call condition callback (it's ptlrpcd_check in this case) for three times in each loop
ptlrpcd_check() will usually scan the whole request list for twice: the first time in ptlrpc_check_set(), the second time is for finishing completed requests
which means ptlrpcd() will scan all requests 3 * 2 = 6 times in each loop, it could be very time consuming. Even worse, more longer it takes, more likely that it can't sleep because more requests/replies may come

So I changed my patch in this way:

only call cond_resched() in mainloop of ptlrpcd, because very likely, we just call into ptlrpcd_check() again and again
ptlrpc_check_set() will put completed request at the head of ptlrpc_request_set::set_requests
ptlrpcd_check() only needs to scan small part of ptlrpc_request_set::set_requests while finishing completed requests
adding "wakeup version" ptlrpc_request_set::set_wake_ver
- this version number increases whenever there is a wakeup call for ptlrpc_request_set
- ptlrpcd will check set::set_wake_ver in ptlrpcd_check(), if it's different with the version that ptlrpcd has already checked/saved, then ptlrpcd needs to save the new version number and call into ptlrpc_check_set(), otherwise ptlrpcd can skip scan
- With this change, ptlrpcd_check() could be a lot cheaper if it's called for more than once in the same loop (very short interval).

Liang Zhen (Inactive) added a comment - 17/May/14 1:50 PM Hi, after looking into code again, I suspect the major reason is ptlrpcd_check() is a too expensive condition for l_wait_event(): l_wait_event() will call condition callback (it's ptlrpcd_check in this case) for three times in each loop ptlrpcd_check() will usually scan the whole request list for twice: the first time in ptlrpc_check_set(), the second time is for finishing completed requests which means ptlrpcd() will scan all requests 3 * 2 = 6 times in each loop, it could be very time consuming. Even worse, more longer it takes, more likely that it can't sleep because more requests/replies may come So I changed my patch in this way: only call cond_resched() in mainloop of ptlrpcd, because very likely, we just call into ptlrpcd_check() again and again ptlrpc_check_set() will put completed request at the head of ptlrpc_request_set::set_requests ptlrpcd_check() only needs to scan small part of ptlrpc_request_set::set_requests while finishing completed requests adding "wakeup version" ptlrpc_request_set::set_wake_ver this version number increases whenever there is a wakeup call for ptlrpc_request_set ptlrpcd will check set::set_wake_ver in ptlrpcd_check(), if it's different with the version that ptlrpcd has already checked/saved, then ptlrpcd needs to save the new version number and call into ptlrpc_check_set(), otherwise ptlrpcd can skip scan With this change, ptlrpcd_check() could be a lot cheaper if it's called for more than once in the same loop (very short interval).

Christopher Morrone (Inactive) added a comment - 17/May/14 12:36 AM

I have seen a couple of these console messages on the clients while using test patch 10358.

2014-05-16 14:51:42.541470 {RMP16Ma135918551} [mmcs]{10}.6.1: Lustre: 3876:0:(service.c:1889:ptlrpc_server_handle_req_in()) @@@ Slow req_in handling 6s  req@c0000002450f32c0 x1465655019342052/t0(0) o104->LOV_OSC_UUID@172.20.20.18@o2ib500:0/0 lens 296/0 e 0
 to 0 dl 0 ref 1 fl New:/0/ffffffff rc 0/-1

That thread was ldlm_bl_19.

The client is still probably faster and better behaved then it has been in the past.

Christopher Morrone (Inactive) added a comment - 17/May/14 12:36 AM I have seen a couple of these console messages on the clients while using test patch 10358 . 2014-05-16 14:51:42.541470 {RMP16Ma135918551} [mmcs]{10}.6.1: Lustre: 3876:0:(service.c:1889:ptlrpc_server_handle_req_in()) @@@ Slow req_in handling 6s req@c0000002450f32c0 x1465655019342052/t0(0) o104->LOV_OSC_UUID@172.20.20.18@o2ib500:0/0 lens 296/0 e 0 to 0 dl 0 ref 1 fl New:/0/ffffffff rc 0/-1 That thread was ldlm_bl_19. The client is still probably faster and better behaved then it has been in the past.

People

Assignee:: Oleg Drokin

Reporter:: Christopher Morrone (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 13/May/14 12:44 AM

Updated:: 06/Jun/22 8:59 PM

Resolved:: 11/Jun/14 4:18 PM