[LU-4733] All mdt thread stuck in cfs_waitq_wait Created: 07/Mar/14 Updated: 02/Oct/14 Resolved: 02/Oct/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.1.6 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | Sebastien Buisson (Inactive) | Assignee: | Niu Yawei (Inactive) |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 13008 |
| Description |
|
Hi, We are seeing all mdt threads on the MDS stuck in "cfs_waitq_wait". At the same time, we have a lot of rpc request (15k/s). What is strange is that cfs_time_beforeq_64(max_age, obd->obd_osfs_age) should be true. This issue was hit 4 times during February. Please find attached the dmesg and 'foreach bt' outputs. Sebastien. |
| Comments |
| Comment by Bruno Faccini (Inactive) [ 07/Mar/14 ] |
|
Looks like a dup of |
| Comment by Sebastien Buisson (Inactive) [ 07/Mar/14 ] |
|
Hi Bruno, I was not so sure as problem described in |
| Comment by Peter Jones [ 07/Mar/14 ] |
|
Niu Could you please confirm whether this is a duplicate of Thanks Peter |
| Comment by Liang Zhen (Inactive) [ 07/Mar/14 ] |
|
This issue should have been fixed on master, please see lov_update_set() in master, set->set_completes are atomic which is safe. void lov_update_set(struct lov_request_set *set,
struct lov_request *req, int rc)
{
req->rq_complete = 1;
req->rq_rc = rc;
cfs_atomic_inc(&set->set_completes);
if (rc == 0)
cfs_atomic_inc(&set->set_success);
wake_up(&set->set_waitq);
}
and please see 2.1.x, set->set_completes is just integer which is not safe (yes, increments may be lost under race), which means lov_set_finished() may never return true, and qos_statfs_done() will never be called if this happened void lov_update_set(struct lov_request_set *set,
struct lov_request *req, int rc)
{
req->rq_complete = 1;
req->rq_rc = rc;
set->set_completes++;
if (rc == 0)
set->set_success++;
cfs_waitq_signal(&set->set_waitq);
}
|
| Comment by Niu Yawei (Inactive) [ 09/Mar/14 ] |
|
Liang, 2.1 has only one ptlrpcd thread to process the statfs requests, so I think the ++ operation should be safe. Probably statfs can't finish on certain OST? Sebastien, were there any anbnormal messages on OSTs? and I see following messages on MDT: [655444.910448] Uhhuh. NMI received for unknown reason 3d on CPU 0.
[655444.916528] Do you have a strange power saving mode enabled?
[655444.922337] Kernel panic - not syncing: NMI: Not continuing
[655444.928060] Pid: 0, comm: swapper Not tainted 2.6.32-220.23.1.bl6.Bull.28.10.x86_64 #1
[655444.936173] Call Trace:
[655444.938768] <NMI> [<ffffffff814851a0>] ? panic+0x78/0x143
[655444.944518] [<ffffffff81488ff0>] ? do_nmi+0x240/0x2b0
[655444.949801] [<ffffffff81488830>] ? nmi+0x20/0x30
[655444.954655] [<ffffffff812a9831>] ? intel_idle+0xb1/0x170
[655444.960200] <<EOE>> [<ffffffff813aa267>] ? cpuidle_idle_call+0xa7/0x140
[655444.967165] [<ffffffff81001e06>] ? cpu_idle+0xb6/0x110
[655444.972536] [<ffffffff8146ff2a>] ? rest_init+0x7a/0x80
[655444.977909] [<ffffffff81adff8c>] ? start_kernel+0x481/0x48d
[655444.983721] [<ffffffff81adf31e>] ? x86_64_start_reservations+0x125/0x129
[655444.990651] [<ffffffff81adf41c>] ? x86_64_start_kernel+0xfa/0x109
This looks like a hardware error, could it cause system hang under load? |
| Comment by Bruno Faccini (Inactive) [ 09/Mar/14 ] |
|
Niu, my 2 cents this comes from a forced crash-dump attempt or a HA fencing via NMI ... |
| Comment by Alexandre Louvet [ 10/Mar/14 ] |
|
The lustre version running at customer site is based on lustre 2.1.6 + the following list of patches
so:
|
| Comment by Niu Yawei (Inactive) [ 11/Mar/14 ] |
|
Alexandre, any error messages on OSTs? |
| Comment by Patrick Valentin (Inactive) [ 11/Mar/14 ] |
|
Niu, I have attached the logs provided by support team, corresponding to the two last crashes which occured on 2 and 3 of march. here under is a copy of the analyze they have done: crash> struct lov_qos.lq_statfs_in_progress 0xffff881eff05a6b0
lq_statfs_in_progress = 1
crash> struct lov_statfs_data 0xffff881e3b9a2b80
struct lov_statfs_data {
lsd_oi = {
oi_policy = {
.....
oi_flags = 5, <--- OBD_STATFS_PTLRPCD | OBD_STATFS_NODELAY
crash> p *ptlrpcds
$1 = {
pd_size = 6352,
pd_index = 0,
pd_nthreads = 32,
pd_thread_rcv = {
pc_flags = 258,
pc_lock = {
raw_lock = {
slock = 0
}
}
.....
crash> struct ptlrpc_request_set 0xffff881fa6c702c0
struct ptlrpc_request_set {
set_refcount = {
counter = 1
},
set_new_count = {
counter = 0
},
set_remaining = {
counter = 0
}
......
It seems rpc have been send but not response is received. In cb_statfs_update conditions are not satisfied lov_update_set(lovreq->rq_rqset, lovreq, rc);
if (rc)
GOTO(out, rc);
out:
if (lovreq->rq_rqset->set_oi->oi_flags & OBD_STATFS_PTLRPCD &&
lov_set_finished(lovreq->rq_rqset, 0)) {
lov_statfs_interpret(NULL, lovreq->rq_rqset,
lovreq->rq_rqset->set_success !=
lovreq->rq_rqset->set_count);
qos_statfs_done(lov);
|
| Comment by Sebastien Buisson (Inactive) [ 20/Mar/14 ] |
|
Hi, Please find attached the error logs on the OSS. Sebastien. |
| Comment by Niu Yawei (Inactive) [ 24/Mar/14 ] |
|
I see some OSTs reported slow read in the log: bcluster210.log-20140309.gz:2014-03-02 13:09:40 Lustre: 24804:0:(filter_io.c:475:filter_preprw_read()) scratch3-OST002c: slow start_page_read 30s bcluster210.log-20140309.gz:2014-03-02 13:09:40 Lustre: 24804:0:(filter_io.c:475:filter_preprw_read()) Skipped 30 previous similar messages Maybe the OST is too busy to reply the statfs in time? Will the MDT get out of stuck state after the statfs RPC timedout and the burst of create requests from client level off? |
| Comment by Sebastien Buisson (Inactive) [ 24/Mar/14 ] |
|
I do not have answers to your questions, but given their time of occurrence the 'slow' messages seem to be completely independent of the problem when all mdt threads are stuck in "cfs_waitq_wait", right? |
| Comment by Niu Yawei (Inactive) [ 25/Mar/14 ] |
I don't know for sure. I just want to figure out if it's a deadlock(or missing wakeup) problem (if the threads recovered when system load getting lower, then it's unlikely a deadlock/missing wakeup problem). BTW: Is there anyway to check the code running on servers? (to make sure the all the patches were applied properly) |
| Comment by Sebastien Buisson (Inactive) [ 25/Mar/14 ] |
|
> BTW: Is there anyway to check the code running on servers? (to make sure the all the patches were applied properly) Sure, but what do you suggest? I mean, how can we proceed to 'check the code'? |
| Comment by Niu Yawei (Inactive) [ 25/Mar/14 ] |
|
Could you upload the code to the whamcloud ftp server (just like update logs)? Or show me where I can checkout the code? Thanks. |
| Comment by Sebastien Buisson (Inactive) [ 27/Mar/14 ] |
|
Hi, On Whamcloud's FTP (ftp.whamcloud.com), I have created a directory " Sebastien. |
| Comment by Niu Yawei (Inactive) [ 02/Apr/14 ] |
|
Thank you, Sebastien. The code looks correct to me. It's possible that connection between MDT to some OST is broken, so the statfs can't be updated, and since there were many ongoing create requests, lq_statfs_in_progress could be changed to 1 immediately after a statfs done, so the waiters will go back to wait on qos_statfs_ready(). When the problem raise again, could you try to collect lustre log on MDT with D_QOS & D_RPCTRACE enabled? Thank you. |
| Comment by Patrick Valentin (Inactive) [ 09/Apr/14 ] |
|
Hi, Several things were made :
The attached file "lctl-dk.tgz" contains the following traces:
Patrick |
| Comment by Niu Yawei (Inactive) [ 21/Apr/14 ] |
Could you provide the stack trace for all processes on MDS when umount the device? 00000100:00000400:2.0F:1396898402.999140:0:12693:0:(service.c:1035:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-150), not sending early reply req@ffff880f02f9b800 x1463563674626235/t0(0) o400->4b5b918b-511a-1159-a7ee-db25715da035@JO.BOO.PA.BLW@o2ib2:0/0 lens 192/0 e 0 to 0 dl 1396898407 ref 2 fl New:H/0/ffffffff rc 0/-1 00000100:00000400:2.0:1396898402.999154:0:12693:0:(service.c:1035:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-150), not sending early reply req@ffff880f02f9bc00 x1463563600312609/t0(0) o400->23a5a711-176e-4fea-3690-e6e8f81b7ba3@JO.BOO.IW.WW@o2ib2:0/0 lens 192/0 e 0 to 0 dl 1396898407 ref 2 fl New:H/0/ffffffff rc 0/-1 00000100:00000400:2.0:1396898402.999165:0:12693:0:(service.c:1035:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-150), not sending early reply req@ffff88091d277000 x1463563647245622/t0(0) o400->e081ff72-cc00-6425-c96f-b093a974f1a4@JO.BOO.WB.AW@o2ib2:0/0 lens 192/0 e 0 to 0 dl 1396898407 ref 2 fl New:H/0/ffffffff rc 0/-1 00000100:00000400:2.0:1396898402.999176:0:12693:0:(service.c:1035:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-150), not sending early reply req@ffff88091d277400 x1463563646360851/t0(0) o400->822cdf71-bac5-e141-3e77-a03c858db938@JO.BOO.PA.BBA@o2ib2:0/0 lens 192/0 e 0 to 0 dl 1396898407 ref 2 fl New:H/0/ffffffff rc 0/-1 JO.BOO.PA.BLW@o2ib2 is one of the OST? Looks the connections between MDT and some OSTs are not stable. I tend to think this is same issue of |
| Comment by Patrick Valentin (Inactive) [ 25/Apr/14 ] |
|
Hi, |
| Comment by Niu Yawei (Inactive) [ 17/Jun/14 ] |
bcluster231.log-20140309.gz:2014-03-03 23:18:53 LustreError: 25883:0:(o2iblnd_cb.c:2992:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds
bcluster231.log-20140309.gz:2014-03-03 23:18:53 LustreError: 25883:0:(o2iblnd_cb.c:3055:kiblnd_check_conns()) Timed out RDMA with JO.BOO.AL.IL@o2ib2 (151): c: 8, oc: 0, rc: 8
bcluster231.log-20140309.gz:2014-03-03 23:18:53 Mar 3 23:18:53 bcluster231 kernel: LustreError: 25883:0:(o2iblnd_cb.c:2992:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds
bcluster231.log-20140309.gz:2014-03-03 23:18:53 Mar 3 23:18:53 bcluster231 kernel: LustreError: 25883:0:(o2iblnd_cb.c:3055:kiblnd_check_conns()) Timed out RDMA with JO.BOO.AL.IL@o2ib2 (151): c: 8, oc: 0, rc: 8
bcluster231.log-20140309.gz:2014-03-03 23:22:50 Lustre: scratch3-OST01df: haven't heard from client scratch3-MDT0000-mdtlov_UUID (at JO.BOO.AL.IL@o2ib2) in 902 seconds. I think it's dead, and I am evicting it. exp ffff880dc9df1000, cur 1393885370 expire 1393884770 last 1393884468
The OSS log showed that something wrong with network when the problem happend on MDS, could you check your network as Amir suggested in |
| Comment by Sebastien Buisson (Inactive) [ 02/Oct/14 ] |
|
Hi, The customer has upgraded to Lustre 2.4, and since then is not able to reproduce. Thanks, |
| Comment by Peter Jones [ 02/Oct/14 ] |
|
ok thanks Sebastien |