[LU-4733] All mdt thread stuck in cfs_waitq_wait - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Cannot Reproduce
Priority: Critical
Fix Version/s: None
Affects Version/s: Lustre 2.1.6
Labels:
None

Severity:
3
Rank (Obsolete):
13008

Description

Hi,

We are seeing all mdt threads on the MDS stuck in "cfs_waitq_wait". At the same time, we have a lot of rpc request (15k/s).
Looking closely at the console and the 'bt' from crash, we can see that those threads are coming from qos_statfs_update() where they block in l_wait_event and never wake up.

What is strange is that cfs_time_beforeq_64(max_age, obd->obd_osfs_age) should be true.

This issue was hit 4 times during February.

Please find attached the dmesg and 'foreach bt' outputs.

Sebastien.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

bt-all.txt
604 kB
07/Mar/14 2:20 PM
dmesg(1).txt
173 kB
07/Mar/14 2:20 PM
log_2014-03-02-03.tgz
1.36 MB
11/Mar/14 8:28 AM
log_OSS_2014-03-02_04.log2.gz
105 kB
20/Mar/14 12:35 PM
lctl-dk.tgz
322 kB
09/Apr/14 12:17 PM
backtrace-20140407
1.65 MB
25/Apr/14 9:31 AM

Activity

[LU-4733] All mdt thread stuck in cfs_waitq_wait

Peter Jones added a comment - 02/Oct/14 2:43 PM

ok thanks Sebastien

Peter Jones added a comment - 02/Oct/14 2:43 PM ok thanks Sebastien

Sebastien Buisson (Inactive) added a comment - 02/Oct/14 2:39 PM

Hi,

The customer has upgraded to Lustre 2.4, and since then is not able to reproduce.
So this ticket can be closed.

Thanks,
Sebastien.

Sebastien Buisson (Inactive) added a comment - 02/Oct/14 2:39 PM Hi, The customer has upgraded to Lustre 2.4, and since then is not able to reproduce. So this ticket can be closed. Thanks, Sebastien.

Niu Yawei (Inactive) added a comment - 17/Jun/14 2:54 AM

bcluster231.log-20140309.gz:2014-03-03 23:18:53 LustreError: 25883:0:(o2iblnd_cb.c:2992:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds
bcluster231.log-20140309.gz:2014-03-03 23:18:53 LustreError: 25883:0:(o2iblnd_cb.c:3055:kiblnd_check_conns()) Timed out RDMA with JO.BOO.AL.IL@o2ib2 (151): c: 8, oc: 0, rc: 8
bcluster231.log-20140309.gz:2014-03-03 23:18:53 Mar  3 23:18:53 bcluster231 kernel: LustreError: 25883:0:(o2iblnd_cb.c:2992:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds
bcluster231.log-20140309.gz:2014-03-03 23:18:53 Mar  3 23:18:53 bcluster231 kernel: LustreError: 25883:0:(o2iblnd_cb.c:3055:kiblnd_check_conns()) Timed out RDMA with JO.BOO.AL.IL@o2ib2 (151): c: 8, oc: 0, rc: 8
bcluster231.log-20140309.gz:2014-03-03 23:22:50 Lustre: scratch3-OST01df: haven't heard from client scratch3-MDT0000-mdtlov_UUID (at JO.BOO.AL.IL@o2ib2) in 902 seconds. I think it's dead, and I am evicting it. exp ffff880dc9df1000, cur 1393885370 expire 1393884770 last 1393884468

The OSS log showed that something wrong with network when the problem happend on MDS, could you check your network as Amir suggested in ~~LU-4195~~.

Niu Yawei (Inactive) added a comment - 17/Jun/14 2:54 AM bcluster231.log-20140309.gz:2014-03-03 23:18:53 LustreError: 25883:0:(o2iblnd_cb.c:2992:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds bcluster231.log-20140309.gz:2014-03-03 23:18:53 LustreError: 25883:0:(o2iblnd_cb.c:3055:kiblnd_check_conns()) Timed out RDMA with JO.BOO.AL.IL@o2ib2 (151): c: 8, oc: 0, rc: 8 bcluster231.log-20140309.gz:2014-03-03 23:18:53 Mar 3 23:18:53 bcluster231 kernel: LustreError: 25883:0:(o2iblnd_cb.c:2992:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds bcluster231.log-20140309.gz:2014-03-03 23:18:53 Mar 3 23:18:53 bcluster231 kernel: LustreError: 25883:0:(o2iblnd_cb.c:3055:kiblnd_check_conns()) Timed out RDMA with JO.BOO.AL.IL@o2ib2 (151): c: 8, oc: 0, rc: 8 bcluster231.log-20140309.gz:2014-03-03 23:22:50 Lustre: scratch3-OST01df: haven 't heard from client scratch3-MDT0000-mdtlov_UUID (at JO.BOO.AL.IL@o2ib2) in 902 seconds. I think it' s dead, and I am evicting it. exp ffff880dc9df1000, cur 1393885370 expire 1393884770 last 1393884468 The OSS log showed that something wrong with network when the problem happend on MDS, could you check your network as Amir suggested in LU-4195 .

Patrick Valentin (Inactive) added a comment - 25/Apr/14 9:31 AM

Hi,
I have attached the stack trace of all processes (2014/04/07) provided by on site support team.
They also mentionned that JO.BOO.PA.xxx addresses in lctl-dk_rpc file are client's (compute nodes) adresses, NOT OSS.

Patrick Valentin (Inactive) added a comment - 25/Apr/14 9:31 AM Hi, I have attached the stack trace of all processes (2014/04/07) provided by on site support team. They also mentionned that JO.BOO.PA.xxx addresses in lctl-dk_rpc file are client's (compute nodes) adresses, NOT OSS.

Niu Yawei (Inactive) added a comment - 21/Apr/14 2:50 AM

Several things were made :
list of clients disconnected was made to attempt to access the filesystem, with a df/ls/touch. All attempts have failed.
umount of device on MDS failed, the process umount stuck on cfs_waitq_wait

Could you provide the stack trace for all processes on MDS when umount the device?

00000100:00000400:2.0F:1396898402.999140:0:12693:0:(service.c:1035:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-150), not sending early reply
  req@ffff880f02f9b800 x1463563674626235/t0(0) o400->4b5b918b-511a-1159-a7ee-db25715da035@JO.BOO.PA.BLW@o2ib2:0/0 lens 192/0 e 0 to 0 dl 1396898407 ref 2 fl New:H/0/ffffffff rc 0/-1
00000100:00000400:2.0:1396898402.999154:0:12693:0:(service.c:1035:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-150), not sending early reply
  req@ffff880f02f9bc00 x1463563600312609/t0(0) o400->23a5a711-176e-4fea-3690-e6e8f81b7ba3@JO.BOO.IW.WW@o2ib2:0/0 lens 192/0 e 0 to 0 dl 1396898407 ref 2 fl New:H/0/ffffffff rc 0/-1
00000100:00000400:2.0:1396898402.999165:0:12693:0:(service.c:1035:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-150), not sending early reply
  req@ffff88091d277000 x1463563647245622/t0(0) o400->e081ff72-cc00-6425-c96f-b093a974f1a4@JO.BOO.WB.AW@o2ib2:0/0 lens 192/0 e 0 to 0 dl 1396898407 ref 2 fl New:H/0/ffffffff rc 0/-1
00000100:00000400:2.0:1396898402.999176:0:12693:0:(service.c:1035:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-150), not sending early reply
  req@ffff88091d277400 x1463563646360851/t0(0) o400->822cdf71-bac5-e141-3e77-a03c858db938@JO.BOO.PA.BBA@o2ib2:0/0 lens 192/0 e 0 to 0 dl 1396898407 ref 2 fl New:H/0/ffffffff rc 0/-1

JO.BOO.PA.BLW@o2ib2 is one of the OST? Looks the connections between MDT and some OSTs are not stable. I tend to think this is same issue of ~~LU-4335~~ & ~~LU-4195~~.

Niu Yawei (Inactive) added a comment - 21/Apr/14 2:50 AM Several things were made : list of clients disconnected was made to attempt to access the filesystem, with a df/ls/touch. All attempts have failed. umount of device on MDS failed, the process umount stuck on cfs_waitq_wait Could you provide the stack trace for all processes on MDS when umount the device? 00000100:00000400:2.0F:1396898402.999140:0:12693:0:(service.c:1035:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-150), not sending early reply req@ffff880f02f9b800 x1463563674626235/t0(0) o400->4b5b918b-511a-1159-a7ee-db25715da035@JO.BOO.PA.BLW@o2ib2:0/0 lens 192/0 e 0 to 0 dl 1396898407 ref 2 fl New:H/0/ffffffff rc 0/-1 00000100:00000400:2.0:1396898402.999154:0:12693:0:(service.c:1035:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-150), not sending early reply req@ffff880f02f9bc00 x1463563600312609/t0(0) o400->23a5a711-176e-4fea-3690-e6e8f81b7ba3@JO.BOO.IW.WW@o2ib2:0/0 lens 192/0 e 0 to 0 dl 1396898407 ref 2 fl New:H/0/ffffffff rc 0/-1 00000100:00000400:2.0:1396898402.999165:0:12693:0:(service.c:1035:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-150), not sending early reply req@ffff88091d277000 x1463563647245622/t0(0) o400->e081ff72-cc00-6425-c96f-b093a974f1a4@JO.BOO.WB.AW@o2ib2:0/0 lens 192/0 e 0 to 0 dl 1396898407 ref 2 fl New:H/0/ffffffff rc 0/-1 00000100:00000400:2.0:1396898402.999176:0:12693:0:(service.c:1035:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-150), not sending early reply req@ffff88091d277400 x1463563646360851/t0(0) o400->822cdf71-bac5-e141-3e77-a03c858db938@JO.BOO.PA.BBA@o2ib2:0/0 lens 192/0 e 0 to 0 dl 1396898407 ref 2 fl New:H/0/ffffffff rc 0/-1 JO.BOO.PA.BLW@o2ib2 is one of the OST? Looks the connections between MDT and some OSTs are not stable. I tend to think this is same issue of LU-4335 & LU-4195 .

Patrick Valentin (Inactive) added a comment - 09/Apr/14 12:23 PM

Hi,
two new occurences of this issue were detected at TGCC on april 7 and 9.

Several things were made :

list of clients disconnected was made to attempt to access the filesystem, with a df/ls/touch. All attempts have failed.
umount of device on MDS failed, the process umount stuck on cfs_waitq_wait

The attached file "lctl-dk.tgz" contains the following traces:

mantis518_dk.log (collected with an echo -1 > /proc/sys/lnet/debug)
mantis518_rpc.log (collected with an echo +other +rpctrace > /proc/sys/lnet/debug)

Patrick

Patrick Valentin (Inactive) added a comment - 09/Apr/14 12:23 PM Hi, two new occurences of this issue were detected at TGCC on april 7 and 9. Several things were made : list of clients disconnected was made to attempt to access the filesystem, with a df/ls/touch. All attempts have failed. umount of device on MDS failed, the process umount stuck on cfs_waitq_wait The attached file "lctl-dk.tgz" contains the following traces: mantis518_dk.log (collected with an echo -1 > /proc/sys/lnet/debug) mantis518_rpc.log (collected with an echo +other +rpctrace > /proc/sys/lnet/debug) Patrick

Niu Yawei (Inactive) added a comment - 02/Apr/14 6:35 AM

Thank you, Sebastien. The code looks correct to me.

It's possible that connection between MDT to some OST is broken, so the statfs can't be updated, and since there were many ongoing create requests, lq_statfs_in_progress could be changed to 1 immediately after a statfs done, so the waiters will go back to wait on qos_statfs_ready().

When the problem raise again, could you try to collect lustre log on MDT with D_QOS & D_RPCTRACE enabled? Thank you.

Niu Yawei (Inactive) added a comment - 02/Apr/14 6:35 AM Thank you, Sebastien. The code looks correct to me. It's possible that connection between MDT to some OST is broken, so the statfs can't be updated, and since there were many ongoing create requests, lq_statfs_in_progress could be changed to 1 immediately after a statfs done, so the waiters will go back to wait on qos_statfs_ready(). When the problem raise again, could you try to collect lustre log on MDT with D_QOS & D_RPCTRACE enabled? Thank you.

Sebastien Buisson (Inactive) added a comment - 27/Mar/14 10:03 AM

Hi,

On Whamcloud's FTP (ftp.whamcloud.com), I have created a directory "~~LU-4733~~" containing a file lustre_216_bull_sources.tgz.

Sebastien.

Sebastien Buisson (Inactive) added a comment - 27/Mar/14 10:03 AM Hi, On Whamcloud's FTP (ftp.whamcloud.com), I have created a directory " LU-4733 " containing a file lustre_216_bull_sources.tgz. Sebastien.

Niu Yawei (Inactive) added a comment - 25/Mar/14 7:56 AM

Could you upload the code to the whamcloud ftp server (just like update logs)? Or show me where I can checkout the code? Thanks.

Niu Yawei (Inactive) added a comment - 25/Mar/14 7:56 AM Could you upload the code to the whamcloud ftp server (just like update logs)? Or show me where I can checkout the code? Thanks.

Sebastien Buisson (Inactive) added a comment - 25/Mar/14 7:25 AM

> BTW: Is there anyway to check the code running on servers? (to make sure the all the patches were applied properly)

Sure, but what do you suggest? I mean, how can we proceed to 'check the code'?

Sebastien Buisson (Inactive) added a comment - 25/Mar/14 7:25 AM > BTW: Is there anyway to check the code running on servers? (to make sure the all the patches were applied properly) Sure, but what do you suggest? I mean, how can we proceed to 'check the code'?

People

Assignee:: Niu Yawei (Inactive)

Reporter:: Sebastien Buisson (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 11 Start watching this issue

Dates

Created:: 07/Mar/14 2:20 PM

Updated:: 02/Oct/14 2:43 PM

Resolved:: 02/Oct/14 2:43 PM