[LU-4572] hung mdt threads - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Duplicate
Priority: Critical
Fix Version/s: None
Affects Version/s: Lustre 2.4.1
Labels:
None
Environment:
our lustre source tree is at:
https://github.com/jlan/lustre-nas

Severity:
3
Rank (Obsolete):
12492

Description

met threads hug. Forced reboot of mds 2 different time.

uploading the following to ftp site:
lustre-log.1391239242.7851.txt.gz
vmcore-dmesg.txt.gz

Lustre: MGS: haven't heard from client c546719d-1bcc-571f-a4e3-17f67dc35b50 (at 10.151.31.4@o2ib) in 199 seconds. I think it's dead, and I am evicting it. exp ffff880fd0587800, cur 1391236411 expire 1391236261 last 1391236212
LNet: Service thread pid 7851 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Pid: 7851, comm: mdt01_055

Call Trace:
 [<ffffffff815404c2>] schedule_timeout+0x192/0x2e0
 [<ffffffff81080610>] ? process_timeout+0x0/0x10
 [<ffffffffa04156d1>] cfs_waitq_timedwait+0x11/0x20 [libcfs]
 [<ffffffffa06d201d>] ldlm_completion_ast+0x4ed/0x960 [ptlrpc]
 [<ffffffffa06cd790>] ? ldlm_expired_completion_wait+0x0/0x390 [ptlrpc]
 [<ffffffff81063be0>] ? default_wake_function+0x0/0x20
 [<ffffffffa06d1758>] ldlm_cli_enqueue_local+0x1f8/0x5d0 [ptlrpc]
 [<ffffffffa06d1b30>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc]
 [<ffffffffa0dd7a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
 [<ffffffffa0dddc0c>] mdt_object_lock0+0x28c/0xaf0 [mdt]
 [<ffffffffa0dd7a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
 [<ffffffffa06d1b30>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc]
 [<ffffffffa0dde534>] mdt_object_lock+0x14/0x20 [mdt]
 [<ffffffffa0dde5a1>] mdt_object_find_lock+0x61/0x170 [mdt]
 [<ffffffffa0e0c80c>] mdt_reint_open+0x8cc/0x20e0 [mdt]
 [<ffffffffa043185e>] ? upcall_cache_get_entry+0x28e/0x860 [libcfs]
 [<ffffffffa06fadcc>] ? lustre_msg_add_version+0x6c/0xc0 [ptlrpc]
 [<ffffffffa05921b0>] ? lu_ucred+0x20/0x30 [obdclass]
 [<ffffffffa0dd7015>] ? mdt_ucred+0x15/0x20 [mdt]
 [<ffffffffa0df31cc>] ? mdt_root_squash+0x2c/0x410 [mdt]
 [<ffffffffa0df7981>] mdt_reint_rec+0x41/0xe0 [mdt]
 [<ffffffffa0ddcb03>] mdt_reint_internal+0x4c3/0x780 [mdt]
 [<ffffffffa0ddd090>] mdt_intent_reint+0x1f0/0x530 [mdt]
 [<ffffffffa0ddaf3e>] mdt_intent_policy+0x39e/0x720 [mdt]
 [<ffffffffa06b2831>] ldlm_lock_enqueue+0x361/0x8d0 [ptlrpc]
 [<ffffffffa06d91ef>] ldlm_handle_enqueue0+0x4ef/0x10b0 [ptlrpc]
 [<ffffffffa0ddb3c6>] mdt_enqueue+0x46/0xe0 [mdt]
 [<ffffffffa0de1ad7>] mdt_handle_common+0x647/0x16d0 [mdt]
 [<ffffffffa0e1b615>] mds_regular_handle+0x15/0x20 [mdt]
 [<ffffffffa070b3c8>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]
 [<ffffffffa04155de>] ? cfs_timer_arm+0xe/0x10 [libcfs]
 [<ffffffffa0426d9f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
 [<ffffffffa0702729>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
 [<ffffffff81055813>] ? __wake_up+0x53/0x70
 [<ffffffffa070c75e>] ptlrpc_main+0xace/0x1700 [ptlrpc]
 [<ffffffffa070bc90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
 [<ffffffff8100c0ca>] child_rip+0xa/0x20
 [<ffffffffa070bc90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
 [<ffffffffa070bc90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
LustreError: dumping log to /tmp/lustre-log.1391239242.7851

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

r445i5n0.conlog.gz
3 kB
21/Feb/14 9:01 PM

Issue Links

duplicates

LU-2419 mdt threads stuck in ldlm_expired_completion_wait

Closed

is related to

LU-2944 Client evictions - watchdog timeouts on MDT - iorfpp

Resolved

LU-7372 replay-dual test_26: test failed to respond and timed out

Resolved

Activity

[LU-4572] hung mdt threads

Jinshan Xiong (Inactive) added a comment - 06/Jan/16 8:22 PM

I decided to fix this issue in the ticket U-7372

Jinshan Xiong (Inactive) added a comment - 06/Jan/16 8:22 PM I decided to fix this issue in the ticket U-7372

Jinshan Xiong (Inactive) added a comment - 06/Jan/16 7:29 PM

I reopened this ticket because I tend to think ldlm_expired_completion_wait() should return an error code at least on client side so that the client enqueue process won't be stuck on completion AST forever.

The same issue is seen on ~~LU-7372~~.

Jinshan Xiong (Inactive) added a comment - 06/Jan/16 7:29 PM I reopened this ticket because I tend to think ldlm_expired_completion_wait() should return an error code at least on client side so that the client enqueue process won't be stuck on completion AST forever. The same issue is seen on LU-7372 .

Peter Jones added a comment - 29/Oct/14 4:31 PM

ok thanks Mahmoud

Peter Jones added a comment - 29/Oct/14 4:31 PM ok thanks Mahmoud

Mahmoud Hanafi added a comment - 29/Oct/14 4:26 PM

please close not seen the issue again.

Mahmoud Hanafi added a comment - 29/Oct/14 4:26 PM please close not seen the issue again.

Oleg Drokin added a comment - 25/Feb/14 10:48 PM

I see. Let's see if switching to HPET helps any, I guess.
With no cpu usage when this happened I do not see why things would be delayed by 70 seconds other than a broken clock.

Oleg Drokin added a comment - 25/Feb/14 10:48 PM I see. Let's see if switching to HPET helps any, I guess. With no cpu usage when this happened I do not see why things would be delayed by 70 seconds other than a broken clock.

Mahmoud Hanafi added a comment - 25/Feb/14 6:29 PM

1. I am working through the logs to see if there were any IB fabric issues.

2. When the system was in that state cpu utilization was minimal.

3. We have seen the clocksource in our logs but we haven't found it to cause any issue. But I will switch it to hpet.

4. This filesystem has 84 OSTs.

Mahmoud Hanafi added a comment - 25/Feb/14 6:29 PM 1. I am working through the logs to see if there were any IB fabric issues. 2. When the system was in that state cpu utilization was minimal. 3. We have seen the clocksource in our logs but we haven't found it to cause any issue. But I will switch it to hpet. 4. This filesystem has 84 OSTs.

Oleg Drokin added a comment - 22/Feb/14 5:40 AM

One more question, how many OSTs do you have in your system?

Oleg Drokin added a comment - 22/Feb/14 5:40 AM One more question, how many OSTs do you have in your system?

Oleg Drokin added a comment - 22/Feb/14 5:20 AM

Well, the timeout you posted in the ticket seems to be related to client 10.151.31.4@o2ib being unresponsive, since the client is unresponsive, it does not send any lock cancels too I assume, so the locks it holds could not be cacelled and others who need said locks are forced to wait.
I also see a number of network errors in logs (and client logs too), are you having some network problems there?

Further on I see a different kind of problem:

<4>Lustre: 65757:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1391239042/real 1391239042]  req@ffff8806c8f5d800 x1458619173294584/t0(0) o601->nbp7-MDT0000-lwp-MDT0000@0@lo:12/10 lens 336/336 e 3 to 1 dl 1391240399 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
<4>Lustre: 65757:0:(client.c:1868:ptlrpc_expire_one_request()) Skipped 5 previous similar messages
<4>Lustre: 65757:0:(service.c:2031:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (1110:247s); client may timeout.  req@ffff8806d18a2000 x1458067638265204/t0(0) o101->446c8dd2-5b53-9b80-51b7-1573c85a021b@10.151.17.177@o2ib:0/0 lens 616/544 e 3 to 0 dl 1391240152 ref 1 fl Complete:/0/0 rc 301/301
<4>LNet: Service thread pid 7918 completed after 1357.11s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).

So here we are trying to send a quota rpc over local-connected lwp and it's timing out? that's not sounding right. Also the timeout for this local connection is huge for whatever reason. Also I do not see it being assigned any sane value too.

We can see that overtime the timeout there grows even more.

The other worrying sign is this:

<4>Clocksource tsc unstable (delta = -68719428120 ns).  Enable clocksource failo
ver by adding clocksource_failover kernel parameter.
<4>Lustre: 6577:0:(service.c:1889:ptlrpc_server_handle_req_in()) @@@ Slow req_in
 handling 76s  req@ffff880f7ded2850 x1447159886197772/t0(0) o400->449796f8-1d26-
c76e-7df6-0123fb984781@10.151.57.111@o2ib:0/0 lens 192/0 e 0 to 0 dl 0 ref 1 fl 
New:/0/ffffffff rc 0/-1
<4>Lustre: mdt: This server is not able to keep up with request traffic (cpu-bou
nd).

So 70+ seconds to handle request_in code is way too excessive. When this happens do you experience huge cpu utilization.
Also I see the tsc unstable message - I think kernel uses that as internal timestamp counter, so if it deviates this hugely, could it be all internal timeouts are just not working properly? Is this something you can address just to be sure?

Oleg Drokin added a comment - 22/Feb/14 5:20 AM Well, the timeout you posted in the ticket seems to be related to client 10.151.31.4@o2ib being unresponsive, since the client is unresponsive, it does not send any lock cancels too I assume, so the locks it holds could not be cacelled and others who need said locks are forced to wait. I also see a number of network errors in logs (and client logs too), are you having some network problems there? Further on I see a different kind of problem: <4>Lustre: 65757:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1391239042/real 1391239042] req@ffff8806c8f5d800 x1458619173294584/t0(0) o601->nbp7-MDT0000-lwp-MDT0000@0@lo:12/10 lens 336/336 e 3 to 1 dl 1391240399 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1 <4>Lustre: 65757:0:(client.c:1868:ptlrpc_expire_one_request()) Skipped 5 previous similar messages <4>Lustre: 65757:0:(service.c:2031:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (1110:247s); client may timeout. req@ffff8806d18a2000 x1458067638265204/t0(0) o101->446c8dd2-5b53-9b80-51b7-1573c85a021b@10.151.17.177@o2ib:0/0 lens 616/544 e 3 to 0 dl 1391240152 ref 1 fl Complete:/0/0 rc 301/301 <4>LNet: Service thread pid 7918 completed after 1357.11s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). So here we are trying to send a quota rpc over local-connected lwp and it's timing out? that's not sounding right. Also the timeout for this local connection is huge for whatever reason. Also I do not see it being assigned any sane value too. We can see that overtime the timeout there grows even more. The other worrying sign is this: <4>Clocksource tsc unstable (delta = -68719428120 ns). Enable clocksource failo ver by adding clocksource_failover kernel parameter. <4>Lustre: 6577:0:(service.c:1889:ptlrpc_server_handle_req_in()) @@@ Slow req_in handling 76s req@ffff880f7ded2850 x1447159886197772/t0(0) o400->449796f8-1d26- c76e-7df6-0123fb984781@10.151.57.111@o2ib:0/0 lens 192/0 e 0 to 0 dl 0 ref 1 fl New:/0/ffffffff rc 0/-1 <4>Lustre: mdt: This server is not able to keep up with request traffic (cpu-bou nd). So 70+ seconds to handle request_in code is way too excessive. When this happens do you experience huge cpu utilization. Also I see the tsc unstable message - I think kernel uses that as internal timestamp counter, so if it deviates this hugely, could it be all internal timeouts are just not working properly? Is this something you can address just to be sure?

Mahmoud Hanafi added a comment - 21/Feb/14 9:08 PM

I have attached the console logs from 10.151.17.136 (r445i5n0.conlog.gz).

We have both 2.4 and 2.1.5 clients.

at_max = 600.

The info was taken from crash dump.

Mahmoud Hanafi added a comment - 21/Feb/14 9:08 PM I have attached the console logs from 10.151.17.136 (r445i5n0.conlog.gz). We have both 2.4 and 2.1.5 clients. at_max = 600. The info was taken from crash dump.

Oleg Drokin added a comment - 20/Feb/14 11:50 PM

Also: do you run 2.4 clients from that source as well? Can you please show me dmesg from e.g. 10.151.17.136 during all of this happening?
Did you just crashdump the node to get all of this info or did you let it stay alive and just live-dumped?

Oleg Drokin added a comment - 20/Feb/14 11:50 PM Also: do you run 2.4 clients from that source as well? Can you please show me dmesg from e.g. 10.151.17.136 during all of this happening? Did you just crashdump the node to get all of this info or did you let it stay alive and just live-dumped?

People

Assignee:: Oleg Drokin

Reporter:: Mahmoud Hanafi

Votes:: 0 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 01/Feb/14 11:42 PM

Updated:: 06/Jan/16 8:22 PM

Resolved:: 06/Jan/16 8:22 PM