[LU-4572] hung mdt threads Created: 01/Feb/14  Updated: 06/Jan/16  Resolved: 06/Jan/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.1
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Mahmoud Hanafi Assignee: Oleg Drokin
Resolution: Duplicate Votes: 0
Labels: None
Environment:

our lustre source tree is at:
https://github.com/jlan/lustre-nas


Attachments: File r445i5n0.conlog.gz    
Issue Links:
Duplicate
duplicates LU-2419 mdt threads stuck in ldlm_expired_com... Closed
Related
is related to LU-2944 Client evictions - watchdog timeouts ... Resolved
is related to LU-7372 replay-dual test_26: test failed to r... Resolved
Severity: 3
Rank (Obsolete): 12492

 Description   

met threads hug. Forced reboot of mds 2 different time.

uploading the following to ftp site:
lustre-log.1391239242.7851.txt.gz
vmcore-dmesg.txt.gz

Lustre: MGS: haven't heard from client c546719d-1bcc-571f-a4e3-17f67dc35b50 (at 10.151.31.4@o2ib) in 199 seconds. I think it's dead, and I am evicting it. exp ffff880fd0587800, cur 1391236411 expire 1391236261 last 1391236212
LNet: Service thread pid 7851 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Pid: 7851, comm: mdt01_055

Call Trace:
 [<ffffffff815404c2>] schedule_timeout+0x192/0x2e0
 [<ffffffff81080610>] ? process_timeout+0x0/0x10
 [<ffffffffa04156d1>] cfs_waitq_timedwait+0x11/0x20 [libcfs]
 [<ffffffffa06d201d>] ldlm_completion_ast+0x4ed/0x960 [ptlrpc]
 [<ffffffffa06cd790>] ? ldlm_expired_completion_wait+0x0/0x390 [ptlrpc]
 [<ffffffff81063be0>] ? default_wake_function+0x0/0x20
 [<ffffffffa06d1758>] ldlm_cli_enqueue_local+0x1f8/0x5d0 [ptlrpc]
 [<ffffffffa06d1b30>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc]
 [<ffffffffa0dd7a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
 [<ffffffffa0dddc0c>] mdt_object_lock0+0x28c/0xaf0 [mdt]
 [<ffffffffa0dd7a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
 [<ffffffffa06d1b30>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc]
 [<ffffffffa0dde534>] mdt_object_lock+0x14/0x20 [mdt]
 [<ffffffffa0dde5a1>] mdt_object_find_lock+0x61/0x170 [mdt]
 [<ffffffffa0e0c80c>] mdt_reint_open+0x8cc/0x20e0 [mdt]
 [<ffffffffa043185e>] ? upcall_cache_get_entry+0x28e/0x860 [libcfs]
 [<ffffffffa06fadcc>] ? lustre_msg_add_version+0x6c/0xc0 [ptlrpc]
 [<ffffffffa05921b0>] ? lu_ucred+0x20/0x30 [obdclass]
 [<ffffffffa0dd7015>] ? mdt_ucred+0x15/0x20 [mdt]
 [<ffffffffa0df31cc>] ? mdt_root_squash+0x2c/0x410 [mdt]
 [<ffffffffa0df7981>] mdt_reint_rec+0x41/0xe0 [mdt]
 [<ffffffffa0ddcb03>] mdt_reint_internal+0x4c3/0x780 [mdt]
 [<ffffffffa0ddd090>] mdt_intent_reint+0x1f0/0x530 [mdt]
 [<ffffffffa0ddaf3e>] mdt_intent_policy+0x39e/0x720 [mdt]
 [<ffffffffa06b2831>] ldlm_lock_enqueue+0x361/0x8d0 [ptlrpc]
 [<ffffffffa06d91ef>] ldlm_handle_enqueue0+0x4ef/0x10b0 [ptlrpc]
 [<ffffffffa0ddb3c6>] mdt_enqueue+0x46/0xe0 [mdt]
 [<ffffffffa0de1ad7>] mdt_handle_common+0x647/0x16d0 [mdt]
 [<ffffffffa0e1b615>] mds_regular_handle+0x15/0x20 [mdt]
 [<ffffffffa070b3c8>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]
 [<ffffffffa04155de>] ? cfs_timer_arm+0xe/0x10 [libcfs]
 [<ffffffffa0426d9f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
 [<ffffffffa0702729>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
 [<ffffffff81055813>] ? __wake_up+0x53/0x70
 [<ffffffffa070c75e>] ptlrpc_main+0xace/0x1700 [ptlrpc]
 [<ffffffffa070bc90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
 [<ffffffff8100c0ca>] child_rip+0xa/0x20
 [<ffffffffa070bc90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
 [<ffffffffa070bc90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
LustreError: dumping log to /tmp/lustre-log.1391239242.7851


 Comments   
Comment by Mahmoud Hanafi [ 01/Feb/14 ]

also uploaded bt of all tasks: bt.all.lu-4572.gz

Comment by Peter Jones [ 02/Feb/14 ]

Nathaniel

Could you please look into this one?

Thanks

Peter

Comment by Nathaniel Clark [ 04/Feb/14 ]

This looks like it may be the same issue as LU-2419 (the stack traces are very similar).

Comment by Mahmoud Hanafi [ 04/Feb/14 ]

Do you need any additional info to debug the issue?

Comment by Oleg Drokin [ 20/Feb/14 ]

What's your at_max?

Comment by Oleg Drokin [ 20/Feb/14 ]

Also: do you run 2.4 clients from that source as well? Can you please show me dmesg from e.g. 10.151.17.136 during all of this happening?
Did you just crashdump the node to get all of this info or did you let it stay alive and just live-dumped?

Comment by Mahmoud Hanafi [ 21/Feb/14 ]

I have attached the console logs from 10.151.17.136 (r445i5n0.conlog.gz).

We have both 2.4 and 2.1.5 clients.

at_max = 600.

The info was taken from crash dump.

Comment by Oleg Drokin [ 22/Feb/14 ]

Well, the timeout you posted in the ticket seems to be related to client 10.151.31.4@o2ib being unresponsive, since the client is unresponsive, it does not send any lock cancels too I assume, so the locks it holds could not be cacelled and others who need said locks are forced to wait.
I also see a number of network errors in logs (and client logs too), are you having some network problems there?

Further on I see a different kind of problem:

<4>Lustre: 65757:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1391239042/real 1391239042]  req@ffff8806c8f5d800 x1458619173294584/t0(0) o601->nbp7-MDT0000-lwp-MDT0000@0@lo:12/10 lens 336/336 e 3 to 1 dl 1391240399 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
<4>Lustre: 65757:0:(client.c:1868:ptlrpc_expire_one_request()) Skipped 5 previous similar messages
<4>Lustre: 65757:0:(service.c:2031:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (1110:247s); client may timeout.  req@ffff8806d18a2000 x1458067638265204/t0(0) o101->446c8dd2-5b53-9b80-51b7-1573c85a021b@10.151.17.177@o2ib:0/0 lens 616/544 e 3 to 0 dl 1391240152 ref 1 fl Complete:/0/0 rc 301/301
<4>LNet: Service thread pid 7918 completed after 1357.11s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).

So here we are trying to send a quota rpc over local-connected lwp and it's timing out? that's not sounding right. Also the timeout for this local connection is huge for whatever reason. Also I do not see it being assigned any sane value too.

We can see that overtime the timeout there grows even more.

The other worrying sign is this:

<4>Clocksource tsc unstable (delta = -68719428120 ns).  Enable clocksource failo
ver by adding clocksource_failover kernel parameter.
<4>Lustre: 6577:0:(service.c:1889:ptlrpc_server_handle_req_in()) @@@ Slow req_in
 handling 76s  req@ffff880f7ded2850 x1447159886197772/t0(0) o400->449796f8-1d26-
c76e-7df6-0123fb984781@10.151.57.111@o2ib:0/0 lens 192/0 e 0 to 0 dl 0 ref 1 fl 
New:/0/ffffffff rc 0/-1
<4>Lustre: mdt: This server is not able to keep up with request traffic (cpu-bou
nd).

So 70+ seconds to handle request_in code is way too excessive. When this happens do you experience huge cpu utilization.
Also I see the tsc unstable message - I think kernel uses that as internal timestamp counter, so if it deviates this hugely, could it be all internal timeouts are just not working properly? Is this something you can address just to be sure?

Comment by Oleg Drokin [ 22/Feb/14 ]

One more question, how many OSTs do you have in your system?

Comment by Mahmoud Hanafi [ 25/Feb/14 ]

1. I am working through the logs to see if there were any IB fabric issues.

2. When the system was in that state cpu utilization was minimal.

3. We have seen the clocksource in our logs but we haven't found it to cause any issue. But I will switch it to hpet.

4. This filesystem has 84 OSTs.

Comment by Oleg Drokin [ 25/Feb/14 ]

I see. Let's see if switching to HPET helps any, I guess.
With no cpu usage when this happened I do not see why things would be delayed by 70 seconds other than a broken clock.

Comment by Mahmoud Hanafi [ 29/Oct/14 ]

please close not seen the issue again.

Comment by Peter Jones [ 29/Oct/14 ]

ok thanks Mahmoud

Comment by Jinshan Xiong (Inactive) [ 06/Jan/16 ]

I reopened this ticket because I tend to think ldlm_expired_completion_wait() should return an error code at least on client side so that the client enqueue process won't be stuck on completion AST forever.

The same issue is seen on LU-7372.

Comment by Jinshan Xiong (Inactive) [ 06/Jan/16 ]

I decided to fix this issue in the ticket U-7372

Generated at Sat Feb 10 01:43:56 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.