[LU-4148] Clients experiencing massive watchdogs in mdtest rmdir Created: 25/Oct/13  Updated: 09/Oct/21  Resolved: 09/Oct/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Cliff White (Inactive) Assignee: Lai Siyao
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

Hyperion/LLNL


Attachments: Text File iwc101.dump.txt    
Severity: 3
Rank (Obsolete): 11263

 Description   

Running mdtest, seeing a performance drop in rmdir.
All clients appear to be hitting watchdogs, example:

INFO: task mdtest:7072 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
mdtest        D 0000000000000009     0  7072   7058 0x00000000
 ffff880870771e08 0000000000000082 ffff880871506aa0 ffff880871506aa0
 ffff880871506aa0 000000000000000b ffff880871506aa0 0000001081065d54
 ffff880871507058 ffff880870771fd8 000000000000fb88 ffff880871507058
Call Trace:
 [<ffffffff8118f541>] ? path_put+0x31/0x40
 [<ffffffff8150f78e>] __mutex_lock_slowpath+0x13e/0x180
 [<ffffffff8150f62b>] mutex_lock+0x2b/0x50
 [<ffffffff81192367>] do_rmdir+0xb7/0x120
 [<ffffffff8100c535>] ? math_state_restore+0x45/0x60
 [<ffffffff81192426>] sys_rmdir+0x16/0x20
 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b

No errors on MDS



 Comments   
Comment by Oleg Drokin [ 28/Oct/13 ]

I guess in these cases it would be very helpful to have a list of all processes with stacktraces to see where does this happen, on client or if a server thread got wedged (less likely because of no errors on MDTs I guess).

a crashdump from such a client might be helpful too. I assume no other errors in the client logs?

Comment by Cliff White (Inactive) [ 28/Oct/13 ]

At this point i am only seeing the watchdogs, will recreate again, maybe with fewer clients. The test does complete eventually, there are no errors causing test aborts.

Comment by Oleg Drokin [ 28/Oct/13 ]

so, mdt is jsut slow apparently.
Is this happening on share dir delete?

Comment by Peter Jones [ 28/Oct/13 ]

Lai

Oleg was wondering if this might be a wide-effect of this patch - http://review.whamcloud.com/7257

What do you think? If not, do you have some other idea?

Thanks

Peter

Comment by Lai Siyao [ 29/Oct/13 ]

The backtrace shows the process is waiting on parent i_mutex in do_rmdir(), if this can be reproduced and see which process is holding this lock, it can help analyse the cause.

http://review.whamcloud.com/7257 doesn't look to be a direct cause of this slowness if there are no processes which changes parent directory permission constantly.

Comment by Cliff White (Inactive) [ 01/Nov/13 ]

Dunp of all registers and stacks from a hung client

Generated at Sat Feb 10 01:40:06 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.