[LU-4148] Clients experiencing massive watchdogs in mdtest rmdir Created: 25/Oct/13 Updated: 09/Oct/21 Resolved: 09/Oct/21 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.5.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Cliff White (Inactive) | Assignee: | Lai Siyao |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Hyperion/LLNL |
||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 11263 |
| Description |
|
Running mdtest, seeing a performance drop in rmdir. INFO: task mdtest:7072 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. mdtest D 0000000000000009 0 7072 7058 0x00000000 ffff880870771e08 0000000000000082 ffff880871506aa0 ffff880871506aa0 ffff880871506aa0 000000000000000b ffff880871506aa0 0000001081065d54 ffff880871507058 ffff880870771fd8 000000000000fb88 ffff880871507058 Call Trace: [<ffffffff8118f541>] ? path_put+0x31/0x40 [<ffffffff8150f78e>] __mutex_lock_slowpath+0x13e/0x180 [<ffffffff8150f62b>] mutex_lock+0x2b/0x50 [<ffffffff81192367>] do_rmdir+0xb7/0x120 [<ffffffff8100c535>] ? math_state_restore+0x45/0x60 [<ffffffff81192426>] sys_rmdir+0x16/0x20 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b No errors on MDS |
| Comments |
| Comment by Oleg Drokin [ 28/Oct/13 ] |
|
I guess in these cases it would be very helpful to have a list of all processes with stacktraces to see where does this happen, on client or if a server thread got wedged (less likely because of no errors on MDTs I guess). a crashdump from such a client might be helpful too. I assume no other errors in the client logs? |
| Comment by Cliff White (Inactive) [ 28/Oct/13 ] |
|
At this point i am only seeing the watchdogs, will recreate again, maybe with fewer clients. The test does complete eventually, there are no errors causing test aborts. |
| Comment by Oleg Drokin [ 28/Oct/13 ] |
|
so, mdt is jsut slow apparently. |
| Comment by Peter Jones [ 28/Oct/13 ] |
|
Lai Oleg was wondering if this might be a wide-effect of this patch - http://review.whamcloud.com/7257 What do you think? If not, do you have some other idea? Thanks Peter |
| Comment by Lai Siyao [ 29/Oct/13 ] |
|
The backtrace shows the process is waiting on parent i_mutex in do_rmdir(), if this can be reproduced and see which process is holding this lock, it can help analyse the cause. http://review.whamcloud.com/7257 doesn't look to be a direct cause of this slowness if there are no processes which changes parent directory permission constantly. |
| Comment by Cliff White (Inactive) [ 01/Nov/13 ] |
|
Dunp of all registers and stacks from a hung client |