[LU-6470] SWL tests appear to wedge on mutex, clients are evicted Created: 16/Apr/15 Updated: 09/Oct/21 Resolved: 09/Oct/21 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.8.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Cliff White (Inactive) | Assignee: | WC Triage |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Hyperion, 2.7.52 tag - ldiskfs format 200 clients |
||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
Running SWL test on Hyperion, multiple clients timeout, eventually are evicted due to lock timeouts. INFO: task ior:76875 blocked for more than 120 seconds. Not tainted 2.6.32-431.29.2.el6.x86_64 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. ior D 0000000000000000 0 76875 76869 0x00000000 ffff88083f5ddd18 0000000000000082 0000000000000000 ffff8808341ab1d8 ffff88083f5ddc88 ffffffff81227e9f ffff88083f5ddd68 ffffffff81199045 ffff880871a9b058 ffff88083f5ddfd8 000000000000fbc8 ffff880871a9b058 Call Trace: [<ffffffff81227e9f>] ? security_inode_permission+0x1f/0x30 [<ffffffff81199045>] ? __link_path_walk+0x145/0x1000 [<ffffffff8152a5be>] __mutex_lock_slowpath+0x13e/0x180 [<ffffffff8152a45b>] mutex_lock+0x2b/0x50 [<ffffffff8119ba76>] do_filp_open+0x2d6/0xd20 [<ffffffff811bd6b8>] ? do_statfs_native+0x98/0xb0 [<ffffffff8128f83a>] ? strncpy_from_user+0x4a/0x90 [<ffffffff811a8b82>] ? alloc_fd+0x92/0x160 [<ffffffff81185be9>] do_sys_open+0x69/0x140 [<ffffffff81185d00>] sys_open+0x20/0x30 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b Server side: Apr 16 12:45:55 iws5 kernel: LustreError: 0:0:(ldlm_lockd.c:341:waiting_locks_callback()) ### lock callback timer expired after 101s: evicting client at 192.168.124.165@o2ib ns: filter-lustre-OST0024_UUID lock: ffff8801eac5e740/0x9c91b8d7046afd8 lrc: 3/0,0 mode: PR/PR res: [0x1d8c6:0x0:0x0].0 rrc: 13 type: EXT [0->18446744073709551615] (req 29796335616->29930553343) flags: 0x60000000010020 nid: 192.168.124.165@o2ib remote: 0x42e28ecfaef2b33a expref: 6 pid: 109819 timeout: 4391206162 lvb_type: 0 Apr 16 13:45:23 iws3 kernel: LustreError: 0:0:(ldlm_lockd.c:341:waiting_locks_callback()) ### lock callback timer expired after 100s: evicting client at 192.168.124.165@o2ib ns: filter-lustre-OST002d_UUID lock: ffff8805ccaf1180/0xc2e6f2e60c6a3a1f lrc: 3/0,0 mode: PR/PR res: [0x28183:0x0:0x0].0 rrc: 10 type: EXT [0->18446744073709551615] (req 29527900160->29662117887) flags: 0x60000000010020 nid: 192.168.124.165@o2ib remote: 0x42e28ecfaef2c102 expref: 5 pid: 23687 timeout: 4394807701 lvb_type: 0 Maybe related to DDN-56? Easy to reproduce if more data is required |
| Comments |
| Comment by Cliff White (Inactive) [ 16/Apr/15 ] |
|
Dmesg from client after eviction. Client was evicted by multiple OSTs |
| Comment by Oleg Drokin [ 20/Apr/15 ] |
|
the client backtraces are indicating that MDS is stuck doing something. |
| Comment by Andreas Dilger [ 20/Apr/15 ] |
|
Cliff, is this running ZFS on the MDT/OST or ldiskfs? It wouldn't be surprising if there is a heavy metadata load on a ZFS MDT, but more surprising if it is ldiskfs. Could you please fill in the FSTYPE and client count into the Environment. Is there also racer/tar/dbench running on other clients during IOR? |
| Comment by Andreas Dilger [ 20/Apr/15 ] |
|
Also, getting the stack traces from the client would be useful, since even if a client thread is blocked waiting for the MDS (which is true from all the stack traces shown) it shouldn't prevent lock callbacks from the OSTs from being processed. |
| Comment by Cliff White (Inactive) [ 20/Apr/15 ] |
|
The setup has already been torn down, but if I get a repeat i will get client traces. The failure was on ldiskfs, currently testing with ZFS |
| Comment by Cliff White (Inactive) [ 21/Apr/15 ] |
|
Ah, the stack trace and lustre log dump from the client are already attached - the client is iwc115, see the two attachments. |
| Comment by Cliff White (Inactive) [ 21/Apr/15 ] |
|
I have re-checked the logs, and the only errors in that time period were from the OSS nodes, no MDS errors. |