[LU-6470] SWL tests appear to wedge on mutex, clients are evicted Created: 16/Apr/15  Updated: 09/Oct/21  Resolved: 09/Oct/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Cliff White (Inactive) Assignee: WC Triage
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

Hyperion, 2.7.52 tag - ldiskfs format 200 clients


Attachments: Text File iwc115.dmesg.txt     File iwc115.evict.log.txt.gz    
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Running SWL test on Hyperion, multiple clients timeout, eventually are evicted due to lock timeouts.
Typical client stack:

INFO: task ior:76875 blocked for more than 120 seconds.
      Not tainted 2.6.32-431.29.2.el6.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
ior           D 0000000000000000     0 76875  76869 0x00000000
 ffff88083f5ddd18 0000000000000082 0000000000000000 ffff8808341ab1d8
 ffff88083f5ddc88 ffffffff81227e9f ffff88083f5ddd68 ffffffff81199045
 ffff880871a9b058 ffff88083f5ddfd8 000000000000fbc8 ffff880871a9b058
Call Trace:
 [<ffffffff81227e9f>] ? security_inode_permission+0x1f/0x30
 [<ffffffff81199045>] ? __link_path_walk+0x145/0x1000
 [<ffffffff8152a5be>] __mutex_lock_slowpath+0x13e/0x180
 [<ffffffff8152a45b>] mutex_lock+0x2b/0x50
 [<ffffffff8119ba76>] do_filp_open+0x2d6/0xd20
 [<ffffffff811bd6b8>] ? do_statfs_native+0x98/0xb0
 [<ffffffff8128f83a>] ? strncpy_from_user+0x4a/0x90
 [<ffffffff811a8b82>] ? alloc_fd+0x92/0x160
 [<ffffffff81185be9>] do_sys_open+0x69/0x140
 [<ffffffff81185d00>] sys_open+0x20/0x30
 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b

Server side:

Apr 16 12:45:55 iws5 kernel: LustreError: 0:0:(ldlm_lockd.c:341:waiting_locks_callback()) ### lock callback timer expired after 101s: evicting client at 192.168.124.165@o2ib  ns: filter-lustre-OST0024_UUID lock: ffff8801eac5e740/0x9c91b8d7046afd8 lrc: 3/0,0 mode: PR/PR res: [0x1d8c6:0x0:0x0].0 rrc: 13 type: EXT [0->18446744073709551615] (req 29796335616->29930553343) flags: 0x60000000010020 nid: 192.168.124.165@o2ib remote: 0x42e28ecfaef2b33a expref: 6 pid: 109819 timeout: 4391206162 lvb_type: 0
Apr 16 13:45:23 iws3 kernel: LustreError: 0:0:(ldlm_lockd.c:341:waiting_locks_callback()) ### lock callback timer expired after 100s: evicting client at 192.168.124.165@o2ib  ns: filter-lustre-OST002d_UUID lock: ffff8805ccaf1180/0xc2e6f2e60c6a3a1f lrc: 3/0,0 mode: PR/PR res: [0x28183:0x0:0x0].0 rrc: 10 type: EXT [0->18446744073709551615] (req 29527900160->29662117887) flags: 0x60000000010020 nid: 192.168.124.165@o2ib remote: 0x42e28ecfaef2c102 expref: 5 pid: 23687 timeout: 4394807701 lvb_type: 0

Maybe related to DDN-56? Easy to reproduce if more data is required
I dumped the lustre log from a client immediately after an eviction, file attached



 Comments   
Comment by Cliff White (Inactive) [ 16/Apr/15 ]

Dmesg from client after eviction. Client was evicted by multiple OSTs

Comment by Oleg Drokin [ 20/Apr/15 ]

the client backtraces are indicating that MDS is stuck doing something.
So it would be great to get MDS side of the story from MDS logs, would that still be possible?

Comment by Andreas Dilger [ 20/Apr/15 ]

Cliff, is this running ZFS on the MDT/OST or ldiskfs? It wouldn't be surprising if there is a heavy metadata load on a ZFS MDT, but more surprising if it is ldiskfs. Could you please fill in the FSTYPE and client count into the Environment. Is there also racer/tar/dbench running on other clients during IOR?

Comment by Andreas Dilger [ 20/Apr/15 ]

Also, getting the stack traces from the client would be useful, since even if a client thread is blocked waiting for the MDS (which is true from all the stack traces shown) it shouldn't prevent lock callbacks from the OSTs from being processed.

Comment by Cliff White (Inactive) [ 20/Apr/15 ]

The setup has already been torn down, but if I get a repeat i will get client traces. The failure was on ldiskfs, currently testing with ZFS

Comment by Cliff White (Inactive) [ 21/Apr/15 ]

Ah, the stack trace and lustre log dump from the client are already attached - the client is iwc115, see the two attachments.

Comment by Cliff White (Inactive) [ 21/Apr/15 ]

I have re-checked the logs, and the only errors in that time period were from the OSS nodes, no MDS errors.

Generated at Sat Feb 10 02:00:30 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.