[LU-10892] hang at 'echo clear > /proc/fs/lustre/ldlm/namespaces/.../lru_size' Created: 09/Apr/18 Updated: 16/Apr/18 Resolved: 16/Apr/18 |
|
| Status: | Closed |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.5.5 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Olaf Faaland | Assignee: | Peter Jones |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | llnl | ||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
We are encountering frequent hangs when we execute: echo clear > $server/lru_size where $server is a path like /proc/fs/lustre/ldlm/namespaces/ls6-OST000a-osc-<UUID>/. In the cases we've documented the target is an OST. That OST shows as active in lfs check servers. We see no indication of problems (on the OST (nothing in console logs, no flapping connections, etc.). The stack trace looks like this. __ldlm_bl_to_thread+0x144 ldlm_bl_to_thread+0x473 ldlm_bl_to_thread_list+0x19 ldlm_cancel_lru+0x70 lprocfs_lru_size_seq_write+0x10c proc_reg_write+0x7e ... The client version is lustre-2.5.5-11chaos. The server version is lustre 2.8.2. Code where stuck thread is blocking: (gdb) l *(__ldlm_bl_to_thread+0x144)
0x28874 is in __ldlm_bl_to_thread (/usr/src/debug/lustre-2.5.5/lustre/ldlm/ldlm_lockd.c:1997).
1992 wake_up(&blp->blp_waitq);
1993
1994 /* can not check blwi->blwi_flags as blwi could be already freed in
1995 LCF_ASYNC mode */
1996 if (!(cancel_flags & LCF_ASYNC))
1997 wait_for_completion(&blwi->blwi_comp);
1998
1999 RETURN(0);
2000 }
2001
(gdb) quit
We are working on retiring our Lustre 2.5 systems, so a workaround is sufficient. Our questions are: |
| Comments |
| Comment by Andreas Dilger [ 11/Apr/18 ] |
|
Olaf, Killing the operation will, at worst, mean that some of the DLM locks may not be cancelled immediately, and will be expired by some other mechanism (age, number of locks, server load). |
| Comment by Olaf Faaland [ 16/Apr/18 ] |
|
Thank you. |