[LU-15034] Lustre 2.12.7 client deadlock on quota check Created: 25/Sep/21 Updated: 05/Oct/21 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.7 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Jim Matthews | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Environment: |
CentOS 7.9 with included centos OFED. Lustre server and client version 2.12.7. |
||
| Issue Links: |
|
||||
| Epic/Theme: | clientdeadlock | ||||
| Severity: | 3 | ||||
| Rank (Obsolete): | 9223372036854775807 | ||||
| Description |
|
Summary: Lustre 2.12.7 clients occasionally (so far has happened on ~9 nodes out of ~1100) deadlocks in quota check routine on file access. The deadlocked processes will not terminate on their own. The clients will deadlock in one of 2 ways it appears. Either the client will get stuck in ptlrpc_queue_wait We are doing quota enforcement and user processes deadlocking are not over quota. We recently upgraded the server from lustre 2.8 to 2.12.7 and the client from 2.10.8 to 2.12.7. Below are the two types of deadlocked stacks. Type 1: deadlock in ptlrpc_queue_wait [<ffffffffc0bd5c60>] ptlrpc_set_wait+0x480/0x790 [ptlrpc]
Type 2: deadlock after ptlrpc_queue_wait fails. Message send to syslog: [4155019.167715] LustreError: 17861:0:(osc_quota.c:308:osc_quotactl()) ptlrpc_queue_wait failed, rc: -4 Followed by deadlocked stack: [<ffffffffc0a765b5>] cl_sync_io_wait+0x2b5/0x3d0 [obdclass] Env: OS: CentOS 7.9 (CentOS packaged OFED on client) Kernel: 3.10.0-1160.36.2.el7.x86_64 Luster client: 2.12.7 Network: Infninband (combination of EDR. FDR and QDR) |
| Comments |
| Comment by Jim Matthews [ 25/Sep/21 ] |
|
Above editor interpreted my >'s it seems, that line crossed out should not be crossed out. |
| Comment by Jim Matthews [ 25/Sep/21 ] |
|
I should clarify my statement above: "The deadlocked processes will not terminate on their own." The processes can't be killed using -9, the only way to clear is to reboot the node. |
| Comment by Jim Matthews [ 01/Oct/21 ] |
|
Just wondering if anyone had a chance to look at this... Thanks! |