[LU-15034] Lustre 2.12.7 client deadlock on quota check Created: 25/Sep/21  Updated: 05/Oct/21

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.7
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Jim Matthews Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None
Environment:

CentOS 7.9 with included centos OFED. Lustre server and client version 2.12.7.


Issue Links:
Related
Epic/Theme: clientdeadlock
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Summary:

Lustre 2.12.7 clients occasionally (so far has happened on ~9 nodes out of ~1100) deadlocks in quota check routine on file access.  The deadlocked processes will not terminate on their own.  The clients will deadlock in one of 2 ways it appears.  Either the client will get stuck in ptlrpc_queue_wait > ptlrpc_set_wait or ptlrpc_queue_wait will fail and it then deadlocks in cl_lock_request > cl_sync_io_wait.  And if a processes deadlocks in ptlrpc_queue_wait it will not eventually fail and go to cl_lock_request.  On the server side when this occurs OSS server(s) will usually report a message about the client reconnecting, but no other errors.  Sometimes these deadlocks only seem to affect the stuck processes and other times it seems to also block other users from accessing files on the lustre mount (it may also depend on how many processes deadlock, we had one node where 4 separate users deadlocked and for that node the lustre mount was completely hosed).

We are doing quota enforcement and user processes deadlocking are not over quota.

We recently upgraded the server from lustre 2.8 to 2.12.7 and the client from 2.10.8 to 2.12.7.

Below are the two types of deadlocked stacks.

Type 1: deadlock in ptlrpc_queue_wait

[<ffffffffc0bd5c60>] ptlrpc_set_wait+0x480/0x790 [ptlrpc]
[<ffffffffc0bd5ff3>] ptlrpc_queue_wait+0x83/0x230 [ptlrpc]
[<ffffffffc0bbac42>] ldlm_cli_enqueue+0x3e2/0x930 [ptlrpc]
[<ffffffffc0d47bc9>] osc_enqueue_base+0x219/0x690 [osc]
[<ffffffffc0d525c9>] osc_lock_enqueue+0x379/0x830 [osc]
[<ffffffffc0a52225>] cl_lock_enqueue+0x65/0x120 [obdclass]
[<ffffffffc0cf72e5>] lov_lock_enqueue+0x95/0x150 [lov]
[<ffffffffc0a52225>] cl_lock_enqueue+0x65/0x120 [obdclass]
[<ffffffffc0a527b7>] cl_lock_request+0x67/0x1f0 [obdclass]
[<ffffffffc0a566bb>] cl_io_lock+0x2bb/0x3d0 [obdclass]
[<ffffffffc0a569ea>] cl_io_loop+0xba/0x1c0 [obdclass]
[<ffffffffc0e250e0>] ll_file_io_generic+0x590/0xc90 [lustre]
[<ffffffffc0e265b3>] ll_file_aio_read+0x3a3/0x450 [lustre]
[<ffffffffc0e26760>] ll_file_read+0x100/0x1c0 [lustre]
[<ffffffffbc24e3af>] vfs_read+0x9f/0x170
[<ffffffffbc24f22f>] SyS_read+0x7f/0xf0
[<ffffffffbc795f92>] system_call_fastpath+0x25/0x2a
[<ffffffffffffffff>] 0xffffffffffffffff

 

Type 2: deadlock after ptlrpc_queue_wait fails.

Message send to syslog:

[4155019.167715] LustreError: 17861:0:(osc_quota.c:308:osc_quotactl()) ptlrpc_queue_wait failed, rc: -4

Followed by deadlocked stack:

[<ffffffffc0a765b5>] cl_sync_io_wait+0x2b5/0x3d0 [obdclass]
[<ffffffffc0a73906>] cl_lock_request+0x1b6/0x1f0 [obdclass]
[<ffffffffc0f8e9b1>] cl_glimpse_lock+0x311/0x370 [lustre]
[<ffffffffc0f8ed3d>] cl_glimpse_size0+0x20d/0x240 [lustre]
[<ffffffffc0f491ca>] ll_getattr+0x22a/0x5c0 [lustre]
[<ffffffff89853e99>] vfs_getattr+0x49/0x80
[<ffffffff89853f15>] vfs_fstat+0x45/0x80
[<ffffffff89854484>] SYSC_newfstat+0x24/0x60
[<ffffffff8985485e>] SyS_newfstat+0xe/0x10
[<ffffffff89d95f92>] system_call_fastpath+0x25/0x2a
[<ffffffffffffffff>] 0xffffffffffffffff

Env:

OS: CentOS 7.9 (CentOS packaged OFED on client)

Kernel: 3.10.0-1160.36.2.el7.x86_64
Luster server: 2.12.7

Luster client: 2.12.7

Network: Infninband (combination of EDR. FDR and QDR)



 Comments   
Comment by Jim Matthews [ 25/Sep/21 ]

Above editor interpreted my >'s it seems, that line crossed out should not be crossed out.

Comment by Jim Matthews [ 25/Sep/21 ]

I should clarify my statement above: "The deadlocked processes will not terminate on their own."  The processes can't be killed using -9, the only way to clear is to reboot the node.

Comment by Jim Matthews [ 01/Oct/21 ]

Just wondering if anyone had a chance to look at this...  Thanks!

Generated at Sat Feb 10 03:14:53 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.