Details
-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
Lustre 2.12.7
-
None
-
CentOS 7.9 with included centos OFED. Lustre server and client version 2.12.7.
-
3
-
9223372036854775807
Description
Summary:
Lustre 2.12.7 clients occasionally (so far has happened on ~9 nodes out of ~1100) deadlocks in quota check routine on file access. The deadlocked processes will not terminate on their own. The clients will deadlock in one of 2 ways it appears. Either the client will get stuck in ptlrpc_queue_wait > ptlrpc_set_wait or ptlrpc_queue_wait will fail and it then deadlocks in cl_lock_request > cl_sync_io_wait. And if a processes deadlocks in ptlrpc_queue_wait it will not eventually fail and go to cl_lock_request. On the server side when this occurs OSS server(s) will usually report a message about the client reconnecting, but no other errors. Sometimes these deadlocks only seem to affect the stuck processes and other times it seems to also block other users from accessing files on the lustre mount (it may also depend on how many processes deadlock, we had one node where 4 separate users deadlocked and for that node the lustre mount was completely hosed).
We are doing quota enforcement and user processes deadlocking are not over quota.
We recently upgraded the server from lustre 2.8 to 2.12.7 and the client from 2.10.8 to 2.12.7.
Below are the two types of deadlocked stacks.
Type 1: deadlock in ptlrpc_queue_wait
[<ffffffffc0bd5c60>] ptlrpc_set_wait+0x480/0x790 [ptlrpc]
[<ffffffffc0bd5ff3>] ptlrpc_queue_wait+0x83/0x230 [ptlrpc]
[<ffffffffc0bbac42>] ldlm_cli_enqueue+0x3e2/0x930 [ptlrpc]
[<ffffffffc0d47bc9>] osc_enqueue_base+0x219/0x690 [osc]
[<ffffffffc0d525c9>] osc_lock_enqueue+0x379/0x830 [osc]
[<ffffffffc0a52225>] cl_lock_enqueue+0x65/0x120 [obdclass]
[<ffffffffc0cf72e5>] lov_lock_enqueue+0x95/0x150 [lov]
[<ffffffffc0a52225>] cl_lock_enqueue+0x65/0x120 [obdclass]
[<ffffffffc0a527b7>] cl_lock_request+0x67/0x1f0 [obdclass]
[<ffffffffc0a566bb>] cl_io_lock+0x2bb/0x3d0 [obdclass]
[<ffffffffc0a569ea>] cl_io_loop+0xba/0x1c0 [obdclass]
[<ffffffffc0e250e0>] ll_file_io_generic+0x590/0xc90 [lustre]
[<ffffffffc0e265b3>] ll_file_aio_read+0x3a3/0x450 [lustre]
[<ffffffffc0e26760>] ll_file_read+0x100/0x1c0 [lustre]
[<ffffffffbc24e3af>] vfs_read+0x9f/0x170
[<ffffffffbc24f22f>] SyS_read+0x7f/0xf0
[<ffffffffbc795f92>] system_call_fastpath+0x25/0x2a
[<ffffffffffffffff>] 0xffffffffffffffff
Type 2: deadlock after ptlrpc_queue_wait fails.
Message send to syslog:
[4155019.167715] LustreError: 17861:0:(osc_quota.c:308:osc_quotactl()) ptlrpc_queue_wait failed, rc: -4
Followed by deadlocked stack:
[<ffffffffc0a765b5>] cl_sync_io_wait+0x2b5/0x3d0 [obdclass]
[<ffffffffc0a73906>] cl_lock_request+0x1b6/0x1f0 [obdclass]
[<ffffffffc0f8e9b1>] cl_glimpse_lock+0x311/0x370 [lustre]
[<ffffffffc0f8ed3d>] cl_glimpse_size0+0x20d/0x240 [lustre]
[<ffffffffc0f491ca>] ll_getattr+0x22a/0x5c0 [lustre]
[<ffffffff89853e99>] vfs_getattr+0x49/0x80
[<ffffffff89853f15>] vfs_fstat+0x45/0x80
[<ffffffff89854484>] SYSC_newfstat+0x24/0x60
[<ffffffff8985485e>] SyS_newfstat+0xe/0x10
[<ffffffff89d95f92>] system_call_fastpath+0x25/0x2a
[<ffffffffffffffff>] 0xffffffffffffffff
Env:
OS: CentOS 7.9 (CentOS packaged OFED on client)
Kernel: 3.10.0-1160.36.2.el7.x86_64
Luster server: 2.12.7
Luster client: 2.12.7
Network: Infninband (combination of EDR. FDR and QDR)