[LU-14378] ldlm_resource_complain()) MGC172.19.3.1@o2ib600: namespace resource [0x68736c:0x2:0x0].0x0 (ffff972b9abea0c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Created: 28/Jan/21 Updated: 11/Jan/23 Resolved: 16/Jan/22 |
|
| Status: | Closed |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Olaf Faaland | Assignee: | Mikhail Pershin |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | llnll | ||
| Environment: |
MDT |
||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
Console log on node with MDT0005 (NID 172.19.3.1@o2ib600) reports: LustreError: 18990:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1609295691, 300s ago), entering recovery for MGS@172.19.3.1@o2ib600 ns: MGC172.19.3.1@o2ib600 lock: ffff97619d5133c0/0xa8790faf05ee75fc lrc: 4/1,0 mode: --/CR res: [0x68736c:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0x3481cb7270a3b1bc expref: -99 pid: 18990 timeout: 0 lvb_type: 0 LustreError: 18990:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) Skipped 1 previous similar message LustreError: 25121:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC172.19.3.1@o2ib600: namespace resource [0x68736c:0x2:0x0].0x0 (ffff972b9abea0c0) refcount nonzero (1) after lock cleanup; forcing cleanup. LustreError: 25121:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message |
| Comments |
| Comment by Olaf Faaland [ 28/Jan/21 ] |
|
For my records my local ticket is TOSS5036 |
| Comment by Olaf Faaland [ 28/Jan/21 ] |
|
The day of the issue we see repeated communication failures with lsh-MDT000c, within the hours of 2am, 7am, and 9am: For example 2020-12-29 02:40:20 [4122150.157428] LNetError: 12879:0:(o2iblnd_cb.c:3359:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds 2020-12-29 02:40:20 [4122150.168998] LNetError: 12879:0:(o2iblnd_cb.c:3434:kiblnd_check_conns()) Timed out RDMA with 172.19.3.13@o2ib600 (11): c: 6, oc: 0, rc: 8 2020-12-29 02:40:20 [4122150.183730] Lustre: 18519:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1609238415/real 1609238420] req@ffff9724f4650900 x1683090791024000/t0(0) o41->lsh-MDT000c-osp-MDT0005@172.19.3.13@o2ib600:24/4 lens 224/368 e 0 to 1 dl 1609238521 ref 1 fl Rpc:eX/0/ffffffff rc 0/-1 2020-12-29 02:40:20 [4122150.189175] Lustre: lsh-MDT0005: Received new LWP connection from 172.19.3.13@o2ib600, removing former export from same NID 2020-12-29 02:40:20 [4122150.189193] Lustre: lsh-MDT0005: Connection restored to (at 172.19.3.13@o2ib600) 2020-12-29 02:40:20 [4122150.237218] Lustre: lsh-MDT000c-osp-MDT0005: Connection to lsh-MDT000c (at 172.19.3.13@o2ib600) was lost; in progress operations using this service will wait for recovery to complete and 2020-12-29 07:27:28 [4139376.886078] LNetError: 12879:0:(o2iblnd_cb.c:3359:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds 2020-12-29 07:27:28 [4139376.897645] LNetError: 12879:0:(o2iblnd_cb.c:3434:kiblnd_check_conns()) Timed out RDMA with 172.19.3.13@o2ib600 (11): c: 7, oc: 0, rc: 8 2020-12-29 07:27:28 [4139376.912177] Lustre: 18530:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1609255643/real 1609255648] req@ffff9760e9b92400 x1683090802516928/t0(0) o41->lsh-MDT000c-osp-MDT0005@172.19.3.13@o2ib600:24/4 lens 224/368 e 0 to 1 dl 1609255749 ref 1 fl Rpc:eX/0/ffffffff rc 0/-1 2020-12-29 07:27:28 [4139376.917039] Lustre: lsh-MDT0005: Received new LWP connection from 172.19.3.13@o2ib600, removing former export from same NID 2020-12-29 07:27:28 [4139376.917054] Lustre: lsh-MDT0005: Connection restored to (at 172.19.3.13@o2ib600) 2020-12-29 07:27:28 [4139376.965671] Lustre: lsh-MDT000c-osp-MDT0005: Connection to lsh-MDT000c (at 172.19.3.13@o2ib600) was lost; in progress operations using this service will wait for recovery to complete and then timed out requests 2020-12-29 09:49:57 [4147924.759391] Lustre: 18536:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1609264091/real 1609264091] req@ffff9724efdeb180 x1683090808250432/t0(0) o400->lsh-MDT000c-osp-MDT0005@172.19.3.14@o2ib600:24/4 lens 224/224 e 0 to 1 dl 1609264197 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 2020-12-29 09:50:22 [4147949.701554] Lustre: 18531:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1609264116/real 1609264116] req@ffff9724efdebf00 x1683090808267264/t0(0) o400->lsh-MDT000c-osp-MDT0005@172.19.3.14@o2ib600:24/4 lens 224/224 e 0 to 1 dl 1609264222 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 |
| Comment by Olaf Faaland [ 28/Jan/21 ] |
|
Possibly related to |
| Comment by Olaf Faaland [ 28/Jan/21 ] |
|
I have console logs for all 3 nodes. I do not have debug logs from the dates/times above because those entries were pushed out by newer ones. The problem seems to have persisted until I crashed the nodes today. I do have those crash dumps for the nodes with MGS,MDT0 and MDT5 in case there's useful information to be obtained there. |
| Comment by Peter Jones [ 28/Jan/21 ] |
|
Mike Could you please look into the reason behind the initial eviction here? Thanks Peter |
| Comment by Etienne Aujames [ 03/Feb/21 ] |
|
Hello, We experienced the same type of issue recently on a Lustre 2.12.4 MGS: After re-mounting an OST, the MGS display the same type of message, firstly for a EX lock and then for a CR lock on the same MGS's fs resources (LDLM_PLAIN, type=CONFIG_T_RECOVER). Then the all the MGCs disconnect/reconnect to the MGS every 5 minutes. After 1h30 the HA detect a connection issue (not responding) with the MGS's node and kill it (stonith). After some research we found that our issue could be associated with the following tickets: |
| Comment by Mikhail Pershin [ 04/Feb/21 ] |
|
Etienne, does your |
| Comment by Etienne Aujames [ 05/Feb/21 ] |
|
I think so but I don't know for sure: there are no reproducer or tests. We are determining causes of the MGS's failover from a crashdump. We think it might be that the mgs threads took too much CPU time (as in In dmesg we have a message like this: [3653076.838597] sched: RT throttling activated In vmcore, we see an accumulation of locks in the resource waiting list and nothing in the granted list (as say in a comment of the LU-10674's patch). |