[LU-14378] ldlm_resource_complain()) MGC172.19.3.1@o2ib600: namespace resource [0x68736c:0x2:0x0].0x0 (ffff972b9abea0c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Created: 28/Jan/21  Updated: 11/Jan/23  Resolved: 16/Jan/22

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Olaf Faaland Assignee: Mikhail Pershin
Resolution: Fixed Votes: 0
Labels: llnll
Environment:

MDT
3.10.0-1160.4.1.1chaos.ch6.x86_64
lustre-2.12.5_10.llnl-3.ch6.x86_64


Issue Links:
Related
is related to LU-14379 ls hangs on client and console log me... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Console log on node with MDT0005 (NID 172.19.3.1@o2ib600) reports:

LustreError: 18990:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1609295691, 300s ago), entering recovery for MGS@172.19.3.1@o2ib600 ns: MGC172.19.3.1@o2ib600 lock: ffff97619d5133c0/0xa8790faf05ee75fc lrc: 4/1,0 mode: --/CR res: [0x68736c:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0x3481cb7270a3b1bc expref: -99 pid: 18990 timeout: 0 lvb_type: 0
LustreError: 18990:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) Skipped 1 previous similar message
LustreError: 25121:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC172.19.3.1@o2ib600: namespace resource [0x68736c:0x2:0x0].0x0 (ffff972b9abea0c0) refcount nonzero (1) after lock cleanup; forcing cleanup.
LustreError: 25121:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message 


 Comments   
Comment by Olaf Faaland [ 28/Jan/21 ]

For my records my local ticket is TOSS5036

Comment by Olaf Faaland [ 28/Jan/21 ]

The day of the issue we see repeated communication failures with lsh-MDT000c, within the hours of 2am, 7am, and 9am:

For example

2020-12-29 02:40:20 [4122150.157428] LNetError: 12879:0:(o2iblnd_cb.c:3359:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds
2020-12-29 02:40:20 [4122150.168998] LNetError: 12879:0:(o2iblnd_cb.c:3434:kiblnd_check_conns()) Timed out RDMA with 172.19.3.13@o2ib600 (11): c: 6, oc: 0, rc: 8
2020-12-29 02:40:20 [4122150.183730] Lustre: 18519:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1609238415/real 1609238420]  req@ffff9724f4650900 x1683090791024000/t0(0) o41->lsh-MDT000c-osp-MDT0005@172.19.3.13@o2ib600:24/4 lens 224/368 e 0 to 1 dl 1609238521 ref 1 fl Rpc:eX/0/ffffffff rc 0/-1
2020-12-29 02:40:20 [4122150.189175] Lustre: lsh-MDT0005: Received new LWP connection from 172.19.3.13@o2ib600, removing former export from same NID
2020-12-29 02:40:20 [4122150.189193] Lustre: lsh-MDT0005: Connection restored to  (at 172.19.3.13@o2ib600)
2020-12-29 02:40:20 [4122150.237218] Lustre: lsh-MDT000c-osp-MDT0005: Connection to lsh-MDT000c (at 172.19.3.13@o2ib600) was lost; in progress operations using this service will wait for recovery to complete

and

2020-12-29 07:27:28 [4139376.886078] LNetError: 12879:0:(o2iblnd_cb.c:3359:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds
2020-12-29 07:27:28 [4139376.897645] LNetError: 12879:0:(o2iblnd_cb.c:3434:kiblnd_check_conns()) Timed out RDMA with 172.19.3.13@o2ib600 (11): c: 7, oc: 0, rc: 8
2020-12-29 07:27:28 [4139376.912177] Lustre: 18530:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1609255643/real 1609255648]  req@ffff9760e9b92400 x1683090802516928/t0(0) o41->lsh-MDT000c-osp-MDT0005@172.19.3.13@o2ib600:24/4 lens 224/368 e 0 to 1 dl 1609255749 ref 1 fl Rpc:eX/0/ffffffff rc 0/-1
2020-12-29 07:27:28 [4139376.917039] Lustre: lsh-MDT0005: Received new LWP connection from 172.19.3.13@o2ib600, removing former export from same NID
2020-12-29 07:27:28 [4139376.917054] Lustre: lsh-MDT0005: Connection restored to  (at 172.19.3.13@o2ib600)
2020-12-29 07:27:28 [4139376.965671] Lustre: lsh-MDT000c-osp-MDT0005: Connection to lsh-MDT000c (at 172.19.3.13@o2ib600) was lost; in progress operations using this service will wait for recovery to complete 

and then timed out requests

2020-12-29 09:49:57 [4147924.759391] Lustre: 18536:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1609264091/real 1609264091]  req@ffff9724efdeb180 x1683090808250432/t0(0) o400->lsh-MDT000c-osp-MDT0005@172.19.3.14@o2ib600:24/4 lens 224/224 e 0 to 1 dl 1609264197 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
2020-12-29 09:50:22 [4147949.701554] Lustre: 18531:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1609264116/real 1609264116]  req@ffff9724efdebf00 x1683090808267264/t0(0) o400->lsh-MDT000c-osp-MDT0005@172.19.3.14@o2ib600:24/4 lens 224/224 e 0 to 1 dl 1609264222 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 
Comment by Olaf Faaland [ 28/Jan/21 ]

Possibly related to LU-12735

Comment by Olaf Faaland [ 28/Jan/21 ]

I have console logs for all 3 nodes.

I do not have debug logs from the dates/times above because those entries were pushed out by newer ones.

The problem seems to have persisted until I crashed the nodes today. I do have those crash dumps for the nodes with MGS,MDT0 and MDT5 in case there's useful information to be obtained there.

Comment by Peter Jones [ 28/Jan/21 ]

Mike

Could you please look into the reason behind the initial eviction here?

Thanks

Peter

Comment by Etienne Aujames [ 03/Feb/21 ]

Hello,

We experienced the same type of issue recently on a Lustre 2.12.4 MGS:

After re-mounting an OST, the MGS display the same type of message, firstly for a EX lock and then for a CR lock on the same MGS's fs resources (LDLM_PLAIN, type=CONFIG_T_RECOVER).

Then the all the MGCs disconnect/reconnect to the MGS every 5 minutes.

After 1h30 the HA detect a connection issue (not responding) with the MGS's node and kill it (stonith).

After some research we found that our issue could be associated with the following tickets: LU-13356, LU-10674 or LU-12735 tickets.

Comment by Mikhail Pershin [ 04/Feb/21 ]

Etienne, does your LU-13356 patch help with that issue? So far I'd say that is what needed

Comment by Etienne Aujames [ 05/Feb/21 ]

I think so but I don't know for sure: there are no reproducer or tests.

We are determining  causes of the MGS's failover from a crashdump. We think it might be that the mgs threads took too much CPU time (as in LU-12735).

In dmesg we have a message like this:

[3653076.838597] sched: RT throttling activated

In vmcore, we see an accumulation of locks in the resource waiting list and nothing in the granted list (as say in a comment of the LU-10674's patch).

Generated at Sat Feb 10 03:09:13 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.