Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14378

ldlm_resource_complain()) MGC172.19.3.1@o2ib600: namespace resource [0x68736c:0x2:0x0].0x0 (ffff972b9abea0c0) refcount nonzero (1) after lock cleanup; forcing cleanup.

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • None
    • None
    • MDT
      3.10.0-1160.4.1.1chaos.ch6.x86_64
      lustre-2.12.5_10.llnl-3.ch6.x86_64
    • 3
    • 9223372036854775807

    Description

      Console log on node with MDT0005 (NID 172.19.3.1@o2ib600) reports:

      LustreError: 18990:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1609295691, 300s ago), entering recovery for MGS@172.19.3.1@o2ib600 ns: MGC172.19.3.1@o2ib600 lock: ffff97619d5133c0/0xa8790faf05ee75fc lrc: 4/1,0 mode: --/CR res: [0x68736c:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0x3481cb7270a3b1bc expref: -99 pid: 18990 timeout: 0 lvb_type: 0
      LustreError: 18990:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) Skipped 1 previous similar message
      LustreError: 25121:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC172.19.3.1@o2ib600: namespace resource [0x68736c:0x2:0x0].0x0 (ffff972b9abea0c0) refcount nonzero (1) after lock cleanup; forcing cleanup.
      LustreError: 25121:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message 
      

      Attachments

        Issue Links

          Activity

            [LU-14378] ldlm_resource_complain()) MGC172.19.3.1@o2ib600: namespace resource [0x68736c:0x2:0x0].0x0 (ffff972b9abea0c0) refcount nonzero (1) after lock cleanup; forcing cleanup.

            I think so but I don't know for sure: there are no reproducer or tests.

            We are determining  causes of the MGS's failover from a crashdump. We think it might be that the mgs threads took too much CPU time (as in LU-12735).

            In dmesg we have a message like this:

            [3653076.838597] sched: RT throttling activated
            

            In vmcore, we see an accumulation of locks in the resource waiting list and nothing in the granted list (as say in a comment of the LU-10674's patch).

            eaujames Etienne Aujames added a comment - I think so but I don't know for sure: there are no reproducer or tests. We are determining  causes of the MGS's failover from a crashdump. We think it might be that the mgs threads took too much CPU time (as in LU-12735 ). In dmesg we have a message like this: [3653076.838597] sched: RT throttling activated In vmcore, we see an accumulation of locks in the resource waiting list and nothing in the granted list (as say in a comment of the LU-10674 's patch).

            Etienne, does your LU-13356 patch help with that issue? So far I'd say that is what needed

            tappro Mikhail Pershin added a comment - Etienne, does your LU-13356 patch help with that issue? So far I'd say that is what needed

            Hello,

            We experienced the same type of issue recently on a Lustre 2.12.4 MGS:

            After re-mounting an OST, the MGS display the same type of message, firstly for a EX lock and then for a CR lock on the same MGS's fs resources (LDLM_PLAIN, type=CONFIG_T_RECOVER).

            Then the all the MGCs disconnect/reconnect to the MGS every 5 minutes.

            After 1h30 the HA detect a connection issue (not responding) with the MGS's node and kill it (stonith).

            After some research we found that our issue could be associated with the following tickets: LU-13356, LU-10674 or LU-12735 tickets.

            eaujames Etienne Aujames added a comment - Hello, We experienced the same type of issue recently on a Lustre 2.12.4 MGS: After re-mounting an OST, the MGS display the same type of message, firstly for a EX lock and then for a CR lock on the same MGS's fs resources (LDLM_PLAIN, type=CONFIG_T_RECOVER). Then the all the MGCs disconnect/reconnect to the MGS every 5 minutes. After 1h30 the HA detect a connection issue (not responding) with the MGS's node and kill it (stonith). After some research we found that our issue could be associated with the following tickets: LU-13356 , LU-10674 or LU-12735 tickets.
            pjones Peter Jones added a comment -

            Mike

            Could you please look into the reason behind the initial eviction here?

            Thanks

            Peter

            pjones Peter Jones added a comment - Mike Could you please look into the reason behind the initial eviction here? Thanks Peter
            ofaaland Olaf Faaland added a comment - - edited

            I have console logs for all 3 nodes.

            I do not have debug logs from the dates/times above because those entries were pushed out by newer ones.

            The problem seems to have persisted until I crashed the nodes today. I do have those crash dumps for the nodes with MGS,MDT0 and MDT5 in case there's useful information to be obtained there.

            ofaaland Olaf Faaland added a comment - - edited I have console logs for all 3 nodes. I do not have debug logs from the dates/times above because those entries were pushed out by newer ones. The problem seems to have persisted until I crashed the nodes today. I do have those crash dumps for the nodes with MGS,MDT0 and MDT5 in case there's useful information to be obtained there.
            ofaaland Olaf Faaland added a comment -

            Possibly related to LU-12735

            ofaaland Olaf Faaland added a comment - Possibly related to LU-12735
            ofaaland Olaf Faaland added a comment -

            The day of the issue we see repeated communication failures with lsh-MDT000c, within the hours of 2am, 7am, and 9am:

            For example

            2020-12-29 02:40:20 [4122150.157428] LNetError: 12879:0:(o2iblnd_cb.c:3359:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds
            2020-12-29 02:40:20 [4122150.168998] LNetError: 12879:0:(o2iblnd_cb.c:3434:kiblnd_check_conns()) Timed out RDMA with 172.19.3.13@o2ib600 (11): c: 6, oc: 0, rc: 8
            2020-12-29 02:40:20 [4122150.183730] Lustre: 18519:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1609238415/real 1609238420]  req@ffff9724f4650900 x1683090791024000/t0(0) o41->lsh-MDT000c-osp-MDT0005@172.19.3.13@o2ib600:24/4 lens 224/368 e 0 to 1 dl 1609238521 ref 1 fl Rpc:eX/0/ffffffff rc 0/-1
            2020-12-29 02:40:20 [4122150.189175] Lustre: lsh-MDT0005: Received new LWP connection from 172.19.3.13@o2ib600, removing former export from same NID
            2020-12-29 02:40:20 [4122150.189193] Lustre: lsh-MDT0005: Connection restored to  (at 172.19.3.13@o2ib600)
            2020-12-29 02:40:20 [4122150.237218] Lustre: lsh-MDT000c-osp-MDT0005: Connection to lsh-MDT000c (at 172.19.3.13@o2ib600) was lost; in progress operations using this service will wait for recovery to complete

            and

            2020-12-29 07:27:28 [4139376.886078] LNetError: 12879:0:(o2iblnd_cb.c:3359:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds
            2020-12-29 07:27:28 [4139376.897645] LNetError: 12879:0:(o2iblnd_cb.c:3434:kiblnd_check_conns()) Timed out RDMA with 172.19.3.13@o2ib600 (11): c: 7, oc: 0, rc: 8
            2020-12-29 07:27:28 [4139376.912177] Lustre: 18530:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1609255643/real 1609255648]  req@ffff9760e9b92400 x1683090802516928/t0(0) o41->lsh-MDT000c-osp-MDT0005@172.19.3.13@o2ib600:24/4 lens 224/368 e 0 to 1 dl 1609255749 ref 1 fl Rpc:eX/0/ffffffff rc 0/-1
            2020-12-29 07:27:28 [4139376.917039] Lustre: lsh-MDT0005: Received new LWP connection from 172.19.3.13@o2ib600, removing former export from same NID
            2020-12-29 07:27:28 [4139376.917054] Lustre: lsh-MDT0005: Connection restored to  (at 172.19.3.13@o2ib600)
            2020-12-29 07:27:28 [4139376.965671] Lustre: lsh-MDT000c-osp-MDT0005: Connection to lsh-MDT000c (at 172.19.3.13@o2ib600) was lost; in progress operations using this service will wait for recovery to complete 

            and then timed out requests

            2020-12-29 09:49:57 [4147924.759391] Lustre: 18536:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1609264091/real 1609264091]  req@ffff9724efdeb180 x1683090808250432/t0(0) o400->lsh-MDT000c-osp-MDT0005@172.19.3.14@o2ib600:24/4 lens 224/224 e 0 to 1 dl 1609264197 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
            2020-12-29 09:50:22 [4147949.701554] Lustre: 18531:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1609264116/real 1609264116]  req@ffff9724efdebf00 x1683090808267264/t0(0) o400->lsh-MDT000c-osp-MDT0005@172.19.3.14@o2ib600:24/4 lens 224/224 e 0 to 1 dl 1609264222 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 
            ofaaland Olaf Faaland added a comment - The day of the issue we see repeated communication failures with lsh-MDT000c, within the hours of 2am, 7am, and 9am: For example 2020-12-29 02:40:20 [4122150.157428] LNetError: 12879:0:(o2iblnd_cb.c:3359:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds 2020-12-29 02:40:20 [4122150.168998] LNetError: 12879:0:(o2iblnd_cb.c:3434:kiblnd_check_conns()) Timed out RDMA with 172.19.3.13@o2ib600 (11): c: 6, oc: 0, rc: 8 2020-12-29 02:40:20 [4122150.183730] Lustre: 18519:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1609238415/real 1609238420]  req@ffff9724f4650900 x1683090791024000/t0(0) o41->lsh-MDT000c-osp-MDT0005@172.19.3.13@o2ib600:24/4 lens 224/368 e 0 to 1 dl 1609238521 ref 1 fl Rpc:eX/0/ffffffff rc 0/-1 2020-12-29 02:40:20 [4122150.189175] Lustre: lsh-MDT0005: Received new LWP connection from 172.19.3.13@o2ib600, removing former export from same NID 2020-12-29 02:40:20 [4122150.189193] Lustre: lsh-MDT0005: Connection restored to  (at 172.19.3.13@o2ib600) 2020-12-29 02:40:20 [4122150.237218] Lustre: lsh-MDT000c-osp-MDT0005: Connection to lsh-MDT000c (at 172.19.3.13@o2ib600) was lost; in progress operations using this service will wait for recovery to complete and 2020-12-29 07:27:28 [4139376.886078] LNetError: 12879:0:(o2iblnd_cb.c:3359:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds 2020-12-29 07:27:28 [4139376.897645] LNetError: 12879:0:(o2iblnd_cb.c:3434:kiblnd_check_conns()) Timed out RDMA with 172.19.3.13@o2ib600 (11): c: 7, oc: 0, rc: 8 2020-12-29 07:27:28 [4139376.912177] Lustre: 18530:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1609255643/real 1609255648]  req@ffff9760e9b92400 x1683090802516928/t0(0) o41->lsh-MDT000c-osp-MDT0005@172.19.3.13@o2ib600:24/4 lens 224/368 e 0 to 1 dl 1609255749 ref 1 fl Rpc:eX/0/ffffffff rc 0/-1 2020-12-29 07:27:28 [4139376.917039] Lustre: lsh-MDT0005: Received new LWP connection from 172.19.3.13@o2ib600, removing former export from same NID 2020-12-29 07:27:28 [4139376.917054] Lustre: lsh-MDT0005: Connection restored to  (at 172.19.3.13@o2ib600) 2020-12-29 07:27:28 [4139376.965671] Lustre: lsh-MDT000c-osp-MDT0005: Connection to lsh-MDT000c (at 172.19.3.13@o2ib600) was lost; in progress operations using this service will wait for recovery to complete and then timed out requests 2020-12-29 09:49:57 [4147924.759391] Lustre: 18536:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1609264091/real 1609264091]  req@ffff9724efdeb180 x1683090808250432/t0(0) o400->lsh-MDT000c-osp-MDT0005@172.19.3.14@o2ib600:24/4 lens 224/224 e 0 to 1 dl 1609264197 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 2020-12-29 09:50:22 [4147949.701554] Lustre: 18531:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1609264116/real 1609264116]  req@ffff9724efdebf00 x1683090808267264/t0(0) o400->lsh-MDT000c-osp-MDT0005@172.19.3.14@o2ib600:24/4 lens 224/224 e 0 to 1 dl 1609264222 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
            ofaaland Olaf Faaland added a comment -

            For my records my local ticket is TOSS5036

            ofaaland Olaf Faaland added a comment - For my records my local ticket is TOSS5036

            People

              tappro Mikhail Pershin
              ofaaland Olaf Faaland
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: