Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9094

OOM caused by huge number of peers in case of INVALID_SERVICE_ID

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.10.0
    • None
    • 3
    • 9223372036854775807

    Attachments

      Activity

        [LU-9094] OOM caused by huge number of peers in case of INVALID_SERVICE_ID

        Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/25376/
        Subject: LU-9094 o2iblnd: kill timedout txs from ibp_tx_queue
        Project: fs/lustre-release
        Branch: master
        Current Patch Set:
        Commit: 824120da92fe8feb4b4308a136e33ec65fe3b635

        gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/25376/ Subject: LU-9094 o2iblnd: kill timedout txs from ibp_tx_queue Project: fs/lustre-release Branch: master Current Patch Set: Commit: 824120da92fe8feb4b4308a136e33ec65fe3b635

        Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/25375/
        Subject: LU-9094 lnet: remove ni from lnet_finalize
        Project: fs/lustre-release
        Branch: master
        Current Patch Set:
        Commit: dab78a9efd05e4f22fc83232bdadce347d3dafda

        gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/25375/ Subject: LU-9094 lnet: remove ni from lnet_finalize Project: fs/lustre-release Branch: master Current Patch Set: Commit: dab78a9efd05e4f22fc83232bdadce347d3dafda

        Please take note that "reconnect peer for REJ_INVALID_SERVICE_ID" without "kill timedout txs from ibp_tx_queue" causes lctl ping hung when lnet is not loaded on the target node(lctl ping waits indefinitely).
        It was the reason why I pushed all 3 patches together.

        scherementsev Sergey Cheremencev added a comment - Please take note that "reconnect peer for REJ_INVALID_SERVICE_ID" without "kill timedout txs from ibp_tx_queue" causes lctl ping hung when lnet is not loaded on the target node(lctl ping waits indefinitely). It was the reason why I pushed all 3 patches together.

        Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/25378/
        Subject: LU-9094 o2iblnd: reconnect peer for REJ_INVALID_SERVICE_ID
        Project: fs/lustre-release
        Branch: master
        Current Patch Set:
        Commit: 603aa7a1df6ee6ce6fe0d501a8b2bd1bfdf43bb8

        gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/25378/ Subject: LU-9094 o2iblnd: reconnect peer for REJ_INVALID_SERVICE_ID Project: fs/lustre-release Branch: master Current Patch Set: Commit: 603aa7a1df6ee6ce6fe0d501a8b2bd1bfdf43bb8

        Sergey Cheremencev (sergey.cheremencev@seagate.com) uploaded a new patch: https://review.whamcloud.com/25378
        Subject: LU-9094 o2iblnd: reconnect peer for REJ_INVALID_SERVICE_ID
        Project: fs/lustre-release
        Branch: master
        Current Patch Set: 1
        Commit: 496c48a3daa21d0423a387625821de03a57db443

        gerrit Gerrit Updater added a comment - Sergey Cheremencev (sergey.cheremencev@seagate.com) uploaded a new patch: https://review.whamcloud.com/25378 Subject: LU-9094 o2iblnd: reconnect peer for REJ_INVALID_SERVICE_ID Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 496c48a3daa21d0423a387625821de03a57db443

        Sergey Cheremencev (sergey.cheremencev@seagate.com) uploaded a new patch: https://review.whamcloud.com/25376
        Subject: LU-9094 o2iblnd: kill timedout txs from ibp_tx_queue
        Project: fs/lustre-release
        Branch: master
        Current Patch Set: 1
        Commit: fdee92bf4859793dc3fe4911b491ad9d0b21533e

        gerrit Gerrit Updater added a comment - Sergey Cheremencev (sergey.cheremencev@seagate.com) uploaded a new patch: https://review.whamcloud.com/25376 Subject: LU-9094 o2iblnd: kill timedout txs from ibp_tx_queue Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: fdee92bf4859793dc3fe4911b491ad9d0b21533e

        Sergey Cheremencev (sergey.cheremencev@seagate.com) uploaded a new patch: https://review.whamcloud.com/25375
        Subject: LU-9094 lnet: remove ni from lnet_finalize
        Project: fs/lustre-release
        Branch: master
        Current Patch Set: 1
        Commit: a312fcd635e63a79c53dd072ece9a066c1baf342

        gerrit Gerrit Updater added a comment - Sergey Cheremencev (sergey.cheremencev@seagate.com) uploaded a new patch: https://review.whamcloud.com/25375 Subject: LU-9094 lnet: remove ni from lnet_finalize Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: a312fcd635e63a79c53dd072ece9a066c1baf342

        Peer shouldn't be killed each time in case of INVALID_SERVICE_ID. This produces
        huge number of peers for the same nid and may cause an OOM.

        Issue could be simple reproduced using lctl ping to the node where lnet is not loaded(ib should be up).

        [root@pink03 tests]# cat ~/oom.sh
        while true; do
        	lctl ping 172.18.56.129@o2ib0
        done
        
        

        The OOM was frequently seen with mlnx-ofa-kernel-2.3 where was used
        RCU mechanism in mlx4_cq_free. In older mlx4 versions to mitigate
        the issue mlx4_cq_free is reworked and doesn't use RCU anymore.
        Anyway we shouldn't create and remove tons of peers with the same nid to don't affect performance and memory.

        Also OOM issue should be reproducible on all mlx5 not depending on mlnx-ofa-kernel version.
        I reproduced it on mlnx-ofa_kernel-3.4 with mlx5.

        I prepared and tested a set of patches for it. Will send it in the nearest time.

        scherementsev Sergey Cheremencev added a comment - Peer shouldn't be killed each time in case of INVALID_SERVICE_ID. This produces huge number of peers for the same nid and may cause an OOM. Issue could be simple reproduced using lctl ping to the node where lnet is not loaded(ib should be up). [root@pink03 tests]# cat ~/oom.sh while true; do lctl ping 172.18.56.129@o2ib0 done The OOM was frequently seen with mlnx-ofa-kernel-2.3 where was used RCU mechanism in mlx4_cq_free. In older mlx4 versions to mitigate the issue mlx4_cq_free is reworked and doesn't use RCU anymore. Anyway we shouldn't create and remove tons of peers with the same nid to don't affect performance and memory. Also OOM issue should be reproducible on all mlx5 not depending on mlnx-ofa-kernel version. I reproduced it on mlnx-ofa_kernel-3.4 with mlx5. I prepared and tested a set of patches for it. Will send it in the nearest time.

        Please change the Topic to something like "OOM caused by huge number of peers in case of INVALID_SERVICE_ID".

        scherementsev Sergey Cheremencev added a comment - Please change the Topic to something like "OOM caused by huge number of peers in case of INVALID_SERVICE_ID".

        People

          doug Doug Oucharek (Inactive)
          scherementsev Sergey Cheremencev
          Votes:
          0 Vote for this issue
          Watchers:
          5 Start watching this issue

          Dates

            Created:
            Updated:
            Resolved: