Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-16530

OOM on routers with a faulty link/interface with 1 node

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • None
    • Production, Lustre 2.12.7 on router and computes, Lustre 2.12.9 + patches on servers
      peer_credits = 42
      infiniband (mofed 5.4 on router and on computes, mofed 4.7 on servers)
    • 3
    • 9223372036854775807

    Description

      A LNet router crash regularly with OOM on a compute partition at the CEA.
      Each time, the router complains about a compute node (with RDMA timeout) and then crash with OOM.
      This issue seems to be linked to a defective compute rack or infiniband interface, but this should not cause the LNet router to crash.

      Environment:

      x32         infiniband    x12       infiniband    ~ x100
      computes    <--o2ib1-->   routers   <--o2ib0-->   servers
      
      peer_credits = 42
      discovery = 0
      health_sensitivity = 0
      transaction_timeout = 50
      retry_count = 0
      
      router RAM amount : 48GB
      

      Kdumps information:
      On the peer interface (lnet_peer_ni) to the faulty compute:
      tx credits: ~ -4500
      I read the msg tx queue (lpni_txq) and sort the messages by NID sources: for 69 NIDs I count 42 messages (peer_credits value) blocked in the tx queue.

      I found the peer interface with a server NID that have 42 msg blocked on tx:
      peer buffer credit: ~ -17000
      On the peer router queue (lpni_rtrq), messages seems to be linked to different kib_conn (kib_rx.rx_conn) every 42 messages.
      These connections are in disconnected state, with ibc_list and ibc_sched_list not linked (poison value inside). But qp and cq are not freed.
      A QP take 512 pages and a CQ take 256 pages, ~ 3 MB per connection.

      So it seems to be a connections leak.

      Analyze
      Here what I understood with the lnet/ko2iblnd sources:

      1. Compute node have an issue and do not answer (or partially) to the router.
      2. Messages from the servers to the compute node are queued and the tx peer credits is negative.
      3. When a server peer interface have more than 42 messages blocked on tx, peer_buffer_credits is negative (by default, peer_credits == peer_buffer_credits). In that case, new message from server are queued in lpni_rtrq.
      4. After that, the server is not able to send any messages to the router because peer_buffer_credits < 0. All messages from the server sent to the router timeout (RDMA timeout).
      5. The server disconnects/reconnects to the routers and cleans its tx credits and resend its messages.
      6. On the router, the old connection is set to disconnect but not freed because old Rx messages are not cleaned and still reference the old connection.

      Can someone help me with this ?
      I am not used to debug LNet/ko2iblnd.

      Attachments

        Activity

          [LU-16530] OOM on routers with a faulty link/interface with 1 node

          Here some context for logs:

          eaujames Etienne Aujames added a comment - Here some context for logs: o2ibxx: storage network o2ibyy: compute network for vmcore-dmesg_hide_router272a_20221220_213634_1.txt : BB.BB.ID8@o2ibyy is the faulty client node for vmcore-dmesg_hide_router272a_20221209_152308_1.txt : BB.BB.ID17@o2ibyy is the faulty client node

          Hi Cyril,

          Sorry for the delay.

          I have submitted 2 dmesg logs:

          Those are logs from 2 crashes of the router in production.

          The situation was stabilized by changing the CPU of the faulty client node.

          eaujames Etienne Aujames added a comment - Hi Cyril, Sorry for the delay. I have submitted 2 dmesg logs: vmcore-dmesg_hide_router272a_20221209_152308_1.txt vmcore-dmesg_hide_router272a_20221220_213634_1.txt Those are logs from 2 crashes of the router in production. The situation was stabilized by changing the CPU of the faulty client node.

          Hello Etienne,

          yes, dmesg could be useful.

          Thank you.

          cbordage Cyril Bordage added a comment - Hello Etienne, yes, dmesg could be useful. Thank you.

          Hi Cyril,

          I can't get you debug_log (maybe some dmesg if you want).
          The setup is not available because it was reproduced on a router from the cluster (to reproduce this it needs a node with 2 ib interfaces on different networks).
          I tried to reproduce this with tcp <-> ib but unsuccessfully.

          eaujames Etienne Aujames added a comment - Hi Cyril, I can't get you debug_log (maybe some dmesg if you want). The setup is not available because it was reproduced on a router from the cluster (to reproduce this it needs a node with 2 ib interfaces on different networks). I tried to reproduce this with tcp <-> ib but unsuccessfully.

          Hello Etienne,

          do you have logs of your tests? Is your setup still available?

          Thank you.

          cbordage Cyril Bordage added a comment - Hello Etienne, do you have logs of your tests? Is your setup still available? Thank you.

          Hello Etienne,

          I did take a look but then was on something else… Sorry about that. I plan to work on it again very soon.

          Thank you.

          cbordage Cyril Bordage added a comment - Hello Etienne, I did take a look but then was on something else… Sorry about that. I plan to work on it again very soon. Thank you.

          Hi Cyril,

          Have you got the time to look into that issue ?

          eaujames Etienne Aujames added a comment - Hi Cyril, Have you got the time to look into that issue ?

          People

            cbordage Cyril Bordage
            eaujames Etienne Aujames
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated: