Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9372

OOM happens on OSS during Lustre recovery for more than 5000 clients

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Fixed
    • Icon: Major Major
    • Lustre 2.11.0, Lustre 2.10.6
    • None
    • Server running with b2_7_fe
      Clients are a mix of IEEL3 (RH7/SCS5), 2.5.3.90 (RH6/AE4), 2.7.3 (CentOS7)
    • 3
    • 9223372036854775807

      I have been on-site to work with Bruno Travouillon (Atos) on one of the crash-dumps they have.

      After joint analysis, it looks like a huge memory part is being consumed by "ptlrpc_request_buffer_desc" (17KB size each due to the embedded req, and that have been allocated in 32KB Slabs to increase/double side effect!).

      Having a look to the concerned source code, it looks like these "ptlrpc_request_buffer_desc" could be additionally allocated upon need by ptlrpc_check_rqbd_pool(), but will never be freed until OST umount/stop by ptlrpc_service_purge_all().

      This problem has caused several OSS failovers to fail due to OOM.

            bfaccini Bruno Faccini (Inactive)
            bfaccini Bruno Faccini (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            11 Start watching this issue

              Created:
              Updated:
              Resolved: