Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7054

ib_cm scalling issue when lustre clients connect to OSS

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.8.0
    • Lustre 2.5.3
    • None
    • OFED3.5, MOFED241, and MOFED3.5
    • 3
    • 9223372036854775807

    Description

      When a large number of lustre clients (>3000) try connecting to a OSS/MDS at the same, ib_cm threads on the OSS/MDS are unable to services the incoming connection in time. Using ibdump we have seen server replies taking 30sec, by that time the clients have timed out the request and are retrying which results in even more work for ib_cm.

      Ib_cm is never able to catchup and usually requires a reboot of the server. Sometime we have been able to recover by ifdowning the ib interface, to give ib_cm time to 'catchup' and then ifuping the interface.

      Most of the threads will be in 'D' state here is a example stack trace:

      0xffff88062f3c0aa0     1655        2  0    1   D  0xffff88062f3c1140  ib_cm/1^M
      sp                ip                Function (args)^M
      0xffff880627237a90 0xffffffff81559b50 thread_return^M
      0xffff880627237b58 0xffffffff8155b30e __mutex_lock_slowpath+0x13e (0xffff88062f76d260)^M
      0xffff880627237bc8 0xffffffff8155b1ab mutex_lock+0x2b (0xffff88062f76d260)^M
      0xffff880627237be8 0xffffffffa043f23e [rdma_cm]cma_disable_callback+0x2e (0xffff88062f76d000, unknown)^M
      0xffff880627237c18 0xffffffffa044440f [rdma_cm]cma_req_handler+0x8f (0xffff880365eec200, 0xffff880494844698)^M
      0xffff880627237d28 0xffffffffa0393e37 [ib_cm]cm_process_work+0x27 (0xffff880365eec200, 0xffff880494844600)^M
      0xffff880627237d78 0xffffffffa0394aaa [ib_cm]cm_req_handler+0x6ba (0xffff880494844600)^M
      0xffff880627237de8 0xffffffffa0395735 [ib_cm]cm_work_handler+0x145 (0xffff880494844600)^M
      0xffff880627237e38 0xffffffff81093f30 worker_thread+0x170 (0xffffe8ffffc431c0)^M
      0xffff880627237ee8 0xffffffff8109a106 kthread+0x96 (0xffff880627ae5da8)^M
      0xffff880627237f48 0xffffffff8100c20a child_rip+0xa (unknown, unknown)^M
      

      Using systemtap I was able to get a trace of ib_cm it shows a great deal of time is spent in spin_lock_irq. see attached file

      Attachments

        1. load.pdf
          85 kB
          Mahmoud Hanafi
        2. lustre-log.1445147654.68807.gz
          0.2 kB
          Mahmoud Hanafi
        3. lustre-log.1445147717.68744.gz
          0.2 kB
          Mahmoud Hanafi
        4. lustre-log.1445147754.68673.gz
          0.2 kB
          Mahmoud Hanafi
        5. nbp8-os11.var.log.messages.oct.17.gz
          27 kB
          Mahmoud Hanafi
        6. opensfs-HLDForSMPnodeaffinity-060415-1623-4.pdf
          564 kB
          Amir Shehata
        7. read.pdf
          92 kB
          Mahmoud Hanafi
        8. service104.+net+malloc.gz
          0.2 kB
          Bob Ciotti
        9. service115.+net.gz
          1.04 MB
          Bob Ciotti
        10. trace.ib_cm_1rack.out.gz
          759 kB
          Mahmoud Hanafi
        11. write.pdf
          91 kB
          Mahmoud Hanafi

        Issue Links

          Activity

            People

              ashehata Amir Shehata (Inactive)
              mhanafi Mahmoud Hanafi
              Votes:
              0 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: