Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7054

ib_cm scalling issue when lustre clients connect to OSS

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.8.0
    • Lustre 2.5.3
    • None
    • OFED3.5, MOFED241, and MOFED3.5
    • 3
    • 9223372036854775807

    Description

      When a large number of lustre clients (>3000) try connecting to a OSS/MDS at the same, ib_cm threads on the OSS/MDS are unable to services the incoming connection in time. Using ibdump we have seen server replies taking 30sec, by that time the clients have timed out the request and are retrying which results in even more work for ib_cm.

      Ib_cm is never able to catchup and usually requires a reboot of the server. Sometime we have been able to recover by ifdowning the ib interface, to give ib_cm time to 'catchup' and then ifuping the interface.

      Most of the threads will be in 'D' state here is a example stack trace:

      0xffff88062f3c0aa0     1655        2  0    1   D  0xffff88062f3c1140  ib_cm/1^M
      sp                ip                Function (args)^M
      0xffff880627237a90 0xffffffff81559b50 thread_return^M
      0xffff880627237b58 0xffffffff8155b30e __mutex_lock_slowpath+0x13e (0xffff88062f76d260)^M
      0xffff880627237bc8 0xffffffff8155b1ab mutex_lock+0x2b (0xffff88062f76d260)^M
      0xffff880627237be8 0xffffffffa043f23e [rdma_cm]cma_disable_callback+0x2e (0xffff88062f76d000, unknown)^M
      0xffff880627237c18 0xffffffffa044440f [rdma_cm]cma_req_handler+0x8f (0xffff880365eec200, 0xffff880494844698)^M
      0xffff880627237d28 0xffffffffa0393e37 [ib_cm]cm_process_work+0x27 (0xffff880365eec200, 0xffff880494844600)^M
      0xffff880627237d78 0xffffffffa0394aaa [ib_cm]cm_req_handler+0x6ba (0xffff880494844600)^M
      0xffff880627237de8 0xffffffffa0395735 [ib_cm]cm_work_handler+0x145 (0xffff880494844600)^M
      0xffff880627237e38 0xffffffff81093f30 worker_thread+0x170 (0xffffe8ffffc431c0)^M
      0xffff880627237ee8 0xffffffff8109a106 kthread+0x96 (0xffff880627ae5da8)^M
      0xffff880627237f48 0xffffffff8100c20a child_rip+0xa (unknown, unknown)^M
      

      Using systemtap I was able to get a trace of ib_cm it shows a great deal of time is spent in spin_lock_irq. see attached file

      Attachments

        1. load.pdf
          85 kB
        2. lustre-log.1445147654.68807.gz
          0.2 kB
        3. lustre-log.1445147717.68744.gz
          0.2 kB
        4. lustre-log.1445147754.68673.gz
          0.2 kB
        5. nbp8-os11.var.log.messages.oct.17.gz
          27 kB
        6. opensfs-HLDForSMPnodeaffinity-060415-1623-4.pdf
          564 kB
        7. read.pdf
          92 kB
        8. service104.+net+malloc.gz
          0.2 kB
        9. service115.+net.gz
          1.04 MB
        10. trace.ib_cm_1rack.out.gz
          759 kB
        11. write.pdf
          91 kB

        Issue Links

          Activity

            [LU-7054] ib_cm scalling issue when lustre clients connect to OSS

            Patch has landed to master for 2.8.

            jgmitter Joseph Gmitter (Inactive) added a comment - Patch has landed to master for 2.8.

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/16470/
            Subject: LU-7054 o2iblnd: less intense allocating retry
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: d96879f34ce229695566a3e5de1f5160f4c9ef02

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/16470/ Subject: LU-7054 o2iblnd: less intense allocating retry Project: fs/lustre-release Branch: master Current Patch Set: Commit: d96879f34ce229695566a3e5de1f5160f4c9ef02

            I updated LU-7224 with some tuning recommendations for this situation.

            We were looking at the code to see if FMR saves us on memory allocation. It does not. Might actually use a bit more memory. With Mellanox cards, we have not noticed much improvement in performance except in high latency cases (like WAN conditions). It benefits Truescale IB cards most. Note: Mellanox has discontinued support for FMR in their latest mlx5-based cards.

            So, I'm not optimistic that FMR will help in this case. See LU-7224 for a summary of tuning suggestions.

            doug Doug Oucharek (Inactive) added a comment - I updated LU-7224 with some tuning recommendations for this situation. We were looking at the code to see if FMR saves us on memory allocation. It does not. Might actually use a bit more memory. With Mellanox cards, we have not noticed much improvement in performance except in high latency cases (like WAN conditions). It benefits Truescale IB cards most. Note: Mellanox has discontinued support for FMR in their latest mlx5-based cards. So, I'm not optimistic that FMR will help in this case. See LU-7224 for a summary of tuning suggestions.

            I think you are correct the two issue are the same.

            The changes we have made (raising NTX, back-off, and interface credits) has allowed oss recover and not get stuck with ib_cm in 'D' state. I think the frequency is less often. We are going to test increase by 2X NTX and interface credits. How much will enabling FMR help us.

            mhanafi Mahmoud Hanafi added a comment - I think you are correct the two issue are the same. The changes we have made (raising NTX, back-off, and interface credits) has allowed oss recover and not get stuck with ib_cm in 'D' state. I think the frequency is less often. We are going to test increase by 2X NTX and interface credits. How much will enabling FMR help us.

            With regards to the two issues: could not the same cause be behind both? We are speculating that memory pool allocation is the cause of issue 2 (many reconnections). Could not the original problem be triggered by the same thing? With so many clients aggressively sending to the OSS, could it not be freezing on memory allocation for several seconds causing evictions?

            This is why I have be focusing so much on addressing the pool allocation problem. If the two problems are truly independent, then I suggest we start a different Jira ticket for problem 1 and use this ticket only for problem 2. That way the evidence behind each problem do not get mingled.

            With the changes made (raising NTX, back-off, and interface credits), has problem 2 happened again or just reduced frequency? Can NTX be increased even more if only reduced frequency?

            doug Doug Oucharek (Inactive) added a comment - With regards to the two issues: could not the same cause be behind both? We are speculating that memory pool allocation is the cause of issue 2 (many reconnections). Could not the original problem be triggered by the same thing? With so many clients aggressively sending to the OSS, could it not be freezing on memory allocation for several seconds causing evictions? This is why I have be focusing so much on addressing the pool allocation problem. If the two problems are truly independent, then I suggest we start a different Jira ticket for problem 1 and use this ticket only for problem 2. That way the evidence behind each problem do not get mingled. With the changes made (raising NTX, back-off, and interface credits), has problem 2 happened again or just reduced frequency? Can NTX be increased even more if only reduced frequency?

            Just uploaded a new debug dump (ftp:/uploads/LU7054/sec.20151031.11.31.38.gz). OSS experienced the issue with first log at 1446316297.011251 with bulk timeout.

            00010000:00020000:11.0:1446316297.011251:0:20421:0:(ldlm_lib.c:2715:target_bulk_io()) @@@ timeout on bulk PUT after 150+0s  req@ffff8809c4ae0400 x1516094020471028/t0(0) o3->236aaab5-0544-9964-0bd9-544f78fd896b@10.151.11.64@o2ib:0/0 lens 488/432 e 2 to 0 dl 1446316318 ref 1 fl Interpret:/0/0 rc 0/0
            00010000:00020000:6.0:1446316297.011279:0:15047:0:(ldlm_lib.c:2715:target_bulk_io()) @@@ timeout on bulk PUT after 150+0s  req@ffff880cf9000c00 x1515275583687516/t0(0) o3->f89cd266-d025-1f42-d13b-d6a51a9b5e01@10.151.11.151@o2ib:0/0 lens 488/432 e 2 to 0 dl 1446316318 ref 1 fl Interpret:/0/0 rc 0/0
            
            mhanafi Mahmoud Hanafi added a comment - Just uploaded a new debug dump (ftp:/uploads/LU7054/sec.20151031.11.31.38.gz). OSS experienced the issue with first log at 1446316297.011251 with bulk timeout. 00010000:00020000:11.0:1446316297.011251:0:20421:0:(ldlm_lib.c:2715:target_bulk_io()) @@@ timeout on bulk PUT after 150+0s req@ffff8809c4ae0400 x1516094020471028/t0(0) o3->236aaab5-0544-9964-0bd9-544f78fd896b@10.151.11.64@o2ib:0/0 lens 488/432 e 2 to 0 dl 1446316318 ref 1 fl Interpret:/0/0 rc 0/0 00010000:00020000:6.0:1446316297.011279:0:15047:0:(ldlm_lib.c:2715:target_bulk_io()) @@@ timeout on bulk PUT after 150+0s req@ffff880cf9000c00 x1515275583687516/t0(0) o3->f89cd266-d025-1f42-d13b-d6a51a9b5e01@10.151.11.151@o2ib:0/0 lens 488/432 e 2 to 0 dl 1446316318 ref 1 fl Interpret:/0/0 rc 0/0

            I guess there are 2 major issues here.
            1. why does server get into the state that requires all clients to reconnect at the same time.
            2. The server handling and recovering for the connection storm.

            The connection back-off patch and raising the ntx and interface credits has help with item 2. But I think we need more tuning here. enable FMR? higher Peer_credits? higher ntx and interface credits. The pool allocation (LIBCFS_CPT_ALLOC) happens during this part.

            But the root cause is item 1. Why do all the clients need to connect at the same time. We have ruled out the IB fabric because it isn't logging any errors during this event.

            From the logs I uploaded yesterday (sec.20151029.21.03.25.gz) the connections start at 20:59 (1446177579.541489). Can you see anything in the logs to indicate why?

            The load and util on the server was light before the connections started.
            I uploaded load.pdf, write.pdf, and read.pdf

            mhanafi Mahmoud Hanafi added a comment - I guess there are 2 major issues here. 1. why does server get into the state that requires all clients to reconnect at the same time. 2. The server handling and recovering for the connection storm. The connection back-off patch and raising the ntx and interface credits has help with item 2. But I think we need more tuning here. enable FMR? higher Peer_credits? higher ntx and interface credits. The pool allocation (LIBCFS_CPT_ALLOC) happens during this part. But the root cause is item 1. Why do all the clients need to connect at the same time. We have ruled out the IB fabric because it isn't logging any errors during this event. From the logs I uploaded yesterday (sec.20151029.21.03.25.gz) the connections start at 20:59 (1446177579.541489). Can you see anything in the logs to indicate why? The load and util on the server was light before the connections started. I uploaded load.pdf, write.pdf, and read.pdf

            Mahmoud: On Oct.14, you indicated the time was spent in a for loop but put the macro LIBCFS_CPT_ALLOC() in the copied code. Was the time spent in this macro call?

            If the allocation code is freezing, there could be one of two reasons: memory has been exhausted or memory has become so fragmented we cannot get the number of contiguous pages we are requesting. I have seen our allocation macros freeze the calling kernel thread until such a time as the memory can be allocated. This could be a very long time and absolutely needs to be avoided. This could explain the 12-13 second delays.

            So, it is important to figure out whether the freezing is in the allocation call.

            Looking at the pool code, I see something that worries me. When a new pool is allocated (not the first one), it is given a deadline of 300 seconds. Once the 300 seconds has expired, whenever buffers are freed to that pool, it is checked if any buffers are still in use. If none are, we free the pool and all associated memory. This potential "yoyo" effect could be causing a problem. It might be interesting to try and set the deadline value to something enormous so we never free up pools allocated and just leave them around to be used.

            That being said, if a large enough pool is allocated at initialization time, it never gets freed and would solve this issue. Was the test to increase NTX suggested above ever tried? Do the servers have enough memory for a large pool allocation?

            If the system is truly running out of memory due to a large load, the peer_credits of the clients can be lowered from the default of 8 to something like 6 or 4. That will cause the large number of clients you have to back off reducing the load on the servers thereby forgoing the need for such a big pool allocation. Worse case for a server: number of clients * peer_credits = max number of tx buffers that could be used.

            Another thing to check is memory statistics once a system gets into trouble. On Linux, if you "echo m > /proc/sysrq-trigger" and look into /var/log/messages, there are some very useful memory statistics there. If you don't get such memory statistics, then SysRq is not enabled. You can do so by "echo 1 > /proc/sys/kernel/sysrq". On newer kernels, you can also "cat /proc/buddyinfo" to get a subset of the same information.

            doug Doug Oucharek (Inactive) added a comment - Mahmoud: On Oct.14, you indicated the time was spent in a for loop but put the macro LIBCFS_CPT_ALLOC() in the copied code. Was the time spent in this macro call? If the allocation code is freezing, there could be one of two reasons: memory has been exhausted or memory has become so fragmented we cannot get the number of contiguous pages we are requesting. I have seen our allocation macros freeze the calling kernel thread until such a time as the memory can be allocated. This could be a very long time and absolutely needs to be avoided. This could explain the 12-13 second delays. So, it is important to figure out whether the freezing is in the allocation call. Looking at the pool code, I see something that worries me. When a new pool is allocated (not the first one), it is given a deadline of 300 seconds. Once the 300 seconds has expired, whenever buffers are freed to that pool, it is checked if any buffers are still in use. If none are, we free the pool and all associated memory. This potential "yoyo" effect could be causing a problem. It might be interesting to try and set the deadline value to something enormous so we never free up pools allocated and just leave them around to be used. That being said, if a large enough pool is allocated at initialization time, it never gets freed and would solve this issue. Was the test to increase NTX suggested above ever tried? Do the servers have enough memory for a large pool allocation? If the system is truly running out of memory due to a large load, the peer_credits of the clients can be lowered from the default of 8 to something like 6 or 4. That will cause the large number of clients you have to back off reducing the load on the servers thereby forgoing the need for such a big pool allocation. Worse case for a server: number of clients * peer_credits = max number of tx buffers that could be used. Another thing to check is memory statistics once a system gets into trouble. On Linux, if you "echo m > /proc/sysrq-trigger" and look into /var/log/messages, there are some very useful memory statistics there. If you don't get such memory statistics, then SysRq is not enabled. You can do so by "echo 1 > /proc/sys/kernel/sysrq". On newer kernels, you can also "cat /proc/buddyinfo" to get a subset of the same information.

            Uploaded 2 new debug logs and /var/log/messages to ftp:/uploads/LU7054/
            messages.gz
            sec.20151029.21.03.25.gz
            sec.20151029.21.05.55.gz

            At Oct 29 21:03:25 we got this in /var/log/messages

            Oct 29 21:03:25 nbp8-oss14 kernel: [1828390.200665] LustreError: 68632:0:(ldlm_lib.c:2715:target_bulk_io()) @@@ timeout on bulk PUT after 150+0s  req@ffff8812e7162c00 x1516263417841768/t0(0) o3->d07e5b2b-77f6-1c68-f1fd-e6e4f8f614d7@10.151.57.146@o2ib:0/0 lens 4568/432 e 1 to 0 dl 1446177826 ref 1 fl Interpret:/0/0 rc 0/0
            

            We dump debug logs to sec.20151029.21.03.25.gz
            it shows all clients connecting.

            00000800:00000200:0.2:1446177579.541489:0:0:0:(o2iblnd_cb.c:3306:kiblnd_cq_completion()) conn[ffff8804e04ad400] (20)++
            ....
            00000800:00000200:1.0:1446177803.579278:0:11134:0:(o2iblnd_cb.c:993:kiblnd_check_sends()) conn[ffff880da89faa00] (31)--
            

            I don't under why the debug logs don't show any other activite before this other than

            00000400 00000001 18.1 Thu Oct 29 20:59:39 PDT 2015 0 0 0 (watchdog.c 123 lcw_cb()) Process entered
            00000400 00000001 18.1 Thu Oct 29 20:59:39 PDT 2015 0 0 0 (watchdog.c 126 lcw_cb()) Process leaving
            

            Hopping you guys can make sense of it.

            mhanafi Mahmoud Hanafi added a comment - Uploaded 2 new debug logs and /var/log/messages to ftp:/uploads/LU7054/ messages.gz sec.20151029.21.03.25.gz sec.20151029.21.05.55.gz At Oct 29 21:03:25 we got this in /var/log/messages Oct 29 21:03:25 nbp8-oss14 kernel: [1828390.200665] LustreError: 68632:0:(ldlm_lib.c:2715:target_bulk_io()) @@@ timeout on bulk PUT after 150+0s req@ffff8812e7162c00 x1516263417841768/t0(0) o3->d07e5b2b-77f6-1c68-f1fd-e6e4f8f614d7@10.151.57.146@o2ib:0/0 lens 4568/432 e 1 to 0 dl 1446177826 ref 1 fl Interpret:/0/0 rc 0/0 We dump debug logs to sec.20151029.21.03.25.gz it shows all clients connecting. 00000800:00000200:0.2:1446177579.541489:0:0:0:(o2iblnd_cb.c:3306:kiblnd_cq_completion()) conn[ffff8804e04ad400] (20)++ .... 00000800:00000200:1.0:1446177803.579278:0:11134:0:(o2iblnd_cb.c:993:kiblnd_check_sends()) conn[ffff880da89faa00] (31)-- I don't under why the debug logs don't show any other activite before this other than 00000400 00000001 18.1 Thu Oct 29 20:59:39 PDT 2015 0 0 0 (watchdog.c 123 lcw_cb()) Process entered 00000400 00000001 18.1 Thu Oct 29 20:59:39 PDT 2015 0 0 0 (watchdog.c 126 lcw_cb()) Process leaving Hopping you guys can make sense of it.

            Uploaded lustre debug dump to ftpsite:/uploads/LU7054/nbp9-oss16.ldebug.gz

            It shows a lock callback timer eviction at

            Oct 19 22:20:39 nbp9-oss16 kernel: LustreError: 11651:0:(ldlm_lockd.c:346:waiting_locks_callback()) ### lock callback timer expired after 151s: evicting client at 10.151.6.134@o2ib  ns: filter-nbp9-OST003f_UUID lock: ffff8800810877c0/0x15ed8b9f78013cb2 lrc: 3/0,0 mode: PW/PW res: [0x85f72a:0x0:0x0].0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x60000000000020 nid: 10.151.6.134@o2ib remote: 0x810eb6843fa5e1d1 expref: 9 pid: 11647 timeout: 5261835188 lvb_type: 0
            

            From the debug dump we seet at 151 sec before (22:18:08) the servers was running in

            o2iblnd_cb.c 3306 kiblnd_cq_completion()) conn[ffff880f0c5d7000] (20)++
            

            for a long time.

            It is possible that the call back request was never sent?

            mhanafi Mahmoud Hanafi added a comment - Uploaded lustre debug dump to ftpsite:/uploads/LU7054/nbp9-oss16.ldebug.gz It shows a lock callback timer eviction at Oct 19 22:20:39 nbp9-oss16 kernel: LustreError: 11651:0:(ldlm_lockd.c:346:waiting_locks_callback()) ### lock callback timer expired after 151s: evicting client at 10.151.6.134@o2ib ns: filter-nbp9-OST003f_UUID lock: ffff8800810877c0/0x15ed8b9f78013cb2 lrc: 3/0,0 mode: PW/PW res: [0x85f72a:0x0:0x0].0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x60000000000020 nid: 10.151.6.134@o2ib remote: 0x810eb6843fa5e1d1 expref: 9 pid: 11647 timeout: 5261835188 lvb_type: 0 From the debug dump we seet at 151 sec before (22:18:08) the servers was running in o2iblnd_cb.c 3306 kiblnd_cq_completion()) conn[ffff880f0c5d7000] (20)++ for a long time. It is possible that the call back request was never sent?

            People

              ashehata Amir Shehata (Inactive)
              mhanafi Mahmoud Hanafi
              Votes:
              0 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: