Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7390

Router memory leak if we start a new router on a operationel configuration

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • Lustre 2.7.0
    • None
    • redhat7 mlx5 EDR and Connect-IB
    • 3
    • 9223372036854775807

    Description

       Router memory leak if we start a new router on a operationel configuration

      configuration :

      lustre server 2.5.3.90 with one IB and 2 ip address QQ.P.BBO.SY QQ.P.BBB.SY

      2 lustre router 2.7 with 4 IB card and 4 ip address
      IB0 - JO.BOO.RX.RY
      IB1 - QQ.P.BBO.RY
      IB2 - JO.BOB.RX.RY
      IB3 - QQ.P.BBB.RY

      ~130 lustre clients i2.7 with one IB and 2 ip address JO.BOO.CX.CY JO.BOB.CX.CY

      we start all servers one router and all clients and waiting that
      the production start.

      and we start the router with modprobe lustre, the router never start
      correctly and panic on Out of memory and no killable processes...

      KERNEL: /usr/lib/debug/lib/modules/3.10.0-229.7.2.el7.x86_64/vmlinux
      DUMPFILE: /var/crash/127.0.0.1-2015.09.23-09:00:12/vmcore [PARTIAL DUMP]
      CPUS: 32
      DATE: Wed Sep 23 08:59:56 2015
      UPTIME: 14:49:59
      LOAD AVERAGE: 11.71, 10.11, 5.64
      TASKS: 547
      NODENAME: neel121
      RELEASE: 3.10.0-229.7.2.el7.x86_64
      VERSION: #1 SMP Fri May 15 21:38:46 EDT 2015
      MACHINE: x86_64 (2299 Mhz)
      MEMORY: 127.9 GB
      PANIC: "Kernel panic - not syncing: Out of memory and no killable processes..."
      PID: 5002
      COMMAND: "kworker/u64:1"
      TASK: ffff8810154816c0 [THREAD_INFO: ffff882028314000]
      CPU: 23
      STATE: TASK_RUNNING (PANIC)

      crash> kmem -i
      PAGES TOTAL PERCENTAGE
      TOTAL MEM 32900006 125.5 GB ----
      FREE 131353 513.1 MB 0% of TOTAL MEM
      USED 32768653 125 GB 99% of TOTAL MEM
      SHARED 79 316 KB 0% of TOTAL MEM
      BUFFERS 0 0 0% of TOTAL MEM
      CACHED 6497 25.4 MB 0% of TOTAL MEM
      SLAB 993205 3.8 GB 3% of TOTAL MEM

      TOTAL SWAP 0 0 ----
      SWAP USED 0 0 100% of TOTAL SWAP
      SWAP FREE 0 0 0% of TOTAL SWAP
      crash> bt
      PID: 5002 TASK: ffff8810154816c0 CPU: 23 COMMAND: "kworker/u64:1"
      #0 [ffff882028317690] machine_kexec at ffffffff8104c4eb
      #1 [ffff8820283176f0] crash_kexec at ffffffff810e2052
      #2 [ffff8820283177c0] panic at ffffffff815fdc31
      #3 [ffff882028317840] out_of_memory at ffffffff8115a96a
      #4 [ffff8820283178d8] __alloc_pages_nodemask at ffffffff81160af5
      #5 [ffff882028317a10] dma_generic_alloc_coherent at ffffffff8101981f
      #6 [ffff882028317a58] x86_swiotlb_alloc_coherent at ffffffff810560e1
      #7 [ffff882028317a88] mlx5_dma_zalloc_coherent_node at ffffffffa012607d [mlx5_core]
      #8 [ffff882028317ac8] mlx5_buf_alloc_node at ffffffffa0126627 [mlx5_core]
      #9 [ffff882028317b18] mlx5_buf_alloc at ffffffffa0126755 [mlx5_core]
      #10 [ffff882028317b28] create_kernel_qp at ffffffffa0158903 [mlx5_ib]
      #11 [ffff882028317ba0] create_qp_common at ffffffffa0159236 [mlx5_ib]
      #12 [ffff882028317c38] __create_qp at ffffffffa0159ab1 [mlx5_ib]
      #13 [ffff882028317c98] mlx5_ib_create_qp at ffffffffa015a023 [mlx5_ib]
      #14 [ffff882028317cc8] ib_create_qp at ffffffffa00ed3b2 [ib_core]
      #15 [ffff882028317d00] rdma_create_qp at ffffffffa0549999 [rdma_cm]
      #16 [ffff882028317d28] kiblnd_create_conn at ffffffffa0926747 [ko2iblnd]
      #17 [ffff882028317d90] kiblnd_cm_callback at ffffffffa0934b89 [ko2iblnd]
      #18 [ffff882028317df8] cma_work_handler at ffffffffa054c98c [rdma_cm]
      #19 [ffff882028317e20] process_one_work at ffffffff8108f0bb
      #20 [ffff882028317e68] worker_thread at ffffffff8108fe8b
      #21 [ffff882028317ec8] kthread at ffffffff8109726f
      #22 [ffff882028317f50] ret_from_fork at ffffffff81614158

      There are a lot of zombies connections on the list :

      crash> p kiblnd_data.kib_connd_zombies
      $48 = {
      next = 0xffff881fac9ed418,
      prev = 0xffff8810aae96818
      }
      crash> list 0xffff881fac9ed418 | wc -l
      122060
      crash>

      All the connections have an ibc_state = 0x5 and
      and an ibc_comms_error = 0xfffffffb ( -5 EIO Input/Output error ) for 120688 connections and ibc_comms_error = 0 for the others (1372)

      we can see on the lustre debug trace some faulted connection :

      [root@neel121 127.0.0.1-2015.09.23-09:54:45]# grep kiblnd_rx_complete lustre.log
      00000800:00000100:18.0:1442994842.103700:0:4513:0:(o2iblnd_cb.c:491:kiblnd_rx_complete()) Rx from JO.BOO.BZP.LW@o2ib3 failed: 5
      00000800:00000200:18.0:1442994842.103701:0:4513:0:(o2iblnd_cb.c:537:kiblnd_rx_complete()) rx ffff881080c31000 conn ffff8810b37a6000
      00000800:00000100:23.0:1442994846.067198:0:4517:0:(o2iblnd_cb.c:491:kiblnd_rx_complete()) Rx from JO.BOB.BZP.BLP@o2ib30 failed: 5
      00000800:00000200:23.0:1442994846.067199:0:4517:0:(o2iblnd_cb.c:537:kiblnd_rx_complete()) rx ffff8810819cc000 conn ffff88109266f600
      00000800:00000100:18.0:1442994863.480144:0:4511:0:(o2iblnd_cb.c:491:kiblnd_rx_complete()) Rx from JO.BOO.BZZ.FL@o2ib3 failed: 5
      00000800:00000200:18.0:1442994863.480144:0:4511:0:(o2iblnd_cb.c:537:kiblnd_rx_complete()) rx ffff881085047000 conn ffff8810b31ccc00

      I don't understand why a lot of connections have an EIO error but that explain the memory leak ....

      Router work fine if we start all router before start lustre on the clients
      The issue is reprodutible only if we start the second router after the real production is started

      I find on Jira lustre Intel database the LU-5718, Could you confirm that the Jira LU-5718 could help for this issue ?

      Lustre version :
      For client and router
      lustre-modules-2.7.0-3.10.0_229.7.2.el7.x86_64_1.el7.Bull.0.005.20150727.x86_64.rpm
      For server
      lustre-modules_H-2.5.3.90-2.6.32_573.1.1.el6.Bull.80.x86_64_Bull.4.113.el6.20150731.x86_64.rpm

      Lustre configuration

      router :
      networks.conf
      LNET_OPTIONS='networks=o2ib3(ib0),o2ib30(ib2),o2ib2(ib1.8110),o2ib20(ib3.8111)'
      routers.conf
      LNET_ROUTER_OPTIONS='forwarding="enabled"'

      Client:
      networks.conf
      LNET_OPTIONS='o2ib3(ib0),o2ib30(ib0:1)'
      routers.conf
      LNET_ROUTER_OPTIONS='routes="o2ib2 JO.BOO.184.[121-122]@o2ib3;o2ib20 JO.BOB.184.[121-122]@o2ib30" dead_router_check_interval=59 live_router_check_interval=107 check _routers_before_use=1'

      Server:
      networks.conf
      LNET_OPTIONS='o2ib2(ib0.8110),o2ib20(ib0.8111)'
      routers.conf
      LNET_ROUTER_OPTIONS='routes="o2ib3 QQ.P.BBO.[121-122]@o2ib2;o2ib30 QQ.P.BBB.[121-122]@o2ib30" dead_router_check_interval=59 live_router_check_interval=107 check _routers_before_use=1'

      on the server side, there are a lot of other route that I didn't reported on the LNET_ROUTER_OPTIONS
      and also the IB configuration on the IB network server use PKEY.

      Attachments

        Issue Links

          Activity

            [LU-7390] Router memory leak if we start a new router on a operationel configuration
            ruth.klundt@gmail.com Ruth Klundt (Inactive) made changes -
            Description Original:  Router memory leak if we start a new router on a operationel configuration

             configuration :

             lustre server 2.5.3.90 with one IB and 2 ip address QQ.P.BBO.SY QQ.P.BBB.SY

             2 lustre router 2.7 with 4 IB card and 4 ip address
                          IB0 - JO.BOO.RX.RY
                          IB1 - QQ.P.BBO.RY
                          IB2 - JO.BOB.RX.RY
                          IB3 - QQ.P.BBB.RY

            ~130 lustre clients i2.7 with one IB and 2 ip address JO.BOO.CX.CY JO.BOB.CX.CY

            we start all servers one router and all clients and waiting that
            the production start.

            and we start the router with modprobe lustre, the router never start
            correctly and panic on Out of memory and no killable processes...

                  KERNEL: /usr/lib/debug/lib/modules/3.10.0-229.7.2.el7.x86_64/vmlinux
                DUMPFILE: /var/crash/127.0.0.1-2015.09.23-09:00:12/vmcore [PARTIAL DUMP]
                    CPUS: 32
                    DATE: Wed Sep 23 08:59:56 2015
                  UPTIME: 14:49:59
            LOAD AVERAGE: 11.71, 10.11, 5.64
                   TASKS: 547
                NODENAME: neel121
                 RELEASE: 3.10.0-229.7.2.el7.x86_64
                 VERSION: #1 SMP Fri May 15 21:38:46 EDT 2015
                 MACHINE: x86_64 (2299 Mhz)
                  MEMORY: 127.9 GB
                   PANIC: "Kernel panic - not syncing: Out of memory and no killable processes..."
                     PID: 5002
                 COMMAND: "kworker/u64:1"
                    TASK: ffff8810154816c0 [THREAD_INFO: ffff882028314000]
                     CPU: 23
                   STATE: TASK_RUNNING (PANIC)

            crash> kmem -i
                          PAGES TOTAL PERCENTAGE
             TOTAL MEM 32900006 125.5 GB ----
                  FREE 131353 513.1 MB 0% of TOTAL MEM
                  USED 32768653 125 GB 99% of TOTAL MEM
                SHARED 79 316 KB 0% of TOTAL MEM
               BUFFERS 0 0 0% of TOTAL MEM
                CACHED 6497 25.4 MB 0% of TOTAL MEM
                  SLAB 993205 3.8 GB 3% of TOTAL MEM

            TOTAL SWAP 0 0 ----
             SWAP USED 0 0 100% of TOTAL SWAP
             SWAP FREE 0 0 0% of TOTAL SWAP
            crash> bt
            PID: 5002 TASK: ffff8810154816c0 CPU: 23 COMMAND: "kworker/u64:1"
             #0 [ffff882028317690] machine_kexec at ffffffff8104c4eb
             #1 [ffff8820283176f0] crash_kexec at ffffffff810e2052
             #2 [ffff8820283177c0] panic at ffffffff815fdc31
             #3 [ffff882028317840] out_of_memory at ffffffff8115a96a
             #4 [ffff8820283178d8] __alloc_pages_nodemask at ffffffff81160af5
             #5 [ffff882028317a10] dma_generic_alloc_coherent at ffffffff8101981f
             #6 [ffff882028317a58] x86_swiotlb_alloc_coherent at ffffffff810560e1
             #7 [ffff882028317a88] mlx5_dma_zalloc_coherent_node at ffffffffa012607d [mlx5_core]
             #8 [ffff882028317ac8] mlx5_buf_alloc_node at ffffffffa0126627 [mlx5_core]
             #9 [ffff882028317b18] mlx5_buf_alloc at ffffffffa0126755 [mlx5_core]
            #10 [ffff882028317b28] create_kernel_qp at ffffffffa0158903 [mlx5_ib]
            #11 [ffff882028317ba0] create_qp_common at ffffffffa0159236 [mlx5_ib]
            #12 [ffff882028317c38] __create_qp at ffffffffa0159ab1 [mlx5_ib]
            #13 [ffff882028317c98] mlx5_ib_create_qp at ffffffffa015a023 [mlx5_ib]
            #14 [ffff882028317cc8] ib_create_qp at ffffffffa00ed3b2 [ib_core]
            #15 [ffff882028317d00] rdma_create_qp at ffffffffa0549999 [rdma_cm]
            #16 [ffff882028317d28] kiblnd_create_conn at ffffffffa0926747 [ko2iblnd]
            #17 [ffff882028317d90] kiblnd_cm_callback at ffffffffa0934b89 [ko2iblnd]
            #18 [ffff882028317df8] cma_work_handler at ffffffffa054c98c [rdma_cm]
            #19 [ffff882028317e20] process_one_work at ffffffff8108f0bb
            #20 [ffff882028317e68] worker_thread at ffffffff8108fe8b
            #21 [ffff882028317ec8] kthread at ffffffff8109726f
            #22 [ffff882028317f50] ret_from_fork at ffffffff81614158

            There are a lot of zombies connections on the list :

            crash> p kiblnd_data.kib_connd_zombies
            $48 = {
              next = 0xffff881fac9ed418,
              prev = 0xffff8810aae96818
            }
            crash> list 0xffff881fac9ed418 | wc -l
            122060
            crash>

            All the connections have an ibc_state = 0x5 and
            and an ibc_comms_error = 0xfffffffb ( -5 EIO Input/Output error ) for 120688 connections and ibc_comms_error = 0 for the others (1372)

            we can see on the lustre debug trace some faulted connection :

            [root@neel121 127.0.0.1-2015.09.23-09:54:45]# grep kiblnd_rx_complete lustre.log
            00000800:00000100:18.0:1442994842.103700:0:4513:0:(o2iblnd_cb.c:491:kiblnd_rx_complete()) Rx from JO.BOO.BZP.LW@o2ib3 failed: 5
            00000800:00000200:18.0:1442994842.103701:0:4513:0:(o2iblnd_cb.c:537:kiblnd_rx_complete()) rx ffff881080c31000 conn ffff8810b37a6000
            00000800:00000100:23.0:1442994846.067198:0:4517:0:(o2iblnd_cb.c:491:kiblnd_rx_complete()) Rx from JO.BOB.BZP.BLP@o2ib30 failed: 5
            00000800:00000200:23.0:1442994846.067199:0:4517:0:(o2iblnd_cb.c:537:kiblnd_rx_complete()) rx ffff8810819cc000 conn ffff88109266f600
            00000800:00000100:18.0:1442994863.480144:0:4511:0:(o2iblnd_cb.c:491:kiblnd_rx_complete()) Rx from JO.BOO.BZZ.FL@o2ib3 failed: 5
            00000800:00000200:18.0:1442994863.480144:0:4511:0:(o2iblnd_cb.c:537:kiblnd_rx_complete()) rx ffff881085047000 conn ffff8810b31ccc00

            I don't understand why a lot of connections have an EIO error but that explain the memory leak ....

            Router work fine if we start all router before start lustre on the clients
            The issue is reprodutible only if we start the second router after the real production is started

            I find on Jira lustre Intel database the LU-5718, Could you confirm that the Jira LU-5718 could help for this issue ?

            Lustre version :
            For client and router
            lustre-modules-2.7.0-3.10.0_229.7.2.el7.x86_64_1.el7.Bull.0.005.20150727.x86_64.rpm
            For server
            lustre-modules_H-2.5.3.90-2.6.32_573.1.1.el6.Bull.80.x86_64_Bull.4.113.el6.20150731.x86_64.rpm

            Lustre configuration

            router :
            networks.conf
            LNET_OPTIONS='networks=o2ib3(ib0),o2ib30(ib2),o2ib2(ib1.8110),o2ib20(ib3.8111)'
            routers.conf
            LNET_ROUTER_OPTIONS='forwarding="enabled"'

            Client:
            networks.conf
            LNET_OPTIONS='o2ib3(ib0),o2ib30(ib0:1)'
            routers.conf
            LNET_ROUTER_OPTIONS='routes="o2ib2 JO.BOO.184.[121-122]@o2ib3;o2ib20 JO.BOB.184.[121-122]@o2ib30" dead_router_check_interval=59 live_router_check_interval=107 check _routers_before_use=1'

            Server:
            networks.conf
            LNET_OPTIONS='o2ib2(ib0.8110),o2ib20(ib0.8111)'
            routers.conf
            LNET_ROUTER_OPTIONS='routes="o2ib3 QQ.P.BBO.[121-122]@o2ib2;o2ib30 QQ.P.BBB.[121-122]@o2ib30" dead_router_check_interval=59 live_router_check_interval=107 check _routers_before_use=1'

            on the server side, there are a lot of other route that I didn't reported on the LNET_ROUTER_OPTIONS
            and also the IB configuration on the IB network server use PKEY.
            New:  Router memory leak if we start a new router on a operationel configuration

            configuration :

            lustre server 2.5.3.90 with one IB and 2 ip address QQ.P.BBO.SY QQ.P.BBB.SY

            2 lustre router 2.7 with 4 IB card and 4 ip address
             IB0 - JO.BOO.RX.RY
             IB1 - QQ.P.BBO.RY
             IB2 - JO.BOB.RX.RY
             IB3 - QQ.P.BBB.RY

            ~130 lustre clients i2.7 with one IB and 2 ip address JO.BOO.CX.CY JO.BOB.CX.CY

            we start all servers one router and all clients and waiting that
             the production start.

            and we start the router with modprobe lustre, the router never start
             correctly and panic on Out of memory and no killable processes...

            KERNEL: /usr/lib/debug/lib/modules/3.10.0-229.7.2.el7.x86_64/vmlinux
             DUMPFILE: /var/crash/127.0.0.1-2015.09.23-09:00:12/vmcore [PARTIAL DUMP]
             CPUS: 32
             DATE: Wed Sep 23 08:59:56 2015
             UPTIME: 14:49:59
             LOAD AVERAGE: 11.71, 10.11, 5.64
             TASKS: 547
             NODENAME: neel121
             RELEASE: 3.10.0-229.7.2.el7.x86_64
             VERSION: #1 SMP Fri May 15 21:38:46 EDT 2015
             MACHINE: x86_64 (2299 Mhz)
             MEMORY: 127.9 GB
             PANIC: "Kernel panic - not syncing: Out of memory and no killable processes..."
             PID: 5002
             COMMAND: "kworker/u64:1"
             TASK: ffff8810154816c0 [THREAD_INFO: ffff882028314000]
             CPU: 23
             STATE: TASK_RUNNING (PANIC)

            crash> kmem -i
             PAGES TOTAL PERCENTAGE
             TOTAL MEM 32900006 125.5 GB ----
             FREE 131353 513.1 MB 0% of TOTAL MEM
             USED 32768653 125 GB 99% of TOTAL MEM
             SHARED 79 316 KB 0% of TOTAL MEM
             BUFFERS 0 0 0% of TOTAL MEM
             CACHED 6497 25.4 MB 0% of TOTAL MEM
             SLAB 993205 3.8 GB 3% of TOTAL MEM

            TOTAL SWAP 0 0 ----
             SWAP USED 0 0 100% of TOTAL SWAP
             SWAP FREE 0 0 0% of TOTAL SWAP
             crash> bt
             PID: 5002 TASK: ffff8810154816c0 CPU: 23 COMMAND: "kworker/u64:1"
             #0 [ffff882028317690] machine_kexec at ffffffff8104c4eb
             #1 [ffff8820283176f0] crash_kexec at ffffffff810e2052
             #2 [ffff8820283177c0] panic at ffffffff815fdc31
             #3 [ffff882028317840] out_of_memory at ffffffff8115a96a
             #4 [ffff8820283178d8] __alloc_pages_nodemask at ffffffff81160af5
             #5 [ffff882028317a10] dma_generic_alloc_coherent at ffffffff8101981f
             #6 [ffff882028317a58] x86_swiotlb_alloc_coherent at ffffffff810560e1
             #7 [ffff882028317a88] mlx5_dma_zalloc_coherent_node at ffffffffa012607d [mlx5_core]
             #8 [ffff882028317ac8] mlx5_buf_alloc_node at ffffffffa0126627 [mlx5_core]
             #9 [ffff882028317b18] mlx5_buf_alloc at ffffffffa0126755 [mlx5_core]
             #10 [ffff882028317b28] create_kernel_qp at ffffffffa0158903 [mlx5_ib]
             #11 [ffff882028317ba0] create_qp_common at ffffffffa0159236 [mlx5_ib]
             #12 [ffff882028317c38] __create_qp at ffffffffa0159ab1 [mlx5_ib]
             #13 [ffff882028317c98] mlx5_ib_create_qp at ffffffffa015a023 [mlx5_ib]
             #14 [ffff882028317cc8] ib_create_qp at ffffffffa00ed3b2 [ib_core]
             #15 [ffff882028317d00] rdma_create_qp at ffffffffa0549999 [rdma_cm]
             #16 [ffff882028317d28] kiblnd_create_conn at ffffffffa0926747 [ko2iblnd]
             #17 [ffff882028317d90] kiblnd_cm_callback at ffffffffa0934b89 [ko2iblnd]
             #18 [ffff882028317df8] cma_work_handler at ffffffffa054c98c [rdma_cm]
             #19 [ffff882028317e20] process_one_work at ffffffff8108f0bb
             #20 [ffff882028317e68] worker_thread at ffffffff8108fe8b
             #21 [ffff882028317ec8] kthread at ffffffff8109726f
             #22 [ffff882028317f50] ret_from_fork at ffffffff81614158

            There are a lot of zombies connections on the list :

            crash> p kiblnd_data.kib_connd_zombies
             $48 = \{
             next = 0xffff881fac9ed418,
             prev = 0xffff8810aae96818
             }
             crash> list 0xffff881fac9ed418 | wc -l
             122060
             crash>

            All the connections have an ibc_state = 0x5 and
             and an ibc_comms_error = 0xfffffffb ( -5 EIO Input/Output error ) for 120688 connections and ibc_comms_error = 0 for the others (1372)

            we can see on the lustre debug trace some faulted connection :

            [root@neel121 127.0.0.1-2015.09.23-09:54:45]# grep kiblnd_rx_complete lustre.log
             00000800:00000100:18.0:1442994842.103700:0:4513:0:(o2iblnd_cb.c:491:kiblnd_rx_complete()) Rx from JO.BOO.BZP.LW@o2ib3 failed: 5
             00000800:00000200:18.0:1442994842.103701:0:4513:0:(o2iblnd_cb.c:537:kiblnd_rx_complete()) rx ffff881080c31000 conn ffff8810b37a6000
             00000800:00000100:23.0:1442994846.067198:0:4517:0:(o2iblnd_cb.c:491:kiblnd_rx_complete()) Rx from JO.BOB.BZP.BLP@o2ib30 failed: 5
             00000800:00000200:23.0:1442994846.067199:0:4517:0:(o2iblnd_cb.c:537:kiblnd_rx_complete()) rx ffff8810819cc000 conn ffff88109266f600
             00000800:00000100:18.0:1442994863.480144:0:4511:0:(o2iblnd_cb.c:491:kiblnd_rx_complete()) Rx from JO.BOO.BZZ.FL@o2ib3 failed: 5
             00000800:00000200:18.0:1442994863.480144:0:4511:0:(o2iblnd_cb.c:537:kiblnd_rx_complete()) rx ffff881085047000 conn ffff8810b31ccc00

            I don't understand why a lot of connections have an EIO error but that explain the memory leak ....

            Router work fine if we start all router before start lustre on the clients
             The issue is reprodutible only if we start the second router after the real production is started

            I find on Jira lustre Intel database the -LU-5718-, Could you confirm that the Jira -LU-5718- could help for this issue ?

            Lustre version :
             For client and router
             lustre-modules-2.7.0-3.10.0_229.7.2.el7.x86_64_1.el7.Bull.0.005.20150727.x86_64.rpm
             For server
             lustre-modules_H-2.5.3.90-2.6.32_573.1.1.el6.Bull.80.x86_64_Bull.4.113.el6.20150731.x86_64.rpm

            Lustre configuration

            router :
             networks.conf
             LNET_OPTIONS='networks=o2ib3(ib0),o2ib30(ib2),o2ib2(ib1.8110),o2ib20(ib3.8111)'
             routers.conf
             LNET_ROUTER_OPTIONS='forwarding="enabled"'

            Client:
             networks.conf
             LNET_OPTIONS='o2ib3(ib0),o2ib30(ib0:1)'
             routers.conf
             LNET_ROUTER_OPTIONS='routes="o2ib2 JO.BOO.184.[121-122]@o2ib3;o2ib20 JO.BOB.184.[121-122]@o2ib30" dead_router_check_interval=59 live_router_check_interval=107 check _routers_before_use=1'

            Server:
             networks.conf
             LNET_OPTIONS='o2ib2(ib0.8110),o2ib20(ib0.8111)'
             routers.conf
             LNET_ROUTER_OPTIONS='routes="o2ib3 QQ.P.BBO.[121-122]@o2ib2;o2ib30 QQ.P.BBB.[121-122]@o2ib30" dead_router_check_interval=59 live_router_check_interval=107 check _routers_before_use=1'

            on the server side, there are a lot of other route that I didn't reported on the LNET_ROUTER_OPTIONS
             and also the IB configuration on the IB network server use PKEY.
            doug Doug Oucharek (Inactive) made changes -
            Assignee Original: Doug Oucharek [ doug ] New: Amir Shehata [ ashehata ]
            pjones Peter Jones made changes -
            End date New: 08/Jan/16
            Start date New: 05/Nov/15
            pjones Peter Jones made changes -
            Link Original: This issue is related to JFC-10 [ JFC-10 ]
            jfc John Fuchs-Chesney (Inactive) made changes -
            Link New: This issue is related to JFC-10 [ JFC-10 ]
            doug Doug Oucharek (Inactive) made changes -
            Link New: This issue is related to LU-7569 [ LU-7569 ]
            doug Doug Oucharek (Inactive) made changes -
            Assignee Original: Amir Shehata [ ashehata ] New: Doug Oucharek [ doug ]

            Patch 14600 has been "reinvented" as: http://review.whamcloud.com/#/c/17661. This is new and needs validation. I need to spend some time to determine if it can address this ticket. However, if you have time, please remove 14600 and apply 17661 and see if this addresses your problem.

            doug Doug Oucharek (Inactive) added a comment - Patch 14600 has been "reinvented" as: http://review.whamcloud.com/#/c/17661 . This is new and needs validation. I need to spend some time to determine if it can address this ticket. However, if you have time, please remove 14600 and apply 17661 and see if this addresses your problem.
            hornc Chris Horn added a comment -

            FWIW, we (Cray) seem to be hitting this issue as well and the patch http://review.whamcloud.com/#/c/14600 did not resolve the issue.

            hornc Chris Horn added a comment - FWIW, we (Cray) seem to be hitting this issue as well and the patch http://review.whamcloud.com/#/c/14600 did not resolve the issue.
            pjones Peter Jones made changes -
            Link Original: This issue is related to JFC-10 [ JFC-10 ]

            People

              ashehata Amir Shehata (Inactive)
              apercher Antoine Percher
              Votes:
              0 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

                Created:
                Updated: