Details

    • Bug
    • Resolution: Not a Bug
    • Critical
    • None
    • Lustre 2.12.0
    • None
    • Power8 running RHEL7 alt kernel.
    • 3
    • 9223372036854775807

    Description

      After the LNet health merger I'm seeing a new bug on Power8 platforms. Currently I can't  even ping the MGS from my Power8 client. I see the following back trace:

      [  172.614537] [c000001fffdc7180] [c000000000c683a0] _raw_spin_unlock_bh+0x50/0x80

      [  172.614610] [c000001fffdc71b0] [c000000000a4d0e8] peernet2id+0x78/0xd0

      [  172.614670] [c000001fffdc71f0] [c000000000acd06c] netlink_broadcast_filtered+0x31c/0x740

      [  172.614749] [c000001fffdc72b0] [d00000000cc33298] rdma_nl_multicast+0x58/0x90 [ib_core]

      [  172.614826] [c000001fffdc72f0] [d00000000cc3a270] send_mad+0x4e0/0x6a0 [ib_core]

      [  172.614903] [c000001fffdc7390] [d00000000cc3bdcc] ib_sa_path_rec_get+0x21c/0x5b0 [ib_core]

      [  172.614977] [c000001fffdc7460] [d00000000fc629e4] path_rec_start+0xb4/0x190 [ib_ipoib]

      [  172.615051] [c000001fffdc7500] [d00000000fc65e1c] ipoib_start_xmit+0x63c/0x7e0 [ib_ipoib]

      [  172.615122] [c000001fffdc75b0] [c000000000a6906c] dev_hard_start_xmit+0xec/0x2f0

      [  172.615193] [c000001fffdc7640] [c000000000ab7ef4] sch_direct_xmit+0x164/0x260

      [  172.615264] [c000001fffdc76e0] [c000000000a69908] __dev_queue_xmit+0x698/0x9e0

      [  172.615335] [c000001fffdc7790] [c000000000a7900c] neigh_connected_output+0xfc/0x170

      [  172.615406] [c000001fffdc77e0] [c000000000a80194] neigh_update+0x644/0x790

      [  172.615465] [c000001fffdc7860] [c000000000b400a8] arp_process+0x2c8/0x850

      [  172.615525] [c000001fffdc7940] [c000000000b407cc] arp_rcv+0x19c/0x230

      [  172.615584] [c000001fffdc79b0] [c000000000a57c4c] __netif_receive_skb_core+0x73c/0x1010

      [  172.615694] [c000001fffdc7a70] [c000000000a5f8a8] netif_receive_skb_internal+0x58/0x160

      [  172.615831] [c000001fffdc7ab0] [c000000000a62e38] napi_gro_receive+0x1c8/0x2f0

      [  172.615983] [c000001fffdc7af0] [d00000000c8e85fc] mlx5i_handle_rx_cqe+0x20c/0x3c0 [mlx5_core]

      [  172.616158] [c000001fffdc7ba0] [d00000000c8e7878] mlx5e_poll_rx_cq+0x278/0xb50 [mlx5_core]

      [  172.616308] [c000001fffdc7c30] [d00000000c8e8a30] mlx5e_napi_poll+0x160/0xe50 [mlx5_core]

      [  172.616444] [c000001fffdc7cf0] [c000000000a62aec] net_rx_action+0x3bc/0x540

      [  172.616559] [c000001fffdc7e00] [c000000000c690cc] __do_softirq+0x14c/0x3dc

      [  172.616675] [c000001fffdc7ef0] [c0000000001423d4] irq_exit+0x1e4/0x1f0

      [  172.616791] [c000001fffdc7f20] [c000000000017190] __do_irq+0xa0/0x200

      [  172.616905] [c000001fffdc7f90] [c00000000002ea40] call_do_irq+0x14/0x24

      [  172.617019] [c000000ffba43a40] [c000000000017390] do_IRQ+0xa0/0x120

      [  172.617135] [c000000ffba43aa0] [c000000000008bd4] hardware_interrupt_common+0x114/0x120

      Attachments

        Issue Links

          Activity

            [LU-11278] LNet failures on Power8

            This might not be a Power8 only issue. The problem shows up when attempting to mount ZFS with a lustre version using the MOFED 4.4 stack. Can you try that setup to see if you can reproduce this problem.

            simmonsja James A Simmons added a comment - This might not be a Power8 only issue. The problem shows up when attempting to mount ZFS with a lustre version using the MOFED 4.4 stack. Can you try that setup to see if you can reproduce this problem.

            I would suggest building a version without Health and try and reproduce the mlx5 issue. There hasn't been any changes in o2iblnd stack, except very minor flag setting that are only triggered when there is failure in sending traffic.

            Also, when does the mlx5 failure occur? Does it occur on bootup? Can you determine if it is hit before or after LNet is loaded? If you simply don't startup LNet, do you still see the issue? If it occurs only when LNet is loaded, can you determine a specific operation LNet is taking that triggers this scenario? I'm not clear about the order of operations from the info you've shared.

            I don't mind joining you in a debug session to try and figure this out.

            ashehata Amir Shehata (Inactive) added a comment - I would suggest building a version without Health and try and reproduce the mlx5 issue. There hasn't been any changes in o2iblnd stack, except very minor flag setting that are only triggered when there is failure in sending traffic. Also, when does the mlx5 failure occur? Does it occur on bootup? Can you determine if it is hit before or after LNet is loaded? If you simply don't startup LNet, do you still see the issue? If it occurs only when LNet is loaded, can you determine a specific operation LNet is taking that triggers this scenario? I'm not clear about the order of operations from the info you've shared. I don't mind joining you in a debug session to try and figure this out.
            simmonsja James A Simmons added a comment - Going to try patch  https://patchwork.ozlabs.org/patch/764801

            I'm still seeing the mlx5 failures with the lastest LNet stack. I gave it another try and it failed the same way. Some how LNet is exposing a mlx5 bug

            simmonsja James A Simmons added a comment - I'm still seeing the mlx5 failures with the lastest LNet stack. I gave it another try and it failed the same way. Some how LNet is exposing a mlx5 bug

            If there are other issues please open a new ticket.

            ashehata Amir Shehata (Inactive) added a comment - If there are other issues please open a new ticket.

            I'm rebuilding new images with the 3 health patches and Sonia's peer config fix.

            simmonsja James A Simmons added a comment - I'm rebuilding new images with the 3 health patches and Sonia's peer config fix.

            Just about the "peer section" issue -  if there are no peers configured on the node and we export the configuration to a yaml file, then the yaml file does have a "peer" root but because no peers are configured so no other information and just the root "peer" is there. 
            And while importing that yaml file, it shows up the error which is misleading. So that issue does need to eb taken care of. But yes this is not LNet health related.

            sharmaso Sonia Sharma (Inactive) added a comment - Just about the "peer section" issue -  if there are no peers configured on the node and we export the configuration to a yaml file, then the yaml file does have a "peer" root but because no peers are configured so no other information and just the root "peer" is there.  And while importing that yaml file, it shows up the error which is misleading. So that issue does need to eb taken care of. But yes this is not LNet health related.
            ashehata Amir Shehata (Inactive) added a comment - - edited

            The config attached doesn't have an empty peer section. I'm able to import it correctly. Also if there is an empty peer section, then the error listed above makes sense and it doesn't stop the rest of the elements from being configured. So that's not an issue.

            As regards to the stack trace listed, that's most likely an issue with the mlx5 driver having a problem when it's coming up. It's not related to the Health feature. I believe it's not reproducible.

            Regarding the YAML file that's not new behavior. And I think it's the correct behavior. It's simply being informative and telling you that something in the config file is not formed correctly or missing, which is the case.

            Unless there is a real issue here, I'd rather close this ticket as not a bug.

            **UPDATE:

            The fix Sonia points out is for not printing the "peer:" if there are no peers. As to regards the error while importing a file, I think is entirely valid.

            ashehata Amir Shehata (Inactive) added a comment - - edited The config attached doesn't have an empty peer section. I'm able to import it correctly. Also if there is an empty peer section, then the error listed above makes sense and it doesn't stop the rest of the elements from being configured. So that's not an issue. As regards to the stack trace listed, that's most likely an issue with the mlx5 driver having a problem when it's coming up. It's not related to the Health feature. I believe it's not reproducible. Regarding the YAML file that's not new behavior. And I think it's the correct behavior. It's simply being informative and telling you that something in the config file is not formed correctly or missing, which is the case. Unless there is a real issue here, I'd rather close this ticket as not a bug. **UPDATE: The fix Sonia points out is for not printing the "peer:" if there are no peers. As to regards the error while importing a file, I think is entirely valid.

            The empty peer issue is resolved with this ticket -  https://jira.whamcloud.com/browse/LU-11006

            sharmaso Sonia Sharma (Inactive) added a comment - The empty peer issue is resolved with this ticket -   https://jira.whamcloud.com/browse/LU-11006
            simmonsja James A Simmons added a comment - - edited

            First problem is the YAML file I have is a empty peer section so I get:

            lnetctl import < ~/lnet-multirail.config

            add:

                - net:

                      errno: 0

                      descr: "success"

                - peer:

                      errno: -2

                      descr: error copying nids from YAML block

                - max_interfaces:

                      errno: 0

                      descr: "success" 

            .......

             

            You can reproduce it with attached file.

             

            simmonsja James A Simmons added a comment - - edited First problem is the YAML file I have is a empty peer section so I get: lnetctl import < ~/lnet-multirail.config add:     - net:           errno: 0           descr: "success"     - peer:           errno: -2           descr: error copying nids from YAML block     - max_interfaces:           errno: 0           descr: "success"  .......   You can reproduce it with attached file.  

            People

              ashehata Amir Shehata (Inactive)
              simmonsja James A Simmons
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: