[LU-11278] LNet failures on Power8 Created: 23/Aug/18  Updated: 19/Dec/18  Resolved: 19/Dec/18

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.0
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: James A Simmons Assignee: Amir Shehata (Inactive)
Resolution: Not a Bug Votes: 0
Labels: None
Environment:

Power8 running RHEL7 alt kernel.


Attachments: File lnet-multirail.config    
Issue Links:
Related
is related to LU-6387 Add Power8 support to Lustre Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

After the LNet health merger I'm seeing a new bug on Power8 platforms. Currently I can't  even ping the MGS from my Power8 client. I see the following back trace:

[  172.614537] [c000001fffdc7180] [c000000000c683a0] _raw_spin_unlock_bh+0x50/0x80

[  172.614610] [c000001fffdc71b0] [c000000000a4d0e8] peernet2id+0x78/0xd0

[  172.614670] [c000001fffdc71f0] [c000000000acd06c] netlink_broadcast_filtered+0x31c/0x740

[  172.614749] [c000001fffdc72b0] [d00000000cc33298] rdma_nl_multicast+0x58/0x90 [ib_core]

[  172.614826] [c000001fffdc72f0] [d00000000cc3a270] send_mad+0x4e0/0x6a0 [ib_core]

[  172.614903] [c000001fffdc7390] [d00000000cc3bdcc] ib_sa_path_rec_get+0x21c/0x5b0 [ib_core]

[  172.614977] [c000001fffdc7460] [d00000000fc629e4] path_rec_start+0xb4/0x190 [ib_ipoib]

[  172.615051] [c000001fffdc7500] [d00000000fc65e1c] ipoib_start_xmit+0x63c/0x7e0 [ib_ipoib]

[  172.615122] [c000001fffdc75b0] [c000000000a6906c] dev_hard_start_xmit+0xec/0x2f0

[  172.615193] [c000001fffdc7640] [c000000000ab7ef4] sch_direct_xmit+0x164/0x260

[  172.615264] [c000001fffdc76e0] [c000000000a69908] __dev_queue_xmit+0x698/0x9e0

[  172.615335] [c000001fffdc7790] [c000000000a7900c] neigh_connected_output+0xfc/0x170

[  172.615406] [c000001fffdc77e0] [c000000000a80194] neigh_update+0x644/0x790

[  172.615465] [c000001fffdc7860] [c000000000b400a8] arp_process+0x2c8/0x850

[  172.615525] [c000001fffdc7940] [c000000000b407cc] arp_rcv+0x19c/0x230

[  172.615584] [c000001fffdc79b0] [c000000000a57c4c] __netif_receive_skb_core+0x73c/0x1010

[  172.615694] [c000001fffdc7a70] [c000000000a5f8a8] netif_receive_skb_internal+0x58/0x160

[  172.615831] [c000001fffdc7ab0] [c000000000a62e38] napi_gro_receive+0x1c8/0x2f0

[  172.615983] [c000001fffdc7af0] [d00000000c8e85fc] mlx5i_handle_rx_cqe+0x20c/0x3c0 [mlx5_core]

[  172.616158] [c000001fffdc7ba0] [d00000000c8e7878] mlx5e_poll_rx_cq+0x278/0xb50 [mlx5_core]

[  172.616308] [c000001fffdc7c30] [d00000000c8e8a30] mlx5e_napi_poll+0x160/0xe50 [mlx5_core]

[  172.616444] [c000001fffdc7cf0] [c000000000a62aec] net_rx_action+0x3bc/0x540

[  172.616559] [c000001fffdc7e00] [c000000000c690cc] __do_softirq+0x14c/0x3dc

[  172.616675] [c000001fffdc7ef0] [c0000000001423d4] irq_exit+0x1e4/0x1f0

[  172.616791] [c000001fffdc7f20] [c000000000017190] __do_irq+0xa0/0x200

[  172.616905] [c000001fffdc7f90] [c00000000002ea40] call_do_irq+0x14/0x24

[  172.617019] [c000000ffba43a40] [c000000000017390] do_IRQ+0xa0/0x120

[  172.617135] [c000000ffba43aa0] [c000000000008bd4] hardware_interrupt_common+0x114/0x120



 Comments   
Comment by James A Simmons [ 23/Aug/18 ]

First problem is the YAML file I have is a empty peer section so I get:

lnetctl import < ~/lnet-multirail.config

add:

    - net:

          errno: 0

          descr: "success"

    - peer:

          errno: -2

          descr: error copying nids from YAML block

    - max_interfaces:

          errno: 0

          descr: "success" 

.......

 

You can reproduce it with attached file.

 

Comment by Sonia Sharma (Inactive) [ 23/Aug/18 ]

The empty peer issue is resolved with this ticket -  https://jira.whamcloud.com/browse/LU-11006

Comment by Amir Shehata (Inactive) [ 23/Aug/18 ]

The config attached doesn't have an empty peer section. I'm able to import it correctly. Also if there is an empty peer section, then the error listed above makes sense and it doesn't stop the rest of the elements from being configured. So that's not an issue.

As regards to the stack trace listed, that's most likely an issue with the mlx5 driver having a problem when it's coming up. It's not related to the Health feature. I believe it's not reproducible.

Regarding the YAML file that's not new behavior. And I think it's the correct behavior. It's simply being informative and telling you that something in the config file is not formed correctly or missing, which is the case.

Unless there is a real issue here, I'd rather close this ticket as not a bug.

**UPDATE:

The fix Sonia points out is for not printing the "peer:" if there are no peers. As to regards the error while importing a file, I think is entirely valid.

Comment by Sonia Sharma (Inactive) [ 23/Aug/18 ]

Just about the "peer section" issue -  if there are no peers configured on the node and we export the configuration to a yaml file, then the yaml file does have a "peer" root but because no peers are configured so no other information and just the root "peer" is there. 
And while importing that yaml file, it shows up the error which is misleading. So that issue does need to eb taken care of. But yes this is not LNet health related.

Comment by James A Simmons [ 23/Aug/18 ]

I'm rebuilding new images with the 3 health patches and Sonia's peer config fix.

Comment by Amir Shehata (Inactive) [ 24/Aug/18 ]

If there are other issues please open a new ticket.

Comment by James A Simmons [ 24/Aug/18 ]

I'm still seeing the mlx5 failures with the lastest LNet stack. I gave it another try and it failed the same way. Some how LNet is exposing a mlx5 bug

Comment by James A Simmons [ 24/Aug/18 ]

Going to try patch https://patchwork.ozlabs.org/patch/764801

Comment by Amir Shehata (Inactive) [ 25/Aug/18 ]

I would suggest building a version without Health and try and reproduce the mlx5 issue. There hasn't been any changes in o2iblnd stack, except very minor flag setting that are only triggered when there is failure in sending traffic.

Also, when does the mlx5 failure occur? Does it occur on bootup? Can you determine if it is hit before or after LNet is loaded? If you simply don't startup LNet, do you still see the issue? If it occurs only when LNet is loaded, can you determine a specific operation LNet is taking that triggers this scenario? I'm not clear about the order of operations from the info you've shared.

I don't mind joining you in a debug session to try and figure this out.

Comment by James A Simmons [ 04/Sep/18 ]

This might not be a Power8 only issue. The problem shows up when attempting to mount ZFS with a lustre version using the MOFED 4.4 stack. Can you try that setup to see if you can reproduce this problem.

Generated at Sat Feb 10 02:42:29 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.