[LU-16552] LNet: crash on deleting an NI using lnetctl Created: 14/Feb/23  Updated: 08/Jan/24  Resolved: 31/Aug/23

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.16.0

Type: Bug Priority: Minor
Reporter: Serguei Smirnov Assignee: James A Simmons
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-9680 Improve the user land to kernel space... In Progress
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Here are the steps to reproduce:

---- create an alias interface eth0:0 or use any other available interface ----
[root@lustre10 ~]# modprobe lnet
[root@lustre10 ~]# lnetctl lnet configure
[root@lustre10 ~]# lnetctl net add --net tcp --if eth0
[root@lustre10 ~]# lnetctl net add --net tcp --if eth0:0
[root@lustre10 ~]# lnetctl net del --net tcp --if eth0:0
--- crash ---

Here's the trace from resulting crash info:

 [  396.152532] BUG: unable to handle kernel NULL pointer dereference at           (null)
[  396.152608] IP: [<ffffffffc0c957ee>] lnet_net_cmd+0x42e/0xc30 [lnet]
[  396.152725] PGD 8000000076c7c067 PUD 76fd5067 PMD 0 
[  396.152768] Oops: 0000 [#1] SMP 
[  396.152807] Modules linked in: ksocklnd(OE) lnet(OE) libcfs(OE) rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_fpga_tools(OE) mlx5_ib(OE) ib_uverbs(OE) mlx5_core(OE) mlxfw(OE) bridge stp llc bonding ip_set nfnetlink sunrpc snd_hda_codec_generic iosf_mbi crc32_pclmul snd_hda_intel ghash_clmulni_intel snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device ppdev aesni_intel snd_pcm lrw gf128mul glue_helper ablk_helper cryptd sg pcspkr snd_timer joydev virtio_balloon snd soundcore i2c_piix4 parport_pc parport knem(OE) xfs libcrc32c virtio_net virtio_console virtio_blk net_failover failover sr_mod cdrom ata_generic pata_acpi 8139too mlx4_ib(OE) crct10dif_pclmul crct10dif_common crc32c_intel mlx4_en(OE) ib_core(OE) ptp pps_core serio_raw qxl drm_kms_helper syscopyarea
[  396.153546]  sysfillrect sysimgblt fb_sys_fops ttm drm 8139cp mii ata_piix mlx4_core(OE) devlink libata mlx_compat(OE) floppy virtio_pci virtio_ring virtio drm_panel_orientation_quirks dm_mirror dm_region_hash dm_log dm_mod [last unloaded: ip_tables]
[  396.153754] CPU: 0 PID: 2394 Comm: lnetctl Kdump: loaded Tainted: G           OE  ------------   3.10.0-1160.25.1.el7_lustre.x86_64 #1
[  396.153830] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
[  396.153886] task: ffff9a6977210000 ti: ffff9a69740dc000 task.ti: ffff9a69740dc000
[  396.153933] RIP: 0010:[<ffffffffc0c957ee>]  [<ffffffffc0c957ee>] lnet_net_cmd+0x42e/0xc30 [lnet]
[  396.154011] RSP: 0018:ffff9a69740df5d0  EFLAGS: 00010286
[  396.154047] RAX: 00000000ffffffff RBX: ffff9a696b9e8710 RCX: 0000000000000000
[  396.154096] RDX: 0000000000000005 RSI: ffff9a69740df640 RDI: ffff9a69771bd2c8
[  396.154172] RBP: ffff9a69740df9c8 R08: 000000000000003a R09: ffff9a696b9e8710
[  396.154247] R10: 0000000000000006 R11: 0000000000000000 R12: ffff9a69740df640
[  396.154321] R13: ffff9a69740bdc00 R14: ffff9a69740dfa18 R15: ffff9a6977258640
[  396.154397] FS:  00007fc625135740(0000) GS:ffff9a697fc00000(0000) knlGS:0000000000000000
[  396.154477] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  396.154557] CR2: 0000000000000000 CR3: 00000000772ce000 CR4: 00000000001406f0
[  396.154668] Call Trace:
[  396.154743]  [<ffffffff99e95ea8>] genl_family_rcv_msg+0x208/0x430
[  396.154819]  [<ffffffff99e9082f>] ? __netlink_sendskb+0x5f/0x180
[  396.154892]  [<ffffffff99b07d5c>] ? security_sock_rcv_skb+0x1c/0x20
[  396.154965]  [<ffffffff99e9612b>] genl_rcv_msg+0x5b/0xc0
[  396.155031]  [<ffffffff99e960d0>] ? genl_family_rcv_msg+0x430/0x430
[  396.155102]  [<ffffffff99e9411b>] netlink_rcv_skb+0xab/0xc0
[  396.155170]  [<ffffffff99e94658>] genl_rcv+0x28/0x40
[  396.155233]  [<ffffffff99e93aa0>] netlink_unicast+0x170/0x210
[  396.156316]  [<ffffffff99e93e48>] netlink_sendmsg+0x308/0x420
[  396.157019]  [<ffffffff99e363a6>] sock_sendmsg+0xb6/0xf0
[  396.157779]  [<ffffffffc06d43f4>] ? xfs_iunlock+0x114/0x120 [xfs]
[  396.158477]  [<ffffffff99e37269>] ___sys_sendmsg+0x3e9/0x400
[  396.159180]  [<ffffffff999be20b>] ? unlock_page+0x2b/0x30
[  396.159929]  [<ffffffff999f5e80>] ? handle_mm_fault+0xa20/0xfb0
[  396.160714]  [<ffffffff99f90678>] ? __do_page_fault+0x238/0x500
[  396.161416]  [<ffffffff99e38921>] __sys_sendmsg+0x51/0x90
[  396.162125]  [<ffffffff99e38972>] SyS_sendmsg+0x12/0x20
[  396.162860]  [<ffffffff99f95f92>] system_call_fastpath+0x25/0x2a
[  396.163597] Code: e9 89 c1 4c 89 e6 c1 e9 10 a9 80 80 00 00 0f 44 c1 48 8d 4a 02 48 0f 44 d1 00 c0 48 83 da 03 4c 29 e2 e8 66 ab ef d8 85 c0 74 4a <4c> 8b 2c 25 00 00 00 00 49 39 dd 75 a3 44 8b a5 28 fc ff ff 8b 
[  396.165199] RIP  [<ffffffffc0c957ee>] lnet_net_cmd+0x42e/0xc30 [lnet]
[  396.165938]  RSP <ffff9a69740df5d0>
[  396.166658] CR2: 0000000000000000


 Comments   
Comment by James A Simmons [ 16/Feb/23 ]

Patch https://review.whamcloud.com/#/c/fs/lustre-release/+/50026 will resolve this.

Comment by Gerrit Updater [ 15/Mar/23 ]

"jsimmons <jsimmons@infradead.org>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50302
Subject: LU-16552 test: add new lnet test for Multi-Rail setups
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 2f610812a6a1798bdb824f9b3b42e9f01af08d63

Comment by Gerrit Updater [ 31/Aug/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50302/
Subject: LU-16552 test: add new lnet test for Multi-Rail setups
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 8785f25b053c69b4303e901c6c8dc5d0d4d6dfc1

Generated at Sat Feb 10 03:28:00 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.