[LU-15944] LNet: adding dst udsp rule before peer is discovered causes oops on peer discovery Created: 14/Jun/22  Updated: 08/Mar/23  Resolved: 08/Mar/23

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.16.0

Type: Bug Priority: Minor
Reporter: Serguei Smirnov Assignee: Cyril Bordage
Resolution: Fixed Votes: 0
Labels: lnet, udsp

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This has been found and reported by hornc:

The following sequence of commands causes a crash:

# lnetctl peer del --prim_nid=10.1.0.60@o2ib1 # <-- make sure there no record of this peer
# lnetctl udsp add --dst tcp --prio 1
# lnetctl discover 192.168.122.60@tcp

The trace is as follows:

[5449781.397300] BUG: unable to handle kernel NULL pointer dereference at 0000000000000028
[5449781.399193] IP: [<ffffffffc0c36ddb>] lnet_udsp_apply_rule_on_lpni+0xbb/0x7b0 [lnet]
[5449781.400130] PGD 8000000055a7f067 PUD 4964e067 PMD 0 
[5449781.400717] Oops: 0000 [#1] SMP 
[5449781.418329] Call Trace:
[5449781.419109]  [<ffffffffc0c35844>] lnet_udsp_apply_single_policy+0xf4/0x540 [lnet]
[5449781.419881]  [<ffffffffc0c35cce>] lnet_udsp_apply_policies_helper.part.8+0x3e/0x70 [lnet]
[5449781.420644]  [<ffffffffc0c37db6>] lnet_udsp_apply_policies_on_lpni+0x56/0x80 [lnet]
[5449781.421386]  [<ffffffffc0c36d20>] ? lnet_udsp_apply_rte_rule_on_nets+0x130/0x130 [lnet]
[5449781.422228]  [<ffffffffc0c28231>] lnet_peer_attach_peer_ni+0x161/0x600 [lnet]
[5449781.422987]  [<ffffffffc0c2883e>] lnet_peer_ni_traffic_add+0x16e/0x2b0 [lnet]
[5449781.423761]  [<ffffffffc0c2de25>] lnet_peerni_by_nid_locked+0xe5/0x140 [lnet]
[5449781.424521]  [<ffffffffc0c2df5e>] lnet_nid2peerni_locked+0xde/0xf0 [lnet]
[5449781.425281]  [<ffffffffc0bf8713>] LNetCtl+0x14d3/0x1c80 [lnet]
[5449781.426061]  [<ffffffffc0bf59fb>] ? LNetNIInit+0x8b/0xd50 [lnet]
[5449781.426818]  [<ffffffffc0c18a33>] lnet_ioctl+0x63/0x270 [lnet]
[5449781.427581]  [<ffffffff8ad90b6f>] notifier_call_chain+0x4f/0x70
[5449781.428345]  [<ffffffff8a6cc15d>] __blocking_notifier_call_chain+0x4d/0x70
[5449781.429083]  [<ffffffff8a6cc196>] blocking_notifier_call_chain+0x16/0x20
[5449781.429837]  [<ffffffffc0bbc3ad>] libcfs_psdev_ioctl+0x43d/0x5c0 [libcfs]
[5449781.430580]  [<ffffffff8a863590>] do_vfs_ioctl+0x3a0/0x5b0
[5449781.431319]  [<ffffffff8a863841>] SyS_ioctl+0xa1/0xc0
[5449781.432065]  [<ffffffff8ad95f92>] system_call_fastpath+0x25/0x2a


 Comments   
Comment by Serguei Smirnov [ 14/Jun/22 ]

Temporary fix applied by Chris locally:

diff --git a/lnet/lnet/udsp.c b/lnet/lnet/udsp.c
index 08c1a7fccc..1f55b9289f 100644
--- a/lnet/lnet/udsp.c
+++ b/lnet/lnet/udsp.c
@@ -536,6 +536,9 @@ lnet_udsp_apply_rule_on_lpni(struct udsp_info *udi)
         &lp_match->ud_net_id.udn_net_num_range,
         &lp_match->ud_addr_range);
+    if (!udi->udi_lpn)
+        udi->udi_lpn = lpni->lpni_peer_net;
+
     /* check if looking for a net match */
     if (!rc &&
         (lnet_get_list_len(&lp_match->ud_addr_range) ||

This prevents the crash, but causes nid priority to be inherited from the previously set net priority for the peer.

Comment by Gerrit Updater [ 07/Oct/22 ]

"Cyril Bordage <cbordage@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/48801
Subject: LU-15944 lnet: remove crash with UDSP
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: cea355d4837c945a3f0193fb92331abb65c13d5c

Comment by Gerrit Updater [ 08/Mar/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/48801/
Subject: LU-15944 lnet: remove crash with UDSP
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: c56b9455f05f760aea6785c47061761bbc76f3b6

Comment by Peter Jones [ 08/Mar/23 ]

Landed for 2.16

Generated at Sat Feb 10 03:22:38 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.