[LU-9525] "lctl network down" won't work if network brought up by a mount Created: 17/May/17  Updated: 16/Apr/19  Resolved: 12/Apr/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Doug Oucharek (Inactive) Assignee: Patrick Farrell (Inactive)
Resolution: Not a Bug Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

The code for both "lctl network down" and "lnetctl net unconfigure" was written assuming the network was originally brought up using either "lctl network configure" or "lnetctl net configure".  If the network came up as part of a mount or "modprobe lustre", there is a flag which does not get set thereby preventing lctl or lnetctl from bringing down the network.

This enforces a restriction that the mechanism which configures the network is the same one which unconfigures the network.  I see no reason for that restriction.  It is confusing and should not be there.

This ticket has been opened to change that part of the code to let us switch to a different mechanism to unconfigure from the one used to configure.  Otherwise, debugging in the field is just no fun!



 Comments   
Comment by Nathaniel Clark [ 21/Feb/19 ]

With Lustre 2.12.0 it doesn't appear there's any way for lctl net down to work:

# modprobe lnet
# lctl list_nids
IOC_LIBCFS_GET_NI error 100: Network is down
# lctl net up
# lctl list_nids
192.168.56.20@tcp
# lctl net down
LNET busy

Debug log with ALL the debugging:

# lctl set_param debug=-1
debug=-1
# lctl dk > /dev/null ; lctl net down; lctl dk
LNET busy
00000400:00000001:1.0F:1550767099.121831:0:5433:0:(module.c:69:libcfs_ioctl()) Process entered
00000400:00000001:1.0:1550767099.121833:0:5433:0:(linux-module.c:113:libcfs_ioctl_getdata()) Process entered
00000400:00000010:1.0:1550767099.121834:0:5433:0:(linux-module.c:136:libcfs_ioctl_getdata()) alloc '(*hdr_pp)': 128 at ffffa0cd5829d900 (tot 502104).
00000400:00000001:1.0:1550767099.121835:0:5433:0:(linux-module.c:143:libcfs_ioctl_getdata()) Process leaving (rc=0 : 0 : 0)
00000400:00000001:1.0:1550767099.121836:0:5433:0:(linux-module.c:91:libcfs_ioctl_data_adjust()) Process entered
00000400:00000001:1.0:1550767099.121837:0:5433:0:(linux-module.c:105:libcfs_ioctl_data_adjust()) Process leaving (rc=0 : 0 : 0)
00000400:00000080:1.0:1550767099.121837:0:5433:0:(module.c:90:libcfs_ioctl()) libcfs ioctl cmd 3221775672
00000400:00000010:1.0:1550767099.121839:0:5433:0:(module.c:118:libcfs_ioctl()) kfreed 'hdr': 128 at ffffa0cd5829d900 (tot 501976).
00000400:00000001:1.0:1550767099.121840:0:5433:0:(module.c:119:libcfs_ioctl()) Process leaving (rc=18446744073709551600 : -16 : fffffffffffffff0)
Debug log: 9 lines, 9 kept, 0 dropped, 0 bad.
Comment by Patrick Farrell (Inactive) [ 21/Feb/19 ]

Nathaniel,

What happens if you do lustre_rmmod in there, like we discussed in Skype?  (Not saying you should have to, just curious what happens)

Comment by Nathaniel Clark [ 21/Feb/19 ]

Doing lustre_rmmod tears everything down.  Doing a partial teardown results in a panic (probably because of polling by iml-agent) see LU-11986

Comment by Patrick Farrell (Inactive) [ 11/Apr/19 ]

So I spent a bit on this before I realized, as far as I can tell, the code is fine, and I can't reproduce this:

# modprobe lnet
# lctl list_nids
IOC_LIBCFS_GET_NI error 100: Network is down
# lctl net up
# lctl list_nids
192.168.56.20@tcp
# lctl net down
LNET busy

lctl net down works fine for me in this scenario, both on master and on b2_12.

I also can't reproduce the problem Doug described about the config thing.  It's true that you cannot unconfigure LNet while ptlrpc is loaded, but that's because ptlrpc won't load without network interfaces available.  ptlrpc doesn't support being loaded without nis configured, so you can't unconfigure them without unloading it.

But it's not just "a flag that's set" - that flag reflects whether LNet initialized its own config or if it was done by ptlrpc.  If it was done by ptlrpc, then LNet can't just tear it down.  All of the scenarios and orders I could think of for "config done by lnet, done by ptlrpc (this is the same as "done by a mount")" work fine, given that you accept that you can't unconfigure lnet while ptlrpc is loaded.

Maybe there's just something IML is doing that's making LNet busy...?  Because this code seems to be fine.

Comment by Joe Grund [ 12/Apr/19 ]

Sounds like this is expected behavior. I'll try using lustre_rmmod ptlrpc unconditionally before bringing down LNet.

Comment by Patrick Farrell (Inactive) [ 12/Apr/19 ]

Per our conversation this morning, it looks like with LU-11986 fixed (which is in progress, patch exists) IML is able to do what it needs to do.

Generated at Sat Feb 10 02:26:57 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.