[LU-9525] "lctl network down" won't work if network brought up by a mount Created: 17/May/17 Updated: 16/Apr/19 Resolved: 12/Apr/19 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Doug Oucharek (Inactive) | Assignee: | Patrick Farrell (Inactive) |
| Resolution: | Not a Bug | Votes: | 0 |
| Labels: | None | ||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
The code for both "lctl network down" and "lnetctl net unconfigure" was written assuming the network was originally brought up using either "lctl network configure" or "lnetctl net configure". If the network came up as part of a mount or "modprobe lustre", there is a flag which does not get set thereby preventing lctl or lnetctl from bringing down the network. This enforces a restriction that the mechanism which configures the network is the same one which unconfigures the network. I see no reason for that restriction. It is confusing and should not be there. This ticket has been opened to change that part of the code to let us switch to a different mechanism to unconfigure from the one used to configure. Otherwise, debugging in the field is just no fun! |
| Comments |
| Comment by Nathaniel Clark [ 21/Feb/19 ] |
|
With Lustre 2.12.0 it doesn't appear there's any way for lctl net down to work: # modprobe lnet # lctl list_nids IOC_LIBCFS_GET_NI error 100: Network is down # lctl net up # lctl list_nids 192.168.56.20@tcp # lctl net down LNET busy Debug log with ALL the debugging: # lctl set_param debug=-1 debug=-1 # lctl dk > /dev/null ; lctl net down; lctl dk LNET busy 00000400:00000001:1.0F:1550767099.121831:0:5433:0:(module.c:69:libcfs_ioctl()) Process entered 00000400:00000001:1.0:1550767099.121833:0:5433:0:(linux-module.c:113:libcfs_ioctl_getdata()) Process entered 00000400:00000010:1.0:1550767099.121834:0:5433:0:(linux-module.c:136:libcfs_ioctl_getdata()) alloc '(*hdr_pp)': 128 at ffffa0cd5829d900 (tot 502104). 00000400:00000001:1.0:1550767099.121835:0:5433:0:(linux-module.c:143:libcfs_ioctl_getdata()) Process leaving (rc=0 : 0 : 0) 00000400:00000001:1.0:1550767099.121836:0:5433:0:(linux-module.c:91:libcfs_ioctl_data_adjust()) Process entered 00000400:00000001:1.0:1550767099.121837:0:5433:0:(linux-module.c:105:libcfs_ioctl_data_adjust()) Process leaving (rc=0 : 0 : 0) 00000400:00000080:1.0:1550767099.121837:0:5433:0:(module.c:90:libcfs_ioctl()) libcfs ioctl cmd 3221775672 00000400:00000010:1.0:1550767099.121839:0:5433:0:(module.c:118:libcfs_ioctl()) kfreed 'hdr': 128 at ffffa0cd5829d900 (tot 501976). 00000400:00000001:1.0:1550767099.121840:0:5433:0:(module.c:119:libcfs_ioctl()) Process leaving (rc=18446744073709551600 : -16 : fffffffffffffff0) Debug log: 9 lines, 9 kept, 0 dropped, 0 bad. |
| Comment by Patrick Farrell (Inactive) [ 21/Feb/19 ] |
|
Nathaniel, What happens if you do lustre_rmmod in there, like we discussed in Skype? (Not saying you should have to, just curious what happens) |
| Comment by Nathaniel Clark [ 21/Feb/19 ] |
|
Doing lustre_rmmod tears everything down. Doing a partial teardown results in a panic (probably because of polling by iml-agent) see |
| Comment by Patrick Farrell (Inactive) [ 11/Apr/19 ] |
|
So I spent a bit on this before I realized, as far as I can tell, the code is fine, and I can't reproduce this: # modprobe lnet # lctl list_nids IOC_LIBCFS_GET_NI error 100: Network is down # lctl net up # lctl list_nids 192.168.56.20@tcp # lctl net down LNET busy lctl net down works fine for me in this scenario, both on master and on b2_12. I also can't reproduce the problem Doug described about the config thing. It's true that you cannot unconfigure LNet while ptlrpc is loaded, but that's because ptlrpc won't load without network interfaces available. ptlrpc doesn't support being loaded without nis configured, so you can't unconfigure them without unloading it. But it's not just "a flag that's set" - that flag reflects whether LNet initialized its own config or if it was done by ptlrpc. If it was done by ptlrpc, then LNet can't just tear it down. All of the scenarios and orders I could think of for "config done by lnet, done by ptlrpc (this is the same as "done by a mount")" work fine, given that you accept that you can't unconfigure lnet while ptlrpc is loaded. Maybe there's just something IML is doing that's making LNet busy...? Because this code seems to be fine. |
| Comment by Joe Grund [ 12/Apr/19 ] |
|
Sounds like this is expected behavior. I'll try using lustre_rmmod ptlrpc unconditionally before bringing down LNet. |
| Comment by Patrick Farrell (Inactive) [ 12/Apr/19 ] |
|
Per our conversation this morning, it looks like with |