[LU-7736] lustre_rmmod does not remove all the Lustre modules Created: 03/Feb/16  Updated: 09/May/17  Resolved: 09/May/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.9.0

Type: Bug Priority: Minor
Reporter: Gregoire Pichon Assignee: Bob Glossman (Inactive)
Resolution: Fixed Votes: 0
Labels: patch
Environment:

Lustre 2.7.66


Issue Links:
Duplicate
duplicates LU-9439 Introduce an lnet systemd service Resolved
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

lustre_rmmod does not remove all the Lustre modules. A second call to the command does.

[root@rio10 ~]# modprobe lustre
[root@rio10 ~]# lctl list_nids
10.1.0.64@o2ib
[root@rio10 ~]# lustre_rmmod
Modules still loaded: 
lnet/lnet/lnet.o libcfs/libcfs/libcfs.o
[root@rio10 ~]# lustre_rmmod

After analysing the problem, it appears that:

  • the lnet module does not has a direct dependency on the lnd modules (ko2iblnd, ksocklnd, ...). This is because lnet uses function pointers to call the lnd routines (lnd_startup(), lnd_shutdown(), ...). This avoids a loop in module dependecy, because the lnd modules do also depends on lnet.
  • the unloading of ptlrpc module triggers the shutdown of lnet network interfaces and thus the decrement of lnd refcount, which allows the lnd modules to be no more "in use".
  • the lustre_rmmod script unloads the lustre modules recursively from libcfs, unloading each dependent module first.

By chance, in previous lustre versions (2.1, 2.4 or 2.5) the dependency order made the lustre_rmmod unload ptlrpc before ko2iblnd.

Unfortunately, since lustre version 2.7, ko2iblnd is still in use when trying to unload, which then prevents lnet to unload. Then lnet module unload is not attempted again, leading to lnet and libcfs still loaded at the end.

[root@rio10 ~]# modprobe lustre
[root@rio10 ~]# lustre_rmmod
DEBUG: rmmod lustre
DEBUG: rmmod mdc
DEBUG: rmmod fid
DEBUG: rmmod lmv
DEBUG: rmmod fld
DEBUG: rmmod lmv
rmmod: ERROR: Module lmv is not currently loaded
DEBUG: rmmod mdc
rmmod: ERROR: Module mdc is not currently loaded
DEBUG: rmmod lov
DEBUG: rmmod ko2iblnd
rmmod: ERROR: Module ko2iblnd is in use
DEBUG: rmmod ptlrpc
DEBUG: rmmod obdclass
DEBUG: rmmod ptlrpc
rmmod: ERROR: Module ptlrpc is not currently loaded
DEBUG: rmmod lnet
rmmod: ERROR: Module lnet is in use by: ko2iblnd
DEBUG: rmmod ko2iblnd
DEBUG: rmmod lustre
rmmod: ERROR: Module lustre is not currently loaded
DEBUG: rmmod obdclass
rmmod: ERROR: Module obdclass is not currently loaded
DEBUG: rmmod ptlrpc
rmmod: ERROR: Module ptlrpc is not currently loaded
DEBUG: rmmod libcfs
rmmod: ERROR: Module libcfs is in use by: lnet
Modules still loaded: 
lnet/lnet/lnet.o libcfs/libcfs/libcfs.o


 Comments   
Comment by Gerrit Updater [ 03/Feb/16 ]

Grégoire Pichon (gregoire.pichon@bull.net) uploaded a new patch: http://review.whamcloud.com/18279
Subject: LU-7736 scripts: ensure lustre_rmmod unload all modules
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: fcbf64af0e2b4a2b5ce7429590c12f2faf2d1374

Comment by Peter Jones [ 03/Feb/16 ]

Bob

Could you please look after this patch?

Thanks

Peter

Comment by James Nunez (Inactive) [ 03/Mar/16 ]

We might be seeing this in out autotest results. For the POSIX test results at https://testing.hpdd.intel.com/test_sets/03f82950-e14e-11e5-8edf-5254006e85c2, It looks like not all modules are removed. The last thing in the suite_stdout is

...
04:52:08:Stopping /mnt/ost8 (opts:-f) on onyx-34vm8
04:52:08:CMD: onyx-34vm8 umount -d -f /mnt/ost8
04:52:19:CMD: onyx-34vm8 lsmod | grep lnet > /dev/null && lctl dl | grep ' ST '
04:52:19:CMD: onyx-34vm1.onyx.hpdd.intel.com lsmod | grep lnet > /dev/null && lctl dl | grep ' ST '
04:52:19: 15 UP osc lustre-OST0000_osc lustre-OST0000_osc_UUID 4
04:52:19: 17 UP osc lustre-OST0001_osc lustre-OST0001_osc_UUID 4
04:52:19: 19 UP osc lustre-OST0002_osc lustre-OST0002_osc_UUID 4
04:52:19: 21 UP osc lustre-OST0003_osc lustre-OST0003_osc_UUID 4
04:52:19: 23 UP osc lustre-OST0004_osc lustre-OST0004_osc_UUID 4
04:52:19: 25 UP osc lustre-OST0005_osc lustre-OST0005_osc_UUID 4
04:52:19: 27 UP osc lustre-OST0006_osc lustre-OST0006_osc_UUID 4
04:52:19: 29 UP osc lustre-OST0007_osc lustre-OST0007_osc_UUID 4
04:52:19:Modules still loaded: 
04:52:19:lustre/osc/osc.o lustre/ptlrpc/ptlrpc.o lustre/obdclass/obdclass.o lnet/klnds/socklnd/ksocklnd.o lnet/lnet/lnet.o libcfs/libcfs/libcfs.o
05:51:43:********** Timeout by autotest system **********
Comment by Bob Glossman (Inactive) [ 10/Mar/16 ]

I haven't been able to reproduce the problem without any IB on hand, don't see it with only ksocklnd loaded. Don't know for sure the patch fixes the whole problem, but I've given it +review anyway.

Comment by Gerrit Updater [ 14/Mar/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/18279/
Subject: LU-7736 scripts: ensure lustre_rmmod unload all modules
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 40e316910744429e2294bf353f85cb1061261d46

Comment by John Salinas (Inactive) [ 06/May/17 ]

I appear to be seeing this issue:

Load lnet

# pdsh -w node0[1-8] "modprobe lnet" 
# pdsh -w node0[1-8] "/usr/sbin/lustre_rmmod" 
node08: Modules still loaded: 
node08: lnet/klnds/socklnd/ksocklnd.o lnet/lnet/lnet.o libcfs/libcfs/libcfs.o
pdsh@natasha: node08: ssh exited with exit code 1
node01: Modules still loaded: 
node01: lnet/klnds/socklnd/ksocklnd.o lnet/lnet/lnet.o libcfs/libcfs/libcfs.o
pdsh@natasha: node01: ssh exited with exit code 1
node05: Modules still loaded: 
node05: lnet/klnds/socklnd/ksocklnd.o lnet/lnet/lnet.o libcfs/libcfs/libcfs.o
pdsh@natasha: node05: ssh exited with exit code 1
node07: Modules still loaded: 
node07: lnet/klnds/socklnd/ksocklnd.o lnet/lnet/lnet.o libcfs/libcfs/libcfs.o
pdsh@natasha: node07: ssh exited with exit code 1
node03: Modules still loaded: 
node03: lnet/klnds/socklnd/ksocklnd.o lnet/lnet/lnet.o libcfs/libcfs/libcfs.o
pdsh@natasha: node03: ssh exited with exit code 1
node02: Modules still loaded: 
node02: lnet/klnds/socklnd/ksocklnd.o lnet/lnet/lnet.o libcfs/libcfs/libcfs.o
pdsh@natasha: node02: ssh exited with exit code 1
node04: Modules still loaded: 
node04: lnet/klnds/socklnd/ksocklnd.o lnet/lnet/lnet.o libcfs/libcfs/libcfs.o
pdsh@natasha: node04: ssh exited with exit code 1
node06: Modules still loaded: 
node06: lnet/klnds/socklnd/ksocklnd.o lnet/lnet/lnet.o libcfs/libcfs/libcfs.o
pdsh@natasha: node06: ssh exited with exit code 1

Try to unload, but we cannot remove either lnet or ksocklnd

[root@natasha jsalinas]# pdsh -w node0[1-8] "lsmod |grep lnet " 
node08: lnet                  444969  2 ksocklnd
node08: libcfs                405310  2 lnet,ksocklnd
node07: lnet                  444969  2 ksocklnd
node04: lnet                  444969  2 ksocklnd
node07: libcfs                405310  2 lnet,ksocklnd
node04: libcfs                405310  2 lnet,ksocklnd
node03: lnet                  444969  2 ksocklnd
node03: libcfs                405310  2 lnet,ksocklnd
node02: lnet                  444969  2 ksocklnd
node02: libcfs                405310  2 lnet,ksocklnd
node05: lnet                  444969  2 ksocklnd
node05: libcfs                405310  2 lnet,ksocklnd
node06: lnet                  444969  2 ksocklnd
node06: libcfs                405310  2 lnet,ksocklnd
node01: lnet                  444969  2 ksocklnd
node01: libcfs                405310  2 lnet,ksocklnd
[root@natasha jsalinas]# pdsh -w node0[1-8] "lsmod |grep ksocklnd" 
node03: ksocklnd              179299  1 
node03: lnet                  444969  2 ksocklnd
node03: libcfs                405310  2 lnet,ksocklnd
node01: ksocklnd              179299  1 
node01: lnet                  444969  2 ksocklnd
node01: libcfs                405310  2 lnet,ksocklnd
node07: ksocklnd              179299  1 
node07: lnet                  444969  2 ksocklnd
node07: libcfs                405310  2 lnet,ksocklnd
node05: ksocklnd              179299  1 
node05: lnet                  444969  2 ksocklnd
node05: libcfs                405310  2 lnet,ksocklnd
node08: ksocklnd              179299  1 
node08: lnet                  444969  2 ksocklnd
node08: libcfs                405310  2 lnet,ksocklnd
node02: ksocklnd              179299  1 
node02: lnet                  444969  2 ksocklnd
node02: libcfs                405310  2 lnet,ksocklnd
node04: ksocklnd              179299  1 
node04: lnet                  444969  2 ksocklnd
node04: libcfs                405310  2 lnet,ksocklnd
node06: ksocklnd              179299  1 
node06: lnet                  444969  2 ksocklnd
node06: libcfs                405310  2 lnet,ksocklnd

Can't win this battle:
[root@node01 ~]# modprobe -r ksocklnd
modprobe: FATAL: Module ksocklnd is in use.
[root@node01 ~]# modprobe -r lnet
modprobe: FATAL: Module lnet is in use.

[root@node01 ~]# rpm -qa |grep lustre
kmod-lustre-client-tests-2.9.0-1.el7.centos.x86_64
lustre-client-tests-2.9.0-1.el7.centos.x86_64
kmod-lustre-client-2.9.0-1.el7.centos.x86_64
lustre-client-debuginfo-2.9.0-1.el7.centos.x86_64
lustre-client-2.9.0-1.el7.centos.x86_64

Comment by John Salinas (Inactive) [ 06/May/17 ]

This appears to happen in the step between modprobe lnet
[root@natasha jsalinas]# pdsh -w node0[1-8] "lsmod |grep lnet"
node07: lnet 444969 0
node07: libcfs 405310 1 lnet
node01: lnet 444969 0
node01: libcfs 405310 1 lnet
node08: lnet 444969 0
node08: libcfs 405310 1 lnet
node03: lnet 444969 0
node03: libcfs 405310 1 lnet
node02: lnet 444969 0
node02: libcfs 405310 1 lnet
node04: lnet 444969 0
node04: libcfs 405310 1 lnet
node05: lnet 444969 0
node05: libcfs 405310 1 lnet
node06: lnet 444969 0
node06: libcfs 405310 1 lnet

and lctl network up
[root@natasha jsalinas]# pdsh -w node0[1-8] "lctl network up "
node04: LNET configured
node02: LNET configured
node01: LNET configured
node07: LNET configured
node08: LNET configured
node03: LNET configured
node05: LNET configured
node06: LNET configured
[root@natasha jsalinas]# pdsh -w node0[1-8] "lsmod |grep lnet"
node01: lnet 444969 2 ko2iblnd
node01: libcfs 405310 2 lnet,ko2iblnd
node07: lnet 444969 2 ko2iblnd
node07: libcfs 405310 2 lnet,ko2iblnd
node03: lnet 444969 2 ko2iblnd
node03: libcfs 405310 2 lnet,ko2iblnd
node05: lnet 444969 2 ko2iblnd
node05: libcfs 405310 2 lnet,ko2iblnd
node02: lnet 444969 2 ko2iblnd
node02: libcfs 405310 2 lnet,ko2iblnd
node06: lnet 444969 2 ko2iblnd
node06: libcfs 405310 2 lnet,ko2iblnd
node08: lnet 444969 2 ko2iblnd
node08: libcfs 405310 2 lnet,ko2iblnd
node04: lnet 444969 2 ko2iblnd
node04: libcfs 405310 2 lnet,ko2iblnd

Comment by Andreas Dilger [ 08/May/17 ]

There is work under LU-9439 to fix up the lnet service for RHEL7 and other systems using systemd.

Comment by John Salinas (Inactive) [ 09/May/17 ]

Perhaps I missed it but I didn't see any specific mention of ko2iblnd in LU-9439?

Comment by Peter Jones [ 09/May/17 ]

Can you please open a new ticket to track the similar issue that you are seeing for the 2.10 release?

Comment by John Salinas (Inactive) [ 09/May/17 ]

Will do

Generated at Sat Feb 10 02:11:29 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.