Details
-
Bug
-
Resolution: Fixed
-
Minor
-
Lustre 2.4.0
-
None
-
3
-
7281
Description
It is easy to reproduce this on a single VM by running only conf-sanity 66:
== conf-sanity test 66: replace nids == 15:30:00 (1363678200) Loading modules from /root/lustre-master/lustre/tests/.. detected 1 online CPUs by sysfs libcfs will create CPU partition based on online CPUs debug=-1 subsystem_debug=all -lnet -lnd -pinger gss/krb5 is not supported quota/lquota options: 'hash_lqs_cur_bits=3' start mds service on linux Starting mds1: -o loop /tmp/lustre-mdt1 /mnt/mds1 Started lustre-MDT0000 start ost1 service on linux Starting ost1: -o loop /tmp/lustre-ost1 /mnt/ost1 Started lustre-OST0000 mount lustre on /mnt/lustre..... Starting client: linux: -o user_xattr,flock linux@tcp:/lustre /mnt/lustre replace_nids should fail if MDS, OSTs and clients are UP error: replace_nids: Operation now in progress umount lustre on /mnt/lustre..... Stopping client linux /mnt/lustre (opts:) sh: lsof: command not found replace_nids should fail if MDS and OSTs are UP error: replace_nids: Operation now in progress stop ost1 service on linux Stopping /mnt/ost1 (opts:-f) on linux replace_nids should fail if MDS is UP error: replace_nids: Operation now in progress stop mds service on linux Stopping /mnt/mds1 (opts:-f) on linux start mds service on linux Starting mds1: -o nosvc,loop /tmp/lustre-mdt1 /mnt/mds1 Started lustre-MDT0000 command should accept two parameters replace primary NIDs for a device usage: replace_nids <device> <nid1>[,nid2,nid3] correct device name should be passed error: replace_nids: Invalid argument wrong nids list should not destroy the system replace primary NIDs for a device usage: replace_nids <device> <nid1>[,nid2,nid3] replace OST nid command should accept two parameters replace primary NIDs for a device usage: replace_nids <device> <nid1>[,nid2,nid3] wrong nids list should not destroy the system replace primary NIDs for a device usage: replace_nids <device> <nid1>[,nid2,nid3] replace MDS nid stop mds service on linux Stopping /mnt/mds1 (opts:-f) on linux start mds service on linux Starting mds1: -o loop /tmp/lustre-mdt1 /mnt/mds1 Started lustre-MDT0000 start ost1 service on linux Starting ost1: -o loop /tmp/lustre-ost1 /mnt/ost1 Started lustre-OST0000 mount lustre on /mnt/lustre..... Starting client: linux: -o user_xattr,flock linux@tcp:/lustre /mnt/lustre setup single mount lustre success umount lustre on /mnt/lustre..... Stopping client linux /mnt/lustre (opts:) sh: lsof: command not found stop ost1 service on linux Stopping /mnt/ost1 (opts:-f) on linux stop mds service on linux Stopping /mnt/mds1 (opts:-f) on linux Modules still loaded: ldiskfs/ldiskfs/ldiskfs.o lustre/mdd/mdd.o lustre/mgs/mgs.o lustre/quota/lquota.o lustre/mgc/mgc.o lustre/fid/fid.o lustre/fld/fld.o lustre/ptlrpc/ptlrpc.o lustre/obdclass/obdclass.o lustre/lvfs/lvfs.o lnet/klnds/socklnd/ksocklnd.o lnet/lnet/lnet.o libcfs/libcfs/libcfs.o Stopping clients: linux /mnt/lustre (opts:) Stopping clients: linux /mnt/lustre2 (opts:) Loading modules from /root/lustre-master/lustre/tests/.. detected 1 online CPUs by sysfs libcfs will create CPU partition based on online CPUs debug=-1 subsystem_debug=all -lnet -lnd -pinger gss/krb5 is not supported Formatting mgs, mds, osts Format mds1: /tmp/lustre-mdt1 Format ost1: /tmp/lustre-ost1 Format ost2: /tmp/lustre-ost2 Resetting fail_loc on all nodes...done. PASS 66 (69s) ............== conf-sanity test complete, duration 113 sec == 15:31:10 (1363678270)
This prevents some of my new tests, which are placed after 66, from removing and reloading Lustre kernel modules. The root cause is that the "lctl replace_nids" implementation may leak lu_envs when certain errors happen.
I'll post a patch shortly.