Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2988

conf-sanity 66: Modules still loaded

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.4.0
    • Lustre 2.4.0
    • None
    • 3
    • 7281

    Description

      It is easy to reproduce this on a single VM by running only conf-sanity 66:

      == conf-sanity test 66: replace nids == 15:30:00 (1363678200)
      Loading modules from /root/lustre-master/lustre/tests/..
      detected 1 online CPUs by sysfs
      libcfs will create CPU partition based on online CPUs
      debug=-1
      subsystem_debug=all -lnet -lnd -pinger
      gss/krb5 is not supported
      quota/lquota options: 'hash_lqs_cur_bits=3'
      start mds service on linux
      Starting mds1:   -o loop /tmp/lustre-mdt1 /mnt/mds1
      Started lustre-MDT0000
      start ost1 service on linux
      Starting ost1:   -o loop /tmp/lustre-ost1 /mnt/ost1
      Started lustre-OST0000
      mount lustre on /mnt/lustre.....
      Starting client: linux: -o user_xattr,flock linux@tcp:/lustre /mnt/lustre
      replace_nids should fail if MDS, OSTs and clients are UP
      error: replace_nids: Operation now in progress
      umount lustre on /mnt/lustre.....
      Stopping client linux /mnt/lustre (opts:)
      sh: lsof: command not found
      replace_nids should fail if MDS and OSTs are UP
      error: replace_nids: Operation now in progress
      stop ost1 service on linux
      Stopping /mnt/ost1 (opts:-f) on linux
      replace_nids should fail if MDS is UP
      error: replace_nids: Operation now in progress
      stop mds service on linux
      Stopping /mnt/mds1 (opts:-f) on linux
      start mds service on linux
      Starting mds1: -o nosvc,loop  /tmp/lustre-mdt1 /mnt/mds1
      Started lustre-MDT0000
      command should accept two parameters
      replace primary NIDs for a device
      usage: replace_nids <device> <nid1>[,nid2,nid3]
      correct device name should be passed
      error: replace_nids: Invalid argument
      wrong nids list should not destroy the system
      replace primary NIDs for a device
      usage: replace_nids <device> <nid1>[,nid2,nid3]
      replace OST nid
      command should accept two parameters
      replace primary NIDs for a device
      usage: replace_nids <device> <nid1>[,nid2,nid3]
      wrong nids list should not destroy the system
      replace primary NIDs for a device
      usage: replace_nids <device> <nid1>[,nid2,nid3]
      replace MDS nid
      stop mds service on linux
      Stopping /mnt/mds1 (opts:-f) on linux
      start mds service on linux
      Starting mds1:   -o loop /tmp/lustre-mdt1 /mnt/mds1
      Started lustre-MDT0000
      start ost1 service on linux
      Starting ost1:   -o loop /tmp/lustre-ost1 /mnt/ost1
      Started lustre-OST0000
      mount lustre on /mnt/lustre.....
      Starting client: linux: -o user_xattr,flock linux@tcp:/lustre /mnt/lustre
      setup single mount lustre success
      umount lustre on /mnt/lustre.....
      Stopping client linux /mnt/lustre (opts:)
      sh: lsof: command not found
      stop ost1 service on linux
      Stopping /mnt/ost1 (opts:-f) on linux
      stop mds service on linux
      Stopping /mnt/mds1 (opts:-f) on linux
      Modules still loaded: 
      ldiskfs/ldiskfs/ldiskfs.o lustre/mdd/mdd.o lustre/mgs/mgs.o lustre/quota/lquota.o lustre/mgc/mgc.o lustre/fid/fid.o lustre/fld/fld.o lustre/ptlrpc/ptlrpc.o lustre/obdclass/obdclass.o lustre/lvfs/lvfs.o lnet/klnds/socklnd/ksocklnd.o lnet/lnet/lnet.o libcfs/libcfs/libcfs.o
      Stopping clients: linux /mnt/lustre (opts:)
      Stopping clients: linux /mnt/lustre2 (opts:)
      Loading modules from /root/lustre-master/lustre/tests/..
      detected 1 online CPUs by sysfs
      libcfs will create CPU partition based on online CPUs
      debug=-1
      subsystem_debug=all -lnet -lnd -pinger
      gss/krb5 is not supported
      Formatting mgs, mds, osts
      Format mds1: /tmp/lustre-mdt1
      Format ost1: /tmp/lustre-ost1
      Format ost2: /tmp/lustre-ost2
      Resetting fail_loc on all nodes...done.
      PASS 66 (69s)
      ............== conf-sanity test complete, duration 113 sec == 15:31:10 (1363678270)
      

      This prevents some of my new tests, which are placed after 66, from removing and reloading Lustre kernel modules. The root cause is that the "lctl replace_nids" implementation may leak lu_envs when certain errors happen.

      I'll post a patch shortly.

      Attachments

        Activity

          People

            liwei Li Wei (Inactive)
            liwei Li Wei (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: