Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4054

HSM llog_handle leak

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.5.0
    • Lustre 2.5.0
    • 3
    • 10871

    Description

      Running sanity-hsm I see that we're leaking a log_handle. You can see it by mounting, running full sanity-hsm, then unmounting or by mounting, running 107 222a 222b 224 226 250, then unmounting.

      # export OSTSIZE=1000000
      # export agt1_HOST=t
      # export PTLDEBUG="vfstrace rpctrace dlmtrace neterror ha config ioctl super hsm other malloc info"
      # export DEBUG_SIZE=2048
      
      # dir=/tmp/hsm-leak.$$
      # mkdir $direxport OSTSIZE=1000000
      # export agt1_HOST=t
      # export PTLDEBUG="vfstrace rpctrace dlmtrace neterror ha config ioctl super hsm other malloc info"
      # export DEBUG_SIZE=2048     
      # dir=/tmp/hsm-leak.$$
      # mkdir $dir || exit 1
      # echo $dir
      # rm -rf /tmp/debug-leak.* /tmp/lustre-log.* /tmp/sanity-hsm.log /tmp/test_logs/
      # export ONLY="107 222a 222b 224 226 250"
      # dmesg -c > /dev/null
      # llmount.sh
      # sh ~/lustre-release/lustre/tests/sanity-hsm.sh
      # llmountcleanup.sh
      # dmesg > $dir/dmesg
      # mv /tmp/debug-leak.* /tmp/lustre-log.* /tmp/sanity-hsm.log /tmp/test_logs/ $dir/
      
      Stopping clients: t /mnt/lustre (opts:)
      Stopping clients: t /mnt/lustre2 (opts:)
      
      ...
      only running test 107 222a 222b 224 226 250
      excepting tests: 31a 34 35 36 200 201 221 223a 223b 225
      Killing existing copytools on t
      Set HSM on and start
      Changed after 0s: from '' to 'stopped'
      Waiting 20 secs for update
      Updated after 7s: wanted 'enabled' got 'enabled'
      Start copytool
      Purging archive on t
      Starting copytool agt1 on t
      Set sanity-hsm HSM policy
      
      == sanity-hsm test 107: Copytool re-register after MDS restart == 16:48:19 (1380750499)
      Wakeup copytool agt1 on t
      Changed after 0s: from '' to 'STARTED'
      Waiting 100 secs for update
      Failing mds1 on t
      Stopping /mnt/mds1 (opts:) on t
      reboot facets: mds1
      Failover mds1 to t
      16:48:31 (1380750511) waiting for t network 900 secs ...
      16:48:31 (1380750511) network interface is UP
      mount facets: mds1
      Starting mds1:   -o loop /tmp/lustre-mdt1 /mnt/mds1
      Started lustre-MDT0000
      t: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
      Changed after 0s: from '' to 'STARTED'
      Waiting 100 secs for update
      Copytool is stopped on t
      Resetting fail_loc on all nodes...done.
      PASS 107 (16s)
      
      
      == sanity-hsm test 222a: Changelog for explicit restore == 16:48:35 (1380750515)
      Purging archive on t
      Starting copytool agt1 on t
      lhsmtool_posix[18491]: action=1 src=d0.sanity-hsm/d222/f.sanity-hsm.222a dst=/mnt/lustre/d0.sanity-hsm/d222/f.sanity-hsm.222a mount_point=/mnt/lustre
      lhsmtool_posix[18491]: importing '/mnt/lustre/d0.sanity-hsm/d222/f.sanity-hsm.222a' from '/tmp/arc1/d0.sanity-hsm/d222/f.sanity-hsm.222a'
      lhsmtool_posix[18491]: imported '/mnt/lustre/d0.sanity-hsm/d222/f.sanity-hsm.222a' from '/tmp/arc1/0006/0000/0400/0000/0002/0000/0x200000400:0x6:0x0'=='/tmp/arc1/d0.sanity-hsm/d222/f.sanity-hsm.222a'
      lhsmtool_posix[18491]: process finished, errs: 0 major, 0 minor, rc=0 (Success)
      mdd.lustre-MDT0000.changelog_mask=+hsm
      Changed after 0s: from '' to 'STARTED'
      Waiting 100 secs for update
      Copytool is stopped on t
      lustre-MDT0000: Deregistered changelog user 'cl1'
      Resetting fail_loc on all nodes...done.
      PASS 222a (3s)
      
      == sanity-hsm test 222b: Changelog for implicit restore == 16:48:38 (1380750518)
      Purging archive on t
      Starting copytool agt1 on t
      mdd.lustre-MDT0000.changelog_mask=+hsm
      Changed after 0s: from '' to 'STARTED'
      Waiting 100 secs for update
      8e700881220db1269deb14847f7ffd4d  /mnt/lustre/d0.sanity-hsm/d222/f.sanity-hsm.222b
      Copytool is stopped on t
      lustre-MDT0000: Deregistered changelog user 'cl2'
      Resetting fail_loc on all nodes...done.
      PASS 222b (2s)
      
      
      == sanity-hsm test 224: Changelog for remove == 16:48:40 (1380750520)
      Purging archive on t
      Starting copytool agt1 on t
      mdd.lustre-MDT0000.changelog_mask=+hsm
      Changed after 0s: from '' to 'STARTED'
      Waiting 100 secs for update
      Copytool is stopped on t
      lustre-MDT0000: Deregistered changelog user 'cl3'
      Resetting fail_loc on all nodes...done.
      PASS 224 (3s)
      
      
      == sanity-hsm test 226: changelog for last rm/mv with exiting archive == 16:48:43 (1380750523)
      Purging archive on t
      Starting copytool agt1 on t
      0x200000401:0x7:0x0
      mdd.lustre-MDT0000.changelog_mask=+hsm
      Changed after 0s: from '' to 'STARTED'
      Waiting 100 secs for update
      Changed after 0s: from '' to 'STARTED'
      Waiting 100 secs for update
      Copytool is stopped on t
      lustre-MDT0000: Deregistered changelog user 'cl4'
      Resetting fail_loc on all nodes...done.
      PASS 226 (4s)
      
      
      == sanity-hsm test 250: Coordinator max request == 16:48:47 (1380750527)
      Purging archive on t
      Starting copytool agt1 on t
      mdt.lustre-MDT0000.hsm_control=disabled
      10+0 records in
      10+0 records out
      10485760 bytes (10 MB) copied, 1.87602 s, 5.6 MB/s
      10+0 records in
      10+0 records out
      10485760 bytes (10 MB) copied, 1.69116 s, 6.2 MB/s
      10+0 records in
      10+0 records out
      10485760 bytes (10 MB) copied, 1.75049 s, 6.0 MB/s
      10+0 records in
      10+0 records out
      10485760 bytes (10 MB) copied, 1.83329 s, 5.7 MB/s
      10+0 records in
      10+0 records out
      10485760 bytes (10 MB) copied, 1.66484 s, 6.3 MB/s
      10+0 records in
      10+0 records out
      10485760 bytes (10 MB) copied, 1.65743 s, 6.3 MB/s
      10+0 records in
      10+0 records out
      10485760 bytes (10 MB) copied, 1.65997 s, 6.3 MB/s
      10+0 records in
      10+0 records out
      10485760 bytes (10 MB) copied, 1.65964 s, 6.3 MB/s
      10+0 records in
      10+0 records out
      10485760 bytes (10 MB) copied, 1.65768 s, 6.3 MB/s
      mdt.lustre-MDT0000.hsm_control=enabled
      max=3 started=2 waiting=6
      ...
      max=3 started=0 waiting=0
      Copytool is stopped on t
      Resetting fail_loc on all nodes...done.
      PASS 250 (107s)
      == sanity-hsm test complete, duration 145 sec == 16:50:34 (1380750634)
      Stopping clients: t /mnt/lustre2 (opts:)
      Stopping client t /mnt/lustre2 opts:
      Stopping clients: t /mnt/lustre (opts:-f)
      Stopping client t /mnt/lustre opts:-f
      Stopping clients: t /mnt/lustre2 (opts:-f)
      Stopping /mnt/mds1 (opts:-f) on t
      Stopping /mnt/ost1 (opts:-f) on t
      Stopping /mnt/ost2 (opts:-f) on t
      LustreError: 21879:0:(class_obd.c:730:cleanup_obdclass()) obd_memory max: 70547704, leaked: 8376
      
      # cd /tmp/hsm-leak.*
      # perl /root/lustre-release/lustre/tests/leak_finder.pl debug-leak.* 2>&1 | grep Leak
      *** Leak: 184 bytes allocated at ffff88019d691840 (llog.c:llog_alloc_handle:66, debug file line 77331)
      *** Leak: 8192 bytes allocated at ffff88018e8c2000 (llog.c:llog_init_handle:215, debug file line 77334)
      

      Attachments

        Activity

          People

            jhammond John Hammond
            jhammond John Hammond
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: