[LU-4054] HSM llog_handle leak Created: 03/Oct/13  Updated: 04/Oct/13  Resolved: 04/Oct/13

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.0
Fix Version/s: Lustre 2.5.0

Type: Bug Priority: Blocker
Reporter: John Hammond Assignee: John Hammond
Resolution: Fixed Votes: 0
Labels: HSM

Severity: 3
Rank (Obsolete): 10871

 Description   

Running sanity-hsm I see that we're leaking a log_handle. You can see it by mounting, running full sanity-hsm, then unmounting or by mounting, running 107 222a 222b 224 226 250, then unmounting.

# export OSTSIZE=1000000
# export agt1_HOST=t
# export PTLDEBUG="vfstrace rpctrace dlmtrace neterror ha config ioctl super hsm other malloc info"
# export DEBUG_SIZE=2048

# dir=/tmp/hsm-leak.$$
# mkdir $direxport OSTSIZE=1000000
# export agt1_HOST=t
# export PTLDEBUG="vfstrace rpctrace dlmtrace neterror ha config ioctl super hsm other malloc info"
# export DEBUG_SIZE=2048     
# dir=/tmp/hsm-leak.$$
# mkdir $dir || exit 1
# echo $dir
# rm -rf /tmp/debug-leak.* /tmp/lustre-log.* /tmp/sanity-hsm.log /tmp/test_logs/
# export ONLY="107 222a 222b 224 226 250"
# dmesg -c > /dev/null
# llmount.sh
# sh ~/lustre-release/lustre/tests/sanity-hsm.sh
# llmountcleanup.sh
# dmesg > $dir/dmesg
# mv /tmp/debug-leak.* /tmp/lustre-log.* /tmp/sanity-hsm.log /tmp/test_logs/ $dir/
Stopping clients: t /mnt/lustre (opts:)
Stopping clients: t /mnt/lustre2 (opts:)

...
only running test 107 222a 222b 224 226 250
excepting tests: 31a 34 35 36 200 201 221 223a 223b 225
Killing existing copytools on t
Set HSM on and start
Changed after 0s: from '' to 'stopped'
Waiting 20 secs for update
Updated after 7s: wanted 'enabled' got 'enabled'
Start copytool
Purging archive on t
Starting copytool agt1 on t
Set sanity-hsm HSM policy

== sanity-hsm test 107: Copytool re-register after MDS restart == 16:48:19 (1380750499)
Wakeup copytool agt1 on t
Changed after 0s: from '' to 'STARTED'
Waiting 100 secs for update
Failing mds1 on t
Stopping /mnt/mds1 (opts:) on t
reboot facets: mds1
Failover mds1 to t
16:48:31 (1380750511) waiting for t network 900 secs ...
16:48:31 (1380750511) network interface is UP
mount facets: mds1
Starting mds1:   -o loop /tmp/lustre-mdt1 /mnt/mds1
Started lustre-MDT0000
t: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Changed after 0s: from '' to 'STARTED'
Waiting 100 secs for update
Copytool is stopped on t
Resetting fail_loc on all nodes...done.
PASS 107 (16s)


== sanity-hsm test 222a: Changelog for explicit restore == 16:48:35 (1380750515)
Purging archive on t
Starting copytool agt1 on t
lhsmtool_posix[18491]: action=1 src=d0.sanity-hsm/d222/f.sanity-hsm.222a dst=/mnt/lustre/d0.sanity-hsm/d222/f.sanity-hsm.222a mount_point=/mnt/lustre
lhsmtool_posix[18491]: importing '/mnt/lustre/d0.sanity-hsm/d222/f.sanity-hsm.222a' from '/tmp/arc1/d0.sanity-hsm/d222/f.sanity-hsm.222a'
lhsmtool_posix[18491]: imported '/mnt/lustre/d0.sanity-hsm/d222/f.sanity-hsm.222a' from '/tmp/arc1/0006/0000/0400/0000/0002/0000/0x200000400:0x6:0x0'=='/tmp/arc1/d0.sanity-hsm/d222/f.sanity-hsm.222a'
lhsmtool_posix[18491]: process finished, errs: 0 major, 0 minor, rc=0 (Success)
mdd.lustre-MDT0000.changelog_mask=+hsm
Changed after 0s: from '' to 'STARTED'
Waiting 100 secs for update
Copytool is stopped on t
lustre-MDT0000: Deregistered changelog user 'cl1'
Resetting fail_loc on all nodes...done.
PASS 222a (3s)

== sanity-hsm test 222b: Changelog for implicit restore == 16:48:38 (1380750518)
Purging archive on t
Starting copytool agt1 on t
mdd.lustre-MDT0000.changelog_mask=+hsm
Changed after 0s: from '' to 'STARTED'
Waiting 100 secs for update
8e700881220db1269deb14847f7ffd4d  /mnt/lustre/d0.sanity-hsm/d222/f.sanity-hsm.222b
Copytool is stopped on t
lustre-MDT0000: Deregistered changelog user 'cl2'
Resetting fail_loc on all nodes...done.
PASS 222b (2s)


== sanity-hsm test 224: Changelog for remove == 16:48:40 (1380750520)
Purging archive on t
Starting copytool agt1 on t
mdd.lustre-MDT0000.changelog_mask=+hsm
Changed after 0s: from '' to 'STARTED'
Waiting 100 secs for update
Copytool is stopped on t
lustre-MDT0000: Deregistered changelog user 'cl3'
Resetting fail_loc on all nodes...done.
PASS 224 (3s)


== sanity-hsm test 226: changelog for last rm/mv with exiting archive == 16:48:43 (1380750523)
Purging archive on t
Starting copytool agt1 on t
0x200000401:0x7:0x0
mdd.lustre-MDT0000.changelog_mask=+hsm
Changed after 0s: from '' to 'STARTED'
Waiting 100 secs for update
Changed after 0s: from '' to 'STARTED'
Waiting 100 secs for update
Copytool is stopped on t
lustre-MDT0000: Deregistered changelog user 'cl4'
Resetting fail_loc on all nodes...done.
PASS 226 (4s)


== sanity-hsm test 250: Coordinator max request == 16:48:47 (1380750527)
Purging archive on t
Starting copytool agt1 on t
mdt.lustre-MDT0000.hsm_control=disabled
10+0 records in
10+0 records out
10485760 bytes (10 MB) copied, 1.87602 s, 5.6 MB/s
10+0 records in
10+0 records out
10485760 bytes (10 MB) copied, 1.69116 s, 6.2 MB/s
10+0 records in
10+0 records out
10485760 bytes (10 MB) copied, 1.75049 s, 6.0 MB/s
10+0 records in
10+0 records out
10485760 bytes (10 MB) copied, 1.83329 s, 5.7 MB/s
10+0 records in
10+0 records out
10485760 bytes (10 MB) copied, 1.66484 s, 6.3 MB/s
10+0 records in
10+0 records out
10485760 bytes (10 MB) copied, 1.65743 s, 6.3 MB/s
10+0 records in
10+0 records out
10485760 bytes (10 MB) copied, 1.65997 s, 6.3 MB/s
10+0 records in
10+0 records out
10485760 bytes (10 MB) copied, 1.65964 s, 6.3 MB/s
10+0 records in
10+0 records out
10485760 bytes (10 MB) copied, 1.65768 s, 6.3 MB/s
mdt.lustre-MDT0000.hsm_control=enabled
max=3 started=2 waiting=6
...
max=3 started=0 waiting=0
Copytool is stopped on t
Resetting fail_loc on all nodes...done.
PASS 250 (107s)
== sanity-hsm test complete, duration 145 sec == 16:50:34 (1380750634)
Stopping clients: t /mnt/lustre2 (opts:)
Stopping client t /mnt/lustre2 opts:
Stopping clients: t /mnt/lustre (opts:-f)
Stopping client t /mnt/lustre opts:-f
Stopping clients: t /mnt/lustre2 (opts:-f)
Stopping /mnt/mds1 (opts:-f) on t
Stopping /mnt/ost1 (opts:-f) on t
Stopping /mnt/ost2 (opts:-f) on t
LustreError: 21879:0:(class_obd.c:730:cleanup_obdclass()) obd_memory max: 70547704, leaked: 8376
# cd /tmp/hsm-leak.*
# perl /root/lustre-release/lustre/tests/leak_finder.pl debug-leak.* 2>&1 | grep Leak
*** Leak: 184 bytes allocated at ffff88019d691840 (llog.c:llog_alloc_handle:66, debug file line 77331)
*** Leak: 8192 bytes allocated at ffff88018e8c2000 (llog.c:llog_init_handle:215, debug file line 77334)


 Comments   
Comment by John Hammond [ 03/Oct/13 ]

Please see http://review.whamcloud.com/7847.

Comment by John Hammond [ 04/Oct/13 ]

Patch landed to master.

Generated at Sat Feb 10 01:39:15 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.