Details
-
Bug
-
Resolution: Fixed
-
Major
-
Lustre 2.7.0, Lustre 2.8.0
-
3
-
9223372036854775807
Description
TEI-3148 brought out a problem with sanity-hsm test 500 causing a memory leak. sanity-hsm test 500 calls the tests in llapi_hsm_test.c. In particular, test #2 of the llapi_hsm_test.c tests is causing or bringing out the memory leak. The interesting thing is that, for ZFS, the memory leak causes Lustre not to be shut down and the test will time out after an hour. For ldiskfs runs, this is not true, the tests shut down cleanly. In all cases, no error is detected and the test status is PASS. This ticket is to report the memory leak cause by llapi_hsm_test test #2.
I ran a series of tests to pinpoint what test was causing the leak. For the last test, I hard coded that test 500 would fail in the hopes that dumping the logs would tell us something, but a quick look over the logs didn’t show me anything obviously wrong. The test patch is at http://review.whamcloud.com/#/c/14337/ and the latest results that this patch produced with only running sanity-hsm test 500 and subtest #2 is at https://testing.hpdd.intel.com/test_sessions/cff1ffc8-e91e-11e4-9e6e-5254006e85c2. From the end of the suite_stdout log, we see
17:09:46:Stopping /mnt/ost2 (opts:-f) on shadow-25vm3 17:09:46:CMD: shadow-25vm3 umount -d -f /mnt/ost2 17:09:57:CMD: shadow-25vm3 lsmod | grep lnet > /dev/null && lctl dl | grep ' ST ' 17:09:57:CMD: shadow-25vm3 ! zpool list -H lustre-ost2 >/dev/null 2>&1 || 17:09:57: grep -q ^lustre-ost2/ /proc/mounts || 17:09:57: zpool export lustre-ost2 17:09:57:0022 17:09:57:CMD: shadow-25vm2.shadow.whamcloud.com lsmod | grep lnet > /dev/null && lctl dl | grep ' ST ' 17:09:57:LustreError: 10994:0:(class_obd.c:680:cleanup_obdclass()) obd_memory max: 3236953, leaked: 48 17:09:57: 17:09:57:Memory leaks detected 18:10:18:********** Timeout by autotest system **********
For a review-dne-part-2 master failure, results are at https://testing.hpdd.intel.com/test_sets/b89ca232-e8d4-11e4-9e6e-5254006e85c2 . From the suite_stdout_log:
07:17:53:Stopping /mnt/ost8 (opts:-f) on shadow-48vm4 07:17:53:CMD: shadow-48vm4 umount -d -f /mnt/ost8 07:17:53:CMD: shadow-48vm4 lsmod | grep lnet > /dev/null && lctl dl | grep ' ST ' 07:17:53:0022 07:17:53:CMD: shadow-48vm6.shadow.whamcloud.com lsmod | grep lnet > /dev/null && lctl dl | grep ' ST ' 07:18:05:LustreError: 17219:0:(class_obd.c:680:cleanup_obdclass()) obd_memory max: 54130094, leaked: 48 07:18:05: 07:18:05:Memory leaks detected 07:18:05:sanity-hsm returned 0 07:18:05:running: ost-pools 07:18:05:run_suite ost-pools /usr/lib64/lustre/tests/ost-pools.sh
In a b2_7 sanity-hsm run, the memory leak is reported twice at https://testing.hpdd.intel.com/test_sets/0fca08aa-ca31-11e4-b9cd-5254006e85c2 in the site_stdout logs:
07:29:53:CMD: shadow-5vm11 umount -d -f /mnt/ost2 07:30:04:CMD: shadow-5vm11 lsmod | grep lnet > /dev/null && lctl dl | grep ' ST ' 07:30:04:CMD: shadow-5vm11 ! zpool list -H lustre-ost2 >/dev/null 2>&1 || 07:30:04: grep -q ^lustre-ost2/ /proc/mounts || 07:30:04: zpool export lustre-ost2 07:30:04:0022 07:30:04:CMD: shadow-5vm1.shadow.whamcloud.com lsmod | grep lnet > /dev/null && lctl dl | grep ' ST ' 07:30:16:LustreError: 32001:0:(class_obd.c:680:cleanup_obdclass()) obd_memory max: 23480928, leaked: 48 07:30:16: 07:30:16:Memory leaks detected 07:30:16:sanity-hsm returned 0 07:30:16:Stopping clients: shadow-5vm10,shadow-5vm1.shadow.whamcloud.com /mnt/lustre (opts:) 07:30:16:CMD: shadow-5vm10,shadow-5vm1.shadow.whamcloud.com running=\$(grep -c /mnt/lustre' ' /proc/mounts); 07:30:16:if [ \$running -ne 0 ] ; then 07:30:16:echo Stopping client \$(hostname) /mnt/lustre opts:; 07:30:16:lsof /mnt/lustre || need_kill=no; 07:30:16:if [ x != x -a x\$need_kill != xno ]; then 07:30:16: pids=\$(lsof -t /mnt/lustre | sort -u); 07:30:16: if [ -n \"\$pids\" ]; then 07:30:16: kill -9 \$pids; 07:30:16: fi 07:30:16:fi; 07:30:16:while umount /mnt/lustre 2>&1 | grep -q busy; do 07:30:16: echo /mnt/lustre is still busy, wait one second && sleep 1; 07:30:16:done; 07:30:16:fi 07:30:16:Stopping clients: shadow-5vm10,shadow-5vm1.shadow.whamcloud.com /mnt/lustre2 (opts:) 07:30:16:CMD: shadow-5vm10,shadow-5vm1.shadow.whamcloud.com running=\$(grep -c /mnt/lustre2' ' /proc/mounts); 07:30:16:if [ \$running -ne 0 ] ; then 07:30:16:echo Stopping client \$(hostname) /mnt/lustre2 opts:; 07:30:16:lsof /mnt/lustre2 || need_kill=no; 07:30:16:if [ x != x -a x\$need_kill != xno ]; then 07:30:16: pids=\$(lsof -t /mnt/lustre2 | sort -u); 07:30:16: if [ -n \"\$pids\" ]; then 07:30:16: kill -9 \$pids; 07:30:16: fi 07:30:16:fi; 07:30:16:while umount /mnt/lustre2 2>&1 | grep -q busy; do 07:30:16: echo /mnt/lustre2 is still busy, wait one second && sleep 1; 07:30:16:done; 07:30:16:fi 07:30:16:CMD: shadow-5vm12 grep -c /mnt/mds1' ' /proc/mounts 07:30:16:CMD: shadow-5vm12 lsmod | grep lnet > /dev/null && lctl dl | grep ' ST ' 07:30:16:CMD: shadow-5vm12 ! zpool list -H lustre-mdt1 >/dev/null 2>&1 || 07:30:16: grep -q ^lustre-mdt1/ /proc/mounts || 07:30:16: zpool export lustre-mdt1 07:30:16:CMD: shadow-5vm11 grep -c /mnt/ost1' ' /proc/mounts 07:30:16:CMD: shadow-5vm11 lsmod | grep lnet > /dev/null && lctl dl | grep ' ST ' 07:30:16:CMD: shadow-5vm11 ! zpool list -H lustre-ost1 >/dev/null 2>&1 || 07:30:16: grep -q ^lustre-ost1/ /proc/mounts || 07:30:16: zpool export lustre-ost1 07:30:16:CMD: shadow-5vm11 grep -c /mnt/ost2' ' /proc/mounts 07:30:16:CMD: shadow-5vm11 lsmod | grep lnet > /dev/null && lctl dl | grep ' ST ' 07:30:16:CMD: shadow-5vm11 ! zpool list -H lustre-ost2 >/dev/null 2>&1 || 07:30:16: grep -q ^lustre-ost2/ /proc/mounts || 07:30:16: zpool export lustre-ost2 07:30:16:0022 07:30:16:error: dl: No such file or directory opening /proc/fs/lustre/devices 07:30:16:opening /dev/obd failed: No such device 07:30:16:hint: the kernel modules may not be loaded 07:30:16:Error getting device list: No such device: check dmesg. 07:30:16:CMD: shadow-5vm1.shadow.whamcloud.com lsmod | grep lnet > /dev/null && lctl dl | grep ' ST ' 07:30:16:LustreError: 32001:0:(class_obd.c:680:cleanup_obdclass()) obd_memory max: 23480928, leaked: 48 07:30:16: 07:30:16:mv: cannot stat `/tmp/debug': No such file or directory 07:30:16:Memory leaks detected
This problem looks to be from the day test 500 was added to sanity-hsm. The first run of test 500 in Maloo with logs is on 2014-11-27 is at https://testing.hpdd.intel.com/test_sets/d1cdb39a-7610-11e4-ad19-5254006e85c2. This run exhibits the memory leak problem.