Details
-
Bug
-
Resolution: Duplicate
-
Minor
-
None
-
Lustre 2.5.1
-
None
-
Linux lustre-utils01 3.8.0-37-generic #53~precise1-Ubuntu SMP Wed Feb 19 21:37:54 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
-
3
-
13083
Description
We have a virtual machine which runs various admin functions one of which is forcing group ownership in certain directories.
This may be related to LU-2613.
The behaviour has improved what happens now is that the slab usage increases until we get the following message in /var/log/kernel
Mar 12 09:16:32 lustre-utils01 kernel: [153486.607665] LustreError: 11-0: lus13-MDT0000-mdc-ffff8804145e5000: Communicating with 172.17.117.153@tcp, operation mds_close failed with -107.
Mar 12 09:16:32 lustre-utils01 kernel: [153486.608011] Lustre: lus13-MDT0000-mdc-ffff8804145e5000: Connection to lus13-MDT0000 (at 172.17.117.153@tcp) was lost; in progress operations using this service will wait for recovery to complete
Mar 12 09:16:32 lustre-utils01 kernel: [153486.609483] LustreError: 167-0: lus13-MDT0000-mdc-ffff8804145e5000: This client was evicted by lus13-MDT0000; in progress operations using this service will fail.
Mar 12 09:16:33 lustre-utils01 kernel: [153486.972130] LustreError: 27292:0:(file.c:174:ll_close_inode_openhandle()) lus13-clilmv-ffff8804145e5000: inode [0x2000067f4:0xcf79:0x0] mdc close failed: rc = -5
Mar 12 09:16:33 lustre-utils01 kernel: [153486.972388] LustreError: 27292:0:(file.c:174:ll_close_inode_openhandle()) Skipped 90999 previous similar messages
Mar 12 09:16:33 lustre-utils01 kernel: [153487.479055] LustreError: 24002:0:(file.c:174:ll_close_inode_openhandle()) lus13-clilmv-ffff8804145e5000: inode [0x200006759:0x9ecd:0x0] mdc close failed: rc = -108
Mar 12 09:16:33 lustre-utils01 kernel: [153487.479273] LustreError: 24002:0:(file.c:174:ll_close_inode_openhandle()) Skipped 5037 previous similar messages
Mar 12 09:16:34 lustre-utils01 kernel: [153488.487923] LustreError: 28028:0:(file.c:174:ll_close_inode_openhandle()) lus13-clilmv-ffff8804145e5000: inode [0x200006397:0xf7e7:0x0] mdc close failed: rc = -108
Mar 12 09:16:34 lustre-utils01 kernel: [153488.488135] LustreError: 28028:0:(file.c:174:ll_close_inode_openhandle()) Skipped 9000 previous similar messages
Mar 12 09:16:36 lustre-utils01 kernel: [153490.496356] LustreError: 28925:0:(file.c:174:ll_close_inode_openhandle()) lus13-clilmv-ffff8804145e5000: inode [0x2000074b3:0x19e1a:0x0] mdc close failed: rc = -108
Mar 12 09:16:36 lustre-utils01 kernel: [153490.496569] LustreError: 28925:0:(file.c:174:ll_close_inode_openhandle()) Skipped 18377 previous similar messages
Mar 12 09:16:40 lustre-utils01 kernel: [153494.504137] LustreError: 24884:0:(file.c:174:ll_close_inode_openhandle()) lus13-clilmv-ffff8804145e5000: inode [0x200000404:0x47b1:0x0] mdc close failed: rc = -108
Mar 12 09:16:40 lustre-utils01 kernel: [153494.504365] LustreError: 24884:0:(file.c:174:ll_close_inode_openhandle()) Skipped 36626 previous similar messages
Mar 12 09:16:48 lustre-utils01 kernel: [153502.506987] LustreError: 28928:0:(file.c:174:ll_close_inode_openhandle()) lus13-clilmv-ffff8804145e5000: inode [0x200000667:0x13997:0x0] mdc close failed: rc = -108
Mar 12 09:16:48 lustre-utils01 kernel: [153502.507204] LustreError: 28928:0:(file.c:174:ll_close_inode_openhandle()) Skipped 68840 previous similar messages
Mar 12 09:17:04 lustre-utils01 kernel: [153518.507317] LustreError: 3503:0:(file.c:174:ll_close_inode_openhandle()) lus13-clilmv-ffff8804145e5000: inode [0x200000712:0x8c0f:0x0] mdc close failed: rc = -108
Mar 12 09:17:04 lustre-utils01 kernel: [153518.507569] LustreError: 3503:0:(file.c:174:ll_close_inode_openhandle()) Skipped 169045 previous similar messages
Mar 12 09:17:32 lustre-utils01 kernel: [153546.597304] LustreError: 1033:0:(mdc_locks.c:904:mdc_enqueue()) lus13-MDT0000-mdc-ffff8804145e5000: ldlm_cli_enqueue failed: rc = -108
Mar 12 09:17:32 lustre-utils01 kernel: [153546.597482] LustreError: 1033:0:(mdc_locks.c:904:mdc_enqueue()) Skipped 173 previous similar messages
Mar 12 09:17:32 lustre-utils01 kernel: [153546.597618] LustreError: 1033:0:(file.c:3196:ll_inode_revalidate_fini()) lus13: revalidate FID [0x1900001:0xe5eff204:0x0] error: rc = -108
Mar 12 09:17:32 lustre-utils01 kernel: [153546.597792] LustreError: 1033:0:(file.c:3196:ll_inode_revalidate_fini()) Skipped 232 previous similar messages
Mar 12 09:17:33 lustre-utils01 kernel: [153546.859806] LustreError: 3527:0:(mdc_locks.c:904:mdc_enqueue()) lus13-MDT0000-mdc-ffff8804145e5000: ldlm_cli_enqueue failed: rc = -108
Mar 12 09:17:33 lustre-utils01 kernel: [153546.859985] LustreError: 3527:0:(mdc_locks.c:904:mdc_enqueue()) Skipped 20 previous similar messages
Mar 12 09:17:33 lustre-utils01 kernel: [153546.860122] LustreError: 3527:0:(file.c:3196:ll_inode_revalidate_fini()) lus13: revalidate FID [0x1900001:0xe5eff204:0x0] error: rc = -108
Mar 12 09:17:33 lustre-utils01 kernel: [153546.860284] LustreError: 3527:0:(file.c:3196:ll_inode_revalidate_fini()) Skipped 20 previous similar messages
Mar 12 09:17:33 lustre-utils01 kernel: [153547.378561] LustreError: 3573:0:(mdc_locks.c:904:mdc_enqueue()) lus13-MDT0000-mdc-ffff8804145e5000: ldlm_cli_enqueue failed: rc = -108
Mar 12 09:17:33 lustre-utils01 kernel: [153547.378740] LustreError: 3573:0:(mdc_locks.c:904:mdc_enqueue()) Skipped 45 previous similar messages
Mar 12 09:17:33 lustre-utils01 kernel: [153547.378966] LustreError: 3573:0:(file.c:3196:ll_inode_revalidate_fini()) lus13: revalidate FID [0x1900001:0xe5eff204:0x0] error: rc = -108
Mar 12 09:17:33 lustre-utils01 kernel: [153547.379129] LustreError: 3573:0:(file.c:3196:ll_inode_revalidate_fini()) Skipped 45 previous similar messages
Mar 12 09:17:34 lustre-utils01 kernel: [153548.399065] LustreError: 3677:0:(mdc_locks.c:904:mdc_enqueue()) lus13-MDT0000-mdc-ffff8804145e5000: ldlm_cli_enqueue failed: rc = -108
Mar 12 09:17:34 lustre-utils01 kernel: [153548.399243] LustreError: 3677:0:(mdc_locks.c:904:mdc_enqueue()) Skipped 103 previous similar messages
Mar 12 09:17:34 lustre-utils01 kernel: [153548.399380] LustreError: 3677:0:(file.c:3196:ll_inode_revalidate_fini()) lus13: revalidate FID [0x1900001:0xe5eff204:0x0] error: rc = -108
Mar 12 09:17:34 lustre-utils01 kernel: [153548.399541] LustreError: 3677:0:(file.c:3196:ll_inode_revalidate_fini()) Skipped 103 previous similar messages
Mar 12 09:17:36 lustre-utils01 kernel: [153550.497531] LustreError: 3503:0:(file.c:174:ll_close_inode_openhandle()) lus13-clilmv-ffff8804145e5000: inode [0x200006038:0x306:0x0] mdc close failed: rc = -108
Mar 12 09:17:36 lustre-utils01 kernel: [153550.497748] LustreError: 3503:0:(file.c:174:ll_close_inode_openhandle()) Skipped 246788 previous similar messages
Mar 12 09:17:37 lustre-utils01 kernel: [153550.950531] Lustre: lus13-MDT0000-mdc-ffff8804145e5000: Connection restored to lus13-MDT0000 (at 172.17.117.153@tcp)
This is an atop -l output before the eviction.
ATOP - lustre-utils01 2014/03/12 09:03:14 ------ 10s elapsed
PRC | sys 5.25s | user 3.48s | #proc 224 | #zombie 0 | #exit 444 |
CPU | sys 53% | user 42% | irq 5% | idle 0% | wait 0% |
CPL | avg1 11.58 | avg5 13.64 | avg15 9.28 | csw 158679 | intr 33749 |
MEM | tot 15.7G | free 647.2M | cache 680.5M | buff 14.8M | slab 13.3G |
SWP | tot 37.0G | free 37.0G | | vmcom 407.8M | vmlim 44.8G |
LVM | --dev64-root | busy 0% | read 0 | write 32 | avio 0.12 ms |
DSK | sda | busy 0% | read 0 | write 4 | avio 1.00 ms |
NET | transport | tcpi 31250 | tcpo 31103 | udpi 0 | udpo 0 |
NET | network | ipi 31250 | ipo 31103 | ipfrw 0 | deliv 31250 |
NET | site ---- | pcki 31255 | pcko 31103 | si 19 Mbps | so 18 Mbps |
PID SYSCPU USRCPU VGROW RGROW RDDSK WRDSK ST EXC S CPU CMD 1/42
27448 0.29s 2.57s 2108K 2072K 0K 108K – - R 29% cf-agent
27183 1.76s 0.03s 0K 0K 0K 0K – - S 18% lfs
1999 1.21s 0.00s 0K 0K 0K 0K – - S 12% socknal_sd00_0
27439 0.76s 0.00s 0K 0K 0K 0K – - S 8% ll_sa_27183
2003 0.39s 0.00s 0K 0K 0K 0K – - S 4% ptlrpcd_0
2004 0.37s 0.00s 0K 0K 0K 0K – - S 4% ptlrpcd_1
28583 0.03s 0.06s 0K 0K - - NE 0 E 1% <lspci>
28533 0.02s 0.06s 0K 0K - - NE 0 E 1% <lspci>
28633 0.03s 0.05s 0K 0K - - NE 0 E 1% <lspci>
28708 0.02s 0.06s 0K 0K - - NE 0 E 1% <lspci>
28758 0.03s 0.05s 0K 0K - - NE 0 E 1% <lspci>
And after
ATOP - lustre-utils01 2014/03/12 09:28:14 ------ 10s elapsed
PRC | sys 0.07s | user 0.02s | #proc 214 | #zombie 0 | #exit 0 |
CPU | sys 0% | user 0% | irq 0% | idle 99% | wait 0% |
CPL | avg1 0.01 | avg5 5.18 | avg15 13.98 | csw 650 | intr 670 |
MEM | tot 15.7G | free 7.1G | cache 316.9M | buff 10.4M | slab 7.2G |
SWP | tot 37.0G | free 37.0G | | vmcom 388.7M | vmlim 44.8G |
LVM | --dev64-root | busy 0% | read 0 | write 3 | avio 0.00 ms |
DSK | sda | busy 0% | read 0 | write 2 | avio 0.00 ms |
NET | transport | tcpi 288 | tcpo 366 | udpi 1 | udpo 1 |
NET | network | ipi 289 | ipo 367 | ipfrw 0 | deliv 289 |
NET | site ---- | pcki 294 | pcko 367 | si 93 Kbps | so 106 Kbps |
PID SYSCPU USRCPU VGROW RGROW RDDSK WRDSK ST EXC S CPU CMD 1/1
4270 0.03s 0.02s 0K 0K 0K 0K – - R 1% atop
492 0.01s 0.00s 0K 0K 0K 0K – - S 0% rsyslogd
1987 0.01s 0.00s 0K 0K 0K 0K – - S 0% ntpd
1817 0.01s 0.00s 0K 0K 0K 0K – - S 0% nrpe
1999 0.01s 0.00s 0K 0K 0K 0K – - S 0% socknal_sd00_0
1839 0.00s 0.00s 0K 0K 0K 0K – - S 0% snmpd
260 0.00s 0.00s 0K 0K 0K 4K – - S 0% jbd2/dm-0-8
The script that we run to provoke the error is:
#!/bin/bash
- run from /etc/cron.d/lustre_scratch107 on isg-disc-mon-03
if [ ! -d /lustre/scratch113/._DO_NOT_DELETE_THIS_FILE ] ;
then
echo "Scratch113 not mounted"
exit 1 ;
fi
project_path=/lustre/scratch113/projects
team_path=/lustre/scratch113/teams
teams=( anderson barrett barroso carter deloukas durbin hgi hurles mcginnis palotie sandhu soranzo tyler-smith zeggini )
groups=( team152 team143 team35 team70 team147 team118 hgi team29 team111 team128 team149 team151 team19 team144 )
projects=`lfs find --maxdepth 0 ${project_path}/* --print0 | xargs -L 1 -0 basename`
test ${#groups[*]} -eq ${#teams[*]} || ( echo "count of groups and teams do not match" && exit 1 )
for project in ${projects}; do lfs find ${project_path}/${project} --print0 ! --group ${project} | xargs -0 -r chgrp -h ${project} ; done
for i in `seq 0 $((${#groups[*]}-1))`; do lfs find ${team_path}/${teams[${i}]}/* --print0 ! --group ${groups[${i}]} | xargs -0 -r chgrp -h ${groups[${i}]}; done
lfs find ${project_path} ${team_path} -type d -print0 | xargs -0 -r stat --printf="%n\0%a\n" | gawk 'BEGIN
{FS="\0"; ORS="\0";}!and(rshift(strtonum("0"$2),10),1)
{print $1}' | xargs -0 -r chmod g+s
(cd /lustre/scratch113/sinbin/ ; for i in *; do j=`echo $i | sed -e 's/.gz$//'`; chown $j ; chgrp -h hgi $i; chmod 750 $i ; done )
#
- RT 304761 fix ownership on actual directory
chmod 755 /lustre/scratch113/sinbin/
chgrp -h hgi /lustre/scratch113/sinbin/
And the display as it dies is:
+ for project in '${projects}'
+ xargs -0 -r chgrp -h cichlid
+ lfs find /lustre/scratch113/projects/cichlid --print0 '!' --group cichlid
error: find failed for cichlid.
+ for project in '${projects}'
+ xargs -0 -r chgrp -h cloten
+ lfs find /lustre/scratch113/projects/cloten --print0 '!' --group cloten
error opening /lustre/scratch113/projects/cloten: Cannot send after transport endpoint shutdown (108)
llapi_semantic_traverse: Failed to open '/lustre/scratch113/projects/cloten': Cannot send after transport endpoint shutdown (108)
error: find failed for cloten.
I note that "echo 3 > /proc/sys/vm/drop_caches" has the following effect:
MEM | tot 15.7G | free 7.1G | cache 317.3M | buff 10.8M | slab 7.2G |
echo 3 > /proc/sys/vm/drop_caches
MEM | tot 15.7G | free 14.5G | cache 65.2M | buff 3.9M | slab 128.3M |