Loading...

Details

Type: Bug
Resolution: Duplicate
Priority: Minor
Fix Version/s: None
Affects Version/s: Lustre 2.5.1
Labels:
None
Environment:
Linux lustre-utils01 3.8.0-37-generic #53~precise1-Ubuntu SMP Wed Feb 19 21:37:54 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

Severity:
3
Rank (Obsolete):
13083

Description

We have a virtual machine which runs various admin functions one of which is forcing group ownership in certain directories.

This may be related to ~~LU-2613~~.

The behaviour has improved what happens now is that the slab usage increases until we get the following message in /var/log/kernel

Mar 12 09:16:32 lustre-utils01 kernel: [153486.607665] LustreError: 11-0: lus13-MDT0000-mdc-ffff8804145e5000: Communicating with 172.17.117.153@tcp, operation mds_close failed with -107.
Mar 12 09:16:32 lustre-utils01 kernel: [153486.608011] Lustre: lus13-MDT0000-mdc-ffff8804145e5000: Connection to lus13-MDT0000 (at 172.17.117.153@tcp) was lost; in progress operations using this service will wait for recovery to complete
Mar 12 09:16:32 lustre-utils01 kernel: [153486.609483] LustreError: 167-0: lus13-MDT0000-mdc-ffff8804145e5000: This client was evicted by lus13-MDT0000; in progress operations using this service will fail.
Mar 12 09:16:33 lustre-utils01 kernel: [153486.972130] LustreError: 27292:0:(file.c:174:ll_close_inode_openhandle()) lus13-clilmv-ffff8804145e5000: inode [0x2000067f4:0xcf79:0x0] mdc close failed: rc = -5
Mar 12 09:16:33 lustre-utils01 kernel: [153486.972388] LustreError: 27292:0:(file.c:174:ll_close_inode_openhandle()) Skipped 90999 previous similar messages
Mar 12 09:16:33 lustre-utils01 kernel: [153487.479055] LustreError: 24002:0:(file.c:174:ll_close_inode_openhandle()) lus13-clilmv-ffff8804145e5000: inode [0x200006759:0x9ecd:0x0] mdc close failed: rc = -108
Mar 12 09:16:33 lustre-utils01 kernel: [153487.479273] LustreError: 24002:0:(file.c:174:ll_close_inode_openhandle()) Skipped 5037 previous similar messages
Mar 12 09:16:34 lustre-utils01 kernel: [153488.487923] LustreError: 28028:0:(file.c:174:ll_close_inode_openhandle()) lus13-clilmv-ffff8804145e5000: inode [0x200006397:0xf7e7:0x0] mdc close failed: rc = -108
Mar 12 09:16:34 lustre-utils01 kernel: [153488.488135] LustreError: 28028:0:(file.c:174:ll_close_inode_openhandle()) Skipped 9000 previous similar messages
Mar 12 09:16:36 lustre-utils01 kernel: [153490.496356] LustreError: 28925:0:(file.c:174:ll_close_inode_openhandle()) lus13-clilmv-ffff8804145e5000: inode [0x2000074b3:0x19e1a:0x0] mdc close failed: rc = -108
Mar 12 09:16:36 lustre-utils01 kernel: [153490.496569] LustreError: 28925:0:(file.c:174:ll_close_inode_openhandle()) Skipped 18377 previous similar messages
Mar 12 09:16:40 lustre-utils01 kernel: [153494.504137] LustreError: 24884:0:(file.c:174:ll_close_inode_openhandle()) lus13-clilmv-ffff8804145e5000: inode [0x200000404:0x47b1:0x0] mdc close failed: rc = -108
Mar 12 09:16:40 lustre-utils01 kernel: [153494.504365] LustreError: 24884:0:(file.c:174:ll_close_inode_openhandle()) Skipped 36626 previous similar messages
Mar 12 09:16:48 lustre-utils01 kernel: [153502.506987] LustreError: 28928:0:(file.c:174:ll_close_inode_openhandle()) lus13-clilmv-ffff8804145e5000: inode [0x200000667:0x13997:0x0] mdc close failed: rc = -108
Mar 12 09:16:48 lustre-utils01 kernel: [153502.507204] LustreError: 28928:0:(file.c:174:ll_close_inode_openhandle()) Skipped 68840 previous similar messages
Mar 12 09:17:04 lustre-utils01 kernel: [153518.507317] LustreError: 3503:0:(file.c:174:ll_close_inode_openhandle()) lus13-clilmv-ffff8804145e5000: inode [0x200000712:0x8c0f:0x0] mdc close failed: rc = -108
Mar 12 09:17:04 lustre-utils01 kernel: [153518.507569] LustreError: 3503:0:(file.c:174:ll_close_inode_openhandle()) Skipped 169045 previous similar messages
Mar 12 09:17:32 lustre-utils01 kernel: [153546.597304] LustreError: 1033:0:(mdc_locks.c:904:mdc_enqueue()) lus13-MDT0000-mdc-ffff8804145e5000: ldlm_cli_enqueue failed: rc = -108
Mar 12 09:17:32 lustre-utils01 kernel: [153546.597482] LustreError: 1033:0:(mdc_locks.c:904:mdc_enqueue()) Skipped 173 previous similar messages
Mar 12 09:17:32 lustre-utils01 kernel: [153546.597618] LustreError: 1033:0:(file.c:3196:ll_inode_revalidate_fini()) lus13: revalidate FID [0x1900001:0xe5eff204:0x0] error: rc = -108
Mar 12 09:17:32 lustre-utils01 kernel: [153546.597792] LustreError: 1033:0:(file.c:3196:ll_inode_revalidate_fini()) Skipped 232 previous similar messages
Mar 12 09:17:33 lustre-utils01 kernel: [153546.859806] LustreError: 3527:0:(mdc_locks.c:904:mdc_enqueue()) lus13-MDT0000-mdc-ffff8804145e5000: ldlm_cli_enqueue failed: rc = -108
Mar 12 09:17:33 lustre-utils01 kernel: [153546.859985] LustreError: 3527:0:(mdc_locks.c:904:mdc_enqueue()) Skipped 20 previous similar messages
Mar 12 09:17:33 lustre-utils01 kernel: [153546.860122] LustreError: 3527:0:(file.c:3196:ll_inode_revalidate_fini()) lus13: revalidate FID [0x1900001:0xe5eff204:0x0] error: rc = -108
Mar 12 09:17:33 lustre-utils01 kernel: [153546.860284] LustreError: 3527:0:(file.c:3196:ll_inode_revalidate_fini()) Skipped 20 previous similar messages
Mar 12 09:17:33 lustre-utils01 kernel: [153547.378561] LustreError: 3573:0:(mdc_locks.c:904:mdc_enqueue()) lus13-MDT0000-mdc-ffff8804145e5000: ldlm_cli_enqueue failed: rc = -108
Mar 12 09:17:33 lustre-utils01 kernel: [153547.378740] LustreError: 3573:0:(mdc_locks.c:904:mdc_enqueue()) Skipped 45 previous similar messages
Mar 12 09:17:33 lustre-utils01 kernel: [153547.378966] LustreError: 3573:0:(file.c:3196:ll_inode_revalidate_fini()) lus13: revalidate FID [0x1900001:0xe5eff204:0x0] error: rc = -108
Mar 12 09:17:33 lustre-utils01 kernel: [153547.379129] LustreError: 3573:0:(file.c:3196:ll_inode_revalidate_fini()) Skipped 45 previous similar messages
Mar 12 09:17:34 lustre-utils01 kernel: [153548.399065] LustreError: 3677:0:(mdc_locks.c:904:mdc_enqueue()) lus13-MDT0000-mdc-ffff8804145e5000: ldlm_cli_enqueue failed: rc = -108
Mar 12 09:17:34 lustre-utils01 kernel: [153548.399243] LustreError: 3677:0:(mdc_locks.c:904:mdc_enqueue()) Skipped 103 previous similar messages
Mar 12 09:17:34 lustre-utils01 kernel: [153548.399380] LustreError: 3677:0:(file.c:3196:ll_inode_revalidate_fini()) lus13: revalidate FID [0x1900001:0xe5eff204:0x0] error: rc = -108
Mar 12 09:17:34 lustre-utils01 kernel: [153548.399541] LustreError: 3677:0:(file.c:3196:ll_inode_revalidate_fini()) Skipped 103 previous similar messages
Mar 12 09:17:36 lustre-utils01 kernel: [153550.497531] LustreError: 3503:0:(file.c:174:ll_close_inode_openhandle()) lus13-clilmv-ffff8804145e5000: inode [0x200006038:0x306:0x0] mdc close failed: rc = -108
Mar 12 09:17:36 lustre-utils01 kernel: [153550.497748] LustreError: 3503:0:(file.c:174:ll_close_inode_openhandle()) Skipped 246788 previous similar messages
Mar 12 09:17:37 lustre-utils01 kernel: [153550.950531] Lustre: lus13-MDT0000-mdc-ffff8804145e5000: Connection restored to lus13-MDT0000 (at 172.17.117.153@tcp)

This is an atop -l output before the eviction.

PID SYSCPU USRCPU VGROW RGROW RDDSK WRDSK ST EXC S CPU CMD 1/42
27448 0.29s 2.57s 2108K 2072K 0K 108K – - R 29% cf-agent
27183 1.76s 0.03s 0K 0K 0K 0K – - S 18% lfs
1999 1.21s 0.00s 0K 0K 0K 0K – - S 12% socknal_sd00_0
27439 0.76s 0.00s 0K 0K 0K 0K – - S 8% ll_sa_27183
2003 0.39s 0.00s 0K 0K 0K 0K – - S 4% ptlrpcd_0
2004 0.37s 0.00s 0K 0K 0K 0K – - S 4% ptlrpcd_1
28583 0.03s 0.06s 0K 0K - - NE 0 E 1% <lspci>
28533 0.02s 0.06s 0K 0K - - NE 0 E 1% <lspci>
28633 0.03s 0.05s 0K 0K - - NE 0 E 1% <lspci>
28708 0.02s 0.06s 0K 0K - - NE 0 E 1% <lspci>
28758 0.03s 0.05s 0K 0K - - NE 0 E 1% <lspci>

And after

PID SYSCPU USRCPU VGROW RGROW RDDSK WRDSK ST EXC S CPU CMD 1/1
4270 0.03s 0.02s 0K 0K 0K 0K – - R 1% atop
492 0.01s 0.00s 0K 0K 0K 0K – - S 0% rsyslogd
1987 0.01s 0.00s 0K 0K 0K 0K – - S 0% ntpd
1817 0.01s 0.00s 0K 0K 0K 0K – - S 0% nrpe
1999 0.01s 0.00s 0K 0K 0K 0K – - S 0% socknal_sd00_0
1839 0.00s 0.00s 0K 0K 0K 0K – - S 0% snmpd
260 0.00s 0.00s 0K 0K 0K 4K – - S 0% jbd2/dm-0-8

The script that we run to provoke the error is:

#!/bin/bash

run from /etc/cron.d/lustre_scratch107 on isg-disc-mon-03
if [ ! -d /lustre/scratch113/._DO_NOT_DELETE_THIS_FILE ] ;
then
echo "Scratch113 not mounted"
exit 1 ;
fi

project_path=/lustre/scratch113/projects
team_path=/lustre/scratch113/teams

teams=( anderson barrett barroso carter deloukas durbin hgi hurles mcginnis palotie sandhu soranzo tyler-smith zeggini )
groups=( team152 team143 team35 team70 team147 team118 hgi team29 team111 team128 team149 team151 team19 team144 )

projects=`lfs find --maxdepth 0 ${project_path}/* --print0 | xargs -L 1 -0 basename`

test ${#groups[*]} -eq ${#teams[*]} || ( echo "count of groups and teams do not match" && exit 1 )

for project in ${projects}; do lfs find ${project_path}/${project} --print0 ! --group ${project} | xargs -0 -r chgrp -h ${project} ; done
for i in `seq 0 $((${#groups[*]}-1))`; do lfs find ${team_path}/${teams[${i}]}/* --print0 ! --group ${groups[${i}]} | xargs -0 -r chgrp -h ${groups[${i}]}; done

lfs find ${project_path} ${team_path} -type d -print0 | xargs -0 -r stat --printf="%n\0%a\n" | gawk 'BEGIN

{FS="\0"; ORS="\0";}

!and(rshift(strtonum("0"$2),10),1)

{print $1}

' | xargs -0 -r chmod g+s

(cd /lustre/scratch113/sinbin/ ; for i in *; do j=`echo $i | sed -e 's/.gz$//'`; chown $j ; chgrp -h hgi $i; chmod 750 $i ; done )
#

RT 304761 fix ownership on actual directory
chmod 755 /lustre/scratch113/sinbin/
chgrp -h hgi /lustre/scratch113/sinbin/

And the display as it dies is:

+ for project in '${projects}'
+ xargs -0 -r chgrp -h cichlid
+ lfs find /lustre/scratch113/projects/cichlid --print0 '!' --group cichlid
error: find failed for cichlid.
+ for project in '${projects}'
+ xargs -0 -r chgrp -h cloten
+ lfs find /lustre/scratch113/projects/cloten --print0 '!' --group cloten
error opening /lustre/scratch113/projects/cloten: Cannot send after transport endpoint shutdown (108)
llapi_semantic_traverse: Failed to open '/lustre/scratch113/projects/cloten': Cannot send after transport endpoint shutdown (108)
error: find failed for cloten.

I note that "echo 3 > /proc/sys/vm/drop_caches" has the following effect:

Attachments

Issue Links

duplicates

LU-4053 client leaking objects/locks during IO

Resolved

is related to

LU-4053 client leaking objects/locks during IO

Resolved

MDS large amount of slab usage

Details

Description

Attachments

Issue Links

Activity

People

Dates