[LU-5546] Client getting "revalidate FID" errors, a lot of them Created: 26/Aug/14  Updated: 24/Mar/18  Resolved: 24/Mar/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.2
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Haisong Cai (Inactive) Assignee: Emoly Liu
Resolution: Won't Fix Votes: 0
Labels: sdsc
Environment:

Lustre server 2.4.2
Linux dolphin-mds-9-2.local 2.6.32-358.23.2.el6_lustre.x86_64 #1 SMP Thu Dec 19 19:57:45 PST 2013 x86_64 x86_64 x86_64 GNU/Linux


Severity: 3
Rank (Obsolete): 15465

 Description   

Aug 26 07:55:01 tscc-login1 postfix/qmgr[3091]: 80B853C070: removed
Aug 26 10:33:10 tscc-login1 kernel: LustreError: 11295:0:(file.c:3077:ll_inode_revalidate_fini()) dolphin: revalidate FID [0x200005416:0x15ef:0x0] error: rc = -116
Aug 26 10:33:10 tscc-login1 kernel: LustreError: 11475:0:(file.c:3077:ll_inode_revalidate_fini()) dolphin: revalidate FID [0x200005416:0x15ef:0x0] error: rc = -116
Aug 26 10:33:10 tscc-login1 kernel: LustreError: 11475:0:(file.c:3077:ll_inode_revalidate_fini()) Skipped 239 previous similar messages
Aug 26 10:33:11 tscc-login1 kernel: LustreError: 11784:0:(file.c:3077:ll_inode_revalidate_fini()) dolphin: revalidate FID [0x200005416:0x15ef:0x0] error: rc = -116
Aug 26 10:33:11 tscc-login1 kernel: LustreError: 11784:0:(file.c:3077:ll_inode_revalidate_fini()) Skipped 408 previous similar messages
Aug 26 10:33:13 tscc-login1 kernel: LustreError: 12494:0:(file.c:3077:ll_inode_revalidate_fini()) dolphin: revalidate FID [0x200005416:0x15ef:0x0] error: rc = -116
Aug 26 10:33:13 tscc-login1 kernel: LustreError: 12494:0:(file.c:3077:ll_inode_revalidate_fini()) Skipped 942 previous similar messages
Aug 26 10:33:17 tscc-login1 kernel: LustreError: 13818:0:(file.c:3077:ll_inode_revalidate_fini()) dolphin: revalidate FID [0x200005416:0x15ef:0x0] error: rc = -116
Aug 26 10:33:17 tscc-login1 kernel: LustreError: 13818:0:(file.c:3077:ll_inode_revalidate_fini()) Skipped 1743 previous similar messages
Aug 26 10:33:25 tscc-login1 kernel: LustreError: 16658:0:(file.c:3077:ll_inode_revalidate_fini()) dolphin: revalidate FID [0x200005416:0x15ef:0x0] error: rc = -116
Aug 26 10:33:25 tscc-login1 kernel: LustreError: 16658:0:(file.c:3077:ll_inode_revalidate_fini()) Skipped 3763 previous similar messages
Aug 26 10:33:41 tscc-login1 kernel: LustreError: 22267:0:(file.c:3077:ll_inode_revalidate_fini()) dolphin: revalidate FID [0x200005416:0x15ef:0x0] error: rc = -116
Aug 26 10:33:41 tscc-login1 kernel: LustreError: 22267:0:(file.c:3077:ll_inode_revalidate_fini()) Skipped 7418 previous similar messages
Aug 26 10:34:13 tscc-login1 kernel: LustreError: 34084:0:(file.c:3077:ll_inode_revalidate_fini()) dolphin: revalidate FID [0x200005416:0x15ef:0x0] error: rc = -116
Aug 26 10:34:13 tscc-login1 kernel: LustreError: 34084:0:(file.c:3077:ll_inode_revalidate_fini()) Skipped 15550 previous similar messages
Aug 26 10:35:17 tscc-login1 kernel: LustreError: 57966:0:(file.c:3077:ll_inode_revalidate_fini()) dolphin: revalidate FID [0x200005416:0x15ef:0x0] error: rc = -116
Aug 26 10:35:17 tscc-login1 kernel: LustreError: 57966:0:(file.c:3077:ll_inode_revalidate_fini()) Skipped 31707 previous similar messages
Aug 26 10:37:25 tscc-login1 kernel: LustreError: 47314:0:(file.c:3077:ll_inode_revalidate_fini()) dolphin: revalidate FID [0x200005416:0x15ef:0x0] error: rc = -116
Aug 26 10:37:25 tscc-login1 kernel: LustreError: 47314:0:(file.c:3077:ll_inode_revalidate_fini()) Skipped 69915 previous similar messages
Aug 26 10:41:41 tscc-login1 kernel: LustreError: 20908:0:(file.c:3077:ll_inode_revalidate_fini()) dolphin: revalidate FID [0x200005416:0x15ef:0x0] error: rc = -116
Aug 26 10:41:41 tscc-login1 kernel: LustreError: 20908:0:(file.c:3077:ll_inode_revalidate_fini()) Skipped 140545 previous similar messages
Aug 26 10:50:13 tscc-login1 kernel: LustreError: 4158:0:(file.c:3077:ll_inode_revalidate_fini()) dolphin: revalidate FID [0x200005416:0x15ef:0x0] error: rc = -116
Aug 26 10:50:13 tscc-login1 kernel: LustreError: 4158:0:(file.c:3077:ll_inode_revalidate_fini()) Skipped 346337 previous similar messages
Aug 26 11:00:13 tscc-login1 kernel: LustreError: 50703:0:(file.c:3077:ll_inode_revalidate_fini()) dolphin: revalidate FID [0x200005416:0x15ef:0x0] error: rc = -116
Aug 26 11:00:13 tscc-login1 kernel: LustreError: 50703:0:(file.c:3077:ll_inode_revalidate_fini()) Skipped 347381 previous similar messages
Aug 26 11:10:13 tscc-login1 kernel: LustreError: 50839:0:(file.c:3077:ll_inode_revalidate_fini()) dolphin: revalidate FID [0x200005416:0x15ef:0x0] error: rc = -116
Aug 26 11:10:13 tscc-login1 kernel: LustreError: 50839:0:(file.c:3077:ll_inode_revalidate_fini()) Skipped 386631 previous similar messages
Aug 26 11:20:13 tscc-login1 kernel: LustreError: 57406:0:(file.c:3077:ll_inode_revalidate_fini()) dolphin: revalidate FID [0x200005416:0x15ef:0x0] error: rc = -116
Aug 26 11:20:13 tscc-login1 kernel: LustreError: 57406:0:(file.c:3077:ll_inode_revalidate_fini()) Skipped 397281 previous similar messages
Aug 26 11:30:13 tscc-login1 kernel: LustreError: 64118:0:(file.c:3077:ll_inode_revalidate_fini()) dolphin: revalidate FID [0x200005416:0x15ef:0x0] error: rc = -116
Aug 26 11:30:13 tscc-login1 kernel: LustreError: 64118:0:(file.c:3077:ll_inode_revalidate_fini()) Skipped 395513 previous similar messages
...

Aug 26 11:50:13 tscc-login1 kernel: LustreError: 18441:0:(file.c:3077:ll_inode_revalidate_fini()) dolphin: revalidate FID [0x200005416:0x15ef:0x0] error: rc = -116
Aug 26 11:50:13 tscc-login1 kernel: LustreError: 18441:0:(file.c:3077:ll_inode_revalidate_fini()) Skipped 402712 previous similar messages



 Comments   
Comment by Peter Jones [ 27/Aug/14 ]

Emoly

Does this seem related to LU-4084/4522? If not, what do you suggest here?

Thanks

Peter

Comment by Emoly Liu [ 28/Aug/14 ]

I will look into this one.

Comment by Emoly Liu [ 29/Aug/14 ]

Hi Cai,
Did this happen on a NFS export? because -116 means stale NFS handle.
Are there any error messages on MDS? If possible, could you please upload MDS logs? Thanks.

Comment by Andreas Dilger [ 29/Aug/14 ]

This message shouldn't be printed onto the console at all for -ESTALE (-116), since this is a common error that can be hit during normal operation. There is some underlying problem that is causing the revalidate to be called repeatedly (400k calls in 20s by the end of the log) that also needs to be fixed.

Cai, is there a client that is hung during this time? The question of NFS exports is also important. Also, it would be great if you could get a stack trace for when this is happening (ideally through systemtap, but also possibly via repeatedly trying sysrq-p) to see what the callpath is. It isn't clear why the caller isn't handling -ESTALE correctly and actually moving on to some new FID.

Comment by Haisong Cai (Inactive) [ 29/Aug/14 ]

Hi Andreas,

The error showed after we rebooted MDS (because of LU-5503) which took about 1hour.

We have since reboot that client as well and it is now error free.
I will remember to collect more info next when I see it.

thanks,
Haisong

Comment by Peter Jones [ 24/Mar/18 ]

SDSC have moved onto more current releases so I do not think any further work is needed here

Generated at Sat Feb 10 01:52:25 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.