[LU-11123] LustreError in ll_xattr_list() server bug: replied size 236 > 132 Created: 05/Jul/18  Updated: 29/Jul/18  Resolved: 29/Jul/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.3, Lustre 2.10.4
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Stephane Thiell Assignee: John Hammond
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

clients: 2.10.4 clients, servers: 2.10.3 + LU-10783 (kernel update RHEL7.4)


Issue Links:
Related
is related to LU-11074 Invalid argument reading file caps Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Hello,

Today our users started to report intermittent file access issues on Oak. I noticed the following messages on one client (2.10.4):

Jul 05 14:32:21 sh-ln01.stanford.edu kernel: LustreError: 155141:0:(xattr.c:377:ll_xattr_list()) server bug: replied size 236 > 132
Jul 05 14:32:41 sh-ln01.stanford.edu kernel: LustreError: 171588:0:(xattr.c:377:ll_xattr_list()) server bug: replied size 164 > 132
Jul 05 14:32:41 sh-ln01.stanford.edu kernel: LustreError: 171588:0:(xattr.c:377:ll_xattr_list()) Skipped 5 previous similar messages
Jul 05 14:32:47 sh-ln01.stanford.edu kernel: LustreError: 176583:0:(xattr.c:377:ll_xattr_list()) server bug: replied size 172 > 132
Jul 05 14:32:47 sh-ln01.stanford.edu kernel: LustreError: 176583:0:(xattr.c:377:ll_xattr_list()) Skipped 59 previous similar messages
Jul 05 14:33:23 sh-ln01.stanford.edu kernel: LustreError: 10776:0:(xattr.c:377:ll_xattr_list()) server bug: replied size 172 > 132
Jul 05 14:33:23 sh-ln01.stanford.edu kernel: LustreError: 10776:0:(xattr.c:377:ll_xattr_list()) Skipped 58 previous similar messages

These errors messages are the only Lustre Error I can see on this impacted client, however they are not very helpful as I'm not even sure it happened on Oak or another Lustre filesystem...

The impacted directories are using ACLs but only a very few, less than 10. We have other directories with >32 ACLs and haven't seen this issue.

The issue doesn't seem to be easily reproducible neither. I'm still investigating.

If you have any ideas on how to troubleshoot this, please let me know.

Thanks!
Stephane

 

 



 Comments   
Comment by Stephane Thiell [ 05/Jul/18 ]

NOTE: I'm not actually sure I need to post here for our former Intel Oak support, please advise.

But... After further investigations, it seems that these messages could be a side effect of a known limitation of nodemapping/Lustre permissions/caching, but not the root cause of our issue, which has been identified.

It would be nice to definitively fix the client inode cache on Lustre to avoid confusion, as already explained in LU-10884.

Comment by Peter Jones [ 06/Jul/18 ]

John

Can you please advise?

Thanks

Peter

Comment by John Hammond [ 10/Jul/18 ]

Hi Stephane,

Could you give any more detail about the file access issues? Is there any Samba or NFS export involved on this node? Some thoughts:

This error message cannot be reached for the "system.posix_acl_access" xattr but it can be for the "system.posix_acl_default" and the expected and returned value sizes look right for that xattr. It's not clear who or what is asking for "system.posix_acl_default" since this xattr is really used on the server. Note that this message is easy to produce by creating a directory with enough default ACLs and then using setfacl or getfacl on it.

The message is a bit misleading since the server does not actually consider the size of the client size buffer. Instead it just creates the reply with a large enough buffer and sends the value back. And the client side getxattr code is actually handing this correctly by returning -ERANGE.

The change https://review.whamcloud.com/#/c/32739/ (LU-11074 mdc: set correct body eadatasize for getxattr()) may help you avoid this situation. Would you be willing to try it?

Comment by Stephane Thiell [ 29/Jul/18 ]

Hi John,
Thanks for your reply and detailed explanation (and sorry for the delay, all notification emails from whamcloud.com got into my Clutter mailbox...).
This was on a Sherlock login node, so no SMB/NFS export involved there. We haven't seen the problem again so I don't think it's worth patching just for that at this point.

Comment by Peter Jones [ 29/Jul/18 ]

ok Stephane. Meanwhile the fix is queued up for a future LTS release so hopefully you'll get it in due course anyway.

Comment by Stephane Thiell [ 29/Jul/18 ]

Awesome, thanks Peter.

Generated at Sat Feb 10 02:41:08 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.