[LU-11123] LustreError in ll_xattr_list() server bug: replied size 236 > 132 Created: 05/Jul/18 Updated: 29/Jul/18 Resolved: 29/Jul/18 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.10.3, Lustre 2.10.4 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Stephane Thiell | Assignee: | John Hammond |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Environment: |
clients: 2.10.4 clients, servers: 2.10.3 + |
||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
Hello, Today our users started to report intermittent file access issues on Oak. I noticed the following messages on one client (2.10.4): Jul 05 14:32:21 sh-ln01.stanford.edu kernel: LustreError: 155141:0:(xattr.c:377:ll_xattr_list()) server bug: replied size 236 > 132 Jul 05 14:32:41 sh-ln01.stanford.edu kernel: LustreError: 171588:0:(xattr.c:377:ll_xattr_list()) server bug: replied size 164 > 132 Jul 05 14:32:41 sh-ln01.stanford.edu kernel: LustreError: 171588:0:(xattr.c:377:ll_xattr_list()) Skipped 5 previous similar messages Jul 05 14:32:47 sh-ln01.stanford.edu kernel: LustreError: 176583:0:(xattr.c:377:ll_xattr_list()) server bug: replied size 172 > 132 Jul 05 14:32:47 sh-ln01.stanford.edu kernel: LustreError: 176583:0:(xattr.c:377:ll_xattr_list()) Skipped 59 previous similar messages Jul 05 14:33:23 sh-ln01.stanford.edu kernel: LustreError: 10776:0:(xattr.c:377:ll_xattr_list()) server bug: replied size 172 > 132 Jul 05 14:33:23 sh-ln01.stanford.edu kernel: LustreError: 10776:0:(xattr.c:377:ll_xattr_list()) Skipped 58 previous similar messages These errors messages are the only Lustre Error I can see on this impacted client, however they are not very helpful as I'm not even sure it happened on Oak or another Lustre filesystem... The impacted directories are using ACLs but only a very few, less than 10. We have other directories with >32 ACLs and haven't seen this issue. The issue doesn't seem to be easily reproducible neither. I'm still investigating. If you have any ideas on how to troubleshoot this, please let me know. Thanks!
|
| Comments |
| Comment by Stephane Thiell [ 05/Jul/18 ] |
|
NOTE: I'm not actually sure I need to post here for our former Intel Oak support, please advise. But... After further investigations, it seems that these messages could be a side effect of a known limitation of nodemapping/Lustre permissions/caching, but not the root cause of our issue, which has been identified. It would be nice to definitively fix the client inode cache on Lustre to avoid confusion, as already explained in LU-10884. |
| Comment by Peter Jones [ 06/Jul/18 ] |
|
John Can you please advise? Thanks Peter |
| Comment by John Hammond [ 10/Jul/18 ] |
|
Hi Stephane, Could you give any more detail about the file access issues? Is there any Samba or NFS export involved on this node? Some thoughts: This error message cannot be reached for the "system.posix_acl_access" xattr but it can be for the "system.posix_acl_default" and the expected and returned value sizes look right for that xattr. It's not clear who or what is asking for "system.posix_acl_default" since this xattr is really used on the server. Note that this message is easy to produce by creating a directory with enough default ACLs and then using setfacl or getfacl on it. The message is a bit misleading since the server does not actually consider the size of the client size buffer. Instead it just creates the reply with a large enough buffer and sends the value back. And the client side getxattr code is actually handing this correctly by returning -ERANGE. The change https://review.whamcloud.com/#/c/32739/ ( |
| Comment by Stephane Thiell [ 29/Jul/18 ] |
|
Hi John, |
| Comment by Peter Jones [ 29/Jul/18 ] |
|
ok Stephane. Meanwhile the fix is queued up for a future LTS release so hopefully you'll get it in due course anyway. |
| Comment by Stephane Thiell [ 29/Jul/18 ] |
|
Awesome, thanks Peter. |