Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11123

LustreError in ll_xattr_list() server bug: replied size 236 > 132

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Major
    • None
    • Lustre 2.10.3, Lustre 2.10.4
    • None
    • clients: 2.10.4 clients, servers: 2.10.3 + LU-10783 (kernel update RHEL7.4)
    • 3
    • 9223372036854775807

    Description

      Hello,

      Today our users started to report intermittent file access issues on Oak. I noticed the following messages on one client (2.10.4):

      Jul 05 14:32:21 sh-ln01.stanford.edu kernel: LustreError: 155141:0:(xattr.c:377:ll_xattr_list()) server bug: replied size 236 > 132
      Jul 05 14:32:41 sh-ln01.stanford.edu kernel: LustreError: 171588:0:(xattr.c:377:ll_xattr_list()) server bug: replied size 164 > 132
      Jul 05 14:32:41 sh-ln01.stanford.edu kernel: LustreError: 171588:0:(xattr.c:377:ll_xattr_list()) Skipped 5 previous similar messages
      Jul 05 14:32:47 sh-ln01.stanford.edu kernel: LustreError: 176583:0:(xattr.c:377:ll_xattr_list()) server bug: replied size 172 > 132
      Jul 05 14:32:47 sh-ln01.stanford.edu kernel: LustreError: 176583:0:(xattr.c:377:ll_xattr_list()) Skipped 59 previous similar messages
      Jul 05 14:33:23 sh-ln01.stanford.edu kernel: LustreError: 10776:0:(xattr.c:377:ll_xattr_list()) server bug: replied size 172 > 132
      Jul 05 14:33:23 sh-ln01.stanford.edu kernel: LustreError: 10776:0:(xattr.c:377:ll_xattr_list()) Skipped 58 previous similar messages
      

      These errors messages are the only Lustre Error I can see on this impacted client, however they are not very helpful as I'm not even sure it happened on Oak or another Lustre filesystem...

      The impacted directories are using ACLs but only a very few, less than 10. We have other directories with >32 ACLs and haven't seen this issue.

      The issue doesn't seem to be easily reproducible neither. I'm still investigating.

      If you have any ideas on how to troubleshoot this, please let me know.

      Thanks!
      Stephane

       

       

      Attachments

        Issue Links

          Activity

            [LU-11123] LustreError in ll_xattr_list() server bug: replied size 236 > 132

            Awesome, thanks Peter.

            sthiell Stephane Thiell added a comment - Awesome, thanks Peter.
            pjones Peter Jones added a comment -

            ok Stephane. Meanwhile the fix is queued up for a future LTS release so hopefully you'll get it in due course anyway.

            pjones Peter Jones added a comment - ok Stephane. Meanwhile the fix is queued up for a future LTS release so hopefully you'll get it in due course anyway.

            Hi John,
            Thanks for your reply and detailed explanation (and sorry for the delay, all notification emails from whamcloud.com got into my Clutter mailbox...).
            This was on a Sherlock login node, so no SMB/NFS export involved there. We haven't seen the problem again so I don't think it's worth patching just for that at this point.

            sthiell Stephane Thiell added a comment - Hi John, Thanks for your reply and detailed explanation (and sorry for the delay, all notification emails from whamcloud.com got into my Clutter mailbox...). This was on a Sherlock login node, so no SMB/NFS export involved there. We haven't seen the problem again so I don't think it's worth patching just for that at this point.
            jhammond John Hammond added a comment -

            Hi Stephane,

            Could you give any more detail about the file access issues? Is there any Samba or NFS export involved on this node? Some thoughts:

            This error message cannot be reached for the "system.posix_acl_access" xattr but it can be for the "system.posix_acl_default" and the expected and returned value sizes look right for that xattr. It's not clear who or what is asking for "system.posix_acl_default" since this xattr is really used on the server. Note that this message is easy to produce by creating a directory with enough default ACLs and then using setfacl or getfacl on it.

            The message is a bit misleading since the server does not actually consider the size of the client size buffer. Instead it just creates the reply with a large enough buffer and sends the value back. And the client side getxattr code is actually handing this correctly by returning -ERANGE.

            The change https://review.whamcloud.com/#/c/32739/ (LU-11074 mdc: set correct body eadatasize for getxattr()) may help you avoid this situation. Would you be willing to try it?

            jhammond John Hammond added a comment - Hi Stephane, Could you give any more detail about the file access issues? Is there any Samba or NFS export involved on this node? Some thoughts: This error message cannot be reached for the "system.posix_acl_access" xattr but it can be for the "system.posix_acl_default" and the expected and returned value sizes look right for that xattr. It's not clear who or what is asking for "system.posix_acl_default" since this xattr is really used on the server. Note that this message is easy to produce by creating a directory with enough default ACLs and then using setfacl or getfacl on it. The message is a bit misleading since the server does not actually consider the size of the client size buffer. Instead it just creates the reply with a large enough buffer and sends the value back. And the client side getxattr code is actually handing this correctly by returning -ERANGE . The change https://review.whamcloud.com/#/c/32739/ ( LU-11074 mdc: set correct body eadatasize for getxattr()) may help you avoid this situation. Would you be willing to try it?
            pjones Peter Jones added a comment -

            John

            Can you please advise?

            Thanks

            Peter

            pjones Peter Jones added a comment - John Can you please advise? Thanks Peter

            NOTE: I'm not actually sure I need to post here for our former Intel Oak support, please advise.

            But... After further investigations, it seems that these messages could be a side effect of a known limitation of nodemapping/Lustre permissions/caching, but not the root cause of our issue, which has been identified.

            It would be nice to definitively fix the client inode cache on Lustre to avoid confusion, as already explained in LU-10884.

            sthiell Stephane Thiell added a comment - NOTE: I'm not actually sure I need to post here for our former Intel Oak support, please advise. But... After further investigations, it seems that these messages could be a side effect of a known limitation of nodemapping/Lustre permissions/caching, but not the root cause of our issue, which has been identified. It would be nice to definitively fix the client inode cache on Lustre to avoid confusion, as already explained in  LU-10884 .

            People

              jhammond John Hammond
              sthiell Stephane Thiell
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: