Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4345

failed to update accounting ZAP for user

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.6.0, Lustre 2.5.3
    • None
    • Lustre 2.4.0-19chaos
    • 3
    • 11907

    Description

      We are using lustre 2.4.0-19chaos on our servers running with the ZFS OSD. On some of the OSS nodes we are seeing messages like this:

      Nov  6 00:06:29 stout8 kernel: LustreError: 14909:0:(osd_object.c:973:osd_attr_set()) fsrzb-OST0007: failed to update accounting ZAP for user 132245 (-2)
      Nov  6 00:06:29 stout8 kernel: LustreError: 14909:0:(osd_object.c:973:osd_attr_set()) Skipped 5 previous similar messages
      Nov  6 00:06:38 stout16 kernel: LustreError: 15266:0:(osd_object.c:973:osd_attr_set()) fsrzb-OST000f: failed to update accounting ZAP for user 122392 (-2)
      Nov  6 00:06:38 stout16 kernel: LustreError: 15266:0:(osd_object.c:973:osd_attr_set()) Skipped 3 previous similar messages
      Nov  6 00:06:40 stout12 kernel: LustreError: 15801:0:(osd_object.c:973:osd_attr_set()) fsrzb-OST000b: failed to update accounting ZAP for user 122708 (-2)
      Nov  6 00:06:40 stout12 kernel: LustreError: 15801:0:(osd_object.c:973:osd_attr_set()) Skipped 4 previous similar messages
      
      Nov  7 00:31:36 porter31 kernel: LustreError: 7704:0:(osd_object.c:973:osd_attr_set()) lse-OST001f: failed to update accounting ZAP for user 54916 (-2)
      Nov  7 02:53:05 porter19 kernel: LustreError: 9380:0:(osd_object.c:973:osd_attr_set()) lse-OST0013: failed to update accounting ZAP for user 7230 (-2)
      
      Dec  3 12:01:21 stout7 kernel: Lustre: Skipped 3 previous similar messages
      Dec  3 13:52:30 stout4 kernel: LustreError: 15806:0:(osd_object.c:967:osd_attr_set()) fsrzb-OST0003: failed to update accounting ZAP for user 1752876224 (-2)
      Dec  3 13:52:30 stout4 kernel: LustreError: 15806:0:(osd_object.c:967:osd_attr_set()) Skipped 3 previous similar messages
      Dec  3 13:52:30 stout1 kernel: LustreError: 15324:0:(osd_object.c:967:osd_attr_set()) fsrzb-OST0000: failed to update accounting ZAP for user 1752876224 (-2)
      Dec  3 13:52:30 stout1 kernel: LustreError: 15784:0:(osd_object.c:967:osd_attr_set()) fsrzb-OST0000: failed to update accounting ZAP for user 1752876224 (-2)
      Dec  3 13:52:30 stout14 kernel: LustreError: 16345:0:(osd_object.c:967:osd_attr_set()) fsrzb-OST000d: failed to update accounting ZAP for user 1752876224 (-2)
      Dec  3 13:52:30 stout12 kernel: LustreError: 32355:0:(osd_object.c:967:osd_attr_set()) fsrzb-OST000b: failed to update accounting ZAP for user 1752876224 (-2)
      Dec  3 13:52:30 stout2 kernel: LustreError: 15145:0:(osd_object.c:967:osd_attr_set()) fsrzb-OST0001: failed to update accounting ZAP for user 1752876224 (-2)
      Dec  3 13:52:30 stout10 kernel: LustreError: 14570:0:(osd_object.c:967:osd_attr_set()) fsrzb-OST0009: failed to update accounting ZAP for user 1752876224 (-2)
      

      First of all, these messages are terrible. If you look at osd_attr_set() there are four exactly identical messages that are printed. Ok, granted, we can look them up by line number. But even better would be to make them unique.

      So looking them up by line numbers 967 and 973, it would appear that we have hit at least the first two of the "filed to update accounting ZAP for user" messages.

      Note that the UID numbers do not look correct to me. Many of them are clearly not in the valid UID range. But then I don't completely understand what is going on here yet.

      Attachments

        Issue Links

          Activity

            [LU-4345] failed to update accounting ZAP for user

            So what is the actual state of the reported bug? Is it now fixed, but because http://review.whamcloud.com/7157 was reverted we now have a potential performance regression? Or is the bug not yet fixed?

            morrone Christopher Morrone (Inactive) added a comment - So what is the actual state of the reported bug? Is it now fixed, but because http://review.whamcloud.com/7157 was reverted we now have a potential performance regression? Or is the bug not yet fixed?

            Follow on work is being tracked in LU-5129

            jlevi Jodi Levi (Inactive) added a comment - Follow on work is being tracked in LU-5129

            The patch http://review.whamcloud.com/7157 was landed to master and then reverted due to problems. That patch needs to be refreshed.

            adilger Andreas Dilger added a comment - The patch http://review.whamcloud.com/7157 was landed to master and then reverted due to problems. That patch needs to be refreshed.

            Thanks Niu

            spimpale Swapnil Pimpale (Inactive) added a comment - Thanks Niu
            niu Niu Yawei (Inactive) added a comment - b2_4: http://review.whamcloud.com/#/c/10462/
            pjones Peter Jones added a comment -

            Swapnil

            Sorry if I was not clear previously. Yes, I understand that you would like a b2_4 version of this fix and as soon as we have finalized the form of the fix we will create one

            Regards

            Peter

            pjones Peter Jones added a comment - Swapnil Sorry if I was not clear previously. Yes, I understand that you would like a b2_4 version of this fix and as soon as we have finalized the form of the fix we will create one Regards Peter

            Peter,

            Could you please provide a b2_4 backport of this patch? We need it at one of our customer sites.

            Thanks!

            spimpale Swapnil Pimpale (Inactive) added a comment - Peter, Could you please provide a b2_4 backport of this patch? We need it at one of our customer sites. Thanks!
            pjones Peter Jones added a comment -

            Swapnil

            Not yet. The usual practice is to finalize the form of the patch on master before back porting to earlier branches

            Peter

            pjones Peter Jones added a comment - Swapnil Not yet. The usual practice is to finalize the form of the patch on master before back porting to earlier branches Peter

            Is there a b2_4 backport of this patch?

            spimpale Swapnil Pimpale (Inactive) added a comment - Is there a b2_4 backport of this patch?
            niu Niu Yawei (Inactive) added a comment - http://review.whamcloud.com/10223

            we don't store "validity" in llog. so I guess the right fix would be to fill missing uid/gid in llog record with current value?

            You mean get the current ids in lod layer, and pass them to osp by 'attr'? (the 'attr' is 'const')

            niu Niu Yawei (Inactive) added a comment - we don't store "validity" in llog. so I guess the right fix would be to fill missing uid/gid in llog record with current value? You mean get the current ids in lod layer, and pass them to osp by 'attr'? (the 'attr' is 'const')

            People

              niu Niu Yawei (Inactive)
              morrone Christopher Morrone (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: