Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4345

failed to update accounting ZAP for user

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.6.0, Lustre 2.5.3
    • None
    • Lustre 2.4.0-19chaos
    • 3
    • 11907

    Description

      We are using lustre 2.4.0-19chaos on our servers running with the ZFS OSD. On some of the OSS nodes we are seeing messages like this:

      Nov  6 00:06:29 stout8 kernel: LustreError: 14909:0:(osd_object.c:973:osd_attr_set()) fsrzb-OST0007: failed to update accounting ZAP for user 132245 (-2)
      Nov  6 00:06:29 stout8 kernel: LustreError: 14909:0:(osd_object.c:973:osd_attr_set()) Skipped 5 previous similar messages
      Nov  6 00:06:38 stout16 kernel: LustreError: 15266:0:(osd_object.c:973:osd_attr_set()) fsrzb-OST000f: failed to update accounting ZAP for user 122392 (-2)
      Nov  6 00:06:38 stout16 kernel: LustreError: 15266:0:(osd_object.c:973:osd_attr_set()) Skipped 3 previous similar messages
      Nov  6 00:06:40 stout12 kernel: LustreError: 15801:0:(osd_object.c:973:osd_attr_set()) fsrzb-OST000b: failed to update accounting ZAP for user 122708 (-2)
      Nov  6 00:06:40 stout12 kernel: LustreError: 15801:0:(osd_object.c:973:osd_attr_set()) Skipped 4 previous similar messages
      
      Nov  7 00:31:36 porter31 kernel: LustreError: 7704:0:(osd_object.c:973:osd_attr_set()) lse-OST001f: failed to update accounting ZAP for user 54916 (-2)
      Nov  7 02:53:05 porter19 kernel: LustreError: 9380:0:(osd_object.c:973:osd_attr_set()) lse-OST0013: failed to update accounting ZAP for user 7230 (-2)
      
      Dec  3 12:01:21 stout7 kernel: Lustre: Skipped 3 previous similar messages
      Dec  3 13:52:30 stout4 kernel: LustreError: 15806:0:(osd_object.c:967:osd_attr_set()) fsrzb-OST0003: failed to update accounting ZAP for user 1752876224 (-2)
      Dec  3 13:52:30 stout4 kernel: LustreError: 15806:0:(osd_object.c:967:osd_attr_set()) Skipped 3 previous similar messages
      Dec  3 13:52:30 stout1 kernel: LustreError: 15324:0:(osd_object.c:967:osd_attr_set()) fsrzb-OST0000: failed to update accounting ZAP for user 1752876224 (-2)
      Dec  3 13:52:30 stout1 kernel: LustreError: 15784:0:(osd_object.c:967:osd_attr_set()) fsrzb-OST0000: failed to update accounting ZAP for user 1752876224 (-2)
      Dec  3 13:52:30 stout14 kernel: LustreError: 16345:0:(osd_object.c:967:osd_attr_set()) fsrzb-OST000d: failed to update accounting ZAP for user 1752876224 (-2)
      Dec  3 13:52:30 stout12 kernel: LustreError: 32355:0:(osd_object.c:967:osd_attr_set()) fsrzb-OST000b: failed to update accounting ZAP for user 1752876224 (-2)
      Dec  3 13:52:30 stout2 kernel: LustreError: 15145:0:(osd_object.c:967:osd_attr_set()) fsrzb-OST0001: failed to update accounting ZAP for user 1752876224 (-2)
      Dec  3 13:52:30 stout10 kernel: LustreError: 14570:0:(osd_object.c:967:osd_attr_set()) fsrzb-OST0009: failed to update accounting ZAP for user 1752876224 (-2)
      

      First of all, these messages are terrible. If you look at osd_attr_set() there are four exactly identical messages that are printed. Ok, granted, we can look them up by line number. But even better would be to make them unique.

      So looking them up by line numbers 967 and 973, it would appear that we have hit at least the first two of the "filed to update accounting ZAP for user" messages.

      Note that the UID numbers do not look correct to me. Many of them are clearly not in the valid UID range. But then I don't completely understand what is going on here yet.

      Attachments

        Issue Links

          Activity

            [LU-4345] failed to update accounting ZAP for user

            at umount dnodes storing object accounting are still referenced, so dnode_special_close() gets stuck because meta dnode is referenced by those.

            why do you think the whole thing is racy?

            bzzz Alex Zhuravlev added a comment - at umount dnodes storing object accounting are still referenced, so dnode_special_close() gets stuck because meta dnode is referenced by those. why do you think the whole thing is racy?

            Exactly what kind of failure are you seeing? I don't understand what you mean by 'somehow dnodes are still referenced'. Can you point me at a maloo failure which shows the problem or better describe exactly what the issue is. What you're trying to do in the patch looks reasonable to me on the surface, although the whole thing feels racy.

            behlendorf Brian Behlendorf added a comment - Exactly what kind of failure are you seeing? I don't understand what you mean by 'somehow dnodes are still referenced'. Can you point me at a maloo failure which shows the problem or better describe exactly what the issue is. What you're trying to do in the patch looks reasonable to me on the surface, although the whole thing feels racy.
            bzzz Alex Zhuravlev added a comment - - edited

            2600 was doing OK on my local system, unfortunately it seem to fail on maloo sometimes. I asked Brian B. to help with understanding the root cause - somehow dnodes are still referenced when I use dsl_sync_task_nowait(). once this sorted out (Brian, please help we can try that again.

            the important thing is that w/o 2600 accounting is still racy..

            bzzz Alex Zhuravlev added a comment - - edited 2600 was doing OK on my local system, unfortunately it seem to fail on maloo sometimes. I asked Brian B. to help with understanding the root cause - somehow dnodes are still referenced when I use dsl_sync_task_nowait(). once this sorted out (Brian, please help we can try that again. the important thing is that w/o 2600 accounting is still racy..

            Chris,
            the patch 10223 was landed for master (2.6.0), which Niu believes to be the major source of inconsistent UID/GIDs on the OSTs for quota accounting.

            The 7157 patch is to be tracked under LU-2600 where it was originally filed. I mistakenly thought it was submitted under this ticket and needed a new patch to track it for landing. It had accidentally landed to master for a very short time, but was reverted because it caused problems and Alex had only intended it for testing at this point.

            adilger Andreas Dilger added a comment - Chris, the patch 10223 was landed for master (2.6.0), which Niu believes to be the major source of inconsistent UID/GIDs on the OSTs for quota accounting. The 7157 patch is to be tracked under LU-2600 where it was originally filed. I mistakenly thought it was submitted under this ticket and needed a new patch to track it for landing. It had accidentally landed to master for a very short time, but was reverted because it caused problems and Alex had only intended it for testing at this point.

            So what is the actual state of the reported bug? Is it now fixed, but because http://review.whamcloud.com/7157 was reverted we now have a potential performance regression? Or is the bug not yet fixed?

            morrone Christopher Morrone (Inactive) added a comment - So what is the actual state of the reported bug? Is it now fixed, but because http://review.whamcloud.com/7157 was reverted we now have a potential performance regression? Or is the bug not yet fixed?

            Follow on work is being tracked in LU-5129

            jlevi Jodi Levi (Inactive) added a comment - Follow on work is being tracked in LU-5129

            The patch http://review.whamcloud.com/7157 was landed to master and then reverted due to problems. That patch needs to be refreshed.

            adilger Andreas Dilger added a comment - The patch http://review.whamcloud.com/7157 was landed to master and then reverted due to problems. That patch needs to be refreshed.

            Thanks Niu

            spimpale Swapnil Pimpale (Inactive) added a comment - Thanks Niu
            niu Niu Yawei (Inactive) added a comment - b2_4: http://review.whamcloud.com/#/c/10462/
            pjones Peter Jones added a comment -

            Swapnil

            Sorry if I was not clear previously. Yes, I understand that you would like a b2_4 version of this fix and as soon as we have finalized the form of the fix we will create one

            Regards

            Peter

            pjones Peter Jones added a comment - Swapnil Sorry if I was not clear previously. Yes, I understand that you would like a b2_4 version of this fix and as soon as we have finalized the form of the fix we will create one Regards Peter

            Peter,

            Could you please provide a b2_4 backport of this patch? We need it at one of our customer sites.

            Thanks!

            spimpale Swapnil Pimpale (Inactive) added a comment - Peter, Could you please provide a b2_4 backport of this patch? We need it at one of our customer sites. Thanks!

            People

              niu Niu Yawei (Inactive)
              morrone Christopher Morrone (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: