Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4345

failed to update accounting ZAP for user

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.6.0, Lustre 2.5.3
    • None
    • Lustre 2.4.0-19chaos
    • 3
    • 11907

    Description

      We are using lustre 2.4.0-19chaos on our servers running with the ZFS OSD. On some of the OSS nodes we are seeing messages like this:

      Nov  6 00:06:29 stout8 kernel: LustreError: 14909:0:(osd_object.c:973:osd_attr_set()) fsrzb-OST0007: failed to update accounting ZAP for user 132245 (-2)
      Nov  6 00:06:29 stout8 kernel: LustreError: 14909:0:(osd_object.c:973:osd_attr_set()) Skipped 5 previous similar messages
      Nov  6 00:06:38 stout16 kernel: LustreError: 15266:0:(osd_object.c:973:osd_attr_set()) fsrzb-OST000f: failed to update accounting ZAP for user 122392 (-2)
      Nov  6 00:06:38 stout16 kernel: LustreError: 15266:0:(osd_object.c:973:osd_attr_set()) Skipped 3 previous similar messages
      Nov  6 00:06:40 stout12 kernel: LustreError: 15801:0:(osd_object.c:973:osd_attr_set()) fsrzb-OST000b: failed to update accounting ZAP for user 122708 (-2)
      Nov  6 00:06:40 stout12 kernel: LustreError: 15801:0:(osd_object.c:973:osd_attr_set()) Skipped 4 previous similar messages
      
      Nov  7 00:31:36 porter31 kernel: LustreError: 7704:0:(osd_object.c:973:osd_attr_set()) lse-OST001f: failed to update accounting ZAP for user 54916 (-2)
      Nov  7 02:53:05 porter19 kernel: LustreError: 9380:0:(osd_object.c:973:osd_attr_set()) lse-OST0013: failed to update accounting ZAP for user 7230 (-2)
      
      Dec  3 12:01:21 stout7 kernel: Lustre: Skipped 3 previous similar messages
      Dec  3 13:52:30 stout4 kernel: LustreError: 15806:0:(osd_object.c:967:osd_attr_set()) fsrzb-OST0003: failed to update accounting ZAP for user 1752876224 (-2)
      Dec  3 13:52:30 stout4 kernel: LustreError: 15806:0:(osd_object.c:967:osd_attr_set()) Skipped 3 previous similar messages
      Dec  3 13:52:30 stout1 kernel: LustreError: 15324:0:(osd_object.c:967:osd_attr_set()) fsrzb-OST0000: failed to update accounting ZAP for user 1752876224 (-2)
      Dec  3 13:52:30 stout1 kernel: LustreError: 15784:0:(osd_object.c:967:osd_attr_set()) fsrzb-OST0000: failed to update accounting ZAP for user 1752876224 (-2)
      Dec  3 13:52:30 stout14 kernel: LustreError: 16345:0:(osd_object.c:967:osd_attr_set()) fsrzb-OST000d: failed to update accounting ZAP for user 1752876224 (-2)
      Dec  3 13:52:30 stout12 kernel: LustreError: 32355:0:(osd_object.c:967:osd_attr_set()) fsrzb-OST000b: failed to update accounting ZAP for user 1752876224 (-2)
      Dec  3 13:52:30 stout2 kernel: LustreError: 15145:0:(osd_object.c:967:osd_attr_set()) fsrzb-OST0001: failed to update accounting ZAP for user 1752876224 (-2)
      Dec  3 13:52:30 stout10 kernel: LustreError: 14570:0:(osd_object.c:967:osd_attr_set()) fsrzb-OST0009: failed to update accounting ZAP for user 1752876224 (-2)
      

      First of all, these messages are terrible. If you look at osd_attr_set() there are four exactly identical messages that are printed. Ok, granted, we can look them up by line number. But even better would be to make them unique.

      So looking them up by line numbers 967 and 973, it would appear that we have hit at least the first two of the "filed to update accounting ZAP for user" messages.

      Note that the UID numbers do not look correct to me. Many of them are clearly not in the valid UID range. But then I don't completely understand what is going on here yet.

      Attachments

        Issue Links

          Activity

            [LU-4345] failed to update accounting ZAP for user

            The patch http://review.whamcloud.com/7157 was landed to master and then reverted due to problems. That patch needs to be refreshed.

            adilger Andreas Dilger added a comment - The patch http://review.whamcloud.com/7157 was landed to master and then reverted due to problems. That patch needs to be refreshed.

            Thanks Niu

            spimpale Swapnil Pimpale (Inactive) added a comment - Thanks Niu
            niu Niu Yawei (Inactive) added a comment - b2_4: http://review.whamcloud.com/#/c/10462/
            pjones Peter Jones added a comment -

            Swapnil

            Sorry if I was not clear previously. Yes, I understand that you would like a b2_4 version of this fix and as soon as we have finalized the form of the fix we will create one

            Regards

            Peter

            pjones Peter Jones added a comment - Swapnil Sorry if I was not clear previously. Yes, I understand that you would like a b2_4 version of this fix and as soon as we have finalized the form of the fix we will create one Regards Peter

            Peter,

            Could you please provide a b2_4 backport of this patch? We need it at one of our customer sites.

            Thanks!

            spimpale Swapnil Pimpale (Inactive) added a comment - Peter, Could you please provide a b2_4 backport of this patch? We need it at one of our customer sites. Thanks!
            pjones Peter Jones added a comment -

            Swapnil

            Not yet. The usual practice is to finalize the form of the patch on master before back porting to earlier branches

            Peter

            pjones Peter Jones added a comment - Swapnil Not yet. The usual practice is to finalize the form of the patch on master before back porting to earlier branches Peter

            Is there a b2_4 backport of this patch?

            spimpale Swapnil Pimpale (Inactive) added a comment - Is there a b2_4 backport of this patch?
            niu Niu Yawei (Inactive) added a comment - http://review.whamcloud.com/10223

            we don't store "validity" in llog. so I guess the right fix would be to fill missing uid/gid in llog record with current value?

            You mean get the current ids in lod layer, and pass them to osp by 'attr'? (the 'attr' is 'const')

            niu Niu Yawei (Inactive) added a comment - we don't store "validity" in llog. so I guess the right fix would be to fill missing uid/gid in llog record with current value? You mean get the current ids in lod layer, and pass them to osp by 'attr'? (the 'attr' is 'const')

            we don't store "validity" in llog. so I guess the right fix would be to fill missing uid/gid in llog record with current value?

            bzzz Alex Zhuravlev added a comment - we don't store "validity" in llog. so I guess the right fix would be to fill missing uid/gid in llog record with current value?

            I found that osp sometimes could set a random uid/gid to OST object. (when user set uid or gid only).

            in osp_sync_add_rec():

                    case MDS_SETATTR64_REC:
                            rc = fid_to_ostid(fid, &osi->osi_oi);
                            LASSERT(rc == 0);
                            osi->osi_hdr.lrh_len = sizeof(osi->osi_setattr);
                            osi->osi_hdr.lrh_type = MDS_SETATTR64_REC;
                            osi->osi_setattr.lsr_oi  = osi->osi_oi;
                            LASSERT(attr);
                            osi->osi_setattr.lsr_uid = attr->la_uid;
                            osi->osi_setattr.lsr_gid = attr->la_gid;
                            break;
            

            Both uid and gid from attr are saved into the llog without checking if they are all valid. (if LA_UID & LA_GID are both present in attr->la_valid)

            in osp_sync_new_setattr_job():

                    body->oa.o_oi = rec->lsr_oi;
                    body->oa.o_uid = rec->lsr_uid;
                    body->oa.o_gid = rec->lsr_gid;
                    body->oa.o_valid = OBD_MD_FLGROUP | OBD_MD_FLID |
                                       OBD_MD_FLUID | OBD_MD_FLGID;
            

            We send both the uid & gid from llog to OST, and tell OST that both uid & gid are valid. (OBD_MD_FLUID & OBD_MD_FLGID)

            This could probably the cause of random id on OST object, I think we'd store a flag in llog_setattr64_rec to specify which id is valid. Alex, what do you think?

            niu Niu Yawei (Inactive) added a comment - I found that osp sometimes could set a random uid/gid to OST object. (when user set uid or gid only). in osp_sync_add_rec(): case MDS_SETATTR64_REC: rc = fid_to_ostid(fid, &osi->osi_oi); LASSERT(rc == 0); osi->osi_hdr.lrh_len = sizeof(osi->osi_setattr); osi->osi_hdr.lrh_type = MDS_SETATTR64_REC; osi->osi_setattr.lsr_oi = osi->osi_oi; LASSERT(attr); osi->osi_setattr.lsr_uid = attr->la_uid; osi->osi_setattr.lsr_gid = attr->la_gid; break ; Both uid and gid from attr are saved into the llog without checking if they are all valid. (if LA_UID & LA_GID are both present in attr->la_valid) in osp_sync_new_setattr_job(): body->oa.o_oi = rec->lsr_oi; body->oa.o_uid = rec->lsr_uid; body->oa.o_gid = rec->lsr_gid; body->oa.o_valid = OBD_MD_FLGROUP | OBD_MD_FLID | OBD_MD_FLUID | OBD_MD_FLGID; We send both the uid & gid from llog to OST, and tell OST that both uid & gid are valid. (OBD_MD_FLUID & OBD_MD_FLGID) This could probably the cause of random id on OST object, I think we'd store a flag in llog_setattr64_rec to specify which id is valid. Alex, what do you think?

            People

              niu Niu Yawei (Inactive)
              morrone Christopher Morrone (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: