Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4345

failed to update accounting ZAP for user

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.6.0, Lustre 2.5.3
    • None
    • Lustre 2.4.0-19chaos
    • 3
    • 11907

    Description

      We are using lustre 2.4.0-19chaos on our servers running with the ZFS OSD. On some of the OSS nodes we are seeing messages like this:

      Nov  6 00:06:29 stout8 kernel: LustreError: 14909:0:(osd_object.c:973:osd_attr_set()) fsrzb-OST0007: failed to update accounting ZAP for user 132245 (-2)
      Nov  6 00:06:29 stout8 kernel: LustreError: 14909:0:(osd_object.c:973:osd_attr_set()) Skipped 5 previous similar messages
      Nov  6 00:06:38 stout16 kernel: LustreError: 15266:0:(osd_object.c:973:osd_attr_set()) fsrzb-OST000f: failed to update accounting ZAP for user 122392 (-2)
      Nov  6 00:06:38 stout16 kernel: LustreError: 15266:0:(osd_object.c:973:osd_attr_set()) Skipped 3 previous similar messages
      Nov  6 00:06:40 stout12 kernel: LustreError: 15801:0:(osd_object.c:973:osd_attr_set()) fsrzb-OST000b: failed to update accounting ZAP for user 122708 (-2)
      Nov  6 00:06:40 stout12 kernel: LustreError: 15801:0:(osd_object.c:973:osd_attr_set()) Skipped 4 previous similar messages
      
      Nov  7 00:31:36 porter31 kernel: LustreError: 7704:0:(osd_object.c:973:osd_attr_set()) lse-OST001f: failed to update accounting ZAP for user 54916 (-2)
      Nov  7 02:53:05 porter19 kernel: LustreError: 9380:0:(osd_object.c:973:osd_attr_set()) lse-OST0013: failed to update accounting ZAP for user 7230 (-2)
      
      Dec  3 12:01:21 stout7 kernel: Lustre: Skipped 3 previous similar messages
      Dec  3 13:52:30 stout4 kernel: LustreError: 15806:0:(osd_object.c:967:osd_attr_set()) fsrzb-OST0003: failed to update accounting ZAP for user 1752876224 (-2)
      Dec  3 13:52:30 stout4 kernel: LustreError: 15806:0:(osd_object.c:967:osd_attr_set()) Skipped 3 previous similar messages
      Dec  3 13:52:30 stout1 kernel: LustreError: 15324:0:(osd_object.c:967:osd_attr_set()) fsrzb-OST0000: failed to update accounting ZAP for user 1752876224 (-2)
      Dec  3 13:52:30 stout1 kernel: LustreError: 15784:0:(osd_object.c:967:osd_attr_set()) fsrzb-OST0000: failed to update accounting ZAP for user 1752876224 (-2)
      Dec  3 13:52:30 stout14 kernel: LustreError: 16345:0:(osd_object.c:967:osd_attr_set()) fsrzb-OST000d: failed to update accounting ZAP for user 1752876224 (-2)
      Dec  3 13:52:30 stout12 kernel: LustreError: 32355:0:(osd_object.c:967:osd_attr_set()) fsrzb-OST000b: failed to update accounting ZAP for user 1752876224 (-2)
      Dec  3 13:52:30 stout2 kernel: LustreError: 15145:0:(osd_object.c:967:osd_attr_set()) fsrzb-OST0001: failed to update accounting ZAP for user 1752876224 (-2)
      Dec  3 13:52:30 stout10 kernel: LustreError: 14570:0:(osd_object.c:967:osd_attr_set()) fsrzb-OST0009: failed to update accounting ZAP for user 1752876224 (-2)
      

      First of all, these messages are terrible. If you look at osd_attr_set() there are four exactly identical messages that are printed. Ok, granted, we can look them up by line number. But even better would be to make them unique.

      So looking them up by line numbers 967 and 973, it would appear that we have hit at least the first two of the "filed to update accounting ZAP for user" messages.

      Note that the UID numbers do not look correct to me. Many of them are clearly not in the valid UID range. But then I don't completely understand what is going on here yet.

      Attachments

        Issue Links

          Activity

            [LU-4345] failed to update accounting ZAP for user

            Peter,

            Could you please provide a b2_4 backport of this patch? We need it at one of our customer sites.

            Thanks!

            spimpale Swapnil Pimpale (Inactive) added a comment - Peter, Could you please provide a b2_4 backport of this patch? We need it at one of our customer sites. Thanks!
            pjones Peter Jones added a comment -

            Swapnil

            Not yet. The usual practice is to finalize the form of the patch on master before back porting to earlier branches

            Peter

            pjones Peter Jones added a comment - Swapnil Not yet. The usual practice is to finalize the form of the patch on master before back porting to earlier branches Peter

            Is there a b2_4 backport of this patch?

            spimpale Swapnil Pimpale (Inactive) added a comment - Is there a b2_4 backport of this patch?
            niu Niu Yawei (Inactive) added a comment - http://review.whamcloud.com/10223

            we don't store "validity" in llog. so I guess the right fix would be to fill missing uid/gid in llog record with current value?

            You mean get the current ids in lod layer, and pass them to osp by 'attr'? (the 'attr' is 'const')

            niu Niu Yawei (Inactive) added a comment - we don't store "validity" in llog. so I guess the right fix would be to fill missing uid/gid in llog record with current value? You mean get the current ids in lod layer, and pass them to osp by 'attr'? (the 'attr' is 'const')

            we don't store "validity" in llog. so I guess the right fix would be to fill missing uid/gid in llog record with current value?

            bzzz Alex Zhuravlev added a comment - we don't store "validity" in llog. so I guess the right fix would be to fill missing uid/gid in llog record with current value?

            I found that osp sometimes could set a random uid/gid to OST object. (when user set uid or gid only).

            in osp_sync_add_rec():

                    case MDS_SETATTR64_REC:
                            rc = fid_to_ostid(fid, &osi->osi_oi);
                            LASSERT(rc == 0);
                            osi->osi_hdr.lrh_len = sizeof(osi->osi_setattr);
                            osi->osi_hdr.lrh_type = MDS_SETATTR64_REC;
                            osi->osi_setattr.lsr_oi  = osi->osi_oi;
                            LASSERT(attr);
                            osi->osi_setattr.lsr_uid = attr->la_uid;
                            osi->osi_setattr.lsr_gid = attr->la_gid;
                            break;
            

            Both uid and gid from attr are saved into the llog without checking if they are all valid. (if LA_UID & LA_GID are both present in attr->la_valid)

            in osp_sync_new_setattr_job():

                    body->oa.o_oi = rec->lsr_oi;
                    body->oa.o_uid = rec->lsr_uid;
                    body->oa.o_gid = rec->lsr_gid;
                    body->oa.o_valid = OBD_MD_FLGROUP | OBD_MD_FLID |
                                       OBD_MD_FLUID | OBD_MD_FLGID;
            

            We send both the uid & gid from llog to OST, and tell OST that both uid & gid are valid. (OBD_MD_FLUID & OBD_MD_FLGID)

            This could probably the cause of random id on OST object, I think we'd store a flag in llog_setattr64_rec to specify which id is valid. Alex, what do you think?

            niu Niu Yawei (Inactive) added a comment - I found that osp sometimes could set a random uid/gid to OST object. (when user set uid or gid only). in osp_sync_add_rec(): case MDS_SETATTR64_REC: rc = fid_to_ostid(fid, &osi->osi_oi); LASSERT(rc == 0); osi->osi_hdr.lrh_len = sizeof(osi->osi_setattr); osi->osi_hdr.lrh_type = MDS_SETATTR64_REC; osi->osi_setattr.lsr_oi = osi->osi_oi; LASSERT(attr); osi->osi_setattr.lsr_uid = attr->la_uid; osi->osi_setattr.lsr_gid = attr->la_gid; break ; Both uid and gid from attr are saved into the llog without checking if they are all valid. (if LA_UID & LA_GID are both present in attr->la_valid) in osp_sync_new_setattr_job(): body->oa.o_oi = rec->lsr_oi; body->oa.o_uid = rec->lsr_uid; body->oa.o_gid = rec->lsr_gid; body->oa.o_valid = OBD_MD_FLGROUP | OBD_MD_FLID | OBD_MD_FLUID | OBD_MD_FLGID; We send both the uid & gid from llog to OST, and tell OST that both uid & gid are valid. (OBD_MD_FLUID & OBD_MD_FLGID) This could probably the cause of random id on OST object, I think we'd store a flag in llog_setattr64_rec to specify which id is valid. Alex, what do you think?

            I wonder if this is a race in zap_increment().. there is no locking between zap_lookup() and zap_remove()/zap_update(). meaning in some case we can try to remove the record already removed concurrently? I'd like to suggest http://review.whamcloud.com/#/c/7157/ but unfortunately that one depends on specific DMU API to be exported.

            bzzz Alex Zhuravlev added a comment - I wonder if this is a race in zap_increment().. there is no locking between zap_lookup() and zap_remove()/zap_update(). meaning in some case we can try to remove the record already removed concurrently? I'd like to suggest http://review.whamcloud.com/#/c/7157/ but unfortunately that one depends on specific DMU API to be exported.

            The message shows that all these incorrect ids come from owner/group changing (not from punch or internal owner/group setting on first write), so I think it could be some problematic application changes owner/group unexpectedly? Could you check if there are any files on MDT were changed to such invalid owner/group? (hope they are not temporary file).

            Another thing is that why it failed to update ZAP? The ZAP should be updated successfully no matter if the uid/gid is invalid, I think we need some zfs expert to enable debug on zfs and see what's going on the ZAP updating.

            niu Niu Yawei (Inactive) added a comment - The message shows that all these incorrect ids come from owner/group changing (not from punch or internal owner/group setting on first write), so I think it could be some problematic application changes owner/group unexpectedly? Could you check if there are any files on MDT were changed to such invalid owner/group? (hope they are not temporary file). Another thing is that why it failed to update ZAP? The ZAP should be updated successfully no matter if the uid/gid is invalid, I think we need some zfs expert to enable debug on zfs and see what's going on the ZAP updating.

            It looks like we are hitting this on OSS nodes pretty frequently. Here are some examples:

            2014-03-31 10:23:46 LustreError: 6113:0:(ofd_obd.c:888:ofd_setattr()) LU-4345 set uid: 632227776
            2014-03-31 10:23:46 LustreError: 6119:0:(ofd_objects.c:418:ofd_attr_set()) LU-4345 set uid: 632227776
            2014-03-31 10:23:46 LustreError: 6113:0:(ofd_obd.c:888:ofd_setattr()) Skipped 1 previous similar message
            2014-03-31 10:23:46 LustreError: 6113:0:(ofd_objects.c:418:ofd_attr_set()) LU-4345 set uid: 632227776
            2014-03-31 10:48:29 LustreError: 6229:0:(ofd_obd.c:888:ofd_setattr()) LU-4345 set uid: 857467328
            2014-03-31 10:48:29 LustreError: 6229:0:(ofd_objects.c:418:ofd_attr_set()) LU-4345 set uid: 857467328
            2014-03-31 10:48:35 LustreError: 6020:0:(ofd_obd.c:888:ofd_setattr()) LU-4345 set uid: 857467328
            2014-03-31 10:48:35 LustreError: 6020:0:(ofd_objects.c:418:ofd_attr_set()) LU-4345 set uid: 857467328
            2014-03-31 10:48:44 LustreError: 6224:0:(ofd_obd.c:888:ofd_setattr()) LU-4345 set uid: 857467328
            2014-03-31 10:48:44 LustreError: 6159:0:(ofd_objects.c:418:ofd_attr_set()) LU-4345 set uid: 857467328
            2014-03-31 10:48:44 LustreError: 6224:0:(ofd_obd.c:888:ofd_setattr()) Skipped 2 previous similar messages
            2014-03-31 10:48:48 LustreError: 6159:0:(ofd_obd.c:888:ofd_setattr()) LU-4345 set uid: 857467328
            2014-03-31 10:48:48 LustreError: 6159:0:(ofd_objects.c:418:ofd_attr_set()) LU-4345 set uid: 857467328
            2014-03-31 10:48:48 LustreError: 6159:0:(ofd_objects.c:418:ofd_attr_set()) Skipped 2 previous similar messages
            2014-03-31 10:48:52 LustreError: 6307:0:(ofd_obd.c:888:ofd_setattr()) LU-4345 set uid: 857467328
            2014-03-31 10:48:52 LustreError: 6002:0:(ofd_objects.c:418:ofd_attr_set()) LU-4345 set uid: 857467328
            2014-03-31 10:48:52 LustreError: 6307:0:(ofd_obd.c:888:ofd_setattr()) Skipped 1 previous similar message
            2014-03-31 10:48:56 LustreError: 6002:0:(ofd_obd.c:888:ofd_setattr()) LU-4345 set uid: 857467328
            2014-03-31 10:48:56 LustreError: 6002:0:(ofd_objects.c:418:ofd_attr_set()) LU-4345 set uid: 857467328
            2014-03-31 10:48:56 LustreError: 6002:0:(ofd_objects.c:418:ofd_attr_set()) Skipped 1 previous similar message
            2014-03-31 10:49:05 LustreError: 6159:0:(ofd_obd.c:888:ofd_setattr()) LU-4345 set uid: 857467328
            2014-03-31 10:49:05 LustreError: 6159:0:(ofd_obd.c:888:ofd_setattr()) Skipped 3 previous similar messages
            2014-03-31 10:49:05 LustreError: 6119:0:(ofd_objects.c:418:ofd_attr_set()) LU-4345 set uid: 857467328
            2014-03-31 10:49:05 LustreError: 6119:0:(ofd_objects.c:418:ofd_attr_set()) Skipped 3 previous similar messages
            
            2014-03-29 18:55:33 LustreError: 5973:0:(ofd_obd.c:888:ofd_setattr()) LU-4345 set uid: 938789056
            2014-03-29 18:55:33 LustreError: 5968:0:(ofd_objects.c:418:ofd_attr_set()) LU-4345 set uid: 938789056
            2014-03-29 18:55:33 LustreError: 5973:0:(ofd_obd.c:888:ofd_setattr()) Skipped 1 previous similar message
            2014-03-29 18:55:33 LustreError: 5987:0:(ofd_obd.c:888:ofd_setattr()) LU-4345 set uid: 938789056
            2014-03-29 18:55:33 LustreError: 5987:0:(ofd_obd.c:888:ofd_setattr()) Skipped 1 previous similar message
            2014-03-29 18:55:33 LustreError: 5987:0:(ofd_objects.c:418:ofd_attr_set()) LU-4345 set uid: 938789056
            2014-03-29 18:55:33 LustreError: 5968:0:(osd_object.c:974:osd_attr_set()) lse-OST0013: failed to update accounting ZAP for original user 938788544 (-2)
            

            Note that I had set the "> 7000" to "> 70000", and also added "LU-4345" to the comments before you provided the updated patch that uses "> 1700000000".

            morrone Christopher Morrone (Inactive) added a comment - It looks like we are hitting this on OSS nodes pretty frequently. Here are some examples: 2014-03-31 10:23:46 LustreError: 6113:0:(ofd_obd.c:888:ofd_setattr()) LU-4345 set uid: 632227776 2014-03-31 10:23:46 LustreError: 6119:0:(ofd_objects.c:418:ofd_attr_set()) LU-4345 set uid: 632227776 2014-03-31 10:23:46 LustreError: 6113:0:(ofd_obd.c:888:ofd_setattr()) Skipped 1 previous similar message 2014-03-31 10:23:46 LustreError: 6113:0:(ofd_objects.c:418:ofd_attr_set()) LU-4345 set uid: 632227776 2014-03-31 10:48:29 LustreError: 6229:0:(ofd_obd.c:888:ofd_setattr()) LU-4345 set uid: 857467328 2014-03-31 10:48:29 LustreError: 6229:0:(ofd_objects.c:418:ofd_attr_set()) LU-4345 set uid: 857467328 2014-03-31 10:48:35 LustreError: 6020:0:(ofd_obd.c:888:ofd_setattr()) LU-4345 set uid: 857467328 2014-03-31 10:48:35 LustreError: 6020:0:(ofd_objects.c:418:ofd_attr_set()) LU-4345 set uid: 857467328 2014-03-31 10:48:44 LustreError: 6224:0:(ofd_obd.c:888:ofd_setattr()) LU-4345 set uid: 857467328 2014-03-31 10:48:44 LustreError: 6159:0:(ofd_objects.c:418:ofd_attr_set()) LU-4345 set uid: 857467328 2014-03-31 10:48:44 LustreError: 6224:0:(ofd_obd.c:888:ofd_setattr()) Skipped 2 previous similar messages 2014-03-31 10:48:48 LustreError: 6159:0:(ofd_obd.c:888:ofd_setattr()) LU-4345 set uid: 857467328 2014-03-31 10:48:48 LustreError: 6159:0:(ofd_objects.c:418:ofd_attr_set()) LU-4345 set uid: 857467328 2014-03-31 10:48:48 LustreError: 6159:0:(ofd_objects.c:418:ofd_attr_set()) Skipped 2 previous similar messages 2014-03-31 10:48:52 LustreError: 6307:0:(ofd_obd.c:888:ofd_setattr()) LU-4345 set uid: 857467328 2014-03-31 10:48:52 LustreError: 6002:0:(ofd_objects.c:418:ofd_attr_set()) LU-4345 set uid: 857467328 2014-03-31 10:48:52 LustreError: 6307:0:(ofd_obd.c:888:ofd_setattr()) Skipped 1 previous similar message 2014-03-31 10:48:56 LustreError: 6002:0:(ofd_obd.c:888:ofd_setattr()) LU-4345 set uid: 857467328 2014-03-31 10:48:56 LustreError: 6002:0:(ofd_objects.c:418:ofd_attr_set()) LU-4345 set uid: 857467328 2014-03-31 10:48:56 LustreError: 6002:0:(ofd_objects.c:418:ofd_attr_set()) Skipped 1 previous similar message 2014-03-31 10:49:05 LustreError: 6159:0:(ofd_obd.c:888:ofd_setattr()) LU-4345 set uid: 857467328 2014-03-31 10:49:05 LustreError: 6159:0:(ofd_obd.c:888:ofd_setattr()) Skipped 3 previous similar messages 2014-03-31 10:49:05 LustreError: 6119:0:(ofd_objects.c:418:ofd_attr_set()) LU-4345 set uid: 857467328 2014-03-31 10:49:05 LustreError: 6119:0:(ofd_objects.c:418:ofd_attr_set()) Skipped 3 previous similar messages 2014-03-29 18:55:33 LustreError: 5973:0:(ofd_obd.c:888:ofd_setattr()) LU-4345 set uid: 938789056 2014-03-29 18:55:33 LustreError: 5968:0:(ofd_objects.c:418:ofd_attr_set()) LU-4345 set uid: 938789056 2014-03-29 18:55:33 LustreError: 5973:0:(ofd_obd.c:888:ofd_setattr()) Skipped 1 previous similar message 2014-03-29 18:55:33 LustreError: 5987:0:(ofd_obd.c:888:ofd_setattr()) LU-4345 set uid: 938789056 2014-03-29 18:55:33 LustreError: 5987:0:(ofd_obd.c:888:ofd_setattr()) Skipped 1 previous similar message 2014-03-29 18:55:33 LustreError: 5987:0:(ofd_objects.c:418:ofd_attr_set()) LU-4345 set uid: 938789056 2014-03-29 18:55:33 LustreError: 5968:0:(osd_object.c:974:osd_attr_set()) lse-OST0013: failed to update accounting ZAP for original user 938788544 (-2) Note that I had set the "> 7000" to "> 70000", and also added " LU-4345 " to the comments before you provided the updated patch that uses "> 1700000000".

            I updated the debug patch, this time error message will only be printed when uid is greater than 1700000000.

            niu Niu Yawei (Inactive) added a comment - I updated the debug patch, this time error message will only be printed when uid is greater than 1700000000.

            People

              niu Niu Yawei (Inactive)
              morrone Christopher Morrone (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: