Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4611

too many transaction credits (32279 > 25600)

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.6.0, Lustre 2.5.2
    • Lustre 2.4.1
    • None
    • 3
    • 12616

    Description

      Getting odd quota errors with the following errors on the mdt. This is a newly created filesystem.

      Filesystem volume name:   nbp9-MDT0000
      Last mounted on:          /
      Filesystem UUID:          4615f09e-ac04-44de-a4d1-b463f280d6da
      Filesystem magic number:  0xEF53
      Filesystem revision #:    1 (dynamic)
      Filesystem features:      has_journal ext_attr resize_inode dir_index filetype needs_recovery flex_bg dirdata sparse_super large_file huge_file uninit_bg dir_nlink extra_isize quota
      Filesystem flags:         signed_directory_hash 
      Default mount options:    user_xattr acl
      Filesystem state:         clean
      Errors behavior:          Continue
      Filesystem OS type:       Linux
      Inode count:              644251648
      Block count:              322122752
      Reserved block count:     0
      Free blocks:              241421671
      Free inodes:              644239829
      First block:              0
      Block size:               4096
      Fragment size:            4096
      Reserved GDT blocks:      1024
      Blocks per group:         16384
      Fragments per group:      16384
      Inodes per group:         32768
      Inode blocks per group:   4096
      Flex block group size:    16
      Filesystem created:       Wed Feb  5 15:31:14 2014
      Last mount time:          Mon Feb 10 08:45:54 2014
      Last write time:          Mon Feb 10 08:45:54 2014
      Mount count:              5
      Maximum mount count:      -1
      Last checked:             Wed Feb  5 15:31:14 2014
      Check interval:           0 (<none>)
      Lifetime writes:          307 GB
      Reserved blocks uid:      0 (user root)
      Reserved blocks gid:      0 (group root)
      First inode:              11
      Inode size:	          512
      Required extra isize:     28
      Desired extra isize:      28
      Journal inode:            8
      Default directory hash:   half_md4
      Directory Hash Seed:      f50d0c82-4edf-4e98-94ef-69ed3ad456d0
      Journal backup:           inode blocks
      User quota inode:         3
      Group quota inode:        4
      
      nbp9-mds /var/log # tunefs.lustre /dev/mapper/nbp9--vg-mdt9
      checking for existing Lustre data: found
      Reading CONFIGS/mountdata
      
         Read previous values:
      Target:     nbp9-MDT0000
      Index:      0
      Lustre FS:  nbp9
      Mount type: ldiskfs
      Flags:      0x1
                    (MDT )
      Persistent mount opts: user_xattr,errors=remount-ro
      Parameters: mgsnode=10.151.26.5@o2ib lov.stripesize=1048576 lov.stripecount=4
      
      Feb 10 07:37:42 nbp9-mds kernel: Lustre: 5858:0:(osd_handler.c:828:osd_trans_start()) nbp9-MDT0000: too many transaction credits (32279 > 25600)
      Feb 10 07:37:42 nbp9-mds kernel: Lustre: 5858:0:(osd_handler.c:835:osd_trans_start())   create: 170/4250, delete: 0/0, destroy: 0/0
      Feb 10 07:37:43 nbp9-mds kernel: Lustre: 5858:0:(osd_handler.c:840:osd_trans_start())   attr_set: 2/2, xattr_set: 172/2395
      Feb 10 07:37:43 nbp9-mds kernel: Lustre: 5858:0:(osd_handler.c:847:osd_trans_start())   write: 1523/21322, punch: 338/1352, quota 4/52
      Feb 10 07:37:43 nbp9-mds kernel: Lustre: 5858:0:(osd_handler.c:852:osd_trans_start())   insert: 171/2906, delete: 0/0
      Feb 10 07:37:43 nbp9-mds kernel: Lustre: 5858:0:(osd_handler.c:857:osd_trans_start())   ref_add: 0/0, ref_del: 0/0
      Feb 10 07:37:43 nbp9-mds kernel: Pid: 5858, comm: mdt02_043
      Feb 10 07:37:43 nbp9-mds kernel: 
      Feb 10 07:37:43 nbp9-mds kernel: Call Trace:
      Feb 10 07:37:43 nbp9-mds kernel: [<ffffffffa0514895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      Feb 10 07:37:43 nbp9-mds kernel: [<ffffffffa0cfc41e>] osd_trans_start+0x65e/0x680 [osd_ldiskfs]
      Feb 10 07:37:43 nbp9-mds kernel: [<ffffffffa0e68309>] lod_trans_start+0x1b9/0x250 [lod]
      Feb 10 07:37:43 nbp9-mds kernel: [<ffffffffa0773117>] mdd_trans_start+0x17/0x20 [mdd]
      Feb 10 07:37:43 nbp9-mds kernel: [<ffffffffa076712a>] mdd_create+0x91a/0x1790 [mdd]
      Feb 10 07:37:43 nbp9-mds kernel: [<ffffffffa0d03937>] ? osd_xattr_get+0x97/0x2d0 [osd_ldiskfs]
      Feb 10 07:37:43 nbp9-mds kernel: [<ffffffffa0ddf2a2>] mdt_reint_open+0x1362/0x20e0 [mdt]
      Feb 10 07:37:43 nbp9-mds kernel: [<ffffffffa053185e>] ? upcall_cache_get_entry+0x28e/0x860 [libcfs]
      Feb 10 07:37:43 nbp9-mds kernel: [<ffffffffa07fadcc>] ? lustre_msg_add_version+0x6c/0xc0 [ptlrpc]
      Feb 10 07:37:43 nbp9-mds kernel: [<ffffffffa0dc9981>] mdt_reint_rec+0x41/0xe0 [mdt]
      Feb 10 07:37:43 nbp9-mds kernel: [<ffffffffa0daeb03>] mdt_reint_internal+0x4c3/0x780 [mdt]
      Feb 10 07:37:43 nbp9-mds kernel: [<ffffffffa0daf090>] mdt_intent_reint+0x1f0/0x530 [mdt]
      Feb 10 07:37:43 nbp9-mds kernel: [<ffffffffa0dacf3e>] mdt_intent_policy+0x39e/0x720 [mdt]
      Feb 10 07:37:43 nbp9-mds kernel: [<ffffffffa07b2831>] ldlm_lock_enqueue+0x361/0x8d0 [ptlrpc]
      Feb 10 07:37:43 nbp9-mds kernel: [<ffffffffa07d91ef>] ldlm_handle_enqueue0+0x4ef/0x10b0 [ptlrpc]
      Feb 10 07:37:43 nbp9-mds kernel: [<ffffffffa0dad3c6>] mdt_enqueue+0x46/0xe0 [mdt]
      Feb 10 07:37:43 nbp9-mds kernel: [<ffffffffa0db3ad7>] mdt_handle_common+0x647/0x16d0 [mdt]
      Feb 10 07:37:43 nbp9-mds kernel: [<ffffffffa0ded615>] mds_regular_handle+0x15/0x20 [mdt]
      Feb 10 07:37:43 nbp9-mds kernel: [<ffffffffa080b3c8>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]
      Feb 10 07:37:43 nbp9-mds kernel: [<ffffffffa05155de>] ? cfs_timer_arm+0xe/0x10 [libcfs]
      Feb 10 07:37:43 nbp9-mds kernel: [<ffffffffa0526d9f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
      Feb 10 07:37:43 nbp9-mds kernel: [<ffffffffa0802729>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
      Feb 10 07:37:43 nbp9-mds kernel: [<ffffffff81055813>] ? __wake_up+0x53/0x70
      Feb 10 07:37:43 nbp9-mds kernel: [<ffffffffa080c75e>] ptlrpc_main+0xace/0x1700 [ptlrpc]
      Feb 10 07:37:43 nbp9-mds kernel: [<ffffffffa080bc90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
      Feb 10 07:37:43 nbp9-mds kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20
      Feb 10 07:37:43 nbp9-mds kernel: [<ffffffffa080bc90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
      Feb 10 07:37:43 nbp9-mds kernel: [<ffffffffa080bc90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
      Feb 10 07:37:43 nbp9-mds kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
      Feb 10 07:37:43 nbp9-mds kernel: 
      Feb 10 07:37:43 nbp9-mds kernel: Lustre: 5858:0:(osd_handler.c:828:osd_trans_start()) nbp9-MDT0000: too many transaction credits (32279 > 25600)
      Feb 10 07:37:43 nbp9-mds kernel: Lustre: 5858:0:(osd_handler.c:835:osd_trans_start())   create: 170/4250, delete: 0/0, destroy: 0/0
      Feb 10 07:37:43 nbp9-mds kernel: Lustre: 5858:0:(osd_handler.c:840:osd_trans_start())   attr_set: 2/2, xattr_set: 172/2395
      Feb 10 07:37:43 nbp9-mds kernel: Lustre: 5858:0:(osd_handler.c:847:osd_trans_start())   write: 1523/21322, punch: 338/1352, quota 4/52
      Feb 10 07:37:43 nbp9-mds kernel: Lustre: 5858:0:(osd_handler.c:852:osd_trans_start())   insert: 171/2906, delete: 0/0
      Feb 10 07:37:43 nbp9-mds kernel: Lustre: 5858:0:(osd_handler.c:857:osd_trans_start())   ref_add: 0/0, ref_del: 0/0
      Feb 10 07:38:48 nbp9-mds kernel: Lustre: 5791:0:(osd_handler.c:828:osd_trans_start()) nbp9-MDT0000: too many transaction credits (32279 > 25600)
      Feb 10 07:38:48 nbp9-mds kernel: Lustre: 5791:0:(osd_handler.c:828:osd_trans_start()) Skipped 2 previous similar messages
      Feb 10 07:38:48 nbp9-mds kernel: Lustre: 5791:0:(osd_handler.c:835:osd_trans_start())   create: 170/4250, delete: 0/0, destroy: 0/0
      Feb 10 07:38:48 nbp9-mds kernel: Lustre: 5791:0:(osd_handler.c:835:osd_trans_start()) Skipped 2 previous similar messages
      Feb 10 07:38:48 nbp9-mds kernel: Lustre: 5791:0:(osd_handler.c:840:osd_trans_start())   attr_set: 2/2, xattr_set: 172/2395
      Feb 10 07:38:48 nbp9-mds kernel: Lustre: 5791:0:(osd_handler.c:840:osd_trans_start()) Skipped 2 previous similar messages
      Feb 10 07:38:48 nbp9-mds kernel: Lustre: 5791:0:(osd_handler.c:847:osd_trans_start())   write: 1523/21322, punch: 338/1352, quota 4/52
      Feb 10 07:38:48 nbp9-mds kernel: Lustre: 5791:0:(osd_handler.c:847:osd_trans_start()) Skipped 2 previous similar messages
      Feb 10 07:38:48 nbp9-mds kernel: Lustre: 5791:0:(osd_handler.c:852:osd_trans_start())   insert: 171/2906, delete: 0/0
      Feb 10 07:38:48 nbp9-mds kernel: Lustre: 5791:0:(osd_handler.c:852:osd_trans_start()) Skipped 2 previous similar messages
      Feb 10 07:38:48 nbp9-mds kernel: Lustre: 5791:0:(osd_handler.c:857:osd_trans_start())   ref_add: 0/0, ref_del: 0/0
      Feb 10 07:38:48 nbp9-mds kernel: Lustre: 5791:0:(osd_handler.c:857:osd_trans_start()) Skipped 2 previous similar messages
      Feb 10 07:43:07 nbp9-mds kernel: Lustre: 5860:0:(osd_handler.c:828:osd_trans_start()) nbp9-MDT0000: too many transaction credits (32279 > 25600)
      Feb 10 07:43:07 nbp9-mds kernel: Lustre: 5860:0:(osd_handler.c:835:osd_trans_start())   create: 170/4250, delete: 0/0, destroy: 0/0
      Feb 10 07:43:07 nbp9-mds kernel: Lustre: 5860:0:(osd_handler.c:840:osd_trans_start())   attr_set: 2/2, xattr_set: 172/2395
      Feb 10 07:43:07 nbp9-mds kernel: Lustre: 5860:0:(osd_handler.c:847:osd_trans_start())   write: 1523/21322, punch: 338/1352, quota 4/52
      Feb 10 07:43:07 nbp9-mds kernel: Lustre: 5860:0:(osd_handler.c:852:osd_trans_start())   insert: 171/2906, delete: 0/0
      Feb 10 07:43:07 nbp9-mds kernel: Lustre: 5860:0:(osd_handler.c:857:osd_trans_start())   ref_add: 0/0, ref_del: 0/0
      

      Attachments

        Issue Links

          Activity

            [LU-4611] too many transaction credits (32279 > 25600)
            bogl Bob Glossman (Inactive) added a comment - backport to b2_5: http://review.whamcloud.com/10407
            niu Niu Yawei (Inactive) added a comment - backport to b2_4: http://review.whamcloud.com/9807

            A port to b2_4 is appreciated. Thanks!

            jaylan Jay Lan (Inactive) added a comment - A port to b2_4 is appreciated. Thanks!

            Patch landed to Master. Please reopen ticket if more work is needed.

            jlevi Jodi Levi (Inactive) added a comment - Patch landed to Master. Please reopen ticket if more work is needed.

            from pure ldiskfs side, there is no point to declare punch for the block we write in the same transaction - we'll be modifying same bitmaps, gd, inode, etc. I'm not absolutely sure, but probably that was added to satisfy operation-based verification for credits (that undo thing) - osd_punch() does expect OSD_OT_PUNCH to be declared and osd_write() doesn't declare that.

            also, I tend to think that'd break ZFS with full debug enabled, but it's broken anyway.

            bzzz Alex Zhuravlev added a comment - from pure ldiskfs side, there is no point to declare punch for the block we write in the same transaction - we'll be modifying same bitmaps, gd, inode, etc. I'm not absolutely sure, but probably that was added to satisfy operation-based verification for credits (that undo thing) - osd_punch() does expect OSD_OT_PUNCH to be declared and osd_write() doesn't declare that. also, I tend to think that'd break ZFS with full debug enabled, but it's broken anyway.
            adilger Andreas Dilger added a comment - - edited

            I don't think we need to keep the full number of threads with the full transaction credits busy. There will always be some fraction of threads that are outside the transaction at a given time. That said, reducing the credits count by any amount will always help to avoid premature transaction commit and checkpoint.

            For the llog punch, it surprises me that we would need to reserve extra credits to truncate data just written? At most any llog record could be 2 blocks long, so we shouldn't need more than {2*(bitmap + GDT to free), 1 block to overwrite} + inode = 5 blocks to truncate a single llog record. If we are already reserving this much for the write (assume yes?) why reserve extra for truncate?

            At one point, there were some explicit declares added to handle the error cases, but in the new accounting we also allow the "undo" updates for any declared update for error handling purpose. I would hope that means we can get rid of the explicit punch calls entirely? Is it possible to get rid of the other explicit undo declarations as well (ref_del, delete)?

            adilger Andreas Dilger added a comment - - edited I don't think we need to keep the full number of threads with the full transaction credits busy. There will always be some fraction of threads that are outside the transaction at a given time. That said, reducing the credits count by any amount will always help to avoid premature transaction commit and checkpoint. For the llog punch, it surprises me that we would need to reserve extra credits to truncate data just written? At most any llog record could be 2 blocks long, so we shouldn't need more than {2*(bitmap + GDT to free), 1 block to overwrite} + inode = 5 blocks to truncate a single llog record. If we are already reserving this much for the write (assume yes?) why reserve extra for truncate? At one point, there were some explicit declares added to handle the error cases, but in the new accounting we also allow the "undo" updates for any declared update for error handling purpose. I would hope that means we can get rid of the explicit punch calls entirely? Is it possible to get rid of the other explicit undo declarations as well (ref_del, delete)?
            bzzz Alex Zhuravlev added a comment - - edited

            another way to save some credits can be truncate (dt_decalre_punch()) in llog_osd_write_blob(). it seem to be used by MGS which generates "partial" records and writes header/payload/tail separately - truncate is used to rollback partial changes. probably it's possible to teach MGS to use "full" records and get rid of truncate. with 7 stripes that would save us 56 credits on ldiskfs - 13% on unlink of 7-stripes file - decreasing credits from 436 to 380..

            that said, on 170 stripes this still gives us ~380/7*170=9180 credits. 3 concurrent transactions can easily cause early transaction close. with 25600 credits per transaction and 256 threads, we'd need to decrease this down to 256 credits per thread/transactions. I don't think this is possible with the current design. something like a postponed iteration turning a big blob (essentially LOVEA) into a series of tiny transactions could help.

            bzzz Alex Zhuravlev added a comment - - edited another way to save some credits can be truncate (dt_decalre_punch()) in llog_osd_write_blob(). it seem to be used by MGS which generates "partial" records and writes header/payload/tail separately - truncate is used to rollback partial changes. probably it's possible to teach MGS to use "full" records and get rid of truncate. with 7 stripes that would save us 56 credits on ldiskfs - 13% on unlink of 7-stripes file - decreasing credits from 436 to 380.. that said, on 170 stripes this still gives us ~380/7*170=9180 credits. 3 concurrent transactions can easily cause early transaction close. with 25600 credits per transaction and 256 threads, we'd need to decrease this down to 256 credits per thread/transactions. I don't think this is possible with the current design. something like a postponed iteration turning a big blob (essentially LOVEA) into a series of tiny transactions could help.

            Andreas, have a look at http://review.whamcloud.com/#/c/9258/ please, I did exactly like you suggested for LMA.
            for files like last_rcvd/lov_objids I used optimistic (and cheap) way to recognize overwrite so drop credits required to allocate new blocks.
            that said, we have to declare llogs in the create path, actually - in case of error we should be able to destroy all the stripes just created.
            given we can't deal with gaps in the stream of new OST objects.

            bzzz Alex Zhuravlev added a comment - Andreas, have a look at http://review.whamcloud.com/#/c/9258/ please, I did exactly like you suggested for LMA. for files like last_rcvd/lov_objids I used optimistic (and cheap) way to recognize overwrite so drop credits required to allocate new blocks. that said, we have to declare llogs in the create path, actually - in case of error we should be able to destroy all the stripes just created. given we can't deal with gaps in the stream of new OST objects.

            I checked with the code and observed this is due to extra declare_attr_set() in mdd_declare_object_initialize(). in Orion we tried to remove all extras and instead have all attributes set by dt_create(). this has changed over time, but we can fix that easily with something like:

            @@ -423,7 +423,7 @@ static int osp_declare_attr_set(const struct lu_env *env, struct dt_object *dt
            RETURN(rc);
            }

            • if (o->opo_new) {
              + if (o->opo_new || fid_is_zero(lu_object_fid(&dt->do_lu))) {

            so that OSP just ignores setattr's on the new OST objects.

            this should help with create path, but won't with unlink as we have to write those 170 records.. here is another patch to improve this a bit:
            http://review.whamcloud.com/#/c/9258/

            bzzz Alex Zhuravlev added a comment - I checked with the code and observed this is due to extra declare_attr_set() in mdd_declare_object_initialize(). in Orion we tried to remove all extras and instead have all attributes set by dt_create(). this has changed over time, but we can fix that easily with something like: @@ -423,7 +423,7 @@ static int osp_declare_attr_set(const struct lu_env *env, struct dt_object *dt RETURN(rc); } if (o->opo_new) { + if (o->opo_new || fid_is_zero(lu_object_fid(&dt->do_lu))) { so that OSP just ignores setattr's on the new OST objects. this should help with create path, but won't with unlink as we have to write those 170 records.. here is another patch to improve this a bit: http://review.whamcloud.com/#/c/9258/

            I was wondering where the setxattr operations would be needed, but the LMA FID on the unlink llog files makes sense. That said, the LMA xattr does not consume any space in the filesystem, since it is stored in the inode. I wonder if we can declare the setxattr more precisely, so that the OSD knows the xattr size and can account for it more appropriately?

            I suspect that a number if these operations can similarly be optimized with a bit more information. In particular, for file create I suspect each OSP will reserve a write to the lov_objids file, but those are always overwrites and should not consume more than a single credit. Since there are normally many u64 shared between OSPs in the same block, this could also be reduced significantly without any risk. At most ((max_ost_idx * sizeof(_u64) + blocksize - 1) / blocksize) blocks could be consumed for that file, not the thousands that are reserved today.

            Likewise, updates to the last_rcvd file typically only need a single block, and that could be known if the declare knows the offset and size of the write.

            adilger Andreas Dilger added a comment - I was wondering where the setxattr operations would be needed, but the LMA FID on the unlink llog files makes sense. That said, the LMA xattr does not consume any space in the filesystem, since it is stored in the inode. I wonder if we can declare the setxattr more precisely, so that the OSD knows the xattr size and can account for it more appropriately? I suspect that a number if these operations can similarly be optimized with a bit more information. In particular, for file create I suspect each OSP will reserve a write to the lov_objids file, but those are always overwrites and should not consume more than a single credit. Since there are normally many u64 shared between OSPs in the same block, this could also be reduced significantly without any risk. At most ((max_ost_idx * sizeof( _u64) + blocksize - 1) / blocksize) blocks could be consumed for that file, not the thousands that are reserved today. Likewise, updates to the last_rcvd file typically only need a single block, and that could be known if the declare knows the offset and size of the write.

            People

              niu Niu Yawei (Inactive)
              mhanafi Mahmoud Hanafi
              Votes:
              0 Vote for this issue
              Watchers:
              17 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: