[LU-4611] too many transaction credits (32279 > 25600) Created: 11/Feb/14  Updated: 31/Oct/14  Resolved: 26/Mar/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.1
Fix Version/s: Lustre 2.6.0, Lustre 2.5.2

Type: Bug Priority: Blocker
Reporter: Mahmoud Hanafi Assignee: Niu Yawei (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Duplicate
is duplicated by LU-4479 Credit error with unlink for large nu... Resolved
is duplicated by LU-4798 Too many transaction credits (28288 >... Closed
Related
is related to LU-4494 sanity-scrub test 4: lustre-MDT0001: ... Resolved
is related to LU-4193 increase maximum default ldiskfs jour... Resolved
Severity: 3
Rank (Obsolete): 12616

 Description   

Getting odd quota errors with the following errors on the mdt. This is a newly created filesystem.

Filesystem volume name:   nbp9-MDT0000
Last mounted on:          /
Filesystem UUID:          4615f09e-ac04-44de-a4d1-b463f280d6da
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal ext_attr resize_inode dir_index filetype needs_recovery flex_bg dirdata sparse_super large_file huge_file uninit_bg dir_nlink extra_isize quota
Filesystem flags:         signed_directory_hash 
Default mount options:    user_xattr acl
Filesystem state:         clean
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              644251648
Block count:              322122752
Reserved block count:     0
Free blocks:              241421671
Free inodes:              644239829
First block:              0
Block size:               4096
Fragment size:            4096
Reserved GDT blocks:      1024
Blocks per group:         16384
Fragments per group:      16384
Inodes per group:         32768
Inode blocks per group:   4096
Flex block group size:    16
Filesystem created:       Wed Feb  5 15:31:14 2014
Last mount time:          Mon Feb 10 08:45:54 2014
Last write time:          Mon Feb 10 08:45:54 2014
Mount count:              5
Maximum mount count:      -1
Last checked:             Wed Feb  5 15:31:14 2014
Check interval:           0 (<none>)
Lifetime writes:          307 GB
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:	          512
Required extra isize:     28
Desired extra isize:      28
Journal inode:            8
Default directory hash:   half_md4
Directory Hash Seed:      f50d0c82-4edf-4e98-94ef-69ed3ad456d0
Journal backup:           inode blocks
User quota inode:         3
Group quota inode:        4
nbp9-mds /var/log # tunefs.lustre /dev/mapper/nbp9--vg-mdt9
checking for existing Lustre data: found
Reading CONFIGS/mountdata

   Read previous values:
Target:     nbp9-MDT0000
Index:      0
Lustre FS:  nbp9
Mount type: ldiskfs
Flags:      0x1
              (MDT )
Persistent mount opts: user_xattr,errors=remount-ro
Parameters: mgsnode=10.151.26.5@o2ib lov.stripesize=1048576 lov.stripecount=4
Feb 10 07:37:42 nbp9-mds kernel: Lustre: 5858:0:(osd_handler.c:828:osd_trans_start()) nbp9-MDT0000: too many transaction credits (32279 > 25600)
Feb 10 07:37:42 nbp9-mds kernel: Lustre: 5858:0:(osd_handler.c:835:osd_trans_start())   create: 170/4250, delete: 0/0, destroy: 0/0
Feb 10 07:37:43 nbp9-mds kernel: Lustre: 5858:0:(osd_handler.c:840:osd_trans_start())   attr_set: 2/2, xattr_set: 172/2395
Feb 10 07:37:43 nbp9-mds kernel: Lustre: 5858:0:(osd_handler.c:847:osd_trans_start())   write: 1523/21322, punch: 338/1352, quota 4/52
Feb 10 07:37:43 nbp9-mds kernel: Lustre: 5858:0:(osd_handler.c:852:osd_trans_start())   insert: 171/2906, delete: 0/0
Feb 10 07:37:43 nbp9-mds kernel: Lustre: 5858:0:(osd_handler.c:857:osd_trans_start())   ref_add: 0/0, ref_del: 0/0
Feb 10 07:37:43 nbp9-mds kernel: Pid: 5858, comm: mdt02_043
Feb 10 07:37:43 nbp9-mds kernel: 
Feb 10 07:37:43 nbp9-mds kernel: Call Trace:
Feb 10 07:37:43 nbp9-mds kernel: [<ffffffffa0514895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
Feb 10 07:37:43 nbp9-mds kernel: [<ffffffffa0cfc41e>] osd_trans_start+0x65e/0x680 [osd_ldiskfs]
Feb 10 07:37:43 nbp9-mds kernel: [<ffffffffa0e68309>] lod_trans_start+0x1b9/0x250 [lod]
Feb 10 07:37:43 nbp9-mds kernel: [<ffffffffa0773117>] mdd_trans_start+0x17/0x20 [mdd]
Feb 10 07:37:43 nbp9-mds kernel: [<ffffffffa076712a>] mdd_create+0x91a/0x1790 [mdd]
Feb 10 07:37:43 nbp9-mds kernel: [<ffffffffa0d03937>] ? osd_xattr_get+0x97/0x2d0 [osd_ldiskfs]
Feb 10 07:37:43 nbp9-mds kernel: [<ffffffffa0ddf2a2>] mdt_reint_open+0x1362/0x20e0 [mdt]
Feb 10 07:37:43 nbp9-mds kernel: [<ffffffffa053185e>] ? upcall_cache_get_entry+0x28e/0x860 [libcfs]
Feb 10 07:37:43 nbp9-mds kernel: [<ffffffffa07fadcc>] ? lustre_msg_add_version+0x6c/0xc0 [ptlrpc]
Feb 10 07:37:43 nbp9-mds kernel: [<ffffffffa0dc9981>] mdt_reint_rec+0x41/0xe0 [mdt]
Feb 10 07:37:43 nbp9-mds kernel: [<ffffffffa0daeb03>] mdt_reint_internal+0x4c3/0x780 [mdt]
Feb 10 07:37:43 nbp9-mds kernel: [<ffffffffa0daf090>] mdt_intent_reint+0x1f0/0x530 [mdt]
Feb 10 07:37:43 nbp9-mds kernel: [<ffffffffa0dacf3e>] mdt_intent_policy+0x39e/0x720 [mdt]
Feb 10 07:37:43 nbp9-mds kernel: [<ffffffffa07b2831>] ldlm_lock_enqueue+0x361/0x8d0 [ptlrpc]
Feb 10 07:37:43 nbp9-mds kernel: [<ffffffffa07d91ef>] ldlm_handle_enqueue0+0x4ef/0x10b0 [ptlrpc]
Feb 10 07:37:43 nbp9-mds kernel: [<ffffffffa0dad3c6>] mdt_enqueue+0x46/0xe0 [mdt]
Feb 10 07:37:43 nbp9-mds kernel: [<ffffffffa0db3ad7>] mdt_handle_common+0x647/0x16d0 [mdt]
Feb 10 07:37:43 nbp9-mds kernel: [<ffffffffa0ded615>] mds_regular_handle+0x15/0x20 [mdt]
Feb 10 07:37:43 nbp9-mds kernel: [<ffffffffa080b3c8>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]
Feb 10 07:37:43 nbp9-mds kernel: [<ffffffffa05155de>] ? cfs_timer_arm+0xe/0x10 [libcfs]
Feb 10 07:37:43 nbp9-mds kernel: [<ffffffffa0526d9f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
Feb 10 07:37:43 nbp9-mds kernel: [<ffffffffa0802729>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
Feb 10 07:37:43 nbp9-mds kernel: [<ffffffff81055813>] ? __wake_up+0x53/0x70
Feb 10 07:37:43 nbp9-mds kernel: [<ffffffffa080c75e>] ptlrpc_main+0xace/0x1700 [ptlrpc]
Feb 10 07:37:43 nbp9-mds kernel: [<ffffffffa080bc90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
Feb 10 07:37:43 nbp9-mds kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20
Feb 10 07:37:43 nbp9-mds kernel: [<ffffffffa080bc90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
Feb 10 07:37:43 nbp9-mds kernel: [<ffffffffa080bc90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
Feb 10 07:37:43 nbp9-mds kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
Feb 10 07:37:43 nbp9-mds kernel: 
Feb 10 07:37:43 nbp9-mds kernel: Lustre: 5858:0:(osd_handler.c:828:osd_trans_start()) nbp9-MDT0000: too many transaction credits (32279 > 25600)
Feb 10 07:37:43 nbp9-mds kernel: Lustre: 5858:0:(osd_handler.c:835:osd_trans_start())   create: 170/4250, delete: 0/0, destroy: 0/0
Feb 10 07:37:43 nbp9-mds kernel: Lustre: 5858:0:(osd_handler.c:840:osd_trans_start())   attr_set: 2/2, xattr_set: 172/2395
Feb 10 07:37:43 nbp9-mds kernel: Lustre: 5858:0:(osd_handler.c:847:osd_trans_start())   write: 1523/21322, punch: 338/1352, quota 4/52
Feb 10 07:37:43 nbp9-mds kernel: Lustre: 5858:0:(osd_handler.c:852:osd_trans_start())   insert: 171/2906, delete: 0/0
Feb 10 07:37:43 nbp9-mds kernel: Lustre: 5858:0:(osd_handler.c:857:osd_trans_start())   ref_add: 0/0, ref_del: 0/0
Feb 10 07:38:48 nbp9-mds kernel: Lustre: 5791:0:(osd_handler.c:828:osd_trans_start()) nbp9-MDT0000: too many transaction credits (32279 > 25600)
Feb 10 07:38:48 nbp9-mds kernel: Lustre: 5791:0:(osd_handler.c:828:osd_trans_start()) Skipped 2 previous similar messages
Feb 10 07:38:48 nbp9-mds kernel: Lustre: 5791:0:(osd_handler.c:835:osd_trans_start())   create: 170/4250, delete: 0/0, destroy: 0/0
Feb 10 07:38:48 nbp9-mds kernel: Lustre: 5791:0:(osd_handler.c:835:osd_trans_start()) Skipped 2 previous similar messages
Feb 10 07:38:48 nbp9-mds kernel: Lustre: 5791:0:(osd_handler.c:840:osd_trans_start())   attr_set: 2/2, xattr_set: 172/2395
Feb 10 07:38:48 nbp9-mds kernel: Lustre: 5791:0:(osd_handler.c:840:osd_trans_start()) Skipped 2 previous similar messages
Feb 10 07:38:48 nbp9-mds kernel: Lustre: 5791:0:(osd_handler.c:847:osd_trans_start())   write: 1523/21322, punch: 338/1352, quota 4/52
Feb 10 07:38:48 nbp9-mds kernel: Lustre: 5791:0:(osd_handler.c:847:osd_trans_start()) Skipped 2 previous similar messages
Feb 10 07:38:48 nbp9-mds kernel: Lustre: 5791:0:(osd_handler.c:852:osd_trans_start())   insert: 171/2906, delete: 0/0
Feb 10 07:38:48 nbp9-mds kernel: Lustre: 5791:0:(osd_handler.c:852:osd_trans_start()) Skipped 2 previous similar messages
Feb 10 07:38:48 nbp9-mds kernel: Lustre: 5791:0:(osd_handler.c:857:osd_trans_start())   ref_add: 0/0, ref_del: 0/0
Feb 10 07:38:48 nbp9-mds kernel: Lustre: 5791:0:(osd_handler.c:857:osd_trans_start()) Skipped 2 previous similar messages
Feb 10 07:43:07 nbp9-mds kernel: Lustre: 5860:0:(osd_handler.c:828:osd_trans_start()) nbp9-MDT0000: too many transaction credits (32279 > 25600)
Feb 10 07:43:07 nbp9-mds kernel: Lustre: 5860:0:(osd_handler.c:835:osd_trans_start())   create: 170/4250, delete: 0/0, destroy: 0/0
Feb 10 07:43:07 nbp9-mds kernel: Lustre: 5860:0:(osd_handler.c:840:osd_trans_start())   attr_set: 2/2, xattr_set: 172/2395
Feb 10 07:43:07 nbp9-mds kernel: Lustre: 5860:0:(osd_handler.c:847:osd_trans_start())   write: 1523/21322, punch: 338/1352, quota 4/52
Feb 10 07:43:07 nbp9-mds kernel: Lustre: 5860:0:(osd_handler.c:852:osd_trans_start())   insert: 171/2906, delete: 0/0
Feb 10 07:43:07 nbp9-mds kernel: Lustre: 5860:0:(osd_handler.c:857:osd_trans_start())   ref_add: 0/0, ref_del: 0/0


 Comments   
Comment by Peter Jones [ 11/Feb/14 ]

Niu

Could you please look into this one?

Thanks

Peter

Comment by Andreas Dilger [ 12/Feb/14 ]

There is definitely something wrong with the transaction accounting here. We've seen problems similar to this in the past, but this one is much more obvious to see the problems given the large number of stripes. The significant impact of this incorrect accounting is that creating a single widely-striped file is consuming ALL of the available transaction credits and serializing the create until the transaction handle is released.

An equivalent problem might be hit if the number of concurrent creates of single-stripe files exceeds the transaction limit as well. That is entirely possible with the default MDT thread count of 256.

Mahmoud, I'm assuming that this is a create of a 170-stripe file on a filesystem with at least this number of OSTs? The MDT journal size looks to be 400MB (journal transaction size limit is 1/4 of journal size, so journal size = 25600 blocks * 4 * 4kB).

The numbers reported for the transaction accounting are "number of operations/maximum blocks needed for that operation".

create: 170/4250, delete: 0/0, destroy: 0/0
attr_set: 2/2, xattr_set: 172/2395
write: 1523/21322, punch: 338/1352, quota 4/52
insert: 171/2906, delete: 0/0

I suspect that there is some confusion in the OSD layer when creating files for an OST vs an MDT. Some of the operations don't make sense at all for create on an MDT. We shouldn't need an xattr_set for every object (@ 13 blocks per xattr), 2x punch for every object (not sure why this is included at all), nor even a "create" for every OST object created. It also doesn't make sense to have 8x write for every create, nor an insert for every stripe (@16 blocks per insert).

What I would suggest to start is to add a temporary debug line to osd_trans_declare_op() so that it is more obvious when this is being called.

I suspect part of the problem is a mixup in transaction declaration that is accounting for OST objects being created, even though this is happening on the MDT and any OSP credits for these operations should mostly be zero (except a small update in the lov_objids file in case of create).

Comment by Mahmoud Hanafi [ 12/Feb/14 ]

This filesystem has 240 OSTs. I think the user was doing trying to create a full stripe file/dir like this

lfs setstripe -c -1 foo

I am trying to duplicate the error.

Comment by Alex Zhuravlev [ 13/Feb/14 ]

that is because of llog declarations involved mostly - we declare possible "undo" for the file just created. we probably can drop this entirely, need to think a bit. but we can't do this for unlink - we have to write to 170 llogs files. then every llog file is either old or new object (x2). for every object being created we set at least one EA (LMA). I'm looking at write declarations as they consume most of credits.

Comment by Niu Yawei (Inactive) [ 14/Feb/14 ]

Alex & Andreas, I don't see where so many write operations comes from, any idea?

Comment by Andreas Dilger [ 14/Feb/14 ]

I was wondering where the setxattr operations would be needed, but the LMA FID on the unlink llog files makes sense. That said, the LMA xattr does not consume any space in the filesystem, since it is stored in the inode. I wonder if we can declare the setxattr more precisely, so that the OSD knows the xattr size and can account for it more appropriately?

I suspect that a number if these operations can similarly be optimized with a bit more information. In particular, for file create I suspect each OSP will reserve a write to the lov_objids file, but those are always overwrites and should not consume more than a single credit. Since there are normally many u64 shared between OSPs in the same block, this could also be reduced significantly without any risk. At most ((max_ost_idx * sizeof(_u64) + blocksize - 1) / blocksize) blocks could be consumed for that file, not the thousands that are reserved today.

Likewise, updates to the last_rcvd file typically only need a single block, and that could be known if the declare knows the offset and size of the write.

Comment by Alex Zhuravlev [ 14/Feb/14 ]

I checked with the code and observed this is due to extra declare_attr_set() in mdd_declare_object_initialize(). in Orion we tried to remove all extras and instead have all attributes set by dt_create(). this has changed over time, but we can fix that easily with something like:

@@ -423,7 +423,7 @@ static int osp_declare_attr_set(const struct lu_env *env, struct dt_object *dt
RETURN(rc);
}

  • if (o->opo_new) {
    + if (o->opo_new || fid_is_zero(lu_object_fid(&dt->do_lu))) {

so that OSP just ignores setattr's on the new OST objects.

this should help with create path, but won't with unlink as we have to write those 170 records.. here is another patch to improve this a bit:
http://review.whamcloud.com/#/c/9258/

Comment by Alex Zhuravlev [ 14/Feb/14 ]

Andreas, have a look at http://review.whamcloud.com/#/c/9258/ please, I did exactly like you suggested for LMA.
for files like last_rcvd/lov_objids I used optimistic (and cheap) way to recognize overwrite so drop credits required to allocate new blocks.
that said, we have to declare llogs in the create path, actually - in case of error we should be able to destroy all the stripes just created.
given we can't deal with gaps in the stream of new OST objects.

Comment by Alex Zhuravlev [ 17/Feb/14 ]

another way to save some credits can be truncate (dt_decalre_punch()) in llog_osd_write_blob(). it seem to be used by MGS which generates "partial" records and writes header/payload/tail separately - truncate is used to rollback partial changes. probably it's possible to teach MGS to use "full" records and get rid of truncate. with 7 stripes that would save us 56 credits on ldiskfs - 13% on unlink of 7-stripes file - decreasing credits from 436 to 380..

that said, on 170 stripes this still gives us ~380/7*170=9180 credits. 3 concurrent transactions can easily cause early transaction close. with 25600 credits per transaction and 256 threads, we'd need to decrease this down to 256 credits per thread/transactions. I don't think this is possible with the current design. something like a postponed iteration turning a big blob (essentially LOVEA) into a series of tiny transactions could help.

Comment by Andreas Dilger [ 17/Feb/14 ]

I don't think we need to keep the full number of threads with the full transaction credits busy. There will always be some fraction of threads that are outside the transaction at a given time. That said, reducing the credits count by any amount will always help to avoid premature transaction commit and checkpoint.

For the llog punch, it surprises me that we would need to reserve extra credits to truncate data just written? At most any llog record could be 2 blocks long, so we shouldn't need more than {2*(bitmap + GDT to free), 1 block to overwrite} + inode = 5 blocks to truncate a single llog record. If we are already reserving this much for the write (assume yes?) why reserve extra for truncate?

At one point, there were some explicit declares added to handle the error cases, but in the new accounting we also allow the "undo" updates for any declared update for error handling purpose. I would hope that means we can get rid of the explicit punch calls entirely? Is it possible to get rid of the other explicit undo declarations as well (ref_del, delete)?

Comment by Alex Zhuravlev [ 17/Feb/14 ]

from pure ldiskfs side, there is no point to declare punch for the block we write in the same transaction - we'll be modifying same bitmaps, gd, inode, etc. I'm not absolutely sure, but probably that was added to satisfy operation-based verification for credits (that undo thing) - osd_punch() does expect OSD_OT_PUNCH to be declared and osd_write() doesn't declare that.

also, I tend to think that'd break ZFS with full debug enabled, but it's broken anyway.

Comment by Jodi Levi (Inactive) [ 26/Mar/14 ]

Patch landed to Master. Please reopen ticket if more work is needed.

Comment by Jay Lan (Inactive) [ 26/Mar/14 ]

A port to b2_4 is appreciated. Thanks!

Comment by Niu Yawei (Inactive) [ 27/Mar/14 ]

backport to b2_4: http://review.whamcloud.com/9807

Comment by Bob Glossman (Inactive) [ 21/May/14 ]

backport to b2_5:
http://review.whamcloud.com/10407

Generated at Sat Feb 10 01:44:17 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.