[LU-14918] too many ldiskfs transaction credits when unlinking overstriped files Created: 06/Aug/21  Updated: 29/Dec/23  Resolved: 14/Feb/23

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.14.0, Lustre 2.15.0
Fix Version/s: Lustre 2.16.0

Type: Bug Priority: Major
Reporter: Andreas Dilger Assignee: Alex Zhuravlev
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-14965 ldiskfs/namei.c:3331 ldiskfs_orphan_a... Resolved
is related to LU-15380 Hang in osp_precreate_reserve() while... Open
is related to LU-5890 lustre-OST0001: too many transaction ... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Removing widely overstriped files from an ldiskfs MDT causes excessively many transaction credits to be reserved. This can be seen in the MDS console logs:

Lustre: DEBUG MARKER: == sanity test 130g: FIEMAP (overstripe file) ========
Lustre: 25401:0:(osd_handler.c:1934:osd_trans_start()) lustre-MDT0000: credits 54595 > trans_max 2592
Lustre: 25401:0:(osd_handler.c:1863:osd_trans_dump_creds())   create: 800/6400/0, destroy: 1/4/0
Lustre: 25401:0:(osd_handler.c:1870:osd_trans_dump_creds())   attr_set: 3/3/0, xattr_set: 804/148/0
Lustre: 25401:0:(osd_handler.c:1880:osd_trans_dump_creds())   write: 4001/34410/0, punch: 0/0/0, quota 6/6/0
Lustre: 25401:0:(osd_handler.c:1887:osd_trans_dump_creds())   insert: 801/13616/0, delete: 2/5/0
Lustre: 25401:0:(osd_handler.c:1894:osd_trans_dump_creds())   ref_add: 1/1/0, ref_del: 2/2/0
Pid: 25401, comm: mdt00_004 3.10.0-1160.36.2.el7_lustre.x86_64 #1 SMP Tue Aug 3 23:03:31 UTC 2021
Call Trace:
libcfs_call_trace+0x90/0xf0 [libcfs]
libcfs_debug_dumpstack+0x26/0x30 [libcfs]
osd_trans_start+0x4bb/0x4e0 [osd_ldiskfs]
top_trans_start+0x702/0x940 [ptlrpc]
lod_trans_start+0x34/0x40 [lod]
mdd_trans_start+0x1a/0x20 [mdd]
mdd_unlink+0x4ee/0xae0 [mdd]
mdo_unlink+0x1b/0x1d [mdt]
mdt_reint_unlink+0xb64/0x1890 [mdt]
mdt_reint_rec+0x83/0x210 [mdt]
mdt_reint_internal+0x720/0xaf0 [mdt]
mdt_reint+0x67/0x140 [mdt]
tgt_request_handle+0x7ea/0x1750 [ptlrpc]
ptlrpc_server_handle_request+0x256/0xb10 [ptlrpc]
ptlrpc_main+0xb3c/0x14e0 [ptlrpc]
Lustre: 25401:0:(osd_internal.h:1325:osd_trans_exec_op()) lustre-MDT0000: opcode 7: before 2589 < left 34410, rollback = 7

and

Lustre: DEBUG MARKER: == sanity test 27Cd: test maximum stripe count ========
Lustre: 12686:0:(osd_handler.c:1934:osd_trans_start()) lustre-MDT0003: credits 136195 > trans_max 2592
Lustre: 12686:0:(osd_handler.c:1863:osd_trans_dump_creds())   create: 2000/16000/0, destroy: 1/4/0
Lustre: 12686:0:(osd_handler.c:1870:osd_trans_dump_creds())   attr_set: 3/3/0, xattr_set: 2004/148/0
Lustre: 12686:0:(osd_handler.c:1880:osd_trans_dump_creds())   write: 10001/86010/0, punch: 0/0/0, quota 6/6/0
Lustre: 12686:0:(osd_handler.c:1887:osd_trans_dump_creds())   insert: 2001/34016/0, delete: 2/5/0
Lustre: 12686:0:(osd_handler.c:1894:osd_trans_dump_creds())   ref_add: 1/1/0, ref_del: 2/2/0
Pid: 12686, comm: mdt00_000 3.10.0-1160.36.2.el7_lustre.x86_64 #1 SMP Tue Aug 3 23:03:31 UTC 2021
Call Trace:
libcfs_call_trace+0x90/0xf0 [libcfs]
libcfs_debug_dumpstack+0x26/0x30 [libcfs]
osd_trans_start+0x4bb/0x4e0 [osd_ldiskfs]
top_trans_start+0x702/0x940 [ptlrpc]
lod_trans_start+0x34/0x40 [lod]
mdd_trans_start+0x1a/0x20 [mdd]
mdd_unlink+0x4ee/0xae0 [mdd]
mdo_unlink+0x1b/0x1d [mdt]
mdt_reint_unlink+0xb64/0x1890 [mdt]
mdt_reint_rec+0x83/0x210 [mdt]
mdt_reint_internal+0x720/0xaf0 [mdt]
mdt_reint+0x67/0x140 [mdt]
tgt_request_handle+0x7ea/0x1750 [ptlrpc]
ptlrpc_server_handle_request+0x256/0xb10 [ptlrpc]
ptlrpc_main+0xb3c/0x14e0 [ptlrpc]

and similarly in sanity test_130e, sanity-pfl test_0b, test_1c, always during unlink.

The two examples shown are trying to reserve a whopping 213MiB and 532MiB of journal space, respectively. Since the maximum xattr size for an overstriped file is 64KiB, this is pretty excessive.



 Comments   
Comment by Andreas Dilger [ 06/Aug/21 ]

There was previously patch http://review.whamcloud.com/8881 "LU-4648 ldiskfs: delete external EA in another transaction" that pushed the unlink of the xattr inode to a separate transaction, so either that code is not working properly anymore, or mdd_declare_unlink() (and sub-functions) is counting the entire xattr size in the credits for the unlink, even though it is done in a separate transaction.

Comment by Andreas Dilger [ 26/Aug/21 ]

This may also be fixed by the LU-14965 patch.

Comment by Zhenyu Xu [ 27/Aug/21 ]

I'm confused where these "create/write" requests come from, since mdd_declare_unlink() is quite simple

mdd_declare_unlink()
1628         if (c != NULL) {                                                        
1629                 rc = mdo_declare_ref_del(env, c, handle);                       
1630                 if (rc)                                                         
1631                         return rc;                                              
1632                                                                                 
1633                 rc = mdo_declare_ref_del(env, c, handle);                       
1634                 if (rc)                                                         
1635                         return rc;                                              
1636                                                                                 
1637                 la->la_valid = LA_CTIME;                                        
1638                 rc = mdo_declare_attr_set(env, c, la, handle);                  
1639                 if (rc)                                                         
1640                         return rc;                                              
1641                                                                                 
1642                 rc = mdd_declare_finish_unlink(env, c, handle);                 
1643                 if (rc)                                                         
1644                         return rc;                                              
1645                                                                                 
1646                 /* FIXME: need changelog for remove entry */                    
1647                 rc = mdd_declare_changelog_store(env, mdd, CL_UNLINK, name,     
1648                                                  NULL, handle);                 
1649         }                                             
mdd_declare_finish_unlink()
1519         rc = mdd_mark_orphan_object(env, obj, handle, true);                    
1520         if (rc != 0)                                                            
1521                 return rc;                                                              
1522                                                                                 
1523         rc = mdo_declare_destroy(env, obj, handle);                             
1524         if (rc != 0)                                                            
1525                 return rc;                                                      
1526                                                                                 
1527         rc = mdd_orphan_declare_insert(env, obj, mdd_object_type(obj), handle); 
1528         if (rc != 0)                                                            
1529                 return rc;                                                      
1530                                                                                 
1531         return mdd_declare_links_del(env, obj, handle);    

mdd_declare_changelog_store() should not generate so much write requests I think.

Comment by Andreas Dilger [ 28/Aug/21 ]

The debug message shows that it is logging thousands of separate operations:

create: 800/6400/0, destroy: 1/4/0
write: 4001/34410/0, punch: 0/0/0, quota 6/6/0
insert: 801/13616/0, delete: 2/5/0

This means 800 creates for 6400 credits, 4000 writes (5 per create/insert) for 34410 credits, and 801 inserts for 13616 credits. The sanity test_130g itself is unlinking an overstriped file with $((OSTCOUNT * 100)) = 800 stripes, and test_27Cd is unlinking a 2000-stripe file, but those objects are on the OSTs, not on the MDT. It looks like the LOD layer is declaring local credits for each remote object being unlinked? I could understand that LOD needs a few credits to update llog records for each unlinks (e.g. enough to allocate a few blocks for each OST), but it doesn't need hundreds of MB of credits.

It wouldn't be too hard to add extra debugging into osd_declare_create(), osd_declare_write(), and osd_declare_insert() that if there are more than, say, 50 such operations in a single handle to print a CWARN() and libcfs_debug_dumpstack() (and set a flag so that it doesn't happen another 750 or 3950 times) to see where this is happening in the code. However, looking at lod_declare_destroy() it appears clear that this is doing a declare for every stripe, and not limiting itself when there are multiple objects per OST.

Firstly, it is worthwhile to confirm that there is not a "sub handle" for each stripe instead of for each OST. That would both increase the RAM usage, and make it difficult to aggregate credits for a single OST's objects. Next, if ldo_dir_stripe_count > ost_count to declare a bitmap for the number of OSTs and then limit the number of credits if that OST has already been assigned credits for logging the unlink of the objects.

Alex should know the details, but I expect the number of credits needed to unlink an object are essentially one llog block with indirects to write an llog record (64 bytes for llog_unlink64_rec, if that is the right one, but close enough), plus a rare chance of creating a new llog file and inserting it into the O/1/dN directory. A new llog file might happen once per unlink per OST, but not 2000 times. Similarly, destroying 2000 objects doesn't take 4000 llog blocks with 34000 blocks. Instead, it would take either one llog block+ overhead per OST, or about 2000x64/4096 bytes =32KB = 8 blocks + 32 indirects if all the objects were on a single OST.

Comment by Zhenyu Xu [ 31/Aug/21 ]

Yes, the write/create declaration come from the llog record for unlinking all stripe objects.

like this example

[10033.829237] [<ffffffffa757a3ae>] dump_stack+0x19/0x1b
[10033.829244] [<ffffffffc1908d74>] osd_declare_write+0x434/0x540 [osd_ldiskfs]
[10033.829255] [<ffffffffc104a150>] llog_osd_declare_write_rec+0xe0/0x3a0 [obdclass]
[10033.829265] [<ffffffffc103e196>] llog_declare_write_rec+0x1e6/0x240 [obdclass]
[10033.829289] [<ffffffffc104475c>] llog_cat_declare_add_rec+0x9c/0x260 [obdclass]
[10033.829297] [<ffffffffc103b3ef>] llog_declare_add+0x14f/0x1c0 [obdclass]
[10033.829302] [<ffffffffc1bb75ba>] osp_sync_declare_add+0x11a/0x490 [osp]
[10033.829306] [<ffffffffc1ba6443>] osp_declare_destroy+0x1a3/0x200 [osp]
[10033.829312] [<ffffffffc1b54a6e>] lod_sub_declare_destroy+0xce/0x2d0 [lod]
[10033.829318] [<ffffffffc1b2a51d>] lod_obj_stripe_destroy_cb+0x8d/0xa0 [lod]
[10033.829325] [<ffffffffc1b37c8e>] lod_obj_for_each_stripe+0x11e/0x2c0 [lod]
[10033.829330] [<ffffffffc1b38bf2>] lod_declare_destroy+0x5c2/0x630 [lod]
[10033.829344] [<ffffffffc19c55f3>] mdd_declare_finish_unlink+0x83/0x200 [mdd]
[10033.829349] [<ffffffffc19d63af>] mdd_unlink+0x4af/0xaf0 [mdd]
[10033.829359] [<ffffffffc1a9cb9c>] mdo_unlink+0x1b/0x1d [mdt]
[10033.829368] [<ffffffffc1a5893f>] mdt_reint_unlink+0x9ff/0x1460 [mdt]

 
I think the "sub handle" for each stripe is eventually derived from the device of the stripes, so that's comforting.

lod_sub_declare_destroy(env, dt, th)->lod_sub_get_thandle(env, th, dt, ...)

lod_sub_declare_destroy(env, dt, th)
350         sub_th = lod_sub_get_thandle(env, th, dt, &record_update);    
...
357         rc = dt_declare_destroy(env, dt, sub_th);    
lod_sub_get_thandle(env, th, dt, ...)
104         sub_th = thandle_get_sub(env, th, dt);       
thandle_get_sub(env, th, dt)
474 {                                                                               
475         return thandle_get_sub_by_dt(env, th, lu2dt_dev(dt->do_lu.lo_dev));         
476 } 

So the issue locates in that unlinking a wide-striped file reserves too much credits for llog records of unlinking its sub objects.

Comment by Andreas Dilger [ 31/Aug/21 ]

Bobijam, are you able to work on a fix for this, or should Mike look into it?

Comment by Alex Zhuravlev [ 31/Aug/21 ]

well, it's still possible (though not likely) that, for example, after some manual manipulations such an unlink will need to create that many llog files.

Comment by Alex Zhuravlev [ 31/Aug/21 ]

I think we can postpone llog object removal using a dedicated thread, this would save some credits.

Comment by Andreas Dilger [ 31/Aug/21 ]

Alex, in the overstriping case, how would it be possible for the unlink to have eg. 100 different llog files for a single OST? I am understand it would be possible to create one new llog file, and several new blocks in that file, but with 65000 records per llog file it isn't possible even under crazy cases to need so many credits.

Comment by Alex Zhuravlev [ 31/Aug/21 ]

ok, I got it, how many different OSTs were actually involved?

Comment by Zhenyu Xu [ 01/Sep/21 ]

Needs Mike's input, llog_osd_declare_write_rec() just use loghandle->lgh_ctxt->loc_chunk_size instead of llog records len to write the llog, which is wasteful.

Comment by Alex Zhuravlev [ 01/Sep/21 ]

a single new record involves:

  • record itself, which can't be less than blocksize (4K for ldiskfs)
  • llog header, which in turn can be two different blocks: 1st with llh_count (4K), 2nd with bitmap (4K)
Comment by Andreas Dilger [ 02/Sep/21 ]

Alex, this issue can be seen in any review-dne-part-1 sanity test_27Cd (or other subtest using overstriping), for example:
https://testing.whamcloud.com/test_sets/afe2d412-628d-4a7f-bda4-1c3359a97ff1

[ 1863.395427] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == sanity test 27Cd: test maximum stripe count =========== 11:50:04 \(1630583404\)
[ 1863.871341] Lustre: DEBUG MARKER: == sanity test 27Cd: test maximum stripe count =========== 11:50:04 (1630583404)
[ 1870.327497] Lustre: 11292:0:(osd_handler.c:1938:osd_trans_start()) lustre-MDT0003: credits 128195 > trans_max 2592
[ 1870.329179] Lustre: 11292:0:(osd_handler.c:1867:osd_trans_dump_creds())   create: 2000/8000/0, destroy: 1/4/0
[ 1870.330663] Lustre: 11292:0:(osd_handler.c:1874:osd_trans_dump_creds())   attr_set: 3/3/0, xattr_set: 2004/148/0
[ 1870.332184] Lustre: 11292:0:(osd_handler.c:1884:osd_trans_dump_creds())   write: 10001/86010/0, punch: 0/0/0, quota 6/6/0
[ 1870.333817] Lustre: 11292:0:(osd_handler.c:1891:osd_trans_dump_creds())   insert: 2001/34016/0, delete: 2/5/0
[ 1870.335318] Lustre: 11292:0:(osd_handler.c:1898:osd_trans_dump_creds())   ref_add: 1/1/0, ref_del: 2/2/0

There are 4 MDTs and 8 OSTs in that case, and a 512MB transaction handle request.

Comment by Zhenyu Xu [ 03/Sep/21 ]

Tried a patch to use llog rec len instread of chunk_size in llog_osd_declare_write_rec(), the write credits demand decreased a little

patch
diff --git a/lustre/obdclass/llog_osd.c b/lustre/obdclass/llog_osd.c
index 5aed12e9b9..18b1b80fdf 100644
--- a/lustre/obdclass/llog_osd.c
+++ b/lustre/obdclass/llog_osd.c
@@ -320,7 +320,6 @@ static int llog_osd_declare_write_rec(const struct lu_env *env,
                                      int idx, struct thandle *th)
 {
        struct llog_thread_info *lgi = llog_info(env);
-       __u32                   chunk_size;
        struct dt_object        *o;
        int                      rc;
 
@@ -335,8 +334,7 @@ static int llog_osd_declare_write_rec(const struct lu_env *env,
        o = loghandle->lgh_obj;
        LASSERT(o);
 
-       chunk_size = loghandle->lgh_ctxt->loc_chunk_size;
-       lgi->lgi_buf.lb_len = chunk_size;
+       lgi->lgi_buf.lb_len = sizeof(struct llog_log_hdr);
        lgi->lgi_buf.lb_buf = NULL;
        /* each time we update header */
        rc = dt_declare_record_write(env, o, &lgi->lgi_buf, 0,
@@ -348,7 +346,7 @@ static int llog_osd_declare_write_rec(const struct lu_env *env,
         * the pad record can be inserted so take into account double
         * record size
         */
-       lgi->lgi_buf.lb_len = chunk_size * 2;
+       lgi->lgi_buf.lb_len = rec->lrh_len * 2;
        lgi->lgi_buf.lb_buf = NULL;
        /* XXX: implement declared window or multi-chunks approach */
        rc = dt_declare_record_write(env, o, &lgi->lgi_buf, -1, th);

sanity.sh test_27Cd
credits demand before the patch

[ 2954.917326] Lustre: 20362:0:(osd_handler.c:1938:osd_trans_start()) lustre-MDT0001: credits 128195 > trans_max 2496
[ 2954.917330] Lustre: 20362:0:(osd_handler.c:1867:osd_trans_dump_creds())   create: 2000/8000/0, destroy: 1/4/0
[ 2954.917333] Lustre: 20362:0:(osd_handler.c:1874:osd_trans_dump_creds())   attr_set: 3/3/0, xattr_set: 2004/148/0
[ 2954.917335] Lustre: 20362:0:(osd_handler.c:1884:osd_trans_dump_creds())   write: 10001/86010/0, punch: 0/0/0, quota 6/6/0
[ 2954.917337] Lustre: 20362:0:(osd_handler.c:1891:osd_trans_dump_creds())   insert: 2001/34016/0, delete: 2/5/0
[ 2954.917339] Lustre: 20362:0:(osd_handler.c:1898:osd_trans_dump_creds())   ref_add: 1/1/0, ref_del: 2/2/0

credits demand after the patch

[ 3371.600695] Lustre: 30759:0:(osd_handler.c:1938:osd_trans_start()) lustre-MDT0000: credits 92195 > trans_max 2496
[ 3371.600699] Lustre: 30759:0:(osd_handler.c:1867:osd_trans_dump_creds())   create: 2000/8000/0, destroy: 1/4/0
[ 3371.600701] Lustre: 30759:0:(osd_handler.c:1874:osd_trans_dump_creds())   attr_set: 3/3/0, xattr_set: 2004/148/0
[ 3371.600704] Lustre: 30759:0:(osd_handler.c:1884:osd_trans_dump_creds())   write: 10001/50010/0, punch: 0/0/0, quota 6/6/0
[ 3371.600706] Lustre: 30759:0:(osd_handler.c:1891:osd_trans_dump_creds())   insert: 2001/34016/0, delete: 2/5/0
[ 3371.600708] Lustre: 30759:0:(osd_handler.c:1898:osd_trans_dump_creds())   ref_add: 1/1/0, ref_del: 2/2/0
Comment by Alex Zhuravlev [ 03/Sep/21 ]

this means that loc_chunk_size is not LLOG_MIN_CHUNK_SIZE for unlink llogs?

Comment by Mikhail Pershin [ 03/Sep/21 ]

Alex, as I understand patch uses just header size and rec->lrh_len to declare llog write but not loc_chunk_size so yes, both are not LLOG_MIN_CHUNK_SIZE.

As I remember we were using loc_chunk_size in declaration just because we don't know where record (and bit in bitmap) will be really written due to llog append nature, that is why we are declaring whole chunk for header and current chunk along with next one for the record. So we can't replace that with declaration of just record size and header size.

Note that write itself is more narrowed, it is done with exact offset/len values as they are known at the moment of write.

Comment by Alex Zhuravlev [ 03/Sep/21 ]

yes, I don't think this change is correct:

+       lgi->lgi_buf.lb_len = rec->lrh_len * 2;

as lrh_len can be more than remaining space in the last block we may need to write a padding record (last block N), then add the new record to block N+1

Comment by Alex Zhuravlev [ 03/Sep/21 ]

what we can do, I guess, is to declare that potential padding block not as an append, but a record within llog's size which would mean a single block to modify (in contrast of append which may need many more blocks - allocation, externt tree, etc)

Comment by Alex Zhuravlev [ 03/Sep/21 ]

but I think this way we don't save much, like Andreas said instead we should be looking for ways to declare at most OSTCOUNT new plain llogs and #stripes records. what about the case when lots of MDT threads are unlinking such overstriped files concurrently?
64K records / 128 threads = 512 records per thread

Comment by Mikhail Pershin [ 03/Sep/21 ]

also as I remember that was mostly due to ZFS which needs exact offset/len to be declared and complains if write doesn't match. So as we can't say for sure exact offset at the declaration time we assume range where it can go - the current chunk and the next one. And also we should take case about non-atomic 'declare'-'write' nature, I think there could be concurrent record additions in between, can't they?

Comment by Andreas Dilger [ 03/Sep/21 ]

Mike, I think that was later fixed by Alex to remove the "specific declare" since that hurt performance and didn't save much space...

Comment by Andreas Dilger [ 03/Sep/21 ]

As for concurrent records, that would be mostly fine. As I wrote before, there may be some interleaving of records, but not 65000 different records written from other threads between each delete of the overstriped object that would trigger a new llog file each time.

Comment by Andreas Dilger [ 27/Nov/21 ]

I'd like to still keep this under consideration for 2.15.0 since it is hit very often during usage, and the transaction size (over 500MB) can hurt performance significantly.

Comment by Andreas Dilger [ 27/Nov/21 ]

Assign to Alex to balance 2.15 tickets.

Comment by Gerrit Updater [ 07/Dec/21 ]

"Alex Zhuravlev <bzzz@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/45765
Subject: LU-14918 osd: don't declare similar writes twice
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 54aef59ac5bd349a3d42b71b3d0bdc9cda93066e

Comment by Gerrit Updater [ 19/Jan/23 ]

"Alex Zhuravlev <bzzz@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49701
Subject: LU-14918 osd: don't declare similar writes twice
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: e4a86ff32d18b6598880df1ca19e16af5a8b781b

Comment by Gerrit Updater [ 14/Feb/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/45765/
Subject: LU-14918 osd: don't declare similar ldiskfs writes twice
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 9e6225b2e7385cbb7be0474df01075fafc4966d5

Comment by Gerrit Updater [ 14/Feb/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49701/
Subject: LU-14918 osd: don't declare similar zfs writes twice
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: c1936c9d294d53ff39741e1b07ffc74f51fcddb6

Comment by Peter Jones [ 14/Feb/23 ]

Landed for 2.16

Generated at Sat Feb 10 03:13:53 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.