[LU-14918] too many ldiskfs transaction credits when unlinking overstriped files Created: 06/Aug/21 Updated: 29/Dec/23 Resolved: 14/Feb/23 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.14.0, Lustre 2.15.0 |
| Fix Version/s: | Lustre 2.16.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Andreas Dilger | Assignee: | Alex Zhuravlev |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||
| Description |
|
Removing widely overstriped files from an ldiskfs MDT causes excessively many transaction credits to be reserved. This can be seen in the MDS console logs: Lustre: DEBUG MARKER: == sanity test 130g: FIEMAP (overstripe file) ======== Lustre: 25401:0:(osd_handler.c:1934:osd_trans_start()) lustre-MDT0000: credits 54595 > trans_max 2592 Lustre: 25401:0:(osd_handler.c:1863:osd_trans_dump_creds()) create: 800/6400/0, destroy: 1/4/0 Lustre: 25401:0:(osd_handler.c:1870:osd_trans_dump_creds()) attr_set: 3/3/0, xattr_set: 804/148/0 Lustre: 25401:0:(osd_handler.c:1880:osd_trans_dump_creds()) write: 4001/34410/0, punch: 0/0/0, quota 6/6/0 Lustre: 25401:0:(osd_handler.c:1887:osd_trans_dump_creds()) insert: 801/13616/0, delete: 2/5/0 Lustre: 25401:0:(osd_handler.c:1894:osd_trans_dump_creds()) ref_add: 1/1/0, ref_del: 2/2/0 Pid: 25401, comm: mdt00_004 3.10.0-1160.36.2.el7_lustre.x86_64 #1 SMP Tue Aug 3 23:03:31 UTC 2021 Call Trace: libcfs_call_trace+0x90/0xf0 [libcfs] libcfs_debug_dumpstack+0x26/0x30 [libcfs] osd_trans_start+0x4bb/0x4e0 [osd_ldiskfs] top_trans_start+0x702/0x940 [ptlrpc] lod_trans_start+0x34/0x40 [lod] mdd_trans_start+0x1a/0x20 [mdd] mdd_unlink+0x4ee/0xae0 [mdd] mdo_unlink+0x1b/0x1d [mdt] mdt_reint_unlink+0xb64/0x1890 [mdt] mdt_reint_rec+0x83/0x210 [mdt] mdt_reint_internal+0x720/0xaf0 [mdt] mdt_reint+0x67/0x140 [mdt] tgt_request_handle+0x7ea/0x1750 [ptlrpc] ptlrpc_server_handle_request+0x256/0xb10 [ptlrpc] ptlrpc_main+0xb3c/0x14e0 [ptlrpc] Lustre: 25401:0:(osd_internal.h:1325:osd_trans_exec_op()) lustre-MDT0000: opcode 7: before 2589 < left 34410, rollback = 7 and Lustre: DEBUG MARKER: == sanity test 27Cd: test maximum stripe count ======== Lustre: 12686:0:(osd_handler.c:1934:osd_trans_start()) lustre-MDT0003: credits 136195 > trans_max 2592 Lustre: 12686:0:(osd_handler.c:1863:osd_trans_dump_creds()) create: 2000/16000/0, destroy: 1/4/0 Lustre: 12686:0:(osd_handler.c:1870:osd_trans_dump_creds()) attr_set: 3/3/0, xattr_set: 2004/148/0 Lustre: 12686:0:(osd_handler.c:1880:osd_trans_dump_creds()) write: 10001/86010/0, punch: 0/0/0, quota 6/6/0 Lustre: 12686:0:(osd_handler.c:1887:osd_trans_dump_creds()) insert: 2001/34016/0, delete: 2/5/0 Lustre: 12686:0:(osd_handler.c:1894:osd_trans_dump_creds()) ref_add: 1/1/0, ref_del: 2/2/0 Pid: 12686, comm: mdt00_000 3.10.0-1160.36.2.el7_lustre.x86_64 #1 SMP Tue Aug 3 23:03:31 UTC 2021 Call Trace: libcfs_call_trace+0x90/0xf0 [libcfs] libcfs_debug_dumpstack+0x26/0x30 [libcfs] osd_trans_start+0x4bb/0x4e0 [osd_ldiskfs] top_trans_start+0x702/0x940 [ptlrpc] lod_trans_start+0x34/0x40 [lod] mdd_trans_start+0x1a/0x20 [mdd] mdd_unlink+0x4ee/0xae0 [mdd] mdo_unlink+0x1b/0x1d [mdt] mdt_reint_unlink+0xb64/0x1890 [mdt] mdt_reint_rec+0x83/0x210 [mdt] mdt_reint_internal+0x720/0xaf0 [mdt] mdt_reint+0x67/0x140 [mdt] tgt_request_handle+0x7ea/0x1750 [ptlrpc] ptlrpc_server_handle_request+0x256/0xb10 [ptlrpc] ptlrpc_main+0xb3c/0x14e0 [ptlrpc] and similarly in sanity test_130e, sanity-pfl test_0b, test_1c, always during unlink. The two examples shown are trying to reserve a whopping 213MiB and 532MiB of journal space, respectively. Since the maximum xattr size for an overstriped file is 64KiB, this is pretty excessive. |
| Comments |
| Comment by Andreas Dilger [ 06/Aug/21 ] |
|
There was previously patch http://review.whamcloud.com/8881 " |
| Comment by Andreas Dilger [ 26/Aug/21 ] |
|
This may also be fixed by the |
| Comment by Zhenyu Xu [ 27/Aug/21 ] |
|
I'm confused where these "create/write" requests come from, since mdd_declare_unlink() is quite simple mdd_declare_unlink() 1628 if (c != NULL) { 1629 rc = mdo_declare_ref_del(env, c, handle); 1630 if (rc) 1631 return rc; 1632 1633 rc = mdo_declare_ref_del(env, c, handle); 1634 if (rc) 1635 return rc; 1636 1637 la->la_valid = LA_CTIME; 1638 rc = mdo_declare_attr_set(env, c, la, handle); 1639 if (rc) 1640 return rc; 1641 1642 rc = mdd_declare_finish_unlink(env, c, handle); 1643 if (rc) 1644 return rc; 1645 1646 /* FIXME: need changelog for remove entry */ 1647 rc = mdd_declare_changelog_store(env, mdd, CL_UNLINK, name, 1648 NULL, handle); 1649 } mdd_declare_finish_unlink() 1519 rc = mdd_mark_orphan_object(env, obj, handle, true); 1520 if (rc != 0) 1521 return rc; 1522 1523 rc = mdo_declare_destroy(env, obj, handle); 1524 if (rc != 0) 1525 return rc; 1526 1527 rc = mdd_orphan_declare_insert(env, obj, mdd_object_type(obj), handle); 1528 if (rc != 0) 1529 return rc; 1530 1531 return mdd_declare_links_del(env, obj, handle); mdd_declare_changelog_store() should not generate so much write requests I think. |
| Comment by Andreas Dilger [ 28/Aug/21 ] |
|
The debug message shows that it is logging thousands of separate operations: create: 800/6400/0, destroy: 1/4/0 write: 4001/34410/0, punch: 0/0/0, quota 6/6/0 insert: 801/13616/0, delete: 2/5/0 This means 800 creates for 6400 credits, 4000 writes (5 per create/insert) for 34410 credits, and 801 inserts for 13616 credits. The sanity test_130g itself is unlinking an overstriped file with $((OSTCOUNT * 100)) = 800 stripes, and test_27Cd is unlinking a 2000-stripe file, but those objects are on the OSTs, not on the MDT. It looks like the LOD layer is declaring local credits for each remote object being unlinked? I could understand that LOD needs a few credits to update llog records for each unlinks (e.g. enough to allocate a few blocks for each OST), but it doesn't need hundreds of MB of credits. It wouldn't be too hard to add extra debugging into osd_declare_create(), osd_declare_write(), and osd_declare_insert() that if there are more than, say, 50 such operations in a single handle to print a CWARN() and libcfs_debug_dumpstack() (and set a flag so that it doesn't happen another 750 or 3950 times) to see where this is happening in the code. However, looking at lod_declare_destroy() it appears clear that this is doing a declare for every stripe, and not limiting itself when there are multiple objects per OST. Firstly, it is worthwhile to confirm that there is not a "sub handle" for each stripe instead of for each OST. That would both increase the RAM usage, and make it difficult to aggregate credits for a single OST's objects. Next, if ldo_dir_stripe_count > ost_count to declare a bitmap for the number of OSTs and then limit the number of credits if that OST has already been assigned credits for logging the unlink of the objects. Alex should know the details, but I expect the number of credits needed to unlink an object are essentially one llog block with indirects to write an llog record (64 bytes for llog_unlink64_rec, if that is the right one, but close enough), plus a rare chance of creating a new llog file and inserting it into the O/1/dN directory. A new llog file might happen once per unlink per OST, but not 2000 times. Similarly, destroying 2000 objects doesn't take 4000 llog blocks with 34000 blocks. Instead, it would take either one llog block+ overhead per OST, or about 2000x64/4096 bytes =32KB = 8 blocks + 32 indirects if all the objects were on a single OST. |
| Comment by Zhenyu Xu [ 31/Aug/21 ] |
|
Yes, the write/create declaration come from the llog record for unlinking all stripe objects. like this example [10033.829237] [<ffffffffa757a3ae>] dump_stack+0x19/0x1b lod_sub_declare_destroy(env, dt, th)->lod_sub_get_thandle(env, th, dt, ...) lod_sub_declare_destroy(env, dt, th) 350 sub_th = lod_sub_get_thandle(env, th, dt, &record_update); ... 357 rc = dt_declare_destroy(env, dt, sub_th); lod_sub_get_thandle(env, th, dt, ...) 104 sub_th = thandle_get_sub(env, th, dt); thandle_get_sub(env, th, dt)
474 {
475 return thandle_get_sub_by_dt(env, th, lu2dt_dev(dt->do_lu.lo_dev));
476 }
So the issue locates in that unlinking a wide-striped file reserves too much credits for llog records of unlinking its sub objects. |
| Comment by Andreas Dilger [ 31/Aug/21 ] |
|
Bobijam, are you able to work on a fix for this, or should Mike look into it? |
| Comment by Alex Zhuravlev [ 31/Aug/21 ] |
|
well, it's still possible (though not likely) that, for example, after some manual manipulations such an unlink will need to create that many llog files. |
| Comment by Alex Zhuravlev [ 31/Aug/21 ] |
|
I think we can postpone llog object removal using a dedicated thread, this would save some credits. |
| Comment by Andreas Dilger [ 31/Aug/21 ] |
|
Alex, in the overstriping case, how would it be possible for the unlink to have eg. 100 different llog files for a single OST? I am understand it would be possible to create one new llog file, and several new blocks in that file, but with 65000 records per llog file it isn't possible even under crazy cases to need so many credits. |
| Comment by Alex Zhuravlev [ 31/Aug/21 ] |
|
ok, I got it, how many different OSTs were actually involved? |
| Comment by Zhenyu Xu [ 01/Sep/21 ] |
|
Needs Mike's input, llog_osd_declare_write_rec() just use loghandle->lgh_ctxt->loc_chunk_size instead of llog records len to write the llog, which is wasteful. |
| Comment by Alex Zhuravlev [ 01/Sep/21 ] |
|
a single new record involves:
|
| Comment by Andreas Dilger [ 02/Sep/21 ] |
|
Alex, this issue can be seen in any review-dne-part-1 sanity test_27Cd (or other subtest using overstriping), for example: [ 1863.395427] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == sanity test 27Cd: test maximum stripe count =========== 11:50:04 \(1630583404\) [ 1863.871341] Lustre: DEBUG MARKER: == sanity test 27Cd: test maximum stripe count =========== 11:50:04 (1630583404) [ 1870.327497] Lustre: 11292:0:(osd_handler.c:1938:osd_trans_start()) lustre-MDT0003: credits 128195 > trans_max 2592 [ 1870.329179] Lustre: 11292:0:(osd_handler.c:1867:osd_trans_dump_creds()) create: 2000/8000/0, destroy: 1/4/0 [ 1870.330663] Lustre: 11292:0:(osd_handler.c:1874:osd_trans_dump_creds()) attr_set: 3/3/0, xattr_set: 2004/148/0 [ 1870.332184] Lustre: 11292:0:(osd_handler.c:1884:osd_trans_dump_creds()) write: 10001/86010/0, punch: 0/0/0, quota 6/6/0 [ 1870.333817] Lustre: 11292:0:(osd_handler.c:1891:osd_trans_dump_creds()) insert: 2001/34016/0, delete: 2/5/0 [ 1870.335318] Lustre: 11292:0:(osd_handler.c:1898:osd_trans_dump_creds()) ref_add: 1/1/0, ref_del: 2/2/0 There are 4 MDTs and 8 OSTs in that case, and a 512MB transaction handle request. |
| Comment by Zhenyu Xu [ 03/Sep/21 ] |
|
Tried a patch to use llog rec len instread of chunk_size in llog_osd_declare_write_rec(), the write credits demand decreased a little patch diff --git a/lustre/obdclass/llog_osd.c b/lustre/obdclass/llog_osd.c index 5aed12e9b9..18b1b80fdf 100644 --- a/lustre/obdclass/llog_osd.c +++ b/lustre/obdclass/llog_osd.c @@ -320,7 +320,6 @@ static int llog_osd_declare_write_rec(const struct lu_env *env, int idx, struct thandle *th) { struct llog_thread_info *lgi = llog_info(env); - __u32 chunk_size; struct dt_object *o; int rc; @@ -335,8 +334,7 @@ static int llog_osd_declare_write_rec(const struct lu_env *env, o = loghandle->lgh_obj; LASSERT(o); - chunk_size = loghandle->lgh_ctxt->loc_chunk_size; - lgi->lgi_buf.lb_len = chunk_size; + lgi->lgi_buf.lb_len = sizeof(struct llog_log_hdr); lgi->lgi_buf.lb_buf = NULL; /* each time we update header */ rc = dt_declare_record_write(env, o, &lgi->lgi_buf, 0, @@ -348,7 +346,7 @@ static int llog_osd_declare_write_rec(const struct lu_env *env, * the pad record can be inserted so take into account double * record size */ - lgi->lgi_buf.lb_len = chunk_size * 2; + lgi->lgi_buf.lb_len = rec->lrh_len * 2; lgi->lgi_buf.lb_buf = NULL; /* XXX: implement declared window or multi-chunks approach */ rc = dt_declare_record_write(env, o, &lgi->lgi_buf, -1, th); sanity.sh test_27Cd [ 2954.917326] Lustre: 20362:0:(osd_handler.c:1938:osd_trans_start()) lustre-MDT0001: credits 128195 > trans_max 2496 [ 2954.917330] Lustre: 20362:0:(osd_handler.c:1867:osd_trans_dump_creds()) create: 2000/8000/0, destroy: 1/4/0 [ 2954.917333] Lustre: 20362:0:(osd_handler.c:1874:osd_trans_dump_creds()) attr_set: 3/3/0, xattr_set: 2004/148/0 [ 2954.917335] Lustre: 20362:0:(osd_handler.c:1884:osd_trans_dump_creds()) write: 10001/86010/0, punch: 0/0/0, quota 6/6/0 [ 2954.917337] Lustre: 20362:0:(osd_handler.c:1891:osd_trans_dump_creds()) insert: 2001/34016/0, delete: 2/5/0 [ 2954.917339] Lustre: 20362:0:(osd_handler.c:1898:osd_trans_dump_creds()) ref_add: 1/1/0, ref_del: 2/2/0 credits demand after the patch [ 3371.600695] Lustre: 30759:0:(osd_handler.c:1938:osd_trans_start()) lustre-MDT0000: credits 92195 > trans_max 2496 [ 3371.600699] Lustre: 30759:0:(osd_handler.c:1867:osd_trans_dump_creds()) create: 2000/8000/0, destroy: 1/4/0 [ 3371.600701] Lustre: 30759:0:(osd_handler.c:1874:osd_trans_dump_creds()) attr_set: 3/3/0, xattr_set: 2004/148/0 [ 3371.600704] Lustre: 30759:0:(osd_handler.c:1884:osd_trans_dump_creds()) write: 10001/50010/0, punch: 0/0/0, quota 6/6/0 [ 3371.600706] Lustre: 30759:0:(osd_handler.c:1891:osd_trans_dump_creds()) insert: 2001/34016/0, delete: 2/5/0 [ 3371.600708] Lustre: 30759:0:(osd_handler.c:1898:osd_trans_dump_creds()) ref_add: 1/1/0, ref_del: 2/2/0 |
| Comment by Alex Zhuravlev [ 03/Sep/21 ] |
|
this means that loc_chunk_size is not LLOG_MIN_CHUNK_SIZE for unlink llogs? |
| Comment by Mikhail Pershin [ 03/Sep/21 ] |
|
Alex, as I understand patch uses just header size and rec->lrh_len to declare llog write but not loc_chunk_size so yes, both are not LLOG_MIN_CHUNK_SIZE. As I remember we were using loc_chunk_size in declaration just because we don't know where record (and bit in bitmap) will be really written due to llog append nature, that is why we are declaring whole chunk for header and current chunk along with next one for the record. So we can't replace that with declaration of just record size and header size. Note that write itself is more narrowed, it is done with exact offset/len values as they are known at the moment of write. |
| Comment by Alex Zhuravlev [ 03/Sep/21 ] |
|
yes, I don't think this change is correct: + lgi->lgi_buf.lb_len = rec->lrh_len * 2; as lrh_len can be more than remaining space in the last block we may need to write a padding record (last block N), then add the new record to block N+1 |
| Comment by Alex Zhuravlev [ 03/Sep/21 ] |
|
what we can do, I guess, is to declare that potential padding block not as an append, but a record within llog's size which would mean a single block to modify (in contrast of append which may need many more blocks - allocation, externt tree, etc) |
| Comment by Alex Zhuravlev [ 03/Sep/21 ] |
|
but I think this way we don't save much, like Andreas said instead we should be looking for ways to declare at most OSTCOUNT new plain llogs and #stripes records. what about the case when lots of MDT threads are unlinking such overstriped files concurrently? |
| Comment by Mikhail Pershin [ 03/Sep/21 ] |
|
also as I remember that was mostly due to ZFS which needs exact offset/len to be declared and complains if write doesn't match. So as we can't say for sure exact offset at the declaration time we assume range where it can go - the current chunk and the next one. And also we should take case about non-atomic 'declare'-'write' nature, I think there could be concurrent record additions in between, can't they? |
| Comment by Andreas Dilger [ 03/Sep/21 ] |
|
Mike, I think that was later fixed by Alex to remove the "specific declare" since that hurt performance and didn't save much space... |
| Comment by Andreas Dilger [ 03/Sep/21 ] |
|
As for concurrent records, that would be mostly fine. As I wrote before, there may be some interleaving of records, but not 65000 different records written from other threads between each delete of the overstriped object that would trigger a new llog file each time. |
| Comment by Andreas Dilger [ 27/Nov/21 ] |
|
I'd like to still keep this under consideration for 2.15.0 since it is hit very often during usage, and the transaction size (over 500MB) can hurt performance significantly. |
| Comment by Andreas Dilger [ 27/Nov/21 ] |
|
Assign to Alex to balance 2.15 tickets. |
| Comment by Gerrit Updater [ 07/Dec/21 ] |
|
"Alex Zhuravlev <bzzz@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/45765 |
| Comment by Gerrit Updater [ 19/Jan/23 ] |
|
"Alex Zhuravlev <bzzz@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49701 |
| Comment by Gerrit Updater [ 14/Feb/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/45765/ |
| Comment by Gerrit Updater [ 14/Feb/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49701/ |
| Comment by Peter Jones [ 14/Feb/23 ] |
|
Landed for 2.16 |