Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14918

too many ldiskfs transaction credits for llog when unlinking overstriped files

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.16.0
    • Lustre 2.14.0, Lustre 2.15.0
    • 3
    • 9223372036854775807

    Description

      Removing widely overstriped files from an ldiskfs MDT causes excessively many transaction credits to be reserved. This can be seen in the MDS console logs:

      Lustre: DEBUG MARKER: == sanity test 130g: FIEMAP (overstripe file) ========
      Lustre: 25401:0:(osd_handler.c:1934:osd_trans_start()) lustre-MDT0000: credits 54595 > trans_max 2592
      Lustre: 25401:0:(osd_handler.c:1863:osd_trans_dump_creds())   create: 800/6400/0, destroy: 1/4/0
      Lustre: 25401:0:(osd_handler.c:1870:osd_trans_dump_creds())   attr_set: 3/3/0, xattr_set: 804/148/0
      Lustre: 25401:0:(osd_handler.c:1880:osd_trans_dump_creds())   write: 4001/34410/0, punch: 0/0/0, quota 6/6/0
      Lustre: 25401:0:(osd_handler.c:1887:osd_trans_dump_creds())   insert: 801/13616/0, delete: 2/5/0
      Lustre: 25401:0:(osd_handler.c:1894:osd_trans_dump_creds())   ref_add: 1/1/0, ref_del: 2/2/0
      Pid: 25401, comm: mdt00_004 3.10.0-1160.36.2.el7_lustre.x86_64 #1 SMP Tue Aug 3 23:03:31 UTC 2021
      Call Trace:
      libcfs_call_trace+0x90/0xf0 [libcfs]
      libcfs_debug_dumpstack+0x26/0x30 [libcfs]
      osd_trans_start+0x4bb/0x4e0 [osd_ldiskfs]
      top_trans_start+0x702/0x940 [ptlrpc]
      lod_trans_start+0x34/0x40 [lod]
      mdd_trans_start+0x1a/0x20 [mdd]
      mdd_unlink+0x4ee/0xae0 [mdd]
      mdo_unlink+0x1b/0x1d [mdt]
      mdt_reint_unlink+0xb64/0x1890 [mdt]
      mdt_reint_rec+0x83/0x210 [mdt]
      mdt_reint_internal+0x720/0xaf0 [mdt]
      mdt_reint+0x67/0x140 [mdt]
      tgt_request_handle+0x7ea/0x1750 [ptlrpc]
      ptlrpc_server_handle_request+0x256/0xb10 [ptlrpc]
      ptlrpc_main+0xb3c/0x14e0 [ptlrpc]
      Lustre: 25401:0:(osd_internal.h:1325:osd_trans_exec_op()) lustre-MDT0000: opcode 7: before 2589 < left 34410, rollback = 7
      

      and

      Lustre: DEBUG MARKER: == sanity test 27Cd: test maximum stripe count ========
      Lustre: 12686:0:(osd_handler.c:1934:osd_trans_start()) lustre-MDT0003: credits 136195 > trans_max 2592
      Lustre: 12686:0:(osd_handler.c:1863:osd_trans_dump_creds())   create: 2000/16000/0, destroy: 1/4/0
      Lustre: 12686:0:(osd_handler.c:1870:osd_trans_dump_creds())   attr_set: 3/3/0, xattr_set: 2004/148/0
      Lustre: 12686:0:(osd_handler.c:1880:osd_trans_dump_creds())   write: 10001/86010/0, punch: 0/0/0, quota 6/6/0
      Lustre: 12686:0:(osd_handler.c:1887:osd_trans_dump_creds())   insert: 2001/34016/0, delete: 2/5/0
      Lustre: 12686:0:(osd_handler.c:1894:osd_trans_dump_creds())   ref_add: 1/1/0, ref_del: 2/2/0
      Pid: 12686, comm: mdt00_000 3.10.0-1160.36.2.el7_lustre.x86_64 #1 SMP Tue Aug 3 23:03:31 UTC 2021
      Call Trace:
      libcfs_call_trace+0x90/0xf0 [libcfs]
      libcfs_debug_dumpstack+0x26/0x30 [libcfs]
      osd_trans_start+0x4bb/0x4e0 [osd_ldiskfs]
      top_trans_start+0x702/0x940 [ptlrpc]
      lod_trans_start+0x34/0x40 [lod]
      mdd_trans_start+0x1a/0x20 [mdd]
      mdd_unlink+0x4ee/0xae0 [mdd]
      mdo_unlink+0x1b/0x1d [mdt]
      mdt_reint_unlink+0xb64/0x1890 [mdt]
      mdt_reint_rec+0x83/0x210 [mdt]
      mdt_reint_internal+0x720/0xaf0 [mdt]
      mdt_reint+0x67/0x140 [mdt]
      tgt_request_handle+0x7ea/0x1750 [ptlrpc]
      ptlrpc_server_handle_request+0x256/0xb10 [ptlrpc]
      ptlrpc_main+0xb3c/0x14e0 [ptlrpc]
      

      and similarly in sanity test_130e, sanity-pfl test_0b, test_1c, always during unlink.

      The two examples shown are trying to reserve a whopping 213MiB and 532MiB of journal space, respectively. Since the maximum xattr size for an overstriped file is 64KiB, this is pretty excessive.

      Attachments

        Issue Links

          Activity

            [LU-14918] too many ldiskfs transaction credits for llog when unlinking overstriped files

            I'd like to still keep this under consideration for 2.15.0 since it is hit very often during usage, and the transaction size (over 500MB) can hurt performance significantly.

            adilger Andreas Dilger added a comment - I'd like to still keep this under consideration for 2.15.0 since it is hit very often during usage, and the transaction size (over 500MB) can hurt performance significantly.

            As for concurrent records, that would be mostly fine. As I wrote before, there may be some interleaving of records, but not 65000 different records written from other threads between each delete of the overstriped object that would trigger a new llog file each time.

            adilger Andreas Dilger added a comment - As for concurrent records, that would be mostly fine. As I wrote before, there may be some interleaving of records, but not 65000 different records written from other threads between each delete of the overstriped object that would trigger a new llog file each time.

            Mike, I think that was later fixed by Alex to remove the "specific declare" since that hurt performance and didn't save much space...

            adilger Andreas Dilger added a comment - Mike, I think that was later fixed by Alex to remove the "specific declare" since that hurt performance and didn't save much space...

            also as I remember that was mostly due to ZFS which needs exact offset/len to be declared and complains if write doesn't match. So as we can't say for sure exact offset at the declaration time we assume range where it can go - the current chunk and the next one. And also we should take case about non-atomic 'declare'-'write' nature, I think there could be concurrent record additions in between, can't they?

            tappro Mikhail Pershin added a comment - also as I remember that was mostly due to ZFS which needs exact offset/len to be declared and complains if write doesn't match. So as we can't say for sure exact offset at the declaration time we assume range where it can go - the current chunk and the next one. And also we should take case about non-atomic 'declare'-'write' nature, I think there could be concurrent record additions in between, can't they?

            but I think this way we don't save much, like Andreas said instead we should be looking for ways to declare at most OSTCOUNT new plain llogs and #stripes records. what about the case when lots of MDT threads are unlinking such overstriped files concurrently?
            64K records / 128 threads = 512 records per thread

            bzzz Alex Zhuravlev added a comment - but I think this way we don't save much, like Andreas said instead we should be looking for ways to declare at most OSTCOUNT new plain llogs and #stripes records. what about the case when lots of MDT threads are unlinking such overstriped files concurrently? 64K records / 128 threads = 512 records per thread

            what we can do, I guess, is to declare that potential padding block not as an append, but a record within llog's size which would mean a single block to modify (in contrast of append which may need many more blocks - allocation, externt tree, etc)

            bzzz Alex Zhuravlev added a comment - what we can do, I guess, is to declare that potential padding block not as an append, but a record within llog's size which would mean a single block to modify (in contrast of append which may need many more blocks - allocation, externt tree, etc)

            yes, I don't think this change is correct:

            +       lgi->lgi_buf.lb_len = rec->lrh_len * 2;
            

            as lrh_len can be more than remaining space in the last block we may need to write a padding record (last block N), then add the new record to block N+1

            bzzz Alex Zhuravlev added a comment - yes, I don't think this change is correct: + lgi->lgi_buf.lb_len = rec->lrh_len * 2; as lrh_len can be more than remaining space in the last block we may need to write a padding record (last block N), then add the new record to block N+1

            Alex, as I understand patch uses just header size and rec->lrh_len to declare llog write but not loc_chunk_size so yes, both are not LLOG_MIN_CHUNK_SIZE.

            As I remember we were using loc_chunk_size in declaration just because we don't know where record (and bit in bitmap) will be really written due to llog append nature, that is why we are declaring whole chunk for header and current chunk along with next one for the record. So we can't replace that with declaration of just record size and header size.

            Note that write itself is more narrowed, it is done with exact offset/len values as they are known at the moment of write.

            tappro Mikhail Pershin added a comment - Alex, as I understand patch uses just header size and rec->lrh_len  to declare llog write but not loc_chunk_size so yes, both are not LLOG_MIN_CHUNK_SIZE . As I remember we were using loc_chunk_size in declaration just because we don't know where record (and bit in bitmap) will be really written due to llog append nature, that is why we are declaring whole chunk for header and current chunk along with next one for the record. So we can't replace that with declaration of just record size and header size. Note that write itself is more narrowed, it is done with exact offset/len values as they are known at the moment of write.

            this means that loc_chunk_size is not LLOG_MIN_CHUNK_SIZE for unlink llogs?

            bzzz Alex Zhuravlev added a comment - this means that loc_chunk_size is not LLOG_MIN_CHUNK_SIZE for unlink llogs?
            bobijam Zhenyu Xu added a comment -

            Tried a patch to use llog rec len instread of chunk_size in llog_osd_declare_write_rec(), the write credits demand decreased a little

            patch
            diff --git a/lustre/obdclass/llog_osd.c b/lustre/obdclass/llog_osd.c
            index 5aed12e9b9..18b1b80fdf 100644
            --- a/lustre/obdclass/llog_osd.c
            +++ b/lustre/obdclass/llog_osd.c
            @@ -320,7 +320,6 @@ static int llog_osd_declare_write_rec(const struct lu_env *env,
                                                  int idx, struct thandle *th)
             {
                    struct llog_thread_info *lgi = llog_info(env);
            -       __u32                   chunk_size;
                    struct dt_object        *o;
                    int                      rc;
             
            @@ -335,8 +334,7 @@ static int llog_osd_declare_write_rec(const struct lu_env *env,
                    o = loghandle->lgh_obj;
                    LASSERT(o);
             
            -       chunk_size = loghandle->lgh_ctxt->loc_chunk_size;
            -       lgi->lgi_buf.lb_len = chunk_size;
            +       lgi->lgi_buf.lb_len = sizeof(struct llog_log_hdr);
                    lgi->lgi_buf.lb_buf = NULL;
                    /* each time we update header */
                    rc = dt_declare_record_write(env, o, &lgi->lgi_buf, 0,
            @@ -348,7 +346,7 @@ static int llog_osd_declare_write_rec(const struct lu_env *env,
                     * the pad record can be inserted so take into account double
                     * record size
                     */
            -       lgi->lgi_buf.lb_len = chunk_size * 2;
            +       lgi->lgi_buf.lb_len = rec->lrh_len * 2;
                    lgi->lgi_buf.lb_buf = NULL;
                    /* XXX: implement declared window or multi-chunks approach */
                    rc = dt_declare_record_write(env, o, &lgi->lgi_buf, -1, th);
            

            sanity.sh test_27Cd
            credits demand before the patch

            [ 2954.917326] Lustre: 20362:0:(osd_handler.c:1938:osd_trans_start()) lustre-MDT0001: credits 128195 > trans_max 2496
            [ 2954.917330] Lustre: 20362:0:(osd_handler.c:1867:osd_trans_dump_creds())   create: 2000/8000/0, destroy: 1/4/0
            [ 2954.917333] Lustre: 20362:0:(osd_handler.c:1874:osd_trans_dump_creds())   attr_set: 3/3/0, xattr_set: 2004/148/0
            [ 2954.917335] Lustre: 20362:0:(osd_handler.c:1884:osd_trans_dump_creds())   write: 10001/86010/0, punch: 0/0/0, quota 6/6/0
            [ 2954.917337] Lustre: 20362:0:(osd_handler.c:1891:osd_trans_dump_creds())   insert: 2001/34016/0, delete: 2/5/0
            [ 2954.917339] Lustre: 20362:0:(osd_handler.c:1898:osd_trans_dump_creds())   ref_add: 1/1/0, ref_del: 2/2/0
            

            credits demand after the patch

            [ 3371.600695] Lustre: 30759:0:(osd_handler.c:1938:osd_trans_start()) lustre-MDT0000: credits 92195 > trans_max 2496
            [ 3371.600699] Lustre: 30759:0:(osd_handler.c:1867:osd_trans_dump_creds())   create: 2000/8000/0, destroy: 1/4/0
            [ 3371.600701] Lustre: 30759:0:(osd_handler.c:1874:osd_trans_dump_creds())   attr_set: 3/3/0, xattr_set: 2004/148/0
            [ 3371.600704] Lustre: 30759:0:(osd_handler.c:1884:osd_trans_dump_creds())   write: 10001/50010/0, punch: 0/0/0, quota 6/6/0
            [ 3371.600706] Lustre: 30759:0:(osd_handler.c:1891:osd_trans_dump_creds())   insert: 2001/34016/0, delete: 2/5/0
            [ 3371.600708] Lustre: 30759:0:(osd_handler.c:1898:osd_trans_dump_creds())   ref_add: 1/1/0, ref_del: 2/2/0
            
            bobijam Zhenyu Xu added a comment - Tried a patch to use llog rec len instread of chunk_size in llog_osd_declare_write_rec(), the write credits demand decreased a little patch diff --git a/lustre/obdclass/llog_osd.c b/lustre/obdclass/llog_osd.c index 5aed12e9b9..18b1b80fdf 100644 --- a/lustre/obdclass/llog_osd.c +++ b/lustre/obdclass/llog_osd.c @@ -320,7 +320,6 @@ static int llog_osd_declare_write_rec( const struct lu_env *env, int idx, struct thandle *th) { struct llog_thread_info *lgi = llog_info(env); - __u32 chunk_size; struct dt_object *o; int rc; @@ -335,8 +334,7 @@ static int llog_osd_declare_write_rec( const struct lu_env *env, o = loghandle->lgh_obj; LASSERT(o); - chunk_size = loghandle->lgh_ctxt->loc_chunk_size; - lgi->lgi_buf.lb_len = chunk_size; + lgi->lgi_buf.lb_len = sizeof(struct llog_log_hdr); lgi->lgi_buf.lb_buf = NULL; /* each time we update header */ rc = dt_declare_record_write(env, o, &lgi->lgi_buf, 0, @@ -348,7 +346,7 @@ static int llog_osd_declare_write_rec( const struct lu_env *env, * the pad record can be inserted so take into account double * record size */ - lgi->lgi_buf.lb_len = chunk_size * 2; + lgi->lgi_buf.lb_len = rec->lrh_len * 2; lgi->lgi_buf.lb_buf = NULL; /* XXX: implement declared window or multi-chunks approach */ rc = dt_declare_record_write(env, o, &lgi->lgi_buf, -1, th); sanity.sh test_27Cd credits demand before the patch [ 2954.917326] Lustre: 20362:0:(osd_handler.c:1938:osd_trans_start()) lustre-MDT0001: credits 128195 > trans_max 2496 [ 2954.917330] Lustre: 20362:0:(osd_handler.c:1867:osd_trans_dump_creds()) create: 2000/8000/0, destroy: 1/4/0 [ 2954.917333] Lustre: 20362:0:(osd_handler.c:1874:osd_trans_dump_creds()) attr_set: 3/3/0, xattr_set: 2004/148/0 [ 2954.917335] Lustre: 20362:0:(osd_handler.c:1884:osd_trans_dump_creds()) write: 10001/86010/0, punch: 0/0/0, quota 6/6/0 [ 2954.917337] Lustre: 20362:0:(osd_handler.c:1891:osd_trans_dump_creds()) insert: 2001/34016/0, delete: 2/5/0 [ 2954.917339] Lustre: 20362:0:(osd_handler.c:1898:osd_trans_dump_creds()) ref_add: 1/1/0, ref_del: 2/2/0 credits demand after the patch [ 3371.600695] Lustre: 30759:0:(osd_handler.c:1938:osd_trans_start()) lustre-MDT0000: credits 92195 > trans_max 2496 [ 3371.600699] Lustre: 30759:0:(osd_handler.c:1867:osd_trans_dump_creds()) create: 2000/8000/0, destroy: 1/4/0 [ 3371.600701] Lustre: 30759:0:(osd_handler.c:1874:osd_trans_dump_creds()) attr_set: 3/3/0, xattr_set: 2004/148/0 [ 3371.600704] Lustre: 30759:0:(osd_handler.c:1884:osd_trans_dump_creds()) write: 10001/50010/0, punch: 0/0/0, quota 6/6/0 [ 3371.600706] Lustre: 30759:0:(osd_handler.c:1891:osd_trans_dump_creds()) insert: 2001/34016/0, delete: 2/5/0 [ 3371.600708] Lustre: 30759:0:(osd_handler.c:1898:osd_trans_dump_creds()) ref_add: 1/1/0, ref_del: 2/2/0

            Alex, this issue can be seen in any review-dne-part-1 sanity test_27Cd (or other subtest using overstriping), for example:
            https://testing.whamcloud.com/test_sets/afe2d412-628d-4a7f-bda4-1c3359a97ff1

            [ 1863.395427] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == sanity test 27Cd: test maximum stripe count =========== 11:50:04 \(1630583404\)
            [ 1863.871341] Lustre: DEBUG MARKER: == sanity test 27Cd: test maximum stripe count =========== 11:50:04 (1630583404)
            [ 1870.327497] Lustre: 11292:0:(osd_handler.c:1938:osd_trans_start()) lustre-MDT0003: credits 128195 > trans_max 2592
            [ 1870.329179] Lustre: 11292:0:(osd_handler.c:1867:osd_trans_dump_creds())   create: 2000/8000/0, destroy: 1/4/0
            [ 1870.330663] Lustre: 11292:0:(osd_handler.c:1874:osd_trans_dump_creds())   attr_set: 3/3/0, xattr_set: 2004/148/0
            [ 1870.332184] Lustre: 11292:0:(osd_handler.c:1884:osd_trans_dump_creds())   write: 10001/86010/0, punch: 0/0/0, quota 6/6/0
            [ 1870.333817] Lustre: 11292:0:(osd_handler.c:1891:osd_trans_dump_creds())   insert: 2001/34016/0, delete: 2/5/0
            [ 1870.335318] Lustre: 11292:0:(osd_handler.c:1898:osd_trans_dump_creds())   ref_add: 1/1/0, ref_del: 2/2/0
            

            There are 4 MDTs and 8 OSTs in that case, and a 512MB transaction handle request.

            adilger Andreas Dilger added a comment - Alex, this issue can be seen in any review-dne-part-1 sanity test_27Cd (or other subtest using overstriping), for example: https://testing.whamcloud.com/test_sets/afe2d412-628d-4a7f-bda4-1c3359a97ff1 [ 1863.395427] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == sanity test 27Cd: test maximum stripe count =========== 11:50:04 \(1630583404\) [ 1863.871341] Lustre: DEBUG MARKER: == sanity test 27Cd: test maximum stripe count =========== 11:50:04 (1630583404) [ 1870.327497] Lustre: 11292:0:(osd_handler.c:1938:osd_trans_start()) lustre-MDT0003: credits 128195 > trans_max 2592 [ 1870.329179] Lustre: 11292:0:(osd_handler.c:1867:osd_trans_dump_creds()) create: 2000/8000/0, destroy: 1/4/0 [ 1870.330663] Lustre: 11292:0:(osd_handler.c:1874:osd_trans_dump_creds()) attr_set: 3/3/0, xattr_set: 2004/148/0 [ 1870.332184] Lustre: 11292:0:(osd_handler.c:1884:osd_trans_dump_creds()) write: 10001/86010/0, punch: 0/0/0, quota 6/6/0 [ 1870.333817] Lustre: 11292:0:(osd_handler.c:1891:osd_trans_dump_creds()) insert: 2001/34016/0, delete: 2/5/0 [ 1870.335318] Lustre: 11292:0:(osd_handler.c:1898:osd_trans_dump_creds()) ref_add: 1/1/0, ref_del: 2/2/0 There are 4 MDTs and 8 OSTs in that case, and a 512MB transaction handle request.

            People

              bzzz Alex Zhuravlev
              adilger Andreas Dilger
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: