Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14916

Interop: sanity-pfl test 0b fails with 'Create /mnt/lustre/d0b.sanity-pfl/f0b.sanity-pfl succeeded'

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.15.0
    • 2.13.0 clients with >= 2.14.50.130 servers
    • 3
    • 9223372036854775807

    Description

      sanity-pfl test 0b stated failing on 07 March 2021 for 2.13.0 clients and 2.14.50.130 servers; https://testing.whamcloud.com/test_sets/4bb01bda-d0c3-4e30-8cf5-1b72e5dbff7f.

      Looking at a recent failure at https://testing.whamcloud.com/test_sets/944ecbd7-e7f4-46de-b435-e52f50afca20, we see the following in the MDS (vm4) console log

      [  283.886793] Lustre: DEBUG MARKER: == sanity-pfl test 0b: Verify comp stripe count limits =============================================== 15:58:00 (1622735880)
      [  284.136777] Lustre: DEBUG MARKER: dumpe2fs -h /dev/mapper/mds1_flakey 2>&1 |
      [  284.136777] 		grep -E -q '(ea_inode|large_xattr)'
      [  284.504759] Lustre: 11026:0:(osd_handler.c:1938:osd_trans_start()) lustre-MDT0000: credits 12995 > trans_max 2592
      [  284.506763] Lustre: 11026:0:(osd_handler.c:1867:osd_trans_dump_creds())   create: 200/800/0, destroy: 1/4/0
      [  284.508514] Lustre: 11026:0:(osd_handler.c:1874:osd_trans_dump_creds())   attr_set: 3/3/0, xattr_set: 204/148/0
      [  284.510309] Lustre: 11026:0:(osd_handler.c:1884:osd_trans_dump_creds())   write: 1001/8610/0, punch: 0/0/0, quota 6/6/0
      [  284.512274] Lustre: 11026:0:(osd_handler.c:1891:osd_trans_dump_creds())   insert: 201/3416/0, delete: 2/5/0
      [  284.514033] Lustre: 11026:0:(osd_handler.c:1898:osd_trans_dump_creds())   ref_add: 1/1/0, ref_del: 2/2/0
      [  284.515721] Pid: 11026, comm: mdt00_000 4.18.0-240.22.1.el8_lustre.x86_64 #1 SMP Sun Apr 11 04:35:52 UTC 2021
      [  284.517504] Call Trace TBD:
      [  284.518265] [<0>] libcfs_call_trace+0x6f/0x90 [libcfs]
      [  284.519281] [<0>] osd_trans_start+0x50c/0x530 [osd_ldiskfs]
      [  284.520707] [<0>] top_trans_start+0x423/0x940 [ptlrpc]
      [  284.521741] [<0>] mdd_unlink+0x495/0xb20 [mdd]
      [  284.522703] [<0>] mdt_reint_unlink+0xb09/0x12a0 [mdt]
      [  284.523656] [<0>] mdt_reint_rec+0x11f/0x250 [mdt]
      [  284.524528] [<0>] mdt_reint_internal+0x498/0x780 [mdt]
      [  284.525480] [<0>] mdt_reint+0x5e/0x100 [mdt]
      [  284.526315] [<0>] tgt_request_handle+0xc78/0x1910 [ptlrpc]
      [  284.527355] [<0>] ptlrpc_server_handle_request+0x31a/0xba0 [ptlrpc]
      [  284.528533] [<0>] ptlrpc_main+0xba2/0x14a0 [ptlrpc]
      [  284.529462] [<0>] kthread+0x112/0x130
      [  284.530166] [<0>] ret_from_fork+0x35/0x40
      [  284.536315] Lustre: 11026:0:(osd_internal.h:1304:osd_trans_exec_op()) lustre-MDT0000: opcode 7: before 2593 < left 8610, rollback = 7
      [  284.822666] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  sanity-pfl test_0b: @@@@@@ FAIL: Create \/mnt\/lustre\/d0b.sanity-pfl\/f0b.sanity-pfl succeeded 
      [  285.114266] Lustre: DEBUG MARKER: sanity-pfl test_0b: @@@@@@ FAIL: Create /mnt/lustre/d0b.sanity-pfl/f0b.sanity-pfl succeeded
      

      Attachments

        Issue Links

          Activity

            [LU-14916] Interop: sanity-pfl test 0b fails with 'Create /mnt/lustre/d0b.sanity-pfl/f0b.sanity-pfl succeeded'

            This interop issue was introduced by patch https://review.whamcloud.com/40895 "LU-14191 lod: comp stripe count limit check", but there isn't really a great solution for it. It would be possible to backport this patch to the older branch or skip this test in interop mode with newer servers, but it would still fail in interop with older clients that don't have the patch.

            adilger Andreas Dilger added a comment - This interop issue was introduced by patch https://review.whamcloud.com/40895 " LU-14191 lod: comp stripe count limit check ", but there isn't really a great solution for it. It would be possible to backport this patch to the older branch or skip this test in interop mode with newer servers, but it would still fail in interop with older clients that don't have the patch.

            One possibility is that the OSTs have run out of objects, shrinking the file layout, but there weren't any signs of this in the layout (it was round robin across OSTs 0-6 for the whole file).

            adilger Andreas Dilger added a comment - One possibility is that the OSTs have run out of objects, shrinking the file layout, but there weren't any signs of this in the layout (it was round robin across OSTs 0-6 for the whole file).

            I added debugging to the test to print out the resulting layout:

            lcm_layout_gen:    2
              lcm_mirror_count:  1
              lcm_entry_count:   2
                lcme_id:             1
                lcme_mirror_id:      0
                lcme_flags:          init
                lcme_extent.e_start: 0
                lcme_extent.e_end:   1048576
                  lmm_stripe_count:  720
                  lmm_stripe_size:   1048576
                  lmm_pattern:       raid0,overstriped
                  lmm_layout_gen:    0
                  lmm_stripe_offset: 3
                  lmm_objects:
                    [720 objects]
                lcme_id:             2
                lcme_mirror_id:      0
                lcme_flags:          0
                lcme_extent.e_start: 1048576
                lcme_extent.e_end:   EOF
                  lmm_stripe_count:  2000
                  lmm_stripe_size:   1048576
                  lmm_pattern:       raid0,overstriped
                  lmm_layout_gen:    0
                  lmm_stripe_offset: -1
            

            It isn't clear why the first component only got 720 objects when 2000 were requested, but if one thinks about it more - it shouldn't be possible to have more than 1 stripe/LOV_MIN_STRIPE_SIZE, so a 1MB component should allow at most 16 x 64KB stripes in a 1MB component, since the rest are just a waste of space. That doesn't help understand or fix this bug, but it does expose a related issue.

            adilger Andreas Dilger added a comment - I added debugging to the test to print out the resulting layout: lcm_layout_gen: 2 lcm_mirror_count: 1 lcm_entry_count: 2 lcme_id: 1 lcme_mirror_id: 0 lcme_flags: init lcme_extent.e_start: 0 lcme_extent.e_end: 1048576 lmm_stripe_count: 720 lmm_stripe_size: 1048576 lmm_pattern: raid0,overstriped lmm_layout_gen: 0 lmm_stripe_offset: 3 lmm_objects: [720 objects] lcme_id: 2 lcme_mirror_id: 0 lcme_flags: 0 lcme_extent.e_start: 1048576 lcme_extent.e_end: EOF lmm_stripe_count: 2000 lmm_stripe_size: 1048576 lmm_pattern: raid0,overstriped lmm_layout_gen: 0 lmm_stripe_offset: -1 It isn't clear why the first component only got 720 objects when 2000 were requested, but if one thinks about it more - it shouldn't be possible to have more than 1 stripe/ LOV_MIN_STRIPE_SIZE , so a 1MB component should allow at most 16 x 64KB stripes in a 1MB component, since the rest are just a waste of space. That doesn't help understand or fix this bug, but it does expose a related issue.

            People

              wc-triage WC Triage
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated: