[LU-14916] Interop: sanity-pfl test 0b fails with 'Create /mnt/lustre/d0b.sanity-pfl/f0b.sanity-pfl succeeded' Created: 06/Aug/21  Updated: 25/Oct/22

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.15.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: interop
Environment:

2.13.0 clients with >= 2.14.50.130 servers


Issue Links:
Related
is related to LU-14191 setstripe: cannot create composite fi... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

sanity-pfl test 0b stated failing on 07 March 2021 for 2.13.0 clients and 2.14.50.130 servers; https://testing.whamcloud.com/test_sets/4bb01bda-d0c3-4e30-8cf5-1b72e5dbff7f.

Looking at a recent failure at https://testing.whamcloud.com/test_sets/944ecbd7-e7f4-46de-b435-e52f50afca20, we see the following in the MDS (vm4) console log

[  283.886793] Lustre: DEBUG MARKER: == sanity-pfl test 0b: Verify comp stripe count limits =============================================== 15:58:00 (1622735880)
[  284.136777] Lustre: DEBUG MARKER: dumpe2fs -h /dev/mapper/mds1_flakey 2>&1 |
[  284.136777] 		grep -E -q '(ea_inode|large_xattr)'
[  284.504759] Lustre: 11026:0:(osd_handler.c:1938:osd_trans_start()) lustre-MDT0000: credits 12995 > trans_max 2592
[  284.506763] Lustre: 11026:0:(osd_handler.c:1867:osd_trans_dump_creds())   create: 200/800/0, destroy: 1/4/0
[  284.508514] Lustre: 11026:0:(osd_handler.c:1874:osd_trans_dump_creds())   attr_set: 3/3/0, xattr_set: 204/148/0
[  284.510309] Lustre: 11026:0:(osd_handler.c:1884:osd_trans_dump_creds())   write: 1001/8610/0, punch: 0/0/0, quota 6/6/0
[  284.512274] Lustre: 11026:0:(osd_handler.c:1891:osd_trans_dump_creds())   insert: 201/3416/0, delete: 2/5/0
[  284.514033] Lustre: 11026:0:(osd_handler.c:1898:osd_trans_dump_creds())   ref_add: 1/1/0, ref_del: 2/2/0
[  284.515721] Pid: 11026, comm: mdt00_000 4.18.0-240.22.1.el8_lustre.x86_64 #1 SMP Sun Apr 11 04:35:52 UTC 2021
[  284.517504] Call Trace TBD:
[  284.518265] [<0>] libcfs_call_trace+0x6f/0x90 [libcfs]
[  284.519281] [<0>] osd_trans_start+0x50c/0x530 [osd_ldiskfs]
[  284.520707] [<0>] top_trans_start+0x423/0x940 [ptlrpc]
[  284.521741] [<0>] mdd_unlink+0x495/0xb20 [mdd]
[  284.522703] [<0>] mdt_reint_unlink+0xb09/0x12a0 [mdt]
[  284.523656] [<0>] mdt_reint_rec+0x11f/0x250 [mdt]
[  284.524528] [<0>] mdt_reint_internal+0x498/0x780 [mdt]
[  284.525480] [<0>] mdt_reint+0x5e/0x100 [mdt]
[  284.526315] [<0>] tgt_request_handle+0xc78/0x1910 [ptlrpc]
[  284.527355] [<0>] ptlrpc_server_handle_request+0x31a/0xba0 [ptlrpc]
[  284.528533] [<0>] ptlrpc_main+0xba2/0x14a0 [ptlrpc]
[  284.529462] [<0>] kthread+0x112/0x130
[  284.530166] [<0>] ret_from_fork+0x35/0x40
[  284.536315] Lustre: 11026:0:(osd_internal.h:1304:osd_trans_exec_op()) lustre-MDT0000: opcode 7: before 2593 < left 8610, rollback = 7
[  284.822666] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  sanity-pfl test_0b: @@@@@@ FAIL: Create \/mnt\/lustre\/d0b.sanity-pfl\/f0b.sanity-pfl succeeded 
[  285.114266] Lustre: DEBUG MARKER: sanity-pfl test_0b: @@@@@@ FAIL: Create /mnt/lustre/d0b.sanity-pfl/f0b.sanity-pfl succeeded


 Comments   
Comment by Andreas Dilger [ 30/Oct/21 ]

I added debugging to the test to print out the resulting layout:

lcm_layout_gen:    2
  lcm_mirror_count:  1
  lcm_entry_count:   2
    lcme_id:             1
    lcme_mirror_id:      0
    lcme_flags:          init
    lcme_extent.e_start: 0
    lcme_extent.e_end:   1048576
      lmm_stripe_count:  720
      lmm_stripe_size:   1048576
      lmm_pattern:       raid0,overstriped
      lmm_layout_gen:    0
      lmm_stripe_offset: 3
      lmm_objects:
        [720 objects]
    lcme_id:             2
    lcme_mirror_id:      0
    lcme_flags:          0
    lcme_extent.e_start: 1048576
    lcme_extent.e_end:   EOF
      lmm_stripe_count:  2000
      lmm_stripe_size:   1048576
      lmm_pattern:       raid0,overstriped
      lmm_layout_gen:    0
      lmm_stripe_offset: -1

It isn't clear why the first component only got 720 objects when 2000 were requested, but if one thinks about it more - it shouldn't be possible to have more than 1 stripe/LOV_MIN_STRIPE_SIZE, so a 1MB component should allow at most 16 x 64KB stripes in a 1MB component, since the rest are just a waste of space. That doesn't help understand or fix this bug, but it does expose a related issue.

Comment by Andreas Dilger [ 30/Oct/21 ]

One possibility is that the OSTs have run out of objects, shrinking the file layout, but there weren't any signs of this in the layout (it was round robin across OSTs 0-6 for the whole file).

Comment by Andreas Dilger [ 01/Mar/22 ]

This interop issue was introduced by patch https://review.whamcloud.com/40895 "LU-14191 lod: comp stripe count limit check", but there isn't really a great solution for it. It would be possible to backport this patch to the older branch or skip this test in interop mode with newer servers, but it would still fail in interop with older clients that don't have the patch.

Generated at Sat Feb 10 03:13:52 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.