Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-17261

stat(2) should be able to use a good replica

Details

    • Improvement
    • Resolution: Fixed
    • Minor
    • Lustre 2.16.0
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      a replica representation in LOV EA can be broken like this one:

      lcm_layout_gen: 7
      lcm_mirror_count: 2
      lcm_entry_count: 2
      lcme_id: 65538
      lcme_mirror_id: 1
      lcme_flags: init,stale
      lcme_extent.e_start: 134217728
      lcme_extent.e_end: 1073741824
      lmm_stripe_count: 16
      lmm_stripe_size: 16777216
      lmm_pattern: 40000001
      lmm_layout_gen: 1
      lmm_stripe_offset: 4294967295
      lmm_objects:
      - 0: { l_ost_idx: -1, l_fid: [0:0x0:0x0] }
      - 1: { l_ost_idx: -1, l_fid: [0:0x0:0x0] }
      - 2: { l_ost_idx: -1, l_fid: [0:0x0:0x0] }
      - 3: { l_ost_idx: 5, l_fid: [0xbc0000406:0x42fce0eb:0x0] }
      - 4: { l_ost_idx: -1, l_fid: [0:0x0:0x0] }
      - 5: { l_ost_idx: -1, l_fid: [0:0x0:0x0] }
      - 6: { l_ost_idx: -1, l_fid: [0:0x0:0x0] }
      - 7: { l_ost_idx: -1, l_fid: [0:0x0:0x0] }
      - 8: { l_ost_idx: -1, l_fid: [0:0x0:0x0] }
      - 9: { l_ost_idx: -1, l_fid: [0:0x0:0x0] }
      - 10: { l_ost_idx: -1, l_fid: [0:0x0:0x0] }
      - 11: { l_ost_idx: -1, l_fid: [0:0x0:0x0] }
      - 12: { l_ost_idx: -1, l_fid: [0:0x0:0x0] }
      - 13: { l_ost_idx: -1, l_fid: [0:0x0:0x0] }
      - 14: { l_ost_idx: -1, l_fid: [0:0x0:0x0] }
      - 15: { l_ost_idx: -1, l_fid: [0:0x0:0x0] }
       
      lcme_id: 131073
      lcme_mirror_id: 2
      lcme_flags: init
      lcme_extent.e_start: 0
      lcme_extent.e_end: EOF
      lmm_stripe_count: 16
      lmm_stripe_size: 1048576
      lmm_pattern: raid0
      lmm_layout_gen: 0
      lmm_stripe_offset: 5
      lmm_pool: hdd-pool
      lmm_objects:
      - 0: { l_ost_idx: 5, l_fid: [0xbc0000406:0x42feb0aa:0x0] }
      - 1: { l_ost_idx: 8, l_fid: [0x8c0000402:0x3bf10cb:0x0] }
      - 2: { l_ost_idx: 15, l_fid: [0x9c0000402:0x11f1d8f:0x0] }
      - 3: { l_ost_idx: 13, l_fid: [0x900000402:0x77529c35:0x0] }
      - 4: { l_ost_idx: 0, l_fid: [0x300000403:0x3beded4:0x0] }
      - 5: { l_ost_idx: 7, l_fid: [0xa80000402:0x11e9898:0x0] }
      - 6: { l_ost_idx: 12, l_fid: [0x880000402:0x3bef34f:0x0] }
      - 7: { l_ost_idx: 10, l_fid: [0xa40000402:0x11d9d9d:0x0] }
      - 8: { l_ost_idx: 14, l_fid: [0xa00000402:0x11e4d68:0x0] }
      - 9: { l_ost_idx: 2, l_fid: [0xb80000402:0x11d545a:0x0] }
      - 10: { l_ost_idx: 6, l_fid: [0xb40000400:0x11f22d9:0x0] }
      - 11: { l_ost_idx: 4, l_fid: [0x2c0000403:0x4016eb6:0x0] }
      - 12: { l_ost_idx: 9, l_fid: [0x940000402:0xaf7b184a:0x0] }
      - 13: { l_ost_idx: 11, l_fid: [0x980000402:0x11dc273:0x0] }
      - 14: { l_ost_idx: 1, l_fid: [0xac0000404:0x3015313c:0x0] }
      - 15: { l_ost_idx: 3, l_fid: [0xb00000400:0x11dd9bb:0x0] } 
      

      but regular stat should use any valid replica, not return an error once a bogus one is met.

      Attachments

        Issue Links

          Activity

            [LU-17261] stat(2) should be able to use a good replica
            pjones Peter Jones added a comment -

            Merged for 2.16

            pjones Peter Jones added a comment - Merged for 2.16

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/54544/
            Subject: LU-17261 lov: unlink can handle bogus striping
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 4ae823762db40d790ddd00c29e969b5c8e376430

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/54544/ Subject: LU-17261 lov: unlink can handle bogus striping Project: fs/lustre-release Branch: master Current Patch Set: Commit: 4ae823762db40d790ddd00c29e969b5c8e376430

            with the patch above I see that unlink can handle such a file with broken ostidx.

            bzzz Alex Zhuravlev added a comment - with the patch above I see that unlink can handle such a file with broken ostidx.

            "Alex Zhuravlev <bzzz@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/54544
            Subject: LU-17261 tests: unlink can handle bogus striping
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: c9a4f962c8c6832169b29b84de6b1d0714b0cf57

            gerrit Gerrit Updater added a comment - "Alex Zhuravlev <bzzz@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/54544 Subject: LU-17261 tests: unlink can handle bogus striping Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: c9a4f962c8c6832169b29b84de6b1d0714b0cf57

            Well, I used LOV_V1_INSANE_STRIPE_COUNT=65532 in my proposed code above. This covers all "valid" OST numbers but excludes "-1=4294967295" that is seen in this case.

            adilger Andreas Dilger added a comment - Well, I used LOV_V1_INSANE_STRIPE_COUNT=65532 in my proposed code above. This covers all "valid" OST numbers but excludes " -1=4294967295 " that is seen in this case.

            what would be the "border" for an OST index to become "invalid" ? i.e. what index we consider "potentially good" and wait for a corresponding OST to appear and what index we declare "impossible" and proceed a special way?

            bzzz Alex Zhuravlev added a comment - what would be the "border" for an OST index to become "invalid" ? i.e. what index we consider "potentially good" and wait for a corresponding OST to appear and what index we declare "impossible" and proceed a special way?

            The "wait 30s while client connects to new OST" delay is because of the OST index check in lmv_tgt_retry(), and it would be gone with the proposed change above.

            adilger Andreas Dilger added a comment - The "wait 30s while client connects to new OST" delay is because of the OST index check in lmv_tgt_retry() , and it would be gone with the proposed change above.

            I did a simple test with unlink:

            == sanity-flr test 210b: handle broken mirrored lovea (unlink) ========================================================== 07:30:01 (1711179001)
            before dd
            lustre-OST0000_UUID      1818580        1524     1700672   1% /mnt/lustre[OST:0] 
            lustre-OST0001_UUID      1818580        1524     1700672   1% /mnt/lustre[OST:1] 
            [   28.527289] Lustre: DEBUG MARKER: == sanity-flr test 210b: handle broken mirrored lovea (unlink) ========================================================== 07:30:01 (1711179001)
            20+0 records in
            20+0 records out
            20971520 bytes (21 MB, 20 MiB) copied, 0.032342 s, 648 MB/s
            after dd
            lustre-OST0000_UUID      1818580       22004     1659592   2% /mnt/lustre[OST:0] 
            lustre-OST0001_UUID      1818580        1524     1700672   1% /mnt/lustre[OST:1] 
            fail_loc=0x1428
            [   31.585672] Lustre: *** cfs_fail_loc=1428, val=0***
            [   31.670467] Lustre: 6901:0:(lov_ea.c:299:lsme_unpack()) lustre-clilov_UUID: FID 0x280000401:2 OST index -1 more than OST count 2
            [   31.670469] Lustre: lustre-clilov_UUID: wait 30s while client connects to new OST
            [   61.791678] Lustre: 6901:0:(lov_pack.c:57:lov_dump_lmm_common()) objid 0x3:1026, magic 0x0bd10bd0, pattern 0x1
            [   61.791683] Lustre: 6901:0:(lov_pack.c:61:lov_dump_lmm_common()) stripe_size 4194304, stripe_count 1, layout_gen 0
            [   61.791685] Lustre: 6901:0:(lov_pack.c:81:lov_dump_lmm_objects()) stripe 0 idx 4294967295 subobj 0x280000401:2
            after mirror extend
            [   91.951586] Lustre: 6901:0:(lov_pack.c:57:lov_dump_lmm_common()) objid 0x3:1026, magic 0x0bd10bd0, pattern 0x1
            [   91.951591] Lustre: 6901:0:(lov_pack.c:61:lov_dump_lmm_common()) stripe_size 4194304, stripe_count 1, layout_gen 0
            [   91.951594] Lustre: 6901:0:(lov_pack.c:81:lov_dump_lmm_objects()) stripe 0 idx 4294967295 subobj 0x280000401:2
            [  121.071619] Lustre: 6905:0:(lov_pack.c:57:lov_dump_lmm_common()) objid 0x3:1026, magic 0x0bd10bd0, pattern 0x1
            [  121.071624] Lustre: 6905:0:(lov_pack.c:61:lov_dump_lmm_common()) stripe_size 4194304, stripe_count 1, layout_gen 0
            [  121.071627] Lustre: 6905:0:(lov_pack.c:81:lov_dump_lmm_objects()) stripe 0 idx 4294967295 subobj 0x280000401:2
            now list directory
            total 0
            ===
            [  151.231585] Lustre: 6905:0:(lov_pack.c:57:lov_dump_lmm_common()) objid 0x3:1026, magic 0x0bd10bd0, pattern 0x1
            [  151.231590] Lustre: 6905:0:(lov_pack.c:61:lov_dump_lmm_common()) stripe_size 4194304, stripe_count 1, layout_gen 0
            [  151.231592] Lustre: 6905:0:(lov_pack.c:81:lov_dump_lmm_objects()) stripe 0 idx 4294967295 subobj 0x280000401:2
            Waiting for MDT destroys to complete
            after removal
            lustre-OST0000_UUID      1818580        1524     1700672   1% /mnt/lustre[OST:0] 
            lustre-OST0001_UUID      1818580        1524     1700672   1% /mnt/lustre[OST:1] 
            PASS 210b (134s)
            

            that seems to work (at least the space is back), but takes very long. not sure whether this is OK.

            bzzz Alex Zhuravlev added a comment - I did a simple test with unlink: == sanity-flr test 210b: handle broken mirrored lovea (unlink) ========================================================== 07:30:01 (1711179001) before dd lustre-OST0000_UUID 1818580 1524 1700672 1% /mnt/lustre[OST:0] lustre-OST0001_UUID 1818580 1524 1700672 1% /mnt/lustre[OST:1] [ 28.527289] Lustre: DEBUG MARKER: == sanity-flr test 210b: handle broken mirrored lovea (unlink) ========================================================== 07:30:01 (1711179001) 20+0 records in 20+0 records out 20971520 bytes (21 MB, 20 MiB) copied, 0.032342 s, 648 MB/s after dd lustre-OST0000_UUID 1818580 22004 1659592 2% /mnt/lustre[OST:0] lustre-OST0001_UUID 1818580 1524 1700672 1% /mnt/lustre[OST:1] fail_loc=0x1428 [ 31.585672] Lustre: *** cfs_fail_loc=1428, val=0*** [ 31.670467] Lustre: 6901:0:(lov_ea.c:299:lsme_unpack()) lustre-clilov_UUID: FID 0x280000401:2 OST index -1 more than OST count 2 [ 31.670469] Lustre: lustre-clilov_UUID: wait 30s while client connects to new OST [ 61.791678] Lustre: 6901:0:(lov_pack.c:57:lov_dump_lmm_common()) objid 0x3:1026, magic 0x0bd10bd0, pattern 0x1 [ 61.791683] Lustre: 6901:0:(lov_pack.c:61:lov_dump_lmm_common()) stripe_size 4194304, stripe_count 1, layout_gen 0 [ 61.791685] Lustre: 6901:0:(lov_pack.c:81:lov_dump_lmm_objects()) stripe 0 idx 4294967295 subobj 0x280000401:2 after mirror extend [ 91.951586] Lustre: 6901:0:(lov_pack.c:57:lov_dump_lmm_common()) objid 0x3:1026, magic 0x0bd10bd0, pattern 0x1 [ 91.951591] Lustre: 6901:0:(lov_pack.c:61:lov_dump_lmm_common()) stripe_size 4194304, stripe_count 1, layout_gen 0 [ 91.951594] Lustre: 6901:0:(lov_pack.c:81:lov_dump_lmm_objects()) stripe 0 idx 4294967295 subobj 0x280000401:2 [ 121.071619] Lustre: 6905:0:(lov_pack.c:57:lov_dump_lmm_common()) objid 0x3:1026, magic 0x0bd10bd0, pattern 0x1 [ 121.071624] Lustre: 6905:0:(lov_pack.c:61:lov_dump_lmm_common()) stripe_size 4194304, stripe_count 1, layout_gen 0 [ 121.071627] Lustre: 6905:0:(lov_pack.c:81:lov_dump_lmm_objects()) stripe 0 idx 4294967295 subobj 0x280000401:2 now list directory total 0 === [ 151.231585] Lustre: 6905:0:(lov_pack.c:57:lov_dump_lmm_common()) objid 0x3:1026, magic 0x0bd10bd0, pattern 0x1 [ 151.231590] Lustre: 6905:0:(lov_pack.c:61:lov_dump_lmm_common()) stripe_size 4194304, stripe_count 1, layout_gen 0 [ 151.231592] Lustre: 6905:0:(lov_pack.c:81:lov_dump_lmm_objects()) stripe 0 idx 4294967295 subobj 0x280000401:2 Waiting for MDT destroys to complete after removal lustre-OST0000_UUID 1818580 1524 1700672 1% /mnt/lustre[OST:0] lustre-OST0001_UUID 1818580 1524 1700672 1% /mnt/lustre[OST:1] PASS 210b (134s) that seems to work (at least the space is back), but takes very long. not sure whether this is OK.

            Alex, can we not just skip the bad stripes during unlink? I don't want to say "just skip the bad component and delete the file anyway, let LFSCK handle the orphan objects", since this is how this problem was caused in the first place - by LFSCK linking the orphan object(s) back into the file.

            I'm sure we have the ability to delete files with bad objects in it, though maybe not when the OST index is "-1"? Could we pick a real OST index but assign a bogus FID that is ignored by the lower layers without generating a lot of error messages?

            adilger Andreas Dilger added a comment - Alex, can we not just skip the bad stripes during unlink? I don't want to say "just skip the bad component and delete the file anyway, let LFSCK handle the orphan objects", since this is how this problem was caused in the first place - by LFSCK linking the orphan object(s) back into the file. I'm sure we have the ability to delete files with bad objects in it, though maybe not when the OST index is "-1"? Could we pick a real OST index but assign a bogus FID that is ignored by the lower layers without generating a lot of error messages?

            The existing patch here avoids crashing the MDS when accessing the bad layout, but does not actually allow deleting the bad file (or ideally the bad mirror.

            yes, I tried to implement that, but this is not trivial - basically the problem is that MDS wants to initialize all in-core structures, including objects, but this is not possible (currently) as referenced OST(s) doesn't exist.
            probably we may have a special OSP to redirect "invalid" objects to.

            bzzz Alex Zhuravlev added a comment - The existing patch here avoids crashing the MDS when accessing the bad layout, but does not actually allow deleting the bad file (or ideally the bad mirror. yes, I tried to implement that, but this is not trivial - basically the problem is that MDS wants to initialize all in-core structures, including objects, but this is not possible (currently) as referenced OST(s) doesn't exist. probably we may have a special OSP to redirect "invalid" objects to.

            The existing patch here avoids crashing the MDS when accessing the bad layout, but does not actually allow deleting the bad file (or ideally the bad mirror.

            I binary edited an existing layout to replace most of the stripes in a second mirror with "0000" for the FID and "0xffffffff" for the ost_idx using:

            # lfs setstripe -N -E eof -c 2 -N -E eof -c 2 /mnt/testfs/mirror
            [ unmount and mount MDT as type ldiskfs]
            # getfattr -n trusted.lov /mnt/testfs/mirror > /tmp/mirror.new
            # vi /tmp/mirror.new  (to replace the second-last FID with "0000" and index with "ffffffff"
            # setfattr --restore /tmp/mirror.new
            [ unmount and remount MDT as type lustre ]
            

            This produced a layout like:

            /mnt/testfs/mirror
              lcm_layout_gen:    2
              lcm_mirror_count:  2
              lcm_entry_count:   2
                lcme_id:             65537
                lcme_mirror_id:      1
                lcme_flags:          init
                lcme_extent.e_start: 0
                lcme_extent.e_end:   EOF
                  lmm_stripe_count:  2
                  lmm_stripe_size:   1048576
                  lmm_pattern:       raid0
                  lmm_layout_gen:    0
                  lmm_stripe_offset: 0
                  lmm_objects:
                  - 0: { l_ost_idx: 0, l_fid: [0x300000bd3:0x3142:0x0] }
                  - 1: { l_ost_idx: 1, l_fid: [0x340000407:0x32a1:0x0] }
            
                lcme_id:             131074
                lcme_mirror_id:      2
                lcme_flags:          init
                lcme_extent.e_start: 0
                lcme_extent.e_end:   EOF
                  lmm_stripe_count:  2
                  lmm_stripe_size:   1048576
                  lmm_pattern:       raid0
                  lmm_layout_gen:    0
                  lmm_stripe_offset: 4294967295
                  lmm_objects:
                  - 0: { l_ost_idx: -1, l_fid: [0:0x0:0x0] }
                  - 1: { l_ost_idx: 3, l_fid: [0x3c0000407:0x3725:0x0] }
            

            but this file cannot be deleted with rm or unlink. The debug.txt.gz log is attached.

            There is also an interaction with LU-17334 because it is now waiting 30s whenever a broken file is accessed, but it should just skip outrageously broken index numbers:

            @@ -140,6 +140,10 @@ struct lu_tgt_desc *lmv_tgt_retry(struct lmv_obd *lmv, __u32 index)
                    unsigned int level;
                    int rc;
             
            +       if (index >= LOV_V1_INSANE_STRIPE_COUNT) {
            +               CERROR("%s: bad stripe index %u\n", obd->obd_name, index);
            +               return NULL;
            +       }
                    might_sleep();
             retry:
                    tgt = lmv_tgt(lmv, index);
            
            adilger Andreas Dilger added a comment - The existing patch here avoids crashing the MDS when accessing the bad layout, but does not actually allow deleting the bad file (or ideally the bad mirror. I binary edited an existing layout to replace most of the stripes in a second mirror with "0000" for the FID and "0xffffffff" for the ost_idx using: # lfs setstripe -N -E eof -c 2 -N -E eof -c 2 /mnt/testfs/mirror [ unmount and mount MDT as type ldiskfs] # getfattr -n trusted.lov /mnt/testfs/mirror > /tmp/mirror.new # vi /tmp/mirror.new (to replace the second-last FID with "0000" and index with "ffffffff" # setfattr --restore /tmp/mirror.new [ unmount and remount MDT as type lustre ] This produced a layout like: /mnt/testfs/mirror lcm_layout_gen: 2 lcm_mirror_count: 2 lcm_entry_count: 2 lcme_id: 65537 lcme_mirror_id: 1 lcme_flags: init lcme_extent.e_start: 0 lcme_extent.e_end: EOF lmm_stripe_count: 2 lmm_stripe_size: 1048576 lmm_pattern: raid0 lmm_layout_gen: 0 lmm_stripe_offset: 0 lmm_objects: - 0: { l_ost_idx: 0, l_fid: [0x300000bd3:0x3142:0x0] } - 1: { l_ost_idx: 1, l_fid: [0x340000407:0x32a1:0x0] } lcme_id: 131074 lcme_mirror_id: 2 lcme_flags: init lcme_extent.e_start: 0 lcme_extent.e_end: EOF lmm_stripe_count: 2 lmm_stripe_size: 1048576 lmm_pattern: raid0 lmm_layout_gen: 0 lmm_stripe_offset: 4294967295 lmm_objects: - 0: { l_ost_idx: -1, l_fid: [0:0x0:0x0] } - 1: { l_ost_idx: 3, l_fid: [0x3c0000407:0x3725:0x0] } but this file cannot be deleted with rm or unlink . The debug.txt.gz log is attached. There is also an interaction with LU-17334 because it is now waiting 30s whenever a broken file is accessed, but it should just skip outrageously broken index numbers: @@ -140,6 +140,10 @@ struct lu_tgt_desc *lmv_tgt_retry(struct lmv_obd *lmv, __u32 index) unsigned int level; int rc; + if (index >= LOV_V1_INSANE_STRIPE_COUNT) { + CERROR( "%s: bad stripe index %u\n" , obd->obd_name, index); + return NULL; + } might_sleep(); retry: tgt = lmv_tgt(lmv, index);

            People

              bzzz Alex Zhuravlev
              bzzz Alex Zhuravlev
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: