Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8417

setstripe -o does not work on directories

Details

    • Improvement
    • Resolution: Duplicate
    • Major
    • None
    • Lustre 2.7.0
    • 3
    • 9223372036854775807

    Description

      the -o option to setstripe where you can specify the OSTs to use works on files but gets an error when you do the same thing on directories.

      [root@Lustre-TG1 lustrefs]# mkdir testdir
      [root@Lustre-TG1 lustrefs]# lfs setstripe -o 0-3 -c 4 testfile
      [root@Lustre-TG1 lustrefs]# lfs getstripe testfile
      testfile
      lmm_stripe_count:   4
      lmm_stripe_size:    1048576
      lmm_pattern:        1
      lmm_layout_gen:     0
      lmm_stripe_offset:  0
              obdidx           objid           objid           group
                   0            9925         0x26c5                0
                   1            1606          0x646                0
                   2            1608          0x648                0
                   3            1609          0x649                0
      
      [root@Lustre-TG1 lustrefs]# lfs setstripe -o 0-3 -c 4 testdir
      error on ioctl 0x4008669a for 'testdir' (3): Invalid argument
      error: setstripe: create stripe file 'testdir' failed
      

      Attachments

        Issue Links

          Activity

            [LU-8417] setstripe -o does not work on directories

            It looks like the directory getstripe was fixed by patch https://review.whamcloud.com/55311 ("LU-15565 utils: updated lfs getstripe yaml format")

            adilger Andreas Dilger added a comment - It looks like the directory getstripe was fixed by patch https://review.whamcloud.com/55311 (" LU-15565 utils: updated lfs getstripe yaml format ")
            mvef Marc Vef added a comment - - edited

            It looks like lfs getstripe on directory now also shows the OST indices:

             

            # mkdir dir4 
            # lfs setstripe -o 2,1,3,0 dir4
            # touch dir4/{file1,file2}
            # lfs getstripe --yaml dir4
            stripe_count:  4
            stripe_size:   4194304
            pattern:       raid0
            stripe_offset: 2
            lmm_objects:
                  - l_ost_idx: 2
                    l_fid:     0x100020000:0x0:0x0
                  - l_ost_idx: 1
                    l_fid:     0x100010000:0x0:0x0
                  - l_ost_idx: 3
                    l_fid:     0x100030000:0x0:0x0
                  - l_ost_idx: 0
                    l_fid:     0x100000000:0x0:0x0
            lmm_stripe_count:  4
            lmm_stripe_size:   4194304
            lmm_pattern:       raid0
            lmm_layout_gen:    0
            lmm_stripe_offset: 2
            lmm_objects:
                  - l_ost_idx: 2
                    l_fid:     0x2c0000400:0x3:0x0
                  - l_ost_idx: 1
                    l_fid:     0x280000400:0x2:0x0
                  - l_ost_idx: 3
                    l_fid:     0x300000400:0x2:0x0
                  - l_ost_idx: 0
                    l_fid:     0x240000400:0x2:0x0
            lmm_stripe_count:  4
            lmm_stripe_size:   4194304
            lmm_pattern:       raid0
            lmm_layout_gen:    0
            lmm_stripe_offset: 2
            lmm_objects:
                  - l_ost_idx: 2
                    l_fid:     0x2c0000400:0x4:0x0
                  - l_ost_idx: 1
                    l_fid:     0x280000400:0x3:0x0
                  - l_ost_idx: 3
                    l_fid:     0x300000400:0x3:0x0
                  - l_ost_idx: 0
                    l_fid:     0x240000400:0x3:0x0
            
            

            I'm not sure about the l_fid fields though, which are shown as N/A in the lfs getstripe command (when not passing --yaml):

            # lfs getstripe dir4
            dir4
            stripe_count:  4 stripe_size:   4194304 pattern:       raid0 stripe_offset: 2
                    obdidx           objid           objid           group
                         2             N/A            N/A              N/A
                         1             N/A            N/A              N/A
                         3             N/A            N/A              N/A
                         0             N/A            N/A              N/A
            
            [... ommitted further output]
            

             

            mvef Marc Vef added a comment - - edited It looks like lfs getstripe on directory now also shows the OST indices:   # mkdir dir4 # lfs setstripe -o 2,1,3,0 dir4 # touch dir4/{file1,file2} # lfs getstripe --yaml dir4 stripe_count:  4 stripe_size:   4194304 pattern:       raid0 stripe_offset: 2 lmm_objects:       - l_ost_idx: 2         l_fid:     0x100020000:0x0:0x0       - l_ost_idx: 1         l_fid:     0x100010000:0x0:0x0       - l_ost_idx: 3         l_fid:     0x100030000:0x0:0x0       - l_ost_idx: 0         l_fid:     0x100000000:0x0:0x0 lmm_stripe_count:  4 lmm_stripe_size:   4194304 lmm_pattern:       raid0 lmm_layout_gen:    0 lmm_stripe_offset: 2 lmm_objects:       - l_ost_idx: 2         l_fid:     0x2c0000400:0x3:0x0       - l_ost_idx: 1         l_fid:     0x280000400:0x2:0x0       - l_ost_idx: 3         l_fid:     0x300000400:0x2:0x0       - l_ost_idx: 0         l_fid:     0x240000400:0x2:0x0 lmm_stripe_count:  4 lmm_stripe_size:   4194304 lmm_pattern:       raid0 lmm_layout_gen:    0 lmm_stripe_offset: 2 lmm_objects:       - l_ost_idx: 2         l_fid:     0x2c0000400:0x4:0x0       - l_ost_idx: 1         l_fid:     0x280000400:0x3:0x0       - l_ost_idx: 3         l_fid:     0x300000400:0x3:0x0       - l_ost_idx: 0         l_fid:     0x240000400:0x3:0x0 I'm not sure about the l_fid fields though, which are shown as N/A in the lfs getstripe command (when not passing --yaml ): # lfs getstripe dir4 dir4 stripe_count:  4 stripe_size:   4194304 pattern:       raid0 stripe_offset: 2         obdidx           objid           objid           group              2             N/A            N/A              N/A              1             N/A            N/A              N/A              3             N/A            N/A              N/A              0             N/A            N/A              N/A [... ommitted further output]  

            It looks like specifying explicit OSTs is working since at least 2.14:

            # lfs setstripe -o 2,3,2,3 /mnt/testfs/specific
            # touch /mnt/testfs/specific/fff
            # lfs getstripe --yaml /mnt/testfs/specific
            stripe_count:  4
            stripe_size:   1048576
            pattern:       raid0,overstriped
            stripe_offset: 2
            
            lmm_stripe_count:  4
            lmm_stripe_size:   1048576
            lmm_pattern:       raid0,overstriped
            lmm_layout_gen:    0
            lmm_stripe_offset: 2
            lmm_objects:
                  - l_ost_idx: 2
                    l_fid:     0x380000401:0x79:0x0
                  - l_ost_idx: 3
                    l_fid:     0x3c0000401:0x79:0x0
                  - l_ost_idx: 2
                    l_fid:     0x380000401:0x7a:0x0
                  - l_ost_idx: 3
                    l_fid:     0x3c0000401:0x7a:0x0
            

            The one remaining issue is that "lfs getstripe" on the directory does not print the specific layout properly. It should print the specific OST indices and not just the stripe count.

            adilger Andreas Dilger added a comment - It looks like specifying explicit OSTs is working since at least 2.14: # lfs setstripe -o 2,3,2,3 /mnt/testfs/specific # touch /mnt/testfs/specific/fff # lfs getstripe --yaml /mnt/testfs/specific stripe_count: 4 stripe_size: 1048576 pattern: raid0,overstriped stripe_offset: 2 lmm_stripe_count: 4 lmm_stripe_size: 1048576 lmm_pattern: raid0,overstriped lmm_layout_gen: 0 lmm_stripe_offset: 2 lmm_objects: - l_ost_idx: 2 l_fid: 0x380000401:0x79:0x0 - l_ost_idx: 3 l_fid: 0x3c0000401:0x79:0x0 - l_ost_idx: 2 l_fid: 0x380000401:0x7a:0x0 - l_ost_idx: 3 l_fid: 0x3c0000401:0x7a:0x0 The one remaining issue is that " lfs getstripe " on the directory does not print the specific layout properly. It should print the specific OST indices and not just the stripe count.

            It looks like https://review.whamcloud.com/12275 implements this to some extent.

            adilger Andreas Dilger added a comment - It looks like https://review.whamcloud.com/12275 implements this to some extent.

            Link to original LU-4665 ticket that introduced this feature.

            adilger Andreas Dilger added a comment - Link to original LU-4665 ticket that introduced this feature.

            One issue with using "-o ... -c" to implement "temporary pools" is that there is no place (currently) to store the object allocation state across file creates as there is with a proper pool, so it can't be smart about round-robin allocation (e.g. 0+1, 2+3, 4+5, 6+7, 1+2, 3+4, ... to track the last OST index used and to avoid having stripe 0 on the same OST repeatedly).

            The MDS would somehow need to dynamically allocate an internal allocation state based on the specified OST list and then keep that in memory for some time to handle allocations with the same group of OSTs. It is OK if the same OST list is shared by different directories, since the file creation and IO load on the OSTs is also shared. There is no significant issue if the allocation state is dropped after some idle time (e.g. a couple of minutes), since the load on the OSTs is also transient.

            Ideally this would be implemented as part of LU-9 so that the imbalance of file creates on a subset of OSTs does not cause global imbalance across other OSTs.

            adilger Andreas Dilger added a comment - One issue with using "-o ... -c" to implement "temporary pools" is that there is no place (currently) to store the object allocation state across file creates as there is with a proper pool, so it can't be smart about round-robin allocation (e.g. 0+1, 2+3, 4+5, 6+7, 1+2, 3+4, ... to track the last OST index used and to avoid having stripe 0 on the same OST repeatedly). The MDS would somehow need to dynamically allocate an internal allocation state based on the specified OST list and then keep that in memory for some time to handle allocations with the same group of OSTs. It is OK if the same OST list is shared by different directories, since the file creation and IO load on the OSTs is also shared. There is no significant issue if the allocation state is dropped after some idle time (e.g. a couple of minutes), since the load on the OSTs is also transient. Ideally this would be implemented as part of LU-9 so that the imbalance of file creates on a subset of OSTs does not cause global imbalance across other OSTs.

            This would need to store the full uninitialized LOV EA on the MDS directory as the template for the file layout. Based on discussion with Gary, this is needed for testing OSS performance. For "real world" usage, it would probably be a "poor man's OST pools" in the sense that it would be possible to specify a list of OSTs, and then a stripe count less than the total OST count, and it should be possible to select a subset of OSTs to create files on.

            adilger Andreas Dilger added a comment - This would need to store the full uninitialized LOV EA on the MDS directory as the template for the file layout. Based on discussion with Gary, this is needed for testing OSS performance. For "real world" usage, it would probably be a "poor man's OST pools" in the sense that it would be possible to specify a list of OSTs, and then a stripe count less than the total OST count, and it should be possible to select a subset of OSTs to create files on.

            People

              wc-triage WC Triage
              ghagensen Gary Hagensen (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: