Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11848

filefrag/FIEMAP doesn't work for PFL or FLR files

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.14.0
    • Lustre 2.11.0, Lustre 2.12.0
    • None
    • 3
    • 9223372036854775807

    Description

      It appears that FIEMAP doesn't work quite correctly for PFL or FLR files. I'd expect that it would dump the FIEMAP information for all copies of a mirrored file, but it exits when the first object is printed. I verified with GDB that the fiemap structure returned by the kernel only has a single extent in it.

      $ lfs getstripe /mnt/testfs/flr2
      /mnt/testfs/flr2
        lcm_layout_gen:    5
        lcm_mirror_count:  2
        lcm_entry_count:   2
          lcme_id:             65537
          lcme_mirror_id:      1
          lcme_flags:          init
          lcme_extent.e_start: 0
          lcme_extent.e_end:   EOF
            lmm_stripe_count:  1
            lmm_stripe_size:   1048576
            lmm_pattern:       raid0
            lmm_layout_gen:    0
            lmm_stripe_offset: 3
            lmm_objects:
            - 0: { l_ost_idx: 3, l_fid: [0x100030000:0x4:0x0] }
      
          lcme_id:             131074
          lcme_mirror_id:      2
          lcme_flags:          init
          lcme_extent.e_start: 0
          lcme_extent.e_end:   EOF
            lmm_stripe_count:  1
            lmm_stripe_size:   1048576
            lmm_pattern:       raid0
            lmm_layout_gen:    0
            lmm_stripe_offset: 0
            lmm_objects:
            - 0: { l_ost_idx: 0, l_fid: [0x100000000:0x4:0x0] }
      
      $ filefrag -v /mnt/testfs/flr2
      Filesystem type is: bd00bd0
      File size of /mnt/testfs/flr2 is 8388608 (8192 blocks of 1024 bytes)
       ext:     device_logical:        physical_offset: length:  dev: flags:
         0:        0..    8191:     140000..    148191:   8192: 0003: last,net,eof
      /mnt/testfs/flr2: 1 extent found
      

      For FLR files this is at least somewhat usable, since a full copy of the file is handled. However, for PFL files, the kernel returns an -EFBIG error after the first component is read:

      $ lfs getstripe /mnt/testfs/pfl3x8
        lcm_entry_count:   3
          lcme_extent.e_start: 0
          lcme_extent.e_end:   1048576
            lmm_stripe_count:  1
            lmm_stripe_size:   1048576
            lmm_pattern:       raid0
            lmm_objects:
            - 0: { l_ost_idx: 2, l_fid: [0x100020000:0x2:0x0] }
      
          lcme_extent.e_start: 1048576
          lcme_extent.e_end:   4194304
            lmm_stripe_count:  3
            lmm_stripe_size:   1048576
            lmm_pattern:       raid0
            lmm_objects:
            - 0: { l_ost_idx: 3, l_fid: [0x100030000:0x2:0x0] }
            - 1: { l_ost_idx: 0, l_fid: [0x100000000:0x2:0x0] }
            - 2: { l_ost_idx: 1, l_fid: [0x100010000:0x2:0x0] }
      
          lcme_extent.e_start: 4194304
          lcme_extent.e_end:   EOF
            lmm_stripe_count:  4
            lmm_stripe_size:   1048576
            lmm_pattern:       raid0
            lmm_objects:
            - 0: { l_ost_idx: 3, l_fid: [0x100030000:0x3:0x0] }
            - 1: { l_ost_idx: 0, l_fid: [0x100000000:0x3:0x0] }
            - 2: { l_ost_idx: 1, l_fid: [0x100010000:0x3:0x0] }
            - 3: { l_ost_idx: 2, l_fid: [0x100020000:0x3:0x0] }
      
      $ filefrag -v /mnt/testfs/pfl3x8
      Filesystem type is: bd00bd0
      File size of /mnt/testfs/pfl3x8 is 8388608 (8192 blocks of 1024 bytes)
       ext:     device_logical:        physical_offset: length:  dev: flags:
         0:        0..    1023:     134880..    135903:   1024: 0002: net
         1:        0..    1023:     134880..    135903:   1024: 0000: net
         2:     1024..    2047:     136928..    137951:   1024: 0003: net
      

      It starts almost correctly, with an extent on OST0002 for [0,1MB] and then something that looks like a duplicate of the first extent with only the device number changed, and an extent on OST0003 for [1MB,2MB]. The last 6 extents are not returned in the first call despite there being plenty of room in the fiemap array.

      Running strace shows:

      ioctl(3, FS_IOC_FIEMAP, {fm_start=0, fm_length=18446744073709551615, fm_flags=0x40000000 /* FIEMAP_FLAG_??? */, fm_extent_count=292} => {fm_flags=0x40000000 /* FIEMAP_FLAG_??? */, fm_mapped_extents=3, fm_extents=...}) = 0
      ioctl(3, FS_IOC_FIEMAP, {fm_start=0, fm_length=18446744073709551615, fm_flags=0x40000000 /* FIEMAP_FLAG_??? */, fm_extent_count=292}) = -1 EFBIG (File too large)
      

      Running under GDB and dumping the returned extents shows that indeed 3 separate extents are returned (so strangeness is not in filefrag), and then -EFBIG is returned on the second ioctl(FIEMAP) call when it tries to continue on the next extent.

      247                     rc = ioctl(fd, FS_IOC_FIEMAP, (unsigned long) fiemap);
      248                     if (rc < 0) {
      (gdb) p *fiemap
      $1 = {fm_start = 0, fm_length = 18446744073709551615, fm_flags = 1073741824, 
        fm_mapped_extents = 3, fm_extent_count = 292, fm_reserved = 0, 
        fm_extents = 0x7fffffff9ca0}
      (gdb) p fiemap->fm_extents[0]
      $2 = {fe_logical = 0, fe_physical = 138117120, fe_length = 1048576, 
        fe_reserved64 = {0, 0}, fe_flags = 2147483648, fe_device = 2, fe_reserved = {0, 0}}
      (gdb) p fiemap->fm_extents[1]
      $3 = {fe_logical = 0, fe_physical = 138117120, fe_length = 1048576, 
        fe_reserved64 = {0, 0}, fe_flags = 2147483648, fe_device = 0, fe_reserved = {0, 0}}
      (gdb) p fiemap->fm_extents[2]
      $4 = {fe_logical = 1048576, fe_physical = 140214272, fe_length = 1048576, 
        fe_reserved64 = {0, 0}, fe_flags = 2147483648, fe_device = 3, fe_reserved = {0, 0}}
      :
      :
      305                     if (flags & FIEMAP_FLAG_DEVICE_ORDER) {
      306                             fm_ext[0].fe_logical =  fm_ext[i - 1].fe_logical + fm_ext[i - 1].fe_length;
      308                             fm_ext[0].fe_device =   fm_ext[i - 1].fe_device;
      309                             fiemap->fm_start =      0;
      :
      :
      244                     fiemap->fm_length = ~0ULL;
      245                     fiemap->fm_flags = flags;
      246                     fiemap->fm_extent_count = count;
      247                     rc = ioctl(fd, FS_IOC_FIEMAP, (unsigned long) fiemap);
      248                     if (rc < 0) {
      251                             rc = -errno;
      (gdb) p *fiemap
      $7 = {fm_start = 0, fm_length = 18446744073709551615, fm_flags = 1073741824, 
        fm_mapped_extents = 0, fm_extent_count = 292, fm_reserved = 0, 
        fm_extents = 0x7fffffff9ca0}
      (gdb) p rc
      $8 = -27
      

      It looks like the LOV code needs to be taught how to iterate over a composite layout to populate the FIEMAP structure, and to be able to find the right component to restart. It may be that we need to make changes to filefrag itself, since it currently only returns the file logical offset of the end of the last printed extent and the device number it was on. This could be ambiguous in rare cases if there were two mirrors on the same OST for the same file offset and FIEMAP was interrupted just there, but it doesn't seem likely and is not a primary concern.

      Attachments

        Issue Links

          Activity

            People

              bobijam Zhenyu Xu
              adilger Andreas Dilger
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: