Details

    • Technical task
    • Resolution: Fixed
    • Blocker
    • Lustre 2.5.0
    • Lustre 2.5.0
    • 10028

    Description

      HSM release in not preserving block count as reported by stat()

      # cd /mnt/lustre
      # dd if=/dev/zero of=Antoshka bs=1M count=10
      10+0 records in
      10+0 records out
      10485760 bytes (10 MB) copied, 0.0740321 s, 142 MB/s
      # stat Antoshka 
        File: `Antoshka'
        Size: 10485760  	Blocks: 20480      IO Block: 4194304 regular file
      Device: 2c54f966h/743766374d	Inode: 144115205255725060  Links: 1
      Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
      Access: 2013-08-30 10:13:48.000000000 -0500
      Modify: 2013-08-30 10:13:48.000000000 -0500
      Change: 2013-08-30 10:13:48.000000000 -0500
      # lfs hsm_archive Antoshka 
      # lfs hsm_release Antoshka
      # stat Antoshka
        File: `Antoshka'
        Size: 10485760  	Blocks: 0          IO Block: 4194304 regular file
      Device: 2c54f966h/743766374d	Inode: 144115205255725060  Links: 1
      Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
      Access: 2013-08-30 10:13:48.000000000 -0500
      Modify: 2013-08-30 10:13:48.000000000 -0500
      Change: 2013-08-30 10:13:48.000000000 -0500
      

      I had intended to fix this with LU-3811 but it will require some work in the MD* attr_set patch.

      If you're thinking (philosophically) hmm, well maybe it should report a block count of 0 here then you're just wrong.

      Attachments

        Activity

          [LU-3864] stat() on HSM released file returns st_blocks = 0

          John,

          I agree with you that checking each possible tool is not the right way.

          • Users will be confused at first by this, but we should teach them. You cannot really consider this will be really fully transparent for them. I'm pretty sure that users will ask admins why their file accesses are random, because they do not realize that files could restored when accessed and this could requires mounting tapes which could be really long.
          • DMF, which is the major HSM in HPC environment, and probably the HSM that will be used the most with Lustre/HSM, is exactly doing this. DMF, on their XFS/CXFS frontend, reports st_block == 0 for released(offline) files. DMF is probably doing this for 20 years now. This is the main solution right now on the market and DMF is very interested by Lustre as a frontend because CXFS does not scale (and because Lustre is very well deployed).
          • Admins like they can see this difference between theoretical space and real disk usage through stat/du.

          I think this is very simple and that we're being a bit lazy here.

          What will be your proposal?

          I still consider this behavior could be changed, but this is not as simple as: this is totally wrong to reports 0 for st_blocks. I think reporting st_block as 0 is a feature, not a bug. Only the potential data corruption for some "too-smart" tools is a concern to me.

          adegremont Aurelien Degremont (Inactive) added a comment - John, I agree with you that checking each possible tool is not the right way. Users will be confused at first by this, but we should teach them. You cannot really consider this will be really fully transparent for them. I'm pretty sure that users will ask admins why their file accesses are random, because they do not realize that files could restored when accessed and this could requires mounting tapes which could be really long. DMF, which is the major HSM in HPC environment, and probably the HSM that will be used the most with Lustre/HSM, is exactly doing this. DMF, on their XFS/CXFS frontend, reports st_block == 0 for released(offline) files. DMF is probably doing this for 20 years now. This is the main solution right now on the market and DMF is very interested by Lustre as a frontend because CXFS does not scale (and because Lustre is very well deployed). Admins like they can see this difference between theoretical space and real disk usage through stat/du. I think this is very simple and that we're being a bit lazy here. What will be your proposal? I still consider this behavior could be changed, but this is not as simple as: this is totally wrong to reports 0 for st_blocks. I think reporting st_block as 0 is a feature, not a bug. Only the potential data corruption for some "too-smart" tools is a concern to me.

          Hi John, by design, st_blocks is the actual disk blocks consumed by the file data. So it should be reasonable to report 0 for st_blocks for released files, this matches the . But I agree that reporting st_blocks to be zero is confusing here because this may cause applications to think this is a totally sparse file so fiemap won't be consulted at all.

          Any decisions for this problem are reasonable to me

          jay Jinshan Xiong (Inactive) added a comment - Hi John, by design, st_blocks is the actual disk blocks consumed by the file data. So it should be reasonable to report 0 for st_blocks for released files, this matches the . But I agree that reporting st_blocks to be zero is confusing here because this may cause applications to think this is a totally sparse file so fiemap won't be consulted at all. Any decisions for this problem are reasonable to me
          jhammond John Hammond added a comment -

          I think this is very simple and that we're being a bit lazy here.

          It is a bug for release to change st_blocks. Remember that HSM is supposed to be transparent to the user. We shouldn't be trying to guess if some application will depend on st_blocks. Instead we should assume that some application depends on st_blocks and that there will be data loss because of this bug. It can be a blocker or not, I don't care so much. But it's a bug.

          The fact that we're focused on cp and tar is troubling. What about bbcp, rsync, zip, and all of those terrible data movers that the open science sites use? Have we checked them too?

          Moreover, it's very easy to think of situations where regular users will be confused by this. For example, when they run 'du -hs' on their homedir and then tell admin that the 8TB of data they just created must be lost. The admin who uses HSM on the other hand can be told and should know that 'dh -hs' will not always reflect the frontend FS usage.

          jhammond John Hammond added a comment - I think this is very simple and that we're being a bit lazy here. It is a bug for release to change st_blocks. Remember that HSM is supposed to be transparent to the user. We shouldn't be trying to guess if some application will depend on st_blocks. Instead we should assume that some application depends on st_blocks and that there will be data loss because of this bug. It can be a blocker or not, I don't care so much. But it's a bug. The fact that we're focused on cp and tar is troubling. What about bbcp, rsync, zip, and all of those terrible data movers that the open science sites use? Have we checked them too? Moreover, it's very easy to think of situations where regular users will be confused by this. For example, when they run 'du -hs' on their homedir and then tell admin that the 8TB of data they just created must be lost. The admin who uses HSM on the other hand can be told and should know that 'dh -hs' will not always reflect the frontend FS usage.

          Hi Aurelien and Bruno,

          If we can fix this issue by fiemap that'll be great. I was afraid that we would fix tar issue but import another one by reporting incorrect block number.

          Jinshan

          jay Jinshan Xiong (Inactive) added a comment - Hi Aurelien and Bruno, If we can fix this issue by fiemap that'll be great. I was afraid that we would fix tar issue but import another one by reporting incorrect block number. Jinshan

          > Did I understand and is this what you mean ?

          This is perfectly correct!

          > Then, only difference I see already is that we will not use the same cp logic for "real" non-sparse (plain or dense) but released files, if any specifics are made in cp mainly for performance but this could be just nothing regarding time for restore.

          I do not understand what you mean

          What I see is that cp will read sequentially the just-previously-released-but-now-restored file, instead of doing something more efficient because the file could really be "a little bit" sparse.
          So before releasing the file, cp would have copied it efficiently, creating a sparse file.
          After restore, it will copy it fully, and the file copy will be no more sparse.
          I think this is acceptable.

          adegremont Aurelien Degremont (Inactive) added a comment - > Did I understand and is this what you mean ? This is perfectly correct! > Then, only difference I see already is that we will not use the same cp logic for "real" non-sparse (plain or dense) but released files, if any specifics are made in cp mainly for performance but this could be just nothing regarding time for restore. I do not understand what you mean What I see is that cp will read sequentially the just-previously-released-but-now-restored file, instead of doing something more efficient because the file could really be "a little bit" sparse. So before releasing the file, cp would have copied it efficiently, creating a sparse file. After restore, it will copy it fully, and the file copy will be no more sparse. I think this is acceptable.

          Aurelien,
          Just to be sure that I fully understand what you mean in your last comment :

          _ for a released file, we can return st_blocks = 0 with correct st_size as already.
          _ this allows cp (and tar?) to presume it is a sparse-file and go thru related special logic.
          _ but then we need to change Lustre FIEMAP to return something like FIEMAP_EXTENT_DELALLOC or FIEMAP_EXTENT_UNKNOWN for released files to make cp happy.

          Did I understand and is this what you mean ?

          Then, only difference I see already is that we will not use the same cp logic for "real" non-sparse (plain or dense) but released files, if any specifics are made in cp mainly for performance but this could be just nothing regarding time for restore.

          bfaccini Bruno Faccini (Inactive) added a comment - Aurelien, Just to be sure that I fully understand what you mean in your last comment : _ for a released file, we can return st_blocks = 0 with correct st_size as already. _ this allows cp (and tar?) to presume it is a sparse-file and go thru related special logic. _ but then we need to change Lustre FIEMAP to return something like FIEMAP_EXTENT_DELALLOC or FIEMAP_EXTENT_UNKNOWN for released files to make cp happy. Did I understand and is this what you mean ? Then, only difference I see already is that we will not use the same cp logic for "real" non-sparse (plain or dense) but released files, if any specifics are made in cp mainly for performance but this could be just nothing regarding time for restore.

          Jinshan, even I thought that reporting a 0 value for st_block was a good idea first, we cannot afford to keep doing this due to cp/tar issue.

          This is unacceptable that users doing "cp --sparse" or "tar --sparse" copy WRONG data if this is applied to a released file.
          After looking at coreutils code and LU-2580 I think the bug here is that FIEMAP output for released file is wrong.
          coreutils is just using st_blocks to detect if the file looks sparse.

          It seems we can return a special FIEMAP for released file, with one region, with FIEMAP_EXTENT_DELALLOC or FIEMAP_EXTENT_UNKNOWN set.

          I've not checked tar.

          Andreas, do you think this solution is reasonable?

          adegremont Aurelien Degremont (Inactive) added a comment - Jinshan, even I thought that reporting a 0 value for st_block was a good idea first, we cannot afford to keep doing this due to cp/tar issue. This is unacceptable that users doing " cp --sparse " or " tar --sparse " copy WRONG data if this is applied to a released file. After looking at coreutils code and LU-2580 I think the bug here is that FIEMAP output for released file is wrong. coreutils is just using st_blocks to detect if the file looks sparse. It seems we can return a special FIEMAP for released file, with one region, with FIEMAP_EXTENT_DELALLOC or FIEMAP_EXTENT_UNKNOWN set. I've not checked tar . Andreas, do you think this solution is reasonable?

          I am late in this thread (even if ticket is assigned to me !), but even if I understand the possible issue when using "smart" tools, I wonder what will happen when using "simple" tools like "du" if st_blocks no longer reflects the current number of consumed blocks ?

          Do we know how behave other HSMs, like DataMigration and others ? I seem to remember DM did not do anything tricky with st_blocks, but may be it changed since.

          bfaccini Bruno Faccini (Inactive) added a comment - I am late in this thread (even if ticket is assigned to me !), but even if I understand the possible issue when using "smart" tools, I wonder what will happen when using "simple" tools like "du" if st_blocks no longer reflects the current number of consumed blocks ? Do we know how behave other HSMs, like DataMigration and others ? I seem to remember DM did not do anything tricky with st_blocks, but may be it changed since.

          Hi Andreas, is there any side effect if we set the st_blocks to be nonzero even it doesn't consume any disk blocks in Lustre?

          To be honest, I'm totally okay to report st_blocks as zero for release files. Backup usually applies to latest files but HSM should move old files to secondary storage so their target are not conflicted. If backup has to restore every files in the lustre this is probably not what administrator wants.

          jay Jinshan Xiong (Inactive) added a comment - Hi Andreas, is there any side effect if we set the st_blocks to be nonzero even it doesn't consume any disk blocks in Lustre? To be honest, I'm totally okay to report st_blocks as zero for release files. Backup usually applies to latest files but HSM should move old files to secondary storage so their target are not conflicted. If backup has to restore every files in the lustre this is probably not what administrator wants.

          Another workaround, to avoid storing the previous block count, we could be to returned a theoretical maximum block count: computing (FILESIZE / BLOCKSIZE) + 1

          The default of this method will be that st_blocks could be larger than what it was before the file was released. But this is must simpler to patch.

          I imagine we have to also hack FIEMAP for released file?

          adegremont Aurelien Degremont (Inactive) added a comment - Another workaround, to avoid storing the previous block count, we could be to returned a theoretical maximum block count: computing (FILESIZE / BLOCKSIZE) + 1 The default of this method will be that st_blocks could be larger than what it was before the file was released. But this is must simpler to patch. I imagine we have to also hack FIEMAP for released file?

          "This is not a bug, this is a feature" but as Andreas pointed out, we will have to modify this as this could lead to issue when using, too smart, tools.

          So, where could we store the original block count? There's no place for that in the inode. Should we put this in HSM EA and use this value when replying to GETATTR?

          adegremont Aurelien Degremont (Inactive) added a comment - "This is not a bug, this is a feature" but as Andreas pointed out, we will have to modify this as this could lead to issue when using, too smart, tools. So, where could we store the original block count? There's no place for that in the inode. Should we put this in HSM EA and use this value when replying to GETATTR?

          People

            bfaccini Bruno Faccini (Inactive)
            jhammond John Hammond
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: