Details

    • Technical task
    • Resolution: Fixed
    • Blocker
    • Lustre 2.5.0
    • Lustre 2.5.0
    • 10028

    Description

      HSM release in not preserving block count as reported by stat()

      # cd /mnt/lustre
      # dd if=/dev/zero of=Antoshka bs=1M count=10
      10+0 records in
      10+0 records out
      10485760 bytes (10 MB) copied, 0.0740321 s, 142 MB/s
      # stat Antoshka 
        File: `Antoshka'
        Size: 10485760  	Blocks: 20480      IO Block: 4194304 regular file
      Device: 2c54f966h/743766374d	Inode: 144115205255725060  Links: 1
      Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
      Access: 2013-08-30 10:13:48.000000000 -0500
      Modify: 2013-08-30 10:13:48.000000000 -0500
      Change: 2013-08-30 10:13:48.000000000 -0500
      # lfs hsm_archive Antoshka 
      # lfs hsm_release Antoshka
      # stat Antoshka
        File: `Antoshka'
        Size: 10485760  	Blocks: 0          IO Block: 4194304 regular file
      Device: 2c54f966h/743766374d	Inode: 144115205255725060  Links: 1
      Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
      Access: 2013-08-30 10:13:48.000000000 -0500
      Modify: 2013-08-30 10:13:48.000000000 -0500
      Change: 2013-08-30 10:13:48.000000000 -0500
      

      I had intended to fix this with LU-3811 but it will require some work in the MD* attr_set patch.

      If you're thinking (philosophically) hmm, well maybe it should report a block count of 0 here then you're just wrong.

      Attachments

        Activity

          [LU-3864] stat() on HSM released file returns st_blocks = 0

          Hummm seems that "tar --sparse", instead of using FIEMAP as expected, uses an odd optimization (if st_blocks = 0) that cause a released file to be archived as a fully sparse file. I did not check all/latest versions of tar.

          Seems that for btrfs they encountered the problem with small files (ie, enough small to have the datas stored with the meta-datas and then report st_blocks = 0), as detailed in RedHat Bugzilla #757557 (at https://bugzilla.redhat.com/show_bug.cgi?id=757557). And if I correctly understand, they fixed it by returning st_blocks = 1. Should we do the same ??

          bfaccini Bruno Faccini (Inactive) added a comment - Hummm seems that "tar --sparse", instead of using FIEMAP as expected, uses an odd optimization (if st_blocks = 0) that cause a released file to be archived as a fully sparse file. I did not check all/latest versions of tar. Seems that for btrfs they encountered the problem with small files (ie, enough small to have the datas stored with the meta-datas and then report st_blocks = 0), as detailed in RedHat Bugzilla #757557 (at https://bugzilla.redhat.com/show_bug.cgi?id=757557 ). And if I correctly understand, they fixed it by returning st_blocks = 1. Should we do the same ??
          bfaccini Bruno Faccini (Inactive) added a comment - - edited

          Humm during 1st patch-set testing I also found (seems not already reported) that doing a "filefrag" (ie FIEMAP syscall!) on a released file just crash with the follwing LBUG :

          LustreError: 13811:0:(lov_obd.c:2488:lov_fiemap()) ASSERTION( fm_local ) failed: 
          LustreError: 13811:0:(lov_obd.c:2488:lov_fiemap()) LBUG
          Pid: 13811, comm: filefrag
          
          Call Trace:
           [<ffffffffa0206895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
           [<ffffffffa0206e97>] lbug_with_loc+0x47/0xb0 [libcfs]
           [<ffffffffa084e56c>] lov_get_info+0x10ac/0x1cb0 [lov]
           [<ffffffff8112fca0>] ? __lru_cache_add+0x40/0x90
           [<ffffffffa08697ab>] ? lov_lsm_addref+0x6b/0x130 [lov]
           [<ffffffffa0dbaab1>] ll_do_fiemap+0x411/0x6b0 [lustre]
           [<ffffffffa0dc5d97>] ll_fiemap+0x117/0x590 [lustre]
           [<ffffffff811956e5>] do_vfs_ioctl+0x505/0x580
           [<ffffffff811957e1>] sys_ioctl+0x81/0xa0
           [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
          

          and this with or without my patch, this is due to unconditionally freeing "fm_local" at the end of lov_fiemap() routine, even if it was not allocated because of no-object/ENOMEM and also now when released!!

          Will fix that in patch-set #7 in addition to answer/address Andreas comments for patch-set #6.

          bfaccini Bruno Faccini (Inactive) added a comment - - edited Humm during 1st patch-set testing I also found (seems not already reported) that doing a "filefrag" (ie FIEMAP syscall!) on a released file just crash with the follwing LBUG : LustreError: 13811:0:(lov_obd.c:2488:lov_fiemap()) ASSERTION( fm_local ) failed: LustreError: 13811:0:(lov_obd.c:2488:lov_fiemap()) LBUG Pid: 13811, comm: filefrag Call Trace: [<ffffffffa0206895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] [<ffffffffa0206e97>] lbug_with_loc+0x47/0xb0 [libcfs] [<ffffffffa084e56c>] lov_get_info+0x10ac/0x1cb0 [lov] [<ffffffff8112fca0>] ? __lru_cache_add+0x40/0x90 [<ffffffffa08697ab>] ? lov_lsm_addref+0x6b/0x130 [lov] [<ffffffffa0dbaab1>] ll_do_fiemap+0x411/0x6b0 [lustre] [<ffffffffa0dc5d97>] ll_fiemap+0x117/0x590 [lustre] [<ffffffff811956e5>] do_vfs_ioctl+0x505/0x580 [<ffffffff811957e1>] sys_ioctl+0x81/0xa0 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b and this with or without my patch, this is due to unconditionally freeing "fm_local" at the end of lov_fiemap() routine, even if it was not allocated because of no-object/ENOMEM and also now when released!! Will fix that in patch-set #7 in addition to answer/address Andreas comments for patch-set #6.

          There is still a patch to land for this bug.

          adilger Andreas Dilger added a comment - There is still a patch to land for this bug.

          > BTW, do we know what answer to FIEMAP do provide DMF and/or GHI ??

          Ok I got the information from SGI.
          XFS(for DMF) is returning 1 full extent, with normal flag (not UNKNOWN or DELALLOC, etc)

          I do not know for GHI.

          adegremont Aurelien Degremont (Inactive) added a comment - > BTW, do we know what answer to FIEMAP do provide DMF and/or GHI ?? Ok I got the information from SGI. XFS(for DMF) is returning 1 full extent, with normal flag (not UNKNOWN or DELALLOC, etc) I do not know for GHI.

          Hello Aurelien,
          Seems everybody "agree" with st_blocks to be Null for released files. This is the reason ticket resolution has been set as "won't fix".

          Concerning change in FIEMAP, this can be tracked as a new ticket (or still this one), but with lower priority.

          BTW, do we know what answer to FIEMAP do provide DMF and/or GHI ??

          Seems to me that it is not a frozen interface, because when googling I found that a FIEMAP_EXTENT_SECONDARY flag exists (to indicate that extent data are on HSM) in some implementations that may fit our needs here, but then is it used/tested by the coreutils and other tools ?

          But anyway, I wrote a 1st patch attempt that returns a single extent with (FIEMAP_EXTENT_DELALLOC | FIEMAP_EXTENT_UNKNOWN | FIEMAP_EXTENT_LAST). It is http://review.whamcloud.com/7584.

          bfaccini Bruno Faccini (Inactive) added a comment - Hello Aurelien, Seems everybody "agree" with st_blocks to be Null for released files. This is the reason ticket resolution has been set as "won't fix". Concerning change in FIEMAP, this can be tracked as a new ticket (or still this one), but with lower priority. BTW, do we know what answer to FIEMAP do provide DMF and/or GHI ?? Seems to me that it is not a frozen interface, because when googling I found that a FIEMAP_EXTENT_SECONDARY flag exists (to indicate that extent data are on HSM) in some implementations that may fit our needs here, but then is it used/tested by the coreutils and other tools ? But anyway, I wrote a 1st patch attempt that returns a single extent with (FIEMAP_EXTENT_DELALLOC | FIEMAP_EXTENT_UNKNOWN | FIEMAP_EXTENT_LAST). It is http://review.whamcloud.com/7584 .

          OK, I've checked with GHI (GPFS/HPSS Interface), which is the only concurrent product of Lustre/HSM.
          GHI also returns st_blocks == 0 when files are RELEASED.

          I propose that FIEMAP is modified for RELEASED files, and st_block still return 0. We will see if lot of people complains about that.

          adegremont Aurelien Degremont (Inactive) added a comment - OK, I've checked with GHI (GPFS/HPSS Interface), which is the only concurrent product of Lustre/HSM. GHI also returns st_blocks == 0 when files are RELEASED. I propose that FIEMAP is modified for RELEASED files, and st_block still return 0. We will see if lot of people complains about that.

          Why this bug as been put as wontfix?

          We must, at least, fix FIEMAP.

          adegremont Aurelien Degremont (Inactive) added a comment - Why this bug as been put as wontfix? We must, at least, fix FIEMAP.

          John,

          I agree with you that checking each possible tool is not the right way.

          • Users will be confused at first by this, but we should teach them. You cannot really consider this will be really fully transparent for them. I'm pretty sure that users will ask admins why their file accesses are random, because they do not realize that files could restored when accessed and this could requires mounting tapes which could be really long.
          • DMF, which is the major HSM in HPC environment, and probably the HSM that will be used the most with Lustre/HSM, is exactly doing this. DMF, on their XFS/CXFS frontend, reports st_block == 0 for released(offline) files. DMF is probably doing this for 20 years now. This is the main solution right now on the market and DMF is very interested by Lustre as a frontend because CXFS does not scale (and because Lustre is very well deployed).
          • Admins like they can see this difference between theoretical space and real disk usage through stat/du.

          I think this is very simple and that we're being a bit lazy here.

          What will be your proposal?

          I still consider this behavior could be changed, but this is not as simple as: this is totally wrong to reports 0 for st_blocks. I think reporting st_block as 0 is a feature, not a bug. Only the potential data corruption for some "too-smart" tools is a concern to me.

          adegremont Aurelien Degremont (Inactive) added a comment - John, I agree with you that checking each possible tool is not the right way. Users will be confused at first by this, but we should teach them. You cannot really consider this will be really fully transparent for them. I'm pretty sure that users will ask admins why their file accesses are random, because they do not realize that files could restored when accessed and this could requires mounting tapes which could be really long. DMF, which is the major HSM in HPC environment, and probably the HSM that will be used the most with Lustre/HSM, is exactly doing this. DMF, on their XFS/CXFS frontend, reports st_block == 0 for released(offline) files. DMF is probably doing this for 20 years now. This is the main solution right now on the market and DMF is very interested by Lustre as a frontend because CXFS does not scale (and because Lustre is very well deployed). Admins like they can see this difference between theoretical space and real disk usage through stat/du. I think this is very simple and that we're being a bit lazy here. What will be your proposal? I still consider this behavior could be changed, but this is not as simple as: this is totally wrong to reports 0 for st_blocks. I think reporting st_block as 0 is a feature, not a bug. Only the potential data corruption for some "too-smart" tools is a concern to me.

          Hi John, by design, st_blocks is the actual disk blocks consumed by the file data. So it should be reasonable to report 0 for st_blocks for released files, this matches the . But I agree that reporting st_blocks to be zero is confusing here because this may cause applications to think this is a totally sparse file so fiemap won't be consulted at all.

          Any decisions for this problem are reasonable to me

          jay Jinshan Xiong (Inactive) added a comment - Hi John, by design, st_blocks is the actual disk blocks consumed by the file data. So it should be reasonable to report 0 for st_blocks for released files, this matches the . But I agree that reporting st_blocks to be zero is confusing here because this may cause applications to think this is a totally sparse file so fiemap won't be consulted at all. Any decisions for this problem are reasonable to me
          jhammond John Hammond added a comment -

          I think this is very simple and that we're being a bit lazy here.

          It is a bug for release to change st_blocks. Remember that HSM is supposed to be transparent to the user. We shouldn't be trying to guess if some application will depend on st_blocks. Instead we should assume that some application depends on st_blocks and that there will be data loss because of this bug. It can be a blocker or not, I don't care so much. But it's a bug.

          The fact that we're focused on cp and tar is troubling. What about bbcp, rsync, zip, and all of those terrible data movers that the open science sites use? Have we checked them too?

          Moreover, it's very easy to think of situations where regular users will be confused by this. For example, when they run 'du -hs' on their homedir and then tell admin that the 8TB of data they just created must be lost. The admin who uses HSM on the other hand can be told and should know that 'dh -hs' will not always reflect the frontend FS usage.

          jhammond John Hammond added a comment - I think this is very simple and that we're being a bit lazy here. It is a bug for release to change st_blocks. Remember that HSM is supposed to be transparent to the user. We shouldn't be trying to guess if some application will depend on st_blocks. Instead we should assume that some application depends on st_blocks and that there will be data loss because of this bug. It can be a blocker or not, I don't care so much. But it's a bug. The fact that we're focused on cp and tar is troubling. What about bbcp, rsync, zip, and all of those terrible data movers that the open science sites use? Have we checked them too? Moreover, it's very easy to think of situations where regular users will be confused by this. For example, when they run 'du -hs' on their homedir and then tell admin that the 8TB of data they just created must be lost. The admin who uses HSM on the other hand can be told and should know that 'dh -hs' will not always reflect the frontend FS usage.

          Hi Aurelien and Bruno,

          If we can fix this issue by fiemap that'll be great. I was afraid that we would fix tar issue but import another one by reporting incorrect block number.

          Jinshan

          jay Jinshan Xiong (Inactive) added a comment - Hi Aurelien and Bruno, If we can fix this issue by fiemap that'll be great. I was afraid that we would fix tar issue but import another one by reporting incorrect block number. Jinshan

          People

            bfaccini Bruno Faccini (Inactive)
            jhammond John Hammond
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: