Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-417

block usage is reported as zero by stat call for tens of seconds after creating a file

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.2.0, Lustre 2.1.1
    • Lustre 2.1.0, Lustre 2.2.0, Lustre 1.8.6
    • None
    • 3
    • 4797

    Description

      If a file is written on Lustre filesystem and it is copied to local(xfs)
      file system immediately, copied file become sparse file.

      For example:


      sgiadm@recca01:~> df /work /data
      Filesystem 1K-blocks Used Available Use% Mounted on
      10.0.1.2@o2ib:/lustre
      38446862208 25530740868 10963120932 70% /work
      /dev/lxvm/IS5000-File-1
      123036116992 41805493792 81230623200 34% /data

      sgiadm@recca01:/data/sgi> cat test.sh
      #!/bin/sh
      SRC=/work/sgi
      DST=/data/sgi

      rm $SRC/file* $DST/file*

      dd if=/dev/zero of=$SRC/file0 bs=1024k count=100
      cp $SRC/file0 $DST/file0
      dd if=/dev/zero of=$SRC/file1 bs=1024k count=100 oflag=direct
      cp $SRC/file1 $DST/file1
      sync
      wait

      ls -sl $SRC
      ls -sl $DST
      sgiadm@recca01:/data/sgi> ./test.sh
      100+0 records in
      100+0 records out
      104857600 bytes (105 MB) copied, 0.282088 s, 372 MB/s
      100+0 records in
      100+0 records out
      104857600 bytes (105 MB) copied, 1.13752 s, 92.2 MB/s
      total 204804
      102404 rw-rr- 1 sgiadm users 104857600 2011-06-13 16:02 file0
      102404 rw-rr- 1 sgiadm users 104857600 2011-06-13 16:02 file1
      total 102404
      0 rw-rr- 1 sgiadm users 104857600 2011-06-13 16:02 file0
      102400 rw-rr- 1 sgiadm users 104857600 2011-06-13 16:02 file1
      4 -rwxr-xr-x 1 sgiadm users 338 2011-06-13 16:01 test.sh


      In above case, file0 was copied as sparse file.

      One minutes after, the problem no longer happens.


      sgiadm@recca01:~> cp /work/sgi/file0 /data/sgi/file0-2
      sgiadm@recca01:~> ls -sl /data/sgi
      total 204804
      0 rw-rr- 1 sgiadm users 104857600 2011-06-13 16:02 file0
      102400 rw-rr- 1 sgiadm users 104857600 2011-06-13 16:51 file0-2
      102400 rw-rr- 1 sgiadm users 104857600 2011-06-13 16:02 file1
      4 -rwxr-xr-x 1 sgiadm users 338 2011-06-13 16:01 test.sh


      It looks like the problem happens if data is on cache and does not happen
      while using direct i/o.
      Also, I noticed stat command reports 0 block for about 30 seconds after
      writing a file.


      sgiadm@recca01:/work/sgi> dd if=/dev/zero of=file0 bs=1024k count=1; stat file0; sleep 60; stat file0
      1+0 records in
      1+0 records out
      1048576 bytes (1.0 MB) copied, 0.00106648 s, 983 MB/s
      File: `file0'
      Size: 1048576 Blocks: 0 IO Block: 2097152 regular file
      Device: 2c54f966h/743766374d Inode: 5801177 Links: 1
      Access: (0644/rw-rr-) Uid: ( 501/ sgiadm) Gid: ( 100/ users)
      Access: 2011-06-13 19:13:27.000000000 +0900
      Modify: 2011-06-13 19:15:06.000000000 +0900
      Change: 2011-06-13 19:15:06.000000000 +0900
      File: `file0'
      Size: 1048576 Blocks: 2048 IO Block: 2097152 regular file
      Device: 2c54f966h/743766374d Inode: 5801177 Links: 1
      Access: (0644/rw-rr-) Uid: ( 501/ sgiadm) Gid: ( 100/ users)
      Access: 2011-06-13 19:13:27.000000000 +0900
      Modify: 2011-06-13 19:15:06.000000000 +0900
      Change: 2011-06-13 19:15:06.000000000 +0900


      I guess the problem happens when the file copied before data blocks are
      allocated to OSTs.
      LU-274 has already reported which is file size issue on MDS.
      But, This problem is block usage issue on OSS. I think those are very
      similar but might be different problem.

      Attachments

        Issue Links

          Activity

            [LU-417] block usage is reported as zero by stat call for tens of seconds after creating a file

            Integrated in lustre-master » x86_64,server,el6,inkernel #393
            LU-417 llite: report non-zero blocks on writing client (Revision 1509e524e3c47d3bb239ff2a8764cff55eb29d4c)

            Result = SUCCESS
            Oleg Drokin : 1509e524e3c47d3bb239ff2a8764cff55eb29d4c
            Files :

            • lustre/obdfilter/filter_lvb.c
            • lustre/lclient/glimpse.c
            • lustre/lclient/lcommon_cl.c
            • lustre/include/lclient.h
            hudson Build Master (Inactive) added a comment - Integrated in lustre-master » x86_64,server,el6,inkernel #393 LU-417 llite: report non-zero blocks on writing client (Revision 1509e524e3c47d3bb239ff2a8764cff55eb29d4c) Result = SUCCESS Oleg Drokin : 1509e524e3c47d3bb239ff2a8764cff55eb29d4c Files : lustre/obdfilter/filter_lvb.c lustre/lclient/glimpse.c lustre/lclient/lcommon_cl.c lustre/include/lclient.h

            Integrated in lustre-master » x86_64,server,el5,inkernel #393
            LU-417 llite: report non-zero blocks on writing client (Revision 1509e524e3c47d3bb239ff2a8764cff55eb29d4c)

            Result = SUCCESS
            Oleg Drokin : 1509e524e3c47d3bb239ff2a8764cff55eb29d4c
            Files :

            • lustre/obdfilter/filter_lvb.c
            • lustre/include/lclient.h
            • lustre/lclient/lcommon_cl.c
            • lustre/lclient/glimpse.c
            hudson Build Master (Inactive) added a comment - Integrated in lustre-master » x86_64,server,el5,inkernel #393 LU-417 llite: report non-zero blocks on writing client (Revision 1509e524e3c47d3bb239ff2a8764cff55eb29d4c) Result = SUCCESS Oleg Drokin : 1509e524e3c47d3bb239ff2a8764cff55eb29d4c Files : lustre/obdfilter/filter_lvb.c lustre/include/lclient.h lustre/lclient/lcommon_cl.c lustre/lclient/glimpse.c

            Integrated in lustre-master » i686,client,el6,inkernel #393
            LU-417 llite: report non-zero blocks on writing client (Revision 1509e524e3c47d3bb239ff2a8764cff55eb29d4c)

            Result = SUCCESS
            Oleg Drokin : 1509e524e3c47d3bb239ff2a8764cff55eb29d4c
            Files :

            • lustre/lclient/glimpse.c
            • lustre/include/lclient.h
            • lustre/obdfilter/filter_lvb.c
            • lustre/lclient/lcommon_cl.c
            hudson Build Master (Inactive) added a comment - Integrated in lustre-master » i686,client,el6,inkernel #393 LU-417 llite: report non-zero blocks on writing client (Revision 1509e524e3c47d3bb239ff2a8764cff55eb29d4c) Result = SUCCESS Oleg Drokin : 1509e524e3c47d3bb239ff2a8764cff55eb29d4c Files : lustre/lclient/glimpse.c lustre/include/lclient.h lustre/obdfilter/filter_lvb.c lustre/lclient/lcommon_cl.c

            Integrated in lustre-master » x86_64,client,el5,ofa #393
            LU-417 llite: report non-zero blocks on writing client (Revision 1509e524e3c47d3bb239ff2a8764cff55eb29d4c)

            Result = SUCCESS
            Oleg Drokin : 1509e524e3c47d3bb239ff2a8764cff55eb29d4c
            Files :

            • lustre/obdfilter/filter_lvb.c
            • lustre/include/lclient.h
            • lustre/lclient/glimpse.c
            • lustre/lclient/lcommon_cl.c
            hudson Build Master (Inactive) added a comment - Integrated in lustre-master » x86_64,client,el5,ofa #393 LU-417 llite: report non-zero blocks on writing client (Revision 1509e524e3c47d3bb239ff2a8764cff55eb29d4c) Result = SUCCESS Oleg Drokin : 1509e524e3c47d3bb239ff2a8764cff55eb29d4c Files : lustre/obdfilter/filter_lvb.c lustre/include/lclient.h lustre/lclient/glimpse.c lustre/lclient/lcommon_cl.c

            Integrated in lustre-master » x86_64,server,el5,ofa #393
            LU-417 llite: report non-zero blocks on writing client (Revision 1509e524e3c47d3bb239ff2a8764cff55eb29d4c)

            Result = FAILURE
            Oleg Drokin : 1509e524e3c47d3bb239ff2a8764cff55eb29d4c
            Files :

            • lustre/obdfilter/filter_lvb.c
            • lustre/lclient/glimpse.c
            • lustre/include/lclient.h
            • lustre/lclient/lcommon_cl.c
            hudson Build Master (Inactive) added a comment - Integrated in lustre-master » x86_64,server,el5,ofa #393 LU-417 llite: report non-zero blocks on writing client (Revision 1509e524e3c47d3bb239ff2a8764cff55eb29d4c) Result = FAILURE Oleg Drokin : 1509e524e3c47d3bb239ff2a8764cff55eb29d4c Files : lustre/obdfilter/filter_lvb.c lustre/lclient/glimpse.c lustre/include/lclient.h lustre/lclient/lcommon_cl.c
            bobijam Zhenyu Xu added a comment - patch tracking at http://review.whamcloud.com/1647

            A straight forward fix for this problem is to have the client increment the in-memory i_blocks counter by (PAGE_SIZE >> 9) for each dirty page in memory for that file when ll_getattr_it() is called. While this is not completely accurate for files that are being overwritten, it avoids the definite problem of stat() returning st_blocks=0 for a file with in-memory data that has not yet been written to the OST backing filesystem, and causing "cp" or "tar" to skip the file because it thinks it is completely sparse.

            Other filesystems such as ext4, xfs, zfs that do delayed block allocation all report in-memory allocated blocks for the inode to stat() before they are written to disk. A simple test shows for ZFS that the initial blocks value is inaccurate (but better than zero) and is "fixed" when the file is actually written:

            $ dd if=/dev/zero of=/zmirror/tmp/foo bs=64k count=1; ls -l /zmirror/tmp/foo; sleep 5; ls -l /zmirror/tmp/foo
            1+0 records in
            1+0 records out
            65536 bytes (66 kB) copied, 0.000911937 s, 71.9 MB/s
            1 rw-rr- 1 root root 65536 Nov 1 16:19 /zmirror/tmp/foo
            65 rw-rr- 1 root root 65536 Nov 1 16:19 /zmirror/tmp/foo

            When I had tried to fix this problem several years ago by just incrementing the inode->i_blocks count when any page was written beyond EOF (to more accurately try to report i_blocks), it didn't work. If we don't already track the number of dirty pages in the CLIO code, it might be enough to just add in a boolean "dirty" to st_blocks so that it is not reported as zero if there are any unwritten pages on the client.

            adilger Andreas Dilger added a comment - A straight forward fix for this problem is to have the client increment the in-memory i_blocks counter by (PAGE_SIZE >> 9) for each dirty page in memory for that file when ll_getattr_it() is called. While this is not completely accurate for files that are being overwritten, it avoids the definite problem of stat() returning st_blocks=0 for a file with in-memory data that has not yet been written to the OST backing filesystem, and causing "cp" or "tar" to skip the file because it thinks it is completely sparse. Other filesystems such as ext4, xfs, zfs that do delayed block allocation all report in-memory allocated blocks for the inode to stat() before they are written to disk. A simple test shows for ZFS that the initial blocks value is inaccurate (but better than zero) and is "fixed" when the file is actually written: $ dd if=/dev/zero of=/zmirror/tmp/foo bs=64k count=1; ls -l /zmirror/tmp/foo; sleep 5; ls -l /zmirror/tmp/foo 1+0 records in 1+0 records out 65536 bytes (66 kB) copied, 0.000911937 s, 71.9 MB/s 1 rw-r r - 1 root root 65536 Nov 1 16:19 /zmirror/tmp/foo 65 rw-r r - 1 root root 65536 Nov 1 16:19 /zmirror/tmp/foo When I had tried to fix this problem several years ago by just incrementing the inode->i_blocks count when any page was written beyond EOF (to more accurately try to report i_blocks), it didn't work. If we don't already track the number of dirty pages in the CLIO code, it might be enough to just add in a boolean "dirty" to st_blocks so that it is not reported as zero if there are any unwritten pages on the client.
            pjones Peter Jones added a comment -

            Bobijam

            Could you please look into this one?

            Thanks

            Peter

            pjones Peter Jones added a comment - Bobijam Could you please look into this one? Thanks Peter

            > Can you please re-run your test scenario, using:
            >
            > dd if=/dev/zero of=$SRC/file0 bs=1024k count=100
            > strace -ttv -o /tmp/cp0.strace cp $SRC/file0 $DST/file0
            > dd if=/dev/zero of=$SRC/file1 bs=1024k count=100 oflag=direct
            > strace -ttv -o /tmp/cp1.strace cp $SRC/file1 $DST/file1

            I attached strace outputs as cp.strace.tgz.

            saka Yui Sakazume (Inactive) added a comment - > Can you please re-run your test scenario, using: > > dd if=/dev/zero of=$SRC/file0 bs=1024k count=100 > strace -ttv -o /tmp/cp0.strace cp $SRC/file0 $DST/file0 > dd if=/dev/zero of=$SRC/file1 bs=1024k count=100 oflag=direct > strace -ttv -o /tmp/cp1.strace cp $SRC/file1 $DST/file1 I attached strace outputs as cp.strace.tgz.

            cp calls seek() instead of write() when a file was written with buffered i/o.

            saka Yui Sakazume (Inactive) added a comment - cp calls seek() instead of write() when a file was written with buffered i/o.

            Can you please re-run your test scenario, using:

            dd if=/dev/zero of=$SRC/file0 bs=1024k count=100
            strace -ttv -o /tmp/cp0.strace cp $SRC/file0 $DST/file0
            dd if=/dev/zero of=$SRC/file1 bs=1024k count=100 oflag=direct
            strace -ttv -o /tmp/cp1.strace cp $SRC/file1 $DST/file1

            and attach cp0.strace and cp1.strace here.

            adilger Andreas Dilger added a comment - Can you please re-run your test scenario, using: dd if=/dev/zero of=$SRC/file0 bs=1024k count=100 strace -ttv -o /tmp/cp0.strace cp $SRC/file0 $DST/file0 dd if=/dev/zero of=$SRC/file1 bs=1024k count=100 oflag=direct strace -ttv -o /tmp/cp1.strace cp $SRC/file1 $DST/file1 and attach cp0.strace and cp1.strace here.

            People

              bobijam Zhenyu Xu
              saka Yui Sakazume (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: