[LU-417] block usage is reported as zero by stat call for tens of seconds after creating a file Created: 15/Jun/11  Updated: 17/Jan/13  Resolved: 31/Dec/11

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.0, Lustre 2.2.0, Lustre 1.8.6
Fix Version/s: Lustre 2.2.0, Lustre 2.1.1

Type: Bug Priority: Major
Reporter: Yui Sakazume Assignee: Zhenyu Xu
Resolution: Fixed Votes: 0
Labels: None
Environment:
  • Lustre version
    1.8.5 release from Oracle
  • MDS, OSS
    CentOS 5.5
    kernel: 2.6.18-194.17.1.el5_lustre.1.8.5
  • Client
    SLES11SP1
    kernel: 2.6.32.19-0.3-default

Attachments: File cp.strace.tgz    
Issue Links:
Related
is related to LU-682 optimization for Lustre-tar on comple... Closed
is related to LU-2580 cp with FIEMAP support creates comple... Resolved
Severity: 3
Epic: metadata
Rank (Obsolete): 4797

 Description   

If a file is written on Lustre filesystem and it is copied to local(xfs)
file system immediately, copied file become sparse file.

For example:


sgiadm@recca01:~> df /work /data
Filesystem 1K-blocks Used Available Use% Mounted on
10.0.1.2@o2ib:/lustre
38446862208 25530740868 10963120932 70% /work
/dev/lxvm/IS5000-File-1
123036116992 41805493792 81230623200 34% /data

sgiadm@recca01:/data/sgi> cat test.sh
#!/bin/sh
SRC=/work/sgi
DST=/data/sgi

rm $SRC/file* $DST/file*

dd if=/dev/zero of=$SRC/file0 bs=1024k count=100
cp $SRC/file0 $DST/file0
dd if=/dev/zero of=$SRC/file1 bs=1024k count=100 oflag=direct
cp $SRC/file1 $DST/file1
sync
wait

ls -sl $SRC
ls -sl $DST
sgiadm@recca01:/data/sgi> ./test.sh
100+0 records in
100+0 records out
104857600 bytes (105 MB) copied, 0.282088 s, 372 MB/s
100+0 records in
100+0 records out
104857600 bytes (105 MB) copied, 1.13752 s, 92.2 MB/s
total 204804
102404 rw-rr- 1 sgiadm users 104857600 2011-06-13 16:02 file0
102404 rw-rr- 1 sgiadm users 104857600 2011-06-13 16:02 file1
total 102404
0 rw-rr- 1 sgiadm users 104857600 2011-06-13 16:02 file0
102400 rw-rr- 1 sgiadm users 104857600 2011-06-13 16:02 file1
4 -rwxr-xr-x 1 sgiadm users 338 2011-06-13 16:01 test.sh


In above case, file0 was copied as sparse file.

One minutes after, the problem no longer happens.


sgiadm@recca01:~> cp /work/sgi/file0 /data/sgi/file0-2
sgiadm@recca01:~> ls -sl /data/sgi
total 204804
0 rw-rr- 1 sgiadm users 104857600 2011-06-13 16:02 file0
102400 rw-rr- 1 sgiadm users 104857600 2011-06-13 16:51 file0-2
102400 rw-rr- 1 sgiadm users 104857600 2011-06-13 16:02 file1
4 -rwxr-xr-x 1 sgiadm users 338 2011-06-13 16:01 test.sh


It looks like the problem happens if data is on cache and does not happen
while using direct i/o.
Also, I noticed stat command reports 0 block for about 30 seconds after
writing a file.


sgiadm@recca01:/work/sgi> dd if=/dev/zero of=file0 bs=1024k count=1; stat file0; sleep 60; stat file0
1+0 records in
1+0 records out
1048576 bytes (1.0 MB) copied, 0.00106648 s, 983 MB/s
File: `file0'
Size: 1048576 Blocks: 0 IO Block: 2097152 regular file
Device: 2c54f966h/743766374d Inode: 5801177 Links: 1
Access: (0644/rw-rr-) Uid: ( 501/ sgiadm) Gid: ( 100/ users)
Access: 2011-06-13 19:13:27.000000000 +0900
Modify: 2011-06-13 19:15:06.000000000 +0900
Change: 2011-06-13 19:15:06.000000000 +0900
File: `file0'
Size: 1048576 Blocks: 2048 IO Block: 2097152 regular file
Device: 2c54f966h/743766374d Inode: 5801177 Links: 1
Access: (0644/rw-rr-) Uid: ( 501/ sgiadm) Gid: ( 100/ users)
Access: 2011-06-13 19:13:27.000000000 +0900
Modify: 2011-06-13 19:15:06.000000000 +0900
Change: 2011-06-13 19:15:06.000000000 +0900


I guess the problem happens when the file copied before data blocks are
allocated to OSTs.
LU-274 has already reported which is file size issue on MDS.
But, This problem is block usage issue on OSS. I think those are very
similar but might be different problem.



 Comments   
Comment by Andreas Dilger [ 15/Jun/11 ]

There are really two separate problems being described here.

One bug is likely with a broken version of "cp" using the FIEMAP ioctl to determine what Parts of the file to copy. Can you please report which distro you are using on the client, and what version of coreutils is installed?

We likely need to make a workaround in the FIEMAP code to handle this safely. The easiest way to handle it would be to have the client-side FIEMAP code always call filemap_fdatawrite_range() and filemap_fdatawait_range() on the region that is being requested by FIEMAP, because we do not support returning "DELALLOC" extents that are only in memory. This will flush all unwritten data from that client's cache. The client should also get a DLM read lock for the requested region in order to flush any unwritten data from other clients. It doesn't have to hold this lock for the duration of the FIEMAP call, just enough to flush the cache, since any returned data would be stale by the time it gets to userspace anyway. I suspect that returning DELALLOC extents would be much more complex to implement.

That the client reports 0 blocks for statfs is related to the same issue (namely that the client is writeback caching data in RAM before it is sent to the OST). I think it is reasonable to have the client locally increment the blocks count of the file as data is being written to it (perhaps at the OSC level), and then drop this local estimate when the dirty pages are written to the OST and it receives updated glimpse data.

Comment by Andreas Dilger [ 15/Jun/11 ]

I suspect this affects 2.1 also.

Comment by Andreas Dilger [ 15/Jun/11 ]

On further thought, the "blocks = 0" issue is probably helping to trigger this problem with new "cp". I now recall that the workaround in "cp" is to only depend on FIEMAP data if (blocks < size / blocksize), otherwise there is no chance that the file is sparse and the whole file should be copied.

For non-sparse files it is more efficient to skip the FIEMAP call (which depends on another server round-trip to get the layout), so it is better to locally estimate the i_blocks value based on written data than to flush the data to disk. Getting the DLM read lock on the FIEMAP range will serve to flush cached data on other clients and ensure that the FIEMAP result is correct, but in more common write/cp operations on a single node it makes sense to avoid flushing the client cache unnecessarily.

Comment by Oleg Drokin [ 15/Jun/11 ]

I guess that would make fiemap call much more expensive as the result, but on the other hand it seems we don't have much choice if we want it to totally work. We cannot even implement "DEALLOC" mode fully unless the data is stored locally, because we have no way of knowing if other clients hold some dirty data.
I think DEALLOC case for locally stored data is not going to be all that hard, though?

Comment by Yui Sakazume [ 16/Jun/11 ]

> Can you please report which distro you are using on the client, and what version of coreutils is installed?

clients distribution is SLES11SP1, coreutils is 6.12-32.17.

Comment by Andreas Dilger [ 16/Jun/11 ]

Can you please re-run your test scenario, using:

dd if=/dev/zero of=$SRC/file0 bs=1024k count=100
strace -ttv -o /tmp/cp0.strace cp $SRC/file0 $DST/file0
dd if=/dev/zero of=$SRC/file1 bs=1024k count=100 oflag=direct
strace -ttv -o /tmp/cp1.strace cp $SRC/file1 $DST/file1

and attach cp0.strace and cp1.strace here.

Comment by Yui Sakazume [ 17/Jun/11 ]

cp calls seek() instead of write() when a file was written with buffered i/o.

Comment by Yui Sakazume [ 17/Jun/11 ]

> Can you please re-run your test scenario, using:
>
> dd if=/dev/zero of=$SRC/file0 bs=1024k count=100
> strace -ttv -o /tmp/cp0.strace cp $SRC/file0 $DST/file0
> dd if=/dev/zero of=$SRC/file1 bs=1024k count=100 oflag=direct
> strace -ttv -o /tmp/cp1.strace cp $SRC/file1 $DST/file1

I attached strace outputs as cp.strace.tgz.

Comment by Peter Jones [ 28/Sep/11 ]

Bobijam

Could you please look into this one?

Thanks

Peter

Comment by Andreas Dilger [ 01/Nov/11 ]

A straight forward fix for this problem is to have the client increment the in-memory i_blocks counter by (PAGE_SIZE >> 9) for each dirty page in memory for that file when ll_getattr_it() is called. While this is not completely accurate for files that are being overwritten, it avoids the definite problem of stat() returning st_blocks=0 for a file with in-memory data that has not yet been written to the OST backing filesystem, and causing "cp" or "tar" to skip the file because it thinks it is completely sparse.

Other filesystems such as ext4, xfs, zfs that do delayed block allocation all report in-memory allocated blocks for the inode to stat() before they are written to disk. A simple test shows for ZFS that the initial blocks value is inaccurate (but better than zero) and is "fixed" when the file is actually written:

$ dd if=/dev/zero of=/zmirror/tmp/foo bs=64k count=1; ls -l /zmirror/tmp/foo; sleep 5; ls -l /zmirror/tmp/foo
1+0 records in
1+0 records out
65536 bytes (66 kB) copied, 0.000911937 s, 71.9 MB/s
1 rw-rr- 1 root root 65536 Nov 1 16:19 /zmirror/tmp/foo
65 rw-rr- 1 root root 65536 Nov 1 16:19 /zmirror/tmp/foo

When I had tried to fix this problem several years ago by just incrementing the inode->i_blocks count when any page was written beyond EOF (to more accurately try to report i_blocks), it didn't work. If we don't already track the number of dirty pages in the CLIO code, it might be enough to just add in a boolean "dirty" to st_blocks so that it is not reported as zero if there are any unwritten pages on the client.

Comment by Zhenyu Xu [ 04/Nov/11 ]

patch tracking at http://review.whamcloud.com/1647

Comment by Build Master (Inactive) [ 31/Dec/11 ]

Integrated in lustre-master » x86_64,server,el5,ofa #393
LU-417 llite: report non-zero blocks on writing client (Revision 1509e524e3c47d3bb239ff2a8764cff55eb29d4c)

Result = FAILURE
Oleg Drokin : 1509e524e3c47d3bb239ff2a8764cff55eb29d4c
Files :

  • lustre/obdfilter/filter_lvb.c
  • lustre/lclient/glimpse.c
  • lustre/include/lclient.h
  • lustre/lclient/lcommon_cl.c
Comment by Build Master (Inactive) [ 31/Dec/11 ]

Integrated in lustre-master » x86_64,client,el5,ofa #393
LU-417 llite: report non-zero blocks on writing client (Revision 1509e524e3c47d3bb239ff2a8764cff55eb29d4c)

Result = SUCCESS
Oleg Drokin : 1509e524e3c47d3bb239ff2a8764cff55eb29d4c
Files :

  • lustre/obdfilter/filter_lvb.c
  • lustre/include/lclient.h
  • lustre/lclient/glimpse.c
  • lustre/lclient/lcommon_cl.c
Comment by Build Master (Inactive) [ 31/Dec/11 ]

Integrated in lustre-master » i686,client,el6,inkernel #393
LU-417 llite: report non-zero blocks on writing client (Revision 1509e524e3c47d3bb239ff2a8764cff55eb29d4c)

Result = SUCCESS
Oleg Drokin : 1509e524e3c47d3bb239ff2a8764cff55eb29d4c
Files :

  • lustre/lclient/glimpse.c
  • lustre/include/lclient.h
  • lustre/obdfilter/filter_lvb.c
  • lustre/lclient/lcommon_cl.c
Comment by Build Master (Inactive) [ 31/Dec/11 ]

Integrated in lustre-master » x86_64,server,el5,inkernel #393
LU-417 llite: report non-zero blocks on writing client (Revision 1509e524e3c47d3bb239ff2a8764cff55eb29d4c)

Result = SUCCESS
Oleg Drokin : 1509e524e3c47d3bb239ff2a8764cff55eb29d4c
Files :

  • lustre/obdfilter/filter_lvb.c
  • lustre/include/lclient.h
  • lustre/lclient/lcommon_cl.c
  • lustre/lclient/glimpse.c
Comment by Build Master (Inactive) [ 31/Dec/11 ]

Integrated in lustre-master » x86_64,server,el6,inkernel #393
LU-417 llite: report non-zero blocks on writing client (Revision 1509e524e3c47d3bb239ff2a8764cff55eb29d4c)

Result = SUCCESS
Oleg Drokin : 1509e524e3c47d3bb239ff2a8764cff55eb29d4c
Files :

  • lustre/obdfilter/filter_lvb.c
  • lustre/lclient/glimpse.c
  • lustre/lclient/lcommon_cl.c
  • lustre/include/lclient.h
Comment by Build Master (Inactive) [ 31/Dec/11 ]

Integrated in lustre-master » x86_64,client,sles11,inkernel #393
LU-417 llite: report non-zero blocks on writing client (Revision 1509e524e3c47d3bb239ff2a8764cff55eb29d4c)

Result = SUCCESS
Oleg Drokin : 1509e524e3c47d3bb239ff2a8764cff55eb29d4c
Files :

  • lustre/lclient/glimpse.c
  • lustre/include/lclient.h
  • lustre/obdfilter/filter_lvb.c
  • lustre/lclient/lcommon_cl.c
Comment by Build Master (Inactive) [ 31/Dec/11 ]

Integrated in lustre-master » x86_64,client,el6,inkernel #393
LU-417 llite: report non-zero blocks on writing client (Revision 1509e524e3c47d3bb239ff2a8764cff55eb29d4c)

Result = SUCCESS
Oleg Drokin : 1509e524e3c47d3bb239ff2a8764cff55eb29d4c
Files :

  • lustre/obdfilter/filter_lvb.c
  • lustre/lclient/lcommon_cl.c
  • lustre/lclient/glimpse.c
  • lustre/include/lclient.h
Comment by Build Master (Inactive) [ 31/Dec/11 ]

Integrated in lustre-master » x86_64,client,ubuntu1004,inkernel #393
LU-417 llite: report non-zero blocks on writing client (Revision 1509e524e3c47d3bb239ff2a8764cff55eb29d4c)

Result = SUCCESS
Oleg Drokin : 1509e524e3c47d3bb239ff2a8764cff55eb29d4c
Files :

  • lustre/lclient/lcommon_cl.c
  • lustre/include/lclient.h
  • lustre/obdfilter/filter_lvb.c
  • lustre/lclient/glimpse.c
Comment by Build Master (Inactive) [ 31/Dec/11 ]

Integrated in lustre-master » x86_64,client,el5,inkernel #393
LU-417 llite: report non-zero blocks on writing client (Revision 1509e524e3c47d3bb239ff2a8764cff55eb29d4c)

Result = SUCCESS
Oleg Drokin : 1509e524e3c47d3bb239ff2a8764cff55eb29d4c
Files :

  • lustre/lclient/lcommon_cl.c
  • lustre/obdfilter/filter_lvb.c
  • lustre/include/lclient.h
  • lustre/lclient/glimpse.c
Comment by Build Master (Inactive) [ 31/Dec/11 ]

Integrated in lustre-master » i686,server,el5,ofa #393
LU-417 llite: report non-zero blocks on writing client (Revision 1509e524e3c47d3bb239ff2a8764cff55eb29d4c)

Result = SUCCESS
Oleg Drokin : 1509e524e3c47d3bb239ff2a8764cff55eb29d4c
Files :

  • lustre/lclient/glimpse.c
  • lustre/include/lclient.h
  • lustre/obdfilter/filter_lvb.c
  • lustre/lclient/lcommon_cl.c
Comment by Build Master (Inactive) [ 31/Dec/11 ]

Integrated in lustre-master » i686,server,el5,inkernel #393
LU-417 llite: report non-zero blocks on writing client (Revision 1509e524e3c47d3bb239ff2a8764cff55eb29d4c)

Result = SUCCESS
Oleg Drokin : 1509e524e3c47d3bb239ff2a8764cff55eb29d4c
Files :

  • lustre/obdfilter/filter_lvb.c
  • lustre/include/lclient.h
  • lustre/lclient/glimpse.c
  • lustre/lclient/lcommon_cl.c
Comment by Build Master (Inactive) [ 31/Dec/11 ]

Integrated in lustre-master » i686,server,el6,inkernel #393
LU-417 llite: report non-zero blocks on writing client (Revision 1509e524e3c47d3bb239ff2a8764cff55eb29d4c)

Result = SUCCESS
Oleg Drokin : 1509e524e3c47d3bb239ff2a8764cff55eb29d4c
Files :

  • lustre/obdfilter/filter_lvb.c
  • lustre/lclient/lcommon_cl.c
  • lustre/lclient/glimpse.c
  • lustre/include/lclient.h
Comment by Build Master (Inactive) [ 31/Dec/11 ]

Integrated in lustre-master » i686,client,el5,inkernel #393
LU-417 llite: report non-zero blocks on writing client (Revision 1509e524e3c47d3bb239ff2a8764cff55eb29d4c)

Result = SUCCESS
Oleg Drokin : 1509e524e3c47d3bb239ff2a8764cff55eb29d4c
Files :

  • lustre/lclient/lcommon_cl.c
  • lustre/include/lclient.h
  • lustre/obdfilter/filter_lvb.c
  • lustre/lclient/glimpse.c
Comment by Build Master (Inactive) [ 31/Dec/11 ]

Integrated in lustre-master » i686,client,el5,ofa #393
LU-417 llite: report non-zero blocks on writing client (Revision 1509e524e3c47d3bb239ff2a8764cff55eb29d4c)

Result = SUCCESS
Oleg Drokin : 1509e524e3c47d3bb239ff2a8764cff55eb29d4c
Files :

  • lustre/lclient/glimpse.c
  • lustre/include/lclient.h
  • lustre/obdfilter/filter_lvb.c
  • lustre/lclient/lcommon_cl.c
Comment by Peter Jones [ 31/Dec/11 ]

Landed for 2.2

Generated at Sat Feb 10 01:06:49 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.