[LU-4597] inconsistent file size Created: 06/Feb/14 Updated: 14/Apr/14 Resolved: 26/Feb/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0 |
| Fix Version/s: | Lustre 2.6.0, Lustre 2.5.1 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Ned Bass | Assignee: | Niu Yawei (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | mn4 | ||
| Environment: |
2.4.0-19chaos clients and servers |
||
| Severity: | 3 |
| Rank (Obsolete): | 12564 |
| Description |
|
We have received reports of Lustre clients incorrectly reporting files with 0 length then on a second attempt the correct non-zero length will be reported. This is reminiscent of Here is the reproducer reported by our user:
|
| Comments |
| Comment by Ned Bass [ 06/Feb/14 ] |
|
Oh, I see |
| Comment by Peter Jones [ 06/Feb/14 ] |
|
Niu Could you please help Prakash with any questions he has about this work? Thanks Peter |
| Comment by Niu Yawei (Inactive) [ 08/Feb/14 ] |
|
The fix of Ned, is it only seen on zfs or it's a common issue? Can this be easily reproduced by the procedures provided by you? Thank you. |
| Comment by Ned Bass [ 08/Feb/14 ] |
|
I believe we've seen it on both ZFS and ldiskfs, but I'll verify that next week. It's not easily reproducible. I've been running the procedure in a loop for over 24 hours and haven't reproduced it yet. |
| Comment by Ned Bass [ 11/Feb/14 ] |
|
Niu, our archival storage servers log an error message during file transfer if a file size changes after an initial scan, so we can use that as evidence of this bug. The logs show a sharp increase in file sizes changing from 0 after we updated our Lustre servers from 2.1 to 2.4.0-19chaos. This has affected all of our filesystems running both ZFS and ldiskfs. It has been observed from both 2.1 and 2.4 clients. |
| Comment by Ned Bass [ 12/Feb/14 ] |
|
We found a pretty reliable reproducer for this bug. Unfortunately it is only working on one of our classified filesystems, so I can't send debug logs. The server is running 2.4.0-24chaos (see https://github.com/chaos/lustre) with ZFS and the clients are 2.4.0-19chaos. The reproducer is pretty simple. # Create files locally then list them on remote node e8. for ((i=0;i<20;i++)) ; do dd if=/dev/urandom of=file$i bs=1k count=1 ; done ; rsh e8 ls -l `pwd` Create several 1k files then immediately list them on another node. Some of the files are listed with 0 length, then show the correct lengths if listed again. Usually between 0 and 3 files are affected, but which and how many files varies between attempts. I captured debug logs from the client and MDS for one successful attempt. The client logs had +dlmtrace +rpctrace and the MDS log had -1. The bug wouldn't reproduce with -1 debugging on the clients. But, I haven't been able to find the bug yet from the logs. Please let me know if you have any tips on how to debug this. Meanwhile I'll keep trying to reproduce this on an unclassified system so we can send debug logs. |
| Comment by Ned Bass [ 12/Feb/14 ] |
|
I managed to reproduce the bug on an unclassified system and get debug logs from the clients, MDS, and OSS. I uploaded them in a tarball to ftp.whamcloud.com. Email me privately if you need the file name. There is a README in the tarball with a few notes of relevance. |
| Comment by Ned Bass [ 12/Feb/14 ] |
|
In case it helps interpret the debug logs, here are the NIDs of the nodes involved. sierra654: 192.168.114.155@o2ib5 # created files sierra330: 192.168.113.81@o2ib5 # got zero length for file4 porter44: 172.19.1.213@o2ib100 # OSS owning object for file4 porter-mds1: 172.19.1.165@o2ib100 # MDS |
| Comment by Ned Bass [ 12/Feb/14 ] |
|
FWIW, having full debugging enabled on the servers seems to make this bug much easier to reproduce. |
| Comment by Niu Yawei (Inactive) [ 12/Feb/14 ] |
|
I can reproduce it with two mounts, and looks it's a race of agl vs normal getattr (because it can't be reproduced anymore when agl turned off), will look into it further. |
| Comment by Ned Bass [ 12/Feb/14 ] |
|
I confirmed that disabling statahead_agl seem to prevent the bug here as well. lctl set_param llite.*.statahead_agl=0 |
| Comment by Niu Yawei (Inactive) [ 13/Feb/14 ] |
|
patch for master: http://review.whamcloud.com/9249 |
| Comment by Ned Bass [ 19/Feb/14 ] |
|
We set statahead_agl=0 on all our clients to workaround this issue until the patch can be deployed. This seemed to work, however I just learned that the 'size changed from 0' error was reported for 9 files during a run of the "htar" archival storage utility. So there may be another (less frequent) bug that can cause this behavior. |
| Comment by James Nunez (Inactive) [ 20/Feb/14 ] |
|
Patch for b2_5 at http://review.whamcloud.com/#/c/9328/ |
| Comment by Peter Jones [ 26/Feb/14 ] |
|
Landed for 2.5.1 and 2.6 |