[LU-4597] inconsistent file size - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: Lustre 2.6.0, Lustre 2.5.1
Affects Version/s: Lustre 2.4.0
Labels:
- mn4
Environment:
2.4.0-19chaos clients and servers

Severity:
3
Rank (Obsolete):
12564

Description

We have received reports of Lustre clients incorrectly reporting files with 0 length then on a second attempt the correct non-zero length will be reported. This is reminiscent of ~~LU-274~~. Before digging too far into this, I notice that the fix for ~~LU-274~~ did not seem to survive the conversion from obdfilter to ofd. Do we need a similar fix in ofd_intent_policy()?

Here is the reproducer reported by our user:

On cslic3 using lscratch3, I created 20 files of size 1024 bytes
Waited 27 minutes (I tried around 15 minutes and didn't see incorrect size zeros)
On cslic8 listed the lscratch3 directory from step 1, I saw 3 files with size 0
On cslic8 listed the lscratch3 directory again (immediately following step3) and all files listed as 1024 bytes

Attachments

Activity

[LU-4597] inconsistent file size

Peter Jones added a comment - 26/Feb/14 11:18 PM

Landed for 2.5.1 and 2.6

Peter Jones added a comment - 26/Feb/14 11:18 PM Landed for 2.5.1 and 2.6

James Nunez (Inactive) added a comment - 20/Feb/14 2:53 PM

Patch for b2_5 at http://review.whamcloud.com/#/c/9328/

James Nunez (Inactive) added a comment - 20/Feb/14 2:53 PM Patch for b2_5 at http://review.whamcloud.com/#/c/9328/

Ned Bass (Inactive) added a comment - 19/Feb/14 6:32 PM

We set statahead_agl=0 on all our clients to workaround this issue until the patch can be deployed. This seemed to work, however I just learned that the 'size changed from 0' error was reported for 9 files during a run of the "htar" archival storage utility. So there may be another (less frequent) bug that can cause this behavior.

Ned Bass (Inactive) added a comment - 19/Feb/14 6:32 PM We set statahead_agl=0 on all our clients to workaround this issue until the patch can be deployed. This seemed to work, however I just learned that the 'size changed from 0' error was reported for 9 files during a run of the "htar" archival storage utility. So there may be another (less frequent) bug that can cause this behavior.

Niu Yawei (Inactive) added a comment - 13/Feb/14 7:13 AM

patch for master: http://review.whamcloud.com/9249

Niu Yawei (Inactive) added a comment - 13/Feb/14 7:13 AM patch for master: http://review.whamcloud.com/9249

Ned Bass (Inactive) added a comment - 12/Feb/14 6:47 PM

I confirmed that disabling statahead_agl seem to prevent the bug here as well.

lctl set_param llite.*.statahead_agl=0

Ned Bass (Inactive) added a comment - 12/Feb/14 6:47 PM I confirmed that disabling statahead_agl seem to prevent the bug here as well. lctl set_param llite.*.statahead_agl=0

Niu Yawei (Inactive) added a comment - 12/Feb/14 3:33 PM

I can reproduce it with two mounts, and looks it's a race of agl vs normal getattr (because it can't be reproduced anymore when agl turned off), will look into it further.

Niu Yawei (Inactive) added a comment - 12/Feb/14 3:33 PM I can reproduce it with two mounts, and looks it's a race of agl vs normal getattr (because it can't be reproduced anymore when agl turned off), will look into it further.

Ned Bass (Inactive) added a comment - 12/Feb/14 2:09 AM

FWIW, having full debugging enabled on the servers seems to make this bug much easier to reproduce.

Ned Bass (Inactive) added a comment - 12/Feb/14 2:09 AM FWIW, having full debugging enabled on the servers seems to make this bug much easier to reproduce.

Ned Bass (Inactive) added a comment - 12/Feb/14 2:04 AM

In case it helps interpret the debug logs, here are the NIDs of the nodes involved.

sierra654: 192.168.114.155@o2ib5     # created files
sierra330: 192.168.113.81@o2ib5      # got zero length for file4
porter44: 172.19.1.213@o2ib100       # OSS owning object for file4
porter-mds1: 172.19.1.165@o2ib100    # MDS

Ned Bass (Inactive) added a comment - 12/Feb/14 2:04 AM In case it helps interpret the debug logs, here are the NIDs of the nodes involved. sierra654: 192.168.114.155@o2ib5 # created files sierra330: 192.168.113.81@o2ib5 # got zero length for file4 porter44: 172.19.1.213@o2ib100 # OSS owning object for file4 porter-mds1: 172.19.1.165@o2ib100 # MDS

Ned Bass (Inactive) added a comment - 12/Feb/14 1:50 AM

I managed to reproduce the bug on an unclassified system and get debug logs from the clients, MDS, and OSS. I uploaded them in a tarball to ftp.whamcloud.com. Email me privately if you need the file name. There is a README in the tarball with a few notes of relevance.

Ned Bass (Inactive) added a comment - 12/Feb/14 1:50 AM I managed to reproduce the bug on an unclassified system and get debug logs from the clients, MDS, and OSS. I uploaded them in a tarball to ftp.whamcloud.com. Email me privately if you need the file name. There is a README in the tarball with a few notes of relevance.

Ned Bass (Inactive) added a comment - 12/Feb/14 12:35 AM

We found a pretty reliable reproducer for this bug. Unfortunately it is only working on one of our classified filesystems, so I can't send debug logs. The server is running 2.4.0-24chaos (see https://github.com/chaos/lustre) with ZFS and the clients are 2.4.0-19chaos. The reproducer is pretty simple.

# Create files locally then list them on remote node e8.
for ((i=0;i<20;i++)) ; do dd if=/dev/urandom of=file$i bs=1k count=1 ; done ; rsh e8 ls -l `pwd`

Create several 1k files then immediately list them on another node. Some of the files are listed with 0 length, then show the correct lengths if listed again. Usually between 0 and 3 files are affected, but which and how many files varies between attempts.

I captured debug logs from the client and MDS for one successful attempt. The client logs had +dlmtrace +rpctrace and the MDS log had -1. The bug wouldn't reproduce with -1 debugging on the clients. But, I haven't been able to find the bug yet from the logs. Please let me know if you have any tips on how to debug this. Meanwhile I'll keep trying to reproduce this on an unclassified system so we can send debug logs.

Ned Bass (Inactive) added a comment - 12/Feb/14 12:35 AM We found a pretty reliable reproducer for this bug. Unfortunately it is only working on one of our classified filesystems, so I can't send debug logs. The server is running 2.4.0-24chaos (see https://github.com/chaos/lustre ) with ZFS and the clients are 2.4.0-19chaos. The reproducer is pretty simple. # Create files locally then list them on remote node e8. for ((i=0;i<20;i++)) ; do dd if=/dev/urandom of=file$i bs=1k count=1 ; done ; rsh e8 ls -l `pwd` Create several 1k files then immediately list them on another node. Some of the files are listed with 0 length, then show the correct lengths if listed again. Usually between 0 and 3 files are affected, but which and how many files varies between attempts. I captured debug logs from the client and MDS for one successful attempt. The client logs had +dlmtrace +rpctrace and the MDS log had -1. The bug wouldn't reproduce with -1 debugging on the clients. But, I haven't been able to find the bug yet from the logs. Please let me know if you have any tips on how to debug this. Meanwhile I'll keep trying to reproduce this on an unclassified system so we can send debug logs.

People

Assignee:: Niu Yawei (Inactive)

Reporter:: Ned Bass (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 06/Feb/14 11:01 PM

Updated:: 14/Apr/14 8:44 PM

Resolved:: 26/Feb/14 11:18 PM