[LU-4597] inconsistent file size Created: 06/Feb/14  Updated: 14/Apr/14  Resolved: 26/Feb/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: Lustre 2.6.0, Lustre 2.5.1

Type: Bug Priority: Critical
Reporter: Ned Bass Assignee: Niu Yawei (Inactive)
Resolution: Fixed Votes: 0
Labels: mn4
Environment:

2.4.0-19chaos clients and servers


Severity: 3
Rank (Obsolete): 12564

 Description   

We have received reports of Lustre clients incorrectly reporting files with 0 length then on a second attempt the correct non-zero length will be reported. This is reminiscent of LU-274. Before digging too far into this, I notice that the fix for LU-274 did not seem to survive the conversion from obdfilter to ofd. Do we need a similar fix in ofd_intent_policy()?

Here is the reproducer reported by our user:

  1. On cslic3 using lscratch3, I created 20 files of size 1024 bytes
  2. Waited 27 minutes (I tried around 15 minutes and didn't see incorrect size zeros)
  3. On cslic8 listed the lscratch3 directory from step 1, I saw 3 files with size 0
  4. On cslic8 listed the lscratch3 directory again (immediately following step3) and all files listed as 1024 bytes


 Comments   
Comment by Ned Bass [ 06/Feb/14 ]

Oh, I see LU-274 was fixed for b_2.x in this patch.

Comment by Peter Jones [ 06/Feb/14 ]

Niu

Could you please help Prakash with any questions he has about this work?

Thanks

Peter

Comment by Niu Yawei (Inactive) [ 08/Feb/14 ]

The fix of LU-274 is in ldlm_cb_interpret() of 2.4, so looks this is different problem than LU-274.

Ned, is it only seen on zfs or it's a common issue? Can this be easily reproduced by the procedures provided by you? Thank you.

Comment by Ned Bass [ 08/Feb/14 ]

I believe we've seen it on both ZFS and ldiskfs, but I'll verify that next week. It's not easily reproducible. I've been running the procedure in a loop for over 24 hours and haven't reproduced it yet.

Comment by Ned Bass [ 11/Feb/14 ]

Niu, our archival storage servers log an error message during file transfer if a file size changes after an initial scan, so we can use that as evidence of this bug. The logs show a sharp increase in file sizes changing from 0 after we updated our Lustre servers from 2.1 to 2.4.0-19chaos. This has affected all of our filesystems running both ZFS and ldiskfs. It has been observed from both 2.1 and 2.4 clients.

Comment by Ned Bass [ 12/Feb/14 ]

We found a pretty reliable reproducer for this bug. Unfortunately it is only working on one of our classified filesystems, so I can't send debug logs. The server is running 2.4.0-24chaos (see https://github.com/chaos/lustre) with ZFS and the clients are 2.4.0-19chaos. The reproducer is pretty simple.

# Create files locally then list them on remote node e8.
for ((i=0;i<20;i++)) ; do dd if=/dev/urandom of=file$i bs=1k count=1 ; done ; rsh e8 ls -l `pwd`

Create several 1k files then immediately list them on another node. Some of the files are listed with 0 length, then show the correct lengths if listed again. Usually between 0 and 3 files are affected, but which and how many files varies between attempts.

I captured debug logs from the client and MDS for one successful attempt. The client logs had +dlmtrace +rpctrace and the MDS log had -1. The bug wouldn't reproduce with -1 debugging on the clients. But, I haven't been able to find the bug yet from the logs. Please let me know if you have any tips on how to debug this. Meanwhile I'll keep trying to reproduce this on an unclassified system so we can send debug logs.

Comment by Ned Bass [ 12/Feb/14 ]

I managed to reproduce the bug on an unclassified system and get debug logs from the clients, MDS, and OSS. I uploaded them in a tarball to ftp.whamcloud.com. Email me privately if you need the file name. There is a README in the tarball with a few notes of relevance.

Comment by Ned Bass [ 12/Feb/14 ]

In case it helps interpret the debug logs, here are the NIDs of the nodes involved.

sierra654: 192.168.114.155@o2ib5     # created files
sierra330: 192.168.113.81@o2ib5      # got zero length for file4
porter44: 172.19.1.213@o2ib100       # OSS owning object for file4
porter-mds1: 172.19.1.165@o2ib100    # MDS
Comment by Ned Bass [ 12/Feb/14 ]

FWIW, having full debugging enabled on the servers seems to make this bug much easier to reproduce.

Comment by Niu Yawei (Inactive) [ 12/Feb/14 ]

I can reproduce it with two mounts, and looks it's a race of agl vs normal getattr (because it can't be reproduced anymore when agl turned off), will look into it further.

Comment by Ned Bass [ 12/Feb/14 ]

I confirmed that disabling statahead_agl seem to prevent the bug here as well.

lctl set_param llite.*.statahead_agl=0
Comment by Niu Yawei (Inactive) [ 13/Feb/14 ]

patch for master: http://review.whamcloud.com/9249

Comment by Ned Bass [ 19/Feb/14 ]

We set statahead_agl=0 on all our clients to workaround this issue until the patch can be deployed. This seemed to work, however I just learned that the 'size changed from 0' error was reported for 9 files during a run of the "htar" archival storage utility. So there may be another (less frequent) bug that can cause this behavior.

Comment by James Nunez (Inactive) [ 20/Feb/14 ]

Patch for b2_5 at http://review.whamcloud.com/#/c/9328/

Comment by Peter Jones [ 26/Feb/14 ]

Landed for 2.5.1 and 2.6

Generated at Sat Feb 10 01:44:10 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.