[LU-13285] multiple DF returns bad info Created: 21/Feb/20  Updated: 06/Apr/20  Resolved: 06/Apr/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.4
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Kevin Konzem Assignee: WC Triage
Resolution: Duplicate Votes: 0
Labels: None
Environment:

CentOS7.7.1908


Attachments: File df.out     File df2.out     Text File dftest.txt    
Issue Links:
Related
is related to LU-13296 statfs isn't work properly with MDT s... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Hey all,
Ive been struggling with a problem with our newly updated lustre 2.12 cluster, and I don't really know if its a bug, or configuration problem, or what.
So here's the setup: I've recently set up a small 2-OST single MDT 2.10 cluster to emulate our production cluster, and test the process of upgrading to 2.12.4. The upgrade went fine, however there is a problem with how df reports space on the lustre filesystem that is causing problems with our processing software. The software includes a df check to make sure the filesystem isn't too full before beginning a job. The problem is, that when multiple df commands are run against the lustre filesystem from the same client, occasionally the command will return a 0 in the available field, which in turn makes the software think the filesystem is full, then drop jobs. I can test this by running 'while [ true ];do /bin/df -TP /performance;done' on two sessions on the same client. As soon as I start the second while loop, the outputs go from:
Filesystem                 Type   1024-blocks   Used Available Capacity Mounted on
192.168.0.181@tcp:/perform lustre    71467728 100416  67664944       1% /performance
 
to:
Filesystem                 Type   1024-blocks  Used Available Capacity Mounted on
192.168.0.181@tcp:/perform lustre           0    -0        -0      50% /performance
I am using lustre 2.12.4 on the client as well, so Ive ruled out version mismatch issues at least.
 
I've checked all the mount settings between the prod 2.10 cluster and the dev 2.12 cluster, and everything I can find looks the same. The 2.10 prod cluster does not have this problem, and the dev cluster did not have the problem before upgrading from 2.10.
 
I have posted this in the lustre-discuss mailing list and Nathan Dauchy suggested I open a Jira issue so I could upload an strace of the failure.



 Comments   
Comment by Nathan Dauchy (Inactive) [ 24/Feb/20 ]

Kevin,  it looks like you did the strace on the bash process, not on 'df' itself, so the data may not be terribly useful to developers.

I was able to catch a similar problem on our system, and the strace shows that the statfs() call is returning incorrect data.  Here is a "good" and a "bad" run for comparison:

statfs("/mnt/lfs1", {f_type=0xbd00bd0, f_bsize=4096, f_blocks=949817358228, f_bfree=378438906913, f_bavail=368840556468, f_files=3795357040, f_ffree=3500647829, f_fsid={1050737646, 0}, f_namelen=255, f_frsize=4096, f_flags=ST_VALID}) = 0


statfs("/mnt/lfs1", {f_type=0xbd00bd0, f_bsize=0, f_blocks=0, f_bfree=0, f_bavail=0, f_files=18446618905756391232, f_ffree=132349083419, f_fsid={1050737646, 0}, f_namelen=0, f_frsize=0, f_flags=ST_VALID}) = 0

 
The recent change that (to me anyway) seems related is LU-12368. Do you have that in your client build?

Comment by Kevin Konzem [ 26/Feb/20 ]

My bad, sorry about that. I found a better way to strace the actual df command, attached is a good example df.out and a bad example df2.out.

I looked at LU-12368, and while that did look promising, I installed 2.13 on a client to try it out, but the bug remained. Should I try installing 2.13 on the server as well, or is that part of the code only handled by the client?

Also, I tried running 'lfs df' instead of 'df', but I got the same response. When ran in a loop on two sessions on the same client, it worked fine on 2.10, but intermittently failed on 2.12/2.13.

Comment by Nathan Dauchy (Inactive) [ 26/Feb/20 ]

Another possibly related ticket is LU-13296 (statfs isn't work properly with MDT statfs proxy), which tracks a regression introduced by LU-10018 (MDT as a statfs proxy).

Comment by Cory Spitz [ 27/Feb/20 ]

Yes, I was just popping in here after following your conversation on lustre-discuss. kkonzem, I think you should try your reproducer against the patch in https://review.whamcloud.com/37753. I hope it will work for you. There is a simplified reproducer as a part of that patch too.

Comment by Andreas Dilger [ 06/Apr/20 ]

The patch https://review.whamcloud.com/37753 "LU-13296 obd: make statfs cache working again" was landed to master for 2.14 and backported to b2_12 for 2.12.5.

Generated at Sat Feb 10 02:59:58 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.