[LU-14565] Changing recordsize of OST Breaks 'df' (lfs df works correctly) Created: 26/Mar/21  Updated: 13/May/22  Resolved: 19/May/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.6
Fix Version/s: Lustre 2.12.8, Lustre 2.15.0

Type: Bug Priority: Major
Reporter: Arshad Hussain Assignee: Arshad Hussain
Resolution: Fixed Votes: 0
Labels: None
Environment:

This was tested on single node client/server CentOS-7.5/ZFS-0.7.13 with lustre 2.12.6 branch/master. Further this was also seen on CentOS 7.8 with Lustre 2.12.3 and 2.12.6 w/ ZFS 0.7.13.


Issue Links:
Related
is related to LU-15853 /mnt/lustre path is hardcoded in sani... Resolved
Epic/Theme: zfs
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Changing recordsize from 1M(default) to 32K breaks the 'df' output. 'lfs df' however works correctly. The 'size', 'used' and 'Avail' fields of the 'df' output shows wrong values. This is seen immediately. Switching record size back to 32K, it was observed that it fixes the issue.

Steps to recreate:

$ df -h 
$ cp <file> /mnt/lustre
$ df -h 
$ zfs set recordsize=32768 gpool/data
$ df -h /* Almost immediately starts showing wrong results, lfs df is good */
$ zfs set recordsize=1048576 gpool/data
$ df -h /* Results are good again */

Details

# df -h
Filesystem Size Used Avail Use% Mounted on
...
gpool/metadata 77M 3.0M 72M 5% /mnt/zfsmdt
gpool/data 76M 3.0M 71M 5% /mnt/zfsost
192.168.50.72@tcp:/lustre 76M 3.0M 71M 5% /mnt/lustre
# lfs df -h
UUID bytes Used Available Use% Mounted on
lustre-MDT0000_UUID 76.6M 3.0M 71.6M 5% /mnt/lustre[MDT:0]
lustre-OST0000_UUID 76.0M 3.0M 71.0M 5% /mnt/lustre[OST:0]
filesystem_summary: 76.0M 3.0M 71.0M 5% /mnt/lustre

 

Verify recordsize

# zfs get recordsize gpool/data
NAME PROPERTY VALUE SOURCE
gpool/data recordsize 1M local
# cp configure /mnt/lustre
# ls -ali configure
670300 -rwxr-xr-x 1 root root 1346008 Mar 26 11:49 configure

 

# df -h
Filesystem Size Used Avail Use% Mounted on
...
gpool/metadata 75M 3.0M 70M 5% /mnt/zfsmdt
gpool/data 76M 5.0M 69M 7% /mnt/zfsost
192.168.50.72@tcp:/lustre 76M 5.0M 69M 7% /mnt/lustre
# lfs df -h
UUID bytes Used Available Use% Mounted on
lustre-MDT0000_UUID 74.6M 3.0M 69.6M 5% /mnt/lustre[MDT:0]
lustre-OST0000_UUID 76.0M 5.0M 69.0M 7% /mnt/lustre[OST:0]
filesystem_summary: 76.0M 5.0M 69.0M 7% /mnt/lustre

Change the record size

zfs set recordsize=32768 gpool/data
# df -h
...
gpool/metadata 75M 3.0M 70M 5% /mnt/zfsmdt
gpool/data 77M 5.1M 70M 7% /mnt/zfsost
192.168.50.72@tcp:/lustre 2.4G 163M 2.2G 7% /mnt/lustre <~~~ Bumps to 2.4GB
# lfs df -h
UUID bytes Used Available Use% Mounted on
lustre-MDT0000_UUID 74.6M 3.0M 69.6M 5% /mnt/lustre[MDT:0]
lustre-OST0000_UUID 76.8M 5.1M 69.7M 7% /mnt/lustre[OST:0]
filesystem_summary: 76.8M 5.1M 69.7M 7% /mnt/lustre

 



 Comments   
Comment by Jeff Johnson [ 27/Mar/21 ]

`lfs df` (before OST recordsize change to 32K from 1M)

lustre-OST0000_UUID         5.3T       18.0M        5.3T   1% /lustre[OST:0]

strace of `df -Th` (before OST recordsize change to 32K from 1M)

stat("/lustre", {st_mode=S_IFDIR|0755, st_size=10752, ...}) = 0
statfs("/lustre", {f_type=0xbd00bd0, f_bsize=4096, f_blocks=1424878848, f_bfree=1424874240, f_bavail=1424873728, f_files=54550890, f_ffree=54550561, f_fsid={val=[743766374, 0]}, f_namelen=255, f_frsize=4096, f_flags=ST_VALID}) = 0
stat("/lustre", {st_mode=S_IFDIR|0755, st_size=10752, ...}) = 0
write(1, "10.0.50.30@tcp:/lustre  lustre  "..., 6410.0.50.30@tcp:/lustre  lustre    5.4T   18M  5.4T   1% /lustre

`lfs df` (after OST recordsize change to 32K from 1M)

lustre-OST0000_UUID         5.3T       17.5M        5.3T   1% /lustre[OST:0]

strace of `df -Th` (after OST recordsize change to 32K from 1M)

stat("/lustre", {st_mode=S_IFDIR|0755, st_size=10752, ...}) = 0
statfs("/lustre", {f_type=0xbd00bd0, f_bsize=4096, f_blocks=45596122112, f_bfree=45595978752, f_bavail=45595962368, f_files=54550890, f_ffree=54550561, f_fsid={val=[743766374, 0]}, f_namelen=255, f_frsize=4096, f_flags=ST_VALID}) = 0
stat("/lustre", {st_mode=S_IFDIR|0755, st_size=10752, ...}) = 0
write(1, "10.0.50.30@tcp:/lustre  lustre  "..., 6410.0.50.30@tcp:/lustre  lustre    170T  560M  170T   1% /lustre

 

Comment by Arshad Hussain [ 28/Mar/21 ]

For linux frize(fragment/smallest) and bsize(block/largest) size is always the same.
Unfortunately, bsize is also used as optimal blocksize (recordsize) in
a dataset. When the recordsize is changed. The bsize is also reflected.
This leads to miscalculation and 'df' output.

The statfs() should always be reported on the maximum size. Note that the , dmu_objset_space() for both 1MB recoredsize and 32KB recoredsize reports the same and they seem to be correct

I will upload the patch after local testing for review.

My debug run: 1MB recordsize

Mar 28 05:01:55 mrpel7 kernel: LustreError: 8094:0:(osd_handler.c:496:osd_objset_statfs()) max_blksz=1048576
Mar 28 05:01:55 mrpel7 kernel: LustreError: 8094:0:(osd_handler.c:500:osd_objset_statfs()) usedbytes=3261440, availbytes=94025216 shift=20
Mar 28 05:01:55 mrpel7 kernel: LustreError: 8094:0:(osd_handler.c:504:osd_objset_statfs()) os_blocks=92, os_bfree=89, os_bavail=89

My debug run: 32KB recordsize

Mar 28 05:02:30 mrpel7 kernel: LustreError: 8527:0:(osd_handler.c:496:osd_objset_statfs()) max_blksz=32768
Mar 28 05:02:30 mrpel7 kernel: LustreError: 8527:0:(osd_handler.c:500:osd_objset_statfs()) usedbytes=3261440, availbytes=94022144 shift=15
Mar 28 05:02:30 mrpel7 kernel: LustreError: 8527:0:(osd_handler.c:504:osd_objset_statfs()) os_blocks=2968, os_bfree=2869, os_bavail=2869
Comment by Gerrit Updater [ 29/Mar/21 ]

Arshad Hussain (arshad.hussain@aeoncomputing.com) uploaded a new patch: https://review.whamcloud.com/43154
Subject: LU-14565 osd_zfs: Make 'statfs' always use 1MB defualt size
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: df160677af3e90e9e62865b18825991877bc46f0

Comment by Gerrit Updater [ 19/May/21 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/43154/
Subject: LU-14565 ofd: Do not rely on tgd_blockbit
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 8ee6e1c8825c4fabfd6c39db11081839ca53d454

Comment by Peter Jones [ 19/May/21 ]

Landed for 2.15

Comment by Arshad Hussain [ 21/May/21 ]

Hi Peter,Andreas,

This is also applicable for b2_12 branch.  Please consider this for b2_12 backport.

Thanks

Arshad

 

 

Comment by Gerrit Updater [ 09/Jun/21 ]

Arshad Hussain (arshad.hussain@aeoncomputing.com) uploaded a new patch: https://review.whamcloud.com/43954
Subject: LU-14565 ofd: Do not rely on tgd_blockbit
Project: fs/lustre-release
Branch: b2_12-next
Current Patch Set: 1
Commit: 865e7def9d5d3e2fdf247483a81c767ab27bb856

Comment by Gerrit Updater [ 09/Jun/21 ]

Arshad Hussain (arshad.hussain@aeoncomputing.com) uploaded a new patch: https://review.whamcloud.com/43955
Subject: LU-14565 ofd: Do not rely on tgd_blockbit
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 4d5ca21aa94c295acb5dd666925d64c62079461d

Comment by Gerrit Updater [ 14/Nov/21 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/43955/
Subject: LU-14565 ofd: Do not rely on tgd_blockbit
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: f268a03170abe5375959ace774e10b92af5b14b1

Generated at Sat Feb 10 03:10:51 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.