[LU-13489] Does LFSCK check on-disk information Created: 28/Apr/20  Updated: 10/Jun/20  Resolved: 10/Jun/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.8
Fix Version/s: None

Type: Task Priority: Trivial
Reporter: Runzhou Han Assignee: WC Triage
Resolution: Done Votes: 0
Labels: lfsck
Environment:

CentOS 7, ldiskfs


Epic/Theme: lfsck
Business Value: 5
Epic: LFSCK
Rank (Obsolete): 9223372036854775807

 Description   

Hi,

I'm a PhD student and I've been working on Lustre reliability study. My group found by manually destroying MDS or OSS layout can lead to resource leak problem which means part of storage space or namespace are not usable by client. This problem actually has been discussed in the paper 'PFault' published on ICS '18, and in this paper the ressource leak is caused by e2fsck changing OST layout. However I found several other ways to trigger the same issue, as long as to destroy MDT-OST consistency. Here is a simple way to rebuilt the scenario:
1. Create a 1 client+1MDS+3OSS cluster
2. Write some files to Lustre on client node, and check system usage with 'lfs df -h'
3. Umount mdt directory on MDS, reformat mdt's disk partition and remount. This step is to destroy consistency between MDT and OST
4. Check with Lustre directory on client node, user files were no more there, but 'lfs df -h' shows that the space is not released
5. Run lfsck, and 'lfs df -h' again. However lfsck didn't move stale objects on OSS to '/lost+found' and the storage space leak is still there

I'm not sure if this is in the scope of lfsck's functionality, but I know lfsck's namespace phase is said to be able to remove orphan objects. This problem can potentially do damage to clusters since on-disk object files can be easily removed by misoperations, and cannot be detected by lfsck.
Thanks!

Runzhou Han
Dept. of Electrical & Computer Engineering
Iowa State University



 Comments   
Comment by Andreas Dilger [ 29/Apr/20 ]

Could you please provide more detail about how you are running LFSCK? Are you using something like lctl lfsck_start -A -t all -o" to link the orphan objects into .../.lustre/lost+found/ so they can be recovered or removed?

Comment by Runzhou Han [ 29/Apr/20 ]

Thank you for replying me!

I didn't use any additional arguments and simply used  'lctl lfsck_start'. I tried to use  'lctl lfsck_start -A -t all -o' but I think lfsck still didn't link orphan objects into /.lustre/lost+found/, if it means I can see orphan objects via ls in Lustre's /lost+found folder.

I post detailed operations and logs of producing the problem here.

After cluster, check brand new cluster usage:

[root@client0 pf_pfs_worker]# lfs df -h
UUID                       bytes        Used   Available Use% Mounted on
lustre-MDT0000_UUID        96.0M        1.7M       85.6M   2% /lustrefs[MDT:0]
lustre-OST0000_UUID       413.4M       13.2M      365.3M   3% /lustrefs[OST:0]
lustre-OST0001_UUID       413.4M       13.2M      365.3M   3% /lustrefs[OST:1]
lustre-OST0002_UUID       413.4M       13.2M      365.3M   3% /lustrefs[OST:2]

filesystem_summary:         1.2G       39.6M        1.1G   3% /lustrefs

 
Then I ran some write workloads to age the cluster:

[root@client0 pf_pfs_worker]# ./pfs_worker_cp.sh 
[root@client0 pf_pfs_worker]# ./pfs_worker_age.sh 

 
Check cluster usage again:

[root@client0 pf_pfs_worker]# lfs df -h
UUID                       bytes        Used   Available Use% Mounted on
lustre-MDT0000_UUID        96.0M        1.7M       85.6M   2% /lustrefs[MDT:0]
lustre-OST0000_UUID       413.4M      136.9M      235.9M  37% /lustrefs[OST:0]
lustre-OST0001_UUID       413.4M      121.9M      250.9M  33% /lustrefs[OST:1]
lustre-OST0002_UUID       413.4M      119.5M      253.5M  32% /lustrefs[OST:2]

filesystem_summary:         1.2G      378.3M      740.3M  34% /lustrefs

It shows that about 300+MB data is written to OSTs.

On MDS, umount MDT, reformat and mount again:

[root@mds /]# umount /mdt
[root@mds /]# mkfs.lustre --fsname=lustre --mgsnode=192.168.1.7@tcp0 --mdt --index=0 --reformat /dev/sdb

Permanent disk data:
Target:     lustre:MDT0000
Index:      0
Lustre FS:  lustre
Mount type: ldiskfs
Flags:      0x61
              (MDT first_time update )
Persistent mount opts: user_xattr,errors=remount-ro
Parameters: mgsnode=192.168.1.7@tcp
device size = 200MB
formatting backing filesystem ldiskfs on /dev/sdb
target name   lustre:MDT0000
4k blocks     51200
options        -I 1024 -i 2560 -q -O dirdata,uninit_bg,^extents,dir_nlink,quota,huge_file,flex_bg -E lazy_journal_init -F
mkfs_cmd = mke2fs -j -b 4096 -L lustre:MDT0000  -I 1024 -i 2560 -q -O dirdata,uninit_bg,^extents,dir_nlink,quota,huge_file,flex_bg -E lazy_journal_init -F /dev/sdb 51200
Writing CONFIGS/mountdata
[root@mds /]# mount.lustre /dev/sdb /mdt
mount.lustre: mount /dev/sdb at /mdt failed: Address already in use
The target service's index is already in use. (/dev/sdb)
[root@mds /]# mount.lustre /dev/sdb /mdt

It’s interesting at first Lustre doesn’t allow me to remount, but the second try worked.

Then check with client side Lustre directory, and cluster usage:

[root@client0 pf_pfs_worker]# ls /lustrefs/
[root@client0 pf_pfs_worker]# lfs df -h
UUID                       bytes        Used   Available Use% Mounted on
lustre-MDT0000_UUID        96.0M        1.7M       85.6M   2% /lustrefs[MDT:0]
lustre-OST0000_UUID       413.4M      130.0M      245.5M  35% /lustrefs[OST:0]
lustre-OST0001_UUID       413.4M      127.5M      248.0M  34% /lustrefs[OST:1]
lustre-OST0002_UUID       413.4M      125.2M      250.3M  33% /lustrefs[OST:2]
 
filesystem_summary:         1.2G      382.8M      743.8M  34% /lustrefs

Client data is not visible to client, but the storage space is not released.

Try to fix this inconsistency with lfsck: 

[root@mds /]# lctl lfsck_start -A -t all -o
Started LFSCK on the device lustre-MDT0000: scrub layout namespace
[root@mds /]# lctl lfsck_query
layout_mdts_init: 0
layout_mdts_scanning-phase1: 0
layout_mdts_scanning-phase2: 0
layout_mdts_completed: 1
layout_mdts_failed: 0
layout_mdts_stopped: 0
layout_mdts_paused: 0
layout_mdts_crashed: 0
layout_mdts_partial: 0
layout_mdts_co-failed: 0
layout_mdts_co-stopped: 0
layout_mdts_co-paused: 0
layout_mdts_unknown: 0
layout_osts_init: 0
layout_osts_scanning-phase1: 0
layout_osts_scanning-phase2: 0
layout_osts_completed: 3
layout_osts_failed: 0
layout_osts_stopped: 0
layout_osts_paused: 0
layout_osts_crashed: 0
layout_osts_partial: 0
layout_osts_co-failed: 0
layout_osts_co-stopped: 0
layout_osts_co-paused: 0
layout_osts_unknown: 0
layout_repaired: 285
namespace_mdts_init: 0
namespace_mdts_scanning-phase1: 0
namespace_mdts_scanning-phase2: 0
namespace_mdts_completed: 1
namespace_mdts_failed: 0
namespace_mdts_stopped: 0
namespace_mdts_paused: 0
namespace_mdts_crashed: 0
namespace_mdts_partial: 0
namespace_mdts_co-failed: 0
namespace_mdts_co-stopped: 0
namespace_mdts_co-paused: 0
namespace_mdts_unknown: 0
namespace_osts_init: 0
namespace_osts_scanning-phase1: 0
namespace_osts_scanning-phase2: 0
namespace_osts_completed: 0
namespace_osts_failed: 0
namespace_osts_stopped: 0
namespace_osts_paused: 0
namespace_osts_crashed: 0
namespace_osts_partial: 0
namespace_osts_co-failed: 0
namespace_osts_co-stopped: 0
namespace_osts_co-paused: 0
namespace_osts_unknown: 0
namespace_repaired: 0

It shows that lfsck has repaired 285 objects. 

On client node check cluster usage again:

[root@client0 pf_pfs_worker]# lfs df -h
UUID                       bytes        Used   Available Use% Mounted on
lustre-MDT0000_UUID        96.0M        1.7M       85.6M   2% /lustrefs[MDT:0]
lustre-OST0000_UUID       413.4M      130.2M      245.4M  35% /lustrefs[OST:0]
lustre-OST0001_UUID       413.4M      127.5M      248.0M  34% /lustrefs[OST:1]
lustre-OST0002_UUID       413.4M      125.2M      250.3M  33% /lustrefs[OST:2]

filesystem_summary:         1.2G      382.9M      743.7M  34% /lustrefs

The storage space is still not released.

Check OSTs’ /.lustre/lost+found, but find they are empty: 

[root@oss0 /]# ls /ost0_bf/lost+found/
[root@oss1 /]# ls /ost1_bf/lost+found/
[root@oss2 /]# ls /ost2_bf/lost+found/
Comment by Andreas Dilger [ 30/Apr/20 ]

The files recovered by LFSCK would not be on the OSTs, but rather they would be re-attached into the filesystem namespace on the client nodes under the directory /lustrefs/.lustre/lost+found/MDT0000/.

 

Comment by Andreas Dilger [ 30/Apr/20 ]

The lost+found on the OST backing filesystems would be used by the local e2fsck in case there is corruption of the underlying disk filesystem (e.g. directory O/0/d4 is corrupted). In that case, after the local e2fsck runs and puts orphan objects into the local lost+found, then LFSCK OI_Scrub would detect this on restart and rebuild the O/0/d4 directory and restore the objects.

Comment by Runzhou Han [ 30/Apr/20 ]

Thanks! This really solved my long term confusion.

Generated at Sat Feb 10 03:01:42 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.