[LU-13489] Does LFSCK check on-disk information Created: 28/Apr/20 Updated: 10/Jun/20 Resolved: 10/Jun/20 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.10.8 |
| Fix Version/s: | None |
| Type: | Task | Priority: | Trivial |
| Reporter: | Runzhou Han | Assignee: | WC Triage |
| Resolution: | Done | Votes: | 0 |
| Labels: | lfsck | ||
| Environment: |
CentOS 7, ldiskfs |
||
| Epic/Theme: | lfsck |
| Business Value: | 5 |
| Epic: | LFSCK |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
Hi, I'm a PhD student and I've been working on Lustre reliability study. My group found by manually destroying MDS or OSS layout can lead to resource leak problem which means part of storage space or namespace are not usable by client. This problem actually has been discussed in the paper 'PFault' published on ICS '18, and in this paper the ressource leak is caused by e2fsck changing OST layout. However I found several other ways to trigger the same issue, as long as to destroy MDT-OST consistency. Here is a simple way to rebuilt the scenario: I'm not sure if this is in the scope of lfsck's functionality, but I know lfsck's namespace phase is said to be able to remove orphan objects. This problem can potentially do damage to clusters since on-disk object files can be easily removed by misoperations, and cannot be detected by lfsck. Runzhou Han |
| Comments |
| Comment by Andreas Dilger [ 29/Apr/20 ] |
|
Could you please provide more detail about how you are running LFSCK? Are you using something like lctl lfsck_start -A -t all -o" to link the orphan objects into .../.lustre/lost+found/ so they can be recovered or removed? |
| Comment by Runzhou Han [ 29/Apr/20 ] |
|
Thank you for replying me! I didn't use any additional arguments and simply used 'lctl lfsck_start'. I tried to use 'lctl lfsck_start -A -t all -o' but I think lfsck still didn't link orphan objects into /.lustre/lost+found/, if it means I can see orphan objects via ls in Lustre's /lost+found folder. I post detailed operations and logs of producing the problem here. After cluster, check brand new cluster usage: [root@client0 pf_pfs_worker]# lfs df -h UUID bytes Used Available Use% Mounted on lustre-MDT0000_UUID 96.0M 1.7M 85.6M 2% /lustrefs[MDT:0] lustre-OST0000_UUID 413.4M 13.2M 365.3M 3% /lustrefs[OST:0] lustre-OST0001_UUID 413.4M 13.2M 365.3M 3% /lustrefs[OST:1] lustre-OST0002_UUID 413.4M 13.2M 365.3M 3% /lustrefs[OST:2] filesystem_summary: 1.2G 39.6M 1.1G 3% /lustrefs [root@client0 pf_pfs_worker]# ./pfs_worker_cp.sh [root@client0 pf_pfs_worker]# ./pfs_worker_age.sh [root@client0 pf_pfs_worker]# lfs df -h UUID bytes Used Available Use% Mounted on lustre-MDT0000_UUID 96.0M 1.7M 85.6M 2% /lustrefs[MDT:0] lustre-OST0000_UUID 413.4M 136.9M 235.9M 37% /lustrefs[OST:0] lustre-OST0001_UUID 413.4M 121.9M 250.9M 33% /lustrefs[OST:1] lustre-OST0002_UUID 413.4M 119.5M 253.5M 32% /lustrefs[OST:2] filesystem_summary: 1.2G 378.3M 740.3M 34% /lustrefs It shows that about 300+MB data is written to OSTs. On MDS, umount MDT, reformat and mount again: [root@mds /]# umount /mdt [root@mds /]# mkfs.lustre --fsname=lustre --mgsnode=192.168.1.7@tcp0 --mdt --index=0 --reformat /dev/sdb Permanent disk data: Target: lustre:MDT0000 Index: 0 Lustre FS: lustre Mount type: ldiskfs Flags: 0x61 (MDT first_time update ) Persistent mount opts: user_xattr,errors=remount-ro Parameters: mgsnode=192.168.1.7@tcp device size = 200MB formatting backing filesystem ldiskfs on /dev/sdb target name lustre:MDT0000 4k blocks 51200 options -I 1024 -i 2560 -q -O dirdata,uninit_bg,^extents,dir_nlink,quota,huge_file,flex_bg -E lazy_journal_init -F mkfs_cmd = mke2fs -j -b 4096 -L lustre:MDT0000 -I 1024 -i 2560 -q -O dirdata,uninit_bg,^extents,dir_nlink,quota,huge_file,flex_bg -E lazy_journal_init -F /dev/sdb 51200 Writing CONFIGS/mountdata [root@mds /]# mount.lustre /dev/sdb /mdt mount.lustre: mount /dev/sdb at /mdt failed: Address already in use The target service's index is already in use. (/dev/sdb) [root@mds /]# mount.lustre /dev/sdb /mdt It’s interesting at first Lustre doesn’t allow me to remount, but the second try worked. Then check with client side Lustre directory, and cluster usage: [root@client0 pf_pfs_worker]# ls /lustrefs/ [root@client0 pf_pfs_worker]# lfs df -h UUID bytes Used Available Use% Mounted on lustre-MDT0000_UUID 96.0M 1.7M 85.6M 2% /lustrefs[MDT:0] lustre-OST0000_UUID 413.4M 130.0M 245.5M 35% /lustrefs[OST:0] lustre-OST0001_UUID 413.4M 127.5M 248.0M 34% /lustrefs[OST:1] lustre-OST0002_UUID 413.4M 125.2M 250.3M 33% /lustrefs[OST:2] filesystem_summary: 1.2G 382.8M 743.8M 34% /lustrefs Client data is not visible to client, but the storage space is not released. Try to fix this inconsistency with lfsck: [root@mds /]# lctl lfsck_start -A -t all -o Started LFSCK on the device lustre-MDT0000: scrub layout namespace [root@mds /]# lctl lfsck_query layout_mdts_init: 0 layout_mdts_scanning-phase1: 0 layout_mdts_scanning-phase2: 0 layout_mdts_completed: 1 layout_mdts_failed: 0 layout_mdts_stopped: 0 layout_mdts_paused: 0 layout_mdts_crashed: 0 layout_mdts_partial: 0 layout_mdts_co-failed: 0 layout_mdts_co-stopped: 0 layout_mdts_co-paused: 0 layout_mdts_unknown: 0 layout_osts_init: 0 layout_osts_scanning-phase1: 0 layout_osts_scanning-phase2: 0 layout_osts_completed: 3 layout_osts_failed: 0 layout_osts_stopped: 0 layout_osts_paused: 0 layout_osts_crashed: 0 layout_osts_partial: 0 layout_osts_co-failed: 0 layout_osts_co-stopped: 0 layout_osts_co-paused: 0 layout_osts_unknown: 0 layout_repaired: 285 namespace_mdts_init: 0 namespace_mdts_scanning-phase1: 0 namespace_mdts_scanning-phase2: 0 namespace_mdts_completed: 1 namespace_mdts_failed: 0 namespace_mdts_stopped: 0 namespace_mdts_paused: 0 namespace_mdts_crashed: 0 namespace_mdts_partial: 0 namespace_mdts_co-failed: 0 namespace_mdts_co-stopped: 0 namespace_mdts_co-paused: 0 namespace_mdts_unknown: 0 namespace_osts_init: 0 namespace_osts_scanning-phase1: 0 namespace_osts_scanning-phase2: 0 namespace_osts_completed: 0 namespace_osts_failed: 0 namespace_osts_stopped: 0 namespace_osts_paused: 0 namespace_osts_crashed: 0 namespace_osts_partial: 0 namespace_osts_co-failed: 0 namespace_osts_co-stopped: 0 namespace_osts_co-paused: 0 namespace_osts_unknown: 0 namespace_repaired: 0 It shows that lfsck has repaired 285 objects. On client node check cluster usage again: [root@client0 pf_pfs_worker]# lfs df -h UUID bytes Used Available Use% Mounted on lustre-MDT0000_UUID 96.0M 1.7M 85.6M 2% /lustrefs[MDT:0] lustre-OST0000_UUID 413.4M 130.2M 245.4M 35% /lustrefs[OST:0] lustre-OST0001_UUID 413.4M 127.5M 248.0M 34% /lustrefs[OST:1] lustre-OST0002_UUID 413.4M 125.2M 250.3M 33% /lustrefs[OST:2] filesystem_summary: 1.2G 382.9M 743.7M 34% /lustrefs The storage space is still not released. Check OSTs’ /.lustre/lost+found, but find they are empty: [root@oss0 /]# ls /ost0_bf/lost+found/ [root@oss1 /]# ls /ost1_bf/lost+found/ [root@oss2 /]# ls /ost2_bf/lost+found/ |
| Comment by Andreas Dilger [ 30/Apr/20 ] |
|
The files recovered by LFSCK would not be on the OSTs, but rather they would be re-attached into the filesystem namespace on the client nodes under the directory /lustrefs/.lustre/lost+found/MDT0000/.
|
| Comment by Andreas Dilger [ 30/Apr/20 ] |
|
The lost+found on the OST backing filesystems would be used by the local e2fsck in case there is corruption of the underlying disk filesystem (e.g. directory O/0/d4 is corrupted). In that case, after the local e2fsck runs and puts orphan objects into the local lost+found, then LFSCK OI_Scrub would detect this on restart and rebuild the O/0/d4 directory and restore the objects. |
| Comment by Runzhou Han [ 30/Apr/20 ] |
|
Thanks! This really solved my long term confusion. |