Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13489

Does LFSCK check on-disk information

Details

    • Task
    • Resolution: Done
    • Trivial
    • None
    • Lustre 2.10.8
    • CentOS 7, ldiskfs
    • 5
    • 9223372036854775807

    Description

      Hi,

      I'm a PhD student and I've been working on Lustre reliability study. My group found by manually destroying MDS or OSS layout can lead to resource leak problem which means part of storage space or namespace are not usable by client. This problem actually has been discussed in the paper 'PFault' published on ICS '18, and in this paper the ressource leak is caused by e2fsck changing OST layout. However I found several other ways to trigger the same issue, as long as to destroy MDT-OST consistency. Here is a simple way to rebuilt the scenario:
      1. Create a 1 client+1MDS+3OSS cluster
      2. Write some files to Lustre on client node, and check system usage with 'lfs df -h'
      3. Umount mdt directory on MDS, reformat mdt's disk partition and remount. This step is to destroy consistency between MDT and OST
      4. Check with Lustre directory on client node, user files were no more there, but 'lfs df -h' shows that the space is not released
      5. Run lfsck, and 'lfs df -h' again. However lfsck didn't move stale objects on OSS to '/lost+found' and the storage space leak is still there

      I'm not sure if this is in the scope of lfsck's functionality, but I know lfsck's namespace phase is said to be able to remove orphan objects. This problem can potentially do damage to clusters since on-disk object files can be easily removed by misoperations, and cannot be detected by lfsck.
      Thanks!

      Runzhou Han
      Dept. of Electrical & Computer Engineering
      Iowa State University

      Attachments

        Activity

          [LU-13489] Does LFSCK check on-disk information

          Thanks! This really solved my long term confusion.

          rzhan Runzhou Han (Inactive) added a comment - Thanks! This really solved my long term confusion.

          The lost+found on the OST backing filesystems would be used by the local e2fsck in case there is corruption of the underlying disk filesystem (e.g. directory O/0/d4 is corrupted). In that case, after the local e2fsck runs and puts orphan objects into the local lost+found, then LFSCK OI_Scrub would detect this on restart and rebuild the O/0/d4 directory and restore the objects.

          adilger Andreas Dilger added a comment - The lost+found on the OST backing filesystems would be used by the local e2fsck in case there is corruption of the underlying disk filesystem (e.g. directory O/0/d4 is corrupted). In that case, after the local e2fsck runs and puts orphan objects into the local lost+found , then LFSCK OI_Scrub would detect this on restart and rebuild the O/0/d4 directory and restore the objects.

          The files recovered by LFSCK would not be on the OSTs, but rather they would be re-attached into the filesystem namespace on the client nodes under the directory /lustrefs/.lustre/lost+found/MDT0000/.

           

          adilger Andreas Dilger added a comment - The files recovered by LFSCK would not be on the OSTs, but rather they would be re-attached into the filesystem namespace on the client nodes under the directory /lustrefs/.lustre/lost+found/MDT0000/ .  
          rzhan Runzhou Han (Inactive) added a comment - - edited

          Thank you for replying me!

          I didn't use any additional arguments and simply used  'lctl lfsck_start'. I tried to use  'lctl lfsck_start -A -t all -o' but I think lfsck still didn't link orphan objects into /.lustre/lost+found/, if it means I can see orphan objects via ls in Lustre's /lost+found folder.

          I post detailed operations and logs of producing the problem here.

          After cluster, check brand new cluster usage:

          [root@client0 pf_pfs_worker]# lfs df -h
          UUID                       bytes        Used   Available Use% Mounted on
          lustre-MDT0000_UUID        96.0M        1.7M       85.6M   2% /lustrefs[MDT:0]
          lustre-OST0000_UUID       413.4M       13.2M      365.3M   3% /lustrefs[OST:0]
          lustre-OST0001_UUID       413.4M       13.2M      365.3M   3% /lustrefs[OST:1]
          lustre-OST0002_UUID       413.4M       13.2M      365.3M   3% /lustrefs[OST:2]
          
          filesystem_summary:         1.2G       39.6M        1.1G   3% /lustrefs
          

           
          Then I ran some write workloads to age the cluster:

          [root@client0 pf_pfs_worker]# ./pfs_worker_cp.sh 
          [root@client0 pf_pfs_worker]# ./pfs_worker_age.sh 
          

           
          Check cluster usage again:

          [root@client0 pf_pfs_worker]# lfs df -h
          UUID                       bytes        Used   Available Use% Mounted on
          lustre-MDT0000_UUID        96.0M        1.7M       85.6M   2% /lustrefs[MDT:0]
          lustre-OST0000_UUID       413.4M      136.9M      235.9M  37% /lustrefs[OST:0]
          lustre-OST0001_UUID       413.4M      121.9M      250.9M  33% /lustrefs[OST:1]
          lustre-OST0002_UUID       413.4M      119.5M      253.5M  32% /lustrefs[OST:2]
          
          filesystem_summary:         1.2G      378.3M      740.3M  34% /lustrefs
          

          It shows that about 300+MB data is written to OSTs.

          On MDS, umount MDT, reformat and mount again:

          [root@mds /]# umount /mdt
          [root@mds /]# mkfs.lustre --fsname=lustre --mgsnode=192.168.1.7@tcp0 --mdt --index=0 --reformat /dev/sdb
          
          Permanent disk data:
          Target:     lustre:MDT0000
          Index:      0
          Lustre FS:  lustre
          Mount type: ldiskfs
          Flags:      0x61
                        (MDT first_time update )
          Persistent mount opts: user_xattr,errors=remount-ro
          Parameters: mgsnode=192.168.1.7@tcp
          device size = 200MB
          formatting backing filesystem ldiskfs on /dev/sdb
          target name   lustre:MDT0000
          4k blocks     51200
          options        -I 1024 -i 2560 -q -O dirdata,uninit_bg,^extents,dir_nlink,quota,huge_file,flex_bg -E lazy_journal_init -F
          mkfs_cmd = mke2fs -j -b 4096 -L lustre:MDT0000  -I 1024 -i 2560 -q -O dirdata,uninit_bg,^extents,dir_nlink,quota,huge_file,flex_bg -E lazy_journal_init -F /dev/sdb 51200
          Writing CONFIGS/mountdata
          [root@mds /]# mount.lustre /dev/sdb /mdt
          mount.lustre: mount /dev/sdb at /mdt failed: Address already in use
          The target service's index is already in use. (/dev/sdb)
          [root@mds /]# mount.lustre /dev/sdb /mdt
          

          It’s interesting at first Lustre doesn’t allow me to remount, but the second try worked.

          Then check with client side Lustre directory, and cluster usage:

          [root@client0 pf_pfs_worker]# ls /lustrefs/
          [root@client0 pf_pfs_worker]# lfs df -h
          UUID                       bytes        Used   Available Use% Mounted on
          lustre-MDT0000_UUID        96.0M        1.7M       85.6M   2% /lustrefs[MDT:0]
          lustre-OST0000_UUID       413.4M      130.0M      245.5M  35% /lustrefs[OST:0]
          lustre-OST0001_UUID       413.4M      127.5M      248.0M  34% /lustrefs[OST:1]
          lustre-OST0002_UUID       413.4M      125.2M      250.3M  33% /lustrefs[OST:2]
           
          filesystem_summary:         1.2G      382.8M      743.8M  34% /lustrefs
          

          Client data is not visible to client, but the storage space is not released.

          Try to fix this inconsistency with lfsck: 

          [root@mds /]# lctl lfsck_start -A -t all -o
          Started LFSCK on the device lustre-MDT0000: scrub layout namespace
          [root@mds /]# lctl lfsck_query
          layout_mdts_init: 0
          layout_mdts_scanning-phase1: 0
          layout_mdts_scanning-phase2: 0
          layout_mdts_completed: 1
          layout_mdts_failed: 0
          layout_mdts_stopped: 0
          layout_mdts_paused: 0
          layout_mdts_crashed: 0
          layout_mdts_partial: 0
          layout_mdts_co-failed: 0
          layout_mdts_co-stopped: 0
          layout_mdts_co-paused: 0
          layout_mdts_unknown: 0
          layout_osts_init: 0
          layout_osts_scanning-phase1: 0
          layout_osts_scanning-phase2: 0
          layout_osts_completed: 3
          layout_osts_failed: 0
          layout_osts_stopped: 0
          layout_osts_paused: 0
          layout_osts_crashed: 0
          layout_osts_partial: 0
          layout_osts_co-failed: 0
          layout_osts_co-stopped: 0
          layout_osts_co-paused: 0
          layout_osts_unknown: 0
          layout_repaired: 285
          namespace_mdts_init: 0
          namespace_mdts_scanning-phase1: 0
          namespace_mdts_scanning-phase2: 0
          namespace_mdts_completed: 1
          namespace_mdts_failed: 0
          namespace_mdts_stopped: 0
          namespace_mdts_paused: 0
          namespace_mdts_crashed: 0
          namespace_mdts_partial: 0
          namespace_mdts_co-failed: 0
          namespace_mdts_co-stopped: 0
          namespace_mdts_co-paused: 0
          namespace_mdts_unknown: 0
          namespace_osts_init: 0
          namespace_osts_scanning-phase1: 0
          namespace_osts_scanning-phase2: 0
          namespace_osts_completed: 0
          namespace_osts_failed: 0
          namespace_osts_stopped: 0
          namespace_osts_paused: 0
          namespace_osts_crashed: 0
          namespace_osts_partial: 0
          namespace_osts_co-failed: 0
          namespace_osts_co-stopped: 0
          namespace_osts_co-paused: 0
          namespace_osts_unknown: 0
          namespace_repaired: 0
          

          It shows that lfsck has repaired 285 objects. 

          On client node check cluster usage again:

          [root@client0 pf_pfs_worker]# lfs df -h
          UUID                       bytes        Used   Available Use% Mounted on
          lustre-MDT0000_UUID        96.0M        1.7M       85.6M   2% /lustrefs[MDT:0]
          lustre-OST0000_UUID       413.4M      130.2M      245.4M  35% /lustrefs[OST:0]
          lustre-OST0001_UUID       413.4M      127.5M      248.0M  34% /lustrefs[OST:1]
          lustre-OST0002_UUID       413.4M      125.2M      250.3M  33% /lustrefs[OST:2]
          
          filesystem_summary:         1.2G      382.9M      743.7M  34% /lustrefs
          

          The storage space is still not released.

          Check OSTs’ /.lustre/lost+found, but find they are empty: 

          [root@oss0 /]# ls /ost0_bf/lost+found/
          [root@oss1 /]# ls /ost1_bf/lost+found/
          [root@oss2 /]# ls /ost2_bf/lost+found/
          
          rzhan Runzhou Han (Inactive) added a comment - - edited Thank you for replying me! I didn't use any additional arguments and simply used  ' lctl lfsck_start '. I tried to use  ' lctl lfsck_start -A -t all -o ' but I think lfsck still didn't link orphan objects into /.lustre/lost+found/ , if it means I can see orphan objects via ls in Lustre's /lost+found folder. I post detailed operations and logs of producing the problem here. After cluster, check brand new cluster usage: [root@client0 pf_pfs_worker]# lfs df -h UUID                       bytes        Used   Available Use% Mounted on lustre-MDT0000_UUID        96.0M        1.7M       85.6M   2% /lustrefs[MDT:0] lustre-OST0000_UUID       413.4M       13.2M      365.3M   3% /lustrefs[OST:0] lustre-OST0001_UUID       413.4M       13.2M      365.3M   3% /lustrefs[OST:1] lustre-OST0002_UUID       413.4M       13.2M      365.3M   3% /lustrefs[OST:2] filesystem_summary:         1.2G       39.6M        1.1G   3% /lustrefs   Then I ran some write workloads to age the cluster: [root@client0 pf_pfs_worker]# ./pfs_worker_cp.sh  [root@client0 pf_pfs_worker]# ./pfs_worker_age.sh    Check cluster usage again: [root@client0 pf_pfs_worker]# lfs df -h UUID                       bytes        Used   Available Use% Mounted on lustre-MDT0000_UUID        96.0M        1.7M       85.6M   2% /lustrefs[MDT:0] lustre-OST0000_UUID       413.4M      136.9M      235.9M  37% /lustrefs[OST:0] lustre-OST0001_UUID       413.4M      121.9M      250.9M  33% /lustrefs[OST:1] lustre-OST0002_UUID       413.4M      119.5M      253.5M  32% /lustrefs[OST:2] filesystem_summary:         1.2G      378.3M      740.3M  34% /lustrefs It shows that about 300+MB data is written to OSTs. On MDS, umount MDT, reformat and mount again: [root@mds /]# umount /mdt [root@mds /]# mkfs.lustre --fsname=lustre --mgsnode=192.168.1.7@tcp0 --mdt --index=0 --reformat /dev/sdb Permanent disk data: Target:     lustre:MDT0000 Index:      0 Lustre FS:  lustre Mount type: ldiskfs Flags:      0x61               (MDT first_time update ) Persistent mount opts: user_xattr,errors=remount-ro Parameters: mgsnode=192.168.1.7@tcp device size = 200MB formatting backing filesystem ldiskfs on /dev/sdb target name   lustre:MDT0000 4k blocks     51200 options        -I 1024 -i 2560 -q -O dirdata,uninit_bg,^extents,dir_nlink,quota,huge_file,flex_bg -E lazy_journal_init -F mkfs_cmd = mke2fs -j -b 4096 -L lustre:MDT0000  -I 1024 -i 2560 -q -O dirdata,uninit_bg,^extents,dir_nlink,quota,huge_file,flex_bg -E lazy_journal_init -F /dev/sdb 51200 Writing CONFIGS/mountdata [root@mds /]# mount.lustre /dev/sdb /mdt mount.lustre: mount /dev/sdb at /mdt failed: Address already in use The target service's index is already in use. (/dev/sdb) [root@mds /]# mount.lustre /dev/sdb /mdt It’s interesting at first Lustre doesn’t allow me to remount, but the second try worked. Then check with client side Lustre directory, and cluster usage: [root@client0 pf_pfs_worker]# ls /lustrefs/ [root@client0 pf_pfs_worker]# lfs df -h UUID                       bytes        Used   Available Use% Mounted on lustre-MDT0000_UUID        96.0M        1.7M       85.6M   2% /lustrefs[MDT:0] lustre-OST0000_UUID       413.4M      130.0M      245.5M  35% /lustrefs[OST:0] lustre-OST0001_UUID       413.4M      127.5M      248.0M  34% /lustrefs[OST:1] lustre-OST0002_UUID       413.4M      125.2M      250.3M  33% /lustrefs[OST:2]   filesystem_summary:         1.2G      382.8M      743.8M  34% /lustrefs Client data is not visible to client, but the storage space is not released. Try to fix this inconsistency with lfsck:  [root@mds /]# lctl lfsck_start -A -t all -o Started LFSCK on the device lustre-MDT0000: scrub layout namespace [root@mds /]# lctl lfsck_query layout_mdts_init: 0 layout_mdts_scanning-phase1: 0 layout_mdts_scanning-phase2: 0 layout_mdts_completed: 1 layout_mdts_failed: 0 layout_mdts_stopped: 0 layout_mdts_paused: 0 layout_mdts_crashed: 0 layout_mdts_partial: 0 layout_mdts_co-failed: 0 layout_mdts_co-stopped: 0 layout_mdts_co-paused: 0 layout_mdts_unknown: 0 layout_osts_init: 0 layout_osts_scanning-phase1: 0 layout_osts_scanning-phase2: 0 layout_osts_completed: 3 layout_osts_failed: 0 layout_osts_stopped: 0 layout_osts_paused: 0 layout_osts_crashed: 0 layout_osts_partial: 0 layout_osts_co-failed: 0 layout_osts_co-stopped: 0 layout_osts_co-paused: 0 layout_osts_unknown: 0 layout_repaired: 285 namespace_mdts_init: 0 namespace_mdts_scanning-phase1: 0 namespace_mdts_scanning-phase2: 0 namespace_mdts_completed: 1 namespace_mdts_failed: 0 namespace_mdts_stopped: 0 namespace_mdts_paused: 0 namespace_mdts_crashed: 0 namespace_mdts_partial: 0 namespace_mdts_co-failed: 0 namespace_mdts_co-stopped: 0 namespace_mdts_co-paused: 0 namespace_mdts_unknown: 0 namespace_osts_init: 0 namespace_osts_scanning-phase1: 0 namespace_osts_scanning-phase2: 0 namespace_osts_completed: 0 namespace_osts_failed: 0 namespace_osts_stopped: 0 namespace_osts_paused: 0 namespace_osts_crashed: 0 namespace_osts_partial: 0 namespace_osts_co-failed: 0 namespace_osts_co-stopped: 0 namespace_osts_co-paused: 0 namespace_osts_unknown: 0 namespace_repaired: 0 It shows that lfsck has repaired 285 objects.  On client node check cluster usage again: [root@client0 pf_pfs_worker]# lfs df -h UUID                       bytes        Used   Available Use% Mounted on lustre-MDT0000_UUID        96.0M        1.7M       85.6M   2% /lustrefs[MDT:0] lustre-OST0000_UUID       413.4M      130.2M      245.4M  35% /lustrefs[OST:0] lustre-OST0001_UUID       413.4M      127.5M      248.0M  34% /lustrefs[OST:1] lustre-OST0002_UUID       413.4M      125.2M      250.3M  33% /lustrefs[OST:2] filesystem_summary:         1.2G      382.9M      743.7M  34% /lustrefs The storage space is still not released. Check OSTs’ /.lustre/lost+found , but find they are empty:  [root@oss0 /]# ls /ost0_bf/lost+found/ [root@oss1 /]# ls /ost1_bf/lost+found/ [root@oss2 /]# ls /ost2_bf/lost+found/

          Could you please provide more detail about how you are running LFSCK? Are you using something like lctl lfsck_start -A -t all -o" to link the orphan objects into .../.lustre/lost+found/ so they can be recovered or removed?

          adilger Andreas Dilger added a comment - Could you please provide more detail about how you are running LFSCK? Are you using something like lctl lfsck_start -A -t all -o " to link the orphan objects into .../.lustre/lost+found/ so they can be recovered or removed?

          People

            wc-triage WC Triage
            rzhan Runzhou Han (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: