Details
Description
Hi,
I'm a PhD student and I've been working on Lustre reliability study. My group found by manually destroying MDS or OSS layout can lead to resource leak problem which means part of storage space or namespace are not usable by client. This problem actually has been discussed in the paper 'PFault' published on ICS '18, and in this paper the ressource leak is caused by e2fsck changing OST layout. However I found several other ways to trigger the same issue, as long as to destroy MDT-OST consistency. Here is a simple way to rebuilt the scenario:
1. Create a 1 client+1MDS+3OSS cluster
2. Write some files to Lustre on client node, and check system usage with 'lfs df -h'
3. Umount mdt directory on MDS, reformat mdt's disk partition and remount. This step is to destroy consistency between MDT and OST
4. Check with Lustre directory on client node, user files were no more there, but 'lfs df -h' shows that the space is not released
5. Run lfsck, and 'lfs df -h' again. However lfsck didn't move stale objects on OSS to '/lost+found' and the storage space leak is still there
I'm not sure if this is in the scope of lfsck's functionality, but I know lfsck's namespace phase is said to be able to remove orphan objects. This problem can potentially do damage to clusters since on-disk object files can be easily removed by misoperations, and cannot be detected by lfsck.
Thanks!
Runzhou Han
Dept. of Electrical & Computer Engineering
Iowa State University