[LU-11306] Moving files from one MDT to another does not free inodes on source MDT Created: 30/Aug/18  Updated: 31/Aug/18  Resolved: 31/Aug/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.4
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Sebastien Piechurski Assignee: WC Triage
Resolution: Not a Bug Votes: 0
Labels: None
Environment:

RHEL 7.5, kernel 3.10.0-862.11.6.el7.x86_64
Seen with 2.10.4 and master (325e23899aa38de32ec00b19ed675bcc64c6e5c8)
ldiskfs MDTs.


Attachments: File move_metadata_debug.tar.xz    
Issue Links:
Related
is related to LU-10192 Agent entry for cross-MDTs reference Resolved
is related to LU-7607 Preserve inode number after MDT migra... Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

When moving files that are on MDT0 to a directory residing on MDT1, the corresponding inodes on MDT are not deallocated.

Here is what I see:

[root@lustre211cli test]# for file in {1..999}; do echo $file > $file; done
[root@lustre211cli test]# lfs df -i
UUID                      Inodes       IUsed       IFree IUse% Mounted on
test-MDT0000_UUID         419432        1262      418170   1% /test[MDT:0]
test-MDT0001_UUID         419432         248      419184   1% /test[MDT:1]
test-OST0000_UUID         737280        1389      735891   0% /test[OST:0]

filesystem_summary:       737401        1510      735891   0% /test

[root@lustre211cli test]# lfs mkdir -i 1 dir2
[root@lustre211cli test]# mv {1..999} dir2/
[root@lustre211cli test]# lfs df -i
UUID                      Inodes       IUsed       IFree IUse% Mounted on
test-MDT0000_UUID         419432        1265      418167   1% /test[MDT:0]
test-MDT0001_UUID         419432        1249      418183   1% /test[MDT:1]
test-OST0000_UUID         737280        1389      735891   0% /test[OST:0]

filesystem_summary:       738405        2514      735891   0% /test

[root@lustre211cli test]# ls
dir1  dir2
[root@lustre211cli test]# ls dir1
[root@lustre211cli test]# sync
[root@lustre211cli test]# echo 3 > /proc/sys/vm/drop_caches
[root@lustre211cli test]# lfs df -i
UUID                      Inodes       IUsed       IFree IUse% Mounted on
test-MDT0000_UUID         419432        1265      418167   1% /test[MDT:0]
test-MDT0001_UUID         419432        1249      418183   1% /test[MDT:1]
test-OST0000_UUID         737280        1389      735891   0% /test[OST:0]

filesystem_summary:       738405        2514      735891   0% /test

The inodes used on MDT0 never decrease, even after umount/mount or by umounting the MDT from the MDS.

When performing an e2fsck (1.42.13.wc5) on MDT0, the behaviour changes between 2.10.4 and 2.11.54:

  • With 2.10.4, e2fsck will find as many unattached inodes as there were files moved
  • With 2.11.54, e2fsck will not find anything

I attach the complete debug logs from the client and server taken during this manipulation.



 Comments   
Comment by Andreas Dilger [ 30/Aug/18 ]

Note that just renaming the file does not cause the inode to be moved, only the name is moved to the new MDT. In order to keep ext4 consistent (as you see with the avoidance of e2fsck errors), an "agent" inode needs to be added on the new MDT so that the directory entry has something to point at. If the inode were also moved to the target MDT with a rename, this would cause a number of other problems, such as changing the userspace-visible inode number (due to the new FID being assigned to map to the new MDT), breaking the DLM locking (which is also tied to the FID), break open file handles (also tied to the FID), and hard links to the file.

There is an open ticket (LU-7607) for implementing a mechanism to preserve at least the inode number across MDTs, which would allow the common case of closed, nlink = 1 inodes to be transparently moved to another MDT, but this has not been implemented yet.

So, for the time being the behaviour you observe is working as intended.

Comment by Sebastien Piechurski [ 31/Aug/18 ]

Hi Andreas,

 

Ok, looks like I did not do my homework ...

Next time, I'll read the HLD or the source code before submitting this kind of thing ... 

 

Thanks for the explanation. You can close this ticket.

Comment by Andreas Dilger [ 31/Aug/18 ]

Sebastien, I think your question was perfectly reasonable, and I wish we had already implemented the automatic inode migration functionality. Until that happens, we need the extra overhead to ensure that the on-disk format remains consistent.

I don't think cross-MDT rename is a common case for Lustre, so this shouldn't cause too much overhead.

Generated at Sat Feb 10 02:42:44 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.