Details
-
Bug
-
Resolution: Cannot Reproduce
-
Major
-
None
-
Lustre 2.4.0, Lustre 2.1.4
-
Lustre 2.1.4-5chaos on client, Lustre 2.1.4-5chaos on ldiskfs servers, Lustre 2.4.0-15chaos on zfs servers
-
3
-
10483
Description
With lustre 2.1.4-5chaos, we are finding that clients are not honoring umount correctly. The sysadmins are using the normal "umount" command with no additional options, and it returns relatively quickly.
Linux no longer has a record of the mount in /proc/mounts after the command returns, and the mount point (/p/lscratchrza) appears to be empty. However the directory clearly still has a reference and cannot be removed:
# rzzeus26 /p > ls -la lscratchrza total 0 drwxr-xr-x 2 root root 40 Aug 13 11:24 . drwxr-xr-x 4 root root 140 Aug 13 11:24 .. # rzzeus26 /p > rmdir lscratchrza rmdir: failed to remove `lscratchrza': Device or resource busy
When we look in /proc/fs/lustre it is clear that most, if not all, objects for this filesystem are still present in llite, osc, mdc, ldlm/namespace, etc.
The sysadmins issued the "umount /p/lscratchrza" command at around 9:42am, but this message did not appear on one of the nodes until over five hours later:
2013-09-13 15:18:11 Lustre: Unmounted lsa-client
So there appear to be at least two problems here
- umount is taking far too long
- umount for lustre is not blocking until umount is complete (it is exhibiting umount "lazy" behavior)
I should note that this lustre client node is mounting two lustre filesystems, and only one was being umounted. I don't know if it is significant yet, but the servers that we were trying to umount are running Lustre 2.1.4-5chaos with ldiskfs, and servers for the other filesystem are running Lustre 2.4.0-15chaos with zfs.
I did not seem to be able to speed up the umount process by running the sync command, or "echo 3 > /proc/sys/vm/drop_caches".
I did a "foreach bt" under crash, but I don't see any processes that are obviously stuck sleeping in umount related call paths.
Real user applications are running on the client nodes while the umounts are going on. "lsof" does not list any usage under /p/lscratchrza (the filesystem that we are trying to unmount).