[LU-3949] umount is lazy and takes hours Created: 13/Sep/13 Updated: 13/Oct/21 Resolved: 13/Oct/21 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0, Lustre 2.1.4 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Christopher Morrone | Assignee: | Hongchao Zhang |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | llnl | ||
| Environment: |
Lustre 2.1.4-5chaos on client, Lustre 2.1.4-5chaos on ldiskfs servers, Lustre 2.4.0-15chaos on zfs servers |
||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 10483 |
| Description |
|
With lustre 2.1.4-5chaos, we are finding that clients are not honoring umount correctly. The sysadmins are using the normal "umount" command with no additional options, and it returns relatively quickly. Linux no longer has a record of the mount in /proc/mounts after the command returns, and the mount point (/p/lscratchrza) appears to be empty. However the directory clearly still has a reference and cannot be removed: # rzzeus26 /p > ls -la lscratchrza total 0 drwxr-xr-x 2 root root 40 Aug 13 11:24 . drwxr-xr-x 4 root root 140 Aug 13 11:24 .. # rzzeus26 /p > rmdir lscratchrza rmdir: failed to remove `lscratchrza': Device or resource busy When we look in /proc/fs/lustre it is clear that most, if not all, objects for this filesystem are still present in llite, osc, mdc, ldlm/namespace, etc. The sysadmins issued the "umount /p/lscratchrza" command at around 9:42am, but this message did not appear on one of the nodes until over five hours later: 2013-09-13 15:18:11 Lustre: Unmounted lsa-client So there appear to be at least two problems here
I should note that this lustre client node is mounting two lustre filesystems, and only one was being umounted. I don't know if it is significant yet, but the servers that we were trying to umount are running Lustre 2.1.4-5chaos with ldiskfs, and servers for the other filesystem are running Lustre 2.4.0-15chaos with zfs. I did not seem to be able to speed up the umount process by running the sync command, or "echo 3 > /proc/sys/vm/drop_caches". I did a "foreach bt" under crash, but I don't see any processes that are obviously stuck sleeping in umount related call paths. Real user applications are running on the client nodes while the umounts are going on. "lsof" does not list any usage under /p/lscratchrza (the filesystem that we are trying to unmount). |
| Comments |
| Comment by Peter Jones [ 15/Sep/13 ] |
|
Hongchao Could you please help with this one? Thanks Peter |
| Comment by Hongchao Zhang [ 16/Sep/13 ] |
|
Hi Chris, Is there any process related to Lustre that was processing something in the output of "foreach bt"? Thanks |
| Comment by Christopher Morrone [ 24/Sep/13 ] |
|
There are lustre processes listed, of course, but they all looked like the normal contingent of processes, waiting for something to work on. The only one that might be interesting is this: PID: 5833 TASK: ffff8802cec68040 CPU: 3 COMMAND: "flush-lustre-2" #0 [ffff8802b33c7ce0] schedule at ffffffff8150d6e2 #1 [ffff8802b33c7da8] schedule_timeout at ffffffff8150e522 #2 [ffff8802b33c7e58] schedule_timeout_interruptible at ffffffff8150e6ce #3 [ffff8802b33c7e68] bdi_writeback_task at ffffffff811ad100 #4 [ffff8802b33c7eb8] bdi_start_fn at ffffffff8113c686 #5 [ffff8802b33c7ee8] kthread at ffffffff81096c76 #6 [ffff8802b33c7f48] kernel_thread at ffffffff8100c0ca There is certainly traffic in the lustre logs that I dumped, but nothing that appeared to be for the filesystem in question. As far as I can tell it was all related to the other filesystem (which was in use by an application). |
| Comment by Hongchao Zhang [ 24/Sep/13 ] |
|
the mnt_count of the vfs_mount used by Lustre should be not dropped to zero during umount, and it could be used by some other process, for the log will try to reproduce it with Lustre-2.1.4-5chaos. |
| Comment by Christopher Morrone [ 24/Sep/13 ] |
|
Even without reproducing, I should think that we can figure out from the code why umount is exhibiting lazy behavior. Obviously the unmount is not complete until ll_put_super is called and Lustre prints that "Unmounted lsa-client" message. However the umount command returns well before that. Is that not a bug? |
| Comment by Hongchao Zhang [ 25/Sep/13 ] |
|
normally, lazy umount is only used for "umount -l" option (kernel uses MNT_DETACH to represent it), but this option is not found to use in Lustre (2.1.4-llnl), Is there any chance the "-l" option to be used implicitly by umount in your environment? Thanks! |
| Comment by Christopher Morrone [ 25/Sep/13 ] |
|
No, I don't see anything to suggest an implicit use of -l. But I'm not sure what I should look for that could cause an "implicit" use. I don't see -l used. And just to be clear, I'm not trying to imply that the code thinks it is in lazy mode. I am just saying that the behavior is similar to lazy mode. In other words, Lustre allows umount to return before the client really disconnects from the servers and removes the final references on the mount point. It certainly is allowing umount to return before ll_put_super is complete. I suspect that umount returns before ll_put_super is even called, but I don't have hard evidence for that. |
| Comment by Hongchao Zhang [ 26/Sep/13 ] |
|
the umount should return before ll_put_super is called for the "Unmounted lsa-client" message was printed long time after the umount command. Hi Chris, Does the issue occur again in your site? Thanks! |
| Comment by Christopher Morrone [ 26/Sep/13 ] |
|
I think that the kernel pretty clearly does not wait until ll_put_super is complete before allowing whatever syscall umount is using to return. That implies to me that lustre is doing the wrong thing. The Lustre client needs to disconnect from the servers before umount returns. I have high confidence that we will see it again on thousands of nodes then next time we use umount. |
| Comment by Christopher Morrone [ 26/Sep/13 ] |
|
Maybe I should clarify my "disconnect from the servers" comment. I mean things like the imports and other components associated with that filesystem should be cleanly shut down before umount returns. I don't expect the lnet connections to necessarily be shut down. |
| Comment by Christopher Morrone [ 28/Sep/13 ] |
|
Apparently there isn't much in the way of explicit callbacks into Lustre at umount time. The callbacks mostly trigger when sufficient references are dropped. There is a call path that looks like this: umount mntput_no_Expire __mntput deactivate_super fs->kill_sb lustre_kill_super deactivate_super() also calls put_super() after the call to fs->kill_sb(), which should in theory kick off ll_put_super() if the super_block s_count is decremented to zero. So that leaves us to wonder what part of this failed... I think it seems most likely that the super_block s_count was higher than 1 when __put_super() was called, but it is not clear to me what else would be holding the reference. |
| Comment by Hongchao Zhang [ 30/Sep/13 ] |
|
the most possible case should be the super_block.s_active is higer than 1 the super_block.s_active will be increased in "sget", "get_active_super", "copy_tree" and "do_loopback". btw, no "sget" and "get_active_super" call is found in Lustre. |
| Comment by Christopher Morrone [ 31/Jan/14 ] |
|
Still a problem. |
| Comment by Hongchao Zhang [ 18/Feb/14 ] |
|
status update: does the issue occur again recently? and on which kernel versions? Thanks |
| Comment by Hongchao Zhang [ 18/Feb/14 ] |
|
how about printing the debug info (then it will not be overwritten by the debug logs of the other Lustre mount) in the two functions related to umount diff --git a/lustre/llite/llite_lib.c b/lustre/llite/llite_lib.c
index aff4ccb..5787f5d 100644
--- a/lustre/llite/llite_lib.c
+++ b/lustre/llite/llite_lib.c
@@ -1138,7 +1138,7 @@ void ll_put_super(struct super_block *sb)
int ccc_count, next, force = 1, rc = 0;
ENTRY;
- CDEBUG(D_VFSTRACE, "VFS Op: sb %p - %s\n", sb, profilenm);
+ LCONSOLE_WARN("VFS Op: sb %p - %s\n", sb, profilenm);
ll_print_capa_stat(sbi);
@@ -2116,8 +2116,8 @@ void ll_umount_begin(struct super_block *sb)
}
#endif
- CDEBUG(D_VFSTRACE, "VFS Op: superblock %p count %d active %d\n", sb,
- sb->s_count, atomic_read(&sb->s_active));
+ LCONSOLE_WARN("VFS Op: superblock %p count %d active %d\n", sb,
+ sb->s_count, atomic_read(&sb->s_active));
obd = class_exp2obd(sbi->ll_md_exp);
if (obd == NULL) {
btw, did you see any these debug info in the previous debug logs when the issue occurred. |
| Comment by Hongchao Zhang [ 24/Feb/14 ] |
|
I have tested it with by the following script, #!/bin/bash if [ $# != 1 ]; then echo "Usage: $(basename $0) Server_IP" exit 0 fi ADDR=$1 mkdir -p /mnt/lustre_test || (echo "can't mkdir /mnt/lustre_test \n" && exit) while [ TRUE ]; do mkdir -p /mnt/lustre_test || exit 1 mount -t lustre $ADDR@tcp:/lustre /mnt/lustre_test || exit 2 echo -e "$(date) Lustre mounted" dd if=/dev/zero of=/mnt/lustre_test/dd.file bs=1024 count=10000 || exit 3 umount /mnt/lustre_test || exit 4 rmdir /mnt/lustre_test || exit 5 echo -e "$(date) Lustre umounted" echo -e "\n\n" sleep 1 done there is another Lustre mount (test with same server and different server) on /mnt/lustre, which continues running some loads (say, tar, dd). |
| Comment by Cliff White (Inactive) [ 24/Feb/14 ] |
|
On Hyperion (12 OSS, 36 OSTs) we normally remount between test passes, have not noticed this issue. I am currently running the mount test script on some idle clients, will report if errors. |
| Comment by Christopher Morrone [ 25/Feb/14 ] |
|
In patch 6392 from |
| Comment by Hongchao Zhang [ 03/Mar/14 ] |
|
yes, it could be related to this issue, and there are also some "mntget" calls in "push_ctxt" (lustre/lvfs/lvfs_linux.c) and will create a debug to verify it. |
| Comment by Hongchao Zhang [ 04/Mar/14 ] |
|
status update: |
| Comment by Hongchao Zhang [ 05/Mar/14 ] |
|
the debug patch against b2_1 is tracked at http://review.whamcloud.com/#/c/9502/ |
| Comment by Christopher Morrone [ 04/Apr/14 ] |
|
FYI, we are no longer using 2.1 in production. We have completely migrated to 2.4. |
| Comment by Hongchao Zhang [ 04/Apr/14 ] |
|
Hi, Did this issue occur again in 2.4? Thanks |
| Comment by Christopher Morrone [ 04/Apr/14 ] |
|
Yes, it is still a problem in 2.4. |
| Comment by Hongchao Zhang [ 18/Apr/14 ] |
|
the debug patch against b2_4 is tracked at http://review.whamcloud.com/#/c/10012/ |
| Comment by Hongchao Zhang [ 04/May/14 ] |
|
Hi, Did you try the debug patch? Did the issue occur again and are there any new logs generated by this debug patch? Thanks |
| Comment by Cliff White (Inactive) [ 10/Jul/14 ] |
|
I ran the umount script on a single client while running SWL on a different mount point. I was also mounting and starting other clients. |
| Comment by Hongchao Zhang [ 11/Jul/14 ] |
|
Hi, Do you run mount & umount in a loop? and as per the console log, the client has been mounted&umounted successfully for 5 times. Thanks |
| Comment by Cliff White (Inactive) [ 11/Jul/14 ] |
|
I used your script from this bug. The final mount was /mnt/lustre_test, and the script exits upon failure, so the final mount was the error. |
| Comment by Hongchao Zhang [ 15/Jul/14 ] |
|
the kill_super/put_super triggered in the first 4 umount is by the "sys_umount" syscall, and the final one Hi Cliff, could you please check what is the process "slurmstepd" in the client? and how the process get It could be the same issue in LLNL (some background process hold an extra reference of the namespace of |
| Comment by Cliff White (Inactive) [ 21/Jul/14 ] |
|
Slurmstepd is an LLNL product. |
| Comment by Oleg Drokin [ 07/Aug/14 ] |
|
Afteer rereading everything in the ticket here again and re-examining the in-kernel code, I think HongChao is right here. At least for the test Cliff has performed it is quite clear that slurmstepd has a private namespace that has an extra reference to lustre fs being unmounted by the admin. As such, there's really nothing Lustre can do here until this additional namespace goes away, in fact Lustre is not even called to try and perform an actual unmount until this happens (Note, this is very different from a task having a file opened on Lustre or CWD there, because then umount itself will fail with EBUSY). This is somewhat similar to the case when you do mount --bind /mnt/lustre /mnt/lustre2 – after this you can do umount /mnt/lustre, but because there's still another reference - /mnt/lustre2 - the lustre is not going to be unmounted (though in this case you would be able to delete /mnt/lustre at least unlike this discussed case). So to summarize. In my view this is not a Lustre bug at all, Lustre has no control about other namespaces referencing it. Cliff - can you please run slurm and slurmstepd under strace somehow and see if clone is called with CLONE_NEWNS anywhere? CLONE_NEWNS is designed to be exactly like we see here - to separate parent namespace from child, so that they can do further mounts and unmounts without affecting each other. |
| Comment by Oleg Drokin [ 08/Aug/14 ] |
|
Here's a simple reproducer for the CLONE_NEWNS behavior that matches what you are seeing I believe (it certainly does for me): #include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <linux/sched.h>
#include <sched.h>
#include <signal.h>
void child_fn(void *ptr)
{
sleep(100);
return;
}
void main(void)
{
long *stack=malloc(10000 * sizeof(stack));
int rc;
rc = clone(child_fn, stack+9999, CLONE_NEWNS | SIGCHLD, NULL);
if (rc == -1)
perror("Cannont clone");
exit(0);
}
Just compile and run it, it'll spawn another thread that'll hang there for 100 seconds. During these 100 seconds try to unmount a lustre filesystem an you'll see that while mount exits right away with success and the filesystem is gone from /proc/mounts, the "Unmounted" lustre message is not printed. [383407.089403] Lustre: Mounted lustre-client [383419.633431] Lustre: setting import lustre-MDT0000_UUID INACTIVE by administrator request [383465.064335] Lustre: lustre-OST0000: haven't heard from client 9ecf059c-5c43-49c9-61c1-7df1b424d51d (at 0@lo) in 53 seconds. I think it's dead, and I am evicting it. exp ffff8800b00f7bf0, cur 1407457950 expire 1407457920 last 1407457897 [383465.084392] LustreError: 4666:0:(ofd_grant.c:183:ofd_grant_sanity_check()) ofd_statfs: tot_granted 281728 != fo_tot_granted 2378880 [383467.052644] Lustre: lustre-MDT0000: haven't heard from client 9ecf059c-5c43-49c9-61c1-7df1b424d51d (at 0@lo) in 55 seconds. I think it's dead, and I am evicting it. exp ffff88008dd3abf0, cur 1407457952 expire 1407457922 last 1407457897 [383467.056074] Lustre: Skipped 1 previous similar message [383478.941396] Lustre: Unmounted lustre-client Granted, the entire idea behind umount -f was to just cut all communications, so we probably cannot easily turn around and change the semantics to "tell the server we are gone too". |
| Comment by Cliff White (Inactive) [ 11/Aug/14 ] |
|
I straced while running SWL. Did not find any use of CLONE_NEWNS |
| Comment by Christopher Morrone [ 11/Aug/14 ] |
Could you please explain exactly what you did? |
| Comment by Cliff White (Inactive) [ 11/Aug/14 ] |
|
While running SWL, on multiple clients I straced slurmd, like this: strace -ff -p $SPID -e trace=clone -o $DIR/$FILE Then I grepped through the results for 'flags=' , sorted for unique, and looked for CLONE_NEWNS. |
| Comment by Cliff White (Inactive) [ 12/Aug/14 ] |
|
Running the new test patch, I see this backtrace repeatedly |
| Comment by Oleg Drokin [ 12/Aug/14 ] |
|
So we clearly see the slurmd or one of it's children call unshare syscall that clones namespace and prevents all fulesystems from unmounting, not just lustre, for the life of the cloned namespace (that ends when the child does). What's strange is that I do not see any unshare calls in the slurmd source I downloaded from github, so I wonder if this is coming from some of the libraries in use at LLNL. In any case there's absolutely nothing we (on the lustre side) can do here to speed up unmounting it appears. The userspace asked the kernel not to allow any real unmounts after all. |
| Comment by Cliff White (Inactive) [ 12/Aug/14 ] |
|
I then straced slurmd, with -e trace=unshare, and see this result: I see the same call in several other traces. |
| Comment by Cliff White (Inactive) [ 12/Aug/14 ] |
|
Sorry, previous comment was lost. |
| Comment by Cliff White (Inactive) [ 19/Aug/14 ] |
|
Slurm plugins on Hyperion (From /etc/slurm/plugstack.conf) I am not sure if this is the LLNL version , but I do see this in the use-env source: /* Unshare file namespace. This means only this process and its children * will see the following mounts, and when this process and its children * terminate, the mounts go away automatically. */ if (unshare (CLONE_NEWNS) < 0) { slurm_error ("unshare CLONE_NEWNS: %m"); return (-1); } |
| Comment by Christopher Morrone [ 21/Aug/14 ] |
Are you sure about that? It looks to me like that code is from the iorelay plugin, not use-env. The other place mentioning CLONE_NEWNS is the private_mount plugin. I got my sources from: |
| Comment by Christopher Morrone [ 21/Aug/14 ] |
|
Perhaps /etc/slurm/lua.d/ns.lua is more likely. |
| Comment by Christopher Morrone [ 22/Aug/14 ] |
|
I am thinking that ns.lua is the likely culprit. I'll have to get someone at LLNL to look into why we are doing that, and how to do it better. |