[LU-7759] umount hanging in modern distros when OST is unavailable Created: 08/Feb/16  Updated: 26/Nov/17  Resolved: 26/Nov/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0, Lustre 2.9.0
Fix Version/s: Lustre 2.9.0

Type: Bug Priority: Major
Reporter: Oleg Drokin Assignee: Yang Sheng
Resolution: Fixed Votes: 0
Labels: easy, sles12

Issue Links:
Related
is related to LU-5472 conf-sanity test_32a: failed with 1 Resolved
is related to LU-4039 Failure on test suite replay-single t... Resolved
is related to LU-8544 recovery-double-scale test_pairwise_f... Resolved
is related to LU-8731 lfs df exits with status 0 on failures Resolved
is related to LU-6233 recovery-small test_10d failed with '... Closed
is related to LU-8069 Allow remount to include "flock" and ... Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

It looks like modern distros: SLES12, Fedora 21+ (possibly earlier) always do statfs on the mountpoint before issuing the umount call.

In case one of the OSTs is not available at that time - the statfs will hang.

In our tests there's a bunch of tests in conf-sanity that have this problem and it was "dealt with" in LU-5472 by just adding the -f umount option.
But this still leaves palces like recovery-single test 89 and something in recovery-small at the very least.

And most of all regular users would be affected too. So I wonder if we should deal with this issue somewhat more intelligently so that the unmount does not really hang in such a case?



 Comments   
Comment by Andreas Dilger [ 09/Feb/16 ]

One possible solution is to add a $sbin/umount.lustre script that calls lctl set_param llite.$fsname-*.lazystatfs=1 so that the statfs() doesn't hang? The $fsname-* might be further refined to be the specific mountpoint, but I don't know offhand how to translate a mountpoint to an instance in /proc/fs/lustre and it probably isn't critical to handle the dual mount case (normally only used for testing). The $fsname part can be extracted from the device section of /proc/mounts line:

mgsnode:/fsname        mountpoint     fstype     options stuff

I'm also not sure what the requirements for umount.lustre are, whether it is run by umount before the filesystem is unmounted, or instead of the regular umount processing?

Comment by Peter Jones [ 15/Feb/16 ]

Yang Sheng

Could you please look into this issue?

Thanks

Peter

Comment by Yang Sheng [ 18/Feb/16 ]

Looks like umount.lustre is called after statfs. Since umount need figure out fstype before invoke umount.{fstype}. This is why it call statfs.The hard part is no way to tell kernel the statfs called from umount. So avoid invoke statfs is only thing we can do. Of course, skip unavailable OST maybe a reasonable solution?

Comment by Andreas Dilger [ 18/Feb/16 ]

I guess it is possible to check strstr("unmount", current->comm) and set sbi->ll_statfs |= LL_SBI_LAZYSTATFS on the filesystem, but this would add overhead to every statfs() call.

It might be possible to set LL_SBI_LAZYSTATFS by default, but that may also cause problems with the recovery tests that wait on "df" to return to indicate recovery is complete.

Comment by Yang Sheng [ 02/Mar/16 ]

I found a way to handle it. We can add a entry to utab(default located at /run/mount/utab) when mount a lustre filesystem. Then umount will use this entry to find out fstype to avoid invoke statfs. What we need to do just change mount_lustre.c is enough. I'll produce a patch.

Comment by Gerrit Updater [ 08/Mar/16 ]

Yang Sheng (yang.sheng@intel.com) uploaded a new patch: http://review.whamcloud.com/18820
Subject: LU-7759 utils: build mount.lustre with libmount
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: ffebbb3e30c247e453480f9219985effdf035b05

Comment by Gerrit Updater [ 29/Mar/16 ]

Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: http://review.whamcloud.com/19195
Subject: LU-7759 llite: handle inactive OSTs better in statfs
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 169a33aaf665fdaef6fc5734665a04d758a443e9

Comment by Gerrit Updater [ 11/Apr/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/18820/
Subject: LU-7759 utils: build mount.lustre with libmount
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: f1de339d881958de8fc47065fb31a5c8e0c14b60

Comment by Gerrit Updater [ 27/Jun/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/19195/
Subject: LU-7759 llite: handle inactive OSTs better in statfs
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 23fde1f89bec0adf4f7181ccce5a236eac371a38

Comment by Yang Sheng [ 28/Jun/16 ]

Patch landed. Close ticket.

Comment by Jinshan Xiong (Inactive) [ 16/Nov/17 ]

There is a new occurrence on recent test: https://testing.hpdd.intel.com/test_sets/0d7d92de-cad2-11e7-9840-52540065bddc

The test failure looks the same but not sure if they are the same.

Comment by Andreas Dilger [ 26/Nov/17 ]

Rather than open up an old ticket that hasn't been seen in 2+ years, it is better to open up a new ticket for the new problem, and if they seem related they can be linked together. Otherwise, tracking the old re-opened issue fix version is more difficult than necessary.

Generated at Sat Feb 10 02:11:41 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.