[LU-3472] MDS can't umount with blocked flock Created: 14/Jun/13  Updated: 09/Jan/20  Resolved: 09/Jan/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Andriy Skulysh Assignee: Bruno Faccini (Inactive)
Resolution: Low Priority Votes: 0
Labels: patch

Severity: 3
Rank (Obsolete): 8700

 Description   

Here is a testcase
flock -e /mnt/lustre/ff -c 'sleep 10' &
sleep 1
flock -e /mnt/lustre/ff -c 'sleep 5' &
sleep 1
echo "umount -f /mnt/mds1"
umount -f /mnt/mds1 || true
killall flock
ps ax
echo "umount -f /mnt/lustre"
umount -f /mnt/lustre || true
cleanupall -f || error "cleanup failed"



 Comments   
Comment by Andriy Skulysh [ 14/Jun/13 ]

patch: http://review.whamcloud.com/6647

Comment by Andriy Skulysh [ 14/Jun/13 ]

Xyratex-bug-id: MRP-997

Comment by Bruno Faccini (Inactive) [ 14/Jun/13 ]

I think this is only a timing issue due to the time needed to give-up/time-out the FL_UNLCK attempt during the kill or exit of flock cmds. If you just wait/sleep a few seconds before trying to umount from the Client side or simply retry, you will succeed.

Comment by Andreas Dilger [ 17/Jun/13 ]

Bruno, I thought you were already working on a patch to clean up FLOCK locks at unmount time?

Comment by Bruno Faccini (Inactive) [ 18/Jun/13 ]

Andreas, I am sure you refer to LU-2665, but scenario in this ticket is different since the problem there is that the FL_UNLCK request upon kill/exit of process with a granted FLOCK can be trashed when MDS communications occur leaving an orphan lock finally causing an LBUG during later umount. This is a Client-side problem requiring more robust error handling during FLock cleanup.

About this ticket's problem, I just ran the reproducer and found that a 2nd/later umount is successful.
I will try to investigate further and fully understand what's going on.

Comment by Bruno Faccini (Inactive) [ 02/Jul/13 ]

I restarted auto-test on patch-set #2 since it triggered LU-3230 problem/hang during zfs-review/replay-single/test_90.

Comment by Bruno Faccini (Inactive) [ 12/Jul/13 ]

Hello, Andriy,
Have you been able to verify the result of your patch ? Sorry to ask, but running with it, I still get the EBUSY on 1st "umount -f /mnt/lustre" ...
I am working on this now to understand what's going wrong.

Comment by Bruno Faccini (Inactive) [ 17/Jul/13 ]

So definitely, even with patch (that I agree fixes some coding imbalance and add consistency in calling ldlm_flock_blocking_unlink() from ldlm_resource_unlink_lock()!), there is still some timing/asynchronous side effect on Client during FLock cleanup that cause this. Since "killall" works same way (unless man use its -w option), in your reproducer it can not serialize enough to allow a successful 1st Client umount. May be I am wrong and missed something (tell me!), but I don't think we can fix this, it looks just like to ask for I/Os to be complete when sync() just returns.

Comment by Andreas Dilger [ 09/Jan/20 ]

Close old bug

Generated at Sat Feb 10 01:34:12 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.