HSM _not only_ small fixes and to do list goes here (LU-3647)

[LU-3601] HSM release causes running restore to hang, hangs itself Created: 16/Jul/13  Updated: 14/Feb/14  Resolved: 14/Feb/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.0
Fix Version/s: Lustre 2.6.0, Lustre 2.5.1

Type: Technical task Priority: Major
Reporter: John Hammond Assignee: Jinshan Xiong (Inactive)
Resolution: Fixed Votes: 0
Labels: HSM

Issue Links:
Duplicate
duplicates LU-4152 layout locks can cause deadlock Resolved
Related
is related to LU-4152 layout locks can cause deadlock Resolved
is related to LU-3608 HSM Master Ticket for 2.5 Landings Resolved
Rank (Obsolete): 9136

 Description   

Running the HSM stack as of July 15 2013, I see a hang when a release is issued while a restore is still running. To reproduce I run the following:

#!/bin/bash

export MOUNT_2=n
export MDSCOUNT=1
export PTLDEBUG="super inode ioctl warning dlmtrace error emerg ha rpctrace vfstrace config console"
export DEBUG_SIZE=512

hsm_root=/tmp/hsm_root

rm -rf $hsm_root
mkdir $hsm_root

llmount.sh

lctl conf_param lustre-MDT0000.mdt.hsm_control=enabled
# lctl conf_param lustre-MDT0001.mdt.hsm_control=enabled
sleep 10
lhsmtool_posix --verbose --hsm_root=$hsm_root --bandwidth 1 lustre

lctl dk > ~/hsm-0-mount.dk

set -x
cd /mnt/lustre
lfs setstripe -c2 f0
dd if=/dev/urandom of=f0 bs=1M count=100
lctl dk > ~/hsm-1-dd.dk

lfs hsm_archive f0
sleep 10
echo > /proc/fs/lustre/ldlm/dump_namespaces
lctl dk > ~/hsm-2-archive.dk

lfs hsm_release f0
echo > /proc/fs/lustre/ldlm/dump_namespaces
lctl dk > ~/hsm-3-release.dk

lfs hsm_restore f0
echo > /proc/fs/lustre/ldlm/dump_namespaces
lctl dk > ~/hsm-4-restore.dk

lfs hsm_release f0

with the last command never returning. The MDS_CLOSE handler looks like

10070
[<ffffffffa0f9866e>] cfs_waitq_wait+0xe/0x10 [libcfs]
[<ffffffffa124826a>] ldlm_completion_ast+0x57a/0x960 [ptlrpc]
[<ffffffffa1247920>] ldlm_cli_enqueue_local+0x1f0/0x5c0 [ptlrpc]
[<ffffffffa08cee3b>] mdt_object_lock0+0x33b/0xaf0 [mdt]
[<ffffffffa08cf6b4>] mdt_object_lock+0x14/0x20 [mdt]
[<ffffffffa08f9551>] mdt_mfd_close+0x351/0xde0 [mdt]
[<ffffffffa08fb372>] mdt_close+0x662/0xa60 [mdt]
[<ffffffffa08d2c07>] mdt_handle_common+0x647/0x16d0 [mdt]
[<ffffffffa090c9e5>] mds_readpage_handle+0x15/0x20 [mdt]
[<ffffffffa12813d8>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]
[<ffffffffa128275d>] ptlrpc_main+0xabd/0x1700 [ptlrpc]
[<ffffffff81096936>] kthread+0x96/0xa0
[<ffffffff8100c0ca>] child_rip+0xa/0x20
[<ffffffffffffffff>] 0xffffffffffffffff

while the MDS_HSM_PROGRESS handler looks like:

10065
[<ffffffffa0f9866e>] cfs_waitq_wait+0xe/0x10 [libcfs]
[<ffffffffa124826a>] ldlm_completion_ast+0x57a/0x960 [ptlrpc]
[<ffffffffa1247920>] ldlm_cli_enqueue_local+0x1f0/0x5c0 [ptlrpc]
[<ffffffffa08cee3b>] mdt_object_lock0+0x33b/0xaf0 [mdt]
[<ffffffffa08cf6b4>] mdt_object_lock+0x14/0x20 [mdt]
[<ffffffffa08cf721>] mdt_object_find_lock+0x61/0x170 [mdt]
[<ffffffffa091dc22>] hsm_get_md_attr+0x62/0x270 [mdt]
[<ffffffffa0923253>] mdt_hsm_update_request_state+0x4d3/0x1c20 [mdt]
[<ffffffffa091ae6e>] mdt_hsm_coordinator_update+0x3e/0xe0 [mdt]
[<ffffffffa090931b>] mdt_hsm_progress+0x21b/0x330 [mdt]
[<ffffffffa08d2c07>] mdt_handle_common+0x647/0x16d0 [mdt]
[<ffffffffa090ca05>] mds_regular_handle+0x15/0x20 [mdt]
[<ffffffffa12813d8>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]
[<ffffffffa128275d>] ptlrpc_main+0xabd/0x1700 [ptlrpc]
[<ffffffff81096936>] kthread+0x96/0xa0
[<ffffffff8100c0ca>] child_rip+0xa/0x20
[<ffffffffffffffff>] 0xffffffffffffffff

The close handler is waiting on an EX layout lock on f0. While the
progress handler is waiting on PW update lock on f0. dump_namespaces does not show that the UPDATE lock is granted.

For reference I'm using the following changes:

# LU-2919 hsm: Implementation of exclusive open
# http://review.whamcloud.com/#/c/6730
git fetch http://review.whamcloud.com/fs/lustre-release refs/changes/30/6730/13 && git cherry-pick FETCH_HEAD
 
# LU-1333 hsm: Add hsm_release feature.
# http://review.whamcloud.com/#/c/6526
git fetch http://review.whamcloud.com/fs/lustre-release refs/changes/26/6526/9 && git cherry-pick FETCH_HEAD
 
# LU-3339 mdt: HSM on disk actions record
# http://review.whamcloud.com/#/c/6529
# MERGED
 
# LU-3340 mdt: HSM memory requests management
# http://review.whamcloud.com/#/c/6530
git fetch http://review.whamcloud.com/fs/lustre-release refs/changes/30/6530/8 && git cherry-pick FETCH_HEAD
 
# LU-3341 mdt: HSM coordinator client interface
# http://review.whamcloud.com/#/c/6532
git fetch http://review.whamcloud.com/fs/lustre-release refs/changes/32/6532/13 && git cherry-pick FETCH_HEAD
# Needs rebase in sanity-hsm.sh
 
# LU-3342 mdt: HSM coordinator agent interface
# http://review.whamcloud.com/#/c/6534
git fetch http://review.whamcloud.com/fs/lustre-release refs/changes/34/6534/8 && git cherry-pick FETCH_HEAD
 
# LU-3343 mdt: HSM coordinator main thread
# http://review.whamcloud.com/#/c/6912
git fetch http://review.whamcloud.com/fs/lustre-release refs/changes/12/6912/3 && git cherry-pick FETCH_HEAD
# lustre/mdt/mdt_internal.h
 
# LU-3561 tests: HSM sanity test suite
# http://review.whamcloud.com/#/c/6913/
git fetch http://review.whamcloud.com/fs/lustre-release refs/changes/13/6913/4 && git cherry-pick FETCH_HEAD
# lustre/tests/sanity-hsm.sh
 
# LU-3432 llite: Access to released file trigs a restore
# http://review.whamcloud.com/#/c/6537
git fetch http://review.whamcloud.com/fs/lustre-release refs/changes/37/6537/11 && git cherry-pick FETCH_HEAD
 
# LU-3363 api: HSM import uses new released pattern
# http://review.whamcloud.com/#/c/6536
git fetch http://review.whamcloud.com/fs/lustre-release refs/changes/36/6536/8 && git cherry-pick FETCH_HEAD
 
# LU-2062 utils: HSM Posix CopyTool
# http://review.whamcloud.com/#/c/4737
git fetch http://review.whamcloud.com/fs/lustre-release refs/changes/37/4737/18 && git cherry-pick FETCH_HEAD


 Comments   
Comment by Jinshan Xiong (Inactive) [ 17/Jul/13 ]

Hi JC,

Can you please refresh your coordinator patches based on release patches? I tried to make a build but unfortunately met some merge errors.

Thanks,
Jinshan

Comment by jacques-charles lafoucriere [ 17/Jul/13 ]

Not sure a coordinator refresh will be enough. I can push our last full branch based on LU-1333-v10. Is it ok for you?

Comment by John Hammond [ 17/Jul/13 ]

I believe that this situation exposes a limitation of LDLM for inodebits locks. All locks below are on f0.

  1. Starting the restore takes EX LAYOUT lock on the server.
  2. When the releasing close RPC is sent the client has a PR LOOKUP|UPDATE|PERM lock.
  3. The release handler on the server blocks attempting to take an EX LAYOUT lock.
  4. When restore complete, the update progress handler blocks attempting to take an PW UPDATE lock.
  5. The client releases the PR LOOKUP|UPDATE|PERM lock.
  6. The resource (f0) gets reprocessed, but the first waiting lock (EX LAYOUT) cannot be granted, so ldlm_process_inodebits_lock() returns LDLM_ITER_STOP causing ldlm_reprocess_queue() to stop processing the resource. In particular it does not check that the PW UPDATE lock is compatible with all of the granted locks and all of the locks before it in the waiting list.

It also appears that the skip list optimizations in ldlm_inodebits_compat_queue() could be extended/improved by computing compatibility one mode-bits-bunch at a time and by granting locks in bunches.

Comment by John Hammond [ 26/Jul/13 ]

Here is a simpler situation where we can get stuck. (It is also more likely to occur.) Consider the following release vs open race. Assume the file F has already been archived.

  1. Client R starts HSM release on file F.
  2. In lfs_hsm_request, R stats F, the MDT returns a PR LOOKUP,UPDATE,LAYOUT,PERM lock on F.
  3. In lfs_hsm_request, R opens F for path2fid, the MDT returns a CR LOOKUP,LAYOUT lock on F.
  4. In ll_hsm_release/ll_lease_open, R leases F, the MDT returns an EX OPEN lock on F.
  5. Client W tries to open F with MDS_OPEN_LOCK set, the MDT adds a CW OPEN lock to the waiting list.
  6. In ll_hsm_release, client R closes F.
  7. In mdt_hsm_release, the MDT requests a local EX LAYOUT on F. This conflicts with the PR and CR locks already held by R, the server sends blocking ASTs to R for these locks.
  8. The MDT reprocesses the waiting queue for F. Granted list contains the EX OPEN lock. The waiting list contains the CW OPEN, followed by the EX LAYOUT.
  9. As responses to the blocking ASTs come in the F is reprocessed but since there is a blocked CW OPEN lock at the head of the waiting list, the following locks (including the EX LAYOUT) are not considered.
  10. The EX OPEN lock times out and client R is evicted.
Comment by John Hammond [ 26/Jul/13 ]

Another issue here is that it may be unsafe to access the mount point being used by the copytool. Especially to perform manual HSM requests, since the MDC's cl_close_lock will prevent multiple concurrent closes. In particular we can have a releasing close block (on EX LAYOUT) because a restore is running, which prevents the restore from being completed, because any close will block on cl_close_lock.

Comment by Jinshan Xiong (Inactive) [ 26/Jul/13 ]

I will fix the lock issue above.

The close sounds like a real issue here, we shouldn't block close REQ to finish. Let's use try version of mdt_object_lock() in close.

Comment by John Hammond [ 27/Jul/13 ]

Please see http://review.whamcloud.com/7148 for the LDLM patch we discussed.

Comment by John Hammond [ 30/Jul/13 ]

A similar hang can be triggered by trying to read a file while a restore is still running. To see this add --bandwidth=1 to the copytool options and do:

# cd /mnt/lustre
# dd if=/dev/urandom of=f0 bs=1M count=10
# lfs hsm_archive f0
# # Wait for archive to complete.
# sleep 15
# lfs hsm_release f0
# lfs hsm_restore f0
# cat f0 > /dev/null

This is addresses by the http://review.whamcloud.com/#/c/7148/.

However even with the latest version (patch set 9) of http://review.whamcloud.com/#/c/6912/ we have an easily exploited race between restore and rename which is not addressed by the change in 7148. Rename onto during restore will hang:

cd /mnt/lustre
dd if=/dev/urandom of=f0 bs=1M count=10
lfs hsm_archive f0
# Wait for archive to complete.
sleep 15
lfs hsm_state f0
lfs hsm_release f0
lfs hsm_restore f0; touch f1; sys_rename f1 f0

Since this rename takes MDS_INODELOCK_FULL on f0, I doubt that the choice of using LAYOUT, UPDATE, or other in hsm_get_md_attr() matters very much. But I could be wrong.

Comment by John Hammond [ 01/Aug/13 ]

Since the removal of UPDATE lock use from the coordinator, I can no longer reproduce these issues.

Comment by jacques-charles lafoucriere [ 02/Aug/13 ]

We will add sanity-hsm tests for the 2 simple use cases. Will be safer for futures changes.

Comment by Aurelien Degremont (Inactive) [ 03/Aug/13 ]

We already have such test. sanity-hsm #33 deadlock was hitting this bug. John's patch was fixing hit. I will confirm that the latest coordinator, without John's patch do not trig this deadlock anymore on monday, but I'm confident.

Comment by jacques-charles lafoucriere [ 04/Aug/13 ]

sanity-hsm #33 hits the same bug, but was not designed to test concurrent access to file during the restore phase. We also today do no test rename/rm during restore.

Comment by Jodi Levi (Inactive) [ 21/Oct/13 ]

Should Change, 7148 be landed or abandoned?

Comment by John Hammond [ 21/Oct/13 ]

Landed after being improved per comments on gerrit.

Comment by John Hammond [ 21/Oct/13 ]

This issue was fixed for 2.5.0 and can be closed now.

Comment by Andreas Dilger [ 28/Oct/13 ]

Andriy wrote in LU-4152:

LU-1876 adds mdt_object_open_lock() which acquires lock in 2 steps for layout locks.
A deadlock is possible since it isn't atomic and ibits locks are reprocessed until first blocking lock found.

Such situation was hit with mdt_reint_open() & mdt_intent_getattr()

mdt_reint_open()->mdt_open_by_fid_lock() takes first part of the lock (ibits=5),
mdt_intent_getattr() tries to obtain lock (ibits=17)
mdt_open_by_fid_lock() tries to obtain second part but fails due to some conflict with another layout lock2. During cancellation of lock2 only getattr lock is reprocessed.
http://review.whamcloud.com/#/c/7148/1 can help, but it is better to fix mdt_open_by_fid_lock()

Andriy, was this problem actually hit during testing, or was this problem found by code inspection?

Comment by Vitaly Fertman [ 28/Oct/13 ]

Andreas, it was hit during testing.

process1.lock1: open|lookup, granted
process2.lock1: layout | XXX, granted
process3.lock1: lookup | XXX, waiting process1.lock1
process1.lock2: layout, waiting process2.lock1
process2.lock1: cancelled, reprocessing does not reach process1.lock2

process1 is open by fid
process3 is getattr

in other words, as 2 locks are taken not atomically, you must guarantee nobody can take a conflict for 1st lock in between. otherwise you need either:

  • full reprocess
  • reordering on waiting list
  • make these 2 enqueue atomic
  • take 1 common lock with all the ibits

btw, why the last option was not done originally ?

as it can deadlock without HSM, I would consider it as a blocker.

Comment by Jinshan Xiong (Inactive) [ 28/Oct/13 ]

Indeed, this is a live lock case.

To clarify, the process1 must be writing an empty file without layout, so writing will cause new layout to be created.

btw, why the last option was not done originally ?

The reason for me to not acquire one common lock is that we have to acquire EX mode for layout lock which will be too strong for lookup and open lock since they have to share the same DLM lock.

Though patch 7148 can fix this problem, acquiring 2 locks in a row is generally bad. Therefore, I'll fix by acquiring one lock with EX mode for the above case, however, this lock won't be returned to client side. As a result, the process will not cache this specific open. This is good as it will happen rarely.

How do you guys think?

Comment by Patrick Farrell (Inactive) [ 28/Oct/13 ]

Jinshan - This was originally a Cray bug (thank you Andriy and Vitaly for bringing this up), which I've been tracking.

I think eliminating the case where 2 locks are taken non-atomically is key long term. If you're planning to do that, then that sounds good.
If you're planning to only do it in certain cases, are you completely sure we don't have another possible live lock?

I'd back Vitaly's suggestion that it be a blocker. We're able to trigger it during testing of NFS export, presumably because of the open_by_fid operations caused by NFS export.

Comment by Jinshan Xiong (Inactive) [ 28/Oct/13 ]

just an update - Oleg is creating a patch for this issue.

Comment by John Hammond [ 29/Oct/13 ]

Links to Oleg's patches (which all reference this issue) may be found in the comments on LU-4152.

Comment by Oleg Drokin [ 29/Oct/13 ]

Patrick: what's your exact reproducer to hit this? We are so far unable to hit it ourselves

Comment by Patrick Farrell (Inactive) [ 29/Oct/13 ]

Oleg - We hit it while testing NFS exported Lustre during a large-ish test run, with tests drawn primarily from the Linux Test Project. The problem is we don't always hit it with the same test.

The test engineer who's been handling it thinks a way to hit it is concurrent runs of fsx-linux with different command line options. Those are being run against an NFS export of Lustre.
He's going to try to pin that down this afternoon, I'll update if he's able to be more specific.

Comment by Patrick Farrell (Inactive) [ 29/Oct/13 ]

Moving conversation about patches to LU-4152; latest is there.

Comment by Jinshan Xiong (Inactive) [ 13/Nov/13 ]

this is fixed in LU-4152

Comment by Andreas Dilger [ 10/Feb/14 ]

Patch http://review.whamcloud.com/8084 was landed under this bug, but is not reported here.

Generated at Sat Feb 10 01:35:19 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.