[LU-3669] (mdt_xattr.c:178:mdt_getxattr()) ASSERTION( lu_object_assert_exists(&info->mti_object->mot_obj) ) failed: Created: 30/Jul/13  Updated: 21/Oct/13  Resolved: 10/Sep/13

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.0
Fix Version/s: Lustre 2.5.0

Type: Bug Priority: Blocker
Reporter: John Hammond Assignee: John Hammond
Resolution: Fixed Votes: 0
Labels: HB, mdt, patch, xattr

Issue Links:
Duplicate
is duplicated by LU-3795 replay-single test_58b test_58c: trus... Resolved
is duplicated by LU-3730 sanity-hsm test_3 Error: 'could not c... Resolved
is duplicated by LU-3746 Test failure on test suite lustre-rsy... Resolved
is duplicated by LU-3911 Failure on test suite sanityn test_14a Resolved
Related
is related to LU-2869 extended attribute cache for Lustre Resolved
is related to LU-3713 Failure on test suite sanity test_17k... Resolved
is related to LU-3938 Test failure on test suite conf-sanit... Resolved
Severity: 3
Rank (Obsolete): 9457

 Description   

Using today's master 2.4.52-72-g643e0ae with the extra racer operations from http://review.whamcloud.com/#/c/5936/ I see LDLM hangs and LBUGs is
the getxattr handler.

# MOUNT_2=y llmount.sh
# sh ./lustre/tests/racer.sh

LNet: Service thread pid 16149 was inactive for 40.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Pid: 16149, comm: mdt01_004

# cat /proc/16149/stack
[<ffffffffa04ed641>] cfs_waitq_timedwait+0x11/0x20 [libcfs]
[<ffffffffa07f912d>] ldlm_completion_ast+0x4ed/0x960 [ptlrpc]
[<ffffffffa07f8870>] ldlm_cli_enqueue_local+0x1f0/0x5c0 [ptlrpc]
[<ffffffffa0cd3c0b>] mdt_object_lock0+0x33b/0xaf0 [mdt]
[<ffffffffa0cd4484>] mdt_object_lock+0x14/0x20 [mdt]
[<ffffffffa0cd4520>] mdt_intent_getxattr+0x90/0x160 [mdt]
[<ffffffffa0cd111e>] mdt_intent_policy+0x3ae/0x770 [mdt]
[<ffffffffa07d9471>] ldlm_lock_enqueue+0x361/0x8c0 [ptlrpc]
[<ffffffffa080219f>] ldlm_handle_enqueue0+0x4ef/0x10b0 [ptlrpc]
[<ffffffffa0cd15e6>] mdt_enqueue+0x46/0xe0 [mdt]
[<ffffffffa0cd7db7>] mdt_handle_common+0x647/0x16d0 [mdt]
[<ffffffffa0d12d25>] mds_regular_handle+0x15/0x20 [mdt]
[<ffffffffa0831fd8>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]
[<ffffffffa083335d>] ptlrpc_main+0xabd/0x1700 [ptlrpc]
[<ffffffff81096936>] kthread+0x96/0xa0
[<ffffffff8100c0ca>] child_rip+0xa/0x20
[<ffffffffffffffff>] 0xffffffffffffffff
LustreError: 11483:0:(mdt_xattr.c:178:mdt_getxattr()) ASSERTION( lu_object_assert_exists(&info->mti_object->mot_obj) ) failed: 
LustreError: 11483:0:(mdt_xattr.c:178:mdt_getxattr()) LBUG
Pid: 11483, comm: mdt00_005

crash> bt
PID: 11483  TASK: ffff8801f2651540  CPU: 0   COMMAND: "mdt00_005"
 #0 [ffff8801feb478f8] machine_kexec at ffffffff81035c9b
 #1 [ffff8801feb47958] crash_kexec at ffffffff810c0d22
 #2 [ffff8801feb47a28] panic at ffffffff8150eeff
 #3 [ffff8801feb47aa8] lbug_with_loc at ffffffffa0c08eeb [libcfs]
 #4 [ffff8801feb47ac8] mdt_getxattr at ffffffffa06ad35f [mdt]
 #5 [ffff8801feb47b68] mdt_intent_getxattr at ffffffffa068a52e [mdt]
 #6 [ffff8801feb47bb8] mdt_intent_policy at ffffffffa068711e [mdt]
 #7 [ffff8801feb47bf8] ldlm_lock_enqueue at ffffffffa0e6e471 [ptlrpc]
 #8 [ffff8801feb47c58] ldlm_handle_enqueue0 at ffffffffa0e9719f [ptlrpc]
 #9 [ffff8801feb47cc8] mdt_enqueue at ffffffffa06875e6 [mdt]
#10 [ffff8801feb47ce8] mdt_handle_common at ffffffffa068ddf7 [mdt]
#11 [ffff8801feb47d38] mds_regular_handle at ffffffffa06c8ca5 [mdt]
#12 [ffff8801feb47d48] ptlrpc_server_handle_request at ffffffffa0ec6fd8 [ptlrpc]
#13 [ffff8801feb47e48] ptlrpc_main at ffffffffa0ec835d [ptlrpc]
#14 [ffff8801feb47ee8] kthread at ffffffff81096936
#15 [ffff8801feb47f48] kernel_thread at ffffffff8100c0ca

We can't keep adding IBITs without defining some ordering or hierarchy for when we use them.

00010000:00010000:6.0:1375198422.885704:0:5685:0:(ldlm_resource.c:1448:ldlm_resource_dump()) --- Resource: [0x200000401:0x8:0x0].0 (ffff8801eceee480) refcount = 8
00010000:00010000:6.0:1375198422.885707:0:5685:0:(ldlm_resource.c:1451:ldlm_resource_dump()) Granted locks (in reverse order):
00010000:00010000:6.0:1375198422.885708:0:5685:0:(ldlm_resource.c:1454:ldlm_resource_dump()) ### ### ns: mdt-lustre-MDT0000_UUID lock: ffff8801e9cc9d00/0x132cecfff5acfc2f lrc: 3/0,0 mode: PW/PW res: [0x200000401:0x8:0x0].0 bits 0x20 rrc: 8 type: IBT flags: 0x60200000000020 nid: 0@lo remote: 0x132cecfff5acfc05 expref: 30 pid: 3274 timeout: 4294866401 lvb_type: 0
00010000:00010000:6.0:1375198422.885715:0:5685:0:(ldlm_resource.c:1454:ldlm_resource_dump()) ### ### ns: mdt-lustre-MDT0000_UUID lock: ffff8801eeae4d40/0x132cecfff5acf6e1 lrc: 2/0,0 mode: CR/CR res: [0x200000401:0x8:0x0].0 bits 0x9 rrc: 8 type: IBT flags: 0x40200000000000 nid: 0@lo remote: 0x132cecfff5acf6c5 expref: 30 pid: 5252 timeout: 0 lvb_type: 0
00010000:00010000:6.0:1375198422.885721:0:5685:0:(ldlm_resource.c:1454:ldlm_resource_dump()) ### ### ns: mdt-lustre-MDT0000_UUID lock: ffff8801ec5246c0/0x132cecfff5acebf8 lrc: 2/0,0 mode: CR/CR res: [0x200000401:0x8:0x0].0 bits 0x9 rrc: 8 type: IBT flags: 0x40200000000000 nid: 0@lo remote: 0x132cecfff5acebdc expref: 39 pid: 3274 timeout: 0 lvb_type: 0
00010000:00010000:6.0:1375198422.885727:0:5685:0:(ldlm_resource.c:1454:ldlm_resource_dump()) ### ### ns: mdt-lustre-MDT0000_UUID lock: ffff8801eb589280/0x132cecfff5ace354 lrc: 2/0,0 mode: CR/CR res: [0x200000401:0x8:0x0].0 bits 0x8 rrc: 8 type: IBT flags: 0x40000000000000 nid: 0@lo remote: 0x132cecfff5ace34d expref: 30 pid: 3810 timeout: 0 lvb_type: 3
00010000:00010000:6.0:1375198422.885734:0:5685:0:(ldlm_resource.c:1469:ldlm_resource_dump()) Waiting locks:
00010000:00010000:6.0:1375198422.885736:0:5685:0:(ldlm_resource.c:1471:ldlm_resource_dump()) ### ### ns: mdt-lustre-MDT0000_UUID lock: ffff8801eea356c0/0x132cecfff5ad0e42 lrc: 3/1,0 mode: --/PR res: [0x200000401:0x8:0x0].0 bits 0x20 rrc: 8 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 5306 timeout: 0 lvb_type: 0
00010000:00010000:6.0:1375198422.885742:0:5685:0:(ldlm_resource.c:1471:ldlm_resource_dump()) ### ### ns: mdt-lustre-MDT0000_UUID lock: ffff8801e95e8d00/0x132cecfff5ad0f1b lrc: 3/0,1 mode: --/EX res: [0x200000401:0x8:0x0].0 bits 0x2 rrc: 8 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 3274 timeout: 0 lvb_type: 0


 Comments   
Comment by Andrew Perepechko [ 31/Jul/13 ]

We can't keep adding IBITs without defining some ordering or hierarchy for when we use them.

That supposedly means that the list below shows a deadlock or something of that kind.

However, nothing prevents EX from being granted. After the EX lock is granted and setxattr succeeds, the client will release the PW lock and the PR lock request can succeed.

Comment by John Hammond [ 31/Jul/13 ]

Try it and see. Or you can just do something like:

MOUNT_2=y llmount.sh
cd /mnt/lustre
while true; do setfattr -n user.ATTR -v LULZ f0; done &
cd /mnt/lustre2
while true; do getfattr -n user.ATTR f0; done &

Soon both handlers will hang and you'll have a resource that looks like:

Resource: [0x200000400:0x6f:0x0].0 (ffff880194bb20c0) refcount = 6
Granted locks (in reverse order):
### ### ns: mdt-lustre-MDT0000_UUID lock: ffff880187b3e2c0/0x20a0719b52891d17 lrc: 3/0,0 mode: PW/PW res: [0x200000400:0x6f:0x0].0 bits 0x20 rrc: 6 type: IBT flags: 0x60200000000020 nid: 0@lo remote: 0x20a0719b52891d09 expref: 6 pid: 20067 timeout: 4296110059 lvb_type: 0
### ### ns: mdt-lustre-MDT0000_UUID lock: ffff88018fbe3d40/0x20a0719b52891bdc lrc: 2/0,0 mode: CR/CR res: [0x200000400:0x6f:0x0].0 bits 0x8 rrc: 6 type: IBT flags: 0x40000000000000 nid: 0@lo remote: 0x20a0719b52891bd5 expref: 8 pid: 17987 timeout: 0 lvb_type: 3
Waiting locks:
### ### ns: mdt-lustre-MDT0000_UUID lock: ffff880194bfd3c0/0x20a0719b52891d2c lrc: 3/1,0 mode: --/PR res: [0x200000400:0x6f:0x0].0 bits 0x20 rrc: 6 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 17987 timeout: 0 lvb_type: 0
### ### ns: mdt-lustre-MDT0000_UUID lock: ffff88018f920180/0x20a0719b52891d33 lrc: 3/0,1 mode: --/EX res: [0x200000400:0x6f:0x0].0 bits 0x2 rrc: 6 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 20067 timeout: 0 lvb_type: 0

Indeed, the EX lock was added to the waiting list to wait for cancellation of some other lock and this is after that lock was canceled. But the EX lock will not be granted since the head of the waiting list is blocked. Please see http://review.whamcloud.com/#/c/7148/ for some discussion.

Comment by Andrew Perepechko [ 31/Jul/13 ]

Right. Thanks for the detailed explanation, John.
By the way, how easily can you reproduce the crash?

Comment by John Hammond [ 31/Jul/13 ]

It reproduces within seconds of starting the two loops.

Comment by Andrew Perepechko [ 31/Jul/13 ]

How could you see the deadlock then?

Comment by Andrew Perepechko [ 31/Jul/13 ]

Never mind, I was able to reproduce it on my hardware.

Comment by Andrew Perepechko [ 31/Jul/13 ]

http://review.whamcloud.com/#/c/7191/1 for the crash part

Comment by Andrew Perepechko [ 31/Jul/13 ]

Using http://review.whamcloud.com/#/c/7148/ and http://review.whamcloud.com/#/c/7191/ seems to make racer.sh pass (no traces of eviction or crashes for a few runs).

John, is there a chance that 7148 will be landed or should locking be redesigned anyway?

Comment by John Hammond [ 31/Jul/13 ]

Hi Andrew, I was just updating the ticket when I saw your comment.

> It reproduces within seconds of starting the two loops.

Sorry, I misread your question. The hang reproduces within seconds of starting the two loops. For the crash use http://review.whamcloud.com/#/c/5936/ and run racer. Four times out of five it will hang the other time it will crash. I pulled your patch for the crash and it seems OK.

> John, is there a chance that 7148 will be landed or should locking be redesigned anyway?

Not sure. I'm working to get reviewers on it.

I'm still concerned that 7148 + 7191 are not enough. Consider the following case:

# MOUNT_2=y llmount.sh
# cd /mnt/lustre
# while true; do touch f0; rm f0; done &
# cd /mnt/lustre2
# while true; do setfattr -n user.ATTR -v LULZ f0; done &

Eventually the setattr gets a PW XATTR lock, the unlink enqueues a blocked EX FULL lock, and the setattr enqueues a blocked EX UPDATE lock. Sorry I didn't mention it earlier, I only thought of it this morning.

Comment by Andrew Perepechko [ 31/Jul/13 ]

The hang reproduces within seconds of starting the two loops.

OK, I see.

I pulled your patch for the crash and it seems OK.

Great! Thanks a lot for the feedback.

Eventually the setattr gets a PW XATTR lock, the unlink enqueues a blocked EX FULL lock, and the setattr enqueues a blocked EX UPDATE lock. Sorry I didn't mention it earlier, I only thought of it this morning.

Hm, right. It looks like nested LDLM locking is often a headache.

Comment by Andrew Perepechko [ 01/Aug/13 ]

We've discussed the issue with our team, and we think the following might be a solution:

1. UPDATE locking should probably be removed from the MDT setxattr path, because it does not seem to be protecting any piece of data [needs a bit more investigation]
2. POSIX ACL ACCESS data should be filtered out from the xattr cache.
3. getxattr and setxattr on POSIX ACLs should use the old (non-xattrcache) code path
4. listxattr should use the xattr cache and also additionally list XATTR_NAME_ACL_ACCESS ("system.posix_acl_access") if it finds that lli_posix_acl is initialized.

This splits xattr cache and ACL cache and so splits XATTR and LOOKUP locking.

Comment by Andreas Dilger [ 01/Aug/13 ]

Panda, I think your proposal to separate ACL xattrs from xattr cache is a reasonable one.

The other thing that would reduce lock contention would be to only grant the PR lock if the lock is under contention, as is done in mdt_getattr_name_lock() for the UPDATE lock. This allows clients to cache readdir data and negative dentries for static directories, but does not cause lock ping-pong when the directory is being modified:

                        /* If the file has not been changed for some time, we
                         * return not only a LOOKUP lock, but also an UPDATE
                         * lock and this might save us RPC on later STAT. For
                         * directories, it also let negative dentry starts
                         * working for this dir. */
                        if (ma->ma_valid & MA_INODE &&
                            ma->ma_attr.la_valid & LA_CTIME &&
                            info->mti_mdt->mdt_namespace->ns_ctime_age_limit +
                                ma->ma_attr.la_ctime < cfs_time_current_sec())
                                child_bits |= MDS_INODELOCK_UPDATE;

Doing something similar for the xattr cache would allow e.g. Samba servers to have a writethrough cache, normal clients to have a read cache for most files, and no cache for files that are being modified frequently (which is no worse than we have today).

NB - if you change code near here, please s/dentry starts/dentry cachestart/ in the above comment.

Comment by Andreas Dilger [ 21/Aug/13 ]

http://review.whamcloud.com/7208

Comment by Jodi Levi (Inactive) [ 10/Sep/13 ]

With Change, 7208 landing to Master, can this ticket be closed or reduced from blocker?

Comment by John Hammond [ 10/Sep/13 ]

Both patches landed to master.

Generated at Sat Feb 10 01:35:54 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.