[LU-3453] LBUG in mdt_intent_opc(), (layout.c:1916:__req_capsule_get()) ASSERTION( msg != ((void *)0) ) failed Created: 11/Jun/13  Updated: 13/Jun/13  Resolved: 13/Jun/13

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.0
Fix Version/s: Lustre 2.5.0

Type: Bug Priority: Blocker
Reporter: John Hammond Assignee: Li Wei (Inactive)
Resolution: Fixed Votes: 0
Labels: mdt

Severity: 3
Rank (Obsolete): 8634

 Description   

On 2.4.50-56-gc42672d, I'm seeing an LBUG in the call to req_capsule_server_get() from mdt_intent_opc(). It follows
mdt_intent_layout() returning -ESTALE.

static int mdt_intent_opc(long itopc, struct mdt_thread_info *info,
                          struct ldlm_lock **lockp, __u64 flags)
{
...
        if (rc == 0 && flv->it_act != NULL) {
                struct ldlm_reply *rep;

                /* execute policy */
                rc = flv->it_act(opc, info, lockp, flags);

                rep = req_capsule_server_get(pill, &RMF_DLM_REP);
                rep->lock_policy_res2 =
                        ptlrpc_status_hton(rep->lock_policy_res2);
        } else {
                rc = -EOPNOTSUPP;
        }
        RETURN(rc);
}
00000004:00000001:3.0:1370903334.569591:0:745:0:(mdt_handler.c:5048:mdt_object_free()) Process leaving
00000020:00000001:3.0:1370903334.569593:0:745:0:(lu_object.c:238:lu_object_alloc()) Process leaving (rc=18446744073709551500 : -116 : ffffffffffffff8c)
00000004:00000001:3.0:1370903334.569596:0:745:0:(mdt_handler.c:2388:mdt_object_find()) Process leaving (rc=18446744073709551500 : -116 : ffffffffffffff8c)
00000004:00000001:3.0:1370903334.569598:0:745:0:(mdt_handler.c:3757:mdt_intent_layout()) Process leaving (rc=18446744073709551500 : -116 : ffffffffffffff8c)
00000100:00040000:3.0:1370903334.569602:0:745:0:(layout.c:1916:__req_capsule_get()) ASSERTION( msg != ((void *)0) ) failed: 
00000100:00040000:3.0:1370903334.572661:0:745:0:(layout.c:1916:__req_capsule_get()) LBUG
Lustre: DEBUG MARKER: == sanity test 34h: ftruncate file under grouplock should not block == 17:28:54 (1370903334)
LustreError: 745:0:(layout.c:1916:__req_capsule_get()) ASSERTION( msg != ((void *)0) ) failed: 
LustreError: 745:0:(layout.c:1916:__req_capsule_get()) LBUG
Pid: 745, comm: mdt01_000

Call Trace:
 [<ffffffffa0f5b895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
 [<ffffffffa0f5be97>] lbug_with_loc+0x47/0xb0 [libcfs]
 [<ffffffffa1232de2>] __req_capsule_get+0x632/0x700 [ptlrpc]
 [<ffffffffa0f66d88>] ? libcfs_log_return+0x28/0x40 [libcfs]
 [<ffffffffa0f66d88>] ? libcfs_log_return+0x28/0x40 [libcfs]
 [<ffffffffa1232fb8>] req_capsule_server_get+0x18/0x20 [ptlrpc]
 [<ffffffffa06faf71>] mdt_intent_policy+0x3d1/0x760 [mdt]
 [<ffffffffa11c23f1>] ldlm_lock_enqueue+0x361/0x8d0 [ptlrpc]
 [<ffffffffa11e939f>] ldlm_handle_enqueue0+0x4ef/0x10b0 [ptlrpc]
 [<ffffffffa06fb406>] mdt_enqueue+0x46/0xe0 [mdt]
 [<ffffffffa0701af8>] mdt_handle_common+0x648/0x1660 [mdt]
 [<ffffffffa073b185>] mds_regular_handle+0x15/0x20 [mdt]
 [<ffffffffa121b6a8>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]
 [<ffffffffa0f5c5de>] ? cfs_timer_arm+0xe/0x10 [libcfs]
 [<ffffffffa0f6dd8f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
 [<ffffffffa1212a09>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
 [<ffffffffa0f6c2c1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
 [<ffffffff81055ab3>] ? __wake_up+0x53/0x70
 [<ffffffffa121ca3e>] ptlrpc_main+0xace/0x1700 [ptlrpc]
 [<ffffffffa121bf70>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
 [<ffffffff8100c0ca>] child_rip+0xa/0x20
 [<ffffffffa121bf70>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
 [<ffffffffa121bf70>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20

It's easily reproduced by running sanityn.sh in a loop.



 Comments   
Comment by Nathaniel Clark [ 11/Jun/13 ]

Seen in review on post 2.4 master
sanity/63b
https://maloo.whamcloud.com/test_sets/0fd8ac52-d271-11e2-b1c7-52540035b04c
also same test run recovery-small/6
https://maloo.whamcloud.com/test_sets/88da2792-d272-11e2-b1c7-52540035b04c

Comment by Sarah Liu [ 11/Jun/13 ]

Hit the same failure in master branch sanity test_118f

https://maloo.whamcloud.com/test_sets/c1fc9f94-d161-11e2-9675-52540035b04c

Comment by Li Wei (Inactive) [ 12/Jun/13 ]

I'll investigate this one.

Comment by Andreas Dilger [ 12/Jun/13 ]

Hit again in recovery-small test_6 (failure rate reported at 7%):
https://maloo.whamcloud.com/sub_tests/895c613a-d272-11e2-b1c7-52540035b04c

Comment by Bob Glossman (Inactive) [ 12/Jun/13 ]

I think this may be another instance, but I'm not 100% sure:
https://maloo.whamcloud.com/test_sets/ebafdf7c-d31e-11e2-ace1-52540035b04c

Comment by Andreas Dilger [ 12/Jun/13 ]

Increasing the priority of this bug. I thought it was only a rare failure (only four tests reported LU-3453 in Maloo), but in fact there are a large number of timeouts for recovery-small mislabeled as LU-1890.

Comment by Andreas Dilger [ 12/Jun/13 ]

Also note that this appears to be a new bug, only failing since 2013-06-11, so it may be fastest to look at patches landed on or just before that day to see if something is obviously causing this regression.

Comment by John Hammond [ 12/Jun/13 ]

Seems to be that mdt_intent_layout() can return without calling req_capsule_server_pack().

Please see http://review.whamcloud.com/6617.

Comment by John Hammond [ 13/Jun/13 ]

Patch landed to master.

Anyone up for an audit of the other 195 calls to req_capsule_server_get()?

Generated at Sat Feb 10 01:34:02 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.