[LU-2895] recovery-small 24a: osp_precreate_get_fid(): ASSERTION( lu_fid_diff(&d->opd_pre_used_fid, &d->opd_pre_last_created_fid) < 0 ) failed: next fid [0x0:0x1:0x0] last created fid [0x0:0x1:0x0] Created: 01/Mar/13  Updated: 25/Mar/13  Resolved: 25/Mar/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: Lustre 2.4.0

Type: Bug Priority: Blocker
Reporter: Maloo Assignee: Niu Yawei (Inactive)
Resolution: Fixed Votes: 0
Labels: LB

Severity: 3
Rank (Obsolete): 6979

 Description   

This issue was created by maloo for Li Wei <liwei@whamcloud.com>

This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/0c78ac7a-81ad-11e2-9f6b-52540035b04c.

The sub-test test_24a failed with the following error:

test failed to respond and timed out

Info required for matching: recovery-small 24a

First, the test failed with return code 1. Second, the MDS hit an assertion failure afterward:

12:39:11:Lustre: DEBUG MARKER: == recovery-small test 24a: fsync error (should return error) == 12:39:10 (1361997550)
12:39:36:Lustre: lustre-OST0000-osc-MDT0000: slow creates, last=[0x0:0x1:0x0], next=[0x0:0x1:0x0], reserved=0, syn_changes=0, syn_rpc_in_progress=0, status=0
12:39:36:LustreError: 18701:0:(osp_precreate.c:1164:osp_precreate_get_fid()) ASSERTION( lu_fid_diff(&d->opd_pre_used_fid, &d->opd_pre_last_created_fid) < 0 ) failed: next fid [0x0:0x1:0x0] last created fid [0x0:0x1:0x0]
12:39:36:LustreError: 18701:0:(osp_precreate.c:1164:osp_precreate_get_fid()) LBUG
12:39:36:Pid: 18701, comm: mdt00_000
12:39:36:
12:39:37:Call Trace:
12:39:37: [<ffffffffa06f1895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
12:39:37: [<ffffffffa06f1e97>] lbug_with_loc+0x47/0xb0 [libcfs]
12:39:37: [<ffffffffa0ba4f63>] osp_precreate_get_fid+0x253/0x310 [osp]
12:39:37: [<ffffffffa07022e1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
12:39:37: [<ffffffffa0ba28cb>] osp_object_assign_fid+0x7b/0x160 [osp]
12:39:37: [<ffffffffa0ba30b6>] osp_object_create+0x406/0x890 [osp]
12:39:38: [<ffffffffa07022e1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
12:39:38: [<ffffffffa0b5e683>] lod_striping_create+0x193/0x320 [lod]
12:39:38: [<ffffffffa0b5e983>] lod_object_create+0x173/0x260 [lod]
12:39:38: [<ffffffffa0a1a580>] mdd_object_create_internal+0xe0/0x220 [mdd]
12:39:38: [<ffffffffa0a2d22c>] mdd_create+0x77c/0x11f0 [mdd]
12:39:38: [<ffffffffa07022e1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
12:39:38: [<ffffffffa0ac0ef9>] mdt_reint_open+0x12b9/0x1ab0 [mdt]
12:39:39: [<ffffffffa07022e1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
12:39:39: [<ffffffffa0aab7f1>] mdt_reint_rec+0x41/0xe0 [mdt]
12:39:39: [<ffffffffa0aa4e53>] mdt_reint_internal+0x4e3/0x7d0 [mdt]
12:39:39: [<ffffffffa0aa540d>] mdt_intent_reint+0x1ed/0x4f0 [mdt]
12:39:39: [<ffffffffa0aa0fde>] mdt_intent_policy+0x3ae/0x750 [mdt]
12:39:39: [<ffffffffa0fb0351>] ldlm_lock_enqueue+0x361/0x8d0 [ptlrpc]
12:39:39: [<ffffffffa0fd6447>] ldlm_handle_enqueue0+0x4f7/0x1080 [ptlrpc]
12:39:39: [<ffffffffa0aa14b6>] mdt_enqueue+0x46/0x110 [mdt]
12:39:39: [<ffffffffa0a96078>] mdt_handle_common+0x628/0x1620 [mdt]
12:39:39: [<ffffffffa0acd845>] mds_regular_handle+0x15/0x20 [mdt]
12:39:39: [<ffffffffa100802c>] ptlrpc_server_handle_request+0x41c/0xdf0 [ptlrpc]
12:39:39: [<ffffffffa06f25de>] ? cfs_timer_arm+0xe/0x10 [libcfs]
12:39:39: [<ffffffffa0fff759>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
12:39:39: [<ffffffff8105fa40>] ? default_wake_function+0x0/0x20
12:39:39: [<ffffffffa1009576>] ptlrpc_main+0xb76/0x1870 [ptlrpc]
12:39:39: [<ffffffffa1008a00>] ? ptlrpc_main+0x0/0x1870 [ptlrpc]
12:39:39: [<ffffffff8100c0ca>] child_rip+0xa/0x20
12:39:39: [<ffffffffa1008a00>] ? ptlrpc_main+0x0/0x1870 [ptlrpc]
12:39:40: [<ffffffffa1008a00>] ? ptlrpc_main+0x0/0x1870 [ptlrpc]
12:39:40: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
12:39:40:
12:39:40:Kernel panic - not syncing: LBUG


 Comments   
Comment by Peter Jones [ 11/Mar/13 ]

Niu

Could you please look into this one?

Thanks

Peter

Comment by Niu Yawei (Inactive) [ 12/Mar/13 ]

Looks the osp_precreate_reserve() should timedout:

12:39:36:Lustre: lustre-OST0000-osc-MDT0000: slow creates, last=[0x0:0x1:0x0], next=[0x0:0x1:0x0], reserved=0, syn_changes=0, syn_rpc_in_progress=0, status=0

but the osp_precreate_reserve() didn't return proper error for such case:

                if (cfs_time_aftereq(cfs_time_current(), expire))
                        break;

The mds debug log is missed, so I'm not sure the exact reason of precreation isn't done (even the seq hasn't been initialized), but at least, we should return proper error in such case.

Comment by Niu Yawei (Inactive) [ 12/Mar/13 ]

fix the LASSERT problem: http://review.whamcloud.com/5683

Comment by Alex Zhuravlev [ 12/Mar/13 ]

why rc taken from is not enough ?

Comment by Niu Yawei (Inactive) [ 12/Mar/13 ]

Alex, the opd_pre_status was intialized as -EAGAIN, and it could be cleared to 0 in osp_precreate_thread() -> osp_statfs_update(), and the following osp_precreate_cleanup_orphans() wasn't success somehow, then we'll end up in above state: uninitialized opd_pre_last_created_fid & opd_pre_used_fid, opd_pre_status == 0, and osp_precreate_reserve() is possbile timedout with rc == 0.

Comment by Alex Zhuravlev [ 13/Mar/13 ]

inability to setup FLD client or cleanup orphans should block any activity on this OSP, I'd think.

Comment by Peter Jones [ 25/Mar/13 ]

Landed for 2.4

Generated at Sat Feb 10 01:29:08 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.