[LU-2056] SWL - osd_trans_start()) ASSERTION( oti->oti_w_locks == 0 ) failed: Created: 30/Sep/12 Updated: 19/Apr/13 Resolved: 14/Jan/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0 |
| Fix Version/s: | Lustre 2.4.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Cliff White (Inactive) | Assignee: | Hongchao Zhang |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | HB | ||
| Environment: |
LLNL/Hyperion lustre-reviews 9573 |
||
| Severity: | 3 |
| Rank (Obsolete): | 4299 |
| Description |
<ConMan> Console [hyperion-rst6] log at 2012-09-30 16:00:00 PDT. 2012-09-30 16:20:23 [A[B[1;2B[1;2B[1;2B[1;2B[1;2B[B[B[B[B[B[BLustre: 4162:0:(client.c:1917:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1349048656/real 1349048656] req@ffff880176df8c00 x1414547178955781/t0(0) o400->MGC192.168.127.6@o2ib@0@lo:26/25 lens 224/224 e 0 to 1 dl 1349048762 ref 1 fl Rpc:RXN/0/ffffffff rc 0/-1 2012-09-30 16:46:02 LustreError: 166-1: MGC192.168.127.6@o2ib: Connection to MGS (at 0@lo) was lost; in progress operations using this service will fail 2012-09-30 16:46:02 Lustre: MGS: Client b825c79a-32bd-3ab8-c5fa-b66aeb9bb741 (at 0@lo) reconnecting 2012-09-30 16:46:02 LustreError: 4242:0:(obd_class.h:527:obd_set_info_async()) obd_set_info_async: dev 0 no operation 2012-09-30 16:47:09 Lustre: MGC192.168.127.6@o2ib: Reactivating import 2012-09-30 16:47:09 Lustre: MGC192.168.127.6@o2ib: Connection restored to MGS (at 0@lo) <ConMan> Console [hyperion-rst6] log at 2012-09-30 17:00:00 PDT. 2012-09-30 17:02:38 LustreError: 6069:0:(osd_handler.c:837:osd_trans_start()) ASSERTION( oti->oti_w_locks == 0 ) failed: 2012-09-30 17:02:38 LustreError: 6069:0:(osd_handler.c:837:osd_trans_start()) LBUG 2012-09-30 17:02:38 Pid: 6069, comm: mdt_rdpg03_002 2012-09-30 17:02:38 For some reason, stack did not dump, but vmcore obtained crash> bt
PID: 6069 TASK: ffff880141836aa0 CPU: 7 COMMAND: "mdt_rdpg03_002"
#0 [ffff880156e759c8] machine_kexec at ffffffff8103281b
#1 [ffff880156e75a28] crash_kexec at ffffffff810ba792
#2 [ffff880156e75af8] panic at ffffffff814fd591
#3 [ffff880156e75b78] lbug_with_loc at ffffffffa0393f6b [libcfs]
#4 [ffff880156e75b98] osd_trans_start at ffffffffa0a8d2bc [osd_ldiskfs]
#5 [ffff880156e75bd8] mdd_trans_start at ffffffffa0f043a4 [mdd]
#6 [ffff880156e75be8] mdd_close at ffffffffa0edf4f6 [mdd]
#7 [ffff880156e75c58] cml_close at ffffffffa06baef6 [cmm]
#8 [ffff880156e75c88] mdt_mfd_close at ffffffffa0f8a18e [mdt]
#9 [ffff880156e75ce8] mdt_close at ffffffffa0f8ae0a [mdt]
#10 [ffff880156e75d38] mdt_handle_common at ffffffffa0f66802 [mdt]
#11 [ffff880156e75d88] mdt_readpage_handle at ffffffffa0f676d5 [mdt]
#12 [ffff880156e75d98] ptlrpc_server_handle_request at ffffffffa0966b3c [ptlrpc]
#13 [ffff880156e75e98] ptlrpc_main at ffffffffa0968111 [ptlrpc]
#14 [ffff880156e75f48] kernel_thread at ffffffff8100c14a
|
| Comments |
| Comment by Cliff White (Inactive) [ 30/Sep/12 ] |
|
VMcore is at ~cliffw/lu2056/vmcore-rst6.01Oct on brent |
| Comment by Cliff White (Inactive) [ 30/Sep/12 ] |
|
vmcore is at ~cliffw/lu2056/vmcore-rst6.01Oct on brent |
| Comment by Liang Zhen (Inactive) [ 01/Oct/12 ] |
|
comment in osd_trans_start() /*
* XXX: current rule is that we first start tx,
* then lock object(s), but we can't use
* this rule for data (due to locking specifics
* in ldiskfs). also in long-term we'd like to
* use usually-used (locks;tx) ordering. so,
* UGLY thing is that we'll use one ordering for
* data (ofd) and reverse ordering for metadata
* (mdd). then at some point we'll fix the latter
*/
if (dev->od_is_md) {
LASSERT(oti->oti_r_locks == 0);
LASSERT(oti->oti_w_locks == 0);
}
But code in mdd_close() .....
2457 mdd_write_lock(env, mdd_obj, MOR_TGT_CHILD);
2458 if (handle == NULL && mdd_obj->mod_count == 1 &&
2459 (mdd_obj->mod_flags & ORPHAN_OBJ) != 0) {
2460 mdd_write_unlock(env, mdd_obj);
2461 goto again;
2462 }
......
2503 if (handle == NULL) {
2504 handle = mdd_trans_create(env, mdo2mdd(obj) );
2505 if (IS_ERR(handle))
2506 GOTO(out, rc = PTR_ERR(handle));
2507
2508 rc = mdd_declare_object_kill(env, mdd_obj, ma,
2509 handle);
2510 if (rc)
2511 GOTO(out, rc);
2512
2513 rc = mdd_declare_changelog_store(env, mdd,
2514 NULL, hand le);
2515 if (rc)
2516 GOTO(stop, rc);
2517
2518 rc = mdd_trans_start(env, mdo2mdd(obj), han dle);
We start transaction after locking the object, which violates the rule set in osd_trans_start(). |
| Comment by Liang Zhen (Inactive) [ 01/Oct/12 ] |
|
I've posted a patch for this: http://review.whamcloud.com/4145 , it's better to try it after it's reviewed. |
| Comment by Liang Zhen (Inactive) [ 02/Oct/12 ] |
|
I made some changes for wangdi's review and rebased previous patch to 2.3 (previous one is for master), unfortunately I lost history of it by doing this. New patch is: http://review.whamcloud.com/#change,4152 |
| Comment by Liang Zhen (Inactive) [ 08/Oct/12 ] |
|
patch landed to 2.3 so it's not a 2.3 blocker anymore, but still need to fix it for master |
| Comment by Hongchao Zhang [ 05/Nov/12 ] |
|
the patch against master is tracked at http://review.whamcloud.com/#change,4461 |
| Comment by Oleg Drokin [ 11/Dec/12 ] |
|
I seem to be hitting this (in form of |
| Comment by Liang Zhen (Inactive) [ 14/Dec/12 ] |
|
reassign to hongchao because he ported the patch to master |
| Comment by Jodi Levi (Inactive) [ 11/Jan/13 ] |
|
Since Change, 4461 has landed, can this ticket be closed? |
| Comment by Jodi Levi (Inactive) [ 14/Jan/13 ] |
|
All patches are landed so closing ticket. Please reopen if more work is needed. |