[LU-2056] SWL - osd_trans_start()) ASSERTION( oti->oti_w_locks == 0 ) failed: Created: 30/Sep/12  Updated: 19/Apr/13  Resolved: 14/Jan/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: Lustre 2.4.0

Type: Bug Priority: Blocker
Reporter: Cliff White (Inactive) Assignee: Hongchao Zhang
Resolution: Fixed Votes: 0
Labels: HB
Environment:

LLNL/Hyperion lustre-reviews 9573


Severity: 3
Rank (Obsolete): 4299

 Description   
<ConMan> Console [hyperion-rst6] log at 2012-09-30 16:00:00 PDT.
2012-09-30 16:20:23          [A[B[1;2B[1;2B[1;2B[1;2B[1;2B[B[B[B[B[B[BLustre: 4162:0:(client.c:1917:ptlrpc_expire_one_request()) @@@ Request  sent has timed out for slow reply: [sent 1349048656/real 1349048656]  req@ffff880176df8c00 x1414547178955781/t0(0) o400->MGC192.168.127.6@o2ib@0@lo:26/25 lens 224/224 e 0 to 1 dl 1349048762 ref 1 fl Rpc:RXN/0/ffffffff rc 0/-1
2012-09-30 16:46:02 LustreError: 166-1: MGC192.168.127.6@o2ib: Connection to MGS (at 0@lo) was lost; in progress operations using this service will fail
2012-09-30 16:46:02 Lustre: MGS: Client b825c79a-32bd-3ab8-c5fa-b66aeb9bb741 (at 0@lo) reconnecting
2012-09-30 16:46:02 LustreError: 4242:0:(obd_class.h:527:obd_set_info_async()) obd_set_info_async: dev 0 no operation
2012-09-30 16:47:09 Lustre: MGC192.168.127.6@o2ib: Reactivating import
2012-09-30 16:47:09 Lustre: MGC192.168.127.6@o2ib: Connection restored to MGS (at 0@lo)

<ConMan> Console [hyperion-rst6] log at 2012-09-30 17:00:00 PDT.
2012-09-30 17:02:38 LustreError: 6069:0:(osd_handler.c:837:osd_trans_start()) ASSERTION( oti->oti_w_locks == 0 ) failed: 

2012-09-30 17:02:38 LustreError: 6069:0:(osd_handler.c:837:osd_trans_start()) LBUG
2012-09-30 17:02:38 Pid: 6069, comm: mdt_rdpg03_002
2012-09-30 17:02:38

For some reason, stack did not dump, but vmcore obtained

crash> bt
PID: 6069   TASK: ffff880141836aa0  CPU: 7   COMMAND: "mdt_rdpg03_002"
 #0 [ffff880156e759c8] machine_kexec at ffffffff8103281b
 #1 [ffff880156e75a28] crash_kexec at ffffffff810ba792
 #2 [ffff880156e75af8] panic at ffffffff814fd591
 #3 [ffff880156e75b78] lbug_with_loc at ffffffffa0393f6b [libcfs]
 #4 [ffff880156e75b98] osd_trans_start at ffffffffa0a8d2bc [osd_ldiskfs]
 #5 [ffff880156e75bd8] mdd_trans_start at ffffffffa0f043a4 [mdd]
 #6 [ffff880156e75be8] mdd_close at ffffffffa0edf4f6 [mdd]
 #7 [ffff880156e75c58] cml_close at ffffffffa06baef6 [cmm]
 #8 [ffff880156e75c88] mdt_mfd_close at ffffffffa0f8a18e [mdt]
 #9 [ffff880156e75ce8] mdt_close at ffffffffa0f8ae0a [mdt]
#10 [ffff880156e75d38] mdt_handle_common at ffffffffa0f66802 [mdt]
#11 [ffff880156e75d88] mdt_readpage_handle at ffffffffa0f676d5 [mdt]
#12 [ffff880156e75d98] ptlrpc_server_handle_request at ffffffffa0966b3c [ptlrpc]
#13 [ffff880156e75e98] ptlrpc_main at ffffffffa0968111 [ptlrpc]
#14 [ffff880156e75f48] kernel_thread at ffffffff8100c14a


 Comments   
Comment by Cliff White (Inactive) [ 30/Sep/12 ]

VMcore is at ~cliffw/lu2056/vmcore-rst6.01Oct on brent

Comment by Cliff White (Inactive) [ 30/Sep/12 ]

vmcore is at ~cliffw/lu2056/vmcore-rst6.01Oct on brent

Comment by Liang Zhen (Inactive) [ 01/Oct/12 ]

comment in osd_trans_start()

                /*
                 * XXX: current rule is that we first start tx,
                 *      then lock object(s), but we can't use
                 *      this rule for data (due to locking specifics
                 *      in ldiskfs). also in long-term we'd like to
                 *      use usually-used (locks;tx) ordering. so,
                 *      UGLY thing is that we'll use one ordering for
                 *      data (ofd) and reverse ordering for metadata
                 *      (mdd). then at some point we'll fix the latter
                 */
                if (dev->od_is_md) {
                        LASSERT(oti->oti_r_locks == 0);
                        LASSERT(oti->oti_w_locks == 0);
                }

But code in mdd_close()

             .....
2457         mdd_write_lock(env, mdd_obj, MOR_TGT_CHILD);
2458         if (handle == NULL && mdd_obj->mod_count == 1 &&
2459             (mdd_obj->mod_flags & ORPHAN_OBJ) != 0) {
2460                 mdd_write_unlock(env, mdd_obj);
2461                 goto again;
2462         }
             ......
2503                         if (handle == NULL) {
2504                                 handle = mdd_trans_create(env, mdo2mdd(obj)     );
2505                                 if (IS_ERR(handle))
2506                                         GOTO(out, rc = PTR_ERR(handle));
2507 
2508                                 rc = mdd_declare_object_kill(env, mdd_obj,      ma,
2509                                                              handle);
2510                                 if (rc)
2511                                         GOTO(out, rc);
2512 
2513                                 rc = mdd_declare_changelog_store(env, mdd,
2514                                                                  NULL, hand     le);
2515                                 if (rc)
2516                                         GOTO(stop, rc);
2517 
2518                                 rc = mdd_trans_start(env, mdo2mdd(obj), han     dle);

We start transaction after locking the object, which violates the rule set in osd_trans_start().

Comment by Liang Zhen (Inactive) [ 01/Oct/12 ]

I've posted a patch for this: http://review.whamcloud.com/4145 , it's better to try it after it's reviewed.

Comment by Liang Zhen (Inactive) [ 02/Oct/12 ]

I made some changes for wangdi's review and rebased previous patch to 2.3 (previous one is for master), unfortunately I lost history of it by doing this. New patch is: http://review.whamcloud.com/#change,4152

Comment by Liang Zhen (Inactive) [ 08/Oct/12 ]

patch landed to 2.3 so it's not a 2.3 blocker anymore, but still need to fix it for master

Comment by Hongchao Zhang [ 05/Nov/12 ]

the patch against master is tracked at http://review.whamcloud.com/#change,4461

Comment by Oleg Drokin [ 11/Dec/12 ]

I seem to be hitting this (in form of LU-2443 during racer shutdown) fairly regularly now, please update the patch.

Comment by Liang Zhen (Inactive) [ 14/Dec/12 ]

reassign to hongchao because he ported the patch to master

Comment by Jodi Levi (Inactive) [ 11/Jan/13 ]

Since Change, 4461 has landed, can this ticket be closed?

Comment by Jodi Levi (Inactive) [ 14/Jan/13 ]

All patches are landed so closing ticket. Please reopen if more work is needed.

Generated at Sat Feb 10 01:21:59 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.