[LU-5314] Lustre 2.4.2 MDS hit LBUG and crash Created: 10/Jul/14  Updated: 23/Sep/22  Resolved: 19/Jul/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.3
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Haisong Cai (Inactive) Assignee: Niu Yawei (Inactive)
Resolution: Cannot Reproduce Votes: 0
Labels: sdsc
Environment:

Linux meerkat-mds-10-1.local 2.6.32-358.23.2.el6_lustre.x86_64 #1 SMP Thu Dec 19 19:57:45 PST 2013 x86_64 x86_64 x86_64 GNU/Linux


Issue Links:
Related
is related to LU-2827 mdt_intent_fixup_resent() cannot find... Resolved
is related to LU-11457 osd_oi_insert(): the FID is used by t... Resolved
Severity: 3
Rank (Obsolete): 14850

 Description   

Our MDS hit LBUG and crashed this evening. Here are the /var/log/messages:

Jul 9 18:40:22 meerkat-mds-10-1 kernel: Lustre: meerkat-MDT0000: Client e27741dc-f76c-ea5a-c426-4c6b5e86a758 (at 198.202.118.120@tcp) reconnecting
Jul 9 18:40:24 meerkat-mds-10-1 kernel: Lustre: meerkat-MDT0000: Client 42a919ae-9df5-e771-0e9a-7ee82fdc33d9 (at 198.202.119.106@tcp) reconnecting
Jul 9 18:40:24 meerkat-mds-10-1 kernel: Lustre: Skipped 1 previous similar message
Jul 9 18:40:29 meerkat-mds-10-1 kernel: Lustre: 3457:0:(mdt_handler.c:1338:mdt_getattr_name_lock()) Although resent, but still not get child lockparent:[0x200003fb3:0xaa9f:0x0] child:[0x200003fb3:0xb42e:0x0]
Jul 9 18:40:29 meerkat-mds-10-1 kernel: LustreError: 3457:0:(mdt_handler.c:3568:mdt_intent_lock_replace()) ASSERTION( lustre_msg_get_flags(req->rq_reqmsg) & 0x0002 ) failed:
Jul 9 18:40:29 meerkat-mds-10-1 kernel: LustreError: 3457:0:(mdt_handler.c:3568:mdt_intent_lock_replace()) LBUG
Jul 9 18:40:29 meerkat-mds-10-1 kernel: Pid: 3457, comm: mdt03_003
Jul 9 18:40:29 meerkat-mds-10-1 kernel:
Jul 9 18:40:29 meerkat-mds-10-1 kernel: Call Trace:
Jul 9 18:40:29 meerkat-mds-10-1 kernel: [<ffffffffa02e7895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
Jul 9 18:40:29 meerkat-mds-10-1 kernel: [<ffffffffa02e7e97>] lbug_with_loc+0x47/0xb0 [libcfs]
Jul 9 18:40:29 meerkat-mds-10-1 kernel: [<ffffffffa0c9c3b1>] mdt_intent_lock_replace+0x391/0x400 [mdt]
Jul 9 18:40:29 meerkat-mds-10-1 kernel: [<ffffffffa0cb34c6>] mdt_intent_getattr+0x3b6/0x490 [mdt]
Jul 9 18:40:29 meerkat-mds-10-1 kernel: [<ffffffffa0c9ff1e>] mdt_intent_policy+0x39e/0x720 [mdt]
Jul 9 18:40:29 meerkat-mds-10-1 kernel: [<ffffffffa0575831>] ldlm_lock_enqueue+0x361/0x8d0 [ptlrpc]
Jul 9 18:40:29 meerkat-mds-10-1 kernel: [<ffffffffa059c1cf>] ldlm_handle_enqueue0+0x4ef/0x10b0 [ptlrpc]
Jul 9 18:40:29 meerkat-mds-10-1 kernel: [<ffffffffa0ca03a6>] mdt_enqueue+0x46/0xe0 [mdt]
Jul 9 18:40:29 meerkat-mds-10-1 kernel: [<ffffffffa0ca6a97>] mdt_handle_common+0x647/0x16d0 [mdt]
Jul 9 18:40:29 meerkat-mds-10-1 kernel: [<ffffffffa05beb8c>] ? lustre_msg_get_transno+0x8c/0x100 [ptlrpc]
Jul 9 18:40:29 meerkat-mds-10-1 kernel: [<ffffffffa0ce06a5>] mds_regular_handle+0x15/0x20 [mdt]
Jul 9 18:40:29 meerkat-mds-10-1 kernel: [<ffffffffa05ce3a8>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]
Jul 9 18:40:29 meerkat-mds-10-1 kernel: [<ffffffffa02e85de>] ? cfs_timer_arm+0xe/0x10 [libcfs]
Jul 9 18:40:29 meerkat-mds-10-1 kernel: [<ffffffffa02f9d9f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
Jul 9 18:40:29 meerkat-mds-10-1 kernel: [<ffffffffa05c5709>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
Jul 9 18:40:29 meerkat-mds-10-1 kernel: [<ffffffff81063990>] ? default_wake_function+0x0/0x20
Jul 9 18:40:29 meerkat-mds-10-1 kernel: [<ffffffffa05cf73e>] ptlrpc_main+0xace/0x1700 [ptlrpc]
Jul 9 18:40:29 meerkat-mds-10-1 kernel: [<ffffffffa05cec70>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
Jul 9 18:40:29 meerkat-mds-10-1 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20
Jul 9 18:40:29 meerkat-mds-10-1 kernel: [<ffffffffa05cec70>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
Jul 9 18:40:29 meerkat-mds-10-1 kernel: [<ffffffffa05cec70>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
Jul 9 18:40:29 meerkat-mds-10-1 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
Jul 9 18:40:29 meerkat-mds-10-1 kernel:



 Comments   
Comment by John Fuchs-Chesney (Inactive) [ 10/Jul/14 ]

Could you please take a look at this Niu?
Thanks,
~ jfc.

Comment by Niu Yawei (Inactive) [ 10/Jul/14 ]

mdt_getattr_name_lock() cleared the MSG_RESENT flag:

                        CWARN("Although resent, but still not get child lock"
                              "parent:"DFID" child:"DFID"\n",
                              PFID(mdt_object_fid(parent)),
                              PFID(mdt_object_fid(child)));
                        lustre_msg_clear_flags(req->rq_reqmsg, MSG_RESENT);
                        LDLM_LOCK_PUT(lock);
                        GOTO(relock, 0);

That'll trigger the LASSERT on MSG_RESENT in mdt_intent_lock_replace():

        if (new_lock->l_export == req->rq_export) {
                /*
                 * Already gave this to the client, which means that we
                 * reconstructed a reply.
                 */
                LASSERT(lustre_msg_get_flags(req->rq_reqmsg) &
                        MSG_RESENT);
                lh->mlh_reg_lh.cookie = 0;
                RETURN(ELDLM_LOCK_REPLACED);
        }

This part of code has been heavily changed in LU-2827, the RESENT flag won't be cleared now, so the LASSERT won't be triggered anymore. The backport of LU-2827 is on http://review.whamcloud.com/#/c/10902/ .

Comment by Haisong Cai (Inactive) [ 10/Jul/14 ]

Hi Niu,

The same MDS has thrown FID errors like below and caused clients to hang. Do you think they are related?

Jul 4 20:51:33 meerkat-mds-10-1 kernel: LustreError: 19626:0:(osd_oi.c:655:osd_oi_insert()) meerkat-MDT0000: the FID [0x2000030d7:0x220:0x0] is used by two objects: 647048266/1095373013 647048267/1095373014
Jul 4 20:51:47 meerkat-mds-10-1 kernel: LustreError: 3655:0:(osd_oi.c:655:osd_oi_insert()) meerkat-MDT0000: the FID [0x2000025ff:0x2c9a:0x0] is used by two objects: 647048307/1095373111 647048308/1095373112
Jul 4 20:56:54 meerkat-mds-10-1 kernel: LustreError: 19257:0:(osd_oi.c:655:osd_oi_insert()) meerkat-MDT0000: the FID [0x200003094:0x20b5:0x0] is used by two objects: 647048393/1095374047 647048397/1095374048
Jul 5 07:28:22 meerkat-mds-10-1 kernel: LustreError: 3657:0:(osd_oi.c:655:osd_oi_insert()) meerkat-MDT0000: the FID [0x200001f67:0x3e3d:0x0] is used by two objects: 647560781/1095604748 647560782/1095604749
Jul 5 17:25:56 meerkat-mds-10-1 kernel: LustreError: 3637:0:(osd_oi.c:655:osd_oi_insert()) meerkat-MDT0000: the FID [0x200002547:0x457:0x0] is used by two objects: 647561074/3949023701 647561076/3949023703
Jul 5 17:25:56 meerkat-mds-10-1 kernel: LustreError: 3534:0:(osd_oi.c:655:osd_oi_insert()) meerkat-MDT0000: the FID [0x2000020c8:0xe6f:0x0] is used by two objects: 647561077/3949023704 647561079/3949023706
Jul 5 17:25:57 meerkat-mds-10-1 kernel: LustreError: 3670:0:(osd_oi.c:655:osd_oi_insert()) meerkat-MDT0000: the FID [0x200001ac1:0x4037:0x0] is used by two objects: 647561031/3949023726 647561045/3949023727
Jul 5 17:35:42 meerkat-mds-10-1 kernel: LustreError: 3685:0:(osd_oi.c:655:osd_oi_insert()) meerkat-MDT0000: the FID [0x2000020a1:0x26f6:0x0] is used by two objects: 647561130/3949042621 647561131/3949042622
Jul 6 00:00:05 meerkat-mds-10-1 kernel: LustreError: 3645:0:(osd_oi.c:655:osd_oi_insert()) meerkat-MDT0000: the FID [0x200001a0e:0x35a8:0x0] is used by two objects: 647561202/3949603984 647561206/3949603985

Comment by Bruno Faccini (Inactive) [ 09/Oct/14 ]

Hello Niu,
You may have missed that but in fact there are regression issues with my b2_4 back-port (http://review.whamcloud.com/#/c/10902/) of LU-2827 changes. And I checked that the b2_5 version (http://review.whamcloud.com/#/c/10492/) is ok and that it will land soon now.

Comment by Peter Jones [ 19/Jul/17 ]

This issue is fixed on newer releases and SDSC upgraded some time back

Generated at Sat Feb 10 01:50:24 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.