Details

    • Bug
    • Resolution: Cannot Reproduce
    • Major
    • None
    • Lustre 2.4.3
    • Linux meerkat-mds-10-1.local 2.6.32-358.23.2.el6_lustre.x86_64 #1 SMP Thu Dec 19 19:57:45 PST 2013 x86_64 x86_64 x86_64 GNU/Linux
    • 3
    • 14850

    Description

      Our MDS hit LBUG and crashed this evening. Here are the /var/log/messages:

      Jul 9 18:40:22 meerkat-mds-10-1 kernel: Lustre: meerkat-MDT0000: Client e27741dc-f76c-ea5a-c426-4c6b5e86a758 (at 198.202.118.120@tcp) reconnecting
      Jul 9 18:40:24 meerkat-mds-10-1 kernel: Lustre: meerkat-MDT0000: Client 42a919ae-9df5-e771-0e9a-7ee82fdc33d9 (at 198.202.119.106@tcp) reconnecting
      Jul 9 18:40:24 meerkat-mds-10-1 kernel: Lustre: Skipped 1 previous similar message
      Jul 9 18:40:29 meerkat-mds-10-1 kernel: Lustre: 3457:0:(mdt_handler.c:1338:mdt_getattr_name_lock()) Although resent, but still not get child lockparent:[0x200003fb3:0xaa9f:0x0] child:[0x200003fb3:0xb42e:0x0]
      Jul 9 18:40:29 meerkat-mds-10-1 kernel: LustreError: 3457:0:(mdt_handler.c:3568:mdt_intent_lock_replace()) ASSERTION( lustre_msg_get_flags(req->rq_reqmsg) & 0x0002 ) failed:
      Jul 9 18:40:29 meerkat-mds-10-1 kernel: LustreError: 3457:0:(mdt_handler.c:3568:mdt_intent_lock_replace()) LBUG
      Jul 9 18:40:29 meerkat-mds-10-1 kernel: Pid: 3457, comm: mdt03_003
      Jul 9 18:40:29 meerkat-mds-10-1 kernel:
      Jul 9 18:40:29 meerkat-mds-10-1 kernel: Call Trace:
      Jul 9 18:40:29 meerkat-mds-10-1 kernel: [<ffffffffa02e7895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      Jul 9 18:40:29 meerkat-mds-10-1 kernel: [<ffffffffa02e7e97>] lbug_with_loc+0x47/0xb0 [libcfs]
      Jul 9 18:40:29 meerkat-mds-10-1 kernel: [<ffffffffa0c9c3b1>] mdt_intent_lock_replace+0x391/0x400 [mdt]
      Jul 9 18:40:29 meerkat-mds-10-1 kernel: [<ffffffffa0cb34c6>] mdt_intent_getattr+0x3b6/0x490 [mdt]
      Jul 9 18:40:29 meerkat-mds-10-1 kernel: [<ffffffffa0c9ff1e>] mdt_intent_policy+0x39e/0x720 [mdt]
      Jul 9 18:40:29 meerkat-mds-10-1 kernel: [<ffffffffa0575831>] ldlm_lock_enqueue+0x361/0x8d0 [ptlrpc]
      Jul 9 18:40:29 meerkat-mds-10-1 kernel: [<ffffffffa059c1cf>] ldlm_handle_enqueue0+0x4ef/0x10b0 [ptlrpc]
      Jul 9 18:40:29 meerkat-mds-10-1 kernel: [<ffffffffa0ca03a6>] mdt_enqueue+0x46/0xe0 [mdt]
      Jul 9 18:40:29 meerkat-mds-10-1 kernel: [<ffffffffa0ca6a97>] mdt_handle_common+0x647/0x16d0 [mdt]
      Jul 9 18:40:29 meerkat-mds-10-1 kernel: [<ffffffffa05beb8c>] ? lustre_msg_get_transno+0x8c/0x100 [ptlrpc]
      Jul 9 18:40:29 meerkat-mds-10-1 kernel: [<ffffffffa0ce06a5>] mds_regular_handle+0x15/0x20 [mdt]
      Jul 9 18:40:29 meerkat-mds-10-1 kernel: [<ffffffffa05ce3a8>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]
      Jul 9 18:40:29 meerkat-mds-10-1 kernel: [<ffffffffa02e85de>] ? cfs_timer_arm+0xe/0x10 [libcfs]
      Jul 9 18:40:29 meerkat-mds-10-1 kernel: [<ffffffffa02f9d9f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
      Jul 9 18:40:29 meerkat-mds-10-1 kernel: [<ffffffffa05c5709>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
      Jul 9 18:40:29 meerkat-mds-10-1 kernel: [<ffffffff81063990>] ? default_wake_function+0x0/0x20
      Jul 9 18:40:29 meerkat-mds-10-1 kernel: [<ffffffffa05cf73e>] ptlrpc_main+0xace/0x1700 [ptlrpc]
      Jul 9 18:40:29 meerkat-mds-10-1 kernel: [<ffffffffa05cec70>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
      Jul 9 18:40:29 meerkat-mds-10-1 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20
      Jul 9 18:40:29 meerkat-mds-10-1 kernel: [<ffffffffa05cec70>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
      Jul 9 18:40:29 meerkat-mds-10-1 kernel: [<ffffffffa05cec70>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
      Jul 9 18:40:29 meerkat-mds-10-1 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
      Jul 9 18:40:29 meerkat-mds-10-1 kernel:

      Attachments

        Issue Links

          Activity

            [LU-5314] Lustre 2.4.2 MDS hit LBUG and crash
            pjones Peter Jones added a comment -

            This issue is fixed on newer releases and SDSC upgraded some time back

            pjones Peter Jones added a comment - This issue is fixed on newer releases and SDSC upgraded some time back

            Hello Niu,
            You may have missed that but in fact there are regression issues with my b2_4 back-port (http://review.whamcloud.com/#/c/10902/) of LU-2827 changes. And I checked that the b2_5 version (http://review.whamcloud.com/#/c/10492/) is ok and that it will land soon now.

            bfaccini Bruno Faccini (Inactive) added a comment - Hello Niu, You may have missed that but in fact there are regression issues with my b2_4 back-port ( http://review.whamcloud.com/#/c/10902/ ) of LU-2827 changes. And I checked that the b2_5 version ( http://review.whamcloud.com/#/c/10492/ ) is ok and that it will land soon now.

            Hi Niu,

            The same MDS has thrown FID errors like below and caused clients to hang. Do you think they are related?

            Jul 4 20:51:33 meerkat-mds-10-1 kernel: LustreError: 19626:0:(osd_oi.c:655:osd_oi_insert()) meerkat-MDT0000: the FID [0x2000030d7:0x220:0x0] is used by two objects: 647048266/1095373013 647048267/1095373014
            Jul 4 20:51:47 meerkat-mds-10-1 kernel: LustreError: 3655:0:(osd_oi.c:655:osd_oi_insert()) meerkat-MDT0000: the FID [0x2000025ff:0x2c9a:0x0] is used by two objects: 647048307/1095373111 647048308/1095373112
            Jul 4 20:56:54 meerkat-mds-10-1 kernel: LustreError: 19257:0:(osd_oi.c:655:osd_oi_insert()) meerkat-MDT0000: the FID [0x200003094:0x20b5:0x0] is used by two objects: 647048393/1095374047 647048397/1095374048
            Jul 5 07:28:22 meerkat-mds-10-1 kernel: LustreError: 3657:0:(osd_oi.c:655:osd_oi_insert()) meerkat-MDT0000: the FID [0x200001f67:0x3e3d:0x0] is used by two objects: 647560781/1095604748 647560782/1095604749
            Jul 5 17:25:56 meerkat-mds-10-1 kernel: LustreError: 3637:0:(osd_oi.c:655:osd_oi_insert()) meerkat-MDT0000: the FID [0x200002547:0x457:0x0] is used by two objects: 647561074/3949023701 647561076/3949023703
            Jul 5 17:25:56 meerkat-mds-10-1 kernel: LustreError: 3534:0:(osd_oi.c:655:osd_oi_insert()) meerkat-MDT0000: the FID [0x2000020c8:0xe6f:0x0] is used by two objects: 647561077/3949023704 647561079/3949023706
            Jul 5 17:25:57 meerkat-mds-10-1 kernel: LustreError: 3670:0:(osd_oi.c:655:osd_oi_insert()) meerkat-MDT0000: the FID [0x200001ac1:0x4037:0x0] is used by two objects: 647561031/3949023726 647561045/3949023727
            Jul 5 17:35:42 meerkat-mds-10-1 kernel: LustreError: 3685:0:(osd_oi.c:655:osd_oi_insert()) meerkat-MDT0000: the FID [0x2000020a1:0x26f6:0x0] is used by two objects: 647561130/3949042621 647561131/3949042622
            Jul 6 00:00:05 meerkat-mds-10-1 kernel: LustreError: 3645:0:(osd_oi.c:655:osd_oi_insert()) meerkat-MDT0000: the FID [0x200001a0e:0x35a8:0x0] is used by two objects: 647561202/3949603984 647561206/3949603985

            haisong Haisong Cai (Inactive) added a comment - Hi Niu, The same MDS has thrown FID errors like below and caused clients to hang. Do you think they are related? Jul 4 20:51:33 meerkat-mds-10-1 kernel: LustreError: 19626:0:(osd_oi.c:655:osd_oi_insert()) meerkat-MDT0000: the FID [0x2000030d7:0x220:0x0] is used by two objects: 647048266/1095373013 647048267/1095373014 Jul 4 20:51:47 meerkat-mds-10-1 kernel: LustreError: 3655:0:(osd_oi.c:655:osd_oi_insert()) meerkat-MDT0000: the FID [0x2000025ff:0x2c9a:0x0] is used by two objects: 647048307/1095373111 647048308/1095373112 Jul 4 20:56:54 meerkat-mds-10-1 kernel: LustreError: 19257:0:(osd_oi.c:655:osd_oi_insert()) meerkat-MDT0000: the FID [0x200003094:0x20b5:0x0] is used by two objects: 647048393/1095374047 647048397/1095374048 Jul 5 07:28:22 meerkat-mds-10-1 kernel: LustreError: 3657:0:(osd_oi.c:655:osd_oi_insert()) meerkat-MDT0000: the FID [0x200001f67:0x3e3d:0x0] is used by two objects: 647560781/1095604748 647560782/1095604749 Jul 5 17:25:56 meerkat-mds-10-1 kernel: LustreError: 3637:0:(osd_oi.c:655:osd_oi_insert()) meerkat-MDT0000: the FID [0x200002547:0x457:0x0] is used by two objects: 647561074/3949023701 647561076/3949023703 Jul 5 17:25:56 meerkat-mds-10-1 kernel: LustreError: 3534:0:(osd_oi.c:655:osd_oi_insert()) meerkat-MDT0000: the FID [0x2000020c8:0xe6f:0x0] is used by two objects: 647561077/3949023704 647561079/3949023706 Jul 5 17:25:57 meerkat-mds-10-1 kernel: LustreError: 3670:0:(osd_oi.c:655:osd_oi_insert()) meerkat-MDT0000: the FID [0x200001ac1:0x4037:0x0] is used by two objects: 647561031/3949023726 647561045/3949023727 Jul 5 17:35:42 meerkat-mds-10-1 kernel: LustreError: 3685:0:(osd_oi.c:655:osd_oi_insert()) meerkat-MDT0000: the FID [0x2000020a1:0x26f6:0x0] is used by two objects: 647561130/3949042621 647561131/3949042622 Jul 6 00:00:05 meerkat-mds-10-1 kernel: LustreError: 3645:0:(osd_oi.c:655:osd_oi_insert()) meerkat-MDT0000: the FID [0x200001a0e:0x35a8:0x0] is used by two objects: 647561202/3949603984 647561206/3949603985

            mdt_getattr_name_lock() cleared the MSG_RESENT flag:

                                    CWARN("Although resent, but still not get child lock"
                                          "parent:"DFID" child:"DFID"\n",
                                          PFID(mdt_object_fid(parent)),
                                          PFID(mdt_object_fid(child)));
                                    lustre_msg_clear_flags(req->rq_reqmsg, MSG_RESENT);
                                    LDLM_LOCK_PUT(lock);
                                    GOTO(relock, 0);
            

            That'll trigger the LASSERT on MSG_RESENT in mdt_intent_lock_replace():

                    if (new_lock->l_export == req->rq_export) {
                            /*
                             * Already gave this to the client, which means that we
                             * reconstructed a reply.
                             */
                            LASSERT(lustre_msg_get_flags(req->rq_reqmsg) &
                                    MSG_RESENT);
                            lh->mlh_reg_lh.cookie = 0;
                            RETURN(ELDLM_LOCK_REPLACED);
                    }
            

            This part of code has been heavily changed in LU-2827, the RESENT flag won't be cleared now, so the LASSERT won't be triggered anymore. The backport of LU-2827 is on http://review.whamcloud.com/#/c/10902/ .

            niu Niu Yawei (Inactive) added a comment - mdt_getattr_name_lock() cleared the MSG_RESENT flag: CWARN( "Although resent, but still not get child lock" "parent:" DFID " child:" DFID "\n" , PFID(mdt_object_fid(parent)), PFID(mdt_object_fid(child))); lustre_msg_clear_flags(req->rq_reqmsg, MSG_RESENT); LDLM_LOCK_PUT(lock); GOTO(relock, 0); That'll trigger the LASSERT on MSG_RESENT in mdt_intent_lock_replace(): if (new_lock->l_export == req->rq_export) { /* * Already gave this to the client, which means that we * reconstructed a reply. */ LASSERT(lustre_msg_get_flags(req->rq_reqmsg) & MSG_RESENT); lh->mlh_reg_lh.cookie = 0; RETURN(ELDLM_LOCK_REPLACED); } This part of code has been heavily changed in LU-2827 , the RESENT flag won't be cleared now, so the LASSERT won't be triggered anymore. The backport of LU-2827 is on http://review.whamcloud.com/#/c/10902/ .

            Could you please take a look at this Niu?
            Thanks,
            ~ jfc.

            jfc John Fuchs-Chesney (Inactive) added a comment - Could you please take a look at this Niu? Thanks, ~ jfc.

            People

              niu Niu Yawei (Inactive)
              haisong Haisong Cai (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: