Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5107

MDS oops during mount with latest lustre 2.5.1 snapshot

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.5.2
    • Lustre 2.5.1
    • None
    • MDS server
    • 3
    • 14091

    Description

      With the latest 2.5.1 snapshot when I attempt to bring up a file system I'm seeing the following bug on the MDS during the MDT mount. Because of this I can' currently mount a 2.5 file system for testing.

      May 27 16:55:19 tick-dne-mds1 kernel: [ 546.512335] LustreError: 13869:0:(osp_dev.c:864:osp_prepare_fid_client()) ASSERTION( osp->opd
      _obd->u.cli.cl_seq != ((void *)0) ) failed:
      May 27 16:55:19 tick-dne-mds1 kernel: [ 546.548335] LustreError: 13869:0:(osp_dev.c:864:osp_prepare_fid_client()) LBUG
      May 27 16:55:19 tick-dne-mds1 kernel: [ 546.569232] Pid: 13869, comm: ptlrpcd_rcv
      May 27 16:55:19 tick-dne-mds1 kernel: [ 546.579503]
      May 27 16:55:19 tick-dne-mds1 kernel: [ 546.579503] Call Trace:
      May 27 16:55:19 tick-dne-mds1 kernel: [ 546.598249] [<ffffffffa05f3895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      May 27 16:55:19 tick-dne-mds1 kernel: [ 546.618857] [<ffffffffa05f3e97>] lbug_with_loc+0x47/0xb0 [libcfs]
      May 27 16:55:19 tick-dne-mds1 kernel: [ 546.638835] [<ffffffffa108ee34>] osp_import_event+0x3d4/0x410 [osp]
      May 27 16:55:19 tick-dne-mds1 kernel: [ 546.659079] [<ffffffffa09207cc>] ptlrpc_activate_import+0x12c/0x270 [ptlrpc]
      May 27 16:55:19 tick-dne-mds1 kernel: [ 546.688094] [<ffffffffa0923502>] ptlrpc_connect_interpret+0x1912/0x2160 [ptlrpc]
      May 27 16:55:19 tick-dne-mds1 kernel: [ 546.708910] [<ffffffffa08f894c>] ptlrpc_check_set+0x2bc/0x1b50 [ptlrpc]
      May 27 16:55:19 tick-dne-mds1 kernel: [ 546.729238] [<ffffffffa0924cab>] ptlrpcd_check+0x53b/0x560 [ptlrpc]
      May 27 16:55:20 tick-dne-mds1 kernel: [ 546.749421] [<ffffffffa09251cb>] ptlrpcd+0x20b/0x370 [ptlrpc]
      May 27 16:55:20 tick-dne-mds1 kernel: [ 546.769286] [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
      May 27 16:55:20 tick-dne-mds1 kernel: [ 546.789361] [<ffffffffa0924fc0>] ? ptlrpcd+0x0/0x370 [ptlrpc]
      May 27 16:55:20 tick-dne-mds1 kernel: [ 546.809414] [<ffffffff8109ab56>] kthread+0x96/0xa0
      May 27 16:55:20 tick-dne-mds1 kernel: [ 546.828800] [<ffffffff8100c20a>] child_rip+0xa/0x20
      May 27 16:55:20 tick-dne-mds1 kernel: [ 546.848257] [<ffffffff8109aac0>] ? kthread+0x0/0xa0
      May 27 16:55:20 tick-dne-mds1 kernel: [ 546.859345] [<ffffffff8100c200>] ? child_rip+0x0/0x20
      May 27 16:55:20 tick-dne-mds1 kernel: [ 546.879015]

      Attachments

        Issue Links

          Activity

            [LU-5107] MDS oops during mount with latest lustre 2.5.1 snapshot
            pjones Peter Jones added a comment -

            Landed for 2.5.2. As I understand it, this issue only affected b2_5 so is not needed on other branches

            pjones Peter Jones added a comment - Landed for 2.5.2. As I understand it, this issue only affected b2_5 so is not needed on other branches

            Problem was caused by backport of patch http://review.whamcloud.com/9875 to b2_5.

            adilger Andreas Dilger added a comment - Problem was caused by backport of patch http://review.whamcloud.com/9875 to b2_5.

            The patch appears to have resolved the issue. Thank you.

            simmonsja James A Simmons added a comment - The patch appears to have resolved the issue. Thank you.
            di.wang Di Wang added a comment - http://review.whamcloud.com/10476
            di.wang Di Wang added a comment -

            Hmm, there are some problems for 8997 when port it to 2.5. Since we do not need OSP(for MDT) to allocate FID, so osp_prepare_fid_client(d) needs to be moved after if (d->opd_connect_mdt) check in osp_import_event.

            diff --git a/lustre/osp/osp_dev.c b/lustre/osp/osp_dev.c
            index a4a2f90..15f2ec0 100644
            --- a/lustre/osp/osp_dev.c
            +++ b/lustre/osp/osp_dev.c
            @@ -1053,15 +1053,16 @@ static int osp_import_event(struct obd_device *obd, struct obd_import *imp,
                    case IMP_EVENT_ACTIVE:
                            d->opd_imp_active = 1;
             
            -               if (osp_prepare_fid_client(d) != 0)
            -                       break;
            -
                            if (d->opd_got_disconnected)
                                    d->opd_new_connection = 1;
                            d->opd_imp_connected = 1;
                            d->opd_imp_seen_connected = 1;
                            if (d->opd_connect_mdt)
                                    break;
            +
            +               if (osp_prepare_fid_client(d) != 0)
            +                       break;
            +
                            wake_up(&d->opd_pre_waitq);
                            __osp_sync_check_for_work(d);
                            CDEBUG(D_HA, "got connected\n");
            

            probably fix the problem, I will cook a patch.

            di.wang Di Wang added a comment - Hmm, there are some problems for 8997 when port it to 2.5. Since we do not need OSP(for MDT) to allocate FID, so osp_prepare_fid_client(d) needs to be moved after if (d->opd_connect_mdt) check in osp_import_event. diff --git a/lustre/osp/osp_dev.c b/lustre/osp/osp_dev.c index a4a2f90..15f2ec0 100644 --- a/lustre/osp/osp_dev.c +++ b/lustre/osp/osp_dev.c @@ -1053,15 +1053,16 @@ static int osp_import_event(struct obd_device *obd, struct obd_import *imp, case IMP_EVENT_ACTIVE: d->opd_imp_active = 1; - if (osp_prepare_fid_client(d) != 0) - break; - if (d->opd_got_disconnected) d->opd_new_connection = 1; d->opd_imp_connected = 1; d->opd_imp_seen_connected = 1; if (d->opd_connect_mdt) break; + + if (osp_prepare_fid_client(d) != 0) + break; + wake_up(&d->opd_pre_waitq); __osp_sync_check_for_work(d); CDEBUG(D_HA, "got connected\n"); probably fix the problem, I will cook a patch.

            I reverted patch 8997.

            simmonsja James A Simmons added a comment - I reverted patch 8997.

            James, there are two patches on LU-4413:

            Which one did you revert to fix the problem?

            adilger Andreas Dilger added a comment - James, there are two patches on LU-4413 : http://review.whamcloud.com/8997 - osp: move seq allocation out of osp_import_event http://review.whamcloud.com/8996 - ptlrpc: don't try to recover no_recov connection Which one did you revert to fix the problem?

            I tried a build with a few extra patches. Then I tried the tip of b2_5 and it was the same problem. Yes it is a DNE setup with 3 MDS servers. When I encountered this error I was using a already formatted 2.5 file system. I later reformatted to make sure that was not the issue but the MDS oops was still there. I found that reverting LU-4413 appears to make the problem go away.
            I don't think reverting that patch is the solution. I have placed the dmesg log and vmcore dump at ftp.whamcloud.com/uploads/LU-5107.

            simmonsja James A Simmons added a comment - I tried a build with a few extra patches. Then I tried the tip of b2_5 and it was the same problem. Yes it is a DNE setup with 3 MDS servers. When I encountered this error I was using a already formatted 2.5 file system. I later reformatted to make sure that was not the issue but the MDS oops was still there. I found that reverting LU-4413 appears to make the problem go away. I don't think reverting that patch is the solution. I have placed the dmesg log and vmcore dump at ftp.whamcloud.com/uploads/ LU-5107 .
            di.wang Di Wang added a comment -

            James,

            Did you setup lustre with single MDT or DNE? Are there any other console error message? Could you tell me which build are you using? It is a new formatted FS? Do you have the dump log for this LBUG?

            Thank you.
            WangDi

            di.wang Di Wang added a comment - James, Did you setup lustre with single MDT or DNE? Are there any other console error message? Could you tell me which build are you using? It is a new formatted FS? Do you have the dump log for this LBUG? Thank you. WangDi

            Di,

            Would you please comment on this ticket?

            Thank you,
            James

            jamesanunez James Nunez (Inactive) added a comment - Di, Would you please comment on this ticket? Thank you, James

            People

              di.wang Di Wang
              simmonsja James A Simmons
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: