Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5107

MDS oops during mount with latest lustre 2.5.1 snapshot

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.5.2
    • Lustre 2.5.1
    • None
    • MDS server
    • 3
    • 14091

    Description

      With the latest 2.5.1 snapshot when I attempt to bring up a file system I'm seeing the following bug on the MDS during the MDT mount. Because of this I can' currently mount a 2.5 file system for testing.

      May 27 16:55:19 tick-dne-mds1 kernel: [ 546.512335] LustreError: 13869:0:(osp_dev.c:864:osp_prepare_fid_client()) ASSERTION( osp->opd
      _obd->u.cli.cl_seq != ((void *)0) ) failed:
      May 27 16:55:19 tick-dne-mds1 kernel: [ 546.548335] LustreError: 13869:0:(osp_dev.c:864:osp_prepare_fid_client()) LBUG
      May 27 16:55:19 tick-dne-mds1 kernel: [ 546.569232] Pid: 13869, comm: ptlrpcd_rcv
      May 27 16:55:19 tick-dne-mds1 kernel: [ 546.579503]
      May 27 16:55:19 tick-dne-mds1 kernel: [ 546.579503] Call Trace:
      May 27 16:55:19 tick-dne-mds1 kernel: [ 546.598249] [<ffffffffa05f3895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      May 27 16:55:19 tick-dne-mds1 kernel: [ 546.618857] [<ffffffffa05f3e97>] lbug_with_loc+0x47/0xb0 [libcfs]
      May 27 16:55:19 tick-dne-mds1 kernel: [ 546.638835] [<ffffffffa108ee34>] osp_import_event+0x3d4/0x410 [osp]
      May 27 16:55:19 tick-dne-mds1 kernel: [ 546.659079] [<ffffffffa09207cc>] ptlrpc_activate_import+0x12c/0x270 [ptlrpc]
      May 27 16:55:19 tick-dne-mds1 kernel: [ 546.688094] [<ffffffffa0923502>] ptlrpc_connect_interpret+0x1912/0x2160 [ptlrpc]
      May 27 16:55:19 tick-dne-mds1 kernel: [ 546.708910] [<ffffffffa08f894c>] ptlrpc_check_set+0x2bc/0x1b50 [ptlrpc]
      May 27 16:55:19 tick-dne-mds1 kernel: [ 546.729238] [<ffffffffa0924cab>] ptlrpcd_check+0x53b/0x560 [ptlrpc]
      May 27 16:55:20 tick-dne-mds1 kernel: [ 546.749421] [<ffffffffa09251cb>] ptlrpcd+0x20b/0x370 [ptlrpc]
      May 27 16:55:20 tick-dne-mds1 kernel: [ 546.769286] [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
      May 27 16:55:20 tick-dne-mds1 kernel: [ 546.789361] [<ffffffffa0924fc0>] ? ptlrpcd+0x0/0x370 [ptlrpc]
      May 27 16:55:20 tick-dne-mds1 kernel: [ 546.809414] [<ffffffff8109ab56>] kthread+0x96/0xa0
      May 27 16:55:20 tick-dne-mds1 kernel: [ 546.828800] [<ffffffff8100c20a>] child_rip+0xa/0x20
      May 27 16:55:20 tick-dne-mds1 kernel: [ 546.848257] [<ffffffff8109aac0>] ? kthread+0x0/0xa0
      May 27 16:55:20 tick-dne-mds1 kernel: [ 546.859345] [<ffffffff8100c200>] ? child_rip+0x0/0x20
      May 27 16:55:20 tick-dne-mds1 kernel: [ 546.879015]

      Attachments

        Issue Links

          Activity

            [LU-5107] MDS oops during mount with latest lustre 2.5.1 snapshot
            pjones Peter Jones added a comment -

            Landed for 2.5.2. As I understand it, this issue only affected b2_5 so is not needed on other branches

            pjones Peter Jones added a comment - Landed for 2.5.2. As I understand it, this issue only affected b2_5 so is not needed on other branches

            Problem was caused by backport of patch http://review.whamcloud.com/9875 to b2_5.

            adilger Andreas Dilger added a comment - Problem was caused by backport of patch http://review.whamcloud.com/9875 to b2_5.

            The patch appears to have resolved the issue. Thank you.

            simmonsja James A Simmons added a comment - The patch appears to have resolved the issue. Thank you.
            di.wang Di Wang added a comment - http://review.whamcloud.com/10476
            di.wang Di Wang added a comment -

            Hmm, there are some problems for 8997 when port it to 2.5. Since we do not need OSP(for MDT) to allocate FID, so osp_prepare_fid_client(d) needs to be moved after if (d->opd_connect_mdt) check in osp_import_event.

            diff --git a/lustre/osp/osp_dev.c b/lustre/osp/osp_dev.c
            index a4a2f90..15f2ec0 100644
            --- a/lustre/osp/osp_dev.c
            +++ b/lustre/osp/osp_dev.c
            @@ -1053,15 +1053,16 @@ static int osp_import_event(struct obd_device *obd, struct obd_import *imp,
                    case IMP_EVENT_ACTIVE:
                            d->opd_imp_active = 1;
             
            -               if (osp_prepare_fid_client(d) != 0)
            -                       break;
            -
                            if (d->opd_got_disconnected)
                                    d->opd_new_connection = 1;
                            d->opd_imp_connected = 1;
                            d->opd_imp_seen_connected = 1;
                            if (d->opd_connect_mdt)
                                    break;
            +
            +               if (osp_prepare_fid_client(d) != 0)
            +                       break;
            +
                            wake_up(&d->opd_pre_waitq);
                            __osp_sync_check_for_work(d);
                            CDEBUG(D_HA, "got connected\n");
            

            probably fix the problem, I will cook a patch.

            di.wang Di Wang added a comment - Hmm, there are some problems for 8997 when port it to 2.5. Since we do not need OSP(for MDT) to allocate FID, so osp_prepare_fid_client(d) needs to be moved after if (d->opd_connect_mdt) check in osp_import_event. diff --git a/lustre/osp/osp_dev.c b/lustre/osp/osp_dev.c index a4a2f90..15f2ec0 100644 --- a/lustre/osp/osp_dev.c +++ b/lustre/osp/osp_dev.c @@ -1053,15 +1053,16 @@ static int osp_import_event(struct obd_device *obd, struct obd_import *imp, case IMP_EVENT_ACTIVE: d->opd_imp_active = 1; - if (osp_prepare_fid_client(d) != 0) - break; - if (d->opd_got_disconnected) d->opd_new_connection = 1; d->opd_imp_connected = 1; d->opd_imp_seen_connected = 1; if (d->opd_connect_mdt) break; + + if (osp_prepare_fid_client(d) != 0) + break; + wake_up(&d->opd_pre_waitq); __osp_sync_check_for_work(d); CDEBUG(D_HA, "got connected\n"); probably fix the problem, I will cook a patch.

            I reverted patch 8997.

            simmonsja James A Simmons added a comment - I reverted patch 8997.

            James, there are two patches on LU-4413:

            Which one did you revert to fix the problem?

            adilger Andreas Dilger added a comment - James, there are two patches on LU-4413 : http://review.whamcloud.com/8997 - osp: move seq allocation out of osp_import_event http://review.whamcloud.com/8996 - ptlrpc: don't try to recover no_recov connection Which one did you revert to fix the problem?

            I tried a build with a few extra patches. Then I tried the tip of b2_5 and it was the same problem. Yes it is a DNE setup with 3 MDS servers. When I encountered this error I was using a already formatted 2.5 file system. I later reformatted to make sure that was not the issue but the MDS oops was still there. I found that reverting LU-4413 appears to make the problem go away.
            I don't think reverting that patch is the solution. I have placed the dmesg log and vmcore dump at ftp.whamcloud.com/uploads/LU-5107.

            simmonsja James A Simmons added a comment - I tried a build with a few extra patches. Then I tried the tip of b2_5 and it was the same problem. Yes it is a DNE setup with 3 MDS servers. When I encountered this error I was using a already formatted 2.5 file system. I later reformatted to make sure that was not the issue but the MDS oops was still there. I found that reverting LU-4413 appears to make the problem go away. I don't think reverting that patch is the solution. I have placed the dmesg log and vmcore dump at ftp.whamcloud.com/uploads/ LU-5107 .

            People

              di.wang Di Wang
              simmonsja James A Simmons
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: