Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5582

mount.lustre failed due to presence of MGC entry after initial failed mount attempt

Details

    • Bug
    • Resolution: Fixed
    • Major
    • None
    • Lustre 2.4.1
    • 3
    • 15573

    Description

      Lustre 2.4 clients initially failed to mount the Sonexion Lustre filesystem due to IB issues. Then subsequent mount attempts failed due to the presence of an MGC entry left behind by the failed mount attempt.

      Here is the log from a client for the initial failed mount attempt:

      console-20131016t19:2013-10-16T19:27:10.464396-05:00 c0-0c2s0n3 LustreError: 3603:0:(client.c:1052:ptlrpc_import_delay_req()) @@@ send limit expired   req@ffff88041e5a9000 x1449100178358368/t0(0) o101->MGC10.10.84.202@o2ib@10.10.84.202@o2ib
      :26/25 lens 328/344 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
      console-20131016t19:2013-10-16T19:27:20.047254-05:00 c0-0c2s0n3 LustreError: 3593:0:(client.c:1052:ptlrpc_import_delay_req()) @@@ send limit expired   req@ffff88081bdee400 x1449100178358376/t0(0) o101->MGC10.10.84.202@o2ib@10.10.84.202@o2ib
      :26/25 lens 328/344 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
      console-20131016t19:2013-10-16T19:27:32.906155-05:00 c0-0c2s0n3 LustreError: 3593:0:(client.c:1052:ptlrpc_import_delay_req()) @@@ send limit expired   req@ffff88081bdee400 x1449100178358380/t0(0) o101->MGC10.10.84.202@o2ib@10.10.84.202@o2ib
      :26/25 lens 328/344 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
      console-20131016t19:2013-10-16T19:28:02.179944-05:00 c0-0c2s0n3 LustreError: 3603:0:(client.c:1052:ptlrpc_import_delay_req()) @@@ send limit expired   req@ffff88081bdf0800 x1449100178358372/t0(0) o101->MGC10.10.84.202@o2ib@10.10.84.202@o2ib
      :26/25 lens 328/344 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
      console-20131016t19:2013-10-16T19:28:02.179952-05:00 c0-0c2s0n3 LustreError: 15c-8: MGC10.10.84.202@o2ib: The configuration from log 'snxtest-client' failed (-5). This may be the result of communication errors between this node and the MGS,
       a bad configuration, or other errors. See the syslog for more information.
      console-20131016t19:2013-10-16T19:28:02.179965-05:00 c0-0c2s0n3 LustreError: 3603:0:(llite_lib.c:1055:ll_fill_super()) Unable to process log: -5
      console-20131016t19:2013-10-16T19:28:02.179973-05:00 c0-0c2s0n3 Lustre: Unmounted snxtest-client
      console-20131016t19:2013-10-16T19:28:20.355005-05:00 c0-0c2s0n3 Lustre: 3529:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1381969623/real 1381969623]  req@ffff88041e5ab800 x1449100178358
      364/t0(0) o250->MGC10.10.84.202@o2ib@10.10.84.202@o2ib:26/25 lens 400/544 e 0 to 1 dl 1381969698 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      console-20131016t19:2013-10-16T19:28:20.355017-05:00 c0-0c2s0n3 LustreError: 3603:0:(obd_mount.c:1267:lustre_fill_super()) Unable to mount  (-5)
      console-20131016t19:2013-10-16T19:28:20.355024-05:00 c0-0c2s0n3 mount.lustre: mount 10.10.84.202@o2ib:10.10.84.203@o2ib:/snxtest at /mnt/pdraid failed: Input/output error
      console-20131016t19:2013-10-16T19:28:20.355031-05:00 c0-0c2s0n3 Is the MGS running?
      console-20131016t19:2013-10-16T19:28:20.355039-05:00 c0-0c2s0n3 Error mounting lustre filesystem, 10.10.84.202@o2ib:10.10.84.203@o2ib:/snxtest at /mnt/pdraid
      

      Here is the output of "lctl dl" on a client after the failed mount attempt:

      # /sbin/lctl dl
        0 UP mgc MGC10.10.84.39@o2ib 10d641b6-8b1c-9e11-5561-2600bf3be157 5
        1 UP lov snx11000-clilov-ffff88041dfd7c00 fa5ccc52-57a8-e85d-21bf-62a2292eaab7 4
        2 UP lmv snx11000-clilmv-ffff88041dfd7c00 fa5ccc52-57a8-e85d-21bf-62a2292eaab7 4
        3 UP mdc snx11000-MDT0000-mdc-ffff88041dfd7c00 fa5ccc52-57a8-e85d-21bf-62a2292eaab7 5
        4 UP osc snx11000-OST0002-osc-ffff88041dfd7c00 fa5ccc52-57a8-e85d-21bf-62a2292eaab7 5
        5 UP osc snx11000-OST0005-osc-ffff88041dfd7c00 fa5ccc52-57a8-e85d-21bf-62a2292eaab7 5
        6 UP osc snx11000-OST0004-osc-ffff88041dfd7c00 fa5ccc52-57a8-e85d-21bf-62a2292eaab7 5
        7 UP osc snx11000-OST0006-osc-ffff88041dfd7c00 fa5ccc52-57a8-e85d-21bf-62a2292eaab7 5
        8 UP osc snx11000-OST0003-osc-ffff88041dfd7c00 fa5ccc52-57a8-e85d-21bf-62a2292eaab7 5
        9 UP osc snx11000-OST0007-osc-ffff88041dfd7c00 fa5ccc52-57a8-e85d-21bf-62a2292eaab7 5
       10 UP osc snx11000-OST0000-osc-ffff88041dfd7c00 fa5ccc52-57a8-e85d-21bf-62a2292eaab7 5
       11 UP osc snx11000-OST0001-osc-ffff88041dfd7c00 fa5ccc52-57a8-e85d-21bf-62a2292eaab7 5
       12 ST mgc MGC10.10.84.202@o2ib 48f1967d-e484-f070-dbd2-e190e1f7d19a 1
      

      After the initial failed mount attempt, the IB issues were fixed, but the subsequent mount attempts failed. mount.lustre reported "File exists." Here are the log messages on a client.

      console-20131016t19:2013-10-16T19:57:15.046983-05:00 c0-0c2s0n3 LustreError: 3867:0:(genops.c:320:class_newdev()) Device MGC10.10.84.202@o2ib already exists at 12, won't add
      console-20131016t19:2013-10-16T19:57:15.047438-05:00 c0-0c2s0n3 LustreError: 3867:0:(obd_config.c:374:class_attach()) Cannot create device MGC10.10.84.202@o2ib of type mgc : -17
      console-20131016t19:2013-10-16T19:57:15.047721-05:00 c0-0c2s0n3 LustreError: 3867:0:(obd_mount.c:196:lustre_start_simple()) MGC10.10.84.202@o2ib attach error -17
      console-20131016t19:2013-10-16T19:57:15.047727-05:00 c0-0c2s0n3 LustreError: 3867:0:(obd_mount.c:1267:lustre_fill_super()) Unable to mount  (-17)
      

      This error persisted on the client until it was rebooted. It looks like the initial mount failure left around a bad device entry, causing future mounts of this fs to fail.

      Attachments

        Issue Links

          Activity

            [LU-5582] mount.lustre failed due to presence of MGC entry after initial failed mount attempt
            haasken Ryan Haasken added a comment -

            I think this bug really is the same as LU-4943. The patch for LU-4943 has been iterated upon and now takes the same approach as Parinay's patch. The patch for LU-4943 (http://review.whamcloud.com/#/c/10129/14) has landed now, and I have tested that the landed patch resolves this issue. This bug should be closed.

            Parinay's patch (http://review.whamcloud.com/#/c/10569/) is almost the same as the above landed patch, so that one can be abandoned.

            haasken Ryan Haasken added a comment - I think this bug really is the same as LU-4943 . The patch for LU-4943 has been iterated upon and now takes the same approach as Parinay's patch. The patch for LU-4943 ( http://review.whamcloud.com/#/c/10129/14 ) has landed now, and I have tested that the landed patch resolves this issue. This bug should be closed. Parinay's patch ( http://review.whamcloud.com/#/c/10569/ ) is almost the same as the above landed patch, so that one can be abandoned.

            The patch has failed our testing, is it possible for you to address the issue?

            cliffw Cliff White (Inactive) added a comment - The patch has failed our testing, is it possible for you to address the issue?

            I will monitor this issue. There was a test failure, seeing about a re-run

            cliffw Cliff White (Inactive) added a comment - I will monitor this issue. There was a test failure, seeing about a re-run
            parinay parinay v kondekar (Inactive) added a comment - Xyratex-bug-id: MRP-1524 http://review.whamcloud.com/#/c/10569/
            parinay parinay v kondekar (Inactive) added a comment - Filing separate ticket as per the comment from Peter ( here - https://jira.hpdd.intel.com/browse/LU-4943?focusedCommentId=91123&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-91123 )

            People

              cliffw Cliff White (Inactive)
              parinay parinay v kondekar (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: