[LU-5582] mount.lustre failed due to presence of MGC entry after initial failed mount attempt Created: 04/Sep/14  Updated: 24/Oct/14  Resolved: 24/Oct/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.1
Fix Version/s: None

Type: Bug Priority: Major
Reporter: parinay v kondekar (Inactive) Assignee: Cliff White (Inactive)
Resolution: Fixed Votes: 0
Labels: patch

Issue Links:
Related
is related to LU-4943 Client Failes to mount filesystem Resolved
Severity: 3
Rank (Obsolete): 15573

 Description   

Lustre 2.4 clients initially failed to mount the Sonexion Lustre filesystem due to IB issues. Then subsequent mount attempts failed due to the presence of an MGC entry left behind by the failed mount attempt.

Here is the log from a client for the initial failed mount attempt:

console-20131016t19:2013-10-16T19:27:10.464396-05:00 c0-0c2s0n3 LustreError: 3603:0:(client.c:1052:ptlrpc_import_delay_req()) @@@ send limit expired   req@ffff88041e5a9000 x1449100178358368/t0(0) o101->MGC10.10.84.202@o2ib@10.10.84.202@o2ib
:26/25 lens 328/344 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
console-20131016t19:2013-10-16T19:27:20.047254-05:00 c0-0c2s0n3 LustreError: 3593:0:(client.c:1052:ptlrpc_import_delay_req()) @@@ send limit expired   req@ffff88081bdee400 x1449100178358376/t0(0) o101->MGC10.10.84.202@o2ib@10.10.84.202@o2ib
:26/25 lens 328/344 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
console-20131016t19:2013-10-16T19:27:32.906155-05:00 c0-0c2s0n3 LustreError: 3593:0:(client.c:1052:ptlrpc_import_delay_req()) @@@ send limit expired   req@ffff88081bdee400 x1449100178358380/t0(0) o101->MGC10.10.84.202@o2ib@10.10.84.202@o2ib
:26/25 lens 328/344 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
console-20131016t19:2013-10-16T19:28:02.179944-05:00 c0-0c2s0n3 LustreError: 3603:0:(client.c:1052:ptlrpc_import_delay_req()) @@@ send limit expired   req@ffff88081bdf0800 x1449100178358372/t0(0) o101->MGC10.10.84.202@o2ib@10.10.84.202@o2ib
:26/25 lens 328/344 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
console-20131016t19:2013-10-16T19:28:02.179952-05:00 c0-0c2s0n3 LustreError: 15c-8: MGC10.10.84.202@o2ib: The configuration from log 'snxtest-client' failed (-5). This may be the result of communication errors between this node and the MGS,
 a bad configuration, or other errors. See the syslog for more information.
console-20131016t19:2013-10-16T19:28:02.179965-05:00 c0-0c2s0n3 LustreError: 3603:0:(llite_lib.c:1055:ll_fill_super()) Unable to process log: -5
console-20131016t19:2013-10-16T19:28:02.179973-05:00 c0-0c2s0n3 Lustre: Unmounted snxtest-client
console-20131016t19:2013-10-16T19:28:20.355005-05:00 c0-0c2s0n3 Lustre: 3529:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1381969623/real 1381969623]  req@ffff88041e5ab800 x1449100178358
364/t0(0) o250->MGC10.10.84.202@o2ib@10.10.84.202@o2ib:26/25 lens 400/544 e 0 to 1 dl 1381969698 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
console-20131016t19:2013-10-16T19:28:20.355017-05:00 c0-0c2s0n3 LustreError: 3603:0:(obd_mount.c:1267:lustre_fill_super()) Unable to mount  (-5)
console-20131016t19:2013-10-16T19:28:20.355024-05:00 c0-0c2s0n3 mount.lustre: mount 10.10.84.202@o2ib:10.10.84.203@o2ib:/snxtest at /mnt/pdraid failed: Input/output error
console-20131016t19:2013-10-16T19:28:20.355031-05:00 c0-0c2s0n3 Is the MGS running?
console-20131016t19:2013-10-16T19:28:20.355039-05:00 c0-0c2s0n3 Error mounting lustre filesystem, 10.10.84.202@o2ib:10.10.84.203@o2ib:/snxtest at /mnt/pdraid

Here is the output of "lctl dl" on a client after the failed mount attempt:

# /sbin/lctl dl
  0 UP mgc MGC10.10.84.39@o2ib 10d641b6-8b1c-9e11-5561-2600bf3be157 5
  1 UP lov snx11000-clilov-ffff88041dfd7c00 fa5ccc52-57a8-e85d-21bf-62a2292eaab7 4
  2 UP lmv snx11000-clilmv-ffff88041dfd7c00 fa5ccc52-57a8-e85d-21bf-62a2292eaab7 4
  3 UP mdc snx11000-MDT0000-mdc-ffff88041dfd7c00 fa5ccc52-57a8-e85d-21bf-62a2292eaab7 5
  4 UP osc snx11000-OST0002-osc-ffff88041dfd7c00 fa5ccc52-57a8-e85d-21bf-62a2292eaab7 5
  5 UP osc snx11000-OST0005-osc-ffff88041dfd7c00 fa5ccc52-57a8-e85d-21bf-62a2292eaab7 5
  6 UP osc snx11000-OST0004-osc-ffff88041dfd7c00 fa5ccc52-57a8-e85d-21bf-62a2292eaab7 5
  7 UP osc snx11000-OST0006-osc-ffff88041dfd7c00 fa5ccc52-57a8-e85d-21bf-62a2292eaab7 5
  8 UP osc snx11000-OST0003-osc-ffff88041dfd7c00 fa5ccc52-57a8-e85d-21bf-62a2292eaab7 5
  9 UP osc snx11000-OST0007-osc-ffff88041dfd7c00 fa5ccc52-57a8-e85d-21bf-62a2292eaab7 5
 10 UP osc snx11000-OST0000-osc-ffff88041dfd7c00 fa5ccc52-57a8-e85d-21bf-62a2292eaab7 5
 11 UP osc snx11000-OST0001-osc-ffff88041dfd7c00 fa5ccc52-57a8-e85d-21bf-62a2292eaab7 5
 12 ST mgc MGC10.10.84.202@o2ib 48f1967d-e484-f070-dbd2-e190e1f7d19a 1

After the initial failed mount attempt, the IB issues were fixed, but the subsequent mount attempts failed. mount.lustre reported "File exists." Here are the log messages on a client.

console-20131016t19:2013-10-16T19:57:15.046983-05:00 c0-0c2s0n3 LustreError: 3867:0:(genops.c:320:class_newdev()) Device MGC10.10.84.202@o2ib already exists at 12, won't add
console-20131016t19:2013-10-16T19:57:15.047438-05:00 c0-0c2s0n3 LustreError: 3867:0:(obd_config.c:374:class_attach()) Cannot create device MGC10.10.84.202@o2ib of type mgc : -17
console-20131016t19:2013-10-16T19:57:15.047721-05:00 c0-0c2s0n3 LustreError: 3867:0:(obd_mount.c:196:lustre_start_simple()) MGC10.10.84.202@o2ib attach error -17
console-20131016t19:2013-10-16T19:57:15.047727-05:00 c0-0c2s0n3 LustreError: 3867:0:(obd_mount.c:1267:lustre_fill_super()) Unable to mount  (-17)

This error persisted on the client until it was rebooted. It looks like the initial mount failure left around a bad device entry, causing future mounts of this fs to fail.



 Comments   
Comment by parinay v kondekar (Inactive) [ 04/Sep/14 ]

Filing separate ticket as per the comment from Peter ( here - https://jira.hpdd.intel.com/browse/LU-4943?focusedCommentId=91123&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-91123 )

Comment by parinay v kondekar (Inactive) [ 04/Sep/14 ]

Xyratex-bug-id: MRP-1524

http://review.whamcloud.com/#/c/10569/

Comment by Cliff White (Inactive) [ 09/Sep/14 ]

I will monitor this issue. There was a test failure, seeing about a re-run

Comment by Cliff White (Inactive) [ 19/Sep/14 ]

The patch has failed our testing, is it possible for you to address the issue?

Comment by Ryan Haasken [ 15/Oct/14 ]

I think this bug really is the same as LU-4943. The patch for LU-4943 has been iterated upon and now takes the same approach as Parinay's patch. The patch for LU-4943 (http://review.whamcloud.com/#/c/10129/14) has landed now, and I have tested that the landed patch resolves this issue. This bug should be closed.

Parinay's patch (http://review.whamcloud.com/#/c/10569/) is almost the same as the above landed patch, so that one can be abandoned.

Generated at Sat Feb 10 01:52:44 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.