Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Cannot Reproduce
Priority: Critical
Fix Version/s: None
Affects Version/s: Lustre 2.10.0
Labels:
None
Environment:
Lustre: Build Version: 2.10.0_15_gbaa1ce2 from b2_10

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

Occasionally, when we try to register (i.e. mount for a first time) a target it will fail with:

Aug  5 03:35:50 lotus-58vm18.lotus.hpdd.lab.intel.com kernel: LustreError: 15f-b: testfs-OST0001: cannot register this server with the MGS: rc = -108. Is the MGS running?
Aug  5 03:35:50 lotus-58vm18.lotus.hpdd.lab.intel.com kernel: LustreError: 29162:0:(obd_mount_server.c:1866:server_fill_super()) Unable to start targets: -108
Aug  5 03:35:50 lotus-58vm18.lotus.hpdd.lab.intel.com kernel: LustreError: 29162:0:(obd_mount_server.c:1576:server_put_super()) no obd testfs-OST0001
Aug  5 03:35:50 lotus-58vm18.lotus.hpdd.lab.intel.com kernel: LustreError: 29162:0:(obd_mount_server.c:135:server_deregister_mount()) testfs-OST0001 not registered
Aug  5 03:35:50 lotus-58vm18.lotus.hpdd.lab.intel.com kernel: Lustre: Evicted from MGS (at 10.14.83.68@tcp) after server handle changed from 0xddf3a8acc8d5136b to 0xddf3a8acc8d51555
Aug  5 03:35:50 lotus-58vm18.lotus.hpdd.lab.intel.com kernel: Lustre: MGC10.14.83.68@tcp: Connection restored to 10.14.83.68@tcp (at 10.14.83.68@tcp)
Aug  5 03:35:50 lotus-58vm18.lotus.hpdd.lab.intel.com kernel: Lustre: server umount testfs-OST0001 complete
Aug  5 03:35:50 lotus-58vm18.lotus.hpdd.lab.intel.com kernel: LustreError: 29162:0:(obd_mount.c:1505:lustre_fill_super()) Unable to mount  (-108)

But then if we try the exact same mount command a second time, immediately followig the failure it succeeds:

Aug  5 03:35:51 lotus-58vm18.lotus.hpdd.lab.intel.com kernel: Lustre: testfs-OST0001: new disk, initializing
Aug  5 03:35:51 lotus-58vm18.lotus.hpdd.lab.intel.com kernel: Lustre: srv-testfs-OST0001: No data found on store. Initialize space
Aug  5 03:35:51 lotus-58vm18.lotus.hpdd.lab.intel.com kernel: Lustre: testfs-OST0001: Imperative Recovery not enabled, recovery window 300-900
Aug  5 03:35:51 lotus-58vm18.lotus.hpdd.lab.intel.com kernel: LustreError: 30649:0:(osd_oi.c:503:osd_oid()) testfs-OST0001-osd: unsupported quota oid: 0x16

In the MGS's syslog around the failure time was:

Aug  5 03:35:50 lotus-58vm15.lotus.hpdd.lab.intel.com kernel: Lustre: MGS: Connection restored to 7d1fb120-bb6f-8171-c97c-aa44f5e9c3db (at 10.14.83.71@tcp)

where 10.14.83.71 is lotus-58vm18.

I've set this issue to Critical because while it may be at least disconcerting for a human user who might just be persistent and curious enough to just try a second time and discover it works, it wreaks havoc on automated systems like IML and Pacemaker, that expect commands that should succeed to succeed.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

mds-debug.log
19/Aug/17 2:15 PM
64 kB
Brian Murrell
mds-debug.log.bz2
22/Aug/17 2:43 AM
8.54 MB
Brian Murrell
oss-debug.log
19/Aug/17 2:15 PM
101 kB
Brian Murrell
oss-debug.log.bz2
22/Aug/17 2:43 AM
7.10 MB
Brian Murrell

Issue Links

is related to

LU-4966 MGS target registration should use proper UUID

Open

Activity

People

Assignee:: WC Triage

Reporter:: Brian Murrell (Inactive)

Votes:: 1 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 05/Aug/17 1:41 PM

Updated:: 04/Jan/24 1:16 AM

Resolved:: 04/Jan/24 1:16 AM