[LU-9838] target registration mount fails with -108 but then succeeds if retried immediately - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Cannot Reproduce
Priority: Critical
Fix Version/s: None
Affects Version/s: Lustre 2.10.0
Labels:
None
Environment:
Lustre: Build Version: 2.10.0_15_gbaa1ce2 from b2_10

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

Occasionally, when we try to register (i.e. mount for a first time) a target it will fail with:

Aug  5 03:35:50 lotus-58vm18.lotus.hpdd.lab.intel.com kernel: LustreError: 15f-b: testfs-OST0001: cannot register this server with the MGS: rc = -108. Is the MGS running?
Aug  5 03:35:50 lotus-58vm18.lotus.hpdd.lab.intel.com kernel: LustreError: 29162:0:(obd_mount_server.c:1866:server_fill_super()) Unable to start targets: -108
Aug  5 03:35:50 lotus-58vm18.lotus.hpdd.lab.intel.com kernel: LustreError: 29162:0:(obd_mount_server.c:1576:server_put_super()) no obd testfs-OST0001
Aug  5 03:35:50 lotus-58vm18.lotus.hpdd.lab.intel.com kernel: LustreError: 29162:0:(obd_mount_server.c:135:server_deregister_mount()) testfs-OST0001 not registered
Aug  5 03:35:50 lotus-58vm18.lotus.hpdd.lab.intel.com kernel: Lustre: Evicted from MGS (at 10.14.83.68@tcp) after server handle changed from 0xddf3a8acc8d5136b to 0xddf3a8acc8d51555
Aug  5 03:35:50 lotus-58vm18.lotus.hpdd.lab.intel.com kernel: Lustre: MGC10.14.83.68@tcp: Connection restored to 10.14.83.68@tcp (at 10.14.83.68@tcp)
Aug  5 03:35:50 lotus-58vm18.lotus.hpdd.lab.intel.com kernel: Lustre: server umount testfs-OST0001 complete
Aug  5 03:35:50 lotus-58vm18.lotus.hpdd.lab.intel.com kernel: LustreError: 29162:0:(obd_mount.c:1505:lustre_fill_super()) Unable to mount  (-108)

But then if we try the exact same mount command a second time, immediately followig the failure it succeeds:

Aug  5 03:35:51 lotus-58vm18.lotus.hpdd.lab.intel.com kernel: Lustre: testfs-OST0001: new disk, initializing
Aug  5 03:35:51 lotus-58vm18.lotus.hpdd.lab.intel.com kernel: Lustre: srv-testfs-OST0001: No data found on store. Initialize space
Aug  5 03:35:51 lotus-58vm18.lotus.hpdd.lab.intel.com kernel: Lustre: testfs-OST0001: Imperative Recovery not enabled, recovery window 300-900
Aug  5 03:35:51 lotus-58vm18.lotus.hpdd.lab.intel.com kernel: LustreError: 30649:0:(osd_oi.c:503:osd_oid()) testfs-OST0001-osd: unsupported quota oid: 0x16

In the MGS's syslog around the failure time was:

Aug  5 03:35:50 lotus-58vm15.lotus.hpdd.lab.intel.com kernel: Lustre: MGS: Connection restored to 7d1fb120-bb6f-8171-c97c-aa44f5e9c3db (at 10.14.83.71@tcp)

where 10.14.83.71 is lotus-58vm18.

I've set this issue to Critical because while it may be at least disconcerting for a human user who might just be persistent and curious enough to just try a second time and discover it works, it wreaks havoc on automated systems like IML and Pacemaker, that expect commands that should succeed to succeed.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

mds-debug.log
64 kB
19/Aug/17 2:15 PM
mds-debug.log.bz2
8.54 MB
22/Aug/17 2:43 AM
oss-debug.log
101 kB
19/Aug/17 2:15 PM
oss-debug.log.bz2
7.10 MB
22/Aug/17 2:43 AM

Issue Links

is related to

LU-4966 MGS target registration should use proper UUID

Open

Activity

[LU-9838] target registration mount fails with -108 but then succeeds if retried immediately

Brian Murrell (Inactive) added a comment - 21/Aug/17 1:28 PM - edited

That's strange.

In any case, I tried to attach two more logs but ran into DCO-7458 so they are on the ftp site in /uploads/LU-9838/ in files mds-debug.log.bz2 and /tmp/oss-debug.log.bz2.

Brian Murrell (Inactive) added a comment - 21/Aug/17 1:28 PM - edited That's strange. In any case, I tried to attach two more logs but ran into DCO-7458 so they are on the ftp site in /uploads/ LU-9838 / in files mds-debug.log.bz2 and /tmp/oss-debug.log.bz2 .

Oleg Drokin added a comment - 19/Aug/17 5:45 PM

Well, this time it failed with -114 - "operation already in progress", not -108 - "Cannot send after transport endpoint shutdown" so I am not sure if this is the same problem or not.

00000020:00000001:0.0:1503141275.035516:0:16570:0:(obd_mount_server.c:1753:osd_start()) Process entered
00000020:01000004:0.0:1503141275.035517:0:16570:0:(obd_mount_server.c:1757:osd_start()) Attempting to start testfs-OST0000, type=osd-zfs, lsifl=200002, mountfl=0
00000020:01000004:0.0:1503141275.035522:0:16570:0:(obd_mount_server.c:1776:osd_start()) testfs-OST0000-osd already started
00000020:00000001:0.0:1503141275.035523:0:16570:0:(obd_mount_server.c:1783:osd_start()) Process leaving (rc=18446744073709551502 : -114 : ffffffffffffff8e)
00000020:00020000:0.0:1503141275.035525:0:16570:0:(obd_mount_server.c:1832:server_fill_super()) Unable to start osd on zfs_pool_scsi0QEMU_QEMU_HARDDISK_disk3/testfs-OST0000: -114
00000020:00000001:0.0:1503141275.044922:0:16570:0:(obd_mount.c:657:lustre_put_lsi()) Process entered
00000020:01000004:0.0:1503141275.044924:0:16570:0:(obd_mount.c:661:lustre_put_lsi()) put ffff88003d7e1000 1

This was even before any talking to MGS, so I guess it is something else.
It sounds like the fs was mounted when you tried to mount it again to me?

Oleg Drokin added a comment - 19/Aug/17 5:45 PM Well, this time it failed with -114 - "operation already in progress", not -108 - "Cannot send after transport endpoint shutdown" so I am not sure if this is the same problem or not. 00000020:00000001:0.0:1503141275.035516:0:16570:0:(obd_mount_server.c:1753:osd_start()) Process entered 00000020:01000004:0.0:1503141275.035517:0:16570:0:(obd_mount_server.c:1757:osd_start()) Attempting to start testfs-OST0000, type=osd-zfs, lsifl=200002, mountfl=0 00000020:01000004:0.0:1503141275.035522:0:16570:0:(obd_mount_server.c:1776:osd_start()) testfs-OST0000-osd already started 00000020:00000001:0.0:1503141275.035523:0:16570:0:(obd_mount_server.c:1783:osd_start()) Process leaving (rc=18446744073709551502 : -114 : ffffffffffffff8e) 00000020:00020000:0.0:1503141275.035525:0:16570:0:(obd_mount_server.c:1832:server_fill_super()) Unable to start osd on zfs_pool_scsi0QEMU_QEMU_HARDDISK_disk3/testfs-OST0000: -114 00000020:00000001:0.0:1503141275.044922:0:16570:0:(obd_mount.c:657:lustre_put_lsi()) Process entered 00000020:01000004:0.0:1503141275.044924:0:16570:0:(obd_mount.c:661:lustre_put_lsi()) put ffff88003d7e1000 1 This was even before any talking to MGS, so I guess it is something else. It sounds like the fs was mounted when you tried to mount it again to me?

Brian Murrell (Inactive) added a comment - 19/Aug/17 2:19 PM

Debug logs attached: mds-debug.log and oss-debug.log.

Brian Murrell (Inactive) added a comment - 19/Aug/17 2:19 PM Debug logs attached: mds-debug.log and oss-debug.log .

Joe Grund added a comment - 14/Aug/17 1:37 PM

Hitting intermittent failures in IML, being tracked under: https://github.com/intel-hpdd/intel-manager-for-lustre/issues/107

Joe Grund added a comment - 14/Aug/17 1:37 PM Hitting intermittent failures in IML, being tracked under: https://github.com/intel-hpdd/intel-manager-for-lustre/issues/107

Oleg Drokin added a comment - 13/Aug/17 4:42 PM - edited

basically I wand you to do this before the registration but after modules are loaded on both mgs and the target:

lctl set_param debug=-1
lctl set_param debug_mb=200

then do your registration attempt and immediately after that on both nodes:

lctl dk >/tmp/debug.log

If there was a failure - collect those two files and make them available.

Oleg Drokin added a comment - 13/Aug/17 4:42 PM - edited basically I wand you to do this before the registration but after modules are loaded on both mgs and the target: lctl set_param debug=-1 lctl set_param debug_mb=200 then do your registration attempt and immediately after that on both nodes: lctl dk >/tmp/debug.log If there was a failure - collect those two files and make them available.

Brian Murrell (Inactive) added a comment - 08/Aug/17 10:27 AM - edited

green: Just because it can take a while to hit this bug and so that we don't iterate over this multiple times until I get it right and collect the correct and enough debug information, could I bother you to just list the commands I need here to start the debugging you are looking for and then collect it when the bug hits? Thanks much.

Brian Murrell (Inactive) added a comment - 08/Aug/17 10:27 AM - edited green : Just because it can take a while to hit this bug and so that we don't iterate over this multiple times until I get it right and collect the correct and enough debug information, could I bother you to just list the commands I need here to start the debugging you are looking for and then collect it when the bug hits? Thanks much.

Oleg Drokin added a comment - 07/Aug/17 5:13 PM

OI know it's only hitting occasionally, but we probably still need you to reproduce this with all debug enabled and dump debug logs on both ends once it hits.
Do you think this is possible?

Oleg Drokin added a comment - 07/Aug/17 5:13 PM OI know it's only hitting occasionally, but we probably still need you to reproduce this with all debug enabled and dump debug logs on both ends once it hits. Do you think this is possible?

People

Assignee:: WC Triage

Reporter:: Brian Murrell (Inactive)

Votes:: 1 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 05/Aug/17 1:41 PM

Updated:: 04/Jan/24 1:16 AM

Resolved:: 04/Jan/24 1:16 AM