Details
-
Bug
-
Resolution: Cannot Reproduce
-
Critical
-
None
-
Lustre 2.10.0
-
None
-
Lustre: Build Version: 2.10.0_15_gbaa1ce2 from b2_10
-
3
-
9223372036854775807
Description
Occasionally, when we try to register (i.e. mount for a first time) a target it will fail with:
Aug 5 03:35:50 lotus-58vm18.lotus.hpdd.lab.intel.com kernel: LustreError: 15f-b: testfs-OST0001: cannot register this server with the MGS: rc = -108. Is the MGS running? Aug 5 03:35:50 lotus-58vm18.lotus.hpdd.lab.intel.com kernel: LustreError: 29162:0:(obd_mount_server.c:1866:server_fill_super()) Unable to start targets: -108 Aug 5 03:35:50 lotus-58vm18.lotus.hpdd.lab.intel.com kernel: LustreError: 29162:0:(obd_mount_server.c:1576:server_put_super()) no obd testfs-OST0001 Aug 5 03:35:50 lotus-58vm18.lotus.hpdd.lab.intel.com kernel: LustreError: 29162:0:(obd_mount_server.c:135:server_deregister_mount()) testfs-OST0001 not registered Aug 5 03:35:50 lotus-58vm18.lotus.hpdd.lab.intel.com kernel: Lustre: Evicted from MGS (at 10.14.83.68@tcp) after server handle changed from 0xddf3a8acc8d5136b to 0xddf3a8acc8d51555 Aug 5 03:35:50 lotus-58vm18.lotus.hpdd.lab.intel.com kernel: Lustre: MGC10.14.83.68@tcp: Connection restored to 10.14.83.68@tcp (at 10.14.83.68@tcp) Aug 5 03:35:50 lotus-58vm18.lotus.hpdd.lab.intel.com kernel: Lustre: server umount testfs-OST0001 complete Aug 5 03:35:50 lotus-58vm18.lotus.hpdd.lab.intel.com kernel: LustreError: 29162:0:(obd_mount.c:1505:lustre_fill_super()) Unable to mount (-108)
But then if we try the exact same mount command a second time, immediately followig the failure it succeeds:
Aug 5 03:35:51 lotus-58vm18.lotus.hpdd.lab.intel.com kernel: Lustre: testfs-OST0001: new disk, initializing Aug 5 03:35:51 lotus-58vm18.lotus.hpdd.lab.intel.com kernel: Lustre: srv-testfs-OST0001: No data found on store. Initialize space Aug 5 03:35:51 lotus-58vm18.lotus.hpdd.lab.intel.com kernel: Lustre: testfs-OST0001: Imperative Recovery not enabled, recovery window 300-900 Aug 5 03:35:51 lotus-58vm18.lotus.hpdd.lab.intel.com kernel: LustreError: 30649:0:(osd_oi.c:503:osd_oid()) testfs-OST0001-osd: unsupported quota oid: 0x16
In the MGS's syslog around the failure time was:
Aug 5 03:35:50 lotus-58vm15.lotus.hpdd.lab.intel.com kernel: Lustre: MGS: Connection restored to 7d1fb120-bb6f-8171-c97c-aa44f5e9c3db (at 10.14.83.71@tcp)
where 10.14.83.71 is lotus-58vm18.
I've set this issue to Critical because while it may be at least disconcerting for a human user who might just be persistent and curious enough to just try a second time and discover it works, it wreaks havoc on automated systems like IML and Pacemaker, that expect commands that should succeed to succeed.
Attachments
Issue Links
- is related to
-
LU-4966 handle server registration errors gracefully
- Open