[LU-15935] MDT mount fails with "duplicate generation for client export" during failover Created: 12/Jun/22 Updated: 20/May/23 Resolved: 02/Nov/22 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.14.0, Lustre 2.12.8, Lustre 2.12.9 |
| Fix Version/s: | Lustre 2.16.0, Lustre 2.15.3 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Andreas Dilger | Assignee: | Etienne Aujames |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||
| Description |
|
Mounting the MDT during recovery immediately fails with: (tgt_lastrcvd.c:1575:tgt_clients_data_init()) fs2-MDT000f: duplicate export for client generation 17
(obd_config.c:559:class_setup()) setup fs2-MDT000f failed (-114)
(obd_config.c:1836:class_config_llog_handler()) MGC10.11.12.142@o2ib: cfg command failed: rc = -114
Lustre: cmd=cf003 0:fs2-MDT000f 1:fs2-MDT000f_UUID 2:15 3:fs2-MDT000f-mdtlov 4:f
15c-8: MGC10.11.12.142@o2ib: The configuration from log 'fs2-MDT000f' failed (-114).
This may be the result of communication errors between this node and the MGS, a bad
configuration, or other errors. See the syslog for more information.
obd_mount_server.c:1408:server_start_targets()) failed to start server fs2-MDT000f: -114
obd_mount_server.c:2052:server_fill_super()) Unable to start targets: -114
There appear to be duplicate client recovery records in the MDT's last_rcvd file that are preventing it from creating all of the client exports in tgt_clients_data_init(). It doesn't have anything to do with the MGS, despite the later messages. Removing the last_rcvd file allowed the MDT to be mounted, as likely mounting with "-o abort_recov" would have done. |
| Comments |
| Comment by Andreas Dilger [ 12/Jun/22 ] |
|
Rather than failing the mount in this case, it makes sense to evict the client(s) that have the duplicate slots and continuing the mount, since there is no other course of action at this point anyway, and evicting a single client is better than aborting recovery (or deleting the last_rcvd file) and evicting all of the clients. That would allow the system to be mounted instead of needing manual intervention to finish the recovery. |
| Comment by Etienne Aujames [ 18/Jul/22 ] |
|
Hi Andreas, We run into this at the CEA with, 8000+ clients. I have not direct access to this machine, but the CEA will try to get some logs. Can you elaborate how it happens, and how can we reproduce this issue? |
| Comment by Andreas Dilger [ 18/Jul/22 ] |
|
Etienne, sorry but I don't know why this happens, but I think it would be reasonable to either ignore this error or at most evict the client that has the duplicate export, possibly printing a bit more details about the client NID and related info to see why it happens. |
| Comment by Etienne Aujames [ 27/Jul/22 ] |
|
I successfully reproduced an issue that could lead to "duplicate generation..." with abort_recovery: Here are the steps to reproduce it:
|
| Comment by Etienne Aujames [ 27/Jul/22 ] |
|
This issue is likely to occur with unstable lustre client nodes or with network issues while doing failover/failback on a MDT target. |
| Comment by Gerrit Updater [ 29/Jul/22 ] |
|
"Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/48082 |
| Comment by Gerrit Updater [ 31/Aug/22 ] |
|
"Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/48400 |
| Comment by Gerrit Updater [ 02/Nov/22 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/48082/ |
| Comment by Peter Jones [ 02/Nov/22 ] |
|
Landed for 2.16 |
| Comment by Gerrit Updater [ 14/Dec/22 ] |
|
"Jian Yu <yujian@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49398 |
| Comment by Gerrit Updater [ 14/Dec/22 ] |
|
"Jian Yu <yujian@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49399 |
| Comment by Gerrit Updater [ 14/Dec/22 ] |
|
"Jian Yu <yujian@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49400 |
| Comment by Gerrit Updater [ 03/Jan/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49398/ |
| Comment by Gerrit Updater [ 19/Apr/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49399/ |
| Comment by Gerrit Updater [ 19/Apr/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49400/ |