[LU-3458] OST not able to register at MGS with predefined index. - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Unresolved
Priority: Critical
Fix Version/s: None
Affects Version/s: Lustre 2.0.0, Lustre 2.1.0, Lustre 2.2.0, Lustre 2.3.0, Lustre 2.4.0, Lustre 1.8.x (1.8.0 - 1.8.5), Lustre 2.5.0
Labels:
None
Environment:
Any lustre from 1.6.0 with mountconf and OST prepared with --index option.

Severity:
3
Rank (Obsolete):
8645

Description

client may have lost a reply to register target operation, but MGS will think reply is delivered and mark a target as used, but client don't have an reply accepted and think it need restart register from beginning after reconnect.
OOPS.
MGS send response a index already used.

[ 2619.730706] Lustre: MGC172.18.1.2@tcp: Reactivating import
[ 2626.816706] Lustre: 56551:0:(client.c:1819:ptlrpc_expire_one_request()) @@@ Request  sent has timed out for slow reply: [sent 1370729827/real 1370729827]  req@ffff8806014d5c00 x1437314393833474/t0(0) o253->MGC172.18.1.2@tcp@172.18.1.2@tcp:26/25 lens 4736/4736 e 0 to 1 dl 1370729834 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
[ 2626.844904] LustreError: 166-1: MGC172.18.1.2@tcp: Connection to MGS (at 172.18.1.2@tcp) was lost; in progress operations using this service will fail
[ 2626.896533] Lustre: MGC172.18.1.2@tcp: Reactivating import
[ 2626.902111] Lustre: MGC172.18.1.2@tcp: Connection restored to MGS (at 172.18.1.2@tcp)
[ 2629.380926] Lustre: MGC172.18.1.2@tcp: Reactivating import
[ 2632.367220] LustreError: 15f-b: Communication to the MGS return error -98. Is the MGS running?
[ 2632.376077] LustreError: 58337:0:(obd_mount.c:1834:server_fill_super()) Unable to start targets: -98

attached logs describe that bug in details (log1 from MGS side, log2 from OSS side - initial register xid is x1437620344193026).

Bug hit because MGS don't schedule a reply to the target register command, and assume client always get a reply. Bug originally hit on Xyratex b_neo_stable branch (mostly 2.1 codebase) but quick look say - bug exist at 2.4 also.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

log1
6.34 MB
12/Jun/13 9:55 AM
log2
5.78 MB
12/Jun/13 9:55 AM

Issue Links

is duplicated by

LU-4966 MGS target registration should use proper UUID

Open

is related to

LU-16475 Reusing OST indexes after lctl del_ost

Open

LU-17716 add 'tunefs.lustre --replace' to allow OST to be skip MGS registration

In Progress

LU-14 live replacement of OST

Resolved

mentioned in: Page No Confluence page found with the given URL.; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...

(40 mentioned in)

Activity

[LU-3458] OST not able to register at MGS with predefined index.

Andreas Dilger added a comment - 09/Apr/24 6:11 AM

I filed LU-17716 which could address this issue. I propose "tunefs.lustre --replace" to allow updating the OST config/label so that it does not think it needs to register with the MGS again.

Andreas Dilger added a comment - 09/Apr/24 6:11 AM I filed LU-17716 which could address this issue. I propose " tunefs.lustre --replace " to allow updating the OST config/label so that it does not think it needs to register with the MGS again.

Alexey Lyashkov added a comment - 23/Sep/13 9:36 AM

One more problem in that area - we have a single last_rcvd per disk storage, so MGS and MDT need to share a single last_rcvd file or we need to change name for such file to ability to run several recovery services on single disk partition.

Alexey Lyashkov added a comment - 23/Sep/13 9:36 AM One more problem in that area - we have a single last_rcvd per disk storage, so MGS and MDT need to share a single last_rcvd file or we need to change name for such file to ability to run several recovery services on single disk partition.

Peter Jones added a comment - 17/Sep/13 10:20 PM

ok. Thanks Alexey!

Peter Jones added a comment - 17/Sep/13 10:20 PM ok. Thanks Alexey!

Alexey Lyashkov added a comment - 17/Sep/13 4:57 PM

Currently not, I have more interested to fix LNet issues now and workaround exist. But i will happy to discus about generic recovery for an MGS as it's long aged task.

Alexey Lyashkov added a comment - 17/Sep/13 4:57 PM Currently not, I have more interested to fix LNet issues now and workaround exist. But i will happy to discus about generic recovery for an MGS as it's long aged task.

Andreas Dilger added a comment - 17/Jun/13 3:51 PM

So, to clarify, I'm not against fixing the MGS/MGC recovery code, but this needs to be done carefully to avoid the problem of complex recovery on the MGS. It should only be done for OSTs connecting during initial registration.

Shadow, are you planning on working on this problem?

Andreas Dilger added a comment - 17/Jun/13 3:51 PM So, to clarify, I'm not against fixing the MGS/MGC recovery code, but this needs to be done carefully to avoid the problem of complex recovery on the MGS. It should only be done for OSTs connecting during initial registration. Shadow, are you planning on working on this problem?

Andreas Dilger added a comment - 13/Jun/13 11:39 AM

There is no longer dynamic index mapping in Lustre 2.4+ because this was never used by real systems where the administrator wants to know which OST index is on a particular node.

I agree with Alex that last_rcvd would handle the reply reconstruction, but it also means the MGS would need to "recover" if all clients are in the last_rcvd file (which would be bad). The clients should NOT be added to the last_rcvd, only the new servers. Ideally, there should be some way for the OST to remove itself from the last_rcvd file after it has registered and gotten a reply?

One danger (and the reason this is an error in the first place) is that you don't want multiple OSTs accidentally trying to claim that they are the same OST index. That would cause serious filesystem corruption. That means there should be some way to determine the OST is the right one (e.g. RPC XID) before sending the reconstructed reply. This wouldn't work if both the MGS and OSS rebooted, but is probably better than today.

I was also wondering if there should also be some way to add a newly-formatted OST to replace an old OST with a "--replace" option to mkfs.lustre, which removes the "LDD_F_VIRGIN" flag and allows it to take over the old OST slot. This isn't directly related, but a similar problem. That would mean at least the administrator has to understand what is happening before adding the OST in an existing index.

Andreas Dilger added a comment - 13/Jun/13 11:39 AM There is no longer dynamic index mapping in Lustre 2.4+ because this was never used by real systems where the administrator wants to know which OST index is on a particular node. I agree with Alex that last_rcvd would handle the reply reconstruction, but it also means the MGS would need to "recover" if all clients are in the last_rcvd file (which would be bad). The clients should NOT be added to the last_rcvd, only the new servers. Ideally, there should be some way for the OST to remove itself from the last_rcvd file after it has registered and gotten a reply? One danger (and the reason this is an error in the first place) is that you don't want multiple OSTs accidentally trying to claim that they are the same OST index. That would cause serious filesystem corruption. That means there should be some way to determine the OST is the right one (e.g. RPC XID) before sending the reconstructed reply. This wouldn't work if both the MGS and OSS rebooted, but is probably better than today. I was also wondering if there should also be some way to add a newly-formatted OST to replace an old OST with a "--replace" option to mkfs.lustre, which removes the "LDD_F_VIRGIN" flag and allows it to take over the old OST slot. This isn't directly related, but a similar problem. That would mean at least the administrator has to understand what is happening before adding the OST in an existing index.

Alexey Lyashkov added a comment - 12/Jun/13 1:18 PM

Looks we need implement a long aged task about simplify recovery for an MGS.
add_target will send a transaction number to the target and that request will be resend to the MGS in case MGS crash. if request resend but it's committed on MGS - reply is reconstructed from an on disk state. But it's may be too hard with dynamic index allocation from same NID as MGS don't know about mapping and it's produce bitmap leak (but looks that bug exist today also).

Alexey Lyashkov added a comment - 12/Jun/13 1:18 PM Looks we need implement a long aged task about simplify recovery for an MGS. add_target will send a transaction number to the target and that request will be resend to the MGS in case MGS crash. if request resend but it's committed on MGS - reply is reconstructed from an on disk state. But it's may be too hard with dynamic index allocation from same NID as MGS don't know about mapping and it's produce bitmap leak (but looks that bug exist today also).

Alex Zhuravlev added a comment - 12/Jun/13 1:02 PM

I think MGS could be able to reconstruct reply using last_rcvd, but that doesn't cover the case when MGS crashes and the reply was lost. I guess in the latter case the only way is to remove new profile with writeconf.

Alex Zhuravlev added a comment - 12/Jun/13 1:02 PM I think MGS could be able to reconstruct reply using last_rcvd, but that doesn't cover the case when MGS crashes and the reply was lost. I guess in the latter case the only way is to remove new profile with writeconf.

Alexey Lyashkov added a comment - 12/Jun/13 9:58 AM

Xyratex-bug MRP-1111

Alexey Lyashkov added a comment - 12/Jun/13 9:58 AM Xyratex-bug MRP-1111

People

Assignee:: WC Triage

Reporter:: Alexey Lyashkov

Votes:: 1 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 12/Jun/13 9:55 AM

Updated:: 25/Mar/25 12:27 AM