[LU-2977] network connection rejected due to consumer defined fatal error Created: 17/Mar/13  Updated: 12/Jan/19  Resolved: 12/Jan/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.3
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Shuichi Ihara (Inactive) Assignee: Liang Zhen (Inactive)
Resolution: Not a Bug Votes: 0
Labels: None
Environment:

CentOS6.3 Lustre-2.1.3


Attachments: Text File lustre-errors.txt    
Severity: 3
Rank (Obsolete): 7259

 Description   

When client mounts the Lustre, we saw following error. What does "consumer defined fatal error" mean and why is this connection rejected?

Mar 17 08:42:52 r3169 kernel: Lustre: Lustre: Build Version: RC2--PRISTINE-2.6.32-279.19.1.el6.x86_64
Mar 17 08:42:52 r3169 kernel: Lustre: Added LNI 10.9.55.1@o2ib3 [8/64/0/180]
Mar 17 08:42:53 r3169 kernel: Lustre: Lustre OSC module (ffffffffa0e9b880).
Mar 17 08:42:53 r3169 kernel: Lustre: Lustre LOV module (ffffffffa0f2dce0).
Mar 17 08:42:53 r3169 kernel: Lustre: Lustre client module (ffffffffa1019020).
Mar 17 08:42:53 r3169 kernel: Lustre: MGC10.9.103.1@o2ib3: Reactivating import
Mar 17 08:42:53 r3169 kernel: LustreError: 929:0:(o2iblnd_cb.c:2569:kiblnd_rejected()) 10.9.102.38@o2ib3 rejected: consumer defined fatal error
Mar 17 08:42:53 r3169 kernel: Lustre: 3305:0:(client.c:1817:ptlrpc_expire_one_request()) @@@ Request  sent has failed due to network error: [sent 1363470173/real 1363470173]  req@ffff88062749b400 x1429702099075137/t0(0) o8->images-OST002e-osc-ffff880865647000@10.9.102.38@o2ib3:28/4 lens 368/512 e 0 to 1 dl 1363470178 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Mar 17 08:43:43 r3169 kernel: LustreError: 11-0: an error occurred while communicating with 10.9.102.37@o2ib3. The ost_connect operation failed with -19
Mar 17 08:44:08 r3169 kernel: LustreError: 927:0:(o2iblnd_cb.c:2569:kiblnd_rejected()) 10.9.102.38@o2ib3 rejected: consumer defined fatal error
Mar 17 08:44:08 r3169 kernel: Lustre: 3305:0:(client.c:1817:ptlrpc_expire_one_request()) @@@ Request  sent has failed due to network error: [sent 1363470248/real 1363470248]  req@ffff881064484800 x1429702099075267/t0(0) o8->images-OST002e-osc-ffff880865647000@10.9.102.38@o2ib3:28/4 lens 368/512 e 0 to 1 dl 1363470259 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Mar 17 08:44:33 r3169 kernel: LustreError: 11-0: an error occurred while communicating with 10.9.102.37@o2ib3. The ost_connect operation failed with -19
Mar 17 08:44:58 r3169 kernel: LustreError: 935:0:(o2iblnd_cb.c:2569:kiblnd_rejected()) 10.9.102.38@o2ib3 rejected: consumer defined fatal error
Mar 17 08:44:58 r3169 kernel: Lustre: 3305:0:(client.c:1817:ptlrpc_expire_one_request()) @@@ Request  sent has failed due to network error: [sent 1363470298/real 1363470298]  req@ffff88100a3e6800 x1429702099078280/t0(0) o8->images-OST002e-osc-ffff880865647000@10.9.102.38@o2ib3:28/4 lens 368/512 e 0 to 1 dl 1363470314 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Mar 17 08:45:23 r3169 kernel: LustreError: 11-0: an error occurred while communicating with 10.9.102.37@o2ib3. The ost_connect operation failed with -19
Mar 17 08:45:48 r3169 kernel: LustreError: 935:0:(o2iblnd_cb.c:2569:kiblnd_rejected()) 10.9.102.38@o2ib3 rejected: consumer defined fatal error
Mar 17 08:45:48 r3169 kernel: Lustre: 3305:0:(client.c:1817:ptlrpc_expire_one_request()) @@@ Request  sent has failed due to network error: [sent 1363470348/real 1363470348]  req@ffff88100a019000 x1429702099078408/t0(0) o8->images-OST002e-osc-ffff880865647000@10.9.102.38@o2ib3:28/4 lens 368/512 e 0 to 1 dl 1363470369 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Mar 17 08:46:13 r3169 kernel: LustreError: 11-0: an error occurred while communicating with 10.9.102.37@o2ib3. The ost_connect operation failed with -19
Mar 17 08:46:38 r3169 kernel: LustreError: 935:0:(o2iblnd_cb.c:2569:kiblnd_rejected()) 10.9.102.38@o2ib3 rejected: consumer defined fatal error


 Comments   
Comment by Peter Jones [ 17/Mar/13 ]

Liang

Could you please advise on this one?

Thanks

Peter

Comment by Isaac Huang (Inactive) [ 20/Mar/13 ]

That's a weird error - client didn't seem to recognize the magic number in reject messages. Was there any error showing up on 10.9.102.38@o2ib3?

Comment by Cory Spitz [ 07/Jun/17 ]

From lustre-discuss@lists.lustre.org 4/25/2017:

Regarding:
> LNetError: 2878:0:(o2iblnd_cb.c:2587:kiblnd_rejected()) 172.23.55.211@o2ib rejected: consumer defined fatal error

Andreas Dilger noted:

This means that the LND didn't connect at startup time, but I don't know what the cause is.
The error that generates this message is IB_CM_REJ_CONSUMER_DEFINED, but I don't know enough about IB to tell you what that means.

Doug Oucharek responded:

That specific message happens when the “magic” u32 field at the start of a message does not match what we are expecting. We do check if the message was transmitted as a different endian from us so when you see this error, we assume that message has been corrupted or the sender is using an invalid magic value. I don’t believe this value has changed in the history of the LND so this is more likely corruption of some sort.

Comment by Cory Spitz [ 07/Jun/17 ]

FYI: About the lustre-discuss conversation – it was determined to be a failing IB subnet manager. While sminfo reported good health, a more formal check of the manager proved that it was faulty.

Generated at Sat Feb 10 01:29:54 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.