[LU-1809] Clients unable to mount (-108) Created: 31/Aug/12  Updated: 19/Oct/12  Resolved: 19/Oct/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 1.8.8
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Kit Westneat (Inactive) Assignee: Keith Mannthey (Inactive)
Resolution: Duplicate Votes: 0
Labels: None

Attachments: Text File mds-2-1.log    
Severity: 3
Rank (Obsolete): 6342

 Description   

NOAA hit a problem that looks a lot like LU-441. The clients were unable to mount the filesystem for a while after rebooting.

Here's the client's syslog:
Aug 30 15:00:09 s1 kernel: Lustre:
2524:0:(client.c:1487:ptlrpc_expire_one_request()) @@@ Request
x1410486709891636 sent from MGC10.179.16.120@o2ib to NID
10.179.16.121@o2ib 0s ago has failed due to network error (35s prior to
deadline).
Aug 30 15:00:09 s1 kernel: req@ffff8805fc06e400 x1410486709891636/t0
o250->MGS@MGC10.179.16.120@o2ib_1:26/25 lens 368/584 e 0 to 1 dl
1346338844 ref 1 fl Rpc:N/0/0 rc 0/0
Aug 30 15:00:09 s1 kernel: LustreError:
112398:0:(client.c:858:ptlrpc_import_delay_req()) @@@ IMP_INVALID
req@ffff88041c05e800 x1410486709891637/t0
o501->MGS@MGC10.179.16.120@o2ib_1:26/25 lens 264/432 e 0 to 1 dl 0 ref 1
fl Rpc:/0/0 rc 0/0
Aug 30 15:00:09 s1 kernel: LustreError: 15c-8: MGC10.179.16.120@o2ib:
The configuration from log 'lfs2-client' failed (-108). This may be the
result of communication errors between this node and the MGS, a bad
configuration, or other errors. See the syslog for more information.
Aug 30 15:00:09 s1 kernel: LustreError:
112398:0:(llite_lib.c:1095:ll_fill_super()) Unable to process log: -108
Aug 30 15:00:09 s1 kernel: Lustre: client lfs2-client(ffff88041c2ea400)
umount complete
Aug 30 15:00:09 s1 kernel: LustreError:
112398:0:(obd_mount.c:2065:lustre_fill_super()) Unable to mount (-108)

MDS logs to come.



 Comments   
Comment by Isaac Huang (Inactive) [ 31/Aug/12 ]

Likely this is a dup of LU-630. The message below said an outgoing message sent 0 second ago failed with error:
2524:0:(client.c:1487:ptlrpc_expire_one_request()) @@@ Request x1410486709891636 sent from MGC10.179.16.120@o2ib to NID 10.179.16.121@o2ib 0s ago has failed due to network error (35s prior to deadline).

This looked like a local error, i.e. the message did not go out on wire. Please:
1. On all clients and servers: options ko2iblnd peer_timeout=0
2. On some clients where mount failed: echo +neterror > /proc/sys/lnet/printk
This must be done after each client reboot.

If this is a dup of LU-630, then step 1 should fix it. If it still persists, step 2 would allow more debug data to go to syslog; /proc/sys/lnet/peers would also provide useful data in this case.

Comment by Kit Westneat (Inactive) [ 05/Sep/12 ]

It looks like the patch is fairly simple, can we get it landed on b1_8?

In the meantime I will communicate the workaround to the customer. I think it is pretty rare though.

Thanks,
Kit

Comment by Kit Westneat (Inactive) [ 06/Sep/12 ]

Hi Isaac,

What are the implications of peer_timeout=0? That is to say, what exactly does it do?

Also, does it have to be on all the servers and clients? or can it be just the servers or just the clients?

Thanks,
Kit

Comment by Isaac Huang (Inactive) [ 06/Sep/12 ]

peer_timeout=0 disables a feature that should only be turned on for routers - it was a bug to be able to enable it anywhere but the routers. In other words, peer_timeout=0 fixes it without any code changes.

The feature does not work on clients and servers and will cause messages to be dropped, so "peer_timeout=0" must be set on all clients and servers.

Comment by Kit Westneat (Inactive) [ 06/Sep/12 ]

Is it ok to do "peer_timeout=0" on the clients before the servers? Or does it need to be set at the same time everywhere?

Comment by Isaac Huang (Inactive) [ 06/Sep/12 ]

There's no requirement on order. You can do it in any order that's most convenient.

Comment by Kit Westneat (Inactive) [ 25/Sep/12 ]

could we get this landed to b1_8? It appears to have fixed the issue.

Comment by Isaac Huang (Inactive) [ 17/Oct/12 ]

I likely missed some notifications when JIRA was upgraded a while back. I agree that it's a simple patch that fixes a class of problems hard to diagnose when they manifest themselves at upper layers. I'd defer to Peter whether to land it to b1_8.

Comment by Peter Jones [ 17/Oct/12 ]

Thanks Isaac. Keith can you please backport the patch from master to b1_8?

Comment by Isaac Huang (Inactive) [ 17/Oct/12 ]

Quite likely the patch would apply to b1_8 without any changes, just ignore white space changes with patch --ignore-whitespace.

Comment by Keith Mannthey (Inactive) [ 17/Oct/12 ]

I was able to cherry-pick the patch from LU-630 into b1_8.

http://review.whamcloud.com/4287 is the b1_8 patch.

Comment by Peter Jones [ 19/Oct/12 ]

duplicate of LU-630

Generated at Sat Feb 10 01:19:52 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.