[LU-9861] Client not reconnecting to OST Created: 10/Aug/17  Updated: 18/Sep/17  Resolved: 15/Aug/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Ned Bass Assignee: Jinshan Xiong (Inactive)
Resolution: Fixed Votes: 0
Labels: llnl
Environment:

lustre-2.8.0_9.chaos


Issue Links:
Duplicate
duplicates LU-7434 lost bulk leads to a hang Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

An OSS evicted a client on Aug 1 during a planned network outage.

 

[Tue Aug 1 16:43:54 2017] Lustre: lsh-OST0005: haven't heard from client d40a30fc-ef66-94ff-e318-77d2c23e45f8 (at 192.168.137.212@o2ib27) in 227 seconds. I think it's dead, and I am evicting it. exp ffff881d1aaa8c00, cur 1501631035 expire 1501630885 last 1501630808

Two days later the client had still not reconnected, although both sides could lctl ping eachother. The client logged this on the console.

[Thu Aug  3 12:03:52 2017] LustreError: 11704:0:(import.c:336:ptlrpc_invalidate_import()) lsh-OST0005_UUID: rc = -110 waiting for callback (1 != 0)
[Thu Aug  3 12:03:52 2017] LustreError: 11704:0:(import.c:336:ptlrpc_invalidate_import()) Skipped 5 previous similar messages
[Thu Aug  3 12:03:52 2017] LustreError: 11704:0:(import.c:362:ptlrpc_invalidate_import()) @@@ still on sending list  req@ffff880168fc3800 x1574573141267560/t0(0) o3->lsh-OST0005-osc-ffff88203c63f800@172.19.3.22@o2ib600:6/4 lens 488/432 e 0 to 0 dl 1501630582 ref 2 fl Unregistering:ES/0/ffffffff rc -5/-1
[Thu Aug  3 12:03:52 2017] LustreError: 11704:0:(import.c:362:ptlrpc_invalidate_import()) Skipped 5 previous similar messages
[Thu Aug  3 12:03:52 2017] LustreError: 11704:0:(import.c:378:ptlrpc_invalidate_import()) lsh-OST0005_UUID: RPCs in "Unregistering" phase found (1). Network is sluggish? Waiting them to error out.
[Thu Aug  3 12:03:52 2017] LustreError: 11704:0:(import.c:378:ptlrpc_invalidate_import()) Skipped 5 previous similar messages

This seems quite similar to LU-8511, which was closed as a duplicate of LU-7434. That issue had two associated patches, but only https://review.whamcloud.com/#/c/18934/ was landed to 2.8 FE, whereas https://review.whamcloud.com/#/c/19953/ was not.



 Comments   
Comment by Peter Jones [ 11/Aug/17 ]

Jinshan

In your opinion could the described behaviour be due to this patch missing from 2.8 FE - https://git.hpdd.intel.com/?p=fs/lustre-release.git;a=commit;h=ac5044566b97c7f6881bed817c2ed9752a0c6d63. If not, what is your alternative theory?

Peter

Comment by Jinshan Xiong (Inactive) [ 14/Aug/17 ]

Yes, I agree it looks like the symptom of LU-7434 and patch 19953 should be able to fix the problem.

Comment by Peter Jones [ 15/Aug/17 ]

The mentioned fix has been ported, reviewed and landed to the 2.8 FE branch so closing this ticket for now. We can reopen if this same issue is hit again with a release including this change.

Generated at Sat Feb 10 02:29:58 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.