[LU-13303] (import.c:361:ptlrpc_invalidate_import()) nbp1-OST0016_UUID: rc = -110 waiting for callback (1 != 0) Created: 27/Feb/20  Updated: 06/Jan/21  Resolved: 06/Jan/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.3
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Mahmoud Hanafi Assignee: Oleg Drokin
Resolution: Incomplete Votes: 0
Labels: None
Environment:

client=2.12.3
server=2.12.3


Attachments: File ptlrpc_invalidate_import.out.gz    
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

During umount client gets stuck invalidating import

 

attaching full debug from client side. No errors reported on server side.

 00000100:00080000:0.0:1582834254.602927:0:43821:0:(pinger.c:412:ptlrpc_pinger_del_import()) removing pingable import 69fa9d68-b663-7a88-9df3-0f9930f72f9d->nbp1-OST0016_UUID
00000100:00080000:0.0:1582834254.603390:0:43821:0:(import.c:157:ptlrpc_deactivate_import_nolock()) setting import nbp1-OST0016_UUID INVALID
00000100:00020000:0.0:1582834356.986737:0:43821:0:(import.c:361:ptlrpc_invalidate_import()) nbp1-OST0016_UUID: rc = -110 waiting for callback (1 != 0)
00000100:00020000:0.0:1582834356.999173:0:43821:0:(import.c:387:ptlrpc_invalidate_import()) @@@ still on sending list  req@ffff8ea5e5192940 x1657225699614176/t0(0) o8->nbp1-OST0016-osc-ffff8eb45292f000@10.151.26.119@o2ib:28/4 lens 520/544 e 0 to 1 dl 1582832131 ref 2 fl UnregRPC:EXN/0/ffffffff rc -5/-1
00000100:00020000:0.0:1582834357.024877:0:43821:0:(import.c:401:ptlrpc_invalidate_import()) nbp1-OST0016_UUID: Unregistering RPCs found (1). Network is sluggish? Waiting them to error out.
00000100:00020000:0.0:1582834459.418739:0:43821:0:(import.c:361:ptlrpc_invalidate_import()) nbp1-OST0016_UUID: rc = -110 waiting for callback (1 != 0)
00000100:00020000:0.0:1582834459.431183:0:43821:0:(import.c:387:ptlrpc_invalidate_import()) @@@ still on sending list  req@ffff8ea5e5192940 x1657225699614176/t0(0) o8->nbp1-OST0016-osc-ffff8eb45292


 Comments   
Comment by Oleg Drokin [ 28/Feb/20 ]

This is not so much Lustre problem as a network/network driver problem of sorts. Lustre asks the network to deregister a buffer in memory so it no longer could accept the data and the network/driver takes its time to do that and eventually Lustre timeouts.

Our options at that time are: keep waiting or leak the buffer and exit. We probably can add a module parameter to do the leaking, but I feel like it's overkill and you are probably better off just adding this line "Unregistering RPCs found (\d+). Network is sluggish? Waiting them to error out" in your health scripts and if triggered just kill the node?

Comment by Mahmoud Hanafi [ 28/Feb/20 ]

"network/driver" is that lnet/ko2ib? It keeps trying forever and never cleans up. I would think it should just give up at some point.

 

Comment by Mahmoud Hanafi [ 06/Jan/21 ]

this can be closed

Generated at Sat Feb 10 03:00:07 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.