[LU-13303] (import.c:361:ptlrpc_invalidate_import()) nbp1-OST0016_UUID: rc = -110 waiting for callback (1 != 0) Created: 27/Feb/20 Updated: 06/Jan/21 Resolved: 06/Jan/21 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.3 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Mahmoud Hanafi | Assignee: | Oleg Drokin |
| Resolution: | Incomplete | Votes: | 0 |
| Labels: | None | ||
| Environment: |
client=2.12.3 |
||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
During umount client gets stuck invalidating import
attaching full debug from client side. No errors reported on server side. 00000100:00080000:0.0:1582834254.602927:0:43821:0:(pinger.c:412:ptlrpc_pinger_del_import()) removing pingable import 69fa9d68-b663-7a88-9df3-0f9930f72f9d->nbp1-OST0016_UUID 00000100:00080000:0.0:1582834254.603390:0:43821:0:(import.c:157:ptlrpc_deactivate_import_nolock()) setting import nbp1-OST0016_UUID INVALID 00000100:00020000:0.0:1582834356.986737:0:43821:0:(import.c:361:ptlrpc_invalidate_import()) nbp1-OST0016_UUID: rc = -110 waiting for callback (1 != 0) 00000100:00020000:0.0:1582834356.999173:0:43821:0:(import.c:387:ptlrpc_invalidate_import()) @@@ still on sending list req@ffff8ea5e5192940 x1657225699614176/t0(0) o8->nbp1-OST0016-osc-ffff8eb45292f000@10.151.26.119@o2ib:28/4 lens 520/544 e 0 to 1 dl 1582832131 ref 2 fl UnregRPC:EXN/0/ffffffff rc -5/-1 00000100:00020000:0.0:1582834357.024877:0:43821:0:(import.c:401:ptlrpc_invalidate_import()) nbp1-OST0016_UUID: Unregistering RPCs found (1). Network is sluggish? Waiting them to error out. 00000100:00020000:0.0:1582834459.418739:0:43821:0:(import.c:361:ptlrpc_invalidate_import()) nbp1-OST0016_UUID: rc = -110 waiting for callback (1 != 0) 00000100:00020000:0.0:1582834459.431183:0:43821:0:(import.c:387:ptlrpc_invalidate_import()) @@@ still on sending list req@ffff8ea5e5192940 x1657225699614176/t0(0) o8->nbp1-OST0016-osc-ffff8eb45292 |
| Comments |
| Comment by Oleg Drokin [ 28/Feb/20 ] |
|
This is not so much Lustre problem as a network/network driver problem of sorts. Lustre asks the network to deregister a buffer in memory so it no longer could accept the data and the network/driver takes its time to do that and eventually Lustre timeouts. Our options at that time are: keep waiting or leak the buffer and exit. We probably can add a module parameter to do the leaking, but I feel like it's overkill and you are probably better off just adding this line "Unregistering RPCs found (\d+). Network is sluggish? Waiting them to error out" in your health scripts and if triggered just kill the node? |
| Comment by Mahmoud Hanafi [ 28/Feb/20 ] |
|
"network/driver" is that lnet/ko2ib? It keeps trying forever and never cleans up. I would think it should just give up at some point.
|
| Comment by Mahmoud Hanafi [ 06/Jan/21 ] |
|
this can be closed |