[LU-13763] ptlrpc_invalidate_import()) lsrza-OST0000_UUID: Unregistering RPCs found (0). Network is sluggish? Waiting them to error out. Created: 08/Jul/20 Updated: 26/Nov/20 Resolved: 12/Sep/20 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.4 |
| Fix Version/s: | Lustre 2.14.0, Lustre 2.12.6 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Olaf Faaland | Assignee: | Mikhail Pershin |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | llnl | ||
| Environment: |
TOSS 3.6-3 / RH78 |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
Console log messages following this pattern, repeatedly, for several days: LustreError: 67801:0:(import.c:361:ptlrpc_invalidate_import()) lsrza-OST0000_UUID: rc = -110 waiting for callback (1 != 0) LustreError: 67801:0:(import.c:387:ptlrpc_invalidate_import()) @@@ still on sending list req@ffff8c3eb65f7500 x1669124751850560/t0(0) o4->lsrza-OST0000-osc-ffff8c44c608a000@172.21.3.5@o2ib700:6/4 lens 488/448 e 2 to 0 dl 1592847228 ref 1 fl Interpret:E/0/ffffffff rc -5/-1 LustreError: 67801:0:(import.c:401:ptlrpc_invalidate_import()) lsrza-OST0000_UUID: Unregistering RPCs found (0). Network is sluggish? Waiting them to error out. Note that the number in the parentheses is 0. This refers to imp->imp_unregistering, an atomic variable that looks like it is intended to track the number of RPC buffers we ware waiting for the underlying network to unregister so know that no data will be lost. But there is still one RPC on the sending list, so why is imp_unregistering 0? |
| Comments |
| Comment by Olaf Faaland [ 08/Jul/20 ] |
|
Similar to |
| Comment by Olaf Faaland [ 09/Jul/20 ] |
|
For my tracking, my internal issue is TOSS4833 |
| Comment by Peter Jones [ 09/Jul/20 ] |
|
Mike Could you please advise Thanks Peter |
| Comment by Olaf Faaland [ 14/Jul/20 ] |
|
The node where I saw this is still in this state. While looking into [root@rzgenie28:~]# lctl get_param -n osc.*OST0000*.cur_grant_bytes 18446744073707847680 |
| Comment by Olaf Faaland [ 14/Jul/20 ] |
|
I've attached console and debug logs. |
| Comment by Mikhail Pershin [ 14/Jul/20 ] |
|
Thanks, Olaf. I am checking logs right now. |
| Comment by Olaf Faaland [ 14/Jul/20 ] |
|
In osc_init_grant(), cl_avail_grant() will underflow if cli->cl_avail_grant = ocd->ocd_grant;
if (cli->cl_import->imp_state != LUSTRE_IMP_EVICTED) {
cli->cl_avail_grant -= cli->cl_reserved_grant;
if (OCD_HAS_FLAG(ocd, GRANT_PARAM))
cli->cl_avail_grant -= cli->cl_dirty_grant;
else
cli->cl_avail_grant -=
cli->cl_dirty_pages << PAGE_SHIFT;
}
I don't know if there's something to prevent that underflow elsewhere. |
| Comment by Gerrit Updater [ 15/Jul/20 ] |
|
Mike Pershin (mpershin@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/39380 |
| Comment by Mikhail Pershin [ 15/Jul/20 ] |
|
Olaf, I've made patch to prevent that underflow. It is worth to do in any case because ocd_grant is received from server so it shouldn't be trusted blindly to be always greater than local consumed grants. E.g. in conjunction with |
| Comment by Olaf Faaland [ 15/Jul/20 ] |
|
Mikhail, I have the node in this state drained. Should I keep it that way in case you want me to gather information from it, or should I go ahead and crash it and put it back into service? I won't be able to send you the crash dump, but I could extract information from the dump for you, potentially (although my crash skills are not great). Thanks |
| Comment by Olaf Faaland [ 20/Jul/20 ] |
|
Hi Mikhail, |
| Comment by Mikhail Pershin [ 20/Jul/20 ] |
|
Olaf, I have no good idea about what to get from that node right now, so you can bounce it. I don't see how grant underflow could occur due to this particular situation - for that server should consider that client has less grants than it really has already. I'd say that such situation could occur due to DIO grants problem and then cause grants underflow later. Maybe there are other scenarios exist. As for imp_unregistering 0 I think it is OK if request is in sending list - it is waiting for reply still, so is not unregistered yet and imp counter is still 0. |
| Comment by Olaf Faaland [ 26/Aug/20 ] |
|
I rebased the patch as it was too old to be retested. |
| Comment by Olaf Faaland [ 01/Sep/20 ] |
|
The patch passes testing now. I posted a comment listing all the failures. It never failed a test twice, every test that failed passed when that test group was re-tested, and the failures didn't look to me like grant accounting problems. |
| Comment by Gerrit Updater [ 04/Sep/20 ] |
|
Mike Pershin (mpershin@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/39827 |
| Comment by Gerrit Updater [ 12/Sep/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39827/ |
| Comment by Peter Jones [ 12/Sep/20 ] |
|
Landed for 2.14 |
| Comment by Gerrit Updater [ 15/Sep/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39380/ |