[LU-13763] ptlrpc_invalidate_import()) lsrza-OST0000_UUID: Unregistering RPCs found (0). Network is sluggish? Waiting them to error out. Created: 08/Jul/20  Updated: 26/Nov/20  Resolved: 12/Sep/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.4
Fix Version/s: Lustre 2.14.0, Lustre 2.12.6

Type: Bug Priority: Minor
Reporter: Olaf Faaland Assignee: Mikhail Pershin
Resolution: Fixed Votes: 0
Labels: llnl
Environment:

TOSS 3.6-3 / RH78
in-kernel OFED
3.10.0-1127.8.2.1chaos.ch6.x86_64
lustre-2.12.4_6.chaos-1.ch6.x86_64


Attachments: File console.rzgenie28     File console.rzgenie28-20200619.gz     File console.rzgenie28-20200705.gz     File dk.rzgenie28.1594702860    
Issue Links:
Duplicate
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Console log messages following this pattern, repeatedly, for several days:

LustreError: 67801:0:(import.c:361:ptlrpc_invalidate_import()) lsrza-OST0000_UUID: rc = -110 waiting for callback (1 != 0)
LustreError: 67801:0:(import.c:387:ptlrpc_invalidate_import()) @@@ still on sending list  req@ffff8c3eb65f7500 x1669124751850560/t0(0) o4->lsrza-OST0000-osc-ffff8c44c608a000@172.21.3.5@o2ib700:6/4 lens 488/448 e 2 to 0 dl 1592847228 ref 1 fl Interpret:E/0/ffffffff rc -5/-1
LustreError: 67801:0:(import.c:401:ptlrpc_invalidate_import()) lsrza-OST0000_UUID: Unregistering RPCs found (0). Network is sluggish? Waiting them to error out.

Note that the number in the parentheses is 0. This refers to imp->imp_unregistering, an atomic variable that looks like it is intended to track the number of RPC buffers we ware waiting for the underlying network to unregister so know that no data will be lost. But there is still one RPC on the sending list, so why is imp_unregistering 0?



 Comments   
Comment by Olaf Faaland [ 08/Jul/20 ]

Similar to LU-13303 but in this case imp_unregistering is 0.

Comment by Olaf Faaland [ 09/Jul/20 ]

For my tracking, my internal issue is TOSS4833

Comment by Peter Jones [ 09/Jul/20 ]

Mike

Could you please advise

Thanks

Peter

Comment by Olaf Faaland [ 14/Jul/20 ]

The node where I saw this is still in this state.

While looking into LU-13766, I found that the node with this symptom also has a cur_grant_bytes that is weirdly large; I wonder if it underflowed. I don't know if this is related or not, but it seems an unlikely coincidence, so here it is.

[root@rzgenie28:~]# lctl get_param -n osc.*OST0000*.cur_grant_bytes
18446744073707847680
Comment by Olaf Faaland [ 14/Jul/20 ]

I've attached console and debug logs.

Comment by Mikhail Pershin [ 14/Jul/20 ]

Thanks, Olaf. I am checking logs right now.

Comment by Olaf Faaland [ 14/Jul/20 ]

In osc_init_grant(), cl_avail_grant() will underflow if
cl_reserved_grant + (cl_dirty_grant OR cl_dirty_pages<<PAGE_SHIFT) > ocd_grant

        cli->cl_avail_grant = ocd->ocd_grant;
        if (cli->cl_import->imp_state != LUSTRE_IMP_EVICTED) {
                cli->cl_avail_grant -= cli->cl_reserved_grant;
                if (OCD_HAS_FLAG(ocd, GRANT_PARAM))
                        cli->cl_avail_grant -= cli->cl_dirty_grant;
                else
                        cli->cl_avail_grant -=
                                        cli->cl_dirty_pages << PAGE_SHIFT;
        }

I don't know if there's something to prevent that underflow elsewhere.

Comment by Gerrit Updater [ 15/Jul/20 ]

Mike Pershin (mpershin@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/39380
Subject: LU-13763 osc: don't allow negative grants
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: d707c6ee3a926b06ebb0c2648cb5f1e5a1aaf2d7

Comment by Mikhail Pershin [ 15/Jul/20 ]

Olaf, I've made patch to prevent that underflow. It is worth to do in any case because ocd_grant is received from server so it shouldn't be trusted blindly to be always greater than local consumed grants.

E.g. in conjunction with LU-12687 that looks as real case.

Comment by Olaf Faaland [ 15/Jul/20 ]

Mikhail,

I have the node in this state drained. Should I keep it that way in case you want me to gather information from it, or should I go ahead and crash it and put it back into service? I won't be able to send you the crash dump, but I could extract information from the dump for you, potentially (although my crash skills are not great).

Thanks

Comment by Olaf Faaland [ 20/Jul/20 ]

Hi Mikhail,
Can I bounce that node?
Do you have any thoughts on whether the stuck import was really related to the grant underflow?
Do you have any idea why imp_unregistering was 0 when there was still one RPC on the sending list?
Thanks!

Comment by Mikhail Pershin [ 20/Jul/20 ]

Olaf, I have no good idea about what to get from that node right now, so you can bounce it. I don't see how grant underflow could occur due to this particular situation - for that server should consider that client has less grants than it really has already. I'd say that such situation could occur due to DIO grants problem and then cause grants underflow later. Maybe there are other scenarios exist. As for imp_unregistering 0 I think it is OK if request is in sending list - it is waiting for reply still, so is not unregistered yet and imp counter is still 0.

Comment by Olaf Faaland [ 26/Aug/20 ]

I rebased the patch as it was too old to be retested.

Comment by Olaf Faaland [ 01/Sep/20 ]

The patch passes testing now. I posted a comment listing all the failures. It never failed a test twice, every test that failed passed when that test group was re-tested, and the failures didn't look to me like grant accounting problems.

Comment by Gerrit Updater [ 04/Sep/20 ]

Mike Pershin (mpershin@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/39827
Subject: LU-13763 osc: don't allow negative grants
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: bd9497c1690b9c4d2358a287277d1d67864d6735

Comment by Gerrit Updater [ 12/Sep/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39827/
Subject: LU-13763 osc: don't allow negative grants
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: e05ccafd6ee214895d01efbb13a3757e3625a859

Comment by Peter Jones [ 12/Sep/20 ]

Landed for 2.14

Comment by Gerrit Updater [ 15/Sep/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39380/
Subject: LU-13763 osc: don't allow negative grants
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: f96aa90548f062e95d2ef4c9ea978ba0e08aae19

Generated at Sat Feb 10 03:04:00 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.