[LU-7434] lost bulk leads to a hang Created: 17/Nov/15  Updated: 04/Mar/20  Resolved: 17/Jun/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.9.0

Type: Bug Priority: Minor
Reporter: Vitaly Fertman Assignee: Jinshan Xiong (Inactive)
Resolution: Fixed Votes: 0
Labels: llnlfixready, patch

Issue Links:
Duplicate
is duplicated by LU-9861 Client not reconnecting to OST Resolved
Related
is related to LU-2522 conf-sanity test_23b timed out during... Resolved
is related to LU-8062 recovery-small test_115b: @@@@@@ FAI... Resolved
is related to LU-8511 mdc stuck in EVICTED state Resolved
is related to LU-8067 sanityn test_31a: read dd: reading `/... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

The reverse order of request_out_callback() and reply_in_callback() puts the RPC into UNREGISTERING state, which is waiting for RPC & bulk md unlink, whereas only RPC md unlink has been called so far. If bulk is lost, even expired_set does not check for UNREGISTERING state.

The same for write if server returns an error.



 Comments   
Comment by Gerrit Updater [ 17/Nov/15 ]

Vitaly Fertman (vitaly.fertman@seagate.com) uploaded a new patch: http://review.whamcloud.com/17221
Subject: LU-7434 ptlrpc: lost bulk leads to a hang
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 4bba2cb84d3d2f87775c2e91e6c434dec4ed7c4c

Comment by Gerrit Updater [ 15/Mar/16 ]

Vitaly Fertman (vitaly.fertman@seagate.com) uploaded a new patch: http://review.whamcloud.com/18934
Subject: LU-7434 ptlrpc: Early Reply vs Reply MDunlink
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 01342f96b28f662531643818406ccaac7af45d7d

Comment by Gerrit Updater [ 21/Apr/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17221/
Subject: LU-7434 ptlrpc: lost bulk leads to a hang
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 55f8520817a31dabf19fe0a8ac2492b85d039c38

Comment by Bob Glossman (Inactive) [ 22/Apr/16 ]

another on master:
https://testing.hpdd.intel.com/test_sets/092f085a-0825-11e6-9e5d-5254006e85c2

Comment by Bob Glossman (Inactive) [ 22/Apr/16 ]

another on master:
https://testing.hpdd.intel.com/test_sets/ed7bb092-08a9-11e6-9b34-5254006e85c2

this test fail included the fix from http://review.whamcloud.com/17221, so it's still happening

Comment by Bob Glossman (Inactive) [ 23/Apr/16 ]

another on master:
https://testing.hpdd.intel.com/test_sets/23681c36-090e-11e6-9e5d-5254006e85c2

Comment by Gerrit Updater [ 26/Apr/16 ]

Oleg Drokin (oleg.drokin@intel.com) uploaded a new patch: http://review.whamcloud.com/19778
Subject: Revert "LU-7434 ptlrpc: lost bulk leads to a hang"
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 0edd5b2a7f3e782c106f1b8f9f685966ee75fc01

Comment by Gerrit Updater [ 26/Apr/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/19778/
Subject: Revert "LU-7434 ptlrpc: lost bulk leads to a hang"
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 024312bd98cafc2e10ff851144c59ca45b2d2792

Comment by Andreas Dilger [ 26/Apr/16 ]

When resubmitting the http://review.whamcloud.com/17221 patch, please include an additional testing request in the patch commit message for the failing test to ensure that it is not still failing intermittently. The test failed about 1/4 recent test runs, so if the patch can pass 8 recovery-small test runs in a row it should be good:

Test-Parameters: testlist=recovery-small,recovery-small,recovery-small,recovery-small,recovery-small,recovery-small
Comment by Gerrit Updater [ 28/Apr/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/18934/
Subject: LU-7434 ptlrpc: Early Reply vs Reply MDunlink
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 14ba724a62b55b27f2a9c9611e15879207d78b43

Comment by Vitaly Fertman [ 03/May/16 ]

https://jira.hpdd.intel.com/browse/LU-8062?focusedCommentId=150173&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-150173

Comment by Gerrit Updater [ 03/May/16 ]

Chris Horn (hornc@cray.com) uploaded a new patch: http://review.whamcloud.com/19953
Subject: LU-7434 ptlrpc: lost bulk leads to a hang
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: a5a865961aae0c62bbf802de5303bb1e4a6b5957

Comment by Cory Spitz [ 13/Jun/16 ]

Can we land this in time for 2.9.0? Is there something more to do to pave the way?

Comment by Gerrit Updater [ 16/Jun/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/19953/
Subject: LU-7434 ptlrpc: lost bulk leads to a hang
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: ac5044566b97c7f6881bed817c2ed9752a0c6d63

Generated at Sat Feb 10 02:08:52 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.