[LU-9828] LBUG ASSERTION( desc->bd_nob_transferred == 0 ) failed: Created: 04/Aug/17  Updated: 09/Mar/18  Resolved: 28/Aug/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.10.1, Lustre 2.11.0

Type: Bug Priority: Major
Reporter: Minh Diep Assignee: Amir Shehata (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-10799 ASSERTION(desc->bd_nob_transferred == 0) Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

One of clients crashed due to the following LBUG.

LustreError: 11818:0:(events.c:201:client_bulk_callback()) event type 2, status -103, desc ffff880827971600
LustreError: 11840:0:(niobuf.c:329:ptlrpc_register_bulk()) ASSERTION( desc->bd_nob_transferred == 0 ) failed:
LustreError: 11818:0:(events.c:201:client_bulk_callback()) event type 2, status -103, desc ffff880d40623400
Lustre: yshare1-OST0023-osc-ffff882049a1c800: Connection to yshare1-OST0023 (at 172.28.8.204@o2ib1) was lost; in progress operations using this service will wait for recovery to complete
Lustre: Skipped 21 previous similar messages
LNet: 11818:0:(o2iblnd_cb.c:1364:kiblnd_reconnect_peer()) Abort reconnection of 172.28.8.204@o2ib1: connected
LNet: 11818:0:(o2iblnd_cb.c:1364:kiblnd_reconnect_peer()) Skipped 1 previous similar message
LustreError: 11840:0:(niobuf.c:329:ptlrpc_register_bulk()) LBUG
Pid: 11840, comm: ptlrpcd_01_01

Call Trace:
 [<ffffffffa0967895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
 [<ffffffffa0967e97>] lbug_with_loc+0x47/0xb0 [libcfs]
 [<ffffffffa0cae07c>] ptlrpc_register_bulk+0xfc/0x9c0 [ptlrpc]
 [<ffffffffa0985c74>] ? cfs_percpt_unlock+0x24/0xb0 [libcfs]
 [<ffffffffa0a1b7b4>] ? LNetMDUnlink+0xd4/0x160 [lnet]
 [<ffffffffa0cb5c64>] ? lustre_msg_get_flags+0x34/0xb0 [ptlrpc]
 [<ffffffffa0caf5af>] ptl_send_rpc+0x1af/0xea0 [ptlrpc]
 [<ffffffffa0ce6804>] ? sptlrpc_req_refresh_ctx+0x154/0x910 [ptlrpc]
 [<ffffffffa0ca90b2>] ptlrpc_check_set+0x1462/0x1bf0 [ptlrpc]
 [<ffffffffa0cd6d83>] ptlrpcd_check+0x3d3/0x610 [ptlrpc]
 [<ffffffffa0cd7232>] ptlrpcd+0x272/0x4f0 [ptlrpc]
 [<ffffffff8106c500>] ? default_wake_function+0x0/0x20
 [<ffffffffa0cd6fc0>] ? ptlrpcd+0x0/0x4f0 [ptlrpc]
 [<ffffffff810a640e>] kthread+0x9e/0xc0
 [<ffffffff8100c28a>] child_rip+0xa/0x20
 [<ffffffff810a6370>] ? kthread+0x0/0xc0
 [<ffffffff8100c280>] ? child_rip+0x0/0x20

 



 Comments   
Comment by Gerrit Updater [ 11/Aug/17 ]

Amir Shehata (amir.shehata@intel.com) uploaded a new patch: https://review.whamcloud.com/28491
Subject: LU-9828 ptlrpc: Do not assert when bd_nob_transferred != 0
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 88b65811680989d7dbc7620716ed1fa5d9388d28

Comment by Oleg Drokin [ 11/Aug/17 ]

I just hit this on my testbed as well

Comment by Gerrit Updater [ 28/Aug/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/28491/
Subject: LU-9828 ptlrpc: Do not assert when bd_nob_transferred != 0
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: e6490ea6cf0b793c0b47f17ac5a5fa3a2a136e0d

Comment by Peter Jones [ 28/Aug/17 ]

Landed for 2.11

Comment by Gerrit Updater [ 28/Aug/17 ]

Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/28759
Subject: LU-9828 ptlrpc: Do not assert when bd_nob_transferred != 0
Project: fs/lustre-release
Branch: b2_10
Current Patch Set: 1
Commit: 7289ae9b0767ac65323bad97471b15f735154024

Comment by Gerrit Updater [ 14/Sep/17 ]

John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/28759/
Subject: LU-9828 ptlrpc: Do not assert when bd_nob_transferred != 0
Project: fs/lustre-release
Branch: b2_10
Current Patch Set:
Commit: 39a275578e5d77d14f5b50b3c2a3fc924081e03c

Comment by Andriy Skulysh [ 05/Dec/17 ]

The assertion failure can happen only during resend vs reply race. It is better to skip reply and restore the assertion. I'll commit the patch.

Comment by Gerrit Updater [ 05/Dec/17 ]

Andriy Skulysh (c17819@cray.com) uploaded a new patch: https://review.whamcloud.com/30368
Subject: LU-9828 ptlrpc: ASSERTION(desc->bd_nob_transferred == 0)
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: d4d5658f3cdad115c06a998e7fa91a2bd89e33dd

Comment by Cory Spitz [ 09/Mar/18 ]

This issue is marked RESOLVED, yet https://review.whamcloud.com/#/c/30368 is still linked to here. Should we get a new ticket or should this issue be reopened?

Comment by Peter Jones [ 09/Mar/18 ]

Cory

A new ticket linked to this one please. It causes no end of confusion when patches are tagged onto long-closed tickets.

Peter

Comment by Andriy Skulysh [ 09/Mar/18 ]

Opened LU-10799

Comment by Peter Jones [ 09/Mar/18 ]

Thanks askulysh. For future reference we can just update the commit message without losing positive testing and reviews so making these corrections does not require abandoning patches.

Generated at Sat Feb 10 02:29:38 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.