[LU-11647] niobuf.c:330:ptlrpc_register_bulk()) ASSERTION( desc->bd_md_count == 0 ) failed: Created: 09/Nov/18  Updated: 19/Mar/19  Resolved: 15/Feb/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.0, Lustre 2.10.5
Fix Version/s: Lustre 2.13.0, Lustre 2.10.7, Lustre 2.12.1

Type: Bug Priority: Major
Reporter: Mahmoud Hanafi Assignee: Hongchao Zhang
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Duplicate
duplicates LU-8573 IOR: niobuf.c:319:ptlrpc_register_bul... Resolved
is duplicated by LU-11692 lustre kernel panic - (niobuf.c:330:... Resolved
Related
Severity: 2
Rank (Obsolete): 9223372036854775807

 Description   

Getting clients crashing with:

Dup of LU-8573? if so will need a backport for 2.10.5.

 [1541688893.800175] Lustre: 84667:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1541688893/real 1541688893]  req@ffff88062aa
230c0 x1616517040059696/t0(0) o37->nbp8-MDT0000-mdc-ffff88080b0df800@10.151.27.60@o2ib:23/10 lens 568/440 e 0 to 1 dl 1541689235 ref 2 fl Rpc:ReX/0/ffffffff rc -11/-1
[1541689099.557409] LustreError: 3447:0:(niobuf.c:330:ptlrpc_register_bulk()) ASSERTION( desc->bd_md_count == 0 ) failed: 
[1541689099.569408] LustreError: 3447:0:(niobuf.c:330:ptlrpc_register_bulk()) LBUG
[1541689099.581408]  [<ffffffff8101bf34>] try_stack_unwind+0x194/0x1b0
[1541689099.589408]  [<ffffffff8101ad54>] dump_trace+0x64/0x3b0
[1541689099.593408]  [<ffffffff81027b02>] save_stack_trace_tsk+0x22/0x40
[1541689099.601407]  [<ffffffffa095c70d>] libcfs_call_trace+0x7d/0xa0 [libcfs]
[1541689099.609407]  [<ffffffffa095c7a5>] lbug_with_loc+0x45/0x90 [libcfs]
[1541689099.613407]  [<ffffffffa0ac84b9>] ptlrpc_register_bulk+0x7a9/0x970 [ptlrpc]
[1541689099.621407]  [<ffffffffa0ac8fe5>] ptl_send_rpc+0x225/0xdf0 [ptlrpc]
[1541689099.629406]  [<ffffffffa0ac328e>] ptlrpc_check_set.part.23+0x178e/0x1d60 [ptlrpc]
[1541689099.637406]  [<ffffffffa0ac38af>] ptlrpc_check_set+0x4f/0xd0 [ptlrpc]
[1541689099.645406]  [<ffffffffa0ac3b3a>] ptlrpc_set_wait+0x20a/0x890 [ptlrpc]
[1541689099.649406]  [<ffffffffa0ac423d>] ptlrpc_queue_wait+0x7d/0x220 [ptlrpc]
[1541689099.657406]  [<ffffffffa084fdcc>] mdc_getpage+0x1bc/0x620 [mdc]
[1541689099.665405]  [<ffffffffa085034b>] mdc_read_page_remote+0x11b/0x5e0 [mdc]
[1541689099.673405]  [<ffffffff811a482f>] do_read_cache_page+0xff/0x1b0
[1541689099.677405]  [<ffffffff811a48f9>] read_cache_page+0x19/0x20
[1541689099.685405]  [<ffffffffa084daba>] mdc_read_page+0x1aa/0x9e0 [mdc]
[1541689099.689404]  [<ffffffffa0a57e63>] lmv_read_page+0x1a3/0x510 [lmv]
[1541689099.697404]  [<ffffffffa0ccf35b>] ll_get_dir_page+0xbb/0x330 [lustre]
[1541689099.705404]  [<ffffffffa0ccf704>] ll_dir_read+0x94/0x2e0 [lustre]
[1541689099.709404]  [<ffffffffa0ccfa58>] ll_iterate+0x108/0x520 [lustre]
[1541689099.717404]  [<ffffffff812342b0>] iterate_dir+0xa0/0x120
[1541689099.721403]  [<ffffffff812346f3>] SyS_getdents+0x83/0xf0
[1541689099.729403]  [<ffffffff81651e43>] entry_SYSCALL_64_fastpath+0x1e/0xca
[1541689099.733403]  [<ffffffffffffffff>] 0xffffffffffffffff
[1541689099.741403] Kernel panic - not syncing: LBUG
[1541689099.745403] CPU: 1 PID: 3447 Comm: csh Tainted: G           OE   NX 4.4.143-94.47.1.20180815-nasa #1
[1541689099.753402] Hardware name: SGI.COM C1104-RP7/X9DRW-3LN4F+/X9DRW-3TF+, BIOS 3.00 09/12/2013
[1541689099.765402]  0000000000000000 ffff88036753f6d0 ffffffff8134907c ffffffffa0979e4b
[1541689099.773402]  ffff8810440f1008 ffff88036753f748 ffffffff811a111a ffffffff00000008
[1541689099.777402]  ffff88036753f758 ffff88036753f6f8 ffffffff810fcea5 0000000000000282
[1541689099.785401] Call Trace:
[1541689099.789401]  [<ffffffff8101bf34>] try_stack_unwind+0x194/0x1b0
[1541689099.797401]  [<ffffffff8101ad54>] dump_trace+0x64/0x3b0
[1541689099.801401]  [<ffffffff8101bf9d>] show_trace_log_lvl+0x4d/0x60
[1541689099.809401]  [<ffffffff8101b18a>] show_stack_log_lvl+0xea/0x170
[1541689099.813400]  [<ffffffff8101bff5>] show_stack+0x25/0x50
[1541689099.821400]  [<ffffffff8134907c>] dump_stack+0x63/0x87
[1541689099.825400]  [<ffffffff811a111a>] panic+0xd2/0x232
[1541689099.829400]  [<ffffffffa095c7ee>] lbug_with_loc+0x8e/0x90 [libcfs]
[1541689099.837400]  [<ffffffffa0ac84b9>] ptlrpc_register_bulk+0x7a9/0x970 [ptlrpc]
[1541689099.845399]  [<ffffffffa0ac8fe5>] ptl_send_rpc+0x225/0xdf0 [ptlrpc]
[1541689099.853399]  [<ffffffffa0ac328e>] ptlrpc_check_set.part.23+0x178e/0x1d60 [ptlrpc]
[1541689099.861399]  [<ffffffffa0ac38af>] ptlrpc_check_set+0x4f/0xd0 [ptlrpc]
[1541689099.865399]  [<ffffffffa0ac3b3a>] ptlrpc_set_wait+0x20a/0x890 [ptlrpc]
[1541689099.873398]  [<ffffffffa0ac423d>] ptlrpc_queue_wait+0x7d/0x220 [ptlrpc]
[1541689099.881398]  [<ffffffffa084fdcc>] mdc_getpage+0x1bc/0x620 [mdc]
[1541689099.885398]  [<ffffffffa085034b>] mdc_read_page_remote+0x11b/0x5e0 [mdc]
[1541689099.893398]  [<ffffffff811a482f>] do_read_cache_page+0xff/0x1b0
[1541689099.901398]  [<ffffffff811a48f9>] read_cache_page+0x19/0x20
[1541689099.905397]  [<ffffffffa084daba>] mdc_read_page+0x1aa/0x9e0 [mdc]
[1541689099.913397]  [<ffffffffa0a57e63>] lmv_read_page+0x1a3/0x510 [lmv]
[1541689099.917397]  [<ffffffffa0ccf35b>] ll_get_dir_page+0xbb/0x330 [lustre]
[1541689099.925397]  [<ffffffffa0ccf704>] ll_dir_read+0x94/0x2e0 [lustre]
[1541689099.933396]  [<ffffffffa0ccfa58>] ll_iterate+0x108/0x520 [lustre]
[1541689099.937396]  [<ffffffff812342b0>] iterate_dir+0xa0/0x120
[1541689099.945396]  [<ffffffff812346f3>] SyS_getdents+0x83/0xf0
[1541689099.949396]  [<ffffffff81651e43>] entry_SYSCALL_64_fastpath+0x1e/0xca


 Comments   
Comment by Peter Jones [ 09/Nov/18 ]

Hongchao

Does this seem related to LU-8573 to you?

Peter

Comment by Hongchao Zhang [ 09/Nov/18 ]

Yes, It should be a duplicate of LU-8573

Comment by Mahmoud Hanafi [ 09/Nov/18 ]

Can we get a 2.10.5 back port please.

Comment by Peter Jones [ 09/Nov/18 ]

Mahmoud

Usually we would want to wait until the fix has finalized (i.e landed to master) before backporting. Is this issue disruptive enough that you would want to run the risk of the fix changing due to testing/review feedback?

Peter

Comment by Jay Lan (Inactive) [ 09/Nov/18 ]

As of this morning, the patchset #6 failed to pass all autotests.

Comment by Andreas Dilger [ 26/Nov/18 ]

Patch v9 is looking promising, though it is still undergoing review.

Comment by Andreas Dilger [ 01/Dec/18 ]

Hongchao Zhang (hongchao@whamcloud.com) uploaded a new patch: http://review.whamcloud.com/33167
Subject: LU-11647 ptlrpc: race with reply_in_callback
Project: fs/lustre-release
Branch: master
Current Patch Set: 10
Commit: 29d2c7ad100098497631c2ce172dc0e03accde60

Comment by Andreas Dilger [ 01/Dec/18 ]

Hongchao Zhang (hongchao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/22378
Subject: LU-11647 ptlrpc: always unregister bulk
Project: fs/lustre-release
Branch: master
Current Patch Set: 5
Commit: e34a4cf031a2b83259cee8e05c2f646b5652b6a9

Comment by Gerrit Updater [ 06/Dec/18 ]

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33798
Subject: LU-11647 ptlrpc: always unregister bulk
Project: fs/lustre-release
Branch: b2_10
Current Patch Set: 1
Commit: bd41c38752dda7e843c1bfb405f2214a31f74366

Comment by Gerrit Updater [ 16/Jan/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/22378/
Subject: LU-11647 ptlrpc: always unregister bulk
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 21c53b18a1bc0e36d2ecd1fb731f0dc6403902ee

Comment by Peter Jones [ 16/Jan/19 ]

Landed for 2.13

Comment by Jay Lan (Inactive) [ 25/Jan/19 ]

Can I assume the work at #33798 is also done, since #22378 has been merged?

Comment by Peter Jones [ 27/Jan/19 ]

Jay

Yes I would expect that fix to land to b2_10 in this coming week

Peter

Comment by Gerrit Updater [ 15/Feb/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33798/
Subject: LU-11647 ptlrpc: always unregister bulk
Project: fs/lustre-release
Branch: b2_10
Current Patch Set:
Commit: 1a251473427d5568f0a973964a2e2c7a288e1547

Comment by Gerrit Updater [ 25/Feb/19 ]

Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34305
Subject: LU-11647 ptlrpc: always unregister bulk
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: ab054fa2bd1ffe7c6c22d01b570211b5661ff59d

Comment by Gerrit Updater [ 19/Mar/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34305/
Subject: LU-11647 ptlrpc: always unregister bulk
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: ca5c659097595daa10eae0c20a6d94294524a2d4

Generated at Sat Feb 10 02:45:43 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.