[LU-12531] LustreError: 3438:0:(pers.c:49:ptlrpc_fill_bulk_md()) ASSERTION( mdidx < desc->bd_md_max_brw ) failed Created: 10/Jul/19  Updated: 22/Apr/20

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.5
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Jacek Tomaka Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

I was wondering if you would be interested in the following crashes we see:

2019-06-11T14:12:48+08:00 nanny1351 kernel: Lustre: pn-OST0012-osc-ffff881f9cca3000: Connection restored to 172.16.0.16@tcp (at 172.16.0.16@tcp)
2019-06-11T14:12:50+08:00 nanny1351 kernel: LustreError: 3438:0:(pers.c:49:ptlrpc_fill_bulk_md()) ASSERTION( mdidx < desc->bd_md_max_brw ) failed:
2019-06-11T14:12:50+08:00 nanny1351 kernel: LustreError: 3438:0:(pers.c:49:ptlrpc_fill_bulk_md()) LBUG
2019-06-11T14:12:50+08:00 nanny1351 kernel: Pid: 3438, comm: ptlrpcd_00_55
2019-06-11T14:12:50+08:00 nanny1351 kernel: #012Call Trace:
2019-06-11T14:12:50+08:00 nanny1351 kernel: [<ffffffffc03d67ae>] libcfs_call_trace+0x4e/0x60 [libcfs]
2019-06-11T14:12:50+08:00 nanny1351 kernel: [<ffffffffc03d683c>] lbug_with_loc+0x4c/0xb0 [libcfs]
2019-06-11T14:12:50+08:00 nanny1351 kernel: [<ffffffffc0ac6cee>] ptlrpc_fill_bulk_md+0xde/0x150 [ptlrpc]
2019-06-11T14:12:50+08:00 nanny1351 kernel: [<ffffffffc0a9d35f>] ptlrpc_register_bulk+0x2ff/0x9d0 [ptlrpc]
2019-06-11T14:12:50+08:00 nanny1351 kernel: [<ffffffffc0a9e446>] ptl_send_rpc+0x256/0xe50 [ptlrpc]
2019-06-11T14:12:50+08:00 nanny1351 kernel: [<ffffffffc0ad2923>] ? sptlrpc_req_refresh_ctx+0x153/0x910 [ptlrpc]
2019-06-11T14:12:50+08:00 nanny1351 kernel: [<ffffffff810ca2ae>] ? account_entity_dequeue+0xae/0xd0
2019-06-11T14:12:50+08:00 nanny1351 kernel: [<ffffffffc0610529>] ? lprocfs_counter_add+0xf9/0x160 [obdclass]
2019-06-11T14:12:50+08:00 nanny1351 kernel: [<ffffffffc0a93508>] ptlrpc_send_new_req+0x468/0xa60 [ptlrpc]
2019-06-11T14:12:50+08:00 nanny1351 kernel: [<ffffffff810c818e>] ? vtime_account_idle+0xe/0x50
2019-06-11T14:12:50+08:00 nanny1351 kernel: [<ffffffffc0a96738>] ptlrpc_check_set.part.23+0x878/0x1d90 [ptlrpc]
2019-06-11T14:12:50+08:00 nanny1351 kernel: [<ffffffffc0a97cab>] ptlrpc_check_set+0x5b/0xe0 [ptlrpc]
2019-06-11T14:12:50+08:00 nanny1351 kernel: [<ffffffffc0ac4a4b>] ptlrpcd_check+0x4db/0x5c0 [ptlrpc]
2019-06-11T14:12:50+08:00 nanny1351 kernel: [<ffffffffc0ac4deb>] ptlrpcd+0x2bb/0x560 [ptlrpc]
2019-06-11T14:12:50+08:00 nanny1351 kernel: [<ffffffff810c4820>] ? default_wake_function+0x0/0x20
2019-06-11T14:12:50+08:00 nanny1351 kernel: [<ffffffffc0ac4b30>] ? ptlrpcd+0x0/0x560 [ptlrpc]
2019-06-11T14:12:50+08:00 nanny1351 kernel: [<ffffffff810b099f>] kthread+0xcf/0xe0
2019-06-11T14:12:50+08:00 nanny1351 kernel: [<ffffffff810c818e>] ? vtime_account_idle+0xe/0x50
2019-06-11T14:12:50+08:00 nanny1351 kernel: [<ffffffff810b08d0>] ? kthread+0x0/0xe0
2019-06-11T14:12:50+08:00 nanny1351 kernel: [<ffffffff816b4fd8>] ret_from_fork+0x58/0x90
2019-06-11T14:12:50+08:00 nanny1351 kernel: [<ffffffff810b08d0>] ? kthread+0x0/0xe0
2019-06-11T14:12:50+08:00 nanny1351 kernel:
2019-06-11T14:12:50+08:00 nanny1351 kernel: Kernel panic - not syncing: LBUG

Interestingly many nodes suffered this crash within 15 minutes (or maybe they were two separate events) ?

2019-06-11T14:03:38+08:00 nanny1343 kernel: LustreError: 3386:0:(pers.c:49:ptlrpc_fill_bulk_md()) LBUG
2019-06-11T14:03:38+08:00 nanny1349 kernel: LustreError: 3708:0:(pers.c:49:ptlrpc_fill_bulk_md()) LBUG
2019-06-11T14:03:38+08:00 nanny1347 kernel: LustreError: 3591:0:(pers.c:49:ptlrpc_fill_bulk_md()) LBUG
2019-06-11T14:12:50+08:00 nanny1351 kernel: LustreError: 3438:0:(pers.c:49:ptlrpc_fill_bulk_md()) LBUG
2019-06-11T14:15:30+08:00 nanny1331 kernel: LustreError: 3566:0:(pers.c:49:ptlrpc_fill_bulk_md()) LBUG

This is Lustre 2.10.1 running on Linux nanny347 3.10.0-693.5.2.el7.x86_64 #1 SMP Fri Oct 20 20:32:50 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
Intel(R) Xeon Phi(TM) CPU 7250 @ 1.40GHz (272 hardware threads (68cores x 4 ht)).



 Comments   
Comment by Andreas Dilger [ 22/Apr/20 ]

Li Xi (lixi@ddn.com) uploaded a new patch: http://review.whamcloud.com/18825
Subject: LU-12531 osc: use enough LNET MDs for changing BRWs
Project: fs/lustre-release
Branch: master
Current Patch Set: 2
Commit: 989bcda993edccc539e8bc189249db23da28c71d

Generated at Sat Feb 10 02:53:25 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.