Details
-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
Lustre 2.12.5
-
None
-
3
-
9223372036854775807
Description
I was wondering if you would be interested in the following crashes we see:
2019-06-11T14:12:48+08:00 nanny1351 kernel: Lustre: pn-OST0012-osc-ffff881f9cca3000: Connection restored to 172.16.0.16@tcp (at 172.16.0.16@tcp) 2019-06-11T14:12:50+08:00 nanny1351 kernel: LustreError: 3438:0:(pers.c:49:ptlrpc_fill_bulk_md()) ASSERTION( mdidx < desc->bd_md_max_brw ) failed: 2019-06-11T14:12:50+08:00 nanny1351 kernel: LustreError: 3438:0:(pers.c:49:ptlrpc_fill_bulk_md()) LBUG 2019-06-11T14:12:50+08:00 nanny1351 kernel: Pid: 3438, comm: ptlrpcd_00_55 2019-06-11T14:12:50+08:00 nanny1351 kernel: #012Call Trace: 2019-06-11T14:12:50+08:00 nanny1351 kernel: [<ffffffffc03d67ae>] libcfs_call_trace+0x4e/0x60 [libcfs] 2019-06-11T14:12:50+08:00 nanny1351 kernel: [<ffffffffc03d683c>] lbug_with_loc+0x4c/0xb0 [libcfs] 2019-06-11T14:12:50+08:00 nanny1351 kernel: [<ffffffffc0ac6cee>] ptlrpc_fill_bulk_md+0xde/0x150 [ptlrpc] 2019-06-11T14:12:50+08:00 nanny1351 kernel: [<ffffffffc0a9d35f>] ptlrpc_register_bulk+0x2ff/0x9d0 [ptlrpc] 2019-06-11T14:12:50+08:00 nanny1351 kernel: [<ffffffffc0a9e446>] ptl_send_rpc+0x256/0xe50 [ptlrpc] 2019-06-11T14:12:50+08:00 nanny1351 kernel: [<ffffffffc0ad2923>] ? sptlrpc_req_refresh_ctx+0x153/0x910 [ptlrpc] 2019-06-11T14:12:50+08:00 nanny1351 kernel: [<ffffffff810ca2ae>] ? account_entity_dequeue+0xae/0xd0 2019-06-11T14:12:50+08:00 nanny1351 kernel: [<ffffffffc0610529>] ? lprocfs_counter_add+0xf9/0x160 [obdclass] 2019-06-11T14:12:50+08:00 nanny1351 kernel: [<ffffffffc0a93508>] ptlrpc_send_new_req+0x468/0xa60 [ptlrpc] 2019-06-11T14:12:50+08:00 nanny1351 kernel: [<ffffffff810c818e>] ? vtime_account_idle+0xe/0x50 2019-06-11T14:12:50+08:00 nanny1351 kernel: [<ffffffffc0a96738>] ptlrpc_check_set.part.23+0x878/0x1d90 [ptlrpc] 2019-06-11T14:12:50+08:00 nanny1351 kernel: [<ffffffffc0a97cab>] ptlrpc_check_set+0x5b/0xe0 [ptlrpc] 2019-06-11T14:12:50+08:00 nanny1351 kernel: [<ffffffffc0ac4a4b>] ptlrpcd_check+0x4db/0x5c0 [ptlrpc] 2019-06-11T14:12:50+08:00 nanny1351 kernel: [<ffffffffc0ac4deb>] ptlrpcd+0x2bb/0x560 [ptlrpc] 2019-06-11T14:12:50+08:00 nanny1351 kernel: [<ffffffff810c4820>] ? default_wake_function+0x0/0x20 2019-06-11T14:12:50+08:00 nanny1351 kernel: [<ffffffffc0ac4b30>] ? ptlrpcd+0x0/0x560 [ptlrpc] 2019-06-11T14:12:50+08:00 nanny1351 kernel: [<ffffffff810b099f>] kthread+0xcf/0xe0 2019-06-11T14:12:50+08:00 nanny1351 kernel: [<ffffffff810c818e>] ? vtime_account_idle+0xe/0x50 2019-06-11T14:12:50+08:00 nanny1351 kernel: [<ffffffff810b08d0>] ? kthread+0x0/0xe0 2019-06-11T14:12:50+08:00 nanny1351 kernel: [<ffffffff816b4fd8>] ret_from_fork+0x58/0x90 2019-06-11T14:12:50+08:00 nanny1351 kernel: [<ffffffff810b08d0>] ? kthread+0x0/0xe0 2019-06-11T14:12:50+08:00 nanny1351 kernel: 2019-06-11T14:12:50+08:00 nanny1351 kernel: Kernel panic - not syncing: LBUG
Interestingly many nodes suffered this crash within 15 minutes (or maybe they were two separate events) ?
2019-06-11T14:03:38+08:00 nanny1343 kernel: LustreError: 3386:0:(pers.c:49:ptlrpc_fill_bulk_md()) LBUG 2019-06-11T14:03:38+08:00 nanny1349 kernel: LustreError: 3708:0:(pers.c:49:ptlrpc_fill_bulk_md()) LBUG 2019-06-11T14:03:38+08:00 nanny1347 kernel: LustreError: 3591:0:(pers.c:49:ptlrpc_fill_bulk_md()) LBUG 2019-06-11T14:12:50+08:00 nanny1351 kernel: LustreError: 3438:0:(pers.c:49:ptlrpc_fill_bulk_md()) LBUG 2019-06-11T14:15:30+08:00 nanny1331 kernel: LustreError: 3566:0:(pers.c:49:ptlrpc_fill_bulk_md()) LBUG
This is Lustre 2.10.1 running on Linux nanny347 3.10.0-693.5.2.el7.x86_64 #1 SMP Fri Oct 20 20:32:50 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
Intel(R) Xeon Phi(TM) CPU 7250 @ 1.40GHz (272 hardware threads (68cores x 4 ht)).
Li Xi (lixi@ddn.com) uploaded a new patch: http://review.whamcloud.com/18825
Subject: LU-12531 osc: use enough LNET MDs for changing BRWs
Project: fs/lustre-release
Branch: master
Current Patch Set: 2
Commit: 989bcda993edccc539e8bc189249db23da28c71d