Details
-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
None
-
Lustre 2.12.6 + patches
-
3
-
9223372036854775807
Description
On a client several instances of robinhood was started reading and clearing changelogs (with the same user "cl1").
Error -22 (-EINVAL) is returned on the client and on the server:
mdd_changelog_clear()) scratch2-MDD0002: Failure to clear the changelog for user 1: -22
It seems normal: one rbh instance is trying to clear changelog that have been already clear by the other instance.
After a while, the MDS crash with the following backtrace:
[125289.545832] LustreError: 142168:0:(llog_osd.c:905:llog_osd_next_block()) ASSERTION( llog_osd_exist(loghandle) ) failed: [125289.556790] LustreError: 142168:0:(llog_osd.c:905:llog_osd_next_block()) LBUG [125289.564017] Pid: 142168, comm: mdt01_107 3.10.0-1062.18.1.el7.x86_64 #1 SMP Tue Mar 17 23:49:17 UTC 2020 [125289.564017] Call Trace: [125289.564026] [<ffffffffc0ae57cc>] libcfs_call_trace+0x8c/0xc0 [libcfs] [125289.570676] [<ffffffffc0ae587c>] lbug_with_loc+0x4c/0xa0 [libcfs] [125289.576977] [<ffffffffc0f82158>] llog_osd_next_block+0xb28/0xbc0 [obdclass] [125289.584162] [<ffffffffc0f75088>] llog_process_thread+0x338/0x1a10 [obdclass] [125289.591417] [<ffffffffc0f7681c>] llog_process_or_fork+0xbc/0x450 [obdclass] [125289.598584] [<ffffffffc0f7bbf9>] llog_cat_process_cb+0x239/0x250 [obdclass] [125289.605751] [<ffffffffc0f755af>] llog_process_thread+0x85f/0x1a10 [obdclass] [125289.613006] [<ffffffffc0f7681c>] llog_process_or_fork+0xbc/0x450 [obdclass] [125289.620173] [<ffffffffc0f78581>] llog_cat_process_or_fork+0x1e1/0x360 [obdclass] [125289.627776] [<ffffffffc0f7872e>] llog_cat_process+0x2e/0x30 [obdclass] [125289.634509] [<ffffffffc197ca34>] llog_changelog_cancel.isra.16+0x54/0x1c0 [mdd] [125289.642023] [<ffffffffc197eb60>] mdd_changelog_llog_cancel+0xd0/0x270 [mdd] [125289.649190] [<ffffffffc1981c13>] mdd_changelog_clear+0x503/0x690 [mdd] [125289.655920] [<ffffffffc1984d03>] mdd_iocontrol+0x163/0x540 [mdd] [125289.662130] [<ffffffffc18063ec>] mdt_iocontrol+0x5ec/0xb00 [mdt] [125289.668355] [<ffffffffc1806d84>] mdt_set_info+0x484/0x490 [mdt] [125289.674479] [<ffffffffc1392f5a>] tgt_request_handle+0xada/0x1570 [ptlrpc] [125289.681535] [<ffffffffc13378cb>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [125289.689336] [<ffffffffc133b234>] ptlrpc_main+0xb34/0x1470 [ptlrpc] [125289.695749] [<ffffffffa6ac6321>] kthread+0xd1/0xe0 [125289.700751] [<ffffffffa718dd1d>] ret_from_fork_nospec_begin+0x7/0x21 [125289.707314] [<ffffffffffffffff>] 0xffffffffffffffff [125289.712416] Kernel panic - not syncing: LBUG [125289.716774] CPU: 18 PID: 142168 Comm: mdt01_107 Kdump: loaded Tainted: G OE ------------ T 3.10.0-1062.18.1.el7.x86_64 #1 [125289.728760] Hardware name: Bull SAS BullSequana X430-E5 2U-1N/X11DPi-NT, BIOS 3.3 02/24/2020 [125289.737279] Call Trace: [125289.739826] [<ffffffffa717b416>] dump_stack+0x19/0x1b [125289.745046] [<ffffffffa7174a0b>] panic+0xe8/0x21f [125289.749931] [<ffffffffc0ae58cb>] lbug_with_loc+0x9b/0xa0 [libcfs] [125289.756211] [<ffffffffc0f82158>] llog_osd_next_block+0xb28/0xbc0 [obdclass] [125289.763349] [<ffffffffc0aebd68>] ? libcfs_debug_vmsg2+0x6d8/0xb30 [libcfs] [125289.770410] [<ffffffffc0f75088>] llog_process_thread+0x338/0x1a10 [obdclass] [125289.777630] [<ffffffffc1980160>] ? mdd_obd_set_info_async+0x440/0x440 [mdd] [125289.784769] [<ffffffffc0f7681c>] llog_process_or_fork+0xbc/0x450 [obdclass] [125289.791912] [<ffffffffc0f7bbf9>] llog_cat_process_cb+0x239/0x250 [obdclass] [125289.799052] [<ffffffffc0f755af>] llog_process_thread+0x85f/0x1a10 [obdclass] [125289.806283] [<ffffffffc0f7b9c0>] ? llog_cat_cancel_records+0x3d0/0x3d0 [obdclass] [125289.813940] [<ffffffffc0f7681c>] llog_process_or_fork+0xbc/0x450 [obdclass] [125289.821083] [<ffffffffc0f7b9c0>] ? llog_cat_cancel_records+0x3d0/0x3d0 [obdclass] [125289.828745] [<ffffffffc0f78581>] llog_cat_process_or_fork+0x1e1/0x360 [obdclass] [125289.836313] [<ffffffffc1980160>] ? mdd_obd_set_info_async+0x440/0x440 [mdd] [125289.843452] [<ffffffffc0f7872e>] llog_cat_process+0x2e/0x30 [obdclass] [125289.850153] [<ffffffffc197ca34>] llog_changelog_cancel.isra.16+0x54/0x1c0 [mdd] [125289.857631] [<ffffffffa6b07632>] ? ktime_get+0x52/0xe0 [125289.862944] [<ffffffffc197eb60>] mdd_changelog_llog_cancel+0xd0/0x270 [mdd] [125289.870080] [<ffffffffc1981c13>] mdd_changelog_clear+0x503/0x690 [mdd] [125289.876777] [<ffffffffc1984d03>] mdd_iocontrol+0x163/0x540 [mdd] [125289.882975] [<ffffffffc0fc08c3>] ? lu_context_init+0xd3/0x1f0 [obdclass] [125289.889853] [<ffffffffc18063ec>] mdt_iocontrol+0x5ec/0xb00 [mdt] [125289.896039] [<ffffffffc1806d84>] mdt_set_info+0x484/0x490 [mdt] [125289.902155] [<ffffffffc1392f5a>] tgt_request_handle+0xada/0x1570 [ptlrpc] [125289.909141] [<ffffffffc136c6a1>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc] [125289.916794] [<ffffffffc0ae5bde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs] [125289.923948] [<ffffffffc13378cb>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [125289.931713] [<ffffffffc13346e5>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc] [125289.938590] [<ffffffffa6ad3a33>] ? __wake_up+0x13/0x20 [125289.943924] [<ffffffffc133b234>] ptlrpc_main+0xb34/0x1470 [ptlrpc] [125289.950272] [<ffffffffa7180d92>] ? __schedule+0x402/0x840 [125289.955870] [<ffffffffc133a700>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc] [125289.963349] [<ffffffffa6ac6321>] kthread+0xd1/0xe0 [125289.968315] [<ffffffffa6ac6250>] ? insert_kthread_work+0x40/0x40 [125289.974497] [<ffffffffa718dd1d>] ret_from_fork_nospec_begin+0x7/0x21 [125289.981023] [<ffffffffa6ac6250>] ? insert_kthread_work+0x40/0x40
We were able to extract dk log from the crashdump:
00000004:00000080:25.0:1619726994.236714:0:142143:0:(mdt_handler.c:6728:mdt_iocontrol()) handling ioctl cmd 0x424066b3 00000004:00000080:25.0:1619726994.236723:0:142143:0:(mdd_device.c:1780:mdd_changelog_clear()) scratch2-MDD0002: Purge request: id=1, endrec=255008878 00000040:00080000:25.0:1619726994.236727:0:142143:0:(llog.c:655:llog_process_thread()) index: 1, lh_last_idx: 1 synced_idx: 1 lgh_last_idx: 1 00000040:00080000:25.0:1619726994.236728:0:142143:0:(llog_cat.c:794:llog_cat_process_common()) processing log [0xb2:0x1:0x0]:0 at index 1 of catalog [0x6:0xa:0x0] 00000004:00000080:25.0:1619726994.236732:0:142143:0:(mdd_device.c:1753:mdd_changelog_clear_cb()) Rewriting changelog user 1 endrec to 255008878 00000040:00080000:25.0:1619726994.236738:0:142143:0:(llog.c:705:llog_process_thread()) stop processing plain 0xb2:1:0 index 2 count 2 00000040:00080000:25.0:1619726994.236739:0:142143:0:(llog.c:705:llog_process_thread()) stop processing catalog 0x6:10:0 index 2 count 2 00000004:00000080:25.0:1619726994.236740:0:142143:0:(mdd_device.c:1818:mdd_changelog_clear()) scratch2-MDD0002: Purging changelog entries up to 255008878 00000040:00080000:25.0:1619726994.236753:0:142143:0:(llog.c:655:llog_process_thread()) index: 4033, lh_last_idx: 4325 synced_idx: 0 lgh_last_idx: 4325 00000040:00080000:25.0:1619726994.236754:0:142143:0:(llog_cat.c:794:llog_cat_process_common()) processing log [0x1c1e:0x1:0x0]:0 at index 4033 of catalog [0x5:0xa:0x0] 00000004:00000080:18.0:1619726994.237619:0:142168:0:(mdt_handler.c:6728:mdt_iocontrol()) handling ioctl cmd 0x424066b3 00000004:00000080:18.0:1619726994.237630:0:142168:0:(mdd_device.c:1780:mdd_changelog_clear()) scratch2-MDD0002: Purge request: id=1, endrec=255008878 00000040:00080000:18.0:1619726994.237637:0:142168:0:(llog.c:655:llog_process_thread()) index: 1, lh_last_idx: 1 synced_idx: 1 lgh_last_idx: 1 00000040:00080000:18.0:1619726994.237638:0:142168:0:(llog_cat.c:794:llog_cat_process_common()) processing log [0xb2:0x1:0x0]:0 at index 1 of catalog [0x6:0xa:0x0] 00000004:00000080:18.0:1619726994.237644:0:142168:0:(mdd_device.cText color:1753:mdd_changelog_clear_cb()) Rewriting changelog user 1 endrec to 255008878 00000040:00080000:18.0:1619726994.237650:0:142168:0:(llog.c:705:llog_process_thread()) stop processing plain 0xb2:1:0 index 2 count 2 00000040:00080000:18.0:1619726994.237651:0:142168:0:(llog.c:705:llog_process_thread()) stop processing catalog 0x6:10:0 index 2 count 2 00000004:00000080:18.0:1619726994.237652:0:142168:0:(mdd_device.c:1818:mdd_changelog_clear()) scratch2-MDD0002: Purging changelog entries up to 255008878 00000040:00080000:18.0:1619726994.237666:0:142168:0:(llog.c:655:llog_process_thread()) index: 4033, lh_last_idx: 4325 synced_idx: 0 lgh_last_idx: 4325 00000040:00080000:18.0:1619726994.237666:0:142168:0:(llog_cat.c:794:llog_cat_process_common()) processing log [0x1c1e:0x1:0x0]:0 at index 4033 of catalog [0x5:0xa:0x0] 00000040:00080000:25.0:1619726994.237839:0:142143:0:(llog_cat.c:1123:llog_cat_set_first_idx()) catlog [0x5:0xa:0x0] first idx 4033, last_idx 4325 00000040:00080000:25.0:1619726994.237843:0:142143:0:(llog_cat.c:1162:llog_cat_cleanup()) cancel plain log [0x1c1e:0x1:0x0] at index 4033 of catalog [0x5:0xa:0x0] 00000040:00080000:25.0:1619726994.237844:0:142143:0:(llog.c:705:llog_process_thread()) stop processing plain 0x1c1e:1:0 index 64767 count 1 00000040:00040000:18.0:1619726994.237845:0:142168:0:(llog_osd.c:905:llog_osd_next_block()) ASSERTION( llog_osd_exist(loghandle) ) failed: 00000040:00080000:25.0:1619726994.237845:0:142143:0:(llog.c:655:llog_process_thread()) index: 4034, lh_last_idx: 4325 synced_idx: 0 lgh_last_idx: 4325 00000040:00080000:25.0:1619726994.237846:0:142143:0:(llog_cat.c:794:llog_cat_process_common()) processing log [0x1c1f:0x1:0x0]:0 at index 4034 of catalog [0x5:0xa:0x0] 00000040:00080000:25.0:1619726994.238798:0:142143:0:(llog.c:705:llog_process_thread()) stop processing plain 0x1c1f:1:0 index 530 count 62869 00000040:00080000:25.0:1619726994.238799:0:142143:0:(llog.c:705:llog_process_thread()) stop processing catalog 0x5:10:0 index 4034 count 293 00000040:00040000:18.0:1619726994.248804:0:142168:0:(llog_osd.c:905:llog_osd_next_block()) LBUG
It seems that one mdd_changelog_clear instance delete a plain llog when the other instance is trying to retrieve the next record of the same plain llog.
I have unsuccessfully tried to reproduce the issue with VMs and several "lfs changelog_clear" and "lfs changelog" instances.
I have backported the https://review.whamcloud.com/43572/ ("LU-14606 llog: hide ENOENT for cancelling record") because this patch seems to handle "llog cancel" for changelog more cleanly ("llog_changelog_cancel_cb" --> RETURN(LLOG_DEL_RECORD)). But I don't think it will resolve the issue.