Loading...

Details

Type: Bug
Resolution: Unresolved
Priority: Minor
Fix Version/s: None
Affects Version/s: None
Labels:
- CEA
- changelogs
Environment:
Lustre 2.12.6 + patches

Epic/Theme:
- changelog
Severity:
3
Rank (Obsolete):
9223372036854775807

Description

On a client several instances of robinhood was started reading and clearing changelogs (with the same user "cl1").

Error -22 (-EINVAL) is returned on the client and on the server:

mdd_changelog_clear()) scratch2-MDD0002: Failure to clear the changelog for user 1: -22

It seems normal: one rbh instance is trying to clear changelog that have been already clear by the other instance.

After a while, the MDS crash with the following backtrace:

[125289.545832] LustreError: 142168:0:(llog_osd.c:905:llog_osd_next_block()) ASSERTION( llog_osd_exist(loghandle) ) failed: 
[125289.556790] LustreError: 142168:0:(llog_osd.c:905:llog_osd_next_block()) LBUG
[125289.564017] Pid: 142168, comm: mdt01_107 3.10.0-1062.18.1.el7.x86_64 #1 SMP Tue Mar 17 23:49:17 UTC 2020
[125289.564017] Call Trace:
[125289.564026]  [<ffffffffc0ae57cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
[125289.570676]  [<ffffffffc0ae587c>] lbug_with_loc+0x4c/0xa0 [libcfs]
[125289.576977]  [<ffffffffc0f82158>] llog_osd_next_block+0xb28/0xbc0 [obdclass]
[125289.584162]  [<ffffffffc0f75088>] llog_process_thread+0x338/0x1a10 [obdclass]
[125289.591417]  [<ffffffffc0f7681c>] llog_process_or_fork+0xbc/0x450 [obdclass]
[125289.598584]  [<ffffffffc0f7bbf9>] llog_cat_process_cb+0x239/0x250 [obdclass]
[125289.605751]  [<ffffffffc0f755af>] llog_process_thread+0x85f/0x1a10 [obdclass]
[125289.613006]  [<ffffffffc0f7681c>] llog_process_or_fork+0xbc/0x450 [obdclass]
[125289.620173]  [<ffffffffc0f78581>] llog_cat_process_or_fork+0x1e1/0x360 [obdclass]
[125289.627776]  [<ffffffffc0f7872e>] llog_cat_process+0x2e/0x30 [obdclass]
[125289.634509]  [<ffffffffc197ca34>] llog_changelog_cancel.isra.16+0x54/0x1c0 [mdd]
[125289.642023]  [<ffffffffc197eb60>] mdd_changelog_llog_cancel+0xd0/0x270 [mdd]
[125289.649190]  [<ffffffffc1981c13>] mdd_changelog_clear+0x503/0x690 [mdd]
[125289.655920]  [<ffffffffc1984d03>] mdd_iocontrol+0x163/0x540 [mdd]
[125289.662130]  [<ffffffffc18063ec>] mdt_iocontrol+0x5ec/0xb00 [mdt]
[125289.668355]  [<ffffffffc1806d84>] mdt_set_info+0x484/0x490 [mdt]
[125289.674479]  [<ffffffffc1392f5a>] tgt_request_handle+0xada/0x1570 [ptlrpc]
[125289.681535]  [<ffffffffc13378cb>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
[125289.689336]  [<ffffffffc133b234>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
[125289.695749]  [<ffffffffa6ac6321>] kthread+0xd1/0xe0
[125289.700751]  [<ffffffffa718dd1d>] ret_from_fork_nospec_begin+0x7/0x21
[125289.707314]  [<ffffffffffffffff>] 0xffffffffffffffff
[125289.712416] Kernel panic - not syncing: LBUG
[125289.716774] CPU: 18 PID: 142168 Comm: mdt01_107 Kdump: loaded Tainted: G           OE  ------------ T 3.10.0-1062.18.1.el7.x86_64 #1
[125289.728760] Hardware name: Bull SAS BullSequana X430-E5 2U-1N/X11DPi-NT, BIOS 3.3 02/24/2020
[125289.737279] Call Trace:
[125289.739826]  [<ffffffffa717b416>] dump_stack+0x19/0x1b
[125289.745046]  [<ffffffffa7174a0b>] panic+0xe8/0x21f
[125289.749931]  [<ffffffffc0ae58cb>] lbug_with_loc+0x9b/0xa0 [libcfs]
[125289.756211]  [<ffffffffc0f82158>] llog_osd_next_block+0xb28/0xbc0 [obdclass]
[125289.763349]  [<ffffffffc0aebd68>] ? libcfs_debug_vmsg2+0x6d8/0xb30 [libcfs]
[125289.770410]  [<ffffffffc0f75088>] llog_process_thread+0x338/0x1a10 [obdclass]
[125289.777630]  [<ffffffffc1980160>] ? mdd_obd_set_info_async+0x440/0x440 [mdd]
[125289.784769]  [<ffffffffc0f7681c>] llog_process_or_fork+0xbc/0x450 [obdclass]
[125289.791912]  [<ffffffffc0f7bbf9>] llog_cat_process_cb+0x239/0x250 [obdclass]
[125289.799052]  [<ffffffffc0f755af>] llog_process_thread+0x85f/0x1a10 [obdclass]
[125289.806283]  [<ffffffffc0f7b9c0>] ? llog_cat_cancel_records+0x3d0/0x3d0 [obdclass]
[125289.813940]  [<ffffffffc0f7681c>] llog_process_or_fork+0xbc/0x450 [obdclass]
[125289.821083]  [<ffffffffc0f7b9c0>] ? llog_cat_cancel_records+0x3d0/0x3d0 [obdclass]
[125289.828745]  [<ffffffffc0f78581>] llog_cat_process_or_fork+0x1e1/0x360 [obdclass]
[125289.836313]  [<ffffffffc1980160>] ? mdd_obd_set_info_async+0x440/0x440 [mdd]
[125289.843452]  [<ffffffffc0f7872e>] llog_cat_process+0x2e/0x30 [obdclass]
[125289.850153]  [<ffffffffc197ca34>] llog_changelog_cancel.isra.16+0x54/0x1c0 [mdd]
[125289.857631]  [<ffffffffa6b07632>] ? ktime_get+0x52/0xe0
[125289.862944]  [<ffffffffc197eb60>] mdd_changelog_llog_cancel+0xd0/0x270 [mdd]
[125289.870080]  [<ffffffffc1981c13>] mdd_changelog_clear+0x503/0x690 [mdd]
[125289.876777]  [<ffffffffc1984d03>] mdd_iocontrol+0x163/0x540 [mdd]
[125289.882975]  [<ffffffffc0fc08c3>] ? lu_context_init+0xd3/0x1f0 [obdclass]
[125289.889853]  [<ffffffffc18063ec>] mdt_iocontrol+0x5ec/0xb00 [mdt]
[125289.896039]  [<ffffffffc1806d84>] mdt_set_info+0x484/0x490 [mdt]
[125289.902155]  [<ffffffffc1392f5a>] tgt_request_handle+0xada/0x1570 [ptlrpc]
[125289.909141]  [<ffffffffc136c6a1>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc]
[125289.916794]  [<ffffffffc0ae5bde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs]
[125289.923948]  [<ffffffffc13378cb>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
[125289.931713]  [<ffffffffc13346e5>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc]
[125289.938590]  [<ffffffffa6ad3a33>] ? __wake_up+0x13/0x20
[125289.943924]  [<ffffffffc133b234>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
[125289.950272]  [<ffffffffa7180d92>] ? __schedule+0x402/0x840
[125289.955870]  [<ffffffffc133a700>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc]
[125289.963349]  [<ffffffffa6ac6321>] kthread+0xd1/0xe0
[125289.968315]  [<ffffffffa6ac6250>] ? insert_kthread_work+0x40/0x40
[125289.974497]  [<ffffffffa718dd1d>] ret_from_fork_nospec_begin+0x7/0x21
[125289.981023]  [<ffffffffa6ac6250>] ? insert_kthread_work+0x40/0x40

We were able to extract dk log from the crashdump:

00000004:00000080:25.0:1619726994.236714:0:142143:0:(mdt_handler.c:6728:mdt_iocontrol()) handling ioctl cmd 0x424066b3
00000004:00000080:25.0:1619726994.236723:0:142143:0:(mdd_device.c:1780:mdd_changelog_clear()) scratch2-MDD0002: Purge request: id=1, endrec=255008878
00000040:00080000:25.0:1619726994.236727:0:142143:0:(llog.c:655:llog_process_thread()) index: 1, lh_last_idx: 1 synced_idx: 1 lgh_last_idx: 1
00000040:00080000:25.0:1619726994.236728:0:142143:0:(llog_cat.c:794:llog_cat_process_common()) processing log [0xb2:0x1:0x0]:0 at index 1 of catalog [0x6:0xa:0x0]
00000004:00000080:25.0:1619726994.236732:0:142143:0:(mdd_device.c:1753:mdd_changelog_clear_cb()) Rewriting changelog user 1 endrec to 255008878
00000040:00080000:25.0:1619726994.236738:0:142143:0:(llog.c:705:llog_process_thread()) stop processing plain 0xb2:1:0 index 2 count 2
00000040:00080000:25.0:1619726994.236739:0:142143:0:(llog.c:705:llog_process_thread()) stop processing catalog 0x6:10:0 index 2 count 2
00000004:00000080:25.0:1619726994.236740:0:142143:0:(mdd_device.c:1818:mdd_changelog_clear()) scratch2-MDD0002: Purging changelog entries up to 255008878
00000040:00080000:25.0:1619726994.236753:0:142143:0:(llog.c:655:llog_process_thread()) index: 4033, lh_last_idx: 4325 synced_idx: 0 lgh_last_idx: 4325
00000040:00080000:25.0:1619726994.236754:0:142143:0:(llog_cat.c:794:llog_cat_process_common()) processing log [0x1c1e:0x1:0x0]:0 at index 4033 of catalog [0x5:0xa:0x0]
00000004:00000080:18.0:1619726994.237619:0:142168:0:(mdt_handler.c:6728:mdt_iocontrol()) handling ioctl cmd 0x424066b3
00000004:00000080:18.0:1619726994.237630:0:142168:0:(mdd_device.c:1780:mdd_changelog_clear()) scratch2-MDD0002: Purge request: id=1, endrec=255008878
00000040:00080000:18.0:1619726994.237637:0:142168:0:(llog.c:655:llog_process_thread()) index: 1, lh_last_idx: 1 synced_idx: 1 lgh_last_idx: 1
00000040:00080000:18.0:1619726994.237638:0:142168:0:(llog_cat.c:794:llog_cat_process_common()) processing log [0xb2:0x1:0x0]:0 at index 1 of catalog [0x6:0xa:0x0]
00000004:00000080:18.0:1619726994.237644:0:142168:0:(mdd_device.cText color:1753:mdd_changelog_clear_cb()) Rewriting changelog user 1 endrec to 255008878
00000040:00080000:18.0:1619726994.237650:0:142168:0:(llog.c:705:llog_process_thread()) stop processing plain 0xb2:1:0 index 2 count 2
00000040:00080000:18.0:1619726994.237651:0:142168:0:(llog.c:705:llog_process_thread()) stop processing catalog 0x6:10:0 index 2 count 2
00000004:00000080:18.0:1619726994.237652:0:142168:0:(mdd_device.c:1818:mdd_changelog_clear()) scratch2-MDD0002: Purging changelog entries up to 255008878
00000040:00080000:18.0:1619726994.237666:0:142168:0:(llog.c:655:llog_process_thread()) index: 4033, lh_last_idx: 4325 synced_idx: 0 lgh_last_idx: 4325
00000040:00080000:18.0:1619726994.237666:0:142168:0:(llog_cat.c:794:llog_cat_process_common()) processing log [0x1c1e:0x1:0x0]:0 at index 4033 of catalog [0x5:0xa:0x0]
00000040:00080000:25.0:1619726994.237839:0:142143:0:(llog_cat.c:1123:llog_cat_set_first_idx()) catlog [0x5:0xa:0x0] first idx 4033, last_idx 4325
00000040:00080000:25.0:1619726994.237843:0:142143:0:(llog_cat.c:1162:llog_cat_cleanup()) cancel plain log [0x1c1e:0x1:0x0] at index 4033 of catalog [0x5:0xa:0x0]
00000040:00080000:25.0:1619726994.237844:0:142143:0:(llog.c:705:llog_process_thread()) stop processing plain 0x1c1e:1:0 index 64767 count 1
00000040:00040000:18.0:1619726994.237845:0:142168:0:(llog_osd.c:905:llog_osd_next_block()) ASSERTION( llog_osd_exist(loghandle) ) failed: 
00000040:00080000:25.0:1619726994.237845:0:142143:0:(llog.c:655:llog_process_thread()) index: 4034, lh_last_idx: 4325 synced_idx: 0 lgh_last_idx: 4325
00000040:00080000:25.0:1619726994.237846:0:142143:0:(llog_cat.c:794:llog_cat_process_common()) processing log [0x1c1f:0x1:0x0]:0 at index 4034 of catalog [0x5:0xa:0x0]
00000040:00080000:25.0:1619726994.238798:0:142143:0:(llog.c:705:llog_process_thread()) stop processing plain 0x1c1f:1:0 index 530 count 62869
00000040:00080000:25.0:1619726994.238799:0:142143:0:(llog.c:705:llog_process_thread()) stop processing catalog 0x5:10:0 index 4034 count 293
00000040:00040000:18.0:1619726994.248804:0:142168:0:(llog_osd.c:905:llog_osd_next_block()) LBUG

It seems that one mdd_changelog_clear instance delete a plain llog when the other instance is trying to retrieve the next record of the same plain llog.

I have unsuccessfully tried to reproduce the issue with VMs and several "lfs changelog_clear" and "lfs changelog" instances.

I have backported the https://review.whamcloud.com/43572/ ("~~LU-14606~~ llog: hide ENOENT for cancelling record") because this patch seems to handle "llog cancel" for changelog more cleanly ("llog_changelog_cancel_cb" --> RETURN(LLOG_DEL_RECORD)). But I don't think it will resolve the issue.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

crash_changelog.tgz
75 kB
25/May/21 9:00 AM

Issue Links

is related to

LU-14606 llog_changelog_cancel_cb returns ENOENT(-2)

Resolved

LU-14688 Changelog cancel improvement

Resolved

ASSERTION( llog_osd_exist(loghandle) ) failed: with concurent "lfs changelog_clear"

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates