Details
-
Bug
-
Resolution: Fixed
-
Blocker
-
Lustre 2.16.0
-
None
-
3
-
9223372036854775807
Description
This issue was created by maloo for Andreas Dilger <adilger@whamcloud.com>
This issue relates to the following test suite run:
https://testing.whamcloud.com/test_sets/d37f6322-70a2-4899-833b-a09de308b500
test_119f failed with the following error:
[ 4762.163878] Lustre: DEBUG MARKER: == sanity test 119f: dio vs dio race ===================== 15:47:47 (1720194467) [ 4777.465102] BUG: scheduling while atomic: dd/456442/0x00000002 [ 4777.465166] CPU: 1 PID: 456442 Comm: dd 5.14.0-362.24.1.el9_3.x86_64 #1 [ 4777.465173] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [ 4777.465178] Call Trace: [ 4777.465194] dump_stack_lvl+0x34/0x48 [ 4777.465250] __schedule_bug.cold+0x47/0x53 [ 4777.465266] schedule_debug.constprop.0+0xc5/0x100 [ 4777.465289] __schedule+0x48/0x550 [ 4777.465319] schedule+0x2d/0x70 [ 4777.465321] schedule_timeout+0x11f/0x160 [ 4777.465331] __wait_for_common+0x93/0x1d0 [ 4777.465334] ? __pfx_schedule_timeout+0x10/0x10 [ 4777.465336] ? __pfx_ll_dio_user_copy_helper+0x10/0x10 [obdclass] [ 4777.465657] wait_for_completion_killable+0x20/0x40 [ 4777.465660] __kthread_create_on_node+0xe2/0x170 [ 4777.465677] kthread_create_on_node+0x49/0x70 [ 4777.465680] ll_dio_user_copy+0x8c/0x100 [obdclass] [ 4777.465734] osc_build_rpc+0x14a/0x1440 [osc] [ 4777.465841] osc_send_write_rpc+0x396/0x470 [osc] [ 4777.465861] osc_check_rpcs+0x11b/0x430 [osc] [ 4777.465880] osc_cache_writeback_range+0xf84/0x1020 [osc] [ 4777.465904] osc_io_fsync_start+0x85/0x360 [osc] [ 4777.465922] cl_io_start+0x61/0x130 [obdclass] [ 4777.466088] lov_io_call.constprop.0+0x73/0x160 [lov] [ 4777.466178] lov_io_start+0xc1/0x180 [lov] [ 4777.466190] cl_io_start+0x61/0x130 [obdclass] [ 4777.466244] cl_io_loop+0x99/0x220 [obdclass] [ 4777.466350] cl_sync_file_range+0x298/0x360 [lustre] [ 4777.466552] ll_writepages+0x195/0x220 [lustre] [ 4777.466589] do_writepages+0xcf/0x1d0 [ 4777.466688] filemap_fdatawrite_wbc+0x66/0x90 [ 4777.466696] __filemap_fdatawrite_range+0x54/0x80 [ 4777.466699] filemap_write_and_wait_range+0x41/0xb0 [ 4777.466701] ll_fsync+0x78/0x570 [lustre] [ 4777.466761] do_syscall_64+0x5c/0x90
Strangely, there is also an LASSERT hit for the same thread a fraction later:
[ 4777.479708] LustreError: 456442:0:(osc_request.c:2804:osc_build_rpc()) ASSERTION( (!((((( gfp_t)(0x400u|0x800u)) | (( gfp_t)0x40u))) != ((( gfp_t)0x20u)|(( gfp_t)0x200u)|(( gfp_t)0x800u))) || (!(((preempt_count() & (((1UL << (4))-1) << (((0 + 8) + 8) + 4))) | (preempt_count() & (((1UL << (4))-1) << ((0 + 8) + 8))) | (preempt_count() & (((1UL << (8))-1) << (0 + 8))))))) ) failed: [ 4777.479713] LustreError: 456442:0:(osc_request.c:2804:osc_build_rpc()) LBUG [ 4777.479724] Kernel panic - not syncing: LBUG in interrupt.
Test session details:
clients: https://build.whamcloud.com/job/lustre-reviews/105920 - 5.14.0-362.24.1.el9_3.x86_64
servers: https://build.whamcloud.com/job/lustre-reviews/105920 - 4.18.0-513.24.1.el8_lustre.x86_64
I didn't see any other recent similar crashes, but the affected patch didn't change anything related to CLIO so no expectation that it was causing this issue. Maybe just a low-frequency race condition.
VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
sanity test_119f - trevis-58vm4 crashed during sanity test_119f