Details
-
Bug
-
Resolution: Unresolved
-
Critical
-
Lustre 2.17.0
-
None
-
3
-
9223372036854775807
Description
About two master landings ago when first bits of clio/dio from recent batch started to come in, a prominent sanity-pcc test 40 crash in ll_release_user_pages appeared, looks like this:
First crash: https://testing.whamcloud.com/test_sets/dd2f3162-c2dc-40c7-a02a-d1bad78c95e1
most recent crash as of the time of this ticket filing: https://testing.whamcloud.com/test_sets/98d8457e-5aeb-41d7-976a-a713ecb4ddf1
[33457.292181] Lustre: DEBUG MARKER: dd if=/mnt/lustre/d40.sanity-pcc/f40.sanity-pcc of=/dev/null bs=1M count=1 iflag=direct [33457.400022] BUG: kernel NULL pointer dereference, address: 0000000000000000 [33457.402182] #PF: supervisor read access in kernel mode [33457.403315] #PF: error_code(0x0000) - not-present page [33457.404463] PGD 0 P4D 0 [33457.405240] Oops: 0000 [#1] PREEMPT SMP PTI [33457.417216] CPU: 1 PID: 1698 Comm: dd Kdump: loaded Tainted: G OE n 6.4.0-150600.23.50-default #1 SLE15-SP6 32013eadc71d652cb07a599d8a722b9604994156 [33457.435656] Hardware name: Red Hat KVM, BIOS 1.16.0-4.module+el8.8.0+1454+0b2cbfb8 04/01/2014 [33457.437334] RIP: 0010:ll_release_user_pages+0x15/0x100 [obdclass] [33457.438973] Code: 6d e5 0f 1f 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 83 fe 00 41 55 49 89 fd 41 54 55 53 74 5b 7e 4b <48> 8b 07 48 85 c0 74 43 48 8d 6f 08 8d 56 ff 4c 8d 64 d5 00 eb 12 [33457.442598] RSP: 0018:ffffb6ea42cdf510 EFLAGS: 00010202 [33457.443518] RAX: 0000000000000000 RBX: ffff89e303f33000 RCX: ffff0a00ffffff04 [33457.444607] RDX: 0000000000000001 RSI: 000000006ea42ce0 RDI: 0000000000000000 [33457.445870] RBP: ffff89e303f33000 R08: 0000000000000000 R09: 0000000000000151 [33457.447596] R10: ffffb6ea42cdf560 R11: 0a2e676e696e6961 R12: ffff89e30307aab8 [33457.449301] R13: 0000000000000000 R14: ffff89e303f33000 R15: ffff89e314aa8090 [33457.451116] FS: 00007ff372aa3740(0000) GS:ffff89e3bcd00000(0000) knlGS:0000000000000000 [33457.452927] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [33457.454459] CR2: 0000000000000000 CR3: 0000000008572002 CR4: 0000000000060ee0 [33457.456278] Call Trace: [33457.457114] <TASK> [33457.457825] cl_sub_dio_end+0x226/0x490 [obdclass 236ee5bdfa9d6196309bc0286afb34df863838c8] [33457.460100] ? __pfx_cl_sub_dio_end+0x10/0x10 [obdclass 236ee5bdfa9d6196309bc0286afb34df863838c8] [33457.462521] __cl_sync_io_note+0x224/0x330 [obdclass 236ee5bdfa9d6196309bc0286afb34df863838c8] [33457.464029] ll_direct_IO+0xa3a/0xdd0 [lustre e0f2add258d3842e2f4f396fc167c80b2708be3b] [33457.465457] ? atime_needs_update+0xa3/0x110 [33457.466166] ? touch_atime+0x34/0x150 [33457.466813] generic_file_read_iter+0x87/0x120 [33457.467613] vvp_io_read_start+0x6c2/0x8a0 [lustre e0f2add258d3842e2f4f396fc167c80b2708be3b] [33457.468944] cl_io_start+0x70/0x140 [obdclass 236ee5bdfa9d6196309bc0286afb34df863838c8] [33457.470261] cl_io_loop+0x9e/0x230 [obdclass 236ee5bdfa9d6196309bc0286afb34df863838c8] [33457.471523] ? ll_cl_add+0x95/0x100 [lustre e0f2add258d3842e2f4f396fc167c80b2708be3b] [33457.472749] ll_file_io_generic+0xa20/0x10a0 [lustre e0f2add258d3842e2f4f396fc167c80b2708be3b] [33457.474068] do_file_read_iter+0xd2c/0x1050 [lustre e0f2add258d3842e2f4f396fc167c80b2708be3b] [33457.475356] __kernel_read+0xf0/0x280 [33457.475982] pcc_attach_data_archive+0x432/0xb70 [lustre e0f2add258d3842e2f4f396fc167c80b2708be3b] [33457.477314] pcc_readonly_attach+0x4c0/0xd90 [lustre e0f2add258d3842e2f4f396fc167c80b2708be3b] [33457.478601] ? pcc_readonly_attach_sync+0x1d3/0x2c0 [lustre e0f2add258d3842e2f4f396fc167c80b2708be3b] [33457.479956] pcc_readonly_attach_sync+0x1d3/0x2c0 [lustre e0f2add258d3842e2f4f396fc167c80b2708be3b] [33457.481281] pcc_file_open+0x9c4/0x1040 [lustre e0f2add258d3842e2f4f396fc167c80b2708be3b] [33457.482503] ll_atomic_open+0x985/0x9e0 [lustre e0f2add258d3842e2f4f396fc167c80b2708be3b] [33457.483726] ? __d_lookup+0x72/0xb0 [33457.484295] path_openat+0x644/0x1050 [33457.484909] do_filp_open+0xc5/0x140 [33457.485531] ? kmem_cache_alloc+0x18a/0x340 [33457.486587] ? getname_flags+0x46/0x1e0 [33457.487635] ? do_sys_openat2+0x248/0x320 [33457.488522] do_sys_openat2+0x248/0x320 [33457.489562] do_sys_open+0x57/0x80 [33457.490500] do_syscall_64+0x5b/0x80 [33457.491287] ? __count_memcg_events+0x46/0x90 [33457.492327] ? count_memcg_event_mm+0x3d/0x60 [33457.493494] ? handle_mm_fault+0x196/0x2f0 [33457.494150] ? do_user_addr_fault+0x267/0x890 [33457.495103] ? exc_page_fault+0x69/0x150 [33457.496143] entry_SYSCALL_64_after_hwframe+0x7c/0xe6 [33457.497461] RIP: 0033:0x7ff37292017e [33457.498090] Code: 83 e2 40 75 4f 89 f0 f7 d0 a9 00 00 41 00 74 44 80 3d b5 d8 0e 00 00 74 68 89 da 48 89 ee bf 9c ff ff ff b8 01 01 00 00 0f 05 <48> 3d 00 f0 ff ff 0f 87 8e 00 00 00 48 8b 54 24 28 64 48 2b 14 25
I guess all the hits I saw come from SLES15 SP6 btw, hency why it's not showing up on regular reviews?