Details
-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
Lustre 2.18.0, Lustre 2.15.8
-
None
-
3
-
9223372036854775807
Description
[202289.946894] Lustre: k6zwlb4v-MDT0000-mdc-ffff00040770e800: Force grant RPC slot (3 current) to proc with flag: 208840. [202289.948916] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000 [202289.950469] Mem abort info: [202289.950989] ESR = 0x96000005 [202289.953243] EC = 0x25: DABT (current EL), IL = 32 bits [202289.954210] SET = 0, FnV = 0 [202289.954782] EA = 0, S1PTW = 0 [202289.955364] Data abort info: [202289.955899] ISV = 0, ISS = 0x00000005 [202289.956600] CM = 0, WnR = 0 [202289.957153] user pgtable: 4k pages, 48-bit VAs, pgdp=0000000056337000 [202289.958292] [0000000000000000] pgd=000000004a0fa003, p4d=000000004a0fa003, pud=0000000000000000 [202289.959819] Internal error: Oops: 0000000096000005 [#1] SMP [202289.960810] Modules linked in: af_packet_diag udp_diag tcp_diag inet_diag ip6table_filter osp(OE) lod(OE) mdt(OE) mdd(OE) lfsck(OE) mgs(OE) mgc(OE) osd_zfs(POE) lquota(OE) lustre(OE) lmv(OE) mdc(OE) lov(OE) osc(OE) fid(OE) fld(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) nf_log_ipv4 nf_log_common xt_LOG xt_limit iptable_filter xt_mark iptable_mangle bpfilter lnet(OE) crc32_generic libcfs(OE) sunrpc vfat fat dm_mirror dm_region_hash dm_log dm_mod ghash_ce sha2_ce sha256_arm64 sha1_ce zfs(POE) zunicode(POE) zzstd(OE) zlua(OE) binfmt_misc zcommon(POE) znvpair(POE) zavl(POE) icp(POE) spl(OE) ena ptp pps_core [202289.969945] CPU: 1 PID: 2 Comm: kthreadd Kdump: loaded Tainted: P OE 5.10.251-248.983.amzn2.aarch64 #1 [202289.971778] Hardware name: Amazon EC2 c6gn.large/, BIOS 1.0 11/1/2018 [202289.972921] pstate: 60c00005 (nZCv daif +PAN +UAO -TCO BTYPE=--) [202289.973995] pc : kthread_should_stop+0x18/0x40 [202289.974796] lr : wait_woken+0x74/0x84 [202289.975462] sp : ffff80000a8bb160 [202289.976070] x29: ffff80000a8bb160 x28: ffff00040c87dd10 [202289.977018] x27: ffff80000a8bb1d8 x26: 000000003b9aca00 [202289.977968] x25: ffff8000018b79f0 x24: ffffffffffffffff [202289.978915] x23: ffff0003c0260000 x22: 0000000000000402 [202289.979863] x21: 0000000000000000 x20: ffff00040c87dd00 [202289.980810] x19: ffff80000a8bb1c0 x18: 0000000000000030 [202289.981761] x17: 0000000000000000 x16: 0000000000000000 [202289.982708] x15: 0000000000000000 x14: 0000000000000000 [202289.983656] x13: 000041ed00000000 x12: 0000000000000000 [202289.984602] x11: 0101010101010101 x10: 0000000000000d30 [202289.985550] x9 : ffff800008103a34 x8 : ffff0003c0260d90 [202289.986498] x7 : 00000000ebc0de03 x6 : 00000000ebc0de01 [202289.987445] x5 : 00000000410fd0c0 x4 : ffff000407b30d98 [202289.988394] x3 : 0000000000000000 x2 : 00000000000008fc [202289.989343] x1 : 0000000000208840 x0 : 0000000000000000 [202289.990291] Call trace: [202289.990754] kthread_should_stop+0x18/0x40 [202289.991546] ptlrpc_set_wait+0x1d0/0x66c [ptlrpc] [202289.992440] ptlrpc_queue_wait+0xa4/0x370 [ptlrpc] [202289.993313] mdc_close+0x224/0xe64 [mdc] [202289.994029] lmv_close+0x1a8/0x480 [lmv] [202289.994757] ll_close_inode_openhandle+0x418/0xcdc [lustre] [202289.995762] ll_md_real_close+0xa4/0x280 [lustre] [202289.996621] ll_clear_inode+0x1a0/0x7d8 [lustre] [202289.997465] ll_delete_inode+0x70/0x260 [lustre] [202289.998294] evict+0xdc/0x240 [202289.998838] iput_final+0x8c/0x1c0 [202289.999459] iput+0x10c/0x128 [202290.000006] dentry_unlink_inode+0xc8/0x150 [202290.000758] __dentry_kill+0xec/0x21c [202290.001424] shrink_dentry_list+0xa8/0x138 [202290.002155] prune_dcache_sb+0x64/0x94 [202290.002827] super_cache_scan+0x128/0x1a4 [202290.003541] do_shrink_slab+0x194/0x394 [202290.004225] shrink_slab+0xbc/0x13c [202290.004853] shrink_node_memcgs+0x1d4/0x230 [202290.005598] shrink_node+0x150/0x5e0 [202290.006240] shrink_zones+0x98/0x220 [202290.006880] do_try_to_free_pages+0xac/0x2e0 [202290.007638] try_to_free_pages+0x120/0x25c [202290.008370] __alloc_pages_slowpath.constprop.0+0x420/0x8a0 [202290.009352] __alloc_pages_nodemask+0x2b4/0x308 [202290.010157] alloc_pages_current+0x8c/0x13c [202290.010901] __vmalloc_area_node+0x104/0x280 [202290.011662] __vmalloc_node_range+0x80/0xe4 [202290.012409] alloc_thread_stack_node+0xc4/0x128 [202290.013214] dup_task_struct+0x54/0x29c [202290.013899] copy_process+0x1d0/0x11b4 [202290.014570] kernel_clone+0x94/0x380 [202290.015213] kernel_thread+0x6c/0x94 [202290.015855] kthreadd+0x178/0x350 [202290.016456] Code: d5384100 b9403401 36a800a1 f943f400 (f9400000) [202290.017528] SMP: stopping secondary CPUs [202290.019499] Starting crashdump kernel... [202290.020203] Bye!
When kernel run out of memory and try to clean up inode cache, it will try to clean up inodes, it will prepare ptlrpc request for it.
In function ptlrpc_set_wait, if reply haven't received, it will call wait_woken, which will call kthread_should_stop and read set_child_tid, bur for kthreadd set_child_tid is null.
/* block until ready or timeout occurs */ do { if (ptlrpc_check_set(NULL, set)) break; if (allow) { siginitsetinv(&newset, allow); sigprocmask(SIG_BLOCK, &newset, &oldset); } remaining = wait_woken(&wait, state, remaining); if (allow) { if (signal_pending(current)) remaining = -EINTR; sigprocmask(SIG_SETMASK, &oldset, NULL); } } while (remaining > 0);
This issue is similar to LU-18826 where same wait_woken function is been called within obd_get_mod_rpc_slot for kthreadd.
Proposed Fix
A possible way to fix this issue is to use wait_event_idle_timeout rather than wait_woken for this unlinkly specific situation when the thread is kthreadd.
Also for this specific situation, as only one mdc_close is in the request set, no need to call ptlrpc_send_new_req within function ptlrpc_check_set, this change won't trigger issue fixed by LU-15808
/* * wait until all complete, interrupted, or an in-flight * req times out */ CDEBUG(D_RPCTRACE, "set %p going to sleep for %lld seconds\n", set, timeout); + /* + * kthreadd (PID 2) has set_child_tid == NULL. + * wait_woken() calls will dereferences + * set_child_tid and cause NULL deref crash. + * + * Use wait_event_idle_timeout for kthreadd instead. This is + * safe because kthreadd only reaches here via the memory + * reclaim shrinker path with a single already-sent close + * RPC, so ptlrpc_check_set just checks completion flags + * and does not block. + */ + if (unlikely((current->flags & PF_KTHREAD) && + !current->set_child_tid)) { + rc = wait_event_idle_timeout( + set->set_waitq, + ptlrpc_check_set(NULL, set), + remaining); + if (rc == 0) { + rc = -ETIMEDOUT; + ptlrpc_expired_set(set); + } else { + rc = 0; + } + goto check_completion; + } + add_wait_queue(&set->set_waitq, &wait); if ((timeout == 0 && !signal_pending(current)) || set->set_allow_intr) { state = TASK_INTERRUPTIBLE; allow = LUSTRE_FATAL_SIGS; } /* block until ready or timeout occurs */ do { if (ptlrpc_check_set(NULL, set)) break; if (allow) { siginitsetinv(&newset, allow); sigprocmask(SIG_BLOCK, &newset, &oldset); } remaining = wait_woken(&wait, state, remaining); if (allow) { if (signal_pending(current)) remaining = -EINTR; sigprocmask(SIG_SETMASK, &oldset, NULL); } } while (remaining > 0); /* * wait_woken* returns the result from schedule_timeout() which * is always a positive number, or 0 on timeout. */ if (remaining == 0) { rc = -ETIMEDOUT; ptlrpc_expired_set(set); } else if (remaining < 0) { rc = -EINTR; ptlrpc_interrupted_set(set); } remove_wait_queue(&set->set_waitq, &wait); + check_completion: /* * -EINTR => all requests have been flagged rq_intr so next * check completes. * -ETIMEDOUT => someone timed out. When all reqs have * timed out, signals are enabled allowing completion with * EINTR. * I don't really care if we go once more round the loop in * the error cases -eeb. */