Loading...

Details

Type: Bug
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: Lustre 2.18.0, Lustre 2.15.8
Labels:
None

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

[202289.946894] Lustre: k6zwlb4v-MDT0000-mdc-ffff00040770e800: Force grant RPC slot (3 current) to proc with flag: 208840.
[202289.948916] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
[202289.950469] Mem abort info:
[202289.950989]   ESR = 0x96000005
[202289.953243]   EC = 0x25: DABT (current EL), IL = 32 bits
[202289.954210]   SET = 0, FnV = 0
[202289.954782]   EA = 0, S1PTW = 0
[202289.955364] Data abort info:
[202289.955899]   ISV = 0, ISS = 0x00000005
[202289.956600]   CM = 0, WnR = 0
[202289.957153] user pgtable: 4k pages, 48-bit VAs, pgdp=0000000056337000
[202289.958292] [0000000000000000] pgd=000000004a0fa003, p4d=000000004a0fa003, pud=0000000000000000
[202289.959819] Internal error: Oops: 0000000096000005 [#1] SMP
[202289.960810] Modules linked in: af_packet_diag udp_diag tcp_diag inet_diag ip6table_filter osp(OE) lod(OE) mdt(OE) mdd(OE) lfsck(OE) mgs(OE) mgc(OE) osd_zfs(POE) lquota(OE) lustre(OE) lmv(OE) mdc(OE) lov(OE) osc(OE) fid(OE) fld(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) nf_log_ipv4 nf_log_common xt_LOG xt_limit iptable_filter xt_mark iptable_mangle bpfilter lnet(OE) crc32_generic libcfs(OE) sunrpc vfat fat dm_mirror dm_region_hash dm_log dm_mod ghash_ce sha2_ce sha256_arm64 sha1_ce zfs(POE) zunicode(POE) zzstd(OE) zlua(OE) binfmt_misc zcommon(POE) znvpair(POE) zavl(POE) icp(POE) spl(OE) ena ptp pps_core
[202289.969945] CPU: 1 PID: 2 Comm: kthreadd Kdump: loaded Tainted: P           OE     5.10.251-248.983.amzn2.aarch64 #1
[202289.971778] Hardware name: Amazon EC2 c6gn.large/, BIOS 1.0 11/1/2018
[202289.972921] pstate: 60c00005 (nZCv daif +PAN +UAO -TCO BTYPE=--)
[202289.973995] pc : kthread_should_stop+0x18/0x40
[202289.974796] lr : wait_woken+0x74/0x84
[202289.975462] sp : ffff80000a8bb160
[202289.976070] x29: ffff80000a8bb160 x28: ffff00040c87dd10 
[202289.977018] x27: ffff80000a8bb1d8 x26: 000000003b9aca00 
[202289.977968] x25: ffff8000018b79f0 x24: ffffffffffffffff 
[202289.978915] x23: ffff0003c0260000 x22: 0000000000000402 
[202289.979863] x21: 0000000000000000 x20: ffff00040c87dd00 
[202289.980810] x19: ffff80000a8bb1c0 x18: 0000000000000030 
[202289.981761] x17: 0000000000000000 x16: 0000000000000000 
[202289.982708] x15: 0000000000000000 x14: 0000000000000000 
[202289.983656] x13: 000041ed00000000 x12: 0000000000000000 
[202289.984602] x11: 0101010101010101 x10: 0000000000000d30 
[202289.985550] x9 : ffff800008103a34 x8 : ffff0003c0260d90 
[202289.986498] x7 : 00000000ebc0de03 x6 : 00000000ebc0de01 
[202289.987445] x5 : 00000000410fd0c0 x4 : ffff000407b30d98 
[202289.988394] x3 : 0000000000000000 x2 : 00000000000008fc 
[202289.989343] x1 : 0000000000208840 x0 : 0000000000000000 
[202289.990291] Call trace:
[202289.990754]  kthread_should_stop+0x18/0x40
[202289.991546]  ptlrpc_set_wait+0x1d0/0x66c [ptlrpc]
[202289.992440]  ptlrpc_queue_wait+0xa4/0x370 [ptlrpc]
[202289.993313]  mdc_close+0x224/0xe64 [mdc]
[202289.994029]  lmv_close+0x1a8/0x480 [lmv]
[202289.994757]  ll_close_inode_openhandle+0x418/0xcdc [lustre]
[202289.995762]  ll_md_real_close+0xa4/0x280 [lustre]
[202289.996621]  ll_clear_inode+0x1a0/0x7d8 [lustre]
[202289.997465]  ll_delete_inode+0x70/0x260 [lustre]
[202289.998294]  evict+0xdc/0x240
[202289.998838]  iput_final+0x8c/0x1c0
[202289.999459]  iput+0x10c/0x128
[202290.000006]  dentry_unlink_inode+0xc8/0x150
[202290.000758]  __dentry_kill+0xec/0x21c
[202290.001424]  shrink_dentry_list+0xa8/0x138
[202290.002155]  prune_dcache_sb+0x64/0x94
[202290.002827]  super_cache_scan+0x128/0x1a4
[202290.003541]  do_shrink_slab+0x194/0x394
[202290.004225]  shrink_slab+0xbc/0x13c
[202290.004853]  shrink_node_memcgs+0x1d4/0x230
[202290.005598]  shrink_node+0x150/0x5e0
[202290.006240]  shrink_zones+0x98/0x220
[202290.006880]  do_try_to_free_pages+0xac/0x2e0
[202290.007638]  try_to_free_pages+0x120/0x25c
[202290.008370]  __alloc_pages_slowpath.constprop.0+0x420/0x8a0
[202290.009352]  __alloc_pages_nodemask+0x2b4/0x308
[202290.010157]  alloc_pages_current+0x8c/0x13c
[202290.010901]  __vmalloc_area_node+0x104/0x280
[202290.011662]  __vmalloc_node_range+0x80/0xe4
[202290.012409]  alloc_thread_stack_node+0xc4/0x128
[202290.013214]  dup_task_struct+0x54/0x29c
[202290.013899]  copy_process+0x1d0/0x11b4
[202290.014570]  kernel_clone+0x94/0x380
[202290.015213]  kernel_thread+0x6c/0x94
[202290.015855]  kthreadd+0x178/0x350
[202290.016456] Code: d5384100 b9403401 36a800a1 f943f400 (f9400000) 
[202290.017528] SMP: stopping secondary CPUs
[202290.019499] Starting crashdump kernel...
[202290.020203] Bye!

When kernel run out of memory and try to clean up inode cache, it will try to clean up inodes, it will prepare ptlrpc request for it.

In function ptlrpc_set_wait, if reply haven't received, it will call wait_woken, which will call kthread_should_stop and read set_child_tid, bur for kthreadd set_child_tid is null.

/* block until ready or timeout occurs */
do {
    if (ptlrpc_check_set(NULL, set))
        break;
    if (allow) {
        siginitsetinv(&newset, allow);
        sigprocmask(SIG_BLOCK, &newset, &oldset);
    }
    remaining = wait_woken(&wait, state, remaining);
    if (allow) {
        if (signal_pending(current))
            remaining = -EINTR;
        sigprocmask(SIG_SETMASK, &oldset, NULL);
    }
} while (remaining > 0);

This issue is similar to ~~LU-18826~~ where same wait_woken function is been called within obd_get_mod_rpc_slot for kthreadd.

Proposed Fix

A possible way to fix this issue is to use wait_event_idle_timeout rather than wait_woken for this unlinkly specific situation when the thread is kthreadd.

Also for this specific situation, as only one mdc_close is in the request set, no need to call ptlrpc_send_new_req within function ptlrpc_check_set, this change won't trigger issue fixed by ~~LU-15808~~

      /*
       * wait until all complete, interrupted, or an in-flight
       * req times out
       */
      CDEBUG(D_RPCTRACE, "set %p going to sleep for %lld seconds\n",
             set, timeout);
  
  +    /*
  +    * kthreadd (PID 2) has set_child_tid == NULL. 
  +    * wait_woken() calls will dereferences
  +    * set_child_tid and cause NULL deref crash.
  +    *
  +    * Use wait_event_idle_timeout for kthreadd instead. This is
  +    * safe because kthreadd only reaches here via the memory
  +    * reclaim shrinker path with a single already-sent close
  +    * RPC, so ptlrpc_check_set just checks completion flags
  +    * and does not block.
  +    */
  +   if (unlikely((current->flags & PF_KTHREAD) &&
  +          !current->set_child_tid)) {
  +     rc = wait_event_idle_timeout(
  +       set->set_waitq,
  +       ptlrpc_check_set(NULL, set),
  +       remaining);
  +     if (rc == 0) {
  +       rc = -ETIMEDOUT;
  +       ptlrpc_expired_set(set);
  +     } else {
  +       rc = 0;
  +     }
  +     goto check_completion;
  +   }
  +
      add_wait_queue(&set->set_waitq, &wait);
      if ((timeout == 0 && !signal_pending(current)) ||
          set->set_allow_intr) {
        state = TASK_INTERRUPTIBLE;
        allow = LUSTRE_FATAL_SIGS;
      }
      /* block until ready or timeout occurs */
      do {
        if (ptlrpc_check_set(NULL, set))
          break;
        if (allow) {
          siginitsetinv(&newset, allow);
          sigprocmask(SIG_BLOCK, &newset, &oldset);
        }
        remaining = wait_woken(&wait, state, remaining);
        if (allow) {
          if (signal_pending(current))
            remaining = -EINTR;
          sigprocmask(SIG_SETMASK, &oldset, NULL);
        }
      } while (remaining > 0);
      /*
       * wait_woken* returns the result from schedule_timeout() which
       * is always a positive number, or 0 on timeout.
       */
      if (remaining == 0) {
        rc = -ETIMEDOUT;
        ptlrpc_expired_set(set);
      } else if (remaining < 0) {
        rc = -EINTR;
        ptlrpc_interrupted_set(set);
      }
      remove_wait_queue(&set->set_waitq, &wait);  

+ check_completion: 
      /*
       * -EINTR => all requests have been flagged rq_intr so next
       * check completes.
       * -ETIMEDOUT => someone timed out.  When all reqs have
       * timed out, signals are enabled allowing completion with
       * EINTR.
       * I don't really care if we go once more round the loop in
       * the error cases -eeb.
       */

Kernel panic due to null pointer from ptlrpc_set_wait

Details

Description

Proposed Fix

Attachments

Activity

People

Dates