Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-20153

Kernel panic due to null pointer from ptlrpc_set_wait

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • Lustre 2.18.0, Lustre 2.15.8
    • None
    • 3
    • 9223372036854775807

    Description

      [202289.946894] Lustre: k6zwlb4v-MDT0000-mdc-ffff00040770e800: Force grant RPC slot (3 current) to proc with flag: 208840.
      [202289.948916] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
      [202289.950469] Mem abort info:
      [202289.950989]   ESR = 0x96000005
      [202289.953243]   EC = 0x25: DABT (current EL), IL = 32 bits
      [202289.954210]   SET = 0, FnV = 0
      [202289.954782]   EA = 0, S1PTW = 0
      [202289.955364] Data abort info:
      [202289.955899]   ISV = 0, ISS = 0x00000005
      [202289.956600]   CM = 0, WnR = 0
      [202289.957153] user pgtable: 4k pages, 48-bit VAs, pgdp=0000000056337000
      [202289.958292] [0000000000000000] pgd=000000004a0fa003, p4d=000000004a0fa003, pud=0000000000000000
      [202289.959819] Internal error: Oops: 0000000096000005 [#1] SMP
      [202289.960810] Modules linked in: af_packet_diag udp_diag tcp_diag inet_diag ip6table_filter osp(OE) lod(OE) mdt(OE) mdd(OE) lfsck(OE) mgs(OE) mgc(OE) osd_zfs(POE) lquota(OE) lustre(OE) lmv(OE) mdc(OE) lov(OE) osc(OE) fid(OE) fld(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) nf_log_ipv4 nf_log_common xt_LOG xt_limit iptable_filter xt_mark iptable_mangle bpfilter lnet(OE) crc32_generic libcfs(OE) sunrpc vfat fat dm_mirror dm_region_hash dm_log dm_mod ghash_ce sha2_ce sha256_arm64 sha1_ce zfs(POE) zunicode(POE) zzstd(OE) zlua(OE) binfmt_misc zcommon(POE) znvpair(POE) zavl(POE) icp(POE) spl(OE) ena ptp pps_core
      [202289.969945] CPU: 1 PID: 2 Comm: kthreadd Kdump: loaded Tainted: P           OE     5.10.251-248.983.amzn2.aarch64 #1
      [202289.971778] Hardware name: Amazon EC2 c6gn.large/, BIOS 1.0 11/1/2018
      [202289.972921] pstate: 60c00005 (nZCv daif +PAN +UAO -TCO BTYPE=--)
      [202289.973995] pc : kthread_should_stop+0x18/0x40
      [202289.974796] lr : wait_woken+0x74/0x84
      [202289.975462] sp : ffff80000a8bb160
      [202289.976070] x29: ffff80000a8bb160 x28: ffff00040c87dd10 
      [202289.977018] x27: ffff80000a8bb1d8 x26: 000000003b9aca00 
      [202289.977968] x25: ffff8000018b79f0 x24: ffffffffffffffff 
      [202289.978915] x23: ffff0003c0260000 x22: 0000000000000402 
      [202289.979863] x21: 0000000000000000 x20: ffff00040c87dd00 
      [202289.980810] x19: ffff80000a8bb1c0 x18: 0000000000000030 
      [202289.981761] x17: 0000000000000000 x16: 0000000000000000 
      [202289.982708] x15: 0000000000000000 x14: 0000000000000000 
      [202289.983656] x13: 000041ed00000000 x12: 0000000000000000 
      [202289.984602] x11: 0101010101010101 x10: 0000000000000d30 
      [202289.985550] x9 : ffff800008103a34 x8 : ffff0003c0260d90 
      [202289.986498] x7 : 00000000ebc0de03 x6 : 00000000ebc0de01 
      [202289.987445] x5 : 00000000410fd0c0 x4 : ffff000407b30d98 
      [202289.988394] x3 : 0000000000000000 x2 : 00000000000008fc 
      [202289.989343] x1 : 0000000000208840 x0 : 0000000000000000 
      [202289.990291] Call trace:
      [202289.990754]  kthread_should_stop+0x18/0x40
      [202289.991546]  ptlrpc_set_wait+0x1d0/0x66c [ptlrpc]
      [202289.992440]  ptlrpc_queue_wait+0xa4/0x370 [ptlrpc]
      [202289.993313]  mdc_close+0x224/0xe64 [mdc]
      [202289.994029]  lmv_close+0x1a8/0x480 [lmv]
      [202289.994757]  ll_close_inode_openhandle+0x418/0xcdc [lustre]
      [202289.995762]  ll_md_real_close+0xa4/0x280 [lustre]
      [202289.996621]  ll_clear_inode+0x1a0/0x7d8 [lustre]
      [202289.997465]  ll_delete_inode+0x70/0x260 [lustre]
      [202289.998294]  evict+0xdc/0x240
      [202289.998838]  iput_final+0x8c/0x1c0
      [202289.999459]  iput+0x10c/0x128
      [202290.000006]  dentry_unlink_inode+0xc8/0x150
      [202290.000758]  __dentry_kill+0xec/0x21c
      [202290.001424]  shrink_dentry_list+0xa8/0x138
      [202290.002155]  prune_dcache_sb+0x64/0x94
      [202290.002827]  super_cache_scan+0x128/0x1a4
      [202290.003541]  do_shrink_slab+0x194/0x394
      [202290.004225]  shrink_slab+0xbc/0x13c
      [202290.004853]  shrink_node_memcgs+0x1d4/0x230
      [202290.005598]  shrink_node+0x150/0x5e0
      [202290.006240]  shrink_zones+0x98/0x220
      [202290.006880]  do_try_to_free_pages+0xac/0x2e0
      [202290.007638]  try_to_free_pages+0x120/0x25c
      [202290.008370]  __alloc_pages_slowpath.constprop.0+0x420/0x8a0
      [202290.009352]  __alloc_pages_nodemask+0x2b4/0x308
      [202290.010157]  alloc_pages_current+0x8c/0x13c
      [202290.010901]  __vmalloc_area_node+0x104/0x280
      [202290.011662]  __vmalloc_node_range+0x80/0xe4
      [202290.012409]  alloc_thread_stack_node+0xc4/0x128
      [202290.013214]  dup_task_struct+0x54/0x29c
      [202290.013899]  copy_process+0x1d0/0x11b4
      [202290.014570]  kernel_clone+0x94/0x380
      [202290.015213]  kernel_thread+0x6c/0x94
      [202290.015855]  kthreadd+0x178/0x350
      [202290.016456] Code: d5384100 b9403401 36a800a1 f943f400 (f9400000) 
      [202290.017528] SMP: stopping secondary CPUs
      [202290.019499] Starting crashdump kernel...
      [202290.020203] Bye!

      When kernel run out of memory and try to clean up inode cache, it will try to clean up inodes, it will prepare ptlrpc request for it.

      In function ptlrpc_set_wait, if reply haven't received, it will call wait_woken, which will call kthread_should_stop and read set_child_tid, bur for kthreadd set_child_tid is null.

      /* block until ready or timeout occurs */
      do {
          if (ptlrpc_check_set(NULL, set))
              break;
          if (allow) {
              siginitsetinv(&newset, allow);
              sigprocmask(SIG_BLOCK, &newset, &oldset);
          }
          remaining = wait_woken(&wait, state, remaining);
          if (allow) {
              if (signal_pending(current))
                  remaining = -EINTR;
              sigprocmask(SIG_SETMASK, &oldset, NULL);
          }
      } while (remaining > 0);

      This issue is similar to LU-18826 where same wait_woken function is been called within obd_get_mod_rpc_slot for kthreadd.

       

      Proposed Fix

      A possible way to fix this issue is to use wait_event_idle_timeout rather than wait_woken for this unlinkly specific situation when the thread is kthreadd.

      Also for this specific situation, as only one mdc_close is in the request set, no need to call ptlrpc_send_new_req within function ptlrpc_check_set, this change won't trigger issue fixed by LU-15808

            /*
             * wait until all complete, interrupted, or an in-flight
             * req times out
             */
            CDEBUG(D_RPCTRACE, "set %p going to sleep for %lld seconds\n",
                   set, timeout);
        
        +    /*
        +    * kthreadd (PID 2) has set_child_tid == NULL. 
        +    * wait_woken() calls will dereferences
        +    * set_child_tid and cause NULL deref crash.
        +    *
        +    * Use wait_event_idle_timeout for kthreadd instead. This is
        +    * safe because kthreadd only reaches here via the memory
        +    * reclaim shrinker path with a single already-sent close
        +    * RPC, so ptlrpc_check_set just checks completion flags
        +    * and does not block.
        +    */
        +   if (unlikely((current->flags & PF_KTHREAD) &&
        +          !current->set_child_tid)) {
        +     rc = wait_event_idle_timeout(
        +       set->set_waitq,
        +       ptlrpc_check_set(NULL, set),
        +       remaining);
        +     if (rc == 0) {
        +       rc = -ETIMEDOUT;
        +       ptlrpc_expired_set(set);
        +     } else {
        +       rc = 0;
        +     }
        +     goto check_completion;
        +   }
        +
            add_wait_queue(&set->set_waitq, &wait);
            if ((timeout == 0 && !signal_pending(current)) ||
                set->set_allow_intr) {
              state = TASK_INTERRUPTIBLE;
              allow = LUSTRE_FATAL_SIGS;
            }
            /* block until ready or timeout occurs */
            do {
              if (ptlrpc_check_set(NULL, set))
                break;
              if (allow) {
                siginitsetinv(&newset, allow);
                sigprocmask(SIG_BLOCK, &newset, &oldset);
              }
              remaining = wait_woken(&wait, state, remaining);
              if (allow) {
                if (signal_pending(current))
                  remaining = -EINTR;
                sigprocmask(SIG_SETMASK, &oldset, NULL);
              }
            } while (remaining > 0);
            /*
             * wait_woken* returns the result from schedule_timeout() which
             * is always a positive number, or 0 on timeout.
             */
            if (remaining == 0) {
              rc = -ETIMEDOUT;
              ptlrpc_expired_set(set);
            } else if (remaining < 0) {
              rc = -EINTR;
              ptlrpc_interrupted_set(set);
            }
            remove_wait_queue(&set->set_waitq, &wait);  
      
      + check_completion: 
            /*
             * -EINTR => all requests have been flagged rq_intr so next
             * check completes.
             * -ETIMEDOUT => someone timed out.  When all reqs have
             * timed out, signals are enabled allowing completion with
             * EINTR.
             * I don't really care if we go once more round the loop in
             * the error cases -eeb.
             */

      Attachments

        Activity

          People

            wc-triage WC Triage
            nhaowang Hao Wang
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: