Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6973

Null pointer access during umount MDT server if orph_cleanup_sc is not finish

Details

    • Bug
    • Resolution: Duplicate
    • Minor
    • None
    • Lustre 2.5.3
    • None
    • MDT Server crash
    • 3
    • 9223372036854775807

    Description

      Null pointer access during umount MDT server if orph_cleanup_sc is not finish

      crash> bt
      PID: 31563  TASK: ffff880879176ab0  CPU: 1   COMMAND: "orph_cleanup_sc"
       #0 [ffff880547d9b810] machine_kexec at ffffffff8103b71b
       #1 [ffff880547d9b870] crash_kexec at ffffffff810c9942
       #2 [ffff880547d9b940] oops_end at ffffffff8152f070
       #3 [ffff880547d9b970] no_context at ffffffff8104c80b
       #4 [ffff880547d9b9c0] __bad_area_nosemaphore at ffffffff8104ca95
       #5 [ffff880547d9ba10] bad_area_nosemaphore at ffffffff8104cb63
       #6 [ffff880547d9ba20] __do_page_fault at ffffffff8104d25c
       #7 [ffff880547d9bb40] do_page_fault at ffffffff81530fbe
       #8 [ffff880547d9bb70] page_fault at ffffffff8152e375
          [exception RIP: fld_server_lookup+97]
          RIP: ffffffffa0a50b31  RSP: ffff880547d9bc20  RFLAGS: 00010286
          RAX: ffff8810515df4c0  RBX: 00000002122fc000  RCX: ffff8810426c5078
          RDX: ffff880e6896b400  RSI: ffffffffa0a56b00  RDI: ffff881040aff840
          RBP: ffff880547d9bc70   R8: 0000000015fbb1dc   R9: 0000000000000000
          R10: 092c8d41cd51a9c5  R11: 0000000000000041  R12: 0000000000000000
          R13: ffff8810515df4c0  R14: ffff8810426c5078  R15: ffff8810426c4000
          ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
       #9 [ffff880547d9bc78] osd_fld_lookup at ffffffffa0d53b8a [osd_ldiskfs]
      #10 [ffff880547d9bca8] osd_remote_fid at ffffffffa0d54320 [osd_ldiskfs]
      #11 [ffff880547d9bcf8] osd_it_ea_rec at ffffffffa0d6539e [osd_ldiskfs]
      #12 [ffff880547d9be38] lod_it_rec at ffffffffa0eec331 [lod]
      #13 [ffff880547d9be48] __mdd_orphan_cleanup at ffffffffa0f55050 [mdd]
      #14 [ffff880547d9bee8] kthread at ffffffff8109e71e
      #15 [ffff880547d9bf48] kernel_thread at ffffffff8100c20a
      crash>
      

      This crash appear because ss_server_fld is NULL

      crash> p *(*((struct osd_device *)0xffff8801f827a000).od_dt_dev.dd_lu_dev.ld_site).ld_seq_site
      $9 = {
        ss_lu = 0xffff8801f827a150,
        ss_node_id = 0,
        ss_server_fld = 0x0,     <---------- ICI
        ss_client_fld = 0x0,
        ss_server_seq = 0x0,
        ss_control_seq = 0x0,
        ss_control_exp = 0x0,
        ss_client_seq = 0x0
      }
      
      2228 int osd_fld_lookup(const struct lu_env *env, struct osd_device *osd,
      2229                    obd_seq seq, struct lu_seq_range *range)
      2230 {
      ....
      2248
      2249         LASSERT(ss != NULL);
      2250         fld_range_set_any(range);
      2251         rc = fld_server_lookup(env, ss->ss_server_fld, seq, range);    <--- in some condition ss->ss_server_fld could be NULL
      2252         if (rc != 0) {
      2253                 CERROR("%s: cannot find FLD range for "LPX64": rc = %d\n",
      2254                        osd_name(osd), seq, rc);
      "lustre/osd-ldiskfs/osd_handler.c" 6005 lines --37%--                                                                      2258,0-1      37%
      

      Is it possible to stop orph_cleanup_sc process on the begining of umount MDT process to prevent this issue ?

      Attachments

        Issue Links

          Activity

            [LU-6973] Null pointer access during umount MDT server if orph_cleanup_sc is not finish
            pjones Peter Jones added a comment -

            Antoine

            The LU-5249 fix has landed for a 2.5.x maintenance release so you can either rebase on a more current release or else apply the patch to your current baseline

            Peter

            pjones Peter Jones added a comment - Antoine The LU-5249 fix has landed for a 2.5.x maintenance release so you can either rebase on a more current release or else apply the patch to your current baseline Peter

            Hello Bruno,
            Thanks for your answer, the patch look like well, is it possible this patch has landed ASAP ?
            and I will ask to have this fix on the next lustre T100 release
            You can tag this issue as a dup of LU-5249

            apercher Antoine Percher added a comment - Hello Bruno, Thanks for your answer, the patch look like well, is it possible this patch has landed ASAP ? and I will ask to have this fix on the next lustre T100 release You can tag this issue as a dup of LU-5249
            bfaccini Bruno Faccini (Inactive) added a comment - - edited

            Hello Antoine,
            My first reading of your bug report makes me think that this could be a dup of LU-5249.
            Thus, and even if associated b2_5 patch (http://review.whamcloud.com/#/c/13579) still has not landed, you may want to give it a try ?

            bfaccini Bruno Faccini (Inactive) added a comment - - edited Hello Antoine, My first reading of your bug report makes me think that this could be a dup of LU-5249 . Thus, and even if associated b2_5 patch ( http://review.whamcloud.com/#/c/13579 ) still has not landed, you may want to give it a try ?

            People

              bfaccini Bruno Faccini (Inactive)
              apercher Antoine Percher
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: