Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2485

NULL pointer dereference in lustre_swab_lov_user_md_common

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.4.0
    • Lustre 2.4.0, Lustre 2.1.4
    • 3
    • 5830

    Description

      We repeatedly hit this problem on our Grove-Production MDS today:

      BUG: unable to handle kernel NULL pointer dereference at 000000000000001c       
      IP: [<ffffffffa08bcdb7>] lustre_swab_lov_user_md_common+0x27/0x4e0 [ptlrpc]
      
      crash> bt                                                                       
      PID: 738    TASK: ffff881778c9caa0  CPU: 14  COMMAND: "mdt00_006"               
       #0 [ffff88175b907370] machine_kexec at ffffffff8103216b                        
       #1 [ffff88175b9073d0] crash_kexec at ffffffff810b8d12                          
       #2 [ffff88175b9074a0] oops_end at ffffffff814f2c00                             
       #3 [ffff88175b9074d0] no_context at ffffffff810423fb                           
       #4 [ffff88175b907520] __bad_area_nosemaphore at ffffffff81042685               
       #5 [ffff88175b907570] bad_area_nosemaphore at ffffffff81042753                 
       #6 [ffff88175b907580] __do_page_fault at ffffffff81042e0d                      
       #7 [ffff88175b9076a0] do_page_fault at ffffffff814f4bde                        
       #8 [ffff88175b9076d0] page_fault at ffffffff814f1f95                           
          [exception RIP: lustre_swab_lov_user_md_common+39]                          
          RIP: ffffffffa08bcdb7  RSP: ffff88175b907780  RFLAGS: 00010246              
          RAX: 0000000000000001  RBX: 0000000000000000  RCX: 0000000000000000         
          RDX: ffffffffa090961a  RSI: 0000000000000000  RDI: 0000000000000000         
          RBP: ffff88175b907790   R8: ffff88175b937000   R9: ffff88175b8910d0         
          R10: 0000000000000001  R11: 00000000fffffff3  R12: ffff8817ec176000         
          R13: ffff88175c222468  R14: ffffc9013311e208  R15: ffff8817ec176000         
          ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018                              
       #9 [ffff88175b907798] lustre_swab_lov_user_md_v3 at ffffffffa08bd2ad [ptlrpc]  
      #10 [ffff88175b9077b8] lod_qos_prep_create at ffffffffa0b6bf77 [lod]            
      #11 [ffff88175b907858] lod_declare_striped_object at ffffffffa0b66c7b [lod]     
      #12 [ffff88175b9078b8] lod_declare_xattr_set at ffffffffa0b67b9d [lod]          
      #13 [ffff88175b907918] mdd_create_data at ffffffffa0bf4c00 [mdd]                
      #14 [ffff88175b907978] mdt_finish_open at ffffffffa0c794f8 [mdt]                
      #15 [ffff88175b907a08] mdt_open_by_fid_lock at ffffffffa0c7a5a7 [mdt]           
      #16 [ffff88175b907a78] mdt_reint_open at ffffffffa0c7ac5f [mdt]                 
      #17 [ffff88175b907b58] mdt_reint_rec at ffffffffa0c66a21 [mdt]                  
      #18 [ffff88175b907b78] mdt_reint_internal at ffffffffa0c601b3 [mdt]             
      #19 [ffff88175b907bb8] mdt_intent_reint at ffffffffa0c6077d [mdt]               
      #20 [ffff88175b907c08] mdt_intent_policy at ffffffffa0c5c38e [mdt]              
      #21 [ffff88175b907c48] ldlm_lock_enqueue at ffffffffa0872b91 [ptlrpc]           
      #22 [ffff88175b907ca8] ldlm_handle_enqueue0 at ffffffffa089a837 [ptlrpc]        
      #23 [ffff88175b907d18] mdt_enqueue at ffffffffa0c5bf16 [mdt]                    
      #24 [ffff88175b907d38] mdt_handle_common at ffffffffa0c4fdd2 [mdt]              
      #25 [ffff88175b907d88] mdt_regular_handle at ffffffffa0c50cd5 [mdt]             
      #26 [ffff88175b907d98] ptlrpc_server_handle_request at ffffffffa08ca8fc [ptlrpc]
      #27 [ffff88175b907e98] ptlrpc_main at ffffffffa08cbeec [ptlrpc]                 
      #28 [ffff88175b907f48] kernel_thread at ffffffff8100c14a 
      

      Recovery was manually aborted, which cleared up the issue:

      lctl --device 5 abort_recovery
      

      Prior to the manual intervention, the node would continuously crash after recovery for about 12 hours.

      Attachments

        1. test1.c
          3 kB
          Christopher Morrone

        Activity

          People

            bzzz Alex Zhuravlev
            prakash Prakash Surya (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: