[LU-2485] NULL pointer dereference in lustre_swab_lov_user_md_common Created: 12/Dec/12  Updated: 22/Dec/12  Resolved: 22/Dec/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0, Lustre 2.1.4
Fix Version/s: Lustre 2.4.0

Type: Bug Priority: Blocker
Reporter: Prakash Surya (Inactive) Assignee: Alex Zhuravlev
Resolution: Fixed Votes: 0
Labels: LB, sequoia

Attachments: File test1.c    
Severity: 3
Rank (Obsolete): 5830

 Description   

We repeatedly hit this problem on our Grove-Production MDS today:

BUG: unable to handle kernel NULL pointer dereference at 000000000000001c       
IP: [<ffffffffa08bcdb7>] lustre_swab_lov_user_md_common+0x27/0x4e0 [ptlrpc]
crash> bt                                                                       
PID: 738    TASK: ffff881778c9caa0  CPU: 14  COMMAND: "mdt00_006"               
 #0 [ffff88175b907370] machine_kexec at ffffffff8103216b                        
 #1 [ffff88175b9073d0] crash_kexec at ffffffff810b8d12                          
 #2 [ffff88175b9074a0] oops_end at ffffffff814f2c00                             
 #3 [ffff88175b9074d0] no_context at ffffffff810423fb                           
 #4 [ffff88175b907520] __bad_area_nosemaphore at ffffffff81042685               
 #5 [ffff88175b907570] bad_area_nosemaphore at ffffffff81042753                 
 #6 [ffff88175b907580] __do_page_fault at ffffffff81042e0d                      
 #7 [ffff88175b9076a0] do_page_fault at ffffffff814f4bde                        
 #8 [ffff88175b9076d0] page_fault at ffffffff814f1f95                           
    [exception RIP: lustre_swab_lov_user_md_common+39]                          
    RIP: ffffffffa08bcdb7  RSP: ffff88175b907780  RFLAGS: 00010246              
    RAX: 0000000000000001  RBX: 0000000000000000  RCX: 0000000000000000         
    RDX: ffffffffa090961a  RSI: 0000000000000000  RDI: 0000000000000000         
    RBP: ffff88175b907790   R8: ffff88175b937000   R9: ffff88175b8910d0         
    R10: 0000000000000001  R11: 00000000fffffff3  R12: ffff8817ec176000         
    R13: ffff88175c222468  R14: ffffc9013311e208  R15: ffff8817ec176000         
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018                              
 #9 [ffff88175b907798] lustre_swab_lov_user_md_v3 at ffffffffa08bd2ad [ptlrpc]  
#10 [ffff88175b9077b8] lod_qos_prep_create at ffffffffa0b6bf77 [lod]            
#11 [ffff88175b907858] lod_declare_striped_object at ffffffffa0b66c7b [lod]     
#12 [ffff88175b9078b8] lod_declare_xattr_set at ffffffffa0b67b9d [lod]          
#13 [ffff88175b907918] mdd_create_data at ffffffffa0bf4c00 [mdd]                
#14 [ffff88175b907978] mdt_finish_open at ffffffffa0c794f8 [mdt]                
#15 [ffff88175b907a08] mdt_open_by_fid_lock at ffffffffa0c7a5a7 [mdt]           
#16 [ffff88175b907a78] mdt_reint_open at ffffffffa0c7ac5f [mdt]                 
#17 [ffff88175b907b58] mdt_reint_rec at ffffffffa0c66a21 [mdt]                  
#18 [ffff88175b907b78] mdt_reint_internal at ffffffffa0c601b3 [mdt]             
#19 [ffff88175b907bb8] mdt_intent_reint at ffffffffa0c6077d [mdt]               
#20 [ffff88175b907c08] mdt_intent_policy at ffffffffa0c5c38e [mdt]              
#21 [ffff88175b907c48] ldlm_lock_enqueue at ffffffffa0872b91 [ptlrpc]           
#22 [ffff88175b907ca8] ldlm_handle_enqueue0 at ffffffffa089a837 [ptlrpc]        
#23 [ffff88175b907d18] mdt_enqueue at ffffffffa0c5bf16 [mdt]                    
#24 [ffff88175b907d38] mdt_handle_common at ffffffffa0c4fdd2 [mdt]              
#25 [ffff88175b907d88] mdt_regular_handle at ffffffffa0c50cd5 [mdt]             
#26 [ffff88175b907d98] ptlrpc_server_handle_request at ffffffffa08ca8fc [ptlrpc]
#27 [ffff88175b907e98] ptlrpc_main at ffffffffa08cbeec [ptlrpc]                 
#28 [ffff88175b907f48] kernel_thread at ffffffff8100c14a 

Recovery was manually aborted, which cleared up the issue:

lctl --device 5 abort_recovery

Prior to the manual intervention, the node would continuously crash after recovery for about 12 hours.



 Comments   
Comment by Christopher Morrone [ 12/Dec/12 ]

I probably caused this with the attached test1.c file. It was an early pass at testing setting striping through xattrs. The test program isn't correct in places, but I am attaching anyway since that is likely the one that triggered the MDS crash.

Client was a 64k page ppc64 node, MDS is a normal x86_64 node.

Comment by Alex Zhuravlev [ 12/Dec/12 ]

please try with http://review.whamcloud.com/4814

Comment by Andreas Dilger [ 12/Dec/12 ]

lol, I was just going to ask whether this was caused by Chris' testing, after having just read his previous comment.

Fortunately this only appears to be hit with big-endian clients, so getting a fix into 2.1.4 and 2.4.0 and at LLNL should cover most of the users. I thought it would mean that we couldn't use the fsetxattr() code safely on 2.4 at all since it would crash 2.1.x MDSes.

Comment by Prakash Surya (Inactive) [ 12/Dec/12 ]

Alex, I've pulled that in.

Comment by Peter Jones [ 22/Dec/12 ]

Landed for 2.4

Generated at Sat Feb 10 01:25:36 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.