[LU-2251] MDS crash will running mdtest Created: 30/Oct/12  Updated: 19/Apr/13  Resolved: 27/Nov/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: Lustre 2.4.0

Type: Bug Priority: Blocker
Reporter: James A Simmons Assignee: Alex Zhuravlev
Resolution: Fixed Votes: 0
Labels: None
Environment:

DDN 9950 sotrage backend with 4 OSS and a MDS and MGS server. Client are a Cray XE cage.


Attachments: File mds.log    
Severity: 3
Rank (Obsolete): 5350

 Description   

While testing for stability with mdtest the MDS experienced a Oops and I had to restart the file system which recovered successfully. Will attach the oops.



 Comments   
Comment by Alex Zhuravlev [ 30/Oct/12 ]

please try with http://review.whamcloud.com/#change,4390

Comment by James A Simmons [ 30/Oct/12 ]

Appears to have worked. Thanks. I will report if the problem shows again.

Comment by James A Simmons [ 31/Oct/12 ]

I was wrong. I get a very similar but slightly different oops.

Oct 31 08:38:31 barry-mds1 kernel: [ 1249.764785] LustreError: 9696:0:(osp_dev.c:337:osp_statfs()) ASSERTION( sfs->os_fprecreated <= 20000 ) failed
:
Oct 31 08:38:31 barry-mds1 kernel: [ 1249.764789] LustreError: 9696:0:(osp_dev.c:337:osp_statfs()) LBUG
Oct 31 08:38:31 barry-mds1 kernel: [ 1249.764791] Pid: 9696, comm: mdt01_497
Oct 31 08:38:31 barry-mds1 kernel: [ 1249.764793] [<ffffffffa04fcf17>] lbug_with_loc+0x47/0xb0 [libcfs]
Oct 31 08:38:31 barry-mds1 kernel: [ 1249.764795]
Oct 31 08:38:31 barry-mds1 kernel: [ 1249.764796] Call Trace:
Oct 31 08:38:31 barry-mds1 kernel: [ 1249.764811] [<ffffffffa095c7da>] osp_statfs+0x28a/0x2a0 [osp]
Oct 31 08:38:31 barry-mds1 kernel: [ 1249.764846] [<ffffffffa0777e8e>] lod_statfs_and_check+0x8e/0x4a0 [lod]
Oct 31 08:38:31 barry-mds1 kernel: [ 1249.764854] [<ffffffffa04fc905>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
Oct 31 08:38:31 barry-mds1 kernel: [ 1249.764856] [<ffffffffa077a430>] lod_alloc_rr.clone.2+0x500/0xd90 [lod]
Oct 31 08:38:31 barry-mds1 kernel: [ 1249.764868] [<ffffffffa04fcf17>] lbug_with_loc+0x47/0xb0 [libcfs]
Oct 31 08:38:31 barry-mds1 kernel: [ 1249.764870] [<ffffffffa077c04c>] lod_qos_prep_create+0xe5c/0x1848 [lod]
Oct 31 08:38:31 barry-mds1 kernel: [ 1249.764886] [<ffffffffa095c7da>] osp_statfs+0x28a/0x2a0 [osp]
Oct 31 08:38:31 barry-mds1 kernel: [ 1249.764922] [<ffffffffa0777e8e>] lod_statfs_and_check+0x8e/0x4a0 [lod]
Oct 31 08:38:31 barry-mds1 kernel: [ 1249.764924] [<ffffffffa0e6c692>] ? osd_declare_inode_qid+0x1a2/0x270 [osd_ldiskfs]
Oct 31 08:38:31 barry-mds1 kernel: [ 1249.764933] [<ffffffffa077a430>] lod_alloc_rr.clone.2+0x500/0xd90 [lod]
Oct 31 08:38:31 barry-mds1 kernel: [ 1249.764935] [<ffffffffa0776c7b>] lod_declare_striped_object+0x14b/0x920 [lod]
Oct 31 08:38:31 barry-mds1 kernel: [ 1249.764943] [<ffffffffa077c04c>] lod_qos_prep_create+0xe5c/0x1848 [lod]
Oct 31 08:38:31 barry-mds1 kernel: [ 1249.764945] [<ffffffffa0777758>] lod_declare_object_create+0x308/0x4f0 [lod]
Oct 31 08:38:31 barry-mds1 kernel: [ 1249.764974] [<ffffffffa0d3eccf>] mdd_declare_object_create_internal+0xaf/0x1d0 [mdd]
Oct 31 08:38:31 barry-mds1 kernel: [ 1249.764991] [<ffffffffa0e6c692>] ? osd_declare_inode_qid+0x1a2/0x270 [osd_ldiskfs]
Oct 31 08:38:31 barry-mds1 kernel: [ 1249.764994] [<ffffffffa0d4fc9a>] mdd_create+0x39a/0x1550 [mdd]
Oct 31 08:38:31 barry-mds1 kernel: [ 1249.765002] [<ffffffffa0776c7b>] lod_declare_striped_object+0x14b/0x920 [lod]
Oct 31 08:38:31 barry-mds1 kernel: [ 1249.765012] [<ffffffffa0777758>] lod_declare_object_create+0x308/0x4f0 [lod]
Oct 31 08:38:31 barry-mds1 kernel: [ 1249.765034] [<ffffffffa0d3eccf>] mdd_declare_object_create_internal+0xaf/0x1d0 [mdd]
Oct 31 08:38:31 barry-mds1 kernel: [ 1249.765048] [<ffffffffa0d4fc9a>] mdd_create+0x39a/0x1550 [mdd]
Oct 31 08:38:31 barry-mds1 kernel: [ 1249.765056] [<ffffffffa0efc519>] mdt_reint_open+0x1079/0x1860 [mdt]
Oct 31 08:38:31 barry-mds1 kernel: [ 1249.765068] [<ffffffffa0d55bee>] ? md_ucred+0x1e/0x60 [mdd]
Oct 31 08:38:31 barry-mds1 kernel: [ 1249.765083] [<ffffffffa0ec7765>] ? mdt_ucred+0x15/0x20 [mdt]
Oct 31 08:38:31 barry-mds1 kernel: [ 1249.765085] [<ffffffffa0efc519>] mdt_reint_open+0x1079/0x1860 [mdt]
Oct 31 08:38:31 barry-mds1 kernel: [ 1249.765095] [<ffffffffa0d55bee>] ? md_ucred+0x1e/0x60 [mdd]
Oct 31 08:38:31 barry-mds1 kernel: [ 1249.765097] [<ffffffffa0ee73e1>] mdt_reint_rec+0x41/0xe0 [mdt]
Oct 31 08:38:31 barry-mds1 kernel: [ 1249.765108] [<ffffffffa0ec7765>] ? mdt_ucred+0x15/0x20 [mdt]
Oct 31 08:38:31 barry-mds1 kernel: [ 1249.765111] [<ffffffffa0ee0b73>] mdt_reint_internal+0x4f3/0x7f0 [mdt]
Oct 31 08:38:31 barry-mds1 kernel: [ 1249.765122] [<ffffffffa0ee73e1>] mdt_reint_rec+0x41/0xe0 [mdt]
Oct 31 08:38:31 barry-mds1 kernel: [ 1249.765125] [<ffffffffa0ee113d>] mdt_intent_reint+0x1ed/0x500 [mdt]
Oct 31 08:38:31 barry-mds1 kernel: [ 1249.765136] [<ffffffffa0ee0b73>] mdt_reint_internal+0x4f3/0x7f0 [mdt]
Oct 31 08:38:31 barry-mds1 kernel: [ 1249.765138] [<ffffffffa0edcb0e>] mdt_intent_policy+0x38e/0x770 [mdt]
Oct 31 08:38:31 barry-mds1 kernel: [ 1249.765279] [<ffffffffa07c3b21>] ldlm_lock_enqueue+0x361/0x8f0 [ptlrpc]
Oct 31 08:38:31 barry-mds1 kernel: [ 1249.765281] [<ffffffffa07c3b21>] ldlm_lock_enqueue+0x361/0x8f0 [ptlrpc]
Oct 31 08:38:31 barry-mds1 kernel: [ 1249.765330] [<ffffffffa07eb767>] ldlm_handle_enqueue0+0x4e7/0x1000 [ptlrpc]
Oct 31 08:38:31 barry-mds1 kernel: [ 1249.765343] [<ffffffffa0edd046>] mdt_enqueue+0x46/0x130 [mdt]
Oct 31 08:38:31 barry-mds1 kernel: [ 1249.765356] [<ffffffffa0ed0322>] mdt_handle_common+0x932/0x1740 [mdt]
Oct 31 08:38:31 barry-mds1 kernel: [ 1249.765364] [<ffffffffa07eb767>] ldlm_handle_enqueue0+0x4e7/0x1000 [ptlrpc]
Oct 31 08:38:31 barry-mds1 kernel: [ 1249.765370] [<ffffffffa0ed1205>] mdt_regular_handle+0x15/0x20 [mdt]
Oct 31 08:38:31 barry-mds1 kernel: [ 1249.765378] [<ffffffffa0edd046>] mdt_enqueue+0x46/0x130 [mdt]
Oct 31 08:38:31 barry-mds1 kernel: [ 1249.765392] [<ffffffffa0ed0322>] mdt_handle_common+0x932/0x1740 [mdt]
Oct 31 08:38:31 barry-mds1 kernel: [ 1249.765405] [<ffffffffa0ed1205>] mdt_regular_handle+0x15/0x20 [mdt]
Oct 31 08:38:31 barry-mds1 kernel: [ 1249.765412] [<ffffffffa081b5dc>] ptlrpc_server_handle_request+0x41c/0xe00 [ptlrpc]
Oct 31 08:38:31 barry-mds1 kernel: [ 1249.765424] [<ffffffffa04fd65e>] ? cfs_timer_arm+0xe/0x10 [libcfs]
Oct 31 08:38:31 barry-mds1 kernel: [ 1249.765440] [<ffffffffa050f17f>] ? lc_watchdog_touch+0x6f/0x180 [libcfs]
Oct 31 08:38:31 barry-mds1 kernel: [ 1249.765454] [<ffffffffa081b5dc>] ptlrpc_server_handle_request+0x41c/0xe00 [ptlrpc]
Oct 31 08:38:31 barry-mds1 kernel: [ 1249.765466] [<ffffffffa04fd65e>] ? cfs_timer_arm+0xe/0x10 [libcfs]
Oct 31 08:38:31 barry-mds1 kernel: [ 1249.765479] [<ffffffffa08129a9>] ? ptlrpc_wait_event+0xa9/0x2a0 [ptlrpc]
Oct 31 08:38:31 barry-mds1 kernel: [ 1249.765483] [<ffffffffa050f17f>] ? lc_watchdog_touch+0x6f/0x180 [libcfs]
Oct 31 08:38:31 barry-mds1 kernel: [ 1249.765486] [<ffffffff810602b0>] ? default_wake_function+0x0/0x20
Oct 31 08:38:31 barry-mds1 kernel: [ 1249.765522] [<ffffffffa08129a9>] ? ptlrpc_wait_event+0xa9/0x2a0 [ptlrpc]

Comment by Alex Zhuravlev [ 07/Nov/12 ]

James, were there timeouts/disconnects on MDS just before the assertion ?

Comment by Alex Zhuravlev [ 08/Nov/12 ]

hmm, clearly the assertion is invalid: we initiate pre-creation before existing pool is exhausted, so the final pool can easily be OST_MAX_PRECERATE + number of objects from the previous pre-creation.

Comment by Alex Zhuravlev [ 08/Nov/12 ]

http://review.whamcloud.com/#change,4495

Comment by James A Simmons [ 08/Nov/12 ]

No timeouts or disconnects happened before the assertion. I found you can only reproduce this problem when you have a newly formatted file system and the first thing you do is run mdtest on it. The exact command used was mdtest -n 1000 -F -i 5 -d /lustre/$MOUNT/scratch/$USER/md_test over 3 nodes. If I ran IOR first then the problem didn't happen. Will apply your patch.

Comment by Christopher Morrone [ 12/Nov/12 ]

We hit the same assertion on a test filesystem running 2.3.54-2chaos.

LustreError: 6847:0:(osp_dev.c:337:osp_statfs()) ASSERTION( sfs->os_fprecreated <= 20000 ) failed: 
LustreError: 6847:0:(osp_dev.c:337:osp_statfs()) LBUG
PID: 6847   TASK: ffff88061df67500  CPU: 9   COMMAND: "mdt03_008"             
 #0 [ffff8805f8c7b518] machine_kexec at ffffffff8103216b                         
 #1 [ffff8805f8c7b578] crash_kexec at ffffffff810b8d12                           
 #2 [ffff8805f8c7b648] panic at ffffffff814eea99                                 
 #3 [ffff8805f8c7b6c8] lbug_with_loc at ffffffffa05c1fcb [libcfs]                
 #4 [ffff8805f8c7b6e8] osp_statfs at ffffffffa0eb97da [osp]                      
 #5 [ffff8805f8c7b728] lod_statfs_and_check at ffffffffa0d6de8e [lod]            
 #6 [ffff8805f8c7b758] lod_alloc_rr.clone.2 at ffffffffa0d70430 [lod]            
 #7 [ffff8805f8c7b808] lod_qos_prep_create at ffffffffa0d7204c [lod]             
 #8 [ffff8805f8c7b8a8] lod_declare_striped_object at ffffffffa0d6cc7b [lod]      
 #9 [ffff8805f8c7b908] lod_declare_object_create at ffffffffa0d6d758 [lod]       
#10 [ffff8805f8c7b958] mdd_declare_object_create_internal at ffffffffa0dafccf [mdd]
#11 [ffff8805f8c7b9a8] mdd_create at ffffffffa0dc0c9a [mdd]                      
#12 [ffff8805f8c7ba78] mdt_reint_open at ffffffffa0e44529 [mdt]                  
#13 [ffff8805f8c7bb58] mdt_reint_rec at ffffffffa0e2f401 [mdt]                   
#14 [ffff8805f8c7bb78] mdt_reint_internal at ffffffffa0e28b93 [mdt]              
#15 [ffff8805f8c7bbb8] mdt_intent_reint at ffffffffa0e2915d [mdt]                
#16 [ffff8805f8c7bc08] mdt_intent_policy at ffffffffa0e24b2e [mdt]               
#17 [ffff8805f8c7bc48] ldlm_lock_enqueue at ffffffffa0897b21 [ptlrpc]            
#18 [ffff8805f8c7bca8] ldlm_handle_enqueue0 at ffffffffa08bf967 [ptlrpc]         
#19 [ffff8805f8c7bd18] mdt_enqueue at ffffffffa0e25066 [mdt]                     
#20 [ffff8805f8c7bd38] mdt_handle_common at ffffffffa0e18322 [mdt]               
#21 [ffff8805f8c7bd88] mdt_regular_handle at ffffffffa0e19225 [mdt]              
#22 [ffff8805f8c7bd98] ptlrpc_server_handle_request at ffffffffa08ef8ac [ptlrpc]
#23 [ffff8805f8c7be98] ptlrpc_main at ffffffffa08f0e9c [ptlrpc]                  
#24 [ffff8805f8c7bf48] kernel_thread at ffffffff8100c14a   

I will drop the assertion from our tree with change 4495.

Comment by James A Simmons [ 21/Nov/12 ]

Chris their was discuss on how to properly fix this bug so the patch has been updated. I tested but could you test this new version at http://review.whamcloud.com/#change,4495. Thanks

Comment by Christopher Morrone [ 21/Nov/12 ]

I have updated to the revision 2 of 4495.

Comment by Alex Zhuravlev [ 27/Nov/12 ]

landed on master

Generated at Sat Feb 10 01:23:38 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.