[LU-2251] MDS crash will running mdtest Created: 30/Oct/12 Updated: 19/Apr/13 Resolved: 27/Nov/12 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0 |
| Fix Version/s: | Lustre 2.4.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | James A Simmons | Assignee: | Alex Zhuravlev |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
DDN 9950 sotrage backend with 4 OSS and a MDS and MGS server. Client are a Cray XE cage. |
||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 5350 |
| Description |
|
While testing for stability with mdtest the MDS experienced a Oops and I had to restart the file system which recovered successfully. Will attach the oops. |
| Comments |
| Comment by Alex Zhuravlev [ 30/Oct/12 ] |
|
please try with http://review.whamcloud.com/#change,4390 |
| Comment by James A Simmons [ 30/Oct/12 ] |
|
Appears to have worked. Thanks. I will report if the problem shows again. |
| Comment by James A Simmons [ 31/Oct/12 ] |
|
I was wrong. I get a very similar but slightly different oops. Oct 31 08:38:31 barry-mds1 kernel: [ 1249.764785] LustreError: 9696:0:(osp_dev.c:337:osp_statfs()) ASSERTION( sfs->os_fprecreated <= 20000 ) failed |
| Comment by Alex Zhuravlev [ 07/Nov/12 ] |
|
James, were there timeouts/disconnects on MDS just before the assertion ? |
| Comment by Alex Zhuravlev [ 08/Nov/12 ] |
|
hmm, clearly the assertion is invalid: we initiate pre-creation before existing pool is exhausted, so the final pool can easily be OST_MAX_PRECERATE + number of objects from the previous pre-creation. |
| Comment by Alex Zhuravlev [ 08/Nov/12 ] |
| Comment by James A Simmons [ 08/Nov/12 ] |
|
No timeouts or disconnects happened before the assertion. I found you can only reproduce this problem when you have a newly formatted file system and the first thing you do is run mdtest on it. The exact command used was mdtest -n 1000 -F -i 5 -d /lustre/$MOUNT/scratch/$USER/md_test over 3 nodes. If I ran IOR first then the problem didn't happen. Will apply your patch. |
| Comment by Christopher Morrone [ 12/Nov/12 ] |
|
We hit the same assertion on a test filesystem running 2.3.54-2chaos. LustreError: 6847:0:(osp_dev.c:337:osp_statfs()) ASSERTION( sfs->os_fprecreated <= 20000 ) failed: LustreError: 6847:0:(osp_dev.c:337:osp_statfs()) LBUG PID: 6847 TASK: ffff88061df67500 CPU: 9 COMMAND: "mdt03_008" #0 [ffff8805f8c7b518] machine_kexec at ffffffff8103216b #1 [ffff8805f8c7b578] crash_kexec at ffffffff810b8d12 #2 [ffff8805f8c7b648] panic at ffffffff814eea99 #3 [ffff8805f8c7b6c8] lbug_with_loc at ffffffffa05c1fcb [libcfs] #4 [ffff8805f8c7b6e8] osp_statfs at ffffffffa0eb97da [osp] #5 [ffff8805f8c7b728] lod_statfs_and_check at ffffffffa0d6de8e [lod] #6 [ffff8805f8c7b758] lod_alloc_rr.clone.2 at ffffffffa0d70430 [lod] #7 [ffff8805f8c7b808] lod_qos_prep_create at ffffffffa0d7204c [lod] #8 [ffff8805f8c7b8a8] lod_declare_striped_object at ffffffffa0d6cc7b [lod] #9 [ffff8805f8c7b908] lod_declare_object_create at ffffffffa0d6d758 [lod] #10 [ffff8805f8c7b958] mdd_declare_object_create_internal at ffffffffa0dafccf [mdd] #11 [ffff8805f8c7b9a8] mdd_create at ffffffffa0dc0c9a [mdd] #12 [ffff8805f8c7ba78] mdt_reint_open at ffffffffa0e44529 [mdt] #13 [ffff8805f8c7bb58] mdt_reint_rec at ffffffffa0e2f401 [mdt] #14 [ffff8805f8c7bb78] mdt_reint_internal at ffffffffa0e28b93 [mdt] #15 [ffff8805f8c7bbb8] mdt_intent_reint at ffffffffa0e2915d [mdt] #16 [ffff8805f8c7bc08] mdt_intent_policy at ffffffffa0e24b2e [mdt] #17 [ffff8805f8c7bc48] ldlm_lock_enqueue at ffffffffa0897b21 [ptlrpc] #18 [ffff8805f8c7bca8] ldlm_handle_enqueue0 at ffffffffa08bf967 [ptlrpc] #19 [ffff8805f8c7bd18] mdt_enqueue at ffffffffa0e25066 [mdt] #20 [ffff8805f8c7bd38] mdt_handle_common at ffffffffa0e18322 [mdt] #21 [ffff8805f8c7bd88] mdt_regular_handle at ffffffffa0e19225 [mdt] #22 [ffff8805f8c7bd98] ptlrpc_server_handle_request at ffffffffa08ef8ac [ptlrpc] #23 [ffff8805f8c7be98] ptlrpc_main at ffffffffa08f0e9c [ptlrpc] #24 [ffff8805f8c7bf48] kernel_thread at ffffffff8100c14a I will drop the assertion from our tree with change 4495. |
| Comment by James A Simmons [ 21/Nov/12 ] |
|
Chris their was discuss on how to properly fix this bug so the patch has been updated. I tested but could you test this new version at http://review.whamcloud.com/#change,4495. Thanks |
| Comment by Christopher Morrone [ 21/Nov/12 ] |
|
I have updated to the revision 2 of 4495. |
| Comment by Alex Zhuravlev [ 27/Nov/12 ] |
|
landed on master |