[LU-1976] SWL - mds hard crash Created: 18/Sep/12 Updated: 03/Dec/12 Resolved: 22/Sep/12 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.3.0 |
| Fix Version/s: | Lustre 2.3.0, Lustre 2.4.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Cliff White (Inactive) | Assignee: | nasf (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Severity: | 3 |
| Rank (Obsolete): | 4425 |
| Description |
|
Console [hyperion-rst6] log at 2012-09-18 18:00:00 PDT. MDS fails to dump a stack, but does dump vmcore. at the same time one client dumped vmcore. both dumps are on brent ~cliffw/hyperion/ |
| Comments |
| Comment by Cliff White (Inactive) [ 18/Sep/12 ] |
|
Kernel backtrace, MDS:
bt
PID: 4898 TASK: ffff88012df62080 CPU: 7 COMMAND: "mdt03_008"
#0 [ffff880121e932b0] machine_kexec at ffffffff8103281b
#1 [ffff880121e93310] crash_kexec at ffffffff810ba792
#2 [ffff880121e933e0] oops_end at ffffffff81501700
#3 [ffff880121e93410] die at ffffffff8100f26b
#4 [ffff880121e93440] do_trap at ffffffff81500ff4
#5 [ffff880121e934a0] do_invalid_op at ffffffff8100ce35
#6 [ffff880121e93540] invalid_op at ffffffff8100bedb
[exception RIP: add_dirent_to_buf+1216]
RIP: ffffffffa0db21d0 RSP: ffff880121e935f0 RFLAGS: 00010246
RAX: ffff880181248000 RBX: ffff880268654078 RCX: 00000000000014b5
RDX: ffff880268654098 RSI: 0000000000000046 RDI: ffff88016ff93c00
RBP: ffff880121e936b0 R8: 0000000000000000 R9: 0000000000000000
R10: 0000000000000001 R11: 0000000000000000 R12: ffff880130e16ad0
R13: 0000000000000000 R14: 0000000000000004 R15: 0000000000000020
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#7 [ffff880121e936b8] ldiskfs_add_entry at ffffffffa0db5c5d [ldiskfs]
#8 [ffff880121e93758] __osd_ea_add_rec at ffffffffa1001689 [osd_ldiskfs]
#9 [ffff880121e937a8] osd_index_ea_insert at ffffffffa100dbeb [osd_ldiskfs]
#10 [ffff880121e93838] __mdd_index_insert_only at ffffffffa0eee977 [mdd]
#11 [ffff880121e93898] __mdd_index_insert at ffffffffa0eef9b1 [mdd]
#12 [ffff880121e938e8] mdd_create at ffffffffa0ef55e3 [mdd]
#13 [ffff880121e93a28] cml_create at ffffffffa06a4637 [cmm]
#14 [ffff880121e93a78] mdt_reint_open at ffffffffa0f8bb9f [mdt]
#15 [ffff880121e93b48] mdt_reint_rec at ffffffffa0f75151 [mdt]
#16 [ffff880121e93b68] mdt_reint_internal at ffffffffa0f6e9aa [mdt]
#17 [ffff880121e93bb8] mdt_intent_reint at ffffffffa0f6ef7d [mdt]
#18 [ffff880121e93c08] mdt_intent_policy at ffffffffa0f6b191 [mdt]
#19 [ffff880121e93c48] ldlm_lock_enqueue at ffffffffa0859881 [ptlrpc]
#20 [ffff880121e93ca8] ldlm_handle_enqueue0 at ffffffffa08819bf [ptlrpc]
#21 [ffff880121e93d18] mdt_enqueue at ffffffffa0f6b506 [mdt]
#22 [ffff880121e93d38] mdt_handle_common at ffffffffa0f62802 [mdt]
#23 [ffff880121e93d88] mdt_regular_handle at ffffffffa0f636f5 [mdt]
#24 [ffff880121e93d98] ptlrpc_server_handle_request at ffffffffa08b199d [ptlrpc]
#25 [ffff880121e93e98] ptlrpc_main at ffffffffa08b2f89 [ptlrpc]
#26 [ffff880121e93f48] kernel_thread at ffffffff8100c14a
|
| Comment by Peter Jones [ 19/Sep/12 ] |
|
Bobijam Could you please look into this one? Thanks Peter |
| Comment by Peter Jones [ 19/Sep/12 ] |
|
Fanyong Oleg suggested that this might be a good ticket for you to look into Thanks Peter |
| Comment by nasf (Inactive) [ 20/Sep/12 ] |
|
One possible reason may cause such failure: when multiple threads try to recycle empty OI leaves concurrently, the race among those threads may cause some OI index node(s) crazy. Especially when there are some index node(s) split threads in parallel, the cases will become more complex. The crazy OI index node(s) may cause memory crash. That may be why there are some strange failures recently. I am making patch to fix such issue. OI related codes are really bombs |
| Comment by nasf (Inactive) [ 20/Sep/12 ] |
|
This is the patch: |
| Comment by Peter Jones [ 22/Sep/12 ] |
|
Landed for 2.3 and 2.4 |
| Comment by Bob Glossman (Inactive) [ 03/Dec/12 ] |
|
back port to b2_1 |