[LU-1976] SWL - mds hard crash Created: 18/Sep/12  Updated: 03/Dec/12  Resolved: 22/Sep/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.3.0
Fix Version/s: Lustre 2.3.0, Lustre 2.4.0

Type: Bug Priority: Blocker
Reporter: Cliff White (Inactive) Assignee: nasf (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 4425

 Description   

Console [hyperion-rst6] log at 2012-09-18 18:00:00 PDT.
2012-09-18 18:04:56 Lustre: lustre-MDT0000: haven't heard from client 3beba6a9-a86c-e3b3-e02d-311fe4e1c5ec (at 192.168.118.135@o2ib1) in 227 seconds. I think it's dead, and I am evicting it. exp ffff880288182400, cur 1348016696 expire 1348016546 last 1348016469
2012-09-18 18:17:21 BUG: unable to handle kernel paging request at 000000008a5e6591
2012-09-18 18:17:21 IP: [<ffffffffa0855018>] unlock_res_and_lock+0x18/0x40 [ptlrpc]
2012-09-18 18:17:21 PGD 0
2012-09-18 18:17:21 BUG: unable to handle kernel NULL pointer dereference at 0000000000000068
2012-09-18 18:17:21 IP: [<ffffffff81043b49>] no_context+0x99/0x260

MDS fails to dump a stack, but does dump vmcore. at the same time one client dumped vmcore. both dumps are on brent ~cliffw/hyperion/



 Comments   
Comment by Cliff White (Inactive) [ 18/Sep/12 ]

Kernel backtrace, MDS:

 bt
PID: 4898   TASK: ffff88012df62080  CPU: 7   COMMAND: "mdt03_008"
 #0 [ffff880121e932b0] machine_kexec at ffffffff8103281b
 #1 [ffff880121e93310] crash_kexec at ffffffff810ba792
 #2 [ffff880121e933e0] oops_end at ffffffff81501700
 #3 [ffff880121e93410] die at ffffffff8100f26b
 #4 [ffff880121e93440] do_trap at ffffffff81500ff4
 #5 [ffff880121e934a0] do_invalid_op at ffffffff8100ce35
 #6 [ffff880121e93540] invalid_op at ffffffff8100bedb
    [exception RIP: add_dirent_to_buf+1216]
    RIP: ffffffffa0db21d0  RSP: ffff880121e935f0  RFLAGS: 00010246
    RAX: ffff880181248000  RBX: ffff880268654078  RCX: 00000000000014b5
    RDX: ffff880268654098  RSI: 0000000000000046  RDI: ffff88016ff93c00
    RBP: ffff880121e936b0   R8: 0000000000000000   R9: 0000000000000000
    R10: 0000000000000001  R11: 0000000000000000  R12: ffff880130e16ad0
    R13: 0000000000000000  R14: 0000000000000004  R15: 0000000000000020
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #7 [ffff880121e936b8] ldiskfs_add_entry at ffffffffa0db5c5d [ldiskfs]
 #8 [ffff880121e93758] __osd_ea_add_rec at ffffffffa1001689 [osd_ldiskfs]
 #9 [ffff880121e937a8] osd_index_ea_insert at ffffffffa100dbeb [osd_ldiskfs]
#10 [ffff880121e93838] __mdd_index_insert_only at ffffffffa0eee977 [mdd]
#11 [ffff880121e93898] __mdd_index_insert at ffffffffa0eef9b1 [mdd]
#12 [ffff880121e938e8] mdd_create at ffffffffa0ef55e3 [mdd]
#13 [ffff880121e93a28] cml_create at ffffffffa06a4637 [cmm]
#14 [ffff880121e93a78] mdt_reint_open at ffffffffa0f8bb9f [mdt]
#15 [ffff880121e93b48] mdt_reint_rec at ffffffffa0f75151 [mdt]
#16 [ffff880121e93b68] mdt_reint_internal at ffffffffa0f6e9aa [mdt]
#17 [ffff880121e93bb8] mdt_intent_reint at ffffffffa0f6ef7d [mdt]
#18 [ffff880121e93c08] mdt_intent_policy at ffffffffa0f6b191 [mdt]
#19 [ffff880121e93c48] ldlm_lock_enqueue at ffffffffa0859881 [ptlrpc]
#20 [ffff880121e93ca8] ldlm_handle_enqueue0 at ffffffffa08819bf [ptlrpc]
#21 [ffff880121e93d18] mdt_enqueue at ffffffffa0f6b506 [mdt]
#22 [ffff880121e93d38] mdt_handle_common at ffffffffa0f62802 [mdt]
#23 [ffff880121e93d88] mdt_regular_handle at ffffffffa0f636f5 [mdt]
#24 [ffff880121e93d98] ptlrpc_server_handle_request at ffffffffa08b199d [ptlrpc]
#25 [ffff880121e93e98] ptlrpc_main at ffffffffa08b2f89 [ptlrpc]
#26 [ffff880121e93f48] kernel_thread at ffffffff8100c14a
Comment by Peter Jones [ 19/Sep/12 ]

Bobijam

Could you please look into this one?

Thanks

Peter

Comment by Peter Jones [ 19/Sep/12 ]

Fanyong

Oleg suggested that this might be a good ticket for you to look into

Thanks

Peter

Comment by nasf (Inactive) [ 20/Sep/12 ]

One possible reason may cause such failure: when multiple threads try to recycle empty OI leaves concurrently, the race among those threads may cause some OI index node(s) crazy. Especially when there are some index node(s) split threads in parallel, the cases will become more complex. The crazy OI index node(s) may cause memory crash. That may be why there are some strange failures recently.

I am making patch to fix such issue.

OI related codes are really bombs

Comment by nasf (Inactive) [ 20/Sep/12 ]

This is the patch:

http://review.whamcloud.com/#change,4061

Comment by Peter Jones [ 22/Sep/12 ]

Landed for 2.3 and 2.4

Comment by Bob Glossman (Inactive) [ 03/Dec/12 ]

back port to b2_1
http://review.whamcloud.com/#change,4735

Generated at Sat Feb 10 01:21:19 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.