[LU-6996] osd_ea_lookup_rec assertion Created: 12/Aug/15 Updated: 16/Jan/16 Resolved: 06/Oct/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.5.3 |
| Fix Version/s: | Lustre 2.8.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Matt Ezell | Assignee: | Alex Zhuravlev |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
2.5.3-2.6.32_431.29.2.el6.atlas.x86_64_g57d5785 |
||
| Attachments: |
|
||||
| Issue Links: |
|
||||
| Severity: | 3 | ||||
| Rank (Obsolete): | 9223372036854775807 | ||||
| Description |
|
This morning a production MDS hit an assertion: <0>[2551157.740086] LustreError: 14993:0:(osd_handler.c:4071:osd_ea_lookup_rec()) ASSERTION( dir->i_op != ((void *)0) && dir->i_op->lookup != ((void *)0) ) failed: <0>[2551157.756253] LustreError: 14993:0:(osd_handler.c:4071:osd_ea_lookup_rec()) LBUG <4>[2551157.764766] Pid: 14993, comm: mdt01_094 <4>[2551157.769360] <4>[2551157.769361] Call Trace: <4>[2551157.774374] [<ffffffffa0409895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] <4>[2551157.782474] [<ffffffffa0409e97>] lbug_with_loc+0x47/0xb0 [libcfs] <4>[2551157.789707] [<ffffffffa0ca8fcf>] osd_index_ea_lookup+0x6ff/0x8a0 [osd_ldiskfs] <4>[2551157.798308] [<ffffffffa0d0dde0>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] <4>[2551157.805733] [<ffffffffa088c7c0>] ? lod_index_lookup+0x0/0x30 [lod] <4>[2551157.813056] [<ffffffffa088c7e5>] lod_index_lookup+0x25/0x30 [lod] <4>[2551157.820291] [<ffffffffa0dd0daa>] __mdd_lookup+0x24a/0x440 [mdd] <4>[2551157.827325] [<ffffffffa0dd1599>] mdd_lookup+0x39/0xe0 [mdd] <4>[2551157.833977] [<ffffffffa0d3bee5>] ? mdt_name+0x35/0xc0 [mdt] <4>[2551157.840629] [<ffffffffa0d44b09>] mdt_reint_open+0xb69/0x21a0 [mdt] <4>[2551157.847959] [<ffffffffa0426376>] ? upcall_cache_get_entry+0x296/0x880 [libcfs] <4>[2551157.856570] [<ffffffffa05c7a80>] ? lu_ucred+0x20/0x30 [obdclass] <4>[2551157.863705] [<ffffffffa0d2d481>] mdt_reint_rec+0x41/0xe0 [mdt] <4>[2551157.870643] [<ffffffffa0d12ed3>] mdt_reint_internal+0x4c3/0x780 [mdt] <4>[2551157.878254] [<ffffffffa0d1345e>] mdt_intent_reint+0x1ee/0x410 [mdt] <4>[2551157.885669] [<ffffffffa0d10c3e>] mdt_intent_policy+0x3ae/0x770 [mdt] <4>[2551157.893212] [<ffffffffa06e42e5>] ldlm_lock_enqueue+0x135/0x980 [ptlrpc] <4>[2551157.901044] [<ffffffffa070de2b>] ldlm_handle_enqueue0+0x51b/0x10c0 [ptlrpc] <4>[2551157.909336] [<ffffffffa0d11106>] mdt_enqueue+0x46/0xe0 [mdt] <4>[2551157.916083] [<ffffffffa0d15ada>] mdt_handle_common+0x52a/0x1470 [mdt] <4>[2551157.923701] [<ffffffffa0d52595>] mds_regular_handle+0x15/0x20 [mdt] <4>[2551157.931144] [<ffffffffa073cf25>] ptlrpc_server_handle_request+0x385/0xc00 [ptlrpc] <4>[2551157.940128] [<ffffffffa040a4ce>] ? cfs_timer_arm+0xe/0x10 [libcfs] <4>[2551157.947452] [<ffffffffa041b7c5>] ? lc_watchdog_touch+0x65/0x170 [libcfs] <4>[2551157.955380] [<ffffffffa07358f9>] ? ptlrpc_wait_event+0xa9/0x2d0 [ptlrpc] <4>[2551157.963287] [<ffffffff810546b9>] ? __wake_up_common+0x59/0x90 <4>[2551157.970142] [<ffffffffa073f6ed>] ptlrpc_main+0xaed/0x1930 [ptlrpc] <4>[2551157.977487] [<ffffffffa073ec00>] ? ptlrpc_main+0x0/0x1930 [ptlrpc] <4>[2551157.984809] [<ffffffff8109abf6>] kthread+0x96/0xa0 <4>[2551157.990580] [<ffffffff8100c20a>] child_rip+0xa/0x20 <4>[2551157.998367] [<ffffffff8109ab60>] ? kthread+0x0/0xa0 <4>[2551158.004229] [<ffffffff8100c200>] ? child_rip+0x0/0x20 <4>[2551158.010272] <0>[2551158.012746] Kernel panic - not syncing: LBUG |
| Comments |
| Comment by Matt Ezell [ 12/Aug/15 ] |
|
We did get a crash dump. The code surrounding the assertion is: lustre/osd-ldiskfs/osd_handler.c static int osd_ea_lookup_rec(const struct lu_env *env, struct osd_object *obj, struct dt_rec *rec, const struct dt_key *key) { struct inode *dir = obj->oo_inode; ... LASSERT(dir->i_op != NULL && dir->i_op->lookup != NULL); I loaded this in crash to try to see if it's i_op or lookup that's NULL. crash> xbt|grep obj
xbt: cannot find debuginfo for module 'libcfs'
obj = ffff8815aca40b40
pobj = ffff881bd96f23c0
object_locked = 0
crash> mod -s osd_ldiskfs
MODULE NAME SIZE OBJECT FILE
ffffffffa0ceaee0 osd_ldiskfs 454984 /lib/modules/2.6.32-431.29.2.el6.atlas.x86_64/extra/kernel/fs/lustre/osd_ldiskfs.ko
crash> osd_object.oo_inode ffff8815aca40b40
oo_inode = 0xffff882b54a72a80
crash> inode.i_op 0xffff882b54a72a80
i_op = 0xffffffffa0bf0c40
crash> inode_operations.lookup 0xffffffffa0bf0c40
lookup = 0xffffffffa0bd31f0
crash> dis 0xffffffffa0bd31f0
0xffffffffa0bd31f0 <ldiskfs_lookup>: push %rbp
Both look non-NULL and valid. |
| Comment by Jian Yu [ 12/Aug/15 ] |
|
Hi Matt, There might be a race. Could you please dump the stack traces for all tasks? Thank you. |
| Comment by Peter Jones [ 13/Aug/15 ] |
|
Alex Could you please look into this issue? Thanks Peter |
| Comment by Matt Ezell [ 13/Aug/15 ] |
|
The output of 'bt -a' and 'ps' from crash. Let me know if you need the backtrace from any idle PIDs. |
| Comment by Jian Yu [ 13/Aug/15 ] |
|
Thank you, Matt. Alex would look into the stack traces and ask you for help to get more logs if needed. |
| Comment by Gerrit Updater [ 19/Aug/15 ] |
|
Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/16026 |
| Comment by nasf (Inactive) [ 27/Aug/15 ] |
|
On server side, the RPC service thread may cache one OI mapping on its stack, such OI mapping will become invalid if some other (RPC service thread) removed the object by race. If the RPC service thread uses the cached OI mapping and finds the inode that has been unlinked and reused by other object with no LMA generated yet, then the original osd_check_lma() would regard it as the expect local object by wrong. Such case is one of the reason for the failure in this ticket. Anyway, it is just possible reason, with the given stack/logs, we cannot say that the patch will fix the failure completely. Please apply the patch and see what will happen. |
| Comment by Gerrit Updater [ 31/Aug/15 ] |
|
Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/16137 |
| Comment by Gerrit Updater [ 01/Sep/15 ] |
|
Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/16157 |
| Comment by Jian Yu [ 03/Sep/15 ] |
|
Hi Alex, Nasf is working on the patches handling stale OI mapping cache but he was unsure of the root cause of the original issue in this ticket. Could you please give some more suggestions here? |
| Comment by Gerrit Updater [ 06/Oct/15 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/16157/ |
| Comment by Peter Jones [ 06/Oct/15 ] |
|
Landed for 2.8 |