[LU-6996] osd_ea_lookup_rec assertion Created: 12/Aug/15  Updated: 16/Jan/16  Resolved: 06/Oct/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.3
Fix Version/s: Lustre 2.8.0

Type: Bug Priority: Major
Reporter: Matt Ezell Assignee: Alex Zhuravlev
Resolution: Fixed Votes: 0
Labels: None
Environment:

2.5.3-2.6.32_431.29.2.el6.atlas.x86_64_g57d5785


Attachments: Text File LU-6996-bt-a.txt     Text File LU-6996-ps.txt    
Issue Links:
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This morning a production MDS hit an assertion:

<0>[2551157.740086] LustreError: 14993:0:(osd_handler.c:4071:osd_ea_lookup_rec()) ASSERTION( dir->i_op != ((void *)0) && dir->i_op->lookup != ((void *)0) ) failed: 
<0>[2551157.756253] LustreError: 14993:0:(osd_handler.c:4071:osd_ea_lookup_rec()) LBUG
<4>[2551157.764766] Pid: 14993, comm: mdt01_094
<4>[2551157.769360] 
<4>[2551157.769361] Call Trace:
<4>[2551157.774374]  [<ffffffffa0409895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
<4>[2551157.782474]  [<ffffffffa0409e97>] lbug_with_loc+0x47/0xb0 [libcfs]
<4>[2551157.789707]  [<ffffffffa0ca8fcf>] osd_index_ea_lookup+0x6ff/0x8a0 [osd_ldiskfs]
<4>[2551157.798308]  [<ffffffffa0d0dde0>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
<4>[2551157.805733]  [<ffffffffa088c7c0>] ? lod_index_lookup+0x0/0x30 [lod]
<4>[2551157.813056]  [<ffffffffa088c7e5>] lod_index_lookup+0x25/0x30 [lod]
<4>[2551157.820291]  [<ffffffffa0dd0daa>] __mdd_lookup+0x24a/0x440 [mdd]
<4>[2551157.827325]  [<ffffffffa0dd1599>] mdd_lookup+0x39/0xe0 [mdd]
<4>[2551157.833977]  [<ffffffffa0d3bee5>] ? mdt_name+0x35/0xc0 [mdt]
<4>[2551157.840629]  [<ffffffffa0d44b09>] mdt_reint_open+0xb69/0x21a0 [mdt]
<4>[2551157.847959]  [<ffffffffa0426376>] ? upcall_cache_get_entry+0x296/0x880 [libcfs]
<4>[2551157.856570]  [<ffffffffa05c7a80>] ? lu_ucred+0x20/0x30 [obdclass]
<4>[2551157.863705]  [<ffffffffa0d2d481>] mdt_reint_rec+0x41/0xe0 [mdt]
<4>[2551157.870643]  [<ffffffffa0d12ed3>] mdt_reint_internal+0x4c3/0x780 [mdt]
<4>[2551157.878254]  [<ffffffffa0d1345e>] mdt_intent_reint+0x1ee/0x410 [mdt]
<4>[2551157.885669]  [<ffffffffa0d10c3e>] mdt_intent_policy+0x3ae/0x770 [mdt]
<4>[2551157.893212]  [<ffffffffa06e42e5>] ldlm_lock_enqueue+0x135/0x980 [ptlrpc]
<4>[2551157.901044]  [<ffffffffa070de2b>] ldlm_handle_enqueue0+0x51b/0x10c0 [ptlrpc]
<4>[2551157.909336]  [<ffffffffa0d11106>] mdt_enqueue+0x46/0xe0 [mdt]
<4>[2551157.916083]  [<ffffffffa0d15ada>] mdt_handle_common+0x52a/0x1470 [mdt]
<4>[2551157.923701]  [<ffffffffa0d52595>] mds_regular_handle+0x15/0x20 [mdt]
<4>[2551157.931144]  [<ffffffffa073cf25>] ptlrpc_server_handle_request+0x385/0xc00 [ptlrpc]
<4>[2551157.940128]  [<ffffffffa040a4ce>] ? cfs_timer_arm+0xe/0x10 [libcfs]
<4>[2551157.947452]  [<ffffffffa041b7c5>] ? lc_watchdog_touch+0x65/0x170 [libcfs]
<4>[2551157.955380]  [<ffffffffa07358f9>] ? ptlrpc_wait_event+0xa9/0x2d0 [ptlrpc]
<4>[2551157.963287]  [<ffffffff810546b9>] ? __wake_up_common+0x59/0x90
<4>[2551157.970142]  [<ffffffffa073f6ed>] ptlrpc_main+0xaed/0x1930 [ptlrpc]
<4>[2551157.977487]  [<ffffffffa073ec00>] ? ptlrpc_main+0x0/0x1930 [ptlrpc]
<4>[2551157.984809]  [<ffffffff8109abf6>] kthread+0x96/0xa0
<4>[2551157.990580]  [<ffffffff8100c20a>] child_rip+0xa/0x20
<4>[2551157.998367]  [<ffffffff8109ab60>] ? kthread+0x0/0xa0
<4>[2551158.004229]  [<ffffffff8100c200>] ? child_rip+0x0/0x20
<4>[2551158.010272] 
<0>[2551158.012746] Kernel panic - not syncing: LBUG


 Comments   
Comment by Matt Ezell [ 12/Aug/15 ]

We did get a crash dump. The code surrounding the assertion is:

lustre/osd-ldiskfs/osd_handler.c
static int osd_ea_lookup_rec(const struct lu_env *env, struct osd_object *obj,
                             struct dt_rec *rec, const struct dt_key *key)
{
        struct inode               *dir    = obj->oo_inode;
...
        LASSERT(dir->i_op != NULL && dir->i_op->lookup != NULL);

I loaded this in crash to try to see if it's i_op or lookup that's NULL.

crash> xbt|grep obj
xbt: cannot find debuginfo for module 'libcfs'
        obj = ffff8815aca40b40
        pobj = ffff881bd96f23c0
        object_locked = 0
crash> mod -s osd_ldiskfs
     MODULE       NAME                        SIZE  OBJECT FILE
ffffffffa0ceaee0  osd_ldiskfs               454984  /lib/modules/2.6.32-431.29.2.el6.atlas.x86_64/extra/kernel/fs/lustre/osd_ldiskfs.ko 
crash> osd_object.oo_inode ffff8815aca40b40
  oo_inode = 0xffff882b54a72a80
crash> inode.i_op 0xffff882b54a72a80
  i_op = 0xffffffffa0bf0c40
crash> inode_operations.lookup 0xffffffffa0bf0c40
  lookup = 0xffffffffa0bd31f0
crash> dis 0xffffffffa0bd31f0
0xffffffffa0bd31f0 <ldiskfs_lookup>:    push   %rbp

Both look non-NULL and valid.

Comment by Jian Yu [ 12/Aug/15 ]

Hi Matt,

There might be a race. Could you please dump the stack traces for all tasks? Thank you.

Comment by Peter Jones [ 13/Aug/15 ]

Alex

Could you please look into this issue?

Thanks

Peter

Comment by Matt Ezell [ 13/Aug/15 ]

The output of 'bt -a' and 'ps' from crash. Let me know if you need the backtrace from any idle PIDs.

Comment by Jian Yu [ 13/Aug/15 ]

Thank you, Matt. Alex would look into the stack traces and ask you for help to get more logs if needed.

Comment by Gerrit Updater [ 19/Aug/15 ]

Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/16026
Subject: LU-6996 osd-ldiskfs: handle stale OI mapping cache
Project: fs/lustre-release
Branch: b2_5
Current Patch Set: 1
Commit: 4bb351a4e4c76c8538532dcbfb7829dfea35aed0

Comment by nasf (Inactive) [ 27/Aug/15 ]

On server side, the RPC service thread may cache one OI mapping on its stack, such OI mapping will become invalid if some other (RPC service thread) removed the object by race. If the RPC service thread uses the cached OI mapping and finds the inode that has been unlinked and reused by other object with no LMA generated yet, then the original osd_check_lma() would regard it as the expect local object by wrong. Such case is one of the reason for the failure in this ticket.

Anyway, it is just possible reason, with the given stack/logs, we cannot say that the patch will fix the failure completely. Please apply the patch and see what will happen.

Comment by Gerrit Updater [ 31/Aug/15 ]

Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/16137
Subject: LU-6996 osd: test b2_5 base
Project: fs/lustre-release
Branch: b2_5
Current Patch Set: 1
Commit: f78965727e7a86786cdb9fe7a15389721be1b1bd

Comment by Gerrit Updater [ 01/Sep/15 ]

Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/16157
Subject: LU-6996 osd-ldiskfs: handle stale OI mapping cache
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: ea2bb60f6b85e53ea43ba240ef1c7e3ef809595c

Comment by Jian Yu [ 03/Sep/15 ]

Hi Alex,

Nasf is working on the patches handling stale OI mapping cache but he was unsure of the root cause of the original issue in this ticket. Could you please give some more suggestions here?

Comment by Gerrit Updater [ 06/Oct/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/16157/
Subject: LU-6996 osd-ldiskfs: handle stale OI mapping cache
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 7aaa680b7f22e7dfaac8af38b78d89164a94e842

Comment by Peter Jones [ 06/Oct/15 ]

Landed for 2.8

Generated at Sat Feb 10 02:05:05 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.