[LU-6996] osd_ea_lookup_rec assertion - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: Lustre 2.8.0
Affects Version/s: Lustre 2.5.3
Labels:
None
Environment:
2.5.3-2.6.32_431.29.2.el6.atlas.x86_64_g57d5785

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

This morning a production MDS hit an assertion:

<0>[2551157.740086] LustreError: 14993:0:(osd_handler.c:4071:osd_ea_lookup_rec()) ASSERTION( dir->i_op != ((void *)0) && dir->i_op->lookup != ((void *)0) ) failed: 
<0>[2551157.756253] LustreError: 14993:0:(osd_handler.c:4071:osd_ea_lookup_rec()) LBUG
<4>[2551157.764766] Pid: 14993, comm: mdt01_094
<4>[2551157.769360] 
<4>[2551157.769361] Call Trace:
<4>[2551157.774374]  [<ffffffffa0409895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
<4>[2551157.782474]  [<ffffffffa0409e97>] lbug_with_loc+0x47/0xb0 [libcfs]
<4>[2551157.789707]  [<ffffffffa0ca8fcf>] osd_index_ea_lookup+0x6ff/0x8a0 [osd_ldiskfs]
<4>[2551157.798308]  [<ffffffffa0d0dde0>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
<4>[2551157.805733]  [<ffffffffa088c7c0>] ? lod_index_lookup+0x0/0x30 [lod]
<4>[2551157.813056]  [<ffffffffa088c7e5>] lod_index_lookup+0x25/0x30 [lod]
<4>[2551157.820291]  [<ffffffffa0dd0daa>] __mdd_lookup+0x24a/0x440 [mdd]
<4>[2551157.827325]  [<ffffffffa0dd1599>] mdd_lookup+0x39/0xe0 [mdd]
<4>[2551157.833977]  [<ffffffffa0d3bee5>] ? mdt_name+0x35/0xc0 [mdt]
<4>[2551157.840629]  [<ffffffffa0d44b09>] mdt_reint_open+0xb69/0x21a0 [mdt]
<4>[2551157.847959]  [<ffffffffa0426376>] ? upcall_cache_get_entry+0x296/0x880 [libcfs]
<4>[2551157.856570]  [<ffffffffa05c7a80>] ? lu_ucred+0x20/0x30 [obdclass]
<4>[2551157.863705]  [<ffffffffa0d2d481>] mdt_reint_rec+0x41/0xe0 [mdt]
<4>[2551157.870643]  [<ffffffffa0d12ed3>] mdt_reint_internal+0x4c3/0x780 [mdt]
<4>[2551157.878254]  [<ffffffffa0d1345e>] mdt_intent_reint+0x1ee/0x410 [mdt]
<4>[2551157.885669]  [<ffffffffa0d10c3e>] mdt_intent_policy+0x3ae/0x770 [mdt]
<4>[2551157.893212]  [<ffffffffa06e42e5>] ldlm_lock_enqueue+0x135/0x980 [ptlrpc]
<4>[2551157.901044]  [<ffffffffa070de2b>] ldlm_handle_enqueue0+0x51b/0x10c0 [ptlrpc]
<4>[2551157.909336]  [<ffffffffa0d11106>] mdt_enqueue+0x46/0xe0 [mdt]
<4>[2551157.916083]  [<ffffffffa0d15ada>] mdt_handle_common+0x52a/0x1470 [mdt]
<4>[2551157.923701]  [<ffffffffa0d52595>] mds_regular_handle+0x15/0x20 [mdt]
<4>[2551157.931144]  [<ffffffffa073cf25>] ptlrpc_server_handle_request+0x385/0xc00 [ptlrpc]
<4>[2551157.940128]  [<ffffffffa040a4ce>] ? cfs_timer_arm+0xe/0x10 [libcfs]
<4>[2551157.947452]  [<ffffffffa041b7c5>] ? lc_watchdog_touch+0x65/0x170 [libcfs]
<4>[2551157.955380]  [<ffffffffa07358f9>] ? ptlrpc_wait_event+0xa9/0x2d0 [ptlrpc]
<4>[2551157.963287]  [<ffffffff810546b9>] ? __wake_up_common+0x59/0x90
<4>[2551157.970142]  [<ffffffffa073f6ed>] ptlrpc_main+0xaed/0x1930 [ptlrpc]
<4>[2551157.977487]  [<ffffffffa073ec00>] ? ptlrpc_main+0x0/0x1930 [ptlrpc]
<4>[2551157.984809]  [<ffffffff8109abf6>] kthread+0x96/0xa0
<4>[2551157.990580]  [<ffffffff8100c20a>] child_rip+0xa/0x20
<4>[2551157.998367]  [<ffffffff8109ab60>] ? kthread+0x0/0xa0
<4>[2551158.004229]  [<ffffffff8100c200>] ? child_rip+0x0/0x20
<4>[2551158.010272] 
<0>[2551158.012746] Kernel panic - not syncing: LBUG

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

LU-6996-bt-a.txt
13 kB
13/Aug/15 1:28 PM
LU-6996-ps.txt
371 kB
13/Aug/15 1:28 PM

Activity

[LU-6996] osd_ea_lookup_rec assertion

Jian Yu added a comment - 03/Sep/15 1:28 AM

Hi Alex,

Nasf is working on the patches handling stale OI mapping cache but he was unsure of the root cause of the original issue in this ticket. Could you please give some more suggestions here?

Jian Yu added a comment - 03/Sep/15 1:28 AM Hi Alex, Nasf is working on the patches handling stale OI mapping cache but he was unsure of the root cause of the original issue in this ticket. Could you please give some more suggestions here?

Gerrit Updater added a comment - 01/Sep/15 5:11 AM

Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/16157
Subject: ~~LU-6996~~ osd-ldiskfs: handle stale OI mapping cache
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: ea2bb60f6b85e53ea43ba240ef1c7e3ef809595c

Gerrit Updater added a comment - 01/Sep/15 5:11 AM Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/16157 Subject: LU-6996 osd-ldiskfs: handle stale OI mapping cache Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: ea2bb60f6b85e53ea43ba240ef1c7e3ef809595c

Gerrit Updater added a comment - 31/Aug/15 4:27 AM

Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/16137
Subject: ~~LU-6996~~ osd: test b2_5 base
Project: fs/lustre-release
Branch: b2_5
Current Patch Set: 1
Commit: f78965727e7a86786cdb9fe7a15389721be1b1bd

Gerrit Updater added a comment - 31/Aug/15 4:27 AM Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/16137 Subject: LU-6996 osd: test b2_5 base Project: fs/lustre-release Branch: b2_5 Current Patch Set: 1 Commit: f78965727e7a86786cdb9fe7a15389721be1b1bd

nasf (Inactive) added a comment - 27/Aug/15 6:27 AM

On server side, the RPC service thread may cache one OI mapping on its stack, such OI mapping will become invalid if some other (RPC service thread) removed the object by race. If the RPC service thread uses the cached OI mapping and finds the inode that has been unlinked and reused by other object with no LMA generated yet, then the original osd_check_lma() would regard it as the expect local object by wrong. Such case is one of the reason for the failure in this ticket.

Anyway, it is just possible reason, with the given stack/logs, we cannot say that the patch will fix the failure completely. Please apply the patch and see what will happen.

nasf (Inactive) added a comment - 27/Aug/15 6:27 AM On server side, the RPC service thread may cache one OI mapping on its stack, such OI mapping will become invalid if some other (RPC service thread) removed the object by race. If the RPC service thread uses the cached OI mapping and finds the inode that has been unlinked and reused by other object with no LMA generated yet, then the original osd_check_lma() would regard it as the expect local object by wrong. Such case is one of the reason for the failure in this ticket. Anyway, it is just possible reason, with the given stack/logs, we cannot say that the patch will fix the failure completely. Please apply the patch and see what will happen.

Gerrit Updater added a comment - 19/Aug/15 5:39 PM

Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/16026
Subject: ~~LU-6996~~ osd-ldiskfs: handle stale OI mapping cache
Project: fs/lustre-release
Branch: b2_5
Current Patch Set: 1
Commit: 4bb351a4e4c76c8538532dcbfb7829dfea35aed0

Gerrit Updater added a comment - 19/Aug/15 5:39 PM Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/16026 Subject: LU-6996 osd-ldiskfs: handle stale OI mapping cache Project: fs/lustre-release Branch: b2_5 Current Patch Set: 1 Commit: 4bb351a4e4c76c8538532dcbfb7829dfea35aed0

Jian Yu added a comment - 13/Aug/15 3:51 PM

Thank you, Matt. Alex would look into the stack traces and ask you for help to get more logs if needed.

Jian Yu added a comment - 13/Aug/15 3:51 PM Thank you, Matt. Alex would look into the stack traces and ask you for help to get more logs if needed.

Matt Ezell added a comment - 13/Aug/15 1:28 PM

The output of 'bt -a' and 'ps' from crash. Let me know if you need the backtrace from any idle PIDs.

Matt Ezell added a comment - 13/Aug/15 1:28 PM The output of 'bt -a' and 'ps' from crash. Let me know if you need the backtrace from any idle PIDs.

Peter Jones added a comment - 13/Aug/15 1:23 PM

Alex

Could you please look into this issue?

Thanks

Peter

Peter Jones added a comment - 13/Aug/15 1:23 PM Alex Could you please look into this issue? Thanks Peter

Jian Yu added a comment - 12/Aug/15 10:26 PM

Hi Matt,

There might be a race. Could you please dump the stack traces for all tasks? Thank you.

Jian Yu added a comment - 12/Aug/15 10:26 PM Hi Matt, There might be a race. Could you please dump the stack traces for all tasks? Thank you.

Matt Ezell added a comment - 12/Aug/15 8:45 PM

We did get a crash dump. The code surrounding the assertion is:

lustre/osd-ldiskfs/osd_handler.c

static int osd_ea_lookup_rec(const struct lu_env *env, struct osd_object *obj,
                             struct dt_rec *rec, const struct dt_key *key)
{
        struct inode               *dir    = obj->oo_inode;
...
        LASSERT(dir->i_op != NULL && dir->i_op->lookup != NULL);

I loaded this in crash to try to see if it's i_op or lookup that's NULL.

crash> xbt|grep obj
xbt: cannot find debuginfo for module 'libcfs'
        obj = ffff8815aca40b40
        pobj = ffff881bd96f23c0
        object_locked = 0
crash> mod -s osd_ldiskfs
     MODULE       NAME                        SIZE  OBJECT FILE
ffffffffa0ceaee0  osd_ldiskfs               454984  /lib/modules/2.6.32-431.29.2.el6.atlas.x86_64/extra/kernel/fs/lustre/osd_ldiskfs.ko 
crash> osd_object.oo_inode ffff8815aca40b40
  oo_inode = 0xffff882b54a72a80
crash> inode.i_op 0xffff882b54a72a80
  i_op = 0xffffffffa0bf0c40
crash> inode_operations.lookup 0xffffffffa0bf0c40
  lookup = 0xffffffffa0bd31f0
crash> dis 0xffffffffa0bd31f0
0xffffffffa0bd31f0 <ldiskfs_lookup>:    push   %rbp

Both look non-NULL and valid.

Matt Ezell added a comment - 12/Aug/15 8:45 PM We did get a crash dump. The code surrounding the assertion is: lustre/osd-ldiskfs/osd_handler.c static int osd_ea_lookup_rec( const struct lu_env *env, struct osd_object *obj, struct dt_rec *rec, const struct dt_key *key) { struct inode *dir = obj->oo_inode; ... LASSERT(dir->i_op != NULL && dir->i_op->lookup != NULL); I loaded this in crash to try to see if it's i_op or lookup that's NULL. crash> xbt|grep obj xbt: cannot find debuginfo for module 'libcfs' obj = ffff8815aca40b40 pobj = ffff881bd96f23c0 object_locked = 0 crash> mod -s osd_ldiskfs MODULE NAME SIZE OBJECT FILE ffffffffa0ceaee0 osd_ldiskfs 454984 /lib/modules/2.6.32-431.29.2.el6.atlas.x86_64/extra/kernel/fs/lustre/osd_ldiskfs.ko crash> osd_object.oo_inode ffff8815aca40b40 oo_inode = 0xffff882b54a72a80 crash> inode.i_op 0xffff882b54a72a80 i_op = 0xffffffffa0bf0c40 crash> inode_operations.lookup 0xffffffffa0bf0c40 lookup = 0xffffffffa0bd31f0 crash> dis 0xffffffffa0bd31f0 0xffffffffa0bd31f0 <ldiskfs_lookup>: push %rbp Both look non-NULL and valid.

People

Assignee:: Alex Zhuravlev

Reporter:: Matt Ezell

Votes:: 0 Vote for this issue

Watchers:: 11 Start watching this issue

Dates

Created:: 12/Aug/15 8:39 PM

Updated:: 16/Jan/16 6:04 AM

Resolved:: 06/Oct/15 2:22 PM