[LU-6996] osd_ea_lookup_rec assertion - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: Lustre 2.8.0
Affects Version/s: Lustre 2.5.3
Labels:
None
Environment:
2.5.3-2.6.32_431.29.2.el6.atlas.x86_64_g57d5785

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

This morning a production MDS hit an assertion:

<0>[2551157.740086] LustreError: 14993:0:(osd_handler.c:4071:osd_ea_lookup_rec()) ASSERTION( dir->i_op != ((void *)0) && dir->i_op->lookup != ((void *)0) ) failed: 
<0>[2551157.756253] LustreError: 14993:0:(osd_handler.c:4071:osd_ea_lookup_rec()) LBUG
<4>[2551157.764766] Pid: 14993, comm: mdt01_094
<4>[2551157.769360] 
<4>[2551157.769361] Call Trace:
<4>[2551157.774374]  [<ffffffffa0409895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
<4>[2551157.782474]  [<ffffffffa0409e97>] lbug_with_loc+0x47/0xb0 [libcfs]
<4>[2551157.789707]  [<ffffffffa0ca8fcf>] osd_index_ea_lookup+0x6ff/0x8a0 [osd_ldiskfs]
<4>[2551157.798308]  [<ffffffffa0d0dde0>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
<4>[2551157.805733]  [<ffffffffa088c7c0>] ? lod_index_lookup+0x0/0x30 [lod]
<4>[2551157.813056]  [<ffffffffa088c7e5>] lod_index_lookup+0x25/0x30 [lod]
<4>[2551157.820291]  [<ffffffffa0dd0daa>] __mdd_lookup+0x24a/0x440 [mdd]
<4>[2551157.827325]  [<ffffffffa0dd1599>] mdd_lookup+0x39/0xe0 [mdd]
<4>[2551157.833977]  [<ffffffffa0d3bee5>] ? mdt_name+0x35/0xc0 [mdt]
<4>[2551157.840629]  [<ffffffffa0d44b09>] mdt_reint_open+0xb69/0x21a0 [mdt]
<4>[2551157.847959]  [<ffffffffa0426376>] ? upcall_cache_get_entry+0x296/0x880 [libcfs]
<4>[2551157.856570]  [<ffffffffa05c7a80>] ? lu_ucred+0x20/0x30 [obdclass]
<4>[2551157.863705]  [<ffffffffa0d2d481>] mdt_reint_rec+0x41/0xe0 [mdt]
<4>[2551157.870643]  [<ffffffffa0d12ed3>] mdt_reint_internal+0x4c3/0x780 [mdt]
<4>[2551157.878254]  [<ffffffffa0d1345e>] mdt_intent_reint+0x1ee/0x410 [mdt]
<4>[2551157.885669]  [<ffffffffa0d10c3e>] mdt_intent_policy+0x3ae/0x770 [mdt]
<4>[2551157.893212]  [<ffffffffa06e42e5>] ldlm_lock_enqueue+0x135/0x980 [ptlrpc]
<4>[2551157.901044]  [<ffffffffa070de2b>] ldlm_handle_enqueue0+0x51b/0x10c0 [ptlrpc]
<4>[2551157.909336]  [<ffffffffa0d11106>] mdt_enqueue+0x46/0xe0 [mdt]
<4>[2551157.916083]  [<ffffffffa0d15ada>] mdt_handle_common+0x52a/0x1470 [mdt]
<4>[2551157.923701]  [<ffffffffa0d52595>] mds_regular_handle+0x15/0x20 [mdt]
<4>[2551157.931144]  [<ffffffffa073cf25>] ptlrpc_server_handle_request+0x385/0xc00 [ptlrpc]
<4>[2551157.940128]  [<ffffffffa040a4ce>] ? cfs_timer_arm+0xe/0x10 [libcfs]
<4>[2551157.947452]  [<ffffffffa041b7c5>] ? lc_watchdog_touch+0x65/0x170 [libcfs]
<4>[2551157.955380]  [<ffffffffa07358f9>] ? ptlrpc_wait_event+0xa9/0x2d0 [ptlrpc]
<4>[2551157.963287]  [<ffffffff810546b9>] ? __wake_up_common+0x59/0x90
<4>[2551157.970142]  [<ffffffffa073f6ed>] ptlrpc_main+0xaed/0x1930 [ptlrpc]
<4>[2551157.977487]  [<ffffffffa073ec00>] ? ptlrpc_main+0x0/0x1930 [ptlrpc]
<4>[2551157.984809]  [<ffffffff8109abf6>] kthread+0x96/0xa0
<4>[2551157.990580]  [<ffffffff8100c20a>] child_rip+0xa/0x20
<4>[2551157.998367]  [<ffffffff8109ab60>] ? kthread+0x0/0xa0
<4>[2551158.004229]  [<ffffffff8100c200>] ? child_rip+0x0/0x20
<4>[2551158.010272] 
<0>[2551158.012746] Kernel panic - not syncing: LBUG

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

LU-6996-bt-a.txt
13 kB
13/Aug/15 1:28 PM
LU-6996-ps.txt
371 kB
13/Aug/15 1:28 PM

Activity

[LU-6996] osd_ea_lookup_rec assertion

Peter Jones added a comment - 06/Oct/15 2:22 PM

Landed for 2.8

Peter Jones added a comment - 06/Oct/15 2:22 PM Landed for 2.8

Gerrit Updater added a comment - 06/Oct/15 1:56 AM

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/16157/
Subject: ~~LU-6996~~ osd-ldiskfs: handle stale OI mapping cache
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 7aaa680b7f22e7dfaac8af38b78d89164a94e842

Gerrit Updater added a comment - 06/Oct/15 1:56 AM Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/16157/ Subject: LU-6996 osd-ldiskfs: handle stale OI mapping cache Project: fs/lustre-release Branch: master Current Patch Set: Commit: 7aaa680b7f22e7dfaac8af38b78d89164a94e842

Jian Yu added a comment - 03/Sep/15 1:28 AM

Hi Alex,

Nasf is working on the patches handling stale OI mapping cache but he was unsure of the root cause of the original issue in this ticket. Could you please give some more suggestions here?

Jian Yu added a comment - 03/Sep/15 1:28 AM Hi Alex, Nasf is working on the patches handling stale OI mapping cache but he was unsure of the root cause of the original issue in this ticket. Could you please give some more suggestions here?

Gerrit Updater added a comment - 01/Sep/15 5:11 AM

Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/16157
Subject: ~~LU-6996~~ osd-ldiskfs: handle stale OI mapping cache
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: ea2bb60f6b85e53ea43ba240ef1c7e3ef809595c

Gerrit Updater added a comment - 01/Sep/15 5:11 AM Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/16157 Subject: LU-6996 osd-ldiskfs: handle stale OI mapping cache Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: ea2bb60f6b85e53ea43ba240ef1c7e3ef809595c

Gerrit Updater added a comment - 31/Aug/15 4:27 AM

Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/16137
Subject: ~~LU-6996~~ osd: test b2_5 base
Project: fs/lustre-release
Branch: b2_5
Current Patch Set: 1
Commit: f78965727e7a86786cdb9fe7a15389721be1b1bd

Gerrit Updater added a comment - 31/Aug/15 4:27 AM Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/16137 Subject: LU-6996 osd: test b2_5 base Project: fs/lustre-release Branch: b2_5 Current Patch Set: 1 Commit: f78965727e7a86786cdb9fe7a15389721be1b1bd

nasf (Inactive) added a comment - 27/Aug/15 6:27 AM

On server side, the RPC service thread may cache one OI mapping on its stack, such OI mapping will become invalid if some other (RPC service thread) removed the object by race. If the RPC service thread uses the cached OI mapping and finds the inode that has been unlinked and reused by other object with no LMA generated yet, then the original osd_check_lma() would regard it as the expect local object by wrong. Such case is one of the reason for the failure in this ticket.

Anyway, it is just possible reason, with the given stack/logs, we cannot say that the patch will fix the failure completely. Please apply the patch and see what will happen.

nasf (Inactive) added a comment - 27/Aug/15 6:27 AM On server side, the RPC service thread may cache one OI mapping on its stack, such OI mapping will become invalid if some other (RPC service thread) removed the object by race. If the RPC service thread uses the cached OI mapping and finds the inode that has been unlinked and reused by other object with no LMA generated yet, then the original osd_check_lma() would regard it as the expect local object by wrong. Such case is one of the reason for the failure in this ticket. Anyway, it is just possible reason, with the given stack/logs, we cannot say that the patch will fix the failure completely. Please apply the patch and see what will happen.

Gerrit Updater added a comment - 19/Aug/15 5:39 PM

Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/16026
Subject: ~~LU-6996~~ osd-ldiskfs: handle stale OI mapping cache
Project: fs/lustre-release
Branch: b2_5
Current Patch Set: 1
Commit: 4bb351a4e4c76c8538532dcbfb7829dfea35aed0

Gerrit Updater added a comment - 19/Aug/15 5:39 PM Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/16026 Subject: LU-6996 osd-ldiskfs: handle stale OI mapping cache Project: fs/lustre-release Branch: b2_5 Current Patch Set: 1 Commit: 4bb351a4e4c76c8538532dcbfb7829dfea35aed0

Jian Yu added a comment - 13/Aug/15 3:51 PM

Thank you, Matt. Alex would look into the stack traces and ask you for help to get more logs if needed.

Jian Yu added a comment - 13/Aug/15 3:51 PM Thank you, Matt. Alex would look into the stack traces and ask you for help to get more logs if needed.

Matt Ezell added a comment - 13/Aug/15 1:28 PM

The output of 'bt -a' and 'ps' from crash. Let me know if you need the backtrace from any idle PIDs.

Matt Ezell added a comment - 13/Aug/15 1:28 PM The output of 'bt -a' and 'ps' from crash. Let me know if you need the backtrace from any idle PIDs.

Peter Jones added a comment - 13/Aug/15 1:23 PM

Alex

Could you please look into this issue?

Thanks

Peter

Peter Jones added a comment - 13/Aug/15 1:23 PM Alex Could you please look into this issue? Thanks Peter

People

Assignee:: Alex Zhuravlev

Reporter:: Matt Ezell

Votes:: 0 Vote for this issue

Watchers:: 11 Start watching this issue

Dates

Created:: 12/Aug/15 8:39 PM

Updated:: 16/Jan/16 6:04 AM

Resolved:: 06/Oct/15 2:22 PM