[LU-12741] crash in osd_object_delete at end of sanity Created: 10/Sep/19  Updated: 24/Sep/20  Resolved: 14/Dec/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.13.0, Lustre 2.12.3
Fix Version/s: Lustre 2.14.0, Lustre 2.12.4

Type: Bug Priority: Major
Reporter: Oleg Drokin Assignee: Mikhail Pershin
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Duplicate
duplicates LU-11204 mdt_reint_unlink->lu_object_put() crash Resolved
Related
is related to LU-8992 osd_object_release() LBUG Resolved
is related to LU-13980 Kernel panic on OST after removing fi... Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

It looks like something broke in master/b2_12 relatively recently.

Typical crash:

[14473.601088] Lustre: DEBUG MARKER: == sanity test complete, duration 5433 sec =========================================================== 03:14:06 (1567926846)
[14492.366277] BUG: unable to handle kernel NULL pointer dereference at 0000000000000c80
[14492.370164] IP: [<ffffffffa0b8e854>] osd_object_delete+0x1f4/0x2a0 [osd_ldiskfs]
[14492.372007] PGD 0 
[14492.372787] Oops: 0002 [#1] SMP DEBUG_PAGEALLOC
[14492.373666] Modules linked in: dm_flakey dm_mod lustre(OE) mdt(OE) mdd(OE) mdc(OE) obdecho(OE) ofd(OE) osp(OE) lod(OE) ost(OE) mgs(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) lfsck(OE) mgc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) libcfs(OE) brd ext4 loop zfs(PO) zunicode(PO) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) jbd2 mbcache crc_t10dif crct10dif_generic crct10dif_common virtio_console pcspkr i2c_piix4 virtio_balloon binfmt_misc ip_tables rpcsec_gss_krb5 ata_generic pata_acpi drm_kms_helper ttm drm ata_piix drm_panel_orientation_quirks libata floppy virtio_blk serio_raw i2c_core [last unloaded: mdt]
[14492.388104] CPU: 6 PID: 8302 Comm: ldlm_cn03_003 Kdump: loaded Tainted: P        W  OE  ------------   3.10.0-7.6-debug #2
[14492.389885] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[14492.390901] task: ffff88003c6209c0 ti: ffff88004139c000 task.ti: ffff88004139c000
[14492.392652] RIP: 0010:[<ffffffffa0b8e854>]  [<ffffffffa0b8e854>] osd_object_delete+0x1f4/0x2a0 [osd_ldiskfs]
[14492.394473] RSP: 0018:ffff88004139fa30  EFLAGS: 00010246
[14492.395341] RAX: 0000000000000000 RBX: 0000000000000c80 RCX: 0000000000000000
[14492.396537] RDX: ffff88006430ae00 RSI: ffffffffa0be0160 RDI: ffff880082322b40
[14492.397291] RBP: ffff88004139fa60 R08: 0000000000000000 R09: d8c8000000000000
[14492.398095] R10: ffff8800bb75e000 R11: ffff8800bb75e7c8 R12: 0000000000000000
[14492.398993] R13: ffff880115978e00 R14: ffff880082322b40 R15: 0000000000000000
[14492.399675] FS:  0000000000000000(0000) GS:ffff880139780000(0000) knlGS:0000000000000000
[14492.401123] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[14492.402167] CR2: 0000000000000c80 CR3: 0000000130b34000 CR4: 00000000000006e0
[14492.422237] Call Trace:
[14492.423115]  [<ffffffffa036d0a5>] lu_object_free.isra.31+0x65/0x170 [obdclass]
[14492.424986]  [<ffffffffa0370e42>] lu_object_put+0xc2/0x3c0 [obdclass]
[14492.426016]  [<ffffffffa0d7fea0>] ? mdt_punch_hpreq_fini+0x10/0x10 [mdt]
[14492.427067]  [<ffffffffa0d7ff51>] ldlm_dom_discard_cp_ast+0xb1/0x2b0 [mdt]
[14492.428115]  [<ffffffffa05f25c6>] ldlm_work_cp_ast_lock+0xa6/0x1d0 [ptlrpc]
[14492.429196]  [<ffffffffa06393b0>] ptlrpc_set_wait+0x70/0x790 [ptlrpc]
[14492.430253]  [<ffffffffa062fe6d>] ? ptlrpc_prep_set+0x5d/0x290 [ptlrpc]
[14492.431457]  [<ffffffffa0350279>] ? lprocfs_counter_add+0xf9/0x160 [obdclass]
[14492.432481]  [<ffffffff810b5a70>] ? __init_waitqueue_head+0x20/0x30
[14492.433523]  [<ffffffffa062ff07>] ? ptlrpc_prep_set+0xf7/0x290 [ptlrpc]
[14492.434526]  [<ffffffffa05f7e15>] ldlm_run_ast_work+0xd5/0x380 [ptlrpc]
[14492.435592]  [<ffffffffa05f927f>] __ldlm_reprocess_all+0xff/0x340 [ptlrpc]
[14492.436653]  [<ffffffffa05f94d0>] ldlm_reprocess_all+0x10/0x20 [ptlrpc]
[14492.437708]  [<ffffffffa06219b4>] ldlm_handle_convert0+0x2f4/0x450 [ptlrpc]
[14492.438727]  [<ffffffffa0621ffb>] ldlm_cancel_handler+0x29b/0x590 [ptlrpc]
[14492.439812]  [<ffffffffa06524b6>] ptlrpc_server_handle_request+0x256/0xad0 [ptlrpc]
[14492.441715]  [<ffffffffa06564a1>] ptlrpc_main+0xb91/0x2110 [ptlrpc]
[14492.442734]  [<ffffffff810c32ed>] ? finish_task_switch+0x5d/0x1b0
[14492.443744]  [<ffffffff817b6cd0>] ? __schedule+0x410/0xa00
[14492.445222]  [<ffffffffa0655910>] ? ptlrpc_register_service+0xfb0/0xfb0 [ptlrpc]
[14492.447031]  [<ffffffff810b4ed4>] kthread+0xe4/0xf0
[14492.447974]  [<ffffffff810b4df0>] ? kthread_create_on_node+0x140/0x140
[14492.448960]  [<ffffffff817c4c77>] ret_from_fork_nospec_begin+0x21/0x21
[14492.449957]  [<ffffffff810b4df0>] ? kthread_create_on_node+0x140/0x140
[14492.450934] Code: e0 03 00 0f 1f 40 00 4d 85 ed 74 c0 4c 89 f7 48 c7 c6 60 01 be a0 e8 cc e7 7d ff 49 89 c4 44 89 f8 31 c9 49 8d 9c 24 80 0c 00 00 <49> 89 84 24 80 0c 00 00 4c 89 ee 4c 89 f7 48 89 da e8 86 16 f0 

Always in dom discard path

First master-next occurence in githash f8c100f.

List of patches:

f8c100f LU-12575 build: add ibutils2 for MOFED build
0ecb29f LU-12560 tests: Use full path for test-groups
4f4f90b LU-12400 ptlrpc: Sun RPC changes for RCU locking
5a19817 LU-12527 utils: Make lustre_user.h c++-legal
f0ba5de LU-12472 tests: update sanity-krb5.sh
adfa543 LU-4315 doc: split lctl get_param and set_param man pages
a57ede6 LU-12355 ldiskfs: Remove old map blocks support
87f6b68 LU-12405 lnet: Oracle OFED extensions default to on
04cdd15 LU-12343 osc: Fix dom handling in weight_ast
3d1920a LU-8066 mdt: migrate procfs files to sysfs
bc02a4e LU-12075 mdt: commit migrate transaction with locks held
27cd9fd LU-10070 test: llapi_layout_test enhancements
2490ed4 LU-11617 mdc: fix possible deadlock in chlg_open()
860dbcb LU-12559 ptlrpc: Hold imp lock for idle reconnect
ce3ccbd LU-6142 tests: Fix style issues for write_disjoint.c
6012e3e LU-6142 tests: Fix style issues for write_append_truncate.c
16792c9 LU-6142 tests: Fix style issues for lp_utils.c
ac153a9 LU-10094 mdc: dir page ldp_hash_end mistakenly adjusted
b598d82 LU-12523 ptlrpc: Don't get jobid in body_v2
b2f2bfc LU-6202 utils: remove obsolete l_ioctl2() wrapper
42fdd2f LU-12440 lnet: Misleading error from lnet_is_health_check
aec7b1a LU-12439 lnet: Convert noisy timeout error to cdebug
93419c4 LU-11023 quota: remove quota pool ID

First b2_12-next occurrence githash: 8ec5896 list of patches:

3674d393d5 LU-12608 kernel: kernel update RHEL7.6 [3.10.0-957.27.2.el7]
fe03ca414f LU-11761 fld: let's caller to retry FLD_QUERY
316310cbb4 LU-12387 tests: Validate l_tunedisk max_sectors_kb tuning
0edb0a6951 LU-8130 libcfs: don't include rhashtable if unavailable
9ac11632fb LU-12660 kernel: kernel update SLES12 SP4 [4.12.14-95.29.1]
3a35d97aee LU-12539 build: pass --with-o2ib when building deb packages
e2ac8c3269 LU-10094 mdc: dir page ldp_hash_end mistakenly adjusted
cbb6d8c8ef LU-12586 lov: Correct write_intent end for trunc
a1e888dcbc LU-10756 ptlrpc: change IMPORT_SET_* macros into real functions
bbf40d8c71 LU-11537 osp: avoid nested transaction
61c93c46c4 LU-12343 osc: Fix dom handling in weight_ast

I guess LU-12343 is the common factor here?



 Comments   
Comment by Patrick Farrell (Inactive) [ 10/Sep/19 ]

Oleg,

Can you provide some links to your crash/test failure stuff from your rig?

Comment by Patrick Farrell (Inactive) [ 10/Sep/19 ]

Hmm, Oleg, this really looks like it should be linked to https://jira.whamcloud.com/browse/LU-11359

rather than LU-12343.

How sure are you it started with LU-12343...?

Comment by Mikhail Pershin [ 10/Sep/19 ]

This can be related to LU-11359 but that issue started to show up only recently so something was changed by one of the recent patches, probably it is LU-12343 patch because it was landed on both b2_12 and master when this issue appeared. I don't think patch from LU-12343 is making something wrong but it can be trigger for some missed race in lu_object_put. I will re-check LU-11359 also.

Comment by Patrick Farrell (Inactive) [ 10/Sep/19 ]

Mike,

Do you want to take this one, then?  I was going to look at it but I'm happy to let you do it.

Comment by Oleg Drokin [ 10/Sep/19 ]

pfarrell as you can see that LU-12343 is the common patch between the patchset where this failure appeared on master and when it appeared on b2_12 (different set of patches mostly)

Also it's probably not a pure coincidence that your patch deals with DOM AST logic and the failure is in a DOM AST path.

Comment by Peter Jones [ 11/Sep/19 ]

Mike is looking into this

Comment by Peter Jones [ 18/Sep/19 ]

Seems to have stopped occurring on master. Likely a duplicate of LU-11204

Comment by Alex Zhuravlev [ 07/Nov/19 ]

actually still happens, fresh master:
Lustre: DEBUG MARKER: == sanity test complete, duration 4910 sec =========================================================== 21:34:43 (1573144483)
LustreError: 23081:0:(osd_handler.c:2142:osd_object_delete()) 000000008c85a7cf after refill
LustreError: 23081:0:(osd_handler.c:2143:osd_object_delete()) LBUG
lbug_with_loc+0x79/0x80 [libcfs]
? osd_object_delete+0x319/0x320 [osd_ldiskfs]
? lu_object_free.isra.0+0x44/0x140 [obdclass]
? lu_object_put+0x230/0x370 [obdclass]
? mdt_punch_hpreq_fini+0x10/0x10 [mdt]
? ldlm_dom_discard_cp_ast+0xe0/0x240 [mdt]
? ldlm_work_cp_ast_lock+0xec/0x1d0 [ptlrpc]
? ptlrpc_set_wait+0x4a/0x760 [ptlrpc]
? is_module_address+0xc/0x20
? static_obj+0x31/0x50
? __lockdep_init_map+0x45/0x180
? __raw_spin_lock_init+0x28/0x50
? ldlm_run_ast_work+0xbf/0x3b0 [ptlrpc]

this is because of info=NULL, lu_env_refill() helps

Comment by Gerrit Updater [ 08/Nov/19 ]

Mike Pershin (mpershin@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36714
Subject: LU-12741 ptlrpc: do lu_env_refill for new request
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: fb60e3076e1a5769e8051b2b268a7797e4f2dab5

Comment by Gerrit Updater [ 14/Dec/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36714/
Subject: LU-12741 ptlrpc: do lu_env_refill for new request
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 3f304b75d24aea0075415affa0c0bef004ef012c

Comment by Mikhail Pershin [ 14/Dec/19 ]

patch landed

Comment by Mikhail Pershin [ 14/Dec/19 ]

Reopen, still may require backporting

Comment by Peter Jones [ 14/Dec/19 ]

Landed for 2.14. Backporting will be tracked separately.

Comment by Gerrit Updater [ 16/Dec/19 ]

Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37036
Subject: LU-12741 ptlrpc: do lu_env_refill for new request
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: f6041a258b595d3c20bcd6a6994a1c284d094413

Comment by Gerrit Updater [ 20/Dec/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37036/
Subject: LU-12741 ptlrpc: do lu_env_refill for new request
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: 0d79ba70e44f44d0334253f8c39725ba3d3f36e7

Generated at Sat Feb 10 02:55:14 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.