[LU-13254] crash at lu_object_find() in mdt_lvbo_fill() Created: 15/Feb/20  Updated: 25/Feb/20  Resolved: 25/Feb/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.14.0

Type: Bug Priority: Major
Reporter: Mikhail Pershin Assignee: Mikhail Pershin
Resolution: Fixed Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

few recent test runs at Oleg tests showed this trace:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000026
IP: [<ffffffffa033793d>] lu_object_find+0xd/0x20 [obdclass]
PGD 0
Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
Modules linked in: zfs(PO) zunicode(PO) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) lustre(OE) ofd(OE) osp(OE) lod(OE) ost(OE) mdt(OE) mdd(OE) mgs(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) lfsck(OE) obdecho(OE) mgc(OE) lov(OE) mdc(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) crc32_generic libcfs(OE) dm_flakey dm_mod crc_t10dif crct10dif_generic crct10dif_common rpcsec_gss_krb5 pcspkr squashfs i2c_piix4 i2c_core binfmt_misc ip_tables ext4 mbcache jbd2 ata_generic pata_acpi serio_raw ata_piix libata
CPU: 0 PID: 8810 Comm: mdt_out00_001 Kdump: loaded Tainted: P           OE  ------------   3.10.0-7.7-debug #1
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
task: ffff8800a5f94780 ti: ffff8800aa2f0000 task.ti: ffff8800aa2f0000
RIP: 0010:[<ffffffffa033793d>]  [<ffffffffa033793d>] lu_object_find+0xd/0x20 [obdclass]
RSP: 0018:ffff8800aa2f3b58  EFLAGS: 00010246
RAX: 0000000000000006 RBX: ffff8800a5f86448 RCX: 0000000000000000
RDX: ffff8800a5f86448 RSI: ffff8800c05cf000 RDI: ffff88009c38a400
RBP: ffff8800aa2f3b58 R08: ffff880106431000 R09: ffff8800b5ee8080
R10: ffff8800a5f86000 R11: ffff8800aa2f3876 R12: ffff88009c38a400
R13: ffff8800c05cf000 R14: ffff8800ac23fe70 R15: ffff88009c38a400
FS:  0000000000000000(0000) GS:ffff88011e200000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000026 CR3: 00000000a2856000 CR4: 00000000000006f0
Call Trace:
 [<ffffffffa0cd7e9b>] mdt_object_find+0x4b/0x170 [mdt]
 [<ffffffffa0d10dc0>] mdt_lvbo_fill+0x530/0xa80 [mdt]
 [<ffffffffa05f1f5d>] ldlm_handle_enqueue0+0x5cd/0x15f0 [ptlrpc]
 [<ffffffffa061ba50>] ? lustre_swab_ldlm_lock_desc+0x30/0x30 [ptlrpc]
 [<ffffffffa067a292>] tgt_enqueue+0x62/0x210 [ptlrpc]
 [<ffffffffa0682f55>] tgt_request_handle+0x965/0x1620 [ptlrpc]
 [<ffffffffa020bdde>] ? libcfs_nid2str_r+0xfe/0x130 [lnet]
 [<ffffffffa0625f60>] ptlrpc_server_handle_request+0x250/0xb10 [ptlrpc]
 [<ffffffff810c6941>] ? __wake_up_common_lock+0x91/0xc0
 [<ffffffff810c6250>] ? sched_feat_set+0xf0/0xf0
 [<ffffffffa062a1c0>] ptlrpc_main+0xcb0/0x1cb0 [ptlrpc]
 [<ffffffff810c665d>] ? finish_task_switch+0x5d/0x1b0
 [<ffffffffa0629510>] ? ptlrpc_register_service+0xff0/0xff0 [ptlrpc]
 [<ffffffff810b8254>] kthread+0xe4/0xf0
 [<ffffffff810b8170>] ? kthread_create_on_node+0x140/0x140
 [<ffffffff817e5ddd>] ret_from_fork_nospec_begin+0x7/0x21
 [<ffffffff810b8170>] ? kthread_create_on_node+0x140/0x140

problem is related to wrongly initialized mdt_thread_info values, particularly mti_mdt. Interesting that none of them are needed in mdt_lvbo_fill, there are only couple fields are needed as temporary storage for FID and data buffer.



 Comments   
Comment by Gerrit Updater [ 15/Feb/20 ]

Mike Pershin (mpershin@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37592
Subject: LU-13254 mdt: clear mti_mdt in mdt_thread_info_fini()
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 9309567ede34573cb2408cd85064dc4b6de22a27

Comment by Mikhail Pershin [ 15/Feb/20 ]

Patch clears mti_mdt in mdt_thread_info_fini, so it may produce new issues with NULL mdt in some other places which may use uninitialized thread info like mdt_lvbo_fill() did, though I didn't find them but still possible.
Interesting that such issues can stay for long time unnoticed or unclear because usually mti_mdt is the same MDT or different MDT in DNE config but still it is valid MDT so cause no immediate error or crash. Meanwhile wrong MDT data is being used and cause weird/unclear effects. I think that long-standing issue like below are also result of using parameters from wrong MDT:

LustreError: 8097:0:(mdt_lvb.c:163:mdt_lvbo_fill()) lustre-MDT0000: expected 944 actual 416.
Comment by Gerrit Updater [ 25/Feb/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37592/
Subject: LU-13254 mdt: clear mti_mdt in mdt_thread_info_fini()
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 4b3748ae6f8859ee56a142bdf03b8006e888b868

Comment by Peter Jones [ 25/Feb/20 ]

Landed for 2.14

Generated at Sat Feb 10 02:59:43 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.