[LU-10601] lov_stripe_lock()) ASSERTION( md->lsm_lock_owner == 0 ) failed Created: 04/Feb/18  Updated: 30/May/23

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.11.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Oleg Drokin Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Hit this on current master-next, some sort of use after free it appears.

[119243.473752] Lustre: DEBUG MARKER: == racer test 1: racer on clients: centos-90.localnet DURATION=2700 ================================== 09:38:17 (1517668697)
[119292.154075] Lustre: lfs: using old ioctl(LL_IOC_LOV_GETSTRIPE) on [0x200000401:0xfa:0x0], use llapi_layout_get_by_path()
[119298.778891] Lustre: DEBUG MARKER: racer test_1: @@@@@@ FAIL: generate lss conf (mds1)
[119363.272972] LustreError: 2970:0:(mdt_lvb.c:402:mdt_lvbo_fill()) lustre-MDT0000: small buffer size 448 for EA 472 (max_mdsize 472): rc = -34
[121710.001204] LustreError: 6319:0:(lov_obd.c:1308:lov_stripe_lock()) ASSERTION( md->lsm_lock_owner == 0 ) failed: 
[121710.002273] Slab corruption (Tainted: P           OE  ------------  ): kmalloc-192 start=ffff8800b7853140, len=192
[121710.002278] 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6d 6b 6b 6b 6b 6b  kkkkkkkkkkmkkkkk
[121710.002280] 010: 0e 00 00 00 6b 6b 6b 6b c0 04 ef 9e 02 88 ff ff  ....kkkk........
[121710.020795] LustreError: 6319:0:(lov_obd.c:1308:lov_stripe_lock()) LBUG
[121710.021476] Pid: 6319, comm: cat
[121710.022363] 
Call Trace:
[121710.023546]  [<ffffffffa020e7ce>] libcfs_call_trace+0x4e/0x60 [libcfs]
[121710.024252]  [<ffffffffa020e85c>] lbug_with_loc+0x4c/0xb0 [libcfs]
[121710.024971]  [<ffffffffa0870c79>] lov_stripe_lock+0xa9/0xb0 [lov]
[121710.025674]  [<ffffffffa0871283>] lov_attr_get_raid0+0x103/0x1f0 [lov]
[121710.026383]  [<ffffffffa1478c04>] ? ll_inode_size_lock+0x24/0x60 [lustre]
[121710.027106]  [<ffffffffa087429e>] lov_attr_get_composite+0x15e/0x3d0 [lov]
[121710.027838]  [<ffffffffa087103e>] lov_attr_get+0x1e/0x60 [lov]
[121710.028705]  [<ffffffffa0354a28>] cl_object_attr_get+0x78/0x150 [obdclass]
[121710.029720]  [<ffffffffa1463c7f>] ll_merge_attr+0x15f/0x420 [lustre]
[121710.030566]  [<ffffffffa14654e2>] ll_file_write_iter+0x4b2/0x540 [lustre]
[121710.031423]  [<ffffffffa146564b>] ll_file_aio_write+0xdb/0x180 [lustre]
[121710.032751]  [<ffffffffa1465794>] ll_file_write+0xa4/0x170 [lustre]
[121710.033601]  [<ffffffff811ed36d>] vfs_write+0xbd/0x1e0
[121710.034398]  [<ffffffff811ede34>] SyS_write+0x84/0xf0
[121710.035191]  [<ffffffff8170fc49>] system_call_fastpath+0x16/0x1b
[121710.036001] 
[121710.036584] Kernel panic - not syncing: LBUG

Crashdump on onyx-68 in 192.168.123.190-2018-02-03-10:1*



 Comments   
Comment by Oleg Drokin [ 05/Feb/18 ]

I guess a somewhat related crash, also in current master-next:

[167021.604141] Lustre: DEBUG MARKER: racer test_1: @@@@@@ FAIL: generate lss conf (mds1)
[167032.628234] Lustre: lfs: using old ioctl(LL_IOC_LOV_GETSTRIPE) on [0x200000401:0x165:0x0], use llapi_layout_get_by_path()
[167189.832873] 2[2699]: segfault at 8 ip 00007f4c554a8728 sp 00007fff5cb268b0 error 4 in ld-2.17.so[7f4c5549d000+20000]
[168793.524664] 7[25061]: segfault at 0 ip           (null) sp 00007ffd859bb338 error 14 in 7[400000+6000]
[169230.995638] 17[5237]: segfault at 8 ip 00007f5d2f872728 sp 00007ffe51ae4160 error 4 in ld-2.17.so[7f5d2f867000+20000]
[169542.140240] BUG: unable to handle kernel NULL pointer dereference at 0000000000000004
[169542.143263] IP: [<ffffffffa047d266>] lov_attr_get_composite+0x126/0x3d0 [lov]
[169542.146519] PGD 2c9d66067 PUD 2dce99067 PMD 0 
[169542.147640] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
[169542.149186] Modules linked in: lustre(OE) ofd(OE) osp(OE) lod(OE) ost(OE) mdt(OE) mdd(OE) mgs(OE) osd_zfs(OE) lquota(OE) lfsck(OE) obdecho(OE) mgc(OE) lov(OE) mdc(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) libcfs(OE) zfs(PO) zunicode(PO) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) zlib_deflate jbd2 syscopyarea sysfillrect sysimgblt ttm ata_generic drm_kms_helper pata_acpi drm i2c_piix4 ata_piix virtio_balloon pcspkr serio_raw i2c_core virtio_blk virtio_console libata floppy nfsd ip_tables rpcsec_gss_krb5 [last unloaded: libcfs]
[169542.168711] CPU: 11 PID: 20181 Comm: cat Tainted: P           OE  ------------   3.10.0-debug #2
[169542.169825] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[169542.170393] task: ffff8802fd8f8b40 ti: ffff880094500000 task.ti: ffff880094500000
[169542.171505] RIP: 0010:[<ffffffffa047d266>]  [<ffffffffa047d266>] lov_attr_get_composite+0x126/0x3d0 [lov]
[169542.172686] RSP: 0018:ffff880094503c20  EFLAGS: 00010246
[169542.173210] RAX: 0000000000000240 RBX: 0000000000000000 RCX: 0000000000000003
[169542.174288] RDX: ffff8800a162df48 RSI: ffff88029c0e5e80 RDI: ffff88008e602f68
[169542.175349] RBP: ffff880094503c70 R08: 0000000000000004 R09: 0000000000000001
[169542.176408] R10: 0000000000000000 R11: 0000000000000246 R12: ffff88029c0e5e80
[169542.177494] R13: 0000000000000000 R14: ffff8800a162df48 R15: aaaaaaaaaaaaaaab
[169542.178581] FS:  00007fca37e29740(0000) GS:ffff88033e560000(0000) knlGS:0000000000000000
[169542.180166] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[169542.181020] CR2: 0000000000000004 CR3: 00000002fc329000 CR4: 00000000000006e0
[169542.183136] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[169542.184749] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[169542.185986] Stack:
[169542.186711]  0000000000000000 0000000000000000 ffff88008e602f68 0000000000000000
[169542.187793]  00000000ec696e4b ffff88029c0e5e80 ffff8802ae257f48 ffff88008e602f68
[169542.188798]  ffff8800a162df48 ffff88025752f828 ffff880094503c80 ffffffffa047a03e
[169542.189768] Call Trace:
[169542.207741]  [<ffffffffa047a03e>] lov_attr_get+0x1e/0x60 [lov]
[169542.208495]  [<ffffffffa05c7a28>] cl_object_attr_get+0x78/0x150 [obdclass]
[169542.209243]  [<ffffffffa1450c7f>] ll_merge_attr+0x15f/0x420 [lustre]
[169542.209945]  [<ffffffffa14524e2>] ll_file_write_iter+0x4b2/0x540 [lustre]
[169542.210896]  [<ffffffffa145264b>] ll_file_aio_write+0xdb/0x180 [lustre]
[169542.211610]  [<ffffffffa1452794>] ll_file_write+0xa4/0x170 [lustre]
[169542.212320]  [<ffffffff811ed36d>] vfs_write+0xbd/0x1e0
[169542.212967]  [<ffffffff811ede34>] SyS_write+0x84/0xf0
[169542.213651]  [<ffffffff8170fc49>] system_call_fastpath+0x16/0x1b
[169542.214415] Code: 48 89 d8 48 c7 45 c8 00 00 00 00 48 29 c8 48 c1 f8 06 49 0f af c7 48 39 cb 41 89 c5 0f 82 95 02 00 00 41 39 c0 0f 86 28 02 00 00 <f6> 43 04 01 74 8c 49 8b 84 24 00 01 00 00 49 63 d5 48 8b 44 d0 
[169542.217540] RIP  [<ffffffffa047d266>] lov_attr_get_composite+0x126/0x3d0 [lov]
Comment by Alex Zhuravlev [ 31/Oct/22 ]

not exactly the same backtrace, but very similar, just hit on master:

[  610.918971] Lustre: DEBUG MARKER: == sanity-flr test 200a: racing IO, mirror extend and resync ========================================================== 04:27:05 (1667190425)
..
[  658.602971] LustreError: 36720:0:(lov_obd.c:1219:lov_stripe_unlock()) ASSERTION( md->lsm_lock_owner == get_current()->pid ) failed: 
[  658.603285] LustreError: 36720:0:(lov_obd.c:1219:lov_stripe_unlock()) LBUG
[  658.603408] Pid: 36720, comm: cat 4.18.0 #2 SMP Sun Oct 23 17:58:04 UTC 2022
[  658.603536] Call Trace TBD:
[  658.603599] [<0>] libcfs_call_trace+0x67/0x90 [libcfs]
[  658.603694] [<0>] lbug_with_loc+0x3e/0x80 [libcfs]
[  658.603791] [<0>] lov_stripe_unlock+0x54/0x60 [lov]
[  658.603886] [<0>] lov_attr_get_raid0+0x58/0x80 [lov]
[  658.603980] [<0>] lov_attr_get_composite+0xdd/0x400 [lov]
[  658.604112] [<0>] cl_object_attr_get+0x63/0x160 [obdclass]
[  658.604228] [<0>] ll_file_read_iter+0x274/0x930 [lustre]
[  658.604323] [<0>] new_sync_read+0xf7/0x120
[  658.604390] [<0>] vfs_read+0xa1/0x150
[  658.604457] [<0>] ksys_read+0x3d/0xa0
Generated at Sat Feb 10 02:36:32 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.