[LU-16263] crash in sanity 273b osc_page_delete LBUG Created: 25/Oct/22  Updated: 13/Sep/23  Resolved: 21/Mar/23

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.16.0
Fix Version/s: Lustre 2.16.0

Type: Bug Priority: Major
Reporter: Oleg Drokin Assignee: Zhenyu Xu
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Duplicate
Related
is related to LU-12752 osc_page.c:osc_page_delete() ASSERTIO... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

For the master next on Oct 10 suddenly this crash came out and seems to repeat regularly though not very often since then.

[338005.158491] Lustre: DEBUG MARKER: == sanity test 273b: DoM: race writeback and object destroy ===================== 18:58:27 (1666652307)
[338005.392174] LustreError: 17734:0:(osc_cache.c:2484:osc_teardown_async_page()) extent ffff8800400af8e8@{[0 -> 255/1023], [2|0|-|cache|wi|ffff8800badddb28], [1073152|256|+|-|ffff8802276bfc00|1024|          (null)]} trunc at 0.
[338005.394627] LustreError: 17734:0:(osc_cache.c:2484:osc_teardown_async_page()) ### extent: ffff8800400af8e8 ns: lustre-OST0003-osc-ffff8803228cca88 lock: ffff8802276bfc00/0xd5e49483fedfcc95 lrc: 2/0,0 mode: PW/PW res: [0x51a7a:0x0:0x0].0x0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->1048575) gid 0 flags: 0x800020000000000 nid: local remote: 0xd5e49483fedfcc9c expref: -99 pid: 17734 timeout: 0 lvb_type: 1
[338005.397689] LustreError: 17734:0:(osc_page.c:174:osc_page_delete()) page@ffff8802b0a34e18[2 ffff880309e4ec80 5 1           (null)]
[338005.399302] LustreError: 17734:0:(osc_page.c:174:osc_page_delete()) vmpage @ffffea0005eb7f80 2fffff0000083d 3:0 ffff8802b0a34e18 256 lru
[338005.400969] LustreError: 17734:0:(osc_page.c:174:osc_page_delete()) osc-page@ffff8802b0a34e78 0: 1< 2 + - > 2< 0 0 4096 0x0 0x40420 |           (null) ffff88028f868710 ffff8800badddb28 > 3< 0 0 > 4< 0 0 8 97349632 - | - - + - > 5< - - + - | 0 - | 256 - ->
[338005.403820] LustreError: 17734:0:(osc_page.c:174:osc_page_delete()) end page@ffff8802b0a34e18
[338005.405381] LustreError: 17734:0:(osc_page.c:174:osc_page_delete()) Trying to teardown failed: -16
[338005.406554] LustreError: 17734:0:(osc_page.c:175:osc_page_delete()) ASSERTION( 0 ) failed: 
[338005.407645] LustreError: 17734:0:(osc_page.c:175:osc_page_delete()) LBUG
[338005.408234] Pid: 17734, comm: multiop 3.10.0-7.9-debug #2 SMP Tue Feb 1 18:17:58 EST 2022
[338005.409142] Call Trace:
[338005.409630] [<0>] libcfs_call_trace+0x90/0xf0 [libcfs]
[338005.410130] [<0>] lbug_with_loc+0x4c/0xa0 [libcfs]
[338005.410607] [<0>] osc_page_delete+0x48d/0x500 [osc]
[338005.411107] [<0>] __cl_page_delete+0x82/0x320 [obdclass]
[338005.411593] [<0>] cl_page_delete+0x33/0x110 [obdclass]
[338005.412078] [<0>] ll_invalidatepage+0x7f/0x170 [lustre]
[338005.412568] [<0>] do_invalidatepage_range+0x71/0x80
[338005.413050] [<0>] truncate_inode_page+0x77/0x80
[338005.413652] [<0>] truncate_inode_pages_range+0x1ea/0x7d0
[338005.414215] [<0>] truncate_inode_pages_final+0x4c/0x60
[338005.414791] [<0>] ll_truncate_inode_pages_final+0x21/0xe0 [lustre][338005.415385] [<0>] ll_delete_inode+0x38/0x150 [lustre]
[338005.415929] [<0>] evict+0xaf/0x180
[338005.416481] [<0>] iput+0xf5/0x180
[338005.416977] [<0>] __dentry_kill+0x148/0x1b0
[338005.417523] [<0>] dput+0xca/0x1cc
[338005.418054] [<0>] __fput+0x1a0/0x240
[338005.418624] [<0>] ____fput+0xe/0x10
[338005.419155] [<0>] task_work_run+0xb5/0xf0
[338005.419584] [<0>] do_notify_resume+0x92/0xb0
[338005.420091] [<0>] int_signal+0x12/0x17
[338005.420650] [<0>] 0xfffffffffffffffe
[338005.421218] Kernel panic - not syncing: LBUG
[338005.421744] CPU: 1 PID: 17734 Comm: multiop Kdump: loaded Tainted: P        W  OE  ------------   3.10.0-7.9-debug #2
[338005.422211] Hardware name: Red Hat KVM, BIOS 1.15.0-1.module_el8.6.0+1087+b42c8331 04/01/2014
[338005.422211] Call Trace:
[338005.422211]  [<ffffffff817d93f8>] dump_stack+0x19/0x1b
[338005.422211]  [<ffffffff817d24d5>] panic+0xe8/0x20d
[338005.422211]  [<ffffffff817e324e>] ? _raw_spin_unlock+0xe/0x20
[338005.422211]  [<ffffffffa022c56b>] lbug_with_loc+0x9b/0xa0 [libcfs][338005.422211]  [<ffffffffa08d22dd>] osc_page_delete+0x48d/0x500 [osc][338005.422211]  [<ffffffffa03e2612>] __cl_page_delete+0x82/0x320 [obdclass][338005.422211]  [<ffffffffa03e28e3>] cl_page_delete+0x33/0x110 [obdclass][338005.422211]  [<ffffffffa171b30f>] ll_invalidatepage+0x7f/0x170 [lustre][338005.422211]  [<ffffffff811c5b41>] do_invalidatepage_range+0x71/0x80
[338005.422211]  [<ffffffff811c5be7>] truncate_inode_page+0x77/0x80
[338005.422211]  [<ffffffff811c5e1a>] truncate_inode_pages_range+0x1ea/0x7d0
[338005.422211]  [<ffffffff811c646c>] truncate_inode_pages_final+0x4c/0x60
[338005.422211]  [<ffffffffa16f5121>] ll_truncate_inode_pages_final+0x21/0xe0 [lustre]
[338005.422211]  [<ffffffffa16f5648>] ll_delete_inode+0x38/0x150 [lustre]
[338005.422211]  [<ffffffff81264a2f>] evict+0xaf/0x180
[338005.422211]  [<ffffffff81264e65>] iput+0xf5/0x180
[338005.422211]  [<ffffffff8125ff68>] __dentry_kill+0x148/0x1b0
[338005.422211]  [<ffffffff8126072a>] dput+0xca/0x1c0
[338005.422211]  [<ffffffff81248000>] __fput+0x1a0/0x240
[338005.422211]  [<ffffffff8124817e>] ____fput+0xe/0x10
[338005.422211]  [<ffffffff810b69b5>] task_work_run+0xb5/0xf0
[338005.422211]  [<ffffffff8102ccb2>] do_notify_resume+0x92/0xb0
[338005.422211]  [<ffffffff817ee363>] int_signal+0x12/0x17 

Always test 273b

https://knox.linuxhacker.ru/crashdb_ui_external.py.cgi?newid=67744 tracks all such crashes

the list of patches in that master-next release:

a99fcef712 LU-15721 llite: only statfs for projid if PROJINHERIT set
4f3e709b69 LU-16219 tests: syntax error fix
c90e7da475 LU-16198 tests: increase margin for sanity/33hh
fe12cf7af6 LU-16200 tests: test_32[f,g]: specify blocksize explicitly
e0a8f3f60d LU-16180 ptlrpc: reduce lock contention in ptlrpc_free_committed
873276be1c LU-16076 utils: enhance 'lfs check' command
9f0b2c2e23 LU-16044 osd: discard pagecache in truncate's declaration
90307c2721 LU-15451 sec: retry ro mount if read-only flag set
b871286a4e LU-15619 osc: Remove oap lock
b918031786 LU-15014 osc: Fix possible null pointer
98c70df982 LU-13364 utils: fix bad output for lnetctl import --show
1c095e3a80 LU-14165 utils: llog_reader: display changleog_user records
b89c2797eb LU-16139 statahead: avoid to block ptlrpcd interpret context
ff59bd3a4a LU-6142 obdclass: change some foo0() to __foo()
0572fc241a LU-10391 lnet: support IPv6 in lnet_inet_enumerate()
5f5523beb1 LU-16002 ptlrpc: reduce pinger eviction time
a1dfff48db LU-16211 o2iblnd: Avoid NULL md deref
3bb1366fe8 LU-16046 ldlm: group lock fix
f79107cf9a LU-16046 revert: "LU-9964 llite: prevent mulitple group locks" 

I tried excluding the LU-15619, LU-15014 and LU-16139 and it did not help and the rest do not appear to be related.



 Comments   
Comment by Oleg Drokin [ 25/Oct/22 ]

This seems to be related to the now closed LU-12752 as it happens in the same test and the same assertion?

 

Alex also reports hitting this on his systems from time to time,

Comment by Alex Zhuravlev [ 25/Oct/22 ]

usually hit this in racer using zfs:

[ 1393.723944] LustreError: 202459:0:(osc_cache.c:2484:osc_teardown_async_page()) extent 0000000076978463@{[0 -> 255/255], [2|0|-|cache|wi|0000000017bdddf3], [1703936|176|+|-|000000008bd8d628|256|00000000e98e7788]} trunc at 80.
[ 1393.724211] LustreError: 202459:0:(osc_cache.c:2484:osc_teardown_async_page()) ### extent: 0000000076978463 ns: lustre-OST0001-osc-ffff96b339e42000 lock: 000000008bd8d628/0xacdee9107b049393 lrc: 3/0,0 mode: PW/PW res: [0x2c0000400:0x5aa:0x0].0x0 rrc: 2 type: EXT [0->18446744073709551615] (req 262144->18446744073709551615) gid 0 flags: 0x800020000020000 nid: local remote: 0xacdee9107b0493a8 expref: -99 pid: 202304 timeout: 0 lvb_type: 1
[ 1393.724384] LustreError: 202459:0:(osc_page.c:174:osc_page_delete()) page@00000000def37c21[2 000000004f3fbbc0 5 1 00000000e98e7788]
[ 1393.724384] 
[ 1393.724456] LustreError: 202459:0:(osc_page.c:174:osc_page_delete()) vmpage @0000000057d3a516 4100000001039 3:0 ffff96b32bd52eb8 80 lru
[ 1393.724456] 
[ 1393.724529] LustreError: 202459:0:(osc_page.c:174:osc_page_delete()) osc-page@0000000059260e88 80: 1< 2 + - > 2< 327680 0 4096 0x0 0x40420 | 00000000e98e7788 00000000f9a6b8a1 0000000017bdddf3 > 3< 0 0 > 4< 0 0 8 69206016 - | - - + - > 5< - - + - | 0 - | 256 - ->
[ 1393.724529] 
[ 1393.724647] LustreError: 202459:0:(osc_page.c:174:osc_page_delete()) end page@00000000def37c21
[ 1393.724647] 
[ 1393.724710] LustreError: 202459:0:(osc_page.c:174:osc_page_delete()) Trying to teardown failed: -16
[ 1393.724761] LustreError: 202459:0:(osc_page.c:175:osc_page_delete()) ASSERTION( 0 ) failed: 
[ 1393.724811] LustreError: 202459:0:(osc_page.c:175:osc_page_delete()) LBUG
[ 1393.724903] Pid: 202459, comm: rm 4.18.0 #2 SMP Sun Oct 23 17:58:04 UTC 2022
[ 1393.724955] Call Trace TBD:
[ 1393.724995] [<0>] libcfs_call_trace+0x67/0x90 [libcfs]
[ 1393.725156] [<0>] lbug_with_loc+0x3e/0x80 [libcfs]
[ 1393.725266] [<0>] osc_page_delete+0x4b4/0x4c0 [osc]
[ 1393.725396] [<0>] __cl_page_delete+0x7c/0x2f0 [obdclass]
[ 1393.725518] [<0>] cl_page_delete+0x25/0xe0 [obdclass]
[ 1393.725645] [<0>] ll_invalidatepage+0x95/0x180 [lustre]
[ 1393.725749] [<0>] truncate_cleanup_page+0x6a/0xb0
[ 1393.725844] [<0>] truncate_inode_pages_range+0x1c2/0x7a0
[ 1393.725951] [<0>] ll_truncate_inode_pages_final+0x13/0xe0 [lustre]
[ 1393.726084] [<0>] ll_delete_inode+0x33/0x140 [lustre]
[ 1393.726185] [<0>] evict+0xbc/0x180
[ 1393.726260] [<0>] do_unlinkat+0x22c/0x2c0
[ 1393.726335] [<0>] do_syscall_64+0x43/0x120
[ 1393.726409] [<0>] entry_SYSCALL_64_after_hwframe+0x65/0xca
Comment by Etienne Aujames [ 26/Oct/22 ]

Could this be related to the LU-16044 (LU-16044 osd: discard pagecache in truncate's declaration) ?

Comment by Alex Zhuravlev [ 26/Oct/22 ]

Could this be related to the LU-16044 (LU-16044 osd: discard pagecache in truncate's declaration) ?

I don't see how - LU-16044 is about pagecache on OSS which is not touch by the client directly.

Comment by Xing Huang [ 18/Feb/23 ]

"Bobi Jam <bobijam@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/#/c/fs/lustre-release/+/50005/
Subject: LU-16263 lov: continue fsync on other OST objs even on -ENOENT
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 3acbd90b075a710ed274d611fee8530e7f11e6ea

Comment by Gerrit Updater [ 21/Mar/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50005/
Subject: LU-16263 lov: continue fsync on other OST objs even on -ENOENT
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 927b5cd49c3369d533d7f8dc5c8df497aaf33b6e

Comment by Peter Jones [ 21/Mar/23 ]

Landed for 2.16

Generated at Sat Feb 10 03:25:27 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.