[LU-12752] osc_page.c:osc_page_delete() ASSERTION( 0 ) failed Created: 11/Sep/19 Updated: 16/Feb/23 Resolved: 16/Mar/21 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.0 |
| Fix Version/s: | Lustre 2.15.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Stephane Thiell | Assignee: | Vladimir Saveliev |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
CentOS 7.6 2.12.0+patch including async_discard for DoM. Servers: 2.12.2_116+3 patches |
||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
A specific type of jobs (which seems to be Spark+Python) keeps triggering the following LBUG on different clients: /var/log/nodes/sh-110-01.log:Sep 9 12:09:32 sh-110-01 kernel: LustreError: 123514:0:(osc_page.c:193:osc_page_delete()) LBUG /var/log/nodes/sh-110-04.log:Sep 9 08:53:09 sh-110-04 kernel: LustreError: 89537:0:(osc_page.c:193:osc_page_delete()) LBUG /var/log/nodes/sh-110-04.log:Sep 9 12:47:29 sh-110-04 kernel: LustreError: 53455:0:(osc_page.c:193:osc_page_delete()) LBUG /var/log/nodes/sh-110-04.log:Sep 9 12:47:29 sh-110-04 kernel: Kernel panic - not syncing: LBUG /var/log/nodes/sh-110-04.log:Sep 9 14:36:54 sh-110-04 kernel: LustreError: 26083:0:(osc_page.c:193:osc_page_delete()) LBUG /var/log/nodes/sh-110-04.log:Sep 10 11:48:43 sh-110-04 kernel: LustreError: 104908:0:(osc_page.c:193:osc_page_delete()) LBUG /var/log/nodes/sh-110-04.log:Sep 10 12:29:58 sh-110-04 kernel: LustreError: 19176:0:(osc_page.c:193:osc_page_delete()) LBUG /var/log/nodes/sh-110-04.log:Sep 10 12:45:09 sh-110-04 kernel: LustreError: 19196:0:(osc_page.c:193:osc_page_delete()) LBUG /var/log/nodes/sh-110-04.log:Sep 10 13:58:56 sh-110-04 kernel: LustreError: 14313:0:(osc_page.c:193:osc_page_delete()) LBUG /var/log/nodes/sh-110-04.log:Sep 10 15:01:22 sh-110-04 kernel: LustreError: 11772:0:(osc_page.c:193:osc_page_delete()) LBUG /var/log/nodes/sh-110-04.log:Sep 10 16:59:34 sh-110-04 kernel: LustreError: 10094:0:(osc_page.c:193:osc_page_delete()) LBUG /var/log/nodes/sh-110-04.log:Sep 10 18:29:22 sh-110-04 kernel: LustreError: 25649:0:(osc_page.c:193:osc_page_delete()) LBUG Sep 9 14:08:38 sh-110-04 kernel: perf: interrupt took too long (4932 > 4930), lowering kernel.perf_event_max_sample_rate to 40000
Sep 9 14:10:01 sh-110-04 CROND[24303]: (root) CMD (/usr/lib64/sa/sa1 1 1)
Sep 9 14:20:01 sh-110-04 CROND[25059]: (root) CMD (/usr/lib64/sa/sa1 1 1)
Sep 9 14:30:01 sh-110-04 CROND[25808]: (root) CMD (/usr/lib64/sa/sa1 1 1)
Sep 9 14:32:01 sh-110-04 kernel: LNetError: 8091:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (0, 5)
Sep 9 14:36:54 sh-110-04 kernel: LustreError: 26083:0:(osc_cache.c:2575:osc_teardown_async_page()) extent ffff94ab962a3c30@{[32 -> 102/4095], [2|0|-|cache|wi|ffff9498052c8f00], [315392|71|+|-|ffff94c319791b00|4096| (null)]} trunc at 32.
Sep 9 14:36:54 sh-110-04 kernel: LustreError: 26083:0:(osc_cache.c:2575:osc_teardown_async_page()) ### extent: ffff94ab962a3c30 ns: fir-OST0028-osc-ffff9496d9c3e000 lock: ffff94c319791b00/0x6233ab8351d87f8e lrc: 2/0,0 mode: PW/PW res: [0xa00000401:0x2db265b:0x0].0x0 rrc: 2 type: EXT [0->18446744073709551615] (req 131072->16777215) flags: 0x20000000000 nid: local remote: 0xcd88a3c067e066be expref: -99 pid: 26040 timeout: 0 lvb_type: 1
Sep 9 14:36:54 sh-110-04 kernel: LustreError: 26083:0:(osc_page.c:192:osc_page_delete()) page@ffff94b6232a7600[2 ffff949c67e4db30 4 1 (null)]
Sep 9 14:36:54 sh-110-04 kernel: LustreError: 26083:0:(osc_page.c:192:osc_page_delete()) vvp-page@ffff94b6232a7650(0:0) vm@ffffe46433a31e00 6fffff0000087d 3:0 ffff94b6232a7600 32 lru
Sep 9 14:36:54 sh-110-04 kernel: LustreError: 26083:0:(osc_page.c:192:osc_page_delete()) lov-page@ffff94b6232a7690, comp index: 10000, gen: 7
Sep 9 14:36:54 sh-110-04 kernel: LustreError: 26083:0:(osc_page.c:192:osc_page_delete()) osc-page@ffff94b6232a76c8 32: 1< 0x845fed 2 0 + - > 2< 131072 0 4096 0x0 0x420 | (null) ffff94ad27592540 ffff9498052c8f00 > 3< 0 0 0 > 4< 0 0 8 210399232 - | - - + - > 5< - - + - | 0 - | 71 - ->
Sep 9 14:36:54 sh-110-04 kernel: LustreError: 26083:0:(osc_page.c:192:osc_page_delete()) end page@ffff94b6232a7600
Sep 9 14:36:54 sh-110-04 kernel: LustreError: 26083:0:(osc_page.c:192:osc_page_delete()) Trying to teardown failed: -16
Sep 9 14:36:54 sh-110-04 kernel: LustreError: 26083:0:(osc_page.c:193:osc_page_delete()) ASSERTION( 0 ) failed:
Sep 9 14:36:54 sh-110-04 kernel: LustreError: 26083:0:(osc_page.c:193:osc_page_delete()) LBUG
Sep 9 14:36:54 sh-110-04 kernel: Pid: 26083, comm: migratefs 3.10.0-957.27.2.el7.x86_64 #1 SMP Mon Jul 29 17:46:05 UTC 2019
Sep 9 14:36:54 sh-110-04 kernel: Call Trace:
Sep 9 14:36:54 sh-110-04 kernel: [<ffffffffc0b9c7cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
Sep 9 14:36:54 sh-110-04 kernel: [<ffffffffc0b9c87c>] lbug_with_loc+0x4c/0xa0 [libcfs]
Sep 9 14:36:54 sh-110-04 kernel: [<ffffffffc0dd08bf>] osc_page_delete+0x48f/0x500 [osc]
Sep 9 14:36:54 sh-110-04 kernel: [<ffffffffc0eea770>] cl_page_delete0+0x80/0x220 [obdclass]
Sep 9 14:36:54 sh-110-04 kernel: [<ffffffffc0eea943>] cl_page_delete+0x33/0x110 [obdclass]
Sep 9 14:36:54 sh-110-04 kernel: [<ffffffffc120ff3f>] ll_invalidatepage+0x7f/0x170 [lustre]
Sep 9 14:36:54 sh-110-04 kernel: [<ffffffffbd3c7ddd>] do_invalidatepage_range+0x7d/0x90
Sep 9 14:36:54 sh-110-04 kernel: [<ffffffffbd3c7e87>] truncate_inode_page+0x77/0x80
Sep 9 14:36:54 sh-110-04 kernel: [<ffffffffbd3c80ba>] truncate_inode_pages_range+0x1ea/0x700
Sep 9 14:36:54 sh-110-04 kernel: [<ffffffffbd3c863f>] truncate_inode_pages_final+0x4f/0x60
Sep 9 14:36:54 sh-110-04 kernel: [<ffffffffc11f656f>] ll_delete_inode+0x4f/0x230 [lustre]
Sep 9 14:36:54 sh-110-04 kernel: [<ffffffffbd45ff24>] evict+0xb4/0x180
Sep 9 14:36:54 sh-110-04 kernel: [<ffffffffbd46082c>] iput+0xfc/0x190
Sep 9 14:36:54 sh-110-04 kernel: [<ffffffffbd4548de>] do_unlinkat+0x1ae/0x2d0
Sep 9 14:36:54 sh-110-04 kernel: [<ffffffffbd45595b>] SyS_unlinkat+0x1b/0x40
Sep 9 14:36:54 sh-110-04 kernel: [<ffffffffbd976ddb>] system_call_fastpath+0x22/0x27
Sep 9 14:36:54 sh-110-04 kernel: [<ffffffffffffffff>] 0xffffffffffffffff
Sep 9 14:40:11 sh-110-04 kernel: Initializing cgroup subsys cpuset
We didn't find any other LBUG so we think this is related to a specific access pattern. This comes from several jobs but similar ones. The logs do mention fir so it's against our 2.12.2_116+3 scratch filesystem. We use DoM and PFL. We've found other similar tickets, some of them are old and marked as resolved:
Any idea? Do you think it's worth trying to upgrade our clients from 2.12.0+patches (including async_discard) to 2.12.2_116+3 like our servers? Thanks! |
| Comments |
| Comment by Patrick Farrell (Inactive) [ 12/Sep/19 ] |
|
Stephane, It sounds like you can reproduce this pretty easily. Given that, I'd suggest upgrading one client and trying it again. If the problem still occurs, then given how easy it is to reproduce, we should try to reproduce it with debug and then gather a vmcore with debug enabled and upload that. Specifically, I'd say turn on rpctrace, dlmtrace, and cache. (Probably OK to leave debug_mb at the default.) |
| Comment by Alex Zhuravlev [ 16/Nov/19 ] |
|
hit in racer:
Lustre: 4849:0:(llite_lib.c:2867:ll_dirty_page_discard_warn()) lustre: dirty page discard: 192.168.122.122@tcp:/lustre/fid: [0x200000403:0x6381:0x0]/ may get corrupted (rc -108)
LustreError: 11031:0:(osc_cache.c:2561:osc_teardown_async_page()) extent 00000000a1f47168@{[240 -> 511/1023], [2|0|+|cache|wiuY|000000004d3c9cb0], [1138688|272|+|-|0000000042a7c322|1024| (null)]} trunc at 240.
LustreError: 11031:0:(osc_cache.c:2561:osc_teardown_async_page()) ### extent: 00000000a1f47168 ns: lustre-OST0001-osc-ffff88015bc17000 lock: 0000000042a7c322/0x1853c4aa31a8466 lrc: 2/0,0 mode: PW/PW res: [0x4e5:0x0:0x0].0x0 rrc: 2 type: EXT [0->18446744073709551615] (req 2097152->18446744073709551615) flags: 0x800020000000000 nid: local remote: 0x1853c4aa31a8497 expref: -99 pid: 6809 timeout: 0 lvb_type: 1
LustreError: 11031:0:(osc_page.c:188:osc_page_delete()) page@00000000985442eb[2 000000006a06e30a 4 1 (null)]
LustreError: 11031:0:(osc_page.c:188:osc_page_delete()) vvp-page@00000000046d3436(0:0) vm@000000006118fc81 3480000001039 3:0 ffff8800d2397e00 240 lru
LustreError: 11031:0:(osc_page.c:188:osc_page_delete()) lov-page@00000000e665113b, gen: 3
LustreError: 11031:0:(osc_page.c:188:osc_page_delete()) osc-page@0000000058e45de8 240: 1< 0x845fed 258 0 + - > 2< 983040 0 4096 0x0 0x520 | (null) 000000003782a010 000000004d3c9cb0 > 3< 0 0 0 > 4< 0 0 8 274124800 - | + - + - > 5< + - + - | 0 - | 272 - +>
LustreError: 11031:0:(osc_page.c:188:osc_page_delete()) end page@00000000985442eb
LustreError: 11031:0:(osc_page.c:188:osc_page_delete()) Trying to teardown failed: -16
LustreError: 11031:0:(osc_page.c:190:osc_page_delete()) ASSERTION( 0 ) failed:
LustreError: 11031:0:(osc_page.c:190:osc_page_delete()) LBUG
Pid: 11031, comm: cat 4.18.0 #31 SMP Thu Oct 24 23:40:39 MSK 2019
Call Trace:
dump_stack+0x85/0xc0
___might_sleep.cold+0xac/0xbc
? kmem_cache_alloc_trace+0x1b2/0x2c0
? libcfs_debug_dumplog_internal+0x60/0x60 [libcfs]
? __kthread_create_on_node+0x53/0x160
? kthread_create_on_node+0x34/0x40
? libcfs_debug_dumplog+0x8d/0x150 [libcfs]
? _raw_spin_unlock+0x1f/0x30
? wake_up_q+0x60/0x60
? lbug_with_loc+0x51/0x80 [libcfs]
? osc_page_delete+0x4ab/0x4b0 [osc]
? cl_page_delete0+0x6b/0x1e0 [obdclass]
? cl_page_delete+0x26/0xf0 [obdclass]
? ll_invalidatepage+0x8d/0x170 [lustre]
? truncate_cleanup_page+0x68/0xb0
? truncate_inode_pages_range+0x1b0/0x800
? ll_delete_inode+0x3e/0x220 [lustre]
? evict+0xbc/0x180
? __dentry_kill+0xcb/0x170
? dentry_kill+0x4b/0x1b0
? dput+0x17/0x2b0
? dput+0x266/0x2b0
? __fput+0x104/0x200
? task_work_run+0x7d/0xb0
? exit_to_usermode_loop+0xa3/0xb0
|
| Comment by Alex Zhuravlev [ 07/May/20 ] |
|
again in racer,
PID: 32147 TASK: ffff8800db844000 CPU: 0 COMMAND: "cat"
#0 [ffff88011633bae0] panic at ffffffff810af9a3
/home/lustre/linux-4.18.0-32.el8/kernel/panic.c: 265
#1 [ffff88011633bb70] osc_page_delete at ffffffffa07843fe [osc]
/home/lustre/master-mine/lustre/osc/osc_page.c: 184
#2 [ffff88011633bbb0] cl_page_delete0 at ffffffffa01b5b8b [obdclass]
/home/lustre/master-mine/lustre/obdclass/cl_page.c: 801
#3 [ffff88011633bbd0] cl_page_delete at ffffffffa01b6606 [obdclass]
/home/lustre/master-mine/libcfs/include/libcfs/libcfs_debug.h: 153
#4 [ffff88011633bbe8] ll_invalidatepage at ffffffffa0eae23d [lustre]
/home/lustre/master-mine/lustre/llite/rw26.c: 100
#5 [ffff88011633bc00] truncate_cleanup_page at ffffffff8119a8c8
/home/lustre/linux-4.18.0-32.el8/./include/linux/compiler.h: 188
#6 [ffff88011633bc18] truncate_inode_pages_range at ffffffff8119b2d0
/home/lustre/linux-4.18.0-32.el8/./include/linux/pagevec.h: 66
#7 [ffff88011633bde8] ll_delete_inode at ffffffffa0e8efce [lustre]
/home/lustre/master-mine/lustre/llite/llite_lib.c: 2265
#8 [ffff88011633be08] evict at ffffffff81218dac
/home/lustre/linux-4.18.0-32.el8/fs/inode.c: 563
#9 [ffff88011633be28] __dentry_kill at ffffffff8121499b
/home/lustre/linux-4.18.0-32.el8/fs/dcache.c: 569
#10 [ffff88011633be48] dentry_kill at ffffffff81214a8b
/home/lustre/linux-4.18.0-32.el8/fs/dcache.c: 686
#11 [ffff88011633be70] dput at ffffffff81214e56
/home/lustre/linux-4.18.0-32.el8/fs/dcache.c: 846
#12 [ffff88011633be90] __fput at ffffffff811fc934
/home/lustre/linux-4.18.0-32.el8/fs/file_table.c: 255
#13 [ffff88011633bed0] task_work_run at ffffffff810cec8d
/home/lustre/linux-4.18.0-32.el8/kernel/task_work.c: 115
#14 [ffff88011633bf08] exit_to_usermode_loop at ffffffff81001db3
/home/lustre/linux-4.18.0-32.el8/./include/linux/tracehook.h: 193
#15 [ffff88011633bf38] do_syscall_64 at ffffffff8100265a
/home/lustre/linux-4.18.0-32.el8/arch/x86/entry/common.c: 211
#16 [ffff88011633bf50] entry_SYSCALL_64_after_hwframe at ffffffff818000ae
/home/lustre/linux-4.18.0-32.el8/arch/x86/entry/entry_64.S: 242
RIP: 00007f00025ab2ab RSP: 00007ffed735b348 RFLAGS: 00000202
RAX: 0000000000000000 RBX: 00007f000267b780 RCX: 00007f00025ab2ab
RDX: 00007f0002676740 RSI: 0000000000000001 RDI: 0000000000000001
RBP: 0000000000000000 R8: 00007f0002681580 R9: 0000000000000001
R10: 00007f0002681580 R11: 0000000000000202 R12: 00007f0002677340
R13: 0000000000000002 R14: 00007f000267f448 R15: 00007f000267bd80
ORIG_RAX: 0000000000000003 CS: 0033 SS: 002b
|
| Comment by Andreas Dilger [ 02/Feb/21 ] |
|
Hit this a few times in racer with patch https://review.whamcloud.com/13669 " |
| Comment by Gerrit Updater [ 01/Mar/21 ] |
|
Vladimir Saveliev (c17830@cray.com) uploaded a new patch: https://review.whamcloud.com/41797 |
| Comment by Vladimir Saveliev [ 01/Mar/21 ] |
Such LBUG may occur as result of race between write RPC and close of unlinked file in case of PFL files. See https://review.whamcloud.com/41797 for more details. |
| Comment by Gerrit Updater [ 16/Mar/21 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/41797/ |
| Comment by Peter Jones [ 16/Mar/21 ] |
|
Landed for 2.15 |
| Comment by Alex Zhuravlev [ 24/Jun/22 ] |
LustreError: 276562:0:(osc_page.c:183:osc_page_delete()) ASSERTION( 0 ) failed: in sanity / 273b
Trace:
PID: 276562 TASK: ffff98be20881d80 CPU: 0 COMMAND: "multiop"
#0 [ffff98bef366fad0] panic at ffffffffb90af786
/tmp/kernel/kernel/panic.c: 299
#1 [ffff98bef366fb60] osc_page_delete at ffffffffc08bf614 [osc]
/home/lustre/master-mine/lustre/osc/osc_page.c: 183
#2 [ffff98bef366fba0] cl_page_delete0 at ffffffffc02cf56a [obdclass]
/home/lustre/master-mine/lustre/obdclass/cl_page.c: 842
#3 [ffff98bef366fbc0] cl_page_delete at ffffffffc02cfe65 [obdclass]
/home/lustre/master-mine/libcfs/include/libcfs/libcfs_debug.h: 155
#4 [ffff98bef366fbd8] ll_invalidatepage at ffffffffc193f095 [lustre]
/home/lustre/master-mine/lustre/llite/rw26.c: 99
#5 [ffff98bef366fbf8] truncate_cleanup_page at ffffffffb9160e8a
/tmp/kernel/./include/linux/compiler.h: 276
#6 [ffff98bef366fc10] truncate_inode_pages_range at ffffffffb9161892
/tmp/kernel/./include/linux/pagevec.h: 68
#7 [ffff98bef366fdd8] ll_truncate_inode_pages_final at ffffffffc191bf63 [lustre]
/home/lustre/master-mine/lustre/llite/llite_lib.c: 2767
#8 [ffff98bef366fdf8] ll_delete_inode at ffffffffc191c3f3 [lustre]
/home/lustre/master-mine/lustre/llite/llite_lib.c: 2851
#9 [ffff98bef366fe08] evict at ffffffffb91e1e6c
/tmp/kernel/fs/inode.c: 569
is this a different issue? |