[LU-12752] osc_page.c:osc_page_delete() ASSERTION( 0 ) failed Created: 11/Sep/19  Updated: 16/Feb/23  Resolved: 16/Mar/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.0
Fix Version/s: Lustre 2.15.0

Type: Bug Priority: Critical
Reporter: Stephane Thiell Assignee: Vladimir Saveliev
Resolution: Fixed Votes: 0
Labels: None
Environment:

CentOS 7.6 2.12.0+patch including async_discard for DoM. Servers: 2.12.2_116+3 patches


Issue Links:
Related
is related to LU-16263 crash in sanity 273b osc_page_delete... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

A specific type of jobs (which seems to be Spark+Python) keeps triggering the following LBUG on different clients:

/var/log/nodes/sh-110-01.log:Sep  9 12:09:32 sh-110-01 kernel: LustreError: 123514:0:(osc_page.c:193:osc_page_delete()) LBUG
/var/log/nodes/sh-110-04.log:Sep  9 08:53:09 sh-110-04 kernel: LustreError: 89537:0:(osc_page.c:193:osc_page_delete()) LBUG
/var/log/nodes/sh-110-04.log:Sep  9 12:47:29 sh-110-04 kernel: LustreError: 53455:0:(osc_page.c:193:osc_page_delete()) LBUG
/var/log/nodes/sh-110-04.log:Sep  9 12:47:29 sh-110-04 kernel: Kernel panic - not syncing: LBUG
/var/log/nodes/sh-110-04.log:Sep  9 14:36:54 sh-110-04 kernel: LustreError: 26083:0:(osc_page.c:193:osc_page_delete()) LBUG
/var/log/nodes/sh-110-04.log:Sep 10 11:48:43 sh-110-04 kernel: LustreError: 104908:0:(osc_page.c:193:osc_page_delete()) LBUG
/var/log/nodes/sh-110-04.log:Sep 10 12:29:58 sh-110-04 kernel: LustreError: 19176:0:(osc_page.c:193:osc_page_delete()) LBUG
/var/log/nodes/sh-110-04.log:Sep 10 12:45:09 sh-110-04 kernel: LustreError: 19196:0:(osc_page.c:193:osc_page_delete()) LBUG
/var/log/nodes/sh-110-04.log:Sep 10 13:58:56 sh-110-04 kernel: LustreError: 14313:0:(osc_page.c:193:osc_page_delete()) LBUG
/var/log/nodes/sh-110-04.log:Sep 10 15:01:22 sh-110-04 kernel: LustreError: 11772:0:(osc_page.c:193:osc_page_delete()) LBUG
/var/log/nodes/sh-110-04.log:Sep 10 16:59:34 sh-110-04 kernel: LustreError: 10094:0:(osc_page.c:193:osc_page_delete()) LBUG
/var/log/nodes/sh-110-04.log:Sep 10 18:29:22 sh-110-04 kernel: LustreError: 25649:0:(osc_page.c:193:osc_page_delete()) LBUG
Sep  9 14:08:38 sh-110-04 kernel: perf: interrupt took too long (4932 > 4930), lowering kernel.perf_event_max_sample_rate to 40000
Sep  9 14:10:01 sh-110-04 CROND[24303]: (root) CMD (/usr/lib64/sa/sa1 1 1)      
Sep  9 14:20:01 sh-110-04 CROND[25059]: (root) CMD (/usr/lib64/sa/sa1 1 1)      
Sep  9 14:30:01 sh-110-04 CROND[25808]: (root) CMD (/usr/lib64/sa/sa1 1 1)      
Sep  9 14:32:01 sh-110-04 kernel: LNetError: 8091:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (0, 5)
Sep  9 14:36:54 sh-110-04 kernel: LustreError: 26083:0:(osc_cache.c:2575:osc_teardown_async_page()) extent ffff94ab962a3c30@{[32 -> 102/4095], [2|0|-|cache|wi|ffff9498052c8f00], [315392|71|+|-|ffff94c319791b00|4096|          (null)]} trunc at 32.
Sep  9 14:36:54 sh-110-04 kernel: LustreError: 26083:0:(osc_cache.c:2575:osc_teardown_async_page()) ### extent: ffff94ab962a3c30 ns: fir-OST0028-osc-ffff9496d9c3e000 lock: ffff94c319791b00/0x6233ab8351d87f8e lrc: 2/0,0 mode: PW/PW res: [0xa00000401:0x2db265b:0x0].0x0 rrc: 2 type: EXT [0->18446744073709551615] (req 131072->16777215) flags: 0x20000000000 nid: local remote: 0xcd88a3c067e066be expref: -99 pid: 26040 timeout: 0 lvb_type: 1
Sep  9 14:36:54 sh-110-04 kernel: LustreError: 26083:0:(osc_page.c:192:osc_page_delete()) page@ffff94b6232a7600[2 ffff949c67e4db30 4 1           (null)]
Sep  9 14:36:54 sh-110-04 kernel: LustreError: 26083:0:(osc_page.c:192:osc_page_delete()) vvp-page@ffff94b6232a7650(0:0) vm@ffffe46433a31e00 6fffff0000087d 3:0 ffff94b6232a7600 32 lru
Sep  9 14:36:54 sh-110-04 kernel: LustreError: 26083:0:(osc_page.c:192:osc_page_delete()) lov-page@ffff94b6232a7690, comp index: 10000, gen: 7
Sep  9 14:36:54 sh-110-04 kernel: LustreError: 26083:0:(osc_page.c:192:osc_page_delete()) osc-page@ffff94b6232a76c8 32: 1< 0x845fed 2 0 + - > 2< 131072 0 4096 0x0 0x420 |           (null) ffff94ad27592540 ffff9498052c8f00 > 3< 0 0 0 > 4< 0 0 8 210399232 - | - - + - > 5< - - + - | 0 - | 71 - ->
Sep  9 14:36:54 sh-110-04 kernel: LustreError: 26083:0:(osc_page.c:192:osc_page_delete()) end page@ffff94b6232a7600
Sep  9 14:36:54 sh-110-04 kernel: LustreError: 26083:0:(osc_page.c:192:osc_page_delete()) Trying to teardown failed: -16
Sep  9 14:36:54 sh-110-04 kernel: LustreError: 26083:0:(osc_page.c:193:osc_page_delete()) ASSERTION( 0 ) failed: 
Sep  9 14:36:54 sh-110-04 kernel: LustreError: 26083:0:(osc_page.c:193:osc_page_delete()) LBUG
Sep  9 14:36:54 sh-110-04 kernel: Pid: 26083, comm: migratefs 3.10.0-957.27.2.el7.x86_64 #1 SMP Mon Jul 29 17:46:05 UTC 2019
Sep  9 14:36:54 sh-110-04 kernel: Call Trace:                                   
Sep  9 14:36:54 sh-110-04 kernel: [<ffffffffc0b9c7cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
Sep  9 14:36:54 sh-110-04 kernel: [<ffffffffc0b9c87c>] lbug_with_loc+0x4c/0xa0 [libcfs] 
Sep  9 14:36:54 sh-110-04 kernel: [<ffffffffc0dd08bf>] osc_page_delete+0x48f/0x500 [osc]
Sep  9 14:36:54 sh-110-04 kernel: [<ffffffffc0eea770>] cl_page_delete0+0x80/0x220 [obdclass]
Sep  9 14:36:54 sh-110-04 kernel: [<ffffffffc0eea943>] cl_page_delete+0x33/0x110 [obdclass]
Sep  9 14:36:54 sh-110-04 kernel: [<ffffffffc120ff3f>] ll_invalidatepage+0x7f/0x170 [lustre]
Sep  9 14:36:54 sh-110-04 kernel: [<ffffffffbd3c7ddd>] do_invalidatepage_range+0x7d/0x90
Sep  9 14:36:54 sh-110-04 kernel: [<ffffffffbd3c7e87>] truncate_inode_page+0x77/0x80 
Sep  9 14:36:54 sh-110-04 kernel: [<ffffffffbd3c80ba>] truncate_inode_pages_range+0x1ea/0x700
Sep  9 14:36:54 sh-110-04 kernel: [<ffffffffbd3c863f>] truncate_inode_pages_final+0x4f/0x60
Sep  9 14:36:54 sh-110-04 kernel: [<ffffffffc11f656f>] ll_delete_inode+0x4f/0x230 [lustre]
Sep  9 14:36:54 sh-110-04 kernel: [<ffffffffbd45ff24>] evict+0xb4/0x180         
Sep  9 14:36:54 sh-110-04 kernel: [<ffffffffbd46082c>] iput+0xfc/0x190          
Sep  9 14:36:54 sh-110-04 kernel: [<ffffffffbd4548de>] do_unlinkat+0x1ae/0x2d0  
Sep  9 14:36:54 sh-110-04 kernel: [<ffffffffbd45595b>] SyS_unlinkat+0x1b/0x40   
Sep  9 14:36:54 sh-110-04 kernel: [<ffffffffbd976ddb>] system_call_fastpath+0x22/0x27
Sep  9 14:36:54 sh-110-04 kernel: [<ffffffffffffffff>] 0xffffffffffffffff       
Sep  9 14:40:11 sh-110-04 kernel: Initializing cgroup subsys cpuset             

We didn't find any other LBUG so we think this is related to a specific access pattern. This comes from several jobs but similar ones.

The logs do mention fir so it's against our 2.12.2_116+3 scratch filesystem. We use DoM and PFL.

We've found other similar tickets, some of them are old and marked as resolved:

LU-2720, LU-2557, LU-8388
or more recent LU-12525 but I checked and we don't have the patch that was reverted as it was only on 2.13+ I think.

Any idea? Do you think it's worth trying to upgrade our clients from 2.12.0+patches (including async_discard) to 2.12.2_116+3 like our servers?

Thanks!
Stephane



 Comments   
Comment by Patrick Farrell (Inactive) [ 12/Sep/19 ]

Stephane,

It sounds like you can reproduce this pretty easily.  Given that, I'd suggest upgrading one client and trying it again.  If the problem still occurs, then given how easy it is to reproduce, we should try to reproduce it with debug and then gather a vmcore with debug enabled and upload that.

Specifically, I'd say turn on rpctrace, dlmtrace, and cache.  (Probably OK to leave debug_mb at the default.)

Comment by Alex Zhuravlev [ 16/Nov/19 ]

hit in racer:

Lustre: 4849:0:(llite_lib.c:2867:ll_dirty_page_discard_warn()) lustre: dirty page discard: 192.168.122.122@tcp:/lustre/fid: [0x200000403:0x6381:0x0]/ may get corrupted (rc -108)
LustreError: 11031:0:(osc_cache.c:2561:osc_teardown_async_page()) extent 00000000a1f47168@{[240 -> 511/1023], [2|0|+|cache|wiuY|000000004d3c9cb0], [1138688|272|+|-|0000000042a7c322|1024|          (null)]} trunc at 240.
LustreError: 11031:0:(osc_cache.c:2561:osc_teardown_async_page()) ### extent: 00000000a1f47168 ns: lustre-OST0001-osc-ffff88015bc17000 lock: 0000000042a7c322/0x1853c4aa31a8466 lrc: 2/0,0 mode: PW/PW res: [0x4e5:0x0:0x0].0x0 rrc: 2 type: EXT [0->18446744073709551615] (req 2097152->18446744073709551615) flags: 0x800020000000000 nid: local remote: 0x1853c4aa31a8497 expref: -99 pid: 6809 timeout: 0 lvb_type: 1
LustreError: 11031:0:(osc_page.c:188:osc_page_delete()) page@00000000985442eb[2 000000006a06e30a 4 1           (null)]

LustreError: 11031:0:(osc_page.c:188:osc_page_delete()) vvp-page@00000000046d3436(0:0) vm@000000006118fc81 3480000001039 3:0 ffff8800d2397e00 240 lru

LustreError: 11031:0:(osc_page.c:188:osc_page_delete()) lov-page@00000000e665113b, gen: 3

LustreError: 11031:0:(osc_page.c:188:osc_page_delete()) osc-page@0000000058e45de8 240: 1< 0x845fed 258 0 + - > 2< 983040 0 4096 0x0 0x520 |           (null) 000000003782a010 000000004d3c9cb0 > 3< 0 0 0 > 4< 0 0 8 274124800 - | + - + - > 5< + - + - | 0 - | 272 - +>

LustreError: 11031:0:(osc_page.c:188:osc_page_delete()) end page@00000000985442eb
LustreError: 11031:0:(osc_page.c:188:osc_page_delete()) Trying to teardown failed: -16
LustreError: 11031:0:(osc_page.c:190:osc_page_delete()) ASSERTION( 0 ) failed: 
LustreError: 11031:0:(osc_page.c:190:osc_page_delete()) LBUG
Pid: 11031, comm: cat 4.18.0 #31 SMP Thu Oct 24 23:40:39 MSK 2019
Call Trace:
 dump_stack+0x85/0xc0
 ___might_sleep.cold+0xac/0xbc
 ? kmem_cache_alloc_trace+0x1b2/0x2c0
 ? libcfs_debug_dumplog_internal+0x60/0x60 [libcfs]
 ? __kthread_create_on_node+0x53/0x160
 ? kthread_create_on_node+0x34/0x40
 ? libcfs_debug_dumplog+0x8d/0x150 [libcfs]
 ? _raw_spin_unlock+0x1f/0x30
 ? wake_up_q+0x60/0x60
 ? lbug_with_loc+0x51/0x80 [libcfs]
 ? osc_page_delete+0x4ab/0x4b0 [osc]
 ? cl_page_delete0+0x6b/0x1e0 [obdclass]
 ? cl_page_delete+0x26/0xf0 [obdclass]
 ? ll_invalidatepage+0x8d/0x170 [lustre]
 ? truncate_cleanup_page+0x68/0xb0
 ? truncate_inode_pages_range+0x1b0/0x800
 ? ll_delete_inode+0x3e/0x220 [lustre]
 ? evict+0xbc/0x180
 ? __dentry_kill+0xcb/0x170
 ? dentry_kill+0x4b/0x1b0
 ? dput+0x17/0x2b0
 ? dput+0x266/0x2b0
 ? __fput+0x104/0x200
 ? task_work_run+0x7d/0xb0
 ? exit_to_usermode_loop+0xa3/0xb0
Comment by Alex Zhuravlev [ 07/May/20 ]

again in racer,

PID: 32147  TASK: ffff8800db844000  CPU: 0   COMMAND: "cat"
 #0 [ffff88011633bae0] panic at ffffffff810af9a3
    /home/lustre/linux-4.18.0-32.el8/kernel/panic.c: 265
 #1 [ffff88011633bb70] osc_page_delete at ffffffffa07843fe [osc]
    /home/lustre/master-mine/lustre/osc/osc_page.c: 184
 #2 [ffff88011633bbb0] cl_page_delete0 at ffffffffa01b5b8b [obdclass]
    /home/lustre/master-mine/lustre/obdclass/cl_page.c: 801
 #3 [ffff88011633bbd0] cl_page_delete at ffffffffa01b6606 [obdclass]
    /home/lustre/master-mine/libcfs/include/libcfs/libcfs_debug.h: 153
 #4 [ffff88011633bbe8] ll_invalidatepage at ffffffffa0eae23d [lustre]
    /home/lustre/master-mine/lustre/llite/rw26.c: 100
 #5 [ffff88011633bc00] truncate_cleanup_page at ffffffff8119a8c8
    /home/lustre/linux-4.18.0-32.el8/./include/linux/compiler.h: 188
 #6 [ffff88011633bc18] truncate_inode_pages_range at ffffffff8119b2d0
    /home/lustre/linux-4.18.0-32.el8/./include/linux/pagevec.h: 66
 #7 [ffff88011633bde8] ll_delete_inode at ffffffffa0e8efce [lustre]
    /home/lustre/master-mine/lustre/llite/llite_lib.c: 2265
 #8 [ffff88011633be08] evict at ffffffff81218dac
    /home/lustre/linux-4.18.0-32.el8/fs/inode.c: 563
 #9 [ffff88011633be28] __dentry_kill at ffffffff8121499b
    /home/lustre/linux-4.18.0-32.el8/fs/dcache.c: 569
#10 [ffff88011633be48] dentry_kill at ffffffff81214a8b
    /home/lustre/linux-4.18.0-32.el8/fs/dcache.c: 686
#11 [ffff88011633be70] dput at ffffffff81214e56
    /home/lustre/linux-4.18.0-32.el8/fs/dcache.c: 846
#12 [ffff88011633be90] __fput at ffffffff811fc934
    /home/lustre/linux-4.18.0-32.el8/fs/file_table.c: 255
#13 [ffff88011633bed0] task_work_run at ffffffff810cec8d
    /home/lustre/linux-4.18.0-32.el8/kernel/task_work.c: 115
#14 [ffff88011633bf08] exit_to_usermode_loop at ffffffff81001db3
    /home/lustre/linux-4.18.0-32.el8/./include/linux/tracehook.h: 193
#15 [ffff88011633bf38] do_syscall_64 at ffffffff8100265a
    /home/lustre/linux-4.18.0-32.el8/arch/x86/entry/common.c: 211
#16 [ffff88011633bf50] entry_SYSCALL_64_after_hwframe at ffffffff818000ae
    /home/lustre/linux-4.18.0-32.el8/arch/x86/entry/entry_64.S: 242
    RIP: 00007f00025ab2ab  RSP: 00007ffed735b348  RFLAGS: 00000202
    RAX: 0000000000000000  RBX: 00007f000267b780  RCX: 00007f00025ab2ab
    RDX: 00007f0002676740  RSI: 0000000000000001  RDI: 0000000000000001
    RBP: 0000000000000000   R8: 00007f0002681580   R9: 0000000000000001
    R10: 00007f0002681580  R11: 0000000000000202  R12: 00007f0002677340
    R13: 0000000000000002  R14: 00007f000267f448  R15: 00007f000267bd80
    ORIG_RAX: 0000000000000003  CS: 0033  SS: 002b
Comment by Andreas Dilger [ 02/Feb/21 ]

Hit this a few times in racer with patch https://review.whamcloud.com/13669 "LU-7073 tests: Add file migration to racer" with the same "trunc at 16" message. e.g.:
https://testing-archive.whamcloud.com/gerrit-janitor/13887/testresults/racer-special7-ldiskfs-centos7_x86_64-centos7_x86_64/

Comment by Gerrit Updater [ 01/Mar/21 ]

Vladimir Saveliev (c17830@cray.com) uploaded a new patch: https://review.whamcloud.com/41797
Subject: LU-12752 mdt: mdt_commitrw_write() - return 0 for dying object
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 2fbc18994e48c3cf562d25f1e9d1975a26418e1f

Comment by Vladimir Saveliev [ 01/Mar/21 ]

LustreError: 11031:0:(osc_page.c:188:osc_page_delete()) Trying to teardown failed: -16
LustreError: 11031:0:(osc_page.c:190:osc_page_delete()) ASSERTION( 0 ) failed:
LustreError: 11031:0:(osc_page.c:190:osc_page_delete()) LBUG

Such LBUG may occur as result of race between write RPC and close of unlinked file in case of PFL files. See https://review.whamcloud.com/41797 for more details.

Comment by Gerrit Updater [ 16/Mar/21 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/41797/
Subject: LU-12752 mdt: commitrw_write() - check dying object under lock
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: d48a0ebb5a8d5d49684325434b503e8aab085397

Comment by Peter Jones [ 16/Mar/21 ]

Landed for 2.15

Comment by Alex Zhuravlev [ 24/Jun/22 ]
LustreError: 276562:0:(osc_page.c:183:osc_page_delete()) ASSERTION( 0 ) failed: in sanity / 273b

Trace:

PID: 276562  TASK: ffff98be20881d80  CPU: 0   COMMAND: "multiop"
 #0 [ffff98bef366fad0] panic at ffffffffb90af786
    /tmp/kernel/kernel/panic.c: 299
 #1 [ffff98bef366fb60] osc_page_delete at ffffffffc08bf614 [osc]
    /home/lustre/master-mine/lustre/osc/osc_page.c: 183
 #2 [ffff98bef366fba0] cl_page_delete0 at ffffffffc02cf56a [obdclass]
    /home/lustre/master-mine/lustre/obdclass/cl_page.c: 842
 #3 [ffff98bef366fbc0] cl_page_delete at ffffffffc02cfe65 [obdclass]
    /home/lustre/master-mine/libcfs/include/libcfs/libcfs_debug.h: 155
 #4 [ffff98bef366fbd8] ll_invalidatepage at ffffffffc193f095 [lustre]
    /home/lustre/master-mine/lustre/llite/rw26.c: 99
 #5 [ffff98bef366fbf8] truncate_cleanup_page at ffffffffb9160e8a
    /tmp/kernel/./include/linux/compiler.h: 276
 #6 [ffff98bef366fc10] truncate_inode_pages_range at ffffffffb9161892
    /tmp/kernel/./include/linux/pagevec.h: 68
 #7 [ffff98bef366fdd8] ll_truncate_inode_pages_final at ffffffffc191bf63 [lustre]
    /home/lustre/master-mine/lustre/llite/llite_lib.c: 2767
 #8 [ffff98bef366fdf8] ll_delete_inode at ffffffffc191c3f3 [lustre]
    /home/lustre/master-mine/lustre/llite/llite_lib.c: 2851
 #9 [ffff98bef366fe08] evict at ffffffffb91e1e6c
    /tmp/kernel/fs/inode.c: 569

is this a different issue?

Generated at Sat Feb 10 02:55:20 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.