Details
-
Bug
-
Resolution: Fixed
-
Critical
-
Lustre 2.12.0
-
None
-
CentOS 7.6 2.12.0+patch including async_discard for DoM. Servers: 2.12.2_116+3 patches
-
3
-
9223372036854775807
Description
A specific type of jobs (which seems to be Spark+Python) keeps triggering the following LBUG on different clients:
/var/log/nodes/sh-110-01.log:Sep 9 12:09:32 sh-110-01 kernel: LustreError: 123514:0:(osc_page.c:193:osc_page_delete()) LBUG /var/log/nodes/sh-110-04.log:Sep 9 08:53:09 sh-110-04 kernel: LustreError: 89537:0:(osc_page.c:193:osc_page_delete()) LBUG /var/log/nodes/sh-110-04.log:Sep 9 12:47:29 sh-110-04 kernel: LustreError: 53455:0:(osc_page.c:193:osc_page_delete()) LBUG /var/log/nodes/sh-110-04.log:Sep 9 12:47:29 sh-110-04 kernel: Kernel panic - not syncing: LBUG /var/log/nodes/sh-110-04.log:Sep 9 14:36:54 sh-110-04 kernel: LustreError: 26083:0:(osc_page.c:193:osc_page_delete()) LBUG /var/log/nodes/sh-110-04.log:Sep 10 11:48:43 sh-110-04 kernel: LustreError: 104908:0:(osc_page.c:193:osc_page_delete()) LBUG /var/log/nodes/sh-110-04.log:Sep 10 12:29:58 sh-110-04 kernel: LustreError: 19176:0:(osc_page.c:193:osc_page_delete()) LBUG /var/log/nodes/sh-110-04.log:Sep 10 12:45:09 sh-110-04 kernel: LustreError: 19196:0:(osc_page.c:193:osc_page_delete()) LBUG /var/log/nodes/sh-110-04.log:Sep 10 13:58:56 sh-110-04 kernel: LustreError: 14313:0:(osc_page.c:193:osc_page_delete()) LBUG /var/log/nodes/sh-110-04.log:Sep 10 15:01:22 sh-110-04 kernel: LustreError: 11772:0:(osc_page.c:193:osc_page_delete()) LBUG /var/log/nodes/sh-110-04.log:Sep 10 16:59:34 sh-110-04 kernel: LustreError: 10094:0:(osc_page.c:193:osc_page_delete()) LBUG /var/log/nodes/sh-110-04.log:Sep 10 18:29:22 sh-110-04 kernel: LustreError: 25649:0:(osc_page.c:193:osc_page_delete()) LBUG
Sep 9 14:08:38 sh-110-04 kernel: perf: interrupt took too long (4932 > 4930), lowering kernel.perf_event_max_sample_rate to 40000 Sep 9 14:10:01 sh-110-04 CROND[24303]: (root) CMD (/usr/lib64/sa/sa1 1 1) Sep 9 14:20:01 sh-110-04 CROND[25059]: (root) CMD (/usr/lib64/sa/sa1 1 1) Sep 9 14:30:01 sh-110-04 CROND[25808]: (root) CMD (/usr/lib64/sa/sa1 1 1) Sep 9 14:32:01 sh-110-04 kernel: LNetError: 8091:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (0, 5) Sep 9 14:36:54 sh-110-04 kernel: LustreError: 26083:0:(osc_cache.c:2575:osc_teardown_async_page()) extent ffff94ab962a3c30@{[32 -> 102/4095], [2|0|-|cache|wi|ffff9498052c8f00], [315392|71|+|-|ffff94c319791b00|4096| (null)]} trunc at 32. Sep 9 14:36:54 sh-110-04 kernel: LustreError: 26083:0:(osc_cache.c:2575:osc_teardown_async_page()) ### extent: ffff94ab962a3c30 ns: fir-OST0028-osc-ffff9496d9c3e000 lock: ffff94c319791b00/0x6233ab8351d87f8e lrc: 2/0,0 mode: PW/PW res: [0xa00000401:0x2db265b:0x0].0x0 rrc: 2 type: EXT [0->18446744073709551615] (req 131072->16777215) flags: 0x20000000000 nid: local remote: 0xcd88a3c067e066be expref: -99 pid: 26040 timeout: 0 lvb_type: 1 Sep 9 14:36:54 sh-110-04 kernel: LustreError: 26083:0:(osc_page.c:192:osc_page_delete()) page@ffff94b6232a7600[2 ffff949c67e4db30 4 1 (null)] Sep 9 14:36:54 sh-110-04 kernel: LustreError: 26083:0:(osc_page.c:192:osc_page_delete()) vvp-page@ffff94b6232a7650(0:0) vm@ffffe46433a31e00 6fffff0000087d 3:0 ffff94b6232a7600 32 lru Sep 9 14:36:54 sh-110-04 kernel: LustreError: 26083:0:(osc_page.c:192:osc_page_delete()) lov-page@ffff94b6232a7690, comp index: 10000, gen: 7 Sep 9 14:36:54 sh-110-04 kernel: LustreError: 26083:0:(osc_page.c:192:osc_page_delete()) osc-page@ffff94b6232a76c8 32: 1< 0x845fed 2 0 + - > 2< 131072 0 4096 0x0 0x420 | (null) ffff94ad27592540 ffff9498052c8f00 > 3< 0 0 0 > 4< 0 0 8 210399232 - | - - + - > 5< - - + - | 0 - | 71 - -> Sep 9 14:36:54 sh-110-04 kernel: LustreError: 26083:0:(osc_page.c:192:osc_page_delete()) end page@ffff94b6232a7600 Sep 9 14:36:54 sh-110-04 kernel: LustreError: 26083:0:(osc_page.c:192:osc_page_delete()) Trying to teardown failed: -16 Sep 9 14:36:54 sh-110-04 kernel: LustreError: 26083:0:(osc_page.c:193:osc_page_delete()) ASSERTION( 0 ) failed: Sep 9 14:36:54 sh-110-04 kernel: LustreError: 26083:0:(osc_page.c:193:osc_page_delete()) LBUG Sep 9 14:36:54 sh-110-04 kernel: Pid: 26083, comm: migratefs 3.10.0-957.27.2.el7.x86_64 #1 SMP Mon Jul 29 17:46:05 UTC 2019 Sep 9 14:36:54 sh-110-04 kernel: Call Trace: Sep 9 14:36:54 sh-110-04 kernel: [<ffffffffc0b9c7cc>] libcfs_call_trace+0x8c/0xc0 [libcfs] Sep 9 14:36:54 sh-110-04 kernel: [<ffffffffc0b9c87c>] lbug_with_loc+0x4c/0xa0 [libcfs] Sep 9 14:36:54 sh-110-04 kernel: [<ffffffffc0dd08bf>] osc_page_delete+0x48f/0x500 [osc] Sep 9 14:36:54 sh-110-04 kernel: [<ffffffffc0eea770>] cl_page_delete0+0x80/0x220 [obdclass] Sep 9 14:36:54 sh-110-04 kernel: [<ffffffffc0eea943>] cl_page_delete+0x33/0x110 [obdclass] Sep 9 14:36:54 sh-110-04 kernel: [<ffffffffc120ff3f>] ll_invalidatepage+0x7f/0x170 [lustre] Sep 9 14:36:54 sh-110-04 kernel: [<ffffffffbd3c7ddd>] do_invalidatepage_range+0x7d/0x90 Sep 9 14:36:54 sh-110-04 kernel: [<ffffffffbd3c7e87>] truncate_inode_page+0x77/0x80 Sep 9 14:36:54 sh-110-04 kernel: [<ffffffffbd3c80ba>] truncate_inode_pages_range+0x1ea/0x700 Sep 9 14:36:54 sh-110-04 kernel: [<ffffffffbd3c863f>] truncate_inode_pages_final+0x4f/0x60 Sep 9 14:36:54 sh-110-04 kernel: [<ffffffffc11f656f>] ll_delete_inode+0x4f/0x230 [lustre] Sep 9 14:36:54 sh-110-04 kernel: [<ffffffffbd45ff24>] evict+0xb4/0x180 Sep 9 14:36:54 sh-110-04 kernel: [<ffffffffbd46082c>] iput+0xfc/0x190 Sep 9 14:36:54 sh-110-04 kernel: [<ffffffffbd4548de>] do_unlinkat+0x1ae/0x2d0 Sep 9 14:36:54 sh-110-04 kernel: [<ffffffffbd45595b>] SyS_unlinkat+0x1b/0x40 Sep 9 14:36:54 sh-110-04 kernel: [<ffffffffbd976ddb>] system_call_fastpath+0x22/0x27 Sep 9 14:36:54 sh-110-04 kernel: [<ffffffffffffffff>] 0xffffffffffffffff Sep 9 14:40:11 sh-110-04 kernel: Initializing cgroup subsys cpuset
We didn't find any other LBUG so we think this is related to a specific access pattern. This comes from several jobs but similar ones.
The logs do mention fir so it's against our 2.12.2_116+3 scratch filesystem. We use DoM and PFL.
We've found other similar tickets, some of them are old and marked as resolved:
LU-2720, LU-2557, LU-8388
or more recent LU-12525 but I checked and we don't have the patch that was reverted as it was only on 2.13+ I think.
Any idea? Do you think it's worth trying to upgrade our clients from 2.12.0+patches (including async_discard) to 2.12.2_116+3 like our servers?
Thanks!
Stephane