Details
-
Bug
-
Resolution: Cannot Reproduce
-
Major
-
None
-
Lustre 2.1.2
-
None
-
Client RHEL-6.2 2.6.32-220.23.1.el6.x86_64
-
3
-
4210
Description
LBUG on production cluster while node was running a user job.
Aug 8 16:00:02 sand-4-52 kernel: LustreError: 64082:0:(osc_page.c:425:osc_page_delete()) page@ffff88106917b5c0[1 ffff8806726e8f48:1263 ^(null)_ffff88106917b500 4 0 1 (null) (null) 0x1]
Aug 8 16:00:02 sand-4-52 kernel: LustreError: 64082:0:(osc_page.c:425:osc_page_delete()) page@ffff88106917b500[1 ffff880770b2c4c8:1263 ^ffff88106917b5c0_(null) 4 0 1 (null) (null) 0x0]
Aug 8 16:00:02 sand-4-52 kernel: LustreError: 64082:0:(osc_page.c:425:osc_page_delete()) vvp-page@ffff8809d88c6be0(1:0:0) vm@ffffea0034f6ce20 c0000000000001 3:0 0 1263 lru
Aug 8 16:00:02 sand-4-52 kernel: LustreError: 64082:0:(osc_page.c:425:osc_page_delete()) lov-page@ffff8809f22e93a8
Aug 8 16:00:02 sand-4-52 kernel: LustreError: 64082:0:(osc_page.c:425:osc_page_delete()) osc-page@ffff881067d54bc8: 1< 0x845fed 1 0 - - + > 2< 5173248 0 4096 0x0 0x8 | (null) ffff8806d7f24688 ffff88016958f700 ffffffffa07ff>
Aug 8 16:00:02 sand-4-52 kernel: LustreError: 64082:0:(osc_page.c:425:osc_page_delete()) end page@ffff88106917b5c0
Aug 8 16:00:02 sand-4-52 kernel: LustreError: 64082:0:(osc_page.c:425:osc_page_delete()) Trying to teardown failed: -16
Aug 8 16:00:02 sand-4-52 kernel: LustreError: 64082:0:(osc_page.c:426:osc_page_delete()) ASSERTION(0) failed
Aug 8 16:00:02 sand-4-52 kernel: LustreError: 64082:0:(osc_page.c:426:osc_page_delete()) LBUG
Aug 8 16:00:02 sand-4-52 kernel: Pid: 64082, comm: calculate_propa
Aug 8 16:00:02 sand-4-52 kernel:
Aug 8 16:00:02 sand-4-52 kernel: Call Trace:
Aug 8 16:00:02 sand-4-52 kernel: [<ffffffffa0446855>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
Aug 8 16:00:02 sand-4-52 kernel: [<ffffffffa0446e95>] lbug_with_loc+0x75/0xe0 [libcfs]
Aug 8 16:00:02 sand-4-52 kernel: [<ffffffffa0451d86>] libcfs_assertion_failed+0x66/0x70 [libcfs]
Aug 8 16:00:02 sand-4-52 kernel: [<ffffffffa07f8df6>] osc_page_delete+0x236/0x240 [osc]
Aug 8 16:00:02 sand-4-52 kernel: [<ffffffffa0553d1e>] cl_page_delete0+0xce/0x400 [obdclass]
Aug 8 16:00:02 sand-4-52 kernel: [<ffffffffa055128e>] ? cl_env_get+0x19e/0x350 [obdclass]
Aug 8 16:00:02 sand-4-52 kernel: [<ffffffffa0550bf6>] ? cl_env_peek+0x36/0x110 [obdclass]
Aug 8 16:00:02 sand-4-52 kernel: [<ffffffffa055408d>] cl_page_delete+0x3d/0xf0 [obdclass]
Aug 8 16:00:02 sand-4-52 kernel: [<ffffffffa0560bde>] ? cl_io_is_going+0xe/0x20 [obdclass]
Aug 8 16:00:02 sand-4-52 kernel: [<ffffffffa08e791b>] ll_releasepage+0x10b/0x150 [lustre]
Aug 8 16:00:03 sand-4-52 kernel: [<ffffffff81168ff0>] ? mem_cgroup_uncharge_cache_page+0x10/0x20
Aug 8 16:00:03 sand-4-52 kernel: [<ffffffff811100b0>] try_to_release_page+0x30/0x60
Aug 8 16:00:03 sand-4-52 kernel: [<ffffffff8112a4f1>] shrink_page_list.clone.0+0x4f1/0x5c0
Aug 8 16:00:03 sand-4-52 kernel: [<ffffffff8112a8bb>] shrink_inactive_list+0x2fb/0x740
Aug 8 16:00:03 sand-4-52 kernel: [<ffffffff8112b5cf>] shrink_zone+0x38f/0x520
Aug 8 16:00:03 sand-4-52 kernel: [<ffffffff8112c374>] zone_reclaim+0x354/0x410
Aug 8 16:00:03 sand-4-52 kernel: [<ffffffff8112cfc0>] ? isolate_pages_global+0x0/0x350
Aug 8 16:00:03 sand-4-52 kernel: [<ffffffff81122874>] get_page_from_freelist+0x694/0x820
Aug 8 16:00:03 sand-4-52 kernel: [<ffffffff81123af1>] __alloc_pages_nodemask+0x111/0x940
Aug 8 16:00:03 sand-4-52 kernel: [<ffffffff8116a728>] ? __mem_cgroup_try_charge+0x78/0x420
Aug 8 16:00:03 sand-4-52 kernel: [<ffffffff811586ca>] alloc_pages_vma+0x9a/0x150
Aug 8 16:00:03 sand-4-52 kernel: [<ffffffff81172015>] do_huge_pmd_anonymous_page+0x145/0x370
Aug 8 16:00:03 sand-4-52 kernel: [<ffffffff8113c79a>] handle_mm_fault+0x25a/0x2b0
Aug 8 16:00:03 sand-4-52 kernel: [<ffffffff81042c29>] __do_page_fault+0x139/0x480
Aug 8 16:00:03 sand-4-52 kernel: [<ffffffff8100988e>] ? __switch_to+0x26e/0x320
Aug 8 16:00:03 sand-4-52 kernel: [<ffffffff814ed250>] ? thread_return+0x4e/0x76e
Aug 8 16:00:03 sand-4-52 kernel: [<ffffffff814f2c8e>] do_page_fault+0x3e/0xa0
Aug 8 16:00:03 sand-4-52 kernel: [<ffffffff814f0045>] page_fault+0x25/0x30
Aug 8 16:00:03 sand-4-52 kernel:
Aug 8 16:00:03 sand-4-52 kernel: LustreError: dumping log to /tmp/lustre-log.1344438003.64082
Attachments
Activity
Hi Oleg, please file a new bug for this because it seems like a new issue.
From the stacktrace and crashdump, an extent was not truncated by osc_cache_truncate_start() somehow.
Hm, I think I just hit this issue on my burn-in cluster as well:
[56349.656614] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) page@ffff8800805c6f08[3 ffff88000bf11f80:256 ^(null)_ffff880014be0f08 0 0 1 (null) (null) 0x0] [56349.657354] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) page@ffff880014be0f08[2 ffff88002a8ecf98:256 ^ffff8800805c6f08_(null) 0 0 1 (null) (null) 0x0] [56349.658145] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) vvp-page@ffff88006e65eb60(0:0:0) vm@ffffea00021b5a58 2000000000083c 2:0 ffff8800805c6f08 256 lru [56349.658890] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) lov-page@ffff88002510ee00 [56349.659360] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) osc-page@ffff880062170f08: 1< 0x845fed 258 0 + - > 2< 1048576 0 4096 0x0 0x520 | (null) ffff88007d410cf0 ffff880078590eb0 > 3< + ffff88006fc50140 1 0 0 > 4< 0 0 8 20480 - | - - + - > 5< - - + - | 0 - | 251 - -> [56349.660491] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) end page@ffff8800805c6f08 [56349.660996] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) exists 0/cp. [56349.661284] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) page@ffff880014be1f08[3 ffff88000bf11f80:257 ^(null)_ffff88001333ef08 0 0 1 (null) (null) 0x0] [56349.662042] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) page@ffff88001333ef08[2 ffff88002a8ecf98:257 ^ffff880014be1f08_(null) 0 0 1 (null) (null) 0x0] [56349.662802] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) vvp-page@ffff88006e65ebc8(0:0:0) vm@ffffea0001cef2d0 2000000000083c 2:0 ffff880014be1f08 257 lru [56349.663548] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) lov-page@ffff88002510e620 [56349.664016] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) osc-page@ffff88002b782f08: 1< 0x845fed 258 0 + - > 2< 1052672 0 4096 0x0 0x520 | (null) ffff88007d410cf0 ffff880078590eb0 > 3< + ffff88006fc50140 1 0 0 > 4< 0 0 8 20480 - | - - + - > 5< - - + - | 0 - | 251 - -> [56349.665336] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) end page@ffff880014be1f08 [56349.665805] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) exists 0/cp. [56349.666679] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) page@ffff880093a26f08[3 ffff88000bf11f80:258 ^(null)_ffff88003c596f08 0 0 1 (null) (null) 0x0] [56349.667419] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) page@ffff88003c596f08[2 ffff88002a8ecf98:258 ^ffff880093a26f08_(null) 0 0 1 (null) (null) 0x0] [56349.668166] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) vvp-page@ffff88006e65ec30(0:0:0) vm@ffffea0002285560 2000000000083c 2:0 ffff880093a26f08 258 lru [56349.668917] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) lov-page@ffff88002510ece0 [56349.669389] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) osc-page@ffff88005bf32f08: 1< 0x845fed 258 0 + - > 2< 1056768 0 4096 0x0 0x520 | (null) ffff88007d410cf0 ffff880078590eb0 > 3< + ffff88006fc50140 1 0 0 > 4< 0 0 8 20480 - | - - + - > 5< - - + - | 0 - | 251 - -> [56349.671402] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) end page@ffff880093a26f08 [56349.671883] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) exists 0/cp. [56349.672186] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) page@ffff8800304a3f08[3 ffff88000bf11f80:259 ^(null)_ffff880032388f08 0 0 1 (null) (null) 0x0] [56349.672953] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) page@ffff880032388f08[2 ffff88002a8ecf98:259 ^ffff8800304a3f08_(null) 0 0 1 (null) (null) 0x0] [56349.673667] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) vvp-page@ffff88006e65ec98(0:0:0) vm@ffffea0001d40e80 2000000000083c 2:0 ffff8800304a3f08 259 lru [56349.674403] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) lov-page@ffff8800740957d0 [56349.674898] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) osc-page@ffff8800922cff08: 1< 0x845fed 258 0 + - > 2< 1060864 0 4096 0x0 0x520 | (null) ffff88007d410cf0 ffff880078590eb0 > 3< + ffff88006fc50140 1 0 0 > 4< 0 0 8 20480 - | - - + - > 5< - - + - | 0 - | 251 - -> [56349.676000] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) end page@ffff8800304a3f08 [56349.676500] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) exists 0/cp. [56349.676872] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) page@ffff8800304a2f08[3 ffff88000bf11f80:260 ^(null)_ffff8800639baf08 0 0 1 (null) (null) 0x0] [56349.677605] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) page@ffff8800639baf08[2 ffff88002a8ecf98:260 ^ffff8800304a2f08_(null) 0 0 1 (null) (null) 0x0] [56349.678374] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) vvp-page@ffff88006e65ed00(0:0:0) vm@ffffea0002114048 2000000000083c 2:0 ffff8800304a2f08 260 lru [56349.679110] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) lov-page@ffff880074095c98 [56349.679560] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) osc-page@ffff880081cf0f08: 1< 0x845fed 258 0 + - > 2< 1064960 0 4096 0x0 0x520 | (null) ffff88007d410cf0 ffff880078590eb0 > 3< + ffff88006fc50140 1 0 0 > 4< 0 0 8 20480 - | - - + - > 5< - - + - | 0 - | 251 - -> [56349.680665] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) end page@ffff8800304a2f08 [56349.681134] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) exists 0/cp. [56349.681428] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) page@ffff8800639bbf08[3 ffff88000bf11f80:261 ^(null)_ffff8800223a9f08 0 0 1 (null) (null) 0x0] [56349.682160] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) page@ffff8800223a9f08[2 ffff88002a8ecf98:261 ^ffff8800639bbf08_(null) 0 0 1 (null) (null) 0x0] [56349.682893] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) vvp-page@ffff88006e65ed68(0:0:0) vm@ffffea000218f5a0 2000000000083c 2:0 ffff8800639bbf08 261 lru [56349.683625] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) lov-page@ffff880074095ce0 [56349.684092] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) osc-page@ffff88001309df08: 1< 0x845fed 258 0 + - > 2< 1069056 0 4096 0x0 0x520 | (null) ffff88007d410cf0 ffff880078590eb0 > 3< + ffff88006fc50140 1 0 0 > 4< 0 0 8 20480 - | - - + - > 5< - - + - | 0 - | 251 - -> [56349.685177] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) end page@ffff8800639bbf08 [56349.685623] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) exists 0/cp. [56349.685921] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) page@ffff8800223a8f08[3 ffff88000bf11f80:262 ^(null)_ffff8800248a3f08 0 0 1 (null) (null) 0x0] [56349.686682] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) page@ffff8800248a3f08[2 ffff88002a8ecf98:262 ^ffff8800223a8f08_(null) 0 0 1 (null) (null) 0x0] [56349.687427] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) vvp-page@ffff88002d0ee0d0(0:0:0) vm@ffffea0001f0b178 2000000000083c 2:0 ffff8800223a8f08 262 lru [56349.688184] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) lov-page@ffff880074095ae8 [56349.688665] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) osc-page@ffff88008e200f08: 1< 0x845fed 258 0 + - > 2< 1073152 0 4096 0x0 0x520 | (null) ffff88007d410cf0 ffff880078590eb0 > 3< + ffff88006fc50140 1 0 0 > 4< 0 0 8 20480 - | - - + - > 5< - - + - | 0 - | 251 - -> [56349.689748] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) end page@ffff8800223a8f08 [56349.690214] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) exists 0/cp. [56349.690501] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) page@ffff880075c2ef08[3 ffff88000bf11f80:263 ^(null)_ffff880033ee2f08 0 0 1 (null) (null) 0x0] [56349.691237] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) page@ffff880033ee2f08[2 ffff88002a8ecf98:263 ^ffff880075c2ef08_(null) 0 0 1 (null) (null) 0x0] [56349.691967] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) vvp-page@ffff88002d0ee4e0(0:0:0) vm@ffffea00020fec90 2000000000083c 2:0 ffff880075c2ef08 263 lru [56349.692706] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) lov-page@ffff880074095aa0 [56349.693172] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) osc-page@ffff880093ce0f08: 1< 0x845fed 258 0 + - > 2< 1077248 0 4096 0x0 0x520 | (null) ffff88007d410cf0 ffff880078590eb0 > 3< + ffff88006fc50140 1 0 0 > 4< 0 0 8 20480 - | - - + - > 5< - - + - | 0 - | 251 - -> [56349.694248] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) end page@ffff880075c2ef08 [56349.694717] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) exists 0/cp. [56349.695010] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) page@ffff880033ee3f08[3 ffff88000bf11f80:264 ^(null)_ffff880061765f08 0 0 1 (null) (null) 0x0] [56349.696293] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) page@ffff880061765f08[2 ffff88002a8ecf98:264 ^ffff880033ee3f08_(null) 0 0 1 (null) (null) 0x0] [56349.697945] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) vvp-page@ffff88002d0eea90(0:0:0) vm@ffffea0001f481f8 2000000000083c 2:0 ffff880033ee3f08 264 lru [56349.698697] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) lov-page@ffff880074095740 [56349.699176] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) osc-page@ffff8800794ccf08: 1< 0x845fed 258 0 + - > 2< 1081344 0 4096 0x0 0x520 | (null) ffff88007d410cf0 ffff880078590eb0 > 3< + ffff88006fc50140 1 0 0 > 4< 0 0 8 20480 - | - - + - > 5< - - + - | 0 - | 251 - -> [56349.700246] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) end page@ffff880033ee3f08 [56349.700732] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) exists 0/cp. [56349.701026] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) page@ffff8800393f6f08[3 ffff88000bf11f80:265 ^(null)_ffff88008b6e9f08 0 0 1 (null) (null) 0x0] [56349.701758] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) page@ffff88008b6e9f08[2 ffff88002a8ecf98:265 ^ffff8800393f6f08_(null) 0 0 1 (null) (null) 0x0] [56349.702492] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) vvp-page@ffff88002d0eec30(0:0:0) vm@ffffea000277cc78 2000000000083c 2:0 ffff8800393f6f08 265 lru [56349.703850] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) lov-page@ffff880074095110 [56349.704321] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) osc-page@ffff88006b098f08: 1< 0x845fed 258 0 + - > 2< 1085440 0 4096 0x0 0x520 | (null) ffff88007d410cf0 ffff880078590eb0 > 3< + ffff88006fc50140 1 0 0 > 4< 0 0 8 20480 - | - - + - > 5< - - + - | 0 - | 251 - -> [56349.705396] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) end page@ffff8800393f6f08 [56349.706003] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) exists 0/cp. [56349.758028] LustreError: 20323:0:(osc_cache.c:2370:osc_teardown_async_page()) extent ffff88006e1f1f38@{[256 -> 506/511], [2|0|-|cache|wi|ffff880078590eb0], [1028096|251|+|-|ffff88002b60ced8|256|(null)]} trunc at 256. [56349.759172] LustreError: 20323:0:(osc_page.c:431:osc_page_delete()) page@ffff8800805c6f08[2 ffff88000bf11f80:256 ^(null)_ffff880014be0f08 4 0 1 (null) (null) 0x0] [56349.759981] LustreError: 20323:0:(osc_page.c:431:osc_page_delete()) page@ffff880014be0f08[2 ffff88002a8ecf98:256 ^ffff8800805c6f08_(null) 4 0 1 (null) (null) 0x0] [56349.760931] LustreError: 20323:0:(osc_page.c:431:osc_page_delete()) vvp-page@ffff88006e65eb60(0:0:0) vm@ffffea00021b5a58 20000000000035 3:0 0 256 lru [56349.761537] LustreError: 20323:0:(osc_page.c:431:osc_page_delete()) lov-page@ffff88002510ee00 [56349.762131] LustreError: 20323:0:(osc_page.c:431:osc_page_delete()) osc-page@ffff880062170f08: 1< 0x845fed 258 0 + - > 2< 1048576 0 4096 0x0 0x520 | (null) ffff88007d410cf0 ffff880078590eb0 > 3< + ffff88006fc50140 0 0 0 > 4< 0 0 8 20480 - | - - + - > 5< - - + - | 0 - | 251 - -> [56349.763222] LustreError: 20323:0:(osc_page.c:431:osc_page_delete()) end page@ffff8800805c6f08 [56349.763780] LustreError: 20323:0:(osc_page.c:431:osc_page_delete()) Trying to teardown failed: -16 [56349.764587] LustreError: 20323:0:(osc_page.c:432:osc_page_delete()) ASSERTION( 0 ) failed: [56349.765113] LustreError: 20323:0:(osc_page.c:432:osc_page_delete()) LBUG [56349.765406] Pid: 20323, comm: cp [56349.765648] [56349.765648] Call Trace: [56349.766258] [<ffffffffa0405915>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] [56349.766719] [<ffffffffa0405f27>] lbug_with_loc+0x47/0xb0 [libcfs] [56349.767070] [<ffffffffa09bd5dd>] osc_page_delete+0x32d/0x330 [osc] [56349.767381] [<ffffffffa05d6b15>] cl_page_delete0+0xc5/0x510 [obdclass] [56349.767713] [<ffffffffa05d6fa2>] cl_page_delete+0x42/0x120 [obdclass] [56349.768192] [<ffffffffa0e4025d>] ll_invalidatepage+0x8d/0x180 [lustre] [56349.768623] [<ffffffff81129ee8>] do_invalidatepage+0x28/0x30 [56349.768950] [<ffffffff8112a102>] truncate_inode_page+0xa2/0xc0 [56349.769216] [<ffffffff8112a42a>] truncate_inode_pages_range+0x15a/0x460 [56349.769515] [<ffffffff8112a745>] truncate_inode_pages+0x15/0x20 [56349.769877] [<ffffffff8112a797>] truncate_pagecache+0x47/0x70 [56349.770297] [<ffffffff8112a7d9>] truncate_setsize+0x19/0x20 [56349.770657] [<ffffffff8112a81e>] vmtruncate+0x3e/0x70 [56349.771010] [<ffffffffa0e53a53>] vvp_io_setattr_end+0x63/0x80 [lustre] [56349.771349] [<ffffffffa05dec60>] cl_io_end+0x60/0x150 [obdclass] [56349.771769] [<ffffffffa05e3d82>] cl_io_loop+0xc2/0x1b0 [obdclass] [56349.772084] [<ffffffffa0e4c448>] cl_setattr_ost+0x208/0x2d0 [lustre] [56349.772389] [<ffffffffa0e1c122>] ll_setattr_raw+0x792/0xfb0 [lustre] [56349.772706] [<ffffffffa0e1c99b>] ll_setattr+0x5b/0xf0 [lustre] [56349.772987] [<ffffffff81197688>] notify_change+0x168/0x340 [56349.773252] [<ffffffff81179af4>] do_truncate+0x64/0xa0 [56349.773530] [<ffffffff8118c309>] do_filp_open+0x829/0xd90 [56349.773854] [<ffffffff8118827b>] ? getname+0x3b/0x250 [56349.774277] [<ffffffff81044f4e>] ? kernel_map_pages+0xfe/0x110 [56349.774708] [<ffffffff8119861b>] ? alloc_fd+0xab/0x160 [56349.775166] [<ffffffff81178839>] do_sys_open+0x69/0x140 [56349.775569] [<ffffffff81178950>] sys_open+0x20/0x30 [56349.775968] [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b [56349.776371] [56349.777741] Kernel panic - not syncing: LBUG
vmcore and modules are in /exports/crashdumps/192.168.10.218-2012-12-23-08:15:29
In today's Xyratex/Cray conference call, it was decided to go forward with backing out the LU-1030 patches from our release branch, pending further work on them.
More comments on Xyratex bug. Given the difficulty/complexity of the change, I would happily accept guidance here. (i.e. wait for you-all or suggestions on how to proceed)
Iurii Golovach added a comment - 06/Dec/12 4:34 AM
During implementation of extents on OSC level, two cases fell between cracks:
- evict;
- invalidate page.
Like a result extent flags for these cases weren't set and we have a race.
Since adding these cases is quite complex task we recommend to do next:
1) revert changes from next patches:
http://review.whamcloud.com/#change,2270
http://review.whamcloud.com/#change,2009
http://review.whamcloud.com/#change,2460
2) wait until Jinshan finish his work on LU-1723 and then use changes with correct extents support.
Please let us know if you need patch against 2.3 to reverse these patches.
Cory Spitz added a comment - 06/Dec/12 2:42 PM
Hi, that's a pretty big change. How many bug fixes would we loose? I'm not sure that the tradeoff for this bug is worth it (yet). Can you guys help w/LU-1723 or are you just waiting for him?
With a crash dump with debug tracing turned on, we get a lot more useful information.
Alexey Lyashkov added a comment - 30/Nov/12 12:27 AM
In fact that is bug in new osc extent code landed into 2.3 :/ both code path 1) ldlm lock cancel -> page discard .... and 2) mkwrite /race/ -> invalidate page forget to find a extent and set correct state for it. looks osc extent author forget about forced page flush from a page cache with mkwrite race and client eviction. for first case that is more or less clear - cl_lock_discard_pages foget to find an extent and set truncate state, because use a old API and directly kill a pages. that is may be fixed via replace cl_lock_discard_pages [with a] osc_extent_truncate_start or something similar. but for second case it's not clear for me. [It] looks [like] we need [to] completely change code near of mkwrite and page invalidation. [Otherwise] (may be dirty hack) - we change osc_page_delete to ignore error in osc_teardown_async_page for some 'force' cases or (better) correctly abort an IO request for that page if page discard need (page may be in readahead as was in 790798-osc_page_delete-assert0-sp1-1120-fulldbg.tar.bz2).
Here is an xz compressed "dk" log from a crash dump that is an example of the issue. The two processes 617 and 619 are at issue. There is a timing hole of some sort. I will append Shadow's and my analysis shortly.
I don't see an obvious way to add attachments, so I'll x-clip and paste some of the log info.
NOTE: I used a script to edit the time into a format compatible with the console log.
This is the "lctl dk" output.
2012-11-14T16-33-50.279302:3591:(client.c:1917:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1352932423/real 0] req@ffff88041df49c00 x1418650257981646/t0(0) o400->lustre1-OST0003-osc-ffff88061d741400@53@gni:28/4 lens 224/224 e 0 to 1 dl 1352932430 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
2012-11-14T16-33-50.279320:3591:(import.c:167:ptlrpc_set_import_discon()) lustre1-OST0003-osc-ffff88061d741400: Connection to lustre1-OST0003 (at 53@gni) was lost; in progress operations using this service will wait for recovery to complete
2012-11-14T16-33-50.279328:3591:(import.c:177:ptlrpc_set_import_discon()) ffff880815628000 lustre1-OST0003_UUID: changing import state from FULL to DISCONN
2012-11-14T16-33-50.279331:3591:(import.c:415:ptlrpc_fail_import()) lustre1-OST0003_UUID: waking up pinger2012-11-14T16-34-06.389737:3567:(import.c:1137:ptlrpc_connect_interpret()) ffff880815628000 lustre1-OST0003_UUID: changing import state from CONNECTING to DISCONN
2012-11-14T16-34-06.389739:3567:(import.c:1183:ptlrpc_connect_interpret()) recovery of lustre1-OST0003_UUID on 53@gni failed (-110 (ETIMEDOUT))
2012-11-14T16-34-47.095297:3564:(gnilnd_conn.c:2152:kgnilnd_reaper_dgram_check()) GNILND_DGRAM_REQ datagram to 42@gni timed out @ 63s dgram 0xffff880814862288 state GNILND_DGRAM_POSTED conn 0xffff8808150134002012-11-14T16-37-35.377578:3567:(import.c:815:ptlrpc_connect_interpret()) lustre1-MDT0000-mdc-ffff88061d741400: connect to target with instance 4
2012-11-14T16-37-35.377583:3567:(import.c:965:ptlrpc_connect_interpret()) @@@ lustre1-MDT0000-mdc-ffff88061d741400: evicting (reconnect/recover flags not set: 4) req@ffff88041dd58000 x1418650257981665/t0(0) o38->lustre1-MDT0000-mdc-ffff88061d741400@52@gni:12/10 lens 400/264 e 0 to 0 dl 1352932681 ref 1 fl Interpret:RN/0/0 rc 0/0
2012-11-14T16-37-35.377589:3567:(import.c:968:ptlrpc_connect_interpret()) ffff8808153cb000 lustre1-MDT0000_UUID: changing import state from CONNECTING to EVICTED2012-11-14T16-37-42.798033:3567:(lcommon_misc.c:77:cl_init_ea_size()) updating max_mdsize/max_cookiesize: 120/128
2012-11-14T16-37-42.798035:3567:(import.c:965:ptlrpc_connect_interpret()) @@@ lustre1-OST0003-osc-ffff88061d741400: evicting (reconnect/recover flags not set: 4) req@ffff88081df80c00 x1418650257981674/t0(0) o8->lustre1-OST0003-osc-ffff88061d741400@53@gni:28/4 lens 400/264 e 0 to 0 dl 1352932683 ref 1 fl Interpret:RN/0/0 rc 0/0
2012-11-14T16-37-42.798040:3567:(import.c:968:ptlrpc_connect_interpret()) ffff880815628000 lustre1-OST0003_UUID: changing import state from CONNECTING to EVICTED
2012-11-14T16-37-42.798043:3567:(import.c:1324:ptlrpc_import_recovery_state_machine()) 167-0: This client was evicted by lustre1-OST0003; in progress operations using this service will fail.
2012-11-14T16-37-42.798044:3567:(import.c:1328:ptlrpc_import_recovery_state_machine()) evicted from lustre1-OST0003_UUID@53@gni; invalidating
The connection recovers, according to logs:
2012-11-14T16-37-42.798209:3867:(import.c:1394:ptlrpc_import_recovery_state_machine()) ffff8808155b6000 lustre1-OST0001_UUID: changing import state from RECOVER to FULL
2012-11-14T16-37-42.798211:3867:(lov_obd.c:431:lov_set_osc_active()) Marking OSC lustre1-OST0001_UUID active
2012-11-14T16-37-42.798213:3867:(lcommon_misc.c:77:cl_init_ea_size()) updating max_mdsize/max_cookiesize: 144/128
2012-11-14T16-37-42.798216:3867:(import.c:1402:ptlrpc_import_recovery_state_machine()) lustre1-OST0001-osc-ffff88061d741400: Connection restored to lustre1-OST0001 (at 53@gni)
Then about 400,000 copies of this:
2012-11-14T16-41-29.357283:3941:(mdc_request.c:836:mdc_close()) @@@ matched open req@ffff88081542e800 x1418650258369942/t17185315789(17185315789) o101->lustre1-MDT0000-mdc-ffff88061d741400@52@gni:12/10 lens 560/600 e 0 to 0 dl 1352932906 ref 2 fl Complete:RP/4/0 rc 301/301
followed 0.2 seconds with the LBUG:
2012-11-14T17-22-00.230715:2496:(osc_cache.c:2367:osc_teardown_async_page()) extent ffff88060ba76480@
Unknown macro: {[13 -> 13/255], [2|0|-|cache|wi|ffff8802054a9648], [4096|1|+|-|ffff8802172f9318|256|(null)]}trunc at 13.
2012-11-14T17-22-00.247868:2496:(osc_page.c:411:osc_page_delete()) page@ffff880403fcbcc0[2 ffff8802164aa948:13 ^(null)_ffff880403fcb840 4 0 1 (null) (null) 0x1]
2012-11-14T17-22-00.260922:2496:(osc_page.c:411:osc_page_delete()) page@ffff880403fcb840[1 ffff880207a317d8:13 ^ffff880403fcbcc0_(null) 4 0 1 (null) (null) 0x0]
2012-11-14T17-22-00.273976:2496:(osc_page.c:411:osc_page_delete()) vvp-page@ffff880401be3320(1:1:0) vm@ffffea000e052f28 6000000000000eb 9:0 0 13 lru
2012-11-14T17-22-00.285984:2496:(osc_page.c:411:osc_page_delete()) lov-page@ffff8803ea750798
2012-11-14T17-22-00.293128:2496:(osc_page.c:411:osc_page_delete()) osc-page@ffff880415ce6660: 1< 0x845fed 258 0 + - > 2< 53248 0 4096 0x0 0x520 | (null) ffff8808155c65f8 ffff8802054a9648 > 3< + ffff8801f01cb040 0 0 0 > 4< 0 0 8 34373632 - | - - + - > 5< - - + - | 0 - | 16 - ->
2012-11-14T17-22-00.316352:2496:(osc_page.c:411:osc_page_delete()) end page@ffff880403fcbcc0
2012-11-14T17-22-00.323489:2496:(osc_page.c:411:osc_page_delete()) Trying to teardown failed: -16
2012-11-14T17-22-00.331062:2496:(osc_page.c:412:osc_page_delete()) ASSERTION( 0 ) failed:
2012-11-14T17-22-00.338026:2496:(osc_page.c:412:osc_page_delete()) LBUG
2012-11-14T17-22-00.412653:2496:(linux-debug.c:143:libcfs_run_upcall()) Invoked LNET upcall /usr/lib/lustre/lnet_upcall LBUG,/usr/src/packages/BUILD/cray-lustre-2.3/lustre/osc/osc_page.c,osc_page_delete,412
OK. Thank you. It just looks a lot alike.
I will keep reading code.
2012-11-06T14:57:10.872333-06:00 c0-0c1s6n0 LustreError: 5270:0:(osc_cache.c:2367:osc_teardown_async_page()) extent ffff88060eedfe58@
{[23 -> 23/255], [2|0|-|cache|wi|ffff88020e18f8c8], [4096|1|+|-|ffff8801f9ed9c18|256| (null)]} trunc at 23.
2012-11-06T14:57:10.872389-06:00 c0-0c1s6n0 LustreError: 5270:0:(osc_page.c:411:osc_page_delete()) page@ffff88080e8c8bc0[2 ffff88020cefad08:23 ^ (null)_ffff88080e8c8b00 4 0 1 (null) (null) 0x0]
2012-11-06T14:57:10.902445-06:00 c0-0c1s6n0 LustreError: 5270:0:(osc_page.c:411:osc_page_delete()) page@ffff88080e8c8b00[1 ffff88020d8c1f58:23 ^ffff88080e8c8bc0_ (null) 4 0 1 (null) (null) 0x0]
2012-11-06T14:57:10.902494-06:00 c0-0c1s6n0 LustreError: 5270:0:(osc_page.c:411:osc_page_delete()) vvp-page@ffff88080e8cf5a0(0:0:0) vm@ffffea001c6a99b8 e00000000000063 7:0 0 23 lru
2012-11-06T14:57:10.902508-06:00 c0-0c1s6n0 LustreError: 5270:0:(osc_page.c:411:osc_page_delete()) lov-page@ffff8808050a7888
2012-11-06T14:57:10.958005-06:00 c0-0c1s6n0 LustreError: 5270:0:(osc_page.c:411:osc_page_delete()) osc-page@ffff88080d78ed20: 1< 0x845fed 258 0 + - > 2< 94208 0 4096 0x0 0x520 | (null) ffff880804350700 ffff88020e18f8c8 > 3< + ffff8801f8f577b0 0 0 0 > 4< 0 0 8 33824768 - | - - + - > 5< - - + - | 0 - | 1 - ->
2012-11-06T14:57:10.958049-06:00 c0-0c1s6n0 LustreError: 5270:0:(osc_page.c:411:osc_page_delete()) end page@ffff88080e8c8bc0
2012-11-06T14:57:10.983504-06:00 c0-0c1s6n0 LustreError: 5270:0:(osc_page.c:411:osc_page_delete()) Trying to teardown failed: -16
2012-11-06T14:57:10.983536-06:00 c0-0c1s6n0 LustreError: 5270:0:(osc_page.c:412:osc_page_delete()) ASSERTION( 0 ) failed:
2012-11-06T14:57:10.983549-06:00 c0-0c1s6n0 LustreError: 5270:0:(osc_page.c:412:osc_page_delete()) LBUG
2012-11-06T14:57:10.983570-06:00 c0-0c1s6n0 Pid: 5270, comm: fsx-linux
2012-11-06T14:57:10.983582-06:00 c0-0c1s6n0 Call Trace:
2012-11-06T14:57:10.983605-06:00 c0-0c1s6n0 [<ffffffff810063b1>] try_stack_unwind+0x161/0x1a0
2012-11-06T14:57:11.009114-06:00 c0-0c1s6n0 [<ffffffff81004bf9>] dump_trace+0x89/0x440
2012-11-06T14:57:11.009138-06:00 c0-0c1s6n0 [<ffffffffa014e887>] libcfs_debug_dumpstack+0x57/0x80 [libcfs]
2012-11-06T14:57:11.009161-06:00 c0-0c1s6n0 [<ffffffffa014ede7>] lbug_with_loc+0x47/0xc0 [libcfs]
2012-11-06T14:57:11.009173-06:00 c0-0c1s6n0 [<ffffffffa0671d41>] osc_page_delete+0x2d1/0x2e0 [osc]
2012-11-06T14:57:11.034700-06:00 c0-0c1s6n0 [<ffffffffa02b0095>] cl_page_delete0+0xd5/0x4e0 [obdclass]
2012-11-06T14:57:11.034737-06:00 c0-0c1s6n0 [<ffffffffa02b04e2>] cl_page_delete+0x42/0x120 [obdclass]
2012-11-06T14:57:11.034762-06:00 c0-0c1s6n0 [<ffffffffa07f2e2d>] ll_invalidatepage+0x8d/0x170 [lustre]
2012-11-06T14:57:11.034774-06:00 c0-0c1s6n0 [<ffffffffa07ea290>] ll_page_mkwrite+0x7c0/0x840 [lustre]
2012-11-06T14:57:11.034820-06:00 c0-0c1s6n0 [<ffffffff81107cb7>] __do_fault+0xe7/0x570
2012-11-06T14:57:11.034833-06:00 c0-0c1s6n0 [<ffffffff811081e4>] handle_pte_fault+0xa4/0xcd0
2012-11-06T14:57:11.060561-06:00 c0-0c1s6n0 [<ffffffff81108fbe>] handle_mm_fault+0x1ae/0x240
2012-11-06T14:57:11.060588-06:00 c0-0c1s6n0 [<ffffffff81025471>] do_page_fault+0x191/0x410
2012-11-06T14:57:11.060600-06:00 c0-0c1s6n0 [<ffffffff81301b5f>] page_fault+0x1f/0x30
2012-11-06T14:57:11.060647-06:00 c0-0c1s6n0 [<00000000200422b3>] 0x200422b3
2012-11-06T14:57:11.060660-06:00 c0-0c1s6n0 Kernel panic - not syncing: LBUG
this is a different issue from what have described by Oleg
Since this was filed against 2.1.2 and the problem we are trying to deal with only occurs on the 2.3 source base, we have opened a new bug report:
LU-2557. It shows exactly the symptom Oleg displayed above, which is the same as my comment from November:http://jira.whamcloud.com/browse/LU-1723?focusedCommentId=47889&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-47889