Details
-
Bug
-
Resolution: Cannot Reproduce
-
Major
-
None
-
Lustre 2.1.2
-
None
-
Client RHEL-6.2 2.6.32-220.23.1.el6.x86_64
-
3
-
4210
Description
LBUG on production cluster while node was running a user job.
Aug 8 16:00:02 sand-4-52 kernel: LustreError: 64082:0:(osc_page.c:425:osc_page_delete()) page@ffff88106917b5c0[1 ffff8806726e8f48:1263 ^(null)_ffff88106917b500 4 0 1 (null) (null) 0x1]
Aug 8 16:00:02 sand-4-52 kernel: LustreError: 64082:0:(osc_page.c:425:osc_page_delete()) page@ffff88106917b500[1 ffff880770b2c4c8:1263 ^ffff88106917b5c0_(null) 4 0 1 (null) (null) 0x0]
Aug 8 16:00:02 sand-4-52 kernel: LustreError: 64082:0:(osc_page.c:425:osc_page_delete()) vvp-page@ffff8809d88c6be0(1:0:0) vm@ffffea0034f6ce20 c0000000000001 3:0 0 1263 lru
Aug 8 16:00:02 sand-4-52 kernel: LustreError: 64082:0:(osc_page.c:425:osc_page_delete()) lov-page@ffff8809f22e93a8
Aug 8 16:00:02 sand-4-52 kernel: LustreError: 64082:0:(osc_page.c:425:osc_page_delete()) osc-page@ffff881067d54bc8: 1< 0x845fed 1 0 - - + > 2< 5173248 0 4096 0x0 0x8 | (null) ffff8806d7f24688 ffff88016958f700 ffffffffa07ff>
Aug 8 16:00:02 sand-4-52 kernel: LustreError: 64082:0:(osc_page.c:425:osc_page_delete()) end page@ffff88106917b5c0
Aug 8 16:00:02 sand-4-52 kernel: LustreError: 64082:0:(osc_page.c:425:osc_page_delete()) Trying to teardown failed: -16
Aug 8 16:00:02 sand-4-52 kernel: LustreError: 64082:0:(osc_page.c:426:osc_page_delete()) ASSERTION(0) failed
Aug 8 16:00:02 sand-4-52 kernel: LustreError: 64082:0:(osc_page.c:426:osc_page_delete()) LBUG
Aug 8 16:00:02 sand-4-52 kernel: Pid: 64082, comm: calculate_propa
Aug 8 16:00:02 sand-4-52 kernel:
Aug 8 16:00:02 sand-4-52 kernel: Call Trace:
Aug 8 16:00:02 sand-4-52 kernel: [<ffffffffa0446855>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
Aug 8 16:00:02 sand-4-52 kernel: [<ffffffffa0446e95>] lbug_with_loc+0x75/0xe0 [libcfs]
Aug 8 16:00:02 sand-4-52 kernel: [<ffffffffa0451d86>] libcfs_assertion_failed+0x66/0x70 [libcfs]
Aug 8 16:00:02 sand-4-52 kernel: [<ffffffffa07f8df6>] osc_page_delete+0x236/0x240 [osc]
Aug 8 16:00:02 sand-4-52 kernel: [<ffffffffa0553d1e>] cl_page_delete0+0xce/0x400 [obdclass]
Aug 8 16:00:02 sand-4-52 kernel: [<ffffffffa055128e>] ? cl_env_get+0x19e/0x350 [obdclass]
Aug 8 16:00:02 sand-4-52 kernel: [<ffffffffa0550bf6>] ? cl_env_peek+0x36/0x110 [obdclass]
Aug 8 16:00:02 sand-4-52 kernel: [<ffffffffa055408d>] cl_page_delete+0x3d/0xf0 [obdclass]
Aug 8 16:00:02 sand-4-52 kernel: [<ffffffffa0560bde>] ? cl_io_is_going+0xe/0x20 [obdclass]
Aug 8 16:00:02 sand-4-52 kernel: [<ffffffffa08e791b>] ll_releasepage+0x10b/0x150 [lustre]
Aug 8 16:00:03 sand-4-52 kernel: [<ffffffff81168ff0>] ? mem_cgroup_uncharge_cache_page+0x10/0x20
Aug 8 16:00:03 sand-4-52 kernel: [<ffffffff811100b0>] try_to_release_page+0x30/0x60
Aug 8 16:00:03 sand-4-52 kernel: [<ffffffff8112a4f1>] shrink_page_list.clone.0+0x4f1/0x5c0
Aug 8 16:00:03 sand-4-52 kernel: [<ffffffff8112a8bb>] shrink_inactive_list+0x2fb/0x740
Aug 8 16:00:03 sand-4-52 kernel: [<ffffffff8112b5cf>] shrink_zone+0x38f/0x520
Aug 8 16:00:03 sand-4-52 kernel: [<ffffffff8112c374>] zone_reclaim+0x354/0x410
Aug 8 16:00:03 sand-4-52 kernel: [<ffffffff8112cfc0>] ? isolate_pages_global+0x0/0x350
Aug 8 16:00:03 sand-4-52 kernel: [<ffffffff81122874>] get_page_from_freelist+0x694/0x820
Aug 8 16:00:03 sand-4-52 kernel: [<ffffffff81123af1>] __alloc_pages_nodemask+0x111/0x940
Aug 8 16:00:03 sand-4-52 kernel: [<ffffffff8116a728>] ? __mem_cgroup_try_charge+0x78/0x420
Aug 8 16:00:03 sand-4-52 kernel: [<ffffffff811586ca>] alloc_pages_vma+0x9a/0x150
Aug 8 16:00:03 sand-4-52 kernel: [<ffffffff81172015>] do_huge_pmd_anonymous_page+0x145/0x370
Aug 8 16:00:03 sand-4-52 kernel: [<ffffffff8113c79a>] handle_mm_fault+0x25a/0x2b0
Aug 8 16:00:03 sand-4-52 kernel: [<ffffffff81042c29>] __do_page_fault+0x139/0x480
Aug 8 16:00:03 sand-4-52 kernel: [<ffffffff8100988e>] ? __switch_to+0x26e/0x320
Aug 8 16:00:03 sand-4-52 kernel: [<ffffffff814ed250>] ? thread_return+0x4e/0x76e
Aug 8 16:00:03 sand-4-52 kernel: [<ffffffff814f2c8e>] do_page_fault+0x3e/0xa0
Aug 8 16:00:03 sand-4-52 kernel: [<ffffffff814f0045>] page_fault+0x25/0x30
Aug 8 16:00:03 sand-4-52 kernel:
Aug 8 16:00:03 sand-4-52 kernel: LustreError: dumping log to /tmp/lustre-log.1344438003.64082
Attachments
Activity
Since this was filed against 2.1.2 and the problem we are trying to deal with only occurs on the 2.3 source base, we have opened a new bug report: LU-2557. It shows exactly the symptom Oleg displayed above, which is the same as my comment from November:
Hi Oleg, please file a new bug for this because it seems like a new issue.
From the stacktrace and crashdump, an extent was not truncated by osc_cache_truncate_start() somehow.
Hm, I think I just hit this issue on my burn-in cluster as well:
[56349.656614] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) page@ffff8800805c6f08[3 ffff88000bf11f80:256 ^(null)_ffff880014be0f08 0 0 1 (null) (null) 0x0] [56349.657354] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) page@ffff880014be0f08[2 ffff88002a8ecf98:256 ^ffff8800805c6f08_(null) 0 0 1 (null) (null) 0x0] [56349.658145] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) vvp-page@ffff88006e65eb60(0:0:0) vm@ffffea00021b5a58 2000000000083c 2:0 ffff8800805c6f08 256 lru [56349.658890] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) lov-page@ffff88002510ee00 [56349.659360] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) osc-page@ffff880062170f08: 1< 0x845fed 258 0 + - > 2< 1048576 0 4096 0x0 0x520 | (null) ffff88007d410cf0 ffff880078590eb0 > 3< + ffff88006fc50140 1 0 0 > 4< 0 0 8 20480 - | - - + - > 5< - - + - | 0 - | 251 - -> [56349.660491] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) end page@ffff8800805c6f08 [56349.660996] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) exists 0/cp. [56349.661284] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) page@ffff880014be1f08[3 ffff88000bf11f80:257 ^(null)_ffff88001333ef08 0 0 1 (null) (null) 0x0] [56349.662042] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) page@ffff88001333ef08[2 ffff88002a8ecf98:257 ^ffff880014be1f08_(null) 0 0 1 (null) (null) 0x0] [56349.662802] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) vvp-page@ffff88006e65ebc8(0:0:0) vm@ffffea0001cef2d0 2000000000083c 2:0 ffff880014be1f08 257 lru [56349.663548] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) lov-page@ffff88002510e620 [56349.664016] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) osc-page@ffff88002b782f08: 1< 0x845fed 258 0 + - > 2< 1052672 0 4096 0x0 0x520 | (null) ffff88007d410cf0 ffff880078590eb0 > 3< + ffff88006fc50140 1 0 0 > 4< 0 0 8 20480 - | - - + - > 5< - - + - | 0 - | 251 - -> [56349.665336] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) end page@ffff880014be1f08 [56349.665805] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) exists 0/cp. [56349.666679] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) page@ffff880093a26f08[3 ffff88000bf11f80:258 ^(null)_ffff88003c596f08 0 0 1 (null) (null) 0x0] [56349.667419] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) page@ffff88003c596f08[2 ffff88002a8ecf98:258 ^ffff880093a26f08_(null) 0 0 1 (null) (null) 0x0] [56349.668166] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) vvp-page@ffff88006e65ec30(0:0:0) vm@ffffea0002285560 2000000000083c 2:0 ffff880093a26f08 258 lru [56349.668917] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) lov-page@ffff88002510ece0 [56349.669389] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) osc-page@ffff88005bf32f08: 1< 0x845fed 258 0 + - > 2< 1056768 0 4096 0x0 0x520 | (null) ffff88007d410cf0 ffff880078590eb0 > 3< + ffff88006fc50140 1 0 0 > 4< 0 0 8 20480 - | - - + - > 5< - - + - | 0 - | 251 - -> [56349.671402] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) end page@ffff880093a26f08 [56349.671883] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) exists 0/cp. [56349.672186] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) page@ffff8800304a3f08[3 ffff88000bf11f80:259 ^(null)_ffff880032388f08 0 0 1 (null) (null) 0x0] [56349.672953] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) page@ffff880032388f08[2 ffff88002a8ecf98:259 ^ffff8800304a3f08_(null) 0 0 1 (null) (null) 0x0] [56349.673667] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) vvp-page@ffff88006e65ec98(0:0:0) vm@ffffea0001d40e80 2000000000083c 2:0 ffff8800304a3f08 259 lru [56349.674403] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) lov-page@ffff8800740957d0 [56349.674898] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) osc-page@ffff8800922cff08: 1< 0x845fed 258 0 + - > 2< 1060864 0 4096 0x0 0x520 | (null) ffff88007d410cf0 ffff880078590eb0 > 3< + ffff88006fc50140 1 0 0 > 4< 0 0 8 20480 - | - - + - > 5< - - + - | 0 - | 251 - -> [56349.676000] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) end page@ffff8800304a3f08 [56349.676500] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) exists 0/cp. [56349.676872] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) page@ffff8800304a2f08[3 ffff88000bf11f80:260 ^(null)_ffff8800639baf08 0 0 1 (null) (null) 0x0] [56349.677605] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) page@ffff8800639baf08[2 ffff88002a8ecf98:260 ^ffff8800304a2f08_(null) 0 0 1 (null) (null) 0x0] [56349.678374] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) vvp-page@ffff88006e65ed00(0:0:0) vm@ffffea0002114048 2000000000083c 2:0 ffff8800304a2f08 260 lru [56349.679110] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) lov-page@ffff880074095c98 [56349.679560] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) osc-page@ffff880081cf0f08: 1< 0x845fed 258 0 + - > 2< 1064960 0 4096 0x0 0x520 | (null) ffff88007d410cf0 ffff880078590eb0 > 3< + ffff88006fc50140 1 0 0 > 4< 0 0 8 20480 - | - - + - > 5< - - + - | 0 - | 251 - -> [56349.680665] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) end page@ffff8800304a2f08 [56349.681134] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) exists 0/cp. [56349.681428] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) page@ffff8800639bbf08[3 ffff88000bf11f80:261 ^(null)_ffff8800223a9f08 0 0 1 (null) (null) 0x0] [56349.682160] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) page@ffff8800223a9f08[2 ffff88002a8ecf98:261 ^ffff8800639bbf08_(null) 0 0 1 (null) (null) 0x0] [56349.682893] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) vvp-page@ffff88006e65ed68(0:0:0) vm@ffffea000218f5a0 2000000000083c 2:0 ffff8800639bbf08 261 lru [56349.683625] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) lov-page@ffff880074095ce0 [56349.684092] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) osc-page@ffff88001309df08: 1< 0x845fed 258 0 + - > 2< 1069056 0 4096 0x0 0x520 | (null) ffff88007d410cf0 ffff880078590eb0 > 3< + ffff88006fc50140 1 0 0 > 4< 0 0 8 20480 - | - - + - > 5< - - + - | 0 - | 251 - -> [56349.685177] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) end page@ffff8800639bbf08 [56349.685623] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) exists 0/cp. [56349.685921] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) page@ffff8800223a8f08[3 ffff88000bf11f80:262 ^(null)_ffff8800248a3f08 0 0 1 (null) (null) 0x0] [56349.686682] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) page@ffff8800248a3f08[2 ffff88002a8ecf98:262 ^ffff8800223a8f08_(null) 0 0 1 (null) (null) 0x0] [56349.687427] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) vvp-page@ffff88002d0ee0d0(0:0:0) vm@ffffea0001f0b178 2000000000083c 2:0 ffff8800223a8f08 262 lru [56349.688184] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) lov-page@ffff880074095ae8 [56349.688665] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) osc-page@ffff88008e200f08: 1< 0x845fed 258 0 + - > 2< 1073152 0 4096 0x0 0x520 | (null) ffff88007d410cf0 ffff880078590eb0 > 3< + ffff88006fc50140 1 0 0 > 4< 0 0 8 20480 - | - - + - > 5< - - + - | 0 - | 251 - -> [56349.689748] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) end page@ffff8800223a8f08 [56349.690214] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) exists 0/cp. [56349.690501] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) page@ffff880075c2ef08[3 ffff88000bf11f80:263 ^(null)_ffff880033ee2f08 0 0 1 (null) (null) 0x0] [56349.691237] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) page@ffff880033ee2f08[2 ffff88002a8ecf98:263 ^ffff880075c2ef08_(null) 0 0 1 (null) (null) 0x0] [56349.691967] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) vvp-page@ffff88002d0ee4e0(0:0:0) vm@ffffea00020fec90 2000000000083c 2:0 ffff880075c2ef08 263 lru [56349.692706] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) lov-page@ffff880074095aa0 [56349.693172] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) osc-page@ffff880093ce0f08: 1< 0x845fed 258 0 + - > 2< 1077248 0 4096 0x0 0x520 | (null) ffff88007d410cf0 ffff880078590eb0 > 3< + ffff88006fc50140 1 0 0 > 4< 0 0 8 20480 - | - - + - > 5< - - + - | 0 - | 251 - -> [56349.694248] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) end page@ffff880075c2ef08 [56349.694717] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) exists 0/cp. [56349.695010] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) page@ffff880033ee3f08[3 ffff88000bf11f80:264 ^(null)_ffff880061765f08 0 0 1 (null) (null) 0x0] [56349.696293] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) page@ffff880061765f08[2 ffff88002a8ecf98:264 ^ffff880033ee3f08_(null) 0 0 1 (null) (null) 0x0] [56349.697945] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) vvp-page@ffff88002d0eea90(0:0:0) vm@ffffea0001f481f8 2000000000083c 2:0 ffff880033ee3f08 264 lru [56349.698697] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) lov-page@ffff880074095740 [56349.699176] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) osc-page@ffff8800794ccf08: 1< 0x845fed 258 0 + - > 2< 1081344 0 4096 0x0 0x520 | (null) ffff88007d410cf0 ffff880078590eb0 > 3< + ffff88006fc50140 1 0 0 > 4< 0 0 8 20480 - | - - + - > 5< - - + - | 0 - | 251 - -> [56349.700246] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) end page@ffff880033ee3f08 [56349.700732] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) exists 0/cp. [56349.701026] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) page@ffff8800393f6f08[3 ffff88000bf11f80:265 ^(null)_ffff88008b6e9f08 0 0 1 (null) (null) 0x0] [56349.701758] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) page@ffff88008b6e9f08[2 ffff88002a8ecf98:265 ^ffff8800393f6f08_(null) 0 0 1 (null) (null) 0x0] [56349.702492] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) vvp-page@ffff88002d0eec30(0:0:0) vm@ffffea000277cc78 2000000000083c 2:0 ffff8800393f6f08 265 lru [56349.703850] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) lov-page@ffff880074095110 [56349.704321] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) osc-page@ffff88006b098f08: 1< 0x845fed 258 0 + - > 2< 1085440 0 4096 0x0 0x520 | (null) ffff88007d410cf0 ffff880078590eb0 > 3< + ffff88006fc50140 1 0 0 > 4< 0 0 8 20480 - | - - + - > 5< - - + - | 0 - | 251 - -> [56349.705396] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) end page@ffff8800393f6f08 [56349.706003] LustreError: 20323:0:(osc_io.c:364:trunc_check_cb()) exists 0/cp. [56349.758028] LustreError: 20323:0:(osc_cache.c:2370:osc_teardown_async_page()) extent ffff88006e1f1f38@{[256 -> 506/511], [2|0|-|cache|wi|ffff880078590eb0], [1028096|251|+|-|ffff88002b60ced8|256|(null)]} trunc at 256. [56349.759172] LustreError: 20323:0:(osc_page.c:431:osc_page_delete()) page@ffff8800805c6f08[2 ffff88000bf11f80:256 ^(null)_ffff880014be0f08 4 0 1 (null) (null) 0x0] [56349.759981] LustreError: 20323:0:(osc_page.c:431:osc_page_delete()) page@ffff880014be0f08[2 ffff88002a8ecf98:256 ^ffff8800805c6f08_(null) 4 0 1 (null) (null) 0x0] [56349.760931] LustreError: 20323:0:(osc_page.c:431:osc_page_delete()) vvp-page@ffff88006e65eb60(0:0:0) vm@ffffea00021b5a58 20000000000035 3:0 0 256 lru [56349.761537] LustreError: 20323:0:(osc_page.c:431:osc_page_delete()) lov-page@ffff88002510ee00 [56349.762131] LustreError: 20323:0:(osc_page.c:431:osc_page_delete()) osc-page@ffff880062170f08: 1< 0x845fed 258 0 + - > 2< 1048576 0 4096 0x0 0x520 | (null) ffff88007d410cf0 ffff880078590eb0 > 3< + ffff88006fc50140 0 0 0 > 4< 0 0 8 20480 - | - - + - > 5< - - + - | 0 - | 251 - -> [56349.763222] LustreError: 20323:0:(osc_page.c:431:osc_page_delete()) end page@ffff8800805c6f08 [56349.763780] LustreError: 20323:0:(osc_page.c:431:osc_page_delete()) Trying to teardown failed: -16 [56349.764587] LustreError: 20323:0:(osc_page.c:432:osc_page_delete()) ASSERTION( 0 ) failed: [56349.765113] LustreError: 20323:0:(osc_page.c:432:osc_page_delete()) LBUG [56349.765406] Pid: 20323, comm: cp [56349.765648] [56349.765648] Call Trace: [56349.766258] [<ffffffffa0405915>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] [56349.766719] [<ffffffffa0405f27>] lbug_with_loc+0x47/0xb0 [libcfs] [56349.767070] [<ffffffffa09bd5dd>] osc_page_delete+0x32d/0x330 [osc] [56349.767381] [<ffffffffa05d6b15>] cl_page_delete0+0xc5/0x510 [obdclass] [56349.767713] [<ffffffffa05d6fa2>] cl_page_delete+0x42/0x120 [obdclass] [56349.768192] [<ffffffffa0e4025d>] ll_invalidatepage+0x8d/0x180 [lustre] [56349.768623] [<ffffffff81129ee8>] do_invalidatepage+0x28/0x30 [56349.768950] [<ffffffff8112a102>] truncate_inode_page+0xa2/0xc0 [56349.769216] [<ffffffff8112a42a>] truncate_inode_pages_range+0x15a/0x460 [56349.769515] [<ffffffff8112a745>] truncate_inode_pages+0x15/0x20 [56349.769877] [<ffffffff8112a797>] truncate_pagecache+0x47/0x70 [56349.770297] [<ffffffff8112a7d9>] truncate_setsize+0x19/0x20 [56349.770657] [<ffffffff8112a81e>] vmtruncate+0x3e/0x70 [56349.771010] [<ffffffffa0e53a53>] vvp_io_setattr_end+0x63/0x80 [lustre] [56349.771349] [<ffffffffa05dec60>] cl_io_end+0x60/0x150 [obdclass] [56349.771769] [<ffffffffa05e3d82>] cl_io_loop+0xc2/0x1b0 [obdclass] [56349.772084] [<ffffffffa0e4c448>] cl_setattr_ost+0x208/0x2d0 [lustre] [56349.772389] [<ffffffffa0e1c122>] ll_setattr_raw+0x792/0xfb0 [lustre] [56349.772706] [<ffffffffa0e1c99b>] ll_setattr+0x5b/0xf0 [lustre] [56349.772987] [<ffffffff81197688>] notify_change+0x168/0x340 [56349.773252] [<ffffffff81179af4>] do_truncate+0x64/0xa0 [56349.773530] [<ffffffff8118c309>] do_filp_open+0x829/0xd90 [56349.773854] [<ffffffff8118827b>] ? getname+0x3b/0x250 [56349.774277] [<ffffffff81044f4e>] ? kernel_map_pages+0xfe/0x110 [56349.774708] [<ffffffff8119861b>] ? alloc_fd+0xab/0x160 [56349.775166] [<ffffffff81178839>] do_sys_open+0x69/0x140 [56349.775569] [<ffffffff81178950>] sys_open+0x20/0x30 [56349.775968] [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b [56349.776371] [56349.777741] Kernel panic - not syncing: LBUG
vmcore and modules are in /exports/crashdumps/192.168.10.218-2012-12-23-08:15:29
In today's Xyratex/Cray conference call, it was decided to go forward with backing out the LU-1030 patches from our release branch, pending further work on them.
More comments on Xyratex bug. Given the difficulty/complexity of the change, I would happily accept guidance here. (i.e. wait for you-all or suggestions on how to proceed)
Iurii Golovach added a comment - 06/Dec/12 4:34 AM
During implementation of extents on OSC level, two cases fell between cracks:
- evict;
- invalidate page.
Like a result extent flags for these cases weren't set and we have a race.
Since adding these cases is quite complex task we recommend to do next:
1) revert changes from next patches:
http://review.whamcloud.com/#change,2270
http://review.whamcloud.com/#change,2009
http://review.whamcloud.com/#change,2460
2) wait until Jinshan finish his work on LU-1723 and then use changes with correct extents support.
Please let us know if you need patch against 2.3 to reverse these patches.
Cory Spitz added a comment - 06/Dec/12 2:42 PM
Hi, that's a pretty big change. How many bug fixes would we loose? I'm not sure that the tradeoff for this bug is worth it (yet). Can you guys help w/LU-1723 or are you just waiting for him?
With a crash dump with debug tracing turned on, we get a lot more useful information.
Alexey Lyashkov added a comment - 30/Nov/12 12:27 AM
In fact that is bug in new osc extent code landed into 2.3 :/ both code path 1) ldlm lock cancel -> page discard .... and 2) mkwrite /race/ -> invalidate page forget to find a extent and set correct state for it. looks osc extent author forget about forced page flush from a page cache with mkwrite race and client eviction. for first case that is more or less clear - cl_lock_discard_pages foget to find an extent and set truncate state, because use a old API and directly kill a pages. that is may be fixed via replace cl_lock_discard_pages [with a] osc_extent_truncate_start or something similar. but for second case it's not clear for me. [It] looks [like] we need [to] completely change code near of mkwrite and page invalidation. [Otherwise] (may be dirty hack) - we change osc_page_delete to ignore error in osc_teardown_async_page for some 'force' cases or (better) correctly abort an IO request for that page if page discard need (page may be in readahead as was in 790798-osc_page_delete-assert0-sp1-1120-fulldbg.tar.bz2).
Here is an xz compressed "dk" log from a crash dump that is an example of the issue. The two processes 617 and 619 are at issue. There is a timing hole of some sort. I will append Shadow's and my analysis shortly.
I don't see an obvious way to add attachments, so I'll x-clip and paste some of the log info.
NOTE: I used a script to edit the time into a format compatible with the console log.
This is the "lctl dk" output.
2012-11-14T16-33-50.279302:3591:(client.c:1917:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1352932423/real 0] req@ffff88041df49c00 x1418650257981646/t0(0) o400->lustre1-OST0003-osc-ffff88061d741400@53@gni:28/4 lens 224/224 e 0 to 1 dl 1352932430 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
2012-11-14T16-33-50.279320:3591:(import.c:167:ptlrpc_set_import_discon()) lustre1-OST0003-osc-ffff88061d741400: Connection to lustre1-OST0003 (at 53@gni) was lost; in progress operations using this service will wait for recovery to complete
2012-11-14T16-33-50.279328:3591:(import.c:177:ptlrpc_set_import_discon()) ffff880815628000 lustre1-OST0003_UUID: changing import state from FULL to DISCONN
2012-11-14T16-33-50.279331:3591:(import.c:415:ptlrpc_fail_import()) lustre1-OST0003_UUID: waking up pinger2012-11-14T16-34-06.389737:3567:(import.c:1137:ptlrpc_connect_interpret()) ffff880815628000 lustre1-OST0003_UUID: changing import state from CONNECTING to DISCONN
2012-11-14T16-34-06.389739:3567:(import.c:1183:ptlrpc_connect_interpret()) recovery of lustre1-OST0003_UUID on 53@gni failed (-110 (ETIMEDOUT))
2012-11-14T16-34-47.095297:3564:(gnilnd_conn.c:2152:kgnilnd_reaper_dgram_check()) GNILND_DGRAM_REQ datagram to 42@gni timed out @ 63s dgram 0xffff880814862288 state GNILND_DGRAM_POSTED conn 0xffff8808150134002012-11-14T16-37-35.377578:3567:(import.c:815:ptlrpc_connect_interpret()) lustre1-MDT0000-mdc-ffff88061d741400: connect to target with instance 4
2012-11-14T16-37-35.377583:3567:(import.c:965:ptlrpc_connect_interpret()) @@@ lustre1-MDT0000-mdc-ffff88061d741400: evicting (reconnect/recover flags not set: 4) req@ffff88041dd58000 x1418650257981665/t0(0) o38->lustre1-MDT0000-mdc-ffff88061d741400@52@gni:12/10 lens 400/264 e 0 to 0 dl 1352932681 ref 1 fl Interpret:RN/0/0 rc 0/0
2012-11-14T16-37-35.377589:3567:(import.c:968:ptlrpc_connect_interpret()) ffff8808153cb000 lustre1-MDT0000_UUID: changing import state from CONNECTING to EVICTED2012-11-14T16-37-42.798033:3567:(lcommon_misc.c:77:cl_init_ea_size()) updating max_mdsize/max_cookiesize: 120/128
2012-11-14T16-37-42.798035:3567:(import.c:965:ptlrpc_connect_interpret()) @@@ lustre1-OST0003-osc-ffff88061d741400: evicting (reconnect/recover flags not set: 4) req@ffff88081df80c00 x1418650257981674/t0(0) o8->lustre1-OST0003-osc-ffff88061d741400@53@gni:28/4 lens 400/264 e 0 to 0 dl 1352932683 ref 1 fl Interpret:RN/0/0 rc 0/0
2012-11-14T16-37-42.798040:3567:(import.c:968:ptlrpc_connect_interpret()) ffff880815628000 lustre1-OST0003_UUID: changing import state from CONNECTING to EVICTED
2012-11-14T16-37-42.798043:3567:(import.c:1324:ptlrpc_import_recovery_state_machine()) 167-0: This client was evicted by lustre1-OST0003; in progress operations using this service will fail.
2012-11-14T16-37-42.798044:3567:(import.c:1328:ptlrpc_import_recovery_state_machine()) evicted from lustre1-OST0003_UUID@53@gni; invalidating
The connection recovers, according to logs:
2012-11-14T16-37-42.798209:3867:(import.c:1394:ptlrpc_import_recovery_state_machine()) ffff8808155b6000 lustre1-OST0001_UUID: changing import state from RECOVER to FULL
2012-11-14T16-37-42.798211:3867:(lov_obd.c:431:lov_set_osc_active()) Marking OSC lustre1-OST0001_UUID active
2012-11-14T16-37-42.798213:3867:(lcommon_misc.c:77:cl_init_ea_size()) updating max_mdsize/max_cookiesize: 144/128
2012-11-14T16-37-42.798216:3867:(import.c:1402:ptlrpc_import_recovery_state_machine()) lustre1-OST0001-osc-ffff88061d741400: Connection restored to lustre1-OST0001 (at 53@gni)
Then about 400,000 copies of this:
2012-11-14T16-41-29.357283:3941:(mdc_request.c:836:mdc_close()) @@@ matched open req@ffff88081542e800 x1418650258369942/t17185315789(17185315789) o101->lustre1-MDT0000-mdc-ffff88061d741400@52@gni:12/10 lens 560/600 e 0 to 0 dl 1352932906 ref 2 fl Complete:RP/4/0 rc 301/301
followed 0.2 seconds with the LBUG:
2012-11-14T17-22-00.230715:2496:(osc_cache.c:2367:osc_teardown_async_page()) extent ffff88060ba76480@
Unknown macro: {[13 -> 13/255], [2|0|-|cache|wi|ffff8802054a9648], [4096|1|+|-|ffff8802172f9318|256|(null)]}trunc at 13.
2012-11-14T17-22-00.247868:2496:(osc_page.c:411:osc_page_delete()) page@ffff880403fcbcc0[2 ffff8802164aa948:13 ^(null)_ffff880403fcb840 4 0 1 (null) (null) 0x1]
2012-11-14T17-22-00.260922:2496:(osc_page.c:411:osc_page_delete()) page@ffff880403fcb840[1 ffff880207a317d8:13 ^ffff880403fcbcc0_(null) 4 0 1 (null) (null) 0x0]
2012-11-14T17-22-00.273976:2496:(osc_page.c:411:osc_page_delete()) vvp-page@ffff880401be3320(1:1:0) vm@ffffea000e052f28 6000000000000eb 9:0 0 13 lru
2012-11-14T17-22-00.285984:2496:(osc_page.c:411:osc_page_delete()) lov-page@ffff8803ea750798
2012-11-14T17-22-00.293128:2496:(osc_page.c:411:osc_page_delete()) osc-page@ffff880415ce6660: 1< 0x845fed 258 0 + - > 2< 53248 0 4096 0x0 0x520 | (null) ffff8808155c65f8 ffff8802054a9648 > 3< + ffff8801f01cb040 0 0 0 > 4< 0 0 8 34373632 - | - - + - > 5< - - + - | 0 - | 16 - ->
2012-11-14T17-22-00.316352:2496:(osc_page.c:411:osc_page_delete()) end page@ffff880403fcbcc0
2012-11-14T17-22-00.323489:2496:(osc_page.c:411:osc_page_delete()) Trying to teardown failed: -16
2012-11-14T17-22-00.331062:2496:(osc_page.c:412:osc_page_delete()) ASSERTION( 0 ) failed:
2012-11-14T17-22-00.338026:2496:(osc_page.c:412:osc_page_delete()) LBUG
2012-11-14T17-22-00.412653:2496:(linux-debug.c:143:libcfs_run_upcall()) Invoked LNET upcall /usr/lib/lustre/lnet_upcall LBUG,/usr/src/packages/BUILD/cray-lustre-2.3/lustre/osc/osc_page.c,osc_page_delete,412
Close old ticket.