[LU-1280] kernel BUG at .../lustre-2.1.0/lustre/lvfs/fsfilt-ldiskfs.c:978 Created: 03/Apr/12 Updated: 22/May/12 Resolved: 22/May/12 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.1.0 |
| Fix Version/s: | Lustre 2.3.0, Lustre 2.1.2 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Ned Bass | Assignee: | Jian Yu |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
kernel 2.6.32-220.7.1.3chaos.ch5.x86_64 |
||
| Severity: | 3 |
| Rank (Obsolete): | 4617 |
| Description |
|
We hit this assertion in ldiskfs_ext_new_extent_cb() on a production OSS. 978 LASSERTF(i == path->p_depth ||
979 EXT_GENERATION(base) != path[0].p_generation,
980 "base vs path extent depth:%d != %d, generation:%lu == %lu\n",
981 i, path->p_depth, EXT_GENERATION(base), path[0].p_generation);
Here is the stack trace. We didn't capture the message from LASSERTF, and the crash dump failed. 2012-04-02 16:06:58 kernel BUG at /builddir/build/BUILD/lustre-2.1.0/lustre/lvfs/fsfilt-ldiskfs.c:978! 2012-04-02 16:06:58 invalid opcode: 0000 [#1] SMP 2012-04-02 16:06:58 last sysfs file: /sys/devices/system/cpu/cpu15/cache/index2/shared_cpu_map 2012-04-02 16:06:58 CPU 4 2012-04-02 16:06:58 Modules linked in: obdfilter(U) fsfilt_ldiskfs(U) exportfs ost(U) mgc(U) lustre(U) lov(U) osc(U) lquota(U) m 2012-04-02 16:06:58 2012-04-02 16:06:58 Pid: 8224, comm: ll_ost_io_146 Not tainted 2.6.32-220.7.1.3chaos.ch5.x86_64 #1 Supermicro X8DTH-i/6/iF/6F/X8 2012-04-02 16:06:58 RIP: 0010:[<ffffffffa0aab4c1>] [<ffffffffa0aab4c1>] ldiskfs_ext_new_extent_cb+0x5a1/0x660 [fsfilt_ldiskfs] 2012-04-02 16:06:59 RSP: 0018:ffff880270add5d0 EFLAGS: 00010246 2012-04-02 16:06:59 RAX: 0000000000000038 RBX: ffff880270add730 RCX: 0000000000000008 2012-04-02 16:06:59 RDX: 0000000000000001 RSI: ffff8804cd11aec0 RDI: ffff88012cf7a5c0 2012-04-02 16:06:59 RBP: ffff880270add670 R08: ffff88012cf7a4c8 R09: ffff8804cd11aec0 2012-04-02 16:06:59 R10: ffff88023b5ec000 R11: 0000000000000fff R12: ffff880270add6b0 2012-04-02 16:06:59 R13: ffff88012cf7a5c0 R14: ffff8804cd11aec0 R15: ffff88012cf7a5c0 2012-04-02 16:06:59 FS: 00002aaaab06eb20(0000) GS:ffff88034ac00000(0000) knlGS:0000000000000000 2012-04-02 16:06:59 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b 2012-04-02 16:06:59 CR2: 00002aaaac9ce000 CR3: 0000000001a85000 CR4: 00000000000006e0 2012-04-02 16:06:59 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 2012-04-02 16:06:59 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 2012-04-02 16:06:59 Process ll_ost_io_146 (pid: 8224, threadinfo ffff880270adc000, task ffff880270adb580) 2012-04-02 16:06:59 Stack: 2012-04-02 16:06:59 ffff880270add620 ffffffff81273646 0000000000000282 ffff88012cf7a5b0 2012-04-02 16:06:59 <0> ffff88012cf7a4c8 000000000000216c ffff8804cd11aec0 ffff88012cf7a4c8 2012-04-02 16:06:59 <0> 0000000000000000 ffff88012cf7a5c0 ffff880270add670 ffffffff81278c08 2012-04-02 16:06:59 Call Trace: 2012-04-02 16:06:59 [<ffffffff81273646>] ? rwsem_wake+0x76/0x170 2012-04-02 16:06:59 [<ffffffff81278c08>] ? call_rwsem_wake+0x18/0x30 2012-04-02 16:06:59 [<ffffffffa04d639f>] ldiskfs_ext_walk_space+0x14f/0x340 [ldiskfs] 2012-04-02 16:06:59 [<ffffffffa0aaaf20>] ? ldiskfs_ext_new_extent_cb+0x0/0x660 [fsfilt_ldiskfs] 2012-04-02 16:06:59 [<ffffffffa0aaac83>] fsfilt_map_nblocks+0xa3/0xd0 [fsfilt_ldiskfs] 2012-04-02 16:06:59 [<ffffffffa0aaadc3>] fsfilt_ldiskfs_map_ext_inode_pages+0x113/0x1e0 [fsfilt_ldiskfs] 2012-04-02 16:06:59 [<ffffffff811285e7>] ? ____pagevec_lru_add+0x167/0x180 2012-04-02 16:06:59 [<ffffffffa0aaaf15>] fsfilt_ldiskfs_map_inode_pages+0x85/0x90 [fsfilt_ldiskfs] 2012-04-02 16:06:59 [<ffffffff81111778>] ? add_to_page_cache_lru+0x68/0x80 2012-04-02 16:06:59 [<ffffffffa0ae62d9>] filter_do_bio+0xec9/0x14a0 [obdfilter] 2012-04-02 16:06:59 [<ffffffff81112388>] ? find_or_create_page+0x68/0xb0 2012-04-02 16:06:59 [<ffffffff81012b59>] ? read_tsc+0x9/0x20 2012-04-02 16:06:59 [<ffffffffa0ade8bb>] filter_preprw_write+0x12db/0x1870 [obdfilter] 2012-04-02 16:06:59 [<ffffffffa040bc21>] ? lnet_ni_send+0x51/0xe0 [lnet] 2012-04-02 16:06:59 [<ffffffffa06d9193>] ? null_alloc_rs+0x173/0x2b0 [ptlrpc] 2012-04-02 16:06:59 [<ffffffffa0adf9f8>] filter_preprw+0x68/0x80 [obdfilter] 2012-04-02 16:06:59 [<ffffffffa0a7d33e>] obd_preprw+0x11e/0x2f0 [ost] 2012-04-02 16:06:59 [<ffffffffa0a84827>] ost_brw_write+0x807/0x1420 [ost] 2012-04-02 16:06:59 [<ffffffffa06a6544>] ? lustre_msg_get_version+0x54/0x90 [ptlrpc] 2012-04-02 16:06:59 [<ffffffffa06a65f8>] ? lustre_msg_check_version+0x78/0x90 [ptlrpc] 2012-04-02 16:06:59 [<ffffffffa0a880e8>] ost_handle+0x2608/0x3950 [ost] 2012-04-02 16:06:59 [<ffffffffa06a6944>] ? lustre_msg_get_transno+0x54/0x90 [ptlrpc] 2012-04-02 16:06:59 [<ffffffffa06b27a1>] ptlrpc_main+0xcd1/0x1690 [ptlrpc] 2012-04-02 16:06:59 [<ffffffffa06b1ad0>] ? ptlrpc_main+0x0/0x1690 [ptlrpc] 2012-04-02 16:06:59 [<ffffffff8100c14a>] child_rip+0xa/0x20 2012-04-02 16:06:59 [<ffffffffa06b1ad0>] ? ptlrpc_main+0x0/0x1690 [ptlrpc] 2012-04-02 16:06:59 [<ffffffffa06b1ad0>] ? ptlrpc_main+0x0/0x1690 [ptlrpc] 2012-04-02 16:06:59 [<ffffffff8100c140>] ? child_rip+0x0/0x20 |
| Comments |
| Comment by Oleg Drokin [ 03/Apr/12 ] |
|
Do you have this patch from |
| Comment by Ned Bass [ 03/Apr/12 ] |
|
Hi Oleg, Yes we do have that patch. |
| Comment by Alex Zhuravlev [ 03/Apr/12 ] |
|
I'm sorry - this is my fault, I left this LASSERTF for debugging purposes to make sure the race really happens ldiskfs_ext_walk_space() takes the semaphore protecting the tree, finds an extent for required offset, makes the solution is to remove LASSERTF(). sorry again. |
| Comment by Alex Zhuravlev [ 03/Apr/12 ] |
|
btw, is it possible to find what LASSERTF() did print ? |
| Comment by Ned Bass [ 03/Apr/12 ] |
|
Good to know there's an easy explanation. I did look for the LASSERTF message in the console log and on our syslog server, but somehow it seems to have gotten lost. |
| Comment by Peter Jones [ 03/Apr/12 ] |
|
Yu Jian Could you please make the necessary correction here? Thanks Peter |
| Comment by Jian Yu [ 05/Apr/12 ] |
|
Patch for master branch is in http://review.whamcloud.com/2452. |
| Comment by Build Master (Inactive) [ 29/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 29/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 29/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 29/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 29/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 29/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 29/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 29/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 29/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 29/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 29/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 29/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 29/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 29/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 29/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 29/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 29/Apr/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Jian Yu [ 03/May/12 ] |
|
Patch for b2_1 branch: http://review.whamcloud.com/2639 |
| Comment by Ned Bass [ 03/May/12 ] |
|
Actually it appears that we were hitting the EXT_ASSERT on lustre/lvfs/fsfilt_ext3.c:977 976 i = EXT_DEPTH(base); 977 EXT_ASSERT(path[i].p_hdr); This explains why we never saw the LASSERTF message. We hit this again after applying http://review.whamcloud.com/2639. |
| Comment by Alex Zhuravlev [ 15/May/12 ] |
|
Ned, thanks for the report. this assertion is invalid as well (since new locking was introduced in ext4_ext_walk_space()) |
| Comment by Jian Yu [ 17/May/12 ] |
Patch for master branch: http://review.whamcloud.com/2827 |
| Comment by Peter Jones [ 22/May/12 ] |
|
Landed for 2.1.2 and 2.3 |