[LU-1421] Client LBUG in ll_file_write after filesystem expansion - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Blocker
Fix Version/s: Lustre 2.3.0
Affects Version/s: Lustre 2.2.0, Lustre 2.3.0
Labels:
- client
Environment:
SL5 on clients and servers, mix of 2.1.0, 2.1.1 and 2.2.0 clients, 2.2.0 on all servers

Severity:
4
Epic:
- client
Rank (Obsolete):
4593

Description

After adding 24 OSTs to the file system we get client LBUGs and crashes on Lustre 2.2.0. We expanded the file system by adding new resources and new OSTs had been seen by clients properly, however now we get dozens of crashes every day. Trace looks like this:

May 18 15:18:36 <user.notice> n3-1-13.local Pid[]: 9127, comm: dtf3d_qdot.out
May 18 15:18:36 <user.notice> n3-1-13.local []:
May 18 15:18:36 <user.notice> n3-1-13.local Call[]: Trace:
May 18 15:18:36 <user.notice> n3-1-13.local [<ffffffff8870c5f1>]: libcfs_debug_dumpstack+0x51/0x60 [libcfs]
May 18 15:18:36 <user.notice> n3-1-13.local [<ffffffff8870ca28>]: lbug_with_loc+0x48/0x90 [libcfs]
May 18 15:18:36 <user.notice> n3-1-13.local [<ffffffff88816a60>]: cl_page_assume+0xa0/0x190 [obdclass]
May 18 15:18:36 <user.notice> n3-1-13.local [<ffffffff88c53198>]: ll_prepare_write+0x98/0x150 [lustre]
May 18 15:18:36 <user.notice> n3-1-13.local [<ffffffff88c6749b>]: ll_write_begin+0xdb/0x150 [lustre]
May 18 15:18:36 <user.notice> n3-1-13.local [<ffffffff8000fe46>]: generic_file_buffered_write+0x14b/0x6a9
May 18 15:18:36 <user.notice> n3-1-13.local [<ffffffff80016741>]: __generic_file_aio_write_nolock+0x369/0x3b6
May 18 15:18:36 <user.notice> n3-1-13.local [<ffffffff800c9ab4>]: __generic_file_write_nolock+0x8f/0xa8
May 18 15:18:36 <user.notice> n3-1-13.local [<ffffffff800a34ad>]: autoremove_wake_function+0x0/0x2e
May 18 15:18:36 <user.notice> n3-1-13.local [<ffffffff8881ad4d>]: cl_enqueue_try+0x23d/0x2f0 [obdclass]
May 18 15:18:36 <user.notice> n3-1-13.local [<ffffffff80063af9>]: mutex_lock+0xd/0x1d
May 18 15:18:36 <user.notice> n3-1-13.local [<ffffffff800c9b15>]: generic_file_writev+0x48/0xa3
May 18 15:18:36 <user.notice> n3-1-13.local [<ffffffff88c7772d>]: vvp_io_write_start+0xfd/0x1b0 [lustre]
May 18 15:18:36 <user.notice> n3-1-13.local [<ffffffff8881d810>]: cl_io_start+0x90/0xf0 [obdclass]
May 18 15:18:36 <user.notice> n3-1-13.local [<ffffffff888204d8>]: cl_io_loop+0x88/0x130 [obdclass]
May 18 15:18:36 <user.notice> n3-1-13.local [<ffffffff88c3124d>]: ll_file_io_generic+0x44d/0x4a0 [lustre]
May 18 15:18:36 <user.notice> n3-1-13.local [<ffffffff88c31425>]: ll_file_writev+0x185/0x1f0 [lustre]
May 18 15:18:36 <user.notice> n3-1-13.local [<ffffffff88c3aa71>]: ll_file_write+0x121/0x190 [lustre]
May 18 15:18:36 <user.notice> n3-1-13.local [<ffffffff80016b49>]: vfs_write+0xce/0x174
May 18 15:18:36 <user.notice> n3-1-13.local [<ffffffff80017412>]: sys_write+0x45/0x6e
May 18 15:18:36 <user.notice> n3-1-13.local [<ffffffff8005d28d>]: tracesys+0xd5/0xe0
May 18 15:18:36 <user.notice> n3-1-13.local []:
May 18 15:18:36 <user.notice> n3-1-13.local Kernel[]: panic - not syncing: LBUG
May 18 15:18:36 <user.notice> n3-1-13.local []:

Problem is hard to reproduce even though we know which binaries caused it. For now it looks like after client reboot the problem disappears, however a subsequent crash might have simply not happened yet. We don't have a crashkernel dump yet. There is nothing suspicious in the server logs.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

grove395-lu1421-foreachbt.txt
314 kB
31/May/12 2:24 PM

Activity

[LU-1421] Client LBUG in ll_file_write after filesystem expansion

Peter Jones added a comment - 31/May/12 4:40 PM

Jinshan

Could you please look into this one?

Thanks

Peter

Peter Jones added a comment - 31/May/12 4:40 PM Jinshan Could you please look into this one? Thanks Peter

Prakash Surya (Inactive) added a comment - 31/May/12 2:24 PM

Jinshan, here is a dump of the backtraces for all processes running on the node at the time of the ASSERT.

Let me know if any other information would be helpful to get out of the crash dump.

Prakash Surya (Inactive) added a comment - 31/May/12 2:24 PM Jinshan, here is a dump of the backtraces for all processes running on the node at the time of the ASSERT. Let me know if any other information would be helpful to get out of the crash dump.

Marek Magrys added a comment - 31/May/12 12:56 PM

Yes, as I mentioned this bug doesn't concern 2.1.1, that's the version to which we are doing rollback.

Marek Magrys added a comment - 31/May/12 12:56 PM Yes, as I mentioned this bug doesn't concern 2.1.1, that's the version to which we are doing rollback.

Prakash Surya (Inactive) added a comment - 31/May/12 12:53 PM

We seem to hit this running a specific test on our Grove/Sequoia filesystem. We don't see this running the same test on our 2.1.1 based test system. Perhaps it is a regression in 2.2?

Prakash Surya (Inactive) added a comment - 31/May/12 12:53 PM We seem to hit this running a specific test on our Grove/Sequoia filesystem. We don't see this running the same test on our 2.1.1 based test system. Perhaps it is a regression in 2.2?

Brian Behlendorf added a comment - 30/May/12 7:43 PM

In our case this was with 2.6.32-220.17.1.2chaos.ch5.x86_64 which is RHEL6.2 plus a few patches.

Brian Behlendorf added a comment - 30/May/12 7:43 PM In our case this was with 2.6.32-220.17.1.2chaos.ch5.x86_64 which is RHEL6.2 plus a few patches.

Marek Magrys added a comment - 29/May/12 6:44 PM

The kernel version is 2.6.18-308.4.1, latest from SL5. As soon as we will be able to reproduce the problem I'll post the information. We dedicated some hardware resources just to reproduce the problem with codes which are possibly the cause of the LBUG, but we're not lucky yet.

Marek Magrys added a comment - 29/May/12 6:44 PM The kernel version is 2.6.18-308.4.1, latest from SL5. As soon as we will be able to reproduce the problem I'll post the information. We dedicated some hardware resources just to reproduce the problem with codes which are possibly the cause of the LBUG, but we're not lucky yet.

Jinshan Xiong (Inactive) added a comment - 29/May/12 6:39 PM

The page was in a strange state where top page was in owned but sub-page is in cache state; also the vmpage is not locked at all. It will be helpful if I can get backtrace of all tasks in the system when crash happens, the exact kernel version is also needed in case it's related to kernel.

Jinshan Xiong (Inactive) added a comment - 29/May/12 6:39 PM The page was in a strange state where top page was in owned but sub-page is in cache state; also the vmpage is not locked at all. It will be helpful if I can get backtrace of all tasks in the system when crash happens, the exact kernel version is also needed in case it's related to kernel.

Jinshan Xiong (Inactive) added a comment - 18/May/12 2:44 PM

Hi Brian, what's the kernel you were running? EL5 too?

Jinshan Xiong (Inactive) added a comment - 18/May/12 2:44 PM Hi Brian, what's the kernel you were running? EL5 too?

Brian Behlendorf added a comment - 18/May/12 12:11 PM

We also observed this issue last night in the latest Orion branch when we started up our stress testing on Grove. Lustre version lustre-orion-2.2.49.57-45chaos.

LustreError: 33961:0:(cl_page.c:1031:cl_page_assume()) page@ffff880c749b5680[4 ffff880ffdf34448:1024 ^(null)_ffff880c749b55c0 1 0 1 ffff880e6a836928 (null) 0x0]
LustreError: 33961:0:(cl_page.c:1031:cl_page_assume()) page@ffff880c749b55c0[1 ffff881009b04b98:1024 ^ffff880c749b5680_(null) 0 0 1 (null) (null) 0x0]
LustreError: 33961:0:(cl_page.c:1031:cl_page_assume()) vvp-page@ffff880a8fb0ee10(0:0:0) vm@ffffea0038170fd8 c000000000086d 4:0 ffff880c749b5680 1024 lru
LustreError: 33961:0:(cl_page.c:1031:cl_page_assume()) lov-page@ffff880c6b4c84c8
LustreError: 33961:0:(cl_page.c:1031:cl_page_assume()) osc-page@ffff880c749b76a0: 1< 0x845fed 2 0 - - - > 2< 4194304 0 4096 0x0 0x400 | (null) ffff88081ef18798 ffff880c749bee80 ffffffffa0a21d20 ffff880c749b76a0 > 3< - ffff880ed1618040 0 0 0 > 4< 0 0 8 39878656 - | - - - - > 5< - - - - | 0 - - | 0 - ->
LustreError: 33961:0:(cl_page.c:1031:cl_page_assume()) end page@ffff880c749b5680
LustreError: 33961:0:(cl_page.c:1031:cl_page_assume()) pg->cp_owner == NULL
LustreError: 33961:0:(cl_page.c:1031:cl_page_assume()) ASSERTION( 0 ) failed:
LustreError: 33961:0:(cl_page.c:1031:cl_page_assume()) LBUG
May 18 04:27:58 Pid: 33961, comm: miranda_io
grove395 kernel:
LustreError: 33Call Trace:
961:0:(cl_page.c [<ffffffffa0329975>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
:1031:cl_page_as [<ffffffffa0329ea7>] lbug_with_loc+0x47/0xb0 [libcfs]
sume()) ASSERTIO [<ffffffffa06ba970>] cl_page_own0+0x0/0x2c0 [obdclass]
N( 0 ) failed:
May 18 04:27:58 [<ffffffffa0b1f6b6>] ll_prepare_write+0x86/0x170 [lustre]
grove395 kernel: [<ffffffffa0b338a8>] ll_write_begin+0x88/0x160 [lustre]
LustreError: 33 [<ffffffff8111151e>] generic_file_buffered_write+0x10e/0x2a0
961:0:(cl_page.c [<ffffffff810707d7>] ? current_fs_time+0x27/0x30
:1031:cl_page_as [<ffffffff81112e70>] __generic_file_aio_write+0x250/0x480
sume()) LBUG
[<ffffffffa06b5735>] ? cl_env_info+0x15/0x20 [obdclass]
[<ffffffffa06b5735>] ? cl_env_info+0x15/0x20 [obdclass]
[<ffffffff8111310f>] generic_file_aio_write+0x6f/0xe0
[<ffffffffa0b44460>] vvp_io_write_start+0xb0/0x1e0 [lustre]
[<ffffffffa06c24b2>] cl_io_start+0x72/0xf0 [obdclass]
[<ffffffffa06c5e44>] cl_io_loop+0xd4/0x160 [obdclass]
[<ffffffffa0afea6e>] ll_file_io_generic+0x3be/0x4f0 [lustre]
[<ffffffffa0aff38c>] ll_file_aio_write+0x14c/0x290 [lustre]
[<ffffffffa0aff643>] ll_file_write+0x173/0x270 [lustre]
[<ffffffff81177888>] vfs_write+0xb8/0x1a0
[<ffffffff81178291>] sys_write+0x51/0x90
[<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b

Brian Behlendorf added a comment - 18/May/12 12:11 PM We also observed this issue last night in the latest Orion branch when we started up our stress testing on Grove. Lustre version lustre-orion-2.2.49.57-45chaos. LustreError: 33961:0:(cl_page.c:1031:cl_page_assume()) page@ffff880c749b5680 [4 ffff880ffdf34448:1024 ^(null)_ffff880c749b55c0 1 0 1 ffff880e6a836928 (null) 0x0] LustreError: 33961:0:(cl_page.c:1031:cl_page_assume()) page@ffff880c749b55c0 [1 ffff881009b04b98:1024 ^ffff880c749b5680_(null) 0 0 1 (null) (null) 0x0] LustreError: 33961:0:(cl_page.c:1031:cl_page_assume()) vvp-page@ffff880a8fb0ee10(0:0:0) vm@ffffea0038170fd8 c000000000086d 4:0 ffff880c749b5680 1024 lru LustreError: 33961:0:(cl_page.c:1031:cl_page_assume()) lov-page@ffff880c6b4c84c8 LustreError: 33961:0:(cl_page.c:1031:cl_page_assume()) osc-page@ffff880c749b76a0: 1< 0x845fed 2 0 - - - > 2< 4194304 0 4096 0x0 0x400 | (null) ffff88081ef18798 ffff880c749bee80 ffffffffa0a21d20 ffff880c749b76a0 > 3< - ffff880ed1618040 0 0 0 > 4< 0 0 8 39878656 - | - - - - > 5< - - - - | 0 - - | 0 - -> LustreError: 33961:0:(cl_page.c:1031:cl_page_assume()) end page@ffff880c749b5680 LustreError: 33961:0:(cl_page.c:1031:cl_page_assume()) pg->cp_owner == NULL LustreError: 33961:0:(cl_page.c:1031:cl_page_assume()) ASSERTION( 0 ) failed: LustreError: 33961:0:(cl_page.c:1031:cl_page_assume()) LBUG May 18 04:27:58 Pid: 33961, comm: miranda_io grove395 kernel: LustreError: 33Call Trace: 961:0:(cl_page.c [<ffffffffa0329975>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] :1031:cl_page_as [<ffffffffa0329ea7>] lbug_with_loc+0x47/0xb0 [libcfs] sume()) ASSERTIO [<ffffffffa06ba970>] cl_page_own0+0x0/0x2c0 [obdclass] N( 0 ) failed: May 18 04:27:58 [<ffffffffa0b1f6b6>] ll_prepare_write+0x86/0x170 [lustre] grove395 kernel: [<ffffffffa0b338a8>] ll_write_begin+0x88/0x160 [lustre] LustreError: 33 [<ffffffff8111151e>] generic_file_buffered_write+0x10e/0x2a0 961:0:(cl_page.c [<ffffffff810707d7>] ? current_fs_time+0x27/0x30 :1031:cl_page_as [<ffffffff81112e70>] __generic_file_aio_write+0x250/0x480 sume()) LBUG [<ffffffffa06b5735>] ? cl_env_info+0x15/0x20 [obdclass] [<ffffffffa06b5735>] ? cl_env_info+0x15/0x20 [obdclass] [<ffffffff8111310f>] generic_file_aio_write+0x6f/0xe0 [<ffffffffa0b44460>] vvp_io_write_start+0xb0/0x1e0 [lustre] [<ffffffffa06c24b2>] cl_io_start+0x72/0xf0 [obdclass] [<ffffffffa06c5e44>] cl_io_loop+0xd4/0x160 [obdclass] [<ffffffffa0afea6e>] ll_file_io_generic+0x3be/0x4f0 [lustre] [<ffffffffa0aff38c>] ll_file_aio_write+0x14c/0x290 [lustre] [<ffffffffa0aff643>] ll_file_write+0x173/0x270 [lustre] [<ffffffff81177888>] vfs_write+0xb8/0x1a0 [<ffffffff81178291>] sys_write+0x51/0x90 [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b

People

Assignee:: Jinshan Xiong (Inactive)

Reporter:: Marek Magrys

Votes:: 1 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 18/May/12 9:45 AM

Updated:: 15/May/14 8:17 PM

Resolved:: 14/Jun/12 4:44 PM