Details
-
Bug
-
Resolution: Fixed
-
Blocker
-
Lustre 2.2.0, Lustre 2.3.0
-
SL5 on clients and servers, mix of 2.1.0, 2.1.1 and 2.2.0 clients, 2.2.0 on all servers
-
4
-
4593
Description
After adding 24 OSTs to the file system we get client LBUGs and crashes on Lustre 2.2.0. We expanded the file system by adding new resources and new OSTs had been seen by clients properly, however now we get dozens of crashes every day. Trace looks like this:
May 18 15:18:36 <user.notice> n3-1-13.local Pid[]: 9127, comm: dtf3d_qdot.out
May 18 15:18:36 <user.notice> n3-1-13.local []:
May 18 15:18:36 <user.notice> n3-1-13.local Call[]: Trace:
May 18 15:18:36 <user.notice> n3-1-13.local [<ffffffff8870c5f1>]: libcfs_debug_dumpstack+0x51/0x60 [libcfs]
May 18 15:18:36 <user.notice> n3-1-13.local [<ffffffff8870ca28>]: lbug_with_loc+0x48/0x90 [libcfs]
May 18 15:18:36 <user.notice> n3-1-13.local [<ffffffff88816a60>]: cl_page_assume+0xa0/0x190 [obdclass]
May 18 15:18:36 <user.notice> n3-1-13.local [<ffffffff88c53198>]: ll_prepare_write+0x98/0x150 [lustre]
May 18 15:18:36 <user.notice> n3-1-13.local [<ffffffff88c6749b>]: ll_write_begin+0xdb/0x150 [lustre]
May 18 15:18:36 <user.notice> n3-1-13.local [<ffffffff8000fe46>]: generic_file_buffered_write+0x14b/0x6a9
May 18 15:18:36 <user.notice> n3-1-13.local [<ffffffff80016741>]: __generic_file_aio_write_nolock+0x369/0x3b6
May 18 15:18:36 <user.notice> n3-1-13.local [<ffffffff800c9ab4>]: __generic_file_write_nolock+0x8f/0xa8
May 18 15:18:36 <user.notice> n3-1-13.local [<ffffffff800a34ad>]: autoremove_wake_function+0x0/0x2e
May 18 15:18:36 <user.notice> n3-1-13.local [<ffffffff8881ad4d>]: cl_enqueue_try+0x23d/0x2f0 [obdclass]
May 18 15:18:36 <user.notice> n3-1-13.local [<ffffffff80063af9>]: mutex_lock+0xd/0x1d
May 18 15:18:36 <user.notice> n3-1-13.local [<ffffffff800c9b15>]: generic_file_writev+0x48/0xa3
May 18 15:18:36 <user.notice> n3-1-13.local [<ffffffff88c7772d>]: vvp_io_write_start+0xfd/0x1b0 [lustre]
May 18 15:18:36 <user.notice> n3-1-13.local [<ffffffff8881d810>]: cl_io_start+0x90/0xf0 [obdclass]
May 18 15:18:36 <user.notice> n3-1-13.local [<ffffffff888204d8>]: cl_io_loop+0x88/0x130 [obdclass]
May 18 15:18:36 <user.notice> n3-1-13.local [<ffffffff88c3124d>]: ll_file_io_generic+0x44d/0x4a0 [lustre]
May 18 15:18:36 <user.notice> n3-1-13.local [<ffffffff88c31425>]: ll_file_writev+0x185/0x1f0 [lustre]
May 18 15:18:36 <user.notice> n3-1-13.local [<ffffffff88c3aa71>]: ll_file_write+0x121/0x190 [lustre]
May 18 15:18:36 <user.notice> n3-1-13.local [<ffffffff80016b49>]: vfs_write+0xce/0x174
May 18 15:18:36 <user.notice> n3-1-13.local [<ffffffff80017412>]: sys_write+0x45/0x6e
May 18 15:18:36 <user.notice> n3-1-13.local [<ffffffff8005d28d>]: tracesys+0xd5/0xe0
May 18 15:18:36 <user.notice> n3-1-13.local []:
May 18 15:18:36 <user.notice> n3-1-13.local Kernel[]: panic - not syncing: LBUG
May 18 15:18:36 <user.notice> n3-1-13.local []:
Problem is hard to reproduce even though we know which binaries caused it. For now it looks like after client reboot the problem disappears, however a subsequent crash might have simply not happened yet. We don't have a crashkernel dump yet. There is nothing suspicious in the server logs.