Details
-
Bug
-
Resolution: Fixed
-
Minor
-
None
-
Lustre 2.7.0
-
None
-
3
-
15385
Description
A compilebench run on Lola, where we inject 0.1% message drops between clients and servers, hung like this:
Aug 20 20:29:41 lola-24 kernel: python S 0000000000000001 0 53350 53330 0x00000080 Aug 20 20:29:41 lola-24 kernel: ffff8807c033dbd8 0000000000000082 0000000000000000 ffff8807b52a4a98 Aug 20 20:29:41 lola-24 kernel: ffff8807c033db78 ffffffffa0af7f8f ffff8807c033db78 ffff8807ef4d7cf8 Aug 20 20:29:41 lola-24 kernel: ffff880804e15098 ffff8807c033dfd8 000000000000fbc8 ffff880804e15098 Aug 20 20:29:41 lola-24 kernel: Call Trace: Aug 20 20:29:41 lola-24 kernel: [<ffffffffa0af7f8f>] ? lov_sublock_unlock+0x5f/0x140 [lov] Aug 20 20:29:41 lola-24 kernel: [<ffffffffa05ed643>] cl_lock_state_wait+0x1d3/0x320 [obdclass] Aug 20 20:29:41 lola-24 kernel: [<ffffffff81061d00>] ? default_wake_function+0x0/0x20 Aug 20 20:29:41 lola-24 kernel: [<ffffffffa05ede0b>] cl_enqueue_locked+0x15b/0x1f0 [obdclass] Aug 20 20:29:41 lola-24 kernel: [<ffffffffa05ee97e>] cl_lock_request+0x7e/0x270 [obdclass] Aug 20 20:29:41 lola-24 kernel: [<ffffffffa05f3934>] cl_io_lock+0x3c4/0x560 [obdclass] Aug 20 20:29:41 lola-24 kernel: [<ffffffffa05f3b72>] cl_io_loop+0xa2/0x1b0 [obdclass] Aug 20 20:29:41 lola-24 kernel: [<ffffffffa0b7e1c2>] ll_file_io_generic+0x412/0x8f0 [lustre] Aug 20 20:29:41 lola-24 kernel: [<ffffffffa05e3ca9>] ? cl_env_get+0x29/0x350 [obdclass] Aug 20 20:29:41 lola-24 kernel: [<ffffffffa0b7eee3>] ll_file_aio_write+0x133/0x2b0 [lustre] Aug 20 20:29:41 lola-24 kernel: [<ffffffffa0b7f1b9>] ll_file_write+0x159/0x290 [lustre] Aug 20 20:29:41 lola-24 kernel: [<ffffffff811892e8>] vfs_write+0xb8/0x1a0 Aug 20 20:29:41 lola-24 kernel: [<ffffffff81189cb1>] sys_write+0x51/0x90 Aug 20 20:29:41 lola-24 kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
Attachment compilebench_hang-lola-24.log contains the complete stack dump. This is likely to be difficult to reproduce.