Details
-
Bug
-
Resolution: Fixed
-
Major
-
None
-
Lustre 2.1.3
-
3
-
9966
Description
Hi,
At a customer site, a compute node running Lustre crashes with the following error:
2013-08-23 17:08:55 LustreError: 3049:0:(lib-move.c:185:lnet_match_md()) ASSERTION( me == md->md_me ) failed: 2013-08-23 17:08:55 LustreError: 3049:0:(lib-move.c:185:lnet_match_md()) LBUG 2013-08-23 17:08:55 Pid: 3049, comm: kiblnd_sd_11 2013-08-23 17:08:55 2013-08-23 17:08:55 Call Trace: 2013-08-23 17:08:55 [<ffffffffa04c47f5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] 2013-08-23 17:08:55 [<ffffffffa04c4e07>] lbug_with_loc+0x47/0xb0 [libcfs] 2013-08-23 17:08:55 [<ffffffffa053c2bd>] lnet_match_md+0x35d/0x3a0 [lnet] 2013-08-23 17:08:55 [<ffffffffa053a8dd>] ? lnet_ni_recv+0xad/0x2f0 [lnet] 2013-08-23 17:08:55 [<ffffffffa0541482>] lnet_parse+0xac2/0x1b80 [lnet] 2013-08-23 17:08:55 [<ffffffff81042f33>] ? enqueue_task+0x43/0x90 2013-08-23 17:08:55 [<ffffffffa08d177b>] kiblnd_handle_rx+0x2cb/0x680 [ko2iblnd] 2013-08-23 17:08:55 [<ffffffff8103d349>] ? __wake_up_common+0x59/0x90 2013-08-23 17:08:55 [<ffffffffa08d2560>] kiblnd_rx_complete+0x2d0/0x440 [ko2iblnd] 2013-08-23 17:08:55 [<ffffffff81042cf3>] ? __wake_up+0x53/0x70 2013-08-23 17:08:55 [<ffffffffa08d2732>] kiblnd_complete+0x62/0xe0 [ko2iblnd] 2013-08-23 17:08:55 [<ffffffffa08d2ae9>] kiblnd_scheduler+0x339/0x7a0 [ko2iblnd] 2013-08-23 17:08:55 [<ffffffff81048df0>] ? default_wake_function+0x0/0x20 2013-08-23 17:08:55 [<ffffffffa08d27b0>] ? kiblnd_scheduler+0x0/0x7a0 [ko2iblnd] 2013-08-23 17:08:55 [<ffffffff8100412a>] child_rip+0xa/0x20 2013-08-23 17:08:55 [<ffffffffa08d27b0>] ? kiblnd_scheduler+0x0/0x7a0 [ko2iblnd] 2013-08-23 17:08:55 [<ffffffffa08d27b0>] ? kiblnd_scheduler+0x0/0x7a0 [ko2iblnd] 2013-08-23 17:08:55 [<ffffffff81004120>] ? child_rip+0x0/0x20 2013-08-23 17:08:55 2013-08-23 17:08:55 Kernel panic - not syncing: LBUG 2013-08-23 17:08:55 Pid: 3049, comm: kiblnd_sd_11 Tainted: G --------------- H 2.6.32-279.5.2.bl6.Bull.36.x86_64 #1 2013-08-23 17:08:55 Call Trace: 2013-08-23 17:08:55 [<ffffffff81495fe3>] ? panic+0xa0/0x168 2013-08-23 17:08:55 [<ffffffffa04c4e5b>] ? lbug_with_loc+0x9b/0xb0 [libcfs] 2013-08-23 17:08:55 [<ffffffffa053c2bd>] ? lnet_match_md+0x35d/0x3a0 [lnet] 2013-08-23 17:08:55 [<ffffffffa053a8dd>] ? lnet_ni_recv+0xad/0x2f0 [lnet] 2013-08-23 17:08:55 [<ffffffffa0541482>] ? lnet_parse+0xac2/0x1b80 [lnet] 2013-08-23 17:08:55 [<ffffffff81042f33>] ? enqueue_task+0x43/0x90 2013-08-23 17:08:55 [<ffffffffa08d177b>] ? kiblnd_handle_rx+0x2cb/0x680 [ko2iblnd] 2013-08-23 17:08:55 [<ffffffff8103d349>] ? __wake_up_common+0x59/0x90 2013-08-23 17:08:55 [<ffffffffa08d2560>] ? kiblnd_rx_complete+0x2d0/0x440 [ko2iblnd] 2013-08-23 17:08:55 [<ffffffff81042cf3>] ? __wake_up+0x53/0x70 2013-08-23 17:08:55 [<ffffffffa08d2732>] ? kiblnd_complete+0x62/0xe0 [ko2iblnd] 2013-08-23 17:08:55 [<ffffffffa08d2ae9>] ? kiblnd_scheduler+0x339/0x7a0 [ko2iblnd] 2013-08-23 17:08:55 [<ffffffff81048df0>] ? default_wake_function+0x0/0x20 2013-08-23 17:08:55 [<ffffffffa08d27b0>] ? kiblnd_scheduler+0x0/0x7a0 [ko2iblnd] 2013-08-23 17:08:55 [<ffffffff8100412a>] ? child_rip+0xa/0x20 2013-08-23 17:08:55 [<ffffffffa08d27b0>] ? kiblnd_scheduler+0x0/0x7a0 [ko2iblnd] 2013-08-23 17:08:55 [<ffffffffa08d27b0>] ? kiblnd_scheduler+0x0/0x7a0 [ko2iblnd] 2013-08-23 17:08:55 [<ffffffff81004120>] ? child_rip+0x0/0x20
The messages visible in the syslog 2 hours before the failed assertion are:
2013-08-23 15:26:09 LustreError: 11-0: an error occurred while communicating with 10.200.200.66@o2ib. The obd_ping operation failed with -107 2013-08-23 15:26:09 Lustre: scratch-OST0018-osc-ffff88087a659000: Connection to scratch-OST0018 (at 10.200.200.66@o2ib) was lost; in progress operations using this service will wait for recovery to complete 2013-08-23 15:26:09 LustreError: 167-0: This client was evicted by scratch-OST0018; in progress operations using this service will fail. 2013-08-23 15:26:09 LustreError: 24191:0:(osc_lock.c:816:osc_ldlm_completion_ast()) lock@ffff880e871c7d98[2 3 0 1 1 00000000] R(1):[4, 18446744073709551615]@[0x100180000:0x2e225e:0x0] { 2013-08-23 15:26:09 LustreError: 24191:0:(osc_lock.c:816:osc_ldlm_completion_ast()) lovsub@ffff880c2c27c8a0: [0 ffff880f31e5f268 R(1):[4, 18446744073709551615]@[0x2006bc5f0:0x1ac23:0x0]] 2013-08-23 15:26:09 LustreError: 24191:0:(osc_lock.c:816:osc_ldlm_completion_ast()) osc@ffff88104e603d78: ffff881072b1f240 40120002 0x4d3a65b709264355 3 ffff881012f1ae88 size: 26325 mtime: 1377264244 atime: 1377264247 ctime: 1377264244 blocks: 56 2013-08-23 15:26:09 LustreError: 24191:0:(osc_lock.c:816:osc_ldlm_completion_ast()) } lock@ffff880e871c7d98 2013-08-23 15:26:09 LustreError: 24191:0:(osc_lock.c:816:osc_ldlm_completion_ast()) dlmlock returned -5 2013-08-23 15:26:09 LustreError: 24191:0:(osc_lock.c:816:osc_ldlm_completion_ast()) lock@ffff880e871c7078[2 2 0 1 1 00000000] R(1):[2, 3]@[0x100180000:0x2e225e:0x0] { 2013-08-23 15:26:09 LustreError: 24191:0:(osc_lock.c:816:osc_ldlm_completion_ast()) lovsub@ffff880c2c27c920: [0 ffff8801fca3e4d8 R(1):[0, 18446744073709551615]@[0x2006bc5f0:0x1ac23:0x0]] 2013-08-23 15:26:09 LustreError: 24191:0:(osc_lock.c:816:osc_ldlm_completion_ast()) osc@ffff88104e603368: ffff88077ea89d80 40120002 0x4d3a65b709264363 2 (null) size: 26325 mtime: 1377264244 atime: 1377264247 ctime: 1377264244 blocks: 56 2013-08-23 15:26:09 LustreError: 24191:0:(osc_lock.c:816:osc_ldlm_completion_ast()) } lock@ffff880e871c7078 2013-08-23 15:26:09 LustreError: 24191:0:(osc_lock.c:816:osc_ldlm_completion_ast()) dlmlock returned -5 2013-08-23 15:26:09 LustreError: 24191:0:(ldlm_resource.c:749:ldlm_resource_complain()) Namespace scratch-OST0018-osc-ffff88087a659000 resource refcount nonzero (2) after lock cleanup; forcing cleanup. 2013-08-23 15:26:09 LustreError: 24191:0:(ldlm_resource.c:755:ldlm_resource_complain()) Resource: ffff880e973c7480 (3023454/0/0/0) (rc: 2) 2013-08-23 15:26:09 Lustre: scratch-OST0018-osc-ffff88087a659000: Connection restored to scratch-OST0018 (at 10.200.200.66@o2ib) 2013-08-23 15:26:09 LustreError: 24071:0:(cl_lock.c:1413:cl_unuse_try()) lock@ffff880e871c7698[2 4 0 2 0 00000000] R(1):[4, 18446744073709551615]@[0x2006bc5f0:0x1ac23:0x0] { 2013-08-23 15:26:09 LustreError: 24071:0:(cl_lock.c:1413:cl_unuse_try()) vvp@ffff880dafcd7d30: 2013-08-23 15:26:09 LustreError: 24071:0:(cl_lock.c:1413:cl_unuse_try()) lov@ffff880f31e5f268: 1 2013-08-23 15:26:09 LustreError: 24071:0:(cl_lock.c:1413:cl_unuse_try()) 0 0: --- 2013-08-23 15:26:09 LustreError: 24071:0:(cl_lock.c:1413:cl_unuse_try()) 2013-08-23 15:26:09 LustreError: 24071:0:(cl_lock.c:1413:cl_unuse_try()) } lock@ffff880e871c7698 2013-08-23 15:26:09 LustreError: 24071:0:(cl_lock.c:1413:cl_unuse_try()) unuse return -5
After investigation, no error was seen on the Infiniband network by the time of the Lustre node crash.
Thanks,
Sebastien.
Attachments
Issue Links
- is related to
-
LU-4330 LustreError: 46336:0:(events.c:433:ptlrpc_master_callback()) ASSERTION( callback == request_out_callback || callback == reply_in_callback || callback == client_bulk_callback || callback == request_in_callback || callback == reply_out_callback ... ) failed
- Reopened