[LU-3848] Compute node crashes due to error in Lustre : ASSERTION( me == md->md_me ) failed - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: None
Affects Version/s: Lustre 2.1.3
Labels:
- mn1

Severity:
3
Rank (Obsolete):
9966

Description

Hi,

At a customer site, a compute node running Lustre crashes with the following error:

2013-08-23 17:08:55 LustreError: 3049:0:(lib-move.c:185:lnet_match_md()) ASSERTION( me == md->md_me ) failed:
2013-08-23 17:08:55 LustreError: 3049:0:(lib-move.c:185:lnet_match_md()) LBUG
2013-08-23 17:08:55 Pid: 3049, comm: kiblnd_sd_11
2013-08-23 17:08:55
2013-08-23 17:08:55 Call Trace:
2013-08-23 17:08:55 [<ffffffffa04c47f5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
2013-08-23 17:08:55 [<ffffffffa04c4e07>] lbug_with_loc+0x47/0xb0 [libcfs]
2013-08-23 17:08:55 [<ffffffffa053c2bd>] lnet_match_md+0x35d/0x3a0 [lnet]
2013-08-23 17:08:55 [<ffffffffa053a8dd>] ? lnet_ni_recv+0xad/0x2f0 [lnet]
2013-08-23 17:08:55 [<ffffffffa0541482>] lnet_parse+0xac2/0x1b80 [lnet]
2013-08-23 17:08:55 [<ffffffff81042f33>] ? enqueue_task+0x43/0x90
2013-08-23 17:08:55 [<ffffffffa08d177b>] kiblnd_handle_rx+0x2cb/0x680 [ko2iblnd]
2013-08-23 17:08:55 [<ffffffff8103d349>] ? __wake_up_common+0x59/0x90
2013-08-23 17:08:55 [<ffffffffa08d2560>] kiblnd_rx_complete+0x2d0/0x440 [ko2iblnd]
2013-08-23 17:08:55 [<ffffffff81042cf3>] ? __wake_up+0x53/0x70
2013-08-23 17:08:55 [<ffffffffa08d2732>] kiblnd_complete+0x62/0xe0 [ko2iblnd]
2013-08-23 17:08:55 [<ffffffffa08d2ae9>] kiblnd_scheduler+0x339/0x7a0 [ko2iblnd]
2013-08-23 17:08:55 [<ffffffff81048df0>] ? default_wake_function+0x0/0x20
2013-08-23 17:08:55 [<ffffffffa08d27b0>] ? kiblnd_scheduler+0x0/0x7a0 [ko2iblnd]
2013-08-23 17:08:55 [<ffffffff8100412a>] child_rip+0xa/0x20
2013-08-23 17:08:55 [<ffffffffa08d27b0>] ? kiblnd_scheduler+0x0/0x7a0 [ko2iblnd]
2013-08-23 17:08:55 [<ffffffffa08d27b0>] ? kiblnd_scheduler+0x0/0x7a0 [ko2iblnd]
2013-08-23 17:08:55 [<ffffffff81004120>] ? child_rip+0x0/0x20
2013-08-23 17:08:55
2013-08-23 17:08:55 Kernel panic - not syncing: LBUG
2013-08-23 17:08:55 Pid: 3049, comm: kiblnd_sd_11 Tainted: G --------------- H 2.6.32-279.5.2.bl6.Bull.36.x86_64 #1
2013-08-23 17:08:55 Call Trace:
2013-08-23 17:08:55 [<ffffffff81495fe3>] ? panic+0xa0/0x168
2013-08-23 17:08:55 [<ffffffffa04c4e5b>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
2013-08-23 17:08:55 [<ffffffffa053c2bd>] ? lnet_match_md+0x35d/0x3a0 [lnet]
2013-08-23 17:08:55 [<ffffffffa053a8dd>] ? lnet_ni_recv+0xad/0x2f0 [lnet]
2013-08-23 17:08:55 [<ffffffffa0541482>] ? lnet_parse+0xac2/0x1b80 [lnet]
2013-08-23 17:08:55 [<ffffffff81042f33>] ? enqueue_task+0x43/0x90
2013-08-23 17:08:55 [<ffffffffa08d177b>] ? kiblnd_handle_rx+0x2cb/0x680 [ko2iblnd]
2013-08-23 17:08:55 [<ffffffff8103d349>] ? __wake_up_common+0x59/0x90
2013-08-23 17:08:55 [<ffffffffa08d2560>] ? kiblnd_rx_complete+0x2d0/0x440 [ko2iblnd]
2013-08-23 17:08:55 [<ffffffff81042cf3>] ? __wake_up+0x53/0x70
2013-08-23 17:08:55 [<ffffffffa08d2732>] ? kiblnd_complete+0x62/0xe0 [ko2iblnd]
2013-08-23 17:08:55 [<ffffffffa08d2ae9>] ? kiblnd_scheduler+0x339/0x7a0 [ko2iblnd]
2013-08-23 17:08:55 [<ffffffff81048df0>] ? default_wake_function+0x0/0x20
2013-08-23 17:08:55 [<ffffffffa08d27b0>] ? kiblnd_scheduler+0x0/0x7a0 [ko2iblnd]
2013-08-23 17:08:55 [<ffffffff8100412a>] ? child_rip+0xa/0x20
2013-08-23 17:08:55 [<ffffffffa08d27b0>] ? kiblnd_scheduler+0x0/0x7a0 [ko2iblnd]
2013-08-23 17:08:55 [<ffffffffa08d27b0>] ? kiblnd_scheduler+0x0/0x7a0 [ko2iblnd]
2013-08-23 17:08:55 [<ffffffff81004120>] ? child_rip+0x0/0x20

The messages visible in the syslog 2 hours before the failed assertion are:

2013-08-23 15:26:09 LustreError: 11-0: an error occurred while communicating with 10.200.200.66@o2ib. The obd_ping operation failed with -107
2013-08-23 15:26:09 Lustre: scratch-OST0018-osc-ffff88087a659000: Connection to scratch-OST0018 (at 10.200.200.66@o2ib) was lost; in progress operations using this service will wait for recovery to complete
2013-08-23 15:26:09 LustreError: 167-0: This client was evicted by scratch-OST0018; in progress operations using this service will fail.
2013-08-23 15:26:09 LustreError: 24191:0:(osc_lock.c:816:osc_ldlm_completion_ast()) lock@ffff880e871c7d98[2 3 0 1 1 00000000] R(1):[4, 18446744073709551615]@[0x100180000:0x2e225e:0x0] {
2013-08-23 15:26:09 LustreError: 24191:0:(osc_lock.c:816:osc_ldlm_completion_ast()) lovsub@ffff880c2c27c8a0: [0 ffff880f31e5f268 R(1):[4, 18446744073709551615]@[0x2006bc5f0:0x1ac23:0x0]]
2013-08-23 15:26:09 LustreError: 24191:0:(osc_lock.c:816:osc_ldlm_completion_ast()) osc@ffff88104e603d78: ffff881072b1f240 40120002 0x4d3a65b709264355 3 ffff881012f1ae88 size: 26325 mtime: 1377264244 atime: 1377264247 ctime: 1377264244 blocks: 56
2013-08-23 15:26:09 LustreError: 24191:0:(osc_lock.c:816:osc_ldlm_completion_ast()) } lock@ffff880e871c7d98
2013-08-23 15:26:09 LustreError: 24191:0:(osc_lock.c:816:osc_ldlm_completion_ast()) dlmlock returned -5
2013-08-23 15:26:09 LustreError: 24191:0:(osc_lock.c:816:osc_ldlm_completion_ast()) lock@ffff880e871c7078[2 2 0 1 1 00000000] R(1):[2, 3]@[0x100180000:0x2e225e:0x0] {
2013-08-23 15:26:09 LustreError: 24191:0:(osc_lock.c:816:osc_ldlm_completion_ast()) lovsub@ffff880c2c27c920: [0 ffff8801fca3e4d8 R(1):[0, 18446744073709551615]@[0x2006bc5f0:0x1ac23:0x0]]
2013-08-23 15:26:09 LustreError: 24191:0:(osc_lock.c:816:osc_ldlm_completion_ast()) osc@ffff88104e603368: ffff88077ea89d80 40120002 0x4d3a65b709264363 2 (null) size: 26325 mtime: 1377264244 atime: 1377264247 ctime: 1377264244 blocks: 56
2013-08-23 15:26:09 LustreError: 24191:0:(osc_lock.c:816:osc_ldlm_completion_ast()) } lock@ffff880e871c7078
2013-08-23 15:26:09 LustreError: 24191:0:(osc_lock.c:816:osc_ldlm_completion_ast()) dlmlock returned -5
2013-08-23 15:26:09 LustreError: 24191:0:(ldlm_resource.c:749:ldlm_resource_complain()) Namespace scratch-OST0018-osc-ffff88087a659000 resource refcount nonzero (2) after lock cleanup; forcing cleanup.
2013-08-23 15:26:09 LustreError: 24191:0:(ldlm_resource.c:755:ldlm_resource_complain()) Resource: ffff880e973c7480 (3023454/0/0/0) (rc: 2)
2013-08-23 15:26:09 Lustre: scratch-OST0018-osc-ffff88087a659000: Connection restored to scratch-OST0018 (at 10.200.200.66@o2ib)
2013-08-23 15:26:09 LustreError: 24071:0:(cl_lock.c:1413:cl_unuse_try()) lock@ffff880e871c7698[2 4 0 2 0 00000000] R(1):[4, 18446744073709551615]@[0x2006bc5f0:0x1ac23:0x0] {
2013-08-23 15:26:09 LustreError: 24071:0:(cl_lock.c:1413:cl_unuse_try()) vvp@ffff880dafcd7d30:
2013-08-23 15:26:09 LustreError: 24071:0:(cl_lock.c:1413:cl_unuse_try()) lov@ffff880f31e5f268: 1
2013-08-23 15:26:09 LustreError: 24071:0:(cl_lock.c:1413:cl_unuse_try()) 0 0: ---
2013-08-23 15:26:09 LustreError: 24071:0:(cl_lock.c:1413:cl_unuse_try())
2013-08-23 15:26:09 LustreError: 24071:0:(cl_lock.c:1413:cl_unuse_try()) } lock@ffff880e871c7698
2013-08-23 15:26:09 LustreError: 24071:0:(cl_lock.c:1413:cl_unuse_try()) unuse return -5

After investigation, no error was seen on the Infiniband network by the time of the Lustre node crash.

Thanks,
Sebastien.

Attachments

Issue Links

is related to

LU-4330 LustreError: 46336:0:(events.c:433:ptlrpc_master_callback()) ASSERTION( callback == request_out_callback || callback == reply_in_callback || callback == client_bulk_callback || callback == request_in_callback || callback == reply_out_callback ... ) failed

Reopened

Compute node crashes due to error in Lustre : ASSERTION( me == md->md_me ) failed

Details

Description

Attachments

Issue Links

Activity

People

Dates