[LU-2106] 2.3<->2.2/2.1 interop: LBUG: ASSERTION( lock != ((void *)0) ) failed Created: 08/Oct/12  Updated: 17/Apr/17  Resolved: 17/Apr/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.2.0, Lustre 2.3.0, Lustre 2.1.3, Lustre 2.5.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Maloo Assignee: WC Triage
Resolution: Duplicate Votes: 0
Labels: None

Issue Links:
Related
is related to LU-3647 HSM _not only_ small fixes and to do ... Closed
Severity: 3
Rank (Obsolete): 4394

 Description   

This issue was created by maloo for yujian <yujian@whamcloud.com>

This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/6fac1700-0e30-11e2-91a3-52540035b04c.

Lustre Client Build: http://build.whamcloud.com/job/lustre-b2_3/28
Lustre Server Build: http://build.whamcloud.com/job/lustre-b2_2/17
Distro/Arch: RHEL6.3/x86_64

The sub-test test_41b failed with the following error:

Starting mds1: -o user_xattr,acl -o nomgs,force  /dev/lvm-MDS/P1 /mnt/mds1
CMD: client-29vm7 mkdir -p /mnt/mds1; mount -t lustre -o user_xattr,acl -o nomgs,force  		                   /dev/lvm-MDS/P1 /mnt/mds1
test failed to respond and timed out

Info required for matching: conf-sanity 41b

Console log on MDS showed that:

11:22:49:Lustre: DEBUG MARKER: mkdir -p /mnt/mds1; mount -t lustre -o user_xattr,acl -o nomgs,force  		                   /dev/lvm-MDS/P1 /mnt/mds1
11:22:50:LustreError: 166-1: MGC10.10.4.178@tcp: Connection to service MGS via nid 0@lo was lost; in progress operations using this service will fail.
11:22:50:Lustre: 3618:0:(ldlm_lib.c:633:target_handle_reconnect()) MGS: 4172abe8-fa59-d3f9-325d-4acb9d3d67d0 reconnecting
11:22:50:LustreError: 3618:0:(obd_class.h:521:obd_set_info_async()) obd_set_info_async: dev 0 no operation
11:22:50:LustreError: 3619:0:(ldlm_lock.c:818:ldlm_lock_decref_and_cancel()) ASSERTION( lock != ((void *)0) ) failed: 
11:22:50:LustreError: 3619:0:(ldlm_lock.c:818:ldlm_lock_decref_and_cancel()) LBUG
11:22:50:Pid: 3619, comm: ll_mgs_02
11:22:50:
11:22:50:Call Trace:
11:22:50: [<ffffffffa04cb835>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
11:22:50: [<ffffffffa04cbd67>] lbug_with_loc+0x47/0xb0 [libcfs]
11:22:50: [<ffffffffa071d2b1>] ldlm_lock_decref_and_cancel+0x111/0x120 [ptlrpc]
11:22:50: [<ffffffffa0ae95ab>] mgs_completion_ast_config+0xfb/0x110 [mgs]
11:22:50: [<ffffffffa0734540>] ldlm_cli_enqueue_local+0x1f0/0x4d0 [ptlrpc]
11:22:50: [<ffffffffa0ae94b0>] ? mgs_completion_ast_config+0x0/0x110 [mgs]
11:22:50: [<ffffffffa0733670>] ? ldlm_blocking_ast+0x0/0x130 [ptlrpc]
11:22:50: [<ffffffffa0ae92ac>] mgs_revoke_lock+0x13c/0x230 [mgs]
11:22:50: [<ffffffffa0733670>] ? ldlm_blocking_ast+0x0/0x130 [ptlrpc]
11:22:50: [<ffffffffa0ae94b0>] ? mgs_completion_ast_config+0x0/0x110 [mgs]
11:22:50: [<ffffffffa04d54f1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
11:22:50: [<ffffffffa0aebc8f>] mgs_handle+0xf0f/0x1820 [mgs]
11:22:50: [<ffffffffa0763011>] ptlrpc_server_handle_request+0x3c1/0xcb0 [ptlrpc]
11:22:50: [<ffffffffa04cc3ee>] ? cfs_timer_arm+0xe/0x10 [libcfs]
11:22:50: [<ffffffffa04d6e19>] ? lc_watchdog_touch+0x79/0x110 [libcfs]
11:22:50: [<ffffffffa075d0e2>] ? ptlrpc_wait_event+0xb2/0x2c0 [ptlrpc]
11:22:50: [<ffffffff8105e7f0>] ? default_wake_function+0x0/0x20
11:22:50: [<ffffffffa076401f>] ptlrpc_main+0x71f/0x1210 [ptlrpc]
11:22:50: [<ffffffffa0763900>] ? ptlrpc_main+0x0/0x1210 [ptlrpc]
11:22:50: [<ffffffff8100c14a>] child_rip+0xa/0x20
11:22:50: [<ffffffffa0763900>] ? ptlrpc_main+0x0/0x1210 [ptlrpc]
11:22:50: [<ffffffffa0763900>] ? ptlrpc_main+0x0/0x1210 [ptlrpc]
11:22:50: [<ffffffff8100c140>] ? child_rip+0x0/0x20
11:22:50:
11:22:50:Kernel panic - not syncing: LBUG


 Comments   
Comment by Jian Yu [ 15/Oct/12 ]

Lustre Client Build: http://build.whamcloud.com/job/lustre-b2_1/121
Lustre Server Build: http://build.whamcloud.com/job/lustre-b2_3/36

After running parallel-scale, the following LBUG occurred on MDS:

20:08:32:Lustre: DEBUG MARKER: /usr/sbin/lctl mark == parallel-scale parallel-scale.sh test complete, duration 11313 sec == 20:08:24 \(1350270504\)
20:08:32:Lustre: DEBUG MARKER: == parallel-scale parallel-scale.sh test complete, duration 11313 sec == 20:08:24 (1350270504)
20:08:32:Lustre: DEBUG MARKER: lctl get_param mdd.lustre-MDT*.quota_type
20:08:32:Lustre: DEBUG MARKER: lctl conf_param lustre-MDT*.mdd.quota_type=3
20:08:32:Lustre: Modifying parameter lustre-MDT0000.mdd.quota_type in log lustre-MDT0000
20:08:32:Lustre: Skipped 15 previous similar messages
20:08:32:LustreError: 31909:0:(ldlm_lock.c:836:ldlm_lock_decref_and_cancel()) ASSERTION( lock != ((void *)0) ) failed: 
20:08:32:LustreError: 31909:0:(ldlm_lock.c:836:ldlm_lock_decref_and_cancel()) LBUG
20:08:32:Pid: 31909, comm: lctl
20:08:32:
20:08:32:Call Trace:
20:08:32: [<ffffffffa04de905>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
20:08:33: [<ffffffffa04def17>] lbug_with_loc+0x47/0xb0 [libcfs]
20:08:33: [<ffffffffa07961cc>] ldlm_lock_decref_and_cancel+0x14c/0x150 [ptlrpc]
20:08:33: [<ffffffffa0ba9685>] mgs_completion_ast_config+0x135/0x140 [mgs]
20:08:33: [<ffffffffa07b1936>] ldlm_cli_enqueue_local+0x1e6/0x560 [ptlrpc]
20:08:33: [<ffffffffa0ba9550>] ? mgs_completion_ast_config+0x0/0x140 [mgs]
20:08:33: [<ffffffffa07b0910>] ? ldlm_blocking_ast+0x0/0x180 [ptlrpc]
20:08:33: [<ffffffffa0ba85bf>] mgs_revoke_lock+0x12f/0x290 [mgs]
20:08:33: [<ffffffffa07b0910>] ? ldlm_blocking_ast+0x0/0x180 [ptlrpc]
20:08:33: [<ffffffffa0ba9550>] ? mgs_completion_ast_config+0x0/0x140 [mgs]
20:08:33: [<ffffffff8127d1f0>] ? sprintf+0x40/0x50
20:08:33: [<ffffffffa0bc0f3b>] mgs_setparam+0x7cb/0x1040 [mgs]
20:08:33: [<ffffffffa0bacf07>] mgs_iocontrol+0x987/0xb70 [mgs]
20:08:33: [<ffffffffa062d55f>] class_handle_ioctl+0x130f/0x1ee0 [obdclass]
20:08:33: [<ffffffff8113ff34>] ? handle_mm_fault+0x1e4/0x2b0
20:08:33: [<ffffffffa06192ab>] obd_class_ioctl+0x4b/0x190 [obdclass]
20:08:33: [<ffffffff8118dff2>] vfs_ioctl+0x22/0xa0
20:08:33: [<ffffffff8118e194>] do_vfs_ioctl+0x84/0x580
20:08:33: [<ffffffff8118e711>] sys_ioctl+0x81/0xa0
20:08:33: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
20:08:33:
20:08:33:Kernel panic - not syncing: LBUG

Maloo report: https://maloo.whamcloud.com/test_sets/cb48e3c6-16ab-11e2-962d-52540035b04c

This LBUG was also reported in LU-1259.

Comment by Sarah Liu [ 14/Aug/13 ]

Also hit the same issue when running interop between 2.4.0 server and 2.5 client:

server: 2.4.0
client: lustre-master tag 2.4.90 build#1610

Lustre: DEBUG MARKER: == sanity-hsm test 3: Check file dirtyness when opening for write == 12:18:08 (1376507888)
Lustre: DEBUG MARKER: == sanity-hsm test 11: Import a file == 12:18:09 (1376507889)
Lustre: DEBUG MARKER: sanity-hsm test_11: @@@@@@ FAIL: import failed
Lustre: DEBUG MARKER: == sanity-hsm test 20: Release is not permitted == 12:18:11 (1376507891)
LustreError: 27684:0:(ldlm_lock.c:967:ldlm_lock_decref_and_cancel()) ASSERTION( lock != ((void *)0) ) failed: 
LustreError: 27684:0:(ldlm_lock.c:967:ldlm_lock_decref_and_cancel()) LBUG

Message fromPid: 27684, comm: lfs
 syslogd@client-
Call Trace:
18 at Aug 14 12: [<ffffffffa035d895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
18:11 ...
 ker [<ffffffffa035de97>] lbug_with_loc+0x47/0xb0 [libcfs]
nel:LustreError: [<ffffffffa07a0a6a>] ldlm_lock_decref_and_cancel+0x14a/0x150 [ptlrpc]
 27684:0:(ldlm_l [<ffffffffa10bd74c>] ll_lease_open+0xa1c/0x1010 [lustre]
ock.c:967:ldlm_l [<ffffffffa10aef00>] ? ll_md_blocking_lease_ast+0x0/0x1b0 [lustre]
ock_decref_and_c [<ffffffffa10bddad>] ll_hsm_release+0x6d/0x350 [lustre]
ancel()) ASSERTI [<ffffffffa109f150>] ? ll_dir_open+0x0/0xf0 [lustre]
ON( lock != ((vo [<ffffffffa10a95de>] ll_dir_ioctl+0x3a5e/0x5db0 [lustre]
id *)0) ) failed [<ffffffffa0368a08>] ? libcfs_log_return+0x28/0x40 [libcfs]
: 

Message [<ffffffffa109f150>] ? ll_dir_open+0x0/0xf0 [lustre]
 from syslogd@cl [<ffffffffa109f150>] ? ll_dir_open+0x0/0xf0 [lustre]
ient-18 at Aug 1 [<ffffffff8117e37f>] ? __dentry_open+0x23f/0x360
 [<ffffffff8121ce4f>] ? security_inode_permission+0x1f/0x30

 kernel:LustreE [<ffffffff8117e5b4>] ? nameidata_to_filp+0x54/0x70
rror: 27684:0:(l [<ffffffff81192fea>] ? do_filp_open+0x6ea/0xdc0
dlm_lock.c:967:l [<ffffffff8104759c>] ? __do_page_fault+0x1ec/0x480
dlm_lock_decref_ [<ffffffff81195062>] vfs_ioctl+0x22/0xa0
and_cancel()) LB [<ffffffff81149330>] ? unmap_region+0x110/0x130
UG
 [<ffffffff81195204>] do_vfs_ioctl+0x84/0x580
 [<ffffffff81195781>] sys_ioctl+0x81/0xa0
 [<ffffffff810dc5b5>] ? __audit_syscall_exit+0x265/0x290
 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b

Kernel panic - not syncing: LBUG
Pid: 27684, comm: lfs Not tainted 2.6.32-358.14.1.el6.x86_64 #1
Call Trace:
 [<ffffffff8150d668>] ? panic+0xa7/0x16f
 [<ffffffffa035deeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
 [<ffffffffa07a0a6a>] ? ldlm_lock_decref_and_cancel+0x14a/0x150 [ptlrpc]
 [<ffffffffa10bd74c>] ? ll_lease_open+0xa1c/0x1010 [lustre]
 [<ffffffffa10aef00>] ? ll_md_blocking_lease_ast+0x0/0x1b0 [lustre]
 [<ffffffffa10bddad>] ? ll_hsm_release+0x6d/0x350 [lustre]
 [<ffffffffa109f150>] ? ll_dir_open+0x0/0xf0 [lustre]
 [<ffffffffa10a95de>] ? ll_dir_ioctl+0x3a5e/0x5db0 [lustre]
 [<ffffffffa0368a08>] ? libcfs_log_return+0x28/0x40 [libcfs]
 [<ffffffffa109f150>] ? ll_dir_open+0x0/0xf0 [lustre]
 [<ffffffffa109f150>] ? ll_dir_open+0x0/0xf0 [lustre]
 [<ffffffff8117e37f>] ? __dentry_open+0x23f/0x360
 [<ffffffff8121ce4f>] ? security_inode_permission+0x1f/0x30
 [<ffffffff8117e5b4>] ? nameidata_to_filp+0x54/0x70
 [<ffffffff81192fea>] ? do_filp_open+0x6ea/0xdc0
 [<ffffffff8104759c>] ? __do_page_fault+0x1ec/0x480
 [<ffffffff81195062>] ? vfs_ioctl+0x22/0xa0
 [<ffffffff81149330>] ? unmap_region+0x110/0x130
 [<ffffffff81195204>] ? do_vfs_ioctl+0x84/0x580
 [<ffffffff81195781>] ? sys_ioctl+0x81/0xa0
 [<ffffffff810dc5b5>] ? __audit_syscall_exit+0x265/0x290
 [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b
Initializing cgroup subsys cpuset
Initializing cgroup subsys cpu
Comment by Jinshan Xiong (Inactive) [ 14/Aug/13 ]

Hi Sarah, this is not the same issue. Can you please file a new ticket?

Comment by Jinshan Xiong (Inactive) [ 15/Aug/13 ]

patch http://review.whamcloud.com/7346 is used to fix the problem seen by Sarah.

Comment by Andreas Dilger [ 17/Apr/17 ]

Close as duplicate of LU-3647.

Generated at Sat Feb 10 01:22:25 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.