[LU-2911] After upgrading from 1.8.9 to master, hit ASSERTION( seq != ((void *)0) ) failed Created: 05/Mar/13 Updated: 21/Mar/13 Resolved: 21/Mar/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0 |
| Fix Version/s: | Lustre 2.4.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Sarah Liu | Assignee: | Emoly Liu |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | LB | ||
| Environment: |
Before upgrade servers are running 1.8.9 RHEL5; two clients are running 1.8.9 RHEL6; one client is running 1.8.9 RHEL5 After upgrade both servers and clients are running lustre-master build# 1280 RHEL6 |
||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 7004 | ||||||||
| Description |
|
After upgrading the system from 1.8.9 to 2.4, one of the client hit LBUG: Lustre: DEBUG MARKER: Using TIMEOUT=20 Lustre: DEBUG MARKER: ===== Check the directories/files in the OST pool Lustre: DEBUG MARKER: ===== Pass Lustre: DEBUG MARKER: ===== Check Lustre quotas usage/limits Lustre: DEBUG MARKER: cancel_lru_locks mdc start Lustre: DEBUG MARKER: cancel_lru_locks osc start Lustre: DEBUG MARKER: cancel_lru_locks mdc start Lustre: DEBUG MARKER: cancel_lru_locks osc start Lustre: DEBUG MARKER: cancel_lru_locks mdc start Lustre: DEBUG MARKER: cancel_lru_locks osc start Lustre: DEBUG MARKER: cancel_lru_locks mdc start Lustre: DEBUG MARKER: cancel_lru_locks osc start Lustre: DEBUG MARKER: cancel_lru_locks mdc start Lustre: DEBUG MARKER: cancel_lru_locks osc start Lustre: DEBUG MARKER: cancel_lru_locks mdc start Lustre: DEBUG MARKER: cancel_lru_locks osc start Lustre: DEBUG MARKER: ===== Verify the data LustreError: 9327:0:(fid_request.c:329:seq_client_alloc_fid()) ASSERTION( seq != ((void *)0) ) failed: LustreError: 9327:0:(fid_request.c:329:seq_client_alloc_fid()) LBUG Pid: 9327, comm: mkdir Call Trace: [<ffffffffa0371895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] [<ffffffffa0371e97>] lbug_with_loc+0x47/0xb0 [libcfs] [<ffffffffa080bea9>] seq_client_alloc_fid+0x379/0x440 [fid] [<ffffffffa03822e1>] ? libcfs_debug_msg+0x41/0x50 [libcfs] [<ffffffffa082470b>] mdc_fid_alloc+0xbb/0xf0 [mdc] [<ffffffffa0832b1c>] mdc_create+0xcc/0x780 [mdc] [<ffffffffa09c487b>] ll_new_node+0x19b/0x6a0 [lustre] [<ffffffffa09c50a7>] ll_mkdir+0x97/0x1f0 [lustre] [<ffffffff8120db8f>] ? security_inode_permission+0x1f/0x30 [<ffffffff811833f7>] vfs_mkdir+0xa7/0x100 [<ffffffff8118656e>] sys_mkdirat+0xfe/0x120 [<ffffffff810d3d27>] ? audit_syscall_entry+0x1d7/0x200 [<ffffffff811865a8>] sys_mkdir+0x18/0x20 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b Kernel panic - not syncing: LBUG Pid: 9327, comm: mkdir Not tainted 2.6.32-279.19.1.el6.x86_64 #1 Call Trace: [<ffffffff814e9541>] ? panic+0xa0/0x168 [<ffffffffa0371eeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs] [<ffffffffa080bea9>] ? seq_client_alloc_fid+0x379/0x440 [fid] [<ffffffffa03822e1>] ? libcfs_debug_msg+0x41/0x50 [libcfs] [<ffffffffa082470b>] ? mdc_fid_alloc+0xbb/0xf0 [mdc] [<ffffffffa0832b1c>] ? mdc_create+0xcc/0x780 [mdc] [<ffffffffa09c487b>] ? ll_new_node+0x19b/0x6a0 [lustre] [<ffffffffa09c50a7>] ? ll_mkdir+0x97/0x1f0 [lustre] [<ffffffff8120db8f>] ? security_inode_permission+0x1f/0x30 [<ffffffff811833f7>] ? vfs_mkdir+0xa7/0x100 [<ffffffff8118656e>] ? sys_mkdirat+0xfe/0x120 [<ffffffff810d3d27>] ? audit_syscall_entry+0x1d7/0x200 [<ffffffff811865a8>] ? sys_mkdir+0x18/0x20 [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b Initializing cgroup subsys cpuset Initializing cgroup subsys cpu |
| Comments |
| Comment by Andreas Dilger [ 05/Mar/13 ] |
|
Sarah, what test is being run here? Is this conf-sanity.sh test_32 or something else? |
| Comment by Sarah Liu [ 05/Mar/13 ] |
|
Andreas, Not conf-sanity, it occurred when extracting a kernel tar ball. |
| Comment by Emoly Liu [ 11/Mar/13 ] |
|
I did this upgrade test on my local 3 VMs, following the wiki page http://wiki.whamcloud.com/display/ENG/Upgrade+and+Downgrade+Testing
This LBUG happened in step 6, after I made the other directory to extract the same kernel tar ball. [root@centos6-1 lustre]# ls kernel1 [root@centos6-1 lustre]# mkdir kernel2 Message from syslogd@localhost at Mar 11 12:47:37 ... kernel:LustreError: 3380:0:(fid_request.c:329:seq_client_alloc_fid()) ASSERTION( seq != ((void *)0) ) failed: Message from syslogd@localhost at Mar 11 12:47:37 ... kernel:LustreError: 3380:0:(fid_request.c:329:seq_client_alloc_fid()) LBUG Message from syslogd@localhost at Mar 11 12:47:37 ... kernel:Kernel panic - not syncing: LBUG |
| Comment by Emoly Liu [ 13/Mar/13 ] |
|
I want to add some logs to see if seq is allocated correctly in client_fid_init(), but this upgrading process is a little complicated on my 3 VMs from centos5 to centos6. I also tried to use two VMs instead, one for master and the other for b18, sharing the lustre devices. During the test, I used tunefs.lustre to reset mgsnode, which erased all config logs and params. Probably due to this operation, I failed to reproduce. Next, I will try to use two VMs with the same IP. If fail again, I will create a patch for test only and reproduce it on Toro nodes. Btw, this LBUG still happens in our new build. |
| Comment by Emoly Liu [ 14/Mar/13 ] |
|
Fortunately, I can reproduce this LBUG on two VMs with the same IP address. The /var/log/message showed that seq was allocated during MDT upgrade loading. Mar 14 19:42:55 centos5-3 kernel: LDISKFS-fs (loop0): mounted filesystem with ordered data mode. quota=off. Opts: Mar 14 19:42:55 centos5-3 kernel: Lustre: 4799:0:(mgs_llog.c:238:mgs_fsdb_handler()) MDT using 1.8 OSC name scheme Mar 14 19:42:55 centos5-3 kernel: Lustre: 4800:0:(obd_config.c:1428:class_config_llog_handler()) For 1.8 interoperability, rename obd type from mds to mdt Mar 14 19:42:55 centos5-3 kernel: Lustre: lustre-MDT0000: used disk, loading Mar 14 19:42:55 centos5-3 kernel: client_fid_init: cli-cli-lustre-OST0000-osc: Allocated sequence [0x0] Mar 14 19:42:55 centos5-3 kernel: client_fid_init: cli-cli-lustre-OST0001-osc: Allocated sequence [0x0] Mar 14 19:42:55 centos5-3 kernel: LustreError: 11-0: lustre-MDT0000-lwp-MDT0000: Communicating with 0@lo, operation mds_connect failed with -11. Mar 14 19:42:55 centos5-3 kernel: LDISKFS-fs (loop1): mounted filesystem with ordered data mode. quota=off. Opts: Mar 14 19:42:56 centos5-3 kernel: LDISKFS-fs (loop2): mounted filesystem with ordered data mode. quota=off. Opts: Mar 14 19:42:56 centos5-3 kernel: Lustre: lustre-MDT0000: Will be in recovery for at least 5:00, or until 1 client reconnects Mar 14 19:42:56 centos5-3 kernel: Lustre: lustre-MDT0000: Denying connection for new client a1ea11db-09a3-2aca-f450-b0505ca2899c (at 0@lo), waiting for all 1 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 5:00 Mar 14 19:42:56 centos5-3 kernel: LustreError: 11-0: lustre-MDT0000-mdc-ffff88003d6bb800: Communicating with 0@lo, operation mds_connect failed with -16. Mar 14 19:43:00 centos5-3 kernel: Lustre: 4440:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1363261375/real 1363261375] req@ffff8800118fac00 x1429483048402984/t0(0) o8->lustre-OST0001-osc@0@lo:28/4 lens 400/544 e 0 to 1 dl 1363261380 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Mar 14 19:43:21 centos5-3 kernel: Lustre: lustre-OST0000: Will be in recovery for at least 5:00, or until 1 client reconnects Mar 14 19:43:21 centos5-3 kernel: Lustre: lustre-OST0000: Denying connection for new client lustre-mdtlov_UUID (at 0@lo), waiting for all 1 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 5:00 Mar 14 19:43:21 centos5-3 kernel: LustreError: 11-0: lustre-OST0000-osc: Communicating with 0@lo, operation ost_connect failed with -16. Mar 14 19:43:46 centos5-3 kernel: Lustre: lustre-OST0000: Denying connection for new client lustre-mdtlov_UUID (at 0@lo), waiting for all 1 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 4:34 Mar 14 19:43:46 centos5-3 kernel: Lustre: Skipped 2 previous similar messages Mar 14 19:43:46 centos5-3 kernel: LustreError: 11-0: lustre-OST0000-osc: Communicating with 0@lo, operation ost_connect failed with -16. ... Mar 14 19:46:16 centos5-3 kernel: LustreError: Skipped 5 previous similar messages Mar 14 19:46:32 centos5-3 kernel: INFO: task tgt_recov:4809 blocked for more than 120 seconds. Mar 14 19:46:32 centos5-3 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Mar 14 19:46:32 centos5-3 kernel: tgt_recov D 0000000000000000 0 4809 2 0x00000080 Mar 14 19:46:32 centos5-3 kernel: ffff880011b27e00 0000000000000046 0000000000000000 0000000000000001 Mar 14 19:46:32 centos5-3 kernel: ffff880011b27d70 ffffffff8102a259 ffff880011b27d80 ffffffff8104e048 Mar 14 19:46:32 centos5-3 kernel: ffff880011b25af8 ffff880011b27fd8 000000000000fb88 ffff880011b25af8 Mar 14 19:46:32 centos5-3 kernel: Call Trace: Mar 14 19:46:32 centos5-3 kernel: [<ffffffff8102a259>] ? native_smp_send_reschedule+0x49/0x60 Mar 14 19:46:32 centos5-3 kernel: [<ffffffff8104e048>] ? resched_task+0x68/0x80 Mar 14 19:46:32 centos5-3 kernel: [<ffffffffa08410c0>] ? check_for_clients+0x0/0x70 [ptlrpc] Mar 14 19:46:32 centos5-3 kernel: [<ffffffffa0842a1d>] target_recovery_overseer+0x9d/0x230 [ptlrpc] Mar 14 19:46:32 centos5-3 kernel: [<ffffffffa0840ec0>] ? exp_connect_healthy+0x0/0x20 [ptlrpc] Mar 14 19:46:32 centos5-3 kernel: [<ffffffff81090990>] ? autoremove_wake_function+0x0/0x40 Mar 14 19:46:32 centos5-3 kernel: [<ffffffffa084970e>] target_recovery_thread+0x58e/0x1970 [ptlrpc] Mar 14 19:46:32 centos5-3 kernel: [<ffffffffa0849180>] ? target_recovery_thread+0x0/0x1970 [ptlrpc] Mar 14 19:46:32 centos5-3 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20 Mar 14 19:46:32 centos5-3 kernel: [<ffffffffa0849180>] ? target_recovery_thread+0x0/0x1970 [ptlrpc] Mar 14 19:46:32 centos5-3 kernel: [<ffffffffa0849180>] ? target_recovery_thread+0x0/0x1970 [ptlrpc] Mar 14 19:46:32 centos5-3 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 Mar 14 19:46:32 centos5-3 kernel: INFO: task tgt_recov:4853 blocked for more than 120 seconds. Mar 14 19:46:32 centos5-3 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Mar 14 19:46:32 centos5-3 kernel: tgt_recov D 0000000000000001 0 4853 2 0x00000080 Mar 14 19:46:32 centos5-3 kernel: ffff880010245e00 0000000000000046 0000000000000001 0000000000000000 Mar 14 19:46:32 centos5-3 kernel: ffff880010245d70 ffffffff8102a259 ffff880010245d80 ffffffff8104e048 Mar 14 19:46:32 centos5-3 kernel: ffff88001023d058 ffff880010245fd8 000000000000fb88 ffff88001023d058 Mar 14 19:46:32 centos5-3 kernel: Call Trace: Mar 14 19:46:32 centos5-3 kernel: [<ffffffff8102a259>] ? native_smp_send_reschedule+0x49/0x60 Mar 14 19:46:32 centos5-3 kernel: [<ffffffff8104e048>] ? resched_task+0x68/0x80 Mar 14 19:46:32 centos5-3 kernel: [<ffffffff81090c7e>] ? prepare_to_wait+0x4e/0x80 Mar 14 19:46:32 centos5-3 kernel: [<ffffffffa08410c0>] ? check_for_clients+0x0/0x70 [ptlrpc] Mar 14 19:46:32 centos5-3 kernel: [<ffffffffa0842a1d>] target_recovery_overseer+0x9d/0x230 [ptlrpc] Mar 14 19:46:32 centos5-3 kernel: [<ffffffffa0840ec0>] ? exp_connect_healthy+0x0/0x20 [ptlrpc] Mar 14 19:46:32 centos5-3 kernel: [<ffffffff81090990>] ? autoremove_wake_function+0x0/0x40 Mar 14 19:46:32 centos5-3 kernel: [<ffffffffa084970e>] target_recovery_thread+0x58e/0x1970 [ptlrpc] Mar 14 19:46:32 centos5-3 kernel: [<ffffffffa0849180>] ? target_recovery_thread+0x0/0x1970 [ptlrpc] Mar 14 19:46:32 centos5-3 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20 Mar 14 19:46:32 centos5-3 kernel: [<ffffffffa0849180>] ? target_recovery_thread+0x0/0x1970 [ptlrpc] Mar 14 19:46:32 centos5-3 kernel: [<ffffffffa0849180>] ? target_recovery_thread+0x0/0x1970 [ptlrpc] Mar 14 19:46:32 centos5-3 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 Mar 14 19:46:32 centos5-3 kernel: INFO: task tgt_recov:4873 blocked for more than 120 seconds. Mar 14 19:46:32 centos5-3 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Mar 14 19:46:32 centos5-3 kernel: tgt_recov D 0000000000000000 0 4873 2 0x00000080 Mar 14 19:46:32 centos5-3 kernel: ffff88000fbcbe00 0000000000000046 0000000000000000 0000000000000000 Mar 14 19:46:32 centos5-3 kernel: ffff88000fbcbd70 ffffffff8102a259 ffff88000fbcbd80 ffffffff8104e048 Mar 14 19:46:32 centos5-3 kernel: ffff88001024b098 ffff88000fbcbfd8 000000000000fb88 ffff88001024b098 Mar 14 19:46:32 centos5-3 kernel: Call Trace: Mar 14 19:46:32 centos5-3 kernel: [<ffffffff8102a259>] ? native_smp_send_reschedule+0x49/0x60 Mar 14 19:46:32 centos5-3 kernel: [<ffffffff8104e048>] ? resched_task+0x68/0x80 Mar 14 19:46:32 centos5-3 kernel: [<ffffffffa08410c0>] ? check_for_clients+0x0/0x70 [ptlrpc] Mar 14 19:46:32 centos5-3 kernel: [<ffffffffa0842a1d>] target_recovery_overseer+0x9d/0x230 [ptlrpc] Mar 14 19:46:32 centos5-3 kernel: [<ffffffffa0840ec0>] ? exp_connect_healthy+0x0/0x20 [ptlrpc] Mar 14 19:46:32 centos5-3 kernel: [<ffffffff81090990>] ? autoremove_wake_function+0x0/0x40 Mar 14 19:46:32 centos5-3 kernel: [<ffffffffa084970e>] target_recovery_thread+0x58e/0x1970 [ptlrpc] Mar 14 19:46:32 centos5-3 kernel: [<ffffffffa0849180>] ? target_recovery_thread+0x0/0x1970 [ptlrpc] Mar 14 19:46:32 centos5-3 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20 Mar 14 19:46:32 centos5-3 kernel: [<ffffffffa0849180>] ? target_recovery_thread+0x0/0x1970 [ptlrpc] Mar 14 19:46:32 centos5-3 kernel: [<ffffffffa0849180>] ? target_recovery_thread+0x0/0x1970 [ptlrpc] Mar 14 19:46:32 centos5-3 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 Mar 14 19:46:41 centos5-3 kernel: Lustre: lustre-OST0000: Denying connection for new client lustre-mdtlov_UUID (at 0@lo), waiting for all 1 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 1:39 Mar 14 19:46:41 centos5-3 kernel: Lustre: Skipped 5 previous similar messages Mar 14 19:47:31 centos5-3 kernel: LustreError: 11-0: lustre-OST0000-osc: Communicating with 0@lo, operation ost_connect failed with -16. Mar 14 19:47:31 centos5-3 kernel: LustreError: Skipped 8 previous similar messages Five minutes for recovery later, the servers evicted stale clients and export, and deleted orphan objects from 0x0:11018 to 11233. And then when running mkdir, seq became null. Mar 14 19:47:56 centos5-3 kernel: Lustre: lustre-MDT0000: recovery is timed out, evict stale exports Mar 14 19:47:56 centos5-3 kernel: Lustre: lustre-MDT0000: disconnecting 1 stale clients Mar 14 19:47:56 centos5-3 kernel: Lustre: lustre-MDT0000: Recovery over after 5:00, of 1 clients 0 recovered and 1 was evicted. Mar 14 19:47:56 centos5-3 kernel: Lustre: lustre-OST0000: Denying connection for new client lustre-mdtlov_UUID (at 0@lo), waiting for all 1 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 0:24 Mar 14 19:47:56 centos5-3 kernel: Lustre: Skipped 8 previous similar messages Mar 14 19:48:21 centos5-3 kernel: Lustre: lustre-OST0000: recovery is timed out, evict stale exports Mar 14 19:48:21 centos5-3 kernel: Lustre: lustre-OST0000: disconnecting 1 stale clients Mar 14 19:48:21 centos5-3 kernel: Lustre: lustre-OST0001: Recovery over after 5:00, of 1 clients 0 recovered and 1 was evicted. Mar 14 19:48:21 centos5-3 kernel: Lustre: lustre-OST0001: deleting orphan objects from 0x0:11018 to 11233 Mar 14 19:48:21 centos5-3 kernel: LustreError: 4833:0:(ldlm_resource.c:1161:ldlm_resource_get()) lvbo_init failed for resource 11169: rc -2 Mar 14 19:48:21 centos5-3 kernel: LustreError: 4833:0:(ldlm_resource.c:1161:ldlm_resource_get()) lvbo_init failed for resource 11168: rc -2 Mar 14 19:48:46 centos5-3 kernel: Lustre: Layout lock feature supported. Mar 14 19:48:46 centos5-3 kernel: Lustre: Mounted lustre-client Mar 14 19:49:11 centos5-3 kernel: seq_client_alloc_fid: seq is null I am looking at dk logs and will update if I find anything suspected. |
| Comment by Emoly Liu [ 15/Mar/13 ] |
|
patch tracking at http://review.whamcloud.com/5733 |
| Comment by Emoly Liu [ 18/Mar/13 ] |
|
Andreas,
This LBUG was probably caused by |
| Comment by Peter Jones [ 21/Mar/13 ] |
|
Landed for 2.4 |