[LU-2911] After upgrading from 1.8.9 to master, hit ASSERTION( seq != ((void *)0) ) failed Created: 05/Mar/13  Updated: 21/Mar/13  Resolved: 21/Mar/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: Lustre 2.4.0

Type: Bug Priority: Blocker
Reporter: Sarah Liu Assignee: Emoly Liu
Resolution: Fixed Votes: 0
Labels: LB
Environment:

Before upgrade servers are running 1.8.9 RHEL5; two clients are running 1.8.9 RHEL6; one client is running 1.8.9 RHEL5

After upgrade both servers and clients are running lustre-master build# 1280 RHEL6


Issue Links:
Duplicate
is duplicated by LU-2958 LBUG triggered in seq_client_alloc_fi... Closed
Severity: 3
Rank (Obsolete): 7004

 Description   

After upgrading the system from 1.8.9 to 2.4, one of the client hit LBUG:

Lustre: DEBUG MARKER: Using TIMEOUT=20
Lustre: DEBUG MARKER: ===== Check the directories/files in the OST pool
Lustre: DEBUG MARKER: ===== Pass
Lustre: DEBUG MARKER: ===== Check Lustre quotas usage/limits
Lustre: DEBUG MARKER: cancel_lru_locks mdc start
Lustre: DEBUG MARKER: cancel_lru_locks osc start
Lustre: DEBUG MARKER: cancel_lru_locks mdc start
Lustre: DEBUG MARKER: cancel_lru_locks osc start
Lustre: DEBUG MARKER: cancel_lru_locks mdc start
Lustre: DEBUG MARKER: cancel_lru_locks osc start
Lustre: DEBUG MARKER: cancel_lru_locks mdc start
Lustre: DEBUG MARKER: cancel_lru_locks osc start
Lustre: DEBUG MARKER: cancel_lru_locks mdc start
Lustre: DEBUG MARKER: cancel_lru_locks osc start
Lustre: DEBUG MARKER: cancel_lru_locks mdc start
Lustre: DEBUG MARKER: cancel_lru_locks osc start
Lustre: DEBUG MARKER: ===== Verify the data
LustreError: 9327:0:(fid_request.c:329:seq_client_alloc_fid()) ASSERTION( seq != ((void *)0) ) failed: 
LustreError: 9327:0:(fid_request.c:329:seq_client_alloc_fid()) LBUG
Pid: 9327, comm: mkdir

Call Trace:
 [<ffffffffa0371895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
 [<ffffffffa0371e97>] lbug_with_loc+0x47/0xb0 [libcfs]
 [<ffffffffa080bea9>] seq_client_alloc_fid+0x379/0x440 [fid]
 [<ffffffffa03822e1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
 [<ffffffffa082470b>] mdc_fid_alloc+0xbb/0xf0 [mdc]
 [<ffffffffa0832b1c>] mdc_create+0xcc/0x780 [mdc]
 [<ffffffffa09c487b>] ll_new_node+0x19b/0x6a0 [lustre]
 [<ffffffffa09c50a7>] ll_mkdir+0x97/0x1f0 [lustre]
 [<ffffffff8120db8f>] ? security_inode_permission+0x1f/0x30
 [<ffffffff811833f7>] vfs_mkdir+0xa7/0x100
 [<ffffffff8118656e>] sys_mkdirat+0xfe/0x120
 [<ffffffff810d3d27>] ? audit_syscall_entry+0x1d7/0x200
 [<ffffffff811865a8>] sys_mkdir+0x18/0x20
 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b

Kernel panic - not syncing: LBUG
Pid: 9327, comm: mkdir Not tainted 2.6.32-279.19.1.el6.x86_64 #1
Call Trace:
 [<ffffffff814e9541>] ? panic+0xa0/0x168
 [<ffffffffa0371eeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
 [<ffffffffa080bea9>] ? seq_client_alloc_fid+0x379/0x440 [fid]
 [<ffffffffa03822e1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
 [<ffffffffa082470b>] ? mdc_fid_alloc+0xbb/0xf0 [mdc]
 [<ffffffffa0832b1c>] ? mdc_create+0xcc/0x780 [mdc]
 [<ffffffffa09c487b>] ? ll_new_node+0x19b/0x6a0 [lustre]
 [<ffffffffa09c50a7>] ? ll_mkdir+0x97/0x1f0 [lustre]
 [<ffffffff8120db8f>] ? security_inode_permission+0x1f/0x30
 [<ffffffff811833f7>] ? vfs_mkdir+0xa7/0x100
 [<ffffffff8118656e>] ? sys_mkdirat+0xfe/0x120
 [<ffffffff810d3d27>] ? audit_syscall_entry+0x1d7/0x200
 [<ffffffff811865a8>] ? sys_mkdir+0x18/0x20
 [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b
Initializing cgroup subsys cpuset
Initializing cgroup subsys cpu


 Comments   
Comment by Andreas Dilger [ 05/Mar/13 ]

Sarah, what test is being run here? Is this conf-sanity.sh test_32 or something else?

Comment by Sarah Liu [ 05/Mar/13 ]

Andreas,

Not conf-sanity, it occurred when extracting a kernel tar ball.

Comment by Emoly Liu [ 11/Mar/13 ]

I did this upgrade test on my local 3 VMs, following the wiki page http://wiki.whamcloud.com/display/ENG/Upgrade+and+Downgrade+Testing

Clean upgrade test steps:

1.setup and start Lustre 1.8.x filesystem
2.extract a kernel tarball into a Lustre test directory
3.shutdown the entire Lustre filesystem
4.upgrade all Lustre servers and clients at once to Lustre 2.x
5.start the entire Lustre filesystem
6.extract the same kernel tarball into a new Lustre test directory
7.compare the old and new test directories (no difference)
8.run auster test suite

This LBUG happened in step 6, after I made the other directory to extract the same kernel tar ball.

[root@centos6-1 lustre]# ls
kernel1
[root@centos6-1 lustre]# mkdir kernel2

Message from syslogd@localhost at Mar 11 12:47:37 ...
 kernel:LustreError: 3380:0:(fid_request.c:329:seq_client_alloc_fid()) ASSERTION( seq != ((void *)0) ) failed: 

Message from syslogd@localhost at Mar 11 12:47:37 ...
 kernel:LustreError: 3380:0:(fid_request.c:329:seq_client_alloc_fid()) LBUG

Message from syslogd@localhost at Mar 11 12:47:37 ...
 kernel:Kernel panic - not syncing: LBUG
Comment by Emoly Liu [ 13/Mar/13 ]

I want to add some logs to see if seq is allocated correctly in client_fid_init(), but this upgrading process is a little complicated on my 3 VMs from centos5 to centos6.

I also tried to use two VMs instead, one for master and the other for b18, sharing the lustre devices. During the test, I used tunefs.lustre to reset mgsnode, which erased all config logs and params. Probably due to this operation, I failed to reproduce.

Next, I will try to use two VMs with the same IP. If fail again, I will create a patch for test only and reproduce it on Toro nodes.

Btw, this LBUG still happens in our new build.

Comment by Emoly Liu [ 14/Mar/13 ]

Fortunately, I can reproduce this LBUG on two VMs with the same IP address.

The /var/log/message showed that seq was allocated during MDT upgrade loading.

Mar 14 19:42:55 centos5-3 kernel: LDISKFS-fs (loop0): mounted filesystem with ordered data mode. quota=off. Opts: 
Mar 14 19:42:55 centos5-3 kernel: Lustre: 4799:0:(mgs_llog.c:238:mgs_fsdb_handler()) MDT using 1.8 OSC name scheme
Mar 14 19:42:55 centos5-3 kernel: Lustre: 4800:0:(obd_config.c:1428:class_config_llog_handler()) For 1.8 interoperability, rename obd type from mds to mdt
Mar 14 19:42:55 centos5-3 kernel: Lustre: lustre-MDT0000: used disk, loading

Mar 14 19:42:55 centos5-3 kernel: client_fid_init: cli-cli-lustre-OST0000-osc: Allocated sequence [0x0]
Mar 14 19:42:55 centos5-3 kernel: client_fid_init: cli-cli-lustre-OST0001-osc: Allocated sequence [0x0]

Mar 14 19:42:55 centos5-3 kernel: LustreError: 11-0: lustre-MDT0000-lwp-MDT0000: Communicating with 0@lo, operation mds_connect failed with -11.
Mar 14 19:42:55 centos5-3 kernel: LDISKFS-fs (loop1): mounted filesystem with ordered data mode. quota=off. Opts: 
Mar 14 19:42:56 centos5-3 kernel: LDISKFS-fs (loop2): mounted filesystem with ordered data mode. quota=off. Opts: 
Mar 14 19:42:56 centos5-3 kernel: Lustre: lustre-MDT0000: Will be in recovery for at least 5:00, or until 1 client reconnects
Mar 14 19:42:56 centos5-3 kernel: Lustre: lustre-MDT0000: Denying connection for new client a1ea11db-09a3-2aca-f450-b0505ca2899c (at 0@lo), waiting for all 1 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 5:00
Mar 14 19:42:56 centos5-3 kernel: LustreError: 11-0: lustre-MDT0000-mdc-ffff88003d6bb800: Communicating with 0@lo, operation mds_connect failed with -16.
Mar 14 19:43:00 centos5-3 kernel: Lustre: 4440:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1363261375/real 1363261375]  req@ffff8800118fac00 x1429483048402984/t0(0) o8->lustre-OST0001-osc@0@lo:28/4 lens 400/544 e 0 to 1 dl 1363261380 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Mar 14 19:43:21 centos5-3 kernel: Lustre: lustre-OST0000: Will be in recovery for at least 5:00, or until 1 client reconnects
Mar 14 19:43:21 centos5-3 kernel: Lustre: lustre-OST0000: Denying connection for new client lustre-mdtlov_UUID (at 0@lo), waiting for all 1 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 5:00
Mar 14 19:43:21 centos5-3 kernel: LustreError: 11-0: lustre-OST0000-osc: Communicating with 0@lo, operation ost_connect failed with -16.
Mar 14 19:43:46 centos5-3 kernel: Lustre: lustre-OST0000: Denying connection for new client lustre-mdtlov_UUID (at 0@lo), waiting for all 1 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 4:34
Mar 14 19:43:46 centos5-3 kernel: Lustre: Skipped 2 previous similar messages
Mar 14 19:43:46 centos5-3 kernel: LustreError: 11-0: lustre-OST0000-osc: Communicating with 0@lo, operation ost_connect failed with -16.
...
Mar 14 19:46:16 centos5-3 kernel: LustreError: Skipped 5 previous similar messages
Mar 14 19:46:32 centos5-3 kernel: INFO: task tgt_recov:4809 blocked for more than 120 seconds.
Mar 14 19:46:32 centos5-3 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 14 19:46:32 centos5-3 kernel: tgt_recov     D 0000000000000000     0  4809      2 0x00000080
Mar 14 19:46:32 centos5-3 kernel: ffff880011b27e00 0000000000000046 0000000000000000 0000000000000001
Mar 14 19:46:32 centos5-3 kernel: ffff880011b27d70 ffffffff8102a259 ffff880011b27d80 ffffffff8104e048
Mar 14 19:46:32 centos5-3 kernel: ffff880011b25af8 ffff880011b27fd8 000000000000fb88 ffff880011b25af8
Mar 14 19:46:32 centos5-3 kernel: Call Trace:
Mar 14 19:46:32 centos5-3 kernel: [<ffffffff8102a259>] ? native_smp_send_reschedule+0x49/0x60
Mar 14 19:46:32 centos5-3 kernel: [<ffffffff8104e048>] ? resched_task+0x68/0x80
Mar 14 19:46:32 centos5-3 kernel: [<ffffffffa08410c0>] ? check_for_clients+0x0/0x70 [ptlrpc]
Mar 14 19:46:32 centos5-3 kernel: [<ffffffffa0842a1d>] target_recovery_overseer+0x9d/0x230 [ptlrpc]
Mar 14 19:46:32 centos5-3 kernel: [<ffffffffa0840ec0>] ? exp_connect_healthy+0x0/0x20 [ptlrpc]
Mar 14 19:46:32 centos5-3 kernel: [<ffffffff81090990>] ? autoremove_wake_function+0x0/0x40
Mar 14 19:46:32 centos5-3 kernel: [<ffffffffa084970e>] target_recovery_thread+0x58e/0x1970 [ptlrpc]
Mar 14 19:46:32 centos5-3 kernel: [<ffffffffa0849180>] ? target_recovery_thread+0x0/0x1970 [ptlrpc]
Mar 14 19:46:32 centos5-3 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20
Mar 14 19:46:32 centos5-3 kernel: [<ffffffffa0849180>] ? target_recovery_thread+0x0/0x1970 [ptlrpc]
Mar 14 19:46:32 centos5-3 kernel: [<ffffffffa0849180>] ? target_recovery_thread+0x0/0x1970 [ptlrpc]
Mar 14 19:46:32 centos5-3 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
Mar 14 19:46:32 centos5-3 kernel: INFO: task tgt_recov:4853 blocked for more than 120 seconds.
Mar 14 19:46:32 centos5-3 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 14 19:46:32 centos5-3 kernel: tgt_recov     D 0000000000000001     0  4853      2 0x00000080
Mar 14 19:46:32 centos5-3 kernel: ffff880010245e00 0000000000000046 0000000000000001 0000000000000000
Mar 14 19:46:32 centos5-3 kernel: ffff880010245d70 ffffffff8102a259 ffff880010245d80 ffffffff8104e048
Mar 14 19:46:32 centos5-3 kernel: ffff88001023d058 ffff880010245fd8 000000000000fb88 ffff88001023d058
Mar 14 19:46:32 centos5-3 kernel: Call Trace:
Mar 14 19:46:32 centos5-3 kernel: [<ffffffff8102a259>] ? native_smp_send_reschedule+0x49/0x60
Mar 14 19:46:32 centos5-3 kernel: [<ffffffff8104e048>] ? resched_task+0x68/0x80
Mar 14 19:46:32 centos5-3 kernel: [<ffffffff81090c7e>] ? prepare_to_wait+0x4e/0x80
Mar 14 19:46:32 centos5-3 kernel: [<ffffffffa08410c0>] ? check_for_clients+0x0/0x70 [ptlrpc]
Mar 14 19:46:32 centos5-3 kernel: [<ffffffffa0842a1d>] target_recovery_overseer+0x9d/0x230 [ptlrpc]
Mar 14 19:46:32 centos5-3 kernel: [<ffffffffa0840ec0>] ? exp_connect_healthy+0x0/0x20 [ptlrpc]
Mar 14 19:46:32 centos5-3 kernel: [<ffffffff81090990>] ? autoremove_wake_function+0x0/0x40
Mar 14 19:46:32 centos5-3 kernel: [<ffffffffa084970e>] target_recovery_thread+0x58e/0x1970 [ptlrpc]
Mar 14 19:46:32 centos5-3 kernel: [<ffffffffa0849180>] ? target_recovery_thread+0x0/0x1970 [ptlrpc]
Mar 14 19:46:32 centos5-3 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20
Mar 14 19:46:32 centos5-3 kernel: [<ffffffffa0849180>] ? target_recovery_thread+0x0/0x1970 [ptlrpc]
Mar 14 19:46:32 centos5-3 kernel: [<ffffffffa0849180>] ? target_recovery_thread+0x0/0x1970 [ptlrpc]
Mar 14 19:46:32 centos5-3 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
Mar 14 19:46:32 centos5-3 kernel: INFO: task tgt_recov:4873 blocked for more than 120 seconds.
Mar 14 19:46:32 centos5-3 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 14 19:46:32 centos5-3 kernel: tgt_recov     D 0000000000000000     0  4873      2 0x00000080
Mar 14 19:46:32 centos5-3 kernel: ffff88000fbcbe00 0000000000000046 0000000000000000 0000000000000000
Mar 14 19:46:32 centos5-3 kernel: ffff88000fbcbd70 ffffffff8102a259 ffff88000fbcbd80 ffffffff8104e048
Mar 14 19:46:32 centos5-3 kernel: ffff88001024b098 ffff88000fbcbfd8 000000000000fb88 ffff88001024b098
Mar 14 19:46:32 centos5-3 kernel: Call Trace:
Mar 14 19:46:32 centos5-3 kernel: [<ffffffff8102a259>] ? native_smp_send_reschedule+0x49/0x60
Mar 14 19:46:32 centos5-3 kernel: [<ffffffff8104e048>] ? resched_task+0x68/0x80
Mar 14 19:46:32 centos5-3 kernel: [<ffffffffa08410c0>] ? check_for_clients+0x0/0x70 [ptlrpc]
Mar 14 19:46:32 centos5-3 kernel: [<ffffffffa0842a1d>] target_recovery_overseer+0x9d/0x230 [ptlrpc]
Mar 14 19:46:32 centos5-3 kernel: [<ffffffffa0840ec0>] ? exp_connect_healthy+0x0/0x20 [ptlrpc]
Mar 14 19:46:32 centos5-3 kernel: [<ffffffff81090990>] ? autoremove_wake_function+0x0/0x40
Mar 14 19:46:32 centos5-3 kernel: [<ffffffffa084970e>] target_recovery_thread+0x58e/0x1970 [ptlrpc]
Mar 14 19:46:32 centos5-3 kernel: [<ffffffffa0849180>] ? target_recovery_thread+0x0/0x1970 [ptlrpc]
Mar 14 19:46:32 centos5-3 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20
Mar 14 19:46:32 centos5-3 kernel: [<ffffffffa0849180>] ? target_recovery_thread+0x0/0x1970 [ptlrpc]
Mar 14 19:46:32 centos5-3 kernel: [<ffffffffa0849180>] ? target_recovery_thread+0x0/0x1970 [ptlrpc]
Mar 14 19:46:32 centos5-3 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
Mar 14 19:46:41 centos5-3 kernel: Lustre: lustre-OST0000: Denying connection for new client lustre-mdtlov_UUID (at 0@lo), waiting for all 1 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 1:39
Mar 14 19:46:41 centos5-3 kernel: Lustre: Skipped 5 previous similar messages
Mar 14 19:47:31 centos5-3 kernel: LustreError: 11-0: lustre-OST0000-osc: Communicating with 0@lo, operation ost_connect failed with -16.
Mar 14 19:47:31 centos5-3 kernel: LustreError: Skipped 8 previous similar messages

Five minutes for recovery later, the servers evicted stale clients and export, and deleted orphan objects from 0x0:11018 to 11233. And then when running mkdir, seq became null.

Mar 14 19:47:56 centos5-3 kernel: Lustre: lustre-MDT0000: recovery is timed out, evict stale exports
Mar 14 19:47:56 centos5-3 kernel: Lustre: lustre-MDT0000: disconnecting 1 stale clients
Mar 14 19:47:56 centos5-3 kernel: Lustre: lustre-MDT0000: Recovery over after 5:00, of 1 clients 0 recovered and 1 was evicted.
Mar 14 19:47:56 centos5-3 kernel: Lustre: lustre-OST0000: Denying connection for new client lustre-mdtlov_UUID (at 0@lo), waiting for all 1 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 0:24
Mar 14 19:47:56 centos5-3 kernel: Lustre: Skipped 8 previous similar messages
Mar 14 19:48:21 centos5-3 kernel: Lustre: lustre-OST0000: recovery is timed out, evict stale exports
Mar 14 19:48:21 centos5-3 kernel: Lustre: lustre-OST0000: disconnecting 1 stale clients
Mar 14 19:48:21 centos5-3 kernel: Lustre: lustre-OST0001: Recovery over after 5:00, of 1 clients 0 recovered and 1 was evicted.
Mar 14 19:48:21 centos5-3 kernel: Lustre: lustre-OST0001: deleting orphan objects from 0x0:11018 to 11233
Mar 14 19:48:21 centos5-3 kernel: LustreError: 4833:0:(ldlm_resource.c:1161:ldlm_resource_get()) lvbo_init failed for resource 11169: rc -2
Mar 14 19:48:21 centos5-3 kernel: LustreError: 4833:0:(ldlm_resource.c:1161:ldlm_resource_get()) lvbo_init failed for resource 11168: rc -2
Mar 14 19:48:46 centos5-3 kernel: Lustre: Layout lock feature supported.
Mar 14 19:48:46 centos5-3 kernel: Lustre: Mounted lustre-client
Mar 14 19:49:11 centos5-3 kernel: seq_client_alloc_fid: seq is null

I am looking at dk logs and will update if I find anything suspected.

Comment by Emoly Liu [ 15/Mar/13 ]

patch tracking at http://review.whamcloud.com/5733

Comment by Emoly Liu [ 18/Mar/13 ]

Andreas,

Unfortunately, we cannot go back and fix all clients that are going to work with 2.4 servers, so we need to get a server-side fix for this.
I'm not against fixing old clients anyway (in case there is a new release), but there must have been something landed to master recently that is causing this problem, and the server-side needs to be fixed.

This LBUG was probably caused by LU-1445 http://review.whamcloud.com/4787, which removed obsolete obd_fid_init/fini in llite. So I revert that part back to make the code compatible with b18.

Comment by Peter Jones [ 21/Mar/13 ]

Landed for 2.4

Generated at Sat Feb 10 01:29:18 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.