Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2911

After upgrading from 1.8.9 to master, hit ASSERTION( seq != ((void *)0) ) failed

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.4.0
    • Lustre 2.4.0
    • Before upgrade servers are running 1.8.9 RHEL5; two clients are running 1.8.9 RHEL6; one client is running 1.8.9 RHEL5

      After upgrade both servers and clients are running lustre-master build# 1280 RHEL6
    • 3
    • 7004

    Description

      After upgrading the system from 1.8.9 to 2.4, one of the client hit LBUG:

      Lustre: DEBUG MARKER: Using TIMEOUT=20
      Lustre: DEBUG MARKER: ===== Check the directories/files in the OST pool
      Lustre: DEBUG MARKER: ===== Pass
      Lustre: DEBUG MARKER: ===== Check Lustre quotas usage/limits
      Lustre: DEBUG MARKER: cancel_lru_locks mdc start
      Lustre: DEBUG MARKER: cancel_lru_locks osc start
      Lustre: DEBUG MARKER: cancel_lru_locks mdc start
      Lustre: DEBUG MARKER: cancel_lru_locks osc start
      Lustre: DEBUG MARKER: cancel_lru_locks mdc start
      Lustre: DEBUG MARKER: cancel_lru_locks osc start
      Lustre: DEBUG MARKER: cancel_lru_locks mdc start
      Lustre: DEBUG MARKER: cancel_lru_locks osc start
      Lustre: DEBUG MARKER: cancel_lru_locks mdc start
      Lustre: DEBUG MARKER: cancel_lru_locks osc start
      Lustre: DEBUG MARKER: cancel_lru_locks mdc start
      Lustre: DEBUG MARKER: cancel_lru_locks osc start
      Lustre: DEBUG MARKER: ===== Verify the data
      LustreError: 9327:0:(fid_request.c:329:seq_client_alloc_fid()) ASSERTION( seq != ((void *)0) ) failed: 
      LustreError: 9327:0:(fid_request.c:329:seq_client_alloc_fid()) LBUG
      Pid: 9327, comm: mkdir
      
      Call Trace:
       [<ffffffffa0371895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
       [<ffffffffa0371e97>] lbug_with_loc+0x47/0xb0 [libcfs]
       [<ffffffffa080bea9>] seq_client_alloc_fid+0x379/0x440 [fid]
       [<ffffffffa03822e1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
       [<ffffffffa082470b>] mdc_fid_alloc+0xbb/0xf0 [mdc]
       [<ffffffffa0832b1c>] mdc_create+0xcc/0x780 [mdc]
       [<ffffffffa09c487b>] ll_new_node+0x19b/0x6a0 [lustre]
       [<ffffffffa09c50a7>] ll_mkdir+0x97/0x1f0 [lustre]
       [<ffffffff8120db8f>] ? security_inode_permission+0x1f/0x30
       [<ffffffff811833f7>] vfs_mkdir+0xa7/0x100
       [<ffffffff8118656e>] sys_mkdirat+0xfe/0x120
       [<ffffffff810d3d27>] ? audit_syscall_entry+0x1d7/0x200
       [<ffffffff811865a8>] sys_mkdir+0x18/0x20
       [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
      
      Kernel panic - not syncing: LBUG
      Pid: 9327, comm: mkdir Not tainted 2.6.32-279.19.1.el6.x86_64 #1
      Call Trace:
       [<ffffffff814e9541>] ? panic+0xa0/0x168
       [<ffffffffa0371eeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
       [<ffffffffa080bea9>] ? seq_client_alloc_fid+0x379/0x440 [fid]
       [<ffffffffa03822e1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
       [<ffffffffa082470b>] ? mdc_fid_alloc+0xbb/0xf0 [mdc]
       [<ffffffffa0832b1c>] ? mdc_create+0xcc/0x780 [mdc]
       [<ffffffffa09c487b>] ? ll_new_node+0x19b/0x6a0 [lustre]
       [<ffffffffa09c50a7>] ? ll_mkdir+0x97/0x1f0 [lustre]
       [<ffffffff8120db8f>] ? security_inode_permission+0x1f/0x30
       [<ffffffff811833f7>] ? vfs_mkdir+0xa7/0x100
       [<ffffffff8118656e>] ? sys_mkdirat+0xfe/0x120
       [<ffffffff810d3d27>] ? audit_syscall_entry+0x1d7/0x200
       [<ffffffff811865a8>] ? sys_mkdir+0x18/0x20
       [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b
      Initializing cgroup subsys cpuset
      Initializing cgroup subsys cpu
      

      Attachments

        Issue Links

          Activity

            [LU-2911] After upgrading from 1.8.9 to master, hit ASSERTION( seq != ((void *)0) ) failed
            pjones Peter Jones added a comment -

            Landed for 2.4

            pjones Peter Jones added a comment - Landed for 2.4
            emoly.liu Emoly Liu added a comment -

            Andreas,

            Unfortunately, we cannot go back and fix all clients that are going to work with 2.4 servers, so we need to get a server-side fix for this.
            I'm not against fixing old clients anyway (in case there is a new release), but there must have been something landed to master recently that is causing this problem, and the server-side needs to be fixed.

            This LBUG was probably caused by LU-1445 http://review.whamcloud.com/4787, which removed obsolete obd_fid_init/fini in llite. So I revert that part back to make the code compatible with b18.

            emoly.liu Emoly Liu added a comment - Andreas, Unfortunately, we cannot go back and fix all clients that are going to work with 2.4 servers, so we need to get a server-side fix for this. I'm not against fixing old clients anyway (in case there is a new release), but there must have been something landed to master recently that is causing this problem, and the server-side needs to be fixed. This LBUG was probably caused by LU-1445 http://review.whamcloud.com/4787 , which removed obsolete obd_fid_init/fini in llite. So I revert that part back to make the code compatible with b18.
            emoly.liu Emoly Liu added a comment - patch tracking at http://review.whamcloud.com/5733
            emoly.liu Emoly Liu added a comment -

            Fortunately, I can reproduce this LBUG on two VMs with the same IP address.

            The /var/log/message showed that seq was allocated during MDT upgrade loading.

            Mar 14 19:42:55 centos5-3 kernel: LDISKFS-fs (loop0): mounted filesystem with ordered data mode. quota=off. Opts: 
            Mar 14 19:42:55 centos5-3 kernel: Lustre: 4799:0:(mgs_llog.c:238:mgs_fsdb_handler()) MDT using 1.8 OSC name scheme
            Mar 14 19:42:55 centos5-3 kernel: Lustre: 4800:0:(obd_config.c:1428:class_config_llog_handler()) For 1.8 interoperability, rename obd type from mds to mdt
            Mar 14 19:42:55 centos5-3 kernel: Lustre: lustre-MDT0000: used disk, loading
            
            Mar 14 19:42:55 centos5-3 kernel: client_fid_init: cli-cli-lustre-OST0000-osc: Allocated sequence [0x0]
            Mar 14 19:42:55 centos5-3 kernel: client_fid_init: cli-cli-lustre-OST0001-osc: Allocated sequence [0x0]
            
            Mar 14 19:42:55 centos5-3 kernel: LustreError: 11-0: lustre-MDT0000-lwp-MDT0000: Communicating with 0@lo, operation mds_connect failed with -11.
            Mar 14 19:42:55 centos5-3 kernel: LDISKFS-fs (loop1): mounted filesystem with ordered data mode. quota=off. Opts: 
            Mar 14 19:42:56 centos5-3 kernel: LDISKFS-fs (loop2): mounted filesystem with ordered data mode. quota=off. Opts: 
            Mar 14 19:42:56 centos5-3 kernel: Lustre: lustre-MDT0000: Will be in recovery for at least 5:00, or until 1 client reconnects
            Mar 14 19:42:56 centos5-3 kernel: Lustre: lustre-MDT0000: Denying connection for new client a1ea11db-09a3-2aca-f450-b0505ca2899c (at 0@lo), waiting for all 1 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 5:00
            Mar 14 19:42:56 centos5-3 kernel: LustreError: 11-0: lustre-MDT0000-mdc-ffff88003d6bb800: Communicating with 0@lo, operation mds_connect failed with -16.
            Mar 14 19:43:00 centos5-3 kernel: Lustre: 4440:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1363261375/real 1363261375]  req@ffff8800118fac00 x1429483048402984/t0(0) o8->lustre-OST0001-osc@0@lo:28/4 lens 400/544 e 0 to 1 dl 1363261380 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
            Mar 14 19:43:21 centos5-3 kernel: Lustre: lustre-OST0000: Will be in recovery for at least 5:00, or until 1 client reconnects
            Mar 14 19:43:21 centos5-3 kernel: Lustre: lustre-OST0000: Denying connection for new client lustre-mdtlov_UUID (at 0@lo), waiting for all 1 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 5:00
            Mar 14 19:43:21 centos5-3 kernel: LustreError: 11-0: lustre-OST0000-osc: Communicating with 0@lo, operation ost_connect failed with -16.
            Mar 14 19:43:46 centos5-3 kernel: Lustre: lustre-OST0000: Denying connection for new client lustre-mdtlov_UUID (at 0@lo), waiting for all 1 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 4:34
            Mar 14 19:43:46 centos5-3 kernel: Lustre: Skipped 2 previous similar messages
            Mar 14 19:43:46 centos5-3 kernel: LustreError: 11-0: lustre-OST0000-osc: Communicating with 0@lo, operation ost_connect failed with -16.
            ...
            Mar 14 19:46:16 centos5-3 kernel: LustreError: Skipped 5 previous similar messages
            Mar 14 19:46:32 centos5-3 kernel: INFO: task tgt_recov:4809 blocked for more than 120 seconds.
            Mar 14 19:46:32 centos5-3 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
            Mar 14 19:46:32 centos5-3 kernel: tgt_recov     D 0000000000000000     0  4809      2 0x00000080
            Mar 14 19:46:32 centos5-3 kernel: ffff880011b27e00 0000000000000046 0000000000000000 0000000000000001
            Mar 14 19:46:32 centos5-3 kernel: ffff880011b27d70 ffffffff8102a259 ffff880011b27d80 ffffffff8104e048
            Mar 14 19:46:32 centos5-3 kernel: ffff880011b25af8 ffff880011b27fd8 000000000000fb88 ffff880011b25af8
            Mar 14 19:46:32 centos5-3 kernel: Call Trace:
            Mar 14 19:46:32 centos5-3 kernel: [<ffffffff8102a259>] ? native_smp_send_reschedule+0x49/0x60
            Mar 14 19:46:32 centos5-3 kernel: [<ffffffff8104e048>] ? resched_task+0x68/0x80
            Mar 14 19:46:32 centos5-3 kernel: [<ffffffffa08410c0>] ? check_for_clients+0x0/0x70 [ptlrpc]
            Mar 14 19:46:32 centos5-3 kernel: [<ffffffffa0842a1d>] target_recovery_overseer+0x9d/0x230 [ptlrpc]
            Mar 14 19:46:32 centos5-3 kernel: [<ffffffffa0840ec0>] ? exp_connect_healthy+0x0/0x20 [ptlrpc]
            Mar 14 19:46:32 centos5-3 kernel: [<ffffffff81090990>] ? autoremove_wake_function+0x0/0x40
            Mar 14 19:46:32 centos5-3 kernel: [<ffffffffa084970e>] target_recovery_thread+0x58e/0x1970 [ptlrpc]
            Mar 14 19:46:32 centos5-3 kernel: [<ffffffffa0849180>] ? target_recovery_thread+0x0/0x1970 [ptlrpc]
            Mar 14 19:46:32 centos5-3 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20
            Mar 14 19:46:32 centos5-3 kernel: [<ffffffffa0849180>] ? target_recovery_thread+0x0/0x1970 [ptlrpc]
            Mar 14 19:46:32 centos5-3 kernel: [<ffffffffa0849180>] ? target_recovery_thread+0x0/0x1970 [ptlrpc]
            Mar 14 19:46:32 centos5-3 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
            Mar 14 19:46:32 centos5-3 kernel: INFO: task tgt_recov:4853 blocked for more than 120 seconds.
            Mar 14 19:46:32 centos5-3 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
            Mar 14 19:46:32 centos5-3 kernel: tgt_recov     D 0000000000000001     0  4853      2 0x00000080
            Mar 14 19:46:32 centos5-3 kernel: ffff880010245e00 0000000000000046 0000000000000001 0000000000000000
            Mar 14 19:46:32 centos5-3 kernel: ffff880010245d70 ffffffff8102a259 ffff880010245d80 ffffffff8104e048
            Mar 14 19:46:32 centos5-3 kernel: ffff88001023d058 ffff880010245fd8 000000000000fb88 ffff88001023d058
            Mar 14 19:46:32 centos5-3 kernel: Call Trace:
            Mar 14 19:46:32 centos5-3 kernel: [<ffffffff8102a259>] ? native_smp_send_reschedule+0x49/0x60
            Mar 14 19:46:32 centos5-3 kernel: [<ffffffff8104e048>] ? resched_task+0x68/0x80
            Mar 14 19:46:32 centos5-3 kernel: [<ffffffff81090c7e>] ? prepare_to_wait+0x4e/0x80
            Mar 14 19:46:32 centos5-3 kernel: [<ffffffffa08410c0>] ? check_for_clients+0x0/0x70 [ptlrpc]
            Mar 14 19:46:32 centos5-3 kernel: [<ffffffffa0842a1d>] target_recovery_overseer+0x9d/0x230 [ptlrpc]
            Mar 14 19:46:32 centos5-3 kernel: [<ffffffffa0840ec0>] ? exp_connect_healthy+0x0/0x20 [ptlrpc]
            Mar 14 19:46:32 centos5-3 kernel: [<ffffffff81090990>] ? autoremove_wake_function+0x0/0x40
            Mar 14 19:46:32 centos5-3 kernel: [<ffffffffa084970e>] target_recovery_thread+0x58e/0x1970 [ptlrpc]
            Mar 14 19:46:32 centos5-3 kernel: [<ffffffffa0849180>] ? target_recovery_thread+0x0/0x1970 [ptlrpc]
            Mar 14 19:46:32 centos5-3 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20
            Mar 14 19:46:32 centos5-3 kernel: [<ffffffffa0849180>] ? target_recovery_thread+0x0/0x1970 [ptlrpc]
            Mar 14 19:46:32 centos5-3 kernel: [<ffffffffa0849180>] ? target_recovery_thread+0x0/0x1970 [ptlrpc]
            Mar 14 19:46:32 centos5-3 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
            Mar 14 19:46:32 centos5-3 kernel: INFO: task tgt_recov:4873 blocked for more than 120 seconds.
            Mar 14 19:46:32 centos5-3 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
            Mar 14 19:46:32 centos5-3 kernel: tgt_recov     D 0000000000000000     0  4873      2 0x00000080
            Mar 14 19:46:32 centos5-3 kernel: ffff88000fbcbe00 0000000000000046 0000000000000000 0000000000000000
            Mar 14 19:46:32 centos5-3 kernel: ffff88000fbcbd70 ffffffff8102a259 ffff88000fbcbd80 ffffffff8104e048
            Mar 14 19:46:32 centos5-3 kernel: ffff88001024b098 ffff88000fbcbfd8 000000000000fb88 ffff88001024b098
            Mar 14 19:46:32 centos5-3 kernel: Call Trace:
            Mar 14 19:46:32 centos5-3 kernel: [<ffffffff8102a259>] ? native_smp_send_reschedule+0x49/0x60
            Mar 14 19:46:32 centos5-3 kernel: [<ffffffff8104e048>] ? resched_task+0x68/0x80
            Mar 14 19:46:32 centos5-3 kernel: [<ffffffffa08410c0>] ? check_for_clients+0x0/0x70 [ptlrpc]
            Mar 14 19:46:32 centos5-3 kernel: [<ffffffffa0842a1d>] target_recovery_overseer+0x9d/0x230 [ptlrpc]
            Mar 14 19:46:32 centos5-3 kernel: [<ffffffffa0840ec0>] ? exp_connect_healthy+0x0/0x20 [ptlrpc]
            Mar 14 19:46:32 centos5-3 kernel: [<ffffffff81090990>] ? autoremove_wake_function+0x0/0x40
            Mar 14 19:46:32 centos5-3 kernel: [<ffffffffa084970e>] target_recovery_thread+0x58e/0x1970 [ptlrpc]
            Mar 14 19:46:32 centos5-3 kernel: [<ffffffffa0849180>] ? target_recovery_thread+0x0/0x1970 [ptlrpc]
            Mar 14 19:46:32 centos5-3 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20
            Mar 14 19:46:32 centos5-3 kernel: [<ffffffffa0849180>] ? target_recovery_thread+0x0/0x1970 [ptlrpc]
            Mar 14 19:46:32 centos5-3 kernel: [<ffffffffa0849180>] ? target_recovery_thread+0x0/0x1970 [ptlrpc]
            Mar 14 19:46:32 centos5-3 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
            Mar 14 19:46:41 centos5-3 kernel: Lustre: lustre-OST0000: Denying connection for new client lustre-mdtlov_UUID (at 0@lo), waiting for all 1 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 1:39
            Mar 14 19:46:41 centos5-3 kernel: Lustre: Skipped 5 previous similar messages
            Mar 14 19:47:31 centos5-3 kernel: LustreError: 11-0: lustre-OST0000-osc: Communicating with 0@lo, operation ost_connect failed with -16.
            Mar 14 19:47:31 centos5-3 kernel: LustreError: Skipped 8 previous similar messages
            

            Five minutes for recovery later, the servers evicted stale clients and export, and deleted orphan objects from 0x0:11018 to 11233. And then when running mkdir, seq became null.

            Mar 14 19:47:56 centos5-3 kernel: Lustre: lustre-MDT0000: recovery is timed out, evict stale exports
            Mar 14 19:47:56 centos5-3 kernel: Lustre: lustre-MDT0000: disconnecting 1 stale clients
            Mar 14 19:47:56 centos5-3 kernel: Lustre: lustre-MDT0000: Recovery over after 5:00, of 1 clients 0 recovered and 1 was evicted.
            Mar 14 19:47:56 centos5-3 kernel: Lustre: lustre-OST0000: Denying connection for new client lustre-mdtlov_UUID (at 0@lo), waiting for all 1 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 0:24
            Mar 14 19:47:56 centos5-3 kernel: Lustre: Skipped 8 previous similar messages
            Mar 14 19:48:21 centos5-3 kernel: Lustre: lustre-OST0000: recovery is timed out, evict stale exports
            Mar 14 19:48:21 centos5-3 kernel: Lustre: lustre-OST0000: disconnecting 1 stale clients
            Mar 14 19:48:21 centos5-3 kernel: Lustre: lustre-OST0001: Recovery over after 5:00, of 1 clients 0 recovered and 1 was evicted.
            Mar 14 19:48:21 centos5-3 kernel: Lustre: lustre-OST0001: deleting orphan objects from 0x0:11018 to 11233
            Mar 14 19:48:21 centos5-3 kernel: LustreError: 4833:0:(ldlm_resource.c:1161:ldlm_resource_get()) lvbo_init failed for resource 11169: rc -2
            Mar 14 19:48:21 centos5-3 kernel: LustreError: 4833:0:(ldlm_resource.c:1161:ldlm_resource_get()) lvbo_init failed for resource 11168: rc -2
            Mar 14 19:48:46 centos5-3 kernel: Lustre: Layout lock feature supported.
            Mar 14 19:48:46 centos5-3 kernel: Lustre: Mounted lustre-client
            Mar 14 19:49:11 centos5-3 kernel: seq_client_alloc_fid: seq is null
            

            I am looking at dk logs and will update if I find anything suspected.

            emoly.liu Emoly Liu added a comment - Fortunately, I can reproduce this LBUG on two VMs with the same IP address. The /var/log/message showed that seq was allocated during MDT upgrade loading. Mar 14 19:42:55 centos5-3 kernel: LDISKFS-fs (loop0): mounted filesystem with ordered data mode. quota=off. Opts: Mar 14 19:42:55 centos5-3 kernel: Lustre: 4799:0:(mgs_llog.c:238:mgs_fsdb_handler()) MDT using 1.8 OSC name scheme Mar 14 19:42:55 centos5-3 kernel: Lustre: 4800:0:(obd_config.c:1428:class_config_llog_handler()) For 1.8 interoperability, rename obd type from mds to mdt Mar 14 19:42:55 centos5-3 kernel: Lustre: lustre-MDT0000: used disk, loading Mar 14 19:42:55 centos5-3 kernel: client_fid_init: cli-cli-lustre-OST0000-osc: Allocated sequence [0x0] Mar 14 19:42:55 centos5-3 kernel: client_fid_init: cli-cli-lustre-OST0001-osc: Allocated sequence [0x0] Mar 14 19:42:55 centos5-3 kernel: LustreError: 11-0: lustre-MDT0000-lwp-MDT0000: Communicating with 0@lo, operation mds_connect failed with -11. Mar 14 19:42:55 centos5-3 kernel: LDISKFS-fs (loop1): mounted filesystem with ordered data mode. quota=off. Opts: Mar 14 19:42:56 centos5-3 kernel: LDISKFS-fs (loop2): mounted filesystem with ordered data mode. quota=off. Opts: Mar 14 19:42:56 centos5-3 kernel: Lustre: lustre-MDT0000: Will be in recovery for at least 5:00, or until 1 client reconnects Mar 14 19:42:56 centos5-3 kernel: Lustre: lustre-MDT0000: Denying connection for new client a1ea11db-09a3-2aca-f450-b0505ca2899c (at 0@lo), waiting for all 1 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 5:00 Mar 14 19:42:56 centos5-3 kernel: LustreError: 11-0: lustre-MDT0000-mdc-ffff88003d6bb800: Communicating with 0@lo, operation mds_connect failed with -16. Mar 14 19:43:00 centos5-3 kernel: Lustre: 4440:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1363261375/real 1363261375] req@ffff8800118fac00 x1429483048402984/t0(0) o8->lustre-OST0001-osc@0@lo:28/4 lens 400/544 e 0 to 1 dl 1363261380 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Mar 14 19:43:21 centos5-3 kernel: Lustre: lustre-OST0000: Will be in recovery for at least 5:00, or until 1 client reconnects Mar 14 19:43:21 centos5-3 kernel: Lustre: lustre-OST0000: Denying connection for new client lustre-mdtlov_UUID (at 0@lo), waiting for all 1 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 5:00 Mar 14 19:43:21 centos5-3 kernel: LustreError: 11-0: lustre-OST0000-osc: Communicating with 0@lo, operation ost_connect failed with -16. Mar 14 19:43:46 centos5-3 kernel: Lustre: lustre-OST0000: Denying connection for new client lustre-mdtlov_UUID (at 0@lo), waiting for all 1 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 4:34 Mar 14 19:43:46 centos5-3 kernel: Lustre: Skipped 2 previous similar messages Mar 14 19:43:46 centos5-3 kernel: LustreError: 11-0: lustre-OST0000-osc: Communicating with 0@lo, operation ost_connect failed with -16. ... Mar 14 19:46:16 centos5-3 kernel: LustreError: Skipped 5 previous similar messages Mar 14 19:46:32 centos5-3 kernel: INFO: task tgt_recov:4809 blocked for more than 120 seconds. Mar 14 19:46:32 centos5-3 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Mar 14 19:46:32 centos5-3 kernel: tgt_recov D 0000000000000000 0 4809 2 0x00000080 Mar 14 19:46:32 centos5-3 kernel: ffff880011b27e00 0000000000000046 0000000000000000 0000000000000001 Mar 14 19:46:32 centos5-3 kernel: ffff880011b27d70 ffffffff8102a259 ffff880011b27d80 ffffffff8104e048 Mar 14 19:46:32 centos5-3 kernel: ffff880011b25af8 ffff880011b27fd8 000000000000fb88 ffff880011b25af8 Mar 14 19:46:32 centos5-3 kernel: Call Trace: Mar 14 19:46:32 centos5-3 kernel: [<ffffffff8102a259>] ? native_smp_send_reschedule+0x49/0x60 Mar 14 19:46:32 centos5-3 kernel: [<ffffffff8104e048>] ? resched_task+0x68/0x80 Mar 14 19:46:32 centos5-3 kernel: [<ffffffffa08410c0>] ? check_for_clients+0x0/0x70 [ptlrpc] Mar 14 19:46:32 centos5-3 kernel: [<ffffffffa0842a1d>] target_recovery_overseer+0x9d/0x230 [ptlrpc] Mar 14 19:46:32 centos5-3 kernel: [<ffffffffa0840ec0>] ? exp_connect_healthy+0x0/0x20 [ptlrpc] Mar 14 19:46:32 centos5-3 kernel: [<ffffffff81090990>] ? autoremove_wake_function+0x0/0x40 Mar 14 19:46:32 centos5-3 kernel: [<ffffffffa084970e>] target_recovery_thread+0x58e/0x1970 [ptlrpc] Mar 14 19:46:32 centos5-3 kernel: [<ffffffffa0849180>] ? target_recovery_thread+0x0/0x1970 [ptlrpc] Mar 14 19:46:32 centos5-3 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20 Mar 14 19:46:32 centos5-3 kernel: [<ffffffffa0849180>] ? target_recovery_thread+0x0/0x1970 [ptlrpc] Mar 14 19:46:32 centos5-3 kernel: [<ffffffffa0849180>] ? target_recovery_thread+0x0/0x1970 [ptlrpc] Mar 14 19:46:32 centos5-3 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 Mar 14 19:46:32 centos5-3 kernel: INFO: task tgt_recov:4853 blocked for more than 120 seconds. Mar 14 19:46:32 centos5-3 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Mar 14 19:46:32 centos5-3 kernel: tgt_recov D 0000000000000001 0 4853 2 0x00000080 Mar 14 19:46:32 centos5-3 kernel: ffff880010245e00 0000000000000046 0000000000000001 0000000000000000 Mar 14 19:46:32 centos5-3 kernel: ffff880010245d70 ffffffff8102a259 ffff880010245d80 ffffffff8104e048 Mar 14 19:46:32 centos5-3 kernel: ffff88001023d058 ffff880010245fd8 000000000000fb88 ffff88001023d058 Mar 14 19:46:32 centos5-3 kernel: Call Trace: Mar 14 19:46:32 centos5-3 kernel: [<ffffffff8102a259>] ? native_smp_send_reschedule+0x49/0x60 Mar 14 19:46:32 centos5-3 kernel: [<ffffffff8104e048>] ? resched_task+0x68/0x80 Mar 14 19:46:32 centos5-3 kernel: [<ffffffff81090c7e>] ? prepare_to_wait+0x4e/0x80 Mar 14 19:46:32 centos5-3 kernel: [<ffffffffa08410c0>] ? check_for_clients+0x0/0x70 [ptlrpc] Mar 14 19:46:32 centos5-3 kernel: [<ffffffffa0842a1d>] target_recovery_overseer+0x9d/0x230 [ptlrpc] Mar 14 19:46:32 centos5-3 kernel: [<ffffffffa0840ec0>] ? exp_connect_healthy+0x0/0x20 [ptlrpc] Mar 14 19:46:32 centos5-3 kernel: [<ffffffff81090990>] ? autoremove_wake_function+0x0/0x40 Mar 14 19:46:32 centos5-3 kernel: [<ffffffffa084970e>] target_recovery_thread+0x58e/0x1970 [ptlrpc] Mar 14 19:46:32 centos5-3 kernel: [<ffffffffa0849180>] ? target_recovery_thread+0x0/0x1970 [ptlrpc] Mar 14 19:46:32 centos5-3 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20 Mar 14 19:46:32 centos5-3 kernel: [<ffffffffa0849180>] ? target_recovery_thread+0x0/0x1970 [ptlrpc] Mar 14 19:46:32 centos5-3 kernel: [<ffffffffa0849180>] ? target_recovery_thread+0x0/0x1970 [ptlrpc] Mar 14 19:46:32 centos5-3 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 Mar 14 19:46:32 centos5-3 kernel: INFO: task tgt_recov:4873 blocked for more than 120 seconds. Mar 14 19:46:32 centos5-3 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Mar 14 19:46:32 centos5-3 kernel: tgt_recov D 0000000000000000 0 4873 2 0x00000080 Mar 14 19:46:32 centos5-3 kernel: ffff88000fbcbe00 0000000000000046 0000000000000000 0000000000000000 Mar 14 19:46:32 centos5-3 kernel: ffff88000fbcbd70 ffffffff8102a259 ffff88000fbcbd80 ffffffff8104e048 Mar 14 19:46:32 centos5-3 kernel: ffff88001024b098 ffff88000fbcbfd8 000000000000fb88 ffff88001024b098 Mar 14 19:46:32 centos5-3 kernel: Call Trace: Mar 14 19:46:32 centos5-3 kernel: [<ffffffff8102a259>] ? native_smp_send_reschedule+0x49/0x60 Mar 14 19:46:32 centos5-3 kernel: [<ffffffff8104e048>] ? resched_task+0x68/0x80 Mar 14 19:46:32 centos5-3 kernel: [<ffffffffa08410c0>] ? check_for_clients+0x0/0x70 [ptlrpc] Mar 14 19:46:32 centos5-3 kernel: [<ffffffffa0842a1d>] target_recovery_overseer+0x9d/0x230 [ptlrpc] Mar 14 19:46:32 centos5-3 kernel: [<ffffffffa0840ec0>] ? exp_connect_healthy+0x0/0x20 [ptlrpc] Mar 14 19:46:32 centos5-3 kernel: [<ffffffff81090990>] ? autoremove_wake_function+0x0/0x40 Mar 14 19:46:32 centos5-3 kernel: [<ffffffffa084970e>] target_recovery_thread+0x58e/0x1970 [ptlrpc] Mar 14 19:46:32 centos5-3 kernel: [<ffffffffa0849180>] ? target_recovery_thread+0x0/0x1970 [ptlrpc] Mar 14 19:46:32 centos5-3 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20 Mar 14 19:46:32 centos5-3 kernel: [<ffffffffa0849180>] ? target_recovery_thread+0x0/0x1970 [ptlrpc] Mar 14 19:46:32 centos5-3 kernel: [<ffffffffa0849180>] ? target_recovery_thread+0x0/0x1970 [ptlrpc] Mar 14 19:46:32 centos5-3 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 Mar 14 19:46:41 centos5-3 kernel: Lustre: lustre-OST0000: Denying connection for new client lustre-mdtlov_UUID (at 0@lo), waiting for all 1 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 1:39 Mar 14 19:46:41 centos5-3 kernel: Lustre: Skipped 5 previous similar messages Mar 14 19:47:31 centos5-3 kernel: LustreError: 11-0: lustre-OST0000-osc: Communicating with 0@lo, operation ost_connect failed with -16. Mar 14 19:47:31 centos5-3 kernel: LustreError: Skipped 8 previous similar messages Five minutes for recovery later, the servers evicted stale clients and export, and deleted orphan objects from 0x0:11018 to 11233. And then when running mkdir, seq became null. Mar 14 19:47:56 centos5-3 kernel: Lustre: lustre-MDT0000: recovery is timed out, evict stale exports Mar 14 19:47:56 centos5-3 kernel: Lustre: lustre-MDT0000: disconnecting 1 stale clients Mar 14 19:47:56 centos5-3 kernel: Lustre: lustre-MDT0000: Recovery over after 5:00, of 1 clients 0 recovered and 1 was evicted. Mar 14 19:47:56 centos5-3 kernel: Lustre: lustre-OST0000: Denying connection for new client lustre-mdtlov_UUID (at 0@lo), waiting for all 1 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 0:24 Mar 14 19:47:56 centos5-3 kernel: Lustre: Skipped 8 previous similar messages Mar 14 19:48:21 centos5-3 kernel: Lustre: lustre-OST0000: recovery is timed out, evict stale exports Mar 14 19:48:21 centos5-3 kernel: Lustre: lustre-OST0000: disconnecting 1 stale clients Mar 14 19:48:21 centos5-3 kernel: Lustre: lustre-OST0001: Recovery over after 5:00, of 1 clients 0 recovered and 1 was evicted. Mar 14 19:48:21 centos5-3 kernel: Lustre: lustre-OST0001: deleting orphan objects from 0x0:11018 to 11233 Mar 14 19:48:21 centos5-3 kernel: LustreError: 4833:0:(ldlm_resource.c:1161:ldlm_resource_get()) lvbo_init failed for resource 11169: rc -2 Mar 14 19:48:21 centos5-3 kernel: LustreError: 4833:0:(ldlm_resource.c:1161:ldlm_resource_get()) lvbo_init failed for resource 11168: rc -2 Mar 14 19:48:46 centos5-3 kernel: Lustre: Layout lock feature supported. Mar 14 19:48:46 centos5-3 kernel: Lustre: Mounted lustre-client Mar 14 19:49:11 centos5-3 kernel: seq_client_alloc_fid: seq is null I am looking at dk logs and will update if I find anything suspected.
            emoly.liu Emoly Liu added a comment -

            I want to add some logs to see if seq is allocated correctly in client_fid_init(), but this upgrading process is a little complicated on my 3 VMs from centos5 to centos6.

            I also tried to use two VMs instead, one for master and the other for b18, sharing the lustre devices. During the test, I used tunefs.lustre to reset mgsnode, which erased all config logs and params. Probably due to this operation, I failed to reproduce.

            Next, I will try to use two VMs with the same IP. If fail again, I will create a patch for test only and reproduce it on Toro nodes.

            Btw, this LBUG still happens in our new build.

            emoly.liu Emoly Liu added a comment - I want to add some logs to see if seq is allocated correctly in client_fid_init(), but this upgrading process is a little complicated on my 3 VMs from centos5 to centos6. I also tried to use two VMs instead, one for master and the other for b18, sharing the lustre devices. During the test, I used tunefs.lustre to reset mgsnode, which erased all config logs and params. Probably due to this operation, I failed to reproduce. Next, I will try to use two VMs with the same IP. If fail again, I will create a patch for test only and reproduce it on Toro nodes. Btw, this LBUG still happens in our new build.
            emoly.liu Emoly Liu added a comment -

            I did this upgrade test on my local 3 VMs, following the wiki page http://wiki.whamcloud.com/display/ENG/Upgrade+and+Downgrade+Testing

            Clean upgrade test steps:

            1.setup and start Lustre 1.8.x filesystem
            2.extract a kernel tarball into a Lustre test directory
            3.shutdown the entire Lustre filesystem
            4.upgrade all Lustre servers and clients at once to Lustre 2.x
            5.start the entire Lustre filesystem
            6.extract the same kernel tarball into a new Lustre test directory
            7.compare the old and new test directories (no difference)
            8.run auster test suite

            This LBUG happened in step 6, after I made the other directory to extract the same kernel tar ball.

            [root@centos6-1 lustre]# ls
            kernel1
            [root@centos6-1 lustre]# mkdir kernel2
            
            Message from syslogd@localhost at Mar 11 12:47:37 ...
             kernel:LustreError: 3380:0:(fid_request.c:329:seq_client_alloc_fid()) ASSERTION( seq != ((void *)0) ) failed: 
            
            Message from syslogd@localhost at Mar 11 12:47:37 ...
             kernel:LustreError: 3380:0:(fid_request.c:329:seq_client_alloc_fid()) LBUG
            
            Message from syslogd@localhost at Mar 11 12:47:37 ...
             kernel:Kernel panic - not syncing: LBUG
            
            emoly.liu Emoly Liu added a comment - I did this upgrade test on my local 3 VMs, following the wiki page http://wiki.whamcloud.com/display/ENG/Upgrade+and+Downgrade+Testing Clean upgrade test steps: 1.setup and start Lustre 1.8.x filesystem 2.extract a kernel tarball into a Lustre test directory 3.shutdown the entire Lustre filesystem 4.upgrade all Lustre servers and clients at once to Lustre 2.x 5.start the entire Lustre filesystem 6.extract the same kernel tarball into a new Lustre test directory 7.compare the old and new test directories (no difference) 8.run auster test suite This LBUG happened in step 6, after I made the other directory to extract the same kernel tar ball. [root@centos6-1 lustre]# ls kernel1 [root@centos6-1 lustre]# mkdir kernel2 Message from syslogd@localhost at Mar 11 12:47:37 ... kernel:LustreError: 3380:0:(fid_request.c:329:seq_client_alloc_fid()) ASSERTION( seq != ((void *)0) ) failed: Message from syslogd@localhost at Mar 11 12:47:37 ... kernel:LustreError: 3380:0:(fid_request.c:329:seq_client_alloc_fid()) LBUG Message from syslogd@localhost at Mar 11 12:47:37 ... kernel:Kernel panic - not syncing: LBUG
            sarah Sarah Liu added a comment -

            Andreas,

            Not conf-sanity, it occurred when extracting a kernel tar ball.

            sarah Sarah Liu added a comment - Andreas, Not conf-sanity, it occurred when extracting a kernel tar ball.

            Sarah, what test is being run here? Is this conf-sanity.sh test_32 or something else?

            adilger Andreas Dilger added a comment - Sarah, what test is being run here? Is this conf-sanity.sh test_32 or something else?

            People

              emoly.liu Emoly Liu
              sarah Sarah Liu
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: