Details
-
Bug
-
Resolution: Fixed
-
Blocker
-
Lustre 2.1.0
-
None
-
Lustre Tag: v2_1_0_0_RC1
Lustre Build: http://newbuild.whamcloud.com/job/lustre-master/274/
e2fsprogs Build: http://newbuild.whamcloud.com/job/e2fsprogs-master/42/
Distro/Arch: RHEL6/x86_64(in-kernel OFED, kernel version: 2.6.32-131.6.1.el6.x86_64)
ENABLE_QUOTA=yes
FAILURE_MODE=HARD
MGS/MDS Nodes: fat-amd-1-ib(active), fat-amd-2-ib(passive)
\ /
1 combined MGS/MDT
OSS Nodes: fat-amd-3-ib(active), fat-amd-4-ib(active)
\ /
OST1 (active in fat-amd-3-ib)
OST2 (active in fat-amd-4-ib)
OST3 (active in fat-amd-3-ib)
OST4 (active in fat-amd-4-ib)
OST5 (active in fat-amd-3-ib)
OST6 (active in fat-amd-4-ib)
Client Nodes: client-[1,2,4,5,12,13,15],fat-intel-4
Network Addresses:
fat-amd-1-ib: 192.168.4.132
fat-amd-2-ib: 192.168.4.133
fat-amd-3-ib: 192.168.4.134
fat-amd-4-ib: 192.168.4.135
client-1-ib: 192.168.4.1
client-2-ib: 192.168.4.2
client-4-ib: 192.168.4.4
client-5-ib: 192.168.4.5
client-12-ib: 192.168.4.12
client-13-ib: 192.168.4.13
client-15-ib: 192.168.4.15
fat-intel-4-ib: 192.168.4.131
Lustre Tag: v2_1_0_0_RC1 Lustre Build: http://newbuild.whamcloud.com/job/lustre-master/274/ e2fsprogs Build: http://newbuild.whamcloud.com/job/e2fsprogs-master/42/ Distro/Arch: RHEL6/x86_64(in-kernel OFED, kernel version: 2.6.32-131.6.1.el6.x86_64) ENABLE_QUOTA=yes FAILURE_MODE=HARD MGS/MDS Nodes: fat-amd-1-ib(active), fat-amd-2-ib(passive) \ / 1 combined MGS/MDT OSS Nodes: fat-amd-3-ib(active), fat-amd-4-ib(active) \ / OST1 (active in fat-amd-3-ib) OST2 (active in fat-amd-4-ib) OST3 (active in fat-amd-3-ib) OST4 (active in fat-amd-4-ib) OST5 (active in fat-amd-3-ib) OST6 (active in fat-amd-4-ib) Client Nodes: client-[1,2,4,5,12,13,15],fat-intel-4 Network Addresses: fat-amd-1-ib: 192.168.4.132 fat-amd-2-ib: 192.168.4.133 fat-amd-3-ib: 192.168.4.134 fat-amd-4-ib: 192.168.4.135 client-1-ib: 192.168.4.1 client-2-ib: 192.168.4.2 client-4-ib: 192.168.4.4 client-5-ib: 192.168.4.5 client-12-ib: 192.168.4.12 client-13-ib: 192.168.4.13 client-15-ib: 192.168.4.15 fat-intel-4-ib: 192.168.4.131
-
3
-
4905
Description
recovery-double-scale test failed as follows:
==== START === test 5: failover OST, then 2 clients ==== <~snip~> Started client load: dd on client-13-ib Started client load: tar on client-1-ib ==== Checking the clients loads AFTER failovers -- ERRORS_OK=yes Done checking / re-Starting client loads. PASS ==== START === test 6: failover OST, then MDS ========== ==== Checking the clients loads BEFORE failover -- failure NOT OK Client load failed on node client-1-ib, rc=1 2011-08-30 09:37:48 Terminating clients loads ... Server failover period: 300 seconds Exited after: 0 seconds Status: FAIL: rc=4
Console log on client node client-1-ib showed that:
Lustre: DEBUG MARKER: Started client load: tar on client-1-ib^M Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failovers -- ERRORS_OK=yes^M LustreError: 2344:0:(llite_lib.c:1586:ll_update_inode()) lsm mismatch for inode 144115238860488711^M LustreError: 2344:0:(llite_lib.c:1587:ll_update_inode()) lli_smd:^M LustreError: 2344:0:(debug.c:71:dump_lsm()) lsm ffff8803168da740, objid 0x7, maxbytes 0x0, magic 0x0BD10BD0, stripe_size 1048576, stripe_count 1 pool ^M LustreError: 11-0: an error occurred while communicating with 192.168.4.134@o2ib. The ost_connect operation failed with -16^M LustreError: 2344:0:(llite_lib.c:1589:ll_update_inode()) lsm:^M LustreError: 2344:0:(debug.c:71:dump_lsm()) lsm ffff8803168da3c0, objid 0x7, maxbytes 0xfffffffffff, magic 0x0BD10BD0, stripe_size 1048576, stripe_count 1 pool ^M LustreError: 2344:0:(llite_lib.c:1591:ll_update_inode()) LBUG^M Pid: 2344, comm: tar^M ^M Call Trace:^M [<ffffffffa02f4855>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]^M [<ffffffffa02f4e95>] lbug_with_loc+0x75/0xe0 [libcfs]^M [<ffffffffa072d018>] ll_update_inode+0xd48/0xea0 [lustre]^M [<ffffffff8118d6d4>] ? ifind+0x74/0xd0^M [<ffffffffa07410f0>] ? ll_test_inode+0x0/0x1a0 [lustre]^M [<ffffffffa0741b9b>] ll_iget+0x21b/0x280 [lustre]^M [<ffffffffa02f59ae>] ? cfs_free+0xe/0x10 [libcfs]^M [<ffffffffa07d6de8>] ? lmv_intent_lookup+0x718/0x1190 [lmv]^M [<ffffffffa0741380>] ? ll_md_blocking_ast+0x0/0x600 [lustre]^M [<ffffffffa072d6f9>] ll_prep_inode+0x589/0xb50 [lustre]^M [<ffffffffa0742a05>] ll_lookup_it_finish+0x85/0xa80 [lustre]^M [<ffffffffa02f59ae>] ? cfs_free+0xe/0x10 [libcfs]^M [<ffffffffa0727581>] ? ll_finish_md_op_data+0x2a1/0x4a0 [lustre]^M [<ffffffffa07252b8>] ? ll_prep_md_op_data+0x148/0x410 [lustre]^M [<ffffffffa0743835>] ll_lookup_it+0x435/0xd60 [lustre]^M [<ffffffffa0741380>] ? ll_md_blocking_ast+0x0/0x600 [lustre]^M [<ffffffffa07441e8>] ll_lookup_nd+0x88/0x440 [lustre]^M^@ [<ffffffff8118ac3e>] ? d_alloc+0x13e/0x1b0^M [<ffffffff811806cb>] do_lookup+0x18b/0x220^M [<ffffffff81180cc9>] __link_path_walk+0x569/0x820^M [<ffffffffa06fc000>] ? return_if_equal+0x0/0x30 [lustre]^M [<ffffffff8118164a>] path_walk+0x6a/0xe0^M [<ffffffff8118181b>] do_path_lookup+0x5b/0xa0^M [<ffffffff811819a7>] user_path_at+0x57/0xa0^M [<ffffffff810dbe25>] ? call_rcu_sched+0x15/0x20^M [<ffffffff810dbe3e>] ? call_rcu+0xe/0x10^M [<ffffffff81262a05>] ? _atomic_dec_and_lock+0x55/0x80^M [<ffffffff811910e0>] ? mntput_no_expire+0x30/0x110^M [<ffffffff811740d1>] ? __fput+0x1a1/0x210^M [<ffffffff811a16f7>] do_utimes+0xc7/0xf0^M [<ffffffff810d1ad2>] ? audit_syscall_entry+0x272/0x2a0^M [<ffffffff811a1822>] sys_utimensat+0x32/0x90^M [<ffffffff8100b172>] system_call_fastpath+0x16/0x1b^M ^M Kernel panic - not syncing: LBUG^M Pid: 2344, comm: tar Tainted: G ---------------- T 2.6.32-131.6.1.el6.x86_64 #1^M Call Trace:^M [<ffffffff814da518>] ? panic+0x78/0x143^M [<ffffffffa02f4eeb>] ? lbug_with_loc+0xcb/0xe0 [libcfs]^M [<ffffffffa072d018>] ? ll_update_inode+0xd48/0xea0 [lustre]^M [<ffffffff8118d6d4>] ? ifind+0x74/0xd0^M [<ffffffffa07410f0>] ? ll_test_inode+0x0/0x1a0 [lustre]^M [<ffffffffa0741b9b>] ? ll_iget+0x21b/0x280 [lustre]^M [<ffffffffa02f59ae>] ? cfs_free+0xe/0x10 [libcfs]^M [<ffffffffa07d6de8>] ? lmv_intent_lookup+0x718/0x1190 [lmv]^M [<ffffffffa0741380>] ? ll_md_blocking_ast+0x0/0x600 [lustre]^M [<ffffffffa072d6f9>] ? ll_prep_inode+0x589/0xb50 [lustre]^M [<ffffffffa0742a05>] ? ll_lookup_it_finish+0x85/0xa80 [lustre]^M [<ffffffffa02f59ae>] ? cfs_free+0xe/0x10 [libcfs]^M [<ffffffffa0727581>] ? ll_finish_md_op_data+0x2a1/0x4a0 [lustre]^M [<ffffffffa07252b8>] ? ll_prep_md_op_data+0x148/0x410 [lustre]^M [<ffffffffa0743835>] ? ll_lookup_it+0x435/0xd60 [lustre]^M [<ffffffffa0741380>] ? ll_md_blocking_ast+0x0/0x600 [lustre]^M [<ffffffffa07441e8>] ? ll_lookup_nd+0x88/0x440 [lustre]^M [<ffffffff8118ac3e>] ? d_alloc+0x13e/0x1b0^M [<ffffffff811806cb>] ? do_lookup+0x18b/0x220^M [<ffffffff81180cc9>] ? __link_path_walk+0x569/0x820^M [<ffffffffa06fc000>] ? return_if_equal+0x0/0x30 [lustre]^M [<ffffffff8118164a>] ? path_walk+0x6a/0xe0^M [<ffffffff8118181b>] ? do_path_lookup+0x5b/0xa0^M [<ffffffff811819a7>] ? user_path_at+0x57/0xa0^M [<ffffffff810dbe25>] ? call_rcu_sched+0x15/0x20^M [<ffffffff810dbe3e>] ? call_rcu+0xe/0x10^M [<ffffffff81262a05>] ? _atomic_dec_and_lock+0x55/0x80^M [<ffffffff811910e0>] ? mntput_no_expire+0x30/0x110^M [<ffffffff811740d1>] ? __fput+0x1a1/0x210^M [<ffffffff811a16f7>] ? do_utimes+0xc7/0xf0^M [<ffffffff810d1ad2>] ? audit_syscall_entry+0x272/0x2a0^M [<ffffffff811a1822>] ? sys_utimensat+0x32/0x90^M [<ffffffff8100b172>] ? system_call_fastpath+0x16/0x1b^M
Syslog on OSS node fat-amd-3-ib showed that:
Aug 30 09:31:50 fat-amd-3-ib kernel: Lustre: DEBUG MARKER: Started client load: tar on client-1-ib Aug 30 09:31:50 fat-amd-3-ib xinetd[2179]: EXIT: shell status=0 pid=4039 duration=0(sec)Aug 30 09:31:53 fat-amd-3-ib xinetd[2179]: START: shell pid=4063 from=::ffff:192.168.4.131 Aug 30 09:31:53 fat-amd-3-ib rshd[4064]: root@fat-intel-4-ib.lab.whamcloud.com as root: cmd='(PATH=$PATH:/usr/lib64/lustre/utils:/usr/lib64/lustre/tests:/sbin:/usr/sbin; cd /us r/lib64/lustre/tests; LUSTRE="/usr/lib64/lustre" sh -c "/usr/sbin/lctl mark ==== Checking the clients loads AFTER failovers -- ERRORS_OK=yes");echo XXRETCODE:$?' Aug 30 09:31:53 fat-amd-3-ib kernel: Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failovers -- ERRORS_OK=yes Aug 30 09:31:53 fat-amd-3-ib xinetd[2179]: EXIT: shell status=0 pid=4063 duration=0(sec) Aug 30 09:31:53 fat-amd-3-ib kernel: LustreError: 2620:0:(ldlm_lib.c:910:target_handle_connect()) lustre-OST0005: denying connection for new client 192.168.4.1@o2ib (1e1bdc24-235d-ddd3-8241-31c1b1eae3e1): 2 clients in recovery for 75s Aug 30 09:31:53 fat-amd-3-ib kernel: LustreError: 2620:0:(ldlm_lib.c:910:target_handle_connect()) Skipped 59 previous similar messages Aug 30 09:31:54 fat-amd-3-ib xinetd[2179]: START: shell pid=4087 from=::ffff:192.168.4.131 Aug 30 09:31:54 fat-amd-3-ib rshd[4088]: root@fat-intel-4-ib.lab.whamcloud.com as root: cmd='(PATH=$PATH:/usr/lib64/lustre/utils:/usr/lib64/lustre/tests:/sbin:/usr/sbin; cd /us r/lib64/lustre/tests; LUSTRE="/usr/lib64/lustre" sh -c "/usr/sbin/lctl mark Done checking \/ re-Starting client loads. PASS");echo XXRETCODE:$?'Aug 30 09:31:54 fat-amd-3-ib kernel: Lustre: DEBUG MARKER: Done checking / re-Starting client loads. PASS Aug 30 09:31:54 fat-amd-3-ib xinetd[2179]: EXIT: shell status=0 pid=4087 duration=0(sec) Aug 30 09:32:12 fat-amd-3-ib kernel: INFO: task tgt_recov:3840 blocked for more than 120 seconds. Aug 30 09:32:12 fat-amd-3-ib kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Aug 30 09:32:12 fat-amd-3-ib kernel: tgt_recov D 0000000000000009 0 3840 2 0x00000080Aug 30 09:32:12 fat-amd-3-ib kernel: ffff8802178c3d30 0000000000000046 ffff8802178c3d30 ffffffffa05bbef6Aug 30 09:32:12 fat-amd-3-ib kernel: 0000000000000000 0000000000000000 ffffffffa06557a0 0000000000000007 Aug 30 09:32:12 fat-amd-3-ib kernel: ffff880217cca638 ffff8802178c3fd8 000000000000f598 ffff880217cca638 Aug 30 09:32:12 fat-amd-3-ib kernel: Call Trace: Aug 30 09:32:12 fat-amd-3-ib kernel: [<ffffffffa05bbef6>] ? check_for_next_transno+0x2d6/0x4f0 [ptlrpc] Aug 30 09:32:12 fat-amd-3-ib kernel: [<ffffffffa05bbc20>] ? check_for_next_transno+0x0/0x4f0 [ptlrpc] Aug 30 09:32:12 fat-amd-3-ib kernel: [<ffffffffa05bd765>] target_recovery_overseer+0x85/0x200 [ptlrpc]Aug 30 09:32:12 fat-amd-3-ib kernel: [<ffffffffa05ba780>] ? exp_req_replay_healthy+0x0/0x30 [ptlrpc]Aug 30 09:32:12 fat-amd-3-ib kernel: [<ffffffff8108e160>] ? autoremove_wake_function+0x0/0x40Aug 30 09:32:12 fat-amd-3-ib kernel: [<ffffffffa05be364>] target_recovery_thread+0x4a4/0xf50 [ptlrpc] Aug 30 09:32:12 fat-amd-3-ib kernel: [<ffffffffa05bdec0>] ? target_recovery_thread+0x0/0xf50 [ptlrpc] Aug 30 09:32:12 fat-amd-3-ib kernel: [<ffffffff8100c1ca>] child_rip+0xa/0x20 Aug 30 09:32:12 fat-amd-3-ib kernel: [<ffffffffa05bdec0>] ? target_recovery_thread+0x0/0xf50 [ptlrpc] Aug 30 09:32:12 fat-amd-3-ib kernel: [<ffffffffa05bdec0>] ? target_recovery_thread+0x0/0xf50 [ptlrpc] Aug 30 09:32:12 fat-amd-3-ib kernel: [<ffffffff8100c1c0>] ? child_rip+0x0/0x20 Aug 30 09:32:38 fat-amd-3-ib kernel: Lustre: lustre-OST0000: haven't heard from client 1e1bdc24-235d-ddd3-8241-31c1b1eae3e1 (at 192.168.4.1@o2ib) in 48 seconds. I think it's de ad, and I am evicting it. exp ffff88031783e000, cur 1314721958 expire 1314721928 last 1314721910 Aug 30 09:32:38 fat-amd-3-ib kernel: Lustre: Skipped 4 previous similar messages Aug 30 09:33:08 fat-amd-3-ib kernel: Lustre: 3840:0:(ldlm_lib.c:1569:target_recovery_overseer()) recovery is timed out, evict stale exports Aug 30 09:33:08 fat-amd-3-ib kernel: Lustre: 3840:0:(ldlm_lib.c:1569:target_recovery_overseer()) recovery is timed out, evict stale exports Aug 30 09:33:38 fat-amd-3-ib kernel: Lustre: 2657:0:(ldlm_lib.c:877:target_handle_connect()) lustre-OST0005: connection from 53ce78e4-dbf9-09af-3e1f-ba4d62ada8c7@192.168.4.13@o2ib recovering/t0 exp (null) cur 1314722018 last 0 Aug 30 09:33:38 fat-amd-3-ib kernel: Lustre: 2657:0:(ldlm_lib.c:877:target_handle_connect()) Skipped 137 previous similar messages Aug 30 09:36:08 fat-amd-3-ib kernel: Lustre: 3840:0:(ldlm_lib.c:1569:target_recovery_overseer()) recovery is timed out, evict stale exports Aug 30 09:36:08 fat-amd-3-ib kernel: Lustre: 3840:0:(ldlm_lib.c:1569:target_recovery_overseer()) recovery is timed out, evict stale exports
Maloo report: https://maloo.whamcloud.com/test_sets/a4d4597c-d387-11e0-8d02-52540025f9af
Please find console, syslog, dmesg and debug logs in the attached recovery-double-scale.1314722275.log.tar.bz2.
Attachments
Issue Links
- Trackbacks
-
Lustre 2.1.0 release testing tracker Lustre 2.1.0 RC0 Tag: v2100RC0 Created Date: 20110820 The difference between RC0 and RC1 is only a date change in lustre/ChangeLog. Lustre 2.1....