[LU-650] recovery-double-scale: (llite_lib.c:1591:ll_update_inode()) LBUG Created: 31/Aug/11 Updated: 10/Sep/11 Resolved: 10/Sep/11 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.1.0 |
| Fix Version/s: | Lustre 2.1.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Jian Yu | Assignee: | Zhenyu Xu |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Lustre Tag: v2_1_0_0_RC1 MGS/MDS Nodes: fat-amd-1-ib(active), fat-amd-2-ib(passive) OSS Nodes: fat-amd-3-ib(active), fat-amd-4-ib(active) Client Nodes: client-[1,2,4,5,12,13,15],fat-intel-4 Network Addresses: |
||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 4905 |
| Description |
|
recovery-double-scale test failed as follows: ==== START === test 5: failover OST, then 2 clients ==== <~snip~> Started client load: dd on client-13-ib Started client load: tar on client-1-ib ==== Checking the clients loads AFTER failovers -- ERRORS_OK=yes Done checking / re-Starting client loads. PASS ==== START === test 6: failover OST, then MDS ========== ==== Checking the clients loads BEFORE failover -- failure NOT OK Client load failed on node client-1-ib, rc=1 2011-08-30 09:37:48 Terminating clients loads ... Server failover period: 300 seconds Exited after: 0 seconds Status: FAIL: rc=4 Console log on client node client-1-ib showed that: Lustre: DEBUG MARKER: Started client load: tar on client-1-ib^M Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failovers -- ERRORS_OK=yes^M LustreError: 2344:0:(llite_lib.c:1586:ll_update_inode()) lsm mismatch for inode 144115238860488711^M LustreError: 2344:0:(llite_lib.c:1587:ll_update_inode()) lli_smd:^M LustreError: 2344:0:(debug.c:71:dump_lsm()) lsm ffff8803168da740, objid 0x7, maxbytes 0x0, magic 0x0BD10BD0, stripe_size 1048576, stripe_count 1 pool ^M LustreError: 11-0: an error occurred while communicating with 192.168.4.134@o2ib. The ost_connect operation failed with -16^M LustreError: 2344:0:(llite_lib.c:1589:ll_update_inode()) lsm:^M LustreError: 2344:0:(debug.c:71:dump_lsm()) lsm ffff8803168da3c0, objid 0x7, maxbytes 0xfffffffffff, magic 0x0BD10BD0, stripe_size 1048576, stripe_count 1 pool ^M LustreError: 2344:0:(llite_lib.c:1591:ll_update_inode()) LBUG^M Pid: 2344, comm: tar^M ^M Call Trace:^M [<ffffffffa02f4855>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]^M [<ffffffffa02f4e95>] lbug_with_loc+0x75/0xe0 [libcfs]^M [<ffffffffa072d018>] ll_update_inode+0xd48/0xea0 [lustre]^M [<ffffffff8118d6d4>] ? ifind+0x74/0xd0^M [<ffffffffa07410f0>] ? ll_test_inode+0x0/0x1a0 [lustre]^M [<ffffffffa0741b9b>] ll_iget+0x21b/0x280 [lustre]^M [<ffffffffa02f59ae>] ? cfs_free+0xe/0x10 [libcfs]^M [<ffffffffa07d6de8>] ? lmv_intent_lookup+0x718/0x1190 [lmv]^M [<ffffffffa0741380>] ? ll_md_blocking_ast+0x0/0x600 [lustre]^M [<ffffffffa072d6f9>] ll_prep_inode+0x589/0xb50 [lustre]^M [<ffffffffa0742a05>] ll_lookup_it_finish+0x85/0xa80 [lustre]^M [<ffffffffa02f59ae>] ? cfs_free+0xe/0x10 [libcfs]^M [<ffffffffa0727581>] ? ll_finish_md_op_data+0x2a1/0x4a0 [lustre]^M [<ffffffffa07252b8>] ? ll_prep_md_op_data+0x148/0x410 [lustre]^M [<ffffffffa0743835>] ll_lookup_it+0x435/0xd60 [lustre]^M [<ffffffffa0741380>] ? ll_md_blocking_ast+0x0/0x600 [lustre]^M [<ffffffffa07441e8>] ll_lookup_nd+0x88/0x440 [lustre]^M^@ [<ffffffff8118ac3e>] ? d_alloc+0x13e/0x1b0^M [<ffffffff811806cb>] do_lookup+0x18b/0x220^M [<ffffffff81180cc9>] __link_path_walk+0x569/0x820^M [<ffffffffa06fc000>] ? return_if_equal+0x0/0x30 [lustre]^M [<ffffffff8118164a>] path_walk+0x6a/0xe0^M [<ffffffff8118181b>] do_path_lookup+0x5b/0xa0^M [<ffffffff811819a7>] user_path_at+0x57/0xa0^M [<ffffffff810dbe25>] ? call_rcu_sched+0x15/0x20^M [<ffffffff810dbe3e>] ? call_rcu+0xe/0x10^M [<ffffffff81262a05>] ? _atomic_dec_and_lock+0x55/0x80^M [<ffffffff811910e0>] ? mntput_no_expire+0x30/0x110^M [<ffffffff811740d1>] ? __fput+0x1a1/0x210^M [<ffffffff811a16f7>] do_utimes+0xc7/0xf0^M [<ffffffff810d1ad2>] ? audit_syscall_entry+0x272/0x2a0^M [<ffffffff811a1822>] sys_utimensat+0x32/0x90^M [<ffffffff8100b172>] system_call_fastpath+0x16/0x1b^M ^M Kernel panic - not syncing: LBUG^M Pid: 2344, comm: tar Tainted: G ---------------- T 2.6.32-131.6.1.el6.x86_64 #1^M Call Trace:^M [<ffffffff814da518>] ? panic+0x78/0x143^M [<ffffffffa02f4eeb>] ? lbug_with_loc+0xcb/0xe0 [libcfs]^M [<ffffffffa072d018>] ? ll_update_inode+0xd48/0xea0 [lustre]^M [<ffffffff8118d6d4>] ? ifind+0x74/0xd0^M [<ffffffffa07410f0>] ? ll_test_inode+0x0/0x1a0 [lustre]^M [<ffffffffa0741b9b>] ? ll_iget+0x21b/0x280 [lustre]^M [<ffffffffa02f59ae>] ? cfs_free+0xe/0x10 [libcfs]^M [<ffffffffa07d6de8>] ? lmv_intent_lookup+0x718/0x1190 [lmv]^M [<ffffffffa0741380>] ? ll_md_blocking_ast+0x0/0x600 [lustre]^M [<ffffffffa072d6f9>] ? ll_prep_inode+0x589/0xb50 [lustre]^M [<ffffffffa0742a05>] ? ll_lookup_it_finish+0x85/0xa80 [lustre]^M [<ffffffffa02f59ae>] ? cfs_free+0xe/0x10 [libcfs]^M [<ffffffffa0727581>] ? ll_finish_md_op_data+0x2a1/0x4a0 [lustre]^M [<ffffffffa07252b8>] ? ll_prep_md_op_data+0x148/0x410 [lustre]^M [<ffffffffa0743835>] ? ll_lookup_it+0x435/0xd60 [lustre]^M [<ffffffffa0741380>] ? ll_md_blocking_ast+0x0/0x600 [lustre]^M [<ffffffffa07441e8>] ? ll_lookup_nd+0x88/0x440 [lustre]^M [<ffffffff8118ac3e>] ? d_alloc+0x13e/0x1b0^M [<ffffffff811806cb>] ? do_lookup+0x18b/0x220^M [<ffffffff81180cc9>] ? __link_path_walk+0x569/0x820^M [<ffffffffa06fc000>] ? return_if_equal+0x0/0x30 [lustre]^M [<ffffffff8118164a>] ? path_walk+0x6a/0xe0^M [<ffffffff8118181b>] ? do_path_lookup+0x5b/0xa0^M [<ffffffff811819a7>] ? user_path_at+0x57/0xa0^M [<ffffffff810dbe25>] ? call_rcu_sched+0x15/0x20^M [<ffffffff810dbe3e>] ? call_rcu+0xe/0x10^M [<ffffffff81262a05>] ? _atomic_dec_and_lock+0x55/0x80^M [<ffffffff811910e0>] ? mntput_no_expire+0x30/0x110^M [<ffffffff811740d1>] ? __fput+0x1a1/0x210^M [<ffffffff811a16f7>] ? do_utimes+0xc7/0xf0^M [<ffffffff810d1ad2>] ? audit_syscall_entry+0x272/0x2a0^M [<ffffffff811a1822>] ? sys_utimensat+0x32/0x90^M [<ffffffff8100b172>] ? system_call_fastpath+0x16/0x1b^M Syslog on OSS node fat-amd-3-ib showed that: Aug 30 09:31:50 fat-amd-3-ib kernel: Lustre: DEBUG MARKER: Started client load: tar on client-1-ib Aug 30 09:31:50 fat-amd-3-ib xinetd[2179]: EXIT: shell status=0 pid=4039 duration=0(sec)Aug 30 09:31:53 fat-amd-3-ib xinetd[2179]: START: shell pid=4063 from=::ffff:192.168.4.131 Aug 30 09:31:53 fat-amd-3-ib rshd[4064]: root@fat-intel-4-ib.lab.whamcloud.com as root: cmd='(PATH=$PATH:/usr/lib64/lustre/utils:/usr/lib64/lustre/tests:/sbin:/usr/sbin; cd /us r/lib64/lustre/tests; LUSTRE="/usr/lib64/lustre" sh -c "/usr/sbin/lctl mark ==== Checking the clients loads AFTER failovers -- ERRORS_OK=yes");echo XXRETCODE:$?' Aug 30 09:31:53 fat-amd-3-ib kernel: Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failovers -- ERRORS_OK=yes Aug 30 09:31:53 fat-amd-3-ib xinetd[2179]: EXIT: shell status=0 pid=4063 duration=0(sec) Aug 30 09:31:53 fat-amd-3-ib kernel: LustreError: 2620:0:(ldlm_lib.c:910:target_handle_connect()) lustre-OST0005: denying connection for new client 192.168.4.1@o2ib (1e1bdc24-235d-ddd3-8241-31c1b1eae3e1): 2 clients in recovery for 75s Aug 30 09:31:53 fat-amd-3-ib kernel: LustreError: 2620:0:(ldlm_lib.c:910:target_handle_connect()) Skipped 59 previous similar messages Aug 30 09:31:54 fat-amd-3-ib xinetd[2179]: START: shell pid=4087 from=::ffff:192.168.4.131 Aug 30 09:31:54 fat-amd-3-ib rshd[4088]: root@fat-intel-4-ib.lab.whamcloud.com as root: cmd='(PATH=$PATH:/usr/lib64/lustre/utils:/usr/lib64/lustre/tests:/sbin:/usr/sbin; cd /us r/lib64/lustre/tests; LUSTRE="/usr/lib64/lustre" sh -c "/usr/sbin/lctl mark Done checking \/ re-Starting client loads. PASS");echo XXRETCODE:$?'Aug 30 09:31:54 fat-amd-3-ib kernel: Lustre: DEBUG MARKER: Done checking / re-Starting client loads. PASS Aug 30 09:31:54 fat-amd-3-ib xinetd[2179]: EXIT: shell status=0 pid=4087 duration=0(sec) Aug 30 09:32:12 fat-amd-3-ib kernel: INFO: task tgt_recov:3840 blocked for more than 120 seconds. Aug 30 09:32:12 fat-amd-3-ib kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Aug 30 09:32:12 fat-amd-3-ib kernel: tgt_recov D 0000000000000009 0 3840 2 0x00000080Aug 30 09:32:12 fat-amd-3-ib kernel: ffff8802178c3d30 0000000000000046 ffff8802178c3d30 ffffffffa05bbef6Aug 30 09:32:12 fat-amd-3-ib kernel: 0000000000000000 0000000000000000 ffffffffa06557a0 0000000000000007 Aug 30 09:32:12 fat-amd-3-ib kernel: ffff880217cca638 ffff8802178c3fd8 000000000000f598 ffff880217cca638 Aug 30 09:32:12 fat-amd-3-ib kernel: Call Trace: Aug 30 09:32:12 fat-amd-3-ib kernel: [<ffffffffa05bbef6>] ? check_for_next_transno+0x2d6/0x4f0 [ptlrpc] Aug 30 09:32:12 fat-amd-3-ib kernel: [<ffffffffa05bbc20>] ? check_for_next_transno+0x0/0x4f0 [ptlrpc] Aug 30 09:32:12 fat-amd-3-ib kernel: [<ffffffffa05bd765>] target_recovery_overseer+0x85/0x200 [ptlrpc]Aug 30 09:32:12 fat-amd-3-ib kernel: [<ffffffffa05ba780>] ? exp_req_replay_healthy+0x0/0x30 [ptlrpc]Aug 30 09:32:12 fat-amd-3-ib kernel: [<ffffffff8108e160>] ? autoremove_wake_function+0x0/0x40Aug 30 09:32:12 fat-amd-3-ib kernel: [<ffffffffa05be364>] target_recovery_thread+0x4a4/0xf50 [ptlrpc] Aug 30 09:32:12 fat-amd-3-ib kernel: [<ffffffffa05bdec0>] ? target_recovery_thread+0x0/0xf50 [ptlrpc] Aug 30 09:32:12 fat-amd-3-ib kernel: [<ffffffff8100c1ca>] child_rip+0xa/0x20 Aug 30 09:32:12 fat-amd-3-ib kernel: [<ffffffffa05bdec0>] ? target_recovery_thread+0x0/0xf50 [ptlrpc] Aug 30 09:32:12 fat-amd-3-ib kernel: [<ffffffffa05bdec0>] ? target_recovery_thread+0x0/0xf50 [ptlrpc] Aug 30 09:32:12 fat-amd-3-ib kernel: [<ffffffff8100c1c0>] ? child_rip+0x0/0x20 Aug 30 09:32:38 fat-amd-3-ib kernel: Lustre: lustre-OST0000: haven't heard from client 1e1bdc24-235d-ddd3-8241-31c1b1eae3e1 (at 192.168.4.1@o2ib) in 48 seconds. I think it's de ad, and I am evicting it. exp ffff88031783e000, cur 1314721958 expire 1314721928 last 1314721910 Aug 30 09:32:38 fat-amd-3-ib kernel: Lustre: Skipped 4 previous similar messages Aug 30 09:33:08 fat-amd-3-ib kernel: Lustre: 3840:0:(ldlm_lib.c:1569:target_recovery_overseer()) recovery is timed out, evict stale exports Aug 30 09:33:08 fat-amd-3-ib kernel: Lustre: 3840:0:(ldlm_lib.c:1569:target_recovery_overseer()) recovery is timed out, evict stale exports Aug 30 09:33:38 fat-amd-3-ib kernel: Lustre: 2657:0:(ldlm_lib.c:877:target_handle_connect()) lustre-OST0005: connection from 53ce78e4-dbf9-09af-3e1f-ba4d62ada8c7@192.168.4.13@o2ib recovering/t0 exp (null) cur 1314722018 last 0 Aug 30 09:33:38 fat-amd-3-ib kernel: Lustre: 2657:0:(ldlm_lib.c:877:target_handle_connect()) Skipped 137 previous similar messages Aug 30 09:36:08 fat-amd-3-ib kernel: Lustre: 3840:0:(ldlm_lib.c:1569:target_recovery_overseer()) recovery is timed out, evict stale exports Aug 30 09:36:08 fat-amd-3-ib kernel: Lustre: 3840:0:(ldlm_lib.c:1569:target_recovery_overseer()) recovery is timed out, evict stale exports Maloo report: https://maloo.whamcloud.com/test_sets/a4d4597c-d387-11e0-8d02-52540025f9af Please find console, syslog, dmesg and debug logs in the attached recovery-double-scale.1314722275.log.tar.bz2. |
| Comments |
| Comment by Peter Jones [ 31/Aug/11 ] |
|
Bobi Could you look into this one please? Thanks Peter |
| Comment by Oleg Drokin [ 31/Aug/11 ] |
|
I think this could be related to |
| Comment by Zhenyu Xu [ 01/Sep/11 ] |
|
Just wondering why in the first place the inode got 0 maxbytes in its lli->lli_smd, it should at least be LUSTRE_STRIPE_MAXBYTES as lov_alloc_memmd() alloc the lov_stripe_md. LustreError: 2344:0:(llite_lib.c:1587:ll_update_inode()) lli_smd: |
| Comment by Peter Jones [ 01/Sep/11 ] |
|
Consider as blocker for 2.1 |
| Comment by Zhenyu Xu [ 02/Sep/11 ] |
|
patch tracking at http://review.whamcloud.com/1327 |
| Comment by Jessica A. Popp (Inactive) [ 08/Sep/11 ] |
|
Bobi - I checked on the status and Build #2070 failed - but the error didn't report back correctly to Gerrit so I asked in test tools chat room and the failure was a false negative. Robert added the build back to the queue, but he didn't have a build number for me. If it works correctly this time the results should post back to your 1327 Gerrit tracker. But I just wanted to let you know there is a new job waiting in the build queue. The queue is pretty long - so it might not have even run by the time you start your day. Hopefully the build works this time. -Jessica |
| Comment by Build Master (Inactive) [ 10/Sep/11 ] |
|
Integrated in Oleg Drokin : 303f41aa0ff07a43a31e397a3d163a6b8b6e4864
|
| Comment by Build Master (Inactive) [ 10/Sep/11 ] |
|
Integrated in Oleg Drokin : 303f41aa0ff07a43a31e397a3d163a6b8b6e4864
|
| Comment by Build Master (Inactive) [ 10/Sep/11 ] |
|
Integrated in Oleg Drokin : 303f41aa0ff07a43a31e397a3d163a6b8b6e4864
|
| Comment by Build Master (Inactive) [ 10/Sep/11 ] |
|
Integrated in Oleg Drokin : 303f41aa0ff07a43a31e397a3d163a6b8b6e4864
|
| Comment by Build Master (Inactive) [ 10/Sep/11 ] |
|
Integrated in Oleg Drokin : 303f41aa0ff07a43a31e397a3d163a6b8b6e4864
|
| Comment by Build Master (Inactive) [ 10/Sep/11 ] |
|
Integrated in Oleg Drokin : 303f41aa0ff07a43a31e397a3d163a6b8b6e4864
|
| Comment by Build Master (Inactive) [ 10/Sep/11 ] |
|
Integrated in Oleg Drokin : 303f41aa0ff07a43a31e397a3d163a6b8b6e4864
|
| Comment by Build Master (Inactive) [ 10/Sep/11 ] |
|
Integrated in Oleg Drokin : 303f41aa0ff07a43a31e397a3d163a6b8b6e4864
|
| Comment by Build Master (Inactive) [ 10/Sep/11 ] |
|
Integrated in Oleg Drokin : 303f41aa0ff07a43a31e397a3d163a6b8b6e4864
|
| Comment by Build Master (Inactive) [ 10/Sep/11 ] |
|
Integrated in Oleg Drokin : 303f41aa0ff07a43a31e397a3d163a6b8b6e4864
|
| Comment by Build Master (Inactive) [ 10/Sep/11 ] |
|
Integrated in Oleg Drokin : 303f41aa0ff07a43a31e397a3d163a6b8b6e4864
|
| Comment by Build Master (Inactive) [ 10/Sep/11 ] |
|
Integrated in Oleg Drokin : 303f41aa0ff07a43a31e397a3d163a6b8b6e4864
|
| Comment by Build Master (Inactive) [ 10/Sep/11 ] |
|
Integrated in Oleg Drokin : 303f41aa0ff07a43a31e397a3d163a6b8b6e4864
|