Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-650

recovery-double-scale: (llite_lib.c:1591:ll_update_inode()) LBUG

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.1.0
    • Lustre 2.1.0
    • None
    • 3
    • 4905

    Description

      recovery-double-scale test failed as follows:

      ==== START === test 5: failover OST, then 2 clients ====
      <~snip~>
      Started client load: dd on client-13-ib
      Started client load: tar on client-1-ib
      ==== Checking the clients loads AFTER  failovers -- ERRORS_OK=yes
      Done checking / re-Starting client loads. PASS
      
      ==== START === test 6: failover OST, then MDS ========== 
      ==== Checking the clients loads BEFORE failover -- failure NOT OK
      Client load failed on node client-1-ib, rc=1
      2011-08-30 09:37:48 Terminating clients loads ...
      
      Server failover period: 300 seconds
      Exited after:           0 seconds
      Status: FAIL: rc=4
      

      Console log on client node client-1-ib showed that:

      Lustre: DEBUG MARKER: Started client load: tar on client-1-ib^M
      Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failovers -- ERRORS_OK=yes^M
      LustreError: 2344:0:(llite_lib.c:1586:ll_update_inode()) lsm mismatch for inode 144115238860488711^M
      LustreError: 2344:0:(llite_lib.c:1587:ll_update_inode()) lli_smd:^M
      LustreError: 2344:0:(debug.c:71:dump_lsm()) lsm ffff8803168da740, objid 0x7, maxbytes 0x0, magic 0x0BD10BD0, stripe_size 1048576, stripe_count 1 pool ^M
      LustreError: 11-0: an error occurred while communicating with 192.168.4.134@o2ib. The ost_connect operation failed with -16^M
      LustreError: 2344:0:(llite_lib.c:1589:ll_update_inode()) lsm:^M
      LustreError: 2344:0:(debug.c:71:dump_lsm()) lsm ffff8803168da3c0, objid 0x7, maxbytes 0xfffffffffff, magic 0x0BD10BD0, stripe_size 1048576, stripe_count 1 pool ^M
      LustreError: 2344:0:(llite_lib.c:1591:ll_update_inode()) LBUG^M
      Pid: 2344, comm: tar^M
      ^M
      Call Trace:^M
       [<ffffffffa02f4855>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]^M
       [<ffffffffa02f4e95>] lbug_with_loc+0x75/0xe0 [libcfs]^M
       [<ffffffffa072d018>] ll_update_inode+0xd48/0xea0 [lustre]^M
       [<ffffffff8118d6d4>] ? ifind+0x74/0xd0^M
       [<ffffffffa07410f0>] ? ll_test_inode+0x0/0x1a0 [lustre]^M
       [<ffffffffa0741b9b>] ll_iget+0x21b/0x280 [lustre]^M
       [<ffffffffa02f59ae>] ? cfs_free+0xe/0x10 [libcfs]^M
       [<ffffffffa07d6de8>] ? lmv_intent_lookup+0x718/0x1190 [lmv]^M
       [<ffffffffa0741380>] ? ll_md_blocking_ast+0x0/0x600 [lustre]^M
       [<ffffffffa072d6f9>] ll_prep_inode+0x589/0xb50 [lustre]^M
       [<ffffffffa0742a05>] ll_lookup_it_finish+0x85/0xa80 [lustre]^M
       [<ffffffffa02f59ae>] ? cfs_free+0xe/0x10 [libcfs]^M
       [<ffffffffa0727581>] ? ll_finish_md_op_data+0x2a1/0x4a0 [lustre]^M
       [<ffffffffa07252b8>] ? ll_prep_md_op_data+0x148/0x410 [lustre]^M
       [<ffffffffa0743835>] ll_lookup_it+0x435/0xd60 [lustre]^M
       [<ffffffffa0741380>] ? ll_md_blocking_ast+0x0/0x600 [lustre]^M
       [<ffffffffa07441e8>] ll_lookup_nd+0x88/0x440 [lustre]^M^@
       [<ffffffff8118ac3e>] ? d_alloc+0x13e/0x1b0^M
       [<ffffffff811806cb>] do_lookup+0x18b/0x220^M
       [<ffffffff81180cc9>] __link_path_walk+0x569/0x820^M
       [<ffffffffa06fc000>] ? return_if_equal+0x0/0x30 [lustre]^M
       [<ffffffff8118164a>] path_walk+0x6a/0xe0^M
       [<ffffffff8118181b>] do_path_lookup+0x5b/0xa0^M
       [<ffffffff811819a7>] user_path_at+0x57/0xa0^M
       [<ffffffff810dbe25>] ? call_rcu_sched+0x15/0x20^M
       [<ffffffff810dbe3e>] ? call_rcu+0xe/0x10^M
       [<ffffffff81262a05>] ? _atomic_dec_and_lock+0x55/0x80^M
       [<ffffffff811910e0>] ? mntput_no_expire+0x30/0x110^M
       [<ffffffff811740d1>] ? __fput+0x1a1/0x210^M
       [<ffffffff811a16f7>] do_utimes+0xc7/0xf0^M
       [<ffffffff810d1ad2>] ? audit_syscall_entry+0x272/0x2a0^M
       [<ffffffff811a1822>] sys_utimensat+0x32/0x90^M
       [<ffffffff8100b172>] system_call_fastpath+0x16/0x1b^M
      ^M
      Kernel panic - not syncing: LBUG^M
      Pid: 2344, comm: tar Tainted: G           ---------------- T 2.6.32-131.6.1.el6.x86_64 #1^M
      Call Trace:^M
       [<ffffffff814da518>] ? panic+0x78/0x143^M
       [<ffffffffa02f4eeb>] ? lbug_with_loc+0xcb/0xe0 [libcfs]^M
       [<ffffffffa072d018>] ? ll_update_inode+0xd48/0xea0 [lustre]^M
       [<ffffffff8118d6d4>] ? ifind+0x74/0xd0^M
       [<ffffffffa07410f0>] ? ll_test_inode+0x0/0x1a0 [lustre]^M
       [<ffffffffa0741b9b>] ? ll_iget+0x21b/0x280 [lustre]^M
       [<ffffffffa02f59ae>] ? cfs_free+0xe/0x10 [libcfs]^M
       [<ffffffffa07d6de8>] ? lmv_intent_lookup+0x718/0x1190 [lmv]^M
       [<ffffffffa0741380>] ? ll_md_blocking_ast+0x0/0x600 [lustre]^M
       [<ffffffffa072d6f9>] ? ll_prep_inode+0x589/0xb50 [lustre]^M
       [<ffffffffa0742a05>] ? ll_lookup_it_finish+0x85/0xa80 [lustre]^M
       [<ffffffffa02f59ae>] ? cfs_free+0xe/0x10 [libcfs]^M
       [<ffffffffa0727581>] ? ll_finish_md_op_data+0x2a1/0x4a0 [lustre]^M
       [<ffffffffa07252b8>] ? ll_prep_md_op_data+0x148/0x410 [lustre]^M
       [<ffffffffa0743835>] ? ll_lookup_it+0x435/0xd60 [lustre]^M
       [<ffffffffa0741380>] ? ll_md_blocking_ast+0x0/0x600 [lustre]^M
       [<ffffffffa07441e8>] ? ll_lookup_nd+0x88/0x440 [lustre]^M
       [<ffffffff8118ac3e>] ? d_alloc+0x13e/0x1b0^M
       [<ffffffff811806cb>] ? do_lookup+0x18b/0x220^M
       [<ffffffff81180cc9>] ? __link_path_walk+0x569/0x820^M
       [<ffffffffa06fc000>] ? return_if_equal+0x0/0x30 [lustre]^M
       [<ffffffff8118164a>] ? path_walk+0x6a/0xe0^M
       [<ffffffff8118181b>] ? do_path_lookup+0x5b/0xa0^M
       [<ffffffff811819a7>] ? user_path_at+0x57/0xa0^M
       [<ffffffff810dbe25>] ? call_rcu_sched+0x15/0x20^M
       [<ffffffff810dbe3e>] ? call_rcu+0xe/0x10^M
       [<ffffffff81262a05>] ? _atomic_dec_and_lock+0x55/0x80^M
       [<ffffffff811910e0>] ? mntput_no_expire+0x30/0x110^M
       [<ffffffff811740d1>] ? __fput+0x1a1/0x210^M
       [<ffffffff811a16f7>] ? do_utimes+0xc7/0xf0^M
       [<ffffffff810d1ad2>] ? audit_syscall_entry+0x272/0x2a0^M
       [<ffffffff811a1822>] ? sys_utimensat+0x32/0x90^M
       [<ffffffff8100b172>] ? system_call_fastpath+0x16/0x1b^M
      

      Syslog on OSS node fat-amd-3-ib showed that:

      Aug 30 09:31:50 fat-amd-3-ib kernel: Lustre: DEBUG MARKER: Started client load: tar on client-1-ib
      Aug 30 09:31:50 fat-amd-3-ib xinetd[2179]: EXIT: shell status=0 pid=4039 duration=0(sec)Aug 30 09:31:53 fat-amd-3-ib xinetd[2179]: START: shell pid=4063 from=::ffff:192.168.4.131
      Aug 30 09:31:53 fat-amd-3-ib rshd[4064]: root@fat-intel-4-ib.lab.whamcloud.com as root: cmd='(PATH=$PATH:/usr/lib64/lustre/utils:/usr/lib64/lustre/tests:/sbin:/usr/sbin; cd /us
      r/lib64/lustre/tests; LUSTRE="/usr/lib64/lustre"  sh -c "/usr/sbin/lctl mark ==== Checking the clients loads AFTER  failovers -- ERRORS_OK=yes");echo XXRETCODE:$?'
      Aug 30 09:31:53 fat-amd-3-ib kernel: Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failovers -- ERRORS_OK=yes
      Aug 30 09:31:53 fat-amd-3-ib xinetd[2179]: EXIT: shell status=0 pid=4063 duration=0(sec)
      Aug 30 09:31:53 fat-amd-3-ib kernel: LustreError: 2620:0:(ldlm_lib.c:910:target_handle_connect()) lustre-OST0005: denying connection for new client 192.168.4.1@o2ib (1e1bdc24-235d-ddd3-8241-31c1b1eae3e1): 2 clients in recovery for 75s
      Aug 30 09:31:53 fat-amd-3-ib kernel: LustreError: 2620:0:(ldlm_lib.c:910:target_handle_connect()) Skipped 59 previous similar messages
      Aug 30 09:31:54 fat-amd-3-ib xinetd[2179]: START: shell pid=4087 from=::ffff:192.168.4.131
      Aug 30 09:31:54 fat-amd-3-ib rshd[4088]: root@fat-intel-4-ib.lab.whamcloud.com as root: cmd='(PATH=$PATH:/usr/lib64/lustre/utils:/usr/lib64/lustre/tests:/sbin:/usr/sbin; cd /us
      r/lib64/lustre/tests; LUSTRE="/usr/lib64/lustre"  sh -c "/usr/sbin/lctl mark Done checking \/ re-Starting client loads. PASS");echo XXRETCODE:$?'Aug 30 09:31:54 fat-amd-3-ib kernel: Lustre: DEBUG MARKER: Done checking / re-Starting client loads. PASS
      Aug 30 09:31:54 fat-amd-3-ib xinetd[2179]: EXIT: shell status=0 pid=4087 duration=0(sec)
      Aug 30 09:32:12 fat-amd-3-ib kernel: INFO: task tgt_recov:3840 blocked for more than 120 seconds.
      Aug 30 09:32:12 fat-amd-3-ib kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      Aug 30 09:32:12 fat-amd-3-ib kernel: tgt_recov     D 0000000000000009     0  3840      2 0x00000080Aug 30 09:32:12 fat-amd-3-ib kernel: ffff8802178c3d30 0000000000000046 ffff8802178c3d30 ffffffffa05bbef6Aug 30 09:32:12 fat-amd-3-ib kernel: 0000000000000000 0000000000000000 ffffffffa06557a0 0000000000000007
      Aug 30 09:32:12 fat-amd-3-ib kernel: ffff880217cca638 ffff8802178c3fd8 000000000000f598 ffff880217cca638
      Aug 30 09:32:12 fat-amd-3-ib kernel: Call Trace:
      Aug 30 09:32:12 fat-amd-3-ib kernel: [<ffffffffa05bbef6>] ? check_for_next_transno+0x2d6/0x4f0 [ptlrpc]
      Aug 30 09:32:12 fat-amd-3-ib kernel: [<ffffffffa05bbc20>] ? check_for_next_transno+0x0/0x4f0 [ptlrpc]
      Aug 30 09:32:12 fat-amd-3-ib kernel: [<ffffffffa05bd765>] target_recovery_overseer+0x85/0x200 [ptlrpc]Aug 30 09:32:12 fat-amd-3-ib kernel: [<ffffffffa05ba780>] ? exp_req_replay_healthy+0x0/0x30 [ptlrpc]Aug 30 09:32:12 fat-amd-3-ib kernel: [<ffffffff8108e160>] ? autoremove_wake_function+0x0/0x40Aug 30 09:32:12 fat-amd-3-ib kernel: [<ffffffffa05be364>] target_recovery_thread+0x4a4/0xf50 [ptlrpc]
      Aug 30 09:32:12 fat-amd-3-ib kernel: [<ffffffffa05bdec0>] ? target_recovery_thread+0x0/0xf50 [ptlrpc]
      Aug 30 09:32:12 fat-amd-3-ib kernel: [<ffffffff8100c1ca>] child_rip+0xa/0x20
      Aug 30 09:32:12 fat-amd-3-ib kernel: [<ffffffffa05bdec0>] ? target_recovery_thread+0x0/0xf50 [ptlrpc]
      Aug 30 09:32:12 fat-amd-3-ib kernel: [<ffffffffa05bdec0>] ? target_recovery_thread+0x0/0xf50 [ptlrpc]
      Aug 30 09:32:12 fat-amd-3-ib kernel: [<ffffffff8100c1c0>] ? child_rip+0x0/0x20
      Aug 30 09:32:38 fat-amd-3-ib kernel: Lustre: lustre-OST0000: haven't heard from client 1e1bdc24-235d-ddd3-8241-31c1b1eae3e1 (at 192.168.4.1@o2ib) in 48 seconds. I think it's de
      ad, and I am evicting it. exp ffff88031783e000, cur 1314721958 expire 1314721928 last 1314721910
      Aug 30 09:32:38 fat-amd-3-ib kernel: Lustre: Skipped 4 previous similar messages
      Aug 30 09:33:08 fat-amd-3-ib kernel: Lustre: 3840:0:(ldlm_lib.c:1569:target_recovery_overseer()) recovery is timed out, evict stale exports
      Aug 30 09:33:08 fat-amd-3-ib kernel: Lustre: 3840:0:(ldlm_lib.c:1569:target_recovery_overseer()) recovery is timed out, evict stale exports
      Aug 30 09:33:38 fat-amd-3-ib kernel: Lustre: 2657:0:(ldlm_lib.c:877:target_handle_connect()) lustre-OST0005: connection from 53ce78e4-dbf9-09af-3e1f-ba4d62ada8c7@192.168.4.13@o2ib recovering/t0 exp (null) cur 1314722018 last 0
      Aug 30 09:33:38 fat-amd-3-ib kernel: Lustre: 2657:0:(ldlm_lib.c:877:target_handle_connect()) Skipped 137 previous similar messages
      Aug 30 09:36:08 fat-amd-3-ib kernel: Lustre: 3840:0:(ldlm_lib.c:1569:target_recovery_overseer()) recovery is timed out, evict stale exports
      Aug 30 09:36:08 fat-amd-3-ib kernel: Lustre: 3840:0:(ldlm_lib.c:1569:target_recovery_overseer()) recovery is timed out, evict stale exports
      

      Maloo report: https://maloo.whamcloud.com/test_sets/a4d4597c-d387-11e0-8d02-52540025f9af

      Please find console, syslog, dmesg and debug logs in the attached recovery-double-scale.1314722275.log.tar.bz2.

      Attachments

        Issue Links

          Activity

            People

              bobijam Zhenyu Xu
              yujian Jian Yu
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: