Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1121

recovery-mds-scale (FLAVOR=OSS): tar: Wrote only 4096 of 7168 bytes

Details

    • Bug
    • Resolution: Won't Fix
    • Blocker
    • None
    • Lustre 2.1.1, Lustre 2.1.3
    • None
    • 3
    • 3993

    Description

      While running recovery-mds-scale with FLAVOR=OSS, it failed as follows:

      ==== Checking the clients loads AFTER  failover -- failure NOT OK
      ost3 has failed over 1 times, and counting...
      sleeping 582 seconds ... 
      tar: etc/selinux/targeted/modules/active/modules/sandbox.pp: Wrote only 4096 of 7168 bytes
      tar: Exiting with failure status due to previous errors
      Found the END_RUN_FILE file: /home/yujian/test_logs/end_run_file
      client-1-ib
      Client load failed on node client-1-ib
      
      client client-1-ib load stdout and debug files :
                    /tmp/recovery-mds-scale.log_run_tar.sh-client-1-ib
                    /tmp/recovery-mds-scale.log_run_tar.sh-client-1-ib.debug
      

      /tmp/recovery-mds-scale.log_run_tar.sh-client-1-ib:

      tar: etc/selinux/targeted/modules/active/modules/sandbox.pp: Wrote only 4096 of 7168 bytes
      tar: Exiting with failure status due to previous errors
      

      /tmp/recovery-mds-scale.log_run_tar.sh-client-1-ib.debug:

      <~snip~>
      2012-02-18 22:30:41: tar run starting
      + mkdir -p /mnt/lustre/d0.tar-client-1-ib
      + cd /mnt/lustre/d0.tar-client-1-ib
      + wait 7567
      + do_tar
      + tar cf - /etc
      + tar xf -
      + tee /tmp/recovery-mds-scale.log_run_tar.sh-client-1-ib
      tar: Removing leading `/' from member names
      + return 2
      + RC=2
      ++ grep 'exit delayed from previous errors' /tmp/recovery-mds-scale.log_run_tar.sh-client-1-ib
      + PREV_ERRORS=
      + true
      + '[' 2 -ne 0 -a '' -a '' ']'
      + '[' 2 -eq 0 ']'
      ++ date '+%F %H:%M:%S'
      + echoerr '2012-02-18 22:37:10: tar failed'
      + echo '2012-02-18 22:37:10: tar failed'
      2012-02-18 22:37:10: tar failed
      <~snip~>
      

      Syslog on client node client-1-ib showed that:

      Feb 18 22:34:54 client-1 kernel: INFO: task flush-lustre-1:3510 blocked for more than 120 seconds.
      Feb 18 22:34:54 client-1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      Feb 18 22:34:54 client-1 kernel: flush-lustre- D 0000000000000000     0  3510      2 0x00000080
      Feb 18 22:34:54 client-1 kernel: ffff8801f70e99a0 0000000000000046 ffff8801f70e9920 ffffffffa0942434
      Feb 18 22:34:54 client-1 kernel: 0000000000000000 ffff880331d24980 ffff8801f70e9930 0000000000000000
      Feb 18 22:34:54 client-1 kernel: ffff88027d12b0b8 ffff8801f70e9fd8 000000000000f4e8 ffff88027d12b0b8
      Feb 18 22:34:54 client-1 kernel: Call Trace:
      Feb 18 22:34:54 client-1 kernel: [<ffffffffa0942434>] ? cfs_hash_dual_bd_unlock+0x34/0x60 [libcfs]
      Feb 18 22:34:54 client-1 kernel: [<ffffffff8109b809>] ? ktime_get_ts+0xa9/0xe0
      Feb 18 22:34:54 client-1 kernel: [<ffffffff81110b10>] ? sync_page+0x0/0x50
      Feb 18 22:34:54 client-1 kernel: [<ffffffff814ed1c3>] io_schedule+0x73/0xc0
      Feb 18 22:34:54 client-1 kernel: [<ffffffff81110b4d>] sync_page+0x3d/0x50
      Feb 18 22:34:54 client-1 kernel: [<ffffffff814eda2a>] __wait_on_bit_lock+0x5a/0xc0
      Feb 18 22:34:54 client-1 kernel: [<ffffffff81110ae7>] __lock_page+0x67/0x70
      Feb 18 22:34:54 client-1 kernel: [<ffffffff81090c30>] ? wake_bit_function+0x0/0x50
      Feb 18 22:34:54 client-1 kernel: [<ffffffff81124c97>] ? __writepage+0x17/0x40
      Feb 18 22:34:54 client-1 kernel: [<ffffffff811261f2>] write_cache_pages+0x392/0x4a0
      Feb 18 22:34:54 client-1 kernel: [<ffffffff81052600>] ? __dequeue_entity+0x30/0x50
      Feb 18 22:34:54 client-1 kernel: [<ffffffff81124c80>] ? __writepage+0x0/0x40
      Feb 18 22:34:54 client-1 kernel: [<ffffffff8126a5c9>] ? cpumask_next_and+0x29/0x50
      Feb 18 22:34:54 client-1 kernel: [<ffffffff81054754>] ? find_busiest_group+0x244/0xb20
      Feb 18 22:34:54 client-1 kernel: [<ffffffff81126324>] generic_writepages+0x24/0x30
      Feb 18 22:34:54 client-1 kernel: [<ffffffff81126351>] do_writepages+0x21/0x40
      Feb 18 22:34:54 client-1 kernel: [<ffffffff811a046d>] writeback_single_inode+0xdd/0x2c0
      Feb 18 22:34:54 client-1 kernel: [<ffffffff811a08ae>] writeback_sb_inodes+0xce/0x180
      Feb 18 22:34:54 client-1 kernel: [<ffffffff811a0a0b>] writeback_inodes_wb+0xab/0x1b0
      Feb 18 22:34:54 client-1 kernel: [<ffffffff811a0dab>] wb_writeback+0x29b/0x3f0
      Feb 18 22:34:54 client-1 kernel: [<ffffffff814eca20>] ? thread_return+0x4e/0x77e
      Feb 18 22:34:54 client-1 kernel: [<ffffffff8107cc02>] ? del_timer_sync+0x22/0x30
      Feb 18 22:34:54 client-1 kernel: [<ffffffff811a1099>] wb_do_writeback+0x199/0x240
      Feb 18 22:34:54 client-1 kernel: [<ffffffff811a11a3>] bdi_writeback_task+0x63/0x1b0
      Feb 18 22:34:54 client-1 kernel: [<ffffffff81090ab7>] ? bit_waitqueue+0x17/0xd0
      Feb 18 22:34:54 client-1 kernel: [<ffffffff81134d40>] ? bdi_start_fn+0x0/0x100
      Feb 18 22:34:54 client-1 kernel: [<ffffffff81134dc6>] bdi_start_fn+0x86/0x100
      Feb 18 22:34:54 client-1 kernel: [<ffffffff81134d40>] ? bdi_start_fn+0x0/0x100
      Feb 18 22:34:54 client-1 kernel: [<ffffffff81090886>] kthread+0x96/0xa0
      Feb 18 22:34:54 client-1 kernel: [<ffffffff8100c14a>] child_rip+0xa/0x20
      Feb 18 22:34:54 client-1 kernel: [<ffffffff810907f0>] ? kthread+0x0/0xa0
      Feb 18 22:34:54 client-1 kernel: [<ffffffff8100c140>] ? child_rip+0x0/0x20
      

      Maloo report: https://maloo.whamcloud.com/test_sets/f3b4fe94-5af9-11e1-8801-5254004bbbd3

      Please refer to the attached recovery-oss-scale.1329633991.log.tar.bz2 for more logs.

      It seems this is issue LU-874.

      Attachments

        Issue Links

          Activity

            [LU-1121] recovery-mds-scale (FLAVOR=OSS): tar: Wrote only 4096 of 7168 bytes
            simmonsja James A Simmons made changes -
            Resolution New: Won't Fix [ 2 ]
            Status Original: Open [ 1 ] New: Closed [ 6 ]

            Really old blocker for unsupported version

            simmonsja James A Simmons added a comment - Really old blocker for unsupported version
            yujian Jian Yu made changes -
            Affects Version/s New: Lustre 2.1.3 [ 10141 ]
            yujian Jian Yu added a comment -

            Lustre Tag: v2_1_3_RC1
            Lustre Build: http://build.whamcloud.com/job/lustre-b2_1/113/
            Distro/Arch: RHEL6.3/x86_64 (kernel version: 2.6.32-279.2.1.el6)
            Network: IB (in-kernel OFED)
            ENABLE_QUOTA=yes
            FAILURE_MODE=HARD

            The original issue occurred again while running recovery-mds-scale failover_ost test:

            tar: etc/selinux/targeted/modules/active/modules/rhgb.pp: Wrote only 4096 of 7680 bytes
            tar: Exiting with failure status due to previous errors
            

            https://maloo.whamcloud.com/test_sets/dc54205c-e534-11e1-ae4e-52540035b04c

            After setting PTLDEBUG=-1 and DEBUG_SIZE=200 to reproduce the issue and gather more logs, I hit LU-463 again:
            https://maloo.whamcloud.com/test_sets/b18a1330-e5ad-11e1-ae4e-52540035b04c

            yujian Jian Yu added a comment - Lustre Tag: v2_1_3_RC1 Lustre Build: http://build.whamcloud.com/job/lustre-b2_1/113/ Distro/Arch: RHEL6.3/x86_64 (kernel version: 2.6.32-279.2.1.el6) Network: IB (in-kernel OFED) ENABLE_QUOTA=yes FAILURE_MODE=HARD The original issue occurred again while running recovery-mds-scale failover_ost test: tar: etc/selinux/targeted/modules/active/modules/rhgb.pp: Wrote only 4096 of 7680 bytes tar: Exiting with failure status due to previous errors https://maloo.whamcloud.com/test_sets/dc54205c-e534-11e1-ae4e-52540035b04c After setting PTLDEBUG=-1 and DEBUG_SIZE=200 to reproduce the issue and gather more logs, I hit LU-463 again: https://maloo.whamcloud.com/test_sets/b18a1330-e5ad-11e1-ae4e-52540035b04c
            green Oleg Drokin made changes -
            Link New: This issue is related to LU-1129 [ LU-1129 ]
            green Oleg Drokin added a comment -

            The second crash is now tracked under LU-1129

            green Oleg Drokin added a comment - The second crash is now tracked under LU-1129
            pjones Peter Jones made changes -
            Assignee Original: Jinshan Xiong [ jay ] New: Oleg Drokin [ green ]
            pjones Peter Jones made changes -
            Priority Original: Major [ 3 ] New: Blocker [ 1 ]
            yujian Jian Yu added a comment -

            I'm disabling panic_on_lbug to get debug log...

            After disabling panic_on_lbug, I could not reproduce the above LBUG and the original issue of this ticket, but kept hitting the known issue: LU-463.

            yujian Jian Yu added a comment - I'm disabling panic_on_lbug to get debug log... After disabling panic_on_lbug, I could not reproduce the above LBUG and the original issue of this ticket, but kept hitting the known issue: LU-463 .
            yujian Jian Yu added a comment -

            I'm disabling panic_on_lbug to get debug log...

            yujian Jian Yu added a comment - I'm disabling panic_on_lbug to get debug log...

            People

              green Oleg Drokin
              yujian Jian Yu
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: