Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1121

recovery-mds-scale (FLAVOR=OSS): tar: Wrote only 4096 of 7168 bytes

Details

    • Bug
    • Resolution: Won't Fix
    • Blocker
    • None
    • Lustre 2.1.1, Lustre 2.1.3
    • None
    • 3
    • 3993

    Description

      While running recovery-mds-scale with FLAVOR=OSS, it failed as follows:

      ==== Checking the clients loads AFTER  failover -- failure NOT OK
      ost3 has failed over 1 times, and counting...
      sleeping 582 seconds ... 
      tar: etc/selinux/targeted/modules/active/modules/sandbox.pp: Wrote only 4096 of 7168 bytes
      tar: Exiting with failure status due to previous errors
      Found the END_RUN_FILE file: /home/yujian/test_logs/end_run_file
      client-1-ib
      Client load failed on node client-1-ib
      
      client client-1-ib load stdout and debug files :
                    /tmp/recovery-mds-scale.log_run_tar.sh-client-1-ib
                    /tmp/recovery-mds-scale.log_run_tar.sh-client-1-ib.debug
      

      /tmp/recovery-mds-scale.log_run_tar.sh-client-1-ib:

      tar: etc/selinux/targeted/modules/active/modules/sandbox.pp: Wrote only 4096 of 7168 bytes
      tar: Exiting with failure status due to previous errors
      

      /tmp/recovery-mds-scale.log_run_tar.sh-client-1-ib.debug:

      <~snip~>
      2012-02-18 22:30:41: tar run starting
      + mkdir -p /mnt/lustre/d0.tar-client-1-ib
      + cd /mnt/lustre/d0.tar-client-1-ib
      + wait 7567
      + do_tar
      + tar cf - /etc
      + tar xf -
      + tee /tmp/recovery-mds-scale.log_run_tar.sh-client-1-ib
      tar: Removing leading `/' from member names
      + return 2
      + RC=2
      ++ grep 'exit delayed from previous errors' /tmp/recovery-mds-scale.log_run_tar.sh-client-1-ib
      + PREV_ERRORS=
      + true
      + '[' 2 -ne 0 -a '' -a '' ']'
      + '[' 2 -eq 0 ']'
      ++ date '+%F %H:%M:%S'
      + echoerr '2012-02-18 22:37:10: tar failed'
      + echo '2012-02-18 22:37:10: tar failed'
      2012-02-18 22:37:10: tar failed
      <~snip~>
      

      Syslog on client node client-1-ib showed that:

      Feb 18 22:34:54 client-1 kernel: INFO: task flush-lustre-1:3510 blocked for more than 120 seconds.
      Feb 18 22:34:54 client-1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      Feb 18 22:34:54 client-1 kernel: flush-lustre- D 0000000000000000     0  3510      2 0x00000080
      Feb 18 22:34:54 client-1 kernel: ffff8801f70e99a0 0000000000000046 ffff8801f70e9920 ffffffffa0942434
      Feb 18 22:34:54 client-1 kernel: 0000000000000000 ffff880331d24980 ffff8801f70e9930 0000000000000000
      Feb 18 22:34:54 client-1 kernel: ffff88027d12b0b8 ffff8801f70e9fd8 000000000000f4e8 ffff88027d12b0b8
      Feb 18 22:34:54 client-1 kernel: Call Trace:
      Feb 18 22:34:54 client-1 kernel: [<ffffffffa0942434>] ? cfs_hash_dual_bd_unlock+0x34/0x60 [libcfs]
      Feb 18 22:34:54 client-1 kernel: [<ffffffff8109b809>] ? ktime_get_ts+0xa9/0xe0
      Feb 18 22:34:54 client-1 kernel: [<ffffffff81110b10>] ? sync_page+0x0/0x50
      Feb 18 22:34:54 client-1 kernel: [<ffffffff814ed1c3>] io_schedule+0x73/0xc0
      Feb 18 22:34:54 client-1 kernel: [<ffffffff81110b4d>] sync_page+0x3d/0x50
      Feb 18 22:34:54 client-1 kernel: [<ffffffff814eda2a>] __wait_on_bit_lock+0x5a/0xc0
      Feb 18 22:34:54 client-1 kernel: [<ffffffff81110ae7>] __lock_page+0x67/0x70
      Feb 18 22:34:54 client-1 kernel: [<ffffffff81090c30>] ? wake_bit_function+0x0/0x50
      Feb 18 22:34:54 client-1 kernel: [<ffffffff81124c97>] ? __writepage+0x17/0x40
      Feb 18 22:34:54 client-1 kernel: [<ffffffff811261f2>] write_cache_pages+0x392/0x4a0
      Feb 18 22:34:54 client-1 kernel: [<ffffffff81052600>] ? __dequeue_entity+0x30/0x50
      Feb 18 22:34:54 client-1 kernel: [<ffffffff81124c80>] ? __writepage+0x0/0x40
      Feb 18 22:34:54 client-1 kernel: [<ffffffff8126a5c9>] ? cpumask_next_and+0x29/0x50
      Feb 18 22:34:54 client-1 kernel: [<ffffffff81054754>] ? find_busiest_group+0x244/0xb20
      Feb 18 22:34:54 client-1 kernel: [<ffffffff81126324>] generic_writepages+0x24/0x30
      Feb 18 22:34:54 client-1 kernel: [<ffffffff81126351>] do_writepages+0x21/0x40
      Feb 18 22:34:54 client-1 kernel: [<ffffffff811a046d>] writeback_single_inode+0xdd/0x2c0
      Feb 18 22:34:54 client-1 kernel: [<ffffffff811a08ae>] writeback_sb_inodes+0xce/0x180
      Feb 18 22:34:54 client-1 kernel: [<ffffffff811a0a0b>] writeback_inodes_wb+0xab/0x1b0
      Feb 18 22:34:54 client-1 kernel: [<ffffffff811a0dab>] wb_writeback+0x29b/0x3f0
      Feb 18 22:34:54 client-1 kernel: [<ffffffff814eca20>] ? thread_return+0x4e/0x77e
      Feb 18 22:34:54 client-1 kernel: [<ffffffff8107cc02>] ? del_timer_sync+0x22/0x30
      Feb 18 22:34:54 client-1 kernel: [<ffffffff811a1099>] wb_do_writeback+0x199/0x240
      Feb 18 22:34:54 client-1 kernel: [<ffffffff811a11a3>] bdi_writeback_task+0x63/0x1b0
      Feb 18 22:34:54 client-1 kernel: [<ffffffff81090ab7>] ? bit_waitqueue+0x17/0xd0
      Feb 18 22:34:54 client-1 kernel: [<ffffffff81134d40>] ? bdi_start_fn+0x0/0x100
      Feb 18 22:34:54 client-1 kernel: [<ffffffff81134dc6>] bdi_start_fn+0x86/0x100
      Feb 18 22:34:54 client-1 kernel: [<ffffffff81134d40>] ? bdi_start_fn+0x0/0x100
      Feb 18 22:34:54 client-1 kernel: [<ffffffff81090886>] kthread+0x96/0xa0
      Feb 18 22:34:54 client-1 kernel: [<ffffffff8100c14a>] child_rip+0xa/0x20
      Feb 18 22:34:54 client-1 kernel: [<ffffffff810907f0>] ? kthread+0x0/0xa0
      Feb 18 22:34:54 client-1 kernel: [<ffffffff8100c140>] ? child_rip+0x0/0x20
      

      Maloo report: https://maloo.whamcloud.com/test_sets/f3b4fe94-5af9-11e1-8801-5254004bbbd3

      Please refer to the attached recovery-oss-scale.1329633991.log.tar.bz2 for more logs.

      It seems this is issue LU-874.

      Attachments

        Issue Links

          Activity

            [LU-1121] recovery-mds-scale (FLAVOR=OSS): tar: Wrote only 4096 of 7168 bytes

            Really old blocker for unsupported version

            simmonsja James A Simmons added a comment - Really old blocker for unsupported version
            yujian Jian Yu added a comment -

            Lustre Tag: v2_1_3_RC1
            Lustre Build: http://build.whamcloud.com/job/lustre-b2_1/113/
            Distro/Arch: RHEL6.3/x86_64 (kernel version: 2.6.32-279.2.1.el6)
            Network: IB (in-kernel OFED)
            ENABLE_QUOTA=yes
            FAILURE_MODE=HARD

            The original issue occurred again while running recovery-mds-scale failover_ost test:

            tar: etc/selinux/targeted/modules/active/modules/rhgb.pp: Wrote only 4096 of 7680 bytes
            tar: Exiting with failure status due to previous errors
            

            https://maloo.whamcloud.com/test_sets/dc54205c-e534-11e1-ae4e-52540035b04c

            After setting PTLDEBUG=-1 and DEBUG_SIZE=200 to reproduce the issue and gather more logs, I hit LU-463 again:
            https://maloo.whamcloud.com/test_sets/b18a1330-e5ad-11e1-ae4e-52540035b04c

            yujian Jian Yu added a comment - Lustre Tag: v2_1_3_RC1 Lustre Build: http://build.whamcloud.com/job/lustre-b2_1/113/ Distro/Arch: RHEL6.3/x86_64 (kernel version: 2.6.32-279.2.1.el6) Network: IB (in-kernel OFED) ENABLE_QUOTA=yes FAILURE_MODE=HARD The original issue occurred again while running recovery-mds-scale failover_ost test: tar: etc/selinux/targeted/modules/active/modules/rhgb.pp: Wrote only 4096 of 7680 bytes tar: Exiting with failure status due to previous errors https://maloo.whamcloud.com/test_sets/dc54205c-e534-11e1-ae4e-52540035b04c After setting PTLDEBUG=-1 and DEBUG_SIZE=200 to reproduce the issue and gather more logs, I hit LU-463 again: https://maloo.whamcloud.com/test_sets/b18a1330-e5ad-11e1-ae4e-52540035b04c
            green Oleg Drokin added a comment -

            The second crash is now tracked under LU-1129

            green Oleg Drokin added a comment - The second crash is now tracked under LU-1129
            yujian Jian Yu added a comment -

            I'm disabling panic_on_lbug to get debug log...

            After disabling panic_on_lbug, I could not reproduce the above LBUG and the original issue of this ticket, but kept hitting the known issue: LU-463.

            yujian Jian Yu added a comment - I'm disabling panic_on_lbug to get debug log... After disabling panic_on_lbug, I could not reproduce the above LBUG and the original issue of this ticket, but kept hitting the known issue: LU-463 .
            yujian Jian Yu added a comment -

            I'm disabling panic_on_lbug to get debug log...

            yujian Jian Yu added a comment - I'm disabling panic_on_lbug to get debug log...
            yujian Jian Yu added a comment -

            After I set the debug level and size as follows on all of the nodes:

            debug=-1
            subsystem_debug=0xffb7e3ff
            debug_mb=200
            debug=-trace
            

            Running recovery-mds-scale with FLAVOR=OSS hit the following LBUG on OSS client-18-ib:

            LustreError: 5227:0:(filter.c:4141:filter_destroy())  lustre-OST0003: can not find olg of group 0^M
            LustreError: 5227:0:(filter.c:3683:filter_handle_precreate()) ASSERTION(diff >= 0) failed: lustre-OST0003: 163 - 225 = -62^M
            LustreError: 5227:0:(filter.c:3683:filter_handle_precreate()) LBUG^M
            Pid: 5227, comm: tgt_recov^M
            ^M
            Call Trace:^M
             [<ffffffffa0636855>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]^M
             [<ffffffffa0636e95>] lbug_with_loc+0x75/0xe0 [libcfs]^M
             [<ffffffffa090ec4d>] filter_create+0x160d/0x1640 [obdfilter]^M
             [<ffffffffa0919b11>] filter_preprw_write+0x6c1/0x1f10 [obdfilter]^M
             [<ffffffffa083e87b>] ? _debug_req+0x51b/0x660 [ptlrpc]^M
             [<ffffffffa091c258>] filter_preprw+0x68/0x90 [obdfilter]^M
             [<ffffffffa07c110e>] obd_preprw+0x11e/0x420 [ost]^M
             [<ffffffffa07cb18c>] ost_brw_write+0x98c/0x1870 [ost]^M
             [<ffffffffa06416a1>] ? libcfs_debug_vmsg2+0x4d1/0xb50 [libcfs]^M
             [<ffffffffa083b264>] ? lustre_msg_get_opc+0x94/0x100 [ptlrpc]^M
             [<ffffffffa07cfce5>] ost_handle+0x3325/0x4b90 [ost]^M
             [<ffffffffa0b16b16>] ? vvp_session_key_init+0xd6/0x1d0 [lustre]^M
             [<ffffffffa07cc9c0>] ? ost_handle+0x0/0x4b90 [ost]^M
             [<ffffffffa0800cf6>] handle_recovery_req+0x1f6/0x330 [ptlrpc]^M
             [<ffffffffa0801257>] target_recovery_thread+0x3a7/0xf50 [ptlrpc]^M
             [<ffffffffa0800eb0>] ? target_recovery_thread+0x0/0xf50 [ptlrpc]^M
             [<ffffffff8100c14a>] child_rip+0xa/0x20^M
             [<ffffffffa0800eb0>] ? target_recovery_thread+0x0/0xf50 [ptlrpc]^M
             [<ffffffffa0800eb0>] ? target_recovery_thread+0x0/0xf50 [ptlrpc]^M
             [<ffffffff8100c140>] ? child_rip+0x0/0x20^M
            ^M
            Kernel panic - not syncing: LBUG^M
            Pid: 5227, comm: tgt_recov Not tainted 2.6.32-220.el6_lustre.g4554b65.x86_64 #1^M
            Call Trace:^M
             [<ffffffff814ec661>] ? panic+0x78/0x143^M
             [<ffffffffa0636eeb>] ? lbug_with_loc+0xcb/0xe0 [libcfs]^M
             [<ffffffffa090ec4d>] ? filter_create+0x160d/0x1640 [obdfilter]^M
             [<ffffffffa0919b11>] ? filter_preprw_write+0x6c1/0x1f10 [obdfilter]^M
             [<ffffffffa083e87b>] ? _debug_req+0x51b/0x660 [ptlrpc]^M
             [<ffffffffa091c258>] ? filter_preprw+0x68/0x90 [obdfilter]^M
             [<ffffffffa07c110e>] ? obd_preprw+0x11e/0x420 [ost]^M
             [<ffffffffa07cb18c>] ? ost_brw_write+0x98c/0x1870 [ost]^M
             [<ffffffffa06416a1>] ? libcfs_debug_vmsg2+0x4d1/0xb50 [libcfs]^M
             [<ffffffffa083b264>] ? lustre_msg_get_opc+0x94/0x100 [ptlrpc]^M
             [<ffffffffa07cfce5>] ? ost_handle+0x3325/0x4b90 [ost]^M
             [<ffffffffa0b16b16>] ? vvp_session_key_init+0xd6/0x1d0 [lustre]^M
             [<ffffffffa07cc9c0>] ? ost_handle+0x0/0x4b90 [ost]^M
             [<ffffffffa0800cf6>] ? handle_recovery_req+0x1f6/0x330 [ptlrpc]^M
             [<ffffffffa0801257>] ? target_recovery_thread+0x3a7/0xf50 [ptlrpc]^M
             [<ffffffffa0800eb0>] ? target_recovery_thread+0x0/0xf50 [ptlrpc]^M
             [<ffffffff8100c14a>] ? child_rip+0xa/0x20^M
             [<ffffffffa0800eb0>] ? target_recovery_thread+0x0/0xf50 [ptlrpc]^M
             [<ffffffffa0800eb0>] ? target_recovery_thread+0x0/0xf50 [ptlrpc]^M
             [<ffffffff8100c140>] ? child_rip+0x0/0x20^M
            

            Please refer to /scratch/logs/2.1.1/recovery-oss-scale.1329896304.log.tar.bz2 on brent node for more logs.

            yujian Jian Yu added a comment - After I set the debug level and size as follows on all of the nodes: debug=-1 subsystem_debug=0xffb7e3ff debug_mb=200 debug=-trace Running recovery-mds-scale with FLAVOR=OSS hit the following LBUG on OSS client-18-ib: LustreError: 5227:0:(filter.c:4141:filter_destroy()) lustre-OST0003: can not find olg of group 0^M LustreError: 5227:0:(filter.c:3683:filter_handle_precreate()) ASSERTION(diff >= 0) failed: lustre-OST0003: 163 - 225 = -62^M LustreError: 5227:0:(filter.c:3683:filter_handle_precreate()) LBUG^M Pid: 5227, comm: tgt_recov^M ^M Call Trace:^M [<ffffffffa0636855>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]^M [<ffffffffa0636e95>] lbug_with_loc+0x75/0xe0 [libcfs]^M [<ffffffffa090ec4d>] filter_create+0x160d/0x1640 [obdfilter]^M [<ffffffffa0919b11>] filter_preprw_write+0x6c1/0x1f10 [obdfilter]^M [<ffffffffa083e87b>] ? _debug_req+0x51b/0x660 [ptlrpc]^M [<ffffffffa091c258>] filter_preprw+0x68/0x90 [obdfilter]^M [<ffffffffa07c110e>] obd_preprw+0x11e/0x420 [ost]^M [<ffffffffa07cb18c>] ost_brw_write+0x98c/0x1870 [ost]^M [<ffffffffa06416a1>] ? libcfs_debug_vmsg2+0x4d1/0xb50 [libcfs]^M [<ffffffffa083b264>] ? lustre_msg_get_opc+0x94/0x100 [ptlrpc]^M [<ffffffffa07cfce5>] ost_handle+0x3325/0x4b90 [ost]^M [<ffffffffa0b16b16>] ? vvp_session_key_init+0xd6/0x1d0 [lustre]^M [<ffffffffa07cc9c0>] ? ost_handle+0x0/0x4b90 [ost]^M [<ffffffffa0800cf6>] handle_recovery_req+0x1f6/0x330 [ptlrpc]^M [<ffffffffa0801257>] target_recovery_thread+0x3a7/0xf50 [ptlrpc]^M [<ffffffffa0800eb0>] ? target_recovery_thread+0x0/0xf50 [ptlrpc]^M [<ffffffff8100c14a>] child_rip+0xa/0x20^M [<ffffffffa0800eb0>] ? target_recovery_thread+0x0/0xf50 [ptlrpc]^M [<ffffffffa0800eb0>] ? target_recovery_thread+0x0/0xf50 [ptlrpc]^M [<ffffffff8100c140>] ? child_rip+0x0/0x20^M ^M Kernel panic - not syncing: LBUG^M Pid: 5227, comm: tgt_recov Not tainted 2.6.32-220.el6_lustre.g4554b65.x86_64 #1^M Call Trace:^M [<ffffffff814ec661>] ? panic+0x78/0x143^M [<ffffffffa0636eeb>] ? lbug_with_loc+0xcb/0xe0 [libcfs]^M [<ffffffffa090ec4d>] ? filter_create+0x160d/0x1640 [obdfilter]^M [<ffffffffa0919b11>] ? filter_preprw_write+0x6c1/0x1f10 [obdfilter]^M [<ffffffffa083e87b>] ? _debug_req+0x51b/0x660 [ptlrpc]^M [<ffffffffa091c258>] ? filter_preprw+0x68/0x90 [obdfilter]^M [<ffffffffa07c110e>] ? obd_preprw+0x11e/0x420 [ost]^M [<ffffffffa07cb18c>] ? ost_brw_write+0x98c/0x1870 [ost]^M [<ffffffffa06416a1>] ? libcfs_debug_vmsg2+0x4d1/0xb50 [libcfs]^M [<ffffffffa083b264>] ? lustre_msg_get_opc+0x94/0x100 [ptlrpc]^M [<ffffffffa07cfce5>] ? ost_handle+0x3325/0x4b90 [ost]^M [<ffffffffa0b16b16>] ? vvp_session_key_init+0xd6/0x1d0 [lustre]^M [<ffffffffa07cc9c0>] ? ost_handle+0x0/0x4b90 [ost]^M [<ffffffffa0800cf6>] ? handle_recovery_req+0x1f6/0x330 [ptlrpc]^M [<ffffffffa0801257>] ? target_recovery_thread+0x3a7/0xf50 [ptlrpc]^M [<ffffffffa0800eb0>] ? target_recovery_thread+0x0/0xf50 [ptlrpc]^M [<ffffffff8100c14a>] ? child_rip+0xa/0x20^M [<ffffffffa0800eb0>] ? target_recovery_thread+0x0/0xf50 [ptlrpc]^M [<ffffffffa0800eb0>] ? target_recovery_thread+0x0/0xf50 [ptlrpc]^M [<ffffffff8100c140>] ? child_rip+0x0/0x20^M Please refer to /scratch/logs/2.1.1/recovery-oss-scale.1329896304.log.tar.bz2 on brent node for more logs.

            Could you please suggest what extra debug code I should add to gather more logs? Thanks.

            On the client side, I always do the following:

            lctl set_param debug=-1
            lctl set_param debug=-trace

            In this way, I can get reasonable log and not need a super size debug memory. Just a hint

            jay Jinshan Xiong (Inactive) added a comment - Could you please suggest what extra debug code I should add to gather more logs? Thanks. On the client side, I always do the following: lctl set_param debug=-1 lctl set_param debug=-trace In this way, I can get reasonable log and not need a super size debug memory. Just a hint
            green Oleg Drokin added a comment -

            Jinshan, please note the abnormal high rate of pngs on all osts after 22:36, it's unrelated to the bug at hand, but sill looks strange.

            YuJian, I imagine having vfstrace enabled would have been helpful, also somewhat increased debug buffer size.
            The kms tracking is done under D_INODE flag, though we usually don't run with it and I don't know how much it will impact testing.
            If you are loking to do extending testing, +vfstrace +inode +ha look like sensible options to me. Also a 1024 in debug_mb if you have the RAM.

            green Oleg Drokin added a comment - Jinshan, please note the abnormal high rate of pngs on all osts after 22:36, it's unrelated to the bug at hand, but sill looks strange. YuJian, I imagine having vfstrace enabled would have been helpful, also somewhat increased debug buffer size. The kms tracking is done under D_INODE flag, though we usually don't run with it and I don't know how much it will impact testing. If you are loking to do extending testing, +vfstrace +inode +ha look like sensible options to me. Also a 1024 in debug_mb if you have the RAM.
            yujian Jian Yu added a comment -

            Thanks Oleg and Jinshan for the analysis. I'm going to reproduce this issue and gather more logs. The current debug level is:

            [root@client-18-ib ~]# lctl get_param debug
            debug=
            super warning dlmtrace error emerg ha rpctrace vfstrace config console
            [root@client-18-ib ~]# lctl get_param subsystem_debug
            subsystem_debug=
            undefined mdc mds osc ost class log llite rpc filter echo ldlm lov lquota lmv sec gss mgc mgs fid fld
            [root@client-18-ib ~]# lctl get_param debug_mb
            debug_mb=48
            

            Could you please suggest what extra debug code I should add to gather more logs? Thanks.

            yujian Jian Yu added a comment - Thanks Oleg and Jinshan for the analysis. I'm going to reproduce this issue and gather more logs. The current debug level is: [root@client-18-ib ~]# lctl get_param debug debug= super warning dlmtrace error emerg ha rpctrace vfstrace config console [root@client-18-ib ~]# lctl get_param subsystem_debug subsystem_debug= undefined mdc mds osc ost class log llite rpc filter echo ldlm lov lquota lmv sec gss mgc mgs fid fld [root@client-18-ib ~]# lctl get_param debug_mb debug_mb=48 Could you please suggest what extra debug code I should add to gather more logs? Thanks.

            In my humble opinion, the stack trace is not related with the issue here, neither is it related to LU-874. The real issue is due to short write where 7168 bytes was supposed to be written but actually 4096 bytes was written. I didn't find any clue from logs. I suspect that there must exist issues with kms handling.

            The reason why there was no activity on client-1 after 22:36 is because the test loop exited due to failure on tar.

            jay Jinshan Xiong (Inactive) added a comment - In my humble opinion, the stack trace is not related with the issue here, neither is it related to LU-874 . The real issue is due to short write where 7168 bytes was supposed to be written but actually 4096 bytes was written. I didn't find any clue from logs. I suspect that there must exist issues with kms handling. The reason why there was no activity on client-1 after 22:36 is because the test loop exited due to failure on tar.

            People

              green Oleg Drokin
              yujian Jian Yu
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: