[LU-1121] recovery-mds-scale (FLAVOR=OSS): tar: Wrote only 4096 of 7168 bytes Created: 19/Feb/12 Updated: 14/Aug/16 Resolved: 14/Aug/16 |
|
| Status: | Closed |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.1.1, Lustre 2.1.3 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Jian Yu | Assignee: | Oleg Drokin |
| Resolution: | Won't Fix | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Lustre Tag: v2_1_1_0_RC4 MGS/MDS Nodes: client-8-ib OSS Nodes: client-18-ib(active), client-19-ib(active) Client Nodes: client-[1,4,17],fat-amd-2,fat-intel-2 Network Addresses: |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 3993 | ||||||||
| Description |
|
While running recovery-mds-scale with FLAVOR=OSS, it failed as follows: ==== Checking the clients loads AFTER failover -- failure NOT OK
ost3 has failed over 1 times, and counting...
sleeping 582 seconds ...
tar: etc/selinux/targeted/modules/active/modules/sandbox.pp: Wrote only 4096 of 7168 bytes
tar: Exiting with failure status due to previous errors
Found the END_RUN_FILE file: /home/yujian/test_logs/end_run_file
client-1-ib
Client load failed on node client-1-ib
client client-1-ib load stdout and debug files :
/tmp/recovery-mds-scale.log_run_tar.sh-client-1-ib
/tmp/recovery-mds-scale.log_run_tar.sh-client-1-ib.debug
/tmp/recovery-mds-scale.log_run_tar.sh-client-1-ib: tar: etc/selinux/targeted/modules/active/modules/sandbox.pp: Wrote only 4096 of 7168 bytes tar: Exiting with failure status due to previous errors /tmp/recovery-mds-scale.log_run_tar.sh-client-1-ib.debug: <~snip~> 2012-02-18 22:30:41: tar run starting + mkdir -p /mnt/lustre/d0.tar-client-1-ib + cd /mnt/lustre/d0.tar-client-1-ib + wait 7567 + do_tar + tar cf - /etc + tar xf - + tee /tmp/recovery-mds-scale.log_run_tar.sh-client-1-ib tar: Removing leading `/' from member names + return 2 + RC=2 ++ grep 'exit delayed from previous errors' /tmp/recovery-mds-scale.log_run_tar.sh-client-1-ib + PREV_ERRORS= + true + '[' 2 -ne 0 -a '' -a '' ']' + '[' 2 -eq 0 ']' ++ date '+%F %H:%M:%S' + echoerr '2012-02-18 22:37:10: tar failed' + echo '2012-02-18 22:37:10: tar failed' 2012-02-18 22:37:10: tar failed <~snip~> Syslog on client node client-1-ib showed that: Feb 18 22:34:54 client-1 kernel: INFO: task flush-lustre-1:3510 blocked for more than 120 seconds. Feb 18 22:34:54 client-1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Feb 18 22:34:54 client-1 kernel: flush-lustre- D 0000000000000000 0 3510 2 0x00000080 Feb 18 22:34:54 client-1 kernel: ffff8801f70e99a0 0000000000000046 ffff8801f70e9920 ffffffffa0942434 Feb 18 22:34:54 client-1 kernel: 0000000000000000 ffff880331d24980 ffff8801f70e9930 0000000000000000 Feb 18 22:34:54 client-1 kernel: ffff88027d12b0b8 ffff8801f70e9fd8 000000000000f4e8 ffff88027d12b0b8 Feb 18 22:34:54 client-1 kernel: Call Trace: Feb 18 22:34:54 client-1 kernel: [<ffffffffa0942434>] ? cfs_hash_dual_bd_unlock+0x34/0x60 [libcfs] Feb 18 22:34:54 client-1 kernel: [<ffffffff8109b809>] ? ktime_get_ts+0xa9/0xe0 Feb 18 22:34:54 client-1 kernel: [<ffffffff81110b10>] ? sync_page+0x0/0x50 Feb 18 22:34:54 client-1 kernel: [<ffffffff814ed1c3>] io_schedule+0x73/0xc0 Feb 18 22:34:54 client-1 kernel: [<ffffffff81110b4d>] sync_page+0x3d/0x50 Feb 18 22:34:54 client-1 kernel: [<ffffffff814eda2a>] __wait_on_bit_lock+0x5a/0xc0 Feb 18 22:34:54 client-1 kernel: [<ffffffff81110ae7>] __lock_page+0x67/0x70 Feb 18 22:34:54 client-1 kernel: [<ffffffff81090c30>] ? wake_bit_function+0x0/0x50 Feb 18 22:34:54 client-1 kernel: [<ffffffff81124c97>] ? __writepage+0x17/0x40 Feb 18 22:34:54 client-1 kernel: [<ffffffff811261f2>] write_cache_pages+0x392/0x4a0 Feb 18 22:34:54 client-1 kernel: [<ffffffff81052600>] ? __dequeue_entity+0x30/0x50 Feb 18 22:34:54 client-1 kernel: [<ffffffff81124c80>] ? __writepage+0x0/0x40 Feb 18 22:34:54 client-1 kernel: [<ffffffff8126a5c9>] ? cpumask_next_and+0x29/0x50 Feb 18 22:34:54 client-1 kernel: [<ffffffff81054754>] ? find_busiest_group+0x244/0xb20 Feb 18 22:34:54 client-1 kernel: [<ffffffff81126324>] generic_writepages+0x24/0x30 Feb 18 22:34:54 client-1 kernel: [<ffffffff81126351>] do_writepages+0x21/0x40 Feb 18 22:34:54 client-1 kernel: [<ffffffff811a046d>] writeback_single_inode+0xdd/0x2c0 Feb 18 22:34:54 client-1 kernel: [<ffffffff811a08ae>] writeback_sb_inodes+0xce/0x180 Feb 18 22:34:54 client-1 kernel: [<ffffffff811a0a0b>] writeback_inodes_wb+0xab/0x1b0 Feb 18 22:34:54 client-1 kernel: [<ffffffff811a0dab>] wb_writeback+0x29b/0x3f0 Feb 18 22:34:54 client-1 kernel: [<ffffffff814eca20>] ? thread_return+0x4e/0x77e Feb 18 22:34:54 client-1 kernel: [<ffffffff8107cc02>] ? del_timer_sync+0x22/0x30 Feb 18 22:34:54 client-1 kernel: [<ffffffff811a1099>] wb_do_writeback+0x199/0x240 Feb 18 22:34:54 client-1 kernel: [<ffffffff811a11a3>] bdi_writeback_task+0x63/0x1b0 Feb 18 22:34:54 client-1 kernel: [<ffffffff81090ab7>] ? bit_waitqueue+0x17/0xd0 Feb 18 22:34:54 client-1 kernel: [<ffffffff81134d40>] ? bdi_start_fn+0x0/0x100 Feb 18 22:34:54 client-1 kernel: [<ffffffff81134dc6>] bdi_start_fn+0x86/0x100 Feb 18 22:34:54 client-1 kernel: [<ffffffff81134d40>] ? bdi_start_fn+0x0/0x100 Feb 18 22:34:54 client-1 kernel: [<ffffffff81090886>] kthread+0x96/0xa0 Feb 18 22:34:54 client-1 kernel: [<ffffffff8100c14a>] child_rip+0xa/0x20 Feb 18 22:34:54 client-1 kernel: [<ffffffff810907f0>] ? kthread+0x0/0xa0 Feb 18 22:34:54 client-1 kernel: [<ffffffff8100c140>] ? child_rip+0x0/0x20 Maloo report: https://maloo.whamcloud.com/test_sets/f3b4fe94-5af9-11e1-8801-5254004bbbd3 Please refer to the attached recovery-oss-scale.1329633991.log.tar.bz2 for more logs. It seems this is issue |
| Comments |
| Comment by Jian Yu [ 19/Feb/12 ] |
|
Hi Jay, |
| Comment by Oleg Drokin [ 19/Feb/12 ] |
|
Looking at the logs, I am not all that sure it's related to lu-874 after all. I don't see the affected filename ever being written too, so possibly the logs are too short to see the problem, or we just don't have the necessary debug level enabled to see where the write failed to look around that time and the tar output itself is not timed. Additionally the last part of client-1 log is pretty strange. It appear that after 22;36, where there is last cd into d0-tar... dir (after the failure was already registered), a period of very frequent pings to OSTs begin for some reason, no other activity happens. |
| Comment by Jinshan Xiong (Inactive) [ 20/Feb/12 ] |
|
In my humble opinion, the stack trace is not related with the issue here, neither is it related to The reason why there was no activity on client-1 after 22:36 is because the test loop exited due to failure on tar. |
| Comment by Jian Yu [ 20/Feb/12 ] |
|
Thanks Oleg and Jinshan for the analysis. I'm going to reproduce this issue and gather more logs. The current debug level is: [root@client-18-ib ~]# lctl get_param debug debug= super warning dlmtrace error emerg ha rpctrace vfstrace config console [root@client-18-ib ~]# lctl get_param subsystem_debug subsystem_debug= undefined mdc mds osc ost class log llite rpc filter echo ldlm lov lquota lmv sec gss mgc mgs fid fld [root@client-18-ib ~]# lctl get_param debug_mb debug_mb=48 Could you please suggest what extra debug code I should add to gather more logs? Thanks. |
| Comment by Oleg Drokin [ 20/Feb/12 ] |
|
Jinshan, please note the abnormal high rate of pngs on all osts after 22:36, it's unrelated to the bug at hand, but sill looks strange. YuJian, I imagine having vfstrace enabled would have been helpful, also somewhat increased debug buffer size. |
| Comment by Jinshan Xiong (Inactive) [ 20/Feb/12 ] |
On the client side, I always do the following: lctl set_param debug=-1 In this way, I can get reasonable log and not need a super size debug memory. Just a hint |
| Comment by Jian Yu [ 22/Feb/12 ] |
|
After I set the debug level and size as follows on all of the nodes: debug=-1 subsystem_debug=0xffb7e3ff debug_mb=200 debug=-trace Running recovery-mds-scale with FLAVOR=OSS hit the following LBUG on OSS client-18-ib: LustreError: 5227:0:(filter.c:4141:filter_destroy()) lustre-OST0003: can not find olg of group 0^M LustreError: 5227:0:(filter.c:3683:filter_handle_precreate()) ASSERTION(diff >= 0) failed: lustre-OST0003: 163 - 225 = -62^M LustreError: 5227:0:(filter.c:3683:filter_handle_precreate()) LBUG^M Pid: 5227, comm: tgt_recov^M ^M Call Trace:^M [<ffffffffa0636855>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]^M [<ffffffffa0636e95>] lbug_with_loc+0x75/0xe0 [libcfs]^M [<ffffffffa090ec4d>] filter_create+0x160d/0x1640 [obdfilter]^M [<ffffffffa0919b11>] filter_preprw_write+0x6c1/0x1f10 [obdfilter]^M [<ffffffffa083e87b>] ? _debug_req+0x51b/0x660 [ptlrpc]^M [<ffffffffa091c258>] filter_preprw+0x68/0x90 [obdfilter]^M [<ffffffffa07c110e>] obd_preprw+0x11e/0x420 [ost]^M [<ffffffffa07cb18c>] ost_brw_write+0x98c/0x1870 [ost]^M [<ffffffffa06416a1>] ? libcfs_debug_vmsg2+0x4d1/0xb50 [libcfs]^M [<ffffffffa083b264>] ? lustre_msg_get_opc+0x94/0x100 [ptlrpc]^M [<ffffffffa07cfce5>] ost_handle+0x3325/0x4b90 [ost]^M [<ffffffffa0b16b16>] ? vvp_session_key_init+0xd6/0x1d0 [lustre]^M [<ffffffffa07cc9c0>] ? ost_handle+0x0/0x4b90 [ost]^M [<ffffffffa0800cf6>] handle_recovery_req+0x1f6/0x330 [ptlrpc]^M [<ffffffffa0801257>] target_recovery_thread+0x3a7/0xf50 [ptlrpc]^M [<ffffffffa0800eb0>] ? target_recovery_thread+0x0/0xf50 [ptlrpc]^M [<ffffffff8100c14a>] child_rip+0xa/0x20^M [<ffffffffa0800eb0>] ? target_recovery_thread+0x0/0xf50 [ptlrpc]^M [<ffffffffa0800eb0>] ? target_recovery_thread+0x0/0xf50 [ptlrpc]^M [<ffffffff8100c140>] ? child_rip+0x0/0x20^M ^M Kernel panic - not syncing: LBUG^M Pid: 5227, comm: tgt_recov Not tainted 2.6.32-220.el6_lustre.g4554b65.x86_64 #1^M Call Trace:^M [<ffffffff814ec661>] ? panic+0x78/0x143^M [<ffffffffa0636eeb>] ? lbug_with_loc+0xcb/0xe0 [libcfs]^M [<ffffffffa090ec4d>] ? filter_create+0x160d/0x1640 [obdfilter]^M [<ffffffffa0919b11>] ? filter_preprw_write+0x6c1/0x1f10 [obdfilter]^M [<ffffffffa083e87b>] ? _debug_req+0x51b/0x660 [ptlrpc]^M [<ffffffffa091c258>] ? filter_preprw+0x68/0x90 [obdfilter]^M [<ffffffffa07c110e>] ? obd_preprw+0x11e/0x420 [ost]^M [<ffffffffa07cb18c>] ? ost_brw_write+0x98c/0x1870 [ost]^M [<ffffffffa06416a1>] ? libcfs_debug_vmsg2+0x4d1/0xb50 [libcfs]^M [<ffffffffa083b264>] ? lustre_msg_get_opc+0x94/0x100 [ptlrpc]^M [<ffffffffa07cfce5>] ? ost_handle+0x3325/0x4b90 [ost]^M [<ffffffffa0b16b16>] ? vvp_session_key_init+0xd6/0x1d0 [lustre]^M [<ffffffffa07cc9c0>] ? ost_handle+0x0/0x4b90 [ost]^M [<ffffffffa0800cf6>] ? handle_recovery_req+0x1f6/0x330 [ptlrpc]^M [<ffffffffa0801257>] ? target_recovery_thread+0x3a7/0xf50 [ptlrpc]^M [<ffffffffa0800eb0>] ? target_recovery_thread+0x0/0xf50 [ptlrpc]^M [<ffffffff8100c14a>] ? child_rip+0xa/0x20^M [<ffffffffa0800eb0>] ? target_recovery_thread+0x0/0xf50 [ptlrpc]^M [<ffffffffa0800eb0>] ? target_recovery_thread+0x0/0xf50 [ptlrpc]^M [<ffffffff8100c140>] ? child_rip+0x0/0x20^M Please refer to /scratch/logs/2.1.1/recovery-oss-scale.1329896304.log.tar.bz2 on brent node for more logs. |
| Comment by Jian Yu [ 22/Feb/12 ] |
|
I'm disabling panic_on_lbug to get debug log... |
| Comment by Jian Yu [ 22/Feb/12 ] |
After disabling panic_on_lbug, I could not reproduce the above LBUG and the original issue of this ticket, but kept hitting the known issue: |
| Comment by Oleg Drokin [ 23/Feb/12 ] |
|
The second crash is now tracked under |
| Comment by Jian Yu [ 14/Aug/12 ] |
|
Lustre Tag: v2_1_3_RC1 The original issue occurred again while running recovery-mds-scale failover_ost test: tar: etc/selinux/targeted/modules/active/modules/rhgb.pp: Wrote only 4096 of 7680 bytes tar: Exiting with failure status due to previous errors https://maloo.whamcloud.com/test_sets/dc54205c-e534-11e1-ae4e-52540035b04c After setting PTLDEBUG=-1 and DEBUG_SIZE=200 to reproduce the issue and gather more logs, I hit |
| Comment by James A Simmons [ 14/Aug/16 ] |
|
Really old blocker for unsupported version |