[LU-1121] recovery-mds-scale (FLAVOR=OSS): tar: Wrote only 4096 of 7168 bytes - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Won't Fix
Priority: Blocker
Fix Version/s: None
Affects Version/s: Lustre 2.1.1, Lustre 2.1.3
Labels:
None
Environment:

Hide

Lustre Tag: v2_1_1_0_RC4
Lustre Build: http://build.whamcloud.com/job/lustre-b2_1/44/
e2fsprogs Build: http://build.whamcloud.com/job/e2fsprogs-master/217/
Distro/Arch: RHEL6/x86_64 (kernel version: 2.6.32-220.el6)
Network: IB (in-kernel OFED)
ENABLE_QUOTA=yes
FAILURE_MODE=HARD
FLAVOR=OSS

MGS/MDS Nodes: client-8-ib

OSS Nodes: client-18-ib(active), client-19-ib(active)
                                                   \ /
                                               OST1 (active in client-18-ib)
                                               OST2 (active in client-19-ib)
                                               OST3 (active in client-18-ib)
                                               OST4 (active in client-19-ib)
                                               OST5 (active in client-18-ib)
                                               OST6 (active in client-19-ib)
                     client-9-ib(OST7)

Client Nodes: client-[1,4,17],fat-amd-2,fat-intel-2

Network Addresses:
client-1-ib: 192.168.4.1
client-4-ib: 192.168.4.4
client-8-ib: 192.168.4.8
client-9-ib: 192.168.4.9
client-17-ib: 192.168.4.17
client-18-ib: 192.168.4.18
client-19-ib: 192.168.4.19
fat-amd-2-ib: 192.168.4.133
fat-intel-2-ib: 192.168.4.129

Show
Lustre Tag: v2_1_1_0_RC4 Lustre Build: http://build.whamcloud.com/job/lustre-b2_1/44/ e2fsprogs Build: http://build.whamcloud.com/job/e2fsprogs-master/217/ Distro/Arch: RHEL6/x86_64 (kernel version: 2.6.32-220.el6) Network: IB (in-kernel OFED) ENABLE_QUOTA=yes FAILURE_MODE=HARD FLAVOR=OSS MGS/MDS Nodes: client-8-ib OSS Nodes: client-18-ib(active), client-19-ib(active)                                                    \ /                                                OST1 (active in client-18-ib)                                                OST2 (active in client-19-ib)                                                OST3 (active in client-18-ib)                                                OST4 (active in client-19-ib)                                                OST5 (active in client-18-ib)                                                OST6 (active in client-19-ib)                      client-9-ib(OST7) Client Nodes: client-[1,4,17],fat-amd-2,fat-intel-2 Network Addresses: client-1-ib: 192.168.4.1 client-4-ib: 192.168.4.4 client-8-ib: 192.168.4.8 client-9-ib: 192.168.4.9 client-17-ib: 192.168.4.17 client-18-ib: 192.168.4.18 client-19-ib: 192.168.4.19 fat-amd-2-ib: 192.168.4.133 fat-intel-2-ib: 192.168.4.129

Severity:
3
Rank (Obsolete):
3993

Description

While running recovery-mds-scale with FLAVOR=OSS, it failed as follows:

==== Checking the clients loads AFTER  failover -- failure NOT OK
ost3 has failed over 1 times, and counting...
sleeping 582 seconds ... 
tar: etc/selinux/targeted/modules/active/modules/sandbox.pp: Wrote only 4096 of 7168 bytes
tar: Exiting with failure status due to previous errors
Found the END_RUN_FILE file: /home/yujian/test_logs/end_run_file
client-1-ib
Client load failed on node client-1-ib

client client-1-ib load stdout and debug files :
              /tmp/recovery-mds-scale.log_run_tar.sh-client-1-ib
              /tmp/recovery-mds-scale.log_run_tar.sh-client-1-ib.debug

/tmp/recovery-mds-scale.log_run_tar.sh-client-1-ib:

tar: etc/selinux/targeted/modules/active/modules/sandbox.pp: Wrote only 4096 of 7168 bytes
tar: Exiting with failure status due to previous errors

/tmp/recovery-mds-scale.log_run_tar.sh-client-1-ib.debug:

<~snip~>
2012-02-18 22:30:41: tar run starting
+ mkdir -p /mnt/lustre/d0.tar-client-1-ib
+ cd /mnt/lustre/d0.tar-client-1-ib
+ wait 7567
+ do_tar
+ tar cf - /etc
+ tar xf -
+ tee /tmp/recovery-mds-scale.log_run_tar.sh-client-1-ib
tar: Removing leading `/' from member names
+ return 2
+ RC=2
++ grep 'exit delayed from previous errors' /tmp/recovery-mds-scale.log_run_tar.sh-client-1-ib
+ PREV_ERRORS=
+ true
+ '[' 2 -ne 0 -a '' -a '' ']'
+ '[' 2 -eq 0 ']'
++ date '+%F %H:%M:%S'
+ echoerr '2012-02-18 22:37:10: tar failed'
+ echo '2012-02-18 22:37:10: tar failed'
2012-02-18 22:37:10: tar failed
<~snip~>

Syslog on client node client-1-ib showed that:

Feb 18 22:34:54 client-1 kernel: INFO: task flush-lustre-1:3510 blocked for more than 120 seconds.
Feb 18 22:34:54 client-1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 18 22:34:54 client-1 kernel: flush-lustre- D 0000000000000000     0  3510      2 0x00000080
Feb 18 22:34:54 client-1 kernel: ffff8801f70e99a0 0000000000000046 ffff8801f70e9920 ffffffffa0942434
Feb 18 22:34:54 client-1 kernel: 0000000000000000 ffff880331d24980 ffff8801f70e9930 0000000000000000
Feb 18 22:34:54 client-1 kernel: ffff88027d12b0b8 ffff8801f70e9fd8 000000000000f4e8 ffff88027d12b0b8
Feb 18 22:34:54 client-1 kernel: Call Trace:
Feb 18 22:34:54 client-1 kernel: [<ffffffffa0942434>] ? cfs_hash_dual_bd_unlock+0x34/0x60 [libcfs]
Feb 18 22:34:54 client-1 kernel: [<ffffffff8109b809>] ? ktime_get_ts+0xa9/0xe0
Feb 18 22:34:54 client-1 kernel: [<ffffffff81110b10>] ? sync_page+0x0/0x50
Feb 18 22:34:54 client-1 kernel: [<ffffffff814ed1c3>] io_schedule+0x73/0xc0
Feb 18 22:34:54 client-1 kernel: [<ffffffff81110b4d>] sync_page+0x3d/0x50
Feb 18 22:34:54 client-1 kernel: [<ffffffff814eda2a>] __wait_on_bit_lock+0x5a/0xc0
Feb 18 22:34:54 client-1 kernel: [<ffffffff81110ae7>] __lock_page+0x67/0x70
Feb 18 22:34:54 client-1 kernel: [<ffffffff81090c30>] ? wake_bit_function+0x0/0x50
Feb 18 22:34:54 client-1 kernel: [<ffffffff81124c97>] ? __writepage+0x17/0x40
Feb 18 22:34:54 client-1 kernel: [<ffffffff811261f2>] write_cache_pages+0x392/0x4a0
Feb 18 22:34:54 client-1 kernel: [<ffffffff81052600>] ? __dequeue_entity+0x30/0x50
Feb 18 22:34:54 client-1 kernel: [<ffffffff81124c80>] ? __writepage+0x0/0x40
Feb 18 22:34:54 client-1 kernel: [<ffffffff8126a5c9>] ? cpumask_next_and+0x29/0x50
Feb 18 22:34:54 client-1 kernel: [<ffffffff81054754>] ? find_busiest_group+0x244/0xb20
Feb 18 22:34:54 client-1 kernel: [<ffffffff81126324>] generic_writepages+0x24/0x30
Feb 18 22:34:54 client-1 kernel: [<ffffffff81126351>] do_writepages+0x21/0x40
Feb 18 22:34:54 client-1 kernel: [<ffffffff811a046d>] writeback_single_inode+0xdd/0x2c0
Feb 18 22:34:54 client-1 kernel: [<ffffffff811a08ae>] writeback_sb_inodes+0xce/0x180
Feb 18 22:34:54 client-1 kernel: [<ffffffff811a0a0b>] writeback_inodes_wb+0xab/0x1b0
Feb 18 22:34:54 client-1 kernel: [<ffffffff811a0dab>] wb_writeback+0x29b/0x3f0
Feb 18 22:34:54 client-1 kernel: [<ffffffff814eca20>] ? thread_return+0x4e/0x77e
Feb 18 22:34:54 client-1 kernel: [<ffffffff8107cc02>] ? del_timer_sync+0x22/0x30
Feb 18 22:34:54 client-1 kernel: [<ffffffff811a1099>] wb_do_writeback+0x199/0x240
Feb 18 22:34:54 client-1 kernel: [<ffffffff811a11a3>] bdi_writeback_task+0x63/0x1b0
Feb 18 22:34:54 client-1 kernel: [<ffffffff81090ab7>] ? bit_waitqueue+0x17/0xd0
Feb 18 22:34:54 client-1 kernel: [<ffffffff81134d40>] ? bdi_start_fn+0x0/0x100
Feb 18 22:34:54 client-1 kernel: [<ffffffff81134dc6>] bdi_start_fn+0x86/0x100
Feb 18 22:34:54 client-1 kernel: [<ffffffff81134d40>] ? bdi_start_fn+0x0/0x100
Feb 18 22:34:54 client-1 kernel: [<ffffffff81090886>] kthread+0x96/0xa0
Feb 18 22:34:54 client-1 kernel: [<ffffffff8100c14a>] child_rip+0xa/0x20
Feb 18 22:34:54 client-1 kernel: [<ffffffff810907f0>] ? kthread+0x0/0xa0
Feb 18 22:34:54 client-1 kernel: [<ffffffff8100c140>] ? child_rip+0x0/0x20

Maloo report: https://maloo.whamcloud.com/test_sets/f3b4fe94-5af9-11e1-8801-5254004bbbd3

Please refer to the attached recovery-oss-scale.1329633991.log.tar.bz2 for more logs.

It seems this is issue ~~LU-874~~.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

recovery-oss-scale.1329633991.log.tar.bz2
1.06 MB
19/Feb/12 8:39 AM

Issue Links

is related to

LU-1129 filter_handle_precreate()) ASSERTION(diff >= 0) failed

Resolved

Trackbacks

Lustre 2.1.1 release testing tracker Lustre 2.1.1 RC4 Tag: v2110RC4 Build:

Lustre 2.1.3 release testing tracker Lustre 2.1.3 RC1 Tag: v213RC1 Build:

Activity

[LU-1121] recovery-mds-scale (FLAVOR=OSS): tar: Wrote only 4096 of 7168 bytes

Jian Yu added a comment - 22/Feb/12 7:51 AM

I'm disabling panic_on_lbug to get debug log...

After disabling panic_on_lbug, I could not reproduce the above LBUG and the original issue of this ticket, but kept hitting the known issue: ~~LU-463~~.

Jian Yu added a comment - 22/Feb/12 7:51 AM I'm disabling panic_on_lbug to get debug log... After disabling panic_on_lbug, I could not reproduce the above LBUG and the original issue of this ticket, but kept hitting the known issue: LU-463 .

Jian Yu added a comment - 22/Feb/12 3:31 AM

I'm disabling panic_on_lbug to get debug log...

Jian Yu added a comment - 22/Feb/12 3:31 AM I'm disabling panic_on_lbug to get debug log...

Jian Yu added a comment - 22/Feb/12 2:55 AM

After I set the debug level and size as follows on all of the nodes:

debug=-1
subsystem_debug=0xffb7e3ff
debug_mb=200
debug=-trace

Running recovery-mds-scale with FLAVOR=OSS hit the following LBUG on OSS client-18-ib:

LustreError: 5227:0:(filter.c:4141:filter_destroy())  lustre-OST0003: can not find olg of group 0^M
LustreError: 5227:0:(filter.c:3683:filter_handle_precreate()) ASSERTION(diff >= 0) failed: lustre-OST0003: 163 - 225 = -62^M
LustreError: 5227:0:(filter.c:3683:filter_handle_precreate()) LBUG^M
Pid: 5227, comm: tgt_recov^M
^M
Call Trace:^M
 [<ffffffffa0636855>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]^M
 [<ffffffffa0636e95>] lbug_with_loc+0x75/0xe0 [libcfs]^M
 [<ffffffffa090ec4d>] filter_create+0x160d/0x1640 [obdfilter]^M
 [<ffffffffa0919b11>] filter_preprw_write+0x6c1/0x1f10 [obdfilter]^M
 [<ffffffffa083e87b>] ? _debug_req+0x51b/0x660 [ptlrpc]^M
 [<ffffffffa091c258>] filter_preprw+0x68/0x90 [obdfilter]^M
 [<ffffffffa07c110e>] obd_preprw+0x11e/0x420 [ost]^M
 [<ffffffffa07cb18c>] ost_brw_write+0x98c/0x1870 [ost]^M
 [<ffffffffa06416a1>] ? libcfs_debug_vmsg2+0x4d1/0xb50 [libcfs]^M
 [<ffffffffa083b264>] ? lustre_msg_get_opc+0x94/0x100 [ptlrpc]^M
 [<ffffffffa07cfce5>] ost_handle+0x3325/0x4b90 [ost]^M
 [<ffffffffa0b16b16>] ? vvp_session_key_init+0xd6/0x1d0 [lustre]^M
 [<ffffffffa07cc9c0>] ? ost_handle+0x0/0x4b90 [ost]^M
 [<ffffffffa0800cf6>] handle_recovery_req+0x1f6/0x330 [ptlrpc]^M
 [<ffffffffa0801257>] target_recovery_thread+0x3a7/0xf50 [ptlrpc]^M
 [<ffffffffa0800eb0>] ? target_recovery_thread+0x0/0xf50 [ptlrpc]^M
 [<ffffffff8100c14a>] child_rip+0xa/0x20^M
 [<ffffffffa0800eb0>] ? target_recovery_thread+0x0/0xf50 [ptlrpc]^M
 [<ffffffffa0800eb0>] ? target_recovery_thread+0x0/0xf50 [ptlrpc]^M
 [<ffffffff8100c140>] ? child_rip+0x0/0x20^M
^M
Kernel panic - not syncing: LBUG^M
Pid: 5227, comm: tgt_recov Not tainted 2.6.32-220.el6_lustre.g4554b65.x86_64 #1^M
Call Trace:^M
 [<ffffffff814ec661>] ? panic+0x78/0x143^M
 [<ffffffffa0636eeb>] ? lbug_with_loc+0xcb/0xe0 [libcfs]^M
 [<ffffffffa090ec4d>] ? filter_create+0x160d/0x1640 [obdfilter]^M
 [<ffffffffa0919b11>] ? filter_preprw_write+0x6c1/0x1f10 [obdfilter]^M
 [<ffffffffa083e87b>] ? _debug_req+0x51b/0x660 [ptlrpc]^M
 [<ffffffffa091c258>] ? filter_preprw+0x68/0x90 [obdfilter]^M
 [<ffffffffa07c110e>] ? obd_preprw+0x11e/0x420 [ost]^M
 [<ffffffffa07cb18c>] ? ost_brw_write+0x98c/0x1870 [ost]^M
 [<ffffffffa06416a1>] ? libcfs_debug_vmsg2+0x4d1/0xb50 [libcfs]^M
 [<ffffffffa083b264>] ? lustre_msg_get_opc+0x94/0x100 [ptlrpc]^M
 [<ffffffffa07cfce5>] ? ost_handle+0x3325/0x4b90 [ost]^M
 [<ffffffffa0b16b16>] ? vvp_session_key_init+0xd6/0x1d0 [lustre]^M
 [<ffffffffa07cc9c0>] ? ost_handle+0x0/0x4b90 [ost]^M
 [<ffffffffa0800cf6>] ? handle_recovery_req+0x1f6/0x330 [ptlrpc]^M
 [<ffffffffa0801257>] ? target_recovery_thread+0x3a7/0xf50 [ptlrpc]^M
 [<ffffffffa0800eb0>] ? target_recovery_thread+0x0/0xf50 [ptlrpc]^M
 [<ffffffff8100c14a>] ? child_rip+0xa/0x20^M
 [<ffffffffa0800eb0>] ? target_recovery_thread+0x0/0xf50 [ptlrpc]^M
 [<ffffffffa0800eb0>] ? target_recovery_thread+0x0/0xf50 [ptlrpc]^M
 [<ffffffff8100c140>] ? child_rip+0x0/0x20^M

Please refer to /scratch/logs/2.1.1/recovery-oss-scale.1329896304.log.tar.bz2 on brent node for more logs.

Jian Yu added a comment - 22/Feb/12 2:55 AM After I set the debug level and size as follows on all of the nodes: debug=-1 subsystem_debug=0xffb7e3ff debug_mb=200 debug=-trace Running recovery-mds-scale with FLAVOR=OSS hit the following LBUG on OSS client-18-ib: LustreError: 5227:0:(filter.c:4141:filter_destroy()) lustre-OST0003: can not find olg of group 0^M LustreError: 5227:0:(filter.c:3683:filter_handle_precreate()) ASSERTION(diff >= 0) failed: lustre-OST0003: 163 - 225 = -62^M LustreError: 5227:0:(filter.c:3683:filter_handle_precreate()) LBUG^M Pid: 5227, comm: tgt_recov^M ^M Call Trace:^M [<ffffffffa0636855>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]^M [<ffffffffa0636e95>] lbug_with_loc+0x75/0xe0 [libcfs]^M [<ffffffffa090ec4d>] filter_create+0x160d/0x1640 [obdfilter]^M [<ffffffffa0919b11>] filter_preprw_write+0x6c1/0x1f10 [obdfilter]^M [<ffffffffa083e87b>] ? _debug_req+0x51b/0x660 [ptlrpc]^M [<ffffffffa091c258>] filter_preprw+0x68/0x90 [obdfilter]^M [<ffffffffa07c110e>] obd_preprw+0x11e/0x420 [ost]^M [<ffffffffa07cb18c>] ost_brw_write+0x98c/0x1870 [ost]^M [<ffffffffa06416a1>] ? libcfs_debug_vmsg2+0x4d1/0xb50 [libcfs]^M [<ffffffffa083b264>] ? lustre_msg_get_opc+0x94/0x100 [ptlrpc]^M [<ffffffffa07cfce5>] ost_handle+0x3325/0x4b90 [ost]^M [<ffffffffa0b16b16>] ? vvp_session_key_init+0xd6/0x1d0 [lustre]^M [<ffffffffa07cc9c0>] ? ost_handle+0x0/0x4b90 [ost]^M [<ffffffffa0800cf6>] handle_recovery_req+0x1f6/0x330 [ptlrpc]^M [<ffffffffa0801257>] target_recovery_thread+0x3a7/0xf50 [ptlrpc]^M [<ffffffffa0800eb0>] ? target_recovery_thread+0x0/0xf50 [ptlrpc]^M [<ffffffff8100c14a>] child_rip+0xa/0x20^M [<ffffffffa0800eb0>] ? target_recovery_thread+0x0/0xf50 [ptlrpc]^M [<ffffffffa0800eb0>] ? target_recovery_thread+0x0/0xf50 [ptlrpc]^M [<ffffffff8100c140>] ? child_rip+0x0/0x20^M ^M Kernel panic - not syncing: LBUG^M Pid: 5227, comm: tgt_recov Not tainted 2.6.32-220.el6_lustre.g4554b65.x86_64 #1^M Call Trace:^M [<ffffffff814ec661>] ? panic+0x78/0x143^M [<ffffffffa0636eeb>] ? lbug_with_loc+0xcb/0xe0 [libcfs]^M [<ffffffffa090ec4d>] ? filter_create+0x160d/0x1640 [obdfilter]^M [<ffffffffa0919b11>] ? filter_preprw_write+0x6c1/0x1f10 [obdfilter]^M [<ffffffffa083e87b>] ? _debug_req+0x51b/0x660 [ptlrpc]^M [<ffffffffa091c258>] ? filter_preprw+0x68/0x90 [obdfilter]^M [<ffffffffa07c110e>] ? obd_preprw+0x11e/0x420 [ost]^M [<ffffffffa07cb18c>] ? ost_brw_write+0x98c/0x1870 [ost]^M [<ffffffffa06416a1>] ? libcfs_debug_vmsg2+0x4d1/0xb50 [libcfs]^M [<ffffffffa083b264>] ? lustre_msg_get_opc+0x94/0x100 [ptlrpc]^M [<ffffffffa07cfce5>] ? ost_handle+0x3325/0x4b90 [ost]^M [<ffffffffa0b16b16>] ? vvp_session_key_init+0xd6/0x1d0 [lustre]^M [<ffffffffa07cc9c0>] ? ost_handle+0x0/0x4b90 [ost]^M [<ffffffffa0800cf6>] ? handle_recovery_req+0x1f6/0x330 [ptlrpc]^M [<ffffffffa0801257>] ? target_recovery_thread+0x3a7/0xf50 [ptlrpc]^M [<ffffffffa0800eb0>] ? target_recovery_thread+0x0/0xf50 [ptlrpc]^M [<ffffffff8100c14a>] ? child_rip+0xa/0x20^M [<ffffffffa0800eb0>] ? target_recovery_thread+0x0/0xf50 [ptlrpc]^M [<ffffffffa0800eb0>] ? target_recovery_thread+0x0/0xf50 [ptlrpc]^M [<ffffffff8100c140>] ? child_rip+0x0/0x20^M Please refer to /scratch/logs/2.1.1/recovery-oss-scale.1329896304.log.tar.bz2 on brent node for more logs.

Jinshan Xiong (Inactive) added a comment - 20/Feb/12 2:12 PM

Could you please suggest what extra debug code I should add to gather more logs? Thanks.

On the client side, I always do the following:

lctl set_param debug=-1
lctl set_param debug=-trace

In this way, I can get reasonable log and not need a super size debug memory. Just a hint

Jinshan Xiong (Inactive) added a comment - 20/Feb/12 2:12 PM Could you please suggest what extra debug code I should add to gather more logs? Thanks. On the client side, I always do the following: lctl set_param debug=-1 lctl set_param debug=-trace In this way, I can get reasonable log and not need a super size debug memory. Just a hint

Oleg Drokin added a comment - 20/Feb/12 10:45 AM

Jinshan, please note the abnormal high rate of pngs on all osts after 22:36, it's unrelated to the bug at hand, but sill looks strange.

YuJian, I imagine having vfstrace enabled would have been helpful, also somewhat increased debug buffer size.
The kms tracking is done under D_INODE flag, though we usually don't run with it and I don't know how much it will impact testing.
If you are loking to do extending testing, +vfstrace +inode +ha look like sensible options to me. Also a 1024 in debug_mb if you have the RAM.

Oleg Drokin added a comment - 20/Feb/12 10:45 AM Jinshan, please note the abnormal high rate of pngs on all osts after 22:36, it's unrelated to the bug at hand, but sill looks strange. YuJian, I imagine having vfstrace enabled would have been helpful, also somewhat increased debug buffer size. The kms tracking is done under D_INODE flag, though we usually don't run with it and I don't know how much it will impact testing. If you are loking to do extending testing, +vfstrace +inode +ha look like sensible options to me. Also a 1024 in debug_mb if you have the RAM.

Jian Yu added a comment - 20/Feb/12 9:56 AM

Thanks Oleg and Jinshan for the analysis. I'm going to reproduce this issue and gather more logs. The current debug level is:

[root@client-18-ib ~]# lctl get_param debug
debug=
super warning dlmtrace error emerg ha rpctrace vfstrace config console
[root@client-18-ib ~]# lctl get_param subsystem_debug
subsystem_debug=
undefined mdc mds osc ost class log llite rpc filter echo ldlm lov lquota lmv sec gss mgc mgs fid fld
[root@client-18-ib ~]# lctl get_param debug_mb
debug_mb=48

Could you please suggest what extra debug code I should add to gather more logs? Thanks.

Jian Yu added a comment - 20/Feb/12 9:56 AM Thanks Oleg and Jinshan for the analysis. I'm going to reproduce this issue and gather more logs. The current debug level is: [root@client-18-ib ~]# lctl get_param debug debug= super warning dlmtrace error emerg ha rpctrace vfstrace config console [root@client-18-ib ~]# lctl get_param subsystem_debug subsystem_debug= undefined mdc mds osc ost class log llite rpc filter echo ldlm lov lquota lmv sec gss mgc mgs fid fld [root@client-18-ib ~]# lctl get_param debug_mb debug_mb=48 Could you please suggest what extra debug code I should add to gather more logs? Thanks.

Jinshan Xiong (Inactive) added a comment - 20/Feb/12 1:57 AM

In my humble opinion, the stack trace is not related with the issue here, neither is it related to ~~LU-874~~. The real issue is due to short write where 7168 bytes was supposed to be written but actually 4096 bytes was written. I didn't find any clue from logs. I suspect that there must exist issues with kms handling.

The reason why there was no activity on client-1 after 22:36 is because the test loop exited due to failure on tar.

Jinshan Xiong (Inactive) added a comment - 20/Feb/12 1:57 AM In my humble opinion, the stack trace is not related with the issue here, neither is it related to LU-874 . The real issue is due to short write where 7168 bytes was supposed to be written but actually 4096 bytes was written. I didn't find any clue from logs. I suspect that there must exist issues with kms handling. The reason why there was no activity on client-1 after 22:36 is because the test loop exited due to failure on tar.

Oleg Drokin added a comment - 19/Feb/12 10:22 AM

Looking at the logs, I am not all that sure it's related to lu-874 after all.

I don't see the affected filename ever being written too, so possibly the logs are too short to see the problem, or we just don't have the necessary debug level enabled to see where the write failed to look around that time and the tar output itself is not timed.

Additionally the last part of client-1 log is pretty strange. It appear that after 22;36, where there is last cd into d0-tar... dir (after the failure was already registered), a period of very frequent pings to OSTs begin for some reason, no other activity happens.
It continues for ~561 seconds, I am not really sure what is that part about.

Oleg Drokin added a comment - 19/Feb/12 10:22 AM Looking at the logs, I am not all that sure it's related to lu-874 after all. I don't see the affected filename ever being written too, so possibly the logs are too short to see the problem, or we just don't have the necessary debug level enabled to see where the write failed to look around that time and the tar output itself is not timed. Additionally the last part of client-1 log is pretty strange. It appear that after 22;36, where there is last cd into d0-tar... dir (after the failure was already registered), a period of very frequent pings to OSTs begin for some reason, no other activity happens. It continues for ~561 seconds, I am not really sure what is that part about.

Jian Yu added a comment - 19/Feb/12 8:41 AM

Hi Jay,
Could you please take a look to see whether this is a duplicate of ~~LU-874~~ or not? Thanks.

Jian Yu added a comment - 19/Feb/12 8:41 AM Hi Jay, Could you please take a look to see whether this is a duplicate of LU-874 or not? Thanks.

People

Assignee:: Oleg Drokin

Reporter:: Jian Yu

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 19/Feb/12 8:39 AM

Updated:: 14/Aug/16 5:18 PM

Resolved:: 14/Aug/16 5:18 PM