[LU-1121] recovery-mds-scale (FLAVOR=OSS): tar: Wrote only 4096 of 7168 bytes - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Won't Fix
Priority: Blocker
Fix Version/s: None
Affects Version/s: Lustre 2.1.1, Lustre 2.1.3
Labels:
None
Environment:

Hide

Lustre Tag: v2_1_1_0_RC4
Lustre Build: http://build.whamcloud.com/job/lustre-b2_1/44/
e2fsprogs Build: http://build.whamcloud.com/job/e2fsprogs-master/217/
Distro/Arch: RHEL6/x86_64 (kernel version: 2.6.32-220.el6)
Network: IB (in-kernel OFED)
ENABLE_QUOTA=yes
FAILURE_MODE=HARD
FLAVOR=OSS

MGS/MDS Nodes: client-8-ib

OSS Nodes: client-18-ib(active), client-19-ib(active)
                                                   \ /
                                               OST1 (active in client-18-ib)
                                               OST2 (active in client-19-ib)
                                               OST3 (active in client-18-ib)
                                               OST4 (active in client-19-ib)
                                               OST5 (active in client-18-ib)
                                               OST6 (active in client-19-ib)
                     client-9-ib(OST7)

Client Nodes: client-[1,4,17],fat-amd-2,fat-intel-2

Network Addresses:
client-1-ib: 192.168.4.1
client-4-ib: 192.168.4.4
client-8-ib: 192.168.4.8
client-9-ib: 192.168.4.9
client-17-ib: 192.168.4.17
client-18-ib: 192.168.4.18
client-19-ib: 192.168.4.19
fat-amd-2-ib: 192.168.4.133
fat-intel-2-ib: 192.168.4.129

Show
Lustre Tag: v2_1_1_0_RC4 Lustre Build: http://build.whamcloud.com/job/lustre-b2_1/44/ e2fsprogs Build: http://build.whamcloud.com/job/e2fsprogs-master/217/ Distro/Arch: RHEL6/x86_64 (kernel version: 2.6.32-220.el6) Network: IB (in-kernel OFED) ENABLE_QUOTA=yes FAILURE_MODE=HARD FLAVOR=OSS MGS/MDS Nodes: client-8-ib OSS Nodes: client-18-ib(active), client-19-ib(active)                                                    \ /                                                OST1 (active in client-18-ib)                                                OST2 (active in client-19-ib)                                                OST3 (active in client-18-ib)                                                OST4 (active in client-19-ib)                                                OST5 (active in client-18-ib)                                                OST6 (active in client-19-ib)                      client-9-ib(OST7) Client Nodes: client-[1,4,17],fat-amd-2,fat-intel-2 Network Addresses: client-1-ib: 192.168.4.1 client-4-ib: 192.168.4.4 client-8-ib: 192.168.4.8 client-9-ib: 192.168.4.9 client-17-ib: 192.168.4.17 client-18-ib: 192.168.4.18 client-19-ib: 192.168.4.19 fat-amd-2-ib: 192.168.4.133 fat-intel-2-ib: 192.168.4.129

Severity:
3
Rank (Obsolete):
3993

Description

While running recovery-mds-scale with FLAVOR=OSS, it failed as follows:

==== Checking the clients loads AFTER  failover -- failure NOT OK
ost3 has failed over 1 times, and counting...
sleeping 582 seconds ... 
tar: etc/selinux/targeted/modules/active/modules/sandbox.pp: Wrote only 4096 of 7168 bytes
tar: Exiting with failure status due to previous errors
Found the END_RUN_FILE file: /home/yujian/test_logs/end_run_file
client-1-ib
Client load failed on node client-1-ib

client client-1-ib load stdout and debug files :
              /tmp/recovery-mds-scale.log_run_tar.sh-client-1-ib
              /tmp/recovery-mds-scale.log_run_tar.sh-client-1-ib.debug

/tmp/recovery-mds-scale.log_run_tar.sh-client-1-ib:

tar: etc/selinux/targeted/modules/active/modules/sandbox.pp: Wrote only 4096 of 7168 bytes
tar: Exiting with failure status due to previous errors

/tmp/recovery-mds-scale.log_run_tar.sh-client-1-ib.debug:

<~snip~>
2012-02-18 22:30:41: tar run starting
+ mkdir -p /mnt/lustre/d0.tar-client-1-ib
+ cd /mnt/lustre/d0.tar-client-1-ib
+ wait 7567
+ do_tar
+ tar cf - /etc
+ tar xf -
+ tee /tmp/recovery-mds-scale.log_run_tar.sh-client-1-ib
tar: Removing leading `/' from member names
+ return 2
+ RC=2
++ grep 'exit delayed from previous errors' /tmp/recovery-mds-scale.log_run_tar.sh-client-1-ib
+ PREV_ERRORS=
+ true
+ '[' 2 -ne 0 -a '' -a '' ']'
+ '[' 2 -eq 0 ']'
++ date '+%F %H:%M:%S'
+ echoerr '2012-02-18 22:37:10: tar failed'
+ echo '2012-02-18 22:37:10: tar failed'
2012-02-18 22:37:10: tar failed
<~snip~>

Syslog on client node client-1-ib showed that:

Feb 18 22:34:54 client-1 kernel: INFO: task flush-lustre-1:3510 blocked for more than 120 seconds.
Feb 18 22:34:54 client-1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 18 22:34:54 client-1 kernel: flush-lustre- D 0000000000000000     0  3510      2 0x00000080
Feb 18 22:34:54 client-1 kernel: ffff8801f70e99a0 0000000000000046 ffff8801f70e9920 ffffffffa0942434
Feb 18 22:34:54 client-1 kernel: 0000000000000000 ffff880331d24980 ffff8801f70e9930 0000000000000000
Feb 18 22:34:54 client-1 kernel: ffff88027d12b0b8 ffff8801f70e9fd8 000000000000f4e8 ffff88027d12b0b8
Feb 18 22:34:54 client-1 kernel: Call Trace:
Feb 18 22:34:54 client-1 kernel: [<ffffffffa0942434>] ? cfs_hash_dual_bd_unlock+0x34/0x60 [libcfs]
Feb 18 22:34:54 client-1 kernel: [<ffffffff8109b809>] ? ktime_get_ts+0xa9/0xe0
Feb 18 22:34:54 client-1 kernel: [<ffffffff81110b10>] ? sync_page+0x0/0x50
Feb 18 22:34:54 client-1 kernel: [<ffffffff814ed1c3>] io_schedule+0x73/0xc0
Feb 18 22:34:54 client-1 kernel: [<ffffffff81110b4d>] sync_page+0x3d/0x50
Feb 18 22:34:54 client-1 kernel: [<ffffffff814eda2a>] __wait_on_bit_lock+0x5a/0xc0
Feb 18 22:34:54 client-1 kernel: [<ffffffff81110ae7>] __lock_page+0x67/0x70
Feb 18 22:34:54 client-1 kernel: [<ffffffff81090c30>] ? wake_bit_function+0x0/0x50
Feb 18 22:34:54 client-1 kernel: [<ffffffff81124c97>] ? __writepage+0x17/0x40
Feb 18 22:34:54 client-1 kernel: [<ffffffff811261f2>] write_cache_pages+0x392/0x4a0
Feb 18 22:34:54 client-1 kernel: [<ffffffff81052600>] ? __dequeue_entity+0x30/0x50
Feb 18 22:34:54 client-1 kernel: [<ffffffff81124c80>] ? __writepage+0x0/0x40
Feb 18 22:34:54 client-1 kernel: [<ffffffff8126a5c9>] ? cpumask_next_and+0x29/0x50
Feb 18 22:34:54 client-1 kernel: [<ffffffff81054754>] ? find_busiest_group+0x244/0xb20
Feb 18 22:34:54 client-1 kernel: [<ffffffff81126324>] generic_writepages+0x24/0x30
Feb 18 22:34:54 client-1 kernel: [<ffffffff81126351>] do_writepages+0x21/0x40
Feb 18 22:34:54 client-1 kernel: [<ffffffff811a046d>] writeback_single_inode+0xdd/0x2c0
Feb 18 22:34:54 client-1 kernel: [<ffffffff811a08ae>] writeback_sb_inodes+0xce/0x180
Feb 18 22:34:54 client-1 kernel: [<ffffffff811a0a0b>] writeback_inodes_wb+0xab/0x1b0
Feb 18 22:34:54 client-1 kernel: [<ffffffff811a0dab>] wb_writeback+0x29b/0x3f0
Feb 18 22:34:54 client-1 kernel: [<ffffffff814eca20>] ? thread_return+0x4e/0x77e
Feb 18 22:34:54 client-1 kernel: [<ffffffff8107cc02>] ? del_timer_sync+0x22/0x30
Feb 18 22:34:54 client-1 kernel: [<ffffffff811a1099>] wb_do_writeback+0x199/0x240
Feb 18 22:34:54 client-1 kernel: [<ffffffff811a11a3>] bdi_writeback_task+0x63/0x1b0
Feb 18 22:34:54 client-1 kernel: [<ffffffff81090ab7>] ? bit_waitqueue+0x17/0xd0
Feb 18 22:34:54 client-1 kernel: [<ffffffff81134d40>] ? bdi_start_fn+0x0/0x100
Feb 18 22:34:54 client-1 kernel: [<ffffffff81134dc6>] bdi_start_fn+0x86/0x100
Feb 18 22:34:54 client-1 kernel: [<ffffffff81134d40>] ? bdi_start_fn+0x0/0x100
Feb 18 22:34:54 client-1 kernel: [<ffffffff81090886>] kthread+0x96/0xa0
Feb 18 22:34:54 client-1 kernel: [<ffffffff8100c14a>] child_rip+0xa/0x20
Feb 18 22:34:54 client-1 kernel: [<ffffffff810907f0>] ? kthread+0x0/0xa0
Feb 18 22:34:54 client-1 kernel: [<ffffffff8100c140>] ? child_rip+0x0/0x20

Maloo report: https://maloo.whamcloud.com/test_sets/f3b4fe94-5af9-11e1-8801-5254004bbbd3

Please refer to the attached recovery-oss-scale.1329633991.log.tar.bz2 for more logs.

It seems this is issue ~~LU-874~~.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

recovery-oss-scale.1329633991.log.tar.bz2
1.06 MB
19/Feb/12 8:39 AM

Issue Links

is related to

LU-1129 filter_handle_precreate()) ASSERTION(diff >= 0) failed

Resolved

Trackbacks

Lustre 2.1.1 release testing tracker Lustre 2.1.1 RC4 Tag: v2110RC4 Build:

Lustre 2.1.3 release testing tracker Lustre 2.1.3 RC1 Tag: v213RC1 Build:

Activity

[LU-1121] recovery-mds-scale (FLAVOR=OSS): tar: Wrote only 4096 of 7168 bytes

James A Simmons added a comment - 14/Aug/16 5:18 PM

Really old blocker for unsupported version

James A Simmons added a comment - 14/Aug/16 5:18 PM Really old blocker for unsupported version

Jian Yu added a comment - 14/Aug/12 7:18 AM

Lustre Tag: v2_1_3_RC1
Lustre Build: http://build.whamcloud.com/job/lustre-b2_1/113/
Distro/Arch: RHEL6.3/x86_64 (kernel version: 2.6.32-279.2.1.el6)
Network: IB (in-kernel OFED)
ENABLE_QUOTA=yes
FAILURE_MODE=HARD

The original issue occurred again while running recovery-mds-scale failover_ost test:

tar: etc/selinux/targeted/modules/active/modules/rhgb.pp: Wrote only 4096 of 7680 bytes
tar: Exiting with failure status due to previous errors

https://maloo.whamcloud.com/test_sets/dc54205c-e534-11e1-ae4e-52540035b04c

After setting PTLDEBUG=-1 and DEBUG_SIZE=200 to reproduce the issue and gather more logs, I hit ~~LU-463~~ again:
https://maloo.whamcloud.com/test_sets/b18a1330-e5ad-11e1-ae4e-52540035b04c

Jian Yu added a comment - 14/Aug/12 7:18 AM Lustre Tag: v2_1_3_RC1 Lustre Build: http://build.whamcloud.com/job/lustre-b2_1/113/ Distro/Arch: RHEL6.3/x86_64 (kernel version: 2.6.32-279.2.1.el6) Network: IB (in-kernel OFED) ENABLE_QUOTA=yes FAILURE_MODE=HARD The original issue occurred again while running recovery-mds-scale failover_ost test: tar: etc/selinux/targeted/modules/active/modules/rhgb.pp: Wrote only 4096 of 7680 bytes tar: Exiting with failure status due to previous errors https://maloo.whamcloud.com/test_sets/dc54205c-e534-11e1-ae4e-52540035b04c After setting PTLDEBUG=-1 and DEBUG_SIZE=200 to reproduce the issue and gather more logs, I hit LU-463 again: https://maloo.whamcloud.com/test_sets/b18a1330-e5ad-11e1-ae4e-52540035b04c

Oleg Drokin added a comment - 23/Feb/12 12:02 AM

The second crash is now tracked under ~~LU-1129~~

Oleg Drokin added a comment - 23/Feb/12 12:02 AM The second crash is now tracked under LU-1129

Jian Yu added a comment - 22/Feb/12 7:51 AM

I'm disabling panic_on_lbug to get debug log...

After disabling panic_on_lbug, I could not reproduce the above LBUG and the original issue of this ticket, but kept hitting the known issue: ~~LU-463~~.

Jian Yu added a comment - 22/Feb/12 7:51 AM I'm disabling panic_on_lbug to get debug log... After disabling panic_on_lbug, I could not reproduce the above LBUG and the original issue of this ticket, but kept hitting the known issue: LU-463 .

Jian Yu added a comment - 22/Feb/12 3:31 AM

I'm disabling panic_on_lbug to get debug log...

Jian Yu added a comment - 22/Feb/12 3:31 AM I'm disabling panic_on_lbug to get debug log...

Jian Yu added a comment - 22/Feb/12 2:55 AM

After I set the debug level and size as follows on all of the nodes:

debug=-1
subsystem_debug=0xffb7e3ff
debug_mb=200
debug=-trace

Running recovery-mds-scale with FLAVOR=OSS hit the following LBUG on OSS client-18-ib:

LustreError: 5227:0:(filter.c:4141:filter_destroy())  lustre-OST0003: can not find olg of group 0^M
LustreError: 5227:0:(filter.c:3683:filter_handle_precreate()) ASSERTION(diff >= 0) failed: lustre-OST0003: 163 - 225 = -62^M
LustreError: 5227:0:(filter.c:3683:filter_handle_precreate()) LBUG^M
Pid: 5227, comm: tgt_recov^M
^M
Call Trace:^M
 [<ffffffffa0636855>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]^M
 [<ffffffffa0636e95>] lbug_with_loc+0x75/0xe0 [libcfs]^M
 [<ffffffffa090ec4d>] filter_create+0x160d/0x1640 [obdfilter]^M
 [<ffffffffa0919b11>] filter_preprw_write+0x6c1/0x1f10 [obdfilter]^M
 [<ffffffffa083e87b>] ? _debug_req+0x51b/0x660 [ptlrpc]^M
 [<ffffffffa091c258>] filter_preprw+0x68/0x90 [obdfilter]^M
 [<ffffffffa07c110e>] obd_preprw+0x11e/0x420 [ost]^M
 [<ffffffffa07cb18c>] ost_brw_write+0x98c/0x1870 [ost]^M
 [<ffffffffa06416a1>] ? libcfs_debug_vmsg2+0x4d1/0xb50 [libcfs]^M
 [<ffffffffa083b264>] ? lustre_msg_get_opc+0x94/0x100 [ptlrpc]^M
 [<ffffffffa07cfce5>] ost_handle+0x3325/0x4b90 [ost]^M
 [<ffffffffa0b16b16>] ? vvp_session_key_init+0xd6/0x1d0 [lustre]^M
 [<ffffffffa07cc9c0>] ? ost_handle+0x0/0x4b90 [ost]^M
 [<ffffffffa0800cf6>] handle_recovery_req+0x1f6/0x330 [ptlrpc]^M
 [<ffffffffa0801257>] target_recovery_thread+0x3a7/0xf50 [ptlrpc]^M
 [<ffffffffa0800eb0>] ? target_recovery_thread+0x0/0xf50 [ptlrpc]^M
 [<ffffffff8100c14a>] child_rip+0xa/0x20^M
 [<ffffffffa0800eb0>] ? target_recovery_thread+0x0/0xf50 [ptlrpc]^M
 [<ffffffffa0800eb0>] ? target_recovery_thread+0x0/0xf50 [ptlrpc]^M
 [<ffffffff8100c140>] ? child_rip+0x0/0x20^M
^M
Kernel panic - not syncing: LBUG^M
Pid: 5227, comm: tgt_recov Not tainted 2.6.32-220.el6_lustre.g4554b65.x86_64 #1^M
Call Trace:^M
 [<ffffffff814ec661>] ? panic+0x78/0x143^M
 [<ffffffffa0636eeb>] ? lbug_with_loc+0xcb/0xe0 [libcfs]^M
 [<ffffffffa090ec4d>] ? filter_create+0x160d/0x1640 [obdfilter]^M
 [<ffffffffa0919b11>] ? filter_preprw_write+0x6c1/0x1f10 [obdfilter]^M
 [<ffffffffa083e87b>] ? _debug_req+0x51b/0x660 [ptlrpc]^M
 [<ffffffffa091c258>] ? filter_preprw+0x68/0x90 [obdfilter]^M
 [<ffffffffa07c110e>] ? obd_preprw+0x11e/0x420 [ost]^M
 [<ffffffffa07cb18c>] ? ost_brw_write+0x98c/0x1870 [ost]^M
 [<ffffffffa06416a1>] ? libcfs_debug_vmsg2+0x4d1/0xb50 [libcfs]^M
 [<ffffffffa083b264>] ? lustre_msg_get_opc+0x94/0x100 [ptlrpc]^M
 [<ffffffffa07cfce5>] ? ost_handle+0x3325/0x4b90 [ost]^M
 [<ffffffffa0b16b16>] ? vvp_session_key_init+0xd6/0x1d0 [lustre]^M
 [<ffffffffa07cc9c0>] ? ost_handle+0x0/0x4b90 [ost]^M
 [<ffffffffa0800cf6>] ? handle_recovery_req+0x1f6/0x330 [ptlrpc]^M
 [<ffffffffa0801257>] ? target_recovery_thread+0x3a7/0xf50 [ptlrpc]^M
 [<ffffffffa0800eb0>] ? target_recovery_thread+0x0/0xf50 [ptlrpc]^M
 [<ffffffff8100c14a>] ? child_rip+0xa/0x20^M
 [<ffffffffa0800eb0>] ? target_recovery_thread+0x0/0xf50 [ptlrpc]^M
 [<ffffffffa0800eb0>] ? target_recovery_thread+0x0/0xf50 [ptlrpc]^M
 [<ffffffff8100c140>] ? child_rip+0x0/0x20^M

Please refer to /scratch/logs/2.1.1/recovery-oss-scale.1329896304.log.tar.bz2 on brent node for more logs.

Jian Yu added a comment - 22/Feb/12 2:55 AM After I set the debug level and size as follows on all of the nodes: debug=-1 subsystem_debug=0xffb7e3ff debug_mb=200 debug=-trace Running recovery-mds-scale with FLAVOR=OSS hit the following LBUG on OSS client-18-ib: LustreError: 5227:0:(filter.c:4141:filter_destroy()) lustre-OST0003: can not find olg of group 0^M LustreError: 5227:0:(filter.c:3683:filter_handle_precreate()) ASSERTION(diff >= 0) failed: lustre-OST0003: 163 - 225 = -62^M LustreError: 5227:0:(filter.c:3683:filter_handle_precreate()) LBUG^M Pid: 5227, comm: tgt_recov^M ^M Call Trace:^M [<ffffffffa0636855>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]^M [<ffffffffa0636e95>] lbug_with_loc+0x75/0xe0 [libcfs]^M [<ffffffffa090ec4d>] filter_create+0x160d/0x1640 [obdfilter]^M [<ffffffffa0919b11>] filter_preprw_write+0x6c1/0x1f10 [obdfilter]^M [<ffffffffa083e87b>] ? _debug_req+0x51b/0x660 [ptlrpc]^M [<ffffffffa091c258>] filter_preprw+0x68/0x90 [obdfilter]^M [<ffffffffa07c110e>] obd_preprw+0x11e/0x420 [ost]^M [<ffffffffa07cb18c>] ost_brw_write+0x98c/0x1870 [ost]^M [<ffffffffa06416a1>] ? libcfs_debug_vmsg2+0x4d1/0xb50 [libcfs]^M [<ffffffffa083b264>] ? lustre_msg_get_opc+0x94/0x100 [ptlrpc]^M [<ffffffffa07cfce5>] ost_handle+0x3325/0x4b90 [ost]^M [<ffffffffa0b16b16>] ? vvp_session_key_init+0xd6/0x1d0 [lustre]^M [<ffffffffa07cc9c0>] ? ost_handle+0x0/0x4b90 [ost]^M [<ffffffffa0800cf6>] handle_recovery_req+0x1f6/0x330 [ptlrpc]^M [<ffffffffa0801257>] target_recovery_thread+0x3a7/0xf50 [ptlrpc]^M [<ffffffffa0800eb0>] ? target_recovery_thread+0x0/0xf50 [ptlrpc]^M [<ffffffff8100c14a>] child_rip+0xa/0x20^M [<ffffffffa0800eb0>] ? target_recovery_thread+0x0/0xf50 [ptlrpc]^M [<ffffffffa0800eb0>] ? target_recovery_thread+0x0/0xf50 [ptlrpc]^M [<ffffffff8100c140>] ? child_rip+0x0/0x20^M ^M Kernel panic - not syncing: LBUG^M Pid: 5227, comm: tgt_recov Not tainted 2.6.32-220.el6_lustre.g4554b65.x86_64 #1^M Call Trace:^M [<ffffffff814ec661>] ? panic+0x78/0x143^M [<ffffffffa0636eeb>] ? lbug_with_loc+0xcb/0xe0 [libcfs]^M [<ffffffffa090ec4d>] ? filter_create+0x160d/0x1640 [obdfilter]^M [<ffffffffa0919b11>] ? filter_preprw_write+0x6c1/0x1f10 [obdfilter]^M [<ffffffffa083e87b>] ? _debug_req+0x51b/0x660 [ptlrpc]^M [<ffffffffa091c258>] ? filter_preprw+0x68/0x90 [obdfilter]^M [<ffffffffa07c110e>] ? obd_preprw+0x11e/0x420 [ost]^M [<ffffffffa07cb18c>] ? ost_brw_write+0x98c/0x1870 [ost]^M [<ffffffffa06416a1>] ? libcfs_debug_vmsg2+0x4d1/0xb50 [libcfs]^M [<ffffffffa083b264>] ? lustre_msg_get_opc+0x94/0x100 [ptlrpc]^M [<ffffffffa07cfce5>] ? ost_handle+0x3325/0x4b90 [ost]^M [<ffffffffa0b16b16>] ? vvp_session_key_init+0xd6/0x1d0 [lustre]^M [<ffffffffa07cc9c0>] ? ost_handle+0x0/0x4b90 [ost]^M [<ffffffffa0800cf6>] ? handle_recovery_req+0x1f6/0x330 [ptlrpc]^M [<ffffffffa0801257>] ? target_recovery_thread+0x3a7/0xf50 [ptlrpc]^M [<ffffffffa0800eb0>] ? target_recovery_thread+0x0/0xf50 [ptlrpc]^M [<ffffffff8100c14a>] ? child_rip+0xa/0x20^M [<ffffffffa0800eb0>] ? target_recovery_thread+0x0/0xf50 [ptlrpc]^M [<ffffffffa0800eb0>] ? target_recovery_thread+0x0/0xf50 [ptlrpc]^M [<ffffffff8100c140>] ? child_rip+0x0/0x20^M Please refer to /scratch/logs/2.1.1/recovery-oss-scale.1329896304.log.tar.bz2 on brent node for more logs.

Jinshan Xiong (Inactive) added a comment - 20/Feb/12 2:12 PM

Could you please suggest what extra debug code I should add to gather more logs? Thanks.

On the client side, I always do the following:

lctl set_param debug=-1
lctl set_param debug=-trace

In this way, I can get reasonable log and not need a super size debug memory. Just a hint

Jinshan Xiong (Inactive) added a comment - 20/Feb/12 2:12 PM Could you please suggest what extra debug code I should add to gather more logs? Thanks. On the client side, I always do the following: lctl set_param debug=-1 lctl set_param debug=-trace In this way, I can get reasonable log and not need a super size debug memory. Just a hint

Oleg Drokin added a comment - 20/Feb/12 10:45 AM

Jinshan, please note the abnormal high rate of pngs on all osts after 22:36, it's unrelated to the bug at hand, but sill looks strange.

YuJian, I imagine having vfstrace enabled would have been helpful, also somewhat increased debug buffer size.
The kms tracking is done under D_INODE flag, though we usually don't run with it and I don't know how much it will impact testing.
If you are loking to do extending testing, +vfstrace +inode +ha look like sensible options to me. Also a 1024 in debug_mb if you have the RAM.

Oleg Drokin added a comment - 20/Feb/12 10:45 AM Jinshan, please note the abnormal high rate of pngs on all osts after 22:36, it's unrelated to the bug at hand, but sill looks strange. YuJian, I imagine having vfstrace enabled would have been helpful, also somewhat increased debug buffer size. The kms tracking is done under D_INODE flag, though we usually don't run with it and I don't know how much it will impact testing. If you are loking to do extending testing, +vfstrace +inode +ha look like sensible options to me. Also a 1024 in debug_mb if you have the RAM.

Jian Yu added a comment - 20/Feb/12 9:56 AM

Thanks Oleg and Jinshan for the analysis. I'm going to reproduce this issue and gather more logs. The current debug level is:

[root@client-18-ib ~]# lctl get_param debug
debug=
super warning dlmtrace error emerg ha rpctrace vfstrace config console
[root@client-18-ib ~]# lctl get_param subsystem_debug
subsystem_debug=
undefined mdc mds osc ost class log llite rpc filter echo ldlm lov lquota lmv sec gss mgc mgs fid fld
[root@client-18-ib ~]# lctl get_param debug_mb
debug_mb=48

Could you please suggest what extra debug code I should add to gather more logs? Thanks.

Jian Yu added a comment - 20/Feb/12 9:56 AM Thanks Oleg and Jinshan for the analysis. I'm going to reproduce this issue and gather more logs. The current debug level is: [root@client-18-ib ~]# lctl get_param debug debug= super warning dlmtrace error emerg ha rpctrace vfstrace config console [root@client-18-ib ~]# lctl get_param subsystem_debug subsystem_debug= undefined mdc mds osc ost class log llite rpc filter echo ldlm lov lquota lmv sec gss mgc mgs fid fld [root@client-18-ib ~]# lctl get_param debug_mb debug_mb=48 Could you please suggest what extra debug code I should add to gather more logs? Thanks.

Jinshan Xiong (Inactive) added a comment - 20/Feb/12 1:57 AM

In my humble opinion, the stack trace is not related with the issue here, neither is it related to ~~LU-874~~. The real issue is due to short write where 7168 bytes was supposed to be written but actually 4096 bytes was written. I didn't find any clue from logs. I suspect that there must exist issues with kms handling.

The reason why there was no activity on client-1 after 22:36 is because the test loop exited due to failure on tar.

Jinshan Xiong (Inactive) added a comment - 20/Feb/12 1:57 AM In my humble opinion, the stack trace is not related with the issue here, neither is it related to LU-874 . The real issue is due to short write where 7168 bytes was supposed to be written but actually 4096 bytes was written. I didn't find any clue from logs. I suspect that there must exist issues with kms handling. The reason why there was no activity on client-1 after 22:36 is because the test loop exited due to failure on tar.

People

Assignee:: Oleg Drokin

Reporter:: Jian Yu

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 19/Feb/12 8:39 AM

Updated:: 14/Aug/16 5:18 PM

Resolved:: 14/Aug/16 5:18 PM