Details
-
Bug
-
Resolution: Won't Fix
-
Blocker
-
None
-
Lustre 2.1.1, Lustre 2.1.3
-
None
-
Lustre Tag: v2_1_1_0_RC4
Lustre Build: http://build.whamcloud.com/job/lustre-b2_1/44/
e2fsprogs Build: http://build.whamcloud.com/job/e2fsprogs-master/217/
Distro/Arch: RHEL6/x86_64 (kernel version: 2.6.32-220.el6)
Network: IB (in-kernel OFED)
ENABLE_QUOTA=yes
FAILURE_MODE=HARD
FLAVOR=OSS
MGS/MDS Nodes: client-8-ib
OSS Nodes: client-18-ib(active), client-19-ib(active)
\ /
OST1 (active in client-18-ib)
OST2 (active in client-19-ib)
OST3 (active in client-18-ib)
OST4 (active in client-19-ib)
OST5 (active in client-18-ib)
OST6 (active in client-19-ib)
client-9-ib(OST7)
Client Nodes: client-[1,4,17],fat-amd-2,fat-intel-2
Network Addresses:
client-1-ib: 192.168.4.1
client-4-ib: 192.168.4.4
client-8-ib: 192.168.4.8
client-9-ib: 192.168.4.9
client-17-ib: 192.168.4.17
client-18-ib: 192.168.4.18
client-19-ib: 192.168.4.19
fat-amd-2-ib: 192.168.4.133
fat-intel-2-ib: 192.168.4.129
Lustre Tag: v2_1_1_0_RC4 Lustre Build: http://build.whamcloud.com/job/lustre-b2_1/44/ e2fsprogs Build: http://build.whamcloud.com/job/e2fsprogs-master/217/ Distro/Arch: RHEL6/x86_64 (kernel version: 2.6.32-220.el6) Network: IB (in-kernel OFED) ENABLE_QUOTA=yes FAILURE_MODE=HARD FLAVOR=OSS MGS/MDS Nodes: client-8-ib OSS Nodes: client-18-ib(active), client-19-ib(active) \ / OST1 (active in client-18-ib) OST2 (active in client-19-ib) OST3 (active in client-18-ib) OST4 (active in client-19-ib) OST5 (active in client-18-ib) OST6 (active in client-19-ib) client-9-ib(OST7) Client Nodes: client-[1,4,17],fat-amd-2,fat-intel-2 Network Addresses: client-1-ib: 192.168.4.1 client-4-ib: 192.168.4.4 client-8-ib: 192.168.4.8 client-9-ib: 192.168.4.9 client-17-ib: 192.168.4.17 client-18-ib: 192.168.4.18 client-19-ib: 192.168.4.19 fat-amd-2-ib: 192.168.4.133 fat-intel-2-ib: 192.168.4.129
-
3
-
3993
Description
While running recovery-mds-scale with FLAVOR=OSS, it failed as follows:
==== Checking the clients loads AFTER failover -- failure NOT OK ost3 has failed over 1 times, and counting... sleeping 582 seconds ... tar: etc/selinux/targeted/modules/active/modules/sandbox.pp: Wrote only 4096 of 7168 bytes tar: Exiting with failure status due to previous errors Found the END_RUN_FILE file: /home/yujian/test_logs/end_run_file client-1-ib Client load failed on node client-1-ib client client-1-ib load stdout and debug files : /tmp/recovery-mds-scale.log_run_tar.sh-client-1-ib /tmp/recovery-mds-scale.log_run_tar.sh-client-1-ib.debug
/tmp/recovery-mds-scale.log_run_tar.sh-client-1-ib:
tar: etc/selinux/targeted/modules/active/modules/sandbox.pp: Wrote only 4096 of 7168 bytes tar: Exiting with failure status due to previous errors
/tmp/recovery-mds-scale.log_run_tar.sh-client-1-ib.debug:
<~snip~> 2012-02-18 22:30:41: tar run starting + mkdir -p /mnt/lustre/d0.tar-client-1-ib + cd /mnt/lustre/d0.tar-client-1-ib + wait 7567 + do_tar + tar cf - /etc + tar xf - + tee /tmp/recovery-mds-scale.log_run_tar.sh-client-1-ib tar: Removing leading `/' from member names + return 2 + RC=2 ++ grep 'exit delayed from previous errors' /tmp/recovery-mds-scale.log_run_tar.sh-client-1-ib + PREV_ERRORS= + true + '[' 2 -ne 0 -a '' -a '' ']' + '[' 2 -eq 0 ']' ++ date '+%F %H:%M:%S' + echoerr '2012-02-18 22:37:10: tar failed' + echo '2012-02-18 22:37:10: tar failed' 2012-02-18 22:37:10: tar failed <~snip~>
Syslog on client node client-1-ib showed that:
Feb 18 22:34:54 client-1 kernel: INFO: task flush-lustre-1:3510 blocked for more than 120 seconds. Feb 18 22:34:54 client-1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Feb 18 22:34:54 client-1 kernel: flush-lustre- D 0000000000000000 0 3510 2 0x00000080 Feb 18 22:34:54 client-1 kernel: ffff8801f70e99a0 0000000000000046 ffff8801f70e9920 ffffffffa0942434 Feb 18 22:34:54 client-1 kernel: 0000000000000000 ffff880331d24980 ffff8801f70e9930 0000000000000000 Feb 18 22:34:54 client-1 kernel: ffff88027d12b0b8 ffff8801f70e9fd8 000000000000f4e8 ffff88027d12b0b8 Feb 18 22:34:54 client-1 kernel: Call Trace: Feb 18 22:34:54 client-1 kernel: [<ffffffffa0942434>] ? cfs_hash_dual_bd_unlock+0x34/0x60 [libcfs] Feb 18 22:34:54 client-1 kernel: [<ffffffff8109b809>] ? ktime_get_ts+0xa9/0xe0 Feb 18 22:34:54 client-1 kernel: [<ffffffff81110b10>] ? sync_page+0x0/0x50 Feb 18 22:34:54 client-1 kernel: [<ffffffff814ed1c3>] io_schedule+0x73/0xc0 Feb 18 22:34:54 client-1 kernel: [<ffffffff81110b4d>] sync_page+0x3d/0x50 Feb 18 22:34:54 client-1 kernel: [<ffffffff814eda2a>] __wait_on_bit_lock+0x5a/0xc0 Feb 18 22:34:54 client-1 kernel: [<ffffffff81110ae7>] __lock_page+0x67/0x70 Feb 18 22:34:54 client-1 kernel: [<ffffffff81090c30>] ? wake_bit_function+0x0/0x50 Feb 18 22:34:54 client-1 kernel: [<ffffffff81124c97>] ? __writepage+0x17/0x40 Feb 18 22:34:54 client-1 kernel: [<ffffffff811261f2>] write_cache_pages+0x392/0x4a0 Feb 18 22:34:54 client-1 kernel: [<ffffffff81052600>] ? __dequeue_entity+0x30/0x50 Feb 18 22:34:54 client-1 kernel: [<ffffffff81124c80>] ? __writepage+0x0/0x40 Feb 18 22:34:54 client-1 kernel: [<ffffffff8126a5c9>] ? cpumask_next_and+0x29/0x50 Feb 18 22:34:54 client-1 kernel: [<ffffffff81054754>] ? find_busiest_group+0x244/0xb20 Feb 18 22:34:54 client-1 kernel: [<ffffffff81126324>] generic_writepages+0x24/0x30 Feb 18 22:34:54 client-1 kernel: [<ffffffff81126351>] do_writepages+0x21/0x40 Feb 18 22:34:54 client-1 kernel: [<ffffffff811a046d>] writeback_single_inode+0xdd/0x2c0 Feb 18 22:34:54 client-1 kernel: [<ffffffff811a08ae>] writeback_sb_inodes+0xce/0x180 Feb 18 22:34:54 client-1 kernel: [<ffffffff811a0a0b>] writeback_inodes_wb+0xab/0x1b0 Feb 18 22:34:54 client-1 kernel: [<ffffffff811a0dab>] wb_writeback+0x29b/0x3f0 Feb 18 22:34:54 client-1 kernel: [<ffffffff814eca20>] ? thread_return+0x4e/0x77e Feb 18 22:34:54 client-1 kernel: [<ffffffff8107cc02>] ? del_timer_sync+0x22/0x30 Feb 18 22:34:54 client-1 kernel: [<ffffffff811a1099>] wb_do_writeback+0x199/0x240 Feb 18 22:34:54 client-1 kernel: [<ffffffff811a11a3>] bdi_writeback_task+0x63/0x1b0 Feb 18 22:34:54 client-1 kernel: [<ffffffff81090ab7>] ? bit_waitqueue+0x17/0xd0 Feb 18 22:34:54 client-1 kernel: [<ffffffff81134d40>] ? bdi_start_fn+0x0/0x100 Feb 18 22:34:54 client-1 kernel: [<ffffffff81134dc6>] bdi_start_fn+0x86/0x100 Feb 18 22:34:54 client-1 kernel: [<ffffffff81134d40>] ? bdi_start_fn+0x0/0x100 Feb 18 22:34:54 client-1 kernel: [<ffffffff81090886>] kthread+0x96/0xa0 Feb 18 22:34:54 client-1 kernel: [<ffffffff8100c14a>] child_rip+0xa/0x20 Feb 18 22:34:54 client-1 kernel: [<ffffffff810907f0>] ? kthread+0x0/0xa0 Feb 18 22:34:54 client-1 kernel: [<ffffffff8100c140>] ? child_rip+0x0/0x20
Maloo report: https://maloo.whamcloud.com/test_sets/f3b4fe94-5af9-11e1-8801-5254004bbbd3
Please refer to the attached recovery-oss-scale.1329633991.log.tar.bz2 for more logs.
It seems this is issue LU-874.
Attachments
Issue Links
- is related to
-
LU-1129 filter_handle_precreate()) ASSERTION(diff >= 0) failed
- Resolved
- Trackbacks
-
Lustre 2.1.1 release testing tracker Lustre 2.1.1 RC4 Tag: v2110RC4 Build:
-
Lustre 2.1.3 release testing tracker Lustre 2.1.3 RC1 Tag: v213RC1 Build: