Details
-
Bug
-
Resolution: Fixed
-
Blocker
-
Lustre 2.1.0
-
None
-
Lustre Tag: v2_1_0_0_RC1
Lustre Build: http://newbuild.whamcloud.com/job/lustre-master/274/
e2fsprogs Build: http://newbuild.whamcloud.com/job/e2fsprogs-master/42/
Distro/Arch: RHEL6/x86_64(in-kernel OFED, kernel version: 2.6.32-131.6.1.el6.x86_64)
ENABLE_QUOTA=yes
FAILURE_MODE=HARD
FLAVOR=OSS
MGS/MDS Nodes: fat-amd-1-ib
OSS Nodes: fat-amd-3-ib(active), fat-amd-4-ib(active)
\ /
OST1 (active in fat-amd-3-ib)
OST2 (active in fat-amd-4-ib)
OST3 (active in fat-amd-3-ib)
OST4 (active in fat-amd-4-ib)
OST5 (active in fat-amd-3-ib)
OST6 (active in fat-amd-4-ib)
fat-amd-2-ib(OST7)
Client Nodes: client-[1,2,4,5,12,13,15],fat-intel-4
Network Addresses:
fat-amd-1-ib: 192.168.4.132
fat-amd-2-ib: 192.168.4.133
fat-amd-3-ib: 192.168.4.134
fat-amd-4-ib: 192.168.4.135
client-1-ib: 192.168.4.1
client-2-ib: 192.168.4.2
client-4-ib: 192.168.4.4
client-5-ib: 192.168.4.5
client-12-ib: 192.168.4.12
client-13-ib: 192.168.4.13
client-15-ib: 192.168.4.15
fat-intel-4-ib: 192.168.4.131
Lustre Tag: v2_1_0_0_RC1 Lustre Build: http://newbuild.whamcloud.com/job/lustre-master/274/ e2fsprogs Build: http://newbuild.whamcloud.com/job/e2fsprogs-master/42/ Distro/Arch: RHEL6/x86_64(in-kernel OFED, kernel version: 2.6.32-131.6.1.el6.x86_64) ENABLE_QUOTA=yes FAILURE_MODE=HARD FLAVOR=OSS MGS/MDS Nodes: fat-amd-1-ib OSS Nodes: fat-amd-3-ib(active), fat-amd-4-ib(active) \ / OST1 (active in fat-amd-3-ib) OST2 (active in fat-amd-4-ib) OST3 (active in fat-amd-3-ib) OST4 (active in fat-amd-4-ib) OST5 (active in fat-amd-3-ib) OST6 (active in fat-amd-4-ib) fat-amd-2-ib(OST7) Client Nodes: client-[1,2,4,5,12,13,15],fat-intel-4 Network Addresses: fat-amd-1-ib: 192.168.4.132 fat-amd-2-ib: 192.168.4.133 fat-amd-3-ib: 192.168.4.134 fat-amd-4-ib: 192.168.4.135 client-1-ib: 192.168.4.1 client-2-ib: 192.168.4.2 client-4-ib: 192.168.4.4 client-5-ib: 192.168.4.5 client-12-ib: 192.168.4.12 client-13-ib: 192.168.4.13 client-15-ib: 192.168.4.15 fat-intel-4-ib: 192.168.4.131
-
3
-
4903
Description
While running recovery-mds-scale with FLAVOR=OSS, it failed as follows:
<~snip~> Client load failed during failover. Exiting Found the END_RUN_FILE file: /home/yujian/test_logs/end_run_file client-15-ib Client load failed on node client-15-ib client client-15-ib load stdout and debug files : /tmp/recovery-mds-scale.log_run_dbench.sh-client-15-ib /tmp/recovery-mds-scale.log_run_dbench.sh-client-15-ib.debug 2011-08-31 01:45:08 Terminating clients loads ... Duration: 43200 Server failover period: 600 seconds Exited after: 0 seconds Number of failovers before exit: mds1: 0 times ost1: 0 times ost2: 0 times ost3: 0 times ost4: 0 times ost5: 0 times ost6: 0 times ost7: 0 times Status: FAIL: rc=1
/tmp/recovery-mds-scale.log_run_dbench.sh-client-15-ib:
<~snip~> 2 5122 0.00 MB/sec execute 40 sec latency 144211.791 ms 2 5122 0.00 MB/sec execute 41 sec latency 145211.965 ms 2 5122 0.00 MB/sec execute 42 sec latency 146212.143 ms [5748] write failed on handle 11108 (Cannot send after transport endpoint shutdown) Child failed with status 1
/tmp/recovery-mds-scale.log_run_dbench.sh-client-15-ib.debug:
2011-08-31 01:40:59: dbench run starting + mkdir -p /mnt/lustre/d0.dbench-client-15-ib + load_pid=3602 + wait 3602 + rundbench -D /mnt/lustre/d0.dbench-client-15-ib 2 touch: missing file operand Try `touch --help' for more information. + '[' 1 -eq 0 ']' ++ date '+%F %H:%M:%S' + echoerr '2011-08-31 01:43:43: dbench failed' + echo '2011-08-31 01:43:43: dbench failed' 2011-08-31 01:43:43: dbench failed
Syslog on client node client-15-ib showed that:
Aug 31 01:42:51 client-15 kernel: Lustre: 2534:0:(client.c:2530:ptlrpc_replay_interpret()) @@@ Version mismatch during replay Aug 31 01:42:51 client-15 kernel: req@ffff880308397400 x1378646783695240/t908(908) o-1->lustre-OST0000_UUID@192.168.4.135@o2ib:6/4 lens 512/400 e 1 to 0 dl 1314780217 ref 2 fl Interpret:R/ffffffff/ffffffff rc -75/-1Aug 31 01:42:55 client-15 kernel: Lustre: 2534:0:(client.c:1778:ptlrpc_expire_one_request()) @@@ Request x1378646783702455 sent from lustre-OST0002-osc-ffff880326e33c00 to NID 192.168.4.134@o2ib has failed due to network error: [sent 1314780175] [real_sent 1314780175] [current 1314780175] [deadline 26s] [delay -26s] req@ffff8803118e7000 x13786467837 02455/t0(0) o-1->lustre-OST0002_UUID@192.168.4.134@o2ib:28/4 lens 368/512 e 0 to 1 dl 1314780201 ref 1 fl Rpc:XN/ffffffff/ffffffff rc 0/-1 Aug 31 01:42:55 client-15 kernel: Lustre: 2534:0:(client.c:1778:ptlrpc_expire_one_request()) Skipped 8 previous similar messages Aug 31 01:43:10 client-15 kernel: LustreError: 11-0: an error occurred while communicating with 192.168.4.135@o2ib. The ost_connect operation failed with -19Aug 31 01:43:10 client-15 kernel: LustreError: Skipped 6 previous similar messagesAug 31 01:43:15 client-15 kernel: Lustre: 2535:0:(import.c:526:import_select_connection()) lustre-OST0002-osc-ffff880326e33c00: tried all connections, increasing latency to 21sAug 31 01:43:15 client-15 kernel: Lustre: 2535:0:(import.c:526:import_select_connection()) Skipped 6 previous similar messages Aug 31 01:43:32 client-15 kernel: Lustre: lustre-OST0002-osc-ffff880326e33c00: Connection restored to service lustre-OST0002 using nid 192.168.4.135@o2ib. Aug 31 01:43:36 client-15 kernel: INFO: task flush-lustre-1:2690 blocked for more than 120 seconds. Aug 31 01:43:36 client-15 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Aug 31 01:43:36 client-15 kernel: flush-lustre- D 0000000000000002 0 2690 2 0x00000080Aug 31 01:43:36 client-15 kernel: ffff88030e35d9a0 0000000000000046 0000000000000000 ffffffffa0303424 Aug 31 01:43:36 client-15 kernel: 0000000000000000 ffff88031f3fdc00 ffff88030e35d930 00000001002932fc Aug 31 01:43:36 client-15 kernel: ffff880325e47078 ffff88030e35dfd8 000000000000f598 ffff880325e47078 Aug 31 01:43:36 client-15 kernel: Call Trace:Aug 31 01:43:36 client-15 kernel: [<ffffffffa0303424>] ? cfs_hash_dual_bd_unlock+0x34/0x60 [libcfs]Aug 31 01:43:36 client-15 kernel: [<ffffffff8110d320>] ? sync_page+0x0/0x50Aug 31 01:43:36 client-15 kernel: [<ffffffff814db3c3>] io_schedule+0x73/0xc0 Aug 31 01:43:36 client-15 kernel: [<ffffffff8110d35d>] sync_page+0x3d/0x50 Aug 31 01:43:36 client-15 kernel: [<ffffffff814dbada>] __wait_on_bit_lock+0x5a/0xc0 Aug 31 01:43:36 client-15 kernel: [<ffffffff8110d2f7>] __lock_page+0x67/0x70 Aug 31 01:43:36 client-15 kernel: [<ffffffff8108e140>] ? wake_bit_function+0x0/0x50 Aug 31 01:43:36 client-15 kernel: [<ffffffff81120d17>] ? __writepage+0x17/0x40 Aug 31 01:43:36 client-15 kernel: [<ffffffff81122272>] write_cache_pages+0x392/0x4a0 Aug 31 01:43:36 client-15 kernel: [<ffffffff81120d00>] ? __writepage+0x0/0x40 Aug 31 01:43:36 client-15 kernel: [<ffffffff811223a4>] generic_writepages+0x24/0x30 Aug 31 01:43:36 client-15 kernel: [<ffffffff811223d1>] do_writepages+0x21/0x40 Aug 31 01:43:36 client-15 kernel: [<ffffffff8119bbdd>] writeback_single_inode+0xdd/0x2c0 Aug 31 01:43:36 client-15 kernel: [<ffffffff8119bfde>] writeback_sb_inodes+0xce/0x180 Aug 31 01:43:36 client-15 kernel: [<ffffffff8119c13b>] writeback_inodes_wb+0xab/0x1b0 Aug 31 01:43:36 client-15 kernel: [<ffffffff8119c4db>] wb_writeback+0x29b/0x3f0 Aug 31 01:43:36 client-15 kernel: [<ffffffff814dac27>] ? thread_return+0x4e/0x777 Aug 31 01:43:36 client-15 kernel: [<ffffffff8107a1a2>] ? del_timer_sync+0x22/0x30 Aug 31 01:43:36 client-15 kernel: [<ffffffff8119c7c9>] wb_do_writeback+0x199/0x240 Aug 31 01:43:36 client-15 kernel: [<ffffffff8119c8d3>] bdi_writeback_task+0x63/0x1b0 Aug 31 01:43:36 client-15 kernel: [<ffffffff8108dfc7>] ? bit_waitqueue+0x17/0xd0 Aug 31 01:43:36 client-15 kernel: [<ffffffff81130bd0>] ? bdi_start_fn+0x0/0x100 Aug 31 01:43:36 client-15 kernel: [<ffffffff81130c56>] bdi_start_fn+0x86/0x100 Aug 31 01:43:36 client-15 kernel: [<ffffffff81130bd0>] ? bdi_start_fn+0x0/0x100 Aug 31 01:43:36 client-15 kernel: [<ffffffff8108dd96>] kthread+0x96/0xa0 Aug 31 01:43:36 client-15 kernel: [<ffffffff8100c1ca>] child_rip+0xa/0x20 Aug 31 01:43:36 client-15 kernel: [<ffffffff8108dd00>] ? kthread+0x0/0xa0 Aug 31 01:43:36 client-15 kernel: [<ffffffff8100c1c0>] ? child_rip+0x0/0x20 Aug 31 01:43:43 client-15 kernel: Lustre: 2534:0:(import.c:1160:completed_replay_interpret()) lustre-OST0000-osc-ffff880326e33c00: version recovery fails, reconnecting Aug 31 01:43:43 client-15 kernel: LustreError: 167-0: This client was evicted by lustre-OST0000; in progress operations using this service will fail.
Maloo report: https://maloo.whamcloud.com/test_sets/e68e0d04-d3b4-11e0-8d02-52540025f9af
Please refer to the attached recovery-mds-scale.1314780314.log.tar.bz2 for more logs.
Attachments
Issue Links
- Trackbacks
-
Lustre 2.1.0 release testing tracker Lustre 2.1.0 RC0 Tag: v2100RC0 Created Date: 20110820 The difference between RC0 and RC1 is only a date change in lustre/ChangeLog. Lustre 2.1....