Details
-
Bug
-
Resolution: Fixed
-
Blocker
-
Lustre 2.1.0, Lustre 2.3.0, Lustre 2.4.0, Lustre 2.1.1, Lustre 2.1.4
-
Lustre Tag: v2_1_0_0_RC1
Lustre Build: http://newbuild.whamcloud.com/job/lustre-master/274/
e2fsprogs Build: http://newbuild.whamcloud.com/job/e2fsprogs-master/42/
Distro/Arch: RHEL6/x86_64(in-kernel OFED, kernel version: 2.6.32-131.6.1.el6.x86_64)
ENABLE_QUOTA=yes
FAILURE_MODE=HARD
FLAVOR=MDS
MGS/MDS Nodes: fat-amd-1-ib(active), fat-amd-2-ib(passive)
\ /
1 combined MGS/MDT
OSS Nodes: fat-amd-3-ib(active), fat-amd-4-ib(active)
\ /
OST1 (lustre-OST0000, active in fat-amd-3-ib)
OST2 (lustre-OST0001, active in fat-amd-4-ib)
OST3 (lustre-OST0002, active in fat-amd-3-ib)
OST4 (lustre-OST0003, active in fat-amd-4-ib)
OST5 (lustre-OST0004, active in fat-amd-3-ib)
OST6 (lustre-OST0005, active in fat-amd-4-ib)
Client Nodes: client-[1,2,4,5,12,13,15],fat-intel-4
Network Addresses:
fat-amd-1-ib: 192.168.4.132
fat-amd-2-ib: 192.168.4.133
fat-amd-3-ib: 192.168.4.134
fat-amd-4-ib: 192.168.4.135
client-1-ib: 192.168.4.1
client-2-ib: 192.168.4.2
client-4-ib: 192.168.4.4
client-5-ib: 192.168.4.5
client-12-ib: 192.168.4.12
client-13-ib: 192.168.4.13
client-15-ib: 192.168.4.15
fat-intel-4-ib: 192.168.4.131
Lustre Tag: v2_1_0_0_RC1 Lustre Build: http://newbuild.whamcloud.com/job/lustre-master/274/ e2fsprogs Build: http://newbuild.whamcloud.com/job/e2fsprogs-master/42/ Distro/Arch: RHEL6/x86_64(in-kernel OFED, kernel version: 2.6.32-131.6.1.el6.x86_64) ENABLE_QUOTA=yes FAILURE_MODE=HARD FLAVOR=MDS MGS/MDS Nodes: fat-amd-1-ib(active), fat-amd-2-ib(passive) \ / 1 combined MGS/MDT OSS Nodes: fat-amd-3-ib(active), fat-amd-4-ib(active) \ / OST1 (lustre-OST0000, active in fat-amd-3-ib) OST2 (lustre-OST0001, active in fat-amd-4-ib) OST3 (lustre-OST0002, active in fat-amd-3-ib) OST4 (lustre-OST0003, active in fat-amd-4-ib) OST5 (lustre-OST0004, active in fat-amd-3-ib) OST6 (lustre-OST0005, active in fat-amd-4-ib) Client Nodes: client-[1,2,4,5,12,13,15],fat-intel-4 Network Addresses: fat-amd-1-ib: 192.168.4.132 fat-amd-2-ib: 192.168.4.133 fat-amd-3-ib: 192.168.4.134 fat-amd-4-ib: 192.168.4.135 client-1-ib: 192.168.4.1 client-2-ib: 192.168.4.2 client-4-ib: 192.168.4.4 client-5-ib: 192.168.4.5 client-12-ib: 192.168.4.12 client-13-ib: 192.168.4.13 client-15-ib: 192.168.4.15 fat-intel-4-ib: 192.168.4.131
-
3
-
4239
Description
After running for 8 hours (MDS failed over 51 times), recovery-mds-scale test failed as follows:
Starting mds1: -o user_xattr,acl /dev/disk/by-id/scsi-1IET_00010001 /mnt/mds1 fat-amd-1-ib: debug=0x33f0404 fat-amd-1-ib: subsystem_debug=0xffb7e3ff fat-amd-1-ib: debug_mb=48 Started lustre-MDT0000 ==== Checking the clients loads AFTER failover -- failure NOT OK Client load failed on node client-12-ib, rc=1 Client load failed during failover. Exiting Found the END_RUN_FILE file: /home/yujian/test_logs/end_run_file client-12-ib client-2-ib Client load failed on node client-12-ib client client-12-ib load stdout and debug files : /tmp/recovery-mds-scale.log_run_dd.sh-client-12-ib /tmp/recovery-mds-scale.log_run_dd.sh-client-12-ib.debug 2011-09-01 08:15:46 Terminating clients loads ... Duration: 43200 Server failover period: 600 seconds Exited after: 30428 seconds Number of failovers before exit: mds1: 51 times ost1: 0 times ost2: 0 times ost3: 0 times ost4: 0 times ost5: 0 times ost6: 0 times Status: FAIL: rc=1
/tmp/recovery-mds-scale.log_run_dd.sh-client-12-ib.debug:
2011-09-01 08:07:51: dd succeeded + cd /tmp + rm -rf /mnt/lustre/d0.dd-client-12-ib ++ date '+%F %H:%M:%S' + echoerr '2011-09-01 08:07:53: dd run finished' + echo '2011-09-01 08:07:53: dd run finished' 2011-09-01 08:07:53: dd run finished + '[' '!' -e /home/yujian/test_logs/end_run_file ']' + true ++ date '+%F %H:%M:%S' + echoerr '2011-09-01 08:07:53: dd run starting' + echo '2011-09-01 08:07:53: dd run starting' 2011-09-01 08:07:53: dd run starting + mkdir -p /mnt/lustre/d0.dd-client-12-ib + cd /mnt/lustre/d0.dd-client-12-ib + load_pid=26714 + wait 26714 + dd bs=4k count=1000000 status=noxfer if=/dev/zero of=/mnt/lustre/d0.dd-client-12-ib/dd-file dd: writing `/mnt/lustre/d0.dd-client-12-ib/dd-file': No space left on device 805559+0 records in 805558+0 records out + '[' 1 -eq 0 ']' ++ date '+%F %H:%M:%S' + echoerr '2011-09-01 08:09:10: dd failed' + echo '2011-09-01 08:09:10: dd failed' 2011-09-01 08:09:10: dd failed
Syslog on client-12-ib showed:
Sep 1 08:09:10 client-12 kernel: Lustre: DEBUG MARKER: Checking clients are in FULL state before doing next failover Sep 1 08:09:10 client-12 kernel: LustreError: 11-0: an error occurred while communicating with 192.168.4.135@o2ib. The ost_write operation failed with -28 Sep 1 08:09:10 client-12 kernel: LustreError: 26714:0:(vvp_io.c:1001:vvp_io_commit_write()) Write page 805558 of inode ffff880312852a38 failed -28
The dd error and syslog on client-12-ib showed that "No space left on device" on the OSS node fat-amd-4 (192.168.4.135), however, df on that node showed that:
[root@fat-amd-4-ib ~]# lctl dl 0 UP mgc MGC192.168.4.132@o2ib f176795b-1295-66f4-d018-e6c09ba5b112 5 1 UP ost OSS OSS_uuid 3 2 UP obdfilter lustre-OST0001 lustre-OST0001_UUID 21 3 UP obdfilter lustre-OST0003 lustre-OST0003_UUID 21 4 UP obdfilter lustre-OST0005 lustre-OST0005_UUID 21 [root@fat-amd-4-ib ~]# df Filesystem 1K-blocks Used Available Use% Mounted on /dev/sda1 20642428 2223180 17370672 12% / tmpfs 8165072 0 8165072 0% /dev/shm /dev/sdk 15939896 6847620 8292276 46% /mnt/ost2 /dev/sdg 15939896 460468 14679428 4% /mnt/ost4 /dev/sdh 15939896 463104 14676792 4% /mnt/ost6
"lfs df" and "lfs df -i" on the client nodes also showed that there were free enough spaces and inodes on the OSTs. What's more, for each load (dd, tar, dbench, iozone) running on different client node, the load dir would always be removed after a successful run and before a new run (please refer to the recovery-mds-scale.log_run_
{dd,tar,dbench,iozone}.sh-*.debug files), so there should not be ENOSPC error.
Syslog on fat-amd-4-ib showed that:
Sep 1 08:00:05 fat-amd-4-ib kernel: Pid: 4286, comm: ll_ost_io_127 Sep 1 08:00:05 fat-amd-4-ib kernel: Sep 1 08:00:05 fat-amd-4-ib kernel: Call Trace: Sep 1 08:00:05 fat-amd-4-ib kernel: [<ffffffff81247e8f>] ? __generic_unplug_device+0x1f/0x40 Sep 1 08:00:05 fat-amd-4-ib kernel: [<ffffffff814dc995>] schedule_timeout+0x215/0x2e0 Sep 1 08:00:05 fat-amd-4-ib kernel: [<ffffffff814dc613>] wait_for_common+0x123/0x180 Sep 1 08:00:05 fat-amd-4-ib kernel: [<ffffffff8105dc60>] ? default_wake_function+0x0/0x20 Sep 1 08:00:05 fat-amd-4-ib kernel: [<ffffffff81249a5d>] ? submit_bio+0x8d/0x120 Sep 1 08:00:05 fat-amd-4-ib kernel: [<ffffffff814dc72d>] wait_for_completion+0x1d/0x20 Sep 1 08:00:05 fat-amd-4-ib kernel: [<ffffffff8124cedd>] __blkdev_issue_flush+0xad/0xe0 Sep 1 08:00:05 fat-amd-4-ib kernel: [<ffffffff8124cf26>] blkdev_issue_flush+0x16/0x20 Sep 1 08:00:05 fat-amd-4-ib kernel: [<ffffffffa041f8ab>] ldiskfs_sync_file+0x17b/0x250 [ldiskfs] Sep 1 08:00:05 fat-amd-4-ib kernel: [<ffffffffa0a15865>] filter_sync+0x285/0x3e0 [obdfilter] Sep 1 08:00:05 fat-amd-4-ib kernel: [<ffffffff8115a340>] ? cache_alloc_refill+0x1c0/0x240 Sep 1 08:00:05 fat-amd-4-ib kernel: [<ffffffffa04fe98c>] ? lprocfs_counter_add+0x12c/0x196 [lvfs] Sep 1 08:00:05 fat-amd-4-ib kernel: [<ffffffffa09cfe0c>] ost_blocking_ast+0x58c/0xe40 [ost] Sep 1 08:00:05 fat-amd-4-ib kernel: [<ffffffffa064db73>] ldlm_cancel_callback+0x63/0xf0 [ptlrpc] Sep 1 08:00:05 fat-amd-4-ib kernel: [<ffffffffa064dc59>] ldlm_lock_cancel+0x59/0x190 [ptlrpc] Sep 1 08:00:05 fat-amd-4-ib kernel: [<ffffffffa066fe36>] ldlm_request_cancel+0x256/0x420 [ptlrpc] Sep 1 08:00:05 fat-amd-4-ib kernel: [<ffffffffa09d9720>] ost_handle+0x3d60/0x4b90 [ost] Sep 1 08:00:05 fat-amd-4-ib kernel: [<ffffffffa049b6b1>] ? libcfs_debug_vmsg2+0x4d1/0xb50 [libcfs] Sep 1 08:00:05 fat-amd-4-ib kernel: [<ffffffffa0694104>] ? lustre_msg_get_opc+0x94/0x100 [ptlrpc] Sep 1 08:00:05 fat-amd-4-ib kernel: [<ffffffffa069684c>] ? lustre_msg_get_status+0x3c/0xa0 [ptlrpc] Sep 1 08:00:05 fat-amd-4-ib kernel: [<ffffffffa06a4c7e>] ptlrpc_main+0xb8e/0x1900 [ptlrpc] Sep 1 08:00:05 fat-amd-4-ib kernel: [<ffffffffa06a40f0>] ? ptlrpc_main+0x0/0x1900 [ptlrpc] Sep 1 08:00:05 fat-amd-4-ib kernel: [<ffffffff8100c1ca>] child_rip+0xa/0x20 Sep 1 08:00:05 fat-amd-4-ib kernel: [<ffffffffa06a40f0>] ? ptlrpc_main+0x0/0x1900 [ptlrpc] Sep 1 08:00:05 fat-amd-4-ib kernel: [<ffffffffa06a40f0>] ? ptlrpc_main+0x0/0x1900 [ptlrpc] Sep 1 08:00:05 fat-amd-4-ib kernel: [<ffffffff8100c1c0>] ? child_rip+0x0/0x20 Sep 1 08:00:05 fat-amd-4-ib kernel: Sep 1 08:00:05 fat-amd-4-ib kernel: LustreError: dumping log to /tmp/lustre-log.1314889205.4286 Sep 1 08:00:05 fat-amd-4-ib kernel: Lustre: Service thread pid 4210 was inactive for 156.04s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one. Sep 1 08:00:05 fat-amd-4-ib kernel: Lustre: Skipped 2 previous similar messages Sep 1 08:00:05 fat-amd-4-ib kernel: LustreError: dumping log to /tmp/lustre-log.1314889205.4270 Sep 1 08:00:05 fat-amd-4-ib kernel: LustreError: dumping log to /tmp/lustre-log.1314889205.4269 Sep 1 08:00:06 fat-amd-4-ib kernel: LustreError: dumping log to /tmp/lustre-log.1314889206.4237 Sep 1 08:00:06 fat-amd-4-ib kernel: LustreError: dumping log to /tmp/lustre-log.1314889206.4267 Sep 1 08:00:06 fat-amd-4-ib kernel: LustreError: dumping log to /tmp/lustre-log.1314889206.4283 Sep 1 08:00:12 fat-amd-4-ib kernel: Lustre: Service thread pid 4218 was inactive for 156.00s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one. Sep 1 08:00:12 fat-amd-4-ib kernel: Lustre: Skipped 12 previous similar messages Sep 1 08:00:12 fat-amd-4-ib kernel: LustreError: dumping log to /tmp/lustre-log.1314889212.4218 Sep 1 08:00:15 fat-amd-4-ib kernel: LustreError: dumping log to /tmp/lustre-log.1314889215.4285 Sep 1 08:00:18 fat-amd-4-ib kernel: Lustre: Service thread pid 4201 was inactive for 156.00s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one. Sep 1 08:00:18 fat-amd-4-ib kernel: Lustre: Skipped 2 previous similar messages Sep 1 08:00:18 fat-amd-4-ib kernel: LustreError: dumping log to /tmp/lustre-log.1314889218.4201 Sep 1 08:00:25 fat-amd-4-ib kernel: Lustre: Service thread pid 4204 completed after 177.20s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). Sep 1 08:00:25 fat-amd-4-ib kernel: Lustre: Skipped 27 previous similar messages Sep 1 08:00:35 fat-amd-4-ib kernel: Lustre: Service thread pid 4249 was inactive for 156.00s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one. Sep 1 08:00:35 fat-amd-4-ib kernel: LustreError: dumping log to /tmp/lustre-log.1314889235.4249 Sep 1 08:00:36 fat-amd-4-ib kernel: LustreError: dumping log to /tmp/lustre-log.1314889236.4275 Sep 1 08:00:43 fat-amd-4-ib kernel: LustreError: dumping log to /tmp/lustre-log.1314889243.4222 Sep 1 08:00:48 fat-amd-4-ib kernel: Lustre: Service thread pid 4226 completed after 200.23s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
Maloo report: https://maloo.whamcloud.com/test_sets/b4bac0f4-d523-11e0-8d02-52540025f9af
Since the size of all of the log files is large, I could not upload them to this ticket. Please find /scratch/logs/2.1.0/recovery-mds-scale.1314890151.log.tar.bz2 on node brent.whamcloud.com.
Attachments
Issue Links
- is related to
-
LU-1824 Test failure on test suite obdfilter-survey, subtest test_2a
- Resolved
- Trackbacks
-
Lustre 2.1.0 release testing tracker Lustre 2.1.0 RC0 Tag: v2100RC0 Created Date: 20110820 The difference between RC0 and RC1 is only a date change in lustre/ChangeLog. Lustre 2.1....
-
Lustre 2.1.1 release testing tracker Lustre 2.1.1 RC2 Tag: v2110RC2 Build: