[LU-657] recovery-mds-scale (FLAVOR=MDS): client load failed Created: 02/Sep/11 Updated: 09/Apr/13 Resolved: 15/Jan/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.1.0, Lustre 2.3.0, Lustre 2.4.0, Lustre 2.1.1, Lustre 2.1.4 |
| Fix Version/s: | Lustre 2.4.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Jian Yu | Assignee: | Hongchao Zhang |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | MB | ||
| Environment: |
Lustre Tag: v2_1_0_0_RC1 MGS/MDS Nodes: fat-amd-1-ib(active), fat-amd-2-ib(passive) OSS Nodes: fat-amd-3-ib(active), fat-amd-4-ib(active) Client Nodes: client-[1,2,4,5,12,13,15],fat-intel-4 Network Addresses: |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 4239 | ||||||||
| Description |
|
After running for 8 hours (MDS failed over 51 times), recovery-mds-scale test failed as follows: Starting mds1: -o user_xattr,acl /dev/disk/by-id/scsi-1IET_00010001 /mnt/mds1
fat-amd-1-ib: debug=0x33f0404
fat-amd-1-ib: subsystem_debug=0xffb7e3ff
fat-amd-1-ib: debug_mb=48
Started lustre-MDT0000
==== Checking the clients loads AFTER failover -- failure NOT OK
Client load failed on node client-12-ib, rc=1
Client load failed during failover. Exiting
Found the END_RUN_FILE file: /home/yujian/test_logs/end_run_file
client-12-ib
client-2-ib
Client load failed on node client-12-ib
client client-12-ib load stdout and debug files :
/tmp/recovery-mds-scale.log_run_dd.sh-client-12-ib
/tmp/recovery-mds-scale.log_run_dd.sh-client-12-ib.debug
2011-09-01 08:15:46 Terminating clients loads ...
Duration: 43200
Server failover period: 600 seconds
Exited after: 30428 seconds
Number of failovers before exit:
mds1: 51 times
ost1: 0 times
ost2: 0 times
ost3: 0 times
ost4: 0 times
ost5: 0 times
ost6: 0 times
Status: FAIL: rc=1
/tmp/recovery-mds-scale.log_run_dd.sh-client-12-ib.debug: 2011-09-01 08:07:51: dd succeeded + cd /tmp + rm -rf /mnt/lustre/d0.dd-client-12-ib ++ date '+%F %H:%M:%S' + echoerr '2011-09-01 08:07:53: dd run finished' + echo '2011-09-01 08:07:53: dd run finished' 2011-09-01 08:07:53: dd run finished + '[' '!' -e /home/yujian/test_logs/end_run_file ']' + true ++ date '+%F %H:%M:%S' + echoerr '2011-09-01 08:07:53: dd run starting' + echo '2011-09-01 08:07:53: dd run starting' 2011-09-01 08:07:53: dd run starting + mkdir -p /mnt/lustre/d0.dd-client-12-ib + cd /mnt/lustre/d0.dd-client-12-ib + load_pid=26714 + wait 26714 + dd bs=4k count=1000000 status=noxfer if=/dev/zero of=/mnt/lustre/d0.dd-client-12-ib/dd-file dd: writing `/mnt/lustre/d0.dd-client-12-ib/dd-file': No space left on device 805559+0 records in 805558+0 records out + '[' 1 -eq 0 ']' ++ date '+%F %H:%M:%S' + echoerr '2011-09-01 08:09:10: dd failed' + echo '2011-09-01 08:09:10: dd failed' 2011-09-01 08:09:10: dd failed Syslog on client-12-ib showed: Sep 1 08:09:10 client-12 kernel: Lustre: DEBUG MARKER: Checking clients are in FULL state before doing next failover Sep 1 08:09:10 client-12 kernel: LustreError: 11-0: an error occurred while communicating with 192.168.4.135@o2ib. The ost_write operation failed with -28 Sep 1 08:09:10 client-12 kernel: LustreError: 26714:0:(vvp_io.c:1001:vvp_io_commit_write()) Write page 805558 of inode ffff880312852a38 failed -28 The dd error and syslog on client-12-ib showed that "No space left on device" on the OSS node fat-amd-4 (192.168.4.135), however, df on that node showed that: [root@fat-amd-4-ib ~]# lctl dl 0 UP mgc MGC192.168.4.132@o2ib f176795b-1295-66f4-d018-e6c09ba5b112 5 1 UP ost OSS OSS_uuid 3 2 UP obdfilter lustre-OST0001 lustre-OST0001_UUID 21 3 UP obdfilter lustre-OST0003 lustre-OST0003_UUID 21 4 UP obdfilter lustre-OST0005 lustre-OST0005_UUID 21 [root@fat-amd-4-ib ~]# df Filesystem 1K-blocks Used Available Use% Mounted on /dev/sda1 20642428 2223180 17370672 12% / tmpfs 8165072 0 8165072 0% /dev/shm /dev/sdk 15939896 6847620 8292276 46% /mnt/ost2 /dev/sdg 15939896 460468 14679428 4% /mnt/ost4 /dev/sdh 15939896 463104 14676792 4% /mnt/ost6 "lfs df" and "lfs df -i" on the client nodes also showed that there were free enough spaces and inodes on the OSTs. What's more, for each load (dd, tar, dbench, iozone) running on different client node, the load dir would always be removed after a successful run and before a new run (please refer to the recovery-mds-scale.log_run_ {dd,tar,dbench,iozone}.sh-*.debug files), so there should not be ENOSPC error. Syslog on fat-amd-4-ib showed that: Sep 1 08:00:05 fat-amd-4-ib kernel: Pid: 4286, comm: ll_ost_io_127 Sep 1 08:00:05 fat-amd-4-ib kernel: Sep 1 08:00:05 fat-amd-4-ib kernel: Call Trace: Sep 1 08:00:05 fat-amd-4-ib kernel: [<ffffffff81247e8f>] ? __generic_unplug_device+0x1f/0x40 Sep 1 08:00:05 fat-amd-4-ib kernel: [<ffffffff814dc995>] schedule_timeout+0x215/0x2e0 Sep 1 08:00:05 fat-amd-4-ib kernel: [<ffffffff814dc613>] wait_for_common+0x123/0x180 Sep 1 08:00:05 fat-amd-4-ib kernel: [<ffffffff8105dc60>] ? default_wake_function+0x0/0x20 Sep 1 08:00:05 fat-amd-4-ib kernel: [<ffffffff81249a5d>] ? submit_bio+0x8d/0x120 Sep 1 08:00:05 fat-amd-4-ib kernel: [<ffffffff814dc72d>] wait_for_completion+0x1d/0x20 Sep 1 08:00:05 fat-amd-4-ib kernel: [<ffffffff8124cedd>] __blkdev_issue_flush+0xad/0xe0 Sep 1 08:00:05 fat-amd-4-ib kernel: [<ffffffff8124cf26>] blkdev_issue_flush+0x16/0x20 Sep 1 08:00:05 fat-amd-4-ib kernel: [<ffffffffa041f8ab>] ldiskfs_sync_file+0x17b/0x250 [ldiskfs] Sep 1 08:00:05 fat-amd-4-ib kernel: [<ffffffffa0a15865>] filter_sync+0x285/0x3e0 [obdfilter] Sep 1 08:00:05 fat-amd-4-ib kernel: [<ffffffff8115a340>] ? cache_alloc_refill+0x1c0/0x240 Sep 1 08:00:05 fat-amd-4-ib kernel: [<ffffffffa04fe98c>] ? lprocfs_counter_add+0x12c/0x196 [lvfs] Sep 1 08:00:05 fat-amd-4-ib kernel: [<ffffffffa09cfe0c>] ost_blocking_ast+0x58c/0xe40 [ost] Sep 1 08:00:05 fat-amd-4-ib kernel: [<ffffffffa064db73>] ldlm_cancel_callback+0x63/0xf0 [ptlrpc] Sep 1 08:00:05 fat-amd-4-ib kernel: [<ffffffffa064dc59>] ldlm_lock_cancel+0x59/0x190 [ptlrpc] Sep 1 08:00:05 fat-amd-4-ib kernel: [<ffffffffa066fe36>] ldlm_request_cancel+0x256/0x420 [ptlrpc] Sep 1 08:00:05 fat-amd-4-ib kernel: [<ffffffffa09d9720>] ost_handle+0x3d60/0x4b90 [ost] Sep 1 08:00:05 fat-amd-4-ib kernel: [<ffffffffa049b6b1>] ? libcfs_debug_vmsg2+0x4d1/0xb50 [libcfs] Sep 1 08:00:05 fat-amd-4-ib kernel: [<ffffffffa0694104>] ? lustre_msg_get_opc+0x94/0x100 [ptlrpc] Sep 1 08:00:05 fat-amd-4-ib kernel: [<ffffffffa069684c>] ? lustre_msg_get_status+0x3c/0xa0 [ptlrpc] Sep 1 08:00:05 fat-amd-4-ib kernel: [<ffffffffa06a4c7e>] ptlrpc_main+0xb8e/0x1900 [ptlrpc] Sep 1 08:00:05 fat-amd-4-ib kernel: [<ffffffffa06a40f0>] ? ptlrpc_main+0x0/0x1900 [ptlrpc] Sep 1 08:00:05 fat-amd-4-ib kernel: [<ffffffff8100c1ca>] child_rip+0xa/0x20 Sep 1 08:00:05 fat-amd-4-ib kernel: [<ffffffffa06a40f0>] ? ptlrpc_main+0x0/0x1900 [ptlrpc] Sep 1 08:00:05 fat-amd-4-ib kernel: [<ffffffffa06a40f0>] ? ptlrpc_main+0x0/0x1900 [ptlrpc] Sep 1 08:00:05 fat-amd-4-ib kernel: [<ffffffff8100c1c0>] ? child_rip+0x0/0x20 Sep 1 08:00:05 fat-amd-4-ib kernel: Sep 1 08:00:05 fat-amd-4-ib kernel: LustreError: dumping log to /tmp/lustre-log.1314889205.4286 Sep 1 08:00:05 fat-amd-4-ib kernel: Lustre: Service thread pid 4210 was inactive for 156.04s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one. Sep 1 08:00:05 fat-amd-4-ib kernel: Lustre: Skipped 2 previous similar messages Sep 1 08:00:05 fat-amd-4-ib kernel: LustreError: dumping log to /tmp/lustre-log.1314889205.4270 Sep 1 08:00:05 fat-amd-4-ib kernel: LustreError: dumping log to /tmp/lustre-log.1314889205.4269 Sep 1 08:00:06 fat-amd-4-ib kernel: LustreError: dumping log to /tmp/lustre-log.1314889206.4237 Sep 1 08:00:06 fat-amd-4-ib kernel: LustreError: dumping log to /tmp/lustre-log.1314889206.4267 Sep 1 08:00:06 fat-amd-4-ib kernel: LustreError: dumping log to /tmp/lustre-log.1314889206.4283 Sep 1 08:00:12 fat-amd-4-ib kernel: Lustre: Service thread pid 4218 was inactive for 156.00s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one. Sep 1 08:00:12 fat-amd-4-ib kernel: Lustre: Skipped 12 previous similar messages Sep 1 08:00:12 fat-amd-4-ib kernel: LustreError: dumping log to /tmp/lustre-log.1314889212.4218 Sep 1 08:00:15 fat-amd-4-ib kernel: LustreError: dumping log to /tmp/lustre-log.1314889215.4285 Sep 1 08:00:18 fat-amd-4-ib kernel: Lustre: Service thread pid 4201 was inactive for 156.00s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one. Sep 1 08:00:18 fat-amd-4-ib kernel: Lustre: Skipped 2 previous similar messages Sep 1 08:00:18 fat-amd-4-ib kernel: LustreError: dumping log to /tmp/lustre-log.1314889218.4201 Sep 1 08:00:25 fat-amd-4-ib kernel: Lustre: Service thread pid 4204 completed after 177.20s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). Sep 1 08:00:25 fat-amd-4-ib kernel: Lustre: Skipped 27 previous similar messages Sep 1 08:00:35 fat-amd-4-ib kernel: Lustre: Service thread pid 4249 was inactive for 156.00s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one. Sep 1 08:00:35 fat-amd-4-ib kernel: LustreError: dumping log to /tmp/lustre-log.1314889235.4249 Sep 1 08:00:36 fat-amd-4-ib kernel: LustreError: dumping log to /tmp/lustre-log.1314889236.4275 Sep 1 08:00:43 fat-amd-4-ib kernel: LustreError: dumping log to /tmp/lustre-log.1314889243.4222 Sep 1 08:00:48 fat-amd-4-ib kernel: Lustre: Service thread pid 4226 completed after 200.23s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). Maloo report: https://maloo.whamcloud.com/test_sets/b4bac0f4-d523-11e0-8d02-52540025f9af Since the size of all of the log files is large, I could not upload them to this ticket. Please find /scratch/logs/2.1.0/recovery-mds-scale.1314890151.log.tar.bz2 on node brent.whamcloud.com. |
| Comments |
| Comment by Jian Yu [ 13/Sep/11 ] |
|
Lustre Branch: master Lustre configuration: MGS/MDS Node: fat-amd-1-ib
OSS Nodes: fat-amd-3-ib(active), fat-amd-4-ib(active)
\ /
OST1 (active in fat-amd-3-ib)
OST2 (active in fat-amd-4-ib)
OST3 (active in fat-amd-3-ib)
OST4 (active in fat-amd-4-ib)
OST5 (active in fat-amd-3-ib)
OST6 (active in fat-amd-4-ib)
fat-amd-2-ib(OST7)
Client Nodes: client-[1,2,4,5,12,13,15],fat-intel-4
After the OSS failed over 9 times, recovery-mds-scale(FLAVOR=OSS) failed as follows: ==== Checking the clients loads AFTER failover -- failure NOT OK
Client load failed on node client-12-ib, rc=1
Client load failed during failover. Exiting
Found the END_RUN_FILE file: /home/yujian/test_logs/end_run_file
client-12-ib
Client load failed on node client-12-ib
client client-12-ib load stdout and debug files :
/tmp/recovery-mds-scale.log_run_dd.sh-client-12-ib
/tmp/recovery-mds-scale.log_run_dd.sh-client-12-ib.debug
2011-09-13 03:17:20 Terminating clients loads ...
Duration: 7200
Server failover period: 600 seconds
Exited after: 6452 seconds
Number of failovers before exit:
mds1: 0 times
ost1: 2 times
ost2: 2 times
ost3: 0 times
ost4: 1 times
ost5: 0 times
ost6: 2 times
ost7: 2 times
Status: FAIL: rc=1
/tmp/recovery-mds-scale.log_run_dd.sh-client-12-ib.debug: 2011-09-13 03:12:09: dd succeeded + cd /tmp + rm -rf /mnt/lustre/d0.dd-client-12-ib ++ date '+%F %H:%M:%S' + echoerr '2011-09-13 03:12:13: dd run finished' + echo '2011-09-13 03:12:13: dd run finished' 2011-09-13 03:12:13: dd run finished + '[' '!' -e /home/yujian/test_logs/end_run_file ']' + true ++ date '+%F %H:%M:%S' + echoerr '2011-09-13 03:12:13: dd run starting' + echo '2011-09-13 03:12:13: dd run starting' 2011-09-13 03:12:13: dd run starting + mkdir -p /mnt/lustre/d0.dd-client-12-ib + cd /mnt/lustre/d0.dd-client-12-ib + load_pid=8828 + wait 8828 + dd bs=4k count=1000000 status=noxfer if=/dev/zero of=/mnt/lustre/d0.dd-client-12-ib/dd-file dd: writing `/mnt/lustre/d0.dd-client-12-ib/dd-file': No space left on device 654964+0 records in 654963+0 records out + '[' 1 -eq 0 ']' ++ date '+%F %H:%M:%S' + echoerr '2011-09-13 03:13:21: dd failed' + echo '2011-09-13 03:13:21: dd failed' 2011-09-13 03:13:21: dd failed Syslog on client-12-ib showed that: Sep 13 03:13:12 client-12 kernel: Lustre: 2186:0:(client.c:1773:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1315908792/real 1315908792 ] req@ffff8800ae432800 x1379823408037596/t0(0) o8->lustre-OST0002-osc-ffff880315e7c400@192.168.4.134@o2ib:28/4 lens 368/512 e 0 to 1 dl 1315908818 ref 1 fl Rpc:XN/0/ffffffff r c 0/-1 Sep 13 03:13:12 client-12 kernel: Lustre: 2186:0:(client.c:1773:ptlrpc_expire_one_request()) Skipped 135 previous similar messages Sep 13 03:13:21 client-12 kernel: LustreError: 8828:0:(vvp_io.c:1001:vvp_io_commit_write()) Write page 654963 of inode ffff88028e371bf8 failed -28 [root@client-12-ib ~]# lfs df -h UUID bytes Used Available Use% Mounted on lustre-MDT0000_UUID 7.2G 435.4M 6.2G 6% /mnt/lustre[MDT:0] lustre-OST0000_UUID 15.2G 440.1M 14.0G 3% /mnt/lustre[OST:0] lustre-OST0001_UUID 15.2G 440.1M 14.0G 3% /mnt/lustre[OST:1] lustre-OST0002_UUID 15.2G 440.1M 14.0G 3% /mnt/lustre[OST:2] lustre-OST0003_UUID 15.2G 440.1M 14.0G 3% /mnt/lustre[OST:3] lustre-OST0004_UUID 15.2G 440.1M 14.0G 3% /mnt/lustre[OST:4] lustre-OST0005_UUID 15.2G 440.1M 14.0G 3% /mnt/lustre[OST:5] lustre-OST0006_UUID 15.2G 6.6G 7.9G 45% /mnt/lustre[OST:6] filesystem summary: 106.4G 9.1G 91.9G 9% /mnt/lustre [root@client-12-ib ~]# lfs df -i UUID Inodes IUsed IFree IUse% Mounted on lustre-MDT0000_UUID 5000040 63 4999977 0% /mnt/lustre[MDT:0] lustre-OST0000_UUID 236160 89 236071 0% /mnt/lustre[OST:0] lustre-OST0001_UUID 236160 94 236066 0% /mnt/lustre[OST:1] lustre-OST0002_UUID 236160 94 236066 0% /mnt/lustre[OST:2] lustre-OST0003_UUID 236160 88 236072 0% /mnt/lustre[OST:3] lustre-OST0004_UUID 236160 87 236073 0% /mnt/lustre[OST:4] lustre-OST0005_UUID 236160 89 236071 0% /mnt/lustre[OST:5] lustre-OST0006_UUID 236160 100 236060 0% /mnt/lustre[OST:6] filesystem summary: 5000040 63 4999977 0% /mnt/lustre [root@fat-amd-2-ib ~]# lctl dl 0 UP mgc MGC192.168.4.132@o2ib 87b4991e-f76f-5234-3bfb-544ca6cfbb6f 5 1 UP ost OSS OSS_uuid 3 2 UP obdfilter lustre-OST0006 lustre-OST0006_UUID 19 [root@fat-amd-2-ib ~]# df -h Filesystem Size Used Avail Use% Mounted on /dev/sda1 20G 1.8G 17G 10% / tmpfs 7.8G 0 7.8G 0% /dev/shm /dev/sde 16G 6.6G 7.9G 46% /mnt/ost7 [root@fat-amd-2-ib ~]# df -i Filesystem Inodes IUsed IFree IUse% Mounted on /dev/sda1 1310720 42667 1268053 4% / tmpfs 2041268 1 2041267 1% /dev/shm /dev/sde 236160 100 236060 1% /mnt/ost7 Syslog on fat-amd-2-ib showed that: Sep 13 03:12:59 fat-amd-2-ib kernel: INFO: task jbd2/sde-8:2487 blocked for more than 120 seconds. Sep 13 03:12:59 fat-amd-2-ib kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.Sep 13 03:12:59 fat-amd-2-ib kernel: jbd2/sde-8 D 0000000000000004 0 2487 2 0x00000080 Sep 13 03:12:59 fat-amd-2-ib kernel: ffff88011a5a3c10 0000000000000046 0000000000000000 ffff88011ac4cc70 Sep 13 03:12:59 fat-amd-2-ib kernel: ffff88011a5a3b80 ffffffff81012979 ffff88011a5a3bc0 ffffffff81098d09 Sep 13 03:12:59 fat-amd-2-ib kernel: ffff880117e00678 ffff88011a5a3fd8 000000000000f598 ffff880117e00678 Sep 13 03:12:59 fat-amd-2-ib kernel: Call Trace:Sep 13 03:12:59 fat-amd-2-ib kernel: [<ffffffff81012979>] ? read_tsc+0x9/0x20 Sep 13 03:12:59 fat-amd-2-ib kernel: [<ffffffff81098d09>] ? ktime_get_ts+0xa9/0xe0 Sep 13 03:12:59 fat-amd-2-ib kernel: [<ffffffff81098d09>] ? ktime_get_ts+0xa9/0xe0 Sep 13 03:12:59 fat-amd-2-ib kernel: [<ffffffff811a3de0>] ? sync_buffer+0x0/0x50 Sep 13 03:12:59 fat-amd-2-ib kernel: [<ffffffff814dc403>] io_schedule+0x73/0xc0Sep 13 03:12:59 fat-amd-2-ib kernel: [<ffffffff811a3e20>] sync_buffer+0x40/0x50 Sep 13 03:12:59 fat-amd-2-ib kernel: [<ffffffff814dcc6f>] __wait_on_bit+0x5f/0x90 Sep 13 03:12:59 fat-amd-2-ib kernel: [<ffffffff811a3de0>] ? sync_buffer+0x0/0x50 Sep 13 03:13:00 fat-amd-2-ib kernel: [<ffffffff814dcd18>] out_of_line_wait_on_bit+0x78/0x90 Sep 13 03:13:00 fat-amd-2-ib kernel: [<ffffffff8108e1a0>] ? wake_bit_function+0x0/0x50Sep 13 03:13:00 fat-amd-2-ib kernel: [<ffffffff811a3dd6>] __wait_on_buffer+0x26/0x30Sep 13 03:13:00 fat-amd-2-ib kernel: [<ffffffffa0922971>] jbd2_journal_commit_transaction+0x11c1/0x1530 [jbd2]Sep 13 03:13:00 fat-amd-2-ib kernel: [<ffffffff810096e0>] ? __switch_to+0xd0/0x320 Sep 13 03:13:00 fat-amd-2-ib kernel: [<ffffffff8107a17b>] ? try_to_del_timer_sync+0x7b/0xe0 Sep 13 03:13:00 fat-amd-2-ib kernel: [<ffffffffa0927b48>] kjournald2+0xb8/0x220 [jbd2] Sep 13 03:13:00 fat-amd-2-ib kernel: [<ffffffff8108e160>] ? autoremove_wake_function+0x0/0x40 Sep 13 03:13:00 fat-amd-2-ib kernel: [<ffffffffa0927a90>] ? kjournald2+0x0/0x220 [jbd2]Sep 13 03:13:00 fat-amd-2-ib kernel: [<ffffffff8108ddf6>] kthread+0x96/0xa0Sep 13 03:13:00 fat-amd-2-ib kernel: [<ffffffff8100c1ca>] child_rip+0xa/0x20Sep 13 03:13:00 fat-amd-2-ib kernel: [<ffffffff8108dd60>] ? kthread+0x0/0xa0 Sep 13 03:13:00 fat-amd-2-ib kernel: [<ffffffff8100c1c0>] ? child_rip+0x0/0x20 Sep 13 03:13:00 fat-amd-2-ib kernel: INFO: task ll_ost_io_89:2736 blocked for more than 120 seconds. Sep 13 03:13:00 fat-amd-2-ib kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Sep 13 03:13:00 fat-amd-2-ib kernel: ll_ost_io_89 D 0000000000000004 0 2736 2 0x00000080 Sep 13 03:13:00 fat-amd-2-ib kernel: ffff88041158b790 0000000000000046 0000000000000000 ffff880217fa8aa0 Sep 13 03:13:00 fat-amd-2-ib kernel: ffff88041158b710 ffffffff81247e8f 0000000000000001 0000000100237e3a Sep 13 03:13:00 fat-amd-2-ib kernel: ffff880411589a78 ffff88041158bfd8 000000000000f598 ffff880411589a78 Sep 13 03:13:00 fat-amd-2-ib kernel: Call Trace: Sep 13 03:13:00 fat-amd-2-ib kernel: [<ffffffff81247e8f>] ? __generic_unplug_device+0x1f/0x40 Sep 13 03:13:00 fat-amd-2-ib kernel: [<ffffffff814dc995>] schedule_timeout+0x215/0x2e0 Sep 13 03:13:00 fat-amd-2-ib kernel: [<ffffffff814dc613>] wait_for_common+0x123/0x180 Sep 13 03:13:00 fat-amd-2-ib kernel: [<ffffffff8105dc60>] ? default_wake_function+0x0/0x20 Sep 13 03:13:00 fat-amd-2-ib kernel: [<ffffffff81249a5d>] ? submit_bio+0x8d/0x120 Sep 13 03:13:00 fat-amd-2-ib kernel: [<ffffffff814dc72d>] wait_for_completion+0x1d/0x20 Sep 13 03:13:00 fat-amd-2-ib kernel: [<ffffffff8124cedd>] __blkdev_issue_flush+0xad/0xe0 Sep 13 03:13:00 fat-amd-2-ib kernel: [<ffffffff8124cf26>] blkdev_issue_flush+0x16/0x20 Sep 13 03:13:00 fat-amd-2-ib kernel: [<ffffffffa094a8ab>] ldiskfs_sync_file+0x17b/0x250 [ldiskfs] Sep 13 03:13:00 fat-amd-2-ib kernel: [<ffffffffa0a1c8f5>] filter_sync+0x285/0x3e0 [obdfilter] Sep 13 03:13:00 fat-amd-2-ib kernel: [<ffffffffa05b56d6>] ? _ldlm_lock_debug+0x446/0x680 [ptlrpc] Sep 13 03:13:00 fat-amd-2-ib kernel: [<ffffffffa046998c>] ? lprocfs_counter_add+0x12c/0x196 [lvfs] Sep 13 03:13:00 fat-amd-2-ib kernel: [<ffffffffa09d6e0c>] ost_blocking_ast+0x58c/0xe40 [ost] Sep 13 03:13:00 fat-amd-2-ib kernel: [<ffffffffa05b8b73>] ldlm_cancel_callback+0x63/0xf0 [ptlrpc] Sep 13 03:13:00 fat-amd-2-ib kernel: [<ffffffffa05b8c59>] ldlm_lock_cancel+0x59/0x190 [ptlrpc] Sep 13 03:13:00 fat-amd-2-ib kernel: [<ffffffffa05dae36>] ldlm_request_cancel+0x256/0x420 [ptlrpc] Sep 13 03:13:00 fat-amd-2-ib kernel: [<ffffffffa09e0720>] ost_handle+0x3d60/0x4b90 [ost] Sep 13 03:13:00 fat-amd-2-ib kernel: [<ffffffffa04066b1>] ? libcfs_debug_vmsg2+0x4d1/0xb50 [libcfs] Sep 13 03:13:00 fat-amd-2-ib kernel: [<ffffffffa05fee94>] ? lustre_msg_get_opc+0x94/0x100 [ptlrpc] Sep 13 03:13:00 fat-amd-2-ib kernel: [<ffffffffa06015fc>] ? lustre_msg_get_status+0x3c/0xa0 [ptlrpc] Sep 13 03:13:00 fat-amd-2-ib kernel: [<ffffffffa060fb4e>] ptlrpc_main+0xb8e/0x1900 [ptlrpc] Sep 13 03:13:00 fat-amd-2-ib kernel: [<ffffffffa060efc0>] ? ptlrpc_main+0x0/0x1900 [ptlrpc] Sep 13 03:13:00 fat-amd-2-ib kernel: [<ffffffff8100c1ca>] child_rip+0xa/0x20 Sep 13 03:13:00 fat-amd-2-ib kernel: [<ffffffffa060efc0>] ? ptlrpc_main+0x0/0x1900 [ptlrpc] Sep 13 03:13:00 fat-amd-2-ib kernel: [<ffffffffa060efc0>] ? ptlrpc_main+0x0/0x1900 [ptlrpc] Sep 13 03:13:00 fat-amd-2-ib kernel: [<ffffffff8100c1c0>] ? child_rip+0x0/0x20 <~snip~> Maloo report: https://maloo.whamcloud.com/test_sets/64ba1fae-ddfd-11e0-9909-52540025f9af Please find /scratch/logs/2.1.0/recovery-oss-scale.1315909046.log.tar.bz2 on node brent.whamcloud.com for more logs. |
| Comment by Jian Yu [ 19/Sep/11 ] |
|
Lustre Tag: v2_1_0_0_RC2 After running about 7 hours (MDS failed over 29 times), recovery-mds-scale test failed with the same issue: /tmp/recovery-mds-scale.log_run_dd.sh-client-12-ib.debug: 2011-09-19 10:16:49: dd succeeded + cd /tmp + rm -rf /mnt/lustre/d0.dd-client-12-ib ++ date '+%F %H:%M:%S' + echoerr '2011-09-19 10:16:51: dd run finished' + echo '2011-09-19 10:16:51: dd run finished' 2011-09-19 10:16:51: dd run finished + '[' '!' -e /home/yujian/test_logs/end_run_file ']' + true ++ date '+%F %H:%M:%S' + echoerr '2011-09-19 10:16:51: dd run starting' + echo '2011-09-19 10:16:51: dd run starting' 2011-09-19 10:16:51: dd run starting + mkdir -p /mnt/lustre/d0.dd-client-12-ib + cd /mnt/lustre/d0.dd-client-12-ib + load_pid=17460 + wait 17460 + dd bs=4k count=1000000 status=noxfer if=/dev/zero of=/mnt/lustre/d0.dd-client-12-ib/dd-file dd: writing `/mnt/lustre/d0.dd-client-12-ib/dd-file': No space left on device 589960+0 records in 589959+0 records out + '[' 1 -eq 0 ']' ++ date '+%F %H:%M:%S' + echoerr '2011-09-19 10:24:03: dd failed' + echo '2011-09-19 10:24:03: dd failed' 2011-09-19 10:24:03: dd failed Syslog on the client node client-12-ib: Sep 19 10:17:48 client-12 kernel: LustreError: 11-0: an error occurred while communicating with 192.168.4.134@o2ib. The ost_write operation failed with -28 Sep 19 10:17:48 client-12 kernel: LustreError: 17460:0:(vvp_io.c:1001:vvp_io_commit_write()) Write page 589959 of inode ffff88029f2b7cb8 failed -28 Please refer to the attached recovery-mds-scale.1316453521.log.tar.bz2 for more logs. |
| Comment by Jian Yu [ 13/Feb/12 ] |
|
Lustre Tag: v2_1_1_0_RC2 The same issue occurred: https://maloo.whamcloud.com/test_sets/8384ecca-55b2-11e1-9aa8-5254004bbbd3 |
| Comment by Jian Yu [ 13/Jun/12 ] |
|
Another instance: https://maloo.whamcloud.com/test_sets/92648b78-b3d7-11e1-8808-52540035b04c |
| Comment by Jian Yu [ 13/Jun/12 ] |
|
Another instance: https://maloo.whamcloud.com/test_sets/f09a5f58-b3bc-11e1-a2dd-52540035b04c |
| Comment by Peter Jones [ 18/Jul/12 ] |
|
Hongchao Could you please look into this one? Thanks Peter |
| Comment by Hongchao Zhang [ 23/Jul/12 ] |
|
there was no obvious error in the logs, and the obd_osfs was updated by fsfilt_statfs before -ENOSPC the OST could be actually full at the time if the tests running at the clients used the same OST, |
| Comment by Hongchao Zhang [ 23/Jul/12 ] |
|
the possible patch is tracked at http://review.whamcloud.com/#change,3446 |
| Comment by Hongchao Zhang [ 26/Jul/12 ] |
|
this issue occurs after the client received the OBD_FL_NO_USRQUOTA or OBD_FL_NO_GRPQUOTA, is the recovery-mds-scale tested |
| Comment by Jian Yu [ 01/Aug/12 ] |
The test was run by root user. |
| Comment by Peter Jones [ 09/Aug/12 ] |
|
The proposed patch has been landed for 2.3. Closing this ticket for now but we should reopen it if it transpires that the issue fixed was not the one reported and this problem still occurs on the next tag. |
| Comment by Alex Zhuravlev [ 16/Aug/12 ] |
|
just got this on the master: Lustre: DEBUG MARKER: == sanity test 49: Change max_pages_per_rpc won't break osc extent =================================== 07:44:43 (1345088683) filter_io.c:786: CDEBUG(D_INODE, "retry after commit pending journals"); |
| Comment by Hongchao Zhang [ 16/Aug/12 ] |
|
there is a bug in the merged patch |
| Comment by Hongchao Zhang [ 16/Aug/12 ] |
|
Thanks bzzz! sorry, there is indeed a bug in the merged patch, the fix patch will be pushed soon. |
| Comment by Hongchao Zhang [ 16/Aug/12 ] |
|
the fix patch is tracked at http://review.whamcloud.com/#change,3692 |
| Comment by Jian Yu [ 20/Aug/12 ] |
|
Lustre Branch: b2_1 Another failure instance: https://maloo.whamcloud.com/test_sets/6533657c-ea21-11e1-b794-52540035b04c |
| Comment by Peter Jones [ 23/Aug/12 ] |
|
Landed for 2.3 and 2.4 |
| Comment by Jian Yu [ 05/Sep/12 ] |
|
Lustre Build: http://build.whamcloud.com/job/lustre-b2_3/12 The issue still exists: |
| Comment by Hongchao Zhang [ 06/Sep/12 ] |
|
Hi YuJian, |
| Comment by Jian Yu [ 06/Sep/12 ] |
Hi Chris, Could you please set the following PTLDEBUG value in autotest_config.sh to run the hard failover tests? PTLDEBUG="vfstrace rpctrace dlmtrace neterror ha config ioctl super quota" I could not reproduce this issue in manual test runs. |
| Comment by Chris Gearing (Inactive) [ 06/Sep/12 ] |
|
Added to b2_3 failover tests |
| Comment by Jian Yu [ 09/Sep/12 ] |
Thanks Chris. Hi Hongchao, |
| Comment by Hongchao Zhang [ 11/Sep/12 ] |
|
the autotest has been queued to verify whether this issue is fixed or not |
| Comment by Jian Yu [ 13/Sep/12 ] |
|
Lustre Build: http://build.whamcloud.com/job/lustre-b2_3/17 https://maloo.whamcloud.com/test_sets/22a91122-fd70-11e1-afe5-52540035b04c The issue still occurred. |
| Comment by Hongchao Zhang [ 13/Sep/12 ] |
|
the debug patch is at http://review.whamcloud.com/#change,3979 Hi Chris, |
| Comment by Chris Gearing (Inactive) [ 13/Sep/12 ] |
|
You are able to do this using the Test-Requirements in the described on the wiki. You will need something like this; Test-Requirements: fortestonly envdefinitions="SLOW=yes" clientcount=4 osscount=2 mdscount=2 austeroptions=-R failover=true useiscsi=true testgroup=failover I should probably introduce a class that allows somethings like that to be done more easily, this example is on the wiki page. |
| Comment by Jian Yu [ 13/Sep/12 ] |
|
Hi Chris and Hongchao, |
| Comment by Jian Yu [ 15/Sep/12 ] |
|
Hi Hongchao, |
| Comment by Jian Yu [ 16/Sep/12 ] |
Chris, could you please a look at the above change? The recovery-mds-scale test was not run somehow. |
| Comment by Chris Gearing (Inactive) [ 17/Sep/12 ] |
|
There is a bug in autotest, I have tried to fix on the fly. You might want to try it again. The bug was that austeroptions was defined as nil and so have no type meaning the parser did not know what to do with the option. I fixed this by defaults austeroptions to "-v" meaning it now has a type. |
| Comment by Chris Gearing (Inactive) [ 17/Sep/12 ] |
|
I've restarted this but don't expect it to work, I've not actually been successful with the long set of Test-Requirements, in fact it's possible I've broken stuff by trying to get it to work. |
| Comment by Hongchao Zhang [ 17/Sep/12 ] |
|
the recovery-mds-scale test run, Thanks Chris, Yujian! from the output, the OST does have no space, Hi Chris, what is the disk size of "/dev/dm-1" in fat-intel-3vm4(OSS node), is it 2G? ( I just check it but it started to |
| Comment by Chris Gearing (Inactive) [ 20/Sep/12 ] |
|
The really need to make this info easier to access, we have a ticket about this. I went and found the original logs and the values are; MDSSIZE=2097152 I don't specify the stripe count. config was this: 09:00:57:cat /root/autotest_config.sh |
| Comment by Hongchao Zhang [ 21/Sep/12 ] |
|
this explains the issue, the default stripe count is 1 and the OSTSIZE is 4G, the -ENOSPC occurs just at writing 3.5G, then this is the normal case, and it can be fixed by modifying the replay-mds-scale to increase default stripe count or enlarge OST size. |
| Comment by Hongchao Zhang [ 21/Sep/12 ] |
|
the patch is tracked at http://review.whamcloud.com/#change,4064 |
| Comment by Jian Yu [ 21/Sep/12 ] |
I just checked the run_dd.sh script and found it always created/deleted a file with 4000000k bytes: dd bs=4k count=1000000 status=noxfer if=/dev/zero of=$TESTDIR/dd-file 1>$LOG & The OSTSIZE with 4089446k bytes is not enough for a 4000000k bytes file with stripe count 0. I just updated http://review.whamcloud.com/#change,4002 with "ostsizegb=6" to see whether the recovery-mds-scale test will hit the out of space error or not. |
| Comment by Peter Jones [ 22/Sep/12 ] |
|
Dropping priority as this is a test only issue. Once the test correction is finalized we should land it to master and we may also land it to b2_3 if we have another RC |
| Comment by Jian Yu [ 22/Sep/12 ] |
Hi Chris, |
| Comment by Jian Yu [ 25/Sep/12 ] |
|
Lustre Version: v2_3_0_RC1 recovery-mds-scale test failed with out of space issue again: |
| Comment by Hongchao Zhang [ 25/Sep/12 ] |
|
what is the OSTSIZE for this test? the dd stopped at about 1.7G. |
| Comment by Jian Yu [ 25/Sep/12 ] |
From the above Maloo report, we can enter "go to session" and find the autotest configuration info in lustre-initialization-1: export OSTSIZE=2097152 |
| Comment by Hongchao Zhang [ 25/Sep/12 ] |
|
oh, thanks! OSTSIZE is 2G, and it's the same situation of insufficient disk space at OST |
| Comment by Jian Yu [ 06/Nov/12 ] |
|
Mark this ticket as a blocker since it's blocking the recovery-*-scale tests. |
| Comment by Hongchao Zhang [ 06/Nov/12 ] |
|
how about adapting the data size in 'dd' to the actual free space in Luste then, will try to create a patch for it |
| Comment by Hongchao Zhang [ 12/Nov/12 ] |
|
status update this should be a TT ticket, and the patch is under creation and test, will attach it soon |
| Comment by Hongchao Zhang [ 16/Nov/12 ] |
|
the patch is tracked at http://review.whamcloud.com/#change,4599 |
| Comment by Hongchao Zhang [ 10/Dec/12 ] |
|
during testing the patch in a loop (with small OSTs), I found it will fail for -ENOSPC after several loops, and it seems to be related to grant... btw, this is not the same issue with this ticket. |
| Comment by Hongchao Zhang [ 16/Dec/12 ] |
|
there is a bug in osd_trans_stop, which use thandle->th_sync after "dt_txn_hook_stop", which could change the value, int ofd_txn_stop_cb(const struct lu_env *env, struct thandle *txn, /* if can't add callback, do sync write */ return ofd_last_rcvd_update(info, txn); then the transactions won't be committed and caused the issue(the client can't use just deleted disk space, |
| Comment by Mikhail Pershin [ 17/Dec/12 ] |
|
the transaction will be committed in any case or did you mean 'won't be forced to commit'? |
| Comment by Hongchao Zhang [ 20/Dec/12 ] |
|
Yes, thanks, the patch has been updated accordingly. |
| Comment by Jodi Levi (Inactive) [ 15/Jan/13 ] |
|
Patch landed. Closing ticket. Please reopen if more work is needed. |