Details
-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
Lustre 2.12.0, Lustre 2.10.7, Lustre 2.12.3, Lustre 2.14.0, Lustre 2.12.4, Lustre 2.15.3, Lustre 2.15.6
-
3
-
9223372036854775807
Description
recovery-mds-scale test_failover_ost fails with 'test_failover_ost returned 1'
Looking at the client test_log from https://testing.whamcloud.com/test_sets/e36f9e0c-fea5-11e8-b837-52540065bddc , we see tht there were several successful OST failovers with one failure
Found the END_RUN_FILE file: /autotest/trevis/2018-12-10/lustre-master-el7_6-x86_64--failover--1_32_1__3837___6af7940a-41a2-4a12-b890-ae54e8237ab3/shared_dir/end_run_file trevis-25vm8.trevis.whamcloud.com Client load failed on node trevis-25vm8.trevis.whamcloud.com: /autotest/trevis/2018-12-10/lustre-master-el7_6-x86_64--failover--1_32_1__3837___6af7940a-41a2-4a12-b890-ae54e8237ab3/recovery-mds-scale.test_failover_ost.run__stdout.trevis-25vm8.trevis.whamcloud.com.log /autotest/trevis/2018-12-10/lustre-master-el7_6-x86_64--failover--1_32_1__3837___6af7940a-41a2-4a12-b890-ae54e8237ab3/recovery-mds-scale.test_failover_ost.run__debug.trevis-25vm8.trevis.whamcloud.com.log 2018-12-11 23:22:47 Terminating clients loads ... Duration: 86400 Server failover period: 1200 seconds Exited after: 21768 seconds Number of failovers before exit: mds1: 0 times ost1: 3 times ost2: 1 times ost3: 6 times ost4: 1 times ost5: 6 times ost6: 0 times ost7: 2 times Status: FAIL: rc=1 CMD: trevis-25vm7,trevis-25vm8 test -f /tmp/client-load.pid && { kill -s TERM \$(cat /tmp/client-load.pid); rm -f /tmp/client-load.pid; } trevis-25vm8: sh: line 1: kill: (11606) - No such process trevis-25vm7: sh: line 1: kill: (18301) - No such process Dumping lctl log to /autotest/trevis/2018-12-10/lustre-master-el7_6-x86_64--failover--1_32_1__3837___6af7940a-41a2-4a12-b890-ae54e8237ab3/recovery-mds-scale.test_failover_ost.*.1544570568.log CMD: trevis-25vm10,trevis-25vm11,trevis-25vm12,trevis-25vm8.trevis.whamcloud.com,trevis-25vm9 /usr/sbin/lctl dk > /autotest/trevis/2018-12-10/lustre-master-el7_6-x86_64--failover--1_32_1__3837___6af7940a-41a2-4a12-b890-ae54e8237ab3/recovery-mds-scale.test_failover_ost.debug_log.\$(hostname -s).1544570568.log; dmesg > /autotest/trevis/2018-12-10/lustre-master-el7_6-x86_64--failover--1_32_1__3837___6af7940a-41a2-4a12-b890-ae54e8237ab3/recovery-mds-scale.test_failover_ost.dmesg.\$(hostname -s).1544570568.log trevis-25vm9: invalid parameter 'dump_kernel' trevis-25vm9: open(dump_kernel) failed: No such file or directory trevis-25vm12: invalid parameter 'dump_kernel' trevis-25vm12: open(dump_kernel) failed: No such file or directory test_failover_ost returned 1 FAIL failover_ost (22821s)
Looking at the logs from Client 3 (vm8), we can see some issues with tar. From the run_tar_debug log, we see a client load return a positive return code
2018-12-11 22:58:05: tar run starting + mkdir -p /mnt/lustre/d0.tar-trevis-25vm8.trevis.whamcloud.com + cd /mnt/lustre/d0.tar-trevis-25vm8.trevis.whamcloud.com + sync ++ du -s /etc ++ awk '{print $1}' + USAGE=34864 + /usr/sbin/lctl set_param 'llite.*.lazystatfs=0' + df /mnt/lustre/d0.tar-trevis-25vm8.trevis.whamcloud.com + sleep 2 ++ df /mnt/lustre/d0.tar-trevis-25vm8.trevis.whamcloud.com ++ awk '/:/ { print $4 }' + FREE_SPACE=9359360 + AVAIL=4211712 + '[' 4211712 -lt 34864 ']' + do_tar + tar cf - /etc + tar xf - tar: Removing leading `/' from member names + return 2 + RC=2 ++ grep 'exit delayed from previous errors' /autotest/trevis/2018-12-10/lustre-master-el7_6-x86_64--failover--1_32_1__3837___6af7940a-41a2-4a12-b890-ae54e8237ab3/recovery-mds-scale.test_failover_ost.run_tar_stdout.trevis-25vm8.log + PREV_ERRORS= + true + '[' 2 -ne 0 -a '' -a '' ']' + '[' 2 -eq 0 ']' ++ date '+%F %H:%M:%S' + echoerr '2018-12-11 23:17:05: tar failed' + echo '2018-12-11 23:17:05: tar failed' 2018-12-11 23:17:05: tar failed + '[' -z '' ']' ++ hostname + echo trevis-25vm8.trevis.whamcloud.com + '[' ']' + '[' '!' -e /autotest/trevis/2018-12-10/lustre-master-el7_6-x86_64--failover--1_32_1__3837___6af7940a-41a2-4a12-b890-ae54e8237ab3/shared_dir/end_run_file ']' ++ date '+%F %H:%M:%S' + echoerr '2018-12-11 23:17:05: tar run exiting' + echo '2018-12-11 23:17:05: tar run exiting' 2018-12-11 23:17:05: tar run exiting
From the run_tar_stdout log, we see some write errors
tar: etc/mke2fs.conf: Cannot write: Input/output error tar: Exiting with failure status due to previous errors
Attachments
Issue Links
- is related to
-
LU-5158 Failure on test suite recovery-mds-scale test_failover_ost
- Resolved
- is related to
-
LU-12224 recovery-mds-scale test failover_mds fails with 'test_failover_mds returned 1'
- Open
-
LU-12858 recovery-mds-scale test failover_ost fails due to dd failure “dd: closing output file ‘/mnt/lustre/*/dd-file’: Input/output error”
- Open
- mentioned in
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...