Details
-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
Lustre 2.12.0, Lustre 2.10.7, Lustre 2.12.3, Lustre 2.14.0, Lustre 2.12.4, Lustre 2.15.3, Lustre 2.17.0, Lustre 2.15.6, Lustre 2.15.7
-
3
-
9223372036854775807
Description
recovery-mds-scale test_failover_ost fails with 'test_failover_ost returned 1'
Looking at the client test_log from https://testing.whamcloud.com/test_sets/e36f9e0c-fea5-11e8-b837-52540065bddc , we see tht there were several successful OST failovers with one failure
Found the END_RUN_FILE file: /autotest/trevis/2018-12-10/lustre-master-el7_6-x86_64--failover--1_32_1__3837___6af7940a-41a2-4a12-b890-ae54e8237ab3/shared_dir/end_run_file
trevis-25vm8.trevis.whamcloud.com
Client load failed on node trevis-25vm8.trevis.whamcloud.com:
/autotest/trevis/2018-12-10/lustre-master-el7_6-x86_64--failover--1_32_1__3837___6af7940a-41a2-4a12-b890-ae54e8237ab3/recovery-mds-scale.test_failover_ost.run__stdout.trevis-25vm8.trevis.whamcloud.com.log
/autotest/trevis/2018-12-10/lustre-master-el7_6-x86_64--failover--1_32_1__3837___6af7940a-41a2-4a12-b890-ae54e8237ab3/recovery-mds-scale.test_failover_ost.run__debug.trevis-25vm8.trevis.whamcloud.com.log
2018-12-11 23:22:47 Terminating clients loads ...
Duration: 86400
Server failover period: 1200 seconds
Exited after: 21768 seconds
Number of failovers before exit:
mds1: 0 times
ost1: 3 times
ost2: 1 times
ost3: 6 times
ost4: 1 times
ost5: 6 times
ost6: 0 times
ost7: 2 times
Status: FAIL: rc=1
CMD: trevis-25vm7,trevis-25vm8 test -f /tmp/client-load.pid &&
{ kill -s TERM \$(cat /tmp/client-load.pid); rm -f /tmp/client-load.pid; }
trevis-25vm8: sh: line 1: kill: (11606) - No such process
trevis-25vm7: sh: line 1: kill: (18301) - No such process
Dumping lctl log to /autotest/trevis/2018-12-10/lustre-master-el7_6-x86_64--failover--1_32_1__3837___6af7940a-41a2-4a12-b890-ae54e8237ab3/recovery-mds-scale.test_failover_ost.*.1544570568.log
CMD: trevis-25vm10,trevis-25vm11,trevis-25vm12,trevis-25vm8.trevis.whamcloud.com,trevis-25vm9 /usr/sbin/lctl dk > /autotest/trevis/2018-12-10/lustre-master-el7_6-x86_64--failover--1_32_1__3837___6af7940a-41a2-4a12-b890-ae54e8237ab3/recovery-mds-scale.test_failover_ost.debug_log.\$(hostname -s).1544570568.log;
dmesg > /autotest/trevis/2018-12-10/lustre-master-el7_6-x86_64--failover--1_32_1__3837___6af7940a-41a2-4a12-b890-ae54e8237ab3/recovery-mds-scale.test_failover_ost.dmesg.\$(hostname -s).1544570568.log
trevis-25vm9: invalid parameter 'dump_kernel'
trevis-25vm9: open(dump_kernel) failed: No such file or directory
trevis-25vm12: invalid parameter 'dump_kernel'
trevis-25vm12: open(dump_kernel) failed: No such file or directory
test_failover_ost returned 1
FAIL failover_ost (22821s)
Looking at the logs from Client 3 (vm8), we can see some issues with tar. From the run_tar_debug log, we see a client load return a positive return code
2018-12-11 22:58:05: tar run starting
+ mkdir -p /mnt/lustre/d0.tar-trevis-25vm8.trevis.whamcloud.com
+ cd /mnt/lustre/d0.tar-trevis-25vm8.trevis.whamcloud.com
+ sync
++ du -s /etc
++ awk '{print $1}'
+ USAGE=34864
+ /usr/sbin/lctl set_param 'llite.*.lazystatfs=0'
+ df /mnt/lustre/d0.tar-trevis-25vm8.trevis.whamcloud.com
+ sleep 2
++ df /mnt/lustre/d0.tar-trevis-25vm8.trevis.whamcloud.com
++ awk '/:/ { print $4 }'
+ FREE_SPACE=9359360
+ AVAIL=4211712
+ '[' 4211712 -lt 34864 ']'
+ do_tar
+ tar cf - /etc
+ tar xf -
tar: Removing leading `/' from member names
+ return 2
+ RC=2
++ grep 'exit delayed from previous errors' /autotest/trevis/2018-12-10/lustre-master-el7_6-x86_64--failover--1_32_1__3837___6af7940a-41a2-4a12-b890-ae54e8237ab3/recovery-mds-scale.test_failover_ost.run_tar_stdout.trevis-25vm8.log
+ PREV_ERRORS=
+ true
+ '[' 2 -ne 0 -a '' -a '' ']'
+ '[' 2 -eq 0 ']'
++ date '+%F %H:%M:%S'
+ echoerr '2018-12-11 23:17:05: tar failed'
+ echo '2018-12-11 23:17:05: tar failed'
2018-12-11 23:17:05: tar failed
+ '[' -z '' ']'
++ hostname
+ echo trevis-25vm8.trevis.whamcloud.com
+ '[' ']'
+ '[' '!' -e /autotest/trevis/2018-12-10/lustre-master-el7_6-x86_64--failover--1_32_1__3837___6af7940a-41a2-4a12-b890-ae54e8237ab3/shared_dir/end_run_file ']'
++ date '+%F %H:%M:%S'
+ echoerr '2018-12-11 23:17:05: tar run exiting'
+ echo '2018-12-11 23:17:05: tar run exiting'
2018-12-11 23:17:05: tar run exiting
From the run_tar_stdout log, we see some write errors
tar: etc/mke2fs.conf: Cannot write: Input/output error tar: Exiting with failure status due to previous errors
Attachments
Issue Links
- is related to
-
LU-5158 Failure on test suite recovery-mds-scale test_failover_ost
-
- Resolved
-
- is related to
-
LU-12224 recovery-mds-scale test failover_mds fails with 'test_failover_mds returned 1'
-
- Open
-
-
LU-12858 recovery-mds-scale test failover_ost fails due to dd failure “dd: closing output file ‘/mnt/lustre/*/dd-file’: Input/output error”
-
- Open
-
- mentioned in
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...