[LU-11791] recovery-mds-scale test failover_ost fails with 'test_failover_ost returned 1' Created: 17/Dec/18  Updated: 30/Jan/24

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.0, Lustre 2.10.7, Lustre 2.12.3, Lustre 2.14.0, Lustre 2.12.4, Lustre 2.15.3
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: Alex Deiter
Resolution: Unresolved Votes: 0
Labels: failover

Issue Links:
Related
is related to LU-12224 recovery-mds-scale test failover_mds ... Open
is related to LU-12858 recovery-mds-scale test failover_ost ... Open
is related to LU-5158 Failure on test suite recovery-mds-sc... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

recovery-mds-scale test_failover_ost fails with 'test_failover_ost returned 1'

Looking at the client test_log from https://testing.whamcloud.com/test_sets/e36f9e0c-fea5-11e8-b837-52540065bddc , we see tht there were several successful OST failovers with one failure

Found the END_RUN_FILE file: /autotest/trevis/2018-12-10/lustre-master-el7_6-x86_64--failover--1_32_1__3837___6af7940a-41a2-4a12-b890-ae54e8237ab3/shared_dir/end_run_file
trevis-25vm8.trevis.whamcloud.com
Client load  failed on node trevis-25vm8.trevis.whamcloud.com:
/autotest/trevis/2018-12-10/lustre-master-el7_6-x86_64--failover--1_32_1__3837___6af7940a-41a2-4a12-b890-ae54e8237ab3/recovery-mds-scale.test_failover_ost.run__stdout.trevis-25vm8.trevis.whamcloud.com.log
/autotest/trevis/2018-12-10/lustre-master-el7_6-x86_64--failover--1_32_1__3837___6af7940a-41a2-4a12-b890-ae54e8237ab3/recovery-mds-scale.test_failover_ost.run__debug.trevis-25vm8.trevis.whamcloud.com.log
2018-12-11 23:22:47 Terminating clients loads ...
Duration:               86400
Server failover period: 1200 seconds
Exited after:           21768 seconds
Number of failovers before exit:
mds1: 0 times
ost1: 3 times
ost2: 1 times
ost3: 6 times
ost4: 1 times
ost5: 6 times
ost6: 0 times
ost7: 2 times
Status: FAIL: rc=1
CMD: trevis-25vm7,trevis-25vm8 test -f /tmp/client-load.pid &&
        { kill -s TERM \$(cat /tmp/client-load.pid); rm -f /tmp/client-load.pid; }
trevis-25vm8: sh: line 1: kill: (11606) - No such process
trevis-25vm7: sh: line 1: kill: (18301) - No such process
Dumping lctl log to /autotest/trevis/2018-12-10/lustre-master-el7_6-x86_64--failover--1_32_1__3837___6af7940a-41a2-4a12-b890-ae54e8237ab3/recovery-mds-scale.test_failover_ost.*.1544570568.log
CMD: trevis-25vm10,trevis-25vm11,trevis-25vm12,trevis-25vm8.trevis.whamcloud.com,trevis-25vm9 /usr/sbin/lctl dk > /autotest/trevis/2018-12-10/lustre-master-el7_6-x86_64--failover--1_32_1__3837___6af7940a-41a2-4a12-b890-ae54e8237ab3/recovery-mds-scale.test_failover_ost.debug_log.\$(hostname -s).1544570568.log;
         dmesg > /autotest/trevis/2018-12-10/lustre-master-el7_6-x86_64--failover--1_32_1__3837___6af7940a-41a2-4a12-b890-ae54e8237ab3/recovery-mds-scale.test_failover_ost.dmesg.\$(hostname -s).1544570568.log
trevis-25vm9: invalid parameter 'dump_kernel'
trevis-25vm9: open(dump_kernel) failed: No such file or directory
trevis-25vm12: invalid parameter 'dump_kernel'
trevis-25vm12: open(dump_kernel) failed: No such file or directory
test_failover_ost returned 1
FAIL failover_ost (22821s)

Looking at the logs from Client 3 (vm8), we can see some issues with tar. From the run_tar_debug log, we see a client load return a positive return code

2018-12-11 22:58:05: tar run starting
+ mkdir -p /mnt/lustre/d0.tar-trevis-25vm8.trevis.whamcloud.com
+ cd /mnt/lustre/d0.tar-trevis-25vm8.trevis.whamcloud.com
+ sync
++ du -s /etc
++ awk '{print $1}'
+ USAGE=34864
+ /usr/sbin/lctl set_param 'llite.*.lazystatfs=0'
+ df /mnt/lustre/d0.tar-trevis-25vm8.trevis.whamcloud.com
+ sleep 2
++ df /mnt/lustre/d0.tar-trevis-25vm8.trevis.whamcloud.com
++ awk '/:/ { print $4 }'
+ FREE_SPACE=9359360
+ AVAIL=4211712
+ '[' 4211712 -lt 34864 ']'
+ do_tar
+ tar cf - /etc
+ tar xf -
tar: Removing leading `/' from member names
+ return 2
+ RC=2
++ grep 'exit delayed from previous errors' /autotest/trevis/2018-12-10/lustre-master-el7_6-x86_64--failover--1_32_1__3837___6af7940a-41a2-4a12-b890-ae54e8237ab3/recovery-mds-scale.test_failover_ost.run_tar_stdout.trevis-25vm8.log
+ PREV_ERRORS=
+ true
+ '[' 2 -ne 0 -a '' -a '' ']'
+ '[' 2 -eq 0 ']'
++ date '+%F %H:%M:%S'
+ echoerr '2018-12-11 23:17:05: tar failed'
+ echo '2018-12-11 23:17:05: tar failed'
2018-12-11 23:17:05: tar failed
+ '[' -z '' ']'
++ hostname
+ echo trevis-25vm8.trevis.whamcloud.com
+ '[' ']'
+ '[' '!' -e /autotest/trevis/2018-12-10/lustre-master-el7_6-x86_64--failover--1_32_1__3837___6af7940a-41a2-4a12-b890-ae54e8237ab3/shared_dir/end_run_file ']'
++ date '+%F %H:%M:%S'
+ echoerr '2018-12-11 23:17:05: tar run exiting'
+ echo '2018-12-11 23:17:05: tar run exiting'
2018-12-11 23:17:05: tar run exiting

From the run_tar_stdout log, we see some write errors

tar: etc/mke2fs.conf: Cannot write: Input/output error
tar: Exiting with failure status due to previous errors


 Comments   
Comment by James Nunez (Inactive) [ 13/Mar/19 ]

I see a similar failure for 2.10.7 RC1 failover testing with logs at https://testing.whamcloud.com/test_sets/fefc3968-43fc-11e9-9720-52540065bddc .

The errors in the client (vm4) run_tar_stdout are

tar: etc/yum.repos.d/lustre-e2fsprogs.repo: Cannot close: Input/output error
tar: etc/rsyncd.conf: Cannot write: No such file or directory
tar: Exiting with failure status due to previous errors
Comment by James Nunez (Inactive) [ 04/Feb/20 ]

We see similar failures with 2.12.4 RHEL8 client failover testing at https://testing.whamcloud.com/test_sets/673f11d2-4378-11ea-bffa-52540065bddc with the following in run_tar_stdout

tar: etc/lvm/profile/cache-smq.profile: Cannot utime: Input/output error
tar: etc/lvm/profile/cache-smq.profile: Cannot close: Input/output error
tar: Exiting with failure status due to previous errors
Generated at Sat Feb 10 02:46:58 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.