[LU-10479] recovery-mds-scale test failover_mds fails with 'test_failover_mds returned 4' Created: 09/Jan/18 Updated: 14/Mar/19 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.10.1, Lustre 2.11.0, Lustre 2.10.2, Lustre 2.10.3, Lustre 2.10.4, Lustre 2.10.5, Lustre 2.10.6, Lustre 2.10.7 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | James Nunez (Inactive) | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
recovery-mds-scale is failing in test_failover_mds. From the test_log on the client, we can see that the clients exited immediately and no failovers took place: Client load failed on node trevis-10vm4, rc=1 2018-01-08 21:34:46 Terminating clients loads ... Duration: 86400 Server failover period: 1200 seconds Exited after: 0 seconds Number of failovers before exit: mds1: 0 times ost1: 0 times ost2: 0 times ost3: 0 times ost4: 0 times ost5: 0 times ost6: 0 times ost7: 0 times Status: FAIL: rc=4 The client load/jobs are terminated because tar is failing due to not enough (any?) free space available. From the run_tar_debug log, strangely, free space is blank: ++ du -s /etc
++ awk '{print $1}'
+ USAGE=30784
+ /usr/sbin/lctl set_param 'llite.*.lazystatfs=0'
+ df /mnt/lustre/d0.tar-trevis-10vm4.trevis.hpdd.intel.com
+ sleep 2
++ df /mnt/lustre/d0.tar-trevis-10vm4.trevis.hpdd.intel.com
++ awk '/:/ { print $4 }'
+ FREE_SPACE=
+ AVAIL=0
+ '[' 0 -lt 30784 ']'
+ echoerr 'no enough free disk space: need 30784, avail 0'
+ echo 'no enough free disk space: need 30784, avail 0'
no enough free disk space: need 30784, avail 0
There is nothing obviously wrong in the console and dmesg logs. So far, this failure is only seen on b2_10, but for several builds. Here are the b2_10 build numbers and links to logs for some of the failures: |
| Comments |
| Comment by Sarah Liu [ 13/Mar/18 ] |
|
also hit this on 2.11 tag-2.10.59 SLES12SP3 failover testing https://testing.hpdd.intel.com/test_sets/bc2c657a-26cc-11e8-b74b-52540065bddc |
| Comment by Sarah Liu [ 17/May/18 ] |
|
+1 on b2_10 https://testing.hpdd.intel.com/test_sets/b757638a-58e8-11e8-b303-52540065bddc |
| Comment by James Nunez (Inactive) [ 15/Aug/18 ] |
|
Note: we still see this issue since using RHEL 6.10; https://testing.whamcloud.com/test_sets/a6e0f10c-a081-11e8-8ee3-52540065bddc Also, recovery-mds-scale may not be lceaning up after itself very well since recovery-random-scale test fail_client_mds and recovery-double-scale test pairwise_fail fails quickly trevis-12vm4: 19732 trevis-12vm3: 20187 Found the END_RUN_FILE file: /autotest/trevis/2018-08-14/lustre-b2_10-el7-x86_64-vs-lustre-b2_10-el6_10-x86_64--failover--1_24_1__135___df1701ec-c51e-4131-bb8d-62fac9f7b291/shared_dir/end_run_file trevis-12vm4.trevis.whamcloud.com Client load failed on node trevis-12vm4.trevis.whamcloud.com: /autotest/trevis/2018-08-14/lustre-b2_10-el7-x86_64-vs-lustre-b2_10-el6_10-x86_64--failover--1_24_1__135___df1701ec-c51e-4131-bb8d-62fac9f7b291/recovery-random-scale.test_fail_client_mds.run__stdout.trevis-12vm4.trevis.whamcloud.com.log /autotest/trevis/2018-08-14/lustre-b2_10-el7-x86_64-vs-lustre-b2_10-el6_10-x86_64--failover--1_24_1__135___df1701ec-c51e-4131-bb8d-62fac9f7b291/recovery-random-scale.test_fail_client_mds.run__debug.trevis-12vm4.trevis.whamcloud.com.log 2018-08-15 00:21:08 Terminating clients loads ... Duration: 86400 Server failover period: 1200 seconds Exited after: 0 seconds |