[LU-7612] recovery-small tests 110a, 110b, 110c, 110d, 110e, 110f fail with 'lfs mkdir failed' Created: 27/Dec/15  Updated: 22/Jul/18

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: dne
Environment:

autotest review-dne-part-1


Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

recovery-small test_110a, test_110b, test_110c, test_110d, test_110e, and test_110f fail with

'lfs mkdir failed' 

From the test_log, we see:

== recovery-small test 110a: create remote directory: drop client req == 18:23:45 (1451154225)
CMD: shadow-20vm8 lctl set_param fail_loc=0x123
fail_loc=0x123
CMD: shadow-20vm5.shadow.whamcloud.com /usr/bin/lfs mkdir -i 1 -c2 /mnt/lustre/d110a.recovery-small/remote_dir
error on LL_IOC_LMV_SETSTRIPE '/mnt/lustre/d110a.recovery-small/remote_dir' (3): Bad address
error: mkdir: create stripe dir '/mnt/lustre/d110a.recovery-small/remote_dir' failed
CMD: shadow-20vm8 lctl set_param fail_loc=0
fail_loc=0
 recovery-small test_110a: @@@@@@ FAIL: lfs mkdir failed 

There’s nothing interesting in the console logs on any of the nodes.

Tests 110g, 110h, 110i and 110j fail when the other 110 tests fail and with a similar error message in the test logs:

error on LL_IOC_LMV_SETSTRIPE '/mnt/lustre/d110a.recovery-small/remote_dir' (3): Bad address
error: mkdir: create stripe dir '/mnt/lustre/d110a.recovery-small/remote_dir' failed

These tests has been failing since October 11, 2015. Failed test logs at:
2015-10-11 14:59:16 - https://testing.hpdd.intel.com/test_sets/c4f4aa1e-7030-11e5-b705-5254006e85c2
2015-10-11 23:09:20 – https://testing.hpdd.intel.com/test_sets/dfa189e2-7074-11e5-b705-5254006e85c2
2015-10-15 01:46:20 - https://testing.hpdd.intel.com/test_sets/368665d8-72e7-11e5-8fc1-5254006e85c2
2015-10-19 02:17:13 - https://testing.hpdd.intel.com/test_sets/d8c8fb86-7610-11e5-b71e-5254006e85c2
2015-10-29 00:59:05 - https://testing.hpdd.intel.com/test_sets/af686e96-7dec-11e5-9c23-5254006e85c2
2015-11-05 00:18:14 - https://testing.hpdd.intel.com/test_sets/c3aa8b6e-835a-11e5-8da0-5254006e85c2
2015-11-21 22:57:07 - https://testing.hpdd.intel.com/test_sets/4fa87c80-90ac-11e5-aaf3-5254006e85c2
2015-11-30 19:17:14 - https://testing.hpdd.intel.com/test_sets/e9a04846-979f-11e5-b72a-5254006e85c2
2015-12-23 04:36:19 - https://testing.hpdd.intel.com/test_sets/bdb1bf9e-a938-11e5-8531-5254006e85c2
2015-12-23 07:40:39 - https://testing.hpdd.intel.com/test_sets/b10e4b98-a953-11e5-b0df-5254006e85c2
2015-12-26 17:51:37 - https://testing.hpdd.intel.com/test_sets/b958c70a-ac01-11e5-aa1f-5254006e85c2



 Comments   
Comment by James Nunez (Inactive) [ 27/Dec/15 ]

There are a few other test suites for the same sessions above that fail with the LL_IOC_LMV_SETSTRIPE 'Bad address' failures:
sanityn 2f, 11, 13, 14, 14a, 14c, 14d, 21, 25b, 31a, 31b, 37, 43i, 45i, 47a, 47b, 47c, 47d, 47e, 47f, 47g, 70a
sanity 1, 2, 3, 5, 6g, 7a, 7b, 9, 10, 11,13, 15, 17a, 17b, 17d, 17e, 17f, 17g, 17i, 17k, 17m, 17n, 21, 22, and many others

Upon looking at all the patches that experience this problem, most are for patch #16785 for ticket LU-2533.

Comment by Di Wang [ 29/Dec/15 ]

I just checked these failures, it seems they are all from patch http://review.whamcloud.com/#/c/16785/ and http://review.whamcloud.com/#/c/16969 (already fix the problem in the most recent patch)

It is probably these patches problem, instead of master problem, so let's close this ticket?

Comment by James Nunez (Inactive) [ 30/Dec/15 ]

Di - I agree that most of these failures are due to patches 16785 and 16969, but there are two cases that these tests failed with this error:
LU-7318, patch 16889 - https://testing.hpdd.intel.com/test_sets/af686e96-7dec-11e5-9c23-5254006e85c2
LU-7490, patch17199 - https://testing.hpdd.intel.com/test_sets/b10e4b98-a953-11e5-b0df-5254006e85c2

What do you think about these failures?

Comment by James Nunez (Inactive) [ 30/Dec/15 ]

LU-7318, patch 16889 - https://testing.hpdd.intel.com/test_sets/af686e96-7dec-11e5-9c23-5254006e85c2
Tests 110a – 110f fail with: error on LL_IOC_LMV_SETSTRIPE '/mnt/lustre/d110a.recovery-small/remote_dir' (3): Not a directory

LU-7490, patch17199
https://testing.hpdd.intel.com/test_sets/b10e4b98-a953-11e5-b0df-5254006e85c2
https://testing.hpdd.intel.com/test_sets/b63a926a-ae04-11e5-8114-5254006e85c2
Tests 110a – 110j fail with: error on LL_IOC_LMV_SETSTRIPE '/mnt/lustre/d110a.recovery-small/remote_dir' (3): Stale file handle
https://testing.hpdd.intel.com/test_sets/f052ddd8-aec0-11e5-9134-5254006e85c2
Test 110a failed with: error on LL_IOC_LMV_SETSTRIPE '/mnt/lustre/d110a.recovery-small/remote_dir' (3): Stale file handle

And the most recent results from LU-7039, patch 16969
https://testing.hpdd.intel.com/test_sets/3994348a-aebf-11e5-8114-5254006e85c2
Test 110a failed with: error on LL_IOC_LMV_SETSTRIPE '/mnt/lustre/d110a.recovery-small/remote_dir' (3): Stale file handle
https://testing.hpdd.intel.com/test_sets/1dd743e0-ae06-11e5-8114-5254006e85c2
Tests 110a – 110j fail with: error on LL_IOC_LMV_SETSTRIPE '/mnt/lustre/d110a.recovery-small/remote_dir' (3): Stale file handle

Comment by Di Wang [ 04/Jan/16 ]

It seems the newest run already pass the test, so I guess the update patch already fixed the problem. And patch 17199 is based on patch 16969.

Comment by Mikhail Pershin [ 22/Jul/18 ]

this issue still happens time to time, about 11 times in this year

the latest one:
https://testing.whamcloud.com/sub_tests/3c8d7b98-3d69-11e8-b45c-52540065bddc

Generated at Sat Feb 10 02:10:24 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.