[LU-9438] sanity-lfsck test_17: (1.2) f1 (wrong) size should be 1048576, but got Created: 02/May/17  Updated: 19/Dec/17  Resolved: 19/Dec/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.0
Fix Version/s: Lustre 2.11.0

Type: Bug Priority: Minor
Reporter: James Casper Assignee: Bob Glossman (Inactive)
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

trevis-35, full, SLES12 clients
EL7, master branch, v2.9.56.11, b3565


Issue Links:
Related
is related to LU-7802 set_param lru_size fails with 'error:... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

https://testing.hpdd.intel.com/test_sessions/20ddc92f-b9fe-482d-ac1b-1602a513c824

From test_log:

CMD: trevis-35vm7 /usr/sbin/lctl set_param fail_val=0 fail_loc=0x1614
fail_val=0
fail_loc=0x1614
1+0 records in
1+0 records out
1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.00496945 s, 211 MB/s
total: 1 open/close in 0.00 seconds: 479.18 ops/second
error: set_param: setting /sys/fs/lustre/ldlm/namespaces/lustre-MDT0000-mdc-ffff88002c485000/lru_size=clear: Invalid argument
ldlm.namespaces.lustre-MDT0000-mdc-ffff88002c485000.lock_unused_count=2
CMD: trevis-35vm7 /usr/sbin/lctl set_param fail_loc=0 fail_val=0
fail_loc=0
fail_val=0
/mnt/lustre/d17.sanity-lfsck/f0 and /mnt/lustre/d17.sanity-lfsck/guard use the same OST-objects
/mnt/lustre/d17.sanity-lfsck/f1 and /mnt/lustre/d17.sanity-lfsck/guard use the same OST-objects
ls: cannot access '/mnt/lustre/d17.sanity-lfsck/f1': Input/output error
/usr/lib64/lustre/tests/sanity-lfsck.sh: line 1906: [: -eq: unary operator expected
 sanity-lfsck test_17: @@@@@@ FAIL: (1.2) f1 (wrong) size should be 1048576, but got  
  Trace dump:
  = /usr/lib64/lustre/tests/test-framework.sh:4931:error()
  = /usr/lib64/lustre/tests/sanity-lfsck.sh:1907:test_17()
  = /usr/lib64/lustre/tests/test-framework.sh:5207:run_one()
  = /usr/lib64/lustre/tests/test-framework.sh:5246:run_one_logged()
  = /usr/lib64/lustre/tests/test-framework.sh:5093:run_test()
  = /usr/lib64/lustre/tests/sanity-lfsck.sh:1940:main()


 Comments   
Comment by James Nunez (Inactive) [ 05/May/17 ]

sanity-lfsck test 17 started failing with this error on April 13, 2017. In all cases of this test failure, the client is SLES11SP* or SLES12SP*. There are no failures of this test with this error for RHEL/CentOS clients.

The logs for the earliest failures are at:
https://testing.hpdd.intel.com/test_sets/400cd9f8-2113-11e7-8920-5254006e85c2
https://testing.hpdd.intel.com/test_sets/c3b83dfe-212e-11e7-9073-5254006e85c2
https://testing.hpdd.intel.com/test_sets/c1bc4fd4-2130-11e7-9de9-5254006e85c2

Comment by Peter Jones [ 08/May/17 ]

Bob

Could you please look into this one?

Thanks

Peter

Comment by Bob Glossman (Inactive) [ 08/May/17 ]

looking at the specific code in the test it looks like if the file $DIR/$tdir/f1 doesn't exist it would have precisely the effects captured in the fail logs.

I'm unable to determine why this file would exist when running on RHEL but not on SLES.

Might be the fail output would make more sense if there were conditional tests to check for and correctly report on the absence of files like f0 and f1 and not just assume that ls command on such files would get reasonable reports on stdout with a size that can be parsed.

Comment by Andreas Dilger [ 12/May/17 ]

Well, if f1 doesn't exist it would return ENOENT instead of EIO, so the problem isn't that the file is missing. John pointed out the earlier error message:

error: set_param: setting /sys/fs/lustre/ldlm/namespaces/lustre-MDT0000-mdc-ffff88002c485000/lru_size=clear: Invalid argument
ldlm.namespaces.lustre-MDT0000-mdc-ffff88002c485000.lock_unused_count=2

So it isn't clear why the "clear" failed to cancel the locks? That is something that could easily be attributed to a change in /proc handling for SLES12 and needs to be investigated.

Comment by Bob Glossman (Inactive) [ 23/May/17 ]

no progress. low priority. seems like a test only problem most likely,

I can make the 'clear' operation fail, but the test passes anyway. Don't think those 2 effects are directly related.
Not close to a root cause on either.

Comment by Bob Glossman (Inactive) [ 19/Dec/17 ]

No instances of this failure have been seen since June 2017.

Generated at Sat Feb 10 02:26:12 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.