[LU-13017] sanity test 56ra fails with ' '/usr/bin/lfs find -size 0 -type f /mnt/lustre/d56ra.sanity' should send 12 glimpse RPCs to OST' Created: 26/Nov/19 Updated: 04/Mar/23 Resolved: 16/Apr/20 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.14.0, Lustre 2.12.4 |
| Fix Version/s: | Lustre 2.14.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | James Nunez (Inactive) | Assignee: | Andreas Dilger |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||
| Description |
|
sanity test_56ra fails with ' '/usr/bin/lfs find -size 0 -type f /mnt/lustre/d56ra.sanity' should send 12 glimpse RPCs to OST'. We’ve seen this error twice for 2.12.3.31; https://testing.whamcloud.com/test_sets/93cddba8-0d76-11ea-98f1-52540065bddc and https://testing.whamcloud.com/test_sets/3d7c7eea-0d93-11ea-8e77-52540065bddc. == sanity test 56ra: check lfs find -size -lazy works for data on OSTs =============================== 10:36:52 (1574332612) Before: 6056 After: 6069 3 osc.lustre-OST0000-osc-ffff8c52a0a6c000.stats= snapshot_time 1574332613.130677340 secs.nsecs req_waittime 3629 samples [usec] 405 4908409 125708950 217095123014804 req_active 3630 samples [reqs] 1 16 10917 64837 ldlm_glimpse_enqueue 453 samples [reqs] 1 1 453 453 ldlm_extent_enqueue 518 samples [reqs] 1 1 518 518 read_bytes 10 samples [bytes] 0 1048576 4198400 4380984082432 write_bytes 729 samples [bytes] 4 1048576 465385120 468655951772936 ost_setattr 30 samples [usec] 493 2458360 5470309 7460871612619 ost_read 10 samples [usec] 770 112506 334235 27947789183 ost_write 729 samples [usec] 814 4301470 93802565 175458110296977 ost_connect 29 samples [usec] 655 12195 57083 265453679 ost_disconnect 17 samples [usec] 1029 3311 31912 67557322 ost_punch 511 samples [usec] 503 954851 1266293 911962877985 ost_statfs 24 samples [usec] 405 1646 17727 14510469 ost_sync 4 samples [usec] 471514 1920497 3598304 4653850793070 ost_quotactl 16 samples [usec] 414 838 8192 4340976 ldlm_cancel 952 samples [usec] 533 39037 11498758 185050546552 obd_ping 336 samples [usec] 760 4908409 8456427 28320522807435 osc.lustre-OST0001-osc-ffff8c52a0a6c000.stats= … obd_ping 364 samples [usec] 735 1157343 2386059 1579049113871 osc.lustre-OST0006-osc-ffff8c52a0a6c000.stats= snapshot_time 1574332613.130931617 secs.nsecs req_waittime 2363 samples [usec] 368 10241895 31650509 113679647081289 req_active 2364 samples [reqs] 1 15 6902 40110 ldlm_glimpse_enqueue 928 samples [reqs] 1 1 928 928 ldlm_extent_enqueue 14 samples [reqs] 1 1 14 14 read_bytes 2 samples [bytes] 0 0 0 0 write_bytes 19 samples [bytes] 1 1048576 12837273 13249571414835 ost_setattr 31 samples [usec] 459 10241895 14509301 109946640240861 ost_read 2 samples [usec] 1823 2294 4117 8585765 ost_write 19 samples [usec] 814 543145 1723935 429765970407 ost_connect 27 samples [usec] 450 12275 51577 247601705 ost_disconnect 14 samples [usec] 999 41957 63004 1800765226 ost_punch 9 samples [usec] 622 73011 79728 5336508958 ost_statfs 24 samples [usec] 368 2161 14597 12673415 ost_quotactl 16 samples [usec] 388 462 6778 2878618 ldlm_cancel 927 samples [usec] 411 38371 11153645 173370727119 obd_ping 352 samples [usec] 759 1625436 2970385 3116552824601 sanity test_56ra: @@@@@@ FAIL: '/usr/bin/lfs find -size 0 -type f /mnt/lustre/d56ra.sanity' should send 12 glimpse RPCs to OST We’ve seen an error message that may be related for Ubuntu clients ' '/usr/bin/lfs find -size 5 -type f /mnt/lustre/d56ra.sanity' should send 14 glimpse RPCs to OST'; https://testing.whamcloud.com/test_sets/138a655c-0d5c-11ea-9487-52540065bddc for patch https://review.whamcloud.com/36405/. From the suite_log == sanity test 56ra: check lfs find -size -lazy works for data on OSTs =============================== 17:50:11 (1574445011) Before: 4861 After: 4873 3 osc.lustre-OST0000-osc-ffff9a29368bb000.stats= snapshot_time 1574445012.373146631 secs.nsecs req_waittime 2983 samples [usec] 355 2937458 64237903 59715964412813 req_active 2983 samples [reqs] 1 13 11657 81779 ldlm_glimpse_enqueue 277 samples [reqs] 1 1 277 277 ldlm_extent_enqueue 515 samples [reqs] 1 1 515 515 read_bytes 2 samples [bytes] 4096 4096 8192 33554432 write_bytes 569 samples [bytes] 3 4194304 498256074 1247211946247316 ost_setattr 33 samples [usec] 363 608888 739594 375188097832 ost_read 2 samples [usec] 455 466 921 424181 ost_write 569 samples [usec] 554 2937458 49108763 59029078413847 ost_connect 13 samples [usec] 464 6183 24328 96647710 ost_disconnect 9 samples [usec] 755 1623 11362 14999198 ost_punch 511 samples [usec] 355 43985 346442 4026093074 ost_statfs 12 samples [usec] 357 1074 5859 3259503 ldlm_cancel 774 samples [usec] 390 63347 12515513 274346845595 obd_ping 267 samples [usec] 502 117510 544749 18808100599 osc.lustre-OST0001-osc-ffff9a29368bb000.stats= … obd_ping 265 samples [usec] 472 4485131 4896837 20121333347081 mdc.lustre-MDT0000-mdc-ffff9a29368bb000.stats= snapshot_time 1574445012.413685221 secs.nsecs req_waittime 938787 samples [usec] 305 1810374 538301391 13512208660843 req_active 938787 samples [reqs] 1 9 961715 1042401 ldlm_glimpse_enqueue 120 samples [reqs] 1 1 120 120 ldlm_ibits_enqueue 429407 samples [reqs] 1 1 429407 429407 mds_getattr 124 samples [usec] 337 1111 47863 19161519 mds_getattr_lock 5898 samples [usec] 470 7696 3084370 1885670342 mds_close 116757 samples [usec] 305 892872 46303939 1088509049017 mds_readpage 352 samples [usec] 734 33119 1049972 8758729676 mds_statfs 14 samples [usec] 355 866 6123 2898905 mds_sync 6 samples [usec] 1153 40361 46512 1636587052 mds_getxattr 28 samples [usec] 339 2024 26071 30757607 mds_hsm_state_set 8 samples [usec] 1150 1358 10021 12585899 ldlm_cancel 4370 samples [usec] 331 105790 2428541 42683139111 obd_ping 204 samples [usec] 446 3724 188084 212073474 seq_query 1 samples [usec] 472 472 472 222784 Before: 4873 After: 4885 3 Before: 4885 After: 4900 3 sanity test_56ra: @@@@@@ FAIL: '/usr/bin/lfs find -size 5 -type f /mnt/lustre/d56ra.sanity' should send 14 glimpse RPCs to OST We’ve also seen this test fail in interop testing. There is nothing interesting in the node console logs. |
| Comments |
| Comment by Peter Jones [ 26/Nov/19 ] |
|
Gu Zheng Could you please advise? Thanks Peter |
| Comment by Gu Zheng (Inactive) [ 27/Nov/19 ] |
|
Sure, Peter |
| Comment by Andreas Dilger [ 27/Nov/19 ] |
|
I don't think the patch https://review.whamcloud.com/36405/ " |
| Comment by Gu Zheng (Inactive) [ 28/Nov/19 ] |
|
Yeah, agreed to Andreas. Patch https://review.whamcloud.com/36405/ ( |
| Comment by Andreas Dilger [ 16/Dec/19 ] |
|
Gu, any update on this ticket? |
| Comment by Jian Yu [ 21/Dec/19 ] |
|
This failure is affecting patch review testing on Lustre b2_12 branch. |
| Comment by Andreas Dilger [ 22/Mar/20 ] |
|
What is interesting is that of the 31 failures of sanity.sh test_56ra in the past 4 weeks, 29 of them are on b2_12 (about 30% failure rate on that branch), and the one on master is an interop test with master and b2_12. On master the patch is skipping the test for MDS < 2.12.58, but on b2_12 it isn't doing that. However, the presence of the test on b2_12 implies that the LSOM functionality was backported and should be working. |
| Comment by Gerrit Updater [ 22/Mar/20 ] |
|
Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38021 |
| Comment by Andreas Dilger [ 23/Mar/20 ] |
|
It looks like the first failure of test_56ra on b2_12 with the "'/usr/bin/lfs find -size 5 -type f /mnt/lustre/d56ra.sanity' should send 12 glimpse RPCs to OST" message was on 2019-11-21, which is the same day that patch https://review.whamcloud.com/36553 " |
| Comment by Andreas Dilger [ 08/Apr/20 ] |
|
Patch 38021 has recently landed to b2_12, and most/all of the failures on master are due to interop tests. If this problem is no longer seen on master/b2_12 in a week then this can be closed. |
| Comment by Neil Brown [ 31/Jul/20 ] |
|
I know this is "resolved" so maybe no-one will see this, but I'll report anyway. I got this failure reliably in my testing. The test is sensitive to timing and performance. I build my kernels with lots of debugging options enabled, and there is a significant performance penalty - but I like the benefits. If it takes more than 1 second from the start of one "lfs find" to the start of the next, failures are quite likely. While the 'lfs find" is calling 'stat' on each file, a background "stat-ahead" thread is doing a similar thing. If "find" gets in before "stat-ahead", the stat-ahead will skip the RPC. If the previous find did a 'stat' within the last 1 second, "stat-ahead" will again skip the RPC. But if the previous stat was more than 1 second ago, and the current find hasn't got up to the file yet, the stat-ahead will do an RPC. Then the find will do its own RPC, resulting in extra RPCs. It seems strange that a "stat()" call will do an RPC even if there was a very recent stat-ahead RPC. What is the point of stat-ahead in that case?
|
| Comment by Andreas Dilger [ 31/Jul/20 ] |
|
Seems like a bug in statahead? |
| Comment by Neil Brown [ 04/Aug/20 ] |
|
Maybe it's a bug... It appears to be a problem when the 'stat()' system call happens after the stat-ahead, and before the statahead receives a reply. During that time the lock that was created and requested is not attached to the (inode's) resource, so it isn't found when the 'stat' attempts to find a lock, so instead it creates a new lock. As both locks are PR they don't conflict and you get two locks on the one resource, both holding size/timestamp info. If the 'stat()' happens before the stat-ahead, the whole operation is under a semaphore so the stat-ahead aborts as it cannot get the lock. If the stat() happens after the statahead gets a reply, the lock will be attached to the resource, and the stat() will find and re-use it. Maybe we could add a "pending" list to the ldlm_resource, and add locks there that have been requested from the server, but that haven't had a reply yet. That would allow us to check if a stat-ahead was pending. I tried adding a 'statahead-pending' flag to the inode, but the lock grant doesn't get back to the osc layer - it just sits in the ldlm waiting to be requested again. |
| Comment by Neil Brown [ 13/Aug/20 ] |
|
It is actually the 'agl' thread within the statahead module that is causing the problem. I think the best way forward is to disable agl while the rest is running. This will have the added advantage of testing the "disable agl" code, which is currently buggy
|
| Comment by Gerrit Updater [ 13/Aug/20 ] |
|
Neil Brown (neilb@suse.de) uploaded a new patch: https://review.whamcloud.com/39667 |
| Comment by Gerrit Updater [ 12/Oct/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39667/ |