[LU-2902] sanity test_156: NOT IN CACHE: before: , after: Created: 04/Mar/13 Updated: 21/Jul/14 Resolved: 07/Aug/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0 |
| Fix Version/s: | Lustre 2.4.1, Lustre 2.5.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Maloo | Assignee: | Keith Mannthey (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | HB | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||||||||||||||
| Rank (Obsolete): | 6990 | ||||||||||||||||||||||||||||||||
| Description |
|
This issue was created by maloo for Oleg Drokin <green@whamcloud.com> This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/406900a6-84d3-11e2-9ab1-52540035b04c. The sub-test test_156 failed with the following error:
This seems to have an astounding 21% failure rate and nobody filed a ticket for it yet. Info required for matching: sanity 156 |
| Comments |
| Comment by Keith Mannthey (Inactive) [ 05/Mar/13 ] |
|
This is a dup of I looked into this a bit when I was looking at sanity test 180 failures. When it fails sometimes it isn't in the cache and sometimes it might have a greater than expected cache read value. Oleg can you take a look at the test and weigh in on the validity of the expected behavior? What about if we are in a low mem situation? I fear this test my be an "under ideal conditions" sort of test. |
| Comment by Keith Mannthey (Inactive) [ 07/Mar/13 ] |
|
Ok from looking a the error logs there are several things that can go wrong. NOT IN CACHE: before: 16741, after: 16741 NOT IN CACHE: before: , after: IN CACHE: before: 16749, after: 16752 NOT IN CACHE: before: 16749, after: 16755 In general it seems like: Also I think these issues affect all the tests that use roc_hit. I know I am just going to just close lu-2009 and work this one. |
| Comment by Keith Mannthey (Inactive) [ 07/Mar/13 ] |
|
I am hoping to get a little more debug info from the below patch. http://review.whamcloud.com/5648 It is untested so it might need another try. The goal is to run sanity test_156 a bunch and log all the stats (not just cache_hits) available to get better resolution to what is happening on the system. Even if it works it many need to be ran a few times. |
| Comment by Keith Mannthey (Inactive) [ 11/Mar/13 ] |
|
My initial debug patch wasn't quite right and got lost in the test logs mess last week. I updated the patch. My local vms freed up enough today so that I am running the debug test locally as well today. |
| Comment by Keith Mannthey (Inactive) [ 12/Mar/13 ] |
|
So far I have ran 5000+ iterations of 156 in my local VMs. Still no sign of the errors in this, nothing runs before me loop, but I am down to about 80megs of free memory in the VM. It has been slowing decreasing this morning (at 10 am or so I had about 150mb free) I am going to let it crash out and see what happens. |
| Comment by Keith Mannthey (Inactive) [ 12/Mar/13 ] |
|
Well nothing really to report. I added some memhot load to take the box near OOM but this has been running for a few hours now without any change in the tests. I will let the memory pressure continue overnight but I don't capture anything overnight I will set my VMs to other work. I may need to look at landing a patch for Master and let the autotest test review pool debug the issue some more. |
| Comment by Keith Mannthey (Inactive) [ 13/Mar/13 ] |
|
Somewhere around 5500 total iterations and some external memory pressure we encountered some errors. I don't know if the memory pressure matters or not. In my debug patch I print out the entire proc entry being used. The first section: Turn off the read cache and turn on the write cache Read again; it should be satisfied from the cache. snapshot_time 1363145753.405525 secs.usecs read_bytes 96380 samples [bytes] 4096 12288 789536768 write_bytes 21503 samples [bytes] 13 1048576 348218279 get_info 85877 samples [reqs] set_info_async 5359 samples [reqs] connect 2 samples [reqs] statfs 27705 samples [reqs] create 672 samples [reqs] destroy 21425 samples [reqs] setattr 5364 samples [reqs] punch 5361 samples [reqs] sync 21424 samples [reqs] preprw 117883 samples [reqs] commitrw 117883 samples [reqs] ping 2819 samples [reqs] snapshot_time 1363145753.407124 secs.usecs get_page 117883 samples [usec] 0 1056 298323 11433093 cache_access 144567 samples [pages] 1 1 144567 cache_hit 80316 samples [pages] 1 1 80316 cache_miss 64251 samples [pages] 1 1 64251 snapshot_time 1363145771.340504 secs.usecs read_bytes 96382 samples [bytes] 4096 12288 789553152 write_bytes 21503 samples [bytes] 13 1048576 348218279 get_info 85877 samples [reqs] set_info_async 5359 samples [reqs] connect 2 samples [reqs] statfs 27709 samples [reqs] create 672 samples [reqs] destroy 21425 samples [reqs] setattr 5364 samples [reqs] punch 5361 samples [reqs] sync 21424 samples [reqs] preprw 117885 samples [reqs] commitrw 117885 samples [reqs] ping 2822 samples [reqs] snapshot_time 1363145771.341000 secs.usecs get_page 117885 samples [usec] 0 1056 298331 11433157 cache_access 144570 samples [pages] 1 1 144570 cache_hit 80316 samples [pages] 1 1 80316 cache_miss 64254 samples [pages] 1 1 64254 sanity test_156: @@@@@@ FAIL: NOT IN CACHE: before: 80316, after: 80316 Notes: Time has moved forward. Lets look at the test from this specific area. log "Turn off the read cache and turn on the write cache"
set_cache read off <========== Isn't safe to assume all cache can be purged after this call? All read cache data gone. No write cache.
set_cache writethrough on <===== Perhaps we should turn on the writethrought first? A single read knocks it out of the writethrought cache?
log "Read again; it should be satisfied from the cache." <============== WHY? i know it happens alot but I don't see why.
BEFORE=`roc_hit`
cancel_lru_locks osc
cat $file >/dev/null
AFTER=`roc_hit`
if ! let "AFTER - BEFORE == CPAGES"; then
error "NOT IN CACHE: before: $BEFORE, after: $AFTER"
else
log "cache hits:: before: $BEFORE, after: $AFTER"
fi
log "Read again; it should not be satisfied from the cache." <======== No state change but it should not be in the cache?
BEFORE=$AFTER
cancel_lru_locks osc
cat $file >/dev/null
AFTER=`roc_hit`
if ! let "AFTER - BEFORE == 0"; then
error "IN CACHE: before: $BEFORE, after: $AFTER"
else
log "cache hits:: before: $BEFORE, after: $AFTER"
fi
|
| Comment by Keith Mannthey (Inactive) [ 13/Mar/13 ] |
|
The is the next spot: Turn on the read cache and turn off the write cache Write data and read it back It should not be satisfied from the cache. 3+0 records in 3+0 records out 12288 bytes (12 kB) copied, 0.0102791 s, 1.2 MB/s snapshot_time 1363149708.491279 secs.usecs read_bytes 96672 samples [bytes] 4096 12288 791928832 write_bytes 21567 samples [bytes] 13 1048576 349004711 get_info 86134 samples [reqs] set_info_async 5376 samples [reqs] connect 2 samples [reqs] statfs 28513 samples [reqs] create 674 samples [reqs] destroy 21485 samples [reqs] setattr 5381 samples [reqs] punch 5382 samples [reqs] sync 21488 samples [reqs] preprw 118239 samples [reqs] commitrw 118239 samples [reqs] ping 3475 samples [reqs] snapshot_time 1363149708.491619 secs.usecs get_page 118239 samples [usec] 0 1056 299142 11440980 cache_access 145005 samples [pages] 1 1 145005 cache_hit 80550 samples [pages] 1 1 80550 cache_miss 64455 samples [pages] 1 1 64455 snapshot_time 1363149709.28799 secs.usecs read_bytes 96674 samples [bytes] 4096 12288 791945216 write_bytes 21568 samples [bytes] 13 1048576 349016999 get_info 86136 samples [reqs] set_info_async 5376 samples [reqs] connect 2 samples [reqs] statfs 28513 samples [reqs] create 674 samples [reqs] destroy 21485 samples [reqs] setattr 5381 samples [reqs] punch 5382 samples [reqs] sync 21489 samples [reqs] preprw 118242 samples [reqs] commitrw 118242 samples [reqs] ping 3475 samples [reqs] snapshot_time 1363149709.29176 secs.usecs get_page 118242 samples [usec] 0 1056 299147 11441005 cache_access 145008 samples [pages] 1 1 145008 cache_hit 80550 samples [pages] 1 1 80550 cache_miss 64458 samples [pages] 1 1 64458 cache hits:: before: 80550, after: 80550 Read again; it should be satisfied from the cache. snapshot_time 1363149740.756061 secs.usecs read_bytes 96674 samples [bytes] 4096 12288 791945216 write_bytes 21568 samples [bytes] 13 1048576 349016999 get_info 86136 samples [reqs] set_info_async 5376 samples [reqs] connect 2 samples [reqs] statfs 28520 samples [reqs] create 674 samples [reqs] destroy 21486 samples [reqs] setattr 5381 samples [reqs] punch 5382 samples [reqs] sync 21489 samples [reqs] preprw 118242 samples [reqs] commitrw 118242 samples [reqs] ping 3481 samples [reqs] snapshot_time 1363149740.757198 secs.usecs get_page 118242 samples [usec] 0 1056 299147 11441005 cache_access 145008 samples [pages] 1 1 145008 cache_hit 80550 samples [pages] 1 1 80550 cache_miss 64458 samples [pages] 1 1 64458 snapshot_time 1363149747.316476 secs.usecs read_bytes 96676 samples [bytes] 4096 12288 791961600 write_bytes 21568 samples [bytes] 13 1048576 349016999 get_info 86137 samples [reqs] set_info_async 5376 samples [reqs] connect 2 samples [reqs] statfs 28521 samples [reqs] create 674 samples [reqs] destroy 21486 samples [reqs] setattr 5381 samples [reqs] punch 5382 samples [reqs] sync 21489 samples [reqs] preprw 118244 samples [reqs] commitrw 118244 samples [reqs] ping 3482 samples [reqs] snapshot_time 1363149747.316992 secs.usecs get_page 118244 samples [usec] 0 1056 299148 11441006 cache_access 145011 samples [pages] 1 1 145011 cache_hit 80550 samples [pages] 1 1 80550 cache_miss 64461 samples [pages] 1 1 64461 sanity test_156: @@@@@@ FAIL: NOT IN CACHE: before: 80550, after: 80550 <==== still a cache_miss same as above. In the first part of this section note the the write_count increased by 1 after the dd completed. It seems we were still writing when we set BEFORE. The test log "Turn on the read cache and turn off the write cache"
set_cache read on
set_cache writethrough off
log "Write data and read it back"
log "It should not be satisfied from the cache."
rm -f $file
dd if=/dev/urandom of=$file bs=4k count=$CPAGES || error "dd failed"
BEFORE=`roc_hit`
cancel_lru_locks osc
cat $file >/dev/null
AFTER=`roc_hit`
if ! let "AFTER - BEFORE == 0"; then
error_ignore 20762 "IN CACHE: before: $BEFORE, after: $AFTER" <==== So in general this test is non deterministic.
else
log "cache hits:: before: $BEFORE, after: $AFTER"
fi
log "Read again; it should be satisfied from the cache." <==== If it didn't make it into the cache before why should it be there now?
BEFORE=`roc_hit`
cancel_lru_locks osc
cat $file >/dev/null
AFTER=`roc_hit`
if ! let "AFTER - BEFORE == CPAGES"; then
error "NOT IN CACHE: before: $BEFORE, after: $AFTER"
else
log "cache hits:: before: $BEFORE, after: $AFTER"
fi
|
| Comment by Keith Mannthey (Inactive) [ 13/Mar/13 ] |
|
And finally this fun item a non value: Turn on read and write cache Write data and read it back. Read should be satisfied from the cache. 3+0 records in 3+0 records out 12288 bytes (12 kB) copied, 0.0131985 s, 931 kB/s snapshot_time 1363156626.318377 secs.usecs read_bytes 96978 samples [bytes] 4096 12288 794435584 write_bytes 21635 samples [bytes] 13 1048576 349840295 get_info 86398 samples [reqs] set_info_async 5401 samples [reqs] connect 2 samples [reqs] statfs 29921 samples [reqs] create 676 samples [reqs] destroy 21553 samples [reqs] setattr 5406 samples [reqs] punch 5407 samples [reqs] sync 21555 samples [reqs] preprw 118613 samples [reqs] commitrw 118613 samples [reqs] ping 4680 samples [reqs] snapshot_time 1363156626.319297 secs.usecs get_page 118613 samples [usec] 0 1056 299906 11448652 cache_access 145464 samples [pages] 1 1 145464 cache_hit 80805 samples [pages] 1 1 80805 cache_miss 64659 samples [pages] 1 1 64659 ost: Connection timed out during banner exchange^M pdsh@client: ost: ssh exited with exit code 255 snapshot_time 1363156656.728803 secs.usecs read_bytes 96980 samples [bytes] 4096 12288 794451968 write_bytes 21635 samples [bytes] 13 1048576 349840295 get_info 86399 samples [reqs] set_info_async 5401 samples [reqs] connect 2 samples [reqs] statfs 29928 samples [reqs] create 676 samples [reqs] destroy 21553 samples [reqs] setattr 5406 samples [reqs] punch 5407 samples [reqs] sync 21556 samples [reqs] preprw 118615 samples [reqs] commitrw 118615 samples [reqs] ping 4685 samples [reqs] snapshot_time 1363156656.729848 secs.usecs get_page 118615 samples [usec] 0 1056 299908 11448656 cache_access 145467 samples [pages] 1 1 145467 cache_hit 80805 samples [pages] 1 1 80805 cache_miss 64662 samples [pages] 1 1 64662 sanity test_156: @@@@@@ FAIL: NOT IN CACHE: before: , after: 80805 Now so in my case I checked the logs and the OOM killer killed ssh at that point. In total the ssh was killed twice on the OST. So no big surprise that this is messed up. For some reason in the middle of the night the OOM killer on the OST came out to play. It killed SSH a few times then it killed Memhog on the OST. The Client never entered OOM. A little about this timing and Client / OST states. Stresses both started by March 12 18:00 With the Client stress had been running 6+ hours already. On the OST first sign of the tage mesage. The first FAIL seen: The last FAIL seen: Last tage message but memhog is still running on the client. On the OST memhog was killed: There are no more errors seen or tage /oom messages. It runs clean the next 10 hours. I am lead to believe some sort of memory pressure on the OST is key to this issue. I am starting some memory pressure only testing, I am going to back off the memhog a bit I agree OOM is not what we want to see. |
| Comment by Keith Mannthey (Inactive) [ 14/Mar/13 ] |
|
Ok a quick summary of last night testing. OST memory pressure is key. The OST did not suffer any OOM. It did have "cannot allocate a tage" for a few hours and during these hours and only these few hours (memory pressure was constant) did we get failures. The Spots: Turn on read and write cache Write data and read it back. Read should be satisfied from the cache. 3+0 records in 3+0 records out 12288 bytes (12 kB) copied, 0.00463795 s, 2.6 MB/s cache hits:: before: 107160, after: 107163 Read again; it should be satisfied from the cache. sanity test_156: @@@@@@ FAIL: NOT IN CACHE: before: 107163, after: 107163 Perhaps even the first read was a race to not be evicted. A few of these. Turn on the read cache and turn off the write cache Write data and read it back It should not be satisfied from the cache. 3+0 records in 3+0 records out 12288 bytes (12 kB) copied, 0.00447962 s, 2.7 MB/s cache hits:: before: 107190, after: 107190 Read again; it should be satisfied from the cache. sanity test_156: @@@@@@ FAIL: NOT IN CACHE: before: 107190, after: 107190 A few of these cache hits:: before: 107250, after: 107253 Read again; it should be satisfied from the cache. cache hits:: before: 107253, after: 107256 Turn off the read cache and turn on the write cache Read again; it should be satisfied from the cache. sanity test_156: @@@@@@ FAIL: NOT IN CACHE: before: 107256, after: 107256 Lots of the next one. cache hits:: before: 107310, after: 107313 Turn off the read cache and turn on the write cache Read again; it should be satisfied from the cache. sanity test_156: @@@@@@ FAIL: NOT IN CACHE: before: 107313, after: 107313 These are the spots I can see errors. I think it is important to note they all represnet that farther points from a write to a checking to see if something is in cache. If we write and then right away read it we seem to find it but if we wait some it might be gone, like I said it is the longer duration spots. The whole test duration seems to be in the 17-25 seconds range, that means each chunk can take seconds. Perhaps there is a way the test can be self aware and know the cache is under too much pressure to run this test? Is there a good way to look at the cache sizes or see cache evictions? Also the very serious and more common "NOT IN CACHE: before: , after: " needs a debug patch for master. I will clean up my dump everything patch for roc_hit and push it. I would like to know if we are reading some part of the data from the proc calls or what. We only know there is not a cache_hit value to read. I don't think there is widespread shell killing going on in autotest (the only way I reproduced the issue). |
| Comment by Keith Mannthey (Inactive) [ 14/Mar/13 ] |
|
I changed the debug patch for landing and am going to try and land it in hops to learn more about the errors see. http://review.whamcloud.com/5648 More debug may be needed but I would like to start here. |
| Comment by Oleg Drokin [ 15/Mar/13 ] |
|
could this be related to lu2735 too? Can you reproduce with that patch Just a memory pressure playing tricks dropping stuff from cache when we don't expect it to? |
| Comment by Keith Mannthey (Inactive) [ 15/Mar/13 ] |
|
This is the same issue as LU2705 and all the other tests that use the cashe-hit-miss stats anything that uses rok-hit is this same issue. At least the read cache is Linux Page Cache (that is what the manual said) which means under memory pressure all bets are off as to if you will be in the cache for any amount of time. I might be able to use the patch for lu2735 carefully enough to pass 156 via the special file_loc flag that prevents removal from the lur list. It is hard to say as there is alot of write cache involved. I do think that patch could address the dropped from cache issue but It won't address the double hit or the just nothing is there parts of this lu. The "NOT IN CACHE: before: , after: " error is the most common and least understood in Autotest. As mentioned I am going to attempt to make at least 156 memory state aware and if we see too much memory pressure we skip the test/consider it save to allow to fail. |
| Comment by Andreas Dilger [ 18/Mar/13 ] |
|
Both |
| Comment by Andreas Dilger [ 18/Mar/13 ] |
|
I think it makes sense to close this bug as a duplicate of |
| Comment by Jodi Levi (Inactive) [ 19/Mar/13 ] |
|
Duplicate of |
| Comment by Keith Mannthey (Inactive) [ 19/Mar/13 ] |
|
It is not safe to close this issue as it represents the 3 issues seen by "roc_hit" users. The only one there is a handle on is the early evictions due to memory pressure. There may be other improper evictions but there is not enough data from the autotest runs to say what has happened. For sure Is not covered by any other LU. Debug data from autotest is needed. Perhaps we want to split this up into a few tickets but they all come down to roc_hit having trouble and cache behavior not being deterministic (hopefully it is just due to memory pressure but there there is no way to tell at this point the tests needs to be improved) I am actively working on this issue as a blocker. If this is meant to close down the HB blocker and punt this to a minor issue please close again and I will stop working on the issue. |
| Comment by Keith Mannthey (Inactive) [ 19/Mar/13 ] |
|
I spent a bit more time looking and working with the no info case and I found this [root@ost lustre-OST0000]# /usr/sbin/lctl get_param -n osd-*.lustre-OST*.stats obdfilter.lustre-OST*.stats snapshot_time 1363657474.555425 secs.usecs get_page 1 samples [usec] 1 1 1 1 snapshot_time 1363657474.555530 secs.usecs write_bytes 1 samples [bytes] 18 18 18 get_info 9 samples [reqs] set_info_async 1 samples [reqs] connect 3 samples [reqs] disconnect 1 samples [reqs] statfs 243 samples [reqs] create 2 samples [reqs] destroy 3 samples [reqs] sync 2 samples [reqs] preprw 1 samples [reqs] commitrw 1 samples [reqs] ping 225 samples [reqs] This was pretty much a fresh mount with no IO. Did a little io and a sync and a read. root@ost lustre-OST0000]# /usr/sbin/lctl get_param -n osd-*.lustre-OST*.stats obdfilter.lustre-OST*.stats snapshot_time 1363657550.849448 secs.usecs get_page 44 samples [usec] 0 2 25 29 cache_access 1 samples [pages] 1 1 1 cache_hit 1 samples [pages] 1 1 1 snapshot_time 1363657550.849551 secs.usecs read_bytes 1 samples [bytes] 4096 4096 4096 write_bytes 43 samples [bytes] 18 1048576 42510794 get_info 16 samples [reqs] set_info_async 1 samples [reqs] connect 3 samples [reqs] disconnect 1 samples [reqs] statfs 258 samples [reqs] create 2 samples [reqs] destroy 6 samples [reqs] sync 3 samples [reqs] preprw 44 samples [reqs] commitrw 44 samples [reqs] ping 236 samples [reqs] It appears we don't ever display "0" for some proc values. The proc interface should be static so tools can be built around them. I am looking into why "0" was not displayed. You can see "cache_miss" (it should report 0 is not displayed at this time on my system. |
| Comment by Keith Mannthey (Inactive) [ 19/Mar/13 ] |
|
Well I have been looking into the lproc subsystem for a bit. I see in lustre/obdclass/lprocfs_status.c (this seems to be the correct spot for the values in question) static int lprocfs_stats_seq_show(struct seq_file *p, void *v)
{
struct lprocfs_stats *stats = p->private;
struct lprocfs_counter *cntr = v;
struct lprocfs_counter ret;
struct lprocfs_counter_header *header;
int entry_size;
int idx;
int rc = 0;
if (cntr == &(stats->ls_percpu[0])->lp_cntr[0]) {
struct timeval now;
cfs_gettimeofday(&now);
rc = seq_printf(p, "%-25s %lu.%lu secs.usecs\n",
"snapshot_time", now.tv_sec, now.tv_usec);
if (rc < 0)
return rc;
}
entry_size = sizeof(*cntr);
if (stats->ls_flags & LPROCFS_STATS_FLAG_IRQ_SAFE)
entry_size += sizeof(__s64);
idx = ((void *)cntr - (void *)&(stats->ls_percpu[0])->lp_cntr[0]) /
entry_size;
header = &stats->ls_cnt_header[idx];
lprocfs_stats_collect(stats, idx, &ret);
if (ret.lc_count == 0) <====== Can someone please explain this? It is a source of the no reads from roc_hit
goto out;
...
Git blame says I need to mail BobiJam. ca461f0f (Bobi Jam 2013-01-19 00:54:32 +0800 1542) if (ret.lc_count == 0) ca461f0f (Bobi Jam 2013-01-19 00:54:32 +0800 1543) goto out; But he just did the tabs. |
| Comment by Keith Mannthey (Inactive) [ 19/Mar/13 ] |
|
Well the real code has been around for a long time. d62360b7 (nathan 2008-09-22 22:20:42 +0000 1505) if (ret.lc_count == 0) d62360b7 (nathan 2008-09-22 22:20:42 +0000 1506) goto out; Any input in appreciated. I will play with removing those lines tomorrow. |
| Comment by Andreas Dilger [ 19/Mar/13 ] |
|
I think the intent is to not print all of the stats for operations that are not relevant for a particular device (e.g. "rename" for OSTs, or "read" for MDTs). One would think that by the time test_156 rolls around that there would be read operations on the device? |
| Comment by Keith Mannthey (Inactive) [ 19/Mar/13 ] |
|
It would be nice if we could land http://review.whamcloud.com/5648 so we can see the /proc state when we are in autotest and getting these errors. I agree it "seeems" like things should be well and good before we get here but perhaps there is some reset or fail over that resets could fail stats. I will wait for info from autotest as the issue will hopefully be clearer. For the larger issue it could save memory/space and only registrar values that are relevant to a given lu object type, but I am sure that is out of the scope of this LU. |
| Comment by Jodi Levi (Inactive) [ 26/Mar/13 ] |
|
Reducing in priority until this occurs again. |
| Comment by Keith Mannthey (Inactive) [ 27/Mar/13 ] |
|
I thought I would update. The issue has not failed since the debug patch was landed (test 151 or 156) (I spent some time looking in maloo). There are failures on openSFS with old code but nothing in the current master space. I am all for leaving the debug patch in (Andreas considered ok for a longer term leave) and we will see what happens. |
| Comment by Sebastien Buisson (Inactive) [ 29/Mar/13 ] |
|
It seems we hit two new occurrences of this problem on master yesterday: |
| Comment by Keith Mannthey (Inactive) [ 29/Mar/13 ] |
|
Thanks for the links: https://maloo.whamcloud.com/test_sets/016cbfcc-9816-11e2-879d-52540035b04c This is the "after" call with some spaces added for readability. snapshot_time 1364501362.939531 secs.usecs read_bytes 11 samples [bytes] 4096 1048576 1429504 write_bytes 6 samples [bytes] 1910 1048576 1910466 get_info 165 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 84 samples [reqs] create 2 samples [reqs] destroy 75 samples [reqs] setattr 1 samples [reqs] punch 2 samples [reqs] sync 7 samples [reqs] preprw 17 samples [reqs] commitrw 17 samples [reqs] ping 95 samples [reqs] snapshot_time 1364501362.939706 secs.usecs read_bytes 770 samples [bytes] 4096 1048576 805322752 write_bytes 770 samples [bytes] 3013 1048576 805321669 get_info 164 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 84 samples [reqs] create 2 samples [reqs] destroy 77 samples [reqs] punch 5 samples [reqs] sync 6 samples [reqs] preprw 1540 samples [reqs] commitrw 1540 samples [reqs] ping 83 samples [reqs] snapshot_time 1364501362.939814 secs.usecs read_bytes 3 samples [bytes] 8192 8192 24576 write_bytes 2 samples [bytes] 6096 6096 12192 get_info 157 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 84 samples [reqs] create 2 samples [reqs] destroy 74 samples [reqs] punch 1 samples [reqs] sync 3 samples [reqs] preprw 5 samples [reqs] commitrw 5 samples [reqs] ping 96 samples [reqs] snapshot_time 1364501362.939862 secs.usecs read_bytes 2 samples [bytes] 8192 8192 16384 write_bytes 3 samples [bytes] 1916 6096 12108 get_info 153 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 84 samples [reqs] create 2 samples [reqs] destroy 75 samples [reqs] punch 1 samples [reqs] sync 4 samples [reqs] preprw 5 samples [reqs] commitrw 5 samples [reqs] ping 95 samples [reqs] snapshot_time 1364501362.939991 secs.usecs read_bytes 2 samples [bytes] 8192 8192 16384 write_bytes 2 samples [bytes] 777 6096 6873 get_info 156 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 84 samples [reqs] create 2 samples [reqs] destroy 76 samples [reqs] setattr 2 samples [reqs] punch 3 samples [reqs] sync 3 samples [reqs] preprw 4 samples [reqs] commitrw 4 samples [reqs] ping 94 samples [reqs] snapshot_time 1364501362.940079 secs.usecs read_bytes 2 samples [bytes] 8192 8192 16384 write_bytes 2 samples [bytes] 1916 6096 8012 get_info 154 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 84 samples [reqs] create 2 samples [reqs] destroy 76 samples [reqs] setattr 1 samples [reqs] punch 1 samples [reqs] sync 2 samples [reqs] preprw 4 samples [reqs] commitrw 4 samples [reqs] ping 95 samples [reqs] snapshot_time 1364501362.940243 secs.usecs read_bytes 2 samples [bytes] 4096 12288 16384 write_bytes 2 samples [bytes] 12288 50400 62688 get_info 150 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 84 samples [reqs] create 2 samples [reqs] destroy 73 samples [reqs] setattr 1 samples [reqs] sync 2 samples [reqs] preprw 4 samples [reqs] commitrw 4 samples [reqs] ping 97 samples [reqs] There are 7 snapshot_time (7 OSTs) entries and they are all obdfilter.lustre-OST*.stats results. There has been some activity on the filesystem. There is no sign of the stats form "osd-.lustre-OST.stats". There are zero cache access to report on all 7 OSTs, either the cache is broken or it is just not reporting correctly. I noticed in the dmesg of the OST there is really bad things happening on test 132. https://maloo.whamcloud.com/test_logs/2613697e-9817-11e2-879d-52540035b04c/download It looks like there is some commuincation breakdown and the OSTs remount a few times? There are bad messages that don't appear in other runs of the test. Perhaps the system is in some bad state after test 132. Part of the OST dmesg: Lustre: DEBUG MARKER: test -b /dev/lvm-OSS/P5 Lustre: DEBUG MARKER: mkdir -p /mnt/ost5; mount -t lustre /dev/lvm-OSS/P5 /mnt/ost5 LDISKFS-fs (dm-4): mounted filesystem with ordered data mode. quota=on. Opts: Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/u LustreError: 137-5: UUID 'lustre-OST0005_UUID' is not available for connect (no target) LustreError: Skipped 1 previous similar message LustreError: 4490:0:(ldlm_resource.c:1161:ldlm_resource_get()) lvbo_init failed for resource 663: rc -2 LustreError: 4490:0:(ldlm_resource.c:1161:ldlm_resource_get()) Skipped 184 previous similar messages There is alot more in the dmesg but all in all it looks not so good. |
| Comment by Keith Mannthey (Inactive) [ 03/Apr/13 ] |
|
I opened |
| Comment by Sarah Liu [ 08/Apr/13 ] |
|
Also seen this issue after upgrade from 1.8.9 to 2.4 and then add one new MDT: https://maloo.whamcloud.com/test_sets/a02cc9b2-9ec5-11e2-975f-52540035b04c |
| Comment by Keith Mannthey (Inactive) [ 09/Apr/13 ] |
|
As a summary there is not sing of the /proc we are looking for. The kernel is not reporting values this area. I have not been able to reproduce this outside of autotest. Other Lustre things in /proc seem to be working so I am guessing for some reason this chunk to initialize but not a whole lproc collapse? I am starting to work on a debug patch for Lustre to help identify what might happening. |
| Comment by Keith Mannthey (Inactive) [ 10/Apr/13 ] |
|
I have stated a debug run of sorts here: http://review.whamcloud.com/6006 |
| Comment by John Hammond [ 13/May/13 ] |
|
The cases with missing values are probably occurrences of |
| Comment by Sarah Liu [ 15/May/13 ] |
|
hit this issue when running interop between 2.3.0 client and 2.4-tag-2.3.65: |
| Comment by Keith Mannthey (Inactive) [ 16/May/13 ] |
|
I did a quick search in Maloo today. I didn't see any failures in the last 2 days. I will check again next week. |
| Comment by Keith Mannthey (Inactive) [ 22/May/13 ] |
|
OK one week into the |
| Comment by Keith Mannthey (Inactive) [ 30/May/13 ] |
|
There are still no Errors to report with the "roc_hit" tests (Sanity 151 and 156). It is pretty clear to me I would like to move to close these issues. Perhaps http://review.whamcloud.com/5648 should be revoked? It is not hugely invasive but it need to be permanent debug code in the tree. |
| Comment by Keith Mannthey (Inactive) [ 06/Jun/13 ] |
|
Ok I decided it would be best to have a decent check in roc_hit to make sure this does not happen again. Please see: It is a patch that check to make sure we are pulling information from the correct number of OSTs. It also reverts the v1 debug patch as it is not needed. |
| Comment by Keith Mannthey (Inactive) [ 07/Aug/13 ] |
|
The Improved test patch as been landed. There is now a check to make sure all they lproc stats are present and working as expected. Please reopen if more work is needed. |