[LU-2902] sanity test_156: NOT IN CACHE: before: , after: Created: 04/Mar/13  Updated: 21/Jul/14  Resolved: 07/Aug/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: Lustre 2.4.1, Lustre 2.5.0

Type: Bug Priority: Major
Reporter: Maloo Assignee: Keith Mannthey (Inactive)
Resolution: Fixed Votes: 0
Labels: HB

Issue Links:
Duplicate
duplicates LU-2848 Failure on test suite sanity test_151... Resolved
is duplicated by LU-3094 Sanity test_132: @@@@@@ FAIL: some gl... Resolved
Related
is related to LU-2735 sanity.sh test_151: NOT IN CACHE: bef... Resolved
is related to LU-2848 Failure on test suite sanity test_151... Resolved
is related to LU-2979 sanity 133a: proc counter for mkdir o... Closed
is related to LU-5375 Failure on test suite sanity test_151... Resolved
Severity: 3
Rank (Obsolete): 6990

 Description   

This issue was created by maloo for Oleg Drokin <green@whamcloud.com>

This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/406900a6-84d3-11e2-9ab1-52540035b04c.

The sub-test test_156 failed with the following error:

NOT IN CACHE: before: 16741, after: 16741

This seems to have an astounding 21% failure rate and nobody filed a ticket for it yet.
Might be related to older LU-2009 that never was really investigated it seems.

Info required for matching: sanity 156



 Comments   
Comment by Keith Mannthey (Inactive) [ 05/Mar/13 ]

This is a dup of LU-2009 (a minor priority bug). This test 156 is making assumption about cache hits and misses. It seems to me if some part of the system in under memory pressure this test may just fail.

I looked into this a bit when I was looking at sanity test 180 failures. When it fails sometimes it isn't in the cache and sometimes it might have a greater than expected cache read value.

Oleg can you take a look at the test and weigh in on the validity of the expected behavior? What about if we are in a low mem situation? I fear this test my be an "under ideal conditions" sort of test.

Comment by Keith Mannthey (Inactive) [ 07/Mar/13 ]

Ok from looking a the error logs there are several things that can go wrong.

NOT IN CACHE: before: 16741, after: 16741
This is where we expected a +3 read value. Simply put in was not in the Cache when we though it should have been. It seems that Lustre under memory pressure may had dropped a page for external reasons. Perhaps there are more stats that can tell us a clearer story.

NOT IN CACHE: before: , after:
This seems quite bad where we call roc_hit (a wrapper for obdfilter.lustre-OST*.stats data). It some cases we just don't return any value from this subfuction. This seem quite wrong.

IN CACHE: before: 16749, after: 16752
We didn't' expect it to be in cache but is was.

NOT IN CACHE: before: 16749, after: 16755
It seems there were 2 successful reads on the system. A single read is +3 and this is +6 such the data may have been re-read.

In general it seems like:
1. Sometime we can't read the values. This is something broken.
2. Sometimes there in an extra read in the accounting I wonder if there is any other interactions or some fail/retry logic going on here.
3. Sometimes we think it should be in the cache but it is not. I would think Memory / External pressure could cause this.

Also I think these issues affect all the tests that use roc_hit. I know LU-2848 and LU-2735 look alot alike (they are both from Sanity test_151).

I am just going to just close lu-2009 and work this one.

Comment by Keith Mannthey (Inactive) [ 07/Mar/13 ]

I am hoping to get a little more debug info from the below patch.

http://review.whamcloud.com/5648

It is untested so it might need another try. The goal is to run sanity test_156 a bunch and log all the stats (not just cache_hits) available to get better resolution to what is happening on the system. Even if it works it many need to be ran a few times.

Comment by Keith Mannthey (Inactive) [ 11/Mar/13 ]

My initial debug patch wasn't quite right and got lost in the test logs mess last week. I updated the patch.

My local vms freed up enough today so that I am running the debug test locally as well today.

Comment by Keith Mannthey (Inactive) [ 12/Mar/13 ]

So far I have ran 5000+ iterations of 156 in my local VMs. Still no sign of the errors in this, nothing runs before me loop, but I am down to about 80megs of free memory in the VM. It has been slowing decreasing this morning (at 10 am or so I had about 150mb free) I am going to let it crash out and see what happens.

Comment by Keith Mannthey (Inactive) [ 12/Mar/13 ]

Well nothing really to report. I added some memhot load to take the box near OOM but this has been running for a few hours now without any change in the tests. I will let the memory pressure continue overnight but I don't capture anything overnight I will set my VMs to other work.

I may need to look at landing a patch for Master and let the autotest test review pool debug the issue some more.

Comment by Keith Mannthey (Inactive) [ 13/Mar/13 ]

Somewhere around 5500 total iterations and some external memory pressure we encountered some errors. I don't know if the memory pressure matters or not.

In my debug patch I print out the entire proc entry being used.

The first section:

Turn off the read cache and turn on the write cache
Read again; it should be satisfied from the cache. 

snapshot_time 1363145753.405525 secs.usecs read_bytes 96380 samples [bytes] 4096 12288 789536768 write_bytes 21503 samples [bytes] 13 1048576 348218279 get_info 85877 samples [reqs] set_info_async 5359 samples [reqs] connect 2 samples [reqs] statfs 27705 samples [reqs] create 672 samples [reqs] destroy 21425 samples [reqs] setattr 5364 samples [reqs] punch 5361 samples [reqs] sync 21424 samples [reqs] preprw 117883 samples [reqs] commitrw 117883 samples [reqs] ping 2819 samples [reqs] snapshot_time 1363145753.407124 secs.usecs get_page 117883 samples [usec] 0 1056 298323 11433093 cache_access 144567 samples [pages] 1 1 144567 cache_hit 80316 samples [pages] 1 1 80316 cache_miss 64251 samples [pages] 1 1 64251


snapshot_time 1363145771.340504 secs.usecs read_bytes 96382 samples [bytes] 4096 12288 789553152 write_bytes 21503 samples [bytes] 13 1048576 348218279 get_info 85877 samples [reqs] set_info_async 5359 samples [reqs] connect 2 samples [reqs] statfs 27709 samples [reqs] create 672 samples [reqs] destroy 21425 samples [reqs] setattr 5364 samples [reqs] punch 5361 samples [reqs] sync 21424 samples [reqs] preprw 117885 samples [reqs] commitrw 117885 samples [reqs] ping 2822 samples [reqs] snapshot_time 1363145771.341000 secs.usecs get_page 117885 samples [usec] 0 1056 298331 11433157 cache_access 144570 samples [pages] 1 1 144570 cache_hit 80316 samples [pages] 1 1 80316 cache_miss 64254 samples [pages] 1 1 64254

 sanity test_156: @@@@@@ FAIL: NOT IN CACHE: before: 80316, after: 80316

Notes:
Why do we expect to turn off the read cache then read it from the cache again? <== more on this a little later.

Time has moved forward.
read_bytes + 2 (same as others)
statfs + 4
preprw + 2
commitrw + 2
ping + 3
get_page + 2
cache_access + 3 (it saw the read)
cache_miss + 3 (but it was not in the cache)

Lets look at the test from this specific area.

    log "Turn off the read cache and turn on the write cache"
    set_cache read off  <==========  Isn't safe to assume all cache can be purged after this call?  All read cache data gone. No write cache.  
    set_cache writethrough on  <===== Perhaps we should turn on the writethrought first? A single read knocks it out of the writethrought cache?

    log "Read again; it should be satisfied from the cache."  <============== WHY?  i know it happens alot but I don't see why.  
    BEFORE=`roc_hit`
    cancel_lru_locks osc
    cat $file >/dev/null
    AFTER=`roc_hit`
    if ! let "AFTER - BEFORE == CPAGES"; then
        error "NOT IN CACHE: before: $BEFORE, after: $AFTER"
    else
        log "cache hits:: before: $BEFORE, after: $AFTER"
    fi

    log "Read again; it should not be satisfied from the cache."   <========  No state change but it should not be in the cache? 
    BEFORE=$AFTER
    cancel_lru_locks osc
    cat $file >/dev/null
    AFTER=`roc_hit`
    if ! let "AFTER - BEFORE == 0"; then
        error "IN CACHE: before: $BEFORE, after: $AFTER"
    else
        log "cache hits:: before: $BEFORE, after: $AFTER"
    fi
Comment by Keith Mannthey (Inactive) [ 13/Mar/13 ]

The is the next spot:

Turn on the read cache and turn off the write cache
Write data and read it back
It should not be satisfied from the cache.
3+0 records in
3+0 records out

12288 bytes (12 kB) copied, 0.0102791 s, 1.2 MB/s
snapshot_time 1363149708.491279 secs.usecs read_bytes 96672 samples [bytes] 4096 12288 791928832 write_bytes 21567 samples [bytes] 13 1048576 349004711 get_info 86134 samples [reqs] set_info_async 5376 samples [reqs] connect 2 samples [reqs] statfs 28513 samples [reqs] create 674 samples [reqs] destroy 21485 samples [reqs] setattr 5381 samples [reqs] punch 5382 samples [reqs] sync 21488 samples [reqs] preprw 118239 samples [reqs] commitrw 118239 samples [reqs] ping 3475 samples [reqs] snapshot_time 1363149708.491619 secs.usecs get_page 118239 samples [usec] 0 1056 299142 11440980 cache_access 145005 samples [pages] 1 1 145005 cache_hit 80550 samples [pages] 1 1 80550 cache_miss 64455 samples [pages] 1 1 64455

snapshot_time 1363149709.28799 secs.usecs read_bytes 96674 samples [bytes] 4096 12288 791945216 write_bytes 21568 samples [bytes] 13 1048576 349016999 get_info 86136 samples [reqs] set_info_async 5376 samples [reqs] connect 2 samples [reqs] statfs 28513 samples [reqs] create 674 samples [reqs] destroy 21485 samples [reqs] setattr 5381 samples [reqs] punch 5382 samples [reqs] sync 21489 samples [reqs] preprw 118242 samples [reqs] commitrw 118242 samples [reqs] ping 3475 samples [reqs] snapshot_time 1363149709.29176 secs.usecs get_page 118242 samples [usec] 0 1056 299147 11441005 cache_access 145008 samples [pages] 1 1 145008 cache_hit 80550 samples [pages] 1 1 80550 cache_miss 64458 samples [pages] 1 1 64458

cache hits:: before: 80550, after: 80550   
Read again; it should be satisfied from the cache.

snapshot_time 1363149740.756061 secs.usecs read_bytes 96674 samples [bytes] 4096 12288 791945216 write_bytes 21568 samples [bytes] 13 1048576 349016999 get_info 86136 samples [reqs] set_info_async 5376 samples [reqs] connect 2 samples [reqs] statfs 28520 samples [reqs] create 674 samples [reqs] destroy 21486 samples [reqs] setattr 5381 samples [reqs] punch 5382 samples [reqs] sync 21489 samples [reqs] preprw 118242 samples [reqs] commitrw 118242 samples [reqs] ping 3481 samples [reqs] snapshot_time 1363149740.757198 secs.usecs get_page 118242 samples [usec] 0 1056 299147 11441005 cache_access 145008 samples [pages] 1 1 145008 cache_hit 80550 samples [pages] 1 1 80550 cache_miss 64458 samples [pages] 1 1 64458

snapshot_time 1363149747.316476 secs.usecs read_bytes 96676 samples [bytes] 4096 12288 791961600 write_bytes 21568 samples [bytes] 13 1048576 349016999 get_info 86137 samples [reqs] set_info_async 5376 samples [reqs] connect 2 samples [reqs] statfs 28521 samples [reqs] create 674 samples [reqs] destroy 21486 samples [reqs] setattr 5381 samples [reqs] punch 5382 samples [reqs] sync 21489 samples [reqs] preprw 118244 samples [reqs] commitrw 118244 samples [reqs] ping 3482 samples [reqs] snapshot_time 1363149747.316992 secs.usecs get_page 118244 samples [usec] 0 1056 299148 11441006 cache_access 145011 samples [pages] 1 1 145011 cache_hit 80550 samples [pages] 1 1 80550 cache_miss 64461 samples [pages] 1 1 64461

 sanity test_156: @@@@@@ FAIL: NOT IN CACHE: before: 80550, after: 80550  <==== still a cache_miss same as above. 

In the first part of this section note the the write_count increased by 1 after the dd completed. It seems we were still writing when we set BEFORE.

The test

  log "Turn on the read cache and turn off the write cache"
    set_cache read on
    set_cache writethrough off

    log "Write data and read it back"
    log "It should not be satisfied from the cache."
    rm -f $file
    dd if=/dev/urandom of=$file bs=4k count=$CPAGES || error "dd failed"
    BEFORE=`roc_hit`
    cancel_lru_locks osc
    cat $file >/dev/null
    AFTER=`roc_hit`
    if ! let "AFTER - BEFORE == 0"; then  
        error_ignore 20762 "IN CACHE: before: $BEFORE, after: $AFTER"  <==== So in general this test is non deterministic. 
    else
        log "cache hits:: before: $BEFORE, after: $AFTER"
    fi

    log "Read again; it should be satisfied from the cache."   <==== If it didn't make it into the cache before why should it be there now? 
    BEFORE=`roc_hit`
    cancel_lru_locks osc
    cat $file >/dev/null
    AFTER=`roc_hit`
    if ! let "AFTER - BEFORE == CPAGES"; then
        error "NOT IN CACHE: before: $BEFORE, after: $AFTER"
    else
        log "cache hits:: before: $BEFORE, after: $AFTER"
    fi
Comment by Keith Mannthey (Inactive) [ 13/Mar/13 ]

And finally this fun item a non value:

Turn on read and write cache
Write data and read it back.
Read should be satisfied from the cache.
3+0 records in
3+0 records out
12288 bytes (12 kB) copied, 0.0131985 s, 931 kB/s
snapshot_time 1363156626.318377 secs.usecs read_bytes 96978 samples [bytes] 4096 12288 794435584 write_bytes 21635 samples [bytes] 13 1048576 349840295 get_info 86398 samples [reqs] set_info_async 5401 samples [reqs] connect 2 samples [reqs] statfs 29921 samples [reqs] create 676 samples [reqs] destroy 21553 samples [reqs] setattr 5406 samples [reqs] punch 5407 samples [reqs] sync 21555 samples [reqs] preprw 118613 samples [reqs] commitrw 118613 samples [reqs] ping 4680 samples [reqs] snapshot_time 1363156626.319297 secs.usecs get_page 118613 samples [usec] 0 1056 299906 11448652 cache_access 145464 samples [pages] 1 1 145464 cache_hit 80805 samples [pages] 1 1 80805 cache_miss 64659 samples [pages] 1 1 64659
ost: Connection timed out during banner exchange^M
pdsh@client: ost: ssh exited with exit code 255
snapshot_time 1363156656.728803 secs.usecs read_bytes 96980 samples [bytes] 4096 12288 794451968 write_bytes 21635 samples [bytes] 13 1048576 349840295 get_info 86399 samples [reqs] set_info_async 5401 samples [reqs] connect 2 samples [reqs] statfs 29928 samples [reqs] create 676 samples [reqs] destroy 21553 samples [reqs] setattr 5406 samples [reqs] punch 5407 samples [reqs] sync 21556 samples [reqs] preprw 118615 samples [reqs] commitrw 118615 samples [reqs] ping 4685 samples [reqs] snapshot_time 1363156656.729848 secs.usecs get_page 118615 samples [usec] 0 1056 299908 11448656 cache_access 145467 samples [pages] 1 1 145467 cache_hit 80805 samples [pages] 1 1 80805 cache_miss 64662 samples [pages] 1 1 64662
 sanity test_156: @@@@@@ FAIL: NOT IN CACHE: before: , after: 80805

Now so in my case I checked the logs and the OOM killer killed ssh at that point. In total the ssh was killed twice on the OST. So no big surprise that this is messed up. For some reason in the middle of the night the OOM killer on the OST came out to play. It killed SSH a few times then it killed Memhog on the OST. The Client never entered OOM.

A little about this timing and Client / OST states.

Stresses both started by March 12 18:00 With the Client stress had been running 6+ hours already.

On the OST first sign of the tage mesage.
Mar 12 19:41:30 ost kernel: cannot allocate a tage (This occurs until memhog is killed a bit later on).

The first FAIL seen:
On the Client Mar 12 19:41:26 client kernel: Lustre: DEBUG MARKER: sanity test_156: @@@@@@ FAIL: NOT IN CACHE: before: 80088, after: 80088
Mar 12 19:41:30 client kernel: cannot allocate a tage (2)

The last FAIL seen:
Mar 13 01:08:43 client kernel: Lustre: DEBUG MARKER: sanity test_156: @@@@@@ FAIL: NOT IN CACHE: before: 81057, after: 81057
Mar 13 01:08:43 client kernel: cannot allocate a tage (1)

Last tage message but memhog is still running on the client.
Mar 13 01:12:06 client kernel: cannot allocate a tage (309)

On the OST memhog was killed:
Mar 13 01:19:47 ost kernel: Killed process 28806, UID 0, (memhog) total-vm:1746940kB, anon-rss:735716kB, file-rss:36kB

There are no more errors seen or tage /oom messages. It runs clean the next 10 hours.

I am lead to believe some sort of memory pressure on the OST is key to this issue. I am starting some memory pressure only testing, I am going to back off the memhog a bit I agree OOM is not what we want to see.

Comment by Keith Mannthey (Inactive) [ 14/Mar/13 ]

Ok a quick summary of last night testing. OST memory pressure is key. The OST did not suffer any OOM. It did have "cannot allocate a tage" for a few hours and during these hours and only these few hours (memory pressure was constant) did we get failures.

The Spots:

Turn on read and write cache
Write data and read it back.
Read should be satisfied from the cache.
3+0 records in
3+0 records out
12288 bytes (12 kB) copied, 0.00463795 s, 2.6 MB/s
cache hits:: before: 107160, after: 107163
Read again; it should be satisfied from the cache.
sanity test_156: @@@@@@ FAIL: NOT IN CACHE: before: 107163, after: 107163

Perhaps even the first read was a race to not be evicted.

A few of these.

Turn on the read cache and turn off the write cache
Write data and read it back
It should not be satisfied from the cache.
3+0 records in
3+0 records out
12288 bytes (12 kB) copied, 0.00447962 s, 2.7 MB/s
cache hits:: before: 107190, after: 107190
Read again; it should be satisfied from the cache.
sanity test_156: @@@@@@ FAIL: NOT IN CACHE: before: 107190, after: 107190

A few of these

cache hits:: before: 107250, after: 107253
Read again; it should be satisfied from the cache.
cache hits:: before: 107253, after: 107256
Turn off the read cache and turn on the write cache
Read again; it should be satisfied from the cache.
sanity test_156: @@@@@@ FAIL: NOT IN CACHE: before: 107256, after: 107256

Lots of the next one.

cache hits:: before: 107310, after: 107313
Turn off the read cache and turn on the write cache
Read again; it should be satisfied from the cache.
sanity test_156: @@@@@@ FAIL: NOT IN CACHE: before: 107313, after: 107313

These are the spots I can see errors. I think it is important to note they all represnet that farther points from a write to a checking to see if something is in cache. If we write and then right away read it we seem to find it but if we wait some it might be gone, like I said it is the longer duration spots. The whole test duration seems to be in the 17-25 seconds range, that means each chunk can take seconds.

Perhaps there is a way the test can be self aware and know the cache is under too much pressure to run this test? Is there a good way to look at the cache sizes or see cache evictions?

Also the very serious and more common "NOT IN CACHE: before: , after: " needs a debug patch for master. I will clean up my dump everything patch for roc_hit and push it. I would like to know if we are reading some part of the data from the proc calls or what. We only know there is not a cache_hit value to read. I don't think there is widespread shell killing going on in autotest (the only way I reproduced the issue).

Comment by Keith Mannthey (Inactive) [ 14/Mar/13 ]

I changed the debug patch for landing and am going to try and land it in hops to learn more about the errors see.

http://review.whamcloud.com/5648

More debug may be needed but I would like to start here.

Comment by Oleg Drokin [ 15/Mar/13 ]

could this be related to lu2735 too? Can you reproduce with that patch Just a memory pressure playing tricks dropping stuff from cache when we don't expect it to?

Comment by Keith Mannthey (Inactive) [ 15/Mar/13 ]

This is the same issue as LU2705 and all the other tests that use the cashe-hit-miss stats anything that uses rok-hit is this same issue. At least the read cache is Linux Page Cache (that is what the manual said) which means under memory pressure all bets are off as to if you will be in the cache for any amount of time.

I might be able to use the patch for lu2735 carefully enough to pass 156 via the special file_loc flag that prevents removal from the lur list. It is hard to say as there is alot of write cache involved. I do think that patch could address the dropped from cache issue but It won't address the double hit or the just nothing is there parts of this lu.

The "NOT IN CACHE: before: , after: " error is the most common and least understood in Autotest.

As mentioned I am going to attempt to make at least 156 memory state aware and if we see too much memory pressure we skip the test/consider it save to allow to fail.

Comment by Andreas Dilger [ 18/Mar/13 ]

Both LU-2848 and LU-2735 are also failing from roc_hit() results.

Comment by Andreas Dilger [ 18/Mar/13 ]

I think it makes sense to close this bug as a duplicate of LU-2848, or vice versa.

Comment by Jodi Levi (Inactive) [ 19/Mar/13 ]

Duplicate of LU-2848

Comment by Keith Mannthey (Inactive) [ 19/Mar/13 ]

It is not safe to close this issue as it represents the 3 issues seen by "roc_hit" users.

The only one there is a handle on is the early evictions due to memory pressure. There may be other improper evictions but there is not enough data from the autotest runs to say what has happened.

For sure
NOT IN CACHE: before: 16749, after: 16755
It seems there were 2 successful reads on the system. A single read is +3 and this is +6 such the data may have been re-read.

Is not covered by any other LU. Debug data from autotest is needed.

Perhaps we want to split this up into a few tickets but they all come down to roc_hit having trouble and cache behavior not being deterministic (hopefully it is just due to memory pressure but there there is no way to tell at this point the tests needs to be improved)

I am actively working on this issue as a blocker. If this is meant to close down the HB blocker and punt this to a minor issue please close again and I will stop working on the issue.

Comment by Keith Mannthey (Inactive) [ 19/Mar/13 ]

I spent a bit more time looking and working with the no info case and I found this

[root@ost lustre-OST0000]# /usr/sbin/lctl get_param -n osd-*.lustre-OST*.stats obdfilter.lustre-OST*.stats
snapshot_time             1363657474.555425 secs.usecs
get_page                  1 samples [usec] 1 1 1 1
snapshot_time             1363657474.555530 secs.usecs
write_bytes               1 samples [bytes] 18 18 18
get_info                  9 samples [reqs]
set_info_async            1 samples [reqs]
connect                   3 samples [reqs]
disconnect                1 samples [reqs]
statfs                    243 samples [reqs]
create                    2 samples [reqs]
destroy                   3 samples [reqs]
sync                      2 samples [reqs]
preprw                    1 samples [reqs]
commitrw                  1 samples [reqs]
ping                      225 samples [reqs]

This was pretty much a fresh mount with no IO.

Did a little io and a sync and a read.

root@ost lustre-OST0000]# /usr/sbin/lctl get_param -n osd-*.lustre-OST*.stats obdfilter.lustre-OST*.stats
snapshot_time             1363657550.849448 secs.usecs
get_page                  44 samples [usec] 0 2 25 29
cache_access              1 samples [pages] 1 1 1
cache_hit                 1 samples [pages] 1 1 1
snapshot_time             1363657550.849551 secs.usecs
read_bytes                1 samples [bytes] 4096 4096 4096
write_bytes               43 samples [bytes] 18 1048576 42510794
get_info                  16 samples [reqs]
set_info_async            1 samples [reqs]
connect                   3 samples [reqs]
disconnect                1 samples [reqs]
statfs                    258 samples [reqs]
create                    2 samples [reqs]
destroy                   6 samples [reqs]
sync                      3 samples [reqs]
preprw                    44 samples [reqs]
commitrw                  44 samples [reqs]
ping                      236 samples [reqs]

It appears we don't ever display "0" for some proc values. The proc interface should be static so tools can be built around them. I am looking into why "0" was not displayed. You can see "cache_miss" (it should report 0 is not displayed at this time on my system.

Comment by Keith Mannthey (Inactive) [ 19/Mar/13 ]

Well I have been looking into the lproc subsystem for a bit.

I see in lustre/obdclass/lprocfs_status.c (this seems to be the correct spot for the values in question)

static int lprocfs_stats_seq_show(struct seq_file *p, void *v)
{
        struct lprocfs_stats            *stats  = p->private;
        struct lprocfs_counter          *cntr   = v;
        struct lprocfs_counter          ret;
        struct lprocfs_counter_header   *header;
        int                             entry_size;
        int                             idx;
        int                             rc      = 0;

        if (cntr == &(stats->ls_percpu[0])->lp_cntr[0]) {
                struct timeval now;
                cfs_gettimeofday(&now);
                rc = seq_printf(p, "%-25s %lu.%lu secs.usecs\n",
                                "snapshot_time", now.tv_sec, now.tv_usec);
                if (rc < 0)
                        return rc;
        }
        entry_size = sizeof(*cntr);
        if (stats->ls_flags & LPROCFS_STATS_FLAG_IRQ_SAFE)
                entry_size += sizeof(__s64);
        idx = ((void *)cntr - (void *)&(stats->ls_percpu[0])->lp_cntr[0]) /
                entry_size;

        header = &stats->ls_cnt_header[idx];
        lprocfs_stats_collect(stats, idx, &ret);

        if (ret.lc_count == 0)    <======  Can someone please explain this?  It is a source of the no reads from roc_hit 
                goto out;
...

Git blame says I need to mail BobiJam.

ca461f0f (Bobi Jam               2013-01-19 00:54:32 +0800 1542)        if (ret.lc_count == 0)
ca461f0f (Bobi Jam               2013-01-19 00:54:32 +0800 1543)                goto out;

But he just did the tabs.

Comment by Keith Mannthey (Inactive) [ 19/Mar/13 ]

Well the real code has been around for a long time.

d62360b7 (nathan                 2008-09-22 22:20:42 +0000 1505)        if (ret.lc_count == 0)
d62360b7 (nathan                 2008-09-22 22:20:42 +0000 1506)                goto out;

Any input in appreciated. I will play with removing those lines tomorrow.

Comment by Andreas Dilger [ 19/Mar/13 ]

I think the intent is to not print all of the stats for operations that are not relevant for a particular device (e.g. "rename" for OSTs, or "read" for MDTs). One would think that by the time test_156 rolls around that there would be read operations on the device?

Comment by Keith Mannthey (Inactive) [ 19/Mar/13 ]

It would be nice if we could land http://review.whamcloud.com/5648 so we can see the /proc state when we are in autotest and getting these errors.

I agree it "seeems" like things should be well and good before we get here but perhaps there is some reset or fail over that resets could fail stats. I will wait for info from autotest as the issue will hopefully be clearer. For the larger issue it could save memory/space and only registrar values that are relevant to a given lu object type, but I am sure that is out of the scope of this LU.

Comment by Jodi Levi (Inactive) [ 26/Mar/13 ]

Reducing in priority until this occurs again.
Should review this in the future to see if this is occurring again.

Comment by Keith Mannthey (Inactive) [ 27/Mar/13 ]

I thought I would update. The issue has not failed since the debug patch was landed (test 151 or 156) (I spent some time looking in maloo). There are failures on openSFS with old code but nothing in the current master space.

I am all for leaving the debug patch in (Andreas considered ok for a longer term leave) and we will see what happens.

Comment by Sebastien Buisson (Inactive) [ 29/Mar/13 ]

It seems we hit two new occurrences of this problem on master yesterday:
https://maloo.whamcloud.com/test_sets/016cbfcc-9816-11e2-879d-52540035b04c
https://maloo.whamcloud.com/test_sets/ab3b7452-97f4-11e2-a652-52540035b04c

Comment by Keith Mannthey (Inactive) [ 29/Mar/13 ]

Thanks for the links:

https://maloo.whamcloud.com/test_sets/016cbfcc-9816-11e2-879d-52540035b04c
sanity test_151 AND test_156: @@@@@@ FAIL: NOT IN CACHE: before: , after:

This is the "after" call with some spaces added for readability.

snapshot_time 1364501362.939531 secs.usecs read_bytes 11 samples [bytes] 4096 1048576 1429504 write_bytes 6 samples [bytes] 1910 1048576 1910466 get_info 165 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 84 samples [reqs] create 2 samples [reqs] destroy 75 samples [reqs] setattr 1 samples [reqs] punch 2 samples [reqs] sync 7 samples [reqs] preprw 17 samples [reqs] commitrw 17 samples [reqs] ping 95 samples [reqs]
snapshot_time 1364501362.939706 secs.usecs read_bytes 770 samples [bytes] 4096 1048576 805322752 write_bytes 770 samples [bytes] 3013 1048576 805321669 get_info 164 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 84 samples [reqs] create 2 samples [reqs] destroy 77 samples [reqs] punch 5 samples [reqs] sync 6 samples [reqs] preprw 1540 samples [reqs] commitrw 1540 samples [reqs] ping 83 samples [reqs]
snapshot_time 1364501362.939814 secs.usecs read_bytes 3 samples [bytes] 8192 8192 24576 write_bytes 2 samples [bytes] 6096 6096 12192 get_info 157 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 84 samples [reqs] create 2 samples [reqs] destroy 74 samples [reqs] punch 1 samples [reqs] sync 3 samples [reqs] preprw 5 samples [reqs] commitrw 5 samples [reqs] ping 96 samples [reqs]
snapshot_time 1364501362.939862 secs.usecs read_bytes 2 samples [bytes] 8192 8192 16384 write_bytes 3 samples [bytes] 1916 6096 12108 get_info 153 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 84 samples [reqs] create 2 samples [reqs] destroy 75 samples [reqs] punch 1 samples [reqs] sync 4 samples [reqs] preprw 5 samples [reqs] commitrw 5 samples [reqs] ping 95 samples [reqs]
snapshot_time 1364501362.939991 secs.usecs read_bytes 2 samples [bytes] 8192 8192 16384 write_bytes 2 samples [bytes] 777 6096 6873 get_info 156 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 84 samples [reqs] create 2 samples [reqs] destroy 76 samples [reqs] setattr 2 samples [reqs] punch 3 samples [reqs] sync 3 samples [reqs] preprw 4 samples [reqs] commitrw 4 samples [reqs] ping 94 samples [reqs]
snapshot_time 1364501362.940079 secs.usecs read_bytes 2 samples [bytes] 8192 8192 16384 write_bytes 2 samples [bytes] 1916 6096 8012 get_info 154 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 84 samples [reqs] create 2 samples [reqs] destroy 76 samples [reqs] setattr 1 samples [reqs] punch 1 samples [reqs] sync 2 samples [reqs] preprw 4 samples [reqs] commitrw 4 samples [reqs] ping 95 samples [reqs]
snapshot_time 1364501362.940243 secs.usecs read_bytes 2 samples [bytes] 4096 12288 16384 write_bytes 2 samples [bytes] 12288 50400 62688 get_info 150 samples [reqs] connect 1 samples [reqs] disconnect 1 samples [reqs] statfs 84 samples [reqs] create 2 samples [reqs] destroy 73 samples [reqs] setattr 1 samples [reqs] sync 2 samples [reqs] preprw 4 samples [reqs] commitrw 4 samples [reqs] ping 97 samples [reqs]

There are 7 snapshot_time (7 OSTs) entries and they are all obdfilter.lustre-OST*.stats results. There has been some activity on the filesystem.

There is no sign of the stats form "osd-.lustre-OST.stats".

There are zero cache access to report on all 7 OSTs, either the cache is broken or it is just not reporting correctly.

I noticed in the dmesg of the OST there is really bad things happening on test 132. https://maloo.whamcloud.com/test_logs/2613697e-9817-11e2-879d-52540035b04c/download

It looks like there is some commuincation breakdown and the OSTs remount a few times?

There are bad messages that don't appear in other runs of the test. Perhaps the system is in some bad state after test 132.

Part of the OST dmesg:

Lustre: DEBUG MARKER: test -b /dev/lvm-OSS/P5
Lustre: DEBUG MARKER: mkdir -p /mnt/ost5; mount -t lustre   		                   /dev/lvm-OSS/P5 /mnt/ost5
LDISKFS-fs (dm-4): mounted filesystem with ordered data mode. quota=on. Opts: 
Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/u
LustreError: 137-5: UUID 'lustre-OST0005_UUID' is not available for connect (no target)
LustreError: Skipped 1 previous similar message
LustreError: 4490:0:(ldlm_resource.c:1161:ldlm_resource_get()) lvbo_init failed for resource 663: rc -2
LustreError: 4490:0:(ldlm_resource.c:1161:ldlm_resource_get()) Skipped 184 previous similar messages

There is alot more in the dmesg but all in all it looks not so good.

Comment by Keith Mannthey (Inactive) [ 03/Apr/13 ]

I opened LU-3094 to track the 132 issues.

Comment by Sarah Liu [ 08/Apr/13 ]

Also seen this issue after upgrade from 1.8.9 to 2.4 and then add one new MDT:

https://maloo.whamcloud.com/test_sets/a02cc9b2-9ec5-11e2-975f-52540035b04c

Comment by Keith Mannthey (Inactive) [ 09/Apr/13 ]

LU-3094 does not seem to be the root issue for this problem.

As a summary there is not sing of the /proc we are looking for. The kernel is not reporting values this area.

I have not been able to reproduce this outside of autotest. Other Lustre things in /proc seem to be working so I am guessing for some reason this chunk to initialize but not a whole lproc collapse?

I am starting to work on a debug patch for Lustre to help identify what might happening.

Comment by Keith Mannthey (Inactive) [ 10/Apr/13 ]

I have stated a debug run of sorts here: http://review.whamcloud.com/6006

Comment by John Hammond [ 13/May/13 ]

The cases with missing values are probably occurrences of LU-2979.

Comment by Sarah Liu [ 15/May/13 ]

hit this issue when running interop between 2.3.0 client and 2.4-tag-2.3.65:
https://maloo.whamcloud.com/test_sets/e89c3aac-bbee-11e2-b013-52540035b04c

Comment by Keith Mannthey (Inactive) [ 16/May/13 ]

I did a quick search in Maloo today. I didn't see any failures in the last 2 days. I will check again next week.

Comment by Keith Mannthey (Inactive) [ 22/May/13 ]

OK one week into the LU-2979 patch landing and there are no failures of Sanity 151 or 156. This is good news.

Comment by Keith Mannthey (Inactive) [ 30/May/13 ]

There are still no Errors to report with the "roc_hit" tests (Sanity 151 and 156). It is pretty clear to me LU-2979 is the root cause for the errors that were seen. The lproc entries were just not being exposed to userspace and there is no sign of any Cache issues in Autotest

I would like to move to close these issues.

Perhaps http://review.whamcloud.com/5648 should be revoked? It is not hugely invasive but it need to be permanent debug code in the tree.

Comment by Keith Mannthey (Inactive) [ 06/Jun/13 ]

Ok I decided it would be best to have a decent check in roc_hit to make sure this does not happen again.

Please see:
http://review.whamcloud.com/6564

It is a patch that check to make sure we are pulling information from the correct number of OSTs. It also reverts the v1 debug patch as it is not needed.

Comment by Keith Mannthey (Inactive) [ 07/Aug/13 ]

The Improved test patch as been landed. There is now a check to make sure all they lproc stats are present and working as expected.

Please reopen if more work is needed.

Generated at Sat Feb 10 01:29:12 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.