[LU-277] Test failure on test suite replay-single Created: 04/May/11  Updated: 13/Jun/11  Resolved: 13/Jun/11

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: Niu Yawei (Inactive)
Resolution: Cannot Reproduce Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 10277

 Description   

This issue was created by maloo for sarah <sarah@whamcloud.com>

This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/7a33c9b8-71c7-11e0-80b5-52540025f9af.

This one looks like lu-184 which has been fixed for a while. Actually I found the similar issue on replay-single and commented on LU-184 to make sure if it was the same problem. Here is the comment's link
http://jira.whamcloud.com/browse/LU-184?focusedCommentId=12227&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12227



 Comments   
Comment by Peter Jones [ 04/May/11 ]

Niu

As you worked previously on LU184 could you please comment?

Thanks

Peter

Comment by Niu Yawei (Inactive) [ 05/May/11 ]

The failure is caused by the open replay from client-6-ib and client-21-ib. Actually, these two clients should not be involved in this test (our intention is to test client-23-ib), however, there were lots of open replay reqeusts kept on the other two clients, these open replays participated in the recovery, and result in test failure at the end.

I guess these open replays come from previous tests, maybe runracer. (runracer is before replay_single? Sarah, please correct me if I'm wrong) Will look into the runracer test to see if there is anything wrong in the script.

How can the open replay fail? One possible reason occur to me: we often set the MDS as read only in replay_single tests, so the open_create will not commit into disk sometimes, and in the test_20b, client-23-ib was evicted by MDS, so some open_create replay from this client will be lost, and the open replays to the same file from other two clients will fail for EONENT.

Comment by Niu Yawei (Inactive) [ 08/May/11 ]

As shown on the maloo system, runracer is just ran before this replay-single test: This replay-single test started on 2011-04-28 11:25:09 UTC, and a runracer started on 2011-04-28 11:21:52 UTC and last 197 seconds.

So I highly suspect that there is something wrong in the runracer script, which caused some racer test threads was not termninated properly. (as I mentioned in previous comment, since replay-single test often run replay barrier, these unexpected racer test thread could result in recovery failure)

The following line of runracer confused me:
running=$(do_nodes $clients "ps uax | grep $RDIR " | egrep -v "(acceptance|grep|pdsh|bash)" || true)
I don't see why 'acceptance' should be matched, it might be the culprit.

Hi, Sarah
Since I don't have any reserved nodes, could you help me to do following test?

  • Run runracer with auster. (on two or three clients, client-23-ib, client-21-ib and client-6-ib for instance)
  • While test is running, get the output of "do_nodes $clients "ps uax | grep $RDIR "
  • While test is running, get the output of "do_nodes $clients "ps uax | grep $RDIR " | egrep -v "(acceptance|grep|pdsh|bash)"
    I want to see if the criteria of finding racer test threads is still valid for auster. Thanks.
Comment by Niu Yawei (Inactive) [ 10/May/11 ]

Well, I finally get 3 nodes on Toro today, and after several runs of runracer, I found that pdsh often return errors like:

pdsh@client-16-ib: client-17: read: protocol failure: Connection reset by peer
or
pdsh@client-16-ib: client-17: rcmd: xpoll (setting up stderr): Interrupted system call

so when such error happens, the script will incorrectly think there isn't racer threads running by following check:
running=$(do_nodes $clients "ps uax | grep $RDIR " | egrep -v "(acceptance|grep|pdsh|bash)" || true)

Will try to come up with a patch to deal with the pdsh errors in the script.

Comment by Niu Yawei (Inactive) [ 11/May/11 ]

I tried to reproduce this bug with running "runracer + replay-single 20" over three clients many times, but it never hit it, so I can't make sure if the open-replays come from runracer.

Hi Sarah, how often did you encounter this bug? If there isn't any reproducer, and it's only be seen very few times, I suggest we leave hold the investigating on it until it becomes a real issue. Thanks.

Comment by Peter Jones [ 13/Jun/11 ]

Reopen if reoccurs

Generated at Sat Feb 10 01:05:28 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.