[LU-277] Test failure on test suite replay-single Created: 04/May/11 Updated: 13/Jun/11 Resolved: 13/Jun/11 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Maloo | Assignee: | Niu Yawei (Inactive) |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Severity: | 3 |
| Rank (Obsolete): | 10277 |
| Description |
|
This issue was created by maloo for sarah <sarah@whamcloud.com> This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/7a33c9b8-71c7-11e0-80b5-52540025f9af. This one looks like lu-184 which has been fixed for a while. Actually I found the similar issue on replay-single and commented on |
| Comments |
| Comment by Peter Jones [ 04/May/11 ] |
|
Niu As you worked previously on LU184 could you please comment? Thanks Peter |
| Comment by Niu Yawei (Inactive) [ 05/May/11 ] |
|
The failure is caused by the open replay from client-6-ib and client-21-ib. Actually, these two clients should not be involved in this test (our intention is to test client-23-ib), however, there were lots of open replay reqeusts kept on the other two clients, these open replays participated in the recovery, and result in test failure at the end. I guess these open replays come from previous tests, maybe runracer. (runracer is before replay_single? Sarah, please correct me if I'm wrong) Will look into the runracer test to see if there is anything wrong in the script. How can the open replay fail? One possible reason occur to me: we often set the MDS as read only in replay_single tests, so the open_create will not commit into disk sometimes, and in the test_20b, client-23-ib was evicted by MDS, so some open_create replay from this client will be lost, and the open replays to the same file from other two clients will fail for EONENT. |
| Comment by Niu Yawei (Inactive) [ 08/May/11 ] |
|
As shown on the maloo system, runracer is just ran before this replay-single test: This replay-single test started on 2011-04-28 11:25:09 UTC, and a runracer started on 2011-04-28 11:21:52 UTC and last 197 seconds. So I highly suspect that there is something wrong in the runracer script, which caused some racer test threads was not termninated properly. (as I mentioned in previous comment, since replay-single test often run replay barrier, these unexpected racer test thread could result in recovery failure) The following line of runracer confused me: Hi, Sarah
|
| Comment by Niu Yawei (Inactive) [ 10/May/11 ] |
|
Well, I finally get 3 nodes on Toro today, and after several runs of runracer, I found that pdsh often return errors like: pdsh@client-16-ib: client-17: read: protocol failure: Connection reset by peer so when such error happens, the script will incorrectly think there isn't racer threads running by following check: Will try to come up with a patch to deal with the pdsh errors in the script. |
| Comment by Niu Yawei (Inactive) [ 11/May/11 ] |
|
I tried to reproduce this bug with running "runracer + replay-single 20" over three clients many times, but it never hit it, so I can't make sure if the open-replays come from runracer. Hi Sarah, how often did you encounter this bug? If there isn't any reproducer, and it's only be seen very few times, I suggest we leave hold the investigating on it until it becomes a real issue. Thanks. |
| Comment by Peter Jones [ 13/Jun/11 ] |
|
Reopen if reoccurs |