[LU-1705] Test failure on test suite racer Created: 03/Aug/12 Updated: 17/Sep/12 Resolved: 17/Sep/12 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.3.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Maloo | Assignee: | Oleg Drokin |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Severity: | 3 |
| Rank (Obsolete): | 6356 |
| Description |
|
This issue was created by maloo for sarah <sarah@whamcloud.com> This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/7871a0f4-dbc1-11e1-81e3-52540035b04c. I reran the test on OFED build, after racer completed, it seems the system cannot be cleaned up, following are logs from client. The test actually passed in the manual run Stopping clients: client-4.lab.whamcloud.com,client-5-ib /mnt/lustre2 (opts:) Stopping /mnt/mds1 (opts:-f) on client-3-ib Stopping /mnt/ost1 (opts:-f) on fat-amd-4-ib Stopping /mnt/ost2 (opts:-f) on fat-amd-4-ib Stopping /mnt/ost3 (opts:-f) on fat-amd-4-ib Stopping /mnt/ost4 (opts:-f) on fat-amd-4-ib Stopping /mnt/ost5 (opts:-f) on fat-amd-4-ib Stopping /mnt/ost6 (opts:-f) on fat-amd-4-ib waited 0 for 3 ST mdc lustre-MDT0000-mdc-ffff8802def78400 b4256893-f3ba-91be-7ec9-98afa80c02b6 2 12 ST mdc lustre-MDT0000-mdc-ffff880310502400 a341bb6a-cd83-afc4-0651-511944cfcef6 2 waited 2 for 3 ST mdc lustre-MDT0000-mdc-ffff8802def78400 b4256893-f3ba-91be-7ec9-98afa80c02b6 2 12 ST mdc lustre-MDT0000-mdc-ffff880310502400 a341bb6a-cd83-afc4-0651-511944cfcef6 2 waited 6 for 3 ST mdc lustre-MDT0000-mdc-ffff8802def78400 b4256893-f3ba-91be-7ec9-98afa80c02b6 2 12 ST mdc lustre-MDT0000-mdc-ffff880310502400 a341bb6a-cd83-afc4-0651-511944cfcef6 2 waited 14 for 3 ST mdc lustre-MDT0000-mdc-ffff8802def78400 b4256893-f3ba-91be-7ec9-98afa80c02b6 2 12 ST mdc lustre-MDT0000-mdc-ffff880310502400 a341bb6a-cd83-afc4-0651-511944cfcef6 2 waited 30 for 3 ST mdc lustre-MDT0000-mdc-ffff8802def78400 b4256893-f3ba-91be-7ec9-98afa80c02b6 2 12 ST mdc lustre-MDT0000-mdc-ffff880310502400 a341bb6a-cd83-afc4-0651-511944cfcef6 2 waited 62 for 3 ST mdc lustre-MDT0000-mdc-ffff8802def78400 b4256893-f3ba-91be-7ec9-98afa80c02b6 2 12 ST mdc lustre-MDT0000-mdc-ffff880310502400 a341bb6a-cd83-afc4-0651-511944cfcef6 2 waited 126 for 3 ST mdc lustre-MDT0000-mdc-ffff8802def78400 b4256893-f3ba-91be-7ec9-98afa80c02b6 2 12 ST mdc lustre-MDT0000-mdc-ffff880310502400 a341bb6a-cd83-afc4-0651-511944cfcef6 2 |
| Comments |
| Comment by Peter Jones [ 07/Aug/12 ] |
|
Bob Could you please look into this one? Thanks Peter |
| Comment by Bob Glossman (Inactive) [ 07/Aug/12 ] |
|
The immediate cause of the failure is outstanding references on one of the OSTs at the time of unmount, after the racer test has completed OK and is trying to clean up and shutdown. from racer.test_1.dmesg.client-26vm8.log: Lustre: DEBUG MARKER: umount -d -f /mnt/ost6 followed by one of these every 30s: Lustre: Mount still busy with 10 refs after 30 secs. So somehow there are still outstanding references to lustre-OST0005 even though all clients have already been unmounted. I'm having a bit of trouble identifying what the references are. It's possible that there is some kind of ref count leak and the count is off. I may need some more expert help to pin this down. |
| Comment by Oleg Drokin [ 07/Aug/12 ] |
|
Sp after looking at the logs, the first question I have is: How was this failure originally flagged? The test itself is marked green and by looking at it from outside nothing seems out of the ordinary. Was this a lucky manual find? Could we have more of those happening undetected? Now to the problem at hand, it does seem to be a reference leakage, all ost threads are idle so none of them could hold a legit reference. |
| Comment by Peter Jones [ 07/Aug/12 ] |
|
Based on the testing for the last tag it looks like it was hit as part of this run https://maloo.whamcloud.com/test_sessions/3e75b8be-dbc0-11e1-81e3-52540035b04c Does this tell you what you need to know? |
| Comment by Oleg Drokin [ 08/Aug/12 ] |
|
Well, it's 2.2.92 tag. Was this ever observable before? Is this repeatable now? |
| Comment by Peter Jones [ 08/Aug/12 ] |
|
Hmmm. There certainly are strange results for racer. Drill into the results on https://maloo.whamcloud.com/reports/show_pass_rate_report?in_last_n_days=28&source_code_branch=d776127c-4096-11e1-9cbd-5254004bbbd3 . It seems that many tests are reported as failing and yet when you drill in they passed. What needs to happen for racer to run more consistently? |
| Comment by Oleg Drokin [ 09/Aug/12 ] |
|
Indeed. |
| Comment by Oleg Drokin [ 09/Aug/12 ] |
|
Ah, also wanted to add that it's not happening every time, but it does happen from time to time. |
| Comment by Sarah Liu [ 15/Aug/12 ] |
|
another instance: https://maloo.whamcloud.com/test_sets/46f73324-e4f2-11e1-9681-52540035b04c |
| Comment by Peter Jones [ 17/Aug/12 ] |
|
Oleg is looking into this one |
| Comment by Peter Jones [ 22/Aug/12 ] |
|
|
| Comment by Sarah Liu [ 23/Aug/12 ] |
|
Seems another failure instance, this is found in tag-2.2.93 testing. Server: RHEL6 Client: SLES11 https://maloo.whamcloud.com/test_sets/e5842c24-ec3e-11e1-ba25-52540035b04c |
| Comment by Sarah Liu [ 31/Aug/12 ] |
|
another instance https://maloo.whamcloud.com/test_sets/0dfb8d56-f2c4-11e1-807d-52540035b04c |
| Comment by Sarah Liu [ 04/Sep/12 ] |
|
another failure: https://maloo.whamcloud.com/test_sets/061f9d34-f251-11e1-9def-52540035b04c |
| Comment by Jian Yu [ 17/Sep/12 ] |
|
The issue occurs regularly on the latest b2_3 builds. |
| Comment by Peter Jones [ 17/Sep/12 ] |
|
Closing as duplicate of |