[LU-1705] Test failure on test suite racer Created: 03/Aug/12  Updated: 17/Sep/12  Resolved: 17/Sep/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.3.0
Fix Version/s: None

Type: Bug Priority: Blocker
Reporter: Maloo Assignee: Oleg Drokin
Resolution: Duplicate Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 6356

 Description   

This issue was created by maloo for sarah <sarah@whamcloud.com>

This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/7871a0f4-dbc1-11e1-81e3-52540035b04c.

I reran the test on OFED build, after racer completed, it seems the system cannot be cleaned up, following are logs from client. The test actually passed in the manual run

Stopping clients: client-4.lab.whamcloud.com,client-5-ib /mnt/lustre2 (opts:)
Stopping /mnt/mds1 (opts:-f) on client-3-ib
Stopping /mnt/ost1 (opts:-f) on fat-amd-4-ib
Stopping /mnt/ost2 (opts:-f) on fat-amd-4-ib
Stopping /mnt/ost3 (opts:-f) on fat-amd-4-ib
Stopping /mnt/ost4 (opts:-f) on fat-amd-4-ib
Stopping /mnt/ost5 (opts:-f) on fat-amd-4-ib
Stopping /mnt/ost6 (opts:-f) on fat-amd-4-ib
waited 0 for  3 ST mdc lustre-MDT0000-mdc-ffff8802def78400 b4256893-f3ba-91be-7ec9-98afa80c02b6 2
 12 ST mdc lustre-MDT0000-mdc-ffff880310502400 a341bb6a-cd83-afc4-0651-511944cfcef6 2
waited 2 for  3 ST mdc lustre-MDT0000-mdc-ffff8802def78400 b4256893-f3ba-91be-7ec9-98afa80c02b6 2
 12 ST mdc lustre-MDT0000-mdc-ffff880310502400 a341bb6a-cd83-afc4-0651-511944cfcef6 2
waited 6 for  3 ST mdc lustre-MDT0000-mdc-ffff8802def78400 b4256893-f3ba-91be-7ec9-98afa80c02b6 2
 12 ST mdc lustre-MDT0000-mdc-ffff880310502400 a341bb6a-cd83-afc4-0651-511944cfcef6 2
waited 14 for  3 ST mdc lustre-MDT0000-mdc-ffff8802def78400 b4256893-f3ba-91be-7ec9-98afa80c02b6 2
 12 ST mdc lustre-MDT0000-mdc-ffff880310502400 a341bb6a-cd83-afc4-0651-511944cfcef6 2
waited 30 for  3 ST mdc lustre-MDT0000-mdc-ffff8802def78400 b4256893-f3ba-91be-7ec9-98afa80c02b6 2
 12 ST mdc lustre-MDT0000-mdc-ffff880310502400 a341bb6a-cd83-afc4-0651-511944cfcef6 2
waited 62 for  3 ST mdc lustre-MDT0000-mdc-ffff8802def78400 b4256893-f3ba-91be-7ec9-98afa80c02b6 2
 12 ST mdc lustre-MDT0000-mdc-ffff880310502400 a341bb6a-cd83-afc4-0651-511944cfcef6 2
waited 126 for  3 ST mdc lustre-MDT0000-mdc-ffff8802def78400 b4256893-f3ba-91be-7ec9-98afa80c02b6 2
 12 ST mdc lustre-MDT0000-mdc-ffff880310502400 a341bb6a-cd83-afc4-0651-511944cfcef6 2


 Comments   
Comment by Peter Jones [ 07/Aug/12 ]

Bob

Could you please look into this one?

Thanks

Peter

Comment by Bob Glossman (Inactive) [ 07/Aug/12 ]

The immediate cause of the failure is outstanding references on one of the OSTs at the time of unmount, after the racer test has completed OK and is trying to clean up and shutdown. from racer.test_1.dmesg.client-26vm8.log:

Lustre: DEBUG MARKER: umount -d -f /mnt/ost6
LustreError: 5851:0:(obd_mount.c:257:server_put_mount()) lustre-OST0005: mount busy, vfscount=10!

followed by one of these every 30s:

Lustre: Mount still busy with 10 refs after 30 secs.
...
Lustre: Mount still busy with 10 refs after 60 secs.

So somehow there are still outstanding references to lustre-OST0005 even though all clients have already been unmounted.

I'm having a bit of trouble identifying what the references are. It's possible that there is some kind of ref count leak and the count is off. I may need some more expert help to pin this down.

Comment by Oleg Drokin [ 07/Aug/12 ]

Sp after looking at the logs, the first question I have is:

How was this failure originally flagged? The test itself is marked green and by looking at it from outside nothing seems out of the ordinary. Was this a lucky manual find? Could we have more of those happening undetected?

Now to the problem at hand, it does seem to be a reference leakage, all ost threads are idle so none of them could hold a legit reference.
But to see when it was introduced we first need to better understand how was this instance caught.

Comment by Peter Jones [ 07/Aug/12 ]

Based on the testing for the last tag it looks like it was hit as part of this run https://maloo.whamcloud.com/test_sessions/3e75b8be-dbc0-11e1-81e3-52540035b04c

Does this tell you what you need to know?

Comment by Oleg Drokin [ 08/Aug/12 ]

Well, it's 2.2.92 tag.

Was this ever observable before? Is this repeatable now?

Comment by Peter Jones [ 08/Aug/12 ]

Hmmm. There certainly are strange results for racer. Drill into the results on https://maloo.whamcloud.com/reports/show_pass_rate_report?in_last_n_days=28&source_code_branch=d776127c-4096-11e1-9cbd-5254004bbbd3 . It seems that many tests are reported as failing and yet when you drill in they passed. What needs to happen for racer to run more consistently?

Comment by Oleg Drokin [ 09/Aug/12 ]

Indeed.
I see this is going on since at least July 28th which is the first occurrence of this I saw:
https://maloo.whamcloud.com/sub_tests/acb34e2a-d96a-11e1-befd-52540035b04c

Comment by Oleg Drokin [ 09/Aug/12 ]

Ah, also wanted to add that it's not happening every time, but it does happen from time to time.

Comment by Sarah Liu [ 15/Aug/12 ]

another instance: https://maloo.whamcloud.com/test_sets/46f73324-e4f2-11e1-9681-52540035b04c

Comment by Peter Jones [ 17/Aug/12 ]

Oleg is looking into this one

Comment by Peter Jones [ 22/Aug/12 ]

LU-1772 prevents us seeing whether this issue still hits on the latest tag

Comment by Sarah Liu [ 23/Aug/12 ]

Seems another failure instance, this is found in tag-2.2.93 testing. Server: RHEL6 Client: SLES11

https://maloo.whamcloud.com/test_sets/e5842c24-ec3e-11e1-ba25-52540035b04c

Comment by Sarah Liu [ 31/Aug/12 ]

another instance https://maloo.whamcloud.com/test_sets/0dfb8d56-f2c4-11e1-807d-52540035b04c

Comment by Sarah Liu [ 04/Sep/12 ]

another failure: https://maloo.whamcloud.com/test_sets/061f9d34-f251-11e1-9def-52540035b04c

Comment by Jian Yu [ 17/Sep/12 ]

The issue occurs regularly on the latest b2_3 builds.
More instances are in LU-1908.

Comment by Peter Jones [ 17/Sep/12 ]

Closing as duplicate of LU-1908

Generated at Sat Feb 10 01:18:58 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.