[LU-5244] conf-sanity test_32b: osp_sync_thread()) ASSERTION( count < 10 ) Created: 23/Jun/14 Updated: 30/Jun/14 Resolved: 30/Jun/14 |
|
| Status: | Closed |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.6.0 |
| Fix Version/s: | Lustre 2.6.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Maloo | Assignee: | Nathaniel Clark |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 14626 | ||||||||||||
| Description |
|
This issue was created by maloo for wangdi <di.wang@intel.com> This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/1c06a92c-fa14-11e3-883f-52540035b04c. The sub-test test_32b failed with the following error:
Info required for matching: conf-sanity 32b |
| Comments |
| Comment by Jodi Levi (Inactive) [ 23/Jun/14 ] |
|
Nathaniel, |
| Comment by Andreas Dilger [ 23/Jun/14 ] |
|
This is a bad LASSERT(). I can't see any reason why "10" is a magic number before which the remote RPCs need to be completed? If we hit this on a test system, we will definitely hit this on some customer system when the MDS is busy, or the network is overloaded. /* wait till all the requests are completed */ count = 0; while (d->opd_syn_rpc_in_progress > 0) { osp_sync_process_committed(&env, d); lwi = LWI_TIMEOUT(cfs_time_seconds(5), NULL, NULL); rc = l_wait_event(d->opd_syn_waitq, d->opd_syn_rpc_in_progress == 0, &lwi); if (rc == -ETIMEDOUT) count++; LASSERTF(count < 10, "%s: %d %d %sempty\n", d->opd_obd->obd_name, d->opd_syn_rpc_in_progress, d->opd_syn_rpc_in_flight, list_empty(&d->opd_syn_committed_there) ? "" : "!"); } There needs to be proper error handling here, either just to continue looping, or to break out and return an error. This was landed as commit 08f093ce2c799faf7a580f53850ecb13d2b71603: LU-2701 osp: wake up sync thread
osp_sync_process_committed() to wake up sync thread when it
is requested to stop (e.g. umount) and there is no pending
work left. the patch adds a sanity check to ensure this
process is not taking too long.
"sanity check" != LASSERT()... |
| Comment by Andreas Dilger [ 24/Jun/14 ] |
|
I've bumped this to be a blocker, since it is causing very regular test failures in review-dne-part-1. |
| Comment by Alex Zhuravlev [ 24/Jun/14 ] |
|
the idea was that at umount we invalidate the import and this should cause RPCs in-flight to abort quickly. I'm not very familiar with lnet internals and not sure the abort is very promptly in all the cases. I think it makes sense to see what's going on and why the RPCs weren't aborted in time. |
| Comment by Nathaniel Clark [ 24/Jun/14 ] |
| Comment by Andreas Dilger [ 24/Jun/14 ] |
|
The patch avoids the crash, but so far there isn't any explanation about why this started failing so seriously. |
| Comment by Andreas Dilger [ 25/Jun/14 ] |
|
Unfortunately, both Also, reverting the patch that is the root of these problems may fix both issues at once. |
| Comment by Jodi Levi (Inactive) [ 30/Jun/14 ] |
|
Duplicate of |