[LU-5244] conf-sanity test_32b: osp_sync_thread()) ASSERTION( count < 10 ) Created: 23/Jun/14  Updated: 30/Jun/14  Resolved: 30/Jun/14

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.6.0
Fix Version/s: Lustre 2.6.0

Type: Bug Priority: Blocker
Reporter: Maloo Assignee: Nathaniel Clark
Resolution: Duplicate Votes: 0
Labels: None

Issue Links:
Related
is related to LU-5249 conf-sanity test_32a: NULL pointer in... Resolved
is related to LU-5188 nbp6-OST002f-osc-MDT0000: invalid set... Resolved
Severity: 3
Rank (Obsolete): 14626

 Description   

This issue was created by maloo for wangdi <di.wang@intel.com>

This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/1c06a92c-fa14-11e3-883f-52540035b04c.

The sub-test test_32b failed with the following error:

test failed to respond and timed out

04:48:35:LustreError: 985:0:(osp_sync.c:994:osp_sync_thread()) ASSERTION( count < 10 ) failed: t32fs-OST0000-osc-MDT0000: 5 5 empty
04:48:35:LustreError: 985:0:(osp_sync.c:994:osp_sync_thread()) LBUG
04:48:35:Pid: 985, comm: osp-syn-0-0
04:48:35:
04:48:35:Call Trace:
04:48:35: [<ffffffffa0742895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
04:48:35: [<ffffffffa0742e97>] lbug_with_loc+0x47/0xb0 [libcfs]
04:48:35: [<ffffffffa16ca112>] osp_sync_thread+0x6c2/0x7d0 [osp]
04:48:35: [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
04:48:35: [<ffffffffa16c9a50>] ? osp_sync_thread+0x0/0x7d0 [osp]
04:48:35: [<ffffffff8109ab56>] kthread+0x96/0xa0
04:48:35: [<ffffffff8100c20a>] child_rip+0xa/0x20
04:48:35: [<ffffffff8109aac0>] ? kthread+0x0/0xa0
04:48:35: [<ffffffff8100c200>] ? child_rip+0x0/0x20
04:48:35:
04:48:35:Kernel panic - not syncing: LBUG
04:48:35:Pid: 985, comm: osp-syn-0-0 Tainted: G W --------------- 2.6.32-431.17.1.el6_lustre.g0eed638.x86_64 #1
04:48:35:Call Trace:
04:48:35: [<ffffffff8152795f>] ? panic+0xa7/0x16f
04:48:35: [<ffffffffa0742eeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
04:48:35: [<ffffffffa16ca112>] ? osp_sync_thread+0x6c2/0x7d0 [osp]
04:48:35: [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
04:48:35: [<ffffffffa16c9a50>] ? osp_sync_thread+0x0/0x7d0 [osp]
04:48:35: [<ffffffff8109ab56>] ? kthread+0x96/0xa0
04:48:35: [<ffffffff8100c20a>] ? child_rip+0xa/0x20
04:48:35: [<ffffffff8109aac0>] ? kthread+0x0/0xa0
04:48:35: [<ffffffff8100c200>] ? child_rip+0x0/0x20
04:48:35:Initializing cgroup subsys cpuset

Info required for matching: conf-sanity 32b



 Comments   
Comment by Jodi Levi (Inactive) [ 23/Jun/14 ]

Nathaniel,
Could you have a look at this one?
Thank you!

Comment by Andreas Dilger [ 23/Jun/14 ]

This is a bad LASSERT(). I can't see any reason why "10" is a magic number before which the remote RPCs need to be completed? If we hit this on a test system, we will definitely hit this on some customer system when the MDS is busy, or the network is overloaded.

        /* wait till all the requests are completed */
        count = 0;
        while (d->opd_syn_rpc_in_progress > 0) {
                osp_sync_process_committed(&env, d);

                lwi = LWI_TIMEOUT(cfs_time_seconds(5), NULL, NULL);
                rc = l_wait_event(d->opd_syn_waitq,
                                  d->opd_syn_rpc_in_progress == 0,
                                  &lwi);
                if (rc == -ETIMEDOUT)
                        count++;
                LASSERTF(count < 10, "%s: %d %d %sempty\n",
                         d->opd_obd->obd_name, d->opd_syn_rpc_in_progress,
                         d->opd_syn_rpc_in_flight,
                         list_empty(&d->opd_syn_committed_there) ? "" : "!");

        }

There needs to be proper error handling here, either just to continue looping, or to break out and return an error.

This was landed as commit 08f093ce2c799faf7a580f53850ecb13d2b71603:

    LU-2701 osp: wake up sync thread
    
    osp_sync_process_committed() to wake up sync thread when it
    is requested to stop (e.g. umount) and there is no pending
    work left. the patch adds a sanity check to ensure this
    process is not taking too long.

"sanity check" != LASSERT()...

Comment by Andreas Dilger [ 24/Jun/14 ]

I've bumped this to be a blocker, since it is causing very regular test failures in review-dne-part-1.

Comment by Alex Zhuravlev [ 24/Jun/14 ]

the idea was that at umount we invalidate the import and this should cause RPCs in-flight to abort quickly. I'm not very familiar with lnet internals and not sure the abort is very promptly in all the cases. I think it makes sense to see what's going on and why the RPCs weren't aborted in time.

Comment by Nathaniel Clark [ 24/Jun/14 ]

http://review.whamcloud.com/10805

Comment by Andreas Dilger [ 24/Jun/14 ]

The patch avoids the crash, but so far there isn't any explanation about why this started failing so seriously.

Comment by Andreas Dilger [ 25/Jun/14 ]

Unfortunately, both LU-5244 and LU-5249 are causing so many test failures that it may not be possible for them to land independently, so re triggering them may not be enough. Instead, basing one patch on the other would allow the second to pass, then it could be landed, then the first one rebased and landed.

Also, reverting the patch that is the root of these problems may fix both issues at once.

Comment by Jodi Levi (Inactive) [ 30/Jun/14 ]

Duplicate of LU-5188

Generated at Sat Feb 10 01:49:46 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.