Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5244

conf-sanity test_32b: osp_sync_thread()) ASSERTION( count < 10 )

Details

    • Bug
    • Resolution: Duplicate
    • Blocker
    • Lustre 2.6.0
    • Lustre 2.6.0
    • None
    • 3
    • 14626

    Description

      This issue was created by maloo for wangdi <di.wang@intel.com>

      This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/1c06a92c-fa14-11e3-883f-52540035b04c.

      The sub-test test_32b failed with the following error:

      test failed to respond and timed out

      04:48:35:LustreError: 985:0:(osp_sync.c:994:osp_sync_thread()) ASSERTION( count < 10 ) failed: t32fs-OST0000-osc-MDT0000: 5 5 empty
      04:48:35:LustreError: 985:0:(osp_sync.c:994:osp_sync_thread()) LBUG
      04:48:35:Pid: 985, comm: osp-syn-0-0
      04:48:35:
      04:48:35:Call Trace:
      04:48:35: [<ffffffffa0742895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      04:48:35: [<ffffffffa0742e97>] lbug_with_loc+0x47/0xb0 [libcfs]
      04:48:35: [<ffffffffa16ca112>] osp_sync_thread+0x6c2/0x7d0 [osp]
      04:48:35: [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
      04:48:35: [<ffffffffa16c9a50>] ? osp_sync_thread+0x0/0x7d0 [osp]
      04:48:35: [<ffffffff8109ab56>] kthread+0x96/0xa0
      04:48:35: [<ffffffff8100c20a>] child_rip+0xa/0x20
      04:48:35: [<ffffffff8109aac0>] ? kthread+0x0/0xa0
      04:48:35: [<ffffffff8100c200>] ? child_rip+0x0/0x20
      04:48:35:
      04:48:35:Kernel panic - not syncing: LBUG
      04:48:35:Pid: 985, comm: osp-syn-0-0 Tainted: G W --------------- 2.6.32-431.17.1.el6_lustre.g0eed638.x86_64 #1
      04:48:35:Call Trace:
      04:48:35: [<ffffffff8152795f>] ? panic+0xa7/0x16f
      04:48:35: [<ffffffffa0742eeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
      04:48:35: [<ffffffffa16ca112>] ? osp_sync_thread+0x6c2/0x7d0 [osp]
      04:48:35: [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
      04:48:35: [<ffffffffa16c9a50>] ? osp_sync_thread+0x0/0x7d0 [osp]
      04:48:35: [<ffffffff8109ab56>] ? kthread+0x96/0xa0
      04:48:35: [<ffffffff8100c20a>] ? child_rip+0xa/0x20
      04:48:35: [<ffffffff8109aac0>] ? kthread+0x0/0xa0
      04:48:35: [<ffffffff8100c200>] ? child_rip+0x0/0x20
      04:48:35:Initializing cgroup subsys cpuset

      Info required for matching: conf-sanity 32b

      Attachments

        Issue Links

          Activity

            [LU-5244] conf-sanity test_32b: osp_sync_thread()) ASSERTION( count < 10 )

            Duplicate of LU-5188

            jlevi Jodi Levi (Inactive) added a comment - Duplicate of LU-5188

            Unfortunately, both LU-5244 and LU-5249 are causing so many test failures that it may not be possible for them to land independently, so re triggering them may not be enough. Instead, basing one patch on the other would allow the second to pass, then it could be landed, then the first one rebased and landed.

            Also, reverting the patch that is the root of these problems may fix both issues at once.

            adilger Andreas Dilger added a comment - Unfortunately, both LU-5244 and LU-5249 are causing so many test failures that it may not be possible for them to land independently, so re triggering them may not be enough. Instead, basing one patch on the other would allow the second to pass, then it could be landed, then the first one rebased and landed. Also, reverting the patch that is the root of these problems may fix both issues at once.

            The patch avoids the crash, but so far there isn't any explanation about why this started failing so seriously.

            adilger Andreas Dilger added a comment - The patch avoids the crash, but so far there isn't any explanation about why this started failing so seriously.
            utopiabound Nathaniel Clark added a comment - http://review.whamcloud.com/10805

            the idea was that at umount we invalidate the import and this should cause RPCs in-flight to abort quickly. I'm not very familiar with lnet internals and not sure the abort is very promptly in all the cases. I think it makes sense to see what's going on and why the RPCs weren't aborted in time.

            bzzz Alex Zhuravlev added a comment - the idea was that at umount we invalidate the import and this should cause RPCs in-flight to abort quickly. I'm not very familiar with lnet internals and not sure the abort is very promptly in all the cases. I think it makes sense to see what's going on and why the RPCs weren't aborted in time.

            I've bumped this to be a blocker, since it is causing very regular test failures in review-dne-part-1.

            adilger Andreas Dilger added a comment - I've bumped this to be a blocker, since it is causing very regular test failures in review-dne-part-1.

            This is a bad LASSERT(). I can't see any reason why "10" is a magic number before which the remote RPCs need to be completed? If we hit this on a test system, we will definitely hit this on some customer system when the MDS is busy, or the network is overloaded.

                    /* wait till all the requests are completed */
                    count = 0;
                    while (d->opd_syn_rpc_in_progress > 0) {
                            osp_sync_process_committed(&env, d);
            
                            lwi = LWI_TIMEOUT(cfs_time_seconds(5), NULL, NULL);
                            rc = l_wait_event(d->opd_syn_waitq,
                                              d->opd_syn_rpc_in_progress == 0,
                                              &lwi);
                            if (rc == -ETIMEDOUT)
                                    count++;
                            LASSERTF(count < 10, "%s: %d %d %sempty\n",
                                     d->opd_obd->obd_name, d->opd_syn_rpc_in_progress,
                                     d->opd_syn_rpc_in_flight,
                                     list_empty(&d->opd_syn_committed_there) ? "" : "!");
            
                    }
            

            There needs to be proper error handling here, either just to continue looping, or to break out and return an error.

            This was landed as commit 08f093ce2c799faf7a580f53850ecb13d2b71603:

                LU-2701 osp: wake up sync thread
                
                osp_sync_process_committed() to wake up sync thread when it
                is requested to stop (e.g. umount) and there is no pending
                work left. the patch adds a sanity check to ensure this
                process is not taking too long.
            

            "sanity check" != LASSERT()...

            adilger Andreas Dilger added a comment - This is a bad LASSERT(). I can't see any reason why "10" is a magic number before which the remote RPCs need to be completed? If we hit this on a test system, we will definitely hit this on some customer system when the MDS is busy, or the network is overloaded. /* wait till all the requests are completed */ count = 0; while (d->opd_syn_rpc_in_progress > 0) { osp_sync_process_committed(&env, d); lwi = LWI_TIMEOUT(cfs_time_seconds(5), NULL, NULL); rc = l_wait_event(d->opd_syn_waitq, d->opd_syn_rpc_in_progress == 0, &lwi); if (rc == -ETIMEDOUT) count++; LASSERTF(count < 10, "%s: %d %d %sempty\n" , d->opd_obd->obd_name, d->opd_syn_rpc_in_progress, d->opd_syn_rpc_in_flight, list_empty(&d->opd_syn_committed_there) ? "" : " !"); } There needs to be proper error handling here, either just to continue looping, or to break out and return an error. This was landed as commit 08f093ce2c799faf7a580f53850ecb13d2b71603: LU-2701 osp: wake up sync thread osp_sync_process_committed() to wake up sync thread when it is requested to stop (e.g. umount) and there is no pending work left. the patch adds a sanity check to ensure this process is not taking too long. "sanity check" != LASSERT()...

            Nathaniel,
            Could you have a look at this one?
            Thank you!

            jlevi Jodi Levi (Inactive) added a comment - Nathaniel, Could you have a look at this one? Thank you!

            People

              utopiabound Nathaniel Clark
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: