Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5244

conf-sanity test_32b: osp_sync_thread()) ASSERTION( count < 10 )

Details

    • Bug
    • Resolution: Duplicate
    • Blocker
    • Lustre 2.6.0
    • Lustre 2.6.0
    • None
    • 3
    • 14626

    Description

      This issue was created by maloo for wangdi <di.wang@intel.com>

      This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/1c06a92c-fa14-11e3-883f-52540035b04c.

      The sub-test test_32b failed with the following error:

      test failed to respond and timed out

      04:48:35:LustreError: 985:0:(osp_sync.c:994:osp_sync_thread()) ASSERTION( count < 10 ) failed: t32fs-OST0000-osc-MDT0000: 5 5 empty
      04:48:35:LustreError: 985:0:(osp_sync.c:994:osp_sync_thread()) LBUG
      04:48:35:Pid: 985, comm: osp-syn-0-0
      04:48:35:
      04:48:35:Call Trace:
      04:48:35: [<ffffffffa0742895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      04:48:35: [<ffffffffa0742e97>] lbug_with_loc+0x47/0xb0 [libcfs]
      04:48:35: [<ffffffffa16ca112>] osp_sync_thread+0x6c2/0x7d0 [osp]
      04:48:35: [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
      04:48:35: [<ffffffffa16c9a50>] ? osp_sync_thread+0x0/0x7d0 [osp]
      04:48:35: [<ffffffff8109ab56>] kthread+0x96/0xa0
      04:48:35: [<ffffffff8100c20a>] child_rip+0xa/0x20
      04:48:35: [<ffffffff8109aac0>] ? kthread+0x0/0xa0
      04:48:35: [<ffffffff8100c200>] ? child_rip+0x0/0x20
      04:48:35:
      04:48:35:Kernel panic - not syncing: LBUG
      04:48:35:Pid: 985, comm: osp-syn-0-0 Tainted: G W --------------- 2.6.32-431.17.1.el6_lustre.g0eed638.x86_64 #1
      04:48:35:Call Trace:
      04:48:35: [<ffffffff8152795f>] ? panic+0xa7/0x16f
      04:48:35: [<ffffffffa0742eeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
      04:48:35: [<ffffffffa16ca112>] ? osp_sync_thread+0x6c2/0x7d0 [osp]
      04:48:35: [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
      04:48:35: [<ffffffffa16c9a50>] ? osp_sync_thread+0x0/0x7d0 [osp]
      04:48:35: [<ffffffff8109ab56>] ? kthread+0x96/0xa0
      04:48:35: [<ffffffff8100c20a>] ? child_rip+0xa/0x20
      04:48:35: [<ffffffff8109aac0>] ? kthread+0x0/0xa0
      04:48:35: [<ffffffff8100c200>] ? child_rip+0x0/0x20
      04:48:35:Initializing cgroup subsys cpuset

      Info required for matching: conf-sanity 32b

      Attachments

        Issue Links

          Activity

            [LU-5244] conf-sanity test_32b: osp_sync_thread()) ASSERTION( count < 10 )
            jlevi Jodi Levi (Inactive) made changes -
            Resolution New: Duplicate [ 3 ]
            Status Original: In Progress [ 3 ] New: Closed [ 6 ]

            Duplicate of LU-5188

            jlevi Jodi Levi (Inactive) added a comment - Duplicate of LU-5188
            di.wang Di Wang made changes -
            Link New: This issue is related to LU-5188 [ LU-5188 ]

            Unfortunately, both LU-5244 and LU-5249 are causing so many test failures that it may not be possible for them to land independently, so re triggering them may not be enough. Instead, basing one patch on the other would allow the second to pass, then it could be landed, then the first one rebased and landed.

            Also, reverting the patch that is the root of these problems may fix both issues at once.

            adilger Andreas Dilger added a comment - Unfortunately, both LU-5244 and LU-5249 are causing so many test failures that it may not be possible for them to land independently, so re triggering them may not be enough. Instead, basing one patch on the other would allow the second to pass, then it could be landed, then the first one rebased and landed. Also, reverting the patch that is the root of these problems may fix both issues at once.
            adilger Andreas Dilger made changes -
            Link New: This issue is related to LU-5249 [ LU-5249 ]
            utopiabound Nathaniel Clark made changes -
            Status Original: Open [ 1 ] New: In Progress [ 3 ]

            The patch avoids the crash, but so far there isn't any explanation about why this started failing so seriously.

            adilger Andreas Dilger added a comment - The patch avoids the crash, but so far there isn't any explanation about why this started failing so seriously.
            utopiabound Nathaniel Clark added a comment - http://review.whamcloud.com/10805

            the idea was that at umount we invalidate the import and this should cause RPCs in-flight to abort quickly. I'm not very familiar with lnet internals and not sure the abort is very promptly in all the cases. I think it makes sense to see what's going on and why the RPCs weren't aborted in time.

            bzzz Alex Zhuravlev added a comment - the idea was that at umount we invalidate the import and this should cause RPCs in-flight to abort quickly. I'm not very familiar with lnet internals and not sure the abort is very promptly in all the cases. I think it makes sense to see what's going on and why the RPCs weren't aborted in time.

            I've bumped this to be a blocker, since it is causing very regular test failures in review-dne-part-1.

            adilger Andreas Dilger added a comment - I've bumped this to be a blocker, since it is causing very regular test failures in review-dne-part-1.

            People

              utopiabound Nathaniel Clark
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: