Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4458

Interop 2.5.0<->2.6 failure on test suite recovery-small test_9

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.7.0, Lustre 2.5.4
    • Lustre 2.6.0, Lustre 2.7.0
    • server: lustre-master build # 1823 RHEL6 ldiskfs
      client: 2.5.0
    • 3
    • 12221

    Description

      This issue was created by maloo for sarah <sarah@whamcloud.com>

      This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/3ba4558e-77f3-11e3-a6a3-52540035b04c.

      The sub-test test_9 failed with the following error:

      test failed to respond and timed out

      Found D process on OST:

      22:51:46:Lustre: DEBUG MARKER: == recovery-small test 9: pause bulk on OST (bug 1420) == 22:48:54 (1389077334)
      22:51:47:Lustre: DEBUG MARKER: lctl set_param fail_loc=0x214
      22:51:47:LustreError: 2046:0:(fail.c:133:__cfs_fail_timeout_set()) cfs_fail_timeout id 214 sleeping for 20000000ms
      22:51:47:INFO: task ll_ost_io00_002:2046 blocked for more than 120 seconds.
      22:51:47:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      22:51:47:ll_ost_io00_0 D 0000000000000001     0  2046      2 0x00000080
      22:51:47: ffff8802f8759a60 0000000000000046 ffff8802f8759ac0 0000000016734040
      22:57:44: ffffffffa0566ab0 ffff880316255389 0000004e359ea090 ffffffffa053c044
      22:57:44: ffff8803167345f8 ffff8802f8759fd8 000000000000fb88 ffff8803167345f8
      22:57:44:Call Trace:
      22:57:44: [<ffffffff8150f3f2>] schedule_timeout+0x192/0x2e0
      22:57:44: [<ffffffff810811e0>] ? process_timeout+0x0/0x10
      22:57:45: [<ffffffffa0520d0f>] __cfs_fail_timeout_set+0xcf/0x150 [libcfs]
      22:57:45: [<ffffffffa0eaaec9>] cfs_fail_timeout_set.clone.2+0x29/0x30 [ptlrpc]
      22:57:45: [<ffffffffa0eae94b>] tgt_brw_write+0x34b/0x1550 [ptlrpc]
      22:57:45: [<ffffffffa0525921>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
      22:57:45: [<ffffffffa0eb0fea>] tgt_handle_request0+0x2ea/0x1490 [ptlrpc]
      22:57:45: [<ffffffffa0525921>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
      22:57:45: [<ffffffffa0eb25ca>] tgt_request_handle+0x43a/0x980 [ptlrpc]
      22:57:45: [<ffffffffa0e65725>] ptlrpc_main+0xd25/0x1970 [ptlrpc]
      22:57:45: [<ffffffffa0e64a00>] ? ptlrpc_main+0x0/0x1970 [ptlrpc]
      22:57:46: [<ffffffff81096a36>] kthread+0x96/0xa0
      22:57:46: [<ffffffff8100c0ca>] child_rip+0xa/0x20
      22:57:46: [<ffffffff810969a0>] ? kthread+0x0/0xa0
      22:57:46: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
      22:57:46:INFO: task ll_ost_io00_002:2046 blocked for more than 120 seconds.
      22:57:46:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      22:57:46:ll_ost_io00_0 D 0000000000000001     0  2046      2 0x00000080
      22:57:46: ffff8802f8759a60 0000000000000046 ffff8802f8759ac0 0000000016734040
      22:57:46: ffffffffa0566ab0 ffff880316255389 0000004e359ea090 ffffffffa053c044
      22:57:46: ffff8803167345f8 ffff8802f8759fd8 000000000000fb88 ffff8803167345f8
      22:57:46:Call Trace:
      

      Attachments

        Issue Links

          Activity

            [LU-4458] Interop 2.5.0<->2.6 failure on test suite recovery-small test_9

            Patch landed to Master.

            jlevi Jodi Levi (Inactive) added a comment - Patch landed to Master.

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13205/
            Subject: LU-4458 test: take fail_val into account for pause_bulk()
            Project: fs/lustre-release
            Branch: b2_5
            Current Patch Set:
            Commit: ae75822728635169a1e23c5d71df3ad122dff0a3

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13205/ Subject: LU-4458 test: take fail_val into account for pause_bulk() Project: fs/lustre-release Branch: b2_5 Current Patch Set: Commit: ae75822728635169a1e23c5d71df3ad122dff0a3

            Mike Pershin (mike.pershin@intel.com) uploaded a new patch: http://review.whamcloud.com/13205
            Subject: LU-4458 test: take fail_val into account for pause_bulk()
            Project: fs/lustre-release
            Branch: b2_5
            Current Patch Set: 1
            Commit: c5ae6409e1d830e8ae4445468f2f9def654c8108

            gerrit Gerrit Updater added a comment - Mike Pershin (mike.pershin@intel.com) uploaded a new patch: http://review.whamcloud.com/13205 Subject: LU-4458 test: take fail_val into account for pause_bulk() Project: fs/lustre-release Branch: b2_5 Current Patch Set: 1 Commit: c5ae6409e1d830e8ae4445468f2f9def654c8108
            adilger Andreas Dilger added a comment - This test is still failing on average once or twice a day: https://testing.hpdd.intel.com/test_sets/654abbae-6b53-11e4-88ff-5254006e85c2 https://testing.hpdd.intel.com/test_sets/4b1792be-6b87-11e4-be53-5254006e85c2 https://testing.hpdd.intel.com/test_sets/40bc9864-6baf-11e4-88ff-5254006e85c2 https://testing.hpdd.intel.com/test_sets/7c70a83e-6c07-11e4-909d-5254006e85c2 If these failures are a different bug, then this one should be closed and a new one opened.

            The LU-793 patch to b2_5 should fix this issue

            tappro Mikhail Pershin added a comment - The LU-793 patch to b2_5 should fix this issue

            The pause_bulk() was changed in 2.6 and now this affects compatibility.

            tappro Mikhail Pershin added a comment - The pause_bulk() was changed in 2.6 and now this affects compatibility.

            Mike,
            Can you please have a look at this one?
            Thank you!

            jlevi Jodi Levi (Inactive) added a comment - Mike, Can you please have a look at this one? Thank you!

            People

              tappro Mikhail Pershin
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: