Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5023

sanity-lfsck test_12: OST lfsck_stop hung

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.7.0
    • None
    • None
    • 3
    • 13903

    Description

      This issue was created by maloo for Bob Glossman <bob.glossman@intel.com>

      This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/4d306b04-d1ed-11e3-a9d1-52540035b04c.

      searching in maloo I see 8 similar failures reported starting 5/1/2014
      suspect something happened then to start causing these failures
      Don't think it's specific to the mod under test.

      The sub-test test_12 failed with the following error:

      test failed to respond and timed out

      Info required for matching: sanity-lfsck 12

      Attachments

        Issue Links

          Activity

            [LU-5023] sanity-lfsck test_12: OST lfsck_stop hung
            pjones Peter Jones added a comment -

            Landed for 2.7

            pjones Peter Jones added a comment - Landed for 2.7

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/10276/
            Subject: LU-5023 tests: check FID seq properly for sanity-lfsck t_11b
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 89a53691076ba45ef415de511ed840a0e3cbdd2c

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/10276/ Subject: LU-5023 tests: check FID seq properly for sanity-lfsck t_11b Project: fs/lustre-release Branch: master Current Patch Set: Commit: 89a53691076ba45ef415de511ed840a0e3cbdd2c

            > https://maloo.whamcloud.com/test_sets/2eada7a0-d4c7-11e3-808a-52540035b04c
            > https://maloo.whamcloud.com/test_sets/465c7ec4-d33e-11e3-a102-52540035b04c
            > https://maloo.whamcloud.com/test_sets/542c478e-d75b-11e3-9d16-52540035b04c

            In above failure cases, the test_11b has failed because improperly FID seq checking and error handler. I have updated the patch 10276 to resolve that.

            yong.fan nasf (Inactive) added a comment - > https://maloo.whamcloud.com/test_sets/2eada7a0-d4c7-11e3-808a-52540035b04c > https://maloo.whamcloud.com/test_sets/465c7ec4-d33e-11e3-a102-52540035b04c > https://maloo.whamcloud.com/test_sets/542c478e-d75b-11e3-9d16-52540035b04c In above failure cases, the test_11b has failed because improperly FID seq checking and error handler. I have updated the patch 10276 to resolve that.
            bogl Bob Glossman (Inactive) added a comment - more: https://maloo.whamcloud.com/test_sets/2eada7a0-d4c7-11e3-808a-52540035b04c https://maloo.whamcloud.com/test_sets/465c7ec4-d33e-11e3-a102-52540035b04c https://maloo.whamcloud.com/test_sets/542c478e-d75b-11e3-9d16-52540035b04c

            The debug patch for more logs:

            http://review.whamcloud.com/10276

            yong.fan nasf (Inactive) added a comment - The debug patch for more logs: http://review.whamcloud.com/10276

            It cannot say that there is deadlock on the OST. The real case is probably like that:

            1) The LFSCK thread on the OST is doing something but it is blocked.
            2) At that time, the first lfsck_stop RPC from the MDT comes in, the RPC handler (on the OST) holds the lfsck::li_mutex to signal the LFSCK thread, then wait until the LFSCK thread stopped.
            3) Because the LFSCK thread is blocked on some unknown event, it cannot respond the stop signal, so the first stop RPC is blocked there.
            4) Then the subsequent lfsck_stop RPC handlers will be blocked at the lfsck::li_mutex by the first lfsck_stop RPC handler.

            So we need to find out what the LFSCK thread was doing at that time. Unfortunately, the current log was not enough to indicate that. I will make a debug patch to collect more logs.

            yong.fan nasf (Inactive) added a comment - It cannot say that there is deadlock on the OST. The real case is probably like that: 1) The LFSCK thread on the OST is doing something but it is blocked. 2) At that time, the first lfsck_stop RPC from the MDT comes in, the RPC handler (on the OST) holds the lfsck::li_mutex to signal the LFSCK thread, then wait until the LFSCK thread stopped. 3) Because the LFSCK thread is blocked on some unknown event, it cannot respond the stop signal, so the first stop RPC is blocked there. 4) Then the subsequent lfsck_stop RPC handlers will be blocked at the lfsck::li_mutex by the first lfsck_stop RPC handler. So we need to find out what the LFSCK thread was doing at that time. Unfortunately, the current log was not enough to indicate that. I will make a debug patch to collect more logs.

            LFSCK patches that have landed around that time:

            commit a8c9796fddaeda0572a64cedb471b2eead78c506
            CommitDate: Tue Apr 29 17:22:17 2014 +0000
                LU-4941 lfsck: check LOV EA header properly
                Reviewed-on: http://review.whamcloud.com/10045
            
            commit 63e7b15a1baa1aae03dfdc965e0777cf754ff29a
            CommitDate: Wed Apr 30 03:41:14 2014 +0000
                LU-4895 lfsck: not create object to repair dangling by default    
                Reviewed-on: http://review.whamcloud.com/9989
            
            commit 1dbba329174e6c7f7712f01fc4e44c44400fbc92
            CommitDate: Wed Apr 30 03:42:00 2014 +0000
                LU-4556 tests: speed up sanity-lfsck and sanity-scrub tests
                Reviewed-on: http://review.whamcloud.com/9704
            

            Not sure if any of these are the cause, but likely candidates. Jodi thought that this hadn't been hit since the initial group of failures, so it is possible that a later patch may have fixed it?

            adilger Andreas Dilger added a comment - LFSCK patches that have landed around that time: commit a8c9796fddaeda0572a64cedb471b2eead78c506 CommitDate: Tue Apr 29 17:22:17 2014 +0000 LU-4941 lfsck: check LOV EA header properly Reviewed-on: http://review.whamcloud.com/10045 commit 63e7b15a1baa1aae03dfdc965e0777cf754ff29a CommitDate: Wed Apr 30 03:41:14 2014 +0000 LU-4895 lfsck: not create object to repair dangling by default Reviewed-on: http://review.whamcloud.com/9989 commit 1dbba329174e6c7f7712f01fc4e44c44400fbc92 CommitDate: Wed Apr 30 03:42:00 2014 +0000 LU-4556 tests: speed up sanity-lfsck and sanity-scrub tests Reviewed-on: http://review.whamcloud.com/9704 Not sure if any of these are the cause, but likely candidates. Jodi thought that this hadn't been hit since the initial group of failures, so it is possible that a later patch may have fixed it?
            green Oleg Drokin added a comment -

            There appears to be some deadlock on ost:

            01:01:32:INFO: task ll_ost00_004:12219 blocked for more than 120 seconds.
            01:01:32:      Not tainted 2.6.32-431.11.2.el6_lustre.g324fa81.x86_64 #1
            01:01:32:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
            01:01:32:ll_ost00_004  D 0000000000000000     0 12219      2 0x00000080
            01:01:32: ffff880079891bb0 0000000000000046 0000000000000000 ffff88006c70facd
            01:01:32: ffffffffa08c3b3d ffffffffa08c3b3b ffff880079891cc0 ffff88006c710000
            01:01:32: ffff88007d7fe5f8 ffff880079891fd8 000000000000fbc8 ffff88007d7fe5f8
            01:01:32:Call Trace:
            01:01:32: [<ffffffff81529b0e>] __mutex_lock_slowpath+0x13e/0x180
            01:01:32: [<ffffffffa0864143>] ? null_alloc_rs+0xf3/0x390 [ptlrpc]
            01:01:32: [<ffffffff815299ab>] mutex_lock+0x2b/0x50
            01:01:32: [<ffffffffa0dd9cff>] lfsck_stop+0x12f/0x410 [lfsck]
            01:01:32: [<ffffffffa0827505>] ? lustre_msg_buf+0x55/0x60 [ptlrpc]
            01:01:32: [<ffffffffa084f086>] ? __req_capsule_get+0x166/0x6e0 [ptlrpc]
            01:01:32: [<ffffffffa0dde214>] lfsck_in_notify+0x444/0x5f0 [lfsck]
            01:01:32: [<ffffffffa0882852>] tgt_handle_lfsck_notify+0x62/0x150 [ptlrpc]
            01:01:32: [<ffffffffa088a93c>] tgt_request_handle+0x23c/0xac0 [ptlrpc]
            01:01:32: [<ffffffffa0839b3a>] ptlrpc_main+0xd1a/0x1980 [ptlrpc]
            01:01:32: [<ffffffffa0838e20>] ? ptlrpc_main+0x0/0x1980 [ptlrpc]
            01:01:32: [<ffffffff8109aee6>] kthread+0x96/0xa0
            01:01:33: [<ffffffff8100c20a>] child_rip+0xa/0x20
            01:01:33: [<ffffffff8109ae50>] ? kthread+0x0/0xa0
            01:01:33: [<ffffffff8100c200>] ? child_rip+0x0/0x20
            
            green Oleg Drokin added a comment - There appears to be some deadlock on ost: 01:01:32:INFO: task ll_ost00_004:12219 blocked for more than 120 seconds. 01:01:32: Not tainted 2.6.32-431.11.2.el6_lustre.g324fa81.x86_64 #1 01:01:32:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 01:01:32:ll_ost00_004 D 0000000000000000 0 12219 2 0x00000080 01:01:32: ffff880079891bb0 0000000000000046 0000000000000000 ffff88006c70facd 01:01:32: ffffffffa08c3b3d ffffffffa08c3b3b ffff880079891cc0 ffff88006c710000 01:01:32: ffff88007d7fe5f8 ffff880079891fd8 000000000000fbc8 ffff88007d7fe5f8 01:01:32:Call Trace: 01:01:32: [<ffffffff81529b0e>] __mutex_lock_slowpath+0x13e/0x180 01:01:32: [<ffffffffa0864143>] ? null_alloc_rs+0xf3/0x390 [ptlrpc] 01:01:32: [<ffffffff815299ab>] mutex_lock+0x2b/0x50 01:01:32: [<ffffffffa0dd9cff>] lfsck_stop+0x12f/0x410 [lfsck] 01:01:32: [<ffffffffa0827505>] ? lustre_msg_buf+0x55/0x60 [ptlrpc] 01:01:32: [<ffffffffa084f086>] ? __req_capsule_get+0x166/0x6e0 [ptlrpc] 01:01:32: [<ffffffffa0dde214>] lfsck_in_notify+0x444/0x5f0 [lfsck] 01:01:32: [<ffffffffa0882852>] tgt_handle_lfsck_notify+0x62/0x150 [ptlrpc] 01:01:32: [<ffffffffa088a93c>] tgt_request_handle+0x23c/0xac0 [ptlrpc] 01:01:32: [<ffffffffa0839b3a>] ptlrpc_main+0xd1a/0x1980 [ptlrpc] 01:01:32: [<ffffffffa0838e20>] ? ptlrpc_main+0x0/0x1980 [ptlrpc] 01:01:32: [<ffffffff8109aee6>] kthread+0x96/0xa0 01:01:33: [<ffffffff8100c20a>] child_rip+0xa/0x20 01:01:33: [<ffffffff8109ae50>] ? kthread+0x0/0xa0 01:01:33: [<ffffffff8100c200>] ? child_rip+0x0/0x20

            People

              yong.fan nasf (Inactive)
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: