Details

    • Bug
    • Resolution: Duplicate
    • Major
    • Lustre 2.8.0, Lustre 2.5.5
    • Lustre 2.7.0, Lustre 2.8.0
    • 3
    • 17092

    Description

      This issue was created by maloo for Nathaniel Clark <nathaniel.l.clark@intel.com>

      This issue relates to the following test suite run:
      https://testing.hpdd.intel.com/test_sets/febe4ef4-9ef2-11e4-a23e-5254006e85c2
      https://testing.hpdd.intel.com/test_sets/e817c76c-9dca-11e4-858d-5254006e85c2

      The sub-test test_56a failed with the following error:

      test failed to respond and timed out
      

      Please provide additional information about the failure here.

      Info required for matching: sanity 56a

      Attachments

        Issue Links

          Activity

            [LU-6133] sanity test_56a: timeout
            yujian Jian Yu added a comment - More failure instances on master branch: https://testing.hpdd.intel.com/test_sets/4233a110-ce15-11e4-8b81-5254006e85c2 https://testing.hpdd.intel.com/test_sets/53903488-cd47-11e4-8795-5254006e85c2 https://testing.hpdd.intel.com/test_sets/f437f480-cb5c-11e4-8352-5254006e85c2 https://testing.hpdd.intel.com/test_sets/0ac8da4a-c7ff-11e4-be50-5254006e85c2 https://testing.hpdd.intel.com/test_sets/ee95d0fc-c80e-11e4-92dc-5254006e85c2
            bogl Bob Glossman (Inactive) added a comment - another seen on master: https://testing.hpdd.intel.com/test_sets/4233a110-ce15-11e4-8b81-5254006e85c2

            There was one report of a very similar panic for 0.6.3-1.1, without running Lustre:
            https://groups.google.com/a/zfsonlinux.org/forum/#!msg/zfs-discuss/6YaxCnuVfzM/3icF7asNL5AJ

            #12 [ffff880028207f50] nmi at ffffffff81534ee0
                [exception RIP: _spin_lock_irq+40]
                RIP: ffffffff81534678  RSP: ffff88021a9a75d0  RFLAGS: 00000002
                RAX: 0000000000000000  RBX: ffff880193353b00  RCX: 0000000000000014
                RDX: 00000000000073e6  RSI: ffff88021a9a7640  RDI: ffff880193353b08
                RBP: ffff88021a9a75d0   R8: 00d67e80287a31ef   R9: ffff88020a91a000
                R10: ffff88021a9a7660  R11: ffff88020b2a7550  R12: ffff8801c68b4ae0
                R13: ffff88021a9a7640  R14: ffff880193353b08  R15: ffff88021a9a7968
                ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
            --- <NMI exception stack> ---
            #13 [ffff88021a9a75d0] _spin_lock_irq at ffffffff81534678
            #14 [ffff88021a9a75d8] rwsem_down_failed_common at ffffffff8153436c
            #15 [ffff88021a9a7638] rwsem_down_read_failed at ffffffff81534556
            #16 [ffff88021a9a7678] call_rwsem_down_read_failed at ffffffff8129fb44
            #17 [ffff88021a9a76e0] zap_get_leaf_byblk at ffffffffa0374b70 [zfs]
            #18 [ffff88021a9a7750] zap_deref_leaf at ffffffffa0374dba [zfs]
            #19 [ffff88021a9a7790] fzap_cursor_retrieve at ffffffffa0375cbc [zfs]
            #20 [ffff88021a9a7820] zap_cursor_retrieve at ffffffffa037aceb [zfs]
            #21 [ffff88021a9a78c0] zfs_purgedir at ffffffffa038302b [zfs]
            #22 [ffff88021a9a7af0] zfs_unlinked_drain at ffffffffa03832de [zfs]
            
            isaac Isaac Huang (Inactive) added a comment - There was one report of a very similar panic for 0.6.3-1.1, without running Lustre: https://groups.google.com/a/zfsonlinux.org/forum/#!msg/zfs-discuss/6YaxCnuVfzM/3icF7asNL5AJ #12 [ffff880028207f50] nmi at ffffffff81534ee0 [exception RIP: _spin_lock_irq+40] RIP: ffffffff81534678 RSP: ffff88021a9a75d0 RFLAGS: 00000002 RAX: 0000000000000000 RBX: ffff880193353b00 RCX: 0000000000000014 RDX: 00000000000073e6 RSI: ffff88021a9a7640 RDI: ffff880193353b08 RBP: ffff88021a9a75d0 R8: 00d67e80287a31ef R9: ffff88020a91a000 R10: ffff88021a9a7660 R11: ffff88020b2a7550 R12: ffff8801c68b4ae0 R13: ffff88021a9a7640 R14: ffff880193353b08 R15: ffff88021a9a7968 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 --- <NMI exception stack> --- #13 [ffff88021a9a75d0] _spin_lock_irq at ffffffff81534678 #14 [ffff88021a9a75d8] rwsem_down_failed_common at ffffffff8153436c #15 [ffff88021a9a7638] rwsem_down_read_failed at ffffffff81534556 #16 [ffff88021a9a7678] call_rwsem_down_read_failed at ffffffff8129fb44 #17 [ffff88021a9a76e0] zap_get_leaf_byblk at ffffffffa0374b70 [zfs] #18 [ffff88021a9a7750] zap_deref_leaf at ffffffffa0374dba [zfs] #19 [ffff88021a9a7790] fzap_cursor_retrieve at ffffffffa0375cbc [zfs] #20 [ffff88021a9a7820] zap_cursor_retrieve at ffffffffa037aceb [zfs] #21 [ffff88021a9a78c0] zfs_purgedir at ffffffffa038302b [zfs] #22 [ffff88021a9a7af0] zfs_unlinked_drain at ffffffffa03832de [zfs]

            Since we're treating 0.6.3-x as a stable release there really wasn't much change between the tags. Just a few few build fixes for newer kernels, changes to a few default values we've been running forever, and a two critical bug fixes. Nothing which should have caused a regression, here's the full commit log between 0.6.3-1.1 and 0.6.3-1.2.

            $ git log --oneline spl-0.6.3-1.1..spl-0.6.3-1.2
            099c670 Tag spl-0.6.3-1.2
            8cd930b mutex: force serialization on mutex_exit() to fix races
            17e5cf3 Linux 3.12 compat: shrinker semantics
            23fcd1e Linux 3.16 compat: smp_mb__after_clear_bit()
            f322c6a Linux 3.17 compat: remove wait_on_bit action function
            5c8a0ad Set spl_kmem_cache_slab_limit=16384 to default
            bf56579 Set spl_kmem_cache_reclaim=0 to default

            $ git log --oneline zfs-0.6.3-1.1..zfs-0.6.3-1.2
            7b13354 Tag zfs-0.6.3-1.2
            c3db49a Fix typo in %post scriptlet lines
            73fd1f4 Don't perform ACL-to-mode translation on empty ACL
            a85804e Undirty freed spill blocks.
            5b29265 Export symbols for ZIL interface
            107077a Remove checks for mandatory locks
            c69e727 Linux 3.12 compat: shrinker semantics

            behlendorf Brian Behlendorf added a comment - Since we're treating 0.6.3-x as a stable release there really wasn't much change between the tags. Just a few few build fixes for newer kernels, changes to a few default values we've been running forever, and a two critical bug fixes. Nothing which should have caused a regression, here's the full commit log between 0.6.3-1.1 and 0.6.3-1.2. $ git log --oneline spl-0.6.3-1.1..spl-0.6.3-1.2 099c670 Tag spl-0.6.3-1.2 8cd930b mutex: force serialization on mutex_exit() to fix races 17e5cf3 Linux 3.12 compat: shrinker semantics 23fcd1e Linux 3.16 compat: smp_mb__after_clear_bit() f322c6a Linux 3.17 compat: remove wait_on_bit action function 5c8a0ad Set spl_kmem_cache_slab_limit=16384 to default bf56579 Set spl_kmem_cache_reclaim=0 to default $ git log --oneline zfs-0.6.3-1.1..zfs-0.6.3-1.2 7b13354 Tag zfs-0.6.3-1.2 c3db49a Fix typo in %post scriptlet lines 73fd1f4 Don't perform ACL-to-mode translation on empty ACL a85804e Undirty freed spill blocks. 5b29265 Export symbols for ZIL interface 107077a Remove checks for mandatory locks c69e727 Linux 3.12 compat: shrinker semantics

            Brian, any thoughts on this? It seems that moving to 0.6.3-1.2 has introduced a regression vs 0.6.3-1.1, but it isn't clear whether it is better to revert to -1.1 (and get back the issues resolved by that update) or stuck with -1.2?

            Nathaniel, do you have a solid reproducer for this locally (e.g. run sanity 10x and it will always fail once)? Is it possible to Git bisect the -1.2 update and isolate this to a specific patch within that update?

            adilger Andreas Dilger added a comment - Brian, any thoughts on this? It seems that moving to 0.6.3-1.2 has introduced a regression vs 0.6.3-1.1, but it isn't clear whether it is better to revert to -1.1 (and get back the issues resolved by that update) or stuck with -1.2? Nathaniel, do you have a solid reproducer for this locally (e.g. run sanity 10x and it will always fail once)? Is it possible to Git bisect the -1.2 update and isolate this to a specific patch within that update?

            utopiabound 0.6.3-1.2 did fix a txg_sync hang that was one of the causes for LU-5242, I was able to reproduce it a few times with 0.6.3-1.1. In addition, here's a report of a very similar issue with spl/zfs 0.6.3-1.1: http://git.net/zfs-discuss/msg19682.html

            isaac Isaac Huang (Inactive) added a comment - utopiabound 0.6.3-1.2 did fix a txg_sync hang that was one of the causes for LU-5242 , I was able to reproduce it a few times with 0.6.3-1.1. In addition, here's a report of a very similar issue with spl/zfs 0.6.3-1.1: http://git.net/zfs-discuss/msg19682.html

            Reverting to spl/zfs 0.6.3-1.1 may be the right answer since it doesn't seem to have solved any problems (ie LU-5242) and has induced this one.

            utopiabound Nathaniel Clark added a comment - Reverting to spl/zfs 0.6.3-1.1 may be the right answer since it doesn't seem to have solved any problems (ie LU-5242 ) and has induced this one.
            utopiabound Nathaniel Clark added a comment - - edited

            This issue does not appear with spl/zfs 0.6.3-1.1 or before.

            It also appears that there is a slow/stuck txg_sync thread on MDS:
            e.g.

            21:26:43:txg_sync      D 0000000000000000     0 15966      2 0x00000080
            21:26:43: ffff88006e225b90 0000000000000046 00000000ffffffff 0000094536c0ce7a
            21:26:43: 0000000000000000 ffff8800379f81a0 0000000000113874 ffffffffab454110
            21:26:43: ffff8800721885f8 ffff88006e225fd8 000000000000fbc8 ffff8800721885f8
            21:26:43:Call Trace:
            21:26:43: [<ffffffff810a6d31>] ? ktime_get_ts+0xb1/0xf0
            21:26:43: [<ffffffff81529e83>] io_schedule+0x73/0xc0
            21:26:43: [<ffffffffa0143596>] cv_wait_common+0xa6/0x120 [spl]
            21:26:43: [<ffffffff8109afa0>] ? autoremove_wake_function+0x0/0x40
            21:26:43: [<ffffffffa0143628>] __cv_wait_io+0x18/0x20 [spl]
            21:26:43: [<ffffffffa028d81b>] zio_wait+0xfb/0x1c0 [zfs]
            21:26:43: [<ffffffffa0220993>] dsl_pool_sync+0xb3/0x3f0 [zfs]
            21:26:43: [<ffffffffa02398bf>] spa_sync+0x40f/0xa70 [zfs]
            21:26:43: [<ffffffffa0243771>] ? spa_txg_history_set+0xc1/0xf0 [zfs]
            21:26:43: [<ffffffffa0246c7d>] txg_sync_thread+0x30d/0x520 [zfs]
            21:26:43: [<ffffffff810591a9>] ? set_user_nice+0xc9/0x130
            21:26:43: [<ffffffffa0246970>] ? txg_sync_thread+0x0/0x520 [zfs]
            21:26:43: [<ffffffffa013ec3f>] thread_generic_wrapper+0x5f/0x70 [spl]
            21:26:43: [<ffffffffa013ebe0>] ? thread_generic_wrapper+0x0/0x70 [spl]
            21:26:43: [<ffffffff8109abf6>] kthread+0x96/0xa0
            21:26:43: [<ffffffff8100c20a>] child_rip+0xa/0x20
            21:26:43: [<ffffffff8109ab60>] ? kthread+0x0/0xa0
            21:26:43: [<ffffffff8100c200>] ? child_rip+0x0/0x20
            
            utopiabound Nathaniel Clark added a comment - - edited This issue does not appear with spl/zfs 0.6.3-1.1 or before. It also appears that there is a slow/stuck txg_sync thread on MDS: e.g. 21:26:43:txg_sync D 0000000000000000 0 15966 2 0x00000080 21:26:43: ffff88006e225b90 0000000000000046 00000000ffffffff 0000094536c0ce7a 21:26:43: 0000000000000000 ffff8800379f81a0 0000000000113874 ffffffffab454110 21:26:43: ffff8800721885f8 ffff88006e225fd8 000000000000fbc8 ffff8800721885f8 21:26:43:Call Trace: 21:26:43: [<ffffffff810a6d31>] ? ktime_get_ts+0xb1/0xf0 21:26:43: [<ffffffff81529e83>] io_schedule+0x73/0xc0 21:26:43: [<ffffffffa0143596>] cv_wait_common+0xa6/0x120 [spl] 21:26:43: [<ffffffff8109afa0>] ? autoremove_wake_function+0x0/0x40 21:26:43: [<ffffffffa0143628>] __cv_wait_io+0x18/0x20 [spl] 21:26:43: [<ffffffffa028d81b>] zio_wait+0xfb/0x1c0 [zfs] 21:26:43: [<ffffffffa0220993>] dsl_pool_sync+0xb3/0x3f0 [zfs] 21:26:43: [<ffffffffa02398bf>] spa_sync+0x40f/0xa70 [zfs] 21:26:43: [<ffffffffa0243771>] ? spa_txg_history_set+0xc1/0xf0 [zfs] 21:26:43: [<ffffffffa0246c7d>] txg_sync_thread+0x30d/0x520 [zfs] 21:26:43: [<ffffffff810591a9>] ? set_user_nice+0xc9/0x130 21:26:43: [<ffffffffa0246970>] ? txg_sync_thread+0x0/0x520 [zfs] 21:26:43: [<ffffffffa013ec3f>] thread_generic_wrapper+0x5f/0x70 [spl] 21:26:43: [<ffffffffa013ebe0>] ? thread_generic_wrapper+0x0/0x70 [spl] 21:26:43: [<ffffffff8109abf6>] kthread+0x96/0xa0 21:26:43: [<ffffffff8100c20a>] child_rip+0xa/0x20 21:26:43: [<ffffffff8109ab60>] ? kthread+0x0/0xa0 21:26:43: [<ffffffff8100c200>] ? child_rip+0x0/0x20
            pjones Peter Jones added a comment -

            Ok thanks So it is probably just something quite rare and it was a coincidence on the timing.

            pjones Peter Jones added a comment - Ok thanks So it is probably just something quite rare and it was a coincidence on the timing.

            There was a report of the same problem for zfs-0.6.3-1.1:
            http://git.net/zfs-discuss/msg19682.html

            So it's unlikely caused by the recent upgrade to 0.6.3-1.2.

            isaac Isaac Huang (Inactive) added a comment - There was a report of the same problem for zfs-0.6.3-1.1: http://git.net/zfs-discuss/msg19682.html So it's unlikely caused by the recent upgrade to 0.6.3-1.2.

            People

              utopiabound Nathaniel Clark
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: