Details

    • Bug
    • Resolution: Duplicate
    • Major
    • Lustre 2.8.0, Lustre 2.5.5
    • Lustre 2.7.0, Lustre 2.8.0
    • 3
    • 17092

    Description

      This issue was created by maloo for Nathaniel Clark <nathaniel.l.clark@intel.com>

      This issue relates to the following test suite run:
      https://testing.hpdd.intel.com/test_sets/febe4ef4-9ef2-11e4-a23e-5254006e85c2
      https://testing.hpdd.intel.com/test_sets/e817c76c-9dca-11e4-858d-5254006e85c2

      The sub-test test_56a failed with the following error:

      test failed to respond and timed out
      

      Please provide additional information about the failure here.

      Info required for matching: sanity 56a

      Attachments

        Issue Links

          Activity

            [LU-6133] sanity test_56a: timeout

            Since we're treating 0.6.3-x as a stable release there really wasn't much change between the tags. Just a few few build fixes for newer kernels, changes to a few default values we've been running forever, and a two critical bug fixes. Nothing which should have caused a regression, here's the full commit log between 0.6.3-1.1 and 0.6.3-1.2.

            $ git log --oneline spl-0.6.3-1.1..spl-0.6.3-1.2
            099c670 Tag spl-0.6.3-1.2
            8cd930b mutex: force serialization on mutex_exit() to fix races
            17e5cf3 Linux 3.12 compat: shrinker semantics
            23fcd1e Linux 3.16 compat: smp_mb__after_clear_bit()
            f322c6a Linux 3.17 compat: remove wait_on_bit action function
            5c8a0ad Set spl_kmem_cache_slab_limit=16384 to default
            bf56579 Set spl_kmem_cache_reclaim=0 to default

            $ git log --oneline zfs-0.6.3-1.1..zfs-0.6.3-1.2
            7b13354 Tag zfs-0.6.3-1.2
            c3db49a Fix typo in %post scriptlet lines
            73fd1f4 Don't perform ACL-to-mode translation on empty ACL
            a85804e Undirty freed spill blocks.
            5b29265 Export symbols for ZIL interface
            107077a Remove checks for mandatory locks
            c69e727 Linux 3.12 compat: shrinker semantics

            behlendorf Brian Behlendorf added a comment - Since we're treating 0.6.3-x as a stable release there really wasn't much change between the tags. Just a few few build fixes for newer kernels, changes to a few default values we've been running forever, and a two critical bug fixes. Nothing which should have caused a regression, here's the full commit log between 0.6.3-1.1 and 0.6.3-1.2. $ git log --oneline spl-0.6.3-1.1..spl-0.6.3-1.2 099c670 Tag spl-0.6.3-1.2 8cd930b mutex: force serialization on mutex_exit() to fix races 17e5cf3 Linux 3.12 compat: shrinker semantics 23fcd1e Linux 3.16 compat: smp_mb__after_clear_bit() f322c6a Linux 3.17 compat: remove wait_on_bit action function 5c8a0ad Set spl_kmem_cache_slab_limit=16384 to default bf56579 Set spl_kmem_cache_reclaim=0 to default $ git log --oneline zfs-0.6.3-1.1..zfs-0.6.3-1.2 7b13354 Tag zfs-0.6.3-1.2 c3db49a Fix typo in %post scriptlet lines 73fd1f4 Don't perform ACL-to-mode translation on empty ACL a85804e Undirty freed spill blocks. 5b29265 Export symbols for ZIL interface 107077a Remove checks for mandatory locks c69e727 Linux 3.12 compat: shrinker semantics

            Brian, any thoughts on this? It seems that moving to 0.6.3-1.2 has introduced a regression vs 0.6.3-1.1, but it isn't clear whether it is better to revert to -1.1 (and get back the issues resolved by that update) or stuck with -1.2?

            Nathaniel, do you have a solid reproducer for this locally (e.g. run sanity 10x and it will always fail once)? Is it possible to Git bisect the -1.2 update and isolate this to a specific patch within that update?

            adilger Andreas Dilger added a comment - Brian, any thoughts on this? It seems that moving to 0.6.3-1.2 has introduced a regression vs 0.6.3-1.1, but it isn't clear whether it is better to revert to -1.1 (and get back the issues resolved by that update) or stuck with -1.2? Nathaniel, do you have a solid reproducer for this locally (e.g. run sanity 10x and it will always fail once)? Is it possible to Git bisect the -1.2 update and isolate this to a specific patch within that update?

            utopiabound 0.6.3-1.2 did fix a txg_sync hang that was one of the causes for LU-5242, I was able to reproduce it a few times with 0.6.3-1.1. In addition, here's a report of a very similar issue with spl/zfs 0.6.3-1.1: http://git.net/zfs-discuss/msg19682.html

            isaac Isaac Huang (Inactive) added a comment - utopiabound 0.6.3-1.2 did fix a txg_sync hang that was one of the causes for LU-5242 , I was able to reproduce it a few times with 0.6.3-1.1. In addition, here's a report of a very similar issue with spl/zfs 0.6.3-1.1: http://git.net/zfs-discuss/msg19682.html

            Reverting to spl/zfs 0.6.3-1.1 may be the right answer since it doesn't seem to have solved any problems (ie LU-5242) and has induced this one.

            utopiabound Nathaniel Clark added a comment - Reverting to spl/zfs 0.6.3-1.1 may be the right answer since it doesn't seem to have solved any problems (ie LU-5242 ) and has induced this one.
            utopiabound Nathaniel Clark added a comment - - edited

            This issue does not appear with spl/zfs 0.6.3-1.1 or before.

            It also appears that there is a slow/stuck txg_sync thread on MDS:
            e.g.

            21:26:43:txg_sync      D 0000000000000000     0 15966      2 0x00000080
            21:26:43: ffff88006e225b90 0000000000000046 00000000ffffffff 0000094536c0ce7a
            21:26:43: 0000000000000000 ffff8800379f81a0 0000000000113874 ffffffffab454110
            21:26:43: ffff8800721885f8 ffff88006e225fd8 000000000000fbc8 ffff8800721885f8
            21:26:43:Call Trace:
            21:26:43: [<ffffffff810a6d31>] ? ktime_get_ts+0xb1/0xf0
            21:26:43: [<ffffffff81529e83>] io_schedule+0x73/0xc0
            21:26:43: [<ffffffffa0143596>] cv_wait_common+0xa6/0x120 [spl]
            21:26:43: [<ffffffff8109afa0>] ? autoremove_wake_function+0x0/0x40
            21:26:43: [<ffffffffa0143628>] __cv_wait_io+0x18/0x20 [spl]
            21:26:43: [<ffffffffa028d81b>] zio_wait+0xfb/0x1c0 [zfs]
            21:26:43: [<ffffffffa0220993>] dsl_pool_sync+0xb3/0x3f0 [zfs]
            21:26:43: [<ffffffffa02398bf>] spa_sync+0x40f/0xa70 [zfs]
            21:26:43: [<ffffffffa0243771>] ? spa_txg_history_set+0xc1/0xf0 [zfs]
            21:26:43: [<ffffffffa0246c7d>] txg_sync_thread+0x30d/0x520 [zfs]
            21:26:43: [<ffffffff810591a9>] ? set_user_nice+0xc9/0x130
            21:26:43: [<ffffffffa0246970>] ? txg_sync_thread+0x0/0x520 [zfs]
            21:26:43: [<ffffffffa013ec3f>] thread_generic_wrapper+0x5f/0x70 [spl]
            21:26:43: [<ffffffffa013ebe0>] ? thread_generic_wrapper+0x0/0x70 [spl]
            21:26:43: [<ffffffff8109abf6>] kthread+0x96/0xa0
            21:26:43: [<ffffffff8100c20a>] child_rip+0xa/0x20
            21:26:43: [<ffffffff8109ab60>] ? kthread+0x0/0xa0
            21:26:43: [<ffffffff8100c200>] ? child_rip+0x0/0x20
            
            utopiabound Nathaniel Clark added a comment - - edited This issue does not appear with spl/zfs 0.6.3-1.1 or before. It also appears that there is a slow/stuck txg_sync thread on MDS: e.g. 21:26:43:txg_sync D 0000000000000000 0 15966 2 0x00000080 21:26:43: ffff88006e225b90 0000000000000046 00000000ffffffff 0000094536c0ce7a 21:26:43: 0000000000000000 ffff8800379f81a0 0000000000113874 ffffffffab454110 21:26:43: ffff8800721885f8 ffff88006e225fd8 000000000000fbc8 ffff8800721885f8 21:26:43:Call Trace: 21:26:43: [<ffffffff810a6d31>] ? ktime_get_ts+0xb1/0xf0 21:26:43: [<ffffffff81529e83>] io_schedule+0x73/0xc0 21:26:43: [<ffffffffa0143596>] cv_wait_common+0xa6/0x120 [spl] 21:26:43: [<ffffffff8109afa0>] ? autoremove_wake_function+0x0/0x40 21:26:43: [<ffffffffa0143628>] __cv_wait_io+0x18/0x20 [spl] 21:26:43: [<ffffffffa028d81b>] zio_wait+0xfb/0x1c0 [zfs] 21:26:43: [<ffffffffa0220993>] dsl_pool_sync+0xb3/0x3f0 [zfs] 21:26:43: [<ffffffffa02398bf>] spa_sync+0x40f/0xa70 [zfs] 21:26:43: [<ffffffffa0243771>] ? spa_txg_history_set+0xc1/0xf0 [zfs] 21:26:43: [<ffffffffa0246c7d>] txg_sync_thread+0x30d/0x520 [zfs] 21:26:43: [<ffffffff810591a9>] ? set_user_nice+0xc9/0x130 21:26:43: [<ffffffffa0246970>] ? txg_sync_thread+0x0/0x520 [zfs] 21:26:43: [<ffffffffa013ec3f>] thread_generic_wrapper+0x5f/0x70 [spl] 21:26:43: [<ffffffffa013ebe0>] ? thread_generic_wrapper+0x0/0x70 [spl] 21:26:43: [<ffffffff8109abf6>] kthread+0x96/0xa0 21:26:43: [<ffffffff8100c20a>] child_rip+0xa/0x20 21:26:43: [<ffffffff8109ab60>] ? kthread+0x0/0xa0 21:26:43: [<ffffffff8100c200>] ? child_rip+0x0/0x20
            pjones Peter Jones added a comment -

            Ok thanks So it is probably just something quite rare and it was a coincidence on the timing.

            pjones Peter Jones added a comment - Ok thanks So it is probably just something quite rare and it was a coincidence on the timing.

            There was a report of the same problem for zfs-0.6.3-1.1:
            http://git.net/zfs-discuss/msg19682.html

            So it's unlikely caused by the recent upgrade to 0.6.3-1.2.

            isaac Isaac Huang (Inactive) added a comment - There was a report of the same problem for zfs-0.6.3-1.1: http://git.net/zfs-discuss/msg19682.html So it's unlikely caused by the recent upgrade to 0.6.3-1.2.
            pjones Peter Jones added a comment -

            My question was the opposite - I wondered if the LU-6105 upgrade had introduced this problem....

            pjones Peter Jones added a comment - My question was the opposite - I wondered if the LU-6105 upgrade had introduced this problem....

            This has not been fixed by LU-6105 (moving to 0.6.3-1.2):

            recent master instance:
            https://testing.hpdd.intel.com/test_sets/cc90d598-b3a4-11e4-94dd-5254006e85c2

            utopiabound Nathaniel Clark added a comment - This has not been fixed by LU-6105 (moving to 0.6.3-1.2): recent master instance: https://testing.hpdd.intel.com/test_sets/cc90d598-b3a4-11e4-94dd-5254006e85c2
            pjones Peter Jones added a comment -

            Nathaniel

            Has this issue only appeared since the upgrade to the latest ZFS maintenance release in LU-6105?

            Peter

            pjones Peter Jones added a comment - Nathaniel Has this issue only appeared since the upgrade to the latest ZFS maintenance release in LU-6105 ? Peter
            jhammond John Hammond added a comment -

            There's a crash dump at shadow:/export/scratch/dumps/shadow-16vm4.shadow.whamcloud.com/10.1.4.188-2015-01-17-21:27:55/vmcore

            crash> bt -l 16053
            PID: 16053  TASK: ffff88007d051540  CPU: 1   COMMAND: "mdt00_000"
             #0 [ffff880002307e90] crash_nmi_callback at ffffffff8102fee6
                /usr/src/debug/kernel-2.6.32-431.29.2.el6/linux-2.6.32-431.29.2.el6_lustre.gffd1fc2.x86_64/arch/x86/include/asm/paravirt.h: 115
             #1 [ffff880002307ea0] notifier_call_chain at ffffffff8152f2f5
                /usr/src/debug/kernel-2.6.32-431.29.2.el6/linux-2.6.32-431.29.2.el6_lustre.gffd1fc2.x86_64/kernel/notifier.c: 95
             #2 [ffff880002307ee0] atomic_notifier_call_chain at ffffffff8152f35a
                /usr/src/debug/kernel-2.6.32-431.29.2.el6/linux-2.6.32-431.29.2.el6_lustre.gffd1fc2.x86_64/kernel/notifier.c: 192
             #3 [ffff880002307ef0] notify_die at ffffffff810a11de
                /usr/src/debug/kernel-2.6.32-431.29.2.el6/linux-2.6.32-431.29.2.el6_lustre.gffd1fc2.x86_64/kernel/notifier.c: 573
             #4 [ffff880002307f20] do_nmi at ffffffff8152cfbb
                /usr/src/debug/kernel-2.6.32-431.29.2.el6/linux-2.6.32-431.29.2.el6_lustre.gffd1fc2.x86_64/arch/x86/kernel/traps.c: 503
             #5 [ffff880002307f50] nmi at ffffffff8152c880
                /usr/src/debug/kernel-2.6.32-431.29.2.el6/linux-2.6.32-431.29.2.el6_lustre.gffd1fc2.x86_64/arch/x86_64/kernel/entry.S
                [exception RIP: _spin_lock_irq+37]
                RIP: ffffffff8152c015  RSP: ffff88006f9636e8  RFLAGS: 00000002
                RAX: 0000000000000000  RBX: ffff8800731c3600  RCX: 0000000000000000
                RDX: 00000000000000a4  RSI: ffff88006f963758  RDI: ffff8800731c3608
                RBP: ffff88006f9636e8   R8: 00ff7b4f2f8f252e   R9: 0000000000000000
                R10: ffff88006b973800  R11: 0000000000000800  R12: ffff88007d051540
                R13: ffff88006f963758  R14: ffff8800731c3608  R15: ffff88006f9638f8
                ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
            --- <NMI exception stack> ---
             #6 [ffff88006f9636e8] _spin_lock_irq at ffffffff8152c015
                /usr/src/debug/kernel-2.6.32-431.29.2.el6/linux-2.6.32-431.29.2.el6_lustre.gffd1fc2.x86_64/arch/x86/include/asm/spinlock.h: 127
             #7 [ffff88006f9636f0] rwsem_down_failed_common at ffffffff8152bd0c
                /usr/src/debug/kernel-2.6.32-431.29.2.el6/linux-2.6.32-431.29.2.el6_lustre.gffd1fc2.x86_64/lib/rwsem.c: 161
             #8 [ffff88006f963750] rwsem_down_read_failed at ffffffff8152bef6
                /usr/src/debug/kernel-2.6.32-431.29.2.el6/linux-2.6.32-431.29.2.el6_lustre.gffd1fc2.x86_64/lib/rwsem.c: 200
             #9 [ffff88006f963790] call_rwsem_down_read_failed at ffffffff8128fbb4
                /usr/src/debug///////////////////////////////////////////////////////////////////////////////////////////////////kernel-2.6.32-431.29.2.el6/linux-2.6.32-431.29.2.el6_lustre.gffd1fc2.x86_64/arch/x86/lib/rwsem_64.S: 49
            #10 [ffff88006f9637f8] zap_get_leaf_byblk at ffffffffa0256b18 [zfs]
            #11 [ffff88006f963878] zap_deref_leaf at ffffffffa0256dca [zfs]
            #12 [ffff88006f9638b8] fzap_lookup at ffffffffa0257430 [zfs]
            #13 [ffff88006f963938] zap_lookup_norm at ffffffffa025ca21 [zfs]
            #14 [ffff88006f9639a8] zap_lookup at ffffffffa025cb63 [zfs]
            #15 [ffff88006f9639d8] osd_dir_lookup at ffffffffa0e9acdd [osd_zfs]
            #16 [ffff88006f963a38] lod_index_lookup at ffffffffa10fb755 [lod]
            #17 [ffff88006f963a48] __mdd_lookup at ffffffffa1165450 [mdd]
            #18 [ffff88006f963aa8] mdd_lookup at ffffffffa1165e91 [mdd]
            #19 [ffff88006f963af8] mdt_getattr_name_lock at ffffffffa10206df [mdt]
            #20 [ffff88006f963b98] mdt_intent_getattr at ffffffffa1021b72 [mdt]
            #21 [ffff88006f963be8] mdt_intent_policy at ffffffffa100ccf4 [mdt]
            #22 [ffff88006f963c58] ldlm_lock_enqueue at ffffffffa093e4f9 [ptlrpc]
            #23 [ffff88006f963cb8] ldlm_handle_enqueue0 at ffffffffa096a4bb [ptlrpc]
            #24 [ffff88006f963d28] tgt_enqueue at ffffffffa09eb1c2 [ptlrpc]
            #25 [ffff88006f963d48] tgt_request_handle at ffffffffa09ebdae [ptlrpc]
            #26 [ffff88006f963da8] ptlrpc_main at ffffffffa099b8a1 [ptlrpc]
            #27 [ffff88006f963ee8] kthread at ffffffff8109abf6
                /usr/src/debug/kernel-2.6.32-431.29.2.el6/linux-2.6.32-431.29.2.el6_lustre.gffd1fc2.x86_64/kernel/kthread.c: 78
            #28 [ffff88006f963f48] kernel_thread at ffffffff8100c20a
                /usr/src/debug///////////////////////////////////////////////////////////////////////////////////////////////////kernel-2.6.32-431.29.2.el6/linux-2.6.32-431.29.2.el6_lustre.gffd1fc2.x86_64/arch/x86/kernel/entry_64.S: 1235
            
            jhammond John Hammond added a comment - There's a crash dump at shadow:/export/scratch/dumps/shadow-16vm4.shadow.whamcloud.com/10.1.4.188-2015-01-17-21:27:55/vmcore crash> bt -l 16053 PID: 16053 TASK: ffff88007d051540 CPU: 1 COMMAND: "mdt00_000" #0 [ffff880002307e90] crash_nmi_callback at ffffffff8102fee6 /usr/src/debug/kernel-2.6.32-431.29.2.el6/linux-2.6.32-431.29.2.el6_lustre.gffd1fc2.x86_64/arch/x86/include/asm/paravirt.h: 115 #1 [ffff880002307ea0] notifier_call_chain at ffffffff8152f2f5 /usr/src/debug/kernel-2.6.32-431.29.2.el6/linux-2.6.32-431.29.2.el6_lustre.gffd1fc2.x86_64/kernel/notifier.c: 95 #2 [ffff880002307ee0] atomic_notifier_call_chain at ffffffff8152f35a /usr/src/debug/kernel-2.6.32-431.29.2.el6/linux-2.6.32-431.29.2.el6_lustre.gffd1fc2.x86_64/kernel/notifier.c: 192 #3 [ffff880002307ef0] notify_die at ffffffff810a11de /usr/src/debug/kernel-2.6.32-431.29.2.el6/linux-2.6.32-431.29.2.el6_lustre.gffd1fc2.x86_64/kernel/notifier.c: 573 #4 [ffff880002307f20] do_nmi at ffffffff8152cfbb /usr/src/debug/kernel-2.6.32-431.29.2.el6/linux-2.6.32-431.29.2.el6_lustre.gffd1fc2.x86_64/arch/x86/kernel/traps.c: 503 #5 [ffff880002307f50] nmi at ffffffff8152c880 /usr/src/debug/kernel-2.6.32-431.29.2.el6/linux-2.6.32-431.29.2.el6_lustre.gffd1fc2.x86_64/arch/x86_64/kernel/entry.S [exception RIP: _spin_lock_irq+37] RIP: ffffffff8152c015 RSP: ffff88006f9636e8 RFLAGS: 00000002 RAX: 0000000000000000 RBX: ffff8800731c3600 RCX: 0000000000000000 RDX: 00000000000000a4 RSI: ffff88006f963758 RDI: ffff8800731c3608 RBP: ffff88006f9636e8 R8: 00ff7b4f2f8f252e R9: 0000000000000000 R10: ffff88006b973800 R11: 0000000000000800 R12: ffff88007d051540 R13: ffff88006f963758 R14: ffff8800731c3608 R15: ffff88006f9638f8 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 --- <NMI exception stack> --- #6 [ffff88006f9636e8] _spin_lock_irq at ffffffff8152c015 /usr/src/debug/kernel-2.6.32-431.29.2.el6/linux-2.6.32-431.29.2.el6_lustre.gffd1fc2.x86_64/arch/x86/include/asm/spinlock.h: 127 #7 [ffff88006f9636f0] rwsem_down_failed_common at ffffffff8152bd0c /usr/src/debug/kernel-2.6.32-431.29.2.el6/linux-2.6.32-431.29.2.el6_lustre.gffd1fc2.x86_64/lib/rwsem.c: 161 #8 [ffff88006f963750] rwsem_down_read_failed at ffffffff8152bef6 /usr/src/debug/kernel-2.6.32-431.29.2.el6/linux-2.6.32-431.29.2.el6_lustre.gffd1fc2.x86_64/lib/rwsem.c: 200 #9 [ffff88006f963790] call_rwsem_down_read_failed at ffffffff8128fbb4 /usr/src/debug///////////////////////////////////////////////////////////////////////////////////////////////////kernel-2.6.32-431.29.2.el6/linux-2.6.32-431.29.2.el6_lustre.gffd1fc2.x86_64/arch/x86/lib/rwsem_64.S: 49 #10 [ffff88006f9637f8] zap_get_leaf_byblk at ffffffffa0256b18 [zfs] #11 [ffff88006f963878] zap_deref_leaf at ffffffffa0256dca [zfs] #12 [ffff88006f9638b8] fzap_lookup at ffffffffa0257430 [zfs] #13 [ffff88006f963938] zap_lookup_norm at ffffffffa025ca21 [zfs] #14 [ffff88006f9639a8] zap_lookup at ffffffffa025cb63 [zfs] #15 [ffff88006f9639d8] osd_dir_lookup at ffffffffa0e9acdd [osd_zfs] #16 [ffff88006f963a38] lod_index_lookup at ffffffffa10fb755 [lod] #17 [ffff88006f963a48] __mdd_lookup at ffffffffa1165450 [mdd] #18 [ffff88006f963aa8] mdd_lookup at ffffffffa1165e91 [mdd] #19 [ffff88006f963af8] mdt_getattr_name_lock at ffffffffa10206df [mdt] #20 [ffff88006f963b98] mdt_intent_getattr at ffffffffa1021b72 [mdt] #21 [ffff88006f963be8] mdt_intent_policy at ffffffffa100ccf4 [mdt] #22 [ffff88006f963c58] ldlm_lock_enqueue at ffffffffa093e4f9 [ptlrpc] #23 [ffff88006f963cb8] ldlm_handle_enqueue0 at ffffffffa096a4bb [ptlrpc] #24 [ffff88006f963d28] tgt_enqueue at ffffffffa09eb1c2 [ptlrpc] #25 [ffff88006f963d48] tgt_request_handle at ffffffffa09ebdae [ptlrpc] #26 [ffff88006f963da8] ptlrpc_main at ffffffffa099b8a1 [ptlrpc] #27 [ffff88006f963ee8] kthread at ffffffff8109abf6 /usr/src/debug/kernel-2.6.32-431.29.2.el6/linux-2.6.32-431.29.2.el6_lustre.gffd1fc2.x86_64/kernel/kthread.c: 78 #28 [ffff88006f963f48] kernel_thread at ffffffff8100c20a /usr/src/debug///////////////////////////////////////////////////////////////////////////////////////////////////kernel-2.6.32-431.29.2.el6/linux-2.6.32-431.29.2.el6_lustre.gffd1fc2.x86_64/arch/x86/kernel/entry_64.S: 1235

            People

              utopiabound Nathaniel Clark
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: