[LU-6133] sanity test_56a: timeout Created: 19/Jan/15  Updated: 07/Apr/17  Resolved: 07/Apr/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.7.0, Lustre 2.8.0
Fix Version/s: Lustre 2.8.0, Lustre 2.5.5

Type: Bug Priority: Major
Reporter: Maloo Assignee: Nathaniel Clark
Resolution: Duplicate Votes: 0
Labels: zfs

Issue Links:
Duplicate
duplicates LU-6477 Update ZFS/SPL version to 0.6.4.1 Resolved
Related
is related to LU-6105 Update ZFS/SPL version to 0.6.3-1.2 Resolved
Severity: 3
Rank (Obsolete): 17092

 Description   

This issue was created by maloo for Nathaniel Clark <nathaniel.l.clark@intel.com>

This issue relates to the following test suite run:
https://testing.hpdd.intel.com/test_sets/febe4ef4-9ef2-11e4-a23e-5254006e85c2
https://testing.hpdd.intel.com/test_sets/e817c76c-9dca-11e4-858d-5254006e85c2

The sub-test test_56a failed with the following error:

test failed to respond and timed out

Please provide additional information about the failure here.

Info required for matching: sanity 56a



 Comments   
Comment by Andreas Dilger [ 19/Jan/15 ]

It looks like the MDS rebooted some time after test_54e and test_56a according to the MDS syslog. No indication why that happened without more console logs from the MDS.

Comment by John Hammond [ 22/Jan/15 ]

There's a crash dump at shadow:/export/scratch/dumps/shadow-16vm4.shadow.whamcloud.com/10.1.4.188-2015-01-17-21:27:55/vmcore

crash> bt -l 16053
PID: 16053  TASK: ffff88007d051540  CPU: 1   COMMAND: "mdt00_000"
 #0 [ffff880002307e90] crash_nmi_callback at ffffffff8102fee6
    /usr/src/debug/kernel-2.6.32-431.29.2.el6/linux-2.6.32-431.29.2.el6_lustre.gffd1fc2.x86_64/arch/x86/include/asm/paravirt.h: 115
 #1 [ffff880002307ea0] notifier_call_chain at ffffffff8152f2f5
    /usr/src/debug/kernel-2.6.32-431.29.2.el6/linux-2.6.32-431.29.2.el6_lustre.gffd1fc2.x86_64/kernel/notifier.c: 95
 #2 [ffff880002307ee0] atomic_notifier_call_chain at ffffffff8152f35a
    /usr/src/debug/kernel-2.6.32-431.29.2.el6/linux-2.6.32-431.29.2.el6_lustre.gffd1fc2.x86_64/kernel/notifier.c: 192
 #3 [ffff880002307ef0] notify_die at ffffffff810a11de
    /usr/src/debug/kernel-2.6.32-431.29.2.el6/linux-2.6.32-431.29.2.el6_lustre.gffd1fc2.x86_64/kernel/notifier.c: 573
 #4 [ffff880002307f20] do_nmi at ffffffff8152cfbb
    /usr/src/debug/kernel-2.6.32-431.29.2.el6/linux-2.6.32-431.29.2.el6_lustre.gffd1fc2.x86_64/arch/x86/kernel/traps.c: 503
 #5 [ffff880002307f50] nmi at ffffffff8152c880
    /usr/src/debug/kernel-2.6.32-431.29.2.el6/linux-2.6.32-431.29.2.el6_lustre.gffd1fc2.x86_64/arch/x86_64/kernel/entry.S
    [exception RIP: _spin_lock_irq+37]
    RIP: ffffffff8152c015  RSP: ffff88006f9636e8  RFLAGS: 00000002
    RAX: 0000000000000000  RBX: ffff8800731c3600  RCX: 0000000000000000
    RDX: 00000000000000a4  RSI: ffff88006f963758  RDI: ffff8800731c3608
    RBP: ffff88006f9636e8   R8: 00ff7b4f2f8f252e   R9: 0000000000000000
    R10: ffff88006b973800  R11: 0000000000000800  R12: ffff88007d051540
    R13: ffff88006f963758  R14: ffff8800731c3608  R15: ffff88006f9638f8
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <NMI exception stack> ---
 #6 [ffff88006f9636e8] _spin_lock_irq at ffffffff8152c015
    /usr/src/debug/kernel-2.6.32-431.29.2.el6/linux-2.6.32-431.29.2.el6_lustre.gffd1fc2.x86_64/arch/x86/include/asm/spinlock.h: 127
 #7 [ffff88006f9636f0] rwsem_down_failed_common at ffffffff8152bd0c
    /usr/src/debug/kernel-2.6.32-431.29.2.el6/linux-2.6.32-431.29.2.el6_lustre.gffd1fc2.x86_64/lib/rwsem.c: 161
 #8 [ffff88006f963750] rwsem_down_read_failed at ffffffff8152bef6
    /usr/src/debug/kernel-2.6.32-431.29.2.el6/linux-2.6.32-431.29.2.el6_lustre.gffd1fc2.x86_64/lib/rwsem.c: 200
 #9 [ffff88006f963790] call_rwsem_down_read_failed at ffffffff8128fbb4
    /usr/src/debug///////////////////////////////////////////////////////////////////////////////////////////////////kernel-2.6.32-431.29.2.el6/linux-2.6.32-431.29.2.el6_lustre.gffd1fc2.x86_64/arch/x86/lib/rwsem_64.S: 49
#10 [ffff88006f9637f8] zap_get_leaf_byblk at ffffffffa0256b18 [zfs]
#11 [ffff88006f963878] zap_deref_leaf at ffffffffa0256dca [zfs]
#12 [ffff88006f9638b8] fzap_lookup at ffffffffa0257430 [zfs]
#13 [ffff88006f963938] zap_lookup_norm at ffffffffa025ca21 [zfs]
#14 [ffff88006f9639a8] zap_lookup at ffffffffa025cb63 [zfs]
#15 [ffff88006f9639d8] osd_dir_lookup at ffffffffa0e9acdd [osd_zfs]
#16 [ffff88006f963a38] lod_index_lookup at ffffffffa10fb755 [lod]
#17 [ffff88006f963a48] __mdd_lookup at ffffffffa1165450 [mdd]
#18 [ffff88006f963aa8] mdd_lookup at ffffffffa1165e91 [mdd]
#19 [ffff88006f963af8] mdt_getattr_name_lock at ffffffffa10206df [mdt]
#20 [ffff88006f963b98] mdt_intent_getattr at ffffffffa1021b72 [mdt]
#21 [ffff88006f963be8] mdt_intent_policy at ffffffffa100ccf4 [mdt]
#22 [ffff88006f963c58] ldlm_lock_enqueue at ffffffffa093e4f9 [ptlrpc]
#23 [ffff88006f963cb8] ldlm_handle_enqueue0 at ffffffffa096a4bb [ptlrpc]
#24 [ffff88006f963d28] tgt_enqueue at ffffffffa09eb1c2 [ptlrpc]
#25 [ffff88006f963d48] tgt_request_handle at ffffffffa09ebdae [ptlrpc]
#26 [ffff88006f963da8] ptlrpc_main at ffffffffa099b8a1 [ptlrpc]
#27 [ffff88006f963ee8] kthread at ffffffff8109abf6
    /usr/src/debug/kernel-2.6.32-431.29.2.el6/linux-2.6.32-431.29.2.el6_lustre.gffd1fc2.x86_64/kernel/kthread.c: 78
#28 [ffff88006f963f48] kernel_thread at ffffffff8100c20a
    /usr/src/debug///////////////////////////////////////////////////////////////////////////////////////////////////kernel-2.6.32-431.29.2.el6/linux-2.6.32-431.29.2.el6_lustre.gffd1fc2.x86_64/arch/x86/kernel/entry_64.S: 1235
Comment by Peter Jones [ 16/Feb/15 ]

Nathaniel

Has this issue only appeared since the upgrade to the latest ZFS maintenance release in LU-6105?

Peter

Comment by Nathaniel Clark [ 17/Feb/15 ]

This has not been fixed by LU-6105 (moving to 0.6.3-1.2):

recent master instance:
https://testing.hpdd.intel.com/test_sets/cc90d598-b3a4-11e4-94dd-5254006e85c2

Comment by Peter Jones [ 18/Feb/15 ]

My question was the opposite - I wondered if the LU-6105 upgrade had introduced this problem....

Comment by Isaac Huang (Inactive) [ 19/Feb/15 ]

There was a report of the same problem for zfs-0.6.3-1.1:
http://git.net/zfs-discuss/msg19682.html

So it's unlikely caused by the recent upgrade to 0.6.3-1.2.

Comment by Peter Jones [ 19/Feb/15 ]

Ok thanks So it is probably just something quite rare and it was a coincidence on the timing.

Comment by Nathaniel Clark [ 23/Feb/15 ]

This issue does not appear with spl/zfs 0.6.3-1.1 or before.

It also appears that there is a slow/stuck txg_sync thread on MDS:
e.g.

21:26:43:txg_sync      D 0000000000000000     0 15966      2 0x00000080
21:26:43: ffff88006e225b90 0000000000000046 00000000ffffffff 0000094536c0ce7a
21:26:43: 0000000000000000 ffff8800379f81a0 0000000000113874 ffffffffab454110
21:26:43: ffff8800721885f8 ffff88006e225fd8 000000000000fbc8 ffff8800721885f8
21:26:43:Call Trace:
21:26:43: [<ffffffff810a6d31>] ? ktime_get_ts+0xb1/0xf0
21:26:43: [<ffffffff81529e83>] io_schedule+0x73/0xc0
21:26:43: [<ffffffffa0143596>] cv_wait_common+0xa6/0x120 [spl]
21:26:43: [<ffffffff8109afa0>] ? autoremove_wake_function+0x0/0x40
21:26:43: [<ffffffffa0143628>] __cv_wait_io+0x18/0x20 [spl]
21:26:43: [<ffffffffa028d81b>] zio_wait+0xfb/0x1c0 [zfs]
21:26:43: [<ffffffffa0220993>] dsl_pool_sync+0xb3/0x3f0 [zfs]
21:26:43: [<ffffffffa02398bf>] spa_sync+0x40f/0xa70 [zfs]
21:26:43: [<ffffffffa0243771>] ? spa_txg_history_set+0xc1/0xf0 [zfs]
21:26:43: [<ffffffffa0246c7d>] txg_sync_thread+0x30d/0x520 [zfs]
21:26:43: [<ffffffff810591a9>] ? set_user_nice+0xc9/0x130
21:26:43: [<ffffffffa0246970>] ? txg_sync_thread+0x0/0x520 [zfs]
21:26:43: [<ffffffffa013ec3f>] thread_generic_wrapper+0x5f/0x70 [spl]
21:26:43: [<ffffffffa013ebe0>] ? thread_generic_wrapper+0x0/0x70 [spl]
21:26:43: [<ffffffff8109abf6>] kthread+0x96/0xa0
21:26:43: [<ffffffff8100c20a>] child_rip+0xa/0x20
21:26:43: [<ffffffff8109ab60>] ? kthread+0x0/0xa0
21:26:43: [<ffffffff8100c200>] ? child_rip+0x0/0x20
Comment by Nathaniel Clark [ 23/Feb/15 ]

Reverting to spl/zfs 0.6.3-1.1 may be the right answer since it doesn't seem to have solved any problems (ie LU-5242) and has induced this one.

Comment by Isaac Huang (Inactive) [ 23/Feb/15 ]

utopiabound 0.6.3-1.2 did fix a txg_sync hang that was one of the causes for LU-5242, I was able to reproduce it a few times with 0.6.3-1.1. In addition, here's a report of a very similar issue with spl/zfs 0.6.3-1.1: http://git.net/zfs-discuss/msg19682.html

Comment by Andreas Dilger [ 23/Feb/15 ]

Brian, any thoughts on this? It seems that moving to 0.6.3-1.2 has introduced a regression vs 0.6.3-1.1, but it isn't clear whether it is better to revert to -1.1 (and get back the issues resolved by that update) or stuck with -1.2?

Nathaniel, do you have a solid reproducer for this locally (e.g. run sanity 10x and it will always fail once)? Is it possible to Git bisect the -1.2 update and isolate this to a specific patch within that update?

Comment by Brian Behlendorf [ 23/Feb/15 ]

Since we're treating 0.6.3-x as a stable release there really wasn't much change between the tags. Just a few few build fixes for newer kernels, changes to a few default values we've been running forever, and a two critical bug fixes. Nothing which should have caused a regression, here's the full commit log between 0.6.3-1.1 and 0.6.3-1.2.

$ git log --oneline spl-0.6.3-1.1..spl-0.6.3-1.2
099c670 Tag spl-0.6.3-1.2
8cd930b mutex: force serialization on mutex_exit() to fix races
17e5cf3 Linux 3.12 compat: shrinker semantics
23fcd1e Linux 3.16 compat: smp_mb__after_clear_bit()
f322c6a Linux 3.17 compat: remove wait_on_bit action function
5c8a0ad Set spl_kmem_cache_slab_limit=16384 to default
bf56579 Set spl_kmem_cache_reclaim=0 to default

$ git log --oneline zfs-0.6.3-1.1..zfs-0.6.3-1.2
7b13354 Tag zfs-0.6.3-1.2
c3db49a Fix typo in %post scriptlet lines
73fd1f4 Don't perform ACL-to-mode translation on empty ACL
a85804e Undirty freed spill blocks.
5b29265 Export symbols for ZIL interface
107077a Remove checks for mandatory locks
c69e727 Linux 3.12 compat: shrinker semantics

Comment by Isaac Huang (Inactive) [ 24/Feb/15 ]

There was one report of a very similar panic for 0.6.3-1.1, without running Lustre:
https://groups.google.com/a/zfsonlinux.org/forum/#!msg/zfs-discuss/6YaxCnuVfzM/3icF7asNL5AJ

#12 [ffff880028207f50] nmi at ffffffff81534ee0
    [exception RIP: _spin_lock_irq+40]
    RIP: ffffffff81534678  RSP: ffff88021a9a75d0  RFLAGS: 00000002
    RAX: 0000000000000000  RBX: ffff880193353b00  RCX: 0000000000000014
    RDX: 00000000000073e6  RSI: ffff88021a9a7640  RDI: ffff880193353b08
    RBP: ffff88021a9a75d0   R8: 00d67e80287a31ef   R9: ffff88020a91a000
    R10: ffff88021a9a7660  R11: ffff88020b2a7550  R12: ffff8801c68b4ae0
    R13: ffff88021a9a7640  R14: ffff880193353b08  R15: ffff88021a9a7968
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <NMI exception stack> ---
#13 [ffff88021a9a75d0] _spin_lock_irq at ffffffff81534678
#14 [ffff88021a9a75d8] rwsem_down_failed_common at ffffffff8153436c
#15 [ffff88021a9a7638] rwsem_down_read_failed at ffffffff81534556
#16 [ffff88021a9a7678] call_rwsem_down_read_failed at ffffffff8129fb44
#17 [ffff88021a9a76e0] zap_get_leaf_byblk at ffffffffa0374b70 [zfs]
#18 [ffff88021a9a7750] zap_deref_leaf at ffffffffa0374dba [zfs]
#19 [ffff88021a9a7790] fzap_cursor_retrieve at ffffffffa0375cbc [zfs]
#20 [ffff88021a9a7820] zap_cursor_retrieve at ffffffffa037aceb [zfs]
#21 [ffff88021a9a78c0] zfs_purgedir at ffffffffa038302b [zfs]
#22 [ffff88021a9a7af0] zfs_unlinked_drain at ffffffffa03832de [zfs]
Comment by Bob Glossman (Inactive) [ 19/Mar/15 ]

another seen on master:
https://testing.hpdd.intel.com/test_sets/4233a110-ce15-11e4-8b81-5254006e85c2

Comment by Jian Yu [ 31/Mar/15 ]

More failure instances on master branch:
https://testing.hpdd.intel.com/test_sets/4233a110-ce15-11e4-8b81-5254006e85c2
https://testing.hpdd.intel.com/test_sets/53903488-cd47-11e4-8795-5254006e85c2
https://testing.hpdd.intel.com/test_sets/f437f480-cb5c-11e4-8352-5254006e85c2
https://testing.hpdd.intel.com/test_sets/0ac8da4a-c7ff-11e4-be50-5254006e85c2
https://testing.hpdd.intel.com/test_sets/ee95d0fc-c80e-11e4-92dc-5254006e85c2

Generated at Sat Feb 10 01:57:30 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.