[LU-6133] sanity test_56a: timeout Created: 19/Jan/15 Updated: 07/Apr/17 Resolved: 07/Apr/17 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.7.0, Lustre 2.8.0 |
| Fix Version/s: | Lustre 2.8.0, Lustre 2.5.5 |
| Type: | Bug | Priority: | Major |
| Reporter: | Maloo | Assignee: | Nathaniel Clark |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | zfs | ||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 17092 | ||||||||||||||||
| Description |
|
This issue was created by maloo for Nathaniel Clark <nathaniel.l.clark@intel.com> This issue relates to the following test suite run: The sub-test test_56a failed with the following error: test failed to respond and timed out Please provide additional information about the failure here. Info required for matching: sanity 56a |
| Comments |
| Comment by Andreas Dilger [ 19/Jan/15 ] |
|
It looks like the MDS rebooted some time after test_54e and test_56a according to the MDS syslog. No indication why that happened without more console logs from the MDS. |
| Comment by John Hammond [ 22/Jan/15 ] |
|
There's a crash dump at shadow:/export/scratch/dumps/shadow-16vm4.shadow.whamcloud.com/10.1.4.188-2015-01-17-21:27:55/vmcore crash> bt -l 16053
PID: 16053 TASK: ffff88007d051540 CPU: 1 COMMAND: "mdt00_000"
#0 [ffff880002307e90] crash_nmi_callback at ffffffff8102fee6
/usr/src/debug/kernel-2.6.32-431.29.2.el6/linux-2.6.32-431.29.2.el6_lustre.gffd1fc2.x86_64/arch/x86/include/asm/paravirt.h: 115
#1 [ffff880002307ea0] notifier_call_chain at ffffffff8152f2f5
/usr/src/debug/kernel-2.6.32-431.29.2.el6/linux-2.6.32-431.29.2.el6_lustre.gffd1fc2.x86_64/kernel/notifier.c: 95
#2 [ffff880002307ee0] atomic_notifier_call_chain at ffffffff8152f35a
/usr/src/debug/kernel-2.6.32-431.29.2.el6/linux-2.6.32-431.29.2.el6_lustre.gffd1fc2.x86_64/kernel/notifier.c: 192
#3 [ffff880002307ef0] notify_die at ffffffff810a11de
/usr/src/debug/kernel-2.6.32-431.29.2.el6/linux-2.6.32-431.29.2.el6_lustre.gffd1fc2.x86_64/kernel/notifier.c: 573
#4 [ffff880002307f20] do_nmi at ffffffff8152cfbb
/usr/src/debug/kernel-2.6.32-431.29.2.el6/linux-2.6.32-431.29.2.el6_lustre.gffd1fc2.x86_64/arch/x86/kernel/traps.c: 503
#5 [ffff880002307f50] nmi at ffffffff8152c880
/usr/src/debug/kernel-2.6.32-431.29.2.el6/linux-2.6.32-431.29.2.el6_lustre.gffd1fc2.x86_64/arch/x86_64/kernel/entry.S
[exception RIP: _spin_lock_irq+37]
RIP: ffffffff8152c015 RSP: ffff88006f9636e8 RFLAGS: 00000002
RAX: 0000000000000000 RBX: ffff8800731c3600 RCX: 0000000000000000
RDX: 00000000000000a4 RSI: ffff88006f963758 RDI: ffff8800731c3608
RBP: ffff88006f9636e8 R8: 00ff7b4f2f8f252e R9: 0000000000000000
R10: ffff88006b973800 R11: 0000000000000800 R12: ffff88007d051540
R13: ffff88006f963758 R14: ffff8800731c3608 R15: ffff88006f9638f8
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
--- <NMI exception stack> ---
#6 [ffff88006f9636e8] _spin_lock_irq at ffffffff8152c015
/usr/src/debug/kernel-2.6.32-431.29.2.el6/linux-2.6.32-431.29.2.el6_lustre.gffd1fc2.x86_64/arch/x86/include/asm/spinlock.h: 127
#7 [ffff88006f9636f0] rwsem_down_failed_common at ffffffff8152bd0c
/usr/src/debug/kernel-2.6.32-431.29.2.el6/linux-2.6.32-431.29.2.el6_lustre.gffd1fc2.x86_64/lib/rwsem.c: 161
#8 [ffff88006f963750] rwsem_down_read_failed at ffffffff8152bef6
/usr/src/debug/kernel-2.6.32-431.29.2.el6/linux-2.6.32-431.29.2.el6_lustre.gffd1fc2.x86_64/lib/rwsem.c: 200
#9 [ffff88006f963790] call_rwsem_down_read_failed at ffffffff8128fbb4
/usr/src/debug///////////////////////////////////////////////////////////////////////////////////////////////////kernel-2.6.32-431.29.2.el6/linux-2.6.32-431.29.2.el6_lustre.gffd1fc2.x86_64/arch/x86/lib/rwsem_64.S: 49
#10 [ffff88006f9637f8] zap_get_leaf_byblk at ffffffffa0256b18 [zfs]
#11 [ffff88006f963878] zap_deref_leaf at ffffffffa0256dca [zfs]
#12 [ffff88006f9638b8] fzap_lookup at ffffffffa0257430 [zfs]
#13 [ffff88006f963938] zap_lookup_norm at ffffffffa025ca21 [zfs]
#14 [ffff88006f9639a8] zap_lookup at ffffffffa025cb63 [zfs]
#15 [ffff88006f9639d8] osd_dir_lookup at ffffffffa0e9acdd [osd_zfs]
#16 [ffff88006f963a38] lod_index_lookup at ffffffffa10fb755 [lod]
#17 [ffff88006f963a48] __mdd_lookup at ffffffffa1165450 [mdd]
#18 [ffff88006f963aa8] mdd_lookup at ffffffffa1165e91 [mdd]
#19 [ffff88006f963af8] mdt_getattr_name_lock at ffffffffa10206df [mdt]
#20 [ffff88006f963b98] mdt_intent_getattr at ffffffffa1021b72 [mdt]
#21 [ffff88006f963be8] mdt_intent_policy at ffffffffa100ccf4 [mdt]
#22 [ffff88006f963c58] ldlm_lock_enqueue at ffffffffa093e4f9 [ptlrpc]
#23 [ffff88006f963cb8] ldlm_handle_enqueue0 at ffffffffa096a4bb [ptlrpc]
#24 [ffff88006f963d28] tgt_enqueue at ffffffffa09eb1c2 [ptlrpc]
#25 [ffff88006f963d48] tgt_request_handle at ffffffffa09ebdae [ptlrpc]
#26 [ffff88006f963da8] ptlrpc_main at ffffffffa099b8a1 [ptlrpc]
#27 [ffff88006f963ee8] kthread at ffffffff8109abf6
/usr/src/debug/kernel-2.6.32-431.29.2.el6/linux-2.6.32-431.29.2.el6_lustre.gffd1fc2.x86_64/kernel/kthread.c: 78
#28 [ffff88006f963f48] kernel_thread at ffffffff8100c20a
/usr/src/debug///////////////////////////////////////////////////////////////////////////////////////////////////kernel-2.6.32-431.29.2.el6/linux-2.6.32-431.29.2.el6_lustre.gffd1fc2.x86_64/arch/x86/kernel/entry_64.S: 1235
|
| Comment by Peter Jones [ 16/Feb/15 ] |
|
Nathaniel Has this issue only appeared since the upgrade to the latest ZFS maintenance release in Peter |
| Comment by Nathaniel Clark [ 17/Feb/15 ] |
|
This has not been fixed by recent master instance: |
| Comment by Peter Jones [ 18/Feb/15 ] |
|
My question was the opposite - I wondered if the |
| Comment by Isaac Huang (Inactive) [ 19/Feb/15 ] |
|
There was a report of the same problem for zfs-0.6.3-1.1: So it's unlikely caused by the recent upgrade to 0.6.3-1.2. |
| Comment by Peter Jones [ 19/Feb/15 ] |
|
Ok thanks So it is probably just something quite rare and it was a coincidence on the timing. |
| Comment by Nathaniel Clark [ 23/Feb/15 ] |
|
This issue does not appear with spl/zfs 0.6.3-1.1 or before. It also appears that there is a slow/stuck txg_sync thread on MDS: 21:26:43:txg_sync D 0000000000000000 0 15966 2 0x00000080 21:26:43: ffff88006e225b90 0000000000000046 00000000ffffffff 0000094536c0ce7a 21:26:43: 0000000000000000 ffff8800379f81a0 0000000000113874 ffffffffab454110 21:26:43: ffff8800721885f8 ffff88006e225fd8 000000000000fbc8 ffff8800721885f8 21:26:43:Call Trace: 21:26:43: [<ffffffff810a6d31>] ? ktime_get_ts+0xb1/0xf0 21:26:43: [<ffffffff81529e83>] io_schedule+0x73/0xc0 21:26:43: [<ffffffffa0143596>] cv_wait_common+0xa6/0x120 [spl] 21:26:43: [<ffffffff8109afa0>] ? autoremove_wake_function+0x0/0x40 21:26:43: [<ffffffffa0143628>] __cv_wait_io+0x18/0x20 [spl] 21:26:43: [<ffffffffa028d81b>] zio_wait+0xfb/0x1c0 [zfs] 21:26:43: [<ffffffffa0220993>] dsl_pool_sync+0xb3/0x3f0 [zfs] 21:26:43: [<ffffffffa02398bf>] spa_sync+0x40f/0xa70 [zfs] 21:26:43: [<ffffffffa0243771>] ? spa_txg_history_set+0xc1/0xf0 [zfs] 21:26:43: [<ffffffffa0246c7d>] txg_sync_thread+0x30d/0x520 [zfs] 21:26:43: [<ffffffff810591a9>] ? set_user_nice+0xc9/0x130 21:26:43: [<ffffffffa0246970>] ? txg_sync_thread+0x0/0x520 [zfs] 21:26:43: [<ffffffffa013ec3f>] thread_generic_wrapper+0x5f/0x70 [spl] 21:26:43: [<ffffffffa013ebe0>] ? thread_generic_wrapper+0x0/0x70 [spl] 21:26:43: [<ffffffff8109abf6>] kthread+0x96/0xa0 21:26:43: [<ffffffff8100c20a>] child_rip+0xa/0x20 21:26:43: [<ffffffff8109ab60>] ? kthread+0x0/0xa0 21:26:43: [<ffffffff8100c200>] ? child_rip+0x0/0x20 |
| Comment by Nathaniel Clark [ 23/Feb/15 ] |
|
Reverting to spl/zfs 0.6.3-1.1 may be the right answer since it doesn't seem to have solved any problems (ie |
| Comment by Isaac Huang (Inactive) [ 23/Feb/15 ] |
|
utopiabound 0.6.3-1.2 did fix a txg_sync hang that was one of the causes for |
| Comment by Andreas Dilger [ 23/Feb/15 ] |
|
Brian, any thoughts on this? It seems that moving to 0.6.3-1.2 has introduced a regression vs 0.6.3-1.1, but it isn't clear whether it is better to revert to -1.1 (and get back the issues resolved by that update) or stuck with -1.2? Nathaniel, do you have a solid reproducer for this locally (e.g. run sanity 10x and it will always fail once)? Is it possible to Git bisect the -1.2 update and isolate this to a specific patch within that update? |
| Comment by Brian Behlendorf [ 23/Feb/15 ] |
|
Since we're treating 0.6.3-x as a stable release there really wasn't much change between the tags. Just a few few build fixes for newer kernels, changes to a few default values we've been running forever, and a two critical bug fixes. Nothing which should have caused a regression, here's the full commit log between 0.6.3-1.1 and 0.6.3-1.2. $ git log --oneline spl-0.6.3-1.1..spl-0.6.3-1.2 $ git log --oneline zfs-0.6.3-1.1..zfs-0.6.3-1.2 |
| Comment by Isaac Huang (Inactive) [ 24/Feb/15 ] |
|
There was one report of a very similar panic for 0.6.3-1.1, without running Lustre: #12 [ffff880028207f50] nmi at ffffffff81534ee0
[exception RIP: _spin_lock_irq+40]
RIP: ffffffff81534678 RSP: ffff88021a9a75d0 RFLAGS: 00000002
RAX: 0000000000000000 RBX: ffff880193353b00 RCX: 0000000000000014
RDX: 00000000000073e6 RSI: ffff88021a9a7640 RDI: ffff880193353b08
RBP: ffff88021a9a75d0 R8: 00d67e80287a31ef R9: ffff88020a91a000
R10: ffff88021a9a7660 R11: ffff88020b2a7550 R12: ffff8801c68b4ae0
R13: ffff88021a9a7640 R14: ffff880193353b08 R15: ffff88021a9a7968
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
--- <NMI exception stack> ---
#13 [ffff88021a9a75d0] _spin_lock_irq at ffffffff81534678
#14 [ffff88021a9a75d8] rwsem_down_failed_common at ffffffff8153436c
#15 [ffff88021a9a7638] rwsem_down_read_failed at ffffffff81534556
#16 [ffff88021a9a7678] call_rwsem_down_read_failed at ffffffff8129fb44
#17 [ffff88021a9a76e0] zap_get_leaf_byblk at ffffffffa0374b70 [zfs]
#18 [ffff88021a9a7750] zap_deref_leaf at ffffffffa0374dba [zfs]
#19 [ffff88021a9a7790] fzap_cursor_retrieve at ffffffffa0375cbc [zfs]
#20 [ffff88021a9a7820] zap_cursor_retrieve at ffffffffa037aceb [zfs]
#21 [ffff88021a9a78c0] zfs_purgedir at ffffffffa038302b [zfs]
#22 [ffff88021a9a7af0] zfs_unlinked_drain at ffffffffa03832de [zfs]
|
| Comment by Bob Glossman (Inactive) [ 19/Mar/15 ] |
|
another seen on master: |
| Comment by Jian Yu [ 31/Mar/15 ] |
|
More failure instances on master branch: |