[LU-8490] ZFS osd sleeping in nonatomic context. Created: 10/Aug/16  Updated: 10/Aug/16  Resolved: 10/Aug/16

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Oleg Drokin Assignee: Alex Zhuravlev
Resolution: Duplicate Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Trying debug kernel in maloo it seems ZFS is hardly operational, sleeps in atomic context and then dies due to some spinlock deadlock or something (soft lockup leading to panic due to our kernel config).

05:18:30:[  590.971378] BUG: sleeping function called from invalid context at kernel/mutex.c:104
05:18:30:[  590.973218] in_atomic(): 1, irqs_disabled(): 0, pid: 32539, name: mdt00_002
05:18:30:[  590.974751] CPU: 0 PID: 32539 Comm: mdt00_002 Tainted: P        W  OE  ------------   3.10.0-327.22.2.el7_lustre.x86_64 #1
05:18:30:[  590.976645] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
05:18:30:[  590.978113]  ffff88004c77bf58 00000000d641fc8e ffff880039193860 ffffffff8164bed6
05:18:30:[  590.979760]  ffff880039193870 ffffffff810b5639 ffff880039193888 ffffffff81651220
05:18:30:[  590.981415]  ffff88004c77be88 ffff8800391938b0 ffffffffa03e0d80 ffff880025cf5f70
05:18:30:[  590.983057] Call Trace:
05:18:30:[  590.984316]  [<ffffffff8164bed6>] dump_stack+0x19/0x1b
05:18:30:[  590.985764]  [<ffffffff810b5639>] __might_sleep+0xd9/0x100
05:18:30:[  590.987246]  [<ffffffff81651220>] mutex_lock+0x20/0x40
05:18:30:[  590.988729]  [<ffffffffa03e0d80>] sa_spill_rele+0x20/0xb0 [zfs]
05:18:30:[  590.990223]  [<ffffffffa0fcf39f>] osd_object_sa_dirty_rele+0xaf/0x110 [osd_zfs]
05:18:30:[  590.991867]  [<ffffffffa0fc7d20>] osd_trans_stop+0x2a0/0x530 [osd_zfs]
05:18:30:[  590.993471]  [<ffffffffa0e2da69>] top_trans_stop+0x99/0x8f0 [ptlrpc]
05:18:30:[  590.995068]  [<ffffffffa121cbda>] ? lod_attr_set+0xaa/0x920 [lod]
05:18:30:[  590.996606]  [<ffffffffa1202219>] lod_trans_stop+0x259/0x340 [lod]
05:18:30:[  590.998149]  [<ffffffffa1284ffd>] ? mdd_attr_set_internal+0x11d/0x2a0 [mdd]
05:18:30:[  590.999737]  [<ffffffffa128fa5a>] mdd_trans_stop+0x1a/0x1c [mdd]
05:18:30:[  591.001275]  [<ffffffffa127d85c>] mdd_create+0x104c/0x12b0 [mdd]
05:18:30:[  591.002815]  [<ffffffffa1154f19>] mdt_md_create+0x849/0xba0 [mdt]
05:18:30:[  591.004373]  [<ffffffffa0bac561>] ? lprocfs_job_stats_log+0xd1/0x600 [obdclass]
05:18:30:[  591.006010]  [<ffffffffa11553db>] mdt_reint_create+0x16b/0x350 [mdt]
05:18:30:[  591.007596]  [<ffffffffa11568e0>] mdt_reint_rec+0x80/0x210 [mdt]
05:18:30:[  591.009238]  [<ffffffffa1139e02>] mdt_reint_internal+0x582/0x970 [mdt]
05:18:30:[  591.010814]  [<ffffffffa1144b67>] mdt_reint+0x67/0x140 [mdt]
05:18:30:[  591.012344]  [<ffffffffa0e1a7e5>] tgt_request_handle+0x925/0x1330 [ptlrpc]
05:18:30:[  591.013948]  [<ffffffffa0dc824e>] ptlrpc_server_handle_request+0x22e/0xaa0 [ptlrpc]
05:18:30:[  591.015621]  [<ffffffffa0dc6aee>] ? ptlrpc_wait_event+0xae/0x350 [ptlrpc]
05:18:30:[  591.017218]  [<ffffffff810bcc92>] ? default_wake_function+0x12/0x20
05:18:30:[  591.018769]  [<ffffffff810b2cd8>] ? __wake_up_common+0x58/0x90
05:18:30:[  591.020298]  [<ffffffffa0dcc018>] ptlrpc_main+0xa58/0x1db0 [ptlrpc]
05:18:30:[  591.021868]  [<ffffffffa0dcb5c0>] ? ptlrpc_register_service+0xe60/0xe60 [ptlrpc]
05:18:30:[  591.023511]  [<ffffffff810a8a24>] kthread+0xe4/0xf0
05:18:30:[  591.024956]  [<ffffffff810a8940>] ? kthread_create_on_node+0x140/0x140
05:18:30:[  591.026521]  [<ffffffff8165d3d8>] ret_from_fork+0x58/0x90
05:18:30:[  591.027984]  [<ffffffff810a8940>] ? kthread_create_on_node+0x140/0x140

Examples:
https://testing.hpdd.intel.com/test_sessions/35c51d2c-5e25-11e6-b2e2-5254006e85c2
https://testing.hpdd.intel.com/test_sessions/d4afd7c4-5e48-11e6-b5b1-5254006e85c2



 Comments   
Comment by Alex Zhuravlev [ 10/Aug/16 ]

this will be fixed in the patch for LU-8449

Generated at Sat Feb 10 02:18:00 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.