[LU-7811] BUG: sleeping function called from invalid context at mm/slub.c:941 Created: 24/Feb/16  Updated: 22/Jun/16  Resolved: 25/Feb/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: Lustre 2.9.0

Type: Bug Priority: Minor
Reporter: Ruth Klundt (Inactive) Assignee: Kit Westneat
Resolution: Duplicate Votes: 0
Labels: llnl
Environment:

test hardware running zfs backed mds with Lustre: Build Version: v2_7_1_0-g9e7c0cf-CHANGED-3.10.0-327.0.0.1chaos.ch6.x86_64


Issue Links:
Related
is related to LU-6409 sleeping while atomic in nodemap_destroy Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

We don't expect to use zfs backed MDS in production anytime soon, this is an experimental setup. but I thought it best to report anyway.

[79475.811789] Lustre: Lustre: Build Version: v2_7_1_0-g9e7c0cf-CHANGED-3.10.0-327.0.0.1chaos.ch6.x86_64

[79488.834992] BUG: sleeping function called from invalid context at mm/slub.c:941
[79488.842318] in_atomic(): 1, irqs_disabled(): 0, pid: 14196, name: mdt00_002
[79488.849337] CPU: 2 PID: 14196 Comm: mdt00_002 Tainted: P OE ------------ 3.10.0-327.0.0.1chaos.ch6.x86_64 #1
[79488.860286] Hardware name: Supermicro X8DTH-i/6/iF/6F/X8DTH, BIOS 2.0a 09/29/2010
[79488.868023] ffff880bfd740070 00000000b04a4301 ffff880bfadcb8d0 ffffffff8164b729
[79488.875658] ffff880bfadcb8e0 ffffffff810b5bc9 ffff880bfadcb920 ffffffff811ca233
[79488.883156] ffffffffa0c618aa ffff880bfd740070 0000000000000001 ffff880c1c006300
[79488.890748] Call Trace:
[79488.893260] [<ffffffff8164b729>] dump_stack+0x19/0x1b
[79488.898458] [<ffffffff810b5bc9>] __might_sleep+0xd9/0x100
[79488.903996] [<ffffffff811ca233>] __kmalloc+0x63/0x270
[79488.909205] [<ffffffffa0c618aa>] ? cfs_hash_buckets_realloc+0x37a/0x660 [libcfs]
[79488.916703] [<ffffffffa0c618aa>] cfs_hash_buckets_realloc+0x37a/0x660 [libcfs]
[79488.924081] [<ffffffffa0c6264f>] cfs_hash_rehash_worker+0x9f/0x510 [libcfs]
[79488.931203] [<ffffffffa0c5f5f7>] ? cfs_hash_bd_add_locked+0x57/0x80 [libcfs]
[79488.938365] [<ffffffffa0c62ba4>] cfs_hash_rehash+0xe4/0x1a0 [libcfs]
[79488.944827] [<ffffffffa0c62f52>] cfs_hash_find_or_add+0x172/0x1b0 [libcfs]
[79488.951805] [<ffffffffa0c62fa7>] cfs_hash_add_unique+0x17/0x30 [libcfs]
[79488.958634] [<ffffffffa1028967>] nm_member_add+0xd7/0x160 [ptlrpc]
[79488.965019] [<ffffffffa1023b37>] nodemap_add_member+0x37/0x70 [ptlrpc]
[79488.971720] [<ffffffffa12c1193>] mdt_obd_connect+0x3b3/0x760 [mdt]
[79488.978061] [<ffffffffa0c5a287>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
[79488.984765] [<ffffffffa0f7746f>] target_handle_connect+0x11ef/0x2bf0 [ptlrpc]
[79488.992021] [<ffffffff81647f9a>] ? __slab_free+0x10e/0x277
[79488.997628] [<ffffffff810c1928>] ? __enqueue_entity+0x78/0x80
[79489.003493] [<ffffffff810c7f57>] ? enqueue_entity+0x237/0x890
[79489.009407] [<ffffffffa0ff3a00>] ? nrs_request_removed+0x30/0x120 [ptlrpc]
[79489.016451] [<ffffffffa1013c37>] tgt_request_handle+0x367/0xfd0 [ptlrpc]
[79489.023277] [<ffffffff810dd80e>] ? getnstimeofday64+0xe/0x30
[79489.029134] [<ffffffffa0fbe81b>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
[79489.036864] [<ffffffffa0c5bea8>] ? lc_watchdog_touch+0x68/0x180 [libcfs]
[79489.043753] [<ffffffffa0fbb8eb>] ? ptlrpc_wait_event+0xab/0x340 [ptlrpc]
[79489.050601] [<ffffffff810b33d8>] ? __wake_up_common+0x58/0x90
[79489.056534] [<ffffffffa0fc2140>] ptlrpc_main+0xc00/0x1f50 [ptlrpc]
[79489.062864] [<ffffffff81015588>] ? __switch_to+0xf8/0x4d0
[79489.068452] [<ffffffffa0fc1540>] ? ptlrpc_register_service+0x1070/0x1070 [ptlrpc]
[79489.076056] [<ffffffff810a997f>] kthread+0xcf/0xe0
[79489.080968] [<ffffffff810a98b0>] ? kthread_create_on_node+0x140/0x140
[79489.087525] [<ffffffff8165c618>] ret_from_fork+0x58/0x90
[79489.092982] [<ffffffff810a98b0>] ? kthread_create_on_node+0x140/0x140
[79489.099560] BUG: scheduling while atomic: mdt00_002/14196/0x10000002
[79489.105969] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_zfs(OE) lquota(OE) fid(OE) fld(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) ko2iblnd(OE) lnet(OE) sha512_generic crypto_null libcfs(OE) xprtrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm sd_mod crc_t10dif crct10dif_generic sg mlx4_en vxlan mlx4_ib ip6_udp_tunnel udp_tunnel ib_sa ib_mad iw_cxgb4 iw_cm ib_core ib_addr ipmi_devintf intel_powerclamp coretemp kvm crct10dif_pclmul crct10dif_common crc32_pclmul crc32c_intel ghash_clmulni_intel mgag200 syscopyarea sysfillrect aesni_intel sysimgblt ttm lrw gf128mul glue_helper drm_kms_helper iTCO_wdt ata_generic ablk_helper pata_acpi
[79489.180080] iTCO_vendor_support cryptd drm mpt2sas mlx4_core pcspkr serio_raw cxgb4 ata_piix lpc_ich ipmi_si raid_class libata ioatdma mfd_core i2c_i801 scsi_transport_sas ipmi_msghandler i7core_edac edac_core shpchp acpi_cpufreq binfmt_misc zfs(POE) zunicode(POE) zavl(POE) zcommon(POE) znvpair(POE) spl(OE) zlib_deflate ip_tables nfsv4 dns_resolver nfsv3 nfs fscache lockd grace nfs_acl sunrpc broadcom tg3 bnx2 igb dca i2c_algo_bit i2c_core e1000e ptp pps_core e1000
[79489.222352] CPU: 2 PID: 14196 Comm: mdt00_002 Tainted: P OE ------------ 3.10.0-327.0.0.1chaos.ch6.x86_64 #1
[79489.233354] Hardware name: Supermicro X8DTH-i/6/iF/6F/X8DTH, BIOS 2.0a 09/29/2010
[79489.241128] ffff880bfadcbfd8 00000000b04a4301 ffff880bfadcb848 ffffffff8164b729
[79489.248654] ffff880bfadcb858 ffffffff81645026 ffff880bfadcb8b8 ffffffff81650f78
[79489.256196] ffff880bfadcbfd8 ffff880c0c462280 ffff880bfadcbfd8 ffff880bfadcbfd8
[79489.263737] Call Trace:
[79489.266218] [<ffffffff8164b729>] dump_stack+0x19/0x1b
[79489.271385] [<ffffffff81645026>] __schedule_bug+0x4d/0x5b
[79489.276901] [<ffffffff81650f78>] __schedule+0x808/0x940
[79489.282248] [<ffffffff810ba4e6>] __cond_resched+0x26/0x30
[79489.287764] [<ffffffff8165137a>] _cond_resched+0x3a/0x50
[79489.293196] [<ffffffff811ca238>] __kmalloc+0x68/0x270
[79489.298379] [<ffffffffa0c618aa>] ? cfs_hash_buckets_realloc+0x37a/0x660 [libcfs]
[79489.305902] [<ffffffffa0c618aa>] cfs_hash_buckets_realloc+0x37a/0x660 [libcfs]
[79489.313255] [<ffffffffa0c6264f>] cfs_hash_rehash_worker+0x9f/0x510 [libcfs]
[79489.320342] [<ffffffffa0c5f5f7>] ? cfs_hash_bd_add_locked+0x57/0x80 [libcfs]
[79489.327521] [<ffffffffa0c62ba4>] cfs_hash_rehash+0xe4/0x1a0 [libcfs]
[79489.334012] [<ffffffffa0c62f52>] cfs_hash_find_or_add+0x172/0x1b0 [libcfs]
[79489.341012] [<ffffffffa0c62fa7>] cfs_hash_add_unique+0x17/0x30 [libcfs]
[79489.347801] [<ffffffffa1028967>] nm_member_add+0xd7/0x160 [ptlrpc]
[79489.354153] [<ffffffffa1023b37>] nodemap_add_member+0x37/0x70 [ptlrpc]
[79489.360817] [<ffffffffa12c1193>] mdt_obd_connect+0x3b3/0x760 [mdt]
[79489.367126] [<ffffffffa0c5a287>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
[79489.373814] [<ffffffffa0f7746f>] target_handle_connect+0x11ef/0x2bf0 [ptlrpc]
[79489.381062] [<ffffffff81647f9a>] ? __slab_free+0x10e/0x277
[79489.386661] [<ffffffff810c1928>] ? __enqueue_entity+0x78/0x80
[79489.392525] [<ffffffff810c7f57>] ? enqueue_entity+0x237/0x890
[79489.398438] [<ffffffffa0ff3a00>] ? nrs_request_removed+0x30/0x120 [ptlrpc]
[79489.405485] [<ffffffffa1013c37>] tgt_request_handle+0x367/0xfd0 [ptlrpc]
[79489.412308] [<ffffffff810dd80e>] ? getnstimeofday64+0xe/0x30
[79489.418128] [<ffffffffa0fbe81b>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
[79489.425832] [<ffffffffa0c5bea8>] ? lc_watchdog_touch+0x68/0x180 [libcfs]
[79489.432690] [<ffffffffa0fbb8eb>] ? ptlrpc_wait_event+0xab/0x340 [ptlrpc]
[79489.439512] [<ffffffff810b33d8>] ? __wake_up_common+0x58/0x90
[79489.445444] [<ffffffffa0fc2140>] ptlrpc_main+0xc00/0x1f50 [ptlrpc]
[79489.451775] [<ffffffff81015588>] ? __switch_to+0xf8/0x4d0
[79489.457365] [<ffffffffa0fc1540>] ? ptlrpc_register_service+0x1070/0x1070 [ptlrpc]
[79489.464992] [<ffffffff810a997f>] kthread+0xcf/0xe0
[79489.469921] [<ffffffff810a98b0>] ? kthread_create_on_node+0x140/0x140
[79489.476504] [<ffffffff8165c618>] ret_from_fork+0x58/0x90
[79489.481957] [<ffffffff810a98b0>] ? kthread_create_on_node+0x140/0x140



 Comments   
Comment by Ruth Klundt (Inactive) [ 24/Feb/16 ]

maybe related to nodemap issues

The bug occurred during startup, no clients yet.

Comment by Ruth Klundt (Inactive) [ 24/Feb/16 ]

Also occurs with Lustre: Build Version: v2_8_0_0_RC2--PRISTINE-3.10.0-327.0.0.1chaos.ch6.x86_64, but not during startup. Right after mounting the first client.

with different traces:
[ 520.092911] BUG: sleeping function called from invalid context at mm/slub.c:941
[ 520.100232] in_atomic(): 1, irqs_disabled(): 0, pid: 9072, name: mdt00_003
[ 520.107161] CPU: 1 PID: 9072 Comm: mdt00_003 Tainted: P OE ------------ 3.10.0-327.0.0.1chaos.ch6.x86_64 #1
[ 520.118079] Hardware name: Supermicro X8DTH-i/6/iF/6F/X8DTH, BIOS 2.0a 09/29/2010
[ 520.125876] ffff880bfde879a0 0000000061159716 ffff880bfde878a8 ffffffff8164b729
[ 520.133557] ffff880bfde878b8 ffffffff810b5bc9 ffff880bfde87900 ffffffff811c8a6a
[ 520.141151] ffff880627803a00 ffffffff8109d2bf ffff880bfde879a0 ffff880bfde879b8
[ 520.148781] Call Trace:
[ 520.151293] [<ffffffff8164b729>] dump_stack+0x19/0x1b
[ 520.156493] [<ffffffff810b5bc9>] __might_sleep+0xd9/0x100
[ 520.162041] [<ffffffff811c8a6a>] kmem_cache_alloc_trace+0x4a/0x240
[ 520.168369] [<ffffffff8109d2bf>] ? call_usermodehelper_setup+0x3f/0xa0
[ 520.175042] [<ffffffff8109d2bf>] call_usermodehelper_setup+0x3f/0xa0
[ 520.181532] [<ffffffff8109d621>] call_usermodehelper+0x31/0x60
[ 520.187530] [<ffffffffa12c16eb>] mdt_identity_do_upcall+0xfb/0x480 [mdt]
[ 520.194392] [<ffffffffa0c6bd77>] ? cfs_hash_bd_lookup_intent+0x57/0x160 [libcfs]
[ 520.201928] [<ffffffff811c0030>] ? SYSC_mbind+0x440/0x6e0
[ 520.207522] [<ffffffffa0d93eff>] upcall_cache_get_entry+0x2af/0x8e0 [obdclass]
[ 520.214943] [<ffffffffa0f77b20>] ? lustre_msg_buf_v2+0x1b0/0x1b0 [ptlrpc]
[ 520.221891] [<ffffffffa12c1f17>] mdt_identity_get+0x17/0x50 [mdt]
[ 520.228136] [<ffffffffa12a34fb>] old_init_ucred_common+0xcb/0x290 [mdt]
[ 520.234907] [<ffffffffa12a54c6>] mdt_init_ucred_intent_getattr+0x1d6/0x260 [mdt]
[ 520.242459] [<ffffffffa129c735>] mdt_intent_getattr+0xc5/0x470 [mdt]
[ 520.248971] [<ffffffffa129fc2c>] mdt_intent_policy+0x5bc/0xbb0 [mdt]
[ 520.255500] [<ffffffffa0f2c0d7>] ldlm_lock_enqueue+0x387/0x970 [ptlrpc]
[ 520.262295] [<ffffffffa0f549e2>] ldlm_handle_enqueue0+0x772/0x16b0 [ptlrpc]
[ 520.269459] [<ffffffffa0f7c030>] ? lustre_swab_ldlm_lock_desc+0x30/0x30 [ptlrpc]
[ 520.277053] [<ffffffffa0fd41b2>] tgt_enqueue+0x62/0x210 [ptlrpc]
[ 520.283259] [<ffffffffa0fd85d5>] tgt_request_handle+0x915/0x1320 [ptlrpc]
[ 520.290241] [<ffffffffa0f850cb>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
[ 520.297961] [<ffffffffa0c68758>] ? lc_watchdog_touch+0x68/0x180 [libcfs]
[ 520.304830] [<ffffffffa0f82c9b>] ? ptlrpc_wait_event+0xab/0x350 [ptlrpc]
[ 520.311650] [<ffffffff810bd452>] ? default_wake_function+0x12/0x20
[ 520.317952] [<ffffffff810b33d8>] ? __wake_up_common+0x58/0x90
[ 520.323864] [<ffffffffa0f89170>] ptlrpc_main+0xa90/0x1db0 [ptlrpc]
[ 520.330217] [<ffffffffa0f886e0>] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc]
[ 520.337648] [<ffffffff810a997f>] kthread+0xcf/0xe0
[ 520.342563] [<ffffffff810a98b0>] ? kthread_create_on_node+0x140/0x140
[ 520.349128] [<ffffffff8165c618>] ret_from_fork+0x58/0x90
[ 520.354564] [<ffffffff810a98b0>] ? kthread_create_on_node+0x140/0x140

Comment by Peter Jones [ 25/Feb/16 ]

Kit

Could you please advise on this one?

Thanks

Peter

Comment by Kit Westneat [ 25/Feb/16 ]

The second trace is a dupe of LU-6447. I think that the bug causing the first trace was addressed in LU-6409, but I'm not sure. Is it possible to see the list of patches carried in that chaos version? Or at least the nodemap related ones?

Comment by Ruth Klundt (Inactive) [ 25/Feb/16 ]

Hi Kit,

The first trace is from the lustre fe release at commit 9e7c0cf, built against the chaos kernel. So no patches unless you mean the kernel perhaps? That lustre tree does not have the patch from LU-6409. I will apply it and try that version again. However note that this trace is in nodemap_add_member, so I suspect that even if adding the patch makes this bug disappear there may be another code path lurking here which may sleep under a lock.

The second one is from git://git.hpdd.intel.com/fs/lustre-release.git at commit 98ace9c tag: v2_8_0_RC2. LU-6447 patch hasn't been committed yet afaics. sorry for the confusion. Please ignore the second one for purposes of this bug.

Ruth

Comment by Ruth Klundt (Inactive) [ 25/Feb/16 ]

It looks like the first tree is missing several nodemap patches. I think I'll stick to the v2_8_0_RC2.

Comment by Kit Westneat [ 25/Feb/16 ]

Hi Ruth,

I'm having a hard time finding a commit tagged v2_7_1_0, 9e7c0cf, or g9e7c0cf, but if it's basically 2.7, then I think it's fixed in http://review.whamcloud.com/#/c/14254/.

The LU-6409 patch doesn't address the stack trace you posted, but in fixing that, it revealed a lot of other locking issues that were then fixed in gerrit change 14254, including that stack trace. You can see Oleg got a similar trace:
https://jira.hpdd.intel.com/browse/LU-6409?focusedCommentId=117248&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-117248

  • Kit
Comment by Ruth Klundt (Inactive) [ 25/Feb/16 ]

ok thanks, go ahead and close as dup.

Comment by John Fuchs-Chesney (Inactive) [ 25/Feb/16 ]

Thanks Ruth.
~ jfc.

Generated at Sat Feb 10 02:12:07 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.