[LU-10511] Stack overflow (?) on ost osd ldiskfs write path? Created: 14/Jan/18  Updated: 14/Jan/18

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Oleg Drokin Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

From time to time I am getting crashes like this:

[48156.988750] Lustre: lustre-OST0001: Imperative Recovery not enabled, recovery window 60-180
[48163.913945] Lustre: lustre-OST0000: Connection restored to 192.168.10.219@tcp (at 0@lo)
[48163.949542] BUG: sleeping function called from invalid context at /home/green/git/lustre-release/ldiskfs/ext4_jbd2.c:259
[48163.951038] in_atomic(): 1, irqs_disabled(): 0, pid: 4792, name: ll_ost00_002
[48163.951463] Lustre: Mounted lustre-client
[48163.953482] INFO: lockdep is turned off.
[48163.954365] CPU: 1 PID: 4792 Comm: ll_ost00_002 Tainted: G        W  OE  ------------   3.10.0-debug #1
[48163.956101] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[48163.957005]  ffff88004b92e6c0 000000000f8a8783 ffff880019e6f840 ffffffff816fd400
[48163.958581]  ffff880019e6f858 ffffffff810b0109 ffff88002f521f88 ffff880019e6f8b8
[48163.960111]  ffffffffa0a63308 ffff88002f521f88 0000000000000002 000000000f8a8783
[48163.961584] Call Trace:
[48163.962274]  [<ffffffff816fd400>] dump_stack+0x19/0x1b
[48163.963004]  [<ffffffff810b0109>] __might_sleep+0xe9/0x110
[48163.963787]  [<ffffffffa0a63308>] __ldiskfs_handle_dirty_metadata+0x38/0x230 [ldiskfs]
[48163.965210]  [<ffffffff810a4055>] ? wake_up_bit+0x25/0x30
[48163.965968]  [<ffffffffa0a77bb2>] ldiskfs_getblk+0x142/0x210 [ldiskfs]
[48163.966850]  [<ffffffffa0a77ca7>] ldiskfs_bread+0x27/0xe0 [ldiskfs]
[48163.967650]  [<ffffffffa0b14531>] osd_ldiskfs_write_record+0x181/0x3d0 [osd_ldiskfs]
[48163.968765]  [<ffffffff810e3244>] ? lockdep_init_map+0xc4/0x600
[48163.969398]  [<ffffffffa0b148c0>] osd_write+0x140/0x5b0 [osd_ldiskfs]
[48163.970006]  [<ffffffffa03bfd09>] dt_record_write+0x39/0x120 [obdclass]
[48163.970690]  [<ffffffffa063fb37>] tgt_client_data_write.isra.18+0x167/0x180 [ptlrpc]
[48163.971859]  [<ffffffffa06431d3>] tgt_client_data_update+0x393/0x5d0 [ptlrpc]
[48163.972595]  [<ffffffffa064382b>] tgt_client_new+0x41b/0x610 [ptlrpc]
[48163.973432]  [<ffffffffa0db6ff3>] ofd_obd_connect+0x3a3/0x4c0 [ofd]
[48163.974487]  [<ffffffffa05ad028>] target_handle_connect+0x1118/0x29e0 [ptlrpc]
[48163.976234]  [<ffffffffa065275a>] tgt_request_handle+0x40a/0x13e0 [ptlrpc]
[48163.977196]  [<ffffffffa05f7c21>] ptlrpc_server_handle_request+0x261/0xaf0 [ptlrpc]
[48163.978766]  [<ffffffffa05fb9d8>] ptlrpc_main+0xa58/0x1df0 [ptlrpc]
[48163.979522]  [<ffffffff81706487>] ? _raw_spin_unlock_irq+0x27/0x50
[48163.980246]  [<ffffffffa05faf80>] ? ptlrpc_register_service+0xeb0/0xeb0 [ptlrpc]
[48163.981587]  [<ffffffff810a2eda>] kthread+0xea/0xf0
[48163.982154]  [<ffffffff810a2df0>] ? kthread_create_on_node+0x140/0x140
[48163.983448]  [<ffffffff8170fbd8>] ret_from_fork+0x58/0x90
[48163.984044]  [<ffffffff810a2df0>] ? kthread_create_on_node+0x140/0x140
[48163.984926] LNetError: 4792:0:(lib-lnet.h:479:lnet_msg_alloc()) ASSERTION( !(((current_thread_info()->preempt_count) & ((((1UL << (10))-1) << ((0 + 8) + 8)) | (((1UL << (8))-1) << (0 + 8)) | (((1UL << (1))-1) << (((0 + 8) + 8) + 10))))) || (((sizeof(*msg))) <= (2 << 12) && (((((( gfp_t)0x10u) | (( gfp_t)0x40u)))) & ((( gfp_t)0x20u)))) != 0 ) failed:
[48163.988383] LNetError: 4792:0:(lib-lnet.h:479:lnet_msg_alloc()) LBUG
[48163.989081] Kernel panic - not syncing: LBUG in interrupt.

[48163.990700] CPU: 1 PID: 4792 Comm: ll_ost00_002 Tainted: G        W  OE  ------------   3.10.0-debug #1
[48163.992415] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[48163.993321]  ffffffffa01d81fe 000000000f8a8783 ffff880019e6f9f8 ffffffff816fd400
[48163.995039]  ffff880019e6fa78 ffffffff816f8c74 0000000000000008 ffff880019e6fa88
[48163.996763]  ffff880019e6fa28 000000000f8a8783 0000000000000010 0000000000000001
[48163.998433] Call Trace:
[48163.999220]  [<ffffffff816fd400>] dump_stack+0x19/0x1b
[48164.000084]  [<ffffffff816f8c74>] panic+0xd8/0x1e7
[48164.000910]  [<ffffffffa01b8882>] lbug_with_loc+0x72/0xb0 [libcfs]
[48164.001591]  [<ffffffffa030408c>] LNetPut+0x6bc/0x7a0 [lnet]
[48164.002227]  [<ffffffffa05e32c6>] ptl_send_buf+0x146/0x530 [ptlrpc]
[48164.002921]  [<ffffffffa0606a37>] ? at_measured+0x1c7/0x380 [ptlrpc]
[48164.003661]  [<ffffffffa05e6711>] ptlrpc_send_reply+0x2c1/0x890 [ptlrpc]
[48164.004379]  [<ffffffffa05a60b1>] target_send_reply_msg+0x91/0x180 [ptlrpc]
[48164.005048]  [<ffffffffa05b0736>] target_send_reply+0x326/0x750 [ptlrpc]
[48164.005788]  [<ffffffffa05ed597>] ? lustre_msg_set_last_committed+0x27/0xa0 [ptlrpc]
[48164.007091]  [<ffffffffa06528e7>] tgt_request_handle+0x597/0x13e0 [ptlrpc]
[48164.007971]  [<ffffffffa05f7c21>] ptlrpc_server_handle_request+0x261/0xaf0 [ptlrpc]
[48164.009154]  [<ffffffffa05fb9d8>] ptlrpc_main+0xa58/0x1df0 [ptlrpc]
[48164.009735]  [<ffffffff81706487>] ? _raw_spin_unlock_irq+0x27/0x50
[48164.010363]  [<ffffffffa05faf80>] ? ptlrpc_register_service+0xeb0/0xeb0 [ptlrpc]
[48164.011738]  [<ffffffff810a2eda>] kthread+0xea/0xf0
[48164.012328]  [<ffffffff810a2df0>] ? kthread_create_on_node+0x140/0x140
[48164.013853]  [<ffffffff8170fbd8>] ret_from_fork+0x58/0x90
[48164.014732]  [<ffffffff810a2df0>] ? kthread_create_on_node+0x140/0x140

I think this is a sign of a stack overflow sometime earlier that ruined our tasks kernel data making it think all sort of incorrect things like that we are in irq/atomic context and whatnot


Generated at Sat Feb 10 02:35:45 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.