[LU-7017] Client stack overflow in racer Created: 18/Aug/15  Updated: 14/Dec/21  Resolved: 14/Dec/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Oleg Drokin Assignee: WC Triage
Resolution: Cannot Reproduce Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

I hit a stack overflow today on current master with migrate testing enabled:

<3>[79285.911482] BUG: sleeping function called from invalid context at block/cfq-iosched.c:3712
<3>[79285.911985] in_atomic(): 1, irqs_disabled(): 0, pid: 25130, name: flush-lustre-2
<4>[79285.912452] Pid: 25130, comm: flush-lustre-2 Tainted: P           ---------------    2.6.32-rhe6.6-debug #1
<1>[79285.913739] BUG: unable to handle kernel paging request at fffffffbdfb6c580
<1>[79285.913865] IP: [<ffffffff8105d07c>] update_curr+0x14c/0x200
<4>[79285.913865] PGD 1a27067 PUD 0 
<4>[79285.913865] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
<4>[79285.913865] last sysfs file: /sys/devices/system/cpu/possible
<4>[79285.913865] CPU 5 
<4>[79285.913865] Modules linked in: lustre ofd osp lod ost mdt mdd mgs osd_ldiskfs ldiskfs lquota lfsck obdecho mgc lov osc mdc lmv fid fld ptlrpc obdclass ksocklnd lnet libcfs zfs(P) zcommon(P) znvpair(P) zavl(P) zunicode(P) spl zlib_deflate exportfs jbd sha512_generic sha256_generic ext4 jbd2 mbcache virtio_console virtio_balloon i2c_piix4 i2c_core virtio_blk virtio_net virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod nfs lockd fscache auth_rpcgss nfs_acl sunrpc be2iscsi bnx2i cnic uio ipv6 cxgb3i libcxgbi cxgb3 mdio libiscsi_tcp qla4xxx iscsi_boot_sysfs libiscsi scsi_transport_iscsi [last unloaded: libcfs]
<4>[79285.913865] 
<4>[79285.913865] Pid: 25130, comm: flush-lustre-2 Tainted: P           ---------------    2.6.32-rhe6.6-debug #1 Red Hat KVM
<4>[79285.913865] RIP: 0010:[<ffffffff8105d07c>]  [<ffffffff8105d07c>] update_curr+0x14c/0x200
<4>[79285.913865] RSP: 0018:ffff880006343c38  EFLAGS: 00010086
<4>[79285.913865] RAX: ffff88003a2be100 RBX: ffffffff8bbf9060 RCX: ffff8800bbbf2f30
<4>[79285.913865] RDX: 0000000000018710 RSI: 0000000000000000 RDI: ffff88003a2be138
<4>[79285.913865] RBP: ffff880006343c68 R08: 0000000000000000 R09: 0000000000000000
<4>[79285.913865] R10: 0000000000000000 R11: 0000000000000400 R12: 0000000000000001
<4>[79285.913865] R13: 000000000022b038 R14: 0000000000000000 R15: ffff880006355b00
<4>[79285.913865] FS:  0000000000000000(0000) GS:ffff880006340000(0000) knlGS:0000000000000000
<4>[79285.913865] CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
<4>[79285.913865] CR2: fffffffbdfb6c580 CR3: 00000000b741d000 CR4: 00000000000006e0
<4>[79285.913865] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4>[79285.913865] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
<4>[79285.913865] Process flush-lustre-2 (pid: 25130, threadinfo ffff88008ef8e000, task ffff88003a2be100)
<4>[79285.913865] Stack:
<4>[79285.913865]  ffff880006343c48 ffffffff810149a9 ffff8800b18f40b8 ffff880006355b78
<4>[79285.913865] <d> 0000000000000003 0000000000000005 ffff880006343cb8 ffffffff8105e0db
<4>[79285.913865] <d> ffff880006343c98 ffffffff810a4608 ffff880006355b00 ffff8800b18f4080
<4>[79285.913865] Call Trace:
<4>[79285.913865]  <IRQ> 
<4>[79285.913865]  [<ffffffff810149a9>] ? sched_clock+0x9/0x10
<4>[79285.913865]  [<ffffffff8105e0db>] enqueue_task_fair+0x5b/0x510
<4>[79285.913865]  [<ffffffff810a4608>] ? sched_clock_cpu+0xb8/0x110
<4>[79285.913865]  [<ffffffff8105c8f6>] enqueue_task+0x66/0x80
<4>[79285.913865]  [<ffffffff8105c933>] activate_task+0x23/0x30
<4>[79285.913865]  [<ffffffff81061440>] try_to_wake_up+0x1f0/0x3e0
<4>[79285.913865]  [<ffffffff81061642>] default_wake_function+0x12/0x20
<4>[79285.913865]  [<ffffffff8109d2e6>] autoremove_wake_function+0x16/0x40
<4>[79285.913865]  [<ffffffff81057049>] __wake_up_common+0x59/0x90
<4>[79285.913865]  [<ffffffff8105b128>] __wake_up+0x48/0x70
<4>[79285.913865]  [<ffffffff810af6d0>] ? tick_sched_timer+0x0/0xc0
<4>[79285.913865]  [<ffffffff81073ad7>] printk_tick+0x47/0x50
<4>[79285.913865]  [<ffffffff81085c3d>] update_process_times+0x4d/0x90
<4>[79285.913865]  [<ffffffff810af736>] tick_sched_timer+0x66/0xc0
<4>[79285.913865]  [<ffffffff810a19cd>] __run_hrtimer+0x8d/0x1a0
<4>[79285.913865]  [<ffffffff810a971f>] ? ktime_get_update_offsets+0x4f/0xd0
<4>[79285.913865]  [<ffffffff810a1d36>] hrtimer_interrupt+0xe6/0x260
<4>[79285.913865]  [<ffffffff81132cb2>] ? drain_pages+0x42/0xa0
<4>[79285.913865]  [<ffffffff810339ed>] local_apic_timer_interrupt+0x3d/0x70
<4>[79285.913865]  [<ffffffff81529225>] smp_apic_timer_interrupt+0x45/0x60
<4>[79285.913865]  [<ffffffff8100bbd3>] apic_timer_interrupt+0x13/0x20
<4>[79285.913865]  <EOI> 
<4>[79285.913865]  [<ffffffff81074106>] ? vprintk+0x336/0x5a0
<4>[79285.913865]  [<ffffffff8151dde2>] printk+0x41/0x47
<4>[79285.913865]  [<ffffffff8151dc1e>] dump_stack+0x62/0x76
<4>[79285.913865]  [<ffffffff8105e91a>] __might_sleep+0xda/0x100
<4>[79285.913865]  [<ffffffff81286733>] cfq_set_request+0x433/0x580
<4>[79285.913865]  [<ffffffff81125c55>] ? mempool_alloc_slab+0x15/0x20
<4>[79285.913865]  [<ffffffff81041df8>] ? pvclock_clocksource_read+0x58/0xd0
<4>[79285.913865]  [<ffffffff81267eab>] elv_set_request+0x1b/0x30
<4>[79285.913865]  [<ffffffff8126fca2>] get_request+0x2f2/0x3b0
<4>[79285.913865]  [<ffffffff8126fd8a>] get_request_wait+0x2a/0x1d0
<4>[79285.913865]  [<ffffffff812697ae>] ? elv_merge+0x18e/0x1d0
<4>[79285.913865]  [<ffffffff8126ffc9>] blk_queue_bio+0x99/0x610
<4>[79285.913865]  [<ffffffff8126f085>] generic_make_request+0x2f5/0x640
<4>[79285.913865]  [<ffffffff81125df3>] ? mempool_alloc+0x63/0x160
<4>[79285.913865]  [<ffffffff8109d2d0>] ? autoremove_wake_function+0x0/0x40
<4>[79285.913865]  [<ffffffff8126f440>] submit_bio+0x70/0x120
<4>[79285.913865]  [<ffffffff8115f374>] swap_writepage+0x94/0xe0
<4>[79285.913865]  [<ffffffff8113cb66>] pageout.clone.2+0x136/0x320
<4>[79285.913865]  [<ffffffff8113d1cf>] shrink_page_list.clone.3+0x47f/0x6b0
<4>[79285.913865]  [<ffffffff8117e3db>] ? mem_cgroup_lru_del_list+0x2b/0xb0
<4>[79285.913865]  [<ffffffff8113d687>] ? isolate_lru_pages.clone.0+0xd7/0x170
<4>[79285.913865]  [<ffffffff8113ddcb>] shrink_inactive_list+0x33b/0x810
<4>[79285.913865]  [<ffffffff8152235e>] ? _write_unlock+0xe/0x10
<4>[79285.913865]  [<ffffffffa0d4f4f4>] ? cfs_hash_dual_bd_unlock+0x34/0x60 [libcfs]
<4>[79285.913865]  [<ffffffff8113e616>] shrink_mem_cgroup_zone+0x376/0x520
<4>[79285.913865]  [<ffffffff811800ed>] ? mem_cgroup_iter+0xfd/0x280
<4>[79285.913865]  [<ffffffff8113e83a>] shrink_zone+0x7a/0x180
<4>[79285.913865]  [<ffffffff8113ea55>] do_try_to_free_pages+0x115/0x610
<4>[79285.913865]  [<ffffffff81130e9c>] ? get_page_from_freelist+0x15c/0x880
<4>[79285.913865]  [<ffffffff8113f122>] try_to_free_pages+0x92/0x120
<4>[79285.913865]  [<ffffffff81133469>] __alloc_pages_nodemask+0x509/0x970
<4>[79285.913865]  [<ffffffff81173072>] kmem_getpages+0x62/0x170
<4>[79285.913865]  [<ffffffff81175aca>] fallback_alloc+0x1ba/0x270
<4>[79285.913865]  [<ffffffff81175377>] ? cache_grow+0x4d7/0x520
<4>[79285.913865]  [<ffffffff811757b8>] ____cache_alloc_node+0xa8/0x200
<4>[79285.913865]  [<ffffffff811760c3>] kmem_cache_alloc_trace+0x1c3/0x250
<4>[79285.913865]  [<ffffffffa0db6f46>] ? lnet_parse+0x2d6/0xd20 [lnet]
<4>[79285.913865]  [<ffffffffa0db6f46>] lnet_parse+0x2d6/0xd20 [lnet]
<4>[79285.913865]  [<ffffffffa0db810b>] lolnd_send+0x2b/0xa0 [lnet]
<4>[79285.913865]  [<ffffffffa0db06ab>] lnet_ni_send+0x4b/0xf0 [lnet]
<4>[79285.913865]  [<ffffffffa0db4d23>] lnet_send+0x883/0xba0 [lnet]
<4>[79285.913865]  [<ffffffffa0db5b0c>] LNetPut+0x2fc/0x810 [lnet]
<4>[79285.913865]  [<ffffffffa14ec0f0>] ptl_send_buf+0x1e0/0x540 [ptlrpc]
<4>[79285.913865]  [<ffffffff81040e8c>] ? kvm_clock_read+0x1c/0x20
<4>[79285.913865]  [<ffffffffa14ef7dd>] ptl_send_rpc+0x64d/0xde0 [ptlrpc]
<4>[79285.913865]  [<ffffffffa14e557b>] ptlrpc_send_new_req+0x4db/0x850 [ptlrpc]
<4>[79285.913865]  [<ffffffffa14e92e6>] ptlrpc_set_wait+0x676/0x9d0 [ptlrpc]
<4>[79285.913865]  [<ffffffffa0e2076c>] ? lustre_get_jobid+0xcc/0x380 [obdclass]
<4>[79285.913865]  [<ffffffffa14f4fd5>] ? lustre_msg_set_jobid+0xf5/0x130 [ptlrpc]
<4>[79285.913865]  [<ffffffffa14e96c1>] ptlrpc_queue_wait+0x81/0x220 [ptlrpc]
<4>[79285.913865]  [<ffffffffa14c1f7e>] ldlm_cli_enqueue+0x37e/0x870 [ptlrpc]
<4>[79285.913865]  [<ffffffffa14c70d0>] ? ldlm_completion_ast+0x0/0x9b0 [ptlrpc]
<4>[79285.913865]  [<ffffffffa0b5add0>] ? ll_md_blocking_ast+0x0/0x7f0 [lustre]
<4>[79285.913865]  [<ffffffffa0449d7a>] mdc_enqueue+0x29a/0x18d0 [mdc]
<4>[79285.913865]  [<ffffffffa040cf1b>] lmv_enqueue+0x2bb/0x610 [lmv]
<4>[79285.913865]  [<ffffffffa0d4ac31>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
<4>[79285.913865]  [<ffffffffa0b2e7bf>] ll_layout_refresh_locked+0x33f/0xe10 [lustre]
<4>[79285.913865]  [<ffffffffa0b5add0>] ? ll_md_blocking_ast+0x0/0x7f0 [lustre]
<4>[79285.913865]  [<ffffffffa14c70d0>] ? ldlm_completion_ast+0x0/0x9b0 [ptlrpc]
<4>[79285.913865]  [<ffffffffa0b2f439>] ll_layout_refresh+0x1a9/0x310 [lustre]
<4>[79285.913865]  [<ffffffffa0b794af>] vvp_io_init+0x39f/0x480 [lustre]
<4>[79285.913865]  [<ffffffffa0e673f8>] cl_io_init0+0x88/0x150 [obdclass]
<4>[79285.913865]  [<ffffffffa0e6a9a4>] cl_io_init+0x64/0xe0 [obdclass]
<4>[79285.913865]  [<ffffffffa0b28c82>] cl_sync_file_range+0x112/0x2f0 [lustre]
<4>[79285.913865]  [<ffffffffa0b4fe02>] ll_writepages+0xa2/0x240 [lustre]
<4>[79285.913865]  [<ffffffff81138a44>] do_writepages+0x24/0x40
<4>[79285.913865]  [<ffffffff811bc5ec>] writeback_single_inode+0xdc/0x2a0
<4>[79285.913865]  [<ffffffff811bca32>] writeback_sb_inodes+0xc2/0x180
<4>[79285.913865]  [<ffffffff811bcb6b>] writeback_inodes_wb+0x7b/0x1a0
<4>[79285.913865]  [<ffffffff811bcf7b>] wb_writeback+0x2eb/0x410
<4>[79285.913865]  [<ffffffff81522574>] ? _spin_lock_irqsave+0x24/0x30
<4>[79285.913865]  [<ffffffff810866a2>] ? del_timer_sync+0x22/0x30
<4>[79285.913865]  [<ffffffff811bd139>] wb_do_writeback+0x99/0x250
<4>[79285.913865]  [<ffffffff811bd353>] bdi_writeback_task+0x63/0x1b0
<4>[79285.913865]  [<ffffffff8109d157>] ? bit_waitqueue+0x17/0xd0
<4>[79285.913865]  [<ffffffff811479a0>] ? bdi_start_fn+0x0/0x100
<4>[79285.913865]  [<ffffffff81147a26>] bdi_start_fn+0x86/0x100
<4>[79285.913865]  [<ffffffff811479a0>] ? bdi_start_fn+0x0/0x100
<4>[79285.913865]  [<ffffffff8109ce4e>] kthread+0x9e/0xc0
<4>[79285.913865]  [<ffffffff8100c24a>] child_rip+0xa/0x20
<4>[79285.913865]  [<ffffffff8109cdb0>] ? kthread+0x0/0xc0
<4>[79285.913865]  [<ffffffff8100c240>] ? child_rip+0x0/0x20
<4>[79285.913865] Code: 9e 00 45 85 e4 74 3b 48 8b 50 08 8b 5a 18 48 8b 90 60 09 00 00 48 8b 4a 50 48 85 c9 74 24 48 63 db 66 0f 1f 44 00 00 48 8b 51 20 <48> 03 14 dd 80 42 ba 81 4c 01 2a 48 8b 89 98 00 00 00 48 85 c9 
<1>[79285.913865] RIP  [<ffffffff8105d07c>] update_curr+0x14c/0x200
<4>[79285.913865]  RSP <ffff880006343c38>
<4>[79285.913865] CR2: fffffffbdfb6c580

Generated at Sat Feb 10 02:05:17 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.