[LU-2859] Stack overflow issues on x86_64 clients in memory reclaim during writes Created: 25/Feb/13  Updated: 30/May/13  Resolved: 12/Mar/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: Lustre 2.4.0

Type: Bug Priority: Blocker
Reporter: Prakash Surya (Inactive) Assignee: Jinshan Xiong (Inactive)
Resolution: Fixed Votes: 0
Labels: MB, sequoia, topsequoia

Issue Links:
Related
is related to LU-3188 IOR fails due to client stack overrun Resolved
is related to LU-3416 Hangs on write in osc_enter_cache() Resolved
Severity: 3
Rank (Obsolete): 6925

 Description   

I was trying to run some IOR tests using a 2.3.58 based client and am running into what appear to be stack overflow issues. The test is an IOR write using 256 clients, with 16 tasks per client:

$ salloc -N256 -n$((256*16)) -pltest
$ srun -- ior -a POSIX -Cegw -F -o /p/lstest/surya1/lu2139-ior/%d -t 1m -b 8G -i 5

The clients are all x86_64, with the default stack size (8k), which is important to note. Some info on an example crash that I've seen is below.

The console shows this:

LustreError: 7408:0:(osc_dev.c:129:osc_key_init()) ASSERTION( (!(CFS_ALLOC_IO != CFS_ALLOC_ATOMIC) || (!(((current_thread_info()->preempt_count) & ((((1UL << (10))-1) << ((0 + 8) + 8)) | (((1UL << (8))-1) << (0 + 8)) | (((1UL << (1))-1) << (((0 + 8) + 8) + 10))))))) ) failed: 
BUG: unable to handle kernel paging request at 00000001811be3e0
IP: [<ffffffff81052994>] update_curr+0x144/0x1f0
PGD 1024273067 PUD 0 
Thread overran stack, or stack corrupted
Oops: 0000 [#1] SMP 
last sysfs file: /sys/devices/system/cpu/cpu15/cache/index2/shared_cpu_map
CPU 8 
Modules linked in: lmv(U) mgc(U) lustre(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ptlrpc(U) obdclass(U) lvfs(U) dm_round_robin scsi_dh_rdac sg sd_mod crc_t10dif zfs(P)(U) zcommon(P)(U) znvpair(P)(U) zavl(P)(U) zunicode(P)(U) spl(U) zlib_deflate acpi_cpufreq freq_table mperf ib_srp scsi_transport_srp scsi_tgt ko2iblnd(U) lnet(U) sha512_generic sha256_generic libcfs(U) ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ib_sa mlx4_ib ib_mad ib_core dm_mirror dm_region_hash dm_log dm_multipath dm_mod vhost_net macvtap macvlan tun kvm wmi sb_edac edac_core i2c_i801 i2c_core ahci iTCO_wdt iTCO_vendor_support ioatdma ipv6 nfs lockd fscache nfs_acl auth_rpcgss sunrpc igb dca mlx4_en mlx4_core [last unloaded: cpufreq_ondemand]

Pid: 7408, comm: ior Tainted: P           ----------------   2.6.32-220.23.1.2chaos.ch5.x86_64 #1 appro gb812x-cn/S2600JF
...

The stack reported by crash:

crash> bt
PID: 7408   TASK: ffff881028578aa0  CPU: 8   COMMAND: "ior"
 #0 [ffff88085c4039a0] machine_kexec at ffffffff8103216b
 #1 [ffff88085c403a00] crash_kexec at ffffffff810b8d12
 #2 [ffff88085c403ad0] oops_end at ffffffff814f2c00
 #3 [ffff88085c403b00] no_context at ffffffff810423fb
 #4 [ffff88085c403b50] __bad_area_nosemaphore at ffffffff81042685
 #5 [ffff88085c403ba0] bad_area_nosemaphore at ffffffff81042753
 #6 [ffff88085c403bb0] __do_page_fault at ffffffff81042e0d
 #7 [ffff88085c403cd0] do_page_fault at ffffffff814f4bde
 #8 [ffff88085c403d00] page_fault at ffffffff814f1f95
    [exception RIP: update_curr+324]
    RIP: ffffffff81052994  RSP: ffff88085c403db8  RFLAGS: 00010086
    RAX: ffff881028578aa0  RBX: 000000003feb8e00  RCX: ffff88082ff111c0
    RDX: 0000000000018b88  RSI: 0000000000000000  RDI: ffff881028578ad8
    RBP: ffff88085c403de8   R8: ffffffff8160b665   R9: 000000000000001a
    R10: 0000000000000010  R11: 000000000000001a  R12: ffff88085c416028
    R13: 0000000001a5b3e7  R14: 0000004135410580  R15: ffff881028578aa0
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #9 [ffff88085c403df0] task_tick_fair at ffffffff81052f4b
#10 [ffff88085c403e20] scheduler_tick at ffffffff81056791
#11 [ffff88085c403e60] update_process_times at ffffffff8107c7e2
#12 [ffff88085c403e90] tick_sched_timer at ffffffff810a0976
#13 [ffff88085c403ec0] __run_hrtimer at ffffffff8109580e
#14 [ffff88085c403f10] hrtimer_interrupt at ffffffff81095bb6
#15 [ffff88085c403f90] smp_apic_timer_interrupt at ffffffff814f754b
#16 [ffff88085c403fb0] apic_timer_interrupt at ffffffff8100bc13
--- <IRQ stack> ---
#17 [ffff881022f56178] apic_timer_interrupt at ffffffff8100bc13
    [exception RIP: _spin_unlock_irqrestore+23]
    RIP: ffffffff814f1ad7  RSP: ffff881022f56228  RFLAGS: 00000246
    RAX: 0000000000000002  RBX: ffff881022f56228  RCX: 0000000000001b78
    RDX: ffff880809274400  RSI: 0000000000000246  RDI: 0000000000000246
    RBP: ffffffff8100bc0e   R8: 0000000000000000   R9: ffffffff8163a940
    R10: 0000000000000001  R11: 0000000000000000  R12: ffff880809274400
    R13: ffffffffa0ab8878  R14: 0000000014208bc1  R15: ffff880840021b48
    ORIG_RAX: ffffffffffffff10  CS: 0010  SS: 0018
#18 [ffff881022f56230] cfs_trace_unlock_tcd at ffffffffa034336c [libcfs]
#19 [ffff881022f56250] libcfs_debug_vmsg2 at ffffffffa0354038 [libcfs]
#20 [ffff881022f563b0] libcfs_debug_msg at ffffffffa03545f1 [libcfs]
#21 [ffff881022f56410] osc_key_init at ffffffffa0a90e17 [osc]
#22 [ffff881022f56430] keys_fill at ffffffffa06d8c3f [obdclass]
#23 [ffff881022f56480] lu_context_init at ffffffffa06dcc6b [obdclass]
#24 [ffff881022f564b0] lu_env_init at ffffffffa06dce3e [obdclass]
#25 [ffff881022f564d0] cl_env_new at ffffffffa06e40bd [obdclass]
#26 [ffff881022f56520] cl_env_get at ffffffffa06e4acb [obdclass]
#27 [ffff881022f56560] lov_sub_get at ffffffffa0b3777d [lov]
#28 [ffff881022f565c0] lov_page_subio at ffffffffa0b37c5d [lov]
#29 [ffff881022f565f0] lov_page_own at ffffffffa0b311af [lov]
#30 [ffff881022f56620] cl_page_own0 at ffffffffa06e8b5b [obdclass]
#31 [ffff881022f56670] cl_page_own_try at ffffffffa06e8db3 [obdclass]
#32 [ffff881022f56680] discard_pagevec at ffffffffa0a926a9 [osc]
#33 [ffff881022f566d0] osc_lru_shrink at ffffffffa0a935d9 [osc]
#34 [ffff881022f567c0] osc_lru_del at ffffffffa0a94aa6 [osc]
#35 [ffff881022f56840] osc_page_delete at ffffffffa0a951b4 [osc]
#36 [ffff881022f56870] cl_page_delete0 at ffffffffa06e99e5 [obdclass]
#37 [ffff881022f568b0] cl_page_delete at ffffffffa06e9e62 [obdclass]
#38 [ffff881022f568d0] ll_releasepage at ffffffffa0bfe41b [lustre]
#39 [ffff881022f56920] try_to_release_page at ffffffff81110070
#40 [ffff881022f56930] shrink_page_list.clone.0 at ffffffff8112a501
#41 [ffff881022f56a60] shrink_inactive_list at ffffffff8112a8cb
#42 [ffff881022f56c10] shrink_zone at ffffffff8112b5df
#43 [ffff881022f56cc0] zone_reclaim at ffffffff8112c384
#44 [ffff881022f56da0] get_page_from_freelist at ffffffff81122834
#45 [ffff881022f56ec0] __alloc_pages_nodemask at ffffffff81123ab1
#46 [ffff881022f56fe0] kmem_getpages at ffffffff8115e2f2
#47 [ffff881022f57010] cache_grow at ffffffff8115e95f
#48 [ffff881022f57080] cache_alloc_refill at ffffffff8115ebb2
#49 [ffff881022f570f0] __kmalloc at ffffffff8115f8d9
#50 [ffff881022f57140] cfs_alloc at ffffffffa0344c40 [libcfs]
#51 [ffff881022f57170] ptlrpc_request_alloc_internal at ffffffffa085d407 [ptlrpc]
#52 [ffff881022f571a0] ptlrpc_request_alloc_pool at ffffffffa085d66e [ptlrpc]
#53 [ffff881022f571b0] osc_brw_prep_request at ffffffffa0a8451b [osc]
#54 [ffff881022f57290] osc_build_rpc at ffffffffa0a8a513 [osc]
#55 [ffff881022f573f0] osc_io_unplug0 at ffffffffa0aa642d [osc]
#56 [ffff881022f57500] osc_io_unplug at ffffffffa0aa7ce1 [osc]
#57 [ffff881022f57510] osc_enter_cache at ffffffffa0aa8473 [osc]
#58 [ffff881022f576a0] osc_queue_async_io at ffffffffa0aae916 [osc]
#59 [ffff881022f57830] osc_page_cache_add at ffffffffa0a94fc9 [osc]
#60 [ffff881022f57870] cl_page_cache_add at ffffffffa06e61d7 [obdclass]
#61 [ffff881022f57910] lov_page_cache_add at ffffffffa0b31325 [lov]
#62 [ffff881022f57940] cl_page_cache_add at ffffffffa06e61d7 [obdclass]
#63 [ffff881022f579e0] vvp_io_commit_write at ffffffffa0c1161d [lustre]
#64 [ffff881022f57a50] cl_io_commit_write at ffffffffa06f5b1d [obdclass]
#65 [ffff881022f57aa0] ll_commit_write at ffffffffa0be68be [lustre]
#66 [ffff881022f57b00] ll_write_end at ffffffffa0bfe4e0 [lustre]
#67 [ffff881022f57b30] generic_file_buffered_write at ffffffff81111684
#68 [ffff881022f57c00] __generic_file_aio_write at ffffffff81112f70
#69 [ffff881022f57cc0] generic_file_aio_write at ffffffff8111320f
#70 [ffff881022f57d10] vvp_io_write_start at ffffffffa0c11f3c [lustre]
#71 [ffff881022f57d50] cl_io_start at ffffffffa06f244a [obdclass]
#72 [ffff881022f57d80] cl_io_loop at ffffffffa06f6d54 [obdclass]
#73 [ffff881022f57db0] ll_file_io_generic at ffffffffa0bbda7b [lustre]
#74 [ffff881022f57e20] ll_file_aio_write at ffffffffa0bbdce2 [lustre]
#75 [ffff881022f57e80] ll_file_write at ffffffffa0bbeaac [lustre]
#76 [ffff881022f57ef0] vfs_write at ffffffff81177b98
#77 [ffff881022f57f30] sys_write at ffffffff811785a1
#78 [ffff881022f57f80] system_call_fastpath at ffffffff8100b0f2
    RIP: 00002aaaab24748d  RSP: 00007fffffffcb50  RFLAGS: 00010246
    RAX: 0000000000000001  RBX: ffffffff8100b0f2  RCX: 0000081f512ba39b
    RDX: 0000000000100000  RSI: 0000000000821000  RDI: 0000000000000009
    RBP: 0000000000100000   R8: 00000000007ff070   R9: 0000000080000000
    R10: 0000000000101008  R11: 0000000000000293  R12: 00000000007ff070
    R13: 0000000000821000  R14: 0000000004900000  R15: 0000000000000000
    ORIG_RAX: 0000000000000001  CS: 0033  SS: 002b

Doing the math, this stack shows about 7.5k of used space: 0xffff881022f57f80 - 0xffff881022f56230 = 7504, which is awfully close to 8k.

Now lets look at the thread_info structure for that thread:

crash> p *(struct task_struct *)0xffff881028578aa0
...
  stack = 0xffff881022f56000,
...
crash> p *(struct thread_info *)0xffff881022f56000 
$5 = {
  task = 0xffffffff814f4cfa, 
  exec_domain = 0xffff881022f56028, 
  flags = 2167540686, 
  status = 4294967295, 
  cpu = 1072401920, 
  preempt_count = 34824, 
  addr_limit = {
    seg = 0
  }, 
  restart_block = {
    fn = 0xffff881022f56088, 
    {
      {
        arg0 = 18446744071582132352, 
        arg1 = 18446612201620201608, 
        arg2 = 52, 
        arg3 = 21474836533
      }, 
      futex = {
        uaddr = 0xffffffff81322880, 
        val = 586506376, 
        flags = 4294936592, 
        bitset = 52, 
        time = 21474836533, 
        uaddr2 = 0x100000000000006
      }, 
      nanosleep = {
        index = -2127419264, 
        rmtp = 0xffff881022f56088, 
        compat_rmtp = 0x34, 
        expires = 21474836533
      }, 
      poll = {
        ufds = 0xffffffff81322880, 
        nfds = 586506376, 
        has_timeout = -30704, 
        tv_sec = 52, 
        tv_nsec = 21474836533
      }
    }
  }, 
  sysenter_return = 0xffff881022f560f8, 
  uaccess_err = -2119203296
}

This structure is definitely trashed and contains bogus information, probably a result of going over the default 8k stack size on x86_64.



 Comments   
Comment by Prakash Surya (Inactive) [ 25/Feb/13 ]

Also, I've installed a temporary kernel with 16K stacks on the client and have not been able reproduce the issue (same Lustre version 2.3.58-15chaos1surya1).

Comment by Jinshan Xiong (Inactive) [ 25/Feb/13 ]

Hi Prakash, can you please try this patch: http://review.whamcloud.com/5526?

Comment by Prakash Surya (Inactive) [ 25/Feb/13 ]

I pulled that in and it's passed my initial testing. The second osc_io_unplug call was introduced because of LU-2576 though, so I'm hesitant to pull it into an official release just yet. Is it safe to collapse the two osc_io_unplug calls into a single one as the patch does?

Comment by Jinshan Xiong (Inactive) [ 25/Feb/13 ]

It should be pretty safe.

Comment by Peter Jones [ 12/Mar/13 ]

Landed for 2.4

Comment by Niu Yawei (Inactive) [ 14/Mar/13 ]

It's not safe to replace the two osc_io_unplug() into one, as the comment described: the first call is to make sure the current object is on ready list, the second call is to make sure flush is alwasy be triggered even if the current object has no dirty. And the async unplug can't make sure the dirty is flushed when the call finished. What about change it as:

while (...) {
osc_io_unplug_async(osc);
osc_io_unplug_async(NULL);
}

Comment by Prakash Surya (Inactive) [ 15/Mar/13 ]

I was hesitant to pull the two calls into a single one. Jinshan, can you comment on this?

Comment by Jinshan Xiong (Inactive) [ 15/Mar/13 ]

it's enough to call osc_io_unplug_async() once.

Generated at Sat Feb 10 01:28:49 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.