Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1243

Lustre 2.1.1 client stack overrun

    XMLWordPrintable

Details

    • 3
    • 9741

    Description

      During I/O testing we hit a client stack overrun on a Lustre 2.1.1 client. We have the LU-969 stack reduction patch in our local branch. The stacktrace looks very similar to the one reported for LU-969.

      2012-03-18 06:05:23 BUG: unable to handle kernel paging request at 000000032cbc7960
      2012-03-18 06:05:23 IP: [<ffffffff81052814>] update_curr+0x144/0x1f0
      2012-03-18 06:05:23 PGD 417be5067 PUD 0
      2012-03-18 06:05:23 Thread overran stack, or stack corrupted
      2012-03-18 06:05:23 Oops: 0000 [#1] SMP
      2012-03-18 06:05:23 last sysfs file: /sys/devices/system/cpu/cpu31/online
      2012-03-18 06:05:23 CPU 6
      2012-03-18 06:05:23 Modules linked in: xt_owner nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack lmv(U) mgc(U)
      lustre(U) lov(U) osc(U) lquota(U) mdc(U) fid(U) fld(U) ptlrpc(U) obdclass(U) lvfs(U) acpi_cpufreq freq_table mperf
      ko2iblnd(U) lnet(U) libcfs(U) ipt_LOG xt_multiport iptable_filter ip_tables ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad
      rdma_cm ib_cm iw_cm ib_addr ib_sa dm_mirror dm_region_hash dm_log dm_mod vhost_net macvtap macvlan tun kvm uinput sg
      sd_mod crc_t10dif isci libsas scsi_transport_sas ahci microcode sb_edac edac_core iTCO_wdt iTCO_vendor_support i2c_i801
      i2c_core ib_qib(U) ib_mad ib_core wmi ioatdma ipv6 nfs lockd fscache nfs_acl auth_rpcgss sunrpc igb dca [last unloaded:
      cpufreq_ondemand]
      2012-03-18 06:05:23
      2012-03-18 06:05:23 Pid: 68872, comm: lmp Not tainted 2.6.32-220.7.1.3chaos.ch5.x86_64 #1 appro appro-512x/S2600JF
      2012-03-18 06:05:23 RIP: 0010:[<ffffffff81052814>]  [<ffffffff81052814>] update_curr+0x144/0x1f0
      2012-03-18 06:05:23 RSP: 0018:ffff8800366c3db8  EFLAGS: 00010086
      2012-03-18 06:05:23 RAX: ffff8808334c9580 RBX: 00000000755fa0c8 RCX: ffff880437f111c0
      2012-03-18 06:05:23 RDX: 0000000000018b48 RSI: 0000000000000000 RDI: ffff8808334c95b8
      2012-03-18 06:05:23 RBP: ffff8800366c3de8 R08: ffffffff8160b665 R09: 0000000000000000
      2012-03-18 06:05:23 R10: 0000000000000010 R11: 0000000000000000 R12: ffff8800366d5fe8
      2012-03-18 06:05:23 R13: 00000000000f3c4d R14: 000081fe1182f6c5 R15: ffff8808334c9580
      2012-03-18 06:05:23 FS:  00002aaaacfd0800(0000) GS:ffff8800366c0000(0000) knlGS:0000000000000000
      2012-03-18 06:05:23 CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      2012-03-18 06:05:23 CR2: 000000032cbc7960 CR3: 0000000326499000 CR4: 00000000000406e0
      2012-03-18 06:05:23 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      2012-03-18 06:05:23 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      2012-03-18 06:05:23 Process lmp (pid: 68872, threadinfo ffff8807755fa000, task ffff8808334c9580)
      2012-03-18 06:05:23 Stack:
      2012-03-18 06:05:23  ffff8800366c3dc8 ffffffff81013783 ffff8808334c95b8 ffff8800366d5fe8
      2012-03-18 06:05:23 <0> 0000000000000000 0000000000000000 ffff8800366c3e18 ffffffff81052e2b
      2012-03-18 06:05:23 <0> ffff8800366d5f80 0000000000000006 0000000000015f80 0000000000000006
      2012-03-18 06:05:23 Call Trace:
      2012-03-18 06:05:23  <IRQ>
      2012-03-18 06:05:23  [<ffffffff81013783>] ? native_sched_clock+0x13/0x80
      2012-03-18 06:05:23  [<ffffffff81052e2b>] task_tick_fair+0xdb/0x160
      2012-03-18 06:05:23  [<ffffffff81056891>] scheduler_tick+0xc1/0x260
      2012-03-18 06:05:23  [<ffffffff810a0e70>] ? tick_sched_timer+0x0/0xc0
      2012-03-18 06:05:23  [<ffffffff8107c512>] update_process_times+0x52/0x70
      2012-03-18 06:05:23  [<ffffffff810a0ed6>] tick_sched_timer+0x66/0xc0
      2012-03-18 06:05:23  [<ffffffff8109555e>] __run_hrtimer+0x8e/0x1a0
      2012-03-18 06:05:23  [<ffffffff81012b59>] ? read_tsc+0x9/0x20
      2012-03-18 06:05:23  [<ffffffff81095906>] hrtimer_interrupt+0xe6/0x250
      2012-03-18 06:05:23  [<ffffffff814f6f0b>] smp_apic_timer_interrupt+0x6b/0x9b
      2012-03-18 06:05:23  [<ffffffff8100bc13>] apic_timer_interrupt+0x13/0x20
      2012-03-18 06:05:23  <EOI>
      2012-03-18 06:05:23  [<ffffffff8112adbe>] ? shrink_inactive_list+0x2de/0x740
      2012-03-18 06:05:23  [<ffffffff8112ad72>] ? shrink_inactive_list+0x292/0x740
      2012-03-18 06:05:23  [<ffffffff8112baef>] shrink_zone+0x38f/0x520
      2012-03-18 06:05:23  [<ffffffff8112c894>] zone_reclaim+0x354/0x410
      2012-03-18 06:05:23  [<ffffffff8112d4e0>] ? isolate_pages_global+0x0/0x350
      2012-03-18 06:05:23  [<ffffffff81122d94>] get_page_from_freelist+0x694/0x820
      2012-03-18 06:05:23  [<ffffffffa0475b6f>] ? cfs_hash_bd_from_key+0x3f/0xc0 [libcfs]
      2012-03-18 06:05:23  [<ffffffffa05d4232>] ? lu_object_put+0x92/0x200 [obdclass]
      2012-03-18 06:05:23  [<ffffffff811337c9>] ? zone_statistics+0x99/0xc0
      2012-03-18 06:05:23  [<ffffffff81124011>] __alloc_pages_nodemask+0x111/0x940
      2012-03-18 06:05:23  [<ffffffff8115e592>] kmem_getpages+0x62/0x170
      2012-03-18 06:05:23  [<ffffffff8115ebff>] cache_grow+0x2cf/0x320
      2012-03-18 06:05:23  [<ffffffff8115ee52>] cache_alloc_refill+0x202/0x240
      2012-03-18 06:05:23  [<ffffffffa0469863>] ? cfs_alloc+0x63/0x90 [libcfs]
      2012-03-18 06:05:23  [<ffffffff8115fb79>] __kmalloc+0x1a9/0x220
      2012-03-18 06:05:23  [<ffffffffa0469863>] cfs_alloc+0x63/0x90 [libcfs]
      2012-03-18 06:05:23  [<ffffffffa0702237>] ptlrpc_request_alloc_internal+0x167/0x360 [ptlrpc]
      2012-03-18 06:05:23  [<ffffffffa070243e>] ptlrpc_request_alloc_pool+0xe/0x10 [ptlrpc]
      2012-03-18 06:05:23  [<ffffffffa08d684b>] osc_brw_prep_request+0x1ab/0xcc0 [osc]
      2012-03-18 06:05:23  [<ffffffffa08e7fbb>] ? osc_req_attr_set+0xfb/0x250 [osc]
      2012-03-18 06:05:23  [<ffffffffa09f7638>] ? ccc_req_attr_set+0x78/0x150 [lustre]
      2012-03-18 06:05:23  [<ffffffffa05e4e81>] ? cl_req_attr_set+0xd1/0x1a0 [obdclass]
      2012-03-18 06:05:23  [<ffffffffa05e476c>] ? cl_req_prep+0x8c/0x130 [obdclass]
      2012-03-18 06:05:23  [<ffffffffa08d821d>] osc_send_oap_rpc+0xebd/0x1690 [osc]
      2012-03-18 06:05:23  [<ffffffffa08d17a6>] ? loi_list_maint+0xa6/0x130 [osc]
      2012-03-18 06:05:23  [<ffffffffa08d8c9e>] osc_check_rpcs+0x2ae/0x3b0 [osc]
      2012-03-18 06:05:23  [<ffffffffa08e86df>] osc_io_submit+0x1df/0x4a0 [osc]
      2012-03-18 06:05:23  [<ffffffffa05e4938>] cl_io_submit_rw+0x78/0x130 [obdclass]
      2012-03-18 06:05:23  [<ffffffffa094160f>] lov_io_submit+0x2bf/0x9d0 [lov]
      2012-03-18 06:05:23  [<ffffffffa05e4938>] cl_io_submit_rw+0x78/0x130 [obdclass]
      2012-03-18 06:05:23  [<ffffffffa09ddf15>] ll_writepage+0x205/0x3b0 [lustre]
      2012-03-18 06:05:23  [<ffffffff8112a34b>] pageout.clone.1+0x12b/0x300
      2012-03-18 06:05:23  [<ffffffff81129a01>] ? __remove_mapping+0x61/0x160
      2012-03-18 06:05:23  [<ffffffff8112a8d2>] shrink_page_list.clone.0+0x3b2/0x5c0
      2012-03-18 06:05:23  [<ffffffff811280bc>] ? release_pages+0x6c/0x250
      2012-03-18 06:05:23  [<ffffffff8116804d>] ? mem_cgroup_del_lru_list+0x2d/0xc0
      2012-03-18 06:05:23  [<ffffffff81168169>] ? mem_cgroup_del_lru+0x39/0x40
      2012-03-18 06:05:23  [<ffffffff811689ae>] ? mem_cgroup_isolate_pages+0xee/0x1d0
      2012-03-18 06:05:23  [<ffffffff8112addb>] shrink_inactive_list+0x2fb/0x740
      2012-03-18 06:05:23  [<ffffffff8112baef>] shrink_zone+0x38f/0x520
      2012-03-18 06:05:23  [<ffffffff811a0e4d>] ? bdi_queue_work+0x7d/0x110
      2012-03-18 06:05:23  [<ffffffff8112bd7e>] do_try_to_free_pages+0xfe/0x520
      2012-03-18 06:05:23  [<ffffffff8112c2ec>] try_to_free_mem_cgroup_pages+0x8c/0x90
      2012-03-18 06:05:23  [<ffffffff811688c0>] ? mem_cgroup_isolate_pages+0x0/0x1d0
      2012-03-18 06:05:23  [<ffffffff8116a450>] mem_cgroup_hierarchical_reclaim+0x2f0/0x460
      2012-03-18 06:05:23  [<ffffffff8116b992>] __mem_cgroup_try_charge+0x202/0x460
      2012-03-18 06:05:23  [<ffffffff811241e7>] ? __alloc_pages_nodemask+0x2e7/0x940
      2012-03-18 06:05:23  [<ffffffff8116c4e7>] mem_cgroup_charge_common+0x87/0xd0
      2012-03-18 06:05:23  [<ffffffff8116c658>] mem_cgroup_cache_charge+0x128/0x140
      2012-03-18 06:05:23  [<ffffffff8111161a>] add_to_page_cache_locked+0x4a/0x140
      2012-03-18 06:05:23  [<ffffffff8111173c>] add_to_page_cache_lru+0x2c/0x80
      2012-03-18 06:05:23  [<ffffffff811122f9>] grab_cache_page_write_begin+0x99/0xc0
      2012-03-18 06:05:23  [<ffffffffa09efa98>] ll_write_begin+0x58/0x170 [lustre]
      2012-03-18 06:05:23  [<ffffffff81111b7e>] generic_file_buffered_write+0x10e/0x2a0
      2012-03-18 06:05:23  [<ffffffff81070637>] ? current_fs_time+0x27/0x30
      2012-03-18 06:05:23  [<ffffffff811134d0>] __generic_file_aio_write+0x250/0x480
      2012-03-18 06:05:23  [<ffffffffa05d8d95>] ? cl_env_info+0x15/0x20 [obdclass]
      2012-03-18 06:05:23  [<ffffffff8111376f>] generic_file_aio_write+0x6f/0xe0
      2012-03-18 06:05:23  [<ffffffffa09fdf9e>] vvp_io_write_start+0x9e/0x1e0 [lustre]
      2012-03-18 06:05:23  [<ffffffffa05e4a5a>] cl_io_start+0x6a/0x100 [obdclass]
      2012-03-18 06:05:23  [<ffffffffa05e8094>] cl_io_loop+0xd4/0x160 [obdclass]
      2012-03-18 06:05:23  [<ffffffffa09ba43b>] ll_file_io_generic+0x3bb/0x4b0 [lustre]
      2012-03-18 06:05:23  [<ffffffffa09bf4ec>] ll_file_aio_write+0x13c/0x280 [lustre]
      2012-03-18 06:05:23  [<ffffffffa09bf79c>] ll_file_write+0x16c/0x290 [lustre]
      2012-03-18 06:05:23  [<ffffffff81177ae8>] vfs_write+0xb8/0x1a0
      2012-03-18 06:05:23  [<ffffffff811784f1>] sys_write+0x51/0x90
      2012-03-18 06:05:23  [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
      2012-03-18 06:05:23 Code: 00 8b 15 cc 36 a4 00 85 d2 74 34 48 8b 50 08 8b 5a 18 48 8b 90 10 09 00 00 48 8b 4a 50 48 85
      c9 74 1d 48 63 db 66 90 48 8b 51 20 <48> 03 14 dd 20 73 bf 81 4c 01 2a 48 8b 49 78 48 85 c9 75 e8 48
      2012-03-18 06:05:23 RIP  [<ffffffff81052814>] update_curr+0x144/0x1f0
      2012-03-18 06:05:23  RSP <ffff8800366c3db8>
      2012-03-18 06:05:23 CR2: 000000032cbc7960
      

      Attachments

        Activity

          People

            hongchao.zhang Hongchao Zhang
            nedbass Ned Bass (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: