[LU-1474] Test failure on test suite parallel-scale-nfsv3, subtest test_iorssf Created: 05/Jun/12  Updated: 13/Aug/12  Resolved: 13/Aug/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.2, Lustre 2.1.3
Fix Version/s: None

Type: Bug Priority: Blocker
Reporter: Maloo Assignee: WC Triage
Resolution: Duplicate Votes: 0
Labels: None
Environment:

Lustre Tag: v2_1_2_RC2
Lustre Build: http://build.whamcloud.com/job/lustre-b2_1/87/
Distro/Arch: RHEL6.2/x86_64 (kernel version: 2.6.32-220.17.1.el6)
Network: TCP (1GigE)
ENABLE_QUOTA=yes


Severity: 3
Rank (Obsolete): 4172

 Description   

This issue was created by maloo for yujian <yujian@whamcloud.com>

This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/c496b4f4-aecd-11e1-b0f7-52540035b04c.

The sub-test test_iorssf failed with the following error:

test failed to respond and timed out

Info required for matching: parallel-scale-nfsv3 iorssf

Console log on the MDS/Lustre Client/NFS server showed that:

17:34:14:BUG: unable to handle kernel paging request at fffffffb8a512d10
17:34:14:IP: [<ffffffff81052994>] update_curr+0x144/0x1f0
17:34:14:PGD 1a87067 PUD 0 
17:34:14:Thread overran stack, or stack corrupted
17:34:14:Oops: 0000 [#1] SMP 
17:34:14:last sysfs file: /sys/module/nfsd/initstate
17:34:14:CPU 0 
17:34:14:Modules linked in: lustre(U) obdfilter(U) ost(U) osd_ldiskfs(U) cmm(U) fsfilt_ldiskfs(U) mdt(U) mdd(U) mds(U) mgs(U) ldiskfs(U) mgc(U) lov(U) osc(U) mdc(U) lmv(U) fid(U) fld(U) lquota(U) ptlrpc(U) obdclass(U) lvfs(U) ksocklnd(U) lnet(U) libcfs(U) nfs fscache jbd2 nfsd lockd nfs_acl auth_rpcgss exportfs autofs4 sunrpc ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa ib_mad ib_core microcode virtio_balloon 8139too 8139cp mii i2c_piix4 i2c_core ext3 jbd mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: lnet_selftest]
17:34:14:
17:34:14:Pid: 16667, comm: nfsd Not tainted 2.6.32-220.17.1.el6_lustre.g636ddbf.x86_64 #1 Red Hat KVM
17:34:15:RIP: 0010:[<ffffffff81052994>]  [<ffffffff81052994>] update_curr+0x144/0x1f0
17:34:15:RSP: 0018:ffff880002203db8  EFLAGS: 00010086
17:34:15:RAX: ffff880071a40b40 RBX: ffffffff811237e6 RCX: ffff88007faa60c0
17:34:15:RDX: 0000000000018b48 RSI: ffff880071a97538 RDI: ffff880071a40b78
17:34:15:RBP: ffff880002203de8 R08: ffffffff8160b6a5 R09: 0000000000000000
17:34:15:R10: 0000000000000010 R11: 0000000000000000 R12: ffff880002215fe8
17:34:15:R13: 00000000000f469a R14: 000054a5051b5d6b R15: ffff880071a40b40
17:34:15:FS:  0000000000000000(0000) GS:ffff880002200000(0000) knlGS:0000000000000000
17:34:15:CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
17:34:15:CR2: fffffffb8a512d10 CR3: 0000000037697000 CR4: 00000000000006f0
17:34:15:DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
17:34:15:DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
17:34:15:Process nfsd (pid: 16667, threadinfo ffff880068a62000, task ffff880071a40b40)
17:34:15:Stack:
17:34:15: ffff880002203dc8 0000000000000000 ffff880071a40b78 ffff880002215fe8
17:34:15:<0> 0000000000000000 0000000000000000 ffff880002203e18 ffffffff81052f4b
17:34:17:<0> ffff880002215f80 0000000000000000 0000000000015f80 0000000000000000
17:34:17:Call Trace:
17:34:17: <IRQ> 
17:34:17: [<ffffffff81052f4b>] task_tick_fair+0xdb/0x160
17:34:17:BUG: unable to handle kernel NULL pointer dereference at 00000000000009e9
17:34:17:IP: [<ffffffff8100f5ad>] print_context_stack+0xad/0x140
17:34:17:PGD 0 
17:34:17:Thread overran stack, or stack corrupted
17:34:17:Oops: 0000 [#2] SMP 
17:34:17:last sysfs file: /sys/module/nfsd/initstate
17:34:17:CPU 0 
17:34:17:Modules linked in: lustre(U) obdfilter(U) ost(U) osd_ldiskfs(U) cmm(U) fsfilt_ldiskfs(U) mdt(U) mdd(U) mds(U) mgs(U) ldiskfs(U) mgc(U) lov(U) osc(U) mdc(U) lmv(U) fid(U) fld(U) lquota(U) ptlrpc(U) obdclass(U) lvfs(U) ksocklnd(U) lnet(U) libcfs(U) nfs fscache jbd2 nfsd lockd nfs_acl auth_rpcgss exportfs autofs4 sunrpc ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa ib_mad ib_core microcode virtio_balloon 8139too 8139cp mii i2c_piix4 i2c_core ext3 jbd mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: lnet_selftest]
17:34:17:
17:34:18:Pid: 16667, comm: nfsd Not tainted 2.6.32-220.17.1.el6_lustre.g636ddbf.x86_64 #1 Red Hat KVM
17:34:18:RIP: 0010:[<ffffffff8100f5ad>]  [<ffffffff8100f5ad>] print_context_stack+0xad/0x140
17:34:18:RSP: 0018:ffff8800022038c8  EFLAGS: 00010002
17:34:18:RAX: 0000000000000001 RBX: ffff880002203df0 RCX: 0000000000000000
17:34:18:RDX: 0000000000000000 RSI: 0000000000000046 RDI: 0000000000000046
17:34:18:RBP: ffff880002203928 R08: ffffffff81c00740 R09: 0000000000000000
17:34:18:R10: 0000000000000002 R11: 0000000000000001 R12: ffff880002203e18
17:34:18:R13: ffff880068a62000 R14: ffffffff81600460 R15: ffff880002203fc0
17:34:18:FS:  0000000000000000(0000) GS:ffff880002200000(0000) knlGS:0000000000000000
17:34:18:CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
17:34:18:CR2: 00000000000009e9 CR3: 0000000037697000 CR4: 00000000000006f0
17:34:18:DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
17:34:18:DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
17:34:18:Process nfsd (pid: 16667, threadinfo ffff880068a62000, task ffff880071a40b40)
17:34:18:Stack:
17:34:18: ffff880000000018 ffff880068a63ff8 ffff880002203e18 ffff880002201fc0
17:34:19:<0> ffffffff81779900 ffffffff81052f4b ffffffff817c60ba ffff880002203db8
17:34:19:<0> 000000000000cbe0 ffffffff81600460 ffffffff81779900 ffff880002203fc0
17:34:19:Call Trace:
17:34:19: <IRQ> 
17:34:19: [<ffffffff81052f4b>] ? task_tick_fair+0xdb/0x160

Console log on NFS client showed that:

17:37:09:INFO: task IOR:7271 blocked for more than 120 seconds.
17:37:09:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
17:37:09:IOR           D 0000000000000000     0  7271   7268 0x00000080
17:37:09: ffff880079e21c88 0000000000000082 ffff880077682078 ffff880002215f80
17:37:09: ffff880079e21c48 ffffffff81053190 ffff880079e21c28 ffffffff8106196b
17:37:09: ffff8800750b3b38 ffff880079e21fd8 000000000000f4e8 ffff8800750b3b38
17:37:09:Call Trace:
17:37:09: [<ffffffff81053190>] ? check_preempt_wakeup+0x1c0/0x260
17:37:09: [<ffffffff8106196b>] ? enqueue_task_fair+0xfb/0x100
17:37:09: [<ffffffff8104da7c>] ? check_preempt_curr+0x7c/0x90
17:37:09: [<ffffffff814ee6fe>] __mutex_lock_slowpath+0x13e/0x180
17:37:09: [<ffffffff814ee59b>] mutex_lock+0x2b/0x50
17:37:09: [<ffffffff81113239>] generic_file_aio_write+0x59/0xe0
17:37:09: [<ffffffffa03b4ffe>] nfs_file_write+0xde/0x1f0 [nfs]
17:37:09: [<ffffffff8117661a>] do_sync_write+0xfa/0x140
17:37:09: [<ffffffff81090d30>] ? autoremove_wake_function+0x0/0x40
17:37:09: [<ffffffff8120c646>] ? security_file_permission+0x16/0x20
17:37:09: [<ffffffff81176918>] vfs_write+0xb8/0x1a0
17:37:11: [<ffffffff810d4932>] ? audit_syscall_entry+0x272/0x2a0
17:37:11: [<ffffffff81177321>] sys_write+0x51/0x90
17:37:11: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
17:38:47:nfs: server client-23vm3 not responding, still trying


 Comments   
Comment by Wojciech Turek (Inactive) [ 27/Jul/12 ]

I hit the same kernel bug few times on production compute nodes (700 node cluster), can not reproduce this though.

BUG: unable to handle kernel paging request at fffffffb81d4d0e0
IP: [<ffffffff81052994>] update_curr+0x144/0x1f0
PGD 1a87067 PUD 0
Oops: 0000 1 SMP
last sysfs file: /sys/devices/pci0000:00/0000:00:02.0/0000:03:00.0/infiniband/mlx4_0/ports/1/link_layer
CPU 14
Modules linked in: lmv(U) mgc(U) lustre(U) lov(U) osc(U) lquota(U) mdc(U) fid(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) nfsd exportfs acpi_cpufreq freq_table mperf rdma_ucm(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U) ib_cm(U) ib_sa(U) ib_uverbs(U) ib_umad(U) mlx4_ib(U) ib_mad(U) ib_core(U) mlx4_core(U) ext4 mbcache jbd2 dm_mirror dm_region_hash dm_log dm_mod vhost_net macvtap macvlan tun kvm wmi microcode dcdbas sg sb_edac edac_core i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support shpchp ioatdma ipv6 sd_mod crc_t10dif ahci igb dca nfs lockd fscache nfs_acl auth_rpcgss sunrpc [last unloaded: scsi_wait_scan]

Pid: 62164, comm: calculate_propa Tainted: G W ---------------- 2.6.32-220.23.1.el6.x86_64 #1 Dell Inc. PowerEdge C6220/0WTH3T
RIP: 0010:[<ffffffff81052994>] [<ffffffff81052994>] update_curr+0x144/0x1f0
RSP: 0018:ffff88089c4c3db8 EFLAGS: 00010086
RAX: ffff881027480040 RBX: ffffffff8002ac60 RCX: ffff88086ff111c0
RDX: 0000000000018b48 RSI: 0000000000000000 RDI: ffff881027480078
RBP: ffff88089c4c3de8 R08: ffffffff8160b685 R09: 0000000000000000
R10: 0000000000000010 R11: 0000000000000000 R12: ffff88089c4d5fe8
R13: 00000000000f39dd R14: 0001ef9177be813f R15: ffff881027480040
FS: 00007f0c4f098700(0000) GS:ffff88089c4c0000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: fffffffb81d4d0e0 CR3: 0000001027514000 CR4: 00000000000406e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process calculate_propa (pid: 62164, threadinfo ffff881027528000, task ffff881027480040)
Stack:
ffff88089c4c3dc8 ffffffff81013743 ffff881027480078 ffff88089c4d5fe8
<0> 0000000000000000 0000000000000000 ffff88089c4c3e18 ffffffff81052f4b
<0> ffff88089c4d5f80 000000000000000e 0000000000015f80 000000000000000e
Call Trace:
<IRQ>
[<ffffffff81013743>] ? native_sched_clock+0x13/0x80
[<ffffffff81052f4b>] task_tick_fair+0xdb/0x160
[<ffffffff81056791>] scheduler_tick+0xc1/0x260
[<ffffffff810a0dc0>] ? tick_sched_timer+0x0/0xc0
[<ffffffff8107c482>] update_process_times+0x52/0x70
[<ffffffff810a0e26>] tick_sched_timer+0x66/0xc0
[<ffffffff810954ae>] __run_hrtimer+0x8e/0x1a0
[<ffffffff81012b59>] ? read_tsc+0x9/0x20
[<ffffffff81095856>] hrtimer_interrupt+0xe6/0x250
[<ffffffff814f55fb>] smp_apic_timer_interrupt+0x6b/0x9b
[<ffffffff8100bc13>] apic_timer_interrupt+0x13/0x20
<EOI>
[<ffffffffa009c0e6>] ? lprocfs_counter_sub+0x86/0x100 [lvfs]
[<ffffffffa055c258>] cl_page_free+0xe8/0x4c0 [obdclass]
[<ffffffffa08fe8dd>] ? vvp_page_delete+0x6d/0x100 [lustre]
[<ffffffffa055c89a>] cl_page_put+0x26a/0x440 [obdclass]
[<ffffffffa055b08d>] ? cl_page_delete+0x3d/0xf0 [obdclass]
[<ffffffffa0567bde>] ? cl_io_is_going+0xe/0x20 [obdclass]
[<ffffffffa08ee8e7>] ll_releasepage+0xd7/0x150 [lustre]
[<ffffffff811100b0>] try_to_release_page+0x30/0x60
[<ffffffff8112a4f1>] shrink_page_list.clone.0+0x4f1/0x5c0
[<ffffffff8112a8bb>] shrink_inactive_list+0x2fb/0x740
[<ffffffff8112b5cf>] shrink_zone+0x38f/0x520
[<ffffffff8112c374>] zone_reclaim+0x354/0x410
[<ffffffff8112cfc0>] ? isolate_pages_global+0x0/0x350
[<ffffffff81122874>] get_page_from_freelist+0x694/0x820
[<ffffffff8112aab2>] ? shrink_inactive_list+0x4f2/0x740
[<ffffffff81123af1>] __alloc_pages_nodemask+0x111/0x940
[<ffffffffa044e87e>] ? cfs_mem_cache_free+0xe/0x10 [libcfs]
[<ffffffffa044e87e>] ? cfs_mem_cache_free+0xe/0x10 [libcfs]
[<ffffffffa055c292>] ? cl_page_free+0x122/0x4c0 [obdclass]
[<ffffffff8115e152>] kmem_getpages+0x62/0x170
[<ffffffff8115e7bf>] cache_grow+0x2cf/0x320
[<ffffffff8115eada>] ____cache_alloc_node+0x8a/0x160
[<ffffffffa044ea13>] ? cfs_alloc+0x63/0x90 [libcfs]
[<ffffffff8115f719>] __kmalloc+0x189/0x220
[<ffffffffa044ea13>] cfs_alloc+0x63/0x90 [libcfs]
[<ffffffffa0636b4a>] ptlrpc_prep_bulk_imp+0x7a/0x350 [ptlrpc]
[<ffffffffa064313c>] ? lustre_msg_set_timeout+0x9c/0x110 [ptlrpc]
[<ffffffffa07ee3df>] osc_brw_prep_request+0x7cf/0x1030 [osc]
[<ffffffffa08039eb>] ? osc_req_attr_set+0xfb/0x2a0 [osc]
[<ffffffffa08f88c8>] ? ccc_req_attr_set+0x78/0x150 [lustre]
[<ffffffffa0564bfc>] ? cl_req_prep+0x8c/0x190 [obdclass]
[<ffffffffa07efdd5>] osc_send_oap_rpc+0x1195/0x1c20 [osc]
[<ffffffffa0840e97>] ? lov_merge_lvb_kms+0x127/0x2b0 [lov]
[<ffffffffa07e1ff1>] ? osc_consume_write_grant+0x81/0x160 [osc]
[<ffffffffa07f0b3e>] osc_check_rpcs+0x2de/0x470 [osc]
[<ffffffffa07b2317>] ? osc_quota_chkdq+0x47/0x3a0 [lquota]
[<ffffffffa0557a58>] ? cl_object_attr_get+0x88/0x1b0 [obdclass]
[<ffffffffa07e7143>] ? on_list+0x43/0x50 [osc]
[<ffffffffa07f16e3>] osc_queue_async_io+0x3c3/0x8f0 [osc]
[<ffffffffa0555d59>] ? cl_env_hops_keycmp+0x19/0x70 [obdclass]
[<ffffffffa07ff63f>] osc_page_cache_add+0xcf/0x200 [osc]
[<ffffffffa0558e68>] cl_page_invoke+0xb8/0x160 [obdclass]
[<ffffffffa0559e78>] cl_page_cache_add+0x58/0x240 [obdclass]
[<ffffffffa08eebd3>] ? ll_set_page_dirty+0x13/0x90 [lustre]
[<ffffffffa08ba8f6>] ? vvp_write_pending+0x56/0x150 [lustre]
[<ffffffffa08ffce3>] vvp_io_commit_write+0x343/0x5a0 [lustre]
[<ffffffffa045e3f2>] ? cfs_hash_lookup+0x82/0xa0 [libcfs]
[<ffffffffa0567d6f>] cl_io_commit_write+0xaf/0x1e0 [obdclass]
[<ffffffffa0558119>] ? cl_env_get+0x29/0x350 [obdclass]
[<ffffffffa08d70bd>] ll_commit_write+0xed/0x300 [lustre]
[<ffffffff812760cd>] ? copy_user_generic_string+0x2d/0x40
[<ffffffffa08ee990>] ll_write_end+0x30/0x60 [lustre]
[<ffffffff811116c4>] generic_file_buffered_write+0x174/0x2a0
[<ffffffff810708c7>] ? current_fs_time+0x27/0x30
[<ffffffff81112fb0>] __generic_file_aio_write+0x250/0x480
[<ffffffff8111324f>] generic_file_aio_write+0x6f/0xe0
[<ffffffffa09005f1>] vvp_io_write_start+0xa1/0x270 [lustre]
[<ffffffffa0564548>] cl_io_start+0x68/0x170 [obdclass]
[<ffffffffa05690c0>] cl_io_loop+0x110/0x1c0 [obdclass]
[<ffffffffa045e3f2>] ? cfs_hash_lookup+0x82/0xa0 [libcfs]
[<ffffffffa08a799b>] ll_file_io_generic+0x44b/0x580 [lustre]
[<ffffffffa045c4e4>] ? cfs_hash_dual_bd_unlock+0x34/0x60 [libcfs]
[<ffffffffa0558119>] ? cl_env_get+0x29/0x350 [obdclass]
[<ffffffffa08a7c0f>] ll_file_aio_write+0x13f/0x310 [lustre]
[<ffffffffa055828e>] ? cl_env_get+0x19e/0x350 [obdclass]
[<ffffffffa08ae2d1>] ll_file_write+0x171/0x310 [lustre]
[<ffffffff81176a38>] vfs_write+0xb8/0x1a0
[<ffffffff81177441>] sys_write+0x51/0x90
[<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
Code: 00 8b 15 4c 35 a4 00 85 d2 74 34 48 8b 50 08 8b 5a 18 48 8b 90 10 09 00 00 48 8b 4a 50 48 85 c9 74 1d 48 63 db 66 90 48 8b 51 20 <48> 03 14 dd e0 6d bf 81 4c 01 2a 48 8b 49 78 48 85 c9 75 e8 48
RIP [<ffffffff81052994>] update_curr+0x144/0x1f0
RSP <ffff88089c4c3db8>
CR2: fffffffb81d4d0e0
--[ end trace 4e12596e8ccf7080 ]--
Kernel panic - not syncing: Fatal exception in interrupt
Pid: 62164, comm: calculate_propa Tainted: G D W ---------------- 2.6.32-220.23.1.el6.x86_64 #1
Call Trace:
<IRQ> [<ffffffff814ecb34>] ? panic+0x78/0x143
[<ffffffff814f0ce2>] ? oops_end+0xf2/0x100
[<ffffffff810423fb>] ? no_context+0xfb/0x260
[<ffffffff8147cfee>] ? tcp_transmit_skb+0x3fe/0x7b0
[<ffffffff81042685>] ? __bad_area_nosemaphore+0x125/0x1e0
[<ffffffff81053455>] ? select_idle_sibling+0x95/0x150
[<ffffffff81042753>] ? bad_area_nosemaphore+0x13/0x20
[<ffffffff81042e0d>] ? __do_page_fault+0x31d/0x480
[<ffffffff8104d734>] ? scale_rt_power+0x24/0x80
[<ffffffff8104d83e>] ? update_group_power+0xae/0x110
[<ffffffff8126ab69>] ? cpumask_next_and+0x29/0x50
[<ffffffff810551d4>] ? find_busiest_group+0x244/0x9f0
[<ffffffff814f2c8e>] ? do_page_fault+0x3e/0xa0
[<ffffffff814f0045>] ? page_fault+0x25/0x30
[<ffffffff81052994>] ? update_curr+0x144/0x1f0
[<ffffffff81052958>] ? update_curr+0x108/0x1f0
[<ffffffff81013743>] ? native_sched_clock+0x13/0x80
[<ffffffff81052f4b>] ? task_tick_fair+0xdb/0x160
[<ffffffff81056791>] ? scheduler_tick+0xc1/0x260
[<ffffffff810a0dc0>] ? tick_sched_timer+0x0/0xc0
[<ffffffff8107c482>] ? update_process_times+0x52/0x70
[<ffffffff810a0e26>] ? tick_sched_timer+0x66/0xc0
[<ffffffff810954ae>] ? __run_hrtimer+0x8e/0x1a0
[<ffffffff81012b59>] ? read_tsc+0x9/0x20
[<ffffffff81095856>] ? hrtimer_interrupt+0xe6/0x250
[<ffffffff814f55fb>] ? smp_apic_timer_interrupt+0x6b/0x9b
[<ffffffff8100bc13>] ? apic_timer_interrupt+0x13/0x20
<EOI> [<ffffffffa009c0e6>] ? lprocfs_counter_sub+0x86/0x100 [lvfs]
[<ffffffffa055c258>] ? cl_page_free+0xe8/0x4c0 [obdclass]
[<ffffffffa08fe8dd>] ? vvp_page_delete+0x6d/0x100 [lustre]
[<ffffffffa055c89a>] ? cl_page_put+0x26a/0x440 [obdclass]
[<ffffffffa055b08d>] ? cl_page_delete+0x3d/0xf0 [obdclass]
[<ffffffffa0567bde>] ? cl_io_is_going+0xe/0x20 [obdclass]
[<ffffffffa08ee8e7>] ? ll_releasepage+0xd7/0x150 [lustre]
[<ffffffff811100b0>] ? try_to_release_page+0x30/0x60
[<ffffffff8112a4f1>] ? shrink_page_list.clone.0+0x4f1/0x5c0
[<ffffffff8112a8bb>] ? shrink_inactive_list+0x2fb/0x740
[<ffffffff8112b5cf>] ? shrink_zone+0x38f/0x520
[<ffffffff8112c374>] ? zone_reclaim+0x354/0x410
[<ffffffff8112cfc0>] ? isolate_pages_global+0x0/0x350
[<ffffffff81122874>] ? get_page_from_freelist+0x694/0x820
[<ffffffff8112aab2>] ? shrink_inactive_list+0x4f2/0x740
[<ffffffff81123af1>] ? __alloc_pages_nodemask+0x111/0x940
[<ffffffffa044e87e>] ? cfs_mem_cache_free+0xe/0x10 [libcfs]
[<ffffffffa044e87e>] ? cfs_mem_cache_free+0xe/0x10 [libcfs]
[<ffffffffa055c292>] ? cl_page_free+0x122/0x4c0 [obdclass]
[<ffffffff8115e152>] ? kmem_getpages+0x62/0x170
[<ffffffff8115e7bf>] ? cache_grow+0x2cf/0x320
[<ffffffff8115eada>] ? ____cache_alloc_node+0x8a/0x160
[<ffffffffa044ea13>] ? cfs_alloc+0x63/0x90 [libcfs]
[<ffffffff8115f719>] ? __kmalloc+0x189/0x220
[<ffffffffa044ea13>] ? cfs_alloc+0x63/0x90 [libcfs]
[<ffffffffa0636b4a>] ? ptlrpc_prep_bulk_imp+0x7a/0x350 [ptlrpc]
[<ffffffffa064313c>] ? lustre_msg_set_timeout+0x9c/0x110 [ptlrpc]
[<ffffffffa07ee3df>] ? osc_brw_prep_request+0x7cf/0x1030 [osc]
[<ffffffffa08039eb>] ? osc_req_attr_set+0xfb/0x2a0 [osc]
[<ffffffffa08f88c8>] ? ccc_req_attr_set+0x78/0x150 [lustre]
[<ffffffffa0564bfc>] ? cl_req_prep+0x8c/0x190 [obdclass]
[<ffffffffa07efdd5>] ? osc_send_oap_rpc+0x1195/0x1c20 [osc]
[<ffffffffa0840e97>] ? lov_merge_lvb_kms+0x127/0x2b0 [lov]
[<ffffffffa07e1ff1>] ? osc_consume_write_grant+0x81/0x160 [osc]
[<ffffffffa07f0b3e>] ? osc_check_rpcs+0x2de/0x470 [osc]
[<ffffffffa07b2317>] ? osc_quota_chkdq+0x47/0x3a0 [lquota]
[<ffffffffa0557a58>] ? cl_object_attr_get+0x88/0x1b0 [obdclass]
[<ffffffffa07e7143>] ? on_list+0x43/0x50 [osc]
[<ffffffffa07f16e3>] ? osc_queue_async_io+0x3c3/0x8f0 [osc]
[<ffffffffa0555d59>] ? cl_env_hops_keycmp+0x19/0x70 [obdclass]
[<ffffffffa07ff63f>] ? osc_page_cache_add+0xcf/0x200 [osc]
[<ffffffffa0558e68>] ? cl_page_invoke+0xb8/0x160 [obdclass]
[<ffffffffa0559e78>] ? cl_page_cache_add+0x58/0x240 [obdclass]
[<ffffffffa08eebd3>] ? ll_set_page_dirty+0x13/0x90 [lustre]
[<ffffffffa08ba8f6>] ? vvp_write_pending+0x56/0x150 [lustre]
[<ffffffffa08ffce3>] ? vvp_io_commit_write+0x343/0x5a0 [lustre]
[<ffffffffa045e3f2>] ? cfs_hash_lookup+0x82/0xa0 [libcfs]
[<ffffffffa0567d6f>] ? cl_io_commit_write+0xaf/0x1e0 [obdclass]
[<ffffffffa0558119>] ? cl_env_get+0x29/0x350 [obdclass]
[<ffffffffa08d70bd>] ? ll_commit_write+0xed/0x300 [lustre]
[<ffffffff812760cd>] ? copy_user_generic_string+0x2d/0x40
[<ffffffffa08ee990>] ? ll_write_end+0x30/0x60 [lustre]
[<ffffffff811116c4>] ? generic_file_buffered_write+0x174/0x2a0
[<ffffffff810708c7>] ? current_fs_time+0x27/0x30
[<ffffffff81112fb0>] ? __generic_file_aio_write+0x250/0x480
[<ffffffff8111324f>] ? generic_file_aio_write+0x6f/0xe0
[<ffffffffa09005f1>] ? vvp_io_write_start+0xa1/0x270 [lustre]
[<ffffffffa0564548>] ? cl_io_start+0x68/0x170 [obdclass]
[<ffffffffa05690c0>] ? cl_io_loop+0x110/0x1c0 [obdclass]
[<ffffffffa045e3f2>] ? cfs_hash_lookup+0x82/0xa0 [libcfs]
[<ffffffffa08a799b>] ? ll_file_io_generic+0x44b/0x580 [lustre]
[<ffffffffa045c4e4>] ? cfs_hash_dual_bd_unlock+0x34/0x60 [libcfs]
[<ffffffffa0558119>] ? cl_env_get+0x29/0x350 [obdclass]
[<ffffffffa08a7c0f>] ? ll_file_aio_write+0x13f/0x310 [lustre]
[<ffffffffa055828e>] ? cl_env_get+0x19e/0x350 [obdclass]
[<ffffffffa08ae2d1>] ? ll_file_write+0x171/0x310 [lustre]
[<ffffffff81176a38>] ? vfs_write+0xb8/0x1a0
[<ffffffff81177441>] ? sys_write+0x51/0x90
[<ffffffff8100b0f2>] ? system_call_fastpath+0x16/0x1b
[-- MARK -- Thu Jul 26 00:00:00 2012]

Comment by Jian Yu [ 13/Aug/12 ]

Lustre Tag: v2_1_3_RC1
Lustre Build: http://build.whamcloud.com/job/lustre-b2_1/113/
Distro/Arch: RHEL6.3/x86_64 + RHEL6.3/i686 (Server + Client)
Network: TCP (1GigE)

parallel-scale-nfsv3 test iorssf failed as follows:

IOR-2.10.3: MPI Coordinated Test of Parallel I/O

Run began: Fri Aug 10 20:35:34 2012
Command line used: /usr/bin/IOR -a POSIX -b 5g -o /mnt/lustre/d0.ior.ssf/iorData -t 2m -v -w -r -i 1 -T 30 -k
Machine: Linux client-10vm1.lab.whamcloud.com
Start time skew across all tasks: 0.00 sec
Path: /mnt/lustre/d0.ior.ssf
FS: 47.6 GiB   Used FS: 4.5%   Inodes: 3.0 Mi   Used Inodes: 0.0%
Participating tasks: 4

Summary:
	api                = POSIX
	test filename      = /mnt/lustre/d0.ior.ssf/iorData
	access             = single-shared-file
	pattern            = segmented (1 segment)
	ordering in a file = sequential offsets
	ordering inter file= no tasks offsets
	clients            = 4 (2 per node)
	repetitions        = 1
	xfersize           = 2 MiB
	blocksize          = 5 GiB
	aggregate filesize = 20 GiB

Commencing write performance test.
Fri Aug 10 20:35:34 2012

[client-10vm1.lab.whamcloud.com:08636] 3 more processes have sent help message help-mpi-btl-base.txt / btl:no-nics
[client-10vm1.lab.whamcloud.com:08636] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
** error **
ERROR in aiori-POSIX.c (line 256): transfer failed.
ERROR: Input/output error
** exiting **
** error **
ERROR in aiori-POSIX.c (line 256): transfer failed.
ERROR: Input/output error
** exiting **

https://maloo.whamcloud.com/test_sets/bd8f71d6-e38a-11e1-93f8-52540035b04c

Console log on MDS/Lustre Client/NFS Server showed that:

20:50:08:Modules linked in: lustre(U) obdfilter(U) ost(U) osd_ldiskfs(U) cmm(U) fsfilt_ldiskfs(U) mdt(U) mdd(U) mds(U) mgs(U) ldiskfs(U) mgc(U) lov(U) osc(U) mdc(U) lmv(U) fid(U) fld(U) lquota(U) ptlrpc(U) obdclass(U) lvfs(U) ksocklnd(U) lnet(U) libcfs(U) nfs fscache jbd2 nfsd lockd nfs_acl auth_rpcgss exportfs autofs4 sunrpc ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa ib_mad ib_core microcode virtio_balloon 8139too 8139cp mii i2c_piix4 i2c_core ext3 jbd mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: lnet_selftest]
20:50:08:Pid: 13603, comm: nfsd Tainted: G        W  ---------------    2.6.32-279.2.1.el6_lustre.x86_64 #1
20:50:08:Call Trace:
20:50:08: [<ffffffff81057df6>] ? __schedule_bug+0x66/0x70
20:50:08:BUG: unable to handle kernel NULL pointer dereference at 0000000000000a08
20:50:08:IP: [<ffffffff8100f5ad>] print_context_stack+0xad/0x140
20:50:08:PGD 0 
20:50:08:Thread overran stack, or stack corrupted
20:50:08:Oops: 0000 [#1] SMP 
20:50:08:last sysfs file: /sys/module/nfsd/initstate
20:50:08:CPU 0 
20:50:09:Modules linked in: lustre(U) obdfilter(U) ost(U) osd_ldiskfs(U) cmm(U) fsfilt_ldiskfs(U) mdt(U) mdd(U) mds(U) mgs(U) ldiskfs(U) mgc(U) lov(U) osc(U) mdc(U) lmv(U) fid(U) fld(U) lquota(U) ptlrpc(U) obdclass(U) lvfs(U) ksocklnd(U) lnet(U) libcfs(U) nfs fscache jbd2 nfsd lockd nfs_acl auth_rpcgss exportfs autofs4 sunrpc ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa ib_mad ib_core microcode virtio_balloon 8139too 8139cp mii i2c_piix4 i2c_core ext3 jbd mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: lnet_selftest]
20:50:09:
20:50:09:Pid: 13603, comm: nfsd Tainted: G        W  ---------------    2.6.32-279.2.1.el6_lustre.x86_64 #1 Red Hat KVM
20:50:09:RIP: 0010:[<ffffffff8100f5ad>]  [<ffffffff8100f5ad>] print_context_stack+0xad/0x140
20:50:09:RSP: 0018:ffff8800547da2e0  EFLAGS: 00010206
20:50:09:RAX: 0000000000000020 RBX: ffff8800547da468 RCX: 0000000000000000
20:50:09:RDX: 0000000000000000 RSI: 0000000000000046 RDI: 0000000000000246
20:50:09:RBP: ffff8800547da340 R08: ffffffff81c019c0 R09: 0000000000000000
20:50:09:R10: 0000000000000001 R11: 0000000000000007 R12: ffff8800547da3e0
20:50:10:R13: ffff8800547da000 R14: ffffffff81600460 R15: 0000000000000000
20:50:10:FS:  0000000000000000(0000) GS:ffff880002200000(0000) knlGS:0000000000000000
20:50:10:CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
20:50:10:CR2: 0000000000000a08 CR3: 000000007d071000 CR4: 00000000000006f0
20:50:10:DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
20:50:10:DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
20:50:10:Process nfsd (pid: 13603, threadinfo ffff8800547da000, task ffff880037d98ae0)
20:50:10:Stack:
20:50:10: ffff8800547da2f0 ffff8800547dbff8 ffff8800547da3e0 ffffffffffffe000
20:50:10:<d> ffffffff817b7290 ffffffff81057df6 0000000000000010 ffff8800547da448
20:50:10:<d> 000000000000cbe0 ffffffff81600460 ffffffff817b7290 ffff880002203fc0
20:50:10:Call Trace:
20:50:10: [<ffffffff81057df6>] ? __schedule_bug+0x66/0x70
20:50:10: [<ffffffff8100e520>] dump_trace+0x190/0x3b0
20:50:10: [<ffffffff8100f315>] show_trace_log_lvl+0x55/0x70
20:50:10: [<ffffffff8100f345>] show_trace+0x15/0x20
20:50:10: [<ffffffff814fd4d3>] dump_stack+0x6f/0x76
20:50:10: [<ffffffff810ac68a>] ? print_modules+0x5a/0xf0
20:50:10: [<ffffffff81057df6>] __schedule_bug+0x66/0x70
20:50:11: [<ffffffff814fe288>] thread_return+0x646/0x76e
20:50:11: [<ffffffffa089ee4b>] ? cfs_hash_add_unique+0x1b/0x40 [libcfs]
20:50:11: [<ffffffff810623da>] __cond_resched+0x2a/0x40
20:50:11: [<ffffffff814fe650>] _cond_resched+0x30/0x40
20:50:11: [<ffffffff8112d69f>] shrink_page_list.clone.0+0xdf/0x660
20:50:11: [<ffffffff8112df3c>] shrink_inactive_list+0x31c/0x7d0
20:50:11: [<ffffffffa089c0b7>] ? cfs_hash_bd_lookup_intent+0x37/0x120 [libcfs]
20:50:11: [<ffffffffa089b4f4>] ? cfs_hash_dual_bd_unlock+0x34/0x60 [libcfs]
20:50:11: [<ffffffffa089ed0d>] ? cfs_hash_find_or_add+0x9d/0x190 [libcfs]
20:50:11: [<ffffffff8112ecbf>] shrink_zone+0x38f/0x520
20:50:11: [<ffffffff8109cd49>] ? ktime_get_ts+0xa9/0xe0
20:50:11: [<ffffffff8112ef4e>] do_try_to_free_pages+0xfe/0x520
20:50:11: [<ffffffff8111835f>] ? zone_watermark_ok+0x1f/0x30
20:50:11: [<ffffffff8112f55d>] try_to_free_pages+0x9d/0x130
20:50:11: [<ffffffff81135f26>] ? next_online_pgdat+0x26/0x50
20:50:11: [<ffffffff811306b0>] ? isolate_pages_global+0x0/0x350
20:50:11: [<ffffffff8112735d>] __alloc_pages_nodemask+0x40d/0x940
20:50:11: [<ffffffff81161e92>] kmem_getpages+0x62/0x170
20:50:11: [<ffffffff81162aaa>] fallback_alloc+0x1ba/0x270
20:50:12: [<ffffffff811624ff>] ? cache_grow+0x2cf/0x320
20:50:12: [<ffffffff81162829>] ____cache_alloc_node+0x99/0x160
20:50:12: [<ffffffffa088da13>] ? cfs_alloc+0x63/0x90 [libcfs]
20:50:12: [<ffffffff81163459>] __kmalloc+0x189/0x220
20:50:12: [<ffffffffa088da13>] cfs_alloc+0x63/0x90 [libcfs]
20:50:12: [<ffffffffa03fad7a>] ptlrpc_prep_bulk_imp+0x7a/0x350 [ptlrpc]
20:50:12: [<ffffffffa040736c>] ? lustre_msg_set_timeout+0x9c/0x110 [ptlrpc]
20:50:12: [<ffffffffa05483df>] osc_brw_prep_request+0x7cf/0x1030 [osc]
20:50:12: [<ffffffffa055d88b>] ? osc_req_attr_set+0xfb/0x2a0 [osc]
20:50:12: [<ffffffffa08388c8>] ? ccc_req_attr_set+0x78/0x150 [lustre]
20:50:12: [<ffffffffa0b8b6cc>] ? cl_req_prep+0x8c/0x190 [obdclass]
20:50:12: [<ffffffffa0549dd5>] osc_send_oap_rpc+0x1195/0x1c20 [osc]
20:50:12: [<ffffffff814fdc90>] ? thread_return+0x4e/0x76e
20:50:12: [<ffffffffa054ab3e>] osc_check_rpcs+0x2de/0x470 [osc]
20:50:12: [<ffffffffa0b7e4f8>] ? cl_object_attr_get+0x88/0x1b0 [obdclass]
20:50:12: [<ffffffff81060250>] ? default_wake_function+0x0/0x20
20:50:12: [<ffffffffa054b6e3>] osc_queue_async_io+0x3c3/0x8f0 [osc]
20:50:13: [<ffffffff81039632>] ? pvclock_clocksource_read+0x12/0xd0
20:50:13: [<ffffffff8109caaa>] ? do_gettimeofday+0x1a/0x50
20:50:13: [<ffffffffa055963f>] osc_page_cache_add+0xcf/0x200 [osc]
20:50:13: [<ffffffffa08976a1>] ? libcfs_debug_vmsg2+0x4d1/0xb50 [libcfs]
20:50:13: [<ffffffffa0b7f908>] cl_page_invoke+0xb8/0x160 [obdclass]
20:50:13: [<ffffffffa05b0afe>] ? lov_page_stripe+0x3e/0x150 [lov]
20:50:13: [<ffffffffa0b80918>] cl_page_cache_add+0x58/0x240 [obdclass]
20:50:13: [<ffffffffa089c15d>] ? cfs_hash_bd_lookup_intent+0xdd/0x120 [libcfs]
20:50:13: [<ffffffffa083fcc5>] vvp_io_commit_write+0x325/0x580 [lustre]
20:50:13: [<ffffffffa089d402>] ? cfs_hash_lookup+0x82/0xa0 [libcfs]
20:50:13: [<ffffffffa0b8e83f>] cl_io_commit_write+0xaf/0x1e0 [obdclass]
20:50:13: [<ffffffffa0b7ebb9>] ? cl_env_get+0x29/0x350 [obdclass]
20:50:13: [<ffffffffa08170ad>] ll_commit_write+0xed/0x300 [lustre]
20:50:13: [<ffffffffa082e990>] ll_write_end+0x30/0x60 [lustre]
20:50:13: [<ffffffff81114c4a>] generic_file_buffered_write+0x18a/0x2e0
20:50:13: [<ffffffff810724c7>] ? current_fs_time+0x27/0x30
20:50:13: [<ffffffff81116580>] __generic_file_aio_write+0x250/0x480
20:50:13: [<ffffffff8111681f>] generic_file_aio_write+0x6f/0xe0
20:50:14: [<ffffffffa08405d1>] vvp_io_write_start+0xa1/0x270 [lustre]
20:50:14: [<ffffffffa0b8b018>] cl_io_start+0x68/0x170 [obdclass]
20:50:14: [<ffffffffa0b8fb90>] cl_io_loop+0x110/0x1c0 [obdclass]
20:50:14: [<ffffffffa07e798b>] ll_file_io_generic+0x44b/0x580 [lustre]
20:50:14: [<ffffffffa089ee4b>] ? cfs_hash_add_unique+0x1b/0x40 [libcfs]
20:50:14: [<ffffffffa0b7ed2e>] ? cl_env_get+0x19e/0x350 [obdclass]
20:50:14: [<ffffffffa07e7bff>] ll_file_aio_write+0x13f/0x310 [lustre]
20:50:14: [<ffffffffa050ccc0>] ? lmv_clear_open_replay_data+0x120/0x520 [lmv]
20:50:14: [<ffffffffa07e7ac0>] ? ll_file_aio_write+0x0/0x310 [lustre]
20:50:14: [<ffffffff8117ad5b>] do_sync_readv_writev+0xfb/0x140
20:50:14: [<ffffffffa0813991>] ? ll_readahead_init+0x21/0x40 [lustre]
20:50:14: [<ffffffff810920d0>] ? autoremove_wake_function+0x0/0x40
20:50:14: [<ffffffffa03f6b10>] ? ptlrpc_req_finished+0x10/0x20 [ptlrpc]
20:50:14: [<ffffffff81213286>] ? security_file_permission+0x16/0x20
20:50:14: [<ffffffff8117be06>] do_readv_writev+0xd6/0x1f0
20:50:14: [<ffffffff8117bf66>] vfs_writev+0x46/0x60
20:50:14: [<ffffffffa034d3f5>] nfsd_vfs_write+0x105/0x430 [nfsd]
20:50:15: [<ffffffff81178c62>] ? dentry_open+0x52/0xc0
20:50:15: [<ffffffffa034fb0b>] ? nfsd_open+0x1db/0x230 [nfsd]
20:50:15: [<ffffffffa034ff27>] nfsd_write+0xe7/0x100 [nfsd]
20:50:15: [<ffffffffa035783f>] nfsd3_proc_write+0xaf/0x140 [nfsd]
20:50:15: [<ffffffffa034843e>] nfsd_dispatch+0xfe/0x240 [nfsd]
20:50:15: [<ffffffffa02b25d4>] svc_process_common+0x344/0x640 [sunrpc]
20:50:15: [<ffffffff81060250>] ? default_wake_function+0x0/0x20
20:50:15: [<ffffffffa02b2c10>] svc_process+0x110/0x160 [sunrpc]
20:50:15: [<ffffffffa0348b62>] nfsd+0xc2/0x160 [nfsd]
20:50:15: [<ffffffffa0348aa0>] ? nfsd+0x0/0x160 [nfsd]
20:50:15: [<ffffffff81091d66>] kthread+0x96/0xa0
20:50:15: [<ffffffff8100c14a>] child_rip+0xa/0x20
20:50:15: [<ffffffff81091cd0>] ? kthread+0x0/0xa0
20:50:15: [<ffffffff8100c140>] ? child_rip+0x0/0x20
20:50:15:Code: f4 07 00 85 c0 74 2c 49 8d 44 24 08 48 39 c3 74 7d 31 d2 48 8b 75 c8 48 8b 7d c0 41 ff 56 10 48 81 7d c8 c1 ae 00 81 49 8b 45 00 <8b> 90 e8 09 00 00 74 1b 48 83 c3 08 4d 85 ff 75 92 4c 39 eb 76 
20:50:15:RIP  [<ffffffff8100f5ad>] print_context_stack+0xad/0x140
20:50:15: RSP <ffff8800547da2e0>
20:50:15:CR2: 0000000000000a08
Comment by Jian Yu [ 13/Aug/12 ]

RHEL6.3/x86_64 (2.1.3 RC1 Server + 1.8.8-wc1 Client):
https://maloo.whamcloud.com/test_sets/2211112e-e42a-11e1-b6d3-52540035b04c

Comment by Jian Yu [ 13/Aug/12 ]

RHEL6.3/x86_64 (2.1.3 Servers + 2.1.2 Clients):
https://maloo.whamcloud.com/test_sets/7f3dc2ee-e3ed-11e1-b6d3-52540035b04c

Comment by Peter Jones [ 13/Aug/12 ]

Oleg believes this to be a duplicate of LU-969. Please reopen if this problem reoccurs with that fix in place.

Generated at Sat Feb 10 01:16:57 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.