|
Unfortunately the maloo still does not report and logs from the nodes other than nfs client.
These runs must include logs from luster nfs exporting client, also MDS and OST to be fully selfcontained.
As things stand now we see from the nfs client that the NFS server went missing somewhere during run, but I have no way of knowing why.
Need logs from 10.10.4.15 to find out.
Thanks.
|
|
connectathon on rhel5/x85_64 has the similar issue, during the lock test, nfs server does not respond.
console log from nfs client:
-----------------
Lustre: DEBUG MARKER: == parallel-scale test connectathon: connectathon ==================================================== 22:38:38 (1308893918)
Lustre: DEBUG MARKER: ./runtests -N 2 -b -f /mnt/lustre/d0.connectathon
Lustre: DEBUG MARKER: ./runtests -N 2 -g -f /mnt/lustre/d0.connectathon
Lustre: DEBUG MARKER: ./runtests -N 2 -s -f /mnt/lustre/d0.connectathon
Lustre: DEBUG MARKER: ./runtests -N 2 -l -f /mnt/lustre/d0.connectathon
lockd: server 10.10.4.15 not responding, timed out
lockd: server 10.10.4.15 not responding, timed out
lockd: server 10.10.4.15 not responding, timed out
lockd: server 10.10.4.15 not responding, timed out
lockd: server 10.10.4.15 not responding, timed out
Attached is the logs from lustre client, mds and ost
|
|
hm, nothing interesting in dmesg logs.
Is it possible to get a backtrace list from lustre client?
Interesting that I have not seen any problems like this, I guess I need to try with a remote node.
|
|
I will update the back trace tomorrow
|
|
Here is the back trace of lustre client(nfs server)
|
|
got the same error on 2.0.65 rhel6-x86_64/i686
Lustre: DEBUG MARKER: == parallel-scale test ior: ior == 13:52:39 (1311195159)
nfs: server 10.10.4.15 not responding, still trying
nfs: server 10.10.4.15 not responding, still trying
nfs: server 10.10.4.15 not responding, still trying
nfs: server 10.10.4.15 not responding, still trying
INFO: task IOR:9286 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
IOR D c306be18 0 9286 9284 0x00000080
c31db030 00000082 00000002 c306be18 c3e03b54 00000000 c04b63d5 c0459a2e
11768000 f09ac740 0000300f 5952697c 0000300f c0ae0120 c0ae0120 c31db2d8
c0ae0120 c0adbb54 c0ae0120 c31db2d8 c3e03b54 e47e41ec 03230e0c c31db030
Call Trace:
[<c04b63d5>] ? rcu_process_callbacks+0x35/0x40
[<c0459a2e>] ? __do_softirq+0xae/0x1a0
[<c0459c85>] ? irq_exit+0x35/0x70
[<c04279f3>] ? smp_apic_timer_interrupt+0x53/0x90
[<c08239c5>] ? apic_timer_interrupt+0x31/0x38
[<c044007b>] ? print_cfs_rq+0x56b/0xc90
[<c08222d8>] ? __mutex_lock_slowpath+0xd8/0x140
[<c08221dd>] ? mutex_lock+0x1d/0x40
[<c04de7a9>] ? generic_file_aio_write+0x49/0xc0
[<f86dfc9c>] ? nfs_file_write+0xcc/0x1d0 [nfs]
[<c04631e5>] ? run_timer_softirq+0x35/0x2c0
[<c05277f5>] ? do_sync_write+0xd5/0x120
[<c0473cb0>] ? autoremove_wake_function+0x0/0x40
[<c0459c85>] ? irq_exit+0x35/0x70
[<c04279f3>] ? smp_apic_timer_interrupt+0x53/0x90
[<c0459a2e>] ? __do_softirq+0xae/0x1a0
[<c08239c5>] ? apic_timer_interrupt+0x31/0x38
[<c059cf4c>] ? security_file_permission+0xc/0x10
[<c05279c6>] ? rw_verify_area+0x66/0xe0
[<c0527720>] ? do_sync_write+0x0/0x120
[<c0527ae0>] ? vfs_write+0xa0/0x190
[<c04adb3c>] ? audit_syscall_entry+0x21c/0x240
[<c0528561>] ? sys_write+0x41/0x70
[<c0409adf>] ? sysenter_do_call+0x12/0x28
INFO: task IOR:9286 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
IOR D c306be18 0 9286 9284 0x00000080
c31db030 00000082 00000002 c306be18 c3e03b54 00000000 c04b63d5 c0459a2e
11768000 f09ac740 0000300f 5952697c 0000300f c0ae0120 c0ae0120 c31db2d8
c0ae0120 c0adbb54 c0ae0120 c31db2d8 c3e03b54 e47e41ec 03230e0c c31db030
Call Trace:
[<c04b63d5>] ? rcu_process_callbacks+0x35/0x40
[<c0459a2e>] ? __do_softirq+0xae/0x1a0
[<c0459c85>] ? irq_exit+0x35/0x70
[<c04279f3>] ? smp_apic_timer_interrupt+0x53/0x90
[<c08239c5>] ? apic_timer_interrupt+0x31/0x38
[<c044007b>] ? print_cfs_rq+0x56b/0xc90
[<c08222d8>] ? __mutex_lock_slowpath+0xd8/0x140
[<c08221dd>] ? mutex_lock+0x1d/0x40
[<c04de7a9>] ? generic_file_aio_write+0x49/0xc0
[<f86dfc9c>] ? nfs_file_write+0xcc/0x1d0 [nfs]
[<c04631e5>] ? run_timer_softirq+0x35/0x2c0
[<c05277f5>] ? do_sync_write+0xd5/0x120
[<c0473cb0>] ? autoremove_wake_function+0x0/0x40
[<c0459c85>] ? irq_exit+0x35/0x70
[<c04279f3>] ? smp_apic_timer_interrupt+0x53/0x90
[<c0459a2e>] ? __do_softirq+0xae/0x1a0
[<c08239c5>] ? apic_timer_interrupt+0x31/0x38
[<c059cf4c>] ? security_file_permission+0xc/0x10
[<c05279c6>] ? rw_verify_area+0x66/0xe0
[<c0527720>] ? do_sync_write+0x0/0x120
[<c0527ae0>] ? vfs_write+0xa0/0x190
[<c04adb3c>] ? audit_syscall_entry+0x21c/0x240
[<c0528561>] ? sys_write+0x41/0x70
[<c0409adf>] ? sysenter_do_call+0x12/0x28
INFO: task IOR:9286 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
IOR D c306be18 0 9286 9284 0x00000080
c31db030 00000082 00000002 c306be18 c3e03b54 00000000 c04b63d5 c0459a2e
11768000 f09ac740 0000300f 5952697c 0000300f c0ae0120 c0ae0120 c31db2d8
c0ae0120 c0adbb54 c0ae0120 c31db2d8 c3e03b54 e47e41ec 03230e0c c31db030
Call Trace:
[<c04b63d5>] ? rcu_process_callbacks+0x35/0x40
[<c0459a2e>] ? __do_softirq+0xae/0x1a0
[<c0459c85>] ? irq_exit+0x35/0x70
[<c04279f3>] ? smp_apic_timer_interrupt+0x53/0x90
[<c08239c5>] ? apic_timer_interrupt+0x31/0x38
[<c044007b>] ? print_cfs_rq+0x56b/0xc90
[<c08222d8>] ? __mutex_lock_slowpath+0xd8/0x140
[<c08221dd>] ? mutex_lock+0x1d/0x40
[<c04de7a9>] ? generic_file_aio_write+0x49/0xc0
[<f86dfc9c>] ? nfs_file_write+0xcc/0x1d0 [nfs]
[<c04631e5>] ? run_timer_softirq+0x35/0x2c0
[<c05277f5>] ? do_sync_write+0xd5/0x120
[<c0473cb0>] ? autoremove_wake_function+0x0/0x40
[<c0459c85>] ? irq_exit+0x35/0x70
[<c04279f3>] ? smp_apic_timer_interrupt+0x53/0x90
[<c0459a2e>] ? __do_softirq+0xae/0x1a0
[<c08239c5>] ? apic_timer_interrupt+0x31/0x38
[<c059cf4c>] ? security_file_permission+0xc/0x10
[<c05279c6>] ? rw_verify_area+0x66/0xe0
[<c0527720>] ? do_sync_write+0x0/0x120
[<c0527ae0>] ? vfs_write+0xa0/0x190
[<c04adb3c>] ? audit_syscall_entry+0x21c/0x240
[<c0528561>] ? sys_write+0x41/0x70
[<c0409adf>] ? sysenter_do_call+0x12/0x28
|
|
Marking as blocker because this seems to regularly fail in NFS testing
|
|
Hm, can you try the latest build from http://review.whamcloud.com/#change,911 and see if IOR still hangs?
Also nfs server (lustre client) backtrace would be useful, when there is a report of nfs server not responding, the problem is somewhere on that end.
|
|
With this patch, it passed on RHEL6 nfs3: https://maloo.whamcloud.com/test_sets/b0149d4a-b6f6-11e0-8bdf-52540025f9af
On RHEL6-x86_64 server and RHEL6-i686 lustre client and nfs client, got this error.
NFS client console log:
-------------------------Lustre: DEBUG MARKER: -----============= acceptance-small: parallel-scale ============---- Mon Jul 25 15:43:16 PDT 2011
Lustre: DEBUG MARKER: only running test ior
Lustre: DEBUG MARKER: excepting tests: parallel_grouplock
Lustre: DEBUG MARKER: == parallel-scale test ior: ior == 15:43:18 (1311633798)
nfs: server 10.10.4.15 not responding, still trying
nfs: server 10.10.4.15 not responding, still trying
INFO: task IOR:3628 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
IOR D e8c07e18 0 3628 3627 0x00000080
c2ca6ab0 00000082 00000002 e8c07e18 c3e03b54 00000000 c04b63d5 00000a3e
00000000 c2c18740 000001e7 a04f6bc3 000001e7 c0ae0120 c0ae0120 c2ca6d58
c0ae0120 c0adbb54 c0ae0120 c2ca6d58 c3e03b54 e958147c 001b6c91 c2ca6ab0
Call Trace:
[<c04b63d5>] ? rcu_process_callbacks+0x35/0x40
[<c0459c85>] ? irq_exit+0x35/0x70
[<c04279f3>] ? smp_apic_timer_interrupt+0x53/0x90
[<c08239c5>] ? apic_timer_interrupt+0x31/0x38
[<c08222d8>] ? __mutex_lock_slowpath+0xd8/0x140
[<c08221dd>] ? mutex_lock+0x1d/0x40
[<c04de7a9>] ? generic_file_aio_write+0x49/0xc0
[<f86dfc9c>] ? nfs_file_write+0xcc/0x1d0 [nfs]
[<c05277f5>] ? do_sync_write+0xd5/0x120
[<c0473cb0>] ? autoremove_wake_function+0x0/0x40
[<c0459a2e>] ? __do_softirq+0xae/0x1a0
[<c059cf4c>] ? security_file_permission+0xc/0x10
[<c05279c6>] ? rw_verify_area+0x66/0xe0
[<c0527720>] ? do_sync_write+0x0/0x120
[<c0527ae0>] ? vfs_write+0xa0/0x190
[<c04adb3c>] ? audit_syscall_entry+0x21c/0x240
[<c0528561>] ? sys_write+0x41/0x70
[<c0409adf>] ? sysenter_do_call+0x12/0x28
nfs: server 10.10.4.15 not responding, still trying
NFS server console log, sorry cannot get the trace since the following messages keep showing up
---------------------------------------
Call Trace:
[<c04ed81e>] ? __alloc_pages_nodemask+0x67e/0x800
[<c0442233>] ? find_busiest_group+0x1d3/0x930
[<c051981c>] ? cache_alloc_refill+0x2bc/0x510
[<c0519dd2>] ? __kmalloc+0x142/0x180
[<c0775dff>] ? __alloc_skb+0x4f/0x140
[<c0775dff>] ? __alloc_skb+0x4f/0x140
[<c07760eb>] ? __netdev_alloc_skb+0x1b/0x40
[<f8115942>] ? igb_alloc_rx_buffers_adv+0x2c2/0x320 [igb]
[<c04b5171>] ? move_native_irq+0x11/0x50
[<f8115f16>] ? igb_poll+0x576/0xd20 [igb]
[<f8117dbe>] ? igb_msix_ring+0x4e/0x70 [igb]
[<c04b1d75>] ? handle_IRQ_event+0x45/0x140
[<c04b5171>] ? move_native_irq+0x11/0x50
[<c040a030>] ? common_interrupt+0x30/0x38
[<c0781b5e>] ? net_rx_action+0xde/0x280
[<c0459a0a>] ? __do_softirq+0x8a/0x1a0
[<c05f846c>] ? __const_udelay+0x2c/0x30
[<c04b5171>] ? move_native_irq+0x11/0x50
[<f811db2f>] ? igb_release_swfw_sync_82575+0xf/0x40 [igb]
[<c0459b5d>] ? do_softirq+0x3d/0x50
[<c0459cb5>] ? irq_exit+0x65/0x70
[<c040b1c0>] ? do_IRQ+0x50/0xc0
[<c04f9aa0>] ? vmstat_update+0x0/0x30
[<c040a030>] ? common_interrupt+0x30/0x38
[<c04f9aa0>] ? vmstat_update+0x0/0x30
[<c04f9ab0>] ? vmstat_update+0x10/0x30
[<c046f58b>] ? worker_thread+0x11b/0x230
[<c0473cb0>] ? autoremove_wake_function+0x0/0x40
[<c046f470>] ? worker_thread+0x0/0x230
[<c0473a74>] ? kthread+0x74/0x80
[<c0473a00>] ? kthread+0x0/0x80
[<c040a03f>] ? kernel_thread_helper+0x7/0x10
Mem-Info:
DMA per-cpu:
CPU 0: hi: 0, btch: 1 usd: 0
CPU 1: hi: 0, btch: 1 usd: 0
CPU 2: hi: 0, btch: 1 usd: 0
CPU 3: hi: 0, btch: 1 usd: 0
Normal per-cpu:
CPU 0: hi: 186, btch: 31 usd: 0
CPU 1: hi: 186, btch: 31 usd: 0
CPU 2: hi: 186, btch: 31 usd: 0
CPU 3: hi: 186, btch: 31 usd: 0
HighMem per-cpu:
CPU 0: hi: 186, btch: 31 usd: 0
CPU 1: hi: 186, btch: 31 usd: 0
CPU 2: hi: 186, btch: 31 usd: 0
CPU 3: hi: 186, btch: 31 usd: 0
active_anon:2077 inactive_anon:354 isolated_anon:0
active_file:2444 inactive_file:756607 isolated_file:0
unevictable:0 dirty:0 writeback:0 unstable:0
free:2146797 slab_reclaimable:3746 slab_unreclaimable:127322
mapped:2502 shmem:46 pagetables:316 bounce:0
DMA free:3472kB min:64kB low:80kB high:96kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15792kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:64kB slab_unreclaimable:3772kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
lowmem_reserve[]: 0 863 12159 12159
Normal free:1400kB min:3724kB low:4652kB high:5584kB active_anon:0kB inactive_anon:0kB active_file:184kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:883912kB mlocked:0kB dirty:0kB writeback:0kB mapped:4kB shmem:0kB slab_reclaimable:14920kB slab_unreclaimable:505516kB kernel_stack:1488kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:334 all_unreclaimable? yes
lowmem_reserve[]: 0 0 90370 90370
HighMem free:8582316kB min:512kB low:12700kB high:24888kB active_anon:8308kB inactive_anon:1416kB active_file:9592kB inactive_file:3026428kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:11567412kB mlocked:0kB dirty:0kB writeback:0kB mapped:10004kB shmem:184kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:1264kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
DMA: 6*4kB 3*8kB 2*16kB 4*32kB 3*64kB 6*128kB 1*256kB 0*512kB 0*1024kB 1*2048kB 0*4096kB = 3472kB
Normal: 60*4kB 7*8kB 1*16kB 0*32kB 1*64kB 0*128kB 0*256kB 0*512kB 1*1024kB 0*2048kB 0*4096kB = 1400kB
HighMem: 1139*4kB 342*8kB 189*16kB 215*32kB 180*64kB 77*128kB 46*256kB 30*512kB 15*1024kB 5*2048kB 2073*4096kB = 8582316kB
759078 total pagecache pages
0 pages in swap cache
Swap cache stats: add 0, delete 0, find 0/0
Free swap = 14565368kB
Total swap = 14565368kB
3145712 pages RAM
2918914 pages HighMem
62162 pages reserved
804322 pages shared
151475 pages non-shared
events/2: page allocation failure. order:0, mode:0x20
Pid: 21, comm: events/2 Tainted: G ---------------- T 2.6.32-131.2.1.el6.i686 #1
connectathon on RHEL5 still has this problem, please find attached for trace.
|
|
Well, this is out of memory on i686
Can you please test on x86_64 instead and see if that works?
|
|
It works on RHEL6-x86_64 and failed on RHEL5-x86_64(connectathon test, not ior)
|
|
How does the failure looked like on rhel5 x86_64? Is there a maloo run link?
Can you run rhel5 and rhel6 each several times and see how consistently they fail, please?
Thanks.
|
|
No, there is no maloo link. It looks like this, trace information please see the attached.
Lustre: DEBUG MARKER: == parallel-scale test connectathon: connectathon ==================================================== 15:16:44 (1311718604)
Lustre: DEBUG MARKER: ./runtests -N 2 -b -f /mnt/lustre/d0.connectathon
Lustre: DEBUG MARKER: ./runtests -N 2 -g -f /mnt/lustre/d0.connectathon
Lustre: DEBUG MARKER: ./runtests -N 2 -s -f /mnt/lustre/d0.connectathon
Lustre: DEBUG MARKER: ./runtests -N 2 -l -f /mnt/lustre/d0.connectathon
lockd: server 10.10.4.15 not responding, timed out
lockd: server 10.10.4.15 not responding, timed out
lockd: server 10.10.4.15 not responding, timed out
lockd: server 10.10.4.15 not responding, timed out
I ran connectathon on RHEL5 4 times and only one pass: https://maloo.whamcloud.com/test_sets/4d895790-b7e3-11e0-8bdf-52540025f9af
I also ran ior on RHEL6 2 times today against this patch, they all failed.
|
|
Hm, the stacktrace points at the locking problems which is quite strange, I think those all are fixed and I have not seen them in quite a while.
Are such connectathon failures only happening with nfs3 and never with nfs4?
|
Are such connectathon failures only happening with nfs3 and never with nfs4?
From the record, only nfs3 has this problem, never with nfs4
|
|
Can you please try with http://review.whamcloud.com/1174 this seems to be working well for me now.
|
|
run 3 times connectathon on RHEL5, all pass; while ior on RHEL6 still has this problem.Please see the attached for trace from nfs server
|
|
Interesting, I did my testing on rhel6 and met no problem, though I concentrated on locking only.
Anyway the problem seen in teh stack trace is a different one.
Can you please file a different bug for it please? Thus bug is already overloaded with various failures and it's about time we bring some order to it.
|
|
I've created a new issue LU-567, please check.
|
|
Integrated in lustre-master » x86_64,client,el5,ofa #242
LU-451 Workaround for lockd (nfsv3) locking
Oleg Drokin : 71535dbbb89774eecb2f83eff80b19515e14da70
Files :
|
|
Integrated in lustre-master » x86_64,server,el5,inkernel #242
LU-451 Workaround for lockd (nfsv3) locking
Oleg Drokin : 71535dbbb89774eecb2f83eff80b19515e14da70
Files :
|
|
Integrated in lustre-master » x86_64,client,sles11,inkernel #242
LU-451 Workaround for lockd (nfsv3) locking
Oleg Drokin : 71535dbbb89774eecb2f83eff80b19515e14da70
Files :
|
|
Integrated in lustre-master » i686,server,el6,inkernel #242
LU-451 Workaround for lockd (nfsv3) locking
Oleg Drokin : 71535dbbb89774eecb2f83eff80b19515e14da70
Files :
|
|
Integrated in lustre-master » x86_64,server,el6,inkernel #242
LU-451 Workaround for lockd (nfsv3) locking
Oleg Drokin : 71535dbbb89774eecb2f83eff80b19515e14da70
Files :
|
|
Integrated in lustre-master » x86_64,server,el5,ofa #242
LU-451 Workaround for lockd (nfsv3) locking
Oleg Drokin : 71535dbbb89774eecb2f83eff80b19515e14da70
Files :
|
|
Integrated in lustre-master » x86_64,client,el5,inkernel #242
LU-451 Workaround for lockd (nfsv3) locking
Oleg Drokin : 71535dbbb89774eecb2f83eff80b19515e14da70
Files :
|
|
Integrated in lustre-master » x86_64,client,el6,inkernel #242
LU-451 Workaround for lockd (nfsv3) locking
Oleg Drokin : 71535dbbb89774eecb2f83eff80b19515e14da70
Files :
|
|
Integrated in lustre-master » i686,client,el6,inkernel #242
LU-451 Workaround for lockd (nfsv3) locking
Oleg Drokin : 71535dbbb89774eecb2f83eff80b19515e14da70
Files :
|
|
Integrated in lustre-master » x86_64,client,ubuntu1004,inkernel #242
LU-451 Workaround for lockd (nfsv3) locking
Oleg Drokin : 71535dbbb89774eecb2f83eff80b19515e14da70
Files :
|
|
Integrated in lustre-master » i686,client,el5,inkernel #242
LU-451 Workaround for lockd (nfsv3) locking
Oleg Drokin : 71535dbbb89774eecb2f83eff80b19515e14da70
Files :
|
|
Integrated in lustre-master » i686,server,el5,ofa #242
LU-451 Workaround for lockd (nfsv3) locking
Oleg Drokin : 71535dbbb89774eecb2f83eff80b19515e14da70
Files :
|
|
Integrated in lustre-master » i686,client,el5,ofa #242
LU-451 Workaround for lockd (nfsv3) locking
Oleg Drokin : 71535dbbb89774eecb2f83eff80b19515e14da70
Files :
|
|
Integrated in lustre-master » i686,server,el5,inkernel #242
LU-451 Workaround for lockd (nfsv3) locking
Oleg Drokin : 71535dbbb89774eecb2f83eff80b19515e14da70
Files :
|
|
Landed for 2.1
|
Generated at Sat Feb 10 05:28:25 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.