[LU-849] NFS server not responding when running parallel-scale test_iorfpp Created: 15/Nov/11  Updated: 16/Jan/12  Resolved: 16/Jan/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Sarah Liu Assignee: Lai Siyao
Resolution: Duplicate Votes: 0
Labels: None
Environment:

server and client: RHEL6-x86_64 build https://newbuild.whamcloud.com/job/lustre-master/353/


Attachments: File debug     File debug-1     File nfs-server-dmesg     File nfs-server-trace    
Severity: 3
Rank (Obsolete): 6524

 Description   

When running parallel-scale test_iorfpp over NFS v3 got "nfs server not responding" error

Lustre: DEBUG MARKER: ----============= acceptance-small: parallel-scale ============---- Tue Nov 15 11:54:31 PST 2011
Lustre: DEBUG MARKER: excepting tests: parallel_grouplock
Lustre: DEBUG MARKER: == parallel-scale test compilebench: compilebench == 11:54:33 (1321386873)
Lustre: DEBUG MARKER: ./compilebench -D /mnt/lustre/d0.compilebench -i 2 -r 2 --makej
Lustre: DEBUG MARKER: == parallel-scale test metabench: metabench == 12:32:28 (1321389148)
Lustre: DEBUG MARKER: == parallel-scale test simul: simul == 12:33:57 (1321389237)
Lustre: DEBUG MARKER: SKIP: parallel-scale test_simul skipped for NFSCLIENT mode
Lustre: DEBUG MARKER: == parallel-scale test mdtestssf: mdtestssf == 12:33:58 (1321389238)
Lustre: DEBUG MARKER: SKIP: parallel-scale test_mdtestssf skipped for NFSCLIENT mode
Lustre: DEBUG MARKER: == parallel-scale test mdtestfpp: mdtestfpp == 12:33:58 (1321389238)
Lustre: DEBUG MARKER: SKIP: parallel-scale test_mdtestfpp skipped for NFSCLIENT mode
Lustre: DEBUG MARKER: == parallel-scale test connectathon: connectathon == 12:33:59 (1321389239)
Lustre: DEBUG MARKER: ./runtests -N 2 -b -f /mnt/lustre/d0.connectathon
Lustre: DEBUG MARKER: ./runtests -N 2 -g -f /mnt/lustre/d0.connectathon
Lustre: DEBUG MARKER: ./runtests -N 2 -s -f /mnt/lustre/d0.connectathon
Lustre: DEBUG MARKER: ./runtests -N 2 -l -f /mnt/lustre/d0.connectathon
Lustre: DEBUG MARKER: == parallel-scale test iorssf: iorssf == 12:35:05 (1321389305)
Lustre: DEBUG MARKER: == parallel-scale test iorfpp: iorfpp == 12:37:56 (1321389476)
nfs: server 10.10.4.15 not responding, still trying
nfs: server 10.10.4.15 not responding, still trying



 Comments   
Comment by Sarah Liu [ 15/Nov/11 ]

NFSv4 has this issue too.

Comment by Sarah Liu [ 16/Nov/11 ]

the attached are the logs from lustre client(NFS server)

Comment by Peter Jones [ 16/Nov/11 ]

Lai

Could you please comment on this one?

Thanks

Peter

Comment by Johann Lombardi (Inactive) [ 24/Nov/11 ]

hm, several nfsd threads seems to be stuck in splice_read (somewhere in cl_page_list_disown() although it is not clear if the stack is reliable).
Sarah, could you please rerun the test and collect lustre debug logs?

Comment by Sarah Liu [ 27/Nov/11 ]

sure, will keep you updated.

Comment by Sarah Liu [ 28/Nov/11 ]

debug log of lustre client(NFS server)

Comment by Johann Lombardi (Inactive) [ 28/Nov/11 ]

ah, this is with the default debug mask. Could you please collect one with the debug mask set to -1? Thanks in advance.

Comment by Sarah Liu [ 29/Nov/11 ]

debug -1 log

Comment by Oleg Drokin [ 03/Jan/12 ]

Lai, Jinshan, we need to look into this as this potentially underscores how recent clio changes introduced some deadlocks in sendfile path.

Comment by Lai Siyao [ 03/Jan/12 ]

nfsd uses splice read/write interface on new kernels, I tested some splice test (from LTP), and it could pass. I'll do more test later.

Comment by Lai Siyao [ 09/Jan/12 ]

In my local test, I found it's stack overflow too. I'll try to decrease stack size a bit and verify it.

Comment by Lai Siyao [ 10/Jan/12 ]

http://jira.whamcloud.com/browse/LU-861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=26225#comment-26225

According to Jinshan's comment on LU-861, two fixes may help, this work will be continued after that.

Comment by Peter Jones [ 16/Jan/12 ]

let's track this under lu-861 and then open a new ticket if the problems still exist with those fixes landed

Generated at Sat Feb 10 01:11:00 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.