[LU-4717] (rw.c:128:ll_cl_init()) husk1: [0x280000f70:0x11c59:0x0] no active IO, please file a ticket. Created: 05/Mar/14  Updated: 05/Apr/17  Resolved: 08/May/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.6.0
Fix Version/s: Lustre 2.6.0

Type: Bug Priority: Critical
Reporter: Ann Koehler (Inactive) Assignee: Zhenyu Xu
Resolution: Fixed Votes: 0
Labels: None
Environment:

Master on SLES11 SP3


Attachments: File console.c0-0c0s10n2    
Issue Links:
Related
Severity: 3
Rank (Obsolete): 12967

 Description   

Filing a ticket as instructed. Log file for a client is filled with stack traces from the following error. All stack traces are the same.

LustreError: 11692:0:(rw.c:128:ll_cl_init()) husk1: [0x280000f70:0x11c59:0x0] no active IO, please file a ticket.
 Pid: 11692, comm: ksh_so_hack.bin
 Trace:
 [<ffffffff81005eb9>] try_stack_unwind+0x169/0x1b0
 [<ffffffff81004919>] dump_trace+0x89/0x450
 [<ffffffffa02158d7>] libcfs_debug_dumpstack+0x57/0x80 [libcfs]
 [<ffffffffa07f33ae>] ll_cl_init+0x21e/0x320 [lustre]
 [<ffffffffa07f34f8>] ll_readpage+0x48/0x1b0 [lustre]
 [<ffffffff81106418>] __do_page_cache_readahead+0x1e8/0x260
 [<ffffffff81106538>] force_page_cache_readahead+0x78/0xa0
 [<ffffffff810ff30d>] sys_fadvise64_64+0xdd/0x230
 [<ffffffff810ff46e>] sys_fadvise64+0xe/0x10
 [<ffffffff8145376b>] system_call_fastpath+0x16/0x1b
 [<00002aaaaaac11bd>] 0x2aaaaaac11bd

Also see these messages an hour prior to those above (in case there's a relationship):

LustreError: 4943:0:(mdc_request.c:1580:mdc_read_page()) husk1-MDT0000-mdc-ffff88044bc53800: read cache page: [0x280000f14:0x4:0x0] at 4753935872275117037: rc -5

LustreError: 5003:0:(mdc_request.c:1580:mdc_read_page()) husk1-MDT0000-mdc-ffff88044bc53800: read cache page: [0x280000f14:0x7:0x0] at 4753935872275117037: rc -5

LustreError: 5984:0:(mdc_request.c:1580:mdc_read_page()) husk1-MDT0000-mdc-ffff88044bc53800: read cache page: [0x280000f14:0x17fe9:0x0] at 6497832999440693922: rc -5

Attaching log file. Dump is available if you want it.



 Comments   
Comment by Peter Jones [ 13/Mar/14 ]

Bobijam

Could you please look into this one?

Thanks

Peter

Comment by Ann Koehler (Inactive) [ 13/Mar/14 ]

I've uploaded the kdumps of 2 nodes that exhibited this bug to:

ftp.whamcloud.com:/uploads/LU-4717/LU4717_no_active_io.tgz

I'm not sure how much they will help. The processes issuing the errors had terminated by the time the dumps were taken, but I'm passing them along in case there might be cached data structures with useful info.

Comment by Zhenyu Xu [ 14/Mar/14 ]

please try patch http://review.whamcloud.com/9658

Comment by Ann Koehler (Inactive) [ 17/Mar/14 ]

The patch is scheduled for testing this week. Will let you know the results when available.

Comment by Jodi Levi (Inactive) [ 24/Mar/14 ]

Is the testing of Change, 9658 still in progress or have you gotten results yet?

Comment by Ann Koehler (Inactive) [ 24/Mar/14 ]

Testing is still in progress.

Comment by Mark Mansk [ 25/Mar/14 ]

Finished testing this afternoon.

Looks as though this patch fixed the issue, after running IOSTRESS for 5 hours I haven't seen the issue. Previously it occurred with in an hour of running.

Comment by Peter Jones [ 25/Mar/14 ]

Thanks Mark!

Comment by Jinshan Xiong (Inactive) [ 25/Mar/14 ]

Is it really necessary for Cray to have fadvise() support? I would like to return an error value in this case so that fadivse() will be actually disabled.

Comment by Cory Spitz [ 26/Mar/14 ]

One way or another this bug should get cleaned up. If fadvise won't be supported in CLIO then we should update the Ops Manual with a discussion about that in the API section. But wouldn't it be better in the long run to actually use input from fadvise() to make good decisions about what Linux should do with page cache, even if CLIO can't (currently) make better use of the advise?

Comment by Jinshan Xiong (Inactive) [ 26/Mar/14 ]

The major problem with fadvise() is that it doesn't have a callback for file system, therefore Lustre can only provide limited support.

However, Lustre can easily support POSIX_FADV_WILLNEED and I believe this is the most frequent option for fadvise(). We can just check if a lock is already existing on the client side and if this is the case, we can read ahead pages as requested by fadvise(). How does this sound?

Comment by Cory Spitz [ 26/Mar/14 ]

That sounds ok to me. I guess we'll have to wait for more use and exposure to see what the application writers will want. We can track those needs in new tickets.

Comment by Mark Mansk [ 08/May/14 ]

We've hit this error again, repeatedly, running Sanity - test 54c this time against LU-3321 built into our 2.6 branch. Patch http://review.whamcloud.com/9658 was not yet in our 2.6 build.

We had not seen this error until recently, are there changes that are bringing this to light more?
In test 54c this fails when attempting to mount the loop device created, with the following in dmesgs:
Buffer I/O error on device loop3, logical block 0
lost page write due to I/O error on loop3

in log file:
mount: wrong fs type, bad option, bad superblock on /tmp/dal/loop54c,
missing codepage or helper program, or other error
In some cases useful info is found in syslog - try
dmesg | tail or so

Comment by Jodi Levi (Inactive) [ 08/May/14 ]

Patch landed to Master. Please reopen ticket if more work is needed

Generated at Sat Feb 10 01:45:13 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.