[LU-4717] (rw.c:128:ll_cl_init()) husk1: [0x280000f70:0x11c59:0x0] no active IO, please file a ticket. Created: 05/Mar/14 Updated: 05/Apr/17 Resolved: 08/May/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.6.0 |
| Fix Version/s: | Lustre 2.6.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Ann Koehler (Inactive) | Assignee: | Zhenyu Xu |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Master on SLES11 SP3 |
||
| Attachments: |
|
||||
| Issue Links: |
|
||||
| Severity: | 3 | ||||
| Rank (Obsolete): | 12967 | ||||
| Description |
|
Filing a ticket as instructed. Log file for a client is filled with stack traces from the following error. All stack traces are the same. LustreError: 11692:0:(rw.c:128:ll_cl_init()) husk1: [0x280000f70:0x11c59:0x0] no active IO, please file a ticket. Pid: 11692, comm: ksh_so_hack.bin Trace: [<ffffffff81005eb9>] try_stack_unwind+0x169/0x1b0 [<ffffffff81004919>] dump_trace+0x89/0x450 [<ffffffffa02158d7>] libcfs_debug_dumpstack+0x57/0x80 [libcfs] [<ffffffffa07f33ae>] ll_cl_init+0x21e/0x320 [lustre] [<ffffffffa07f34f8>] ll_readpage+0x48/0x1b0 [lustre] [<ffffffff81106418>] __do_page_cache_readahead+0x1e8/0x260 [<ffffffff81106538>] force_page_cache_readahead+0x78/0xa0 [<ffffffff810ff30d>] sys_fadvise64_64+0xdd/0x230 [<ffffffff810ff46e>] sys_fadvise64+0xe/0x10 [<ffffffff8145376b>] system_call_fastpath+0x16/0x1b [<00002aaaaaac11bd>] 0x2aaaaaac11bd Also see these messages an hour prior to those above (in case there's a relationship): LustreError: 4943:0:(mdc_request.c:1580:mdc_read_page()) husk1-MDT0000-mdc-ffff88044bc53800: read cache page: [0x280000f14:0x4:0x0] at 4753935872275117037: rc -5 LustreError: 5003:0:(mdc_request.c:1580:mdc_read_page()) husk1-MDT0000-mdc-ffff88044bc53800: read cache page: [0x280000f14:0x7:0x0] at 4753935872275117037: rc -5 LustreError: 5984:0:(mdc_request.c:1580:mdc_read_page()) husk1-MDT0000-mdc-ffff88044bc53800: read cache page: [0x280000f14:0x17fe9:0x0] at 6497832999440693922: rc -5 Attaching log file. Dump is available if you want it. |
| Comments |
| Comment by Peter Jones [ 13/Mar/14 ] |
|
Bobijam Could you please look into this one? Thanks Peter |
| Comment by Ann Koehler (Inactive) [ 13/Mar/14 ] |
|
I've uploaded the kdumps of 2 nodes that exhibited this bug to: ftp.whamcloud.com:/uploads/ I'm not sure how much they will help. The processes issuing the errors had terminated by the time the dumps were taken, but I'm passing them along in case there might be cached data structures with useful info. |
| Comment by Zhenyu Xu [ 14/Mar/14 ] |
|
please try patch http://review.whamcloud.com/9658 |
| Comment by Ann Koehler (Inactive) [ 17/Mar/14 ] |
|
The patch is scheduled for testing this week. Will let you know the results when available. |
| Comment by Jodi Levi (Inactive) [ 24/Mar/14 ] |
|
Is the testing of Change, 9658 still in progress or have you gotten results yet? |
| Comment by Ann Koehler (Inactive) [ 24/Mar/14 ] |
|
Testing is still in progress. |
| Comment by Mark Mansk [ 25/Mar/14 ] |
|
Finished testing this afternoon. Looks as though this patch fixed the issue, after running IOSTRESS for 5 hours I haven't seen the issue. Previously it occurred with in an hour of running. |
| Comment by Peter Jones [ 25/Mar/14 ] |
|
Thanks Mark! |
| Comment by Jinshan Xiong (Inactive) [ 25/Mar/14 ] |
|
Is it really necessary for Cray to have fadvise() support? I would like to return an error value in this case so that fadivse() will be actually disabled. |
| Comment by Cory Spitz [ 26/Mar/14 ] |
|
One way or another this bug should get cleaned up. If fadvise won't be supported in CLIO then we should update the Ops Manual with a discussion about that in the API section. But wouldn't it be better in the long run to actually use input from fadvise() to make good decisions about what Linux should do with page cache, even if CLIO can't (currently) make better use of the advise? |
| Comment by Jinshan Xiong (Inactive) [ 26/Mar/14 ] |
|
The major problem with fadvise() is that it doesn't have a callback for file system, therefore Lustre can only provide limited support. However, Lustre can easily support POSIX_FADV_WILLNEED and I believe this is the most frequent option for fadvise(). We can just check if a lock is already existing on the client side and if this is the case, we can read ahead pages as requested by fadvise(). How does this sound? |
| Comment by Cory Spitz [ 26/Mar/14 ] |
|
That sounds ok to me. I guess we'll have to wait for more use and exposure to see what the application writers will want. We can track those needs in new tickets. |
| Comment by Mark Mansk [ 08/May/14 ] |
|
We've hit this error again, repeatedly, running Sanity - test 54c this time against We had not seen this error until recently, are there changes that are bringing this to light more? in log file: |
| Comment by Jodi Levi (Inactive) [ 08/May/14 ] |
|
Patch landed to Master. Please reopen ticket if more work is needed |