Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4717

(rw.c:128:ll_cl_init()) husk1: [0x280000f70:0x11c59:0x0] no active IO, please file a ticket.

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.6.0
    • Lustre 2.6.0
    • None
    • Master on SLES11 SP3
    • 3
    • 12967

    Description

      Filing a ticket as instructed. Log file for a client is filled with stack traces from the following error. All stack traces are the same.

      LustreError: 11692:0:(rw.c:128:ll_cl_init()) husk1: [0x280000f70:0x11c59:0x0] no active IO, please file a ticket.
       Pid: 11692, comm: ksh_so_hack.bin
       Trace:
       [<ffffffff81005eb9>] try_stack_unwind+0x169/0x1b0
       [<ffffffff81004919>] dump_trace+0x89/0x450
       [<ffffffffa02158d7>] libcfs_debug_dumpstack+0x57/0x80 [libcfs]
       [<ffffffffa07f33ae>] ll_cl_init+0x21e/0x320 [lustre]
       [<ffffffffa07f34f8>] ll_readpage+0x48/0x1b0 [lustre]
       [<ffffffff81106418>] __do_page_cache_readahead+0x1e8/0x260
       [<ffffffff81106538>] force_page_cache_readahead+0x78/0xa0
       [<ffffffff810ff30d>] sys_fadvise64_64+0xdd/0x230
       [<ffffffff810ff46e>] sys_fadvise64+0xe/0x10
       [<ffffffff8145376b>] system_call_fastpath+0x16/0x1b
       [<00002aaaaaac11bd>] 0x2aaaaaac11bd
      

      Also see these messages an hour prior to those above (in case there's a relationship):

      LustreError: 4943:0:(mdc_request.c:1580:mdc_read_page()) husk1-MDT0000-mdc-ffff88044bc53800: read cache page: [0x280000f14:0x4:0x0] at 4753935872275117037: rc -5
      
      LustreError: 5003:0:(mdc_request.c:1580:mdc_read_page()) husk1-MDT0000-mdc-ffff88044bc53800: read cache page: [0x280000f14:0x7:0x0] at 4753935872275117037: rc -5
      
      LustreError: 5984:0:(mdc_request.c:1580:mdc_read_page()) husk1-MDT0000-mdc-ffff88044bc53800: read cache page: [0x280000f14:0x17fe9:0x0] at 6497832999440693922: rc -5
      

      Attaching log file. Dump is available if you want it.

      Attachments

        Issue Links

          Activity

            [LU-4717] (rw.c:128:ll_cl_init()) husk1: [0x280000f70:0x11c59:0x0] no active IO, please file a ticket.

            Patch landed to Master. Please reopen ticket if more work is needed

            jlevi Jodi Levi (Inactive) added a comment - Patch landed to Master. Please reopen ticket if more work is needed
            mmansk Mark Mansk added a comment -

            We've hit this error again, repeatedly, running Sanity - test 54c this time against LU-3321 built into our 2.6 branch. Patch http://review.whamcloud.com/9658 was not yet in our 2.6 build.

            We had not seen this error until recently, are there changes that are bringing this to light more?
            In test 54c this fails when attempting to mount the loop device created, with the following in dmesgs:
            Buffer I/O error on device loop3, logical block 0
            lost page write due to I/O error on loop3

            in log file:
            mount: wrong fs type, bad option, bad superblock on /tmp/dal/loop54c,
            missing codepage or helper program, or other error
            In some cases useful info is found in syslog - try
            dmesg | tail or so

            mmansk Mark Mansk added a comment - We've hit this error again, repeatedly, running Sanity - test 54c this time against LU-3321 built into our 2.6 branch. Patch http://review.whamcloud.com/9658 was not yet in our 2.6 build. We had not seen this error until recently, are there changes that are bringing this to light more? In test 54c this fails when attempting to mount the loop device created, with the following in dmesgs: Buffer I/O error on device loop3, logical block 0 lost page write due to I/O error on loop3 in log file: mount: wrong fs type, bad option, bad superblock on /tmp/dal/loop54c, missing codepage or helper program, or other error In some cases useful info is found in syslog - try dmesg | tail or so
            spitzcor Cory Spitz added a comment -

            That sounds ok to me. I guess we'll have to wait for more use and exposure to see what the application writers will want. We can track those needs in new tickets.

            spitzcor Cory Spitz added a comment - That sounds ok to me. I guess we'll have to wait for more use and exposure to see what the application writers will want. We can track those needs in new tickets.

            The major problem with fadvise() is that it doesn't have a callback for file system, therefore Lustre can only provide limited support.

            However, Lustre can easily support POSIX_FADV_WILLNEED and I believe this is the most frequent option for fadvise(). We can just check if a lock is already existing on the client side and if this is the case, we can read ahead pages as requested by fadvise(). How does this sound?

            jay Jinshan Xiong (Inactive) added a comment - The major problem with fadvise() is that it doesn't have a callback for file system, therefore Lustre can only provide limited support. However, Lustre can easily support POSIX_FADV_WILLNEED and I believe this is the most frequent option for fadvise(). We can just check if a lock is already existing on the client side and if this is the case, we can read ahead pages as requested by fadvise(). How does this sound?
            spitzcor Cory Spitz added a comment -

            One way or another this bug should get cleaned up. If fadvise won't be supported in CLIO then we should update the Ops Manual with a discussion about that in the API section. But wouldn't it be better in the long run to actually use input from fadvise() to make good decisions about what Linux should do with page cache, even if CLIO can't (currently) make better use of the advise?

            spitzcor Cory Spitz added a comment - One way or another this bug should get cleaned up. If fadvise won't be supported in CLIO then we should update the Ops Manual with a discussion about that in the API section. But wouldn't it be better in the long run to actually use input from fadvise() to make good decisions about what Linux should do with page cache, even if CLIO can't (currently) make better use of the advise?

            Is it really necessary for Cray to have fadvise() support? I would like to return an error value in this case so that fadivse() will be actually disabled.

            jay Jinshan Xiong (Inactive) added a comment - Is it really necessary for Cray to have fadvise() support? I would like to return an error value in this case so that fadivse() will be actually disabled.
            pjones Peter Jones added a comment -

            Thanks Mark!

            pjones Peter Jones added a comment - Thanks Mark!
            mmansk Mark Mansk added a comment -

            Finished testing this afternoon.

            Looks as though this patch fixed the issue, after running IOSTRESS for 5 hours I haven't seen the issue. Previously it occurred with in an hour of running.

            mmansk Mark Mansk added a comment - Finished testing this afternoon. Looks as though this patch fixed the issue, after running IOSTRESS for 5 hours I haven't seen the issue. Previously it occurred with in an hour of running.

            Testing is still in progress.

            amk Ann Koehler (Inactive) added a comment - Testing is still in progress.

            Is the testing of Change, 9658 still in progress or have you gotten results yet?

            jlevi Jodi Levi (Inactive) added a comment - Is the testing of Change, 9658 still in progress or have you gotten results yet?

            People

              bobijam Zhenyu Xu
              amk Ann Koehler (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: