Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5263

ll_read_ahead_pages() ASSERTION( page_idx > ria->ria_stoff )

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.7.0, Lustre 2.5.3
    • Lustre 2.1.6
    • None
    • RHEL 6 w/ kernel 2.6.32_220.23.1
    • 3
    • 14689

    Description

      One Lustre client frequently crash on LBUG with the ASSERTION( page_idx > ria->ria_stoff ).. (13 crashes in the past 3 months)

      This Lustre client acts as a nfs server and exports Lustre to a web server through nfs-ganesha.

      ----8< ----
      [24937.600920] Lustre: DEBUG MARKER: Thu Jun 5 20:00:01 2014
      [24937.600921]
      [24950.667750] LustreError: 4126:0:(rw.c:698:ll_read_ahead_pages()) ASSERTION( page_idx > ria->ria_stoff ) failed: Invalid page_idx 234497rs 234497 re 300287 ro 234751 rl 256 rp 1
      [24950.683642] LustreError: 4126:0:(rw.c:698:ll_read_ahead_pages()) LBUG
      [24950.690154] Pid: 4126, comm: ganesha.nfsd
      [24950.695572]
      [24950.695573] Call Trace:
      [24950.702337] [<ffffffffa05697f5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      [24950.710703] [<ffffffffa0569e07>] lbug_with_loc+0x47/0xb0 [libcfs]
      [24950.718317] [<ffffffffa0bb511f>] ll_readahead+0x10cf/0x1100 [lustre]
      [24950.726131] [<ffffffffa0bdc805>] vvp_io_read_page+0x305/0x360 [lustre]
      [24950.734159] [<ffffffffa068eb4d>] cl_io_read_page+0x8d/0x170 [obdclass]
      [24950.742108] [<ffffffffa0682c19>] ? cl_page_assume+0xf9/0x2d0 [obdclass]
      [24950.750166] [<ffffffffa0bb5746>] ll_readpage+0x96/0x200 [lustre]
      [24950.757661] [<ffffffff810ff88c>] generic_file_aio_read+0x1fc/0x700
      [24950.765360] [<ffffffff810816ff>] ? up+0x2f/0x50
      [24950.771433] [<ffffffffa0bdce1b>] vvp_io_read_start+0x13b/0x3e0 [lustre]
      [24950.779575] [<ffffffffa068cb4a>] cl_io_start+0x6a/0x140 [obdclass]
      [24950.787244] [<ffffffffa0690e2c>] cl_io_loop+0xcc/0x190 [obdclass]
      [24950.794848] [<ffffffffa0b8d097>] ll_file_io_generic+0x3a7/0x560 [lustre]
      [24950.803059] [<ffffffffa0b8d389>] ll_file_aio_read+0x139/0x2c0 [lustre]
      [24950.811092] [<ffffffffa0b8d849>] ll_file_read+0x169/0x2a0 [lustre]
      [24950.818784] [<ffffffff81164525>] vfs_read+0xb5/0x1a0
      [24950.825275] [<ffffffff81164852>] sys_pread64+0x82/0xa0
      [24950.831917] [<ffffffff810030f2>] system_call_fastpath+0x16/0x1b
      [24950.839353]
      [24950.842659] Kernel panic - not syncing: LBUG
      [24950.847986] Pid: 4126, comm: ganesha.nfsd Not tainted 2.6.32-220.23.1.bl6.Bull.28.10.x86_64 #1
      [24950.858073] Call Trace:
      [24950.861907] [<ffffffff814851a0>] ? panic+0x78/0x143
      [24950.868308] [<ffffffffa0569e5b>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
      [24950.876096] [<ffffffffa0bb511f>] ? ll_readahead+0x10cf/0x1100 [lustre]
      [24950.884141] [<ffffffffa0bdc805>] ? vvp_io_read_page+0x305/0x360 [lustre]
      [24950.892371] [<ffffffffa068eb4d>] ? cl_io_read_page+0x8d/0x170 [obdclass]
      [24950.900562] [<ffffffffa0682c19>] ? cl_page_assume+0xf9/0x2d0 [obdclass]
      [24950.908676] [<ffffffffa0bb5746>] ? ll_readpage+0x96/0x200 [lustre]
      [24950.916357] [<ffffffff810ff88c>] ? generic_file_aio_read+0x1fc/0x700
      [24950.924221] [<ffffffff810816ff>] ? up+0x2f/0x50
      [24950.930283] [<ffffffffa0bdce1b>] ? vvp_io_read_start+0x13b/0x3e0 [lustre]
      [24950.938586] [<ffffffffa068cb4a>] ? cl_io_start+0x6a/0x140 [obdclass]
      [24950.946448] [<ffffffffa0690e2c>] ? cl_io_loop+0xcc/0x190 [obdclass]
      [24950.954218] [<ffffffffa0b8d097>] ? ll_file_io_generic+0x3a7/0x560 [lustre]
      [24950.962601] [<ffffffffa0b8d389>] ? ll_file_aio_read+0x139/0x2c0 [lustre]
      [24950.970813] [<ffffffffa0b8d849>] ? ll_file_read+0x169/0x2a0 [lustre]
      [24950.978649] [<ffffffff81164525>] ? vfs_read+0xb5/0x1a0
      [24950.985288] [<ffffffff81164852>] ? sys_pread64+0x82/0xa0
      [24950.992092] [<ffffffff810030f2>] ? system_call_fastpath+0x16/0x1b
      ----8< ----

      We asked the customer to add read ahead to the debug log. (lctl set_param debug=+reada)

      The debug log is available in the attached support bundle (from crash 2014-06-05-20:01:29).

      Read-ahead settings:
      ----8< ----

      1. lctl get_param llite..max_read_ahead
        llite.store1-ffff88120ee93c00.max_read_ahead_mb=40
        llite.store1-ffff88120ee93c00.max_read_ahead_per_file_mb=40
        llite.store1-ffff88120ee93c00.max_read_ahead_whole_mb=2
        ----8< ----

      We asked the customer to take a look at the web server logs to see which files are accessed at the time of the crash. This is never the same file.

      It looks like LU-4192

      Attachments

        Issue Links

          Activity

            [LU-5263] ll_read_ahead_pages() ASSERTION( page_idx > ria->ria_stoff )
            yujian Jian Yu added a comment -

            Here is the back-ported patch for Lustre b2_5 branch: http://review.whamcloud.com/11455

            yujian Jian Yu added a comment - Here is the back-ported patch for Lustre b2_5 branch: http://review.whamcloud.com/11455
            pjones Peter Jones added a comment -

            Landed for 2.7. Will track landing to maintenance releases separately

            pjones Peter Jones added a comment - Landed for 2.7. Will track landing to maintenance releases separately
            bobijam Zhenyu Xu added a comment -

            patch for master is tracked at http://review.whamcloud.com/#/c/11181/

            bobijam Zhenyu Xu added a comment - patch for master is tracked at http://review.whamcloud.com/#/c/11181/
            pjones Peter Jones added a comment -

            Bobi

            It's great to hear that this patch has held up so well in testing. Our usual practice before Bull deploy things into production is to ensure that we have at least two reviews and Oleg has signed off on it. We would only land it to the b2_1 branch if/when we create a 2.1.7 release.

            Regards

            Peter

            pjones Peter Jones added a comment - Bobi It's great to hear that this patch has held up so well in testing. Our usual practice before Bull deploy things into production is to ensure that we have at least two reviews and Oleg has signed off on it. We would only land it to the b2_1 branch if/when we create a 2.1.7 release. Regards Peter
            bobijam Zhenyu Xu added a comment -

            I think yes, and I'll try to make it land to b2_1 branch as well.

            bobijam Zhenyu Xu added a comment - I think yes, and I'll try to make it land to b2_1 branch as well.

            We had no issue with the Lustre fs since the last two weeks.

            bobijam, can we safely add this patch into our 2.1.6 branch?

            bruno.travouillon Bruno Travouillon (Inactive) added a comment - We had no issue with the Lustre fs since the last two weeks. bobijam, can we safely add this patch into our 2.1.6 branch?

            Hi bobijam,

            The patch is in test since this morning. We should be able to give you a feedback in the next two weeks.

            Thanks,

            Bruno

            bruno.travouillon Bruno Travouillon (Inactive) added a comment - Hi bobijam, The patch is in test since this morning. We should be able to give you a feedback in the next two weeks. Thanks, Bruno
            bobijam Zhenyu Xu added a comment -

            would you please try this patch http://review.whamcloud.com/10914 ?

            bobijam Zhenyu Xu added a comment - would you please try this patch http://review.whamcloud.com/10914 ?
            pjones Peter Jones added a comment -

            Bobijam

            Could you please advise on this issue?

            Thanks

            Peter

            pjones Peter Jones added a comment - Bobijam Could you please advise on this issue? Thanks Peter

            People

              bobijam Zhenyu Xu
              bruno.travouillon Bruno Travouillon (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: