Details

    • Bug
    • Resolution: Won't Fix
    • Minor
    • None
    • Lustre 1.8.x (1.8.0 - 1.8.5)
    • None
    • 3
    • 24,376
    • 9764

    Description

      Kernel OOPS on a client

      <1>[1740431.005226] BUG: unable to handle kernel NULL pointer dereference at (null)
      <1>[1740431.008694] IP: [<ffffffffa0a9dfda>] ll_readahead+0x4a/0x1d20 [lustre]
      <4>[1740431.008694] PGD 2cffabf5067 PUD 2cffabf4067 PMD 0
      <0>[1740431.008694] Oops: 0000 1 SMP
      <0>[1740431.008694] last sysfs file: /sys/devices/system/node/node63/meminfo
      [archkdbcommon]kdb> -bt

      Stack traceback for pid 839153
      0xffff8acffbd72600 000000 000000 0 000 R 0xffff8acffbd72c90 *magic.exe
      [<ffffffffa0a9dfda>] ll_readahead+0x4a/0x1d20 [lustre]
      [<ffffffffa0aa0ebb>] ll_readpage+0x120b/0x1760 [lustre]
      [<ffffffff810b5cf1>] filemap_fault+0x241/0x3d0
      [<ffffffffa0aaebd7>] ll_fault+0x3b7/0xdd0 [lustre]
      [<ffffffff810ceae7>] __do_fault+0x57/0x520
      [<ffffffff810d3459>] handle_mm_fault+0x199/0x430
      [<ffffffff8139dcaf>] do_page_fault+0x1bf/0x3e0
      [<ffffffff8139b5cf>] page_fault+0x1f/0x30
      [<0000000000472d64>] 0x472d64

      Attachments

        Issue Links

          Activity

            [LU-1878] NULL pointer in ll_readahead()

            Close old ticket.

            adilger Andreas Dilger added a comment - Close old ticket.

            http://review.whamcloud.com/4026

            this is updated version accordingly to Jinshan Xiong's inspection.

            vs Vladimir V. Saveliev added a comment - http://review.whamcloud.com/4026 this is updated version accordingly to Jinshan Xiong's inspection.
            pjones Peter Jones added a comment -

            Excellent - thanks Vladimir! Don't worry about the inspectors - we'll organize that.

            Bob

            Could you please take care of landing this patch and check in particular its applicability to the master branch

            Thanks

            Peter

            pjones Peter Jones added a comment - Excellent - thanks Vladimir! Don't worry about the inspectors - we'll organize that. Bob Could you please take care of landing this patch and check in particular its applicability to the master branch Thanks Peter

            the patch is in gerrit (http://review.whamcloud.com/3927), however, I
            am not sure about inspectors, please advise.

            vs Vladimir V. Saveliev added a comment - the patch is in gerrit ( http://review.whamcloud.com/3927 ), however, I am not sure about inspectors, please advise.
            vs Vladimir V. Saveliev added a comment - http://review.whamcloud.com/3927
            pjones Peter Jones added a comment -

            Great! It sounds like you are on the right track. Once the patch is in gerrit then you should post the URL to the changeset as a comment in the JIRA ticket. Thanks!

            pjones Peter Jones added a comment - Great! It sounds like you are on the right track. Once the patch is in gerrit then you should post the URL to the changeset as a comment in the JIRA ticket. Thanks!

            Hello, Peter

            That is what I was planning to do next. From http://wiki.whamcloud.com/display/PUB/Submitting+Changes
            I realised that JIRA ticket needs to be created first. So, I created
            LU-1878.

            I hope that my patch matches
            2. Test and commit your patch locally using acceptance-small.sh and
            3. The patch follows the Requirements for patch submission:
            and I was going to do 4. The patch has been uploaded to Gerrit. On
            that step I guess that I am to git push the patch somewhere.

            If I went wrong way or there is simple way, please get someone to help
            me.

            TIA
            Best regards,
            Vladimir

            vs Vladimir V. Saveliev added a comment - Hello, Peter That is what I was planning to do next. From http://wiki.whamcloud.com/display/PUB/Submitting+Changes I realised that JIRA ticket needs to be created first. So, I created LU-1878 . I hope that my patch matches 2. Test and commit your patch locally using acceptance-small.sh and 3. The patch follows the Requirements for patch submission: and I was going to do 4. The patch has been uploaded to Gerrit. On that step I guess that I am to git push the patch somewhere. If I went wrong way or there is simple way, please get someone to help me. TIA Best regards, Vladimir
            pjones Peter Jones added a comment -

            Thanks for the patch Vladimir. Could you please upload it into gerrit so that we can review and test it? There are instructions on the whamcloud wiki or I can get an engineer to step you through the process if that helps.

            pjones Peter Jones added a comment - Thanks for the patch Vladimir. Could you please upload it into gerrit so that we can review and test it? There are instructions on the whamcloud wiki or I can get an engineer to step you through the process if that helps.

            llap_shrink_cache_internal(): avoid shrinking of pages which are in use

            vs Vladimir V. Saveliev added a comment - llap_shrink_cache_internal(): avoid shrinking of pages which are in use

            The BUG happened because ll_readhead() was called from ll_readpage() with
            mapping == NULL, so the line

            inode = mapping->host
            caused the oops.

            At the begining of ll_readpage() page->mapping was not NULL, because the line
            struct inode *inode = page->mapping->host;
            worked fine. Therefore, page->mapping changed to NULL while running ll_readpage().

            At first sight it looks like the page kept locked all the time at ll_readpage().
            However, the following race seems to be possible:

            Process 1:
            ll_readpage() called via handle_mm_fault()>__do_fault()>ll_fault()->filemap_fault():
            starts with locked page P and calls ll_issue_page_read() which queues the page for i/o.
            Once the page is queued it can be involved into group i/o triggered by another thread.

            Process 2:
            obd_trigger_group_io()
            triggers i/o for group of queued pages, including the page P. The P gets unlocked and uptodate as result of that.

            Process 3:
            shrink_slab()>llap_shrink_cache()>llap_shrink_cache_internal()
            kswapd shrinks slabs and calls ll_shrink_cache() (registered lustre's slab shrinker, which tries to shrink lustre pages.
            llap_shrink_cache_internal() locks the page P and calls ll_truncate_complete_page() which removes the page P from page cache (page->mapping becomes NULL).
            llap_shrink_cache_internal() avoids to ll_truncate_complete_page() dirty pages, but it does not care of busy pages: the ones with increase reference count.

            Process 1 continues with page->mapping == NULL and calls ll_readahead() passing NULL as 3rd argument:
            ll_readpage()
            ...
            /* We have just requested the actual page we want, see if we can tack

            • on some readahead to that page's RPC before it is sent. */
              if (ll_i2sbi(inode)->ll_ra_info.ra_max_pages_per_file)
              ll_readahead(&fd->fd_ras, exp, page->mapping, oig,
              fd->fd_flags);

            ll_readahead()
            {
            ...
            inode = mapping->host; <<<== NULL pointer dereference happens here, as long as mapping is NULL
            ...

            Dump of stacktraces provided by HLRN (https://bugzilla.lustre.org/attachment.cgi?id=33148) contains number of processes:

            • handling pagefaults of lustre pages looking like:
              0xffff8acffbd72600 839153 838417 1 343 R 0xffff8acffbd72c90 *magic.exe
              [<ffffffffa0a9dfda>] ll_readahead+0x4a/0x1d20 [lustre]
              [<ffffffffa0aa0ebb>] ll_readpage+0x120b/0x1760 [lustre]
              [<ffffffff810b5cf1>] filemap_fault+0x241/0x3d0
              [<ffffffffa0aaebd7>] ll_fault+0x3b7/0xdd0 [lustre]
              [<ffffffff810ceae7>] __do_fault+0x57/0x520
              [<ffffffff810d3459>] handle_mm_fault+0x199/0x430
              [<ffffffff8139dcaf>] do_page_fault+0x1bf/0x3e0
              [<ffffffff8139b5cf>] page_fault+0x1f/0x30
            • processes which are entered lustre shrink code trying to allocate memory and kswapd's shrinking slabs with typical trace like:
              0xffff881072408680 7239 2 0 965 D 0xffff881072408d10 kswapd56
              [<ffffffff81399094>] thread_return+0x0/0x34c
              [<ffffffff810cf589>] mapping_sleep+0x9/0x10
              [<ffffffff8139986a>] __wait_on_bit_lock+0x4a/0xa0
              [<ffffffff81399939>] out_of_line_wait_on_bit_lock+0x79/0xa0
              [<ffffffff810d2309>] unmap_mapping_range+0x149/0x310
              [<ffffffffa0aae6c7>] ll_teardown_mmaps+0x67/0x1c0 [lustre]
              [<ffffffffa0aa2072>] llap_shrink_cache+0x5a2/0xa40 [lustre]
              [<ffffffffa0a83763>] ll_shrink_cache+0x33/0x70 [lustre]
              [<ffffffff810c3f8c>] shrink_slab+0x12c/0x190
              [<ffffffff810c49d8>] balance_pgdat+0x4c8/0x7b0
              [<ffffffff810c4de9>] kswapd+0x129/0x180
              [<ffffffff81064366>] kthread+0x96/0xa0

            which shows that the race described above is possible.
            The proposed fix is to avoid to ll_truncate_complete_page() pages having extra reference counters.

            vs Vladimir V. Saveliev added a comment - The BUG happened because ll_readhead() was called from ll_readpage() with mapping == NULL, so the line inode = mapping->host caused the oops. At the begining of ll_readpage() page->mapping was not NULL, because the line struct inode *inode = page->mapping->host; worked fine. Therefore, page->mapping changed to NULL while running ll_readpage(). At first sight it looks like the page kept locked all the time at ll_readpage(). However, the following race seems to be possible: Process 1: ll_readpage() called via handle_mm_fault() >__do_fault() >ll_fault()->filemap_fault(): starts with locked page P and calls ll_issue_page_read() which queues the page for i/o. Once the page is queued it can be involved into group i/o triggered by another thread. Process 2: obd_trigger_group_io() triggers i/o for group of queued pages, including the page P. The P gets unlocked and uptodate as result of that. Process 3: shrink_slab() >llap_shrink_cache() >llap_shrink_cache_internal() kswapd shrinks slabs and calls ll_shrink_cache() (registered lustre's slab shrinker, which tries to shrink lustre pages. llap_shrink_cache_internal() locks the page P and calls ll_truncate_complete_page() which removes the page P from page cache (page->mapping becomes NULL). llap_shrink_cache_internal() avoids to ll_truncate_complete_page() dirty pages, but it does not care of busy pages: the ones with increase reference count. Process 1 continues with page->mapping == NULL and calls ll_readahead() passing NULL as 3rd argument: ll_readpage() ... /* We have just requested the actual page we want, see if we can tack on some readahead to that page's RPC before it is sent. */ if (ll_i2sbi(inode)->ll_ra_info.ra_max_pages_per_file) ll_readahead(&fd->fd_ras, exp, page->mapping, oig, fd->fd_flags); ll_readahead() { ... inode = mapping->host; <<<== NULL pointer dereference happens here, as long as mapping is NULL ... Dump of stacktraces provided by HLRN ( https://bugzilla.lustre.org/attachment.cgi?id=33148 ) contains number of processes: handling pagefaults of lustre pages looking like: 0xffff8acffbd72600 839153 838417 1 343 R 0xffff8acffbd72c90 *magic.exe [<ffffffffa0a9dfda>] ll_readahead+0x4a/0x1d20 [lustre] [<ffffffffa0aa0ebb>] ll_readpage+0x120b/0x1760 [lustre] [<ffffffff810b5cf1>] filemap_fault+0x241/0x3d0 [<ffffffffa0aaebd7>] ll_fault+0x3b7/0xdd0 [lustre] [<ffffffff810ceae7>] __do_fault+0x57/0x520 [<ffffffff810d3459>] handle_mm_fault+0x199/0x430 [<ffffffff8139dcaf>] do_page_fault+0x1bf/0x3e0 [<ffffffff8139b5cf>] page_fault+0x1f/0x30 processes which are entered lustre shrink code trying to allocate memory and kswapd's shrinking slabs with typical trace like: 0xffff881072408680 7239 2 0 965 D 0xffff881072408d10 kswapd56 [<ffffffff81399094>] thread_return+0x0/0x34c [<ffffffff810cf589>] mapping_sleep+0x9/0x10 [<ffffffff8139986a>] __wait_on_bit_lock+0x4a/0xa0 [<ffffffff81399939>] out_of_line_wait_on_bit_lock+0x79/0xa0 [<ffffffff810d2309>] unmap_mapping_range+0x149/0x310 [<ffffffffa0aae6c7>] ll_teardown_mmaps+0x67/0x1c0 [lustre] [<ffffffffa0aa2072>] llap_shrink_cache+0x5a2/0xa40 [lustre] [<ffffffffa0a83763>] ll_shrink_cache+0x33/0x70 [lustre] [<ffffffff810c3f8c>] shrink_slab+0x12c/0x190 [<ffffffff810c49d8>] balance_pgdat+0x4c8/0x7b0 [<ffffffff810c4de9>] kswapd+0x129/0x180 [<ffffffff81064366>] kthread+0x96/0xa0 which shows that the race described above is possible. The proposed fix is to avoid to ll_truncate_complete_page() pages having extra reference counters.

            People

              bogl Bob Glossman (Inactive)
              vs Vladimir V. Saveliev
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: