Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13018

NULL pointer dereference in osd_attr_get while consuming changelogs and reading changelog_size

Details

    • 9223372036854775807

    Description

      We monitor changelog_size on our MDTs, fetching it every 5 minutes under normal operation.  We also consume changelogs constantly via Starfish.  We occasionally see the following BUG:

      Lustre: 29422:0:(mdd_device.c:1577:mdd_changelog_clear()) Skipped 22 previous similar messages
      BUG: unable to handle kernel NULL pointer dereference at 000000000000001c
      IP: [<ffffffffc14d20a2>] osd_attr_get+0x62/0x340 [osd_zfs]
      PGD 0 
      Oops: 0000 [#1] SMP 
      CPU: 4 PID: 21384 Comm: snmpd Kdump: loaded Tainted: P           OE  ------------ T 3.10.0-1062.1.1.1chaos.ch6.x86_64 #1
      Hardware name: Intel Corporation S2600WTTR/S2600WTTR, BIOS SE5C610.86B.01.01.0016.033120161139 03/31/2016
      task: ffff9b1a2ab2b150 ti: ffff9b170ece0000 task.ti: ffff9b170ece0000
      RIP: 0010:[<ffffffffc14d20a2>]  [<ffffffffc14d20a2>] osd_attr_get+0x62/0x340 [osd_zfs]
      Call Trace:
       [<ffffffffc12f73b5>] llog_size+0x35/0xe0 [obdclass]
       [<ffffffffc13006fe>] ? llog_cat_id2handle+0x30e/0x5b0 [obdclass]
       [<ffffffffb9232fb8>] ? kmem_cache_alloc+0x48/0x240
       [<ffffffffc1300b9c>] llog_cat_size_cb+0x1fc/0x3d0 [obdclass]
       [<ffffffffc12fb9db>] llog_process_thread+0x87b/0x1470 [obdclass]
       [<ffffffffb9232c89>] ? ___slab_alloc+0x209/0x4f0
       [<ffffffffc13009a0>] ? llog_cat_id2handle+0x5b0/0x5b0 [obdclass]
       [<ffffffffc12fc68c>] llog_process_or_fork+0xbc/0x450 [obdclass]
       [<ffffffffc13009a0>] ? llog_cat_id2handle+0x5b0/0x5b0 [obdclass]
       [<ffffffffc1300f09>] llog_cat_process_or_fork+0x199/0x2a0 [obdclass]
       [<ffffffffc1301098>] llog_cat_size+0x58/0x80 [obdclass]
       [<ffffffffc19d7a32>] mdd_changelog_size_ctxt+0x92/0x320 [mdd]
       [<ffffffffc19d7d35>] mdd_changelog_size_seq_show+0x75/0xe0 [mdd]
       [<ffffffffb9284d18>] seq_read+0x138/0x460
       [<ffffffffb92d94b0>] proc_reg_read+0x40/0x80
       [<ffffffffb925a71c>] vfs_read+0xbc/0x1c0
       [<ffffffffb925b68f>] SyS_read+0x7f/0xf0
       [<ffffffffb97ba11e>] system_call_fastpath+0x25/0x2a 

      Attachments

        Issue Links

          Activity

            [LU-13018] NULL pointer dereference in osd_attr_get while consuming changelogs and reading changelog_size

            Thanks, Mikhail.  You can close this issue as dupe.

            ofaaland Olaf Faaland added a comment - Thanks, Mikhail.  You can close this issue as dupe.

            Patch was added under LU-10198, here is the link: https://review.whamcloud.com/#/c/37367/

            tappro Mikhail Pershin added a comment - Patch was added under LU-10198 , here is the link: https://review.whamcloud.com/#/c/37367/

            this issue has the same reason as LU-10198, I will prepate patch soon

            tappro Mikhail Pershin added a comment - this issue has the same reason as LU-10198 , I will prepate patch soon
            ofaaland Olaf Faaland added a comment -

            For my reference: My local issue is TOSS4656

            ofaaland Olaf Faaland added a comment - For my reference: My local issue is TOSS4656

            Hi Mike,

            I know of the concept but don't know the different kernel config options to search for.  Here's one - let me know what others to look for.

            [root@zinc2:~]# grep SLUB_DEBUG /boot/config-3.10.0-1062.1.1.1chaos.ch6.x86_64
            CONFIG_SLUB_DEBUG=y
            # CONFIG_SLUB_DEBUG_ON is not set

            I confirmed that all of /sys/kernel/slab/*/poison are "0", which I believe means no poisoning is being done for that memory.  If I'm wrong, let me know.

            The kernel is a patched version of RHEL 7.7 kernel, 3.10.0-1062.1.1.1chaos.ch6.x86_64.

             

            ofaaland Olaf Faaland added a comment - Hi Mike, I know of the concept but don't know the different kernel config options to search for.  Here's one - let me know what others to look for. [root@zinc2:~]# grep SLUB_DEBUG /boot/config-3.10.0-1062.1.1.1chaos.ch6.x86_64 CONFIG_SLUB_DEBUG=y # CONFIG_SLUB_DEBUG_ON is not set I confirmed that all of /sys/kernel/slab/*/poison are "0", which I believe means no poisoning is being done for that memory.  If I'm wrong, let me know. The kernel is a patched version of RHEL 7.7 kernel, 3.10.0-1062.1.1.1chaos.ch6.x86_64.  

            Olaf, speaking of possible use-after-free for lu_object, there is NULL pointer and I wonder why it is not poisoned, can you check kernel config options for SLAB_DEBUG or any other poisoning things? Are they turned off? I mean that if allocation debug is being used then this case is unlikely use-after-free

            tappro Mikhail Pershin added a comment - Olaf, speaking of possible use-after-free for lu_object , there is NULL pointer and I wonder why it is not poisoned, can you check kernel config options for SLAB_DEBUG or any other poisoning things? Are they turned off? I mean that if allocation debug is being used then this case is unlikely use-after-free
            ofaaland Olaf Faaland added a comment -

            Hi Mike,

            Thanks for looking.  I haven't been able to reproduce it so far.  It has not been frequent in production, but has happened multiple times.

             

            ofaaland Olaf Faaland added a comment - Hi Mike, Thanks for looking.  I haven't been able to reproduce it so far.  It has not been frequent in production, but has happened multiple times.  
            tappro Mikhail Pershin added a comment - - edited

            As I understand this offset 0x1c is for loh_attr field in lo_header while checking dt_object_exists(). That means we have a race here between llog deletion and polling its size. I will check related code.

            Olaf, can this be reproduced easily or this is quite rare situation?

            tappro Mikhail Pershin added a comment - - edited As I understand this offset 0x1c is for loh_attr field in lo_header while checking dt_object_exists() . That means we have a race here between llog deletion and polling its size. I will check related code. Olaf, can this be reproduced easily or this is quite rare situation?

            People

              tappro Mikhail Pershin
              ofaaland Olaf Faaland
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: