Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-16013

SLES15 SP4 client BUG: kernel NULL pointer dereference

Details

    • Bug
    • Resolution: Duplicate
    • Critical
    • None
    • Lustre 2.15.1
    • None
    • SLES15 SP4 client
    • 3
    • 9223372036854775807

    Description

      While testing SLES15 SP4 client support patch https://review.whamcloud.com/47924 with kernel 5.14.21-150400.22.1 on Lustre b2_15 branch, sanity test 0d hung.

      Console log on client:

      BUG: kernel NULL pointer dereference, address: 0000000000000000
      #PF: supervisor instruction fetch in kernel mode
      #PF: error_code(0x0010) - not-present page
      PGD 0 P4D 0
      T21801] Oops: 0010 [#1] PREEMPT SMP PTI
      CPU: 0 PID: 21801 Comm: tee Kdump: loaded Tainted: G           OE     N 5.14.21-150400.22-default #1 SLE15-SP4 0b6a6578ade2de5c4a0b916095dff44f76ef1704
      Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
      RIP: 0010:0x0
      Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6.
      RSP: 0018:ffffa052c33b3938 EFLAGS: 00010002
      RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000000
      RDX: 0000000000000001 RSI: ffff8dd956c55298 RDI: ffffe011c04cc500
      RBP: ffff8dd954416f90 R08: 000000000000041d R09: 0000000000000bf3
      R10: ffffa052c33b3940 R11: 0000000000000000 R12: ffff8dd956c55298
      R13: 0000000000000000 R14: ffffe011c04cc500 R15: 0000000000000000
      FS:  00007f1eaeb68740(0000) GS:ffff8dd9ffc00000(0000) knlGS:0000000000000000 
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: ffffffffffffffd6 CR3: 0000000021102001 CR4: 00000000001706f0
      Call Trace:
       <TASK>
       vvp_set_pagevec_dirty+0x171/0x3e0 [lustre 0967379a0b23e963dac3d44d0227623bfa058caa]
       write_commit_callback+0x5f/0x1a0 [lustre 0967379a0b23e963dac3d44d0227623bfa058caa]
       osc_io_commit_async+0x226/0x530 [osc 0cd30f43a98bab30cdcc8c80790581cd345e8072]
       ? vvp_set_pagevec_dirty+0x3e0/0x3e0 [lustre 0967379a0b23e963dac3d44d0227623bfa058caa]
       ? vvp_set_pagevec_dirty+0x3e0/0x3e0 [lustre 0967379a0b23e963dac3d44d0227623bfa058caa]
       cl_io_commit_async+0x8b/0x160 [obdclass 627f410ec5b64ecc7835c12c3881f5ffa2886afa]
       lov_io_commit_async+0x101/0x5a0 [lov 85fdc8bde1ce6ed86b2cd3053a19a75843ff306a]
       ? vvp_set_pagevec_dirty+0x3e0/0x3e0 [lustre 0967379a0b23e963dac3d44d0227623bfa058caa]
       ? vvp_set_pagevec_dirty+0x3e0/0x3e0 [lustre 0967379a0b23e963dac3d44d0227623bfa058caa]
       cl_io_commit_async+0x8b/0x160 [obdclass 627f410ec5b64ecc7835c12c3881f5ffa2886afa]
       vvp_io_write_commit+0x151/0x5f0 [lustre 0967379a0b23e963dac3d44d0227623bfa058caa]
       vvp_io_write_start+0x8c4/0xc60 [lustre 0967379a0b23e963dac3d44d0227623bfa058caa]
       cl_io_start+0x6c/0x130 [obdclass 627f410ec5b64ecc7835c12c3881f5ffa2886afa]
       cl_io_loop+0x9a/0x200 [obdclass 627f410ec5b64ecc7835c12c3881f5ffa2886afa]
       ll_file_io_generic+0x423/0xc90 [lustre 0967379a0b23e963dac3d44d0227623bfa058caa]
       ll_file_write_iter+0x3f2/0x7b0 [lustre 0967379a0b23e963dac3d44d0227623bfa058caa]
       new_sync_write+0x11f/0x1b0
       vfs_write+0x21c/0x280
       ksys_write+0xa1/0xe0
       do_syscall_64+0x5b/0x80
       ? ksys_write+0x50/0xe0 
       ? do_syscall_64+0x67/0x80
       ? do_sys_open+0x57/0x80
       ? syscall_exit_to_user_mode+0x18/0x40
       ? do_syscall_64+0x67/0x80
       ? exc_page_fault+0x67/0x150
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      RIP: 0033:0x7f1eae65db13
      

      https://testing.whamcloud.com/test_sets/227e0686-d7d7-4e4a-9d0b-cdaf599ee00e

      Attachments

        Issue Links

          Activity

            [LU-16013] SLES15 SP4 client BUG: kernel NULL pointer dereference
            pjones Peter Jones added a comment -

            Fix included in latest LU-15959 patch

            pjones Peter Jones added a comment - Fix included in latest LU-15959  patch
            yujian Jian Yu added a comment -

            After falling back to use __set_page_dirty_nobuffers() while account_page_dirtied() is not found, sanity test 0d passed. I'm creating the patch.

            yujian Jian Yu added a comment - After falling back to use __set_page_dirty_nobuffers() while account_page_dirtied() is not found, sanity test 0d passed. I'm creating the patch.
            yujian Jian Yu added a comment -

            In kernel 5.14.21-150400.22, kallsyms_lookup_name is defined but account_page_dirtied is not exported:

            mm/page-writeback.c
            /*
             * Helper function for set_page_dirty family.
             *
             * Caller must hold lock_page_memcg().
             *
             * NOTE: This relies on being atomic wrt interrupts.
             */
            static void account_page_dirtied(struct page *page,
                            struct address_space *mapping)
            {
                    struct inode *inode = mapping->host;
            
                    trace_writeback_dirty_page(page, mapping);
            
                    if (mapping_can_writeback(mapping)) {
                            struct bdi_writeback *wb;
            
                            inode_attach_wb(inode, page);
                            wb = inode_to_wb(inode);
            
                            __inc_lruvec_page_state(page, NR_FILE_DIRTY);
                            __inc_zone_page_state(page, NR_ZONE_WRITE_PENDING);
                            __inc_node_page_state(page, NR_DIRTIED);
                            inc_wb_stat(wb, WB_RECLAIMABLE);
                            inc_wb_stat(wb, WB_DIRTIED);
                            task_io_account_write(PAGE_SIZE);
                            current->nr_dirtied++;
                            __this_cpu_inc(bdp_ratelimits);
            
                            mem_cgroup_track_foreign_dirty(page, wb);
                    }
            }
            
            yujian Jian Yu added a comment - In kernel 5.14.21-150400.22, kallsyms_lookup_name is defined but account_page_dirtied is not exported: mm/page-writeback.c /* * Helper function for set_page_dirty family. * * Caller must hold lock_page_memcg(). * * NOTE: This relies on being atomic wrt interrupts. */ static void account_page_dirtied(struct page *page, struct address_space *mapping) { struct inode *inode = mapping->host; trace_writeback_dirty_page(page, mapping); if (mapping_can_writeback(mapping)) { struct bdi_writeback *wb; inode_attach_wb(inode, page); wb = inode_to_wb(inode); __inc_lruvec_page_state(page, NR_FILE_DIRTY); __inc_zone_page_state(page, NR_ZONE_WRITE_PENDING); __inc_node_page_state(page, NR_DIRTIED); inc_wb_stat(wb, WB_RECLAIMABLE); inc_wb_stat(wb, WB_DIRTIED); task_io_account_write(PAGE_SIZE); current->nr_dirtied++; __this_cpu_inc(bdp_ratelimits); mem_cgroup_track_foreign_dirty(page, wb); } }
            yujian Jian Yu added a comment -

            Thank you for the advice, Patrick. The failure occurred in autotest run. Let me reproduce it in manual run and debug the codes.

            yujian Jian Yu added a comment - Thank you for the advice, Patrick. The failure occurred in autotest run. Let me reproduce it in manual run and debug the codes.
            paf0186 Patrick Farrell added a comment - See comment here for possible thoughts: https://review.whamcloud.com/#/c/45927/6/lustre/llite/vvp_dev.c@292

            Jian,

            Are you able to look at the dump and extract the line of code where the null pointer occurred?  And, also - Does that SLES result in HAVE_KALLSYMS_LOOKUP_NAME defined, or not?

            There's nothing obviously wrong with the patch, but I don't immediately know what code is running there either.

            paf0186 Patrick Farrell added a comment - Jian, Are you able to look at the dump and extract the line of code where the null pointer occurred?  And, also - Does that SLES result in HAVE_KALLSYMS_LOOKUP_NAME defined, or not? There's nothing obviously wrong with the patch, but I don't immediately know what code is running there either.
            yujian Jian Yu added a comment -

            FYI, sanity test 0d passed on RHEL 9.0 client with kernel 5.14.0-70.17.1.el9_0.x86_64:

            == sanity test 0d: check export proc ======================================================================================= 16:08:16 (1657840096)
            mgc.MGC192.168.0.166@tcp.import=
            import:
                name: MGC192.168.0.166@tcp
                target: MGS
                state: FULL
                connect_flags: [ version, barrier, adaptive_timeouts, full20, imp_recov, bulk_mbits, second_flags, reply_mbits ]
                connect_data:
                   flags: 0xa000011001002020
                   instance: 0
                   target_version: 2.15.0.0
                import_flags: [ pingable, connect_tried ]
                connection:
                   failover_nids: [ 192.168.0.166@tcp ]
                   current_connection: 192.168.0.166@tcp
                   connection_attempts: 1
                   generation: 1
                   in-progress_invalidations: 0
                   idle: 5 sec
            CMD: vm86 /usr/sbin/lctl get_param -N mgs.MGS.exports.*
            CMD: vm86 /usr/sbin/lctl get_param -n mgs.MGS.exports.0@lo.uuid
            CMD: vm86 /usr/sbin/lctl get_param -n mgs.MGS.exports.192.168.0.153@tcp.uuid
            CMD: vm86 /usr/sbin/lctl get_param mgs.MGS.exports.192.168.0.153@tcp.export
            mgs.MGS.exports.192.168.0.153@tcp.export=
            eb8166ca-446d-425a-8c5e-04d8d30b3c77:
                name: MGS
                client: 192.168.0.153@tcp
                connect_flags: [ version, barrier, adaptive_timeouts, full20, imp_recov, bulk_mbits, second_flags, reply_mbits ]
                connect_data:
                   flags: 0xa000011001002020
                   instance: 0
                   target_version: 2.15.0.0
                export_flags: [  ]
            CMD: vm90 /usr/sbin/lctl get_param -n version 2>/dev/null
            PASS 0d (3s)
            
            yujian Jian Yu added a comment - FYI, sanity test 0d passed on RHEL 9.0 client with kernel 5.14.0-70.17.1.el9_0.x86_64: == sanity test 0d: check export proc ======================================================================================= 16:08:16 (1657840096) mgc.MGC192.168.0.166@tcp.import= import: name: MGC192.168.0.166@tcp target: MGS state: FULL connect_flags: [ version, barrier, adaptive_timeouts, full20, imp_recov, bulk_mbits, second_flags, reply_mbits ] connect_data: flags: 0xa000011001002020 instance: 0 target_version: 2.15.0.0 import_flags: [ pingable, connect_tried ] connection: failover_nids: [ 192.168.0.166@tcp ] current_connection: 192.168.0.166@tcp connection_attempts: 1 generation: 1 in-progress_invalidations: 0 idle: 5 sec CMD: vm86 /usr/sbin/lctl get_param -N mgs.MGS.exports.* CMD: vm86 /usr/sbin/lctl get_param -n mgs.MGS.exports.0@lo.uuid CMD: vm86 /usr/sbin/lctl get_param -n mgs.MGS.exports.192.168.0.153@tcp.uuid CMD: vm86 /usr/sbin/lctl get_param mgs.MGS.exports.192.168.0.153@tcp.export mgs.MGS.exports.192.168.0.153@tcp.export= eb8166ca-446d-425a-8c5e-04d8d30b3c77: name: MGS client: 192.168.0.153@tcp connect_flags: [ version, barrier, adaptive_timeouts, full20, imp_recov, bulk_mbits, second_flags, reply_mbits ] connect_data: flags: 0xa000011001002020 instance: 0 target_version: 2.15.0.0 export_flags: [ ] CMD: vm90 /usr/sbin/lctl get_param -n version 2>/dev/null PASS 0d (3s)
            yujian Jian Yu added a comment -

            Hi Patrick,
            Could you please advise? Is this related to the changes in https://review.whamcloud.com/45927 ("LU-15220 llite: Compat for set_pagevec_dirty")?

            yujian Jian Yu added a comment - Hi Patrick, Could you please advise? Is this related to the changes in https://review.whamcloud.com/45927 (" LU-15220 llite: Compat for set_pagevec_dirty")?

            People

              paf0186 Patrick Farrell
              yujian Jian Yu
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: