Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-17639

Trace cpu data (tcd) not initialized correctly

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • Lustre 2.17.0
    • Lustre 2.15.0
    • None
    • 3
    • 9223372036854775807

    Description

      modprobe causes following core

      PID: 283791   TASK: ffff96ad444e5640  CPU: 1    COMMAND: "modprobe"
       #0 [ffffbb30822cf960] machine_kexec at ffffffff94a77167
       #1 [ffffbb30822cf9b8] __crash_kexec at ffffffff94bea45a
       #2 [ffffbb30822cfa78] crash_kexec at ffffffff94beb6e8
       #3 [ffffbb30822cfa80] oops_end at ffffffff94a2e4db
       #4 [ffffbb30822cfaa0] page_fault_oops at ffffffff94a88d0b
       #5 [ffffbb30822cfaf8] exc_page_fault at ffffffff95654602
       #6 [ffffbb30822cfb20] asm_exc_page_fault at ffffffff95800bc2
          [exception RIP: cfs_trace_lock_tcd+9]
          RIP: ffffffffc0ba3a69  RSP: ffffbb30822cfbd0  RFLAGS: 00010282
          RAX: 0000000000000080  RBX: 0000000000000080  RCX: 0000000000000000
          RDX: 0000000080000000  RSI: 0000000000000000  RDI: 0000000000000080
          RBP: ffffbb30822cfd00   R8: ffffffffc0c51850   R9: 0000000000000010
          R10: ffffbb30822cfd20  R11: 0000000000000000  R12: ffffffffc0c9bb80
          R13: ffffffffc0c51851  R14: 0000000000000000  R15: 0000000000000080
          ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
       #7 [ffffbb30822cfbd8] libcfs_debug_msg at ffffffffc0ba49a6 [libcfs]
       #8 [ffffbb30822cfd18] _MODULE_INIT_START_lnet at ffffffffc0ccf073 [lnet]
       #9 [ffffbb30822cfd28] do_one_initcall at ffffffff94a01374
      #10 [ffffbb30822cfd90] do_init_module at ffffffff94be57bc
      #11 [ffffbb30822cfdb0] __do_sys_finit_module at ffffffff94be73ce
      #12 [ffffbb30822cfe70] do_syscall_64 at ffffffff9565018c
      #13 [ffffbb30822cff50] entry_SYSCALL_64_after_hwframe at ffffffff958000ea
          RIP: 00007fd8bee3ee5d  RSP: 00007ffdc1206ce8  RFLAGS: 00000246
          RAX: ffffffffffffffda  RBX: 0000555564dfdfc0  RCX: 00007fd8bee3ee5d
          RDX: 0000000000000000  RSI: 0000555564dfe850  RDI: 0000000000000000
          RBP: 0000000000040000   R8: 0000000000000000   R9: 0000000000000004
          R10: 0000000000000000  R11: 0000000000000246  R12: 0000555564dfe850
          R13: 0000555564dfdf60  R14: 0000555564dfdfc0  R15: 0000555564dfe890
          ORIG_RAX: 0000000000000139  CS: 0033  SS: 002b

       

      An additional dis shows for _MODULE_INIT_START_lnet

      0xffffffffc0ccf073 <_MODULE_INIT_START_lnet+115>:       call   0xffffffffc0ba3090 <libcfs_setup>

      Attachments

        Issue Links

          Activity

            [LU-17639] Trace cpu data (tcd) not initialized correctly
            pjones Peter Jones added a comment -

            As per discussion on the LWG call today, moving tickets that do not appear to be essential to fix version 2.17. If the fix lands before code freeze we will update the fix version to reflect that but we want to focus on activities on the critical path. Please speak up if you think that this issue definitely needs to be fixed before we could issue a 2.16 release.

            pjones Peter Jones added a comment - As per discussion on the LWG call today, moving tickets that do not appear to be essential to fix version 2.17. If the fix lands before code freeze we will update the fix version to reflect that but we want to focus on activities on the critical path. Please speak up if you think that this issue definitely needs to be fixed before we could issue a 2.16 release.
            fsehr Frank Sehr added a comment - - edited

            It is happening on lustre master clients during exa testing. 

            xxx crashed during conf-sanity test_76a.

            What I am aware of it is happening sporadic. May be some problem with module dependencies during startup.

            If you look at tcd parameter RDI: 0000000000000080 it seems like the memory is not initialized at all.

            fsehr Frank Sehr added a comment - - edited It is happening on lustre master clients during exa testing.  xxx crashed during conf-sanity test_76a. What I am aware of it is happening sporadic. May be some problem with module dependencies during startup. If you look at tcd parameter RDI: 0000000000000080 it seems like the memory is not initialized at all.
            simmonsja James A Simmons added a comment - - edited

            It looks like someone is trying to use some debugging before the debug buffer is setup. Do you have a reproducer?

            simmonsja James A Simmons added a comment - - edited It looks like someone is trying to use some debugging before the debug buffer is setup. Do you have a reproducer?
            fsehr Frank Sehr added a comment -

            This problem seem to be only lustre_master related and seem to caused by the new libcfs_setup in lnet_init.

            fsehr Frank Sehr added a comment - This problem seem to be only lustre_master related and seem to caused by the new libcfs_setup in lnet_init.

            People

              simmonsja James A Simmons
              fsehr Frank Sehr
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated: