Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11629

MDS panic under load - lu_context_key_get

Details

    • Bug
    • Resolution: Duplicate
    • Critical
    • None
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      When running a very recent (few days ago) copy of master under heavy load (on real hardware), we hit MDS panics relatively easily.

      Here's the basic crash signature:

       [16840.816521] BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
      [16840.827221] IP: [<ffffffffc0bae433>] lu_context_key_get+0x13/0x30 [obdclass]
      [16840.837127] PGD 0
      [16840.841095] Oops: 0000 [#1] SMP
      [.....]
      [16841.012065] CPU: 18 PID: 145031 Comm: ldlm_cn01_053 Tainted: G OE ------------ 3.10.0-693.21.1.x3.1.9.x86_64 #1
      [16841.026546] Hardware name: Intel Corporation S2600WT2R/S2600WT2R, BIOS SE5C610.86B.01.01.0021.032120170601 03/21/2017
      [16841.040373] task: ffff880e3b296eb0 ti: ffff881ee0e44000 task.ti: ffff881ee0e44000
      [16841.050648] RIP: 0010:[<ffffffffc0bae433>] [<ffffffffc0bae433>] lu_context_key_get+0x13/0x30 [obdclass]
      [16841.063235] RSP: 0018:ffff881ee0e47a28 EFLAGS: 00010246
      [16841.071023] RAX: 0000000000000014 RBX: ffff881dc15a2f40 RCX: ffff881ee0e47aac
      [16841.080843] RDX: ffff881e50b05930 RSI: ffffffffc1325c40 RDI: 0000000000000000
      [16841.090656] RBP: ffff881ee0e47a70 R08: ffff880fef20c000 R09: 0000000000000130
      [16841.100468] R10: 0000000000000000 R11: ffff881e50b05800 R12: ffff881ee0e47aac
      [16841.110290] R13: ffff881e50b05930 R14: ffff881dc15a2f40 R15: 0000000000000000
      [16841.120113] FS: 0000000000000000(0000) GS:ffff88203df80000(0000) knlGS:0000000000000000
      [16841.130986] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [16841.139190] CR2: 0000000000000010 CR3: 0000000fdc790000 CR4: 00000000003607e0
      [16841.148961] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [16841.158694] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [16841.168389] Call Trace:
      [16841.172820] [<ffffffffc12e7ad4>] ? mdt_lvbo_fill+0x74/0xa80 [mdt]
      [16841.181467] [<ffffffffc0dc6852>] ldlm_server_completion_ast+0x242/0x9e0 [ptlrpc]
      [16841.191573] [<ffffffffc0dc6610>] ? ldlm_server_blocking_ast+0xa40/0xa40 [ptlrpc]
      [16841.201642] [<ffffffffc0d98748>] ldlm_work_cp_ast_lock+0xa8/0x1d0 [ptlrpc]
      [16841.211100] [<ffffffffc0de062a>] ptlrpc_set_wait+0x7a/0x8d0 [ptlrpc]
      [16841.219944] [<ffffffffc09ba2b8>] ? cfs_hash_bd_from_key+0x38/0xb0 [libcfs]
      [16841.229338] [<ffffffff811e4d1d>] ? kmem_cache_alloc_node_trace+0x11d/0x210
      [16841.238708] [<ffffffffc0b90e19>] ? lprocfs_counter_add+0xf9/0x160 [obdclass]
      [16841.248345] [<ffffffffc0d986a0>] ? ldlm_work_gl_ast_lock+0x3a0/0x3a0 [ptlrpc]
      [16841.258047] [<ffffffffc0dd6e80>] ? ptlrpc_prep_set+0xc0/0x260 [ptlrpc]
      [16841.267028] [<ffffffffc0d9e245>] ldlm_run_ast_work+0xd5/0x3a0 [ptlrpc]
      [16841.275975] [<ffffffffc0d9f7a9>] __ldlm_reprocess_all+0x129/0x380 [ptlrpc]
      [16841.285292] [<ffffffffc0d9fa10>] ldlm_reprocess_all+0x10/0x20 [ptlrpc]
      [16841.294199] [<ffffffffc0dc3d3e>] ldlm_request_cancel+0x14e/0x740 [ptlrpc]
      [16841.303307] [<ffffffffc0dc8ada>] ldlm_handle_cancel+0xba/0x250 [ptlrpc]
      [16841.312233] [<ffffffffc0dc8dc8>] ldlm_cancel_handler+0x158/0x590 [ptlrpc]
      [16841.321356] [<ffffffffc0df9ccb>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      [16841.331372] [<ffffffffc0df6b55>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc]
      [16841.340401] [<ffffffffc0dfd5c4>] ptlrpc_main+0xaf4/0x1fa0 [ptlrpc]
      [16841.348851] [<ffffffffc0dfcad0>] ? ptlrpc_register_service+0xf70/0xf70 [ptlrpc]
      [16841.358516] [<ffffffff810b4031>] kthread+0xd1/0xe0
       

       

      I think there's a good chance this is related to https://jira.whamcloud.com/browse/LU-11483 , but I haven't done detailed triage yet.

       

      If someone from WC wants to take a look, I can make the vmcore available.  (Someone from Cray will take a detailed look eventually, but we haven't had the chance yet.)

      Attachments

        Issue Links

          Activity

            [LU-11629] MDS panic under load - lu_context_key_get
            pjones Peter Jones added a comment -

            Ah sorry, having now seen that extra context I understand. Thanks for pointing that out!

            pjones Peter Jones added a comment - Ah sorry, having now seen that extra context I understand. Thanks for pointing that out!

            I'm not skeptical of what Oleg did, I think there's some confusion here, and it would be good to get Oleg to weigh in.

            Essentially, exactly the same problem exists in two places in the code.  One crashes with the signature given in LU-11483, one is the signature reported here in LU-11629.  It's simply that the same fix has to be applied in two places.  I believe that's what Oleg was indicating by marking this as a duplicate of LU-11483.

            paf Patrick Farrell (Inactive) added a comment - I'm not skeptical of what Oleg did, I think there's some confusion here, and it would be good to get Oleg to weigh in. Essentially, exactly the same problem exists in two places in the code.  One crashes with the signature given in LU-11483 , one is the signature reported here in LU-11629 .  It's simply that the same fix has to be applied in two places.  I believe that's what Oleg was indicating by marking this as a duplicate of LU-11483 .
            pjones Peter Jones added a comment -

            Patrick is skeptical that this is a duplicate of LU-11483 so reopening until some testing has been run to prove/disprove this theory either way

             

            pjones Peter Jones added a comment - Patrick is skeptical that this is a duplicate of LU-11483 so reopening until some testing has been run to prove/disprove this theory either way  

            Testing mail delivery for @paf

            jmiller Justin Miller (Inactive) added a comment - Testing mail delivery for @paf

            OK, thanks!

            paf Patrick Farrell (Inactive) added a comment - OK, thanks!

            Please follow up in LU-11483.

            adilger Andreas Dilger added a comment - Please follow up in LU-11483 .
            green Oleg Drokin added a comment -

            I believe they are the same

            green Oleg Drokin added a comment - I believe they are the same

            People

              wc-triage WC Triage
              paf Patrick Farrell (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: