Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11629

MDS panic under load - lu_context_key_get

    XMLWordPrintable

Details

    • Bug
    • Resolution: Duplicate
    • Critical
    • None
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      When running a very recent (few days ago) copy of master under heavy load (on real hardware), we hit MDS panics relatively easily.

      Here's the basic crash signature:

       [16840.816521] BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
      [16840.827221] IP: [<ffffffffc0bae433>] lu_context_key_get+0x13/0x30 [obdclass]
      [16840.837127] PGD 0
      [16840.841095] Oops: 0000 [#1] SMP
      [.....]
      [16841.012065] CPU: 18 PID: 145031 Comm: ldlm_cn01_053 Tainted: G OE ------------ 3.10.0-693.21.1.x3.1.9.x86_64 #1
      [16841.026546] Hardware name: Intel Corporation S2600WT2R/S2600WT2R, BIOS SE5C610.86B.01.01.0021.032120170601 03/21/2017
      [16841.040373] task: ffff880e3b296eb0 ti: ffff881ee0e44000 task.ti: ffff881ee0e44000
      [16841.050648] RIP: 0010:[<ffffffffc0bae433>] [<ffffffffc0bae433>] lu_context_key_get+0x13/0x30 [obdclass]
      [16841.063235] RSP: 0018:ffff881ee0e47a28 EFLAGS: 00010246
      [16841.071023] RAX: 0000000000000014 RBX: ffff881dc15a2f40 RCX: ffff881ee0e47aac
      [16841.080843] RDX: ffff881e50b05930 RSI: ffffffffc1325c40 RDI: 0000000000000000
      [16841.090656] RBP: ffff881ee0e47a70 R08: ffff880fef20c000 R09: 0000000000000130
      [16841.100468] R10: 0000000000000000 R11: ffff881e50b05800 R12: ffff881ee0e47aac
      [16841.110290] R13: ffff881e50b05930 R14: ffff881dc15a2f40 R15: 0000000000000000
      [16841.120113] FS: 0000000000000000(0000) GS:ffff88203df80000(0000) knlGS:0000000000000000
      [16841.130986] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [16841.139190] CR2: 0000000000000010 CR3: 0000000fdc790000 CR4: 00000000003607e0
      [16841.148961] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [16841.158694] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [16841.168389] Call Trace:
      [16841.172820] [<ffffffffc12e7ad4>] ? mdt_lvbo_fill+0x74/0xa80 [mdt]
      [16841.181467] [<ffffffffc0dc6852>] ldlm_server_completion_ast+0x242/0x9e0 [ptlrpc]
      [16841.191573] [<ffffffffc0dc6610>] ? ldlm_server_blocking_ast+0xa40/0xa40 [ptlrpc]
      [16841.201642] [<ffffffffc0d98748>] ldlm_work_cp_ast_lock+0xa8/0x1d0 [ptlrpc]
      [16841.211100] [<ffffffffc0de062a>] ptlrpc_set_wait+0x7a/0x8d0 [ptlrpc]
      [16841.219944] [<ffffffffc09ba2b8>] ? cfs_hash_bd_from_key+0x38/0xb0 [libcfs]
      [16841.229338] [<ffffffff811e4d1d>] ? kmem_cache_alloc_node_trace+0x11d/0x210
      [16841.238708] [<ffffffffc0b90e19>] ? lprocfs_counter_add+0xf9/0x160 [obdclass]
      [16841.248345] [<ffffffffc0d986a0>] ? ldlm_work_gl_ast_lock+0x3a0/0x3a0 [ptlrpc]
      [16841.258047] [<ffffffffc0dd6e80>] ? ptlrpc_prep_set+0xc0/0x260 [ptlrpc]
      [16841.267028] [<ffffffffc0d9e245>] ldlm_run_ast_work+0xd5/0x3a0 [ptlrpc]
      [16841.275975] [<ffffffffc0d9f7a9>] __ldlm_reprocess_all+0x129/0x380 [ptlrpc]
      [16841.285292] [<ffffffffc0d9fa10>] ldlm_reprocess_all+0x10/0x20 [ptlrpc]
      [16841.294199] [<ffffffffc0dc3d3e>] ldlm_request_cancel+0x14e/0x740 [ptlrpc]
      [16841.303307] [<ffffffffc0dc8ada>] ldlm_handle_cancel+0xba/0x250 [ptlrpc]
      [16841.312233] [<ffffffffc0dc8dc8>] ldlm_cancel_handler+0x158/0x590 [ptlrpc]
      [16841.321356] [<ffffffffc0df9ccb>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      [16841.331372] [<ffffffffc0df6b55>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc]
      [16841.340401] [<ffffffffc0dfd5c4>] ptlrpc_main+0xaf4/0x1fa0 [ptlrpc]
      [16841.348851] [<ffffffffc0dfcad0>] ? ptlrpc_register_service+0xf70/0xf70 [ptlrpc]
      [16841.358516] [<ffffffff810b4031>] kthread+0xd1/0xe0
       

       

      I think there's a good chance this is related to https://jira.whamcloud.com/browse/LU-11483 , but I haven't done detailed triage yet.

       

      If someone from WC wants to take a look, I can make the vmcore available.  (Someone from Cray will take a detailed look eventually, but we haven't had the chance yet.)

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              paf Patrick Farrell
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: