Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13927

MDS crash when increasing max_rpcs_in_flight to 256

    XMLWordPrintable

Details

    • Bug
    • Resolution: Duplicate
    • Critical
    • None
    • Lustre 2.12.5
    • RHEL7 server nodes running 2.12.5 LTS.
    • 3
    • 9223372036854775807

    Description

      When setting max_rpc_in_flight to 256 the MDS crashed with the following back trace.

      [3072807.665012] LustreError: 106301:0:(ldlm_lockd.c:1543:ldlm_handle_convert0()) Skipped 6 previous similar messages

      [3072920.767949] LustreError: 107784:0:(ldlm_lockd.c:1543:ldlm_handle_convert0()) ### convert on canceled lock! ns: mdt-storm-MDT0000_UUID lock: ffff8fbfd69a2

      400/0x8f43eb98e65eb06e lrc: 3/0,0 mode: PR/PR res: [0x20000560c:0x9f09:0x0].0x0 bits 0x58/0x0 rrc: 4 type: IBT flags: 0x54a01400010020 nid: 10.134.129.9@tcp55

      remote: 0xc1b65128fa6df589 expref: 31059 pid: 154261 timeout: 3080537 lvb_type: 0

      [3072920.805945] LustreError: 107784:0:(ldlm_lockd.c:1543:ldlm_handle_convert0()) Skipped 4 previous similar messages

      [3072929.398817] LustreError: 106301:0:(ldlm_lock.c:1106:ldlm_grant_lock_with_skiplist()) ASSERTION( ldlm_is_granted(lock) ) failed:

      [3072929.412226] LustreError: 106301:0:(ldlm_lock.c:1106:ldlm_grant_lock_with_skiplist()) LBUG

      [3072929.421404] Pid: 106301, comm: ldlm_cn00_002 3.10.0-1127.13.1.el7.x86_64 #1 SMP Fri Jun 12 14:34:17 EDT 2020

      [3072929.432225] Call Trace:

      [3072929.435691]  [<ffffffffc282a7cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]

      [3072929.443252]  [<ffffffffc282a87c>] lbug_with_loc+0x4c/0xa0 [libcfs]

      [3072929.450458]  [<ffffffffc164fa87>] ldlm_grant_lock_with_skiplist+0x607/0x750 [ptlrpc]

      [3072929.459259]  [<ffffffffc1682d0a>] ldlm_inodebits_drop+0xaa/0x170 [ptlrpc]

      [3072929.467092]  [<ffffffffc167b3fb>] ldlm_handle_convert0+0x2db/0x460 [ptlrpc]

      [3072929.475080]  [<ffffffffc167bacb>] ldlm_cancel_handler+0x29b/0x590 [ptlrpc]

      [3072929.482957]  [<ffffffffc16ae48b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]

      [3072929.491613]  [<ffffffffc16b1df4>] ptlrpc_main+0xb34/0x1470 [ptlrpc]

      [3072929.498873]  [<ffffffff930c6691>] kthread+0xd1/0xe0

      [3072929.504710]  [<ffffffff93792d1d>] ret_from_fork_nospec_begin+0x7/0x21

      [3072929.512100]  [<ffffffffffffffff>] 0xffffffffffffffff

      [3072929.518025] Kernel panic - not syncing: LBUG

      [3072929.523194] CPU: 1 PID: 106301 Comm: ldlm_cn00_002 Kdump: loaded Tainted: P           OE  ------------ T 3.10.0-1127.13.1.el7.x86_64 #1

      [3072929.536964] Hardware name: Dell Inc. PowerEdge R640/0RGP26, BIOS 2.3.10 08/15/2019

      [3072929.545412] Call Trace:

      [3072929.548751]  [<ffffffff9377ffa5>] dump_stack+0x19/0x1b

      [3072929.554758]  [<ffffffff93779541>] panic+0xe8/0x21f

      [3072929.560410]  [<ffffffffc282a8cb>] lbug_with_loc+0x9b/0xa0 [libcfs]

      [3072929.567463]  [<ffffffffc164fa87>] ldlm_grant_lock_with_skiplist+0x607/0x750 [ptlrpc]

      [3072929.576066]  [<ffffffffc1682d0a>] ldlm_inodebits_drop+0xaa/0x170 [ptlrpc]

      [3072929.583705]  [<ffffffffc167b3fb>] ldlm_handle_convert0+0x2db/0x460 [ptlrpc]

      [3072929.591502]  [<ffffffffc167bacb>] ldlm_cancel_handler+0x29b/0x590 [ptlrpc]

      [3072929.599199]  [<ffffffffc16ae48b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]

      [3072929.607671]  [<ffffffffc16ab2a5>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc]

      [3072929.615245]  [<ffffffff930d3dc3>] ? __wake_up+0x13/0x20

      [3072929.621272]  [<ffffffffc16b1df4>] ptlrpc_main+0xb34/0x1470 [ptlrpc]

      [3072929.628307]  [<ffffffff93785942>] ? __schedule+0x402/0x840

      Attachments

        Issue Links

          Activity

            People

              tappro Mikhail Pershin
              simmonsja James A Simmons
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: