Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11761

blocked MDT mount and high cpu usage from lodXXXX_recYYYY threads

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.13.0, Lustre 2.12.3
    • Lustre 2.12.0
    • None
    • CentOS 7.6 3.10.0-957.1.3.el7_lustre.x86_64 Lustre 2.12.0 RC2
    • 3
    • 9223372036854775807

    Description

      Another issue when using 2.12.0 RC2 during testing... MDTs mount seems to never complete and the following threads take 100% cpu:

      PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
       20953 root 20 0 0 0 0 R 100.0 0.0 27:00.33 lod0002_rec0001
       20954 root 20 0 0 0 0 R 100.0 0.0 27:00.34 lod0002_rec0003
      

      This is on fir-md1-s1 that handles MDT0 and MDT2 on this test system.

      sysrq t shows:

      Dec 11 09:50:13 fir-md1-s1 kernel: lod0002_rec0001 R  running task        0 20953      2 0x00000080
      Dec 11 09:50:13 fir-md1-s1 kernel: Call Trace:
      Dec 11 09:50:13 fir-md1-s1 kernel:  [<ffffffffc0d3cc5d>] ? keys_fini+0x2d/0x1d0 [obdclass]
      Dec 11 09:50:13 fir-md1-s1 kernel:  [<ffffffffc0d3ce2b>] lu_context_fini+0x2b/0xa0 [obdclass]
      Dec 11 09:50:13 fir-md1-s1 kernel:  [<ffffffffc0d3d0da>] lu_env_init+0x1a/0x30 [obdclass]
      Dec 11 09:50:13 fir-md1-s1 kernel:  [<ffffffffc0f19b68>] ptlrpc_set_wait+0x7d8/0x8d0 [ptlrpc]
      Dec 11 09:50:13 fir-md1-s1 kernel:  [<ffffffffc0d515e5>] ? lustre_get_jobid+0x185/0x2e0 [obdclass]
      Dec 11 09:50:13 fir-md1-s1 kernel:  [<ffffffffc0d09f3c>] ? obd_get_request_slot+0x3c/0x280 [obdclass]
      Dec 11 09:50:13 fir-md1-s1 kernel:  [<ffffffffc0f19ce3>] ptlrpc_queue_wait+0x83/0x230 [ptlrpc]
      Dec 11 09:50:13 fir-md1-s1 kernel:  [<ffffffffc088e334>] fld_client_rpc+0x104/0x540 [fld]
      Dec 11 09:50:13 fir-md1-s1 kernel:  [<ffffffffc0892f5f>] fld_server_lookup+0x15f/0x320 [fld]
      Dec 11 09:50:13 fir-md1-s1 kernel:  [<ffffffffc1684587>] lod_fld_lookup+0x327/0x510 [lod]
      Dec 11 09:50:13 fir-md1-s1 kernel:  [<ffffffffc16997dd>] lod_object_init+0x7d/0x3c0 [lod]
      Dec 11 09:50:13 fir-md1-s1 kernel:  [<ffffffffc0d3dfd5>] lu_object_alloc+0xe5/0x320 [obdclass]
      Dec 11 09:50:13 fir-md1-s1 kernel:  [<ffffffffc0d3e2e6>] lu_object_find_at+0x76/0x280 [obdclass]
      Dec 11 09:50:13 fir-md1-s1 kernel:  [<ffffffffc0d3f78d>] dt_locate_at+0x1d/0xb0 [obdclass]
      Dec 11 09:50:13 fir-md1-s1 kernel:  [<ffffffffc0d02b4c>] llog_osd_open+0xfc/0xf30 [obdclass]
      Dec 11 09:50:13 fir-md1-s1 kernel:  [<ffffffffc0d3e789>] ? lu_object_put+0x279/0x3d0 [obdclass]
      Dec 11 09:50:13 fir-md1-s1 kernel:  [<ffffffffc0ceff20>] llog_open+0x140/0x3d0 [obdclass]
      Dec 11 09:50:13 fir-md1-s1 kernel:  [<ffffffffc16bdeed>] lod_sub_prep_llog+0x14d/0x783 [lod]
      Dec 11 09:50:13 fir-md1-s1 kernel:  [<ffffffffc16837ab>] lod_sub_recovery_thread+0x1cb/0xc80 [lod]
      Dec 11 09:50:13 fir-md1-s1 kernel:  [<ffffffffc16835e0>] ? lod_obd_get_info+0x9d0/0x9d0 [lod]
      Dec 11 09:50:13 fir-md1-s1 kernel:  [<ffffffffa7ac1c31>] kthread+0xd1/0xe0
      Dec 11 09:50:13 fir-md1-s1 kernel:  [<ffffffffa7ac1b60>] ? insert_kthread_work+0x40/0x40
      Dec 11 09:50:13 fir-md1-s1 kernel:  [<ffffffffa8174c24>] ret_from_fork_nospec_begin+0xe/0x21
      Dec 11 09:50:13 fir-md1-s1 kernel:  [<ffffffffa7ac1b60>] ? insert_kthread_work+0x40/0x40
      

      and

      Dec 11 09:44:24 fir-md1-s1 kernel: lod0002_rec0003 R  running task        0 20954      2 0x00000080
      Dec 11 09:44:24 fir-md1-s1 kernel: Call Trace:
      Dec 11 09:44:24 fir-md1-s1 kernel:  [<ffffffffc0d3cfa3>] ? lu_context_init+0xd3/0x1f0 [obdclass]
      Dec 11 09:44:24 fir-md1-s1 kernel:  [<ffffffffc0d3ceba>] ? lu_env_fini+0x1a/0x30 [obdclass]
      Dec 11 09:44:24 fir-md1-s1 kernel:  [<ffffffffc0f19b68>] ? ptlrpc_set_wait+0x7d8/0x8d0 [ptlrpc]
      Dec 11 09:44:24 fir-md1-s1 kernel:  [<ffffffffc0d515e5>] ? lustre_get_jobid+0x185/0x2e0 [obdclass]
      Dec 11 09:44:24 fir-md1-s1 kernel:  [<ffffffffc0d09f3c>] ? obd_get_request_slot+0x3c/0x280 [obdclass]
      Dec 11 09:44:24 fir-md1-s1 kernel:  [<ffffffffc0f19ce3>] ? ptlrpc_queue_wait+0x83/0x230 [ptlrpc]
      Dec 11 09:44:24 fir-md1-s1 kernel:  [<ffffffffc088e421>] ? fld_client_rpc+0x1f1/0x540 [fld]
      Dec 11 09:44:24 fir-md1-s1 kernel:  [<ffffffffc0892f5f>] ? fld_server_lookup+0x15f/0x320 [fld]
      Dec 11 09:44:24 fir-md1-s1 kernel:  [<ffffffffc1684587>] ? lod_fld_lookup+0x327/0x510 [lod]
      Dec 11 09:44:24 fir-md1-s1 kernel:  [<ffffffffc16997dd>] ? lod_object_init+0x7d/0x3c0 [lod]
      Dec 11 09:44:24 fir-md1-s1 kernel:  [<ffffffffc0d3dfd5>] ? lu_object_alloc+0xe5/0x320 [obdclass]
      Dec 11 09:44:24 fir-md1-s1 kernel:  [<ffffffffc0d3e2e6>] ? lu_object_find_at+0x76/0x280 [obdclass]
      Dec 11 09:44:24 fir-md1-s1 kernel:  [<ffffffffc0d3f78d>] ? dt_locate_at+0x1d/0xb0 [obdclass]
      Dec 11 09:44:24 fir-md1-s1 kernel:  [<ffffffffc0d02b4c>] ? llog_osd_open+0xfc/0xf30 [obdclass]
      Dec 11 09:44:24 fir-md1-s1 kernel:  [<ffffffffc0d3e789>] ? lu_object_put+0x279/0x3d0 [obdclass]
      Dec 11 09:44:24 fir-md1-s1 kernel:  [<ffffffffc0ceff20>] ? llog_open+0x140/0x3d0 [obdclass]
      Dec 11 09:44:24 fir-md1-s1 kernel:  [<ffffffffc16bdeed>] ? lod_sub_prep_llog+0x14d/0x783 [lod]
      Dec 11 09:44:24 fir-md1-s1 kernel:  [<ffffffffc16837ab>] ? lod_sub_recovery_thread+0x1cb/0xc80 [lod]
      Dec 11 09:44:24 fir-md1-s1 kernel:  [<ffffffffc16835e0>] ? lod_obd_get_info+0x9d0/0x9d0 [lod]
      Dec 11 09:44:24 fir-md1-s1 kernel:  [<ffffffffa7ac1c31>] ? kthread+0xd1/0xe0
      Dec 11 09:44:24 fir-md1-s1 kernel:  [<ffffffffa7ac1b60>] ? insert_kthread_work+0x40/0x40
      Dec 11 09:44:24 fir-md1-s1 kernel:  [<ffffffffa8174c24>] ? ret_from_fork_nospec_begin+0xe/0x21
      Dec 11 09:44:24 fir-md1-s1 kernel:  [<ffffffffa7ac1b60>] ? insert_kthread_work+0x40/0x40
      

      Mount commands are stuck, even when using -o abort_recov

      I took a crash dump just in case you're interested.

      I believe it's a regression from earlier 2.11.x versions...

      HTH,
      Stephane

      Attachments

        1. fir-md1-s1_2.12.2_119_SRCC.log
          29 kB
          Stephane Thiell
        2. fir-md1-s2_2.12.2_119_SRCC.log
          21 kB
          Stephane Thiell

        Issue Links

          Activity

            People

              hongchao.zhang Hongchao Zhang
              sthiell Stephane Thiell
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: