Details
-
Bug
-
Resolution: Fixed
-
Major
-
Lustre 2.12.0
-
None
-
CentOS 7.6 3.10.0-957.1.3.el7_lustre.x86_64 Lustre 2.12.0 RC2
-
3
-
9223372036854775807
Description
Another issue when using 2.12.0 RC2 during testing... MDTs mount seems to never complete and the following threads take 100% cpu:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 20953 root 20 0 0 0 0 R 100.0 0.0 27:00.33 lod0002_rec0001 20954 root 20 0 0 0 0 R 100.0 0.0 27:00.34 lod0002_rec0003
This is on fir-md1-s1 that handles MDT0 and MDT2 on this test system.
sysrq t shows:
Dec 11 09:50:13 fir-md1-s1 kernel: lod0002_rec0001 R running task 0 20953 2 0x00000080 Dec 11 09:50:13 fir-md1-s1 kernel: Call Trace: Dec 11 09:50:13 fir-md1-s1 kernel: [<ffffffffc0d3cc5d>] ? keys_fini+0x2d/0x1d0 [obdclass] Dec 11 09:50:13 fir-md1-s1 kernel: [<ffffffffc0d3ce2b>] lu_context_fini+0x2b/0xa0 [obdclass] Dec 11 09:50:13 fir-md1-s1 kernel: [<ffffffffc0d3d0da>] lu_env_init+0x1a/0x30 [obdclass] Dec 11 09:50:13 fir-md1-s1 kernel: [<ffffffffc0f19b68>] ptlrpc_set_wait+0x7d8/0x8d0 [ptlrpc] Dec 11 09:50:13 fir-md1-s1 kernel: [<ffffffffc0d515e5>] ? lustre_get_jobid+0x185/0x2e0 [obdclass] Dec 11 09:50:13 fir-md1-s1 kernel: [<ffffffffc0d09f3c>] ? obd_get_request_slot+0x3c/0x280 [obdclass] Dec 11 09:50:13 fir-md1-s1 kernel: [<ffffffffc0f19ce3>] ptlrpc_queue_wait+0x83/0x230 [ptlrpc] Dec 11 09:50:13 fir-md1-s1 kernel: [<ffffffffc088e334>] fld_client_rpc+0x104/0x540 [fld] Dec 11 09:50:13 fir-md1-s1 kernel: [<ffffffffc0892f5f>] fld_server_lookup+0x15f/0x320 [fld] Dec 11 09:50:13 fir-md1-s1 kernel: [<ffffffffc1684587>] lod_fld_lookup+0x327/0x510 [lod] Dec 11 09:50:13 fir-md1-s1 kernel: [<ffffffffc16997dd>] lod_object_init+0x7d/0x3c0 [lod] Dec 11 09:50:13 fir-md1-s1 kernel: [<ffffffffc0d3dfd5>] lu_object_alloc+0xe5/0x320 [obdclass] Dec 11 09:50:13 fir-md1-s1 kernel: [<ffffffffc0d3e2e6>] lu_object_find_at+0x76/0x280 [obdclass] Dec 11 09:50:13 fir-md1-s1 kernel: [<ffffffffc0d3f78d>] dt_locate_at+0x1d/0xb0 [obdclass] Dec 11 09:50:13 fir-md1-s1 kernel: [<ffffffffc0d02b4c>] llog_osd_open+0xfc/0xf30 [obdclass] Dec 11 09:50:13 fir-md1-s1 kernel: [<ffffffffc0d3e789>] ? lu_object_put+0x279/0x3d0 [obdclass] Dec 11 09:50:13 fir-md1-s1 kernel: [<ffffffffc0ceff20>] llog_open+0x140/0x3d0 [obdclass] Dec 11 09:50:13 fir-md1-s1 kernel: [<ffffffffc16bdeed>] lod_sub_prep_llog+0x14d/0x783 [lod] Dec 11 09:50:13 fir-md1-s1 kernel: [<ffffffffc16837ab>] lod_sub_recovery_thread+0x1cb/0xc80 [lod] Dec 11 09:50:13 fir-md1-s1 kernel: [<ffffffffc16835e0>] ? lod_obd_get_info+0x9d0/0x9d0 [lod] Dec 11 09:50:13 fir-md1-s1 kernel: [<ffffffffa7ac1c31>] kthread+0xd1/0xe0 Dec 11 09:50:13 fir-md1-s1 kernel: [<ffffffffa7ac1b60>] ? insert_kthread_work+0x40/0x40 Dec 11 09:50:13 fir-md1-s1 kernel: [<ffffffffa8174c24>] ret_from_fork_nospec_begin+0xe/0x21 Dec 11 09:50:13 fir-md1-s1 kernel: [<ffffffffa7ac1b60>] ? insert_kthread_work+0x40/0x40
and
Dec 11 09:44:24 fir-md1-s1 kernel: lod0002_rec0003 R running task 0 20954 2 0x00000080 Dec 11 09:44:24 fir-md1-s1 kernel: Call Trace: Dec 11 09:44:24 fir-md1-s1 kernel: [<ffffffffc0d3cfa3>] ? lu_context_init+0xd3/0x1f0 [obdclass] Dec 11 09:44:24 fir-md1-s1 kernel: [<ffffffffc0d3ceba>] ? lu_env_fini+0x1a/0x30 [obdclass] Dec 11 09:44:24 fir-md1-s1 kernel: [<ffffffffc0f19b68>] ? ptlrpc_set_wait+0x7d8/0x8d0 [ptlrpc] Dec 11 09:44:24 fir-md1-s1 kernel: [<ffffffffc0d515e5>] ? lustre_get_jobid+0x185/0x2e0 [obdclass] Dec 11 09:44:24 fir-md1-s1 kernel: [<ffffffffc0d09f3c>] ? obd_get_request_slot+0x3c/0x280 [obdclass] Dec 11 09:44:24 fir-md1-s1 kernel: [<ffffffffc0f19ce3>] ? ptlrpc_queue_wait+0x83/0x230 [ptlrpc] Dec 11 09:44:24 fir-md1-s1 kernel: [<ffffffffc088e421>] ? fld_client_rpc+0x1f1/0x540 [fld] Dec 11 09:44:24 fir-md1-s1 kernel: [<ffffffffc0892f5f>] ? fld_server_lookup+0x15f/0x320 [fld] Dec 11 09:44:24 fir-md1-s1 kernel: [<ffffffffc1684587>] ? lod_fld_lookup+0x327/0x510 [lod] Dec 11 09:44:24 fir-md1-s1 kernel: [<ffffffffc16997dd>] ? lod_object_init+0x7d/0x3c0 [lod] Dec 11 09:44:24 fir-md1-s1 kernel: [<ffffffffc0d3dfd5>] ? lu_object_alloc+0xe5/0x320 [obdclass] Dec 11 09:44:24 fir-md1-s1 kernel: [<ffffffffc0d3e2e6>] ? lu_object_find_at+0x76/0x280 [obdclass] Dec 11 09:44:24 fir-md1-s1 kernel: [<ffffffffc0d3f78d>] ? dt_locate_at+0x1d/0xb0 [obdclass] Dec 11 09:44:24 fir-md1-s1 kernel: [<ffffffffc0d02b4c>] ? llog_osd_open+0xfc/0xf30 [obdclass] Dec 11 09:44:24 fir-md1-s1 kernel: [<ffffffffc0d3e789>] ? lu_object_put+0x279/0x3d0 [obdclass] Dec 11 09:44:24 fir-md1-s1 kernel: [<ffffffffc0ceff20>] ? llog_open+0x140/0x3d0 [obdclass] Dec 11 09:44:24 fir-md1-s1 kernel: [<ffffffffc16bdeed>] ? lod_sub_prep_llog+0x14d/0x783 [lod] Dec 11 09:44:24 fir-md1-s1 kernel: [<ffffffffc16837ab>] ? lod_sub_recovery_thread+0x1cb/0xc80 [lod] Dec 11 09:44:24 fir-md1-s1 kernel: [<ffffffffc16835e0>] ? lod_obd_get_info+0x9d0/0x9d0 [lod] Dec 11 09:44:24 fir-md1-s1 kernel: [<ffffffffa7ac1c31>] ? kthread+0xd1/0xe0 Dec 11 09:44:24 fir-md1-s1 kernel: [<ffffffffa7ac1b60>] ? insert_kthread_work+0x40/0x40 Dec 11 09:44:24 fir-md1-s1 kernel: [<ffffffffa8174c24>] ? ret_from_fork_nospec_begin+0xe/0x21 Dec 11 09:44:24 fir-md1-s1 kernel: [<ffffffffa7ac1b60>] ? insert_kthread_work+0x40/0x40
Mount commands are stuck, even when using -o abort_recov
I took a crash dump just in case you're interested.
I believe it's a regression from earlier 2.11.x versions...
HTH,
Stephane
Attachments
Issue Links
- is duplicated by
-
LU-10401 sanity test_133g: timeout during MDT mount
- Resolved
- is related to
-
LU-12360 Can't restart filesystem (2.12) even with abort_recov
- Reopened
-
LU-13468 Timeout-ed FLD_QUERY rpc leads to client operation failure.
- Resolved
- is related to
-
LU-5871 Do not return -EAGAIN in lod_object_init
- Resolved