Details
-
Bug
-
Resolution: Fixed
-
Major
-
Lustre 2.10.4
-
lustre-2.10.4_1.chaos-1.ch6.x86_64 servers
RHEL 7.5
DNE1 file system
-
3
-
9223372036854775807
Description
Servers were restarted and appeared to recover normally. They briefly appeared to be handling the same (heavy) workload from before they were powered off, then started logging the "system was overloaded" message. The kernel then reported several stacks like this:
INFO: task ll_ost00_007:108440 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
ll_ost00_007 D ffff8ba4dc35bf40 0 108440 2 0x00000080
Call Trace:
[<ffffffffaad38919>] schedule_preempt_disabled+0x39/0x90
[<ffffffffaad3654f>] __mutex_lock_slowpath+0x10f/0x250
[<ffffffffaad357f2>] mutex_lock+0x32/0x42
[<ffffffffc1669afb>] ofd_create_hdl+0xdcb/0x2090 [ofd]
[<ffffffffc1322007>] ? lustre_msg_add_version+0x27/0xa0 [ptlrpc]
[<ffffffffc132235f>] ? lustre_pack_reply_v2+0x14f/0x290 [ptlrpc]
[<ffffffffc1322691>] ? lustre_pack_reply+0x11/0x20 [ptlrpc]
[<ffffffffc138653a>] tgt_request_handle+0x92a/0x1370 [ptlrpc]
[<ffffffffc132db5b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
[<ffffffffc132b26b>] ? ptlrpc_wait_event+0xab/0x350 [ptlrpc]
[<ffffffffaa6d5c32>] ? default_wake_function+0x12/0x20
[<ffffffffaa6cb01b>] ? __wake_up_common+0x5b/0x90
[<ffffffffc1331c70>] ptlrpc_main+0xae0/0x1e90 [ptlrpc]
[<ffffffffc1331190>] ? ptlrpc_register_service+0xe30/0xe30 [ptlrpc]
[<ffffffffaa6c0ad1>] kthread+0xd1/0xe0
[<ffffffffaa6c0a00>] ? insert_kthread_work+0x40/0x40
[<ffffffffaad44837>] ret_from_fork_nospec_begin+0x21/0x21
[<ffffffffaa6c0a00>] ? insert_kthread_work+0x40/0x40
And lustre began reporting:
LustreError: 108448:0:(ofd_dev.c:1627:ofd_create_hdl()) lquake-OST0003:[27917288460] destroys_in_progress already cleared
Attachments
Issue Links
- is related to
-
LU-11399 use separate locks for orphan destroy and objects re-create at OFD
- Open