Details
-
Bug
-
Resolution: Fixed
-
Critical
-
Lustre 2.0.0
-
None
-
2
-
24,438
-
4793
Description
BULL reports at the bugzilla that there are some possible deadlock issues on MDS with jbd2 (just run away transactions?):
At CEA, they have encountered several occurrences of the same scenario where all Lustre activity is
hung. Each time they live-debug the problem, they end-up on the MDS node where all Lustre
operations appear to be frozen.
As a consequence, MDS has to be rebooted and Lustre layer has to be restarted on it with recovery.
The MDS threads which appear to be strongly involved in the frozen situation have the following
stack traces, taken from one of the forced crash-dumps:
==================================
There are about 234 tasks with the same following stack:
PID 5250 mdt_rdpg_143
schedule()
start_this_handle()
jbd2_journal_start()
ldiskfs_journal_start_sb()
osd_trans_start()
mdd_trans_start()
cml_close()
One is with:
Pid: 4990 mdt_395
schedule()
jbd2_log_wait_commit()
jbd2_journal_stop()
__ldiskfs_journal_stop()
osd_trans_stop()
mdd_trans_stop()
mdd_attr_set()
cml_attr_set()
And another with:
Pid: 4534 "jbd2/sdd-8"
schedule()
jbd2_journal_commit_transaction()
kjournald2()
kthread()
kernel_thread()
==================================
Analyzing the crash dump shows that the task hung in jbd2_journal_commit_transaction() is in this
state since a very long time.
This problem looks like bug 16667, but unfortunately it is not applicable 'as is' as it dates back
to 1.6. Here it seems there is a race or deadlock in Lustre/JBD2 layers.
As a workaround the customer deactivated the ChangeLog feature, and since then the problem never
reoccurred. Sadly ChangeLogs are required by HSM so this workaround cannot last...
Can you see the reason for this deadlock?
I have to precise that this bug is critical as it blocks normal cluster operation (ie with HSM).
Attachments
Issue Links
- Trackbacks
-
HSM HSM Oleg has been assigned to this project. The first goal is to land the layout lock patch Bugzilla ticket for layout lock: 13183 https://bugzilla.lustre.org/showbug.cgi?id=13183 Jira ticket for layout lock lu169
-
Changelog 2.1 Changes from version 2.1.1 to version 2.1.2 Server support for kernels: 2.6.18308.4.1.el5 (RHEL5) 2.6.32220.17.1.el6 (RHEL6) Client support for unpatched kernels: 2.6.18308.4.1.el5 (RHEL5) 2.6.32220.17.1....
-
Changelog 2.2 version 2.2.0 Support for networks: o2iblnd OFED 1.5.4 Server support for kernels: 2.6.32220.4.2.el6 (RHEL6) Client support for unpatched kernels: 2.6.18274.18.1.el5 (RHEL5) 2.6.32220.4.2.el6 (RHEL6) 2.6.32.360....
Bruno - can you post/attach the full set of stack traces for this lockup.