Details
-
Bug
-
Resolution: Cannot Reproduce
-
Minor
-
None
-
Lustre 1.8.8
-
None
-
SLES kernel on debian
-
3
-
5276
Description
Sanger have been running into an issue where one of their applications seems to deadlock OSTs. They have an application that does lots of small IO and seems to create and delete a lot of files. It also seems to saturate the network, so there are a lot of bulk IO errors. It looks like the quota and jbd sections are getting into some kind of deadlock. I'm uploading the full logs, but there is a lot of:
Oct 21 11:29:40 lus08-oss2 kernel: [ 1456.264411] [<ffffffff8139ba25>] rwsem_down_failed_common+0x95/0x1e0
Oct 21 11:29:40 lus08-oss2 kernel: [ 1456.264418] [<ffffffff8139bb8f>] rwsem_down_write_failed+0x1f/0x30
Oct 21 11:29:40 lus08-oss2 kernel: [ 1456.264425] [<ffffffff811e8db3>] call_rwsem_down_write_failed+0x13/0x20
Oct 21 11:29:40 lus08-oss2 kernel: [ 1456.264431] [<ffffffff8139ad8c>] down_write+0x1c/0x20
Oct 21 11:29:40 lus08-oss2 kernel: [ 1456.264438] [<ffffffff8114fd3f>] dquot_initialize+0x8f/0x1c0
Oct 21 11:29:40 lus08-oss2 kernel: [ 1456.264453] [<ffffffffa098fff0>] ldiskfs_unlink+0x130/0x270 [ldiskfs]
Oct 21 11:29:40 lus08-oss2 kernel: [ 1456.264484] [<ffffffffa0a18a58>] filter_vfs_unlink+0x2f8/0x500 [obdfilter]
Oct 21 11:29:40 lus08-oss2 kernel: [ 1456.264499] [<ffffffffa0a2c412>] filter_destroy+0x1572/0x1b90 [obdfilter]
Oct 21 11:29:40 lus08-oss2 kernel: [ 1456.264512] [<ffffffffa09e4436>] ost_handle+0x2f36/0x5ef0 [ost]
Oct 21 11:29:40 lus08-oss2 kernel: [ 1456.264539] [<ffffffffa06fb040>] ptlrpc_main+0x1bc0/0x22f0 [ptlrpc]
Oct 21 11:29:40 lus08-oss2 kernel: [ 1456.264574] [<ffffffff81003eba>] child_rip+0xa/0x20
Oct 21 11:29:40 lus08-oss2 kernel: [ 1456.264577]
and
Oct 21 12:02:13 lus08-oss2 kernel: [ 3406.266346] Call Trace:
Oct 21 12:02:13 lus08-oss2 kernel: [ 3406.266366] [<ffffffffa0956006>] start_this_handle+0x356/0x450 [jbd2]
Oct 21 12:02:13 lus08-oss2 kernel: [ 3406.266388] [<ffffffffa09562e0>] jbd2_journal_start+0xa0/0xe0 [jbd2]
Oct 21 12:02:13 lus08-oss2 kernel: [ 3406.266398] [<ffffffffa095632e>] jbd2_journal_force_commit+0xe/0x30 [jbd2]
Oct 21 12:02:13 lus08-oss2 kernel: [ 3406.266415] [<ffffffffa0995ce1>] ldiskfs_force_commit+0xb1/0xe0 [ldiskfs]
Oct 21 12:02:13 lus08-oss2 kernel: [ 3406.266444] [<ffffffffa0a1fab0>] filter_sync+0x80/0x600 [obdfilter]
Oct 21 12:02:13 lus08-oss2 kernel: [ 3406.266457] [<ffffffffa09e039f>] ost_blocking_ast+0x29f/0xa30 [ost]
Oct 21 12:02:13 lus08-oss2 kernel: [ 3406.266485] [<ffffffffa06a36d6>] ldlm_cancel_callback+0x56/0xe0 [ptlrpc]
Oct 21 12:02:13 lus08-oss2 kernel: [ 3406.266504] [<ffffffffa06a37ac>] ldlm_lock_cancel+0x4c/0x190 [ptlrpc]
Oct 21 12:02:13 lus08-oss2 kernel: [ 3406.266528] [<ffffffffa06c3dcf>] ldlm_request_cancel+0x13f/0x380 [ptlrpc]
I asked them to turn down the oss threads to try to reduce contention on the disks and network, but that didn't seem to help. Let me know if there are any other logs you need.