Loading...

XML

Word

Printable

Type: Bug
Resolution: Cannot Reproduce
Priority: Minor
Fix Version/s: None
Affects Version/s: Lustre 1.8.8
Labels:
None
Environment:
SLES kernel on debian

Severity:
3
Rank (Obsolete):
5276

Sanger have been running into an issue where one of their applications seems to deadlock OSTs. They have an application that does lots of small IO and seems to create and delete a lot of files. It also seems to saturate the network, so there are a lot of bulk IO errors. It looks like the quota and jbd sections are getting into some kind of deadlock. I'm uploading the full logs, but there is a lot of:

Oct 21 11:29:40 lus08-oss2 kernel: [ 1456.264411] [<ffffffff8139ba25>] rwsem_down_failed_common+0x95/0x1e0
Oct 21 11:29:40 lus08-oss2 kernel: [ 1456.264418] [<ffffffff8139bb8f>] rwsem_down_write_failed+0x1f/0x30
Oct 21 11:29:40 lus08-oss2 kernel: [ 1456.264425] [<ffffffff811e8db3>] call_rwsem_down_write_failed+0x13/0x20
Oct 21 11:29:40 lus08-oss2 kernel: [ 1456.264431] [<ffffffff8139ad8c>] down_write+0x1c/0x20
Oct 21 11:29:40 lus08-oss2 kernel: [ 1456.264438] [<ffffffff8114fd3f>] dquot_initialize+0x8f/0x1c0
Oct 21 11:29:40 lus08-oss2 kernel: [ 1456.264453] [<ffffffffa098fff0>] ldiskfs_unlink+0x130/0x270 [ldiskfs]
Oct 21 11:29:40 lus08-oss2 kernel: [ 1456.264484] [<ffffffffa0a18a58>] filter_vfs_unlink+0x2f8/0x500 [obdfilter]
Oct 21 11:29:40 lus08-oss2 kernel: [ 1456.264499] [<ffffffffa0a2c412>] filter_destroy+0x1572/0x1b90 [obdfilter]
Oct 21 11:29:40 lus08-oss2 kernel: [ 1456.264512] [<ffffffffa09e4436>] ost_handle+0x2f36/0x5ef0 [ost]
Oct 21 11:29:40 lus08-oss2 kernel: [ 1456.264539] [<ffffffffa06fb040>] ptlrpc_main+0x1bc0/0x22f0 [ptlrpc]
Oct 21 11:29:40 lus08-oss2 kernel: [ 1456.264574] [<ffffffff81003eba>] child_rip+0xa/0x20
Oct 21 11:29:40 lus08-oss2 kernel: [ 1456.264577]

and

Oct 21 12:02:13 lus08-oss2 kernel: [ 3406.266346] Call Trace:
Oct 21 12:02:13 lus08-oss2 kernel: [ 3406.266366] [<ffffffffa0956006>] start_this_handle+0x356/0x450 [jbd2]
Oct 21 12:02:13 lus08-oss2 kernel: [ 3406.266388] [<ffffffffa09562e0>] jbd2_journal_start+0xa0/0xe0 [jbd2]
Oct 21 12:02:13 lus08-oss2 kernel: [ 3406.266398] [<ffffffffa095632e>] jbd2_journal_force_commit+0xe/0x30 [jbd2]
Oct 21 12:02:13 lus08-oss2 kernel: [ 3406.266415] [<ffffffffa0995ce1>] ldiskfs_force_commit+0xb1/0xe0 [ldiskfs]
Oct 21 12:02:13 lus08-oss2 kernel: [ 3406.266444] [<ffffffffa0a1fab0>] filter_sync+0x80/0x600 [obdfilter]
Oct 21 12:02:13 lus08-oss2 kernel: [ 3406.266457] [<ffffffffa09e039f>] ost_blocking_ast+0x29f/0xa30 [ost]
Oct 21 12:02:13 lus08-oss2 kernel: [ 3406.266485] [<ffffffffa06a36d6>] ldlm_cancel_callback+0x56/0xe0 [ptlrpc]
Oct 21 12:02:13 lus08-oss2 kernel: [ 3406.266504] [<ffffffffa06a37ac>] ldlm_lock_cancel+0x4c/0x190 [ptlrpc]
Oct 21 12:02:13 lus08-oss2 kernel: [ 3406.266528] [<ffffffffa06c3dcf>] ldlm_request_cancel+0x13f/0x380 [ptlrpc]

I asked them to turn down the oss threads to try to reduce contention on the disks and network, but that didn't seem to help. Let me know if there are any other logs you need.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

debug.tar.gz
0.2 kB
22/Oct/12 5:42 AM

Assignee:: Niu Yawei (Inactive)

Reporter:: Kit Westneat (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Created:: 22/Oct/12 5:42 AM

Updated:: 09/May/14 6:07 PM

Resolved:: 09/May/14 6:07 PM

Details

Description

Attachments

Attachments

Activity

People

Dates