Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2218

lots of small IO causes OST deadlock

    XMLWordPrintable

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Minor
    • None
    • Lustre 1.8.8
    • None
    • SLES kernel on debian
    • 3
    • 5276

    Description

      Sanger have been running into an issue where one of their applications seems to deadlock OSTs. They have an application that does lots of small IO and seems to create and delete a lot of files. It also seems to saturate the network, so there are a lot of bulk IO errors. It looks like the quota and jbd sections are getting into some kind of deadlock. I'm uploading the full logs, but there is a lot of:

      Oct 21 11:29:40 lus08-oss2 kernel: [ 1456.264411] [<ffffffff8139ba25>] rwsem_down_failed_common+0x95/0x1e0
      Oct 21 11:29:40 lus08-oss2 kernel: [ 1456.264418] [<ffffffff8139bb8f>] rwsem_down_write_failed+0x1f/0x30
      Oct 21 11:29:40 lus08-oss2 kernel: [ 1456.264425] [<ffffffff811e8db3>] call_rwsem_down_write_failed+0x13/0x20
      Oct 21 11:29:40 lus08-oss2 kernel: [ 1456.264431] [<ffffffff8139ad8c>] down_write+0x1c/0x20
      Oct 21 11:29:40 lus08-oss2 kernel: [ 1456.264438] [<ffffffff8114fd3f>] dquot_initialize+0x8f/0x1c0
      Oct 21 11:29:40 lus08-oss2 kernel: [ 1456.264453] [<ffffffffa098fff0>] ldiskfs_unlink+0x130/0x270 [ldiskfs]
      Oct 21 11:29:40 lus08-oss2 kernel: [ 1456.264484] [<ffffffffa0a18a58>] filter_vfs_unlink+0x2f8/0x500 [obdfilter]
      Oct 21 11:29:40 lus08-oss2 kernel: [ 1456.264499] [<ffffffffa0a2c412>] filter_destroy+0x1572/0x1b90 [obdfilter]
      Oct 21 11:29:40 lus08-oss2 kernel: [ 1456.264512] [<ffffffffa09e4436>] ost_handle+0x2f36/0x5ef0 [ost]
      Oct 21 11:29:40 lus08-oss2 kernel: [ 1456.264539] [<ffffffffa06fb040>] ptlrpc_main+0x1bc0/0x22f0 [ptlrpc]
      Oct 21 11:29:40 lus08-oss2 kernel: [ 1456.264574] [<ffffffff81003eba>] child_rip+0xa/0x20
      Oct 21 11:29:40 lus08-oss2 kernel: [ 1456.264577]

      and

      Oct 21 12:02:13 lus08-oss2 kernel: [ 3406.266346] Call Trace:
      Oct 21 12:02:13 lus08-oss2 kernel: [ 3406.266366] [<ffffffffa0956006>] start_this_handle+0x356/0x450 [jbd2]
      Oct 21 12:02:13 lus08-oss2 kernel: [ 3406.266388] [<ffffffffa09562e0>] jbd2_journal_start+0xa0/0xe0 [jbd2]
      Oct 21 12:02:13 lus08-oss2 kernel: [ 3406.266398] [<ffffffffa095632e>] jbd2_journal_force_commit+0xe/0x30 [jbd2]
      Oct 21 12:02:13 lus08-oss2 kernel: [ 3406.266415] [<ffffffffa0995ce1>] ldiskfs_force_commit+0xb1/0xe0 [ldiskfs]
      Oct 21 12:02:13 lus08-oss2 kernel: [ 3406.266444] [<ffffffffa0a1fab0>] filter_sync+0x80/0x600 [obdfilter]
      Oct 21 12:02:13 lus08-oss2 kernel: [ 3406.266457] [<ffffffffa09e039f>] ost_blocking_ast+0x29f/0xa30 [ost]
      Oct 21 12:02:13 lus08-oss2 kernel: [ 3406.266485] [<ffffffffa06a36d6>] ldlm_cancel_callback+0x56/0xe0 [ptlrpc]
      Oct 21 12:02:13 lus08-oss2 kernel: [ 3406.266504] [<ffffffffa06a37ac>] ldlm_lock_cancel+0x4c/0x190 [ptlrpc]
      Oct 21 12:02:13 lus08-oss2 kernel: [ 3406.266528] [<ffffffffa06c3dcf>] ldlm_request_cancel+0x13f/0x380 [ptlrpc]

      I asked them to turn down the oss threads to try to reduce contention on the disks and network, but that didn't seem to help. Let me know if there are any other logs you need.

      Attachments

        Activity

          People

            niu Niu Yawei (Inactive)
            kitwestneat Kit Westneat (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: