Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.4.0
    • Lustre 2.4.0
    • 2.3.56-2chaos (github.com/chaos/lustre)
    • 3
    • 5630

    Description

      We seem to be getting hangs in osc_extent_wait() under the fsync at the end of ior's write phase.

      2012-11-20 16:15:08.093629 {DefaultControlEventListener} [mmcs]{657}.0.0: sysiod        S 00000fff8c266bd0     0  4037   3078 0x00000000
      2012-11-20 16:15:08.093693 {DefaultControlEventListener} [mmcs]{657}.0.0: Call Trace:
      2012-11-20 16:15:08.093744 {DefaultControlEventListener} [mmcs]{657}.0.0: [c0000003e1fc2b10] [c0000003e1fc2bc0] 0xc0000003e1fc2bc0 (unreliable)
      2012-11-20 16:15:08.093812 {DefaultControlEventListener} [mmcs]{657}.0.0: [c0000003e1fc2ce0] [c000000000008de0] .__switch_to+0xc4/0x100
      2012-11-20 16:15:08.093885 {DefaultControlEventListener} [mmcs]{657}.0.0: [c0000003e1fc2d70] [c00000000042b0e0] .schedule+0x858/0x9c0
      2012-11-20 16:15:08.093956 {DefaultControlEventListener} [mmcs]{657}.0.0: [c0000003e1fc3020] [8000000000aa20a0] .cfs_waitq_wait+0x10/0x30 [libcfs]
      2012-11-20 16:15:08.094010 {DefaultControlEventListener} [mmcs]{657}.0.0: [c0000003e1fc3090] [80000000046d93f0] .osc_extent_wait+0x620/0x7d0 [osc]
      2012-11-20 16:15:08.094062 {DefaultControlEventListener} [mmcs]{657}.0.0: [c0000003e1fc3290] [80000000046d9c80] .osc_cache_wait_range+0x6e0/0xcd0 [osc]
      2012-11-20 16:15:08.094127 {DefaultControlEventListener} [mmcs]{657}.0.0: [c0000003e1fc34d0] [80000000046c4b00] .osc_io_fsync_end+0xf0/0x100 [osc]
      2012-11-20 16:15:08.094192 {DefaultControlEventListener} [mmcs]{657}.0.0: [c0000003e1fc3570] [800000000249962c] .cl_io_end+0xcc/0x2b0 [obdclass]
      2012-11-20 16:15:08.094257 {DefaultControlEventListener} [mmcs]{657}.0.0: [c0000003e1fc3620] [8000000005234d00] .lov_io_end_wrapper+0x180/0x1f0 [lov]
      2012-11-20 16:15:08.094325 {DefaultControlEventListener} [mmcs]{657}.0.0: [c0000003e1fc36c0] [8000000005234e3c] .lov_io_fsync_end+0xcc/0x250 [lov]
      2012-11-20 16:15:08.094389 {DefaultControlEventListener} [mmcs]{657}.0.0: [c0000003e1fc3760] [800000000249962c] .cl_io_end+0xcc/0x2b0 [obdclass]
      2012-11-20 16:15:08.094446 {DefaultControlEventListener} [mmcs]{657}.0.0: [c0000003e1fc3810] [80000000024a2e14] .cl_io_loop+0x1a4/0x2c0 [obdclass]
      2012-11-20 16:15:08.094498 {DefaultControlEventListener} [mmcs]{657}.0.0: [c0000003e1fc38c0] [8000000006a1332c] .cl_sync_file_range+0x3ac/0x660 [lustre]
      2012-11-20 16:15:08.094550 {DefaultControlEventListener} [mmcs]{657}.0.0: [c0000003e1fc39b0] [8000000006a4d4b4] .ll_writepages+0xb4/0x280 [lustre]
      2012-11-20 16:15:08.094602 {DefaultControlEventListener} [mmcs]{657}.0.0: [c0000003e1fc3a60] [c00000000009d534] .do_writepages+0x44/0x74
      2012-11-20 16:15:08.094653 {DefaultControlEventListener} [mmcs]{657}.0.0: [c0000003e1fc3ad0] [c000000000095710] .__filemap_fdatawrite_range+0x80/0xa8
      2012-11-20 16:15:08.094705 {DefaultControlEventListener} [mmcs]{657}.0.0: [c0000003e1fc3be0] [c000000000095778] .filemap_write_and_wait_range+0x40/0x8c
      2012-11-20 16:15:08.094757 {DefaultControlEventListener} [mmcs]{657}.0.0: [c0000003e1fc3c80] [c0000000000fc054] .vfs_fsync_range+0x70/0x104
      2012-11-20 16:15:08.096720 {DefaultControlEventListener} [mmcs]{657}.0.0: [c0000003e1fc3d30] [c0000000000fc18c] .do_fsync+0x3c/0x6c
      2012-11-20 16:15:08.096787 {DefaultControlEventListener} [mmcs]{657}.0.0: [c0000003e1fc3dc0] [c0000000000fc1fc] .SyS_fsync+0x18/0x28
      2012-11-20 16:15:08.096846 {DefaultControlEventListener} [mmcs]{657}.0.0: [c0000003e1fc3e30] [c000000000000580] syscall_exit+0x0/0x2c
      

      The console message that seems to go with this event is:

      LustreError: 4037:0:(osc_cache.c:893:osc_extent_wait()) extent c0000002d415ade8@{[8176 -> 8191/8191], [2|0|+|cache|wiuY|c000000383eb0480], [1048576|16|+|-|c0000002d05f01e0|16|(null)]} lsfull-OST00d1-osc-c0000003cbcda000: wait ext to 0 timedout, recovery in progress?
      

      Attached will be a log file,RA0-ID-J01.log, showing all processes on the node. Note that I logged into the node and ran "sync", which also hung, so you will likely see a second backtrace from that in there. That was not part of the original problem.

      Attachments

        1. RA0-ID-J01.log
          1.10 MB
          Christopher Morrone

        Activity

          [LU-2367] Hang in osc_extent_wait under fsync

          The fix has been landed, so it can be closed. Thanks!

          prakash Prakash Surya (Inactive) added a comment - The fix has been landed, so it can be closed. Thanks!

          Hi Jodi, the patch has to be landed before we can close this ticket.

          jay Jinshan Xiong (Inactive) added a comment - Hi Jodi, the patch has to be landed before we can close this ticket.

          Based on the results, can this ticket be closed?

          jlevi Jodi Levi (Inactive) added a comment - Based on the results, can this ticket be closed?

          To update, we've pull in 4659, did some testing on Sequoia, and have reason to believe that patch fixes this issue.

          prakash Prakash Surya (Inactive) added a comment - To update, we've pull in 4659 , did some testing on Sequoia, and have reason to believe that patch fixes this issue.

          Ok, I've pulled 4659 into our 2.3.56-llnl branch.

          morrone Christopher Morrone (Inactive) added a comment - Ok, I've pulled 4659 into our 2.3.56-llnl branch.
          jay Jinshan Xiong (Inactive) added a comment - patch is at: http://review.whamcloud.com/4659

          We already have that patch since we're based on 2.3.56.

          morrone Christopher Morrone (Inactive) added a comment - We already have that patch since we're based on 2.3.56.

          duplicate of LU-2286. Please apply fix 300526627be80c16be05dd1abd9a05b563dc0763

          jay Jinshan Xiong (Inactive) added a comment - duplicate of LU-2286 . Please apply fix 300526627be80c16be05dd1abd9a05b563dc0763

          People

            jay Jinshan Xiong (Inactive)
            morrone Christopher Morrone (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: