Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2367

Hang in osc_extent_wait under fsync

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.4.0
    • Lustre 2.4.0
    • 2.3.56-2chaos (github.com/chaos/lustre)
    • 3
    • 5630

    Description

      We seem to be getting hangs in osc_extent_wait() under the fsync at the end of ior's write phase.

      2012-11-20 16:15:08.093629 {DefaultControlEventListener} [mmcs]{657}.0.0: sysiod        S 00000fff8c266bd0     0  4037   3078 0x00000000
      2012-11-20 16:15:08.093693 {DefaultControlEventListener} [mmcs]{657}.0.0: Call Trace:
      2012-11-20 16:15:08.093744 {DefaultControlEventListener} [mmcs]{657}.0.0: [c0000003e1fc2b10] [c0000003e1fc2bc0] 0xc0000003e1fc2bc0 (unreliable)
      2012-11-20 16:15:08.093812 {DefaultControlEventListener} [mmcs]{657}.0.0: [c0000003e1fc2ce0] [c000000000008de0] .__switch_to+0xc4/0x100
      2012-11-20 16:15:08.093885 {DefaultControlEventListener} [mmcs]{657}.0.0: [c0000003e1fc2d70] [c00000000042b0e0] .schedule+0x858/0x9c0
      2012-11-20 16:15:08.093956 {DefaultControlEventListener} [mmcs]{657}.0.0: [c0000003e1fc3020] [8000000000aa20a0] .cfs_waitq_wait+0x10/0x30 [libcfs]
      2012-11-20 16:15:08.094010 {DefaultControlEventListener} [mmcs]{657}.0.0: [c0000003e1fc3090] [80000000046d93f0] .osc_extent_wait+0x620/0x7d0 [osc]
      2012-11-20 16:15:08.094062 {DefaultControlEventListener} [mmcs]{657}.0.0: [c0000003e1fc3290] [80000000046d9c80] .osc_cache_wait_range+0x6e0/0xcd0 [osc]
      2012-11-20 16:15:08.094127 {DefaultControlEventListener} [mmcs]{657}.0.0: [c0000003e1fc34d0] [80000000046c4b00] .osc_io_fsync_end+0xf0/0x100 [osc]
      2012-11-20 16:15:08.094192 {DefaultControlEventListener} [mmcs]{657}.0.0: [c0000003e1fc3570] [800000000249962c] .cl_io_end+0xcc/0x2b0 [obdclass]
      2012-11-20 16:15:08.094257 {DefaultControlEventListener} [mmcs]{657}.0.0: [c0000003e1fc3620] [8000000005234d00] .lov_io_end_wrapper+0x180/0x1f0 [lov]
      2012-11-20 16:15:08.094325 {DefaultControlEventListener} [mmcs]{657}.0.0: [c0000003e1fc36c0] [8000000005234e3c] .lov_io_fsync_end+0xcc/0x250 [lov]
      2012-11-20 16:15:08.094389 {DefaultControlEventListener} [mmcs]{657}.0.0: [c0000003e1fc3760] [800000000249962c] .cl_io_end+0xcc/0x2b0 [obdclass]
      2012-11-20 16:15:08.094446 {DefaultControlEventListener} [mmcs]{657}.0.0: [c0000003e1fc3810] [80000000024a2e14] .cl_io_loop+0x1a4/0x2c0 [obdclass]
      2012-11-20 16:15:08.094498 {DefaultControlEventListener} [mmcs]{657}.0.0: [c0000003e1fc38c0] [8000000006a1332c] .cl_sync_file_range+0x3ac/0x660 [lustre]
      2012-11-20 16:15:08.094550 {DefaultControlEventListener} [mmcs]{657}.0.0: [c0000003e1fc39b0] [8000000006a4d4b4] .ll_writepages+0xb4/0x280 [lustre]
      2012-11-20 16:15:08.094602 {DefaultControlEventListener} [mmcs]{657}.0.0: [c0000003e1fc3a60] [c00000000009d534] .do_writepages+0x44/0x74
      2012-11-20 16:15:08.094653 {DefaultControlEventListener} [mmcs]{657}.0.0: [c0000003e1fc3ad0] [c000000000095710] .__filemap_fdatawrite_range+0x80/0xa8
      2012-11-20 16:15:08.094705 {DefaultControlEventListener} [mmcs]{657}.0.0: [c0000003e1fc3be0] [c000000000095778] .filemap_write_and_wait_range+0x40/0x8c
      2012-11-20 16:15:08.094757 {DefaultControlEventListener} [mmcs]{657}.0.0: [c0000003e1fc3c80] [c0000000000fc054] .vfs_fsync_range+0x70/0x104
      2012-11-20 16:15:08.096720 {DefaultControlEventListener} [mmcs]{657}.0.0: [c0000003e1fc3d30] [c0000000000fc18c] .do_fsync+0x3c/0x6c
      2012-11-20 16:15:08.096787 {DefaultControlEventListener} [mmcs]{657}.0.0: [c0000003e1fc3dc0] [c0000000000fc1fc] .SyS_fsync+0x18/0x28
      2012-11-20 16:15:08.096846 {DefaultControlEventListener} [mmcs]{657}.0.0: [c0000003e1fc3e30] [c000000000000580] syscall_exit+0x0/0x2c
      

      The console message that seems to go with this event is:

      LustreError: 4037:0:(osc_cache.c:893:osc_extent_wait()) extent c0000002d415ade8@{[8176 -> 8191/8191], [2|0|+|cache|wiuY|c000000383eb0480], [1048576|16|+|-|c0000002d05f01e0|16|(null)]} lsfull-OST00d1-osc-c0000003cbcda000: wait ext to 0 timedout, recovery in progress?
      

      Attached will be a log file,RA0-ID-J01.log, showing all processes on the node. Note that I logged into the node and ran "sync", which also hung, so you will likely see a second backtrace from that in there. That was not part of the original problem.

      Attachments

        1. RA0-ID-J01.log
          1.10 MB
          Christopher Morrone

        Activity

          People

            jay Jinshan Xiong (Inactive)
            morrone Christopher Morrone (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: