Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      there are two interesting stack traces:

      PID: 6269     TASK: ffff9c4a9c2243c0  CPU: 1    COMMAND: "ll_ost_io00_001"
       #0 [ffff9c4a9c2eba50] __schedule at ffffffff9b6f6bd6
          /tmp/kernel/kernel/sched/core.c: 3755
       #1 [ffff9c4a9c2ebaa8] schedule at ffffffff9b6f7170
          /tmp/kernel/kernel/sched/core.c: 4602
       #2 [ffff9c4a9c2ebac0] schedule_preempt_disabled at ffffffff9b6f75bf
          /tmp/kernel/kernel/sched/core.c: 4661
       #3 [ffff9c4a9c2ebac8] rwsem_down_read_slowpath at ffffffff9b6facc0
          /tmp/kernel/kernel/locking/rwsem.c: 1088
       #4 [ffff9c4a9c2ebb50] down_read_nested at ffffffff9b12bb71
          /tmp/kernel/./include/linux/err.h: 36
       #5 [ffff9c4a9c2ebb68] osd_read_lock at ffffffffc0b3b643 [osd_ldiskfs]
          /home/lustre/master-mine/lustre/osd-ldiskfs/osd_handler.c: 2809
       #6 [ffff9c4a9c2ebb90] ofd_preprw at ffffffffc0ec7093 [ofd]
          /home/lustre/master-mine/lustre/ofd/ofd_internal.h: 207
       #7 [ffff9c4a9c2ebc38] tgt_brw_read at ffffffffc05af2c9 [ptlrpc]
          /home/lustre/master-mine/libcfs/include/libcfs/libcfs_debug.h: 126
       #8 [ffff9c4a9c2ebda8] tgt_request_handle at ffffffffc05ad03c [ptlrpc]
          /home/lustre/master-mine/lustre/include/lu_target.h: 638
       #9 [ffff9c4a9c2ebe20] ptlrpc_main at ffffffffc04fbb53 [ptlrpc]
          /home/lustre/master-mine/lustre/include/lustre_net.h: 2413
      #10 [ffff9c4a9c2ebf10] kthread at ffffffff9b10383e
          /tmp/kernel/kernel/kthread.c: 354
      #11 [ffff9c4a9c2ebf50] ret_from_fork at ffffffff9b8001c4
          /tmp/kernel/arch/x86/entry/entry_64.S: 328
      
      PID: 11853    TASK: ffff9c4af1f94c40  CPU: 0    COMMAND: "ll_ost_io00_011"
       #0 [ffff9c4af13bfa58] __schedule at ffffffff9b6f6bd6
          /tmp/kernel/kernel/sched/core.c: 3755
       #1 [ffff9c4af13bfab0] schedule at ffffffff9b6f7170
          /tmp/kernel/kernel/sched/core.c: 4602
       #2 [ffff9c4af13bfac8] io_schedule at ffffffff9b6f769d
          /tmp/kernel/./arch/x86/include/asm/current.h: 15
       #3 [ffff9c4af13bfad8] __lock_page at ffffffff9b1d405d
          /tmp/kernel/./arch/x86/include/asm/current.h: 15
       #4 [ffff9c4af13bfb68] pagecache_get_page at ffffffff9b1d5207
          /tmp/kernel/./include/linux/pagemap.h: 480
       #5 [ffff9c4af13bfba8] ldiskfs_block_zero_page_range at ffffffffc0abeedc [ldiskfs]
          /home/lustre/master-mine/ldiskfs/inode.c: 4043
       #6 [ffff9c4af13bfc00] ldiskfs_truncate at ffffffffc0ac4b6c [ldiskfs]
          /home/lustre/master-mine/ldiskfs/inode.c: 4169
       #7 [ffff9c4af13bfc40] osd_execute_truncate at ffffffffc0b63f78 [osd_ldiskfs]
          /home/lustre/linux-4.18.0-477.15.1.el8_8/include/linux/fs.h: 792
       #8 [ffff9c4af13bfc78] osd_punch at ffffffffc0b64249 [osd_ldiskfs]
          /home/lustre/master-mine/libcfs/include/libcfs/libcfs_debug.h: 126
       #9 [ffff9c4af13bfcb8] ofd_object_punch at ffffffffc0ec2421 [ofd]
          /home/lustre/master-mine/lustre/ofd/ofd_objects.c: 986
      #10 [ffff9c4af13bfd30] ofd_punch_hdl at ffffffffc0eaa09c [ofd]
          /home/lustre/master-mine/lustre/ofd/ofd_dev.c: 2131
      #11 [ffff9c4af13bfda8] tgt_request_handle at ffffffffc05ad03c [ptlrpc]
          /home/lustre/master-mine/lustre/include/lu_target.h: 638
      #12 [ffff9c4af13bfe20] ptlrpc_main at ffffffffc04fbb53 [ptlrpc]
          /home/lustre/master-mine/lustre/include/lustre_net.h: 2413
      #13 [ffff9c4af13bff10] kthread at ffffffff9b10383e
          /tmp/kernel/kernel/kthread.c: 354
      #14 [ffff9c4af13bff50] ret_from_fork at ffffffff9b8001c4
          /tmp/kernel/arch/x86/entry/entry_64.S: 328
      

      not sure how this is possible to have overlapping truncate and read.

      Attachments

        Issue Links

          Activity

            [LU-18607] a deadlock in sanityn/16k

            Andreas, my test had added to test a client side problems, while backtrace say about OFD side.
            High likely, this is regression from OFD external truncate on OFD. Similar bug hits on Cray with group lock.

            crash> bt ffff8b078a744740
            PID: 437687 TASK: ffff8b078a744740 CPU: 30 COMMAND: "ll_ost_io07_060"
            #0 [ffff9997255ef818] __schedule at ffffffff8df4e1d4
            #1 [ffff9997255ef878] schedule at ffffffff8df4e648
            #2 [ffff9997255ef888] rwsem_down_read_slowpath at ffffffff8df511d0
            #3 [ffff9997255ef920] osd_trunc_lock at ffffffffc1f2bd0c [osd_ldiskfs]
            #4 [ffff9997255ef950] osd_declare_write_commit at ffffffffc1f2cf5e [osd_ldiskfs]
            #5 [ffff9997255efa00] ofd_commitrw_write at ffffffffc1d36fda [ofd]
            #6 [ffff9997255efaa0] ofd_commitrw at ffffffffc1d3c581 [ofd]
            #7 [ffff9997255efb58] obd_commitrw at ffffffffc172759c [ptlrpc]
            #8 [ffff9997255efbd0] tgt_brw_write at ffffffffc172fde7 [ptlrpc]
            #9 [ffff9997255efd50] tgt_request_handle at ffffffffc1731453 [ptlrpc]
            #10 [ffff9997255efdd0] ptlrpc_server_handle_request at ffffffffc16dd883 [ptlrpc]
            #11 [ffff9997255efe38] ptlrpc_main at ffffffffc16df2f6 [ptlrpc]
            #12 [ffff9997255eff10] kthread at ffffffff8d7043a6
            #13 [ffff9997255eff50] ret_from_fork at ffffffff8e00023f
            
            PID: 437638 TASK: ffff8b0da9858000 CPU: 31 COMMAND: "ll_ost_io07_011"
            #0 [ffff99972544f630] __schedule at ffffffff8df4e1d4
            #1 [ffff99972544f690] schedule at ffffffff8df4e648
            #2 [ffff99972544f6a0] io_schedule at ffffffff8df4ea62
            #3 [ffff99972544f6b0] __lock_page at ffffffff8d860651
            #4 [ffff99972544f740] mpage_prepare_extent_to_map at ffffffffc1e6abf5 [ldiskfs]
            #5 [ffff99972544f818] ldiskfs_writepages at ffffffffc1e6fc34 [ldiskfs]
            #6 [ffff99972544f958] do_writepages at ffffffff8d86d291
            #7 [ffff99972544f9d0] __filemap_fdatawrite_range at ffffffff8d8646ae
            #8 [ffff99972544fa60] filemap_write_and_wait_range at ffffffff8d8647d0
            #9 [ffff99972544fa88] osd_execute_truncate at ffffffffc1f2d46f [osd_ldiskfs]
            #10 [ffff99972544fac0] osd_process_truncates at ffffffffc1f2dbbd [osd_ldiskfs]
            #11 [ffff99972544fb10] osd_trans_stop at ffffffffc1f10123 [osd_ldiskfs]
            #12 [ffff99972544fbe8] ofd_object_punch at ffffffffc1d3388b [ofd]
            #13 [ffff99972544fcd0] ofd_punch_hdl at ffffffffc1d1fb0c [ofd]
            #14 [ffff99972544fd50] tgt_request_handle at ffffffffc1731453 [ptlrpc]
            #15 [ffff99972544fdd0] ptlrpc_server_handle_request at ffffffffc16dd883 [ptlrpc]
            #16 [ffff99972544fe38] ptlrpc_main at ffffffffc16df2f6 [ptlrpc]
            #17 [ffff99972544ff10] kthread at ffffffff8d7043a6
            #18 [ffff99972544ff50] ret_from_fork at ffffffff8e00023f
            
            The first thread is holding the page lock on page with index=13, attempts to take trunc_lock.
            The second thread is holding trunc_lock, attempts to take the page lock.
            
            shadow Alexey Lyashkov added a comment - Andreas, my test had added to test a client side problems, while backtrace say about OFD side. High likely, this is regression from OFD external truncate on OFD. Similar bug hits on Cray with group lock. crash> bt ffff8b078a744740 PID: 437687 TASK: ffff8b078a744740 CPU: 30 COMMAND: "ll_ost_io07_060" #0 [ffff9997255ef818] __schedule at ffffffff8df4e1d4 #1 [ffff9997255ef878] schedule at ffffffff8df4e648 #2 [ffff9997255ef888] rwsem_down_read_slowpath at ffffffff8df511d0 #3 [ffff9997255ef920] osd_trunc_lock at ffffffffc1f2bd0c [osd_ldiskfs] #4 [ffff9997255ef950] osd_declare_write_commit at ffffffffc1f2cf5e [osd_ldiskfs] #5 [ffff9997255efa00] ofd_commitrw_write at ffffffffc1d36fda [ofd] #6 [ffff9997255efaa0] ofd_commitrw at ffffffffc1d3c581 [ofd] #7 [ffff9997255efb58] obd_commitrw at ffffffffc172759c [ptlrpc] #8 [ffff9997255efbd0] tgt_brw_write at ffffffffc172fde7 [ptlrpc] #9 [ffff9997255efd50] tgt_request_handle at ffffffffc1731453 [ptlrpc] #10 [ffff9997255efdd0] ptlrpc_server_handle_request at ffffffffc16dd883 [ptlrpc] #11 [ffff9997255efe38] ptlrpc_main at ffffffffc16df2f6 [ptlrpc] #12 [ffff9997255eff10] kthread at ffffffff8d7043a6 #13 [ffff9997255eff50] ret_from_fork at ffffffff8e00023f PID: 437638 TASK: ffff8b0da9858000 CPU: 31 COMMAND: "ll_ost_io07_011" #0 [ffff99972544f630] __schedule at ffffffff8df4e1d4 #1 [ffff99972544f690] schedule at ffffffff8df4e648 #2 [ffff99972544f6a0] io_schedule at ffffffff8df4ea62 #3 [ffff99972544f6b0] __lock_page at ffffffff8d860651 #4 [ffff99972544f740] mpage_prepare_extent_to_map at ffffffffc1e6abf5 [ldiskfs] #5 [ffff99972544f818] ldiskfs_writepages at ffffffffc1e6fc34 [ldiskfs] #6 [ffff99972544f958] do_writepages at ffffffff8d86d291 #7 [ffff99972544f9d0] __filemap_fdatawrite_range at ffffffff8d8646ae #8 [ffff99972544fa60] filemap_write_and_wait_range at ffffffff8d8647d0 #9 [ffff99972544fa88] osd_execute_truncate at ffffffffc1f2d46f [osd_ldiskfs] #10 [ffff99972544fac0] osd_process_truncates at ffffffffc1f2dbbd [osd_ldiskfs] #11 [ffff99972544fb10] osd_trans_stop at ffffffffc1f10123 [osd_ldiskfs] #12 [ffff99972544fbe8] ofd_object_punch at ffffffffc1d3388b [ofd] #13 [ffff99972544fcd0] ofd_punch_hdl at ffffffffc1d1fb0c [ofd] #14 [ffff99972544fd50] tgt_request_handle at ffffffffc1731453 [ptlrpc] #15 [ffff99972544fdd0] ptlrpc_server_handle_request at ffffffffc16dd883 [ptlrpc] #16 [ffff99972544fe38] ptlrpc_main at ffffffffc16df2f6 [ptlrpc] #17 [ffff99972544ff10] kthread at ffffffff8d7043a6 #18 [ffff99972544ff50] ret_from_fork at ffffffff8e00023f The first thread is holding the page lock on page with index=13, attempts to take trunc_lock. The second thread is holding trunc_lock, attempts to take the page lock.

            shadow it looks like this test was added in your patch https://review.whamcloud.com/53550 ("LU-17364 llite: don't use stale page"). Do you know if this failure is related to the original problem addressed by your patch (i.e. a regression) or some other new problem that is caught by the test case?

            adilger Andreas Dilger added a comment - shadow it looks like this test was added in your patch https://review.whamcloud.com/53550 (" LU-17364 llite: don't use stale page "). Do you know if this failure is related to the original problem addressed by your patch (i.e. a regression) or some other new problem that is caught by the test case?

            People

              wc-triage WC Triage
              bzzz Alex Zhuravlev
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated: