Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4840

Deadlock when truncating file during lfs migrate

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.8.0
    • Lustre 2.4.2
    • 3
    • 13336

    Description

      While migrating a file with "lfs migrate", if a process tries to truncate the file, both lfs migrate and truncating processes will deadlock.

      This will result in both processes never finishing (unless it is killed) and watchdog messages saying that the processes did not progress for the last XXX seconds.

      Here is a reproducer:

      [root@lustre24cli ~]# cat reproducer.sh
      #!/bin/sh
      
      FS=/test
      FILE=${FS}/file
      
      rm -f ${FILE}
      # Create a file on OST 1 of size 512M
      lfs setstripe -o 1 -c 1 ${FILE}
      dd if=/dev/zero of=${FILE} bs=1M count=512
      
      echo 3 > /proc/sys/vm/drop_caches
      
      # Launch a migrate to OST 0 and a bit later open it for write
      lfs migrate -i 0 --block ${FILE} &
      sleep 2
      dd if=/dev/zero of=${FILE} bs=1M count=512 
      

      Once the last dd tries to open the file, both lfs and dd processes stay forever with this stack:

      lfs stack:

      [<ffffffff8128e864>] call_rwsem_down_read_failed+0x14/0x30
      [<ffffffffa08d98dd>] ll_file_io_generic+0x29d/0x600 [lustre]
      [<ffffffffa08d9d7f>] ll_file_aio_read+0x13f/0x2c0 [lustre]
      [<ffffffffa08da61c>] ll_file_read+0x16c/0x2a0 [lustre]
      [<ffffffff811896b5>] vfs_read+0xb5/0x1a0
      [<ffffffff811897f1>] sys_read+0x51/0x90
      [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
      [<ffffffffffffffff>] 0xffffffffffffffff
      

      dd stack:

      [<ffffffffa03436fe>] cfs_waitq_wait+0xe/0x10 [libcfs]
      [<ffffffffa04779fa>] cl_lock_state_wait+0x1aa/0x320 [obdclass]
      [<ffffffffa04781eb>] cl_enqueue_locked+0x15b/0x1f0 [obdclass]
      [<ffffffffa0478d6e>] cl_lock_request+0x7e/0x270 [obdclass]
      [<ffffffffa047e00c>] cl_io_lock+0x3cc/0x560 [obdclass]
      [<ffffffffa047e242>] cl_io_loop+0xa2/0x1b0 [obdclass]
      [<ffffffffa092a8c8>] cl_setattr_ost+0x208/0x2c0 [lustre]
      [<ffffffffa08f8a0e>] ll_setattr_raw+0x9ce/0x1000 [lustre]
      [<ffffffffa08f909b>] ll_setattr+0x5b/0xf0 [lustre]
      [<ffffffff811a7348>] notify_change+0x168/0x340
      [<ffffffff81187074>] do_truncate+0x64/0xa0
      [<ffffffff8119bcc1>] do_filp_open+0x861/0xd20
      [<ffffffff81185d39>] do_sys_open+0x69/0x140
      [<ffffffff81185e50>] sys_open+0x20/0x30
      [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
      [<ffffffffffffffff>] 0xffffffffffffffff
      

      Attachments

        Issue Links

          Activity

            [LU-4840] Deadlock when truncating file during lfs migrate
            pjones Peter Jones made changes -
            Link New: This issue is related to LDEV-262 [ LDEV-262 ]
            riauxjb Jean-Baptiste Riaux (Inactive) added a comment - Backport to b2_7_fe http://review.whamcloud.com/#/c/21513/
            riauxjb Jean-Baptiste Riaux (Inactive) made changes -
            Labels New: cea
            pjones Peter Jones made changes -
            Resolution New: Fixed [ 1 ]
            Status Original: Open [ 1 ] New: Resolved [ 5 ]
            adilger Andreas Dilger made changes -
            Link New: This issue is related to LU-6785 [ LU-6785 ]

            Closing LU-4840 since the patches here are landed to fix the problems described here, except http://review.whamcloud.com/13669 to enable racer with object migrate. LU-7073 has been opened to track the new racer + object migrate failure.

            adilger Andreas Dilger added a comment - Closing LU-4840 since the patches here are landed to fix the problems described here, except http://review.whamcloud.com/13669 to enable racer with object migrate. LU-7073 has been opened to track the new racer + object migrate failure.
            adilger Andreas Dilger made changes -
            Link New: This issue is related to LU-7073 [ LU-7073 ]
            adilger Andreas Dilger made changes -
            Link New: This issue is related to LU-6903 [ LU-6903 ]

            Hi,

            Now that patch at http://review.whamcloud.com/10013 has been merged into master, is it possible to have a backport to b2_5?
            Indeed, the original issue was met on one of our customer clusters running Lustre 2.4, that was updated afterwards in Lustre 2.5.

            Thanks,
            Sebastien.

            sebastien.buisson Sebastien Buisson (Inactive) added a comment - Hi, Now that patch at http://review.whamcloud.com/10013 has been merged into master, is it possible to have a backport to b2_5? Indeed, the original issue was met on one of our customer clusters running Lustre 2.4, that was updated afterwards in Lustre 2.5. Thanks, Sebastien.

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/10013/
            Subject: LU-4840 lfs: Use file lease to implement migration
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 85bd36cc69563d7a79e3ed34f8fadb4ed1a72b7c

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/10013/ Subject: LU-4840 lfs: Use file lease to implement migration Project: fs/lustre-release Branch: master Current Patch Set: Commit: 85bd36cc69563d7a79e3ed34f8fadb4ed1a72b7c

            People

              bobijam Zhenyu Xu
              patrick.valentin Patrick Valentin (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              20 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: