Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3949

umount is lazy and takes hours

    XMLWordPrintable

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Major
    • None
    • Lustre 2.4.0, Lustre 2.1.4
    • Lustre 2.1.4-5chaos on client, Lustre 2.1.4-5chaos on ldiskfs servers, Lustre 2.4.0-15chaos on zfs servers
    • 3
    • 10483

    Description

      With lustre 2.1.4-5chaos, we are finding that clients are not honoring umount correctly. The sysadmins are using the normal "umount" command with no additional options, and it returns relatively quickly.

      Linux no longer has a record of the mount in /proc/mounts after the command returns, and the mount point (/p/lscratchrza) appears to be empty. However the directory clearly still has a reference and cannot be removed:

      # rzzeus26 /p > ls -la lscratchrza
      total 0
      drwxr-xr-x 2 root root  40 Aug 13 11:24 .
      drwxr-xr-x 4 root root 140 Aug 13 11:24 ..
      # rzzeus26 /p > rmdir lscratchrza
      rmdir: failed to remove `lscratchrza': Device or resource busy
      

      When we look in /proc/fs/lustre it is clear that most, if not all, objects for this filesystem are still present in llite, osc, mdc, ldlm/namespace, etc.

      The sysadmins issued the "umount /p/lscratchrza" command at around 9:42am, but this message did not appear on one of the nodes until over five hours later:

      2013-09-13 15:18:11 Lustre: Unmounted lsa-client
      

      So there appear to be at least two problems here

      • umount is taking far too long
      • umount for lustre is not blocking until umount is complete (it is exhibiting umount "lazy" behavior)

      I should note that this lustre client node is mounting two lustre filesystems, and only one was being umounted. I don't know if it is significant yet, but the servers that we were trying to umount are running Lustre 2.1.4-5chaos with ldiskfs, and servers for the other filesystem are running Lustre 2.4.0-15chaos with zfs.

      I did not seem to be able to speed up the umount process by running the sync command, or "echo 3 > /proc/sys/vm/drop_caches".

      I did a "foreach bt" under crash, but I don't see any processes that are obviously stuck sleeping in umount related call paths.

      Real user applications are running on the client nodes while the umounts are going on. "lsof" does not list any usage under /p/lscratchrza (the filesystem that we are trying to unmount).

      Attachments

        Activity

          People

            hongchao.zhang Hongchao Zhang
            morrone Christopher Morrone (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: