Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-18738

lfs hang when non-target file system is disconnected

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.17.0
    • Lustre 2.15.6
    • kernel-4.18.0-553.34.1.1toss.t4.x86_64
      zfs-2.2.7_1llnl-2.t4.x86_64
      lustre-2.15.6_4.llnl-2.t4.x86_64
    • 3
    • 9223372036854775807

    Description

      We have many nodes that mount multiple Lustre file systems.  When a user issues lfs commands target a particular file system, for example, lfs path2fid, the code enters get_root_path_slow(), and a statx() call is issued against each mounted Lustre file system until the target file system is reached.  If the client is disconnected from MDT0000 for one of those file systems, the statx() hangs before reaching the target file system.

      [root@rzslic9:~]# strace -ftT lfs fid2path /p/lustre1 [0x24006b523:0x5ecf:0x0]
      ...
      11:37:09 read(3, "latime,vers=3,rsize=65536,wsize="..., 1024) = 1024 <0.000050>
      11:37:09 read(3, "ize=65536,wsize=65536,namlen=255"..., 1024) = 1024 <0.000030>
      11:37:09 read(3, "e=65536,namlen=255,hard,proto=tc"..., 1024) = 1024 <0.000033>
      11:37:09 read(3, "=255,hard,proto=tcp,timeo=600,re"..., 1024) = 1024 <0.000035>
      11:37:09 read(3, "255,hard,proto=tcp,timeo=600,ret"..., 1024) = 1024 <0.000044>
      11:37:09 statx(AT_FDCWD, "/p/czlustre1", AT_STATX_SYNC_AS_STAT, 0, {stx_mask=0, stx_attributes=0, stx_mode=S_IFDIR|0755, stx_size=1074176, ...}) = 0 <0.000049>
      11:37:09 statx(AT_FDCWD, "/p/czlustre2", AT_STATX_SYNC_AS_STAT, 0, {stx_mask=0, stx_attributes=0, stx_mode=S_IFDIR|0755, stx_size=1057280, ...}) = 0 <0.000023>
      11:37:09 statx(AT_FDCWD, "/p/czlustre3", AT_STATX_SYNC_AS_STAT, 0, {stx_mask=0, stx_attributes=0, stx_mode=S_IFDIR|0755, stx_size=1073664, ...}) = 0 <0.000018>
      11:37:09 statx(AT_FDCWD, "/p/czlustre4", AT_STATX_SYNC_AS_STAT, 0, 
      <<HUNG HERE>>

      another affected example is lfs df <mountpoint>. The stack is:

      (gdb) bt
      #0  0x00001555539afedf in statx () from /lib64/libc.so.6
      #1  0x00001555550ec3ef in get_file_dev (path=<optimized out>, dev=0x7fffffff8b38) at liblustreapi.c:1222
      #2  0x00001555550f44c8 in get_root_path_slow (want=want@entry=3, fsname=fsname@entry=0x7fffffffbd60 "", 
          outfd=outfd@entry=0x0, path=path@entry=0x7fffffff9d60 "/p/lustre1", index=index@entry=-1, 
          dev=dev@entry=0x0, nid=0x0) at liblustreapi.c:1357
      #3  0x00001555550f4aac in get_root_path (want=3, fsname=0x7fffffffbd60 "", outfd=0x0, 
          path=0x7fffffff9d60 "/p/lustre1", index=-1, dev=0x0, nid=0x0) at liblustreapi.c:1444
      #4  0x00001555550f4c69 in llapi_search_mounts (pathname=pathname@entry=0x7fffffffad60 "/p/lustre1", 
          index=index@entry=0, mntdir=mntdir@entry=0x7fffffff9d60 "/p/lustre1", 
          fsname=fsname@entry=0x7fffffffbd60 "") at liblustreapi.c:1487
      #5  0x000000000040e152 in lfs_df (argc=<optimized out>, argv=0x7fffffffcec0) at lfs.c:7269
      #6  0x0000155555105511 in Parser_execarg (argc=argc@entry=2, argv=argv@entry=0x7fffffffcec0, 
          cmds=cmds@entry=0x62d6e0 <cmdlist>) at util/parser.c:118
      #7  0x0000000000404e6c in main (argc=3, argv=0x7fffffffceb8) at lfs.c:12737
      

      Attachments

        Activity

          [LU-18738] lfs hang when non-target file system is disconnected

          "Olaf Faaland <faaland1@llnl.gov>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/58321
          Subject: LU-18738 utils: avoid statx() of root of mounted FS
          Project: fs/lustre-release
          Branch: b2_15
          Current Patch Set: 1
          Commit: 342d738245897f8e10079b27495ebe91d1b0ba69

          gerrit Gerrit Updater added a comment - "Olaf Faaland <faaland1@llnl.gov>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/58321 Subject: LU-18738 utils: avoid statx() of root of mounted FS Project: fs/lustre-release Branch: b2_15 Current Patch Set: 1 Commit: 342d738245897f8e10079b27495ebe91d1b0ba69
          pjones Peter Jones added a comment -

          Merged for 2.17

          pjones Peter Jones added a comment - Merged for 2.17

          "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/58135/
          Subject: LU-18738 utils: avoid statx() of root of mounted FS
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: 2da8542e7069af71566a5d36d53fdc840a63228a

          gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/58135/ Subject: LU-18738 utils: avoid statx() of root of mounted FS Project: fs/lustre-release Branch: master Current Patch Set: Commit: 2da8542e7069af71566a5d36d53fdc840a63228a

          For my reference, my local issue is TOSS6459

          ofaaland Olaf Faaland added a comment - For my reference, my local issue is TOSS6459
          ofaaland Olaf Faaland added a comment - - edited

          Note that get_root_path_slow() has this comment before the call to get_file_dev()->statx()

          /* ignore unaccessible filesystem */
          if (get_file_dev(mnt.mnt_dir, &devmnt))

          It's possible that the statx() wasn't supposed to hang under these circumstances, and instead was supposed to return an error. If so then my change is probably wrong, and I can investigate that more.

          ofaaland Olaf Faaland added a comment - - edited Note that get_root_path_slow() has this comment before the call to get_file_dev()->statx() /* ignore unaccessible filesystem */ if (get_file_dev(mnt.mnt_dir, &devmnt)) It's possible that the statx() wasn't supposed to hang under these circumstances, and instead was supposed to return an error. If so then my change is probably wrong, and I can investigate that more.

          "Olaf Faaland <faaland1@llnl.gov>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/58135
          Subject: LU-18738 utils: avoid statx() of root of mounted FS
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: cc94b5857c39ee7f7a50bcbeb41233d15d6f9f96

          gerrit Gerrit Updater added a comment - "Olaf Faaland <faaland1@llnl.gov>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/58135 Subject: LU-18738 utils: avoid statx() of root of mounted FS Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: cc94b5857c39ee7f7a50bcbeb41233d15d6f9f96

          People

            ofaaland Olaf Faaland
            ofaaland Olaf Faaland
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: