Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-18738

lfs hang when non-target file system is disconnected

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.17.0
    • Lustre 2.15.6
    • kernel-4.18.0-553.34.1.1toss.t4.x86_64
      zfs-2.2.7_1llnl-2.t4.x86_64
      lustre-2.15.6_4.llnl-2.t4.x86_64
    • 3
    • 9223372036854775807

    Description

      We have many nodes that mount multiple Lustre file systems.  When a user issues lfs commands target a particular file system, for example, lfs path2fid, the code enters get_root_path_slow(), and a statx() call is issued against each mounted Lustre file system until the target file system is reached.  If the client is disconnected from MDT0000 for one of those file systems, the statx() hangs before reaching the target file system.

      [root@rzslic9:~]# strace -ftT lfs fid2path /p/lustre1 [0x24006b523:0x5ecf:0x0]
      ...
      11:37:09 read(3, "latime,vers=3,rsize=65536,wsize="..., 1024) = 1024 <0.000050>
      11:37:09 read(3, "ize=65536,wsize=65536,namlen=255"..., 1024) = 1024 <0.000030>
      11:37:09 read(3, "e=65536,namlen=255,hard,proto=tc"..., 1024) = 1024 <0.000033>
      11:37:09 read(3, "=255,hard,proto=tcp,timeo=600,re"..., 1024) = 1024 <0.000035>
      11:37:09 read(3, "255,hard,proto=tcp,timeo=600,ret"..., 1024) = 1024 <0.000044>
      11:37:09 statx(AT_FDCWD, "/p/czlustre1", AT_STATX_SYNC_AS_STAT, 0, {stx_mask=0, stx_attributes=0, stx_mode=S_IFDIR|0755, stx_size=1074176, ...}) = 0 <0.000049>
      11:37:09 statx(AT_FDCWD, "/p/czlustre2", AT_STATX_SYNC_AS_STAT, 0, {stx_mask=0, stx_attributes=0, stx_mode=S_IFDIR|0755, stx_size=1057280, ...}) = 0 <0.000023>
      11:37:09 statx(AT_FDCWD, "/p/czlustre3", AT_STATX_SYNC_AS_STAT, 0, {stx_mask=0, stx_attributes=0, stx_mode=S_IFDIR|0755, stx_size=1073664, ...}) = 0 <0.000018>
      11:37:09 statx(AT_FDCWD, "/p/czlustre4", AT_STATX_SYNC_AS_STAT, 0, 
      <<HUNG HERE>>

      another affected example is lfs df <mountpoint>. The stack is:

      (gdb) bt
      #0  0x00001555539afedf in statx () from /lib64/libc.so.6
      #1  0x00001555550ec3ef in get_file_dev (path=<optimized out>, dev=0x7fffffff8b38) at liblustreapi.c:1222
      #2  0x00001555550f44c8 in get_root_path_slow (want=want@entry=3, fsname=fsname@entry=0x7fffffffbd60 "", 
          outfd=outfd@entry=0x0, path=path@entry=0x7fffffff9d60 "/p/lustre1", index=index@entry=-1, 
          dev=dev@entry=0x0, nid=0x0) at liblustreapi.c:1357
      #3  0x00001555550f4aac in get_root_path (want=3, fsname=0x7fffffffbd60 "", outfd=0x0, 
          path=0x7fffffff9d60 "/p/lustre1", index=-1, dev=0x0, nid=0x0) at liblustreapi.c:1444
      #4  0x00001555550f4c69 in llapi_search_mounts (pathname=pathname@entry=0x7fffffffad60 "/p/lustre1", 
          index=index@entry=0, mntdir=mntdir@entry=0x7fffffff9d60 "/p/lustre1", 
          fsname=fsname@entry=0x7fffffffbd60 "") at liblustreapi.c:1487
      #5  0x000000000040e152 in lfs_df (argc=<optimized out>, argv=0x7fffffffcec0) at lfs.c:7269
      #6  0x0000155555105511 in Parser_execarg (argc=argc@entry=2, argv=argv@entry=0x7fffffffcec0, 
          cmds=cmds@entry=0x62d6e0 <cmdlist>) at util/parser.c:118
      #7  0x0000000000404e6c in main (argc=3, argv=0x7fffffffceb8) at lfs.c:12737
      

      Attachments

        Activity

          [LU-18738] lfs hang when non-target file system is disconnected

          People

            ofaaland Olaf Faaland
            ofaaland Olaf Faaland
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: