Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14465

unlink fails when if the metadata migration is running behind

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.14.0
    • None
    • 3
    • 9223372036854775807

    Description

      Created 6.4M files and restripe automatically triggered unknown reasons. (this is already filed as a separate ticket LU-14464)

      [root@ec01 ~]# salloc -p 40n -N 40 --ntasks-per-node=16  mpirun -mca btl_openib_if_include mlx5_1:1 -x UCX_NET_DEVICES=mlx5_1:1 --bind-to core:overload-allowed --allow-run-as-root /work/tools/bin/mdtest -n 10000 -F -v -d /ai400x/testdir/ -C
      
      [root@ec01 ~]#  lfs getdirstripe /ai400x/testdir/test-dir.0-0/mdtest_tree.0/
      lmv_stripe_count: 4 lmv_stripe_offset: 0 lmv_hash_type: fnv_1a_64,migrating
      mdtidx		 FID[seq:oid:ver]
           0		 [0x200000d46:0x1f339:0x0]		
           2		 [0x2c0000c06:0xa6d6:0x0]		
           1		 [0x300000c07:0xa7a1:0x0]		
           3		 [0x340000c07:0xa9d1:0x0]
      

      Anyway, when it removes all files during metadata migration process is running behind, unlink operations fails for some files due to files were already removed.

      [root@ec01 ~]# salloc -p 40n -N 40 --ntasks-per-node=16  mpirun -mca btl_openib_if_include mlx5_1:1 -x UCX_NET_DEVICES=mlx5_1:1 --bind-to core:overload-allowed --allow-run-as-root /work/tools/bin/mdtest -n 10000 -F -v -d /ai400x/testdir/ -r
      salloc: Granted job allocation 7262
      V-1: Entering PrintTimestamp...
      -- started at 02/23/2021 10:50:03 --
      
      mdtest-3.3.0+dev was launched with 640 total task(s) on 40 node(s)
      Command line used: /work/tools/bin/mdtest '-n' '10000' '-F' '-v' '-d' '/ai400x/testdir/' '-r'
      V-1: Rank   0 Line   239 Entering parse_dirpath on /ai400x/testdir/...
      V-1: Rank   0 Line  1398 Entering valid_tests...
      V-1: Rank   0 Line  2015 api                     : (null)
      V-1: Rank   0 Line  2016 barriers                : True
      V-1: Rank   0 Line  2017 collective_creates      : False
      V-1: Rank   0 Line  2018 create_only             : False
      V-1: Rank   0 Line  2019 dirpath(s):
      V-1: Rank   0 Line  2021 	/ai400x/testdir/
      V-1: Rank   0 Line  2023 dirs_only               : False
      V-1: Rank   0 Line  2024 read_bytes              : 0
      V-1: Rank   0 Line  2025 read_only               : False
      V-1: Rank   0 Line  2026 first                   : 1
      V-1: Rank   0 Line  2027 files_only              : True
      V-1: Rank   0 Line  2031 iterations              : 1
      V-1: Rank   0 Line  2032 items_per_dir           : 0
      V-1: Rank   0 Line  2033 last                    : 0
      V-1: Rank   0 Line  2034 leaf_only               : False
      V-1: Rank   0 Line  2035 items                   : 10000
      V-1: Rank   0 Line  2036 nstride                 : 0
      V-1: Rank   0 Line  2037 pre_delay               : 0
      V-1: Rank   0 Line  2038 remove_only             : False
      V-1: Rank   0 Line  2039 random_seed             : 0
      V-1: Rank   0 Line  2040 stride                  : 1
      V-1: Rank   0 Line  2041 shared_file             : False
      V-1: Rank   0 Line  2042 time_unique_dir_overhead: False
      V-1: Rank   0 Line  2043 stone_wall_timer_seconds: 0
      V-1: Rank   0 Line  2044 stat_only               : False
      V-1: Rank   0 Line  2045 unique_dir_per_task     : False
      V-1: Rank   0 Line  2046 write_bytes             : 0
      V-1: Rank   0 Line  2047 sync_file               : False
      V-1: Rank   0 Line  2048 call_sync               : False
      V-1: Rank   0 Line  2049 depth                   : 0
      V-1: Rank   0 Line  2050 make_node               : 0
      V-1: Rank   0 Line  1490 Entering show_file_system_size on /ai400x/testdir
      Path: /ai400x/testdir
      FS: 52.4 TiB   Used FS: 0.0%   Inodes: 316.8 Mi   Used Inodes: 1.9%
      
      Nodemap: 1111111111111111000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
      640 tasks, 6400000 files
      V-1: Rank   0 Line  2238 
      V-1: Rank   0 Line  2239    Operation               Duration              Rate
      V-1: Rank   0 Line  2240    ---------               --------              ----
      V-1: Rank   0 Line  1648 main: * iteration 1 *
      V-1: Rank   0 Line   481 Entering create_remove_items on /ai400x/testdir/test-dir.0-0/mdtest_tree.0, currDepth = 0...
      V-1: Rank   0 Line   412 Entering create_remove_items_helper on /ai400x/testdir/test-dir.0-0/mdtest_tree.0
      ior WARNING: [RANK 550]: unlink() of file "/ai400x/testdir/test-dir.0-0/mdtest_tree.0/file.mdtest.550.4111" failed
      , errno 2, No such file or directory 
      V-1: Rank   0 Line  1223   File creation     :          0.000 sec,          0.000 ops/sec
      V-1: Rank   0 Line  1227   File stat         :          0.000 sec,          0.000 ops/sec
      V-1: Rank   0 Line  1228   File read         :          0.000 sec,          0.000 ops/sec
      V-1: Rank   0 Line  1229   File removal      :       1230.781 sec,       5199.949 ops/sec
      V-1: Rank   0 Line  1573 Entering create_remove_directory_tree on /ai400x/testdir/test-dir.0-0, currDepth = 0...
      V-1: Rank   0 Line  1573 Entering create_remove_directory_tree on /ai400x/testdir/test-dir.0-0/mdtest_tree.0/, currDepth = 1...
      V-1: Entering PrintTimestamp...
      02/23/2021 11:10:33: Process 0: FAILED in create_remove_directory_tree, Unable to remove directory /ai400x/testdir/test-dir.0-0/mdtest_tree.0/: Directory not empty
      --------------------------------------------------------------------------
      MPI_ABORT was invoked on rank 0 in communicator MPI COMMUNICATOR 3 CREATE FROM 0
      with errorcode 1.
      
      NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
      You may or may not see output from other processes, depending on
      exactly when Open MPI kills them.
      --------------------------------------------------------------------------
      salloc: Relinquishing job allocation 7262
      

      what mdtest claimed alraedy removed file, was not revmoed yet.

      # ls /ai400x/testdir/test-dir.0-0/mdtest_tree.0/file.mdtest.550.4111 
      

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              sihara Shuichi Ihara
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated: