Details
-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
Lustre 2.14.0
-
None
-
3
-
9223372036854775807
Description
Created 6.4M files and restripe automatically triggered unknown reasons. (this is already filed as a separate ticket LU-14464)
[root@ec01 ~]# salloc -p 40n -N 40 --ntasks-per-node=16 mpirun -mca btl_openib_if_include mlx5_1:1 -x UCX_NET_DEVICES=mlx5_1:1 --bind-to core:overload-allowed --allow-run-as-root /work/tools/bin/mdtest -n 10000 -F -v -d /ai400x/testdir/ -C
[root@ec01 ~]# lfs getdirstripe /ai400x/testdir/test-dir.0-0/mdtest_tree.0/ lmv_stripe_count: 4 lmv_stripe_offset: 0 lmv_hash_type: fnv_1a_64,migrating mdtidx FID[seq:oid:ver] 0 [0x200000d46:0x1f339:0x0] 2 [0x2c0000c06:0xa6d6:0x0] 1 [0x300000c07:0xa7a1:0x0] 3 [0x340000c07:0xa9d1:0x0]
Anyway, when it removes all files during metadata migration process is running behind, unlink operations fails for some files due to files were already removed.
[root@ec01 ~]# salloc -p 40n -N 40 --ntasks-per-node=16 mpirun -mca btl_openib_if_include mlx5_1:1 -x UCX_NET_DEVICES=mlx5_1:1 --bind-to core:overload-allowed --allow-run-as-root /work/tools/bin/mdtest -n 10000 -F -v -d /ai400x/testdir/ -r salloc: Granted job allocation 7262 V-1: Entering PrintTimestamp... -- started at 02/23/2021 10:50:03 -- mdtest-3.3.0+dev was launched with 640 total task(s) on 40 node(s) Command line used: /work/tools/bin/mdtest '-n' '10000' '-F' '-v' '-d' '/ai400x/testdir/' '-r' V-1: Rank 0 Line 239 Entering parse_dirpath on /ai400x/testdir/... V-1: Rank 0 Line 1398 Entering valid_tests... V-1: Rank 0 Line 2015 api : (null) V-1: Rank 0 Line 2016 barriers : True V-1: Rank 0 Line 2017 collective_creates : False V-1: Rank 0 Line 2018 create_only : False V-1: Rank 0 Line 2019 dirpath(s): V-1: Rank 0 Line 2021 /ai400x/testdir/ V-1: Rank 0 Line 2023 dirs_only : False V-1: Rank 0 Line 2024 read_bytes : 0 V-1: Rank 0 Line 2025 read_only : False V-1: Rank 0 Line 2026 first : 1 V-1: Rank 0 Line 2027 files_only : True V-1: Rank 0 Line 2031 iterations : 1 V-1: Rank 0 Line 2032 items_per_dir : 0 V-1: Rank 0 Line 2033 last : 0 V-1: Rank 0 Line 2034 leaf_only : False V-1: Rank 0 Line 2035 items : 10000 V-1: Rank 0 Line 2036 nstride : 0 V-1: Rank 0 Line 2037 pre_delay : 0 V-1: Rank 0 Line 2038 remove_only : False V-1: Rank 0 Line 2039 random_seed : 0 V-1: Rank 0 Line 2040 stride : 1 V-1: Rank 0 Line 2041 shared_file : False V-1: Rank 0 Line 2042 time_unique_dir_overhead: False V-1: Rank 0 Line 2043 stone_wall_timer_seconds: 0 V-1: Rank 0 Line 2044 stat_only : False V-1: Rank 0 Line 2045 unique_dir_per_task : False V-1: Rank 0 Line 2046 write_bytes : 0 V-1: Rank 0 Line 2047 sync_file : False V-1: Rank 0 Line 2048 call_sync : False V-1: Rank 0 Line 2049 depth : 0 V-1: Rank 0 Line 2050 make_node : 0 V-1: Rank 0 Line 1490 Entering show_file_system_size on /ai400x/testdir Path: /ai400x/testdir FS: 52.4 TiB Used FS: 0.0% Inodes: 316.8 Mi Used Inodes: 1.9% Nodemap: 1111111111111111000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 640 tasks, 6400000 files V-1: Rank 0 Line 2238 V-1: Rank 0 Line 2239 Operation Duration Rate V-1: Rank 0 Line 2240 --------- -------- ---- V-1: Rank 0 Line 1648 main: * iteration 1 * V-1: Rank 0 Line 481 Entering create_remove_items on /ai400x/testdir/test-dir.0-0/mdtest_tree.0, currDepth = 0... V-1: Rank 0 Line 412 Entering create_remove_items_helper on /ai400x/testdir/test-dir.0-0/mdtest_tree.0 ior WARNING: [RANK 550]: unlink() of file "/ai400x/testdir/test-dir.0-0/mdtest_tree.0/file.mdtest.550.4111" failed , errno 2, No such file or directory V-1: Rank 0 Line 1223 File creation : 0.000 sec, 0.000 ops/sec V-1: Rank 0 Line 1227 File stat : 0.000 sec, 0.000 ops/sec V-1: Rank 0 Line 1228 File read : 0.000 sec, 0.000 ops/sec V-1: Rank 0 Line 1229 File removal : 1230.781 sec, 5199.949 ops/sec V-1: Rank 0 Line 1573 Entering create_remove_directory_tree on /ai400x/testdir/test-dir.0-0, currDepth = 0... V-1: Rank 0 Line 1573 Entering create_remove_directory_tree on /ai400x/testdir/test-dir.0-0/mdtest_tree.0/, currDepth = 1... V-1: Entering PrintTimestamp... 02/23/2021 11:10:33: Process 0: FAILED in create_remove_directory_tree, Unable to remove directory /ai400x/testdir/test-dir.0-0/mdtest_tree.0/: Directory not empty -------------------------------------------------------------------------- MPI_ABORT was invoked on rank 0 in communicator MPI COMMUNICATOR 3 CREATE FROM 0 with errorcode 1. NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. You may or may not see output from other processes, depending on exactly when Open MPI kills them. -------------------------------------------------------------------------- salloc: Relinquishing job allocation 7262
what mdtest claimed alraedy removed file, was not revmoed yet.
# ls /ai400x/testdir/test-dir.0-0/mdtest_tree.0/file.mdtest.550.4111