Details
-
Bug
-
Resolution: Fixed
-
Critical
-
Lustre 2.8.0
-
pre-2.8 clients with DNE2 directories which contain 1 million or more files.
-
3
-
9223372036854775807
Description
In my testing of DNE2 I'm seeing problems when creating 1 million+ files per directory. Clearing out the debug logs I see the problem is only on the client side. When running a application I see:
command line used: /lustre/sultan/stf008/scratch/jsimmons/mdtest -I 100000 -i 5 -d /lustre/sultan/stf008/scratch/jsimmons/dne2_4_mds_md_test/shared_1000k_10
Path: /lustre/sultan/stf008/scratch/jsimmons/dne2_4_mds_md_test
FS: 21.8 TiB Used FS: 0.2% Inodes: 58.7 Mi Used Inodes: 4.6%
10 tasks, 1000000 files/directories
aprun: Apid 3172: Caught signal Window changed, sending to application
08/03/2015 10:34:45: Process 0(nid00028): FAILED in create_remove_directory_tree, Unable to remove directory: No such file or directory
Rank 0 [Mon Aug 3 10:34:45 2015] [c0-0c0s1n2] application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
_pmiu_daemon(SIGCHLD): [NID 00028] [c0-0c0s1n2] [Mon Aug 3 10:34:45 2015] PE RANK 0 exit signal Aborted
aprun: Apid 3172: Caught signal Interrupt, sending to application
_pmiu_daemon(SIGCHLD): [NID 00012] [c0-0c0s6n0] [Mon Aug 3 10:50:50 2015] PE RANK 7 exit signal Interrupt
_pmiu_daemon(SIGCHLD): [NID 00018] [c0-0c0s6n2] [Mon Aug 3 10:50:50 2015] PE RANK 9 exit signal Interrupt
_pmiu_daemon(SIGCHLD): [NID 00013] [c0-0c0s6n1] [Mon Aug 3 10:50:50 2015] PE RANK 8 exit signal Interrupt
After the test failed any attempt to remove the files create by these test fail. When I attempt to remove the files I see the following errors in dmesg.
LustreError: 5430:0:(llite_lib.c:2286:ll_prep_inode()) new_inode -fatal: rc -2
LustreError: 5451:0:(llite_lib.c:2286:ll_prep_inode()) new_inode -fatal: rc -2
LustreError: 5451:0:(llite_lib.c:2286:ll_prep_inode()) Skipped 7 previous similar messages
LustreError: 5451:0:(llite_lib.c:2286:ll_prep_inode()) new_inode -fatal: rc -2