Details
-
Bug
-
Resolution: Fixed
-
Critical
-
Lustre 2.6.0, Lustre 2.7.0, Lustre 2.5.3, Lustre 2.5.4
-
OpenSFS Cluster running Lustre master build #1914
Combined MGS/MDS, one OSS with two OSTs and one client.
-
3
-
12884
Description
I’ve created a brand new file system on a freshly installed system on the OpenSFS cluster. I run dd a few times and everything looks fine. On the OSS, I run
# lctl set_param fail_loc=0x1610 fail_loc=0x1610 # lctl get_param fail_loc fail_loc=5648
fail_loc=0x1610 (OBD_FAIL_LFSCK_DANGLING) is supposed to create files with dangling references. Then I run dd
# dd if=/dev/urandom of=/lustre/scratch/a_3 count=1 bs=64k 1+0 records in 1+0 records out 65536 bytes (66 kB) copied, 0.0157425 s, 4.2 MB/s
and get no errors for the first 50 or so files written. Then all dd commands will produce the following error
# dd if=/dev/urandom of=/lustre/scratch/m_502 count=1 bs=64k dd: writing `/lustre/scratch/m_502': Cannot allocate memory 1+0 records in 0+0 records out 0 bytes (0 B) copied, 0.292437 s, 0.0 kB/s
I run LFSCK on the MDS
#lctl lfsck_start -M scratch-MDT0000 -A --reset --type layout Started LFSCK on the device scratch-MDT0000: layout. # lctl get_param -n mdd.scratch-MDT0000.lfsck_layout
and see that some number of dangling references were repaired. Up to this point, all of this is expected behavior.
The problem happens when I try to turn dangling references off. On the OST, I run “lctl set_param fail_loc=0” and get “fail_loc=0” returned. I then run dd on the client and get the same error as above about allocating memory and running LFSCK finds and corrects dangling references. I’m told that files could still be created with dangling references due to preallocation, but that after 32 or so files, it should stop.
After writing about 30 files, the dd command on the client froze, the OST crashed and, on the OST console, I see
Message from syslogd@c11-ib at Feb 27 20:42:26 ... kernel:LustreError: 2082:0:(ldlm_lib.c:1311:target_destroy_export()) ASSERTION( atomic_read(&exp->exp_cb_count) == 0 ) failed: value: 1 Message from syslogd@c11-ib at Feb 27 20:42:26 ... kernel:LustreError: 2082:0:(ldlm_lib.c:1311:target_destroy_export()) LBUG
The OST came back on-line after a few minutes. I’ve repeated this twice on two different clean file systems.