[LU-274] Client delayed file status (cache meta-data) causing job failures - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: Lustre 2.0.0, Lustre 2.1.0, Lustre 1.8.6
Affects Version/s: Lustre 2.0.0, Lustre 2.1.0, Lustre 1.8.6
Labels:
None
Environment:
servers running 1.8.5.54 and 1.8.5.55 both also have a patch to increase the number of MDS threads
OS version CentOS 5.5
Clients experiencing the problem are running 1.8.5 release from Oracle.

Severity:
3
Rank (Obsolete):
5021

Description

A site key user while running a weather forcasting model that as part of its post processing and gathering of information from lustre noticed that sometimes his job would fail or get errors from tar because of zero length files. Unpon further investigation it was noticed that you could see basically what was going on from a simple "ls -l" command. The first instance showed files with zero length. After a short period of time (10-15 seconds and it varies) repeating the "ls -l" would show the correct size.
So the user basic sequence that is run is:
1. Base data is in directory from completed run
2. Execute a mppcombine program that takes this data and creates a set of joined output data
3. create a list of output files from mppcombine run to feed to tar using "ls -l" of this newly created data
4. Take that list and begin tarring up the data
5. If you see "xt5-widow2-mpp-test.o406112:tar: atmos_scalar.nc: file changed as we read it" bingo you hit the problem.

Needless to say this gives the impression of data corruption since the tar file can't be trusted. The data in reality is there but not appearant to tar. If you place a "sync" command between steps 2 and 3 you appear to be able to avoid the problem.However, this is not acceptable because since we can't be expected to tell all users they need to issue a sync before certain operations or they may not get all their data.

An example of the ls -l output

>> ls -ltr
.
.
rw------ 1 XX.XXXX user 130991 2011-04-12 20:57 time_FS.o564793
rw------ 1 XX.XXXX user 130988 2011-04-12 21:28 time_FS.o564868
rw------ 1 XX.XXXX user 0 2011-04-12 22:01 time_FS.o564927
rw------ 1 XX.XXXX user 0 2011-04-12 22:36 time_FS.o565015

Wait an undetermined amount of time repeat:

>> ls -ltr
.
.
rw------ 1 XX.XXXX user 130991 2011-04-12 20:57 time_FS.o564793
rw------ 1 XX.XXXX user 130988 2011-04-12 21:28 time_FS.o564868
rw------ 1 XX.XXXX user 130991 2011-04-12 22:01 time_FS.o564927
rw------ 1 XX.XXXX user 130956 2011-04-12 22:36 time_FS.o565015

Data size is now correct.

I have made a couple of attempts at a local test using lustre 1.8.4 and so far am unable to reproduce the problem but not convinced this is conclusive. Will be running more tests.

On the Oracle bugzilla basically bug 24501 is the same problem and also searching their DB it appears that bug 24458 might also be basically the same issue.

I will attach a test case that has the basic commands used. The actual test data is available but is extremely large and I would need to be pointed to a ftp for that.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

mppnccombine.c
05/May/11 7:37 AM
40 kB
Steven Woods
tar_fail.tar
04/May/11 7:06 AM
7.22 MB
Steven Woods

Issue Links

is duplicated by

LU-2347 lustre client sometimes reports wrong file size

Closed

Trackbacks

Lustre 1.8.x known issues tracker While testing against Lustre b18 branch, we would hit known bugs which were already reported in Lustre Bugzilla https://bugzilla.lustre.org/. In order to move away from relying on Bugzilla, we would create a JIRA

Activity

People

Assignee:: Niu Yawei (Inactive)

Reporter:: Steven Woods

Votes:: 0 Vote for this issue

Watchers:: 12 Start watching this issue

Dates

Created:: 04/May/11 7:06 AM

Updated:: 06/Nov/13 5:57 PM

Resolved:: 30/May/11 12:06 AM