[LU-1586] no free catalog slots for log Created: 29/Jun/12  Updated: 01/Mar/17  Resolved: 01/Mar/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.1
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Ned Bass Assignee: Bob Glossman (Inactive)
Resolution: Duplicate Votes: 0
Labels: llnl
Environment:

https://github.com/chaos/lustre/commits/2.1.1-15chaos


Issue Links:
Duplicate
duplicates LU-7340 ChangeLogs catalog full condition sho... Resolved
Related
is related to LU-9055 MDS crash due to changelog being full Open
is related to LU-1570 llog_cat.c:428:llog_cat_process_flags... Resolved
Severity: 3
Rank (Obsolete): 4003

 Description   

It seems the MDT catalog file may be damaged on our test filesystem. We were doing recovery testing with the patch for LU-1352. Sometime after power-cycling the MDS and letting it go through recovery, clients started getting EFAULT writing to lustre. These failures are accompanied by the following console errors on the MDS.

Jun 28 12:08:45 zwicky-mds2 kernel: LustreError: 11841:0:(llog_cat.c:81:llog_cat_new_log()) no free catalog slots for log...
Jun 28 12:08:45 zwicky-mds2 kernel: LustreError: 11841:0:(llog_cat.c:81:llog_cat_new_log()) Skipped 3 previous similar messages
Jun 28 12:08:45 zwicky-mds2 kernel: LustreError: 11841:0:(llog_obd.c:454:llog_obd_origin_add()) write one catalog record failed: -28
Jun 28 12:08:45 zwicky-mds2 kernel: LustreError: 11841:0:(llog_obd.c:454:llog_obd_origin_add()) Skipped 3 previous similar messages
Jun 28 12:08:45 zwicky-mds2 kernel: LustreError: 11841:0:(mdd_object.c:1330:mdd_changelog_data_store()) changelog failed: rc=-28 op17 t[0x200de60af:0x17913:0x0]

I mentioned this in LU-1570, but I figured a new ticket was needed.



 Comments   
Comment by Bob Glossman (Inactive) [ 29/Jun/12 ]

This problem nearly certainly has a common underlying cause with LU-1570. Both come down to damage in the catalog. Don't plan to close this as a dup until we complete our analysis.

Ned,
You say this happened while you were testing a fix from LU-1352. As far as I know that back port to b2_1 has not yet been fully tested and accepted. Kind of suspicious.

Comment by Ned Bass [ 29/Jun/12 ]

Indeed, these are the types of issues we're hoping to shake out before we even consider putting that patch in production. I should mention that the "catlog 0x27500007 crosses index zero" error message from LU-1570 began appearing in the MDS logs before we installed the LU-1352 patch.

Comment by Andreas Dilger [ 30/Jun/12 ]

If this is a production system, it would be possible to delete the CATALOGS file on the MDS.

That said, is there any reason you are aware of that would cause a lot of llog records to be produced but not deleted? Creating and deleting a lot of files, but having OSTs offline could do this.

It would be useful to get a listing of the OBJECTS directory on the MDS, just to see how many llog files there actually are. The catalog should be able to reference ~64000 log files.

Comment by Ned Bass [ 30/Jun/12 ]

is there any reason you are aware of that would cause a lot of llog records to be produced but not deleted? Creating and deleting a lot of files, but having OSTs offline could do this.

It's very possible that's what happened. The OSTs were down overnight while our I/O SWL tests were running on the clients. That may have included mdtest which does create lots of files. I thought the MDS was down too, but I'd have to go back over the logs to reconstruct what really happened.

It would be useful to get a listing of the OBJECTS directory on the MDS, just to see how many llog files there actually are. The catalog should be able to reference ~64000 log files.

# zwicky-mds2 > ls OBJECTS/ | wc -l
65428

Comment by Ned Bass [ 12/Jul/12 ]

Creating and deleting a lot of files, but having OSTs offline could do this.

I still don't really understand this failure mode. Shouldn't the llogs be replayed and removed when the OSTs reconnect to the MDS?

Also, we found that removing the CATALOGS file didn't resolve this. We event went so far as to clear the OBJECTS and CONFIGS directories and issue a writeconf, but were unable to resuscitate the MDS.

Lustre: MGS: Logs for fs lc2 were removed by user request.  All servers must be restarted in order to regenerate the logs.
Lustre: Setting parameter lc2-MDT0000-mdtlov.lov.stripecount in log lc2-MDT0000
Lustre: Setting parameter lc2-clilov.lov.stripecount in log lc2-client
LustreError: 5569:0:(genops.c:304:class_newdev()) Device lc2-MDT0000-mdtlov already exists, won't add
LustreError: 5569:0:(obd_config.c:327:class_attach()) Cannot create device lc2-MDT0000-mdtlov of type lov : -17
LustreError: 5569:0:(obd_config.c:1363:class_config_llog_handler()) Err -17 on cfg command:
Lustre:    cmd=cf001 0:lc2-MDT0000-mdtlov  1:lov  2:lc2-MDT0000-mdtlov_UUID  
LustreError: 15c-8: MGC10.1.1.212@o2ib9: The configuration from log 'lc2-MDT0000' failed (-17). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
LustreError: 5513:0:(obd_mount.c:1192:server_start_targets()) failed to start server lc2-MDT0000: -17
LustreError: 5513:0:(obd_mount.c:1723:server_fill_super()) Unable to start targets: -17
LustreError: 5513:0:(obd_mount.c:1512:server_put_super()) no obd lc2-MDT0000

We needed the filesystem for other testing, so we gave up and reformatted it. However, we're concerned that this failure may occur in production and we don't have a recovery process. So we'd really like to understand better what happened and how to fix it.

Comment by Andreas Dilger [ 12/Jul/12 ]

-17 = EEXIST, so I would suspect it is complaining about a file in CONFIGS, but you reported that was cleared out as well.

You are correct that bringing the OSTs back online should cause the OST recovery logs to be cleaned up. Along with the message in LU-1570, it seems there is something at your site that is consuming more llogs than normal. Each llog should allow up to 64k unlinks to be stored for recovery, and up to 64k llogs in a catalog PER OST, though new llog files are started for each boot. That means 4B unlinks or 64k reboots, or combinations thereof, per OST before the catalog wraps back to zero (per LU-1570).

The logs should be deleted sequentially after the MDT->OST orphan recovery is completed when the OST reconnects, freeing up their slot in the catalog file. It is possible that something was broken in this code in 2.x and it hasn't been noticed until now, since it would take a long time to see the symptoms.

A simple test to reproduce this would be to create & delete files in a loop (~1M) on a specific OST (using "mkdir $LUSTRE/test_ostN; lfs setstripe -i N $LUSTRE/test_ostN; createmany -o $LUSTRE/test_ostN/f 1000000", where "N" is some OST number) and see if the number of llog files in the MDT OBJECTS/ directory is increasing steadily over time (beyond 2 or 3 files per OST). I don't recall specifically, but it may need an unmount and remount of the OST for the llog files to be cleaned up.

Failing that test, try creating a large number of files (~1M) in $LUSTRE/test_ostN, and then unmount OST N and delete all the files. This should succeed without error, but there will be many llog entries stored in the llog file. The llog files should be cleaned when this OST is mounted again.

Comment by Ned Bass [ 12/Jul/12 ]

-17 = EEXIST, so I would suspect it is complaining about a file in CONFIGS, but you reported that was cleared out as well.

Even more strangely, the EEXIST errors persisted after reformatting the MDT. To test whether something on disk was to blame, we formatted a loopback device with the same filesystem name, and still got EEXIST mounting the MDT. Formatting and the loopback device with a different filesystem name worked as expected. Rebooting the MDS node cleared the problem (I suspect reloading the module stack would have as well). There must have been some in-memory state left over from the original filesytsem (we double-checked in /proc/mounts that the old MDT was not mounted).

In any case, this seems like something to track in a separate (but possibly related) issue.

it seems there is something at your site that is consuming more llogs than normal.

By any chance, do you suppose running Robinhood could be a factor? I just learned one of our admins was evaluating Robinhood on this filesystem, but my understanding it that it just reads the Changelogs so I don't suspect a connection.

Thanks for the test suggestions. We'll give them a try when the filesystem is free again.

Comment by Andreas Dilger [ 13/Jul/12 ]

Yes, the changelogs could definitely be a factor. Once there is a registered changelog user, the changelogs are kept on disk until they are consumed. That ensures that if e.g. Robinhood crashes, or has some other problem for a day or four, that it won't have to do a full scan just to recover the state again.

However, if the ChangeLog user is not unregistered, the changelogs will be kept until they run out of space. I suspect that is the root cause here, and should be investigated further. This bug should be CC'd to Jinshan and Aurelien Degremont, who are working on HSM these days.

Cheers, Andreas

Comment by Peter Jones [ 23/Jul/12 ]

Adding those involved with HSM for comment

Comment by Ned Bass [ 20/Feb/13 ]

It seems like lots of bad things can happen if the changelog catalog is allowed to become full: LU-2843 LU-2844 LU-2845. Besides these crashes the MDS service fails to start due to EINVAL errors from mdd_changelog_llog_init(), and the only way I've found to recover is manually deleting the changelog_catalog file.

I'm interested in adding safety mechanisms to prevent this situation. Perhaps the MDS could automatically unregister changelog users or set the changelog mask to zero based on a tunable threshold of unprocessed records. Does anyone have other ideas for how to handle this more gracefully?

Comment by Andreas Dilger [ 21/Feb/13 ]

Ned, I agree this should be handled more gracefully. I think it is preferable to unregister the oldest consumer as the catalog approaches full, which should cause old records to be released (need to check this). That is IMHO better than setting the mask to zero and no longer recording new events.

In both cases the consumer will have to do some scanning to find new changes. However, in the first case, it is more likely that the old consumer is no longer in use and no harm is done, while in the second case even a well-behaved consumer is punished.

On a related note, do you know how many files were created before the catalog was full? In theory about 4B Changelog entries should be possible (approx 64000^2), but this might be reduced by some small factor if there are multiple records per file (e.g. create + setattr).

Comment by Ned Bass [ 21/Feb/13 ]

It only took about 1.3 million changelog entries to fill the catalog. My test case was something like

MDSDEV1=/dev/sda llmount.sh
lctl --device lustre-MDT0000 changelog_register
while createmany -m /mnt/lustre/%d 1000 ; do
    unlinkmany /mnt/lustre/%d 1000
done

and it made it through about 670 iterations before failing.

Comment by Ned Bass [ 21/Feb/13 ]

Sorry, I was filling the device not the changelog catalog. I specified MDSDEV1=/dev/sda thinking it would use the whole device, but I also need to set MDSSIZE. So it will take days not minutes to hit this limit, making it less worrisome but still something that should be addressed.

The reason I'm now picking this thread up again is that we have plans to enable changelogs on our production systems for use by Robinhood. We're concerned about being exposed to the problems under discussion here if Robinhood goes down for an extended period.

Comment by Aurelien Degremont (Inactive) [ 22/Feb/13 ]

FYI we had Robinhood setup on a filesystem with 100 millions of inodes, and MDS RPC rate between 1k/s and 30k/s peak. We had Robinhood stopped for days and we had millions of record changelog to be consumed. It has required also days to close the gap but the MDS was very, very far from being filled. (MDS size was 2 TB). I think we did not consume even 1% of this device.
Do not worry

Comment by Ned Bass [ 22/Feb/13 ]

Aurelien, we're concerned about filling the changelog catalog, not the device. We actually had that happen on our our test system when Robinhood was down and I was testing metadata peformance (hence this Jira issue). It's far less likely on a production system with non-pathological workloads, but not outside the realm of possibility.

Comment by Kilian Cavalotti [ 09/Jan/15 ]

As a matter of fact, it happened to us on a production filesystem. I wouldn't say the workload is non-pathological, though.

Anyway, we noticed at some point that a MD operation such as "chown" could lead too ENOSPC:

# chown userA /scratch/users/userA
chown: changing ownership of `/scratch/users/userA/': No space left on device

The related MDS messages are:

LustreError: 8130:0:(llog_cat.c:82:llog_cat_new_log()) no free catalog slots for log...
LustreError: 8130:0:(mdd_dir.c:783:mdd_changelog_ns_store()) changelog failed: rc=-28, op1 test c[0x20000b197:0x108d0:0x0] p[0x200002efb:0x155d5:0x0]

Any tip on how to solve this? Would consuming (or clearing) the changelogs be sufficient?

Comment by Andreas Dilger [ 01/Mar/17 ]

Close as a duplicate of LU-7340.

Generated at Sat Feb 10 01:17:56 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.