[LU-1586] no free catalog slots for log Created: 29/Jun/12 Updated: 01/Mar/17 Resolved: 01/Mar/17 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.1.1 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | Ned Bass | Assignee: | Bob Glossman (Inactive) |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | llnl | ||
| Environment: | |||
| Issue Links: |
|
||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||
| Rank (Obsolete): | 4003 | ||||||||||||||||||||
| Description |
|
It seems the MDT catalog file may be damaged on our test filesystem. We were doing recovery testing with the patch for Jun 28 12:08:45 zwicky-mds2 kernel: LustreError: 11841:0:(llog_cat.c:81:llog_cat_new_log()) no free catalog slots for log... Jun 28 12:08:45 zwicky-mds2 kernel: LustreError: 11841:0:(llog_cat.c:81:llog_cat_new_log()) Skipped 3 previous similar messages Jun 28 12:08:45 zwicky-mds2 kernel: LustreError: 11841:0:(llog_obd.c:454:llog_obd_origin_add()) write one catalog record failed: -28 Jun 28 12:08:45 zwicky-mds2 kernel: LustreError: 11841:0:(llog_obd.c:454:llog_obd_origin_add()) Skipped 3 previous similar messages Jun 28 12:08:45 zwicky-mds2 kernel: LustreError: 11841:0:(mdd_object.c:1330:mdd_changelog_data_store()) changelog failed: rc=-28 op17 t[0x200de60af:0x17913:0x0] I mentioned this in |
| Comments |
| Comment by Bob Glossman (Inactive) [ 29/Jun/12 ] |
|
This problem nearly certainly has a common underlying cause with Ned, |
| Comment by Ned Bass [ 29/Jun/12 ] |
|
Indeed, these are the types of issues we're hoping to shake out before we even consider putting that patch in production. I should mention that the "catlog 0x27500007 crosses index zero" error message from |
| Comment by Andreas Dilger [ 30/Jun/12 ] |
|
If this is a production system, it would be possible to delete the CATALOGS file on the MDS. That said, is there any reason you are aware of that would cause a lot of llog records to be produced but not deleted? Creating and deleting a lot of files, but having OSTs offline could do this. It would be useful to get a listing of the OBJECTS directory on the MDS, just to see how many llog files there actually are. The catalog should be able to reference ~64000 log files. |
| Comment by Ned Bass [ 30/Jun/12 ] |
It's very possible that's what happened. The OSTs were down overnight while our I/O SWL tests were running on the clients. That may have included mdtest which does create lots of files. I thought the MDS was down too, but I'd have to go back over the logs to reconstruct what really happened.
# zwicky-mds2 > ls OBJECTS/ | wc -l |
| Comment by Ned Bass [ 12/Jul/12 ] |
I still don't really understand this failure mode. Shouldn't the llogs be replayed and removed when the OSTs reconnect to the MDS? Also, we found that removing the CATALOGS file didn't resolve this. We event went so far as to clear the OBJECTS and CONFIGS directories and issue a writeconf, but were unable to resuscitate the MDS. Lustre: MGS: Logs for fs lc2 were removed by user request. All servers must be restarted in order to regenerate the logs. Lustre: Setting parameter lc2-MDT0000-mdtlov.lov.stripecount in log lc2-MDT0000 Lustre: Setting parameter lc2-clilov.lov.stripecount in log lc2-client LustreError: 5569:0:(genops.c:304:class_newdev()) Device lc2-MDT0000-mdtlov already exists, won't add LustreError: 5569:0:(obd_config.c:327:class_attach()) Cannot create device lc2-MDT0000-mdtlov of type lov : -17 LustreError: 5569:0:(obd_config.c:1363:class_config_llog_handler()) Err -17 on cfg command: Lustre: cmd=cf001 0:lc2-MDT0000-mdtlov 1:lov 2:lc2-MDT0000-mdtlov_UUID LustreError: 15c-8: MGC10.1.1.212@o2ib9: The configuration from log 'lc2-MDT0000' failed (-17). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information. LustreError: 5513:0:(obd_mount.c:1192:server_start_targets()) failed to start server lc2-MDT0000: -17 LustreError: 5513:0:(obd_mount.c:1723:server_fill_super()) Unable to start targets: -17 LustreError: 5513:0:(obd_mount.c:1512:server_put_super()) no obd lc2-MDT0000 We needed the filesystem for other testing, so we gave up and reformatted it. However, we're concerned that this failure may occur in production and we don't have a recovery process. So we'd really like to understand better what happened and how to fix it. |
| Comment by Andreas Dilger [ 12/Jul/12 ] |
|
-17 = EEXIST, so I would suspect it is complaining about a file in CONFIGS, but you reported that was cleared out as well. You are correct that bringing the OSTs back online should cause the OST recovery logs to be cleaned up. Along with the message in The logs should be deleted sequentially after the MDT->OST orphan recovery is completed when the OST reconnects, freeing up their slot in the catalog file. It is possible that something was broken in this code in 2.x and it hasn't been noticed until now, since it would take a long time to see the symptoms. A simple test to reproduce this would be to create & delete files in a loop (~1M) on a specific OST (using "mkdir $LUSTRE/test_ostN; lfs setstripe -i N $LUSTRE/test_ostN; createmany -o $LUSTRE/test_ostN/f 1000000", where "N" is some OST number) and see if the number of llog files in the MDT OBJECTS/ directory is increasing steadily over time (beyond 2 or 3 files per OST). I don't recall specifically, but it may need an unmount and remount of the OST for the llog files to be cleaned up. Failing that test, try creating a large number of files (~1M) in $LUSTRE/test_ostN, and then unmount OST N and delete all the files. This should succeed without error, but there will be many llog entries stored in the llog file. The llog files should be cleaned when this OST is mounted again. |
| Comment by Ned Bass [ 12/Jul/12 ] |
Even more strangely, the EEXIST errors persisted after reformatting the MDT. To test whether something on disk was to blame, we formatted a loopback device with the same filesystem name, and still got EEXIST mounting the MDT. Formatting and the loopback device with a different filesystem name worked as expected. Rebooting the MDS node cleared the problem (I suspect reloading the module stack would have as well). There must have been some in-memory state left over from the original filesytsem (we double-checked in /proc/mounts that the old MDT was not mounted). In any case, this seems like something to track in a separate (but possibly related) issue.
By any chance, do you suppose running Robinhood could be a factor? I just learned one of our admins was evaluating Robinhood on this filesystem, but my understanding it that it just reads the Changelogs so I don't suspect a connection. Thanks for the test suggestions. We'll give them a try when the filesystem is free again. |
| Comment by Andreas Dilger [ 13/Jul/12 ] |
|
Yes, the changelogs could definitely be a factor. Once there is a registered changelog user, the changelogs are kept on disk until they are consumed. That ensures that if e.g. Robinhood crashes, or has some other problem for a day or four, that it won't have to do a full scan just to recover the state again. However, if the ChangeLog user is not unregistered, the changelogs will be kept until they run out of space. I suspect that is the root cause here, and should be investigated further. This bug should be CC'd to Jinshan and Aurelien Degremont, who are working on HSM these days. Cheers, Andreas |
| Comment by Peter Jones [ 23/Jul/12 ] |
|
Adding those involved with HSM for comment |
| Comment by Ned Bass [ 20/Feb/13 ] |
|
It seems like lots of bad things can happen if the changelog catalog is allowed to become full: I'm interested in adding safety mechanisms to prevent this situation. Perhaps the MDS could automatically unregister changelog users or set the changelog mask to zero based on a tunable threshold of unprocessed records. Does anyone have other ideas for how to handle this more gracefully? |
| Comment by Andreas Dilger [ 21/Feb/13 ] |
|
Ned, I agree this should be handled more gracefully. I think it is preferable to unregister the oldest consumer as the catalog approaches full, which should cause old records to be released (need to check this). That is IMHO better than setting the mask to zero and no longer recording new events. In both cases the consumer will have to do some scanning to find new changes. However, in the first case, it is more likely that the old consumer is no longer in use and no harm is done, while in the second case even a well-behaved consumer is punished. On a related note, do you know how many files were created before the catalog was full? In theory about 4B Changelog entries should be possible (approx 64000^2), but this might be reduced by some small factor if there are multiple records per file (e.g. create + setattr). |
| Comment by Ned Bass [ 21/Feb/13 ] |
|
It only took about 1.3 million changelog entries to fill the catalog. My test case was something like MDSDEV1=/dev/sda llmount.sh lctl --device lustre-MDT0000 changelog_register while createmany -m /mnt/lustre/%d 1000 ; do unlinkmany /mnt/lustre/%d 1000 done and it made it through about 670 iterations before failing. |
| Comment by Ned Bass [ 21/Feb/13 ] |
|
Sorry, I was filling the device not the changelog catalog. I specified MDSDEV1=/dev/sda thinking it would use the whole device, but I also need to set MDSSIZE. So it will take days not minutes to hit this limit, making it less worrisome but still something that should be addressed. The reason I'm now picking this thread up again is that we have plans to enable changelogs on our production systems for use by Robinhood. We're concerned about being exposed to the problems under discussion here if Robinhood goes down for an extended period. |
| Comment by Aurelien Degremont (Inactive) [ 22/Feb/13 ] |
|
FYI we had Robinhood setup on a filesystem with 100 millions of inodes, and MDS RPC rate between 1k/s and 30k/s peak. We had Robinhood stopped for days and we had millions of record changelog to be consumed. It has required also days to close the gap but the MDS was very, very far from being filled. (MDS size was 2 TB). I think we did not consume even 1% of this device. |
| Comment by Ned Bass [ 22/Feb/13 ] |
|
Aurelien, we're concerned about filling the changelog catalog, not the device. We actually had that happen on our our test system when Robinhood was down and I was testing metadata peformance (hence this Jira issue). It's far less likely on a production system with non-pathological workloads, but not outside the realm of possibility. |
| Comment by Kilian Cavalotti [ 09/Jan/15 ] |
|
As a matter of fact, it happened to us on a production filesystem. I wouldn't say the workload is non-pathological, though. Anyway, we noticed at some point that a MD operation such as "chown" could lead too ENOSPC: # chown userA /scratch/users/userA chown: changing ownership of `/scratch/users/userA/': No space left on device The related MDS messages are: LustreError: 8130:0:(llog_cat.c:82:llog_cat_new_log()) no free catalog slots for log... LustreError: 8130:0:(mdd_dir.c:783:mdd_changelog_ns_store()) changelog failed: rc=-28, op1 test c[0x20000b197:0x108d0:0x0] p[0x200002efb:0x155d5:0x0] Any tip on how to solve this? Would consuming (or clearing) the changelogs be sufficient? |
| Comment by Andreas Dilger [ 01/Mar/17 ] |
|
Close as a duplicate of |