[LU-1586] no free catalog slots for log - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Duplicate
Priority: Critical
Fix Version/s: None
Affects Version/s: Lustre 2.1.1
Labels:
- llnl
Environment:
https://github.com/chaos/lustre/commits/2.1.1-15chaos

Severity:
3
Rank (Obsolete):
4003

Description

It seems the MDT catalog file may be damaged on our test filesystem. We were doing recovery testing with the patch for ~~LU-1352~~. Sometime after power-cycling the MDS and letting it go through recovery, clients started getting EFAULT writing to lustre. These failures are accompanied by the following console errors on the MDS.

Jun 28 12:08:45 zwicky-mds2 kernel: LustreError: 11841:0:(llog_cat.c:81:llog_cat_new_log()) no free catalog slots for log...
Jun 28 12:08:45 zwicky-mds2 kernel: LustreError: 11841:0:(llog_cat.c:81:llog_cat_new_log()) Skipped 3 previous similar messages
Jun 28 12:08:45 zwicky-mds2 kernel: LustreError: 11841:0:(llog_obd.c:454:llog_obd_origin_add()) write one catalog record failed: -28
Jun 28 12:08:45 zwicky-mds2 kernel: LustreError: 11841:0:(llog_obd.c:454:llog_obd_origin_add()) Skipped 3 previous similar messages
Jun 28 12:08:45 zwicky-mds2 kernel: LustreError: 11841:0:(mdd_object.c:1330:mdd_changelog_data_store()) changelog failed: rc=-28 op17 t[0x200de60af:0x17913:0x0]

I mentioned this in ~~LU-1570~~, but I figured a new ticket was needed.

Attachments

Issue Links

duplicates

LU-7340 ChangeLogs catalog full condition should be handled more gracefully

Resolved

is related to

LU-9055 MDS crash due to changelog being full

Open

LU-1570 llog_cat.c:428:llog_cat_process_flags() catlog 0x27500007 crosses index zero

Resolved

Activity

[LU-1586] no free catalog slots for log

Andreas Dilger made changes - 01/Mar/17 12:02 PM

Resolution		New: Duplicate [ 3 ]
Status	Original: Open [ 1 ]	New: Resolved [ 5 ]

Andreas Dilger added a comment - 01/Mar/17 12:02 PM

Close as a duplicate of ~~LU-7340~~.

Andreas Dilger added a comment - 01/Mar/17 12:02 PM Close as a duplicate of LU-7340 .

Andreas Dilger made changes - 01/Mar/17 12:02 PM

Link

New: This issue duplicates ~~LU-7340~~ [ ~~LU-7340~~ ]

Andreas Dilger made changes - 01/Mar/17 12:00 PM

Link

New: This issue is related to LU-9055 [ LU-9055 ]

Peter Jones made changes - 07/Jun/16 3:37 PM

End date		New: 09/Jan/15
Start date		New: 29/Jun/12

Kilian Cavalotti added a comment - 09/Jan/15 5:09 PM - edited

As a matter of fact, it happened to us on a production filesystem. I wouldn't say the workload is non-pathological, though.

Anyway, we noticed at some point that a MD operation such as "chown" could lead too ENOSPC:

# chown userA /scratch/users/userA
chown: changing ownership of `/scratch/users/userA/': No space left on device

The related MDS messages are:

LustreError: 8130:0:(llog_cat.c:82:llog_cat_new_log()) no free catalog slots for log...
LustreError: 8130:0:(mdd_dir.c:783:mdd_changelog_ns_store()) changelog failed: rc=-28, op1 test c[0x20000b197:0x108d0:0x0] p[0x200002efb:0x155d5:0x0]

Any tip on how to solve this? Would consuming (or clearing) the changelogs be sufficient?

Kilian Cavalotti added a comment - 09/Jan/15 5:09 PM - edited As a matter of fact, it happened to us on a production filesystem. I wouldn't say the workload is non-pathological, though. Anyway, we noticed at some point that a MD operation such as "chown" could lead too ENOSPC: # chown userA /scratch/users/userA chown: changing ownership of `/scratch/users/userA/': No space left on device The related MDS messages are: LustreError: 8130:0:(llog_cat.c:82:llog_cat_new_log()) no free catalog slots for log... LustreError: 8130:0:(mdd_dir.c:783:mdd_changelog_ns_store()) changelog failed: rc=-28, op1 test c[0x20000b197:0x108d0:0x0] p[0x200002efb:0x155d5:0x0] Any tip on how to solve this? Would consuming (or clearing) the changelogs be sufficient?

Christopher Morrone (Inactive) made changes - 26/Jun/14 11:57 PM

Labels

New: llnl

Ned Bass (Inactive) added a comment - 22/Feb/13 11:22 AM

Aurelien, we're concerned about filling the changelog catalog, not the device. We actually had that happen on our our test system when Robinhood was down and I was testing metadata peformance (hence this Jira issue). It's far less likely on a production system with non-pathological workloads, but not outside the realm of possibility.

Ned Bass (Inactive) added a comment - 22/Feb/13 11:22 AM Aurelien, we're concerned about filling the changelog catalog, not the device. We actually had that happen on our our test system when Robinhood was down and I was testing metadata peformance (hence this Jira issue). It's far less likely on a production system with non-pathological workloads, but not outside the realm of possibility.

Aurelien Degremont (Inactive) added a comment - 22/Feb/13 3:19 AM

FYI we had Robinhood setup on a filesystem with 100 millions of inodes, and MDS RPC rate between 1k/s and 30k/s peak. We had Robinhood stopped for days and we had millions of record changelog to be consumed. It has required also days to close the gap but the MDS was very, very far from being filled. (MDS size was 2 TB). I think we did not consume even 1% of this device.
Do not worry

Aurelien Degremont (Inactive) added a comment - 22/Feb/13 3:19 AM FYI we had Robinhood setup on a filesystem with 100 millions of inodes, and MDS RPC rate between 1k/s and 30k/s peak. We had Robinhood stopped for days and we had millions of record changelog to be consumed. It has required also days to close the gap but the MDS was very , very far from being filled. (MDS size was 2 TB). I think we did not consume even 1% of this device. Do not worry

Ned Bass (Inactive) added a comment - 21/Feb/13 2:14 PM

Sorry, I was filling the device not the changelog catalog. I specified MDSDEV1=/dev/sda thinking it would use the whole device, but I also need to set MDSSIZE. So it will take days not minutes to hit this limit, making it less worrisome but still something that should be addressed.

The reason I'm now picking this thread up again is that we have plans to enable changelogs on our production systems for use by Robinhood. We're concerned about being exposed to the problems under discussion here if Robinhood goes down for an extended period.

Ned Bass (Inactive) added a comment - 21/Feb/13 2:14 PM Sorry, I was filling the device not the changelog catalog. I specified MDSDEV1=/dev/sda thinking it would use the whole device, but I also need to set MDSSIZE. So it will take days not minutes to hit this limit, making it less worrisome but still something that should be addressed. The reason I'm now picking this thread up again is that we have plans to enable changelogs on our production systems for use by Robinhood. We're concerned about being exposed to the problems under discussion here if Robinhood goes down for an extended period.

People

Assignee:: Bob Glossman (Inactive)

Reporter:: Ned Bass (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 17 Start watching this issue

Dates

Created:: 29/Jun/12 6:50 PM

Updated:: 01/Mar/17 12:02 PM

Resolved:: 01/Mar/17 12:02 PM