[LU-8527] Lustre 2.8 server crashed on bring up during large scale test shot Created: 23/Aug/16 Updated: 18/Oct/16 Resolved: 08/Oct/16 |
|
| Status: | Closed |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.8.0 |
| Fix Version/s: | Lustre 2.9.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | James A Simmons | Assignee: | Zhenyu Xu |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
RHEL6.7 servers running latest lustre 2.8 using latest lustre 2.8 clients. |
||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
In our first attempt to test lustre 2.8 servers at scale the MDS crashed at bring up. The backtrace is: <2>[ 8011.527071] kernel BUG at fs/jbd2/transaction.c:1028! |
| Comments |
| Comment by James A Simmons [ 23/Aug/16 ] |
|
Small detail. This crashed happened during an IOR run. Also it doesn't seem repeatable. |
| Comment by James A Simmons [ 23/Aug/16 ] |
|
Here is the dmesg from the crash |
| Comment by Jian Yu [ 23/Aug/16 ] |
|
Hi Bobi, I saw the similar stack trace reported in Thank you. |
| Comment by Zhenyu Xu [ 23/Aug/16 ] |
|
is this FE 2.8? If yes, I think http://review.whamcloud.com/19732 would be a promising patch. |
| Comment by Zhenyu Xu [ 23/Aug/16 ] |
|
port of it at http://review.whamcloud.com/22058 |
| Comment by James A Simmons [ 23/Aug/16 ] |
|
Okay I'm building it now. Will let you know the results. |
| Comment by James A Simmons [ 23/Aug/16 ] |
|
Nope patch 22058 doesn't fix it for us |
| Comment by Alex Zhuravlev [ 24/Aug/16 ] |
|
LOD itself doesn't declare credits, but it ask OSD to declare operations. is it possible to load osd-ldiskfs module with ldiskfs_track_declares_assert=1 and attach dmesg output starting a bit earlier (hopefully we'll hit our assert with additional information). thanks in advance. |
| Comment by Dustin Leverman [ 25/Aug/16 ] |
|
I cut the log down to show times from 7:11PM (recovery Aug 23 19:05:18 atlas1-mds1.ccs.ornl.gov gedi;lustre-mgmt1.ccs.ornl.gov;test_lustre kernel: Kernel logging (proc) stopped. |
| Comment by Jian Yu [ 25/Aug/16 ] |
|
Thank you Dustin for the logs. Hi Alex, Could you please look into the logs? And although patch http://review.whamcloud.com/22058 doesn't fix the issue in this ticket, do you think it is still a better-to-have patch for Lustre b2_8_fe servers? |
| Comment by Andreas Dilger [ 26/Aug/16 ] |
|
It seems possible that this is related to creating or deleting a wide-striped file? I found |
| Comment by James A Simmons [ 01/Sep/16 ] |
|
the first crash happened while running a large stripe single shared file with IOR. After the first crash this bug came back at random times. |
| Comment by Alex Zhuravlev [ 02/Sep/16 ] |
|
please, tell the latest commit for the branch you're using. |
| Comment by Jian Yu [ 02/Sep/16 ] |
|
Hi Alex, It's Lustre b2_8_fe branch with commit 71badf2 on the tip. |
| Comment by James A Simmons [ 07/Sep/16 ] |
|
Any updates? |
| Comment by Alex Zhuravlev [ 08/Sep/16 ] |
|
i'm trying to come up with a debugging patch. |
| Comment by Alex Zhuravlev [ 14/Sep/16 ] |
|
Lustre: 18235:0:(osd_handler.c:1272:osd_trans_dump_creds()) attr_set: 2/2/1, xattr_set: 1011/92/0 the transaction declared at least 2+92+43356+52+17152+9+2+2=60667 credits, while consumed 1+1+1=3 |
| Comment by Andreas Dilger [ 14/Sep/16 ] |
|
My first guess would be to take a look at the ldiskfs xattr inode (wide striping) patch. It may be that it is declaring too many credits in some cases. |
| Comment by nasf (Inactive) [ 23/Sep/16 ] |
The "xattr_set" seems abnormal, under which case the oti_declare_ops[xattr_set] can be so much larger than oti_declare_ops_cred[xattr_set] ? |
| Comment by nasf (Inactive) [ 26/Sep/16 ] |
|
It is suspected that the concurrent multiple llog append operations, such as unlink large striped (large EA) files concurrently, that can fill up a llog file in very short time. Because the llog may be shared (at the same time) by many concurrent modifications, then under some extreme cases, some unlink large striped file operation may have not declared to create llog file, but during the real llog append for recording the striped OST-objects, it finds that the current llog is filled up, and it needs to create new llog file, but because it does not declare related credits, then the lower layer may hit trouble. I am not sure whether it is just such case or not. But since it is reproducible on customer site, I will make a patch for verification. |
| Comment by Gerrit Updater [ 26/Sep/16 ] |
|
Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/22726 |
| Comment by Alex Zhuravlev [ 26/Sep/16 ] |
|
Fan Yong, there was no problem with write credits in the customer's case: we do declare additional create: Bar.java if (!IS_ERR_OR_NULL(next)) { if (!llog_exist(next)) { if (dt_object_remote(cathandle->lgh_obj)) { } else { rc = llog_declare_create(env, next, th); llog_declare_write_rec(env, cathandle, &lirec->lid_hdr, -1, th); if you think it's possible to hit a race where we miss to declare additional llog (say, the next was just created), then it makes sense to add an assertion and hit it. |
| Comment by nasf (Inactive) [ 26/Sep/16 ] |
|
The case may be that when the unlink RPC service thread declares, the next llog exists already, then it will not declare the credit for create new llog file, but after the unlink transaction started, related llog file may has been filled up by others (and plus itself). Then it will try to create new llog file that will use the credits declared for subsequent write, then cause some transaction trouble.
I am not sure whether it is the same as the original trouble. According to our current implementation, there are some nested transactions, means one transaction during other transaction declare, that may cause the OSD layer credits information are not correct. That is another issue to be fixed. |
| Comment by Alex Zhuravlev [ 26/Sep/16 ] |
|
check llog_cat_current_log().. when the current is full, then the next is taken and set to NULL. in any case, there is no point to declare yet another "next". we just need to make sure that we've declared at least one. for example, if we find "next" existing, then we can repeat from the beginning. |
| Comment by nasf (Inactive) [ 26/Sep/16 ] |
|
As you can see: int llog_cat_declare_add_rec(const struct lu_env *env,
struct llog_handle *cathandle,
struct llog_rec_hdr *rec, struct thandle *th)
{
...
if (!llog_exist(cathandle->u.chd.chd_current_log)) {
...
next = cathandle->u.chd.chd_next_log;
if (next) {
(1)==> if (!llog_exist(next)) {
/* declare create */
...
}
static struct llog_handle *llog_cat_current_log(struct llog_handle *cathandle,
struct thandle *th)
{
...
loghandle = cathandle->u.chd.chd_next_log;
cathandle->u.chd.chd_current_log = loghandle;
(2)==> cathandle->u.chd.chd_next_log = NULL;
...
}
During above checking, we do NOT take "cathandle->lgh_lock", then it is possible (at least in theory) that the thread1 has already moved to the line (1), then another thread2 sets the "chd_current_log" as "chd_next_log" and set "chd_next_log" as NULL at line (2), and then create related llog object via llog_cat_new_log(). So the thread1 will find the "next" exist, then will not declare to create the "next". Such case should exists, otherwise the line (1) check is redundant, and should be replaced as "LASSERT(!llog_exist(next))". |
| Comment by Alex Zhuravlev [ 26/Sep/16 ] |
|
yes, this is possible, but we declare few creations usually because of striping (those are independed from llog point of view). I think it's unlikely to hit this race often while most of time new llog declarations are not used. |
| Comment by nasf (Inactive) [ 26/Sep/16 ] |
|
Yes, it is unlikely. One important reason is that we seldom test large striped file during our daily Maloo tests, that may hides some corner cases when llog handle switching. In this case, large striping is the most different factor compared with our daily testing. So I suspect such point. As for the write itself optimistic, I have checked related code. The "offset" is the most important parameter to indicate how to optimise. But for llog case, such parameter for write is either "0" (for modify header, including bitmap) or "-1" (for append record). But for these two cases, the OSD layer does very limited optimisation as to I cannot find out suspected points. |
| Comment by Alex Zhuravlev [ 26/Sep/16 ] |
|
if create was missing a declaration or credits, then we'd get a warning or assertion (when enabled). I just checked with skipped declaration and low credits.. |
| Comment by nasf (Inactive) [ 26/Sep/16 ] |
|
Another suspected point: "osd_thandle::ot_credits" is defined as "unsigned short". Is such two-bytes large enough for the credits declaration for large striping operations? Possible overflow? |
| Comment by Alex Zhuravlev [ 26/Sep/16 ] |
|
I was looking on this as well. this can potentially explain our assertion, but not BUG() in JBD2.. |
| Comment by Gerrit Updater [ 28/Sep/16 ] |
|
Alex Zhuravlev (alexey.zhuravlev@intel.com) uploaded a new patch: http://review.whamcloud.com/22782 |
| Comment by Gerrit Updater [ 08/Oct/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/22782/ |
| Comment by Peter Jones [ 08/Oct/16 ] |
|
Landed for 2.9 |
| Comment by James A Simmons [ 18/Oct/16 ] |
|
We were able to reproduce the problem at smaller scale and after we applied the |