[LU-7138] LBUG: (osd_handler.c:1017:osd_trans_start()) ASSERTION( get_current()->journal_info == ((void *)0) ) failed: - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Duplicate
Priority: Blocker
Fix Version/s: None
Affects Version/s: Lustre 2.7.0
Labels:
None

Severity:
1
Rank (Obsolete):
9223372036854775807

Description

This evening we have hit this LBUG on the MDT in our production file system, the file system is currently down as we hit the same bug every time we attempt to bring the MDT back, as soon as recovery finishes.

<0>LustreError: 722:0:(osd_handler.c:1017:osd_trans_start()) ASSERTION( get_current()->journal_info == ((void *)0) ) failed:
<0>LustreError: 722:0:(osd_handler.c:1017:osd_trans_start()) LBUG
<4>Pid: 722, comm: mdt01_017
<4>
<4>Call Trace:
<4> [<ffffffffa065f895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
<4> [<ffffffffa065fe97>] lbug_with_loc+0x47/0xb0 [libcfs]
<4> [<ffffffffa17df24d>] osd_trans_start+0x25d/0x660 [osd_ldiskfs]
<4> [<ffffffffa09b9b4a>] llog_osd_destroy+0x42a/0xd40 [obdclass]
<4> [<ffffffffa09b2edc>] llog_cat_new_log+0x1ec/0x710 [obdclass]
<4> [<ffffffffa09b350a>] llog_cat_add_rec+0x10a/0x450 [obdclass]
<4> [<ffffffffa09ab1e9>] llog_add+0x89/0x1c0 [obdclass]
<4> [<ffffffffa17f1976>] ? osd_attr_set+0x166/0x460 [osd_ldiskfs]
<4> [<ffffffffa0d914e2>] mdd_changelog_store+0x122/0x290 [mdd]
<4> [<ffffffffa0da4d0c>] mdd_changelog_data_store+0x16c/0x320 [mdd]
<4> [<ffffffffa0dad9b3>] mdd_attr_set+0x12f3/0x1730 [mdd]
<4> [<ffffffffa088a551>] mdt_reint_setattr+0xf81/0x13a0 [mdt]
<4> [<ffffffffa087be1c>] ? mdt_root_squash+0x2c/0x3f0 [mdt]
<4> [<ffffffffa08801dd>] mdt_reint_rec+0x5d/0x200 [mdt]
<4> [<ffffffffa086423b>] mdt_reint_internal+0x4cb/0x7a0 [mdt]
<4> [<ffffffffa08649ab>] mdt_reint+0x6b/0x120 [mdt]
<4> [<ffffffffa0c6f56e>] tgt_request_handle+0x8be/0x1000 [ptlrpc]
<4> [<ffffffffa0c1f5a1>] ptlrpc_main+0xe41/0x1960 [ptlrpc]
<4> [<ffffffff8106c4f0>] ? pick_next_task_fair+0xd0/0x130
<4> [<ffffffffa0c1e760>] ? ptlrpc_main+0x0/0x1960 [ptlrpc]
<4> [<ffffffff8109e66e>] kthread+0x9e/0xc0
<4> [<ffffffff8100c20a>] child_rip+0xa/0x20
<4> [<ffffffff8109e5d0>] ? kthread+0x0/0xc0
<4> [<ffffffff8100c200>] ? child_rip+0x0/0x20
<4>
<0>Kernel panic - not syncing: LBUG

The stack trace doesn't quite seem to be the same as for ~~LU-6634~~ (which anyway doesn't have any fix suggested.)

Attachments

Issue Links

is related to

LU-6556 changelog catalog corruption if all possible records is define

Resolved

LU-6634 (osd_handler.c:901:osd_trans_start()) ASSERTION( get_current()->journal_info == ((void *)0) ) failed: when reaching Catalog full condition

Resolved

Activity

[LU-7138] LBUG: (osd_handler.c:1017:osd_trans_start()) ASSERTION( get_current()->journal_info == ((void *)0) ) failed:

Oleg Drokin added a comment - 11/Sep/15 2:19 AM

Can you also check that all the files in LOGS/ dir are root owned? what are the sizes of the 10 most recently written (based on mtime) ones?

Oleg Drokin added a comment - 11/Sep/15 2:19 AM Can you also check that all the files in LOGS/ dir are root owned? what are the sizes of the 10 most recently written (based on mtime) ones?

Frederik Ferner (Inactive) added a comment - 11/Sep/15 2:15 AM - edited

quota shows now quota for any user on the MDT when mounted as ldiskfs and yes, I can still write a file:

[bnh65367@cs04r-sc-mds03-01 ~]$ sudo dd if=/dev/zero of=/lustre/lustre03/mdt/LOGS/aaa bs=1024k count=1
1+0 records in
1+0 records out
1048576 bytes (1.0 MB) copied, 0.000997444 s, 1.1 GB/s
[bnh65367@cs04r-sc-mds03-01 ~]$

(just to be sure, I've also checked the output of df -hl and df -hil for all OSS nodes and all OSTs have space as well.)

Frederik Ferner (Inactive) added a comment - 11/Sep/15 2:15 AM - edited quota shows now quota for any user on the MDT when mounted as ldiskfs and yes, I can still write a file: [bnh65367@cs04r-sc-mds03-01 ~]$ sudo dd if=/dev/zero of=/lustre/lustre03/mdt/LOGS/aaa bs=1024k count=1 1+0 records in 1+0 records out 1048576 bytes (1.0 MB) copied, 0.000997444 s, 1.1 GB/s [bnh65367@cs04r-sc-mds03-01 ~]$ (just to be sure, I've also checked the output of df -hl and df -hil for all OSS nodes and all OSTs have space as well.)

Oleg Drokin added a comment - 11/Sep/15 2:07 AM

if you mount as ldiskfs, you should see quotainfo with the quota tools, I imagine. like quota.

Also can you write files in there? e.g. can you do dd if=/dev/zero of=/mnt/mdt-mountpoint/LOGS/aaa bs=1024k count=1 ?
(remove the /mnt/mdt-mountpoint/LOGS/aaa afterwards).

Oleg Drokin added a comment - 11/Sep/15 2:07 AM if you mount as ldiskfs, you should see quotainfo with the quota tools, I imagine. like quota. Also can you write files in there? e.g. can you do dd if=/dev/zero of=/mnt/mdt-mountpoint/LOGS/aaa bs=1024k count=1 ? (remove the /mnt/mdt-mountpoint/LOGS/aaa afterwards).

Oleg Drokin added a comment - 11/Sep/15 2:04 AM

Aha, reading into llog_osd_write_rec I see this nice piece:

        /* if it's the last idx in log file, then return -ENOSPC */
        if (loghandle->lgh_last_idx >= LLOG_BITMAP_SIZE(llh) - 1)
                RETURN(-ENOSPC);

I wonder if this is what you are hitting.... Hmmm....
In fact looking at the backtrace, we are coming from llog_cat_add_rec where we have:

        rc = llog_write_rec(env, loghandle, rec, reccookie, LLOG_NEXT_IDX, th);
        if (rc < 0)
                CDEBUG_LIMIT(rc == -ENOSPC ? D_HA : D_ERROR,
                             "llog_write_rec %d: lh=%p\n", rc, loghandle);
        up_write(&loghandle->lgh_lock);
        if (rc == -ENOSPC) {
                /* try to use next log */
                loghandle = llog_cat_current_log(cathandle, th);
                LASSERT(!IS_ERR(loghandle));
                /* new llog can be created concurrently */
                if (!llog_exist(loghandle)) {
                        rc = llog_cat_new_log(env, cathandle, loghandle, th);
                        if (rc < 0) {
                                up_write(&loghandle->lgh_lock);
                                RETURN(rc);
                        }
                }

So we caught ENOSPC and got into the creation of new llog, but got another ENOSPC there as well where the llog should already be the new one, so it could not be end of previous llog, writes appear to be failing genuinely.

Oleg Drokin added a comment - 11/Sep/15 2:04 AM Aha, reading into llog_osd_write_rec I see this nice piece: /* if it's the last idx in log file, then return -ENOSPC */ if (loghandle->lgh_last_idx >= LLOG_BITMAP_SIZE(llh) - 1) RETURN(-ENOSPC); I wonder if this is what you are hitting.... Hmmm.... In fact looking at the backtrace, we are coming from llog_cat_add_rec where we have: rc = llog_write_rec(env, loghandle, rec, reccookie, LLOG_NEXT_IDX, th); if (rc < 0) CDEBUG_LIMIT(rc == -ENOSPC ? D_HA : D_ERROR, "llog_write_rec %d: lh=%p\n", rc, loghandle); up_write(&loghandle->lgh_lock); if (rc == -ENOSPC) { /* try to use next log */ loghandle = llog_cat_current_log(cathandle, th); LASSERT(!IS_ERR(loghandle)); /* new llog can be created concurrently */ if (!llog_exist(loghandle)) { rc = llog_cat_new_log(env, cathandle, loghandle, th); if (rc < 0) { up_write(&loghandle->lgh_lock); RETURN(rc); } } So we caught ENOSPC and got into the creation of new llog, but got another ENOSPC there as well where the llog should already be the new one, so it could not be end of previous llog, writes appear to be failing genuinely.

Frederik Ferner (Inactive) added a comment - 11/Sep/15 1:59 AM

I have checked inodes as well:

[bnh65367@cs04r-sc-mds03-01 127.0.0.1-2015-09-10-22:05:04]$ df -hi
Filesystem           Inodes IUsed IFree IUse% Mounted on
/dev/mapper/vg_lustre03-mdt
                       484M  128M  357M   27% /lustre/lustre03/mdt
/dev/mapper/vg_lustre03-mgs
                       125K   218  125K    1% /lustre/mgs

And I don't think we have any quota limits set at all. How would I check this without having the file system up? (For example on the ldiskfs mounted MDT?)

Frederik Ferner (Inactive) added a comment - 11/Sep/15 1:59 AM I have checked inodes as well: [bnh65367@cs04r-sc-mds03-01 127.0.0.1-2015-09-10-22:05:04]$ df -hi Filesystem Inodes IUsed IFree IUse% Mounted on /dev/mapper/vg_lustre03-mdt 484M 128M 357M 27% /lustre/lustre03/mdt /dev/mapper/vg_lustre03-mgs 125K 218 125K 1% /lustre/mgs And I don't think we have any quota limits set at all. How would I check this without having the file system up? (For example on the ldiskfs mounted MDT?)

Oleg Drokin added a comment - 11/Sep/15 1:53 AM

"lop_write_rec error -28" means you are out of free space.
Can ou check that you have enough inodes too? (df -i). Is there a quota set for the root user by any chance?

Oleg Drokin added a comment - 11/Sep/15 1:53 AM "lop_write_rec error -28" means you are out of free space. Can ou check that you have enough inodes too? (df -i). Is there a quota set for the root user by any chance?

Frederik Ferner (Inactive) added a comment - 11/Sep/15 1:32 AM - edited

After updating Lustre and mounting the MDT again, I got this on the console:

Lustre: lustre03-MDT0000: Denying connection for new client lustre03-MDT0000-lwp-OST0001_UUID (at 10.144.144.31@o2ib), waiting for all 354 known clients (139 recovered, 170 in progress, and 0 evicted) to recove0
Lustre: Skipped 19 previous similar messages
Lustre: Skipped 3 previous similar messages
Lustre: lustre03-MDT0000: Client 211c78af-7674-5b7e-5f34-3d38b509f421 (at 10.144.148.31@o2ib) reconnecting, waiting for 354 clients in recovery for 2:44
Lustre: Skipped 1895 previous similar messages
LustreError: 10692:0:(ldlm_lib.c:1748:check_for_next_transno()) lustre03-MDT0000: waking for gap in transno, VBR is OFF (skip: 370152455116, ql: 37, comp: 317, conn: 354, next: 370152455496, last_committed: 370)
LustreError: 10692:0:(llog.c:726:llog_write_rec()) lop_write_rec error -28
LustreError: 10692:0:(llog.c:726:llog_write_rec()) Skipped 13 previous similar messages
LustreError: 10692:0:(osd_handler.c:1017:osd_trans_start()) ASSERTION( get_current()->journal_info == ((void *)0) ) failed: 
LustreError: 10692:0:(osd_handler.c:1017:osd_trans_start()) LBUG

I'm also not having much luck with the vmcore, it appears I can find the right "NAMELIST" (vmlinux?, do I need the kernel-debuginfo rpm for the kernel we've been runninng?), so I'm thinking of uploading this somewhere and hopefully you could look at this?

vmcore file now at ftp://ftpanon.diamond.ac.uk/LU-7138/vmcore-LU_7138

I need to get some sleep now, but would appreciate if this could still be looked at. I'll be back in approximately 5h and users will start getting very impatient shortly after this, so an idea how to get this file system back up (sanely) would be really good at that point.

Frederik Ferner (Inactive) added a comment - 11/Sep/15 1:32 AM - edited After updating Lustre and mounting the MDT again, I got this on the console: Lustre: lustre03-MDT0000: Denying connection for new client lustre03-MDT0000-lwp-OST0001_UUID (at 10.144.144.31@o2ib), waiting for all 354 known clients (139 recovered, 170 in progress, and 0 evicted) to recove0 Lustre: Skipped 19 previous similar messages Lustre: Skipped 3 previous similar messages Lustre: lustre03-MDT0000: Client 211c78af-7674-5b7e-5f34-3d38b509f421 (at 10.144.148.31@o2ib) reconnecting, waiting for 354 clients in recovery for 2:44 Lustre: Skipped 1895 previous similar messages LustreError: 10692:0:(ldlm_lib.c:1748:check_for_next_transno()) lustre03-MDT0000: waking for gap in transno, VBR is OFF (skip: 370152455116, ql: 37, comp: 317, conn: 354, next: 370152455496, last_committed: 370) LustreError: 10692:0:(llog.c:726:llog_write_rec()) lop_write_rec error -28 LustreError: 10692:0:(llog.c:726:llog_write_rec()) Skipped 13 previous similar messages LustreError: 10692:0:(osd_handler.c:1017:osd_trans_start()) ASSERTION( get_current()->journal_info == ((void *)0) ) failed: LustreError: 10692:0:(osd_handler.c:1017:osd_trans_start()) LBUG I'm also not having much luck with the vmcore, it appears I can find the right "NAMELIST" (vmlinux?, do I need the kernel-debuginfo rpm for the kernel we've been runninng?), so I'm thinking of uploading this somewhere and hopefully you could look at this? vmcore file now at ftp://ftpanon.diamond.ac.uk/LU-7138/vmcore-LU_7138 I need to get some sleep now, but would appreciate if this could still be looked at. I'll be back in approximately 5h and users will start getting very impatient shortly after this, so an idea how to get this file system back up (sanely) would be really good at that point.

Oleg Drokin added a comment - 11/Sep/15 1:04 AM

you can use crash tool to load vmcore in, then use mod -S to load lustre module symbol information and also xbt crash module to see local variable contents.

Oleg Drokin added a comment - 11/Sep/15 1:04 AM you can use crash tool to load vmcore in, then use mod -S to load lustre module symbol information and also xbt crash module to see local variable contents.

Frederik Ferner (Inactive) added a comment - 11/Sep/15 1:03 AM

Ok, I've started the build for a Lustre version with the debug patch.

I have a vmcore from the initial LBUG kernel panic, is this potentially useful? (at the moment I'm not sure how I would go about extracting information from this, though.)

Frederik Ferner (Inactive) added a comment - 11/Sep/15 1:03 AM Ok, I've started the build for a Lustre version with the debug patch. I have a vmcore from the initial LBUG kernel panic, is this potentially useful? (at the moment I'm not sure how I would go about extracting information from this, though.)

Oleg Drokin added a comment - 11/Sep/15 12:42 AM - edited

If you apply a patch like below, it won't really fix anything, but at least we'll know where does the error comes from and what was it (if you have a crashdump from the crash, there's a chance you can extract information from it too).

diff --git a/lustre/obdclass/llog.c b/lustre/obdclass/llog.c
index c34e179..680dae1 100644
--- a/lustre/obdclass/llog.c
+++ b/lustre/obdclass/llog.c
@@ -702,12 +702,16 @@ int llog_write_rec(const struct lu_env *env, struct llog_handle *handle,
 	ENTRY;
 
 	rc = llog_handle2ops(handle, &lop);
-	if (rc)
+	if (rc) {
+		CERROR("llog_handle2ops error %d\n", rc);
 		RETURN(rc);
+	}
 
 	LASSERT(lop);
-	if (lop->lop_write_rec == NULL)
+	if (lop->lop_write_rec == NULL) {
+		CERROR("lop->lop_write_rec == NULL");
 		RETURN(-EOPNOTSUPP);
+	}
 
 	buflen = rec->lrh_len;
 	LASSERT(cfs_size_round(buflen) == buflen);
@@ -718,6 +722,8 @@ int llog_write_rec(const struct lu_env *env, struct llog_handle *handle,
 	rc = lop->lop_write_rec(env, handle, rec, logcookies, idx, th);
 	if (!raised)
 		cfs_cap_lower(CFS_CAP_SYS_RESOURCE);
+	if (rc)
+		CERROR("lop_write_rec error %d\n", rc);
 	RETURN(rc);
 }

the patch is only needed on MDS, btw

Oleg Drokin added a comment - 11/Sep/15 12:42 AM - edited If you apply a patch like below, it won't really fix anything, but at least we'll know where does the error comes from and what was it (if you have a crashdump from the crash, there's a chance you can extract information from it too). diff --git a/lustre/obdclass/llog.c b/lustre/obdclass/llog.c index c34e179..680dae1 100644 --- a/lustre/obdclass/llog.c +++ b/lustre/obdclass/llog.c @@ -702,12 +702,16 @@ int llog_write_rec(const struct lu_env *env, struct llog_handle *handle, ENTRY; rc = llog_handle2ops(handle, &lop); - if (rc) + if (rc) { + CERROR("llog_handle2ops error %d\n", rc); RETURN(rc); + } LASSERT(lop); - if (lop->lop_write_rec == NULL) + if (lop->lop_write_rec == NULL) { + CERROR("lop->lop_write_rec == NULL"); RETURN(-EOPNOTSUPP); + } buflen = rec->lrh_len; LASSERT(cfs_size_round(buflen) == buflen); @@ -718,6 +722,8 @@ int llog_write_rec(const struct lu_env *env, struct llog_handle *handle, rc = lop->lop_write_rec(env, handle, rec, logcookies, idx, th); if (!raised) cfs_cap_lower(CFS_CAP_SYS_RESOURCE); + if (rc) + CERROR("lop_write_rec error %d\n", rc); RETURN(rc); } the patch is only needed on MDS, btw

Frederik Ferner (Inactive) added a comment - 11/Sep/15 12:35 AM - edited

There's nothing obvious in syslog. The full contents of /var/log/messages before the crash/reboot is this:

Sep  7 08:52:49 cs04r-sc-mds03-01 kernel: Lustre: lustre03-MDT0000-osd: FID [0x2000115a3:0x662:0x0] != self_fid [0x200010dc1:0xd167:0x0]
Sep  7 11:25:06 cs04r-sc-mds03-01 kernel: Lustre: lustre03-MDT0000-osd: FID [0x200010dbe:0x9e03:0x0] != self_fid [0x200010dbe:0x9e09:0x0]
Sep  7 12:03:51 cs04r-sc-mds03-01 kernel: Lustre: lustre03-MDT0000-osd: FID [0x200011586:0x283f:0x0] != self_fid [0x200011582:0x41a3:0x0]
Sep  7 13:14:15 cs04r-sc-mds03-01 kernel: Lustre: lustre03-MDT0000-osd: FID [0x200010dab:0x198d5:0x0] != self_fid [0x200011602:0x1c22e:0x0]
Sep  7 13:30:42 cs04r-sc-mds03-01 kernel: Lustre: lustre03-MDT0000-osd: FID [0x200011580:0x9afc:0x0] != self_fid [0x200010da9:0x16251:0x0]
Sep  7 14:10:07 cs04r-sc-mds03-01 kernel: Lustre: lustre03-MDT0000-osd: FID [0x20001157b:0xf7a6:0x0] != self_fid [0x200011581:0x152fe:0x0]
Sep  7 15:47:13 cs04r-sc-mds03-01 kernel: Lustre: lustre03-MDT0000-osd: FID [0x2000115b2:0x4cab:0x0] != self_fid [0x2000115b1:0x4cab:0x0]
Sep  7 18:17:46 cs04r-sc-mds03-01 kernel: Lustre: MGS: Client d6e80b71-1eca-2fca-7854-d8ad02c40814 (at 172.23.116.33@tcp) reconnecting
Sep  7 18:17:46 cs04r-sc-mds03-01 kernel: Lustre: Skipped 1 previous similar message
Sep  7 18:18:46 cs04r-sc-mds03-01 kernel: Lustre: MGS: Client d6e80b71-1eca-2fca-7854-d8ad02c40814 (at 172.23.116.33@tcp) reconnecting
Sep  8 11:05:29 cs04r-sc-mds03-01 kernel: Lustre: lustre03-MDT0000: Client a1185e16-56ed-6244-ef01-edb7c59bbde2 (at 172.23.146.38@tcp) reconnecting
Sep  8 11:05:29 cs04r-sc-mds03-01 kernel: Lustre: Skipped 2 previous similar messages
Sep  8 11:06:11 cs04r-sc-mds03-01 kernel: Lustre: MGS: Client 847ec63e-7411-2e53-d492-aca9e17e6c84 (at 172.23.146.38@tcp) reconnecting
Sep  8 11:06:11 cs04r-sc-mds03-01 kernel: Lustre: Skipped 1 previous similar message
Sep  8 11:06:25 cs04r-sc-mds03-01 kernel: Lustre: lustre03-MDT0000: Client a1185e16-56ed-6244-ef01-edb7c59bbde2 (at 172.23.146.38@tcp) reconnecting
Sep  8 11:07:19 cs04r-sc-mds03-01 kernel: Lustre: lustre03-MDT0000: Client a1185e16-56ed-6244-ef01-edb7c59bbde2 (at 172.23.146.38@tcp) reconnecting
Sep  8 11:07:19 cs04r-sc-mds03-01 kernel: Lustre: Skipped 1 previous similar message
Sep  8 11:11:41 cs04r-sc-mds03-01 kernel: Lustre: MGS: Client 847ec63e-7411-2e53-d492-aca9e17e6c84 (at 172.23.146.38@tcp) reconnecting
Sep  8 11:17:00 cs04r-sc-mds03-01 kernel: Lustre: MGS: Client 847ec63e-7411-2e53-d492-aca9e17e6c84 (at 172.23.146.38@tcp) reconnecting
Sep  8 16:15:16 cs04r-sc-mds03-01 kernel: Lustre: lustre03-MDT0000-osd: FID [0x2000115a4:0xa9e9:0x0] != self_fid [0x200011626:0x1d353:0x0]
Sep  8 17:23:12 cs04r-sc-mds03-01 kernel: Lustre: lustre03-MDT0000: Client a1185e16-56ed-6244-ef01-edb7c59bbde2 (at 172.23.146.38@tcp) reconnecting
Sep  8 17:23:27 cs04r-sc-mds03-01 kernel: Lustre: lustre03-MDT0000: Client a1185e16-56ed-6244-ef01-edb7c59bbde2 (at 172.23.146.38@tcp) reconnecting
Sep  8 18:33:55 cs04r-sc-mds03-01 kernel: Lustre: lustre03-MDT0000-osd: FID [0x200010d9e:0x196cf:0x0] != self_fid [0x20001157c:0x10445:0x0]
Sep  8 20:37:36 cs04r-sc-mds03-01 kernel: Lustre: lustre03-MDT0000-osd: FID [0x20001163d:0x29e3:0x0] != self_fid [0x200010db6:0x1c6f6:0x0]
Sep  9 00:32:12 cs04r-sc-mds03-01 kernel: Lustre: lustre03-MDT0000-osd: FID [0x200011640:0x407d:0x0] != self_fid [0x20001157f:0x1411a:0x0]
Sep  9 01:07:50 cs04r-sc-mds03-01 kernel: Lustre: lustre03-MDT0000: Client a1185e16-56ed-6244-ef01-edb7c59bbde2 (at 172.23.146.38@tcp) reconnecting
Sep  9 02:36:05 cs04r-sc-mds03-01 kernel: Lustre: lustre03-MDT0000-osd: FID [0x2000115bf:0x1b19:0x0] != self_fid [0x200011570:0x1d5c:0x0]
Sep 10 10:21:18 cs04r-sc-mds03-01 kernel: Lustre: lustre03-MDT0000-osd: FID [0x200011645:0x8b5e:0x0] != self_fid [0x20001163a:0xcf49:0x0]
Sep 10 10:29:03 cs04r-sc-mds03-01 kernel: Lustre: lustre03-MDT0000-osd: FID [0x20001163d:0x1dc2d:0x0] != self_fid [0x20001163d:0x1dc37:0x0]
Sep 10 16:48:08 cs04r-sc-mds03-01 kernel: Lustre: lustre03-MDT0000-osd: FID [0x200011668:0x739b:0x0] != self_fid [0x20001167c:0x1501f:0x0]

Mounting the MDT as ldiskfs works and there is lots of space free:

[bnh65367@cs04r-sc-mds03-01 127.0.0.1-2015-09-10-22:05:04]$ df -h
/dev/mapper/vg_lustre03-mgs
                      469M  2.7M  441M   1% /lustre/mgs
/dev/mapper/vg_lustre03-mdt
                      1.5T  103G  1.3T   8% /lustre/lustre03/mdt
[bnh65367@cs04r-sc-mds03-01 127.0.0.1-2015-09-10-22:05:04]$ df -hi
Filesystem           Inodes IUsed IFree IUse% Mounted on
/dev/mapper/vg_lustre03-mdt
                       484M  128M  357M   27% /lustre/lustre03/mdt
/dev/mapper/vg_lustre03-mgs
                       125K   218  125K    1% /lustre/mgs

Is there another option to get the file system back online without the "at your own risk" patch? I'd rather not find out that half written logs are completely trashing our file system...

Frederik Ferner (Inactive) added a comment - 11/Sep/15 12:35 AM - edited There's nothing obvious in syslog. The full contents of /var/log/messages before the crash/reboot is this: Sep 7 08:52:49 cs04r-sc-mds03-01 kernel: Lustre: lustre03-MDT0000-osd: FID [0x2000115a3:0x662:0x0] != self_fid [0x200010dc1:0xd167:0x0] Sep 7 11:25:06 cs04r-sc-mds03-01 kernel: Lustre: lustre03-MDT0000-osd: FID [0x200010dbe:0x9e03:0x0] != self_fid [0x200010dbe:0x9e09:0x0] Sep 7 12:03:51 cs04r-sc-mds03-01 kernel: Lustre: lustre03-MDT0000-osd: FID [0x200011586:0x283f:0x0] != self_fid [0x200011582:0x41a3:0x0] Sep 7 13:14:15 cs04r-sc-mds03-01 kernel: Lustre: lustre03-MDT0000-osd: FID [0x200010dab:0x198d5:0x0] != self_fid [0x200011602:0x1c22e:0x0] Sep 7 13:30:42 cs04r-sc-mds03-01 kernel: Lustre: lustre03-MDT0000-osd: FID [0x200011580:0x9afc:0x0] != self_fid [0x200010da9:0x16251:0x0] Sep 7 14:10:07 cs04r-sc-mds03-01 kernel: Lustre: lustre03-MDT0000-osd: FID [0x20001157b:0xf7a6:0x0] != self_fid [0x200011581:0x152fe:0x0] Sep 7 15:47:13 cs04r-sc-mds03-01 kernel: Lustre: lustre03-MDT0000-osd: FID [0x2000115b2:0x4cab:0x0] != self_fid [0x2000115b1:0x4cab:0x0] Sep 7 18:17:46 cs04r-sc-mds03-01 kernel: Lustre: MGS: Client d6e80b71-1eca-2fca-7854-d8ad02c40814 (at 172.23.116.33@tcp) reconnecting Sep 7 18:17:46 cs04r-sc-mds03-01 kernel: Lustre: Skipped 1 previous similar message Sep 7 18:18:46 cs04r-sc-mds03-01 kernel: Lustre: MGS: Client d6e80b71-1eca-2fca-7854-d8ad02c40814 (at 172.23.116.33@tcp) reconnecting Sep 8 11:05:29 cs04r-sc-mds03-01 kernel: Lustre: lustre03-MDT0000: Client a1185e16-56ed-6244-ef01-edb7c59bbde2 (at 172.23.146.38@tcp) reconnecting Sep 8 11:05:29 cs04r-sc-mds03-01 kernel: Lustre: Skipped 2 previous similar messages Sep 8 11:06:11 cs04r-sc-mds03-01 kernel: Lustre: MGS: Client 847ec63e-7411-2e53-d492-aca9e17e6c84 (at 172.23.146.38@tcp) reconnecting Sep 8 11:06:11 cs04r-sc-mds03-01 kernel: Lustre: Skipped 1 previous similar message Sep 8 11:06:25 cs04r-sc-mds03-01 kernel: Lustre: lustre03-MDT0000: Client a1185e16-56ed-6244-ef01-edb7c59bbde2 (at 172.23.146.38@tcp) reconnecting Sep 8 11:07:19 cs04r-sc-mds03-01 kernel: Lustre: lustre03-MDT0000: Client a1185e16-56ed-6244-ef01-edb7c59bbde2 (at 172.23.146.38@tcp) reconnecting Sep 8 11:07:19 cs04r-sc-mds03-01 kernel: Lustre: Skipped 1 previous similar message Sep 8 11:11:41 cs04r-sc-mds03-01 kernel: Lustre: MGS: Client 847ec63e-7411-2e53-d492-aca9e17e6c84 (at 172.23.146.38@tcp) reconnecting Sep 8 11:17:00 cs04r-sc-mds03-01 kernel: Lustre: MGS: Client 847ec63e-7411-2e53-d492-aca9e17e6c84 (at 172.23.146.38@tcp) reconnecting Sep 8 16:15:16 cs04r-sc-mds03-01 kernel: Lustre: lustre03-MDT0000-osd: FID [0x2000115a4:0xa9e9:0x0] != self_fid [0x200011626:0x1d353:0x0] Sep 8 17:23:12 cs04r-sc-mds03-01 kernel: Lustre: lustre03-MDT0000: Client a1185e16-56ed-6244-ef01-edb7c59bbde2 (at 172.23.146.38@tcp) reconnecting Sep 8 17:23:27 cs04r-sc-mds03-01 kernel: Lustre: lustre03-MDT0000: Client a1185e16-56ed-6244-ef01-edb7c59bbde2 (at 172.23.146.38@tcp) reconnecting Sep 8 18:33:55 cs04r-sc-mds03-01 kernel: Lustre: lustre03-MDT0000-osd: FID [0x200010d9e:0x196cf:0x0] != self_fid [0x20001157c:0x10445:0x0] Sep 8 20:37:36 cs04r-sc-mds03-01 kernel: Lustre: lustre03-MDT0000-osd: FID [0x20001163d:0x29e3:0x0] != self_fid [0x200010db6:0x1c6f6:0x0] Sep 9 00:32:12 cs04r-sc-mds03-01 kernel: Lustre: lustre03-MDT0000-osd: FID [0x200011640:0x407d:0x0] != self_fid [0x20001157f:0x1411a:0x0] Sep 9 01:07:50 cs04r-sc-mds03-01 kernel: Lustre: lustre03-MDT0000: Client a1185e16-56ed-6244-ef01-edb7c59bbde2 (at 172.23.146.38@tcp) reconnecting Sep 9 02:36:05 cs04r-sc-mds03-01 kernel: Lustre: lustre03-MDT0000-osd: FID [0x2000115bf:0x1b19:0x0] != self_fid [0x200011570:0x1d5c:0x0] Sep 10 10:21:18 cs04r-sc-mds03-01 kernel: Lustre: lustre03-MDT0000-osd: FID [0x200011645:0x8b5e:0x0] != self_fid [0x20001163a:0xcf49:0x0] Sep 10 10:29:03 cs04r-sc-mds03-01 kernel: Lustre: lustre03-MDT0000-osd: FID [0x20001163d:0x1dc2d:0x0] != self_fid [0x20001163d:0x1dc37:0x0] Sep 10 16:48:08 cs04r-sc-mds03-01 kernel: Lustre: lustre03-MDT0000-osd: FID [0x200011668:0x739b:0x0] != self_fid [0x20001167c:0x1501f:0x0] Mounting the MDT as ldiskfs works and there is lots of space free: [bnh65367@cs04r-sc-mds03-01 127.0.0.1-2015-09-10-22:05:04]$ df -h /dev/mapper/vg_lustre03-mgs 469M 2.7M 441M 1% /lustre/mgs /dev/mapper/vg_lustre03-mdt 1.5T 103G 1.3T 8% /lustre/lustre03/mdt [bnh65367@cs04r-sc-mds03-01 127.0.0.1-2015-09-10-22:05:04]$ df -hi Filesystem Inodes IUsed IFree IUse% Mounted on /dev/mapper/vg_lustre03-mdt 484M 128M 357M 27% /lustre/lustre03/mdt /dev/mapper/vg_lustre03-mgs 125K 218 125K 1% /lustre/mgs Is there another option to get the file system back online without the "at your own risk" patch? I'd rather not find out that half written logs are completely trashing our file system...

People

Assignee:: Oleg Drokin

Reporter:: Frederik Ferner (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 10/Sep/15 11:23 PM

Updated:: 01/Mar/18 4:38 PM

Resolved:: 04/Oct/15 7:34 PM