[LU-3410] mgc_copy_llog()) Failed to copy remote log routed1-OST00b3 (-2) ecountered during bring up of a 2.4 file system Created: 28/May/13 Updated: 11/Apr/14 Resolved: 11/Apr/14 |
|
| Status: | Closed |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | James A Simmons | Assignee: | Mikhail Pershin |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Environment: |
RHEL6.4 running 2.4.0-RC2 |
||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 8427 | ||||||||
| Description |
|
At our first attempt to mount the file system we experienced the following assertion on about 40 OSS. Th eassertion we hit is as follows: May 28 10:36:43 widow-oss11c4 kernel: [ 2743.137026] LDISKFS-fs (dm-12): mounted filesystem with ordered data mode. quota=on. Opts: |
| Comments |
| Comment by Andreas Dilger [ 28/May/13 ] |
|
James, did the same problem happen on a second restart, or did it work on the next try? |
| Comment by Andreas Dilger [ 28/May/13 ] |
|
James, what is the origin of this filesystem? Is this a newly formatted filesystem with 2.4.0 or is it upgraded from some previous version of Lustre? |
| Comment by Andreas Dilger [ 28/May/13 ] |
|
Mike rewrote the MGC llog backup code to use the OSD API recently in http://review.whamcloud.com/5049 (patch is ready to land, but missed "feature freeze" window), so any work in this area should take that into account. I'm not sure it will solve this problem, but it doesn't make sense to work on the old code when it is being replaced. |
| Comment by Oleg Drokin [ 29/May/13 ] |
|
This only happened once, second restart worked. |
| Comment by Mikhail Pershin [ 30/May/13 ] |
|
the mgs_fs_setup() may rewrite the obd_lvfs_ctxt with new values and there is no any protection that could prevent that between push/pop pair. That can be the reasonfor this bug. Note, that mgs_fs_setup() is called from obd_mount.c code to set new bottom fs and can be called multiple times with several OSTs on one node. Meanwhile this is old code and I am not sure why it happens now. |
| Comment by Alexey Lyashkov [ 09/Aug/13 ] |
|
Mike, looks you are wrong (or it's second bug with same assert). ..
00002000:00040000:9.0:1375891841.290003:0:1448:0:(lvfs_linux.c:175:pop_ctxt()) ASSERTION( cfs_fs_pwd(current->fs) == new_ctx->pwd ) failed: ffff88076b4d68c0 != ffff88069df8fd80
..
crash> bt
PID: 1448 TASK: ffff880817656040 CPU: 9 COMMAND: "ldlm_elt"
#0 [ffff880774e7dba8] machine_kexec at ffffffff810310db
#1 [ffff880774e7dc08] crash_kexec at ffffffff810b6332
#2 [ffff880774e7dcd8] panic at ffffffff814d684f
#3 [ffff880774e7dd58] lbug_with_loc at ffffffffa0380ecb [libcfs]
#4 [ffff880774e7dd78] pop_ctxt at ffffffffa0427b77 [lvfs]
#5 [ffff880774e7ddc8] filter_client_del at ffffffffa0ed5637 [obdfilter]
#6 [ffff880774e7de78] filter_disconnect at ffffffffa0ed7820 [obdfilter]
#7 [ffff880774e7dea8] class_fail_export at ffffffffa0648365 [obdclass]
#8 [ffff880774e7dec8] expired_lock_main at ffffffffa07b81b1 [ptlrpc]
#9 [ffff880774e7df48] kernel_thread at ffffffff8100c1ca
but in task fs is correct value crash> p *((struct task_struct *)0xffff880817656040)->fs
$119 = {
users = 481,
lock = {
raw_lock = {
lock = 16777216
}
},
umask = 0,
in_exec = 0,
root = {
mnt = 0xffff880817b090c0,
dentry = 0xffff880818e4fa40
},
pwd = {
mnt = 0xffff8808174b6180,
dentry = 0xffff88069df8fd80
}
}
as you see - fs have a correct value and assert should be don't hit. additional bugs in that area |
| Comment by Andreas Dilger [ 08/Jan/14 ] |
|
Mike, I also see "13a-8: Failed to get MGS log routed1-OST00b3 and no local copy." in many test logs. This is a misleading error message and should probably be removed, or at least quieted in the common case when a new filesystem is first mounted. |
| Comment by Emoly Liu [ 26/Feb/14 ] |
|
I filed a new ticket So, can we close this one ? |
| Comment by Mikhail Pershin [ 11/Apr/14 ] |
|
closing in favor of |