[LU-5107] MDS oops during mount with latest lustre 2.5.1 snapshot Created: 27/May/14  Updated: 30/May/14  Resolved: 30/May/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.1
Fix Version/s: Lustre 2.5.2

Type: Bug Priority: Blocker
Reporter: James A Simmons Assignee: Di Wang
Resolution: Fixed Votes: 0
Labels: None
Environment:

MDS server


Issue Links:
Related
is related to LU-4413 Test failure on test suite conf-sanit... Resolved
Severity: 3
Rank (Obsolete): 14091

 Description   

With the latest 2.5.1 snapshot when I attempt to bring up a file system I'm seeing the following bug on the MDS during the MDT mount. Because of this I can' currently mount a 2.5 file system for testing.

May 27 16:55:19 tick-dne-mds1 kernel: [ 546.512335] LustreError: 13869:0:(osp_dev.c:864:osp_prepare_fid_client()) ASSERTION( osp->opd
_obd->u.cli.cl_seq != ((void *)0) ) failed:
May 27 16:55:19 tick-dne-mds1 kernel: [ 546.548335] LustreError: 13869:0:(osp_dev.c:864:osp_prepare_fid_client()) LBUG
May 27 16:55:19 tick-dne-mds1 kernel: [ 546.569232] Pid: 13869, comm: ptlrpcd_rcv
May 27 16:55:19 tick-dne-mds1 kernel: [ 546.579503]
May 27 16:55:19 tick-dne-mds1 kernel: [ 546.579503] Call Trace:
May 27 16:55:19 tick-dne-mds1 kernel: [ 546.598249] [<ffffffffa05f3895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
May 27 16:55:19 tick-dne-mds1 kernel: [ 546.618857] [<ffffffffa05f3e97>] lbug_with_loc+0x47/0xb0 [libcfs]
May 27 16:55:19 tick-dne-mds1 kernel: [ 546.638835] [<ffffffffa108ee34>] osp_import_event+0x3d4/0x410 [osp]
May 27 16:55:19 tick-dne-mds1 kernel: [ 546.659079] [<ffffffffa09207cc>] ptlrpc_activate_import+0x12c/0x270 [ptlrpc]
May 27 16:55:19 tick-dne-mds1 kernel: [ 546.688094] [<ffffffffa0923502>] ptlrpc_connect_interpret+0x1912/0x2160 [ptlrpc]
May 27 16:55:19 tick-dne-mds1 kernel: [ 546.708910] [<ffffffffa08f894c>] ptlrpc_check_set+0x2bc/0x1b50 [ptlrpc]
May 27 16:55:19 tick-dne-mds1 kernel: [ 546.729238] [<ffffffffa0924cab>] ptlrpcd_check+0x53b/0x560 [ptlrpc]
May 27 16:55:20 tick-dne-mds1 kernel: [ 546.749421] [<ffffffffa09251cb>] ptlrpcd+0x20b/0x370 [ptlrpc]
May 27 16:55:20 tick-dne-mds1 kernel: [ 546.769286] [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
May 27 16:55:20 tick-dne-mds1 kernel: [ 546.789361] [<ffffffffa0924fc0>] ? ptlrpcd+0x0/0x370 [ptlrpc]
May 27 16:55:20 tick-dne-mds1 kernel: [ 546.809414] [<ffffffff8109ab56>] kthread+0x96/0xa0
May 27 16:55:20 tick-dne-mds1 kernel: [ 546.828800] [<ffffffff8100c20a>] child_rip+0xa/0x20
May 27 16:55:20 tick-dne-mds1 kernel: [ 546.848257] [<ffffffff8109aac0>] ? kthread+0x0/0xa0
May 27 16:55:20 tick-dne-mds1 kernel: [ 546.859345] [<ffffffff8100c200>] ? child_rip+0x0/0x20
May 27 16:55:20 tick-dne-mds1 kernel: [ 546.879015]



 Comments   
Comment by James Nunez (Inactive) [ 27/May/14 ]

Di,

Would you please comment on this ticket?

Thank you,
James

Comment by Di Wang [ 27/May/14 ]

James,

Did you setup lustre with single MDT or DNE? Are there any other console error message? Could you tell me which build are you using? It is a new formatted FS? Do you have the dump log for this LBUG?

Thank you.
WangDi

Comment by James A Simmons [ 28/May/14 ]

I tried a build with a few extra patches. Then I tried the tip of b2_5 and it was the same problem. Yes it is a DNE setup with 3 MDS servers. When I encountered this error I was using a already formatted 2.5 file system. I later reformatted to make sure that was not the issue but the MDS oops was still there. I found that reverting LU-4413 appears to make the problem go away.
I don't think reverting that patch is the solution. I have placed the dmesg log and vmcore dump at ftp.whamcloud.com/uploads/LU-5107.

Comment by Andreas Dilger [ 28/May/14 ]

James, there are two patches on LU-4413:

Which one did you revert to fix the problem?

Comment by James A Simmons [ 28/May/14 ]

I reverted patch 8997.

Comment by Di Wang [ 28/May/14 ]

Hmm, there are some problems for 8997 when port it to 2.5. Since we do not need OSP(for MDT) to allocate FID, so osp_prepare_fid_client(d) needs to be moved after if (d->opd_connect_mdt) check in osp_import_event.

diff --git a/lustre/osp/osp_dev.c b/lustre/osp/osp_dev.c
index a4a2f90..15f2ec0 100644
--- a/lustre/osp/osp_dev.c
+++ b/lustre/osp/osp_dev.c
@@ -1053,15 +1053,16 @@ static int osp_import_event(struct obd_device *obd, struct obd_import *imp,
        case IMP_EVENT_ACTIVE:
                d->opd_imp_active = 1;
 
-               if (osp_prepare_fid_client(d) != 0)
-                       break;
-
                if (d->opd_got_disconnected)
                        d->opd_new_connection = 1;
                d->opd_imp_connected = 1;
                d->opd_imp_seen_connected = 1;
                if (d->opd_connect_mdt)
                        break;
+
+               if (osp_prepare_fid_client(d) != 0)
+                       break;
+
                wake_up(&d->opd_pre_waitq);
                __osp_sync_check_for_work(d);
                CDEBUG(D_HA, "got connected\n");

probably fix the problem, I will cook a patch.

Comment by Di Wang [ 28/May/14 ]

http://review.whamcloud.com/10476

Comment by James A Simmons [ 29/May/14 ]

The patch appears to have resolved the issue. Thank you.

Comment by Andreas Dilger [ 29/May/14 ]

Problem was caused by backport of patch http://review.whamcloud.com/9875 to b2_5.

Comment by Peter Jones [ 30/May/14 ]

Landed for 2.5.2. As I understand it, this issue only affected b2_5 so is not needed on other branches

Generated at Sat Feb 10 01:48:35 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.