[LU-3302] ll_fill_super() Unable to process log: -2 Created: 09/May/13  Updated: 17/May/13  Resolved: 10/May/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: Lustre 2.4.0

Type: Bug Priority: Blocker
Reporter: Ned Bass Assignee: Di Wang
Resolution: Fixed Votes: 0
Labels: LB
Environment:

PPC client


Attachments: File lustre.log.vesta-mds1.1368040434.gz     File lustre.log.vulcanlac1.1368040616.gz    
Issue Links:
Related
is related to LU-3294 osp_sync_llog_init(): ASSERTION( lgh ... Resolved
Severity: 3
Rank (Obsolete): 8173

 Description   

We updated a client to 2.3.64-4chaos and tried to mount a 2.3.63-6chaos server. The mount fails with

LustreError: 15c-8: MGC172.20.20.201@o2ib500: The configuration from log 'fsv-client' failed (-2). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
LustreError: 14351:0:(llite_lib.c:1043:ll_fill_super()) Unable to process log: -2
Lustre: Unmounted fsv-client
LustreError: 14351:0:(obd_mount.c:1265:lustre_fill_super()) Unable to mount  (-2)

Using git bisect I found the mount failure was introduced with this patch:

http://review.whamcloud.com/#change,5820

LU-2684 fid: unify ostid and FID

The critical questions at this point are:

  • Can we solve this problem by updating both server and client to 2.3.64-4chaos?
  • Can we safely upgrade the server, or does the above patch introduce on-disk format incompatibilities?
  • Will we be able to safely revert the server to 2.3.63 in case we find problems, or will it write new objects in an incompatible format?

LLNL-bug-id: TOSS-2060



 Comments   
Comment by Ned Bass [ 09/May/13 ]

As an editorial comment, while we understand that interoperability issues are inevitable in a pre-release branch, we wish such changes would be advertised more prominently. Clear statements about compatibility between tags would really help us plan our update process. At a minimum, patches that introduce incompatibilities should say so clearly in the commit message.

Comment by Ned Bass [ 09/May/13 ]

Di, can you advise us on this? Thanks

Comment by Andreas Dilger [ 09/May/13 ]

Ned, can you please attach a -1 debug log from the 2.3.64 client, and ideally also from the MGS.

I agree that the LU-2684 change was problematic, and it was intended to only change the network protocol between clients and OSTs when running DNE. The LU-2888 patch http://review.whamcloud.com/6044 (which was already included in 2.3.64) should have fixed the LLOG handling, so I'm not sure what the exact cause of your problem is. AFAIK, the current master code interoperates with 2.1.5 and 2.3.0 properly, but there might be something specific with your setup that is causing grief.

Comment by Ned Bass [ 09/May/13 ]

Andreas, yes I'll grab the logs.

Note the above error was a 2.3.64 client talking to a 2.3.63 server. Do you mean that patch 6044 fixed LLOG handling on the client, or is it needed on the server as well?

Comment by Ned Bass [ 09/May/13 ]

Attaching -1 debug logs for client and MDS. Note these were not captured from the same mount attempt.

The NID of the client is 172.20.16.10@o2ib500.

Comment by Ned Bass [ 09/May/13 ]

I did notice the mgs got ENOENT handling opcodes LLOG_ORIGIN_HANDLE_CREATE and LLOG_ORIGIN_HANDLE_READ_HEADER:

20000000:01000000:6.0:1368040430.786604:0:18265:0:(mgs_handler.c:757:mgs_handle()) @@@ MGS fail to handle opc = 501: rc = -2
  req@ffff881019ecb050 x1434494006460492/t0(0) o501->2e89e428-68d9-71a1-75f0-147bc1963566@172.20.16.10@o2ib500:0/0 lens 296/0 e 0 to 0 dl 1368040491 ref 1 fl Interpret:/0/ffffffff rc 0/-1
...
20000000:01000000:6.0:1368040430.788063:0:18265:0:(mgs_handler.c:757:mgs_handle()) @@@ MGS fail to handle opc = 503: rc = -2
  req@ffff881019f14850 x1434494006460504/t0(0) o503->2e89e428-68d9-71a1-75f0-147bc1963566@172.20.16.10@o2ib500:0/0 lens 272/0 e 0 to 0 dl 1368040491 ref 1 fl Interpret:/0/ffffffff rc 0/-1
Comment by Andreas Dilger [ 09/May/13 ]

The 6044/LU-2888 patch fixed the handling on the server, but the original problem patch from LU-2684 wasn't in 2.3.63, so it shouldn't be relevant.

Comment by Peter Jones [ 09/May/13 ]

Di

Could you please comment on this?

Thanks

Peter

Comment by Andreas Dilger [ 09/May/13 ]

Ned, is this a PPC client? It would be useful to include this information in the "Environment" section when filing a bug.

Comment by Ned Bass [ 09/May/13 ]

Yes it is. Sorry for the omission.

Comment by Andreas Dilger [ 09/May/13 ]

John, it was mentioned to me that you have already found some endian issues with the FID-on-OST code? Could you please point out where they are, it might be that this is the source of the problem being seen here, since we didn't see any problems with our x86_64 clients for interoperability.

Comment by John Hammond [ 09/May/13 ]

Possibly. Please see LU-3294.

Ned, it would be interesting to know what happens when you create a new 2.3.65 FS on ppc, unmount, and then remount it.

Comment by Ned Bass [ 09/May/13 ]

John, okay, we're getting a test environment set up where I should be able to do that test.

Comment by Ned Bass [ 09/May/13 ]

Haven't tried 2.3.65 yet, but initial testing suggests updating the server to 2.3.64 lets the mount succeed. Here's what I did:

1. Tried to mount a 2.3.62 server from a 2.3.64 PPC client. Fails with "ll_fill_super() Unable to process log: -2"
2. Updated the server to 2.3.64. Mount from 2.3.64 PPC client succeeds.

Comment by Di Wang [ 09/May/13 ]

Ned, I just checked the debug log, it seems client get correct log ID after swab,

Here are the client log

00000040:00000001:5.0:1368040600.989913:5152:8187:0:(llog_swab.c:86:lustre_swab_llogd_body()) Process entered
00000040:00001000:5.0:1368040600.989914:5328:8187:0:(llog_swab.c:53:print_llogd_body()) llogd body: c000000f50e9a100
00000040:00001000:5.0:1368040600.989915:5328:8187:0:(llog_swab.c:55:print_llogd_body())         lgd_logid.lgl_oi: 0x6400000000000000:16777216
00000040:00001000:5.0:1368040600.989915:5328:8187:0:(llog_swab.c:56:print_llogd_body())         lgd_logid.lgl_ogen: 0x0
00000040:00001000:5.0:1368040600.989916:5328:8187:0:(llog_swab.c:57:print_llogd_body())         lgd_ctxt_idx: 0x0
00000040:00001000:5.0:1368040600.989917:5328:8187:0:(llog_swab.c:58:print_llogd_body())         lgd_llh_flags: 0x0
00000040:00001000:5.0:1368040600.989917:5328:8187:0:(llog_swab.c:59:print_llogd_body())         lgd_index: 0x0
00000040:00001000:5.0:1368040600.989918:5328:8187:0:(llog_swab.c:60:print_llogd_body())         lgd_saved_index: 0x0
00000040:00001000:5.0:1368040600.989918:5328:8187:0:(llog_swab.c:61:print_llogd_body())         lgd_len: 0x0
00000040:00001000:5.0:1368040600.989919:5328:8187:0:(llog_swab.c:62:print_llogd_body())         lgd_cur_offset: 0x0
00000040:00001000:5.0:1368040600.989920:5328:8187:0:(llog_swab.c:53:print_llogd_body()) llogd body: c000000f50e9a100
00000040:00001000:5.0:1368040600.989920:5328:8187:0:(llog_swab.c:55:print_llogd_body())         lgd_logid.lgl_oi: 0x64:1
00000040:00001000:5.0:1368040600.989921:5328:8187:0:(llog_swab.c:56:print_llogd_body())         lgd_logid.lgl_ogen: 0x0
00000040:00001000:5.0:1368040600.989921:5328:8187:0:(llog_swab.c:57:print_llogd_body())         lgd_ctxt_idx: 0x0
00000040:00001000:5.0:1368040600.989922:5328:8187:0:(llog_swab.c:58:print_llogd_body())         lgd_llh_flags: 0x0
00000040:00001000:5.0:1368040600.989923:5328:8187:0:(llog_swab.c:59:print_llogd_body())         lgd_index: 0x0
00000040:00001000:5.0:1368040600.989923:5328:8187:0:(llog_swab.c:60:print_llogd_body())         lgd_saved_index: 0x0
00000040:00001000:5.0:1368040600.989924:5328:8187:0:(llog_swab.c:61:print_llogd_body())         lgd_len: 0x0
00000040:00001000:5.0:1368040600.989924:5328:8187:0:(llog_swab.c:62:print_llogd_body())         lgd_cur_offset: 0x0
00000040:00000001:5.0:1368040600.989925:5152:8187:0:(llog_swab.c:97:lustre_swab_llogd_body()) Process leaving

But somehow server can not find the log object by this ID. Unfortunately, I can not find correspondent mgs handling information in the MDS debug log. Could you please redo the test update the debug log.

In the mean time, I do see there are some problem during the logid swab(John also point out one in LU-3294) I will cook the patch now.

Comment by Di Wang [ 09/May/13 ]

http://review.whamcloud.com/#change,6305

Comment by John Hammond [ 09/May/13 ]

Ned would you confirm that these are x86_64 servers and ppc/ppc64 clients? In that case it's unlikely that you're affected by LU-3294 since that issue is probably limited to BE servers.

Comment by Ned Bass [ 09/May/13 ]

Yes these are x86_64 servers and ppc64 clients.

Also, if it is an unfixed swabbing bug, I would expect the mount to also fail with 2.3.64 servers.

Comment by Jodi Levi (Inactive) [ 10/May/13 ]

Now that this patch has landed, can we get confirmation that this is fixed?
Thank you!

Comment by Ned Bass [ 10/May/13 ]

With the patch, a 2.3.64 PPC client can mount from a 2.3.63 server. So this appears to be fixed. Thanks

Comment by Jodi Levi (Inactive) [ 10/May/13 ]

Based on latest comments, this patch landed and has fixed the issue. Closing ticket.

Generated at Sat Feb 10 01:32:43 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.