[LU-3294] osp_sync_llog_init(): ASSERTION( lgh != ((void *)0) ) failed Created: 08/May/13  Updated: 11/Jun/13  Resolved: 11/Jun/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: Lustre 2.4.0

Type: Bug Priority: Minor
Reporter: Li Wei (Inactive) Assignee: John Hammond
Resolution: Fixed Votes: 0
Labels: endianness, sparc

Issue Links:
Related
is related to LU-3302 ll_fill_super() Unable to process log... Resolved
Severity: 3
Rank (Obsolete): 8157

 Description   

When starting an MDT on a SPARC MDS, this assertion failure occurred:

Lustre: lustre-MDT0000: used disk, loading
Lustre: lustre-OST0000-osc-MDT0000: Init llog for 0 - catid 0x2:0:0
LustreError: 11309:0:(osp_sync.c:963:osp_sync_llog_init()) ASSERTION( lgh != ((void *)0) ) failed:
LustreError: 11309:0:(osp_sync.c:963:osp_sync_llog_init()) LBUG
Pid: 11309, comm: llog_process_th

Call Trace:

Kernel panic - not syncing: LBUG
Call Trace:
 [0000000010181194] lbug_with_loc+0x94/0xc0 [libcfs]
 [0000000010dc28c8] osp_sync_llog_init+0xa28/0xc00 [osp]
 [0000000010dc6d78] osp_sync_init+0x1f8/0xbe0 [osp]
 [0000000010daf51c] osp_device_alloc+0x4d7c/0x5c40 [osp]
 [000000001033a500] class_setup+0x6e0/0xf00 [obdclass]
 [000000001033da58] class_process_config+0x1738/0x5180 [obdclass]
[...]

According to the "catid" printed, I guess the FID of the log must be [1:2:0]. The problem is in the definition of oat_id:

struct ost_id { 
        union {
                struct ostid {
                        __u64   oi_id;
                        __u64   oi_seq;
                } oi;
                struct lu_fid oi_fid;
        };      
};

When fid_to_logid() assigns a 64-bit sequence number to oi_seq, which 32 bits go to f_oid and f_ver really depends on the endianness of the MDS. On the SPARC MDS, the FID_SEQ_LLOG goes to f_ver, causing oatid_id() to return 0, while the log ID as a whole is nonzero. This combined caused osp_sync_llog_init() to neither open nor re-create the log.



 Comments   
Comment by John Hammond [ 08/May/13 ]

After 725f3f8e it looks like we should use logid_id(&osi->osi_cid.lci_logid) instead of ostid_id(&osi->osi_cid.lci_logid.lgl_oi) in osp_sync_llog_init() and similarly elsewhere. Di, can you comment?

Comment by Di Wang [ 09/May/13 ]

John, yes, you are right. it should logid_id(), instead of ostid_id. Sigh, I had thought all of it has been revert to logid_id in LU-2888.

Comment by John Hammond [ 10/May/13 ]

The proposed change for ops_syn_llog_init() has been rolled into http://review.whamcloud.com/#change,6305.

However, there are still some spots where ostid_id() is being applied to lgl_oi, and similar combinations. This only affects big-endian servers and so is not an issue for common setups (including LLNL's x86_64 servers with ppc64 clients).

If I understand this code correctly it may be useful to add some assertions/trace that:

In logid_id() the seq is FID_SEQ_LLOG or FID_SEQ_LLOG_NAME.

Same for logid_set_id().

In ostid_id() the seq is not FID_SEQ_LLOG or FID_SEQ_LLOG_NAME.

Same for ostid_set_id().

A double swab of a ost_id/llog_logid for various valid seqs and oids (big and small) is the identity.

This may require fixing up POSTID, DOSTID, ... or replacing them with PLOGID(), DLOGID(), ...

Comment by John Hammond [ 13/May/13 ]

The proposed change to osp_sync_llog_init() was landed to master as part of http://review.whamcloud.com/6305.

Comment by John Hammond [ 11/Jun/13 ]

Patch landed.

Generated at Sat Feb 10 01:32:39 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.