[LU-3278] osd_internal.h:747:osd_fid2oi()) ASSERTION( osd->od_oi_table != ((void *)0) && osd->od_oi_count >= 1 ) failed Created: 05/May/13 Updated: 07/Nov/19 Resolved: 07/May/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Oleg Drokin | Assignee: | nasf (Inactive) |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 8115 | ||||||||
| Description |
|
Running recovery-small.sh hit this very fast after test start. [ 838.403817] Lustre: DEBUG MARKER: == replay-single test 2b: touch == 00:56:23 (1367729783) [ 839.177827] LustreError: 6825:0:(osd_handler.c:1119:osd_ro()) *** setting lustre-MDT0000 read-only *** [ 839.178349] Turning device loop0 (0x700000) read-only [ 839.231722] Lustre: DEBUG MARKER: mds1 REPLAY BARRIER on lustre-MDT0000 [ 839.261830] Lustre: DEBUG MARKER: local REPLAY BARRIER on lustre-MDT0000 [ 839.589175] Lustre: Failing over lustre-MDT0000 [ 839.716258] LustreError: 6547:0:(osd_internal.h:747:osd_fid2oi()) ASSERTION( osd->od_oi_table != ((void *)0) && osd->od_oi_count >= 1 ) failed: [ 839.716797] LustreError: 6547:0:(osd_internal.h:747:osd_fid2oi()) LBUG [ 839.717086] Pid: 6547, comm: ll_mgs_0002 [ 839.717336] [ 839.717337] Call Trace: [ 839.717752] [<ffffffffa04018a5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] [ 839.718105] [<ffffffffa0401ea7>] lbug_with_loc+0x47/0xb0 [libcfs] [ 839.718362] [<ffffffffa0c51911>] __osd_oi_lookup+0x2f1/0x3e0 [osd_ldiskfs] [ 839.718654] [<ffffffffa0c51aa6>] osd_oi_lookup+0xa6/0x150 [osd_ldiskfs] [ 839.718941] [<ffffffffa0c49d60>] osd_object_init+0x5a0/0xcb0 [osd_ldiskfs] [ 839.719261] [<ffffffffa05ad3ed>] lu_object_alloc+0xcd/0x300 [obdclass] [ 839.719551] [<ffffffffa05ad769>] ? htable_lookup+0x119/0x1c0 [obdclass] [ 839.719817] [<ffffffffa05adf55>] lu_object_find_at+0x205/0x360 [obdclass] [ 839.720074] [<ffffffff814fc8bc>] ? __mutex_lock_slowpath+0x21c/0x2c0 [ 839.720342] [<ffffffffa05b04ca>] dt_locate_at+0x3a/0x140 [obdclass] [ 839.720602] [<ffffffffa058ec2f>] llog_osd_open+0x14f/0xbf0 [obdclass] [ 839.720895] [<ffffffffa055a28d>] llog_open+0xbd/0x2d0 [obdclass] [ 839.721206] [<ffffffffa07501a2>] llog_origin_handle_read_header+0x162/0x5e0 [ptlrpc] [ 839.721671] [<ffffffffa0b7b064>] mgs_handle+0xad4/0x11d0 [mgs] [ 839.721955] [<ffffffffa0412361>] ? libcfs_debug_msg+0x41/0x50 [libcfs] [ 839.722262] [<ffffffffa0747898>] ptlrpc_server_handle_request+0x3a8/0xc70 [ptlrpc] [ 839.722742] [<ffffffffa04025ee>] ? cfs_timer_arm+0xe/0x10 [libcfs] [ 839.723028] [<ffffffffa0413e9f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs] [ 839.723349] [<ffffffffa073efe1>] ? ptlrpc_wait_event+0xb1/0x2a0 [ptlrpc] [ 839.723656] [<ffffffff8105ad10>] ? default_wake_function+0x0/0x20 [ 839.723961] [<ffffffffa0748ba5>] ptlrpc_main+0xa45/0x1650 [ptlrpc] [ 839.724266] [<ffffffffa0748160>] ? ptlrpc_main+0x0/0x1650 [ptlrpc] [ 839.724545] [<ffffffff8100c10a>] child_rip+0xa/0x20 [ 839.724845] [<ffffffffa0748160>] ? ptlrpc_main+0x0/0x1650 [ptlrpc] [ 839.725143] [<ffffffffa0748160>] ? ptlrpc_main+0x0/0x1650 [ptlrpc] [ 839.725426] [<ffffffff8100c100>] ? child_rip+0x0/0x20 [ 839.725683] [ 839.741236] Kernel panic - not syncing: LBUG I have a crashdump in /exports/crashdumps/192.168.10.221-2013-05-05-00\:56\:27 |
| Comments |
| Comment by nasf (Inactive) [ 06/May/13 ] |
|
The direct reason for the failure is that the OI files may be not ready or already be closed when the LLOG processing. But the real issue is that the LLOG handling should not trigger OI files lookup. int osd_oi_lookup(struct osd_thread_info *info, struct osd_device *osd, const struct lu_fid *fid, struct osd_inode_id *id, bool check_fld) { if (unlikely(fid_is_last_id(fid))) return osd_obj_spec_lookup(info, osd, fid, id); ===> if ((check_fld && fid_is_on_ost(info, osd, fid)) || fid_is_llog(fid)) return osd_obj_map_lookup(info, osd, fid, id); if (fid_is_fs_root(fid)) { osd_id_gen(id, osd_sb(osd)->s_root->d_inode->i_ino, osd_sb(osd)->s_root->d_inode->i_generation); return 0; } if (unlikely(fid_is_acct(fid))) return osd_acct_obj_lookup(info, osd, fid, id); if (!osd->od_igif_inoi && fid_is_igif(fid)) { osd_id_gen(id, lu_igif_ino(fid), lu_igif_gen(fid)); return 0; } return __osd_oi_lookup(info, osd, fid, id); } That means the "fid_is_llog(fid)" returned false for this case. static inline int fid_is_llog(const struct lu_fid *fid) { /* file with OID == 1 is not llog but contains last oid */ return fid_seq_is_llog(fid_seq(fid)) && fid_oid(fid) > 1; } But the old "fid_is_llog(fid)" was like that: static inline int fid_is_llog(const struct lu_fid *fid) { return fid_seq_is_llog(fid_seq(fid)); } I am not sure whether it is related with above code changes. I have add debug information in the patch (http://review.whamcloud.com/#change,6267) to catch the "@fid" when the failure reproduced. |
| Comment by Mikhail Pershin [ 06/May/13 ] |
|
The file with OID == 1 is not llog but /seq-1-lastid which stores last generated fid oid for llogs, it doesn't need osd_obj_map_xxx, it is just ordinary file in the root directory. But you said above it can be used before OI is initialized, probably that is the reason, though I am not sure why llog is allowed to work over OSD which is not ready, probably OI should be set up early? |
| Comment by nasf (Inactive) [ 07/May/13 ] |
|
More study the log and code analysis, I think the failure caused by that the OI files closed before the MGS service thread stopped. So the solution is NOT setup OI early, instead, we should cleanup the OI some later. It is another instance of |
| Comment by nasf (Inactive) [ 07/May/13 ] |
|
Another failure instance of |
| Comment by Alex Zhuravlev [ 07/Nov/19 ] |
|
not sure whether to file a new ticket, but I've hit this locally few times yet: Lustre: DEBUG MARKER: == replay-dual test 23b: c1 rmdir d1, M1 drop reply and fail M0/M1, c2 mkdir d1 ====================== 20:38:18 (1573141098) |