[LU-8662] osd_fid_lookup()) ASSERTION( tid->oii_ino == id->oii_ino && tid->oii_gen == id->oii_gen ) failed: OI mapping changed(2): Created: 02/Oct/16 Updated: 14/Mar/17 Resolved: 14/Mar/17 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.10.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Maloo | Assignee: | nasf (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
This issue was created by maloo for Oleg Drokin <green@whamcloud.com> This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/de8e19f6-88ee-11e6-ad53-5254006e85c2. The sub-test test_0a failed with the following error: test failed to respond and timed out Now, this is with dm_flakey patch that might have introduced some unexpected failure behavior, but even so it highlights an assert that could happen as it turns out, so we better look at this in more details Assertion was introcuded by FanYong's patch commit cecde8bdb4913fd4405d425b0bf3aead03181e9d for MDS crashed with: 15:13:56:[ 1124.877616] LustreError: 167-0: lustre-MDT0000-lwp-MDT0002: This client was evicted by lustre-MDT0000; in progress operations using this service will fail. 15:13:56:[ 1124.908734] LDISKFS-fs error (device dm-6): mb_free_blocks:1453: group 4, block 68664:freeing already freed block (bit 1560); block bitmap corrupt. 15:13:56:[ 1124.914666] Aborting journal on device dm-6-8. 15:13:56:[ 1124.978308] LDISKFS-fs (dm-6): Remounting filesystem read-only 15:13:56:[ 1124.982226] LDISKFS-fs warning (device dm-6): ldiskfs_mb_generate_buddy:761: group 4: 15957 blocks in bitmap, 15959 in bb, 16445 in gd, 1 pa's block bitmap corrupt 15:13:56:[ 1124.987219] LDISKFS-fs error (device dm-6) in osd_trans_stop:1834: IO failure 15:13:56:[ 1124.990016] LustreError: 6145:0:(osd_handler.c:1837:osd_trans_stop()) lustre-MDT0000: failed to stop transaction: rc = -5 15:13:56:[ 1124.992521] Lustre: lustre-MDT0000: Recovery over after 0:04, of 5 clients 5 recovered and 0 were evicted. 15:13:56:[ 1125.992942] Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 4 sec 15:13:56:[ 1126.229054] Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 4 sec 15:13:56:[ 1126.241517] Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 4 sec 15:13:56:[ 1126.476588] Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 4 sec 15:13:56:[ 1127.807784] format at osd_handler.c:1182:osd_fid_lookup doesn't end in newline 15:13:56:[ 1127.812857] LustreError: 800:0:(osd_handler.c:1182:osd_fid_lookup()) ASSERTION( tid->oii_ino == id->oii_ino && tid->oii_gen == id->oii_gen ) failed: OI mapping changed(2): 139/537564674 => 139/1570612830 15:13:56:[ 1127.815844] LustreError: 800:0:(osd_handler.c:1182:osd_fid_lookup()) LBUG 15:13:56:[ 1127.819571] Pid: 800, comm: mdt00_001 15:13:56:[ 1127.821424] 15:13:56:[ 1127.821424] Call Trace: 15:13:56:[ 1127.824770] [<ffffffffa06987d3>] libcfs_debug_dumpstack+0x53/0x80 [libcfs] 15:13:56:[ 1127.826840] [<ffffffffa0698d75>] lbug_with_loc+0x45/0xc0 [libcfs] 15:13:56:[ 1127.828862] [<ffffffffa0c250bb>] osd_fid_lookup+0x175b/0x1800 [osd_ldiskfs] 15:13:56:[ 1127.830930] [<ffffffffa0c251b5>] osd_object_init+0x55/0xf0 [osd_ldiskfs] 15:13:56:[ 1127.833042] [<ffffffffa07d68cf>] lu_object_alloc+0xdf/0x310 [obdclass] 15:13:56:[ 1127.835107] [<ffffffffa07d6ccc>] lu_object_find_try+0x16c/0x2b0 [obdclass] 15:13:56:[ 1127.837173] [<ffffffffa07d6ebc>] lu_object_find_at+0xac/0xe0 [obdclass] 15:13:56:[ 1127.839279] [<ffffffffa0e9bb65>] ? lod_index_lookup+0x25/0x30 [lod] 15:13:56:[ 1127.841285] [<ffffffffa0ef5f4f>] ? __mdd_lookup.isra.17+0x26f/0x450 [mdd] 15:13:56:[ 1127.843369] [<ffffffffa07d6f06>] lu_object_find+0x16/0x20 [obdclass] 15:13:56:[ 1127.845403] [<ffffffffa0dbff5b>] mdt_object_find+0x4b/0x170 [mdt] 15:13:56:[ 1127.847443] [<ffffffffa0dc3ab6>] mdt_getattr_name_lock+0x746/0x1900 [mdt] 15:13:56:[ 1127.849494] [<ffffffffa0dc4f20>] mdt_intent_getattr+0x2b0/0x480 [mdt] 15:13:56:[ 1127.851464] [<ffffffffa0dc8b3c>] mdt_intent_policy+0x5bc/0xbb0 [mdt] 15:13:56:[ 1127.853443] [<ffffffffa09ae1e7>] ldlm_lock_enqueue+0x387/0x970 [ptlrpc] 15:13:56:[ 1127.855381] [<ffffffffa09d7363>] ldlm_handle_enqueue0+0x9c3/0x1680 [ptlrpc] 15:13:56:[ 1127.857316] [<ffffffffa09fef40>] ? lustre_swab_ldlm_request+0x0/0x30 [ptlrpc] 15:13:56:[ 1127.859274] [<ffffffffa0a57bf2>] tgt_enqueue+0x62/0x210 [ptlrpc] 15:13:56:[ 1127.861113] [<ffffffffa0a5c055>] tgt_request_handle+0x915/0x1320 [ptlrpc] 15:13:56:[ 1127.862983] [<ffffffffa0a07fdb>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc] 15:13:56:[ 1127.864882] [<ffffffffa0a05b98>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc] 15:13:56:[ 1127.866693] [<ffffffff810b8952>] ? default_wake_function+0x12/0x20 15:13:56:[ 1127.868447] [<ffffffff810af0b8>] ? __wake_up_common+0x58/0x90 15:13:56:[ 1127.870236] [<ffffffffa0a0c090>] ptlrpc_main+0xaa0/0x1de0 [ptlrpc] 15:13:56:[ 1127.871995] [<ffffffffa0a0b5f0>] ? ptlrpc_main+0x0/0x1de0 [ptlrpc] 15:13:56:[ 1127.873738] [<ffffffff810a5b8f>] kthread+0xcf/0xe0 15:13:56:[ 1127.875343] [<ffffffff810a5ac0>] ? kthread+0x0/0xe0 15:13:56:[ 1127.876914] [<ffffffff81646b98>] ret_from_fork+0x58/0x90 15:13:56:[ 1127.878478] [<ffffffff810a5ac0>] ? kthread+0x0/0xe0 15:13:56:[ 1127.879973] 15:13:56:[ 1127.883734] Kernel panic - not syncing: LBUG Info required for matching: replay-single 0a |
| Comments |
| Comment by Jian Yu [ 10/Oct/16 ] |
|
Hi Nasf, Could you please look at this issue? Thank you. |
| Comment by Gerrit Updater [ 18/Oct/16 ] |
|
Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/23220 |
| Comment by Jian Yu [ 20/Oct/16 ] |
|
Hi Nasf, Could you please investigate whether it's a regression introduced by that above patch or not? Thank you. |
| Comment by nasf (Inactive) [ 26/Oct/16 ] |
|
The logs show that the MDT0 and MDT2 on the 35vm7 have finished the LFSCK, the MDT1 and MDT3 on the 35vm3 were in the second stage scanning and trying to notify something to the 35vm7. It is possible to send message to the 35vm7 for the LFSCK completion. But it was blocked there, no detail logs for that (may be overwritten). Anyway, all these things have already passed the fid lookup phase. So the new failure is not regression. |
| Comment by Jian Yu [ 26/Oct/16 ] |
|
Thank you, Nasf. I just created |
| Comment by Gerrit Updater [ 14/Mar/17 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/23220/ |