[LU-8662] osd_fid_lookup()) ASSERTION( tid->oii_ino == id->oii_ino && tid->oii_gen == id->oii_gen ) failed: OI mapping changed(2): Created: 02/Oct/16  Updated: 14/Mar/17  Resolved: 14/Mar/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.10.0

Type: Bug Priority: Minor
Reporter: Maloo Assignee: nasf (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for Oleg Drokin <green@whamcloud.com>

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/de8e19f6-88ee-11e6-ad53-5254006e85c2.

The sub-test test_0a failed with the following error:

test failed to respond and timed out

Now, this is with dm_flakey patch that might have introduced some unexpected failure behavior, but even so it highlights an assert that could happen as it turns out, so we better look at this in more details

Assertion was introcuded by FanYong's patch commit cecde8bdb4913fd4405d425b0bf3aead03181e9d for LU-8218

MDS crashed with:

15:13:56:[ 1124.877616] LustreError: 167-0: lustre-MDT0000-lwp-MDT0002: This client was evicted by lustre-MDT0000; in progress operations using this service will fail.
15:13:56:[ 1124.908734] LDISKFS-fs error (device dm-6): mb_free_blocks:1453: group 4, block 68664:freeing already freed block (bit 1560); block bitmap corrupt.
15:13:56:[ 1124.914666] Aborting journal on device dm-6-8.
15:13:56:[ 1124.978308] LDISKFS-fs (dm-6): Remounting filesystem read-only
15:13:56:[ 1124.982226] LDISKFS-fs warning (device dm-6): ldiskfs_mb_generate_buddy:761: group 4: 15957 blocks in bitmap, 15959 in bb, 16445 in gd, 1 pa's block bitmap corrupt
15:13:56:[ 1124.987219] LDISKFS-fs error (device dm-6) in osd_trans_stop:1834: IO failure
15:13:56:[ 1124.990016] LustreError: 6145:0:(osd_handler.c:1837:osd_trans_stop()) lustre-MDT0000: failed to stop transaction: rc = -5
15:13:56:[ 1124.992521] Lustre: lustre-MDT0000: Recovery over after 0:04, of 5 clients 5 recovered and 0 were evicted.
15:13:56:[ 1125.992942] Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 4 sec
15:13:56:[ 1126.229054] Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 4 sec
15:13:56:[ 1126.241517] Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 4 sec
15:13:56:[ 1126.476588] Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 4 sec
15:13:56:[ 1127.807784] format at osd_handler.c:1182:osd_fid_lookup doesn't end in newline
15:13:56:[ 1127.812857] LustreError: 800:0:(osd_handler.c:1182:osd_fid_lookup()) ASSERTION( tid->oii_ino == id->oii_ino && tid->oii_gen == id->oii_gen ) failed: OI mapping changed(2): 139/537564674 => 139/1570612830
15:13:56:[ 1127.815844] LustreError: 800:0:(osd_handler.c:1182:osd_fid_lookup()) LBUG
15:13:56:[ 1127.819571] Pid: 800, comm: mdt00_001
15:13:56:[ 1127.821424] 
15:13:56:[ 1127.821424] Call Trace:
15:13:56:[ 1127.824770]  [<ffffffffa06987d3>] libcfs_debug_dumpstack+0x53/0x80 [libcfs]
15:13:56:[ 1127.826840]  [<ffffffffa0698d75>] lbug_with_loc+0x45/0xc0 [libcfs]
15:13:56:[ 1127.828862]  [<ffffffffa0c250bb>] osd_fid_lookup+0x175b/0x1800 [osd_ldiskfs]
15:13:56:[ 1127.830930]  [<ffffffffa0c251b5>] osd_object_init+0x55/0xf0 [osd_ldiskfs]
15:13:56:[ 1127.833042]  [<ffffffffa07d68cf>] lu_object_alloc+0xdf/0x310 [obdclass]
15:13:56:[ 1127.835107]  [<ffffffffa07d6ccc>] lu_object_find_try+0x16c/0x2b0 [obdclass]
15:13:56:[ 1127.837173]  [<ffffffffa07d6ebc>] lu_object_find_at+0xac/0xe0 [obdclass]
15:13:56:[ 1127.839279]  [<ffffffffa0e9bb65>] ? lod_index_lookup+0x25/0x30 [lod]
15:13:56:[ 1127.841285]  [<ffffffffa0ef5f4f>] ? __mdd_lookup.isra.17+0x26f/0x450 [mdd]
15:13:56:[ 1127.843369]  [<ffffffffa07d6f06>] lu_object_find+0x16/0x20 [obdclass]
15:13:56:[ 1127.845403]  [<ffffffffa0dbff5b>] mdt_object_find+0x4b/0x170 [mdt]
15:13:56:[ 1127.847443]  [<ffffffffa0dc3ab6>] mdt_getattr_name_lock+0x746/0x1900 [mdt]
15:13:56:[ 1127.849494]  [<ffffffffa0dc4f20>] mdt_intent_getattr+0x2b0/0x480 [mdt]
15:13:56:[ 1127.851464]  [<ffffffffa0dc8b3c>] mdt_intent_policy+0x5bc/0xbb0 [mdt]
15:13:56:[ 1127.853443]  [<ffffffffa09ae1e7>] ldlm_lock_enqueue+0x387/0x970 [ptlrpc]
15:13:56:[ 1127.855381]  [<ffffffffa09d7363>] ldlm_handle_enqueue0+0x9c3/0x1680 [ptlrpc]
15:13:56:[ 1127.857316]  [<ffffffffa09fef40>] ? lustre_swab_ldlm_request+0x0/0x30 [ptlrpc]
15:13:56:[ 1127.859274]  [<ffffffffa0a57bf2>] tgt_enqueue+0x62/0x210 [ptlrpc]
15:13:56:[ 1127.861113]  [<ffffffffa0a5c055>] tgt_request_handle+0x915/0x1320 [ptlrpc]
15:13:56:[ 1127.862983]  [<ffffffffa0a07fdb>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
15:13:56:[ 1127.864882]  [<ffffffffa0a05b98>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
15:13:56:[ 1127.866693]  [<ffffffff810b8952>] ? default_wake_function+0x12/0x20
15:13:56:[ 1127.868447]  [<ffffffff810af0b8>] ? __wake_up_common+0x58/0x90
15:13:56:[ 1127.870236]  [<ffffffffa0a0c090>] ptlrpc_main+0xaa0/0x1de0 [ptlrpc]
15:13:56:[ 1127.871995]  [<ffffffffa0a0b5f0>] ? ptlrpc_main+0x0/0x1de0 [ptlrpc]
15:13:56:[ 1127.873738]  [<ffffffff810a5b8f>] kthread+0xcf/0xe0
15:13:56:[ 1127.875343]  [<ffffffff810a5ac0>] ? kthread+0x0/0xe0
15:13:56:[ 1127.876914]  [<ffffffff81646b98>] ret_from_fork+0x58/0x90
15:13:56:[ 1127.878478]  [<ffffffff810a5ac0>] ? kthread+0x0/0xe0
15:13:56:[ 1127.879973] 
15:13:56:[ 1127.883734] Kernel panic - not syncing: LBUG

Info required for matching: replay-single 0a



 Comments   
Comment by Jian Yu [ 10/Oct/16 ]

Hi Nasf,

Could you please look at this issue? Thank you.

Comment by Gerrit Updater [ 18/Oct/16 ]

Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/23220
Subject: LU-8662 osd-ldiskfs: check OI mapping update
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: b2d9f82903e34afb2c1d8f6744a90dd14a053a3a

Comment by Jian Yu [ 20/Oct/16 ]

Hi Nasf,
Patch set 19 of http://review.whamcloud.com/7200 was based on the above patch and the replay-single test failure was resolved. However, sanity-lfsck failed:
https://testing.hpdd.intel.com/test_sets/7e295098-96fa-11e6-a763-5254006e85c2

Could you please investigate whether it's a regression introduced by that above patch or not? Thank you.

Comment by nasf (Inactive) [ 26/Oct/16 ]

The logs show that the MDT0 and MDT2 on the 35vm7 have finished the LFSCK, the MDT1 and MDT3 on the 35vm3 were in the second stage scanning and trying to notify something to the 35vm7. It is possible to send message to the 35vm7 for the LFSCK completion. But it was blocked there, no detail logs for that (may be overwritten). Anyway, all these things have already passed the fid lookup phase. So the new failure is not regression.

Comment by Jian Yu [ 26/Oct/16 ]

Thank you, Nasf. I just created LU-8760 to track the issue and restarted the testing to see if it's reproducible.

Comment by Gerrit Updater [ 14/Mar/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/23220/
Subject: LU-8662 osd-ldiskfs: check OI mapping update
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: cd38d6cb92c468372d90e627c79e0c747713839d

Generated at Sat Feb 10 02:19:29 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.