[LU-9498] osp_precreate_get_fid()) ASSERTION( osp_fid_diff(&d->opd_pre_used_fid, &d->opd_pre_last_created_fid) < 0 ) failed: next fid [0x680000402:0x25de031:0x0] last created fid [0x680000402:0x25de031:0x0] Created: 12/May/17  Updated: 13/Jun/17  Resolved: 13/Jun/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.0
Fix Version/s: Lustre 2.10.0

Type: Bug Priority: Critical
Reporter: Cliff White (Inactive) Assignee: Alex Zhuravlev
Resolution: Fixed Votes: 0
Labels: soak
Environment:

Soak stress cluster


Attachments: Text File vmcore-dmesg.txt    
Issue Links:
Related
is related to LU-8367 delete orphan phase isn't stated for ... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

After multiple failovers, soak-11 delayed for a very long time completing recovery

May 12 15:51:41 soak-11 kernel: Lustre: 3632:0:(ldlm_lib.c:2049:target_recovery_overseer()) soaked-MDT0003 recovery is aborted by hard timeout
May 12 15:51:41 soak-11 kernel: Lustre: 3632:0:(ldlm_lib.c:2059:target_recovery_overseer()) recovery is aborted, evict exports in recovery
May 12 15:51:41 soak-11 kernel: Lustre: soaked-MDT0003: Recovery over after 20:46, of 32 clients 28 recovered and 4 were evicted.

Shorty after that, another MDS had a hard crash:

May 12 16:23:21 soak-10 kernel: LNet: 3452:0:(o2iblnd_cb.c:3198:kiblnd_check_conns()) Timed out tx for 192.168.1.104@o2ib10: 4 seconds
May 12 16:23:21 soak-10 kernel: LNet: 3452:0:(o2iblnd_cb.c:3198:kiblnd_check_conns()) Skipped 3 previous similar messages
May 12 16:24:19 soak-10 kernel: LustreError: 11-0: soaked-OST0008-osc-MDT0002: operation ost_statfs to node 192.168.1.105@o2ib10 failed: rc = -19
May 12 16:24:19 soak-10 kernel: LustreError: Skipped 149 previous similar messages
May 12 16:24:19 soak-10 kernel: Lustre: soaked-OST0008-osc-MDT0002: Connection to soaked-OST0008 (at 192.168.1.105@o2ib10) was lost; in progress operations using this service will wait for recovery to complete
May 12 16:24:19 soak-10 kernel: Lustre: Skipped 3 previous similar messages
May 12 16:24:35 soak-10 chronyd[1295]: Selected source 132.163.4.101
May 12 16:24:35 soak-10 chronyd[1295]: System clock wrong by -2.455743 seconds, adjustment started
May 12 16:25:02 soak-10 kernel: Lustre: soaked-MDT0002: Connection restored to 192.168.1.104@o2ib10 (at 192.168.1.104@o2ib10)
May 12 16:25:02 soak-10 kernel: Lustre: Skipped 7 previous similar messages
May 12 16:25:03 soak-10 kernel: LustreError: 11-0: soaked-OST0014-osc-MDT0002: operation ost_statfs to node 192.168.1.105@o2ib10 failed: rc = -107
May 12 16:25:03 soak-10 kernel: Lustre: soaked-OST0014-osc-MDT0002: Connection to soaked-OST0014 (at 192.168.1.105@o2ib10) was lost; in progress operations using this service will wait for recovery to complete
May 12 16:25:08 soak-10 kernel: Lustre: 3462:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1494606301/real 1494606301]  req@ffff8807c5ff2700 x1567221038829200/t0(0) o8->soaked-OST0008-osc-MDT0002@192.168.1.104@o2ib10:28/4 lens 520/544 e 0 to 1 dl 1494606308 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
May 12 16:25:08 soak-10 kernel: Lustre: 3462:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 3 previous similar messages
May 12 16:25:26 soak-10 kernel: LustreError: 11-0: soaked-OST0002-osc-MDT0002: operation obd_ping to node 192.168.1.105@o2ib10 failed: rc = -107
May 12 16:25:26 soak-10 kernel: Lustre: soaked-OST0002-osc-MDT0002: Connection to soaked-OST0002 (at 192.168.1.105@o2ib10) was lost; in progress operations using this service will wait for recovery to complete
May 12 16:25:35 soak-10 kernel: Lustre: soaked-OST000e-osc-MDT0002: Connection to soaked-OST000e (at 192.168.1.105@o2ib10) was lost; in progress operations using this service will wait for recovery to complete
May 12 16:25:38 soak-10 kernel: LustreError: 3960:0:(osp_precreate.c:1489:osp_precreate_get_fid()) ASSERTION( osp_fid_diff(&d->opd_pre_used_fid, &d->opd_pre_last_created_fid) < 0 ) failed: next fid [0x680000402:0x25de031:0x0] last created fid [0x680000402:0x25de031:0x0]
May 12 16:25:38 soak-10 kernel: LustreError: 3960:0:(osp_precreate.c:1489:osp_precreate_get_fid()) LBUG
May 12 16:25:38 soak-10 kernel: Pid: 3960, comm: mdt01_017
May 12 16:25:38 soak-10 kernel: #012Call Trace:
May 12 16:25:38 soak-10 kernel: [<ffffffffa0c697ee>] libcfs_call_trace+0x4e/0x60 [libcfs]
May 12 16:25:38 soak-10 kernel: [<ffffffffa0c6987c>] lbug_with_loc+0x4c/0xb0 [libcfs]
May 12 16:25:38 soak-10 kernel: [<ffffffffa153a7cf>] osp_precreate_get_fid+0x30f/0x380 [osp]
May 12 16:25:38 soak-10 kernel: [<ffffffffa152dd5a>] osp_object_assign_fid+0x6a/0x120 [osp]
May 12 16:25:38 soak-10 kernel: [<ffffffffa15300a8>] osp_object_create+0x588/0xad0 [osp]
May 12 16:25:38 soak-10 kernel: [<ffffffffa1280104>] ? osd_idc_find_and_init+0x24/0x100 [osd_ldiskfs]
May 12 16:25:38 soak-10 kernel: [<ffffffffa148ac45>] lod_sub_object_create+0x1f5/0x480 [lod]
May 12 16:25:38 soak-10 kernel: [<ffffffff811de76a>] ? kmem_cache_alloc+0x1ba/0x1e0
May 12 16:25:38 soak-10 kernel: [<ffffffffa147fe56>] lod_striping_create+0x186/0x2c0 [lod]
May 12 16:25:38 soak-10 kernel: [<ffffffffa148026d>] lod_object_create+0x24d/0x2c0 [lod]
May 12 16:25:38 soak-10 kernel: [<ffffffffa14ecc25>] mdd_object_create_internal+0xb5/0x280 [mdd]
May 12 16:25:38 soak-10 kernel: [<ffffffffa14d85b5>] mdd_object_create+0x75/0xb80 [mdd]
May 12 16:25:38 soak-10 kernel: [<ffffffffa14de818>] ? mdd_declare_create+0x578/0xe20 [mdd]
May 12 16:25:38 soak-10 kernel: [<ffffffffa14e262e>] mdd_create+0xd1e/0x1320 [mdd]
May 12 16:25:38 soak-10 kernel: [<ffffffffa13e0a12>] mdt_reint_open+0x2192/0x31a0 [mdt]
May 12 16:25:38 soak-10 kernel: [<ffffffffa0dabda9>] ? upcall_cache_get_entry+0x3e9/0x8e0 [obdclass]
May 12 16:25:38 soak-10 kernel: [<ffffffffa13c6493>] ? ucred_set_jobid+0x53/0x70 [mdt]
May 12 16:25:38 soak-10 kernel: [<ffffffffa13d5c80>] mdt_reint_rec+0x80/0x210 [mdt]
May 12 16:25:38 soak-10 kernel: [<ffffffffa13b7d62>] mdt_reint_internal+0x5b2/0x980 [mdt]
May 12 16:25:38 soak-10 kernel: [<ffffffffa13b8292>] mdt_intent_reint+0x162/0x430 [mdt]
May 12 16:25:38 soak-10 kernel: [<ffffffffa13c2ffe>] mdt_intent_policy+0x43e/0xc30 [mdt]
May 12 16:25:38 soak-10 kernel: [<ffffffffa0f79e3f>] ? ldlm_resource_get+0x9f/0xa30 [ptlrpc]
May 12 16:25:38 soak-10 kernel: [<ffffffffa0f73167>] ldlm_lock_enqueue+0x387/0x970 [ptlrpc]
May 12 16:25:38 soak-10 kernel: [<ffffffffa0f9cab3>] ldlm_handle_enqueue0+0x9c3/0x1680 [ptlrpc]
May 12 16:25:38 soak-10 kernel: [<ffffffffa0fc4cb0>] ? lustre_swab_ldlm_request+0x0/0x30 [ptlrpc]
May 12 16:25:38 soak-10 kernel: [<ffffffffa10219a2>] tgt_enqueue+0x62/0x210 [ptlrpc]
May 12 16:25:38 soak-10 kernel: [<ffffffffa1024f85>] tgt_request_handle+0x915/0x1320 [ptlrpc]
May 12 16:25:38 soak-10 kernel: [<ffffffffa0fce023>] ptlrpc_server_handle_request+0x233/0xa90 [ptlrpc]
May 12 16:25:38 soak-10 kernel: [<ffffffffa0fcbbb8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
May 12 16:25:38 soak-10 kernel: [<ffffffff810c54d2>] ? default_wake_function+0x12/0x20
May 12 16:25:38 soak-10 kernel: [<ffffffff810ba628>] ? __wake_up_common+0x58/0x90
May 12 16:25:38 soak-10 kernel: [<ffffffffa0fd2030>] ptlrpc_main+0xaa0/0x1dd0 [ptlrpc]
May 12 16:25:38 soak-10 kernel: [<ffffffffa0fd1590>] ? ptlrpc_main+0x0/0x1dd0 [ptlrpc]
May 12 16:25:38 soak-10 kernel: [<ffffffff810b0a4f>] kthread+0xcf/0xe0
May 12 16:25:38 soak-10 kernel: [<ffffffff810b0980>] ? kthread+0x0/0xe0
May 12 16:25:38 soak-10 kernel: [<ffffffff81697318>] ret_from_fork+0x58/0x90
May 12 16:25:38 soak-10 kernel: [<ffffffff810b0980>] ? kthread+0x0/0xe0
May 12 16:25:38 soak-10 kernel:
May 12 16:25:38 soak-10 kernel: Kernel panic - not syncing: LBUG

Within the same time interval, the OSS referenced above had the following errors:

May 12 16:23:49 soak-5 kernel: LustreError: 168-f: soaked-OST0008: BAD WRITE CHECKSUM: from 12345-192.168.1.122@o2ib100 via 192.168.1.115@o2ib10 inode [0x28002562b:0x504c:0x0] object 0x500000401:39665560 extent [0-1048575]: client csum 14298c12, server csum e5738ecd
May 12 16:23:50 soak-5 kernel: LustreError: 168-f: soaked-OST0008: BAD WRITE CHECKSUM: from 12345-192.168.1.122@o2ib100 via 192.168.1.115@o2ib10 inode [0x28002562b:0x504c:0x0] object 0x500000401:39665560 extent [1048576-2097151]: client csum 14298c12, server csum 3111bed7
May 12 16:23:51 soak-5 kernel: LustreError: 168-f: soaked-OST0008: BAD WRITE CHECKSUM: from 12345-192.168.1.122@o2ib100 via 192.168.1.115@o2ib10 inode [0x28002562b:0x504c:0x0] object 0x500000401:39665560 extent [5242880-6291455]: client csum 14298c12, server csum 512bbd0f
May 12 16:23:51 soak-5 kernel: LustreError: Skipped 3 previous similar messages
May 12 16:23:55 soak-5 kernel: Lustre: soaked-OST0008: Recovery over after 1:07, of 33 clients 33 recovered and 0 were evicted.
May 12 16:23:55 soak-5 kernel: Lustre: soaked-OST0008: deleting orphan objects from 0x500000400:42837891 to 0x500000400:42843249
May 12 16:23:55 soak-5 kernel: Lustre: soaked-OST0008: deleting orphan objects from 0x0:77925254 to 0x0:77933012
May 12 16:23:55 soak-5 kernel: Lustre: soaked-OST0008: deleting orphan objects from 0x500000402:40333428 to 0x500000402:40339153
May 12 16:23:55 soak-5 kernel: Lustre: soaked-OST0008: deleting orphan objects from 0x500000401:39666149 to 0x500000401:39671009
...
May 12 16:24:06 soak-5 kernel: LustreError: 168-f: soaked-OST000e: BAD WRITE CHECKSUM: from 12345-192.168.1.122@o2ib100 via 192.168.1.115@o2ib10 inode [0x28002562b:0x504c:0x0] object 0x680000402:39701914 extent [3145728-4194303]: client csum 14298c12, server csum f4fa78ed
May 12 16:24:06 soak-5 kernel: LustreError: Skipped 4 previous similar messages
May 12 16:24:09 soak-5 kernel: Lustre: soaked-OST0014: Recovery over after 1:12, of 33 clients 33 recovered and 0 were evicted.
May 12 16:24:09 soak-5 kernel: Lustre: soaked-OST0014: deleting orphan objects from 0x800000401:42992204 to 0x800000401:43000337
May 12 16:24:09 soak-5 kernel: Lustre: soaked-OST0014: deleting orphan objects from 0x800000402:39709601 to 0x800000402:39721217
May 12 16:24:09 soak-5 kernel: Lustre: soaked-OST0014: deleting orphan objects from 0x800000400:40548481 to 0x800000400:40557441
May 12 16:24:09 soak-5 kernel: Lustre: soaked-OST0014: deleting orphan objects from 0x0:77823923 to 0x0:77830564
May 12 16:24:10 soak-5 kernel: Lustre: soaked-OST0002: Recovery already passed deadline 2:05. If you do not want to wait more, please abort the recovery by force.
May 12 16:24:11 soak-5 kernel: Lustre: soaked-OST0002: Recovery over after 1:06, of 33 clients 33 recovered and 0 were evicted.
May 12 16:24:11 soak-5 kernel: Lustre: soaked-OST0002: deleting orphan objects from 0x380000401:42934599 to 0x380000401:42940465
May 12 16:24:11 soak-5 kernel: Lustre: soaked-OST0002: deleting orphan objects from 0x380000400:39672936 to 0x380000400:39681745
May 12 16:24:11 soak-5 kernel: Lustre: soaked-OST0002: deleting orphan objects from 0x380000402:40655106 to 0x380000402:40668241
May 12 16:24:11 soak-5 kernel: Lustre: soaked-OST0002: deleting orphan objects from 0x0:77950953 to 0x0:77957748
May 12 16:24:12 soak-5 kernel: Lustre: soaked-OST000e: deleting orphan objects from 0x680000401:42880681 to 0x680000401:42893057
May 12 16:24:12 soak-5 kernel: Lustre: soaked-OST000e: deleting orphan objects from 0x680000400:40580561 to 0x680000400:40588049
May 12 16:24:12 soak-5 kernel: Lustre: soaked-OST000e: deleting orphan objects from 0x680000402:39702634 to 0x680000402:39706673
May 12 16:24:13 soak-5 kernel: Lustre: soaked-OST000e: deleting orphan objects from 0x0:77672171 to 0x0:77680036
May 12 16:24:15 soak-5 kernel: LustreError: 15735:0:(lustre_dlm.h:1372:ldlm_res_lvbo_update()) delayed lvb init failed (rc -2)


 Comments   
Comment by Cliff White (Inactive) [ 12/May/17 ]

A crash dump is available on soak-10 in /var/crash.
vmcore-dmesg is attached.

Comment by Cliff White (Inactive) [ 12/May/17 ]

Build was lustre-master 3577, version=2.9.57_30_g57cc12f

Comment by Peter Jones [ 26/May/17 ]

Alex

Could you please assist with this one?

Thanks

Peter

Comment by Alex Zhuravlev [ 30/May/17 ]

I need to update LU-8367 for master as I did for b_neo_stable where this problem first appeared.

Comment by Alex Zhuravlev [ 30/May/17 ]

I think we can close the ticket.

Comment by Gerrit Updater [ 06/Jun/17 ]

Alex Zhuravlev (alexey.zhuravlev@intel.com) uploaded a new patch: https://review.whamcloud.com/27447
Subject: LU-9498 osp: revert till a fix
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 0d84e9559c53aa8d8826c7a1b018f6e134028d9f

Comment by Gerrit Updater [ 13/Jun/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/27447/
Subject: LU-9498 osp: revert till a fix
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 9230561f268f9c3d7f84ac7824d7a1d3769a3dfe

Comment by Peter Jones [ 13/Jun/17 ]

Landed for 2.10

Generated at Sat Feb 10 02:26:43 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.