Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9498

osp_precreate_get_fid()) ASSERTION( osp_fid_diff(&d->opd_pre_used_fid, &d->opd_pre_last_created_fid) < 0 ) failed: next fid [0x680000402:0x25de031:0x0] last created fid [0x680000402:0x25de031:0x0]

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: Lustre 2.10.0
    • Fix Version/s: Lustre 2.10.0
    • Labels:
    • Environment:
      Soak stress cluster
    • Severity:
      3
    • Rank (Obsolete):
      9223372036854775807

      Description

      After multiple failovers, soak-11 delayed for a very long time completing recovery

      May 12 15:51:41 soak-11 kernel: Lustre: 3632:0:(ldlm_lib.c:2049:target_recovery_overseer()) soaked-MDT0003 recovery is aborted by hard timeout
      May 12 15:51:41 soak-11 kernel: Lustre: 3632:0:(ldlm_lib.c:2059:target_recovery_overseer()) recovery is aborted, evict exports in recovery
      May 12 15:51:41 soak-11 kernel: Lustre: soaked-MDT0003: Recovery over after 20:46, of 32 clients 28 recovered and 4 were evicted.
      

      Shorty after that, another MDS had a hard crash:

      May 12 16:23:21 soak-10 kernel: LNet: 3452:0:(o2iblnd_cb.c:3198:kiblnd_check_conns()) Timed out tx for 192.168.1.104@o2ib10: 4 seconds
      May 12 16:23:21 soak-10 kernel: LNet: 3452:0:(o2iblnd_cb.c:3198:kiblnd_check_conns()) Skipped 3 previous similar messages
      May 12 16:24:19 soak-10 kernel: LustreError: 11-0: soaked-OST0008-osc-MDT0002: operation ost_statfs to node 192.168.1.105@o2ib10 failed: rc = -19
      May 12 16:24:19 soak-10 kernel: LustreError: Skipped 149 previous similar messages
      May 12 16:24:19 soak-10 kernel: Lustre: soaked-OST0008-osc-MDT0002: Connection to soaked-OST0008 (at 192.168.1.105@o2ib10) was lost; in progress operations using this service will wait for recovery to complete
      May 12 16:24:19 soak-10 kernel: Lustre: Skipped 3 previous similar messages
      May 12 16:24:35 soak-10 chronyd[1295]: Selected source 132.163.4.101
      May 12 16:24:35 soak-10 chronyd[1295]: System clock wrong by -2.455743 seconds, adjustment started
      May 12 16:25:02 soak-10 kernel: Lustre: soaked-MDT0002: Connection restored to 192.168.1.104@o2ib10 (at 192.168.1.104@o2ib10)
      May 12 16:25:02 soak-10 kernel: Lustre: Skipped 7 previous similar messages
      May 12 16:25:03 soak-10 kernel: LustreError: 11-0: soaked-OST0014-osc-MDT0002: operation ost_statfs to node 192.168.1.105@o2ib10 failed: rc = -107
      May 12 16:25:03 soak-10 kernel: Lustre: soaked-OST0014-osc-MDT0002: Connection to soaked-OST0014 (at 192.168.1.105@o2ib10) was lost; in progress operations using this service will wait for recovery to complete
      May 12 16:25:08 soak-10 kernel: Lustre: 3462:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1494606301/real 1494606301]  req@ffff8807c5ff2700 x1567221038829200/t0(0) o8->soaked-OST0008-osc-MDT0002@192.168.1.104@o2ib10:28/4 lens 520/544 e 0 to 1 dl 1494606308 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      May 12 16:25:08 soak-10 kernel: Lustre: 3462:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 3 previous similar messages
      May 12 16:25:26 soak-10 kernel: LustreError: 11-0: soaked-OST0002-osc-MDT0002: operation obd_ping to node 192.168.1.105@o2ib10 failed: rc = -107
      May 12 16:25:26 soak-10 kernel: Lustre: soaked-OST0002-osc-MDT0002: Connection to soaked-OST0002 (at 192.168.1.105@o2ib10) was lost; in progress operations using this service will wait for recovery to complete
      May 12 16:25:35 soak-10 kernel: Lustre: soaked-OST000e-osc-MDT0002: Connection to soaked-OST000e (at 192.168.1.105@o2ib10) was lost; in progress operations using this service will wait for recovery to complete
      May 12 16:25:38 soak-10 kernel: LustreError: 3960:0:(osp_precreate.c:1489:osp_precreate_get_fid()) ASSERTION( osp_fid_diff(&d->opd_pre_used_fid, &d->opd_pre_last_created_fid) < 0 ) failed: next fid [0x680000402:0x25de031:0x0] last created fid [0x680000402:0x25de031:0x0]
      May 12 16:25:38 soak-10 kernel: LustreError: 3960:0:(osp_precreate.c:1489:osp_precreate_get_fid()) LBUG
      May 12 16:25:38 soak-10 kernel: Pid: 3960, comm: mdt01_017
      May 12 16:25:38 soak-10 kernel: #012Call Trace:
      May 12 16:25:38 soak-10 kernel: [<ffffffffa0c697ee>] libcfs_call_trace+0x4e/0x60 [libcfs]
      May 12 16:25:38 soak-10 kernel: [<ffffffffa0c6987c>] lbug_with_loc+0x4c/0xb0 [libcfs]
      May 12 16:25:38 soak-10 kernel: [<ffffffffa153a7cf>] osp_precreate_get_fid+0x30f/0x380 [osp]
      May 12 16:25:38 soak-10 kernel: [<ffffffffa152dd5a>] osp_object_assign_fid+0x6a/0x120 [osp]
      May 12 16:25:38 soak-10 kernel: [<ffffffffa15300a8>] osp_object_create+0x588/0xad0 [osp]
      May 12 16:25:38 soak-10 kernel: [<ffffffffa1280104>] ? osd_idc_find_and_init+0x24/0x100 [osd_ldiskfs]
      May 12 16:25:38 soak-10 kernel: [<ffffffffa148ac45>] lod_sub_object_create+0x1f5/0x480 [lod]
      May 12 16:25:38 soak-10 kernel: [<ffffffff811de76a>] ? kmem_cache_alloc+0x1ba/0x1e0
      May 12 16:25:38 soak-10 kernel: [<ffffffffa147fe56>] lod_striping_create+0x186/0x2c0 [lod]
      May 12 16:25:38 soak-10 kernel: [<ffffffffa148026d>] lod_object_create+0x24d/0x2c0 [lod]
      May 12 16:25:38 soak-10 kernel: [<ffffffffa14ecc25>] mdd_object_create_internal+0xb5/0x280 [mdd]
      May 12 16:25:38 soak-10 kernel: [<ffffffffa14d85b5>] mdd_object_create+0x75/0xb80 [mdd]
      May 12 16:25:38 soak-10 kernel: [<ffffffffa14de818>] ? mdd_declare_create+0x578/0xe20 [mdd]
      May 12 16:25:38 soak-10 kernel: [<ffffffffa14e262e>] mdd_create+0xd1e/0x1320 [mdd]
      May 12 16:25:38 soak-10 kernel: [<ffffffffa13e0a12>] mdt_reint_open+0x2192/0x31a0 [mdt]
      May 12 16:25:38 soak-10 kernel: [<ffffffffa0dabda9>] ? upcall_cache_get_entry+0x3e9/0x8e0 [obdclass]
      May 12 16:25:38 soak-10 kernel: [<ffffffffa13c6493>] ? ucred_set_jobid+0x53/0x70 [mdt]
      May 12 16:25:38 soak-10 kernel: [<ffffffffa13d5c80>] mdt_reint_rec+0x80/0x210 [mdt]
      May 12 16:25:38 soak-10 kernel: [<ffffffffa13b7d62>] mdt_reint_internal+0x5b2/0x980 [mdt]
      May 12 16:25:38 soak-10 kernel: [<ffffffffa13b8292>] mdt_intent_reint+0x162/0x430 [mdt]
      May 12 16:25:38 soak-10 kernel: [<ffffffffa13c2ffe>] mdt_intent_policy+0x43e/0xc30 [mdt]
      May 12 16:25:38 soak-10 kernel: [<ffffffffa0f79e3f>] ? ldlm_resource_get+0x9f/0xa30 [ptlrpc]
      May 12 16:25:38 soak-10 kernel: [<ffffffffa0f73167>] ldlm_lock_enqueue+0x387/0x970 [ptlrpc]
      May 12 16:25:38 soak-10 kernel: [<ffffffffa0f9cab3>] ldlm_handle_enqueue0+0x9c3/0x1680 [ptlrpc]
      May 12 16:25:38 soak-10 kernel: [<ffffffffa0fc4cb0>] ? lustre_swab_ldlm_request+0x0/0x30 [ptlrpc]
      May 12 16:25:38 soak-10 kernel: [<ffffffffa10219a2>] tgt_enqueue+0x62/0x210 [ptlrpc]
      May 12 16:25:38 soak-10 kernel: [<ffffffffa1024f85>] tgt_request_handle+0x915/0x1320 [ptlrpc]
      May 12 16:25:38 soak-10 kernel: [<ffffffffa0fce023>] ptlrpc_server_handle_request+0x233/0xa90 [ptlrpc]
      May 12 16:25:38 soak-10 kernel: [<ffffffffa0fcbbb8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
      May 12 16:25:38 soak-10 kernel: [<ffffffff810c54d2>] ? default_wake_function+0x12/0x20
      May 12 16:25:38 soak-10 kernel: [<ffffffff810ba628>] ? __wake_up_common+0x58/0x90
      May 12 16:25:38 soak-10 kernel: [<ffffffffa0fd2030>] ptlrpc_main+0xaa0/0x1dd0 [ptlrpc]
      May 12 16:25:38 soak-10 kernel: [<ffffffffa0fd1590>] ? ptlrpc_main+0x0/0x1dd0 [ptlrpc]
      May 12 16:25:38 soak-10 kernel: [<ffffffff810b0a4f>] kthread+0xcf/0xe0
      May 12 16:25:38 soak-10 kernel: [<ffffffff810b0980>] ? kthread+0x0/0xe0
      May 12 16:25:38 soak-10 kernel: [<ffffffff81697318>] ret_from_fork+0x58/0x90
      May 12 16:25:38 soak-10 kernel: [<ffffffff810b0980>] ? kthread+0x0/0xe0
      May 12 16:25:38 soak-10 kernel:
      May 12 16:25:38 soak-10 kernel: Kernel panic - not syncing: LBUG
      

      Within the same time interval, the OSS referenced above had the following errors:

      May 12 16:23:49 soak-5 kernel: LustreError: 168-f: soaked-OST0008: BAD WRITE CHECKSUM: from 12345-192.168.1.122@o2ib100 via 192.168.1.115@o2ib10 inode [0x28002562b:0x504c:0x0] object 0x500000401:39665560 extent [0-1048575]: client csum 14298c12, server csum e5738ecd
      May 12 16:23:50 soak-5 kernel: LustreError: 168-f: soaked-OST0008: BAD WRITE CHECKSUM: from 12345-192.168.1.122@o2ib100 via 192.168.1.115@o2ib10 inode [0x28002562b:0x504c:0x0] object 0x500000401:39665560 extent [1048576-2097151]: client csum 14298c12, server csum 3111bed7
      May 12 16:23:51 soak-5 kernel: LustreError: 168-f: soaked-OST0008: BAD WRITE CHECKSUM: from 12345-192.168.1.122@o2ib100 via 192.168.1.115@o2ib10 inode [0x28002562b:0x504c:0x0] object 0x500000401:39665560 extent [5242880-6291455]: client csum 14298c12, server csum 512bbd0f
      May 12 16:23:51 soak-5 kernel: LustreError: Skipped 3 previous similar messages
      May 12 16:23:55 soak-5 kernel: Lustre: soaked-OST0008: Recovery over after 1:07, of 33 clients 33 recovered and 0 were evicted.
      May 12 16:23:55 soak-5 kernel: Lustre: soaked-OST0008: deleting orphan objects from 0x500000400:42837891 to 0x500000400:42843249
      May 12 16:23:55 soak-5 kernel: Lustre: soaked-OST0008: deleting orphan objects from 0x0:77925254 to 0x0:77933012
      May 12 16:23:55 soak-5 kernel: Lustre: soaked-OST0008: deleting orphan objects from 0x500000402:40333428 to 0x500000402:40339153
      May 12 16:23:55 soak-5 kernel: Lustre: soaked-OST0008: deleting orphan objects from 0x500000401:39666149 to 0x500000401:39671009
      ...
      May 12 16:24:06 soak-5 kernel: LustreError: 168-f: soaked-OST000e: BAD WRITE CHECKSUM: from 12345-192.168.1.122@o2ib100 via 192.168.1.115@o2ib10 inode [0x28002562b:0x504c:0x0] object 0x680000402:39701914 extent [3145728-4194303]: client csum 14298c12, server csum f4fa78ed
      May 12 16:24:06 soak-5 kernel: LustreError: Skipped 4 previous similar messages
      May 12 16:24:09 soak-5 kernel: Lustre: soaked-OST0014: Recovery over after 1:12, of 33 clients 33 recovered and 0 were evicted.
      May 12 16:24:09 soak-5 kernel: Lustre: soaked-OST0014: deleting orphan objects from 0x800000401:42992204 to 0x800000401:43000337
      May 12 16:24:09 soak-5 kernel: Lustre: soaked-OST0014: deleting orphan objects from 0x800000402:39709601 to 0x800000402:39721217
      May 12 16:24:09 soak-5 kernel: Lustre: soaked-OST0014: deleting orphan objects from 0x800000400:40548481 to 0x800000400:40557441
      May 12 16:24:09 soak-5 kernel: Lustre: soaked-OST0014: deleting orphan objects from 0x0:77823923 to 0x0:77830564
      May 12 16:24:10 soak-5 kernel: Lustre: soaked-OST0002: Recovery already passed deadline 2:05. If you do not want to wait more, please abort the recovery by force.
      May 12 16:24:11 soak-5 kernel: Lustre: soaked-OST0002: Recovery over after 1:06, of 33 clients 33 recovered and 0 were evicted.
      May 12 16:24:11 soak-5 kernel: Lustre: soaked-OST0002: deleting orphan objects from 0x380000401:42934599 to 0x380000401:42940465
      May 12 16:24:11 soak-5 kernel: Lustre: soaked-OST0002: deleting orphan objects from 0x380000400:39672936 to 0x380000400:39681745
      May 12 16:24:11 soak-5 kernel: Lustre: soaked-OST0002: deleting orphan objects from 0x380000402:40655106 to 0x380000402:40668241
      May 12 16:24:11 soak-5 kernel: Lustre: soaked-OST0002: deleting orphan objects from 0x0:77950953 to 0x0:77957748
      May 12 16:24:12 soak-5 kernel: Lustre: soaked-OST000e: deleting orphan objects from 0x680000401:42880681 to 0x680000401:42893057
      May 12 16:24:12 soak-5 kernel: Lustre: soaked-OST000e: deleting orphan objects from 0x680000400:40580561 to 0x680000400:40588049
      May 12 16:24:12 soak-5 kernel: Lustre: soaked-OST000e: deleting orphan objects from 0x680000402:39702634 to 0x680000402:39706673
      May 12 16:24:13 soak-5 kernel: Lustre: soaked-OST000e: deleting orphan objects from 0x0:77672171 to 0x0:77680036
      May 12 16:24:15 soak-5 kernel: LustreError: 15735:0:(lustre_dlm.h:1372:ldlm_res_lvbo_update()) delayed lvb init failed (rc -2)
      

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                bzzz Alex Zhuravlev
                Reporter:
                cliffw Cliff White (Inactive)
              • Votes:
                0 Vote for this issue
                Watchers:
                6 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: