Details
-
Bug
-
Resolution: Fixed
-
Critical
-
Lustre 2.10.0
-
Soak stress cluster
-
3
-
9223372036854775807
Description
After multiple failovers, soak-11 delayed for a very long time completing recovery
May 12 15:51:41 soak-11 kernel: Lustre: 3632:0:(ldlm_lib.c:2049:target_recovery_overseer()) soaked-MDT0003 recovery is aborted by hard timeout May 12 15:51:41 soak-11 kernel: Lustre: 3632:0:(ldlm_lib.c:2059:target_recovery_overseer()) recovery is aborted, evict exports in recovery May 12 15:51:41 soak-11 kernel: Lustre: soaked-MDT0003: Recovery over after 20:46, of 32 clients 28 recovered and 4 were evicted.
Shorty after that, another MDS had a hard crash:
May 12 16:23:21 soak-10 kernel: LNet: 3452:0:(o2iblnd_cb.c:3198:kiblnd_check_conns()) Timed out tx for 192.168.1.104@o2ib10: 4 seconds May 12 16:23:21 soak-10 kernel: LNet: 3452:0:(o2iblnd_cb.c:3198:kiblnd_check_conns()) Skipped 3 previous similar messages May 12 16:24:19 soak-10 kernel: LustreError: 11-0: soaked-OST0008-osc-MDT0002: operation ost_statfs to node 192.168.1.105@o2ib10 failed: rc = -19 May 12 16:24:19 soak-10 kernel: LustreError: Skipped 149 previous similar messages May 12 16:24:19 soak-10 kernel: Lustre: soaked-OST0008-osc-MDT0002: Connection to soaked-OST0008 (at 192.168.1.105@o2ib10) was lost; in progress operations using this service will wait for recovery to complete May 12 16:24:19 soak-10 kernel: Lustre: Skipped 3 previous similar messages May 12 16:24:35 soak-10 chronyd[1295]: Selected source 132.163.4.101 May 12 16:24:35 soak-10 chronyd[1295]: System clock wrong by -2.455743 seconds, adjustment started May 12 16:25:02 soak-10 kernel: Lustre: soaked-MDT0002: Connection restored to 192.168.1.104@o2ib10 (at 192.168.1.104@o2ib10) May 12 16:25:02 soak-10 kernel: Lustre: Skipped 7 previous similar messages May 12 16:25:03 soak-10 kernel: LustreError: 11-0: soaked-OST0014-osc-MDT0002: operation ost_statfs to node 192.168.1.105@o2ib10 failed: rc = -107 May 12 16:25:03 soak-10 kernel: Lustre: soaked-OST0014-osc-MDT0002: Connection to soaked-OST0014 (at 192.168.1.105@o2ib10) was lost; in progress operations using this service will wait for recovery to complete May 12 16:25:08 soak-10 kernel: Lustre: 3462:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1494606301/real 1494606301] req@ffff8807c5ff2700 x1567221038829200/t0(0) o8->soaked-OST0008-osc-MDT0002@192.168.1.104@o2ib10:28/4 lens 520/544 e 0 to 1 dl 1494606308 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 May 12 16:25:08 soak-10 kernel: Lustre: 3462:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 3 previous similar messages May 12 16:25:26 soak-10 kernel: LustreError: 11-0: soaked-OST0002-osc-MDT0002: operation obd_ping to node 192.168.1.105@o2ib10 failed: rc = -107 May 12 16:25:26 soak-10 kernel: Lustre: soaked-OST0002-osc-MDT0002: Connection to soaked-OST0002 (at 192.168.1.105@o2ib10) was lost; in progress operations using this service will wait for recovery to complete May 12 16:25:35 soak-10 kernel: Lustre: soaked-OST000e-osc-MDT0002: Connection to soaked-OST000e (at 192.168.1.105@o2ib10) was lost; in progress operations using this service will wait for recovery to complete May 12 16:25:38 soak-10 kernel: LustreError: 3960:0:(osp_precreate.c:1489:osp_precreate_get_fid()) ASSERTION( osp_fid_diff(&d->opd_pre_used_fid, &d->opd_pre_last_created_fid) < 0 ) failed: next fid [0x680000402:0x25de031:0x0] last created fid [0x680000402:0x25de031:0x0] May 12 16:25:38 soak-10 kernel: LustreError: 3960:0:(osp_precreate.c:1489:osp_precreate_get_fid()) LBUG May 12 16:25:38 soak-10 kernel: Pid: 3960, comm: mdt01_017 May 12 16:25:38 soak-10 kernel: #012Call Trace: May 12 16:25:38 soak-10 kernel: [<ffffffffa0c697ee>] libcfs_call_trace+0x4e/0x60 [libcfs] May 12 16:25:38 soak-10 kernel: [<ffffffffa0c6987c>] lbug_with_loc+0x4c/0xb0 [libcfs] May 12 16:25:38 soak-10 kernel: [<ffffffffa153a7cf>] osp_precreate_get_fid+0x30f/0x380 [osp] May 12 16:25:38 soak-10 kernel: [<ffffffffa152dd5a>] osp_object_assign_fid+0x6a/0x120 [osp] May 12 16:25:38 soak-10 kernel: [<ffffffffa15300a8>] osp_object_create+0x588/0xad0 [osp] May 12 16:25:38 soak-10 kernel: [<ffffffffa1280104>] ? osd_idc_find_and_init+0x24/0x100 [osd_ldiskfs] May 12 16:25:38 soak-10 kernel: [<ffffffffa148ac45>] lod_sub_object_create+0x1f5/0x480 [lod] May 12 16:25:38 soak-10 kernel: [<ffffffff811de76a>] ? kmem_cache_alloc+0x1ba/0x1e0 May 12 16:25:38 soak-10 kernel: [<ffffffffa147fe56>] lod_striping_create+0x186/0x2c0 [lod] May 12 16:25:38 soak-10 kernel: [<ffffffffa148026d>] lod_object_create+0x24d/0x2c0 [lod] May 12 16:25:38 soak-10 kernel: [<ffffffffa14ecc25>] mdd_object_create_internal+0xb5/0x280 [mdd] May 12 16:25:38 soak-10 kernel: [<ffffffffa14d85b5>] mdd_object_create+0x75/0xb80 [mdd] May 12 16:25:38 soak-10 kernel: [<ffffffffa14de818>] ? mdd_declare_create+0x578/0xe20 [mdd] May 12 16:25:38 soak-10 kernel: [<ffffffffa14e262e>] mdd_create+0xd1e/0x1320 [mdd] May 12 16:25:38 soak-10 kernel: [<ffffffffa13e0a12>] mdt_reint_open+0x2192/0x31a0 [mdt] May 12 16:25:38 soak-10 kernel: [<ffffffffa0dabda9>] ? upcall_cache_get_entry+0x3e9/0x8e0 [obdclass] May 12 16:25:38 soak-10 kernel: [<ffffffffa13c6493>] ? ucred_set_jobid+0x53/0x70 [mdt] May 12 16:25:38 soak-10 kernel: [<ffffffffa13d5c80>] mdt_reint_rec+0x80/0x210 [mdt] May 12 16:25:38 soak-10 kernel: [<ffffffffa13b7d62>] mdt_reint_internal+0x5b2/0x980 [mdt] May 12 16:25:38 soak-10 kernel: [<ffffffffa13b8292>] mdt_intent_reint+0x162/0x430 [mdt] May 12 16:25:38 soak-10 kernel: [<ffffffffa13c2ffe>] mdt_intent_policy+0x43e/0xc30 [mdt] May 12 16:25:38 soak-10 kernel: [<ffffffffa0f79e3f>] ? ldlm_resource_get+0x9f/0xa30 [ptlrpc] May 12 16:25:38 soak-10 kernel: [<ffffffffa0f73167>] ldlm_lock_enqueue+0x387/0x970 [ptlrpc] May 12 16:25:38 soak-10 kernel: [<ffffffffa0f9cab3>] ldlm_handle_enqueue0+0x9c3/0x1680 [ptlrpc] May 12 16:25:38 soak-10 kernel: [<ffffffffa0fc4cb0>] ? lustre_swab_ldlm_request+0x0/0x30 [ptlrpc] May 12 16:25:38 soak-10 kernel: [<ffffffffa10219a2>] tgt_enqueue+0x62/0x210 [ptlrpc] May 12 16:25:38 soak-10 kernel: [<ffffffffa1024f85>] tgt_request_handle+0x915/0x1320 [ptlrpc] May 12 16:25:38 soak-10 kernel: [<ffffffffa0fce023>] ptlrpc_server_handle_request+0x233/0xa90 [ptlrpc] May 12 16:25:38 soak-10 kernel: [<ffffffffa0fcbbb8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc] May 12 16:25:38 soak-10 kernel: [<ffffffff810c54d2>] ? default_wake_function+0x12/0x20 May 12 16:25:38 soak-10 kernel: [<ffffffff810ba628>] ? __wake_up_common+0x58/0x90 May 12 16:25:38 soak-10 kernel: [<ffffffffa0fd2030>] ptlrpc_main+0xaa0/0x1dd0 [ptlrpc] May 12 16:25:38 soak-10 kernel: [<ffffffffa0fd1590>] ? ptlrpc_main+0x0/0x1dd0 [ptlrpc] May 12 16:25:38 soak-10 kernel: [<ffffffff810b0a4f>] kthread+0xcf/0xe0 May 12 16:25:38 soak-10 kernel: [<ffffffff810b0980>] ? kthread+0x0/0xe0 May 12 16:25:38 soak-10 kernel: [<ffffffff81697318>] ret_from_fork+0x58/0x90 May 12 16:25:38 soak-10 kernel: [<ffffffff810b0980>] ? kthread+0x0/0xe0 May 12 16:25:38 soak-10 kernel: May 12 16:25:38 soak-10 kernel: Kernel panic - not syncing: LBUG
Within the same time interval, the OSS referenced above had the following errors:
May 12 16:23:49 soak-5 kernel: LustreError: 168-f: soaked-OST0008: BAD WRITE CHECKSUM: from 12345-192.168.1.122@o2ib100 via 192.168.1.115@o2ib10 inode [0x28002562b:0x504c:0x0] object 0x500000401:39665560 extent [0-1048575]: client csum 14298c12, server csum e5738ecd
May 12 16:23:50 soak-5 kernel: LustreError: 168-f: soaked-OST0008: BAD WRITE CHECKSUM: from 12345-192.168.1.122@o2ib100 via 192.168.1.115@o2ib10 inode [0x28002562b:0x504c:0x0] object 0x500000401:39665560 extent [1048576-2097151]: client csum 14298c12, server csum 3111bed7
May 12 16:23:51 soak-5 kernel: LustreError: 168-f: soaked-OST0008: BAD WRITE CHECKSUM: from 12345-192.168.1.122@o2ib100 via 192.168.1.115@o2ib10 inode [0x28002562b:0x504c:0x0] object 0x500000401:39665560 extent [5242880-6291455]: client csum 14298c12, server csum 512bbd0f
May 12 16:23:51 soak-5 kernel: LustreError: Skipped 3 previous similar messages
May 12 16:23:55 soak-5 kernel: Lustre: soaked-OST0008: Recovery over after 1:07, of 33 clients 33 recovered and 0 were evicted.
May 12 16:23:55 soak-5 kernel: Lustre: soaked-OST0008: deleting orphan objects from 0x500000400:42837891 to 0x500000400:42843249
May 12 16:23:55 soak-5 kernel: Lustre: soaked-OST0008: deleting orphan objects from 0x0:77925254 to 0x0:77933012
May 12 16:23:55 soak-5 kernel: Lustre: soaked-OST0008: deleting orphan objects from 0x500000402:40333428 to 0x500000402:40339153
May 12 16:23:55 soak-5 kernel: Lustre: soaked-OST0008: deleting orphan objects from 0x500000401:39666149 to 0x500000401:39671009
...
May 12 16:24:06 soak-5 kernel: LustreError: 168-f: soaked-OST000e: BAD WRITE CHECKSUM: from 12345-192.168.1.122@o2ib100 via 192.168.1.115@o2ib10 inode [0x28002562b:0x504c:0x0] object 0x680000402:39701914 extent [3145728-4194303]: client csum 14298c12, server csum f4fa78ed
May 12 16:24:06 soak-5 kernel: LustreError: Skipped 4 previous similar messages
May 12 16:24:09 soak-5 kernel: Lustre: soaked-OST0014: Recovery over after 1:12, of 33 clients 33 recovered and 0 were evicted.
May 12 16:24:09 soak-5 kernel: Lustre: soaked-OST0014: deleting orphan objects from 0x800000401:42992204 to 0x800000401:43000337
May 12 16:24:09 soak-5 kernel: Lustre: soaked-OST0014: deleting orphan objects from 0x800000402:39709601 to 0x800000402:39721217
May 12 16:24:09 soak-5 kernel: Lustre: soaked-OST0014: deleting orphan objects from 0x800000400:40548481 to 0x800000400:40557441
May 12 16:24:09 soak-5 kernel: Lustre: soaked-OST0014: deleting orphan objects from 0x0:77823923 to 0x0:77830564
May 12 16:24:10 soak-5 kernel: Lustre: soaked-OST0002: Recovery already passed deadline 2:05. If you do not want to wait more, please abort the recovery by force.
May 12 16:24:11 soak-5 kernel: Lustre: soaked-OST0002: Recovery over after 1:06, of 33 clients 33 recovered and 0 were evicted.
May 12 16:24:11 soak-5 kernel: Lustre: soaked-OST0002: deleting orphan objects from 0x380000401:42934599 to 0x380000401:42940465
May 12 16:24:11 soak-5 kernel: Lustre: soaked-OST0002: deleting orphan objects from 0x380000400:39672936 to 0x380000400:39681745
May 12 16:24:11 soak-5 kernel: Lustre: soaked-OST0002: deleting orphan objects from 0x380000402:40655106 to 0x380000402:40668241
May 12 16:24:11 soak-5 kernel: Lustre: soaked-OST0002: deleting orphan objects from 0x0:77950953 to 0x0:77957748
May 12 16:24:12 soak-5 kernel: Lustre: soaked-OST000e: deleting orphan objects from 0x680000401:42880681 to 0x680000401:42893057
May 12 16:24:12 soak-5 kernel: Lustre: soaked-OST000e: deleting orphan objects from 0x680000400:40580561 to 0x680000400:40588049
May 12 16:24:12 soak-5 kernel: Lustre: soaked-OST000e: deleting orphan objects from 0x680000402:39702634 to 0x680000402:39706673
May 12 16:24:13 soak-5 kernel: Lustre: soaked-OST000e: deleting orphan objects from 0x0:77672171 to 0x0:77680036
May 12 16:24:15 soak-5 kernel: LustreError: 15735:0:(lustre_dlm.h:1372:ldlm_res_lvbo_update()) delayed lvb init failed (rc -2)
Attachments
Issue Links
- is related to
-
LU-8367 delete orphan phase isn't stated for multistriped file
- Resolved