Details
-
Bug
-
Resolution: Duplicate
-
Minor
-
None
-
Lustre 2.12.0
-
lustre-master-ib build #120 lustre version: 2.11.54_54_ga7debf8
-
3
-
9223372036854775807
Description
MDS2 hit LBUG after recovery completed
soak-10 console
[ 485.435828] Lustre: soaked-MDT0002: Recovery over after 0:44, of 28 clients 28 recovered and 0 were evicted. Aug 29 03:10:01 soak-10 TIME: Time stamp for console [ 3453.242721] LNetError: 4307:0:(o2iblnd_cb.c:3321:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds [ 3453.255641] LNetError: 4307:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out RDMA with 192.168.1.109@o2ib (28): c: 6, oc: 0, rc: 8 [ 3453.272156] Lustre: 4335:0:(client.c:2126:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1535514904/real 1535514909] re q@ffff9bf8f3dfe000 x1610100967659856/t0(0) o400->soaked-MDT0001-osp-MDT0002@192.168.1.109@o2ib:24/4 lens 224/224 e 0 to 1 dl 1535514911 ref 1 fl Rpc:eXN/0/ffffff ff rc 0/-1 [ 3453.309301] Lustre: soaked-MDT0001-osp-MDT0002: Connection to soaked-MDT0001 (at 192.168.1.109@o2ib) was lost; in progress operations using this service will wait for recovery to complete [ 3454.242963] LNet: 4307:0:(o2iblnd_cb.c:3367:kiblnd_check_conns()) Timed out tx for 192.168.1.109@o2ib: 1 seconds [ 3651.408451] Lustre: soaked-MDT0002: haven't heard from client soaked-MDT0001-mdtlov_UUID (at 192.168.1.109@o2ib) in 227 seconds. I think it's dead, and I am e victing it. exp ffff9bf8fbbc7400, cur 1535515108 expire 1535514958 last 1535514881 Aug 29 04:01:01 soak-10 TIME: Time stamp for console [ 7026.178219] LNet: Service thread pid 6456 was inactive for 200.53s. The thread might be hung, or it might only be slow and will resume later. Dumping the stac k trace for debugging purposes: [ 7026.200999] Pid: 6456, comm: mdt01_016 3.10.0-862.9.1.el7_lustre.x86_64 #1 SMP Fri Aug 17 20:37:05 UTC 2018 [ 7026.213853] Call Trace: [ 7026.218428] [<ffffffffc1904ab8>] osp_precreate_reserve+0x2d8/0x800 [osp] [ 7026.227906] [<ffffffffc18f9bb9>] osp_declare_create+0x199/0x5b0 [osp] [ 7026.236955] [<ffffffffc1844e8f>] lod_sub_declare_create+0xdf/0x210 [lod] [ 7026.246232] [<ffffffffc183e62e>] lod_qos_declare_object_on+0xbe/0x3a0 [lod] [ 7026.255759] [<ffffffffc184b40f>] lod_alloc_rr.constprop.18+0xc16/0xf63 [lod] [ 7026.265270] [<ffffffffc1844326>] lod_qos_prep_create+0x1416/0x17a0 [lod] [ 7026.274312] [<ffffffffc18448c5>] lod_prepare_create+0x215/0x2e0 [lod] [ 7026.283060] [<ffffffffc18367ee>] lod_declare_striped_create+0x1ee/0x980 [lod] [ 7026.292518] [<ffffffffc183aec4>] lod_declare_create+0x204/0x590 [lod] [ 7026.301124] [<ffffffffc18ad892>] mdd_declare_create_object_internal+0xe2/0x2f0 [mdd] [ 7026.311217] [<ffffffffc189d1c8>] mdd_declare_create+0x48/0xc10 [mdd] [ 7026.319652] [<ffffffffc18a15e9>] mdd_create+0x929/0x13f0 [mdd] [ 7026.327460] [<ffffffffc174ce37>] mdt_reint_open+0x2117/0x3160 [mdt] [ 7026.335764] [<ffffffffc1740ce3>] mdt_reint_rec+0x83/0x210 [mdt] [ 7026.343611] [<ffffffffc17201d2>] mdt_reint_internal+0x6b2/0xa80 [mdt] [ 7026.352048] [<ffffffffc172c6c2>] mdt_intent_open+0x82/0x350 [mdt] [ 7026.360063] [<ffffffffc172a768>] mdt_intent_policy+0x2f8/0xd10 [mdt] [ 7026.368365] [<ffffffffc12c8e9e>] ldlm_lock_enqueue+0x34e/0xa50 [ptlrpc] [ 7026.376957] [<ffffffffc12f1483>] ldlm_handle_enqueue0+0x903/0x1520 [ptlrpc] [ 7026.385918] [<ffffffffc1377932>] tgt_enqueue+0x62/0x210 [ptlrpc] [ 7026.393851] [<ffffffffc138027a>] tgt_request_handle+0xaea/0x1580 [ptlrpc] [ 7026.402592] [<ffffffffc132340b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [ 7026.412194] [<ffffffffc1326c44>] ptlrpc_main+0xb14/0x1fb0 [ptlrpc] [ 7026.420225] [<ffffffffb98bb621>] kthread+0xd1/0xe0 [ 7026.426693] [<ffffffffb9f205f7>] ret_from_fork_nospec_end+0x0/0x39 [ 7026.434746] [<ffffffffffffffff>] 0xffffffffffffffff [ 7026.441364] LustreError: dumping log to /tmp/lustre-log.1535518483.6456 [ 7081.097862] INFO: task mdt00_008:4632 blocked for more than 120 seconds. [ 7081.107689] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 7081.118799] mdt00_008 D ffff9bfd2cb7eeb0 0 4632 2 0x00000080 [ 7081.128951] Call Trace: [ 7081.133752] [<ffffffffb98c8204>] ? __wake_up+0x44/0x50 [ 7081.141630] [<ffffffffb9f14029>] schedule+0x29/0x70 [ 7081.149050] [<ffffffffb9f15915>] rwsem_down_write_failed+0x225/0x3a0 [ 7081.158119] [<ffffffffb9b691bd>] ? list_del+0xd/0x30 [ 7081.165473] [<ffffffffb9b5b587>] call_rwsem_down_write_failed+0x17/0x30 [ 7081.174637] [<ffffffffb9f1319d>] down_write+0x2d/0x3d [ 7081.181924] [<ffffffffc1840f77>] lod_qos_statfs_update+0x97/0x2b0 [lod] [ 7081.190933] [<ffffffffc184307a>] lod_qos_prep_create+0x16a/0x17a0 [lod] [ 7081.199838] [<ffffffffc1503d70>] ? qsd_op_begin+0xb0/0x4d0 [lquota] [ 7081.208310] [<ffffffffc161fa98>] ? osd_declare_qid+0x1d8/0x480 [osd_ldiskfs] [ 7081.217533] [<ffffffffc18448c5>] lod_prepare_create+0x215/0x2e0 [lod] [ 7081.226064] [<ffffffffc18367ee>] lod_declare_striped_create+0x1ee/0x980 [lod] [ 7081.235279] [<ffffffffc1844e8f>] ? lod_sub_declare_create+0xdf/0x210 [lod] [ 7081.244170] [<ffffffffc183aec4>] lod_declare_create+0x204/0x590 [lod] [ 7081.252490] [<ffffffffc182a013>] ? lod_striping_from_default+0x93/0x5b0 [lod] [ 7081.261638] [<ffffffffc10dd969>] ? lu_context_refill+0x19/0x50 [obdclass] [ 7081.270296] [<ffffffffc18ad892>] mdd_declare_create_object_internal+0xe2/0x2f0 [mdd] [ 7081.279987] [<ffffffffc189d1c8>] mdd_declare_create+0x48/0xc10 [mdd] [ 7081.288085] [<ffffffffc18a15e9>] mdd_create+0x929/0x13f0 [mdd] [ 7081.295636] [<ffffffffc174ce37>] mdt_reint_open+0x2117/0x3160 [mdt] [ 7081.303618] [<ffffffffc10f2278>] ? upcall_cache_get_entry+0x218/0x8b0 [obdclass] [ 7081.312849] [<ffffffffc1740ce3>] mdt_reint_rec+0x83/0x210 [mdt] [ 7081.320409] [<ffffffffc17201d2>] mdt_reint_internal+0x6b2/0xa80 [mdt] [ 7081.328539] [<ffffffffc172c6c2>] mdt_intent_open+0x82/0x350 [mdt] [ 7081.336260] [<ffffffffc10bc6f9>] ? lprocfs_counter_add+0xf9/0x160 [obdclass] [ 7081.345041] [<ffffffffc172a768>] mdt_intent_policy+0x2f8/0xd10 [mdt] [ 7081.353013] [<ffffffffc172c640>] ? mdt_intent_fixup_resent+0x220/0x220 [mdt] [ 7081.361816] [<ffffffffc12c8e9e>] ldlm_lock_enqueue+0x34e/0xa50 [ptlrpc] [ 7081.370088] [<ffffffffc0fa26ee>] ? cfs_hash_add+0xbe/0x1a0 [libcfs] [ 7081.377989] [<ffffffffc12f1483>] ldlm_handle_enqueue0+0x903/0x1520 [ptlrpc] [ 7081.386676] [<ffffffffc13192d0>] ? lustre_swab_ldlm_lock_desc+0x30/0x30 [ptlrpc] [ 7081.395867] [<ffffffffc1377932>] tgt_enqueue+0x62/0x210 [ptlrpc] [ 7081.403500] [<ffffffffc138027a>] tgt_request_handle+0xaea/0x1580 [ptlrpc] [ 7081.412018] [<ffffffffc1358291>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc] [ 7081.421288] [<ffffffffc0f93bde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs] [ 7081.430002] [<ffffffffc132340b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [ 7081.439396] [<ffffffffc1320295>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc] [ 7081.447788] [<ffffffffb98cf682>] ? default_wake_function+0x12/0x20 [ 7081.455596] [<ffffffffb98c52ab>] ? __wake_up_common+0x5b/0x90 [ 7081.462945] [<ffffffffc1326c44>] ptlrpc_main+0xb14/0x1fb0 [ptlrpc] [ 7081.470780] [<ffffffffc1326130>] ? ptlrpc_register_service+0xe90/0xe90 [ptlrpc] [ 7081.479853] [<ffffffffb98bb621>] kthread+0xd1/0xe0 [ 7081.486103] [<ffffffffb98bb550>] ? insert_kthread_work+0x40/0x40 [ 7081.493716] [<ffffffffb9f205f7>] ret_from_fork_nospec_begin+0x21/0x21 [ 7081.501800] [<ffffffffb98bb550>] ? insert_kthread_work+0x40/0x40 [ 7091.714953] Pid: 4632, comm: mdt00_008 3.10.0-862.9.1.el7_lustre.x86_64 #1 SMP Fri Aug 17 20:37:05 UTC 2018 ... [ 7225.649157] LNet: Service thread pid 6456 completed after 400.00s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). [ 7225.649191] LNet: Service thread pid 4632 completed after 334.69s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). [ 7247.049160] LustreError: 11-0: soaked-OST0002-osc-MDT0002: operation ost_create to node 192.168.1.107@o2ib failed: rc = -19 [ 7247.063727] LustreError: 4525:0:(osp_precreate.c:647:osp_precreate_send()) soaked-OST0002-osc-MDT0002: can't precreate: rc = -19 [ 7247.080342] LustreError: 4525:0:(osp_precreate.c:1294:osp_precreate_thread()) soaked-OST0002-osc-MDT0002: cannot precreate objects: rc = -19 Aug 29 05:01:01 soak-10 TIME: Time stamp for console [ 8225.327563] LNetError: 4307:0:(lib-msg.c:594:lnet_incr_hstats()) LBUG [ 8225.336875] Pid: 4307, comm: kiblnd_connd 3.10.0-862.9.1.el7_lustre.x86_64 #1 SMP Fri Aug 17 20:37:05 UTC 2018 [ 8225.349882] Call Trace: [ 8225.354452] [<ffffffffc0f937cc>] libcfs_call_trace+0x8c/0xc0 [libcfs] [ 8225.363505] [<ffffffffc0f9387c>] lbug_with_loc+0x4c/0xa0 [libcfs] [ 8225.372109] [<ffffffffc100d61a>] lnet_finalize+0xb7a/0xe00 [lnet] [ 8225.380637] [<ffffffffc0bd15cd>] kiblnd_tx_done+0x10d/0x3e0 [ko2iblnd] [ 8225.389573] [<ffffffffc0bd18e6>] kiblnd_txlist_done+0x46/0x60 [ko2iblnd] [ 8225.398487] [<ffffffffc0bd2360>] kiblnd_abort_txs+0xf0/0x250 [ko2iblnd] [ 8225.407115] [<ffffffffc0bd6825>] kiblnd_finalise_conn+0x75/0x120 [ko2iblnd] [ 8225.416096] [<ffffffffc0bdb591>] kiblnd_connd+0x251/0xa00 [ko2iblnd] [ 8225.424360] [<ffffffffb98bb621>] kthread+0xd1/0xe0 [ 8225.430835] [<ffffffffb9f205f7>] ret_from_fork_nospec_end+0x0/0x39 [ 8225.438857] [<ffffffffffffffff>] 0xffffffffffffffff [ 8225.445439] Kernel panic - not syncing: LBUG [ 8225.451189] CPU: 21 PID: 4307 Comm: kiblnd_connd Tainted: P OE ------------ 3.10.0-862.9.1.el7_lustre.x86_64 #1 [ 8225.464746] Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013 [ 8225.478323] Call Trace: [ 8225.482095] [<ffffffffb9f0e84e>] dump_stack+0x19/0x1b [ 8225.488889] [<ffffffffb9f08b50>] panic+0xe8/0x21f [ 8225.495277] [<ffffffffc0f938cb>] lbug_with_loc+0x9b/0xa0 [libcfs] [ 8225.503226] [<ffffffffc100d61a>] lnet_finalize+0xb7a/0xe00 [lnet] [ 8225.511156] [<ffffffffc0bd15cd>] kiblnd_tx_done+0x10d/0x3e0 [ko2iblnd] [ 8225.519586] [<ffffffffc0bd18e6>] kiblnd_txlist_done+0x46/0x60 [ko2iblnd] [ 8225.528203] [<ffffffffc0bd2360>] kiblnd_abort_txs+0xf0/0x250 [ko2iblnd] [ 8225.536723] [<ffffffffc0bd6825>] kiblnd_finalise_conn+0x75/0x120 [ko2iblnd] [ 8225.545635] [<ffffffffc0bdb591>] kiblnd_connd+0x251/0xa00 [ko2iblnd] [ 8225.553845] [<ffffffffb98cf670>] ? wake_up_state+0x20/0x20 [ 8225.561085] [<ffffffffc0bdb340>] ? kiblnd_check_conns+0x870/0x870 [ko2iblnd] [ 8225.570069] [<ffffffffb98bb621>] kthread+0xd1/0xe0 [ 8225.576523] [<ffffffffb98bb550>] ? insert_kthread_work+0x40/0x40 [ 8225.584338] [<ffffffffb9f205f7>] ret_from_fork_nospec_begin+0x21/0x21 [ 8225.592644] [<ffffffffb98bb550>] ? insert_kthread_work+0x40/0x40 [ 8225.600518] Kernel Offset: 0x38800000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
Attachments
Issue Links
- duplicates
-
LU-11271 LNet Health: o2iblnd, conditionally set health status
- Resolved