Details
-
Bug
-
Resolution: Duplicate
-
Blocker
-
None
-
Lustre 2.15.0
-
None
-
lustre-master-ib tag-2.14.50_115_gf55fdff
-
3
-
9223372036854775807
Description
Move soak to lustre-master-ib build 537 on 3/2, the system was reformatted. it runs fro about 5 hours and 1 mds node hit following LBUG
[ 373.349621] Lustre: soaked-MDT0000: Imperative Recovery not enabled, recovery window 300-900 [ 374.778336] Lustre: soaked-MDT0000: Will be in recovery for at least 5:00, or until 26 clients reconnect [ 381.623999] Lustre: soaked-MDT0000: Recovery over after 0:07, of 26 clients 26 recovered and 0 were evicted. [ 488.543482] LustreError: 4036:0:(llog_cat.c:603:llog_cat_add_rec()) llog_write_rec -116: lh=ffff96e6b6ba2200 [ 488.554466] LustreError: 4036:0:(update_trans.c:1051:top_trans_stop()) soaked-MDT0003-osp-MDT0000: write updates failed: rc = -116 [ 488.826212] LustreError: 3934:0:(llog_cat.c:752:llog_cat_cancel_arr_rec()) soaked-MDT0003-osp-MDT0000: fail to cancel 1 llog-records: rc = -116 [ 488.840617] LustreError: 3934:0:(llog_cat.c:788:llog_cat_cancel_records()) soaked-MDT0003-osp-MDT0000: fail to cancel 1 of 1 llog-records: rc = -116 [ 489.331008] LustreError: 4026:0:(llog_cat.c:603:llog_cat_add_rec()) llog_write_rec -116: lh=ffff96e6b6ba2200 [ 751.036178] LustreError: 3851:0:(llog_osd.c:622:llog_osd_write_rec()) LBUG [ 751.043861] Pid: 3851, comm: mdt00_002 3.10.0-1160.6.1.el7_lustre.x86_64 #1 SMP Mon Feb 22 19:55:21 UTC 2021 [ 751.054838] Call Trace: [ 751.057614] [<0>] libcfs_call_trace+0x90/0xf0 [libcfs] [ 751.063361] [<0>] lbug_with_loc+0x4c/0xa0 [libcfs] [ 751.068762] [<0>] llog_osd_write_rec+0x15f1/0x1630 [obdclass] [ 751.075201] [<0>] llog_write_rec+0x273/0x550 [obdclass] [ 751.081067] [<0>] llog_cat_add_rec+0x220/0x880 [obdclass] [ 751.087111] [<0>] llog_add+0x162/0x1d0 [obdclass] [ 751.092446] [<0>] sub_updates_write+0x2e8/0xdd9 [ptlrpc] [ 751.098437] [<0>] top_trans_stop+0x84f/0xbf0 [ptlrpc] [ 751.104089] [<0>] lod_trans_stop+0x25c/0x340 [lod] [ 751.109473] [<0>] mdd_trans_stop+0x2e/0x174 [mdd] [ 751.114735] [<0>] mdd_create+0x1441/0x1820 [mdd] [ 751.119913] [<0>] mdo_create+0x46/0x48 [mdt] [ 751.124695] [<0>] mdt_create+0x9c5/0xd00 [mdt] [ 751.129670] [<0>] mdt_reint_create+0x310/0x3c0 [mdt] [ 751.135235] [<0>] mdt_reint_rec+0x83/0x210 [mdt] [ 751.140408] [<0>] mdt_reint_internal+0x720/0xaf0 [mdt] [ 751.146164] [<0>] mdt_reint+0x67/0x140 [mdt] [ 751.150984] [<0>] tgt_request_handle+0x7ea/0x1750 [ptlrpc] [ 751.157146] [<0>] ptlrpc_server_handle_request+0x256/0xb10 [ptlrpc] [ 751.164194] [<0>] ptlrpc_main+0xb3c/0x14e0 [ptlrpc] [ 751.169653] [<0>] kthread+0xd1/0xe0 [ 751.173564] [<0>] ret_from_fork_nospec_begin+0x21/0x21 [ 751.179324] [<0>] 0xfffffffffffffffe [ 751.183358] LustreError: dumping log to /tmp/lustre-log.1614665912.3851 [ 855.168034] LustreError: 4276:0:(llog_cat.c:603:llog_cat_add_rec()) llog_write_rec -116: lh=ffff96e6b6ba0600 [ 855.179033] LustreError: 4276:0:(update_trans.c:1051:top_trans_stop()) soaked-MDT0002-osp-MDT0000: write updates failed: rc = -116 [ 914.209786] LustreError: 4347:0:(llog_cat.c:603:llog_cat_add_rec()) llog_write_rec -116: lh=ffff96e6b6ba0600 [ 914.220789] LustreError: 4347:0:(update_trans.c:1051:top_trans_stop()) soaked-MDT0002-osp-MDT0000: write updates failed: rc = -116 [ 951.201186] Lustre: mdt00_004: service thread pid 4024 was inactive for 200.368 seconds. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [ 951.201189] Pid: 4026, comm: mdt00_005 3.10.0-1160.6.1.el7_lustre.x86_64 #1 SMP Mon Feb 22 19:55:21 UTC 2021 [ 951.201193] Call Trace: [ 951.201195] Lustre: mdt00_002: service thread pid 3851 was inactive for 200.368 seconds. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one. [ 951.201206] Lustre: Skipped 1 previous similar message [ 951.201291] [<0>] ldlm_completion_ast+0x777/0x9d0 [ptlrpc] [ 951.201336] [<0>] ldlm_cli_enqueue_fini+0x9e0/0xe50 [ptlrpc] [ 951.201380] [<0>] ldlm_cli_enqueue+0x436/0x9b0 [ptlrpc] [ 951.201401] [<0>] osp_md_object_lock+0x165/0x2d0 [osp] [ 951.201425] [<0>] lod_object_lock+0xde/0x760 [lod] [ 951.201444] [<0>] mdd_object_lock+0x30/0xd0 [mdd] [ 951.201467] [<0>] mdt_remote_object_lock_try+0x1cf/0x500 [mdt] [ 951.201491] [<0>] mdt_object_lock_internal+0x18c/0x360 [mdt] [ 951.201507] [<0>] mdt_object_lock+0x20/0x30 [mdt] [ 951.201527] [<0>] mdt_create+0x6a6/0xd00 [mdt] [ 951.201545] [<0>] mdt_reint_create+0x310/0x3c0 [mdt] [ 951.201564] [<0>] mdt_reint_rec+0x83/0x210 [mdt] [ 951.201585] [<0>] mdt_reint_internal+0x720/0xaf0 [mdt] [ 951.201602] [<0>] mdt_reint+0x67/0x140 [mdt] [ 951.201679] [<0>] tgt_request_handle+0x7ea/0x1750 [ptlrpc] [ 951.201734] [<0>] ptlrpc_server_handle_request+0x256/0xb10 [ptlrpc] [ 951.201788] [<0>] ptlrpc_main+0xb3c/0x14e0 [ptlrpc] [ 951.201793] [<0>] kthread+0xd1/0xe0 [ 951.201798] [<0>] ret_from_fork_nospec_begin+0x21/0x21 [ 951.201825] [<0>] 0xfffffffffffffffe [ 951.201830] Pid: 4107, comm: mdt01_009 3.10.0-1160.6.1.el7_lustre.x86_64 #1 SMP Mon Feb 22 19:55:21 UTC 2021 [ 951.201831] Call Trace: [ 951.513260] [<0>] ldlm_completion_ast+0x777/0x9d0 [ptlrpc] [ 951.519429] [<0>] ldlm_cli_enqueue_fini+0x9e0/0xe50 [ptlrpc] [ 951.525777] [<0>] ldlm_cli_enqueue+0x436/0x9b0 [ptlrpc] [ 951.531620] [<0>] osp_md_object_lock+0x165/0x2d0 [osp] [ 951.537372] [<0>] lod_object_lock+0xde/0x760 [lod] [ 951.542732] [<0>] mdd_object_lock+0x30/0xd0 [mdd] [ 951.548003] [<0>] mdt_remote_object_lock_try+0x1cf/0x500 [mdt] [ 951.554532] [<0>] mdt_object_lock_internal+0x18c/0x360 [mdt] [ 951.560862] [<0>] mdt_object_lock+0x20/0x30 [mdt] [ 951.566126] [<0>] mdt_create+0x6a6/0xd00 [mdt] [ 951.571099] [<0>] mdt_reint_create+0x310/0x3c0 [mdt] [ 951.576654] [<0>] mdt_reint_rec+0x83/0x210 [mdt] [ 951.581821] [<0>] mdt_reint_internal+0x720/0xaf0 [mdt] [ 951.587569] [<0>] mdt_reint+0x67/0x140 [mdt] [ 951.592378] [<0>] tgt_request_handle+0x7ea/0x1750 [ptlrpc] [ 951.598543] [<0>] ptlrpc_server_handle_request+0x256/0xb10 [ptlrpc] [ 951.605583] [<0>] ptlrpc_main+0xb3c/0x14e0 [ptlrpc] [ 951.611039] [<0>] kthread+0xd1/0xe0 [ 951.614936] [<0>] ret_from_fork_nospec_begin+0x21/0x21 [ 951.620681] [<0>] 0xfffffffffffffffe [ 961.223773] INFO: task mdt00_002:3851 blocked for more than 120 seconds. [ 961.231298] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 961.240053] mdt00_002 D ffff96e8fc45e300 0 3851 2 0x00000080 [ 961.247966] Call Trace: [ 961.250732] [<ffffffffb86daf40>] ? wake_up_state+0x20/0x20 [ 961.256987] [<ffffffffb8d87229>] schedule+0x29/0x70 [ 961.262572] [<ffffffffc0bed9fd>] lbug_with_loc+0x7d/0xa0 [libcfs] [ 961.269543] [<ffffffffc0d15691>] llog_osd_write_rec+0x15f1/0x1630 [obdclass] [ 961.277540] [<ffffffffc0d03463>] llog_write_rec+0x273/0x550 [obdclass] [ 961.284986] [<ffffffffc0d08b70>] llog_cat_add_rec+0x220/0x880 [obdclass] [ 961.292618] [<ffffffffc0d005c2>] llog_add+0x162/0x1d0 [obdclass] [ 961.299470] [<ffffffffc0d2ef19>] ? lprocfs_counter_add+0xf9/0x160 [obdclass] [ 961.307547] [<ffffffffc10e15f5>] sub_updates_write+0x2e8/0xdd9 [ptlrpc] [ 961.315098] [<ffffffffc17265f5>] ? osp_md_attr_set+0xe5/0x1c0 [osp] [ 961.322283] [<ffffffffc10e12fc>] ? prepare_writing_updates.isra.11+0x139/0x14a [ptlrpc] [ 961.331408] [<ffffffffc10b3f0f>] top_trans_stop+0x84f/0xbf0 [ptlrpc] [ 961.338635] [<ffffffffc1633f93>] ? lod_attr_set+0x243/0x9d0 [lod] [ 961.345563] [<ffffffffc164e05e>] ? lod_sub_ref_add+0x1be/0x450 [lod] [ 961.352776] [<ffffffffc160ea2c>] lod_trans_stop+0x25c/0x340 [lod] [ 961.359735] [<ffffffffc16cd7be>] mdd_trans_stop+0x2e/0x174 [mdd] [ 961.366548] [<ffffffffc16afb91>] mdd_create+0x1441/0x1820 [mdd] [ 961.373306] [<ffffffffc151a893>] ? mdt_version_save+0x63/0xf0 [mdt] [ 961.380431] [<ffffffffc1565810>] mdo_create+0x46/0x48 [mdt] [ 961.386799] [<ffffffffc1524665>] mdt_create+0x9c5/0xd00 [mdt] [ 961.393383] [<ffffffffc1067ab1>] ? req_capsule_get_size+0x31/0x70 [ptlrpc] [ 961.401186] [<ffffffffc1524cb0>] mdt_reint_create+0x310/0x3c0 [mdt] [ 961.408288] [<ffffffffc1528de3>] mdt_reint_rec+0x83/0x210 [mdt] [ 961.415022] [<ffffffffc1500bb0>] mdt_reint_internal+0x720/0xaf0 [mdt] [ 961.422356] [<ffffffffc150c747>] mdt_reint+0x67/0x140 [mdt] [ 961.428741] [<ffffffffc10a2eda>] tgt_request_handle+0x7ea/0x1750 [ptlrpc] [ 961.436454] [<ffffffffc1084281>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc] [ 961.444944] [<ffffffffc0bea02e>] ? ktime_get_real_seconds+0xe/0x10 [libcfs] [ 961.452877] [<ffffffffc104d8d6>] ptlrpc_server_handle_request+0x256/0xb10 [ptlrpc] [ 961.461433] [<ffffffffb86c6d10>] ? wake_up_atomic_t+0x30/0x30 [ 961.467978] [<ffffffffc105242c>] ptlrpc_main+0xb3c/0x14e0 [ptlrpc] [ 961.475051] [<ffffffffc10518f0>] ? ptlrpc_register_service+0xf90/0xf90 [ptlrpc] [ 961.483347] [<ffffffffb86c5c21>] kthread+0xd1/0xe0 [ 961.488803] [<ffffffffb86c5b50>] ? insert_kthread_work+0x40/0x40 [ 961.495612] [<ffffffffb8d94df7>] ret_from_fork_nospec_begin+0x21/0x21 [ 961.502904] [<ffffffffb86c5b50>] ? insert_kthread_work+0x40/0x40 ...... [ 1652.388142] LustreError: 4349:0:(ldlm_request.c:124:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1614666513, 300s ago); not entering recovery in server code, just going back to sleep ns: mdt-soaked-MDT0000_UUID lock: ffff96e8e2faaf40/0x9d008a4e71cca5ef lrc: 3/ 1,0 mode: --/PR res: [0x200000408:0x4:0x0].0x0 bits 0x13/0x0 rrc: 23 type: IBT gid 0 flags: 0x40210000000000 nid: local remote: 0x0 expref: - 99 pid: 4349 timeout: 0 lvb_type: 0 [ 1653.117989] Lustre: 4053:0:(service.c:1434:ptlrpc_at_send_early_reply()) @@@ Could not add any time (5/5), not sending early reply req@ff ff96ed4d3b8d80 x1693073188179200/t8590852393(0) o36->dad5519f-3c27-4aba-baad-7ec78af8ef07@192.168.1.140@o2ib:414/0 lens 552/448 e 1 to 0 dl 1 614666819 ref 2 fl Interpret:/0/0 rc 0/0 job:'' [ 1653.150120] Lustre: 4053:0:(service.c:1434:ptlrpc_at_send_early_reply()) Skipped 1 previous similar message [ 1659.403233] Lustre: soaked-MDT0000: Client dad5519f-3c27-4aba-baad-7ec78af8ef07 (at 192.168.1.140@o2ib) reconnecting [ 1662.749484] LustreError: 4161:0:(ldlm_request.c:124:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1614666524, 300s ago); not entering recovery in server code, just going back to sleep ns: mdt-soaked-MDT0000_UUID lock: ffff96e8e7908480/0x9d008a4e71cd5398 lrc: 3/ 1,0 mode: --/PR res: [0x200000408:0x4:0x0].0x0 bits 0x13/0x0 rrc: 23 type: IBT gid 0 flags: 0x40210000000000 nid: local remote: 0x0 expref: - 99 pid: 4161 timeout: 0 lvb_type: 0 [ 1715.578839] ptlrpc_watchdog_fire: 4 callbacks suppressed [ 1715.578854] Lustre: mdt01_011: service thread pid 4276 was inactive for 657.231 seconds. Watchdog stack traces are limited to 3 per 300 se conds, skipping this one. [ 1715.578857] Lustre: Skipped 2 previous similar messages [ 1715.606956] Lustre: mdt01_004: service thread pid 4022 was inactive for 657.012 seconds. The thread might be hung, or it might only be slo w and will resume later. Dumping the stack trace for debugging purposes: [ 1715.627844] Pid: 4022, comm: mdt01_004 3.10.0-1160.6.1.el7_lustre.x86_64 #1 SMP Mon Feb 22 19:55:21 UTC 2021 [ 1715.638826] Call Trace: [ 1715.641695] [<0>] ldlm_completion_ast+0x635/0x9d0 [ptlrpc] [ 1715.647872] [<0>] ldlm_cli_enqueue_local+0x25c/0x850 [ptlrpc] [ 1715.654376] [<0>] mdt_object_local_lock+0x51f/0xb50 [mdt] [ 1715.660423] [<0>] mdt_object_lock_internal+0x70/0x360 [mdt] [ 1715.666687] [<0>] mdt_getattr_name_lock+0xa7a/0x1ef0 [mdt] [ 1715.672831] [<0>] mdt_intent_getattr+0x2b5/0x480 [mdt] [ 1715.678594] [<0>] mdt_intent_opc+0x1ba/0xb40 [mdt] [ 1715.683959] [<0>] mdt_intent_policy+0x1a1/0x360 [mdt] [ 1715.689661] [<0>] ldlm_lock_enqueue+0x3c5/0xa90 [ptlrpc] [ 1715.695653] [<0>] ldlm_handle_enqueue0+0xa66/0x1620 [ptlrpc] [ 1715.702065] [<0>] tgt_enqueue+0x62/0x210 [ptlrpc] [ 1715.707385] [<0>] tgt_request_handle+0x7ea/0x1750 [ptlrpc] [ 1715.713576] [<0>] ptlrpc_server_handle_request+0x256/0xb10 [ptlrpc] [ 1715.720624] [<0>] ptlrpc_main+0xb3c/0x14e0 [ptlrpc] [ 1715.726099] [<0>] kthread+0xd1/0xe0 [ 1715.730023] [<0>] ret_from_fork_nospec_begin+0x21/0x21 [ 1715.735803] [<0>] 0xfffffffffffffffe
Then after system reboot, it hit hardware issue
[ 17.065581] sd 0:0:1:3: [sdj] Attached SCSI disk [ 17.070444] sd 0:0:0:2: [sde] Attached SCSI disk [ 17.072246] sd 0:0:1:2: [sdi] 6986547200 512-byte logical blocks: (3.57 TB/3.25 TiB) [ 17.107708] sd 0:0:1:2: [sdi] Write Protect is off [ 17.125880] sd 0:0:1:2: [sdi] Write cache: enabled, read cache: enabled, supports DPO and FUA [ 17.209669] Dev sdi: unable to read RDB block 0 [ 17.209687] sdi: unable to read partition table [ 17.310369] sd 0:0:1:2: [sdi] Attached SCSI disk [ 17.400189] sd 1:0:0:0: [sda] Write Protect is off [ 17.405605] sd 1:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA [ 17.436028] sda: sda1 sda2 [ 17.439935] sd 1:0:0:0: [sda] Attached SCSI disk [ 17.577436] sd 0:0:1:1: [sdh] tag#1 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=0s [ 17.587772] sd 0:0:1:1: [sdh] tag#1 Sense Key : Hardware Error [current] [ 17.595372] sd 0:0:1:1: [sdh] tag#1 <<vendor>>ASC=0x84 ASCQ=0x0 [ 17.602089] sd 0:0:1:1: [sdh] tag#1 CDB: Read(16) 88 00 00 00 00 00 00 00 00 00 00 00 00 08 00 00 [ 17.612011] blk_update_request: critical target error, dev sdh, sector 0 [ 17.619501] Buffer I/O error on dev sdh, logical block 0, async page read [ 18.127431] sd 0:0:1:1: [sdh] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=0s [ 18.137759] sd 0:0:1:1: [sdh] tag#0 Sense Key : Hardware Error [current] [ 18.145349] sd 0:0:1:1: [sdh] tag#0 <<vendor>>ASC=0x84 ASCQ=0x0 [ 18.152071] sd 0:0:1:1: [sdh] tag#0 CDB: Read(16) 88 00 00 00 00 00 00 00 00 00 00 00 00 08 00 00 [ 18.161978] blk_update_request: critical target error, dev sdh, sector 0 [ 18.169468] Buffer I/O error on dev sdh, logical block 0, async page read [ 18.677551] sd 0:0:1:1: [sdh] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=0s [ 18.687850] sd 0:0:1:1: [sdh] tag#0 Sense Key : Hardware Error [current] [ 18.695435] sd 0:0:1:1: [sdh] tag#0 <<vendor>>ASC=0x84 ASCQ=0x0 [ 18.702154] sd 0:0:1:1: [sdh] tag#0 CDB: Read(16) 88 00 00 00 00 00 00 00 00 00 00 00 00 08 00 00 [ 18.712076] blk_update_request: critical target error, dev sdh, sector 0 [ 18.719565] Buffer I/O error on dev sdh, logical block 0, async page read [ 19.244098] sd 0:0:1:1: [sdh] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=0s