[LU-10871] MDS hit LBUG: LustreError: 2566:0:(osd_handler.c:3304:osd_destroy()) ASSERTION( !lu_object_is_dying(dt->do_lu.lo_header) ) failed Created: 02/Apr/18 Updated: 30/Aug/23 Resolved: 30/Aug/23 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.11.0, Lustre 2.14.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Sarah Liu | Assignee: | Hongchao Zhang |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | soak | ||
| Environment: |
2.11.0 RC3 |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
After soak running for about 7 hours MDS hit LBUG Apr 1 23:58:08 soak-11 multipathd: sdp: mark as failed Apr 1 23:58:08 soak-11 multipathd: 360080e50001fedb80000015952012962: Entering recovery mode: max_retries=300 Apr 1 23:58:08 soak-11 multipathd: 360080e50001fedb80000015952012962: remaining active paths: 0 Apr 1 23:58:09 soak-11 kernel: LustreError: 2566:0:(osd_handler.c:3853:osd_ref_del()) soaked-MDT0003: nlink == 0 on [0x2c000d70d:0x389:0x0], maybe an upgraded file? (LU-3915) Apr 1 23:58:09 soak-11 kernel: LustreError: 2566:0:(osd_handler.c:3304:osd_destroy()) ASSERTION( !lu_object_is_dying(dt->do_lu.lo_header) ) failed: Apr 1 23:58:09 soak-11 kernel: LustreError: 2566:0:(osd_handler.c:3304:osd_destroy()) LBUG Apr 1 23:58:09 soak-11 kernel: Pid: 2566, comm: mdt_out00_004 Apr 1 23:58:09 soak-11 kernel: #012Call Trace: Apr 1 23:58:09 soak-11 kernel: [<ffffffffc0dc47ae>] libcfs_call_trace+0x4e/0x60 [libcfs] Apr 1 23:58:09 soak-11 kernel: [<ffffffffc0dc483c>] lbug_with_loc+0x4c/0xb0 [libcfs] Apr 1 23:58:09 soak-11 kernel: [<ffffffffc142a0b0>] osd_destroy+0x4a0/0x760 [osd_ldiskfs] Apr 1 23:58:09 soak-11 kernel: [<ffffffffc1429237>] ? osd_ref_del+0x2c7/0x6a0 [osd_ldiskfs] Apr 1 23:58:09 soak-11 kernel: [<ffffffffc1428589>] ? osd_attr_set+0x199/0xb80 [osd_ldiskfs] Apr 1 23:58:09 soak-11 kernel: [<ffffffff816b3232>] ? down_write+0x12/0x3d Apr 1 23:58:09 soak-11 kernel: [<ffffffffc11c9141>] out_obj_destroy+0x101/0x2c0 [ptlrpc] Apr 1 23:58:09 soak-11 kernel: [<ffffffffc11c93b0>] out_tx_destroy_exec+0x20/0x190 [ptlrpc] Apr 1 23:58:09 soak-11 kernel: [<ffffffffc11c3e91>] out_tx_end+0xe1/0x5c0 [ptlrpc] Apr 1 23:58:09 soak-11 kernel: [<ffffffffc11c7a72>] out_handle+0x1442/0x1bb0 [ptlrpc] Apr 1 23:58:09 soak-11 kernel: [<ffffffffc11568a2>] ? lustre_msg_get_opc+0x22/0xf0 [ptlrpc] Apr 1 23:58:09 soak-11 kernel: [<ffffffffc11be0a9>] ? tgt_request_preprocess.isra.28+0x299/0x7a0 [ptlrpc] Apr 1 23:58:09 soak-11 kernel: [<ffffffffc11beeda>] tgt_request_handle+0x92a/0x13b0 [ptlrpc] Apr 1 23:58:09 soak-11 kernel: [<ffffffffc1164813>] ptlrpc_server_handle_request+0x253/0xab0 [ptlrpc] Apr 1 23:58:09 soak-11 kernel: [<ffffffffc11616c8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc] Apr 1 23:58:09 soak-11 kernel: [<ffffffff810c7c82>] ? default_wake_function+0x12/0x20 Apr 1 23:58:09 soak-11 kernel: [<ffffffff810bdc4b>] ? __wake_up_common+0x5b/0x90 Apr 1 23:58:10 soak-11 kernel: [<ffffffffc1167fc2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc] Apr 1 23:58:10 soak-11 kernel: [<ffffffffc1167530>] ? ptlrpc_main+0x0/0x1e40 [ptlrpc] Apr 1 23:58:10 soak-11 kernel: [<ffffffff810b4031>] kthread+0xd1/0xe0 Apr 1 23:58:10 soak-11 kernel: [<ffffffff810b3f60>] ? kthread+0x0/0xe0 Apr 1 23:58:10 soak-11 kernel: [<ffffffff816c0577>] ret_from_fork+0x77/0xb0 Apr 1 23:58:10 soak-11 kernel: [<ffffffff810b3f60>] ? kthread+0x0/0xe0 Apr 1 23:58:10 soak-11 kernel: Apr 1 23:58:10 soak-11 kernel: Kernel panic - not syncing: LBUG Apr 2 00:02:02 soak-11 systemd: Starting Stop Read-Ahead Data Collection... Apr 2 00:02:02 soak-11 rsyslogd: action 'action 0' resumed (module 'builtin:omfwd') [v8.24.0 try http://www.rsyslog.com/e/2359 ] Apr 2 00:02:02 soak-11 rsyslogd: action 'action 0' resumed (module 'builtin:omfwd') [v8.24.0 try http://www.rsyslog.com/e/2359 ] Apr 2 00:02:02 soak-11 systemd: Started Stop Read-Ahead Data Collection. Apr 2 00:03:37 soak-11 chronyd[1280]: Source 64.6.144.6 replaced with 69.10.161.7 Apr 2 00:10:01 soak-11 systemd: Created slice User Slice of root. Apr 2 00:10:01 soak-11 systemd: Starting User Slice of root. Apr 2 00:10:01 soak-11 systemd: Started Session 1 of user root. Apr 2 00:10:01 soak-11 systemd: Starting Session 1 of user root. Apr 2 00:10:01 soak-11 CROND[2234]: (root) CMD (/usr/lib64/sa/sa1 1 1) Apr 2 00:10:01 soak-11 systemd: Removed slice User Slice of root. Apr 2 00:10:01 soak-11 systemd: Stopping User Slice of root. Apr 2 00:15:54 soak-11 systemd: Starting Cleanup of Temporary Directories... Apr 2 00:15:54 soak-11 systemd: Started Cleanup of Temporary Directories. |
| Comments |
| Comment by Cliff White (Inactive) [ 02/Apr/18 ] |
|
Core dump is available on soak at /scratch/dumps |
| Comment by Andreas Dilger [ 02/Apr/18 ] |
|
The |
| Comment by Peter Jones [ 03/Apr/18 ] |
|
Hongchao Can you please investigate? Thanks Peter |
| Comment by Hongchao Zhang [ 04/Apr/18 ] |
|
Hi Cliff Is the Core dump at /scratch/dumps at onyx.hpdd.intel.com? I can't find it at that directory. |
| Comment by Cliff White (Inactive) [ 04/Apr/18 ] |
|
The dump is on spirit. spirit.hpdd.intel.com |
| Comment by Hongchao Zhang [ 08/Apr/18 ] |
|
Hi Cliff, |
| Comment by Sarah Liu [ 18/Feb/20 ] |
|
Hit the problem again on master branch [ 2575.647060] Lustre: Skipped 17 previous similar messages [ 2575.694765] LustreError: dumping log to /tmp/lustre-log.1581765979.5236 [ 2576.220180] LNet: 4683:0:(o2iblnd_cb.c:3393:kiblnd_check_conns()) Timed out tx for 192.168.1.110@o2ib: 0 seconds [ 2576.231562] LNet: 4683:0:(o2iblnd_cb.c:3393:kiblnd_check_conns()) Skipped 6 previous similar messages [ 2576.241901] LNetError: 4683:0:(lib-msg.c:479:lnet_handle_local_failure()) ni 192.168.1.108@o2ib added to recovery queue. Health = 900 [ 2606.654791] LustreError: 6202:0:(ldlm_request.c:124:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1581765710, 300s ago); not entering recove ry in server code, just going back to sleep ns: mdt-soaked-MDT0000_UUID lock: ffff88f8c632d580/0xa21a133420fc92f4 lrc: 3/1,0 mode: --/PR res: [0x20000040e:0x3:0x 0].0x0 bits 0x13/0x0 rrc: 13 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 6202 timeout: 0 lvb_type: 0 [ 2606.698855] LustreError: 6202:0:(ldlm_request.c:124:ldlm_expired_completion_wait()) Skipped 3 previous similar messages [ 2643.197397] LNetError: 4683:0:(lib-msg.c:479:lnet_handle_local_failure()) ni 192.168.1.108@o2ib added to recovery queue. Health = 900 [ 2643.210833] LNetError: 4683:0:(lib-msg.c:479:lnet_handle_local_failure()) Skipped 1 previous similar message [ 2767.192219] Lustre: MGS: Connection restored to 5c9f140e-4423-4 (at 192.168.1.110@o2ib) [ 2767.201198] Lustre: Skipped 1 previous similar message [ 2769.137827] Lustre: soaked-MDT0000: Received new LWP connection from 192.168.1.110@o2ib, removing former export from same NID [ 2769.150490] Lustre: Skipped 1 previous similar message [ 2853.947102] Lustre: soaked-MDT0002-osp-MDT0000: Connection restored to 192.168.1.110@o2ib (at 192.168.1.110@o2ib) [ 2853.958590] Lustre: Skipped 2 previous similar messages [ 2870.647167] Lustre: 4940:0:(service.c:1440:ptlrpc_at_send_early_reply()) @@@ Could not add any time (5/5), not sending early reply req@ffff88f8af2fda00 x1658 449052245760/t0(0) o36->775f92d9-5437-4@192.168.1.127@o2ib:509/0 lens 528/2888 e 24 to 0 dl 1581766279 ref 2 fl Interpret:/0/0 rc 0/0 job:'' [ 2871.318872] Lustre: 4858:0:(service.c:1440:ptlrpc_at_send_early_reply()) @@@ Could not add any time (5/5), not sending early reply req@ffff88f4963b2d00 x1658 449307111936/t0(0) o36->4d2dbe7c-0a77-4@192.168.1.126@o2ib:509/0 lens 528/2888 e 24 to 0 dl 1581766279 ref 2 fl Interpret:/0/0 rc 0/0 job:'' [ 2871.348299] Lustre: 4858:0:(service.c:1440:ptlrpc_at_send_early_reply()) Skipped 2 previous similar messages [ 2877.572936] Lustre: soaked-MDT0000: Client 4d2dbe7c-0a77-4 (at 192.168.1.126@o2ib) reconnecting [ 3153.983138] Lustre: soaked-MDT0000: Received new LWP connection from 192.168.1.110@o2ib, removing former export from same NID [ 3153.995796] Lustre: Skipped 1 previous similar message [ 3153.996927] Lustre: soaked-MDT0000: Connection restored to soaked-MDT0001-mdtlov_UUID (at 192.168.1.109@o2ib) [ 3153.996931] Lustre: Skipped 4 previous similar messages [ 3154.093652] LustreError: 5720:0:(osd_handler.c:4169:osd_ref_del()) soaked-MDT0000: nlink == 0 on [0x200011d4e:0x9eaf:0x0], maybe an upgraded file? (LU-3915) [ 3154.109319] LustreError: 5720:0:(osd_handler.c:3592:osd_destroy()) ASSERTION( !lu_object_is_dying(dt->do_lu.lo_header) ) failed: [ 3154.122342] LustreError: 5720:0:(osd_handler.c:3592:osd_destroy()) LBUG [ 3154.129735] Pid: 5720, comm: mdt_out01_006 3.10.0-1062.9.1.el7_lustre.x86_64 #1 SMP Wed Feb 12 06:45:58 UTC 2020 [ 3154.141104] Call Trace: [ 3154.143850] [<ffffffffc0b0ffac>] libcfs_call_trace+0x8c/0xc0 [libcfs] [ 3154.151193] [<ffffffffc0b1005c>] lbug_with_loc+0x4c/0xa0 [libcfs] [ 3154.158133] [<ffffffffc1224f80>] osd_destroy+0x4a0/0x760 [osd_ldiskfs] [ 3154.165558] [<ffffffffc0f92321>] out_obj_destroy+0x101/0x2c0 [ptlrpc] [ 3154.172976] [<ffffffffc0f92590>] out_tx_destroy_exec+0x20/0x190 [ptlrpc] [ 3154.180640] [<ffffffffc0f8c991>] out_tx_end+0xe1/0x5c0 [ptlrpc] [ 3154.187425] [<ffffffffc0f90c52>] out_handle+0x1442/0x1bb0 [ptlrpc] [ 3154.194490] [<ffffffffc0f8972a>] tgt_request_handle+0x95a/0x1610 [ptlrpc] [ 3154.202224] [<ffffffffc0f2b0f6>] ptlrpc_server_handle_request+0x256/0xb10 [ptlrpc] [ 3154.210846] [<ffffffffc0f2f4f4>] ptlrpc_main+0xbb4/0x1550 [ptlrpc] [ 3154.217898] [<ffffffffa58c61f1>] kthread+0xd1/0xe0 [ 3154.223385] [<ffffffffa5f8dd37>] ret_from_fork_nospec_end+0x0/0x39 [ 3154.230407] [<ffffffffffffffff>] 0xffffffffffffffff [ 3154.235988] Kernel panic - not syncing: LBUG [ 3154.240763] CPU: 30 PID: 5720 Comm: mdt_out01_006 Kdump: loaded Tainted: G OE ------------ 3.10.0-1062.9.1.el7_lustre.x86_64 #1 [ 3154.254834] Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013 [ 3154.267355] Call Trace: [ 3154.270096] [<ffffffffa5f7ac23>] dump_stack+0x19/0x1b [ 3154.275834] [<ffffffffa5f74967>] panic+0xe8/0x21f [ 3154.281187] [<ffffffffc0b100ab>] lbug_with_loc+0x9b/0xa0 [libcfs] [ 3154.288092] [<ffffffffc1224f80>] osd_destroy+0x4a0/0x760 [osd_ldiskfs] [ 3154.295481] [<ffffffffa5f80caa>] ? _cond_resched+0x3a/0x50 [ 3154.301699] [<ffffffffa5f7fd42>] ? down_write+0x12/0x3d [ 3154.307674] [<ffffffffc0f92321>] out_obj_destroy+0x101/0x2c0 [ptlrpc] [ 3154.314999] [<ffffffffc0f92590>] out_tx_destroy_exec+0x20/0x190 [ptlrpc] [ 3154.322612] [<ffffffffc0f8c991>] out_tx_end+0xe1/0x5c0 [ptlrpc] [ 3154.329353] [<ffffffffc0f90c52>] out_handle+0x1442/0x1bb0 [ptlrpc] [ 3154.336372] [<ffffffffa596ad65>] ? tracing_is_on+0x15/0x30 [ 3154.342627] [<ffffffffc0f8972a>] tgt_request_handle+0x95a/0x1610 [ptlrpc] [ 3154.350305] [<ffffffffc0af700e>] ? ktime_get_real_seconds+0xe/0x10 [libcfs] [ 3154.358212] [<ffffffffc0f2b0f6>] ptlrpc_server_handle_request+0x256/0xb10 [ptlrpc] [ 3154.366791] [<ffffffffc0f2787b>] ? ptlrpc_wait_event+0x12b/0x4f0 [ptlrpc] [ 3154.374495] [<ffffffffa58d3360>] ? task_rq_unlock+0x20/0x20 [ 3154.380825] [<ffffffffa58d3903>] ? __wake_up+0x13/0x20 [ 3154.386689] [<ffffffffc0f2f4f4>] ptlrpc_main+0xbb4/0x1550 [ptlrpc] [ 3154.393738] [<ffffffffc0f2e940>] ? ptlrpc_register_service+0xf90/0xf90 [ptlrpc] [ 3154.401985] [<ffffffffa58c61f1>] kthread+0xd1/0xe0 [ 3154.407429] [<ffffffffa58c6120>] ? insert_kthread_work+0x40/0x40 [ 3154.414231] [<ffffffffa5f8dd37>] ret_from_fork_nospec_begin+0x21/0x21 [ 3154.421517] [<ffffffffa58c6120>] ? insert_kthread_work+0x40/0x40 [ 0.000000] Initializing cgroup subsys cpuset [ 0.000000] Initializing cgroup subsys cpu |
| Comment by Sarah Liu [ 08/Jul/20 ] |
|
hit this again on lustre-master-ib build#437 |