[LU-10390] MGS crashes in ldlm_reprocess_queue() when stopping Created: 14/Dec/17 Updated: 02/Aug/21 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.10.2 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Stephane Thiell | Assignee: | Emoly Liu |
| Resolution: | Unresolved | Votes: | 1 |
| Labels: | None | ||
| Environment: |
CentOS 7.4 |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
Never seen before but already twice with the latest 2.10.2 version: the MGS is crashing when stopping: [77225.855547] BUG: unable to handle kernel NULL pointer dereference at 000000000000001c [77225.864304] IP: [<ffffffffc0ba48ce>] ldlm_process_plain_lock+0x6e/0xb30 [ptlrpc] [77225.872614] PGD 0 [77225.874864] Oops: 0000 [#1] SMP [77225.878480] Modules linked in: mgs(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) ldiskfs(OE) lustre(OE) lmv(OE) mdc(OE) lov(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) vfat fat uas usb_storage mpt2sas mptctl mptbase dell_rbu rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm mlx4_ib ib_core sb_edac edac_core intel_powerclamp dm_service_time coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper iTCO_wdt ablk_helper iTCO_vendor_support cryptd dm_round_robin pcspkr mxm_wmi dcdbas sg ipmi_si ipmi_devintf ipmi_msghandler mei_me mei lpc_ich shpchp acpi_power_meter wmi nfsd auth_rpcgss nfs_acl lockd grace dm_multipath sunrpc dm_mod ip_tables ext4 mbcache [77225.957975] jbd2 sd_mod crc_t10dif crct10dif_generic mlx4_en i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm mlx4_core drm tg3 ahci libahci crct10dif_pclmul mpt3sas crct10dif_common raid_class ptp crc32c_intel libata megaraid_sas i2c_core devlink scsi_transport_sas pps_core [77225.986851] CPU: 23 PID: 27105 Comm: ldlm_bl_14 Tainted: G OE ------------ 3.10.0-693.2.2.el7_lustre.pl1.x86_64 #1 [77225.999661] Hardware name: Dell Inc. PowerEdge R630/0CNCJW, BIOS 2.3.4 11/08/2016 [77226.008010] task: ffff88103323cf10 ti: ffff881012ab8000 task.ti: ffff881012ab8000 [77226.016359] RIP: 0010:[<ffffffffc0ba48ce>] [<ffffffffc0ba48ce>] ldlm_process_plain_lock+0x6e/0xb30 [ptlrpc] [77226.027355] RSP: 0018:ffff881012abbbe0 EFLAGS: 00010287 [77226.033280] RAX: 0000000000000000 RBX: ffff881011f7d400 RCX: ffff881012abbc7c [77226.041240] RDX: 0000000000000002 RSI: ffff881012abbc80 RDI: ffff881011f7d400 [77226.049201] RBP: ffff881012abbc58 R08: ffff881012abbcd0 R09: ffff88103d0d7880 [77226.057162] R10: ffff881011f7d400 R11: 7fffffffffffffff R12: ffff880168287540 [77226.065123] R13: 0000000000000002 R14: ffff881012abbcd0 R15: ffff881011f7d460 [77226.073085] FS: 0000000000000000(0000) GS:ffff88203c8c0000(0000) knlGS:0000000000000000 [77226.082111] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [77226.088521] CR2: 000000000000001c CR3: 00000000019f2000 CR4: 00000000001407e0 [77226.096482] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [77226.104443] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [77226.112403] Stack: [77226.114642] ffff881012abbc7c ffff881012abbcd0 ffff881012abbc80 0000000000000000 [77226.122930] ffff880168287520 0000001000000001 ffff880100000010 ffff881012abbc18 [77226.131219] ffff881012abbc18 00000000bd734aeb 0000000000000002 ffff880168287540 [77226.139507] Call Trace: [77226.142252] [<ffffffffc0ba4860>] ? ldlm_errno2error+0x60/0x60 [ptlrpc] [77226.149649] [<ffffffffc0b8f9db>] ldlm_reprocess_queue+0x13b/0x2a0 [ptlrpc] [77226.157434] [<ffffffffc0b9057d>] __ldlm_reprocess_all+0x14d/0x3a0 [ptlrpc] [77226.165220] [<ffffffffc0b90b30>] ldlm_reprocess_res+0x20/0x30 [ptlrpc] [77226.172611] [<ffffffffc0866bef>] cfs_hash_for_each_relax+0x21f/0x400 [libcfs] [77226.180687] [<ffffffffc0b90b10>] ? ldlm_lock_downgrade+0x320/0x320 [ptlrpc] [77226.188571] [<ffffffffc0b90b10>] ? ldlm_lock_downgrade+0x320/0x320 [ptlrpc] [77226.196441] [<ffffffffc0869d95>] cfs_hash_for_each_nolock+0x75/0x1c0 [libcfs] [77226.204518] [<ffffffffc0b90b7c>] ldlm_reprocess_recovery_done+0x3c/0x110 [ptlrpc] [77226.212983] [<ffffffffc0b917bc>] ldlm_export_cancel_locks+0x11c/0x130 [ptlrpc] [77226.221162] [<ffffffffc0bbada8>] ldlm_bl_thread_main+0x4c8/0x700 [ptlrpc] [77226.228836] [<ffffffff816a8fad>] ? __schedule+0x39d/0x8b0 [77226.234977] [<ffffffffc0bba8e0>] ? ldlm_handle_bl_callback+0x410/0x410 [ptlrpc] [77226.243232] [<ffffffff810b098f>] kthread+0xcf/0xe0 [77226.248672] [<ffffffff810b08c0>] ? insert_kthread_work+0x40/0x40 [77226.255472] [<ffffffff816b4f58>] ret_from_fork+0x58/0x90 [77226.261494] [<ffffffff810b08c0>] ? insert_kthread_work+0x40/0x40 [77226.268292] Code: 89 45 a0 74 0d f6 05 b3 ac cd ff 01 0f 85 34 06 00 00 8b 83 98 00 00 00 39 83 9c 00 00 00 89 45 b8 0f 84 57 09 00 00 48 8b 45 a0 <8b> 40 1c 85 c0 0f 84 7a 09 00 00 48 8b 4d a0 48 89 c8 48 83 c0 [77226.289871] RIP [<ffffffffc0ba48ce>] ldlm_process_plain_lock+0x6e/0xb30 [ptlrpc] [77226.298248] RSP <ffff881012abbbe0> [77226.302135] CR2: 000000000000001c Best, |
| Comments |
| Comment by Peter Jones [ 15/Dec/17 ] |
|
Thanks Stephane |
| Comment by Stephane Thiell [ 16/Feb/18 ] |
|
Crash dump for this issue, with Lustre 2.10.2 on all targets. vmcore-oak-md1-s1-2017-12-14-11:47:46.gz: https://stanford.box.com/s/unzo08zycbafl5t62cwtkubl7hsr3y1n vmlinux-3.10.0-693.2.2.el7_lustre.pl1.x86_64.gz: https://stanford.box.com/s/r6u6xjzmsgys2kzcq26562bqo6p45xcp
|
| Comment by Malcolm Haak - NCI (Inactive) [ 01/Mar/18 ] |
|
Hi, Quick question, are your clients at a matching version of lustre? Or are they running an earlier version? (Even 2.10.1?) |
| Comment by Stephane Thiell [ 01/Mar/18 ] |
|
Hi Malcolm, At the time of this issue, we had a mix of 2.10.0, 2.10.1 and 2.10.2 clients. I couldn't reproduce this problem with a MGS running 2.10.3 though.
|
| Comment by Malcolm Haak - NCI (Inactive) [ 01/Mar/18 ] |
|
Thank you so much. I was going to git bisect this but if 2.10.3 fixes it I'll move onto testing that. Do you still have a mix of clients? I had an idea about what I patches I was going to pick on for my testing. |
| Comment by Stephane Thiell [ 01/Mar/18 ] |
|
Sure. And yes, with the servers running 2.10.3 we had a mix of 2.10.x (x = 0,1,2,3) clients. I say "had" because we're not running 2.10.3 anymore as we downgraded all servers to 2.10.2 for other reasons. But I do confirm that we couldn't reproduce the MGS panic when unmounting the MGS when the servers were running version 2.10.3, and we did it several times. That reminds me that we haven't tried to stop the MGS now that we're back in 2.10.2. Just saw
|
| Comment by Malcolm Haak - NCI (Inactive) [ 01/Mar/18 ] |
|
Thanks for that info also. It at least points me in the direction I need for some testing. Thanks again |
| Comment by Emoly Liu [ 05/Mar/18 ] |
|
Thanks for Stephane's information and thanks for Malcolm's testing! |
| Comment by Malcolm Haak - NCI (Inactive) [ 05/Mar/18 ] |
|
My testing was limited. I installed 2.10.1 , 2.10.2 and 2.10.3. None of them had sane output in /proc/fs/lustre/ldlm/namespaces/MGS/pool/stats At this point I decided something is horridly wrong since nobody answered my questions in I'm going to start with 2.9.1 and see if its also whacky in 2.9 and then when I figure out which point release it all went bad I'll start doing a git bisect This has all come to ahead for NCI because this non-granting or half-granting of locks is starting to really cause issues with clients
LDLM is broken. It looks like its broken in 2.10.3 still. Despite your lack of shutdown crashes post upgrade. I couldn't get my test environment to crash, for the record. But its tiny so I don't think its a valid test. I do think having a look at the crash dumps will be enlightening. EDIT: Other issue is related to |
| Comment by Stephane Thiell [ 16/Mar/18 ] |
|
Hello, This issue just happened in 2.10.3 when we tried to unmount the MGS. In this case, the server didn't crash but we saw a lot of soft lockup/CPU stuck stack traces on the MGS like this one: [83347.066299] NMI watchdog: BUG: soft lockup - CPU#5 stuck for 23s! [ldlm_bl_11:16013] [83347.066322] Modules linked in: mgs(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) ldiskfs(OE) lustre(OE) lmv(OE) mdc(OE) lov(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) vfat fat uas usb_storage mpt2sas mptctl mptbase rpcsec_gss_krb5 dell_rbu nfsv4 dns_resolver nfs fscache ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm mlx4_ib ib_core sb_edac edac_core intel_powerclamp coretemp intel_rapl iosf_mbi dm_service_time kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd iTCO_wdt dcdbas iTCO_vendor_support ipmi_si ipmi_devintf mxm_wmi dm_round_robin pcspkr sg ipmi_msghandler acpi_power_meter wmi mei_me mei shpchp lpc_ich nfsd auth_rpcgss dm_multipath dm_mod nfs_acl lockd grace sunrpc ip_tables ext4 mbcache [83347.066330] jbd2 sd_mod crc_t10dif crct10dif_generic mlx4_en i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm tg3 ahci crct10dif_pclmul crct10dif_common mlx4_core mpt3sas drm libahci crc32c_intel ptp raid_class libata megaraid_sas devlink i2c_core scsi_transport_sas pps_core [83347.066331] CPU: 5 PID: 16013 Comm: ldlm_bl_11 Tainted: G OEL ------------ 3.10.0-693.2.2.el7_lustre.pl2.x86_64 #1 [83347.066332] Hardware name: Dell Inc. PowerEdge R630/0CNCJW, BIOS 2.6.0 10/26/2017 [83347.066333] task: ffff88203ad88000 ti: ffff88201c804000 task.ti: ffff88201c804000 [83347.066335] RIP: 0010:[<ffffffff810fa326>] [<ffffffff810fa326>] native_queued_spin_lock_slowpath+0x116/0x1e0 [83347.066335] RSP: 0018:ffff88201c807b70 EFLAGS: 00000246 [83347.066336] RAX: 0000000000000000 RBX: 000000013c796cc0 RCX: 0000000000290000 [83347.066337] RDX: ffff88103d017880 RSI: 0000000000810001 RDI: ffff88102137981c [83347.066337] RBP: ffff88201c807b70 R08: ffff88203c697880 R09: 0000000000000000 [83347.066338] R10: ffff88202b216a00 R11: 0000000000000000 R12: ffff88201c807c58 [83347.066338] R13: 0000000000000001 R14: ffff88201c807b28 R15: ffffffff81322c35 [83347.066339] FS: 0000000000000000(0000) GS:ffff88203c680000(0000) knlGS:0000000000000000 [83347.066340] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [83347.066340] CR2: 00007fd7ddcd8090 CR3: 00000000019f2000 CR4: 00000000001407e0 [83347.066341] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [83347.066341] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [83347.066341] Stack: [83347.066343] ffff88201c807b80 ffffffff8169e61f ffff88201c807b90 ffffffff816abb70 [83347.066344] ffff88201c807bd0 ffffffffc0b72198 0000000000000000 ffff88202b216a00 [83347.066345] ffff88201c807c18 ffff88202b216a60 ffff88202b216a00 ffff8810175b8000 [83347.066345] Call Trace: [83347.066347] [<ffffffff8169e61f>] queued_spin_lock_slowpath+0xb/0xf [83347.066349] [<ffffffff816abb70>] _raw_spin_lock+0x20/0x30 [83347.066367] [<ffffffffc0b72198>] ldlm_handle_conflict_lock+0xd8/0x330 [ptlrpc] [83347.066388] [<ffffffffc0b86755>] ldlm_process_plain_lock+0x435/0xb30 [ptlrpc] [83347.066407] [<ffffffffc0b86320>] ? ldlm_errno2error+0x60/0x60 [ptlrpc] [83347.066425] [<ffffffffc0b7199b>] ldlm_reprocess_queue+0x13b/0x2a0 [ptlrpc] [83347.066443] [<ffffffffc0b7253d>] __ldlm_reprocess_all+0x14d/0x3a0 [ptlrpc] [83347.066460] [<ffffffffc0b72af0>] ldlm_reprocess_res+0x20/0x30 [ptlrpc] [83347.066466] [<ffffffffc0847bef>] cfs_hash_for_each_relax+0x21f/0x400 [libcfs] [83347.066483] [<ffffffffc0b72ad0>] ? ldlm_lock_downgrade+0x320/0x320 [ptlrpc] [83347.066501] [<ffffffffc0b72ad0>] ? ldlm_lock_downgrade+0x320/0x320 [ptlrpc] [83347.066506] [<ffffffffc084ad95>] cfs_hash_for_each_nolock+0x75/0x1c0 [libcfs] [83347.066526] [<ffffffffc0b72b3c>] ldlm_reprocess_recovery_done+0x3c/0x110 [ptlrpc] [83347.066544] [<ffffffffc0b737cc>] ldlm_export_cancel_locks+0x11c/0x130 [ptlrpc] [83347.066566] [<ffffffffc0b9c9a8>] ldlm_bl_thread_main+0x4c8/0x700 [ptlrpc] [83347.066567] [<ffffffff810c4810>] ? wake_up_state+0x20/0x20 [83347.066588] [<ffffffffc0b9c4e0>] ? ldlm_handle_bl_callback+0x410/0x410 [ptlrpc] [83347.066589] [<ffffffff810b098f>] kthread+0xcf/0xe0 [83347.066590] [<ffffffff810b08c0>] ? insert_kthread_work+0x40/0x40 [83347.066592] [<ffffffff816b4f58>] ret_from_fork+0x58/0x90 [83347.066593] [<ffffffff810b08c0>] ? insert_kthread_work+0x40/0x40 After a few minutes, I decided to take a crash dump. It is available upon request. Attaching the associated vmcore-dmesg.txt as vmcore-dmesg-oak-md1-s1-2018-03-16-06-47-23.txt Prior to un-mounting the MGS, I also noticed tons of 'lock timed out' errors on clients for the MGS (at 10.0.2.51@o2ib5): [716108.759689] LustreError: 1392:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1521202950, 300s ago), entering recovery for MGS@MGC10.0.2.51@o2ib5_0 ns: MGC10.0.2.51@o2ib5 lock: ffff8803e8b51800/0x509cd0f71fde08cc lrc: 4/1,0 mode: --/CR res: [0x6b616f:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0x2e0a3ddb63e79ade expref: -99 pid: 1392 timeout: 0 lvb_type: 0 [716108.760126] LustreError: 5988:0:(ldlm_resource.c:1100:ldlm_resource_complain()) MGC10.0.2.51@o2ib5: namespace resource [0x6b616f:0x2:0x0].0x0 (ffff880402be2600) refcount nonzero (2) after lock cleanup; forcing cleanup. [716108.760128] LustreError: 5988:0:(ldlm_resource.c:1100:ldlm_resource_complain()) Skipped 1 previous similar message [716108.760130] LustreError: 5988:0:(ldlm_resource.c:1682:ldlm_resource_dump()) --- Resource: [0x6b616f:0x2:0x0].0x0 (ffff880402be2600) refcount = 3 [716108.760132] LustreError: 5988:0:(ldlm_resource.c:1703:ldlm_resource_dump()) Waiting locks: [716108.760137] LustreError: 5988:0:(ldlm_resource.c:1705:ldlm_resource_dump()) ### ### ns: MGC10.0.2.51@o2ib5 lock: ffff8803e8b51800/0x509cd0f71fde08cc lrc: 4/1,0 mode: --/CR res: [0x6b616f:0x2:0x0].0x0 rrc: 4 type: PLN flags: 0x1106400000000 nid: local remote: 0x2e0a3ddb63e79ade expref: -99 pid: 1392 timeout: 0 lvb_type: 0 [716108.760138] LustreError: 5988:0:(ldlm_resource.c:1705:ldlm_resource_dump()) Skipped 1 previous similar message [716108.778087] LustreError: 1392:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) Skipped 1 previous similar message [716415.536505] LustreError: 166-1: MGC10.0.2.51@o2ib5: Connection to MGS (at 10.0.2.51@o2ib5) was lost; in progress operations using this service will fail [716415.539195] LustreError: Skipped 1 previous similar message [716415.540823] LustreError: 5999:0:(ldlm_resource.c:1682:ldlm_resource_dump()) --- Resource: [0x6b616f:0x2:0x0].0x0 (ffff8800a2fb46c0) refcount = 2 [716415.543552] LustreError: 5999:0:(ldlm_resource.c:1703:ldlm_resource_dump()) Waiting locks: [716415.544914] Lustre: MGC10.0.2.51@o2ib5: Connection restored to MGC10.0.2.51@o2ib5_0 (at 10.0.2.51@o2ib5) [716415.546536] Lustre: Skipped 1 previous similar message [716723.374375] LustreError: 1392:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1521203565, 300s ago), entering recovery for MGS@MGC10.0.2.51@o2ib5_0 ns: MGC10.0.2.51@o2ib5 lock: ffff8802383c6c00/0x509cd0f71fde2a67 lrc: 4/1,0 mode: --/CR res: [0x6b616f:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0x2e0a3ddb63e8a325 expref: -99 pid: 1392 timeout: 0 lvb_type: 0 [716723.380318] LustreError: 1392:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) Skipped 1 previous similar message [716723.381110] LustreError: 6003:0:(ldlm_resource.c:1100:ldlm_resource_complain()) MGC10.0.2.51@o2ib5: namespace resource [0x6b616f:0x2:0x0].0x0 (ffff8803aafc4900) refcount nonzero (2) after lock cleanup; forcing cleanup. [716723.381112] LustreError: 6003:0:(ldlm_resource.c:1100:ldlm_resource_complain()) Skipped 1 previous similar message [716723.381115] LustreError: 6003:0:(ldlm_resource.c:1682:ldlm_resource_dump()) --- Resource: [0x6b616f:0x2:0x0].0x0 (ffff8803aafc4900) refcount = 3 [716723.381117] LustreError: 6003:0:(ldlm_resource.c:1703:ldlm_resource_dump()) Waiting locks: [716723.381123] LustreError: 6003:0:(ldlm_resource.c:1705:ldlm_resource_dump()) ### ### ns: MGC10.0.2.51@o2ib5 lock: ffff8802383c6c00/0x509cd0f71fde2a67 lrc: 4/1,0 mode: --/CR res: [0x6b616f:0x2:0x0].0x0 rrc: 4 type: PLN flags: 0x1106400000000 nid: local remote: 0x2e0a3ddb63e8a325 expref: -99 pid: 1392 timeout: 0 lvb_type: 0 [716723.381124] LustreError: 6003:0:(ldlm_resource.c:1705:ldlm_resource_dump()) Skipped 1 previous similar message [717033.536366] LustreError: 166-1: MGC10.0.2.51@o2ib5: Connection to MGS (at 10.0.2.51@o2ib5) was lost; in progress operations using this service will fail [717033.539206] LustreError: Skipped 1 previous similar message [717033.541034] Lustre: MGC10.0.2.51@o2ib5: Connection restored to MGC10.0.2.51@o2ib5_0 (at 10.0.2.51@o2ib5) [717033.542492] Lustre: Skipped 1 previous similar message [717340.609290] LustreError: 1392:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1521204182, 300s ago), entering recovery for MGS@MGC10.0.2.51@o2ib5_0 ns: MGC10.0.2.51@o2ib5 lock: ffff8803fd41c000/0x509cd0f71fde4c02 lrc: 4/1,0 mode: --/CR res: [0x6b616f:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0x2e0a3ddb63e9adcd expref: -99 pid: 1392 timeout: 0 lvb_type: 0 [717340.615144] LustreError: 1392:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) Skipped 1 previous similar message [717340.617285] LustreError: 6017:0:(ldlm_resource.c:1100:ldlm_resource_complain()) MGC10.0.2.51@o2ib5: namespace resource [0x6b616f:0x2:0x0].0x0 (ffff88039bd87540) refcount nonzero (1) after lock cleanup; forcing cleanup. [717340.619996] LustreError: 6017:0:(ldlm_resource.c:1682:ldlm_resource_dump()) --- Resource: [0x6b616f:0x2:0x0].0x0 (ffff88039bd87540) refcount = 2 [717340.621862] LustreError: 6017:0:(ldlm_resource.c:1703:ldlm_resource_dump()) Waiting locks: [717340.623052] LustreError: 6017:0:(ldlm_resource.c:1705:ldlm_resource_dump()) ### ### ns: ?? lock: ffff8803fd41c000/0x509cd0f71fde4c02 lrc: 4/1,0 mode: --/CR res: ?? rrc=?? type: ??? flags: 0x1106400000000 nid: local remote: 0x2e0a3ddb63e9adcd expref: -99 pid: 1392 timeout: 0 lvb_type: 0 [717648.431191] LustreError: 166-1: MGC10.0.2.51@o2ib5: Connection to MGS (at 10.0.2.51@o2ib5) was lost; in progress operations using this service will fail [717648.433742] LustreError: Skipped 1 previous similar message [717648.434877] LustreError: 6028:0:(ldlm_resource.c:1682:ldlm_resource_dump()) --- Resource: [0x6b616f:0x2:0x0].0x0 (ffff8803be8ce000) refcount = 2 [717648.436695] LustreError: 6028:0:(ldlm_resource.c:1703:ldlm_resource_dump()) Waiting locks: [717648.438034] Lustre: MGC10.0.2.51@o2ib5: Connection restored to MGC10.0.2.51@o2ib5_0 (at 10.0.2.51@o2ib5) [717648.439395] Lustre: Skipped 1 previous similar message Thanks! |
| Comment by Emoly Liu [ 21/Mar/18 ] |
|
I'm checking the patch of Thanks, |
| Comment by Stephane Thiell [ 21/Mar/18 ] |
|
Hi Emoly, We're using ldiskfs, not ZFS. The crash occurs almost each time I stop the MGS, usually after at least a few days of production. But because it's a system in production, my actions are limited, I cannot test much right now. We're almost constantly redeploying clients for rolling updates and the MGS has to be available. Still, I'll see what I can do. Also what kind of information would help you (like before unmounting), perhaps the output of /proc/fs/lustre/ldlm/namespaces/MGS/pool/stats as mentioned by Malcolm?
Also, as I said, I often see these 'lock timed out' errors on the clients (see my previous comment), which are related to the MGS. Thanks!! Stephane |
| Comment by Li Xi (Inactive) [ 23/May/18 ] |
|
I got similar issue when running test of replay-dual:26 on Lustre-2.10.4.
4853.863850] Lustre: DEBUG MARKER: == replay-dual test 26: dbench and tar with mds failover ============================================= 19:10:39 (1527073839) |
| Comment by javed shaikh (Inactive) [ 23/Oct/20 ] |
|
just for info, NCI is no longer following this ticket. ok to close from our perspective. |
| Comment by Malcolm Haak (Inactive) [ 29/Jul/21 ] |
|
Actually, its not. We're having this issue again. |
| Comment by Emoly Liu [ 29/Jul/21 ] |
|
Could you please provide the lustre version you're using this time and any information about the testing environment and operations to hit this issue ? Thanks. |
| Comment by Malcolm Haak (Inactive) [ 30/Jul/21 ] |
|
Lustre 2.10.8 Server. Currently we see this issues on the MGS when unmounting. We are unmounting the MGS to deal with the fact that it has become unstable and is throwing the following errors
Jul 26 11:47:54 g4-mds01 kernel: LNet: 40488:0:(o2iblnd_cb.c:3202:kiblnd_check_conns()) Timed out tx for 10.141.1.87@o2ib4: 18 seconds Jul 26 11:48:36 g4-mds01 kernel: LustreError: 166-1: MGC10.141.1.41@o2ib4: Connection to MGS (at 0@lo) was lost; in progress operations using this service will fail Jul 26 11:48:36 g4-mds01 kernel: LustreError: 208265:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1627263816, 300s ago), entering recovery for MGS@MGC10.141.1.41@o2ib4_0 ns: MGC10.141.1.41@o2ib4 lock: ffff9f7c79309c00/0xa65add00b40c1243 lrc: 4/1,0 mode: --/CR res: [0x346174616467:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0xa65add00b40c124a expref: -99 pid: 208265 timeout: 0 lvb_type: 0 Jul 26 11:48:36 g4-mds01 kernel: LustreError: 219839:0:(ldlm_resource.c:1101:ldlm_resource_complain()) MGC10.141.1.41@o2ib4: namespace resource [0x346174616467:0x2:0x0].0x0 (ffff9f745c79e3c0) refcount nonzero (2) after lock cleanup; forcing cleanup.Jul 26 11:48:36 g4-mds01 kernel: LustreError: 219839:0:(ldlm_resource.c:1683:ldlm_resource_dump()) --- Resource: [0x346174616467:0x2:0x0].0x0 (ffff9f745c79e3c0) refcount = 3 Jul 26 11:48:36 g4-mds01 kernel: LustreError: 219839:0:(ldlm_resource.c:1704:ldlm_resource_dump()) Waiting locks: Jul 26 11:48:36 g4-mds01 kernel: LustreError: 219839:0:(ldlm_resource.c:1706:ldlm_resource_dump()) ### ### ns: MGC10.141.1.41@o2ib4 lock: ffff9f7c79309c00/0xa65add00b40c1243 lrc: 4/1,0 mode: --/CR res: [0x346174616467:0x2:0x0].0x0 rrc: 4 type: PLN flags: 0x1106400000000 nid: local remote: 0xa65add00b40c124a expref: -99 pid: 208265 timeout: 0 lvb_type: 0 The workaround for the above issue is to umount/remount the mgs. Currently unmounting the MGS leads to a livelocked machine. |
| Comment by Emoly Liu [ 02/Aug/21 ] |
|
Could you please collect some lustre logs on MGS by the following steps and upload here?
Thanks. |