[LU-17510] Client hung on ll_file_open - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Minor
Fix Version/s: Lustre 2.16.0, Lustre 2.15.5
Affects Version/s: Lustre 2.15.4
Labels:
None
Environment:
Rocky 8.9 client:
- Lustre 2.15.4
- Kernel 4.18.0-513.11.1.el8_9.x86_64

vs. Lustre 2.12.6 server

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

Hi,

We have a Rocky 8.9 / Lustre 2.15.4 client which has trouble running a particular large single-node MPI application, when its input/output files are stored on a Lustre 2.12.6 filesystem. We didn't see this when the client was running Rocky 8.8 / Lustre 2.12.9.

The application hangs at shortly after startup for a while, eventually terminating with an error. The application messages imply a failure during a Fortran OPEN or READ statement.

I see multiple messages such as the following in the client syslog:-

Feb  7 14:10:18 xxxx kernel: watchdog: BUG: soft lockup - CPU#93 stuck for 22s! [vasp_std:1029118]
Feb  7 14:10:18 xxxx kernel: Modules linked in: mgc(OE) lustre(OE) lmv(OE) mdc(OE) fid(OE) lov(OE) fld(OE) osc(OE) ptlrpc(OE) obdclass(OE) ko2iblnd(OE) lnet(OE) libcfs(OE) 8021q garp mrp stp llc rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache rpcrdma sunrpc rdma_ucm ib_srpt ib_isert iscsi_target_mod target_core_mod ib_iser libiscsi scsi_transport_iscsi ib_umad intel_rapl_msr intel_rapl_common rdma_cm ib_ipoib iw_cm xfs amd64_edac_mod ib_cm edac_mce_amd amd_energy libcrc32c ipmi_ssif kvm dell_smbios wmi_bmof dell_wmi_descriptor irqbypass crct10dif_pclmul crc32_pclmul dcdbas ghash_clmulni_intel mlx5_ib rapl pcspkr ib_uverbs ib_core ccp sp5100_tco k10temp acpi_ipmi i2c_piix4 ipmi_si ptdma ipmi_devintf wmi ipmi_msghandler acpi_power_meter acpi_cpufreq ext4 mbcache jbd2 sd_mod t10_pi sg mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt i2c_algo_bit drm_shmem_helper mlx5_core ahci drm crc32c_intel libahci libata mlxfw pci_hyperv_intf tls tg3 psample dm_mirror dm_region_hash
Feb  7 14:10:18 xxxx kernel: dm_log dm_mod fuse
Feb  7 14:10:18 xxxx kernel: CPU: 93 PID: 1029118 Comm: vasp_std Kdump: loaded Tainted: G           OEL   --------- -  - 4.18.0-513.11.1.el8_9.x86_64 #1
Feb  7 14:10:18 xxxx kernel: Hardware name: Dell Inc. PowerEdge C6525/0978PJ, BIOS 2.12.4 07/26/2023
Feb  7 14:10:18 xxxx kernel: RIP: 0010:_raw_spin_unlock_irqrestore+0x11/0x20
Feb  7 14:10:18 xxxx kernel: Code: c0 e9 33 09 00 00 b8 01 00 00 00 e9 29 09 00 00 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 c6 07 00 0f 1f 40 00 48 89 f7 57 9d <0f> 1f 44 00 00 e9 05 09 00 00 0f 1f 44 00 00 0f 1f 44 00 00 8b 07
Feb  7 14:10:18 xxxx kernel: RSP: 0018:ffffb26cec28fa70 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff13
Feb  7 14:10:18 xxxx kernel: RAX: 00000000feec4859 RBX: ffffa0efcc96fa60 RCX: dead000000000200
Feb  7 14:10:18 xxxx kernel: RDX: ffffb26cee9d37f8 RSI: 0000000000000202 RDI: 0000000000000202
Feb  7 14:10:18 xxxx kernel: RBP: 00000000feec4859 R08: ffffb26cee9d37f8 R09: 0000000000032940
Feb  7 14:10:18 xxxx kernel: R10: 000013cd01a8e0f8 R11: 0000000000000002 R12: 0000000000000202
Feb  7 14:10:18 xxxx kernel: R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000000
Feb  7 14:10:18 xxxx kernel: FS:  00007f3be4341940(0000) GS:ffffa106dfd40000(0000) knlGS:0000000000000000
Feb  7 14:10:18 xxxx kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Feb  7 14:10:18 xxxx kernel: CR2: 00000000005c2183 CR3: 00000028eddec000 CR4: 0000000000350ee0
Feb  7 14:10:18 xxxx kernel: Call Trace:
Feb  7 14:10:18 xxxx kernel: <IRQ>
Feb  7 14:10:18 xxxx kernel: ? watchdog_timer_fn.cold.10+0x46/0x9e
Feb  7 14:10:18 xxxx kernel: ? watchdog+0x30/0x30
Feb  7 14:10:18 xxxx kernel: ? __hrtimer_run_queues+0x101/0x280
Feb  7 14:10:18 xxxx kernel: ? hrtimer_interrupt+0x100/0x220
Feb  7 14:10:18 xxxx kernel: ? sched_clock+0x5/0x10
Feb  7 14:10:18 xxxx kernel: ? smp_apic_timer_interrupt+0x6a/0x130
Feb  7 14:10:18 xxxx kernel: ? apic_timer_interrupt+0xf/0x20
Feb  7 14:10:18 xxxx kernel: </IRQ>
Feb  7 14:10:18 xxxx kernel: ? _raw_spin_unlock_irqrestore+0x11/0x20
Feb  7 14:10:18 xxxx kernel: __wake_up_common_lock+0x89/0xc0
Feb  7 14:10:18 xxxx kernel: mdc_close+0x2ba/0x970 [mdc]
Feb  7 14:10:18 xxxx kernel: lmv_close+0x11d/0x2c0 [lmv]
Feb  7 14:10:18 xxxx kernel: ll_close_inode_openhandle+0x361/0xe20 [lustre]
Feb  7 14:10:18 xxxx kernel: ll_release_openhandle+0x2f8/0x400 [lustre]
Feb  7 14:10:18 xxxx kernel: ll_file_open+0x6c0/0xd40 [lustre]
Feb  7 14:10:18 xxxx kernel: ? ll_intent_file_open+0x960/0x960 [lustre]
Feb  7 14:10:18 xxxx kernel: do_dentry_open+0x143/0x3a0
Feb  7 14:10:18 xxxx kernel: path_openat+0x55b/0x1580
Feb  7 14:10:18 xxxx kernel: ? filemap_map_pages+0x271/0x410
Feb  7 14:10:18 xxxx kernel: ? alloc_set_pte+0xb8/0x3e0
Feb  7 14:10:18 xxxx kernel: do_filp_open+0x93/0x100
Feb  7 14:10:18 xxxx kernel: ? getname_flags+0x4a/0x1e0
Feb  7 14:10:18 xxxx kernel: ? __check_object_size+0xac/0x173
Feb  7 14:10:18 xxxx kernel: ? __alloc_fd+0x44/0x150
Feb  7 14:10:18 xxxx kernel: do_sys_openat2+0x211/0x2b0
Feb  7 14:10:18 xxxx kernel: do_sys_open+0x4b/0x80
Feb  7 14:10:18 xxxx kernel: do_syscall_64+0x5b/0x1b0
Feb  7 14:10:18 xxxx kernel: entry_SYSCALL_64_after_hwframe+0x61/0xc6
Feb  7 14:10:18 xxxx kernel: RIP: 0033:0x7f3be10e72a6
Feb  7 14:10:18 xxxx kernel: Code: 89 54 24 08 e8 9b f4 ff ff 8b 74 24 0c 48 8b 3c 24 41 89 c0 44 8b 54 24 08 b8 01 01 00 00 89 f2 48 89 fe bf 9c ff ff ff 0f 05 <48> 3d 00 f0 ff ff 77 30 44 89 c7 89 44 24 08 e8 c6 f4 ff ff 8b 44
Feb  7 14:10:18 xxxx kernel: RSP: 002b:00007ffdae754ef0 EFLAGS: 00000293 ORIG_RAX: 0000000000000101
Feb  7 14:10:18 xxxx kernel: RAX: ffffffffffffffda RBX: 0000000000080002 RCX: 00007f3be10e72a6
Feb  7 14:10:18 xxxx kernel: RDX: 0000000000080002 RSI: 0000000009dc3b50 RDI: 00000000ffffff9c
Feb  7 14:10:18 xxxx kernel: RBP: 0000000009dc3b50 R08: 0000000000000000 R09: 000000000942206c
Feb  7 14:10:18 xxxx kernel: R10: 0000000000000000 R11: 0000000000000293 R12: 00007ffdae755120
Feb  7 14:10:18 xxxx kernel: R13: 00007ffdae755650 R14: 0000000000080000 R15: 0000000000000000

Any ideas, please?

Attachments

Issue Links

is related to

LU-14741 Close RPC might get stuck behind normal RPCs waiting for slot

Resolved

LU-15947 Spinlock contention during wake_up_all() in obd_put_mod_rpc_slot()

Resolved

LU-17197 Performance regression with "LU-15947 obdclass: improve precision of wakeups for mod_rpcs"

Resolved

Activity

[LU-17510] Client hung on ll_file_open

Mark Dixon added a comment - 06/Mar/24 3:12 PM

Thanks all!

I'm trying to figure out what to do now and was hoping someone could offer advice. Am I playing with fire if I cherry pick onto 2.15.4? Or do I need to wait for a release?

This set appears to apply cleanly, apart from a tweak to ~~LU-15947~~ because ~~LU-16231~~ had already landed:

5243630b09 LU-15947 obdclass: improve precision of wakeups for mod_rpcs
91a3726f31 LU-16633 obdclass: fix rpc slot leakage
b5fde4d6c0 LU-17197 obdclass: preserve fairness when waiting for rpc slot
(plus Neil's patch for LU-17510)

Mark Dixon added a comment - 06/Mar/24 3:12 PM Thanks all! I'm trying to figure out what to do now and was hoping someone could offer advice. Am I playing with fire if I cherry pick onto 2.15.4? Or do I need to wait for a release? This set appears to apply cleanly, apart from a tweak to LU-15947 because LU-16231 had already landed: 5243630b09 LU-15947 obdclass: improve precision of wakeups for mod_rpcs 91a3726f31 LU-16633 obdclass: fix rpc slot leakage b5fde4d6c0 LU-17197 obdclass: preserve fairness when waiting for rpc slot (plus Neil's patch for LU-17510)

Shaun Tancheff added a comment - 05/Mar/24 2:27 PM

That is good news. I think the preference to use continue with Neil's patch

Shaun Tancheff added a comment - 05/Mar/24 2:27 PM That is good news. I think the preference to use continue with Neil's patch

Mark Dixon added a comment - 05/Mar/24 10:31 AM

Apologies, misread as I'm getting to grips with Gerrit. In fact, cherry picking either Shaun's or Neil's patchset on top of 2.15.61 fixes things allowing the application to start and run normally.

Mark Dixon added a comment - 05/Mar/24 10:31 AM Apologies, misread as I'm getting to grips with Gerrit. In fact, cherry picking either Shaun's or Neil's patchset on top of 2.15.61 fixes things allowing the application to start and run normally.

Andreas Dilger added a comment - 04/Mar/24 4:22 PM

I can confirm that adding the above patch ... fixes the problem

Could you please clarify which patch you are referring to, now that there are two involved?

the client show a couple of other issues on our system

They should be filed as separate issues against 2.16.0.

Andreas Dilger added a comment - 04/Mar/24 4:22 PM I can confirm that adding the above patch ... fixes the problem Could you please clarify which patch you are referring to, now that there are two involved? the client show a couple of other issues on our system They should be filed as separate issues against 2.16.0.

Mark Dixon added a comment - 04/Mar/24 4:02 PM

Hi all,

I've collected the kernel debug log as Andreas asked - but I don't know if it's still helpful and so should post it?

Happily, I can confirm that adding the above patch on top of 2.15.58_150_gb5fde4d or 2.15.61 fixes the problem. Unfortunately, ~~LU-15947~~+~~LU-17197~~+this new patch won't apply cleanly onto 2.15.4 - is it possible/sensible to attempt that?

Re: revalidate FID messages, sometimes I'd Ctrl-C the application myself a while after it had hung, so good to know that this was the cause and can discount them.

Ignoring the original problem for a second, 2.15.61 on the client show a couple of other issues on our system that I cannot see in Jira (ls provoking a "Wrong buffer for field 'batch_update_reply'" message, and kernel panic on umount). Should I be logging these as issues against 2.15.61, or ignoring them because it isn't a "proper" release?

Thanks again,

Mark

Mark Dixon added a comment - 04/Mar/24 4:02 PM Hi all, I've collected the kernel debug log as Andreas asked - but I don't know if it's still helpful and so should post it? Happily, I can confirm that adding the above patch on top of 2.15.58_150_gb5fde4d or 2.15.61 fixes the problem. Unfortunately, LU-15947 + LU-17197 +this new patch won't apply cleanly onto 2.15.4 - is it possible/sensible to attempt that? Re: revalidate FID messages, sometimes I'd Ctrl-C the application myself a while after it had hung, so good to know that this was the cause and can discount them. Ignoring the original problem for a second, 2.15.61 on the client show a couple of other issues on our system that I cannot see in Jira (ls provoking a "Wrong buffer for field 'batch_update_reply'" message, and kernel panic on umount). Should I be logging these as issues against 2.15.61, or ignoring them because it isn't a "proper" release? Thanks again, Mark

Neil Brown added a comment - 04/Mar/24 3:10 AM

Above patch addresses precisely the problem observed. I think it is very likely to fix the problem.

If that could be tested instead, I would appreciate it.

Neil Brown added a comment - 04/Mar/24 3:10 AM Above patch addresses precisely the problem observed. I think it is very likely to fix the problem. If that could be tested instead, I would appreciate it.

Gerrit Updater added a comment - 04/Mar/24 2:29 AM

"Neil Brown <neilb@suse.de>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/54259
Subject: ~~LU-17510~~ obdclass: fix wake up when queuing close request.
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: d430ddbb2414b26da69070fa35c265c9be2f497c

Gerrit Updater added a comment - 04/Mar/24 2:29 AM "Neil Brown <neilb@suse.de>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/54259 Subject: LU-17510 obdclass: fix wake up when queuing close request. Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: d430ddbb2414b26da69070fa35c265c9be2f497c

Andreas Dilger added a comment - 04/Mar/24 2:05 AM

bodgerer if you are able to reproduce this issue, does Shaun's patch above fix the problem you are seeing?

Andreas Dilger added a comment - 04/Mar/24 2:05 AM bodgerer if you are able to reproduce this issue, does Shaun's patch above fix the problem you are seeing?

Shaun Tancheff added a comment - 02/Mar/24 5:51 AM

Also related to LU-13993 originally for 2.12 and ~~LU-15947~~ for 2.15.
I have restored:
https://review.whamcloud.com/c/fs/lustre-release/+/47634
and pushed it with "~~LU-15947~~ obdclass: improve precision of wakeups for mod_rpcs" removed.

We have seen a deadlock the prevented us from continuing with "~~LU-15947~~ obdclass: improve precision of wakeups for mod_rpcs" however we have not had a similar deadlock on master before now.

Shaun Tancheff added a comment - 02/Mar/24 5:51 AM Also related to LU-13993 originally for 2.12 and LU-15947 for 2.15. I have restored: https://review.whamcloud.com/c/fs/lustre-release/+/47634 and pushed it with " LU-15947 obdclass: improve precision of wakeups for mod_rpcs" removed. We have seen a deadlock the prevented us from continuing with " LU-15947 obdclass: improve precision of wakeups for mod_rpcs" however we have not had a similar deadlock on master before now.

Andreas Dilger added a comment - 01/Mar/24 8:54 PM

Hi Mark, thanks for tracking this down.

Based on your original stack trace, it looks like there is some kind of error during the open, and then it gets hung trying to do the cleanup:

int ll_file_open(struct inode *inode, struct file *file)
{
        :
        :
        mutex_lock(&lli->lli_och_mutex);
        if (*och_p) { /* Open handle is present */
                if (it_disposition(it, DISP_OPEN_OPEN)) {
                        /* Well, there's extra open request that we do not need,
                         * let's close it somehow. This will decref request. */
                        rc = it_open_error(DISP_OPEN_OPEN, it);
                        if (rc) {
                                mutex_unlock(&lli->lli_och_mutex);
                                GOTO(out_openerr, rc);
                        }

                        ll_release_openhandle(file_dentry(file), it);
                }

The "ll_revalidate_fid() error: rc -4" message would indicate that the open thread was killed/interrupted by a signal (e.g. CTRL-C, "kill -N", or maybe sigalarm() or similar)? Are you able to attach a full kernel debug log to the ticket for analysis? By default it will contain the filenames, process names/PIDs, and IP addresses of the nodes, but those could be redacted if necessary as I see you've already done so in the log messages above (if done in some consistent manner to allow identifying the nodes involved. Something like:

client# lctl set_param debug=all debug_mb=1024
client# lctl clear
client# [ run reproducer until hung ]
client# lctl dk /tmp/debug-lu17510.txt

then compress the debug-lu17510.txt file and attach here.

I wasn't directly involved in the referenced changes, but I've CC'd the parties involved and hopefully they can take a look at this issue. Strangely, the above mentioned patches should avoid hung threads, but clearly something is amiss, or your application is doing something unusual (e.g. sigalarm to kill itself if it is waiting on an open) that isn't done in our normal testing and usage, which has exercised this codepath many millions (probably billions) of times since those patches were landed.

Andreas Dilger added a comment - 01/Mar/24 8:54 PM Hi Mark, thanks for tracking this down. Based on your original stack trace, it looks like there is some kind of error during the open, and then it gets hung trying to do the cleanup: int ll_file_open(struct inode *inode, struct file *file) { : : mutex_lock(&lli->lli_och_mutex); if (*och_p) { /* Open handle is present */ if (it_disposition(it, DISP_OPEN_OPEN)) { /* Well, there's extra open request that we do not need, * let's close it somehow. This will decref request. */ rc = it_open_error(DISP_OPEN_OPEN, it); if (rc) { mutex_unlock(&lli->lli_och_mutex); GOTO(out_openerr, rc); } ll_release_openhandle(file_dentry(file), it); } The " ll_revalidate_fid() error: rc -4 " message would indicate that the open thread was killed/interrupted by a signal (e.g. CTRL-C, " kill -N ", or maybe sigalarm() or similar)? Are you able to attach a full kernel debug log to the ticket for analysis? By default it will contain the filenames, process names/PIDs, and IP addresses of the nodes, but those could be redacted if necessary as I see you've already done so in the log messages above (if done in some consistent manner to allow identifying the nodes involved. Something like: client# lctl set_param debug=all debug_mb=1024 client# lctl clear client# [ run reproducer until hung ] client# lctl dk /tmp/debug-lu17510.txt then compress the debug-lu17510.txt file and attach here. I wasn't directly involved in the referenced changes, but I've CC'd the parties involved and hopefully they can take a look at this issue. Strangely, the above mentioned patches should avoid hung threads, but clearly something is amiss, or your application is doing something unusual (e.g. sigalarm to kill itself if it is waiting on an open) that isn't done in our normal testing and usage, which has exercised this codepath many millions (probably billions) of times since those patches were landed.

Mark Dixon added a comment - 01/Mar/24 5:12 PM - edited

Hi Andreas,

Apologies for the delay, it took a while getting a suitably old client stood up to do as you suggested. The TL;DR read version is that our application is broken by ~~LU-14741~~, subsequently fixed by ~~LU-15947~~, but broken again by ~~LU-17197~~.

Detail:

Using CentOS 8.0 with kernel 4.18.0-80.11.2.el8_0.x86_64 and doing a git bisect 2.12.9..2.15.4, this problem first shows up in 2.14.52_32_ga4e1567 with commit:

a4e1567d67 ~~LU-14741~~ obdclass: Wake up entire queue of requests on close completion

Checking to see if this was fixed in 2.15 development and just hadn't made it into a release yet, I found that the problem is fixed in 2.15.52_106_g5243630, with our application running normally at commit:

5243630b09 ~~LU-15947~~ obdclass: improve precision of wakeups for mod_rpcs

However, our application doesn't work with 2.15.61. It seems the above commit created a 25% hit in an mdtest benchmark, which resulted in a subsequent fix in 2.15.58_150_gb5fde4d with commit:

b5fde4d6c0 ~~LU-17197~~ obdclass: preserve fairness when waiting for rpc slot

Our application is broken by ~~LU-17197~~. It hangs quite early on at startup and the following message appears in syslog and sometimes an eviction:

Mar 1 17:00:50 xxxxx kernel: LustreError: 19383:0:(file.c:5373:ll_inode_revalidate_fini()) lustre: revalidate FID [0x200019b0a:0x2:0x0] error: rc = -4
Mar 1 17:00:50 xxxxx kernel: LustreError: 19383:0:(file.c:5373:ll_inode_revalidate_fini()) Skipped 7 previous similar messages
Mar 1 17:01:15 xxxxx kernel: LustreError: 11-0: lustre-MDT0000-mdc-ffff9cd9dcdf4800: operation ldlm_enqueue to node 172.18.16.22@o2ib failed: rc = -107
Mar 1 17:01:15 xxxxx kernel: Lustre: lustre-MDT0000-mdc-ffff9cd9dcdf4800: Connection to lustre-MDT0000 (at 172.18.16.22@o2ib) was lost; in progress operations using this service will wait for recovery to complete
Mar 1 17:01:15 xxxxx kernel: LustreError: 167-0: lustre-MDT0000-mdc-ffff9cd9dcdf4800: This client was evicted by lustre-MDT0000; in progress operations using this service will fail.
Mar 1 17:01:15 xxxxx kernel: LustreError: 18854:0:(file.c:5373:ll_inode_revalidate_fini()) lustre: revalidate FID [0x200019b0a:0x2:0x0] error: rc = -5
Mar 1 17:01:15 xxxxx kernel: LustreError: 18854:0:(file.c:5373:ll_inode_revalidate_fini()) Skipped 120 previous similar messages
Mar 1 17:01:15 xxxxx kernel: LustreError: 19023:0:(file.c:246:ll_close_inode_openhandle()) lustre-clilmv-ffff9cd9dcdf4800: inode [0x200019b0a:0x2:0x0] mdc close failed: rc = -108
Mar 1 17:01:15 xxxxx kernel: Lustre: lustre-MDT0000-mdc-ffff9cd9dcdf4800: Connection restored to 172.18.16.22@o2ib (at 172.18.16.22@o2ib)

Can you advise, please?

Thanks,

Mark

Mark Dixon added a comment - 01/Mar/24 5:12 PM - edited Hi Andreas, Apologies for the delay, it took a while getting a suitably old client stood up to do as you suggested. The TL;DR read version is that our application is broken by LU-14741 , subsequently fixed by LU-15947 , but broken again by LU-17197 . Detail: Using CentOS 8.0 with kernel 4.18.0-80.11.2.el8_0.x86_64 and doing a git bisect 2.12.9..2.15.4, this problem first shows up in 2.14.52_32_ga4e1567 with commit: a4e1567d67 LU-14741 obdclass: Wake up entire queue of requests on close completion Checking to see if this was fixed in 2.15 development and just hadn't made it into a release yet, I found that the problem is fixed in 2.15.52_106_g5243630, with our application running normally at commit: 5243630b09 LU-15947 obdclass: improve precision of wakeups for mod_rpcs However, our application doesn't work with 2.15.61. It seems the above commit created a 25% hit in an mdtest benchmark, which resulted in a subsequent fix in 2.15.58_150_gb5fde4d with commit: b5fde4d6c0 LU-17197 obdclass: preserve fairness when waiting for rpc slot Our application is broken by LU-17197 . It hangs quite early on at startup and the following message appears in syslog and sometimes an eviction: Mar 1 17:00:50 xxxxx kernel: LustreError: 19383:0:(file.c:5373:ll_inode_revalidate_fini()) lustre: revalidate FID [0x200019b0a:0x2:0x0] error: rc = -4 Mar 1 17:00:50 xxxxx kernel: LustreError: 19383:0:(file.c:5373:ll_inode_revalidate_fini()) Skipped 7 previous similar messages Mar 1 17:01:15 xxxxx kernel: LustreError: 11-0: lustre-MDT0000-mdc-ffff9cd9dcdf4800: operation ldlm_enqueue to node 172.18.16.22@o2ib failed: rc = -107 Mar 1 17:01:15 xxxxx kernel: Lustre: lustre-MDT0000-mdc-ffff9cd9dcdf4800: Connection to lustre-MDT0000 (at 172.18.16.22@o2ib) was lost; in progress operations using this service will wait for recovery to complete Mar 1 17:01:15 xxxxx kernel: LustreError: 167-0: lustre-MDT0000-mdc-ffff9cd9dcdf4800: This client was evicted by lustre-MDT0000; in progress operations using this service will fail. Mar 1 17:01:15 xxxxx kernel: LustreError: 18854:0:(file.c:5373:ll_inode_revalidate_fini()) lustre: revalidate FID [0x200019b0a:0x2:0x0] error: rc = -5 Mar 1 17:01:15 xxxxx kernel: LustreError: 18854:0:(file.c:5373:ll_inode_revalidate_fini()) Skipped 120 previous similar messages Mar 1 17:01:15 xxxxx kernel: LustreError: 19023:0:(file.c:246:ll_close_inode_openhandle()) lustre-clilmv-ffff9cd9dcdf4800: inode [0x200019b0a:0x2:0x0] mdc close failed: rc = -108 Mar 1 17:01:15 xxxxx kernel: Lustre: lustre-MDT0000-mdc-ffff9cd9dcdf4800: Connection restored to 172.18.16.22@o2ib (at 172.18.16.22@o2ib) Can you advise, please? Thanks, Mark

People

Assignee:: Shaun Tancheff

Reporter:: Mark Dixon

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 07/Feb/24 2:52 PM

Updated:: 26/Nov/24 3:00 AM

Resolved:: 23/Mar/24 3:44 PM