Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.16.0, Lustre 2.15.5
    • Lustre 2.15.4
    • None
    • Rocky 8.9 client:
      - Lustre 2.15.4
      - Kernel 4.18.0-513.11.1.el8_9.x86_64

      vs. Lustre 2.12.6 server
    • 3
    • 9223372036854775807

    Description

      Hi,

      We have a Rocky 8.9 / Lustre 2.15.4 client which has trouble running a particular large single-node MPI application, when its input/output files are stored on a Lustre 2.12.6 filesystem. We didn't see this when the client was running Rocky 8.8 / Lustre 2.12.9.

      The application hangs at shortly after startup for a while, eventually terminating with an error. The application messages imply a failure during a Fortran OPEN or READ statement.

      I see multiple messages such as the following in the client syslog:-

      Feb  7 14:10:18 xxxx kernel: watchdog: BUG: soft lockup - CPU#93 stuck for 22s! [vasp_std:1029118]
      Feb  7 14:10:18 xxxx kernel: Modules linked in: mgc(OE) lustre(OE) lmv(OE) mdc(OE) fid(OE) lov(OE) fld(OE) osc(OE) ptlrpc(OE) obdclass(OE) ko2iblnd(OE) lnet(OE) libcfs(OE) 8021q garp mrp stp llc rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache rpcrdma sunrpc rdma_ucm ib_srpt ib_isert iscsi_target_mod target_core_mod ib_iser libiscsi scsi_transport_iscsi ib_umad intel_rapl_msr intel_rapl_common rdma_cm ib_ipoib iw_cm xfs amd64_edac_mod ib_cm edac_mce_amd amd_energy libcrc32c ipmi_ssif kvm dell_smbios wmi_bmof dell_wmi_descriptor irqbypass crct10dif_pclmul crc32_pclmul dcdbas ghash_clmulni_intel mlx5_ib rapl pcspkr ib_uverbs ib_core ccp sp5100_tco k10temp acpi_ipmi i2c_piix4 ipmi_si ptdma ipmi_devintf wmi ipmi_msghandler acpi_power_meter acpi_cpufreq ext4 mbcache jbd2 sd_mod t10_pi sg mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt i2c_algo_bit drm_shmem_helper mlx5_core ahci drm crc32c_intel libahci libata mlxfw pci_hyperv_intf tls tg3 psample dm_mirror dm_region_hash
      Feb  7 14:10:18 xxxx kernel: dm_log dm_mod fuse
      Feb  7 14:10:18 xxxx kernel: CPU: 93 PID: 1029118 Comm: vasp_std Kdump: loaded Tainted: G           OEL   --------- -  - 4.18.0-513.11.1.el8_9.x86_64 #1
      Feb  7 14:10:18 xxxx kernel: Hardware name: Dell Inc. PowerEdge C6525/0978PJ, BIOS 2.12.4 07/26/2023
      Feb  7 14:10:18 xxxx kernel: RIP: 0010:_raw_spin_unlock_irqrestore+0x11/0x20
      Feb  7 14:10:18 xxxx kernel: Code: c0 e9 33 09 00 00 b8 01 00 00 00 e9 29 09 00 00 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 c6 07 00 0f 1f 40 00 48 89 f7 57 9d <0f> 1f 44 00 00 e9 05 09 00 00 0f 1f 44 00 00 0f 1f 44 00 00 8b 07
      Feb  7 14:10:18 xxxx kernel: RSP: 0018:ffffb26cec28fa70 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff13
      Feb  7 14:10:18 xxxx kernel: RAX: 00000000feec4859 RBX: ffffa0efcc96fa60 RCX: dead000000000200
      Feb  7 14:10:18 xxxx kernel: RDX: ffffb26cee9d37f8 RSI: 0000000000000202 RDI: 0000000000000202
      Feb  7 14:10:18 xxxx kernel: RBP: 00000000feec4859 R08: ffffb26cee9d37f8 R09: 0000000000032940
      Feb  7 14:10:18 xxxx kernel: R10: 000013cd01a8e0f8 R11: 0000000000000002 R12: 0000000000000202
      Feb  7 14:10:18 xxxx kernel: R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000000
      Feb  7 14:10:18 xxxx kernel: FS:  00007f3be4341940(0000) GS:ffffa106dfd40000(0000) knlGS:0000000000000000
      Feb  7 14:10:18 xxxx kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      Feb  7 14:10:18 xxxx kernel: CR2: 00000000005c2183 CR3: 00000028eddec000 CR4: 0000000000350ee0
      Feb  7 14:10:18 xxxx kernel: Call Trace:
      Feb  7 14:10:18 xxxx kernel: <IRQ>
      Feb  7 14:10:18 xxxx kernel: ? watchdog_timer_fn.cold.10+0x46/0x9e
      Feb  7 14:10:18 xxxx kernel: ? watchdog+0x30/0x30
      Feb  7 14:10:18 xxxx kernel: ? __hrtimer_run_queues+0x101/0x280
      Feb  7 14:10:18 xxxx kernel: ? hrtimer_interrupt+0x100/0x220
      Feb  7 14:10:18 xxxx kernel: ? sched_clock+0x5/0x10
      Feb  7 14:10:18 xxxx kernel: ? smp_apic_timer_interrupt+0x6a/0x130
      Feb  7 14:10:18 xxxx kernel: ? apic_timer_interrupt+0xf/0x20
      Feb  7 14:10:18 xxxx kernel: </IRQ>
      Feb  7 14:10:18 xxxx kernel: ? _raw_spin_unlock_irqrestore+0x11/0x20
      Feb  7 14:10:18 xxxx kernel: __wake_up_common_lock+0x89/0xc0
      Feb  7 14:10:18 xxxx kernel: mdc_close+0x2ba/0x970 [mdc]
      Feb  7 14:10:18 xxxx kernel: lmv_close+0x11d/0x2c0 [lmv]
      Feb  7 14:10:18 xxxx kernel: ll_close_inode_openhandle+0x361/0xe20 [lustre]
      Feb  7 14:10:18 xxxx kernel: ll_release_openhandle+0x2f8/0x400 [lustre]
      Feb  7 14:10:18 xxxx kernel: ll_file_open+0x6c0/0xd40 [lustre]
      Feb  7 14:10:18 xxxx kernel: ? ll_intent_file_open+0x960/0x960 [lustre]
      Feb  7 14:10:18 xxxx kernel: do_dentry_open+0x143/0x3a0
      Feb  7 14:10:18 xxxx kernel: path_openat+0x55b/0x1580
      Feb  7 14:10:18 xxxx kernel: ? filemap_map_pages+0x271/0x410
      Feb  7 14:10:18 xxxx kernel: ? alloc_set_pte+0xb8/0x3e0
      Feb  7 14:10:18 xxxx kernel: do_filp_open+0x93/0x100
      Feb  7 14:10:18 xxxx kernel: ? getname_flags+0x4a/0x1e0
      Feb  7 14:10:18 xxxx kernel: ? __check_object_size+0xac/0x173
      Feb  7 14:10:18 xxxx kernel: ? __alloc_fd+0x44/0x150
      Feb  7 14:10:18 xxxx kernel: do_sys_openat2+0x211/0x2b0
      Feb  7 14:10:18 xxxx kernel: do_sys_open+0x4b/0x80
      Feb  7 14:10:18 xxxx kernel: do_syscall_64+0x5b/0x1b0
      Feb  7 14:10:18 xxxx kernel: entry_SYSCALL_64_after_hwframe+0x61/0xc6
      Feb  7 14:10:18 xxxx kernel: RIP: 0033:0x7f3be10e72a6
      Feb  7 14:10:18 xxxx kernel: Code: 89 54 24 08 e8 9b f4 ff ff 8b 74 24 0c 48 8b 3c 24 41 89 c0 44 8b 54 24 08 b8 01 01 00 00 89 f2 48 89 fe bf 9c ff ff ff 0f 05 <48> 3d 00 f0 ff ff 77 30 44 89 c7 89 44 24 08 e8 c6 f4 ff ff 8b 44
      Feb  7 14:10:18 xxxx kernel: RSP: 002b:00007ffdae754ef0 EFLAGS: 00000293 ORIG_RAX: 0000000000000101
      Feb  7 14:10:18 xxxx kernel: RAX: ffffffffffffffda RBX: 0000000000080002 RCX: 00007f3be10e72a6
      Feb  7 14:10:18 xxxx kernel: RDX: 0000000000080002 RSI: 0000000009dc3b50 RDI: 00000000ffffff9c
      Feb  7 14:10:18 xxxx kernel: RBP: 0000000009dc3b50 R08: 0000000000000000 R09: 000000000942206c
      Feb  7 14:10:18 xxxx kernel: R10: 0000000000000000 R11: 0000000000000293 R12: 00007ffdae755120
      Feb  7 14:10:18 xxxx kernel: R13: 00007ffdae755650 R14: 0000000000080000 R15: 0000000000000000
      

      Any ideas, please?

      Attachments

        Issue Links

          Activity

            [LU-17510] Client hung on ll_file_open
            bodgerer Mark Dixon added a comment -

            Thanks all!

            I'm trying to figure out what to do now and was hoping someone could offer advice. Am I playing with fire if I cherry pick onto 2.15.4? Or do I need to wait for a release?

            This set appears to apply cleanly, apart from a tweak to LU-15947 because LU-16231 had already landed:

            5243630b09 LU-15947 obdclass: improve precision of wakeups for mod_rpcs
            91a3726f31 LU-16633 obdclass: fix rpc slot leakage
            b5fde4d6c0 LU-17197 obdclass: preserve fairness when waiting for rpc slot
            (plus Neil's patch for LU-17510)
            
            bodgerer Mark Dixon added a comment - Thanks all! I'm trying to figure out what to do now and was hoping someone could offer advice. Am I playing with fire if I cherry pick onto 2.15.4? Or do I need to wait for a release? This set appears to apply cleanly, apart from a tweak to LU-15947 because LU-16231 had already landed: 5243630b09 LU-15947 obdclass: improve precision of wakeups for mod_rpcs 91a3726f31 LU-16633 obdclass: fix rpc slot leakage b5fde4d6c0 LU-17197 obdclass: preserve fairness when waiting for rpc slot (plus Neil's patch for LU-17510)

            That is good news. I think the preference to use continue with Neil's patch

            stancheff Shaun Tancheff added a comment - That is good news. I think the preference to use continue with Neil's patch
            bodgerer Mark Dixon added a comment -

            Apologies, misread as I'm getting to grips with Gerrit. In fact, cherry picking either Shaun's or Neil's patchset on top of 2.15.61 fixes things allowing the application to start and run normally.

            bodgerer Mark Dixon added a comment - Apologies, misread as I'm getting to grips with Gerrit. In fact, cherry picking either Shaun's or Neil's patchset on top of 2.15.61 fixes things allowing the application to start and run normally.

            I can confirm that adding the above patch ... fixes the problem

            Could you please clarify which patch you are referring to, now that there are two involved?

            the client show a couple of other issues on our system

            They should be filed as separate issues against 2.16.0.

            adilger Andreas Dilger added a comment - I can confirm that adding the above patch ... fixes the problem Could you please clarify which patch you are referring to, now that there are two involved? the client show a couple of other issues on our system They should be filed as separate issues against 2.16.0.
            bodgerer Mark Dixon added a comment -

            Hi all,

            I've collected the kernel debug log as Andreas asked - but I don't know if it's still helpful and so should post it?

            Happily, I can confirm that adding the above patch on top of 2.15.58_150_gb5fde4d or 2.15.61 fixes the problem. Unfortunately, LU-15947+LU-17197+this new patch won't apply cleanly onto 2.15.4 - is it possible/sensible to attempt that?

            Re: revalidate FID messages, sometimes I'd Ctrl-C the application myself a while after it had hung, so good to know that this was the cause and can discount them.

            Ignoring the original problem for a second, 2.15.61 on the client show a couple of other issues on our system that I cannot see in Jira (ls provoking a "Wrong buffer for field 'batch_update_reply'" message, and kernel panic on umount). Should I be logging these as issues against 2.15.61, or ignoring them because it isn't a "proper" release?

            Thanks again,

            Mark

            bodgerer Mark Dixon added a comment - Hi all, I've collected the kernel debug log as Andreas asked - but I don't know if it's still helpful and so should post it? Happily, I can confirm that adding the above patch on top of 2.15.58_150_gb5fde4d or 2.15.61 fixes the problem. Unfortunately, LU-15947 + LU-17197 +this new patch won't apply cleanly onto 2.15.4 - is it possible/sensible to attempt that? Re: revalidate FID messages, sometimes I'd Ctrl-C the application myself a while after it had hung, so good to know that this was the cause and can discount them. Ignoring the original problem for a second, 2.15.61 on the client show a couple of other issues on our system that I cannot see in Jira (ls provoking a "Wrong buffer for field 'batch_update_reply'" message, and kernel panic on umount). Should I be logging these as issues against 2.15.61, or ignoring them because it isn't a "proper" release? Thanks again, Mark
            neilb Neil Brown added a comment -

            Above patch addresses precisely the problem observed.  I think it is very likely to fix the problem.

            If that could be tested instead, I would appreciate it.

             

            neilb Neil Brown added a comment - Above patch addresses precisely the problem observed.  I think it is very likely to fix the problem. If that could be tested instead, I would appreciate it.  

            "Neil Brown <neilb@suse.de>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/54259
            Subject: LU-17510 obdclass: fix wake up when queuing close request.
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: d430ddbb2414b26da69070fa35c265c9be2f497c

            gerrit Gerrit Updater added a comment - "Neil Brown <neilb@suse.de>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/54259 Subject: LU-17510 obdclass: fix wake up when queuing close request. Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: d430ddbb2414b26da69070fa35c265c9be2f497c

            bodgerer if you are able to reproduce this issue, does Shaun's patch above fix the problem you are seeing?

            adilger Andreas Dilger added a comment - bodgerer if you are able to reproduce this issue, does Shaun's patch above fix the problem you are seeing?

            Also related to LU-13993 originally for 2.12 and LU-15947 for 2.15.
            I have restored:
            https://review.whamcloud.com/c/fs/lustre-release/+/47634
            and pushed it with "LU-15947 obdclass: improve precision of wakeups for mod_rpcs" removed.

            We have seen a deadlock the prevented us from continuing with "LU-15947 obdclass: improve precision of wakeups for mod_rpcs" however we have not had a similar deadlock on master before now.

            stancheff Shaun Tancheff added a comment - Also related to LU-13993 originally for 2.12 and LU-15947 for 2.15. I have restored: https://review.whamcloud.com/c/fs/lustre-release/+/47634 and pushed it with " LU-15947 obdclass: improve precision of wakeups for mod_rpcs" removed. We have seen a deadlock the prevented us from continuing with " LU-15947 obdclass: improve precision of wakeups for mod_rpcs" however we have not had a similar deadlock on master before now.

            Hi Mark, thanks for tracking this down.

            Based on your original stack trace, it looks like there is some kind of error during the open, and then it gets hung trying to do the cleanup:

            int ll_file_open(struct inode *inode, struct file *file)
            {
                    :
                    :
                    mutex_lock(&lli->lli_och_mutex);
                    if (*och_p) { /* Open handle is present */
                            if (it_disposition(it, DISP_OPEN_OPEN)) {
                                    /* Well, there's extra open request that we do not need,
                                     * let's close it somehow. This will decref request. */
                                    rc = it_open_error(DISP_OPEN_OPEN, it);
                                    if (rc) {
                                            mutex_unlock(&lli->lli_och_mutex);
                                            GOTO(out_openerr, rc);
                                    }
            
                                    ll_release_openhandle(file_dentry(file), it);
                            }
            

            The "ll_revalidate_fid() error: rc -4" message would indicate that the open thread was killed/interrupted by a signal (e.g. CTRL-C, "kill -N", or maybe sigalarm() or similar)? Are you able to attach a full kernel debug log to the ticket for analysis? By default it will contain the filenames, process names/PIDs, and IP addresses of the nodes, but those could be redacted if necessary as I see you've already done so in the log messages above (if done in some consistent manner to allow identifying the nodes involved. Something like:

            client# lctl set_param debug=all debug_mb=1024
            client# lctl clear
            client# [ run reproducer until hung ]
            client# lctl dk /tmp/debug-lu17510.txt
            

            then compress the debug-lu17510.txt file and attach here.

            I wasn't directly involved in the referenced changes, but I've CC'd the parties involved and hopefully they can take a look at this issue. Strangely, the above mentioned patches should avoid hung threads, but clearly something is amiss, or your application is doing something unusual (e.g. sigalarm to kill itself if it is waiting on an open) that isn't done in our normal testing and usage, which has exercised this codepath many millions (probably billions) of times since those patches were landed.

            adilger Andreas Dilger added a comment - Hi Mark, thanks for tracking this down. Based on your original stack trace, it looks like there is some kind of error during the open, and then it gets hung trying to do the cleanup: int ll_file_open(struct inode *inode, struct file *file) { : : mutex_lock(&lli->lli_och_mutex); if (*och_p) { /* Open handle is present */ if (it_disposition(it, DISP_OPEN_OPEN)) { /* Well, there's extra open request that we do not need, * let's close it somehow. This will decref request. */ rc = it_open_error(DISP_OPEN_OPEN, it); if (rc) { mutex_unlock(&lli->lli_och_mutex); GOTO(out_openerr, rc); } ll_release_openhandle(file_dentry(file), it); } The " ll_revalidate_fid() error: rc -4 " message would indicate that the open thread was killed/interrupted by a signal (e.g. CTRL-C, " kill -N ", or maybe sigalarm() or similar)? Are you able to attach a full kernel debug log to the ticket for analysis? By default it will contain the filenames, process names/PIDs, and IP addresses of the nodes, but those could be redacted if necessary as I see you've already done so in the log messages above (if done in some consistent manner to allow identifying the nodes involved. Something like: client# lctl set_param debug=all debug_mb=1024 client# lctl clear client# [ run reproducer until hung ] client# lctl dk /tmp/debug-lu17510.txt then compress the debug-lu17510.txt file and attach here. I wasn't directly involved in the referenced changes, but I've CC'd the parties involved and hopefully they can take a look at this issue. Strangely, the above mentioned patches should avoid hung threads, but clearly something is amiss, or your application is doing something unusual (e.g. sigalarm to kill itself if it is waiting on an open) that isn't done in our normal testing and usage, which has exercised this codepath many millions (probably billions) of times since those patches were landed.
            bodgerer Mark Dixon added a comment - - edited

            Hi Andreas,

            Apologies for the delay, it took a while getting a suitably old client stood up to do as you suggested. The TL;DR read version is that our application is broken by LU-14741, subsequently fixed by LU-15947, but broken again by LU-17197.

            Detail:

            Using CentOS 8.0 with kernel 4.18.0-80.11.2.el8_0.x86_64 and doing a git bisect 2.12.9..2.15.4, this problem first shows up in 2.14.52_32_ga4e1567 with commit:

               a4e1567d67 LU-14741 obdclass: Wake up entire queue of requests on close completion

            Checking to see if this was fixed in 2.15 development and just hadn't made it into a release yet, I found that the problem is fixed in 2.15.52_106_g5243630, with our application running normally at commit:

               5243630b09 LU-15947 obdclass: improve precision of wakeups for mod_rpcs

            However, our application doesn't work with 2.15.61. It seems the above commit created a 25% hit in an mdtest benchmark, which resulted in a subsequent fix in 2.15.58_150_gb5fde4d with commit:

               b5fde4d6c0 LU-17197 obdclass: preserve fairness when waiting for rpc slot

            Our application is broken by LU-17197. It hangs quite early on at startup and the following message appears in syslog and sometimes an eviction:

               Mar  1 17:00:50 xxxxx kernel: LustreError: 19383:0:(file.c:5373:ll_inode_revalidate_fini()) lustre: revalidate FID [0x200019b0a:0x2:0x0] error: rc = -4
               Mar  1 17:00:50 xxxxx kernel: LustreError: 19383:0:(file.c:5373:ll_inode_revalidate_fini()) Skipped 7 previous similar messages
               Mar  1 17:01:15 xxxxx kernel: LustreError: 11-0: lustre-MDT0000-mdc-ffff9cd9dcdf4800: operation ldlm_enqueue to node 172.18.16.22@o2ib failed: rc = -107
               Mar  1 17:01:15 xxxxx kernel: Lustre: lustre-MDT0000-mdc-ffff9cd9dcdf4800: Connection to lustre-MDT0000 (at 172.18.16.22@o2ib) was lost; in progress operations using this service will wait for recovery to complete
               Mar  1 17:01:15 xxxxx kernel: LustreError: 167-0: lustre-MDT0000-mdc-ffff9cd9dcdf4800: This client was evicted by lustre-MDT0000; in progress operations using this service will fail.
               Mar  1 17:01:15 xxxxx kernel: LustreError: 18854:0:(file.c:5373:ll_inode_revalidate_fini()) lustre: revalidate FID [0x200019b0a:0x2:0x0] error: rc = -5
               Mar  1 17:01:15 xxxxx kernel: LustreError: 18854:0:(file.c:5373:ll_inode_revalidate_fini()) Skipped 120 previous similar messages
               Mar  1 17:01:15 xxxxx kernel: LustreError: 19023:0:(file.c:246:ll_close_inode_openhandle()) lustre-clilmv-ffff9cd9dcdf4800: inode [0x200019b0a:0x2:0x0] mdc close failed: rc = -108
               Mar  1 17:01:15 xxxxx kernel: Lustre: lustre-MDT0000-mdc-ffff9cd9dcdf4800: Connection restored to 172.18.16.22@o2ib (at 172.18.16.22@o2ib)

             

            Can you advise, please?

            Thanks,

            Mark

            bodgerer Mark Dixon added a comment - - edited Hi Andreas, Apologies for the delay, it took a while getting a suitably old client stood up to do as you suggested. The TL;DR read version is that our application is broken by LU-14741 , subsequently fixed by LU-15947 , but broken again by LU-17197 . Detail: Using CentOS 8.0 with kernel 4.18.0-80.11.2.el8_0.x86_64 and doing a git bisect 2.12.9..2.15.4, this problem first shows up in 2.14.52_32_ga4e1567 with commit:    a4e1567d67 LU-14741 obdclass: Wake up entire queue of requests on close completion Checking to see if this was fixed in 2.15 development and just hadn't made it into a release yet, I found that the problem is fixed in 2.15.52_106_g5243630, with our application running normally at commit:    5243630b09 LU-15947 obdclass: improve precision of wakeups for mod_rpcs However, our application doesn't work with 2.15.61. It seems the above commit created a 25% hit in an mdtest benchmark, which resulted in a subsequent fix in 2.15.58_150_gb5fde4d with commit:    b5fde4d6c0 LU-17197 obdclass: preserve fairness when waiting for rpc slot Our application is broken by LU-17197 . It hangs quite early on at startup and the following message appears in syslog and sometimes an eviction:    Mar  1 17:00:50 xxxxx kernel: LustreError: 19383:0:(file.c:5373:ll_inode_revalidate_fini()) lustre: revalidate FID [0x200019b0a:0x2:0x0] error: rc = -4    Mar  1 17:00:50 xxxxx kernel: LustreError: 19383:0:(file.c:5373:ll_inode_revalidate_fini()) Skipped 7 previous similar messages    Mar  1 17:01:15 xxxxx kernel: LustreError: 11-0: lustre-MDT0000-mdc-ffff9cd9dcdf4800: operation ldlm_enqueue to node 172.18.16.22@o2ib failed: rc = -107    Mar  1 17:01:15 xxxxx kernel: Lustre: lustre-MDT0000-mdc-ffff9cd9dcdf4800: Connection to lustre-MDT0000 (at 172.18.16.22@o2ib) was lost; in progress operations using this service will wait for recovery to complete    Mar  1 17:01:15 xxxxx kernel: LustreError: 167-0: lustre-MDT0000-mdc-ffff9cd9dcdf4800: This client was evicted by lustre-MDT0000; in progress operations using this service will fail.    Mar  1 17:01:15 xxxxx kernel: LustreError: 18854:0:(file.c:5373:ll_inode_revalidate_fini()) lustre: revalidate FID [0x200019b0a:0x2:0x0] error: rc = -5    Mar  1 17:01:15 xxxxx kernel: LustreError: 18854:0:(file.c:5373:ll_inode_revalidate_fini()) Skipped 120 previous similar messages    Mar  1 17:01:15 xxxxx kernel: LustreError: 19023:0:(file.c:246:ll_close_inode_openhandle()) lustre-clilmv-ffff9cd9dcdf4800: inode [0x200019b0a:0x2:0x0] mdc close failed: rc = -108    Mar  1 17:01:15 xxxxx kernel: Lustre: lustre-MDT0000-mdc-ffff9cd9dcdf4800: Connection restored to 172.18.16.22@o2ib (at 172.18.16.22@o2ib)   Can you advise, please? Thanks, Mark

            People

              stancheff Shaun Tancheff
              bodgerer Mark Dixon
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: