Details

    • Bug
    • Resolution: Cannot Reproduce
    • Critical
    • Lustre 2.9.0
    • None
    • lola
      build: 2.7.62-40-gebda41d, ebda41d8de7956f19fd27f86208c668e43c6957c + patches
    • 3
    • 9223372036854775807

    Description

      Error occurred during soak testing of build '20151112' (see: https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20151112). DNE is enabled. OST had been formatted using zfs, MDTs using ldiskfs as backend.

      Approximately 20% of the jobs executed during soak testing fail with the following typical error shown below. The error selected for the eviction (8 jobs affected) reads as:

      lola-2.log:Nov 13 01:18:11 lola-2 kernel: LustreError: 0:0:(ldlm_lockd.c:342:waiting_locks_callback()) ### lock callback timer expired after 100s: evicting client at 192.168.1.133@o2ib100  ns: filter-soaked-OST0000_UUID lock: ffff8801cb9ded00/0x4823861a3a714ead lrc: 3/0,0 mode: PR/PR res: [0x2d2261b:0x0:0x0].0x0 rrc: 5 type: EXT [105534627840->18446744073709551615] (req 105534627840->105556344831) flags: 0x60000000030020 nid: 192.168.1.133@o2ib100 remote: 0x1b74614c7d18de8f expref: 20 pid: 9256 timeout: 4311080944 lvb_type: 0
      
      lola-31.log:Nov 13 01:18:12 lola-31 kernel: LustreError: 11-0: soaked-OST0000-osc-ffff8808657d2800: op
      eration ost_read to node 192.168.1.102@o2ib10 failed: rc = -107
      lola-31.log:Nov 13 01:18:12 lola-31 kernel: Lustre: soaked-OST0000-osc-ffff8808657d2800: Connection to
       soaked-OST0000 (at 192.168.1.102@o2ib10) was lost; in progress operations using this service will wai
      t for recovery to complete
      lola-31.log:Nov 13 01:18:12 lola-31 kernel: Lustre: Skipped 1 previous similar message
      lola-31.log:Nov 13 01:18:12 lola-31 kernel: LustreError: 167-0: soaked-OST0000-osc-ffff8808657d2800: This client was evicted by soaked-OST0000; in progress operations using this service will fail.
      lola-31.log:Nov 13 01:18:22 lola-31 kernel: Lustre: soaked-OST0000-osc-ffff8808657d2800: Connection restored to 192.168.1.102@o2ib10 (at 192.168.1.102@o2ib10)
      lola-33.log:Nov 13 01:18:12 lola-33 kernel: LustreError: 11-0: soaked-OST0000-osc-ffff88033a596800: operation ldlm_enqueue to node 192.168.1.102@o2ib10 failed: rc = -107
      lola-33.log:Nov 13 01:18:12 lola-33 kernel: Lustre: soaked-OST0000-osc-ffff88033a596800: Connection to soaked-OST0000 (at 192.168.1.102@o2ib10) was lost; in progress operations using this service will wait for recovery to complete
      lola-33.log:Nov 13 01:18:12 lola-33 kernel: Lustre: Skipped 1 previous similar message
      lola-33.log:Nov 13 01:18:12 lola-33 kernel: LustreError: 167-0: soaked-OST0000-osc-ffff88033a596800: This client was evicted by soaked-OST0000; in progress operations using this service will fail.
      lola-33.log:Nov 13 01:18:12 lola-33 kernel: Lustre: 32171:0:(llite_lib.c:2628:ll_dirty_page_discard_warn()) soaked: dirty page discard: 192.168.1.108@o2ib10:192.168.1.109@o2ib10:/soaked/fid: [0x24006e1d4:0xd58d:0x0]/ may get corrupted (rc -108)
      lola-33.log:Nov 13 01:19:01 lola-33 kernel: Lustre: soaked-OST0000-osc-ffff88033a596800: Connection restored to 192.168.1.102@o2ib10 (at 192.168.1.102@o2ib10)
      

      No other messages, kernel debug logs were written on other nodes. No recovery went on, nor any error was injected during the lifetime of the jobs.

      The syslog, console and kernel debug logs of OSS node (lola-2), soak clients involved and slurm job log files had been attached to the ticket.

      Attachments

        1. console-lola-2.log.bz2
          89 kB
        2. console-lola-31.log.bz2
          39 kB
        3. console-lola-33.log.bz2
          23 kB
        4. jobs.tar
          1.28 MB
        5. lola-2-lustre-log-job-crash.log.bz2
          3.13 MB
        6. lola-31-lustre-log-job-crash.log.bz2
          0.3 kB
        7. lola-33-lustre-log-job-crash.log.bz2
          3.41 MB
        8. messages-lola-2.log.bz2
          522 kB
        9. messages-lola-31.log.bz2
          457 kB
        10. messages-lola-33.log.bz2
          448 kB

        Activity

          People

            bfaccini Bruno Faccini (Inactive)
            heckes Frank Heckes (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: