Details

    • Bug
    • Resolution: Duplicate
    • Blocker
    • None
    • Lustre 2.9.0
    • lola
      build: tip of master, commit 0f37c051158a399f7b00536eeec27f5dbdd54168
    • 3
    • 9223372036854775807

    Description

      error happened during soaktesting of build '20160727' (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160727)
      OSTs formatted with zfs, MDSs formatted with ldiskfs
      DNE is enabled, HSM/robinhood enable and integrated
      4 MDSs with 1 MDT / MDS
      6 OSSs with 4 OSTs / OSS
      Server nodes configured in active-active HA confguration

      Sequence of events:

      • 2016-07-28 08:48:37 - Soak session started
      • 2016-07-28 08:50:34 - First LNet time out:
         
        Jul 28 08:50:43 lola-5 kernel: LNetError: 9448:0:(o2iblnd_cb.c:3114:kiblnd_check_txs_locked()) Timed out tx: active_txs, 3 seconds
        Jul 28 08:50:43 lola-5 kernel: LNetError: 9448:0:(o2iblnd_cb.c:3177:kiblnd_check_conns()) Timed out RDMA with 192.168.1.108@o2ib10 (62): c: 0, oc: 0, rc: 8
        Jul 28 08:50:43 lola-5 kernel: Lustre: Skipped 4 previous similar messages
        Jul 28 08:51:03 lola-5 kernel: BUG: soft lockup - CPU#1 stuck for 67s! [ll_ost_io00_006:28605]
        Jul 28 08:51:03 lola-5 kernel: BUG: soft lockup - CPU#2 stuck for 67s! [ll_ost_io00_048:28758]
        

        (see also attached file abrt-kernel-oops.tar.bz2; In total 1545 event records of this form
        had been written till the node crashed)

      • 2016-07-28 08:51:03 - First occurrenance of error below. These error flooded the console after some time. (see console log after entry 'Jul 28 08:45:01 lola-5 TIME: Time stamp for console')
        Jul 28 08:51:03 lola-5 kernel: Pid: 28758, comm: ll_ost_io00_048 Tainted: P           -- ------------    2.6.32-573.26.1.el6_lustre.x86_64 #1 Intel Corporation S2600GZ ........../S2600GZ
        Jul 28 08:51:03 lola-5 kernel: RIP: 0010:[<ffffffff8129e8af>]  [<ffffffff8129e8af>] __write_lock_failed+0xf/0x20
        Jul 28 08:51:03 lola-5 kernel: RSP: 0018:ffff8803c8e2b918  EFLAGS: 00000287
        Jul 28 08:51:03 lola-5 kernel: RAX: 0000000000000000 RBX: ffff8803c8e2b920 RCX: 0000000000000000
        Jul 28 08:51:03 lola-5 kernel: RDX: ffff88044e415a00 RSI: ffff880335d78400 RDI: ffff8803fc143dd8
        Jul 28 08:51:03 lola-5 kernel: RBP: ffffffff8100bc0e R08: 0000000000000000 R09: 0000000000000000
        Jul 28 08:51:03 lola-5 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 00000008fa9fcc28
        Jul 28 08:51:03 lola-5 kernel: R13: 0000000200000008 R14: ffff8803bac2b0b8 R15: ffffffff810674be
        Jul 28 08:51:03 lola-5 kernel: FS:  0000000000000000(0000) GS:ffff880038640000(0000) knlGS:0000000000000000
        Jul 28 08:51:03 lola-5 kernel: CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
        Jul 28 08:51:03 lola-5 kernel: CR2: 00007f88b9e46000 CR3: 0000000001a8d000 CR4: 00000000000407e0
        Jul 28 08:51:03 lola-5 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        Jul 28 08:51:03 lola-5 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
        Jul 28 08:51:03 lola-5 kernel: Process ll_ost_io00_048 (pid: 28758, threadinfo ffff8803c8e28000, task ffff8803cb648040)
        
      • 2016-07-28-16 16:02 - oom-killer started (see entry in console) and last mtime update
        for collect data file:
        -rw-r--r-- 1 root root  738427536 Jul 28 16:02 lola-5-20160728-021116.raw.gz
        
      • Node neither accessible via ssh nor console. Node rebooted. No crash dump file was written. (Parameter set_param panic_on_lbug=1 was set).

      Attached files:
      message, console, and debug message (written inbetween Jul 28, 08:48 - 16:02), abrt-kernel-oops.tar.bz2 (content for single event)
      collectl memory and slab counters.

      We'll try to trigger a crashdump on the node which will be affected next and increase debug mask. Current debug files don't contain slab information, as far as I could see.

      Attachments

        1. abrt-kernel-oops.tar.bz2
          3 kB
        2. all-lustre-log.tar.bz2
          759 kB
        3. allocation-per-slab.tar.bz2
          1.81 MB
        4. console-lola-5.log.bz2
          698 kB
        5. lola-2-leak_finder.output.bz2
          221 kB
        6. lola-2-lustre-log.1470213950.128013.bz2
          0.3 kB
        7. lola-6.timeouts.txt
          0.2 kB
        8. lola-7.errors.txt
          619 kB
        9. memory-counter-lola-5-20160728-021116.dat.bz2
          80 kB
        10. messages-lola-5.log.bz2
          582 kB
        11. slab-details-counter-lola-5-20160728-021116.dat.bz2
          2.75 MB
        12. slab-sorted-alloaction.dat.bz2
          5 kB

        Issue Links

          Activity

            [LU-8449] OSS crash with oom-killer started
            pjones Peter Jones added a comment -

            Fixed by reverting LU-7899

            pjones Peter Jones added a comment - Fixed by reverting LU-7899

            hmm... would you mind to try refreshed version, please? please, attach the logs.

            bzzz Alex Zhuravlev added a comment - hmm... would you mind to try refreshed version, please? please, attach the logs.

            Yes, with the patch above

            cliffw Cliff White (Inactive) added a comment - Yes, with the patch above

            Cliff, with the patch above?

            bzzz Alex Zhuravlev added a comment - Cliff, with the patch above?

            System did not oom-kill however all OSS are unusable due to massive timeous.

            cliffw Cliff White (Inactive) added a comment - System did not oom-kill however all OSS are unusable due to massive timeous.

            Will continue with run to see if it ooo-kills

            cliffw Cliff White (Inactive) added a comment - Will continue with run to see if it ooo-kills

            We have not seen an oom-killer, however the immediate soft lockups continue.
            Tends to freeze the node over time. Full console from timeouts attached.

            cliffw Cliff White (Inactive) added a comment - We have not seen an oom-killer, however the immediate soft lockups continue. Tends to freeze the node over time. Full console from timeouts attached.

            Alex Zhuravlev (alexey.zhuravlev@intel.com) uploaded a new patch: http://review.whamcloud.com/21827
            Subject: LU-8449 osd: retake locks as planned
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 7a6836c0e49a3065ef876b6fa41e233502adecd6

            gerrit Gerrit Updater added a comment - Alex Zhuravlev (alexey.zhuravlev@intel.com) uploaded a new patch: http://review.whamcloud.com/21827 Subject: LU-8449 osd: retake locks as planned Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 7a6836c0e49a3065ef876b6fa41e233502adecd6

            Set kernel.panic = 60 on all OSS nodes. Re-ran test. Errors are attached

            cliffw Cliff White (Inactive) added a comment - Set kernel.panic = 60 on all OSS nodes. Re-ran test. Errors are attached
            jhammond John Hammond added a comment -

            This is likely introduced by:

            commit 6cd79ab5860c59c2a640a9e8ca4ee86eec050b43
            Author: Alex Zhuravlev <alexey.zhuravlev@intel.com>
            Date:   Fri Mar 25 12:21:16 2016 +0300
            
                LU-7899 osd: batch EA updates
                
                during file creation we set number of EAs: LMA, VBR, LinkEA, LOVEA, ACLs.
                calling into SA to refill spill again and again is expensive. thus it
                makes sense to postpone this to osd_trans_stop() where all changed EAs
                has been already collected in a temporary buffer.
                
                Change-Id: I8f02a287b96615c3aa550d63ffd9dd3da51b39ee
                Signed-off-by: Alex Zhuravlev <alexey.zhuravlev@intel.com>
                Reviewed-on: http://review.whamcloud.com/19143
                Tested-by: Jenkins
                Tested-by: Maloo <hpdd-maloo@intel.com>
                Reviewed-by: Lai Siyao <lai.siyao@intel.com>
                Reviewed-by: Bobi Jam <bobijam@hotmail.com>
                Reviewed-by: Oleg Drokin <oleg.drokin@intel.com>
            

            Alex, can you comment here?

            jhammond John Hammond added a comment - This is likely introduced by: commit 6cd79ab5860c59c2a640a9e8ca4ee86eec050b43 Author: Alex Zhuravlev <alexey.zhuravlev@intel.com> Date: Fri Mar 25 12:21:16 2016 +0300 LU-7899 osd: batch EA updates during file creation we set number of EAs: LMA, VBR, LinkEA, LOVEA, ACLs. calling into SA to refill spill again and again is expensive. thus it makes sense to postpone this to osd_trans_stop() where all changed EAs has been already collected in a temporary buffer. Change-Id: I8f02a287b96615c3aa550d63ffd9dd3da51b39ee Signed-off-by: Alex Zhuravlev <alexey.zhuravlev@intel.com> Reviewed-on: http://review.whamcloud.com/19143 Tested-by: Jenkins Tested-by: Maloo <hpdd-maloo@intel.com> Reviewed-by: Lai Siyao <lai.siyao@intel.com> Reviewed-by: Bobi Jam <bobijam@hotmail.com> Reviewed-by: Oleg Drokin <oleg.drokin@intel.com> Alex, can you comment here?

            People

              bzzz Alex Zhuravlev
              heckes Frank Heckes (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: