Details

    • Bug
    • Resolution: Duplicate
    • Blocker
    • None
    • Lustre 2.9.0
    • lola
      build: tip of master, commit 0f37c051158a399f7b00536eeec27f5dbdd54168
    • 3
    • 9223372036854775807

    Description

      error happened during soaktesting of build '20160727' (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160727)
      OSTs formatted with zfs, MDSs formatted with ldiskfs
      DNE is enabled, HSM/robinhood enable and integrated
      4 MDSs with 1 MDT / MDS
      6 OSSs with 4 OSTs / OSS
      Server nodes configured in active-active HA confguration

      Sequence of events:

      • 2016-07-28 08:48:37 - Soak session started
      • 2016-07-28 08:50:34 - First LNet time out:
         
        Jul 28 08:50:43 lola-5 kernel: LNetError: 9448:0:(o2iblnd_cb.c:3114:kiblnd_check_txs_locked()) Timed out tx: active_txs, 3 seconds
        Jul 28 08:50:43 lola-5 kernel: LNetError: 9448:0:(o2iblnd_cb.c:3177:kiblnd_check_conns()) Timed out RDMA with 192.168.1.108@o2ib10 (62): c: 0, oc: 0, rc: 8
        Jul 28 08:50:43 lola-5 kernel: Lustre: Skipped 4 previous similar messages
        Jul 28 08:51:03 lola-5 kernel: BUG: soft lockup - CPU#1 stuck for 67s! [ll_ost_io00_006:28605]
        Jul 28 08:51:03 lola-5 kernel: BUG: soft lockup - CPU#2 stuck for 67s! [ll_ost_io00_048:28758]
        

        (see also attached file abrt-kernel-oops.tar.bz2; In total 1545 event records of this form
        had been written till the node crashed)

      • 2016-07-28 08:51:03 - First occurrenance of error below. These error flooded the console after some time. (see console log after entry 'Jul 28 08:45:01 lola-5 TIME: Time stamp for console')
        Jul 28 08:51:03 lola-5 kernel: Pid: 28758, comm: ll_ost_io00_048 Tainted: P           -- ------------    2.6.32-573.26.1.el6_lustre.x86_64 #1 Intel Corporation S2600GZ ........../S2600GZ
        Jul 28 08:51:03 lola-5 kernel: RIP: 0010:[<ffffffff8129e8af>]  [<ffffffff8129e8af>] __write_lock_failed+0xf/0x20
        Jul 28 08:51:03 lola-5 kernel: RSP: 0018:ffff8803c8e2b918  EFLAGS: 00000287
        Jul 28 08:51:03 lola-5 kernel: RAX: 0000000000000000 RBX: ffff8803c8e2b920 RCX: 0000000000000000
        Jul 28 08:51:03 lola-5 kernel: RDX: ffff88044e415a00 RSI: ffff880335d78400 RDI: ffff8803fc143dd8
        Jul 28 08:51:03 lola-5 kernel: RBP: ffffffff8100bc0e R08: 0000000000000000 R09: 0000000000000000
        Jul 28 08:51:03 lola-5 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 00000008fa9fcc28
        Jul 28 08:51:03 lola-5 kernel: R13: 0000000200000008 R14: ffff8803bac2b0b8 R15: ffffffff810674be
        Jul 28 08:51:03 lola-5 kernel: FS:  0000000000000000(0000) GS:ffff880038640000(0000) knlGS:0000000000000000
        Jul 28 08:51:03 lola-5 kernel: CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
        Jul 28 08:51:03 lola-5 kernel: CR2: 00007f88b9e46000 CR3: 0000000001a8d000 CR4: 00000000000407e0
        Jul 28 08:51:03 lola-5 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        Jul 28 08:51:03 lola-5 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
        Jul 28 08:51:03 lola-5 kernel: Process ll_ost_io00_048 (pid: 28758, threadinfo ffff8803c8e28000, task ffff8803cb648040)
        
      • 2016-07-28-16 16:02 - oom-killer started (see entry in console) and last mtime update
        for collect data file:
        -rw-r--r-- 1 root root  738427536 Jul 28 16:02 lola-5-20160728-021116.raw.gz
        
      • Node neither accessible via ssh nor console. Node rebooted. No crash dump file was written. (Parameter set_param panic_on_lbug=1 was set).

      Attached files:
      message, console, and debug message (written inbetween Jul 28, 08:48 - 16:02), abrt-kernel-oops.tar.bz2 (content for single event)
      collectl memory and slab counters.

      We'll try to trigger a crashdump on the node which will be affected next and increase debug mask. Current debug files don't contain slab information, as far as I could see.

      Attachments

        1. abrt-kernel-oops.tar.bz2
          3 kB
        2. all-lustre-log.tar.bz2
          759 kB
        3. allocation-per-slab.tar.bz2
          1.81 MB
        4. console-lola-5.log.bz2
          698 kB
        5. lola-2-leak_finder.output.bz2
          221 kB
        6. lola-2-lustre-log.1470213950.128013.bz2
          0.3 kB
        7. lola-6.timeouts.txt
          0.2 kB
        8. lola-7.errors.txt
          619 kB
        9. memory-counter-lola-5-20160728-021116.dat.bz2
          80 kB
        10. messages-lola-5.log.bz2
          582 kB
        11. slab-details-counter-lola-5-20160728-021116.dat.bz2
          2.75 MB
        12. slab-sorted-alloaction.dat.bz2
          5 kB

        Issue Links

          Activity

            [LU-8449] OSS crash with oom-killer started
            jhammond John Hammond added a comment -

            What is kernel.panic set to here? If it's set to 0 then please try sysctl -w kernel.panic=60 to see if we can get the full crash dump message on the console.

            jhammond John Hammond added a comment - What is kernel.panic set to here? If it's set to 0 then please try sysctl -w kernel.panic=60 to see if we can get the full crash dump message on the console.
            jhammond John Hammond added a comment -

            Do we have stack traces for the soft lockups?

            Build 3046 is 4af430de697249d07bcb7fe3401a6caa4415fe31. There have been 56 commits on master from 4af430de697249d07bcb7fe3401a6caa4415fe31 to 0f37c051158a399f7b00536eeec27f5dbdd54168. How long does this take to reproduce?

            jhammond John Hammond added a comment - Do we have stack traces for the soft lockups? Build 3046 is 4af430de697249d07bcb7fe3401a6caa4415fe31. There have been 56 commits on master from 4af430de697249d07bcb7fe3401a6caa4415fe31 to 0f37c051158a399f7b00536eeec27f5dbdd54168. How long does this take to reproduce?
            cliffw Cliff White (Inactive) added a comment - - edited

            Downgraded OSS nodes to build lustre-master 3406 - no oom-killer. Run stopped when MDS failed to complete recovery (existing bug)
            Upgrading OSS nodes to build 3409 Revision: 9d8469694738288fcd246a1e9f22c47918a32c58
            Soak restart Thu Aug 4 19:38:27 PDT 2016
            First soft lockup: Aug 4 19:58:26 lola-7 kernel: BUG: soft lockup - CPU#3 stuck for 67s! [ll_ost_io00_031:20580]

            cliffw Cliff White (Inactive) added a comment - - edited Downgraded OSS nodes to build lustre-master 3406 - no oom-killer. Run stopped when MDS failed to complete recovery (existing bug) Upgrading OSS nodes to build 3409 Revision: 9d8469694738288fcd246a1e9f22c47918a32c58 Soak restart Thu Aug 4 19:38:27 PDT 2016 First soft lockup: Aug 4 19:58:26 lola-7 kernel: BUG: soft lockup - CPU#3 stuck for 67s! [ll_ost_io00_031:20580]

            I attached debug logs (and output of leak_finder.pl, most likely not so important) of a debug log (512M, -1) of an OSS node (lola-2) that crashed with the same error. Test session was started at 2016-08-03 01:39 and node crashed very soon with:

            Aug  3 01:43:22 lola-2 kernel: LNetError: 9516:0:(o2iblnd_cb.c:3114:kiblnd_check_txs_locked()) Timed out tx: tx_queu
            e, 0 seconds
            Aug  3 01:43:22 lola-2 kernel: LNetError: 9516:0:(o2iblnd_cb.c:3177:kiblnd_check_conns()) Timed out RDMA with 192.16
            8.1.114@o2ib10 (51): c: 0, oc: 0, rc: 7
            Aug  3 01:43:22 lola-2 kernel: LustreError: 9516:0:(events.c:449:server_bulk_callback()) event type 3, status -103, desc ffff8803fd339400
            Aug  3 01:43:22 lola-2 kernel: LustreError: 9516:0:(events.c:449:server_bulk_callback()) event type 5, status -103, desc ffff88032a550e00
            Aug  3 01:43:59 lola-2 kernel: BUG: soft lockup - CPU#0 stuck for 67s! [ll_ost_io00_028:22505]
            Aug  3 01:43:59 lola-2 kernel: Modules linked in:
            

            (Btw. 192.168.1.114 is the LNet router. There's no message that can be correlated in the router's syslog nor console log.

            heckes Frank Heckes (Inactive) added a comment - I attached debug logs (and output of leak_finder.pl, most likely not so important) of a debug log (512M, -1) of an OSS node ( lola-2 ) that crashed with the same error. Test session was started at 2016-08-03 01:39 and node crashed very soon with: Aug 3 01:43:22 lola-2 kernel: LNetError: 9516:0:(o2iblnd_cb.c:3114:kiblnd_check_txs_locked()) Timed out tx: tx_queu e, 0 seconds Aug 3 01:43:22 lola-2 kernel: LNetError: 9516:0:(o2iblnd_cb.c:3177:kiblnd_check_conns()) Timed out RDMA with 192.16 8.1.114@o2ib10 (51): c: 0, oc: 0, rc: 7 Aug 3 01:43:22 lola-2 kernel: LustreError: 9516:0:(events.c:449:server_bulk_callback()) event type 3, status -103, desc ffff8803fd339400 Aug 3 01:43:22 lola-2 kernel: LustreError: 9516:0:(events.c:449:server_bulk_callback()) event type 5, status -103, desc ffff88032a550e00 Aug 3 01:43:59 lola-2 kernel: BUG: soft lockup - CPU#0 stuck for 67s! [ll_ost_io00_028:22505] Aug 3 01:43:59 lola-2 kernel: Modules linked in: (Btw. 192.168.1.114 is the LNet router. There's no message that can be correlated in the router's syslog nor console log.

            If CON RACEs were involved, we should be seeing some rejection logs from o2iblnd with the reason CON RACE.

            doug Doug Oucharek (Inactive) added a comment - If CON RACEs were involved, we should be seeing some rejection logs from o2iblnd with the reason CON RACE.

            Could this be related to "LU-7646 lnet: Stop Infinite CON RACE Condition" where the connection fails and reconnecting consumes all the memory?

            adilger Andreas Dilger added a comment - Could this be related to " LU-7646 lnet: Stop Infinite CON RACE Condition" where the connection fails and reconnecting consumes all the memory?

            The top-10-consumers read as:

            #Date Time SlabName ObjInUse ObjInUseB ObjAll ObjAllB SlabInUse SlabInUseB SlabAll SlabAllB SlabChg SlabPct
            slab-details/filp.dat:20160728 16:01:40 filp 86528715 22151351040 86528715 22151351040 5768581 23628107776 5768581 23628107776 19533824 0
            slab-details/size-131072.dat:20160728 16:01:40 size-131072 8696 1139802112 8696 1139802112 8696 1139802112 8696 1139802112 0 0
            slab-details/dentry.dat:20160728 16:01:40 dentry 2671440 512916480 2671440 512916480 133572 547110912 133572 547110912 26107904 5
            slab-details/zio_buf_16384.dat:20160728 16:01:40 zio_buf_16384 18261 299188224 18261 299188224 18261 299188224 18261 299188224 0 0
            slab-details/size-1048576.dat:20160728 16:01:40 size-1048576 256 268435456 256 268435456 256 268435456 256 268435456 0 0
            slab-details/task_struct.dat:20160728 16:01:40 task_struct 43410 115991520 43410 115991520 14470 118538240 14470 118538240 57344 0
            slab-details/zio_cache.dat:20160728 16:01:40 zio_cache 31401 34917912 94507 105091784 13501 110600192 13501 110600192 0 0
            slab-details/size-262144.dat:20160728 16:01:40 size-262144 395 103546880 395 103546880 395 103546880 395 103546880 0 0
            slab-details/biovec-256.dat:20160728 16:01:40 biovec-256 23042 94380032 23042 94380032 23042 94380032 23042 94380032 0 0
            slab-details/size-1024.dat:20160728 16:01:40 size-1024 61913 63398912 61956 63442944 15489 63442944 15489 63442944 -4096 0
            

            see also attached files slab-sorted-alloaction.dat, allocation-per-slab.tar.bz2

            heckes Frank Heckes (Inactive) added a comment - The top-10-consumers read as: #Date Time SlabName ObjInUse ObjInUseB ObjAll ObjAllB SlabInUse SlabInUseB SlabAll SlabAllB SlabChg SlabPct slab-details/filp.dat:20160728 16:01:40 filp 86528715 22151351040 86528715 22151351040 5768581 23628107776 5768581 23628107776 19533824 0 slab-details/size-131072.dat:20160728 16:01:40 size-131072 8696 1139802112 8696 1139802112 8696 1139802112 8696 1139802112 0 0 slab-details/dentry.dat:20160728 16:01:40 dentry 2671440 512916480 2671440 512916480 133572 547110912 133572 547110912 26107904 5 slab-details/zio_buf_16384.dat:20160728 16:01:40 zio_buf_16384 18261 299188224 18261 299188224 18261 299188224 18261 299188224 0 0 slab-details/size-1048576.dat:20160728 16:01:40 size-1048576 256 268435456 256 268435456 256 268435456 256 268435456 0 0 slab-details/task_struct.dat:20160728 16:01:40 task_struct 43410 115991520 43410 115991520 14470 118538240 14470 118538240 57344 0 slab-details/zio_cache.dat:20160728 16:01:40 zio_cache 31401 34917912 94507 105091784 13501 110600192 13501 110600192 0 0 slab-details/size-262144.dat:20160728 16:01:40 size-262144 395 103546880 395 103546880 395 103546880 395 103546880 0 0 slab-details/biovec-256.dat:20160728 16:01:40 biovec-256 23042 94380032 23042 94380032 23042 94380032 23042 94380032 0 0 slab-details/size-1024.dat:20160728 16:01:40 size-1024 61913 63398912 61956 63442944 15489 63442944 15489 63442944 -4096 0 see also attached files slab-sorted-alloaction.dat , allocation-per-slab.tar.bz2

            People

              bzzz Alex Zhuravlev
              heckes Frank Heckes (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: