Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7538

file.c:3891:ll_layout_lock_set()) LBUG

    XMLWordPrintable

Details

    • 3
    • 9223372036854775807

    Description

      The error occurred during soak testing of master via build '20151209' (see https://wiki.hpdd.intel.com/pages/viewpage.action?title=Soak+Testing+on+Lola&spaceKey=Releases#SoakTestingonLola-20151209). DNE is enabled. MDTs had been formatted using ldiskfs, OSTs using zfs. MDSes are configured in active-active HA - configuration.

      During normal operations (no fault injected) two Lustre client nodes hit the LBUG listed below:

      • lola-26192.168.1.126 – Dec 9 21:41:59
      • lola-27192.168.1.127 – Dec 9 21:41:40
        Dec  9 21:41:40 lola-27 kernel: LustreError: 3786:0:(file.c:3891:ll_layout_lock_set()) ASSERTION( ldlm_has_layout(lock) ) f
        ailed: 
        Dec  9 21:41:40 lola-27 kernel: LustreError: 3786:0:(file.c:3891:ll_layout_lock_set()) LBUG
        Dec  9 21:41:40 lola-27 kernel: Pid: 3786, comm: flush-lustre-1
        Dec  9 21:41:40 lola-27 kernel: 
        Dec  9 21:41:40 lola-27 kernel: Call Trace:
        Dec  9 21:41:40 lola-27 kernel: [<ffffffffa045f875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
        Dec  9 21:41:40 lola-27 kernel: [<ffffffffa045fe77>] lbug_with_loc+0x47/0xb0 [libcfs]
        Dec  9 21:41:40 lola-27 kernel: [<ffffffffa0a04b89>] ll_layout_lock_set+0xa9/0x1360 [lustre]
        Dec  9 21:41:40 lola-27 kernel: [<ffffffffa0a03b5a>] ? ll_take_md_lock+0xfa/0x4b0 [lustre]
        Dec  9 21:41:40 lola-27 kernel: [<ffffffffa0a08fc1>] ll_layout_refresh_locked+0xe1/0xe00 [lustre]
        Dec  9 21:41:40 lola-27 kernel: [<ffffffffa058b7f1>] ? cl_io_slice_add+0xc1/0x190 [obdclass]
        Dec  9 21:41:40 lola-27 kernel: [<ffffffffa0a37c20>] ? ll_md_blocking_ast+0x0/0x7d0 [lustre]
        Dec  9 21:41:40 lola-27 kernel: [<ffffffffa072f350>] ? ldlm_completion_ast+0x0/0x9b0 [ptlrpc]
        Dec  9 21:41:40 lola-27 kernel: [<ffffffffa0470aa7>] ? cfs_hash_bd_lookup_intent+0x37/0x130 [libcfs]
        Dec  9 21:41:40 lola-27 kernel: [<ffffffffa0a09e79>] ll_layout_refresh+0x199/0x300 [lustre]
        Dec  9 21:41:40 lola-27 kernel: [<ffffffffa058b7f1>] ? cl_io_slice_add+0xc1/0x190 [obdclass]
        Dec  9 21:41:40 lola-27 kernel: [<ffffffffa0a56c8f>] vvp_io_init+0x39f/0x480 [lustre]
        Dec  9 21:41:40 lola-27 kernel: [<ffffffffa047377a>] ? cfs_hash_find_or_add+0x9a/0x190 [libcfs]
        Dec  9 21:41:40 lola-27 kernel: [<ffffffffa058a3a8>] cl_io_init0+0x88/0x150 [obdclass]
        Dec  9 21:41:40 lola-27 kernel: [<ffffffffa058d4a4>] cl_io_init+0x64/0xe0 [obdclass]
        Dec  9 21:41:40 lola-27 kernel: [<ffffffffa0a04022>] cl_sync_file_range+0x112/0x2f0 [lustre]
        Dec  9 21:41:40 lola-27 kernel: [<ffffffffa0a2cd7c>] ll_writepages+0x9c/0x220 [lustre]
        Dec  9 21:41:40 lola-27 kernel: [<ffffffff81139871>] do_writepages+0x21/0x40
        Dec  9 21:41:40 lola-27 kernel: [<ffffffff811bb19d>] writeback_single_inode+0xdd/0x290
        Dec  9 21:41:40 lola-27 kernel: [<ffffffff811bb59d>] writeback_sb_inodes+0xbd/0x170
        Dec  9 21:41:40 lola-27 kernel: [<ffffffff811bb6fb>] writeback_inodes_wb+0xab/0x1b0
        Dec  9 21:41:40 lola-27 kernel: [<ffffffff811bbaf3>] wb_writeback+0x2f3/0x410
        Dec  9 21:41:40 lola-27 kernel: [<ffffffff810880b2>] ? del_timer_sync+0x22/0x30
        Dec  9 21:41:40 lola-27 kernel: [<ffffffff811bbdb5>] wb_do_writeback+0x1a5/0x240
        Dec  9 21:41:40 lola-27 kernel: [<ffffffff811bbeb3>] bdi_writeback_task+0x63/0x1b0
        Dec  9 21:41:40 lola-27 kernel: [<ffffffff8109eaa7>] ? bit_waitqueue+0x17/0xd0
        Dec  9 21:41:40 lola-27 kernel: [<ffffffff81148620>] ? bdi_start_fn+0x0/0x100
        Dec  9 21:41:40 lola-27 kernel: [<ffffffff811486a6>] bdi_start_fn+0x86/0x100
        Dec  9 21:41:40 lola-27 kernel: [<ffffffff81148620>] ? bdi_start_fn+0x0/0x100
        Dec  9 21:41:40 lola-27 kernel: [<ffffffff8109e78e>] kthread+0x9e/0xc0
        Dec  9 21:41:40 lola-27 kernel: [<ffffffff8100c28a>] child_rip+0xa/0x20
        Dec  9 21:41:40 lola-27 kernel: [<ffffffff8109e6f0>] ? kthread+0x0/0xc0
        Dec  9 21:41:40 lola-27 kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20
        Dec  9 21:41:40 lola-27 kernel: 
        Dec  9 21:41:40 lola-27 kernel: Kernel panic - not syncing: LBUG
        

        The errors temporal correlate the errors OSS nodes (lola-2,3 of the form:

        lola-2.log:Dec  9 21:41:48 lola-2 kernel: LustreError: 28806:0:(ldlm_lockd.c:689:ldlm_handle_ast_error()) ### client (nid 1
        92.168.1.126@o2ib100) failed to reply to blocking AST (req status 0 rc -11), evict it ns: filter-soaked-OST0004_UUID lock: 
        ffff880377d872c0/0xef6ba6a3129d2917 lrc: 4/0,0 mode: PR/PR res: [0x500000406:0xfa062d:0x0].0x0 rrc: 2 type: EXT [0->1844674
        4073709551615] (req 0->18446744073709551615) flags: 0x60000000010020 nid: 192.168.1.126@o2ib100 remote: 0x6044879e61dc398 e
        xpref: 33311 pid: 27253 timeout: 4297781214 lvb_type: 1
        

        Several messages on the OSS nodes can be found in (attached) messages files for both OSSes.

      Attached files:

      • lola-26,27 - messages, console, vmcore-dmesg.txt files
      • lola-2,3 - messages, console files

      Attachments

        1. messages-lola-3.log.bz2
          196 kB
        2. messages-lola-27.log.bz2
          206 kB
        3. messages-lola-26.log.bz2
          207 kB
        4. messages-lola-2.log.bz2
          184 kB
        5. lola-27-vmcore-dmesg.txt.bz2
          18 kB
        6. lola-26-vmcore-dmesg.txt.bz2
          18 kB
        7. console-lola-3.log.bz2
          39 kB
        8. console-lola-27.log.bz2
          33 kB
        9. console-lola-26.log.bz2
          33 kB
        10. console-lola-2.log.bz2
          41 kB

        Issue Links

          Activity

            People

              bobijam Zhenyu Xu
              heckes Frank Heckes (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: