Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7720

osd_object.c:925:osd_attr_set()) ASSERTION( dt_object_exists(dt)

Details

    • Bug
    • Resolution: Duplicate
    • Blocker
    • None
    • None
    • lola
      build: master branch, 2.7.65-38-g607f691 ; 607f6919ea67b101796630d4b55649a12ea0e859
    • 3
    • 9223372036854775807

    Description

      The error happened during soak testing of build '20160126' (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160126). DNE is enabled.
      MDTs had been formated with ldiskfs, OSTs with zfs.
      No faults were injected during soak test. Only application load and execution of lfsck were imposed on the test cluster.

      Sequence of events:

      • Jan 27 05:44:56 - Started lfsck - command on primary MDS (lola-8):
        lctl lfsck_start -M soaked-MDT0000 -s 1000 -t all -A 
        
      • Jan 27 05:49 - OSS node lola-5 hit several LBUGs of the form:
        Jan 27 05:49:11 lola-5 kernel: LustreError: 17617:0:(osd_object.c:925:osd_attr_set()) LBUG
        Jan 27 05:49:11 lola-5 kernel: Pid: 17617, comm: ll_ost_out03_00
        Jan 27 05:49:11 lola-5 kernel: 
        Jan 27 05:49:11 lola-5 kernel: Call Trace:
        Jan 27 05:49:11 lola-5 kernel: [<ffffffffa05c7875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
        Jan 27 05:49:11 lola-5 kernel: [<ffffffffa05c7e77>] lbug_with_loc+0x47/0xb0 [libcfs]
        Jan 27 05:49:11 lola-5 kernel: [<ffffffffa0b27af5>] osd_attr_set+0xdd5/0xe40 [osd_zfs]
        Jan 27 05:49:11 lola-5 kernel: [<ffffffffa0710795>] ? keys_fill+0xd5/0x1b0 [obdclass]
        Jan 27 05:49:11 lola-5 kernel: [<ffffffffa02da916>] ? spl_kmem_alloc+0x96/0x1a0 [spl]
        Jan 27 05:49:11 lola-5 kernel: [<ffffffffa09b4033>] out_tx_attr_set_exec+0xa3/0x480 [ptlrpc]
        Jan 27 05:49:11 lola-5 kernel: [<ffffffffa09aa49a>] out_tx_end+0xda/0x5c0 [ptlrpc]
        Jan 27 05:49:11 lola-5 kernel: [<ffffffffa09b0364>] out_handle+0x11c4/0x19a0 [ptlrpc]
        Jan 27 05:49:11 lola-5 kernel: [<ffffffff8152b83e>] ? mutex_lock+0x1e/0x50
        Jan 27 05:49:12 lola-5 kernel: [<ffffffffa099f6fa>] ? req_can_reconstruct+0x6a/0x120 [ptlrpc]
        
      • Jan 27 08:30 - lola-5 crashed with oom-killer, most likely caused by LBUG in the end; over 600 blocked ost_* - threads.

      Attached files:

      • messages, console logs of lola-5
      • debug log files: lustre-log.1453902551.22690 lustre-log.1453902552.17617

      Attachments

        Activity

          [LU-7720] osd_object.c:925:osd_attr_set()) ASSERTION( dt_object_exists(dt)

          I stopped the upload of the log files as event is already identified as duplicate.

          heckes Frank Heckes (Inactive) added a comment - I stopped the upload of the log files as event is already identified as duplicate.
          yong.fan nasf (Inactive) added a comment - - edited

          Another failure instance of LU-5565. We need to enhance the patch http://review.whamcloud.com/#/c/12608/ for ZFS case.

          yong.fan nasf (Inactive) added a comment - - edited Another failure instance of LU-5565 . We need to enhance the patch http://review.whamcloud.com/#/c/12608/ for ZFS case.

          The LBUG() is another failure instance of LU-5565. In fact, related trouble has already been fixed by the patch http://review.whamcloud.com/#/c/12608/. But as mentioned by John in LU-5565, we only fixed ldiskfs case, we need to fix ZFS case also.

          yong.fan nasf (Inactive) added a comment - The LBUG() is another failure instance of LU-5565 . In fact, related trouble has already been fixed by the patch http://review.whamcloud.com/#/c/12608/ . But as mentioned by John in LU-5565 , we only fixed ldiskfs case, we need to fix ZFS case also.

          This might be related to LU-7662

          heckes Frank Heckes (Inactive) added a comment - This might be related to LU-7662

          collectl counters can be provided on demand for the oom-killer event. Anyway they show
          no exhaustion by any slab. Also no process or thread can be identified consuming all memory from process counters.
          Anyway all mem resources is consumed in the end.

          heckes Frank Heckes (Inactive) added a comment - collectl counters can be provided on demand for the oom-killer event. Anyway they show no exhaustion by any slab. Also no process or thread can be identified consuming all memory from process counters. Anyway all mem resources is consumed in the end.

          People

            wc-triage WC Triage
            heckes Frank Heckes (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: