[LU-7720] osd_object.c:925:osd_attr_set()) ASSERTION( dt_object_exists(dt) Created: 28/Jan/16  Updated: 28/Jan/16  Resolved: 28/Jan/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Blocker
Reporter: Frank Heckes (Inactive) Assignee: WC Triage
Resolution: Duplicate Votes: 0
Labels: soak
Environment:

lola
build: master branch, 2.7.65-38-g607f691 ; 607f6919ea67b101796630d4b55649a12ea0e859


Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

The error happened during soak testing of build '20160126' (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160126). DNE is enabled.
MDTs had been formated with ldiskfs, OSTs with zfs.
No faults were injected during soak test. Only application load and execution of lfsck were imposed on the test cluster.

Sequence of events:

  • Jan 27 05:44:56 - Started lfsck - command on primary MDS (lola-8):
    lctl lfsck_start -M soaked-MDT0000 -s 1000 -t all -A 
    
  • Jan 27 05:49 - OSS node lola-5 hit several LBUGs of the form:
    Jan 27 05:49:11 lola-5 kernel: LustreError: 17617:0:(osd_object.c:925:osd_attr_set()) LBUG
    Jan 27 05:49:11 lola-5 kernel: Pid: 17617, comm: ll_ost_out03_00
    Jan 27 05:49:11 lola-5 kernel: 
    Jan 27 05:49:11 lola-5 kernel: Call Trace:
    Jan 27 05:49:11 lola-5 kernel: [<ffffffffa05c7875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
    Jan 27 05:49:11 lola-5 kernel: [<ffffffffa05c7e77>] lbug_with_loc+0x47/0xb0 [libcfs]
    Jan 27 05:49:11 lola-5 kernel: [<ffffffffa0b27af5>] osd_attr_set+0xdd5/0xe40 [osd_zfs]
    Jan 27 05:49:11 lola-5 kernel: [<ffffffffa0710795>] ? keys_fill+0xd5/0x1b0 [obdclass]
    Jan 27 05:49:11 lola-5 kernel: [<ffffffffa02da916>] ? spl_kmem_alloc+0x96/0x1a0 [spl]
    Jan 27 05:49:11 lola-5 kernel: [<ffffffffa09b4033>] out_tx_attr_set_exec+0xa3/0x480 [ptlrpc]
    Jan 27 05:49:11 lola-5 kernel: [<ffffffffa09aa49a>] out_tx_end+0xda/0x5c0 [ptlrpc]
    Jan 27 05:49:11 lola-5 kernel: [<ffffffffa09b0364>] out_handle+0x11c4/0x19a0 [ptlrpc]
    Jan 27 05:49:11 lola-5 kernel: [<ffffffff8152b83e>] ? mutex_lock+0x1e/0x50
    Jan 27 05:49:12 lola-5 kernel: [<ffffffffa099f6fa>] ? req_can_reconstruct+0x6a/0x120 [ptlrpc]
    
  • Jan 27 08:30 - lola-5 crashed with oom-killer, most likely caused by LBUG in the end; over 600 blocked ost_* - threads.

Attached files:

  • messages, console logs of lola-5
  • debug log files: lustre-log.1453902551.22690 lustre-log.1453902552.17617


 Comments   
Comment by Frank Heckes (Inactive) [ 28/Jan/16 ]

collectl counters can be provided on demand for the oom-killer event. Anyway they show
no exhaustion by any slab. Also no process or thread can be identified consuming all memory from process counters.
Anyway all mem resources is consumed in the end.

Comment by Frank Heckes (Inactive) [ 28/Jan/16 ]

This might be related to LU-7662

Comment by nasf (Inactive) [ 28/Jan/16 ]

The LBUG() is another failure instance of LU-5565. In fact, related trouble has already been fixed by the patch http://review.whamcloud.com/#/c/12608/. But as mentioned by John in LU-5565, we only fixed ldiskfs case, we need to fix ZFS case also.

Comment by nasf (Inactive) [ 28/Jan/16 ]

Another failure instance of LU-5565. We need to enhance the patch http://review.whamcloud.com/#/c/12608/ for ZFS case.

Comment by Frank Heckes (Inactive) [ 28/Jan/16 ]

I stopped the upload of the log files as event is already identified as duplicate.

Generated at Sat Feb 10 02:11:20 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.