[LU-10105]  kernel:Kernel panic - not syncing: LBUG Created: 09/Oct/17  Updated: 25/May/20  Resolved: 19/Apr/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.1
Fix Version/s: None

Type: Bug Priority: Blocker
Reporter: Shankar Vellaichamy Assignee: Brad Hoagland (Inactive)
Resolution: Won't Do Votes: 0
Labels: None
Environment:

RHEL 7.4 plus 3rd party kernel modules


Attachments: Zip Archive ds-agent-diagnostic.zip    
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

When attempt to create a *new *file from lustre client node, it's throwing the following panic error and rebooting the lustre client node. After the reboot, the file that was tried to create is present with 0 bytes and can be used to write data into it. Creation of new directories are fine. The problem is with creation of new files. Existing files can be updated but creation of new files from lustre client node is throwing the following panic error and rebooting the lustre client node.

Message from syslogd@10-64-7-142 at Oct 9 16:19:18 ...
kernel:LustreError: 4337:0:(dcache.c:188:ll_d_init()) ASSERTION( de->d_op == &ll_d_ops ) failed:

Message from syslogd@10-64-7-142 at Oct 9 16:19:18 ...
kernel:LustreError: 4337:0:(dcache.c:188:ll_d_init()) LBUG

Message from syslogd@10-64-7-142 at Oct 9 16:19:19 ...
kernel:Kernel panic - not syncing: LBUG



 Comments   
Comment by Shankar Vellaichamy [ 09/Oct/17 ]

Message from syslogd@10-64-7-142 at Oct 9 16:51:16 ...
kernel:LustreError: 7760:0:(dcache.c:188:ll_d_init()) ASSERTION( de->d_op == &ll_d_ops ) failed:

Message from syslogd@10-64-7-142 at Oct 9 16:51:16 ...
kernel:LustreError: 7760:0:(dcache.c:188:ll_d_init()) LBUG
Oct 9 16:51:16 10-64-7-142 kernel: LustreError: 7760:0:(dcache.c:188:ll_d_init()) ASSERTION( de->d_op == &ll_d_ops ) failed:
Oct 9 16:51:16 10-64-7-142 kernel: LustreError: 7760:0:(dcache.c:188:ll_d_init()) LBUG
Oct 9 16:51:16 10-64-7-142 kernel: Pid: 7760, comm: touch
Oct 9 16:51:16 10-64-7-142 kernel: #012Call Trace:
Oct 9 16:51:16 10-64-7-142 kernel: [<ffffffffc06957ae>] libcfs_call_trace+0x4e/0x60 [libcfs]
Oct 9 16:51:16 10-64-7-142 kernel: [<ffffffffc069583c>] lbug_with_loc+0x4c/0xb0 [libcfs]
Oct 9 16:51:16 10-64-7-142 kernel: [<ffffffffc0ba0d2e>] ll_d_init+0x2de/0x420 [lustre]
Oct 9 16:51:16 10-64-7-142 kernel: [<ffffffffc09df53d>] ? __req_capsule_get+0x15d/0x700 [ptlrpc]
Oct 9 16:51:16 10-64-7-142 kernel: [<ffffffffc0be3ba8>] ll_splice_alias+0x1b8/0x320 [lustre]
Oct 9 16:51:16 10-64-7-142 kernel: [<ffffffffc0be3d93>] ll_lookup_it_finish+0x83/0x1090 [lustre]
Oct 9 16:51:16 10-64-7-142 kernel: [<ffffffffc0b81787>] ? lmv_intent_lock+0xe37/0x1b50 [lmv]
Oct 9 16:51:16 10-64-7-142 kernel: [<ffffffff812cb380>] ? security_sid_to_context+0x10/0x20
Oct 9 16:51:16 10-64-7-142 kernel: [<ffffffff812bb875>] ? selinux_dentry_init_security+0xa5/0x110
Oct 9 16:51:16 10-64-7-142 kernel: [<ffffffffc0be2e80>] ? ll_md_blocking_ast+0x0/0x730 [lustre]
Oct 9 16:51:16 10-64-7-142 kernel: [<ffffffffc0be563e>] ll_lookup_it+0x89e/0xee0 [lustre]
Oct 9 16:51:16 10-64-7-142 kernel: [<ffffffff812ca70d>] ? context_struct_compute_av+0x34d/0x470
Oct 9 16:51:16 10-64-7-142 kernel: [<ffffffffc0be5db7>] ll_atomic_open+0x137/0x12d0 [lustre]
Oct 9 16:51:16 10-64-7-142 kernel: [<ffffffff816a318f>] ? avc_compute_av+0x1a3/0x1b5
Oct 9 16:51:16 10-64-7-142 kernel: [<ffffffffc0407138>] rfs_atomic_open+0x148/0x380 [redirfs]
Oct 9 16:51:16 10-64-7-142 kernel: [<ffffffff812100fd>] do_last+0xa4d/0x12c0
Oct 9 16:51:16 10-64-7-142 kernel: [<ffffffff81210a32>] path_openat+0xc2/0x490
Oct 9 16:51:16 10-64-7-142 kernel: [<ffffffff8118295b>] ? unlock_page+0x2b/0x30
Oct 9 16:51:16 10-64-7-142 kernel: [<ffffffff81212fcb>] do_filp_open+0x4b/0xb0
Oct 9 16:51:16 10-64-7-142 kernel: [<ffffffff8111ea0c>] ? audit_alloc_name+0x9c/0x160
Oct 9 16:51:16 10-64-7-142 kernel: [<ffffffff8111f6fd>] ? __audit_getname+0x3d/0xb0
Oct 9 16:51:16 10-64-7-142 kernel: [<ffffffff81220249>] ? __alloc_fd+0xa9/0x130
Oct 9 16:51:16 10-64-7-142 kernel: [<ffffffff811ffc13>] do_sys_open+0xf3/0x1f0
Oct 9 16:51:16 10-64-7-142 kernel: [<ffffffff811ffd2e>] SyS_open+0x1e/0x20
Oct 9 16:51:16 10-64-7-142 kernel: [<ffffffffc0421c34>] gsch_open_hook_fn+0x124/0x140 [gsch]
Oct 9 16:51:16 10-64-7-142 kernel: [<ffffffff816b5009>] system_call_fastpath+0x16/0x1b

Message from syslogd@10-64-7-142 at Oct 9 16:51:16 ...
kernel:Kernel panic - not syncing: LBUG
Oct 9 16:51:16 10-64-7-142 kernel:
Oct 9 16:51:16 10-64-7-142 kernel: Kernel panic - not syncing: LBUG

Comment by John Hammond [ 09/Oct/17 ]

Could you describe the client configuration used here? Especially the how gsch and redirfs are used?

Comment by Shankar Vellaichamy [ 10/Oct/17 ]

Hi John, After stopping ds_agent using /etc/init.d/ds_agent stop, new file creations are successful. Please refer attached ds_agent diagnostic package from output of /opt/ds_agent/dsa_control -d. Could you please support to make the creation of new files work from lustre client with ds-agent active and running.

ds-agent-diagnostic.zip

Comment by Brad Hoagland (Inactive) [ 10/Oct/17 ]

Hi Shankar,

Do you have a support agreement for this system with Intel?

Regards,

Brad

Comment by Shankar Vellaichamy [ 10/Oct/17 ]

Hi Brad,  We have just started with Lustre and don't have support agreement for this system with Intel.

Comment by Aurelien Degremont (Inactive) [ 25/May/20 ]

For the record, I've faced the same crashes which ended up being cause by the same software. ds_agent is Trend Micro Deep Security Agent. This is part of Trend Deep Security composed of a file system wrapping real filesystem accesses: redirfs and a hooking modules gsch . Here is the LBUG stack trace:

 

 

crash> bt
PID: 19992  TASK: ffff9df833f73150  CPU: 1   COMMAND: "touch"
 #0 [ffff9df7395b3728] machine_kexec at ffffffffa6466044
 #1 [ffff9df7395b3788] __crash_kexec at ffffffffa6522ee2
 #2 [ffff9df7395b3858] panic at ffffffffa6b7952c
 #3 [ffff9df7395b38d8] lbug_with_loc at ffffffffc03028cb [libcfs]
 #4 [ffff9df7395b38f8] ll_d_init at ffffffffc07b9d2e [lustre]
 #5 [ffff9df7395b3938] ll_splice_alias at ffffffffc07fdab8 [lustre]
 #6 [ffff9df7395b3980] ll_lookup_it_finish at ffffffffc07fdca3 [lustre]
 #7 [ffff9df7395b3a48] ll_lookup_it at ffffffffc07ff576 [lustre]
 #8 [ffff9df7395b3b08] ll_atomic_open at ffffffffc07ffcf7 [lustre]
 #9 [ffff9df7395b3bc8] rfs_atomic_open at ffffffffc075622b [redirfs]
#10 [ffff9df7395b3c70] do_last at ffffffffa665d803
#11 [ffff9df7395b3d20] path_openat at ffffffffa665e1bd
#12 [ffff9df7395b3db8] do_filp_open at ffffffffa666040d
#13 [ffff9df7395b3e90] do_sys_open at ffffffffa664bfe4
#14 [ffff9df7395b3ef0] sys_open at ffffffffa664c0fe
#15 [ffff9df7395b3f00] gsch_open_hook_fn at ffffffffc088a4fa [gsch]
#16 [ffff9df7395b3f50] system_call_fastpath at ffffffffa6b92ed2
    RIP: 00007fb6a78d3760  RSP: 00007ffee9c8e958  RFLAGS: 00010202
    RAX: 0000000000000002  RBX: 00007ffee9c8ecf8  RCX: 0000000000000037
    RDX: 00000000000001b6  RSI: 0000000000000941  RDI: 00007ffee9c9056f
    RBP: 0000000000000000   R8: 0000000000000000   R9: 0000000000000000
    R10: 00007ffee9c8dea0  R11: 0000000000000246  R12: 00007ffee9c9056f
    R13: 00007fb6a7bab2a0  R14: 0000000000000001  R15: 0000000000000000
    ORIG_RAX: 0000000000000002  CS: 0033  SS: 002b

(See 'gsch' at the beginning of the call stack)

This hook or redirfs is replacing the struct dentry_operations by its own, where some callbacks are replaced by its own callbacks, making the LASSERT fail.

crash> p &ll_d_ops
$7 = (const struct dentry_operations *) 0xffffffffc0824200 <ll_d_ops>

...

crash> struct dentry_operations 0xffff9df83256e040
struct dentry_operations {
  d_revalidate = 0xffffffffc07538e0,
  d_weak_revalidate = 0x0,
  d_hash = 0x0,
  d_compare = 0xffffffffc07b9770 <ll_dcompare>,
  d_delete = 0xffffffffc07b94e0 <ll_ddelete>,
  d_release = 0xffffffffc0753b30,
  d_prune = 0x0,
  d_iput = 0xffffffffc07536a0,
  d_dname = 0x0,
  d_automount = 0x0,
  {
    d_manage = 0x0,
    __UNIQUE_ID_rh_kabi_hide17 = {
      d_manage = 0x0
    },
    {<No data fields>}
  }
}

ffffffffc07538e0 (t) rfs_d_revalidate [redirfs]
ffffffffc0753b30 (t) rfs_d_release [redirfs]
ffffffffc07536a0 (t) rfs_d_iput [redirfs]

 

This kind of low level hooking is definitely a bad design in my opinion, but I was wondering if this LASSERT is really useful and if removing it could be enough to make this software works.

 

Generated at Sat Feb 10 02:32:06 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.