[LU-10105] kernel:Kernel panic - not syncing: LBUG Created: 09/Oct/17 Updated: 25/May/20 Resolved: 19/Apr/18 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.10.1 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Shankar Vellaichamy | Assignee: | Brad Hoagland (Inactive) |
| Resolution: | Won't Do | Votes: | 0 |
| Labels: | None | ||
| Environment: |
RHEL 7.4 plus 3rd party kernel modules |
||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
When attempt to create a *new *file from lustre client node, it's throwing the following panic error and rebooting the lustre client node. After the reboot, the file that was tried to create is present with 0 bytes and can be used to write data into it. Creation of new directories are fine. The problem is with creation of new files. Existing files can be updated but creation of new files from lustre client node is throwing the following panic error and rebooting the lustre client node. Message from syslogd@10-64-7-142 at Oct 9 16:19:18 ... Message from syslogd@10-64-7-142 at Oct 9 16:19:18 ... Message from syslogd@10-64-7-142 at Oct 9 16:19:19 ... |
| Comments |
| Comment by Shankar Vellaichamy [ 09/Oct/17 ] |
|
Message from syslogd@10-64-7-142 at Oct 9 16:51:16 ... Message from syslogd@10-64-7-142 at Oct 9 16:51:16 ... Message from syslogd@10-64-7-142 at Oct 9 16:51:16 ... |
| Comment by John Hammond [ 09/Oct/17 ] |
|
Could you describe the client configuration used here? Especially the how gsch and redirfs are used? |
| Comment by Shankar Vellaichamy [ 10/Oct/17 ] |
|
Hi John, After stopping ds_agent using /etc/init.d/ds_agent stop, new file creations are successful. Please refer attached ds_agent diagnostic package from output of /opt/ds_agent/dsa_control -d. Could you please support to make the creation of new files work from lustre client with ds-agent active and running. |
| Comment by Brad Hoagland (Inactive) [ 10/Oct/17 ] |
|
Hi Shankar, Do you have a support agreement for this system with Intel? Regards, Brad |
| Comment by Shankar Vellaichamy [ 10/Oct/17 ] |
|
Hi Brad, We have just started with Lustre and don't have support agreement for this system with Intel. |
| Comment by Aurelien Degremont (Inactive) [ 25/May/20 ] |
|
For the record, I've faced the same crashes which ended up being cause by the same software. ds_agent is Trend Micro Deep Security Agent. This is part of Trend Deep Security composed of a file system wrapping real filesystem accesses: redirfs and a hooking modules gsch . Here is the LBUG stack trace:
crash> bt
PID: 19992 TASK: ffff9df833f73150 CPU: 1 COMMAND: "touch"
#0 [ffff9df7395b3728] machine_kexec at ffffffffa6466044
#1 [ffff9df7395b3788] __crash_kexec at ffffffffa6522ee2
#2 [ffff9df7395b3858] panic at ffffffffa6b7952c
#3 [ffff9df7395b38d8] lbug_with_loc at ffffffffc03028cb [libcfs]
#4 [ffff9df7395b38f8] ll_d_init at ffffffffc07b9d2e [lustre]
#5 [ffff9df7395b3938] ll_splice_alias at ffffffffc07fdab8 [lustre]
#6 [ffff9df7395b3980] ll_lookup_it_finish at ffffffffc07fdca3 [lustre]
#7 [ffff9df7395b3a48] ll_lookup_it at ffffffffc07ff576 [lustre]
#8 [ffff9df7395b3b08] ll_atomic_open at ffffffffc07ffcf7 [lustre]
#9 [ffff9df7395b3bc8] rfs_atomic_open at ffffffffc075622b [redirfs]
#10 [ffff9df7395b3c70] do_last at ffffffffa665d803
#11 [ffff9df7395b3d20] path_openat at ffffffffa665e1bd
#12 [ffff9df7395b3db8] do_filp_open at ffffffffa666040d
#13 [ffff9df7395b3e90] do_sys_open at ffffffffa664bfe4
#14 [ffff9df7395b3ef0] sys_open at ffffffffa664c0fe
#15 [ffff9df7395b3f00] gsch_open_hook_fn at ffffffffc088a4fa [gsch]
#16 [ffff9df7395b3f50] system_call_fastpath at ffffffffa6b92ed2
RIP: 00007fb6a78d3760 RSP: 00007ffee9c8e958 RFLAGS: 00010202
RAX: 0000000000000002 RBX: 00007ffee9c8ecf8 RCX: 0000000000000037
RDX: 00000000000001b6 RSI: 0000000000000941 RDI: 00007ffee9c9056f
RBP: 0000000000000000 R8: 0000000000000000 R9: 0000000000000000
R10: 00007ffee9c8dea0 R11: 0000000000000246 R12: 00007ffee9c9056f
R13: 00007fb6a7bab2a0 R14: 0000000000000001 R15: 0000000000000000
ORIG_RAX: 0000000000000002 CS: 0033 SS: 002b
(See 'gsch' at the beginning of the call stack) This hook or redirfs is replacing the struct dentry_operations by its own, where some callbacks are replaced by its own callbacks, making the LASSERT fail.
crash> p &ll_d_ops
$7 = (const struct dentry_operations *) 0xffffffffc0824200 <ll_d_ops>
...
crash> struct dentry_operations 0xffff9df83256e040
struct dentry_operations {
d_revalidate = 0xffffffffc07538e0,
d_weak_revalidate = 0x0,
d_hash = 0x0,
d_compare = 0xffffffffc07b9770 <ll_dcompare>,
d_delete = 0xffffffffc07b94e0 <ll_ddelete>,
d_release = 0xffffffffc0753b30,
d_prune = 0x0,
d_iput = 0xffffffffc07536a0,
d_dname = 0x0,
d_automount = 0x0,
{
d_manage = 0x0,
__UNIQUE_ID_rh_kabi_hide17 = {
d_manage = 0x0
},
{<No data fields>}
}
}
ffffffffc07538e0 (t) rfs_d_revalidate [redirfs]
ffffffffc0753b30 (t) rfs_d_release [redirfs]
ffffffffc07536a0 (t) rfs_d_iput [redirfs]
This kind of low level hooking is definitely a bad design in my opinion, but I was wondering if this LASSERT is really useful and if removing it could be enough to make this software works.
|