Details
-
Bug
-
Resolution: Fixed
-
Minor
-
Lustre 2.5.3
-
RHEL 6 Bull kernel 2.6.32-642.4.2.el6.Bull.100.x86_64
Lustre build based on 2.5.3.90
-
3
-
9223372036854775807
Description
One Lustre client crashed right after a recovery with the following messages:
2016-10-29 16:00:01 Lustre: DEBUG MARKER: Sat Oct 29 16:00:01 2016 2016-10-29 16:00:01 2016-10-29 16:01:07 Lustre: 17766:0:(client.c:1942:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1477749640/real 1477749640] req@ffff8811e1b9a400 x1547883906151520/t0(0) o101->scratch3-MDT0000-mdc-ffff88205ec91400@JO.BOO.AL.IL@o2ib2:12/10 lens 632/1136 e 0 to 1 dl 1477749667 ref 2 fl Rpc:XP/0/ffffffff rc 0/-1 2016-10-29 16:01:07 Lustre: 17766:0:(client.c:1942:ptlrpc_expire_one_request()) Skipped 12 previous similar messages 2016-10-29 16:01:07 Lustre: scratch3-MDT0000-mdc-ffff88205ec91400: Connection to scratch3-MDT0000 (at JO.BOO.AL.IL@o2ib2) was lost; in progress operations using this service will wait for recovery to complet e 2016-10-29 16:01:07 Lustre: Skipped 12 previous similar messages 2016-10-29 16:01:07 Lustre: scratch3-MDT0000-mdc-ffff88205ec91400: Connection restored to scratch3-MDT0000 (at JO.BOO.AL.IL@o2ib2) 2016-10-29 16:01:07 Lustre: Skipped 12 previous similar messages 2016-10-29 16:01:07 LustreError: 17766:0:(namei.c:816:ll_create_node()) ASSERTION( list_empty(&inode->i_dentry) ) failed: 2016-10-29 16:01:07 LustreError: 17766:0:(namei.c:816:ll_create_node()) LBUG
The stack is as follows:
PID: 17766 TASK: ffff8819ab5b0ab0 CPU: 6 COMMAND: "rsync" #0 [ffff88189c697b20] machine_kexec at ffffffff8103ff4b #1 [ffff88189c697b80] crash_kexec at ffffffff810cfce2 #2 [ffff88189c697c50] panic at ffffffff81546ce9 #3 [ffff88189c697cd0] lbug_with_loc at ffffffffa067beeb [libcfs] #4 [ffff88189c697cf0] ll_create_nd at ffffffffa0dbf854 [lustre] #5 [ffff88189c697d70] vfs_create at ffffffff811a7946 #6 [ffff88189c697db0] do_filp_open at ffffffff811ab75e #7 [ffff88189c697f20] do_sys_open at ffffffff81194e87 #8 [ffff88189c697f70] sys_open at ffffffff81194f90 #9 [ffff88189c697f80] system_call_fastpath at ffffffff8100b0d2 RIP: 0000003c88adb480 RSP: 00007fffd85b6bc8 RFLAGS: 00010246 RAX: 0000000000000002 RBX: ffffffff8100b0d2 RCX: 0000000000000001 RDX: 0000000000000180 RSI: 00000000000000c2 RDI: 00007fffd85b7d70 RBP: 00007fffd85b7d9b R8: 7a672e7761722e30 R9: 00000000ffffffff R10: 0000000000000001 R11: 0000000000000246 R12: ffffffff81194f90 R13: ffff88189c697f78 R14: 00007fffd85b7d9c R15: 0000000000000000 ORIG_RAX: 0000000000000002 CS: 0033 SS: 002b
Even though the intent is to create the inode with a dentry in the filesystem namespace (/somepath/.tmpfile), the inode structure field i_dentry is populated with one dentry pointing at /.lustre/fid/[0x298cd542a:0x3b3d:0x0].
Strangely enough, even though the crash occured in ll_create_nd() -> ll_create_it() -> ll_create_node, before the call to d_instantiate() in ll_create_nd() -> ll_create_it(), the fid actually really points back to the requested location:
# lfs fid2path /mountpoint [0x298cd542a:0x3b3d:0x0] /somepath/.tmpfile
I can't upload the crashdump due to site restriction, but will be happy to give more informations as you request them.
Hi Bruno,
Here is the list of patches we apply to our servers kernel:
0003-Applied-Bull_lmigr-x86_64.el6.0.7.patch-for-RHEL-2.6.patch
Patch which delays migration of pages between numa nodes
0006-Applied-Bull_311740-GENERIC_ALLOCATOR-4.patch
Include generic memory allocator for x86 architecture.
0008-Applied-Bull_Dynamic_Power_Limitation_2.6.32-573.patch
CState manipulation interface through /sys
0011-Applied-Bull_mce75xx_2.6.32-504.diff.patch
Intel Xeon 7500 series corrected memory error driver
0012-Applied-Bull_improve-x86_64.csum.patch.patch
Performance improvement on TCP/IP checksumming
0017-Applied-raid5-mmp-unplug-dev-rhel6.Bull.3.patch
Forces sync reads on MD devices' disks for MMP compatibility
0018-lustre-blkdev_tunables-2.6-rhel6.patch.patch
Some block device parameters value modifications
0021-lustre-bh_lru_size_config.patch.patch
Buffer head per-cpu LRU size increased from 8 to 16
0022-quota-replace-dqptr-sem.patch
Replaces dqptr semaphore by spin locks (lustre server kernel patch)
0029-Applied-proc_pid_limits.patch.patch
Ease use of writes to /proc/<pid>/limits file
0030-Applied-patch-Bull_oomprotect.diff.patch
Patch to protect processes from root to be killed by oom
0055-quota-avoid-dqget-calls.patch
(lustre server kernel patch)
0058-fix-CONFIG_XEN.patch
Some modifications to have kvm work when XEN is not enabled
0059-ocfs2-getname.patch
fix getname/putname changes in ocfs2
0090-kernel_H-2.6.32-573.1.1-config-x86_64.patch
0091-kernel_H-2.6.32-573.1.1-config-x86_64.debug.patch
Some modifications to the default RHEL kernel configuration.
For the lustre code, this is based on 2.5.3.90 (commit 35bb8577c83f787451cd9842a07ac670cb6f4d3f), with some patches that I will send you by mail.