[LU-5928] sanity test_133f timeout: lprocfs_exp_nodemap_seq_show() oops Created: 17/Nov/14 Updated: 20/Jan/15 Resolved: 05/Jan/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.7.0 |
| Fix Version/s: | Lustre 2.7.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Sarah Liu | Assignee: | Dmitry Eremin (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
before upgrade system is 2.6.0 |
||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 16556 | ||||||||
| Description |
|
MDS console: Lustre: DEBUG MARKER: == sanity test 133f: Check for LBUGs/Oopses/unreadable files in /proc == 14:30:46 (1416263446) general protection fault: 0000 [#1] SMP last sysfs file: /sys/devices/pci0000:ff/0000:ff:13.6/irq CPU 25 Modules linked in: osp(U) mdd(U) lod(U) mdt(U) lfsck(U) mgs(U) mgc(U) osd_ldiskfs(U) lquota(U) ldiskfs(U) jbd2 lustre(U) lov(U) mdc(U) fid(U) lmv(U) fld(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) sha512_generic sha256_generic crc32c_intel libcfs(U) nfsd exportfs autofs4 nfs lockd fscache auth_rpcgss nfs_acl sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 iTCO_wdt iTCO_vendor_support microcode sg sb_edac edac_core i2c_i801 lpc_ich mfd_core ioatdma mlx4_ib ib_sa ib_mad ib_core mlx4_en mlx4_core igb dca i2c_algo_bit i2c_core ptp pps_core shpchp ext3 jbd mbcache sd_mod crc_t10dif isci libsas scsi_transport_sas ahci wmi dm_mirror dm_region_hash dm_log dm_mod [last unloaded: llog_test] Pid: 68277, comm: cat Not tainted 2.6.32-431.29.2.el6_lustre.x86_64 #1 Intel Corporation S2600GZ/S2600GZ RIP: 0010:[<ffffffff8128af39>] [<ffffffff8128af39>] strnlen+0x9/0x40 RSP: 0018:ffff880823671ca8 EFLAGS: 00010286 RAX: ffffffff817b5db4 RBX: ffff880820cda000 RCX: 0000000000000002 RDX: 5f6b636568635f63 RSI: ffffffffffffffff RDI: 5f6b636568635f63 RBP: ffff880823671ca8 R08: 0000000000000073 R09: 0000000000000000 R10: 0000000000000002 R11: 0000000000000246 R12: ffff880820cd9000 R13: 5f6b636568635f63 R14: 00000000ffffffff R15: 0000000000000000 FS: 00007fd2dee1c700(0000) GS:ffff88044e520000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000000000228a038 CR3: 000000082201c000 CR4: 00000000000407e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process cat (pid: 68277, threadinfo ffff880823670000, task ffff88081eb74aa0) Stack: ffff880823671ce8 ffffffff8128c1f0 0000000000000000 ffff880820cd9000 <d> ffffffffa0603a59 ffffffffa0603a57 ffff880823671db8 ffff880820cda000 <d> ffff880823671d88 ffffffff8128d638 0000000000000004 0000000affffffff Call Trace: [<ffffffff8128c1f0>] string+0x40/0x100 [<ffffffff8128d638>] vsnprintf+0x218/0x5e0 [<ffffffff811f3cc0>] ? proc_reg_open+0x0/0x160 [<ffffffff812284cf>] ? security_inode_permission+0x1f/0x30 [<ffffffff811ae952>] seq_vprintf+0x32/0x60 [<ffffffff811ae9bd>] seq_printf+0x3d/0x40 [<ffffffff811aeac0>] ? seq_read+0x0/0x400 [<ffffffffa05dc5e7>] lprocfs_exp_nodemap_seq_show+0x27/0x40 [obdclass] [<ffffffff811aebb2>] seq_read+0xf2/0x400 [<ffffffff811f42fe>] proc_reg_read+0x7e/0xc0 [<ffffffff81189a75>] vfs_read+0xb5/0x1a0 [<ffffffff81189bb1>] sys_read+0x51/0x90 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b Code: 66 90 48 83 c2 01 80 3a 00 75 f7 48 89 d0 48 29 f8 c9 c3 66 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 85 f6 48 89 e5 74 2e <80> 3f 00 74 29 48 83 ee 01 48 89 f8 eb 12 66 0f 1f 84 00 00 00 RIP [<ffffffff8128af39>] strnlen+0x9/0x40 RSP <ffff880823671ca8> Initializing cgroup subsys cpuset Initializing cgroup subsys cpu Linux version 2.6.32-431.29.2.el6_lustre.x86_64 (jenkins@onyx-7-sdf1-el6-x8664.onyx.hpdd.intel.com) (gcc version 4.4.7 20120313 (Red Hat 4.4.7-11) (GCC) ) #1 SMP Thu Nov 6 12:02:29 PST 2014 Command line: ro root=UUID=c301d149-2cdb-40cc-aaad-0c2a8780bdf7 rd_NO_LUKS rd_NO_LVM LANG=en_US.UTF-8 rd_NO_MD console=tty0 SYSFONT=latarcyrheb-sun16 KEYBOARDTYPE=pc KEYTABLE=us rd_NO_DM console=ttyS0,115200 irqpoll nr_cpus=1 reset_devices cgroup_disable=memory mce=off disable_cpu_apicid=0 memmap=exactmap memmap=574K@4K memmap=133550K@49726K elfcorehdr=183276K memmap=4K$0K memmap=62K$578K memmap=128K$896K memmap=42200K$3067876K memmap=992K#3110076K memmap=488K#3111068K memmap=568K#3111556K memmap=516K#3112124K memmap=294912K$3112960K memmap=4K$4173824K memmap=4K$4174948K memmap=16K$4174960K memmap=4K$4175872K memmap=6016K$4188288K |
| Comments |
| Comment by Jodi Levi (Inactive) [ 18/Nov/14 ] |
|
Dmitry, |
| Comment by Andreas Dilger [ 02/Dec/14 ] |
|
Dmitry, |
| Comment by James A Simmons [ 03/Dec/14 ] |
|
Tested the patch from |
| Comment by Dmitry Eremin (Inactive) [ 04/Dec/14 ] |
|
This issue was not fixed yet (until the patch from James, thanks for testing. It looks it prove this. |
| Comment by Dmitry Eremin (Inactive) [ 10/Dec/14 ] |
|
After testing with 2.6.91 tag I was not able to reproduce this issue any more. So, the patch from |
| Comment by John Hammond [ 17/Dec/14 ] |
|
I see this on 2.6.91-30-g12bfc80 with CONFIG_DEBUG_SLAB=y. It reproduces easily on a single node setup with running sanity with test_132 skipped. (Note that test_132 unmounts and remounts everything). I got a crash dump: [ 1299.647397] Lustre: DEBUG MARKER: == sanity test 133f: Check for LBUGs/Oopses/unreadable files in /proc == 08:40:04 (1418827204) [ 1300.544545] general protection fault: 0000 [#1] SMP ... [ 1300.545013] [ 1300.545013] Pid: 610, comm: cat Not tainted 2.6.32-431.29.2.el6.lustre.x86_64 #1 Bochs\ Bochs [ 1300.545013] RIP: 0010:[<ffffffff812aa369>] [<ffffffff812aa369>] strnlen+0x9/0x40 [ 1300.545013] RSP: 0018:ffff88008706fca8 EFLAGS: 00010286 [ 1300.545013] RAX: ffffffff817bdc45 RBX: ffff880094fa1000 RCX: 0000000000000002 [ 1300.545013] RDX: 6b6b6b6b6b6b6b6b RSI: ffffffffffffffff RDI: 6b6b6b6b6b6b6b6b [ 1300.545013] RBP: ffff88008706fca8 R08: 0000000000000073 R09: ffff880094fa0000 [ 1300.545013] R10: 0000000000000000 R11: 0000000000000000 R12: ffff880094fa0000 [ 1300.545013] R13: 6b6b6b6b6b6b6b6b R14: 00000000ffffffff R15: 0000000000000000 [ 1300.545013] FS: 00007fa56dce3700(0000) GS:ffff88002c000000(0000) knlGS:0000000000000000 [ 1300.545013] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1300.545013] CR2: 0000000002624038 CR3: 000000003734c000 CR4: 00000000000006f0 [ 1300.545013] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 1300.545013] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 1300.545013] Process cat (pid: 610, threadinfo ffff88008706e000, task ffff88009db146c0) [ 1300.545013] Stack: [ 1300.545013] ffff88008706fce8 ffffffff812ab620 0000000000000000 ffff880094fa0000 [ 1300.545013] <d> ffffffffa07073b2 ffffffffa07073b0 ffff88008706fdb8 ffff880094fa1000 [ 1300.545013] <d> ffff88008706fd88 ffffffff812aca68 0000000000000004 0000000affffffff [ 1300.545013] Call Trace: [ 1300.545013] [<ffffffff812ab620>] string+0x40/0x100 [ 1300.545013] [<ffffffff812aca68>] vsnprintf+0x218/0x5e0 [ 1300.545013] [<ffffffff811c912f>] ? seq_read+0x3f/0x420 [ 1300.545013] [<ffffffff8120f630>] ? proc_reg_open+0x0/0x160 [ 1300.545013] [<ffffffff811c912f>] ? seq_read+0x3f/0x420 [ 1300.545013] [<ffffffff811c8f72>] seq_vprintf+0x32/0x60 [ 1300.545013] [<ffffffff811c8fdd>] seq_printf+0x3d/0x40 [ 1300.545013] [<ffffffff81186e32>] ? kmem_cache_alloc_trace+0x142/0x2b0 [ 1300.545013] [<ffffffffa06ecb9a>] lprocfs_exp_nodemap_seq_show+0x2a/0x40 [obdclass] [ 1300.545013] [<ffffffff811c91e8>] seq_read+0xf8/0x420 [ 1300.545013] [<ffffffff811c90f0>] ? seq_read+0x0/0x420 [ 1300.545013] [<ffffffff8120fc75>] proc_reg_read+0x85/0xc0 [ 1300.545013] [<ffffffff811a3655>] vfs_read+0xb5/0x1a0 [ 1300.545013] [<ffffffff81555e68>] ? lockdep_sys_exit_thunk+0x35/0x67 [ 1300.545013] [<ffffffff811a3791>] sys_read+0x51/0x90 [ 1300.545013] [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b [ 1300.545013] Code: 66 90 48 83 c2 01 80 3a 00 75 f7 48 89 d0 48 29 f8 c9 c3 66 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 85 f6 48 89 e5 74 2e <80> 3f 00 74 29 48 83 ee 01 48 89 f8 eb 12 66 0f 1f 84 00 00 00 [ 1300.545013] RIP [<ffffffff812aa369>] strnlen+0x9/0x40 [ 1300.545013] RSP <ffff88008706fca8> crash> files PID: 610 TASK: ffff88009db146c0 CPU: 0 COMMAND: "cat" ROOT: / CWD: /root/lustre-release FD FILE DENTRY INODE TYPE PATH 0 ffff88011afac2f8 ffff880112b1e4d0 ffff880112b5f048 CHR /dev/pts/0 1 ffff88009ae25960 ffff88011d602188 ffff88011dc72180 CHR /dev/null 2 ffff88009ae25960 ffff88011d602188 ffff88011dc72180 CHR /dev/null 3 ffff880087604bf0 ffff880112bb1930 ffff880112bb7580 DIR /root/lustre-release 4 ffff88009ae25588 ffff880112bb1930 ffff880112bb7580 DIR /root/lustre-release 5 ffff8801179556d0 ffff88008ebf9818 ffff8800d57e90b8 REG /proc/fs/lustre/obdfilter/lustre-OST0000/exports/0@lo/nodemap In lprocfs_exp_nodemap_seq_show() the export was poisoned (already freed). It's interesting to note that in lprocfs_exp_setup() the nodemap file uses the export as it's private data whereas the other files use the nid_stats. entry = lprocfs_add_simple(new_stat->nid_proc, "nodemap",
exp, &lprocfs_exp_nodemap_fops);
Probably we shouldn't be doing this. Can't the export be changed while the nid_stats stay the same? |
| Comment by Andreas Dilger [ 23/Dec/14 ] |
|
Found this on a few sanity.sh tests in maloo: 05:12:19:Process cat (pid: 15865, threadinfo ffff88002f774000, task ffff880071243540) 05:12:19:Call Trace: 05:12:19: [<ffffffff8128c1f0>] string+0x40/0x100 05:12:19: [<ffffffff8128d638>] vsnprintf+0x218/0x5e0 05:12:19: [<ffffffff811ae952>] seq_vprintf+0x32/0x60 05:12:19: [<ffffffff811ae9bd>] seq_printf+0x3d/0x40 05:12:19: [<ffffffffa0777117>] lprocfs_exp_nodemap_seq_show+0x27/0x40 [obdclass] 05:12:19: [<ffffffff811aebb2>] seq_read+0xf2/0x400 05:12:19: [<ffffffff811f42fe>] proc_reg_read+0x7e/0xc0 05:12:19: [<ffffffff81189a75>] vfs_read+0xb5/0x1a0 05:12:19: [<ffffffff81189bb1>] sys_read+0x51/0x90 |
| Comment by Gerrit Updater [ 23/Dec/14 ] |
|
Dmitry Eremin (dmitry.eremin@intel.com) uploaded a new patch: http://review.whamcloud.com/13175 |
| Comment by James A Simmons [ 23/Dec/14 ] |
|
Thank you Dmitry for fixing this. I was seeing this in testing and it was a show stopper for me. |
| Comment by Dmitry Eremin (Inactive) [ 23/Dec/14 ] |
|
Thanks James. Does this patch solve the issue for you? |
| Comment by James A Simmons [ 23/Dec/14 ] |
|
Yes it helps with 133f but the test for 133g always breaks my system. That is related to a different problem which I just tracked down to another patch. |
| Comment by nasf (Inactive) [ 24/Dec/14 ] |
|
Another failure instance: |
| Comment by Gerrit Updater [ 04/Jan/15 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13175/ |
| Comment by Sarah Liu [ 20/Jan/15 ] |
|
Didn't hit this issue with build #2810 which includes the fix. |