[LU-5928] sanity test_133f timeout: lprocfs_exp_nodemap_seq_show() oops Created: 17/Nov/14  Updated: 20/Jan/15  Resolved: 05/Jan/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.7.0
Fix Version/s: Lustre 2.7.0

Type: Bug Priority: Major
Reporter: Sarah Liu Assignee: Dmitry Eremin (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Environment:

before upgrade system is 2.6.0
after upgrade: MDS and OSS are lustre-master build # 2733; clients are 2.6.0


Issue Links:
Related
is related to LU-5963 wrong hashing in LU-4647 Resolved
Severity: 3
Rank (Obsolete): 16556

 Description   

MDS console:
After rolling upgrade OSS and MDS from 2.6.0 to master, clients are still 2.6.0, when running saniy sub test 133f, MDS oops

Lustre: DEBUG MARKER: == sanity test 133f: Check for LBUGs/Oopses/unreadable files in /proc == 14:30:46 (1416263446)
general protection fault: 0000 [#1] SMP 
last sysfs file: /sys/devices/pci0000:ff/0000:ff:13.6/irq
CPU 25 
Modules linked in: osp(U) mdd(U) lod(U) mdt(U) lfsck(U) mgs(U) mgc(U) osd_ldiskfs(U) lquota(U) ldiskfs(U) jbd2 lustre(U) lov(U) mdc(U) fid(U) lmv(U) fld(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) sha512_generic sha256_generic crc32c_intel libcfs(U) nfsd exportfs autofs4 nfs lockd fscache auth_rpcgss nfs_acl sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 iTCO_wdt iTCO_vendor_support microcode sg sb_edac edac_core i2c_i801 lpc_ich mfd_core ioatdma mlx4_ib ib_sa ib_mad ib_core mlx4_en mlx4_core igb dca i2c_algo_bit i2c_core ptp pps_core shpchp ext3 jbd mbcache sd_mod crc_t10dif isci libsas scsi_transport_sas ahci wmi dm_mirror dm_region_hash dm_log dm_mod [last unloaded: llog_test]

Pid: 68277, comm: cat Not tainted 2.6.32-431.29.2.el6_lustre.x86_64 #1 Intel Corporation S2600GZ/S2600GZ
RIP: 0010:[<ffffffff8128af39>]  [<ffffffff8128af39>] strnlen+0x9/0x40
RSP: 0018:ffff880823671ca8  EFLAGS: 00010286
RAX: ffffffff817b5db4 RBX: ffff880820cda000 RCX: 0000000000000002
RDX: 5f6b636568635f63 RSI: ffffffffffffffff RDI: 5f6b636568635f63
RBP: ffff880823671ca8 R08: 0000000000000073 R09: 0000000000000000
R10: 0000000000000002 R11: 0000000000000246 R12: ffff880820cd9000
R13: 5f6b636568635f63 R14: 00000000ffffffff R15: 0000000000000000
FS:  00007fd2dee1c700(0000) GS:ffff88044e520000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000000000228a038 CR3: 000000082201c000 CR4: 00000000000407e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process cat (pid: 68277, threadinfo ffff880823670000, task ffff88081eb74aa0)
Stack:
 ffff880823671ce8 ffffffff8128c1f0 0000000000000000 ffff880820cd9000
<d> ffffffffa0603a59 ffffffffa0603a57 ffff880823671db8 ffff880820cda000
<d> ffff880823671d88 ffffffff8128d638 0000000000000004 0000000affffffff
Call Trace:
 [<ffffffff8128c1f0>] string+0x40/0x100
 [<ffffffff8128d638>] vsnprintf+0x218/0x5e0
 [<ffffffff811f3cc0>] ? proc_reg_open+0x0/0x160
 [<ffffffff812284cf>] ? security_inode_permission+0x1f/0x30
 [<ffffffff811ae952>] seq_vprintf+0x32/0x60
 [<ffffffff811ae9bd>] seq_printf+0x3d/0x40
 [<ffffffff811aeac0>] ? seq_read+0x0/0x400
 [<ffffffffa05dc5e7>] lprocfs_exp_nodemap_seq_show+0x27/0x40 [obdclass]
 [<ffffffff811aebb2>] seq_read+0xf2/0x400
 [<ffffffff811f42fe>] proc_reg_read+0x7e/0xc0
 [<ffffffff81189a75>] vfs_read+0xb5/0x1a0
 [<ffffffff81189bb1>] sys_read+0x51/0x90
 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
Code: 66 90 48 83 c2 01 80 3a 00 75 f7 48 89 d0 48 29 f8 c9 c3 66 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 85 f6 48 89 e5 74 2e <80> 3f 00 74 29 48 83 ee 01 48 89 f8 eb 12 66 0f 1f 84 00 00 00 
RIP  [<ffffffff8128af39>] strnlen+0x9/0x40
 RSP <ffff880823671ca8>
Initializing cgroup subsys cpuset
Initializing cgroup subsys cpu
Linux version 2.6.32-431.29.2.el6_lustre.x86_64 (jenkins@onyx-7-sdf1-el6-x8664.onyx.hpdd.intel.com) (gcc version 4.4.7 20120313 (Red Hat 4.4.7-11) (GCC) ) #1 SMP Thu Nov 6 12:02:29 PST 2014
Command line: ro root=UUID=c301d149-2cdb-40cc-aaad-0c2a8780bdf7 rd_NO_LUKS rd_NO_LVM LANG=en_US.UTF-8 rd_NO_MD console=tty0 SYSFONT=latarcyrheb-sun16 KEYBOARDTYPE=pc KEYTABLE=us rd_NO_DM console=ttyS0,115200 irqpoll nr_cpus=1 reset_devices cgroup_disable=memory mce=off disable_cpu_apicid=0 memmap=exactmap memmap=574K@4K memmap=133550K@49726K elfcorehdr=183276K memmap=4K$0K memmap=62K$578K memmap=128K$896K memmap=42200K$3067876K memmap=992K#3110076K memmap=488K#3111068K memmap=568K#3111556K memmap=516K#3112124K memmap=294912K$3112960K memmap=4K$4173824K memmap=4K$4174948K memmap=16K$4174960K memmap=4K$4175872K memmap=6016K$4188288K


 Comments   
Comment by Jodi Levi (Inactive) [ 18/Nov/14 ]

Dmitry,
Could you please comment on this one?
Thank you!

Comment by Andreas Dilger [ 02/Dec/14 ]

Dmitry,
Could you please check if this strnlen() call has been fixed yet?

Comment by James A Simmons [ 03/Dec/14 ]

Tested the patch from LU-5963 and it appears to stop the MDS from crashing.

Comment by Dmitry Eremin (Inactive) [ 04/Dec/14 ]

This issue was not fixed yet (until the patch from LU-5963). This oops can happens only with access to incorrect export structure (already freed, for example). Even according commits from this build the functionality was not touched. The patch from LU-5963 should fix this issue.

James, thanks for testing. It looks it prove this.

Comment by Dmitry Eremin (Inactive) [ 10/Dec/14 ]

After testing with 2.6.91 tag I was not able to reproduce this issue any more. So, the patch from LU-5963 solve this issue also.

Comment by John Hammond [ 17/Dec/14 ]

I see this on 2.6.91-30-g12bfc80 with CONFIG_DEBUG_SLAB=y. It reproduces easily on a single node setup with running sanity with test_132 skipped. (Note that test_132 unmounts and remounts everything). I got a crash dump:

[ 1299.647397] Lustre: DEBUG MARKER: == sanity test 133f: Check for LBUGs/Oopses/unreadable files in /proc == 08:40:04 (1418827204)
[ 1300.544545] general protection fault: 0000 [#1] SMP
...
[ 1300.545013]
[ 1300.545013] Pid: 610, comm: cat Not tainted 2.6.32-431.29.2.el6.lustre.x86_64 #1 Bochs\
 Bochs
[ 1300.545013] RIP: 0010:[<ffffffff812aa369>]  [<ffffffff812aa369>] strnlen+0x9/0x40
[ 1300.545013] RSP: 0018:ffff88008706fca8  EFLAGS: 00010286
[ 1300.545013] RAX: ffffffff817bdc45 RBX: ffff880094fa1000 RCX: 0000000000000002
[ 1300.545013] RDX: 6b6b6b6b6b6b6b6b RSI: ffffffffffffffff RDI: 6b6b6b6b6b6b6b6b
[ 1300.545013] RBP: ffff88008706fca8 R08: 0000000000000073 R09: ffff880094fa0000
[ 1300.545013] R10: 0000000000000000 R11: 0000000000000000 R12: ffff880094fa0000
[ 1300.545013] R13: 6b6b6b6b6b6b6b6b R14: 00000000ffffffff R15: 0000000000000000
[ 1300.545013] FS:  00007fa56dce3700(0000) GS:ffff88002c000000(0000) knlGS:0000000000000000
[ 1300.545013] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1300.545013] CR2: 0000000002624038 CR3: 000000003734c000 CR4: 00000000000006f0
[ 1300.545013] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1300.545013] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 1300.545013] Process cat (pid: 610, threadinfo ffff88008706e000, task ffff88009db146c0)
[ 1300.545013] Stack:
[ 1300.545013]  ffff88008706fce8 ffffffff812ab620 0000000000000000 ffff880094fa0000
[ 1300.545013] <d> ffffffffa07073b2 ffffffffa07073b0 ffff88008706fdb8 ffff880094fa1000
[ 1300.545013] <d> ffff88008706fd88 ffffffff812aca68 0000000000000004 0000000affffffff
[ 1300.545013] Call Trace:
[ 1300.545013]  [<ffffffff812ab620>] string+0x40/0x100
[ 1300.545013]  [<ffffffff812aca68>] vsnprintf+0x218/0x5e0
[ 1300.545013]  [<ffffffff811c912f>] ? seq_read+0x3f/0x420
[ 1300.545013]  [<ffffffff8120f630>] ? proc_reg_open+0x0/0x160
[ 1300.545013]  [<ffffffff811c912f>] ? seq_read+0x3f/0x420
[ 1300.545013]  [<ffffffff811c8f72>] seq_vprintf+0x32/0x60
[ 1300.545013]  [<ffffffff811c8fdd>] seq_printf+0x3d/0x40
[ 1300.545013]  [<ffffffff81186e32>] ? kmem_cache_alloc_trace+0x142/0x2b0
[ 1300.545013]  [<ffffffffa06ecb9a>] lprocfs_exp_nodemap_seq_show+0x2a/0x40 [obdclass]
[ 1300.545013]  [<ffffffff811c91e8>] seq_read+0xf8/0x420
[ 1300.545013]  [<ffffffff811c90f0>] ? seq_read+0x0/0x420
[ 1300.545013]  [<ffffffff8120fc75>] proc_reg_read+0x85/0xc0
[ 1300.545013]  [<ffffffff811a3655>] vfs_read+0xb5/0x1a0
[ 1300.545013]  [<ffffffff81555e68>] ? lockdep_sys_exit_thunk+0x35/0x67
[ 1300.545013]  [<ffffffff811a3791>] sys_read+0x51/0x90
[ 1300.545013]  [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
[ 1300.545013] Code: 66 90 48 83 c2 01 80 3a 00 75 f7 48 89 d0 48 29 f8 c9 c3 66 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 85 f6 48 89 e5 74 2e <80> 3f 00 74 29 48 83 ee 01 48 89 f8 eb 12 66 0f 1f 84 00 00 00
[ 1300.545013] RIP  [<ffffffff812aa369>] strnlen+0x9/0x40
[ 1300.545013]  RSP <ffff88008706fca8>
crash> files
PID: 610    TASK: ffff88009db146c0  CPU: 0   COMMAND: "cat"
ROOT: /    CWD: /root/lustre-release
 FD       FILE            DENTRY           INODE       TYPE PATH
  0 ffff88011afac2f8 ffff880112b1e4d0 ffff880112b5f048 CHR  /dev/pts/0
  1 ffff88009ae25960 ffff88011d602188 ffff88011dc72180 CHR  /dev/null
  2 ffff88009ae25960 ffff88011d602188 ffff88011dc72180 CHR  /dev/null
  3 ffff880087604bf0 ffff880112bb1930 ffff880112bb7580 DIR  /root/lustre-release
  4 ffff88009ae25588 ffff880112bb1930 ffff880112bb7580 DIR  /root/lustre-release
  5 ffff8801179556d0 ffff88008ebf9818 ffff8800d57e90b8 REG  /proc/fs/lustre/obdfilter/lustre-OST0000/exports/0@lo/nodemap

In lprocfs_exp_nodemap_seq_show() the export was poisoned (already freed).

It's interesting to note that in lprocfs_exp_setup() the nodemap file uses the export as it's private data whereas the other files use the nid_stats.

        entry = lprocfs_add_simple(new_stat->nid_proc, "nodemap",
                                   exp, &lprocfs_exp_nodemap_fops);

Probably we shouldn't be doing this. Can't the export be changed while the nid_stats stay the same?

Comment by Andreas Dilger [ 23/Dec/14 ]

Found this on a few sanity.sh tests in maloo:
https://testing.hpdd.intel.com/test_sets/04b81a90-8a73-11e4-bd67-5254006e85c2
https://testing.hpdd.intel.com/test_sets/f111bae8-8a4d-11e4-9b09-5254006e85c2
https://testing.hpdd.intel.com/test_sets/9ab98254-8a7d-11e4-bd67-5254006e85c2

05:12:19:Process cat (pid: 15865, threadinfo ffff88002f774000, task ffff880071243540)
05:12:19:Call Trace:
05:12:19: [<ffffffff8128c1f0>] string+0x40/0x100
05:12:19: [<ffffffff8128d638>] vsnprintf+0x218/0x5e0
05:12:19: [<ffffffff811ae952>] seq_vprintf+0x32/0x60
05:12:19: [<ffffffff811ae9bd>] seq_printf+0x3d/0x40
05:12:19: [<ffffffffa0777117>] lprocfs_exp_nodemap_seq_show+0x27/0x40 [obdclass]
05:12:19: [<ffffffff811aebb2>] seq_read+0xf2/0x400
05:12:19: [<ffffffff811f42fe>] proc_reg_read+0x7e/0xc0
05:12:19: [<ffffffff81189a75>] vfs_read+0xb5/0x1a0
05:12:19: [<ffffffff81189bb1>] sys_read+0x51/0x90
Comment by Gerrit Updater [ 23/Dec/14 ]

Dmitry Eremin (dmitry.eremin@intel.com) uploaded a new patch: http://review.whamcloud.com/13175
Subject: LU-5928 obdclass: fix export of nodemap to procfs
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 209a7aadd4d6457d3a4a801248a40624d1ebb7dd

Comment by James A Simmons [ 23/Dec/14 ]

Thank you Dmitry for fixing this. I was seeing this in testing and it was a show stopper for me.

Comment by Dmitry Eremin (Inactive) [ 23/Dec/14 ]

Thanks James. Does this patch solve the issue for you?

Comment by James A Simmons [ 23/Dec/14 ]

Yes it helps with 133f but the test for 133g always breaks my system. That is related to a different problem which I just tracked down to another patch.

Comment by nasf (Inactive) [ 24/Dec/14 ]

Another failure instance:
https://testing.hpdd.intel.com/test_sets/9a3310c8-8acf-11e4-bd67-5254006e85c2

Comment by Gerrit Updater [ 04/Jan/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13175/
Subject: LU-5928 obdclass: fix export of nodemap to procfs
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: d68d56c00e7788a445f94b6e099f7f8b9e259bae

Comment by Sarah Liu [ 20/Jan/15 ]

Didn't hit this issue with build #2810 which includes the fix.

Generated at Sat Feb 10 01:55:43 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.