[LU-1863] Test failure with MDS spontaneous rebooting (test suite sanity, subtest test_32n) Created: 08/Sep/12 Updated: 31/Jul/13 Resolved: 31/Jul/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.3.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Maloo | Assignee: | Zhenyu Xu |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | USE_OFD, zfs | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 4270 | ||||||||
| Description |
|
This issue was created by maloo for Minh Diep <mdiep@whamcloud.com> This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/f3243a50-f9b2-11e1-b8d8-52540035b04c. The sub-test test_32n failed with the following error:
The console on the OSS shows it rebooted for no reason and conman fail to capture the issue Lustre: DEBUG MARKER: /usr/sbin/lctl mark == sanity test 32n: open d32n\/symlink-\>tmp\/symlink-\>lustre-root ====================================== 16:04:09 (1346972649)^M Info required for matching: sanity 32n |
| Comments |
| Comment by Peter Jones [ 09/Sep/12 ] |
|
Not just limited to OFD runs https://maloo.whamcloud.com/test_sets/957f03fe-fa6f-11e1-887d-52540035b04c |
| Comment by Peter Jones [ 09/Sep/12 ] |
|
Bobijam Could you please look into this failure? Thanks Peter |
| Comment by Zhenyu Xu [ 09/Sep/12 ] |
|
Something happened on MDS make it out of service, there is no log from MDS though, I'll trying to reproduce it on Toro nodes. |
| Comment by Jian Yu [ 10/Sep/12 ] |
|
Hi Bobi, FYI, I also found some failures on the latest b2_3 sessions related to MDS out of service issue: More failures are reported in |
| Comment by Andreas Dilger [ 10/Sep/12 ] |
|
I think the conf-sanity.sh test_45 failure reported above should be filed in a separate bug. I think it is just a failure in that test, and does not contain any report of the MDS spontaneous rebooting. |
| Comment by Jian Yu [ 11/Sep/12 ] |
My fault. I saw that "Connection to lustre-MDT0000 (at 10.10.4.182@tcp) was lost;" on the client node and thought that MDS was out of service, but did not realize that the MDS was unmounted before. |
| Comment by Jian Yu [ 12/Sep/12 ] |
|
Lustre Build: http://build.whamcloud.com/job/lustre-b2_3/17 Still failed: https://maloo.whamcloud.com/test_sets/d8e2ddcc-fc82-11e1-a4a6-52540035b04c |
| Comment by Jian Yu [ 12/Sep/12 ] |
|
Lustre Server Build: http://build.whamcloud.com/job/lustre-b2_3/17 sanity test 32n also hung: https://maloo.whamcloud.com/test_sets/47eebb26-fbf9-11e1-a4a6-52540035b04c However, this time I found the following message on the console log of MDS (client-20): 00:15:57:Lustre: DEBUG MARKER: == sanity test 32n: open d32n/symlink->tmp/symlink->lustre-root ======== 00:15:55 (1347347755) 00:15:57:------------[ cut here ]------------ 00:15:57:WARNING: at kernel/sched_fair.c:132 load_balance_next_fair+0x6a/0x80() (Not tainted) 00:15:57:Hardware name: X8DTT 00:17:18:Modules linked in: nfs fscache cmm(U) osd_ldiskfs(U) mdt(U) mdd(U) mds(U) (T)�[0m[1;30;40m[01;00H |
| Comment by Jian Yu [ 13/Sep/12 ] |
|
Lustre Build: http://build.whamcloud.com/job/lustre-b2_3/17 Only running sanity test 32 passed: https://maloo.whamcloud.com/test_sets/6b92dd2a-fd9b-11e1-afe5-52540035b04c Running sanity from start hung at test 32g: https://maloo.whamcloud.com/test_sets/a52af1ca-fda1-11e1-afe5-52540035b04c Console log on MDS (fat-intel-2): Lustre: DEBUG MARKER: == sanity test 32g: stat d32g/symlink->tmp/symlink->lustre-subdir/2 == 05:39:13 (1347539953) BUG: unable to handle kernel NULL pointer dereference at 0000000000000040 IP: [<ffffffff8150057e>] _spin_lock+0xe/0x30 PGD 0 Oops: 0002 [#1] SMP last sysfs file: /sys/devices/system/cpu/cpu23/cache/index2/shared_cpu_map CPU 18 Modules linked in: nfs fscache cmm(U) osd_ldiskfs(U) mdt(U) mdd(U) mds(U) fsfilt_ldiskfs(U) mgs(U) mgc(U) ldiskfs(U) jbd2 lustre(U) lquota(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) sha512_generic sha256_generic libcfs(U) nfsd lockd nfs_acl auth_rpcgss exportfs autofs4 sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa mlx4_ib ib_mad ib_core mlx4_en mlx4_core e1000e microcode serio_raw i2c_i801 i2c_core sg iTCO_wdt iTCO_vendor_support ioatdma dca i7core_edac edac_core shpchp ext3 jbd mbcache sd_mod crc_t10dif ahci dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan] Pid: 0, comm: swapper Not tainted 2.6.32-279.5.1.el6_lustre.g634f764.x86_64 #1 Supermicro X8DTT-H/X8DTT-H RIP: 0010:[<ffffffff8150057e>] [<ffffffff8150057e>] _spin_lock+0xe/0x30 RSP: 0018:ffff88034acc3dd0 EFLAGS: 00010002 RAX: 0000000000010000 RBX: 000000000000f14c RCX: ffff880336b68800 RDX: 0040000000000080 RSI: ffff8806304dc080 RDI: 0000000000000040 RBP: ffff88034acc3dd0 R08: 0000000000000000 R09: ffff880337eefe68 R10: 000002936cfd7c69 R11: 0000000000000000 R12: 0000000000000000 R13: 0000000000000286 R14: ffff88033fcd0340 R15: 000000000000e140 FS: 0000000000000000(0000) GS:ffff88034acc0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 0000000000000040 CR3: 00000003303fc000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 |
| Comment by Jian Yu [ 14/Sep/12 ] |
|
Lustre Build: http://build.whamcloud.com/job/lustre-b2_3/19 (with fix for sanity tests passed: https://maloo.whamcloud.com/test_sets/19ee1aae-fe7a-11e1-a707-52540035b04c |
| Comment by Peter Jones [ 15/Sep/12 ] |
|
Lowering priority as this now only occurs with USE_OFED=yes |
| Comment by Jian Yu [ 15/Sep/12 ] |
|
Hi Minh, While I running sanity tests on Lustre build: http://build.whamcloud.com/job/lustre-b2_3/19 with the following variables: USE_OFD=yes OSTFSTYPE=zfs LOAD_MODULES_REMOTE=true I hit the following module loading issue: fat-intel-4: FATAL: Error inserting osd_zfs (/lib/modules/2.6.32-279.5.1.el6_lustre.x86_64/updates/kernel/fs/lustre/osd_zfs.ko): Unknown symbol in module, or unknown parameter (see dmesg) Dmesg on fat-intel-4: Lustre: Lustre: Build Version: 2.2.96--PRISTINE-2.6.32-279.5.1.el6_lustre.x86_64 LNet: Added LNI 10.10.4.131@tcp [8/256/0/180] LNet: Accept secure, port 988 Lustre: Echo OBD driver; http://www.lustre.org/ osd_zfs: Unknown symbol zap_cursor_serialize osd_zfs: Unknown symbol zap_remove osd_zfs: Unknown symbol dmu_tx_hold_write osd_zfs: Unknown symbol zfs_attr_table osd_zfs: Unknown symbol nvpair_value_nvlist osd_zfs: Unknown symbol sa_spill_rele osd_zfs: Unknown symbol zap_cursor_advance osd_zfs: Unknown symbol kmem_alloc_debug osd_zfs: Unknown symbol dmu_read osd_zfs: Unknown symbol nvlist_add_byte_array osd_zfs: Unknown symbol zap_cursor_init_serialized ...... Did you hit the above issue before? |
| Comment by Li Wei (Inactive) [ 15/Sep/12 ] |
|
Yu Jian, Those symbols should be provided by spl/zfs. Unlike LDiskFS, spl/zfs source trees are not included in lustre. Jenkins pulls and builds them when building lustre, and generates spl/zfs RPMs. I'd check if spl-modules and zfs-modules are installed in this case. (Historically, loadjenkinsbuild did not automatically install spl/zfs RPMs that came with requested lustre builds.) |
| Comment by Jian Yu [ 19/Sep/12 ] |
|
Thanks Li Wei. After installing spl-modules and zfs-modules packages, the sanity tests went forward and finished on Lustre b2_3 build #19 with USE_OFD=yes: The issue in this ticket did not occur. |
| Comment by Nathaniel Clark [ 31/Jul/13 ] |
|
This is a duplicate of |