[LU-1863] Test failure with MDS spontaneous rebooting (test suite sanity, subtest test_32n) Created: 08/Sep/12  Updated: 31/Jul/13  Resolved: 31/Jul/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.3.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: Zhenyu Xu
Resolution: Duplicate Votes: 0
Labels: USE_OFD, zfs

Issue Links:
Duplicate
duplicates LU-1926 Reboots during test runs Resolved
Severity: 3
Rank (Obsolete): 4270

 Description   

This issue was created by maloo for Minh Diep <mdiep@whamcloud.com>

This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/f3243a50-f9b2-11e1-b8d8-52540035b04c.

The sub-test test_32n failed with the following error:

test failed to respond and timed out

The console on the OSS shows it rebooted for no reason and conman fail to capture the issue

Lustre: DEBUG MARKER: /usr/sbin/lctl mark == sanity test 32n: open d32n\/symlink-\>tmp\/symlink-\>lustre-root ====================================== 16:04:09 (1346972649)^M
Lustre: DEBUG MARKER: == sanity test 32n: open d32n/symlink->tmp/symlink->lustre-root ====================================== 16:04:09 (1346972649)^M
Lustre: 3770:0:(client.c:1917:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1346972670/real 1346972670] req@ffff880071728c00 x1412399426961989/t0(0) o400->MGC10.10.4.222@tcp@10.10.4.222@tcp:26/25 lens 224/224 e 0 to 1 dl 1346972677 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1^M
LustreError: 166-1: MGC10.10.4.222@tcp: Connection to MGS (at 10.10.4.222@tcp) was lost; in progress operations using this service will fail^M
Lustre: 3768:0:(client.c:1917:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1346972702/real 1346972702] req@ffff880078a46400 x1412399426961991/t0(0) o250->MGC10.10.4.222@tcp@10.10.4.222@tcp:26/25 lens 400/544 e 0 to 1 dl 1346972713 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1^M
Lustre: 3768:0:(client.c:1917:ptlrpc_expire_one_request()) Skipped 1 previous similar message^M
Lustre: 3768:0:(client.c:1917:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1346972727/real 1346972729] req@ffff880074d4bc00 x1412399426961992/t0(0) o250->MGC10.10.4.222@tcp@10.10.4.222@tcp:26/25 lens 400/544 e 0 to 1 dl 1346972743 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1^M
Lustre: 3768:0:(client.c:1917:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1346972777/real 1346972777] req@ffff8800791f2400 x1412399426961994/t0(0) o250->MGC10.10.4.222@tcp@10.10.4.222@tcp:26/25 lens 400/544 e 0 to 1 dl 1346972803 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1^M
Lustre: 3768:0:(client.c:1917:ptlrpc_expire_one_request()) Skipped 1 previous similar message^M
Lustre: lustre-OST0000: haven't heard from client lustre-MDT0000-mdtlov_UUID (at 10.10.4.222@tcp) in 235 seconds. I think it's dead, and I am evicting it. exp ffff88007178e400, cur 1346972885 expire 1346972735 last 1346972650^M
Lustre: lustre-OST0002: haven't heard from client lustre-MDT0000-mdtlov_UUID (at 10.10.4.222@tcp) in 235 seconds. I think it's dead, and I am evicting it. exp ffff880061d38400, cur 1346972885 expire 1346972735 last 1346972650^M
Lustre: 3768:0:(client.c:1917:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1346972877/real 1346972880] req@ffff8800791f2c00 x1412399426961996/t0(0) o250->MGC10.10.4.222@tcp@10.10.4.222@tcp:26/25 lens 400/544 e 0 to 1 dl 1346972913 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1^M
Lustre: 3768:0:(client.c:1917:ptlrpc_expire_one_request()) Skipped 1 previous similar message^M
Lustre: 3768:0:(client.c:1917:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1346973002/real 1346973005] req@ffff880074050800 x1412399426961999/t0(0) o250->MGC10.10.4.222@tcp@10.10.4.222@tcp:26/25 lens 400/544 e 0 to 1 dl 1346973053 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1^M
Lustre: 3768:0:(client.c:1917:ptlrpc_expire_one_request()) Skipped 2 previous similar messages^M
Lustre: 3768:0:(client.c:1917:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1346973302/real 1346973305] req@ffff880072541c00 x1412399426962003/t0(0) o250->MGC10.10.4.222@tcp@10.10.4.222@tcp:26/25 lens 400/544 e 0 to 1 dl 1346973357 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1^M
Lustre: 3768:0:(client.c:1917:ptlrpc_expire_one_request()) Skipped 3 previous similar messages^M
Lustre: 3768:0:(client.c:1917:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1346973827/real 1346973830] req@ffff88007171f800 x1412399426962010/t0(0) o250->MGC10.10.4.222@tcp@10.10.4.222@tcp:26/25 lens 400/544 e 0 to 1 dl 1346973882 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1^M
Lustre: 3768:0:(client.c:1917:ptlrpc_expire_one_request()) Skipped 6 previous similar messages^M
Lustre: 3768:0:(client.c:1917:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1346974502/real 1346974505] req@ffff880078a46800 x1412399426962019/t0(0) o250->MGC10.10.4.222@tcp@10.10.4.222@tcp:26/25 lens 400/544 e 0 to 1 dl 1346974557 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1^M
Lustre: 3768:0:(client.c:1917:ptlrpc_expire_one_request()) Skipped 8 previous similar messages^M
Lustre: 3768:0:(client.c:1917:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1346975177/real 1346975180] req@ffff88005ecfac00 x1412399426962028/t0(0) o250->MGC10.10.4.222@tcp@10.10.4.222@tcp:26/25 lens 400/544 e 0 to 1 dl 1346975232 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1^M
Lustre: 3768:0:(client.c:1917:ptlrpc_expire_one_request()) Skipped 8 previous similar messages^M
Lustre: 3768:0:(client.c:1917:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1346975852/real 1346975855] req@ffff880078a46800 x1412399426962037/t0(0) o250->MGC10.10.4.222@tcp@10.10.4.222@tcp:26/25 lens 400/544 e 0 to 1 dl 1346975907 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1^M
Lustre: 3768:0:(client.c:1917:ptlrpc_expire_one_request()) Skipped 8 previous similar messages^M
^M
<ConMan> Console [client-19vm4] disconnected from <client-19:6003> at 09-06 17:05.^M
^M
<ConMan> Console [client-19vm4] connected to <client-19:6003> at 09-06 17:05.^M
^MPress any key to continue.^M
^MPress any key to continue.^M
^MPress any key to continue.^M
^MPress any key to continue.^M
^MPress any key to continue.^M
[[H[[J^M

Info required for matching: sanity 32n



 Comments   
Comment by Peter Jones [ 09/Sep/12 ]

Not just limited to OFD runs

https://maloo.whamcloud.com/test_sets/957f03fe-fa6f-11e1-887d-52540035b04c
https://maloo.whamcloud.com/test_sets/a41657fe-fa70-11e1-a03c-52540035b04c
https://maloo.whamcloud.com/test_sets/75760d88-fa50-11e1-887d-52540035b04c

Comment by Peter Jones [ 09/Sep/12 ]

Bobijam

Could you please look into this failure?

Thanks

Peter

Comment by Zhenyu Xu [ 09/Sep/12 ]

Something happened on MDS make it out of service, there is no log from MDS though, I'll trying to reproduce it on Toro nodes.

Comment by Jian Yu [ 10/Sep/12 ]

Hi Bobi,

FYI, I also found some failures on the latest b2_3 sessions related to MDS out of service issue:
conf-sanity test 45: https://maloo.whamcloud.com/test_sets/aaac51d6-fa70-11e1-887d-52540035b04c
parallel-scale-nfsv3: https://maloo.whamcloud.com/test_sets/9fe57158-fa73-11e1-887d-52540035b04c

More failures are reported in TT-851.

Comment by Andreas Dilger [ 10/Sep/12 ]

I think the conf-sanity.sh test_45 failure reported above should be filed in a separate bug. I think it is just a failure in that test, and does not contain any report of the MDS spontaneous rebooting.

Comment by Jian Yu [ 11/Sep/12 ]

I think the conf-sanity.sh test_45 failure reported above should be filed in a separate bug. I think it is just a failure in that test, and does not contain any report of the MDS spontaneous rebooting.

My fault. I saw that "Connection to lustre-MDT0000 (at 10.10.4.182@tcp) was lost;" on the client node and thought that MDS was out of service, but did not realize that the MDS was unmounted before. LU-1882 was created.

Comment by Jian Yu [ 12/Sep/12 ]

Lustre Build: http://build.whamcloud.com/job/lustre-b2_3/17

Still failed: https://maloo.whamcloud.com/test_sets/d8e2ddcc-fc82-11e1-a4a6-52540035b04c

Comment by Jian Yu [ 12/Sep/12 ]

Lustre Server Build: http://build.whamcloud.com/job/lustre-b2_3/17
Lustre Client Build: http://build.whamcloud.com/job/lustre-b2_1/121

sanity test 32n also hung: https://maloo.whamcloud.com/test_sets/47eebb26-fbf9-11e1-a4a6-52540035b04c

However, this time I found the following message on the console log of MDS (client-20):

00:15:57:Lustre: DEBUG MARKER: == sanity test 32n: open d32n/symlink->tmp/symlink->lustre-root ======== 00:15:55 (1347347755)
00:15:57:------------[ cut here ]------------
00:15:57:WARNING: at kernel/sched_fair.c:132 load_balance_next_fair+0x6a/0x80() (Not tainted)
00:15:57:Hardware name: X8DTT
00:17:18:Modules linked in: nfs fscache cmm(U) osd_ldiskfs(U) mdt(U) mdd(U) mds(U) (T)�
Comment by Jian Yu [ 13/Sep/12 ]

Lustre Build: http://build.whamcloud.com/job/lustre-b2_3/17

Only running sanity test 32 passed: https://maloo.whamcloud.com/test_sets/6b92dd2a-fd9b-11e1-afe5-52540035b04c

Running sanity from start hung at test 32g: https://maloo.whamcloud.com/test_sets/a52af1ca-fda1-11e1-afe5-52540035b04c

Console log on MDS (fat-intel-2):

Lustre: DEBUG MARKER: == sanity test 32g: stat d32g/symlink->tmp/symlink->lustre-subdir/2 == 05:39:13 (1347539953)
BUG: unable to handle kernel NULL pointer dereference at 0000000000000040
IP: [<ffffffff8150057e>] _spin_lock+0xe/0x30
PGD 0 
Oops: 0002 [#1] SMP 
last sysfs file: /sys/devices/system/cpu/cpu23/cache/index2/shared_cpu_map
CPU 18 
Modules linked in: nfs fscache cmm(U) osd_ldiskfs(U) mdt(U) mdd(U) mds(U) fsfilt_ldiskfs(U) mgs(U) mgc(U) ldiskfs(U) jbd2 lustre(U) lquota(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) sha512_generic sha256_generic libcfs(U) nfsd lockd nfs_acl auth_rpcgss exportfs autofs4 sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa mlx4_ib ib_mad ib_core mlx4_en mlx4_core e1000e microcode serio_raw i2c_i801 i2c_core sg iTCO_wdt iTCO_vendor_support ioatdma dca i7core_edac edac_core shpchp ext3 jbd mbcache sd_mod crc_t10dif ahci dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]

Pid: 0, comm: swapper Not tainted 2.6.32-279.5.1.el6_lustre.g634f764.x86_64 #1 Supermicro X8DTT-H/X8DTT-H
RIP: 0010:[<ffffffff8150057e>]  [<ffffffff8150057e>] _spin_lock+0xe/0x30
RSP: 0018:ffff88034acc3dd0  EFLAGS: 00010002
RAX: 0000000000010000 RBX: 000000000000f14c RCX: ffff880336b68800
RDX: 0040000000000080 RSI: ffff8806304dc080 RDI: 0000000000000040
RBP: ffff88034acc3dd0 R08: 0000000000000000 R09: ffff880337eefe68
R10: 000002936cfd7c69 R11: 0000000000000000 R12: 0000000000000000
R13: 0000000000000286 R14: ffff88033fcd0340 R15: 000000000000e140
FS:  0000000000000000(0000) GS:ffff88034acc0000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000000040 CR3: 00000003303fc000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Comment by Jian Yu [ 14/Sep/12 ]

Lustre Build: http://build.whamcloud.com/job/lustre-b2_3/19 (with fix for LU-1881)

sanity tests passed: https://maloo.whamcloud.com/test_sets/19ee1aae-fe7a-11e1-a707-52540035b04c

Comment by Peter Jones [ 15/Sep/12 ]

Lowering priority as this now only occurs with USE_OFED=yes

Comment by Jian Yu [ 15/Sep/12 ]

Hi Minh,

While I running sanity tests on Lustre build: http://build.whamcloud.com/job/lustre-b2_3/19 with the following variables:

USE_OFD=yes
OSTFSTYPE=zfs
LOAD_MODULES_REMOTE=true

I hit the following module loading issue:

fat-intel-4: FATAL: Error inserting osd_zfs (/lib/modules/2.6.32-279.5.1.el6_lustre.x86_64/updates/kernel/fs/lustre/osd_zfs.ko): Unknown symbol in module, or unknown parameter (see dmesg)

Dmesg on fat-intel-4:

Lustre: Lustre: Build Version: 2.2.96--PRISTINE-2.6.32-279.5.1.el6_lustre.x86_64
LNet: Added LNI 10.10.4.131@tcp [8/256/0/180]
LNet: Accept secure, port 988
Lustre: Echo OBD driver; http://www.lustre.org/
osd_zfs: Unknown symbol zap_cursor_serialize
osd_zfs: Unknown symbol zap_remove
osd_zfs: Unknown symbol dmu_tx_hold_write
osd_zfs: Unknown symbol zfs_attr_table
osd_zfs: Unknown symbol nvpair_value_nvlist
osd_zfs: Unknown symbol sa_spill_rele
osd_zfs: Unknown symbol zap_cursor_advance
osd_zfs: Unknown symbol kmem_alloc_debug
osd_zfs: Unknown symbol dmu_read
osd_zfs: Unknown symbol nvlist_add_byte_array
osd_zfs: Unknown symbol zap_cursor_init_serialized
......

Did you hit the above issue before?

Comment by Li Wei (Inactive) [ 15/Sep/12 ]

Yu Jian,

Those symbols should be provided by spl/zfs. Unlike LDiskFS, spl/zfs source trees are not included in lustre. Jenkins pulls and builds them when building lustre, and generates spl/zfs RPMs. I'd check if spl-modules and zfs-modules are installed in this case. (Historically, loadjenkinsbuild did not automatically install spl/zfs RPMs that came with requested lustre builds.)

Comment by Jian Yu [ 19/Sep/12 ]

Thanks Li Wei.

After installing spl-modules and zfs-modules packages, the sanity tests went forward and finished on Lustre b2_3 build #19 with USE_OFD=yes:
https://maloo.whamcloud.com/test_sets/ff130926-0241-11e2-ab94-52540035b04c

The issue in this ticket did not occur.

Comment by Nathaniel Clark [ 31/Jul/13 ]

This is a duplicate of LU-1926

Generated at Sat Feb 10 01:20:20 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.