[LU-2648] 8 timout issues in a test run and incomplete logs. - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: Lustre 2.4.0
Affects Version/s: None
Labels:
None
Environment:
patch push via git

Severity:
3
Rank (Obsolete):
6028

Description

https://maloo.whamcloud.com/test_sessions/bd713280-5760-11e2-8b17-52540035b04c

Patch was submitted then 8 tests had timeout issues during the run. In mmp the timeout was listed a 100% fail rate but the test queue did not share this observation.

Chris G. Indicated this is likely a one off issue but I would like to a place to track the issues.

In the timeouts errors I checked there were no logs for the failed tests. It is really hard to say what has happened with the test run but whatever it was it was not good. One of the servers could have been blocked or reset a few times it is not clear.

Attachments

Issue Links

duplicates

LU-2673 lprocfs_free_obd_stats() must follow lprocfs_obd_cleanup()

Resolved

Activity

[LU-2648] 8 timout issues in a test run and incomplete logs.

Keith Mannthey (Inactive) added a comment - 27/Feb/13 3:39 AM

Patches have landed. The code was worked to avoid this issue so it is safe to close this issue.

Keith Mannthey (Inactive) added a comment - 27/Feb/13 3:39 AM Patches have landed. The code was worked to avoid this issue so it is safe to close this issue.

Keith Mannthey (Inactive) added a comment - 24/Jan/13 9:24 PM

Yea I saw that yesterday. Lets see about ~~LU-2673~~ and then go back to change 4815.

Keith Mannthey (Inactive) added a comment - 24/Jan/13 9:24 PM Yea I saw that yesterday. Lets see about LU-2673 and then go back to change 4815.

John Hammond added a comment - 24/Jan/13 8:38 PM

Thanks Keith. Also please see my comment at the bottom of http://review.whamcloud.com/#change,4815. I think the memory corruption I caused there is the culprit for this LU.

John Hammond added a comment - 24/Jan/13 8:38 PM Thanks Keith. Also please see my comment at the bottom of http://review.whamcloud.com/#change,4815 . I think the memory corruption I caused there is the culprit for this LU.

Keith Mannthey (Inactive) added a comment - 24/Jan/13 8:22 PM

http://jira.whamcloud.com/browse/LU-2673 seem to be related. I am looking into Johns patch.

Keith Mannthey (Inactive) added a comment - 24/Jan/13 8:22 PM http://jira.whamcloud.com/browse/LU-2673 seem to be related. I am looking into Johns patch.

Keith Mannthey (Inactive) added a comment - 18/Jan/13 2:01 PM

So MDS panics all over the place:

It will take a bit to catorgize them all but

05:33:49:BUG: unable to handle kernel paging request at 0000000100000000
05:33:49:IP: [<ffffffffa04f7220>] cfs_hash_for_each_tight+0xd0/0x290 [libcfs]
05:33:49:PGD 0 
05:33:50:Oops: 0000 [#1] SMP 
05:33:50:last sysfs file: /sys/fs/ldiskfs/dm-0/max_dir_size
05:33:50:CPU 0 
05:33:50:Modules linked in: osp(U) lod(U) mdt(U) mgs(U) mgc(U) fsfilt_ldiskfs(U) osd_ldiskfs(U) lquota(U) mdd(U) lustre(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) sha512_generic sha256_generic libcfs(U) ldiskfs(U) jbd2 nfsd exportfs autofs4 nfs lockd fscache nfs_acl auth_rpcgss sunrpc ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa ib_mad ib_core microcode virtio_balloon 8139too 8139cp mii i2c_piix4 i2c_core ext3 jbd mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: llog_test]
05:33:50:
05:33:50:Pid: 16691, comm: umount Not tainted 2.6.32-279.14.1.el6_lustre.g639f1fc.x86_64 #1 Red Hat KVM
05:33:50:RIP: 0010:[<ffffffffa04f7220>]  [<ffffffffa04f7220>] cfs_hash_for_each_tight+0xd0/0x290 [libcfs]
05:33:50:RSP: 0018:ffff88005b35db58  EFLAGS: 00010206
05:33:50:RAX: ffffc900008d6074 RBX: ffff880079592300 RCX: 000000000000000b
05:33:50:RDX: 0000000100000000 RSI: ffff88005b35db68 RDI: ffff880079592300
05:33:50:RBP: ffff88005b35dba8 R08: 20737365636f7250 R09: 0a64657265746e65
05:33:50:R10: 20737365636f7250 R11: 0a64657265746e65 R12: 0000000000000000
05:33:50:R13: 0000000000000000 R14: ffff88005b35dc18 R15: ffffffffa04f3e60
05:33:50:FS:  00007f549ce83740(0000) GS:ffff880002200000(0000) knlGS:0000000000000000
05:33:50:CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
05:33:50:CR2: 0000000100000000 CR3: 000000005c456000 CR4: 00000000000006f0
05:33:50:DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
05:33:50:DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
05:33:51:Process umount (pid: 16691, threadinfo ffff88005b35c000, task ffff88005c940080)
05:33:51:Stack:
05:33:51: 0000000000000004 ffff88005b35dbc4 ffffc900008d3000 0000000000000306
05:33:51:<d> ffff88005b35db78 ffff8800799815c0 ffff88005c8b7200 ffff88005620c038
05:33:52:<d> ffff88005b35dc18 ffff88005620c388 ffff88005b35dbc8 ffffffffa04f7436
05:33:52:Call Trace:
05:33:52: [<ffffffffa04f7436>] cfs_hash_is_empty+0x26/0x30 [libcfs]
05:33:52: [<ffffffffa0da9a4e>] mgs_device_fini+0x20e/0x5a0 [mgs]
05:33:52: [<ffffffffa0653e47>] class_cleanup+0x577/0xda0 [obdclass]
05:33:52: [<ffffffffa0629446>] ? class_name2dev+0x56/0xe0 [obdclass]
05:33:52: [<ffffffffa0655715>] class_process_config+0x10a5/0x1c60 [obdclass]
05:33:52: [<ffffffffa04e9e68>] ? libcfs_log_return+0x28/0x40 [libcfs]
05:33:52: [<ffffffffa064efa1>] ? lustre_cfg_new+0x391/0x7e0 [obdclass]
05:33:52: [<ffffffffa0656449>] class_manual_cleanup+0x179/0x6f0 [obdclass]
05:33:52: [<ffffffffa0629446>] ? class_name2dev+0x56/0xe0 [obdclass]
05:33:52: [<ffffffffa066384d>] server_put_super+0x46d/0x1170 [obdclass]
05:33:52: [<ffffffff8117d84b>] generic_shutdown_super+0x5b/0xe0
05:33:52: [<ffffffff8117d936>] kill_anon_super+0x16/0x60
05:33:52: [<ffffffffa06582a6>] lustre_kill_super+0x36/0x60 [obdclass]
05:33:52: [<ffffffff8117e9b0>] deactivate_super+0x70/0x90
05:33:52: [<ffffffff8119a9ef>] mntput_no_expire+0xbf/0x110
05:33:52: [<ffffffff8119b48b>] sys_umount+0x7b/0x3a0
05:33:53: [<ffffffff810d6d42>] ? audit_syscall_entry+0x272/0x2a0
05:33:53: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
05:33:53:Code: df ff 10 48 85 c0 0f 84 16 01 00 00 48 8b 10 48 85 d2 75 17 e9 f2 00 00 00 66 90 41 83 c5 01 4d 85 f6 4c 89 f2 0f 84 e0 00 00 00 <4c> 8b 32 49 83 c4 01 48 8b 4d b8 48 8d 75 c0 48 89 df 41 ff d7 
05:33:53:RIP  [<ffffffffa04f7220>] cfs_hash_for_each_tight+0xd0/0x290 [libcfs]
05:33:53: RSP <ffff88005b35db58>
05:33:53:CR2: 0000000100000000

and

06:05:15:Lustre: Failing over lustre-MDT0000
06:05:15:general protection fault: 0000 [#1] SMP 
06:05:15:last sysfs file: /sys/devices/system/cpu/possible
06:05:15:CPU 0 
06:05:15:Modules linked in: osp(U) mdt(U) lod(U) mgs(U) mgc(U) fsfilt_ldiskfs(U) osd_ldiskfs(U) lquota(U) mdd(U) lustre(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) sha512_generic sha256_generic libcfs(U) ldiskfs(U) jbd2 nfs fscache nfsd lockd nfs_acl auth_rpcgss exportfs autofs4 sunrpc ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa ib_mad ib_core microcode virtio_balloon 8139too 8139cp mii i2c_piix4 i2c_core ext3 jbd mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: speedstep_lib]
06:05:15:
06:05:15:Pid: 15200, comm: umount Not tainted 2.6.32-279.14.1.el6_lustre.g639f1fc.x86_64 #1 Red Hat KVM
06:05:15:RIP: 0010:[<ffffffffa082366e>]  [<ffffffffa082366e>] lustre_msg_get_status+0xe/0xb0 [ptlrpc]
06:05:15:RSP: 0018:ffff88005a09b758  EFLAGS: 00010292
06:05:15:RAX: ffff88005a09b9a8 RBX: ffff8800554237c9 RCX: 0000000004000000
06:05:15:RDX: 0000000000000001 RSI: ffffffffa08b0f40 RDI: 68ffff8800570b80
06:05:15:RBP: ffff88005a09b768 R08: 00000000ffffff0a R09: 00000000fffffffe
06:05:15:R10: 0000000000000000 R11: 000000000000010e R12: 0000000000000001
06:05:15:R13: 78ffff8800554238 R14: ffff8800554237c9 R15: ffffffffffffffff
06:05:15:FS:  00007fafa2cf1740(0000) GS:ffff880002200000(0000) knlGS:0000000000000000
06:05:15:CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
06:05:15:CR2: 00007ffde7ad9000 CR3: 000000005b0fa000 CR4: 00000000000006f0
06:05:15:DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
06:05:15:DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400

seem most common.

Keith Mannthey (Inactive) added a comment - 18/Jan/13 2:01 PM So MDS panics all over the place: It will take a bit to catorgize them all but 05:33:49:BUG: unable to handle kernel paging request at 0000000100000000 05:33:49:IP: [<ffffffffa04f7220>] cfs_hash_for_each_tight+0xd0/0x290 [libcfs] 05:33:49:PGD 0 05:33:50:Oops: 0000 [#1] SMP 05:33:50:last sysfs file: /sys/fs/ldiskfs/dm-0/max_dir_size 05:33:50:CPU 0 05:33:50:Modules linked in: osp(U) lod(U) mdt(U) mgs(U) mgc(U) fsfilt_ldiskfs(U) osd_ldiskfs(U) lquota(U) mdd(U) lustre(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) sha512_generic sha256_generic libcfs(U) ldiskfs(U) jbd2 nfsd exportfs autofs4 nfs lockd fscache nfs_acl auth_rpcgss sunrpc ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa ib_mad ib_core microcode virtio_balloon 8139too 8139cp mii i2c_piix4 i2c_core ext3 jbd mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: llog_test] 05:33:50: 05:33:50:Pid: 16691, comm: umount Not tainted 2.6.32-279.14.1.el6_lustre.g639f1fc.x86_64 #1 Red Hat KVM 05:33:50:RIP: 0010:[<ffffffffa04f7220>] [<ffffffffa04f7220>] cfs_hash_for_each_tight+0xd0/0x290 [libcfs] 05:33:50:RSP: 0018:ffff88005b35db58 EFLAGS: 00010206 05:33:50:RAX: ffffc900008d6074 RBX: ffff880079592300 RCX: 000000000000000b 05:33:50:RDX: 0000000100000000 RSI: ffff88005b35db68 RDI: ffff880079592300 05:33:50:RBP: ffff88005b35dba8 R08: 20737365636f7250 R09: 0a64657265746e65 05:33:50:R10: 20737365636f7250 R11: 0a64657265746e65 R12: 0000000000000000 05:33:50:R13: 0000000000000000 R14: ffff88005b35dc18 R15: ffffffffa04f3e60 05:33:50:FS: 00007f549ce83740(0000) GS:ffff880002200000(0000) knlGS:0000000000000000 05:33:50:CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b 05:33:50:CR2: 0000000100000000 CR3: 000000005c456000 CR4: 00000000000006f0 05:33:50:DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 05:33:50:DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 05:33:51:Process umount (pid: 16691, threadinfo ffff88005b35c000, task ffff88005c940080) 05:33:51:Stack: 05:33:51: 0000000000000004 ffff88005b35dbc4 ffffc900008d3000 0000000000000306 05:33:51:<d> ffff88005b35db78 ffff8800799815c0 ffff88005c8b7200 ffff88005620c038 05:33:52:<d> ffff88005b35dc18 ffff88005620c388 ffff88005b35dbc8 ffffffffa04f7436 05:33:52:Call Trace: 05:33:52: [<ffffffffa04f7436>] cfs_hash_is_empty+0x26/0x30 [libcfs] 05:33:52: [<ffffffffa0da9a4e>] mgs_device_fini+0x20e/0x5a0 [mgs] 05:33:52: [<ffffffffa0653e47>] class_cleanup+0x577/0xda0 [obdclass] 05:33:52: [<ffffffffa0629446>] ? class_name2dev+0x56/0xe0 [obdclass] 05:33:52: [<ffffffffa0655715>] class_process_config+0x10a5/0x1c60 [obdclass] 05:33:52: [<ffffffffa04e9e68>] ? libcfs_log_return+0x28/0x40 [libcfs] 05:33:52: [<ffffffffa064efa1>] ? lustre_cfg_new+0x391/0x7e0 [obdclass] 05:33:52: [<ffffffffa0656449>] class_manual_cleanup+0x179/0x6f0 [obdclass] 05:33:52: [<ffffffffa0629446>] ? class_name2dev+0x56/0xe0 [obdclass] 05:33:52: [<ffffffffa066384d>] server_put_super+0x46d/0x1170 [obdclass] 05:33:52: [<ffffffff8117d84b>] generic_shutdown_super+0x5b/0xe0 05:33:52: [<ffffffff8117d936>] kill_anon_super+0x16/0x60 05:33:52: [<ffffffffa06582a6>] lustre_kill_super+0x36/0x60 [obdclass] 05:33:52: [<ffffffff8117e9b0>] deactivate_super+0x70/0x90 05:33:52: [<ffffffff8119a9ef>] mntput_no_expire+0xbf/0x110 05:33:52: [<ffffffff8119b48b>] sys_umount+0x7b/0x3a0 05:33:53: [<ffffffff810d6d42>] ? audit_syscall_entry+0x272/0x2a0 05:33:53: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b 05:33:53:Code: df ff 10 48 85 c0 0f 84 16 01 00 00 48 8b 10 48 85 d2 75 17 e9 f2 00 00 00 66 90 41 83 c5 01 4d 85 f6 4c 89 f2 0f 84 e0 00 00 00 <4c> 8b 32 49 83 c4 01 48 8b 4d b8 48 8d 75 c0 48 89 df 41 ff d7 05:33:53:RIP [<ffffffffa04f7220>] cfs_hash_for_each_tight+0xd0/0x290 [libcfs] 05:33:53: RSP <ffff88005b35db58> 05:33:53:CR2: 0000000100000000 and 06:05:15:Lustre: Failing over lustre-MDT0000 06:05:15:general protection fault: 0000 [#1] SMP 06:05:15:last sysfs file: /sys/devices/system/cpu/possible 06:05:15:CPU 0 06:05:15:Modules linked in: osp(U) mdt(U) lod(U) mgs(U) mgc(U) fsfilt_ldiskfs(U) osd_ldiskfs(U) lquota(U) mdd(U) lustre(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) sha512_generic sha256_generic libcfs(U) ldiskfs(U) jbd2 nfs fscache nfsd lockd nfs_acl auth_rpcgss exportfs autofs4 sunrpc ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa ib_mad ib_core microcode virtio_balloon 8139too 8139cp mii i2c_piix4 i2c_core ext3 jbd mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: speedstep_lib] 06:05:15: 06:05:15:Pid: 15200, comm: umount Not tainted 2.6.32-279.14.1.el6_lustre.g639f1fc.x86_64 #1 Red Hat KVM 06:05:15:RIP: 0010:[<ffffffffa082366e>] [<ffffffffa082366e>] lustre_msg_get_status+0xe/0xb0 [ptlrpc] 06:05:15:RSP: 0018:ffff88005a09b758 EFLAGS: 00010292 06:05:15:RAX: ffff88005a09b9a8 RBX: ffff8800554237c9 RCX: 0000000004000000 06:05:15:RDX: 0000000000000001 RSI: ffffffffa08b0f40 RDI: 68ffff8800570b80 06:05:15:RBP: ffff88005a09b768 R08: 00000000ffffff0a R09: 00000000fffffffe 06:05:15:R10: 0000000000000000 R11: 000000000000010e R12: 0000000000000001 06:05:15:R13: 78ffff8800554238 R14: ffff8800554237c9 R15: ffffffffffffffff 06:05:15:FS: 00007fafa2cf1740(0000) GS:ffff880002200000(0000) knlGS:0000000000000000 06:05:15:CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b 06:05:15:CR2: 00007ffde7ad9000 CR3: 000000005b0fa000 CR4: 00000000000006f0 06:05:15:DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 06:05:15:DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 seem most common.

Keith Mannthey (Inactive) added a comment - 18/Jan/13 1:45 PM

Well I realized the the mds console contains panics not in the syslog.

Paging request errors all over the place on the mds.

Chris can assign this back to a LU? I will take ownership of it.

Keith Mannthey (Inactive) added a comment - 18/Jan/13 1:45 PM Well I realized the the mds console contains panics not in the syslog. Paging request errors all over the place on the mds. Chris can assign this back to a LU? I will take ownership of it.

Keith Mannthey (Inactive) added a comment - 11/Jan/13 2:36 PM - edited

https://maloo.whamcloud.com/test_sessions/6eea9e98-5ba3-11e2-aa66-52540035b04c

A few observations from Sanity 900.

Jan  9 23:36:45 client-29vm3 rshd[6101]: root@client-29vm6.lab.whamcloud.com as root: cmd='/usr/sbin/lctl mark "umount -d -f /mnt/mds1";echo XXRETCODE:$?'
Jan  9 23:36:45 client-29vm3 kernel: Lustre: DEBUG MARKER: umount -d -f /mnt/mds1
Jan  9 23:36:45 client-29vm3 xinetd[1600]: EXIT: shell status=0 pid=6100 duration=0(sec)
Jan  9 23:36:45 client-29vm3 xinetd[1600]: START: shell pid=6123 from=::ffff:10.10.4.177
Jan  9 23:36:45 client-29vm3 rshd[6124]: root@client-29vm6.lab.whamcloud.com as root: cmd='(PATH=$PATH:/usr/lib64/lustre/utils:/usr/lib64/lustre/tests:/sbin:/usr/sbin; cd /usr/lib64/lustre/tests; LUSTRE="/usr/lib64/lustre" sh -c "umount -d -f /mnt/mds1");echo XXRETCODE:$?'
Jan  9 23:38:58 client-29vm3 kernel: imklog 5.8.10, log source = /proc/kmsg started.
Jan  9 23:38:58 client-29vm3 rsyslogd: [origin software="rsyslogd" swVersion="5.8.10" x-pid="1222" x-info="http://www.rsyslog.com"] start
Jan  9 23:38:58 client-29vm3 kernel: Initializing cgroup subsys cpuset
Jan  9 23:38:58 client-29vm3 kernel: Initializing cgroup subsys cpu
Jan  9 23:38:58 client-29vm3 kernel: Linux version 2.6.32-279.14.1.el6_lustre.g639f1fc.x86_64 (jenkins@builder-1-sde1-el6-x8664.lab.whamcloud.com) (gcc version 4.4.6 20120305 (Red Hat 4.4.6-4) (GCC) ) #1 SMP Wed Jan 2 11:43:03 PST 2013
Jan  9 23:38:58 client-29vm3 kernel: Command line: ro root=UUID=301a532c-28f4-4437-98

It looks like the last thing it tried to do was umount the mds.
It then reboots.
This may or may not be apart of the tests. I don't know.

When the mds reboots it talks to NTP and maybe not in a good way.

Jan  9 23:39:01 client-29vm3 rpc.mountd[1573]: Version 1.2.3 starting
Jan  9 23:39:01 client-29vm3 kernel: NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory
Jan  9 23:39:01 client-29vm3 kernel: NFSD: starting 90-second grace period
Jan  9 23:39:02 client-29vm3 ntpd[1696]: ntpd 4.2.4p8@1.1612-o Tue Nov 29 00:09:12 UTC 2011 (1)
Jan  9 23:39:02 client-29vm3 xinetd[1689]: xinetd Version 2.3.14 started with libwrap loadavg labeled-networking options compiled in.
Jan  9 23:39:02 client-29vm3 xinetd[1689]: Started working: 4 available services
Jan  9 23:39:02 client-29vm3 ntpd[1697]: precision = 0.189 usec
Jan  9 23:39:02 client-29vm3 ntpd[1697]: Listening on interface #0 wildcard, 0.0.0.0#123 Disabled
Jan  9 23:39:02 client-29vm3 ntpd[1697]: Listening on interface #1 wildcard, ::#123 Disabled
Jan  9 23:39:02 client-29vm3 ntpd[1697]: Listening on interface #2 lo, ::1#123 Enabled
Jan  9 23:39:02 client-29vm3 ntpd[1697]: Listening on interface #3 eth0, fe80::216:3eff:fe2d:bfd3#123 Enabled
Jan  9 23:39:02 client-29vm3 ntpd[1697]: Listening on interface #4 lo, 127.0.0.1#123 Enabled
Jan  9 23:39:02 client-29vm3 ntpd[1697]: Listening on interface #5 eth0, 10.10.4.174#123 Enabled
Jan  9 23:39:02 client-29vm3 ntpd[1697]: Listening on routing socket on fd #22 for interface updates
Jan  9 23:39:02 client-29vm3 ntpd[1697]: kernel time sync status 2040
Jan  9 23:39:02 client-29vm3 ntpd[1697]: frequency initialized 0.093 PPM from /var/lib/ntp/ntp.drift
Jan  9 23:43:20 client-29vm3 ntpd[1697]: synchronized to 10.10.0.1, stratum 2
Jan  9 23:43:20 client-29vm3 ntpd[1697]: time reset +0.183343 s
Jan  9 23:43:20 client-29vm3 ntpd[1697]: kernel time sync status change 2001
Jan  9 23:54:46 client-29vm3 ntpd[1697]: synchronized to 10.10.0.1, stratum 2
Jan 10 00:37:30 client-29vm3 xinetd[1689]: START: shell pid=2018 from=::ffff:10.10.4.177
Jan 10 00:37:30 client-29vm3 rshd[2019]: autotest@client-29vm6.lab.whamcloud.com as root: cmd='/home/autotest/.autotest/dynamic_bash/70095054347540+1357807045.03184'
Jan 10 00:37:30 client-29vm3 kernel: Slow work thread pool: Starting up
Jan 10 00:37:30 client-29vm3 kernel: Slow work thread pool: Ready
Jan 10 00:37:30 client-29vm3 kernel: FS-Cache: Loaded
Jan 10 00:37:30 client-29vm3 kernel: Registering the id_resolver key type
Jan 10 00:37:30 client-29vm3 kernel: FS-Cache: Netfs 'nfs' registered for caching
Jan 10 00:37:31 client-29vm3 kernel: SysRq : Show State

Right after NTP resets, NTP indicated it only changes the time by +0.183343s, autotest has started the Show State log collection process.

from the clients point of view

Jan  9 23:36:45 client-29vm6 xinetd[5665]: EXIT: shell status=0 pid=22796 duration=0(sec)
Jan  9 23:36:45 client-29vm6 xinetd[5665]: START: shell pid=22851 from=::ffff:10.10.4.177
Jan  9 23:36:45 client-29vm6 rshd[22852]: root@client-29vm6.lab.whamcloud.com as root: cmd='(PATH=$PATH:/usr/lib64/lustre/utils:/usr/lib64/lustre/tests:/sbin:/usr/sbin; cd /usr/lib64/lustre/tests; LUSTRE="/usr/lib64/lustre"  MGSFSTYPE=ldiskfs MDSFSTYPE=ldiskfs OSTFSTYPE=ldiskfs FSTYPE=ldiskfs sh -c "running=\$(grep -c /mnt/lustre2' ' /proc/mounts);#012if [ \$running -ne 0 ] ; then#012echo Stopping client \$(hostname) /mnt/lustre2 opts:;#012lsof /mnt/lustre2 || need_kill=no;#012if [ x != x -a x\$need_kill != xno ]; then#012    pids=\$(lsof -t /mnt/lustre2 | sort -u);#012    if [ -n \"\$pids\" ]; then#012             kill -9 \$pids;#012    fi#012fi;#012while umount  /mnt/lustre2 2>&1 | grep -q busy; do#012    echo /mnt/lustre2 is still busy, wait one second && sleep 1;#012done;#012fi");echo XXRETCODE:$?'
Jan  9 23:36:45 client-29vm6 xinetd[5665]: EXIT: shell status=0 pid=22851 duration=0(sec)
Jan 10 00:37:04 client-29vm6 xinetd[5665]: START: shell pid=23142 from=::ffff:10.10.4.177
Jan 10 00:37:04 client-29vm6 rshd[23144]: autotest@client-29vm6.lab.whamcloud.com as root: cmd='/home/autotest/.autotest/dynamic_bash/70095054347540+1357807023.34444'
Jan 10 00:37:05 client-29vm6 kernel: SysRq : Show State

the shell exit status say duration of 0. There is some unaccounted of time on the client.

From the OST

Jan  9 23:37:05 client-29vm4 kernel: LustreError: 166-1: MGC10.10.4.174@tcp: Connection to MGS (at 10.10.4.174@tcp) was lost; in progress operations using this service will fail
Jan  9 23:37:11 client-29vm4 kernel: Lustre: 4954:0:(client.c:1836:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1357803425/real 1357803425]  req@ffff88005aae9400 x1423754754290676/t0(0) o250->MGC10.10.4.174@tcp@10.10.4.174@tcp:26/25 lens 400/544 e 0 to 1 dl 1357803431 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Jan  9 23:37:11 client-29vm4 kernel: Lustre: 4954:0:(client.c:1836:ptlrpc_expire_one_request()) Skipped 21 previous similar messages
Jan  9 23:37:46 client-29vm4 kernel: Lustre: 4954:0:(client.c:1836:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1357803450/real 1357803450]  req@ffff880066117400 x1423754754290685/t0(0) o250->MGC10.10.4.174@tcp@10.10.4.174@tcp:26/25 lens 400/544 e 0 to 1 dl 1357803466 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Jan  9 23:37:46 client-29vm4 kernel: Lustre: 4954:0:(client.c:1836:ptlrpc_expire_one_request()) Skipped 15 previous similar messages
Jan  9 23:38:55 client-29vm4 kernel: Lustre: 4954:0:(client.c:1836:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1357803510/real 1357803510]  req@ffff88005aae9400 x1423754754290702/t0(0) o38->lustre-MDT0000-osp-OST0000@10.10.4.174@tcp:12/10 lens 400/544 e 0 to 1 dl 1357803535 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Jan  9 23:38:55 client-29vm4 kernel: Lustre: 4954:0:(client.c:1836:ptlrpc_expire_one_request()) Skipped 16 previous similar messages
Jan  9 23:41:05 client-29vm4 kernel: Lustre: 4954:0:(client.c:1836:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1357803640/real 1357803640]  req@ffff880043443c00 x1423754754290792/t0(0) o250->MGC10.10.4.174@tcp@10.10.4.174@tcp:26/25 lens 400/544 e 0 to 1 dl 1357803665 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Jan  9 23:41:05 client-29vm4 kernel: Lustre: 4954:0:(client.c:1836:ptlrpc_expire_one_request()) Skipped 89 previous similar messages
Jan  9 23:45:25 client-29vm4 kernel: Lustre: 4954:0:(client.c:1836:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1357803925/real 1357803925]  req@ffff880058dc7400 x1423754754291056/t0(0) o250->MGC10.10.4.174@tcp@10.10.4.174@tcp:26/25 lens 400/544 e 0 to 1 dl 1357803950 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Jan  9 23:45:25 client-29vm4 kernel: Lustre: 4954:0:(client.c:1836:ptlrpc_expire_one_request()) Skipped 263 previous similar messages
Jan  9 23:54:00 client-29vm4 kernel: Lustre: 4954:0:(client.c:1836:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1357804440/real 1357804440]  req@ffff880073443800 x1423754754291561/t0(0) o250->MGC10.10.4.174@tcp@10.10.4.174@tcp:26/25 lens 400/544 e 0 to 1 dl 1357804465 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Jan  9 23:54:00 client-29vm4 kernel: Lustre: 4954:0:(client.c:1836:ptlrpc_expire_one_request()) Skipped 504 previous similar messages
Jan 10 00:04:05 client-29vm4 kernel: Lustre: 4954:0:(client.c:1836:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1357805045/real 1357805045]  req@ffff8800715cd400 x1423754754292134/t0(0) o250->MGC10.10.4.174@tcp@10.10.4.174@tcp:26/25 lens 400/544 e 0 to 1 dl 1357805070 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Jan 10 00:04:05 client-29vm4 kernel: Lustre: 4954:0:(client.c:1836:ptlrpc_expire_one_request()) Skipped 572 previous similar messages
Jan 10 00:14:10 client-29vm4 kernel: Lustre: 4954:0:(client.c:1836:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1357805650/real 1357805650]  req@ffff8800715cd000 x1423754754292695/t0(0) o250->MGC10.10.4.174@tcp@10.10.4.174@tcp:26/25 lens 400/544 e 0 to 1 dl 1357805675 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Jan 10 00:14:10 client-29vm4 kernel: Lustre: 4954:0:(client.c:1836:ptlrpc_expire_one_request()) Skipped 560 previous similar messages
Jan 10 00:24:25 client-29vm4 kernel: Lustre: 4954:0:(client.c:1836:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1357806240/real 1357806240]  req@ffff88005312c000 x1423754754293272/t0(0) o250->MGC10.10.4.174@tcp@10.10.4.174@tcp:26/25 lens 400/544 e 0 to 1 dl 1357806265 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Jan 10 00:24:25 client-29vm4 kernel: Lustre: 4954:0:(client.c:1836:ptlrpc_expire_one_request()) Skipped 576 previous similar messages
Jan 10 00:34:30 client-29vm4 kernel: Lustre: 4954:0:(client.c:1836:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1357806845/real 1357806845]  req@ffff88007cab9c00 x1423754754293840/t0(0) o250->MGC10.10.4.174@tcp@10.10.4.174@tcp:26/25 lens 400/544 e 0 to 1 dl 1357806870 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Jan 10 00:34:30 client-29vm4 kernel: Lustre: 4954:0:(client.c:1836:ptlrpc_expire_one_request()) Skipped 567 previous similar messages
Jan 10 00:37:17 client-29vm4 xinetd[1610]: START: shell pid=557 from=::ffff:10.10.4.177
Jan 10 00:37:18 client-29vm4 rshd[558]: autotest@client-29vm6.lab.whamcloud.com as root: cmd='/home/autotest/.autotest/dynamic_bash/70095054347540+1357807032.72548'
Jan 10 00:37:24 client-29vm4 kernel: SysRq : Show State

So this is even more confusing.

At about the same time all the servers did an unaccounted for change of time from about
Jan 9 23:54:00 to Jan 10 00:04:05

Could NTP be responsible?
Are all the VMs getting blocked for the same time?

From replay single (note there is no weird time issue) and the MDS

Jan 10 02:13:09 client-29vm3 rshd[28692]: root@client-29vm6.lab.whamcloud.com as root: cmd='/usr/sbin/lctl mark "grep -c /mnt/mds1' ' /proc/mounts";echo XXRETCODE:$?'
Jan 10 02:13:09 client-29vm3 kernel: Lustre: DEBUG MARKER: grep -c /mnt/mds1' ' /proc/mounts
Jan 10 02:13:09 client-29vm3 xinetd[1603]: EXIT: shell status=0 pid=28691 duration=0(sec)
Jan 10 02:13:09 client-29vm3 xinetd[1603]: START: shell pid=28714 from=::ffff:10.10.4.177
Jan 10 02:13:09 client-29vm3 rshd[28715]: root@client-29vm6.lab.whamcloud.com as root: cmd='(PATH=$PATH:/usr/lib64/lustre/utils:/usr/lib64/lustre/tests:/sbin:/usr/sbin; cd /usr/lib64/lustre/tests; LUSTRE="/usr/lib64/lustre" sh -c "grep -c /mnt/mds1' ' /proc/mounts");echo XXRETCODE:$?'
Jan 10 02:15:08 client-29vm3 kernel: imklog 5.8.10, log source = /proc/kmsg started.
Jan 10 02:15:08 client-29vm3 rsyslogd: [origin software="rsyslogd" swVersion="5.8.10" x-pid="1209" x-info="http://www.rsyslog.com"] start
Jan 10 02:15:08 client-29vm3 kernel: Initializing cgroup subsys cpuset
Jan 10 02:15:08 client-29vm3 kernel: Initializing cgroup subsys cpu
Jan 10 02:15:08 client-29vm3 kernel: Linux version 2.6.32-279.14.1.e

Looks like we greped /proc and the box rebooted?

Insanitiy test 2:

Jan 10 08:31:57 client-29vm3 rshd[8528]: root@client-29vm6.lab.whamcloud.com as root: cmd='/usr/sbin/lctl mark "umount -d /mnt/mds1";echo XXRETCODE:$?'
Jan 10 08:33:57 client-29vm3 kernel: imklog 5.8.10, log source = /proc/kmsg started.
Jan 10 08:33:57 client-29vm3 rsyslogd: [origin software="rsyslogd" swVersion="5.8.10" x-pid="1223" x-info="http://www.rsyslog.com"] start
Jan 10 08:33:57 client-29vm3 kernel: Initializing cgroup subsys cpuset
Jan 10 08:33:57 client-29vm3 kernel: Initializing cgroup subsys cpu
Jan 10 08:33:57 client-29vm3 kernel: Linux version 2.6.32-279.14.1.el6_lustre.g639f1fc.x86_64 (jenkins@builder-1-sde1-el6-x8664.lab.whamcloud.com) (gcc version 4.4.6 20120305 (Red Hat 4.4.6-4) (GCC) ) #1 SMP Wed Jan 2 11:43:03 PST 2013
Jan 10 08:33:57 client-29vm3 kernel: Command line: ro root=UUID=e4f3f0d2-2922-468e-b20b-0096084a9409 rd_NO_LUKS rd_NO_LVM LANG=en_US.UTF-8 rd_NO_MD console=ttyS0,115200 crashkernel=auto SYSFONT=latarcyrheb-sun16  KEYBOARDTYPE=pc KEYTABLE=us rd_NO_DM
Jan 10 08:33:57 client-29vm3 kernel: KERNEL supported cpus:
Jan 10 08:33:57 client-29vm3 kernel:  Intel GenuineIntel

MDS rebooted after umount -d /mnt/mds1.

Looks like the MDS mount is unstable? There is no serial line from the reboots is there?

Keith Mannthey (Inactive) added a comment - 11/Jan/13 2:36 PM - edited https://maloo.whamcloud.com/test_sessions/6eea9e98-5ba3-11e2-aa66-52540035b04c A few observations from Sanity 900. Jan 9 23:36:45 client-29vm3 rshd[6101]: root@client-29vm6.lab.whamcloud.com as root: cmd='/usr/sbin/lctl mark "umount -d -f /mnt/mds1";echo XXRETCODE:$?' Jan 9 23:36:45 client-29vm3 kernel: Lustre: DEBUG MARKER: umount -d -f /mnt/mds1 Jan 9 23:36:45 client-29vm3 xinetd[1600]: EXIT: shell status=0 pid=6100 duration=0(sec) Jan 9 23:36:45 client-29vm3 xinetd[1600]: START: shell pid=6123 from=::ffff:10.10.4.177 Jan 9 23:36:45 client-29vm3 rshd[6124]: root@client-29vm6.lab.whamcloud.com as root: cmd='(PATH=$PATH:/usr/lib64/lustre/utils:/usr/lib64/lustre/tests:/sbin:/usr/sbin; cd /usr/lib64/lustre/tests; LUSTRE="/usr/lib64/lustre" sh -c "umount -d -f /mnt/mds1");echo XXRETCODE:$?' Jan 9 23:38:58 client-29vm3 kernel: imklog 5.8.10, log source = /proc/kmsg started. Jan 9 23:38:58 client-29vm3 rsyslogd: [origin software="rsyslogd" swVersion="5.8.10" x-pid="1222" x-info="http://www.rsyslog.com"] start Jan 9 23:38:58 client-29vm3 kernel: Initializing cgroup subsys cpuset Jan 9 23:38:58 client-29vm3 kernel: Initializing cgroup subsys cpu Jan 9 23:38:58 client-29vm3 kernel: Linux version 2.6.32-279.14.1.el6_lustre.g639f1fc.x86_64 (jenkins@builder-1-sde1-el6-x8664.lab.whamcloud.com) (gcc version 4.4.6 20120305 (Red Hat 4.4.6-4) (GCC) ) #1 SMP Wed Jan 2 11:43:03 PST 2013 Jan 9 23:38:58 client-29vm3 kernel: Command line: ro root=UUID=301a532c-28f4-4437-98 It looks like the last thing it tried to do was umount the mds. It then reboots. This may or may not be apart of the tests. I don't know. When the mds reboots it talks to NTP and maybe not in a good way. Jan 9 23:39:01 client-29vm3 rpc.mountd[1573]: Version 1.2.3 starting Jan 9 23:39:01 client-29vm3 kernel: NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory Jan 9 23:39:01 client-29vm3 kernel: NFSD: starting 90-second grace period Jan 9 23:39:02 client-29vm3 ntpd[1696]: ntpd 4.2.4p8@1.1612-o Tue Nov 29 00:09:12 UTC 2011 (1) Jan 9 23:39:02 client-29vm3 xinetd[1689]: xinetd Version 2.3.14 started with libwrap loadavg labeled-networking options compiled in. Jan 9 23:39:02 client-29vm3 xinetd[1689]: Started working: 4 available services Jan 9 23:39:02 client-29vm3 ntpd[1697]: precision = 0.189 usec Jan 9 23:39:02 client-29vm3 ntpd[1697]: Listening on interface #0 wildcard, 0.0.0.0#123 Disabled Jan 9 23:39:02 client-29vm3 ntpd[1697]: Listening on interface #1 wildcard, ::#123 Disabled Jan 9 23:39:02 client-29vm3 ntpd[1697]: Listening on interface #2 lo, ::1#123 Enabled Jan 9 23:39:02 client-29vm3 ntpd[1697]: Listening on interface #3 eth0, fe80::216:3eff:fe2d:bfd3#123 Enabled Jan 9 23:39:02 client-29vm3 ntpd[1697]: Listening on interface #4 lo, 127.0.0.1#123 Enabled Jan 9 23:39:02 client-29vm3 ntpd[1697]: Listening on interface #5 eth0, 10.10.4.174#123 Enabled Jan 9 23:39:02 client-29vm3 ntpd[1697]: Listening on routing socket on fd #22 for interface updates Jan 9 23:39:02 client-29vm3 ntpd[1697]: kernel time sync status 2040 Jan 9 23:39:02 client-29vm3 ntpd[1697]: frequency initialized 0.093 PPM from /var/lib/ntp/ntp.drift Jan 9 23:43:20 client-29vm3 ntpd[1697]: synchronized to 10.10.0.1, stratum 2 Jan 9 23:43:20 client-29vm3 ntpd[1697]: time reset +0.183343 s Jan 9 23:43:20 client-29vm3 ntpd[1697]: kernel time sync status change 2001 Jan 9 23:54:46 client-29vm3 ntpd[1697]: synchronized to 10.10.0.1, stratum 2 Jan 10 00:37:30 client-29vm3 xinetd[1689]: START: shell pid=2018 from=::ffff:10.10.4.177 Jan 10 00:37:30 client-29vm3 rshd[2019]: autotest@client-29vm6.lab.whamcloud.com as root: cmd='/home/autotest/.autotest/dynamic_bash/70095054347540+1357807045.03184' Jan 10 00:37:30 client-29vm3 kernel: Slow work thread pool: Starting up Jan 10 00:37:30 client-29vm3 kernel: Slow work thread pool: Ready Jan 10 00:37:30 client-29vm3 kernel: FS-Cache: Loaded Jan 10 00:37:30 client-29vm3 kernel: Registering the id_resolver key type Jan 10 00:37:30 client-29vm3 kernel: FS-Cache: Netfs 'nfs' registered for caching Jan 10 00:37:31 client-29vm3 kernel: SysRq : Show State Right after NTP resets, NTP indicated it only changes the time by +0.183343s, autotest has started the Show State log collection process. from the clients point of view Jan 9 23:36:45 client-29vm6 xinetd[5665]: EXIT: shell status=0 pid=22796 duration=0(sec) Jan 9 23:36:45 client-29vm6 xinetd[5665]: START: shell pid=22851 from=::ffff:10.10.4.177 Jan 9 23:36:45 client-29vm6 rshd[22852]: root@client-29vm6.lab.whamcloud.com as root: cmd='(PATH=$PATH:/usr/lib64/lustre/utils:/usr/lib64/lustre/tests:/sbin:/usr/sbin; cd /usr/lib64/lustre/tests; LUSTRE="/usr/lib64/lustre" MGSFSTYPE=ldiskfs MDSFSTYPE=ldiskfs OSTFSTYPE=ldiskfs FSTYPE=ldiskfs sh -c "running=\$(grep -c /mnt/lustre2' ' /proc/mounts);#012if [ \$running -ne 0 ] ; then#012echo Stopping client \$(hostname) /mnt/lustre2 opts:;#012lsof /mnt/lustre2 || need_kill=no;#012if [ x != x -a x\$need_kill != xno ]; then#012 pids=\$(lsof -t /mnt/lustre2 | sort -u);#012 if [ -n \"\$pids\" ]; then#012 kill -9 \$pids;#012 fi#012fi;#012while umount /mnt/lustre2 2>&1 | grep -q busy; do#012 echo /mnt/lustre2 is still busy, wait one second && sleep 1;#012done;#012fi");echo XXRETCODE:$?' Jan 9 23:36:45 client-29vm6 xinetd[5665]: EXIT: shell status=0 pid=22851 duration=0(sec) Jan 10 00:37:04 client-29vm6 xinetd[5665]: START: shell pid=23142 from=::ffff:10.10.4.177 Jan 10 00:37:04 client-29vm6 rshd[23144]: autotest@client-29vm6.lab.whamcloud.com as root: cmd='/home/autotest/.autotest/dynamic_bash/70095054347540+1357807023.34444' Jan 10 00:37:05 client-29vm6 kernel: SysRq : Show State the shell exit status say duration of 0. There is some unaccounted of time on the client. From the OST Jan 9 23:37:05 client-29vm4 kernel: LustreError: 166-1: MGC10.10.4.174@tcp: Connection to MGS (at 10.10.4.174@tcp) was lost; in progress operations using this service will fail Jan 9 23:37:11 client-29vm4 kernel: Lustre: 4954:0:(client.c:1836:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1357803425/real 1357803425] req@ffff88005aae9400 x1423754754290676/t0(0) o250->MGC10.10.4.174@tcp@10.10.4.174@tcp:26/25 lens 400/544 e 0 to 1 dl 1357803431 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Jan 9 23:37:11 client-29vm4 kernel: Lustre: 4954:0:(client.c:1836:ptlrpc_expire_one_request()) Skipped 21 previous similar messages Jan 9 23:37:46 client-29vm4 kernel: Lustre: 4954:0:(client.c:1836:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1357803450/real 1357803450] req@ffff880066117400 x1423754754290685/t0(0) o250->MGC10.10.4.174@tcp@10.10.4.174@tcp:26/25 lens 400/544 e 0 to 1 dl 1357803466 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Jan 9 23:37:46 client-29vm4 kernel: Lustre: 4954:0:(client.c:1836:ptlrpc_expire_one_request()) Skipped 15 previous similar messages Jan 9 23:38:55 client-29vm4 kernel: Lustre: 4954:0:(client.c:1836:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1357803510/real 1357803510] req@ffff88005aae9400 x1423754754290702/t0(0) o38->lustre-MDT0000-osp-OST0000@10.10.4.174@tcp:12/10 lens 400/544 e 0 to 1 dl 1357803535 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Jan 9 23:38:55 client-29vm4 kernel: Lustre: 4954:0:(client.c:1836:ptlrpc_expire_one_request()) Skipped 16 previous similar messages Jan 9 23:41:05 client-29vm4 kernel: Lustre: 4954:0:(client.c:1836:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1357803640/real 1357803640] req@ffff880043443c00 x1423754754290792/t0(0) o250->MGC10.10.4.174@tcp@10.10.4.174@tcp:26/25 lens 400/544 e 0 to 1 dl 1357803665 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Jan 9 23:41:05 client-29vm4 kernel: Lustre: 4954:0:(client.c:1836:ptlrpc_expire_one_request()) Skipped 89 previous similar messages Jan 9 23:45:25 client-29vm4 kernel: Lustre: 4954:0:(client.c:1836:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1357803925/real 1357803925] req@ffff880058dc7400 x1423754754291056/t0(0) o250->MGC10.10.4.174@tcp@10.10.4.174@tcp:26/25 lens 400/544 e 0 to 1 dl 1357803950 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Jan 9 23:45:25 client-29vm4 kernel: Lustre: 4954:0:(client.c:1836:ptlrpc_expire_one_request()) Skipped 263 previous similar messages Jan 9 23:54:00 client-29vm4 kernel: Lustre: 4954:0:(client.c:1836:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1357804440/real 1357804440] req@ffff880073443800 x1423754754291561/t0(0) o250->MGC10.10.4.174@tcp@10.10.4.174@tcp:26/25 lens 400/544 e 0 to 1 dl 1357804465 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Jan 9 23:54:00 client-29vm4 kernel: Lustre: 4954:0:(client.c:1836:ptlrpc_expire_one_request()) Skipped 504 previous similar messages Jan 10 00:04:05 client-29vm4 kernel: Lustre: 4954:0:(client.c:1836:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1357805045/real 1357805045] req@ffff8800715cd400 x1423754754292134/t0(0) o250->MGC10.10.4.174@tcp@10.10.4.174@tcp:26/25 lens 400/544 e 0 to 1 dl 1357805070 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Jan 10 00:04:05 client-29vm4 kernel: Lustre: 4954:0:(client.c:1836:ptlrpc_expire_one_request()) Skipped 572 previous similar messages Jan 10 00:14:10 client-29vm4 kernel: Lustre: 4954:0:(client.c:1836:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1357805650/real 1357805650] req@ffff8800715cd000 x1423754754292695/t0(0) o250->MGC10.10.4.174@tcp@10.10.4.174@tcp:26/25 lens 400/544 e 0 to 1 dl 1357805675 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Jan 10 00:14:10 client-29vm4 kernel: Lustre: 4954:0:(client.c:1836:ptlrpc_expire_one_request()) Skipped 560 previous similar messages Jan 10 00:24:25 client-29vm4 kernel: Lustre: 4954:0:(client.c:1836:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1357806240/real 1357806240] req@ffff88005312c000 x1423754754293272/t0(0) o250->MGC10.10.4.174@tcp@10.10.4.174@tcp:26/25 lens 400/544 e 0 to 1 dl 1357806265 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Jan 10 00:24:25 client-29vm4 kernel: Lustre: 4954:0:(client.c:1836:ptlrpc_expire_one_request()) Skipped 576 previous similar messages Jan 10 00:34:30 client-29vm4 kernel: Lustre: 4954:0:(client.c:1836:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1357806845/real 1357806845] req@ffff88007cab9c00 x1423754754293840/t0(0) o250->MGC10.10.4.174@tcp@10.10.4.174@tcp:26/25 lens 400/544 e 0 to 1 dl 1357806870 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Jan 10 00:34:30 client-29vm4 kernel: Lustre: 4954:0:(client.c:1836:ptlrpc_expire_one_request()) Skipped 567 previous similar messages Jan 10 00:37:17 client-29vm4 xinetd[1610]: START: shell pid=557 from=::ffff:10.10.4.177 Jan 10 00:37:18 client-29vm4 rshd[558]: autotest@client-29vm6.lab.whamcloud.com as root: cmd='/home/autotest/.autotest/dynamic_bash/70095054347540+1357807032.72548' Jan 10 00:37:24 client-29vm4 kernel: SysRq : Show State So this is even more confusing. At about the same time all the servers did an unaccounted for change of time from about Jan 9 23:54:00 to Jan 10 00:04:05 Could NTP be responsible? Are all the VMs getting blocked for the same time? From replay single (note there is no weird time issue) and the MDS Jan 10 02:13:09 client-29vm3 rshd[28692]: root@client-29vm6.lab.whamcloud.com as root: cmd='/usr/sbin/lctl mark "grep -c /mnt/mds1' ' /proc/mounts";echo XXRETCODE:$?' Jan 10 02:13:09 client-29vm3 kernel: Lustre: DEBUG MARKER: grep -c /mnt/mds1' ' /proc/mounts Jan 10 02:13:09 client-29vm3 xinetd[1603]: EXIT: shell status=0 pid=28691 duration=0(sec) Jan 10 02:13:09 client-29vm3 xinetd[1603]: START: shell pid=28714 from=::ffff:10.10.4.177 Jan 10 02:13:09 client-29vm3 rshd[28715]: root@client-29vm6.lab.whamcloud.com as root: cmd='(PATH=$PATH:/usr/lib64/lustre/utils:/usr/lib64/lustre/tests:/sbin:/usr/sbin; cd /usr/lib64/lustre/tests; LUSTRE="/usr/lib64/lustre" sh -c "grep -c /mnt/mds1' ' /proc/mounts");echo XXRETCODE:$?' Jan 10 02:15:08 client-29vm3 kernel: imklog 5.8.10, log source = /proc/kmsg started. Jan 10 02:15:08 client-29vm3 rsyslogd: [origin software="rsyslogd" swVersion="5.8.10" x-pid="1209" x-info="http://www.rsyslog.com"] start Jan 10 02:15:08 client-29vm3 kernel: Initializing cgroup subsys cpuset Jan 10 02:15:08 client-29vm3 kernel: Initializing cgroup subsys cpu Jan 10 02:15:08 client-29vm3 kernel: Linux version 2.6.32-279.14.1.e Looks like we greped /proc and the box rebooted? Insanitiy test 2: Jan 10 08:31:57 client-29vm3 rshd[8528]: root@client-29vm6.lab.whamcloud.com as root: cmd='/usr/sbin/lctl mark "umount -d /mnt/mds1";echo XXRETCODE:$?' Jan 10 08:33:57 client-29vm3 kernel: imklog 5.8.10, log source = /proc/kmsg started. Jan 10 08:33:57 client-29vm3 rsyslogd: [origin software="rsyslogd" swVersion="5.8.10" x-pid="1223" x-info="http://www.rsyslog.com"] start Jan 10 08:33:57 client-29vm3 kernel: Initializing cgroup subsys cpuset Jan 10 08:33:57 client-29vm3 kernel: Initializing cgroup subsys cpu Jan 10 08:33:57 client-29vm3 kernel: Linux version 2.6.32-279.14.1.el6_lustre.g639f1fc.x86_64 (jenkins@builder-1-sde1-el6-x8664.lab.whamcloud.com) (gcc version 4.4.6 20120305 (Red Hat 4.4.6-4) (GCC) ) #1 SMP Wed Jan 2 11:43:03 PST 2013 Jan 10 08:33:57 client-29vm3 kernel: Command line: ro root=UUID=e4f3f0d2-2922-468e-b20b-0096084a9409 rd_NO_LUKS rd_NO_LVM LANG=en_US.UTF-8 rd_NO_MD console=ttyS0,115200 crashkernel=auto SYSFONT=latarcyrheb-sun16 KEYBOARDTYPE=pc KEYTABLE=us rd_NO_DM Jan 10 08:33:57 client-29vm3 kernel: KERNEL supported cpus: Jan 10 08:33:57 client-29vm3 kernel: Intel GenuineIntel MDS rebooted after umount -d /mnt/mds1. Looks like the MDS mount is unstable? There is no serial line from the reboots is there?

Keith Mannthey (Inactive) added a comment - 10/Jan/13 10:28 PM

This looks very similar to this:
https://maloo.whamcloud.com/test_sessions/6e837946-4809-11e2-8cdc-52540035b04c

Which is a recent repo for:
http://jira.whamcloud.com/browse/LU-2008

Keith Mannthey (Inactive) added a comment - 10/Jan/13 10:28 PM This looks very similar to this: https://maloo.whamcloud.com/test_sessions/6e837946-4809-11e2-8cdc-52540035b04c Which is a recent repo for: http://jira.whamcloud.com/browse/LU-2008

Keith Mannthey (Inactive) added a comment - 08/Jan/13 6:39 PM

I have a 2nd observation of this issue today:

https://maloo.whamcloud.com/test_sessions/15ac4d02-59d4-11e2-9fee-52540035b04c

It is with the same patch set so I am starting to think it is related to the change (the chage is /proc related). The mds is rebooting this is captured in the mds log.

Jan 8 09:43:08 client-30vm7 rshd[6679]: autotest@client-30vm2.lab.whamcloud.com as root: cmd='/home/autotest/.autotest/dynamic_bash/70095034882780+1357666987.85688'
Jan 8 09:43:08 client-30vm7 xinetd[5687]: EXIT: shell status=0 pid=6678 duration=0(sec)
Jan 8 09:43:08 client-30vm7 xinetd[5687]: START: shell pid=6701 from=::ffff:10.10.4.189
Jan 8 09:43:13 client-30vm7 rshd[6702]: autotest@client-30vm2.lab.whamcloud.com as root: cmd='/home/autotest/.autotest/dynamic_bash/70095034882780+1357666988.20111'
Jan 8 09:43:13 client-30vm7 xinetd[5687]: EXIT: shell status=0 pid=6701 duration=5(sec)
Jan 8 09:43:14 client-30vm7 xinetd[5687]: START: shell pid=6728 from=::ffff:10.10.4.189
Jan 8 09:43:14 client-30vm7 rshd[6729]: autotest@client-30vm2.lab.whamcloud.com as root: cmd='/home/autotest/.autotest/dynamic_bash/70095054367360+1357666994.04923'
Jan 8 09:43:14 client-30vm7 init: serial (ttyS0) main process (5790) killed by TERM signal
Jan 8 09:43:14 client-30vm7 init: tty (/dev/tty1) main process (5792) killed by TERM signal
Jan 8 09:43:14 client-30vm7 init: tty (/dev/tty2) main process (5794) killed by TERM signal
Jan 8 09:43:14 client-30vm7 init: tty (/dev/tty3) main process (5796) killed by TERM signal
Jan 8 09:43:14 client-30vm7 init: tty (/dev/tty4) main process (5798) killed by TERM signal
Jan 8 09:43:14 client-30vm7 init: tty (/dev/tty5) main process (5800) killed by TERM signal
Jan 8 09:43:14 client-30vm7 init: tty (/dev/tty6) main process (5802) killed by TERM signal
Jan 8 09:43:14 client-30vm7 xinetd[5687]: EXIT: shell status=0 pid=6728 duration=0(sec)
Jan 8 09:43:14 client-30vm7 xinetd[5687]: Exiting...
Jan 8 09:43:24 client-30vm7 acpid: exiting
Jan 8 09:43:24 client-30vm7 ntpd[5695]: ntpd exiting on signal 15
Jan 8 09:43:24 client-30vm7 init: Disconnected from system bus
Jan 8 09:43:24 client-30vm7 rpcbind: rpcbind terminating on signal. Restart with "rpcbind -w"
Jan 8 09:43:24 client-30vm7 auditd[5826]: The audit daemon is exiting.
Jan 8 09:43:24 client-30vm7 kernel: type=1305 audit(1357667004.705:45464): audit_pid=0 old=5826 auid=4294967295 ses=4294967295 res=1
Jan 8 09:43:24 client-30vm7 kernel: type=1305 audit(1357667004.802:45465): audit_enabled=0 old=1 auid=4294967295 ses=4294967295 res=1
Jan 8 09:43:24 client-30vm7 kernel: Kernel logging (proc) stopped.
Jan 8 09:43:24 client-30vm7 rsyslogd: [origin software="rsyslogd" swVersion="5.8.10" x-pid="1134" x-info="http://www.rsyslog.com"] exiting on signal 15.
Jan 8 09:43:54 client-30vm7 kernel: imklog 5.8.10, log source = /proc/kmsg started.
Jan 8 09:43:54 client-30vm7 rsyslogd: [origin software="rsyslogd" swVersion="5.8.10" x-pid="1136" x-info="http://www.rsyslog.com"] start
Jan 8 09:43:54 client-30vm7 kernel: Initializing cgroup subsys cpuset
Jan 8 09:43:54 client-30vm7 kernel: Initializing cgroup subsys cpu
Jan 8 09:43:54 client-30vm7 kernel: Linux version 2.6.32-279.14.1.el6_lustre.g639f1fc.x86_64 (jenkins@builder-1-sde1-el6-x8664.lab.whamcloud.com) (gcc version 4.4.6 20120305 (Red Hat 4.4.6-4) (GCC) ) #1 SMP Wed Jan 2 11:43:03 PST 2013
Jan 8 09:43:54 client-30vm7 kernel: Command line: ro root=UUID=c70bfa8c-fd02-4a3f-98eb-2dddc1270604 rd_NO_LUKS rd_NO_LVM LANG=en_US.UTF-8 rd_NO_MD console=ttyS0,115200 crashkernel=auto SYSFONT=latarcyrheb-sun16 KEYBOARDTYPE=pc KEYTABLE=us rd_NO_DM
Jan 8 09:43:54 client-30vm7 kernel: KERNEL supported cpus:
Jan 8 09:43:54 client-30vm7 kernel: Intel GenuineIntel
Jan 8 09:43:54 client-30vm7 kernel: AMD AuthenticAMD
Jan 8 09:43:54 client-30vm7 kernel: Centaur CentaurHauls
Jan 8 09:43:54 client-30vm7 kernel: Disabled fast string operations
Jan 8 09:43:54 client-30vm7 kernel: BIOS-provided physical RAM map:
Jan 8 09:43:54 client-30vm7 kernel: BIOS-e820: 0000000000000000 - 000000000009a400 (usable)
Jan 8 09:43:54 client-30vm7 kernel: BIOS-e820: 00000000000f0000 - 0000000000100000 (reserved)
}

We can see from the log this is an orderly shutdown on the system with an unknown cause. It does not seem to be a panic or error of some kind but a logical choice to reboot the mds.

This 2nd failure has logs.

I don't see the kernel being changed and suspect this was not an upgrade process. Is a reboot of the MDS expected during the course of a run like this?

Is there a good way to sort out what caused this mds to reboot?

Keith Mannthey (Inactive) added a comment - 08/Jan/13 6:39 PM I have a 2nd observation of this issue today: https://maloo.whamcloud.com/test_sessions/15ac4d02-59d4-11e2-9fee-52540035b04c It is with the same patch set so I am starting to think it is related to the change (the chage is /proc related). The mds is rebooting this is captured in the mds log. Jan 8 09:43:08 client-30vm7 rshd [6679] : autotest@client-30vm2.lab.whamcloud.com as root: cmd='/home/autotest/.autotest/dynamic_bash/70095034882780+1357666987.85688' Jan 8 09:43:08 client-30vm7 xinetd [5687] : EXIT: shell status=0 pid=6678 duration=0(sec) Jan 8 09:43:08 client-30vm7 xinetd [5687] : START: shell pid=6701 from=::ffff:10.10.4.189 Jan 8 09:43:13 client-30vm7 rshd [6702] : autotest@client-30vm2.lab.whamcloud.com as root: cmd='/home/autotest/.autotest/dynamic_bash/70095034882780+1357666988.20111' Jan 8 09:43:13 client-30vm7 xinetd [5687] : EXIT: shell status=0 pid=6701 duration=5(sec) Jan 8 09:43:14 client-30vm7 xinetd [5687] : START: shell pid=6728 from=::ffff:10.10.4.189 Jan 8 09:43:14 client-30vm7 rshd [6729] : autotest@client-30vm2.lab.whamcloud.com as root: cmd='/home/autotest/.autotest/dynamic_bash/70095054367360+1357666994.04923' Jan 8 09:43:14 client-30vm7 init: serial (ttyS0) main process (5790) killed by TERM signal Jan 8 09:43:14 client-30vm7 init: tty (/dev/tty1) main process (5792) killed by TERM signal Jan 8 09:43:14 client-30vm7 init: tty (/dev/tty2) main process (5794) killed by TERM signal Jan 8 09:43:14 client-30vm7 init: tty (/dev/tty3) main process (5796) killed by TERM signal Jan 8 09:43:14 client-30vm7 init: tty (/dev/tty4) main process (5798) killed by TERM signal Jan 8 09:43:14 client-30vm7 init: tty (/dev/tty5) main process (5800) killed by TERM signal Jan 8 09:43:14 client-30vm7 init: tty (/dev/tty6) main process (5802) killed by TERM signal Jan 8 09:43:14 client-30vm7 xinetd [5687] : EXIT: shell status=0 pid=6728 duration=0(sec) Jan 8 09:43:14 client-30vm7 xinetd [5687] : Exiting... Jan 8 09:43:24 client-30vm7 acpid: exiting Jan 8 09:43:24 client-30vm7 ntpd [5695] : ntpd exiting on signal 15 Jan 8 09:43:24 client-30vm7 init: Disconnected from system bus Jan 8 09:43:24 client-30vm7 rpcbind: rpcbind terminating on signal. Restart with "rpcbind -w" Jan 8 09:43:24 client-30vm7 auditd [5826] : The audit daemon is exiting. Jan 8 09:43:24 client-30vm7 kernel: type=1305 audit(1357667004.705:45464): audit_pid=0 old=5826 auid=4294967295 ses=4294967295 res=1 Jan 8 09:43:24 client-30vm7 kernel: type=1305 audit(1357667004.802:45465): audit_enabled=0 old=1 auid=4294967295 ses=4294967295 res=1 Jan 8 09:43:24 client-30vm7 kernel: Kernel logging (proc) stopped. Jan 8 09:43:24 client-30vm7 rsyslogd: [origin software="rsyslogd" swVersion="5.8.10" x-pid="1134" x-info="http://www.rsyslog.com"] exiting on signal 15. Jan 8 09:43:54 client-30vm7 kernel: imklog 5.8.10, log source = /proc/kmsg started. Jan 8 09:43:54 client-30vm7 rsyslogd: [origin software="rsyslogd" swVersion="5.8.10" x-pid="1136" x-info="http://www.rsyslog.com"] start Jan 8 09:43:54 client-30vm7 kernel: Initializing cgroup subsys cpuset Jan 8 09:43:54 client-30vm7 kernel: Initializing cgroup subsys cpu Jan 8 09:43:54 client-30vm7 kernel: Linux version 2.6.32-279.14.1.el6_lustre.g639f1fc.x86_64 (jenkins@builder-1-sde1-el6-x8664.lab.whamcloud.com) (gcc version 4.4.6 20120305 (Red Hat 4.4.6-4) (GCC) ) #1 SMP Wed Jan 2 11:43:03 PST 2013 Jan 8 09:43:54 client-30vm7 kernel: Command line: ro root=UUID=c70bfa8c-fd02-4a3f-98eb-2dddc1270604 rd_NO_LUKS rd_NO_LVM LANG=en_US.UTF-8 rd_NO_MD console=ttyS0,115200 crashkernel=auto SYSFONT=latarcyrheb-sun16 KEYBOARDTYPE=pc KEYTABLE=us rd_NO_DM Jan 8 09:43:54 client-30vm7 kernel: KERNEL supported cpus: Jan 8 09:43:54 client-30vm7 kernel: Intel GenuineIntel Jan 8 09:43:54 client-30vm7 kernel: AMD AuthenticAMD Jan 8 09:43:54 client-30vm7 kernel: Centaur CentaurHauls Jan 8 09:43:54 client-30vm7 kernel: Disabled fast string operations Jan 8 09:43:54 client-30vm7 kernel: BIOS-provided physical RAM map: Jan 8 09:43:54 client-30vm7 kernel: BIOS-e820: 0000000000000000 - 000000000009a400 (usable) Jan 8 09:43:54 client-30vm7 kernel: BIOS-e820: 00000000000f0000 - 0000000000100000 (reserved) } We can see from the log this is an orderly shutdown on the system with an unknown cause. It does not seem to be a panic or error of some kind but a logical choice to reboot the mds. This 2nd failure has logs. I don't see the kernel being changed and suspect this was not an upgrade process. Is a reboot of the MDS expected during the course of a run like this? Is there a good way to sort out what caused this mds to reboot?

People

Assignee:: Keith Mannthey (Inactive)

Reporter:: Keith Mannthey (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 07/Jan/13 6:12 PM

Updated:: 23/Apr/13 4:48 PM

Resolved:: 27/Feb/13 3:39 AM