Details
-
Bug
-
Resolution: Fixed
-
Critical
-
Lustre 2.4.0
-
None
-
3
-
7398
Description
Running conf-sanity.sh in a loop I hit this crash:
[ 6625.425296] general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC [ 6625.425755] last sysfs file: /sys/devices/system/cpu/possible [ 6625.426029] CPU 1 [ 6625.426068] Modules linked in: ptlrpc(-) obdclass lvfs ksocklnd lnet libcfs exportfs jbd sha512_generic sha256_generic ext4 mbcache jbd2 virtio_balloon virtio_console i2c_piix4 i2c_core virtio_blk virtio_net virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod nfs lockd fscache nfs_acl auth_rpcgss sunrpc be2iscsi bnx2i cnic uio ipv6 cxgb3i libcxgbi cxgb3 mdio libiscsi_tcp qla4xxx iscsi_boot_sysfs libiscsi scsi_transport_iscsi [last unloaded: fld] [ 6625.429023] [ 6625.429231] Pid: 4991, comm: ll_ping Not tainted 2.6.32-debug #6 Bochs Bochs [ 6625.429247] RIP: 0010:[<ffffffff8104d2e6>] [<ffffffff8104d2e6>] __wake_up_common+0x56/0x90 [ 6625.429247] RSP: 0018:ffff88008f9e5de0 EFLAGS: 00010082 [ 6625.429247] RAX: ffffffffa0d963fa RBX: ffff880098e26f68 RCX: 0000000000000000 [ 6625.429247] RDX: 0000000000000000 RSI: 0000000000000003 RDI: ffffffffa0d963fa [ 6625.429247] RBP: ffff88008f9e5e20 R08: 0000000000000000 R09: 000000000000005c [ 6625.429247] R10: 0000000000000001 R11: 0000000000000000 R12: c284e8fffb9889d0 [ 6625.429247] R13: 00000000ffffb502 R14: 0000000000000000 R15: 0000000000000000 [ 6625.429247] FS: 00007f7026720700(0000) GS:ffff880006240000(0000) knlGS:0000000000000000 [ 6625.429247] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 6625.429247] CR2: 000000000089ee20 CR3: 00000000905c0000 CR4: 00000000000006e0 [ 6625.429247] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 6625.429247] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 6625.429247] Process ll_ping (pid: 4991, threadinfo ffff88008f9e4000, task ffff8800af3ea5c0) [ 6625.429247] Stack: [ 6625.429247] ffffffffffffffff 0000000300000001 0000000000000001 ffff880098e26f68 [ 6625.429247] <d> 0000000000000282 0000000000000003 0000000000000001 0000000000000000 [ 6625.429247] <d> ffff88008f9e5e60 ffffffff81051f68 ffff88008f9e5e40 0000000000000001 [ 6625.429247] Call Trace: [ 6625.429247] [<ffffffff81051f68>] __wake_up+0x48/0x70 [ 6625.429247] [<ffffffffa0acf7fa>] cfs_waitq_signal+0x1a/0x20 [libcfs] [ 6625.429247] [<ffffffffa0d5f73f>] ptlrpc_pinger_main+0x5cf/0x7f0 [ptlrpc] [ 6625.429247] [<ffffffff81057d60>] ? default_wake_function+0x0/0x20 [ 6625.429247] [<ffffffffa0d5f170>] ? ptlrpc_pinger_main+0x0/0x7f0 [ptlrpc] [ 6625.429247] [<ffffffff8100c14a>] child_rip+0xa/0x20 [ 6625.429247] [<ffffffffa0d5f170>] ? ptlrpc_pinger_main+0x0/0x7f0 [ptlrpc] [ 6625.429247] [<ffffffffa0d5f170>] ? ptlrpc_pinger_main+0x0/0x7f0 [ptlrpc] [ 6625.429247] [<ffffffff8100c140>] ? child_rip+0x0/0x20 [ 6625.429247] Code: e8 18 48 39 c7 4c 8b 60 18 74 3d 49 83 ec 18 eb 0b 0f 1f 40 00 4c 89 e0 4c 8d 62 e8 44 8b 28 4c 89 f1 44 89 fa 8b 75 cc 48 89 c7 <ff> 50 10 85 c0 74 0c 41 83 e5 01 74 06 83 6d c8 01 74 0a 4c 39
test output had this:
== conf-sanity test 22: start a client before osts (should return errs) == 03:20:55 (1364282455) start mds service on centos6-14.localnet Starting mds1: -o loop /tmp/lustre-mdt1 /mnt/mds1 Started lustre-MDT0000 Client mount with ost in logs, but none running start ost1 service on centos6-14.localnet Starting ost1: -o loop /tmp/lustre-ost1 /mnt/ost1 Started lustre-OST0000 error: list_param: /proc/{fs,sys}/{lnet,lustre}/osc/lustre-OST0000-osc-MDT0000/ost_server_uuid: Found no match error: list_param: /proc/{fs,sys}/{lnet,lustre}/osc/lustre-OST0000-osc-MDT0000/ost_server_uuid: Found no match Stopping client centos6-14.localnet /mnt/lustre (opts:) PASS Client mount with a running ost start ost1 service on centos6-14.localnet Starting ost1: -o loop /tmp/lustre-ost1 /mnt/ost1 Started lustre-OST0000 mount lustre on /mnt/lustre..... Starting client: centos6-14.localnet: -o user_xattr,flock centos6-14.localnet@tcp:/lustre /mnt/lustre centos6-14.localnet: osc.lustre-OST0000-osc-MDT0000.ost_server_uuid in FULL state after 14 sec centos6-14.localnet: osc.lustre-OST0000-osc-MDT0000.ost_server_uuid in FULL state after 0 sec setup single mount lustre success PASS umount lustre on /mnt/lustre..... Stopping client centos6-14.localnet /mnt/lustre (opts:) stop ost1 service on centos6-14.localnet Stopping /mnt/ost1 (opts:-f) on centos6-14.localnet waited 0 for 10 ST ost OSS OSS_uuid 0
Crashdump and modules are in /exports/crashdumps/192.168.10.224-2013-03-26-03\:21\:55/
I agree with investigation. I'd like to propose a patch http://review.whamcloud.com/6040.