Details
-
Bug
-
Resolution: Fixed
-
Blocker
-
Lustre 2.3.0
-
None
-
3
-
4258
Description
Running with debug kernel, REFORMAT=yes SLOW=yes sh sanity.sh on latest master,
test 116 locks up reliably (3 out of 3 runs so far):
[ 1402.025523] Lustre: DEBUG MARKER: == sanity test 116: stripe QOS: free space balance ===================== 21:16:44 (1347326204) [ 1458.081053] LNet: Service thread pid 1041 was inactive for 40.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [ 1458.083593] Pid: 1041, comm: mdt01_003 [ 1458.084187] [ 1458.084187] Call Trace: [ 1458.084851] [<ffffffffa0b0ed42>] ? mdt_handle_common+0x922/0x1740 [mdt] [ 1458.085891] [<ffffffffa0b0fc35>] mdt_regular_handle+0x15/0x20 [mdt] [ 1458.086909] [<ffffffffa048586f>] ptlrpc_server_handle_request+0x44f/0xee0 [ptlrpc] [ 1458.088097] [<ffffffffa0c9666e>] ? cfs_timer_arm+0xe/0x10 [libcfs] [ 1458.089098] [<ffffffffa047e711>] ? ptlrpc_wait_event+0xb1/0x2a0 [ptlrpc] [ 1458.090171] [<ffffffff81051f73>] ? __wake_up+0x53/0x70 [ 1458.091005] [<ffffffffa04883de>] ptlrpc_main+0xaee/0x1800 [ptlrpc] [ 1458.091991] [<ffffffffa04878f0>] ? ptlrpc_main+0x0/0x1800 [ptlrpc] [ 1458.092988] [<ffffffff8100c14a>] child_rip+0xa/0x20 [ 1458.093775] [<ffffffffa04878f0>] ? ptlrpc_main+0x0/0x1800 [ptlrpc] [ 1458.094750] [<ffffffffa04878f0>] ? ptlrpc_main+0x0/0x1800 [ptlrpc] [ 1458.095723] [<ffffffff8100c140>] ? child_rip+0x0/0x20 [ 1458.096542] [ 1458.096780] LustreError: dumping log to /tmp/lustre-log.1347326260.1041 [ 1500.096007] BUG: soft lockup - CPU#4 stuck for 67s! [mdt01_003:1041] [ 1500.096998] Modules linked in: lustre obdfilter ost cmm mdt osd_ldiskfs fsfilt_ldiskfs ldiskfs mdd mds mgs lquota obdecho mgc lov osc mdc lmv fid fld ptlrpc obdclass lvfs ksocklnd lnet libcfs ext2 exportfs jbd sha512_generic sha256_generic sunrpc ipv6 microcode virtio_balloon virtio_net i2c_piix4 i2c_core ext4 mbcache jbd2 virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: libcfs] [ 1500.097004] CPU 4 [ 1500.097004] Modules linked in: lustre obdfilter ost cmm mdt osd_ldiskfs fsfilt_ldiskfs ldiskfs mdd mds mgs lquota obdecho mgc lov osc mdc lmv fid fld ptlrpc obdclass lvfs ksocklnd lnet libcfs ext2 exportfs jbd sha512_generic sha256_generic sunrpc ipv6 microcode virtio_balloon virtio_net i2c_piix4 i2c_core ext4 mbcache jbd2 virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: libcfs] [ 1500.097004] [ 1500.097004] Pid: 1041, comm: mdt01_003 Not tainted 2.6.32-debug #3 Bochs Bochs [ 1500.097004] RIP: 0010:[<ffffffff8127db02>] [<ffffffff8127db02>] memmove+0x42/0x1a0 [ 1500.097004] RSP: 0018:ffff8802056dd498 EFLAGS: 00010282 [ 1500.097004] RAX: ffff880231b7c03c RBX: ffff8802056dd4e0 RCX: 00000000000000ee [ 1500.097004] RDX: fffffffffff7bfec RSI: ffff880231bfffe8 RDI: ffff880231bffffc [ 1500.097004] RBP: ffffffff8100bc0e R08: 0000000000000000 R09: 0000000000000000 [ 1500.097004] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88021d674004 [ 1500.097004] R13: ffff880231b7c028 R14: ffff880231b7c000 R15: 0000000000000002 [ 1500.097004] FS: 00007effccfcf700(0000) GS:ffff880028300000(0000) knlGS:0000000000000000 [ 1500.097004] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 1500.097004] CR2: ffff880231c00000 CR3: 000000025a3ed000 CR4: 00000000000006e0 [ 1500.097004] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 1500.097004] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 1500.097004] Process mdt01_003 (pid: 1041, threadinfo ffff8802056dc000, task ffff88024430a380) [ 1500.097004] Stack: [ 1500.097004] ffffffffa0aab3a8 00000000000000ee 00000000000000ee ffff880294f9df58 [ 1500.097004] <d> ffff8802056dd640 ffff8802056dd608 ffff8802056dd6e8 0000000000000fd8 [ 1500.097004] <d> ffff88021d674000 ffff8802056dd500 ffffffffa0aab430 ffff88024430a380 [ 1500.097004] Call Trace: [ 1500.097004] [<ffffffffa0aab3a8>] ? iam_insert_key+0x68/0xb0 [osd_ldiskfs] [ 1500.097004] [<ffffffffa0aab430>] ? iam_insert_key_lock+0x40/0x50 [osd_ldiskfs] [ 1500.097004] [<ffffffffa0aae7ed>] ? iam_lfix_split+0x12d/0x150 [osd_ldiskfs] [ 1500.097004] [<ffffffffa0aadc8d>] ? iam_it_rec_insert+0x20d/0x300 [osd_ldiskfs] [ 1500.097004] [<ffffffffa0aade21>] ? iam_insert+0xa1/0xb0 [osd_ldiskfs] [ 1500.097004] [<ffffffffa0aa9467>] ? osd_oi_insert+0x1e7/0x5b0 [osd_ldiskfs] [ 1500.097004] [<ffffffffa0a9cef5>] ? __osd_oi_insert+0x145/0x1e0 [osd_ldiskfs] [ 1500.097004] [<ffffffffa0aa1d48>] ? osd_object_ea_create+0x1d8/0x460 [osd_ldiskfs] [ 1500.097004] [<ffffffffa09721dc>] ? mdd_object_create_internal+0x13c/0x2a0 [mdd] [ 1500.097004] [<ffffffffa0992aba>] ? mdd_create+0x16ba/0x20c0 [mdd] [ 1500.097004] [<ffffffffa0a9fd7f>] ? osd_xattr_get+0x9f/0x360 [osd_ldiskfs] [ 1500.097004] [<ffffffffa0bb3557>] ? cml_create+0x97/0x250 [cmm] [ 1500.097004] [<ffffffffa0b25d0f>] ? mdt_version_get_save+0x8f/0xd0 [mdt] [ 1500.097004] [<ffffffffa0b398bf>] ? mdt_reint_open+0x108f/0x18a0 [mdt] [ 1500.097004] [<ffffffffa099860e>] ? md_ucred+0x1e/0x60 [mdd] [ 1500.097004] [<ffffffffa0b071c5>] ? mdt_ucred+0x15/0x20 [mdt] [ 1500.097004] [<ffffffffa0b23081>] ? mdt_reint_rec+0x41/0xe0 [mdt] [ 1500.097004] [<ffffffffa0b1c42a>] ? mdt_reint_internal+0x50a/0x810 [mdt] [ 1500.097004] [<ffffffffa0b1c9fd>] ? mdt_intent_reint+0x1ed/0x500 [mdt] [ 1500.097004] [<ffffffffa0b19041>] ? mdt_intent_policy+0x371/0x6a0 [mdt] [ 1500.097004] [<ffffffffa042fb9a>] ? ldlm_lock_enqueue+0x2ea/0x890 [ptlrpc] [ 1500.097004] [<ffffffffa045744f>] ? ldlm_handle_enqueue0+0x48f/0xf70 [ptlrpc] [ 1500.097004] [<ffffffffa0b18ad6>] ? mdt_enqueue+0x46/0x130 [mdt] [ 1500.097004] [<ffffffffa0b0ed42>] ? mdt_handle_common+0x922/0x1740 [mdt] [ 1500.097004] [<ffffffffa0b0fc35>] ? mdt_regular_handle+0x15/0x20 [mdt] [ 1500.097004] [<ffffffffa048586f>] ? ptlrpc_server_handle_request+0x44f/0xee0 [ptlrpc] [ 1500.097004] [<ffffffffa0c9666e>] ? cfs_timer_arm+0xe/0x10 [libcfs] [ 1500.097004] [<ffffffffa047e711>] ? ptlrpc_wait_event+0xb1/0x2a0 [ptlrpc] [ 1500.097004] [<ffffffff81051f73>] ? __wake_up+0x53/0x70 [ 1500.097004] [<ffffffffa04883de>] ? ptlrpc_main+0xaee/0x1800 [ptlrpc] [ 1500.097004] [<ffffffffa04878f0>] ? ptlrpc_main+0x0/0x1800 [ptlrpc] [ 1500.097004] [<ffffffff8100c14a>] ? child_rip+0xa/0x20 [ 1500.097004] [<ffffffffa04878f0>] ? ptlrpc_main+0x0/0x1800 [ptlrpc] [ 1500.097004] [<ffffffffa04878f0>] ? ptlrpc_main+0x0/0x1800 [ptlrpc] [ 1500.097004] [<ffffffff8100c140>] ? child_rip+0x0/0x20 [ 1500.097004] Code: d0 49 39 f8 0f 8f 9f 00 00 00 48 81 fa a8 02 00 00 72 05 40 38 fe 74 41 48 83 ea 20 48 83 ea 20 4c 8b 1e 4c 8b 56 08 4c 8b 4e 10 <4c> 8b 46 18 48 8d 76 20 4c 89 1f 4c 89 57 08 4c 89 4f 10 4c 89 ... [ 1560.492304] INFO: task jbd2/loop0-8:32349 blocked for more than 120 seconds. [ 1560.493247] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 1560.494302] jbd2/loop0-8 D 0000000000000002 5152 32349 2 0x00000000 [ 1560.495213] ffff88026046dd10 0000000000000046 00000000000167c0 00000000000167c0 [ 1560.496212] ffff880028310960 00000000000167c0 00000000000167c0 0000000000000286 [ 1560.497246] ffff8802481ce738 ffff88026046dfd8 000000000000fba8 ffff8802481ce738 [ 1560.498252] Call Trace: [ 1560.498574] [<ffffffff8109004e>] ? prepare_to_wait+0x4e/0x80 [ 1560.499316] [<ffffffffa0076afd>] jbd2_journal_commit_transaction+0x19d/0x16e0 [jbd2] [ 1560.500335] [<ffffffff81009310>] ? __switch_to+0xd0/0x320 [ 1560.501048] [<ffffffff814fc4ae>] ? _spin_unlock_irq+0xe/0x20 [ 1560.501784] [<ffffffff8108fd60>] ? autoremove_wake_function+0x0/0x40 [ 1560.502618] [<ffffffffa007d627>] kjournald2+0xb7/0x210 [jbd2] [ 1560.503379] [<ffffffff8108fd60>] ? autoremove_wake_function+0x0/0x40 [ 1560.504209] [<ffffffffa007d570>] ? kjournald2+0x0/0x210 [jbd2] [ 1560.504993] [<ffffffff8108fa16>] kthread+0x96/0xa0 [ 1560.505631] [<ffffffff8100c14a>] child_rip+0xa/0x20 [ 1560.506274] [<ffffffff8108f980>] ? kthread+0x0/0xa0 [ 1560.506919] [<ffffffff8100c140>] ? child_rip+0x0/0x20
On first run I also hit a disk corruption in that same test before soft lockup:
[ 4158.782961] Lustre: DEBUG MARKER: == sanity test 116: stripe QOS: free space balance ===================== 20:44:10 (1347324250) [ 4178.796623] LDISKFS-fs error (device loop2): file system corruption: inode #8 logical block 3108 mapped to 7268 (size 1) [ 4178.798364] Aborting journal on device loop2-8.
Also the ost2 fs state changed:
/dev/loop2 64Z 64Z 159M 100% /mnt/ost2
After a reboot I did fsck on mdt fs and found this:
[root@rhel6 tests]# e2fsck -f -n /tmp/lustre-mdt1 e2fsck 1.41.90.wc1 (18-Mar-2011) Pass 1: Checking inodes, blocks, and sizes Inode 33445 is a zero-length directory. Clear? no Inode 33445, i_size is 4096, should be 0. Fix? no Pass 2: Checking directory structure Pass 3: Checking directory connectivity Unconnected directory inode 33445 (...) Connect to /lost+found? no Pass 4: Checking reference counts Unattached inode 33445 Connect to /lost+found? no Pass 5: Checking group summary information lustre-MDT0000: ********** WARNING: Filesystem still has errors **********
Attachments
Issue Links
- is duplicated by
-
LU-1906 performance-sanity subtest test_3: Oops: RIP: __find_get_block_slow+0x87/0x130
- Resolved
-
LU-1909 performance-sanity subtest test_3: Oops: RIP: __do_softirq+0x73/0x1e0
- Resolved
-
LU-1925 kernel tried to execute NX-protected page - exploit attempt? (uid: 0)
- Resolved
-
LU-1927 large-scale subtest test_3a: Oops: RIP: _spin_lock+0xe/0x30
- Resolved
-
LU-1928 parallel-scale-nfsv3 subtest test_compilebench: Oops: RIP: put_page+0x9/0x40
- Resolved
-
LU-1929 performance-sanity subtest test_3: list_add corruption
- Resolved
-
LU-1968 1.8.8<->2.3 Test failure on test suite parallel-scale-nfsv4, subtest test_compilebench
- Resolved