[LU-1881] sanity test 116 soft lockup Created: 10/Sep/12 Updated: 03/Dec/12 Resolved: 19/Sep/12 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.3.0 |
| Fix Version/s: | Lustre 2.3.0, Lustre 2.4.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Oleg Drokin | Assignee: | nasf (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||||||||||||||
| Rank (Obsolete): | 4258 | ||||||||||||||||||||||||||||||||
| Description |
|
Running with debug kernel, REFORMAT=yes SLOW=yes sh sanity.sh on latest master, [ 1402.025523] Lustre: DEBUG MARKER: == sanity test 116: stripe QOS: free space balance ===================== 21:16:44 (1347326204) [ 1458.081053] LNet: Service thread pid 1041 was inactive for 40.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [ 1458.083593] Pid: 1041, comm: mdt01_003 [ 1458.084187] [ 1458.084187] Call Trace: [ 1458.084851] [<ffffffffa0b0ed42>] ? mdt_handle_common+0x922/0x1740 [mdt] [ 1458.085891] [<ffffffffa0b0fc35>] mdt_regular_handle+0x15/0x20 [mdt] [ 1458.086909] [<ffffffffa048586f>] ptlrpc_server_handle_request+0x44f/0xee0 [ptlrpc] [ 1458.088097] [<ffffffffa0c9666e>] ? cfs_timer_arm+0xe/0x10 [libcfs] [ 1458.089098] [<ffffffffa047e711>] ? ptlrpc_wait_event+0xb1/0x2a0 [ptlrpc] [ 1458.090171] [<ffffffff81051f73>] ? __wake_up+0x53/0x70 [ 1458.091005] [<ffffffffa04883de>] ptlrpc_main+0xaee/0x1800 [ptlrpc] [ 1458.091991] [<ffffffffa04878f0>] ? ptlrpc_main+0x0/0x1800 [ptlrpc] [ 1458.092988] [<ffffffff8100c14a>] child_rip+0xa/0x20 [ 1458.093775] [<ffffffffa04878f0>] ? ptlrpc_main+0x0/0x1800 [ptlrpc] [ 1458.094750] [<ffffffffa04878f0>] ? ptlrpc_main+0x0/0x1800 [ptlrpc] [ 1458.095723] [<ffffffff8100c140>] ? child_rip+0x0/0x20 [ 1458.096542] [ 1458.096780] LustreError: dumping log to /tmp/lustre-log.1347326260.1041 [ 1500.096007] BUG: soft lockup - CPU#4 stuck for 67s! [mdt01_003:1041] [ 1500.096998] Modules linked in: lustre obdfilter ost cmm mdt osd_ldiskfs fsfilt_ldiskfs ldiskfs mdd mds mgs lquota obdecho mgc lov osc mdc lmv fid fld ptlrpc obdclass lvfs ksocklnd lnet libcfs ext2 exportfs jbd sha512_generic sha256_generic sunrpc ipv6 microcode virtio_balloon virtio_net i2c_piix4 i2c_core ext4 mbcache jbd2 virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: libcfs] [ 1500.097004] CPU 4 [ 1500.097004] Modules linked in: lustre obdfilter ost cmm mdt osd_ldiskfs fsfilt_ldiskfs ldiskfs mdd mds mgs lquota obdecho mgc lov osc mdc lmv fid fld ptlrpc obdclass lvfs ksocklnd lnet libcfs ext2 exportfs jbd sha512_generic sha256_generic sunrpc ipv6 microcode virtio_balloon virtio_net i2c_piix4 i2c_core ext4 mbcache jbd2 virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: libcfs] [ 1500.097004] [ 1500.097004] Pid: 1041, comm: mdt01_003 Not tainted 2.6.32-debug #3 Bochs Bochs [ 1500.097004] RIP: 0010:[<ffffffff8127db02>] [<ffffffff8127db02>] memmove+0x42/0x1a0 [ 1500.097004] RSP: 0018:ffff8802056dd498 EFLAGS: 00010282 [ 1500.097004] RAX: ffff880231b7c03c RBX: ffff8802056dd4e0 RCX: 00000000000000ee [ 1500.097004] RDX: fffffffffff7bfec RSI: ffff880231bfffe8 RDI: ffff880231bffffc [ 1500.097004] RBP: ffffffff8100bc0e R08: 0000000000000000 R09: 0000000000000000 [ 1500.097004] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88021d674004 [ 1500.097004] R13: ffff880231b7c028 R14: ffff880231b7c000 R15: 0000000000000002 [ 1500.097004] FS: 00007effccfcf700(0000) GS:ffff880028300000(0000) knlGS:0000000000000000 [ 1500.097004] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 1500.097004] CR2: ffff880231c00000 CR3: 000000025a3ed000 CR4: 00000000000006e0 [ 1500.097004] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 1500.097004] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 1500.097004] Process mdt01_003 (pid: 1041, threadinfo ffff8802056dc000, task ffff88024430a380) [ 1500.097004] Stack: [ 1500.097004] ffffffffa0aab3a8 00000000000000ee 00000000000000ee ffff880294f9df58 [ 1500.097004] <d> ffff8802056dd640 ffff8802056dd608 ffff8802056dd6e8 0000000000000fd8 [ 1500.097004] <d> ffff88021d674000 ffff8802056dd500 ffffffffa0aab430 ffff88024430a380 [ 1500.097004] Call Trace: [ 1500.097004] [<ffffffffa0aab3a8>] ? iam_insert_key+0x68/0xb0 [osd_ldiskfs] [ 1500.097004] [<ffffffffa0aab430>] ? iam_insert_key_lock+0x40/0x50 [osd_ldiskfs] [ 1500.097004] [<ffffffffa0aae7ed>] ? iam_lfix_split+0x12d/0x150 [osd_ldiskfs] [ 1500.097004] [<ffffffffa0aadc8d>] ? iam_it_rec_insert+0x20d/0x300 [osd_ldiskfs] [ 1500.097004] [<ffffffffa0aade21>] ? iam_insert+0xa1/0xb0 [osd_ldiskfs] [ 1500.097004] [<ffffffffa0aa9467>] ? osd_oi_insert+0x1e7/0x5b0 [osd_ldiskfs] [ 1500.097004] [<ffffffffa0a9cef5>] ? __osd_oi_insert+0x145/0x1e0 [osd_ldiskfs] [ 1500.097004] [<ffffffffa0aa1d48>] ? osd_object_ea_create+0x1d8/0x460 [osd_ldiskfs] [ 1500.097004] [<ffffffffa09721dc>] ? mdd_object_create_internal+0x13c/0x2a0 [mdd] [ 1500.097004] [<ffffffffa0992aba>] ? mdd_create+0x16ba/0x20c0 [mdd] [ 1500.097004] [<ffffffffa0a9fd7f>] ? osd_xattr_get+0x9f/0x360 [osd_ldiskfs] [ 1500.097004] [<ffffffffa0bb3557>] ? cml_create+0x97/0x250 [cmm] [ 1500.097004] [<ffffffffa0b25d0f>] ? mdt_version_get_save+0x8f/0xd0 [mdt] [ 1500.097004] [<ffffffffa0b398bf>] ? mdt_reint_open+0x108f/0x18a0 [mdt] [ 1500.097004] [<ffffffffa099860e>] ? md_ucred+0x1e/0x60 [mdd] [ 1500.097004] [<ffffffffa0b071c5>] ? mdt_ucred+0x15/0x20 [mdt] [ 1500.097004] [<ffffffffa0b23081>] ? mdt_reint_rec+0x41/0xe0 [mdt] [ 1500.097004] [<ffffffffa0b1c42a>] ? mdt_reint_internal+0x50a/0x810 [mdt] [ 1500.097004] [<ffffffffa0b1c9fd>] ? mdt_intent_reint+0x1ed/0x500 [mdt] [ 1500.097004] [<ffffffffa0b19041>] ? mdt_intent_policy+0x371/0x6a0 [mdt] [ 1500.097004] [<ffffffffa042fb9a>] ? ldlm_lock_enqueue+0x2ea/0x890 [ptlrpc] [ 1500.097004] [<ffffffffa045744f>] ? ldlm_handle_enqueue0+0x48f/0xf70 [ptlrpc] [ 1500.097004] [<ffffffffa0b18ad6>] ? mdt_enqueue+0x46/0x130 [mdt] [ 1500.097004] [<ffffffffa0b0ed42>] ? mdt_handle_common+0x922/0x1740 [mdt] [ 1500.097004] [<ffffffffa0b0fc35>] ? mdt_regular_handle+0x15/0x20 [mdt] [ 1500.097004] [<ffffffffa048586f>] ? ptlrpc_server_handle_request+0x44f/0xee0 [ptlrpc] [ 1500.097004] [<ffffffffa0c9666e>] ? cfs_timer_arm+0xe/0x10 [libcfs] [ 1500.097004] [<ffffffffa047e711>] ? ptlrpc_wait_event+0xb1/0x2a0 [ptlrpc] [ 1500.097004] [<ffffffff81051f73>] ? __wake_up+0x53/0x70 [ 1500.097004] [<ffffffffa04883de>] ? ptlrpc_main+0xaee/0x1800 [ptlrpc] [ 1500.097004] [<ffffffffa04878f0>] ? ptlrpc_main+0x0/0x1800 [ptlrpc] [ 1500.097004] [<ffffffff8100c14a>] ? child_rip+0xa/0x20 [ 1500.097004] [<ffffffffa04878f0>] ? ptlrpc_main+0x0/0x1800 [ptlrpc] [ 1500.097004] [<ffffffffa04878f0>] ? ptlrpc_main+0x0/0x1800 [ptlrpc] [ 1500.097004] [<ffffffff8100c140>] ? child_rip+0x0/0x20 [ 1500.097004] Code: d0 49 39 f8 0f 8f 9f 00 00 00 48 81 fa a8 02 00 00 72 05 40 38 fe 74 41 48 83 ea 20 48 83 ea 20 4c 8b 1e 4c 8b 56 08 4c 8b 4e 10 <4c> 8b 46 18 48 8d 76 20 4c 89 1f 4c 89 57 08 4c 89 4f 10 4c 89 ... [ 1560.492304] INFO: task jbd2/loop0-8:32349 blocked for more than 120 seconds. [ 1560.493247] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 1560.494302] jbd2/loop0-8 D 0000000000000002 5152 32349 2 0x00000000 [ 1560.495213] ffff88026046dd10 0000000000000046 00000000000167c0 00000000000167c0 [ 1560.496212] ffff880028310960 00000000000167c0 00000000000167c0 0000000000000286 [ 1560.497246] ffff8802481ce738 ffff88026046dfd8 000000000000fba8 ffff8802481ce738 [ 1560.498252] Call Trace: [ 1560.498574] [<ffffffff8109004e>] ? prepare_to_wait+0x4e/0x80 [ 1560.499316] [<ffffffffa0076afd>] jbd2_journal_commit_transaction+0x19d/0x16e0 [jbd2] [ 1560.500335] [<ffffffff81009310>] ? __switch_to+0xd0/0x320 [ 1560.501048] [<ffffffff814fc4ae>] ? _spin_unlock_irq+0xe/0x20 [ 1560.501784] [<ffffffff8108fd60>] ? autoremove_wake_function+0x0/0x40 [ 1560.502618] [<ffffffffa007d627>] kjournald2+0xb7/0x210 [jbd2] [ 1560.503379] [<ffffffff8108fd60>] ? autoremove_wake_function+0x0/0x40 [ 1560.504209] [<ffffffffa007d570>] ? kjournald2+0x0/0x210 [jbd2] [ 1560.504993] [<ffffffff8108fa16>] kthread+0x96/0xa0 [ 1560.505631] [<ffffffff8100c14a>] child_rip+0xa/0x20 [ 1560.506274] [<ffffffff8108f980>] ? kthread+0x0/0xa0 [ 1560.506919] [<ffffffff8100c140>] ? child_rip+0x0/0x20 On first run I also hit a disk corruption in that same test before soft lockup: [ 4158.782961] Lustre: DEBUG MARKER: == sanity test 116: stripe QOS: free space balance ===================== 20:44:10 (1347324250) [ 4178.796623] LDISKFS-fs error (device loop2): file system corruption: inode #8 logical block 3108 mapped to 7268 (size 1) [ 4178.798364] Aborting journal on device loop2-8. Also the ost2 fs state changed: After a reboot I did fsck on mdt fs and found this: [root@rhel6 tests]# e2fsck -f -n /tmp/lustre-mdt1 e2fsck 1.41.90.wc1 (18-Mar-2011) Pass 1: Checking inodes, blocks, and sizes Inode 33445 is a zero-length directory. Clear? no Inode 33445, i_size is 4096, should be 0. Fix? no Pass 2: Checking directory structure Pass 3: Checking directory connectivity Unconnected directory inode 33445 (...) Connect to /lost+found? no Pass 4: Checking reference counts Unattached inode 33445 Connect to /lost+found? no Pass 5: Checking group summary information lustre-MDT0000: ********** WARNING: Filesystem still has errors ********** |
| Comments |
| Comment by Oleg Drokin [ 10/Sep/12 ] |
|
Another bunch of traces from a third run: [ 1411.405290] LNet: Service thread pid 3559 was inactive for 62.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [ 1411.407862] Pid: 3559, comm: mdt01_003 [ 1411.408447] [ 1411.408447] Call Trace: [ 1411.409101] [<ffffffffa0b0ed42>] ? mdt_handle_common+0x922/0x1740 [mdt] [ 1411.410122] [<ffffffffa0b0fc35>] mdt_regular_handle+0x15/0x20 [mdt] [ 1411.411112] [<ffffffffa048586f>] ptlrpc_server_handle_request+0x44f/0xee0 [ptlrpc] [ 1411.412295] [<ffffffffa0c9666e>] ? cfs_timer_arm+0xe/0x10 [libcfs] [ 1411.413300] [<ffffffffa047e711>] ? ptlrpc_wait_event+0xb1/0x2a0 [ptlrpc] [ 1411.414331] [<ffffffff81051f73>] ? __wake_up+0x53/0x70 [ 1411.415154] [<ffffffffa04883de>] ptlrpc_main+0xaee/0x1800 [ptlrpc] [ 1411.416136] [<ffffffffa04878f0>] ? ptlrpc_main+0x0/0x1800 [ptlrpc] [ 1411.417208] [<ffffffff8100c14a>] child_rip+0xa/0x20 [ 1411.418228] [<ffffffffa04878f0>] ? ptlrpc_main+0x0/0x1800 [ptlrpc] [ 1411.419448] [<ffffffffa04878f0>] ? ptlrpc_main+0x0/0x1800 [ptlrpc] [ 1411.420654] [<ffffffff8100c140>] ? child_rip+0x0/0x20 [ 1411.421667] [ 1411.421987] LustreError: dumping log to /tmp/lustre-log.1347328442.3559 [ 1420.533547] LNet: Service thread pid 2359 was inactive for 40.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [ 1420.536821] Pid: 2359, comm: ll_ost_io01_001 [ 1420.537654] [ 1420.537655] Call Trace: [ 1420.538463] [<ffffffff81125463>] ? __alloc_pages_nodemask+0x123/0x9e0 [ 1420.539739] [<ffffffff8109004e>] ? prepare_to_wait+0x4e/0x80 [ 1420.540909] [<ffffffffa0076335>] do_get_write_access+0x2b5/0x550 [jbd2] [ 1420.542246] [<ffffffff8108fda0>] ? wake_bit_function+0x0/0x50 [ 1420.543397] [<ffffffffa0076751>] jbd2_journal_get_write_access+0x31/0x50 [jbd2] [ 1420.544875] [<ffffffffa09f54b8>] __ldiskfs_journal_get_write_access+0x38/0x80 [ldiskfs] [ 1420.546465] [<ffffffffa0a09442>] ldiskfs_mb_mark_diskspace_used+0xf2/0x300 [ldiskfs] [ 1420.548030] [<ffffffffa0a10e2f>] ldiskfs_mb_new_blocks+0x2af/0x5b0 [ldiskfs] [ 1420.549478] [<ffffffffa09f753e>] ? ldiskfs_ext_find_extent+0x2ce/0x330 [ldiskfs] [ 1420.550966] [<ffffffffa0a7a1da>] ldiskfs_ext_new_extent_cb+0x59a/0x6d0 [fsfilt_ldiskfs] [ 1420.552845] [<ffffffffa09f76ef>] ldiskfs_ext_walk_space+0x14f/0x340 [ldiskfs] [ 1420.554682] [<ffffffffa0a79c40>] ? ldiskfs_ext_new_extent_cb+0x0/0x6d0 [fsfilt_ldiskfs] [ 1420.556445] [<ffffffffa0a79968>] fsfilt_map_nblocks+0xd8/0x100 [fsfilt_ldiskfs] [ 1420.557642] [<ffffffffa0a79aa3>] fsfilt_ldiskfs_map_ext_inode_pages+0x113/0x220 [fsfilt_ldiskfs] [ 1420.559032] [<ffffffff814fa75e>] ? mutex_unlock+0xe/0x10 [ 1420.559858] [<ffffffffa0a79c35>] fsfilt_ldiskfs_map_inode_pages+0x85/0x90 [fsfilt_ldiskfs] [ 1420.561187] [<ffffffffa05cae3b>] filter_alloc_iobuf+0x8fb/0x11f0 [obdfilter] [ 1420.562304] [<ffffffffa05cc9ec>] filter_commitrw_write+0x12bc/0x2eb8 [obdfilter] [ 1420.563443] [<ffffffff8116145a>] ? cache_alloc_debugcheck_after+0x14a/0x210 [ 1420.564548] [<ffffffff8116175e>] ? cache_free_debugcheck+0x1be/0x360 [ 1420.565574] [<ffffffff81160af6>] ? kfree_debugcheck+0x16/0x40 [ 1420.566476] [<ffffffff8116175e>] ? cache_free_debugcheck+0x1be/0x360 [ 1420.567469] [<ffffffffa05bfea5>] filter_commitrw+0x285/0x2b0 [obdfilter] [ 1420.568538] [<ffffffff814fc4fe>] ? _spin_unlock+0xe/0x10 [ 1420.569390] [<ffffffffa0be3bc8>] obd_commitrw+0x128/0x3d0 [ost] [ 1420.570316] [<ffffffffa0beb1e9>] ost_brw_write+0xd29/0x1610 [ost] [ 1420.571276] [<ffffffff8127c326>] ? vsnprintf+0x2b6/0x5f0 [ 1420.572146] [<ffffffffa0437fa0>] ? target_bulk_timeout+0x0/0xc0 [ptlrpc] [ 1420.573215] [<ffffffffa0bf0c26>] ost_handle+0x3096/0x4320 [ost] [ 1420.574175] [<ffffffffa0ca23f4>] ? libcfs_id2str+0x74/0xb0 [libcfs] [ 1420.575154] [<ffffffffa048586f>] ptlrpc_server_handle_request+0x44f/0xee0 [ptlrpc] [ 1420.576324] [<ffffffffa0c9666e>] ? cfs_timer_arm+0xe/0x10 [libcfs] [ 1420.577314] [<ffffffffa047e711>] ? ptlrpc_wait_event+0xb1/0x2a0 [ptlrpc] [ 1420.578359] [<ffffffff81051f73>] ? __wake_up+0x53/0x70 [ 1420.579165] [<ffffffffa04883de>] ptlrpc_main+0xaee/0x1800 [ptlrpc] [ 1420.580147] [<ffffffffa04878f0>] ? ptlrpc_main+0x0/0x1800 [ptlrpc] [ 1420.581128] [<ffffffff8100c14a>] child_rip+0xa/0x20 [ 1420.581942] [<ffffffffa04878f0>] ? ptlrpc_main+0x0/0x1800 [ptlrpc] [ 1420.582895] [<ffffffffa04878f0>] ? ptlrpc_main+0x0/0x1800 [ptlrpc] [ 1420.583854] [<ffffffff8100c140>] ? child_rip+0x0/0x20 [ 1420.584658] [ 1420.584901] LustreError: dumping log to /tmp/lustre-log.1347328451.2359 [ 1420.586271] Pid: 6961, comm: ll_ost_io00_005 [ 1420.586940] [ 1420.586940] Call Trace: [ 1420.587546] [<ffffffff8108fd4f>] ? wake_up_bit+0x2f/0x40 [ 1420.588372] [<ffffffff8109004e>] ? prepare_to_wait+0x4e/0x80 [ 1420.589294] [<ffffffffa0076335>] do_get_write_access+0x2b5/0x550 [jbd2] [ 1420.590334] [<ffffffff8108fda0>] ? wake_bit_function+0x0/0x50 [ 1420.591233] [<ffffffffa09ffee6>] ? ldiskfs_mark_iloc_dirty+0x376/0x5d0 [ldiskfs] [ 1420.592390] [<ffffffffa0076751>] jbd2_journal_get_write_access+0x31/0x50 [jbd2] [ 1420.593581] [<ffffffffa09f54b8>] __ldiskfs_journal_get_write_access+0x38/0x80 [ldiskfs] [ 1420.594845] [<ffffffffa0a09442>] ldiskfs_mb_mark_diskspace_used+0xf2/0x300 [ldiskfs] [ 1420.596048] [<ffffffff814fc4fe>] ? _spin_unlock+0xe/0x10 [ 1420.596887] [<ffffffffa0a10e2f>] ldiskfs_mb_new_blocks+0x2af/0x5b0 [ldiskfs] [ 1420.598071] [<ffffffffa09f753e>] ? ldiskfs_ext_find_extent+0x2ce/0x330 [ldiskfs] [ 1420.599223] [<ffffffffa0a7a1da>] ldiskfs_ext_new_extent_cb+0x59a/0x6d0 [fsfilt_ldiskfs] [ 1420.600481] [<ffffffffa09f76ef>] ldiskfs_ext_walk_space+0x14f/0x340 [ldiskfs] [ 1420.601621] [<ffffffffa0a79c40>] ? ldiskfs_ext_new_extent_cb+0x0/0x6d0 [fsfilt_ldiskfs] [ 1420.602892] [<ffffffffa0a79968>] fsfilt_map_nblocks+0xd8/0x100 [fsfilt_ldiskfs] [ 1420.604051] [<ffffffffa0a79aa3>] fsfilt_ldiskfs_map_ext_inode_pages+0x113/0x220 [fsfilt_ldiskfs] [ 1420.605478] [<ffffffff814fa75e>] ? mutex_unlock+0xe/0x10 [ 1420.606305] [<ffffffffa0a79c35>] fsfilt_ldiskfs_map_inode_pages+0x85/0x90 [fsfilt_ldiskfs] [ 1420.607590] [<ffffffffa05cae3b>] filter_alloc_iobuf+0x8fb/0x11f0 [obdfilter] [ 1420.608693] [<ffffffffa05cc9ec>] filter_commitrw_write+0x12bc/0x2eb8 [obdfilter] [ 1420.609845] [<ffffffff8123d76c>] ? crypto_create_tfm+0x3c/0xe0 [ 1420.610749] [<ffffffff8116145a>] ? cache_alloc_debugcheck_after+0x14a/0x210 [ 1420.611831] [<ffffffff8116175e>] ? cache_free_debugcheck+0x1be/0x360 [ 1420.612856] [<ffffffff81160af6>] ? kfree_debugcheck+0x16/0x40 [ 1420.613800] [<ffffffff8116175e>] ? cache_free_debugcheck+0x1be/0x360 [ 1420.614783] [<ffffffffa05bfea5>] filter_commitrw+0x285/0x2b0 [obdfilter] [ 1420.615837] [<ffffffff814fc4fe>] ? _spin_unlock+0xe/0x10 [ 1420.616704] [<ffffffffa0be3bc8>] obd_commitrw+0x128/0x3d0 [ost] [ 1420.617658] [<ffffffffa0beb1e9>] ost_brw_write+0xd29/0x1610 [ost] [ 1420.618611] [<ffffffff8127c326>] ? vsnprintf+0x2b6/0x5f0 [ 1420.619467] [<ffffffffa0437fa0>] ? target_bulk_timeout+0x0/0xc0 [ptlrpc] [ 1420.620536] [<ffffffffa0bf0c26>] ost_handle+0x3096/0x4320 [ost] [ 1420.621512] [<ffffffffa0ca23f4>] ? libcfs_id2str+0x74/0xb0 [libcfs] [ 1420.622518] [<ffffffffa048586f>] ptlrpc_server_handle_request+0x44f/0xee0 [ptlrpc] [ 1420.623691] [<ffffffffa0c9666e>] ? cfs_timer_arm+0xe/0x10 [libcfs] [ 1420.624702] [<ffffffffa047e711>] ? ptlrpc_wait_event+0xb1/0x2a0 [ptlrpc] [ 1420.625752] [<ffffffff81051f73>] ? __wake_up+0x53/0x70 [ 1420.626577] [<ffffffffa04883de>] ptlrpc_main+0xaee/0x1800 [ptlrpc] [ 1420.627560] [<ffffffffa04878f0>] ? ptlrpc_main+0x0/0x1800 [ptlrpc] [ 1420.628584] [<ffffffff8100c14a>] child_rip+0xa/0x20 [ 1420.629406] [<ffffffffa04878f0>] ? ptlrpc_main+0x0/0x1800 [ptlrpc] [ 1420.630398] [<ffffffffa04878f0>] ? ptlrpc_main+0x0/0x1800 [ptlrpc] [ 1420.631363] [<ffffffff8100c140>] ? child_rip+0x0/0x20 [ 1420.632186] [ 1420.632421] Pid: 3346, comm: ll_ost_io00_003 [ 1420.633119] [ 1420.633119] Call Trace: [ 1420.633737] [<ffffffffa09f527b>] ? __ldiskfs_handle_dirty_metadata+0x7b/0x100 [ldiskfs] [ 1420.635001] [<ffffffffa09ffee6>] ? ldiskfs_mark_iloc_dirty+0x376/0x5d0 [ldiskfs] [ 1420.636144] [<ffffffff814fabd8>] __mutex_lock_slowpath+0x128/0x2c0 [ 1420.637132] [<ffffffff814fada1>] mutex_lock+0x31/0x50 [ 1420.637963] [<ffffffffa0a08f0b>] ldiskfs_mb_initialize_context+0x17b/0x1f0 [ldiskfs] [ 1420.639154] [<ffffffffa0a10d09>] ldiskfs_mb_new_blocks+0x189/0x5b0 [ldiskfs] [ 1420.640253] [<ffffffffa09f753e>] ? ldiskfs_ext_find_extent+0x2ce/0x330 [ldiskfs] [ 1420.641422] [<ffffffffa0a7a1da>] ldiskfs_ext_new_extent_cb+0x59a/0x6d0 [fsfilt_ldiskfs] [ 1420.642677] [<ffffffffa09f76ef>] ldiskfs_ext_walk_space+0x14f/0x340 [ldiskfs] [ 1420.643787] [<ffffffffa0a79c40>] ? ldiskfs_ext_new_extent_cb+0x0/0x6d0 [fsfilt_ldiskfs] [ 1420.645080] [<ffffffffa0a79968>] fsfilt_map_nblocks+0xd8/0x100 [fsfilt_ldiskfs] [ 1420.646250] [<ffffffffa0a79aa3>] fsfilt_ldiskfs_map_ext_inode_pages+0x113/0x220 [fsfilt_ldiskfs] [ 1420.647617] [<ffffffff814fa75e>] ? mutex_unlock+0xe/0x10 [ 1420.648465] [<ffffffffa0a79c35>] fsfilt_ldiskfs_map_inode_pages+0x85/0x90 [f sfilt_ldiskfs] [ 1420.649777] [<ffffffffa05cae3b>] filter_alloc_iobuf+0x8fb/0x11f0 [obdfilter] [ 1420.650881] [<ffffffffa05cc9ec>] filter_commitrw_write+0x12bc/0x2eb8 [obdfilter] [ 1420.652058] [<ffffffff8123d76c>] ? crypto_create_tfm+0x3c/0xe0 [ 1420.652994] [<ffffffff8116145a>] ? cache_alloc_debugcheck_after+0x14a/0x210 [ 1420.654112] [<ffffffff8116175e>] ? cache_free_debugcheck+0x1be/0x360 [ 1420.655108] [<ffffffff81160af6>] ? kfree_debugcheck+0x16/0x40 [ 1420.656015] [<ffffffff8116175e>] ? cache_free_debugcheck+0x1be/0x360 [ 1420.657029] [<ffffffffa05bfea5>] filter_commitrw+0x285/0x2b0 [obdfilter] [ 1420.658070] [<ffffffff814fc4fe>] ? _spin_unlock+0xe/0x10 [ 1420.658894] [<ffffffffa0be3bc8>] obd_commitrw+0x128/0x3d0 [ost] [ 1420.659819] [<ffffffffa0beb1e9>] ost_brw_write+0xd29/0x1610 [ost] [ 1420.660789] [<ffffffff8127c326>] ? vsnprintf+0x2b6/0x5f0 [ 1420.661670] [<ffffffffa0437fa0>] ? target_bulk_timeout+0x0/0xc0 [ptlrpc] [ 1420.662739] [<ffffffffa0bf0c26>] ost_handle+0x3096/0x4320 [ost] [ 1420.663682] [<ffffffffa0ca23f4>] ? libcfs_id2str+0x74/0xb0 [libcfs] [ 1420.664700] [<ffffffffa048586f>] ptlrpc_server_handle_request+0x44f/0xee0 [ptlrpc] [ 1420.665900] [<ffffffffa0c9666e>] ? cfs_timer_arm+0xe/0x10 [libcfs] [ 1420.666879] [<ffffffffa047e711>] ? ptlrpc_wait_event+0xb1/0x2a0 [ptlrpc] [ 1420.667936] [<ffffffff81051f73>] ? __wake_up+0x53/0x70 [ 1420.668771] [<ffffffffa04883de>] ptlrpc_main+0xaee/0x1800 [ptlrpc] [ 1420.669792] [<ffffffffa04878f0>] ? ptlrpc_main+0x0/0x1800 [ptlrpc] [ 1420.670767] [<ffffffff8100c14a>] child_rip+0xa/0x20 [ 1420.671562] [<ffffffffa04878f0>] ? ptlrpc_main+0x0/0x1800 [ptlrpc] [ 1420.672569] [<ffffffffa04878f0>] ? ptlrpc_main+0x0/0x1800 [ptlrpc] [ 1420.673537] [<ffffffff8100c140>] ? child_rip+0x0/0x20 [ 1420.674325] [ 1420.674562] Pid: 2355, comm: ll_ost_io00_000 [ 1420.675240] [ 1420.675240] Call Trace: [ 1420.675856] [<ffffffffa0076335>] do_get_write_access+0x2b5/0x550 [jbd2] [ 1420.676914] [<ffffffff8108fda0>] ? wake_bit_function+0x0/0x50 [ 1420.677841] [<ffffffffa09ffee6>] ? ldiskfs_mark_iloc_dirty+0x376/0x5d0 [ldiskfs] [ 1420.679019] [<ffffffffa0076751>] jbd2_journal_get_write_access+0x31/0x50 [jbd2] [ 1420.680163] [<ffffffffa09f54b8>] __ldiskfs_journal_get_write_access+0x38/0x80 [ldiskfs] [ 1420.681444] [<ffffffffa0a09442>] ldiskfs_mb_mark_diskspace_used+0xf2/0x300 [ldiskfs] [ 1420.682678] [<ffffffff814fc4fe>] ? _spin_unlock+0xe/0x10 [ 1420.683532] [<ffffffffa0a10e2f>] ldiskfs_mb_new_blocks+0x2af/0x5b0 [ldiskfs] [ 1420.684649] [<ffffffffa09f753e>] ? ldiskfs_ext_find_extent+0x2ce/0x330 [ldiskfs] [ 1420.685832] [<ffffffffa0a7a1da>] ldiskfs_ext_new_extent_cb+0x59a/0x6d0 [fsfilt_ldiskfs] [ 1420.687092] [<ffffffffa09f76ef>] ldiskfs_ext_walk_space+0x14f/0x340 [ldiskfs] [ 1420.688206] [<ffffffffa0a79c40>] ? ldiskfs_ext_new_extent_cb+0x0/0x6d0 [fsfilt_ldiskfs] [ 1420.689484] [<ffffffffa0a79968>] fsfilt_map_nblocks+0xd8/0x100 [fsfilt_ldiskfs] [ 1420.690618] [<ffffffffa0a79aa3>] fsfilt_ldiskfs_map_ext_inode_pages+0x113/0x220 [fsfilt_ldiskfs] [ 1420.691987] [<ffffffff814fa75e>] ? mutex_unlock+0xe/0x10 [ 1420.692818] [<ffffffffa0a79c35>] fsfilt_ldiskfs_map_inode_pages+0x85/0x90 [fsfilt_ldiskfs] [ 1420.694142] [<ffffffffa05cae3b>] filter_alloc_iobuf+0x8fb/0x11f0 [obdfilter] [ 1420.695230] [<ffffffffa05cc9ec>] filter_commitrw_write+0x12bc/0x2eb8 [obdfilter] [ 1420.696373] [<ffffffff8123d76c>] ? crypto_create_tfm+0x3c/0xe0 [ 1420.697323] [<ffffffff8116145a>] ? cache_alloc_debugcheck_after+0x14a/0x210 [ 1420.698425] [<ffffffff8116175e>] ? cache_free_debugcheck+0x1be/0x360 [ 1420.699417] [<ffffffff81160af6>] ? kfree_debugcheck+0x16/0x40 [ 1420.700320] [<ffffffff8116175e>] ? cache_free_debugcheck+0x1be/0x360 [ 1420.701364] [<ffffffffa05bfea5>] filter_commitrw+0x285/0x2b0 [obdfilter] [ 1420.702431] [<ffffffff814fc4fe>] ? _spin_unlock+0xe/0x10 [ 1420.703289] [<ffffffffa0be3bc8>] obd_commitrw+0x128/0x3d0 [ost] [ 1420.704216] [<ffffffffa0beb1e9>] ost_brw_write+0xd29/0x1610 [ost] [ 1420.705192] [<ffffffff8127c326>] ? vsnprintf+0x2b6/0x5f0 [ 1420.706063] [<ffffffffa0437fa0>] ? target_bulk_timeout+0x0/0xc0 [ptlrpc] [ 1420.707114] [<ffffffffa0bf0c26>] ost_handle+0x3096/0x4320 [ost] [ 1420.708055] [<ffffffffa0ca23f4>] ? libcfs_id2str+0x74/0xb0 [libcfs] [ 1420.709082] [<ffffffffa048586f>] ptlrpc_server_handle_request+0x44f/0xee0 [ptlrpc] [ 1420.710272] [<ffffffffa0c9666e>] ? cfs_timer_arm+0xe/0x10 [libcfs] [ 1420.711248] [<ffffffffa047e711>] ? ptlrpc_wait_event+0xb1/0x2a0 [ptlrpc] [ 1420.712314] [<ffffffff81051f73>] ? __wake_up+0x53/0x70 [ 1420.713165] [<ffffffffa04883de>] ptlrpc_main+0xaee/0x1800 [ptlrpc] [ 1420.714148] [<ffffffffa04878f0>] ? ptlrpc_main+0x0/0x1800 [ptlrpc] [ 1420.715099] [<ffffffff8100c14a>] child_rip+0xa/0x20 [ 1420.715871] [<ffffffffa04878f0>] ? ptlrpc_main+0x0/0x1800 [ptlrpc] [ 1420.716895] [<ffffffffa04878f0>] ? ptlrpc_main+0x0/0x1800 [ptlrpc] [ 1420.717902] [<ffffffff8100c140>] ? child_rip+0x0/0x20 [ 1420.718689] [ 1420.718942] LNet: Service thread pid 2357 was inactive for 40.18s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one. ... [ 1512.104006] BUG: soft lockup - CPU#5 stuck for 67s! [mdt01_003:3559] [ 1512.104819] Modules linked in: lustre obdfilter ost cmm mdt osd_ldiskfs fsfilt_ldiskfs ldiskfs mdd mds mgs lquota obdecho mgc lov osc mdc lmv fid fld ptlrpc obdclass lvfs ksocklnd lnet libcfs ext2 exportfs jbd sha512_generic sha256_generic sunrpc ipv6 microcode virtio_balloon virtio_net i2c_piix4 i2c_core ext4 mbcache jbd2 virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: libcfs] [ 1512.105255] CPU 5 [ 1512.105255] Modules linked in: lustre obdfilter ost cmm mdt osd_ldiskfs fsfilt_ldiskfs ldiskfs mdd mds mgs lquota obdecho mgc lov osc mdc lmv fid fld ptlrpc obdclass lvfs ksocklnd lnet libcfs ext2 exportfs jbd sha512_generic sha256_generic sunrpc ipv6 microcode virtio_balloon virtio_net i2c_piix4 i2c_core ext4 mbcache jbd2 virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: libcfs] [ 1512.105255] [ 1512.105255] Pid: 3559, comm: mdt01_003 Not tainted 2.6.32-debug #3 Bochs Bochs [ 1512.105255] RIP: 0010:[<ffffffff8127db02>] [<ffffffff8127db02>] memmove+0x42/0x1a0 [ 1512.105255] RSP: 0018:ffff8802160d5498 EFLAGS: 00010282 [ 1512.105255] RAX: ffff8801eab1d03c RBX: ffff8802160d54e0 RCX: 00000000000000ee [ 1512.105255] RDX: fffffffffff5cfec RSI: ffff8801eabbffe8 RDI: ffff8801eabbfffc [ 1512.105255] RBP: ffffffff8100bc0e R08: 0000000000000000 R09: 0000000000000000 [ 1512.105255] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88021f70b004 [ 1512.105255] R13: ffff8801eab1d028 R14: ffff8801eab1d000 R15: 0000000000000002 [ 1512.105255] FS: 00007f784cf7b700(0000) GS:ffff880028340000(0000) knlGS:0000000000000000 [ 1512.105255] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 1512.105255] CR2: ffff8801eabc0000 CR3: 0000000001a25000 CR4: 00000000000006e0 [ 1512.105255] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 1512.105255] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 1512.105255] Process mdt01_003 (pid: 3559, threadinfo ffff8802160d4000, task ffff88025cbf41c0) [ 1512.105255] Stack: [ 1512.105255] ffffffffa0aab3a8 00000000000000ee 00000000000000ee ffff88022cc66f58 [ 1512.105255] <d> ffff8802160d5640 ffff8802160d5608 ffff8802160d56e8 0000000000000fd8 [ 1512.105255] <d> ffff88021f70b000 ffff8802160d5500 ffffffffa0aab430 ffff88025cbf41c0 [ 1512.105255] Call Trace: [ 1512.105255] [<ffffffffa0aab3a8>] ? iam_insert_key+0x68/0xb0 [osd_ldiskfs] [ 1512.105255] [<ffffffffa0aab430>] ? iam_insert_key_lock+0x40/0x50 [osd_ldiskfs] [ 1512.105255] [<ffffffffa0aae7ed>] ? iam_lfix_split+0x12d/0x150 [osd_ldiskfs] [ 1512.105255] [<ffffffffa0aadc8d>] ? iam_it_rec_insert+0x20d/0x300 [osd_ldiskfs] [ 1512.105255] [<ffffffffa0aade21>] ? iam_insert+0xa1/0xb0 [osd_ldiskfs] [ 1512.105255] [<ffffffffa0aa9467>] ? osd_oi_insert+0x1e7/0x5b0 [osd_ldiskfs] [ 1512.105255] [<ffffffffa0a9cef5>] ? __osd_oi_insert+0x145/0x1e0 [osd_ldiskfs] [ 1512.105255] [<ffffffffa0aa1d48>] ? osd_object_ea_create+0x1d8/0x460 [osd_ldiskfs] [ 1512.105255] [<ffffffffa09721dc>] ? mdd_object_create_internal+0x13c/0x2a0 [mdd] [ 1512.105255] [<ffffffffa0992aba>] ? mdd_create+0x16ba/0x20c0 [mdd] [ 1512.105255] [<ffffffffa0a9fd7f>] ? osd_xattr_get+0x9f/0x360 [osd_ldiskfs] [ 1512.105255] [<ffffffffa0bb3557>] ? cml_create+0x97/0x250 [cmm] [ 1512.105255] [<ffffffffa0b25d0f>] ? mdt_version_get_save+0x8f/0xd0 [mdt] [ 1512.105255] [<ffffffffa0b398bf>] ? mdt_reint_open+0x108f/0x18a0 [mdt] [ 1512.105255] [<ffffffffa099860e>] ? md_ucred+0x1e/0x60 [mdd] [ 1512.105255] [<ffffffffa0b071c5>] ? mdt_ucred+0x15/0x20 [mdt] [ 1512.105255] [<ffffffffa0b23081>] ? mdt_reint_rec+0x41/0xe0 [mdt] [ 1512.105255] [<ffffffffa0b1c42a>] ? mdt_reint_internal+0x50a/0x810 [mdt] [ 1512.105255] [<ffffffffa0b1c9fd>] ? mdt_intent_reint+0x1ed/0x500 [mdt] [ 1512.105255] [<ffffffffa0b19041>] ? mdt_intent_policy+0x371/0x6a0 [mdt] [ 1512.105255] [<ffffffffa042fb9a>] ? ldlm_lock_enqueue+0x2ea/0x890 [ptlrpc] [ 1512.105255] [<ffffffffa045744f>] ? ldlm_handle_enqueue0+0x48f/0xf70 [ptlrpc] [ 1512.105255] [<ffffffffa0b18ad6>] ? mdt_enqueue+0x46/0x130 [mdt] [ 1512.105255] [<ffffffffa0b0ed42>] ? mdt_handle_common+0x922/0x1740 [mdt] [ 1512.105255] [<ffffffffa0b0fc35>] ? mdt_regular_handle+0x15/0x20 [mdt] [ 1512.105255] [<ffffffffa048586f>] ? ptlrpc_server_handle_request+0x44f/0xee0 [ptlrpc] [ 1512.105255] [<ffffffffa0c9666e>] ? cfs_timer_arm+0xe/0x10 [libcfs] [ 1512.105255] [<ffffffffa047e711>] ? ptlrpc_wait_event+0xb1/0x2a0 [ptlrpc] [ 1512.105255] [<ffffffff81051f73>] ? __wake_up+0x53/0x70 [ 1512.105255] [<ffffffffa04883de>] ? ptlrpc_main+0xaee/0x1800 [ptlrpc] [ 1512.105255] [<ffffffffa04878f0>] ? ptlrpc_main+0x0/0x1800 [ptlrpc] [ 1512.105255] [<ffffffff8100c14a>] ? child_rip+0xa/0x20 [ 1512.105255] [<ffffffffa04878f0>] ? ptlrpc_main+0x0/0x1800 [ptlrpc] [ 1512.105255] [<ffffffffa04878f0>] ? ptlrpc_main+0x0/0x1800 [ptlrpc] [ 1512.105255] [<ffffffff8100c140>] ? child_rip+0x0/0x20 [ 1512.105255] Code: d0 49 39 f8 0f 8f 9f 00 00 00 48 81 fa a8 02 00 00 72 05 40 38 fe 74 41 48 83 ea 20 48 83 ea 20 4c 8b 1e 4c 8b 56 08 4c 8b 4e 10 <4c> 8b 46 18 48 8d 76 20 4c 89 1f 4c 89 57 08 4c 89 4f 10 4c 89 [ 1560.488304] INFO: task jbd2/loop0-8:2178 blocked for more than 120 seconds. [ 1560.489230] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 1560.490216] jbd2/loop0-8 D 0000000000000003 5264 2178 2 0x00000000[ 1560.491119] ffff88026227fd10 0000000000000046 00000000000167c0 0000000000016 7c0 [ 1560.492112] ffff880028290960 00000000000167c0 00000000000167c0 0000000000000 286[ 1560.493137] ffff88028fb0e8f8 ffff88026227ffd8 000000000000fba8 ffff88028fb0e 8f8 [ 1560.494147] Call Trace: [ 1560.494464] [<ffffffff8109004e>] ? prepare_to_wait+0x4e/0x80 [ 1560.495198] [<ffffffffa0076afd>] jbd2_journal_commit_transaction+0x19d/0x16e 0 [jbd2] [ 1560.496185] [<ffffffff81009310>] ? __switch_to+0xd0/0x320 [ 1560.496910] [<ffffffff814fc4ae>] ? _spin_unlock_irq+0xe/0x20 [ 1560.497643] [<ffffffff8108fd60>] ? autoremove_wake_function+0x0/0x40 [ 1560.498462] [<ffffffffa007d627>] kjournald2+0xb7/0x210 [jbd2] [ 1560.499200] [<ffffffff8108fd60>] ? autoremove_wake_function+0x0/0x40 [ 1560.500019] [<ffffffffa007d570>] ? kjournald2+0x0/0x210 [jbd2] [ 1560.500797] [<ffffffff8108fa16>] kthread+0x96/0xa0[ 1560.501445] [<ffffffff8100c14a>] child_rip+0xa/0x20 [ 1560.502094] [<ffffffff8108f980>] ? kthread+0x0/0xa0 [ 1560.502718] [<ffffffff8100c140>] ? child_rip+0x0/0x20 [ 1560.503387] INFO: task jbd2/loop1-8:2336 blocked for more than 120 seconds. [ 1560.504285] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 1560.505289] jbd2/loop1-8 D 0000000000000000 5280 2336 2 0x00000000 [ 1560.506192] ffff8802467c9c10 0000000000000046 0000000000000000 ffff8802467c9bd4 [ 1560.507179] ffff8802467c9b80 ffff88029fc24100 ffff8800283567c0 0000000000000000 [ 1560.508166] ffff880260cea778 ffff8802467c9fd8 000000000000fba8 ffff880260cea778 [ 1560.509185] Call Trace: [ 1560.509521] [<ffffffff811b03a0>] ? sync_buffer+0x0/0x50 [ 1560.510202] [<ffffffff814f9a33>] io_schedule+0x73/0xc0 [ 1560.510858] [<ffffffff811b03e3>] sync_buffer+0x43/0x50 [ 1560.511520] [<ffffffff814fa3ef>] __wait_on_bit+0x5f/0x90 [ 1560.512214] [<ffffffff811b03a0>] ? sync_buffer+0x0/0x50 [ 1560.512924] [<ffffffff814fa498>] out_of_line_wait_on_bit+0x78/0x90 [ 1560.513723] [<ffffffff8108fda0>] ? wake_bit_function+0x0/0x50 [ 1560.514462] [<ffffffff811b0396>] __wait_on_buffer+0x26/0x30 [ 1560.515185] [<ffffffffa0077459>] jbd2_journal_commit_transaction+0xaf9/0x16e0 [jbd2] [ 1560.516177] [<ffffffff81009310>] ? __switch_to+0xd0/0x320 [ 1560.516901] [<ffffffff8107c65b>] ? try_to_del_timer_sync+0x7b/0xe0 [ 1560.517732] [<ffffffffa007d627>] kjournald2+0xb7/0x210 [jbd2] [ 1560.518470] [<ffffffff8108fd60>] ? autoremove_wake_function+0x0/0x40 [ 1560.519298] [<ffffffffa007d570>] ? kjournald2+0x0/0x210 [jbd2] [ 1560.520059] [<ffffffff8108fa16>] kthread+0x96/0xa0 [ 1560.520705] [<ffffffff8100c14a>] child_rip+0xa/0x20 [ 1560.521359] [<ffffffff8108f980>] ? kthread+0x0/0xa0 [ 1560.522003] [<ffffffff8100c140>] ? child_rip+0x0/0x20 ... |
| Comment by Oleg Drokin [ 10/Sep/12 ] |
|
Ok, after some more testing, doing ONLY=116 passes. REFORMAT=yes ONLY="115 116" sh sanity.sh seems to be triggering this issue 100% for me. (just had another lockup + ost disk corruption). [ 1294.835204] LDISKFS-fs error (device loop1): ldiskfs_init_block_bitmap: Checksum bad for group 1
[ 1294.836610] Aborting journal on device loop1-8.
My test system configuration: kvm with 8 cpu cores, 10G RAM. |
| Comment by Oleg Drokin [ 11/Sep/12 ] |
|
The real problem seems to be test 115, if I exclude it, all subsequent tests seem to be fine, but if I leave test 115 in, tests after it start to fail and as I exclude them next ones start to fail in a very similar manner. I have stack guard on btw and it does not trigger. the only trigger I see is around test 3 about "rm" using the most stack yet with about 2.5k still remaining, though I agree some stack traces do look quite big. |
| Comment by nasf (Inactive) [ 11/Sep/12 ] |
|
The reason is that when shrink the OI index node to recycle idle leaf for the last entry, it makes the index node to be empty, then causes the subsequent IAM lookup/insert ops to access invalid space. The solution is that keep the last entry for the idle leaf in the OI index node, it can be reused directly when next new node added. This is the patch for that: |
| Comment by Oleg Drokin [ 11/Sep/12 ] |
|
Thanks. [root@rhel6 ~]# e2fsck -f -n /tmp/lustre-mdt1 e2fsck 1.41.90.wc1 (18-Mar-2011) Pass 1: Checking inodes, blocks, and sizes Inode 8297 is a zero-length directory. Clear? no Inode 8297, i_size is 4096, should be 0. Fix? no Pass 2: Checking directory structure Pass 3: Checking directory connectivity Unconnected directory inode 8297 (...) Connect to /lost+found? no Pass 4: Checking reference counts Unattached inode 8297 Connect to /lost+found? no Pass 5: Checking group summary information lustre-MDT0000: ********** WARNING: Filesystem still has errors ********** |
| Comment by nasf (Inactive) [ 12/Sep/12 ] |
|
I cannot reproduce the e2fsck failure by myself. But according to the error message, it looks like some object was removed from the parent directory, but the object itself was not destroyed. One possible reason for that is related with partly unlink, because we do not declare enough credit for the unlink transaction, which may needs more credit for recycling idle OI leaf. I have updated the patch http://review.whamcloud.com/#change,3931 to fix that. |
| Comment by Oleg Drokin [ 12/Sep/12 ] |
|
That apparently did not help. |
| Comment by Oleg Drokin [ 12/Sep/12 ] |
|
Reduced the failure to test 51b: REFORMAT=yes ONLY=51b sh sanity.sh == sanity test 51b: mkdir .../t-0 --- .../t-70000 ====================== 23:18:05 (1347506285)
- created 10000 (time 1347506301.81 total 16.47 last 16.47)
- created 20000 (time 1347506318.40 total 33.05 last 16.58)
- created 30000 (time 1347506340.06 total 54.71 last 21.66)
mkdir(/mnt/lustre/d51b/t-32335) error: No space left on device
total: 32335 creates in 59.82 seconds: 540.54 creates/second
sanity test_51b: @@@@@@ FAIL: test_51b failed with 28
Trace dump:
= /home/green/git/lustre-release/lustre/tests/test-framework.sh:3640:error_noexit()
= /home/green/git/lustre-release/lustre/tests/test-framework.sh:3662:error()
= /home/green/git/lustre-release/lustre/tests/test-framework.sh:3898:run_one()
= /home/green/git/lustre-release/lustre/tests/test-framework.sh:3928:run_one_logged()
= /home/green/git/lustre-release/lustre/tests/test-framework.sh:3750:run_test()
= sanity.sh:3150:main()
Dumping lctl log to /tmp/test_logs/1347506256/sanity.test_51b.*.1347506345.log
Dumping logs only on local client.
FAIL 51b (60s)
...........................................................................................................................resend_count is set to 4 4 4 4
...........resend_count is set to 10 10 10 10
.................................................................................................== sanity sanity.sh test complete, duration 90 sec == 23:19:06 (1347506346)
sanity.sh: FAIL: test_51b test_51b failed with 28
Stopping clients: rhel6.localnet /mnt/lustre (opts:-f)
Stopping client rhel6.localnet /mnt/lustre opts:-f
Stopping clients: rhel6.localnet /mnt/lustre2 (opts:-f)
Stopping /mnt/mds1 (opts:-f) on rhel6.localnet
Stopping /mnt/ost1 (opts:-f) on rhel6.localnet
Stopping /mnt/ost2 (opts:-f) on rhel6.localnet
waited 0 for 5 ST ost OSS OSS_uuid 0
osd_ldiskfs 337297 0
fsfilt_ldiskfs 86602 0
ldiskfs 422919 2 osd_ldiskfs,fsfilt_ldiskfs
mdd 419671 3 cmm,mdt,osd_ldiskfs
lquota 253915 5 obdfilter,osd_ldiskfs,mdd
fid 70216 5 mdt,osd_ldiskfs,mdd,obdecho,mdc
obdclass 1160202 47 lustre,obdfilter,ost,cmm,mdt,osd_ldiskfs,fsfilt_ldiskfs,mdd,mds,mgs,lquota,obdecho,mgc,lov,osc,mdc,lmv,fid,fld,ptlrpc
lvfs 38111 22 lustre,obdfilter,ost,cmm,mdt,osd_ldiskfs,fsfilt_ldiskfs,mdd,mds,mgs,lquota,obdecho,mgc,lov,osc,mdc,lmv,fid,fld,ptlrpc,obdclass
libcfs 490662 24 lustre,obdfilter,ost,cmm,mdt,osd_ldiskfs,fsfilt_ldiskfs,mdd,mds,mgs,lquota,obdecho,mgc,lov,osc,mdc,lmv,fid,fld,ptlrpc,obdclass,modules unloaded.
[root@rhel6 tests]#
[root@rhel6 tests]# e2fsck -n -f /tmp/lustre-mdt1
e2fsck 1.41.90.wc1 (18-Mar-2011)
Pass 1: Checking inodes, blocks, and sizes
Inode 33212 is a zero-length directory. Clear? no
Inode 33212, i_size is 4096, should be 0. Fix? no
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Unconnected directory inode 33212 (...)
Connect to /lost+found? no
Pass 4: Checking reference counts
Unattached inode 33212
Connect to /lost+found? no
Pass 5: Checking group summary information
lustre-MDT0000: ********** WARNING: Filesystem still has errors **********
lustre-MDT0000: 107/100000 files (3.7% non-contiguous), 17256/50000 blocks
|
| Comment by nasf (Inactive) [ 13/Sep/12 ] |
|
Your failure is caused by partial create because of not enough space. It occurred in mdd_object_initialize() when mkdir, the inode for the dir was allocated, and "." was inserted, but failed to insert ".." because of not enough space. Under such case, our rollback mechanism did not cleanup the environment clearly. Then left the inode there with non-zero i_nlink and non-zero size, but without added into parent directory. This is the patch for fixing: |
| Comment by Peter Jones [ 19/Sep/12 ] |
|
Landed for 2.3 and 2.4 |
| Comment by Bob Glossman (Inactive) [ 03/Dec/12 ] |
|
back port to b2_1 |