Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1881

sanity test 116 soft lockup

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.3.0, Lustre 2.4.0
    • Lustre 2.3.0
    • None
    • 3
    • 4258

    Description

      Running with debug kernel, REFORMAT=yes SLOW=yes sh sanity.sh on latest master,
      test 116 locks up reliably (3 out of 3 runs so far):

      [ 1402.025523] Lustre: DEBUG MARKER: == sanity test 116: stripe QOS: free space balance ===================== 21:16:44 (1347326204)
      [ 1458.081053] LNet: Service thread pid 1041 was inactive for 40.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      [ 1458.083593] Pid: 1041, comm: mdt01_003
      [ 1458.084187]
      [ 1458.084187] Call Trace:
      [ 1458.084851]  [<ffffffffa0b0ed42>] ? mdt_handle_common+0x922/0x1740 [mdt]
      [ 1458.085891]  [<ffffffffa0b0fc35>] mdt_regular_handle+0x15/0x20 [mdt]
      [ 1458.086909]  [<ffffffffa048586f>] ptlrpc_server_handle_request+0x44f/0xee0 [ptlrpc]
      [ 1458.088097]  [<ffffffffa0c9666e>] ? cfs_timer_arm+0xe/0x10 [libcfs]
      [ 1458.089098]  [<ffffffffa047e711>] ? ptlrpc_wait_event+0xb1/0x2a0 [ptlrpc]
      [ 1458.090171]  [<ffffffff81051f73>] ? __wake_up+0x53/0x70
      [ 1458.091005]  [<ffffffffa04883de>] ptlrpc_main+0xaee/0x1800 [ptlrpc]
      [ 1458.091991]  [<ffffffffa04878f0>] ? ptlrpc_main+0x0/0x1800 [ptlrpc]
      [ 1458.092988]  [<ffffffff8100c14a>] child_rip+0xa/0x20
      [ 1458.093775]  [<ffffffffa04878f0>] ? ptlrpc_main+0x0/0x1800 [ptlrpc]
      [ 1458.094750]  [<ffffffffa04878f0>] ? ptlrpc_main+0x0/0x1800 [ptlrpc]
      [ 1458.095723]  [<ffffffff8100c140>] ? child_rip+0x0/0x20
      [ 1458.096542]
      [ 1458.096780] LustreError: dumping log to /tmp/lustre-log.1347326260.1041
      [ 1500.096007] BUG: soft lockup - CPU#4 stuck for 67s! [mdt01_003:1041]
      [ 1500.096998] Modules linked in: lustre obdfilter ost cmm mdt osd_ldiskfs fsfilt_ldiskfs ldiskfs mdd mds mgs lquota obdecho mgc lov osc mdc lmv fid fld ptlrpc obdclass lvfs ksocklnd lnet libcfs ext2 exportfs jbd sha512_generic sha256_generic sunrpc ipv6 microcode virtio_balloon virtio_net i2c_piix4 i2c_core ext4 mbcache jbd2 virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: libcfs]
      [ 1500.097004] CPU 4
      [ 1500.097004] Modules linked in: lustre obdfilter ost cmm mdt osd_ldiskfs fsfilt_ldiskfs ldiskfs mdd mds mgs lquota obdecho mgc lov osc mdc lmv fid fld ptlrpc obdclass lvfs ksocklnd lnet libcfs ext2 exportfs jbd sha512_generic sha256_generic sunrpc ipv6 microcode virtio_balloon virtio_net i2c_piix4 i2c_core ext4 mbcache jbd2 virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: libcfs]
      [ 1500.097004]
      [ 1500.097004] Pid: 1041, comm: mdt01_003 Not tainted 2.6.32-debug #3 Bochs Bochs
      [ 1500.097004] RIP: 0010:[<ffffffff8127db02>]  [<ffffffff8127db02>] memmove+0x42/0x1a0
      [ 1500.097004] RSP: 0018:ffff8802056dd498  EFLAGS: 00010282
      [ 1500.097004] RAX: ffff880231b7c03c RBX: ffff8802056dd4e0 RCX: 00000000000000ee
      [ 1500.097004] RDX: fffffffffff7bfec RSI: ffff880231bfffe8 RDI: ffff880231bffffc
      [ 1500.097004] RBP: ffffffff8100bc0e R08: 0000000000000000 R09: 0000000000000000
      [ 1500.097004] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88021d674004
      [ 1500.097004] R13: ffff880231b7c028 R14: ffff880231b7c000 R15: 0000000000000002
      [ 1500.097004] FS:  00007effccfcf700(0000) GS:ffff880028300000(0000) knlGS:0000000000000000
      [ 1500.097004] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [ 1500.097004] CR2: ffff880231c00000 CR3: 000000025a3ed000 CR4: 00000000000006e0
      [ 1500.097004] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [ 1500.097004] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      [ 1500.097004] Process mdt01_003 (pid: 1041, threadinfo ffff8802056dc000, task ffff88024430a380)
      [ 1500.097004] Stack:
      [ 1500.097004]  ffffffffa0aab3a8 00000000000000ee 00000000000000ee ffff880294f9df58
      [ 1500.097004] <d> ffff8802056dd640 ffff8802056dd608 ffff8802056dd6e8 0000000000000fd8
      [ 1500.097004] <d> ffff88021d674000 ffff8802056dd500 ffffffffa0aab430 ffff88024430a380
      [ 1500.097004] Call Trace:
      [ 1500.097004]  [<ffffffffa0aab3a8>] ? iam_insert_key+0x68/0xb0 [osd_ldiskfs]
      [ 1500.097004]  [<ffffffffa0aab430>] ? iam_insert_key_lock+0x40/0x50 [osd_ldiskfs]
      [ 1500.097004]  [<ffffffffa0aae7ed>] ? iam_lfix_split+0x12d/0x150 [osd_ldiskfs]
      [ 1500.097004]  [<ffffffffa0aadc8d>] ? iam_it_rec_insert+0x20d/0x300 [osd_ldiskfs]
      [ 1500.097004]  [<ffffffffa0aade21>] ? iam_insert+0xa1/0xb0 [osd_ldiskfs]
      [ 1500.097004]  [<ffffffffa0aa9467>] ? osd_oi_insert+0x1e7/0x5b0 [osd_ldiskfs]
      [ 1500.097004]  [<ffffffffa0a9cef5>] ? __osd_oi_insert+0x145/0x1e0 [osd_ldiskfs]
      [ 1500.097004]  [<ffffffffa0aa1d48>] ? osd_object_ea_create+0x1d8/0x460 [osd_ldiskfs]
      [ 1500.097004]  [<ffffffffa09721dc>] ? mdd_object_create_internal+0x13c/0x2a0 [mdd]
      [ 1500.097004]  [<ffffffffa0992aba>] ? mdd_create+0x16ba/0x20c0 [mdd]
      [ 1500.097004]  [<ffffffffa0a9fd7f>] ? osd_xattr_get+0x9f/0x360 [osd_ldiskfs]
      [ 1500.097004]  [<ffffffffa0bb3557>] ? cml_create+0x97/0x250 [cmm]
      [ 1500.097004]  [<ffffffffa0b25d0f>] ? mdt_version_get_save+0x8f/0xd0 [mdt]
      [ 1500.097004]  [<ffffffffa0b398bf>] ? mdt_reint_open+0x108f/0x18a0 [mdt]
      [ 1500.097004]  [<ffffffffa099860e>] ? md_ucred+0x1e/0x60 [mdd]
      [ 1500.097004]  [<ffffffffa0b071c5>] ? mdt_ucred+0x15/0x20 [mdt]
      [ 1500.097004]  [<ffffffffa0b23081>] ? mdt_reint_rec+0x41/0xe0 [mdt]
      [ 1500.097004]  [<ffffffffa0b1c42a>] ? mdt_reint_internal+0x50a/0x810 [mdt]
      [ 1500.097004]  [<ffffffffa0b1c9fd>] ? mdt_intent_reint+0x1ed/0x500 [mdt]
      [ 1500.097004]  [<ffffffffa0b19041>] ? mdt_intent_policy+0x371/0x6a0 [mdt]
      [ 1500.097004]  [<ffffffffa042fb9a>] ? ldlm_lock_enqueue+0x2ea/0x890 [ptlrpc]
      [ 1500.097004]  [<ffffffffa045744f>] ? ldlm_handle_enqueue0+0x48f/0xf70 [ptlrpc]
      [ 1500.097004]  [<ffffffffa0b18ad6>] ? mdt_enqueue+0x46/0x130 [mdt]
      [ 1500.097004]  [<ffffffffa0b0ed42>] ? mdt_handle_common+0x922/0x1740 [mdt]
      [ 1500.097004]  [<ffffffffa0b0fc35>] ? mdt_regular_handle+0x15/0x20 [mdt]
      [ 1500.097004]  [<ffffffffa048586f>] ? ptlrpc_server_handle_request+0x44f/0xee0 [ptlrpc]
      [ 1500.097004]  [<ffffffffa0c9666e>] ? cfs_timer_arm+0xe/0x10 [libcfs]
      [ 1500.097004]  [<ffffffffa047e711>] ? ptlrpc_wait_event+0xb1/0x2a0 [ptlrpc]
      [ 1500.097004]  [<ffffffff81051f73>] ? __wake_up+0x53/0x70
      [ 1500.097004]  [<ffffffffa04883de>] ? ptlrpc_main+0xaee/0x1800 [ptlrpc]
      [ 1500.097004]  [<ffffffffa04878f0>] ? ptlrpc_main+0x0/0x1800 [ptlrpc]
      [ 1500.097004]  [<ffffffff8100c14a>] ? child_rip+0xa/0x20
      [ 1500.097004]  [<ffffffffa04878f0>] ? ptlrpc_main+0x0/0x1800 [ptlrpc]
      [ 1500.097004]  [<ffffffffa04878f0>] ? ptlrpc_main+0x0/0x1800 [ptlrpc]
      [ 1500.097004]  [<ffffffff8100c140>] ? child_rip+0x0/0x20
      [ 1500.097004] Code: d0 49 39 f8 0f 8f 9f 00 00 00 48 81 fa a8 02 00 00 72 05 40 38 fe 74 41 48 83 ea 20 48 83 ea 20 4c 8b 1e 4c 8b 56 08 4c 8b 4e 10 <4c> 8b 46 18 48 8d 76 20 4c 89 1f 4c 89 57 08 4c 89 4f 10 4c 89
      ...
      [ 1560.492304] INFO: task jbd2/loop0-8:32349 blocked for more than 120 seconds.
      [ 1560.493247] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [ 1560.494302] jbd2/loop0-8  D 0000000000000002  5152 32349      2 0x00000000
      [ 1560.495213]  ffff88026046dd10 0000000000000046 00000000000167c0 00000000000167c0
      [ 1560.496212]  ffff880028310960 00000000000167c0 00000000000167c0 0000000000000286
      [ 1560.497246]  ffff8802481ce738 ffff88026046dfd8 000000000000fba8 ffff8802481ce738
      [ 1560.498252] Call Trace:
      [ 1560.498574]  [<ffffffff8109004e>] ? prepare_to_wait+0x4e/0x80
      [ 1560.499316]  [<ffffffffa0076afd>] jbd2_journal_commit_transaction+0x19d/0x16e0 [jbd2]
      [ 1560.500335]  [<ffffffff81009310>] ? __switch_to+0xd0/0x320
      [ 1560.501048]  [<ffffffff814fc4ae>] ? _spin_unlock_irq+0xe/0x20
      [ 1560.501784]  [<ffffffff8108fd60>] ? autoremove_wake_function+0x0/0x40
      [ 1560.502618]  [<ffffffffa007d627>] kjournald2+0xb7/0x210 [jbd2]
      [ 1560.503379]  [<ffffffff8108fd60>] ? autoremove_wake_function+0x0/0x40
      [ 1560.504209]  [<ffffffffa007d570>] ? kjournald2+0x0/0x210 [jbd2]
      [ 1560.504993]  [<ffffffff8108fa16>] kthread+0x96/0xa0
      [ 1560.505631]  [<ffffffff8100c14a>] child_rip+0xa/0x20
      [ 1560.506274]  [<ffffffff8108f980>] ? kthread+0x0/0xa0
      [ 1560.506919]  [<ffffffff8100c140>] ? child_rip+0x0/0x20
      

      On first run I also hit a disk corruption in that same test before soft lockup:

      [ 4158.782961] Lustre: DEBUG MARKER: == sanity test 116: stripe QOS: free space balance ===================== 20:44:10 (1347324250)
      [ 4178.796623] LDISKFS-fs error (device loop2): file system corruption: inode #8 logical block 3108 mapped to 7268 (size 1)
      [ 4178.798364] Aborting journal on device loop2-8.
      

      Also the ost2 fs state changed:
      /dev/loop2 64Z 64Z 159M 100% /mnt/ost2

      After a reboot I did fsck on mdt fs and found this:

      [root@rhel6 tests]# e2fsck -f -n /tmp/lustre-mdt1
      e2fsck 1.41.90.wc1 (18-Mar-2011)
      Pass 1: Checking inodes, blocks, and sizes
      Inode 33445 is a zero-length directory.  Clear? no
      
      Inode 33445, i_size is 4096, should be 0.  Fix? no
      
      Pass 2: Checking directory structure
      Pass 3: Checking directory connectivity
      Unconnected directory inode 33445 (...)
      Connect to /lost+found? no
      
      Pass 4: Checking reference counts
      Unattached inode 33445
      Connect to /lost+found? no
      
      Pass 5: Checking group summary information
      
      lustre-MDT0000: ********** WARNING: Filesystem still has errors **********
      

      Attachments

        Issue Links

          Activity

            People

              yong.fan nasf (Inactive)
              green Oleg Drokin
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: