Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3331

Crash during mount/umount during stress testing

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.7.0
    • Lustre 2.4.0
    • None
    • Single 2GB VM on my local Machine.
    • 3
    • 8227

    Description

      I am doing some testing for Racer in a 2gb in a VM. I am really hitting a non-racer issue related to mount /umount.

      I run mem-hog just that it stresses the system without OOMing it. Every 300 seconds or so autotest unmount and remounts the Lustre filesystem as part of it teardown setup phase for running the tests.

      I am hitting this:

      Pid: 11143, comm: mount.lustre Not tainted 2.6.32.may03master #1 innotek GmbH VirtualBox
      RIP: 0010:[<ffffffffa07c5498>]  [<ffffffffa07c5498>] lprocfs_srch+0x48/0x80 [obdclass]
      RSP: 0018:ffff88006c837a58  EFLAGS: 00010286
      RAX: ffffffffa084dac0 RBX: 0000000000000000 RCX: 0000000000000000
      RDX: 0000000000000000 RSI: ffff88006c837b68 RDI: ffffffffa084dac0
      RBP: ffff88006c837a78 R08: 00000000fffffffe R09: 0000000000000000
      R10: 000000000000000f R11: 000000000000000f R12: ffff88006c837b68
      R13: ffffffffffffff8e R14: 0000000000000000 R15: ffff88006c837b68
      FS:  00007f27ac959700(0000) GS:ffff880002300000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      CR2: ffffffffffffffde CR3: 0000000013efd000 CR4: 00000000000006e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      Process mount.lustre (pid: 11143, threadinfo ffff88006c836000, task ffff88006c835500)
      Stack:
       ffffffffffffff8e ffff88006c837b68 ffffffffffffff8e 0000000000000000
      <d> ffff88006c837ac8 ffffffffa07c73e4 ffff88006c837a98 ffffffff81281356
      <d> 0000000000000004 0000000affffffff ffff880012804000 0000000000000006
      Call Trace:
       [<ffffffffa07c73e4>] lprocfs_register+0x34/0x100 [obdclass]
       [<ffffffff81281356>] ? vsnprintf+0x336/0x5e0
       [<ffffffffa0f0db7e>] lprocfs_register_mountpoint+0x12e/0xb20 [lustre]
       [<ffffffffa0efd3b6>] client_common_fill_super+0x1a6/0x5280 [lustre]
       [<ffffffff81281640>] ? sprintf+0x40/0x50
       [<ffffffffa0f03204>] ll_fill_super+0xd74/0x1500 [lustre]
       [<ffffffffa07f257d>] lustre_fill_super+0x3dd/0x530 [obdclass]
       [<ffffffffa07f21a0>] ? lustre_fill_super+0x0/0x530 [obdclass]
       [<ffffffff811841df>] get_sb_nodev+0x5f/0xa0
       [<ffffffffa07e9de5>] lustre_get_sb+0x25/0x30 [obdclass]
       [<ffffffff8118381b>] vfs_kern_mount+0x7b/0x1b0
       [<ffffffff811839c2>] do_kern_mount+0x52/0x130
       [<ffffffff811a3c72>] do_mount+0x2d2/0x8d0
       [<ffffffff811a4300>] sys_mount+0x90/0xe0
       [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
      

      There a few other messages in the logs right before or during the event:

      SLAB: cache with size 40 has lost its name
      SLAB: cache with size 192 has lost its name
      SLAB: cache with size 1216 has lost its name
      LustreError: 10029:0:(lprocfs_status.c:489:lprocfs_register())  Lproc: Attempting to register llite more than once 
      SLAB: cache with size 256 has lost its name
      SLAB: cache with size 40 has lost its name
      SLAB: cache with size 192 has lost its name
      SLAB: cache with size 1216 has lost its name
      SLAB: cache with size 256 has lost its name
      

      The SLAB messages indicate that a module unload did not go as planed.

      The "Attempting to register llite more than once" speaks for itself.

      I am actively working this issue I just wanted to document it.

      Attachments

        1. obdclass.dmp
          7 kB
          Keith Mannthey
        2. single-lu-3331
          109 kB
          Keith Mannthey

        Activity

          [LU-3331] Crash during mount/umount during stress testing
          jhammond John Hammond added a comment -

          All patches landed to master.

          jhammond John Hammond added a comment - All patches landed to master.

          With both patches applied I seem to be hitting a different issue now. I would say both lproc patches are helping. Another day or two and I will know more for sure but I will be keeping them in my kernels.

          keith Keith Mannthey (Inactive) added a comment - With both patches applied I seem to be hitting a different issue now. I would say both lproc patches are helping. Another day or two and I will know more for sure but I will be keeping them in my kernels.

          That patch address the line of investigation I was on. I will add 6420 to 5161 and continue testing.

          keith Keith Mannthey (Inactive) added a comment - That patch address the line of investigation I was on. I will add 6420 to 5161 and continue testing.
          jhammond John Hammond added a comment -

          This must be a failure to remove /proc/fs/lustre/llite when init_lustre_lite() fails.

          m:lustre-release# errno 114
          114 EALREADY Operation already in progress
          m:lustre-release# printf '%x\n' -114
          ffffffffffffff8e

          Please see http://review.whamcloud.com/6420. This patch should go along with 5161.

          jhammond John Hammond added a comment - This must be a failure to remove /proc/fs/lustre/llite when init_lustre_lite() fails. m:lustre-release# errno 114 114 EALREADY Operation already in progress m:lustre-release# printf '%x\n' -114 ffffffffffffff8e Please see http://review.whamcloud.com/6420 . This patch should go along with 5161.

          lprocfs_register objdump. Sorry no line numbers.

          keith Keith Mannthey (Inactive) added a comment - lprocfs_register objdump. Sorry no line numbers.

          This is still with 5161 in place. I have not hit the lprocfs_src crash since I have started to carry the patch.

          I will update more in a bit.

          keith Keith Mannthey (Inactive) added a comment - This is still with 5161 in place. I have not hit the lprocfs_src crash since I have started to carry the patch. I will update more in a bit.
          jhammond John Hammond added a comment -

          Kieth is this with or without 5161? In any case I suspect this is from init_lustre_lite().

          Could you post the disassembly of lprocfs_register along with the values of proc_lustre_fs_root, *proc_fs_lustre_root, proc_lustre_root, and *proc_lustre_root?

          jhammond John Hammond added a comment - Kieth is this with or without 5161? In any case I suspect this is from init_lustre_lite(). Could you post the disassembly of lprocfs_register along with the values of proc_lustre_fs_root, *proc_fs_lustre_root, proc_lustre_root, and *proc_lustre_root?

          I moved lustre to current Master today and ran another test. The above error is what I hit right now. It takes a good part of the day to trigger but it has been the last 3 or 4 crashes. I have a full crash dump that I am looking into and I working on a patch for this error.

          Lustre: lustre-OST0000: new disk, initializing
          Lustre: srv-lustre-OST0000: No data found on store. Initialize space
          Lustre: Skipped 1 previous similar message
          LDISKFS-fs (loop2): mounted filesystem with ordered data mode. quota=on. Opts: 
          LDISKFS-fs (loop2): mounted filesystem with ordered data mode. quota=on. Opts: 
          Lustre: lustre-OST0001: new disk, initializing
          BUG: unable to handle kernel paging request at ffffffffffffffde
          IP: [<ffffffffa04b43f7>] lprocfs_register+0x37/0x230 [obdclass]
          PGD 1a87067 PUD 1a88067 PMD 0 
          Oops: 0000 [#1] SMP 
          last sysfs file: /sys/devices/system/cpu/possible
          CPU 0 
          Modules linked in: lustre(U) ofd(U) osp(U) lod(U) ost(U) mdt(U) osd_ldiskfs(U) fsfilt_ldiskfs(U) ldiskfs(U) mdd(U) mgs(U) lquota(U) obdecho(U) mgc(U) lov(U) osc(U) mdc(U) lmv(U) fid(U) fld(U) ptlrpc(U) obdclass(U) lvfs(U) exportfs jbd ksocklnd(U) lnet(U) sha512_generic sha256_generic libcfs(U) ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 ppdev parport_pc parport microcode sg snd_intel8x0 snd_ac97_codec ac97_bus snd_seq snd_seq_device snd_pcm snd_timer snd soundcore snd_page_alloc i2c_piix4 i2c_core e1000 ext4 mbcache jbd2 sd_mod crc_t10dif sr_mod cdrom ahci pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: lvfs]
          
          Pid: 11086, comm: mount.lustre Not tainted 2.6.32.may03master #1 innotek GmbH VirtualBox
          RIP: 0010:[<ffffffffa04b43f7>]  [<ffffffffa04b43f7>] lprocfs_register+0x37/0x230 [obdclass]
          RSP: 0018:ffff88002f285a88  EFLAGS: 00010286
          RAX: ffffffffa053a000 RBX: ffff88002f285b68 RCX: 0000000000000000
          RDX: 0000000000000000 RSI: ffffffffffffff8e RDI: ffffffffa053a000
          RBP: ffff88002f285ac8 R08: 00000000fffffffe R09: 0000000000000000
          R10: 000000000000000f R11: 000000000000000f R12: ffffffffffffff8e
          R13: 0000000000000000 R14: ffffffffffffff8e R15: 0000000000000000
          FS:  00007f140ead5700(0000) GS:ffff880002200000(0000) knlGS:0000000000000000
          CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
          CR2: ffffffffffffffde CR3: 000000007d7d5000 CR4: 00000000000006f0
          DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
          DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
          Process mount.lustre (pid: 11086, threadinfo ffff88002f284000, task ffff8800135f2ae0)
          Stack:
           ffff88002f285a98 ffffffff81281356 0000000000000004 ffff88003db60000
          <d> 0000000000000006 ffff88007aca8da0 ffffffffffffff8e ffff88002f285b68
          <d> ffff88002f285c28 ffffffffa17c4b8e ffff880018be714e ffff88002f285b08
          Call Trace:
           [<ffffffff81281356>] ? vsnprintf+0x336/0x5e0
           [<ffffffffa17c4b8e>] lprocfs_register_mountpoint+0x12e/0xb20 [lustre]
           [<ffffffffa17b43c6>] client_common_fill_super+0x1a6/0x5280 [lustre]
           [<ffffffff81281640>] ? sprintf+0x40/0x50
           [<ffffffffa17ba214>] ll_fill_super+0xd74/0x1500 [lustre]
           [<ffffffffa04df97d>] lustre_fill_super+0x3dd/0x530 [obdclass]
           [<ffffffffa04df5a0>] ? lustre_fill_super+0x0/0x530 [obdclass]
           [<ffffffff811841df>] get_sb_nodev+0x5f/0xa0
           [<ffffffffa04d71e5>] lustre_get_sb+0x25/0x30 [obdclass]
           [<ffffffff8118381b>] vfs_kern_mount+0x7b/0x1b0
           [<ffffffff811839c2>] do_kern_mount+0x52/0x130
           [<ffffffff811a3c72>] do_mount+0x2d2/0x8d0
           [<ffffffff811a4300>] sys_mount+0x90/0xe0
           [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
          Code: 53 48 83 ec 18 0f 1f 44 00 00 49 89 f4 48 89 fb 48 c7 c7 00 a0 53 a0 49 89 d5 49 89 cf e8 82 b2 05 e1 4d 85 e4 0f 84 09 01 00 00 <4d> 8b 74 24 50 4d 85 f6 75 1c e9 fa 00 00 00 66 2e 0f 1f 84 00 
          RIP  [<ffffffffa04b43f7>] lprocfs_register+0x37/0x230 [obdclass]
           RSP <ffff88002f285a88>
          CR2: ffffffffffffffde
          
          keith Keith Mannthey (Inactive) added a comment - I moved lustre to current Master today and ran another test. The above error is what I hit right now. It takes a good part of the day to trigger but it has been the last 3 or 4 crashes. I have a full crash dump that I am looking into and I working on a patch for this error. Lustre: lustre-OST0000: new disk, initializing Lustre: srv-lustre-OST0000: No data found on store. Initialize space Lustre: Skipped 1 previous similar message LDISKFS-fs (loop2): mounted filesystem with ordered data mode. quota=on. Opts: LDISKFS-fs (loop2): mounted filesystem with ordered data mode. quota=on. Opts: Lustre: lustre-OST0001: new disk, initializing BUG: unable to handle kernel paging request at ffffffffffffffde IP: [<ffffffffa04b43f7>] lprocfs_register+0x37/0x230 [obdclass] PGD 1a87067 PUD 1a88067 PMD 0 Oops: 0000 [#1] SMP last sysfs file: /sys/devices/system/cpu/possible CPU 0 Modules linked in: lustre(U) ofd(U) osp(U) lod(U) ost(U) mdt(U) osd_ldiskfs(U) fsfilt_ldiskfs(U) ldiskfs(U) mdd(U) mgs(U) lquota(U) obdecho(U) mgc(U) lov(U) osc(U) mdc(U) lmv(U) fid(U) fld(U) ptlrpc(U) obdclass(U) lvfs(U) exportfs jbd ksocklnd(U) lnet(U) sha512_generic sha256_generic libcfs(U) ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 ppdev parport_pc parport microcode sg snd_intel8x0 snd_ac97_codec ac97_bus snd_seq snd_seq_device snd_pcm snd_timer snd soundcore snd_page_alloc i2c_piix4 i2c_core e1000 ext4 mbcache jbd2 sd_mod crc_t10dif sr_mod cdrom ahci pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: lvfs] Pid: 11086, comm: mount.lustre Not tainted 2.6.32.may03master #1 innotek GmbH VirtualBox RIP: 0010:[<ffffffffa04b43f7>] [<ffffffffa04b43f7>] lprocfs_register+0x37/0x230 [obdclass] RSP: 0018:ffff88002f285a88 EFLAGS: 00010286 RAX: ffffffffa053a000 RBX: ffff88002f285b68 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffffffffffffff8e RDI: ffffffffa053a000 RBP: ffff88002f285ac8 R08: 00000000fffffffe R09: 0000000000000000 R10: 000000000000000f R11: 000000000000000f R12: ffffffffffffff8e R13: 0000000000000000 R14: ffffffffffffff8e R15: 0000000000000000 FS: 00007f140ead5700(0000) GS:ffff880002200000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: ffffffffffffffde CR3: 000000007d7d5000 CR4: 00000000000006f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process mount.lustre (pid: 11086, threadinfo ffff88002f284000, task ffff8800135f2ae0) Stack: ffff88002f285a98 ffffffff81281356 0000000000000004 ffff88003db60000 <d> 0000000000000006 ffff88007aca8da0 ffffffffffffff8e ffff88002f285b68 <d> ffff88002f285c28 ffffffffa17c4b8e ffff880018be714e ffff88002f285b08 Call Trace: [<ffffffff81281356>] ? vsnprintf+0x336/0x5e0 [<ffffffffa17c4b8e>] lprocfs_register_mountpoint+0x12e/0xb20 [lustre] [<ffffffffa17b43c6>] client_common_fill_super+0x1a6/0x5280 [lustre] [<ffffffff81281640>] ? sprintf+0x40/0x50 [<ffffffffa17ba214>] ll_fill_super+0xd74/0x1500 [lustre] [<ffffffffa04df97d>] lustre_fill_super+0x3dd/0x530 [obdclass] [<ffffffffa04df5a0>] ? lustre_fill_super+0x0/0x530 [obdclass] [<ffffffff811841df>] get_sb_nodev+0x5f/0xa0 [<ffffffffa04d71e5>] lustre_get_sb+0x25/0x30 [obdclass] [<ffffffff8118381b>] vfs_kern_mount+0x7b/0x1b0 [<ffffffff811839c2>] do_kern_mount+0x52/0x130 [<ffffffff811a3c72>] do_mount+0x2d2/0x8d0 [<ffffffff811a4300>] sys_mount+0x90/0xe0 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b Code: 53 48 83 ec 18 0f 1f 44 00 00 49 89 f4 48 89 fb 48 c7 c7 00 a0 53 a0 49 89 d5 49 89 cf e8 82 b2 05 e1 4d 85 e4 0f 84 09 01 00 00 <4d> 8b 74 24 50 4d 85 f6 75 1c e9 fa 00 00 00 66 2e 0f 1f 84 00 RIP [<ffffffffa04b43f7>] lprocfs_register+0x37/0x230 [obdclass] RSP <ffff88002f285a88> CR2: ffffffffffffffde

          Well after quite a while testing and dealing with another error this cropped up

          I still have the 5161 in place and it is helping.

          There is still some lproc issue out there:

          ...
          LustreError: 22360:0:(obd_mount.c:201:lustre_start_simple()) Skipped 2 previous similar messages
          LustreError: 22360:0:(obd_mount_server.c:1225:server_start_targets()) failed to start OSS: -12
          LustreError: 22360:0:(obd_mount_server.c:1225:server_start_targets()) Skipped 2 previous similar messages
          LustreError: 22360:0:(obd_mount_server.c:1699:server_fill_super()) Unable to start targets: -12
          LustreError: 22360:0:(obd_mount_server.c:1699:server_fill_super()) Skipped 2 previous similar messages
          LustreError: 22360:0:(obd_mount_server.c:844:lustre_disconnect_lwp()) lustre-MDT0000-lwp-OST0000: Can't end config log lustre-client.
          LustreError: 22360:0:(obd_mount_server.c:844:lustre_disconnect_lwp()) Skipped 2 previous similar messages
          LustreError: 22360:0:(obd_mount_server.c:1426:server_put_super()) lustre-OST0000: failed to disconnect lwp. (rc=-2)
          LustreError: 22360:0:(obd_mount_server.c:1426:server_put_super()) Skipped 2 previous similar messages
          LustreError: 22360:0:(obd_mount_server.c:1456:server_put_super()) no obd lustre-OST0000
          LustreError: 22360:0:(obd_mount_server.c:1456:server_put_super()) Skipped 2 previous similar messages
          LustreError: 22360:0:(obd_mount_server.c:135:server_deregister_mount()) lustre-OST0000 not registered
          LustreError: 22360:0:(obd_mount_server.c:135:server_deregister_mount()) Skipped 2 previous similar messages
          LustreError: 22360:0:(obd_mount.c:1267:lustre_fill_super()) Unable to mount  (-12)
          LustreError: 22360:0:(obd_mount.c:1267:lustre_fill_super()) Skipped 2 previous similar messages
          BUG: unable to handle kernel paging request at ffffffffffffffde
          IP: [<ffffffffa0cc03f7>] lprocfs_register+0x37/0x230 [obdclass]
          PGD 1a87067 PUD 1a88067 PMD 0 
          Oops: 0000 [#1] SMP 
          last sysfs file: /sys/devices/system/cpu/possible
          CPU 0 
          Modules linked in: lustre(U) ofd(U) osp(U) lod(U) ost(U) mdt(U) osd_ldiskfs(U) fsfilt_ldiskfs(U) ldiskfs(U) mdd(U) mgs(U) lquota(U) obdecho(U) mgc(U) lov(U) osc(U) mdc(U) lmv(U) fid(U) fld(U) ptlrpc(U) obdclass(U) lvfs(U) exportfs jbd ksocklnd(U) lnet(U) sha256_generic sha512_generic libcfs(U) ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 ppdev parport_pc parport microcode i2c_piix4 i2c_core sg snd_intel8x0 snd_ac97_codec ac97_bus snd_seq snd_seq_device snd_pcm snd_timer snd soundcore snd_page_alloc e1000 ext4 mbcache jbd2 sd_mod crc_t10dif sr_mod cdrom ahci pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: lvfs]
          
          Pid: 26036, comm: mount.lustre Not tainted 2.6.32.may03master #1 innotek GmbH VirtualBox
          RIP: 0010:[<ffffffffa0cc03f7>]  [<ffffffffa0cc03f7>] lprocfs_register+0x37/0x230 [obdclass]
          RSP: 0000:ffff880077b31a88  EFLAGS: 00010286
          RAX: ffffffffa0d46000 RBX: ffff880077b31b68 RCX: 0000000000000000
          RDX: 0000000000000000 RSI: ffffffffffffff8e RDI: ffffffffa0d46000
          RBP: ffff880077b31ac8 R08: 00000000fffffffe R09: 0000000000000000
          R10: 000000000000000f R11: 000000000000000f R12: ffffffffffffff8e
          R13: 0000000000000000 R14: ffffffffffffff8e R15: 0000000000000000
          FS:  00007ffe2003d700(0000) GS:ffff880002200000(0000) knlGS:0000000000000000
          CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
          CR2: ffffffffffffffde CR3: 000000003159f000 CR4: 00000000000006f0
          DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
          DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
          Process mount.lustre (pid: 26036, threadinfo ffff880077b30000, task ffff880036f7d500)
          Stack:
           ffff880077b31a98 ffffffff81281356 0000000000000004 ffff880000070000
          <d> 0000000000000006 ffff880078121f80 ffffffffffffff8e ffff880077b31b68
          <d> ffff880077b31c28 ffffffffa05d8b9e ffff88007608088e ffff880077b31b08
          Call Trace:
           [<ffffffff81281356>] ? vsnprintf+0x336/0x5e0
           [<ffffffffa05d8b9e>] lprocfs_register_mountpoint+0x12e/0xb20 [lustre]
           [<ffffffffa05c83d6>] client_common_fill_super+0x1a6/0x5280 [lustre]
           [<ffffffff81281640>] ? sprintf+0x40/0x50
           [<ffffffffa05ce224>] ll_fill_super+0xd74/0x1500 [lustre]
           [<ffffffffa0ceb97d>] lustre_fill_super+0x3dd/0x530 [obdclass]
           [<ffffffffa0ceb5a0>] ? lustre_fill_super+0x0/0x530 [obdclass]
           [<ffffffff811841df>] get_sb_nodev+0x5f/0xa0
           [<ffffffffa0ce31e5>] lustre_get_sb+0x25/0x30 [obdclass]
           [<ffffffff8118381b>] vfs_kern_mount+0x7b/0x1b0
           [<ffffffff811839c2>] do_kern_mount+0x52/0x130
           [<ffffffff811a3c72>] do_mount+0x2d2/0x8d0
           [<ffffffff811a4300>] sys_mount+0x90/0xe0
           [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
          
          keith Keith Mannthey (Inactive) added a comment - Well after quite a while testing and dealing with another error this cropped up I still have the 5161 in place and it is helping. There is still some lproc issue out there: ... LustreError: 22360:0:(obd_mount.c:201:lustre_start_simple()) Skipped 2 previous similar messages LustreError: 22360:0:(obd_mount_server.c:1225:server_start_targets()) failed to start OSS: -12 LustreError: 22360:0:(obd_mount_server.c:1225:server_start_targets()) Skipped 2 previous similar messages LustreError: 22360:0:(obd_mount_server.c:1699:server_fill_super()) Unable to start targets: -12 LustreError: 22360:0:(obd_mount_server.c:1699:server_fill_super()) Skipped 2 previous similar messages LustreError: 22360:0:(obd_mount_server.c:844:lustre_disconnect_lwp()) lustre-MDT0000-lwp-OST0000: Can't end config log lustre-client. LustreError: 22360:0:(obd_mount_server.c:844:lustre_disconnect_lwp()) Skipped 2 previous similar messages LustreError: 22360:0:(obd_mount_server.c:1426:server_put_super()) lustre-OST0000: failed to disconnect lwp. (rc=-2) LustreError: 22360:0:(obd_mount_server.c:1426:server_put_super()) Skipped 2 previous similar messages LustreError: 22360:0:(obd_mount_server.c:1456:server_put_super()) no obd lustre-OST0000 LustreError: 22360:0:(obd_mount_server.c:1456:server_put_super()) Skipped 2 previous similar messages LustreError: 22360:0:(obd_mount_server.c:135:server_deregister_mount()) lustre-OST0000 not registered LustreError: 22360:0:(obd_mount_server.c:135:server_deregister_mount()) Skipped 2 previous similar messages LustreError: 22360:0:(obd_mount.c:1267:lustre_fill_super()) Unable to mount (-12) LustreError: 22360:0:(obd_mount.c:1267:lustre_fill_super()) Skipped 2 previous similar messages BUG: unable to handle kernel paging request at ffffffffffffffde IP: [<ffffffffa0cc03f7>] lprocfs_register+0x37/0x230 [obdclass] PGD 1a87067 PUD 1a88067 PMD 0 Oops: 0000 [#1] SMP last sysfs file: /sys/devices/system/cpu/possible CPU 0 Modules linked in: lustre(U) ofd(U) osp(U) lod(U) ost(U) mdt(U) osd_ldiskfs(U) fsfilt_ldiskfs(U) ldiskfs(U) mdd(U) mgs(U) lquota(U) obdecho(U) mgc(U) lov(U) osc(U) mdc(U) lmv(U) fid(U) fld(U) ptlrpc(U) obdclass(U) lvfs(U) exportfs jbd ksocklnd(U) lnet(U) sha256_generic sha512_generic libcfs(U) ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 ppdev parport_pc parport microcode i2c_piix4 i2c_core sg snd_intel8x0 snd_ac97_codec ac97_bus snd_seq snd_seq_device snd_pcm snd_timer snd soundcore snd_page_alloc e1000 ext4 mbcache jbd2 sd_mod crc_t10dif sr_mod cdrom ahci pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: lvfs] Pid: 26036, comm: mount.lustre Not tainted 2.6.32.may03master #1 innotek GmbH VirtualBox RIP: 0010:[<ffffffffa0cc03f7>] [<ffffffffa0cc03f7>] lprocfs_register+0x37/0x230 [obdclass] RSP: 0000:ffff880077b31a88 EFLAGS: 00010286 RAX: ffffffffa0d46000 RBX: ffff880077b31b68 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffffffffffffff8e RDI: ffffffffa0d46000 RBP: ffff880077b31ac8 R08: 00000000fffffffe R09: 0000000000000000 R10: 000000000000000f R11: 000000000000000f R12: ffffffffffffff8e R13: 0000000000000000 R14: ffffffffffffff8e R15: 0000000000000000 FS: 00007ffe2003d700(0000) GS:ffff880002200000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: ffffffffffffffde CR3: 000000003159f000 CR4: 00000000000006f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process mount.lustre (pid: 26036, threadinfo ffff880077b30000, task ffff880036f7d500) Stack: ffff880077b31a98 ffffffff81281356 0000000000000004 ffff880000070000 <d> 0000000000000006 ffff880078121f80 ffffffffffffff8e ffff880077b31b68 <d> ffff880077b31c28 ffffffffa05d8b9e ffff88007608088e ffff880077b31b08 Call Trace: [<ffffffff81281356>] ? vsnprintf+0x336/0x5e0 [<ffffffffa05d8b9e>] lprocfs_register_mountpoint+0x12e/0xb20 [lustre] [<ffffffffa05c83d6>] client_common_fill_super+0x1a6/0x5280 [lustre] [<ffffffff81281640>] ? sprintf+0x40/0x50 [<ffffffffa05ce224>] ll_fill_super+0xd74/0x1500 [lustre] [<ffffffffa0ceb97d>] lustre_fill_super+0x3dd/0x530 [obdclass] [<ffffffffa0ceb5a0>] ? lustre_fill_super+0x0/0x530 [obdclass] [<ffffffff811841df>] get_sb_nodev+0x5f/0xa0 [<ffffffffa0ce31e5>] lustre_get_sb+0x25/0x30 [obdclass] [<ffffffff8118381b>] vfs_kern_mount+0x7b/0x1b0 [<ffffffff811839c2>] do_kern_mount+0x52/0x130 [<ffffffff811a3c72>] do_mount+0x2d2/0x8d0 [<ffffffff811a4300>] sys_mount+0x90/0xe0 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b

          Testing update: http://review.whamcloud.com/#change,5161 helped. I don't didn't hit the lproc crash today. I ran for alot longer than the usual lproc crash was taking. The lproc crash was taking 1-2 hours. This lasted 5 or 6. I am kicking it off again with this patch in place and will update again if I hit the lproc issue.

          keith Keith Mannthey (Inactive) added a comment - Testing update: http://review.whamcloud.com/#change,5161 helped. I don't didn't hit the lproc crash today. I ran for alot longer than the usual lproc crash was taking. The lproc crash was taking 1-2 hours. This lasted 5 or 6. I am kicking it off again with this patch in place and will update again if I hit the lproc issue.

          People

            wc-triage WC Triage
            keith Keith Mannthey (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: