Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1328

Failing customer's file creation test

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Major
    • None
    • Lustre 2.1.1
    • None
    • 2
    • 6412

    Description

      Customer running a script that calls a Java program (see attachments). Two clients panic'd.

      2012-04-13 19:21:49 +0000 [26759.401564] -----------[ cut here ]-----------
      2012-04-13 19:21:49 +0000 [26759.424756] WARNING: at fs/libfs.c:363 simple_setattr+0x99/0xb0()
      2012-04-13 19:21:49 +0000 [26759.455860] Hardware name: ProLiant BL460c G7
      2012-04-13 19:21:49 +0000 [26759.477542] Modules linked in: lmv mgc lustre lquota lov osc mdc fid fld ksocklnd ptlrpc obdclass lnet lvfs libcfs ppdev 8021q garp bridge stp llc nfsd nfs lockd nfs_acl auth_rpcgss sunrpc netconsole configfs dm_crypt dm_mod crc32c ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi iptable_filter be2net ip_tables x_tables hpilo joydev hpwdt rtc_cmos psmouse rtc_core lp rtc_lib evdev parport mac_hid loop serio_raw tcp_scalable fuse virtio_blk virtio virtio_ring xenfs ext4 mbcache jbd2 xfs usbhid exportfs raid1 mptspi mptsas mptscsih mptbase mpt2sas raid_class arcmsr aic94xx libsas libata scsi_transport_sas aic7xxx aic79xx scsi_transport_spi megaraid_sas cciss sd_mod sg hpsa scsi_mod uhci_hcd ehci_hcd [last unloaded: libcfs]
      2012-04-13 19:21:49 +0000 [26759.830142] Pid: 11122, comm: java Not tainted 2.6.38.2-ts4 #11
      2012-04-13 19:21:49 +0000 [26759.860423] Call Trace:
      2012-04-13 19:21:49 +0000 [26759.872545] [<ffffffff8105d10f>] ? warn_slowpath_common+0x7f/0xc0
      2012-04-13 19:21:49 +0000 [26759.903121] [<ffffffff8105d16a>] ? warn_slowpath_null+0x1a/0x20
      2012-04-13 19:21:49 +0000 [26759.933790] [<ffffffff8115bea9>] ? simple_setattr+0x99/0xb0
      2012-04-13 19:21:50 +0000 [26759.962234] [<ffffffff8115be10>] ? simple_setattr+0x0/0xb0
      2012-04-13 19:21:50 +0000 [26759.990161] [<ffffffffa1200af6>] ? ll_md_setattr+0x3e6/0x840 [lustre]
      2012-04-13 19:21:50 +0000 [26760.024040] [<ffffffffa12011b4>] ? ll_setattr_raw+0x264/0xe40 [lustre]
      2012-04-13 19:21:50 +0000 [26760.056779] [<ffffffffa0ba3126>] ? cfs_hash_del+0xa6/0x1d0 [libcfs]
      2012-04-13 19:21:50 +0000 [26760.089867] [<ffffffffa1201ded>] ? ll_setattr+0x5d/0x100 [lustre]
      2012-04-13 19:21:50 +0000 [26760.120906] [<ffffffff81153761>] ? notify_change+0x161/0x2c0
      2012-04-13 19:21:50 +0000 [26760.150011] [<ffffffff811383d1>] ? do_truncate+0x61/0x90
      2012-04-13 19:21:50 +0000 [26760.176623] [<ffffffff8113862d>] ? sys_ftruncate+0xdd/0xf0
      2012-04-13 19:21:50 +0000 [26760.205178] [<ffffffff8100bfc2>] ? system_call_fastpath+0x16/0x1b
      2012-04-13 19:21:50 +0000 [26760.235981] --[ end trace 402d40ca74c5ea86 ]--

      On one OSS, about an hour before the client crash, I saw this:

      Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: Lustre: Service thread pid 16292 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: Pid: 16292, comm: ll_ost_io_254
      Apr 13 18:00:53 ts-xxxxxxxx-04 kernel:
      Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: Call Trace:
      Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffffa026f1a9>] ? LNetGet+0x3e9/0x830 [lnet]
      Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffffa02699c5>] ? LNetMDBind+0x135/0x490 [lnet]
      Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffff8104fdef>] ? lock_timer_base+0x2b/0x4f
      Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffff8104febd>] ? try_to_del_timer_sync+0xaa/0xb7
      Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffff814aa778>] schedule_timeout+0x1c6/0x1ee
      Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffff810500ce>] ? process_timeout+0x0/0x10
      Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffffa0221351>] cfs_waitq_timedwait+0x11/0x20 [libcfs]
      Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffffa03b417f>] target_bulk_io+0x3af/0xa40 [ptlrpc]
      Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffffa0221910>] ? cfs_alloc+0x30/0x60 [libcfs]
      Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffff81038bc6>] ? default_wake_function+0x0/0x14
      Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffffa038752f>] ost_brw_write+0x12af/0x2040 [ost]
      Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffffa03ef7cc>] ? lustre_msg_set_timeout+0x9c/0x110 [ptlrpc]
      Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffff810c6edb>] ? free_hot_page+0x3f/0x44
      Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffff810c7081>] ? __free_pages+0x5a/0x70
      Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffffa03b19c0>] ? target_bulk_timeout+0x0/0x100 [ptlrpc]
      Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffffa038e104>] ost_handle+0x2604/0x57e0 [ost]
      Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffffa03ea29e>] ? lustre_msg_get_opc+0x8e/0xf0 [ptlrpc]
      Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffffa03f735c>] ptlrpc_server_handle_request+0x4ec/0xfc0 [ptlrpc]
      Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffff81037e95>] ? enqueue_task+0x7c/0x8b
      Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffffa02212ae>] ? cfs_timer_arm+0xe/0x10 [libcfs]
      Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffffa022deb0>] ? lc_watchdog_touch+0x70/0x150 [libcfs]
      Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffffa022dfd7>] ? lc_watchdog_disable+0x47/0x120 [libcfs]
      Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffffa03fa714>] ? ptlrpc_wait_event+0xa4/0x2d0 [ptlrpc]
      Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffff81030fda>] ? __wake_up+0x48/0x55
      Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffffa03fae7f>] ptlrpc_main+0x53f/0x1670 [ptlrpc]
      Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffff81003ada>] child_rip+0xa/0x20
      Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffffa03fa940>] ? ptlrpc_main+0x0/0x1670 [ptlrpc]
      Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffff81003ad0>] ? child_rip+0x0/0x20

      I don't know if it is related, or not.

      Attached files:

      Reproduce.java: Java program used when reproducing the problem
      reproduce.sh: Shell script that calls Reproduce.Java in a loop
      messages-mds: /var/log/messages from the MDS
      messages-oss-1 from the first OSS
      messages-oss-2 from the second OSS
      users*: Client output via netconsole.

      Attachments

        1. usrs400.netconsole
          3 kB
        2. usrs399.netconsole
          3 kB
        3. usrs398.netconsole
          3 kB
        4. usrs397.netconsole
          3 kB
        5. usrs396.netconsole
          4 kB
        6. usrs395.netconsole
          3 kB
        7. usrs394.netconsole
          3 kB
        8. usrs393.netconsole
          2 kB
        9. usrs393.messages
          20 kB
        10. usrs392.netconsole
          3 kB
        11. usrs391.netconsole
          3 kB
        12. usrs390.netconsole
          3 kB
        13. usrs389.netconsole
          3 kB
        14. usrs388.netconsole
          3 kB
        15. staterrs.tar.gz
          2.12 MB
        16. reproduce.sh
          0.7 kB
        17. Reproduce.java
          1 kB
        18. messages-oss-2
          67 kB
        19. messages-oss-1
          43 kB
        20. messages-mds
          177 kB
        21. messages.mds
          108 kB
        22. enoent-20120524.tar.bz2
          5.64 MB
        23. enoent-20120523.tar.gz
          9.64 MB
        24. debug.tar.gz
          9.42 MB
        25. 20120606-netconsole.tbz2
          8 kB

        Activity

          People

            laisiyao Lai Siyao
            rspellman Roger Spellman (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: