Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1328

Failing customer's file creation test

Details

    • Bug
    • Resolution: Fixed
    • Major
    • None
    • Lustre 2.1.1
    • None
    • 2
    • 6412

    Description

      Customer running a script that calls a Java program (see attachments). Two clients panic'd.

      2012-04-13 19:21:49 +0000 [26759.401564] -----------[ cut here ]-----------
      2012-04-13 19:21:49 +0000 [26759.424756] WARNING: at fs/libfs.c:363 simple_setattr+0x99/0xb0()
      2012-04-13 19:21:49 +0000 [26759.455860] Hardware name: ProLiant BL460c G7
      2012-04-13 19:21:49 +0000 [26759.477542] Modules linked in: lmv mgc lustre lquota lov osc mdc fid fld ksocklnd ptlrpc obdclass lnet lvfs libcfs ppdev 8021q garp bridge stp llc nfsd nfs lockd nfs_acl auth_rpcgss sunrpc netconsole configfs dm_crypt dm_mod crc32c ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi iptable_filter be2net ip_tables x_tables hpilo joydev hpwdt rtc_cmos psmouse rtc_core lp rtc_lib evdev parport mac_hid loop serio_raw tcp_scalable fuse virtio_blk virtio virtio_ring xenfs ext4 mbcache jbd2 xfs usbhid exportfs raid1 mptspi mptsas mptscsih mptbase mpt2sas raid_class arcmsr aic94xx libsas libata scsi_transport_sas aic7xxx aic79xx scsi_transport_spi megaraid_sas cciss sd_mod sg hpsa scsi_mod uhci_hcd ehci_hcd [last unloaded: libcfs]
      2012-04-13 19:21:49 +0000 [26759.830142] Pid: 11122, comm: java Not tainted 2.6.38.2-ts4 #11
      2012-04-13 19:21:49 +0000 [26759.860423] Call Trace:
      2012-04-13 19:21:49 +0000 [26759.872545] [<ffffffff8105d10f>] ? warn_slowpath_common+0x7f/0xc0
      2012-04-13 19:21:49 +0000 [26759.903121] [<ffffffff8105d16a>] ? warn_slowpath_null+0x1a/0x20
      2012-04-13 19:21:49 +0000 [26759.933790] [<ffffffff8115bea9>] ? simple_setattr+0x99/0xb0
      2012-04-13 19:21:50 +0000 [26759.962234] [<ffffffff8115be10>] ? simple_setattr+0x0/0xb0
      2012-04-13 19:21:50 +0000 [26759.990161] [<ffffffffa1200af6>] ? ll_md_setattr+0x3e6/0x840 [lustre]
      2012-04-13 19:21:50 +0000 [26760.024040] [<ffffffffa12011b4>] ? ll_setattr_raw+0x264/0xe40 [lustre]
      2012-04-13 19:21:50 +0000 [26760.056779] [<ffffffffa0ba3126>] ? cfs_hash_del+0xa6/0x1d0 [libcfs]
      2012-04-13 19:21:50 +0000 [26760.089867] [<ffffffffa1201ded>] ? ll_setattr+0x5d/0x100 [lustre]
      2012-04-13 19:21:50 +0000 [26760.120906] [<ffffffff81153761>] ? notify_change+0x161/0x2c0
      2012-04-13 19:21:50 +0000 [26760.150011] [<ffffffff811383d1>] ? do_truncate+0x61/0x90
      2012-04-13 19:21:50 +0000 [26760.176623] [<ffffffff8113862d>] ? sys_ftruncate+0xdd/0xf0
      2012-04-13 19:21:50 +0000 [26760.205178] [<ffffffff8100bfc2>] ? system_call_fastpath+0x16/0x1b
      2012-04-13 19:21:50 +0000 [26760.235981] --[ end trace 402d40ca74c5ea86 ]--

      On one OSS, about an hour before the client crash, I saw this:

      Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: Lustre: Service thread pid 16292 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: Pid: 16292, comm: ll_ost_io_254
      Apr 13 18:00:53 ts-xxxxxxxx-04 kernel:
      Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: Call Trace:
      Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffffa026f1a9>] ? LNetGet+0x3e9/0x830 [lnet]
      Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffffa02699c5>] ? LNetMDBind+0x135/0x490 [lnet]
      Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffff8104fdef>] ? lock_timer_base+0x2b/0x4f
      Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffff8104febd>] ? try_to_del_timer_sync+0xaa/0xb7
      Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffff814aa778>] schedule_timeout+0x1c6/0x1ee
      Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffff810500ce>] ? process_timeout+0x0/0x10
      Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffffa0221351>] cfs_waitq_timedwait+0x11/0x20 [libcfs]
      Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffffa03b417f>] target_bulk_io+0x3af/0xa40 [ptlrpc]
      Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffffa0221910>] ? cfs_alloc+0x30/0x60 [libcfs]
      Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffff81038bc6>] ? default_wake_function+0x0/0x14
      Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffffa038752f>] ost_brw_write+0x12af/0x2040 [ost]
      Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffffa03ef7cc>] ? lustre_msg_set_timeout+0x9c/0x110 [ptlrpc]
      Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffff810c6edb>] ? free_hot_page+0x3f/0x44
      Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffff810c7081>] ? __free_pages+0x5a/0x70
      Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffffa03b19c0>] ? target_bulk_timeout+0x0/0x100 [ptlrpc]
      Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffffa038e104>] ost_handle+0x2604/0x57e0 [ost]
      Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffffa03ea29e>] ? lustre_msg_get_opc+0x8e/0xf0 [ptlrpc]
      Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffffa03f735c>] ptlrpc_server_handle_request+0x4ec/0xfc0 [ptlrpc]
      Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffff81037e95>] ? enqueue_task+0x7c/0x8b
      Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffffa02212ae>] ? cfs_timer_arm+0xe/0x10 [libcfs]
      Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffffa022deb0>] ? lc_watchdog_touch+0x70/0x150 [libcfs]
      Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffffa022dfd7>] ? lc_watchdog_disable+0x47/0x120 [libcfs]
      Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffffa03fa714>] ? ptlrpc_wait_event+0xa4/0x2d0 [ptlrpc]
      Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffff81030fda>] ? __wake_up+0x48/0x55
      Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffffa03fae7f>] ptlrpc_main+0x53f/0x1670 [ptlrpc]
      Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffff81003ada>] child_rip+0xa/0x20
      Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffffa03fa940>] ? ptlrpc_main+0x0/0x1670 [ptlrpc]
      Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffff81003ad0>] ? child_rip+0x0/0x20

      I don't know if it is related, or not.

      Attached files:

      Reproduce.java: Java program used when reproducing the problem
      reproduce.sh: Shell script that calls Reproduce.Java in a loop
      messages-mds: /var/log/messages from the MDS
      messages-oss-1 from the first OSS
      messages-oss-2 from the second OSS
      users*: Client output via netconsole.

      Attachments

        1. 20120606-netconsole.tbz2
          8 kB
        2. debug.tar.gz
          9.42 MB
        3. enoent-20120523.tar.gz
          9.64 MB
        4. enoent-20120524.tar.bz2
          5.64 MB
        5. messages.mds
          108 kB
        6. messages-mds
          177 kB
        7. messages-oss-1
          43 kB
        8. messages-oss-2
          67 kB
        9. Reproduce.java
          1 kB
        10. reproduce.sh
          0.7 kB
        11. staterrs.tar.gz
          2.12 MB
        12. usrs388.netconsole
          3 kB
        13. usrs389.netconsole
          3 kB
        14. usrs390.netconsole
          3 kB
        15. usrs391.netconsole
          3 kB
        16. usrs392.netconsole
          3 kB
        17. usrs393.messages
          20 kB
        18. usrs393.netconsole
          2 kB
        19. usrs394.netconsole
          3 kB
        20. usrs395.netconsole
          3 kB
        21. usrs396.netconsole
          4 kB
        22. usrs397.netconsole
          3 kB
        23. usrs398.netconsole
          3 kB
        24. usrs399.netconsole
          3 kB
        25. usrs400.netconsole
          3 kB

        Activity

          [LU-1328] Failing customer's file creation test
          laisiyao Lai Siyao added a comment -

          Yes, it's better to enable debug trace on both client and server. Upon error, you can run `lctl dk /tmp/create.log`, then you can check debug trace in the log file, which contains recent debug traces, and this is why this command should be run immediately after error occurs, otherwise the old debug trace is dropped.

          laisiyao Lai Siyao added a comment - Yes, it's better to enable debug trace on both client and server. Upon error, you can run `lctl dk /tmp/create.log`, then you can check debug trace in the log file, which contains recent debug traces, and this is why this command should be run immediately after error occurs, otherwise the old debug trace is dropped.

          What command is run to get view the debug data?

          rspellman Roger Spellman (Inactive) added a comment - What command is run to get view the debug data?

          > I have no way to debug without debug logs and messages; it would really to useful to dump debuglog upon error on both client and MDS, and it's best to enable 'info' and 'vfstrace' debug trace, you can check it by `lctl get_param debug`, if it doesn't contain 'info' or 'vfstrace', you can enable it with `lctl set_param debug=+"info vfstrace"`.

          I will ask the customer to enable these. Are these enabled on both the client and servers?

          rspellman Roger Spellman (Inactive) added a comment - > I have no way to debug without debug logs and messages; it would really to useful to dump debuglog upon error on both client and MDS, and it's best to enable 'info' and 'vfstrace' debug trace, you can check it by `lctl get_param debug`, if it doesn't contain 'info' or 'vfstrace', you can enable it with `lctl set_param debug=+"info vfstrace"`. I will ask the customer to enable these. Are these enabled on both the client and servers?

          This is the messages file from the MDS.

          rspellman Roger Spellman (Inactive) added a comment - This is the messages file from the MDS.
          laisiyao Lai Siyao added a comment -

          I have no way to debug without debug logs and messages; it would really to useful to dump debuglog upon error on both client and MDS, and it's best to enable 'info' and 'vfstrace' debug trace, you can check it by `lctl get_param debug`, if it doesn't contain 'info' or 'vfstrace', you can enable it with `lctl set_param debug=+"info vfstrace"`.

          laisiyao Lai Siyao added a comment - I have no way to debug without debug logs and messages; it would really to useful to dump debuglog upon error on both client and MDS, and it's best to enable 'info' and 'vfstrace' debug trace, you can check it by `lctl get_param debug`, if it doesn't contain 'info' or 'vfstrace', you can enable it with `lctl set_param debug=+"info vfstrace"`.
          laisiyao Lai Siyao added a comment -

          Is there any error message on MDS server?

          laisiyao Lai Siyao added a comment - Is there any error message on MDS server?

          We applied the patch. The customer reports the following:

          $ lfs find . -type f | sh -c 'while read file; do stat $file & done >/dev/null'
          stat: cannot stat `./20120507-091222/d3/3/270.bin': No such file or directory
          stat: cannot stat `./20120507-091222/d5/32/681.bin': No such file or directory
          stat: cannot stat `./20120507-091222/d5/32/1525.bin': No such file or directory
          stat: cannot stat `./20120507-091222/d12/0/284.bin': No such file or directory
          stat: cannot stat `./20120507-091222/d12/0/1037.bin': No such file or directory
          stat: cannot stat `./20120507-091222/d12/0/917.bin': No such file or directory
          stat: cannot stat `./20120507-091222/d11/32/1807.bin': No such file or directory
          stat: cannot stat `./20120507-091222/d10/6/2587.bin': No such file or directory
          stat: cannot stat `./20120507-091222/d10/6/891.bin': No such file or directory
          stat: cannot stat `./20120507-091222/d10/6/750.bin': No such file or directory
          stat: cannot stat `./20120507-091222/d7/34/151.bin': No such file or directory

          Here are some additional details:

          This is a single client with 24 cores running the test.

          "ls -l' on each file succeeds - if I perform the stat sequentially it succeeds - it only fails when I launch the stat processes in the background thus making them parallel.

          Each time I run the test, I get different files that can't be found.

          rspellman Roger Spellman (Inactive) added a comment - We applied the patch. The customer reports the following: $ lfs find . -type f | sh -c 'while read file; do stat $file & done >/dev/null' stat: cannot stat `./20120507-091222/d3/3/270.bin': No such file or directory stat: cannot stat `./20120507-091222/d5/32/681.bin': No such file or directory stat: cannot stat `./20120507-091222/d5/32/1525.bin': No such file or directory stat: cannot stat `./20120507-091222/d12/0/284.bin': No such file or directory stat: cannot stat `./20120507-091222/d12/0/1037.bin': No such file or directory stat: cannot stat `./20120507-091222/d12/0/917.bin': No such file or directory stat: cannot stat `./20120507-091222/d11/32/1807.bin': No such file or directory stat: cannot stat `./20120507-091222/d10/6/2587.bin': No such file or directory stat: cannot stat `./20120507-091222/d10/6/891.bin': No such file or directory stat: cannot stat `./20120507-091222/d10/6/750.bin': No such file or directory stat: cannot stat `./20120507-091222/d7/34/151.bin': No such file or directory Here are some additional details: This is a single client with 24 cores running the test. "ls -l' on each file succeeds - if I perform the stat sequentially it succeeds - it only fails when I launch the stat processes in the background thus making them parallel. Each time I run the test, I get different files that can't be found.
          laisiyao Lai Siyao added a comment -

          Yes, the patch affects Lustre MDS server only, and only touches one file.

          laisiyao Lai Siyao added a comment - Yes, the patch affects Lustre MDS server only, and only touches one file.

          Please verify that this only affects the Lustre server, not the client.

          rspellman Roger Spellman (Inactive) added a comment - Please verify that this only affects the Lustre server, not the client.

          Please verify that the patch only touches one file.

          rspellman Roger Spellman (Inactive) added a comment - Please verify that the patch only touches one file.
          laisiyao Lai Siyao added a comment -

          I'm testing with 2.6.38.6-26.

          laisiyao Lai Siyao added a comment - I'm testing with 2.6.38.6-26.

          People

            laisiyao Lai Siyao
            rspellman Roger Spellman (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: