Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1328

Failing customer's file creation test

Details

    • Bug
    • Resolution: Fixed
    • Major
    • None
    • Lustre 2.1.1
    • None
    • 2
    • 6412

    Description

      Customer running a script that calls a Java program (see attachments). Two clients panic'd.

      2012-04-13 19:21:49 +0000 [26759.401564] -----------[ cut here ]-----------
      2012-04-13 19:21:49 +0000 [26759.424756] WARNING: at fs/libfs.c:363 simple_setattr+0x99/0xb0()
      2012-04-13 19:21:49 +0000 [26759.455860] Hardware name: ProLiant BL460c G7
      2012-04-13 19:21:49 +0000 [26759.477542] Modules linked in: lmv mgc lustre lquota lov osc mdc fid fld ksocklnd ptlrpc obdclass lnet lvfs libcfs ppdev 8021q garp bridge stp llc nfsd nfs lockd nfs_acl auth_rpcgss sunrpc netconsole configfs dm_crypt dm_mod crc32c ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi iptable_filter be2net ip_tables x_tables hpilo joydev hpwdt rtc_cmos psmouse rtc_core lp rtc_lib evdev parport mac_hid loop serio_raw tcp_scalable fuse virtio_blk virtio virtio_ring xenfs ext4 mbcache jbd2 xfs usbhid exportfs raid1 mptspi mptsas mptscsih mptbase mpt2sas raid_class arcmsr aic94xx libsas libata scsi_transport_sas aic7xxx aic79xx scsi_transport_spi megaraid_sas cciss sd_mod sg hpsa scsi_mod uhci_hcd ehci_hcd [last unloaded: libcfs]
      2012-04-13 19:21:49 +0000 [26759.830142] Pid: 11122, comm: java Not tainted 2.6.38.2-ts4 #11
      2012-04-13 19:21:49 +0000 [26759.860423] Call Trace:
      2012-04-13 19:21:49 +0000 [26759.872545] [<ffffffff8105d10f>] ? warn_slowpath_common+0x7f/0xc0
      2012-04-13 19:21:49 +0000 [26759.903121] [<ffffffff8105d16a>] ? warn_slowpath_null+0x1a/0x20
      2012-04-13 19:21:49 +0000 [26759.933790] [<ffffffff8115bea9>] ? simple_setattr+0x99/0xb0
      2012-04-13 19:21:50 +0000 [26759.962234] [<ffffffff8115be10>] ? simple_setattr+0x0/0xb0
      2012-04-13 19:21:50 +0000 [26759.990161] [<ffffffffa1200af6>] ? ll_md_setattr+0x3e6/0x840 [lustre]
      2012-04-13 19:21:50 +0000 [26760.024040] [<ffffffffa12011b4>] ? ll_setattr_raw+0x264/0xe40 [lustre]
      2012-04-13 19:21:50 +0000 [26760.056779] [<ffffffffa0ba3126>] ? cfs_hash_del+0xa6/0x1d0 [libcfs]
      2012-04-13 19:21:50 +0000 [26760.089867] [<ffffffffa1201ded>] ? ll_setattr+0x5d/0x100 [lustre]
      2012-04-13 19:21:50 +0000 [26760.120906] [<ffffffff81153761>] ? notify_change+0x161/0x2c0
      2012-04-13 19:21:50 +0000 [26760.150011] [<ffffffff811383d1>] ? do_truncate+0x61/0x90
      2012-04-13 19:21:50 +0000 [26760.176623] [<ffffffff8113862d>] ? sys_ftruncate+0xdd/0xf0
      2012-04-13 19:21:50 +0000 [26760.205178] [<ffffffff8100bfc2>] ? system_call_fastpath+0x16/0x1b
      2012-04-13 19:21:50 +0000 [26760.235981] --[ end trace 402d40ca74c5ea86 ]--

      On one OSS, about an hour before the client crash, I saw this:

      Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: Lustre: Service thread pid 16292 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: Pid: 16292, comm: ll_ost_io_254
      Apr 13 18:00:53 ts-xxxxxxxx-04 kernel:
      Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: Call Trace:
      Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffffa026f1a9>] ? LNetGet+0x3e9/0x830 [lnet]
      Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffffa02699c5>] ? LNetMDBind+0x135/0x490 [lnet]
      Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffff8104fdef>] ? lock_timer_base+0x2b/0x4f
      Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffff8104febd>] ? try_to_del_timer_sync+0xaa/0xb7
      Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffff814aa778>] schedule_timeout+0x1c6/0x1ee
      Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffff810500ce>] ? process_timeout+0x0/0x10
      Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffffa0221351>] cfs_waitq_timedwait+0x11/0x20 [libcfs]
      Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffffa03b417f>] target_bulk_io+0x3af/0xa40 [ptlrpc]
      Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffffa0221910>] ? cfs_alloc+0x30/0x60 [libcfs]
      Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffff81038bc6>] ? default_wake_function+0x0/0x14
      Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffffa038752f>] ost_brw_write+0x12af/0x2040 [ost]
      Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffffa03ef7cc>] ? lustre_msg_set_timeout+0x9c/0x110 [ptlrpc]
      Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffff810c6edb>] ? free_hot_page+0x3f/0x44
      Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffff810c7081>] ? __free_pages+0x5a/0x70
      Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffffa03b19c0>] ? target_bulk_timeout+0x0/0x100 [ptlrpc]
      Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffffa038e104>] ost_handle+0x2604/0x57e0 [ost]
      Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffffa03ea29e>] ? lustre_msg_get_opc+0x8e/0xf0 [ptlrpc]
      Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffffa03f735c>] ptlrpc_server_handle_request+0x4ec/0xfc0 [ptlrpc]
      Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffff81037e95>] ? enqueue_task+0x7c/0x8b
      Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffffa02212ae>] ? cfs_timer_arm+0xe/0x10 [libcfs]
      Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffffa022deb0>] ? lc_watchdog_touch+0x70/0x150 [libcfs]
      Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffffa022dfd7>] ? lc_watchdog_disable+0x47/0x120 [libcfs]
      Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffffa03fa714>] ? ptlrpc_wait_event+0xa4/0x2d0 [ptlrpc]
      Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffff81030fda>] ? __wake_up+0x48/0x55
      Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffffa03fae7f>] ptlrpc_main+0x53f/0x1670 [ptlrpc]
      Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffff81003ada>] child_rip+0xa/0x20
      Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffffa03fa940>] ? ptlrpc_main+0x0/0x1670 [ptlrpc]
      Apr 13 18:00:53 ts-xxxxxxxx-04 kernel: [<ffffffff81003ad0>] ? child_rip+0x0/0x20

      I don't know if it is related, or not.

      Attached files:

      Reproduce.java: Java program used when reproducing the problem
      reproduce.sh: Shell script that calls Reproduce.Java in a loop
      messages-mds: /var/log/messages from the MDS
      messages-oss-1 from the first OSS
      messages-oss-2 from the second OSS
      users*: Client output via netconsole.

      Attachments

        1. 20120606-netconsole.tbz2
          8 kB
        2. debug.tar.gz
          9.42 MB
        3. enoent-20120523.tar.gz
          9.64 MB
        4. enoent-20120524.tar.bz2
          5.64 MB
        5. messages.mds
          108 kB
        6. messages-mds
          177 kB
        7. messages-oss-1
          43 kB
        8. messages-oss-2
          67 kB
        9. Reproduce.java
          1 kB
        10. reproduce.sh
          0.7 kB
        11. staterrs.tar.gz
          2.12 MB
        12. usrs388.netconsole
          3 kB
        13. usrs389.netconsole
          3 kB
        14. usrs390.netconsole
          3 kB
        15. usrs391.netconsole
          3 kB
        16. usrs392.netconsole
          3 kB
        17. usrs393.messages
          20 kB
        18. usrs393.netconsole
          2 kB
        19. usrs394.netconsole
          3 kB
        20. usrs395.netconsole
          3 kB
        21. usrs396.netconsole
          4 kB
        22. usrs397.netconsole
          3 kB
        23. usrs398.netconsole
          3 kB
        24. usrs399.netconsole
          3 kB
        25. usrs400.netconsole
          3 kB

        Activity

          [LU-1328] Failing customer's file creation test
          laisiyao Lai Siyao added a comment -

          This is just kernel warning which is a bit excessive, and it has nothing to do with real IO performance. Lustre uses a kernel exported function simple_setattr(), which is originally for simple filesystem (which doesn't implement truncate), so it gives a warning, but there is no real problem.

          laisiyao Lai Siyao added a comment - This is just kernel warning which is a bit excessive, and it has nothing to do with real IO performance. Lustre uses a kernel exported function simple_setattr(), which is originally for simple filesystem (which doesn't implement truncate), so it gives a warning, but there is no real problem.

          Customer reports:

          We have been running with the latest patch for several days with no problems other than the old slowpath warning:

          609631.154485] -----------[ cut here ]-----------
          [609631.154499] WARNING: at fs/libfs.c:363 simple_setattr+0x99/0xb0()
          [609631.154502] Hardware name: ProLiant BL460c G7
          [609631.154504] Modules linked in: lmv mgc lustre lov osc mdc fid fld ksocklnd ptlrpc obdclass lnet lvfs libcfs parport_pc ppdev 8021q garp bridge stp llc nfsd nfs lockd nfs_acl auth_rpcgss sunrpc netconsole configfs dm_crypt dm_mod crc32c ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi iptable_filter ip_tables x_tables parport be2net psmouse evdev hpilo rtc_cmos joydev hpwdt serio_raw mac_hid loop rtc_core rtc_lib tcp_scalable fuse virtio_blk virtio virtio_ring xenfs ext4 mbcache jbd2 usbhid xfs exportfs raid1 mptspi mptsas mptscsih mptbase mpt2sas raid_class arcmsr aic94xx libsas libata scsi_transport_sas aic7xxx aic79xx scsi_transport_spi megaraid_sas cciss sd_mod sg hpsa scsi_mod ehci_hcd uhci_hcd [last unloaded: libcfs]
          [609631.154578] Pid: 19765, comm: java Not tainted 2.6.38.2-ts4 #11
          [609631.154580] Call Trace:
          [609631.154588] [<ffffffff8105d10f>] ? warn_slowpath_common+0x7f/0xc0
          [609631.154592] [<ffffffff8105d16a>] ? warn_slowpath_null+0x1a/0x20
          [609631.154596] [<ffffffff8115bea9>] ? simple_setattr+0x99/0xb0
          [609631.154632] [<ffffffffa10f2cd6>] ? ll_md_setattr+0x3e6/0x840 [lustre]
          [609631.154652] [<ffffffffa10f3394>] ? ll_setattr_raw+0x264/0xe40 [lustre]
          [609631.154672] [<ffffffffa10f3fcd>] ? ll_setattr+0x5d/0x100 [lustre]
          [609631.154677] [<ffffffff81153761>] ? notify_change+0x161/0x2c0
          [609631.154682] [<ffffffff811383d1>] ? do_truncate+0x61/0x90
          [609631.154687] [<ffffffff811beeec>] ? security_inode_permission+0x1c/0x30
          [609631.154692] [<ffffffff81144868>] ? finish_open+0x138/0x1b0
          [609631.154696] [<ffffffff81146003>] ? do_last+0x83/0x360
          [609631.154699] [<ffffffff81148706>] ? do_filp_open+0x3d6/0x830
          [609631.154704] [<ffffffff8110bc27>] ? handle_mm_fault+0x157/0x250
          [609631.154708] [<ffffffff8115493a>] ? alloc_fd+0x10a/0x150
          [609631.154713] [<ffffffff811373a9>] ? do_sys_open+0x69/0x110
          [609631.154717] [<ffffffff81137490>] ? sys_open+0x20/0x30
          [609631.154722] [<ffffffff8100bfc2>] ? system_call_fastpath+0x16/0x1b
          [609631.154725] --[ end trace e84ad085cd1d9abc ]--

          Question: Is the best thing to do for this to lower the number of OST threads, as documented on the Lustre manual?

          rspellman Roger Spellman (Inactive) added a comment - Customer reports: We have been running with the latest patch for several days with no problems other than the old slowpath warning: 609631.154485] ----------- [ cut here ] ----------- [609631.154499] WARNING: at fs/libfs.c:363 simple_setattr+0x99/0xb0() [609631.154502] Hardware name: ProLiant BL460c G7 [609631.154504] Modules linked in: lmv mgc lustre lov osc mdc fid fld ksocklnd ptlrpc obdclass lnet lvfs libcfs parport_pc ppdev 8021q garp bridge stp llc nfsd nfs lockd nfs_acl auth_rpcgss sunrpc netconsole configfs dm_crypt dm_mod crc32c ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi iptable_filter ip_tables x_tables parport be2net psmouse evdev hpilo rtc_cmos joydev hpwdt serio_raw mac_hid loop rtc_core rtc_lib tcp_scalable fuse virtio_blk virtio virtio_ring xenfs ext4 mbcache jbd2 usbhid xfs exportfs raid1 mptspi mptsas mptscsih mptbase mpt2sas raid_class arcmsr aic94xx libsas libata scsi_transport_sas aic7xxx aic79xx scsi_transport_spi megaraid_sas cciss sd_mod sg hpsa scsi_mod ehci_hcd uhci_hcd [last unloaded: libcfs] [609631.154578] Pid: 19765, comm: java Not tainted 2.6.38.2-ts4 #11 [609631.154580] Call Trace: [609631.154588] [<ffffffff8105d10f>] ? warn_slowpath_common+0x7f/0xc0 [609631.154592] [<ffffffff8105d16a>] ? warn_slowpath_null+0x1a/0x20 [609631.154596] [<ffffffff8115bea9>] ? simple_setattr+0x99/0xb0 [609631.154632] [<ffffffffa10f2cd6>] ? ll_md_setattr+0x3e6/0x840 [lustre] [609631.154652] [<ffffffffa10f3394>] ? ll_setattr_raw+0x264/0xe40 [lustre] [609631.154672] [<ffffffffa10f3fcd>] ? ll_setattr+0x5d/0x100 [lustre] [609631.154677] [<ffffffff81153761>] ? notify_change+0x161/0x2c0 [609631.154682] [<ffffffff811383d1>] ? do_truncate+0x61/0x90 [609631.154687] [<ffffffff811beeec>] ? security_inode_permission+0x1c/0x30 [609631.154692] [<ffffffff81144868>] ? finish_open+0x138/0x1b0 [609631.154696] [<ffffffff81146003>] ? do_last+0x83/0x360 [609631.154699] [<ffffffff81148706>] ? do_filp_open+0x3d6/0x830 [609631.154704] [<ffffffff8110bc27>] ? handle_mm_fault+0x157/0x250 [609631.154708] [<ffffffff8115493a>] ? alloc_fd+0x10a/0x150 [609631.154713] [<ffffffff811373a9>] ? do_sys_open+0x69/0x110 [609631.154717] [<ffffffff81137490>] ? sys_open+0x20/0x30 [609631.154722] [<ffffffff8100bfc2>] ? system_call_fastpath+0x16/0x1b [609631.154725] -- [ end trace e84ad085cd1d9abc ] -- Question: Is the best thing to do for this to lower the number of OST threads, as documented on the Lustre manual?
          laisiyao Lai Siyao added a comment -

          As Jinshan pointed out, it looks to be the same issue of LU-1421, the fix is at http://review.whamcloud.com/#change,3027. Roger, could you apply this patch and try again?

          laisiyao Lai Siyao added a comment - As Jinshan pointed out, it looks to be the same issue of LU-1421 , the fix is at http://review.whamcloud.com/#change,3027 . Roger, could you apply this patch and try again?
          laisiyao Lai Siyao added a comment -

          Jinshan, could you help check why this ASSERT is not true on 2.6.38 kernel?

          laisiyao Lai Siyao added a comment - Jinshan, could you help check why this ASSERT is not true on 2.6.38 kernel?

          Customer reports:
          It appears that this patch has fixed the original bug as the researcher was not able to replicate the problem, however he was able to panic nine of the clients with an LBUG.

          See attachment for the netconsole output. The following is the output from one:

          2012-06-06 11:35:51 +0000 [4026925.936063] Lustre: MGC192.168.185.35@tcp: Reactivating import
          2012-06-06 11:35:52 +0000 [4026926.153312] Lustre: Mounted xxxx-client
          2012-06-06 16:08:32 +0000 [4043242.176640] LustreError: 8473:0:(cl_page.c:1026:cl_page_assume()) page@ffff880f6107ad80[4 ffff880d5f11e498:0 ^ (null)_ffff880f6107ae40 1 0 1 ffff88103efcc028 (null) 0x0]
          2012-06-06 16:08:32 +0000 [4043242.261214] LustreError: 8473:0:(cl_page.c:1026:cl_page_assume()) page@ffff880f6107ae40[1 ffff880fafbffda8:0 ^ffff880f6107ad80_ (null) 0 0 1 (null) (null) 0x0]
          2012-06-06 16:08:32 +0000 [4043242.346779] LustreError: 8473:0:(cl_page.c:1026:cl_page_assume()) vvp-page@ffff880dbfdb9640(0:0:0) vm@ffffea0033482180 60000000000086d 4:0 ffff880f6107ad80 0 lru
          2012-06-06 16:08:32 +0000 [4043242.419949] LustreError: 8473:0:(cl_page.c:1026:cl_page_assume()) lov-page@ffff880dbfdbdcf0
          2012-06-06 16:08:32 +0000 [4043242.462875] LustreError: 8473:0:(cl_page.c:1026:cl_page_assume()) osc-page@ffff880f610795a8: 1< 0x845fed 2 0 - - - > 2< 0 0 51 0x0 0x400 | (null) ffff880e893b1fb8 ffff88176f4b3e80 ffffffffa1088200 ffff880f610795a8 > 3< - ffff880967a58000 0 0 0 > 4< 0 0 48 532799488 - | - - + - > 5< - - - - | 0 - - | 0 - ->
          2012-06-06 16:08:32 +0000 [4043242.604818] LustreError: 8473:0:(cl_page.c:1026:cl_page_assume()) end page@ffff880f6107ad80
          2012-06-06 16:08:32 +0000 [4043242.647677] LustreError: 8473:0:(cl_page.c:1026:cl_page_assume()) pg->cp_owner == NULL
          2012-06-06 16:08:32 +0000 [4043242.688726] LustreError: 8473:0:(cl_page.c:1026:cl_page_assume()) ASSERTION( 0 ) failed:
          2012-06-06 16:08:32 +0000 [4043242.731216] LustreError: 8473:0:(cl_page.c:1026:cl_page_assume()) LBUG
          2012-06-06 16:08:32 +0000 [4043242.764574] Pid: 8473, comm: java
          2012-06-06 16:08:32 +0000 [4043242.782718]
          2012-06-06 16:08:32 +0000 [4043242.782720] Call Trace:
          2012-06-06 16:08:32 +0000 [4043242.803973] [<ffffffffa0b77865>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
          2012-06-06 16:08:32 +0000 [4043242.841227] [<ffffffffa0b77d97>] lbug_with_loc+0x47/0xc0 [libcfs]
          2012-06-06 16:08:32 +0000 [4043242.872993] [<ffffffffa0d61b50>] cl_page_own0+0x0/0x2c0 [obdclass]
          2012-06-06 16:08:32 +0000 [4043242.906060] [<ffffffffa1196726>] ll_prepare_write+0x86/0x170 [lustre]
          2012-06-06 16:08:33 +0000 [4043242.940052] [<ffffffffa11aa8e8>] ll_write_begin+0x88/0x160 [lustre]
          2012-06-06 16:08:33 +0000 [4043242.973265] [<ffffffffa11a51cb>] ? ll_getxattr+0xfb/0x440 [lustre]
          2012-06-06 16:08:33 +0000 [4043243.005911] [<ffffffff810e75fe>] generic_file_buffered_write+0xfe/0x250
          2012-06-06 16:08:33 +0000 [4043243.039902] [<ffffffff810ea1a0>] __generic_file_aio_write+0x230/0x470
          2012-06-06 16:08:33 +0000 [4043243.073917] [<ffffffff810ea442>] generic_file_aio_write+0x62/0xd0
          2012-06-06 16:08:33 +0000 [4043243.105326] [<ffffffffa11bb800>] vvp_io_write_start+0xb0/0x1e0 [lustre]
          2012-06-06 16:08:33 +0000 [4043243.139984] [<ffffffffa0d6a1d2>] cl_io_start+0x72/0x100 [obdclass]
          2012-06-06 16:08:33 +0000 [4043243.172847] [<ffffffffa0d6d774>] cl_io_loop+0xd4/0x160 [obdclass]
          2012-06-06 16:08:33 +0000 [4043243.203854] [<ffffffffa1175d3e>] ll_file_io_generic+0x3be/0x4f0 [lustre]
          2012-06-06 16:08:33 +0000 [4043243.238796] [<ffffffffa1175fa0>] ll_file_aio_write+0x130/0x1f0 [lustre]
          2012-06-06 16:08:33 +0000 [4043243.272727] [<ffffffffa117634c>] ll_file_write+0x14c/0x250 [lustre]
          2012-06-06 16:08:33 +0000 [4043243.306367] [<ffffffff811390a8>] vfs_write+0xc8/0x190
          2012-06-06 16:08:33 +0000 [4043243.332757] [<ffffffff811398d1>] sys_write+0x51/0x90
          2012-06-06 16:08:33 +0000 [4043243.359311] [<ffffffff8100bfc2>] system_call_fastpath+0x16/0x1b
          2012-06-06 16:08:33 +0000 [4043243.389972]
          2012-06-06 16:08:33 +0000 [4043243.399097] Kernel panic - not syncing: LBUG
          2012-06-06 16:08:33 +0000 [4043243.421020] Pid: 8473, comm: java Tainted: G W 2.6.38.2-ts4 #11
          2012-06-06 16:08:33 +0000 [4043243.455850] Call Trace:
          2012-06-06 16:08:33 +0000 [4043243.468809] [<ffffffff8145a373>] ? panic+0x91/0x19c
          2012-06-06 16:08:33 +0000 [4043243.494377] [<ffffffffa0b77dfb>] ? lbug_with_loc+0xab/0xc0 [libcfs]
          2012-06-06 16:08:33 +0000 [4043243.527753] [<ffffffffa0d61b50>] ? cl_page_own0+0x0/0x2c0 [obdclass]
          2012-06-06 16:08:33 +0000 [4043243.561570] [<ffffffffa1196726>] ? ll_prepare_write+0x86/0x170 [lustre]
          2012-06-06 16:08:33 +0000 [4043243.595489] [<ffffffffa11aa8e8>] ? ll_write_begin+0x88/0x160 [lustre]
          2012-06-06 16:08:33 +0000 [4043243.629850] [<ffffffffa11a51cb>] ? ll_getxattr+0xfb/0x440 [lustre]
          2012-06-06 16:08:33 +0000 [4043243.661785] [<ffffffff810e75fe>] ? generic_file_buffered_write+0xfe/0x250
          2012-06-06 16:08:33 +0000 [4043243.697719] [<ffffffff810ea1a0>] ? __generic_file_aio_write+0x230/0x470
          2012-06-06 16:08:33 +0000 [4043243.734746] [<ffffffff810ea442>] ? generic_file_aio_write+0x62/0xd0
          2012-06-06 16:08:33 +0000 [4043243.767215] [<ffffffffa11bb800>] ? vvp_io_write_start+0xb0/0x1e0 [lustre]
          2012-06-06 16:08:33 +0000 [4043243.803208] [<ffffffffa0d6a1d2>] ? cl_io_start+0x72/0x100 [obdclass]
          2012-06-06 16:08:33 +0000 [4043243.836585] [<ffffffffa0d6d774>] ? cl_io_loop+0xd4/0x160 [obdclass]
          2012-06-06 16:08:33 +0000 [4043243.870706] [<ffffffffa1175d3e>] ? ll_file_io_generic+0x3be/0x4f0 [lustre]
          2012-06-06 16:08:33 +0000 [4043243.907105] [<ffffffffa1175fa0>] ? ll_file_aio_write+0x130/0x1f0 [lustre]
          2012-06-06 16:08:34 +0000 [4043243.942842] [<ffffffffa117634c>] ? ll_file_write+0x14c/0x250 [lustre]
          2012-06-06 16:08:34 +0000 [4043243.977049] [<ffffffff811390a8>] ? vfs_write+0xc8/0x190
          2012-06-06 16:08:34 +0000 [4043244.004041] [<ffffffff811398d1>] ? sys_write+0x51/0x90
          2012-06-06 16:08:34 +0000 [4043244.031568] [<ffffffff8100bfc2>] ? system_call_fastpath+0x16/0x1b
          2012-06-06 16:08:34 +0000 [4043244.063216] -----------[ cut here ]-----------
          2012-06-06 16:08:34 +0000 [4043244.087602] WARNING: at arch/x86/kernel/smp.c:118 native_smp_send_reschedule+0x54/0x60()
          2012-06-06 16:08:34 +0000 [4043244.129258] Hardware name: ProLiant BL460c G7
          2012-06-06 16:08:34 +0000 [4043244.151613] Modules linked in: lmv mgc lustre lov osc mdc fid fld ksocklnd ptlrpc obdclass lnet lvfs libcfs parport_pc ppdev 8021q garp bridge stp llc nfsd netconsole configfs nfs lockd nfs_acl auth_rpcgss sunrpc dm_crypt dm_mod crc32c ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi iptable_filter ip_tables x_tables hpwdt psmouse be2net joydev hpilo evdev mac_hid serio_raw rtc_cmos rtc_core rtc_lib parport loop tcp_scalable fuse virtio_blk virtio virtio_ring xenfs ext4 mbcache jbd2 xfs exportfs raid1 usbhid mptspi mptsas mptscsih mptbase mpt2sas raid_class arcmsr aic94xx libsas libata scsi_transport_sas aic7xxx aic79xx scsi_transport_spi megaraid_sas cciss sg sd_mod hpsa ehci_hcd scsi_mod uhci_hcd [last unloaded: libcfs]
          2012-06-06 16:08:34 +0000 [4043244.501423] Pid: 15289, comm: ptlrpcd_6 Tainted: G W 2.6.38.2-ts4 #11
          2012-06-06 16:08:34 +0000 [4043244.540816] Call Trace:
          2012-06-06 16:08:34 +0000 [4043244.553799] [<ffffffff8105d10f>] ? warn_slowpath_common+0x7f/0xc0
          2012-06-06 16:08:34 +0000 [4043244.585974] [<ffffffff8105d16a>] ? warn_slowpath_null+0x1a/0x20
          2012-06-06 16:08:34 +0000 [4043244.616810] [<ffffffff81029c24>] ? native_smp_send_reschedule+0x54/0x60
          2012-06-06 16:08:34 +0000 [4043244.651789] [<ffffffff81045f06>] ? resched_task+0x76/0x90
          2012-06-06 16:08:34 +0000 [4043244.679720] [<ffffffff81056465>] ? check_preempt_wakeup+0x1b5/0x280
          2012-06-06 16:08:34 +0000 [4043244.714069] [<ffffffff81045fc4>] ? check_preempt_curr+0x84/0xa0
          2012-06-06 16:08:34 +0000 [4043244.745025] [<ffffffff81055ebb>] ? try_to_wake_up+0x7b/0x410
          2012-06-06 16:08:34 +0000 [4043244.775300] [<ffffffff81056262>] ? default_wake_function+0x12/0x20
          2012-06-06 16:08:34 +0000 [4043244.808168] [<ffffffff8107e046>] ? autoremove_wake_function+0x16/0x40
          2012-06-06 16:08:34 +0000 [4043244.841761] [<ffffffff810455c9>] ? __wake_up_common+0x59/0x90
          2012-06-06 16:08:34 +0000 [4043244.873316] [<ffffffff8104d998>] ? __wake_up+0x48/0x70
          2012-06-06 16:08:34 +0000 [4043244.900189] [<ffffffffa0b7834a>] ? cfs_waitq_signal+0x1a/0x20 [libcfs]
          2012-06-06 16:08:35 +0000 [4043244.935008] [<ffffffffa1012227>] ? ksocknal_queue_tx_locked+0x277/0x540 [ksocklnd]
          2012-06-06 16:08:35 +0000 [4043244.973972] [<ffffffffa100d033>] ? ksocknal_find_conn_locked+0xa3/0x230 [ksocklnd]
          2012-06-06 16:08:35 +0000 [4043245.013276] [<ffffffffa101263b>] ? ksocknal_launch_packet+0x14b/0x350 [ksocklnd]
          2012-06-06 16:08:35 +0000 [4043245.053190] [<ffffffffa10129be>] ? ksocknal_send+0x17e/0x410 [ksocklnd]
          2012-06-06 16:08:35 +0000 [4043245.087409] [<ffffffffa0cce12b>] ? lnet_ni_send+0x4b/0x100 [lnet]
          2012-06-06 16:08:35 +0000 [4043245.119743] [<ffffffffa0cd294b>] ? lnet_send+0x20b/0xa30 [lnet]
          2012-06-06 16:08:35 +0000 [4043245.150531] [<ffffffffa0cce530>] ? lnet_prep_send+0x50/0xb0 [lnet]
          2012-06-06 16:08:35 +0000 [4043245.183883] [<ffffffffa0cd3a49>] ? LNetPut+0x2a9/0x670 [lnet]
          2012-06-06 16:08:35 +0000 [4043245.214255] [<ffffffffa0e9e4da>] ? ptl_send_buf+0x18a/0x440 [ptlrpc]
          2012-06-06 16:08:35 +0000 [4043245.247512] [<ffffffffa0ea08a0>] ? ptl_send_rpc+0x4e0/0xb10 [ptlrpc]
          2012-06-06 16:08:35 +0000 [4043245.280878] [<ffffffff8104bb3a>] ? finish_task_switch+0x4a/0x100
          2012-06-06 16:08:35 +0000 [4043245.311703] [<ffffffffa0e97695>] ? ptlrpc_send_new_req+0x3e5/0x720 [ptlrpc]
          2012-06-06 16:08:35 +0000 [4043245.348610] [<ffffffff8145d2bf>] ? _raw_spin_lock_irqsave+0x2f/0x40
          2012-06-06 16:08:35 +0000 [4043245.381490] [<ffffffffa0e9a9a0>] ? ptlrpc_check_set+0x340/0x1750 [ptlrpc]
          2012-06-06 16:08:35 +0000 [4043245.417485] [<ffffffff8106ca3a>] ? del_timer_sync+0x3a/0x60
          2012-06-06 16:08:35 +0000 [4043245.447321] [<ffffffffa0ec4dcb>] ? ptlrpcd_check+0x52b/0x550 [ptlrpc]
          2012-06-06 16:08:35 +0000 [4043245.480734] [<ffffffffa0ec50ab>] ? ptlrpcd+0x2bb/0x360 [ptlrpc]
          2012-06-06 16:08:35 +0000 [4043245.514762] [<ffffffff81056250>] ? default_wake_function+0x0/0x20
          2012-06-06 16:08:35 +0000 [4043245.546436] [<ffffffff8100cde4>] ? kernel_thread_helper+0x4/0x10
          2012-06-06 16:08:35 +0000 [4043245.578542] [<ffffffffa0ec4df0>] ? ptlrpcd+0x0/0x360 [ptlrpc]
          2012-06-06 16:08:35 +0000 [4043245.609487] [<ffffffff8100cde0>] ? kernel_thread_helper+0x0/0x10
          2012-06-06 16:08:35 +0000 [4043245.640837] --[ end trace 70a7f3071bb3c3f8 ]--

          rspellman Roger Spellman (Inactive) added a comment - Customer reports: It appears that this patch has fixed the original bug as the researcher was not able to replicate the problem, however he was able to panic nine of the clients with an LBUG. See attachment for the netconsole output. The following is the output from one: 2012-06-06 11:35:51 +0000 [4026925.936063] Lustre: MGC192.168.185.35@tcp: Reactivating import 2012-06-06 11:35:52 +0000 [4026926.153312] Lustre: Mounted xxxx-client 2012-06-06 16:08:32 +0000 [4043242.176640] LustreError: 8473:0:(cl_page.c:1026:cl_page_assume()) page@ffff880f6107ad80 [4 ffff880d5f11e498:0 ^ (null)_ffff880f6107ae40 1 0 1 ffff88103efcc028 (null) 0x0] 2012-06-06 16:08:32 +0000 [4043242.261214] LustreError: 8473:0:(cl_page.c:1026:cl_page_assume()) page@ffff880f6107ae40 [1 ffff880fafbffda8:0 ^ffff880f6107ad80_ (null) 0 0 1 (null) (null) 0x0] 2012-06-06 16:08:32 +0000 [4043242.346779] LustreError: 8473:0:(cl_page.c:1026:cl_page_assume()) vvp-page@ffff880dbfdb9640(0:0:0) vm@ffffea0033482180 60000000000086d 4:0 ffff880f6107ad80 0 lru 2012-06-06 16:08:32 +0000 [4043242.419949] LustreError: 8473:0:(cl_page.c:1026:cl_page_assume()) lov-page@ffff880dbfdbdcf0 2012-06-06 16:08:32 +0000 [4043242.462875] LustreError: 8473:0:(cl_page.c:1026:cl_page_assume()) osc-page@ffff880f610795a8: 1< 0x845fed 2 0 - - - > 2< 0 0 51 0x0 0x400 | (null) ffff880e893b1fb8 ffff88176f4b3e80 ffffffffa1088200 ffff880f610795a8 > 3< - ffff880967a58000 0 0 0 > 4< 0 0 48 532799488 - | - - + - > 5< - - - - | 0 - - | 0 - -> 2012-06-06 16:08:32 +0000 [4043242.604818] LustreError: 8473:0:(cl_page.c:1026:cl_page_assume()) end page@ffff880f6107ad80 2012-06-06 16:08:32 +0000 [4043242.647677] LustreError: 8473:0:(cl_page.c:1026:cl_page_assume()) pg->cp_owner == NULL 2012-06-06 16:08:32 +0000 [4043242.688726] LustreError: 8473:0:(cl_page.c:1026:cl_page_assume()) ASSERTION( 0 ) failed: 2012-06-06 16:08:32 +0000 [4043242.731216] LustreError: 8473:0:(cl_page.c:1026:cl_page_assume()) LBUG 2012-06-06 16:08:32 +0000 [4043242.764574] Pid: 8473, comm: java 2012-06-06 16:08:32 +0000 [4043242.782718] 2012-06-06 16:08:32 +0000 [4043242.782720] Call Trace: 2012-06-06 16:08:32 +0000 [4043242.803973] [<ffffffffa0b77865>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] 2012-06-06 16:08:32 +0000 [4043242.841227] [<ffffffffa0b77d97>] lbug_with_loc+0x47/0xc0 [libcfs] 2012-06-06 16:08:32 +0000 [4043242.872993] [<ffffffffa0d61b50>] cl_page_own0+0x0/0x2c0 [obdclass] 2012-06-06 16:08:32 +0000 [4043242.906060] [<ffffffffa1196726>] ll_prepare_write+0x86/0x170 [lustre] 2012-06-06 16:08:33 +0000 [4043242.940052] [<ffffffffa11aa8e8>] ll_write_begin+0x88/0x160 [lustre] 2012-06-06 16:08:33 +0000 [4043242.973265] [<ffffffffa11a51cb>] ? ll_getxattr+0xfb/0x440 [lustre] 2012-06-06 16:08:33 +0000 [4043243.005911] [<ffffffff810e75fe>] generic_file_buffered_write+0xfe/0x250 2012-06-06 16:08:33 +0000 [4043243.039902] [<ffffffff810ea1a0>] __generic_file_aio_write+0x230/0x470 2012-06-06 16:08:33 +0000 [4043243.073917] [<ffffffff810ea442>] generic_file_aio_write+0x62/0xd0 2012-06-06 16:08:33 +0000 [4043243.105326] [<ffffffffa11bb800>] vvp_io_write_start+0xb0/0x1e0 [lustre] 2012-06-06 16:08:33 +0000 [4043243.139984] [<ffffffffa0d6a1d2>] cl_io_start+0x72/0x100 [obdclass] 2012-06-06 16:08:33 +0000 [4043243.172847] [<ffffffffa0d6d774>] cl_io_loop+0xd4/0x160 [obdclass] 2012-06-06 16:08:33 +0000 [4043243.203854] [<ffffffffa1175d3e>] ll_file_io_generic+0x3be/0x4f0 [lustre] 2012-06-06 16:08:33 +0000 [4043243.238796] [<ffffffffa1175fa0>] ll_file_aio_write+0x130/0x1f0 [lustre] 2012-06-06 16:08:33 +0000 [4043243.272727] [<ffffffffa117634c>] ll_file_write+0x14c/0x250 [lustre] 2012-06-06 16:08:33 +0000 [4043243.306367] [<ffffffff811390a8>] vfs_write+0xc8/0x190 2012-06-06 16:08:33 +0000 [4043243.332757] [<ffffffff811398d1>] sys_write+0x51/0x90 2012-06-06 16:08:33 +0000 [4043243.359311] [<ffffffff8100bfc2>] system_call_fastpath+0x16/0x1b 2012-06-06 16:08:33 +0000 [4043243.389972] 2012-06-06 16:08:33 +0000 [4043243.399097] Kernel panic - not syncing: LBUG 2012-06-06 16:08:33 +0000 [4043243.421020] Pid: 8473, comm: java Tainted: G W 2.6.38.2-ts4 #11 2012-06-06 16:08:33 +0000 [4043243.455850] Call Trace: 2012-06-06 16:08:33 +0000 [4043243.468809] [<ffffffff8145a373>] ? panic+0x91/0x19c 2012-06-06 16:08:33 +0000 [4043243.494377] [<ffffffffa0b77dfb>] ? lbug_with_loc+0xab/0xc0 [libcfs] 2012-06-06 16:08:33 +0000 [4043243.527753] [<ffffffffa0d61b50>] ? cl_page_own0+0x0/0x2c0 [obdclass] 2012-06-06 16:08:33 +0000 [4043243.561570] [<ffffffffa1196726>] ? ll_prepare_write+0x86/0x170 [lustre] 2012-06-06 16:08:33 +0000 [4043243.595489] [<ffffffffa11aa8e8>] ? ll_write_begin+0x88/0x160 [lustre] 2012-06-06 16:08:33 +0000 [4043243.629850] [<ffffffffa11a51cb>] ? ll_getxattr+0xfb/0x440 [lustre] 2012-06-06 16:08:33 +0000 [4043243.661785] [<ffffffff810e75fe>] ? generic_file_buffered_write+0xfe/0x250 2012-06-06 16:08:33 +0000 [4043243.697719] [<ffffffff810ea1a0>] ? __generic_file_aio_write+0x230/0x470 2012-06-06 16:08:33 +0000 [4043243.734746] [<ffffffff810ea442>] ? generic_file_aio_write+0x62/0xd0 2012-06-06 16:08:33 +0000 [4043243.767215] [<ffffffffa11bb800>] ? vvp_io_write_start+0xb0/0x1e0 [lustre] 2012-06-06 16:08:33 +0000 [4043243.803208] [<ffffffffa0d6a1d2>] ? cl_io_start+0x72/0x100 [obdclass] 2012-06-06 16:08:33 +0000 [4043243.836585] [<ffffffffa0d6d774>] ? cl_io_loop+0xd4/0x160 [obdclass] 2012-06-06 16:08:33 +0000 [4043243.870706] [<ffffffffa1175d3e>] ? ll_file_io_generic+0x3be/0x4f0 [lustre] 2012-06-06 16:08:33 +0000 [4043243.907105] [<ffffffffa1175fa0>] ? ll_file_aio_write+0x130/0x1f0 [lustre] 2012-06-06 16:08:34 +0000 [4043243.942842] [<ffffffffa117634c>] ? ll_file_write+0x14c/0x250 [lustre] 2012-06-06 16:08:34 +0000 [4043243.977049] [<ffffffff811390a8>] ? vfs_write+0xc8/0x190 2012-06-06 16:08:34 +0000 [4043244.004041] [<ffffffff811398d1>] ? sys_write+0x51/0x90 2012-06-06 16:08:34 +0000 [4043244.031568] [<ffffffff8100bfc2>] ? system_call_fastpath+0x16/0x1b 2012-06-06 16:08:34 +0000 [4043244.063216] ----------- [ cut here ] ----------- 2012-06-06 16:08:34 +0000 [4043244.087602] WARNING: at arch/x86/kernel/smp.c:118 native_smp_send_reschedule+0x54/0x60() 2012-06-06 16:08:34 +0000 [4043244.129258] Hardware name: ProLiant BL460c G7 2012-06-06 16:08:34 +0000 [4043244.151613] Modules linked in: lmv mgc lustre lov osc mdc fid fld ksocklnd ptlrpc obdclass lnet lvfs libcfs parport_pc ppdev 8021q garp bridge stp llc nfsd netconsole configfs nfs lockd nfs_acl auth_rpcgss sunrpc dm_crypt dm_mod crc32c ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi iptable_filter ip_tables x_tables hpwdt psmouse be2net joydev hpilo evdev mac_hid serio_raw rtc_cmos rtc_core rtc_lib parport loop tcp_scalable fuse virtio_blk virtio virtio_ring xenfs ext4 mbcache jbd2 xfs exportfs raid1 usbhid mptspi mptsas mptscsih mptbase mpt2sas raid_class arcmsr aic94xx libsas libata scsi_transport_sas aic7xxx aic79xx scsi_transport_spi megaraid_sas cciss sg sd_mod hpsa ehci_hcd scsi_mod uhci_hcd [last unloaded: libcfs] 2012-06-06 16:08:34 +0000 [4043244.501423] Pid: 15289, comm: ptlrpcd_6 Tainted: G W 2.6.38.2-ts4 #11 2012-06-06 16:08:34 +0000 [4043244.540816] Call Trace: 2012-06-06 16:08:34 +0000 [4043244.553799] [<ffffffff8105d10f>] ? warn_slowpath_common+0x7f/0xc0 2012-06-06 16:08:34 +0000 [4043244.585974] [<ffffffff8105d16a>] ? warn_slowpath_null+0x1a/0x20 2012-06-06 16:08:34 +0000 [4043244.616810] [<ffffffff81029c24>] ? native_smp_send_reschedule+0x54/0x60 2012-06-06 16:08:34 +0000 [4043244.651789] [<ffffffff81045f06>] ? resched_task+0x76/0x90 2012-06-06 16:08:34 +0000 [4043244.679720] [<ffffffff81056465>] ? check_preempt_wakeup+0x1b5/0x280 2012-06-06 16:08:34 +0000 [4043244.714069] [<ffffffff81045fc4>] ? check_preempt_curr+0x84/0xa0 2012-06-06 16:08:34 +0000 [4043244.745025] [<ffffffff81055ebb>] ? try_to_wake_up+0x7b/0x410 2012-06-06 16:08:34 +0000 [4043244.775300] [<ffffffff81056262>] ? default_wake_function+0x12/0x20 2012-06-06 16:08:34 +0000 [4043244.808168] [<ffffffff8107e046>] ? autoremove_wake_function+0x16/0x40 2012-06-06 16:08:34 +0000 [4043244.841761] [<ffffffff810455c9>] ? __wake_up_common+0x59/0x90 2012-06-06 16:08:34 +0000 [4043244.873316] [<ffffffff8104d998>] ? __wake_up+0x48/0x70 2012-06-06 16:08:34 +0000 [4043244.900189] [<ffffffffa0b7834a>] ? cfs_waitq_signal+0x1a/0x20 [libcfs] 2012-06-06 16:08:35 +0000 [4043244.935008] [<ffffffffa1012227>] ? ksocknal_queue_tx_locked+0x277/0x540 [ksocklnd] 2012-06-06 16:08:35 +0000 [4043244.973972] [<ffffffffa100d033>] ? ksocknal_find_conn_locked+0xa3/0x230 [ksocklnd] 2012-06-06 16:08:35 +0000 [4043245.013276] [<ffffffffa101263b>] ? ksocknal_launch_packet+0x14b/0x350 [ksocklnd] 2012-06-06 16:08:35 +0000 [4043245.053190] [<ffffffffa10129be>] ? ksocknal_send+0x17e/0x410 [ksocklnd] 2012-06-06 16:08:35 +0000 [4043245.087409] [<ffffffffa0cce12b>] ? lnet_ni_send+0x4b/0x100 [lnet] 2012-06-06 16:08:35 +0000 [4043245.119743] [<ffffffffa0cd294b>] ? lnet_send+0x20b/0xa30 [lnet] 2012-06-06 16:08:35 +0000 [4043245.150531] [<ffffffffa0cce530>] ? lnet_prep_send+0x50/0xb0 [lnet] 2012-06-06 16:08:35 +0000 [4043245.183883] [<ffffffffa0cd3a49>] ? LNetPut+0x2a9/0x670 [lnet] 2012-06-06 16:08:35 +0000 [4043245.214255] [<ffffffffa0e9e4da>] ? ptl_send_buf+0x18a/0x440 [ptlrpc] 2012-06-06 16:08:35 +0000 [4043245.247512] [<ffffffffa0ea08a0>] ? ptl_send_rpc+0x4e0/0xb10 [ptlrpc] 2012-06-06 16:08:35 +0000 [4043245.280878] [<ffffffff8104bb3a>] ? finish_task_switch+0x4a/0x100 2012-06-06 16:08:35 +0000 [4043245.311703] [<ffffffffa0e97695>] ? ptlrpc_send_new_req+0x3e5/0x720 [ptlrpc] 2012-06-06 16:08:35 +0000 [4043245.348610] [<ffffffff8145d2bf>] ? _raw_spin_lock_irqsave+0x2f/0x40 2012-06-06 16:08:35 +0000 [4043245.381490] [<ffffffffa0e9a9a0>] ? ptlrpc_check_set+0x340/0x1750 [ptlrpc] 2012-06-06 16:08:35 +0000 [4043245.417485] [<ffffffff8106ca3a>] ? del_timer_sync+0x3a/0x60 2012-06-06 16:08:35 +0000 [4043245.447321] [<ffffffffa0ec4dcb>] ? ptlrpcd_check+0x52b/0x550 [ptlrpc] 2012-06-06 16:08:35 +0000 [4043245.480734] [<ffffffffa0ec50ab>] ? ptlrpcd+0x2bb/0x360 [ptlrpc] 2012-06-06 16:08:35 +0000 [4043245.514762] [<ffffffff81056250>] ? default_wake_function+0x0/0x20 2012-06-06 16:08:35 +0000 [4043245.546436] [<ffffffff8100cde4>] ? kernel_thread_helper+0x4/0x10 2012-06-06 16:08:35 +0000 [4043245.578542] [<ffffffffa0ec4df0>] ? ptlrpcd+0x0/0x360 [ptlrpc] 2012-06-06 16:08:35 +0000 [4043245.609487] [<ffffffff8100cde0>] ? kernel_thread_helper+0x0/0x10 2012-06-06 16:08:35 +0000 [4043245.640837] -- [ end trace 70a7f3071bb3c3f8 ] --
          pjones Peter Jones added a comment -

          As I understand it, this code is being tracked for landing under LU-506

          pjones Peter Jones added a comment - As I understand it, this code is being tracked for landing under LU-506

          My error. I copied the wrong modules to the target system. The modules load fine now.

          rspellman Roger Spellman (Inactive) added a comment - My error. I copied the wrong modules to the target system. The modules load fine now.
          laisiyao Lai Siyao added a comment -

          I did the same as you, but all looks well.

          And where lut_boot_epoch_update() is called in ldlm_lib.c is inside #ifdef HAVE_SERVER_SUPPORT, so that these code won't be compiled, and won't cause the unknown symbol problem.

          laisiyao Lai Siyao added a comment - I did the same as you, but all looks well. And where lut_boot_epoch_update() is called in ldlm_lib.c is inside #ifdef HAVE_SERVER_SUPPORT, so that these code won't be compiled, and won't cause the unknown symbol problem.

          The commands I ran to get the build are:

          git clone git://git.whamcloud.com/fs/lustre-release.git # svn checkout url
          cd lustre-release
          git fetch http://review.whamcloud.com/p/fs/lustre-release refs/changes/65/1865/21 && git checkout FETCH_HEAD

          rspellman Roger Spellman (Inactive) added a comment - The commands I ran to get the build are: git clone git://git.whamcloud.com/fs/lustre-release.git # svn checkout url cd lustre-release git fetch http://review.whamcloud.com/p/fs/lustre-release refs/changes/65/1865/21 && git checkout FETCH_HEAD

          The problem still persists, i.e.

          May 31 12:05:21 compute-01-32 kernel: Lustre: Lustre: Build Version: 2.2.54-g9567e22-CHANGED-2.6.32-220.el6.x86_64
          May 31 12:05:21 compute-01-32 kernel: ptlrpc: Unknown symbol lut_boot_epoch_update
          May 31 12:05:21 compute-01-32 kernel: ptlrpc: Unknown symbol lut_mod_exit
          May 31 12:05:21 compute-01-32 kernel: ptlrpc: Unknown symbol lut_mod_init

          I think I know the cause of the problem.

          In the file lustre/ptlrpc/Makefile.in is the line

          @SERVER_TRUE@ptlrpc_objs += target.o

          After running:

          ./configure --with-kernel=/usr/src/kernels/2.6.32-220.el6.x86_64 \
          --with-linux=/usr/src/kernels/2.6.32-220.el6.x86_64 \
          --disable-liblustre \
          --without-sysio \
          --disable-server

          The file lustre/ptlrpc/Makefile contains

          #ptlrpc_objs += target.o

          The file lustre/ldlm/ldlm_lib.c calls lut_boot_epoch_update(), which is defined in target.c. But, since ldlm_lib.c is compiled, but target.c is not compiled, there is an unknown symbol.

          rspellman Roger Spellman (Inactive) added a comment - The problem still persists, i.e. May 31 12:05:21 compute-01-32 kernel: Lustre: Lustre: Build Version: 2.2.54-g9567e22-CHANGED-2.6.32-220.el6.x86_64 May 31 12:05:21 compute-01-32 kernel: ptlrpc: Unknown symbol lut_boot_epoch_update May 31 12:05:21 compute-01-32 kernel: ptlrpc: Unknown symbol lut_mod_exit May 31 12:05:21 compute-01-32 kernel: ptlrpc: Unknown symbol lut_mod_init I think I know the cause of the problem. In the file lustre/ptlrpc/Makefile.in is the line @SERVER_TRUE@ptlrpc_objs += target.o After running: ./configure --with-kernel=/usr/src/kernels/2.6.32-220.el6.x86_64 \ --with-linux=/usr/src/kernels/2.6.32-220.el6.x86_64 \ --disable-liblustre \ --without-sysio \ --disable-server The file lustre/ptlrpc/Makefile contains #ptlrpc_objs += target.o The file lustre/ldlm/ldlm_lib.c calls lut_boot_epoch_update(), which is defined in target.c. But, since ldlm_lib.c is compiled, but target.c is not compiled, there is an unknown symbol.

          People

            laisiyao Lai Siyao
            rspellman Roger Spellman (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: