Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8460

(osc_cache.c:658:osc_extent_find()) ASSERTION( (max_pages & ~chunk_mask) == 0 ) failed: LBUG

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.9.0
    • None
    • 3
    • 9223372036854775807

    Description

      As part of CORAL testing I was trying to use ior to to write/read some 16MB RPCs. As soon as the test started it appeared all the Lustre clients took an LBUG, crashed, and rebooted.

      [263032.826900] LustreError: 54616:0:(mgc_request.c:1536:mgc_apply_recover_logs()) mgc: cannot find uuid by nid 192.168.1.4@o2ib
      [263032.839625] Lustre: 54616:0:(mgc_request.c:1756:mgc_process_recover_nodemap_log()) MGC192.168.1.5@o2ib: error processing recovery log lsdraid-cliir: rc = -2
      [263032.940768] Lustre: Mounted lsdraid-client
      [270650.452658] LustreError: 64198:0:(osc_cache.c:658:osc_extent_find()) ASSERTION( (max_pages & ~chunk_mask) == 0 ) failed: 
      [270650.465098] LustreError: 64198:0:(osc_cache.c:658:osc_extent_find()) LBUG
      [270650.472830] Pid: 64198, comm: ior
      [270650.476666] 
      Call Trace:
      [270650.482816]  [<ffffffffa08267d3>] libcfs_debug_dumpstack+0x53/0x80 [libcfs]
      [270650.492278]  [<ffffffffa0826d75>] lbug_with_loc+0x45/0xc0 [libcfs]
      [270650.502128]  [<ffffffffa0dc066d>] osc_extent_find+0x2ec/0x1a98 [osc]
      [270650.510720]  [<ffffffff81189b79>] ? zone_statistics+0x89/0xa0
      [270650.518827]  [<ffffffffa0dc2354>] ? osc_enter_cache+0x53b/0xe7a [osc]
      [270650.527638]  [<ffffffff810b5f8d>] ? ttwu_do_activate.constprop.84+0x5d/0x70
      [270650.537084]  [<ffffffff81173317>] ? __alloc_pages_nodemask+0x197/0xba0
      [270650.545977]  [<ffffffff81285e38>] ? security_capable+0x18/0x20
      [270650.554180]  [<ffffffffa0db8705>] osc_queue_async_io+0x6a5/0x1870 [osc]
      [270650.563226]  [<ffffffffa0d2aa70>] ? write_commit_callback+0x0/0x50 [lustre]
      [270650.572635]  [<ffffffffa0da55e3>] osc_page_cache_add+0x43/0x140 [osc]
      [270650.581481]  [<ffffffffa0dad903>] osc_io_commit_async+0x173/0x440 [osc]
      [270650.590561]  [<ffffffffa0d2aa70>] ? write_commit_callback+0x0/0x50 [lustre]
      [270650.600035]  [<ffffffffa0959317>] cl_io_commit_async+0x77/0x140 [obdclass]
      [270650.609317]  [<ffffffffa0c8654c>] lov_io_commit_async+0x2dc/0x4a0 [lov]
      [270650.618397]  [<ffffffffa0d2aa70>] ? write_commit_callback+0x0/0x50 [lustre]
      [270650.627864]  [<ffffffffa0d2aa70>] ? write_commit_callback+0x0/0x50 [lustre]
      [270650.637324]  [<ffffffffa0959317>] cl_io_commit_async+0x77/0x140 [obdclass]
      [270650.646690]  [<ffffffffa0d2d97a>] vvp_io_write_commit+0x17a/0x8d0 [lustre]
      [270650.655374]  [<ffffffffa0d1b807>] ll_write_end+0xc7/0x400 [lustre]
      [270650.663281]  [<ffffffff8116a214>] generic_file_buffered_write+0x184/0x290
      [270650.671879]  [<ffffffff8116b6d5>] __generic_file_aio_write+0x1d5/0x3e0
      [270650.680178]  [<ffffffffa0d2e355>] vvp_io_write_start+0x285/0x650 [lustre]
      [270650.688744]  [<ffffffffa0958925>] cl_io_start+0x65/0x130 [obdclass]
      [270650.696904]  [<ffffffffa095acc5>] cl_io_loop+0xa5/0x190 [obdclass]
      [270650.704945]  [<ffffffffa0cdbbef>] ll_file_io_generic+0x67f/0xaa0 [lustre]
      [270650.713656]  [<ffffffff811d4256>] ? mem_cgroup_update_page_stat+0x16/0x50
      [270650.722373]  [<ffffffffa0cdc2dd>] ll_file_aio_write+0x12d/0x1f0 [lustre]
      [270650.731000]  [<ffffffffa0cdc46e>] ll_file_write+0xce/0x1e0 [lustre]
      [270650.739126]  [<ffffffff811dec1d>] vfs_write+0xbd/0x1e0
      [270650.745994]  [<ffffffff811df6bf>] SyS_write+0x7f/0xe0
      [270650.752766]  [<ffffffff81646889>] system_call_fastpath+0x16/0x1b
      [270650.760571] 
      [270650.763508] Kernel panic - not syncing: LBUG
      [270650.769875] CPU: 17 PID: 64198 Comm: ior Tainted: G          IOE  ------------   3.10.0-327.22.2.el7.x86_64 #1
      [270650.782689] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS SE5C610.86B.01.01.0008.021120151325 02/11/2015
      [270650.796038]  ffffffffa0843def 0000000067048a74 ffff88103fd9f560 ffffffff816360fc
      [270650.806102]  ffff88103fd9f5e0 ffffffff8162f977 ffffffff00000008 ffff88103fd9f5f0
      [270650.816170]  ffff88103fd9f590 0000000067048a74 ffffffffa0dc63b0 0000000000000000
      [270650.826189] Call Trace:
      [270650.830566]  [<ffffffff816360fc>] dump_stack+0x19/0x1b
      [270650.837897]  [<ffffffff8162f977>] panic+0xd8/0x1e7
      [270650.844870]  [<ffffffffa0826ddb>] lbug_with_loc+0xab/0xc0 [libcfs]
      [270650.853345]  [<ffffffffa0dc066d>] osc_extent_find+0x2ec/0x1a98 [osc]
      [270650.862000]  [<ffffffff81189b79>] ? zone_statistics+0x89/0xa0
      [270650.869976]  [<ffffffffa0dc2354>] ? osc_enter_cache+0x53b/0xe7a [osc]
      [270650.878736]  [<ffffffff810b5f8d>] ? ttwu_do_activate.constprop.84+0x5d/0x70
      [270650.888101]  [<ffffffff81173317>] ? __alloc_pages_nodemask+0x197/0xba0
      [270650.896965]  [<ffffffff81285e38>] ? security_capable+0x18/0x20
      [270650.905069]  [<ffffffffa0db8705>] osc_queue_async_io+0x6a5/0x1870 [osc]
      [270650.914081]  [<ffffffffa0d2aa70>] ? mkwrite_commit_callback+0x20/0x20 [lustre]
      [270650.923740]  [<ffffffffa0da55e3>] osc_page_cache_add+0x43/0x140 [osc]
      [270650.932528]  [<ffffffffa0dad903>] osc_io_commit_async+0x173/0x440 [osc]
      [270650.941515]  [<ffffffffa0d2aa70>] ? mkwrite_commit_callback+0x20/0x20 [lustre]
      [270650.951198]  [<ffffffffa0959317>] cl_io_commit_async+0x77/0x140 [obdclass]
      [270650.960478]  [<ffffffffa0c8654c>] lov_io_commit_async+0x2dc/0x4a0 [lov]
      [270650.969481]  [<ffffffffa0d2aa70>] ? mkwrite_commit_callback+0x20/0x20 [lustre]
      [270650.979160]  [<ffffffffa0d2aa70>] ? mkwrite_commit_callback+0x20/0x20 [lustre]
      [270650.989192]  [<ffffffffa0959317>] cl_io_commit_async+0x77/0x140 [obdclass]
      [270650.998797]  [<ffffffffa0d2d97a>] vvp_io_write_commit+0x17a/0x8d0 [lustre]
      [270651.008434]  [<ffffffffa0d1b807>] ll_write_end+0xc7/0x400 [lustre]
      [270651.017239]  [<ffffffff8116a214>] generic_file_buffered_write+0x184/0x290
      [270651.026725]  [<ffffffff8116b6d5>] __generic_file_aio_write+0x1d5/0x3e0
      [270651.035938]  [<ffffffffa0d2e355>] vvp_io_write_start+0x285/0x650 [lustre]
      [270651.045484]  [<ffffffffa0958925>] cl_io_start+0x65/0x130 [obdclass]
      [270651.054443]  [<ffffffffa095acc5>] cl_io_loop+0xa5/0x190 [obdclass]
      [270651.063300]  [<ffffffffa0cdbbef>] ll_file_io_generic+0x67f/0xaa0 [lustre]
      [270651.072803]  [<ffffffff811d4256>] ? mem_cgroup_update_page_stat+0x16/0x50
      [270651.082339]  [<ffffffffa0cdc2dd>] ll_file_aio_write+0x12d/0x1f0 [lustre]
      [270651.091766]  [<ffffffffa0cdc46e>] ll_file_write+0xce/0x1e0 [lustre]
      [270651.100727]  [<ffffffff811dec1d>] vfs_write+0xbd/0x1e0
      [270651.108378]  [<ffffffff811df6bf>] SyS_write+0x7f/0xe0
      [270651.115918]  [<ffffffff81646889>] system_call_fastpath+0x16/0x1b 
      

      Here is what IOR saw:

      IOR-3.0.1: MPI Coordinated Test of Parallel I/O
      
      Began: Mon Aug  1 18:28:46 2016
      Command line used: /home/johnsali/wolf-3/ior/src/ior -i 5 -v -b 4096M -t 16m
      Machine: Linux wolf-6.wolf.hpdd.intel.com
      Start time skew across all tasks: 0.00 sec
      
      Test 0 started: Mon Aug  1 18:28:46 2016
      Path: /mnt/lustre
      FS: 13.9 TiB   Used FS: 0.0%   Inodes: 4.2 Mi   Used Inodes: 0.0%
      Participating tasks: 3
      Summary:
      	api                = POSIX
      	test filename      = testFile
      	access             = single-shared-file
      	pattern            = segmented (1 segment)
      	ordering in a file = sequential offsets
      	ordering inter file= no tasks offsets
      	clients            = 3 (1 per node)
      	repetitions        = 5
      	xfersize           = 16 MiB
      	blocksize          = 4 GiB
      	aggregate filesize = 12 GiB
      
      access    bw(MiB/s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   total(s)   iter
      ------    ---------  ---------- ---------  --------   --------   --------   --------   ----
      Commencing write performance test: Mon Aug  1 18:28:46 2016
      
      Message from syslogd@wolf-6 at Aug  1 18:28:48 ...
       kernel:LustreError: 64198:0:(osc_cache.c:658:osc_extent_find()) ASSERTION( (max_pages & ~chunk_mask) == 0 ) failed: 
      
      Message from syslogd@wolf-6 at Aug  1 18:28:48 ...
       kernel:LustreError: 64198:0:(osc_cache.c:658:osc_extent_find()) LBUG
      

      Attempts to change the max_pages_per_rpc have not been successful:

      [root@wolf-8 ~]#  lctl set_param osc.*.max_pages_per_rpc=1024
      error: set_param: setting /proc/fs/lustre/osc/lsdraid-OST0000-osc-ffff88104f0c1000/max_pages_per_rpc=1024: Numerical result out of range 
      

      System crash files can be found at (all clients hit the LBUG/crash/reboot at the same time):
      /scratch/dumps/wolf-6.wolf.hpdd.intel.com/10.8.1.6-2016-08-01-18:29:12
      /scratch/dumps/wolf-7.wolf.hpdd.intel.com/10.8.1.7-2016-08-01-18:29:10
      /scratch/dumps/wolf-8.wolf.hpdd.intel.com/10.8.1.8-2016-08-01-18:29:11

      Attachments

        Activity

          [LU-8460] (osc_cache.c:658:osc_extent_find()) ASSERTION( (max_pages & ~chunk_mask) == 0 ) failed: LBUG

          Currently this bug is blocking our progress: https://jira.hpdd.intel.com/browse/LU-8498

          jsalians_intel John Salinas (Inactive) added a comment - Currently this bug is blocking our progress: https://jira.hpdd.intel.com/browse/LU-8498

          jsalians_intel can you please verify that patch 21825 actually fixes the problem?

          jay Jinshan Xiong (Inactive) added a comment - jsalians_intel can you please verify that patch 21825 actually fixes the problem?

          Bobi Jam (bobijam@hotmail.com) uploaded a new patch: http://review.whamcloud.com/21825
          Subject: LU-8460 osc: chunk size should not bigger than RPC size
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: bde1f813f387f8e15492a864e8c3a78eb0eb3ff4

          gerrit Gerrit Updater added a comment - Bobi Jam (bobijam@hotmail.com) uploaded a new patch: http://review.whamcloud.com/21825 Subject: LU-8460 osc: chunk size should not bigger than RPC size Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: bde1f813f387f8e15492a864e8c3a78eb0eb3ff4

          chunk_bits is 24 means 16MB block size on the OST side and max_pages_per_rpc was set to 256 unfortunately.

          jay Jinshan Xiong (Inactive) added a comment - chunk_bits is 24 means 16MB block size on the OST side and max_pages_per_rpc was set to 256 unfortunately.
          jsalians_intel John Salinas (Inactive) added a comment - - edited

          With the patch I see:

          kernel:LustreError: 137570:0:(osc_cache.c:661:osc_extent_find()) ASSERTION( (max_pages & ~chunk_mask) == 0 ) failed: max_pages: 0x100 chunk_bits: 24 chunk_mask: 0xfffffffffffff000
          kernel:LustreError: 137570:0:(osc_cache.c:661:osc_extent_find()) LBUG
          

          This is with default settings (no 16MB RPCs set):

          client [root@wolf-6 ~]# cat /proc/fs/lustre/osc/lsdraid-OST0000-osc-ffff88084eccc800/max_pages_per_rpc
          256 
          

          OSS:

          [root@wolf-3 kernel]# lctl get_param obdfilter.lsdraid-OST0000.brw_size
          obdfilter.lsdraid-OST0000.brw_size=1
           [root@wolf-4 ~]# lctl get_param obdfilter.lsdraid-OST0000.brw_size
          obdfilter.lsdraid-OST0000.brw_size=1
          
          [413569.867436] Lustre: Mounted lsdraid-client
          [414679.351734] LustreError: 137570:0:(osc_cache.c:661:osc_extent_find()) ASSERTION( (max_pages & ~chunk_mask) == 0 ) failed: max_pages: 0x100 chunk_bits: 24 chunk_mask: 0xfffffffffffff000
          [414679.373266] LustreError: 137570:0:(osc_cache.c:661:osc_extent_find()) LBUG
          [414679.382571] Pid: 137570, comm: mdtest
          [414679.388269] 
          Call Trace:
          [414679.395587]  [<ffffffffa071d7d3>] libcfs_debug_dumpstack+0x53/0x80 [libcfs]
          [414679.404872]  [<ffffffffa071dd75>] lbug_with_loc+0x45/0xc0 [libcfs]
          [414679.413187]  [<ffffffffa0e31677>] osc_extent_find+0x2f6/0x1aa2 [osc]
          [414679.421671]  [<ffffffffa0e3335e>] ? osc_enter_cache+0x53b/0xe7a [osc]
          [414679.430159]  [<ffffffff81173317>] ? __alloc_pages_nodemask+0x197/0xba0
          [414679.438756]  [<ffffffffa0e1780b>] ? osc_lru_alloc+0x3b/0x390 [osc]
          [414679.446884]  [<ffffffff81285e38>] ? security_capable+0x18/0x20
          [414679.454653]  [<ffffffffa0e29705>] osc_queue_async_io+0x6a5/0x1870 [osc]
          [414679.463245]  [<ffffffffa0cf9c8c>] ? lov_merge_lvb_kms+0x12c/0x450 [lov]
          [414679.471818]  [<ffffffffa0e17b60>] ? osc_page_init+0x0/0x210 [osc]
          [414679.479775]  [<ffffffffa0d9ba70>] ? write_commit_callback+0x0/0x50 [lustre]
          [414679.488686]  [<ffffffffa0e165e3>] osc_page_cache_add+0x43/0x140 [osc]
          [414679.496969]  [<ffffffffa0e1e903>] osc_io_commit_async+0x173/0x440 [osc]
          [414679.505470]  [<ffffffffa0d9ba70>] ? write_commit_callback+0x0/0x50 [lustre]
          [414679.514355]  [<ffffffffa0a65557>] cl_io_commit_async+0x77/0x140 [obdclass]
          [414679.523097]  [<ffffffffa0cf754c>] lov_io_commit_async+0x2dc/0x4a0 [lov]
          [414679.531506]  [<ffffffffa0d9ba70>] ? write_commit_callback+0x0/0x50 [lustre]
          [414679.540302]  [<ffffffffa0d9ba70>] ? write_commit_callback+0x0/0x50 [lustre]
          [414679.549061]  [<ffffffffa0a65557>] cl_io_commit_async+0x77/0x140 [obdclass]
          [414679.557714]  [<ffffffffa0d9e97a>] vvp_io_write_commit+0x17a/0x8d0 [lustre]
          [414679.566324]  [<ffffffffa0d9f39f>] vvp_io_write_start+0x2cf/0x650 [lustre]
          [414679.574846]  [<ffffffffa0a64b65>] cl_io_start+0x65/0x130 [obdclass]
          [414679.582796]  [<ffffffffa0a66f05>] cl_io_loop+0xa5/0x190 [obdclass]
          [414679.590648]  [<ffffffffa0d4cbef>] ll_file_io_generic+0x67f/0xaa0 [lustre]
          [414679.599179]  [<ffffffffa0d4d2dd>] ll_file_aio_write+0x12d/0x1f0 [lustre]
          [414679.607620]  [<ffffffffa0d4d46e>] ll_file_write+0xce/0x1e0 [lustre]
          [414679.615561]  [<ffffffff811dec1d>] vfs_write+0xbd/0x1e0
          [414679.622248]  [<ffffffff811df6bf>] SyS_write+0x7f/0xe0
          [414679.628843]  [<ffffffff81646889>] system_call_fastpath+0x16/0x1b
          [414679.636498] 
          [414679.639249] Kernel panic - not syncing: LBUG
          [414679.644949] CPU: 15 PID: 137570 Comm: mdtest Tainted: G          IOE  ------------   3.10.0-327.22.2.el7.x86_64 #1
          [414679.657442] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS SE5C610.86B.01.01.0008.021120151325 02/11/2015
          [414679.670035]  ffffffffa073adef 00000000a1cd2686 ffff88103ee6f6f8 ffffffff816360fc
          [414679.679336]  ffff88103ee6f778 ffffffff8162f977 ffffffff00000008 ffff88103ee6f788
          [414679.688629]  ffff88103ee6f728 00000000a1cd2686 ffffffffa0e373b0 0000000000000000
          [414679.697931] Call Trace:
          [414679.701624]  [<ffffffff816360fc>] dump_stack+0x19/0x1b
          [414679.708306]  [<ffffffff8162f977>] panic+0xd8/0x1e7
          [414679.714580]  [<ffffffffa071dddb>] lbug_with_loc+0xab/0xc0 [libcfs]
          [414679.722387]  [<ffffffffa0e31677>] osc_extent_find+0x2f6/0x1aa2 [osc]
          [414679.730387]  [<ffffffffa0e3335e>] ? osc_enter_cache+0x53b/0xe7a [osc]
          [414679.738450]  [<ffffffff81173317>] ? __alloc_pages_nodemask+0x197/0xba0
          [414679.746598]  [<ffffffffa0e1780b>] ? osc_lru_alloc+0x3b/0x390 [osc]
          [414679.754346]  [<ffffffff81285e38>] ? security_capable+0x18/0x20
          [414679.761704]  [<ffffffffa0e29705>] osc_queue_async_io+0x6a5/0x1870 [osc]
          [414679.769940]  [<ffffffffa0cf9c8c>] ? lov_merge_lvb_kms+0x12c/0x450 [lov]
          [414679.778188]  [<ffffffffa0e17b60>] ? osc_lru_alloc+0x390/0x390 [osc]
          [414679.786036]  [<ffffffffa0d9ba70>] ? mkwrite_commit_callback+0x20/0x20 [lustre]
          [414679.794960]  [<ffffffffa0e165e3>] osc_page_cache_add+0x43/0x140 [osc]
          [414679.803203]  [<ffffffffa0e1e903>] osc_io_commit_async+0x173/0x440 [osc]
          [414679.811621]  [<ffffffffa0d9ba70>] ? mkwrite_commit_callback+0x20/0x20 [lustre]
          [414679.820732]  [<ffffffffa0a65557>] cl_io_commit_async+0x77/0x140 [obdclass]
          [414679.829440]  [<ffffffffa0cf754c>] lov_io_commit_async+0x2dc/0x4a0 [lov]
          [414679.837867]  [<ffffffffa0d9ba70>] ? mkwrite_commit_callback+0x20/0x20 [lustre]
          [414679.846976]  [<ffffffffa0d9ba70>] ? mkwrite_commit_callback+0x20/0x20 [lustre]
          [414679.856076]  [<ffffffffa0a65557>] cl_io_commit_async+0x77/0x140 [obdclass]
          [414679.864791]  [<ffffffffa0d9e97a>] vvp_io_write_commit+0x17a/0x8d0 [lustre]
          [414679.873498]  [<ffffffffa0d9f39f>] vvp_io_write_start+0x2cf/0x650 [lustre]
          [414679.882106]  [<ffffffffa0a64b65>] cl_io_start+0x65/0x130 [obdclass]
          [414679.890136]  [<ffffffffa0a66f05>] cl_io_loop+0xa5/0x190 [obdclass]
          [414679.898055]  [<ffffffffa0d4cbef>] ll_file_io_generic+0x67f/0xaa0 [lustre]
          [414679.906650]  [<ffffffffa0d4d2dd>] ll_file_aio_write+0x12d/0x1f0 [lustre]
          [414679.915158]  [<ffffffffa0d4d46e>] ll_file_write+0xce/0x1e0 [lustre]
          [414679.923158]  [<ffffffff811dec1d>] vfs_write+0xbd/0x1e0
          [414679.929946]  [<ffffffff811df6bf>] SyS_write+0x7f/0xe0
          [414679.936584]  [<ffffffff81646889>] system_call_fastpath+0x16/0x1b
          
          /scratch/dumps/wolf-6.wolf.hpdd.intel.com/10.8.1.6-2016-08-08-17:12:27/
          
          jsalians_intel John Salinas (Inactive) added a comment - - edited With the patch I see: kernel:LustreError: 137570:0:(osc_cache.c:661:osc_extent_find()) ASSERTION( (max_pages & ~chunk_mask) == 0 ) failed: max_pages: 0x100 chunk_bits: 24 chunk_mask: 0xfffffffffffff000 kernel:LustreError: 137570:0:(osc_cache.c:661:osc_extent_find()) LBUG This is with default settings (no 16MB RPCs set): client [root@wolf-6 ~]# cat /proc/fs/lustre/osc/lsdraid-OST0000-osc-ffff88084eccc800/max_pages_per_rpc 256 OSS: [root@wolf-3 kernel]# lctl get_param obdfilter.lsdraid-OST0000.brw_size obdfilter.lsdraid-OST0000.brw_size=1 [root@wolf-4 ~]# lctl get_param obdfilter.lsdraid-OST0000.brw_size obdfilter.lsdraid-OST0000.brw_size=1 [413569.867436] Lustre: Mounted lsdraid-client [414679.351734] LustreError: 137570:0:(osc_cache.c:661:osc_extent_find()) ASSERTION( (max_pages & ~chunk_mask) == 0 ) failed: max_pages: 0x100 chunk_bits: 24 chunk_mask: 0xfffffffffffff000 [414679.373266] LustreError: 137570:0:(osc_cache.c:661:osc_extent_find()) LBUG [414679.382571] Pid: 137570, comm: mdtest [414679.388269] Call Trace: [414679.395587] [<ffffffffa071d7d3>] libcfs_debug_dumpstack+0x53/0x80 [libcfs] [414679.404872] [<ffffffffa071dd75>] lbug_with_loc+0x45/0xc0 [libcfs] [414679.413187] [<ffffffffa0e31677>] osc_extent_find+0x2f6/0x1aa2 [osc] [414679.421671] [<ffffffffa0e3335e>] ? osc_enter_cache+0x53b/0xe7a [osc] [414679.430159] [<ffffffff81173317>] ? __alloc_pages_nodemask+0x197/0xba0 [414679.438756] [<ffffffffa0e1780b>] ? osc_lru_alloc+0x3b/0x390 [osc] [414679.446884] [<ffffffff81285e38>] ? security_capable+0x18/0x20 [414679.454653] [<ffffffffa0e29705>] osc_queue_async_io+0x6a5/0x1870 [osc] [414679.463245] [<ffffffffa0cf9c8c>] ? lov_merge_lvb_kms+0x12c/0x450 [lov] [414679.471818] [<ffffffffa0e17b60>] ? osc_page_init+0x0/0x210 [osc] [414679.479775] [<ffffffffa0d9ba70>] ? write_commit_callback+0x0/0x50 [lustre] [414679.488686] [<ffffffffa0e165e3>] osc_page_cache_add+0x43/0x140 [osc] [414679.496969] [<ffffffffa0e1e903>] osc_io_commit_async+0x173/0x440 [osc] [414679.505470] [<ffffffffa0d9ba70>] ? write_commit_callback+0x0/0x50 [lustre] [414679.514355] [<ffffffffa0a65557>] cl_io_commit_async+0x77/0x140 [obdclass] [414679.523097] [<ffffffffa0cf754c>] lov_io_commit_async+0x2dc/0x4a0 [lov] [414679.531506] [<ffffffffa0d9ba70>] ? write_commit_callback+0x0/0x50 [lustre] [414679.540302] [<ffffffffa0d9ba70>] ? write_commit_callback+0x0/0x50 [lustre] [414679.549061] [<ffffffffa0a65557>] cl_io_commit_async+0x77/0x140 [obdclass] [414679.557714] [<ffffffffa0d9e97a>] vvp_io_write_commit+0x17a/0x8d0 [lustre] [414679.566324] [<ffffffffa0d9f39f>] vvp_io_write_start+0x2cf/0x650 [lustre] [414679.574846] [<ffffffffa0a64b65>] cl_io_start+0x65/0x130 [obdclass] [414679.582796] [<ffffffffa0a66f05>] cl_io_loop+0xa5/0x190 [obdclass] [414679.590648] [<ffffffffa0d4cbef>] ll_file_io_generic+0x67f/0xaa0 [lustre] [414679.599179] [<ffffffffa0d4d2dd>] ll_file_aio_write+0x12d/0x1f0 [lustre] [414679.607620] [<ffffffffa0d4d46e>] ll_file_write+0xce/0x1e0 [lustre] [414679.615561] [<ffffffff811dec1d>] vfs_write+0xbd/0x1e0 [414679.622248] [<ffffffff811df6bf>] SyS_write+0x7f/0xe0 [414679.628843] [<ffffffff81646889>] system_call_fastpath+0x16/0x1b [414679.636498] [414679.639249] Kernel panic - not syncing: LBUG [414679.644949] CPU: 15 PID: 137570 Comm: mdtest Tainted: G IOE ------------ 3.10.0-327.22.2.el7.x86_64 #1 [414679.657442] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS SE5C610.86B.01.01.0008.021120151325 02/11/2015 [414679.670035] ffffffffa073adef 00000000a1cd2686 ffff88103ee6f6f8 ffffffff816360fc [414679.679336] ffff88103ee6f778 ffffffff8162f977 ffffffff00000008 ffff88103ee6f788 [414679.688629] ffff88103ee6f728 00000000a1cd2686 ffffffffa0e373b0 0000000000000000 [414679.697931] Call Trace: [414679.701624] [<ffffffff816360fc>] dump_stack+0x19/0x1b [414679.708306] [<ffffffff8162f977>] panic+0xd8/0x1e7 [414679.714580] [<ffffffffa071dddb>] lbug_with_loc+0xab/0xc0 [libcfs] [414679.722387] [<ffffffffa0e31677>] osc_extent_find+0x2f6/0x1aa2 [osc] [414679.730387] [<ffffffffa0e3335e>] ? osc_enter_cache+0x53b/0xe7a [osc] [414679.738450] [<ffffffff81173317>] ? __alloc_pages_nodemask+0x197/0xba0 [414679.746598] [<ffffffffa0e1780b>] ? osc_lru_alloc+0x3b/0x390 [osc] [414679.754346] [<ffffffff81285e38>] ? security_capable+0x18/0x20 [414679.761704] [<ffffffffa0e29705>] osc_queue_async_io+0x6a5/0x1870 [osc] [414679.769940] [<ffffffffa0cf9c8c>] ? lov_merge_lvb_kms+0x12c/0x450 [lov] [414679.778188] [<ffffffffa0e17b60>] ? osc_lru_alloc+0x390/0x390 [osc] [414679.786036] [<ffffffffa0d9ba70>] ? mkwrite_commit_callback+0x20/0x20 [lustre] [414679.794960] [<ffffffffa0e165e3>] osc_page_cache_add+0x43/0x140 [osc] [414679.803203] [<ffffffffa0e1e903>] osc_io_commit_async+0x173/0x440 [osc] [414679.811621] [<ffffffffa0d9ba70>] ? mkwrite_commit_callback+0x20/0x20 [lustre] [414679.820732] [<ffffffffa0a65557>] cl_io_commit_async+0x77/0x140 [obdclass] [414679.829440] [<ffffffffa0cf754c>] lov_io_commit_async+0x2dc/0x4a0 [lov] [414679.837867] [<ffffffffa0d9ba70>] ? mkwrite_commit_callback+0x20/0x20 [lustre] [414679.846976] [<ffffffffa0d9ba70>] ? mkwrite_commit_callback+0x20/0x20 [lustre] [414679.856076] [<ffffffffa0a65557>] cl_io_commit_async+0x77/0x140 [obdclass] [414679.864791] [<ffffffffa0d9e97a>] vvp_io_write_commit+0x17a/0x8d0 [lustre] [414679.873498] [<ffffffffa0d9f39f>] vvp_io_write_start+0x2cf/0x650 [lustre] [414679.882106] [<ffffffffa0a64b65>] cl_io_start+0x65/0x130 [obdclass] [414679.890136] [<ffffffffa0a66f05>] cl_io_loop+0xa5/0x190 [obdclass] [414679.898055] [<ffffffffa0d4cbef>] ll_file_io_generic+0x67f/0xaa0 [lustre] [414679.906650] [<ffffffffa0d4d2dd>] ll_file_aio_write+0x12d/0x1f0 [lustre] [414679.915158] [<ffffffffa0d4d46e>] ll_file_write+0xce/0x1e0 [lustre] [414679.923158] [<ffffffff811dec1d>] vfs_write+0xbd/0x1e0 [414679.929946] [<ffffffff811df6bf>] SyS_write+0x7f/0xe0 [414679.936584] [<ffffffff81646889>] system_call_fastpath+0x16/0x1b /scratch/dumps/wolf-6.wolf.hpdd.intel.com/10.8.1.6-2016-08-08-17:12:27/

          the issue is that the brw_size is 16M on the server side but the max_pages_per_rpc is set to 1024, which means 4M, on the client side

          there is a defect to handle chunk size on the client side, which should be set to the maximum of “size reported from OST” and “max_pages_per_rpc”

          jay Jinshan Xiong (Inactive) added a comment - the issue is that the brw_size is 16M on the server side but the max_pages_per_rpc is set to 1024, which means 4M, on the client side there is a defect to handle chunk size on the client side, which should be set to the maximum of “size reported from OST” and “max_pages_per_rpc”
          pjones Peter Jones added a comment -

          Bobijam

          Could you please assist with this issue?

          John

          Were you using Andreas's diagnostic patch when you were running?

          Peter

          pjones Peter Jones added a comment - Bobijam Could you please assist with this issue? John Were you using Andreas's diagnostic patch when you were running? Peter

          John, could you please reproduce the problem with the above patch applied, so that we can see why the assertion is failing.

          adilger Andreas Dilger added a comment - John, could you please reproduce the problem with the above patch applied, so that we can see why the assertion is failing.

          Today I have hit this several times trying to reproduce a different hang. In these case I do NOT have 16MB blocks set. It is a completely vanilla run and as soon as mdtest starts I hit this:

          ]# mpirun -np 4 -machinefile hosts /home/johnsali/wolf-3/mdtest-1.8.4/mdtest -i 2 -I 2 -z 2 -u -w 1 -d /mnt/lustre/
          – started at 08/03/2016 21:55:07 –

          mdtest-1.8.3 was launched with 4 total task(s) on 4 nodes
          Command line used: /home/johnsali/wolf-3/mdtest-1.8.4/mdtest -i 2 -I 2 -z 2 -u -w 1 -d /mnt/lustre/
          Path: /mnt/lustre
          FS: 13.9 TiB Used FS: 0.0% Inodes: 55.8 Mi Used Inodes: 89.5%
          4 tasks, 24 files/directories

          Message from syslogd@wolf-6 at Aug 3 21:55:08 ...
          kernel:LustreError: 6182:0:(osc_cache.c:658:osc_extent_find()) ASSERTION( (max_pages & ~chunk_mask) == 0 ) failed:
          Message from syslogd@wolf-6 at Aug 3 21:55:08 ...
          kernel:LustreError: 6182:0:(osc_cache.c:658:osc_extent_find()) LBUG

          jsalians_intel John Salinas (Inactive) added a comment - Today I have hit this several times trying to reproduce a different hang. In these case I do NOT have 16MB blocks set. It is a completely vanilla run and as soon as mdtest starts I hit this: ]# mpirun -np 4 -machinefile hosts /home/johnsali/wolf-3/mdtest-1.8.4/mdtest -i 2 -I 2 -z 2 -u -w 1 -d /mnt/lustre/ – started at 08/03/2016 21:55:07 – mdtest-1.8.3 was launched with 4 total task(s) on 4 nodes Command line used: /home/johnsali/wolf-3/mdtest-1.8.4/mdtest -i 2 -I 2 -z 2 -u -w 1 -d /mnt/lustre/ Path: /mnt/lustre FS: 13.9 TiB Used FS: 0.0% Inodes: 55.8 Mi Used Inodes: 89.5% 4 tasks, 24 files/directories Message from syslogd@wolf-6 at Aug 3 21:55:08 ... kernel:LustreError: 6182:0:(osc_cache.c:658:osc_extent_find()) ASSERTION( (max_pages & ~chunk_mask) == 0 ) failed: Message from syslogd@wolf-6 at Aug 3 21:55:08 ... kernel:LustreError: 6182:0:(osc_cache.c:658:osc_extent_find()) LBUG

          This specific hit was on master, it appears we have the same issue for 2.9 or master though. We are not using 2.8 because it does not have the large 16MB RPC patches we are looking to use for Lustre streaming.

          It appears that setting this on the clients failed: lctl set_param osc.*.max_pages_per_rpc=1024 because we had not set this on the OSS nodes: lctl set_param obdfilter.lsdraid-OST0000.brw_size=16. The default was "1" which is what it was when we hit this issue. Last night we set both of these and were able to do a couple of IOR runs and we saw 16MB pages being used:

          OSS0
          [root@wolf-3 ~]# cat /proc/fs/lustre/obdfilter/lsdraid-OST0000/brw_stats
          snapshot_time: 1470125909.249936 (secs.usecs)

          read | write
          pages per bulk r/w rpcs % cum % | rpcs % cum %
          1: 5 0 0 | 597115 50 50
          2: 2 0 0 | 2 0 50
          4: 0 0 0 | 0 0 50
          8: 0 0 0 | 0 0 50
          16: 0 0 0 | 0 0 50
          32: 0 0 0 | 0 0 50
          64: 0 0 0 | 0 0 50
          128: 0 0 0 | 0 0 50
          256: 43 0 0 | 82 0 50
          512: 2 0 0 | 153548 12 62
          1K: 2 0 0 | 49 0 62
          2K: 39 0 0 | 141 0 62
          4K: 369260 99 100 | 443071 37 100

          read | write
          discontiguous pages rpcs % cum % | rpcs % cum %
          0: 369353 100 100 | 597117 50 50
          1: 0 0 100 | 0 0 50
          2: 0 0 100 | 0 0 50
          3: 0 0 100 | 0 0 50
          4: 0 0 100 | 0 0 50
          5: 0 0 100 | 0 0 50
          6: 0 0 100 | 0 0 50
          7: 0 0 100 | 0 0 50
          8: 0 0 100 | 0 0 50
          9: 0 0 100 | 0 0 50
          10: 0 0 100 | 0 0 50
          11: 0 0 100 | 0 0 50
          12: 0 0 100 | 0 0 50
          13: 0 0 100 | 0 0 50
          14: 0 0 100 | 0 0 50
          15: 0 0 100 | 0 0 50
          16: 0 0 100 | 0 0 50
          17: 0 0 100 | 0 0 50
          18: 0 0 100 | 0 0 50
          19: 0 0 100 | 0 0 50
          20: 0 0 100 | 0 0 50
          21: 0 0 100 | 0 0 50
          22: 0 0 100 | 0 0 50
          23: 0 0 100 | 0 0 50
          24: 0 0 100 | 0 0 50
          25: 0 0 100 | 0 0 50
          26: 0 0 100 | 0 0 50
          27: 0 0 100 | 0 0 50
          28: 0 0 100 | 0 0 50
          29: 0 0 100 | 0 0 50
          30: 0 0 100 | 0 0 50
          31: 0 0 100 | 596891 49 100

          read | write
          disk I/Os in flight ios % cum % | ios % cum %
          1: 13706 3 3 | 644001 53 53
          2: 32744 8 12 | 204951 17 71
          3: 26076 7 19 | 65415 5 76
          4: 41077 11 30 | 9928 0 77
          5: 62304 16 47 | 6889 0 77
          6: 71872 19 67 | 5011 0 78
          7: 63965 17 84 | 3906 0 78
          8: 41954 11 95 | 3087 0 78
          9: 15655 4 100 | 2566 0 79
          10: 0 0 100 | 2222 0 79
          11: 0 0 100 | 2485 0 79
          12: 0 0 100 | 3936 0 79
          13: 0 0 100 | 4464 0 80
          14: 0 0 100 | 4682 0 80
          15: 0 0 100 | 3413 0 80
          16: 0 0 100 | 2469 0 81
          17: 0 0 100 | 2331 0 81
          18: 0 0 100 | 2830 0 81
          19: 0 0 100 | 7537 0 82
          20: 0 0 100 | 57506 4 87
          21: 0 0 100 | 123813 10 97
          22: 0 0 100 | 21147 1 99
          23: 0 0 100 | 4920 0 99
          24: 0 0 100 | 4496 0 99
          25: 0 0 100 | 3 0 100

          read | write
          I/O time (1/1000s) ios % cum % | ios % cum %
          1: 4 0 0 | 0 0 0
          2: 0 0 0 | 0 0 0
          4: 0 0 0 | 0 0 0
          8: 0 0 0 | 0 0 0
          16: 0 0 0 | 0 0 0
          32: 9997 2 2 | 0 0 0
          64: 136162 36 39 | 0 0 0
          128: 172673 46 86 | 0 0 0
          256: 49096 13 99 | 0 0 0
          512: 1317 0 99 | 0 0 0
          1K: 22 0 99 | 0 0 0
          2K: 21 0 100 | 0 0 0

          read | write
          disk I/O size ios % cum % | ios % cum %
          4: 0 0 0 | 25 0 0
          8: 0 0 0 | 45 0 0
          16: 0 0 0 | 0 0 0
          32: 0 0 0 | 1 0 0
          64: 0 0 0 | 1 0 0
          128: 0 0 0 | 597016 50 50
          256: 0 0 0 | 18 0 50
          512: 0 0 0 | 9 0 50
          1K: 1 0 0 | 0 0 50
          2K: 0 0 0 | 0 0 50
          4K: 1 0 0 | 0 0 50
          8K: 2 0 0 | 2 0 50
          16K: 0 0 0 | 0 0 50
          32K: 0 0 0 | 0 0 50
          64K: 0 0 0 | 0 0 50
          128K: 0 0 0 | 0 0 50
          256K: 0 0 0 | 0 0 50
          512K: 0 0 0 | 0 0 50
          1M: 43 0 0 | 82 0 50
          2M: 2 0 0 | 153548 12 62
          4M: 2 0 0 | 49 0 62
          8M: 39 0 0 | 141 0 62
          16M: 369260 99 100 | 443071 37 100

          From one client:
          [root@wolf-6 mdisolation_step2_draid_testing]# lctl get_param osc.*.rpc_stats
          osc.lsdraid-OST0000-osc-ffff88104e13d800.rpc_stats=
          snapshot_time: 1470098591.939087 (secs.usecs)
          read RPCs in flight: 0
          write RPCs in flight: 1
          pending write pages: 0
          pending read pages: 0

          read write
          pages per rpc rpcs % cum % | rpcs % cum %
          1: 0 0 0 | 0 0 0
          2: 0 0 0 | 0 0 0
          4: 0 0 0 | 0 0 0
          8: 0 0 0 | 0 0 0
          16: 0 0 0 | 0 0 0
          32: 0 0 0 | 0 0 0
          64: 0 0 0 | 0 0 0
          128: 0 0 0 | 0 0 0
          256: 17 0 0 | 10 0 0
          512: 2 0 0 | 2 0 0
          1024: 2 0 0 | 5 0 0
          2048: 13 0 0 | 53 0 0
          4096: 116941 99 100 | 146350 99 100

          read write
          rpcs in flight rpcs % cum % | rpcs % cum %
          0: 0 0 0 | 0 0 0
          1: 14281 12 12 | 66644 45 45
          2: 31530 26 39 | 25379 17 62
          3: 71163 60 99 | 3173 2 65
          4: 1 0 100 | 4759 3 68
          5: 0 0 100 | 2206 1 69
          6: 0 0 100 | 1957 1 71
          7: 0 0 100 | 3053 2 73
          8: 0 0 100 | 39249 26 100

          read write
          offset rpcs % cum % | rpcs % cum %
          0: 20 0 0 | 40 0 0
          1: 0 0 0 | 0 0 0
          2: 0 0 0 | 0 0 0
          4: 0 0 0 | 0 0 0
          8: 0 0 0 | 0 0 0
          16: 0 0 0 | 0 0 0
          32: 0 0 0 | 0 0 0
          64: 0 0 0 | 0 0 0
          128: 0 0 0 | 0 0 0
          256: 5 0 0 | 4 0 0
          512: 8 0 0 | 0 0 0
          1024: 4 0 0 | 0 0 0
          2048: 6 0 0 | 3 0 0
          4096: 19 0 0 | 41 0 0
          8192: 40 0 0 | 82 0 0
          16384: 80 0 0 | 164 0 0
          32768: 159 0 0 | 328 0 0
          65536: 320 0 0 | 656 0 0
          131072: 637 0 1 | 1311 0 1
          262144: 1274 1 2 | 2621 1 3
          524288: 2553 2 4 | 5241 3 7
          1048576: 5107 4 8 | 8413 5 12
          2097152: 10211 8 17 | 16784 11 24
          4194304: 18374 15 33 | 22058 15 39
          8388608: 36778 31 64 | 43061 29 68
          16777216: 41380 35 100 | 45613 31 100

          However, now it appears mdtest is hanging – we will have to find some time to collect data on that.

          jsalians_intel John Salinas (Inactive) added a comment - This specific hit was on master, it appears we have the same issue for 2.9 or master though. We are not using 2.8 because it does not have the large 16MB RPC patches we are looking to use for Lustre streaming. It appears that setting this on the clients failed: lctl set_param osc.*.max_pages_per_rpc=1024 because we had not set this on the OSS nodes: lctl set_param obdfilter.lsdraid-OST0000.brw_size=16. The default was "1" which is what it was when we hit this issue. Last night we set both of these and were able to do a couple of IOR runs and we saw 16MB pages being used: OSS0 [root@wolf-3 ~] # cat /proc/fs/lustre/obdfilter/lsdraid-OST0000/brw_stats snapshot_time: 1470125909.249936 (secs.usecs) read | write pages per bulk r/w rpcs % cum % | rpcs % cum % 1: 5 0 0 | 597115 50 50 2: 2 0 0 | 2 0 50 4: 0 0 0 | 0 0 50 8: 0 0 0 | 0 0 50 16: 0 0 0 | 0 0 50 32: 0 0 0 | 0 0 50 64: 0 0 0 | 0 0 50 128: 0 0 0 | 0 0 50 256: 43 0 0 | 82 0 50 512: 2 0 0 | 153548 12 62 1K: 2 0 0 | 49 0 62 2K: 39 0 0 | 141 0 62 4K: 369260 99 100 | 443071 37 100 read | write discontiguous pages rpcs % cum % | rpcs % cum % 0: 369353 100 100 | 597117 50 50 1: 0 0 100 | 0 0 50 2: 0 0 100 | 0 0 50 3: 0 0 100 | 0 0 50 4: 0 0 100 | 0 0 50 5: 0 0 100 | 0 0 50 6: 0 0 100 | 0 0 50 7: 0 0 100 | 0 0 50 8: 0 0 100 | 0 0 50 9: 0 0 100 | 0 0 50 10: 0 0 100 | 0 0 50 11: 0 0 100 | 0 0 50 12: 0 0 100 | 0 0 50 13: 0 0 100 | 0 0 50 14: 0 0 100 | 0 0 50 15: 0 0 100 | 0 0 50 16: 0 0 100 | 0 0 50 17: 0 0 100 | 0 0 50 18: 0 0 100 | 0 0 50 19: 0 0 100 | 0 0 50 20: 0 0 100 | 0 0 50 21: 0 0 100 | 0 0 50 22: 0 0 100 | 0 0 50 23: 0 0 100 | 0 0 50 24: 0 0 100 | 0 0 50 25: 0 0 100 | 0 0 50 26: 0 0 100 | 0 0 50 27: 0 0 100 | 0 0 50 28: 0 0 100 | 0 0 50 29: 0 0 100 | 0 0 50 30: 0 0 100 | 0 0 50 31: 0 0 100 | 596891 49 100 read | write disk I/Os in flight ios % cum % | ios % cum % 1: 13706 3 3 | 644001 53 53 2: 32744 8 12 | 204951 17 71 3: 26076 7 19 | 65415 5 76 4: 41077 11 30 | 9928 0 77 5: 62304 16 47 | 6889 0 77 6: 71872 19 67 | 5011 0 78 7: 63965 17 84 | 3906 0 78 8: 41954 11 95 | 3087 0 78 9: 15655 4 100 | 2566 0 79 10: 0 0 100 | 2222 0 79 11: 0 0 100 | 2485 0 79 12: 0 0 100 | 3936 0 79 13: 0 0 100 | 4464 0 80 14: 0 0 100 | 4682 0 80 15: 0 0 100 | 3413 0 80 16: 0 0 100 | 2469 0 81 17: 0 0 100 | 2331 0 81 18: 0 0 100 | 2830 0 81 19: 0 0 100 | 7537 0 82 20: 0 0 100 | 57506 4 87 21: 0 0 100 | 123813 10 97 22: 0 0 100 | 21147 1 99 23: 0 0 100 | 4920 0 99 24: 0 0 100 | 4496 0 99 25: 0 0 100 | 3 0 100 read | write I/O time (1/1000s) ios % cum % | ios % cum % 1: 4 0 0 | 0 0 0 2: 0 0 0 | 0 0 0 4: 0 0 0 | 0 0 0 8: 0 0 0 | 0 0 0 16: 0 0 0 | 0 0 0 32: 9997 2 2 | 0 0 0 64: 136162 36 39 | 0 0 0 128: 172673 46 86 | 0 0 0 256: 49096 13 99 | 0 0 0 512: 1317 0 99 | 0 0 0 1K: 22 0 99 | 0 0 0 2K: 21 0 100 | 0 0 0 read | write disk I/O size ios % cum % | ios % cum % 4: 0 0 0 | 25 0 0 8: 0 0 0 | 45 0 0 16: 0 0 0 | 0 0 0 32: 0 0 0 | 1 0 0 64: 0 0 0 | 1 0 0 128: 0 0 0 | 597016 50 50 256: 0 0 0 | 18 0 50 512: 0 0 0 | 9 0 50 1K: 1 0 0 | 0 0 50 2K: 0 0 0 | 0 0 50 4K: 1 0 0 | 0 0 50 8K: 2 0 0 | 2 0 50 16K: 0 0 0 | 0 0 50 32K: 0 0 0 | 0 0 50 64K: 0 0 0 | 0 0 50 128K: 0 0 0 | 0 0 50 256K: 0 0 0 | 0 0 50 512K: 0 0 0 | 0 0 50 1M: 43 0 0 | 82 0 50 2M: 2 0 0 | 153548 12 62 4M: 2 0 0 | 49 0 62 8M: 39 0 0 | 141 0 62 16M: 369260 99 100 | 443071 37 100 From one client: [root@wolf-6 mdisolation_step2_draid_testing] # lctl get_param osc.*.rpc_stats osc.lsdraid-OST0000-osc-ffff88104e13d800.rpc_stats= snapshot_time: 1470098591.939087 (secs.usecs) read RPCs in flight: 0 write RPCs in flight: 1 pending write pages: 0 pending read pages: 0 read write pages per rpc rpcs % cum % | rpcs % cum % 1: 0 0 0 | 0 0 0 2: 0 0 0 | 0 0 0 4: 0 0 0 | 0 0 0 8: 0 0 0 | 0 0 0 16: 0 0 0 | 0 0 0 32: 0 0 0 | 0 0 0 64: 0 0 0 | 0 0 0 128: 0 0 0 | 0 0 0 256: 17 0 0 | 10 0 0 512: 2 0 0 | 2 0 0 1024: 2 0 0 | 5 0 0 2048: 13 0 0 | 53 0 0 4096: 116941 99 100 | 146350 99 100 read write rpcs in flight rpcs % cum % | rpcs % cum % 0: 0 0 0 | 0 0 0 1: 14281 12 12 | 66644 45 45 2: 31530 26 39 | 25379 17 62 3: 71163 60 99 | 3173 2 65 4: 1 0 100 | 4759 3 68 5: 0 0 100 | 2206 1 69 6: 0 0 100 | 1957 1 71 7: 0 0 100 | 3053 2 73 8: 0 0 100 | 39249 26 100 read write offset rpcs % cum % | rpcs % cum % 0: 20 0 0 | 40 0 0 1: 0 0 0 | 0 0 0 2: 0 0 0 | 0 0 0 4: 0 0 0 | 0 0 0 8: 0 0 0 | 0 0 0 16: 0 0 0 | 0 0 0 32: 0 0 0 | 0 0 0 64: 0 0 0 | 0 0 0 128: 0 0 0 | 0 0 0 256: 5 0 0 | 4 0 0 512: 8 0 0 | 0 0 0 1024: 4 0 0 | 0 0 0 2048: 6 0 0 | 3 0 0 4096: 19 0 0 | 41 0 0 8192: 40 0 0 | 82 0 0 16384: 80 0 0 | 164 0 0 32768: 159 0 0 | 328 0 0 65536: 320 0 0 | 656 0 0 131072: 637 0 1 | 1311 0 1 262144: 1274 1 2 | 2621 1 3 524288: 2553 2 4 | 5241 3 7 1048576: 5107 4 8 | 8413 5 12 2097152: 10211 8 17 | 16784 11 24 4194304: 18374 15 33 | 22058 15 39 8388608: 36778 31 64 | 43061 29 68 16777216: 41380 35 100 | 45613 31 100 However, now it appears mdtest is hanging – we will have to find some time to collect data on that.
          adilger Andreas Dilger added a comment - - edited

          What version of Lustre are you running?

          Also, based on your last comments do you have 16MB RPCs enabled at all? The stock Lustre code should handle up to 4MB RPC size out of the box, so it is strange that this is failing when essentially just doing 16MB sized writes?

          If you are able to repeat this, could you please provide the steps used for setup (e.g. the current value of obdfilter.*.brw_size on the OSS, osc.*.max_pages_per_rpc and other relevant settings on the client) and improve the LASSERT() in osc_extent_find() with the patch http://review.whamcloud.com/21637 to help debug this problem.

          adilger Andreas Dilger added a comment - - edited What version of Lustre are you running? Also, based on your last comments do you have 16MB RPCs enabled at all? The stock Lustre code should handle up to 4MB RPC size out of the box, so it is strange that this is failing when essentially just doing 16MB sized writes? If you are able to repeat this, could you please provide the steps used for setup (e.g. the current value of obdfilter.*.brw_size on the OSS, osc.*.max_pages_per_rpc and other relevant settings on the client) and improve the LASSERT() in osc_extent_find() with the patch http://review.whamcloud.com/21637 to help debug this problem.

          People

            bobijam Zhenyu Xu
            jsalians_intel John Salinas (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: