Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8460

(osc_cache.c:658:osc_extent_find()) ASSERTION( (max_pages & ~chunk_mask) == 0 ) failed: LBUG

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.9.0
    • None
    • 3
    • 9223372036854775807

    Description

      As part of CORAL testing I was trying to use ior to to write/read some 16MB RPCs. As soon as the test started it appeared all the Lustre clients took an LBUG, crashed, and rebooted.

      [263032.826900] LustreError: 54616:0:(mgc_request.c:1536:mgc_apply_recover_logs()) mgc: cannot find uuid by nid 192.168.1.4@o2ib
      [263032.839625] Lustre: 54616:0:(mgc_request.c:1756:mgc_process_recover_nodemap_log()) MGC192.168.1.5@o2ib: error processing recovery log lsdraid-cliir: rc = -2
      [263032.940768] Lustre: Mounted lsdraid-client
      [270650.452658] LustreError: 64198:0:(osc_cache.c:658:osc_extent_find()) ASSERTION( (max_pages & ~chunk_mask) == 0 ) failed: 
      [270650.465098] LustreError: 64198:0:(osc_cache.c:658:osc_extent_find()) LBUG
      [270650.472830] Pid: 64198, comm: ior
      [270650.476666] 
      Call Trace:
      [270650.482816]  [<ffffffffa08267d3>] libcfs_debug_dumpstack+0x53/0x80 [libcfs]
      [270650.492278]  [<ffffffffa0826d75>] lbug_with_loc+0x45/0xc0 [libcfs]
      [270650.502128]  [<ffffffffa0dc066d>] osc_extent_find+0x2ec/0x1a98 [osc]
      [270650.510720]  [<ffffffff81189b79>] ? zone_statistics+0x89/0xa0
      [270650.518827]  [<ffffffffa0dc2354>] ? osc_enter_cache+0x53b/0xe7a [osc]
      [270650.527638]  [<ffffffff810b5f8d>] ? ttwu_do_activate.constprop.84+0x5d/0x70
      [270650.537084]  [<ffffffff81173317>] ? __alloc_pages_nodemask+0x197/0xba0
      [270650.545977]  [<ffffffff81285e38>] ? security_capable+0x18/0x20
      [270650.554180]  [<ffffffffa0db8705>] osc_queue_async_io+0x6a5/0x1870 [osc]
      [270650.563226]  [<ffffffffa0d2aa70>] ? write_commit_callback+0x0/0x50 [lustre]
      [270650.572635]  [<ffffffffa0da55e3>] osc_page_cache_add+0x43/0x140 [osc]
      [270650.581481]  [<ffffffffa0dad903>] osc_io_commit_async+0x173/0x440 [osc]
      [270650.590561]  [<ffffffffa0d2aa70>] ? write_commit_callback+0x0/0x50 [lustre]
      [270650.600035]  [<ffffffffa0959317>] cl_io_commit_async+0x77/0x140 [obdclass]
      [270650.609317]  [<ffffffffa0c8654c>] lov_io_commit_async+0x2dc/0x4a0 [lov]
      [270650.618397]  [<ffffffffa0d2aa70>] ? write_commit_callback+0x0/0x50 [lustre]
      [270650.627864]  [<ffffffffa0d2aa70>] ? write_commit_callback+0x0/0x50 [lustre]
      [270650.637324]  [<ffffffffa0959317>] cl_io_commit_async+0x77/0x140 [obdclass]
      [270650.646690]  [<ffffffffa0d2d97a>] vvp_io_write_commit+0x17a/0x8d0 [lustre]
      [270650.655374]  [<ffffffffa0d1b807>] ll_write_end+0xc7/0x400 [lustre]
      [270650.663281]  [<ffffffff8116a214>] generic_file_buffered_write+0x184/0x290
      [270650.671879]  [<ffffffff8116b6d5>] __generic_file_aio_write+0x1d5/0x3e0
      [270650.680178]  [<ffffffffa0d2e355>] vvp_io_write_start+0x285/0x650 [lustre]
      [270650.688744]  [<ffffffffa0958925>] cl_io_start+0x65/0x130 [obdclass]
      [270650.696904]  [<ffffffffa095acc5>] cl_io_loop+0xa5/0x190 [obdclass]
      [270650.704945]  [<ffffffffa0cdbbef>] ll_file_io_generic+0x67f/0xaa0 [lustre]
      [270650.713656]  [<ffffffff811d4256>] ? mem_cgroup_update_page_stat+0x16/0x50
      [270650.722373]  [<ffffffffa0cdc2dd>] ll_file_aio_write+0x12d/0x1f0 [lustre]
      [270650.731000]  [<ffffffffa0cdc46e>] ll_file_write+0xce/0x1e0 [lustre]
      [270650.739126]  [<ffffffff811dec1d>] vfs_write+0xbd/0x1e0
      [270650.745994]  [<ffffffff811df6bf>] SyS_write+0x7f/0xe0
      [270650.752766]  [<ffffffff81646889>] system_call_fastpath+0x16/0x1b
      [270650.760571] 
      [270650.763508] Kernel panic - not syncing: LBUG
      [270650.769875] CPU: 17 PID: 64198 Comm: ior Tainted: G          IOE  ------------   3.10.0-327.22.2.el7.x86_64 #1
      [270650.782689] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS SE5C610.86B.01.01.0008.021120151325 02/11/2015
      [270650.796038]  ffffffffa0843def 0000000067048a74 ffff88103fd9f560 ffffffff816360fc
      [270650.806102]  ffff88103fd9f5e0 ffffffff8162f977 ffffffff00000008 ffff88103fd9f5f0
      [270650.816170]  ffff88103fd9f590 0000000067048a74 ffffffffa0dc63b0 0000000000000000
      [270650.826189] Call Trace:
      [270650.830566]  [<ffffffff816360fc>] dump_stack+0x19/0x1b
      [270650.837897]  [<ffffffff8162f977>] panic+0xd8/0x1e7
      [270650.844870]  [<ffffffffa0826ddb>] lbug_with_loc+0xab/0xc0 [libcfs]
      [270650.853345]  [<ffffffffa0dc066d>] osc_extent_find+0x2ec/0x1a98 [osc]
      [270650.862000]  [<ffffffff81189b79>] ? zone_statistics+0x89/0xa0
      [270650.869976]  [<ffffffffa0dc2354>] ? osc_enter_cache+0x53b/0xe7a [osc]
      [270650.878736]  [<ffffffff810b5f8d>] ? ttwu_do_activate.constprop.84+0x5d/0x70
      [270650.888101]  [<ffffffff81173317>] ? __alloc_pages_nodemask+0x197/0xba0
      [270650.896965]  [<ffffffff81285e38>] ? security_capable+0x18/0x20
      [270650.905069]  [<ffffffffa0db8705>] osc_queue_async_io+0x6a5/0x1870 [osc]
      [270650.914081]  [<ffffffffa0d2aa70>] ? mkwrite_commit_callback+0x20/0x20 [lustre]
      [270650.923740]  [<ffffffffa0da55e3>] osc_page_cache_add+0x43/0x140 [osc]
      [270650.932528]  [<ffffffffa0dad903>] osc_io_commit_async+0x173/0x440 [osc]
      [270650.941515]  [<ffffffffa0d2aa70>] ? mkwrite_commit_callback+0x20/0x20 [lustre]
      [270650.951198]  [<ffffffffa0959317>] cl_io_commit_async+0x77/0x140 [obdclass]
      [270650.960478]  [<ffffffffa0c8654c>] lov_io_commit_async+0x2dc/0x4a0 [lov]
      [270650.969481]  [<ffffffffa0d2aa70>] ? mkwrite_commit_callback+0x20/0x20 [lustre]
      [270650.979160]  [<ffffffffa0d2aa70>] ? mkwrite_commit_callback+0x20/0x20 [lustre]
      [270650.989192]  [<ffffffffa0959317>] cl_io_commit_async+0x77/0x140 [obdclass]
      [270650.998797]  [<ffffffffa0d2d97a>] vvp_io_write_commit+0x17a/0x8d0 [lustre]
      [270651.008434]  [<ffffffffa0d1b807>] ll_write_end+0xc7/0x400 [lustre]
      [270651.017239]  [<ffffffff8116a214>] generic_file_buffered_write+0x184/0x290
      [270651.026725]  [<ffffffff8116b6d5>] __generic_file_aio_write+0x1d5/0x3e0
      [270651.035938]  [<ffffffffa0d2e355>] vvp_io_write_start+0x285/0x650 [lustre]
      [270651.045484]  [<ffffffffa0958925>] cl_io_start+0x65/0x130 [obdclass]
      [270651.054443]  [<ffffffffa095acc5>] cl_io_loop+0xa5/0x190 [obdclass]
      [270651.063300]  [<ffffffffa0cdbbef>] ll_file_io_generic+0x67f/0xaa0 [lustre]
      [270651.072803]  [<ffffffff811d4256>] ? mem_cgroup_update_page_stat+0x16/0x50
      [270651.082339]  [<ffffffffa0cdc2dd>] ll_file_aio_write+0x12d/0x1f0 [lustre]
      [270651.091766]  [<ffffffffa0cdc46e>] ll_file_write+0xce/0x1e0 [lustre]
      [270651.100727]  [<ffffffff811dec1d>] vfs_write+0xbd/0x1e0
      [270651.108378]  [<ffffffff811df6bf>] SyS_write+0x7f/0xe0
      [270651.115918]  [<ffffffff81646889>] system_call_fastpath+0x16/0x1b 
      

      Here is what IOR saw:

      IOR-3.0.1: MPI Coordinated Test of Parallel I/O
      
      Began: Mon Aug  1 18:28:46 2016
      Command line used: /home/johnsali/wolf-3/ior/src/ior -i 5 -v -b 4096M -t 16m
      Machine: Linux wolf-6.wolf.hpdd.intel.com
      Start time skew across all tasks: 0.00 sec
      
      Test 0 started: Mon Aug  1 18:28:46 2016
      Path: /mnt/lustre
      FS: 13.9 TiB   Used FS: 0.0%   Inodes: 4.2 Mi   Used Inodes: 0.0%
      Participating tasks: 3
      Summary:
      	api                = POSIX
      	test filename      = testFile
      	access             = single-shared-file
      	pattern            = segmented (1 segment)
      	ordering in a file = sequential offsets
      	ordering inter file= no tasks offsets
      	clients            = 3 (1 per node)
      	repetitions        = 5
      	xfersize           = 16 MiB
      	blocksize          = 4 GiB
      	aggregate filesize = 12 GiB
      
      access    bw(MiB/s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   total(s)   iter
      ------    ---------  ---------- ---------  --------   --------   --------   --------   ----
      Commencing write performance test: Mon Aug  1 18:28:46 2016
      
      Message from syslogd@wolf-6 at Aug  1 18:28:48 ...
       kernel:LustreError: 64198:0:(osc_cache.c:658:osc_extent_find()) ASSERTION( (max_pages & ~chunk_mask) == 0 ) failed: 
      
      Message from syslogd@wolf-6 at Aug  1 18:28:48 ...
       kernel:LustreError: 64198:0:(osc_cache.c:658:osc_extent_find()) LBUG
      

      Attempts to change the max_pages_per_rpc have not been successful:

      [root@wolf-8 ~]#  lctl set_param osc.*.max_pages_per_rpc=1024
      error: set_param: setting /proc/fs/lustre/osc/lsdraid-OST0000-osc-ffff88104f0c1000/max_pages_per_rpc=1024: Numerical result out of range 
      

      System crash files can be found at (all clients hit the LBUG/crash/reboot at the same time):
      /scratch/dumps/wolf-6.wolf.hpdd.intel.com/10.8.1.6-2016-08-01-18:29:12
      /scratch/dumps/wolf-7.wolf.hpdd.intel.com/10.8.1.7-2016-08-01-18:29:10
      /scratch/dumps/wolf-8.wolf.hpdd.intel.com/10.8.1.8-2016-08-01-18:29:11

      Attachments

        Activity

          [LU-8460] (osc_cache.c:658:osc_extent_find()) ASSERTION( (max_pages & ~chunk_mask) == 0 ) failed: LBUG

          Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/21825/
          Subject: LU-8460 osc: max_pages_per_rpc should be chunk size aligned
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: 13834f5aeef42d3c358574ac59475c0758dce300

          gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/21825/ Subject: LU-8460 osc: max_pages_per_rpc should be chunk size aligned Project: fs/lustre-release Branch: master Current Patch Set: Commit: 13834f5aeef42d3c358574ac59475c0758dce300

          It seems that it would be possible to test this with a ZFS OST with recordsize=1024k and then manually setting brw_size=512K on an unpatched OST, and verifying that the client no longer crashes. Similarly, testing that it isn't possible to set brw_size smaller than recordsize on a patched OST would be useful, including changing recordsize after mount.

          adilger Andreas Dilger added a comment - It seems that it would be possible to test this with a ZFS OST with recordsize=1024k and then manually setting brw_size=512K on an unpatched OST, and verifying that the client no longer crashes. Similarly, testing that it isn't possible to set brw_size smaller than recordsize on a patched OST would be useful, including changing recordsize after mount.

          Currently this bug is blocking our progress: https://jira.hpdd.intel.com/browse/LU-8498

          jsalians_intel John Salinas (Inactive) added a comment - Currently this bug is blocking our progress: https://jira.hpdd.intel.com/browse/LU-8498

          jsalians_intel can you please verify that patch 21825 actually fixes the problem?

          jay Jinshan Xiong (Inactive) added a comment - jsalians_intel can you please verify that patch 21825 actually fixes the problem?

          Bobi Jam (bobijam@hotmail.com) uploaded a new patch: http://review.whamcloud.com/21825
          Subject: LU-8460 osc: chunk size should not bigger than RPC size
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: bde1f813f387f8e15492a864e8c3a78eb0eb3ff4

          gerrit Gerrit Updater added a comment - Bobi Jam (bobijam@hotmail.com) uploaded a new patch: http://review.whamcloud.com/21825 Subject: LU-8460 osc: chunk size should not bigger than RPC size Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: bde1f813f387f8e15492a864e8c3a78eb0eb3ff4

          chunk_bits is 24 means 16MB block size on the OST side and max_pages_per_rpc was set to 256 unfortunately.

          jay Jinshan Xiong (Inactive) added a comment - chunk_bits is 24 means 16MB block size on the OST side and max_pages_per_rpc was set to 256 unfortunately.
          jsalians_intel John Salinas (Inactive) added a comment - - edited

          With the patch I see:

          kernel:LustreError: 137570:0:(osc_cache.c:661:osc_extent_find()) ASSERTION( (max_pages & ~chunk_mask) == 0 ) failed: max_pages: 0x100 chunk_bits: 24 chunk_mask: 0xfffffffffffff000
          kernel:LustreError: 137570:0:(osc_cache.c:661:osc_extent_find()) LBUG
          

          This is with default settings (no 16MB RPCs set):

          client [root@wolf-6 ~]# cat /proc/fs/lustre/osc/lsdraid-OST0000-osc-ffff88084eccc800/max_pages_per_rpc
          256 
          

          OSS:

          [root@wolf-3 kernel]# lctl get_param obdfilter.lsdraid-OST0000.brw_size
          obdfilter.lsdraid-OST0000.brw_size=1
           [root@wolf-4 ~]# lctl get_param obdfilter.lsdraid-OST0000.brw_size
          obdfilter.lsdraid-OST0000.brw_size=1
          
          [413569.867436] Lustre: Mounted lsdraid-client
          [414679.351734] LustreError: 137570:0:(osc_cache.c:661:osc_extent_find()) ASSERTION( (max_pages & ~chunk_mask) == 0 ) failed: max_pages: 0x100 chunk_bits: 24 chunk_mask: 0xfffffffffffff000
          [414679.373266] LustreError: 137570:0:(osc_cache.c:661:osc_extent_find()) LBUG
          [414679.382571] Pid: 137570, comm: mdtest
          [414679.388269] 
          Call Trace:
          [414679.395587]  [<ffffffffa071d7d3>] libcfs_debug_dumpstack+0x53/0x80 [libcfs]
          [414679.404872]  [<ffffffffa071dd75>] lbug_with_loc+0x45/0xc0 [libcfs]
          [414679.413187]  [<ffffffffa0e31677>] osc_extent_find+0x2f6/0x1aa2 [osc]
          [414679.421671]  [<ffffffffa0e3335e>] ? osc_enter_cache+0x53b/0xe7a [osc]
          [414679.430159]  [<ffffffff81173317>] ? __alloc_pages_nodemask+0x197/0xba0
          [414679.438756]  [<ffffffffa0e1780b>] ? osc_lru_alloc+0x3b/0x390 [osc]
          [414679.446884]  [<ffffffff81285e38>] ? security_capable+0x18/0x20
          [414679.454653]  [<ffffffffa0e29705>] osc_queue_async_io+0x6a5/0x1870 [osc]
          [414679.463245]  [<ffffffffa0cf9c8c>] ? lov_merge_lvb_kms+0x12c/0x450 [lov]
          [414679.471818]  [<ffffffffa0e17b60>] ? osc_page_init+0x0/0x210 [osc]
          [414679.479775]  [<ffffffffa0d9ba70>] ? write_commit_callback+0x0/0x50 [lustre]
          [414679.488686]  [<ffffffffa0e165e3>] osc_page_cache_add+0x43/0x140 [osc]
          [414679.496969]  [<ffffffffa0e1e903>] osc_io_commit_async+0x173/0x440 [osc]
          [414679.505470]  [<ffffffffa0d9ba70>] ? write_commit_callback+0x0/0x50 [lustre]
          [414679.514355]  [<ffffffffa0a65557>] cl_io_commit_async+0x77/0x140 [obdclass]
          [414679.523097]  [<ffffffffa0cf754c>] lov_io_commit_async+0x2dc/0x4a0 [lov]
          [414679.531506]  [<ffffffffa0d9ba70>] ? write_commit_callback+0x0/0x50 [lustre]
          [414679.540302]  [<ffffffffa0d9ba70>] ? write_commit_callback+0x0/0x50 [lustre]
          [414679.549061]  [<ffffffffa0a65557>] cl_io_commit_async+0x77/0x140 [obdclass]
          [414679.557714]  [<ffffffffa0d9e97a>] vvp_io_write_commit+0x17a/0x8d0 [lustre]
          [414679.566324]  [<ffffffffa0d9f39f>] vvp_io_write_start+0x2cf/0x650 [lustre]
          [414679.574846]  [<ffffffffa0a64b65>] cl_io_start+0x65/0x130 [obdclass]
          [414679.582796]  [<ffffffffa0a66f05>] cl_io_loop+0xa5/0x190 [obdclass]
          [414679.590648]  [<ffffffffa0d4cbef>] ll_file_io_generic+0x67f/0xaa0 [lustre]
          [414679.599179]  [<ffffffffa0d4d2dd>] ll_file_aio_write+0x12d/0x1f0 [lustre]
          [414679.607620]  [<ffffffffa0d4d46e>] ll_file_write+0xce/0x1e0 [lustre]
          [414679.615561]  [<ffffffff811dec1d>] vfs_write+0xbd/0x1e0
          [414679.622248]  [<ffffffff811df6bf>] SyS_write+0x7f/0xe0
          [414679.628843]  [<ffffffff81646889>] system_call_fastpath+0x16/0x1b
          [414679.636498] 
          [414679.639249] Kernel panic - not syncing: LBUG
          [414679.644949] CPU: 15 PID: 137570 Comm: mdtest Tainted: G          IOE  ------------   3.10.0-327.22.2.el7.x86_64 #1
          [414679.657442] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS SE5C610.86B.01.01.0008.021120151325 02/11/2015
          [414679.670035]  ffffffffa073adef 00000000a1cd2686 ffff88103ee6f6f8 ffffffff816360fc
          [414679.679336]  ffff88103ee6f778 ffffffff8162f977 ffffffff00000008 ffff88103ee6f788
          [414679.688629]  ffff88103ee6f728 00000000a1cd2686 ffffffffa0e373b0 0000000000000000
          [414679.697931] Call Trace:
          [414679.701624]  [<ffffffff816360fc>] dump_stack+0x19/0x1b
          [414679.708306]  [<ffffffff8162f977>] panic+0xd8/0x1e7
          [414679.714580]  [<ffffffffa071dddb>] lbug_with_loc+0xab/0xc0 [libcfs]
          [414679.722387]  [<ffffffffa0e31677>] osc_extent_find+0x2f6/0x1aa2 [osc]
          [414679.730387]  [<ffffffffa0e3335e>] ? osc_enter_cache+0x53b/0xe7a [osc]
          [414679.738450]  [<ffffffff81173317>] ? __alloc_pages_nodemask+0x197/0xba0
          [414679.746598]  [<ffffffffa0e1780b>] ? osc_lru_alloc+0x3b/0x390 [osc]
          [414679.754346]  [<ffffffff81285e38>] ? security_capable+0x18/0x20
          [414679.761704]  [<ffffffffa0e29705>] osc_queue_async_io+0x6a5/0x1870 [osc]
          [414679.769940]  [<ffffffffa0cf9c8c>] ? lov_merge_lvb_kms+0x12c/0x450 [lov]
          [414679.778188]  [<ffffffffa0e17b60>] ? osc_lru_alloc+0x390/0x390 [osc]
          [414679.786036]  [<ffffffffa0d9ba70>] ? mkwrite_commit_callback+0x20/0x20 [lustre]
          [414679.794960]  [<ffffffffa0e165e3>] osc_page_cache_add+0x43/0x140 [osc]
          [414679.803203]  [<ffffffffa0e1e903>] osc_io_commit_async+0x173/0x440 [osc]
          [414679.811621]  [<ffffffffa0d9ba70>] ? mkwrite_commit_callback+0x20/0x20 [lustre]
          [414679.820732]  [<ffffffffa0a65557>] cl_io_commit_async+0x77/0x140 [obdclass]
          [414679.829440]  [<ffffffffa0cf754c>] lov_io_commit_async+0x2dc/0x4a0 [lov]
          [414679.837867]  [<ffffffffa0d9ba70>] ? mkwrite_commit_callback+0x20/0x20 [lustre]
          [414679.846976]  [<ffffffffa0d9ba70>] ? mkwrite_commit_callback+0x20/0x20 [lustre]
          [414679.856076]  [<ffffffffa0a65557>] cl_io_commit_async+0x77/0x140 [obdclass]
          [414679.864791]  [<ffffffffa0d9e97a>] vvp_io_write_commit+0x17a/0x8d0 [lustre]
          [414679.873498]  [<ffffffffa0d9f39f>] vvp_io_write_start+0x2cf/0x650 [lustre]
          [414679.882106]  [<ffffffffa0a64b65>] cl_io_start+0x65/0x130 [obdclass]
          [414679.890136]  [<ffffffffa0a66f05>] cl_io_loop+0xa5/0x190 [obdclass]
          [414679.898055]  [<ffffffffa0d4cbef>] ll_file_io_generic+0x67f/0xaa0 [lustre]
          [414679.906650]  [<ffffffffa0d4d2dd>] ll_file_aio_write+0x12d/0x1f0 [lustre]
          [414679.915158]  [<ffffffffa0d4d46e>] ll_file_write+0xce/0x1e0 [lustre]
          [414679.923158]  [<ffffffff811dec1d>] vfs_write+0xbd/0x1e0
          [414679.929946]  [<ffffffff811df6bf>] SyS_write+0x7f/0xe0
          [414679.936584]  [<ffffffff81646889>] system_call_fastpath+0x16/0x1b
          
          /scratch/dumps/wolf-6.wolf.hpdd.intel.com/10.8.1.6-2016-08-08-17:12:27/
          
          jsalians_intel John Salinas (Inactive) added a comment - - edited With the patch I see: kernel:LustreError: 137570:0:(osc_cache.c:661:osc_extent_find()) ASSERTION( (max_pages & ~chunk_mask) == 0 ) failed: max_pages: 0x100 chunk_bits: 24 chunk_mask: 0xfffffffffffff000 kernel:LustreError: 137570:0:(osc_cache.c:661:osc_extent_find()) LBUG This is with default settings (no 16MB RPCs set): client [root@wolf-6 ~]# cat /proc/fs/lustre/osc/lsdraid-OST0000-osc-ffff88084eccc800/max_pages_per_rpc 256 OSS: [root@wolf-3 kernel]# lctl get_param obdfilter.lsdraid-OST0000.brw_size obdfilter.lsdraid-OST0000.brw_size=1 [root@wolf-4 ~]# lctl get_param obdfilter.lsdraid-OST0000.brw_size obdfilter.lsdraid-OST0000.brw_size=1 [413569.867436] Lustre: Mounted lsdraid-client [414679.351734] LustreError: 137570:0:(osc_cache.c:661:osc_extent_find()) ASSERTION( (max_pages & ~chunk_mask) == 0 ) failed: max_pages: 0x100 chunk_bits: 24 chunk_mask: 0xfffffffffffff000 [414679.373266] LustreError: 137570:0:(osc_cache.c:661:osc_extent_find()) LBUG [414679.382571] Pid: 137570, comm: mdtest [414679.388269] Call Trace: [414679.395587] [<ffffffffa071d7d3>] libcfs_debug_dumpstack+0x53/0x80 [libcfs] [414679.404872] [<ffffffffa071dd75>] lbug_with_loc+0x45/0xc0 [libcfs] [414679.413187] [<ffffffffa0e31677>] osc_extent_find+0x2f6/0x1aa2 [osc] [414679.421671] [<ffffffffa0e3335e>] ? osc_enter_cache+0x53b/0xe7a [osc] [414679.430159] [<ffffffff81173317>] ? __alloc_pages_nodemask+0x197/0xba0 [414679.438756] [<ffffffffa0e1780b>] ? osc_lru_alloc+0x3b/0x390 [osc] [414679.446884] [<ffffffff81285e38>] ? security_capable+0x18/0x20 [414679.454653] [<ffffffffa0e29705>] osc_queue_async_io+0x6a5/0x1870 [osc] [414679.463245] [<ffffffffa0cf9c8c>] ? lov_merge_lvb_kms+0x12c/0x450 [lov] [414679.471818] [<ffffffffa0e17b60>] ? osc_page_init+0x0/0x210 [osc] [414679.479775] [<ffffffffa0d9ba70>] ? write_commit_callback+0x0/0x50 [lustre] [414679.488686] [<ffffffffa0e165e3>] osc_page_cache_add+0x43/0x140 [osc] [414679.496969] [<ffffffffa0e1e903>] osc_io_commit_async+0x173/0x440 [osc] [414679.505470] [<ffffffffa0d9ba70>] ? write_commit_callback+0x0/0x50 [lustre] [414679.514355] [<ffffffffa0a65557>] cl_io_commit_async+0x77/0x140 [obdclass] [414679.523097] [<ffffffffa0cf754c>] lov_io_commit_async+0x2dc/0x4a0 [lov] [414679.531506] [<ffffffffa0d9ba70>] ? write_commit_callback+0x0/0x50 [lustre] [414679.540302] [<ffffffffa0d9ba70>] ? write_commit_callback+0x0/0x50 [lustre] [414679.549061] [<ffffffffa0a65557>] cl_io_commit_async+0x77/0x140 [obdclass] [414679.557714] [<ffffffffa0d9e97a>] vvp_io_write_commit+0x17a/0x8d0 [lustre] [414679.566324] [<ffffffffa0d9f39f>] vvp_io_write_start+0x2cf/0x650 [lustre] [414679.574846] [<ffffffffa0a64b65>] cl_io_start+0x65/0x130 [obdclass] [414679.582796] [<ffffffffa0a66f05>] cl_io_loop+0xa5/0x190 [obdclass] [414679.590648] [<ffffffffa0d4cbef>] ll_file_io_generic+0x67f/0xaa0 [lustre] [414679.599179] [<ffffffffa0d4d2dd>] ll_file_aio_write+0x12d/0x1f0 [lustre] [414679.607620] [<ffffffffa0d4d46e>] ll_file_write+0xce/0x1e0 [lustre] [414679.615561] [<ffffffff811dec1d>] vfs_write+0xbd/0x1e0 [414679.622248] [<ffffffff811df6bf>] SyS_write+0x7f/0xe0 [414679.628843] [<ffffffff81646889>] system_call_fastpath+0x16/0x1b [414679.636498] [414679.639249] Kernel panic - not syncing: LBUG [414679.644949] CPU: 15 PID: 137570 Comm: mdtest Tainted: G IOE ------------ 3.10.0-327.22.2.el7.x86_64 #1 [414679.657442] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS SE5C610.86B.01.01.0008.021120151325 02/11/2015 [414679.670035] ffffffffa073adef 00000000a1cd2686 ffff88103ee6f6f8 ffffffff816360fc [414679.679336] ffff88103ee6f778 ffffffff8162f977 ffffffff00000008 ffff88103ee6f788 [414679.688629] ffff88103ee6f728 00000000a1cd2686 ffffffffa0e373b0 0000000000000000 [414679.697931] Call Trace: [414679.701624] [<ffffffff816360fc>] dump_stack+0x19/0x1b [414679.708306] [<ffffffff8162f977>] panic+0xd8/0x1e7 [414679.714580] [<ffffffffa071dddb>] lbug_with_loc+0xab/0xc0 [libcfs] [414679.722387] [<ffffffffa0e31677>] osc_extent_find+0x2f6/0x1aa2 [osc] [414679.730387] [<ffffffffa0e3335e>] ? osc_enter_cache+0x53b/0xe7a [osc] [414679.738450] [<ffffffff81173317>] ? __alloc_pages_nodemask+0x197/0xba0 [414679.746598] [<ffffffffa0e1780b>] ? osc_lru_alloc+0x3b/0x390 [osc] [414679.754346] [<ffffffff81285e38>] ? security_capable+0x18/0x20 [414679.761704] [<ffffffffa0e29705>] osc_queue_async_io+0x6a5/0x1870 [osc] [414679.769940] [<ffffffffa0cf9c8c>] ? lov_merge_lvb_kms+0x12c/0x450 [lov] [414679.778188] [<ffffffffa0e17b60>] ? osc_lru_alloc+0x390/0x390 [osc] [414679.786036] [<ffffffffa0d9ba70>] ? mkwrite_commit_callback+0x20/0x20 [lustre] [414679.794960] [<ffffffffa0e165e3>] osc_page_cache_add+0x43/0x140 [osc] [414679.803203] [<ffffffffa0e1e903>] osc_io_commit_async+0x173/0x440 [osc] [414679.811621] [<ffffffffa0d9ba70>] ? mkwrite_commit_callback+0x20/0x20 [lustre] [414679.820732] [<ffffffffa0a65557>] cl_io_commit_async+0x77/0x140 [obdclass] [414679.829440] [<ffffffffa0cf754c>] lov_io_commit_async+0x2dc/0x4a0 [lov] [414679.837867] [<ffffffffa0d9ba70>] ? mkwrite_commit_callback+0x20/0x20 [lustre] [414679.846976] [<ffffffffa0d9ba70>] ? mkwrite_commit_callback+0x20/0x20 [lustre] [414679.856076] [<ffffffffa0a65557>] cl_io_commit_async+0x77/0x140 [obdclass] [414679.864791] [<ffffffffa0d9e97a>] vvp_io_write_commit+0x17a/0x8d0 [lustre] [414679.873498] [<ffffffffa0d9f39f>] vvp_io_write_start+0x2cf/0x650 [lustre] [414679.882106] [<ffffffffa0a64b65>] cl_io_start+0x65/0x130 [obdclass] [414679.890136] [<ffffffffa0a66f05>] cl_io_loop+0xa5/0x190 [obdclass] [414679.898055] [<ffffffffa0d4cbef>] ll_file_io_generic+0x67f/0xaa0 [lustre] [414679.906650] [<ffffffffa0d4d2dd>] ll_file_aio_write+0x12d/0x1f0 [lustre] [414679.915158] [<ffffffffa0d4d46e>] ll_file_write+0xce/0x1e0 [lustre] [414679.923158] [<ffffffff811dec1d>] vfs_write+0xbd/0x1e0 [414679.929946] [<ffffffff811df6bf>] SyS_write+0x7f/0xe0 [414679.936584] [<ffffffff81646889>] system_call_fastpath+0x16/0x1b /scratch/dumps/wolf-6.wolf.hpdd.intel.com/10.8.1.6-2016-08-08-17:12:27/

          the issue is that the brw_size is 16M on the server side but the max_pages_per_rpc is set to 1024, which means 4M, on the client side

          there is a defect to handle chunk size on the client side, which should be set to the maximum of “size reported from OST” and “max_pages_per_rpc”

          jay Jinshan Xiong (Inactive) added a comment - the issue is that the brw_size is 16M on the server side but the max_pages_per_rpc is set to 1024, which means 4M, on the client side there is a defect to handle chunk size on the client side, which should be set to the maximum of “size reported from OST” and “max_pages_per_rpc”
          pjones Peter Jones added a comment -

          Bobijam

          Could you please assist with this issue?

          John

          Were you using Andreas's diagnostic patch when you were running?

          Peter

          pjones Peter Jones added a comment - Bobijam Could you please assist with this issue? John Were you using Andreas's diagnostic patch when you were running? Peter

          John, could you please reproduce the problem with the above patch applied, so that we can see why the assertion is failing.

          adilger Andreas Dilger added a comment - John, could you please reproduce the problem with the above patch applied, so that we can see why the assertion is failing.

          Today I have hit this several times trying to reproduce a different hang. In these case I do NOT have 16MB blocks set. It is a completely vanilla run and as soon as mdtest starts I hit this:

          ]# mpirun -np 4 -machinefile hosts /home/johnsali/wolf-3/mdtest-1.8.4/mdtest -i 2 -I 2 -z 2 -u -w 1 -d /mnt/lustre/
          – started at 08/03/2016 21:55:07 –

          mdtest-1.8.3 was launched with 4 total task(s) on 4 nodes
          Command line used: /home/johnsali/wolf-3/mdtest-1.8.4/mdtest -i 2 -I 2 -z 2 -u -w 1 -d /mnt/lustre/
          Path: /mnt/lustre
          FS: 13.9 TiB Used FS: 0.0% Inodes: 55.8 Mi Used Inodes: 89.5%
          4 tasks, 24 files/directories

          Message from syslogd@wolf-6 at Aug 3 21:55:08 ...
          kernel:LustreError: 6182:0:(osc_cache.c:658:osc_extent_find()) ASSERTION( (max_pages & ~chunk_mask) == 0 ) failed:
          Message from syslogd@wolf-6 at Aug 3 21:55:08 ...
          kernel:LustreError: 6182:0:(osc_cache.c:658:osc_extent_find()) LBUG

          jsalians_intel John Salinas (Inactive) added a comment - Today I have hit this several times trying to reproduce a different hang. In these case I do NOT have 16MB blocks set. It is a completely vanilla run and as soon as mdtest starts I hit this: ]# mpirun -np 4 -machinefile hosts /home/johnsali/wolf-3/mdtest-1.8.4/mdtest -i 2 -I 2 -z 2 -u -w 1 -d /mnt/lustre/ – started at 08/03/2016 21:55:07 – mdtest-1.8.3 was launched with 4 total task(s) on 4 nodes Command line used: /home/johnsali/wolf-3/mdtest-1.8.4/mdtest -i 2 -I 2 -z 2 -u -w 1 -d /mnt/lustre/ Path: /mnt/lustre FS: 13.9 TiB Used FS: 0.0% Inodes: 55.8 Mi Used Inodes: 89.5% 4 tasks, 24 files/directories Message from syslogd@wolf-6 at Aug 3 21:55:08 ... kernel:LustreError: 6182:0:(osc_cache.c:658:osc_extent_find()) ASSERTION( (max_pages & ~chunk_mask) == 0 ) failed: Message from syslogd@wolf-6 at Aug 3 21:55:08 ... kernel:LustreError: 6182:0:(osc_cache.c:658:osc_extent_find()) LBUG

          People

            bobijam Zhenyu Xu
            jsalians_intel John Salinas (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: