Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8460

(osc_cache.c:658:osc_extent_find()) ASSERTION( (max_pages & ~chunk_mask) == 0 ) failed: LBUG

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.9.0
    • None
    • 3
    • 9223372036854775807

    Description

      As part of CORAL testing I was trying to use ior to to write/read some 16MB RPCs. As soon as the test started it appeared all the Lustre clients took an LBUG, crashed, and rebooted.

      [263032.826900] LustreError: 54616:0:(mgc_request.c:1536:mgc_apply_recover_logs()) mgc: cannot find uuid by nid 192.168.1.4@o2ib
      [263032.839625] Lustre: 54616:0:(mgc_request.c:1756:mgc_process_recover_nodemap_log()) MGC192.168.1.5@o2ib: error processing recovery log lsdraid-cliir: rc = -2
      [263032.940768] Lustre: Mounted lsdraid-client
      [270650.452658] LustreError: 64198:0:(osc_cache.c:658:osc_extent_find()) ASSERTION( (max_pages & ~chunk_mask) == 0 ) failed: 
      [270650.465098] LustreError: 64198:0:(osc_cache.c:658:osc_extent_find()) LBUG
      [270650.472830] Pid: 64198, comm: ior
      [270650.476666] 
      Call Trace:
      [270650.482816]  [<ffffffffa08267d3>] libcfs_debug_dumpstack+0x53/0x80 [libcfs]
      [270650.492278]  [<ffffffffa0826d75>] lbug_with_loc+0x45/0xc0 [libcfs]
      [270650.502128]  [<ffffffffa0dc066d>] osc_extent_find+0x2ec/0x1a98 [osc]
      [270650.510720]  [<ffffffff81189b79>] ? zone_statistics+0x89/0xa0
      [270650.518827]  [<ffffffffa0dc2354>] ? osc_enter_cache+0x53b/0xe7a [osc]
      [270650.527638]  [<ffffffff810b5f8d>] ? ttwu_do_activate.constprop.84+0x5d/0x70
      [270650.537084]  [<ffffffff81173317>] ? __alloc_pages_nodemask+0x197/0xba0
      [270650.545977]  [<ffffffff81285e38>] ? security_capable+0x18/0x20
      [270650.554180]  [<ffffffffa0db8705>] osc_queue_async_io+0x6a5/0x1870 [osc]
      [270650.563226]  [<ffffffffa0d2aa70>] ? write_commit_callback+0x0/0x50 [lustre]
      [270650.572635]  [<ffffffffa0da55e3>] osc_page_cache_add+0x43/0x140 [osc]
      [270650.581481]  [<ffffffffa0dad903>] osc_io_commit_async+0x173/0x440 [osc]
      [270650.590561]  [<ffffffffa0d2aa70>] ? write_commit_callback+0x0/0x50 [lustre]
      [270650.600035]  [<ffffffffa0959317>] cl_io_commit_async+0x77/0x140 [obdclass]
      [270650.609317]  [<ffffffffa0c8654c>] lov_io_commit_async+0x2dc/0x4a0 [lov]
      [270650.618397]  [<ffffffffa0d2aa70>] ? write_commit_callback+0x0/0x50 [lustre]
      [270650.627864]  [<ffffffffa0d2aa70>] ? write_commit_callback+0x0/0x50 [lustre]
      [270650.637324]  [<ffffffffa0959317>] cl_io_commit_async+0x77/0x140 [obdclass]
      [270650.646690]  [<ffffffffa0d2d97a>] vvp_io_write_commit+0x17a/0x8d0 [lustre]
      [270650.655374]  [<ffffffffa0d1b807>] ll_write_end+0xc7/0x400 [lustre]
      [270650.663281]  [<ffffffff8116a214>] generic_file_buffered_write+0x184/0x290
      [270650.671879]  [<ffffffff8116b6d5>] __generic_file_aio_write+0x1d5/0x3e0
      [270650.680178]  [<ffffffffa0d2e355>] vvp_io_write_start+0x285/0x650 [lustre]
      [270650.688744]  [<ffffffffa0958925>] cl_io_start+0x65/0x130 [obdclass]
      [270650.696904]  [<ffffffffa095acc5>] cl_io_loop+0xa5/0x190 [obdclass]
      [270650.704945]  [<ffffffffa0cdbbef>] ll_file_io_generic+0x67f/0xaa0 [lustre]
      [270650.713656]  [<ffffffff811d4256>] ? mem_cgroup_update_page_stat+0x16/0x50
      [270650.722373]  [<ffffffffa0cdc2dd>] ll_file_aio_write+0x12d/0x1f0 [lustre]
      [270650.731000]  [<ffffffffa0cdc46e>] ll_file_write+0xce/0x1e0 [lustre]
      [270650.739126]  [<ffffffff811dec1d>] vfs_write+0xbd/0x1e0
      [270650.745994]  [<ffffffff811df6bf>] SyS_write+0x7f/0xe0
      [270650.752766]  [<ffffffff81646889>] system_call_fastpath+0x16/0x1b
      [270650.760571] 
      [270650.763508] Kernel panic - not syncing: LBUG
      [270650.769875] CPU: 17 PID: 64198 Comm: ior Tainted: G          IOE  ------------   3.10.0-327.22.2.el7.x86_64 #1
      [270650.782689] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS SE5C610.86B.01.01.0008.021120151325 02/11/2015
      [270650.796038]  ffffffffa0843def 0000000067048a74 ffff88103fd9f560 ffffffff816360fc
      [270650.806102]  ffff88103fd9f5e0 ffffffff8162f977 ffffffff00000008 ffff88103fd9f5f0
      [270650.816170]  ffff88103fd9f590 0000000067048a74 ffffffffa0dc63b0 0000000000000000
      [270650.826189] Call Trace:
      [270650.830566]  [<ffffffff816360fc>] dump_stack+0x19/0x1b
      [270650.837897]  [<ffffffff8162f977>] panic+0xd8/0x1e7
      [270650.844870]  [<ffffffffa0826ddb>] lbug_with_loc+0xab/0xc0 [libcfs]
      [270650.853345]  [<ffffffffa0dc066d>] osc_extent_find+0x2ec/0x1a98 [osc]
      [270650.862000]  [<ffffffff81189b79>] ? zone_statistics+0x89/0xa0
      [270650.869976]  [<ffffffffa0dc2354>] ? osc_enter_cache+0x53b/0xe7a [osc]
      [270650.878736]  [<ffffffff810b5f8d>] ? ttwu_do_activate.constprop.84+0x5d/0x70
      [270650.888101]  [<ffffffff81173317>] ? __alloc_pages_nodemask+0x197/0xba0
      [270650.896965]  [<ffffffff81285e38>] ? security_capable+0x18/0x20
      [270650.905069]  [<ffffffffa0db8705>] osc_queue_async_io+0x6a5/0x1870 [osc]
      [270650.914081]  [<ffffffffa0d2aa70>] ? mkwrite_commit_callback+0x20/0x20 [lustre]
      [270650.923740]  [<ffffffffa0da55e3>] osc_page_cache_add+0x43/0x140 [osc]
      [270650.932528]  [<ffffffffa0dad903>] osc_io_commit_async+0x173/0x440 [osc]
      [270650.941515]  [<ffffffffa0d2aa70>] ? mkwrite_commit_callback+0x20/0x20 [lustre]
      [270650.951198]  [<ffffffffa0959317>] cl_io_commit_async+0x77/0x140 [obdclass]
      [270650.960478]  [<ffffffffa0c8654c>] lov_io_commit_async+0x2dc/0x4a0 [lov]
      [270650.969481]  [<ffffffffa0d2aa70>] ? mkwrite_commit_callback+0x20/0x20 [lustre]
      [270650.979160]  [<ffffffffa0d2aa70>] ? mkwrite_commit_callback+0x20/0x20 [lustre]
      [270650.989192]  [<ffffffffa0959317>] cl_io_commit_async+0x77/0x140 [obdclass]
      [270650.998797]  [<ffffffffa0d2d97a>] vvp_io_write_commit+0x17a/0x8d0 [lustre]
      [270651.008434]  [<ffffffffa0d1b807>] ll_write_end+0xc7/0x400 [lustre]
      [270651.017239]  [<ffffffff8116a214>] generic_file_buffered_write+0x184/0x290
      [270651.026725]  [<ffffffff8116b6d5>] __generic_file_aio_write+0x1d5/0x3e0
      [270651.035938]  [<ffffffffa0d2e355>] vvp_io_write_start+0x285/0x650 [lustre]
      [270651.045484]  [<ffffffffa0958925>] cl_io_start+0x65/0x130 [obdclass]
      [270651.054443]  [<ffffffffa095acc5>] cl_io_loop+0xa5/0x190 [obdclass]
      [270651.063300]  [<ffffffffa0cdbbef>] ll_file_io_generic+0x67f/0xaa0 [lustre]
      [270651.072803]  [<ffffffff811d4256>] ? mem_cgroup_update_page_stat+0x16/0x50
      [270651.082339]  [<ffffffffa0cdc2dd>] ll_file_aio_write+0x12d/0x1f0 [lustre]
      [270651.091766]  [<ffffffffa0cdc46e>] ll_file_write+0xce/0x1e0 [lustre]
      [270651.100727]  [<ffffffff811dec1d>] vfs_write+0xbd/0x1e0
      [270651.108378]  [<ffffffff811df6bf>] SyS_write+0x7f/0xe0
      [270651.115918]  [<ffffffff81646889>] system_call_fastpath+0x16/0x1b 
      

      Here is what IOR saw:

      IOR-3.0.1: MPI Coordinated Test of Parallel I/O
      
      Began: Mon Aug  1 18:28:46 2016
      Command line used: /home/johnsali/wolf-3/ior/src/ior -i 5 -v -b 4096M -t 16m
      Machine: Linux wolf-6.wolf.hpdd.intel.com
      Start time skew across all tasks: 0.00 sec
      
      Test 0 started: Mon Aug  1 18:28:46 2016
      Path: /mnt/lustre
      FS: 13.9 TiB   Used FS: 0.0%   Inodes: 4.2 Mi   Used Inodes: 0.0%
      Participating tasks: 3
      Summary:
      	api                = POSIX
      	test filename      = testFile
      	access             = single-shared-file
      	pattern            = segmented (1 segment)
      	ordering in a file = sequential offsets
      	ordering inter file= no tasks offsets
      	clients            = 3 (1 per node)
      	repetitions        = 5
      	xfersize           = 16 MiB
      	blocksize          = 4 GiB
      	aggregate filesize = 12 GiB
      
      access    bw(MiB/s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   total(s)   iter
      ------    ---------  ---------- ---------  --------   --------   --------   --------   ----
      Commencing write performance test: Mon Aug  1 18:28:46 2016
      
      Message from syslogd@wolf-6 at Aug  1 18:28:48 ...
       kernel:LustreError: 64198:0:(osc_cache.c:658:osc_extent_find()) ASSERTION( (max_pages & ~chunk_mask) == 0 ) failed: 
      
      Message from syslogd@wolf-6 at Aug  1 18:28:48 ...
       kernel:LustreError: 64198:0:(osc_cache.c:658:osc_extent_find()) LBUG
      

      Attempts to change the max_pages_per_rpc have not been successful:

      [root@wolf-8 ~]#  lctl set_param osc.*.max_pages_per_rpc=1024
      error: set_param: setting /proc/fs/lustre/osc/lsdraid-OST0000-osc-ffff88104f0c1000/max_pages_per_rpc=1024: Numerical result out of range 
      

      System crash files can be found at (all clients hit the LBUG/crash/reboot at the same time):
      /scratch/dumps/wolf-6.wolf.hpdd.intel.com/10.8.1.6-2016-08-01-18:29:12
      /scratch/dumps/wolf-7.wolf.hpdd.intel.com/10.8.1.7-2016-08-01-18:29:10
      /scratch/dumps/wolf-8.wolf.hpdd.intel.com/10.8.1.8-2016-08-01-18:29:11

      Attachments

        Activity

          People

            bobijam Zhenyu Xu
            jsalians_intel John Salinas (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: