Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6271

(osc_cache.c:3150:discard_cb()) ASSERTION( (!(page->cp_type == CPT_CACHEABLE) || (!PageDirty(cl_page_vmpage(page)))) ) failed:

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.8.0
    • Lustre 2.7.0
    • Hyperon
    • 3
    • 17581

    Description

      While performing failover testing:

      LustreError: 167-0: lustre-OST0000-osc-ffff88083b339000: This client was evicted by lustre-OST0000; in progress operations using this service will fail.
      Feb 23 09:22:48 iwc260 kernel: LustreError: 51585:0:(ldlm_resource.c:777:ldlm_resource_complain()) lustre-OST0000-osc-ffff88083b339000: namespace resource [0x2d097bb:0x0:0x0].0 (ffff88106a992e40) refcount nonzero (1) after lock cleanup; forcing cleanup.
      Feb 23 09:22:48 iwc260 kernel: LustreError: 51585:0:(ldlm_resource.c:1374:ldlm_resource_dump()) --- Resource: [0x2d097bb:0x0:0x0].0 (ffff88106a992e40) refcount = 3
      Feb 23 09:22:48 iwc260 kernel: LustreError: 51585:0:(ldlm_resource.c:1377:ldlm_resource_dump()) Granted locks (in reverse order):
      Feb 23 09:22:48 iwc260 kernel: LustreError: 51585:0:(ldlm_resource.c:1380:ldlm_resource_dump()) ### ### ns: lustre-OST0000-osc-ffff88083b339000 lock: ffff88106a60d540/0xa1f402a8812988f4 lrc: 3/0,1 mode: PW/PW res: [0x2d097bb:0x0:0x0].0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->8191) flags: 0x126400020000 nid: local remote: 0x72a9f4c80e66e07b expref: -99 pid: 51570 timeout: 0 lvb_type: 1
      Feb 23 09:22:48 iwc260 kernel: Lustre: lustre-OST0000-osc-ffff88083b339000: Connection restored to lustre-OST0000 (at 192.168.120.13@o2ib)
      Feb 23 09:22:49 iwc260 kernel: LustreError: 90252:0:(osc_cache.c:3150:discard_cb()) ASSERTION( (!(page->cp_type == CPT_CACHEABLE) || (!PageDirty(cl_page_vmpage(page)))) ) failed:
      Feb 23 09:22:49 iwc260 kernel: LustreError: 90252:0:(osc_cache.c:3150:discard_cb()) ASSERTION( (!(page->cp_type == CPT_CACHEABLE) || (!PageDirty(cl_page_vmpage(page)))) ) failed:
      Feb 23 09:22:49 iwc260 kernel: LustreError: 90252:0:(osc_cache.c:3150:discard_cb()) LBUG
      Feb 23 09:22:49 iwc260 kernel: LustreError: 90252:0:(osc_cache.c:3150:discard_cb()) LBUG
      Feb 23 09:22:49 iwc260 kernel: Pid: 90252, comm: ldlm_bl_36
      Feb 23 09:22:49 iwc260 kernel:
      Feb 23 09:22:49 iwc260 kernel: Call Trace:
      Feb 23 09:22:49 iwc260 kernel: [<ffffffffa0435895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      Feb 23 09:22:49 iwc260 kernel: [<ffffffffa0435e97>] lbug_with_loc+0x47/0xb0 [libcfs]
      Feb 23 09:22:49 iwc260 kernel: [<ffffffffa0b2ba56>] discard_cb+0x156/0x190 [osc]
      Feb 23 09:22:49 iwc260 kernel: [<ffffffffa0b2bdcc>] osc_page_gang_lookup+0x1ac/0x330 [osc]
      Feb 23 09:22:49 iwc260 kernel: [<ffffffffa0b2b900>] ? discard_cb+0x0/0x190 [osc]
      Feb 23 09:22:49 iwc260 kernel: [<ffffffffa0b2c094>] osc_lock_discard_pages+0x144/0x240 [osc]
      Feb 23 09:22:49 iwc260 kernel: [<ffffffffa04461c1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
      Feb 23 09:22:49 iwc260 kernel: [<ffffffffa0b2b900>] ? discard_cb+0x0/0x190 [osc]
      Feb 23 09:22:49 iwc260 kernel: [<ffffffffa0b2298b>] osc_lock_flush+0x8b/0x260 [osc]
      Feb 23 09:22:49 iwc260 kernel: [<ffffffffa0b22e08>] osc_ldlm_blocking_ast+0x2a8/0x3c0 [osc]
      Feb 23 09:22:49 iwc260 kernel: [<ffffffffa0761a6c>] ldlm_cancel_callback+0x6c/0x170 [ptlrpc]
      Feb 23 09:22:49 iwc260 kernel: [<ffffffffa077450a>] ldlm_cli_cancel_local+0x8a/0x470 [ptlrpc]
      Feb 23 09:22:49 iwc260 kernel: [<ffffffffa0779120>] ldlm_cli_cancel+0x60/0x360 [ptlrpc]
      Feb 23 09:22:49 iwc260 kernel: [<ffffffffa0b22c3b>] osc_ldlm_blocking_ast+0xdb/0x3c0 [osc]
      Feb 23 09:22:49 iwc260 kernel: [<ffffffffa04461c1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
      Feb 23 09:22:49 iwc260 kernel: [<ffffffffa077cb60>] ldlm_handle_bl_callback+0x130/0x400 [ptlrpc]
      Feb 23 09:22:49 iwc260 kernel: [<ffffffffa077d0c1>] ldlm_bl_thread_main+0x291/0x3f0 [ptlrpc]
      Feb 23 09:22:49 iwc260 kernel: [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
      Feb 23 09:22:49 iwc260 kernel: [<ffffffffa077ce30>] ? ldlm_bl_thread_main+0x0/0x3f0 [ptlrpc]
      Feb 23 09:22:49 iwc260 kernel: [<ffffffff8109abf6>] kthread+0x96/0xa0
      Feb 23 09:22:49 iwc260 kernel: [<ffffffff8100c20a>] child_rip+0xa/0x20
      Feb 23 09:22:49 iwc260 kernel: [<ffffffff8109ab60>] ? kthread+0x0/0xa0
      Feb 23 09:22:49 iwc260 kernel: [<ffffffff8100c200>] ? child_rip+0x0/0x20
      

      Attachments

        1. iwc189.lbug.log.txt.gz
          0.2 kB
        2. iwc260.log.gz
          4.81 MB
        3. l-23.LU-6271.txt.gz
          0.2 kB
        4. lustre-log.iwc37.txt
          329 kB
        5. r2i1n4.messages.gz
          96 kB
        6. server_messages.tar.gz
          924 kB

        Issue Links

          Activity

            [LU-6271] (osc_cache.c:3150:discard_cb()) ASSERTION( (!(page->cp_type == CPT_CACHEABLE) || (!PageDirty(cl_page_vmpage(page)))) ) failed:
            jay Jinshan Xiong (Inactive) added a comment - - edited

            Hi Dongyang, the problem may be fixed in patch set 7 but feel free to use patch 14915 on your own site.

            jay Jinshan Xiong (Inactive) added a comment - - edited Hi Dongyang, the problem may be fixed in patch set 7 but feel free to use patch 14915 on your own site.

            With patch set 6 pf 16456, the client won't crash with the gplck reproducer and single eviction.
            However, the client crashes if:
            1. run the gplck reproducer, evict the client once.
            2. after client reconnects, don't kill the gplck, run another instance of gplck with a different gid.
            3. evict the client again, the client crashes with the same stack trace ASSERTION( !ext->oe_hp ) failed

            This makes me think that the problem should be fixed in the ldlm layer like http://review.whamcloud.com/#/c/14915/
            not in the osc layer.
            Thanks
            Dongyang

            lidongyang Li Dongyang (Inactive) added a comment - With patch set 6 pf 16456, the client won't crash with the gplck reproducer and single eviction. However, the client crashes if: 1. run the gplck reproducer, evict the client once. 2. after client reconnects, don't kill the gplck, run another instance of gplck with a different gid. 3. evict the client again, the client crashes with the same stack trace ASSERTION( !ext->oe_hp ) failed This makes me think that the problem should be fixed in the ldlm layer like http://review.whamcloud.com/#/c/14915/ not in the osc layer. Thanks Dongyang

            Thanks. Please check if the latest patch can fix the problem

            jay Jinshan Xiong (Inactive) added a comment - Thanks. Please check if the latest patch can fix the problem

            Hi Jinshan,
            See my comment earlier on 28/Jul/15 8:39 PM.

            lidongyang Li Dongyang (Inactive) added a comment - Hi Jinshan, See my comment earlier on 28/Jul/15 8:39 PM.

            Hi Dongyang, please upload your test cases.

            jay Jinshan Xiong (Inactive) added a comment - Hi Dongyang, please upload your test cases.

            OK I've been running the scripts with patchset 5 of 16456 for some time, seems the general IO is fine.
            But when I tried the group lock, the client crashed immediately after eviction:

            <0>LustreError: 1697:0:(osc_cache.c:2907:osc_cache_writeback_range()) ASSERTION( !ext->oe_hp ) failed: 
            <0>LustreError: 1697:0:(osc_cache.c:2907:osc_cache_writeback_range()) LBUG
            <4>Pid: 1697, comm: ldlm_bl_11
            <4>
            <4>Call Trace:
            <4> [<ffffffffa021d875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
            <4> [<ffffffffa021de77>] lbug_with_loc+0x47/0xb0 [libcfs]
            <4> [<ffffffffa0922085>] osc_cache_writeback_range+0x1275/0x1280 [osc]
            <4> [<ffffffffa090b545>] osc_lock_flush+0x175/0x260 [osc]
            <4> [<ffffffffa090b8d8>] osc_ldlm_blocking_ast+0x2a8/0x3c0 [osc]
            <4> [<ffffffffa050ba57>] ldlm_cancel_callback+0x87/0x280 [ptlrpc]
            <4> [<ffffffff81060530>] ? __dequeue_entity+0x30/0x50
            <4> [<ffffffff8100969d>] ? __switch_to+0x7d/0x340
            <4> [<ffffffffa051e84a>] ldlm_cli_cancel_local+0x8a/0x470 [ptlrpc]
            <4> [<ffffffffa05234bc>] ldlm_cli_cancel+0x9c/0x3e0 [ptlrpc]
            <4> [<ffffffffa090b70b>] osc_ldlm_blocking_ast+0xdb/0x3c0 [osc]
            <4> [<ffffffff810672c2>] ? default_wake_function+0x12/0x20
            <4> [<ffffffffa0527400>] ldlm_handle_bl_callback+0x130/0x400 [ptlrpc]
            <4> [<ffffffffa0527ee4>] ldlm_bl_thread_main+0x484/0x700 [ptlrpc]
            <4> [<ffffffff810672b0>] ? default_wake_function+0x0/0x20
            <4> [<ffffffffa0527a60>] ? ldlm_bl_thread_main+0x0/0x700 [ptlrpc]
            <4> [<ffffffff810a101e>] kthread+0x9e/0xc0
            <4> [<ffffffff8100c28a>] child_rip+0xa/0x20
            <4> [<ffffffff810a0f80>] ? kthread+0x0/0xc0
            <4> [<ffffffff8100c280>] ? child_rip+0x0/0x20
            <4>
            <0>Kernel panic - not syncing: LBUG
            <4>Pid: 1697, comm: ldlm_bl_11 Not tainted 2.6.32-573.3.1.el6.x86_64 #1
            <4>Call Trace:
            <4> [<ffffffff81537c54>] ? panic+0xa7/0x16f
            <4> [<ffffffffa021decb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
            <4> [<ffffffffa0922085>] ? osc_cache_writeback_range+0x1275/0x1280 [osc]
            <4> [<ffffffffa090b545>] ? osc_lock_flush+0x175/0x260 [osc]
            <4> [<ffffffffa090b8d8>] ? osc_ldlm_blocking_ast+0x2a8/0x3c0 [osc]
            <4> [<ffffffffa050ba57>] ? ldlm_cancel_callback+0x87/0x280 [ptlrpc]
            <4> [<ffffffff81060530>] ? __dequeue_entity+0x30/0x50
            <4> [<ffffffff8100969d>] ? __switch_to+0x7d/0x340
            <4> [<ffffffffa051e84a>] ? ldlm_cli_cancel_local+0x8a/0x470 [ptlrpc]
            <4> [<ffffffffa05234bc>] ? ldlm_cli_cancel+0x9c/0x3e0 [ptlrpc]
            <4> [<ffffffffa090b70b>] ? osc_ldlm_blocking_ast+0xdb/0x3c0 [osc]
            <4> [<ffffffff810672c2>] ? default_wake_function+0x12/0x20
            <4> [<ffffffffa0527400>] ? ldlm_handle_bl_callback+0x130/0x400 [ptlrpc]
            <4> [<ffffffffa0527ee4>] ? ldlm_bl_thread_main+0x484/0x700 [ptlrpc]
            <4> [<ffffffff810672b0>] ? default_wake_function+0x0/0x20
            <4> [<ffffffffa0527a60>] ? ldlm_bl_thread_main+0x0/0x700 [ptlrpc]
            <4> [<ffffffff810a101e>] ? kthread+0x9e/0xc0
            <4> [<ffffffff8100c28a>] ? child_rip+0xa/0x20
            <4> [<ffffffff810a0f80>] ? kthread+0x0/0xc0
            <4> [<ffffffff8100c280>] ? child_rip+0x0/0x20
            
            lidongyang Li Dongyang (Inactive) added a comment - OK I've been running the scripts with patchset 5 of 16456 for some time, seems the general IO is fine. But when I tried the group lock, the client crashed immediately after eviction: <0>LustreError: 1697:0:(osc_cache.c:2907:osc_cache_writeback_range()) ASSERTION( !ext->oe_hp ) failed: <0>LustreError: 1697:0:(osc_cache.c:2907:osc_cache_writeback_range()) LBUG <4>Pid: 1697, comm: ldlm_bl_11 <4> <4>Call Trace: <4> [<ffffffffa021d875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] <4> [<ffffffffa021de77>] lbug_with_loc+0x47/0xb0 [libcfs] <4> [<ffffffffa0922085>] osc_cache_writeback_range+0x1275/0x1280 [osc] <4> [<ffffffffa090b545>] osc_lock_flush+0x175/0x260 [osc] <4> [<ffffffffa090b8d8>] osc_ldlm_blocking_ast+0x2a8/0x3c0 [osc] <4> [<ffffffffa050ba57>] ldlm_cancel_callback+0x87/0x280 [ptlrpc] <4> [<ffffffff81060530>] ? __dequeue_entity+0x30/0x50 <4> [<ffffffff8100969d>] ? __switch_to+0x7d/0x340 <4> [<ffffffffa051e84a>] ldlm_cli_cancel_local+0x8a/0x470 [ptlrpc] <4> [<ffffffffa05234bc>] ldlm_cli_cancel+0x9c/0x3e0 [ptlrpc] <4> [<ffffffffa090b70b>] osc_ldlm_blocking_ast+0xdb/0x3c0 [osc] <4> [<ffffffff810672c2>] ? default_wake_function+0x12/0x20 <4> [<ffffffffa0527400>] ldlm_handle_bl_callback+0x130/0x400 [ptlrpc] <4> [<ffffffffa0527ee4>] ldlm_bl_thread_main+0x484/0x700 [ptlrpc] <4> [<ffffffff810672b0>] ? default_wake_function+0x0/0x20 <4> [<ffffffffa0527a60>] ? ldlm_bl_thread_main+0x0/0x700 [ptlrpc] <4> [<ffffffff810a101e>] kthread+0x9e/0xc0 <4> [<ffffffff8100c28a>] child_rip+0xa/0x20 <4> [<ffffffff810a0f80>] ? kthread+0x0/0xc0 <4> [<ffffffff8100c280>] ? child_rip+0x0/0x20 <4> <0>Kernel panic - not syncing: LBUG <4>Pid: 1697, comm: ldlm_bl_11 Not tainted 2.6.32-573.3.1.el6.x86_64 #1 <4>Call Trace: <4> [<ffffffff81537c54>] ? panic+0xa7/0x16f <4> [<ffffffffa021decb>] ? lbug_with_loc+0x9b/0xb0 [libcfs] <4> [<ffffffffa0922085>] ? osc_cache_writeback_range+0x1275/0x1280 [osc] <4> [<ffffffffa090b545>] ? osc_lock_flush+0x175/0x260 [osc] <4> [<ffffffffa090b8d8>] ? osc_ldlm_blocking_ast+0x2a8/0x3c0 [osc] <4> [<ffffffffa050ba57>] ? ldlm_cancel_callback+0x87/0x280 [ptlrpc] <4> [<ffffffff81060530>] ? __dequeue_entity+0x30/0x50 <4> [<ffffffff8100969d>] ? __switch_to+0x7d/0x340 <4> [<ffffffffa051e84a>] ? ldlm_cli_cancel_local+0x8a/0x470 [ptlrpc] <4> [<ffffffffa05234bc>] ? ldlm_cli_cancel+0x9c/0x3e0 [ptlrpc] <4> [<ffffffffa090b70b>] ? osc_ldlm_blocking_ast+0xdb/0x3c0 [osc] <4> [<ffffffff810672c2>] ? default_wake_function+0x12/0x20 <4> [<ffffffffa0527400>] ? ldlm_handle_bl_callback+0x130/0x400 [ptlrpc] <4> [<ffffffffa0527ee4>] ? ldlm_bl_thread_main+0x484/0x700 [ptlrpc] <4> [<ffffffff810672b0>] ? default_wake_function+0x0/0x20 <4> [<ffffffffa0527a60>] ? ldlm_bl_thread_main+0x0/0x700 [ptlrpc] <4> [<ffffffff810a101e>] ? kthread+0x9e/0xc0 <4> [<ffffffff8100c28a>] ? child_rip+0xa/0x20 <4> [<ffffffff810a0f80>] ? kthread+0x0/0xc0 <4> [<ffffffff8100c280>] ? child_rip+0x0/0x20

            Please check the new patch and see if it can fix the problem.

            jay Jinshan Xiong (Inactive) added a comment - Please check the new patch and see if it can fix the problem.

            I did very similar thing with iozone but they wrote to different files, let me try your reproducer.

            jay Jinshan Xiong (Inactive) added a comment - I did very similar thing with iozone but they wrote to different files, let me try your reproducer.
            lidongyang Li Dongyang (Inactive) added a comment - - edited

            Hi Jinshan,
            Hardware shouldn't matter as I'm reproducing it on my VMs.
            Here are the scripts I used.
            On client, I'm using IOR and when IOR dies because of eviction, just bring up new ones:

            #!/bin/sh
            while true
            do
            	num=$(ps aux | grep IOR | wc -l)
            	while [ $num -lt 20 ]
            	do
            		/root/IOR/src/C/IOR -b 8g -w -e -E -t 1m -v -k -o /mnt/testfile &
            		num=$(ps aux | grep IOR | wc -l)
            	done
            	sleep 1
            done
            

            Here I got 20 of them running at the same time.

            On the OSS:

            #!/bin/sh
            while true
            do	
            	for ost in /proc/fs/lustre/obdfilter/*
            	do
            		if [ -n $(cat $ost/exports/<your client nid>/uuid) ]
            		then
            			echo $(cat $ost/exports/<your client nid>/uuid) > $ost/evict_client
            		fi
            	done
            	sleep 10
            done
            

            As you can see I'm evicting the client over and over, and I'm using 10secs as the interval. Again you can change this.

            Thanks
            Dongyang

            lidongyang Li Dongyang (Inactive) added a comment - - edited Hi Jinshan, Hardware shouldn't matter as I'm reproducing it on my VMs. Here are the scripts I used. On client, I'm using IOR and when IOR dies because of eviction, just bring up new ones: #!/bin/sh while true do num=$(ps aux | grep IOR | wc -l) while [ $num -lt 20 ] do /root/IOR/src/C/IOR -b 8g -w -e -E -t 1m -v -k -o /mnt/testfile & num=$(ps aux | grep IOR | wc -l) done sleep 1 done Here I got 20 of them running at the same time. On the OSS: #!/bin/sh while true do for ost in /proc/fs/lustre/obdfilter/* do if [ -n $(cat $ost/exports/<your client nid>/uuid) ] then echo $(cat $ost/exports/<your client nid>/uuid) > $ost/evict_client fi done sleep 10 done As you can see I'm evicting the client over and over, and I'm using 10secs as the interval. Again you can change this. Thanks Dongyang

            Hi Dongyang,

            it certainly lasted longer on my VM node

            Please upload your reproducer and I will restart working on this issue once I get access to a real hardware.

            Thanks,

            jay Jinshan Xiong (Inactive) added a comment - Hi Dongyang, it certainly lasted longer on my VM node Please upload your reproducer and I will restart working on this issue once I get access to a real hardware. Thanks,

            Hi Jinshan,
            I'm using master which has 14989 landed already, plus patch set 3 of 16456:
            The client crashed 5mins after I started the reproducer:
            Sep 18 13:29:00 client kernel: LustreError: 16824:0:(ldlm_resource.c:887:ldlm_resource_complain()) testfs-OST0000-osc-ffff8800375c4000: namespace resource [0x303:0x0:0x0].0 (ffff88003d2abcc0) refcount nonzero (1) after lock cleanup; forcing cleanup.
            Sep 18 13:29:00 client kernel: LustreError: 16824:0:(ldlm_resource.c:1502:ldlm_resource_dump()) — Resource: [0x303:0x0:0x0].0 (ffff88003d2abcc0) refcount = 3
            Sep 18 13:29:00 client kernel: LustreError: 16824:0:(ldlm_resource.c:1505:ldlm_resource_dump()) Granted locks (in reverse order):
            Sep 18 13:29:00 client kernel: LustreError: 16824:0:(ldlm_resource.c:1508:ldlm_resource_dump()) ### ### ns: testfs-OST0000-osc-ffff8800375c4000 lock: ffff88000b69d380/0x2104da78ff029aa9 lrc: 8/0,1 mode: PW/PW res: [0x303:0x0:0x0].0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->1048575) flags: 0x126400000000 nid: local remote: 0x59a587f3f1d1d697 expref: -99 pid: 16588 timeout: 0 lvb_type: 1
            Sep 18 13:29:00 client kernel: LustreError: 16824:0:(ldlm_resource.c:1502:ldlm_resource_dump()) — Resource: [0x303:0x0:0x0].0 (ffff88003d2abcc0) refcount = 2
            Sep 18 13:29:00 client kernel: LustreError: 16824:0:(ldlm_resource.c:1505:ldlm_resource_dump()) Granted locks (in reverse order):
            Sep 18 13:29:01 client kernel: LustreError: 13392:0:(osc_cache.c:3134:discard_cb()) page@ffff88003d5a7600[3 ffff88003a513b38 1 0 1 ffff88003a5d6b78]
            Sep 18 13:29:01 client kernel:
            Sep 18 13:29:01 client kernel: LustreError: 13392:0:(osc_cache.c:3134:discard_cb()) vvp-page@ffff88003d5a7650(0:0) vm@ffffea00004439e0 2000000000087d 3:0 ffff88003d5a7600 25088 lru
            Sep 18 13:29:01 client kernel:
            Sep 18 13:29:01 client kernel: LustreError: 13392:0:(osc_cache.c:3134:discard_cb()) lov-page@ffff88003d5a7690, raid0
            Sep 18 13:29:01 client kernel:
            Sep 18 13:29:01 client kernel: LustreError: 13392:0:(osc_cache.c:3134:discard_cb()) osc-page@ffff88003d5a76f8 25088: 1< 0x845fed 258 0 + - > 2< 102760448 0 4096 0x0 0x520 | (null) ffff88003c479500 ffff88003a517e20 > 3< 1 0 0 > 4< 0 9 8 18446744073642434560 + | + - + - > 5< + - + - | 0 - | 5635 - ->
            Sep 18 13:29:01 client kernel:
            Sep 18 13:29:01 client kernel: LustreError: 13392:0:(osc_cache.c:3134:discard_cb()) end page@ffff88003d5a7600
            Sep 18 13:29:01 client kernel:
            Sep 18 13:29:01 client kernel: LustreError: 13392:0:(osc_cache.c:3134:discard_cb()) discard dirty page?
            Sep 18 13:29:01 client kernel: LustreError: 13392:0:(osc_cache.c:2447:osc_teardown_async_page()) extent ffff880039a18720@

            {[25088 -> 25341/25343], [2|0|+|locking|wi|ffff88003a517e20], [1040384|254|+|-|ffff88000b69d580|256|ffff8800372b3520]}

            trunc at 25088.
            Sep 18 13:29:01 client kernel: LustreError: 13392:0:(osc_cache.c:2447:osc_teardown_async_page()) ### extent: ffff880039a18720
            Sep 18 13:29:01 client kernel: ns: testfs-OST0000-osc-ffff8800375c4000 lock: ffff88000b69d580/0x2104da78ff029c0e lrc: 41/0,3 mode: PW/PW res: [0x303:0x0:0x0].0 rrc: 2 type: EXT [0->18446744073709551615] (req 102760448->103809023) flags: 0x20000000000 nid: local remote: 0x59a587f3f1d1d6ac expref: -99 pid: 16683 timeout: 0 lvb_type: 1
            Sep 18 13:29:01 client kernel: LustreError: 13392:0:(osc_page.c:307:osc_page_delete()) page@ffff88003d5a7600[3 ffff88003a513b38 4 0 1 (null)]
            Sep 18 13:29:01 client kernel:
            Sep 18 13:29:01 client kernel: LustreError: 13392:0:(osc_page.c:307:osc_page_delete()) vvp-page@ffff88003d5a7650(0:0) vm@ffffea00004439e0 2000000000087d 3:0 ffff88003d5a7600 25088 lru
            Sep 18 13:29:01 client kernel:
            Sep 18 13:29:01 client kernel: LustreError: 13392:0:(osc_page.c:307:osc_page_delete()) lov-page@ffff88003d5a7690, raid0
            Sep 18 13:29:01 client kernel:
            Sep 18 13:29:01 client kernel: LustreError: 13392:0:(osc_page.c:307:osc_page_delete()) osc-page@ffff88003d5a76f8 25088: 1< 0x845fed 258 0 + - > 2< 102760448 0 4096 0x0 0x520 | (null) ffff88003c479500 ffff88003a517e20 > 3< 0 0 0 > 4< 0 9 8 18446744073642434560 + | + - + - > 5< + - + - | 0 - | 5635 - ->
            Sep 18 13:29:01 client kernel:
            Sep 18 13:29:01 client kernel: LustreError: 13392:0:(osc_page.c:307:osc_page_delete()) end page@ffff88003d5a7600
            Sep 18 13:29:01 client kernel:
            Sep 18 13:29:01 client kernel: LustreError: 13392:0:(osc_page.c:307:osc_page_delete()) Trying to teardown failed: -16
            Sep 18 13:29:01 client kernel: LustreError: 13392:0:(osc_page.c:308:osc_page_delete()) ASSERTION( 0 ) failed:

            Message from syslogd@client at Sep 18 13:29:01 ...
            kernel:LustreError: 13392:0:(osc_page.c:308:osc_page_delete()) ASSERTION( 0 ) failed:
            Sep 18 13:29:01 client kernel: LustreError: 13392:0:(osc_page.c:308:osc_page_delete()) LBUG
            Sep 18 13:29:01 client kernel: Pid: 13392, comm: ldlm_bl_00
            Sep 18 13:29:01 client kernel:
            Sep 18 13:29:01 client kernel: Call Trace:
            Sep 18 13:29:01 client kernel: [<ffffffffa10ad875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
            Sep 18 13:29:01 client kernel: [<ffffffffa10ade77>] lbug_with_loc+0x47/0xb0 [libcfs]
            Sep 18 13:29:01 client kernel: [<ffffffffa178bf9e>] osc_page_delete+0x46e/0x4e0 [osc]
            Sep 18 13:29:01 client kernel: [<ffffffffa12019cd>] cl_page_delete0+0x7d/0x210 [obdclass]
            Sep 18 13:29:01 client kernel: [<ffffffffa1201b9d>] cl_page_delete+0x3d/0x110 [obdclass]
            Sep 18 13:29:01 client kernel: [<ffffffffa16b6d2d>] ll_invalidatepage+0x8d/0x160 [lustre]
            Sep 18 13:29:01 client kernel: [<ffffffffa16c5d85>] vvp_page_discard+0xc5/0x160 [lustre]
            Sep 18 13:29:01 client kernel: [<ffffffffa11fffc8>] cl_page_invoid+0x68/0x160 [obdclass]
            Sep 18 13:29:01 client kernel: [<ffffffffa12000d3>] cl_page_discard+0x13/0x20 [obdclass]
            Sep 18 13:29:01 client kernel: [<ffffffffa1797158>] discard_cb+0x88/0x1e0 [osc]
            Sep 18 13:29:01 client kernel: [<ffffffffa1796f4e>] osc_page_gang_lookup+0x1ae/0x330 [osc]
            Sep 18 13:29:01 client kernel: [<ffffffffa17970d0>] ? discard_cb+0x0/0x1e0 [osc]
            Sep 18 13:29:01 client kernel: [<ffffffffa17973f4>] osc_lock_discard_pages+0x144/0x240 [osc]
            Sep 18 13:29:01 client kernel: [<ffffffffa17970d0>] ? discard_cb+0x0/0x1e0 [osc]
            Sep 18 13:29:01 client kernel: [<ffffffffa178e45b>] osc_lock_flush+0x8b/0x260 [osc]
            Sep 18 13:29:01 client kernel: [<ffffffffa178e8d8>] osc_ldlm_blocking_ast+0x2a8/0x3c0 [osc]
            Sep 18 13:29:01 client kernel: [<ffffffffa138ea57>] ldlm_cancel_callback+0x87/0x280 [ptlrpc]
            Sep 18 13:29:01 client kernel: [<ffffffff81060530>] ? __dequeue_entity+0x30/0x50
            Sep 18 13:29:01 client kernel: [<ffffffff8100969d>] ? __switch_to+0x7d/0x340
            Sep 18 13:29:01 client kernel: [<ffffffffa13a184a>] ldlm_cli_cancel_local+0x8a/0x470 [ptlrpc]
            Sep 18 13:29:01 client kernel: [<ffffffffa13a6480>] ldlm_cli_cancel+0x60/0x360 [ptlrpc]
            Sep 18 13:29:01 client kernel: [<ffffffffa178e70b>] osc_ldlm_blocking_ast+0xdb/0x3c0 [osc]
            Sep 18 13:29:01 client kernel: [<ffffffffa13aa380>] ldlm_handle_bl_callback+0x130/0x400 [ptlrpc]
            Sep 18 13:29:01 client kernel: [<ffffffffa13aae64>] ldlm_bl_thread_main+0x484/0x700 [ptlrpc]
            Sep 18 13:29:01 client kernel: [<ffffffff810672b0>] ? default_wake_function+0x0/0x20
            Sep 18 13:29:01 client kernel: [<ffffffffa13aa9e0>] ? ldlm_bl_thread_main+0x0/0x700 [ptlrpc]
            Sep 18 13:29:01 client kernel: [<ffffffff810a101e>] kthread+0x9e/0xc0
            Sep 18 13:29:01 client kernel: [<ffffffff8100c28a>] child_rip+0xa/0x20

            lidongyang Li Dongyang (Inactive) added a comment - Hi Jinshan, I'm using master which has 14989 landed already, plus patch set 3 of 16456: The client crashed 5mins after I started the reproducer: Sep 18 13:29:00 client kernel: LustreError: 16824:0:(ldlm_resource.c:887:ldlm_resource_complain()) testfs-OST0000-osc-ffff8800375c4000: namespace resource [0x303:0x0:0x0] .0 (ffff88003d2abcc0) refcount nonzero (1) after lock cleanup; forcing cleanup. Sep 18 13:29:00 client kernel: LustreError: 16824:0:(ldlm_resource.c:1502:ldlm_resource_dump()) — Resource: [0x303:0x0:0x0] .0 (ffff88003d2abcc0) refcount = 3 Sep 18 13:29:00 client kernel: LustreError: 16824:0:(ldlm_resource.c:1505:ldlm_resource_dump()) Granted locks (in reverse order): Sep 18 13:29:00 client kernel: LustreError: 16824:0:(ldlm_resource.c:1508:ldlm_resource_dump()) ### ### ns: testfs-OST0000-osc-ffff8800375c4000 lock: ffff88000b69d380/0x2104da78ff029aa9 lrc: 8/0,1 mode: PW/PW res: [0x303:0x0:0x0] .0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->1048575) flags: 0x126400000000 nid: local remote: 0x59a587f3f1d1d697 expref: -99 pid: 16588 timeout: 0 lvb_type: 1 Sep 18 13:29:00 client kernel: LustreError: 16824:0:(ldlm_resource.c:1502:ldlm_resource_dump()) — Resource: [0x303:0x0:0x0] .0 (ffff88003d2abcc0) refcount = 2 Sep 18 13:29:00 client kernel: LustreError: 16824:0:(ldlm_resource.c:1505:ldlm_resource_dump()) Granted locks (in reverse order): Sep 18 13:29:01 client kernel: LustreError: 13392:0:(osc_cache.c:3134:discard_cb()) page@ffff88003d5a7600 [3 ffff88003a513b38 1 0 1 ffff88003a5d6b78] Sep 18 13:29:01 client kernel: Sep 18 13:29:01 client kernel: LustreError: 13392:0:(osc_cache.c:3134:discard_cb()) vvp-page@ffff88003d5a7650(0:0) vm@ffffea00004439e0 2000000000087d 3:0 ffff88003d5a7600 25088 lru Sep 18 13:29:01 client kernel: Sep 18 13:29:01 client kernel: LustreError: 13392:0:(osc_cache.c:3134:discard_cb()) lov-page@ffff88003d5a7690, raid0 Sep 18 13:29:01 client kernel: Sep 18 13:29:01 client kernel: LustreError: 13392:0:(osc_cache.c:3134:discard_cb()) osc-page@ffff88003d5a76f8 25088: 1< 0x845fed 258 0 + - > 2< 102760448 0 4096 0x0 0x520 | (null) ffff88003c479500 ffff88003a517e20 > 3< 1 0 0 > 4< 0 9 8 18446744073642434560 + | + - + - > 5< + - + - | 0 - | 5635 - -> Sep 18 13:29:01 client kernel: Sep 18 13:29:01 client kernel: LustreError: 13392:0:(osc_cache.c:3134:discard_cb()) end page@ffff88003d5a7600 Sep 18 13:29:01 client kernel: Sep 18 13:29:01 client kernel: LustreError: 13392:0:(osc_cache.c:3134:discard_cb()) discard dirty page? Sep 18 13:29:01 client kernel: LustreError: 13392:0:(osc_cache.c:2447:osc_teardown_async_page()) extent ffff880039a18720@ {[25088 -> 25341/25343], [2|0|+|locking|wi|ffff88003a517e20], [1040384|254|+|-|ffff88000b69d580|256|ffff8800372b3520]} trunc at 25088. Sep 18 13:29:01 client kernel: LustreError: 13392:0:(osc_cache.c:2447:osc_teardown_async_page()) ### extent: ffff880039a18720 Sep 18 13:29:01 client kernel: ns: testfs-OST0000-osc-ffff8800375c4000 lock: ffff88000b69d580/0x2104da78ff029c0e lrc: 41/0,3 mode: PW/PW res: [0x303:0x0:0x0] .0 rrc: 2 type: EXT [0->18446744073709551615] (req 102760448->103809023) flags: 0x20000000000 nid: local remote: 0x59a587f3f1d1d6ac expref: -99 pid: 16683 timeout: 0 lvb_type: 1 Sep 18 13:29:01 client kernel: LustreError: 13392:0:(osc_page.c:307:osc_page_delete()) page@ffff88003d5a7600 [3 ffff88003a513b38 4 0 1 (null)] Sep 18 13:29:01 client kernel: Sep 18 13:29:01 client kernel: LustreError: 13392:0:(osc_page.c:307:osc_page_delete()) vvp-page@ffff88003d5a7650(0:0) vm@ffffea00004439e0 2000000000087d 3:0 ffff88003d5a7600 25088 lru Sep 18 13:29:01 client kernel: Sep 18 13:29:01 client kernel: LustreError: 13392:0:(osc_page.c:307:osc_page_delete()) lov-page@ffff88003d5a7690, raid0 Sep 18 13:29:01 client kernel: Sep 18 13:29:01 client kernel: LustreError: 13392:0:(osc_page.c:307:osc_page_delete()) osc-page@ffff88003d5a76f8 25088: 1< 0x845fed 258 0 + - > 2< 102760448 0 4096 0x0 0x520 | (null) ffff88003c479500 ffff88003a517e20 > 3< 0 0 0 > 4< 0 9 8 18446744073642434560 + | + - + - > 5< + - + - | 0 - | 5635 - -> Sep 18 13:29:01 client kernel: Sep 18 13:29:01 client kernel: LustreError: 13392:0:(osc_page.c:307:osc_page_delete()) end page@ffff88003d5a7600 Sep 18 13:29:01 client kernel: Sep 18 13:29:01 client kernel: LustreError: 13392:0:(osc_page.c:307:osc_page_delete()) Trying to teardown failed: -16 Sep 18 13:29:01 client kernel: LustreError: 13392:0:(osc_page.c:308:osc_page_delete()) ASSERTION( 0 ) failed: Message from syslogd@client at Sep 18 13:29:01 ... kernel:LustreError: 13392:0:(osc_page.c:308:osc_page_delete()) ASSERTION( 0 ) failed: Sep 18 13:29:01 client kernel: LustreError: 13392:0:(osc_page.c:308:osc_page_delete()) LBUG Sep 18 13:29:01 client kernel: Pid: 13392, comm: ldlm_bl_00 Sep 18 13:29:01 client kernel: Sep 18 13:29:01 client kernel: Call Trace: Sep 18 13:29:01 client kernel: [<ffffffffa10ad875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] Sep 18 13:29:01 client kernel: [<ffffffffa10ade77>] lbug_with_loc+0x47/0xb0 [libcfs] Sep 18 13:29:01 client kernel: [<ffffffffa178bf9e>] osc_page_delete+0x46e/0x4e0 [osc] Sep 18 13:29:01 client kernel: [<ffffffffa12019cd>] cl_page_delete0+0x7d/0x210 [obdclass] Sep 18 13:29:01 client kernel: [<ffffffffa1201b9d>] cl_page_delete+0x3d/0x110 [obdclass] Sep 18 13:29:01 client kernel: [<ffffffffa16b6d2d>] ll_invalidatepage+0x8d/0x160 [lustre] Sep 18 13:29:01 client kernel: [<ffffffffa16c5d85>] vvp_page_discard+0xc5/0x160 [lustre] Sep 18 13:29:01 client kernel: [<ffffffffa11fffc8>] cl_page_invoid+0x68/0x160 [obdclass] Sep 18 13:29:01 client kernel: [<ffffffffa12000d3>] cl_page_discard+0x13/0x20 [obdclass] Sep 18 13:29:01 client kernel: [<ffffffffa1797158>] discard_cb+0x88/0x1e0 [osc] Sep 18 13:29:01 client kernel: [<ffffffffa1796f4e>] osc_page_gang_lookup+0x1ae/0x330 [osc] Sep 18 13:29:01 client kernel: [<ffffffffa17970d0>] ? discard_cb+0x0/0x1e0 [osc] Sep 18 13:29:01 client kernel: [<ffffffffa17973f4>] osc_lock_discard_pages+0x144/0x240 [osc] Sep 18 13:29:01 client kernel: [<ffffffffa17970d0>] ? discard_cb+0x0/0x1e0 [osc] Sep 18 13:29:01 client kernel: [<ffffffffa178e45b>] osc_lock_flush+0x8b/0x260 [osc] Sep 18 13:29:01 client kernel: [<ffffffffa178e8d8>] osc_ldlm_blocking_ast+0x2a8/0x3c0 [osc] Sep 18 13:29:01 client kernel: [<ffffffffa138ea57>] ldlm_cancel_callback+0x87/0x280 [ptlrpc] Sep 18 13:29:01 client kernel: [<ffffffff81060530>] ? __dequeue_entity+0x30/0x50 Sep 18 13:29:01 client kernel: [<ffffffff8100969d>] ? __switch_to+0x7d/0x340 Sep 18 13:29:01 client kernel: [<ffffffffa13a184a>] ldlm_cli_cancel_local+0x8a/0x470 [ptlrpc] Sep 18 13:29:01 client kernel: [<ffffffffa13a6480>] ldlm_cli_cancel+0x60/0x360 [ptlrpc] Sep 18 13:29:01 client kernel: [<ffffffffa178e70b>] osc_ldlm_blocking_ast+0xdb/0x3c0 [osc] Sep 18 13:29:01 client kernel: [<ffffffffa13aa380>] ldlm_handle_bl_callback+0x130/0x400 [ptlrpc] Sep 18 13:29:01 client kernel: [<ffffffffa13aae64>] ldlm_bl_thread_main+0x484/0x700 [ptlrpc] Sep 18 13:29:01 client kernel: [<ffffffff810672b0>] ? default_wake_function+0x0/0x20 Sep 18 13:29:01 client kernel: [<ffffffffa13aa9e0>] ? ldlm_bl_thread_main+0x0/0x700 [ptlrpc] Sep 18 13:29:01 client kernel: [<ffffffff810a101e>] kthread+0x9e/0xc0 Sep 18 13:29:01 client kernel: [<ffffffff8100c28a>] child_rip+0xa/0x20

            People

              jay Jinshan Xiong (Inactive)
              cliffw Cliff White (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              25 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: