Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2652

lov_io.c:222:lov_sub_get()) ASSERTION( stripe < lio->lis_stripe_count ) failed

Details

    • Bug
    • Resolution: Duplicate
    • Blocker
    • Lustre 2.4.0
    • Lustre 2.4.0
    • 3
    • 6194

    Description

      HAd this crash running sanityn test 12:

      [45366.990436] Lustre: DEBUG MARKER: == sanityn test 12: test lock ordering (link, stat, unlink) ============= 10:01:13 (1358607673)
      [45367.146837] Lustre: DEBUG MARKER: start dir: /mnt/lustre/lockdir=144115205255725085 file: /mnt/lustre/lockdir/lockfile=144115205255725083
      [45445.727378] LustreError: 20512:0:(lov_io.c:222:lov_sub_get()) ASSERTION( stripe < lio->lis_stripe_count ) failed: 
      [45445.728276] LustreError: 20512:0:(lov_io.c:222:lov_sub_get()) LBUG
      [45445.728751] Pid: 20512, comm: statmany
      [45445.729089] 
      [45445.729090] Call Trace:
      [45445.729696]  [<ffffffffa0a7a915>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      [45445.730184]  [<ffffffffa0a7af17>] lbug_with_loc+0x47/0xb0 [libcfs]
      [45445.730673]  [<ffffffffa14041ba>] lov_sub_get+0x4aa/0x690 [lov]
      [45445.731129]  [<ffffffffa1400132>] lov_sublock_env_get+0xd2/0x140 [lov]
      [45445.731612]  [<ffffffffa1401841>] lov_sublock_alloc+0xf1/0x450 [lov]
      [45445.732086]  [<ffffffffa1402a2c>] lov_lock_init_raid0+0x3ec/0xe50 [lov]
      [45445.732632]  [<ffffffffa13fa2ae>] lov_lock_init+0x1e/0x60 [lov]
      [45445.733135]  [<ffffffffa0bec52c>] cl_lock_hold_mutex+0x34c/0x660 [obdclass]
      [45445.733652]  [<ffffffffa0bec9a2>] cl_lock_request+0x62/0x270 [obdclass]
      [45445.734143]  [<ffffffffa090b1f9>] cl_glimpse_lock+0x179/0x480 [lustre]
      [45445.734635]  [<ffffffffa090ba65>] cl_glimpse_size0+0x1a5/0x1d0 [lustre]
      [45445.735142]  [<ffffffffa08c5298>] ll_inode_revalidate_it+0x198/0x1c0 [lustre]
      [45445.735748]  [<ffffffffa08c5309>] ll_getattr_it+0x49/0x170 [lustre]
      [45445.736230]  [<ffffffffa08c5467>] ll_getattr+0x37/0x40 [lustre]
      [45445.736719]  [<ffffffff81214fa3>] ? security_inode_getattr+0x23/0x30
      [45445.737183]  [<ffffffff81180891>] vfs_getattr+0x51/0x80
      [45445.737613]  [<ffffffff810385d8>] ? pvclock_clocksource_read+0x58/0xd0
      [45445.738070]  [<ffffffff81180920>] vfs_fstatat+0x60/0x80
      [45445.738508]  [<ffffffff81180a6b>] vfs_stat+0x1b/0x20
      [45445.738919]  [<ffffffff81180a94>] sys_newstat+0x24/0x50
      [45445.739332]  [<ffffffff8109b05a>] ? do_gettimeofday+0x1a/0x50
      [45445.740967]  [<ffffffff8100c5b5>] ? math_state_restore+0x45/0x60
      [45445.741415]  [<ffffffff814fb72e>] ? do_device_not_available+0xe/0x10
      [45445.741875]  [<ffffffff8100befb>] ? device_not_available+0x1b/0x20
      [45445.742317]  [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
      [45445.742763] 
      [45445.744348] Kernel panic - not syncing: LBUG
      

      Crashdump is in /exports/crashdumps/192.168.10.217-2013-01-19-10\:02\:36

      Attachments

        Issue Links

          Activity

            [LU-2652] lov_io.c:222:lov_sub_get()) ASSERTION( stripe < lio->lis_stripe_count ) failed
            bfaccini Bruno Faccini (Inactive) added a comment - - edited

            TGCC site hits the same issue/LBUG running an MPI-IO app on 2.1.3 (LU-3524), I mean the original/title "(lov_io.c:212:lov_sub_get()) ASSERTION( stripe < lio->lis_stripe_count ) failed" LBUG for this ticket, so based on this can we definitely trust that this particular problem is layout related ?

            bfaccini Bruno Faccini (Inactive) added a comment - - edited TGCC site hits the same issue/LBUG running an MPI-IO app on 2.1.3 ( LU-3524 ), I mean the original/title "(lov_io.c:212:lov_sub_get()) ASSERTION( stripe < lio->lis_stripe_count ) failed" LBUG for this ticket, so based on this can we definitely trust that this particular problem is layout related ?

            Duplicate of LU-2766, the same issue with different symptom.

            jay Jinshan Xiong (Inactive) added a comment - Duplicate of LU-2766 , the same issue with different symptom.
            jhammond John Hammond added a comment -

            Unfortunately this is still present in 2.4. Oleg's description matches up very well with what I'm seeing on 2.4 and 2.5, so reopening seems best. Based on stack traces and the shared reproducer, I assume this is really the same issue as LU-2766.

            jhammond John Hammond added a comment - Unfortunately this is still present in 2.4. Oleg's description matches up very well with what I'm seeing on 2.4 and 2.5, so reopening seems best. Based on stack traces and the shared reproducer, I assume this is really the same issue as LU-2766 .

            Do you think this is the same bug as was previously being hit here, or a different bug with the same symptom (this wasn't hit with racer, and was apparently fixed by landing the patch)? Reopening this bug (which is marked a blocker) that was fixed and landed for 2.4.0 complicates the tracking process.

            It might be better to close this one again and open a separate bug related to racer + layout swap. While the symptom (LASSERT()) is serious, the stress of swapping the layout of two files in a loop is very unlikely to be seen in real life, so I won't rank this as a blocker.

            adilger Andreas Dilger added a comment - Do you think this is the same bug as was previously being hit here, or a different bug with the same symptom (this wasn't hit with racer, and was apparently fixed by landing the patch)? Reopening this bug (which is marked a blocker) that was fixed and landed for 2.4.0 complicates the tracking process. It might be better to close this one again and open a separate bug related to racer + layout swap. While the symptom (LASSERT()) is serious, the stress of swapping the layout of two files in a loop is very unlikely to be seen in real life, so I won't rank this as a blocker.
            jhammond John Hammond added a comment -

            Saw this running racer today on 2.4.51-3-g9f5eea8. Reproduces much more readily if you do:

            # llmount.sh
            # cd /mnt/lustre
            # touch 0 1
            # while true; do
                lfs swap_layouts $((RANDOM % 2)) $((RANDOM % 2))
            done &
            # while true; do
                lfs swap_layouts $((RANDOM % 2)) $((RANDOM % 2))
            done &
            

            The same is true of LU-2766.

            jhammond John Hammond added a comment - Saw this running racer today on 2.4.51-3-g9f5eea8. Reproduces much more readily if you do: # llmount.sh # cd /mnt/lustre # touch 0 1 # while true; do lfs swap_layouts $((RANDOM % 2)) $((RANDOM % 2)) done & # while true; do lfs swap_layouts $((RANDOM % 2)) $((RANDOM % 2)) done & The same is true of LU-2766 .
            green Oleg Drokin added a comment -

            Patches landed, issue no longer visible

            green Oleg Drokin added a comment - Patches landed, issue no longer visible
            jay Jinshan Xiong (Inactive) added a comment - - edited patches are at: http://review.whamcloud.com/5157 , http://review.whamcloud.com/5158 , http://review.whamcloud.com/5159 . Oleg is verifying them.

            I've known the root cause of the problem, need time to cook a patch

            jay Jinshan Xiong (Inactive) added a comment - I've known the root cause of the problem, need time to cook a patch
            [  756.853670] LustreError: 25456:0:(lov_object.c:595:lov_layout_change()) ASSERTION( list_empty(&hdr->coh_locks) ) failed: 
            [  756.854210] LustreError: 25456:0:(lov_object.c:595:lov_layout_change()) LBUG
            [  756.854498] Pid: 25456, comm: ll_sa_25403
            [  756.854724] 
            [  756.854725] Call Trace:
            [  756.855090]  [<ffffffffa0449915>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
            [  756.855355]  [<ffffffffa0449f17>] lbug_with_loc+0x47/0xb0 [libcfs]
            [  756.855617]  [<ffffffffa0a711af>] lov_layout_change+0x2df/0x350 [lov]
            [  756.855874]  [<ffffffffa0a714b7>] lov_conf_set+0x297/0x6a0 [lov]
            [  756.856124]  [<ffffffffa045a401>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
            [  756.856384]  [<ffffffffa04629eb>] ? cfs_hash_add_unique+0x1b/0x40 [libcfs]
            [  756.856807]  [<ffffffffa061ad08>] cl_conf_set+0x58/0x100 [obdclass]
            [  756.857073]  [<ffffffffa0dd71c8>] ll_layout_conf+0x78/0x250 [lustre]
            [  756.857334]  [<ffffffffa0de62ea>] ll_layout_lock_set+0x40a/0xd30 [lustre]
            [  756.857599]  [<ffffffffa0947320>] ? lmv_set_lock_data+0x100/0x480 [lmv]
            [  756.857862]  [<ffffffffa0de73ef>] ll_layout_refresh+0x7df/0xd80 [lustre]
            [  756.858128]  [<ffffffffa0e0a560>] ? ll_md_blocking_ast+0x0/0x720 [lustre]
            [  756.858411]  [<ffffffffa0765370>] ? ldlm_completion_ast+0x0/0x950 [ptlrpc]
            [  756.858685]  [<ffffffffa0e2d2ed>] vvp_io_fini+0xfd/0x1b0 [lustre]
            [  756.858939]  [<ffffffffa0a7b7ad>] ? lov_io_fini+0x16d/0x3d0 [lov]
            [  756.859202]  [<ffffffffa062c807>] cl_io_fini+0x77/0x260 [obdclass]
            [  756.859460]  [<ffffffffa0e24985>] cl_glimpse_size0+0xc5/0x1d0 [lustre]
            [  756.859725]  [<ffffffffa0e1cde2>] ll_agl_trigger+0xd2/0x350 [lustre]
            [  756.859983]  [<ffffffffa0e233a7>] ll_statahead_thread+0x2d7/0xf40 [lustre]
            [  756.860245]  [<ffffffff81057d60>] ? default_wake_function+0x0/0x20
            [  756.860501]  [<ffffffffa0e230d0>] ? ll_statahead_thread+0x0/0xf40 [lustre]
            [  756.860776]  [<ffffffff8100c14a>] child_rip+0xa/0x20
            [  756.861012]  [<ffffffffa0e230d0>] ? ll_statahead_thread+0x0/0xf40 [lustre]
            [  756.861279]  [<ffffffffa0e230d0>] ? ll_statahead_thread+0x0/0xf40 [lustre]
            [  756.861537]  [<ffffffff8100c140>] ? child_rip+0x0/0x20
            [  756.861769] 
            [  756.885413] Kernel panic - not syncing: LBUG
            
            jay Jinshan Xiong (Inactive) added a comment - [ 756.853670] LustreError: 25456:0:(lov_object.c:595:lov_layout_change()) ASSERTION( list_empty(&hdr->coh_locks) ) failed: [ 756.854210] LustreError: 25456:0:(lov_object.c:595:lov_layout_change()) LBUG [ 756.854498] Pid: 25456, comm: ll_sa_25403 [ 756.854724] [ 756.854725] Call Trace: [ 756.855090] [<ffffffffa0449915>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] [ 756.855355] [<ffffffffa0449f17>] lbug_with_loc+0x47/0xb0 [libcfs] [ 756.855617] [<ffffffffa0a711af>] lov_layout_change+0x2df/0x350 [lov] [ 756.855874] [<ffffffffa0a714b7>] lov_conf_set+0x297/0x6a0 [lov] [ 756.856124] [<ffffffffa045a401>] ? libcfs_debug_msg+0x41/0x50 [libcfs] [ 756.856384] [<ffffffffa04629eb>] ? cfs_hash_add_unique+0x1b/0x40 [libcfs] [ 756.856807] [<ffffffffa061ad08>] cl_conf_set+0x58/0x100 [obdclass] [ 756.857073] [<ffffffffa0dd71c8>] ll_layout_conf+0x78/0x250 [lustre] [ 756.857334] [<ffffffffa0de62ea>] ll_layout_lock_set+0x40a/0xd30 [lustre] [ 756.857599] [<ffffffffa0947320>] ? lmv_set_lock_data+0x100/0x480 [lmv] [ 756.857862] [<ffffffffa0de73ef>] ll_layout_refresh+0x7df/0xd80 [lustre] [ 756.858128] [<ffffffffa0e0a560>] ? ll_md_blocking_ast+0x0/0x720 [lustre] [ 756.858411] [<ffffffffa0765370>] ? ldlm_completion_ast+0x0/0x950 [ptlrpc] [ 756.858685] [<ffffffffa0e2d2ed>] vvp_io_fini+0xfd/0x1b0 [lustre] [ 756.858939] [<ffffffffa0a7b7ad>] ? lov_io_fini+0x16d/0x3d0 [lov] [ 756.859202] [<ffffffffa062c807>] cl_io_fini+0x77/0x260 [obdclass] [ 756.859460] [<ffffffffa0e24985>] cl_glimpse_size0+0xc5/0x1d0 [lustre] [ 756.859725] [<ffffffffa0e1cde2>] ll_agl_trigger+0xd2/0x350 [lustre] [ 756.859983] [<ffffffffa0e233a7>] ll_statahead_thread+0x2d7/0xf40 [lustre] [ 756.860245] [<ffffffff81057d60>] ? default_wake_function+0x0/0x20 [ 756.860501] [<ffffffffa0e230d0>] ? ll_statahead_thread+0x0/0xf40 [lustre] [ 756.860776] [<ffffffff8100c14a>] child_rip+0xa/0x20 [ 756.861012] [<ffffffffa0e230d0>] ? ll_statahead_thread+0x0/0xf40 [lustre] [ 756.861279] [<ffffffffa0e230d0>] ? ll_statahead_thread+0x0/0xf40 [lustre] [ 756.861537] [<ffffffff8100c140>] ? child_rip+0x0/0x20 [ 756.861769] [ 756.885413] Kernel panic - not syncing: LBUG
            green Oleg Drokin added a comment -

            Sometimes instead of the crash I am getting this sort of a locked up thread:

            .111182] Lustre: DEBUG MARKER: == sanityn test 12: test lock ordering (link, sta
            .378607] Lustre: DEBUG MARKER: start dir: /mnt/lustre/lockdir=144115205255725085
            .812526] LNet: Service thread pid 12451 was inactive for 40.00s. The thread migh
            .813279] Pid: 12451, comm: mdt00_001
            .813563] 
            .813564] Call Trace:
            .813979]  [<ffffffff814fb054>] ? _spin_lock_irqsave+0x24/0x30
            .814234]  [<ffffffff814f8ad1>] schedule_timeout+0x191/0x2e0
            .814507]  [<ffffffff8107bcd0>] ? process_timeout+0x0/0x10
            .814762]  [<ffffffffa0e00771>] cfs_waitq_timedwait+0x11/0x20 [libcfs]
            .815052]  [<ffffffffa053184d>] ldlm_completion_ast+0x4dd/0x950 [ptlrpc]
            .815338]  [<ffffffffa052d360>] ? ldlm_expired_completion_wait+0x0/0x260 [ptlrpc]
            .815791]  [<ffffffff81057d60>] ? default_wake_function+0x0/0x20
            .816062]  [<ffffffffa0530f98>] ldlm_cli_enqueue_local+0x1f8/0x5d0 [ptlrpc]
            .816351]  [<ffffffffa0531370>] ? ldlm_completion_ast+0x0/0x950 [ptlrpc]
            .816692]  [<ffffffffa09dc830>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
            .816980]  [<ffffffffa09dedc0>] mdt_object_lock0+0x360/0xb10 [mdt]
            .817278]  [<ffffffffa09dc830>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
            .817584]  [<ffffffffa0531370>] ? ldlm_completion_ast+0x0/0x950 [ptlrpc]
            .817878]  [<ffffffffa09df614>] mdt_object_lock+0x14/0x20 [mdt]
            .818158]  [<ffffffffa09fef3a>] mdt_reint_link+0x3fa/0x910 [mdt]
            .818459]  [<ffffffffa0559c5e>] ? lustre_pack_reply_flags+0xae/0x1f0 [ptlrpc]
            .818918]  [<ffffffffa09faa91>] mdt_reint_rec+0x41/0xe0 [mdt]
            .819194]  [<ffffffffa09f4133>] mdt_reint_internal+0x4f3/0x7e0 [mdt]
            .819495]  [<ffffffffa09f4464>] mdt_reint+0x44/0xe0 [mdt]
            .819764]  [<ffffffffa09e4782>] mdt_handle_common+0x8e2/0x1680 [mdt]
            .820052]  [<ffffffffa0a19ed5>] mds_regular_handle+0x15/0x20 [mdt]
            .820352]  [<ffffffffa0568483>] ptlrpc_server_handle_request+0x463/0xe60 [ptlrpc]
            .820872]  [<ffffffffa0e0065e>] ? cfs_timer_arm+0xe/0x10 [libcfs]
            .821172]  [<ffffffffa0561221>] ? ptlrpc_wait_event+0xb1/0x2a0 [ptlrpc]
            .821474]  [<ffffffff81051f73>] ? __wake_up+0x53/0x70
            .821758]  [<ffffffffa056af8d>] ptlrpc_main+0xb3d/0x18c0 [ptlrpc]
            .822055]  [<ffffffffa056a450>] ? ptlrpc_main+0x0/0x18c0 [ptlrpc]
            .822333]  [<ffffffff8100c14a>] child_rip+0xa/0x20
            .822608]  [<ffffffffa056a450>] ? ptlrpc_main+0x0/0x18c0 [ptlrpc]
            .822904]  [<ffffffffa056a450>] ? ptlrpc_main+0x0/0x18c0 [ptlrpc]
            .823181]  [<ffffffff8100c140>] ? child_rip+0x0/0x20
            .823438] 
            .. (this thread actually finished after 100 seconds) ...
            .864250] INFO: task statmany:14503 blocked for more than 120 seconds.
            .865404] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this messag
            .867239] statmany      D 0000000000000000  2608 14503  14484 0x00000000
            .868449]  ffff880055affb10 0000000000000086 ffff880055affad8 ffff880055affad4
            .869752]  ffff880055affaf8 ffff8800bcc23080 ffff8800063167c0 0000000000000000
            .870198]  ffff8800aa03e8b8 ffff880055afffd8 000000000000fba8 ffff8800aa03e8b8
            .870647] Call Trace:
            .870842]  [<ffffffff814fabbd>] rwsem_down_failed_common+0x8d/0x1d0
            .871102]  [<ffffffff81054aaa>] ? enqueue_task_fair+0x14a/0x4e0
            .871356]  [<ffffffff81096d6a>] ? sched_clock_cpu+0x6a/0x110
            .871610]  [<ffffffff814fad56>] rwsem_down_read_failed+0x26/0x30
            .871867]  [<ffffffff8127c104>] call_rwsem_down_read_failed+0x14/0x30
            .872156]  [<ffffffff814f9ec7>] ? down_read+0x37/0x40
            .872474]  [<ffffffffa0717cba>] lov_lsm_addref+0x2a/0x130 [lov]
            .872796]  [<ffffffffa0717dea>] lov_lsm_get+0x2a/0x40 [lov]
            .873063]  [<ffffffffa0b28c95>] ccc_inode_lsm_get+0x15/0x20 [lustre]
            .873332]  [<ffffffffa0addb83>] ll_merge_lvb+0x53/0x5e0 [lustre]
            .873600]  [<ffffffff814f919e>] ? mutex_unlock+0xe/0x10
            .873863]  [<ffffffffa0f6dd0a>] ? cl_lock_mutex_put+0x7a/0x90 [obdclass]
            .874149]  [<ffffffffa0b283a8>] cl_glimpse_lock+0x328/0x480 [lustre]
            .874445]  [<ffffffffa0b28a65>] cl_glimpse_size0+0x1a5/0x1d0 [lustre]
            .874735]  [<ffffffffa0ae2298>] ll_inode_revalidate_it+0x198/0x1c0 [lustre]
            .875036]  [<ffffffffa0ae2309>] ll_getattr_it+0x49/0x170 [lustre]
            .875322]  [<ffffffffa0ae2467>] ll_getattr+0x37/0x40 [lustre]
            .875598]  [<ffffffff81214fa3>] ? security_inode_getattr+0x23/0x30
            .875875]  [<ffffffff81180891>] vfs_getattr+0x51/0x80
            .876154]  [<ffffffff810385d8>] ? pvclock_clocksource_read+0x58/0xd0
            .876439]  [<ffffffff81180920>] vfs_fstatat+0x60/0x80
            .876696]  [<ffffffff81180a6b>] vfs_stat+0x1b/0x20
            .876947]  [<ffffffff81180a94>] sys_newstat+0x24/0x50
            .877203]  [<ffffffff8109b05a>] ? do_gettimeofday+0x1a/0x50
            .877473]  [<ffffffff8100c5b5>] ? math_state_restore+0x45/0x60
            .877751]  [<ffffffff814fb72e>] ? do_device_not_available+0xe/0x10
            .878001]  [<ffffffff8100befb>] ? device_not_available+0x1b/0x20
            .878277]  [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
            .878552] INFO: task ll_imp_inval:14515 blocked for more than 120 seconds.
            .878838] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this messag
            .879297] ll_imp_inval  D 0000000000000003  4640 14515      2 0x00000000
            .879607]  ffff8800b5f31980 0000000000000046 0000000000000000 ffffffff810385d8
            .880107]  ffff8800ffffffff 00075e5906b4fe40 ffff8800ffffffff ffff8800b5f31980
            .880590]  ffff8800ac106af8 ffff8800b5f31fd8 000000000000fba8 ffff8800ac106af8
            .881064] Call Trace:
            .881268]  [<ffffffff810385d8>] ? pvclock_clocksource_read+0x58/0xd0
            .881548]  [<ffffffff814fabbd>] rwsem_down_failed_common+0x8d/0x1d0
            .881807]  [<ffffffff814fad23>] rwsem_down_write_failed+0x23/0x30
            .882054]  [<ffffffff8127c133>] call_rwsem_down_write_failed+0x13/0x20
            .882339]  [<ffffffff814f9e85>] ? down_write+0x45/0x50
            .882605]  [<ffffffffa0716e44>] lov_conf_lock+0x34/0xc0 [lov]
            .882877]  [<ffffffffa0717268>] lov_conf_set+0x48/0x6a0 [lov]
            .883163]  [<ffffffffa0f65aad>] ? cl_env_nested_get+0x5d/0xc0 [obdclass]
            .883465]  [<ffffffffa0e189eb>] ? cfs_hash_add_unique+0x1b/0x40 [libcfs]
            .883771]  [<ffffffffa0f63d08>] cl_conf_set+0x58/0x100 [obdclass]
            .884076]  [<ffffffffa0adb1c8>] ll_layout_conf+0x78/0x250 [lustre]
            .884365]  [<ffffffffa0b0e947>] ll_md_blocking_ast+0x3e7/0x720 [lustre]
            .884655]  [<ffffffff81051f73>] ? __wake_up+0x53/0x70
            .884936]  [<ffffffffa051106c>] ldlm_cancel_callback+0x6c/0x1a0 [ptlrpc]
            .885243]  [<ffffffffa052afda>] ldlm_cli_cancel_local+0x8a/0x470 [ptlrpc]
            .885561]  [<ffffffffa052f8cb>] ldlm_cli_cancel+0x5b/0x360 [ptlrpc]
            .885848]  [<ffffffffa0519c1d>] cleanup_resource+0x12d/0x300 [ptlrpc]
            .886128]  [<ffffffffa0519e1f>] ldlm_resource_clean+0x2f/0x60 [ptlrpc]
            .886993]  [<ffffffffa0e167af>] cfs_hash_for_each_relax+0x17f/0x360 [libcfs]
            .887434]  [<ffffffffa0519df0>] ? ldlm_resource_clean+0x0/0x60 [ptlrpc]
            .887717]  [<ffffffffa0519df0>] ? ldlm_resource_clean+0x0/0x60 [ptlrpc]
            .887997]  [<ffffffffa0e1817f>] cfs_hash_for_each_nolock+0x7f/0x1c0 [libcfs]
            .888463]  [<ffffffffa051685e>] ldlm_namespace_cleanup+0x2e/0xc0 [ptlrpc]
            .888800]  [<ffffffffa04bbc81>] mdc_import_event+0x1c1/0xa30 [mdc]
            .889143]  [<ffffffffa057993f>] ptlrpc_invalidate_import+0x29f/0x8c0 [ptlrpc]
            .889596]  [<ffffffffa0e10401>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
            .889870]  [<ffffffffa0579f60>] ? ptlrpc_invalidate_import_thread+0x0/0x2e0 [ptlr
            .890341]  [<ffffffffa0579faf>] ptlrpc_invalidate_import_thread+0x4f/0x2e0 [ptlrp
            .890807]  [<ffffffff8100c14a>] child_rip+0xa/0x20
            .891077]  [<ffffffffa0579f60>] ? ptlrpc_invalidate_import_thread+0x0/0x2e0 [ptlr
            .892918]  [<ffffffffa0579f60>] ? ptlrpc_invalidate_import_thread+0x0/0x2e0 [ptlr
            .893398]  [<ffffffff8100c140>] ? child_rip+0x0/0x20
            (these two threads never complete)
            
            green Oleg Drokin added a comment - Sometimes instead of the crash I am getting this sort of a locked up thread: .111182] Lustre: DEBUG MARKER: == sanityn test 12: test lock ordering (link, sta .378607] Lustre: DEBUG MARKER: start dir: /mnt/lustre/lockdir=144115205255725085 .812526] LNet: Service thread pid 12451 was inactive for 40.00s. The thread migh .813279] Pid: 12451, comm: mdt00_001 .813563] .813564] Call Trace: .813979] [<ffffffff814fb054>] ? _spin_lock_irqsave+0x24/0x30 .814234] [<ffffffff814f8ad1>] schedule_timeout+0x191/0x2e0 .814507] [<ffffffff8107bcd0>] ? process_timeout+0x0/0x10 .814762] [<ffffffffa0e00771>] cfs_waitq_timedwait+0x11/0x20 [libcfs] .815052] [<ffffffffa053184d>] ldlm_completion_ast+0x4dd/0x950 [ptlrpc] .815338] [<ffffffffa052d360>] ? ldlm_expired_completion_wait+0x0/0x260 [ptlrpc] .815791] [<ffffffff81057d60>] ? default_wake_function+0x0/0x20 .816062] [<ffffffffa0530f98>] ldlm_cli_enqueue_local+0x1f8/0x5d0 [ptlrpc] .816351] [<ffffffffa0531370>] ? ldlm_completion_ast+0x0/0x950 [ptlrpc] .816692] [<ffffffffa09dc830>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] .816980] [<ffffffffa09dedc0>] mdt_object_lock0+0x360/0xb10 [mdt] .817278] [<ffffffffa09dc830>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] .817584] [<ffffffffa0531370>] ? ldlm_completion_ast+0x0/0x950 [ptlrpc] .817878] [<ffffffffa09df614>] mdt_object_lock+0x14/0x20 [mdt] .818158] [<ffffffffa09fef3a>] mdt_reint_link+0x3fa/0x910 [mdt] .818459] [<ffffffffa0559c5e>] ? lustre_pack_reply_flags+0xae/0x1f0 [ptlrpc] .818918] [<ffffffffa09faa91>] mdt_reint_rec+0x41/0xe0 [mdt] .819194] [<ffffffffa09f4133>] mdt_reint_internal+0x4f3/0x7e0 [mdt] .819495] [<ffffffffa09f4464>] mdt_reint+0x44/0xe0 [mdt] .819764] [<ffffffffa09e4782>] mdt_handle_common+0x8e2/0x1680 [mdt] .820052] [<ffffffffa0a19ed5>] mds_regular_handle+0x15/0x20 [mdt] .820352] [<ffffffffa0568483>] ptlrpc_server_handle_request+0x463/0xe60 [ptlrpc] .820872] [<ffffffffa0e0065e>] ? cfs_timer_arm+0xe/0x10 [libcfs] .821172] [<ffffffffa0561221>] ? ptlrpc_wait_event+0xb1/0x2a0 [ptlrpc] .821474] [<ffffffff81051f73>] ? __wake_up+0x53/0x70 .821758] [<ffffffffa056af8d>] ptlrpc_main+0xb3d/0x18c0 [ptlrpc] .822055] [<ffffffffa056a450>] ? ptlrpc_main+0x0/0x18c0 [ptlrpc] .822333] [<ffffffff8100c14a>] child_rip+0xa/0x20 .822608] [<ffffffffa056a450>] ? ptlrpc_main+0x0/0x18c0 [ptlrpc] .822904] [<ffffffffa056a450>] ? ptlrpc_main+0x0/0x18c0 [ptlrpc] .823181] [<ffffffff8100c140>] ? child_rip+0x0/0x20 .823438] .. (this thread actually finished after 100 seconds) ... .864250] INFO: task statmany:14503 blocked for more than 120 seconds. .865404] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this messag .867239] statmany D 0000000000000000 2608 14503 14484 0x00000000 .868449] ffff880055affb10 0000000000000086 ffff880055affad8 ffff880055affad4 .869752] ffff880055affaf8 ffff8800bcc23080 ffff8800063167c0 0000000000000000 .870198] ffff8800aa03e8b8 ffff880055afffd8 000000000000fba8 ffff8800aa03e8b8 .870647] Call Trace: .870842] [<ffffffff814fabbd>] rwsem_down_failed_common+0x8d/0x1d0 .871102] [<ffffffff81054aaa>] ? enqueue_task_fair+0x14a/0x4e0 .871356] [<ffffffff81096d6a>] ? sched_clock_cpu+0x6a/0x110 .871610] [<ffffffff814fad56>] rwsem_down_read_failed+0x26/0x30 .871867] [<ffffffff8127c104>] call_rwsem_down_read_failed+0x14/0x30 .872156] [<ffffffff814f9ec7>] ? down_read+0x37/0x40 .872474] [<ffffffffa0717cba>] lov_lsm_addref+0x2a/0x130 [lov] .872796] [<ffffffffa0717dea>] lov_lsm_get+0x2a/0x40 [lov] .873063] [<ffffffffa0b28c95>] ccc_inode_lsm_get+0x15/0x20 [lustre] .873332] [<ffffffffa0addb83>] ll_merge_lvb+0x53/0x5e0 [lustre] .873600] [<ffffffff814f919e>] ? mutex_unlock+0xe/0x10 .873863] [<ffffffffa0f6dd0a>] ? cl_lock_mutex_put+0x7a/0x90 [obdclass] .874149] [<ffffffffa0b283a8>] cl_glimpse_lock+0x328/0x480 [lustre] .874445] [<ffffffffa0b28a65>] cl_glimpse_size0+0x1a5/0x1d0 [lustre] .874735] [<ffffffffa0ae2298>] ll_inode_revalidate_it+0x198/0x1c0 [lustre] .875036] [<ffffffffa0ae2309>] ll_getattr_it+0x49/0x170 [lustre] .875322] [<ffffffffa0ae2467>] ll_getattr+0x37/0x40 [lustre] .875598] [<ffffffff81214fa3>] ? security_inode_getattr+0x23/0x30 .875875] [<ffffffff81180891>] vfs_getattr+0x51/0x80 .876154] [<ffffffff810385d8>] ? pvclock_clocksource_read+0x58/0xd0 .876439] [<ffffffff81180920>] vfs_fstatat+0x60/0x80 .876696] [<ffffffff81180a6b>] vfs_stat+0x1b/0x20 .876947] [<ffffffff81180a94>] sys_newstat+0x24/0x50 .877203] [<ffffffff8109b05a>] ? do_gettimeofday+0x1a/0x50 .877473] [<ffffffff8100c5b5>] ? math_state_restore+0x45/0x60 .877751] [<ffffffff814fb72e>] ? do_device_not_available+0xe/0x10 .878001] [<ffffffff8100befb>] ? device_not_available+0x1b/0x20 .878277] [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b .878552] INFO: task ll_imp_inval:14515 blocked for more than 120 seconds. .878838] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this messag .879297] ll_imp_inval D 0000000000000003 4640 14515 2 0x00000000 .879607] ffff8800b5f31980 0000000000000046 0000000000000000 ffffffff810385d8 .880107] ffff8800ffffffff 00075e5906b4fe40 ffff8800ffffffff ffff8800b5f31980 .880590] ffff8800ac106af8 ffff8800b5f31fd8 000000000000fba8 ffff8800ac106af8 .881064] Call Trace: .881268] [<ffffffff810385d8>] ? pvclock_clocksource_read+0x58/0xd0 .881548] [<ffffffff814fabbd>] rwsem_down_failed_common+0x8d/0x1d0 .881807] [<ffffffff814fad23>] rwsem_down_write_failed+0x23/0x30 .882054] [<ffffffff8127c133>] call_rwsem_down_write_failed+0x13/0x20 .882339] [<ffffffff814f9e85>] ? down_write+0x45/0x50 .882605] [<ffffffffa0716e44>] lov_conf_lock+0x34/0xc0 [lov] .882877] [<ffffffffa0717268>] lov_conf_set+0x48/0x6a0 [lov] .883163] [<ffffffffa0f65aad>] ? cl_env_nested_get+0x5d/0xc0 [obdclass] .883465] [<ffffffffa0e189eb>] ? cfs_hash_add_unique+0x1b/0x40 [libcfs] .883771] [<ffffffffa0f63d08>] cl_conf_set+0x58/0x100 [obdclass] .884076] [<ffffffffa0adb1c8>] ll_layout_conf+0x78/0x250 [lustre] .884365] [<ffffffffa0b0e947>] ll_md_blocking_ast+0x3e7/0x720 [lustre] .884655] [<ffffffff81051f73>] ? __wake_up+0x53/0x70 .884936] [<ffffffffa051106c>] ldlm_cancel_callback+0x6c/0x1a0 [ptlrpc] .885243] [<ffffffffa052afda>] ldlm_cli_cancel_local+0x8a/0x470 [ptlrpc] .885561] [<ffffffffa052f8cb>] ldlm_cli_cancel+0x5b/0x360 [ptlrpc] .885848] [<ffffffffa0519c1d>] cleanup_resource+0x12d/0x300 [ptlrpc] .886128] [<ffffffffa0519e1f>] ldlm_resource_clean+0x2f/0x60 [ptlrpc] .886993] [<ffffffffa0e167af>] cfs_hash_for_each_relax+0x17f/0x360 [libcfs] .887434] [<ffffffffa0519df0>] ? ldlm_resource_clean+0x0/0x60 [ptlrpc] .887717] [<ffffffffa0519df0>] ? ldlm_resource_clean+0x0/0x60 [ptlrpc] .887997] [<ffffffffa0e1817f>] cfs_hash_for_each_nolock+0x7f/0x1c0 [libcfs] .888463] [<ffffffffa051685e>] ldlm_namespace_cleanup+0x2e/0xc0 [ptlrpc] .888800] [<ffffffffa04bbc81>] mdc_import_event+0x1c1/0xa30 [mdc] .889143] [<ffffffffa057993f>] ptlrpc_invalidate_import+0x29f/0x8c0 [ptlrpc] .889596] [<ffffffffa0e10401>] ? libcfs_debug_msg+0x41/0x50 [libcfs] .889870] [<ffffffffa0579f60>] ? ptlrpc_invalidate_import_thread+0x0/0x2e0 [ptlr .890341] [<ffffffffa0579faf>] ptlrpc_invalidate_import_thread+0x4f/0x2e0 [ptlrp .890807] [<ffffffff8100c14a>] child_rip+0xa/0x20 .891077] [<ffffffffa0579f60>] ? ptlrpc_invalidate_import_thread+0x0/0x2e0 [ptlr .892918] [<ffffffffa0579f60>] ? ptlrpc_invalidate_import_thread+0x0/0x2e0 [ptlr .893398] [<ffffffff8100c140>] ? child_rip+0x0/0x20 (these two threads never complete)

            People

              jay Jinshan Xiong (Inactive)
              green Oleg Drokin
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: