[LU-10481] parallel-scale-nfsv3 test racer_on_nfs: racer on NFS client: Hung Created: 10/Jan/18  Updated: 24/Jul/18  Resolved: 24/Jul/18

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.11.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Saurabh Tandan (Inactive) Assignee: Saurabh Tandan (Inactive)
Resolution: Duplicate Votes: 0
Labels: None
Environment:

Data on MDT enabled
2.10.56 both clients and servers
2 OSTs 1 mdt
2 clients
Onyx


Issue Links:
Related
is related to LU-10254 version_code() error in racer/file_cr... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Result at - https://testing.hpdd.intel.com/test_sets/c7e2bd8c-fd56-11e7-a6ad-52540065bddc

== parallel-scale-nfsv3 test racer_on_nfs: racer on NFS client ======================================= 21:38:30 (1515533910)
[357743.218692] Lustre: DEBUG MARKER: == parallel-scale-nfsv3 test racer_on_nfs: racer on NFS client ======================================= 21:38:30 (1515533910)
Running /usr/lib64/lustre/tests/racer/racer.sh for 300 seconds. CTRL-C to exit
./dir_remote.sh: line 11: RANDOM % MDTCOUNT: division by 0 (error token is "MDTCOUNT")
./dir_remote.sh: line 11: RANDOM % MDTCOUNT: division by 0 (error token is "MDTCOUNT")
./dir_migrate.sh: line 11: RANDOM % MDTCOUNT: division by 0 (error token is "MDTCOUNT")
./dir_migrate.sh: line 11: RANDOM % MDTCOUNT: division by 0 (error token is "MDTCOUNT")
./dir_migrate.sh: line 11: RANDOM % MDTCOUNT: division by 0 (error token is "MDTCOUNT")
./dir_remote.sh: line 11: RANDOM % MDTCOUNT: division by 0 (error token is "MDTCOUNT")
kill: usage: kill [-s sigspec | -n signum | -sigspec] pid | jobspec ... or kill -l [sigspec]
kill: usage: kill [-s sigspec | -n signum | -sigspec] pid | jobspec ... or kill -l [sigspec]
kill: usage: kill [-s sigspec | -n signum | -sigspec] pid | jobspec ... or kill -l [sigspec]
kill: usage: kill [-s sigspec | -n signum | -sigspec] pid | jobspec ... or kill -l [sigspec]
kill: usage: kill [-s sigspec | -n signum | -sigspec] pid | jobspec ... or kill -l [sigspec]
kill: usage: kill [-s sigspec | -n signum | -sigspec] pid | jobspec ... or kill -l [sigspec]
  Trace dump:
  = ./file_create.sh:10:main()
parallel-scale-nfsv3: FAIL: test-framework exiting on error
  Trace dump:
  = ./file_create.sh:10:main()
parallel-scale-nfsv3: FAIL: test-framework exiting on error
  Trace dump:
  = ./file_create.sh:10:main()
parallel-scale-nfsv3: FAIL: test-framework exiting on error
[357960.620111] INFO: task chown:10956 blocked for more than 120 seconds.
[357960.626483] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[357960.632950] chown           D ffff88000e112750     0 10956  21869 0x00000080
[357960.635468]  ffff880018727df0 0000000000000086 ffff88000f5d0fd0 ffff880018727fd8
[357960.637915]  ffff880018727fd8 ffff880018727fd8 ffff88000f5d0fd0 ffff88000e112748
[357960.640417]  ffff88000e11274c ffff88000f5d0fd0 00000000ffffffff ffff88000e112750
[357960.642902] Call Trace:
[357960.644892]  [<ffffffff816aa4b9>] schedule_preempt_disabled+0x29/0x70
[357960.647213]  [<ffffffff816a83e7>] __mutex_lock_slowpath+0xc7/0x1d0
[357960.649494]  [<ffffffff81211aad>] ? putname+0x3d/0x60
[357960.651733]  [<ffffffff816a77ff>] mutex_lock+0x1f/0x2f
[357960.653938]  [<ffffffff811fdffb>] chown_common+0xeb/0x1d0
[357960.656153]  [<ffffffff816b0091>] ? __do_page_fault+0x171/0x450
[357960.658395]  [<ffffffff811ff5ef>] SyS_fchownat+0xcf/0x120
[357960.660597]  [<ffffffff816b5089>] system_call_fastpath+0x16/0x1b
[357960.662816] INFO: task setfattr:10965 blocked for more than 120 seconds.
[357960.665076] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[357960.667408] setfattr        D ffff88000e115f20     0 10965  21855 0x00000080
[357960.669690]  ffff88003637fd50 0000000000000086 ffff88006ff15ee0 ffff88003637ffd8
[357960.672058]  ffff88003637ffd8 ffff88003637ffd8 ffff88006ff15ee0 ffff88000e115f18
[357960.674383]  ffff88000e115f1c ffff88006ff15ee0 00000000ffffffff ffff88000e115f20
[357960.676681] Call Trace:
[357960.678530]  [<ffffffff816aa4b9>] schedule_preempt_disabled+0x29/0x70
[357960.680677]  [<ffffffff816a83e7>] __mutex_lock_slowpath+0xc7/0x1d0
[357960.682778]  [<ffffffff816a77ff>] mutex_lock+0x1f/0x2f
[357960.684763]  [<ffffffff81227a0f>] vfs_removexattr+0x5f/0x130
[357960.686749]  [<ffffffff81227b35>] removexattr+0x55/0x80
[357960.688670]  [<ffffffff81211aad>] ? putname+0x3d/0x60
[357960.690559]  [<ffffffff81212cc2>] ? user_path_at_empty+0x72/0xc0
[357960.692497]  [<ffffffff81203a58>] ? __sb_start_write+0x58/0x110
[357960.694401]  [<ffffffff816b0091>] ? __do_page_fault+0x171/0x450
[357960.696284]  [<ffffffff8122292c>] ? mnt_want_write+0x2c/0x50
[357960.698151]  [<ffffffff81228924>] SyS_removexattr+0x94/0xd0
[357960.699982]  [<ffffffff816b5089>] system_call_fastpath+0x16/0x1b
[357960.701822] INFO: task mv:10981 blocked for more than 120 seconds.
[357960.703658] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[357960.705615] mv              D ffff88000e112750     0 10981  21845 0x00000080
[357960.707528]  ffff880039c03c20 0000000000000082 ffff88001456af70 ffff880039c03fd8
[357960.709465]  ffff880039c03fd8 ffff880039c03fd8 ffff88001456af70 ffff88000e112748
[357960.711361]  ffff88000e11274c ffff88001456af70 00000000ffffffff ffff88000e112750
[357960.713237] Call Trace:
[357960.714640]  [<ffffffff816aa4b9>] schedule_preempt_disabled+0x29/0x70
[357960.716340]  [<ffffffff816a83e7>] __mutex_lock_slowpath+0xc7/0x1d0
[357960.717978]  [<ffffffff816a77ff>] mutex_lock+0x1f/0x2f
[357960.719522]  [<ffffffff816a13e3>] lookup_slow+0x33/0xa7
[357960.721017]  [<ffffffff8120f0db>] path_lookupat+0x77b/0x7b0
[357960.722522]  [<ffffffff811df60d>] ? kmem_cache_alloc+0x16d/0x1e0
[357960.724031]  [<ffffffff81211b1f>] ? getname_flags+0x4f/0x1a0
[357960.725493]  [<ffffffff8120f13b>] filename_lookup+0x2b/0xc0
[357960.726928]  [<ffffffff81212cb7>] user_path_at_empty+0x67/0xc0
[357960.728375]  [<ffffffff81114b22>] ? from_kgid_munged+0x12/0x20
[357960.729801]  [<ffffffff8120676f>] ? cp_new_stat+0x14f/0x180
[357960.731242]  [<ffffffff81212d21>] user_path_at+0x11/0x20
[357960.732635]  [<ffffffff81206263>] vfs_fstatat+0x63/0xc0
[357960.734032]  [<ffffffff81206831>] SYSC_newlstat+0x31/0x60
[357960.735403]  [<ffffffff8111f5d6>] ? __audit_syscall_exit+0x1e6/0x280
[357960.736846]  [<ffffffff81206abe>] SyS_newlstat+0xe/0x10
[357960.738208]  [<ffffffff816b5089>] system_call_fastpath+0x16/0x1b
[357960.739627] INFO: task ln:10982 blocked for more than 120 seconds.
[357960.741074] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[357960.742661] ln              D ffff88005dd73398     0 10982  21846 0x00000080
[357960.744356]  ffff880078c7bd70 0000000000000082 ffff880014568fd0 ffff880078c7bfd8
[357960.746111]  ffff880078c7bfd8 ffff880078c7bfd8 ffff880014568fd0 ffff88005dd73390
[357960.747734]  ffff88005dd73394 ffff880014568fd0 00000000ffffffff ffff88005dd73398
[357960.749349] Call Trace:
[357960.750511]  [<ffffffff816aa4b9>] schedule_preempt_disabled+0x29/0x70
[357960.751983]  [<ffffffff816a83e7>] __mutex_lock_slowpath+0xc7/0x1d0
[357960.753668]  [<ffffffff816a77ff>] mutex_lock+0x1f/0x2f
[357960.755112]  [<ffffffff8120f255>] filename_create+0x85/0x180
[357960.756515]  [<ffffffff811df60d>] ? kmem_cache_alloc+0x16d/0x1e0
[357960.757937]  [<ffffffff81211b1f>] ? getname_flags+0x4f/0x1a0
[357960.759339]  [<ffffffff81211b94>] ? getname_flags+0xc4/0x1a0
[357960.760717]  [<ffffffff81211e41>] user_path_create+0x41/0x60
[357960.762100]  [<ffffffff8121350e>] SyS_linkat+0xae/0x320
[357960.763455]  [<ffffffff816b5089>] system_call_fastpath+0x16/0x1b
[357960.764866] INFO: task mkdir:10986 blocked for more than 120 seconds.
[357960.766323] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[357960.767896] mkdir           D ffff88005dd73398     0 10986  21843 0x00000080
[357960.769474]  ffff880013bfbd90 0000000000000082 ffff880006408000 ffff880013bfbfd8
[357960.771124]  ffff880013bfbfd8 ffff880013bfbfd8 ffff880006408000 ffff88005dd73390
[357960.772745]  ffff88005dd73394 ffff880006408000 00000000ffffffff ffff88005dd73398
[357960.774391] Call Trace:
[357960.775583]  [<ffffffff816aa4b9>] schedule_preempt_disabled+0x29/0x70
[357960.777084]  [<ffffffff816a83e7>] __mutex_lock_slowpath+0xc7/0x1d0
[357960.778551]  [<ffffffff816a77ff>] mutex_lock+0x1f/0x2f
[357960.779933]  [<ffffffff8120f255>] filename_create+0x85/0x180
[357960.781387]  [<ffffffff81211b1f>] ? getname_flags+0x4f/0x1a0
[357960.782982]  [<ffffffff81211b94>] ? getname_flags+0xc4/0x1a0
[357960.784401]  [<ffffffff81211e41>] user_path_create+0x41/0x60
[357960.785803]  [<ffffffff812131d8>] SyS_mkdirat+0x48/0x100
[357960.787165]  [<ffffffff812132a9>] SyS_mkdir+0x19/0x20
[357960.788503]  [<ffffffff816b5089>] system_call_fastpath+0x16/0x1b
[357960.789920] INFO: task mkdir:10988 blocked for more than 120 seconds.
[357960.791370] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[357960.792928] mkdir           D ffff88005dd73398     0 10988  21861 0x00000080
[357960.794496]  ffff8800094bbd90 0000000000000082 ffff88006ff16eb0 ffff8800094bbfd8
[357960.796144]  ffff8800094bbfd8 ffff8800094bbfd8 ffff88006ff16eb0 ffff88005dd73390
[357960.797790]  ffff88005dd73394 ffff88006ff16eb0 00000000ffffffff ffff88005dd73398
[357960.799440] Call Trace:
[357960.800624]  [<ffffffff816aa4b9>] schedule_preempt_disabled+0x29/0x70
[357960.802134]  [<ffffffff816a83e7>] __mutex_lock_slowpath+0xc7/0x1d0
[357960.803629]  [<ffffffff816a77ff>] mutex_lock+0x1f/0x2f
[357960.805043]  [<ffffffff8120f255>] filename_create+0x85/0x180
[357960.806486]  [<ffffffff81211b1f>] ? getname_flags+0x4f/0x1a0
[357960.807928]  [<ffffffff81211b94>] ? getname_flags+0xc4/0x1a0
[357960.809375]  [<ffffffff81211e41>] user_path_create+0x41/0x60
[357960.810860]  [<ffffffff812131d8>] SyS_mkdirat+0x48/0x100
[357960.812242]  [<ffffffff812132a9>] SyS_mkdir+0x19/0x20
[357960.813581]  [<ffffffff816b5089>] system_call_fastpath+0x16/0x1b
[357985.696288] nfs: server onyx-73 not responding, still trying
[357985.700798] nfs: server onyx-73 not responding, still trying
[357997.728270] nfs: server onyx-73 not responding, still trying
[357997.732791] nfs: server onyx-73 not responding, still trying
[358021.792299] nfs: server onyx-73 not responding, still trying
[358021.796371] nfs: server onyx-73 not responding, still trying
racer cleanup
./file_exec.sh: line 1: kill: (21927) - No such process
./file_exec.sh: line 1: kill: (21934) - No such process
./file_exec.sh: line 1: kill: (21914) - No such process
sleeping 5 sec ...
file_create.sh: no process found
dir_create.sh: no process found
file_rm.sh: no process found
file_rename.sh: no process found
file_link.sh: no process found
file_symlink.sh: no process found
file_list.sh: no process found
file_concat.sh: no process found
file_exec.sh: no process found
file_chown.sh: no process found
file_chmod.sh: no process found
file_mknod.sh: no process found
file_truncate.sh: no process found
file_delxattr.sh: no process found
file_getxattr.sh: no process found
file_setxattr.sh: no process found
dir_remote.sh: no process found
dir_migrate.sh: no process found
there should be NO racer processes:
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
[358069.984316] nfs: server onyx-73 not responding, still trying
[358250.464082] nfs: server onyx-73 not responding, still trying
[358250.469403] nfs: server onyx-73 not responding, still trying


 Comments   
Comment by James Nunez (Inactive) [ 24/Jul/18 ]

This issue will be fixed with the patch for LU-10254; https://review.whamcloud.com/#/c/30307/ .

Generated at Sat Feb 10 02:35:29 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.