CANNBot Simulator V2参考文档

发布时间:2026/6/30 0:27:23

CANNBot Simulator V2参考文档 Simulator V2 Reference【免费下载链接】cannbot-skillsCANNBot 是面向 CANN 开发的用于提升开发效率的系列智能体本仓库为其提供可复用的 Skills 模块。项目地址: https://gitcode.com/cann/cannbot-skillsRead this file when the question is specifically about how simulator execution works now. Do not use it as a replacement for kernel-authoring or general architecture docs.GoalCapture the current simulator execution path so future work does not rely on removed or staleeasyasc/simulator/assumptions.1. Current defaultThe repositorys simulator path is now the V2 runtime.Current behavior:OpExec(..., simulatorTrue)enables simulator executionOpExec(..., simulatorv2)is an accepted spelling for the same pathOpExec(..., simulatorlegacy)is still accepted byOpExec, but it doesnotselect a separate old runtime anymore; it still routes to V2KernelBase.run_sim()always calls_run_sim_v2()Practical rule:do not document or debug a separateeasyasc/simulator/runtime as if it were still active2. How a kernel becomes a V2 programThe simulator build entry lives ineasyasc/kernelbase/kernelbase.py.The selection order is:custom builder viakernel._simulator_v2_program_builderprebuilt program viakernel._simulator_v2_programauto analysis auto bridge selectionAuto bridge selection:if the instruction stream contains control-flow, topology queries,call_micro,VarList, or cross-lane sync helpers, V2 useseasyasc/simulator_v2/compat/control_flow_bridge.pyotherwise V2 uses the narrow linear bridge ineasyasc/simulator_v2/compat/kernel_bridge.pyImportant difference:control_flow_bridge.pypreserves loops/conditionals and defers resolution to the runtimekernel_bridge.pyonly covers a narrower linear lowered-instruction subset3. Runtime stackThe runtime is split across these layers:parent coordinator:easyasc/simulator_v2/runtime/global_runtime.pycore process wrapper:easyasc/simulator_v2/runtime/core_process.pyper-core runtime:easyasc/simulator_v2/runtime/core_runtime.pylane-level control interpreter:easyasc/simulator_v2/runtime/control_actor.pypipe worker threads:easyasc/simulator_v2/runtime/pipe_worker.pypipe executors:easyasc/simulator_v2/ops/Execution shape:one parentGlobalRuntimeone childCoreProcessper simulated coreinside each active core, oneControlActorper active laneinside each lane, one threadedPipeWorkerper logical pipeLaunch rule:start simulator repros from a real.pyfile, not fromstdinentry points such aspython - PYor piped scriptsV2 uses multiprocessing during startup, and Python spawn must be able to re-import__main__from a real filesystem path;stdinentry points appear asstdinand break child startupwhen the launcher lives outside the repo root, include the repo root inPYTHONPATHso child processes can import local modules consistentlysafe pattern:PYTHONPATH/abs/path/to/repo python /tmp/repro.pyCompletion / shutdown facts:pipe workers already stop through mailbox sentinels; the thread layer does not need a special end instructionparent / child completion now uses a one-shot status channel that the parent polls while joiningGlobalRuntime.run()uses one global execution deadline across all active cores, not a full timeout budget per core in sequence4. Planning and activationCore and lane activation are resolved by:easyasc/simulator_v2/config.pyeasyasc/simulator_v2/runtime/execution_plan.pyeasyasc/simulator_v2/helpers.pyKey facts:default core count follows the active device family (950 - 32,b3 - 20)V2 can skip inactive lanes when a program only uses a subset of cube/vec lanescollective ops (allcube_*,allvec_*) affect lane-activation planning5. Memory and tensor stateShared tensor setup lives in:easyasc/simulator_v2/memory/shared_tensor.pyeasyasc/simulator_v2/memory/shared_tensor_store.pyeasyasc/simulator_v2/memory/tensor_view.pyeasyasc/simulator_v2/memory/workspace.pyeasyasc/simulator_v2/memory/local_memory.pyImportant facts:OpExecclones input tensors intoGMTensor.dataV2 copies that payload into the shared runtime tensor store before executionafter execution, V2 copies runtime tensors back into the boundGMTensor.dataworkspaces and local buffers are represented as shared-tensor specs in program metadatachild-core local tensors now go through a bank-aware allocator (UB0/UB1/L1/L0A/L0B/L0C); over-capacity local allocations fail before pipe execution startsruntime-created local slice snapshots must treat a root local tensorsSharedTensorSpec.storage_offsetas allocator bookkeeping in bytes rather than as an extra in-storage element offset; only nested local views should re-apply a parentstorage_offsetwhencontrol_actor.pymaterializes a dynamic slicesimulator-side GMatomic_add/atomic_max/atomic_minnow serialize their read-modify-write sections through a shared store-wide atomic lock so cross-core atomic writebacks do not lose updates under contentionRegression note:testcases/simulator/memory/test_simulator_v2_slice_tensor.pycovers the sliced-UB vec-mul case where several prefix UB allocations push the sliced root tensor onto a non-zero local bank offset before runtime snapshotting6. Sync and controlThe main sync/control pieces are:intra-core sync:easyasc/simulator_v2/sync/intra_core_sync.pycollective sync:easyasc/simulator_v2/sync/collective_sync.pylane-local flags:easyasc/simulator_v2/sync/local_flags.pylane-local events:easyasc/simulator_v2/sync/local_events.pyworker mailboxes:easyasc/simulator_v2/sync/mailbox.pyImportant fact:collective sync state is process-shared at runtime;GlobalRuntimesnapshots the parentCollectiveSyncand each child core reloads that shared state instead of creating a private per-process coordinatorlane-localbarrier(pipe...)currently has special runtime behavior only forbarrier(ALL); non-ALLbarriers are preserved as control instructions but act as no-ops in the V2 runtime main looppractical consequence for kernel debugging:bar_v()/bar_mte2()/ other single-pipe barriers do not serialize cross-pipe edges such asV - MTE2on the simulator path; when a repro needs a simulator-visible local drain across pipe domains, usebar_all()setflag/waitflagstill use the phase-basedLocalFlagTable, but localSEvent/DEventno longer do: V2 now models them with a per-lane flag bank keyed by(src_pipe, dst_pipe, flag_id)and a bool value per flagcreate_seventallocates oneflag_idfrom the lane-local pool for its(src_pipe, dst_pipe)pair;create_deventallocates two consecutive ids from that same pair-local poolSEvent.set()sets its single flag to1and errors if it is already1;SEvent.wait()blocks until that flag becomes1, then clears it back to0DEventkeeps two independent bool flags plus separateset_count/wait_countcursors: the producer-sidesetpath alternatesflag0, flag1, flag0, ..., and the consumer-sidewaitpath alternates on its own cursor over the same two flagsevent_setallis modeled as repeatedset()calls on the same event object rather than as a special bulk primitive; forDEventthat usually means setting both flags in rotation order, whileSEvent.setall()will replayset()twice and therefore errors on the second call if the single flag is still setevent_releaseis modeled as repeatedwait()calls:SEvent.release()performs one wait, whileDEvent.release()performs one wait and then performs a second wait only when a second outstanding token is already pending on the other rotated flagpractical consequence for trace/timing work: local event blocking must now be reasoned about per realflag_id, not perevent_nameregression coverage:testcases/simulator/bridge/test_simulator_v2_control_flow.pyWhen debugging a hang:inspect the original failing lane error firstthen inspect the sync state / timeout diagnosticdo not assume the timeout itself is the root causeWhen a child core raises an exception:GlobalRuntime.run()now raises the combined per-core traceback text directlydo not rely on a generic parent-side wrapper message; the actionable failure should already be in the thrown exception stringpipe-worker instruction failures now print an immediatestderrlog withlane/pipe/opname/error, control-sidewait_*paths poll worker failures while waiting, andCoreRuntime.join()prefers surfacing the more actionable worker/task failure over a secondary sync-timeout symptom when multiple lane actors fail7. Trace pathTrace recording lives in:easyasc/simulator_v2/trace/recorder.pyeasyasc/simulator_v2/trace/merge.pyeasyasc/simulator_v2/trace/chrome.pya5 cycle-model profile and estimators:easyasc/simulator_v2/timing/Runtime flow:each core records its own eventsparent runtime merges them after executiondump_chrome_trace(...)exports Chrome/Perfetto-style JSONruntime event timestamps originate fromtime.monotonic()exported Chrome traces normalize those timestamps into a per-run relative axis instead of replacing them with event-order indicesexporteddurnow reflects measured task/wait spans when the runtime recorded them; zero-duration control markers still use a tiny fallback width only to stay visible in viewerssync-heavy kernels may now emit explicitsynctrace events for wait/ready phases in addition to pipe execution eventson a5 (device_type 950), the runtime can now switch trace timing to a cycle-model domain driven by the JSON profile undertiming/; in that modeeasyasc_time_domain cycleis exported in the trace payload and task args include the modeling breakdowncurrent a5 cycle-model defaults treat one ordinary V-pipe instruction as2cyclesforcall_micro/vf()timing, register - UB shuffle instructions are counted as0cycle:micro_ub2reg,micro_reg2ub,micro_ub2regcont,micro_reg2ubcontin cycle-model mode, direct control-side waits (event_wait,wait_vec,wait_cube, collective waits) now advance the control actors cycle cursor, butevent_setno longer acts as a lane-global block for later unrelated pipe dispatch; its ready time is derived from the completed source pipe, and unrelated pipes can start as soon as their own event dependencies are satisfiedlane-localevent_wait/event_releasecan now be lowered into the destination pipe worker queue, so the blocking happens on that pipe thread instead of only on the control actor;event_set/event_setallintentionally stay control-side because their position in the instruction stream still defines autosync lifetime boundariestrace export now consultsglobvars.trace_event(defaultFalse): when disabled, all sync-style trace markers are omitted from dispatch, pipe, and sync tracks, including lane-localevent_*, local flag waits, intra-core handoff ops such aswait_vec/cube_ready, and collectiveall*sync ops; tests or debugging sessions that need those markers must enable the flag explicitly before running the simulatorwhen optimizing from the trace view, keepglobvars.trace_eventat its defaultFalseunless the specific goal is to inspect sync/event behavior; turning it on adds sync markers that are useful for debugging but can distract from the steady-state scheduling picture you usually want for optimization workwhen optimizing cycle count from a trace, use the trace makespan as the objective: the cycle at which the last timed event finishes (max(ts dur)overph Xevents). Do not optimize for the sum of all timed durations or total activated cycles; those overcount parallel overlap and can rank kernels differently from the real end-to-end completion time8. Vec and micro executionKey implementation files:vec runtime entry:easyasc/simulator_v2/ops/vec/v.pyvec legacy-layout helper:easyasc/simulator_v2/ops/vec/_legacy_vpipe.pyvec MTE2 path:easyasc/simulator_v2/ops/vec/mte2.pyvec MTE3 path:easyasc/simulator_v2/ops/vec/mte3.pymicro runtime:easyasc/simulator_v2/ops/micro/runtime.pypipe dispatch:easyasc/simulator_v2/ops/dispatch.pyImportant fact:several vec operations still reuse the legacy layout executor throughops/vec/_legacy_vpipe.py, but they run inside the V2 runtimewhengm_to_ub_padorl0c_to_gm_nz2ndreports a source/destination view that is too small on an a2 workspace-mediated tail path, first inspect whether the workspace view was cropped in the column dimension; those bridge ops infer row-stride from the parent GM shape, so a cropped workspace column span can fail even when the logical tail math is correctall UB burst copy ops (gm_to_ub_padinops/vec/mte2.py,ub_to_gm_padandub_to_l1_nzinops/vec/mte3.py) use_linear_view_from_pointerso that column-sliced UB views (ub[:, 0:valid_n]withvalid_n buffer_cols) round-trip through the underlying storage; any new burst-style op must mirror this pattern or it will falsely raise view is too small when the destination is non-contiguousregression coverage:testcases/simulator/datamove/test_gm_to_ub_pad_column_slice.pyScalar-semantics reminder:control_flow_bridge.pypreservesVararithmetic as runtime scalar ops such asvar_add,var_mul, andvar_divcontrol_actor.pyandops/micro/runtime.pymust preserve floatVarsemantics for those ops; do not silently coerce float scalar expressions to int on the runtime pathpractical symptom of a broken float-scalar path: raw cube/UB data looks correct, but a latervf()stage that multiplies by a computed scale suddenly collapses to09. Best first files for simulator debuggingeasyasc/kernelbase/kernelbase.pyeasyasc/simulator_v2/compat/control_flow_bridge.pyeasyasc/simulator_v2/compat/kernel_bridge.pyeasyasc/simulator_v2/runtime/control_actor.pyeasyasc/simulator_v2/runtime/task_memory_validator.pypre-dispatch memory-range checks now cover shared-tensor helpers, all current cube-pipe tensor ops, vec datamoves, V-pipe tensor ops including packedcompare/select, repeat-layout vec instructions,sort32,mergesort*,gather,scatter, task-level micro shared-tensor ops, andcall_microdry-run validationeasyasc/simulator_v2/runtime/pipe_worker.pyeasyasc/simulator_v2/runtime/global_runtime.pytestcases/simulator/【免费下载链接】cannbot-skillsCANNBot 是面向 CANN 开发的用于提升开发效率的系列智能体本仓库为其提供可复用的 Skills 模块。项目地址: https://gitcode.com/cann/cannbot-skills创作声明:本文部分内容由AI辅助生成(AIGC),仅供参考

相关新闻