Node types

TongFlow’s nodes fall into six groups. The Add and Modality nodes hold materials on the canvas; everything else operates on them.

The authoritative list lives in config/tongflow.abi.json in the tongflow repo — this page reflects the current state.

Add (7 nodes)

Add nodes drop a new material onto the canvas. Picked from the Smart Island Add toolbar in Create mode:

Node	Icon	What it does
`addTextNode`	Type	Type text directly into the node body
`addImageNode`	Image	Upload a file, capture from camera, or draw a sketch — outputs one image
`addAudioNode`	Music	Upload an audio file or record from the mic
`addVideoNode`	Video	Upload a video file or record from the camera
`addFileNode`	FileText	Upload a document (PDF / DOCX / TXT / MD)
`addLinkNode`	Link	Paste a URL — fetches the page content into text
`addModelNode`	Box	Upload a 3D model file (GLB / GLTF)

There are seven Add types, not eleven. Earlier docs counted “Add image” and “Record image (from camera)” as separate nodes; they’re modes inside the same addImageNode.

Transform

Each transform takes one input modality and produces another. Wired up against backend models or external LLMs.

Text transforms

Node slot	Description	Backend
`gen-text`	Generate or rewrite text from a prompt	OpenRouter / Gemini / OpenAI / DeepSeek (configurable)
`combine-text`	Merge multiple text inputs into one	Local
`split-text`	Split a long passage into chunks	Local

Image transforms

Node slot	Description	Backend model
`image-gen-text`	Text → image	Z-Image
`image-gen`	Image → edited image (full-frame)	Z-Image
`image-gen-model`	Model-conditioned image generation	Configurable
`image-edit`	Inpaint / instruction-driven edit	FLUX.2 Klein 9B
`image-fusion`	Multi-reference blend	FLUX.2 Klein 9B
`image-describe`	Image → text (caption / Q&A)	Gemma 4 (multimodal)
`image-upscale`	Upscale image	SeedVR2

Video transforms

Node slot	Description	Backend
`gen-video`, `text-gen-video`	Text → video	LTX-2
`image-gen-video`	Image → video	LTX-2
`image-image-gen-video`	First + last frame → video (interpolation)	LTX-2
`video-image-gen-video-mix`, `wan-animate-mix`	Image + video → video with character swap / scene mix	WAN Animate
`video-image-gen-video-move`, `video-image-move-animal`	Motion transfer (subject from one, motion from another)	WAN Animate (move variant)
`audio-image-gen-video`	Audio + image → talking-head / animated portrait	LTX-2 / WAN
`video-describe`, `video-gen-text`	Video → text (summary / caption)	Gemma 4
`video-upscale`	Upscale video	SeedVR2
`get-first-frame`, `get-last-frame`	Extract a single frame	Local (FFmpeg)
`subtitle_remove`	Remove burned-in subtitles	Backend
`remove_watermark`	Remove watermark	Backend

Audio transforms

Node slot	Description	Backend
`gen-music`	Text → music	ACE-Step
`text-gen-speech-preset`	TTS with a preset voice	Qwen3
`text-gen-speech-clone`	TTS with a reference voice (clone)	Qwen3
`text-gen-speech-instruct`	TTS with style instructions	Qwen3
`text-audio-gen-speech`	TTS using both text and a reference audio	Qwen3
`transcribe`, `transcribe-timestamp`	Audio / video → text (with optional timestamps)	Qwen3
`denoise_audio`	Noise reduction	Backend
`separate_speaker`	Speaker diarization	Backend
`separate_audio_track`, `separate-video-audio`	Demux audio from video	Local (FFmpeg)
`convert_voice`	Voice / timbre replacement	Qwen3

Node slot	Description
`parse-document`	Document → text
`link`	URL → text
Image → 3D (in pipeline)	Image → 3D model

Combine

Combine nodes take multiple inputs and produce one output.

Node slot	Inputs	Output
`image-fusion`	N images	One blended image
`audio-video-lip-sync`	Audio + video	Audio-driven lip sync (InfiniteTalk)
`speech-video-gen-video`	Video + text (target dialogue)	Text-driven lip dub (LTX LipDub)
Other lip-sync related	Audio + image / text / image+video	Video generation or composite
`speech-image-video-gen-video`	Speech + image + video	Composite video
`speech-text-gen-video`	Speech + text	Video
`convert_voice` (combine flavor)	Text + reference audio → speech	Cloned voice
`combine-text`	N text nodes → one

Helpers

Node slot	Description
`concat-videos`	Join multiple clips end-to-end
`merge-video-audio`	Mux audio + video into one file
`split-video`	Cut by scene boundaries (scene detection)
`separate-video-audio`	Demux into separate tracks
`extract-audio`	Pull audio track from a video
`split-text`	Break long text into chunks
`combine-text`	Merge text segments
`drop-video`	Filter / drop clips by rule
`arrange-group`	Group and arrange clips/text for batch downstream

How types are checked

Connection validation is driven by the ABI. When you drag an output handle to an input handle, the system checks that the modality and shape match — if you try to feed a video into an input that wants text, the edge won’t connect. The generated TypeScript types in src/generated/abi/index.ts keep the canvas and the workflow exporter honest at compile time.

Adding your own node

If a transform you need isn’t listed, you can plug it in. See docs/feature-registry.md and docs/plugins.md in the tongflow repo. The flow is:

Add the slot definition to config/tongflow.abi.json.
Regenerate types: pnpm gen:abi.
Implement the plugin under plugins/ with the @node_slot decorator and matching Pydantic models.
Bump the Python SDK pin, publish, redeploy Modal.

Smart Island — how to surface these nodes from the dock
Workflow studio — connecting nodes and running
AI capabilities — the named backend models