Node types

TongFlow’s nodes fall into six groups. The Add and Modality nodes hold materials on the canvas; everything else operates on them.

The authoritative list lives in config/tongflow.abi.json in the tongflow repo — this page reflects v0.1.0.

Add (7 nodes)

Add nodes drop a new material onto the canvas. Picked from the Smart Island Add toolbar in Create mode:

NodeIconWhat it does
addTextNodeTypeType text directly into the node body
addImageNodeImageUpload a file, capture from camera, or draw a sketch — outputs one image
addAudioNodeMusicUpload an audio file or record from the mic
addVideoNodeVideoUpload a video file or record from the camera
addFileNodeFileTextUpload a document (PDF / DOCX / TXT / MD)
addLinkNodeLinkPaste a URL — fetches the page content into text
addModelNodeBoxUpload a 3D model file (GLB / GLTF)

There are seven Add types, not eleven. Earlier docs counted “Add image” and “Record image (from camera)” as separate nodes; they’re modes inside the same addImageNode.

Transform

Each transform takes one input modality and produces another. Wired up against backend models or external LLMs.

Text transforms

Node slotDescriptionBackend
gen-textGenerate or rewrite text from a promptOpenRouter / Gemini / OpenAI / DeepSeek (configurable)
combine-textMerge multiple text inputs into oneLocal
split-textSplit a long passage into chunksLocal

Image transforms

Node slotDescriptionBackend model
image-gen-textText → imageZ-Image
image-genImage → edited image (full-frame)Z-Image
image-gen-modelModel-conditioned image generationConfigurable
image-editInpaint / instruction-driven editFLUX.2 Klein 9B
image-fusionMulti-reference blendFLUX.2 Klein 9B
image-describeImage → text (caption / Q&A)Gemma 4 (multimodal)
image-upscaleUpscale imageSeedVR2

Video transforms

Node slotDescriptionBackend
gen-video, text-gen-videoText → videoLTX-2
image-gen-videoImage → videoLTX-2
image-image-gen-videoFirst + last frame → video (interpolation)LTX-2
video-image-gen-video-mix, wan-animate-mixImage + video → video with character swap / scene mixWAN Animate
video-image-gen-video-move, video-image-move-animalMotion transfer (subject from one, motion from another)WAN Animate (move variant)
audio-image-gen-videoAudio + image → talking-head / animated portraitLTX-2 / WAN
video-describe, video-gen-textVideo → text (summary / caption)Gemma 4
video-upscaleUpscale videoSeedVR2
get-first-frame, get-last-frameExtract a single frameLocal (FFmpeg)
subtitle_removeRemove burned-in subtitlesBackend
remove_watermarkRemove watermarkBackend

Audio transforms

Node slotDescriptionBackend
gen-musicText → musicACE-Step
text-gen-speech-presetTTS with a preset voiceQwen3
text-gen-speech-cloneTTS with a reference voice (clone)Qwen3
text-gen-speech-instructTTS with style instructionsQwen3
text-audio-gen-speechTTS using both text and a reference audioQwen3
transcribe, transcribe-timestampAudio / video → text (with optional timestamps)Qwen3
denoise_audioNoise reductionBackend
separate_speakerSpeaker diarizationBackend
separate_audio_track, separate-video-audioDemux audio from videoLocal (FFmpeg)
convert_voiceVoice / timbre replacementQwen3

Cross-modal bridges

Node slotDescription
parse-documentDocument → text
linkURL → text
Image → 3D (in pipeline)Image → 3D model

Combine

Combine nodes take multiple inputs and produce one output.

Node slotInputsOutput
image-fusionN imagesOne blended image
speech-video-gen-video, lip-sync variantsAudio + video → video / Audio + image → video / Audio + text → video / Audio + image + video → videoLip-synced video
speech-image-video-gen-videoSpeech + image + videoComposite video
speech-text-gen-videoSpeech + textVideo
convert_voice (combine flavor)Text + reference audio → speechCloned voice
combine-textN text nodes → one

Helpers

Node slotDescription
concat-videosJoin multiple clips end-to-end
merge-video-audioMux audio + video into one file
split-videoCut by scene boundaries (scene detection)
separate-video-audioDemux into separate tracks
extract-audioPull audio track from a video
split-textBreak long text into chunks
combine-textMerge text segments
drop-videoFilter / drop clips by rule
arrange-groupGroup and arrange clips/text for batch downstream

How types are checked

Connection validation is driven by the ABI. When you drag an output handle to an input handle, the system checks that the modality and shape match — if you try to feed a video into an input that wants text, the edge won’t connect. The generated TypeScript types in src/generated/abi/index.ts keep the canvas and the workflow exporter honest at compile time.

Adding your own node

If a transform you need isn’t listed, you can plug it in. See docs/feature-registry.md and docs/plugins.md in the tongflow repo. The flow is:

  1. Add the slot definition to config/tongflow.abi.json.
  2. Regenerate types: pnpm gen:abi.
  3. Implement the plugin under plugins/ with the @node_slot decorator and matching Pydantic models.
  4. Bump the Python SDK pin, publish, redeploy Modal.