TongFlow is an open-source, multi-modal GenAI workflow studio. It turns every AI model into a node on an infinite canvas, so you can chain text, image, video, audio, document, and 3D models together — like snapping together building blocks — to produce things no single model could make on its own.

What it is, in 30 seconds

Drop material onto the canvas, pick the next step, and the connection is made for you — then run. No parameter panels, no manual wiring. That’s the whole idea. A few things people have already built with it:

Basic — type text, generate images, then blend them into a single picture.
Talking-head video — topic → script → speech, plus a character description → image, fused into a lip-synced avatar video.
Music video — generate lyrics + a song + characters + scenes + a storyboard, then assemble a full MV.

See the demos and GIFs on GitHub for the workflows behind each result.

Why we built it

AI capabilities are exploding — text-to-image, image-to-video, text-to-speech, lip sync, super-resolution — but they live in separate tools. Each has its own interface, its own parameters, and moving a file from one into the next is manual work. The moment your idea spans more than one modality, the friction wins.

TongFlow’s answer is a single canvas: every capability is a node, every node speaks a typed contract, and connecting them is automatic. The hard part becomes the creative part again.

The core idea

Every model, as one mental model

Think of any AI model by what it turns into what. An LLM is text → text. A diffusion model is text → image. Text-to-speech is text → audio, speech recognition is audio → text, a 3D generator is image → 3D. Every capability is the same kind of thing — a modality transform — so TongFlow wraps each one as a node with typed inputs and outputs. The entire, ever-growing zoo of AI models collapses into one consistent mental model. When a new model lands, it’s just another node; nothing about how you work changes.

Every modality, not just generation

Image, video, audio, text, document, and 3D — the formats people actually ship on the web — are all first-class. And it’s not only generation: you can edit, understand, upscale, transcribe, and convert across them. Text becomes an image; the image animates into a video; the video gets lip-synced to generated speech; a document or a URL becomes text you feed into the next step. Whatever you bring in, there’s a path to whatever you want out.

Low barrier

No CFG scales, no sampler settings, no seeds buried in parameter panels — and no manual node wiring. You work with three verbs: Add material, Transform it, Combine the results. Drop something on the canvas, pick the next step, and the connection is made for you. Install the desktop app and you’re creating within minutes — no ML background required.

High ceiling

Because every node composes with every other, simple parts chain into ambitious results. Lyrics → a song → characters → scenes → a storyboard → a finished music video, all on one canvas. The interface stays easy, but the combination space is enormous: orchestrate the models freely and you make things that are genuinely your own — not one tool’s single canned output. The floor is low and the ceiling keeps rising as the model ecosystem grows.

Open ecosystem

TongFlow’s core stays deliberately small. Every capability node is defined by a contract, and at least one official plugin implements it — so it works out of the box — while anyone else can publish an alternative. API providers, GPU hosts, CPU services: any platform can package its own plugin the same way, and the best implementation of each capability can come from whoever does it best. The core stays small; the ecosystem stays open.

The capability map

The whole interface reduces to four groups:

Add — bring material onto the canvas: text, image, photo, sketch, audio, recording, video, document, URL, or 3D model.
Transform — convert between modalities: text rewrite; image generation / editing / understanding / upscaling; text-to-video / image-to-video / first-last-frame / video understanding; music generation; text-to-speech (including voice cloning); speech recognition.
Combine — merge results: image fusion, lip sync, character swap, motion transfer, text merge.
Helpers — production glue: concatenate clips, mux audio + video, split by shots, extract tracks, chunk long text, and more.

Nodes marked ✅ in the README work out of the box with an official plugin; ⬜ nodes exist on the canvas and are planned.

Up and running in five minutes

Install the desktop app — macOS (Apple Silicon / Intel) and Windows builds are on the Releases page. On first open, the canvas is preloaded with an example workflow.
Install plugins — the app ships with none. Open the plugin manager and install what you need; new plugins are usable immediately, no restart.
Configure credentials — open Settings and add the environment variables your plugins need (e.g. OPENAI_API_KEY, or MODAL_TOKEN_* for GPU plugins). Values are stored locally and take effect without a restart.
Run the example — node by node, or switch to Execute Mode and run the whole graph in one click.

Start for free — and that’s not a trial gimmick. The official GPU/CPU plugins run on Modal, which gives every new account up to $30/month of free GPU compute on real H100/A100-class hardware. That’s enough to generate images, animate videos, synthesize speech and music, and run an entire multi-step pipeline — without owning a GPU or paying a cent. You can take a workflow from idea to finished result at zero cost, every month. Bring your own keys or scale up only when you’re ready.

For developers: the plugin architecture

Every runnable node is backed by a contract — the ABI (config/tongflow.abi.json) — that defines what capabilities exist and what each one’s input and output look like, independent of who implements it.

A plugin is a small Python package that picks one or more ABI slots and supplies the how, annotated against the ABI-generated types via the tongflow Python SDK.
Compile-time contracts. TypeScript types and Python Pydantic models generated from the ABI are the entire gate — typos and shape mismatches are caught by tsc / pyright, with no runtime validation overhead.
Backend-neutral. The SDK never depends on Modal. Any platform — an API provider, a GPU host, a CPU service — can publish its own plugin the same way.

The official catalog already spans API plugins (OpenAI, Gemini, OpenRouter) and GPU/CPU plugins (Z-Image, FLUX.2 Klein 9B, LTX, InfiniteTalk, Wan-Animate, SeedVR2, Whisper, Qwen3, ACE-Step, and more). The full development flow lives in docs/PLUGINS.md.

Who it’s for

Creators — build talking-head videos, music videos, and short-form pipelines without stitching tools by hand.
Developers & platforms — package your own model as a plugin and plug it into the ecosystem.
Enterprises — deploy on local GPUs, build custom nodes, and integrate private models.

Open source, license, and community

TongFlow is dual-licensed: AGPL-3.0 (free for individuals, research, and open-source projects) and a commercial license for closed-source or SaaS use. For business inquiries, reach us at [email protected].

If the project is useful to you, a star on GitHub helps a lot. Come build with us on Discord, or jump straight in with the hosted studio at app.tongflow.com.

Expand your imagination, stretch your ideas — give it a try.

TongFlow: An Open-Source Multi-Modal GenAI Workflow Studio