From Text to Visuals: Using Generative AI to Create Graphics for Scientific Articles

ComfyUI · Python · OpenAI API · DepthFlow · Baseten · Truss

Scientific articles live or die by their visuals. The right graphic can make complex research accessible and compelling — the wrong one, or the absence of one entirely, and most readers are already gone. This project started with a simple question: could generative AI create visuals that actually do justice to scientific content, at scale, automatically?

The spark came from a visit to the team at Proem.ai, whose mission is to make scientific research more widely accessible. During the visit I was asked how I'd approach creating visuals for scientific papers — and I couldn't stop thinking about it on the way home. I envisioned something like an Instagram for scientific articles: a personalised, aesthetically driven feed where every paper has a visual identity that fits its subject.

the idea

The challenge wasn't just technical — it was aesthetic. AI-generated imagery has a reputation for being either photorealistic in an uncanny way, or generic and forgettable. Neither works for scientific communication, where the visual needs to elevate the text without overshadowing it.

I defined four principles early on: the visuals had to be aesthetically compelling, they had to serve the science rather than distract from it, the entire pipeline had to be automated (with thousands of papers published daily, manual creation is impractical), and it had to be fast enough to feel responsive.

For the visual style I drew from Vector Art, Art Deco, Flat Design, Minimalism, and Bauhaus — design languages that prioritise clarity and striking composition. This wasn't just an aesthetic preference. Constraining the style through the diffusion prompt consistently produced more coherent, purposeful images than leaving it open-ended. Style as a parameter turned out to be as important as the subject matter prompt itself.

Rather than full video, I landed on looping 2.5D animations using DepthFlow. The technique works by generating a depth map from the still image — estimating the distance of each pixel from the camera — and then using that depth information to simulate parallax motion, as if a camera is slowly moving through a three-dimensional scene. The result is subtle but effective: engaging without being distracting, and computationally far cheaper than generating actual video.

the ComfyUI workflow

The pipeline I built in ComfyUI works in four steps. A link to a scientific article goes in — GripTape handles the web scraping and passes the content to the OpenAI API, which extracts the core subject and writes an image prompt constrained to the defined visual style — a diffusion model generates the image — and DepthFlow transforms the still into a looping 2.5D video with parallax motion.

One thing worth noting about the prompt engineering: getting the LLM to produce prompts that reliably hit the aesthetic target required experimentation. Defining what the image should not be — through negative prompts — played a role, but the bigger factor turned out to be model selection. AlbedoBase_XL consistently produced images that matched the intended visual language far better than more general-purpose models. It has a natural affinity for the kind of stylised, illustrative aesthetic I was after, which meant the prompt work could focus on what to depict rather than fighting the model's tendencies.

The whole thing ran locally on my M2 MacBook, with API calls offloading the heavier compute.

deploying as a web app

A local ComfyUI workflow is just a JSON file — it runs fine on your own machine but can't easily be shared or scaled. To turn it into something anyone could use, I needed to package the entire pipeline as a deployable cloud service.

I used Baseten and their Truss framework for this. The process involves exporting your ComfyUI workflow in API format (a slightly different JSON structure than the standard save format), then defining a config.yaml that specifies Docker build commands to install ComfyUI itself, all custom nodes, and the model weights at container build time. The workflow JSON is then served via an API endpoint that accepts templatised inputs — prompts, images — and returns outputs as base64-encoded results.

This also meant moving the orchestration logic — the web scraping, the OpenAI calls, the DepthFlow post-processing — out of ComfyUI and into Python, so it could all run as part of the same cloud service rather than relying on my local machine. I also switched image generation from Leonardo.AI to my own diffusion model running on Baseten, which gave me more control over the output style and removed a dependency on an external API.

The resulting web app, SciViz, ran on Streamlit and was publicly accessible for a period. It's now a completed experiment rather than an active product — but taking a ComfyUI prototype all the way to a containerised, cloud-deployed ML service was a genuinely useful exercise in what it actually takes to ship AI tooling rather than just build it locally.

closing thoughts

What this project confirmed is that the interesting challenge in generative AI isn't the generation itself — it's the design thinking around what to generate and why, and the engineering work required to make it run reliably outside your own laptop. Both of those are still very much human jobs.

Code available on GitHub →

From Text to Visuals: Using Generative AI to Create Graphics for Scientific Articles

the idea

the ComfyUI workflow

deploying as a web app

closing thoughts

Blowing up the Death Star