iTranslated by AI
【Blender x Python】How My Custom Tool Became a 'Copyright-Clean AI Data Forge' and the Limits of Synthetic Data (Part 3)
Introduction
Part 1 is here: Part 2 is here:
In the first and second parts, I introduced the journey of building a Blender pipeline that mass-produces 2D images in "specified angles" and "multiple art styles" fully automatically from my original 3D models (FBX), while consulting with generative AI (Gemini).
In this final installment, I want to step away from the technical details and talk about a "major truth" I realized while actually operating the finished tool, and the survival strategy for creators in the AI era that has emerged from it.
Why did a simple personal efficiency tool become an "alchemy furnace for data that the current AI community craves most"? And what are the "limits of synthetic data" that only became clear after trying to auto-generate everything?
1. Transforming from an Efficiency Tool to a "Data Alchemy Furnace"
When the tool was complete, I loaded my original robot FBX and ran the batch process.
Massive amounts of images were spit out into the output folder in no time.
For one model, 8 directions × 13 art styles = 104 images.
Those image files were perfectly tagged (metadata provided) with names like "front_anime_style" or "angle_pixel_art."
Looking at that folder, I realized something from the perspective of a systems engineer.
"Isn't this exactly the 'structured clean dataset' that researchers and companies in image generative AI (ControlNet, LoRA) are desperate for?"
In current AI development, images collected indiscriminately from the internet are full of noise, and copyright status is often gray. However, the group of images this tool produced was a mass of "extremely high-learning-efficiency data" with perfectly controlled angles and art styles.
2. The Value of "100% Copyright Purity" and the Reality of Data Sales
The greatest strength of this dataset is that, in addition to the original 3D model being "handmade by me," the tool itself is completed using only standard Blender features and my own scripts, with no terms (such as commercial use restrictions) from third-party add-ons involved.
It can be called "clean data with 100% copyright purity" in the true sense, with not a millimeter of others' rights (the risk of unauthorized learning) mixed in.
If I were to sell this clean data for AI learning (test marketing), what would be the correct business architecture?
As a result of thought experiments, I concluded that "subscriptions" or "low-resolution budget versions" are bad moves in AI data sales, and "preparing only one highest-quality full-set ZIP and selling it as a one-time purchase with usage-specific licenses (terms)" is the optimal solution.
For example, thousands of yen for personal research use, and "hundreds of thousands of yen (high price)" for incorporation into corporate/commercial AI.
Why set high commercial licenses for the same data and purposely separate them by terms? There are 3 clear reasons (objectives).
-
① Corporations buy "compliance," not just data.
The reason decent companies buy expensive commercial licenses is not because they want the images themselves, but because they are buying "assurance (proof)" with zero legal risk that "this will never be sued even if incorporated into our company's commercial AI." -
② To create a "basis for calculating damages" in case of unauthorized learning (the greatest defensive measure, in the author's opinion).
Under current law, even if images floating on the internet are learned without permission, it is extremely difficult to prove the "amount of damage" when filing for damages.
However, if you define "Commercial Learning License: 500,000 yen" clearly in the market in advance, if it is discovered that a company has engaged in unauthorized learning or misuse, you can legally claim a solid calculation basis that "Since our company normally licenses this data for 500,000 yen, the damage amount is 500,000 yen (+ penalty for unauthorized use)." -
③ To control the learning destination via "sales terms (contracts)."
Current Japanese copyright law is very tolerant of AI learning globally. Therefore, in reality, there are many cases where it is difficult to prevent unauthorized learning against images published on the internet using only "copyright" as a shield.
However, if you have them purchase after agreeing to "Terms of Use (License Agreement)" through a sales platform, you can bind them with "contracts between parties (sales contracts)" that take precedence over copyright law.
This allows for fine control, such as "must not be used for AI learning of specific purposes (e.g., specific genres or competitor services)," with a clear legal basis as "breach of contract (non-performance of obligation)" rather than copyright infringement.
I also considered an approach of mixing in trap images for detection, but there is a limit to what can be blocked by screening. Rather, "defining market prices and terms of use yourself and lining them up in the storefront" is the most realistic and shrewd legal defensive measure (counter) that a creator can take.
3. Experiment: Procedural City Generation and the Wall of Blender
"If I can create this much data with my own models, wouldn't it become the ultimate data factory if I could also auto-generate the models themselves using programs (random numbers)?"
Thinking this, I further leveraged Python and Blender to create "tools that randomly generate parts like buildings, trees, and rocks" and "a tool that arranges them randomly to build a 'vast city'."
Since these are procedural generations based on mathematical formulas and rules, the copyright of the generated products belongs to me, the developer.

The flow of random city generation
Then, I fed the completed "random city" into the image conversion tool used this time.

Pixel art style random city
As a result, a "pipeline that generates clean background data limitlessly" was completed, but... here I hit a massive wall: the specifications of Blender.
When I applied Freestyle (outline extraction) to a vast city model, the computational cost exploded, and the rendering could not be finished in a realistic timeframe. A single city model results in a file of about 10-20 MB (750,000 to 2.3 million polygons). After rendering this file, the process did not finish even after one hour.
The reason it did not finish is due to Blender's specification that uses the CPU (and only one core at that) instead of the GPU for outline extraction.
As a side note, my main machine's CPU is a "Ryzen 9 3900XT," so it might be possible for a higher-end, latest CPU to be faster.
At this time, I could not come up with a good idea due to my lack of expertise, so as a result, I was forced to compromise by limiting the city's image conversion to "no outlines (Freestyle off)."
I would like to write about the gritty details of implementing this city generation tool in another spin-off article, but as a solution, I am testing 'high-speed line art extraction using the compositor (2D edge detection).' The results are not yet satisfactory, but I have a feeling that the processing speed alone can be kept within a practical timeframe.
4. Core Insight: "Synthetic Data" Cannot Surpass the Real Thing
I connected the procedural generation and the auto-shooting pipeline to obtain an environment where I could refine data limitlessly as long as my PC's HDD capacity allowed. However, looking at the massive amount of images outputted, I felt a decisive sense of discomfort.
That is the fact that "This randomly generated modeling can never beat the 'real thing' that a human created with full effort and intent. It is just a fake (dummy) in the end."
Groups of buildings and rocks arranged by random numbers look like a "city" at first glance. However, it critically lacks aesthetic intent (soul), such as "the design of back alleys with a sense of living" or "ingenious silhouettes (composition) that catch the eye," which human creators put into their work.
In the current AI industry, research is progressing on using "Synthetic Data" created by AI and algorithms to train AI. However, having mass-produced synthetic data myself, I was convinced.
The tens of thousands of images generated by this pipeline are very effective for "increasing the volume (uplifting) of data" for AI learning, but they do not contribute to "improving the quality (breaking through the limits) of AI's expressive power."
No matter how much you feed it fakes, AI just learns the "patterns of fakes." To create the highest quality AI, "real data" that humans have poured their heart and soul into is essential after all.
5. Conclusion: Creator Survival Strategy in the AI Era
Recent AI research has also published papers stating that "if you only retrain on the AI's own generated products or random data, the model will eventually collapse."
After automating everything, paradoxically, the "value of humans" was highlighted.
I believe that the value of "primary creative works" of high quality created by creators with intent will not decrease in the AI era, but rather, their scarcity value will be redefined as something one-of-a-kind more than ever before.
Are we creators only able to tremble at unauthorized AI learning and hide our works?
I don't think so.
Define "high-quality data that is completely clean in terms of copyright" with your own hands, set fair licenses and prices, and release it into the market. By doing so, you can create a "market price (going rate) for clean data" and urge companies to thoroughly implement compliance.
My gritty tool development as a single systems engineer, as a result, made me realize the possibility of such "survival strategy and armament for creators in a new era."
It has been a long read, but thank you for following along with this tool development chronicle spanning Part 1, Part 2, and the Final Part!
Discussion