iTranslated by AI
Creating Videos from Sequential Images

Introduction
I've recently been into IoT and have been making various things. Furthermore, I purchased a 3D printer, and the foundation for creating hardware is finally ready.
I'm making things like this ↓

This time, I wanted to capture photos and audio from that IoT device and turn them into a video, so I'd like to write a memo about implementing the video generation part in an iOS app.
Creating a Video from Sequential Images
First is the method for creating a video from sequential images. The procedure involves creating a video here, loading it separately, and finally compositing it with audio.
Additionally, I have uploaded an Xcode project to GitHub where you can check the content explained in this article.
Therefore, I will start by explaining the video generation.
Concept Clarification
Here is a summary of the concepts (classes) that appear during video generation.
| Concept | Description |
|---|---|
| AVAssetWriter | A class that writes out video files frame by frame. |
| AVAssetWriterInput | Defines information for the "track" (video or audio) to be written. |
| AVAssetWriterInputPixelBufferAdaptor | An adapter that converts CVPixelBuffer into a format suitable for video files and provides convenient functions for creating video tracks, such as timestamps. |
| CMTime | A struct that manages frame time. It precisely controls frame time by specifying fps. |
The general processing flow is as follows:
- Create
AVAssetWriter/AVAssetWriterInput. - Create
AVAssetWriterInputPixelBufferAdaptor. - Start a session with
AVAssetWriter. - Convert
UIImagetoCVPixelBuffer. - Add the converted buffers one by one to the
AVAssetWriterInputPixelBufferAdaptor. - Once all images are added, execute the finish processing for
AVAssetWriterInput. - End the
AVAssetWritersession and export.
The process follows this flow. What needs to be done is relatively simple.
【About CVPixelBuffer】
The CV in this class name stands for Core Video. It is explained in the documentation as follows:
A Core Video pixel buffer is an image buffer that holds pixels in main memory. Applications generating frames, compressing or decompressing video, or using Core Image can all make use of Core Video pixel buffers.
Since we are trying to create a video this time, we need to convert to this buffer.
It's not that long, so first, I'll introduce the full code.
import UIKit
import AVFoundation
/// Class to create video from sequential images
class ImageToVideoExporter {
/// Creates a video from sequential images
/// - Parameters:
/// - images: Array of UIImages to be made into a video (in frame order)
/// - fps: Frame rate
/// - outputURL: The export destination URL
/// - completion: Completion callback
func createVideo(from images:[UIImage], fps: Int32, completion: @escaping (Bool) -> Void) {
guard let firstImage = images.first else {
print("File array is empty.")
completion(false)
return
}
// Generate a unique temporary filename
let fileName = UUID().uuidString
let outputURL = FileManager.default.temporaryDirectory.appendingPathComponent(fileName)
// Get video size from the first image
let size = firstImage.size
// Initialize AVAssetWriter
do {
let writer = try AVAssetWriter(outputURL: outputURL, fileType: .mov)
// Basic video settings (codec, size, etc.)
let settings: [String: Any] = [
AVVideoCodecKey: AVVideoCodecType.h264,
AVVideoWidthKey: size.width,
AVVideoHeightKey: size.height,
]
// AVAssetWriterInput: Definition of the video track
let writerInput = AVAssetWriterInput(mediaType: .video, outputSettings: settings)
writerInput.expectsMediaDataInRealTime = false
// Pixel buffer adapter (Role of converting UIImage to video frames)
let sourceBufferAttributes: [String: Any] = [
kCVPixelBufferPixelFormatTypeKey as String: Int(kCVPixelFormatType_32ARGB),
]
let adaptor = AVAssetWriterInputPixelBufferAdaptor(
assetWriterInput: writerInput,
sourcePixelBufferAttributes: sourceBufferAttributes
)
// Add input to the writer
writer.add(writerInput)
// Start writing
writer.startWriting()
writer.startSession(atSourceTime: .zero)
// Helper function to generate a pixel buffer
func pixelBuffer(from image: UIImage, size: CGSize) -> CVPixelBuffer? {
var pixelBuffer: CVPixelBuffer?
let options: [String: Any] = [
kCVPixelBufferCGImageCompatibilityKey as String: true,
kCVPixelBufferCGBitmapContextCompatibilityKey as String: true,
]
let status = CVPixelBufferCreate(
kCFAllocatorDefault,
Int(size.width),
Int(size.height),
kCVPixelFormatType_32ARGB,
options as CFDictionary,
&pixelBuffer
)
guard status == kCVReturnSuccess, let buffer = pixelBuffer else { return nil }
CVPixelBufferLockBaseAddress(buffer, [])
let rgbColorSpace = CGColorSpaceCreateDeviceRGB()
guard let context = CGContext(
data: CVPixelBufferGetBaseAddress(buffer),
width: Int(size.width),
height: Int(size.height),
bitsPerComponent: 8,
bytesPerRow: CVPixelBufferGetBytesPerRow(buffer),
space: rgbColorSpace,
bitmapInfo: CGImageAlphaInfo.noneSkipFirst.rawValue,
)
else {
CVPixelBufferUnlockBaseAddress(buffer, [])
return nil
}
// Flip the UIKit coordinate system (to prevent upside-down rendering)
context.translateBy(x: 0, y: size.height)
context.scaleBy(x: 1.0, y: -1.0)
UIGraphicsPushContext(context)
image.draw(in: CGRect(origin: .zero, size: size))
UIGraphicsPopContext()
CVPixelBufferUnlockBaseAddress(buffer, [])
return buffer
}
// Writing queue
let mediaInputQueue = DispatchQueue(label: "mediaInputQueue")
// Per-frame writing process
writerInput.requestMediaDataWhenReady(on: mediaInputQueue) {
var frameCount: Int64 = 0
for image in images {
while !writerInput.isReadyForMoreMediaData {
Thread.sleep(forTimeInterval: 0.01)
}
guard let buffer = pixelBuffer(from: image, size: size) else {
print("Failed to create a pixel buffer.")
continue
}
let frameTime = CMTime(value: frameCount, timescale: fps)
adaptor.append(buffer, withPresentationTime: frameTime)
frameCount += 1
}
writerInput.markAsFinished()
writer.finishWriting {
guard let url = self.saveVideoToDocuments(from: outputURL) else {
completion(false)
return
}
completion(writer.status == .completed)
}
}
}
catch {
print("Failed to initialize AVAssetWriter \(error)")
completion(false)
}
}
func saveVideoToDocuments(from tempURL: URL) -> URL? {
let fileManager = FileManager.default
let documentsURL = fileManager.urls(for: .documentDirectory, in: .userDomainMask).first!
// Uniquify filename by date
let fileName = "video_\(Date().timeIntervalSince1970).mov"
let destinationURL = documentsURL.appendingPathComponent(fileName)
do {
try fileManager.copyItem(at: tempURL, to: destinationURL)
print("Saved a temp video to persistant folder. [\(destinationURL)]")
return destinationURL
}
catch {
print("Failed to save a temp video to persistant folder.")
return nil
}
}
func fetchSavedVideos() -> [URL] {
let fileManager = FileManager.default
let documentsURL = fileManager.urls(for: .documentDirectory, in: .userDomainMask).first!
do {
let files = try fileManager.contentsOfDirectory(at: documentsURL, includingPropertiesForKeys: nil)
return files.filter { $0.pathExtension == "mov" }
}
catch {
print("Failed to get video list. \(error)")
return []
}
}
}
The most important part of this class is the createVideo method. The rest is for file saving and other processes, so I will skip the explanation for those.
The createVideo Method for Adding Sequential Images to Assets
Let's take a look at the createVideo method right away.
Here is the signature:
func createVideo(from images:[UIImage], fps: Int32, completion: @escaping (Bool) -> Void)
It receives a list of sequential images, the FPS for the video to be generated, and a function to notify completion. FPS is crucial information for determining the video's duration.
The size of the first image in the array is determined as the video size.
// Get video size from the first image
let size = firstImage.size
Creating Video with AVAssetWriter
The main class is AVAssetWriter. As the name suggests, it creates video assets. At initialization, you specify the path to export to and the file type. In this case, we use the .mov format.
let writer = try AVAssetWriter(outputURL: outputURL, fileType: .mov)
Creating a Video Track
I'll skip the detailed explanation, but a video is handled as a single video file by combining a video codec and an audio codec into a format called a "container." The key point is that video and audio are saved separately.
And each is represented here as a "track." Let's look at the part where we create this "video track."
// Basic video settings (codec, size, etc.)
let settings: [String: Any] = [
AVVideoCodecKey: AVVideoCodecType.h264,
AVVideoWidthKey: size.width,
AVVideoHeightKey: size.height,
]
// AVAssetWriterInput: Definition of the video track
let writerInput = AVAssetWriterInput(mediaType: .video, outputSettings: settings)
writerInput.expectsMediaDataInRealTime = false
// Pixel buffer adapter (Role of converting UIImage to video frames)
let sourceBufferAttributes: [String: Any] = [
kCVPixelBufferPixelFormatTypeKey as String: Int(kCVPixelFormatType_32ARGB),
]
let adaptor = AVAssetWriterInputPixelBufferAdaptor(
assetWriterInput: writerInput,
sourcePixelBufferAttributes: sourceBufferAttributes
)
// Add input to the writer
writer.add(writerInput)
// Start writing
writer.startWriting()
writer.startSession(atSourceTime: .zero)
Video and audio have specifications called "codecs" that define how to represent them. More specifically, it is "technology for compressing (encoding) audio and video data and for restoring (decoding) the compressed data."
This is determined in the initial settings section. And the AVAssetWriterInput is responsible for creating the essential track part.
I'll discuss the adapter later. Once the instance for track creation is generated, it is added to the AVAssetWriter. In other words, we have added a video track.
After adding it, writing starts with startWriting.
Writing Process
The actual writing process is performed here:
// Per-frame writing process
writerInput.requestMediaDataWhenReady(on: mediaInputQueue) {
var frameCount: Int64 = 0
for image in images {
while !writerInput.isReadyForMoreMediaData {
Thread.sleep(forTimeInterval: 0.01)
}
guard let buffer = pixelBuffer(from: image, size: size) else {
print("Failed to create a pixel buffer.")
continue
}
let frameTime = CMTime(value: frameCount, timescale: fps)
adaptor.append(buffer, withPresentationTime: frameTime)
frameCount += 1
}
writerInput.markAsFinished()
writer.finishWriting {
guard let url = self.saveVideoToDocuments(from: outputURL) else {
completion(false)
return
}
completion(writer.status == .completed)
}
}
When the writing is ready with the requestMediaDataWhenReady method, we write the sequential images.
What is being done here is roughly as follows:
- Convert
UIImagetoCVPixelBuffer - Determine the time within the video for the target image using
CoreMedia'sCMTime - Add the above two points to the adapter
Once all sequential images are added, the markAsFinished method is called to notify the end. Then, all data is actually written out with finishWriting, and the caller is notified of the completion via the callback called after it finishes.
The saveVideoToDocuments method is a custom implementation that simply moves the file from the temporary folder to the Documents folder.
Conversion to Pixel Buffer and Writing
Regarding the adapter and the conversion to pixel buffers mentioned in the previous section, images come in various formats, and conversion may be necessary depending on the object being handled. In this case, it is necessary to convert the UIImage representation into a CVPixelBuffer.
This is achieved by the pixelBuffer function defined as an internal function.
func pixelBuffer(from image: UIImage, size: CGSize) -> CVPixelBuffer? {
var pixelBuffer: CVPixelBuffer?
let options: [String: Any] = [
kCVPixelBufferCGImageCompatibilityKey as String: true,
kCVPixelBufferCGBitmapContextCompatibilityKey as String: true,
]
let status = CVPixelBufferCreate(
kCFAllocatorDefault,
Int(size.width),
Int(size.height),
kCVPixelFormatType_32ARGB,
options as CFDictionary,
&pixelBuffer
)
guard status == kCVReturnSuccess, let buffer = pixelBuffer else { return nil }
CVPixelBufferLockBaseAddress(buffer, [])
let rgbColorSpace = CGColorSpaceCreateDeviceRGB()
guard let context = CGContext(
data: CVPixelBufferGetBaseAddress(buffer),
width: Int(size.width),
height: Int(size.height),
bitsPerComponent: 8,
bytesPerRow: CVPixelBufferGetBytesPerRow(buffer),
space: rgbColorSpace,
bitmapInfo: CGImageAlphaInfo.noneSkipFirst.rawValue,
)
else {
CVPixelBufferUnlockBaseAddress(buffer, [])
return nil
}
// Flip the UIKit coordinate system (to prevent upside-down rendering)
context.translateBy(x: 0, y: size.height)
context.scaleBy(x: 1.0, y: -1.0)
UIGraphicsPushContext(context)
image.draw(in: CGRect(origin: .zero, size: size))
UIGraphicsPopContext()
CVPixelBufferUnlockBaseAddress(buffer, [])
return buffer
}
What is being done here is creating a CVPixelBuffer of the same size as the UIImage passed as an argument, locking the address of the allocated buffer, and then creating a CGContext by specifying the parameters required for a CVPixelBuffer.
Finally, it writes to the buffer using UIImage.draw.
The process is completed by returning the buffer after writing is finished.
The Role of the Adapter
Regarding the adapter mentioned earlier. Here is an explanation from Gemini about what it does:
The primary role of
AVAssetWriterInputPixelBufferAdaptoris to receiveCVPixelBufferas frame data and pass it to the associatedAVAssetWriterInput(video track input) with efficient and accurate timestamps.This adapter is necessary primarily for the following two reasons:
A. Bridging Data Formats
AVAssetWriterInputis an object that manages input to a video track, but it typically requires the format ofCMSampleBufferorCVPixelBufferas raw frame data.However, this code is attempting to create a video from an array of
UIImageobjects.
UIImageis a UIKit image object and is not the media frame format (specifically, an efficient pixel buffer) required for video writing.By using an adapter (
AVAssetWriterInputPixelBufferAdaptor), theCVPixelBufferconverted from aUIImageby the helper function (pixelBuffer(from:size:)) can be provided in a form thatAVAssetWriterInputcan handle directly (the pixel buffer format).B. Efficient Frame Management and Queuing
AVAssetWriterInputPixelBufferAdaptorprovides utilities for smoothly providing frame data toAVAssetWriterInput, especially when the video frame source is not real-time (like in this code, where a video is generated from an existing array ofUIImageobjects).The standard
append(_:)method ofAVAssetWriterInputrequires aCMSampleBuffer, but by using the adapter'sappend(_:withPresentationTime:)method, you can easily supply frame data at accurate timings by combining the pixel buffer (CVPixelBuffer) and the presentation time (CMTime).This allows you to meet the low-level requirements of video writing while saving the effort of manually constructing complex
CMSampleBufferobjects.In summary, the adapter is an essential component for bridging the "pixel buffer" format to write
UIImageas video frames and for streaming them into the writing process at accurate timings.
So, it means that a conversion from UIImage → CVPixelBuffer and then further into data necessary for video generation is required, and AVAssetWriterInputPixelBufferAdaptor is what performs that final part efficiently.
If you don't need audio, this will create a playable video.
Creating a Video with Audio
Next, I will explain how to add an audio track to the video track created from the sequential images above. For the audio, I assume that .wav or .mp3 files are loaded and integrated.
Since the part that creates a video from sequential images is the same, I will focus the explanation on adding the audio track.
Several more classes are involved in audio compositing. Summarizing them beforehand, they are as follows:
| Concept | Description |
|---|---|
| AVMutableComposition | A class that composites video and audio. |
| AVMutableCompositionTrack | Information for the "track" (video or audio). |
| AVURLAsset | Not directly related to video creation, but since we are using a temporarily saved video this time, we use this class to load the saved video. |
The following shows a flow where the video is created first, and then the audio is added in its callback.
/// Class to generate video with audio from sequential images
class ImageToVideoWithAutioExporter {
func createVideoWithAudio(from images: [UIImage],
audioURL: URL,
fps: Int32,
completion: @escaping (Bool, URL?) -> Void) {
// Video creation. The process is almost the same as the one described previously.
// Add audio once the video creation is complete.
createVideo(from: images, fps: fps, outputURL: tempOnlyVideoURL) { success in
guard success else {
DispatchQueue.main.async { completion(false, nil) }
return
}
Task {
// The process here will be described later to keep it readable
}
}
}
}
Let's look at the processing inside the callback.
// Get the video since it is saved temporarily
let videoAsset = AVURLAsset(url: tempOnlyVideoURL)
// Assume that the audio is retrieved from something previously saved
let audioAsset = AVURLAsset(url: audioURL)
do {
// Load duration with Async/await
let videoLength = try await videoAsset.load(.duration)
let audioLength = try await audioAsset.load(.duration)
let duration = CMTimeMinimum(videoLength, audioLength)
let mixComposition = AVMutableComposition()
// Video track (modern async loading)
let videoTracks = try await videoAsset.loadTracks(withMediaType: .video)
guard let videoTrack = videoTracks.first else {
DispatchQueue.main.async { completion(false, nil) }
return
}
let videoCompositionTrack = mixComposition.addMutableTrack(withMediaType: .video,
preferredTrackID: kCMPersistentTrackID_Invalid)
do {
try videoCompositionTrack?.insertTimeRange(
CMTimeRange(start: .zero, duration: videoLength),
of: videoTrack,
at: .zero
)
}
catch {
print("Failed to insert video track. \(error)")
DispatchQueue.main.async { completion(false, nil) }
return
}
// Audio track
if let audioTrack = try await audioAsset.loadTracks(withMediaType: .audio).first {
let audioCompositionTrack = mixComposition.addMutableTrack(withMediaType: .audio,
preferredTrackID: kCMPersistentTrackID_Invalid)
do {
try audioCompositionTrack?.insertTimeRange(
CMTimeRange(start: .zero, end: duration),
of: audioTrack,
at: .zero
)
}
catch {
print("Failed to insert audio track. \(error)")
}
// If audio is shorter -> Add silent padding
if audioLength < videoLength {
let silenceDuration = videoLength - audioLength
let silenceTrack = mixComposition.addMutableTrack(withMediaType: .audio,
preferredTrackID: kCMPersistentTrackID_Invalid)
// Add an empty audio track (silence)
silenceTrack?.insertEmptyTimeRange(CMTimeRange(start: audioLength, duration: silenceDuration))
}
}
// Step 3. Exporting
try? FileManager.default.removeItem(at: resultURL)
guard let exporter = AVAssetExportSession(asset: mixComposition, presetName: AVAssetExportPresetHighestQuality) else {
DispatchQueue.main.async { completion(false, nil) }
return
}
do {
try await exporter.export(to: resultURL, as: .mov)
for await state in exporter.states(updateInterval: 0.2) {
switch state {
case .exporting(let progress):
let progressValue = progress.completedUnitCount / progress.totalUnitCount
print("In Progress \(progressValue)")
case .pending, .waiting:
print("Pending or Waiting to export")
default:
break
}
}
guard let outputURL = self.saveVideoToDocuments(from: resultURL) else {
completion(false, nil)
return
}
completion(true, outputURL)
}
catch {
print("Failed to export video file. \(error)")
}
}
catch {
print("Failed to load durations: \(error)")
DispatchQueue.main.async { completion(false, nil) }
return
}
As mentioned earlier, a video consists of two parts: a "video track" and an "audio track." By combining them into one, it can be played in video players and other software.
Therefore, we retrieve the data for the video track and the audio track and use AVMutableComposition to combine them into one.
Adding a Video Track
First, we add the video track. We use the addMutableTrack method of the AVMutableComposition class to add a video track.
// Retrieving the video track
let videoTracks = try await videoAsset.loadTracks(withMediaType: .video)
guard let videoTrack = videoTracks.first else {
DispatchQueue.main.async { completion(false, nil) }
return
}
let videoCompositionTrack = mixComposition.addMutableTrack(withMediaType: .video,
preferredTrackID: kCMPersistentTrackID_Invalid)
do {
try videoCompositionTrack?.insertTimeRange(
CMTimeRange(start: .zero, duration: videoLength),
of: videoTrack,
at: .zero
)
}
catch {
print("Failed to insert video track. \(error)")
DispatchQueue.main.async { completion(false, nil) }
return
}
As you can see from the name "Mutable," it is an editable class. If you have used video editing software before, you might have experience adding and editing multiple video and audio tracks, and this class allows you to add multiple tracks in a similar way.
Therefore, it is important to note that you need to specify "which time range" of the track you are adding. In the code above, insertTimeRange handles this.
Since there is only one video track this time, we are adding the entire duration of the video starting from the 0-second position.
Adding an Audio Track
Next, we add the audio track. The method for adding it is the same as for the video track.
A unique point in this implementation is that we add silent padding if the audio is shorter than the video.
let audioCompositionTrack = mixComposition.addMutableTrack(withMediaType: .audio,
preferredTrackID: kCMPersistentTrackID_Invalid)
do {
try audioCompositionTrack?.insertTimeRange(
CMTimeRange(start: .zero, end: duration),
of: audioTrack,
at: .zero
)
}
catch {
print("Failed to insert audio track. \(error)")
}
// If audio is shorter -> Add silent padding
if audioLength < videoLength {
let silenceDuration = videoLength - audioLength
let silenceTrack = mixComposition.addMutableTrack(withMediaType: .audio,
preferredTrackID: kCMPersistentTrackID_Invalid)
// Add an empty audio track (silence)
silenceTrack?.insertEmptyTimeRange(CMTimeRange(start: audioLength, duration: silenceDuration))
}
As mentioned earlier, this is exactly where we have added "multiple audio tracks."
Exporting the Video File
Finally, we export it as a video file.
try? FileManager.default.removeItem(at: resultURL)
guard let exporter = AVAssetExportSession(asset: mixComposition, presetName: AVAssetExportPresetHighestQuality) else {
DispatchQueue.main.async { completion(false, nil) }
return
}
do {
try await exporter.export(to: resultURL, as: .mov)
for await state in exporter.states(updateInterval: 0.2) {
switch state {
case .exporting(let progress):
let progressValue = progress.completedUnitCount / progress.totalUnitCount
print("In Progress \(progressValue)")
case .pending, .waiting:
print("Pending or Waiting to export")
default:
break
}
}
guard let outputURL = self.saveVideoToDocuments(from: resultURL) else {
completion(false, nil)
return
}
completion(true, outputURL)
}
catch {
print("Failed to export video file. \(error)")
}
We use the AVAssetExportSession class for exporting.
You can use exporter.states to check the progress and display it in the UI if necessary.
for await state in exporter.states(updateInterval: 0.2) {
switch state {
case .exporting(let progress):
let progressValue = progress.completedUnitCount / progress.totalUnitCount
print("In Progress \(progressValue)")
case .pending, .waiting:
print("Pending or Waiting to export")
default:
break
}
When this loop ends, it means the writing is finished, so in the subsequent processing, the callback is called to notify the caller of the completion.
completion(true, outputURL)
This completes the process for creating a video with audio.
Conclusion
When creating XR experiences, there are many times when you want to record things like the participant's perspective.
If you're using Unity, there are recording assets available, but they are often expensive and can be difficult to use casually.
Since this implementation is in Swift, it's limited to iOS or visionOS, but if the platform is restricted, it's possible to build a simple recording function yourself.
Of course, assets are more feature-rich in terms of supported platforms and export file formats compared to building it from scratch, but conversely, there are many cases where you don't need that much functionality.
In such cases, keeping track of these foundational technologies as an option can be very useful.
Discussion