From RGB to Lab: Addressing Color Artifacts in AI Image Compositing

Editor
19 Min Read


Introduction

replacement is a staple of image editing, achieving production-grade results remains a significant challenge for developers. Many existing tools work like “black boxes,” which means we have little control over the balance between quality and speed needed for a real application. I ran into these difficulties while building VividFlow. The project is mainly focused on Image-to-Video generation, but it also provides a feature for users to swap backgrounds using AI prompts.

To make the system more reliable across different types of images, I ended up focusing on three technical areas that made a significant difference in my results:

  • A Three-Tier Fallback Strategy: I found that orchestrating BiRefNetU²-Net, and traditional gradients ensures the system always produces a usable mask, even if the primary model fails.
  • Correction in Lab Color Space: Moving the process to Lab space helped me remove the “yellow halo” artifacts that often appear when blending images in standard RGB space.
  • Specific Logic for Cartoon Art: I added a dedicated pipeline to detect and preserve the sharp outlines and flat colors that are unique to illustrations.

These are the approaches that worked for me when I deployed the app on HuggingFace Spaces. In this article, I want to share the logic and some of the math behind these choices, and how they helped the system handle the messy variety of real-world images more consistently.

1. The Problem with RGB: Why Backgrounds Leave a Trace

Standard RGB alpha blending tends to leave a stubborn visual mess in background replacement. When you blend a portrait shot against a colored wall into a new background, the edge pixels usually hold onto some of that original color. This is most obvious when the original and new backgrounds have contrasting colors, like swapping a warm yellow wall for a cool blue sky. You often end up with an unnatural yellowish tint that immediately gives away the fact that the image is a composite. This is why even when your segmentation mask is pixel-perfect, the final composite still looks obviously fake — the color contamination betrays the edit.

The issue is rooted in how RGB blending works. Standard alpha compositing treats each color channel independently, calculating weighted averages without considering how humans actually perceive color. To see this problem concretely, consider the example visualized in Figure 1 below. Take a dark hair pixel (RGB 80, 60, 40) captured against a yellow wall (RGB 200, 180, 120). During the photo shoot, light from that wall reflects onto the hair edges, creating a color cast. If you apply a 50% blend with a new blue background in RGB space, the pixel becomes a muddy average (RGB 140, 120, 80) that preserves obvious traces of the original yellow—exactly the yellowish tint problem we want to eliminate. Instead of a clean transition, this contamination breaks the illusion of natural integration.

Figure 1. RGB versus Lab space blending comparison at pixel level.

As demonstrated in the figure above, the middle panel shows how RGB blending produces a muddy result that retains the yellowish tint from the original wall. The rightmost panel reveals the solution: switching to Lab color space before the final blend allows surgical removal of this contamination. Lab space separates lightness (L channel) from chroma (a and b channels), enabling targeted corrections of color casts without disturbing the luminance that defines object edges. The corrected result (RGB 75, 55, 35) achieves natural hair darkness while eliminating yellow influence through vector operations in the ab plane, a mathematical process I’ll detail in Section 5.

2. System Architecture: Orchestrating the Workflow

The background replacement pipeline orchestrates several specialized components in a carefully designed sequence that prioritizes both robustness and efficiency. The architecture ensures that even when individual models encounter challenging scenarios, the system gracefully degrades to alternative approaches while maintaining output quality without wasting GPU resources.

Following the architecture diagram, the pipeline executes through six distinct stages:

Image Preparation: The system resizes and normalizes input images to a maximum dimension of 1024 pixels, ensuring compatibility with diffusion model architectures while maintaining aspect ratio.

Semantic Analysis: An OpenCLIP vision encoder analyzes the image to detect subject type (person, animal, object, nature, or building) and measures color temperature characteristics (warm versus cool tones).

Prompt Enhancement: Based on the semantic analysis, the system augments the user’s original prompt with contextually appropriate lighting descriptors (golden hour, soft diffused, bright daylight) and atmospheric qualities (professional, natural, elegant, cozy).

Background Generation: Stable Diffusion XL synthesizes a new background scene using the enhanced prompt, configured with a DPM-Solver++ scheduler running for twenty-five inference steps at guidance scale 7.5.

Robust Mask Generation: The system attempts three progressively simpler approaches to extract the foreground. BiRefNet provides high-quality semantic segmentation as the first choice. When BiRefNet produces insufficient results, U²-Net through rembg offers reliable general-purpose extraction. Traditional gradient-based methods serve as the final fallback, guaranteeing mask production regardless of input complexity.

Perceptual Color Blending: The fusion stage operates in Lab color space to enable precise removal of background color contamination through chroma vector deprojection. Adaptive suppression strength scales with each pixel’s color similarity to the original background. Multi-scale edge refinement produces natural transitions around fine details, and the result is composited back to standard color space with proper gamma correction.

3. The Three-Tier Mask Strategy: Quality Meets Reliability

In background replacement, the mask quality is the ceiling, your final image can never look better than the mask it’s built on. However, relying on just one segmentation model is a recipe for failure when dealing with real-world variety. I found that a three-tier fallback strategy was the best way to ensure every user gets a usable result, regardless of the image type.

  1. BiRefNet (The Quality Leader): This is the primary choice for complex scenes. If you look at the left panel of the comparison image, notice how cleanly it handles the individual curly hair strands. It uses a bilateral architecture that balances high-level semantic understanding with fine-grained detail. In my experience, it’s the only model that consistently avoids the “choppy” look around flyaway hair.
  2. U²-Net via rembg (The Balanced Fallback): When BiRefNet struggles, often with cartoons or very small subjects—the system automatically switches to U²-Net. Looking at the middle panel, the hair edges are a bit “fuzzier” and less detailed than BiRefNet, but the overall body shape is still very accurate. I added custom alpha stretching and morphological refinements to this stage to help keep extremities like hands and feet from being accidentally clipped.
  3. Traditional Gradients (The “Never Fail” Safety Net): As a final resort, I use Sobel and Laplacian operators to find edges based on pixel intensity. The right panel shows the result: it’s much simpler and misses the fine hair textures, but it is guaranteed to complete without a model error. To make this look professional, I apply a guided filter using the original image as a signal, which helps smooth out noise while keeping the structural edges sharp.

4. Perceptual Color Space Operations for Targeted Contamination Removal

The solution to RGB blending’s color contamination problem lies in choosing a color space where luminance and chromaticity separate cleanly. Lab color space, standardized by the CIE (2004), provides exactly this property through its three-channel structure: the L channel encodes lightness on a 0–100 scale, while the a and b channels represent color opponent dimensions spanning green-to-red and blue-to-yellow respectively. Unlike RGB where all three channels couple together during blending operations, Lab allows surgical manipulation of color information without disturbing the brightness values that define object boundaries.

The mathematical correction operates through vector projection in the ab chromatic plane. To understand this operation geometrically, consider Figure 3 below, which visualizes the process in two-dimensional ab space. When an edge pixel exhibits contamination from a yellow background, its measured chroma vector C represents the pixel’s color coordinates (a, b) in the ab plane, pointing partially toward the yellow direction. In the diagram, the contaminated pixel appears as a red arrow with coordinates (a = 12, b = 28), while the background’s yellow chroma vector B appears as an orange arrow pointing toward (a = 5, b = 45). The key insight is that the portion of C that aligns with B represents unwanted background influence, while the perpendicular portion represents the subject’s true color.

Figure 3. Vector projection in Lab ab chromatic plane removing yellow background contamination.

As illustrated in the figure above, the system removes contamination by projecting C onto the normalized background direction and subtracting this projection. Mathematically, the corrected chroma vector becomes:

\[\mathbf{C}’ = \mathbf{C} – (\mathbf{C} \cdot \mathbf{\hat{B}}) \mathbf{\hat{B}}\]

where C · denotes the dot product that measures how much of C lies along the background direction. The yellow dashed line in Figure 3 represents this projection component, showing the contamination magnitude of 15 units along the background direction. The purple dashed arrow demonstrates the subtraction operation that yields the corrected green arrow C′ = (a = 4, b = 8). This corrected chroma exhibits substantially reduced yellow component (from b = 28 down to b = 8) while maintaining the original red-green balance (a remains near its original value). The operation performs precisely what visual inspection suggests is needed: it removes only the color component parallel to the background direction while preserving perpendicular components that encode the subject’s inherent coloration.

Critically, this correction happens exclusively in the chromatic dimensions while the L channel remains untouched throughout the operation. This preservation of luminance maintains the edge structure that viewers perceive as natural boundaries between foreground and background elements. Converting the corrected Lab values back to RGB space produces the final pixel color that integrates cleanly with the new background without visible contamination artifacts.

5. Adaptive Correction Strength Through Color Distance Metrics

Simply removing all background color from edges risks overcorrection, edges can become artificially gray or desaturated, losing natural warmth. To prevent this, I implemented adaptive strength modulation based on how contaminated each pixel actually is, using the ΔE color distance metric:

\[\Delta E = \sqrt{(\Delta L)^2 + (\Delta a)^2 + (\Delta b)^2}\]

where ΔE below 1 is imperceptible while values above 5 indicate clearly distinguishable colors. Pixels with ΔE below 18 from the background are classified as contaminated candidates for correction.

The correction strength follows an inverse relationship, pixels very close to the background color receive strong correction while distant pixels get gentle treatment:

\[S = 0.85 \times \max\left(0, 1 – \frac{\Delta E}{18}\right)\]

This formula ensures strength gracefully tapers to zero as ΔE approaches the threshold, avoiding sharp discontinuities.

Figure 4 illustrates this through a zoomed comparison of hair edges against different backgrounds. The left panel shows the original image with yellow wall contamination visible along the hair boundary. The middle panel reveals how standard RGB blending preserves a yellowish rim that immediately betrays the composite as artificial. The right panel shows the Lab-based correction eliminating color spill while maintaining natural hair texture, the edge now integrates cleanly with the blue background by targeting contamination precisely at the mask boundary without affecting legitimate subject color.

Figure 4. Hair edge comparison: Original (left), RGB blend (middle), Lab adaptive correction (right).

6. Cartoon-Specific Enhancement for Line Art Preservation

Cartoon and line-art images present unique challenges for generic segmentation models trained on photographic data. Unlike natural photographs with gradual transitions, cartoon characters feature sharp black outlines and flat color fills. Standard deep learning segmentation often misclassifies black outlines as background while giving insufficient coverage to solid fill areas, creating visible gaps in composites.

I developed an automatic detection pipeline that activates when the system identifies line-art characteristics through three features: edge density (Canny edge pixels ratio), color simplicity (unique colors relative to area), and dark pixel prevalence (luminance below 50). When these thresholds are met, specialized enhancement routines trigger.

Figure 5 below shows the enhancement pipeline through four stages. The first panel displays the original cartoon dog with its characteristic black outlines and flat colors. The second panel shows the enhanced mask, notice the complete white silhouette capturing the entire character. The third panel reveals Canny edge detection identifying sharp outlines. The fourth panel highlights dark regions (luminance < 50) that mark the black lines defining the character’s form.

Figure 5. Cartoon enhancement pipeline: Original, enhanced mask, edge detection, black outline regions.

The enhancement process in the figure above operates in two stages. First, black outline protection scans for dark pixels (luminance < 80), dilates them slightly, and sets their mask alpha to 255 (full opacity), ensuring black lines are never lost. Second, internal fill enhancement identifies high-confidence regions (alpha > 160), applies morphological closing to connect separated parts, then boosts medium-confidence pixels within this zone to minimum alpha of 220, eliminating gaps in flat-colored areas.

This specialized handling preserved mask coverage across anime characters, comic illustrations, and line drawings during development. Without it, generic models produce masks technically correct for photos but fail to preserve the sharp outlines and solid fills that define cartoon imagery.

Conclusion: Engineering Decisions Over Model Selection

Building this background replacement system reinforced a core principle: production-quality AI applications require thoughtful orchestration of multiple techniques rather than relying on a single “best” model. The three-tier mask generation strategy ensures robustness across diverse inputs, Lab color space operations eliminate perceptual artifacts that RGB blending inherently produces, and cartoon-specific enhancements preserve artistic integrity for non-photographic content. Together, these design decisions create a system that handles real-world diversity while maintaining transparency about how corrections are applied—critical for developers integrating AI into their applications.

Several directions for future enhancement emerge from this work. Implementing guided filter refinement as standard post-processing could further smooth mask edges while preserving structural boundaries. The cartoon detection heuristics currently use fixed thresholds but could benefit from a lightweight classifier trained on labeled examples. The adaptive spill suppression currently uses linear falloff, but smooth step or double smooth step curves might provide more natural transitions. Finally, extending the system to handle video input would require temporal consistency mechanisms to prevent flickering between frames.

Acknowledgments:

This work builds upon the open-source contributions of BiRefNet, U²-Net, Stable Diffusion XL, and OpenCLIP. Special thanks to the HuggingFace team for providing the ZeroGPU infrastructure that enabled this deployment.


References & Further Reading

Color Science Foundations

  • CIE. (2004). Colorimetry (3rd ed.). CIE Publication 15:2004. International Commission on Illumination.
  • Sharma, G., Wu, W., & Dalal, E. N. (2005). The CIEDE2000 color-difference formula: Implementation notes, supplementary test data, and mathematical observations. Color Research & Application, 30(1), 21-30.

Deep Learning Segmentation

  • Peng, Z., Shen, J., & Shao, L. (2024). Bilateral reference for high-resolution dichotomous image segmentation. arXiv preprint arXiv:2401.03407.
  • Qin, X., Zhang, Z., Huang, C., Dehghan, M., Zaiane, O. R., & Jagersand, M. (2020). U²-Net: Going deeper with nested U-structure for salient object detection. Pattern Recognition, 106, 107404.

Image Compositing & Color Spaces

  • Lucas, B. D. (1984). Color image compositing in multiple color spaces. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

Core Infrastructure

  • Rombach, R., et al. (2022). High-resolution image synthesis with latent diffusion models. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10684-10695.
  • Radford, A., et al. (2021). Learning transferable visual models from natural language supervision. Proceedings of the International Conference on Machine Learning, 8748-8763.

Image Attribution

  • All figures in this article were generated using Gemini Nano Banana and Python code.

Share this Article
Please enter CoinGecko Free Api Key to get this plugin works.