OTR: Synthesizing Overlay Text Dataset for Text Removal
Abstract
Text removal is a crucial task in computer vision with applications such as privacy preservation, image editing, and media reuse. While existing research has primarily focused on scene text removal in natural images, limitations in current datasets hinder out-of-domain generalization or accurate evaluation. In particular, widely used benchmarks such as SCUT-EnsText suffer from ground truth artifacts due to manual editing, overly simplistic text backgrounds, and evaluation metrics that do not capture the quality of generated results. To address these issues, we introduce an approach to synthesizing a text removal benchmark applicable to domains other than scene texts. Our dataset features text rendered on complex backgrounds using object-aware placement and vision-language model-generated content, ensuring clean ground truth and challenging text removal scenarios. We demonstrate through extensive evaluations that training on our dataset significantly improves performance and generalization on existing benchmarks.
Type
Publication
33rd ACM International Conference on Multimedia (ACMMM 2025) - Datasets Track

Authors
Research Scientist
Jan is a research scientist at CyberAgent, where he works on artificial intelligence and computer vision
with a focus on image generation and editing. He received his PhD in Information Science and Technology
from the University of Tokyo, where his research centered on image generation. Prior to that, he received
his Master’s degree in Creative Informatics from the University of Tokyo, and his Bachelor’s degree
in Computer and Information Science from the Czech Technical University in Prague.
Born and raised in the Czech Republic, he currently works in Japan.
Authors
Authors