Scene text erasing is a task of removing text from natural scene images, which has been gaining attention in recent years. The main motivation is to conceal private information such as license plate numbers, and house nameplates that can appear in images. In this work, we propose a method for scene text erasing that approaches the problem as a general inpainting task. In contrast to previous methods, which require pairs of original images containing text and images from which the text has been removed, our method does not need corresponding image pairs for training. We use a separately trained scene text detector and an inpainting network. The scene text detector predicts segmentation maps of text instances which are then used as masks for the inpainting network. The network for inpainting, trained on a large-scale image dataset, fills in masked out regions in an input image and generates a final image in which the original text is no longer present. The results show that our method is able to successfully remove text and fill in the created holes to produce natural-looking images.