How to replace text in a PDF with C#?

For simple text replace use iTextSharp library. The code that replace one string with another is below. Note that this will replace only simple text and may not work in all cases.

    //using iTextSharp.text.pdf;
    void VerySimpleReplaceText(string OrigFile, string ResultFile, string origText, string replaceText)
    {
        using (PdfReader reader = new PdfReader(OrigFile))
        {
            for (int i = 1; i <= reader.NumberOfPages; i++)
            {
                byte[] contentBytes = reader.GetPageContent(i);
                string contentString = PdfEncodings.ConvertToString(contentBytes, PdfObject.TEXT_PDFDOCENCODING);
                contentString = contentString.Replace(origText, replaceText);
                reader.SetPageContent(i, PdfEncodings.ConvertToBytes(contentString, PdfObject.TEXT_PDFDOCENCODING));
            }
            new PdfStamper(reader, new FileStream(ResultFile, FileMode.Create, FileAccess.Write)).Close();
        }
    }

As stated in similar thread this is not really possible an easy way. The easier way it seems to be getting a DocX file and using DocX library which allow easy word swapping and then converting your DocX to PDF (using PDF Creator printer or so).

Or use pdf sharp/migradoc to create new documents.


This thread is dead, however I'm posting my solution for other lost souls that might face this problem in the future. Unfortunately my company doesn't allow posting code online so I'll describe the solution :).

So basically what you have to do is use PdfSharp and modify this sample to replace text in stream, but you must take into account that text may be split into many parentheses (convert stream to string to see what the format is).

Then, with code similar to this sample traverse through source pdf page by page and modify current page by searching for PdfContent items inside PdfReference items and replacing text in content's stream.


The 'problem' with PDF documents is that they are inherently not suitable for editing. Especially ones without fields. The best thing is to step back and look at your process and see if there is a way to replace the text before the PDF was generated. Obviously, you may not always have this freedom.

If you will be able to replace text, then you should be aware that there will be no automatic reflow of the text following the replaced text. Given that you are fine with that, then there are very few solutions that allows you to replace text.

I know that you are looking for an OpenSource solution so I feel reluctant to offer you a commercial solution. We offer one called PDFKit.NET. It allows you to extract all content on a page as so-called shapes (text, images, curves, etc.). See method Page.CreateShapes in the type reference. You can then programmatically navigate and edit this structure of shapes and then write it back to a PDF again.

Here it is: http://www.tallcomponents.com/pdfkit

Disclosure: I am the founder of TallComponents, vendor of this component

Tags:

C#

Pdf