Remove metadata from a PDF file

I worked on removing metadata from PDF files created with GemBox, and in this post I will share the C# solution I came up with. I read about the PDF file format and discovered that everything is organized as objects, including the metadata. So I made a solution that simply removes the two objects that contain the metadata in the PDF file created by GemBox. This solution makes a lot of assumptions though, but it works for this narrow use case.

Explanation of the solution

A PDF file consists of numbered objects. GemBox puts info in object 2 and metadata in object 4. What I'm going to do is remove objects 2 and 4 from the PDF file, which will result in no metadata remaining.

Actually, if you want to do this properly, you should look in object 1 to see which object contains the metadata, and in the trailer which object contains the info. But to keep the code simple, and since I'm only cleaning GemBox PDF files, it seems easiest to hardcode which objects I want to remove.

Here's what the first 5 objects look like in a GemBox PDF file (PDF is however a binary format, so the example below has been cleaned of some bytes that cannot be rendered):

%PDF-1.4
%
1 0 obj
<</Type/Catalog/Pages 3 0 R/Lang(sv-SE)/Metadata 4 0 R/Outlines 5 0 R>>
endobj
2 0 obj
<</CreationDate(D:20250721161200+02'00')/Creator(Microsoft Office Word)/Producer(GemBox.Document 2025.11 for .NET Standard 2.0)/Author(þÿDaniel Jonsson)/LastSavedBy(þÿDaniel Jonsson)RevisionNumber(þÿ3)/ModDate(D:20250723120800+02'00')>>
endobj
3 0 obj
<</Type/Pages/Kids[8 0 R 9 0 R]/Count 2/MediaBox[0 0 595.32 841.92]>>
endobj
4 0 obj
<</Length 1255/Type/Metadata/Subtype/XML>>stream
<?xpacket begin="Ä»ż" id="W5M0MpCehiHzreSzNTczkc9d"?>
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="3.1-701">
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<rdf:Description rdf:about="" xmlns:pdf="http://ns.adobe.com/pdf/1.3/">
<pdf:Producer>GemBox.Document 2025.11 for .NET Standard 2.0</pdf:Producer>
</rdf:Description>
<rdf:Description rdf:about="" xmlns:dc="http://purl.org/dc/elements/1.1/">
<dc:creator><rdf:Seq><rdf:li>Daniel Jonsson</rdf:li></rdf:Seq></dc:creator>
</rdf:Description>
<rdf:Description rdf:about="" xmlns:xmp="http://ns.adobe.com/xap/1.0/">
<xmp:ModifyDate>2025-07-23T12:08:00+02:00</xmp:ModifyDate>
<xmp:CreateDate>2025-07-21T16:12:00+02:00</xmp:CreateDate>
<xmp:MetadataDate>2025-07-23T12:08:00+02:00</xmp:MetadataDate>
<xmp:CreatorTool>Microsoft Office Word</xmp:CreatorTool>
</rdf:Description>
<rdf:Description rdf:about="" xmlns:xmpMM="http://ns.adobe.com/xap/1.0/mm/">
<xmpMM:DocumentID>uuid:a5ecb23e-6382-48f3-900b-aebaef9af65a</xmpMM:DocumentID>
<xmpMM:InstanceID>uuid:a5ecb23e-6382-48f3-900b-aebaef9af65a</xmpMM:InstanceID>
<xmpMM:RenditionClass>default</xmpMM:RenditionClass>
<xmpMM:VersionID>1</xmpMM:VersionID>
</rdf:Description>
</rdf:RDF>
</x:xmpmeta>
<?xpacket end="w"?>

endstream
endobj
5 0 obj
<</Last 6 0 R/First 7 0 R/Count 6>>
endobj

Note object 1, where it says "Metadata 4 0". This indicates that the metadata is in object 4. But as I mentioned, my implementation doesn't look at that. But to do it properly, you should read the number here.

The idea is that the result when we're done will look like this:

%PDF-1.4
%
1 0 obj
<</Type/Catalog/Pages 3 0 R/Lang(sv-SE)/Metadata 4 0 R/Outlines 5 0 R>>
endobj
endobj
3 0 obj
<</Type/Pages/Kids[8 0 R 9 0 R]/Count 2/MediaBox[0 0 595.32 841.92]>>
endobj
5 0 obj
<</Last 6 0 R/First 7 0 R/Count 6>>
endobj

In other words, we simply remove objects 2 and 4.

However, you also need to consider the end of the PDF file, which looks like this:

xref
0 26
0000000000 65535 f 
0000000015 00000 n 
0000000102 00000 n 
0000000384 00000 n 
0000000469 00000 n 
0000001799 00000 n 
0000001850 00000 n 
0000001938 00000 n 
0000002035 00000 n 
0000002186 00000 n 
0000002314 00000 n 
0000002470 00000 n 
0000002568 00000 n 
0000003886 00000 n 
0000005712 00000 n 
0000005873 00000 n 
0000006017 00000 n 
0000012614 00000 n 
0000012778 00000 n 
0000012937 00000 n 
0000013443 00000 n 
0000013778 00000 n 
0000014331 00000 n 
0000014645 00000 n 
0000035192 00000 n 
0000066164 00000 n 
trailer
<</Root 1 0 R/Info 2 0 R/ID[<3EB2ECA58263F348900BAEBAEF9AF65A><3EB2ECA58263F348900BAEBAEF9AF65A>]/Size 26>>
startxref
66201
%%EOF

"%%EOF" signals where the file ends. But if you've modified the PDF in another program, there may be additional content after EOF. So if you take the GemBox PDF file and run it through another program, there could be additional things afterward. A PDF file can contain essentially a version history, where later changes overwrite earlier objects. This is what the "exiftool" tool does.

"exiftool" removes the metadata by adding a change after the EOF that updates the links from object 1 and the trailer, which point out where the metadata and info are located. So "exiftool" doesn't actually remove anything from the file – it just adds an update that says the links to the metadata and info are no longer valid. But you can still open the file yourself and see all previous content – the metadata is still right there.

Since I'm not modifying the PDF in a separate program after GemBox has created it, I ignored that there may be more content after the EOF. This is again to keep the code simple. But if you were to do this more properly and generic, this is something you would need to take into account.

Also note that "trailer" contains "Info 2 0". So if you wanted to do this more generic, you would need to read the number here to know which object contains the info.

When reading a PDF file, you start from the end. There you will find a "startxref", followed by a number. This number indicates at which byte "xref" begins. "xref" is a catalog of all objects in the file and at which bytes they start. In the example above, we see that xref begins at byte 66201.

Another note. If the PDF file has been run through another program, there may be multiple startxref and multiple xref. So there can be a chain where a newer registry points back to an older registry. But as I mentioned, I'm keeping the code simple and assuming there's only one startxref and one xref.

In this example, we find xref at byte 66201. The following line says there are 26 objects in the file. Object 1 starts at byte 15, object 2 at byte 102, object 3 at byte 384, and so on.

Since I'm removing objects 2 and 4 from the file, I need to mark in the registry that these two objects are free. And I also need to update the addresses of all subsequent objects in the registry, so their addresses are correct.

The result will look something like this:

xref
0 26
0000000000 65535 f 
0000000015 00000 n 
0000000000 00001 f 
0000000102 00000 n 
0000000000 00001 f 
0000000187 00000 n 
0000000238 00000 n 
0000000326 00000 n 
0000000423 00000 n 
0000000574 00000 n 
0000000702 00000 n 
0000000858 00000 n 
0000000956 00000 n 
0000002314 00000 n 
0000004172 00000 n 
0000004333 00000 n 
0000004477 00000 n 
0000011074 00000 n 
0000011238 00000 n 
0000011397 00000 n 
0000011903 00000 n 
0000012238 00000 n 
0000012783 00000 n 
0000013097 00000 n 
0000033644 00000 n 
0000063519 00000 n 
trailer
<</Root 1 0 R/Info 2 0 R/ID[<9CD1D9407F499042A902F0CE2B4171C8><9CD1D9407F499042A902F0CE2B4171C8>]/Size 26>>
startxref
63556
%%EOF

What I need to do, as we see above, is:

  1. Mark objects 2 and 4 as free by changing them to "0000000000 00001 f ".
  2. Update the address of object 3 to "address of object 3 minus the length of object 2".
  3. Update the address of object 5 and all subsequent objects to "address of object n minus the length of objects 2 and 4".
  4. Update the startxref address to "startxref minus the length of objects 2 and 4".

Implementation

My implementation is as follows:

public static class PdfHelper
{
    public static byte[] CleanPdf(MemoryStream pdfMemoryStream)
    {
        // Create a span of the content in the memory stream. Using a span allows efficient addressing of content in
        // the underlying byte array.
        var span = pdfMemoryStream.GetBuffer().AsSpan(0, (int)pdfMemoryStream.Length);

        // This is where we will save the result.
        var result = new ArrayBufferWriter<byte>();

        // The beginning of different objects that we will search for.
        var object2SearchPattern = "2 0 obj\n"u8;
        var object3SearchPattern = "3 0 obj\n"u8;
        var object4SearchPattern = "4 0 obj\n"u8;
        var object5SearchPattern = "5 0 obj\n"u8;

        // The addresses of the objects.
        var indexObject2 = span.IndexOf(object2SearchPattern);
        var indexObject3 = span[indexObject2..].IndexOf(object3SearchPattern) + indexObject2;
        var indexObject4 = span[indexObject3..].IndexOf(object4SearchPattern) + indexObject3;
        var indexObject5 = span[indexObject4..].IndexOf(object5SearchPattern) + indexObject4;

        // The lengths of objects 2 and 4. We will need to know this when we later update the startxref and
        // xref addresses.
        var lengthObject2 = indexObject3 - indexObject2;
        var lengthObject4 = indexObject5 - indexObject4;

        // Search backwards in the PDF file for startxref. This assumes there is only one startxref.
        var startxrefSearchPattern = "startxref"u8;
        var indexStartxref = span.LastIndexOf(startxrefSearchPattern);

        // Search backwards in the PDF file for EOF.
        var eofSearchPattern = "%%EOF"u8;
        var indexEof = span.LastIndexOf(eofSearchPattern);

        // Extract the address to xref. It's between "startxref\n" (10 bytes) and "%%EOF".
        var indexXrefSpan = span[(indexStartxref + 10)..indexEof];
        var indexXref = int.Parse(indexXrefSpan);

        // Now we know where xref is. So we skip "xref\n0 " (7 bytes) and read to the next "\n". Then we get the number
        // of objects (rows) that exist in the xref registry.
        var indexStartNumberOfObjects = indexXref + 7;
        var indexEndNumberOfObjects = span[indexStartNumberOfObjects..].IndexOf("\n"u8) + indexStartNumberOfObjects;
        var numberOfObjectsSpan = span[indexStartNumberOfObjects..indexEndNumberOfObjects];
        var numberOfObjects = int.Parse(numberOfObjectsSpan);

        // The end of the number of objects is a "\n" character before the first registry row begins. So we add 1 here to
        // get the beginning of object 0 in the registry.
        var indexFirstOffset = indexEndNumberOfObjects + 1;

        // Copy everything from the beginning of the file to the beginning of object 2.
        result.Write(span[..indexObject2]);
        // Skip object 2, and copy object 3.
        result.Write(span[indexObject3..indexObject4]);
        // Skip object 4, and copy everything from object 5 to the first row in the xref registry.
        result.Write(span[indexObject5..indexFirstOffset]);

        // Copy the first two rows from the registry. Each row is 20 bytes.
        result.Write(span[indexFirstOffset..(indexFirstOffset + 40)]);
        // Mark the 3rd row (object 2) as free.
        result.Write("0000000000 00001 f \n"u8);
        // Update the address to object 3, which is minus the length of object 2.
        result.Write(RecalculateNumber(span[(indexFirstOffset + 60)..(indexFirstOffset + 70)], -lengthObject2));
        // Write the rest of the row for object 3.
        result.Write(" 00000 n \n"u8);
        // Mark the 5th row (object 4) as free.
        result.Write("0000000000 00001 f \n"u8);
        // Update the addresses to the rest of the objects, which is minus the length of objects 2 and 4.
        for (var objectIndex = 5; objectIndex < numberOfObjects; ++objectIndex)
        {
            var indexOffset = indexFirstOffset + objectIndex * 20;
            result.Write(RecalculateNumber(span[indexOffset..(indexOffset + 10)], -lengthObject2 - lengthObject4));
            result.Write(" 00000 n \n"u8);
        }
        // Copy the rest up to and including startxref.
        result.Write(span[(indexFirstOffset + numberOfObjects * 20)..indexStartxref]);
        // Write a new startxref with an updated address to where xref begins.
        result.Write("startxref\n"u8);
        result.Write(Encoding.ASCII.GetBytes($"{indexXref - lengthObject2 - lengthObject4}"));
        // Write the end of the file.
        result.Write("\n%%EOF"u8);

        return result.WrittenSpan.ToArray();
    }

    static byte[] RecalculateNumber(Span<byte> span, int difference)
    {
        var num = int.Parse(span);
        var newNum = num + difference;
        var newNumString = newNum.ToString("D10");
        var newBytes = Encoding.ASCII.GetBytes(newNumString);
        return newBytes;
    }
}

I haven't worked with PDF files on this level before, nor have I used spans in C# before. Furthermore, I handwrote the code without letting AI generate any of it. So when I was done with my implementation and had tried it and fixed some small things, I was quite happy when it worked and the PDF still rendered correctly.