python – Merge PDF files

Introduction

I used to use PDFSam to do PDF file merging when submitting my claims which consist of many receipts and claim application form which are all in PDF format, however since I know python an easier and free way to do PDF merging is to use the PyPDF2 module. Credits go to the PyPDF2 team for abstracting the complicated PDF processing from python scriptors.

Python Modules

PyPDF2

This is the main module for processing PDF files in a simple way, the main classes are PdfFileMerger and PdfFileReader, I have also imported the PdfReadError for catching exception should there be problems opening the PDF files.

I create a merger object with PdfFileMerger then append each PDF file, and finally write them to a fresh PDF file which contains the merged content of all PDF files specified.

The PdfFileReader is to make sure the PDF file specified by the user is indeed a PDF file and not something else.

Argparse

The Argparse module is for specifying arguments in command line, there are two options:

  1. -f / –file – accepts a list of PDF file names, this option is mandatory as the script needs to know which PDF files are required to merge.
  2. -o / –output – accepts a filename of the merged PDF file, if not specified the default filename – merged.pdf – is used.

Argparse module is an ideal module to use for command line python script as it contain less codes compared to getopt module.

For argument that accepts a collection of items you need to specify the nargs option when using the parser, valid nargs are:

  • nargs=’+’ – 1 or more arguments
  • nargs=’?’ – 0 or 1 argument
  • nargs=’*’ – 0 or more arguments

If you know regex the symbols used in nargs are simple. If you have not learned regular expression you should take up the knowledge now, regex is a must know topic for programming. I recommend taking The Complete Regular Expressions(Regex) Course For Beginners.

Source code

"""
Merging PDFs into one big PDF has never been so easy and free thanks to creators of pyPDF2
"""
from PyPDF2 import PdfFileMerger, PdfFileReader
from PyPDF2.utils import PdfReadError
from argparse import ArgumentParser


def is_pdf(pdf_path):
    try:
        with open(pdf_path, "rb") as pdf_file:
            PdfFileReader(pdf_file)
        return True
    except (PdfReadError, OSError, FileNotFoundError):
        """
        Possible exceptions:
        PdfReadError - When problem opening a PDF file.
        OSError - When a non-pdf file such as txt is attempted to be opened.
        PyPDF2 throws OSError: [Errno 22] Invalid argument
        FileNotFoundError - When the filename is not found in the argument list.
        """
        return False


def merge_pdfs(list_of_pdf, merged_filename="merged.pdf"):
    merger = PdfFileMerger()
    for pdf in list_of_pdf:
        if is_pdf(pdf):
            merger.append(pdf)
    try:
        merger.write(merged_filename)
    except (PdfReadError, AttributeError) as e:
        print(e)


if __name__ == '__main__':
    parser = ArgumentParser()
    # -f / --file accepts a list of arguments, the nargs=+ means it accepts 1 or more
    parser.add_argument('-f', '--file', nargs='+', dest="user_inputs", help='pdf files to merge', required=True)
    parser.add_argument('-o', '--outfile', dest='output_filename', help="filename after merged pdf, default is "
                                                                        "merged.pdf if not specified", required=False)
    args = parser.parse_args()
    if args.output_filename:
        merge_pdfs(args.user_inputs, args.output_filename)
    else:
        merge_pdfs(args.user_inputs)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s