2.1 Content Extraction using PyPDF2
‘PyPDF2‘ is a Python library for reading and manipulating PDF files. It provides functionalities to extract information from PDF documents, merge multiple PDFs, split PDFs, encrypt and decrypt PDFs, and more.
The provided snippet serves as an illustrative example for extracting data from a PDF file, subsequently transforming it into a string of extracted content.
def extract(file_path):
# Fetch the file from the given file path
pdf_reader = PyPDF2.PdfReader(file_path)
# Count the number of pages to be processed
num_pages = len(pdf_reader.pages)
# Create a empty list / string to store the data.
full_text = [] # full_text = ''
# Run a loop to extract the data line by line as the PyPDF2 scans a document
line by line
for i in range(num_pages):
page = pdf_reader.pages[i]
full_text.append(page.extract_text())
return full_textfile_path = "Path of file to extract the data"
extracted_data = extract(file_path)def extract(file_path):This line defines a function namedextractthat takes one argumentfile_path.pdf_reader = PyPDF2.PdfReader(file_path): This line creates an instance of thePdfReaderclass from the PyPDF2 library, which is used for reading PDF files. It takesfile_pathas input, indicating the location of the PDF file to be processed.num_pages = len(pdf_reader.pages): This line determines the total number of pages in the PDF document by accessing thepagesattribute of thepdf_readerobject.full_text = []: This creates an empty list calledfull_textwhich will be used to store the extracted text from each page of the PDF.for i in range(num_pages):This starts a loop that will iterate over each page in the PDF document.page = pdf_reader.pages[i]: Within the loop, this line selects thei-th page for processing.full_text.append(page.extract_text()): This extracts the text content from the selected page and appends it to thefull_textlist. Theextract_text()method is used to obtain the text from the page.- After processing all pages, the loop completes, and the function
extractreturns thefull_textlist, which now contains the extracted text from all pages of the PDF. file_path = "Path of file to extract the data": This assigns the file path (in string format) to the variablefile_pathto specify the location of the PDF file to be processed.extracted_data = extract(file_path): This calls theextractfunction, passingfile_pathas an argument, and assigns the returned value (the extracted text) to the variableextracted_data.
In summary, this code defines a function extract that uses the PyPDF2 library to read a PDF file specified by file_path, extracts the text content from each page, and returns the extracted text as a list. The provided file_path variable is then used to extract data from a specific PDF file.