2.1 Content Extraction using PyPDF2
‘PyPDF2
‘ is a Python library for reading and manipulating PDF files. It provides functionalities to extract information from PDF documents, merge multiple PDFs, split PDFs, encrypt and decrypt PDFs, and more.
The provided snippet serves as an illustrative example for extracting data from a PDF file, subsequently transforming it into a string of extracted content.
def extract(file_path):
# Fetch the file from the given file path
pdf_reader = PyPDF2.PdfReader(file_path)
# Count the number of pages to be processed
num_pages = len(pdf_reader.pages)
# Create a empty list / string to store the data.
full_text = [] # full_text = ''
# Run a loop to extract the data line by line as the PyPDF2 scans a document
line by line
for i in range(num_pages):
page = pdf_reader.pages[i]
full_text.append(page.extract_text())
return full_text
file_path = "Path of file to extract the data"
extracted_data = extract(file_path)
def extract(file_path):
This line defines a function namedextract
that takes one argumentfile_path
.pdf_reader = PyPDF2.PdfReader(file_path)
: This line creates an instance of thePdfReader
class from the PyPDF2 library, which is used for reading PDF files. It takesfile_path
as input, indicating the location of the PDF file to be processed.num_pages = len(pdf_reader.pages)
: This line determines the total number of pages in the PDF document by accessing thepages
attribute of thepdf_reader
object.full_text = []
: This creates an empty list calledfull_text
which will be used to store the extracted text from each page of the PDF.for i in range(num_pages):
This starts a loop that will iterate over each page in the PDF document.page = pdf_reader.pages[i]
: Within the loop, this line selects thei
-th page for processing.full_text.append(page.extract_text())
: This extracts the text content from the selected page and appends it to thefull_text
list. Theextract_text()
method is used to obtain the text from the page.- After processing all pages, the loop completes, and the function
extract
returns thefull_text
list, which now contains the extracted text from all pages of the PDF. file_path = "Path of file to extract the data"
: This assigns the file path (in string format) to the variablefile_path
to specify the location of the PDF file to be processed.extracted_data = extract(file_path)
: This calls theextract
function, passingfile_path
as an argument, and assigns the returned value (the extracted text) to the variableextracted_data
.
In summary, this code defines a function extract
that uses the PyPDF2 library to read a PDF file specified by file_path
, extracts the text content from each page, and returns the extracted text as a list. The provided file_path
variable is then used to extract data from a specific PDF file.