2.3 Universal Content Extraction using subprocess
This method uses a flexible approach with special tools in Python (called ‘Subprocess’) to change DOC files into either PDF or DOCX formats. Afterward, you can use the new file to take out information from PDFs or DOCX files. This powerful combination makes it easy to get important data from documents, which is super useful for projects that need a lot of information. With ‘subprocess‘, you get a bunch of strong features that make it faster and better at getting content out of different kinds of files.
The provided snippet serves as an illustrative example of converting data from a DOC file to a PDF file and subsequently extracting the content from the generated PDF. This content is then transformed into a string of extracted data.
def the_extract():
# setting output path
output_pdf_path = f'/folder_path/{base}.pdf'
# setting the conversion command
conversion_command = f'unoconv -f pdf "in_file_path"'
try:
# running the conversion command
subprocess.run(conversion_command, shell = True, check = True)
except subprocess.CalledProcessError as e:
print(f"Conversion failed with error: {e}")
# fetching the converted file for extraction
file_pa =f'Give your file path'
pdf_reader = PyPDF2.PdfReader(file_pa)
num_pages = len(pdf_reader.pages)
full_text = []
for i in range(num_pages):
page = pdf_reader.pages[i]
full_text.append(page.extract_text())
return full_text
the_extract()
- Setting Output Path:
- It creates a string
output_pdf_path
containing a file path for the output PDF. This path is constructed using the value of a variablebase
.
- It creates a string
- Setting Conversion Command:
- It creates a string
conversion_command
which is a command-line instruction for converting a file to PDF using theunoconv
tool. The input file path (in_file_path
) is used in this command.
- It creates a string
- Conversion Attempt:
- It attempts to run the conversion command using the
subprocess.run
function. This executes a shell command, which in this case is theunoconv
command for PDF conversion.
- It attempts to run the conversion command using the
- Error Handling:
- If an error occurs during the conversion (indicated by
subprocess.CalledProcessError
), it catches the error and prints a message indicating the conversion failure along with the error message.
- If an error occurs during the conversion (indicated by
- Fetching and Extracting Content from PDF:
- It attempts to open a PDF file located at the specified
file_pa
path usingPyPDF2.PdfReader
. This assumes that you should replace'Give your file path'
with the actual file path. - It retrieves the number of pages in the PDF and initializes an empty list
full_text
to store the extracted text. - It then iterates through each page in the PDF, extracts the text content using
page.extract_text()
, and appends it to thefull_text
list.
- It attempts to open a PDF file located at the specified
- Returning Extracted Text:
- Finally, it returns the list
full_text
containing the extracted text from all the pages.
- Finally, it returns the list
- Function Call:
- The function
the_extract
is called at the end, but it appears thatfile_pa
needs to be properly defined before running the function.
- The function
Please note that for this code to work, you need to have the unoconv
tool installed on your system and properly configured. Additionally, you should replace 'Give your file path'
with the actual file path you want to extract content from.