Lesson 1,
Topic 1
In Progress
2.2 Content Extraction using python-docx
‘python-docx‘ is a Python library for reading and manipulating docx files. It provides functionalities to extract information from docx documents, create and manipulate the docx files and more.
The provided snippet serves as an illustrative example for extracting data from a DOCX file, subsequently transforming it into a string of extracted content.
def extract(file_path):
# Extract the file from filepath
doc = Document(file_path)
full_text = []
# Read the, extract and append the data per paragraph
for paragraph in doc.paragraphs:
full_text.append(paragraph.text)
# Read the table content and append
for table in doc.tables:
for row in table.rows:
for cell in row.cells:
full_text.append(cell.text)
# Read the content by sections like header footer etc
for section in doc.sections:
for header in section.header.paragraphs:
full_text.append(header.text)
for footer in section.footer.paragraphs:
full_text.append(footer.text)
for shape in doc.inline_shapes:
if shape.type == docx.enum.text.WD_INLINE_SHAPE.TEXT_BOX:
textbox_text = shape.text_frame.text
full_text.append(textbox_text)
return full_textfile_path = "Path of file to extract the data"
extracted_data = extract(file_path)
def extract(file_path):: This line defines a function namedextractthat takes a single argumentfile_path.doc = Document(file_path): This line creates a document objectdocby opening the file located atfile_path. It assumes that the file is in a format that can be read by theDocumentclass (likely a.docxfile).full_text = []: This initializes an empty list calledfull_textwhich will be used to store the extracted content.for paragraph in doc.paragraphs:: This initiates a loop that iterates over each paragraph in the document.full_text.append(paragraph.text): For each paragraph, the text content is extracted and appended to thefull_textlist.
for table in doc.tables:: This starts another loop that iterates over each table in the document.for row in table.rows:: For each table, it iterates over each row.for cell in row.cells:: For each row, it iterates over each cell.full_text.append(cell.text): The text content of each cell is extracted and appended to thefull_textlist.
- The next section of code extracts content from headers, footers, and other sections.
for section in doc.sections:: This loop iterates over each section in the document.for header in section.header.paragraphs:: For each section, it iterates over the paragraphs in the header.full_text.append(header.text): The text content of each header paragraph is appended to thefull_textlist.
for footer in section.footer.paragraphs:: Similarly, for each section, it iterates over the paragraphs in the footer.full_text.append(footer.text): The text content of each footer paragraph is appended to thefull_textlist.
for shape in doc.inline_shapes:: This loop iterates over inline shapes in the document (like text boxes).if shape.type == docx.enum.text.WD_INLINE_SHAPE.TEXT_BOX:: It checks if the shape is a text box.textbox_text = shape.text_frame.text: If it is a text box, it extracts the text from the text frame.full_text.append(textbox_text): The text from the text box is appended to thefull_textlist.
- Finally,
return full_textsends back the list containing all the extracted text. file_path = "Path of file to extract the data": This line assigns the path of the file you want to extract data from to the variablefile_path.extracted_data = extract(file_path): This line calls theextractfunction with the specifiedfile_pathand assigns the returned list of extracted data to the variableextracted_data.