Lesson 1,
Topic 1
In Progress
2.2 Content Extraction using python-docx
‘python-docx
‘ is a Python library for reading and manipulating docx files. It provides functionalities to extract information from docx documents, create and manipulate the docx files and more.
The provided snippet serves as an illustrative example for extracting data from a DOCX file, subsequently transforming it into a string of extracted content.
def extract(file_path):
# Extract the file from filepath
doc = Document(file_path)
full_text = []
# Read the, extract and append the data per paragraph
for paragraph in doc.paragraphs:
full_text.append(paragraph.text)
# Read the table content and append
for table in doc.tables:
for row in table.rows:
for cell in row.cells:
full_text.append(cell.text)
# Read the content by sections like header footer etc
for section in doc.sections:
for header in section.header.paragraphs:
full_text.append(header.text)
for footer in section.footer.paragraphs:
full_text.append(footer.text)
for shape in doc.inline_shapes:
if shape.type == docx.enum.text.WD_INLINE_SHAPE.TEXT_BOX:
textbox_text = shape.text_frame.text
full_text.append(textbox_text)
return full_text
file_path = "Path of file to extract the data"
extracted_data = extract(file_path)
def extract(file_path):
: This line defines a function namedextract
that takes a single argumentfile_path
.doc = Document(file_path)
: This line creates a document objectdoc
by opening the file located atfile_path
. It assumes that the file is in a format that can be read by theDocument
class (likely a.docx
file).full_text = []
: This initializes an empty list calledfull_text
which will be used to store the extracted content.for paragraph in doc.paragraphs:
: This initiates a loop that iterates over each paragraph in the document.full_text.append(paragraph.text)
: For each paragraph, the text content is extracted and appended to thefull_text
list.
for table in doc.tables:
: This starts another loop that iterates over each table in the document.for row in table.rows:
: For each table, it iterates over each row.for cell in row.cells:
: For each row, it iterates over each cell.full_text.append(cell.text)
: The text content of each cell is extracted and appended to thefull_text
list.
- The next section of code extracts content from headers, footers, and other sections.
for section in doc.sections:
: This loop iterates over each section in the document.for header in section.header.paragraphs:
: For each section, it iterates over the paragraphs in the header.full_text.append(header.text)
: The text content of each header paragraph is appended to thefull_text
list.
for footer in section.footer.paragraphs:
: Similarly, for each section, it iterates over the paragraphs in the footer.full_text.append(footer.text)
: The text content of each footer paragraph is appended to thefull_text
list.
for shape in doc.inline_shapes:
: This loop iterates over inline shapes in the document (like text boxes).if shape.type == docx.enum.text.WD_INLINE_SHAPE.TEXT_BOX:
: It checks if the shape is a text box.textbox_text = shape.text_frame.text
: If it is a text box, it extracts the text from the text frame.full_text.append(textbox_text)
: The text from the text box is appended to thefull_text
list.
- Finally,
return full_text
sends back the list containing all the extracted text. file_path = "Path of file to extract the data"
: This line assigns the path of the file you want to extract data from to the variablefile_path
.extracted_data = extract(file_path)
: This line calls theextract
function with the specifiedfile_path
and assigns the returned list of extracted data to the variableextracted_data
.