You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am using Camelot for table extraction in PDF documents, which generally works well for my needs. However, I've encountered a recurring issue where the first and last rows of tables cause problems during the extraction process, primarily due to their alignment. These rows often differ in format from the rest of the table, affecting the consistency and accuracy of the extracted data. Currently, Camelot does not seem to offer a direct way to exclude specific rows based on their characteristics or alignment.
This feature would be incredibly beneficial for scenarios where table headers or footers consistently deviate in style or alignment from the main table body, leading to extraction inaccuracies. A parameter or method to specify rows to ignore (by index or pattern recognition) during extraction could significantly improve the utility and flexibility of Camelot for users facing similar challenges.
Is there an existing solution or workaround to address this issue, or could this functionality be considered for future updates?
For the details.
page 1
page 2 ( long table and is on 2, 3, and 4 pages in some pdf)
You can see because of this last row and first, it is making 11 columns for this data frame instead actually they are 10 columns. In my PDFs sometimes there are such footers (last row of the table on pdf) and (first row of header) which I am not interested in getting extracted and my header is after this.
I have already tried to play with line_tol, joint_tol, split_text, line_scale, shift_text, etc (and it works for smaller differences like in the 1st screenshot of page 1 it works but in the case of the second screenshot it fails.
Here is my appending tables function which makes a a single result_df for long tables
for i, table in enumerate(tables):
# If the table has at least 10 columns
if table.shape[1] >= 10:
# Handle header extraction for the first table
if i == 0:
# Find the index where "Date" is in the first cell
date_index = table.df[table.df.iloc[:, 0].str.contains(r"\bD\s*a\s*t\s*e\s*o\s*f\s*T\s*r\s*a\s*n\s*s\s*a\s*c\s*t\s*i\s*o\s*n\b", case=False, regex=True)].index
if not date_index.empty:
print("Got the row having header ...")
# header_index = date_index[0]
table.df.columns = range(len(table.df.columns))
df_list.append(table.df)
# Concatenate all tables in the DataFrame list
result_df = pd.concat(df_list, ignore_index=True)
return result_df
except Exception as e:
print("Error in result_df creation:", e)
return pd.DataFrame() # Return an empty DataFrame in case of an error
`
Is there any way we can search the appearing text or anything else to exclude that first row and last while extraction so that Camelot focuses on the main table (containing data we are interested in)?
I hope it is making sense, if not do let me know, I would love to explain more and if somehow you will be able to add this to Camelot it will make more powerful to this library.
Thanks
The text was updated successfully, but these errors were encountered:
Is there any way we can search the appearing text or anything else to exclude that first row and last while extraction so that Camelot focuses on the main table (containing data we are interested in)?
Have you tried setting table regions ?
Or table areas?
I am using Camelot for table extraction in PDF documents, which generally works well for my needs. However, I've encountered a recurring issue where the first and last rows of tables cause problems during the extraction process, primarily due to their alignment. These rows often differ in format from the rest of the table, affecting the consistency and accuracy of the extracted data. Currently, Camelot does not seem to offer a direct way to exclude specific rows based on their characteristics or alignment.
This feature would be incredibly beneficial for scenarios where table headers or footers consistently deviate in style or alignment from the main table body, leading to extraction inaccuracies. A parameter or method to specify rows to ignore (by index or pattern recognition) during extraction could significantly improve the utility and flexibility of Camelot for users facing similar challenges.
Is there an existing solution or workaround to address this issue, or could this functionality be considered for future updates?
For the details.
page 1
page 2 ( long table and is on 2, 3, and 4 pages in some pdf)
You can see because of this last row and first, it is making 11 columns for this data frame instead actually they are 10 columns. In my PDFs sometimes there are such footers (last row of the table on pdf) and (first row of header) which I am not interested in getting extracted and my header is after this.
I have already tried to play with line_tol, joint_tol, split_text, line_scale, shift_text, etc (and it works for smaller differences like in the 1st screenshot of page 1 it works but in the case of the second screenshot it fails.
Here is my appending tables function which makes a a single result_df for long tables
`
def append_tables_to_dataframe(tables):
try:
df_list = []
`
Is there any way we can search the appearing text or anything else to exclude that first row and last while extraction so that Camelot focuses on the main table (containing data we are interested in)?
I hope it is making sense, if not do let me know, I would love to explain more and if somehow you will be able to add this to Camelot it will make more powerful to this library.
Thanks
The text was updated successfully, but these errors were encountered: