This project enables the extraction of data from PDF files and converts them into CSV format. It utilizes AWS Textract, AWS Lambda, and AWS S3 services to automate the extraction process.
The goal of this project is to extract data from PDF files and save it in CSV format for easy analysis and further processing. The extraction is performed asynchronously using AWS Textract, which accurately extracts text and data from scanned documents.
- Upload PDF Files: Place the PDF files in the designated location.
- Extraction Process: The system will automatically trigger the extraction process.
- CSV Output: Extracted data will be saved in CSV format for each PDF file.
- Accessing Results: Retrieve the CSV files for further analysis and use.
- AWS Account with necessary permissions.
- AWS CLI installed and configured.
- Python installed for deploying AWS Lambda functions.
- AWS Textract: Service for extracting data from scanned documents.
- AWS Lambda: Serverless compute service to run code without managing servers.
- AWS S3: Scalable object storage service for storing data.
Contributions are welcome! If you encounter issues or have suggestions, please open an issue on the GitHub repository.