Building a Finance AI Chatbot — Struggling with Web Data Extraction

Hello everyone ; I’m currently working on an AI project where we’re building a chatbot (similar to ChatGPT) designed to answer all kinds of finance-related questions. Before the model can respond intelligently, we first need to gather and structure all the relevant financial data and documents (PDFs, reports, policy summaries, etc.) from various websites — mainly insurers and financial institutions.

At this stage, I’m using Power Automate to automate part of the data extraction process. However, it’s becoming increasingly difficult and limited, especially when it comes to:

Navigating websites with multiple levels of pages

Downloading files like PDFs or Excel documents

Extracting structured and unstructured data from dynamic websites

I’m wondering: :backhand_index_pointing_right: What tools or methods would you recommend to efficiently extract all types of content (text, files, tables, etc.) from websites? :backhand_index_pointing_right: Has anyone here worked on a similar project involving web scraping for finance or AI knowledge bases? Thanks for your help :folded_hands:

2 Likes

did you explore thunderbit or ParseHub?

2 Likes