Learning Timeline
Key Insights
Model Selection Based on Task
Use large models (like Gemini) for complex visual tasks, but use smaller, more affordable models for structured data extraction tasks like Form 4 extraction to save on operational costs.
Smart Routing Techniques
Don't process all files the same way. By classifying files at an early stage (router), you can use different prompts and models for each document type, making the pipeline more efficient and accurate.
Prompts
Document Classification Logic
Target:
DSPy Signature / LLM
Given these images (the first few pages of a document), identify the type. Options: [SEC filing, Contract, City Infrastructure Image]. Return only the document type.
Step by Step
Building an Automated File Classification Pipeline with DSPy
- Import the DSPy and attachments libraries to handle various media input types uniformly.
- Configure API keys for two types of models: a standard model (LLM) for text and a Visual Model (such as Gemini) for image recognition.
- Use the 'classify_file' function to send documents (PDFs or images) to the DSPy program.
- Extract the first three pages or the first few images from the source document to serve as input fields.
- Define 'Document Type' as the signature output to determine if the file is an SEC filing, a contract, or an infrastructure image.
- Use 'if-else' or 'switch' logic based on the classification results to determine the routing for the subsequent process.
- For SEC Filing files: Run the 'form4 extraction' function using a smaller model to reduce costs.
- For Contract files: Call the 'recursive summarization' function to summarize the entire document and detect document boundaries.
- For Infrastructure images: Send the file to a Visual Model for deeper visual interpretation.