Docs
Note: At no point throughout the installation, setup, or usage of this code should you change the location or name of any files as the scripts rely on the original names.
prerequisites
gemini api key
You can get a free Gemini API key here. The code defaults to the Gemini 2.0 Flash model for its higher free-tier limits.
Gemini 2.0 Flash should be suitable for analyzing 50-100 cases/day depending on the length of the case decisions. See model options here and rate limits here.
installation
# Clone the repo
git clone https://github.com/shriyanyamali/Lextract.git
# Change into the project directory
cd Lextract
# Install the required packages pip
install -r requirements.txtsetup
1. Remove .gitkeep files:
# macOS / Linux
rm json/.gitkeep data/extracted_batches/.gitkeep data/extracted_sections/.gitkeep
# PowerShell
Remove-Item json/.gitkeep, data/extracted_batches/.gitkeep, data/extracted_sections/.gitkeep -Force
# Command Prompt
del json\.gitkeep data\extracted_batches\.gitkeep data\extracted_sections\.gitkeep
2. Go to competition-cases.ec.europa.eu/search and export the Merger and Antitrust cases you want to process.
3. Rename the exported excel file cases.xlsx. Move the file into the data directory.
4. Open the run_pipeline.py script. On line 33, follow the instructions and set CHUNKS_SIZE equal to 79, 80, or both.
5. Set the GEMINI_API_KEY Environment Variable:
# macOS / Linux
export GEMINI_API_KEY="your-api-key-here"
# PowerShell
$env:GEMINI_API_KEY="your-api-key-here"
# Command Prompt
set GEMINI_API_KEY=your-api-key-here6. Run the pipeline: python run_pipeline.py
testing
Run all tests: pytest -q
Run all tests with coverage report: pytest --cov=scripts --cov=tests
make coverage # Run tests with coverage report
make format # Auto-format code
make lint # Lint code
make clean # Remove __pycache__ and test artifacts
For more, see the GitHub README.