Docs

Note: At no point throughout the installation, setup, or usage of this code should you change the location or name of any files as the scripts rely on the original names.

prerequisites

gemini api key

You can get a free Gemini API key here. The code defaults to the Gemini 2.0 Flash model for its higher free-tier limits.

Gemini 2.0 Flash should be suitable for analyzing 50-100 cases/day depending on the length of the case decisions. See model options here and rate limits here.

installation

# Clone the repo

git clone https://github.com/shriyanyamali/Lextract.git

# Change into the project directory

cd Lextract

# Install the required packages pip

install -r requirements.txt

setup

1. Remove .gitkeep files:

# macOS / Linux

rm json/.gitkeep data/extracted_batches/.gitkeep data/extracted_sections/.gitkeep

# PowerShell

Remove-Item json/.gitkeep, data/extracted_batches/.gitkeep, data/extracted_sections/.gitkeep -Force

# Command Prompt

del json\.gitkeep data\extracted_batches\.gitkeep data\extracted_sections\.gitkeep

2. Go to competition-cases.ec.europa.eu/search and export the Merger and Antitrust cases you want to process.

3. Rename the exported excel file cases.xlsx. Move the file into the data directory.

4. Open the run_pipeline.py script. On line 33, follow the instructions and set CHUNKS_SIZE equal to 79, 80, or both.

5. Set the GEMINI_API_KEY Environment Variable:

# macOS / Linux

export GEMINI_API_KEY="your-api-key-here"

# PowerShell

$env:GEMINI_API_KEY="your-api-key-here"

# Command Prompt

set GEMINI_API_KEY=your-api-key-here

6. Run the pipeline: python run_pipeline.py

testing

Run all tests: pytest -q

Run all tests with coverage report: pytest --cov=scripts --cov=tests

make test # Run all tests
make coverage # Run tests with coverage report
make format # Auto-format code
make lint # Lint code
make clean # Remove __pycache__ and test artifacts

For more, see the GitHub README.