Lextract: An Automated Market Definition Extractor
Lextract is an automated Python pipeline that automatically extract relevant market definitions from the European Commission’s merger and antitrust case decisions. Designed for researchers and competition law experts, it makes extracting relevant market definitions from many cases at once quick and scalable.
what it does
Extracts relevant market definitions scraped from the European Commission’s case search portal. It uses pandas, PyPDF2, openpyxl, and Google Gemini to automate the extraction process, making it easy to analyze large volumes of case data efficiently.
how it works
There are 6 steps in the pipeline: 1) Scrape decision PDF links and corresponding metadata. 2) Extract text from the PDFs and exclude irrelevant cases. 3) Use Google Gemini to the extract market definitions section. 4) Isolate each individual market definition and save it as a JSON file. 5) Remove markdown fences from the JSON output. 6) Combine the JSON files into one file.
research applications
This tool is intended to support research in competition law, antitrust policy, mergers, and economic regulation. Outputs can be filtered, extended, applied, or repurposed to fit a wide range of empirical and legal research projects.
get started
Lextract is easy to run on your machine. All you need is Git, Python, a Gemini API key, and a list of cases. Follow the documentation to get started.
example output
[
{
"case_number": "M.1234",
"year": "2019",
"policy_area": "Merger",
"link": "https://ec.europa.eu/decision/path.pdf",
"topic": "Scope of phone markets",
"text": "In the scope of the phone markets was..."
},
{
"case_number": "M.2345",
"year": "2023",
"policy_area": "Merger",
"link": "https://ec.europa.eu/decision/path.pdf",
"topic": "Relevant product markets of...",
"text": "The relevant product markets are the..."
}
]
usage example
This code was used in order to create the database for JurisMercatus, a market definition database that has semantic search capabilities.
license
Lextract and this website are licensed under the AGPL-3.0 License. View the full license.