Lextract: An Automated Market Definition Extractor

Lextract is an automated Python pipeline that automatically extract relevant market definitions from the European Commission’s merger and antitrust case decisions. Designed for researchers and competition law experts, it makes extracting relevant market definitions from many cases at once quick and scalable.

what it does

Extracts relevant market definitions scraped from the European Commission’s case search portal. It uses pandas, PyPDF2, openpyxl, and Google Gemini to automate the extraction process, making it easy to analyze large volumes of case data efficiently.

how it works

There are 6 steps in the pipeline: 1) Scrape decision PDF links and corresponding metadata. 2) Extract text from the PDFs and exclude irrelevant cases. 3) Use Google Gemini to the extract market definitions section. 4) Isolate each individual market definition and save it as a JSON file. 5) Remove markdown fences from the JSON output. 6) Combine the JSON files into one file.

research applications

This tool is intended to support research in competition law, antitrust policy, mergers, and economic regulation. Outputs can be filtered, extended, applied, or repurposed to fit a wide range of empirical and legal research projects.

get started

Lextract is easy to run on your machine. All you need is Git, Python, a Gemini API key, and a list of cases. Follow the documentation to get started.

example output

[
  {
    "case_number": "M.1234",
    "year": "2019",
    "policy_area": "Merger",
    "link": "https://ec.europa.eu/decision/path.pdf",
    "topic": "Scope of phone markets",
    "text": "In the scope of the phone markets was..."
  },
  {
    "case_number": "M.2345",
    "year": "2023",
    "policy_area": "Merger",
    "link": "https://ec.europa.eu/decision/path.pdf",
    "topic": "Relevant product markets of...",
    "text": "The relevant product markets are the..."
  }
]

usage example

This code was used in order to create the database for JurisMercatus, a market definition database that has semantic search capabilities.

license

Lextract and this website are licensed under the AGPL-3.0 License. View the full license.