PIF Compiler - Comsoguard API

Project Overview

PIF Compiler (branded as Comsoguard API) is a cosmetic safety assessment tool that compiles Product Information Files (PIF) for cosmetic products, as required by EU regulations. It aggregates toxicological, regulatory, and chemical data from multiple external sources to evaluate ingredient safety (DAP calculations, SED/MoS computations, NOAEL extraction).

The primary language is Italian (variable names, comments, some log messages). Code is written in Python 3.12.

Tech Stack

Framework: FastAPI + Uvicorn
Package manager: uv (with pyproject.toml + uv.lock)
Data models: Pydantic v2
Databases: MongoDB (substance data cache via pymongo) + PostgreSQL (product presets, search logs, compilers via psycopg2)
Web scraping: Playwright (PDF generation, browser automation), BeautifulSoup4 (HTML parsing)
External APIs: PubChem (pubchempy + pubchemprops), COSING (EU Search API), ECHA (chem.echa.europa.eu)
Logging: Python logging with rotating file handlers (logs/debug.log, logs/error.log)

Project Structure

src/pif_compiler/
├── main.py                      # FastAPI app, middleware, exception handlers, routers
├── __init__.py
├── api/
│   └── routes/
│       ├── api_echa.py          # ECHA endpoints (single + batch search)
│       ├── api_cosing.py        # COSING endpoints (single + batch search)
│       └── common.py            # PDF generation, PubChem, CIR search endpoints
├── classes/
│   └── models.py                # Pydantic models: Ingredient, DapInfo, CosingInfo,
│                                #   ToxIndicator, Toxicity, Esposition, RetentionFactors
├── functions/
│   ├── common_func.py           # PDF generation with Playwright
│   ├── common_log.py            # Centralized logging configuration
│   └── db_utils.py              # MongoDB + PostgreSQL connection helpers
└── services/
    ├── srv_echa.py              # ECHA scraping, HTML parsing, toxicology extraction,
    │                            #   orchestrator (validate -> check cache -> fetch -> store)
    ├── srv_cosing.py            # COSING API search + data cleaning
    ├── srv_pubchem.py           # PubChem property extraction (DAP data)
    └── srv_cir.py               # CIR (Cosmetic Ingredient Review) database search

Other directories

data/ - Input data files (input.json with sample INCI/CAS/percentage lists), old CSV data
logs/ - Rotating log files (debug.log, error.log) - auto-generated
pdfs/ - Generated PDF files from ECHA dossier pages
marimo/ - Ignore this folder. Debug/test notebooks, not part of the main application

Architecture & Data Flow

Core workflow (per ingredient)

Input: CAS number (and optionally INCI name + percentage)
COSING (srv_cosing.py): Search EU cosmetic ingredients database for regulatory restrictions, INCI match, annex references
ECHA (srv_echa.py): Search substance -> get dossier -> parse HTML index -> extract toxicological data (NOAEL, LD50, LOAEL) from acute & repeated dose toxicity pages
PubChem (srv_pubchem.py): Get molecular weight, XLogP, TPSA, melting point, dissociation constants
DAP calculation (DapInfo model): Dermal Absorption Percentage based on molecular properties (MW > 500, LogP, TPSA > 120, etc.)
Toxicity ranking (Toxicity model): Best toxicological indicator selection with priority (NOAEL > LOAEL > LD50) and safety factors

Caching strategy

ECHA results are cached in MongoDB (toxinfo.substance_index collection) keyed by substance.rmlCas
The orchestrator checks local cache before making external requests
Search history is logged to PostgreSQL (logs.search_history table)

API Endpoints

All routes are under /api/v1:

Method	Path	Description
POST	`/echa/search`	Single ECHA substance search by CAS
POST	`/echa/batch-search`	Batch ECHA search for multiple CAS numbers
POST	`/cosing/search`	COSING search (by name, CAS, EC, or ID)
POST	`/cosing/batch-search`	Batch COSING search
POST	`/common/pubchem`	PubChem property lookup by CAS
POST	`/common/generate-pdf`	Generate PDF from URL via Playwright
GET	`/common/download-pdf/{name}`	Download a generated PDF
POST	`/common/cir-search`	CIR ingredient text search
GET	`/health`, `/ping`	Health check endpoints

Docs available at /docs (Swagger) and /redoc.

Environment Variables

Configured via .env file (loaded with python-dotenv):

ADMIN_USER - MongoDB admin username
ADMIN_PASSWORD - MongoDB admin password
MONGO_HOST - MongoDB host
MONGO_PORT - MongoDB port
DATABASE_URL - PostgreSQL connection string

Development

Setup

uv sync              # Install dependencies
playwright install   # Install browser binaries for PDF generation

Running the API

uv run python -m pif_compiler.main
# or
uv run uvicorn pif_compiler.main:app --reload --host 0.0.0.0 --port 8000

Key conventions

Services in services/ handle external API calls and data extraction
Models in classes/models.py use Pydantic @model_validator and @classmethod builders for construction from raw API data
The orchestrator pattern (see srv_echa.py) handles: validate input -> check local cache -> fetch from external -> store locally -> return
All modules use the shared logger from common_log.get_logger()
API routes define Pydantic request/response models inline in each route file

Important domain concepts

CAS number: Chemical Abstracts Service identifier (e.g., "50-00-0")
INCI: International Nomenclature of Cosmetic Ingredients
NOAEL: No Observed Adverse Effect Level (preferred toxicity indicator)
LOAEL: Lowest Observed Adverse Effect Level
LD50: Lethal Dose 50%
DAP: Dermal Absorption Percentage
SED: Systemic Exposure Dosage
MoS: Margin of Safety
PIF: Product Information File (EU cosmetic regulation requirement)

6.1 KiB Raw Blame History