cosmoguard-bd/CLAUDE.md
2026-02-08 19:29:22 +01:00

11 KiB

PIF Compiler - Comsoguard API

Project Overview

PIF Compiler (branded as Comsoguard API) is a cosmetic safety assessment tool that compiles Product Information Files (PIF) for cosmetic products, as required by EU regulations. It aggregates toxicological, regulatory, and chemical data from multiple external sources to evaluate ingredient safety (DAP calculations, SED/MoS computations, NOAEL extraction).

The primary language is Italian (variable names, comments, some log messages). Code is written in Python 3.12.

Tech Stack

  • Framework: FastAPI + Uvicorn
  • Package manager: uv (with pyproject.toml + uv.lock)
  • Data models: Pydantic v2
  • Databases: MongoDB (substance data cache via pymongo) + PostgreSQL (product presets, search logs, compilers via psycopg2)
  • Web scraping: Playwright (PDF generation, browser automation), BeautifulSoup4 (HTML parsing)
  • External APIs: PubChem (pubchempy + pubchemprops), COSING (EU Search API), ECHA (chem.echa.europa.eu)
  • Logging: Python logging with rotating file handlers (logs/debug.log, logs/error.log)

Project Structure

src/pif_compiler/
├── main.py                      # FastAPI app, middleware, exception handlers, routers
├── __init__.py
├── api/
│   └── routes/
│       ├── api_echa.py          # ECHA endpoints (single + batch search)
│       ├── api_cosing.py        # COSING endpoints (single + batch search)
│       ├── api_ingredients.py   # Ingredient search by CAS + list all ingested
│       ├── api_esposition.py    # Esposition preset creation + list all presets
│       └── common.py            # PDF generation, PubChem, CIR search endpoints
├── classes/
│   ├── __init__.py              # Re-exports all models from models.py and main_cls.py
│   ├── models.py                # Pydantic models: Ingredient, DapInfo, CosingInfo,
│   │                            #   ToxIndicator, Toxicity, Esposition, RetentionFactors, StatoOrdine
│   └── main_cls.py              # Orchestrator classes: Order (raw input layer),
│                                #   Project (processed layer), IngredientInput
├── functions/
│   ├── common_func.py           # PDF generation with Playwright
│   ├── common_log.py            # Centralized logging configuration
│   └── db_utils.py              # MongoDB + PostgreSQL connection helpers
└── services/
    ├── srv_echa.py              # ECHA scraping, HTML parsing, toxicology extraction,
    │                            #   orchestrator (validate -> check cache -> fetch -> store)
    ├── srv_cosing.py            # COSING API search + data cleaning
    ├── srv_pubchem.py           # PubChem property extraction (DAP data)
    └── srv_cir.py               # CIR (Cosmetic Ingredient Review) database search

Other directories

  • data/ - Input data files (input.json with sample INCI/CAS/percentage lists), DB schema reference (db_schema.sql), old CSV data
  • logs/ - Rotating log files (debug.log, error.log) - auto-generated
  • pdfs/ - Generated PDF files from ECHA dossier pages
  • streamlit/ - Streamlit UI pages (ingredients_page.py, exposition_page.py)
  • marimo/ - Ignore this folder. Debug/test notebooks, not part of the main application

Architecture & Data Flow

Core workflow (per ingredient)

  1. Input: CAS number (and optionally INCI name + percentage)
  2. COSING (srv_cosing.py): Search EU cosmetic ingredients database for regulatory restrictions, INCI match, annex references
  3. ECHA (srv_echa.py): Search substance -> get dossier -> parse HTML index -> extract toxicological data (NOAEL, LD50, LOAEL) from acute & repeated dose toxicity pages
  4. PubChem (srv_pubchem.py): Get molecular weight, XLogP, TPSA, melting point, dissociation constants
  5. DAP calculation (DapInfo model): Dermal Absorption Percentage based on molecular properties (MW > 500, LogP, TPSA > 120, etc.)
  6. Toxicity ranking (Toxicity model): Best toxicological indicator selection with priority (NOAEL > LOAEL > LD50) and safety factors

Caching strategy

  • ECHA results are cached in MongoDB (toxinfo.substance_index collection) keyed by substance.rmlCas
  • Ingredients are cached in MongoDB (toxinfo.ingredients collection) keyed by cas, with PostgreSQL ingredienti table as index (stores mongo_id + enrichment flags dap, cosing, tox)
  • Orders are cached in MongoDB (toxinfo.orders collection) keyed by uuid_ordine
  • Projects are cached in MongoDB (toxinfo.projects collection) keyed by uuid_progetto
  • The orchestrator checks local cache before making external requests
  • Ingredient.get_or_create(cas) checks PostgreSQL -> MongoDB cache, returns cached if not older than 365 days, otherwise re-scrapes
  • Search history is logged to PostgreSQL (logs.search_history table)

Order / Project architecture

  • Order (main_cls.py): Raw input layer. Receives JSON with client, compiler, product type, ingredients list (CAS + percentage). Cleans CAS numbers (strips \n, splits by ;). Saves to MongoDB orders collection. Registers client/compiler in PostgreSQL.
  • Project (main_cls.py): Processed layer. Created from an Order via Project.from_order(). Holds enriched Ingredient objects, percentages mapping (CAS -> %), and Esposition preset. process_ingredients() calls Ingredient.get_or_create() for each CAS. Saves to MongoDB projects collection.
  • An order can update an older project — they are decoupled.

PostgreSQL schema (see data/db_schema.sql)

  • clienti - Customers (id_cliente, nome_cliente)
  • compilatori - PIF compilers/assessors (id_compilatore, nome_compilatore)
  • ordini - Orders linking a client + compiler to a project (uuid_ordine, uuid_progetto, data_ordine, stato_ordine). FK to clienti, compilatori, stati_ordini
  • stati_ordini - Order status lookup table (id_stato, nome_stato). Values mapped to StatoOrdine IntEnum in models.py
  • ingredienti - Ingredient registry keyed by CAS. Tracks enrichment status via boolean flags (dap, cosing, tox) and links to MongoDB document (mongo_id)
  • inci - INCI name to CAS mapping. FK to ingredienti(cas)
  • progetti - Projects linked to an order (mongo_id -> ordini.uuid_progetto) and a product type preset (preset_tipo_prodotto -> tipi_prodotti)
  • ingredients_lineage - Many-to-many join between progetti and ingredienti, tracking which ingredients belong to which project
  • tipi_prodotti - Product type presets with exposure parameters (preset_name, tipo_prodotto, luogo_applicazione, exposure routes, sup_esposta, freq_applicazione, qta_giornaliera, ritenzione). Maps to the Esposition Pydantic model
  • logs.search_history - Search audit log (cas_ricercato, target, esito)

API Endpoints

All routes are under /api/v1:

Method Path Description
POST /echa/search Single ECHA substance search by CAS
POST /echa/batch-search Batch ECHA search for multiple CAS numbers
POST /cosing/search COSING search (by name, CAS, EC, or ID)
POST /cosing/batch-search Batch COSING search
POST /ingredients/search Get full ingredient by CAS (cached or scraped)
GET /ingredients/list List all ingested ingredients from PostgreSQL
POST /esposition/create Create a new esposition preset
GET /esposition/presets List all esposition presets
POST /common/pubchem PubChem property lookup by CAS
POST /common/generate-pdf Generate PDF from URL via Playwright
GET /common/download-pdf/{name} Download a generated PDF
POST /common/cir-search CIR ingredient text search
GET /health, /ping Health check endpoints

Docs available at /docs (Swagger) and /redoc.

Environment Variables

Configured via .env file (loaded with python-dotenv):

  • ADMIN_USER - MongoDB admin username
  • ADMIN_PASSWORD - MongoDB admin password
  • MONGO_HOST - MongoDB host
  • MONGO_PORT - MongoDB port
  • DATABASE_URL - PostgreSQL connection string

Development

Setup

uv sync              # Install dependencies
playwright install   # Install browser binaries for PDF generation

Running the API

uv run python -m pif_compiler.main
# or
uv run uvicorn pif_compiler.main:app --reload --host 0.0.0.0 --port 8000

Key conventions

  • Services in services/ handle external API calls and data extraction
  • Models in classes/models.py use Pydantic @model_validator and @classmethod builders for construction from raw API data
  • Orchestrator classes in classes/main_cls.py handle Order (raw input) and Project (processed) layers
  • The orchestrator pattern (see srv_echa.py) handles: validate input -> check local cache -> fetch from external -> store locally -> return
  • Ingredient.ingredient_builder(cas) calls scraping functions directly (pubchem_dap, cosing_entry, orchestrator)
  • Ingredient.save() upserts to both MongoDB and PostgreSQL, Ingredient.from_cas() retrieves via PostgreSQL index -> MongoDB
  • Ingredient.get_or_create(cas) is the main entry point: checks cache freshness (365 days), scrapes if needed
  • All modules use the shared logger from common_log.get_logger()
  • API routes define Pydantic request/response models inline in each route file

db_utils.py functions

  • db_connect(db_name, collection_name) - MongoDB collection accessor
  • postgres_connect() - PostgreSQL connection
  • upsert_ingrediente(cas, mongo_id, dap, cosing, tox) - Upsert ingredient in PostgreSQL
  • get_ingrediente_by_cas(cas) - Get ingredient row by CAS
  • get_all_ingredienti() - List all ingredients from PostgreSQL
  • upsert_cliente(nome_cliente) - Upsert client, returns id_cliente
  • upsert_compilatore(nome_compilatore) - Upsert compiler, returns id_compilatore
  • aggiorna_stato_ordine(id_ordine, nuovo_stato) - Update order status
  • log_ricerche(cas, target, esito) - Log search history

Streamlit UI

  • streamlit/ingredients_page.py - Ingredient search by CAS + result display + inventory of ingested ingredients
  • streamlit/exposition_page.py - Esposition preset creation form + list of existing presets
  • Both pages call the FastAPI endpoints via requests (API must be running on localhost:8000)
  • Run with: streamlit run streamlit/<page>.py

Important domain concepts

  • CAS number: Chemical Abstracts Service identifier (e.g., "50-00-0")
  • INCI: International Nomenclature of Cosmetic Ingredients
  • NOAEL: No Observed Adverse Effect Level (preferred toxicity indicator)
  • LOAEL: Lowest Observed Adverse Effect Level
  • LD50: Lethal Dose 50%
  • DAP: Dermal Absorption Percentage
  • SED: Systemic Exposure Dosage
  • MoS: Margin of Safety
  • PIF: Product Information File (EU cosmetic regulation requirement)