cosmoguard-bd/CLAUDE.md
2026-02-08 14:31:50 +01:00

126 lines
6.1 KiB
Markdown

# PIF Compiler - Comsoguard API
## Project Overview
**PIF Compiler** (branded as **Comsoguard API**) is a cosmetic safety assessment tool that compiles Product Information Files (PIF) for cosmetic products, as required by EU regulations. It aggregates toxicological, regulatory, and chemical data from multiple external sources to evaluate ingredient safety (DAP calculations, SED/MoS computations, NOAEL extraction).
The primary language is **Italian** (variable names, comments, some log messages). Code is written in **Python 3.12**.
## Tech Stack
- **Framework**: FastAPI + Uvicorn
- **Package manager**: uv (with `pyproject.toml` + `uv.lock`)
- **Data models**: Pydantic v2
- **Databases**: MongoDB (substance data cache via `pymongo`) + PostgreSQL (product presets, search logs, compilers via `psycopg2`)
- **Web scraping**: Playwright (PDF generation, browser automation), BeautifulSoup4 (HTML parsing)
- **External APIs**: PubChem (`pubchempy` + `pubchemprops`), COSING (EU Search API), ECHA (chem.echa.europa.eu)
- **Logging**: Python `logging` with rotating file handlers (`logs/debug.log`, `logs/error.log`)
## Project Structure
```
src/pif_compiler/
├── main.py # FastAPI app, middleware, exception handlers, routers
├── __init__.py
├── api/
│ └── routes/
│ ├── api_echa.py # ECHA endpoints (single + batch search)
│ ├── api_cosing.py # COSING endpoints (single + batch search)
│ └── common.py # PDF generation, PubChem, CIR search endpoints
├── classes/
│ └── models.py # Pydantic models: Ingredient, DapInfo, CosingInfo,
│ # ToxIndicator, Toxicity, Esposition, RetentionFactors
├── functions/
│ ├── common_func.py # PDF generation with Playwright
│ ├── common_log.py # Centralized logging configuration
│ └── db_utils.py # MongoDB + PostgreSQL connection helpers
└── services/
├── srv_echa.py # ECHA scraping, HTML parsing, toxicology extraction,
│ # orchestrator (validate -> check cache -> fetch -> store)
├── srv_cosing.py # COSING API search + data cleaning
├── srv_pubchem.py # PubChem property extraction (DAP data)
└── srv_cir.py # CIR (Cosmetic Ingredient Review) database search
```
### Other directories
- `data/` - Input data files (`input.json` with sample INCI/CAS/percentage lists), old CSV data
- `logs/` - Rotating log files (debug.log, error.log) - auto-generated
- `pdfs/` - Generated PDF files from ECHA dossier pages
- `marimo/` - **Ignore this folder.** Debug/test notebooks, not part of the main application
## Architecture & Data Flow
### Core workflow (per ingredient)
1. **Input**: CAS number (and optionally INCI name + percentage)
2. **COSING** (`srv_cosing.py`): Search EU cosmetic ingredients database for regulatory restrictions, INCI match, annex references
3. **ECHA** (`srv_echa.py`): Search substance -> get dossier -> parse HTML index -> extract toxicological data (NOAEL, LD50, LOAEL) from acute & repeated dose toxicity pages
4. **PubChem** (`srv_pubchem.py`): Get molecular weight, XLogP, TPSA, melting point, dissociation constants
5. **DAP calculation** (`DapInfo` model): Dermal Absorption Percentage based on molecular properties (MW > 500, LogP, TPSA > 120, etc.)
6. **Toxicity ranking** (`Toxicity` model): Best toxicological indicator selection with priority (NOAEL > LOAEL > LD50) and safety factors
### Caching strategy
- ECHA results are cached in MongoDB (`toxinfo.substance_index` collection) keyed by `substance.rmlCas`
- The orchestrator checks local cache before making external requests
- Search history is logged to PostgreSQL (`logs.search_history` table)
## API Endpoints
All routes are under `/api/v1`:
| Method | Path | Description |
|--------|------|-------------|
| POST | `/echa/search` | Single ECHA substance search by CAS |
| POST | `/echa/batch-search` | Batch ECHA search for multiple CAS numbers |
| POST | `/cosing/search` | COSING search (by name, CAS, EC, or ID) |
| POST | `/cosing/batch-search` | Batch COSING search |
| POST | `/common/pubchem` | PubChem property lookup by CAS |
| POST | `/common/generate-pdf` | Generate PDF from URL via Playwright |
| GET | `/common/download-pdf/{name}` | Download a generated PDF |
| POST | `/common/cir-search` | CIR ingredient text search |
| GET | `/health`, `/ping` | Health check endpoints |
Docs available at `/docs` (Swagger) and `/redoc`.
## Environment Variables
Configured via `.env` file (loaded with `python-dotenv`):
- `ADMIN_USER` - MongoDB admin username
- `ADMIN_PASSWORD` - MongoDB admin password
- `MONGO_HOST` - MongoDB host
- `MONGO_PORT` - MongoDB port
- `DATABASE_URL` - PostgreSQL connection string
## Development
### Setup
```bash
uv sync # Install dependencies
playwright install # Install browser binaries for PDF generation
```
### Running the API
```bash
uv run python -m pif_compiler.main
# or
uv run uvicorn pif_compiler.main:app --reload --host 0.0.0.0 --port 8000
```
### Key conventions
- Services in `services/` handle external API calls and data extraction
- Models in `classes/models.py` use Pydantic `@model_validator` and `@classmethod` builders for construction from raw API data
- The `orchestrator` pattern (see `srv_echa.py`) handles: validate input -> check local cache -> fetch from external -> store locally -> return
- All modules use the shared logger from `common_log.get_logger()`
- API routes define Pydantic request/response models inline in each route file
### Important domain concepts
- **CAS number**: Chemical Abstracts Service identifier (e.g., "50-00-0")
- **INCI**: International Nomenclature of Cosmetic Ingredients
- **NOAEL**: No Observed Adverse Effect Level (preferred toxicity indicator)
- **LOAEL**: Lowest Observed Adverse Effect Level
- **LD50**: Lethal Dose 50%
- **DAP**: Dermal Absorption Percentage
- **SED**: Systemic Exposure Dosage
- **MoS**: Margin of Safety
- **PIF**: Product Information File (EU cosmetic regulation requirement)