18 KiB
PIF Compiler - Comsoguard API
Project Overview
PIF Compiler (branded as Comsoguard API) is a cosmetic safety assessment tool that compiles Product Information Files (PIF) for cosmetic products, as required by EU regulations. It aggregates toxicological, regulatory, and chemical data from multiple external sources to evaluate ingredient safety (DAP calculations, SED/MoS computations, NOAEL extraction).
The primary language is Italian (variable names, comments, some log messages). Code is written in Python 3.12.
Tech Stack
- Framework: FastAPI + Uvicorn
- Package manager: uv (with
pyproject.toml+uv.lock) - Data models: Pydantic v2
- Databases: MongoDB (substance data cache via
pymongo) + PostgreSQL (product presets, search logs, compilers viapsycopg2) - Web scraping: Playwright (PDF generation, browser automation), BeautifulSoup4 (HTML parsing)
- External APIs: PubChem (
pubchempy+pubchemprops), COSING (EU Search API), ECHA (chem.echa.europa.eu) - Logging: Python
loggingwith rotating file handlers (logs/debug.log,logs/error.log)
Project Structure
src/pif_compiler/
├── main.py # FastAPI app, middleware, exception handlers, routers
├── __init__.py
├── api/
│ └── routes/
│ ├── api_echa.py # ECHA endpoints (single + batch search)
│ ├── api_cosing.py # COSING endpoints (single + batch search)
│ ├── api_ingredients.py # Ingredient search by CAS + list all ingested + add tox indicator + clients CRUD
│ ├── api_esposition.py # Esposition preset CRUD (create, list, delete)
│ ├── api_orders.py # Order creation, retry, manual pipeline trigger, Excel/PDF export
│ └── common.py # PDF generation, PubChem, CIR search endpoints
├── classes/
│ ├── __init__.py # Re-exports all models from models.py and main_workflow.py
│ ├── models.py # Pydantic models: Ingredient, DapInfo, CosingInfo,
│ │ # ToxIndicator, Toxicity, Esposition, RetentionFactors, StatoOrdine
│ └── main_workflow.py # Order/Project workflow: Order (DB + raw JSON layer),
│ # Project (enriched layer), ProjectIngredient,
│ # orchestrator functions (receive_order, process_order_pipeline,
│ # retry_order, trigger_pipeline)
├── functions/
│ ├── common_func.py # PDF generation with Playwright, tox+COSING source PDF batch generation, COSING PDF download, ZIP creation
│ ├── common_log.py # Centralized logging configuration
│ ├── db_utils.py # MongoDB + PostgreSQL connection helpers
│ └── excel_export.py # Excel export (4 sheets: Anagrafica, Esposizione, SED, MoS)
└── services/
├── srv_echa.py # ECHA scraping, HTML parsing, toxicology extraction,
│ # orchestrator (validate -> check cache -> fetch -> store)
├── srv_cosing.py # COSING API search + data cleaning
├── srv_pubchem.py # PubChem property extraction (DAP data)
└── srv_cir.py # CIR (Cosmetic Ingredient Review) database search
Other directories
data/- Input data files (input.jsonwith sample INCI/CAS/percentage lists), DB schema reference (db_schema.sql), old CSV datalogs/- Rotating log files (debug.log, error.log) - auto-generatedpdfs/- Generated PDF files from ECHA dossier pagesstreamlit/- Streamlit UI pages (ingredients_page.py,exposition_page.py,order_page.py,orders_page.py)scripts/- Utility scripts (create_mock_order.py- inserts a test order with 4 ingredients)marimo/- Ignore this folder. Debug/test notebooks, not part of the main application
Architecture & Data Flow
Core workflow (per ingredient)
- Input: CAS number (and optionally INCI name + percentage)
- COSING (
srv_cosing.py): Search EU cosmetic ingredients database for regulatory restrictions, INCI match, annex references - ECHA (
srv_echa.py): Search substance -> get dossier -> parse HTML index -> extract toxicological data (NOAEL, LD50, LOAEL) from acute & repeated dose toxicity pages - PubChem (
srv_pubchem.py): Get molecular weight, XLogP, TPSA, melting point, dissociation constants - DAP calculation (
DapInfomodel): Dermal Absorption Percentage based on molecular properties (MW > 500, LogP, TPSA > 120, etc.) - Toxicity ranking (
Toxicitymodel): Best toxicological indicator selection with priority (NOAEL > LOAEL > LD50) and safety factors
Caching strategy
- ECHA results are cached in MongoDB (
toxinfo.substance_indexcollection) keyed bysubstance.rmlCas - Ingredients are cached in MongoDB (
toxinfo.ingredientscollection) keyed bycas, with PostgreSQLingredientitable as index (storesmongo_id+ enrichment flagsdap,cosing,tox) - Orders are cached in MongoDB (
toxinfo.orderscollection) keyed byuuid_ordine - Projects are cached in MongoDB (
toxinfo.projectscollection) keyed byuuid_progetto - The orchestrator checks local cache before making external requests
Ingredient.get_or_create(cas)checks PostgreSQL -> MongoDB cache, returns cached if not older than 365 days, otherwise re-scrapes- Search history is logged to PostgreSQL (
logs.search_historytable)
Order / Project workflow (main_workflow.py)
The order processing uses a background pipeline with state machine tracking via StatoOrdine:
POST /orders/create → receive_order() → BackgroundTasks → process_order_pipeline()
│ │
▼ ▼
Save raw JSON to MongoDB Order.pick_next() (oldest with stato=1)
+ create ordini record (stato=1) │
+ return id_ordine immediately ▼
order.validate_anagrafica() → stato=2
│
▼
Project.from_order() → stato=3
(loads Esposition preset, parses ingredients)
│
▼
project.process_ingredients() → Ingredient.get_or_create()
(skip if skip_tox=True or CAS empty)
│
▼
stato=5 (ARRICCHITO)
│
▼
project.save() → MongoDB + progetti table + ingredients_lineage
On error → stato=9 (ERRORE) + note with error message
- Order (
main_workflow.py): Pydantic model with DB table attributes + raw JSON from MongoDB.pick_next()classmethod picks the oldest pending order (FIFO).validate_anagrafica()upserts client inclientitable.update_stato()is the reusable state transition method. - Project (
main_workflow.py): Created from Order viaProject.from_order(). HoldsEspositionpreset (loaded by name from DB), list ofProjectIngredientwith enrichedIngredientobjects.process_ingredients()callsIngredient.get_or_create()for each CAS.save()dumps to MongoDBprojectscollection, createsprogettientry, and populatesingredients_lineage. - ProjectIngredient: Helper model with cas, inci, percentage, is_colorante, skip_tox, and optional
Ingredientobject. - Retry:
retry_order(id_ordine)resets an ERRORE order back to RICEVUTO for reprocessing. - Manual trigger:
trigger_pipeline()launches the pipeline on-demand for any pending order. - Pipeline is on-demand only (no periodic polling). Each API call to
/orders/createor/orders/retrytriggers one background execution.
Excel export (excel_export.py)
export_project_excel(project, output_path) generates a 4-sheet Excel file:
- Anagrafica — Client info (nome, prodotto, preset) + ingredient table (INCI, CAS, %, colorante, skip_tox)
- Esposizione — All esposition parameters + computed fields via Excel formulas (
=B12*B13,=B15*1000/B5) - SED — SED calculation per ingredient. Formula:
=(C{r}/100)*Esposizione!$B$12*Esposizione!$B$13/Esposizione!$B$5*1000. COSING restrictions highlighted in red. - MoS — 14 columns (Nome, %, SED, DAP, SED con DAP, Indicatore, Valore, Fattore, MoS, Fonte, Info DAP, Restrizioni, Altre Restrizioni, Note). MoS formula:
=IF(AND(E{r}>0,H{r}>0),G{r}/(E{r}*H{r}),""). Includes legend row.
Called via Project.export_excel() method, exposed at GET /orders/export/{id_ordine}.
Source PDF generation (common_func.py)
generate_project_source_pdfs(project)— for each ingredient, generates two types of source PDFs:- Tox best_case PDF: downloads the ECHA dossier page of
best_casevia Playwright. Naming:CAS_source.pdf(source is theToxIndicator.sourceattribute, e.g.,56-81-5_repeated_dose_toxicity.pdf) - COSING PDF: downloads the official COSING regulation PDF via EU API for each
CosingInfowith areferenceattribute. Naming:CAS_cosing.pdf
- Tox best_case PDF: downloads the ECHA dossier page of
cosing_download(ref_no)— downloads the COSING regulation PDF fromapi.tech.ec.europa.euby reference number. Returns PDF bytes or error stringcreate_sources_zip(pdf_paths, zip_path)— bundles all source PDFs into a ZIP archive- Exposed at
GET /orders/export-sources/{id_ordine}— returns ZIP as FileResponse
PostgreSQL schema (see data/db_schema.sql)
clienti- Customers (id_cliente,nome_cliente)compilatori- PIF compilers/assessors (id_compilatore,nome_compilatore)ordini- Orders linking a client + compiler to a project (uuid_ordine,uuid_progetto,data_ordine,stato_ordine). FK toclienti,compilatori,stati_ordinistati_ordini- Order status lookup table (id_stato,nome_stato). Values mapped toStatoOrdineIntEnum inmodels.pyingredienti- Ingredient registry keyed by CAS. Tracks enrichment status via boolean flags (dap,cosing,tox) and links to MongoDB document (mongo_id)inci- INCI name to CAS mapping. FK toingredienti(cas)progetti- Projects linked to an order (mongo_id->ordini.uuid_progetto) and a product type preset (preset_tipo_prodotto->tipi_prodotti)ingredients_lineage- Many-to-many join betweenprogettiandingredienti, tracking which ingredients belong to which projecttipi_prodotti- Product type presets with exposure parameters (preset_name,tipo_prodotto,luogo_applicazione, exposure routes,sup_esposta,freq_applicazione,qta_giornaliera,ritenzione). Maps to theEspositionPydantic modellogs.search_history- Search audit log (cas_ricercato,target,esito)
API Endpoints
All routes are under /api/v1:
| Method | Path | Description |
|---|---|---|
| POST | /echa/search |
Single ECHA substance search by CAS |
| POST | /echa/batch-search |
Batch ECHA search for multiple CAS numbers |
| POST | /cosing/search |
COSING search (by name, CAS, EC, or ID) |
| POST | /cosing/batch-search |
Batch COSING search |
| POST | /ingredients/search |
Get full ingredient by CAS (cached or scraped, force param to bypass cache) |
| POST | /ingredients/add-tox-indicator |
Add custom ToxIndicator to an ingredient |
| GET | /ingredients/list |
List all ingested ingredients from PostgreSQL |
| GET | /ingredients/clients |
List all registered clients |
| POST | /ingredients/clients |
Create or retrieve a client |
| POST | /esposition/create |
Create a new esposition preset |
| DELETE | /esposition/delete/{preset_name} |
Delete an esposition preset by name |
| GET | /esposition/presets |
List all esposition presets |
| POST | /orders/create |
Create order + start background processing |
| POST | /orders/retry/{id_ordine} |
Retry a failed order (ERRORE → RICEVUTO) |
| POST | /orders/trigger-pipeline |
Manually trigger pipeline for next pending order |
| GET | /orders/export/{id_ordine} |
Download Excel export for a completed order |
| GET | /orders/export-sources/{id_ordine} |
Download ZIP of tox + COSING source PDFs for an order |
| GET | /orders/list |
List all orders with client/compiler/status info |
| GET | /orders/detail/{id_ordine} |
Full order detail with ingredients from MongoDB |
| DELETE | /orders/{id_ordine} |
Delete order and all related data (PostgreSQL + MongoDB) |
| POST | /common/pubchem |
PubChem property lookup by CAS |
| POST | /common/generate-pdf |
Generate PDF from URL via Playwright |
| GET | /common/download-pdf/{name} |
Download a generated PDF |
| POST | /common/cir-search |
CIR ingredient text search |
| GET | /health, /ping |
Health check endpoints |
Docs available at /docs (Swagger) and /redoc.
Environment Variables
Configured via .env file (loaded with python-dotenv):
ADMIN_USER- MongoDB admin usernameADMIN_PASSWORD- MongoDB admin passwordMONGO_HOST- MongoDB hostMONGO_PORT- MongoDB portDATABASE_URL- PostgreSQL connection string
Development
Setup
uv sync # Install dependencies
playwright install # Install browser binaries for PDF generation
Running the API
uv run python -m pif_compiler.main
# or
uv run uvicorn pif_compiler.main:app --reload --host 0.0.0.0 --port 8000
Key conventions
- Services in
services/handle external API calls and data extraction - Models in
classes/models.pyuse Pydantic@model_validatorand@classmethodbuilders for construction from raw API data - Workflow classes in
classes/main_workflow.pyhandle Order (DB + raw JSON) and Project (enriched) layers - Order processing runs as a FastAPI
BackgroundTaskscallback (on-demand, not polled) - The
orchestratorpattern (seesrv_echa.py) handles: validate input -> check local cache -> fetch from external -> store locally -> return Ingredient.ingredient_builder(cas)calls scraping functions directly (pubchem_dap,cosing_entry,orchestrator)Ingredient.save()upserts to both MongoDB and PostgreSQL,Ingredient.from_cas()retrieves via PostgreSQL index -> MongoDBIngredient.get_or_create(cas, force=False)is the main entry point: checks cache freshness (365 days), scrapes if needed.force=Truebypasses cache entirely and re-scrapes- All modules use the shared logger from
common_log.get_logger() - API routes define Pydantic request/response models inline in each route file
db_utils.py functions
Core:
db_connect(db_name, collection_name)- MongoDB collection accessorpostgres_connect()- PostgreSQL connection
Ingredients:
upsert_ingrediente(cas, mongo_id, dap, cosing, tox)- Upsert ingredient in PostgreSQLget_ingrediente_by_cas(cas)- Get ingredient row by CASget_ingrediente_id_by_cas(cas)- Get PostgreSQL ID by CAS (for lineage FK)get_all_ingredienti()- List all ingredients from PostgreSQL
Clients / Compilers:
upsert_cliente(nome_cliente)- Upsert client, returnsid_clienteupsert_compilatore(nome_compilatore)- Upsert compiler, returnsid_compilatoreget_all_clienti()- List all clients from PostgreSQL
Orders:
insert_ordine(uuid_ordine, id_cliente)- Insert new order, returnsid_ordineget_ordine_by_id(id_ordine)- Get full order rowget_oldest_pending_order()- Get oldest order with stato=RICEVUTOaggiorna_stato_ordine(id_ordine, nuovo_stato)- Update order statusupdate_ordine_cliente(id_ordine, id_cliente)- Set client on orderupdate_ordine_progetto(id_ordine, uuid_progetto)- Set project UUID on orderupdate_ordine_note(id_ordine, note)- Set note on orderreset_ordine_per_retry(id_ordine)- Reset ERRORE order to RICEVUTOget_all_ordini()- List all orders with JOINs to clienti/compilatori/stati_ordinidelete_ordine(id_ordine)- Delete order + related data (lineage, progetti, MongoDB docs)
Projects:
get_preset_id_by_name(preset_name)- Get preset FK by nameinsert_progetto(mongo_id, id_preset)- Insert project, returnsidinsert_ingredient_lineage(id_progetto, id_ingrediente)- Insert project-ingredient join
Logging:
log_ricerche(cas, target, esito)- Log search history
Streamlit UI
streamlit/ingredients_page.py- Ingredient search by CAS + result display + inventory of ingested ingredientsstreamlit/exposition_page.py- Esposition preset creation form + list of existing presetsstreamlit/order_page.py- Order creation form (client dropdown, preset selection, ingredient data_editor with CAS/INCI/percentage, AQUA auto-detection, validation, submit with background processing)streamlit/orders_page.py- Order management: list with filters (date, client, status), detail view with ingredients, actions (refresh, retry, Excel download, PDF sources ZIP, delete with confirmation), notes/log display- All pages call the FastAPI endpoints via
requests(API must be running onlocalhost:8000) - Run with:
streamlit run streamlit/<page>.py
Important domain concepts
- CAS number: Chemical Abstracts Service identifier (e.g., "50-00-0")
- INCI: International Nomenclature of Cosmetic Ingredients
- NOAEL: No Observed Adverse Effect Level (preferred toxicity indicator)
- LOAEL: Lowest Observed Adverse Effect Level
- LD50: Lethal Dose 50%
- DAP: Dermal Absorption Percentage
- SED: Systemic Exposure Dosage
- MoS: Margin of Safety
- PIF: Product Information File (EU cosmetic regulation requirement)