added api entrypoint

This commit is contained in:
adish-rmr 2025-11-15 16:02:37 +01:00
parent 4cabf6fa11
commit 135bea642c
19 changed files with 305 additions and 212 deletions

194
claude.md
View file

@ -1,194 +0,0 @@
# PIF Compiler - Project Context
## Overview
Application to generate **Product Information Files (PIF)** for cosmetic products. This is a regulatory document required for cosmetics safety assessment.
## Development Environment
- **Platform**: Windows
- **Package Manager**: [uv](https://github.com/astral-sh/uv) - Fast Python package installer and resolver
- **Python Version**: 3.12+
## Tech Stack
- **Backend**: Python 3.12+
- **Frontend**: Streamlit
- **Database**: MongoDB (primary), potential relational DB (not yet implemented)
- **Package Manager**: uv
- **Build System**: hatchling
## Common Commands
```bash
# Install dependencies
uv sync
# Add a new dependency
uv add <package-name>
# Run the application
uv run pif-compiler
# Activate virtual environment (if needed)
.venv\Scripts\activate # Windows
```
## Project Structure
```
pif_compiler/
├── src/pif_compiler/
│ ├── classes/ # Data models & type definitions
│ │ ├── pif_class.py # Main PIF data model
│ │ ├── classes.py # Supporting classes (Ingredient, ExpositionInfo, SedTable, ProdCompany)
│ │ └── types_enum.py # Enums for cosmetic types, physical forms, exposure routes
│ │
│ └── functions/ # Core functionality modules
│ ├── scraper_cosing.py # COSING database scraper (EU cosmetic ingredients)
│ ├── mongo_functions.py # MongoDB connection & queries
│ ├── html_to_pdf.py # PDF generation with Playwright
│ ├── echaFind.py # ECHA dossier search
│ ├── echaProcess.py # ECHA data extraction & processing
│ ├── pubchem.py # PubChem API for chemical properties
│ ├── find.py # Unified search interface (QUACKO/ECHA)
│ └── pdf_extraction.py # PDF processing utilities
└── data/
├── pif_schema.json # JSON schema for PIF structure
└── input.json # Example input data format
```
## Core Functionality
### 1. Data Models ([classes/](src/pif_compiler/classes/))
#### PIF Class ([pif_class.py](src/pif_compiler/classes/pif_class.py:10))
Main data model containing:
- Product information (name, type, CNCP, company)
- Ingredient list with quantities
- Exposure information
- Safety evaluation data (SED table, warnings)
- Metadata (creation date)
#### Supporting Classes ([classes.py](src/pif_compiler/classes/classes.py))
- **Ingredient**: INCI name, CAS number, quantity, toxicity values (SED, NOAEL, MOS), PubChem data
- **ExpositionInfo**: Application details, exposure routes, calculated daily exposure
- **SedTable**: Safety evaluation data table
- **ProdCompany**: Production company information
#### Type Enumerations ([types_enum.py](src/pif_compiler/classes/types_enum.py))
Bilingual (EN/IT) enums for:
- **CosmeticType**: 100+ product types (foundations, lipsticks, skincare, etc.)
- **PhysicalForm**: Liquid, semi-solid, solid, aerosol, hybrid forms
- **NormalUser**: Adult/Child
- **PlaceApplication**: Face, etc.
- **RoutesExposure**: Dermal, Ocular, Oral
- **NanoRoutes**: Same as above for nanomaterials
### 2. External Data Sources
#### COSING Database ([scraper_cosing.py](src/pif_compiler/functions/scraper_cosing.py))
EU cosmetic ingredients database
- Search by INCI name, CAS number, or EC number
- Extract: substance ID, CAS/EC numbers, restrictions, SCCS opinions
- Handle "identified ingredients" recursively
- Functions: `cosing_search()`, `clean_cosing()`, `parse_cas_numbers()`
#### ECHA Database ([echaFind.py](src/pif_compiler/functions/echaFind.py), [echaProcess.py](src/pif_compiler/functions/echaProcess.py))
European Chemicals Agency dossiers
- **Search**: Find dossiers by CAS/substance name ([echaFind.py:44](src/pif_compiler/functions/echaFind.py:44))
- **Extract**: Toxicity data (NOAEL, LD50) from HTML pages
- **Process**: Convert HTML → Markdown → JSON → DataFrame
- **Scraping Types**: RepeatedDose (NOAEL), AcuteToxicity (LD50)
- **Local caching**: DuckDB in-memory for scraped data
- Functions: `search_dossier()`, `echaExtract()`, `echa_noael_ld50()`
#### PubChem ([pubchem.py](src/pif_compiler/functions/pubchem.py))
Chemical properties for DAP calculation
- Properties: LogP, Molecular Weight, TPSA, Melting Point, pH, Dissociation Constants
- Uses `pubchempy` + custom certificate handling
- Function: `pubchem_dap(cas)`
#### QUACKO/Find Module ([find.py](src/pif_compiler/functions/find.py))
Unified search interface for ECHA
- Search by CAS, EC, or substance name
- Extract multiple sections (ToxSummary, AcuteToxicity, RepeatedDose, GeneticToxicity, physical properties)
- Support for local HTML files
- Functions: `search_dossier()`, `get_section_links_from_index()`
### 3. Database Layer
#### MongoDB ([mongo_functions.py](src/pif_compiler/functions/mongo_functions.py))
- Database: `toxinfo`
- Collection: `toxinfo` (ingredient data from COSING/ECHA)
- Functions:
- `connect(user, password, database)` - MongoDB Atlas connection
- `value_search(db, value, mode)` - Search by INCI, CAS, EC, chemical name
### 4. PDF Generation ([html_to_pdf.py](src/pif_compiler/functions/html_to_pdf.py), [pdf_extraction.py](src/pif_compiler/functions/pdf_extraction.py))
- **Playwright-based**: Headless browser for HTML → PDF
- **Dynamic headers**: Inject substance info, ECHA logos
- **Cleanup**: Remove empty sections, fix page breaks
- **Batch processing**: `search_generate_pdfs()` for multiple pages
- Output: Structured folders by CAS/EC/RML ID
## Data Flow
1. **Input**: Product formulation (INCI names, quantities)
2. **Enrichment**:
- Search COSING for ingredient info
- Query MongoDB for cached data
- Fetch PubChem for chemical properties
- Extract ECHA toxicity data (NOAEL/LD50)
3. **Calculation**:
- SED (Systemic Exposure Dose)
- MOS (Margin of Safety)
- Daily exposure values
4. **Output**: PIF document (likely PDF/HTML format)
## Key Dependencies
- `streamlit` - Frontend
- `pydantic` - Data validation
- `pymongo` - MongoDB client
- `requests` - HTTP requests
- `beautifulsoup4` - HTML parsing
- `playwright` - PDF generation
- `pubchempy` - PubChem API
- `pandas` - Data processing
- `duckdb` - Local caching
## Important Notes
### CAS Number Handling
- CAS numbers can contain special separators (`/`, `;`, `,`, `--`)
- Parser handles parenthetical info and invalid values
### ECHA Scraping
- **Logging**: All operations logged to `echa.log`
- **Dossier Status**: Active preferred, falls back to Inactive
- **Scraping Modes**:
- `local_search=True`: Check local cache first
- `local_only=True`: Only use cached data
- **Multi-substance**: `echaExtract_multi()` for batch processing
- **Filtering**: Can filter by route (oral/dermal/inhalation) and dose descriptor
### Bilingual Support
- Enums support EN/IT via `TranslatedEnum.get_translation(lang)`
- Italian used as primary language in comments
### Regulatory Context
- SCCS: Scientific Committee on Consumer Safety
- CNCP: Cosmetic Notification Portal
- NOAEL: No Observed Adverse Effect Level
- SED: Systemic Exposure Dose
- MOS: Margin of Safety
- DAP: Dermal Absorption Percentage
## TODO/Future Work
- Relational DB implementation (mentioned but not present)
- Streamlit UI (referenced but code not in current files)
- Main entry point (`pif-compiler` script in pyproject.toml)
- LLM approximation for exposure values (mentioned in [classes.py:55-60](src/pif_compiler/classes/classes.py:55))
## Development Notes
- Project appears to consolidate previously separate codebases
- Heavy use of external APIs (rate limiting may apply)
- Certificate handling needed for PubChem API
- MongoDB credentials required for database access

View file

Before

Width:  |  Height:  |  Size: 6.1 KiB

After

Width:  |  Height:  |  Size: 6.1 KiB

View file

Before

Width:  |  Height:  |  Size: 35 KiB

After

Width:  |  Height:  |  Size: 35 KiB

View file

@ -11,6 +11,7 @@ dependencies = [
"beautifulsoup4>=4.14.2", "beautifulsoup4>=4.14.2",
"dotenv>=0.9.9", "dotenv>=0.9.9",
"duckdb>=1.4.1", "duckdb>=1.4.1",
"fastapi>=0.121.2",
"marimo>=0.16.5", "marimo>=0.16.5",
"markdown-to-json>=2.1.2", "markdown-to-json>=2.1.2",
"markdownify>=1.2.0", "markdownify>=1.2.0",
@ -25,6 +26,7 @@ dependencies = [
"python-dotenv>=1.2.1", "python-dotenv>=1.2.1",
"requests>=2.32.5", "requests>=2.32.5",
"streamlit>=1.50.0", "streamlit>=1.50.0",
"uvicorn>=0.35.0",
"weasyprint>=66.0", "weasyprint>=66.0",
] ]

View file

@ -0,0 +1 @@

View file

View file

@ -0,0 +1,102 @@
from fastapi import APIRouter, HTTPException, status
from pydantic import BaseModel, Field
from typing import Optional, Dict, Any
from pif_compiler.services.srv_echa import orchestrator
from pif_compiler.services.common_log import get_logger
logger = get_logger()
router = APIRouter()
class EchaRequest(BaseModel):
cas: str = Field(..., description="CAS number of the substance to search for")
class Config:
json_schema_extra = {
"example": {
"cas": "50-00-0"
}
}
class EchaResponse(BaseModel):
success: bool
cas: str
data: Optional[Dict[str, Any]] = None
error: Optional[str] = None
@router.post("/echa/search", response_model=EchaResponse, tags=["ECHA"])
async def search_echa_substance(request: EchaRequest):
"""
Search for substance information in ECHA database.
This endpoint orchestrates the full ECHA data extraction process:
1. Validates the CAS number
2. Checks local MongoDB cache
3. If not found locally, fetches from ECHA:
- Substance information
- Dossier information
- Toxicological information
- Acute toxicity data
- Repeated dose toxicity data
4. Caches the result locally
Args:
request: EchaRequest containing the CAS number
Returns:
EchaResponse with the substance data or error information
"""
logger.info(f"API request received for CAS: {request.cas}")
try:
result = orchestrator(request.cas)
if result is None:
logger.warning(f"No data found for CAS: {request.cas}")
return EchaResponse(
success=False,
cas=request.cas,
data=None,
error="No data found for the provided CAS number. The CAS may be invalid or not registered in ECHA."
)
# Remove MongoDB _id field if present (it's not JSON serializable)
if "_id" in result:
del result["_id"]
logger.info(f"Successfully retrieved data for CAS: {request.cas}")
return EchaResponse(
success=True,
cas=request.cas,
data=result,
error=None
)
except Exception as e:
logger.error(f"Error processing request for CAS {request.cas}: {str(e)}", exc_info=True)
raise HTTPException(
status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
detail=f"Internal error while processing CAS {request.cas}: {str(e)}"
)
@router.get("/echa/health", tags=["ECHA"])
async def echa_health_check():
"""
Health check endpoint for ECHA service.
Returns the status of the ECHA service components.
"""
return {
"status": "healthy",
"service": "echa-orchestrator",
"components": {
"api": "operational",
"scraper": "operational",
"parser": "operational"
}
}

View file

@ -1,18 +0,0 @@
from playwright.sync_api import sync_playwright
def generate_pdf(url, pdf_path):
with sync_playwright() as p:
# Avvia un browser (può essere 'chromium', 'firefox', o 'webkit')
browser = p.chromium.launch(headless=True)
page = browser.new_page()
# Vai all'URL specificato
page.goto(url)
# Genera il PDF
page.pdf(path=pdf_path, format="A4")
# Chiudi il browser
browser.close()
print(f"PDF salvato con successo in: {pdf_path}")

172
src/pif_compiler/main.py Normal file
View file

@ -0,0 +1,172 @@
from fastapi import FastAPI, Request, status
from fastapi.middleware.cors import CORSMiddleware
from fastapi.responses import JSONResponse
from fastapi.exceptions import RequestValidationError
from contextlib import asynccontextmanager
import time
from pif_compiler.services.common_log import get_logger
# Import dei tuoi router
from pif_compiler.api.routes import api_echa
# Configurazione logging
logger = get_logger()
# Lifespan events (startup/shutdown) - moderna alternativa a @app.on_event
@asynccontextmanager
async def lifespan(app: FastAPI):
# Startup
logger.info("🚀 Starting up API...")
# Qui puoi inizializzare:
# - Connessioni al database
yield # L'app gira qui
# Shutdown
logger.info("👋 Shutting down API...")
# Qui chiudi le connessioni e fai cleanup
# - Chiudi DB connections
# - Salva cache
# - Cleanup risorse
# Inizializza FastAPI
app = FastAPI(
title="Comsoguard API",
description="Central API for Comsoguard services",
version="0.0.1",
docs_url="/docs",
redoc_url="/redoc",
openapi_url="/openapi.json",
lifespan=lifespan
)
# ==================== MIDDLEWARE ====================
# 1. CORS - Configura in base alle tue esigenze
app.add_middleware(
CORSMiddleware,
allow_origins=[
"*"
],
allow_credentials=True,
allow_methods=["*"],
allow_headers=["*"],
)
# 2. Logging Middleware - Log di tutte le richieste
@app.middleware("http")
async def log_requests(request: Request, call_next):
start_time = time.time()
# Log della richiesta in arrivo
logger.info(
f"📥 {request.method} {request.url.path} - "
f"Client: {request.client.host}"
)
# Esegui la richiesta
response = await call_next(request)
# Calcola il tempo di elaborazione
process_time = time.time() - start_time
# Log della risposta
logger.info(
f"📤 {request.method} {request.url.path} - "
f"Status: {response.status_code} - "
f"Time: {process_time:.3f}s"
)
return response
# ==================== EXCEPTION HANDLERS ====================
# Handler per errori di validazione (422)
@app.exception_handler(RequestValidationError)
async def validation_exception_handler(request: Request, exc: RequestValidationError):
logger.warning(f"❌ Validation error on {request.url.path}: {exc.errors()}")
return JSONResponse(
status_code=status.HTTP_422_UNPROCESSABLE_ENTITY,
content={
"error": "Validation Error",
"detail": exc.errors(),
"body": exc.body
}
)
# Handler per errori generici
@app.exception_handler(Exception)
async def general_exception_handler(request: Request, exc: Exception):
logger.error(f"💥 Unhandled exception on {request.url.path}: {str(exc)}", exc_info=True)
return JSONResponse(
status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
content={
"error": "Internal Server Error",
"message": "Si è verificato un errore imprevisto. Riprova più tardi."
}
)
# ==================== ROUTERS ====================
# Include i tuoi router qui
app.include_router(
api_echa.router,
prefix="/api/v1",
tags=["ECHA"]
)
# ==================== ROOT ENDPOINTS ====================
@app.get("/", tags=["Root"])
async def root():
"""
Endpoint di benvenuto - mostra info base dell'API
"""
return {
"message": "Welcome to Comsoguard API",
"version": "0.0.1",
"docs": "/docs",
"redoc": "/redoc"
}
@app.get("/health", tags=["Health"])
async def health_check():
"""
Health check endpoint - utile per monitoring e load balancers
"""
return {
"status": "healthy",
"service": "comsoguard-api",
"version": "0.0.1"
}
@app.get("/ping", tags=["Health"])
async def ping():
"""
Ping endpoint - risposta velocissima per verificare che l'API sia up
"""
return {"ping": "pong"}
# ==================== RUN SERVER ====================
if __name__ == "__main__":
import uvicorn
uvicorn.run(
"pif_compiler.main:app",
host="0.0.0.0",
port=8000,
reload=True, # Auto-reload on code changes during development
log_level="info"
)

28
uv.lock
View file

@ -17,6 +17,15 @@ wheels = [
{ url = "https://files.pythonhosted.org/packages/aa/f3/0b6ced594e51cc95d8c1fc1640d3623770d01e4969d29c0bd09945fafefa/altair-5.5.0-py3-none-any.whl", hash = "sha256:91a310b926508d560fe0148d02a194f38b824122641ef528113d029fcd129f8c", size = 731200 }, { url = "https://files.pythonhosted.org/packages/aa/f3/0b6ced594e51cc95d8c1fc1640d3623770d01e4969d29c0bd09945fafefa/altair-5.5.0-py3-none-any.whl", hash = "sha256:91a310b926508d560fe0148d02a194f38b824122641ef528113d029fcd129f8c", size = 731200 },
] ]
[[package]]
name = "annotated-doc"
version = "0.0.4"
source = { registry = "https://pypi.org/simple" }
sdist = { url = "https://files.pythonhosted.org/packages/57/ba/046ceea27344560984e26a590f90bc7f4a75b06701f653222458922b558c/annotated_doc-0.0.4.tar.gz", hash = "sha256:fbcda96e87e9c92ad167c2e53839e57503ecfda18804ea28102353485033faa4", size = 7288 }
wheels = [
{ url = "https://files.pythonhosted.org/packages/1e/d3/26bf1008eb3d2daa8ef4cacc7f3bfdc11818d111f7e2d0201bc6e3b49d45/annotated_doc-0.0.4-py3-none-any.whl", hash = "sha256:571ac1dc6991c450b25a9c2d84a3705e2ae7a53467b5d111c24fa8baabbed320", size = 5303 },
]
[[package]] [[package]]
name = "annotated-types" name = "annotated-types"
version = "0.7.0" version = "0.7.0"
@ -400,6 +409,21 @@ wheels = [
{ url = "https://files.pythonhosted.org/packages/30/79/4f544d73fcc0513b71296cb3ebb28a227d22e80dec27204977039b9fa875/duckdb-1.4.1-cp313-cp313-win_amd64.whl", hash = "sha256:280fd663dacdd12bb3c3bf41f3e5b2e5b95e00b88120afabb8b8befa5f335c6f", size = 12336460 }, { url = "https://files.pythonhosted.org/packages/30/79/4f544d73fcc0513b71296cb3ebb28a227d22e80dec27204977039b9fa875/duckdb-1.4.1-cp313-cp313-win_amd64.whl", hash = "sha256:280fd663dacdd12bb3c3bf41f3e5b2e5b95e00b88120afabb8b8befa5f335c6f", size = 12336460 },
] ]
[[package]]
name = "fastapi"
version = "0.121.2"
source = { registry = "https://pypi.org/simple" }
dependencies = [
{ name = "annotated-doc" },
{ name = "pydantic" },
{ name = "starlette" },
{ name = "typing-extensions" },
]
sdist = { url = "https://files.pythonhosted.org/packages/fb/48/f08f264da34cf160db82c62ffb335e838b1fc16cbcc905f474c7d4c815db/fastapi-0.121.2.tar.gz", hash = "sha256:ca8e932b2b823ec1721c641e3669472c855ad9564a2854c9899d904c2848b8b9", size = 342944 }
wheels = [
{ url = "https://files.pythonhosted.org/packages/eb/23/dfb161e91db7c92727db505dc72a384ee79681fe0603f706f9f9f52c2901/fastapi-0.121.2-py3-none-any.whl", hash = "sha256:f2d80b49a86a846b70cc3a03eb5ea6ad2939298bf6a7fe377aa9cd3dd079d358", size = 109201 },
]
[[package]] [[package]]
name = "fonttools" name = "fonttools"
version = "4.60.1" version = "4.60.1"
@ -937,6 +961,7 @@ dependencies = [
{ name = "beautifulsoup4" }, { name = "beautifulsoup4" },
{ name = "dotenv" }, { name = "dotenv" },
{ name = "duckdb" }, { name = "duckdb" },
{ name = "fastapi" },
{ name = "marimo" }, { name = "marimo" },
{ name = "markdown-to-json" }, { name = "markdown-to-json" },
{ name = "markdownify" }, { name = "markdownify" },
@ -951,6 +976,7 @@ dependencies = [
{ name = "python-dotenv" }, { name = "python-dotenv" },
{ name = "requests" }, { name = "requests" },
{ name = "streamlit" }, { name = "streamlit" },
{ name = "uvicorn" },
{ name = "weasyprint" }, { name = "weasyprint" },
] ]
@ -959,6 +985,7 @@ requires-dist = [
{ name = "beautifulsoup4", specifier = ">=4.14.2" }, { name = "beautifulsoup4", specifier = ">=4.14.2" },
{ name = "dotenv", specifier = ">=0.9.9" }, { name = "dotenv", specifier = ">=0.9.9" },
{ name = "duckdb", specifier = ">=1.4.1" }, { name = "duckdb", specifier = ">=1.4.1" },
{ name = "fastapi", specifier = ">=0.121.2" },
{ name = "marimo", specifier = ">=0.16.5" }, { name = "marimo", specifier = ">=0.16.5" },
{ name = "markdown-to-json", specifier = ">=2.1.2" }, { name = "markdown-to-json", specifier = ">=2.1.2" },
{ name = "markdownify", specifier = ">=1.2.0" }, { name = "markdownify", specifier = ">=1.2.0" },
@ -973,6 +1000,7 @@ requires-dist = [
{ name = "python-dotenv", specifier = ">=1.2.1" }, { name = "python-dotenv", specifier = ">=1.2.1" },
{ name = "requests", specifier = ">=2.32.5" }, { name = "requests", specifier = ">=2.32.5" },
{ name = "streamlit", specifier = ">=1.50.0" }, { name = "streamlit", specifier = ">=1.50.0" },
{ name = "uvicorn", specifier = ">=0.35.0" },
{ name = "weasyprint", specifier = ">=66.0" }, { name = "weasyprint", specifier = ">=66.0" },
] ]