cosmoguard-bd/claude.md

194 lines
7.6 KiB
Markdown

# PIF Compiler - Project Context
## Overview
Application to generate **Product Information Files (PIF)** for cosmetic products. This is a regulatory document required for cosmetics safety assessment.
## Development Environment
- **Platform**: Windows
- **Package Manager**: [uv](https://github.com/astral-sh/uv) - Fast Python package installer and resolver
- **Python Version**: 3.12+
## Tech Stack
- **Backend**: Python 3.12+
- **Frontend**: Streamlit
- **Database**: MongoDB (primary), potential relational DB (not yet implemented)
- **Package Manager**: uv
- **Build System**: hatchling
## Common Commands
```bash
# Install dependencies
uv sync
# Add a new dependency
uv add <package-name>
# Run the application
uv run pif-compiler
# Activate virtual environment (if needed)
.venv\Scripts\activate # Windows
```
## Project Structure
```
pif_compiler/
├── src/pif_compiler/
│ ├── classes/ # Data models & type definitions
│ │ ├── pif_class.py # Main PIF data model
│ │ ├── classes.py # Supporting classes (Ingredient, ExpositionInfo, SedTable, ProdCompany)
│ │ └── types_enum.py # Enums for cosmetic types, physical forms, exposure routes
│ │
│ └── functions/ # Core functionality modules
│ ├── scraper_cosing.py # COSING database scraper (EU cosmetic ingredients)
│ ├── mongo_functions.py # MongoDB connection & queries
│ ├── html_to_pdf.py # PDF generation with Playwright
│ ├── echaFind.py # ECHA dossier search
│ ├── echaProcess.py # ECHA data extraction & processing
│ ├── pubchem.py # PubChem API for chemical properties
│ ├── find.py # Unified search interface (QUACKO/ECHA)
│ └── pdf_extraction.py # PDF processing utilities
└── data/
├── pif_schema.json # JSON schema for PIF structure
└── input.json # Example input data format
```
## Core Functionality
### 1. Data Models ([classes/](src/pif_compiler/classes/))
#### PIF Class ([pif_class.py](src/pif_compiler/classes/pif_class.py:10))
Main data model containing:
- Product information (name, type, CNCP, company)
- Ingredient list with quantities
- Exposure information
- Safety evaluation data (SED table, warnings)
- Metadata (creation date)
#### Supporting Classes ([classes.py](src/pif_compiler/classes/classes.py))
- **Ingredient**: INCI name, CAS number, quantity, toxicity values (SED, NOAEL, MOS), PubChem data
- **ExpositionInfo**: Application details, exposure routes, calculated daily exposure
- **SedTable**: Safety evaluation data table
- **ProdCompany**: Production company information
#### Type Enumerations ([types_enum.py](src/pif_compiler/classes/types_enum.py))
Bilingual (EN/IT) enums for:
- **CosmeticType**: 100+ product types (foundations, lipsticks, skincare, etc.)
- **PhysicalForm**: Liquid, semi-solid, solid, aerosol, hybrid forms
- **NormalUser**: Adult/Child
- **PlaceApplication**: Face, etc.
- **RoutesExposure**: Dermal, Ocular, Oral
- **NanoRoutes**: Same as above for nanomaterials
### 2. External Data Sources
#### COSING Database ([scraper_cosing.py](src/pif_compiler/functions/scraper_cosing.py))
EU cosmetic ingredients database
- Search by INCI name, CAS number, or EC number
- Extract: substance ID, CAS/EC numbers, restrictions, SCCS opinions
- Handle "identified ingredients" recursively
- Functions: `cosing_search()`, `clean_cosing()`, `parse_cas_numbers()`
#### ECHA Database ([echaFind.py](src/pif_compiler/functions/echaFind.py), [echaProcess.py](src/pif_compiler/functions/echaProcess.py))
European Chemicals Agency dossiers
- **Search**: Find dossiers by CAS/substance name ([echaFind.py:44](src/pif_compiler/functions/echaFind.py:44))
- **Extract**: Toxicity data (NOAEL, LD50) from HTML pages
- **Process**: Convert HTML → Markdown → JSON → DataFrame
- **Scraping Types**: RepeatedDose (NOAEL), AcuteToxicity (LD50)
- **Local caching**: DuckDB in-memory for scraped data
- Functions: `search_dossier()`, `echaExtract()`, `echa_noael_ld50()`
#### PubChem ([pubchem.py](src/pif_compiler/functions/pubchem.py))
Chemical properties for DAP calculation
- Properties: LogP, Molecular Weight, TPSA, Melting Point, pH, Dissociation Constants
- Uses `pubchempy` + custom certificate handling
- Function: `pubchem_dap(cas)`
#### QUACKO/Find Module ([find.py](src/pif_compiler/functions/find.py))
Unified search interface for ECHA
- Search by CAS, EC, or substance name
- Extract multiple sections (ToxSummary, AcuteToxicity, RepeatedDose, GeneticToxicity, physical properties)
- Support for local HTML files
- Functions: `search_dossier()`, `get_section_links_from_index()`
### 3. Database Layer
#### MongoDB ([mongo_functions.py](src/pif_compiler/functions/mongo_functions.py))
- Database: `toxinfo`
- Collection: `toxinfo` (ingredient data from COSING/ECHA)
- Functions:
- `connect(user, password, database)` - MongoDB Atlas connection
- `value_search(db, value, mode)` - Search by INCI, CAS, EC, chemical name
### 4. PDF Generation ([html_to_pdf.py](src/pif_compiler/functions/html_to_pdf.py), [pdf_extraction.py](src/pif_compiler/functions/pdf_extraction.py))
- **Playwright-based**: Headless browser for HTML → PDF
- **Dynamic headers**: Inject substance info, ECHA logos
- **Cleanup**: Remove empty sections, fix page breaks
- **Batch processing**: `search_generate_pdfs()` for multiple pages
- Output: Structured folders by CAS/EC/RML ID
## Data Flow
1. **Input**: Product formulation (INCI names, quantities)
2. **Enrichment**:
- Search COSING for ingredient info
- Query MongoDB for cached data
- Fetch PubChem for chemical properties
- Extract ECHA toxicity data (NOAEL/LD50)
3. **Calculation**:
- SED (Systemic Exposure Dose)
- MOS (Margin of Safety)
- Daily exposure values
4. **Output**: PIF document (likely PDF/HTML format)
## Key Dependencies
- `streamlit` - Frontend
- `pydantic` - Data validation
- `pymongo` - MongoDB client
- `requests` - HTTP requests
- `beautifulsoup4` - HTML parsing
- `playwright` - PDF generation
- `pubchempy` - PubChem API
- `pandas` - Data processing
- `duckdb` - Local caching
## Important Notes
### CAS Number Handling
- CAS numbers can contain special separators (`/`, `;`, `,`, `--`)
- Parser handles parenthetical info and invalid values
### ECHA Scraping
- **Logging**: All operations logged to `echa.log`
- **Dossier Status**: Active preferred, falls back to Inactive
- **Scraping Modes**:
- `local_search=True`: Check local cache first
- `local_only=True`: Only use cached data
- **Multi-substance**: `echaExtract_multi()` for batch processing
- **Filtering**: Can filter by route (oral/dermal/inhalation) and dose descriptor
### Bilingual Support
- Enums support EN/IT via `TranslatedEnum.get_translation(lang)`
- Italian used as primary language in comments
### Regulatory Context
- SCCS: Scientific Committee on Consumer Safety
- CNCP: Cosmetic Notification Portal
- NOAEL: No Observed Adverse Effect Level
- SED: Systemic Exposure Dose
- MOS: Margin of Safety
- DAP: Dermal Absorption Percentage
## TODO/Future Work
- Relational DB implementation (mentioned but not present)
- Streamlit UI (referenced but code not in current files)
- Main entry point (`pif-compiler` script in pyproject.toml)
- LLM approximation for exposure values (mentioned in [classes.py:55-60](src/pif_compiler/classes/classes.py:55))
## Development Notes
- Project appears to consolidate previously separate codebases
- Heavy use of external APIs (rate limiting may apply)
- Certificate handling needed for PubChem API
- MongoDB credentials required for database access