# PIF Compiler - Project Context ## Overview Application to generate **Product Information Files (PIF)** for cosmetic products. This is a regulatory document required for cosmetics safety assessment. ## Development Environment - **Platform**: Windows - **Package Manager**: [uv](https://github.com/astral-sh/uv) - Fast Python package installer and resolver - **Python Version**: 3.12+ ## Tech Stack - **Backend**: Python 3.12+ - **Frontend**: Streamlit - **Database**: MongoDB (primary), potential relational DB (not yet implemented) - **Package Manager**: uv - **Build System**: hatchling ## Common Commands ```bash # Install dependencies uv sync # Add a new dependency uv add # Run the application uv run pif-compiler # Activate virtual environment (if needed) .venv\Scripts\activate # Windows ``` ## Project Structure ``` pif_compiler/ ├── src/pif_compiler/ │ ├── classes/ # Data models & type definitions │ │ ├── pif_class.py # Main PIF data model │ │ ├── classes.py # Supporting classes (Ingredient, ExpositionInfo, SedTable, ProdCompany) │ │ └── types_enum.py # Enums for cosmetic types, physical forms, exposure routes │ │ │ └── functions/ # Core functionality modules │ ├── scraper_cosing.py # COSING database scraper (EU cosmetic ingredients) │ ├── mongo_functions.py # MongoDB connection & queries │ ├── html_to_pdf.py # PDF generation with Playwright │ ├── echaFind.py # ECHA dossier search │ ├── echaProcess.py # ECHA data extraction & processing │ ├── pubchem.py # PubChem API for chemical properties │ ├── find.py # Unified search interface (QUACKO/ECHA) │ └── pdf_extraction.py # PDF processing utilities │ └── data/ ├── pif_schema.json # JSON schema for PIF structure └── input.json # Example input data format ``` ## Core Functionality ### 1. Data Models ([classes/](src/pif_compiler/classes/)) #### PIF Class ([pif_class.py](src/pif_compiler/classes/pif_class.py:10)) Main data model containing: - Product information (name, type, CNCP, company) - Ingredient list with quantities - Exposure information - Safety evaluation data (SED table, warnings) - Metadata (creation date) #### Supporting Classes ([classes.py](src/pif_compiler/classes/classes.py)) - **Ingredient**: INCI name, CAS number, quantity, toxicity values (SED, NOAEL, MOS), PubChem data - **ExpositionInfo**: Application details, exposure routes, calculated daily exposure - **SedTable**: Safety evaluation data table - **ProdCompany**: Production company information #### Type Enumerations ([types_enum.py](src/pif_compiler/classes/types_enum.py)) Bilingual (EN/IT) enums for: - **CosmeticType**: 100+ product types (foundations, lipsticks, skincare, etc.) - **PhysicalForm**: Liquid, semi-solid, solid, aerosol, hybrid forms - **NormalUser**: Adult/Child - **PlaceApplication**: Face, etc. - **RoutesExposure**: Dermal, Ocular, Oral - **NanoRoutes**: Same as above for nanomaterials ### 2. External Data Sources #### COSING Database ([scraper_cosing.py](src/pif_compiler/functions/scraper_cosing.py)) EU cosmetic ingredients database - Search by INCI name, CAS number, or EC number - Extract: substance ID, CAS/EC numbers, restrictions, SCCS opinions - Handle "identified ingredients" recursively - Functions: `cosing_search()`, `clean_cosing()`, `parse_cas_numbers()` #### ECHA Database ([echaFind.py](src/pif_compiler/functions/echaFind.py), [echaProcess.py](src/pif_compiler/functions/echaProcess.py)) European Chemicals Agency dossiers - **Search**: Find dossiers by CAS/substance name ([echaFind.py:44](src/pif_compiler/functions/echaFind.py:44)) - **Extract**: Toxicity data (NOAEL, LD50) from HTML pages - **Process**: Convert HTML → Markdown → JSON → DataFrame - **Scraping Types**: RepeatedDose (NOAEL), AcuteToxicity (LD50) - **Local caching**: DuckDB in-memory for scraped data - Functions: `search_dossier()`, `echaExtract()`, `echa_noael_ld50()` #### PubChem ([pubchem.py](src/pif_compiler/functions/pubchem.py)) Chemical properties for DAP calculation - Properties: LogP, Molecular Weight, TPSA, Melting Point, pH, Dissociation Constants - Uses `pubchempy` + custom certificate handling - Function: `pubchem_dap(cas)` #### QUACKO/Find Module ([find.py](src/pif_compiler/functions/find.py)) Unified search interface for ECHA - Search by CAS, EC, or substance name - Extract multiple sections (ToxSummary, AcuteToxicity, RepeatedDose, GeneticToxicity, physical properties) - Support for local HTML files - Functions: `search_dossier()`, `get_section_links_from_index()` ### 3. Database Layer #### MongoDB ([mongo_functions.py](src/pif_compiler/functions/mongo_functions.py)) - Database: `toxinfo` - Collection: `toxinfo` (ingredient data from COSING/ECHA) - Functions: - `connect(user, password, database)` - MongoDB Atlas connection - `value_search(db, value, mode)` - Search by INCI, CAS, EC, chemical name ### 4. PDF Generation ([html_to_pdf.py](src/pif_compiler/functions/html_to_pdf.py), [pdf_extraction.py](src/pif_compiler/functions/pdf_extraction.py)) - **Playwright-based**: Headless browser for HTML → PDF - **Dynamic headers**: Inject substance info, ECHA logos - **Cleanup**: Remove empty sections, fix page breaks - **Batch processing**: `search_generate_pdfs()` for multiple pages - Output: Structured folders by CAS/EC/RML ID ## Data Flow 1. **Input**: Product formulation (INCI names, quantities) 2. **Enrichment**: - Search COSING for ingredient info - Query MongoDB for cached data - Fetch PubChem for chemical properties - Extract ECHA toxicity data (NOAEL/LD50) 3. **Calculation**: - SED (Systemic Exposure Dose) - MOS (Margin of Safety) - Daily exposure values 4. **Output**: PIF document (likely PDF/HTML format) ## Key Dependencies - `streamlit` - Frontend - `pydantic` - Data validation - `pymongo` - MongoDB client - `requests` - HTTP requests - `beautifulsoup4` - HTML parsing - `playwright` - PDF generation - `pubchempy` - PubChem API - `pandas` - Data processing - `duckdb` - Local caching ## Important Notes ### CAS Number Handling - CAS numbers can contain special separators (`/`, `;`, `,`, `--`) - Parser handles parenthetical info and invalid values ### ECHA Scraping - **Logging**: All operations logged to `echa.log` - **Dossier Status**: Active preferred, falls back to Inactive - **Scraping Modes**: - `local_search=True`: Check local cache first - `local_only=True`: Only use cached data - **Multi-substance**: `echaExtract_multi()` for batch processing - **Filtering**: Can filter by route (oral/dermal/inhalation) and dose descriptor ### Bilingual Support - Enums support EN/IT via `TranslatedEnum.get_translation(lang)` - Italian used as primary language in comments ### Regulatory Context - SCCS: Scientific Committee on Consumer Safety - CNCP: Cosmetic Notification Portal - NOAEL: No Observed Adverse Effect Level - SED: Systemic Exposure Dose - MOS: Margin of Safety - DAP: Dermal Absorption Percentage ## TODO/Future Work - Relational DB implementation (mentioned but not present) - Streamlit UI (referenced but code not in current files) - Main entry point (`pif-compiler` script in pyproject.toml) - LLM approximation for exposure values (mentioned in [classes.py:55-60](src/pif_compiler/classes/classes.py:55)) ## Development Notes - Project appears to consolidate previously separate codebases - Heavy use of external APIs (rate limiting may apply) - Certificate handling needed for PubChem API - MongoDB credentials required for database access