7.6 KiB
7.6 KiB
PIF Compiler - Project Context
Overview
Application to generate Product Information Files (PIF) for cosmetic products. This is a regulatory document required for cosmetics safety assessment.
Development Environment
- Platform: Windows
- Package Manager: uv - Fast Python package installer and resolver
- Python Version: 3.12+
Tech Stack
- Backend: Python 3.12+
- Frontend: Streamlit
- Database: MongoDB (primary), potential relational DB (not yet implemented)
- Package Manager: uv
- Build System: hatchling
Common Commands
# Install dependencies
uv sync
# Add a new dependency
uv add <package-name>
# Run the application
uv run pif-compiler
# Activate virtual environment (if needed)
.venv\Scripts\activate # Windows
Project Structure
pif_compiler/
├── src/pif_compiler/
│ ├── classes/ # Data models & type definitions
│ │ ├── pif_class.py # Main PIF data model
│ │ ├── classes.py # Supporting classes (Ingredient, ExpositionInfo, SedTable, ProdCompany)
│ │ └── types_enum.py # Enums for cosmetic types, physical forms, exposure routes
│ │
│ └── functions/ # Core functionality modules
│ ├── scraper_cosing.py # COSING database scraper (EU cosmetic ingredients)
│ ├── mongo_functions.py # MongoDB connection & queries
│ ├── html_to_pdf.py # PDF generation with Playwright
│ ├── echaFind.py # ECHA dossier search
│ ├── echaProcess.py # ECHA data extraction & processing
│ ├── pubchem.py # PubChem API for chemical properties
│ ├── find.py # Unified search interface (QUACKO/ECHA)
│ └── pdf_extraction.py # PDF processing utilities
│
└── data/
├── pif_schema.json # JSON schema for PIF structure
└── input.json # Example input data format
Core Functionality
1. Data Models (classes/)
PIF Class (pif_class.py)
Main data model containing:
- Product information (name, type, CNCP, company)
- Ingredient list with quantities
- Exposure information
- Safety evaluation data (SED table, warnings)
- Metadata (creation date)
Supporting Classes (classes.py)
- Ingredient: INCI name, CAS number, quantity, toxicity values (SED, NOAEL, MOS), PubChem data
- ExpositionInfo: Application details, exposure routes, calculated daily exposure
- SedTable: Safety evaluation data table
- ProdCompany: Production company information
Type Enumerations (types_enum.py)
Bilingual (EN/IT) enums for:
- CosmeticType: 100+ product types (foundations, lipsticks, skincare, etc.)
- PhysicalForm: Liquid, semi-solid, solid, aerosol, hybrid forms
- NormalUser: Adult/Child
- PlaceApplication: Face, etc.
- RoutesExposure: Dermal, Ocular, Oral
- NanoRoutes: Same as above for nanomaterials
2. External Data Sources
COSING Database (scraper_cosing.py)
EU cosmetic ingredients database
- Search by INCI name, CAS number, or EC number
- Extract: substance ID, CAS/EC numbers, restrictions, SCCS opinions
- Handle "identified ingredients" recursively
- Functions:
cosing_search(),clean_cosing(),parse_cas_numbers()
ECHA Database (echaFind.py, echaProcess.py)
European Chemicals Agency dossiers
- Search: Find dossiers by CAS/substance name (echaFind.py:44)
- Extract: Toxicity data (NOAEL, LD50) from HTML pages
- Process: Convert HTML → Markdown → JSON → DataFrame
- Scraping Types: RepeatedDose (NOAEL), AcuteToxicity (LD50)
- Local caching: DuckDB in-memory for scraped data
- Functions:
search_dossier(),echaExtract(),echa_noael_ld50()
PubChem (pubchem.py)
Chemical properties for DAP calculation
- Properties: LogP, Molecular Weight, TPSA, Melting Point, pH, Dissociation Constants
- Uses
pubchempy+ custom certificate handling - Function:
pubchem_dap(cas)
QUACKO/Find Module (find.py)
Unified search interface for ECHA
- Search by CAS, EC, or substance name
- Extract multiple sections (ToxSummary, AcuteToxicity, RepeatedDose, GeneticToxicity, physical properties)
- Support for local HTML files
- Functions:
search_dossier(),get_section_links_from_index()
3. Database Layer
MongoDB (mongo_functions.py)
- Database:
toxinfo - Collection:
toxinfo(ingredient data from COSING/ECHA) - Functions:
connect(user, password, database)- MongoDB Atlas connectionvalue_search(db, value, mode)- Search by INCI, CAS, EC, chemical name
4. PDF Generation (html_to_pdf.py, pdf_extraction.py)
- Playwright-based: Headless browser for HTML → PDF
- Dynamic headers: Inject substance info, ECHA logos
- Cleanup: Remove empty sections, fix page breaks
- Batch processing:
search_generate_pdfs()for multiple pages - Output: Structured folders by CAS/EC/RML ID
Data Flow
- Input: Product formulation (INCI names, quantities)
- Enrichment:
- Search COSING for ingredient info
- Query MongoDB for cached data
- Fetch PubChem for chemical properties
- Extract ECHA toxicity data (NOAEL/LD50)
- Calculation:
- SED (Systemic Exposure Dose)
- MOS (Margin of Safety)
- Daily exposure values
- Output: PIF document (likely PDF/HTML format)
Key Dependencies
streamlit- Frontendpydantic- Data validationpymongo- MongoDB clientrequests- HTTP requestsbeautifulsoup4- HTML parsingplaywright- PDF generationpubchempy- PubChem APIpandas- Data processingduckdb- Local caching
Important Notes
CAS Number Handling
- CAS numbers can contain special separators (
/,;,,,--) - Parser handles parenthetical info and invalid values
ECHA Scraping
- Logging: All operations logged to
echa.log - Dossier Status: Active preferred, falls back to Inactive
- Scraping Modes:
local_search=True: Check local cache firstlocal_only=True: Only use cached data
- Multi-substance:
echaExtract_multi()for batch processing - Filtering: Can filter by route (oral/dermal/inhalation) and dose descriptor
Bilingual Support
- Enums support EN/IT via
TranslatedEnum.get_translation(lang) - Italian used as primary language in comments
Regulatory Context
- SCCS: Scientific Committee on Consumer Safety
- CNCP: Cosmetic Notification Portal
- NOAEL: No Observed Adverse Effect Level
- SED: Systemic Exposure Dose
- MOS: Margin of Safety
- DAP: Dermal Absorption Percentage
TODO/Future Work
- Relational DB implementation (mentioned but not present)
- Streamlit UI (referenced but code not in current files)
- Main entry point (
pif-compilerscript in pyproject.toml) - LLM approximation for exposure values (mentioned in classes.py:55-60)
Development Notes
- Project appears to consolidate previously separate codebases
- Heavy use of external APIs (rate limiting may apply)
- Certificate handling needed for PubChem API
- MongoDB credentials required for database access