194 lines
7.6 KiB
Markdown
194 lines
7.6 KiB
Markdown
# PIF Compiler - Project Context
|
|
|
|
## Overview
|
|
Application to generate **Product Information Files (PIF)** for cosmetic products. This is a regulatory document required for cosmetics safety assessment.
|
|
|
|
## Development Environment
|
|
- **Platform**: Windows
|
|
- **Package Manager**: [uv](https://github.com/astral-sh/uv) - Fast Python package installer and resolver
|
|
- **Python Version**: 3.12+
|
|
|
|
## Tech Stack
|
|
- **Backend**: Python 3.12+
|
|
- **Frontend**: Streamlit
|
|
- **Database**: MongoDB (primary), potential relational DB (not yet implemented)
|
|
- **Package Manager**: uv
|
|
- **Build System**: hatchling
|
|
|
|
## Common Commands
|
|
```bash
|
|
# Install dependencies
|
|
uv sync
|
|
|
|
# Add a new dependency
|
|
uv add <package-name>
|
|
|
|
# Run the application
|
|
uv run pif-compiler
|
|
|
|
# Activate virtual environment (if needed)
|
|
.venv\Scripts\activate # Windows
|
|
```
|
|
|
|
## Project Structure
|
|
|
|
```
|
|
pif_compiler/
|
|
├── src/pif_compiler/
|
|
│ ├── classes/ # Data models & type definitions
|
|
│ │ ├── pif_class.py # Main PIF data model
|
|
│ │ ├── classes.py # Supporting classes (Ingredient, ExpositionInfo, SedTable, ProdCompany)
|
|
│ │ └── types_enum.py # Enums for cosmetic types, physical forms, exposure routes
|
|
│ │
|
|
│ └── functions/ # Core functionality modules
|
|
│ ├── scraper_cosing.py # COSING database scraper (EU cosmetic ingredients)
|
|
│ ├── mongo_functions.py # MongoDB connection & queries
|
|
│ ├── html_to_pdf.py # PDF generation with Playwright
|
|
│ ├── echaFind.py # ECHA dossier search
|
|
│ ├── echaProcess.py # ECHA data extraction & processing
|
|
│ ├── pubchem.py # PubChem API for chemical properties
|
|
│ ├── find.py # Unified search interface (QUACKO/ECHA)
|
|
│ └── pdf_extraction.py # PDF processing utilities
|
|
│
|
|
└── data/
|
|
├── pif_schema.json # JSON schema for PIF structure
|
|
└── input.json # Example input data format
|
|
```
|
|
|
|
## Core Functionality
|
|
|
|
### 1. Data Models ([classes/](src/pif_compiler/classes/))
|
|
|
|
#### PIF Class ([pif_class.py](src/pif_compiler/classes/pif_class.py:10))
|
|
Main data model containing:
|
|
- Product information (name, type, CNCP, company)
|
|
- Ingredient list with quantities
|
|
- Exposure information
|
|
- Safety evaluation data (SED table, warnings)
|
|
- Metadata (creation date)
|
|
|
|
#### Supporting Classes ([classes.py](src/pif_compiler/classes/classes.py))
|
|
- **Ingredient**: INCI name, CAS number, quantity, toxicity values (SED, NOAEL, MOS), PubChem data
|
|
- **ExpositionInfo**: Application details, exposure routes, calculated daily exposure
|
|
- **SedTable**: Safety evaluation data table
|
|
- **ProdCompany**: Production company information
|
|
|
|
#### Type Enumerations ([types_enum.py](src/pif_compiler/classes/types_enum.py))
|
|
Bilingual (EN/IT) enums for:
|
|
- **CosmeticType**: 100+ product types (foundations, lipsticks, skincare, etc.)
|
|
- **PhysicalForm**: Liquid, semi-solid, solid, aerosol, hybrid forms
|
|
- **NormalUser**: Adult/Child
|
|
- **PlaceApplication**: Face, etc.
|
|
- **RoutesExposure**: Dermal, Ocular, Oral
|
|
- **NanoRoutes**: Same as above for nanomaterials
|
|
|
|
### 2. External Data Sources
|
|
|
|
#### COSING Database ([scraper_cosing.py](src/pif_compiler/functions/scraper_cosing.py))
|
|
EU cosmetic ingredients database
|
|
- Search by INCI name, CAS number, or EC number
|
|
- Extract: substance ID, CAS/EC numbers, restrictions, SCCS opinions
|
|
- Handle "identified ingredients" recursively
|
|
- Functions: `cosing_search()`, `clean_cosing()`, `parse_cas_numbers()`
|
|
|
|
#### ECHA Database ([echaFind.py](src/pif_compiler/functions/echaFind.py), [echaProcess.py](src/pif_compiler/functions/echaProcess.py))
|
|
European Chemicals Agency dossiers
|
|
- **Search**: Find dossiers by CAS/substance name ([echaFind.py:44](src/pif_compiler/functions/echaFind.py:44))
|
|
- **Extract**: Toxicity data (NOAEL, LD50) from HTML pages
|
|
- **Process**: Convert HTML → Markdown → JSON → DataFrame
|
|
- **Scraping Types**: RepeatedDose (NOAEL), AcuteToxicity (LD50)
|
|
- **Local caching**: DuckDB in-memory for scraped data
|
|
- Functions: `search_dossier()`, `echaExtract()`, `echa_noael_ld50()`
|
|
|
|
#### PubChem ([pubchem.py](src/pif_compiler/functions/pubchem.py))
|
|
Chemical properties for DAP calculation
|
|
- Properties: LogP, Molecular Weight, TPSA, Melting Point, pH, Dissociation Constants
|
|
- Uses `pubchempy` + custom certificate handling
|
|
- Function: `pubchem_dap(cas)`
|
|
|
|
#### QUACKO/Find Module ([find.py](src/pif_compiler/functions/find.py))
|
|
Unified search interface for ECHA
|
|
- Search by CAS, EC, or substance name
|
|
- Extract multiple sections (ToxSummary, AcuteToxicity, RepeatedDose, GeneticToxicity, physical properties)
|
|
- Support for local HTML files
|
|
- Functions: `search_dossier()`, `get_section_links_from_index()`
|
|
|
|
### 3. Database Layer
|
|
|
|
#### MongoDB ([mongo_functions.py](src/pif_compiler/functions/mongo_functions.py))
|
|
- Database: `toxinfo`
|
|
- Collection: `toxinfo` (ingredient data from COSING/ECHA)
|
|
- Functions:
|
|
- `connect(user, password, database)` - MongoDB Atlas connection
|
|
- `value_search(db, value, mode)` - Search by INCI, CAS, EC, chemical name
|
|
|
|
### 4. PDF Generation ([html_to_pdf.py](src/pif_compiler/functions/html_to_pdf.py), [pdf_extraction.py](src/pif_compiler/functions/pdf_extraction.py))
|
|
- **Playwright-based**: Headless browser for HTML → PDF
|
|
- **Dynamic headers**: Inject substance info, ECHA logos
|
|
- **Cleanup**: Remove empty sections, fix page breaks
|
|
- **Batch processing**: `search_generate_pdfs()` for multiple pages
|
|
- Output: Structured folders by CAS/EC/RML ID
|
|
|
|
## Data Flow
|
|
|
|
1. **Input**: Product formulation (INCI names, quantities)
|
|
2. **Enrichment**:
|
|
- Search COSING for ingredient info
|
|
- Query MongoDB for cached data
|
|
- Fetch PubChem for chemical properties
|
|
- Extract ECHA toxicity data (NOAEL/LD50)
|
|
3. **Calculation**:
|
|
- SED (Systemic Exposure Dose)
|
|
- MOS (Margin of Safety)
|
|
- Daily exposure values
|
|
4. **Output**: PIF document (likely PDF/HTML format)
|
|
|
|
## Key Dependencies
|
|
- `streamlit` - Frontend
|
|
- `pydantic` - Data validation
|
|
- `pymongo` - MongoDB client
|
|
- `requests` - HTTP requests
|
|
- `beautifulsoup4` - HTML parsing
|
|
- `playwright` - PDF generation
|
|
- `pubchempy` - PubChem API
|
|
- `pandas` - Data processing
|
|
- `duckdb` - Local caching
|
|
|
|
## Important Notes
|
|
|
|
### CAS Number Handling
|
|
- CAS numbers can contain special separators (`/`, `;`, `,`, `--`)
|
|
- Parser handles parenthetical info and invalid values
|
|
|
|
### ECHA Scraping
|
|
- **Logging**: All operations logged to `echa.log`
|
|
- **Dossier Status**: Active preferred, falls back to Inactive
|
|
- **Scraping Modes**:
|
|
- `local_search=True`: Check local cache first
|
|
- `local_only=True`: Only use cached data
|
|
- **Multi-substance**: `echaExtract_multi()` for batch processing
|
|
- **Filtering**: Can filter by route (oral/dermal/inhalation) and dose descriptor
|
|
|
|
### Bilingual Support
|
|
- Enums support EN/IT via `TranslatedEnum.get_translation(lang)`
|
|
- Italian used as primary language in comments
|
|
|
|
### Regulatory Context
|
|
- SCCS: Scientific Committee on Consumer Safety
|
|
- CNCP: Cosmetic Notification Portal
|
|
- NOAEL: No Observed Adverse Effect Level
|
|
- SED: Systemic Exposure Dose
|
|
- MOS: Margin of Safety
|
|
- DAP: Dermal Absorption Percentage
|
|
|
|
## TODO/Future Work
|
|
- Relational DB implementation (mentioned but not present)
|
|
- Streamlit UI (referenced but code not in current files)
|
|
- Main entry point (`pif-compiler` script in pyproject.toml)
|
|
- LLM approximation for exposure values (mentioned in [classes.py:55-60](src/pif_compiler/classes/classes.py:55))
|
|
|
|
## Development Notes
|
|
- Project appears to consolidate previously separate codebases
|
|
- Heavy use of external APIs (rate limiting may apply)
|
|
- Certificate handling needed for PubChem API
|
|
- MongoDB credentials required for database access
|