cosmoguard-bd/claude.md

# PIF Compiler - Project Context

## Overview
Application to generate **Product Information Files (PIF)** for cosmetic products. This is a regulatory document required for cosmetics safety assessment.

## Development Environment
- **Platform**: Windows
- **Package Manager**: [uv](https://github.com/astral-sh/uv) - Fast Python package installer and resolver
- **Python Version**: 3.12+

## Tech Stack
- **Backend**: Python 3.12+
- **Frontend**: Streamlit
- **Database**: MongoDB (primary), potential relational DB (not yet implemented)
- **Package Manager**: uv
- **Build System**: hatchling

## Common Commands
```bash
# Install dependencies
uv sync

# Add a new dependency
uv add <package-name>

# Run the application
uv run pif-compiler

# Activate virtual environment (if needed)
.venv\Scripts\activate  # Windows
```

## Project Structure

```
pif_compiler/
├── src/pif_compiler/
│   ├── classes/           # Data models & type definitions
│   │   ├── pif_class.py   # Main PIF data model
│   │   ├── classes.py     # Supporting classes (Ingredient, ExpositionInfo, SedTable, ProdCompany)
│   │   └── types_enum.py  # Enums for cosmetic types, physical forms, exposure routes
│   │
│   └── functions/         # Core functionality modules
│       ├── scraper_cosing.py    # COSING database scraper (EU cosmetic ingredients)
│       ├── mongo_functions.py   # MongoDB connection & queries
│       ├── html_to_pdf.py       # PDF generation with Playwright
│       ├── echaFind.py          # ECHA dossier search
│       ├── echaProcess.py       # ECHA data extraction & processing
│       ├── pubchem.py           # PubChem API for chemical properties
│       ├── find.py              # Unified search interface (QUACKO/ECHA)
│       └── pdf_extraction.py    # PDF processing utilities
│
└── data/
    ├── pif_schema.json    # JSON schema for PIF structure
    └── input.json         # Example input data format
```

## Core Functionality

### 1. Data Models ([classes/](src/pif_compiler/classes/))

#### PIF Class ([pif_class.py](src/pif_compiler/classes/pif_class.py:10))
Main data model containing:
- Product information (name, type, CNCP, company)
- Ingredient list with quantities
- Exposure information
- Safety evaluation data (SED table, warnings)
- Metadata (creation date)

#### Supporting Classes ([classes.py](src/pif_compiler/classes/classes.py))
- **Ingredient**: INCI name, CAS number, quantity, toxicity values (SED, NOAEL, MOS), PubChem data
- **ExpositionInfo**: Application details, exposure routes, calculated daily exposure
- **SedTable**: Safety evaluation data table
- **ProdCompany**: Production company information

#### Type Enumerations ([types_enum.py](src/pif_compiler/classes/types_enum.py))
Bilingual (EN/IT) enums for:
- **CosmeticType**: 100+ product types (foundations, lipsticks, skincare, etc.)
- **PhysicalForm**: Liquid, semi-solid, solid, aerosol, hybrid forms
- **NormalUser**: Adult/Child
- **PlaceApplication**: Face, etc.
- **RoutesExposure**: Dermal, Ocular, Oral
- **NanoRoutes**: Same as above for nanomaterials

### 2. External Data Sources

#### COSING Database ([scraper_cosing.py](src/pif_compiler/functions/scraper_cosing.py))
EU cosmetic ingredients database
- Search by INCI name, CAS number, or EC number
- Extract: substance ID, CAS/EC numbers, restrictions, SCCS opinions
- Handle "identified ingredients" recursively
- Functions: `cosing_search()`, `clean_cosing()`, `parse_cas_numbers()`

#### ECHA Database ([echaFind.py](src/pif_compiler/functions/echaFind.py), [echaProcess.py](src/pif_compiler/functions/echaProcess.py))
European Chemicals Agency dossiers
- **Search**: Find dossiers by CAS/substance name ([echaFind.py:44](src/pif_compiler/functions/echaFind.py:44))
- **Extract**: Toxicity data (NOAEL, LD50) from HTML pages
- **Process**: Convert HTML → Markdown → JSON → DataFrame
- **Scraping Types**: RepeatedDose (NOAEL), AcuteToxicity (LD50)
- **Local caching**: DuckDB in-memory for scraped data
- Functions: `search_dossier()`, `echaExtract()`, `echa_noael_ld50()`

#### PubChem ([pubchem.py](src/pif_compiler/functions/pubchem.py))
Chemical properties for DAP calculation
- Properties: LogP, Molecular Weight, TPSA, Melting Point, pH, Dissociation Constants
- Uses `pubchempy` + custom certificate handling
- Function: `pubchem_dap(cas)`

#### QUACKO/Find Module ([find.py](src/pif_compiler/functions/find.py))
Unified search interface for ECHA
- Search by CAS, EC, or substance name
- Extract multiple sections (ToxSummary, AcuteToxicity, RepeatedDose, GeneticToxicity, physical properties)
- Support for local HTML files
- Functions: `search_dossier()`, `get_section_links_from_index()`

### 3. Database Layer

#### MongoDB ([mongo_functions.py](src/pif_compiler/functions/mongo_functions.py))
- Database: `toxinfo`
- Collection: `toxinfo` (ingredient data from COSING/ECHA)
- Functions:
  - `connect(user, password, database)` - MongoDB Atlas connection
  - `value_search(db, value, mode)` - Search by INCI, CAS, EC, chemical name

### 4. PDF Generation ([html_to_pdf.py](src/pif_compiler/functions/html_to_pdf.py), [pdf_extraction.py](src/pif_compiler/functions/pdf_extraction.py))
- **Playwright-based**: Headless browser for HTML → PDF
- **Dynamic headers**: Inject substance info, ECHA logos
- **Cleanup**: Remove empty sections, fix page breaks
- **Batch processing**: `search_generate_pdfs()` for multiple pages
- Output: Structured folders by CAS/EC/RML ID

## Data Flow

1. **Input**: Product formulation (INCI names, quantities)
2. **Enrichment**:
   - Search COSING for ingredient info
   - Query MongoDB for cached data
   - Fetch PubChem for chemical properties
   - Extract ECHA toxicity data (NOAEL/LD50)
3. **Calculation**:
   - SED (Systemic Exposure Dose)
   - MOS (Margin of Safety)
   - Daily exposure values
4. **Output**: PIF document (likely PDF/HTML format)

## Key Dependencies
- `streamlit` - Frontend
- `pydantic` - Data validation
- `pymongo` - MongoDB client
- `requests` - HTTP requests
- `beautifulsoup4` - HTML parsing
- `playwright` - PDF generation
- `pubchempy` - PubChem API
- `pandas` - Data processing
- `duckdb` - Local caching

## Important Notes

### CAS Number Handling
- CAS numbers can contain special separators (`/`, `;`, `,`, `--`)
- Parser handles parenthetical info and invalid values

### ECHA Scraping
- **Logging**: All operations logged to `echa.log`
- **Dossier Status**: Active preferred, falls back to Inactive
- **Scraping Modes**:
  - `local_search=True`: Check local cache first
  - `local_only=True`: Only use cached data
- **Multi-substance**: `echaExtract_multi()` for batch processing
- **Filtering**: Can filter by route (oral/dermal/inhalation) and dose descriptor

### Bilingual Support
- Enums support EN/IT via `TranslatedEnum.get_translation(lang)`
- Italian used as primary language in comments

### Regulatory Context
- SCCS: Scientific Committee on Consumer Safety
- CNCP: Cosmetic Notification Portal
- NOAEL: No Observed Adverse Effect Level
- SED: Systemic Exposure Dose
- MOS: Margin of Safety
- DAP: Dermal Absorption Percentage

## TODO/Future Work
- Relational DB implementation (mentioned but not present)
- Streamlit UI (referenced but code not in current files)
- Main entry point (`pif-compiler` script in pyproject.toml)
- LLM approximation for exposure values (mentioned in [classes.py:55-60](src/pif_compiler/classes/classes.py:55))

## Development Notes
- Project appears to consolidate previously separate codebases
- Heavy use of external APIs (rate limiting may apply)
- Certificate handling needed for PubChem API
- MongoDB credentials required for database access