cosmoguard-bd/claude.md

7.6 KiB

PIF Compiler - Project Context

Overview

Application to generate Product Information Files (PIF) for cosmetic products. This is a regulatory document required for cosmetics safety assessment.

Development Environment

  • Platform: Windows
  • Package Manager: uv - Fast Python package installer and resolver
  • Python Version: 3.12+

Tech Stack

  • Backend: Python 3.12+
  • Frontend: Streamlit
  • Database: MongoDB (primary), potential relational DB (not yet implemented)
  • Package Manager: uv
  • Build System: hatchling

Common Commands

# Install dependencies
uv sync

# Add a new dependency
uv add <package-name>

# Run the application
uv run pif-compiler

# Activate virtual environment (if needed)
.venv\Scripts\activate  # Windows

Project Structure

pif_compiler/
├── src/pif_compiler/
│   ├── classes/           # Data models & type definitions
│   │   ├── pif_class.py   # Main PIF data model
│   │   ├── classes.py     # Supporting classes (Ingredient, ExpositionInfo, SedTable, ProdCompany)
│   │   └── types_enum.py  # Enums for cosmetic types, physical forms, exposure routes
│   │
│   └── functions/         # Core functionality modules
│       ├── scraper_cosing.py    # COSING database scraper (EU cosmetic ingredients)
│       ├── mongo_functions.py   # MongoDB connection & queries
│       ├── html_to_pdf.py       # PDF generation with Playwright
│       ├── echaFind.py          # ECHA dossier search
│       ├── echaProcess.py       # ECHA data extraction & processing
│       ├── pubchem.py           # PubChem API for chemical properties
│       ├── find.py              # Unified search interface (QUACKO/ECHA)
│       └── pdf_extraction.py    # PDF processing utilities
│
└── data/
    ├── pif_schema.json    # JSON schema for PIF structure
    └── input.json         # Example input data format

Core Functionality

1. Data Models (classes/)

PIF Class (pif_class.py)

Main data model containing:

  • Product information (name, type, CNCP, company)
  • Ingredient list with quantities
  • Exposure information
  • Safety evaluation data (SED table, warnings)
  • Metadata (creation date)

Supporting Classes (classes.py)

  • Ingredient: INCI name, CAS number, quantity, toxicity values (SED, NOAEL, MOS), PubChem data
  • ExpositionInfo: Application details, exposure routes, calculated daily exposure
  • SedTable: Safety evaluation data table
  • ProdCompany: Production company information

Type Enumerations (types_enum.py)

Bilingual (EN/IT) enums for:

  • CosmeticType: 100+ product types (foundations, lipsticks, skincare, etc.)
  • PhysicalForm: Liquid, semi-solid, solid, aerosol, hybrid forms
  • NormalUser: Adult/Child
  • PlaceApplication: Face, etc.
  • RoutesExposure: Dermal, Ocular, Oral
  • NanoRoutes: Same as above for nanomaterials

2. External Data Sources

COSING Database (scraper_cosing.py)

EU cosmetic ingredients database

  • Search by INCI name, CAS number, or EC number
  • Extract: substance ID, CAS/EC numbers, restrictions, SCCS opinions
  • Handle "identified ingredients" recursively
  • Functions: cosing_search(), clean_cosing(), parse_cas_numbers()

ECHA Database (echaFind.py, echaProcess.py)

European Chemicals Agency dossiers

  • Search: Find dossiers by CAS/substance name (echaFind.py:44)
  • Extract: Toxicity data (NOAEL, LD50) from HTML pages
  • Process: Convert HTML → Markdown → JSON → DataFrame
  • Scraping Types: RepeatedDose (NOAEL), AcuteToxicity (LD50)
  • Local caching: DuckDB in-memory for scraped data
  • Functions: search_dossier(), echaExtract(), echa_noael_ld50()

PubChem (pubchem.py)

Chemical properties for DAP calculation

  • Properties: LogP, Molecular Weight, TPSA, Melting Point, pH, Dissociation Constants
  • Uses pubchempy + custom certificate handling
  • Function: pubchem_dap(cas)

QUACKO/Find Module (find.py)

Unified search interface for ECHA

  • Search by CAS, EC, or substance name
  • Extract multiple sections (ToxSummary, AcuteToxicity, RepeatedDose, GeneticToxicity, physical properties)
  • Support for local HTML files
  • Functions: search_dossier(), get_section_links_from_index()

3. Database Layer

MongoDB (mongo_functions.py)

  • Database: toxinfo
  • Collection: toxinfo (ingredient data from COSING/ECHA)
  • Functions:
    • connect(user, password, database) - MongoDB Atlas connection
    • value_search(db, value, mode) - Search by INCI, CAS, EC, chemical name

4. PDF Generation (html_to_pdf.py, pdf_extraction.py)

  • Playwright-based: Headless browser for HTML → PDF
  • Dynamic headers: Inject substance info, ECHA logos
  • Cleanup: Remove empty sections, fix page breaks
  • Batch processing: search_generate_pdfs() for multiple pages
  • Output: Structured folders by CAS/EC/RML ID

Data Flow

  1. Input: Product formulation (INCI names, quantities)
  2. Enrichment:
    • Search COSING for ingredient info
    • Query MongoDB for cached data
    • Fetch PubChem for chemical properties
    • Extract ECHA toxicity data (NOAEL/LD50)
  3. Calculation:
    • SED (Systemic Exposure Dose)
    • MOS (Margin of Safety)
    • Daily exposure values
  4. Output: PIF document (likely PDF/HTML format)

Key Dependencies

  • streamlit - Frontend
  • pydantic - Data validation
  • pymongo - MongoDB client
  • requests - HTTP requests
  • beautifulsoup4 - HTML parsing
  • playwright - PDF generation
  • pubchempy - PubChem API
  • pandas - Data processing
  • duckdb - Local caching

Important Notes

CAS Number Handling

  • CAS numbers can contain special separators (/, ;, ,, --)
  • Parser handles parenthetical info and invalid values

ECHA Scraping

  • Logging: All operations logged to echa.log
  • Dossier Status: Active preferred, falls back to Inactive
  • Scraping Modes:
    • local_search=True: Check local cache first
    • local_only=True: Only use cached data
  • Multi-substance: echaExtract_multi() for batch processing
  • Filtering: Can filter by route (oral/dermal/inhalation) and dose descriptor

Bilingual Support

  • Enums support EN/IT via TranslatedEnum.get_translation(lang)
  • Italian used as primary language in comments

Regulatory Context

  • SCCS: Scientific Committee on Consumer Safety
  • CNCP: Cosmetic Notification Portal
  • NOAEL: No Observed Adverse Effect Level
  • SED: Systemic Exposure Dose
  • MOS: Margin of Safety
  • DAP: Dermal Absorption Percentage

TODO/Future Work

  • Relational DB implementation (mentioned but not present)
  • Streamlit UI (referenced but code not in current files)
  • Main entry point (pif-compiler script in pyproject.toml)
  • LLM approximation for exposure values (mentioned in classes.py:55-60)

Development Notes

  • Project appears to consolidate previously separate codebases
  • Heavy use of external APIs (rate limiting may apply)
  • Certificate handling needed for PubChem API
  • MongoDB credentials required for database access