adish-rmr 497dba7aab first commit: checkpoint per multi-device collab

2025-10-21 14:22:27 +02:00

7.6 KiB

Raw Blame History

PIF Compiler - Project Context

Overview

Application to generate Product Information Files (PIF) for cosmetic products. This is a regulatory document required for cosmetics safety assessment.

Development Environment

Platform: Windows
Package Manager: uv - Fast Python package installer and resolver
Python Version: 3.12+

Tech Stack

Backend: Python 3.12+
Frontend: Streamlit
Database: MongoDB (primary), potential relational DB (not yet implemented)
Package Manager: uv
Build System: hatchling

Common Commands

# Install dependencies
uv sync

# Add a new dependency
uv add <package-name>

# Run the application
uv run pif-compiler

# Activate virtual environment (if needed)
.venv\Scripts\activate  # Windows

Project Structure

pif_compiler/
├── src/pif_compiler/
│   ├── classes/           # Data models & type definitions
│   │   ├── pif_class.py   # Main PIF data model
│   │   ├── classes.py     # Supporting classes (Ingredient, ExpositionInfo, SedTable, ProdCompany)
│   │   └── types_enum.py  # Enums for cosmetic types, physical forms, exposure routes
│   │
│   └── functions/         # Core functionality modules
│       ├── scraper_cosing.py    # COSING database scraper (EU cosmetic ingredients)
│       ├── mongo_functions.py   # MongoDB connection & queries
│       ├── html_to_pdf.py       # PDF generation with Playwright
│       ├── echaFind.py          # ECHA dossier search
│       ├── echaProcess.py       # ECHA data extraction & processing
│       ├── pubchem.py           # PubChem API for chemical properties
│       ├── find.py              # Unified search interface (QUACKO/ECHA)
│       └── pdf_extraction.py    # PDF processing utilities
│
└── data/
    ├── pif_schema.json    # JSON schema for PIF structure
    └── input.json         # Example input data format

Core Functionality

1. Data Models (classes/)

PIF Class (pif_class.py)

Main data model containing:

Product information (name, type, CNCP, company)
Ingredient list with quantities
Exposure information
Safety evaluation data (SED table, warnings)
Metadata (creation date)

Supporting Classes (classes.py)

Ingredient: INCI name, CAS number, quantity, toxicity values (SED, NOAEL, MOS), PubChem data
ExpositionInfo: Application details, exposure routes, calculated daily exposure
SedTable: Safety evaluation data table
ProdCompany: Production company information

Type Enumerations (types_enum.py)

Bilingual (EN/IT) enums for:

CosmeticType: 100+ product types (foundations, lipsticks, skincare, etc.)
PhysicalForm: Liquid, semi-solid, solid, aerosol, hybrid forms
NormalUser: Adult/Child
PlaceApplication: Face, etc.
RoutesExposure: Dermal, Ocular, Oral
NanoRoutes: Same as above for nanomaterials

2. External Data Sources

COSING Database (scraper_cosing.py)

EU cosmetic ingredients database

Search by INCI name, CAS number, or EC number
Extract: substance ID, CAS/EC numbers, restrictions, SCCS opinions
Handle "identified ingredients" recursively
Functions: cosing_search(), clean_cosing(), parse_cas_numbers()

ECHA Database (echaFind.py, echaProcess.py)

European Chemicals Agency dossiers

Search: Find dossiers by CAS/substance name (echaFind.py:44)
Extract: Toxicity data (NOAEL, LD50) from HTML pages
Process: Convert HTML → Markdown → JSON → DataFrame
Scraping Types: RepeatedDose (NOAEL), AcuteToxicity (LD50)
Local caching: DuckDB in-memory for scraped data
Functions: search_dossier(), echaExtract(), echa_noael_ld50()

PubChem (pubchem.py)

Chemical properties for DAP calculation

Properties: LogP, Molecular Weight, TPSA, Melting Point, pH, Dissociation Constants
Uses pubchempy + custom certificate handling
Function: pubchem_dap(cas)

QUACKO/Find Module (find.py)

Unified search interface for ECHA

Search by CAS, EC, or substance name
Extract multiple sections (ToxSummary, AcuteToxicity, RepeatedDose, GeneticToxicity, physical properties)
Support for local HTML files
Functions: search_dossier(), get_section_links_from_index()

3. Database Layer

MongoDB (mongo_functions.py)

Database: toxinfo
Collection: toxinfo (ingredient data from COSING/ECHA)
Functions:
- connect(user, password, database) - MongoDB Atlas connection
- value_search(db, value, mode) - Search by INCI, CAS, EC, chemical name

4. PDF Generation (html_to_pdf.py, pdf_extraction.py)

Playwright-based: Headless browser for HTML → PDF
Dynamic headers: Inject substance info, ECHA logos
Cleanup: Remove empty sections, fix page breaks
Batch processing: search_generate_pdfs() for multiple pages
Output: Structured folders by CAS/EC/RML ID

Data Flow

Input: Product formulation (INCI names, quantities)
Enrichment:
- Search COSING for ingredient info
- Query MongoDB for cached data
- Fetch PubChem for chemical properties
- Extract ECHA toxicity data (NOAEL/LD50)
Calculation:
- SED (Systemic Exposure Dose)
- MOS (Margin of Safety)
- Daily exposure values
Output: PIF document (likely PDF/HTML format)

Key Dependencies

streamlit - Frontend
pydantic - Data validation
pymongo - MongoDB client
requests - HTTP requests
beautifulsoup4 - HTML parsing
playwright - PDF generation
pubchempy - PubChem API
pandas - Data processing
duckdb - Local caching

Important Notes

CAS Number Handling

CAS numbers can contain special separators (/, ;, ,, --)
Parser handles parenthetical info and invalid values

ECHA Scraping

Logging: All operations logged to echa.log
Dossier Status: Active preferred, falls back to Inactive
Scraping Modes:
- local_search=True: Check local cache first
- local_only=True: Only use cached data
Multi-substance: echaExtract_multi() for batch processing
Filtering: Can filter by route (oral/dermal/inhalation) and dose descriptor

Bilingual Support

Enums support EN/IT via TranslatedEnum.get_translation(lang)
Italian used as primary language in comments

Regulatory Context

SCCS: Scientific Committee on Consumer Safety
CNCP: Cosmetic Notification Portal
NOAEL: No Observed Adverse Effect Level
SED: Systemic Exposure Dose
MOS: Margin of Safety
DAP: Dermal Absorption Percentage

TODO/Future Work

Relational DB implementation (mentioned but not present)
Streamlit UI (referenced but code not in current files)
Main entry point (pif-compiler script in pyproject.toml)
LLM approximation for exposure values (mentioned in classes.py:55-60)

Development Notes

Project appears to consolidate previously separate codebases
Heavy use of external APIs (rate limiting may apply)
Certificate handling needed for PubChem API
MongoDB credentials required for database access

7.6 KiB Raw Blame History