refactoring and cleaning

This commit is contained in:
adish-rmr 2026-02-08 14:31:50 +01:00
parent 2b39ba6324
commit e02aca560c
58 changed files with 1439 additions and 8610 deletions

4
.gitignore vendored
View file

@ -205,3 +205,7 @@ cython_debug/
marimo/_static/ marimo/_static/
marimo/_lsp/ marimo/_lsp/
__marimo__/ __marimo__/
# other
pdfs/

126
CLAUDE.md Normal file
View file

@ -0,0 +1,126 @@
# PIF Compiler - Comsoguard API
## Project Overview
**PIF Compiler** (branded as **Comsoguard API**) is a cosmetic safety assessment tool that compiles Product Information Files (PIF) for cosmetic products, as required by EU regulations. It aggregates toxicological, regulatory, and chemical data from multiple external sources to evaluate ingredient safety (DAP calculations, SED/MoS computations, NOAEL extraction).
The primary language is **Italian** (variable names, comments, some log messages). Code is written in **Python 3.12**.
## Tech Stack
- **Framework**: FastAPI + Uvicorn
- **Package manager**: uv (with `pyproject.toml` + `uv.lock`)
- **Data models**: Pydantic v2
- **Databases**: MongoDB (substance data cache via `pymongo`) + PostgreSQL (product presets, search logs, compilers via `psycopg2`)
- **Web scraping**: Playwright (PDF generation, browser automation), BeautifulSoup4 (HTML parsing)
- **External APIs**: PubChem (`pubchempy` + `pubchemprops`), COSING (EU Search API), ECHA (chem.echa.europa.eu)
- **Logging**: Python `logging` with rotating file handlers (`logs/debug.log`, `logs/error.log`)
## Project Structure
```
src/pif_compiler/
├── main.py # FastAPI app, middleware, exception handlers, routers
├── __init__.py
├── api/
│ └── routes/
│ ├── api_echa.py # ECHA endpoints (single + batch search)
│ ├── api_cosing.py # COSING endpoints (single + batch search)
│ └── common.py # PDF generation, PubChem, CIR search endpoints
├── classes/
│ └── models.py # Pydantic models: Ingredient, DapInfo, CosingInfo,
│ # ToxIndicator, Toxicity, Esposition, RetentionFactors
├── functions/
│ ├── common_func.py # PDF generation with Playwright
│ ├── common_log.py # Centralized logging configuration
│ └── db_utils.py # MongoDB + PostgreSQL connection helpers
└── services/
├── srv_echa.py # ECHA scraping, HTML parsing, toxicology extraction,
│ # orchestrator (validate -> check cache -> fetch -> store)
├── srv_cosing.py # COSING API search + data cleaning
├── srv_pubchem.py # PubChem property extraction (DAP data)
└── srv_cir.py # CIR (Cosmetic Ingredient Review) database search
```
### Other directories
- `data/` - Input data files (`input.json` with sample INCI/CAS/percentage lists), old CSV data
- `logs/` - Rotating log files (debug.log, error.log) - auto-generated
- `pdfs/` - Generated PDF files from ECHA dossier pages
- `marimo/` - **Ignore this folder.** Debug/test notebooks, not part of the main application
## Architecture & Data Flow
### Core workflow (per ingredient)
1. **Input**: CAS number (and optionally INCI name + percentage)
2. **COSING** (`srv_cosing.py`): Search EU cosmetic ingredients database for regulatory restrictions, INCI match, annex references
3. **ECHA** (`srv_echa.py`): Search substance -> get dossier -> parse HTML index -> extract toxicological data (NOAEL, LD50, LOAEL) from acute & repeated dose toxicity pages
4. **PubChem** (`srv_pubchem.py`): Get molecular weight, XLogP, TPSA, melting point, dissociation constants
5. **DAP calculation** (`DapInfo` model): Dermal Absorption Percentage based on molecular properties (MW > 500, LogP, TPSA > 120, etc.)
6. **Toxicity ranking** (`Toxicity` model): Best toxicological indicator selection with priority (NOAEL > LOAEL > LD50) and safety factors
### Caching strategy
- ECHA results are cached in MongoDB (`toxinfo.substance_index` collection) keyed by `substance.rmlCas`
- The orchestrator checks local cache before making external requests
- Search history is logged to PostgreSQL (`logs.search_history` table)
## API Endpoints
All routes are under `/api/v1`:
| Method | Path | Description |
|--------|------|-------------|
| POST | `/echa/search` | Single ECHA substance search by CAS |
| POST | `/echa/batch-search` | Batch ECHA search for multiple CAS numbers |
| POST | `/cosing/search` | COSING search (by name, CAS, EC, or ID) |
| POST | `/cosing/batch-search` | Batch COSING search |
| POST | `/common/pubchem` | PubChem property lookup by CAS |
| POST | `/common/generate-pdf` | Generate PDF from URL via Playwright |
| GET | `/common/download-pdf/{name}` | Download a generated PDF |
| POST | `/common/cir-search` | CIR ingredient text search |
| GET | `/health`, `/ping` | Health check endpoints |
Docs available at `/docs` (Swagger) and `/redoc`.
## Environment Variables
Configured via `.env` file (loaded with `python-dotenv`):
- `ADMIN_USER` - MongoDB admin username
- `ADMIN_PASSWORD` - MongoDB admin password
- `MONGO_HOST` - MongoDB host
- `MONGO_PORT` - MongoDB port
- `DATABASE_URL` - PostgreSQL connection string
## Development
### Setup
```bash
uv sync # Install dependencies
playwright install # Install browser binaries for PDF generation
```
### Running the API
```bash
uv run python -m pif_compiler.main
# or
uv run uvicorn pif_compiler.main:app --reload --host 0.0.0.0 --port 8000
```
### Key conventions
- Services in `services/` handle external API calls and data extraction
- Models in `classes/models.py` use Pydantic `@model_validator` and `@classmethod` builders for construction from raw API data
- The `orchestrator` pattern (see `srv_echa.py`) handles: validate input -> check local cache -> fetch from external -> store locally -> return
- All modules use the shared logger from `common_log.get_logger()`
- API routes define Pydantic request/response models inline in each route file
### Important domain concepts
- **CAS number**: Chemical Abstracts Service identifier (e.g., "50-00-0")
- **INCI**: International Nomenclature of Cosmetic Ingredients
- **NOAEL**: No Observed Adverse Effect Level (preferred toxicity indicator)
- **LOAEL**: Lowest Observed Adverse Effect Level
- **LD50**: Lethal Dose 50%
- **DAP**: Dermal Absorption Percentage
- **SED**: Systemic Exposure Dosage
- **MoS**: Margin of Safety
- **PIF**: Product Information File (EU cosmetic regulation requirement)

View file

View file

@ -1,57 +0,0 @@
{
"$schema": "http://json-schema.org/draft-04/schema#",
"type": "object",
"properties": {
"INCI": {
"type": "array",
"items": [
{
"type": "string"
},
{
"type": "string"
}
]
},
"CAS": {
"type": "array",
"items": [
{
"type": "array",
"items": [
{
"type": "string"
},
{
"type": "string"
}
]
},
{
"type": "array",
"items": [
{
"type": "string"
}
]
}
]
},
"percentage": {
"type": "array",
"items": [
{
"type": "number"
},
{
"type": "number"
}
]
}
},
"required": [
"INCI",
"CAS",
"percentage"
]
}

View file

@ -1,28 +0,0 @@
# Echa Scraping Log Readme
Il file di log viene utilizzato durante lo scraping per tenere traccia delle sostanze estratte.
**Colonne:**
- **casNo**: il numero CAS della sostanza.
- **substanceId**: l'identificativo della sostanza nel database COSING.
- **inciName**: il nome INCI della sostanza.
- **scraping_AcuteToxicity**: stato dello scraping della pagina *Acute Toxicity* (valori LD50, LC50, ecc.).
- **scraping_RepeatedDose**: stato dello scraping della pagina *Repeated Dose* (valori NOAEL, DNEL, ecc.).
- **timestamp**: il momento in cui il dato è stato registrato.
**Valori possibili per scraping_AcuteToxicity e scraping_RepeatedDose:**
1. **no_lead_dossiers**: non esistono lead dossiers attivi o inattivi per la sostanza.
2. **successful_scrape**: dati estratti con successo dalla pagina.
3. **no_data_found**: è stato trovato un lead dossier, ma la pagina non esiste o non contiene dati.
4. **error**: diversi tipi di errori.
---
Ho dedicato 20-30 minuti alla conferma manuale dei risultati *no_data_found* e *no_lead_dossiers*: ho verificato casualmente che non esistessero dossier o che le pagine fossero effettivamente prive di dati.
Durante il primo full-scraping era presente un bug, che ho successivamente corretto, consentendo l'estrazione di altre 700 sostanze. Non so se siano presenti altri bug simili.
---
Al momento ci sono **68 righe nel log con errori.** Sto investigando, ma nella maggior parte dei casi si tratta di errori causati dalla mancanza di dati nelle pagine.
In pratica, molti di questi sono semplicemente *no_data_found* erroneamente segnati come *error*.

View file

Can't render this file because it is too large.

View file

Can't render this file because it is too large.

View file

Can't render this file because it is too large.

View file

Can't render this file because it is too large.

View file

Can't render this file because it is too large.

View file

Can't render this file because it is too large.

Binary file not shown.

View file

@ -1,38 +0,0 @@
{
"general_info": {
"exec_date": "2021-07-01",
"company": "Company Name",
"product_name": "Product Name",
"type": "pif",
"ph_form": "fisical state",
"CPNP": "CPNP number",
"prod_company": {"Company Name": "Company Name", "Address": "Company Address", "Country": "Country"}
},
"formula_table": "df_json",
"normal_user": ["italiano", "english"],
"exposition": {
"type": "type",
"place_application": "place",
"routes_exposure": "routes",
"secondary_routes": "secondary routes",
"nano_exposure": "nano exposure",
"surface_area": "surface area",
"frequency": "frequency",
"est_daily_amount": "est daily amount",
"rel_daily_amount": "rel daily amount",
"retention": 1,
"calculated_daily_exp:": "calculated daily exp",
"calculated_relative_daily_exp": "calculated relative daily exp",
"consumer_weight": "consumer weight",
"target_population": "target population"
},
"sed_formula_table": "df_json",
"sed_table": "df_json",
"toxicity_table": "df_json",
"undesired_effects": "no effets",
"description": "description",
"warnings": "warnings"
}

File diff suppressed because one or more lines are too long

View file

@ -1,211 +0,0 @@
# Refactoring Summary
## Completed: Phase 1 & 2
### Phase 1: Critical Bug Fixes ✅
**Fixed Issues:**
1. **[base_classes.py](src/pif_compiler/classes/models.py)** (now renamed to `models.py`)
- Fixed missing closing parenthesis in `StringConstraints` annotation (line 24)
- File renamed to `models.py` for clarity
2. **[pif_class.py](src/pif_compiler/classes/pif_class.py)**
- Removed unnecessary `streamlit` import
- Fixed duplicate `NormalUser` import conflict
- Fixed type annotations for optional fields (lines 33-36)
- Removed unused imports
3. **[classes/__init__.py](src/pif_compiler/classes/__init__.py)**
- Created proper module exports
- Added docstring
- Listed all available models and enums
### Phase 2: Code Organization ✅
**New Structure:**
```
src/pif_compiler/
├── classes/ # Data Models
│ ├── __init__.py # ✨ NEW: Proper exports
│ ├── models.py # ✨ RENAMED from base_classes.py
│ ├── pif_class.py # ✅ FIXED: Import conflicts
│ └── types_enum.py
├── services/ # ✨ NEW: Business Logic Layer
│ ├── __init__.py # Service exports
│ ├── echa_service.py # ECHA API (merged from find.py)
│ ├── echa_parser.py # HTML/Markdown/JSON parsing
│ ├── echa_extractor.py # High-level extraction
│ ├── cosing_service.py # COSING integration
│ ├── pubchem_service.py # PubChem integration
│ └── database_service.py # MongoDB operations
└── functions/ # Utilities & Legacy
├── _old/ # 🗄️ Deprecated files (moved here)
│ ├── echaFind.py # → Merged into echa_service.py
│ ├── find.py # → Merged into echa_service.py
│ ├── echaProcess.py # → Split into echa_parser + echa_extractor
│ ├── scraper_cosing.py # → Copied to cosing_service.py
│ ├── pubchem.py # → Copied to pubchem_service.py
│ └── mongo_functions.py # → Copied to database_service.py
├── html_to_pdf.py # PDF generation utilities
├── pdf_extraction.py # PDF processing utilities
└── resources/ # Static resources (logos, templates)
```
---
## Key Improvements
### 1. **Separation of Concerns**
- **Models** (`classes/`): Pure data structures with Pydantic validation
- **Services** (`services/`): Business logic and external API calls
- **Functions** (`functions/`): Legacy code, will be gradually migrated
### 2. **ECHA Module Consolidation**
Previously scattered across 3 files:
- `echaFind.py` (246 lines) - Old search implementation
- `find.py` (513 lines) - Better search with type hints
- `echaProcess.py` (947 lines) - Massive monolith
Now organized into 3 focused modules:
- `echa_service.py` (~513 lines) - API integration (from `find.py`)
- `echa_parser.py` (~250 lines) - Data parsing/cleaning
- `echa_extractor.py` (~350 lines) - High-level extraction logic
### 3. **Better Logging**
- Changed from module-level `logging.basicConfig()` to proper logger instances
- Each service has its own logger: `logger = logging.getLogger(__name__)`
- Prevents logging configuration conflicts
### 4. **Improved Imports**
Services can now be imported cleanly:
```python
# Old way
from src.func.echaFind import search_dossier
from src.func.echaProcess import echaExtract
# New way
from pif_compiler.services import search_dossier, echa_extract
```
---
## Migration Guide
### For Code Using Old Imports
**ECHA Functions:**
```python
# Before
from src.func.find import search_dossier
from src.func.echaProcess import echaExtract, echaPage_to_md, clean_json
# After
from pif_compiler.services import (
search_dossier,
echa_extract,
echa_page_to_markdown,
clean_json
)
```
**Data Models:**
```python
# Before
from classes import Ingredient, PIF
from base_classes import ExpositionInfo
# After
from pif_compiler.classes import Ingredient, PIF, ExpositionInfo
```
**COSING/PubChem:**
```python
# Before
from functions.scraper_cosing import cosing_search
from functions.pubchem import pubchem_dap
# After (when ready)
from pif_compiler.services.cosing_service import cosing_search
from pif_compiler.services.pubchem_service import pubchem_dap
```
---
## Next Steps (Phase 3 - Not Done Yet)
### Configuration Management
- [ ] Create `config.py` for MongoDB credentials, API keys
- [ ] Use environment variables (.env file)
- [ ] Separate dev/prod configurations
### Testing
- [ ] Add pytest setup
- [ ] Unit tests for models (Pydantic validation)
- [ ] Integration tests for services
- [ ] Mock external API calls
### Streamlit App
- [ ] Create `app.py` entry point
- [ ] Organize UI components
- [ ] Connect to services layer
### Database
- [ ] Document MongoDB schema
- [ ] Add migration scripts
- [ ] Consider adding SQLAlchemy for relational DB
### Documentation
- [ ] API documentation (docstrings → Sphinx)
- [ ] User guide for PIF creation workflow
- [ ] Developer setup guide
---
## Files Changed
### Modified:
- `src/pif_compiler/classes/models.py` (renamed, fixed)
- `src/pif_compiler/classes/pif_class.py` (fixed imports/types)
- `src/pif_compiler/classes/__init__.py` (new exports)
### Created:
- `src/pif_compiler/services/__init__.py`
- `src/pif_compiler/services/echa_service.py`
- `src/pif_compiler/services/echa_parser.py`
- `src/pif_compiler/services/echa_extractor.py`
- `src/pif_compiler/services/cosing_service.py`
- `src/pif_compiler/services/pubchem_service.py`
- `src/pif_compiler/services/database_service.py`
### Moved to Archive:
- `src/pif_compiler/functions/_old/echaFind.py` (merged into echa_service.py)
- `src/pif_compiler/functions/_old/find.py` (merged into echa_service.py)
- `src/pif_compiler/functions/_old/echaProcess.py` (split into echa_parser + echa_extractor)
- `src/pif_compiler/functions/_old/scraper_cosing.py` (copied to cosing_service.py)
- `src/pif_compiler/functions/_old/pubchem.py` (copied to pubchem_service.py)
- `src/pif_compiler/functions/_old/mongo_functions.py` (copied to database_service.py)
### Kept (Active):
- `src/pif_compiler/functions/html_to_pdf.py` (PDF utilities)
- `src/pif_compiler/functions/pdf_extraction.py` (PDF utilities)
- `src/pif_compiler/functions/resources/` (Static files)
---
## Benefits
**Cleaner imports** - No more relative path confusion
**Better testing** - Services can be mocked easily
**Easier debugging** - Smaller, focused modules
**Type safety** - Proper type hints throughout
**Maintainability** - Clear separation of concerns
**Backward compatible** - Old code still works
---
**Date:** 2025-01-04
**Status:** Phase 1 & 2 Complete ✅

View file

@ -1,295 +0,0 @@
# ECHA Services Test Suite Summary
## Overview
Comprehensive test suites have been created for all three ECHA service modules, following a bottom-up approach from lowest to highest level dependencies.
## Test Files Created
### 1. test_echa_parser.py (Lowest Level)
**Location**: `tests/test_echa_parser.py`
**Coverage**: Tests for HTML/Markdown/JSON processing functions
**Test Classes**:
- `TestOpenEchaPage` - HTML page opening (remote & local)
- `TestEchaPageToMarkdown` - HTML to Markdown conversion
- `TestMarkdownToJsonRaw` - Markdown to JSON conversion (skipped if markdown_to_json not installed)
- `TestNormalizeUnicodeCharacters` - Unicode normalization
- `TestCleanJson` - JSON cleaning and validation
- `TestIntegrationParser` - Full pipeline integration tests
**Total Tests**: 28 tests
- 20 tests for core parser functions
- 5 tests for markdown_to_json (conditional)
- 2 integration tests
- 1 test with known Unicode encoding issue (needs fix)
**Key Features**:
- Mocks external dependencies (requests, file I/O)
- Tests Unicode handling edge cases
- Validates data cleaning logic
- Tests comparison operator conversions (>, <, >=, <=)
**Known Issues**:
- Unicode literal encoding in test strings (line 372, 380) - use `chr()` instead of `\uXXXX`
- Missing `markdown_to_json` dependency (tests skip gracefully)
###2. test_echa_service.py (Middle Level)
**Location**: `tests/test_echa_service.py`
**Coverage**: Tests for ECHA API interaction and search functions
**Test Classes**:
- `TestGetSubstanceByIdentifier` - Substance API search
- `TestGetDossierByRmlId` - Dossier retrieval with Active/Inactive fallback
- `TestExtractSectionLinks` - Section link extraction with validation
- `TestParseSectionsFromHtml` - HTML parsing for multiple sections
- `TestGetSectionLinksFromIndex` - Remote index.html fetching
- `TestGetSectionLinksFromFile` - Local file parsing
- `TestSearchDossier` - Main search workflow
- `TestIntegrationEchaService` - Real API integration tests (marked @pytest.mark.integration)
**Total Tests**: 36 tests
- 30 unit tests with mocked APIs
- 3 integration tests (require internet, marked for manual execution)
**Key Features**:
- Comprehensive API mocking
- Tests nested section bug fix (parent vs child section links)
- Tests URL encoding, error handling, fallback logic
- Tests local vs remote workflows
- Integration tests for real formaldehyde data
**Testing Approach**:
- Unit tests run by default (fast, no external deps)
- Integration tests require `-m integration` flag
### 3. test_echa_extractor.py (Highest Level)
**Location**: `tests/test_echa_extractor.py`
**Coverage**: Tests for high-level extraction orchestration
**Test Classes**:
- `TestSchemas` - Data schema validation
- `TestJsonToDataframe` - JSON to pandas DataFrame conversion
- `TestDfWrapper` - DataFrame metadata addition
- `TestEchaExtractLocal` - DuckDB cache querying
- `TestEchaExtract` - Main extraction workflow
- `TestIntegrationEchaExtractor` - Real data integration tests (marked @pytest.mark.integration)
**Total Tests**: 32 tests
- 28 unit tests with full mocking
- 4 integration tests (require internet)
**Key Features**:
- Tests both RepeatedDose and AcuteToxicity schemas
- Tests local cache (DuckDB) integration
- Tests key information extraction
- Tests error handling at each pipeline stage
- Tests DataFrame vs JSON output modes
- Validates metadata addition (substance, CAS, timestamps)
**Testing Strategy**:
- Mocks entire pipeline: search → parse → convert → clean → wrap
- Tests local_search and local_only modes
- Tests graceful degradation (returns key_infos on main extraction failure)
## Test Architecture
```
test_echa_parser.py (Data Transformation)
test_echa_service.py (API & Search)
test_echa_extractor.py (Orchestration)
```
### Dependency Flow
1. **Parser** (lowest) - No dependencies on other ECHA modules
2. **Service** (middle) - Depends on Parser for some functionality
3. **Extractor** (highest) - Depends on both Service and Parser
### Mock Strategy
- **Parser**: Mocks `requests`, file I/O, `os.makedirs`
- **Service**: Mocks `requests.get` for API calls, HTML content
- **Extractor**: Mocks entire pipeline chain (search_dossier, open_echa_page, etc.)
## Running the Tests
### Run All Tests
```bash
uv run pytest tests/test_echa_*.py -v
```
### Run Specific Module
```bash
uv run pytest tests/test_echa_parser.py -v
uv run pytest tests/test_echa_service.py -v
uv run pytest tests/test_echa_extractor.py -v
```
### Run Only Unit Tests (Fast)
```bash
uv run pytest tests/test_echa_*.py -v -m "not integration"
```
### Run Integration Tests (Requires Internet)
```bash
uv run pytest tests/test_echa_*.py -v -m integration
```
### Run With Coverage
```bash
uv run pytest tests/test_echa_*.py --cov=pif_compiler.services --cov-report=html
```
## Test Coverage Summary
### Functions Tested
#### echa_parser.py (5/5 = 100%)
- ✅ `open_echa_page()` - Remote & local file opening
- ✅ `echa_page_to_markdown()` - HTML to Markdown with route formatting
- ✅ `markdown_to_json_raw()` - Markdown parsing & JSON conversion
- ✅ `normalize_unicode_characters()` - Unicode normalization
- ✅ `clean_json()` - Recursive cleaning & validation
#### echa_service.py (8/8 = 100%)
- ✅ `search_dossier()` - Main entry point with local file support
- ✅ `get_substance_by_identifier()` - Substance API search
- ✅ `get_dossier_by_rml_id()` - Dossier retrieval with fallback
- ✅ `_query_dossier_api()` - Helper for API queries
- ✅ `get_section_links_from_index()` - Remote HTML fetching
- ✅ `get_section_links_from_file()` - Local HTML parsing
- ✅ `parse_sections_from_html()` - HTML content parsing
- ✅ `extract_section_links()` - Individual section extraction with validation
#### echa_extractor.py (4/4 = 100%)
- ✅ `echa_extract()` - Main extraction function
- ✅ `echa_extract_local()` - DuckDB cache queries
- ✅ `json_to_dataframe()` - JSON to DataFrame conversion
- ✅ `df_wrapper()` - Metadata addition
**Total Functions**: 17/17 tested (100%)
## Edge Cases Covered
### Parser
- Empty/malformed HTML
- Missing sections
- Unicode encoding issues (â€, superscripts)
- Comparison operators (>, <, >=, <=)
- Nested structures
- [Empty] value filtering
- "no information available" filtering
### Service
- Substance not found
- No dossiers (active or inactive)
- Nested sections (parent without direct link)
- Input type mismatches
- Network errors
- Malformed API responses
- Local vs remote file paths
### Extractor
- Substance not found
- Missing scraping type pages
- Empty sections
- Empty cleaned JSON
- Local cache hits/misses
- Key information extraction
- DataFrame filtering (null Effect levels)
- JSON vs DataFrame output modes
## Dependencies Required
### Core Dependencies (Already in project)
- pytest
- pytest-mock
- pytest-cov
- beautifulsoup4
- pandas
- requests
- markdownify
- pydantic
### Optional Dependencies (Tests skip if missing)
- `markdown_to_json` - Required for markdown→JSON conversion tests
- `duckdb` - Required for local cache tests
- Internet connection - Required for integration tests
## Test Markers
### Custom Markers (defined in conftest.py)
- `@pytest.mark.unit` - Fast tests, no external dependencies
- `@pytest.mark.integration` - Tests requiring real APIs/internet
- `@pytest.mark.slow` - Long-running tests
- `@pytest.mark.database` - Tests requiring database
### Usage in ECHA Tests
- Unit tests: Default (run without flags)
- Integration tests: Require `-m integration`
- Skipped tests: Auto-skip if dependencies missing
## Known Issues & Improvements Needed
### 1. Unicode Test Encoding (test_echa_parser.py)
**Issue**: Lines 372 and 380 have truncated Unicode escape sequences
**Fix**: Replace `\u00c2\u00b²` with `chr(0xc2) + chr(0xb2)`
**Priority**: Medium
### 2. Missing markdown_to_json Dependency
**Issue**: Tests skip if not installed
**Fix**: Add to project dependencies or document as optional
**Priority**: Low (tests gracefully skip)
### 3. Integration Test Data
**Issue**: Real API tests may fail if ECHA structure changes
**Fix**: Add recorded fixtures for deterministic testing
**Priority**: Low
### 4. DuckDB Integration
**Issue**: test_echa_extractor local cache tests mock DuckDB
**Fix**: Create actual test database for integration testing
**Priority**: Low
## Test Statistics
| Module | Total Tests | Unit Tests | Integration Tests | Skipped (Conditional) |
|--------|-------------|------------|-------------------|-----------------------|
| echa_parser.py | 28 | 26 | 2 | 7 (markdown_to_json) |
| echa_service.py | 36 | 33 | 3 | 0 |
| echa_extractor.py | 32 | 28 | 4 | 0 |
| **TOTAL** | **96** | **87** | **9** | **7** |
## Next Steps
1. **Fix Unicode encoding** in test_echa_parser.py (lines 372, 380)
2. **Run full test suite** to verify all unit tests pass
3. **Add markdown_to_json** to dependencies if needed
4. **Run integration tests** manually to verify real API behavior
5. **Generate coverage report** to identify any untested code paths
6. **Document test patterns** for future service additions
7. **Add CI/CD integration** for automated testing
## Contributing
When adding new functions to ECHA services:
1. **Write tests first** (TDD approach)
2. **Follow existing patterns**:
- One test class per function
- Mock external dependencies
- Test happy path + edge cases
- Add integration tests for real API behavior
3. **Use appropriate markers**: `@pytest.mark.integration` for slow tests
4. **Update this document** with new test coverage
## References
- Main documentation: [docs/echa_architecture.md](echa_architecture.md)
- Test patterns: [tests/test_cosing_service.py](../tests/test_cosing_service.py)
- pytest configuration: [pytest.ini](../pytest.ini)
- Test fixtures: [tests/conftest.py](../tests/conftest.py)

View file

@ -1,767 +0,0 @@
# Testing Guide - Theory and Best Practices
## Table of Contents
- [Introduction](#introduction)
- [Your Current Approach vs. Test-Driven Development](#your-current-approach-vs-test-driven-development)
- [The Testing Pyramid](#the-testing-pyramid)
- [Key Concepts](#key-concepts)
- [Real-World Testing Workflow](#real-world-testing-workflow)
- [Regression Testing](#regression-testing---the-killer-feature)
- [Code Coverage](#coverage---how-much-is-tested)
- [Best Practices](#best-practices-summary)
- [Practical Examples](#practical-example-your-workflow)
- [When Should You Write Tests](#when-should-you-write-tests)
- [Getting Started](#your-next-steps)
---
## Introduction
This guide explains the theory and best practices of software testing, specifically for the PIF Compiler project. It moves beyond ad-hoc testing scripts to a comprehensive, automated testing approach.
---
## Your Current Approach vs. Test-Driven Development
### What You Do Now (Ad-hoc Scripts):
```python
# test_script.py
from cosing_service import cosing_search
result = cosing_search("WATER", mode="name")
print(result) # Look at output, check if it looks right
```
**Problems:**
- ❌ Manual checking (is the output correct?)
- ❌ Not repeatable (you forget what "correct" looks like)
- ❌ Doesn't catch regressions (future changes break old code)
- ❌ No documentation (what should the function do?)
- ❌ Tedious for many functions
---
## The Testing Pyramid
```
/\
/ \ E2E Tests (Few)
/----\
/ \ Integration Tests (Some)
/--------\
/ \ Unit Tests (Many)
/____________\
```
### 1. **Unit Tests** (Bottom - Most Important)
Test individual functions in isolation.
**Example:**
```python
def test_parse_cas_numbers_single():
"""Test parsing a single CAS number."""
result = parse_cas_numbers(["7732-18-5"])
assert result == ["7732-18-5"] # ← Automated check
```
**Benefits:**
- ✅ Fast (milliseconds)
- ✅ No external dependencies (no API, no database)
- ✅ Pinpoint exact problem
- ✅ Run hundreds in seconds
**When to use:**
- Testing individual functions
- Testing data parsing/validation
- Testing business logic calculations
---
### 2. **Integration Tests** (Middle)
Test multiple components working together.
**Example:**
```python
def test_full_cosing_workflow():
"""Test search + clean workflow."""
raw = cosing_search("WATER", mode="name")
clean = clean_cosing(raw)
assert "cosingUrl" in clean
```
**Benefits:**
- ✅ Tests real interactions
- ✅ Catches integration bugs
**Drawbacks:**
- ⚠️ Slower (hits real APIs)
- ⚠️ Requires internet/database
**When to use:**
- Testing workflows across multiple services
- Testing API integrations
- Testing database interactions
---
### 3. **E2E Tests** (End-to-End - Top - Fewest)
Test entire application flow (UI → Backend → Database).
**Example:**
```python
def test_create_pif_from_ui():
"""User creates PIF through Streamlit UI."""
# Click buttons, fill forms, verify PDF generated
```
**When to use:**
- Testing complete user workflows
- Smoke tests before deployment
- Critical business processes
---
## Key Concepts
### 1. **Assertions - Automated Verification**
**Old way (manual):**
```python
result = parse_cas_numbers(["7732-18-5/56-81-5"])
print(result) # You look at: ['7732-18-5', '56-81-5']
# Is this right? Maybe? You forget in 2 weeks.
```
**Test way (automated):**
```python
def test_parse_multiple_cas():
result = parse_cas_numbers(["7732-18-5/56-81-5"])
assert result == ["7732-18-5", "56-81-5"] # ← Computer checks!
# If wrong, test FAILS immediately
```
**Common Assertions:**
```python
# Equality
assert result == expected
# Truthiness
assert result is not None
assert "key" in result
# Exceptions
with pytest.raises(ValueError):
invalid_function()
# Approximate equality (for floats)
assert result == pytest.approx(3.14159, rel=1e-5)
```
---
### 2. **Mocking - Control External Dependencies**
**Problem:** Testing `cosing_search()` hits the real COSING API:
- ⚠️ Slow (network request)
- ⚠️ Unreliable (API might be down)
- ⚠️ Expensive (rate limits)
- ⚠️ Hard to test errors (how do you make API return error?)
**Solution: Mock it!**
```python
from unittest.mock import Mock, patch
@patch('cosing_service.req.post') # Replace real HTTP request
def test_search_by_name(mock_post):
# Control what the "API" returns
mock_response = Mock()
mock_response.json.return_value = {
"results": [{"metadata": {"inciName": ["WATER"]}}]
}
mock_post.return_value = mock_response
result = cosing_search("WATER", mode="name")
assert result["inciName"] == ["WATER"] # ← Test your logic, not the API
mock_post.assert_called_once() # Verify it was called
```
**Benefits:**
- ✅ Fast (no real network)
- ✅ Reliable (always works)
- ✅ Can test error cases (mock API failures)
- ✅ Isolate your code from external issues
**What to mock:**
- HTTP requests (`requests.get`, `requests.post`)
- Database calls (`db.find_one`, `db.insert`)
- File I/O (`open`, `read`, `write`)
- External APIs (COSING, ECHA, PubChem)
- Time-dependent functions (`datetime.now()`)
---
### 3. **Fixtures - Reusable Test Data**
**Without fixtures (repetitive):**
```python
def test_clean_basic():
data = {"inciName": ["WATER"], "casNo": ["7732-18-5"], ...}
result = clean_cosing(data)
assert ...
def test_clean_empty():
data = {"inciName": ["WATER"], "casNo": ["7732-18-5"], ...} # Copy-paste!
result = clean_cosing(data)
assert ...
```
**With fixtures (DRY - Don't Repeat Yourself):**
```python
# conftest.py
@pytest.fixture
def sample_cosing_response():
"""Reusable COSING response data."""
return {
"inciName": ["WATER"],
"casNo": ["7732-18-5"],
"substanceId": ["12345"]
}
# test file
def test_clean_basic(sample_cosing_response): # Auto-injected!
result = clean_cosing(sample_cosing_response)
assert result["inciName"] == "WATER"
def test_clean_empty(sample_cosing_response): # Reuse same data!
result = clean_cosing(sample_cosing_response)
assert "cosingUrl" in result
```
**Benefits:**
- ✅ No code duplication
- ✅ Centralized test data
- ✅ Easy to update (change once, affects all tests)
- ✅ Auto-cleanup (fixtures can tear down resources)
**Common fixture patterns:**
```python
# Database fixture with cleanup
@pytest.fixture
def test_db():
db = connect_to_test_db()
yield db # Test runs here
db.drop_all() # Cleanup after test
# Temporary file fixture
@pytest.fixture
def temp_file(tmp_path):
file_path = tmp_path / "test.json"
file_path.write_text('{"test": "data"}')
return file_path # Auto-cleaned by pytest
```
---
## Real-World Testing Workflow
### Scenario: You Add a New Feature
**Step 1: Write the test FIRST (TDD - Test-Driven Development):**
```python
def test_parse_cas_removes_parentheses():
"""CAS numbers with parentheses should be cleaned."""
result = parse_cas_numbers(["7732-18-5 (hydrate)"])
assert result == ["7732-18-5"]
```
**Step 2: Run test - it FAILS (expected!):**
```bash
$ uv run pytest tests/test_cosing_service.py::test_parse_cas_removes_parentheses
FAILED: AssertionError: assert ['7732-18-5 (hydrate)'] == ['7732-18-5']
```
**Step 3: Write code to make it pass:**
```python
def parse_cas_numbers(cas_string: list) -> list:
cas_string = cas_string[0]
cas_string = re.sub(r"\([^)]*\)", "", cas_string) # ← Add this
# ... rest of function
```
**Step 4: Run test again - it PASSES:**
```bash
$ uv run pytest tests/test_cosing_service.py::test_parse_cas_removes_parentheses
PASSED ✓
```
**Step 5: Refactor if needed - tests ensure you don't break anything!**
---
### TDD Cycle (Red-Green-Refactor)
```
1. RED: Write failing test
2. GREEN: Write minimal code to pass
3. REFACTOR: Improve code without breaking tests
Repeat
```
**Benefits:**
- ✅ Forces you to think about requirements first
- ✅ Prevents over-engineering
- ✅ Built-in documentation (tests show intended behavior)
- ✅ Confidence to refactor
---
## Regression Testing - The Killer Feature
**Scenario: You change code 6 months later:**
```python
# Original (working)
def parse_cas_numbers(cas_string: list) -> list:
cas_string = cas_string[0]
cas_string = re.sub(r"\([^)]*\)", "", cas_string)
cas_parts = re.split(r"[/;,]", cas_string) # Handles /, ;, ,
return [cas.strip() for cas in cas_parts]
# You "improve" it
def parse_cas_numbers(cas_string: list) -> list:
return cas_string[0].split("/") # Simpler! But...
```
**Run tests:**
```bash
$ uv run pytest
FAILED: test_multiple_cas_with_semicolon
Expected: ['7732-18-5', '56-81-5']
Got: ['7732-18-5;56-81-5'] # ← Oops, broke semicolon support!
FAILED: test_cas_with_parentheses
Expected: ['7732-18-5']
Got: ['7732-18-5 (hydrate)'] # ← Broke parentheses removal!
```
**Without tests:**
- You deploy
- Users report bugs
- You're confused what broke
- Spend hours debugging
**With tests:**
- Instant feedback
- Fix before deploying
- Save hours of debugging
---
## Coverage - How Much Is Tested?
### Running Coverage
```bash
uv run pytest --cov=src/pif_compiler --cov-report=html
```
### Sample Output
```
Name Stmts Miss Cover
--------------------------------------------------
cosing_service.py 89 5 94%
echa_service.py 156 89 43%
models.py 45 45 0%
--------------------------------------------------
TOTAL 290 139 52%
```
### Interpretation
- ✅ `cosing_service.py` - **94% covered** (great!)
- ⚠️ `echa_service.py` - **43% covered** (needs more tests)
- ❌ `models.py` - **0% covered** (no tests yet)
### Coverage Goals
| Coverage | Status | Action |
|----------|--------|--------|
| 90-100% | ✅ Excellent | Maintain |
| 70-90% | ⚠️ Good | Add edge cases |
| 50-70% | ⚠️ Acceptable | Prioritize critical paths |
| <50% | Poor | Add tests immediately |
**Target:** 80%+ for business-critical code
### HTML Coverage Report
```bash
uv run pytest --cov=src/pif_compiler --cov-report=html
# Open htmlcov/index.html in browser
```
Shows:
- Which lines are tested (green)
- Which lines are not tested (red)
- Which branches are not covered
---
## Best Practices Summary
### ✅ DO:
1. **Write tests for all business logic**
```python
# YES: Test calculations
def test_sed_calculation():
ingredient = Ingredient(quantity=10.0, dap=0.5)
assert ingredient.calculate_sed() == 5.0
```
2. **Mock external dependencies**
```python
# YES: Mock API calls
@patch('cosing_service.req.post')
def test_search(mock_post):
mock_post.return_value.json.return_value = {...}
```
3. **Test edge cases**
```python
# YES: Test edge cases
def test_parse_empty_cas():
assert parse_cas_numbers([""]) == []
def test_parse_invalid_cas():
with pytest.raises(ValueError):
parse_cas_numbers(["abc-def-ghi"])
```
4. **Keep tests simple**
```python
# YES: One test = one thing
def test_cas_removes_whitespace():
assert parse_cas_numbers([" 123-45-6 "]) == ["123-45-6"]
# NO: Testing multiple things
def test_cas_everything():
assert parse_cas_numbers([" 123-45-6 "]) == ["123-45-6"]
assert parse_cas_numbers(["123-45-6/789-01-2"]) == [...]
# Too much in one test!
```
5. **Run tests before committing**
```bash
git add .
uv run pytest # ← Always run first!
git commit -m "Add feature X"
```
6. **Use descriptive test names**
```python
# YES: Describes what it tests
def test_parse_cas_removes_parenthetical_info():
...
# NO: Vague
def test_cas_1():
...
```
---
### ❌ DON'T:
1. **Don't test external libraries**
```python
# NO: Testing if requests.post works
def test_requests_library():
response = requests.post("https://example.com")
assert response.status_code == 200
# YES: Test YOUR code that uses requests
@patch('requests.post')
def test_my_search_function(mock_post):
...
```
2. **Don't make tests dependent on each other**
```python
# NO: test_b depends on test_a
def test_a_creates_data():
db.insert({"id": 1, "name": "test"})
def test_b_uses_data():
data = db.find_one({"id": 1}) # Breaks if test_a fails!
# YES: Each test is independent
def test_b_uses_data():
db.insert({"id": 1, "name": "test"}) # Create own data
data = db.find_one({"id": 1})
```
3. **Don't test implementation details**
```python
# NO: Testing internal variable names
def test_internal_state():
obj = MyClass()
assert obj._internal_var == "value" # Breaks with refactoring
# YES: Test public behavior
def test_public_api():
obj = MyClass()
assert obj.get_value() == "value"
```
4. **Don't skip tests**
```python
# NO: Commenting out failing tests
# def test_broken_feature():
# assert broken_function() == "expected"
# YES: Fix the test or mark as TODO
@pytest.mark.skip(reason="Feature not implemented yet")
def test_future_feature():
...
```
---
## Practical Example: Your Workflow
### Before (Manual Script)
```python
# test_water.py
from cosing_service import cosing_search, clean_cosing
result = cosing_search("WATER", "name")
print(result) # ← You manually check
clean = clean_cosing(result)
print(clean) # ← You manually check again
# Run 10 times with different inputs... tedious!
```
**Problems:**
- Manual verification
- Slow (type command, read output, verify)
- Error-prone (miss things)
- Not repeatable
---
### After (Automated Tests)
```python
# tests/test_cosing_service.py
def test_search_and_clean_water():
"""Water should be searchable and cleanable."""
result = cosing_search("WATER", "name")
assert result is not None
assert "inciName" in result
clean = clean_cosing(result)
assert clean["inciName"] == "WATER"
assert "cosingUrl" in clean
# Run ONCE: pytest
# It checks everything automatically!
```
**Run all 25 tests:**
```bash
$ uv run pytest
tests/test_cosing_service.py::TestParseCasNumbers::test_single_cas_number PASSED
tests/test_cosing_service.py::TestParseCasNumbers::test_multiple_cas_with_slash PASSED
...
======================== 25 passed in 0.5s ========================
```
**Benefits:**
- ✅ All pass? Safe to deploy!
- ❌ One fails? Fix before deploying!
- ⏱️ 25 tests in 0.5 seconds vs. manual testing for 30 minutes
---
## When Should You Write Tests?
### Always Test:
**Business logic** (calculations, data processing)
```python
# YES
def test_calculate_sed():
assert calculate_sed(quantity=10, dap=0.5) == 5.0
```
**Data validation** (Pydantic models)
```python
# YES
def test_ingredient_validates_cas_format():
with pytest.raises(ValidationError):
Ingredient(cas="invalid", quantity=10.0)
```
**API integrations** (with mocks)
```python
# YES
@patch('requests.post')
def test_cosing_search(mock_post):
...
```
**Bug fixes** (write test first, then fix)
```python
# YES
def test_bug_123_empty_cas_crash():
"""Regression test for bug #123."""
result = parse_cas_numbers([]) # Used to crash
assert result == []
```
---
### Sometimes Test:
⚠️ **UI code** (harder to test, less critical)
```python
# Streamlit UI tests are complex, lower priority
```
⚠️ **Configuration** (usually simple)
```python
# Config loading is straightforward, test if complex logic
```
---
### Don't Test:
**Third-party libraries** (they have their own tests)
```python
# NO: Testing if pandas works
def test_pandas_dataframe():
df = pd.DataFrame({"a": [1, 2, 3]})
assert len(df) == 3 # Pandas team already tested this!
```
❌ **Trivial code**
```python
# NO: Testing simple getters/setters
class MyClass:
def get_name(self):
return self.name # Too simple to test
```
---
## Your Next Steps
### 1. Install Pytest
```bash
cd c:\Users\adish\Projects\pif_compiler
uv add --dev pytest pytest-cov pytest-mock
```
### 2. Run the COSING Tests
```bash
# Run all tests
uv run pytest
# Run with verbose output
uv run pytest -v
# Run specific test file
uv run pytest tests/test_cosing_service.py
# Run specific test
uv run pytest tests/test_cosing_service.py::TestParseCasNumbers::test_single_cas_number
```
### 3. See Coverage
```bash
# Terminal report
uv run pytest --cov=src/pif_compiler/services/cosing_service
# HTML report (more detailed)
uv run pytest --cov=src/pif_compiler --cov-report=html
# Open htmlcov/index.html in browser
```
### 4. Start Writing Tests for New Code
Follow the TDD cycle:
1. **Red**: Write failing test
2. **Green**: Write minimal code to pass
3. **Refactor**: Improve code
4. Repeat!
---
## Additional Resources
### Pytest Documentation
- [Official Pytest Docs](https://docs.pytest.org/)
- [Pytest Fixtures](https://docs.pytest.org/en/stable/fixture.html)
- [Pytest Mocking](https://docs.pytest.org/en/stable/monkeypatch.html)
### Testing Philosophy
- [Test-Driven Development (TDD)](https://www.freecodecamp.org/news/test-driven-development-what-it-is-and-what-it-is-not-41fa6bca02a2/)
- [Testing Best Practices](https://testautomationuniversity.com/)
- [The Testing Pyramid](https://martinfowler.com/articles/practical-test-pyramid.html)
### PIF Compiler Specific
- [tests/README.md](../tests/README.md) - Test suite documentation
- [tests/RUN_TESTS.md](../tests/RUN_TESTS.md) - Quick start guide
- [REFACTORING.md](../REFACTORING.md) - Code organization changes
---
## Summary
**Testing transforms your development workflow:**
| Without Tests | With Tests |
|---------------|------------|
| Manual verification | Automated checks |
| Slow feedback | Instant feedback |
| Fear of breaking things | Confidence to refactor |
| Undocumented behavior | Tests as documentation |
| Debug for hours | Pinpoint issues immediately |
**Start small:**
1. Write tests for one service (✅ COSING done!)
2. Add tests for new features
3. Fix bugs with tests first
4. Gradually increase coverage
**The investment pays off:**
- Fewer bugs in production
- Faster development (less debugging)
- Better code design
- Easier collaboration
- Peace of mind 😌
---
*Last updated: 2025-01-04*

View file

@ -1,6 +0,0 @@
# User Journey
1) User login or signs up
- For this function we will use the internal component of streamlit to handle authentication, and behind i will have a supabase db (work-in progress)
2) Open recent or create a new project:
- This is where we open an existing file of project with all the specifics or we create a new one

BIN
main.db

Binary file not shown.

Binary file not shown.

View file

@ -1,158 +0,0 @@
import marimo
__generated_with = "0.16.5"
app = marimo.App(width="medium")
@app.cell
def _():
import marimo as mo
import duckdb
return duckdb, mo
@app.cell
def _(duckdb):
con = duckdb.connect('main.db')
return (con,)
@app.cell
def _(con, mo):
_df = mo.sql(
f"""
--CREATE SEQUENCE seq_clienti START 1;
--CREATE SEQUENCE seq_compilatori START 1;
--CREATE SEQUENCE seq_tipi_prodotti START 1;
--CREATE SEQUENCE seq_stati_ordini START 1;
--CREATE SEQUENCE seq_ordini START 1;
""",
engine=con
)
return
@app.cell
def _(con, mo):
_df = mo.sql(
f"""
CREATE OR REPLACE TABLE clienti (
nome_cliente VARCHAR UNIQUE,
id_cliente INTEGER PRIMARY KEY DEFAULT NEXTVAL('seq_clienti')
)
""",
engine=con
)
return
@app.cell
def _(con, mo):
_df = mo.sql(
f"""
CREATE OR REPLACE TABLE compilatori (
nome_compilatore VARCHAR UNIQUE,
id_compilatore INTEGER PRIMARY KEY DEFAULT NEXTVAL('seq_compilatori')
)
""",
engine=con
)
return
@app.cell
def _(con, mo):
_df = mo.sql(
f"""
CREATE OR REPLACE TABLE tipi_prodotti (
nome_tipo VARCHAR UNIQUE,
id_tipo INTEGER PRIMARY KEY DEFAULT NEXTVAL('seq_tipi_prodotti'),
luogo_applicazione VARCHAR,
espo_primaria VARCHAR,
espo_secondaria VARCHAR,
espo_nano VARCHAR,
supericie_cm2 INT,
frequenza INT,
qty_daily_stimata INT,
qty_daily_relativa INT,
ritenzione FLOAT,
espo_daily_calcolata INT,
espo_daily_relativa_calcolata INT,
peso INT,
target VARCHAR
)
""",
engine=con
)
return
@app.cell
def _(con, mo):
_df = mo.sql(
f"""
CREATE OR REPLACE TABLE stati_ordini (
id_stato INTEGER PRIMARY KEY DEFAULT NEXTVAL('seq_stati_ordini'),
nome_stato VARCHAR
)
""",
engine=con
)
return (stati_ordini,)
@app.cell
def _(con, mo):
_df = mo.sql(
f"""
CREATE OR REPLACE TABLE ordini (
id_ordine INTEGER PRIMARY KEY DEFAULT NEXTVAL('seq_ordini'),
id_cliente INTEGER,
id_compilatore INTEGER,
id_tipo_prodotto INTEGER,
uuid_ordine VARCHAR NOT NULL,
uuid_progetto VARCHAR,
data_ordine DATETIME NOT NULL,
stato_ordine INTEGER DEFAULT 0,
note VARCHAR,
FOREIGN KEY (id_cliente) REFERENCES clienti(id_cliente),
FOREIGN KEY (id_compilatore) REFERENCES compilatori(id_compilatore),
FOREIGN KEY (id_tipo_prodotto) REFERENCES tipi_prodotti(id_tipo),
FOREIGN KEY (stato_ordine) REFERENCES stati_ordini(id_stato)
)
""",
engine=con
)
return
@app.cell
def _(con, mo):
_df = mo.sql(
f"""
""",
engine=con
)
return
@app.cell
def _(con, mo, stati_ordini):
_df = mo.sql(
f"""
INSERT INTO stati_ordini (nome_stato) VALUES (
'Ordine registrato',
''
)
""",
engine=con
)
return
if __name__ == "__main__":
app.run()

194
marimo/parsing_echa.py Normal file
View file

@ -0,0 +1,194 @@
import marimo
__generated_with = "0.16.5"
app = marimo.App(width="medium")
@app.cell
def _():
import marimo as mo
return
@app.cell
def _():
from pif_compiler.services.srv_echa import orchestrator
return (orchestrator,)
@app.cell
def _(orchestrator):
result = orchestrator("57-55-6")
return (result,)
@app.cell
def _(result):
test = result['repeated_dose_toxicity']
return (test,)
@app.cell
def _(result):
acute = result['acute_toxicity']
acute
return (acute,)
@app.cell
def _():
from pif_compiler.services.srv_echa import extract_levels, at_extractor, rdt_extractor
return at_extractor, extract_levels, rdt_extractor
@app.cell
def _(acute, at_extractor, extract_levels):
at = extract_levels(acute, at_extractor)
return (at,)
@app.cell
def _(at):
at
return
@app.cell
def _(extract_levels, rdt_extractor, test):
rdt = extract_levels(test, rdt_extractor)
return (rdt,)
@app.cell
def _(rdt):
rdt
return
@app.cell
def _():
from pydantic import BaseModel
from typing import Optional
return BaseModel, Optional
@app.cell
def _(BaseModel, Optional):
class ToxIndicator(BaseModel):
indicator : str
value : int
unit : str
route : str
toxicity_type : Optional[str] = None
ref : Optional[str] = None
@property
def priority_rank(self):
"""Returns the numerical priority based on the toxicological indicator."""
mapping = {
'LD50': 1,
'DL50': 1,
'NOAEC': 2,
'LOAEL': 3,
'NOAEL': 4
}
return mapping.get(self.indicator, 99)
@property
def factor(self):
"""Returns the factor based on the toxicity type."""
if self.priority_rank == 1:
return 10
elif self.priority_rank == 3:
return 3
return 1
return (ToxIndicator,)
@app.cell
def _(ToxIndicator, rdt):
lista = []
for i in rdt:
tesss = rdt.get(i)
t = ToxIndicator(**tesss)
lista.append(t)
lista
return
@app.cell
def _():
from pydantic import model_validator
return (model_validator,)
@app.cell
def _(
BaseModel,
Optional,
ToxIndicator,
at_extractor,
extract_levels,
model_validator,
rdt_extractor,
):
class Toxicity(BaseModel):
cas: str
indicators: list[ToxIndicator]
best_case: Optional[ToxIndicator] = None
factor: Optional[int] = None
@model_validator(mode='after')
def set_best_case(self) -> 'Toxicity':
if self.indicators:
self.best_case = max(self.indicators, key=lambda x: x.priority_rank)
self.factor = self.best_case.factor
return self
@classmethod
def from_result(cls, cas: str, result):
toxicity_types = ['repeated_dose_toxicity', 'acute_toxicity']
indicators_list = []
for tt in toxicity_types:
if tt not in result:
continue
try:
extractor = at_extractor if tt == 'acute_toxicity' else rdt_extractor
fetch = extract_levels(result[tt], extractor=extractor)
link = result.get(f"{tt}_link", "")
for key, lvl in fetch.items():
lvl['ref'] = link
elem = ToxIndicator(**lvl)
indicators_list.append(elem)
except Exception as e:
print(f"Errore durante l'estrazione di {tt}: {e}")
continue
return cls(
cas=cas,
indicators=indicators_list
)
return (Toxicity,)
@app.cell
def _(Toxicity, result):
tox = Toxicity.from_result("57-55-6", result)
tox
return
@app.cell
def _(result):
result
return
if __name__ == "__main__":
app.run()

49
marimo/test_obj.py Normal file
View file

@ -0,0 +1,49 @@
import marimo
__generated_with = "0.16.5"
app = marimo.App(width="medium")
@app.cell
def _():
from pif_compiler.classes.models import Esposition
return (Esposition,)
@app.cell
def _(Esposition):
it = Esposition(
preset_name="Test xzzx<xdsadsa<",
tipo_prodotto="Test Product",
luogo_applicazione="Face",
esp_normali=["DERMAL", "ASD"],
esp_secondarie=["ORAL"],
esp_nano=["NA"],
sup_esposta=500,
freq_applicazione=2,
qta_giornaliera=1.5,
ritenzione=0.1
)
return (it,)
@app.cell
def _(it):
it.save_to_postgres()
return
@app.cell
def _(Esposition):
data = Esposition.get_presets()
return (data,)
@app.cell
def _(data):
data
return
if __name__ == "__main__":
app.run()

412
marimo/worflow.py Normal file
View file

@ -0,0 +1,412 @@
import marimo
__generated_with = "0.16.5"
app = marimo.App(width="medium")
@app.cell
def _():
import marimo as mo
return (mo,)
@app.cell
def _():
from pif_compiler.functions.db_utils import db_connect
return (db_connect,)
@app.cell
def _(db_connect):
col = db_connect(collection_name="orders")
return (col,)
@app.cell
def _(col):
input = col.find_one({"client_name": "CSM Srl"})
return (input,)
@app.cell
def _(input):
input
return
@app.cell
def _():
import json
from pydantic import BaseModel, Field, field_validator, ConfigDict, model_validator
from pymongo import MongoClient
import re
from typing import List, Optional
return BaseModel, ConfigDict, Field, List, Optional, model_validator, re
app._unparsable_cell(
r"""
class CosmeticIngredient(BaseModel):
inci_name: str
cas: str = Field(..., pattern=r'^\d{2,7}-\d{2}-\d$')
colorant = bool = Field(default=False)
organic = bool = Field(default=False)
dap = dict | None = Field(default=None)
cosing = dict | None = Field(default=None)
tox_levels = dict | None = Field(default=None)
@field_validator('inci_name')
@classmethod
def make_uppercase(cls, v: str) -> str:
return v.upper()
""",
name="_"
)
@app.cell
def _(CosmeticIngredient, collection, mo):
mo.stop(True)
try:
ingredient = CosmeticIngredient(
inci_name="Glycerin",
cas="56-81-5",
percentage=5.5
)
print(f"✅ Object Created: {ingredient}")
except ValueError as e:
print(f"❌ Validation Error: {e}")
document_to_insert = ingredient.model_dump()
result = collection.insert_one(document_to_insert)
return
@app.cell
def _(
BaseModel,
ConfigDict,
CosingInfo,
Field,
List,
Optional,
mo,
model_validator,
re,
):
mo.stop(True)
class DapInfo(BaseModel):
"""Informazioni dal Dossier (es. origine, purezza)"""
origin: Optional[str] = None # es. "Synthetic", "Vegetable"
purity_percentage: Optional[float] = None
supplier_code: Optional[str] = None
class ToxInfo(BaseModel):
"""Dati Tossicologici"""
noael: Optional[float] = None # No Observed Adverse Effect Level
ld50: Optional[float] = None # Lethal Dose 50
sed: Optional[float] = None # Systemic Exposure Dosage
mos: Optional[float] = None # Margin of Safety
# --- 2. Modello Principale ---
class CosmeticIngredient(BaseModel):
model_config = ConfigDict(validate_assignment=True) # Valida anche se modifichi i campi dopo
# Gestione INCI multipli per lo stesso CAS
inci_names: List[str] = Field(default_factory=list)
# Il CAS è una stringa obbligatoria, ma il regex dipende dal contesto
cas: str
colorant: bool = Field(default=False)
organic: bool = Field(default=False)
# Sotto-oggetti opzionali
dap: Optional[DapInfo] = None
cosing: Optional[CosingInfo] = None
tox_levels: Optional[ToxInfo] = None
# --- VALIDAZIONE CONDIZIONALE CAS ---
@model_validator(mode='after')
def validate_cas_logic(self):
cas_value = self.cas
is_exempt = self.colorant or self.organic
if not cas_value or not cas_value.strip():
raise ValueError("Il campo CAS non può essere vuoto.")
if not is_exempt:
cas_regex = r'^\d{2,7}-\d{2}-\d$'
if not re.match(cas_regex, cas_value):
raise ValueError(f"Formato CAS non valido ('{cas_value}') per ingrediente standard.")
# Se è colorante/organico, accettiamo qualsiasi stringa (es. 'CI 77891' o codici interni)
return self
# --- METODO HELPER PER AGGIUNGERE INCI ---
def add_inci(self, new_inci: str):
"""Aggiunge un INCI alla lista solo se non è già presente (case insensitive)."""
new_inci_upper = new_inci.upper()
# Controlliamo se esiste già (normalizzando a maiuscolo per sicurezza)
if not any(existing.upper() == new_inci_upper for existing in self.inci_names):
self.inci_names.append(new_inci_upper)
print(f"✅ INCI '{new_inci_upper}' aggiunto.")
else:
print(f" INCI '{new_inci_upper}' già presente.")
return CosmeticIngredient, DapInfo
@app.cell
def _():
from pif_compiler.services.srv_pubchem import pubchem_dap
dato = pubchem_dap("56-81-5")
dato
return (dato,)
app._unparsable_cell(
r"""
molecular_weight = dato.get(\"MolecularWeight\")
log pow = dato.get(\"XLogP\")
topological_polar_surface_area = dato.get(\"TPSA\")
melting_point = dato.get(\"MeltingPoint\")
ionization = dato.get(\"Dissociation Constants\")
""",
name="_"
)
@app.cell(hide_code=True)
def _(mo):
mo.md(
r"""
Molecolar Weight >500 Da
High degree of ionisation
Log Pow -1 or 4
Topological polar surface area >120 Å2
Melting point > 200°C
"""
)
return
@app.cell
def _(BaseModel, Field, Optional, model_validator):
class DapInfo(BaseModel):
cas: str
molecular_weight: Optional[float] = Field(default=None, description="In Daltons (Da)")
high_ionization: Optional[float] = Field(default=None, description="High degree of ionization")
log_pow: Optional[float] = Field(default=None, description="Partition coefficient")
tpsa: Optional[float] = Field(default=None, description="Topological polar surface area")
melting_point: Optional[float] = Field(default=None, description="In Celsius (°C)")
# --- Il valore DAP Calcolato ---
# Lo impostiamo di default a 0.5 (50%), verrà sovrascritto dal validator
dap_value: float = 0.5
@model_validator(mode='after')
def compute_dap(self):
# Lista delle condizioni (True se la condizione riduce l'assorbimento)
conditions = []
# 1. MW > 500 Da
if self.molecular_weight is not None:
conditions.append(self.molecular_weight > 500)
# 2. High Ionization (Se è True, riduce l'assorbimento)
if self.high_ionization is not None:
conditions.append(self.high_ionization is True)
# 3. Log Pow <= -1 OR >= 4
if self.log_pow is not None:
conditions.append(self.log_pow <= -1 or self.log_pow >= 4)
# 4. TPSA > 120 Å2
if self.tpsa is not None:
conditions.append(self.tpsa > 120)
# 5. Melting Point > 200°C
if self.melting_point is not None:
conditions.append(self.melting_point > 200)
# LOGICA FINALE:
# Se c'è almeno una condizione "sicura" (True), il DAP è 0.1
if any(conditions):
self.dap_value = 0.1
else:
self.dap_value = 0.5
return self
return (DapInfo,)
@app.cell
def _(DapInfo, dato, re):
desiderated_keys = ['CAS', 'MolecularWeight', 'XLogP', 'TPSA', 'Melting Point', 'Dissociation Constants']
actual_keys = [key for key in dato.keys() if key in desiderated_keys]
dict = {}
for key in actual_keys:
if key == 'CAS':
dict['cas'] = dato[key]
if key == 'MolecularWeight':
mw = float(dato[key])
dict['molecular_weight'] = mw
if key == 'XLogP':
log_pow = float(dato[key])
dict['log_pow'] = log_pow
if key == 'TPSA':
tpsa = float(dato[key])
dict['tpsa'] = tpsa
if key == 'Melting Point':
try:
for item in dato[key]:
if '°C' in item['Value']:
mp = dato[key]['Value']
mp_value = re.findall(r"[-+]?\d*\.\d+|\d+", mp)
if mp_value:
dict['melting_point'] = float(mp_value[0])
except:
continue
if key == 'Dissociation Constants':
try:
for item in dato[key]:
if 'pKa' in item['Value']:
pk = dato[key]['Value']
pk_value = re.findall(r"[-+]?\d*\.\d+|\d+", mp)
if pk_value:
dict['high_ionization'] = float(mp_value[0])
except:
continue
dap_info = DapInfo(**dict)
dap_info
return
@app.cell
def _():
from pif_compiler.services.srv_cosing import cosing_search, parse_cas_numbers, clean_cosing, identified_ingredients
return clean_cosing, cosing_search
@app.cell
def _(clean_cosing, cosing_search):
raw_cosing = cosing_search("72-48-0", 'cas')
cleaned_cosing = clean_cosing(raw_cosing)
cleaned_cosing
return cleaned_cosing, raw_cosing
@app.cell
def _(mo):
mo.md(
r"""
otherRestrictions
refNo
annexNo
casNo
functionName
"""
)
return
@app.cell
def _(BaseModel, Field, List, Optional):
class CosingInfo(BaseModel):
cas : List[str] = Field(default_factory=list)
common_names : List[str] = Field(default_factory=list)
inci : List[str] = Field(default_factory=list)
annex : List[str] = Field(default_factory=list)
functionName : List[str] = Field(default_factory=list)
otherRestrictions : List[str] = Field(default_factory=list)
cosmeticRestriction : Optional[str]
return (CosingInfo,)
@app.cell
def _(CosingInfo):
def cosing_builder(cleaned_cosing):
cosing_keys = ['nameOfCommonIngredientsGlossary', 'casNo', 'functionName', 'annexNo', 'refNo', 'otherRestrictions', 'cosmeticRestriction', 'inciName']
keys = [k for k in cleaned_cosing.keys() if k in cosing_keys]
cosing_dict = {}
for k in keys:
if k == 'nameOfCommonIngredientsGlossary':
names = []
for name in cleaned_cosing[k]:
names.append(name)
cosing_dict['common_names'] = names
if k == 'inciName':
inci = []
for inc in cleaned_cosing[k]:
inci.append(inc)
cosing_dict['inci'] = names
if k == 'casNo':
cas_list = []
for casNo in cleaned_cosing[k]:
cas_list.append(casNo)
cosing_dict['cas'] = cas_list
if k == 'functionName':
functions = []
for func in cleaned_cosing[k]:
functions.append(func)
cosing_dict['functionName'] = functions
if k == 'annexNo':
annexes = []
i = 0
for ann in cleaned_cosing[k]:
restriction = ann + ' / ' + cleaned_cosing['refNo'][i]
annexes.append(restriction)
i = i+1
cosing_dict['annex'] = annexes
if k == 'otherRestrictions':
other_restrictions = []
for ores in cleaned_cosing[k]:
other_restrictions.append(ores)
cosing_dict['otherRestrictions'] = other_restrictions
if k == 'cosmeticRestriction':
cosing_dict['cosmeticRestriction'] = cleaned_cosing[k]
test_cosing = CosingInfo(
**cosing_dict
)
return test_cosing
return (cosing_builder,)
@app.cell
def _(cleaned_cosing, cosing_builder):
id = cleaned_cosing['identifiedIngredient']
if id:
for e in id:
obj = cosing_builder(e)
obj
return
@app.cell
def _(raw_cosing):
raw_cosing
return
@app.cell
def _():
return
if __name__ == "__main__":
app.run()

View file

@ -1,245 +0,0 @@
import requests
import urllib.parse
import re as standardre
import logging
import json
from bs4 import BeautifulSoup
# Settings per il logging
logging.basicConfig(
format="{asctime} - {levelname} - {message}",
style="{",
datefmt="%Y-%m-%d %H:%M",
filename="echa.log",
encoding="utf-8",
filemode="a",
level=logging.INFO,
)
# Funzione inutile
def getCas(substance, ):
results = {}
req_0 = requests.get(
"https://chem.echa.europa.eu/api-substance/v1/substance?pageIndex=1&pageSize=100&searchText="
+ urllib.parse.quote(substance)
)
req_0_json = req_0.json()
try:
rmlId = req_0_json["items"][0]["substanceIndex"]["rmlId"]
rmlName = req_0_json["items"][0]["substanceIndex"]["rmlName"]
rmlCas = req_0_json["items"][0]["substanceIndex"]["rmlCas"]
results["rmlId"] = rmlId
results["rmlName"] = rmlName
results["rmlCas"] = rmlCas
except:
return False
return results
# Funzione per cercare il dossier dato in input un CAS, una sostanza o un EN
def search_dossier(substance, input_type='rmlCas'):
results = {}
# Il dizionario che farò tornare alla fine
# Prima parte. Ottengo rmlID e rmlName
# st.code('https://chem.echa.europa.eu/api-substance/v1/substance?pageIndex=1&pageSize=100&searchText='+substance)
req_0 = requests.get(
"https://chem.echa.europa.eu/api-substance/v1/substance?pageIndex=1&pageSize=100&searchText="
+ urllib.parse.quote(substance)
)
logging.info(f'echaFind.search_dossier(). searching "{substance}"')
#'La prima cosa da fare è fare una ricerca con il nome della sostanza ma trasformata attraverso urllib'
req_0_json = req_0.json()
try:
# Estraggo i campi che mi servono dalla response
rmlId = req_0_json["items"][0]["substanceIndex"]["rmlId"]
rmlName = req_0_json["items"][0]["substanceIndex"]["rmlName"]
rmlCas = req_0_json["items"][0]["substanceIndex"]["rmlCas"]
rmlEc = req_0_json["items"][0]["substanceIndex"]["rmlEc"]
results['search_response'] = f"https://chem.echa.europa.eu/api-substance/v1/substance?pageIndex=1&pageSize=100&searchText={urllib.parse.quote(substance)}"
results["rmlId"] = rmlId
results["rmlName"] = rmlName
results["rmlCas"] = rmlCas
results["rmlEc"] = rmlEc
logging.info(
f"echaFind.search_dossier(). found substance on ECHA. rmlId: '{rmlId}', rmlName: '{rmlName}', rmlCas: '{rmlCas}'"
)
except:
logging.info(
f"echaFind.search_dossier(). could not find substance for '{substance}'"
)
return False
# Update: in certi casi poteva verificarsi che inserendo un CAS si trovasse invece una sostanza con codice EN uguale al CAS in input.
# Ora controllo che la sostanza trovata abbia effettivamente un CAS uguale a quello inserito in input.
# è inoltre possibile cercare per rmlName (nome della sostanza) o EN (rmlEn): basta specificare in input_type per cosa si sta cercando
if results[input_type] != substance:
logging.error(f'echa.echaFind.search_dossier(): results[{input_type}] "{results[input_type]}is not equal to "{substance}". ')
return f'search_error. results[{input_type}] ("{results[input_type]}") is not equal to "{substance}". Maybe you specified the wrong input_type. Check the results here: https://chem.echa.europa.eu/api-substance/v1/substance?pageIndex=1&pageSize=100&searchText={urllib.parse.quote(substance)}'
# Seconda parte. Cerco sul sito ECHA dei dossiers creando un link con l'ID precedentemente ottenuto.
req_1_url = (
"https://chem.echa.europa.eu/api-dossier-list/v1/dossier?pageIndex=1&pageSize=100&rmlId="
+ rmlId
+ "&registrationStatuses=Active"
) # Prima cerco negli active.
req_1 = requests.get(req_1_url)
req_1_json = req_1.json()
# Se non esistono dossiers attivi cerco quelli inattivi
if req_1_json["items"] == []:
logging.info(
f"echaFind.search_dossier(). could not find active dossier for '{substance}'. Proceeding to search in the unactive ones."
)
req_1_url = (
"https://chem.echa.europa.eu/api-dossier-list/v1/dossier?pageIndex=1&pageSize=100&rmlId="
+ rmlId
+ "&registrationStatuses=Inactive"
)
req_1 = requests.get(req_1_url)
req_1_json = req_1.json()
if req_1_json["items"] == []:
logging.info(
f"echaFind.search_dossier(). could not find unactive dossiers for '{substance}'"
) # Non ho trovato nè dossiers inattivi che attivi
return False
else:
logging.info(
f"echaFind.search_dossier(). found unactive dossiers for '{rmlName}'"
)
results["dossierType"] = "Inactive"
else:
logging.info(
f"echaFind.search_dossier(). found active dossiers for '{substance}'"
)
results["dossierType"] = "Active"
# Queste erano le due robe che mi servivano
assetExternalId = req_1_json["items"][0]["assetExternalId"]
# UPDATE: Per ottenere la data dell'ultima modifica
try:
lastUpdateDate = req_1_json["items"][0]["lastUpdatedDate"]
datetime_object = datetime.fromisoformat(lastUpdateDate.replace('Z', '+00:00')) # Handle 'Z' if present, else it might break on older python versions
lastUpdateDate = datetime_object.date().isoformat()
results['lastUpdateDate'] = lastUpdateDate
except:
logging.error(f"echa.echaFind(). Could not find lastUpdateDate for the dossier")
rootKey = req_1_json["items"][0]["rootKey"]
# Terza parte. Ottengo assetExternalId
# "Con l'assetExternalId è possibile arrivare alla pagina principale del dossier."
# "Da questa pagina bisogna scrappare l'ID del riassunto tossicologico, :red[**SE ESISTE**]"
results["index"] = (
"https://chem.echa.europa.eu/html-pages/" + assetExternalId + "/index.html"
)
results["index_js"] = (
f"https://chem.echa.europa.eu/{rmlId}/dossier-view/{assetExternalId}"
)
req_2 = requests.get(
"https://chem.echa.europa.eu/html-pages/" + assetExternalId + "/index.html"
)
index = BeautifulSoup(req_2.text, "html.parser")
index.prettify()
# Quarta parte. Ottengo l'ID del riassunto tossicologico dall'index.html
# "In tutto quell'HTML ci interessa solo un div. BeautifulSoup ha problemi se ci sono troppi div innestati. Quindi uso una combinazione di quello e di Regex"
div = index.find_all("div", id=["id_7_Toxicologicalinformation"])
str_div = str(div)
str_div = str_div.split("</div>")[0]
uic_found = False
if type(standardre.search('href="([^"]+)"', str_div)).__name__ == "NoneType":
# Un regex per trovare l'href che mi serve
logging.info(
f"echaFind.search_dossier(). Could not find 'id_7_Toxicologicalinformation' in the body"
)
else:
UIC = standardre.search('href="([^"]+)"', str_div).group(1)
uic_found = True
# Per l'acute toxicity
acute_toxicity_found = False
div_acute_toxicity = index.find_all("div", id=["id_72_AcuteToxicity"])
if div_acute_toxicity:
for div in div_acute_toxicity:
try:
a = div.find_all("a", href=True)[0]
acute_toxicity_id = standardre.search('href="([^"]+)"', str(a)).group(1)
acute_toxicity_found = True
except:
logging.info(
f"echaFind.search_dossier(). No acute_toxicity_id found from index for {substance}"
)
# Per il repeated dose
repeated_dose_found = False
div_repeated_dose = index.find_all("div", id=["id_75_Repeateddosetoxicity"])
if div_repeated_dose:
for div in div_repeated_dose:
try:
a = div.find_all("a", href=True)[0]
repeated_dose_id = standardre.search('href="([^"]+)"', str(a)).group(1)
repeated_dose_found = True
except:
logging.info(
f"echaFind.search_dossier(). No repeated_dose_id found from index for {substance}"
)
# Quinta parte. Recupero l'html del dossier tossicologico e faccio ritornare il content
if acute_toxicity_found:
acute_toxicity_link = (
"https://chem.echa.europa.eu/html-pages/"
+ assetExternalId
+ "/documents/"
+ acute_toxicity_id
+ ".html"
)
results["AcuteToxicity"] = acute_toxicity_link
results["AcuteToxicity_js"] = (
f"https://chem.echa.europa.eu/{rmlId}/dossier-view/{assetExternalId}/{acute_toxicity_id}"
)
if uic_found:
# UIC è l'id del toxsummary
final_url = (
"https://chem.echa.europa.eu/html-pages/"
+ assetExternalId
+ "/documents/"
+ UIC
+ ".html"
)
results["ToxSummary"] = final_url
results["ToxSummary_js"] = (
f"https://chem.echa.europa.eu/{rmlId}/dossier-view/{assetExternalId}/{UIC}"
)
if repeated_dose_found:
results["RepeatedDose"] = (
"https://chem.echa.europa.eu/html-pages/"
+ assetExternalId
+ "/documents/"
+ repeated_dose_id
+ ".html"
)
results["RepeatedDose_js"] = (
f"https://chem.echa.europa.eu/{rmlId}/dossier-view/{assetExternalId}/{repeated_dose_id}"
)
json_formatted_str = json.dumps(results)
logging.info(f"echaFind.search_dossier() OK. output: {json_formatted_str}")
return results

View file

@ -1,946 +0,0 @@
from src.func.echaFind import search_dossier
from bs4 import BeautifulSoup
from markdownify import MarkdownConverter
import pandas as pd
import requests
import os
import re
import markdown_to_json
import json
import copy
import unicodedata
from datetime import datetime
import logging
import duckdb
# Settings per il logging
logging.basicConfig(
format="{asctime} - {levelname} - {message}",
style="{",
datefmt="%Y-%m-%d %H:%M",
filename="echa.log",
encoding="utf-8",
filemode="a",
level=logging.INFO,
)
try:
# Carico il full scraping in memoria se esiste
con = duckdb.connect()
os.chdir(".")
res = con.sql("""
CREATE TABLE echa_full_scraping AS
SELECT * FROM read_csv_auto('src\data\echa_full_scraping.csv');
""")
logging.info(
f"echa.echaProcess().main: Loaded echa scraped data into duckdb memory. First CAS in the df is: {con.sql('select CAS from echa_full_scraping limit 1').fetchone()[0]}"
)
local_echa = True
except:
logging.error(f"echa.echaProcess().main: No local echa scraped data found")
# Metodo per trovare le informazioni relative sul sito echa
# Funziona sia con il nome della sostanza che con il CUS
def openEchaPage(link, local=False):
try:
if local:
page = open(link, encoding="utf8")
soup = BeautifulSoup(page, "html.parser")
else:
page = requests.get(link)
page.encoding = "utf-8"
soup = BeautifulSoup(page.text, "html.parser")
except:
logging.error(
f"echa.echaProcess.openEchaPage() error. could not open: '{link}'",
exc_info=True,
)
return soup
# Metodo per trasformare la pagina dell'echa in un Markdown
def echaPage_to_md(sezione, scrapingType=None, local=False, substance=None):
# sezione : il soup della pagina estratta attraverso search_dossier
# scrapingType : 'RepeatedDose' o 'AcuteToxicity'
# local : se vuoi salvare il contenuto del markdown in locale. Utile per debuggare
# substance : il nome della sostanza. Per salvarla nel path corretto
# Create shorthand method for conversion
def md(soup, **options):
return MarkdownConverter(**options).convert_soup(soup)
output = md(sezione)
# Trasformo la section html in un markdown, che però va corretto.
# Cambia un po' il modo in cui modifico il .md in base al tipo di pagina da scrappare
# aggiungo eccezioni man mano che testo nuove sostanze
if scrapingType == "RepeatedDose":
output = output.replace("### Oral route", "#### oral")
output = output.replace("### Dermal", "#### dermal")
output = output.replace("### Inhalation", "#### inhalation")
# Devo rimpiazzare >< con delle stringhe perchè sennò il jsonifier interpreta quei due simboli come dei codici e inserisce il testo nelle []
output = re.sub(r">\s+", "greater than ", output)
# Replace '<' followed by whitespace with 'less than '
output = re.sub(r"<\s+", "less than ", output)
output = re.sub(r">=\s*\n", "greater or equal than ", output)
output = re.sub(r"<=\s*\n", "less or equal than ", output)
elif scrapingType == "AcuteToxicity":
# Devo rimpiazzare >< con delle stringhe perchè sennò il jsonifier interpreta quei due simboli come dei codici e inserisce il testo nelle []
output = re.sub(r">\s+", "greater than ", output)
# Replace '<' followed by whitespace with 'less than '
output = re.sub(r"<\s+", "less than ", output)
output = re.sub(r">=\s*\n", "greater or equal than", output)
output = re.sub(r"<=\s*\n", "less or equal than ", output)
output = output.replace("–", "-")
output = re.sub(r"\s+mg", " mg", output)
# sta parte serve per fixare le unità di misura che vanno a capo e sono separate dal valore
if local and substance:
path = f"{scrapingType}/mds/{substance}.md"
os.makedirs(os.path.dirname(path), exist_ok=True)
with open(path, "w") as text_file:
text_file.write(output)
return output
# Questo è la parte 2 del processing del sito ECHA. Va trasformato il markdown in un JSON
def markdown_to_json_raw(output, scrapingType=None, local=False, substance=None):
# Output: Il markdown
# scrapingType : 'RepeatedDose' o 'AcuteToxicity'
# substance : il nome della sostanza. Per salvarla nel path corretto
jsonified = markdown_to_json.jsonify(output)
dictified = json.loads(jsonified)
# Salvo il json iniziale così come esce da jsonify
if local and scrapingType and substance:
path = f"{scrapingType}/jsons/raws/{substance}_raw0.json"
os.makedirs(os.path.dirname(path), exist_ok=True)
with open(path, "w") as text_file:
text_file.write(jsonified)
# Ora splitto i contenuti dei dizionari innestati.
for key, value in dictified.items():
if type(value) == dict:
for key2, value2 in value.items():
parts = value2.split("\n\n")
dictified[key][key2] = {
parts[i]: parts[i + 1]
for i in range(0, len(parts) - 1, 2)
if parts[i + 1] != "[Empty]"
}
else:
parts = value.split("\n\n")
dictified[key] = {
parts[i]: parts[i + 1]
for i in range(0, len(parts) - 1, 2)
if parts[i + 1] != "[Empty]"
}
jsonified = json.dumps(dictified)
if local and scrapingType and substance:
path = f"{scrapingType}/jsons/raws/{substance}_raw1.json"
os.makedirs(os.path.dirname(path), exist_ok=True)
with open(path, "w") as text_file:
text_file.write(jsonified)
dictified = json.loads(jsonified)
return jsonified
# Metodo creato da claude per risolvere i problemi di unicode characters
def normalize_unicode_characters(text):
"""
Normalize Unicode characters, with special handling for superscript
"""
if not isinstance(text, str):
return text
# Specific replacements for common Unicode encoding issues
# e per altre eccezioni particolari
replacements = {
"\u00c2\u00b2": "²", # ² -> ²
"\u00c2\u00b3": "³", # ³ -> ³
"\u00b2": "²", # Bare superscript 2
"\u00b3": "³", # Bare superscript 3
"\n": "", # ogni tanto ci sono degli \n brutti da togliere
"greater than": ">",
"less than": "<",
"greater or equal than": ">=",
"less or equal than": "<",
# Ste due entry le ho messe io. >< creano problemi quindi le rinonimo temporaneamente
}
# Apply specific replacements first
for old, new in replacements.items():
text = text.replace(old, new)
# Normalize Unicode characters
text = unicodedata.normalize("NFKD", text)
return text
# Un'altro metodo creato da Claude.
# Pare che il mio cervello sia troppo piccolo per riuscire a ciclare ricursivamente
# un dizionario innestato. Se magari avessimo fatto algoritmi e strutture dati...
def clean_json(data):
"""
Recursively clean JSON by removing empty/uninformative entries
and normalizing Unicode characters
"""
def is_uninformative(value, context=None):
"""
Check if a dictionary entry is considered uninformative
Args:
value: The value to check
context: Additional context about where the value is located
"""
# Specific exceptions
if context and context == "Key value for chemical safety assessment":
# Always keep all entries in this specific section
return False
uninformative_values = ["hours/week", "", None]
return value in uninformative_values or (
isinstance(value, str)
and (
value.strip() in uninformative_values
or value.lower() == "no information available"
)
)
def clean_recursive(obj, context=None):
# If it's a dictionary, process its contents
if isinstance(obj, dict):
# Create a copy to modify
cleaned = {}
for key, value in obj.items():
# Normalize key
normalized_key = normalize_unicode_characters(key)
# Set context for nested dictionaries
new_context = context or normalized_key
# Recursively clean nested structures
cleaned_value = clean_recursive(value, new_context)
# Conditions for keeping the entry
keep_entry = (
cleaned_value not in [None, {}, ""]
and not (
isinstance(cleaned_value, dict) and len(cleaned_value) == 0
)
and not is_uninformative(cleaned_value, new_context)
)
# Add to cleaned dict if conditions are met
if keep_entry:
cleaned[normalized_key] = cleaned_value
return cleaned if cleaned else None
# If it's a list, clean each item
elif isinstance(obj, list):
cleaned_list = [clean_recursive(item, context) for item in obj]
cleaned_list = [item for item in cleaned_list if item not in [None, {}, ""]]
return cleaned_list if cleaned_list else None
# For strings, normalize Unicode
elif isinstance(obj, str):
return normalize_unicode_characters(obj)
# Return as-is for other types
return obj
# Create a deep copy to avoid modifying original data
cleaned_data = clean_recursive(copy.deepcopy(data))
# Sì figa questa è la parte che mi ha fatto sclerare
# Ciclare in dizionari innestati senza poter modificare la struttura
return cleaned_data
def json_to_dataframe(cleaned_json, scrapingType):
rows = []
schema = {
"RepeatedDose": [
"Substance",
"CAS",
"Toxicity Type",
"Route",
"Dose descriptor",
"Effect level",
"Species",
"Extraction_Timestamp",
"Endpoint conclusion",
],
"AcuteToxicity": [
"Substance",
"CAS",
"Route",
"Endpoint conclusion",
"Dose descriptor",
"Effect level",
"Extraction_Timestamp",
],
}
if scrapingType == "RepeatedDose":
# Iterate through top-level sections (excluding 'Key value for chemical safety assessment')
for toxicity_type, routes in cleaned_json.items():
if toxicity_type == "Key value for chemical safety assessment":
continue
# Iterate through routes within each toxicity type
for route, details in routes.items():
row = {"Toxicity Type": toxicity_type, "Route": route}
# Add details to the row, excluding 'Link to relevant study record(s)'
row.update(
{
k: v
for k, v in details.items()
if k != "Link to relevant study record(s)"
}
)
rows.append(row)
elif scrapingType == "AcuteToxicity":
for toxicity_type, routes in cleaned_json.items():
if (
toxicity_type == "Key value for chemical safety assessment"
or not routes
):
continue
row = {
"Route": toxicity_type.replace("Acute toxicity: via", "")
.replace("route", "")
.strip()
}
# Add details directly from the routes dictionary
row.update(
{
k: v
for k, v in routes.items()
if k != "Link to relevant study record(s)"
}
)
rows.append(row)
# Create DataFrame
df = pd.DataFrame(rows)
# Last moment fixes. Per forzare uno schema
fair_columns = list(set(schema["RepeatedDose"] + schema["AcuteToxicity"]))
df = df = df.loc[:, df.columns.intersection(fair_columns)]
return df
def save_dataframe(df, file_path, scrapingType, schema):
"""
Save DataFrame with strict column requirements.
Args:
df (pd.DataFrame): DataFrame to potentially append
file_path (str): Path of CSV file
"""
# Mandatory columns for saved DataFrame
saved_columns = schema[scrapingType]
# Check if input DataFrame has at least Dose Descriptor and Effect Level
if not all(col in df.columns for col in ["Effect level"]):
return
# If file exists, read it to get saved columns
if os.path.exists(file_path):
existing_df = pd.read_csv(file_path)
# Reindex to match saved columns, filling missing with NaN
df = df.reindex(columns=saved_columns)
else:
# If file doesn't exist, create DataFrame with saved columns
df = df.reindex(columns=saved_columns)
df = df[df["Effect level"].isna() == False]
# Ignoro le righe che non hanno valori per Effect Level
# Append or save the DataFrame
df.to_csv(
file_path,
mode="a" if os.path.exists(file_path) else "w",
header=not os.path.exists(file_path),
index=False,
)
def echaExtract(
substance: str,
scrapingType: str,
outputType="df",
key_infos=False,
local_search=False,
local_only = False
):
"""
Funzione principale per scrapare dal sito ECHA. Mette insieme tante funzioni diverse di ricerca, estrazione e pulizia.
Registra il logging delle operazioni.
Args:
substance (str): CAS o nome della sostanza. Vanno bene entrambi ma il CAS funziona meglio.
scrapingType (str): 'AcuteToxicity' (LD50) o 'RepeatedDose' (NOAEL)
outputType (str): 'pd.DataFrame' o 'json' (sconsigliato)
key_infos (bool): Di base True. Specifica se cercare la sezione "Description of Key Information" nei dossiers.
Certe sostanze hanno i dati inseriti a cazzo e mettono le informazioni in forma discorsiva al posto che altrove.
Output:
un dataframe o un json,
f"Non esistono lead dossiers attivi o inattivi per {substance}"
"""
# se local_search = True tento una ricerca in locale. Altrimenti la provo online.
if local_search and local_echa:
result = echaExtract_local(substance, scrapingType, key_infos)
if not result.empty:
logging.info(
f"echa.echaProcess.echaExtract(): Found local data for {scrapingType}, {substance}. Returning it."
)
return result
elif result.empty:
logging.info(
f"echa.echaProcess.echaExtract(): Have not found local data for {scrapingType}, {substance}. Continuining."
)
if local_only:
logging.info(f'echa.echaProcess.echaExtract(): No data found in local-only search for {substance}, {scrapingType}')
return f'No data found in local-only search for {substance}, {scrapingType}'
try:
# con search_dossier trovo le informazioni relative al dossiers cercando sul sito echa la sostanza fornita.
links = search_dossier(substance)
if not links:
logging.info(
f'echaProcess.echaExtract(). no active or unactive lead dossiers for: "{substance}". Ending extraction.'
)
return f"Non esistono lead dossiers attivi o inattivi per {substance}"
# Se non esistono LEAD dossiers (quelli con i riassunti tossicologici) attivi o inattivi
# Se esistono, apro la pagina che mi interessa ('Acute Toxicity' o 'Repeated Dose')
if not scrapingType in list(links.keys()):
logging.info(
f'echaProcess.echaExtract(). No page for "{scrapingType}", "{substance}"'
)
return f'No data in "{scrapingType}", "{substance}". Page does not exist.'
soup = openEchaPage(link=links[scrapingType])
logging.info(
f"echaProcess.echaExtract(). soupped '{scrapingType}' echa page for '{substance}'"
)
# Piglio la sezione che mi serve
try:
sezione = soup.find(
"section",
class_="KeyValueForChemicalSafetyAssessment",
attrs={"data-cy": "das-block"},
)
except:
logging.error(
f'echaProcess.echaExtract(). could not extract the "section" for "{scrapingType}" for "{substance}"',
exc_info=True,
)
# Per ottenere il timestamp attuale
now = datetime.now()
# UPDATE. Cerco le key infos
key_infos_faund = False
if key_infos:
try:
key_infos = soup.find(
"section",
class_="KeyInformation",
attrs={"data-cy": "das-block"},
)
if key_infos:
key_infos = key_infos.find(
"div",
class_="das-field_value das-field_value_html",
)
key_infos = key_infos.text
key_infos = key_infos if key_infos.strip() != "[Empty]" else None
if key_infos:
key_infos_faund = True
logging.info(
f"echaProcess.echaExtract(). Extracted key_infos from '{scrapingType}' echa page for '{substance}': {key_infos}"
)
key_infos_df = pd.DataFrame(index=[0])
key_infos_df["key_information"] = key_infos
key_infos_df = df_wrapper(
df=key_infos_df,
rmlName=links["rmlName"],
rmlCas=links["rmlCas"],
timestamp=now.strftime("%Y-%m-%d"),
dossierType=links["dossierType"],
page=scrapingType,
linkPage=links[scrapingType],
key_infos=True,
)
else:
logging.error(
f'echaProcess.echaExtract() > echaProcess.echaPage_to_md() ERROR. could not extract key_infos for "{scrapingType}", "{substance}"'
)
else:
logging.error(
f'echaProcess.echaExtract() > echaProcess.echaPage_to_md() ERROR. could not extract key_infos for "{scrapingType}", "{substance}"'
)
except:
logging.error(
f'echaProcess.echaExtract() > echaProcess.echaPage_to_md() ERROR. could not extract key_infos for "{scrapingType}", "{substance}"',
exc_info=True,
)
try:
if not sezione:
logging.error(
f'echaProcess.echaExtract() > echaProcess.echaPage_to_md() Empty section for the html > markdown conversion. No data for "{scrapingType}", "{substance}"'
)
if not key_infos_faund:
# Se non ci sono dati ma ci sono le key informations ritorno quelle
return f'No data in "{scrapingType}", "{substance}"'
else:
return key_infos_df
# Trasformo la sezione html in markdown
output = echaPage_to_md(
sezione, scrapingType=scrapingType, substance=substance
)
logging.info(
f'echaProcess.echaExtract() > echaProcess.echaPage_to_md() OK. created MD for "{scrapingType}", "{substance}"'
)
# Ci sono rari casi in cui proprio non esistono pagine per l'acute toxicity o la repeated dose. In quel caso output sarà vuoto e darà errore
# logging.info(output)
except:
logging.error(
f'echaProcess.echaExtract() > echaProcess.echaPage_to_md() ERROR. could not MD for "{scrapingType}", "{substance}"',
exc_info=True,
)
try:
# Trasformo il markdown nel primo json raw
jsonified = markdown_to_json_raw(
output, scrapingType=scrapingType, substance=substance
)
logging.info(
f'echaProcess.echaExtract() > echaProcess.markdown_to_json_raw() OK. created initial json for "{scrapingType}", "{substance}"'
)
except:
logging.error(
f'echaProcess.echaExtract() > echaProcess.markdown_to_json_raw() ERROR. could not create initial json for "{scrapingType}", "{substance}"',
exc_info=True,
)
json_data = json.loads(jsonified)
try:
# Secondo step per il processing del json: pulisco i dizionari piu' innestati
cleaned_data = clean_json(json_data)
logging.info(
f'echaProcess.echaExtract() > echaProcess.clean_json() OK. cleaned the json for "{scrapingType}", "{substance}"'
)
# Se cleaned_data è vuoto vuol dire che non ci sono dati
if not cleaned_data:
logging.error(
f'echaProcess.echaExtract() > echaProcess.clean_json() Empty cleaned_json. No data for "{scrapingType}", "{substance}"'
)
if not key_infos_faund:
# Se non ci sono dati ma ci sono le key informations ritorno quelle
return f'No data in "{scrapingType}", "{substance}"'
else:
return key_infos_df
except:
logging.error(
f'echaProcess.echaExtract() > echaProcess.clean_json() ERROR. cleaning the json for "{scrapingType}", "{substance}"'
)
# Se si vuole come output il dataframe creo un dataframe e ci aggiungo un timestamp
try:
df = json_to_dataframe(cleaned_data, scrapingType)
df = df_wrapper(
df=df,
rmlName=links["rmlName"],
rmlCas=links["rmlCas"],
timestamp=now.strftime("%Y-%m-%d"),
dossierType=links["dossierType"],
page=scrapingType,
linkPage=links[scrapingType],
)
if outputType == "df":
logging.info(
f'echaProcess.echaExtract(). succesfully extracted "{scrapingType}", "{substance}". Returning df'
)
# Se l'utente vuole le key infos e le key_infos sono state trovate unisco i due df
return df if not key_infos_faund else pd.concat([key_infos_df, df])
elif outputType == "json":
if key_infos_faund:
df = pd.concat([key_infos_df, df])
jayson = df.to_json(orient="records", force_ascii=False)
logging.info(
f'echaProcess.echaExtract(). succesfully extracted "{scrapingType}", "{substance}". Returning json'
)
return jayson
except KeyError:
# Per gestire le pagine di merda che hanno solo "no information available"
if key_infos_faund:
return key_infos_df
json_output = list(cleaned_data[list(cleaned_data.keys())[0]].values())
if json_output == ["no information available" for elem in json_output]:
logging.info(
f"echaProcess.echaExtract(). No data found for {scrapingType} for {substance}"
)
return f'No data in "{scrapingType}", "{substance}"'
else:
logging.error(
f"echaProcess.json_to_dataframe(). Could not create dataframe"
)
cleaned_data["error"] = (
"Non sono riuscito a creare il dataframe, probabilmente non ci sono abbastanza informazioni. Ritorno il JSON"
)
return cleaned_data
except Exception:
logging.error(
f"echaProcess.echaExtract() ERROR. Something went wrong, not quite sure what.",
exc_info=True,
)
def df_wrapper(
df, rmlName, rmlCas, timestamp, dossierType, page, linkPage, key_infos=False
):
# Un semplice metodo per aggiungere tutta la roba che ci serve al dataframe.
# Per non intasare echaExtract che già di suo è un figa di bordello
df.insert(0, "Substance", rmlName)
df.insert(1, "CAS", rmlCas)
df["Extraction_Timestamp"] = timestamp
df = df.replace("\n", "", regex=True)
if not key_infos:
df = df[df["Effect level"].isnull() == False]
# Aggiungo il link del dossier e lo status
df["dossierType"] = dossierType
df["page"] = page
df["linkPage"] = linkPage
return df
def echaExtract_specific(
CAS: str,
scrapingType="RepeatedDose",
doseDescriptor="NOAEL",
route="inhalation",
local_search=False,
local_only=False
):
"""
Dato un CAS cerca di trovare il dose descriptor (di base NOAEL) per la route specificata (di base 'inhalation').
Args:
CAS (str): il cas o in alternativa la sostanza
route (str): 'inhalation', 'oral', 'dermal'. Di base 'inhalation'
scrapingType (str): la pagina su cui cercarlo
doseDescriptor (str): il tipo di valore da ricercare (NOAEL, DNEL, LD50, LC50)
"""
# Tento di estrarre
result = echaExtract(
substance=CAS,
scrapingType=scrapingType,
outputType="df",
local_search=local_search,
local_only=local_only
)
# Il risultato è un dataframe?
if type(result) == pd.DataFrame:
# Se sì, lo filtro per ciò che mi interessa
filtered_df = result[
(result["Route"] == route) & (result["Dose descriptor"] == doseDescriptor)
]
# Se non è vuoto lo ritorno
if not filtered_df.empty:
return filtered_df
else:
return f'Non ho trovato {doseDescriptor} in {scrapingType} con route "{route}" per {CAS}'
elif type(result) == dict and result["error"]:
# Questo significa che gli è arrivato qualche json con un errore
return f'Non ho trovato {doseDescriptor} in {scrapingType} con route "{route}" per {CAS}'
# Questo significa che ha ricevuto un "Non esistono" come risultato. Non esistono lead dossiers attivi o inattivi per la sostanza ricercata
elif result.startswith("Non esistono"):
return result
def echa_noael_ld50(CAS: str, route="inhalation", outputType="df", local_search=False, local_only=False):
"""
Dato un CAS cerca di trovare il NOAEL per la route specificata (di base 'inhalation').
Se non esiste la pagina RepeatedDose con il NOAEL fa ritornare l'LD50 per quella route.
Args:
CAS (str): il cas o in alternativa la sostanza
route (str): 'inhalation', 'oral', 'dermal'. Di base 'inhalation'
outputType (str) = 'df', 'json'. Il tipo di output
"""
if route not in ["inhalation", "oral", "dermal"] and outputType not in [
"df",
"json",
]:
return "invalid input"
# Di base cerco di scrapare la pagina "Repeated Dose"
first_attempt = echaExtract_specific(
CAS=CAS,
scrapingType="RepeatedDose",
doseDescriptor="NOAEL",
route=route,
local_search=local_search,
local_only=local_only
)
if isinstance(first_attempt, pd.DataFrame):
return first_attempt
elif isinstance(first_attempt, str) and first_attempt.startswith("Non ho trovato"):
second_attempt = echaExtract_specific(
CAS=CAS,
scrapingType="AcuteToxicity",
doseDescriptor="LD50",
route=route,
local_search=True,
local_only=local_only
)
if isinstance(second_attempt, pd.DataFrame):
return second_attempt
elif isinstance(second_attempt, str) and second_attempt.startswith(
"Non ho trovato"
):
return second_attempt.replace("LD50", "NOAEL ed LD50")
elif first_attempt.startswith("Non esistono"):
return first_attempt
def echa_noael_ld50_multi(
casList: list, route="inhalation", messages=False, local_search=False, local_only=False
):
"""
Metodo abbastanza semplice. Data una lista di cas esegue echa_noael_ld50. Quindi cerca i NOAEL per la route desiderata o gli LD50 se non trova i NOAEL.
L'output è un df per le sostanze che trova e una lista di messaggi per quelle che non trova.
Args:
casList (list): la lista di CAS
route (str): 'inhalation', 'oral', 'dermal'. Di base 'inhalation'
messages (boolean) = True o False. Con True fa ritornare una lista. Il primo elemento sarà il dataframe, il secondo la lista di messaggi per le sostanze non trovate.
Di base è False e fa ritornare solo il dataframe.
"""
messages_list = []
df = pd.DataFrame()
for CAS in casList:
output = echa_noael_ld50(
CAS=CAS, route=route, outputType="df", local_search=local_search, local_only=local_only
)
if isinstance(output, str):
messages_list.append(output)
elif isinstance(output, pd.DataFrame):
df = pd.concat([df, output], ignore_index=True)
df.dropna(axis=1, how="all", inplace=True)
if messages and df.empty:
messages_list.append(
f'Non sono riuscito a trovare nessun NOAEL o LD50 per i cas per la route "{route}"'
)
return [None, messages_list]
elif messages and not df.empty:
return [df, messages_list]
elif not df.empty and not messages:
return df
elif df.empty and not messages:
return f'Non sono riuscito a trovare nessun NOAEL o LD50 per i cas per la route "{route}"'
def echaExtract_multi(
casList: list,
scrapingType="all",
local=False,
local_path=None,
log_path=None,
debug_print=False,
error=False,
error_path=None,
key_infos=False,
local_search=False,
local_only=False,
filter = None
):
"""
Data una lista di CAS cerca di estrarre tutte le pagine con le Repeated Dose, tutte le pagine con l'AcuteToxicity, o entrambe
Args:
casList (list): la lista di CAS
scrapingType (str): 'RepeatedDose', 'AcuteToxicity', 'all'
local (boolean): Se impostato su True questo parametro salva sul disco in maniera progressiva, appendendo ogni result man mano che li trova.
è necessario per lo scraping su larga scala
log_path (str): il path per il log da fillare durante lo scraping di massa
debug_print (bool): per avere il printing durante lo scraping. per verificare l'avanzamento
error (bool): Per far ritornare la lista degli errori una volta scrapato
Output:
pd.Dataframe
"""
cas_len = len(casList)
i = 0
df = pd.DataFrame()
if scrapingType == "all":
scrapingTypeList = ["RepeatedDose", "AcuteToxicity"]
else:
scrapingTypeList = [scrapingType]
logging.info(
f"echa.echaExtract_multi(). Commencing mass extraction of {scrapingTypeList} for {casList}"
)
errors = []
for cas in casList:
for scrapingType in scrapingTypeList:
extraction = echaExtract(
substance=cas,
scrapingType=scrapingType,
outputType="df",
key_infos=key_infos,
local_search=local_search,
local_only=local_only
)
if isinstance(extraction, pd.DataFrame) and not extraction.empty:
status = "successful_scrape"
logging.info(
f"echa.echaExtract_multi(). Succesfully scraped {scrapingType} for {cas}"
)
df = pd.concat([df, extraction], ignore_index=True)
if local and local_path:
df.to_csv(local_path, index=False)
elif (
(isinstance(extraction, pd.DataFrame) and extraction.empty)
or (extraction is None)
or (isinstance(extraction, str) and extraction.startswith("No data"))
):
status = "no_data_found"
logging.info(
f"echa.echaExtract_multi(). Found no data for {scrapingType} for {cas}"
)
elif isinstance(extraction, dict):
if extraction["error"]:
status = "df_creation_error"
errors.append(extraction)
logging.info(
f"echa.echaExtract_multi(). Df creation error for {scrapingType} for {cas}"
)
elif isinstance(extraction, str) and extraction.startswith("Non esistono"):
status = "no_lead_dossiers"
logging.info(
f"echa.echaExtract_multi(). Found no lead dossiers for {cas}"
)
else:
status = "unknown_error"
logging.error(
f"echa.echaExtract_multi(). Unknown error for {scrapingType} for {cas}"
)
if log_path:
fill_log(cas, status, log_path, scrapingType)
if debug_print:
print(f"{i}: {cas}, {scrapingType}")
i += 1
if error and errors and error_path:
with open(error_path, "w") as json_file:
json.dump(errors, json_file, indent=4)
# Questa è la mossa che mi permette di eliminare 4 metodi
if filter:
df = filter_dataframe_by_dict(df, filter)
return df
def fill_log(cas: str, status: str, log_path: str, scrapingType: str):
"""
Funzione usata durante lo scraping di massa per fillare un log mentre estraggo le sostanze
"""
df = pd.read_csv(log_path)
df.loc[df["casNo"] == cas, f"scraping_{scrapingType}"] = status
df.loc[df["casNo"] == cas, "timestamp"] = datetime.now().strftime("%Y-%m-%d")
df.to_csv(log_path, index=False)
def echaExtract_local(substance:str, scrapingType:str, key_infos=False):
if not key_infos:
query = f"""
SELECT *
FROM echa_full_scraping
WHERE CAS = '{substance}' AND page = '{scrapingType}' AND key_information IS NULL;
"""
elif key_infos:
query = f"""
SELECT *
FROM echa_full_scraping
WHERE CAS = '{substance}' AND page = '{scrapingType}';
"""
result = con.sql(query).df()
return result
def filter_dataframe_by_dict(df, filter_dict):
"""
Filters a Pandas DataFrame based on a dictionary.
Args:
df (pd.DataFrame): The input DataFrame.
filter_dict (dict): A dictionary where keys are column names and
values are lists of allowed values for that column.
Returns:
pd.DataFrame: A new DataFrame containing only the rows that match
the filter criteria.
"""
filter_condition = pd.Series(True, index=df.index) # Initialize with all True to start filtering
for column_name, allowed_values in filter_dict.items():
if column_name in df.columns: # Check if the column exists in the DataFrame
column_filter = df[column_name].isin(allowed_values) # Create a boolean Series for the current column
filter_condition = filter_condition & column_filter # Combine with existing condition using 'and'
else:
print(f"Warning: Column '{column_name}' not found in the DataFrame. Filter for this column will be ignored.")
filtered_df = df[filter_condition] # Apply the combined filter condition
return filtered_df

View file

@ -1,497 +0,0 @@
import requests
import urllib.parse
import json
import logging
import re
from datetime import datetime
from bs4 import BeautifulSoup
from typing import Dict, Union, Optional, Any
# Settings per il logging
logging.basicConfig(
format="{asctime} - {levelname} - {message}",
style="{",
datefmt="%Y-%m-%d %H:%M",
filename=".log",
encoding="utf-8",
filemode="a",
level=logging.INFO,
)
# Constants for API endpoints
QUACKO_BASE_URL = "https://chem.echa.europa.eu"
QUACKO_SUBSTANCE_API = f"{QUACKO_BASE_URL}/api-substance/v1/substance"
QUACKO_DOSSIER_API = f"{QUACKO_BASE_URL}/api-dossier-list/v1/dossier"
QUACKO_HTML_PAGES = f"{QUACKO_BASE_URL}/html-pages"
# Default sections to look for in the dossier
DEFAULT_SECTIONS = {
"id_7_Toxicologicalinformation": "ToxSummary",
"id_72_AcuteToxicity": "AcuteToxicity",
"id_75_Repeateddosetoxicity": "RepeatedDose",
"id_6_Ecotoxicologicalinformation": "EcotoxSummary",
"id_76_Genetictoxicity" : 'GeneticToxicity',
"id_42_Meltingpointfreezingpoint" : "MeltingFreezingPoint",
"id_43_Boilingpoint" : "BoilingPoint",
"id_48_Watersolubility" : "WaterSolubility",
"id_410_Surfacetension" : "SurfaceTension",
"id_420_pH" : "pH",
"Test" : "Test2"
}
def search_dossier(
substance: str,
input_type: str = 'rmlCas',
sections: Dict[str, str] = None,
local_index_path: str = None
) -> Union[Dict[str, Any], str, bool]:
"""
Search for a chemical substance in the QUACKO database and retrieve its dossier information.
Args:
substance (str): The identifier of the substance to search for (e.g. CAS number, name)
input_type (str): The type of identifier provided. Options: 'rmlCas', 'rmlName', 'rmlEc'
sections (Dict[str, str], optional): Dictionary mapping section IDs to result keys.
If None, default sections will be used.
local_index_path (str, optional): Path to a local index.html file to parse instead of
downloading from QUACKO. If provided, the function will
skip all API calls and only extract sections from the local file.
Returns:
Union[Dict[str, Any], str, bool]: Dictionary with substance information and dossier links on success,
error message string if substance found but with issues,
False if substance not found or other critical error
"""
# Use default sections if none provided
if sections is None:
sections = DEFAULT_SECTIONS
try:
results = {}
# If a local file is provided, extract sections from it directly
if local_index_path:
logging.info(f"QUACKO.search() - Using local index file: {local_index_path}")
# We still need some minimal info for constructing the URLs
if '/' not in local_index_path:
asset_id = "local"
rml_id = "local"
else:
# Try to extract information from the path if available
path_parts = local_index_path.split('/')
# If path follows expected structure: .../html-pages/ASSET_ID/index.html
if 'html-pages' in path_parts and 'index.html' in path_parts[-1]:
asset_id = path_parts[path_parts.index('html-pages') + 1]
rml_id = "extracted" # Just a placeholder
else:
asset_id = "local"
rml_id = "local"
# Add these to results for consistency
results["assetExternalId"] = asset_id
results["rmlId"] = rml_id
# Extract sections from the local file
section_links = get_section_links_from_file(local_index_path, asset_id, rml_id, sections)
if section_links:
results.update(section_links)
return results
# Normal flow with API calls
substance_data = get_substance_by_identifier(substance)
if not substance_data:
return False
# Verify that the found substance matches the input identifier
if substance_data.get(input_type) != substance:
error_msg = (f"Search error: results[{input_type}] (\"{substance_data.get(input_type)}\") "
f"is not equal to \"{substance}\". Maybe you specified the wrong input_type. "
f"Check the results here: {substance_data.get('search_response')}")
logging.error(f"QUACKO.search(): {error_msg}")
return error_msg
# Step 2: Find dossiers for the substance
rml_id = substance_data["rmlId"]
dossier_data = get_dossier_by_rml_id(rml_id, substance)
if not dossier_data:
return False
# Merge substance and dossier data
results = {**substance_data, **dossier_data}
# Step 3: Extract detailed information from dossier index page
asset_external_id = dossier_data["assetExternalId"]
section_links = get_section_links_from_index(asset_external_id, rml_id, sections)
if section_links:
results.update(section_links)
logging.info(f"QUACKO.search() OK. output: {json.dumps(results)}")
return results
except Exception as e:
logging.error(f"QUACKO.search(): Unexpected error in search_dossier for '{substance}': {str(e)}")
return False
def get_substance_by_identifier(substance: str) -> Optional[Dict[str, str]]:
"""
Search the QUACKO database for a substance using the provided identifier.
Args:
substance (str): The substance identifier to search for (CAS number, name, etc.)
Returns:
Optional[Dict[str, str]]: Dictionary with substance information or None if not found
"""
encoded_substance = urllib.parse.quote(substance)
search_url = f"{QUACKO_SUBSTANCE_API}?pageIndex=1&pageSize=100&searchText={encoded_substance}"
logging.info(f'QUACKO.search(). searching "{substance}"')
try:
response = requests.get(search_url)
response.raise_for_status() # Raise exception for HTTP errors
data = response.json()
if not data.get("items") or len(data["items"]) == 0:
logging.info(f"QUACKO.search() could not find substance for '{substance}'")
return None
# Extract substance information
substance_index = data["items"][0]["substanceIndex"]
result = {
'search_response': search_url,
'rmlId': substance_index.get("rmlId", ""),
'rmlName': substance_index.get("rmlName", ""),
'rmlCas': substance_index.get("rmlCas", ""),
'rmlEc': substance_index.get("rmlEc", "")
}
logging.info(
f"QUACKO.search() found substance on QUACKO. "
f"rmlId: '{result['rmlId']}', rmlName: '{result['rmlName']}', rmlCas: '{result['rmlCas']}'"
)
return result
except requests.RequestException as e:
logging.error(f"QUACKO.search() - Request error while searching for substance '{substance}': {str(e)}")
return None
except (KeyError, IndexError) as e:
logging.error(f"QUACKO.search() - Data parsing error for substance '{substance}': {str(e)}")
return None
def get_dossier_by_rml_id(rml_id: str, substance_name: str) -> Optional[Dict[str, Any]]:
"""
Find dossiers for a substance using its RML ID.
Args:
rml_id (str): The RML ID of the substance
substance_name (str): The name of the substance (for logging)
Returns:
Optional[Dict[str, Any]]: Dictionary with dossier information or None if not found
"""
# First try active dossiers
dossier_results = _query_dossier_api(rml_id, "Active")
# If no active dossiers found, try inactive ones
if not dossier_results:
logging.info(
f"QUACKO.search() - could not find active dossier for '{substance_name}'. "
"Proceeding to search in the unactive ones."
)
dossier_results = _query_dossier_api(rml_id, "Inactive")
if not dossier_results:
logging.info(f"QUACKO.search() - could not find unactive dossiers for '{substance_name}'")
return None
else:
logging.info(f"QUACKO.search() - found unactive dossiers for '{substance_name}'")
dossier_results["dossierType"] = "Inactive"
else:
logging.info(f"QUACKO.search() - found active dossiers for '{substance_name}'")
dossier_results["dossierType"] = "Active"
return dossier_results
def _query_dossier_api(rml_id: str, status: str) -> Optional[Dict[str, Any]]:
"""
Helper function to query the QUACKO dossier API for a specific substance and status.
Args:
rml_id (str): The RML ID of the substance
status (str): The status of dossiers to search for ('Active' or 'Inactive')
Returns:
Optional[Dict[str, Any]]: Dictionary with dossier information or None if not found
"""
url = f"{QUACKO_DOSSIER_API}?pageIndex=1&pageSize=100&rmlId={rml_id}&registrationStatuses={status}"
try:
response = requests.get(url)
response.raise_for_status()
data = response.json()
if not data.get("items") or len(data["items"]) == 0:
return None
result = {
"assetExternalId": data["items"][0]["assetExternalId"],
"rootKey": data["items"][0]["rootKey"],
}
# Extract last update date if available
try:
last_update = data["items"][0]["lastUpdatedDate"]
datetime_object = datetime.fromisoformat(last_update.replace('Z', '+00:00'))
result['lastUpdateDate'] = datetime_object.date().isoformat()
except (KeyError, ValueError) as e:
logging.error(f"QUACKO.search() - Error extracting lastUpdateDate: {str(e)}")
# Add index URLs
result["index"] = f"{QUACKO_HTML_PAGES}/{result['assetExternalId']}/index.html"
result["index_js"] = f"{QUACKO_BASE_URL}/{rml_id}/dossier-view/{result['assetExternalId']}"
return result
except requests.RequestException as e:
logging.error(f"QUACKO.search() - Request error while getting dossiers for RML ID '{rml_id}': {str(e)}")
return None
except (KeyError, IndexError) as e:
logging.error(f"QUACKO.search() - Data parsing error for RML ID '{rml_id}': {str(e)}")
return None
def get_section_links_from_index(
asset_id: str,
rml_id: str,
sections: Dict[str, str]
) -> Dict[str, str]:
"""
Extract links to specified sections from the dossier index page by downloading it.
Args:
asset_id (str): The asset external ID of the dossier
rml_id (str): The RML ID of the substance
sections (Dict[str, str]): Dictionary mapping section IDs to result keys
Returns:
Dict[str, str]: Dictionary with links to the requested sections
"""
index_url = f"{QUACKO_HTML_PAGES}/{asset_id}/index.html"
try:
response = requests.get(index_url)
response.raise_for_status()
# Parse content using the shared method
return parse_sections_from_html(response.text, asset_id, rml_id, sections)
except requests.RequestException as e:
logging.error(f"QUACKO.search() - Request error while extracting section links: {str(e)}")
return {}
except Exception as e:
logging.error(f"QUACKO.search() - Error extracting section links: {str(e)}")
return {}
def get_section_links_from_file(
file_path: str,
asset_id: str,
rml_id: str,
sections: Dict[str, str]
) -> Dict[str, str]:
"""
Extract links to specified sections from a local index.html file.
Args:
file_path (str): Path to the local index.html file
asset_id (str): The asset external ID to use for constructing URLs
rml_id (str): The RML ID to use for constructing URLs
sections (Dict[str, str]): Dictionary mapping section IDs to result keys
Returns:
Dict[str, str]: Dictionary with links to the requested sections
"""
try:
if not os.path.exists(file_path):
logging.error(f"QUACKO.search() - Local file not found: {file_path}")
return {}
with open(file_path, 'r', encoding='utf-8') as file:
html_content = file.read()
# Parse content using the shared method
return parse_sections_from_html(html_content, asset_id, rml_id, sections)
except FileNotFoundError:
logging.error(f"QUACKO.search() - File not found: {file_path}")
return {}
except Exception as e:
logging.error(f"QUACKO.search() - Error parsing local file {file_path}: {str(e)}")
return {}
def parse_sections_from_html(
html_content: str,
asset_id: str,
rml_id: str,
sections: Dict[str, str]
) -> Dict[str, str]:
"""
Parse HTML content to extract links to specified sections.
Args:
html_content (str): HTML content to parse
asset_id (str): The asset external ID to use for constructing URLs
rml_id (str): The RML ID to use for constructing URLs
sections (Dict[str, str]): Dictionary mapping section IDs to result keys
Returns:
Dict[str, str]: Dictionary with links to the requested sections
"""
result = {}
try:
soup = BeautifulSoup(html_content, "html.parser")
# Extract each requested section
for section_id, section_name in sections.items():
section_links = extract_section_links(soup, section_id, asset_id, rml_id, section_name)
if section_links:
result.update(section_links)
logging.info(f"QUACKO.search() - Found section '{section_name}' in document")
else:
logging.info(f"QUACKO.search() - Section '{section_name}' not found in document")
return result
except Exception as e:
logging.error(f"QUACKO.search() - Error parsing HTML content: {str(e)}")
return {}
# --------------------------------------------------------------------------
# Function to Extract Section Links with Validation
# --------------------------------------------------------------------------
# This function extracts the document link associated with a specific section ID
# from the QUACKO index.html page structure.
#
# Problem Solved:
# Previous attempts faced issues where searching for a link within a parent
# section's div (e.g., "7 Toxicological Information" with id="id_7_...")
# would incorrectly grab the link belonging to the *first child* section
# (e.g., "7.2 Acute Toxicity" with id="id_72_..."). This happened because
# the simple `find("a", href=True)` doesn't distinguish ownership when nested.
#
# Solution Logic:
# 1. Find Target Div: Locate the `div` element using the specific `section_id` provided.
# This div typically contains the section's content or nested subsections.
# 2. Find First Link: Find the very first `<a>` tag that has an `href` attribute
# somewhere *inside* the `target_div`.
# 3. Find Link's Owning Section Div: Starting from the `first_link_tag`, traverse
# up the HTML tree using `find_parent()` to find the nearest ancestor `div`
# whose `id` attribute starts with "id_" (the pattern for section containers).
# 4. Validate Ownership: Compare the `id` of the `link_ancestor_section_div` found
# in step 3 with the original `section_id` passed into the function.
# 5. Decision:
# - If the IDs MATCH: It confirms that the `first_link_tag` truly belongs to the
# `section_id` we are querying. The function proceeds to extract and format
# this link.
# - If the IDs DO NOT MATCH: It indicates that the first link found actually
# belongs to a *nested* subsection div. Therefore, the original `section_id`
# (the parent/container) does not have its own direct link, and the function
# correctly returns an empty dictionary for this `section_id`.
#
# This validation step ensures that we only return links that are directly
# associated with the queried section ID, preventing the inheritance bug.
# --------------------------------------------------------------------------
def extract_section_links(
soup: BeautifulSoup,
section_id: str,
asset_id: str,
rml_id: str,
section_name: str
) -> Dict[str, str]:
"""
Extracts a link for a specific section ID by finding the first link
within its div and verifying that the link belongs directly to that
section, not a nested subsection.
Args:
soup (BeautifulSoup): The BeautifulSoup object of the index page.
section_id (str): The HTML ID of the section div.
asset_id (str): The asset external ID of the dossier.
rml_id (str): The RML ID of the substance.
section_name (str): The name to use for the section in the result.
Returns:
Dict[str, str]: Dictionary with link if found and validated,
otherwise empty.
"""
result = {}
# 1. Find the target div for the section ID
target_div = soup.find("div", id=section_id)
if not target_div:
logging.info(f"QUACKO.search() - extract_section_links(): No div found for id='{section_id}'")
return result
# 2. Find the first <a> tag with an href within this target div
first_link_tag = target_div.find("a", href=True)
if not first_link_tag:
logging.info(f"QUACKO.search() - extract_section_links(): No 'a' tag with href found within div id='{section_id}'")
return result # No links at all within this section
# 3. Validate: Find the closest ancestor div with an ID starting with "id_"
# This tells us which section container the link *actually* resides in.
# We use a lambda function for the id check.
# Need to handle potential None if the structure is unexpected.
link_ancestor_section_div: Optional[Tag] = first_link_tag.find_parent(
"div", id=lambda x: x and x.startswith("id_")
)
# 4. Compare IDs
if link_ancestor_section_div and link_ancestor_section_div.get('id') == section_id:
# The first link found belongs directly to the section we are looking for.
logging.debug(f"QUACKO.search() - extract_section_links(): Valid link found for id='{section_id}'.")
a_tag_to_use = first_link_tag # Use the link we found
else:
# The first link found belongs to a *different* (nested) section
# or the structure is broken (no ancestor div with id found).
# Therefore, the section_id we were originally checking has no direct link.
ancestor_id = link_ancestor_section_div.get('id') if link_ancestor_section_div else "None"
logging.info(f"QUACKO.search() - extract_section_links(): First link within id='{section_id}' belongs to ancestor id='{ancestor_id}'. No direct link for '{section_id}'.")
return result # Return empty dict
# 5. Proceed with link extraction using the validated a_tag_to_use
try:
document_id = a_tag_to_use.get('href') # Use .get() for safety
if not document_id:
logging.error(f"QUACKO.search() - extract_section_links(): Found 'a' tag for '{section_name}' has no href attribute.")
return {}
# Clean up the document ID
if document_id.startswith('./documents/'):
document_id = document_id.replace('./documents/', '')
if document_id.endswith('.html'):
document_id = document_id.replace('.html', '')
# Construct the full URLs unless in local-only mode
if asset_id == "local" and rml_id == "local":
result[section_name] = f"Local section found: {document_id}"
else:
result[section_name] = f"{QUACKO_HTML_PAGES}/{asset_id}/documents/{document_id}.html"
result[f"{section_name}_js"] = f"{QUACKO_BASE_URL}/{rml_id}/dossier-view/{asset_id}/{document_id}"
return result
except Exception as e: # Catch potential errors during processing
logging.error(f"QUACKO.search() - extract_section_links(): Error processing the validated link tag for '{section_name}': {str(e)}")
return {}

View file

@ -1,37 +0,0 @@
from typing import Optional
from pymongo.mongo_client import MongoClient
from pymongo.server_api import ServerApi
from pymongo.database import Database
#region Funzioni generali connesse a MongoDB
# Funzione di connessione al database
def connect(user : str,password : str, database : str = "INCI") -> Database:
uri = f"mongodb+srv://{user}:{password}@ufs13.dsmvdrx.mongodb.net/?retryWrites=true&w=majority&appName=UFS13"
client = MongoClient(uri, server_api=ServerApi('1'))
db = client['toxinfo']
return db
#endregion
#region Funzioni di ricerca all'interno del mio DB
# Funzione di ricerca degli elementi estratti dal cosing
def value_search(db : Database,value:str,mode : Optional[str] = None) -> Optional[dict]:
if mode:
json = db.toxinfo.find_one({mode:value},{"_id":0})
if json:
return json
return None
else:
modes = ['commonName','inciName','casNo','ecNo','chemicalName','phEurName']
for el in modes:
json = db.toxinfo.find_one({el:value},{"_id":0})
if json:
return json
return None
#endregion

View file

@ -1,149 +0,0 @@
import os
from contextlib import contextmanager
import pubchempy as pcp
from pubchemprops.pubchemprops import get_second_layer_props
import logging
logging.basicConfig(
format="{asctime} - {levelname} - {message}",
style="{",
datefmt="%Y-%m-%d %H:%M",
filename="echa.log",
encoding="utf-8",
filemode="a",
level=logging.INFO,
)
@contextmanager
def temporary_certificate(cert_path):
# Sto robo serve perchè per usare l'API di PubChem serve cambiare temporaneamente il certificato con il quale
# si fanno le richieste
"""
Context manager to temporarily change the certificate used for requests.
Args:
cert_path (str): Path to the certificate file to use temporarily
Example:
# Regular request uses default certificates
requests.get('https://api.example.com')
# Use custom certificate only within this block
with temporary_certificate('custom-cert.pem'):
requests.get('https://api.requiring.custom.cert.com')
# Back to default certificates
requests.get('https://api.example.com')
"""
# Store original environment variables
original_ca_bundle = os.environ.get('REQUESTS_CA_BUNDLE')
original_ssl_cert = os.environ.get('SSL_CERT_FILE')
try:
# Set new certificate
os.environ['REQUESTS_CA_BUNDLE'] = cert_path
os.environ['SSL_CERT_FILE'] = cert_path
yield
finally:
# Restore original environment variables
if original_ca_bundle is not None:
os.environ['REQUESTS_CA_BUNDLE'] = original_ca_bundle
else:
os.environ.pop('REQUESTS_CA_BUNDLE', None)
if original_ssl_cert is not None:
os.environ['SSL_CERT_FILE'] = original_ssl_cert
else:
os.environ.pop('SSL_CERT_FILE', None)
def clean_property_data(api_response):
"""
Simplifies the API response data by flattening nested structures.
Args:
api_response (dict): Raw API response containing property data
Returns:
dict: Cleaned data with simplified structure
"""
cleaned_data = {}
for property_name, measurements in api_response.items():
cleaned_measurements = []
for measurement in measurements:
cleaned_measurement = {
'ReferenceNumber': measurement.get('ReferenceNumber'),
'Description': measurement.get('Description', ''),
}
# Handle Reference field
if 'Reference' in measurement:
# Check if Reference is a list or string
ref = measurement['Reference']
cleaned_measurement['Reference'] = ref[0] if isinstance(ref, list) else ref
# Handle Value field
value = measurement.get('Value', {})
if isinstance(value, dict) and 'StringWithMarkup' in value:
cleaned_measurement['Value'] = value['StringWithMarkup'][0]['String']
else:
cleaned_measurement['Value'] = str(value)
# Remove empty values
cleaned_measurement = {k: v for k, v in cleaned_measurement.items() if v}
cleaned_measurements.append(cleaned_measurement)
cleaned_data[property_name] = cleaned_measurements
return cleaned_data
def pubchem_dap(cas):
'''
Data un CAS in input ricerca le informazioni per la scheda di sicurezza su PubChem.
Per estrarre le proprietà di 1o (sinonimi, cid, logP, MolecularWeight, ExactMass, TPSA) livello uso Pubchempy.
Per quelle di 2o livello uso pubchemprops (Melting point)
args:
cas : string
'''
with temporary_certificate('src/data/ncbi-nlm-nih-gov-catena.pem'):
try:
# Ricerca iniziale
out = pcp.get_synonyms(cas, 'name')
if out:
out = out[0]
output = {'CID' : out['CID'],
'CAS' : cas,
'first_pubchem_name' : out['Synonym'][0],
'pubchem_link' : f"https://pubchem.ncbi.nlm.nih.gov/compound/{out['CID']}"}
else:
return f'No results on PubChem for {cas}'
except Exception as E:
logging.error(f'various_utils.pubchem.pubchem_dap(). Some error during pubchem search for {cas}', exc_info=True)
try:
# Ricerca delle proprietà
properties = pcp.get_properties(['xlogp', 'molecular_weight', 'tpsa', 'exact_mass'], identifier = out['CID'], namespace='cid', searchtype=None, as_dataframe=False)
if properties:
output = {**output, **properties[0]}
else:
return output
except Exception as E:
logging.error(f'various_utils.pubchem.pubchem_dap(). Some error during pubchem first level properties extraction for {cas}', exc_info=True)
try:
# Ricerca del Melting Point
second_layer_props = get_second_layer_props(output['first_pubchem_name'], ['Melting Point', 'Dissociation Constants', 'pH'])
if second_layer_props:
second_layer_props = clean_property_data(second_layer_props)
output = {**output, **second_layer_props}
except Exception as E:
logging.error(f'various_utils.pubchem.pubchem_dap(). Some error during pubchem second level properties extraction (Melting Point) for {cas}', exc_info=True)
return output

View file

@ -1,182 +0,0 @@
import json as js
import re
import requests as req
from typing import Union
#region Funzione che processa una lista di CAS presa da Cosing (Grazie Jem)
def parse_cas_numbers(cas_string:list) -> list:
# Siccome ci assicuriamo esternamente che esista almeno un cas possiamo prendere la stringa
cas_string = cas_string[0]
# Rimuoviamo parentesi e il loro contenuto
cas_string = re.sub(r"\([^)]*\)", "", cas_string)
# Eseguiamo uno split su vari possibili separatori
cas_parts = re.split(r"[/;,]", cas_string)
# Otteniamo una lista utilizzando i cas precedenti e rimuovendo spazi in eccesso
cas_list = [cas.strip() for cas in cas_parts]
# Alcuni cas sono divisi da un doppio dash (--) che dobbiamo rimuovere
# è però necessario farlo ora in seconda battuta
if len(cas_list) == 1 and "--" in cas_list[0]:
cas_list = [cas.strip() for cas in cas_list[0].split("--")]
# Siccome alcuni cas hanno valori non validi ("-") li cerchiamo e rimuoviamo
if cas_list:
while "-" in cas_list:
cas_list.remove("-")
return cas_list
#endregion
#region Funzione per eseguire una ricerca direttamente sul cosing
# Il primo argomento è la stringa da cercare, il secondo indica il tipo di ricerca
def cosing_search(text : str, mode : str = "name") -> Union[list,dict,None]:
cosing_post_req = "https://api.tech.ec.europa.eu/search-api/prod/rest/search?apiKey=285a77fd-1257-4271-8507-f0c6b2961203&text=*&pageSize=100&pageNumber=1"
agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36 Edg/131.0.0.0"
# La modalità di ricerca base è quella per nome, che sia INCI o di altro tipo
if mode == "name":
search_query = {"bool":
{"must":[
{"text":
{"query":f"{text}","fields":
["inciName.exact","inciUsaName","innName.exact","phEurName","chemicalName","chemicalDescription"],
"defaultOperator":"AND"}}]}}
# In caso di ricerca per numero cas o EC il payload della richiesta è diverso
elif mode in ["cas","ec"]:
search_query = {"bool": {"must": [{"text": {"query": f"*{text}*","fields": ["casNo", "ecNo"]}}]}}
# La ricerca per ID è necessaria dove serva recuperare gli identified ingredients
elif mode == "id":
search_query = {"bool":{"must":[{"term":{"substanceId":f"{text}"}}]}}
# Se la mode inserita non è prevista lancio un errore
else:
raise ValueError
# Creo il payload della mia request
files = {"query": ("query",js.dumps(search_query),"application/json")}
# Eseguo la post di ricerca
risposta = req.post(cosing_post_req,headers={"user_agent":agent,"Connection":"keep-alive"},files=files)
risposta = risposta.json()
if risposta["results"]:
return risposta["results"][0]["metadata"]
# La funzione ritorna None quando non ho risultati dalla mia ricerca
return risposta.status_code
#endregion
#region Funzione per pulire un json cosing e restituirlo
def clean_cosing(json : dict, full : bool = True) -> dict:
# Definisco i campi di nostro interesse e li divido in base al tipo di output finale che desidero
string_cols = ["itemType","nameOfCommonIngredientsGlossary","inciName","phEurName","chemicalName","innName","substanceId","refNo"]
list_cols = ["casNo","ecNo","functionName","otherRestrictions","sccsOpinion","sccsOpinionUrls","identifiedIngredient","annexNo","otherRegulations"]
# Creo una lista con tutti i campi su cui ciclare
total_keys = string_cols + list_cols
# Definisco la base dell"url che mi serve per ottenere il link cosing della sostanza
base_url = "https://ec.europa.eu/growth/tools-databases/cosing/details/"
clean_json = {}
# Ciclo su tutti i campi di interesse
for key in total_keys:
# Alcuni campi contengono una dicitura inutile che occupa solo spazio
# per cui provvedo a rimuoverla
while "<empty>" in json[key]:
json[key].remove("<empty>")
# Se il campo dovrà avere in output una lista, posso anche accettare le liste vuote del cosing come valore
if key in list_cols:
value = json[key]
# Il cas e l"ec sono casi speciali, per cui eseguo delle funzioni in più quando lo tratto
if key in ["casNo", "ecNo"]:
if value:
value = parse_cas_numbers(value)
# Completiamo direttamente il json di ritorno ove presenti degli identifiedIngredient,
# solo dove il flag "full" è true
elif key == "identifiedIngredient":
if full:
if value:
value = identified_ingredients(value)
clean_json[key] = value
else:
# Questo nome di campo era troppo lungo e ho preferito semplificarlo
if key == "nameOfCommonIngredientsGlossary":
nKey = "commonName"
# Dovendo rinominare alcuni campi, negli altri casi copio il nome iniziale
else:
nKey = key
# Siccome voglio in output una stringa semplice e se usassi uno slicer su di una lista vuota
# devo prima verificare che la lista cosing contenga dei valori
if json[key]:
clean_json[nKey] = json[key][0]
else:
clean_json[nKey] = ""
# Il campo cosingUrl non esiste ancora, vado a crearlo unendo il substance ID all"url base
clean_json["cosingUrl"] = f"{base_url}{json["substanceId"][0]}"
return clean_json
#endregion
#region Funzione per completare, se necessario, un json cosing
def identified_ingredients(id_list : list) -> list:
identified = []
# Per ognuno degli ingredienti in IdentifiedIngredients eseguo una ricerca
for id in id_list:
ingredient = cosing_search(id,"id")
if ingredient:
# Vado a pulire i json appena trovati
ingredient = clean_cosing(ingredient,full=False)
# Ora salvo nella lista il documento pulito
identified.append(ingredient)
# Finito di popolare la lista con gli oggetti degli identifiedIngredient, la ritorno
return identified
#endregion

View file

@ -1,223 +0,0 @@
import requests
import urllib.parse
import re as standardre
import json
from bs4 import BeautifulSoup
from datetime import datetime
from pif_compiler.functions.common_log import get_logger
logger = get_logger()
# Funzione per cercare il dossier dato in input un CAS, una sostanza o un EN
def search_dossier(substance, input_type='rmlCas'):
results = {}
# Il dizionario che farò tornare alla fine
# Prima parte. Ottengo rmlID e rmlName
# st.code('https://chem.echa.europa.eu/api-substance/v1/substance?pageIndex=1&pageSize=100&searchText='+substance)
req_0 = requests.get(
"https://chem.echa.europa.eu/api-substance/v1/substance?pageIndex=1&pageSize=100&searchText="
+ urllib.parse.quote(substance) #va convertito per il web
)
logger.info(f'echaFind.search_dossier(). searching "{substance}"')
#'La prima cosa da fare è fare una ricerca con il nome della sostanza ma trasformata attraverso urllib'
req_0_json = req_0.json()
try:
# Estraggo i campi che mi servono dalla response
rmlId = req_0_json["items"][0]["substanceIndex"]["rmlId"]
rmlName = req_0_json["items"][0]["substanceIndex"]["rmlName"]
rmlCas = req_0_json["items"][0]["substanceIndex"]["rmlCas"]
rmlEc = req_0_json["items"][0]["substanceIndex"]["rmlEc"]
results['search_response'] = f"https://chem.echa.europa.eu/api-substance/v1/substance?pageIndex=1&pageSize=100&searchText={urllib.parse.quote(substance)}"
results["rmlId"] = rmlId
results["rmlName"] = rmlName
results["rmlCas"] = rmlCas
results["rmlEc"] = rmlEc
logger.info(
f"echaFind.search_dossier(). found substance on ECHA. rmlId: '{rmlId}', rmlName: '{rmlName}', rmlCas: '{rmlCas}'"
)
except:
logger.info(
f"echaFind.search_dossier(). could not find substance for '{substance}'"
)
return False
# Update: in certi casi poteva verificarsi che inserendo un CAS si trovasse invece una sostanza con codice EN uguale al CAS in input.
# Ora controllo che la sostanza trovata abbia effettivamente un CAS uguale a quello inserito in input.
# è inoltre possibile cercare per rmlName (nome della sostanza) o EN (rmlEn): basta specificare in input_type per cosa si sta cercando
if results[input_type] != substance:
logger.error(f'echa.echaFind.search_dossier(): results[{input_type}] "{results[input_type]}is not equal to "{substance}". ')
return f'search_error. results[{input_type}] ("{results[input_type]}") is not equal to "{substance}". Maybe you specified the wrong input_type. Check the results here: https://chem.echa.europa.eu/api-substance/v1/substance?pageIndex=1&pageSize=100&searchText={urllib.parse.quote(substance)}'
# Seconda parte. Cerco sul sito ECHA dei dossiers creando un link con l'ID precedentemente ottenuto.
req_1_url = (
"https://chem.echa.europa.eu/api-dossier-list/v1/dossier?pageIndex=1&pageSize=100&rmlId="
+ rmlId
+ "&registrationStatuses=Active"
) # Prima cerco negli active.
req_1 = requests.get(req_1_url)
req_1_json = req_1.json()
# Se non esistono dossiers attivi cerco quelli inattivi
if req_1_json["items"] == []:
logger.info(
f"echaFind.search_dossier(). could not find active dossier for '{substance}'. Proceeding to search in the unactive ones."
)
req_1_url = (
"https://chem.echa.europa.eu/api-dossier-list/v1/dossier?pageIndex=1&pageSize=100&rmlId="
+ rmlId
+ "&registrationStatuses=Inactive"
)
req_1 = requests.get(req_1_url)
req_1_json = req_1.json()
if req_1_json["items"] == []:
logger.info(
f"echaFind.search_dossier(). could not find unactive dossiers for '{substance}'"
) # Non ho trovato nè dossiers inattivi che attivi
return False
else:
logger.info(
f"echaFind.search_dossier(). found unactive dossiers for '{rmlName}'"
)
results["dossierType"] = "Inactive"
else:
logger.info(
f"echaFind.search_dossier(). found active dossiers for '{substance}'"
)
results["dossierType"] = "Active"
# Queste erano le due robe che mi servivano
assetExternalId = req_1_json["items"][0]["assetExternalId"]
# UPDATE: Per ottenere la data dell'ultima modifica: serve per capire se abbiamo già dei file aggiornati scaricati in locale
# confrontare data di scraping e ultimo aggiornato (se prima o dopo)
try:
lastUpdateDate = req_1_json["items"][0]["lastUpdatedDate"]
datetime_object = datetime.fromisoformat(lastUpdateDate.replace('Z', '+00:00')) # Handle 'Z' if present, else it might break on older python versions
lastUpdateDate = datetime_object.date().isoformat()
results['lastUpdateDate'] = lastUpdateDate
except:
logger.error(f"echa.echaFind(). Could not find lastUpdateDate for the dossier")
rootKey = req_1_json["items"][0]["rootKey"]
# PARTE DI HTML
# Terza parte. Ottengo assetExternalId
# "Con l'assetExternalId è possibile arrivare alla pagina principale del dossier."
# "Da questa pagina bisogna scrappare l'ID del riassunto tossicologico, :red[**SE ESISTE**]"
results["index"] = (
"https://chem.echa.europa.eu/html-pages" + assetExternalId + "/index.html"
)
results["index_js"] = (
f"https://chem.echa.europa.eu/{rmlId}/dossier-view/{assetExternalId}"
)
req_2 = requests.get(
"https://chem.echa.europa.eu/html-pages/" + assetExternalId + "/index.html"
)
index = BeautifulSoup(req_2.text, "html.parser")
index.prettify()
# Quarta parte. Ottengo l'ID del riassunto tossicologico dall'index.html
# "In tutto quell'HTML ci interessa solo un div. BeautifulSoup ha problemi se ci sono troppi div innestati. Quindi uso una combinazione di quello e di Regex"
div = index.find_all("div", id=["id_7_Toxicologicalinformation"])
str_div = str(div)
str_div = str_div.split("</div>")[0]
# UIC è l'id del toxsummary
uic_found = False
if type(standardre.search('href="([^"]+)"', str_div)).__name__ == "NoneType":
# Un regex per trovare l'href che mi serve
logger.info(
f"echaFind.search_dossier(). Could not find 'id_7_Toxicologicalinformation' in the body"
)
else:
UIC = standardre.search('href="([^"]+)"', str_div).group(1)
uic_found = True
# Per l'acute toxicity
acute_toxicity_found = False
div_acute_toxicity = index.find_all("div", id=["id_72_AcuteToxicity"])
if div_acute_toxicity:
for div in div_acute_toxicity:
try:
a = div.find_all("a", href=True)[0]
acute_toxicity_id = standardre.search('href="([^"]+)"', str(a)).group(1)
acute_toxicity_found = True
except:
logger.info(
f"echaFind.search_dossier(). No acute_toxicity_id found from index for {substance}"
)
# Per il repeated dose
repeated_dose_found = False
div_repeated_dose = index.find_all("div", id=["id_75_Repeateddosetoxicity"])
if div_repeated_dose:
for div in div_repeated_dose:
try:
a = div.find_all("a", href=True)[0]
repeated_dose_id = standardre.search('href="([^"]+)"', str(a)).group(1)
repeated_dose_found = True
except:
logger.info(
f"echaFind.search_dossier(). No repeated_dose_id found from index for {substance}"
)
# Quinta parte. Recupero l'html del dossier tossicologico e faccio ritornare il content
if acute_toxicity_found:
acute_toxicity_link = (
"https://chem.echa.europa.eu/html-pages/"
+ assetExternalId
+ "/documents/"
+ acute_toxicity_id
+ ".html"
)
results["AcuteToxicity"] = acute_toxicity_link
# ci sono due link diversi: uno solo html brutto ma che ha le info leggibile, mentre js è la versione più bella presentata all'utente,
# usata per creare il pdf carino
results["AcuteToxicity_js"] = (
f"https://chem.echa.europa.eu/{rmlId}/dossier-view/{assetExternalId}/{acute_toxicity_id}"
)
if uic_found:
# UIC è l'id del toxsummary
final_url = (
"https://chem.echa.europa.eu/html-pages/"
+ assetExternalId
+ "/documents/"
+ UIC
+ ".html"
)
results["ToxSummary"] = final_url
results["ToxSummary_js"] = (
f"https://chem.echa.europa.eu/{rmlId}/dossier-view/{assetExternalId}/{UIC}"
)
if repeated_dose_found:
results["RepeatedDose"] = (
"https://chem.echa.europa.eu/html-pages/"
+ assetExternalId
+ "/documents/"
+ repeated_dose_id
+ ".html"
)
results["RepeatedDose_js"] = (
f"https://chem.echa.europa.eu/{rmlId}/dossier-view/{assetExternalId}/{repeated_dose_id}"
)
json_formatted_str = json.dumps(results)
logger.info(f"echaFind.search_dossier() OK. output: {json_formatted_str}")
return results
if __name__ == "__main__":
search_dossier("100-41-4", input_type='rmlCas')

View file

@ -1,477 +0,0 @@
import os
import base64
import traceback
import logging # Import logging module
import datetime
import pandas as pd
# import time # Keep if you use page.wait_for_timeout
from playwright.sync_api import sync_playwright, TimeoutError # Catch specific errors
from pif_compiler.services.echa_find import search_dossier
import requests
# --- Basic Logging Setup (Commented Out) ---
# # Configure logging - uncomment and customize level/handler as needed
# logging.basicConfig(
# level=logging.INFO, # Or DEBUG for more details
# format='%(asctime)s - %(levelname)s - %(message)s',
# # filename='pdf_generator.log', # Optional: Log to a file
# # filemode='a'
# )
# --- End Logging Setup ---
# Assume svg_to_data_uri is defined elsewhere correctly
def svg_to_data_uri(svg_path):
try:
if not os.path.exists(svg_path):
# logging.error(f"SVG file not found: {svg_path}") # Example logging
raise FileNotFoundError(f"SVG file not found: {svg_path}")
with open(svg_path, 'rb') as f:
svg_content = f.read()
encoded_svg = base64.b64encode(svg_content).decode('utf-8')
return f"data:image/svg+xml;base64,{encoded_svg}"
except Exception as e:
print(f"Error converting SVG {svg_path}: {e}")
# logging.error(f"Error converting SVG {svg_path}: {e}", exc_info=True) # Example logging
return None
# --- JavaScript Expressions ---
# Define the cleanup logic as an immediately-invoked arrow function expression
# NOTE: .das-block_empty removal is currently disabled as per previous step
cleanup_js_expression = """
() => {
console.log('Running cleanup JS (DISABLED .das-block_empty removal)...');
let totalRemoved = 0;
// Example 1: Remove sections explicitly marked as empty (Currently Disabled)
// const emptyBlocks = document.querySelectorAll('.das-block_empty');
// emptyBlocks.forEach(el => {
// if (el && el.parentNode) {
// console.log(`Removing '.das-block_empty' block with ID: ${el.id || 'N/A'}`);
// el.remove();
// totalRemoved++;
// }
// });
// Add other specific cleanup logic here if needed
console.log(`Cleanup script removed ${totalRemoved} elements (DISABLED .das-block_empty removal).`);
return totalRemoved; // Return the count
}
"""
# --- End JavaScript Expressions ---
def generate_pdf_with_header_and_cleanup(
url,
pdf_path,
substance_name,
substance_link,
ec_number,
cas_number,
header_template_path=r"src\func\resources\injectableHeader.html",
echa_chem_logo_path=r"src\func\resources\echa_chem_logo.svg",
echa_logo_path=r"src\func\resources\ECHA_Logo.svg"
) -> bool: # Added return type hint
"""
Generates a PDF with a dynamic header and optionally removes empty sections.
Provides basic logging (commented out) and returns True/False for success/failure.
Args:
url (str): The target URL OR local HTML file path.
pdf_path (str): The output PDF path.
substance_name (str): The name of the chemical substance.
substance_link (str): The URL the substance name should link to (in header).
ec_number (str): The EC number for the substance.
cas_number (str): The CAS number for the substance.
header_template_path (str): Path to the HTML header template file.
echa_chem_logo_path (str): Path to the echa_chem_logo.svg file.
echa_logo_path (str): Path to the ECHA_Logo.svg file.
Returns:
bool: True if the PDF was generated successfully, False otherwise.
"""
final_header_html = None
# logging.info(f"Starting PDF generation for URL: {url} to path: {pdf_path}") # Example logging
# --- 1. Prepare Header HTML ---
try:
# logging.debug(f"Reading header template from: {header_template_path}") # Example logging
print(f"Reading header template from: {header_template_path}")
if not os.path.exists(header_template_path):
raise FileNotFoundError(f"Header template file not found: {header_template_path}")
with open(header_template_path, 'r', encoding='utf-8') as f:
header_template_content = f.read()
if not header_template_content:
raise ValueError("Header template file is empty.")
# logging.debug("Converting logos...") # Example logging
print("Converting logos...")
logo1_data_uri = svg_to_data_uri(echa_chem_logo_path)
logo2_data_uri = svg_to_data_uri(echa_logo_path)
if not logo1_data_uri or not logo2_data_uri:
raise ValueError("Failed to convert one or both logos to Data URIs.")
# logging.debug("Replacing placeholders...") # Example logging
print("Replacing placeholders...")
final_header_html = header_template_content.replace("##ECHA_CHEM_LOGO_SRC##", logo1_data_uri)
final_header_html = final_header_html.replace("##ECHA_LOGO_SRC##", logo2_data_uri)
final_header_html = final_header_html.replace("##SUBSTANCE_NAME##", substance_name)
final_header_html = final_header_html.replace("##SUBSTANCE_LINK##", substance_link)
final_header_html = final_header_html.replace("##EC_NUMBER##", ec_number)
final_header_html = final_header_html.replace("##CAS_NUMBER##", cas_number)
if "##" in final_header_html:
print("Warning: Not all placeholders seem replaced in the header HTML.")
# logging.warning("Not all placeholders seem replaced in the header HTML.") # Example logging
except Exception as e:
print(f"Error during header setup phase: {e}")
traceback.print_exc()
# logging.error(f"Error during header setup phase: {e}", exc_info=True) # Example logging
return False # Return False on header setup failure
# --- End Header Prep ---
# --- CSS Override Definition ---
# Using Revision 4 from previous step (simplified breaks, boundary focus)
selectors_to_fix = [
'.das-field .das-field_value_html',
'.das-field .das-field_value_large',
'.das-field .das-value_remark-text'
]
css_selector_string = ",\n".join(selectors_to_fix)
css_override = f"""
<style id='pdf-override-styles'>
/* Basic Resets & Overflows */
html, body {{ height: auto !important; overflow: visible !important; margin: 0 !important; padding: 0 !important; }}
* {{ box-sizing: border-box; }}
{css_selector_string} {{
overflow: visible !important; overflow-y: visible !important; height: auto !important; max-height: none !important;
}}
/* Boundary Fix */
#pdf-custom-header {{ margin-bottom: 0 !important; padding-bottom: 1px !important; page-break-after: auto !important; display: block !important; }}
#pdf-custom-header + .body-inner {{ margin-top: 0 !important; padding-top: 0 !important; page-break-before: auto !important; display: block !important; }}
.body-inner .document-header {{ margin-top: 0 !important; padding-top: 0 !important; page-break-before: auto !important; }}
/* Simplified Page Breaks */
.body-inner h1, .body-inner h2, .body-inner h3, .body-inner h4, .body-inner h5, .body-inner h6 {{ page-break-after: avoid !important; }}
#pdf-custom-header h2 {{ page-break-after: auto !important; }}
@media print {{
html, body {{ height: auto !important; overflow: visible !important; margin: 0; padding: 0; }}
#pdf-custom-header {{ margin-bottom: 0 !important; padding-bottom: 1px !important; page-break-after: auto !important; display: block !important;}}
#pdf-custom-header + .body-inner {{ margin-top: 0 !important; padding-top: 0 !important; page-break-before: auto !important; display: block !important; }}
.body-inner .document-header {{ margin-top: 0 !important; padding-top: 0 !important; page-break-before: auto !important; }}
.body-inner h1, .body-inner h2, .body-inner h3, .body-inner h4, .body-inner h5, .body-inner h6 {{ page-break-after: avoid !important; }}
#pdf-custom-header h2 {{ page-break-after: auto !important; }}
.das-doc-toolbar, .document-header__section-links, #das-totop {{ display: none !important; }}
}}
</style>
"""
# --- End CSS Override Definition ---
# --- Playwright Automation ---
try:
with sync_playwright() as p:
# logging.debug("Launching browser...") # Example logging
# browser = p.chromium.launch(headless=False, devtools=True) # For debugging
browser = p.chromium.launch()
page = browser.new_page()
# Capture console messages (Corrected: use msg.text)
page.on("console", lambda msg: print(f"Browser Console: {msg.text}"))
try:
# logging.info(f"Navigating to page: {url}") # Example logging
print(f"Navigating to: {url}")
if os.path.exists(url) and not url.startswith('file://'):
page_url = f'file://{os.path.abspath(url)}'
# logging.info(f"Treating as local file: {page_url}") # Example logging
print(f"Treating as local file: {page_url}")
else:
page_url = url
page.goto(page_url, wait_until='load', timeout=90000)
# logging.info("Page navigation complete.") # Example logging
# logging.debug("Injecting header HTML...") # Example logging
print("Injecting header HTML...")
page.evaluate(f'(headerHtml) => {{ document.body.insertAdjacentHTML("afterbegin", headerHtml); }}', final_header_html)
# logging.debug("Injecting CSS overrides...") # Example logging
print("Injecting CSS overrides...")
page.evaluate(f"""(css) => {{
const existingStyle = document.getElementById('pdf-override-styles');
if (existingStyle) existingStyle.remove();
document.head.insertAdjacentHTML('beforeend', css);
}}""", css_override)
# logging.debug("Running JavaScript cleanup function...") # Example logging
print("Running JavaScript cleanup function...")
elements_removed_count = page.evaluate(cleanup_js_expression)
# logging.info(f"Cleanup script finished (reported removing {elements_removed_count} elements).") # Example logging
print(f"Cleanup script finished (reported removing {elements_removed_count} elements).")
# --- Optional: Emulate Print Media ---
# print("Emulating print media...")
# page.emulate_media(media='print')
# --- Generate PDF ---
# logging.info(f"Generating PDF: {pdf_path}") # Example logging
print(f"Generating PDF: {pdf_path}")
pdf_options = {
"path": pdf_path, "format": "A4", "print_background": True,
"margin": {'top': '20px', 'bottom': '20px', 'left': '20px', 'right': '20px'},
"scale": 1.0
}
page.pdf(**pdf_options)
# logging.info(f"PDF saved successfully to: {pdf_path}") # Example logging
print(f"PDF saved successfully to: {pdf_path}")
# logging.debug("Closing browser.") # Example logging
print("Closing browser.")
browser.close()
return True # Indicate success
except TimeoutError as e:
print(f"A Playwright TimeoutError occurred: {e}")
traceback.print_exc()
# logging.error(f"Playwright TimeoutError occurred: {e}", exc_info=True) # Example logging
browser.close() # Ensure browser is closed on error
return False # Indicate failure
except Exception as e: # Catch other potential errors during Playwright page operations
print(f"An unexpected error occurred during Playwright page operations: {e}")
traceback.print_exc()
# logging.error(f"Unexpected error during Playwright page operations: {e}", exc_info=True) # Example logging
# Optional: Save HTML state on error
try:
html_content = page.content()
error_html_path = pdf_path.replace('.pdf', '_error.html')
with open(error_html_path, 'w', encoding='utf-8') as f_err:
f_err.write(html_content)
# logging.info(f"Saved HTML state on error to: {error_html_path}") # Example logging
print(f"Saved HTML state on error to: {error_html_path}")
except Exception as save_e:
# logging.error(f"Could not save HTML state on error: {save_e}", exc_info=True) # Example logging
print(f"Could not save HTML state on error: {save_e}")
browser.close() # Ensure browser is closed on error
return False # Indicate failure
# Note: The finally block for the 'with sync_playwright()' context
# is handled automatically by the 'with' statement.
except Exception as e:
# Catch errors during Playwright startup (less common)
print(f"An error occurred during Playwright setup/teardown: {e}")
traceback.print_exc()
# logging.error(f"Error during Playwright setup/teardown: {e}", exc_info=True) # Example logging
return False # Indicate failure
# --- Example Usage ---
# result = generate_pdf_with_header_and_cleanup(
# url='path/to/your/input.html',
# pdf_path='output.pdf',
# substance_name='Glycerol Example',
# substance_link='http://example.com/glycerol',
# ec_number='200-289-5',
# cas_number='56-81-5',
# )
#
# if result:
# print("PDF Generation Succeeded.")
# # logging.info("Main script: PDF Generation Succeeded.") # Example logging
# else:
# print("PDF Generation Failed.")
# # logging.error("Main script: PDF Generation Failed.") # Example logging
def search_generate_pdfs(
cas_number_to_search: str,
page_types_to_extract: list[str],
base_output_folder: str = "data/library"
) -> bool:
"""
Searches for a substance by CAS, saves raw HTML and generates PDFs for
specified page types. Uses '_js' link variant for the PDF header link if available.
Args:
cas_number_to_search (str): CAS number to search for.
page_types_to_extract (list[str]): List of page type names (e.g., 'RepeatedDose').
Expects '{page_type}' and '{page_type}_js' keys
in the search result.
base_output_folder (str): Root directory for saving HTML/PDFs.
Returns:
bool: True if substance found and >=1 requested PDF generated, False otherwise.
"""
# logging.info(f"Starting process for CAS: {cas_number_to_search}")
print(f"\n===== Processing request for CAS: {cas_number_to_search} =====")
# --- 1. Search for Dossier Information ---
try:
# logging.debug(f"Calling search_dossier for CAS: {cas_number_to_search}")
search_result = search_dossier(substance=cas_number_to_search, input_type='rmlCas')
except Exception as e:
print(f"Error during dossier search for CAS '{cas_number_to_search}': {e}")
traceback.print_exc()
# logging.error(f"Exception during search_dossier for CAS '{cas_number_to_search}': {e}", exc_info=True)
return False
if not search_result:
print(f"Substance not found or search failed for CAS: {cas_number_to_search}")
# logging.warning(f"Substance not found or search failed for CAS: {cas_number_to_search}")
return False
# logging.info(f"Substance found for CAS: {cas_number_to_search}")
print(f"Substance found: {search_result.get('rmlName', 'N/A')}")
# --- 2. Extract Details and Filter Pages ---
try:
# Extract required info
rml_id = search_result.get('rmlId')
rml_name = search_result.get('rmlName')
rml_cas = search_result.get('rmlCas')
rml_ec = search_result.get('rmlEc')
asset_ext_id = search_result.get('assetExternalId')
# Basic validation
if not all([rml_id, rml_name, rml_cas, rml_ec, asset_ext_id]):
missing_keys = [k for k, v in {'rmlId': rml_id, 'rmlName': rml_name, 'rmlCas': rml_cas, 'rmlEc': rml_ec, 'assetExternalId': asset_ext_id}.items() if not v]
message = f"Search result for {cas_number_to_search} is missing required keys: {missing_keys}"
print(f"Error: {message}")
# logging.error(message)
return False
# --- Filtering Logic - Collect pairs of URLs ---
pages_to_process_list = [] # Store tuples: (page_name, raw_url, js_url)
# logging.debug(f"Filtering pages. Requested: {page_types_to_extract}.")
for page_type in page_types_to_extract:
raw_url_key = page_type
js_url_key = f"{page_type}_js"
raw_url = search_result.get(raw_url_key)
js_url = search_result.get(js_url_key) # Get the JS URL
# Check if both URLs are valid strings
if raw_url and isinstance(raw_url, str) and raw_url.strip():
if js_url and isinstance(js_url, str) and js_url.strip():
pages_to_process_list.append((page_type, raw_url, js_url))
# logging.debug(f"Found valid pair for '{page_type}': Raw='{raw_url}', JS='{js_url}'")
else:
# Found raw URL but not a valid JS URL - skip this page type for PDF?
# Or use raw_url for header too? Let's skip if JS URL is missing/invalid.
print(f"Found raw URL for '{page_type}' but missing/invalid JS URL ('{js_url}'). Skipping PDF generation for this type.")
# logging.warning(f"Missing/invalid JS URL for page type '{page_type}' for {rml_cas}. Raw URL: '{raw_url}'.")
else:
# Raw URL missing or invalid
if page_type in search_result: # Check if key existed at all
print(f"Found page type key '{page_type}' for {rml_cas}, but its value is not a valid URL ('{raw_url}'). Skipping.")
# logging.warning(f"Invalid raw URL value for page type '{page_type}' for {rml_cas}: '{raw_url}'.")
else:
print(f"Requested page type key '{page_type}' not found in search results for {rml_cas}.")
# logging.warning(f"Requested page type key '{page_type}' not found for {rml_cas}.")
# --- End Filtering Logic ---
if not pages_to_process_list:
print(f"After filtering, no requested page types ({page_types_to_extract}) resulted in a valid pair of Raw and JS URLs for substance {rml_cas}.")
# logging.warning(f"No pages with valid URL pairs to process for substance {rml_cas}.")
return False # Nothing to generate
except Exception as e:
print(f"Error processing search result for '{cas_number_to_search}': {e}")
traceback.print_exc()
# logging.error(f"Error processing search result for '{cas_number_to_search}': {e}", exc_info=True)
return False
# --- 3. Prepare Folders ---
safe_cas = rml_cas.replace('/', '_').replace('\\', '_')
substance_folder_name = f"{safe_cas}_{rml_ec}_{rml_id}"
substance_folder_path = os.path.join(base_output_folder, substance_folder_name)
try:
os.makedirs(substance_folder_path, exist_ok=True)
# logging.info(f"Ensured output directory exists: {substance_folder_path}")
print(f"Ensured output directory exists: {substance_folder_path}")
except OSError as e:
print(f"Error creating directory {substance_folder_path}: {e}")
# logging.error(f"Failed to create directory {substance_folder_path}: {e}", exc_info=True)
return False
# --- 4. Process Each Page (Save HTML, Generate PDF) ---
successful_pages = [] # Track successful PDF generations
overall_success = False # Track if any PDF was generated
for page_name, raw_html_url, js_header_link in pages_to_process_list:
print(f"\nProcessing page: {page_name}")
base_filename = f"{safe_cas}_{page_name}"
html_filename = f"{base_filename}.html"
pdf_filename = f"{base_filename}.pdf"
html_full_path = os.path.join(substance_folder_path, html_filename)
pdf_full_path = os.path.join(substance_folder_path, pdf_filename)
# --- Save Raw HTML ---
html_saved = False
try:
# logging.debug(f"Fetching raw HTML for {page_name} from {raw_html_url}")
print(f"Fetching raw HTML from: {raw_html_url}")
# Add headers to mimic a browser slightly if needed
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'}
response = requests.get(raw_html_url, timeout=30, headers=headers) # 30s timeout
response.raise_for_status() # Raise an exception for bad status codes (4xx or 5xx)
# Decide encoding - response.text tries to guess, or use apparent_encoding
# Or assume utf-8 if unsure, which is common.
html_content = response.content.decode('utf-8', errors='replace')
with open(html_full_path, 'w', encoding='utf-8') as f:
f.write(html_content)
html_saved = True
# logging.info(f"Successfully saved raw HTML for {page_name} to {html_full_path}")
print(f"Successfully saved raw HTML to: {html_full_path}")
except requests.exceptions.RequestException as req_e:
print(f"Error fetching raw HTML for {page_name} from {raw_html_url}: {req_e}")
# logging.error(f"Error fetching raw HTML for {page_name}: {req_e}", exc_info=True)
except IOError as io_e:
print(f"Error saving raw HTML for {page_name} to {html_full_path}: {io_e}")
# logging.error(f"Error saving raw HTML for {page_name}: {io_e}", exc_info=True)
except Exception as e: # Catch other potential errors like decoding
print(f"Unexpected error saving HTML for {page_name}: {e}")
# logging.error(f"Unexpected error saving HTML for {page_name}: {e}", exc_info=True)
# --- Generate PDF (using raw URL for content, JS URL for header link) ---
# logging.info(f"Generating PDF for {page_name} from {raw_html_url}")
print(f"Generating PDF using content from: {raw_html_url}")
pdf_success = generate_pdf_with_header_and_cleanup(
url=raw_html_url, # Use raw URL for Playwright navigation/content
pdf_path=pdf_full_path,
substance_name=rml_name,
substance_link=js_header_link, # Use JS URL for the link in the header
ec_number=rml_ec,
cas_number=rml_cas
)
if pdf_success:
successful_pages.append(page_name) # Log success based on PDF generation
overall_success = True
# logging.info(f"Successfully generated PDF for {page_name} at {pdf_full_path}")
print(f"Successfully generated PDF for {page_name}")
else:
# logging.error(f"Failed to generate PDF for {page_name} from {raw_html_url}")
print(f"Failed to generate PDF for {page_name}")
# Decide if failure to save HTML should affect overall success or logging...
# Currently, success is tied only to PDF generation.
print(f"===== Finished request for CAS: {cas_number_to_search} =====")
print(f"Successfully generated {len(successful_pages)} PDFs: {successful_pages}")
return overall_success # Return success based on PDF generation
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page()
page.goto("https://chem.echa.europa.eu/html-pages-prod/e4c88c6e-06c7-4daa-b0fb-1a55459ac22f/documents/IUC5-5f55d8ec-7a71-4e2c-9955-8469ead9fe84_0035f3f8-7467-4944-9028-1db2e9c99565.html")
page.pdf(path='output.pdf')
browser.close()

View file

@ -1,947 +0,0 @@
from pif_compiler.services.echa_find import search_dossier
from bs4 import BeautifulSoup
from markdownify import MarkdownConverter
import pandas as pd
import requests
import os
import re
import markdown_to_json
import json
import copy
import unicodedata
from datetime import datetime
import logging
import duckdb
# Settings per il logging
logging.basicConfig(
format="{asctime} - {levelname} - {message}",
style="{",
datefmt="%Y-%m-%d %H:%M",
filename="echa.log",
encoding="utf-8",
filemode="a",
level=logging.INFO,
)
try:
# Carico il full scraping in memoria se esiste
con = duckdb.connect()
os.chdir(".") # directory che legge python
res = con.sql("""
CREATE TABLE echa_full_scraping AS
SELECT * FROM read_csv_auto('src\data\echa_full_scraping.csv');
""") # leggi il file csv come db in memory
logging.info(
f"echa.echaProcess().main: Loaded echa scraped data into duckdb memory. First CAS in the df is: {con.sql('select CAS from echa_full_scraping limit 1').fetchone()[0]}"
)
local_echa = True
except:
logging.error(f"echa.echaProcess().main: No local echa scraped data found")
# Metodo per trovare le informazioni relative sul sito echa
# Funziona sia con il nome della sostanza che con il CUS
def openEchaPage(link, local=False):
try:
if local:
page = open(link, encoding="utf8")
soup = BeautifulSoup(page, "html.parser")
else:
page = requests.get(link)
page.encoding = "utf-8"
soup = BeautifulSoup(page.text, "html.parser")
except:
logging.error(
f"echa.echaProcess.openEchaPage() error. could not open: '{link}'",
exc_info=True,
)
return soup
# Metodo per trasformare la pagina dell'echa in un Markdown
def echaPage_to_md(sezione, scrapingType=None, local=False, substance=None):
# sezione : il soup della pagina estratta attraverso search_dossier
# scrapingType : 'RepeatedDose' o 'AcuteToxicity'
# local : se vuoi salvare il contenuto del markdown in locale. Utile per debuggare
# substance : il nome della sostanza. Per salvarla nel path corretto
# Create shorthand method for conversion
def md(soup, **options):
return MarkdownConverter(**options).convert_soup(soup)
output = md(sezione)
# Trasformo la section html in un markdown, che però va corretto.
# Cambia un po' il modo in cui modifico il .md in base al tipo di pagina da scrappare
# aggiungo eccezioni man mano che testo nuove sostanze
if scrapingType == "RepeatedDose":
output = output.replace("### Oral route", "#### oral")
output = output.replace("### Dermal", "#### dermal")
output = output.replace("### Inhalation", "#### inhalation")
# Devo rimpiazzare >< con delle stringhe perchè sennò il jsonifier interpreta quei due simboli come dei codici e inserisce il testo nelle []
output = re.sub(r">\s+", "greater than ", output)
# Replace '<' followed by whitespace with 'less than '
output = re.sub(r"<\s+", "less than ", output)
output = re.sub(r">=\s*\n", "greater or equal than ", output)
output = re.sub(r"<=\s*\n", "less or equal than ", output)
elif scrapingType == "AcuteToxicity":
# Devo rimpiazzare >< con delle stringhe perchè sennò il jsonifier interpreta quei due simboli come dei codici e inserisce il testo nelle []
output = re.sub(r">\s+", "greater than ", output)
# Replace '<' followed by whitespace with 'less than '
output = re.sub(r"<\s+", "less than ", output)
output = re.sub(r">=\s*\n", "greater or equal than", output)
output = re.sub(r"<=\s*\n", "less or equal than ", output)
output = output.replace("–", "-")
output = re.sub(r"\s+mg", " mg", output)
# sta parte serve per fixare le unità di misura che vanno a capo e sono separate dal valore
if local and substance:
path = f"{scrapingType}/mds/{substance}.md"
os.makedirs(os.path.dirname(path), exist_ok=True)
with open(path, "w") as text_file:
text_file.write(output)
return output
# Questo è la parte 2 del processing del sito ECHA. Va trasformato il markdown in un JSON
def markdown_to_json_raw(output, scrapingType=None, local=False, substance=None):
# Output: Il markdown
# scrapingType : 'RepeatedDose' o 'AcuteToxicity'
# substance : il nome della sostanza. Per salvarla nel path corretto
jsonified = markdown_to_json.jsonify(output)
dictified = json.loads(jsonified)
# Salvo il json iniziale così come esce da jsonify
if local and scrapingType and substance:
path = f"{scrapingType}/jsons/raws/{substance}_raw0.json"
os.makedirs(os.path.dirname(path), exist_ok=True)
with open(path, "w") as text_file:
text_file.write(jsonified)
# Ora splitto i contenuti dei dizionari innestati.
for key, value in dictified.items():
if type(value) == dict:
for key2, value2 in value.items():
parts = value2.split("\n\n")
dictified[key][key2] = {
parts[i]: parts[i + 1]
for i in range(0, len(parts) - 1, 2)
if parts[i + 1] != "[Empty]"
}
else:
parts = value.split("\n\n")
dictified[key] = {
parts[i]: parts[i + 1]
for i in range(0, len(parts) - 1, 2)
if parts[i + 1] != "[Empty]"
}
jsonified = json.dumps(dictified)
if local and scrapingType and substance:
path = f"{scrapingType}/jsons/raws/{substance}_raw1.json"
os.makedirs(os.path.dirname(path), exist_ok=True)
with open(path, "w") as text_file:
text_file.write(jsonified)
dictified = json.loads(jsonified)
return jsonified
# Metodo creato da claude per risolvere i problemi di unicode characters
def normalize_unicode_characters(text):
"""
Normalize Unicode characters, with special handling for superscript
"""
if not isinstance(text, str):
return text
# Specific replacements for common Unicode encoding issues
# e per altre eccezioni particolari
replacements = {
"\u00c2\u00b2": "²", # ² -> ²
"\u00c2\u00b3": "³", # ³ -> ³
"\u00b2": "²", # Bare superscript 2
"\u00b3": "³", # Bare superscript 3
"\n": "", # ogni tanto ci sono degli \n brutti da togliere
"greater than": ">",
"less than": "<",
"greater or equal than": ">=",
"less or equal than": "<",
# Ste due entry le ho messe io. >< creano problemi quindi le rinonimo temporaneamente
}
# Apply specific replacements first
for old, new in replacements.items():
text = text.replace(old, new)
# Normalize Unicode characters
text = unicodedata.normalize("NFKD", text)
return text
# Un'altro metodo creato da Claude.
# Pare che il mio cervello sia troppo piccolo per riuscire a ciclare ricursivamente
# un dizionario innestato. Se magari avessimo fatto algoritmi e strutture dati...
def clean_json(data):
"""
Recursively clean JSON by removing empty/uninformative entries
and normalizing Unicode characters
"""
def is_uninformative(value, context=None):
"""
Check if a dictionary entry is considered uninformative
Args:
value: The value to check
context: Additional context about where the value is located
"""
# Specific exceptions
if context and context == "Key value for chemical safety assessment":
# Always keep all entries in this specific section
return False
uninformative_values = ["hours/week", "", None]
return value in uninformative_values or (
isinstance(value, str)
and (
value.strip() in uninformative_values
or value.lower() == "no information available"
)
)
def clean_recursive(obj, context=None):
# If it's a dictionary, process its contents
if isinstance(obj, dict):
# Create a copy to modify
cleaned = {}
for key, value in obj.items():
# Normalize key
normalized_key = normalize_unicode_characters(key)
# Set context for nested dictionaries
new_context = context or normalized_key
# Recursively clean nested structures
cleaned_value = clean_recursive(value, new_context)
# Conditions for keeping the entry
keep_entry = (
cleaned_value not in [None, {}, ""]
and not (
isinstance(cleaned_value, dict) and len(cleaned_value) == 0
)
and not is_uninformative(cleaned_value, new_context)
)
# Add to cleaned dict if conditions are met
if keep_entry:
cleaned[normalized_key] = cleaned_value
return cleaned if cleaned else None
# If it's a list, clean each item
elif isinstance(obj, list):
cleaned_list = [clean_recursive(item, context) for item in obj]
cleaned_list = [item for item in cleaned_list if item not in [None, {}, ""]]
return cleaned_list if cleaned_list else None
# For strings, normalize Unicode
elif isinstance(obj, str):
return normalize_unicode_characters(obj)
# Return as-is for other types
return obj
# Create a deep copy to avoid modifying original data
cleaned_data = clean_recursive(copy.deepcopy(data))
# Sì figa questa è la parte che mi ha fatto sclerare
# Ciclare in dizionari innestati senza poter modificare la struttura
return cleaned_data
def json_to_dataframe(cleaned_json, scrapingType):
rows = []
schema = {
"RepeatedDose": [
"Substance",
"CAS",
"Toxicity Type",
"Route",
"Dose descriptor",
"Effect level",
"Species",
"Extraction_Timestamp",
"Endpoint conclusion",
],
"AcuteToxicity": [
"Substance",
"CAS",
"Route",
"Endpoint conclusion",
"Dose descriptor",
"Effect level",
"Extraction_Timestamp",
],
}
if scrapingType == "RepeatedDose":
# Iterate through top-level sections (excluding 'Key value for chemical safety assessment')
for toxicity_type, routes in cleaned_json.items():
if toxicity_type == "Key value for chemical safety assessment":
continue
# Iterate through routes within each toxicity type
for route, details in routes.items():
row = {"Toxicity Type": toxicity_type, "Route": route}
# Add details to the row, excluding 'Link to relevant study record(s)'
row.update(
{
k: v
for k, v in details.items()
if k != "Link to relevant study record(s)"
}
)
rows.append(row)
elif scrapingType == "AcuteToxicity":
for toxicity_type, routes in cleaned_json.items():
if (
toxicity_type == "Key value for chemical safety assessment"
or not routes
):
continue
row = {
"Route": toxicity_type.replace("Acute toxicity: via", "")
.replace("route", "")
.strip()
}
# Add details directly from the routes dictionary
row.update(
{
k: v
for k, v in routes.items()
if k != "Link to relevant study record(s)"
}
)
rows.append(row)
# Create DataFrame
df = pd.DataFrame(rows)
# Last moment fixes. Per forzare uno schema
fair_columns = list(set(schema["RepeatedDose"] + schema["AcuteToxicity"]))
df = df = df.loc[:, df.columns.intersection(fair_columns)]
return df
def save_dataframe(df, file_path, scrapingType, schema):
"""
Save DataFrame with strict column requirements.
Args:
df (pd.DataFrame): DataFrame to potentially append
file_path (str): Path of CSV file
"""
# Mandatory columns for saved DataFrame
saved_columns = schema[scrapingType]
# Check if input DataFrame has at least Dose Descriptor and Effect Level
if not all(col in df.columns for col in ["Effect level"]):
return
# If file exists, read it to get saved columns
if os.path.exists(file_path):
existing_df = pd.read_csv(file_path)
# Reindex to match saved columns, filling missing with NaN
df = df.reindex(columns=saved_columns)
else:
# If file doesn't exist, create DataFrame with saved columns
df = df.reindex(columns=saved_columns)
df = df[df["Effect level"].isna() == False]
# Ignoro le righe che non hanno valori per Effect Level
# Append or save the DataFrame
df.to_csv(
file_path,
mode="a" if os.path.exists(file_path) else "w",
header=not os.path.exists(file_path),
index=False,
)
def echaExtract(
substance: str,
scrapingType: str,
outputType="df",
key_infos=False,
local_search=False,
local_only = False
):
"""
Funzione principale per scrapare dal sito ECHA. Mette insieme tante funzioni diverse di ricerca, estrazione e pulizia.
Registra il logging delle operazioni.
Args:
substance (str): CAS o nome della sostanza. Vanno bene entrambi ma il CAS funziona meglio.
scrapingType (str): 'AcuteToxicity' (LD50) o 'RepeatedDose' (NOAEL)
outputType (str): 'pd.DataFrame' o 'json' (sconsigliato)
key_infos (bool): Di base True. Specifica se cercare la sezione "Description of Key Information" nei dossiers.
Certe sostanze hanno i dati inseriti a cazzo e mettono le informazioni in forma discorsiva al posto che altrove.
Output:
un dataframe o un json,
f"Non esistono lead dossiers attivi o inattivi per {substance}"
"""
# se local_search = True tento una ricerca in locale. Altrimenti la provo online.
if local_search and local_echa:
result = echaExtract_local(substance, scrapingType, key_infos)
if not result.empty:
logging.info(
f"echa.echaProcess.echaExtract(): Found local data for {scrapingType}, {substance}. Returning it."
)
return result
elif result.empty:
logging.info(
f"echa.echaProcess.echaExtract(): Have not found local data for {scrapingType}, {substance}. Continuining."
)
if local_only:
logging.info(f'echa.echaProcess.echaExtract(): No data found in local-only search for {substance}, {scrapingType}')
return f'No data found in local-only search for {substance}, {scrapingType}'
try:
# con search_dossier trovo le informazioni relative al dossiers cercando sul sito echa la sostanza fornita.
links = search_dossier(substance)
if not links:
logging.info(
f'echaProcess.echaExtract(). no active or unactive lead dossiers for: "{substance}". Ending extraction.'
)
return f"Non esistono lead dossiers attivi o inattivi per {substance}"
# Se non esistono LEAD dossiers (quelli con i riassunti tossicologici) attivi o inattivi
# LEAD dossiers: riassumono le informazioni di un po' di tutti gli altri dossier, sono quelli completi dove c'erano le info necessarie
# Se esistono, apro la pagina che mi interessa ('Acute Toxicity' o 'Repeated Dose')
if not scrapingType in list(links.keys()):
logging.info(
f'echaProcess.echaExtract(). No page for "{scrapingType}", "{substance}"'
)
return f'No data in "{scrapingType}", "{substance}". Page does not exist.'
soup = openEchaPage(link=links[scrapingType])
logging.info(
f"echaProcess.echaExtract(). soupped '{scrapingType}' echa page for '{substance}'"
)
# Piglio la sezione che mi serve
try:
sezione = soup.find(
"section",
class_="KeyValueForChemicalSafetyAssessment",
attrs={"data-cy": "das-block"},
)
except:
logging.error(
f'echaProcess.echaExtract(). could not extract the "section" for "{scrapingType}" for "{substance}"',
exc_info=True,
)
# Per ottenere il timestamp attuale
now = datetime.now()
# UPDATE. Cerco le key infos: recupera quel testo di summary generale
key_infos_faund = False
if key_infos:
try:
key_infos = soup.find(
"section",
class_="KeyInformation",
attrs={"data-cy": "das-block"},
)
if key_infos:
key_infos = key_infos.find(
"div",
class_="das-field_value das-field_value_html",
)
key_infos = key_infos.text
key_infos = key_infos if key_infos.strip() != "[Empty]" else None
if key_infos:
key_infos_faund = True
logging.info(
f"echaProcess.echaExtract(). Extracted key_infos from '{scrapingType}' echa page for '{substance}': {key_infos}"
)
key_infos_df = pd.DataFrame(index=[0])
key_infos_df["key_information"] = key_infos
key_infos_df = df_wrapper(
df=key_infos_df,
rmlName=links["rmlName"],
rmlCas=links["rmlCas"],
timestamp=now.strftime("%Y-%m-%d"),
dossierType=links["dossierType"], # attivo o inattivo?? da verificare
page=scrapingType, # repeated dose o acute toxicity
linkPage=links[scrapingType], # i link al dossier di repeated dose o acute toxicity
key_infos=True,
)
else:
logging.error(
f'echaProcess.echaExtract() > echaProcess.echaPage_to_md() ERROR. could not extract key_infos for "{scrapingType}", "{substance}"'
)
else:
logging.error(
f'echaProcess.echaExtract() > echaProcess.echaPage_to_md() ERROR. could not extract key_infos for "{scrapingType}", "{substance}"'
)
except:
logging.error(
f'echaProcess.echaExtract() > echaProcess.echaPage_to_md() ERROR. could not extract key_infos for "{scrapingType}", "{substance}"',
exc_info=True,
)
try:
if not sezione: # la sezione principale che viene scrapata
logging.error(
f'echaProcess.echaExtract() > echaProcess.echaPage_to_md() Empty section for the html > markdown conversion. No data for "{scrapingType}", "{substance}"'
)
if not key_infos_faund:
# Se non ci sono dati ma ci sono le key informations ritorno quelle
return f'No data in "{scrapingType}", "{substance}"'
else:
return key_infos_df
# Trasformo la sezione html in markdown
output = echaPage_to_md(
sezione, scrapingType=scrapingType, substance=substance
)
logging.info(
f'echaProcess.echaExtract() > echaProcess.echaPage_to_md() OK. created MD for "{scrapingType}", "{substance}"'
)
# Ci sono rari casi in cui proprio non esistono pagine per l'acute toxicity o la repeated dose. In quel caso output sarà vuoto e darà errore
# logging.info(output)
except:
logging.error(
f'echaProcess.echaExtract() > echaProcess.echaPage_to_md() ERROR. could not MD for "{scrapingType}", "{substance}"',
exc_info=True,
)
try:
# Trasformo il markdown nel primo json raw
jsonified = markdown_to_json_raw(
output, scrapingType=scrapingType, substance=substance
)
logging.info(
f'echaProcess.echaExtract() > echaProcess.markdown_to_json_raw() OK. created initial json for "{scrapingType}", "{substance}"'
)
except:
logging.error(
f'echaProcess.echaExtract() > echaProcess.markdown_to_json_raw() ERROR. could not create initial json for "{scrapingType}", "{substance}"',
exc_info=True,
)
json_data = json.loads(jsonified)
try:
# Secondo step per il processing del json: pulisco i dizionari piu' innestati
cleaned_data = clean_json(json_data)
logging.info(
f'echaProcess.echaExtract() > echaProcess.clean_json() OK. cleaned the json for "{scrapingType}", "{substance}"'
)
# Se cleaned_data è vuoto vuol dire che non ci sono dati
if not cleaned_data:
logging.error(
f'echaProcess.echaExtract() > echaProcess.clean_json() Empty cleaned_json. No data for "{scrapingType}", "{substance}"'
)
if not key_infos_faund:
# Se non ci sono dati ma ci sono le key informations ritorno quelle
return f'No data in "{scrapingType}", "{substance}"'
else:
return key_infos_df
except:
logging.error(
f'echaProcess.echaExtract() > echaProcess.clean_json() ERROR. cleaning the json for "{scrapingType}", "{substance}"'
)
# Se si vuole come output il dataframe creo un dataframe e ci aggiungo un timestamp
try:
df = json_to_dataframe(cleaned_data, scrapingType)
df = df_wrapper(
df=df,
rmlName=links["rmlName"],
rmlCas=links["rmlCas"],
timestamp=now.strftime("%Y-%m-%d"),
dossierType=links["dossierType"],
page=scrapingType,
linkPage=links[scrapingType],
)
if outputType == "df":
logging.info(
f'echaProcess.echaExtract(). succesfully extracted "{scrapingType}", "{substance}". Returning df'
)
# Se l'utente vuole le key infos e le key_infos sono state trovate unisco i due df
return df if not key_infos_faund else pd.concat([key_infos_df, df])
elif outputType == "json":
if key_infos_faund:
df = pd.concat([key_infos_df, df])
jayson = df.to_json(orient="records", force_ascii=False)
logging.info(
f'echaProcess.echaExtract(). succesfully extracted "{scrapingType}", "{substance}". Returning json'
)
return jayson
except KeyError:
# Per gestire le pagine di merda che hanno solo "no information available"
if key_infos_faund:
return key_infos_df
json_output = list(cleaned_data[list(cleaned_data.keys())[0]].values())
if json_output == ["no information available" for elem in json_output]:
logging.info(
f"echaProcess.echaExtract(). No data found for {scrapingType} for {substance}"
)
return f'No data in "{scrapingType}", "{substance}"'
else:
logging.error(
f"echaProcess.json_to_dataframe(). Could not create dataframe"
)
cleaned_data["error"] = (
"Non sono riuscito a creare il dataframe, probabilmente non ci sono abbastanza informazioni. Ritorno il JSON"
)
return cleaned_data
except Exception:
logging.error(
f"echaProcess.echaExtract() ERROR. Something went wrong, not quite sure what.",
exc_info=True,
)
def df_wrapper(
df, rmlName, rmlCas, timestamp, dossierType, page, linkPage, key_infos=False
):
# Un semplice metodo per aggiungere tutta la roba che ci serve al dataframe.
# Per non intasare echaExtract che già di suo è un figa di bordello
df.insert(0, "Substance", rmlName)
df.insert(1, "CAS", rmlCas)
df["Extraction_Timestamp"] = timestamp
df = df.replace("\n", "", regex=True)
if not key_infos:
df = df[df["Effect level"].isnull() == False]
# Aggiungo il link del dossier e lo status
df["dossierType"] = dossierType
df["page"] = page
df["linkPage"] = linkPage
return df
def echaExtract_specific(
CAS: str,
scrapingType="RepeatedDose",
doseDescriptor="NOAEL",
route="inhalation",
local_search=False,
local_only=False
):
"""
Dato un CAS cerca di trovare il dose descriptor (di base NOAEL) per la route specificata (di base 'inhalation').
Args:
CAS (str): il cas o in alternativa la sostanza
route (str): 'inhalation', 'oral', 'dermal'. Di base 'inhalation'
scrapingType (str): la pagina su cui cercarlo
doseDescriptor (str): il tipo di valore da ricercare (NOAEL, DNEL, LD50, LC50)
"""
# Tento di estrarre
result = echaExtract(
substance=CAS,
scrapingType=scrapingType,
outputType="df",
local_search=local_search,
local_only=local_only
)
# Il risultato è un dataframe?
if type(result) == pd.DataFrame:
# Se sì, lo filtro per ciò che mi interessa
filtered_df = result[
(result["Route"] == route) & (result["Dose descriptor"] == doseDescriptor)
]
# Se non è vuoto lo ritorno
if not filtered_df.empty:
return filtered_df
else:
return f'Non ho trovato {doseDescriptor} in {scrapingType} con route "{route}" per {CAS}'
elif type(result) == dict and result["error"]:
# Questo significa che gli è arrivato qualche json con un errore
return f'Non ho trovato {doseDescriptor} in {scrapingType} con route "{route}" per {CAS}'
# Questo significa che ha ricevuto un "Non esistono" come risultato. Non esistono lead dossiers attivi o inattivi per la sostanza ricercata
elif result.startswith("Non esistono"):
return result
def echa_noael_ld50(CAS: str, route="inhalation", outputType="df", local_search=False, local_only=False):
"""
Dato un CAS cerca di trovare il NOAEL per la route specificata (di base 'inhalation').
Se non esiste la pagina RepeatedDose con il NOAEL fa ritornare l'LD50 per quella route.
Args:
CAS (str): il cas o in alternativa la sostanza
route (str): 'inhalation', 'oral', 'dermal'. Di base 'inhalation'
outputType (str) = 'df', 'json'. Il tipo di output
"""
if route not in ["inhalation", "oral", "dermal"] and outputType not in [
"df",
"json",
]:
return "invalid input"
# Di base cerco di scrapare la pagina "Repeated Dose"
first_attempt = echaExtract_specific(
CAS=CAS,
scrapingType="RepeatedDose",
doseDescriptor="NOAEL",
route=route,
local_search=local_search,
local_only=local_only
)
if isinstance(first_attempt, pd.DataFrame):
return first_attempt
elif isinstance(first_attempt, str) and first_attempt.startswith("Non ho trovato"):
second_attempt = echaExtract_specific(
CAS=CAS,
scrapingType="AcuteToxicity",
doseDescriptor="LD50",
route=route,
local_search=True,
local_only=local_only
)
if isinstance(second_attempt, pd.DataFrame):
return second_attempt
elif isinstance(second_attempt, str) and second_attempt.startswith(
"Non ho trovato"
):
return second_attempt.replace("LD50", "NOAEL ed LD50")
elif first_attempt.startswith("Non esistono"):
return first_attempt
def echa_noael_ld50_multi(
casList: list, route="inhalation", messages=False, local_search=False, local_only=False
):
"""
Metodo abbastanza semplice. Data una lista di cas esegue echa_noael_ld50. Quindi cerca i NOAEL per la route desiderata o gli LD50 se non trova i NOAEL.
L'output è un df per le sostanze che trova e una lista di messaggi per quelle che non trova.
Args:
casList (list): la lista di CAS
route (str): 'inhalation', 'oral', 'dermal'. Di base 'inhalation'
messages (boolean) = True o False. Con True fa ritornare una lista. Il primo elemento sarà il dataframe, il secondo la lista di messaggi per le sostanze non trovate.
Di base è False e fa ritornare solo il dataframe.
"""
messages_list = []
df = pd.DataFrame()
for CAS in casList:
output = echa_noael_ld50(
CAS=CAS, route=route, outputType="df", local_search=local_search, local_only=local_only
)
if isinstance(output, str):
messages_list.append(output)
elif isinstance(output, pd.DataFrame):
df = pd.concat([df, output], ignore_index=True)
df.dropna(axis=1, how="all", inplace=True)
if messages and df.empty:
messages_list.append(
f'Non sono riuscito a trovare nessun NOAEL o LD50 per i cas per la route "{route}"'
)
return [None, messages_list]
elif messages and not df.empty:
return [df, messages_list]
elif not df.empty and not messages:
return df
elif df.empty and not messages:
return f'Non sono riuscito a trovare nessun NOAEL o LD50 per i cas per la route "{route}"'
def echaExtract_multi(
casList: list,
scrapingType="all",
local=False,
local_path=None,
log_path=None,
debug_print=False,
error=False,
error_path=None,
key_infos=False,
local_search=False,
local_only=False,
filter = None
):
"""
Data una lista di CAS cerca di estrarre tutte le pagine con le Repeated Dose, tutte le pagine con l'AcuteToxicity, o entrambe
Args:
casList (list): la lista di CAS
scrapingType (str): 'RepeatedDose', 'AcuteToxicity', 'all'
local (boolean): Se impostato su True questo parametro salva sul disco in maniera progressiva, appendendo ogni result man mano che li trova.
è necessario per lo scraping su larga scala
log_path (str): il path per il log da fillare durante lo scraping di massa
debug_print (bool): per avere il printing durante lo scraping. per verificare l'avanzamento
error (bool): Per far ritornare la lista degli errori una volta scrapato
Output:
pd.Dataframe
"""
cas_len = len(casList)
i = 0
df = pd.DataFrame()
if scrapingType == "all":
scrapingTypeList = ["RepeatedDose", "AcuteToxicity"]
else:
scrapingTypeList = [scrapingType]
logging.info(
f"echa.echaExtract_multi(). Commencing mass extraction of {scrapingTypeList} for {casList}"
)
errors = []
for cas in casList:
for scrapingType in scrapingTypeList:
extraction = echaExtract(
substance=cas,
scrapingType=scrapingType,
outputType="df",
key_infos=key_infos,
local_search=local_search,
local_only=local_only
)
if isinstance(extraction, pd.DataFrame) and not extraction.empty:
status = "successful_scrape"
logging.info(
f"echa.echaExtract_multi(). Succesfully scraped {scrapingType} for {cas}"
)
df = pd.concat([df, extraction], ignore_index=True)
if local and local_path:
df.to_csv(local_path, index=False)
elif (
(isinstance(extraction, pd.DataFrame) and extraction.empty)
or (extraction is None)
or (isinstance(extraction, str) and extraction.startswith("No data"))
):
status = "no_data_found"
logging.info(
f"echa.echaExtract_multi(). Found no data for {scrapingType} for {cas}"
)
elif isinstance(extraction, dict):
if extraction["error"]:
status = "df_creation_error"
errors.append(extraction)
logging.info(
f"echa.echaExtract_multi(). Df creation error for {scrapingType} for {cas}"
)
elif isinstance(extraction, str) and extraction.startswith("Non esistono"):
status = "no_lead_dossiers"
logging.info(
f"echa.echaExtract_multi(). Found no lead dossiers for {cas}"
)
else:
status = "unknown_error"
logging.error(
f"echa.echaExtract_multi(). Unknown error for {scrapingType} for {cas}"
)
if log_path:
fill_log(cas, status, log_path, scrapingType)
if debug_print:
print(f"{i}: {cas}, {scrapingType}")
i += 1
if error and errors and error_path:
with open(error_path, "w") as json_file:
json.dump(errors, json_file, indent=4)
# Questa è la mossa che mi permette di eliminare 4 metodi
if filter:
df = filter_dataframe_by_dict(df, filter)
return df
def fill_log(cas: str, status: str, log_path: str, scrapingType: str):
"""
Funzione usata durante lo scraping di massa per fillare un log mentre estraggo le sostanze
"""
df = pd.read_csv(log_path)
df.loc[df["casNo"] == cas, f"scraping_{scrapingType}"] = status
df.loc[df["casNo"] == cas, "timestamp"] = datetime.now().strftime("%Y-%m-%d")
df.to_csv(log_path, index=False)
def echaExtract_local(substance:str, scrapingType:str, key_infos=False):
if not key_infos:
query = f"""
SELECT *
FROM echa_full_scraping
WHERE CAS = '{substance}' AND page = '{scrapingType}' AND key_information IS NULL;
"""
elif key_infos:
query = f"""
SELECT *
FROM echa_full_scraping
WHERE CAS = '{substance}' AND page = '{scrapingType}';
"""
result = con.sql(query).df()
return result
def filter_dataframe_by_dict(df, filter_dict):
"""
Filters a Pandas DataFrame based on a dictionary.
Args:
df (pd.DataFrame): The input DataFrame.
filter_dict (dict): A dictionary where keys are column names and
values are lists of allowed values for that column.
Returns:
pd.DataFrame: A new DataFrame containing only the rows that match
the filter criteria.
"""
filter_condition = pd.Series(True, index=df.index) # Initialize with all True to start filtering
for column_name, allowed_values in filter_dict.items():
if column_name in df.columns: # Check if the column exists in the DataFrame
column_filter = df[column_name].isin(allowed_values) # Create a boolean Series for the current column
filter_condition = filter_condition & column_filter # Combine with existing condition using 'and'
else:
print(f"Warning: Column '{column_name}' not found in the DataFrame. Filter for this column will be ignored.")
filtered_df = df[filter_condition] # Apply the combined filter condition
return filtered_df

View file

@ -1,467 +0,0 @@
import os
import base64
import traceback
import logging # Import logging module
import datetime
import pandas as pd
# import time # Keep if you use page.wait_for_timeout
from playwright.sync_api import sync_playwright, TimeoutError # Catch specific errors
from src.func.find import search_dossier
import requests
# --- Basic Logging Setup (Commented Out) ---
# # Configure logging - uncomment and customize level/handler as needed
# logging.basicConfig(
# level=logging.INFO, # Or DEBUG for more details
# format='%(asctime)s - %(levelname)s - %(message)s',
# # filename='pdf_generator.log', # Optional: Log to a file
# # filemode='a'
# )
# --- End Logging Setup ---
# Assume svg_to_data_uri is defined elsewhere correctly
def svg_to_data_uri(svg_path):
try:
if not os.path.exists(svg_path):
# logging.error(f"SVG file not found: {svg_path}") # Example logging
raise FileNotFoundError(f"SVG file not found: {svg_path}")
with open(svg_path, 'rb') as f:
svg_content = f.read()
encoded_svg = base64.b64encode(svg_content).decode('utf-8')
return f"data:image/svg+xml;base64,{encoded_svg}"
except Exception as e:
print(f"Error converting SVG {svg_path}: {e}")
# logging.error(f"Error converting SVG {svg_path}: {e}", exc_info=True) # Example logging
return None
# --- JavaScript Expressions ---
# Define the cleanup logic as an immediately-invoked arrow function expression
# NOTE: .das-block_empty removal is currently disabled as per previous step
cleanup_js_expression = """
() => {
console.log('Running cleanup JS (DISABLED .das-block_empty removal)...');
let totalRemoved = 0;
// Example 1: Remove sections explicitly marked as empty (Currently Disabled)
// const emptyBlocks = document.querySelectorAll('.das-block_empty');
// emptyBlocks.forEach(el => {
// if (el && el.parentNode) {
// console.log(`Removing '.das-block_empty' block with ID: ${el.id || 'N/A'}`);
// el.remove();
// totalRemoved++;
// }
// });
// Add other specific cleanup logic here if needed
console.log(`Cleanup script removed ${totalRemoved} elements (DISABLED .das-block_empty removal).`);
return totalRemoved; // Return the count
}
"""
# --- End JavaScript Expressions ---
def generate_pdf_with_header_and_cleanup(
url,
pdf_path,
substance_name,
substance_link,
ec_number,
cas_number,
header_template_path=r"src\func\resources\injectableHeader.html",
echa_chem_logo_path=r"src\func\resources\echa_chem_logo.svg",
echa_logo_path=r"src\func\resources\ECHA_Logo.svg"
) -> bool: # Added return type hint
"""
Generates a PDF with a dynamic header and optionally removes empty sections.
Provides basic logging (commented out) and returns True/False for success/failure.
Args:
url (str): The target URL OR local HTML file path.
pdf_path (str): The output PDF path.
substance_name (str): The name of the chemical substance.
substance_link (str): The URL the substance name should link to (in header).
ec_number (str): The EC number for the substance.
cas_number (str): The CAS number for the substance.
header_template_path (str): Path to the HTML header template file.
echa_chem_logo_path (str): Path to the echa_chem_logo.svg file.
echa_logo_path (str): Path to the ECHA_Logo.svg file.
Returns:
bool: True if the PDF was generated successfully, False otherwise.
"""
final_header_html = None
# logging.info(f"Starting PDF generation for URL: {url} to path: {pdf_path}") # Example logging
# --- 1. Prepare Header HTML ---
try:
# logging.debug(f"Reading header template from: {header_template_path}") # Example logging
print(f"Reading header template from: {header_template_path}")
if not os.path.exists(header_template_path):
raise FileNotFoundError(f"Header template file not found: {header_template_path}")
with open(header_template_path, 'r', encoding='utf-8') as f:
header_template_content = f.read()
if not header_template_content:
raise ValueError("Header template file is empty.")
# logging.debug("Converting logos...") # Example logging
print("Converting logos...")
logo1_data_uri = svg_to_data_uri(echa_chem_logo_path)
logo2_data_uri = svg_to_data_uri(echa_logo_path)
if not logo1_data_uri or not logo2_data_uri:
raise ValueError("Failed to convert one or both logos to Data URIs.")
# logging.debug("Replacing placeholders...") # Example logging
print("Replacing placeholders...")
final_header_html = header_template_content.replace("##ECHA_CHEM_LOGO_SRC##", logo1_data_uri)
final_header_html = final_header_html.replace("##ECHA_LOGO_SRC##", logo2_data_uri)
final_header_html = final_header_html.replace("##SUBSTANCE_NAME##", substance_name)
final_header_html = final_header_html.replace("##SUBSTANCE_LINK##", substance_link)
final_header_html = final_header_html.replace("##EC_NUMBER##", ec_number)
final_header_html = final_header_html.replace("##CAS_NUMBER##", cas_number)
if "##" in final_header_html:
print("Warning: Not all placeholders seem replaced in the header HTML.")
# logging.warning("Not all placeholders seem replaced in the header HTML.") # Example logging
except Exception as e:
print(f"Error during header setup phase: {e}")
traceback.print_exc()
# logging.error(f"Error during header setup phase: {e}", exc_info=True) # Example logging
return False # Return False on header setup failure
# --- End Header Prep ---
# --- CSS Override Definition ---
# Using Revision 4 from previous step (simplified breaks, boundary focus)
selectors_to_fix = [
'.das-field .das-field_value_html',
'.das-field .das-field_value_large',
'.das-field .das-value_remark-text'
]
css_selector_string = ",\n".join(selectors_to_fix)
css_override = f"""
<style id='pdf-override-styles'>
/* Basic Resets & Overflows */
html, body {{ height: auto !important; overflow: visible !important; margin: 0 !important; padding: 0 !important; }}
* {{ box-sizing: border-box; }}
{css_selector_string} {{
overflow: visible !important; overflow-y: visible !important; height: auto !important; max-height: none !important;
}}
/* Boundary Fix */
#pdf-custom-header {{ margin-bottom: 0 !important; padding-bottom: 1px !important; page-break-after: auto !important; display: block !important; }}
#pdf-custom-header + .body-inner {{ margin-top: 0 !important; padding-top: 0 !important; page-break-before: auto !important; display: block !important; }}
.body-inner .document-header {{ margin-top: 0 !important; padding-top: 0 !important; page-break-before: auto !important; }}
/* Simplified Page Breaks */
.body-inner h1, .body-inner h2, .body-inner h3, .body-inner h4, .body-inner h5, .body-inner h6 {{ page-break-after: avoid !important; }}
#pdf-custom-header h2 {{ page-break-after: auto !important; }}
@media print {{
html, body {{ height: auto !important; overflow: visible !important; margin: 0; padding: 0; }}
#pdf-custom-header {{ margin-bottom: 0 !important; padding-bottom: 1px !important; page-break-after: auto !important; display: block !important;}}
#pdf-custom-header + .body-inner {{ margin-top: 0 !important; padding-top: 0 !important; page-break-before: auto !important; display: block !important; }}
.body-inner .document-header {{ margin-top: 0 !important; padding-top: 0 !important; page-break-before: auto !important; }}
.body-inner h1, .body-inner h2, .body-inner h3, .body-inner h4, .body-inner h5, .body-inner h6 {{ page-break-after: avoid !important; }}
#pdf-custom-header h2 {{ page-break-after: auto !important; }}
.das-doc-toolbar, .document-header__section-links, #das-totop {{ display: none !important; }}
}}
</style>
"""
# --- End CSS Override Definition ---
# --- Playwright Automation ---
try:
with sync_playwright() as p:
# logging.debug("Launching browser...") # Example logging
# browser = p.chromium.launch(headless=False, devtools=True) # For debugging
browser = p.chromium.launch()
page = browser.new_page()
# Capture console messages (Corrected: use msg.text)
page.on("console", lambda msg: print(f"Browser Console: {msg.text}"))
try:
# logging.info(f"Navigating to page: {url}") # Example logging
print(f"Navigating to: {url}")
if os.path.exists(url) and not url.startswith('file://'):
page_url = f'file://{os.path.abspath(url)}'
# logging.info(f"Treating as local file: {page_url}") # Example logging
print(f"Treating as local file: {page_url}")
else:
page_url = url
page.goto(page_url, wait_until='load', timeout=90000)
# logging.info("Page navigation complete.") # Example logging
# logging.debug("Injecting header HTML...") # Example logging
print("Injecting header HTML...")
page.evaluate(f'(headerHtml) => {{ document.body.insertAdjacentHTML("afterbegin", headerHtml); }}', final_header_html)
# logging.debug("Injecting CSS overrides...") # Example logging
print("Injecting CSS overrides...")
page.evaluate(f"""(css) => {{
const existingStyle = document.getElementById('pdf-override-styles');
if (existingStyle) existingStyle.remove();
document.head.insertAdjacentHTML('beforeend', css);
}}""", css_override)
# logging.debug("Running JavaScript cleanup function...") # Example logging
print("Running JavaScript cleanup function...")
elements_removed_count = page.evaluate(cleanup_js_expression)
# logging.info(f"Cleanup script finished (reported removing {elements_removed_count} elements).") # Example logging
print(f"Cleanup script finished (reported removing {elements_removed_count} elements).")
# --- Optional: Emulate Print Media ---
# print("Emulating print media...")
# page.emulate_media(media='print')
# --- Generate PDF ---
# logging.info(f"Generating PDF: {pdf_path}") # Example logging
print(f"Generating PDF: {pdf_path}")
pdf_options = {
"path": pdf_path, "format": "A4", "print_background": True,
"margin": {'top': '20px', 'bottom': '20px', 'left': '20px', 'right': '20px'},
"scale": 1.0
}
page.pdf(**pdf_options)
# logging.info(f"PDF saved successfully to: {pdf_path}") # Example logging
print(f"PDF saved successfully to: {pdf_path}")
# logging.debug("Closing browser.") # Example logging
print("Closing browser.")
browser.close()
return True # Indicate success
except TimeoutError as e:
print(f"A Playwright TimeoutError occurred: {e}")
traceback.print_exc()
# logging.error(f"Playwright TimeoutError occurred: {e}", exc_info=True) # Example logging
browser.close() # Ensure browser is closed on error
return False # Indicate failure
except Exception as e: # Catch other potential errors during Playwright page operations
print(f"An unexpected error occurred during Playwright page operations: {e}")
traceback.print_exc()
# logging.error(f"Unexpected error during Playwright page operations: {e}", exc_info=True) # Example logging
# Optional: Save HTML state on error
try:
html_content = page.content()
error_html_path = pdf_path.replace('.pdf', '_error.html')
with open(error_html_path, 'w', encoding='utf-8') as f_err:
f_err.write(html_content)
# logging.info(f"Saved HTML state on error to: {error_html_path}") # Example logging
print(f"Saved HTML state on error to: {error_html_path}")
except Exception as save_e:
# logging.error(f"Could not save HTML state on error: {save_e}", exc_info=True) # Example logging
print(f"Could not save HTML state on error: {save_e}")
browser.close() # Ensure browser is closed on error
return False # Indicate failure
# Note: The finally block for the 'with sync_playwright()' context
# is handled automatically by the 'with' statement.
except Exception as e:
# Catch errors during Playwright startup (less common)
print(f"An error occurred during Playwright setup/teardown: {e}")
traceback.print_exc()
# logging.error(f"Error during Playwright setup/teardown: {e}", exc_info=True) # Example logging
return False # Indicate failure
# --- Example Usage ---
# result = generate_pdf_with_header_and_cleanup(
# url='path/to/your/input.html',
# pdf_path='output.pdf',
# substance_name='Glycerol Example',
# substance_link='http://example.com/glycerol',
# ec_number='200-289-5',
# cas_number='56-81-5',
# )
#
# if result:
# print("PDF Generation Succeeded.")
# # logging.info("Main script: PDF Generation Succeeded.") # Example logging
# else:
# print("PDF Generation Failed.")
# # logging.error("Main script: PDF Generation Failed.") # Example logging
def search_generate_pdfs(
cas_number_to_search: str,
page_types_to_extract: list[str],
base_output_folder: str = "data/library"
) -> bool:
"""
Searches for a substance by CAS, saves raw HTML and generates PDFs for
specified page types. Uses '_js' link variant for the PDF header link if available.
Args:
cas_number_to_search (str): CAS number to search for.
page_types_to_extract (list[str]): List of page type names (e.g., 'RepeatedDose').
Expects '{page_type}' and '{page_type}_js' keys
in the search result.
base_output_folder (str): Root directory for saving HTML/PDFs.
Returns:
bool: True if substance found and >=1 requested PDF generated, False otherwise.
"""
# logging.info(f"Starting process for CAS: {cas_number_to_search}")
print(f"\n===== Processing request for CAS: {cas_number_to_search} =====")
# --- 1. Search for Dossier Information ---
try:
# logging.debug(f"Calling search_dossier for CAS: {cas_number_to_search}")
search_result = search_dossier(substance=cas_number_to_search, input_type='rmlCas')
except Exception as e:
print(f"Error during dossier search for CAS '{cas_number_to_search}': {e}")
traceback.print_exc()
# logging.error(f"Exception during search_dossier for CAS '{cas_number_to_search}': {e}", exc_info=True)
return False
if not search_result:
print(f"Substance not found or search failed for CAS: {cas_number_to_search}")
# logging.warning(f"Substance not found or search failed for CAS: {cas_number_to_search}")
return False
# logging.info(f"Substance found for CAS: {cas_number_to_search}")
print(f"Substance found: {search_result.get('rmlName', 'N/A')}")
# --- 2. Extract Details and Filter Pages ---
try:
# Extract required info
rml_id = search_result.get('rmlId')
rml_name = search_result.get('rmlName')
rml_cas = search_result.get('rmlCas')
rml_ec = search_result.get('rmlEc')
asset_ext_id = search_result.get('assetExternalId')
# Basic validation
if not all([rml_id, rml_name, rml_cas, rml_ec, asset_ext_id]):
missing_keys = [k for k, v in {'rmlId': rml_id, 'rmlName': rml_name, 'rmlCas': rml_cas, 'rmlEc': rml_ec, 'assetExternalId': asset_ext_id}.items() if not v]
message = f"Search result for {cas_number_to_search} is missing required keys: {missing_keys}"
print(f"Error: {message}")
# logging.error(message)
return False
# --- Filtering Logic - Collect pairs of URLs ---
pages_to_process_list = [] # Store tuples: (page_name, raw_url, js_url)
# logging.debug(f"Filtering pages. Requested: {page_types_to_extract}.")
for page_type in page_types_to_extract:
raw_url_key = page_type
js_url_key = f"{page_type}_js"
raw_url = search_result.get(raw_url_key)
js_url = search_result.get(js_url_key) # Get the JS URL
# Check if both URLs are valid strings
if raw_url and isinstance(raw_url, str) and raw_url.strip():
if js_url and isinstance(js_url, str) and js_url.strip():
pages_to_process_list.append((page_type, raw_url, js_url))
# logging.debug(f"Found valid pair for '{page_type}': Raw='{raw_url}', JS='{js_url}'")
else:
# Found raw URL but not a valid JS URL - skip this page type for PDF?
# Or use raw_url for header too? Let's skip if JS URL is missing/invalid.
print(f"Found raw URL for '{page_type}' but missing/invalid JS URL ('{js_url}'). Skipping PDF generation for this type.")
# logging.warning(f"Missing/invalid JS URL for page type '{page_type}' for {rml_cas}. Raw URL: '{raw_url}'.")
else:
# Raw URL missing or invalid
if page_type in search_result: # Check if key existed at all
print(f"Found page type key '{page_type}' for {rml_cas}, but its value is not a valid URL ('{raw_url}'). Skipping.")
# logging.warning(f"Invalid raw URL value for page type '{page_type}' for {rml_cas}: '{raw_url}'.")
else:
print(f"Requested page type key '{page_type}' not found in search results for {rml_cas}.")
# logging.warning(f"Requested page type key '{page_type}' not found for {rml_cas}.")
# --- End Filtering Logic ---
if not pages_to_process_list:
print(f"After filtering, no requested page types ({page_types_to_extract}) resulted in a valid pair of Raw and JS URLs for substance {rml_cas}.")
# logging.warning(f"No pages with valid URL pairs to process for substance {rml_cas}.")
return False # Nothing to generate
except Exception as e:
print(f"Error processing search result for '{cas_number_to_search}': {e}")
traceback.print_exc()
# logging.error(f"Error processing search result for '{cas_number_to_search}': {e}", exc_info=True)
return False
# --- 3. Prepare Folders ---
safe_cas = rml_cas.replace('/', '_').replace('\\', '_')
substance_folder_name = f"{safe_cas}_{rml_ec}_{rml_id}"
substance_folder_path = os.path.join(base_output_folder, substance_folder_name)
try:
os.makedirs(substance_folder_path, exist_ok=True)
# logging.info(f"Ensured output directory exists: {substance_folder_path}")
print(f"Ensured output directory exists: {substance_folder_path}")
except OSError as e:
print(f"Error creating directory {substance_folder_path}: {e}")
# logging.error(f"Failed to create directory {substance_folder_path}: {e}", exc_info=True)
return False
# --- 4. Process Each Page (Save HTML, Generate PDF) ---
successful_pages = [] # Track successful PDF generations
overall_success = False # Track if any PDF was generated
for page_name, raw_html_url, js_header_link in pages_to_process_list:
print(f"\nProcessing page: {page_name}")
base_filename = f"{safe_cas}_{page_name}"
html_filename = f"{base_filename}.html"
pdf_filename = f"{base_filename}.pdf"
html_full_path = os.path.join(substance_folder_path, html_filename)
pdf_full_path = os.path.join(substance_folder_path, pdf_filename)
# --- Save Raw HTML ---
html_saved = False
try:
# logging.debug(f"Fetching raw HTML for {page_name} from {raw_html_url}")
print(f"Fetching raw HTML from: {raw_html_url}")
# Add headers to mimic a browser slightly if needed
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'}
response = requests.get(raw_html_url, timeout=30, headers=headers) # 30s timeout
response.raise_for_status() # Raise an exception for bad status codes (4xx or 5xx)
# Decide encoding - response.text tries to guess, or use apparent_encoding
# Or assume utf-8 if unsure, which is common.
html_content = response.content.decode('utf-8', errors='replace')
with open(html_full_path, 'w', encoding='utf-8') as f:
f.write(html_content)
html_saved = True
# logging.info(f"Successfully saved raw HTML for {page_name} to {html_full_path}")
print(f"Successfully saved raw HTML to: {html_full_path}")
except requests.exceptions.RequestException as req_e:
print(f"Error fetching raw HTML for {page_name} from {raw_html_url}: {req_e}")
# logging.error(f"Error fetching raw HTML for {page_name}: {req_e}", exc_info=True)
except IOError as io_e:
print(f"Error saving raw HTML for {page_name} to {html_full_path}: {io_e}")
# logging.error(f"Error saving raw HTML for {page_name}: {io_e}", exc_info=True)
except Exception as e: # Catch other potential errors like decoding
print(f"Unexpected error saving HTML for {page_name}: {e}")
# logging.error(f"Unexpected error saving HTML for {page_name}: {e}", exc_info=True)
# --- Generate PDF (using raw URL for content, JS URL for header link) ---
# logging.info(f"Generating PDF for {page_name} from {raw_html_url}")
print(f"Generating PDF using content from: {raw_html_url}")
pdf_success = generate_pdf_with_header_and_cleanup(
url=raw_html_url, # Use raw URL for Playwright navigation/content
pdf_path=pdf_full_path,
substance_name=rml_name,
substance_link=js_header_link, # Use JS URL for the link in the header
ec_number=rml_ec,
cas_number=rml_cas
)
if pdf_success:
successful_pages.append(page_name) # Log success based on PDF generation
overall_success = True
# logging.info(f"Successfully generated PDF for {page_name} at {pdf_full_path}")
print(f"Successfully generated PDF for {page_name}")
else:
# logging.error(f"Failed to generate PDF for {page_name} from {raw_html_url}")
print(f"Failed to generate PDF for {page_name}")
# Decide if failure to save HTML should affect overall success or logging...
# Currently, success is tied only to PDF generation.
print(f"===== Finished request for CAS: {cas_number_to_search} =====")
print(f"Successfully generated {len(successful_pages)} PDFs: {successful_pages}")
return overall_success # Return success based on PDF generation

File diff suppressed because one or more lines are too long

Before

Width:  |  Height:  |  Size: 6.1 KiB

View file

@ -1,141 +0,0 @@
<svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" width="352.909" height="64.542" viewBox="0 0 352.909 64.542">
<defs>
<linearGradient id="linear-gradient" x1="0.499" y1="0.955" x2="0.501" y2="0.043" gradientUnits="objectBoundingBox">
<stop offset="0" stop-color="#002658"/>
<stop offset="0.99" stop-color="#0160ae"/>
</linearGradient>
<radialGradient id="radial-gradient" cx="0.502" cy="0.5" r="0.881" gradientUnits="objectBoundingBox">
<stop offset="0.34" stop-color="#0961ad"/>
<stop offset="1" stop-color="#1c2f5d"/>
</radialGradient>
<radialGradient id="radial-gradient-2" cx="0.795" cy="0.199" r="0.8" xlink:href="#radial-gradient"/>
<linearGradient id="linear-gradient-2" y1="0.499" x2="1" y2="0.499" gradientUnits="objectBoundingBox">
<stop offset="0" stop-color="#fff"/>
<stop offset="0" stop-color="#0961ad"/>
<stop offset="1" stop-color="#1c2f5d"/>
</linearGradient>
<linearGradient id="linear-gradient-3" x1="-3.244" y1="0.922" x2="0.926" y2="0.075" gradientUnits="objectBoundingBox">
<stop offset="0" stop-color="#fff"/>
<stop offset="0" stop-color="#f6d46a"/>
<stop offset="0.99" stop-color="#f8a71b"/>
</linearGradient>
<linearGradient id="linear-gradient-4" x1="-0.547" y1="0.499" x2="0.453" y2="0.499" gradientUnits="objectBoundingBox">
<stop offset="0" stop-color="#f6d46a"/>
<stop offset="0.99" stop-color="#f8a71b"/>
</linearGradient>
<linearGradient id="linear-gradient-5" x1="-0.17" y1="0.5" x2="0.83" y2="0.5" xlink:href="#linear-gradient-3"/>
<linearGradient id="linear-gradient-6" x1="0.5" y1="-0.199" x2="0.5" y2="1.353" gradientUnits="objectBoundingBox">
<stop offset="0" stop-color="#fff"/>
<stop offset="0" stop-color="#feca0a"/>
<stop offset="0.96" stop-color="#faaa1b"/>
<stop offset="0.99" stop-color="#f8a71b"/>
</linearGradient>
<linearGradient id="linear-gradient-7" x1="0.5" y1="-0.223" x2="0.5" y2="1.383" xlink:href="#linear-gradient-6"/>
<linearGradient id="linear-gradient-9" x1="0.5" y1="-0.223" x2="0.5" y2="1.383" xlink:href="#linear-gradient-6"/>
</defs>
<g id="Group_1542" data-name="Group 1542" transform="translate(-103 -146)">
<g id="Group_1535" data-name="Group 1535">
<path id="Path_484" data-name="Path 484" d="M219.034,36.851,202.609.06h-5.448L180.736,36.851h8.4l3.267-8.032h14.867l3.347,8.032h8.4ZM204.71,22.718h-9.73l4.905-11.831,4.825,11.831ZM165.185,36.851h7.71V.985h-7.71V15.118h-17.9V.985h-7.71V36.841h7.71V21.853h17.9V36.841h0Zm-48.713.935c3.659,0,8.172-.141,12.223-1.719V29.322A40.491,40.491,0,0,1,116.4,30.97c-8.012,0-10.5-6.092-10.5-12.053S108.39,6.865,116.4,6.865A40.491,40.491,0,0,1,128.7,8.514V1.779C124.645.2,120.131.06,116.472.06c-13.229,0-18.526,8.534-18.526,18.858s5.3,18.858,18.526,18.858h0ZM60.13,36.851H89.472V30.106H67.84V21.853H86.909V15.108H67.84V7.72H89.472V.985H60.13V36.841h0Z" transform="translate(103.644 146.001)" fill="url(#linear-gradient)"/>
<circle id="Ellipse_9" data-name="Ellipse 9" cx="2.02" cy="2.02" r="2.02" transform="translate(129.958 175.363)" fill="url(#radial-gradient)"/>
<path id="Path_485" data-name="Path 485" d="M38.618,37a5.358,5.358,0,0,1-1.689.593,2.892,2.892,0,0,0,.151-.935,3.016,3.016,0,1,0-6,.432c-2.563-.281-4.956-.241-5.2-.362-.02,0-6.413-5.247-15.561-.623a.494.494,0,0,0-.221.482A14.761,14.761,0,0,0,24.836,50.638,14.529,14.529,0,0,0,39.5,37.54a.587.587,0,0,0-.885-.553Z" transform="translate(103.383 146.175)" fill="url(#radial-gradient-2)"/>
<path id="Path_486" data-name="Path 486" d="M10.281,36.025s6.4-5.277,13.751-1.448a24.047,24.047,0,0,0,7.047,2.513s-23.18,1.3-20.788-1.066Z" transform="translate(103.383 146.173)" fill="url(#linear-gradient-2)"/>
<rect id="Rectangle_2083" data-name="Rectangle 2083" width="4.322" height="15.48" rx="2.15" transform="translate(113.755 196.405) rotate(44.01)" fill="url(#linear-gradient-3)"/>
<path id="Path_487" data-name="Path 487" d="M2.734,63.94A1.62,1.62,0,0,1,1.628,63.5L.483,62.392a1.59,1.59,0,0,1-.04-2.242l8.866-9.178a1.6,1.6,0,0,1,1.116-.483,1.623,1.623,0,0,1,1.126.442L12.7,52.038a1.587,1.587,0,0,1,.04,2.242L3.87,63.457a1.6,1.6,0,0,1-1.116.483h-.03Zm7.72-13.007h-.02a1.12,1.12,0,0,0-.8.352L.764,60.442a1.173,1.173,0,0,0-.322.814,1.12,1.12,0,0,0,.352.8L1.94,63.166a1.141,1.141,0,0,0,1.618-.03l8.866-9.178a1.144,1.144,0,0,0-.03-1.618l-1.146-1.106a1.109,1.109,0,0,0-.794-.322Z" transform="translate(103.33 146.263)" fill="url(#linear-gradient-4)"/>
<path id="Path_488" data-name="Path 488" d="M19.3.99H13.977a.461.461,0,0,0-.462.462V4.83a.461.461,0,0,0,.462.462h1.568a.461.461,0,0,1,.462.462V16.028l-.02,1.206a.454.454,0,0,1-.261.412A21.882,21.882,0,0,0,4.96,29.226a.465.465,0,0,0,.312.623l3.3.895a.47.47,0,0,0,.553-.261,17.583,17.583,0,0,1,10.334-9.761.465.465,0,0,0,.312-.442V1.462A.461.461,0,0,0,19.3,1Z" transform="translate(103.356 146.005)" fill="#003c75"/>
<path id="Path_489" data-name="Path 489" d="M36.551.99H31.042a.378.378,0,0,0-.372.372V19.989a.378.378,0,0,0,.553.332l3.026-1.639a.365.365,0,0,0,.191-.332V5.674a.378.378,0,0,1,.372-.372h1.749a.378.378,0,0,0,.372-.372V1.362A.378.378,0,0,0,36.561.99Z" transform="translate(103.49 146.005)" fill="#003c75"/>
<path id="Path_490" data-name="Path 490" d="M45.919,34.292A21.215,21.215,0,0,0,31.545,16.741h0l-.08-.03-.181-.06h0L30.8,16.51v3.629a.758.758,0,0,0,.523.724h.02A17.285,17.285,0,1,1,7.661,38.946a17.1,17.1,0,0,1,.02-4.182.285.285,0,0,0-.211-.312l-3.277-.875a.294.294,0,0,0-.362.231,20.916,20.916,0,0,0-.07,5.609,21.236,21.236,0,1,0,42.159-5.147Z" transform="translate(103.349 146.086)" fill="url(#linear-gradient-5)"/>
<path id="Path_491" data-name="Path 491" d="M224.13,18.9a34.878,34.878,0,0,1,.714-7.2,17.285,17.285,0,0,1,2.372-5.911,11.557,11.557,0,0,1,4.4-3.991A14.5,14.5,0,0,1,238.484.34a26.823,26.823,0,0,1,5.669.523,22.624,22.624,0,0,1,4.091,1.246V7.226c-.985-.412-1.9-.754-2.724-1.025a23.113,23.113,0,0,0-2.392-.643,17.986,17.986,0,0,0-2.292-.332c-.764-.06-1.548-.09-2.342-.09a8.147,8.147,0,0,0-4.282,1.055,8.105,8.105,0,0,0-2.825,2.915,14.1,14.1,0,0,0-1.558,4.383,28.844,28.844,0,0,0-.483,5.428,28.767,28.767,0,0,0,.483,5.428,13.693,13.693,0,0,0,1.558,4.373,7.871,7.871,0,0,0,2.825,2.915,8.147,8.147,0,0,0,4.282,1.055c.794,0,1.578-.03,2.342-.1a17.985,17.985,0,0,0,2.292-.332,23.113,23.113,0,0,0,2.392-.643c.824-.271,1.739-.613,2.724-1.025V35.7a22.624,22.624,0,0,1-4.091,1.246,26.823,26.823,0,0,1-5.669.523,14.5,14.5,0,0,1-6.866-1.458,11.787,11.787,0,0,1-4.4-3.991,17.489,17.489,0,0,1-2.372-5.911,34.27,34.27,0,0,1-.714-7.2Z" transform="translate(104.499 146.002)" fill="url(#linear-gradient-6)"/>
<path id="Path_492" data-name="Path 492" d="M238.486,37.8a14.953,14.953,0,0,1-7.026-1.5,11.974,11.974,0,0,1-4.523-4.111,17.723,17.723,0,0,1-2.413-6.021A34.94,34.94,0,0,1,223.8,18.9a34.94,34.94,0,0,1,.724-7.268,17.723,17.723,0,0,1,2.413-6.021A12.056,12.056,0,0,1,231.46,1.5,14.9,14.9,0,0,1,238.486,0a27.543,27.543,0,0,1,5.74.533A23.034,23.034,0,0,1,248.377,1.8l.2.09V7.73l-.462-.191c-.975-.412-1.89-.754-2.7-1.015a20.145,20.145,0,0,0-2.352-.633,21.324,21.324,0,0,0-2.252-.332c-.754-.06-1.538-.09-2.312-.09a7.909,7.909,0,0,0-4.111,1.005,7.711,7.711,0,0,0-2.7,2.8,13.738,13.738,0,0,0-1.518,4.272,28.991,28.991,0,0,0-.472,5.368,28.99,28.99,0,0,0,.472,5.368,13.738,13.738,0,0,0,1.518,4.272,7.79,7.79,0,0,0,2.7,2.8,7.91,7.91,0,0,0,4.111,1.005c.784,0,1.568-.03,2.312-.09a17.129,17.129,0,0,0,2.252-.332,22.352,22.352,0,0,0,2.352-.633c.814-.261,1.719-.613,2.7-1.015l.462-.191v5.84l-.2.09a23.872,23.872,0,0,1-4.152,1.267,27.543,27.543,0,0,1-5.74.533Zm0-37.133a14.408,14.408,0,0,0-6.715,1.417,11.475,11.475,0,0,0-4.282,3.88,17.122,17.122,0,0,0-2.322,5.8,34.262,34.262,0,0,0-.714,7.127,34.262,34.262,0,0,0,.714,7.127,17.122,17.122,0,0,0,2.322,5.8,11.31,11.31,0,0,0,4.282,3.88,14.256,14.256,0,0,0,6.715,1.417,26.193,26.193,0,0,0,5.6-.523,23.16,23.16,0,0,0,3.83-1.136v-4.4c-.824.332-1.588.623-2.292.844a23.888,23.888,0,0,1-2.433.653,20.7,20.7,0,0,1-2.332.342c-.764.06-1.568.1-2.372.1a8.456,8.456,0,0,1-4.453-1.106A8.346,8.346,0,0,1,231.1,28.85a14.484,14.484,0,0,1-1.6-4.483,29.507,29.507,0,0,1-.482-5.488,29.432,29.432,0,0,1,.482-5.488,14.322,14.322,0,0,1,1.6-4.483,8.346,8.346,0,0,1,2.935-3.036,8.5,8.5,0,0,1,4.453-1.1c.8,0,1.6.03,2.372.1a20.551,20.551,0,0,1,2.342.342,23.884,23.884,0,0,1,2.433.653q1.055.347,2.292.844V2.322a23.159,23.159,0,0,0-3.83-1.136,26.856,26.856,0,0,0-5.6-.523Z" transform="translate(104.497 146)" fill="#e68a00"/>
<path id="Path_493" data-name="Path 493" d="M275.544,36.836V20.662H260.516V36.836H255.49V.95h5.026V15.877h15.028V.95h5.026V36.836Z" transform="translate(104.663 146.005)" fill="url(#linear-gradient-7)"/>
<path id="Path_494" data-name="Path 494" d="M280.9,37.17h-5.689V21H260.86V37.17h-5.69V.62h5.69V15.547h14.354V.62H280.9Zm-5.026-.663h4.363V1.283h-4.363V16.211H260.186V1.283h-4.353V36.506h4.353V20.332h15.691Z" transform="translate(104.661 146.004)" fill="#e68a00"/>
<path id="Path_495" data-name="Path 495" d="M308.534,20.662H293.556V32.051h16.556v4.785H288.53V.95h21.582V5.735H293.556V15.877h14.978Z" transform="translate(104.835 146.005)" fill="url(#linear-gradient-7)"/>
<path id="Path_496" data-name="Path 496" d="M310.445,37.17H288.2V.62h22.245V6.068H293.889v9.479h14.978V21H293.889V31.721h16.556Zm-21.582-.663h20.908V32.385H293.216V20.332h14.978V16.211H293.216V5.4h16.556V1.283H288.863Z" transform="translate(104.833 146.004)" fill="#e68a00"/>
<path id="Path_497" data-name="Path 497" d="M335.9,32.051h-3.83l-10-23.11.241,27.895H317.38V.95h6.071l10.525,24.6L344.5.95h6.072V36.836h-4.926l.241-27.895-10,23.11Z" transform="translate(104.985 146.005)" fill="url(#linear-gradient-9)"/>
<path id="Path_498" data-name="Path 498" d="M350.916,37.17h-5.6l.231-26.588-9.439,21.8h-4.262l-9.429-21.8.231,26.588h-5.6V.62h6.634l10.3,24.085L344.291.62h6.634V37.17Zm-4.926-.663h4.262V1.283h-5.519l-10.746,25.11L323.242,1.283h-5.519V36.506h4.262l-.251-29.2L332.3,31.721h3.388L346.251,7.3,346,36.506Z" transform="translate(104.984 146.004)" fill="#e68a00"/>
<g id="Group_1497" data-name="Group 1497" transform="translate(163.734 192.803)">
<path id="Path_499" data-name="Path 499" d="M66.049,56.891H60.53V47h5.519v1.025H61.686v3.187h4.1v1.015h-4.1v3.639h4.363Z" transform="translate(-60.088 -46.558)" fill="#003c75"/>
<path id="Path_500" data-name="Path 500" d="M66.051,57.336H60.532a.446.446,0,0,1-.442-.442V47a.446.446,0,0,1,.442-.442h5.519a.446.446,0,0,1,.442.442v1.025a.446.446,0,0,1-.442.442h-3.92v2.3H65.79a.446.446,0,0,1,.442.442v1.015a.446.446,0,0,1-.442.442H62.131v2.754h3.92a.446.446,0,0,1,.442.442v1.025A.446.446,0,0,1,66.051,57.336Zm-5.076-.885H65.6v-.141h-3.92a.446.446,0,0,1-.442-.442V52.229a.446.446,0,0,1,.442-.442h3.659v-.131H61.678a.446.446,0,0,1-.442-.442V48.028a.446.446,0,0,1,.442-.442H65.6v-.131H60.975v9.007Z" transform="translate(-60.09 -46.56)" fill="#336"/>
</g>
<g id="Group_1498" data-name="Group 1498" transform="translate(175.786 192.652)">
<path id="Path_501" data-name="Path 501" d="M77.275,47.875A3.226,3.226,0,0,0,74.7,48.961a4.368,4.368,0,0,0-.945,2.975,4.53,4.53,0,0,0,.915,3.006A3.228,3.228,0,0,0,77.265,56a8.72,8.72,0,0,0,2.362-.372v1.005a7.15,7.15,0,0,1-2.533.382,4.3,4.3,0,0,1-3.378-1.327,5.452,5.452,0,0,1-1.186-3.77,5.991,5.991,0,0,1,.573-2.684,4.13,4.13,0,0,1,1.649-1.769,5.045,5.045,0,0,1,2.543-.623,6.138,6.138,0,0,1,2.724.573l-.483.985a5.258,5.258,0,0,0-2.252-.523Z" transform="translate(-72.078 -46.408)" fill="#003c75"/>
<path id="Path_502" data-name="Path 502" d="M77.086,57.467a4.743,4.743,0,0,1-3.709-1.478,5.9,5.9,0,0,1-1.3-4.061,6.371,6.371,0,0,1,.623-2.875,4.6,4.6,0,0,1,1.83-1.96A5.472,5.472,0,0,1,77.3,46.41a6.653,6.653,0,0,1,2.915.613.391.391,0,0,1,.221.251.45.45,0,0,1-.02.342l-.483.985a.431.431,0,0,1-.583.2,4.823,4.823,0,0,0-2.061-.483,2.768,2.768,0,0,0-2.242.935,3.972,3.972,0,0,0-.834,2.684,4.142,4.142,0,0,0,.8,2.714c.864,1.005,2.362,1.146,4.5.553a.455.455,0,0,1,.392.07.429.429,0,0,1,.171.352v1.005a.432.432,0,0,1-.291.412,7.827,7.827,0,0,1-2.694.412Zm.2-10.173a4.618,4.618,0,0,0-2.322.563,3.707,3.707,0,0,0-1.478,1.588,5.508,5.508,0,0,0-.523,2.483,5.091,5.091,0,0,0,1.076,3.478,3.847,3.847,0,0,0,3.046,1.176,7.462,7.462,0,0,0,2.091-.261V56.2c-2.222.513-3.84.2-4.845-.965a4.908,4.908,0,0,1-1.015-3.287,4.747,4.747,0,0,1,1.055-3.267,3.661,3.661,0,0,1,2.915-1.236,5.412,5.412,0,0,1,2.031.4l.09-.191A6.036,6.036,0,0,0,77.3,47.3Z" transform="translate(-72.08 -46.41)" fill="#336"/>
</g>
<g id="Group_1499" data-name="Group 1499" transform="translate(189.93 192.803)">
<path id="Path_503" data-name="Path 503" d="M94.089,56.891H92.943V52.237H87.736v4.654H86.59V47h1.146v4.212h5.207V47h1.146Z" transform="translate(-86.148 -46.558)" fill="#003c75"/>
<path id="Path_504" data-name="Path 504" d="M94.091,57.336H92.945a.446.446,0,0,1-.442-.442V52.682H88.181v4.212a.446.446,0,0,1-.442.442H86.592a.446.446,0,0,1-.442-.442V47a.446.446,0,0,1,.442-.442h1.146a.446.446,0,0,1,.442.442v3.77H92.5V47a.446.446,0,0,1,.442-.442h1.146a.446.446,0,0,1,.442.442v9.891A.446.446,0,0,1,94.091,57.336Zm-.7-.885h.261V47.445h-.261v3.77a.446.446,0,0,1-.442.442H87.738a.446.446,0,0,1-.442-.442v-3.77h-.261v9.007H87.3V52.239a.446.446,0,0,1,.442-.442h5.207a.446.446,0,0,1,.442.442Z" transform="translate(-86.15 -46.56)" fill="#336"/>
</g>
<g id="Group_1500" data-name="Group 1500" transform="translate(203.644 192.763)">
<path id="Path_505" data-name="Path 505" d="M107.819,56.892l-1.236-3.146h-3.971L101.4,56.892H100.23l3.91-9.932h.965L109,56.892h-1.186Zm-1.588-4.182-1.146-3.066c-.151-.392-.3-.864-.462-1.427a14.321,14.321,0,0,1-.422,1.427l-1.166,3.066Z" transform="translate(-99.79 -46.518)" fill="#003c75"/>
<path id="Path_506" data-name="Path 506" d="M109.008,57.336h-1.186a.436.436,0,0,1-.412-.281l-1.126-2.865h-3.357l-1.106,2.865a.446.446,0,0,1-.412.281h-1.166a.445.445,0,0,1-.422-.6l3.91-9.932a.446.446,0,0,1,.412-.281h.965a.436.436,0,0,1,.412.281l3.89,9.932a.431.431,0,0,1-.05.412.444.444,0,0,1-.372.191Zm-.885-.885h.241L104.826,47.4h-.362l-3.558,9.047h.211l1.106-2.865a.446.446,0,0,1,.412-.281h3.971a.436.436,0,0,1,.412.281l1.126,2.865Zm-1.89-3.3h-3.2a.426.426,0,0,1-.362-.191.46.46,0,0,1-.05-.412l1.166-3.066c.171-.493.312-.955.412-1.367a.439.439,0,0,1,.422-.342.465.465,0,0,1,.442.322c.151.543.3,1.015.442,1.387l1.156,3.066a.46.46,0,0,1-.05.412.439.439,0,0,1-.362.191Zm-2.553-.885h1.92l-.925-2.463s-.02-.05-.03-.07c0,.02-.02.04-.02.07l-.935,2.473Z" transform="translate(-99.793 -46.52)" fill="#336"/>
</g>
<g id="Group_1501" data-name="Group 1501" transform="translate(226.59 192.652)">
<path id="Path_507" data-name="Path 507" d="M127.815,47.875a3.226,3.226,0,0,0-2.573,1.086,4.368,4.368,0,0,0-.945,2.975,4.53,4.53,0,0,0,.915,3.006A3.229,3.229,0,0,0,127.8,56a8.72,8.72,0,0,0,2.362-.372v1.005a7.15,7.15,0,0,1-2.533.382,4.3,4.3,0,0,1-3.378-1.327,5.452,5.452,0,0,1-1.186-3.77,5.991,5.991,0,0,1,.573-2.684,4.13,4.13,0,0,1,1.649-1.769,5.045,5.045,0,0,1,2.543-.623,6.138,6.138,0,0,1,2.724.573l-.483.985a5.258,5.258,0,0,0-2.252-.523Z" transform="translate(-122.618 -46.408)" fill="#003c75"/>
<path id="Path_508" data-name="Path 508" d="M127.626,57.467a4.743,4.743,0,0,1-3.709-1.478,5.9,5.9,0,0,1-1.3-4.061,6.37,6.37,0,0,1,.623-2.875,4.6,4.6,0,0,1,1.829-1.96,5.472,5.472,0,0,1,2.764-.684,6.653,6.653,0,0,1,2.915.613.391.391,0,0,1,.221.251.449.449,0,0,1-.02.342l-.482.985a.431.431,0,0,1-.583.2,4.823,4.823,0,0,0-2.061-.483,2.768,2.768,0,0,0-2.242.935,3.972,3.972,0,0,0-.834,2.684,4.142,4.142,0,0,0,.8,2.714c.864,1.005,2.362,1.146,4.5.553a.455.455,0,0,1,.392.07.429.429,0,0,1,.171.352v1.005a.431.431,0,0,1-.292.412,7.826,7.826,0,0,1-2.694.412Zm.2-10.173a4.618,4.618,0,0,0-2.322.563,3.707,3.707,0,0,0-1.478,1.588,5.508,5.508,0,0,0-.523,2.483,5.09,5.09,0,0,0,1.076,3.478,3.847,3.847,0,0,0,3.046,1.176,7.462,7.462,0,0,0,2.091-.261V56.2c-2.222.513-3.84.2-4.845-.965a4.908,4.908,0,0,1-1.015-3.287,4.747,4.747,0,0,1,1.056-3.267,3.661,3.661,0,0,1,2.915-1.236,5.412,5.412,0,0,1,2.031.4l.09-.191a6.036,6.036,0,0,0-2.111-.352Z" transform="translate(-122.62 -46.41)" fill="#336"/>
</g>
<g id="Group_1502" data-name="Group 1502" transform="translate(240.733 192.803)">
<path id="Path_509" data-name="Path 509" d="M144.629,56.891h-1.146V52.237h-5.207v4.654H137.13V47h1.146v4.212h5.207V47h1.146Z" transform="translate(-136.688 -46.558)" fill="#003c75"/>
<path id="Path_510" data-name="Path 510" d="M144.631,57.336h-1.146a.446.446,0,0,1-.442-.442V52.682h-4.322v4.212a.446.446,0,0,1-.442.442h-1.146a.446.446,0,0,1-.442-.442V47a.446.446,0,0,1,.442-.442h1.146a.446.446,0,0,1,.442.442v3.77h4.322V47a.446.446,0,0,1,.442-.442h1.146a.446.446,0,0,1,.442.442v9.891A.446.446,0,0,1,144.631,57.336Zm-.7-.885h.261V47.445h-.261v3.77a.446.446,0,0,1-.442.442h-5.207a.446.446,0,0,1-.442-.442v-3.77h-.261v9.007h.261V52.239a.446.446,0,0,1,.442-.442h5.207a.446.446,0,0,1,.442.442Z" transform="translate(-136.69 -46.56)" fill="#336"/>
</g>
<g id="Group_1503" data-name="Group 1503" transform="translate(255.801 192.803)">
<path id="Path_511" data-name="Path 511" d="M157.639,56.891H152.12V47h5.519v1.025h-4.363v3.187h4.1v1.015h-4.1v3.639h4.363Z" transform="translate(-151.678 -46.558)" fill="#003c75"/>
<path id="Path_512" data-name="Path 512" d="M157.641,57.336h-5.519a.446.446,0,0,1-.442-.442V47a.446.446,0,0,1,.442-.442h5.519a.446.446,0,0,1,.442.442v1.025a.446.446,0,0,1-.442.442h-3.92v2.3h3.659a.446.446,0,0,1,.442.442v1.015a.446.446,0,0,1-.442.442h-3.659v2.754h3.92a.446.446,0,0,1,.442.442v1.025A.446.446,0,0,1,157.641,57.336Zm-5.076-.885h4.624v-.141h-3.92a.446.446,0,0,1-.442-.442V52.229a.446.446,0,0,1,.442-.442h3.659v-.131h-3.659a.446.446,0,0,1-.442-.442V48.028a.446.446,0,0,1,.442-.442h3.92v-.131h-4.624v9.007Z" transform="translate(-151.68 -46.56)" fill="#336"/>
</g>
<g id="Group_1504" data-name="Group 1504" transform="translate(268.397 192.803)">
<path id="Path_513" data-name="Path 513" d="M169.013,56.891l-3.357-8.776h-.05q.091,1.04.09,2.473v6.293H164.63V46.99h1.729l3.136,8.162h.05L172.7,46.99h1.719v9.891h-1.146V50.508q0-1.1.09-2.382h-.05l-3.388,8.755H169Z" transform="translate(-164.208 -46.558)" fill="#003c75"/>
<path id="Path_514" data-name="Path 514" d="M174.433,57.336h-1.146a.446.446,0,0,1-.442-.442V50.621l-2.483,6.433a.446.446,0,0,1-.412.281h-.925a.436.436,0,0,1-.412-.281l-2.453-6.423v6.262a.446.446,0,0,1-.442.442h-1.065a.446.446,0,0,1-.442-.442V47a.446.446,0,0,1,.442-.442h1.729a.436.436,0,0,1,.412.281L169.538,54l2.774-7.157a.446.446,0,0,1,.412-.281h1.719a.446.446,0,0,1,.442.442v9.891a.446.446,0,0,1-.442.442Zm-.7-.885h.261V47.445h-.975l-3.046,7.881a.446.446,0,0,1-.412.281h-.05a.436.436,0,0,1-.412-.281l-3.026-7.881h-.985v9.007h.171V50.6c0-.935-.03-1.759-.09-2.433a.469.469,0,0,1,.111-.342.44.44,0,0,1,.332-.141h.05a.436.436,0,0,1,.412.281l3.247,8.484h.322l3.277-8.474a.446.446,0,0,1,.412-.281h.05a.434.434,0,0,1,.322.141.454.454,0,0,1,.121.332c-.06.844-.09,1.638-.09,2.352Z" transform="translate(-164.21 -46.56)" fill="#336"/>
</g>
<g id="Group_1505" data-name="Group 1505" transform="translate(285.767 192.803)">
<path id="Path_515" data-name="Path 515" d="M181.92,56.891V47h1.146v9.891Z" transform="translate(-181.488 -46.558)" fill="#003c75"/>
<path id="Path_516" data-name="Path 516" d="M183.078,57.336h-1.146a.446.446,0,0,1-.442-.442V47a.446.446,0,0,1,.442-.442h1.146a.446.446,0,0,1,.442.442v9.891A.446.446,0,0,1,183.078,57.336Zm-.7-.885h.261V47.445h-.261Z" transform="translate(-181.49 -46.56)" fill="#336"/>
</g>
<g id="Group_1506" data-name="Group 1506" transform="translate(293.969 192.652)">
<path id="Path_517" data-name="Path 517" d="M194.845,47.875a3.226,3.226,0,0,0-2.573,1.086,4.368,4.368,0,0,0-.945,2.975,4.531,4.531,0,0,0,.915,3.006A3.229,3.229,0,0,0,194.835,56a8.719,8.719,0,0,0,2.362-.372v1.005a7.15,7.15,0,0,1-2.533.382,4.3,4.3,0,0,1-3.378-1.327,5.452,5.452,0,0,1-1.186-3.77,5.991,5.991,0,0,1,.573-2.684,4.13,4.13,0,0,1,1.649-1.769,5.045,5.045,0,0,1,2.543-.623,6.138,6.138,0,0,1,2.724.573l-.482.985a5.258,5.258,0,0,0-2.252-.523Z" transform="translate(-189.648 -46.408)" fill="#003c75"/>
<path id="Path_518" data-name="Path 518" d="M194.656,57.467a4.743,4.743,0,0,1-3.709-1.478,5.9,5.9,0,0,1-1.3-4.061,6.371,6.371,0,0,1,.623-2.875,4.6,4.6,0,0,1,1.83-1.96,5.472,5.472,0,0,1,2.764-.684,6.653,6.653,0,0,1,2.915.613.392.392,0,0,1,.221.251.45.45,0,0,1-.02.342l-.482.985a.431.431,0,0,1-.583.2,4.823,4.823,0,0,0-2.061-.483,2.767,2.767,0,0,0-2.242.935,3.972,3.972,0,0,0-.834,2.684,4.142,4.142,0,0,0,.8,2.714c.865,1.005,2.362,1.146,4.5.553a.455.455,0,0,1,.392.07.429.429,0,0,1,.171.352v1.005a.432.432,0,0,1-.292.412,7.826,7.826,0,0,1-2.694.412Zm.2-10.173a4.618,4.618,0,0,0-2.322.563,3.707,3.707,0,0,0-1.478,1.588,5.508,5.508,0,0,0-.523,2.483,5.091,5.091,0,0,0,1.076,3.478,3.847,3.847,0,0,0,3.046,1.176,7.462,7.462,0,0,0,2.091-.261V56.2c-2.221.513-3.84.2-4.845-.965a4.908,4.908,0,0,1-1.015-3.287,4.747,4.747,0,0,1,1.055-3.267,3.661,3.661,0,0,1,2.915-1.236,5.412,5.412,0,0,1,2.031.4l.09-.191a6.036,6.036,0,0,0-2.111-.352Z" transform="translate(-189.65 -46.41)" fill="#336"/>
</g>
<g id="Group_1507" data-name="Group 1507" transform="translate(306.738 192.763)">
<path id="Path_519" data-name="Path 519" d="M210.369,56.892l-1.236-3.146h-3.971l-1.216,3.146H202.78l3.91-9.932h.965l3.89,9.932h-1.186Zm-1.588-4.182-1.146-3.066c-.151-.392-.3-.864-.462-1.427a14.3,14.3,0,0,1-.422,1.427l-1.166,3.066Z" transform="translate(-202.351 -46.518)" fill="#003c75"/>
<path id="Path_520" data-name="Path 520" d="M211.568,57.336h-1.186a.436.436,0,0,1-.412-.281l-1.126-2.865h-3.357l-1.106,2.865a.446.446,0,0,1-.412.281H202.8a.445.445,0,0,1-.422-.6l3.91-9.932a.446.446,0,0,1,.412-.281h.965a.436.436,0,0,1,.412.281l3.89,9.932a.431.431,0,0,1-.05.412.444.444,0,0,1-.372.191Zm-.885-.885h.241L207.386,47.4h-.362l-3.558,9.047h.211l1.106-2.865a.446.446,0,0,1,.412-.281h3.971a.436.436,0,0,1,.412.281l1.126,2.865Zm-1.89-3.3h-3.2a.426.426,0,0,1-.362-.191.46.46,0,0,1-.05-.412l1.166-3.066c.171-.493.312-.955.412-1.367a.439.439,0,0,1,.422-.342.465.465,0,0,1,.442.322c.151.543.3,1.015.442,1.387l1.156,3.066a.459.459,0,0,1-.05.412.439.439,0,0,1-.362.191Zm-2.553-.885h1.92l-.925-2.463s-.02-.05-.03-.07c0,.02-.02.04-.02.07l-.935,2.473Z" transform="translate(-202.353 -46.52)" fill="#336"/>
</g>
<g id="Group_1508" data-name="Group 1508" transform="translate(321.733 192.803)">
<path id="Path_521" data-name="Path 521" d="M217.71,56.891V47h1.146v8.856h4.363V56.9H217.7Z" transform="translate(-217.268 -46.558)" fill="#003c75"/>
<path id="Path_522" data-name="Path 522" d="M223.231,57.336h-5.519a.446.446,0,0,1-.442-.442V47a.446.446,0,0,1,.442-.442h1.146A.446.446,0,0,1,219.3,47v8.4h3.92a.446.446,0,0,1,.442.442v1.045a.446.446,0,0,1-.442.442Zm-5.076-.885h4.624V56.3h-3.92a.446.446,0,0,1-.442-.442v-8.4h-.261v9.007Z" transform="translate(-217.27 -46.56)" fill="#336"/>
</g>
<g id="Group_1509" data-name="Group 1509" transform="translate(333.153 192.652)">
<path id="Path_523" data-name="Path 523" d="M235.292,54.258a2.429,2.429,0,0,1-.945,2.041,4.142,4.142,0,0,1-2.573.734,6.435,6.435,0,0,1-2.7-.452V55.475a6.4,6.4,0,0,0,1.327.4,6.938,6.938,0,0,0,1.417.151,2.847,2.847,0,0,0,1.729-.432,1.447,1.447,0,0,0,.583-1.216,1.547,1.547,0,0,0-.211-.844,1.9,1.9,0,0,0-.694-.6,10.945,10.945,0,0,0-1.468-.633,4.756,4.756,0,0,1-1.97-1.166,2.6,2.6,0,0,1-.593-1.769,2.2,2.2,0,0,1,.864-1.819,3.554,3.554,0,0,1,2.272-.673,6.817,6.817,0,0,1,2.714.543l-.362,1.005a6.132,6.132,0,0,0-2.382-.513,2.3,2.3,0,0,0-1.427.392,1.29,1.29,0,0,0-.513,1.086,1.641,1.641,0,0,0,.191.844,1.789,1.789,0,0,0,.643.6,8,8,0,0,0,1.377.6,5.533,5.533,0,0,1,2.141,1.186,2.329,2.329,0,0,1,.583,1.649Z" transform="translate(-228.628 -46.408)" fill="#003c75"/>
<path id="Path_524" data-name="Path 524" d="M231.776,57.467a6.792,6.792,0,0,1-2.895-.493.448.448,0,0,1-.251-.4V55.467a.429.429,0,0,1,.2-.372.449.449,0,0,1,.422-.04,6.46,6.46,0,0,0,1.246.382,6.692,6.692,0,0,0,1.327.141,2.483,2.483,0,0,0,1.468-.352,1.011,1.011,0,0,0,.4-.864,1.139,1.139,0,0,0-.141-.6,1.546,1.546,0,0,0-.533-.452,8.992,8.992,0,0,0-1.4-.593,5.126,5.126,0,0,1-2.161-1.3,3.048,3.048,0,0,1-.7-2.061,2.632,2.632,0,0,1,1.025-2.171,4.025,4.025,0,0,1,2.553-.774,7.062,7.062,0,0,1,2.9.583.444.444,0,0,1,.241.553l-.362,1.005a.473.473,0,0,1-.241.261.43.43,0,0,1-.352,0,5.636,5.636,0,0,0-2.211-.483,1.887,1.887,0,0,0-1.156.3.862.862,0,0,0-.342.734,1.222,1.222,0,0,0,.131.623,1.4,1.4,0,0,0,.482.442,6.755,6.755,0,0,0,1.3.563,5.849,5.849,0,0,1,2.322,1.307,2.8,2.8,0,0,1,.7,1.95,2.857,2.857,0,0,1-1.116,2.392,4.576,4.576,0,0,1-2.845.824Zm-2.262-1.2a6.862,6.862,0,0,0,2.262.3,3.724,3.724,0,0,0,2.3-.643,1.989,1.989,0,0,0,.774-1.689,1.869,1.869,0,0,0-.472-1.347,5.01,5.01,0,0,0-1.96-1.076,8.118,8.118,0,0,1-1.457-.643,1.943,1.943,0,0,1-1.046-1.829,1.759,1.759,0,0,1,.694-1.447,2.748,2.748,0,0,1,1.7-.483,6.393,6.393,0,0,1,2.121.382l.06-.171a6.524,6.524,0,0,0-2.151-.352,3.137,3.137,0,0,0-2,.583,1.757,1.757,0,0,0-.694,1.468,2.162,2.162,0,0,0,.482,1.478,4.3,4.3,0,0,0,1.789,1.045,9.489,9.489,0,0,1,1.548.663,2.323,2.323,0,0,1,.844.754,2.01,2.01,0,0,1,.271,1.076,1.871,1.871,0,0,1-.764,1.568,3.27,3.27,0,0,1-2,.523,7.06,7.06,0,0,1-1.508-.161c-.271-.06-.543-.121-.794-.2v.181Z" transform="translate(-228.63 -46.41)" fill="#336"/>
</g>
<g id="Group_1510" data-name="Group 1510" transform="translate(354.724 192.803)">
<path id="Path_525" data-name="Path 525" d="M258.431,51.845a5.018,5.018,0,0,1-1.327,3.749,5.258,5.258,0,0,1-3.83,1.3H250.53V47h3.036a4.434,4.434,0,0,1,4.865,4.845Zm-1.216.04a3.96,3.96,0,0,0-.975-2.915,3.887,3.887,0,0,0-2.885-.985h-1.669v7.9h1.4a4.285,4.285,0,0,0,3.1-1.015,3.986,3.986,0,0,0,1.035-3Z" transform="translate(-250.088 -46.558)" fill="#003c75"/>
<path id="Path_526" data-name="Path 526" d="M253.277,57.336h-2.744a.446.446,0,0,1-.442-.442V47a.446.446,0,0,1,.442-.442h3.036a4.875,4.875,0,0,1,5.307,5.3,5.418,5.418,0,0,1-1.468,4.061,5.7,5.7,0,0,1-4.141,1.417Zm-2.292-.885h2.292a4.87,4.87,0,0,0,3.518-1.166,4.6,4.6,0,0,0,1.2-3.428,4,4,0,0,0-4.423-4.4h-2.583v9.007Zm2.111-.111h-1.4a.446.446,0,0,1-.442-.442V48a.446.446,0,0,1,.442-.442h1.669a4.319,4.319,0,0,1,3.207,1.116,4.387,4.387,0,0,1,1.1,3.227,4.522,4.522,0,0,1-1.166,3.317,4.712,4.712,0,0,1-3.408,1.136Zm-.955-.885h.955a3.888,3.888,0,0,0,2.784-.885,3.585,3.585,0,0,0,.9-2.674,3.006,3.006,0,0,0-3.418-3.448h-1.226v7.016Z" transform="translate(-250.09 -46.56)" fill="#336"/>
</g>
<g id="Group_1511" data-name="Group 1511" transform="translate(368.338 192.763)">
<path id="Path_527" data-name="Path 527" d="M271.649,56.892l-1.236-3.146h-3.971l-1.216,3.146H264.06l3.91-9.932h.965l3.89,9.932h-1.186Zm-1.588-4.182-1.146-3.066c-.151-.392-.3-.864-.462-1.427a14.326,14.326,0,0,1-.422,1.427l-1.166,3.066h3.2Z" transform="translate(-263.63 -46.518)" fill="#003c75"/>
<path id="Path_528" data-name="Path 528" d="M272.848,57.336h-1.186a.436.436,0,0,1-.412-.281l-1.126-2.865h-3.357l-1.106,2.865a.446.446,0,0,1-.412.281h-1.166a.445.445,0,0,1-.422-.6l3.91-9.932a.446.446,0,0,1,.412-.281h.965a.436.436,0,0,1,.412.281l3.89,9.932a.431.431,0,0,1-.05.412.444.444,0,0,1-.372.191Zm-.885-.885h.241L268.666,47.4H268.3l-3.558,9.047h.211l1.106-2.865a.446.446,0,0,1,.412-.281h3.971a.436.436,0,0,1,.412.281l1.126,2.865Zm-1.89-3.3h-3.2a.426.426,0,0,1-.362-.191.459.459,0,0,1-.05-.412l1.166-3.066c.171-.493.312-.955.412-1.367a.439.439,0,0,1,.422-.342.456.456,0,0,1,.442.322c.151.543.3,1.015.442,1.387l1.156,3.066a.46.46,0,0,1-.05.412.439.439,0,0,1-.362.191Zm-2.553-.885h1.92l-.925-2.463s-.02-.05-.03-.07c0,.02-.02.04-.02.07l-.935,2.473Z" transform="translate(-263.633 -46.52)" fill="#336"/>
</g>
<g id="Group_1512" data-name="Group 1512" transform="translate(382.096 192.803)">
<path id="Path_529" data-name="Path 529" d="M282.042,56.891H280.9V48.015H277.76V46.99h7.419v1.025h-3.136Z" transform="translate(-277.318 -46.558)" fill="#003c75"/>
<path id="Path_530" data-name="Path 530" d="M282.044,57.336H280.9a.446.446,0,0,1-.442-.442V48.47h-2.694a.446.446,0,0,1-.442-.442V47a.446.446,0,0,1,.442-.442h7.418a.446.446,0,0,1,.442.442v1.025a.446.446,0,0,1-.442.442h-2.694v8.424A.446.446,0,0,1,282.044,57.336Zm-.7-.885h.261V48.028a.446.446,0,0,1,.442-.442h2.694v-.131h-6.524v.131h2.694a.446.446,0,0,1,.442.442v8.424Z" transform="translate(-277.32 -46.56)" fill="#336"/>
</g>
<g id="Group_1513" data-name="Group 1513" transform="translate(394.504 192.763)">
<path id="Path_531" data-name="Path 531" d="M297.679,56.892l-1.237-3.146h-3.971l-1.216,3.146H290.09L294,46.96h.965l3.89,9.932h-1.186Zm-1.588-4.182-1.146-3.066c-.151-.392-.3-.864-.462-1.427a14.322,14.322,0,0,1-.422,1.427l-1.166,3.066Z" transform="translate(-289.66 -46.518)" fill="#003c75"/>
<path id="Path_532" data-name="Path 532" d="M298.878,57.336h-1.186a.436.436,0,0,1-.412-.281l-1.126-2.865H292.8l-1.106,2.865a.446.446,0,0,1-.412.281h-1.166a.445.445,0,0,1-.422-.6L293.6,46.8a.446.446,0,0,1,.412-.281h.965a.436.436,0,0,1,.412.281l3.89,9.932a.431.431,0,0,1-.05.412.457.457,0,0,1-.372.191Zm-.884-.885h.241L294.686,47.4h-.362l-3.558,9.047h.211l1.106-2.865a.446.446,0,0,1,.412-.281h3.971a.436.436,0,0,1,.412.281L298,56.452Zm-1.89-3.3h-3.2a.426.426,0,0,1-.362-.191.459.459,0,0,1-.05-.412l1.166-3.066c.171-.493.312-.955.412-1.367a.439.439,0,0,1,.422-.342.465.465,0,0,1,.442.322c.151.543.3,1.015.442,1.387l1.156,3.066a.46.46,0,0,1-.05.412.439.439,0,0,1-.362.191Zm-2.553-.885h1.92l-.925-2.463s-.02-.05-.03-.07c0,.02-.02.05-.02.07l-.935,2.473Z" transform="translate(-289.663 -46.52)" fill="#336"/>
</g>
<g id="Group_1514" data-name="Group 1514" transform="translate(409.509 192.803)">
<path id="Path_533" data-name="Path 533" d="M305.02,46.99h2.795a5.238,5.238,0,0,1,2.845.593,2.091,2.091,0,0,1,.885,1.86,2.125,2.125,0,0,1-.492,1.448,2.362,2.362,0,0,1-1.427.744v.07c1.5.261,2.252,1.045,2.252,2.372a2.551,2.551,0,0,1-.895,2.071,3.831,3.831,0,0,1-2.5.744H305.03V47Zm1.146,4.232h1.9a3.086,3.086,0,0,0,1.749-.382,1.479,1.479,0,0,0,.533-1.287,1.33,1.33,0,0,0-.593-1.206,3.736,3.736,0,0,0-1.89-.372h-1.689v3.237Zm0,.975v3.7h2.061a2.915,2.915,0,0,0,1.8-.462,1.716,1.716,0,0,0,.6-1.448,1.552,1.552,0,0,0-.623-1.357,3.289,3.289,0,0,0-1.88-.432h-1.97Z" transform="translate(-304.588 -46.558)" fill="#003c75"/>
<path id="Path_534" data-name="Path 534" d="M308.48,57.336h-3.448a.446.446,0,0,1-.442-.442V47a.446.446,0,0,1,.442-.442h2.794a5.644,5.644,0,0,1,3.1.663A2.51,2.51,0,0,1,312,49.455a2.6,2.6,0,0,1-.593,1.739,2.753,2.753,0,0,1-.513.452,2.559,2.559,0,0,1,1.447,2.443,2.977,2.977,0,0,1-1.055,2.413,4.254,4.254,0,0,1-2.795.844Zm-3.006-.885h3.006a3.365,3.365,0,0,0,2.221-.643,2.109,2.109,0,0,0,.734-1.729,1.679,1.679,0,0,0-.965-1.659,2.022,2.022,0,0,1,.623,1.568,2.14,2.14,0,0,1-.784,1.809,3.341,3.341,0,0,1-2.071.553h-2.061a.446.446,0,0,1-.442-.442v-3.7a.446.446,0,0,1,.442-.442h1.97a5.693,5.693,0,0,1,1.066.09.341.341,0,0,1-.02-.141v-.131a5.282,5.282,0,0,1-1.116.1h-1.9a.446.446,0,0,1-.442-.442V48.007a.446.446,0,0,1,.442-.442h1.689A4.121,4.121,0,0,1,310,48a1.726,1.726,0,0,1,.8,1.578,2.12,2.12,0,0,1-.372,1.307,1.432,1.432,0,0,0,.292-.261,1.7,1.7,0,0,0,.382-1.166,1.633,1.633,0,0,0-.683-1.488,4.919,4.919,0,0,0-2.6-.513h-2.352v9.007Zm1.146-.985h1.618a2.55,2.55,0,0,0,1.538-.372,1.282,1.282,0,0,0,.432-1.1,1.043,1.043,0,0,0-.432-.985,2.943,2.943,0,0,0-1.628-.352h-1.528v2.815Zm0-4.674h1.448a2.6,2.6,0,0,0,1.5-.3,1.074,1.074,0,0,0,.352-.925.861.861,0,0,0-.382-.824,3.2,3.2,0,0,0-1.659-.3h-1.246v2.352Z" transform="translate(-304.59 -46.56)" fill="#336"/>
</g>
<g id="Group_1515" data-name="Group 1515" transform="translate(421.986 192.763)">
<path id="Path_535" data-name="Path 535" d="M325.019,56.892l-1.236-3.146h-3.971L318.6,56.892H317.43l3.91-9.932h.965l3.89,9.932h-1.186Zm-1.588-4.182-1.146-3.066c-.151-.392-.3-.864-.463-1.427a14.318,14.318,0,0,1-.422,1.427l-1.166,3.066h3.2Z" transform="translate(-317 -46.518)" fill="#003c75"/>
<path id="Path_536" data-name="Path 536" d="M326.218,57.336h-1.186a.436.436,0,0,1-.412-.281l-1.126-2.865h-3.357l-1.106,2.865a.446.446,0,0,1-.412.281h-1.166a.445.445,0,0,1-.422-.6l3.91-9.932a.446.446,0,0,1,.412-.281h.965a.436.436,0,0,1,.412.281l3.89,9.932a.431.431,0,0,1-.05.412.457.457,0,0,1-.372.191Zm-.885-.885h.241L322.026,47.4h-.362l-3.559,9.047h.211l1.106-2.865a.446.446,0,0,1,.412-.281h3.971a.436.436,0,0,1,.412.281l1.126,2.865Zm-1.89-3.3h-3.2a.426.426,0,0,1-.362-.191.46.46,0,0,1-.05-.412L321,49.485c.171-.493.312-.955.412-1.367a.439.439,0,0,1,.422-.342.456.456,0,0,1,.442.322c.151.543.3,1.015.442,1.387l1.156,3.066a.46.46,0,0,1-.05.412.439.439,0,0,1-.362.191Zm-2.553-.885h1.92l-.925-2.463s-.02-.05-.03-.07c0,.02-.02.05-.02.07L320.9,52.28Z" transform="translate(-317.003 -46.52)" fill="#336"/>
</g>
<g id="Group_1516" data-name="Group 1516" transform="translate(436.338 192.662)">
<path id="Path_537" data-name="Path 537" d="M337.952,54.258a2.43,2.43,0,0,1-.945,2.041,4.079,4.079,0,0,1-2.573.734,6.436,6.436,0,0,1-2.7-.452V55.475a6.4,6.4,0,0,0,1.327.4,6.939,6.939,0,0,0,1.417.151A2.848,2.848,0,0,0,336.2,55.6a1.422,1.422,0,0,0,.583-1.216,1.547,1.547,0,0,0-.211-.844,1.9,1.9,0,0,0-.694-.6,10.945,10.945,0,0,0-1.468-.633,4.756,4.756,0,0,1-1.97-1.166,2.578,2.578,0,0,1-.593-1.769,2.2,2.2,0,0,1,.865-1.819,3.554,3.554,0,0,1,2.272-.673,6.816,6.816,0,0,1,2.714.543l-.362,1.005a6.131,6.131,0,0,0-2.382-.513,2.3,2.3,0,0,0-1.427.392,1.29,1.29,0,0,0-.513,1.086,1.641,1.641,0,0,0,.191.844,1.787,1.787,0,0,0,.643.6,8.367,8.367,0,0,0,1.377.6,5.531,5.531,0,0,1,2.141,1.186,2.329,2.329,0,0,1,.583,1.649Z" transform="translate(-331.278 -46.418)" fill="#003c75"/>
<path id="Path_538" data-name="Path 538" d="M334.426,57.467a6.792,6.792,0,0,1-2.9-.493.448.448,0,0,1-.251-.4V55.467a.429.429,0,0,1,.2-.372.449.449,0,0,1,.422-.04,6.458,6.458,0,0,0,1.246.382,6.692,6.692,0,0,0,1.327.141,2.483,2.483,0,0,0,1.468-.352,1,1,0,0,0,.4-.854,1.138,1.138,0,0,0-.141-.6,1.546,1.546,0,0,0-.533-.452,9.455,9.455,0,0,0-1.4-.593,5.127,5.127,0,0,1-2.161-1.3,3.049,3.049,0,0,1-.7-2.061,2.632,2.632,0,0,1,1.025-2.171,4.024,4.024,0,0,1,2.553-.774,7.062,7.062,0,0,1,2.9.583.444.444,0,0,1,.241.553l-.362,1.005a.473.473,0,0,1-.241.261.429.429,0,0,1-.352,0,5.637,5.637,0,0,0-2.212-.482,1.887,1.887,0,0,0-1.156.3.862.862,0,0,0-.342.734,1.224,1.224,0,0,0,.131.623,1.4,1.4,0,0,0,.482.442,6.758,6.758,0,0,0,1.3.563,5.849,5.849,0,0,1,2.322,1.307,2.8,2.8,0,0,1,.7,1.95,2.858,2.858,0,0,1-1.116,2.392,4.577,4.577,0,0,1-2.845.824Zm-2.262-1.2a6.861,6.861,0,0,0,2.262.3,3.789,3.789,0,0,0,2.3-.633,1.989,1.989,0,0,0,.774-1.689,1.869,1.869,0,0,0-.472-1.347,5.012,5.012,0,0,0-1.96-1.076,8.11,8.11,0,0,1-1.458-.643,1.943,1.943,0,0,1-1.045-1.829,1.778,1.778,0,0,1,.683-1.447,2.748,2.748,0,0,1,1.7-.483,6.394,6.394,0,0,1,2.121.382l.06-.171a6.524,6.524,0,0,0-2.151-.352,3.137,3.137,0,0,0-2,.583,1.757,1.757,0,0,0-.694,1.468,2.161,2.161,0,0,0,.483,1.478,4.294,4.294,0,0,0,1.789,1.045,9.486,9.486,0,0,1,1.548.663,2.322,2.322,0,0,1,.844.754,2.01,2.01,0,0,1,.271,1.076,1.871,1.871,0,0,1-.764,1.568,3.27,3.27,0,0,1-2,.523,7.059,7.059,0,0,1-1.508-.161c-.271-.06-.543-.121-.794-.2v.181Z" transform="translate(-331.28 -46.42)" fill="#336"/>
</g>
<g id="Group_1517" data-name="Group 1517" transform="translate(449.456 192.803)">
<path id="Path_539" data-name="Path 539" d="M350.289,56.891H344.77V47h5.519v1.025h-4.363v3.187h4.1v1.015h-4.1v3.639h4.363Z" transform="translate(-344.328 -46.558)" fill="#003c75"/>
<path id="Path_540" data-name="Path 540" d="M350.291,57.336h-5.519a.446.446,0,0,1-.442-.442V47a.446.446,0,0,1,.442-.442h5.519a.446.446,0,0,1,.442.442v1.025a.446.446,0,0,1-.442.442h-3.92v2.3h3.659a.446.446,0,0,1,.442.442v1.015a.446.446,0,0,1-.442.442H346.37v2.754h3.92a.446.446,0,0,1,.442.442v1.025A.446.446,0,0,1,350.291,57.336Zm-5.076-.885h4.624v-.141h-3.92a.446.446,0,0,1-.442-.442V52.229a.446.446,0,0,1,.442-.442h3.659v-.131h-3.659a.446.446,0,0,1-.442-.442V48.028a.446.446,0,0,1,.442-.442h3.92v-.131h-4.624v9.007Z" transform="translate(-344.33 -46.56)" fill="#336"/>
</g>
</g>
</g>
<script xmlns=""/></svg>

Before

Width:  |  Height:  |  Size: 35 KiB

View file

@ -1,184 +0,0 @@
<!-- Start of Injectable ECHA Header Block (v7 - Dynamic Data) -->
<style>
/* ECHA Header Styles - Based on V5/V6 */
.echa-header-injected { /* Wrapper class */
width: 100%;
box-shadow: 0 1px 3px rgba(0, 0, 0, 0.1), 0 1px 2px rgba(0,0,0,0.06);
font-family: system-ui, -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, Oxygen, Ubuntu, Cantarell, 'Open Sans', 'Helvetica Neue', sans-serif;
box-sizing: border-box;
background-color: #ffffff;
line-height: 1.4;
}
.echa-header-injected *, .echa-header-injected *::before, .echa-header-injected *::after {
box-sizing: inherit;
}
/* Top white bar - das-top-nav */
.echa-header-injected .das-top-nav {
background-color: #ffffff;
display: flex;
align-items: stretch;
padding: 8px 25px;
border-bottom: 1px solid #e7e7e7;
min-height: 55px;
}
.echa-header-injected .logo-container {
display: flex;
align-items: center;
gap: 20px;
}
.echa-header-injected .logo-main img {
height: 38px;
width: auto;
display: block;
border: 0;
}
.echa-header-injected .logo-part-of {
display: flex;
align-items: center;
padding-left: 20px;
border-left: 1px solid #cccccc;
height: 100%;
}
.echa-header-injected .logo-part-of img {
height: 18px;
width: auto;
display: block;
border: 0;
}
/* Bottom blue bar - das-primary-header_wrapper */
.echa-header-injected .das-primary-header_wrapper {
background-color: #005487;
background-image: linear-gradient(to bottom, rgba(255, 255, 255, 0.08), rgba(0, 0, 0, 0.05));
color: #ffffff;
padding: 12px 25px;
display: flex;
align-items: center;
gap: 15px;
}
.echa-header-injected .das-primary-header-info {
flex-grow: 1;
min-width: 0; /* Prevent flex item from overflowing */
}
/* Style for the substance link */
.echa-header-injected .substance-link {
color: #ffffff;
text-decoration: none;
display: block; /* Makes the whole H2 area clickable */
}
.echa-header-injected .substance-link:hover,
.echa-header-injected .substance-link:focus {
text-decoration: underline;
}
.echa-header-injected .das-primary-header-info h2 {
font-size: 1.5em; /* Set your desired FIXED font size */
margin: 0 0 4px 0;
line-height: 1.2; /* This will control spacing between lines if it wraps */
color: #ffffff;
font-weight: 600;
width: 100%; /* Constrains the text horizontally */
/* --- REMOVED ---
white-space: nowrap;
overflow: hidden;
text-overflow: ellipsis;
*/
/* --- ADDED (Recommended) --- */
white-space: normal; /* Explicitly allow wrapping (this is the default, but good for clarity) */
overflow-wrap: break-word; /* Helps break long words without spaces */
/* word-break: break-word; /* Alternative if overflow-wrap doesn't catch all cases */
/* Ensure overflow is visible (default, but explicit) */
overflow: visible;
}
.echa-header-injected .das-primary-header-info_details {
display: flex;
align-items: center;
gap: 18px;
flex-wrap: wrap;
}
.echa-header-injected .item {
display: flex;
align-items: baseline;
position: relative;
}
.echa-header-injected .item + .item::before {
content: '•';
color: #f5a623;
font-weight: bold;
font-size: 1.1em;
line-height: 1;
display: inline-block;
margin-right: 18px;
}
.echa-header-injected .item label {
font-size: 0.85em;
color: #e0eaf1;
margin-right: 8px;
font-weight: 400;
}
.echa-header-injected .item span {
font-size: 0.95em;
color: #ffffff;
font-weight: bold;
}
/* Minimal reset */
.echa-header-injected h2, .echa-header-injected span, .echa-header-injected label, .echa-header-injected div {
margin: 0; padding: 0;
}
.echa-header-injected a { color: inherit; text-decoration: none; } /* Basic reset for any links */
</style>
<header class="echa-header-injected" id="pdf-custom-header">
<div class="das-top-nav">
<div class="logo-container">
<div class="logo-main">
<!-- Logo link can be kept static or made dynamic if needed -->
<a title="ECHA Chemicals Database" href="/">
<img height="38" alt="ECHA Chemicals Database" src="##ECHA_CHEM_LOGO_SRC##">
</a>
</div>
<div class="logo-part-of">
<a title="visit ECHA website" target="_blank" rel="noopener noreferrer" href="https://echa.europa.eu/">
<img height="18" alt="European Chemicals Agency" src="##ECHA_LOGO_SRC##">
</a>
</div>
</div>
</div>
<div class="das-primary-header_wrapper">
<div class="das-primary-header-info">
<!-- ==== DYNAMIC CONTENT START ==== -->
<a href="##SUBSTANCE_LINK##" title="View substance details: ##SUBSTANCE_NAME##" class="substance-link">
<h2 class="das-text-truncate">##SUBSTANCE_NAME##</h2>
</a>
<div class="das-primary-header-info_details">
<div class="item">
<label>EC number</label>
<span>##EC_NUMBER##</span>
</div>
<div class="item">
<label>CAS number</label>
<span class="das-text-truncate">##CAS_NUMBER##</span>
</div>
</div>
<!-- ==== DYNAMIC CONTENT END ==== -->
</div>
</div>
</header>
<!-- End of Injectable ECHA Header Block (v7) -->

View file

@ -16,6 +16,7 @@ dependencies = [
"markdown-to-json>=2.1.2", "markdown-to-json>=2.1.2",
"markdownify>=1.2.0", "markdownify>=1.2.0",
"playwright>=1.55.0", "playwright>=1.55.0",
"psycopg2>=2.9.11",
"pubchemprops>=0.1.1", "pubchemprops>=0.1.1",
"pubchempy>=1.0.5", "pubchempy>=1.0.5",
"pydantic>=2.11.10", "pydantic>=2.11.10",

View file

@ -1,28 +0,0 @@
[pytest]
# Pytest configuration for PIF Compiler
# Test discovery
testpaths = tests
python_files = test_*.py
python_classes = Test*
python_functions = test_*
# Output options
addopts =
-v
--strict-markers
--tb=short
--disable-warnings
# Markers for different test types
markers =
unit: Unit tests (fast, no external dependencies)
integration: Integration tests (may hit real APIs)
slow: Slow tests (skip by default)
database: Tests requiring MongoDB
# Coverage options (if pytest-cov is installed)
# addopts = --cov=src/pif_compiler --cov-report=html --cov-report=term
# Ignore patterns
norecursedirs = .git .venv __pycache__ *.egg-info dist build

View file

@ -1,73 +1,423 @@
from dataclasses import dataclass, field from pydantic import BaseModel, Field, field_validator, ConfigDict, model_validator, computed_field
from typing import Dict, List, Optional, Any import re
from datetime import datetime from typing import List, Optional
from pydantic import BaseModel, StringConstraints, Field from datetime import datetime as dt
from typing_extensions import Annotated
from pif_compiler.classes.types_enum import CosmeticType, PhysicalForm, PlaceApplication, NormalUser, RoutesExposure, NanoRoutes
from pif_compiler.services.srv_echa import extract_levels, at_extractor, rdt_extractor, orchestrator
from pif_compiler.functions.db_utils import postgres_connect
from pif_compiler.services.srv_pubchem import pubchem_dap
from pif_compiler.services.srv_cosing import cosing_entry
class DapInfo(BaseModel):
cas: str
molecular_weight: Optional[float] = Field(default=None, description="In Daltons (Da)")
high_ionization: Optional[float] = Field(default=None, description="High degree of ionization")
log_pow: Optional[float] = Field(default=None, description="Partition coefficient")
tpsa: Optional[float] = Field(default=None, description="Topological polar surface area")
melting_point: Optional[float] = Field(default=None, description="In Celsius (°C)")
# --- Il valore DAP Calcolato ---
# Lo impostiamo di default a 0.5 (50%), verrà sovrascritto dal validator
dap_value: float = 0.5
@model_validator(mode='after')
def compute_dap(self):
# Lista delle condizioni (True se la condizione riduce l'assorbimento)
conditions = []
# 1. MW > 500 Da
if self.molecular_weight is not None:
conditions.append(self.molecular_weight > 500)
# 2. High Ionization (Se è True, riduce l'assorbimento)
if self.high_ionization is not None:
conditions.append(self.high_ionization is True)
# 3. Log Pow <= -1 OR >= 4
if self.log_pow is not None:
conditions.append(self.log_pow <= -1 or self.log_pow >= 4)
# 4. TPSA > 120 Å2
if self.tpsa is not None:
conditions.append(self.tpsa > 120)
# 5. Melting Point > 200°C
if self.melting_point is not None:
conditions.append(self.melting_point > 200)
# LOGICA FINALE:
# Se c'è almeno una condizione "sicura" (True), il DAP è 0.1
if any(conditions):
self.dap_value = 0.1
else:
self.dap_value = 0.5
return self
@classmethod
def dap_builder(cls, dap_data: dict):
"""
Costruisce un oggetto DapInfo a partire dai dati grezzi.
"""
desiderated_keys = ['CAS', 'MolecularWeight', 'XLogP', 'TPSA', 'Melting Point', 'Dissociation Constants']
actual_keys = [key for key in dap_data.keys() if key in desiderated_keys]
dict = {}
for key in actual_keys:
if key == 'CAS':
dict['cas'] = dap_data[key]
if key == 'MolecularWeight':
mw = float(dap_data[key])
dict['molecular_weight'] = mw
if key == 'XLogP':
log_pow = float(dap_data[key])
dict['log_pow'] = log_pow
if key == 'TPSA':
tpsa = float(dap_data[key])
dict['tpsa'] = tpsa
if key == 'Melting Point':
try:
for item in dap_data[key]:
if '°C' in item['Value']:
mp = dap_data[key]['Value']
mp_value = re.findall(r"[-+]?\d*\.\d+|\d+", mp)
if mp_value:
dict['melting_point'] = float(mp_value[0])
except:
continue
if key == 'Dissociation Constants':
try:
for item in dap_data[key]:
if 'pKa' in item['Value']:
pk = dap_data[key]['Value']
pk_value = re.findall(r"[-+]?\d*\.\d+|\d+", pk)
if pk_value:
dict['high_ionization'] = float(mp_value[0])
except:
continue
return cls(**dict)
class CosingInfo(BaseModel):
cas : List[str] = Field(default_factory=list)
common_names : List[str] = Field(default_factory=list)
inci : List[str] = Field(default_factory=list)
annex : List[str] = Field(default_factory=list)
functionName : List[str] = Field(default_factory=list)
otherRestrictions : List[str] = Field(default_factory=list)
cosmeticRestriction : Optional[str]
@classmethod
def cosing_builder(cls, cosing_data : dict):
cosing_keys = ['nameOfCommonIngredientsGlossary', 'casNo', 'functionName', 'annexNo', 'refNo', 'otherRestrictions', 'cosmeticRestriction', 'inciName']
keys = [k for k in cosing_data.keys() if k in cosing_keys]
cosing_dict = {}
for k in keys:
if k == 'nameOfCommonIngredientsGlossary':
names = []
for name in cosing_data[k]:
names.append(name)
cosing_dict['common_names'] = names
if k == 'inciName':
inci = []
for inc in cosing_data[k]:
inci.append(inc)
cosing_dict['inci'] = inci
if k == 'casNo':
cas_list = []
for casNo in cosing_data[k]:
cas_list.append(casNo)
cosing_dict['cas'] = cas_list
if k == 'functionName':
functions = []
for func in cosing_data[k]:
functions.append(func)
cosing_dict['functionName'] = functions
if k == 'annexNo':
annexes = []
i = 0
for ann in cosing_data[k]:
restriction = ann + ' / ' + cosing_data['refNo'][i]
annexes.append(restriction)
i = i+1
cosing_dict['annex'] = annexes
if k == 'otherRestrictions':
other_restrictions = []
for ores in cosing_data[k]:
other_restrictions.append(ores)
cosing_dict['otherRestrictions'] = other_restrictions
if k == 'cosmeticRestriction':
cosing_dict['cosmeticRestriction'] = cosing_data[k]
return cls(**cosing_dict)
@classmethod
def cycle_identified(cls, cosing_data : dict):
cosing_entries = []
if 'identifiedIngredient' in cosing_data.keys():
identified_cosing = cls.cosing_builder(cosing_data['identifiedIngredient'])
cosing_entries.append(identified_cosing)
main = cls.cosing_builder(cosing_data)
cosing_entries.append(main)
return cosing_entries
class ToxIndicator(BaseModel):
indicator : str
value : int
unit : str
route : str
toxicity_type : Optional[str] = None
ref : Optional[str] = None
@property
def priority_rank(self):
"""Returns the numerical priority based on the toxicological indicator."""
mapping = {
'LD50': 1,
'DL50': 1,
'LOAEL': 3,
'NOAEL': 4
}
return mapping.get(self.indicator, -1)
@property
def factor(self):
"""Returns the factor based on the toxicity type."""
if self.priority_rank == 1:
return 10
elif self.priority_rank == 3:
return 3
return 1
class Toxicity(BaseModel):
cas: str
indicators: list[ToxIndicator]
best_case: Optional[ToxIndicator] = None
factor: Optional[int] = None
@model_validator(mode='after')
def set_best_case(self) -> 'Toxicity':
if self.indicators:
self.best_case = max(self.indicators, key=lambda x: x.priority_rank)
self.factor = self.best_case.factor
return self
@classmethod
def from_result(cls, cas: str, result):
toxicity_types = ['repeated_dose_toxicity', 'acute_toxicity']
indicators_list = []
for tt in toxicity_types:
if tt not in result:
continue
try:
extractor = at_extractor if tt == 'acute_toxicity' else rdt_extractor
fetch = extract_levels(result[tt], extractor=extractor)
link = result.get(f"{tt}_link", "")
for key, lvl in fetch.items():
lvl['ref'] = link
elem = ToxIndicator(**lvl)
indicators_list.append(elem)
except Exception as e:
print(f"Errore durante l'estrazione di {tt}: {e}")
continue
return cls(
cas=cas,
indicators=indicators_list
)
class Ingredient(BaseModel): class Ingredient(BaseModel):
cas: str
inci_name : str = Annotated[ inci: Optional[List[str]] = None
str, dap_info: Optional[DapInfo] = None
StringConstraints( cosing_info: Optional[List[CosingInfo]] = None
min_length=3, toxicity: Optional[Toxicity] = None
max_length=50, creation_date: Optional[str] = None
strip_whitespace=True,
to_upper=True)
]
cas : str = Annotated[str, StringConstraints( @classmethod
min_length=5, def ingredient_builder(
max_length=13, cls,
strip_whitespace=True cas: str,
)] inci: Optional[List[str]] = None,
dap_data: Optional[dict] = None,
quantity : float = Annotated[float, Field(gt=0.001, lt=100.0, allow_inf_nan = False)] cosing_data: Optional[dict] = None,
toxicity_data: Optional[dict] = None):
# pubchem data x dap
mol_weight : Optional[int] dap_info = DapInfo.dap_builder(dap_data) if dap_data else None
degree_ioniz : Optional[bool] cosing_info = CosingInfo.cycle_identified(cosing_data) if cosing_data else None
log_pow : Optional[int] toxicity = Toxicity.from_result(cas, toxicity_data) if toxicity_data else None
melting_pnt : Optional[int]
return cls(
# toxicity values cas=cas,
sed : Optional[float] inci=inci,
dap : float = 0.5 dap_info=dap_info,
sedd : Optional[float] cosing_info=cosing_info,
noael : Optional[int] toxicity=toxicity
mos : Optional[int] )
# riferimenti
ref : Optional[str]
restriction: Optional[str]
class ExpositionInfo(BaseModel):
type: CosmeticType
target_population: NormalUser
consumer_weight: str = "60 kg"
place_application: PlaceApplication
routes_exposure: RoutesExposure
nano_routes: NanoRoutes
surface_area: int
frequency: int
# to be approximated by LLM @model_validator(mode='after')
estimated_daily_amount_applied: float def set_creation_date(self) -> 'Ingredient':
relative_daily_amount_applied: float self.creation_date = dt.now().isoformat()
retention_factor: float return self
calculated_daily_exposure: float
calculated_relative_daily_exposure: float def update_ingredient(self, attr : str, data : dict):
setattr(self, attr, data)
def to_mongo_dict(self):
mongo_dict = self.model_dump()
return mongo_dict
def get_stats(self):
stats = {
"has_dap_info": self.dap_info is not None,
"has_cosing_info": self.cosing_info is not None,
"has_toxicity_info": self.toxicity is not None,
"num_tox_indicators": len(self.toxicity.indicators) if self.toxicity else 0,
"has_best_tox_indicator": self.toxicity.best_case is not None if self.toxicity else False,
"has_restrictions_in_cosing": any(self.cosing_info[0].annex) if self.cosing_info else False,
"has_noael_indicator": any(ind.indicator == 'NOAEL' for ind in self.toxicity.indicators) if self.toxicity else False,
"has_ld50_indicator": any(ind.indicator == 'LD50' for ind in self.toxicity.indicators) if self.toxicity else False,
"has_loael_indicator": any(ind.indicator == 'LOAEL' for ind in self.toxicity.indicators) if self.toxicity else False
}
return stats
def is_old(self, threshold_days: int = 365) -> bool:
if not self.creation_date:
return True
creation_dt = dt.fromisoformat(self.creation_date)
current_dt = dt.now()
delta = current_dt - creation_dt
return delta.days > threshold_days
def add_inci_name(self, inci_name: str):
if self.inci is None:
self.inci = []
if inci_name not in self.inci:
self.inci.append(inci_name)
def return_best_toxicity(self) -> Optional[ToxIndicator]:
if self.toxicity and self.toxicity.best_case:
return self.toxicity.best_case
return None
def return_cosing_restrictions(self) -> List[str]:
restrictions = []
if self.cosing_info:
for cosing in self.cosing_info:
restrictions.extend(cosing.annex)
return restrictions
class RetentionFactors:
LEAVE_ON = 1.0
RINSE_OFF = 0.01
DENTIFRICE = 0.05
MOUTHWASH = 0.10
DYE = 0.10
class SedTable(BaseModel): class Esposition(BaseModel):
surface : int preset_name : str
total_exposition : int tipo_prodotto: str
frequency : int popolazione_target: str = "Adulti"
retention : int peso_target_kg: float = 60.0
consumer_weight : int
total_sed : int luogo_applicazione: str
esp_normali: List[str]
esp_secondarie: List[str]
esp_nano: List[str]
sup_esposta: int = Field(ge=1, le=17500, description="Area di applicazione in cm2")
freq_applicazione: int = Field(default=1, description="Numero di applicazioni al giorno")
qta_giornaliera: float = Field(..., description="Quantità di prodotto applicata (g/die)")
ritenzione: float = Field(default=1.0, ge=0, le=1.0, description="Fattore di ritenzione")
class ProdCompany(BaseModel): note: Optional[str] = None
prod_company_name : str
prod_vat: int @field_validator('esp_normali', 'esp_secondarie', 'esp_nano', mode='before')
prod_address : str @classmethod
def parse_postgres_array(cls, v):
# Se Postgres restituisce una stringa tipo '{a,b}' la trasformiamo in ['a','b']
if isinstance(v, str):
cleaned = v.strip('{}[]')
return [item.strip() for item in cleaned.split(',')] if cleaned else []
return v
@computed_field
@property
def esposizione_calcolata(self) -> float:
return self.qta_giornaliera * self.ritenzione
@computed_field
@property
def esposizione_relativa(self) -> float:
return (self.esposizione_calcolata * 1000) / self.peso_target_kg
def save_to_postgres(self):
data = self.model_dump(mode='json')
query = """INSERT INTO tipi_prodotti (
preset_name, tipo_prodotto, luogo_applicazione,
esp_normali, esp_secondarie, esp_nano,
sup_esposta, freq_applicazione, qta_giornaliera, ritenzione
) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s) RETURNING id_preset;"""
conn = postgres_connect()
try:
with conn.cursor() as cur:
cur.execute(query, (
data.get("preset_name"), data.get("tipo_prodotto"),
data.get("luogo_applicazione"), data.get("esp_normali"),
data.get("esp_secondarie"), data.get("esp_nano"),
data.get("sup_esposta"), data.get("freq_applicazione"),
data.get("qta_giornaliera"), data.get("ritenzione")
))
result = cur.fetchone()
conn.commit()
return result[0] if result else None
except Exception as e:
print(f"Errore salvataggio: {e}")
conn.rollback()
return False
finally:
conn.close()
@classmethod
def get_presets(cls):
conn = postgres_connect()
try:
with conn.cursor() as cur:
cur.execute("SELECT preset_name, tipo_prodotto, luogo_applicazione, esp_normali, esp_secondarie, esp_nano, sup_esposta, freq_applicazione, qta_giornaliera, ritenzione FROM tipi_prodotti;")
results = cur.fetchall()
lista_oggetti = []
for r in results:
obj = cls(
preset_name=r[0],
tipo_prodotto=r[1],
luogo_applicazione=r[2],
esp_normali=r[3],
esp_secondarie=r[4],
esp_nano=r[5],
sup_esposta=r[6],
freq_applicazione=r[7],
qta_giornaliera=r[8],
ritenzione=r[9]
)
lista_oggetti.append(obj)
return lista_oggetti
except Exception as e:
print(f"Errore: {e}")
return []
finally:
conn.close()

View file

@ -1,36 +0,0 @@
from typing import List, Optional
from datetime import datetime
from pydantic import BaseModel, Field
from pif_compiler.classes.base_classes import ExpositionInfo, SedTable, ProdCompany, Ingredient
from pif_compiler.classes.types_enum import CosmeticType, PhysicalForm, NormalUser
class PIF(BaseModel):
# INFORMAZIONI GENERALI DEL PRODOTTO
# Data di esecuzione pif = datetime.now()
created_at: datetime = Field(
default_factory=lambda: datetime.strptime(
datetime.now().strftime('%Y-%m-%d'),
'%Y-%m-%d'
)
)
# Informazioni del prodotto
company: str
product_name: str
type: CosmeticType
physical_form: PhysicalForm
CNCP: int
production_company: ProdCompany
# Ingredienti
ingredients: List[Ingredient] # str = quantità decimale %
normal_consumer: Optional[NormalUser]
exposition: Optional[ExpositionInfo]
# Informazioni di sicurezza
sed_table: Optional[SedTable] = None
undesired_effets: Optional[str] = None
description: Optional[str] = None
warnings: Optional[List[str]] = None

View file

@ -1,145 +0,0 @@
from enum import Enum
class TranslatedEnum(str, Enum):
def get_translation(self, lang: str) -> str:
translations = self.value.split("|")
return translations[0] if lang == "en" else translations[1]
class PhysicalForm(TranslatedEnum):
SIERO = "Serum (liquid)|Siero (liquido)"
LOZIONE = "Lotion (liquid)|Lozione (liquido)"
CREMA = "Cream (liquid)|Crema (liquido)"
OLIO = "Oil (liquid)|Olio (liquido)"
GEL = "Gel (liquid)|Gel (liquido)"
SCHIUMA = "Foam (liquid)|Schiuma (liquido)"
SOLUZIONE = "Solution (liquid)|Soluzione (liquido)"
EMULSIONE = "Emulsion (liquid)|Emulsione (liquido)"
SOSPENSIONE = "Suspension (liquid)|Sospensione (liquido)"
BALSAMO = "Balm (semi-solid)|Balsamo (semi-solido)"
PASTA = "Paste (semi-solid)|Pasta (semi-solido)"
UNGENTO = "Ointment (semi-solid)|Unguento (semi-solido)"
POLVERE_COMPATTA = "Pressed Powder (solid)|Polvere compatta (solido)"
POLVERE_LIBERA = "Loose Powder (solid)|Polvere libera (solido)"
STICK = "Stick (solid)|Stick (solido)"
BARRETTA = "Bar (solid)|Barretta (solido)"
PERLE = "Beads/Pearls (solid)|Perle (solido)"
SPRAY = "Spray/Mist (aerosol)|Spray/Nebulizzatore (aerosol)"
AEROSOL = "Aerosol (aerosol)|Aerosol (aerosol)"
SPRAY_IN_POLVERE = "Powder Spray (aerosol)|Spray in polvere (aerosol)"
CUSCINETTO = "Cushion (hybrid)|Cuscinetto (ibrido)"
GELATINA = "Jelly (hybrid)|Gelatina (ibrido)"
PRODOTTO_BIFASICO = "Bi-Phase Product (hybrid)|Prodotto bifasico (ibrido)"
MICROINCAPSULATO = "Encapsulated Actives (hybrid)|Attivi microincapsulati (ibrido)"
class CosmeticType(TranslatedEnum):
LIQUID_FOUNDATION = "Liquid foundation|Fondotinta liquido"
POWDER_FOUNDATION = "Powder foundation|Fondotinta in polvere"
BB_CREAM = "BB cream|BB cream"
CC_CREAM = "CC cream|CC cream"
CONCEALER = "Concealer|Correttore"
LOOSE_POWDER = "Loose powder|Cipria in polvere"
PRESSED_POWDER = "Pressed powder|Cipria compatta"
POWDER_BLUSH = "Powder blush|Blush in polvere"
CREAM_BLUSH = "Cream blush|Blush in crema"
LIQUID_BLUSH = "Liquid blush|Blush liquido"
BRONZER = "Bronzer|Bronzer"
HIGHLIGHTER = "Highlighter|Illuminante"
FACE_PRIMER = "Face primer|Primer viso"
SETTING_SPRAY = "Setting spray|Spray fissante"
COLOR_CORRECTOR = "Color corrector|Correttore colorato"
CONTOUR_POWDER = "Contour powder|Contouring in polvere"
CONTOUR_CREAM = "Contour cream|Contouring in crema"
TINTED_MOISTURIZER = "Tinted moisturizer|Crema colorata"
POWDER_EYESHADOW = "Powder eyeshadow|Ombretto in polvere"
CREAM_EYESHADOW = "Cream eyeshadow|Ombretto in crema"
LIQUID_EYESHADOW = "Liquid eyeshadow|Ombretto liquido"
PENCIL_EYELINER = "Pencil eyeliner|Matita occhi"
LIQUID_EYELINER = "Liquid eyeliner|Eyeliner liquido"
GEL_EYELINER = "Gel eyeliner|Eyeliner in gel"
KOHL_LINER = "Kohl liner|Matita kohl"
MASCARA = "Mascara|Mascara"
WATERPROOF_MASCARA = "Waterproof mascara|Mascara waterproof"
BROW_PENCIL = "Eyebrow pencil|Matita sopracciglia"
BROW_GEL = "Eyebrow gel|Gel sopracciglia"
BROW_POWDER = "Eyebrow powder|Polvere sopracciglia"
EYE_PRIMER = "Eye primer|Primer occhi"
FALSE_LASHES = "False eyelashes|Ciglia finte"
LASH_GLUE = "Eyelash glue|Colla ciglia"
BROW_POMADE = "Eyebrow pomade|Pomata sopracciglia"
MATTE_LIPSTICK = "Matte lipstick|Rossetto opaco"
CREAM_LIPSTICK = "Cream lipstick|Rossetto cremoso"
SATIN_LIPSTICK = "Satin lipstick|Rossetto satinato"
LIP_GLOSS = "Lip gloss|Lucidalabbra"
LIP_LINER = "Lip liner|Matita labbra"
LIP_STAIN = "Lip stain|Tinta labbra"
LIP_BALM = "Lip balm|Balsamo labbra"
LIP_PRIMER = "Lip primer|Primer labbra"
LIP_PLUMPER = "Lip plumper|Volumizzante labbra"
LIP_OIL = "Lip oil|Olio labbra"
LIP_MASK = "Lip mask|Maschera labbra"
LIQUID_LIPSTICK = "Liquid lipstick|Rossetto liquido"
GEL_CLEANSER = "Gel cleanser|Detergente gel"
FOAM_CLEANSER = "Foam cleanser|Detergente schiumoso"
OIL_CLEANSER = "Oil cleanser|Detergente oleoso"
CREAM_CLEANSER = "Cream cleanser|Detergente in crema"
MICELLAR_WATER = "Micellar water|Acqua micellare"
TONER = "Toner|Tonico"
ESSENCE = "Essence|Essenza"
SERUM = "Serum|Siero"
MOISTURIZER = "Moisturizer|Idratante"
FACE_OIL = "Face oil|Olio viso"
SHEET_MASK = "Sheet mask|Maschera in tessuto"
CLAY_MASK = "Clay mask|Maschera all'argilla"
GEL_MASK = "Gel mask|Maschera in gel"
CREAM_MASK = "Cream mask|Maschera in crema"
EYE_CREAM = "Eye cream|Crema contorno occhi"
PHYSICAL_EXFOLIATOR = "Physical exfoliator|Esfoliante fisico"
CHEMICAL_EXFOLIATOR = "Chemical exfoliator|Esfoliante chimico"
SUNSCREEN = "Sunscreen|Protezione solare"
NIGHT_CREAM = "Night cream|Crema notte"
FACE_MIST = "Face mist|Acqua spray"
SPOT_TREATMENT = "Spot treatment|Trattamento localizzato"
PORE_STRIPS = "Pore strips|Cerotti purificanti"
PEELING_GEL = "Peeling gel|Gel esfoliante"
BASE_COAT = "Base coat|Base smalto"
NAIL_POLISH = "Nail polish|Smalto"
TOP_COAT = "Top coat|Top coat"
CUTICLE_OIL = "Cuticle oil|Olio cuticole"
NAIL_STRENGTHENER = "Nail strengthener|Rinforzante unghie"
QUICK_DRY_DROPS = "Quick dry drops|Gocce asciugatura rapida"
NAIL_PRIMER = "Nail primer|Primer unghie"
GEL_POLISH = "Gel polish|Smalto gel"
ACRYLIC_POWDER = "Acrylic powder|Polvere acrilica"
NAIL_GLUE = "Nail glue|Colla unghie"
MAKEUP_BRUSHES = "Makeup brushes|Pennelli trucco"
MAKEUP_SPONGES = "Makeup sponges|Spugnette trucco"
EYELASH_CURLER = "Eyelash curler|Piegaciglia"
TWEEZERS = "Tweezers|Pinzette"
NAIL_CLIPPERS = "Nail clippers|Tagliaunghie"
NAIL_FILE = "Nail file|Lima unghie"
COTTON_PADS = "Cotton pads|Dischetti di cotone"
MAKEUP_REMOVER_PADS = "Makeup remover pads|Dischetti struccanti"
POWDER_PUFF = "Powder puff|Piumino cipria"
FACIAL_ROLLER = "Facial roller|Rullo facciale"
GUA_SHA = "Gua sha tool|Strumento gua sha"
BRUSH_CLEANER = "Brush cleaner|Detergente pennelli"
MAKEUP_ORGANIZER = "Makeup organizer|Organizzatore trucchi"
MIRROR = "Mirror|Specchio"
NAIL_BUFFER = "Nail buffer|Buffer unghie"
class NormalUser(TranslatedEnum):
ADULTO = "Adult|Adulto"
BAMBINO = "Child|Bambino"
class PlaceApplication(TranslatedEnum):
VISO = "Face|Viso"
class RoutesExposure(TranslatedEnum):
DERMAL = "Dermal|Dermale"
OCULAR = "Ocular|Oculare"
ORAL = "Oral|Orale"
class NanoRoutes(TranslatedEnum):
DERMAL = "Dermal|Dermale"
OCULAR = "Ocular|Oculare"
ORAL = "Oral|Orale"

View file

@ -2,6 +2,7 @@ import os
from urllib.parse import quote_plus from urllib.parse import quote_plus
from dotenv import load_dotenv from dotenv import load_dotenv
import psycopg2
from pymongo import MongoClient from pymongo import MongoClient
from pif_compiler.functions.common_log import get_logger from pif_compiler.functions.common_log import get_logger
@ -40,9 +41,31 @@ def db_connect(db_name : str = 'toxinfo', collection_name : str = 'substance_ind
return collection return collection
def postgres_connect():
DATABASE_URL = os.getenv("DATABASE_URL")
with psycopg2.connect(DATABASE_URL) as conn:
return conn
def insert_compilatore(nome_compilatore):
try:
conn = postgres_connect()
with conn.cursor() as cur:
cur.execute("INSERT INTO compilatori (nome_compilatore) VALUES (%s)", (nome_compilatore,))
conn.commit()
conn.close()
except Exception as e:
logger.error(f"Error: {e}")
def log_ricerche(cas, target, esito):
try:
conn = postgres_connect()
with conn.cursor() as cur:
cur.execute("INSERT INTO logs.search_history (cas_ricercato, target, esito) VALUES (%s, %s, %s)", (cas, target, esito))
conn.commit()
conn.close()
except Exception as e:
logger.error(f"Error: {e}")
return
if __name__ == "__main__": if __name__ == "__main__":
coll = db_connect() log_ricerche("123-45-6", "ECHA", True)
if coll is not None:
logger.info("Database connection successful.")
else:
logger.error("Database connection failed.")

View file

@ -1,6 +0,0 @@
from pymongo import MongoClient
from pif_compiler.functions.common_log import get_logger
from pif_compiler.functions.db_utils import db_connect
log = get_logger()

View file

@ -1,247 +1,141 @@
import json as js import json as js
import re import re
import requests as req import requests as req
from typing import Union from typing import Union, List, Dict, Optional
from pif_compiler.functions.common_log import get_logger from pif_compiler.functions.common_log import get_logger
logger = get_logger() logger = get_logger()
#region Funzione che processa una lista di CAS presa da Cosing # --- PARSING ---
def parse_cas_numbers(cas_string:list) -> list: def parse_cas_numbers(cas_string: list) -> list:
logger.debug(f"Parsing CAS numbers from input: {cas_string}") logger.debug(f"Parsing CAS numbers: {cas_string}")
# Siccome ci assicuriamo esternamente che esista almeno un cas possiamo prendere la stringa cas_raw = cas_string[0]
cas_string = cas_string[0] cas_raw = re.sub(r"\([^)]*\)", "", cas_raw)
logger.debug(f"Extracted CAS string: {cas_string}")
cas_parts = re.split(r"[/;,]", cas_raw)
# Rimuoviamo parentesi e il loro contenuto cas_list = [cas.strip() for cas in cas_parts if cas.strip()]
cas_string = re.sub(r"\([^)]*\)", "", cas_string)
logger.debug(f"After removing parentheses: {cas_string}")
# Eseguiamo uno split su vari possibili separatori
cas_parts = re.split(r"[/;,]", cas_string)
# Otteniamo una lista utilizzando i cas precedenti e rimuovendo spazi in eccesso
cas_list = [cas.strip() for cas in cas_parts]
logger.debug(f"CAS list after splitting: {cas_list}")
# Alcuni cas sono divisi da un doppio dash (--) che dobbiamo rimuovere
# è però necessario farlo ora in seconda battuta
if len(cas_list) == 1 and "--" in cas_list[0]: if len(cas_list) == 1 and "--" in cas_list[0]:
logger.debug("Found double dash separator, splitting further")
cas_list = [cas.strip() for cas in cas_list[0].split("--")] cas_list = [cas.strip() for cas in cas_list[0].split("--")]
# Siccome alcuni cas hanno valori non validi ("-") li cerchiamo e rimuoviamo cas_list = [cas for cas in cas_list if cas != "-"]
if cas_list:
while "-" in cas_list:
logger.debug("Removing invalid CAS value: '-'")
cas_list.remove("-")
logger.info(f"Parsed CAS numbers: {cas_list}") logger.info(f"Parsed CAS numbers: {cas_list}")
return cas_list return cas_list
#endregion
#region Funzione per eseguire una ricerca direttamente sul cosing # --- SEARCH ---
# Il primo argomento è la stringa da cercare, il secondo indica il tipo di ricerca def cosing_search(text: str, mode: str = "name") -> Union[list, dict, None]:
def cosing_search(text : str, mode : str = "name") -> Union[list,dict,None]:
logger.info(f"Starting COSING search: text='{text}', mode='{mode}'") logger.info(f"Starting COSING search: text='{text}', mode='{mode}'")
cosing_post_req = "https://api.tech.ec.europa.eu/search-api/prod/rest/search?apiKey=285a77fd-1257-4271-8507-f0c6b2961203&text=*&pageSize=100&pageNumber=1" url = "https://api.tech.ec.europa.eu/search-api/prod/rest/search?apiKey=285a77fd-1257-4271-8507-f0c6b2961203&text=*&pageSize=100&pageNumber=1"
agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36 Edg/131.0.0.0" agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36 Edg/131.0.0.0"
# La modalità di ricerca base è quella per nome, che sia INCI o di altro tipo
if mode == "name": if mode == "name":
logger.debug("Search mode: name (INCI, chemical name, etc.)") search_query = {
search_query = {"bool": "bool": {
{"must":[ "must": [{
{"text": "text": {
{"query":f"{text}","fields": "query": f"{text}",
["inciName.exact","inciUsaName","innName.exact","phEurName","chemicalName","chemicalDescription"], "fields": ["inciName.exact", "inciUsaName", "innName.exact", "phEurName", "chemicalName", "chemicalDescription"],
"defaultOperator":"AND"}}]}} "defaultOperator": "AND"
}
# In caso di ricerca per numero cas o EC il payload della richiesta è diverso }]
elif mode in ["cas","ec"]: }
logger.debug(f"Search mode: {mode}") }
search_query = {"bool": {"must": [{"text": {"query": f"*{text}*","fields": ["casNo", "ecNo"]}}]}} elif mode in ["cas", "ec"]:
search_query = {"bool": {"must": [{"text": {"query": f"*{text}*", "fields": ["casNo", "ecNo"]}}]}}
# La ricerca per ID è necessaria dove serva recuperare gli identified ingredients
elif mode == "id": elif mode == "id":
logger.debug("Search mode: substance ID") search_query = {"bool": {"must": [{"term": {"substanceId": f"{text}"}}]}}
search_query = {"bool":{"must":[{"term":{"substanceId":f"{text}"}}]}}
# Se la mode inserita non è prevista lancio un errore
else: else:
logger.error(f"Invalid search mode: {mode}") logger.error(f"Invalid search mode: {mode}")
raise ValueError(f"Invalid search mode: {mode}") raise ValueError(f"Invalid search mode: {mode}")
# Creo il payload della mia request files = {"query": ("query", js.dumps(search_query), "application/json")}
files = {"query": ("query",js.dumps(search_query),"application/json")}
logger.debug(f"Search query: {search_query}")
# Eseguo la post di ricerca
try: try:
logger.debug(f"Sending POST request to COSING API") risposta = req.post(url, headers={"user_agent": agent, "Connection": "keep-alive"}, files=files)
risposta = req.post(cosing_post_req,headers={"user_agent":agent,"Connection":"keep-alive"},files=files)
risposta.raise_for_status() risposta.raise_for_status()
risposta = risposta.json() data = risposta.json()
if risposta["results"]: if data.get("results"):
logger.info(f"COSING search successful: found {len(risposta['results'])} result(s)") logger.info(f"COSING search successful")
logger.debug(f"First result substance ID: {risposta['results'][0]['metadata'].get('substanceId', 'N/A')}") return data["results"][0]["metadata"]
return risposta["results"][0]["metadata"]
else: else:
# La funzione ritorna None quando non ho risultati dalla mia ricerca logger.warning(f"No results found")
logger.warning(f"COSING search returned no results for text='{text}', mode='{mode}'")
return None return None
except req.exceptions.RequestException as e: except req.exceptions.RequestException as e:
logger.error(f"HTTP request error during COSING search: {e}") logger.error(f"HTTP request error: {e}")
raise raise
except (KeyError, ValueError, TypeError) as e: except (KeyError, ValueError, TypeError) as e:
logger.error(f"Error parsing COSING response: {e}") logger.error(f"Parsing error: {e}")
raise raise
#endregion
#region Funzione per pulire un json cosing e restituirlo # --- CLEANING ---
def clean_cosing(json : dict, full : bool = True) -> dict: def clean_cosing(json_data: dict, full: bool = True) -> dict:
substance_id = json.get("substanceId", ["Unknown"])[0] if json.get("substanceId") else "Unknown" substance_id = json_data.get("substanceId", ["Unknown"])[0] if json_data.get("substanceId") else "Unknown"
logger.info(f"Cleaning COSING data for substance ID: {substance_id}, full={full}") logger.info(f"Cleaning COSING data for: {substance_id}")
# Definisco i campi di nostro interesse e li divido in base al tipo di output finale che desidero
string_cols = [ string_cols = [
"itemType", "itemType", "phEurName", "chemicalName", "innName", "substanceId", "cosmeticRestriction"
"nameOfCommonIngredientsGlossary", ]
"inciName",
"phEurName",
"chemicalName",
"innName",
"substanceId",
"refNo"
]
list_cols = [ list_cols = [
"casNo", "casNo", "ecNo", "functionName", "otherRestrictions", "refNo",
"ecNo", "sccsOpinion", "sccsOpinionUrls", "identifiedIngredient",
"functionName", "annexNo", "otherRegulations", "nameOfCommonIngredientsGlossary", "inciName"
"otherRestrictions", ]
"sccsOpinion",
"sccsOpinionUrls",
"identifiedIngredient",
"annexNo",
"otherRegulations"
]
# Creo una lista con tutti i campi su cui ciclare
total_keys = string_cols + list_cols
# Definisco la base dell"url che mi serve per ottenere il link cosing della sostanza
base_url = "https://ec.europa.eu/growth/tools-databases/cosing/details/" base_url = "https://ec.europa.eu/growth/tools-databases/cosing/details/"
clean_json = {} clean_json = {}
# Ciclo su tutti i campi di interesse for key in (string_cols + list_cols):
current_val = json_data.get(key, [])
for key in total_keys: filtered_val = [v for v in current_val if v != "<empty>"]
# Alcuni campi contengono una dicitura inutile che occupa solo spazio
# per cui provvedo a rimuoverla
while "<empty>" in json[key]:
json[key].remove("<empty>")
# Se il campo dovrà avere in output una lista, posso anche accettare le liste vuote del cosing come valore
if key in list_cols: if key in list_cols:
value = json[key] if key in ["casNo", "ecNo"] and filtered_val:
filtered_val = parse_cas_numbers(filtered_val)
# Il cas e l"ec sono casi speciali, per cui eseguo delle funzioni in più quando lo tratto elif key == "identifiedIngredient" and full and filtered_val:
filtered_val = identified_ingredients(filtered_val)
if key in ["casNo", "ecNo"]:
if value: clean_json[key] = filtered_val
logger.debug(f"Processing {key}: {value}")
value = parse_cas_numbers(value)
# Completiamo direttamente il json di ritorno ove presenti degli identifiedIngredient,
# solo dove il flag "full" è true
elif key == "identifiedIngredient":
if full:
if value:
logger.debug(f"Processing {len(value)} identified ingredient(s)")
value = identified_ingredients(value)
clean_json[key] = value
else: else:
nKey = "commonName" if key == "nameOfCommonIngredientsGlossary" else key
clean_json[nKey] = filtered_val[0] if filtered_val else ""
# Questo nome di campo era troppo lungo e ho preferito semplificarlo clean_json["cosingUrl"] = f"{base_url}{json_data['substanceId'][0]}"
if key == "nameOfCommonIngredientsGlossary":
nKey = "commonName"
# Dovendo rinominare alcuni campi, negli altri casi copio il nome iniziale
else:
nKey = key
# Siccome voglio in output una stringa semplice e se usassi uno slicer su di una lista vuota
# devo prima verificare che la lista cosing contenga dei valori
if json[key]:
clean_json[nKey] = json[key][0]
else:
clean_json[nKey] = ""
# Il campo cosingUrl non esiste ancora, vado a crearlo unendo il substance ID all"url base
clean_json["cosingUrl"] = f"{base_url}{json["substanceId"][0]}"
logger.debug(f"Generated COSING URL: {clean_json['cosingUrl']}")
logger.info(f"Successfully cleaned COSING data for substance ID: {substance_id}")
return clean_json return clean_json
#endregion
#region Funzione per completare, se necessario, un json cosing
def identified_ingredients(id_list : list) -> list:
logger.info(f"Processing {len(id_list)} identified ingredient(s): {id_list}")
def identified_ingredients(id_list: list) -> list:
logger.info(f"Processing identified ingredients: {id_list}")
identified = [] identified = []
# Per ognuno degli ingredienti in IdentifiedIngredients eseguo una ricerca for sub_id in id_list:
ingredient = cosing_search(sub_id, "id")
for id in id_list:
logger.debug(f"Searching for identified ingredient with ID: {id}")
ingredient = cosing_search(id,"id")
if ingredient: if ingredient:
identified.append(clean_cosing(ingredient, full=False))
# Vado a pulire i json appena trovati
ingredient = clean_cosing(ingredient,full=False)
# Ora salvo nella lista il documento pulito
identified.append(ingredient)
logger.debug(f"Successfully added identified ingredient ID: {id}")
else:
logger.warning(f"Could not find identified ingredient with ID: {id}")
# Finito di popolare la lista con gli oggetti degli identifiedIngredient, la ritorno
logger.info(f"Successfully processed {len(identified)} of {len(id_list)} identified ingredient(s)")
return identified return identified
#endregion
def cosing_entry(cas: str) -> Optional[dict]:
logger.info(f"Retrieving COSING entry for CAS: {cas}")
try:
search_result = cosing_search(cas, mode="cas")
if search_result:
return clean_cosing(search_result)
else:
logger.warning(f"No COSING entry found for CAS: {cas}")
return None
except Exception as e:
logger.error(f"Error retrieving COSING entry for CAS {cas}: {e}")
return None
if __name__ == "__main__": if __name__ == "__main__":
print(cosing_search("Water","name")) raw = cosing_search("72-48-0", "cas")
clean = clean_cosing(raw)
print(clean)

View file

@ -6,9 +6,10 @@ import re
from bs4 import BeautifulSoup from bs4 import BeautifulSoup
from dotenv import load_dotenv from dotenv import load_dotenv
from playwright.sync_api import sync_playwright from playwright.sync_api import sync_playwright
from typing import Callable, Any
from pif_compiler.functions.common_log import get_logger from pif_compiler.functions.common_log import get_logger
from pif_compiler.functions.db_utils import db_connect from pif_compiler.functions.db_utils import db_connect, log_ricerche
log = get_logger() log = get_logger()
load_dotenv() load_dotenv()
@ -311,6 +312,114 @@ def parse_toxicology_html(html_content):
return result return result
def parse_value_with_unit(value_str: str) -> tuple:
"""Parse a combined value+unit string like '5,040mg/kg bw/day' into (value, unit)."""
if not value_str:
return ("", "")
# Pattern to match numeric value (with commas/decimals) followed by unit
match = re.match(r'^([\d,\.]+)\s*(.*)$', value_str.strip())
if match:
numeric_value = match.group(1).replace(',', '') # Remove commas
unit = match.group(2).strip()
return (numeric_value, unit)
return (value_str, "")
def extract_levels(data: dict | list, extractor: Callable[[dict, list], dict | None]) -> dict:
"""
Generic function to recursively extract data from nested JSON structures.
Args:
data: The JSON data (dict or list) to parse
extractor: A callable that receives (obj, path) and returns:
- A dict with extracted data if the object matches criteria
- None if no match (continue searching)
The 'path' is a list of labels encountered in the hierarchy.
Returns:
A dict keyed by context path (labels joined by " > "), with values
being whatever the extractor returns.
Example:
def my_extractor(obj, path):
if obj.get("SomeField"):
return {"field": obj["SomeField"]}
return None
results = extract_from_json(data, my_extractor)
"""
results = {}
def recurse(obj: Any, path: list = None):
if path is None:
path = []
if isinstance(obj, dict):
# Check for label to use in path
current_label = obj.get("label", "")
current_path = path + [current_label] if current_label else path
# Call the extractor function
extracted = extractor(obj, current_path)
if extracted is not None:
key = " > ".join(filter(None, current_path)) or "root"
results[key] = extracted
# Recurse into all values
for k, val in obj.items():
if k != "label":
recurse(val, current_path)
elif isinstance(obj, list):
for item in obj:
recurse(item, path)
recurse(data)
return results
def rdt_extractor(obj: dict, path: list) -> dict | None:
indicator = obj.get("EffectLevelUnit", "").strip()
value_str = obj.get("EffectLevelValue", "").strip()
if indicator and value_str:
numeric_value, unit = parse_value_with_unit(value_str)
# Extract route (last label) and toxicity_type (second-to-last label)
filtered_path = [p for p in path if p]
route = filtered_path[-1] if len(filtered_path) >= 1 else ""
toxicity_type = filtered_path[-2] if len(filtered_path) >= 2 else ""
return {
"indicator": indicator,
"value": numeric_value,
"unit": unit,
"route": route,
"toxicity_type": toxicity_type
}
return None
def at_extractor(obj: dict, path: list) -> dict | None:
indicator = obj.get("EffectLevelUnit", "").strip()
value_str = obj.get("EffectLevelValue", "").strip()
if indicator and value_str:
numeric_value, unit = parse_value_with_unit(value_str)
filtered_path = [p for p in path if p]
route = filtered_path[-1] if len(filtered_path) >= 1 else ""
route = route.replace("Acute toxicity: via ", "").strip()
return {
"indicator": indicator,
"value": numeric_value,
"unit": unit,
"route": route
}
return None
#endregion #endregion
#region Orchestrator functions #region Orchestrator functions
@ -418,16 +527,19 @@ def orchestrator(cas: str) -> dict:
local_record = check_local(cas_validated) local_record = check_local(cas_validated)
if local_record: if local_record:
log.info(f"Returning local record for CAS {cas}.") log.info(f"Returning local record for CAS {cas}.")
log_ricerche(cas, 'ECHA', True)
return local_record return local_record
else: else:
log.info(f"No local record, starting echa flow") log.info(f"No local record, starting echa flow")
echa_data = echa_flow(cas_validated) echa_data = echa_flow(cas_validated)
if echa_data: if echa_data:
log.info(f"Echa flow successful") log.info(f"Echa flow successful")
log_ricerche(cas, 'ECHA', True)
add_to_local(echa_data) add_to_local(echa_data)
return echa_data return echa_data
else: else:
log.error(f"Failed to retrieve ECHA data for CAS {cas}.") log.error(f"Failed to retrieve ECHA data for CAS {cas}.")
log_ricerche(cas, 'ECHA', False)
return None return None
# to do: check if document is complete # to do: check if document is complete
@ -436,5 +548,4 @@ def orchestrator(cas: str) -> dict:
if __name__ == "__main__": if __name__ == "__main__":
cas_test = "113170-55-1" cas_test = "113170-55-1"
result = orchestrator(cas_test) result = orchestrator(cas_test)
print(result)

View file

@ -1,220 +0,0 @@
# PIF Compiler - Test Suite
## Overview
Comprehensive test suite for the PIF Compiler project using `pytest`.
## Structure
```
tests/
├── __init__.py # Test package marker
├── conftest.py # Shared fixtures and configuration
├── test_cosing_service.py # COSING service tests
├── test_models.py # (TODO) Pydantic model tests
├── test_echa_service.py # (TODO) ECHA service tests
└── README.md # This file
```
## Installation
```bash
# Install test dependencies
uv add --dev pytest pytest-cov pytest-mock
# Or manually install
uv pip install pytest pytest-cov pytest-mock
```
## Running Tests
### Run All Tests (Unit only)
```bash
uv run pytest
```
### Run Specific Test File
```bash
uv run pytest tests/test_cosing_service.py
```
### Run Specific Test Class
```bash
uv run pytest tests/test_cosing_service.py::TestParseCasNumbers
```
### Run Specific Test
```bash
uv run pytest tests/test_cosing_service.py::TestParseCasNumbers::test_single_cas_number
```
### Run with Verbose Output
```bash
uv run pytest -v
```
### Run with Coverage Report
```bash
uv run pytest --cov=src/pif_compiler --cov-report=html
# Open htmlcov/index.html in browser
```
## Test Categories
### Unit Tests (Default)
Fast tests with no external dependencies. Run by default.
```bash
uv run pytest -m unit
```
### Integration Tests
Tests that hit real APIs or databases. Skipped by default.
```bash
uv run pytest -m integration
```
### Slow Tests
Tests that take longer to run. Skipped by default.
```bash
uv run pytest -m slow
```
### Database Tests
Tests requiring MongoDB. Ensure Docker is running.
```bash
cd utils
docker-compose up -d
uv run pytest -m database
```
## Test Organization
### `test_cosing_service.py`
**Coverage:**
- ✅ `parse_cas_numbers()` - CAS parsing logic
- Single/multiple CAS
- Different separators (/, ;, ,, --)
- Parentheses removal
- Whitespace handling
- Invalid dash removal
- ✅ `cosing_search()` - API search
- Search by name
- Search by CAS
- Search by EC number
- Search by ID
- No results handling
- Invalid mode error
- ✅ `clean_cosing()` - JSON cleaning
- Basic field cleaning
- Empty tag removal
- CAS parsing
- URL creation
- Field renaming
- ✅ Integration tests (marked as `@pytest.mark.integration`)
- Real API calls (requires internet)
## Writing New Tests
### Example Unit Test
```python
class TestMyFunction:
"""Test my_function."""
def test_basic_case(self):
"""Test basic functionality."""
result = my_function("input")
assert result == "expected"
def test_edge_case(self):
"""Test edge case handling."""
with pytest.raises(ValueError):
my_function("invalid")
```
### Example Mock Test
```python
from unittest.mock import Mock, patch
@patch('module.external_api_call')
def test_with_mock(mock_api):
"""Test with mocked external call."""
mock_api.return_value = {"data": "mocked"}
result = my_function()
assert result == "expected"
mock_api.assert_called_once()
```
### Example Fixture Usage
```python
def test_with_fixture(sample_cosing_response):
"""Test using a fixture from conftest.py."""
result = clean_cosing(sample_cosing_response)
assert "cosingUrl" in result
```
## Best Practices
1. **Naming**: Test files/classes/functions start with `test_`
2. **Arrange-Act-Assert**: Structure tests clearly
3. **One assertion focus**: Each test should test one thing
4. **Use fixtures**: Reuse test data via `conftest.py`
5. **Mock external calls**: Don't hit real APIs in unit tests
6. **Mark appropriately**: Use `@pytest.mark.integration` for slow tests
7. **Descriptive names**: Test names should describe what they test
## Common Commands
```bash
# Run fast tests only (skip integration/slow)
uv run pytest -m "not integration and not slow"
# Run only integration tests
uv run pytest -m integration
# Run with detailed output
uv run pytest -vv
# Stop at first failure
uv run pytest -x
# Run last failed tests
uv run pytest --lf
# Run tests matching pattern
uv run pytest -k "test_parse"
# Generate coverage report
uv run pytest --cov=src/pif_compiler --cov-report=term-missing
```
## CI/CD Integration
For GitHub Actions (example):
```yaml
- name: Run tests
run: |
uv run pytest -m "not integration" --cov --cov-report=xml
```
## TODO
- [ ] Add tests for `models.py` (Pydantic validation)
- [ ] Add tests for `echa_service.py`
- [ ] Add tests for `echa_parser.py`
- [ ] Add tests for `echa_extractor.py`
- [ ] Add tests for `database_service.py`
- [ ] Add tests for `pubchem_service.py`
- [ ] Add integration tests with test database
- [ ] Set up GitHub Actions CI

View file

@ -1,86 +0,0 @@
# Quick Start - Running Tests
## 1. Install Test Dependencies
```bash
# Add pytest and related tools
uv add --dev pytest pytest-cov pytest-mock
```
## 2. Run the Tests
```bash
# Run all unit tests (fast, no API calls)
uv run pytest
# Run with more detail
uv run pytest -v
# Run just the COSING tests
uv run pytest tests/test_cosing_service.py
# Run integration tests (will hit real COSING API)
uv run pytest -m integration
```
## 3. See Coverage
```bash
# Generate HTML coverage report
uv run pytest --cov=src/pif_compiler --cov-report=html
# Open htmlcov/index.html in your browser
```
## What the Tests Cover
### ✅ `parse_cas_numbers()`
- Parses single CAS: `["7732-18-5"]``["7732-18-5"]`
- Parses multiple: `["7732-18-5/56-81-5"]``["7732-18-5", "56-81-5"]`
- Handles separators: `/`, `;`, `,`, `--`
- Removes parentheses: `["7732-18-5 (hydrate)"]``["7732-18-5"]`
- Cleans whitespace and invalid dashes
### ✅ `cosing_search()`
- Mocks API calls (no internet needed for unit tests)
- Tests search by name, CAS, EC, ID
- Tests error handling
- Integration tests hit real API
### ✅ `clean_cosing()`
- Cleans COSING JSON responses
- Removes empty tags
- Parses CAS numbers
- Creates COSING URLs
- Renames fields
## Test Results Example
```
tests/test_cosing_service.py::TestParseCasNumbers::test_single_cas_number PASSED
tests/test_cosing_service.py::TestParseCasNumbers::test_multiple_cas_with_slash PASSED
tests/test_cosing_service.py::TestCosingSearch::test_search_by_name_success PASSED
...
================================ 25 passed in 0.5s ================================
```
## Troubleshooting
### Import errors
Make sure you're in the project root:
```bash
cd c:\Users\adish\Projects\pif_compiler
uv run pytest
```
### Mock not found
Install pytest-mock:
```bash
uv add --dev pytest-mock
```
### Integration tests failing
These hit the real API and need internet. Skip them:
```bash
uv run pytest -m "not integration"
```

View file

@ -1,3 +0,0 @@
"""
PIF Compiler - Test Suite
"""

View file

@ -1,247 +0,0 @@
"""
Pytest configuration and fixtures for PIF Compiler tests.
This file contains shared fixtures and configuration for all tests.
"""
import pytest
import sys
from pathlib import Path
# Add src to Python path for imports
src_path = Path(__file__).parent.parent / "src"
sys.path.insert(0, str(src_path))
# Sample data fixtures
@pytest.fixture
def sample_cas_numbers():
"""Real CAS numbers for testing common cosmetic ingredients."""
return {
"water": "7732-18-5",
"glycerin": "56-81-5",
"sodium_hyaluronate": "9067-32-7",
"niacinamide": "98-92-0",
"ascorbic_acid": "50-81-7",
"retinol": "68-26-8",
"lanolin": "85507-69-3",
"sodium_chloride": "7647-14-5",
"propylene_glycol": "57-55-6",
"butylene_glycol": "107-88-0",
"salicylic_acid": "69-72-7",
"tocopherol": "59-02-9",
"caffeine": "58-08-2",
"citric_acid": "77-92-9",
"hyaluronic_acid": "9004-61-9",
"sodium_hyaluronate_crosspolymer": "63148-62-9",
"zinc_oxide": "1314-13-2",
"titanium_dioxide": "13463-67-7",
"lactic_acid": "50-21-5",
"lanolin_oil": "8006-54-0",
}
@pytest.fixture
def sample_cosing_response():
"""Sample COSING API response for testing."""
return {
"inciName": ["WATER"],
"casNo": ["7732-18-5"],
"ecNo": ["231-791-2"],
"substanceId": ["12345"],
"itemType": ["Ingredient"],
"functionName": ["Solvent"],
"chemicalName": ["Dihydrogen monoxide"],
"nameOfCommonIngredientsGlossary": ["Water"],
"sccsOpinion": [],
"sccsOpinionUrls": [],
"otherRestrictions": [],
"identifiedIngredient": [],
"annexNo": [],
"otherRegulations": [],
"refNo": ["REF123"],
"phEurName": [],
"innName": []
}
@pytest.fixture
def sample_ingredient_data():
"""Sample ingredient data for Pydantic model testing."""
return {
"inci_name": "WATER",
"cas": "7732-18-5",
"quantity": 70.0,
"mol_weight": 18,
"dap": 0.5,
}
@pytest.fixture
def sample_pif_data():
"""Sample PIF data for testing."""
return {
"company": "Beauty Corp",
"product_name": "Face Cream",
"type": "MOISTURIZER",
"physical_form": "CREMA",
"CNCP": 123456,
"production_company": {
"prod_company_name": "Manufacturer Inc",
"prod_vat": 12345678,
"prod_address": "123 Main St, City, Country"
},
"ingredients": [
{
"inci_name": "WATER",
"cas": "7732-18-5",
"quantity": 70.0,
"dap": 0.5
},
{
"inci_name": "GLYCERIN",
"cas": "56-81-5",
"quantity": 10.0,
"dap": 0.5
}
]
}
@pytest.fixture
def sample_echa_substance_response():
"""Sample ECHA substance search API response for glycerin."""
return {
"items": [{
"substanceIndex": {
"rmlId": "100.029.181",
"rmlName": "glycerol",
"rmlCas": "56-81-5",
"rmlEc": "200-289-5"
}
}]
}
@pytest.fixture
def sample_echa_substance_response_water():
"""Sample ECHA substance search API response for water."""
return {
"items": [{
"substanceIndex": {
"rmlId": "100.028.902",
"rmlName": "water",
"rmlCas": "7732-18-5",
"rmlEc": "231-791-2"
}
}]
}
@pytest.fixture
def sample_echa_substance_response_niacinamide():
"""Sample ECHA substance search API response for niacinamide."""
return {
"items": [{
"substanceIndex": {
"rmlId": "100.002.530",
"rmlName": "nicotinamide",
"rmlCas": "98-92-0",
"rmlEc": "202-713-4"
}
}]
}
@pytest.fixture
def sample_echa_dossier_response():
"""Sample ECHA dossier list API response."""
return {
"items": [{
"assetExternalId": "abc123def456",
"rootKey": "key123",
"lastUpdatedDate": "2024-01-15T10:30:00Z"
}]
}
@pytest.fixture
def sample_echa_index_html_full():
"""Sample ECHA index.html with all toxicology sections."""
return """
<html>
<head><title>ECHA Dossier</title></head>
<body>
<div id="id_7_Toxicologicalinformation">
<a href="tox_summary_001"></a>
</div>
<div id="id_72_AcuteToxicity">
<a href="acute_tox_001"></a>
</div>
<div id="id_75_Repeateddosetoxicity">
<a href="repeated_dose_001"></a>
</div>
</body>
</html>
"""
@pytest.fixture
def sample_echa_index_html_partial():
"""Sample ECHA index.html with only ToxSummary section."""
return """
<html>
<head><title>ECHA Dossier</title></head>
<body>
<div id="id_7_Toxicologicalinformation">
<a href="tox_summary_001"></a>
</div>
</body>
</html>
"""
@pytest.fixture
def sample_echa_index_html_empty():
"""Sample ECHA index.html with no toxicology sections."""
return """
<html>
<head><title>ECHA Dossier</title></head>
<body>
<p>No toxicology information available</p>
</body>
</html>
"""
# Skip markers
def pytest_configure(config):
"""Configure custom markers."""
config.addinivalue_line(
"markers", "unit: mark test as a unit test (fast, no external deps)"
)
config.addinivalue_line(
"markers", "integration: mark test as integration test (may use real APIs)"
)
config.addinivalue_line(
"markers", "slow: mark test as slow (skip by default)"
)
config.addinivalue_line(
"markers", "database: mark test as requiring database"
)
def pytest_collection_modifyitems(config, items):
"""Modify test collection to skip slow/integration tests by default."""
skip_slow = pytest.mark.skip(reason="Slow test (use -m slow to run)")
skip_integration = pytest.mark.skip(reason="Integration test (use -m integration to run)")
# Only skip if not explicitly requested
run_slow = config.getoption("-m") == "slow"
run_integration = config.getoption("-m") == "integration"
for item in items:
if "slow" in item.keywords and not run_slow:
item.add_marker(skip_slow)
if "integration" in item.keywords and not run_integration:
item.add_marker(skip_integration)

View file

@ -1,254 +0,0 @@
"""
Tests for COSING Service
Test coverage:
- parse_cas_numbers: CAS number parsing logic
- cosing_search: API search functionality
- clean_cosing: JSON cleaning and formatting
"""
import pytest
from unittest.mock import Mock, patch
from pif_compiler.services.srv_cosing import (
parse_cas_numbers,
cosing_search,
clean_cosing,
)
class TestParseCasNumbers:
"""Test CAS number parsing function."""
def test_single_cas_number(self):
"""Test parsing a single CAS number."""
result = parse_cas_numbers(["7732-18-5"])
assert result == ["7732-18-5"]
def test_multiple_cas_with_slash(self):
"""Test parsing multiple CAS numbers separated by slash."""
result = parse_cas_numbers(["7732-18-5/56-81-5"])
assert result == ["7732-18-5", "56-81-5"]
def test_multiple_cas_with_semicolon(self):
"""Test parsing multiple CAS numbers separated by semicolon."""
result = parse_cas_numbers(["7732-18-5;56-81-5"])
assert result == ["7732-18-5", "56-81-5"]
def test_multiple_cas_with_comma(self):
"""Test parsing multiple CAS numbers separated by comma."""
result = parse_cas_numbers(["7732-18-5,56-81-5"])
assert result == ["7732-18-5", "56-81-5"]
def test_double_dash_separator(self):
"""Test parsing CAS numbers with double dash separator."""
result = parse_cas_numbers(["7732-18-5--56-81-5"])
assert result == ["7732-18-5", "56-81-5"]
def test_cas_with_parentheses(self):
"""Test that parenthetical info is removed."""
result = parse_cas_numbers(["7732-18-5 (hydrate)"])
assert result == ["7732-18-5"]
def test_cas_with_extra_whitespace(self):
"""Test that extra whitespace is trimmed."""
result = parse_cas_numbers([" 7732-18-5 / 56-81-5 "])
assert result == ["7732-18-5", "56-81-5"]
def test_removes_invalid_dash(self):
"""Test that standalone dashes are removed."""
result = parse_cas_numbers(["7732-18-5/-/56-81-5"])
assert result == ["7732-18-5", "56-81-5"]
def test_complex_mixed_separators(self):
"""Test with multiple separator types."""
result = parse_cas_numbers(["7732-18-5/56-81-5;50-00-0"])
assert result == ["7732-18-5", "56-81-5", "50-00-0"]
class TestCosingSearch:
"""Test COSING API search functionality."""
@patch('pif_compiler.services.cosing_service.req.post')
def test_search_by_name_success(self, mock_post):
"""Test successful search by ingredient name."""
# Mock API response
mock_response = Mock()
mock_response.json.return_value = {
"results": [{
"metadata": {
"inciName": ["WATER"],
"casNo": ["7732-18-5"],
"substanceId": ["12345"]
}
}]
}
mock_post.return_value = mock_response
result = cosing_search("WATER", mode="name")
assert result is not None
assert result["inciName"] == ["WATER"]
assert result["casNo"] == ["7732-18-5"]
@patch('pif_compiler.services.cosing_service.req.post')
def test_search_by_cas_success(self, mock_post):
"""Test successful search by CAS number."""
mock_response = Mock()
mock_response.json.return_value = {
"results": [{
"metadata": {
"inciName": ["WATER"],
"casNo": ["7732-18-5"]
}
}]
}
mock_post.return_value = mock_response
result = cosing_search("7732-18-5", mode="cas")
assert result is not None
assert "7732-18-5" in result["casNo"]
@patch('pif_compiler.services.cosing_service.req.post')
def test_search_by_ec_success(self, mock_post):
"""Test successful search by EC number."""
mock_response = Mock()
mock_response.json.return_value = {
"results": [{
"metadata": {
"ecNo": ["231-791-2"]
}
}]
}
mock_post.return_value = mock_response
result = cosing_search("231-791-2", mode="ec")
assert result is not None
assert "231-791-2" in result["ecNo"]
@patch('pif_compiler.services.cosing_service.req.post')
def test_search_by_id_success(self, mock_post):
"""Test successful search by substance ID."""
mock_response = Mock()
mock_response.json.return_value = {
"results": [{
"metadata": {
"substanceId": ["12345"]
}
}]
}
mock_post.return_value = mock_response
result = cosing_search("12345", mode="id")
assert result is not None
assert result["substanceId"] == ["12345"]
@patch('pif_compiler.services.cosing_service.req.post')
def test_search_no_results(self, mock_post):
"""Test search with no results returns status code."""
mock_response = Mock()
mock_response.json.return_value = {"results": []}
mock_post.return_value = mock_response
result = cosing_search("NONEXISTENT", mode="name")
assert result == None # Should return None
def test_search_invalid_mode(self):
"""Test that invalid mode raises ValueError."""
with pytest.raises(ValueError):
cosing_search("WATER", mode="invalid_mode")
class TestCleanCosing:
"""Test COSING JSON cleaning function."""
def test_clean_basic_fields(self, sample_cosing_response):
"""Test cleaning basic string and list fields."""
result = clean_cosing(sample_cosing_response, full=False)
assert result["inciName"] == "WATER"
assert result["casNo"] == ["7732-18-5"]
assert result["ecNo"] == ["231-791-2"]
def test_removes_empty_tags(self, sample_cosing_response):
"""Test that <empty> tags are removed."""
sample_cosing_response["inciName"] = ["<empty>"]
sample_cosing_response["functionName"] = ["<empty>"]
result = clean_cosing(sample_cosing_response, full=False)
assert "<empty>" not in result["inciName"]
assert result["functionName"] == []
def test_parses_cas_numbers(self, sample_cosing_response):
"""Test that CAS numbers are parsed correctly."""
sample_cosing_response["casNo"] = ["56-81-5"]
result = clean_cosing(sample_cosing_response, full=False)
assert result["casNo"] == ["56-81-5"]
def test_creates_cosing_url(self, sample_cosing_response):
"""Test that COSING URL is created."""
result = clean_cosing(sample_cosing_response, full=False)
assert "cosingUrl" in result
assert "12345" in result["cosingUrl"]
assert result["cosingUrl"] == "https://ec.europa.eu/growth/tools-databases/cosing/details/12345"
def test_renames_common_name(self, sample_cosing_response):
"""Test that nameOfCommonIngredientsGlossary is renamed."""
result = clean_cosing(sample_cosing_response, full=False)
assert "commonName" in result
assert result["commonName"] == "Water"
assert "nameOfCommonIngredientsGlossary" not in result
def test_empty_lists_handled(self, sample_cosing_response):
"""Test that empty lists are handled correctly."""
sample_cosing_response["inciName"] = []
sample_cosing_response["casNo"] = []
result = clean_cosing(sample_cosing_response, full=False)
assert result["inciName"] == ""
assert result["casNo"] == []
class TestIntegration:
"""Integration tests with real API (marked as slow)."""
@pytest.mark.integration
def test_real_water_search(self):
"""Test real API call for WATER (requires internet)."""
result = cosing_search("WATER", mode="name")
if result and isinstance(result, dict):
# Real API call succeeded
assert "inciName" in result or "casNo" in result
@pytest.mark.integration
def test_real_cas_search(self):
"""Test real API call by CAS number (requires internet)."""
result = cosing_search("56-81-5", mode="cas")
if result and isinstance(result, dict):
assert "casNo" in result
@pytest.mark.integration
def test_full_workflow(self):
"""Test complete workflow: search -> clean."""
# Search for glycerin
raw_result = cosing_search("GLYCERIN", mode="name")
if raw_result and isinstance(raw_result, dict):
# Clean the result
clean_result = clean_cosing(raw_result, full=False)
# Verify cleaned structure
assert "cosingUrl" in clean_result
assert isinstance(clean_result.get("casNo"), list)

View file

@ -1,857 +0,0 @@
"""
Tests for ECHA Find Service
Test coverage:
- search_dossier: Complete workflow for searching ECHA dossiers
- Substance search (by CAS, EC, rmlName)
- Dossier retrieval (Active/Inactive)
- HTML parsing for toxicology sections
- Error handling and edge cases
"""
import pytest
from unittest.mock import Mock, patch, MagicMock
from datetime import datetime
from pif_compiler.services.echa_find import search_dossier
class TestSearchDossierSubstanceSearch:
"""Test the initial substance search phase of search_dossier."""
@patch('pif_compiler.services.echa_find.requests.get')
def test_successful_cas_search(self, mock_get):
"""Test successful search by CAS number."""
# Mock the substance search API response
mock_response = Mock()
mock_response.json.return_value = {
"items": [{
"substanceIndex": {
"rmlId": "100.000.001",
"rmlName": "Test Substance",
"rmlCas": "50-00-0",
"rmlEc": "200-001-8"
}
}]
}
mock_get.return_value = mock_response
# Mocking all subsequent calls
with patch('pif_compiler.services.echa_find.requests.get') as mock_all_gets:
# First call: substance search (already mocked above)
# Second call: dossier list
mock_dossier_response = Mock()
mock_dossier_response.json.return_value = {
"items": [{
"assetExternalId": "abc123",
"rootKey": "key123",
"lastUpdatedDate": "2024-01-15T10:30:00Z"
}]
}
# Third call: index.html page
mock_index_response = Mock()
mock_index_response.text = """
<html>
<div id="id_7_Toxicologicalinformation">
<a href="tox_summary_001"></a>
</div>
<div id="id_72_AcuteToxicity">
<a href="acute_tox_001"></a>
</div>
<div id="id_75_Repeateddosetoxicity">
<a href="repeated_dose_001"></a>
</div>
</html>
"""
mock_all_gets.side_effect = [
mock_response,
mock_dossier_response,
mock_index_response
]
result = search_dossier("50-00-0", input_type="rmlCas")
assert result is not False
assert result["rmlCas"] == "50-00-0"
assert result["rmlName"] == "Test Substance"
assert result["rmlId"] == "100.000.001"
assert result["rmlEc"] == "200-001-8"
@patch('pif_compiler.services.echa_find.requests.get')
def test_successful_ec_search(self, mock_get):
"""Test successful search by EC number."""
mock_response = Mock()
mock_response.json.return_value = {
"items": [{
"substanceIndex": {
"rmlId": "100.000.001",
"rmlName": "Test Substance",
"rmlCas": "50-00-0",
"rmlEc": "200-001-8"
}
}]
}
mock_get.return_value = mock_response
with patch('pif_compiler.services.echa_find.requests.get') as mock_all_gets:
mock_dossier_response = Mock()
mock_dossier_response.json.return_value = {
"items": [{
"assetExternalId": "abc123",
"rootKey": "key123",
"lastUpdatedDate": "2024-01-15T10:30:00Z"
}]
}
mock_index_response = Mock()
mock_index_response.text = "<html><div id='id_7_Toxicologicalinformation'><a href='tox_001'></a></div></html>"
mock_all_gets.side_effect = [
mock_response,
mock_dossier_response,
mock_index_response
]
result = search_dossier("200-001-8", input_type="rmlEc")
assert result is not False
assert result["rmlEc"] == "200-001-8"
@patch('pif_compiler.services.echa_find.requests.get')
def test_successful_name_search(self, mock_get):
"""Test successful search by substance name."""
mock_response = Mock()
mock_response.json.return_value = {
"items": [{
"substanceIndex": {
"rmlId": "100.000.001",
"rmlName": "formaldehyde",
"rmlCas": "50-00-0",
"rmlEc": "200-001-8"
}
}]
}
mock_get.return_value = mock_response
with patch('pif_compiler.services.echa_find.requests.get') as mock_all_gets:
mock_dossier_response = Mock()
mock_dossier_response.json.return_value = {
"items": [{
"assetExternalId": "abc123",
"rootKey": "key123",
"lastUpdatedDate": "2024-01-15T10:30:00Z"
}]
}
mock_index_response = Mock()
mock_index_response.text = "<html><div id='id_7_Toxicologicalinformation'><a href='tox_001'></a></div></html>"
mock_all_gets.side_effect = [
mock_response,
mock_dossier_response,
mock_index_response
]
result = search_dossier("formaldehyde", input_type="rmlName")
assert result is not False
assert result["rmlName"] == "formaldehyde"
@patch('pif_compiler.services.echa_find.requests.get')
def test_substance_not_found(self, mock_get):
"""Test when substance is not found in ECHA."""
mock_response = Mock()
mock_response.json.return_value = {"items": []}
mock_get.return_value = mock_response
result = search_dossier("999-99-9", input_type="rmlCas")
assert result is False
@patch('pif_compiler.services.echa_find.requests.get')
def test_empty_items_array(self, mock_get):
"""Test when API returns empty items array."""
mock_response = Mock()
mock_response.json.return_value = {"items": []}
mock_get.return_value = mock_response
result = search_dossier("NONEXISTENT", input_type="rmlName")
assert result is False
@patch('pif_compiler.services.echa_find.requests.get')
def test_malformed_api_response(self, mock_get):
"""Test when API response is malformed."""
mock_response = Mock()
mock_response.json.return_value = {} # Missing 'items' key
mock_get.return_value = mock_response
result = search_dossier("50-00-0", input_type="rmlCas")
assert result is False
class TestSearchDossierInputTypeValidation:
"""Test input_type parameter validation."""
@patch('pif_compiler.services.echa_find.requests.get')
def test_input_type_mismatch_cas(self, mock_get):
"""Test when input_type doesn't match actual search result (CAS)."""
mock_response = Mock()
mock_response.json.return_value = {
"items": [{
"substanceIndex": {
"rmlId": "100.000.001",
"rmlName": "Test Substance",
"rmlCas": "50-00-0",
"rmlEc": "200-001-8"
}
}]
}
mock_get.return_value = mock_response
# Search with CAS but specify wrong input_type
result = search_dossier("50-00-0", input_type="rmlEc")
assert isinstance(result, str)
assert "search_error" in result
assert "not equal" in result
@patch('pif_compiler.services.echa_find.requests.get')
def test_input_type_correct_match(self, mock_get):
"""Test when input_type correctly matches search result."""
mock_response = Mock()
mock_response.json.return_value = {
"items": [{
"substanceIndex": {
"rmlId": "100.000.001",
"rmlName": "Test Substance",
"rmlCas": "50-00-0",
"rmlEc": "200-001-8"
}
}]
}
mock_get.return_value = mock_response
with patch('pif_compiler.services.echa_find.requests.get') as mock_all_gets:
mock_dossier_response = Mock()
mock_dossier_response.json.return_value = {
"items": [{
"assetExternalId": "abc123",
"rootKey": "key123",
"lastUpdatedDate": "2024-01-15T10:30:00Z"
}]
}
mock_index_response = Mock()
mock_index_response.text = "<html><div id='id_7_Toxicologicalinformation'><a href='tox_001'></a></div></html>"
mock_all_gets.side_effect = [
mock_response,
mock_dossier_response,
mock_index_response
]
result = search_dossier("50-00-0", input_type="rmlCas")
assert result is not False
assert isinstance(result, dict)
class TestSearchDossierDossierRetrieval:
"""Test dossier retrieval (Active/Inactive)."""
@patch('pif_compiler.services.echa_find.requests.get')
def test_active_dossier_found(self, mock_get):
"""Test when active dossier is found."""
mock_substance_response = Mock()
mock_substance_response.json.return_value = {
"items": [{
"substanceIndex": {
"rmlId": "100.000.001",
"rmlName": "Test Substance",
"rmlCas": "50-00-0",
"rmlEc": "200-001-8"
}
}]
}
mock_dossier_response = Mock()
mock_dossier_response.json.return_value = {
"items": [{
"assetExternalId": "abc123",
"rootKey": "key123",
"lastUpdatedDate": "2024-01-15T10:30:00Z"
}]
}
mock_index_response = Mock()
mock_index_response.text = "<html><div id='id_7_Toxicologicalinformation'><a href='tox_001'></a></div></html>"
mock_get.side_effect = [
mock_substance_response,
mock_dossier_response,
mock_index_response
]
result = search_dossier("50-00-0", input_type="rmlCas")
assert result is not False
assert result["dossierType"] == "Active"
@patch('pif_compiler.services.echa_find.requests.get')
def test_inactive_dossier_fallback(self, mock_get):
"""Test when only inactive dossier exists (fallback)."""
mock_substance_response = Mock()
mock_substance_response.json.return_value = {
"items": [{
"substanceIndex": {
"rmlId": "100.000.001",
"rmlName": "Test Substance",
"rmlCas": "50-00-0",
"rmlEc": "200-001-8"
}
}]
}
# First dossier call returns empty (no active)
mock_active_dossier_response = Mock()
mock_active_dossier_response.json.return_value = {"items": []}
# Second dossier call returns inactive
mock_inactive_dossier_response = Mock()
mock_inactive_dossier_response.json.return_value = {
"items": [{
"assetExternalId": "abc123",
"rootKey": "key123",
"lastUpdatedDate": "2024-01-15T10:30:00Z"
}]
}
mock_index_response = Mock()
mock_index_response.text = "<html><div id='id_7_Toxicologicalinformation'><a href='tox_001'></a></div></html>"
mock_get.side_effect = [
mock_substance_response,
mock_active_dossier_response,
mock_inactive_dossier_response,
mock_index_response
]
result = search_dossier("50-00-0", input_type="rmlCas")
assert result is not False
assert result["dossierType"] == "Inactive"
@patch('pif_compiler.services.echa_find.requests.get')
def test_no_dossiers_found(self, mock_get):
"""Test when no dossiers (active or inactive) are found."""
mock_substance_response = Mock()
mock_substance_response.json.return_value = {
"items": [{
"substanceIndex": {
"rmlId": "100.000.001",
"rmlName": "Test Substance",
"rmlCas": "50-00-0",
"rmlEc": "200-001-8"
}
}]
}
# Both active and inactive return empty
mock_empty_response = Mock()
mock_empty_response.json.return_value = {"items": []}
mock_get.side_effect = [
mock_substance_response,
mock_empty_response, # Active
mock_empty_response # Inactive
]
result = search_dossier("50-00-0", input_type="rmlCas")
assert result is False
@patch('pif_compiler.services.echa_find.requests.get')
def test_last_update_date_parsed(self, mock_get):
"""Test that lastUpdateDate is correctly parsed."""
mock_substance_response = Mock()
mock_substance_response.json.return_value = {
"items": [{
"substanceIndex": {
"rmlId": "100.000.001",
"rmlName": "Test Substance",
"rmlCas": "50-00-0",
"rmlEc": "200-001-8"
}
}]
}
mock_dossier_response = Mock()
mock_dossier_response.json.return_value = {
"items": [{
"assetExternalId": "abc123",
"rootKey": "key123",
"lastUpdatedDate": "2024-01-15T10:30:00Z"
}]
}
mock_index_response = Mock()
mock_index_response.text = "<html><div id='id_7_Toxicologicalinformation'><a href='tox_001'></a></div></html>"
mock_get.side_effect = [
mock_substance_response,
mock_dossier_response,
mock_index_response
]
result = search_dossier("50-00-0", input_type="rmlCas")
assert result is not False
assert "lastUpdateDate" in result
assert result["lastUpdateDate"] == "2024-01-15"
@patch('pif_compiler.services.echa_find.requests.get')
def test_missing_last_update_date(self, mock_get):
"""Test when lastUpdateDate is missing from response."""
mock_substance_response = Mock()
mock_substance_response.json.return_value = {
"items": [{
"substanceIndex": {
"rmlId": "100.000.001",
"rmlName": "Test Substance",
"rmlCas": "50-00-0",
"rmlEc": "200-001-8"
}
}]
}
mock_dossier_response = Mock()
mock_dossier_response.json.return_value = {
"items": [{
"assetExternalId": "abc123",
"rootKey": "key123"
# lastUpdatedDate missing
}]
}
mock_index_response = Mock()
mock_index_response.text = "<html><div id='id_7_Toxicologicalinformation'><a href='tox_001'></a></div></html>"
mock_get.side_effect = [
mock_substance_response,
mock_dossier_response,
mock_index_response
]
result = search_dossier("50-00-0", input_type="rmlCas")
assert result is not False
# Should still work, just without lastUpdateDate
assert "lastUpdateDate" not in result
class TestSearchDossierHTMLParsing:
"""Test HTML parsing for toxicology sections."""
@patch('pif_compiler.services.echa_find.requests.get')
def test_all_tox_sections_found(self, mock_get):
"""Test when all toxicology sections are found."""
mock_substance_response = Mock()
mock_substance_response.json.return_value = {
"items": [{
"substanceIndex": {
"rmlId": "100.000.001",
"rmlName": "Test Substance",
"rmlCas": "50-00-0",
"rmlEc": "200-001-8"
}
}]
}
mock_dossier_response = Mock()
mock_dossier_response.json.return_value = {
"items": [{
"assetExternalId": "abc123",
"rootKey": "key123",
"lastUpdatedDate": "2024-01-15T10:30:00Z"
}]
}
mock_index_response = Mock()
mock_index_response.text = """
<html>
<div id="id_7_Toxicologicalinformation">
<a href="tox_summary_001"></a>
</div>
<div id="id_72_AcuteToxicity">
<a href="acute_tox_001"></a>
</div>
<div id="id_75_Repeateddosetoxicity">
<a href="repeated_dose_001"></a>
</div>
</html>
"""
mock_get.side_effect = [
mock_substance_response,
mock_dossier_response,
mock_index_response
]
result = search_dossier("50-00-0", input_type="rmlCas")
assert result is not False
assert "ToxSummary" in result
assert "AcuteToxicity" in result
assert "RepeatedDose" in result
assert "tox_summary_001" in result["ToxSummary"]
assert "acute_tox_001" in result["AcuteToxicity"]
assert "repeated_dose_001" in result["RepeatedDose"]
@patch('pif_compiler.services.echa_find.requests.get')
def test_only_tox_summary_found(self, mock_get):
"""Test when only ToxSummary section exists."""
mock_substance_response = Mock()
mock_substance_response.json.return_value = {
"items": [{
"substanceIndex": {
"rmlId": "100.000.001",
"rmlName": "Test Substance",
"rmlCas": "50-00-0",
"rmlEc": "200-001-8"
}
}]
}
mock_dossier_response = Mock()
mock_dossier_response.json.return_value = {
"items": [{
"assetExternalId": "abc123",
"rootKey": "key123",
"lastUpdatedDate": "2024-01-15T10:30:00Z"
}]
}
mock_index_response = Mock()
mock_index_response.text = """
<html>
<div id="id_7_Toxicologicalinformation">
<a href="tox_summary_001"></a>
</div>
</html>
"""
mock_get.side_effect = [
mock_substance_response,
mock_dossier_response,
mock_index_response
]
result = search_dossier("50-00-0", input_type="rmlCas")
assert result is not False
assert "ToxSummary" in result
assert "AcuteToxicity" not in result
assert "RepeatedDose" not in result
@patch('pif_compiler.services.echa_find.requests.get')
def test_no_tox_sections_found(self, mock_get):
"""Test when no toxicology sections are found."""
mock_substance_response = Mock()
mock_substance_response.json.return_value = {
"items": [{
"substanceIndex": {
"rmlId": "100.000.001",
"rmlName": "Test Substance",
"rmlCas": "50-00-0",
"rmlEc": "200-001-8"
}
}]
}
mock_dossier_response = Mock()
mock_dossier_response.json.return_value = {
"items": [{
"assetExternalId": "abc123",
"rootKey": "key123",
"lastUpdatedDate": "2024-01-15T10:30:00Z"
}]
}
mock_index_response = Mock()
mock_index_response.text = "<html><body>No toxicology sections</body></html>"
mock_get.side_effect = [
mock_substance_response,
mock_dossier_response,
mock_index_response
]
result = search_dossier("50-00-0", input_type="rmlCas")
assert result is not False
assert "ToxSummary" not in result
assert "AcuteToxicity" not in result
assert "RepeatedDose" not in result
# Basic info should still be present
assert "rmlId" in result
assert "index" in result
@patch('pif_compiler.services.echa_find.requests.get')
def test_js_links_created(self, mock_get):
"""Test that both HTML and JS links are created."""
mock_substance_response = Mock()
mock_substance_response.json.return_value = {
"items": [{
"substanceIndex": {
"rmlId": "100.000.001",
"rmlName": "Test Substance",
"rmlCas": "50-00-0",
"rmlEc": "200-001-8"
}
}]
}
mock_dossier_response = Mock()
mock_dossier_response.json.return_value = {
"items": [{
"assetExternalId": "abc123",
"rootKey": "key123",
"lastUpdatedDate": "2024-01-15T10:30:00Z"
}]
}
mock_index_response = Mock()
mock_index_response.text = """
<html>
<div id="id_7_Toxicologicalinformation">
<a href="tox_summary_001"></a>
</div>
<div id="id_72_AcuteToxicity">
<a href="acute_tox_001"></a>
</div>
</html>
"""
mock_get.side_effect = [
mock_substance_response,
mock_dossier_response,
mock_index_response
]
result = search_dossier("50-00-0", input_type="rmlCas")
assert result is not False
assert "ToxSummary" in result
assert "ToxSummary_js" in result
assert "AcuteToxicity" in result
assert "AcuteToxicity_js" in result
assert "index" in result
assert "index_js" in result
class TestSearchDossierURLConstruction:
"""Test URL construction for various endpoints."""
@patch('pif_compiler.services.echa_find.requests.get')
def test_search_response_url(self, mock_get):
"""Test that search_response URL is correctly constructed."""
mock_substance_response = Mock()
mock_substance_response.json.return_value = {
"items": [{
"substanceIndex": {
"rmlId": "100.000.001",
"rmlName": "Test Substance",
"rmlCas": "50-00-0",
"rmlEc": "200-001-8"
}
}]
}
mock_dossier_response = Mock()
mock_dossier_response.json.return_value = {
"items": [{
"assetExternalId": "abc123",
"rootKey": "key123",
"lastUpdatedDate": "2024-01-15T10:30:00Z"
}]
}
mock_index_response = Mock()
mock_index_response.text = "<html></html>"
mock_get.side_effect = [
mock_substance_response,
mock_dossier_response,
mock_index_response
]
result = search_dossier("50-00-0", input_type="rmlCas")
assert result is not False
assert "search_response" in result
assert "50-00-0" in result["search_response"]
assert "https://chem.echa.europa.eu/api-substance/v1/substance" in result["search_response"]
@patch('pif_compiler.services.echa_find.requests.get')
def test_url_encoding(self, mock_get):
"""Test that special characters in substance names are URL-encoded."""
substance_name = "test substance with spaces"
mock_substance_response = Mock()
mock_substance_response.json.return_value = {
"items": [{
"substanceIndex": {
"rmlId": "100.000.001",
"rmlName": substance_name,
"rmlCas": "50-00-0",
"rmlEc": "200-001-8"
}
}]
}
mock_dossier_response = Mock()
mock_dossier_response.json.return_value = {
"items": [{
"assetExternalId": "abc123",
"rootKey": "key123",
"lastUpdatedDate": "2024-01-15T10:30:00Z"
}]
}
mock_index_response = Mock()
mock_index_response.text = "<html></html>"
mock_get.side_effect = [
mock_substance_response,
mock_dossier_response,
mock_index_response
]
result = search_dossier(substance_name, input_type="rmlName")
assert result is not False
assert "search_response" in result
# Spaces should be encoded
assert "%20" in result["search_response"] or "+" in result["search_response"]
class TestIntegration:
"""Integration tests with real API (marked as integration)."""
@pytest.mark.integration
def test_real_formaldehyde_search(self):
"""Test real API call for formaldehyde (requires internet)."""
result = search_dossier("50-00-0", input_type="rmlCas")
if result and isinstance(result, dict):
# Real API call succeeded
assert "rmlId" in result
assert "rmlName" in result
assert "rmlCas" in result
assert result["rmlCas"] == "50-00-0"
assert "index" in result
assert "dossierType" in result
@pytest.mark.integration
def test_real_water_search(self):
"""Test real API call for water by CAS (requires internet)."""
result = search_dossier("7732-18-5", input_type="rmlCas")
if result and isinstance(result, dict):
assert "rmlCas" in result
assert result["rmlCas"] == "7732-18-5"
@pytest.mark.integration
def test_real_nonexistent_substance(self):
"""Test real API call for non-existent substance (requires internet)."""
result = search_dossier("999-99-9", input_type="rmlCas")
# Should return False for non-existent substance
assert result is False or isinstance(result, str)
@pytest.mark.integration
def test_real_glycerin_search(self):
"""Test real API call for glycerin (requires internet)."""
result = search_dossier("56-81-5", input_type="rmlCas")
if result and isinstance(result, dict):
assert "rmlCas" in result
assert result["rmlCas"] == "56-81-5"
assert "rmlId" in result
assert "dossierType" in result
@pytest.mark.integration
def test_real_niacinamide_search(self):
"""Test real API call for niacinamide (requires internet)."""
result = search_dossier("98-92-0", input_type="rmlCas")
if result and isinstance(result, dict):
assert "rmlCas" in result
assert result["rmlCas"] == "98-92-0"
@pytest.mark.integration
def test_real_retinol_search(self):
"""Test real API call for retinol (requires internet)."""
result = search_dossier("68-26-8", input_type="rmlCas")
if result and isinstance(result, dict):
assert "rmlCas" in result
assert result["rmlCas"] == "68-26-8"
@pytest.mark.integration
def test_real_caffeine_search(self):
"""Test real API call for caffeine (requires internet)."""
result = search_dossier("58-08-2", input_type="rmlCas")
if result and isinstance(result, dict):
assert "rmlCas" in result
assert result["rmlCas"] == "58-08-2"
@pytest.mark.integration
def test_real_salicylic_acid_search(self):
"""Test real API call for salicylic acid (requires internet)."""
result = search_dossier("69-72-7", input_type="rmlCas")
if result and isinstance(result, dict):
assert "rmlCas" in result
assert result["rmlCas"] == "69-72-7"
@pytest.mark.integration
def test_real_titanium_dioxide_search(self):
"""Test real API call for titanium dioxide (requires internet)."""
result = search_dossier("13463-67-7", input_type="rmlCas")
if result and isinstance(result, dict):
assert "rmlCas" in result
assert result["rmlCas"] == "13463-67-7"
@pytest.mark.integration
def test_real_zinc_oxide_search(self):
"""Test real API call for zinc oxide (requires internet)."""
result = search_dossier("1314-13-2", input_type="rmlCas")
if result and isinstance(result, dict):
assert "rmlCas" in result
assert result["rmlCas"] == "1314-13-2"
@pytest.mark.integration
def test_multiple_cosmetic_ingredients(self, sample_cas_numbers):
"""Test real API calls for multiple cosmetic ingredients (requires internet)."""
# Test a subset of common cosmetic ingredients
test_ingredients = [
("water", "7732-18-5"),
("glycerin", "56-81-5"),
("propylene_glycol", "57-55-6"),
]
for name, cas in test_ingredients:
result = search_dossier(cas, input_type="rmlCas")
if result and isinstance(result, dict):
assert result["rmlCas"] == cas
assert "rmlId" in result
# Give the API some time between requests
import time
time.sleep(0.5)

View file

@ -1,153 +0,0 @@
# PIF Compiler - MongoDB Docker Setup
## Quick Start
Start MongoDB and Mongo Express web interface:
```bash
cd utils
docker-compose up -d
```
Stop the services:
```bash
docker-compose down
```
Stop and remove all data:
```bash
docker-compose down -v
```
## Services
### MongoDB
- **Port**: 27017
- **Database**: toxinfo
- **Username**: admin
- **Password**: admin123
- **Connection String**: `mongodb://admin:admin123@localhost:27017/toxinfo?authSource=admin`
### Mongo Express (Web UI)
- **URL**: http://localhost:8082
- **Username**: admin
- **Password**: admin123
## Usage in Python
Update your MongoDB connection in `src/pif_compiler/functions/mongo_functions.py`:
```python
# For local development with Docker
db = connect(user="admin", password="admin123", database="toxinfo")
```
Or use the connection URI directly:
```python
from pymongo import MongoClient
client = MongoClient("mongodb://admin:admin123@localhost:27017/toxinfo?authSource=admin")
db = client['toxinfo']
```
## Data Persistence
Data is persisted in Docker volumes:
- `mongodb_data` - Database files
- `mongodb_config` - Configuration files
These volumes persist even when containers are stopped.
## Creating Application User
It's recommended to create a dedicated user for your application instead of using the admin account.
### Option 1: Using mongosh (MongoDB Shell)
```bash
# Access MongoDB shell
docker exec -it pif_mongodb mongosh -u admin -p admin123 --authenticationDatabase admin
# In the MongoDB shell, run:
use toxinfo
db.createUser({
user: "pif_app",
pwd: "pif_app_password",
roles: [
{ role: "readWrite", db: "toxinfo" }
]
})
# Exit the shell
exit
```
### Option 2: Using Mongo Express Web UI
1. Go to http://localhost:8082
2. Login with admin/admin123
3. Select `toxinfo` database
4. Click on "Users" tab
5. Add new user with `readWrite` role
### Option 3: Using Python Script
Create a file `utils/create_user.py`:
```python
from pymongo import MongoClient
# Connect as admin
client = MongoClient("mongodb://admin:admin123@localhost:27017/?authSource=admin")
db = client['toxinfo']
# Create application user
db.command("createUser", "pif_app",
pwd="pif_app_password",
roles=[{"role": "readWrite", "db": "toxinfo"}])
print("User 'pif_app' created successfully!")
client.close()
```
Run it:
```bash
cd utils
uv run python create_user.py
```
### Update Your Application
After creating the user, update your connection in `src/pif_compiler/functions/mongo_functions.py`:
```python
# Use application user instead of admin
db = connect(user="pif_app", password="pif_app_password", database="toxinfo")
```
Or with connection URI:
```python
client = MongoClient("mongodb://pif_app:pif_app_password@localhost:27017/toxinfo?authSource=toxinfo")
```
### Available Roles
- `read`: Read-only access to the database
- `readWrite`: Read and write access (recommended for your app)
- `dbAdmin`: Database administration (create indexes, etc.)
- `userAdmin`: Manage users and roles
## Security Notes
⚠️ **WARNING**: The default credentials are for local development only.
For production:
1. Change all passwords in `docker-compose.yml`
2. Use environment variables or secrets management
3. Create dedicated users with minimal required permissions
4. Configure firewall rules
5. Enable SSL/TLS connections

View file

@ -1,140 +0,0 @@
#!/usr/bin/env python3
"""
Change Log Manager
Manages a change.log file with external and internal changes
"""
import os
from datetime import datetime
from enum import Enum
class ChangeType(Enum):
EXTERNAL = "EXTERNAL"
INTERNAL = "INTERNAL"
class ChangeLogManager:
def __init__(self, log_file="change.log"):
self.log_file = log_file
self._ensure_log_exists()
def _ensure_log_exists(self):
"""Create the log file if it doesn't exist"""
if not os.path.exists(self.log_file):
with open(self.log_file, 'w') as f:
f.write("# Change Log\n")
f.write(f"# Created: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}\n\n")
def add_change(self, change_type, description):
"""
Add a change entry to the log
Args:
change_type (ChangeType): Type of change (EXTERNAL or INTERNAL)
description (str): Description of the change
"""
timestamp = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
entry = f"[{timestamp}] [{change_type.value}] {description}\n"
with open(self.log_file, 'a') as f:
f.write(entry)
print(f"✓ Change added: {change_type.value} - {description}")
def view_log(self, filter_type=None):
"""
View the change log with optional filtering
Args:
filter_type (ChangeType, optional): Filter by change type
"""
if not os.path.exists(self.log_file):
print("No change log found.")
return
with open(self.log_file, 'r') as f:
lines = f.readlines()
print("\n" + "="*70)
print("CHANGE LOG")
print("="*70 + "\n")
for line in lines:
if filter_type and f"[{filter_type.value}]" not in line:
continue
print(line, end='')
print("\n" + "="*70 + "\n")
def get_statistics(self):
"""Display statistics about the change log"""
if not os.path.exists(self.log_file):
print("No change log found.")
return
with open(self.log_file, 'r') as f:
lines = f.readlines()
external_count = sum(1 for line in lines if "[EXTERNAL]" in line)
internal_count = sum(1 for line in lines if "[INTERNAL]" in line)
total = external_count + internal_count
print("\n" + "="*40)
print("CHANGE LOG STATISTICS")
print("="*40)
print(f"Total changes: {total}")
print(f"External changes: {external_count}")
print(f"Internal changes: {internal_count}")
print("="*40 + "\n")
def main():
manager = ChangeLogManager()
while True:
print("\n📝 Change Log Manager")
print("1. Add External Change")
print("2. Add Internal Change")
print("3. View All Changes")
print("4. View External Changes Only")
print("5. View Internal Changes Only")
print("6. Show Statistics")
print("7. Exit")
choice = input("\nSelect an option (1-7): ").strip()
if choice == '1':
description = input("Enter external change description: ").strip()
if description:
manager.add_change(ChangeType.EXTERNAL, description)
else:
print("Description cannot be empty.")
elif choice == '2':
description = input("Enter internal change description: ").strip()
if description:
manager.add_change(ChangeType.INTERNAL, description)
else:
print("Description cannot be empty.")
elif choice == '3':
manager.view_log()
elif choice == '4':
manager.view_log(filter_type=ChangeType.EXTERNAL)
elif choice == '5':
manager.view_log(filter_type=ChangeType.INTERNAL)
elif choice == '6':
manager.get_statistics()
elif choice == '7':
print("Goodbye!")
break
else:
print("Invalid option. Please select 1-7.")
if __name__ == "__main__":
main()

View file

@ -1,86 +0,0 @@
"""
Create MongoDB application user for PIF Compiler.
This script creates a dedicated user with readWrite permissions
on the toxinfo database instead of using the admin account.
"""
from pymongo import MongoClient
from pymongo.errors import DuplicateKeyError, OperationFailure
import sys
def create_app_user():
"""Create application user for toxinfo database."""
# Configuration
ADMIN_USER = "admin"
ADMIN_PASSWORD = "admin123"
MONGO_HOST = "localhost"
MONGO_PORT = 27017
APP_USER = "pif_app"
APP_PASSWORD = "marox123"
APP_DATABASE = "pif-projects"
print(f"Connecting to MongoDB as admin...")
try:
# Connect as admin
client = MongoClient(
f"mongodb://{ADMIN_USER}:{ADMIN_PASSWORD}@{MONGO_HOST}:{MONGO_PORT}/?authSource=admin",
serverSelectionTimeoutMS=5000
)
# Test connection
client.admin.command('ping')
print("✓ Connected to MongoDB successfully")
# Switch to application database
db = client[APP_DATABASE]
# Create application user
print(f"\nCreating user '{APP_USER}' with readWrite permissions on '{APP_DATABASE}'...")
db.command(
"createUser",
APP_USER,
pwd=APP_PASSWORD,
roles=[{"role": "readWrite", "db": APP_DATABASE}]
)
print(f"✓ User '{APP_USER}' created successfully!")
print(f"\nConnection details:")
print(f" Username: {APP_USER}")
print(f" Password: {APP_PASSWORD}")
print(f" Database: {APP_DATABASE}")
print(f" Connection String: mongodb://{APP_USER}:{APP_PASSWORD}@{MONGO_HOST}:{MONGO_PORT}/{APP_DATABASE}?authSource={APP_DATABASE}")
print(f"\nUpdate your mongo_functions.py with:")
print(f" db = connect(user='{APP_USER}', password='{APP_PASSWORD}', database='{APP_DATABASE}')")
client.close()
return 0
except DuplicateKeyError:
print(f"⚠ User '{APP_USER}' already exists!")
print(f"\nTo delete and recreate, run:")
print(f" docker exec -it pif_mongodb mongosh -u admin -p admin123 --authenticationDatabase admin")
print(f" use {APP_DATABASE}")
print(f" db.dropUser('{APP_USER}')")
return 1
except OperationFailure as e:
print(f"✗ MongoDB operation failed: {e}")
return 1
except Exception as e:
print(f"✗ Error: {e}")
print("\nMake sure MongoDB is running:")
print(" cd utils")
print(" docker-compose up -d")
return 1
if __name__ == "__main__":
sys.exit(create_app_user())

View file

@ -1,28 +0,0 @@
version: '3.8'
services:
mongodb:
image: mongo:latest
container_name: personal_mongodb
restart: unless-stopped
environment:
MONGO_INITDB_ROOT_USERNAME: admin
MONGO_INITDB_ROOT_PASSWORD: bello98A.
MONGO_INITDB_DATABASE: toxinfo
ports:
- "27017:27017"
volumes:
- mongodb_data:/data/db
- mongodb_config:/data/configdb
networks:
- personal_network
volumes:
mongodb_data:
driver: local
mongodb_config:
driver: local
networks:
personal_network:
driver: bridge

13
uv.lock
View file

@ -966,6 +966,7 @@ dependencies = [
{ name = "markdown-to-json" }, { name = "markdown-to-json" },
{ name = "markdownify" }, { name = "markdownify" },
{ name = "playwright" }, { name = "playwright" },
{ name = "psycopg2" },
{ name = "pubchemprops" }, { name = "pubchemprops" },
{ name = "pubchempy" }, { name = "pubchempy" },
{ name = "pydantic" }, { name = "pydantic" },
@ -990,6 +991,7 @@ requires-dist = [
{ name = "markdown-to-json", specifier = ">=2.1.2" }, { name = "markdown-to-json", specifier = ">=2.1.2" },
{ name = "markdownify", specifier = ">=1.2.0" }, { name = "markdownify", specifier = ">=1.2.0" },
{ name = "playwright", specifier = ">=1.55.0" }, { name = "playwright", specifier = ">=1.55.0" },
{ name = "psycopg2", specifier = ">=2.9.11" },
{ name = "pubchemprops", specifier = ">=0.1.1" }, { name = "pubchemprops", specifier = ">=0.1.1" },
{ name = "pubchempy", specifier = ">=1.0.5" }, { name = "pubchempy", specifier = ">=1.0.5" },
{ name = "pydantic", specifier = ">=2.11.10" }, { name = "pydantic", specifier = ">=2.11.10" },
@ -1128,6 +1130,17 @@ wheels = [
{ url = "https://files.pythonhosted.org/packages/26/65/1070a6e3c036f39142c2820c4b52e9243246fcfc3f96239ac84472ba361e/psutil-7.1.0-cp37-abi3-win_arm64.whl", hash = "sha256:6937cb68133e7c97b6cc9649a570c9a18ba0efebed46d8c5dae4c07fa1b67a07", size = 244971 }, { url = "https://files.pythonhosted.org/packages/26/65/1070a6e3c036f39142c2820c4b52e9243246fcfc3f96239ac84472ba361e/psutil-7.1.0-cp37-abi3-win_arm64.whl", hash = "sha256:6937cb68133e7c97b6cc9649a570c9a18ba0efebed46d8c5dae4c07fa1b67a07", size = 244971 },
] ]
[[package]]
name = "psycopg2"
version = "2.9.11"
source = { registry = "https://pypi.org/simple" }
sdist = { url = "https://files.pythonhosted.org/packages/89/8d/9d12bc8677c24dad342ec777529bce705b3e785fa05d85122b5502b9ab55/psycopg2-2.9.11.tar.gz", hash = "sha256:964d31caf728e217c697ff77ea69c2ba0865fa41ec20bb00f0977e62fdcc52e3", size = 379598 }
wheels = [
{ url = "https://files.pythonhosted.org/packages/b5/bf/635fbe5dd10ed200afbbfbe98f8602829252ca1cce81cc48fb25ed8dadc0/psycopg2-2.9.11-cp312-cp312-win_amd64.whl", hash = "sha256:e03e4a6dbe87ff81540b434f2e5dc2bddad10296db5eea7bdc995bf5f4162938", size = 2713969 },
{ url = "https://files.pythonhosted.org/packages/88/5a/18c8cb13fc6908dc41a483d2c14d927a7a3f29883748747e8cb625da6587/psycopg2-2.9.11-cp313-cp313-win_amd64.whl", hash = "sha256:8dc379166b5b7d5ea66dcebf433011dfc51a7bb8a5fc12367fa05668e5fc53c8", size = 2714048 },
{ url = "https://files.pythonhosted.org/packages/47/08/737aa39c78d705a7ce58248d00eeba0e9fc36be488f9b672b88736fbb1f7/psycopg2-2.9.11-cp314-cp314-win_amd64.whl", hash = "sha256:f10a48acba5fe6e312b891f290b4d2ca595fc9a06850fe53320beac353575578", size = 2803738 },
]
[[package]] [[package]]
name = "pubchemprops" name = "pubchemprops"
version = "0.1.1" version = "0.1.1"