refactoring and cleaning
This commit is contained in:
parent
2b39ba6324
commit
e02aca560c
58 changed files with 1439 additions and 8610 deletions
4
.gitignore
vendored
4
.gitignore
vendored
|
|
@ -205,3 +205,7 @@ cython_debug/
|
|||
marimo/_static/
|
||||
marimo/_lsp/
|
||||
__marimo__/
|
||||
|
||||
# other
|
||||
|
||||
pdfs/
|
||||
126
CLAUDE.md
Normal file
126
CLAUDE.md
Normal file
|
|
@ -0,0 +1,126 @@
|
|||
# PIF Compiler - Comsoguard API
|
||||
|
||||
## Project Overview
|
||||
|
||||
**PIF Compiler** (branded as **Comsoguard API**) is a cosmetic safety assessment tool that compiles Product Information Files (PIF) for cosmetic products, as required by EU regulations. It aggregates toxicological, regulatory, and chemical data from multiple external sources to evaluate ingredient safety (DAP calculations, SED/MoS computations, NOAEL extraction).
|
||||
|
||||
The primary language is **Italian** (variable names, comments, some log messages). Code is written in **Python 3.12**.
|
||||
|
||||
## Tech Stack
|
||||
|
||||
- **Framework**: FastAPI + Uvicorn
|
||||
- **Package manager**: uv (with `pyproject.toml` + `uv.lock`)
|
||||
- **Data models**: Pydantic v2
|
||||
- **Databases**: MongoDB (substance data cache via `pymongo`) + PostgreSQL (product presets, search logs, compilers via `psycopg2`)
|
||||
- **Web scraping**: Playwright (PDF generation, browser automation), BeautifulSoup4 (HTML parsing)
|
||||
- **External APIs**: PubChem (`pubchempy` + `pubchemprops`), COSING (EU Search API), ECHA (chem.echa.europa.eu)
|
||||
- **Logging**: Python `logging` with rotating file handlers (`logs/debug.log`, `logs/error.log`)
|
||||
|
||||
## Project Structure
|
||||
|
||||
```
|
||||
src/pif_compiler/
|
||||
├── main.py # FastAPI app, middleware, exception handlers, routers
|
||||
├── __init__.py
|
||||
├── api/
|
||||
│ └── routes/
|
||||
│ ├── api_echa.py # ECHA endpoints (single + batch search)
|
||||
│ ├── api_cosing.py # COSING endpoints (single + batch search)
|
||||
│ └── common.py # PDF generation, PubChem, CIR search endpoints
|
||||
├── classes/
|
||||
│ └── models.py # Pydantic models: Ingredient, DapInfo, CosingInfo,
|
||||
│ # ToxIndicator, Toxicity, Esposition, RetentionFactors
|
||||
├── functions/
|
||||
│ ├── common_func.py # PDF generation with Playwright
|
||||
│ ├── common_log.py # Centralized logging configuration
|
||||
│ └── db_utils.py # MongoDB + PostgreSQL connection helpers
|
||||
└── services/
|
||||
├── srv_echa.py # ECHA scraping, HTML parsing, toxicology extraction,
|
||||
│ # orchestrator (validate -> check cache -> fetch -> store)
|
||||
├── srv_cosing.py # COSING API search + data cleaning
|
||||
├── srv_pubchem.py # PubChem property extraction (DAP data)
|
||||
└── srv_cir.py # CIR (Cosmetic Ingredient Review) database search
|
||||
```
|
||||
|
||||
### Other directories
|
||||
|
||||
- `data/` - Input data files (`input.json` with sample INCI/CAS/percentage lists), old CSV data
|
||||
- `logs/` - Rotating log files (debug.log, error.log) - auto-generated
|
||||
- `pdfs/` - Generated PDF files from ECHA dossier pages
|
||||
- `marimo/` - **Ignore this folder.** Debug/test notebooks, not part of the main application
|
||||
|
||||
## Architecture & Data Flow
|
||||
|
||||
### Core workflow (per ingredient)
|
||||
1. **Input**: CAS number (and optionally INCI name + percentage)
|
||||
2. **COSING** (`srv_cosing.py`): Search EU cosmetic ingredients database for regulatory restrictions, INCI match, annex references
|
||||
3. **ECHA** (`srv_echa.py`): Search substance -> get dossier -> parse HTML index -> extract toxicological data (NOAEL, LD50, LOAEL) from acute & repeated dose toxicity pages
|
||||
4. **PubChem** (`srv_pubchem.py`): Get molecular weight, XLogP, TPSA, melting point, dissociation constants
|
||||
5. **DAP calculation** (`DapInfo` model): Dermal Absorption Percentage based on molecular properties (MW > 500, LogP, TPSA > 120, etc.)
|
||||
6. **Toxicity ranking** (`Toxicity` model): Best toxicological indicator selection with priority (NOAEL > LOAEL > LD50) and safety factors
|
||||
|
||||
### Caching strategy
|
||||
- ECHA results are cached in MongoDB (`toxinfo.substance_index` collection) keyed by `substance.rmlCas`
|
||||
- The orchestrator checks local cache before making external requests
|
||||
- Search history is logged to PostgreSQL (`logs.search_history` table)
|
||||
|
||||
## API Endpoints
|
||||
|
||||
All routes are under `/api/v1`:
|
||||
|
||||
| Method | Path | Description |
|
||||
|--------|------|-------------|
|
||||
| POST | `/echa/search` | Single ECHA substance search by CAS |
|
||||
| POST | `/echa/batch-search` | Batch ECHA search for multiple CAS numbers |
|
||||
| POST | `/cosing/search` | COSING search (by name, CAS, EC, or ID) |
|
||||
| POST | `/cosing/batch-search` | Batch COSING search |
|
||||
| POST | `/common/pubchem` | PubChem property lookup by CAS |
|
||||
| POST | `/common/generate-pdf` | Generate PDF from URL via Playwright |
|
||||
| GET | `/common/download-pdf/{name}` | Download a generated PDF |
|
||||
| POST | `/common/cir-search` | CIR ingredient text search |
|
||||
| GET | `/health`, `/ping` | Health check endpoints |
|
||||
|
||||
Docs available at `/docs` (Swagger) and `/redoc`.
|
||||
|
||||
## Environment Variables
|
||||
|
||||
Configured via `.env` file (loaded with `python-dotenv`):
|
||||
|
||||
- `ADMIN_USER` - MongoDB admin username
|
||||
- `ADMIN_PASSWORD` - MongoDB admin password
|
||||
- `MONGO_HOST` - MongoDB host
|
||||
- `MONGO_PORT` - MongoDB port
|
||||
- `DATABASE_URL` - PostgreSQL connection string
|
||||
|
||||
## Development
|
||||
|
||||
### Setup
|
||||
```bash
|
||||
uv sync # Install dependencies
|
||||
playwright install # Install browser binaries for PDF generation
|
||||
```
|
||||
|
||||
### Running the API
|
||||
```bash
|
||||
uv run python -m pif_compiler.main
|
||||
# or
|
||||
uv run uvicorn pif_compiler.main:app --reload --host 0.0.0.0 --port 8000
|
||||
```
|
||||
|
||||
### Key conventions
|
||||
- Services in `services/` handle external API calls and data extraction
|
||||
- Models in `classes/models.py` use Pydantic `@model_validator` and `@classmethod` builders for construction from raw API data
|
||||
- The `orchestrator` pattern (see `srv_echa.py`) handles: validate input -> check local cache -> fetch from external -> store locally -> return
|
||||
- All modules use the shared logger from `common_log.get_logger()`
|
||||
- API routes define Pydantic request/response models inline in each route file
|
||||
|
||||
### Important domain concepts
|
||||
- **CAS number**: Chemical Abstracts Service identifier (e.g., "50-00-0")
|
||||
- **INCI**: International Nomenclature of Cosmetic Ingredients
|
||||
- **NOAEL**: No Observed Adverse Effect Level (preferred toxicity indicator)
|
||||
- **LOAEL**: Lowest Observed Adverse Effect Level
|
||||
- **LD50**: Lethal Dose 50%
|
||||
- **DAP**: Dermal Absorption Percentage
|
||||
- **SED**: Systemic Exposure Dosage
|
||||
- **MoS**: Margin of Safety
|
||||
- **PIF**: Product Information File (EU cosmetic regulation requirement)
|
||||
|
|
@ -1,57 +0,0 @@
|
|||
{
|
||||
"$schema": "http://json-schema.org/draft-04/schema#",
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"INCI": {
|
||||
"type": "array",
|
||||
"items": [
|
||||
{
|
||||
"type": "string"
|
||||
},
|
||||
{
|
||||
"type": "string"
|
||||
}
|
||||
]
|
||||
},
|
||||
"CAS": {
|
||||
"type": "array",
|
||||
"items": [
|
||||
{
|
||||
"type": "array",
|
||||
"items": [
|
||||
{
|
||||
"type": "string"
|
||||
},
|
||||
{
|
||||
"type": "string"
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"type": "array",
|
||||
"items": [
|
||||
{
|
||||
"type": "string"
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
},
|
||||
"percentage": {
|
||||
"type": "array",
|
||||
"items": [
|
||||
{
|
||||
"type": "number"
|
||||
},
|
||||
{
|
||||
"type": "number"
|
||||
}
|
||||
]
|
||||
}
|
||||
},
|
||||
"required": [
|
||||
"INCI",
|
||||
"CAS",
|
||||
"percentage"
|
||||
]
|
||||
}
|
||||
|
|
@ -1,28 +0,0 @@
|
|||
# Echa Scraping Log Readme
|
||||
|
||||
Il file di log viene utilizzato durante lo scraping per tenere traccia delle sostanze estratte.
|
||||
|
||||
**Colonne:**
|
||||
- **casNo**: il numero CAS della sostanza.
|
||||
- **substanceId**: l'identificativo della sostanza nel database COSING.
|
||||
- **inciName**: il nome INCI della sostanza.
|
||||
- **scraping_AcuteToxicity**: stato dello scraping della pagina *Acute Toxicity* (valori LD50, LC50, ecc.).
|
||||
- **scraping_RepeatedDose**: stato dello scraping della pagina *Repeated Dose* (valori NOAEL, DNEL, ecc.).
|
||||
- **timestamp**: il momento in cui il dato è stato registrato.
|
||||
|
||||
**Valori possibili per scraping_AcuteToxicity e scraping_RepeatedDose:**
|
||||
1. **no_lead_dossiers**: non esistono lead dossiers attivi o inattivi per la sostanza.
|
||||
2. **successful_scrape**: dati estratti con successo dalla pagina.
|
||||
3. **no_data_found**: è stato trovato un lead dossier, ma la pagina non esiste o non contiene dati.
|
||||
4. **error**: diversi tipi di errori.
|
||||
|
||||
---
|
||||
|
||||
Ho dedicato 20-30 minuti alla conferma manuale dei risultati *no_data_found* e *no_lead_dossiers*: ho verificato casualmente che non esistessero dossier o che le pagine fossero effettivamente prive di dati.
|
||||
|
||||
Durante il primo full-scraping era presente un bug, che ho successivamente corretto, consentendo l'estrazione di altre 700 sostanze. Non so se siano presenti altri bug simili.
|
||||
|
||||
---
|
||||
|
||||
Al momento ci sono **68 righe nel log con errori.** Sto investigando, ma nella maggior parte dei casi si tratta di errori causati dalla mancanza di dati nelle pagine.
|
||||
In pratica, molti di questi sono semplicemente *no_data_found* erroneamente segnati come *error*.
|
||||
|
Can't render this file because it is too large.
|
|
Can't render this file because it is too large.
|
|
Can't render this file because it is too large.
|
|
Can't render this file because it is too large.
|
|
Can't render this file because it is too large.
|
|
Can't render this file because it is too large.
|
BIN
data/output.pdf
BIN
data/output.pdf
Binary file not shown.
|
|
@ -1,38 +0,0 @@
|
|||
{
|
||||
"general_info": {
|
||||
"exec_date": "2021-07-01",
|
||||
"company": "Company Name",
|
||||
"product_name": "Product Name",
|
||||
"type": "pif",
|
||||
"ph_form": "fisical state",
|
||||
"CPNP": "CPNP number",
|
||||
"prod_company": {"Company Name": "Company Name", "Address": "Company Address", "Country": "Country"}
|
||||
},
|
||||
|
||||
"formula_table": "df_json",
|
||||
"normal_user": ["italiano", "english"],
|
||||
|
||||
"exposition": {
|
||||
"type": "type",
|
||||
"place_application": "place",
|
||||
"routes_exposure": "routes",
|
||||
"secondary_routes": "secondary routes",
|
||||
"nano_exposure": "nano exposure",
|
||||
"surface_area": "surface area",
|
||||
"frequency": "frequency",
|
||||
"est_daily_amount": "est daily amount",
|
||||
"rel_daily_amount": "rel daily amount",
|
||||
"retention": 1,
|
||||
"calculated_daily_exp:": "calculated daily exp",
|
||||
"calculated_relative_daily_exp": "calculated relative daily exp",
|
||||
"consumer_weight": "consumer weight",
|
||||
"target_population": "target population"
|
||||
},
|
||||
|
||||
"sed_formula_table": "df_json",
|
||||
"sed_table": "df_json",
|
||||
"toxicity_table": "df_json",
|
||||
"undesired_effects": "no effets",
|
||||
"description": "description",
|
||||
"warnings": "warnings"
|
||||
}
|
||||
File diff suppressed because one or more lines are too long
|
|
@ -1,211 +0,0 @@
|
|||
# Refactoring Summary
|
||||
|
||||
## Completed: Phase 1 & 2
|
||||
|
||||
### Phase 1: Critical Bug Fixes ✅
|
||||
|
||||
**Fixed Issues:**
|
||||
|
||||
1. **[base_classes.py](src/pif_compiler/classes/models.py)** (now renamed to `models.py`)
|
||||
- Fixed missing closing parenthesis in `StringConstraints` annotation (line 24)
|
||||
- File renamed to `models.py` for clarity
|
||||
|
||||
2. **[pif_class.py](src/pif_compiler/classes/pif_class.py)**
|
||||
- Removed unnecessary `streamlit` import
|
||||
- Fixed duplicate `NormalUser` import conflict
|
||||
- Fixed type annotations for optional fields (lines 33-36)
|
||||
- Removed unused imports
|
||||
|
||||
3. **[classes/__init__.py](src/pif_compiler/classes/__init__.py)**
|
||||
- Created proper module exports
|
||||
- Added docstring
|
||||
- Listed all available models and enums
|
||||
|
||||
### Phase 2: Code Organization ✅
|
||||
|
||||
**New Structure:**
|
||||
|
||||
```
|
||||
src/pif_compiler/
|
||||
├── classes/ # Data Models
|
||||
│ ├── __init__.py # ✨ NEW: Proper exports
|
||||
│ ├── models.py # ✨ RENAMED from base_classes.py
|
||||
│ ├── pif_class.py # ✅ FIXED: Import conflicts
|
||||
│ └── types_enum.py
|
||||
│
|
||||
├── services/ # ✨ NEW: Business Logic Layer
|
||||
│ ├── __init__.py # Service exports
|
||||
│ ├── echa_service.py # ECHA API (merged from find.py)
|
||||
│ ├── echa_parser.py # HTML/Markdown/JSON parsing
|
||||
│ ├── echa_extractor.py # High-level extraction
|
||||
│ ├── cosing_service.py # COSING integration
|
||||
│ ├── pubchem_service.py # PubChem integration
|
||||
│ └── database_service.py # MongoDB operations
|
||||
│
|
||||
└── functions/ # Utilities & Legacy
|
||||
├── _old/ # 🗄️ Deprecated files (moved here)
|
||||
│ ├── echaFind.py # → Merged into echa_service.py
|
||||
│ ├── find.py # → Merged into echa_service.py
|
||||
│ ├── echaProcess.py # → Split into echa_parser + echa_extractor
|
||||
│ ├── scraper_cosing.py # → Copied to cosing_service.py
|
||||
│ ├── pubchem.py # → Copied to pubchem_service.py
|
||||
│ └── mongo_functions.py # → Copied to database_service.py
|
||||
├── html_to_pdf.py # PDF generation utilities
|
||||
├── pdf_extraction.py # PDF processing utilities
|
||||
└── resources/ # Static resources (logos, templates)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Key Improvements
|
||||
|
||||
### 1. **Separation of Concerns**
|
||||
- **Models** (`classes/`): Pure data structures with Pydantic validation
|
||||
- **Services** (`services/`): Business logic and external API calls
|
||||
- **Functions** (`functions/`): Legacy code, will be gradually migrated
|
||||
|
||||
### 2. **ECHA Module Consolidation**
|
||||
Previously scattered across 3 files:
|
||||
- `echaFind.py` (246 lines) - Old search implementation
|
||||
- `find.py` (513 lines) - Better search with type hints
|
||||
- `echaProcess.py` (947 lines) - Massive monolith
|
||||
|
||||
Now organized into 3 focused modules:
|
||||
- `echa_service.py` (~513 lines) - API integration (from `find.py`)
|
||||
- `echa_parser.py` (~250 lines) - Data parsing/cleaning
|
||||
- `echa_extractor.py` (~350 lines) - High-level extraction logic
|
||||
|
||||
### 3. **Better Logging**
|
||||
- Changed from module-level `logging.basicConfig()` to proper logger instances
|
||||
- Each service has its own logger: `logger = logging.getLogger(__name__)`
|
||||
- Prevents logging configuration conflicts
|
||||
|
||||
### 4. **Improved Imports**
|
||||
Services can now be imported cleanly:
|
||||
```python
|
||||
# Old way
|
||||
from src.func.echaFind import search_dossier
|
||||
from src.func.echaProcess import echaExtract
|
||||
|
||||
# New way
|
||||
from pif_compiler.services import search_dossier, echa_extract
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Migration Guide
|
||||
|
||||
### For Code Using Old Imports
|
||||
|
||||
**ECHA Functions:**
|
||||
```python
|
||||
# Before
|
||||
from src.func.find import search_dossier
|
||||
from src.func.echaProcess import echaExtract, echaPage_to_md, clean_json
|
||||
|
||||
# After
|
||||
from pif_compiler.services import (
|
||||
search_dossier,
|
||||
echa_extract,
|
||||
echa_page_to_markdown,
|
||||
clean_json
|
||||
)
|
||||
```
|
||||
|
||||
**Data Models:**
|
||||
```python
|
||||
# Before
|
||||
from classes import Ingredient, PIF
|
||||
from base_classes import ExpositionInfo
|
||||
|
||||
# After
|
||||
from pif_compiler.classes import Ingredient, PIF, ExpositionInfo
|
||||
```
|
||||
|
||||
**COSING/PubChem:**
|
||||
```python
|
||||
# Before
|
||||
from functions.scraper_cosing import cosing_search
|
||||
from functions.pubchem import pubchem_dap
|
||||
|
||||
# After (when ready)
|
||||
from pif_compiler.services.cosing_service import cosing_search
|
||||
from pif_compiler.services.pubchem_service import pubchem_dap
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Next Steps (Phase 3 - Not Done Yet)
|
||||
|
||||
### Configuration Management
|
||||
- [ ] Create `config.py` for MongoDB credentials, API keys
|
||||
- [ ] Use environment variables (.env file)
|
||||
- [ ] Separate dev/prod configurations
|
||||
|
||||
### Testing
|
||||
- [ ] Add pytest setup
|
||||
- [ ] Unit tests for models (Pydantic validation)
|
||||
- [ ] Integration tests for services
|
||||
- [ ] Mock external API calls
|
||||
|
||||
### Streamlit App
|
||||
- [ ] Create `app.py` entry point
|
||||
- [ ] Organize UI components
|
||||
- [ ] Connect to services layer
|
||||
|
||||
### Database
|
||||
- [ ] Document MongoDB schema
|
||||
- [ ] Add migration scripts
|
||||
- [ ] Consider adding SQLAlchemy for relational DB
|
||||
|
||||
### Documentation
|
||||
- [ ] API documentation (docstrings → Sphinx)
|
||||
- [ ] User guide for PIF creation workflow
|
||||
- [ ] Developer setup guide
|
||||
|
||||
---
|
||||
|
||||
## Files Changed
|
||||
|
||||
### Modified:
|
||||
- `src/pif_compiler/classes/models.py` (renamed, fixed)
|
||||
- `src/pif_compiler/classes/pif_class.py` (fixed imports/types)
|
||||
- `src/pif_compiler/classes/__init__.py` (new exports)
|
||||
|
||||
### Created:
|
||||
- `src/pif_compiler/services/__init__.py`
|
||||
- `src/pif_compiler/services/echa_service.py`
|
||||
- `src/pif_compiler/services/echa_parser.py`
|
||||
- `src/pif_compiler/services/echa_extractor.py`
|
||||
- `src/pif_compiler/services/cosing_service.py`
|
||||
- `src/pif_compiler/services/pubchem_service.py`
|
||||
- `src/pif_compiler/services/database_service.py`
|
||||
|
||||
### Moved to Archive:
|
||||
- `src/pif_compiler/functions/_old/echaFind.py` (merged into echa_service.py)
|
||||
- `src/pif_compiler/functions/_old/find.py` (merged into echa_service.py)
|
||||
- `src/pif_compiler/functions/_old/echaProcess.py` (split into echa_parser + echa_extractor)
|
||||
- `src/pif_compiler/functions/_old/scraper_cosing.py` (copied to cosing_service.py)
|
||||
- `src/pif_compiler/functions/_old/pubchem.py` (copied to pubchem_service.py)
|
||||
- `src/pif_compiler/functions/_old/mongo_functions.py` (copied to database_service.py)
|
||||
|
||||
### Kept (Active):
|
||||
- `src/pif_compiler/functions/html_to_pdf.py` (PDF utilities)
|
||||
- `src/pif_compiler/functions/pdf_extraction.py` (PDF utilities)
|
||||
- `src/pif_compiler/functions/resources/` (Static files)
|
||||
|
||||
---
|
||||
|
||||
## Benefits
|
||||
|
||||
✅ **Cleaner imports** - No more relative path confusion
|
||||
✅ **Better testing** - Services can be mocked easily
|
||||
✅ **Easier debugging** - Smaller, focused modules
|
||||
✅ **Type safety** - Proper type hints throughout
|
||||
✅ **Maintainability** - Clear separation of concerns
|
||||
✅ **Backward compatible** - Old code still works
|
||||
|
||||
---
|
||||
|
||||
**Date:** 2025-01-04
|
||||
**Status:** Phase 1 & 2 Complete ✅
|
||||
|
|
@ -1,295 +0,0 @@
|
|||
# ECHA Services Test Suite Summary
|
||||
|
||||
## Overview
|
||||
|
||||
Comprehensive test suites have been created for all three ECHA service modules, following a bottom-up approach from lowest to highest level dependencies.
|
||||
|
||||
## Test Files Created
|
||||
|
||||
### 1. test_echa_parser.py (Lowest Level)
|
||||
**Location**: `tests/test_echa_parser.py`
|
||||
|
||||
**Coverage**: Tests for HTML/Markdown/JSON processing functions
|
||||
|
||||
**Test Classes**:
|
||||
- `TestOpenEchaPage` - HTML page opening (remote & local)
|
||||
- `TestEchaPageToMarkdown` - HTML to Markdown conversion
|
||||
- `TestMarkdownToJsonRaw` - Markdown to JSON conversion (skipped if markdown_to_json not installed)
|
||||
- `TestNormalizeUnicodeCharacters` - Unicode normalization
|
||||
- `TestCleanJson` - JSON cleaning and validation
|
||||
- `TestIntegrationParser` - Full pipeline integration tests
|
||||
|
||||
**Total Tests**: 28 tests
|
||||
- 20 tests for core parser functions
|
||||
- 5 tests for markdown_to_json (conditional)
|
||||
- 2 integration tests
|
||||
- 1 test with known Unicode encoding issue (needs fix)
|
||||
|
||||
**Key Features**:
|
||||
- Mocks external dependencies (requests, file I/O)
|
||||
- Tests Unicode handling edge cases
|
||||
- Validates data cleaning logic
|
||||
- Tests comparison operator conversions (>, <, >=, <=)
|
||||
|
||||
**Known Issues**:
|
||||
- Unicode literal encoding in test strings (line 372, 380) - use `chr()` instead of `\uXXXX`
|
||||
- Missing `markdown_to_json` dependency (tests skip gracefully)
|
||||
|
||||
###2. test_echa_service.py (Middle Level)
|
||||
**Location**: `tests/test_echa_service.py`
|
||||
|
||||
**Coverage**: Tests for ECHA API interaction and search functions
|
||||
|
||||
**Test Classes**:
|
||||
- `TestGetSubstanceByIdentifier` - Substance API search
|
||||
- `TestGetDossierByRmlId` - Dossier retrieval with Active/Inactive fallback
|
||||
- `TestExtractSectionLinks` - Section link extraction with validation
|
||||
- `TestParseSectionsFromHtml` - HTML parsing for multiple sections
|
||||
- `TestGetSectionLinksFromIndex` - Remote index.html fetching
|
||||
- `TestGetSectionLinksFromFile` - Local file parsing
|
||||
- `TestSearchDossier` - Main search workflow
|
||||
- `TestIntegrationEchaService` - Real API integration tests (marked @pytest.mark.integration)
|
||||
|
||||
**Total Tests**: 36 tests
|
||||
- 30 unit tests with mocked APIs
|
||||
- 3 integration tests (require internet, marked for manual execution)
|
||||
|
||||
**Key Features**:
|
||||
- Comprehensive API mocking
|
||||
- Tests nested section bug fix (parent vs child section links)
|
||||
- Tests URL encoding, error handling, fallback logic
|
||||
- Tests local vs remote workflows
|
||||
- Integration tests for real formaldehyde data
|
||||
|
||||
**Testing Approach**:
|
||||
- Unit tests run by default (fast, no external deps)
|
||||
- Integration tests require `-m integration` flag
|
||||
|
||||
### 3. test_echa_extractor.py (Highest Level)
|
||||
**Location**: `tests/test_echa_extractor.py`
|
||||
|
||||
**Coverage**: Tests for high-level extraction orchestration
|
||||
|
||||
**Test Classes**:
|
||||
- `TestSchemas` - Data schema validation
|
||||
- `TestJsonToDataframe` - JSON to pandas DataFrame conversion
|
||||
- `TestDfWrapper` - DataFrame metadata addition
|
||||
- `TestEchaExtractLocal` - DuckDB cache querying
|
||||
- `TestEchaExtract` - Main extraction workflow
|
||||
- `TestIntegrationEchaExtractor` - Real data integration tests (marked @pytest.mark.integration)
|
||||
|
||||
**Total Tests**: 32 tests
|
||||
- 28 unit tests with full mocking
|
||||
- 4 integration tests (require internet)
|
||||
|
||||
**Key Features**:
|
||||
- Tests both RepeatedDose and AcuteToxicity schemas
|
||||
- Tests local cache (DuckDB) integration
|
||||
- Tests key information extraction
|
||||
- Tests error handling at each pipeline stage
|
||||
- Tests DataFrame vs JSON output modes
|
||||
- Validates metadata addition (substance, CAS, timestamps)
|
||||
|
||||
**Testing Strategy**:
|
||||
- Mocks entire pipeline: search → parse → convert → clean → wrap
|
||||
- Tests local_search and local_only modes
|
||||
- Tests graceful degradation (returns key_infos on main extraction failure)
|
||||
|
||||
## Test Architecture
|
||||
|
||||
```
|
||||
test_echa_parser.py (Data Transformation)
|
||||
↓
|
||||
test_echa_service.py (API & Search)
|
||||
↓
|
||||
test_echa_extractor.py (Orchestration)
|
||||
```
|
||||
|
||||
### Dependency Flow
|
||||
1. **Parser** (lowest) - No dependencies on other ECHA modules
|
||||
2. **Service** (middle) - Depends on Parser for some functionality
|
||||
3. **Extractor** (highest) - Depends on both Service and Parser
|
||||
|
||||
### Mock Strategy
|
||||
- **Parser**: Mocks `requests`, file I/O, `os.makedirs`
|
||||
- **Service**: Mocks `requests.get` for API calls, HTML content
|
||||
- **Extractor**: Mocks entire pipeline chain (search_dossier, open_echa_page, etc.)
|
||||
|
||||
## Running the Tests
|
||||
|
||||
### Run All Tests
|
||||
```bash
|
||||
uv run pytest tests/test_echa_*.py -v
|
||||
```
|
||||
|
||||
### Run Specific Module
|
||||
```bash
|
||||
uv run pytest tests/test_echa_parser.py -v
|
||||
uv run pytest tests/test_echa_service.py -v
|
||||
uv run pytest tests/test_echa_extractor.py -v
|
||||
```
|
||||
|
||||
### Run Only Unit Tests (Fast)
|
||||
```bash
|
||||
uv run pytest tests/test_echa_*.py -v -m "not integration"
|
||||
```
|
||||
|
||||
### Run Integration Tests (Requires Internet)
|
||||
```bash
|
||||
uv run pytest tests/test_echa_*.py -v -m integration
|
||||
```
|
||||
|
||||
### Run With Coverage
|
||||
```bash
|
||||
uv run pytest tests/test_echa_*.py --cov=pif_compiler.services --cov-report=html
|
||||
```
|
||||
|
||||
## Test Coverage Summary
|
||||
|
||||
### Functions Tested
|
||||
|
||||
#### echa_parser.py (5/5 = 100%)
|
||||
- ✅ `open_echa_page()` - Remote & local file opening
|
||||
- ✅ `echa_page_to_markdown()` - HTML to Markdown with route formatting
|
||||
- ✅ `markdown_to_json_raw()` - Markdown parsing & JSON conversion
|
||||
- ✅ `normalize_unicode_characters()` - Unicode normalization
|
||||
- ✅ `clean_json()` - Recursive cleaning & validation
|
||||
|
||||
#### echa_service.py (8/8 = 100%)
|
||||
- ✅ `search_dossier()` - Main entry point with local file support
|
||||
- ✅ `get_substance_by_identifier()` - Substance API search
|
||||
- ✅ `get_dossier_by_rml_id()` - Dossier retrieval with fallback
|
||||
- ✅ `_query_dossier_api()` - Helper for API queries
|
||||
- ✅ `get_section_links_from_index()` - Remote HTML fetching
|
||||
- ✅ `get_section_links_from_file()` - Local HTML parsing
|
||||
- ✅ `parse_sections_from_html()` - HTML content parsing
|
||||
- ✅ `extract_section_links()` - Individual section extraction with validation
|
||||
|
||||
#### echa_extractor.py (4/4 = 100%)
|
||||
- ✅ `echa_extract()` - Main extraction function
|
||||
- ✅ `echa_extract_local()` - DuckDB cache queries
|
||||
- ✅ `json_to_dataframe()` - JSON to DataFrame conversion
|
||||
- ✅ `df_wrapper()` - Metadata addition
|
||||
|
||||
**Total Functions**: 17/17 tested (100%)
|
||||
|
||||
## Edge Cases Covered
|
||||
|
||||
### Parser
|
||||
- Empty/malformed HTML
|
||||
- Missing sections
|
||||
- Unicode encoding issues (â€, superscripts)
|
||||
- Comparison operators (>, <, >=, <=)
|
||||
- Nested structures
|
||||
- [Empty] value filtering
|
||||
- "no information available" filtering
|
||||
|
||||
### Service
|
||||
- Substance not found
|
||||
- No dossiers (active or inactive)
|
||||
- Nested sections (parent without direct link)
|
||||
- Input type mismatches
|
||||
- Network errors
|
||||
- Malformed API responses
|
||||
- Local vs remote file paths
|
||||
|
||||
### Extractor
|
||||
- Substance not found
|
||||
- Missing scraping type pages
|
||||
- Empty sections
|
||||
- Empty cleaned JSON
|
||||
- Local cache hits/misses
|
||||
- Key information extraction
|
||||
- DataFrame filtering (null Effect levels)
|
||||
- JSON vs DataFrame output modes
|
||||
|
||||
## Dependencies Required
|
||||
|
||||
### Core Dependencies (Already in project)
|
||||
- pytest
|
||||
- pytest-mock
|
||||
- pytest-cov
|
||||
- beautifulsoup4
|
||||
- pandas
|
||||
- requests
|
||||
- markdownify
|
||||
- pydantic
|
||||
|
||||
### Optional Dependencies (Tests skip if missing)
|
||||
- `markdown_to_json` - Required for markdown→JSON conversion tests
|
||||
- `duckdb` - Required for local cache tests
|
||||
- Internet connection - Required for integration tests
|
||||
|
||||
## Test Markers
|
||||
|
||||
### Custom Markers (defined in conftest.py)
|
||||
- `@pytest.mark.unit` - Fast tests, no external dependencies
|
||||
- `@pytest.mark.integration` - Tests requiring real APIs/internet
|
||||
- `@pytest.mark.slow` - Long-running tests
|
||||
- `@pytest.mark.database` - Tests requiring database
|
||||
|
||||
### Usage in ECHA Tests
|
||||
- Unit tests: Default (run without flags)
|
||||
- Integration tests: Require `-m integration`
|
||||
- Skipped tests: Auto-skip if dependencies missing
|
||||
|
||||
## Known Issues & Improvements Needed
|
||||
|
||||
### 1. Unicode Test Encoding (test_echa_parser.py)
|
||||
**Issue**: Lines 372 and 380 have truncated Unicode escape sequences
|
||||
**Fix**: Replace `\u00c2\u00b²` with `chr(0xc2) + chr(0xb2)`
|
||||
**Priority**: Medium
|
||||
|
||||
### 2. Missing markdown_to_json Dependency
|
||||
**Issue**: Tests skip if not installed
|
||||
**Fix**: Add to project dependencies or document as optional
|
||||
**Priority**: Low (tests gracefully skip)
|
||||
|
||||
### 3. Integration Test Data
|
||||
**Issue**: Real API tests may fail if ECHA structure changes
|
||||
**Fix**: Add recorded fixtures for deterministic testing
|
||||
**Priority**: Low
|
||||
|
||||
### 4. DuckDB Integration
|
||||
**Issue**: test_echa_extractor local cache tests mock DuckDB
|
||||
**Fix**: Create actual test database for integration testing
|
||||
**Priority**: Low
|
||||
|
||||
## Test Statistics
|
||||
|
||||
| Module | Total Tests | Unit Tests | Integration Tests | Skipped (Conditional) |
|
||||
|--------|-------------|------------|-------------------|-----------------------|
|
||||
| echa_parser.py | 28 | 26 | 2 | 7 (markdown_to_json) |
|
||||
| echa_service.py | 36 | 33 | 3 | 0 |
|
||||
| echa_extractor.py | 32 | 28 | 4 | 0 |
|
||||
| **TOTAL** | **96** | **87** | **9** | **7** |
|
||||
|
||||
## Next Steps
|
||||
|
||||
1. **Fix Unicode encoding** in test_echa_parser.py (lines 372, 380)
|
||||
2. **Run full test suite** to verify all unit tests pass
|
||||
3. **Add markdown_to_json** to dependencies if needed
|
||||
4. **Run integration tests** manually to verify real API behavior
|
||||
5. **Generate coverage report** to identify any untested code paths
|
||||
6. **Document test patterns** for future service additions
|
||||
7. **Add CI/CD integration** for automated testing
|
||||
|
||||
## Contributing
|
||||
|
||||
When adding new functions to ECHA services:
|
||||
|
||||
1. **Write tests first** (TDD approach)
|
||||
2. **Follow existing patterns**:
|
||||
- One test class per function
|
||||
- Mock external dependencies
|
||||
- Test happy path + edge cases
|
||||
- Add integration tests for real API behavior
|
||||
3. **Use appropriate markers**: `@pytest.mark.integration` for slow tests
|
||||
4. **Update this document** with new test coverage
|
||||
|
||||
## References
|
||||
|
||||
- Main documentation: [docs/echa_architecture.md](echa_architecture.md)
|
||||
- Test patterns: [tests/test_cosing_service.py](../tests/test_cosing_service.py)
|
||||
- pytest configuration: [pytest.ini](../pytest.ini)
|
||||
- Test fixtures: [tests/conftest.py](../tests/conftest.py)
|
||||
|
|
@ -1,767 +0,0 @@
|
|||
# Testing Guide - Theory and Best Practices
|
||||
|
||||
## Table of Contents
|
||||
- [Introduction](#introduction)
|
||||
- [Your Current Approach vs. Test-Driven Development](#your-current-approach-vs-test-driven-development)
|
||||
- [The Testing Pyramid](#the-testing-pyramid)
|
||||
- [Key Concepts](#key-concepts)
|
||||
- [Real-World Testing Workflow](#real-world-testing-workflow)
|
||||
- [Regression Testing](#regression-testing---the-killer-feature)
|
||||
- [Code Coverage](#coverage---how-much-is-tested)
|
||||
- [Best Practices](#best-practices-summary)
|
||||
- [Practical Examples](#practical-example-your-workflow)
|
||||
- [When Should You Write Tests](#when-should-you-write-tests)
|
||||
- [Getting Started](#your-next-steps)
|
||||
|
||||
---
|
||||
|
||||
## Introduction
|
||||
|
||||
This guide explains the theory and best practices of software testing, specifically for the PIF Compiler project. It moves beyond ad-hoc testing scripts to a comprehensive, automated testing approach.
|
||||
|
||||
---
|
||||
|
||||
## Your Current Approach vs. Test-Driven Development
|
||||
|
||||
### What You Do Now (Ad-hoc Scripts):
|
||||
|
||||
```python
|
||||
# test_script.py
|
||||
from cosing_service import cosing_search
|
||||
|
||||
result = cosing_search("WATER", mode="name")
|
||||
print(result) # Look at output, check if it looks right
|
||||
```
|
||||
|
||||
**Problems:**
|
||||
- ❌ Manual checking (is the output correct?)
|
||||
- ❌ Not repeatable (you forget what "correct" looks like)
|
||||
- ❌ Doesn't catch regressions (future changes break old code)
|
||||
- ❌ No documentation (what should the function do?)
|
||||
- ❌ Tedious for many functions
|
||||
|
||||
---
|
||||
|
||||
## The Testing Pyramid
|
||||
|
||||
```
|
||||
/\
|
||||
/ \ E2E Tests (Few)
|
||||
/----\
|
||||
/ \ Integration Tests (Some)
|
||||
/--------\
|
||||
/ \ Unit Tests (Many)
|
||||
/____________\
|
||||
```
|
||||
|
||||
### 1. **Unit Tests** (Bottom - Most Important)
|
||||
|
||||
Test individual functions in isolation.
|
||||
|
||||
**Example:**
|
||||
```python
|
||||
def test_parse_cas_numbers_single():
|
||||
"""Test parsing a single CAS number."""
|
||||
result = parse_cas_numbers(["7732-18-5"])
|
||||
assert result == ["7732-18-5"] # ← Automated check
|
||||
```
|
||||
|
||||
**Benefits:**
|
||||
- ✅ Fast (milliseconds)
|
||||
- ✅ No external dependencies (no API, no database)
|
||||
- ✅ Pinpoint exact problem
|
||||
- ✅ Run hundreds in seconds
|
||||
|
||||
**When to use:**
|
||||
- Testing individual functions
|
||||
- Testing data parsing/validation
|
||||
- Testing business logic calculations
|
||||
|
||||
---
|
||||
|
||||
### 2. **Integration Tests** (Middle)
|
||||
|
||||
Test multiple components working together.
|
||||
|
||||
**Example:**
|
||||
```python
|
||||
def test_full_cosing_workflow():
|
||||
"""Test search + clean workflow."""
|
||||
raw = cosing_search("WATER", mode="name")
|
||||
clean = clean_cosing(raw)
|
||||
assert "cosingUrl" in clean
|
||||
```
|
||||
|
||||
**Benefits:**
|
||||
- ✅ Tests real interactions
|
||||
- ✅ Catches integration bugs
|
||||
|
||||
**Drawbacks:**
|
||||
- ⚠️ Slower (hits real APIs)
|
||||
- ⚠️ Requires internet/database
|
||||
|
||||
**When to use:**
|
||||
- Testing workflows across multiple services
|
||||
- Testing API integrations
|
||||
- Testing database interactions
|
||||
|
||||
---
|
||||
|
||||
### 3. **E2E Tests** (End-to-End - Top - Fewest)
|
||||
|
||||
Test entire application flow (UI → Backend → Database).
|
||||
|
||||
**Example:**
|
||||
```python
|
||||
def test_create_pif_from_ui():
|
||||
"""User creates PIF through Streamlit UI."""
|
||||
# Click buttons, fill forms, verify PDF generated
|
||||
```
|
||||
|
||||
**When to use:**
|
||||
- Testing complete user workflows
|
||||
- Smoke tests before deployment
|
||||
- Critical business processes
|
||||
|
||||
---
|
||||
|
||||
## Key Concepts
|
||||
|
||||
### 1. **Assertions - Automated Verification**
|
||||
|
||||
**Old way (manual):**
|
||||
```python
|
||||
result = parse_cas_numbers(["7732-18-5/56-81-5"])
|
||||
print(result) # You look at: ['7732-18-5', '56-81-5']
|
||||
# Is this right? Maybe? You forget in 2 weeks.
|
||||
```
|
||||
|
||||
**Test way (automated):**
|
||||
```python
|
||||
def test_parse_multiple_cas():
|
||||
result = parse_cas_numbers(["7732-18-5/56-81-5"])
|
||||
assert result == ["7732-18-5", "56-81-5"] # ← Computer checks!
|
||||
# If wrong, test FAILS immediately
|
||||
```
|
||||
|
||||
**Common Assertions:**
|
||||
```python
|
||||
# Equality
|
||||
assert result == expected
|
||||
|
||||
# Truthiness
|
||||
assert result is not None
|
||||
assert "key" in result
|
||||
|
||||
# Exceptions
|
||||
with pytest.raises(ValueError):
|
||||
invalid_function()
|
||||
|
||||
# Approximate equality (for floats)
|
||||
assert result == pytest.approx(3.14159, rel=1e-5)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 2. **Mocking - Control External Dependencies**
|
||||
|
||||
**Problem:** Testing `cosing_search()` hits the real COSING API:
|
||||
- ⚠️ Slow (network request)
|
||||
- ⚠️ Unreliable (API might be down)
|
||||
- ⚠️ Expensive (rate limits)
|
||||
- ⚠️ Hard to test errors (how do you make API return error?)
|
||||
|
||||
**Solution: Mock it!**
|
||||
```python
|
||||
from unittest.mock import Mock, patch
|
||||
|
||||
@patch('cosing_service.req.post') # Replace real HTTP request
|
||||
def test_search_by_name(mock_post):
|
||||
# Control what the "API" returns
|
||||
mock_response = Mock()
|
||||
mock_response.json.return_value = {
|
||||
"results": [{"metadata": {"inciName": ["WATER"]}}]
|
||||
}
|
||||
mock_post.return_value = mock_response
|
||||
|
||||
result = cosing_search("WATER", mode="name")
|
||||
|
||||
assert result["inciName"] == ["WATER"] # ← Test your logic, not the API
|
||||
mock_post.assert_called_once() # Verify it was called
|
||||
```
|
||||
|
||||
**Benefits:**
|
||||
- ✅ Fast (no real network)
|
||||
- ✅ Reliable (always works)
|
||||
- ✅ Can test error cases (mock API failures)
|
||||
- ✅ Isolate your code from external issues
|
||||
|
||||
**What to mock:**
|
||||
- HTTP requests (`requests.get`, `requests.post`)
|
||||
- Database calls (`db.find_one`, `db.insert`)
|
||||
- File I/O (`open`, `read`, `write`)
|
||||
- External APIs (COSING, ECHA, PubChem)
|
||||
- Time-dependent functions (`datetime.now()`)
|
||||
|
||||
---
|
||||
|
||||
### 3. **Fixtures - Reusable Test Data**
|
||||
|
||||
**Without fixtures (repetitive):**
|
||||
```python
|
||||
def test_clean_basic():
|
||||
data = {"inciName": ["WATER"], "casNo": ["7732-18-5"], ...}
|
||||
result = clean_cosing(data)
|
||||
assert ...
|
||||
|
||||
def test_clean_empty():
|
||||
data = {"inciName": ["WATER"], "casNo": ["7732-18-5"], ...} # Copy-paste!
|
||||
result = clean_cosing(data)
|
||||
assert ...
|
||||
```
|
||||
|
||||
**With fixtures (DRY - Don't Repeat Yourself):**
|
||||
```python
|
||||
# conftest.py
|
||||
@pytest.fixture
|
||||
def sample_cosing_response():
|
||||
"""Reusable COSING response data."""
|
||||
return {
|
||||
"inciName": ["WATER"],
|
||||
"casNo": ["7732-18-5"],
|
||||
"substanceId": ["12345"]
|
||||
}
|
||||
|
||||
# test file
|
||||
def test_clean_basic(sample_cosing_response): # Auto-injected!
|
||||
result = clean_cosing(sample_cosing_response)
|
||||
assert result["inciName"] == "WATER"
|
||||
|
||||
def test_clean_empty(sample_cosing_response): # Reuse same data!
|
||||
result = clean_cosing(sample_cosing_response)
|
||||
assert "cosingUrl" in result
|
||||
```
|
||||
|
||||
**Benefits:**
|
||||
- ✅ No code duplication
|
||||
- ✅ Centralized test data
|
||||
- ✅ Easy to update (change once, affects all tests)
|
||||
- ✅ Auto-cleanup (fixtures can tear down resources)
|
||||
|
||||
**Common fixture patterns:**
|
||||
```python
|
||||
# Database fixture with cleanup
|
||||
@pytest.fixture
|
||||
def test_db():
|
||||
db = connect_to_test_db()
|
||||
yield db # Test runs here
|
||||
db.drop_all() # Cleanup after test
|
||||
|
||||
# Temporary file fixture
|
||||
@pytest.fixture
|
||||
def temp_file(tmp_path):
|
||||
file_path = tmp_path / "test.json"
|
||||
file_path.write_text('{"test": "data"}')
|
||||
return file_path # Auto-cleaned by pytest
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Real-World Testing Workflow
|
||||
|
||||
### Scenario: You Add a New Feature
|
||||
|
||||
**Step 1: Write the test FIRST (TDD - Test-Driven Development):**
|
||||
```python
|
||||
def test_parse_cas_removes_parentheses():
|
||||
"""CAS numbers with parentheses should be cleaned."""
|
||||
result = parse_cas_numbers(["7732-18-5 (hydrate)"])
|
||||
assert result == ["7732-18-5"]
|
||||
```
|
||||
|
||||
**Step 2: Run test - it FAILS (expected!):**
|
||||
```bash
|
||||
$ uv run pytest tests/test_cosing_service.py::test_parse_cas_removes_parentheses
|
||||
|
||||
FAILED: AssertionError: assert ['7732-18-5 (hydrate)'] == ['7732-18-5']
|
||||
```
|
||||
|
||||
**Step 3: Write code to make it pass:**
|
||||
```python
|
||||
def parse_cas_numbers(cas_string: list) -> list:
|
||||
cas_string = cas_string[0]
|
||||
cas_string = re.sub(r"\([^)]*\)", "", cas_string) # ← Add this
|
||||
# ... rest of function
|
||||
```
|
||||
|
||||
**Step 4: Run test again - it PASSES:**
|
||||
```bash
|
||||
$ uv run pytest tests/test_cosing_service.py::test_parse_cas_removes_parentheses
|
||||
|
||||
PASSED ✓
|
||||
```
|
||||
|
||||
**Step 5: Refactor if needed - tests ensure you don't break anything!**
|
||||
|
||||
---
|
||||
|
||||
### TDD Cycle (Red-Green-Refactor)
|
||||
|
||||
```
|
||||
1. RED: Write failing test
|
||||
↓
|
||||
2. GREEN: Write minimal code to pass
|
||||
↓
|
||||
3. REFACTOR: Improve code without breaking tests
|
||||
↓
|
||||
Repeat
|
||||
```
|
||||
|
||||
**Benefits:**
|
||||
- ✅ Forces you to think about requirements first
|
||||
- ✅ Prevents over-engineering
|
||||
- ✅ Built-in documentation (tests show intended behavior)
|
||||
- ✅ Confidence to refactor
|
||||
|
||||
---
|
||||
|
||||
## Regression Testing - The Killer Feature
|
||||
|
||||
**Scenario: You change code 6 months later:**
|
||||
|
||||
```python
|
||||
# Original (working)
|
||||
def parse_cas_numbers(cas_string: list) -> list:
|
||||
cas_string = cas_string[0]
|
||||
cas_string = re.sub(r"\([^)]*\)", "", cas_string)
|
||||
cas_parts = re.split(r"[/;,]", cas_string) # Handles /, ;, ,
|
||||
return [cas.strip() for cas in cas_parts]
|
||||
|
||||
# You "improve" it
|
||||
def parse_cas_numbers(cas_string: list) -> list:
|
||||
return cas_string[0].split("/") # Simpler! But...
|
||||
```
|
||||
|
||||
**Run tests:**
|
||||
```bash
|
||||
$ uv run pytest
|
||||
|
||||
FAILED: test_multiple_cas_with_semicolon
|
||||
Expected: ['7732-18-5', '56-81-5']
|
||||
Got: ['7732-18-5;56-81-5'] # ← Oops, broke semicolon support!
|
||||
|
||||
FAILED: test_cas_with_parentheses
|
||||
Expected: ['7732-18-5']
|
||||
Got: ['7732-18-5 (hydrate)'] # ← Broke parentheses removal!
|
||||
```
|
||||
|
||||
**Without tests:**
|
||||
- You deploy
|
||||
- Users report bugs
|
||||
- You're confused what broke
|
||||
- Spend hours debugging
|
||||
|
||||
**With tests:**
|
||||
- Instant feedback
|
||||
- Fix before deploying
|
||||
- Save hours of debugging
|
||||
|
||||
---
|
||||
|
||||
## Coverage - How Much Is Tested?
|
||||
|
||||
### Running Coverage
|
||||
|
||||
```bash
|
||||
uv run pytest --cov=src/pif_compiler --cov-report=html
|
||||
```
|
||||
|
||||
### Sample Output
|
||||
|
||||
```
|
||||
Name Stmts Miss Cover
|
||||
--------------------------------------------------
|
||||
cosing_service.py 89 5 94%
|
||||
echa_service.py 156 89 43%
|
||||
models.py 45 45 0%
|
||||
--------------------------------------------------
|
||||
TOTAL 290 139 52%
|
||||
```
|
||||
|
||||
### Interpretation
|
||||
|
||||
- ✅ `cosing_service.py` - **94% covered** (great!)
|
||||
- ⚠️ `echa_service.py` - **43% covered** (needs more tests)
|
||||
- ❌ `models.py` - **0% covered** (no tests yet)
|
||||
|
||||
### Coverage Goals
|
||||
|
||||
| Coverage | Status | Action |
|
||||
|----------|--------|--------|
|
||||
| 90-100% | ✅ Excellent | Maintain |
|
||||
| 70-90% | ⚠️ Good | Add edge cases |
|
||||
| 50-70% | ⚠️ Acceptable | Prioritize critical paths |
|
||||
| <50% | ❌ Poor | Add tests immediately |
|
||||
|
||||
**Target:** 80%+ for business-critical code
|
||||
|
||||
### HTML Coverage Report
|
||||
|
||||
```bash
|
||||
uv run pytest --cov=src/pif_compiler --cov-report=html
|
||||
# Open htmlcov/index.html in browser
|
||||
```
|
||||
|
||||
Shows:
|
||||
- Which lines are tested (green)
|
||||
- Which lines are not tested (red)
|
||||
- Which branches are not covered
|
||||
|
||||
---
|
||||
|
||||
## Best Practices Summary
|
||||
|
||||
### ✅ DO:
|
||||
|
||||
1. **Write tests for all business logic**
|
||||
```python
|
||||
# YES: Test calculations
|
||||
def test_sed_calculation():
|
||||
ingredient = Ingredient(quantity=10.0, dap=0.5)
|
||||
assert ingredient.calculate_sed() == 5.0
|
||||
```
|
||||
|
||||
2. **Mock external dependencies**
|
||||
```python
|
||||
# YES: Mock API calls
|
||||
@patch('cosing_service.req.post')
|
||||
def test_search(mock_post):
|
||||
mock_post.return_value.json.return_value = {...}
|
||||
```
|
||||
|
||||
3. **Test edge cases**
|
||||
```python
|
||||
# YES: Test edge cases
|
||||
def test_parse_empty_cas():
|
||||
assert parse_cas_numbers([""]) == []
|
||||
|
||||
def test_parse_invalid_cas():
|
||||
with pytest.raises(ValueError):
|
||||
parse_cas_numbers(["abc-def-ghi"])
|
||||
```
|
||||
|
||||
4. **Keep tests simple**
|
||||
```python
|
||||
# YES: One test = one thing
|
||||
def test_cas_removes_whitespace():
|
||||
assert parse_cas_numbers([" 123-45-6 "]) == ["123-45-6"]
|
||||
|
||||
# NO: Testing multiple things
|
||||
def test_cas_everything():
|
||||
assert parse_cas_numbers([" 123-45-6 "]) == ["123-45-6"]
|
||||
assert parse_cas_numbers(["123-45-6/789-01-2"]) == [...]
|
||||
# Too much in one test!
|
||||
```
|
||||
|
||||
5. **Run tests before committing**
|
||||
```bash
|
||||
git add .
|
||||
uv run pytest # ← Always run first!
|
||||
git commit -m "Add feature X"
|
||||
```
|
||||
|
||||
6. **Use descriptive test names**
|
||||
```python
|
||||
# YES: Describes what it tests
|
||||
def test_parse_cas_removes_parenthetical_info():
|
||||
...
|
||||
|
||||
# NO: Vague
|
||||
def test_cas_1():
|
||||
...
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### ❌ DON'T:
|
||||
|
||||
1. **Don't test external libraries**
|
||||
```python
|
||||
# NO: Testing if requests.post works
|
||||
def test_requests_library():
|
||||
response = requests.post("https://example.com")
|
||||
assert response.status_code == 200
|
||||
|
||||
# YES: Test YOUR code that uses requests
|
||||
@patch('requests.post')
|
||||
def test_my_search_function(mock_post):
|
||||
...
|
||||
```
|
||||
|
||||
2. **Don't make tests dependent on each other**
|
||||
```python
|
||||
# NO: test_b depends on test_a
|
||||
def test_a_creates_data():
|
||||
db.insert({"id": 1, "name": "test"})
|
||||
|
||||
def test_b_uses_data():
|
||||
data = db.find_one({"id": 1}) # Breaks if test_a fails!
|
||||
|
||||
# YES: Each test is independent
|
||||
def test_b_uses_data():
|
||||
db.insert({"id": 1, "name": "test"}) # Create own data
|
||||
data = db.find_one({"id": 1})
|
||||
```
|
||||
|
||||
3. **Don't test implementation details**
|
||||
```python
|
||||
# NO: Testing internal variable names
|
||||
def test_internal_state():
|
||||
obj = MyClass()
|
||||
assert obj._internal_var == "value" # Breaks with refactoring
|
||||
|
||||
# YES: Test public behavior
|
||||
def test_public_api():
|
||||
obj = MyClass()
|
||||
assert obj.get_value() == "value"
|
||||
```
|
||||
|
||||
4. **Don't skip tests**
|
||||
```python
|
||||
# NO: Commenting out failing tests
|
||||
# def test_broken_feature():
|
||||
# assert broken_function() == "expected"
|
||||
|
||||
# YES: Fix the test or mark as TODO
|
||||
@pytest.mark.skip(reason="Feature not implemented yet")
|
||||
def test_future_feature():
|
||||
...
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Practical Example: Your Workflow
|
||||
|
||||
### Before (Manual Script)
|
||||
|
||||
```python
|
||||
# test_water.py
|
||||
from cosing_service import cosing_search, clean_cosing
|
||||
|
||||
result = cosing_search("WATER", "name")
|
||||
print(result) # ← You manually check
|
||||
|
||||
clean = clean_cosing(result)
|
||||
print(clean) # ← You manually check again
|
||||
|
||||
# Run 10 times with different inputs... tedious!
|
||||
```
|
||||
|
||||
**Problems:**
|
||||
- Manual verification
|
||||
- Slow (type command, read output, verify)
|
||||
- Error-prone (miss things)
|
||||
- Not repeatable
|
||||
|
||||
---
|
||||
|
||||
### After (Automated Tests)
|
||||
|
||||
```python
|
||||
# tests/test_cosing_service.py
|
||||
def test_search_and_clean_water():
|
||||
"""Water should be searchable and cleanable."""
|
||||
result = cosing_search("WATER", "name")
|
||||
assert result is not None
|
||||
assert "inciName" in result
|
||||
|
||||
clean = clean_cosing(result)
|
||||
assert clean["inciName"] == "WATER"
|
||||
assert "cosingUrl" in clean
|
||||
|
||||
# Run ONCE: pytest
|
||||
# It checks everything automatically!
|
||||
```
|
||||
|
||||
**Run all 25 tests:**
|
||||
```bash
|
||||
$ uv run pytest
|
||||
|
||||
tests/test_cosing_service.py::TestParseCasNumbers::test_single_cas_number PASSED
|
||||
tests/test_cosing_service.py::TestParseCasNumbers::test_multiple_cas_with_slash PASSED
|
||||
...
|
||||
======================== 25 passed in 0.5s ========================
|
||||
```
|
||||
|
||||
**Benefits:**
|
||||
- ✅ All pass? Safe to deploy!
|
||||
- ❌ One fails? Fix before deploying!
|
||||
- ⏱️ 25 tests in 0.5 seconds vs. manual testing for 30 minutes
|
||||
|
||||
---
|
||||
|
||||
## When Should You Write Tests?
|
||||
|
||||
### Always Test:
|
||||
|
||||
✅ **Business logic** (calculations, data processing)
|
||||
```python
|
||||
# YES
|
||||
def test_calculate_sed():
|
||||
assert calculate_sed(quantity=10, dap=0.5) == 5.0
|
||||
```
|
||||
|
||||
✅ **Data validation** (Pydantic models)
|
||||
```python
|
||||
# YES
|
||||
def test_ingredient_validates_cas_format():
|
||||
with pytest.raises(ValidationError):
|
||||
Ingredient(cas="invalid", quantity=10.0)
|
||||
```
|
||||
|
||||
✅ **API integrations** (with mocks)
|
||||
```python
|
||||
# YES
|
||||
@patch('requests.post')
|
||||
def test_cosing_search(mock_post):
|
||||
...
|
||||
```
|
||||
|
||||
✅ **Bug fixes** (write test first, then fix)
|
||||
```python
|
||||
# YES
|
||||
def test_bug_123_empty_cas_crash():
|
||||
"""Regression test for bug #123."""
|
||||
result = parse_cas_numbers([]) # Used to crash
|
||||
assert result == []
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Sometimes Test:
|
||||
|
||||
⚠️ **UI code** (harder to test, less critical)
|
||||
```python
|
||||
# Streamlit UI tests are complex, lower priority
|
||||
```
|
||||
|
||||
⚠️ **Configuration** (usually simple)
|
||||
```python
|
||||
# Config loading is straightforward, test if complex logic
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Don't Test:
|
||||
|
||||
❌ **Third-party libraries** (they have their own tests)
|
||||
```python
|
||||
# NO: Testing if pandas works
|
||||
def test_pandas_dataframe():
|
||||
df = pd.DataFrame({"a": [1, 2, 3]})
|
||||
assert len(df) == 3 # Pandas team already tested this!
|
||||
```
|
||||
|
||||
❌ **Trivial code**
|
||||
```python
|
||||
# NO: Testing simple getters/setters
|
||||
class MyClass:
|
||||
def get_name(self):
|
||||
return self.name # Too simple to test
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Your Next Steps
|
||||
|
||||
### 1. Install Pytest
|
||||
|
||||
```bash
|
||||
cd c:\Users\adish\Projects\pif_compiler
|
||||
uv add --dev pytest pytest-cov pytest-mock
|
||||
```
|
||||
|
||||
### 2. Run the COSING Tests
|
||||
|
||||
```bash
|
||||
# Run all tests
|
||||
uv run pytest
|
||||
|
||||
# Run with verbose output
|
||||
uv run pytest -v
|
||||
|
||||
# Run specific test file
|
||||
uv run pytest tests/test_cosing_service.py
|
||||
|
||||
# Run specific test
|
||||
uv run pytest tests/test_cosing_service.py::TestParseCasNumbers::test_single_cas_number
|
||||
```
|
||||
|
||||
### 3. See Coverage
|
||||
|
||||
```bash
|
||||
# Terminal report
|
||||
uv run pytest --cov=src/pif_compiler/services/cosing_service
|
||||
|
||||
# HTML report (more detailed)
|
||||
uv run pytest --cov=src/pif_compiler --cov-report=html
|
||||
# Open htmlcov/index.html in browser
|
||||
```
|
||||
|
||||
### 4. Start Writing Tests for New Code
|
||||
|
||||
Follow the TDD cycle:
|
||||
1. **Red**: Write failing test
|
||||
2. **Green**: Write minimal code to pass
|
||||
3. **Refactor**: Improve code
|
||||
4. Repeat!
|
||||
|
||||
---
|
||||
|
||||
## Additional Resources
|
||||
|
||||
### Pytest Documentation
|
||||
- [Official Pytest Docs](https://docs.pytest.org/)
|
||||
- [Pytest Fixtures](https://docs.pytest.org/en/stable/fixture.html)
|
||||
- [Pytest Mocking](https://docs.pytest.org/en/stable/monkeypatch.html)
|
||||
|
||||
### Testing Philosophy
|
||||
- [Test-Driven Development (TDD)](https://www.freecodecamp.org/news/test-driven-development-what-it-is-and-what-it-is-not-41fa6bca02a2/)
|
||||
- [Testing Best Practices](https://testautomationuniversity.com/)
|
||||
- [The Testing Pyramid](https://martinfowler.com/articles/practical-test-pyramid.html)
|
||||
|
||||
### PIF Compiler Specific
|
||||
- [tests/README.md](../tests/README.md) - Test suite documentation
|
||||
- [tests/RUN_TESTS.md](../tests/RUN_TESTS.md) - Quick start guide
|
||||
- [REFACTORING.md](../REFACTORING.md) - Code organization changes
|
||||
|
||||
---
|
||||
|
||||
## Summary
|
||||
|
||||
**Testing transforms your development workflow:**
|
||||
|
||||
| Without Tests | With Tests |
|
||||
|---------------|------------|
|
||||
| Manual verification | Automated checks |
|
||||
| Slow feedback | Instant feedback |
|
||||
| Fear of breaking things | Confidence to refactor |
|
||||
| Undocumented behavior | Tests as documentation |
|
||||
| Debug for hours | Pinpoint issues immediately |
|
||||
|
||||
**Start small:**
|
||||
1. Write tests for one service (✅ COSING done!)
|
||||
2. Add tests for new features
|
||||
3. Fix bugs with tests first
|
||||
4. Gradually increase coverage
|
||||
|
||||
**The investment pays off:**
|
||||
- Fewer bugs in production
|
||||
- Faster development (less debugging)
|
||||
- Better code design
|
||||
- Easier collaboration
|
||||
- Peace of mind 😌
|
||||
|
||||
---
|
||||
|
||||
*Last updated: 2025-01-04*
|
||||
|
|
@ -1,6 +0,0 @@
|
|||
# User Journey
|
||||
|
||||
1) User login or signs up
|
||||
- For this function we will use the internal component of streamlit to handle authentication, and behind i will have a supabase db (work-in progress)
|
||||
2) Open recent or create a new project:
|
||||
- This is where we open an existing file of project with all the specifics or we create a new one
|
||||
BIN
main.db
BIN
main.db
Binary file not shown.
BIN
main.db.wal
BIN
main.db.wal
Binary file not shown.
|
|
@ -1,158 +0,0 @@
|
|||
import marimo
|
||||
|
||||
__generated_with = "0.16.5"
|
||||
app = marimo.App(width="medium")
|
||||
|
||||
|
||||
@app.cell
|
||||
def _():
|
||||
import marimo as mo
|
||||
import duckdb
|
||||
return duckdb, mo
|
||||
|
||||
|
||||
@app.cell
|
||||
def _(duckdb):
|
||||
con = duckdb.connect('main.db')
|
||||
return (con,)
|
||||
|
||||
|
||||
@app.cell
|
||||
def _(con, mo):
|
||||
_df = mo.sql(
|
||||
f"""
|
||||
--CREATE SEQUENCE seq_clienti START 1;
|
||||
--CREATE SEQUENCE seq_compilatori START 1;
|
||||
--CREATE SEQUENCE seq_tipi_prodotti START 1;
|
||||
--CREATE SEQUENCE seq_stati_ordini START 1;
|
||||
--CREATE SEQUENCE seq_ordini START 1;
|
||||
""",
|
||||
engine=con
|
||||
)
|
||||
return
|
||||
|
||||
|
||||
@app.cell
|
||||
def _(con, mo):
|
||||
_df = mo.sql(
|
||||
f"""
|
||||
CREATE OR REPLACE TABLE clienti (
|
||||
nome_cliente VARCHAR UNIQUE,
|
||||
id_cliente INTEGER PRIMARY KEY DEFAULT NEXTVAL('seq_clienti')
|
||||
)
|
||||
""",
|
||||
engine=con
|
||||
)
|
||||
return
|
||||
|
||||
|
||||
@app.cell
|
||||
def _(con, mo):
|
||||
_df = mo.sql(
|
||||
f"""
|
||||
CREATE OR REPLACE TABLE compilatori (
|
||||
nome_compilatore VARCHAR UNIQUE,
|
||||
id_compilatore INTEGER PRIMARY KEY DEFAULT NEXTVAL('seq_compilatori')
|
||||
)
|
||||
""",
|
||||
engine=con
|
||||
)
|
||||
return
|
||||
|
||||
|
||||
@app.cell
|
||||
def _(con, mo):
|
||||
_df = mo.sql(
|
||||
f"""
|
||||
|
||||
CREATE OR REPLACE TABLE tipi_prodotti (
|
||||
nome_tipo VARCHAR UNIQUE,
|
||||
id_tipo INTEGER PRIMARY KEY DEFAULT NEXTVAL('seq_tipi_prodotti'),
|
||||
luogo_applicazione VARCHAR,
|
||||
espo_primaria VARCHAR,
|
||||
espo_secondaria VARCHAR,
|
||||
espo_nano VARCHAR,
|
||||
supericie_cm2 INT,
|
||||
frequenza INT,
|
||||
qty_daily_stimata INT,
|
||||
qty_daily_relativa INT,
|
||||
ritenzione FLOAT,
|
||||
espo_daily_calcolata INT,
|
||||
espo_daily_relativa_calcolata INT,
|
||||
peso INT,
|
||||
target VARCHAR
|
||||
)
|
||||
""",
|
||||
engine=con
|
||||
)
|
||||
return
|
||||
|
||||
|
||||
@app.cell
|
||||
def _(con, mo):
|
||||
_df = mo.sql(
|
||||
f"""
|
||||
|
||||
CREATE OR REPLACE TABLE stati_ordini (
|
||||
id_stato INTEGER PRIMARY KEY DEFAULT NEXTVAL('seq_stati_ordini'),
|
||||
nome_stato VARCHAR
|
||||
)
|
||||
""",
|
||||
engine=con
|
||||
)
|
||||
return (stati_ordini,)
|
||||
|
||||
|
||||
@app.cell
|
||||
def _(con, mo):
|
||||
_df = mo.sql(
|
||||
f"""
|
||||
|
||||
CREATE OR REPLACE TABLE ordini (
|
||||
id_ordine INTEGER PRIMARY KEY DEFAULT NEXTVAL('seq_ordini'),
|
||||
id_cliente INTEGER,
|
||||
id_compilatore INTEGER,
|
||||
id_tipo_prodotto INTEGER,
|
||||
uuid_ordine VARCHAR NOT NULL,
|
||||
uuid_progetto VARCHAR,
|
||||
data_ordine DATETIME NOT NULL,
|
||||
stato_ordine INTEGER DEFAULT 0,
|
||||
note VARCHAR,
|
||||
FOREIGN KEY (id_cliente) REFERENCES clienti(id_cliente),
|
||||
FOREIGN KEY (id_compilatore) REFERENCES compilatori(id_compilatore),
|
||||
FOREIGN KEY (id_tipo_prodotto) REFERENCES tipi_prodotti(id_tipo),
|
||||
FOREIGN KEY (stato_ordine) REFERENCES stati_ordini(id_stato)
|
||||
)
|
||||
""",
|
||||
engine=con
|
||||
)
|
||||
return
|
||||
|
||||
|
||||
@app.cell
|
||||
def _(con, mo):
|
||||
_df = mo.sql(
|
||||
f"""
|
||||
|
||||
""",
|
||||
engine=con
|
||||
)
|
||||
return
|
||||
|
||||
|
||||
@app.cell
|
||||
def _(con, mo, stati_ordini):
|
||||
_df = mo.sql(
|
||||
f"""
|
||||
INSERT INTO stati_ordini (nome_stato) VALUES (
|
||||
'Ordine registrato',
|
||||
''
|
||||
)
|
||||
""",
|
||||
engine=con
|
||||
)
|
||||
return
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
app.run()
|
||||
194
marimo/parsing_echa.py
Normal file
194
marimo/parsing_echa.py
Normal file
|
|
@ -0,0 +1,194 @@
|
|||
import marimo
|
||||
|
||||
__generated_with = "0.16.5"
|
||||
app = marimo.App(width="medium")
|
||||
|
||||
|
||||
@app.cell
|
||||
def _():
|
||||
import marimo as mo
|
||||
return
|
||||
|
||||
|
||||
@app.cell
|
||||
def _():
|
||||
from pif_compiler.services.srv_echa import orchestrator
|
||||
return (orchestrator,)
|
||||
|
||||
|
||||
@app.cell
|
||||
def _(orchestrator):
|
||||
result = orchestrator("57-55-6")
|
||||
return (result,)
|
||||
|
||||
|
||||
@app.cell
|
||||
def _(result):
|
||||
test = result['repeated_dose_toxicity']
|
||||
return (test,)
|
||||
|
||||
|
||||
@app.cell
|
||||
def _(result):
|
||||
acute = result['acute_toxicity']
|
||||
acute
|
||||
return (acute,)
|
||||
|
||||
|
||||
@app.cell
|
||||
def _():
|
||||
from pif_compiler.services.srv_echa import extract_levels, at_extractor, rdt_extractor
|
||||
return at_extractor, extract_levels, rdt_extractor
|
||||
|
||||
|
||||
@app.cell
|
||||
def _(acute, at_extractor, extract_levels):
|
||||
at = extract_levels(acute, at_extractor)
|
||||
return (at,)
|
||||
|
||||
|
||||
@app.cell
|
||||
def _(at):
|
||||
at
|
||||
return
|
||||
|
||||
|
||||
@app.cell
|
||||
def _(extract_levels, rdt_extractor, test):
|
||||
rdt = extract_levels(test, rdt_extractor)
|
||||
return (rdt,)
|
||||
|
||||
|
||||
@app.cell
|
||||
def _(rdt):
|
||||
rdt
|
||||
return
|
||||
|
||||
|
||||
@app.cell
|
||||
def _():
|
||||
from pydantic import BaseModel
|
||||
from typing import Optional
|
||||
return BaseModel, Optional
|
||||
|
||||
|
||||
@app.cell
|
||||
def _(BaseModel, Optional):
|
||||
class ToxIndicator(BaseModel):
|
||||
indicator : str
|
||||
value : int
|
||||
unit : str
|
||||
route : str
|
||||
toxicity_type : Optional[str] = None
|
||||
ref : Optional[str] = None
|
||||
|
||||
@property
|
||||
def priority_rank(self):
|
||||
"""Returns the numerical priority based on the toxicological indicator."""
|
||||
mapping = {
|
||||
'LD50': 1,
|
||||
'DL50': 1,
|
||||
'NOAEC': 2,
|
||||
'LOAEL': 3,
|
||||
'NOAEL': 4
|
||||
}
|
||||
return mapping.get(self.indicator, 99)
|
||||
|
||||
@property
|
||||
def factor(self):
|
||||
"""Returns the factor based on the toxicity type."""
|
||||
if self.priority_rank == 1:
|
||||
return 10
|
||||
elif self.priority_rank == 3:
|
||||
return 3
|
||||
return 1
|
||||
return (ToxIndicator,)
|
||||
|
||||
|
||||
@app.cell
|
||||
def _(ToxIndicator, rdt):
|
||||
lista = []
|
||||
for i in rdt:
|
||||
tesss = rdt.get(i)
|
||||
t = ToxIndicator(**tesss)
|
||||
lista.append(t)
|
||||
|
||||
lista
|
||||
return
|
||||
|
||||
|
||||
@app.cell
|
||||
def _():
|
||||
from pydantic import model_validator
|
||||
return (model_validator,)
|
||||
|
||||
|
||||
@app.cell
|
||||
def _(
|
||||
BaseModel,
|
||||
Optional,
|
||||
ToxIndicator,
|
||||
at_extractor,
|
||||
extract_levels,
|
||||
model_validator,
|
||||
rdt_extractor,
|
||||
):
|
||||
class Toxicity(BaseModel):
|
||||
cas: str
|
||||
indicators: list[ToxIndicator]
|
||||
best_case: Optional[ToxIndicator] = None
|
||||
factor: Optional[int] = None
|
||||
|
||||
@model_validator(mode='after')
|
||||
def set_best_case(self) -> 'Toxicity':
|
||||
if self.indicators:
|
||||
self.best_case = max(self.indicators, key=lambda x: x.priority_rank)
|
||||
self.factor = self.best_case.factor
|
||||
return self
|
||||
|
||||
@classmethod
|
||||
def from_result(cls, cas: str, result):
|
||||
toxicity_types = ['repeated_dose_toxicity', 'acute_toxicity']
|
||||
indicators_list = []
|
||||
|
||||
for tt in toxicity_types:
|
||||
if tt not in result:
|
||||
continue
|
||||
|
||||
try:
|
||||
extractor = at_extractor if tt == 'acute_toxicity' else rdt_extractor
|
||||
fetch = extract_levels(result[tt], extractor=extractor)
|
||||
|
||||
link = result.get(f"{tt}_link", "")
|
||||
|
||||
for key, lvl in fetch.items():
|
||||
lvl['ref'] = link
|
||||
elem = ToxIndicator(**lvl)
|
||||
indicators_list.append(elem)
|
||||
|
||||
except Exception as e:
|
||||
print(f"Errore durante l'estrazione di {tt}: {e}")
|
||||
continue
|
||||
|
||||
return cls(
|
||||
cas=cas,
|
||||
indicators=indicators_list
|
||||
)
|
||||
return (Toxicity,)
|
||||
|
||||
|
||||
@app.cell
|
||||
def _(Toxicity, result):
|
||||
tox = Toxicity.from_result("57-55-6", result)
|
||||
tox
|
||||
return
|
||||
|
||||
|
||||
@app.cell
|
||||
def _(result):
|
||||
result
|
||||
return
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
app.run()
|
||||
49
marimo/test_obj.py
Normal file
49
marimo/test_obj.py
Normal file
|
|
@ -0,0 +1,49 @@
|
|||
import marimo
|
||||
|
||||
__generated_with = "0.16.5"
|
||||
app = marimo.App(width="medium")
|
||||
|
||||
|
||||
@app.cell
|
||||
def _():
|
||||
from pif_compiler.classes.models import Esposition
|
||||
return (Esposition,)
|
||||
|
||||
|
||||
@app.cell
|
||||
def _(Esposition):
|
||||
it = Esposition(
|
||||
preset_name="Test xzzx<xdsadsa<",
|
||||
tipo_prodotto="Test Product",
|
||||
luogo_applicazione="Face",
|
||||
esp_normali=["DERMAL", "ASD"],
|
||||
esp_secondarie=["ORAL"],
|
||||
esp_nano=["NA"],
|
||||
sup_esposta=500,
|
||||
freq_applicazione=2,
|
||||
qta_giornaliera=1.5,
|
||||
ritenzione=0.1
|
||||
)
|
||||
return (it,)
|
||||
|
||||
|
||||
@app.cell
|
||||
def _(it):
|
||||
it.save_to_postgres()
|
||||
return
|
||||
|
||||
|
||||
@app.cell
|
||||
def _(Esposition):
|
||||
data = Esposition.get_presets()
|
||||
return (data,)
|
||||
|
||||
|
||||
@app.cell
|
||||
def _(data):
|
||||
data
|
||||
return
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
app.run()
|
||||
412
marimo/worflow.py
Normal file
412
marimo/worflow.py
Normal file
|
|
@ -0,0 +1,412 @@
|
|||
import marimo
|
||||
|
||||
__generated_with = "0.16.5"
|
||||
app = marimo.App(width="medium")
|
||||
|
||||
|
||||
@app.cell
|
||||
def _():
|
||||
import marimo as mo
|
||||
return (mo,)
|
||||
|
||||
|
||||
@app.cell
|
||||
def _():
|
||||
from pif_compiler.functions.db_utils import db_connect
|
||||
return (db_connect,)
|
||||
|
||||
|
||||
@app.cell
|
||||
def _(db_connect):
|
||||
col = db_connect(collection_name="orders")
|
||||
return (col,)
|
||||
|
||||
|
||||
@app.cell
|
||||
def _(col):
|
||||
input = col.find_one({"client_name": "CSM Srl"})
|
||||
return (input,)
|
||||
|
||||
|
||||
@app.cell
|
||||
def _(input):
|
||||
input
|
||||
return
|
||||
|
||||
|
||||
@app.cell
|
||||
def _():
|
||||
import json
|
||||
from pydantic import BaseModel, Field, field_validator, ConfigDict, model_validator
|
||||
from pymongo import MongoClient
|
||||
import re
|
||||
from typing import List, Optional
|
||||
return BaseModel, ConfigDict, Field, List, Optional, model_validator, re
|
||||
|
||||
|
||||
app._unparsable_cell(
|
||||
r"""
|
||||
|
||||
|
||||
class CosmeticIngredient(BaseModel):
|
||||
inci_name: str
|
||||
cas: str = Field(..., pattern=r'^\d{2,7}-\d{2}-\d$')
|
||||
colorant = bool = Field(default=False)
|
||||
organic = bool = Field(default=False)
|
||||
|
||||
dap = dict | None = Field(default=None)
|
||||
cosing = dict | None = Field(default=None)
|
||||
tox_levels = dict | None = Field(default=None)
|
||||
|
||||
@field_validator('inci_name')
|
||||
@classmethod
|
||||
def make_uppercase(cls, v: str) -> str:
|
||||
return v.upper()
|
||||
""",
|
||||
name="_"
|
||||
)
|
||||
|
||||
|
||||
@app.cell
|
||||
def _(CosmeticIngredient, collection, mo):
|
||||
mo.stop(True)
|
||||
try:
|
||||
ingredient = CosmeticIngredient(
|
||||
inci_name="Glycerin",
|
||||
cas="56-81-5",
|
||||
percentage=5.5
|
||||
)
|
||||
print(f"✅ Object Created: {ingredient}")
|
||||
except ValueError as e:
|
||||
print(f"❌ Validation Error: {e}")
|
||||
|
||||
document_to_insert = ingredient.model_dump()
|
||||
|
||||
result = collection.insert_one(document_to_insert)
|
||||
return
|
||||
|
||||
|
||||
@app.cell
|
||||
def _(
|
||||
BaseModel,
|
||||
ConfigDict,
|
||||
CosingInfo,
|
||||
Field,
|
||||
List,
|
||||
Optional,
|
||||
mo,
|
||||
model_validator,
|
||||
re,
|
||||
):
|
||||
mo.stop(True)
|
||||
|
||||
|
||||
class DapInfo(BaseModel):
|
||||
"""Informazioni dal Dossier (es. origine, purezza)"""
|
||||
origin: Optional[str] = None # es. "Synthetic", "Vegetable"
|
||||
purity_percentage: Optional[float] = None
|
||||
supplier_code: Optional[str] = None
|
||||
|
||||
class ToxInfo(BaseModel):
|
||||
"""Dati Tossicologici"""
|
||||
noael: Optional[float] = None # No Observed Adverse Effect Level
|
||||
ld50: Optional[float] = None # Lethal Dose 50
|
||||
sed: Optional[float] = None # Systemic Exposure Dosage
|
||||
mos: Optional[float] = None # Margin of Safety
|
||||
|
||||
# --- 2. Modello Principale ---
|
||||
|
||||
class CosmeticIngredient(BaseModel):
|
||||
model_config = ConfigDict(validate_assignment=True) # Valida anche se modifichi i campi dopo
|
||||
|
||||
# Gestione INCI multipli per lo stesso CAS
|
||||
inci_names: List[str] = Field(default_factory=list)
|
||||
|
||||
# Il CAS è una stringa obbligatoria, ma il regex dipende dal contesto
|
||||
cas: str
|
||||
|
||||
colorant: bool = Field(default=False)
|
||||
organic: bool = Field(default=False)
|
||||
|
||||
# Sotto-oggetti opzionali
|
||||
dap: Optional[DapInfo] = None
|
||||
cosing: Optional[CosingInfo] = None
|
||||
tox_levels: Optional[ToxInfo] = None
|
||||
|
||||
# --- VALIDAZIONE CONDIZIONALE CAS ---
|
||||
@model_validator(mode='after')
|
||||
def validate_cas_logic(self):
|
||||
cas_value = self.cas
|
||||
is_exempt = self.colorant or self.organic
|
||||
|
||||
if not cas_value or not cas_value.strip():
|
||||
raise ValueError("Il campo CAS non può essere vuoto.")
|
||||
|
||||
if not is_exempt:
|
||||
cas_regex = r'^\d{2,7}-\d{2}-\d$'
|
||||
if not re.match(cas_regex, cas_value):
|
||||
raise ValueError(f"Formato CAS non valido ('{cas_value}') per ingrediente standard.")
|
||||
|
||||
# Se è colorante/organico, accettiamo qualsiasi stringa (es. 'CI 77891' o codici interni)
|
||||
return self
|
||||
|
||||
# --- METODO HELPER PER AGGIUNGERE INCI ---
|
||||
def add_inci(self, new_inci: str):
|
||||
"""Aggiunge un INCI alla lista solo se non è già presente (case insensitive)."""
|
||||
new_inci_upper = new_inci.upper()
|
||||
# Controlliamo se esiste già (normalizzando a maiuscolo per sicurezza)
|
||||
if not any(existing.upper() == new_inci_upper for existing in self.inci_names):
|
||||
self.inci_names.append(new_inci_upper)
|
||||
print(f"✅ INCI '{new_inci_upper}' aggiunto.")
|
||||
else:
|
||||
print(f"ℹ️ INCI '{new_inci_upper}' già presente.")
|
||||
return CosmeticIngredient, DapInfo
|
||||
|
||||
|
||||
@app.cell
|
||||
def _():
|
||||
from pif_compiler.services.srv_pubchem import pubchem_dap
|
||||
|
||||
dato = pubchem_dap("56-81-5")
|
||||
dato
|
||||
return (dato,)
|
||||
|
||||
|
||||
app._unparsable_cell(
|
||||
r"""
|
||||
molecular_weight = dato.get(\"MolecularWeight\")
|
||||
log pow = dato.get(\"XLogP\")
|
||||
topological_polar_surface_area = dato.get(\"TPSA\")
|
||||
melting_point = dato.get(\"MeltingPoint\")
|
||||
ionization = dato.get(\"Dissociation Constants\")
|
||||
""",
|
||||
name="_"
|
||||
)
|
||||
|
||||
|
||||
@app.cell(hide_code=True)
|
||||
def _(mo):
|
||||
mo.md(
|
||||
r"""
|
||||
Molecolar Weight >500 Da
|
||||
High degree of ionisation
|
||||
Log Pow ≤-1 or ≥ 4
|
||||
Topological polar surface area >120 Å2
|
||||
Melting point > 200°C
|
||||
"""
|
||||
)
|
||||
return
|
||||
|
||||
|
||||
@app.cell
|
||||
def _(BaseModel, Field, Optional, model_validator):
|
||||
class DapInfo(BaseModel):
|
||||
cas: str
|
||||
|
||||
molecular_weight: Optional[float] = Field(default=None, description="In Daltons (Da)")
|
||||
high_ionization: Optional[float] = Field(default=None, description="High degree of ionization")
|
||||
log_pow: Optional[float] = Field(default=None, description="Partition coefficient")
|
||||
tpsa: Optional[float] = Field(default=None, description="Topological polar surface area")
|
||||
melting_point: Optional[float] = Field(default=None, description="In Celsius (°C)")
|
||||
|
||||
# --- Il valore DAP Calcolato ---
|
||||
# Lo impostiamo di default a 0.5 (50%), verrà sovrascritto dal validator
|
||||
dap_value: float = 0.5
|
||||
|
||||
@model_validator(mode='after')
|
||||
def compute_dap(self):
|
||||
# Lista delle condizioni (True se la condizione riduce l'assorbimento)
|
||||
conditions = []
|
||||
|
||||
# 1. MW > 500 Da
|
||||
if self.molecular_weight is not None:
|
||||
conditions.append(self.molecular_weight > 500)
|
||||
|
||||
# 2. High Ionization (Se è True, riduce l'assorbimento)
|
||||
if self.high_ionization is not None:
|
||||
conditions.append(self.high_ionization is True)
|
||||
|
||||
# 3. Log Pow <= -1 OR >= 4
|
||||
if self.log_pow is not None:
|
||||
conditions.append(self.log_pow <= -1 or self.log_pow >= 4)
|
||||
|
||||
# 4. TPSA > 120 Å2
|
||||
if self.tpsa is not None:
|
||||
conditions.append(self.tpsa > 120)
|
||||
|
||||
# 5. Melting Point > 200°C
|
||||
if self.melting_point is not None:
|
||||
conditions.append(self.melting_point > 200)
|
||||
|
||||
# LOGICA FINALE:
|
||||
# Se c'è almeno una condizione "sicura" (True), il DAP è 0.1
|
||||
if any(conditions):
|
||||
self.dap_value = 0.1
|
||||
else:
|
||||
self.dap_value = 0.5
|
||||
|
||||
return self
|
||||
return (DapInfo,)
|
||||
|
||||
|
||||
@app.cell
|
||||
def _(DapInfo, dato, re):
|
||||
desiderated_keys = ['CAS', 'MolecularWeight', 'XLogP', 'TPSA', 'Melting Point', 'Dissociation Constants']
|
||||
actual_keys = [key for key in dato.keys() if key in desiderated_keys]
|
||||
|
||||
dict = {}
|
||||
|
||||
for key in actual_keys:
|
||||
if key == 'CAS':
|
||||
dict['cas'] = dato[key]
|
||||
if key == 'MolecularWeight':
|
||||
mw = float(dato[key])
|
||||
dict['molecular_weight'] = mw
|
||||
if key == 'XLogP':
|
||||
log_pow = float(dato[key])
|
||||
dict['log_pow'] = log_pow
|
||||
if key == 'TPSA':
|
||||
tpsa = float(dato[key])
|
||||
dict['tpsa'] = tpsa
|
||||
if key == 'Melting Point':
|
||||
try:
|
||||
for item in dato[key]:
|
||||
if '°C' in item['Value']:
|
||||
mp = dato[key]['Value']
|
||||
mp_value = re.findall(r"[-+]?\d*\.\d+|\d+", mp)
|
||||
if mp_value:
|
||||
dict['melting_point'] = float(mp_value[0])
|
||||
except:
|
||||
continue
|
||||
if key == 'Dissociation Constants':
|
||||
try:
|
||||
for item in dato[key]:
|
||||
if 'pKa' in item['Value']:
|
||||
pk = dato[key]['Value']
|
||||
pk_value = re.findall(r"[-+]?\d*\.\d+|\d+", mp)
|
||||
if pk_value:
|
||||
dict['high_ionization'] = float(mp_value[0])
|
||||
except:
|
||||
continue
|
||||
|
||||
dap_info = DapInfo(**dict)
|
||||
dap_info
|
||||
return
|
||||
|
||||
|
||||
@app.cell
|
||||
def _():
|
||||
from pif_compiler.services.srv_cosing import cosing_search, parse_cas_numbers, clean_cosing, identified_ingredients
|
||||
return clean_cosing, cosing_search
|
||||
|
||||
|
||||
@app.cell
|
||||
def _(clean_cosing, cosing_search):
|
||||
raw_cosing = cosing_search("72-48-0", 'cas')
|
||||
cleaned_cosing = clean_cosing(raw_cosing)
|
||||
cleaned_cosing
|
||||
return cleaned_cosing, raw_cosing
|
||||
|
||||
|
||||
@app.cell
|
||||
def _(mo):
|
||||
mo.md(
|
||||
r"""
|
||||
otherRestrictions
|
||||
refNo
|
||||
annexNo
|
||||
casNo
|
||||
functionName
|
||||
"""
|
||||
)
|
||||
return
|
||||
|
||||
|
||||
@app.cell
|
||||
def _(BaseModel, Field, List, Optional):
|
||||
class CosingInfo(BaseModel):
|
||||
cas : List[str] = Field(default_factory=list)
|
||||
common_names : List[str] = Field(default_factory=list)
|
||||
inci : List[str] = Field(default_factory=list)
|
||||
annex : List[str] = Field(default_factory=list)
|
||||
functionName : List[str] = Field(default_factory=list)
|
||||
otherRestrictions : List[str] = Field(default_factory=list)
|
||||
cosmeticRestriction : Optional[str]
|
||||
return (CosingInfo,)
|
||||
|
||||
|
||||
@app.cell
|
||||
def _(CosingInfo):
|
||||
def cosing_builder(cleaned_cosing):
|
||||
cosing_keys = ['nameOfCommonIngredientsGlossary', 'casNo', 'functionName', 'annexNo', 'refNo', 'otherRestrictions', 'cosmeticRestriction', 'inciName']
|
||||
keys = [k for k in cleaned_cosing.keys() if k in cosing_keys]
|
||||
|
||||
cosing_dict = {}
|
||||
|
||||
for k in keys:
|
||||
if k == 'nameOfCommonIngredientsGlossary':
|
||||
names = []
|
||||
for name in cleaned_cosing[k]:
|
||||
names.append(name)
|
||||
cosing_dict['common_names'] = names
|
||||
if k == 'inciName':
|
||||
inci = []
|
||||
for inc in cleaned_cosing[k]:
|
||||
inci.append(inc)
|
||||
cosing_dict['inci'] = names
|
||||
if k == 'casNo':
|
||||
cas_list = []
|
||||
for casNo in cleaned_cosing[k]:
|
||||
cas_list.append(casNo)
|
||||
cosing_dict['cas'] = cas_list
|
||||
if k == 'functionName':
|
||||
functions = []
|
||||
for func in cleaned_cosing[k]:
|
||||
functions.append(func)
|
||||
cosing_dict['functionName'] = functions
|
||||
if k == 'annexNo':
|
||||
annexes = []
|
||||
i = 0
|
||||
for ann in cleaned_cosing[k]:
|
||||
restriction = ann + ' / ' + cleaned_cosing['refNo'][i]
|
||||
annexes.append(restriction)
|
||||
i = i+1
|
||||
cosing_dict['annex'] = annexes
|
||||
if k == 'otherRestrictions':
|
||||
other_restrictions = []
|
||||
for ores in cleaned_cosing[k]:
|
||||
other_restrictions.append(ores)
|
||||
cosing_dict['otherRestrictions'] = other_restrictions
|
||||
if k == 'cosmeticRestriction':
|
||||
cosing_dict['cosmeticRestriction'] = cleaned_cosing[k]
|
||||
|
||||
test_cosing = CosingInfo(
|
||||
**cosing_dict
|
||||
)
|
||||
return test_cosing
|
||||
return (cosing_builder,)
|
||||
|
||||
|
||||
@app.cell
|
||||
def _(cleaned_cosing, cosing_builder):
|
||||
id = cleaned_cosing['identifiedIngredient']
|
||||
if id:
|
||||
for e in id:
|
||||
obj = cosing_builder(e)
|
||||
obj
|
||||
return
|
||||
|
||||
|
||||
@app.cell
|
||||
def _(raw_cosing):
|
||||
raw_cosing
|
||||
return
|
||||
|
||||
|
||||
@app.cell
|
||||
def _():
|
||||
return
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
app.run()
|
||||
|
|
@ -1,245 +0,0 @@
|
|||
import requests
|
||||
import urllib.parse
|
||||
import re as standardre
|
||||
import logging
|
||||
import json
|
||||
from bs4 import BeautifulSoup
|
||||
|
||||
|
||||
# Settings per il logging
|
||||
logging.basicConfig(
|
||||
format="{asctime} - {levelname} - {message}",
|
||||
style="{",
|
||||
datefmt="%Y-%m-%d %H:%M",
|
||||
filename="echa.log",
|
||||
encoding="utf-8",
|
||||
filemode="a",
|
||||
level=logging.INFO,
|
||||
)
|
||||
|
||||
# Funzione inutile
|
||||
def getCas(substance, ):
|
||||
results = {}
|
||||
req_0 = requests.get(
|
||||
"https://chem.echa.europa.eu/api-substance/v1/substance?pageIndex=1&pageSize=100&searchText="
|
||||
+ urllib.parse.quote(substance)
|
||||
)
|
||||
req_0_json = req_0.json()
|
||||
try:
|
||||
rmlId = req_0_json["items"][0]["substanceIndex"]["rmlId"]
|
||||
rmlName = req_0_json["items"][0]["substanceIndex"]["rmlName"]
|
||||
rmlCas = req_0_json["items"][0]["substanceIndex"]["rmlCas"]
|
||||
|
||||
results["rmlId"] = rmlId
|
||||
results["rmlName"] = rmlName
|
||||
results["rmlCas"] = rmlCas
|
||||
except:
|
||||
return False
|
||||
return results
|
||||
|
||||
|
||||
|
||||
|
||||
# Funzione per cercare il dossier dato in input un CAS, una sostanza o un EN
|
||||
def search_dossier(substance, input_type='rmlCas'):
|
||||
results = {}
|
||||
# Il dizionario che farò tornare alla fine
|
||||
|
||||
# Prima parte. Ottengo rmlID e rmlName
|
||||
# st.code('https://chem.echa.europa.eu/api-substance/v1/substance?pageIndex=1&pageSize=100&searchText='+substance)
|
||||
req_0 = requests.get(
|
||||
"https://chem.echa.europa.eu/api-substance/v1/substance?pageIndex=1&pageSize=100&searchText="
|
||||
+ urllib.parse.quote(substance)
|
||||
)
|
||||
|
||||
logging.info(f'echaFind.search_dossier(). searching "{substance}"')
|
||||
|
||||
#'La prima cosa da fare è fare una ricerca con il nome della sostanza ma trasformata attraverso urllib'
|
||||
req_0_json = req_0.json()
|
||||
try:
|
||||
# Estraggo i campi che mi servono dalla response
|
||||
rmlId = req_0_json["items"][0]["substanceIndex"]["rmlId"]
|
||||
rmlName = req_0_json["items"][0]["substanceIndex"]["rmlName"]
|
||||
rmlCas = req_0_json["items"][0]["substanceIndex"]["rmlCas"]
|
||||
rmlEc = req_0_json["items"][0]["substanceIndex"]["rmlEc"]
|
||||
|
||||
results['search_response'] = f"https://chem.echa.europa.eu/api-substance/v1/substance?pageIndex=1&pageSize=100&searchText={urllib.parse.quote(substance)}"
|
||||
results["rmlId"] = rmlId
|
||||
results["rmlName"] = rmlName
|
||||
results["rmlCas"] = rmlCas
|
||||
results["rmlEc"] = rmlEc
|
||||
|
||||
logging.info(
|
||||
f"echaFind.search_dossier(). found substance on ECHA. rmlId: '{rmlId}', rmlName: '{rmlName}', rmlCas: '{rmlCas}'"
|
||||
)
|
||||
except:
|
||||
logging.info(
|
||||
f"echaFind.search_dossier(). could not find substance for '{substance}'"
|
||||
)
|
||||
return False
|
||||
|
||||
# Update: in certi casi poteva verificarsi che inserendo un CAS si trovasse invece una sostanza con codice EN uguale al CAS in input.
|
||||
# Ora controllo che la sostanza trovata abbia effettivamente un CAS uguale a quello inserito in input.
|
||||
# è inoltre possibile cercare per rmlName (nome della sostanza) o EN (rmlEn): basta specificare in input_type per cosa si sta cercando
|
||||
if results[input_type] != substance:
|
||||
logging.error(f'echa.echaFind.search_dossier(): results[{input_type}] "{results[input_type]}is not equal to "{substance}". ')
|
||||
return f'search_error. results[{input_type}] ("{results[input_type]}") is not equal to "{substance}". Maybe you specified the wrong input_type. Check the results here: https://chem.echa.europa.eu/api-substance/v1/substance?pageIndex=1&pageSize=100&searchText={urllib.parse.quote(substance)}'
|
||||
|
||||
# Seconda parte. Cerco sul sito ECHA dei dossiers creando un link con l'ID precedentemente ottenuto.
|
||||
req_1_url = (
|
||||
"https://chem.echa.europa.eu/api-dossier-list/v1/dossier?pageIndex=1&pageSize=100&rmlId="
|
||||
+ rmlId
|
||||
+ "®istrationStatuses=Active"
|
||||
) # Prima cerco negli active.
|
||||
|
||||
req_1 = requests.get(req_1_url)
|
||||
req_1_json = req_1.json()
|
||||
|
||||
# Se non esistono dossiers attivi cerco quelli inattivi
|
||||
if req_1_json["items"] == []:
|
||||
logging.info(
|
||||
f"echaFind.search_dossier(). could not find active dossier for '{substance}'. Proceeding to search in the unactive ones."
|
||||
)
|
||||
req_1_url = (
|
||||
"https://chem.echa.europa.eu/api-dossier-list/v1/dossier?pageIndex=1&pageSize=100&rmlId="
|
||||
+ rmlId
|
||||
+ "®istrationStatuses=Inactive"
|
||||
)
|
||||
req_1 = requests.get(req_1_url)
|
||||
req_1_json = req_1.json()
|
||||
if req_1_json["items"] == []:
|
||||
logging.info(
|
||||
f"echaFind.search_dossier(). could not find unactive dossiers for '{substance}'"
|
||||
) # Non ho trovato nè dossiers inattivi che attivi
|
||||
return False
|
||||
else:
|
||||
logging.info(
|
||||
f"echaFind.search_dossier(). found unactive dossiers for '{rmlName}'"
|
||||
)
|
||||
results["dossierType"] = "Inactive"
|
||||
|
||||
else:
|
||||
logging.info(
|
||||
f"echaFind.search_dossier(). found active dossiers for '{substance}'"
|
||||
)
|
||||
results["dossierType"] = "Active"
|
||||
|
||||
# Queste erano le due robe che mi servivano
|
||||
assetExternalId = req_1_json["items"][0]["assetExternalId"]
|
||||
|
||||
# UPDATE: Per ottenere la data dell'ultima modifica
|
||||
try:
|
||||
lastUpdateDate = req_1_json["items"][0]["lastUpdatedDate"]
|
||||
datetime_object = datetime.fromisoformat(lastUpdateDate.replace('Z', '+00:00')) # Handle 'Z' if present, else it might break on older python versions
|
||||
lastUpdateDate = datetime_object.date().isoformat()
|
||||
results['lastUpdateDate'] = lastUpdateDate
|
||||
except:
|
||||
logging.error(f"echa.echaFind(). Could not find lastUpdateDate for the dossier")
|
||||
|
||||
rootKey = req_1_json["items"][0]["rootKey"]
|
||||
|
||||
# Terza parte. Ottengo assetExternalId
|
||||
# "Con l'assetExternalId è possibile arrivare alla pagina principale del dossier."
|
||||
# "Da questa pagina bisogna scrappare l'ID del riassunto tossicologico, :red[**SE ESISTE**]"
|
||||
results["index"] = (
|
||||
"https://chem.echa.europa.eu/html-pages/" + assetExternalId + "/index.html"
|
||||
)
|
||||
results["index_js"] = (
|
||||
f"https://chem.echa.europa.eu/{rmlId}/dossier-view/{assetExternalId}"
|
||||
)
|
||||
|
||||
req_2 = requests.get(
|
||||
"https://chem.echa.europa.eu/html-pages/" + assetExternalId + "/index.html"
|
||||
)
|
||||
index = BeautifulSoup(req_2.text, "html.parser")
|
||||
index.prettify()
|
||||
|
||||
# Quarta parte. Ottengo l'ID del riassunto tossicologico dall'index.html
|
||||
# "In tutto quell'HTML ci interessa solo un div. BeautifulSoup ha problemi se ci sono troppi div innestati. Quindi uso una combinazione di quello e di Regex"
|
||||
|
||||
div = index.find_all("div", id=["id_7_Toxicologicalinformation"])
|
||||
str_div = str(div)
|
||||
str_div = str_div.split("</div>")[0]
|
||||
|
||||
uic_found = False
|
||||
if type(standardre.search('href="([^"]+)"', str_div)).__name__ == "NoneType":
|
||||
# Un regex per trovare l'href che mi serve
|
||||
logging.info(
|
||||
f"echaFind.search_dossier(). Could not find 'id_7_Toxicologicalinformation' in the body"
|
||||
)
|
||||
else:
|
||||
UIC = standardre.search('href="([^"]+)"', str_div).group(1)
|
||||
uic_found = True
|
||||
|
||||
# Per l'acute toxicity
|
||||
acute_toxicity_found = False
|
||||
div_acute_toxicity = index.find_all("div", id=["id_72_AcuteToxicity"])
|
||||
if div_acute_toxicity:
|
||||
for div in div_acute_toxicity:
|
||||
try:
|
||||
a = div.find_all("a", href=True)[0]
|
||||
acute_toxicity_id = standardre.search('href="([^"]+)"', str(a)).group(1)
|
||||
acute_toxicity_found = True
|
||||
except:
|
||||
logging.info(
|
||||
f"echaFind.search_dossier(). No acute_toxicity_id found from index for {substance}"
|
||||
)
|
||||
|
||||
# Per il repeated dose
|
||||
repeated_dose_found = False
|
||||
div_repeated_dose = index.find_all("div", id=["id_75_Repeateddosetoxicity"])
|
||||
if div_repeated_dose:
|
||||
for div in div_repeated_dose:
|
||||
try:
|
||||
a = div.find_all("a", href=True)[0]
|
||||
repeated_dose_id = standardre.search('href="([^"]+)"', str(a)).group(1)
|
||||
repeated_dose_found = True
|
||||
except:
|
||||
logging.info(
|
||||
f"echaFind.search_dossier(). No repeated_dose_id found from index for {substance}"
|
||||
)
|
||||
|
||||
# Quinta parte. Recupero l'html del dossier tossicologico e faccio ritornare il content
|
||||
|
||||
if acute_toxicity_found:
|
||||
acute_toxicity_link = (
|
||||
"https://chem.echa.europa.eu/html-pages/"
|
||||
+ assetExternalId
|
||||
+ "/documents/"
|
||||
+ acute_toxicity_id
|
||||
+ ".html"
|
||||
)
|
||||
results["AcuteToxicity"] = acute_toxicity_link
|
||||
results["AcuteToxicity_js"] = (
|
||||
f"https://chem.echa.europa.eu/{rmlId}/dossier-view/{assetExternalId}/{acute_toxicity_id}"
|
||||
)
|
||||
|
||||
if uic_found:
|
||||
# UIC è l'id del toxsummary
|
||||
final_url = (
|
||||
"https://chem.echa.europa.eu/html-pages/"
|
||||
+ assetExternalId
|
||||
+ "/documents/"
|
||||
+ UIC
|
||||
+ ".html"
|
||||
)
|
||||
results["ToxSummary"] = final_url
|
||||
results["ToxSummary_js"] = (
|
||||
f"https://chem.echa.europa.eu/{rmlId}/dossier-view/{assetExternalId}/{UIC}"
|
||||
)
|
||||
|
||||
if repeated_dose_found:
|
||||
results["RepeatedDose"] = (
|
||||
"https://chem.echa.europa.eu/html-pages/"
|
||||
+ assetExternalId
|
||||
+ "/documents/"
|
||||
+ repeated_dose_id
|
||||
+ ".html"
|
||||
)
|
||||
results["RepeatedDose_js"] = (
|
||||
f"https://chem.echa.europa.eu/{rmlId}/dossier-view/{assetExternalId}/{repeated_dose_id}"
|
||||
)
|
||||
|
||||
json_formatted_str = json.dumps(results)
|
||||
logging.info(f"echaFind.search_dossier() OK. output: {json_formatted_str}")
|
||||
return results
|
||||
|
|
@ -1,946 +0,0 @@
|
|||
from src.func.echaFind import search_dossier
|
||||
from bs4 import BeautifulSoup
|
||||
from markdownify import MarkdownConverter
|
||||
import pandas as pd
|
||||
import requests
|
||||
import os
|
||||
import re
|
||||
import markdown_to_json
|
||||
import json
|
||||
import copy
|
||||
import unicodedata
|
||||
from datetime import datetime
|
||||
import logging
|
||||
import duckdb
|
||||
|
||||
# Settings per il logging
|
||||
logging.basicConfig(
|
||||
format="{asctime} - {levelname} - {message}",
|
||||
style="{",
|
||||
datefmt="%Y-%m-%d %H:%M",
|
||||
filename="echa.log",
|
||||
encoding="utf-8",
|
||||
filemode="a",
|
||||
level=logging.INFO,
|
||||
)
|
||||
|
||||
try:
|
||||
# Carico il full scraping in memoria se esiste
|
||||
con = duckdb.connect()
|
||||
os.chdir(".")
|
||||
res = con.sql("""
|
||||
CREATE TABLE echa_full_scraping AS
|
||||
SELECT * FROM read_csv_auto('src\data\echa_full_scraping.csv');
|
||||
""")
|
||||
logging.info(
|
||||
f"echa.echaProcess().main: Loaded echa scraped data into duckdb memory. First CAS in the df is: {con.sql('select CAS from echa_full_scraping limit 1').fetchone()[0]}"
|
||||
)
|
||||
local_echa = True
|
||||
except:
|
||||
logging.error(f"echa.echaProcess().main: No local echa scraped data found")
|
||||
|
||||
|
||||
# Metodo per trovare le informazioni relative sul sito echa
|
||||
# Funziona sia con il nome della sostanza che con il CUS
|
||||
def openEchaPage(link, local=False):
|
||||
try:
|
||||
if local:
|
||||
page = open(link, encoding="utf8")
|
||||
soup = BeautifulSoup(page, "html.parser")
|
||||
else:
|
||||
page = requests.get(link)
|
||||
page.encoding = "utf-8"
|
||||
soup = BeautifulSoup(page.text, "html.parser")
|
||||
except:
|
||||
logging.error(
|
||||
f"echa.echaProcess.openEchaPage() error. could not open: '{link}'",
|
||||
exc_info=True,
|
||||
)
|
||||
return soup
|
||||
|
||||
|
||||
# Metodo per trasformare la pagina dell'echa in un Markdown
|
||||
def echaPage_to_md(sezione, scrapingType=None, local=False, substance=None):
|
||||
# sezione : il soup della pagina estratta attraverso search_dossier
|
||||
# scrapingType : 'RepeatedDose' o 'AcuteToxicity'
|
||||
# local : se vuoi salvare il contenuto del markdown in locale. Utile per debuggare
|
||||
# substance : il nome della sostanza. Per salvarla nel path corretto
|
||||
|
||||
# Create shorthand method for conversion
|
||||
def md(soup, **options):
|
||||
return MarkdownConverter(**options).convert_soup(soup)
|
||||
|
||||
output = md(sezione)
|
||||
# Trasformo la section html in un markdown, che però va corretto.
|
||||
|
||||
# Cambia un po' il modo in cui modifico il .md in base al tipo di pagina da scrappare
|
||||
# aggiungo eccezioni man mano che testo nuove sostanze
|
||||
if scrapingType == "RepeatedDose":
|
||||
output = output.replace("### Oral route", "#### oral")
|
||||
output = output.replace("### Dermal", "#### dermal")
|
||||
output = output.replace("### Inhalation", "#### inhalation")
|
||||
# Devo rimpiazzare >< con delle stringhe perchè sennò il jsonifier interpreta quei due simboli come dei codici e inserisce il testo nelle []
|
||||
output = re.sub(r">\s+", "greater than ", output)
|
||||
# Replace '<' followed by whitespace with 'less than '
|
||||
output = re.sub(r"<\s+", "less than ", output)
|
||||
output = re.sub(r">=\s*\n", "greater or equal than ", output)
|
||||
output = re.sub(r"<=\s*\n", "less or equal than ", output)
|
||||
|
||||
elif scrapingType == "AcuteToxicity":
|
||||
# Devo rimpiazzare >< con delle stringhe perchè sennò il jsonifier interpreta quei due simboli come dei codici e inserisce il testo nelle []
|
||||
output = re.sub(r">\s+", "greater than ", output)
|
||||
# Replace '<' followed by whitespace with 'less than '
|
||||
output = re.sub(r"<\s+", "less than ", output)
|
||||
output = re.sub(r">=\s*\n", "greater or equal than", output)
|
||||
output = re.sub(r"<=\s*\n", "less or equal than ", output)
|
||||
|
||||
output = output.replace("–", "-")
|
||||
|
||||
output = re.sub(r"\s+mg", " mg", output)
|
||||
# sta parte serve per fixare le unità di misura che vanno a capo e sono separate dal valore
|
||||
|
||||
if local and substance:
|
||||
path = f"{scrapingType}/mds/{substance}.md"
|
||||
os.makedirs(os.path.dirname(path), exist_ok=True)
|
||||
with open(path, "w") as text_file:
|
||||
text_file.write(output)
|
||||
|
||||
return output
|
||||
|
||||
|
||||
|
||||
# Questo è la parte 2 del processing del sito ECHA. Va trasformato il markdown in un JSON
|
||||
def markdown_to_json_raw(output, scrapingType=None, local=False, substance=None):
|
||||
# Output: Il markdown
|
||||
# scrapingType : 'RepeatedDose' o 'AcuteToxicity'
|
||||
# substance : il nome della sostanza. Per salvarla nel path corretto
|
||||
jsonified = markdown_to_json.jsonify(output)
|
||||
dictified = json.loads(jsonified)
|
||||
|
||||
# Salvo il json iniziale così come esce da jsonify
|
||||
if local and scrapingType and substance:
|
||||
path = f"{scrapingType}/jsons/raws/{substance}_raw0.json"
|
||||
os.makedirs(os.path.dirname(path), exist_ok=True)
|
||||
|
||||
with open(path, "w") as text_file:
|
||||
text_file.write(jsonified)
|
||||
|
||||
# Ora splitto i contenuti dei dizionari innestati.
|
||||
for key, value in dictified.items():
|
||||
if type(value) == dict:
|
||||
for key2, value2 in value.items():
|
||||
parts = value2.split("\n\n")
|
||||
dictified[key][key2] = {
|
||||
parts[i]: parts[i + 1]
|
||||
for i in range(0, len(parts) - 1, 2)
|
||||
if parts[i + 1] != "[Empty]"
|
||||
}
|
||||
else:
|
||||
parts = value.split("\n\n")
|
||||
dictified[key] = {
|
||||
parts[i]: parts[i + 1]
|
||||
for i in range(0, len(parts) - 1, 2)
|
||||
if parts[i + 1] != "[Empty]"
|
||||
}
|
||||
|
||||
jsonified = json.dumps(dictified)
|
||||
|
||||
if local and scrapingType and substance:
|
||||
path = f"{scrapingType}/jsons/raws/{substance}_raw1.json"
|
||||
os.makedirs(os.path.dirname(path), exist_ok=True)
|
||||
|
||||
with open(path, "w") as text_file:
|
||||
text_file.write(jsonified)
|
||||
|
||||
dictified = json.loads(jsonified)
|
||||
|
||||
return jsonified
|
||||
|
||||
|
||||
# Metodo creato da claude per risolvere i problemi di unicode characters
|
||||
def normalize_unicode_characters(text):
|
||||
"""
|
||||
Normalize Unicode characters, with special handling for superscript
|
||||
"""
|
||||
if not isinstance(text, str):
|
||||
return text
|
||||
|
||||
# Specific replacements for common Unicode encoding issues
|
||||
# e per altre eccezioni particolari
|
||||
replacements = {
|
||||
"\u00c2\u00b2": "²", # ² -> ²
|
||||
"\u00c2\u00b3": "³", # ³ -> ³
|
||||
"\u00b2": "²", # Bare superscript 2
|
||||
"\u00b3": "³", # Bare superscript 3
|
||||
"\n": "", # ogni tanto ci sono degli \n brutti da togliere
|
||||
"greater than": ">",
|
||||
"less than": "<",
|
||||
"greater or equal than": ">=",
|
||||
"less or equal than": "<",
|
||||
# Ste due entry le ho messe io. >< creano problemi quindi le rinonimo temporaneamente
|
||||
}
|
||||
|
||||
# Apply specific replacements first
|
||||
for old, new in replacements.items():
|
||||
text = text.replace(old, new)
|
||||
|
||||
# Normalize Unicode characters
|
||||
text = unicodedata.normalize("NFKD", text)
|
||||
|
||||
return text
|
||||
|
||||
|
||||
# Un'altro metodo creato da Claude.
|
||||
# Pare che il mio cervello sia troppo piccolo per riuscire a ciclare ricursivamente
|
||||
# un dizionario innestato. Se magari avessimo fatto algoritmi e strutture dati...
|
||||
def clean_json(data):
|
||||
"""
|
||||
Recursively clean JSON by removing empty/uninformative entries
|
||||
and normalizing Unicode characters
|
||||
"""
|
||||
|
||||
def is_uninformative(value, context=None):
|
||||
"""
|
||||
Check if a dictionary entry is considered uninformative
|
||||
|
||||
Args:
|
||||
value: The value to check
|
||||
context: Additional context about where the value is located
|
||||
"""
|
||||
# Specific exceptions
|
||||
if context and context == "Key value for chemical safety assessment":
|
||||
# Always keep all entries in this specific section
|
||||
return False
|
||||
|
||||
uninformative_values = ["hours/week", "", None]
|
||||
|
||||
return value in uninformative_values or (
|
||||
isinstance(value, str)
|
||||
and (
|
||||
value.strip() in uninformative_values
|
||||
or value.lower() == "no information available"
|
||||
)
|
||||
)
|
||||
|
||||
def clean_recursive(obj, context=None):
|
||||
# If it's a dictionary, process its contents
|
||||
if isinstance(obj, dict):
|
||||
# Create a copy to modify
|
||||
cleaned = {}
|
||||
for key, value in obj.items():
|
||||
# Normalize key
|
||||
normalized_key = normalize_unicode_characters(key)
|
||||
|
||||
# Set context for nested dictionaries
|
||||
new_context = context or normalized_key
|
||||
|
||||
# Recursively clean nested structures
|
||||
cleaned_value = clean_recursive(value, new_context)
|
||||
|
||||
# Conditions for keeping the entry
|
||||
keep_entry = (
|
||||
cleaned_value not in [None, {}, ""]
|
||||
and not (
|
||||
isinstance(cleaned_value, dict) and len(cleaned_value) == 0
|
||||
)
|
||||
and not is_uninformative(cleaned_value, new_context)
|
||||
)
|
||||
|
||||
# Add to cleaned dict if conditions are met
|
||||
if keep_entry:
|
||||
cleaned[normalized_key] = cleaned_value
|
||||
|
||||
return cleaned if cleaned else None
|
||||
|
||||
# If it's a list, clean each item
|
||||
elif isinstance(obj, list):
|
||||
cleaned_list = [clean_recursive(item, context) for item in obj]
|
||||
cleaned_list = [item for item in cleaned_list if item not in [None, {}, ""]]
|
||||
return cleaned_list if cleaned_list else None
|
||||
|
||||
# For strings, normalize Unicode
|
||||
elif isinstance(obj, str):
|
||||
return normalize_unicode_characters(obj)
|
||||
|
||||
# Return as-is for other types
|
||||
return obj
|
||||
|
||||
# Create a deep copy to avoid modifying original data
|
||||
cleaned_data = clean_recursive(copy.deepcopy(data))
|
||||
# Sì figa questa è la parte che mi ha fatto sclerare
|
||||
# Ciclare in dizionari innestati senza poter modificare la struttura
|
||||
return cleaned_data
|
||||
|
||||
|
||||
def json_to_dataframe(cleaned_json, scrapingType):
|
||||
rows = []
|
||||
schema = {
|
||||
"RepeatedDose": [
|
||||
"Substance",
|
||||
"CAS",
|
||||
"Toxicity Type",
|
||||
"Route",
|
||||
"Dose descriptor",
|
||||
"Effect level",
|
||||
"Species",
|
||||
"Extraction_Timestamp",
|
||||
"Endpoint conclusion",
|
||||
],
|
||||
"AcuteToxicity": [
|
||||
"Substance",
|
||||
"CAS",
|
||||
"Route",
|
||||
"Endpoint conclusion",
|
||||
"Dose descriptor",
|
||||
"Effect level",
|
||||
"Extraction_Timestamp",
|
||||
],
|
||||
}
|
||||
if scrapingType == "RepeatedDose":
|
||||
# Iterate through top-level sections (excluding 'Key value for chemical safety assessment')
|
||||
for toxicity_type, routes in cleaned_json.items():
|
||||
if toxicity_type == "Key value for chemical safety assessment":
|
||||
continue
|
||||
|
||||
# Iterate through routes within each toxicity type
|
||||
for route, details in routes.items():
|
||||
row = {"Toxicity Type": toxicity_type, "Route": route}
|
||||
|
||||
# Add details to the row, excluding 'Link to relevant study record(s)'
|
||||
row.update(
|
||||
{
|
||||
k: v
|
||||
for k, v in details.items()
|
||||
if k != "Link to relevant study record(s)"
|
||||
}
|
||||
)
|
||||
rows.append(row)
|
||||
elif scrapingType == "AcuteToxicity":
|
||||
for toxicity_type, routes in cleaned_json.items():
|
||||
if (
|
||||
toxicity_type == "Key value for chemical safety assessment"
|
||||
or not routes
|
||||
):
|
||||
continue
|
||||
|
||||
row = {
|
||||
"Route": toxicity_type.replace("Acute toxicity: via", "")
|
||||
.replace("route", "")
|
||||
.strip()
|
||||
}
|
||||
|
||||
# Add details directly from the routes dictionary
|
||||
row.update(
|
||||
{
|
||||
k: v
|
||||
for k, v in routes.items()
|
||||
if k != "Link to relevant study record(s)"
|
||||
}
|
||||
)
|
||||
rows.append(row)
|
||||
|
||||
# Create DataFrame
|
||||
df = pd.DataFrame(rows)
|
||||
|
||||
# Last moment fixes. Per forzare uno schema
|
||||
fair_columns = list(set(schema["RepeatedDose"] + schema["AcuteToxicity"]))
|
||||
df = df = df.loc[:, df.columns.intersection(fair_columns)]
|
||||
return df
|
||||
|
||||
|
||||
def save_dataframe(df, file_path, scrapingType, schema):
|
||||
"""
|
||||
Save DataFrame with strict column requirements.
|
||||
|
||||
Args:
|
||||
df (pd.DataFrame): DataFrame to potentially append
|
||||
file_path (str): Path of CSV file
|
||||
"""
|
||||
# Mandatory columns for saved DataFrame
|
||||
|
||||
saved_columns = schema[scrapingType]
|
||||
|
||||
# Check if input DataFrame has at least Dose Descriptor and Effect Level
|
||||
if not all(col in df.columns for col in ["Effect level"]):
|
||||
return
|
||||
|
||||
# If file exists, read it to get saved columns
|
||||
if os.path.exists(file_path):
|
||||
existing_df = pd.read_csv(file_path)
|
||||
|
||||
# Reindex to match saved columns, filling missing with NaN
|
||||
df = df.reindex(columns=saved_columns)
|
||||
else:
|
||||
# If file doesn't exist, create DataFrame with saved columns
|
||||
df = df.reindex(columns=saved_columns)
|
||||
|
||||
df = df[df["Effect level"].isna() == False]
|
||||
# Ignoro le righe che non hanno valori per Effect Level
|
||||
|
||||
# Append or save the DataFrame
|
||||
df.to_csv(
|
||||
file_path,
|
||||
mode="a" if os.path.exists(file_path) else "w",
|
||||
header=not os.path.exists(file_path),
|
||||
index=False,
|
||||
)
|
||||
|
||||
|
||||
def echaExtract(
|
||||
substance: str,
|
||||
scrapingType: str,
|
||||
outputType="df",
|
||||
key_infos=False,
|
||||
local_search=False,
|
||||
local_only = False
|
||||
):
|
||||
"""
|
||||
Funzione principale per scrapare dal sito ECHA. Mette insieme tante funzioni diverse di ricerca, estrazione e pulizia.
|
||||
Registra il logging delle operazioni.
|
||||
|
||||
Args:
|
||||
substance (str): CAS o nome della sostanza. Vanno bene entrambi ma il CAS funziona meglio.
|
||||
scrapingType (str): 'AcuteToxicity' (LD50) o 'RepeatedDose' (NOAEL)
|
||||
outputType (str): 'pd.DataFrame' o 'json' (sconsigliato)
|
||||
key_infos (bool): Di base True. Specifica se cercare la sezione "Description of Key Information" nei dossiers.
|
||||
Certe sostanze hanno i dati inseriti a cazzo e mettono le informazioni lì in forma discorsiva al posto che altrove.
|
||||
|
||||
Output:
|
||||
un dataframe o un json,
|
||||
f"Non esistono lead dossiers attivi o inattivi per {substance}"
|
||||
"""
|
||||
|
||||
# se local_search = True tento una ricerca in locale. Altrimenti la provo online.
|
||||
if local_search and local_echa:
|
||||
result = echaExtract_local(substance, scrapingType, key_infos)
|
||||
|
||||
if not result.empty:
|
||||
logging.info(
|
||||
f"echa.echaProcess.echaExtract(): Found local data for {scrapingType}, {substance}. Returning it."
|
||||
)
|
||||
return result
|
||||
elif result.empty:
|
||||
logging.info(
|
||||
f"echa.echaProcess.echaExtract(): Have not found local data for {scrapingType}, {substance}. Continuining."
|
||||
)
|
||||
if local_only:
|
||||
logging.info(f'echa.echaProcess.echaExtract(): No data found in local-only search for {substance}, {scrapingType}')
|
||||
return f'No data found in local-only search for {substance}, {scrapingType}'
|
||||
|
||||
try:
|
||||
# con search_dossier trovo le informazioni relative al dossiers cercando sul sito echa la sostanza fornita.
|
||||
links = search_dossier(substance)
|
||||
if not links:
|
||||
logging.info(
|
||||
f'echaProcess.echaExtract(). no active or unactive lead dossiers for: "{substance}". Ending extraction.'
|
||||
)
|
||||
return f"Non esistono lead dossiers attivi o inattivi per {substance}"
|
||||
# Se non esistono LEAD dossiers (quelli con i riassunti tossicologici) attivi o inattivi
|
||||
|
||||
# Se esistono, apro la pagina che mi interessa ('Acute Toxicity' o 'Repeated Dose')
|
||||
|
||||
if not scrapingType in list(links.keys()):
|
||||
logging.info(
|
||||
f'echaProcess.echaExtract(). No page for "{scrapingType}", "{substance}"'
|
||||
)
|
||||
return f'No data in "{scrapingType}", "{substance}". Page does not exist.'
|
||||
|
||||
soup = openEchaPage(link=links[scrapingType])
|
||||
logging.info(
|
||||
f"echaProcess.echaExtract(). soupped '{scrapingType}' echa page for '{substance}'"
|
||||
)
|
||||
|
||||
# Piglio la sezione che mi serve
|
||||
try:
|
||||
sezione = soup.find(
|
||||
"section",
|
||||
class_="KeyValueForChemicalSafetyAssessment",
|
||||
attrs={"data-cy": "das-block"},
|
||||
)
|
||||
except:
|
||||
logging.error(
|
||||
f'echaProcess.echaExtract(). could not extract the "section" for "{scrapingType}" for "{substance}"',
|
||||
exc_info=True,
|
||||
)
|
||||
|
||||
# Per ottenere il timestamp attuale
|
||||
now = datetime.now()
|
||||
|
||||
# UPDATE. Cerco le key infos
|
||||
key_infos_faund = False
|
||||
if key_infos:
|
||||
try:
|
||||
key_infos = soup.find(
|
||||
"section",
|
||||
class_="KeyInformation",
|
||||
attrs={"data-cy": "das-block"},
|
||||
)
|
||||
if key_infos:
|
||||
key_infos = key_infos.find(
|
||||
"div",
|
||||
class_="das-field_value das-field_value_html",
|
||||
)
|
||||
key_infos = key_infos.text
|
||||
key_infos = key_infos if key_infos.strip() != "[Empty]" else None
|
||||
if key_infos:
|
||||
key_infos_faund = True
|
||||
logging.info(
|
||||
f"echaProcess.echaExtract(). Extracted key_infos from '{scrapingType}' echa page for '{substance}': {key_infos}"
|
||||
)
|
||||
key_infos_df = pd.DataFrame(index=[0])
|
||||
key_infos_df["key_information"] = key_infos
|
||||
key_infos_df = df_wrapper(
|
||||
df=key_infos_df,
|
||||
rmlName=links["rmlName"],
|
||||
rmlCas=links["rmlCas"],
|
||||
timestamp=now.strftime("%Y-%m-%d"),
|
||||
dossierType=links["dossierType"],
|
||||
page=scrapingType,
|
||||
linkPage=links[scrapingType],
|
||||
key_infos=True,
|
||||
)
|
||||
else:
|
||||
logging.error(
|
||||
f'echaProcess.echaExtract() > echaProcess.echaPage_to_md() ERROR. could not extract key_infos for "{scrapingType}", "{substance}"'
|
||||
)
|
||||
else:
|
||||
logging.error(
|
||||
f'echaProcess.echaExtract() > echaProcess.echaPage_to_md() ERROR. could not extract key_infos for "{scrapingType}", "{substance}"'
|
||||
)
|
||||
except:
|
||||
logging.error(
|
||||
f'echaProcess.echaExtract() > echaProcess.echaPage_to_md() ERROR. could not extract key_infos for "{scrapingType}", "{substance}"',
|
||||
exc_info=True,
|
||||
)
|
||||
|
||||
try:
|
||||
if not sezione:
|
||||
logging.error(
|
||||
f'echaProcess.echaExtract() > echaProcess.echaPage_to_md() Empty section for the html > markdown conversion. No data for "{scrapingType}", "{substance}"'
|
||||
)
|
||||
if not key_infos_faund:
|
||||
# Se non ci sono dati ma ci sono le key informations ritorno quelle
|
||||
return f'No data in "{scrapingType}", "{substance}"'
|
||||
else:
|
||||
return key_infos_df
|
||||
|
||||
# Trasformo la sezione html in markdown
|
||||
output = echaPage_to_md(
|
||||
sezione, scrapingType=scrapingType, substance=substance
|
||||
)
|
||||
logging.info(
|
||||
f'echaProcess.echaExtract() > echaProcess.echaPage_to_md() OK. created MD for "{scrapingType}", "{substance}"'
|
||||
)
|
||||
|
||||
# Ci sono rari casi in cui proprio non esistono pagine per l'acute toxicity o la repeated dose. In quel caso output sarà vuoto e darà errore
|
||||
# logging.info(output)
|
||||
except:
|
||||
logging.error(
|
||||
f'echaProcess.echaExtract() > echaProcess.echaPage_to_md() ERROR. could not MD for "{scrapingType}", "{substance}"',
|
||||
exc_info=True,
|
||||
)
|
||||
|
||||
try:
|
||||
# Trasformo il markdown nel primo json raw
|
||||
jsonified = markdown_to_json_raw(
|
||||
output, scrapingType=scrapingType, substance=substance
|
||||
)
|
||||
logging.info(
|
||||
f'echaProcess.echaExtract() > echaProcess.markdown_to_json_raw() OK. created initial json for "{scrapingType}", "{substance}"'
|
||||
)
|
||||
except:
|
||||
logging.error(
|
||||
f'echaProcess.echaExtract() > echaProcess.markdown_to_json_raw() ERROR. could not create initial json for "{scrapingType}", "{substance}"',
|
||||
exc_info=True,
|
||||
)
|
||||
|
||||
json_data = json.loads(jsonified)
|
||||
|
||||
try:
|
||||
# Secondo step per il processing del json: pulisco i dizionari piu' innestati
|
||||
cleaned_data = clean_json(json_data)
|
||||
logging.info(
|
||||
f'echaProcess.echaExtract() > echaProcess.clean_json() OK. cleaned the json for "{scrapingType}", "{substance}"'
|
||||
)
|
||||
# Se cleaned_data è vuoto vuol dire che non ci sono dati
|
||||
if not cleaned_data:
|
||||
logging.error(
|
||||
f'echaProcess.echaExtract() > echaProcess.clean_json() Empty cleaned_json. No data for "{scrapingType}", "{substance}"'
|
||||
)
|
||||
if not key_infos_faund:
|
||||
# Se non ci sono dati ma ci sono le key informations ritorno quelle
|
||||
return f'No data in "{scrapingType}", "{substance}"'
|
||||
else:
|
||||
return key_infos_df
|
||||
except:
|
||||
logging.error(
|
||||
f'echaProcess.echaExtract() > echaProcess.clean_json() ERROR. cleaning the json for "{scrapingType}", "{substance}"'
|
||||
)
|
||||
|
||||
# Se si vuole come output il dataframe creo un dataframe e ci aggiungo un timestamp
|
||||
try:
|
||||
df = json_to_dataframe(cleaned_data, scrapingType)
|
||||
df = df_wrapper(
|
||||
df=df,
|
||||
rmlName=links["rmlName"],
|
||||
rmlCas=links["rmlCas"],
|
||||
timestamp=now.strftime("%Y-%m-%d"),
|
||||
dossierType=links["dossierType"],
|
||||
page=scrapingType,
|
||||
linkPage=links[scrapingType],
|
||||
)
|
||||
|
||||
if outputType == "df":
|
||||
logging.info(
|
||||
f'echaProcess.echaExtract(). succesfully extracted "{scrapingType}", "{substance}". Returning df'
|
||||
)
|
||||
|
||||
# Se l'utente vuole le key infos e le key_infos sono state trovate unisco i due df
|
||||
return df if not key_infos_faund else pd.concat([key_infos_df, df])
|
||||
|
||||
elif outputType == "json":
|
||||
if key_infos_faund:
|
||||
df = pd.concat([key_infos_df, df])
|
||||
jayson = df.to_json(orient="records", force_ascii=False)
|
||||
logging.info(
|
||||
f'echaProcess.echaExtract(). succesfully extracted "{scrapingType}", "{substance}". Returning json'
|
||||
)
|
||||
return jayson
|
||||
except KeyError:
|
||||
# Per gestire le pagine di merda che hanno solo "no information available"
|
||||
|
||||
if key_infos_faund:
|
||||
return key_infos_df
|
||||
|
||||
json_output = list(cleaned_data[list(cleaned_data.keys())[0]].values())
|
||||
if json_output == ["no information available" for elem in json_output]:
|
||||
logging.info(
|
||||
f"echaProcess.echaExtract(). No data found for {scrapingType} for {substance}"
|
||||
)
|
||||
return f'No data in "{scrapingType}", "{substance}"'
|
||||
else:
|
||||
logging.error(
|
||||
f"echaProcess.json_to_dataframe(). Could not create dataframe"
|
||||
)
|
||||
cleaned_data["error"] = (
|
||||
"Non sono riuscito a creare il dataframe, probabilmente non ci sono abbastanza informazioni. Ritorno il JSON"
|
||||
)
|
||||
return cleaned_data
|
||||
|
||||
except Exception:
|
||||
logging.error(
|
||||
f"echaProcess.echaExtract() ERROR. Something went wrong, not quite sure what.",
|
||||
exc_info=True,
|
||||
)
|
||||
|
||||
|
||||
def df_wrapper(
|
||||
df, rmlName, rmlCas, timestamp, dossierType, page, linkPage, key_infos=False
|
||||
):
|
||||
# Un semplice metodo per aggiungere tutta la roba che ci serve al dataframe.
|
||||
# Per non intasare echaExtract che già di suo è un figa di bordello
|
||||
df.insert(0, "Substance", rmlName)
|
||||
df.insert(1, "CAS", rmlCas)
|
||||
df["Extraction_Timestamp"] = timestamp
|
||||
df = df.replace("\n", "", regex=True)
|
||||
if not key_infos:
|
||||
df = df[df["Effect level"].isnull() == False]
|
||||
|
||||
# Aggiungo il link del dossier e lo status
|
||||
df["dossierType"] = dossierType
|
||||
df["page"] = page
|
||||
df["linkPage"] = linkPage
|
||||
return df
|
||||
|
||||
def echaExtract_specific(
|
||||
CAS: str,
|
||||
scrapingType="RepeatedDose",
|
||||
doseDescriptor="NOAEL",
|
||||
route="inhalation",
|
||||
local_search=False,
|
||||
local_only=False
|
||||
):
|
||||
"""
|
||||
Dato un CAS cerca di trovare il dose descriptor (di base NOAEL) per la route specificata (di base 'inhalation').
|
||||
|
||||
Args:
|
||||
CAS (str): il cas o in alternativa la sostanza
|
||||
route (str): 'inhalation', 'oral', 'dermal'. Di base 'inhalation'
|
||||
scrapingType (str): la pagina su cui cercarlo
|
||||
doseDescriptor (str): il tipo di valore da ricercare (NOAEL, DNEL, LD50, LC50)
|
||||
"""
|
||||
|
||||
# Tento di estrarre
|
||||
result = echaExtract(
|
||||
substance=CAS,
|
||||
scrapingType=scrapingType,
|
||||
outputType="df",
|
||||
local_search=local_search,
|
||||
local_only=local_only
|
||||
)
|
||||
|
||||
# Il risultato è un dataframe?
|
||||
if type(result) == pd.DataFrame:
|
||||
# Se sì, lo filtro per ciò che mi interessa
|
||||
filtered_df = result[
|
||||
(result["Route"] == route) & (result["Dose descriptor"] == doseDescriptor)
|
||||
]
|
||||
# Se non è vuoto lo ritorno
|
||||
if not filtered_df.empty:
|
||||
return filtered_df
|
||||
else:
|
||||
return f'Non ho trovato {doseDescriptor} in {scrapingType} con route "{route}" per {CAS}'
|
||||
|
||||
elif type(result) == dict and result["error"]:
|
||||
# Questo significa che gli è arrivato qualche json con un errore
|
||||
return f'Non ho trovato {doseDescriptor} in {scrapingType} con route "{route}" per {CAS}'
|
||||
|
||||
# Questo significa che ha ricevuto un "Non esistono" come risultato. Non esistono lead dossiers attivi o inattivi per la sostanza ricercata
|
||||
elif result.startswith("Non esistono"):
|
||||
return result
|
||||
|
||||
|
||||
def echa_noael_ld50(CAS: str, route="inhalation", outputType="df", local_search=False, local_only=False):
|
||||
"""
|
||||
Dato un CAS cerca di trovare il NOAEL per la route specificata (di base 'inhalation').
|
||||
Se non esiste la pagina RepeatedDose con il NOAEL fa ritornare l'LD50 per quella route.
|
||||
|
||||
Args:
|
||||
CAS (str): il cas o in alternativa la sostanza
|
||||
route (str): 'inhalation', 'oral', 'dermal'. Di base 'inhalation'
|
||||
outputType (str) = 'df', 'json'. Il tipo di output
|
||||
|
||||
"""
|
||||
if route not in ["inhalation", "oral", "dermal"] and outputType not in [
|
||||
"df",
|
||||
"json",
|
||||
]:
|
||||
return "invalid input"
|
||||
# Di base cerco di scrapare la pagina "Repeated Dose"
|
||||
first_attempt = echaExtract_specific(
|
||||
CAS=CAS,
|
||||
scrapingType="RepeatedDose",
|
||||
doseDescriptor="NOAEL",
|
||||
route=route,
|
||||
local_search=local_search,
|
||||
local_only=local_only
|
||||
)
|
||||
|
||||
if isinstance(first_attempt, pd.DataFrame):
|
||||
return first_attempt
|
||||
elif isinstance(first_attempt, str) and first_attempt.startswith("Non ho trovato"):
|
||||
second_attempt = echaExtract_specific(
|
||||
CAS=CAS,
|
||||
scrapingType="AcuteToxicity",
|
||||
doseDescriptor="LD50",
|
||||
route=route,
|
||||
local_search=True,
|
||||
local_only=local_only
|
||||
)
|
||||
if isinstance(second_attempt, pd.DataFrame):
|
||||
return second_attempt
|
||||
elif isinstance(second_attempt, str) and second_attempt.startswith(
|
||||
"Non ho trovato"
|
||||
):
|
||||
return second_attempt.replace("LD50", "NOAEL ed LD50")
|
||||
elif first_attempt.startswith("Non esistono"):
|
||||
return first_attempt
|
||||
|
||||
|
||||
def echa_noael_ld50_multi(
|
||||
casList: list, route="inhalation", messages=False, local_search=False, local_only=False
|
||||
):
|
||||
"""
|
||||
Metodo abbastanza semplice. Data una lista di cas esegue echa_noael_ld50. Quindi cerca i NOAEL per la route desiderata o gli LD50 se non trova i NOAEL.
|
||||
L'output è un df per le sostanze che trova e una lista di messaggi per quelle che non trova.
|
||||
|
||||
Args:
|
||||
casList (list): la lista di CAS
|
||||
route (str): 'inhalation', 'oral', 'dermal'. Di base 'inhalation'
|
||||
messages (boolean) = True o False. Con True fa ritornare una lista. Il primo elemento sarà il dataframe, il secondo la lista di messaggi per le sostanze non trovate.
|
||||
Di base è False e fa ritornare solo il dataframe.
|
||||
"""
|
||||
messages_list = []
|
||||
df = pd.DataFrame()
|
||||
for CAS in casList:
|
||||
output = echa_noael_ld50(
|
||||
CAS=CAS, route=route, outputType="df", local_search=local_search, local_only=local_only
|
||||
)
|
||||
if isinstance(output, str):
|
||||
messages_list.append(output)
|
||||
elif isinstance(output, pd.DataFrame):
|
||||
df = pd.concat([df, output], ignore_index=True)
|
||||
df.dropna(axis=1, how="all", inplace=True)
|
||||
if messages and df.empty:
|
||||
messages_list.append(
|
||||
f'Non sono riuscito a trovare nessun NOAEL o LD50 per i cas per la route "{route}"'
|
||||
)
|
||||
return [None, messages_list]
|
||||
elif messages and not df.empty:
|
||||
return [df, messages_list]
|
||||
elif not df.empty and not messages:
|
||||
return df
|
||||
elif df.empty and not messages:
|
||||
return f'Non sono riuscito a trovare nessun NOAEL o LD50 per i cas per la route "{route}"'
|
||||
|
||||
|
||||
def echaExtract_multi(
|
||||
casList: list,
|
||||
scrapingType="all",
|
||||
local=False,
|
||||
local_path=None,
|
||||
log_path=None,
|
||||
debug_print=False,
|
||||
error=False,
|
||||
error_path=None,
|
||||
key_infos=False,
|
||||
local_search=False,
|
||||
local_only=False,
|
||||
filter = None
|
||||
):
|
||||
"""
|
||||
Data una lista di CAS cerca di estrarre tutte le pagine con le Repeated Dose, tutte le pagine con l'AcuteToxicity, o entrambe
|
||||
|
||||
Args:
|
||||
casList (list): la lista di CAS
|
||||
scrapingType (str): 'RepeatedDose', 'AcuteToxicity', 'all'
|
||||
local (boolean): Se impostato su True questo parametro salva sul disco in maniera progressiva, appendendo ogni result man mano che li trova.
|
||||
è necessario per lo scraping su larga scala
|
||||
log_path (str): il path per il log da fillare durante lo scraping di massa
|
||||
debug_print (bool): per avere il printing durante lo scraping. per verificare l'avanzamento
|
||||
error (bool): Per far ritornare la lista degli errori una volta scrapato
|
||||
|
||||
Output:
|
||||
pd.Dataframe
|
||||
"""
|
||||
cas_len = len(casList)
|
||||
i = 0
|
||||
|
||||
df = pd.DataFrame()
|
||||
if scrapingType == "all":
|
||||
scrapingTypeList = ["RepeatedDose", "AcuteToxicity"]
|
||||
else:
|
||||
scrapingTypeList = [scrapingType]
|
||||
|
||||
logging.info(
|
||||
f"echa.echaExtract_multi(). Commencing mass extraction of {scrapingTypeList} for {casList}"
|
||||
)
|
||||
|
||||
errors = []
|
||||
|
||||
for cas in casList:
|
||||
for scrapingType in scrapingTypeList:
|
||||
extraction = echaExtract(
|
||||
substance=cas,
|
||||
scrapingType=scrapingType,
|
||||
outputType="df",
|
||||
key_infos=key_infos,
|
||||
local_search=local_search,
|
||||
local_only=local_only
|
||||
)
|
||||
if isinstance(extraction, pd.DataFrame) and not extraction.empty:
|
||||
status = "successful_scrape"
|
||||
logging.info(
|
||||
f"echa.echaExtract_multi(). Succesfully scraped {scrapingType} for {cas}"
|
||||
)
|
||||
|
||||
df = pd.concat([df, extraction], ignore_index=True)
|
||||
if local and local_path:
|
||||
df.to_csv(local_path, index=False)
|
||||
|
||||
elif (
|
||||
(isinstance(extraction, pd.DataFrame) and extraction.empty)
|
||||
or (extraction is None)
|
||||
or (isinstance(extraction, str) and extraction.startswith("No data"))
|
||||
):
|
||||
status = "no_data_found"
|
||||
logging.info(
|
||||
f"echa.echaExtract_multi(). Found no data for {scrapingType} for {cas}"
|
||||
)
|
||||
elif isinstance(extraction, dict):
|
||||
if extraction["error"]:
|
||||
status = "df_creation_error"
|
||||
errors.append(extraction)
|
||||
logging.info(
|
||||
f"echa.echaExtract_multi(). Df creation error for {scrapingType} for {cas}"
|
||||
)
|
||||
elif isinstance(extraction, str) and extraction.startswith("Non esistono"):
|
||||
status = "no_lead_dossiers"
|
||||
logging.info(
|
||||
f"echa.echaExtract_multi(). Found no lead dossiers for {cas}"
|
||||
)
|
||||
else:
|
||||
status = "unknown_error"
|
||||
logging.error(
|
||||
f"echa.echaExtract_multi(). Unknown error for {scrapingType} for {cas}"
|
||||
)
|
||||
|
||||
if log_path:
|
||||
fill_log(cas, status, log_path, scrapingType)
|
||||
if debug_print:
|
||||
print(f"{i}: {cas}, {scrapingType}")
|
||||
i += 1
|
||||
|
||||
if error and errors and error_path:
|
||||
with open(error_path, "w") as json_file:
|
||||
json.dump(errors, json_file, indent=4)
|
||||
|
||||
# Questa è la mossa che mi permette di eliminare 4 metodi
|
||||
if filter:
|
||||
df = filter_dataframe_by_dict(df, filter)
|
||||
return df
|
||||
|
||||
|
||||
def fill_log(cas: str, status: str, log_path: str, scrapingType: str):
|
||||
"""
|
||||
Funzione usata durante lo scraping di massa per fillare un log mentre estraggo le sostanze
|
||||
"""
|
||||
|
||||
df = pd.read_csv(log_path)
|
||||
df.loc[df["casNo"] == cas, f"scraping_{scrapingType}"] = status
|
||||
df.loc[df["casNo"] == cas, "timestamp"] = datetime.now().strftime("%Y-%m-%d")
|
||||
|
||||
df.to_csv(log_path, index=False)
|
||||
|
||||
def echaExtract_local(substance:str, scrapingType:str, key_infos=False):
|
||||
if not key_infos:
|
||||
query = f"""
|
||||
SELECT *
|
||||
FROM echa_full_scraping
|
||||
WHERE CAS = '{substance}' AND page = '{scrapingType}' AND key_information IS NULL;
|
||||
"""
|
||||
elif key_infos:
|
||||
query = f"""
|
||||
SELECT *
|
||||
FROM echa_full_scraping
|
||||
WHERE CAS = '{substance}' AND page = '{scrapingType}';
|
||||
|
||||
"""
|
||||
result = con.sql(query).df()
|
||||
return result
|
||||
|
||||
def filter_dataframe_by_dict(df, filter_dict):
|
||||
"""
|
||||
Filters a Pandas DataFrame based on a dictionary.
|
||||
|
||||
Args:
|
||||
df (pd.DataFrame): The input DataFrame.
|
||||
filter_dict (dict): A dictionary where keys are column names and
|
||||
values are lists of allowed values for that column.
|
||||
|
||||
Returns:
|
||||
pd.DataFrame: A new DataFrame containing only the rows that match
|
||||
the filter criteria.
|
||||
"""
|
||||
|
||||
filter_condition = pd.Series(True, index=df.index) # Initialize with all True to start filtering
|
||||
|
||||
for column_name, allowed_values in filter_dict.items():
|
||||
if column_name in df.columns: # Check if the column exists in the DataFrame
|
||||
column_filter = df[column_name].isin(allowed_values) # Create a boolean Series for the current column
|
||||
filter_condition = filter_condition & column_filter # Combine with existing condition using 'and'
|
||||
else:
|
||||
print(f"Warning: Column '{column_name}' not found in the DataFrame. Filter for this column will be ignored.")
|
||||
|
||||
filtered_df = df[filter_condition] # Apply the combined filter condition
|
||||
return filtered_df
|
||||
497
old/_old/find.py
497
old/_old/find.py
|
|
@ -1,497 +0,0 @@
|
|||
import requests
|
||||
import urllib.parse
|
||||
import json
|
||||
import logging
|
||||
import re
|
||||
from datetime import datetime
|
||||
from bs4 import BeautifulSoup
|
||||
from typing import Dict, Union, Optional, Any
|
||||
|
||||
# Settings per il logging
|
||||
logging.basicConfig(
|
||||
format="{asctime} - {levelname} - {message}",
|
||||
style="{",
|
||||
datefmt="%Y-%m-%d %H:%M",
|
||||
filename=".log",
|
||||
encoding="utf-8",
|
||||
filemode="a",
|
||||
level=logging.INFO,
|
||||
)
|
||||
|
||||
|
||||
# Constants for API endpoints
|
||||
QUACKO_BASE_URL = "https://chem.echa.europa.eu"
|
||||
QUACKO_SUBSTANCE_API = f"{QUACKO_BASE_URL}/api-substance/v1/substance"
|
||||
QUACKO_DOSSIER_API = f"{QUACKO_BASE_URL}/api-dossier-list/v1/dossier"
|
||||
QUACKO_HTML_PAGES = f"{QUACKO_BASE_URL}/html-pages"
|
||||
|
||||
# Default sections to look for in the dossier
|
||||
DEFAULT_SECTIONS = {
|
||||
"id_7_Toxicologicalinformation": "ToxSummary",
|
||||
"id_72_AcuteToxicity": "AcuteToxicity",
|
||||
"id_75_Repeateddosetoxicity": "RepeatedDose",
|
||||
"id_6_Ecotoxicologicalinformation": "EcotoxSummary",
|
||||
"id_76_Genetictoxicity" : 'GeneticToxicity',
|
||||
"id_42_Meltingpointfreezingpoint" : "MeltingFreezingPoint",
|
||||
"id_43_Boilingpoint" : "BoilingPoint",
|
||||
"id_48_Watersolubility" : "WaterSolubility",
|
||||
"id_410_Surfacetension" : "SurfaceTension",
|
||||
"id_420_pH" : "pH",
|
||||
"Test" : "Test2"
|
||||
}
|
||||
|
||||
def search_dossier(
|
||||
substance: str,
|
||||
input_type: str = 'rmlCas',
|
||||
sections: Dict[str, str] = None,
|
||||
local_index_path: str = None
|
||||
) -> Union[Dict[str, Any], str, bool]:
|
||||
"""
|
||||
Search for a chemical substance in the QUACKO database and retrieve its dossier information.
|
||||
|
||||
Args:
|
||||
substance (str): The identifier of the substance to search for (e.g. CAS number, name)
|
||||
input_type (str): The type of identifier provided. Options: 'rmlCas', 'rmlName', 'rmlEc'
|
||||
sections (Dict[str, str], optional): Dictionary mapping section IDs to result keys.
|
||||
If None, default sections will be used.
|
||||
local_index_path (str, optional): Path to a local index.html file to parse instead of
|
||||
downloading from QUACKO. If provided, the function will
|
||||
skip all API calls and only extract sections from the local file.
|
||||
|
||||
Returns:
|
||||
Union[Dict[str, Any], str, bool]: Dictionary with substance information and dossier links on success,
|
||||
error message string if substance found but with issues,
|
||||
False if substance not found or other critical error
|
||||
"""
|
||||
# Use default sections if none provided
|
||||
if sections is None:
|
||||
sections = DEFAULT_SECTIONS
|
||||
|
||||
try:
|
||||
results = {}
|
||||
|
||||
# If a local file is provided, extract sections from it directly
|
||||
if local_index_path:
|
||||
logging.info(f"QUACKO.search() - Using local index file: {local_index_path}")
|
||||
|
||||
# We still need some minimal info for constructing the URLs
|
||||
if '/' not in local_index_path:
|
||||
asset_id = "local"
|
||||
rml_id = "local"
|
||||
else:
|
||||
# Try to extract information from the path if available
|
||||
path_parts = local_index_path.split('/')
|
||||
# If path follows expected structure: .../html-pages/ASSET_ID/index.html
|
||||
if 'html-pages' in path_parts and 'index.html' in path_parts[-1]:
|
||||
asset_id = path_parts[path_parts.index('html-pages') + 1]
|
||||
rml_id = "extracted" # Just a placeholder
|
||||
else:
|
||||
asset_id = "local"
|
||||
rml_id = "local"
|
||||
|
||||
# Add these to results for consistency
|
||||
results["assetExternalId"] = asset_id
|
||||
results["rmlId"] = rml_id
|
||||
|
||||
# Extract sections from the local file
|
||||
section_links = get_section_links_from_file(local_index_path, asset_id, rml_id, sections)
|
||||
if section_links:
|
||||
results.update(section_links)
|
||||
|
||||
return results
|
||||
|
||||
# Normal flow with API calls
|
||||
substance_data = get_substance_by_identifier(substance)
|
||||
if not substance_data:
|
||||
return False
|
||||
|
||||
# Verify that the found substance matches the input identifier
|
||||
if substance_data.get(input_type) != substance:
|
||||
error_msg = (f"Search error: results[{input_type}] (\"{substance_data.get(input_type)}\") "
|
||||
f"is not equal to \"{substance}\". Maybe you specified the wrong input_type. "
|
||||
f"Check the results here: {substance_data.get('search_response')}")
|
||||
logging.error(f"QUACKO.search(): {error_msg}")
|
||||
return error_msg
|
||||
|
||||
# Step 2: Find dossiers for the substance
|
||||
rml_id = substance_data["rmlId"]
|
||||
dossier_data = get_dossier_by_rml_id(rml_id, substance)
|
||||
if not dossier_data:
|
||||
return False
|
||||
|
||||
# Merge substance and dossier data
|
||||
results = {**substance_data, **dossier_data}
|
||||
|
||||
# Step 3: Extract detailed information from dossier index page
|
||||
asset_external_id = dossier_data["assetExternalId"]
|
||||
section_links = get_section_links_from_index(asset_external_id, rml_id, sections)
|
||||
if section_links:
|
||||
results.update(section_links)
|
||||
|
||||
logging.info(f"QUACKO.search() OK. output: {json.dumps(results)}")
|
||||
return results
|
||||
|
||||
except Exception as e:
|
||||
logging.error(f"QUACKO.search(): Unexpected error in search_dossier for '{substance}': {str(e)}")
|
||||
return False
|
||||
|
||||
|
||||
def get_substance_by_identifier(substance: str) -> Optional[Dict[str, str]]:
|
||||
"""
|
||||
Search the QUACKO database for a substance using the provided identifier.
|
||||
|
||||
Args:
|
||||
substance (str): The substance identifier to search for (CAS number, name, etc.)
|
||||
|
||||
Returns:
|
||||
Optional[Dict[str, str]]: Dictionary with substance information or None if not found
|
||||
"""
|
||||
encoded_substance = urllib.parse.quote(substance)
|
||||
search_url = f"{QUACKO_SUBSTANCE_API}?pageIndex=1&pageSize=100&searchText={encoded_substance}"
|
||||
|
||||
logging.info(f'QUACKO.search(). searching "{substance}"')
|
||||
|
||||
try:
|
||||
response = requests.get(search_url)
|
||||
response.raise_for_status() # Raise exception for HTTP errors
|
||||
data = response.json()
|
||||
|
||||
if not data.get("items") or len(data["items"]) == 0:
|
||||
logging.info(f"QUACKO.search() could not find substance for '{substance}'")
|
||||
return None
|
||||
|
||||
# Extract substance information
|
||||
substance_index = data["items"][0]["substanceIndex"]
|
||||
result = {
|
||||
'search_response': search_url,
|
||||
'rmlId': substance_index.get("rmlId", ""),
|
||||
'rmlName': substance_index.get("rmlName", ""),
|
||||
'rmlCas': substance_index.get("rmlCas", ""),
|
||||
'rmlEc': substance_index.get("rmlEc", "")
|
||||
}
|
||||
|
||||
logging.info(
|
||||
f"QUACKO.search() found substance on QUACKO. "
|
||||
f"rmlId: '{result['rmlId']}', rmlName: '{result['rmlName']}', rmlCas: '{result['rmlCas']}'"
|
||||
)
|
||||
return result
|
||||
|
||||
except requests.RequestException as e:
|
||||
logging.error(f"QUACKO.search() - Request error while searching for substance '{substance}': {str(e)}")
|
||||
return None
|
||||
except (KeyError, IndexError) as e:
|
||||
logging.error(f"QUACKO.search() - Data parsing error for substance '{substance}': {str(e)}")
|
||||
return None
|
||||
|
||||
|
||||
def get_dossier_by_rml_id(rml_id: str, substance_name: str) -> Optional[Dict[str, Any]]:
|
||||
"""
|
||||
Find dossiers for a substance using its RML ID.
|
||||
|
||||
Args:
|
||||
rml_id (str): The RML ID of the substance
|
||||
substance_name (str): The name of the substance (for logging)
|
||||
|
||||
Returns:
|
||||
Optional[Dict[str, Any]]: Dictionary with dossier information or None if not found
|
||||
"""
|
||||
# First try active dossiers
|
||||
dossier_results = _query_dossier_api(rml_id, "Active")
|
||||
|
||||
# If no active dossiers found, try inactive ones
|
||||
if not dossier_results:
|
||||
logging.info(
|
||||
f"QUACKO.search() - could not find active dossier for '{substance_name}'. "
|
||||
"Proceeding to search in the unactive ones."
|
||||
)
|
||||
dossier_results = _query_dossier_api(rml_id, "Inactive")
|
||||
|
||||
if not dossier_results:
|
||||
logging.info(f"QUACKO.search() - could not find unactive dossiers for '{substance_name}'")
|
||||
return None
|
||||
else:
|
||||
logging.info(f"QUACKO.search() - found unactive dossiers for '{substance_name}'")
|
||||
dossier_results["dossierType"] = "Inactive"
|
||||
else:
|
||||
logging.info(f"QUACKO.search() - found active dossiers for '{substance_name}'")
|
||||
dossier_results["dossierType"] = "Active"
|
||||
|
||||
return dossier_results
|
||||
|
||||
|
||||
def _query_dossier_api(rml_id: str, status: str) -> Optional[Dict[str, Any]]:
|
||||
"""
|
||||
Helper function to query the QUACKO dossier API for a specific substance and status.
|
||||
|
||||
Args:
|
||||
rml_id (str): The RML ID of the substance
|
||||
status (str): The status of dossiers to search for ('Active' or 'Inactive')
|
||||
|
||||
Returns:
|
||||
Optional[Dict[str, Any]]: Dictionary with dossier information or None if not found
|
||||
"""
|
||||
url = f"{QUACKO_DOSSIER_API}?pageIndex=1&pageSize=100&rmlId={rml_id}®istrationStatuses={status}"
|
||||
|
||||
try:
|
||||
response = requests.get(url)
|
||||
response.raise_for_status()
|
||||
data = response.json()
|
||||
|
||||
if not data.get("items") or len(data["items"]) == 0:
|
||||
return None
|
||||
|
||||
result = {
|
||||
"assetExternalId": data["items"][0]["assetExternalId"],
|
||||
"rootKey": data["items"][0]["rootKey"],
|
||||
}
|
||||
|
||||
# Extract last update date if available
|
||||
try:
|
||||
last_update = data["items"][0]["lastUpdatedDate"]
|
||||
datetime_object = datetime.fromisoformat(last_update.replace('Z', '+00:00'))
|
||||
result['lastUpdateDate'] = datetime_object.date().isoformat()
|
||||
except (KeyError, ValueError) as e:
|
||||
logging.error(f"QUACKO.search() - Error extracting lastUpdateDate: {str(e)}")
|
||||
|
||||
# Add index URLs
|
||||
result["index"] = f"{QUACKO_HTML_PAGES}/{result['assetExternalId']}/index.html"
|
||||
result["index_js"] = f"{QUACKO_BASE_URL}/{rml_id}/dossier-view/{result['assetExternalId']}"
|
||||
|
||||
return result
|
||||
|
||||
except requests.RequestException as e:
|
||||
logging.error(f"QUACKO.search() - Request error while getting dossiers for RML ID '{rml_id}': {str(e)}")
|
||||
return None
|
||||
except (KeyError, IndexError) as e:
|
||||
logging.error(f"QUACKO.search() - Data parsing error for RML ID '{rml_id}': {str(e)}")
|
||||
return None
|
||||
|
||||
|
||||
def get_section_links_from_index(
|
||||
asset_id: str,
|
||||
rml_id: str,
|
||||
sections: Dict[str, str]
|
||||
) -> Dict[str, str]:
|
||||
"""
|
||||
Extract links to specified sections from the dossier index page by downloading it.
|
||||
|
||||
Args:
|
||||
asset_id (str): The asset external ID of the dossier
|
||||
rml_id (str): The RML ID of the substance
|
||||
sections (Dict[str, str]): Dictionary mapping section IDs to result keys
|
||||
|
||||
Returns:
|
||||
Dict[str, str]: Dictionary with links to the requested sections
|
||||
"""
|
||||
index_url = f"{QUACKO_HTML_PAGES}/{asset_id}/index.html"
|
||||
|
||||
try:
|
||||
response = requests.get(index_url)
|
||||
response.raise_for_status()
|
||||
|
||||
# Parse content using the shared method
|
||||
return parse_sections_from_html(response.text, asset_id, rml_id, sections)
|
||||
|
||||
except requests.RequestException as e:
|
||||
logging.error(f"QUACKO.search() - Request error while extracting section links: {str(e)}")
|
||||
return {}
|
||||
except Exception as e:
|
||||
logging.error(f"QUACKO.search() - Error extracting section links: {str(e)}")
|
||||
return {}
|
||||
|
||||
|
||||
def get_section_links_from_file(
|
||||
file_path: str,
|
||||
asset_id: str,
|
||||
rml_id: str,
|
||||
sections: Dict[str, str]
|
||||
) -> Dict[str, str]:
|
||||
"""
|
||||
Extract links to specified sections from a local index.html file.
|
||||
|
||||
Args:
|
||||
file_path (str): Path to the local index.html file
|
||||
asset_id (str): The asset external ID to use for constructing URLs
|
||||
rml_id (str): The RML ID to use for constructing URLs
|
||||
sections (Dict[str, str]): Dictionary mapping section IDs to result keys
|
||||
|
||||
Returns:
|
||||
Dict[str, str]: Dictionary with links to the requested sections
|
||||
"""
|
||||
try:
|
||||
if not os.path.exists(file_path):
|
||||
logging.error(f"QUACKO.search() - Local file not found: {file_path}")
|
||||
return {}
|
||||
|
||||
with open(file_path, 'r', encoding='utf-8') as file:
|
||||
html_content = file.read()
|
||||
|
||||
# Parse content using the shared method
|
||||
return parse_sections_from_html(html_content, asset_id, rml_id, sections)
|
||||
|
||||
except FileNotFoundError:
|
||||
logging.error(f"QUACKO.search() - File not found: {file_path}")
|
||||
return {}
|
||||
except Exception as e:
|
||||
logging.error(f"QUACKO.search() - Error parsing local file {file_path}: {str(e)}")
|
||||
return {}
|
||||
|
||||
|
||||
def parse_sections_from_html(
|
||||
html_content: str,
|
||||
asset_id: str,
|
||||
rml_id: str,
|
||||
sections: Dict[str, str]
|
||||
) -> Dict[str, str]:
|
||||
"""
|
||||
Parse HTML content to extract links to specified sections.
|
||||
|
||||
Args:
|
||||
html_content (str): HTML content to parse
|
||||
asset_id (str): The asset external ID to use for constructing URLs
|
||||
rml_id (str): The RML ID to use for constructing URLs
|
||||
sections (Dict[str, str]): Dictionary mapping section IDs to result keys
|
||||
|
||||
Returns:
|
||||
Dict[str, str]: Dictionary with links to the requested sections
|
||||
"""
|
||||
result = {}
|
||||
|
||||
try:
|
||||
soup = BeautifulSoup(html_content, "html.parser")
|
||||
|
||||
# Extract each requested section
|
||||
for section_id, section_name in sections.items():
|
||||
section_links = extract_section_links(soup, section_id, asset_id, rml_id, section_name)
|
||||
if section_links:
|
||||
result.update(section_links)
|
||||
logging.info(f"QUACKO.search() - Found section '{section_name}' in document")
|
||||
else:
|
||||
logging.info(f"QUACKO.search() - Section '{section_name}' not found in document")
|
||||
|
||||
return result
|
||||
|
||||
except Exception as e:
|
||||
logging.error(f"QUACKO.search() - Error parsing HTML content: {str(e)}")
|
||||
return {}
|
||||
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# Function to Extract Section Links with Validation
|
||||
# --------------------------------------------------------------------------
|
||||
# This function extracts the document link associated with a specific section ID
|
||||
# from the QUACKO index.html page structure.
|
||||
#
|
||||
# Problem Solved:
|
||||
# Previous attempts faced issues where searching for a link within a parent
|
||||
# section's div (e.g., "7 Toxicological Information" with id="id_7_...")
|
||||
# would incorrectly grab the link belonging to the *first child* section
|
||||
# (e.g., "7.2 Acute Toxicity" with id="id_72_..."). This happened because
|
||||
# the simple `find("a", href=True)` doesn't distinguish ownership when nested.
|
||||
#
|
||||
# Solution Logic:
|
||||
# 1. Find Target Div: Locate the `div` element using the specific `section_id` provided.
|
||||
# This div typically contains the section's content or nested subsections.
|
||||
# 2. Find First Link: Find the very first `<a>` tag that has an `href` attribute
|
||||
# somewhere *inside* the `target_div`.
|
||||
# 3. Find Link's Owning Section Div: Starting from the `first_link_tag`, traverse
|
||||
# up the HTML tree using `find_parent()` to find the nearest ancestor `div`
|
||||
# whose `id` attribute starts with "id_" (the pattern for section containers).
|
||||
# 4. Validate Ownership: Compare the `id` of the `link_ancestor_section_div` found
|
||||
# in step 3 with the original `section_id` passed into the function.
|
||||
# 5. Decision:
|
||||
# - If the IDs MATCH: It confirms that the `first_link_tag` truly belongs to the
|
||||
# `section_id` we are querying. The function proceeds to extract and format
|
||||
# this link.
|
||||
# - If the IDs DO NOT MATCH: It indicates that the first link found actually
|
||||
# belongs to a *nested* subsection div. Therefore, the original `section_id`
|
||||
# (the parent/container) does not have its own direct link, and the function
|
||||
# correctly returns an empty dictionary for this `section_id`.
|
||||
#
|
||||
# This validation step ensures that we only return links that are directly
|
||||
# associated with the queried section ID, preventing the inheritance bug.
|
||||
# --------------------------------------------------------------------------
|
||||
def extract_section_links(
|
||||
soup: BeautifulSoup,
|
||||
section_id: str,
|
||||
asset_id: str,
|
||||
rml_id: str,
|
||||
section_name: str
|
||||
) -> Dict[str, str]:
|
||||
"""
|
||||
Extracts a link for a specific section ID by finding the first link
|
||||
within its div and verifying that the link belongs directly to that
|
||||
section, not a nested subsection.
|
||||
|
||||
Args:
|
||||
soup (BeautifulSoup): The BeautifulSoup object of the index page.
|
||||
section_id (str): The HTML ID of the section div.
|
||||
asset_id (str): The asset external ID of the dossier.
|
||||
rml_id (str): The RML ID of the substance.
|
||||
section_name (str): The name to use for the section in the result.
|
||||
|
||||
Returns:
|
||||
Dict[str, str]: Dictionary with link if found and validated,
|
||||
otherwise empty.
|
||||
"""
|
||||
result = {}
|
||||
|
||||
# 1. Find the target div for the section ID
|
||||
target_div = soup.find("div", id=section_id)
|
||||
if not target_div:
|
||||
logging.info(f"QUACKO.search() - extract_section_links(): No div found for id='{section_id}'")
|
||||
return result
|
||||
|
||||
# 2. Find the first <a> tag with an href within this target div
|
||||
first_link_tag = target_div.find("a", href=True)
|
||||
if not first_link_tag:
|
||||
logging.info(f"QUACKO.search() - extract_section_links(): No 'a' tag with href found within div id='{section_id}'")
|
||||
return result # No links at all within this section
|
||||
|
||||
# 3. Validate: Find the closest ancestor div with an ID starting with "id_"
|
||||
# This tells us which section container the link *actually* resides in.
|
||||
# We use a lambda function for the id check.
|
||||
# Need to handle potential None if the structure is unexpected.
|
||||
link_ancestor_section_div: Optional[Tag] = first_link_tag.find_parent(
|
||||
"div", id=lambda x: x and x.startswith("id_")
|
||||
)
|
||||
|
||||
# 4. Compare IDs
|
||||
if link_ancestor_section_div and link_ancestor_section_div.get('id') == section_id:
|
||||
# The first link found belongs directly to the section we are looking for.
|
||||
logging.debug(f"QUACKO.search() - extract_section_links(): Valid link found for id='{section_id}'.")
|
||||
a_tag_to_use = first_link_tag # Use the link we found
|
||||
else:
|
||||
# The first link found belongs to a *different* (nested) section
|
||||
# or the structure is broken (no ancestor div with id found).
|
||||
# Therefore, the section_id we were originally checking has no direct link.
|
||||
ancestor_id = link_ancestor_section_div.get('id') if link_ancestor_section_div else "None"
|
||||
logging.info(f"QUACKO.search() - extract_section_links(): First link within id='{section_id}' belongs to ancestor id='{ancestor_id}'. No direct link for '{section_id}'.")
|
||||
return result # Return empty dict
|
||||
|
||||
# 5. Proceed with link extraction using the validated a_tag_to_use
|
||||
try:
|
||||
document_id = a_tag_to_use.get('href') # Use .get() for safety
|
||||
if not document_id:
|
||||
logging.error(f"QUACKO.search() - extract_section_links(): Found 'a' tag for '{section_name}' has no href attribute.")
|
||||
return {}
|
||||
|
||||
# Clean up the document ID
|
||||
if document_id.startswith('./documents/'):
|
||||
document_id = document_id.replace('./documents/', '')
|
||||
if document_id.endswith('.html'):
|
||||
document_id = document_id.replace('.html', '')
|
||||
|
||||
# Construct the full URLs unless in local-only mode
|
||||
if asset_id == "local" and rml_id == "local":
|
||||
result[section_name] = f"Local section found: {document_id}"
|
||||
else:
|
||||
result[section_name] = f"{QUACKO_HTML_PAGES}/{asset_id}/documents/{document_id}.html"
|
||||
result[f"{section_name}_js"] = f"{QUACKO_BASE_URL}/{rml_id}/dossier-view/{asset_id}/{document_id}"
|
||||
|
||||
return result
|
||||
|
||||
except Exception as e: # Catch potential errors during processing
|
||||
logging.error(f"QUACKO.search() - extract_section_links(): Error processing the validated link tag for '{section_name}': {str(e)}")
|
||||
return {}
|
||||
|
|
@ -1,37 +0,0 @@
|
|||
from typing import Optional
|
||||
from pymongo.mongo_client import MongoClient
|
||||
from pymongo.server_api import ServerApi
|
||||
from pymongo.database import Database
|
||||
|
||||
|
||||
#region Funzioni generali connesse a MongoDB
|
||||
|
||||
# Funzione di connessione al database
|
||||
|
||||
def connect(user : str,password : str, database : str = "INCI") -> Database:
|
||||
|
||||
uri = f"mongodb+srv://{user}:{password}@ufs13.dsmvdrx.mongodb.net/?retryWrites=true&w=majority&appName=UFS13"
|
||||
client = MongoClient(uri, server_api=ServerApi('1'))
|
||||
db = client['toxinfo']
|
||||
return db
|
||||
|
||||
#endregion
|
||||
|
||||
#region Funzioni di ricerca all'interno del mio DB
|
||||
|
||||
# Funzione di ricerca degli elementi estratti dal cosing
|
||||
def value_search(db : Database,value:str,mode : Optional[str] = None) -> Optional[dict]:
|
||||
if mode:
|
||||
json = db.toxinfo.find_one({mode:value},{"_id":0})
|
||||
if json:
|
||||
return json
|
||||
return None
|
||||
else:
|
||||
modes = ['commonName','inciName','casNo','ecNo','chemicalName','phEurName']
|
||||
for el in modes:
|
||||
json = db.toxinfo.find_one({el:value},{"_id":0})
|
||||
if json:
|
||||
return json
|
||||
return None
|
||||
|
||||
#endregion
|
||||
|
|
@ -1,149 +0,0 @@
|
|||
|
||||
import os
|
||||
from contextlib import contextmanager
|
||||
import pubchempy as pcp
|
||||
from pubchemprops.pubchemprops import get_second_layer_props
|
||||
import logging
|
||||
|
||||
logging.basicConfig(
|
||||
format="{asctime} - {levelname} - {message}",
|
||||
style="{",
|
||||
datefmt="%Y-%m-%d %H:%M",
|
||||
filename="echa.log",
|
||||
encoding="utf-8",
|
||||
filemode="a",
|
||||
level=logging.INFO,
|
||||
)
|
||||
|
||||
@contextmanager
|
||||
def temporary_certificate(cert_path):
|
||||
# Sto robo serve perchè per usare l'API di PubChem serve cambiare temporaneamente il certificato con il quale
|
||||
# si fanno le richieste
|
||||
|
||||
"""
|
||||
Context manager to temporarily change the certificate used for requests.
|
||||
|
||||
Args:
|
||||
cert_path (str): Path to the certificate file to use temporarily
|
||||
|
||||
Example:
|
||||
# Regular request uses default certificates
|
||||
requests.get('https://api.example.com')
|
||||
|
||||
# Use custom certificate only within this block
|
||||
with temporary_certificate('custom-cert.pem'):
|
||||
requests.get('https://api.requiring.custom.cert.com')
|
||||
|
||||
# Back to default certificates
|
||||
requests.get('https://api.example.com')
|
||||
"""
|
||||
# Store original environment variables
|
||||
original_ca_bundle = os.environ.get('REQUESTS_CA_BUNDLE')
|
||||
original_ssl_cert = os.environ.get('SSL_CERT_FILE')
|
||||
|
||||
try:
|
||||
# Set new certificate
|
||||
os.environ['REQUESTS_CA_BUNDLE'] = cert_path
|
||||
os.environ['SSL_CERT_FILE'] = cert_path
|
||||
yield
|
||||
finally:
|
||||
# Restore original environment variables
|
||||
if original_ca_bundle is not None:
|
||||
os.environ['REQUESTS_CA_BUNDLE'] = original_ca_bundle
|
||||
else:
|
||||
os.environ.pop('REQUESTS_CA_BUNDLE', None)
|
||||
|
||||
if original_ssl_cert is not None:
|
||||
os.environ['SSL_CERT_FILE'] = original_ssl_cert
|
||||
else:
|
||||
os.environ.pop('SSL_CERT_FILE', None)
|
||||
|
||||
def clean_property_data(api_response):
|
||||
"""
|
||||
Simplifies the API response data by flattening nested structures.
|
||||
|
||||
Args:
|
||||
api_response (dict): Raw API response containing property data
|
||||
|
||||
Returns:
|
||||
dict: Cleaned data with simplified structure
|
||||
"""
|
||||
cleaned_data = {}
|
||||
|
||||
for property_name, measurements in api_response.items():
|
||||
cleaned_measurements = []
|
||||
|
||||
for measurement in measurements:
|
||||
cleaned_measurement = {
|
||||
'ReferenceNumber': measurement.get('ReferenceNumber'),
|
||||
'Description': measurement.get('Description', ''),
|
||||
}
|
||||
|
||||
# Handle Reference field
|
||||
if 'Reference' in measurement:
|
||||
# Check if Reference is a list or string
|
||||
ref = measurement['Reference']
|
||||
cleaned_measurement['Reference'] = ref[0] if isinstance(ref, list) else ref
|
||||
|
||||
# Handle Value field
|
||||
value = measurement.get('Value', {})
|
||||
if isinstance(value, dict) and 'StringWithMarkup' in value:
|
||||
cleaned_measurement['Value'] = value['StringWithMarkup'][0]['String']
|
||||
else:
|
||||
cleaned_measurement['Value'] = str(value)
|
||||
|
||||
# Remove empty values
|
||||
cleaned_measurement = {k: v for k, v in cleaned_measurement.items() if v}
|
||||
|
||||
cleaned_measurements.append(cleaned_measurement)
|
||||
|
||||
cleaned_data[property_name] = cleaned_measurements
|
||||
|
||||
return cleaned_data
|
||||
|
||||
def pubchem_dap(cas):
|
||||
'''
|
||||
Data un CAS in input ricerca le informazioni per la scheda di sicurezza su PubChem.
|
||||
Per estrarre le proprietà di 1o (sinonimi, cid, logP, MolecularWeight, ExactMass, TPSA) livello uso Pubchempy.
|
||||
Per quelle di 2o livello uso pubchemprops (Melting point)
|
||||
|
||||
args:
|
||||
cas : string
|
||||
|
||||
'''
|
||||
with temporary_certificate('src/data/ncbi-nlm-nih-gov-catena.pem'):
|
||||
try:
|
||||
# Ricerca iniziale
|
||||
out = pcp.get_synonyms(cas, 'name')
|
||||
if out:
|
||||
out = out[0]
|
||||
output = {'CID' : out['CID'],
|
||||
'CAS' : cas,
|
||||
'first_pubchem_name' : out['Synonym'][0],
|
||||
'pubchem_link' : f"https://pubchem.ncbi.nlm.nih.gov/compound/{out['CID']}"}
|
||||
else:
|
||||
return f'No results on PubChem for {cas}'
|
||||
|
||||
except Exception as E:
|
||||
logging.error(f'various_utils.pubchem.pubchem_dap(). Some error during pubchem search for {cas}', exc_info=True)
|
||||
|
||||
try:
|
||||
# Ricerca delle proprietà
|
||||
properties = pcp.get_properties(['xlogp', 'molecular_weight', 'tpsa', 'exact_mass'], identifier = out['CID'], namespace='cid', searchtype=None, as_dataframe=False)
|
||||
if properties:
|
||||
output = {**output, **properties[0]}
|
||||
else:
|
||||
return output
|
||||
except Exception as E:
|
||||
logging.error(f'various_utils.pubchem.pubchem_dap(). Some error during pubchem first level properties extraction for {cas}', exc_info=True)
|
||||
|
||||
try:
|
||||
# Ricerca del Melting Point
|
||||
second_layer_props = get_second_layer_props(output['first_pubchem_name'], ['Melting Point', 'Dissociation Constants', 'pH'])
|
||||
if second_layer_props:
|
||||
second_layer_props = clean_property_data(second_layer_props)
|
||||
output = {**output, **second_layer_props}
|
||||
except Exception as E:
|
||||
logging.error(f'various_utils.pubchem.pubchem_dap(). Some error during pubchem second level properties extraction (Melting Point) for {cas}', exc_info=True)
|
||||
|
||||
return output
|
||||
|
|
@ -1,182 +0,0 @@
|
|||
import json as js
|
||||
import re
|
||||
import requests as req
|
||||
from typing import Union
|
||||
|
||||
|
||||
#region Funzione che processa una lista di CAS presa da Cosing (Grazie Jem)
|
||||
|
||||
def parse_cas_numbers(cas_string:list) -> list:
|
||||
|
||||
# Siccome ci assicuriamo esternamente che esista almeno un cas possiamo prendere la stringa
|
||||
cas_string = cas_string[0]
|
||||
|
||||
# Rimuoviamo parentesi e il loro contenuto
|
||||
cas_string = re.sub(r"\([^)]*\)", "", cas_string)
|
||||
|
||||
# Eseguiamo uno split su vari possibili separatori
|
||||
cas_parts = re.split(r"[/;,]", cas_string)
|
||||
|
||||
# Otteniamo una lista utilizzando i cas precedenti e rimuovendo spazi in eccesso
|
||||
cas_list = [cas.strip() for cas in cas_parts]
|
||||
|
||||
# Alcuni cas sono divisi da un doppio dash (--) che dobbiamo rimuovere
|
||||
# è però necessario farlo ora in seconda battuta
|
||||
|
||||
if len(cas_list) == 1 and "--" in cas_list[0]:
|
||||
|
||||
cas_list = [cas.strip() for cas in cas_list[0].split("--")]
|
||||
|
||||
# Siccome alcuni cas hanno valori non validi ("-") li cerchiamo e rimuoviamo
|
||||
if cas_list:
|
||||
|
||||
while "-" in cas_list:
|
||||
cas_list.remove("-")
|
||||
|
||||
return cas_list
|
||||
#endregion
|
||||
|
||||
#region Funzione per eseguire una ricerca direttamente sul cosing
|
||||
|
||||
# Il primo argomento è la stringa da cercare, il secondo indica il tipo di ricerca
|
||||
def cosing_search(text : str, mode : str = "name") -> Union[list,dict,None]:
|
||||
cosing_post_req = "https://api.tech.ec.europa.eu/search-api/prod/rest/search?apiKey=285a77fd-1257-4271-8507-f0c6b2961203&text=*&pageSize=100&pageNumber=1"
|
||||
agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36 Edg/131.0.0.0"
|
||||
|
||||
# La modalità di ricerca base è quella per nome, che sia INCI o di altro tipo
|
||||
if mode == "name":
|
||||
search_query = {"bool":
|
||||
{"must":[
|
||||
{"text":
|
||||
{"query":f"{text}","fields":
|
||||
["inciName.exact","inciUsaName","innName.exact","phEurName","chemicalName","chemicalDescription"],
|
||||
"defaultOperator":"AND"}}]}}
|
||||
|
||||
# In caso di ricerca per numero cas o EC il payload della richiesta è diverso
|
||||
elif mode in ["cas","ec"]:
|
||||
search_query = {"bool": {"must": [{"text": {"query": f"*{text}*","fields": ["casNo", "ecNo"]}}]}}
|
||||
|
||||
# La ricerca per ID è necessaria dove serva recuperare gli identified ingredients
|
||||
elif mode == "id":
|
||||
search_query = {"bool":{"must":[{"term":{"substanceId":f"{text}"}}]}}
|
||||
|
||||
# Se la mode inserita non è prevista lancio un errore
|
||||
else:
|
||||
raise ValueError
|
||||
|
||||
# Creo il payload della mia request
|
||||
files = {"query": ("query",js.dumps(search_query),"application/json")}
|
||||
|
||||
# Eseguo la post di ricerca
|
||||
risposta = req.post(cosing_post_req,headers={"user_agent":agent,"Connection":"keep-alive"},files=files)
|
||||
risposta = risposta.json()
|
||||
if risposta["results"]:
|
||||
|
||||
return risposta["results"][0]["metadata"]
|
||||
|
||||
# La funzione ritorna None quando non ho risultati dalla mia ricerca
|
||||
return risposta.status_code
|
||||
#endregion
|
||||
|
||||
#region Funzione per pulire un json cosing e restituirlo
|
||||
|
||||
def clean_cosing(json : dict, full : bool = True) -> dict:
|
||||
|
||||
# Definisco i campi di nostro interesse e li divido in base al tipo di output finale che desidero
|
||||
|
||||
string_cols = ["itemType","nameOfCommonIngredientsGlossary","inciName","phEurName","chemicalName","innName","substanceId","refNo"]
|
||||
list_cols = ["casNo","ecNo","functionName","otherRestrictions","sccsOpinion","sccsOpinionUrls","identifiedIngredient","annexNo","otherRegulations"]
|
||||
|
||||
# Creo una lista con tutti i campi su cui ciclare
|
||||
|
||||
total_keys = string_cols + list_cols
|
||||
|
||||
# Definisco la base dell"url che mi serve per ottenere il link cosing della sostanza
|
||||
|
||||
base_url = "https://ec.europa.eu/growth/tools-databases/cosing/details/"
|
||||
clean_json = {}
|
||||
|
||||
# Ciclo su tutti i campi di interesse
|
||||
|
||||
for key in total_keys:
|
||||
|
||||
# Alcuni campi contengono una dicitura inutile che occupa solo spazio
|
||||
# per cui provvedo a rimuoverla
|
||||
|
||||
while "<empty>" in json[key]:
|
||||
json[key].remove("<empty>")
|
||||
|
||||
# Se il campo dovrà avere in output una lista, posso anche accettare le liste vuote del cosing come valore
|
||||
|
||||
if key in list_cols:
|
||||
value = json[key]
|
||||
|
||||
# Il cas e l"ec sono casi speciali, per cui eseguo delle funzioni in più quando lo tratto
|
||||
|
||||
if key in ["casNo", "ecNo"]:
|
||||
if value:
|
||||
value = parse_cas_numbers(value)
|
||||
|
||||
# Completiamo direttamente il json di ritorno ove presenti degli identifiedIngredient,
|
||||
# solo dove il flag "full" è true
|
||||
|
||||
elif key == "identifiedIngredient":
|
||||
if full:
|
||||
if value:
|
||||
value = identified_ingredients(value)
|
||||
|
||||
clean_json[key] = value
|
||||
|
||||
else:
|
||||
|
||||
# Questo nome di campo era troppo lungo e ho preferito semplificarlo
|
||||
|
||||
if key == "nameOfCommonIngredientsGlossary":
|
||||
nKey = "commonName"
|
||||
|
||||
# Dovendo rinominare alcuni campi, negli altri casi copio il nome iniziale
|
||||
|
||||
else:
|
||||
nKey = key
|
||||
|
||||
# Siccome voglio in output una stringa semplice e se usassi uno slicer su di una lista vuota
|
||||
# devo prima verificare che la lista cosing contenga dei valori
|
||||
|
||||
if json[key]:
|
||||
clean_json[nKey] = json[key][0]
|
||||
else:
|
||||
clean_json[nKey] = ""
|
||||
|
||||
# Il campo cosingUrl non esiste ancora, vado a crearlo unendo il substance ID all"url base
|
||||
|
||||
clean_json["cosingUrl"] = f"{base_url}{json["substanceId"][0]}"
|
||||
|
||||
return clean_json
|
||||
#endregion
|
||||
|
||||
#region Funzione per completare, se necessario, un json cosing
|
||||
|
||||
def identified_ingredients(id_list : list) -> list:
|
||||
|
||||
identified = []
|
||||
|
||||
# Per ognuno degli ingredienti in IdentifiedIngredients eseguo una ricerca
|
||||
|
||||
for id in id_list:
|
||||
|
||||
ingredient = cosing_search(id,"id")
|
||||
|
||||
if ingredient:
|
||||
|
||||
# Vado a pulire i json appena trovati
|
||||
|
||||
ingredient = clean_cosing(ingredient,full=False)
|
||||
|
||||
# Ora salvo nella lista il documento pulito
|
||||
|
||||
identified.append(ingredient)
|
||||
|
||||
# Finito di popolare la lista con gli oggetti degli identifiedIngredient, la ritorno
|
||||
|
||||
return identified
|
||||
#endregion
|
||||
223
old/echa_find.py
223
old/echa_find.py
|
|
@ -1,223 +0,0 @@
|
|||
import requests
|
||||
import urllib.parse
|
||||
import re as standardre
|
||||
import json
|
||||
from bs4 import BeautifulSoup
|
||||
from datetime import datetime
|
||||
from pif_compiler.functions.common_log import get_logger
|
||||
|
||||
logger = get_logger()
|
||||
|
||||
# Funzione per cercare il dossier dato in input un CAS, una sostanza o un EN
|
||||
def search_dossier(substance, input_type='rmlCas'):
|
||||
results = {}
|
||||
# Il dizionario che farò tornare alla fine
|
||||
|
||||
# Prima parte. Ottengo rmlID e rmlName
|
||||
# st.code('https://chem.echa.europa.eu/api-substance/v1/substance?pageIndex=1&pageSize=100&searchText='+substance)
|
||||
req_0 = requests.get(
|
||||
"https://chem.echa.europa.eu/api-substance/v1/substance?pageIndex=1&pageSize=100&searchText="
|
||||
+ urllib.parse.quote(substance) #va convertito per il web
|
||||
)
|
||||
|
||||
logger.info(f'echaFind.search_dossier(). searching "{substance}"')
|
||||
|
||||
#'La prima cosa da fare è fare una ricerca con il nome della sostanza ma trasformata attraverso urllib'
|
||||
req_0_json = req_0.json()
|
||||
try:
|
||||
# Estraggo i campi che mi servono dalla response
|
||||
rmlId = req_0_json["items"][0]["substanceIndex"]["rmlId"]
|
||||
rmlName = req_0_json["items"][0]["substanceIndex"]["rmlName"]
|
||||
rmlCas = req_0_json["items"][0]["substanceIndex"]["rmlCas"]
|
||||
rmlEc = req_0_json["items"][0]["substanceIndex"]["rmlEc"]
|
||||
|
||||
results['search_response'] = f"https://chem.echa.europa.eu/api-substance/v1/substance?pageIndex=1&pageSize=100&searchText={urllib.parse.quote(substance)}"
|
||||
results["rmlId"] = rmlId
|
||||
results["rmlName"] = rmlName
|
||||
results["rmlCas"] = rmlCas
|
||||
results["rmlEc"] = rmlEc
|
||||
|
||||
logger.info(
|
||||
f"echaFind.search_dossier(). found substance on ECHA. rmlId: '{rmlId}', rmlName: '{rmlName}', rmlCas: '{rmlCas}'"
|
||||
)
|
||||
except:
|
||||
logger.info(
|
||||
f"echaFind.search_dossier(). could not find substance for '{substance}'"
|
||||
)
|
||||
return False
|
||||
|
||||
# Update: in certi casi poteva verificarsi che inserendo un CAS si trovasse invece una sostanza con codice EN uguale al CAS in input.
|
||||
# Ora controllo che la sostanza trovata abbia effettivamente un CAS uguale a quello inserito in input.
|
||||
# è inoltre possibile cercare per rmlName (nome della sostanza) o EN (rmlEn): basta specificare in input_type per cosa si sta cercando
|
||||
if results[input_type] != substance:
|
||||
logger.error(f'echa.echaFind.search_dossier(): results[{input_type}] "{results[input_type]}is not equal to "{substance}". ')
|
||||
return f'search_error. results[{input_type}] ("{results[input_type]}") is not equal to "{substance}". Maybe you specified the wrong input_type. Check the results here: https://chem.echa.europa.eu/api-substance/v1/substance?pageIndex=1&pageSize=100&searchText={urllib.parse.quote(substance)}'
|
||||
|
||||
# Seconda parte. Cerco sul sito ECHA dei dossiers creando un link con l'ID precedentemente ottenuto.
|
||||
req_1_url = (
|
||||
"https://chem.echa.europa.eu/api-dossier-list/v1/dossier?pageIndex=1&pageSize=100&rmlId="
|
||||
+ rmlId
|
||||
+ "®istrationStatuses=Active"
|
||||
) # Prima cerco negli active.
|
||||
|
||||
req_1 = requests.get(req_1_url)
|
||||
req_1_json = req_1.json()
|
||||
|
||||
# Se non esistono dossiers attivi cerco quelli inattivi
|
||||
if req_1_json["items"] == []:
|
||||
logger.info(
|
||||
f"echaFind.search_dossier(). could not find active dossier for '{substance}'. Proceeding to search in the unactive ones."
|
||||
)
|
||||
req_1_url = (
|
||||
"https://chem.echa.europa.eu/api-dossier-list/v1/dossier?pageIndex=1&pageSize=100&rmlId="
|
||||
+ rmlId
|
||||
+ "®istrationStatuses=Inactive"
|
||||
)
|
||||
req_1 = requests.get(req_1_url)
|
||||
req_1_json = req_1.json()
|
||||
if req_1_json["items"] == []:
|
||||
logger.info(
|
||||
f"echaFind.search_dossier(). could not find unactive dossiers for '{substance}'"
|
||||
) # Non ho trovato nè dossiers inattivi che attivi
|
||||
return False
|
||||
else:
|
||||
logger.info(
|
||||
f"echaFind.search_dossier(). found unactive dossiers for '{rmlName}'"
|
||||
)
|
||||
results["dossierType"] = "Inactive"
|
||||
|
||||
else:
|
||||
logger.info(
|
||||
f"echaFind.search_dossier(). found active dossiers for '{substance}'"
|
||||
)
|
||||
results["dossierType"] = "Active"
|
||||
|
||||
# Queste erano le due robe che mi servivano
|
||||
assetExternalId = req_1_json["items"][0]["assetExternalId"]
|
||||
|
||||
# UPDATE: Per ottenere la data dell'ultima modifica: serve per capire se abbiamo già dei file aggiornati scaricati in locale
|
||||
# confrontare data di scraping e ultimo aggiornato (se prima o dopo)
|
||||
|
||||
try:
|
||||
lastUpdateDate = req_1_json["items"][0]["lastUpdatedDate"]
|
||||
datetime_object = datetime.fromisoformat(lastUpdateDate.replace('Z', '+00:00')) # Handle 'Z' if present, else it might break on older python versions
|
||||
lastUpdateDate = datetime_object.date().isoformat()
|
||||
results['lastUpdateDate'] = lastUpdateDate
|
||||
except:
|
||||
logger.error(f"echa.echaFind(). Could not find lastUpdateDate for the dossier")
|
||||
|
||||
rootKey = req_1_json["items"][0]["rootKey"]
|
||||
|
||||
# PARTE DI HTML
|
||||
|
||||
# Terza parte. Ottengo assetExternalId
|
||||
# "Con l'assetExternalId è possibile arrivare alla pagina principale del dossier."
|
||||
# "Da questa pagina bisogna scrappare l'ID del riassunto tossicologico, :red[**SE ESISTE**]"
|
||||
results["index"] = (
|
||||
"https://chem.echa.europa.eu/html-pages" + assetExternalId + "/index.html"
|
||||
)
|
||||
results["index_js"] = (
|
||||
f"https://chem.echa.europa.eu/{rmlId}/dossier-view/{assetExternalId}"
|
||||
)
|
||||
|
||||
req_2 = requests.get(
|
||||
"https://chem.echa.europa.eu/html-pages/" + assetExternalId + "/index.html"
|
||||
)
|
||||
index = BeautifulSoup(req_2.text, "html.parser")
|
||||
index.prettify()
|
||||
|
||||
# Quarta parte. Ottengo l'ID del riassunto tossicologico dall'index.html
|
||||
# "In tutto quell'HTML ci interessa solo un div. BeautifulSoup ha problemi se ci sono troppi div innestati. Quindi uso una combinazione di quello e di Regex"
|
||||
|
||||
div = index.find_all("div", id=["id_7_Toxicologicalinformation"])
|
||||
str_div = str(div)
|
||||
str_div = str_div.split("</div>")[0]
|
||||
|
||||
# UIC è l'id del toxsummary
|
||||
uic_found = False
|
||||
if type(standardre.search('href="([^"]+)"', str_div)).__name__ == "NoneType":
|
||||
# Un regex per trovare l'href che mi serve
|
||||
logger.info(
|
||||
f"echaFind.search_dossier(). Could not find 'id_7_Toxicologicalinformation' in the body"
|
||||
)
|
||||
else:
|
||||
UIC = standardre.search('href="([^"]+)"', str_div).group(1)
|
||||
uic_found = True
|
||||
|
||||
# Per l'acute toxicity
|
||||
acute_toxicity_found = False
|
||||
div_acute_toxicity = index.find_all("div", id=["id_72_AcuteToxicity"])
|
||||
if div_acute_toxicity:
|
||||
for div in div_acute_toxicity:
|
||||
try:
|
||||
a = div.find_all("a", href=True)[0]
|
||||
acute_toxicity_id = standardre.search('href="([^"]+)"', str(a)).group(1)
|
||||
acute_toxicity_found = True
|
||||
except:
|
||||
logger.info(
|
||||
f"echaFind.search_dossier(). No acute_toxicity_id found from index for {substance}"
|
||||
)
|
||||
|
||||
# Per il repeated dose
|
||||
repeated_dose_found = False
|
||||
div_repeated_dose = index.find_all("div", id=["id_75_Repeateddosetoxicity"])
|
||||
if div_repeated_dose:
|
||||
for div in div_repeated_dose:
|
||||
try:
|
||||
a = div.find_all("a", href=True)[0]
|
||||
repeated_dose_id = standardre.search('href="([^"]+)"', str(a)).group(1)
|
||||
repeated_dose_found = True
|
||||
except:
|
||||
logger.info(
|
||||
f"echaFind.search_dossier(). No repeated_dose_id found from index for {substance}"
|
||||
)
|
||||
|
||||
# Quinta parte. Recupero l'html del dossier tossicologico e faccio ritornare il content
|
||||
|
||||
if acute_toxicity_found:
|
||||
acute_toxicity_link = (
|
||||
"https://chem.echa.europa.eu/html-pages/"
|
||||
+ assetExternalId
|
||||
+ "/documents/"
|
||||
+ acute_toxicity_id
|
||||
+ ".html"
|
||||
)
|
||||
results["AcuteToxicity"] = acute_toxicity_link
|
||||
# ci sono due link diversi: uno solo html brutto ma che ha le info leggibile, mentre js è la versione più bella presentata all'utente,
|
||||
# usata per creare il pdf carino
|
||||
results["AcuteToxicity_js"] = (
|
||||
f"https://chem.echa.europa.eu/{rmlId}/dossier-view/{assetExternalId}/{acute_toxicity_id}"
|
||||
)
|
||||
|
||||
if uic_found:
|
||||
# UIC è l'id del toxsummary
|
||||
final_url = (
|
||||
"https://chem.echa.europa.eu/html-pages/"
|
||||
+ assetExternalId
|
||||
+ "/documents/"
|
||||
+ UIC
|
||||
+ ".html"
|
||||
)
|
||||
results["ToxSummary"] = final_url
|
||||
results["ToxSummary_js"] = (
|
||||
f"https://chem.echa.europa.eu/{rmlId}/dossier-view/{assetExternalId}/{UIC}"
|
||||
)
|
||||
|
||||
if repeated_dose_found:
|
||||
results["RepeatedDose"] = (
|
||||
"https://chem.echa.europa.eu/html-pages/"
|
||||
+ assetExternalId
|
||||
+ "/documents/"
|
||||
+ repeated_dose_id
|
||||
+ ".html"
|
||||
)
|
||||
results["RepeatedDose_js"] = (
|
||||
f"https://chem.echa.europa.eu/{rmlId}/dossier-view/{assetExternalId}/{repeated_dose_id}"
|
||||
)
|
||||
|
||||
json_formatted_str = json.dumps(results)
|
||||
logger.info(f"echaFind.search_dossier() OK. output: {json_formatted_str}")
|
||||
return results
|
||||
|
||||
if __name__ == "__main__":
|
||||
search_dossier("100-41-4", input_type='rmlCas')
|
||||
477
old/echa_pdf.py
477
old/echa_pdf.py
|
|
@ -1,477 +0,0 @@
|
|||
import os
|
||||
import base64
|
||||
import traceback
|
||||
import logging # Import logging module
|
||||
import datetime
|
||||
import pandas as pd
|
||||
# import time # Keep if you use page.wait_for_timeout
|
||||
from playwright.sync_api import sync_playwright, TimeoutError # Catch specific errors
|
||||
from pif_compiler.services.echa_find import search_dossier
|
||||
import requests
|
||||
|
||||
# --- Basic Logging Setup (Commented Out) ---
|
||||
# # Configure logging - uncomment and customize level/handler as needed
|
||||
# logging.basicConfig(
|
||||
# level=logging.INFO, # Or DEBUG for more details
|
||||
# format='%(asctime)s - %(levelname)s - %(message)s',
|
||||
# # filename='pdf_generator.log', # Optional: Log to a file
|
||||
# # filemode='a'
|
||||
# )
|
||||
# --- End Logging Setup ---
|
||||
|
||||
|
||||
# Assume svg_to_data_uri is defined elsewhere correctly
|
||||
def svg_to_data_uri(svg_path):
|
||||
try:
|
||||
if not os.path.exists(svg_path):
|
||||
# logging.error(f"SVG file not found: {svg_path}") # Example logging
|
||||
raise FileNotFoundError(f"SVG file not found: {svg_path}")
|
||||
with open(svg_path, 'rb') as f:
|
||||
svg_content = f.read()
|
||||
encoded_svg = base64.b64encode(svg_content).decode('utf-8')
|
||||
return f"data:image/svg+xml;base64,{encoded_svg}"
|
||||
except Exception as e:
|
||||
print(f"Error converting SVG {svg_path}: {e}")
|
||||
# logging.error(f"Error converting SVG {svg_path}: {e}", exc_info=True) # Example logging
|
||||
return None
|
||||
|
||||
# --- JavaScript Expressions ---
|
||||
|
||||
# Define the cleanup logic as an immediately-invoked arrow function expression
|
||||
# NOTE: .das-block_empty removal is currently disabled as per previous step
|
||||
cleanup_js_expression = """
|
||||
() => {
|
||||
console.log('Running cleanup JS (DISABLED .das-block_empty removal)...');
|
||||
let totalRemoved = 0;
|
||||
|
||||
// Example 1: Remove sections explicitly marked as empty (Currently Disabled)
|
||||
// const emptyBlocks = document.querySelectorAll('.das-block_empty');
|
||||
// emptyBlocks.forEach(el => {
|
||||
// if (el && el.parentNode) {
|
||||
// console.log(`Removing '.das-block_empty' block with ID: ${el.id || 'N/A'}`);
|
||||
// el.remove();
|
||||
// totalRemoved++;
|
||||
// }
|
||||
// });
|
||||
|
||||
// Add other specific cleanup logic here if needed
|
||||
|
||||
console.log(`Cleanup script removed ${totalRemoved} elements (DISABLED .das-block_empty removal).`);
|
||||
return totalRemoved; // Return the count
|
||||
}
|
||||
"""
|
||||
# --- End JavaScript Expressions ---
|
||||
|
||||
|
||||
def generate_pdf_with_header_and_cleanup(
|
||||
url,
|
||||
pdf_path,
|
||||
substance_name,
|
||||
substance_link,
|
||||
ec_number,
|
||||
cas_number,
|
||||
header_template_path=r"src\func\resources\injectableHeader.html",
|
||||
echa_chem_logo_path=r"src\func\resources\echa_chem_logo.svg",
|
||||
echa_logo_path=r"src\func\resources\ECHA_Logo.svg"
|
||||
) -> bool: # Added return type hint
|
||||
"""
|
||||
Generates a PDF with a dynamic header and optionally removes empty sections.
|
||||
Provides basic logging (commented out) and returns True/False for success/failure.
|
||||
|
||||
Args:
|
||||
url (str): The target URL OR local HTML file path.
|
||||
pdf_path (str): The output PDF path.
|
||||
substance_name (str): The name of the chemical substance.
|
||||
substance_link (str): The URL the substance name should link to (in header).
|
||||
ec_number (str): The EC number for the substance.
|
||||
cas_number (str): The CAS number for the substance.
|
||||
header_template_path (str): Path to the HTML header template file.
|
||||
echa_chem_logo_path (str): Path to the echa_chem_logo.svg file.
|
||||
echa_logo_path (str): Path to the ECHA_Logo.svg file.
|
||||
|
||||
Returns:
|
||||
bool: True if the PDF was generated successfully, False otherwise.
|
||||
"""
|
||||
final_header_html = None
|
||||
# logging.info(f"Starting PDF generation for URL: {url} to path: {pdf_path}") # Example logging
|
||||
|
||||
# --- 1. Prepare Header HTML ---
|
||||
try:
|
||||
# logging.debug(f"Reading header template from: {header_template_path}") # Example logging
|
||||
print(f"Reading header template from: {header_template_path}")
|
||||
if not os.path.exists(header_template_path):
|
||||
raise FileNotFoundError(f"Header template file not found: {header_template_path}")
|
||||
with open(header_template_path, 'r', encoding='utf-8') as f:
|
||||
header_template_content = f.read()
|
||||
if not header_template_content:
|
||||
raise ValueError("Header template file is empty.")
|
||||
|
||||
# logging.debug("Converting logos...") # Example logging
|
||||
print("Converting logos...")
|
||||
logo1_data_uri = svg_to_data_uri(echa_chem_logo_path)
|
||||
logo2_data_uri = svg_to_data_uri(echa_logo_path)
|
||||
if not logo1_data_uri or not logo2_data_uri:
|
||||
raise ValueError("Failed to convert one or both logos to Data URIs.")
|
||||
|
||||
# logging.debug("Replacing placeholders...") # Example logging
|
||||
print("Replacing placeholders...")
|
||||
final_header_html = header_template_content.replace("##ECHA_CHEM_LOGO_SRC##", logo1_data_uri)
|
||||
final_header_html = final_header_html.replace("##ECHA_LOGO_SRC##", logo2_data_uri)
|
||||
final_header_html = final_header_html.replace("##SUBSTANCE_NAME##", substance_name)
|
||||
final_header_html = final_header_html.replace("##SUBSTANCE_LINK##", substance_link)
|
||||
final_header_html = final_header_html.replace("##EC_NUMBER##", ec_number)
|
||||
final_header_html = final_header_html.replace("##CAS_NUMBER##", cas_number)
|
||||
|
||||
if "##" in final_header_html:
|
||||
print("Warning: Not all placeholders seem replaced in the header HTML.")
|
||||
# logging.warning("Not all placeholders seem replaced in the header HTML.") # Example logging
|
||||
|
||||
except Exception as e:
|
||||
print(f"Error during header setup phase: {e}")
|
||||
traceback.print_exc()
|
||||
# logging.error(f"Error during header setup phase: {e}", exc_info=True) # Example logging
|
||||
return False # Return False on header setup failure
|
||||
# --- End Header Prep ---
|
||||
|
||||
# --- CSS Override Definition ---
|
||||
# Using Revision 4 from previous step (simplified breaks, boundary focus)
|
||||
selectors_to_fix = [
|
||||
'.das-field .das-field_value_html',
|
||||
'.das-field .das-field_value_large',
|
||||
'.das-field .das-value_remark-text'
|
||||
]
|
||||
css_selector_string = ",\n".join(selectors_to_fix)
|
||||
css_override = f"""
|
||||
<style id='pdf-override-styles'>
|
||||
/* Basic Resets & Overflows */
|
||||
html, body {{ height: auto !important; overflow: visible !important; margin: 0 !important; padding: 0 !important; }}
|
||||
* {{ box-sizing: border-box; }}
|
||||
{css_selector_string} {{
|
||||
overflow: visible !important; overflow-y: visible !important; height: auto !important; max-height: none !important;
|
||||
}}
|
||||
/* Boundary Fix */
|
||||
#pdf-custom-header {{ margin-bottom: 0 !important; padding-bottom: 1px !important; page-break-after: auto !important; display: block !important; }}
|
||||
#pdf-custom-header + .body-inner {{ margin-top: 0 !important; padding-top: 0 !important; page-break-before: auto !important; display: block !important; }}
|
||||
.body-inner .document-header {{ margin-top: 0 !important; padding-top: 0 !important; page-break-before: auto !important; }}
|
||||
/* Simplified Page Breaks */
|
||||
.body-inner h1, .body-inner h2, .body-inner h3, .body-inner h4, .body-inner h5, .body-inner h6 {{ page-break-after: avoid !important; }}
|
||||
#pdf-custom-header h2 {{ page-break-after: auto !important; }}
|
||||
@media print {{
|
||||
html, body {{ height: auto !important; overflow: visible !important; margin: 0; padding: 0; }}
|
||||
#pdf-custom-header {{ margin-bottom: 0 !important; padding-bottom: 1px !important; page-break-after: auto !important; display: block !important;}}
|
||||
#pdf-custom-header + .body-inner {{ margin-top: 0 !important; padding-top: 0 !important; page-break-before: auto !important; display: block !important; }}
|
||||
.body-inner .document-header {{ margin-top: 0 !important; padding-top: 0 !important; page-break-before: auto !important; }}
|
||||
.body-inner h1, .body-inner h2, .body-inner h3, .body-inner h4, .body-inner h5, .body-inner h6 {{ page-break-after: avoid !important; }}
|
||||
#pdf-custom-header h2 {{ page-break-after: auto !important; }}
|
||||
.das-doc-toolbar, .document-header__section-links, #das-totop {{ display: none !important; }}
|
||||
}}
|
||||
</style>
|
||||
"""
|
||||
# --- End CSS Override Definition ---
|
||||
|
||||
# --- Playwright Automation ---
|
||||
try:
|
||||
with sync_playwright() as p:
|
||||
# logging.debug("Launching browser...") # Example logging
|
||||
# browser = p.chromium.launch(headless=False, devtools=True) # For debugging
|
||||
browser = p.chromium.launch()
|
||||
page = browser.new_page()
|
||||
# Capture console messages (Corrected: use msg.text)
|
||||
page.on("console", lambda msg: print(f"Browser Console: {msg.text}"))
|
||||
|
||||
try:
|
||||
# logging.info(f"Navigating to page: {url}") # Example logging
|
||||
print(f"Navigating to: {url}")
|
||||
if os.path.exists(url) and not url.startswith('file://'):
|
||||
page_url = f'file://{os.path.abspath(url)}'
|
||||
# logging.info(f"Treating as local file: {page_url}") # Example logging
|
||||
print(f"Treating as local file: {page_url}")
|
||||
else:
|
||||
page_url = url
|
||||
|
||||
page.goto(page_url, wait_until='load', timeout=90000)
|
||||
# logging.info("Page navigation complete.") # Example logging
|
||||
|
||||
# logging.debug("Injecting header HTML...") # Example logging
|
||||
print("Injecting header HTML...")
|
||||
page.evaluate(f'(headerHtml) => {{ document.body.insertAdjacentHTML("afterbegin", headerHtml); }}', final_header_html)
|
||||
|
||||
# logging.debug("Injecting CSS overrides...") # Example logging
|
||||
print("Injecting CSS overrides...")
|
||||
page.evaluate(f"""(css) => {{
|
||||
const existingStyle = document.getElementById('pdf-override-styles');
|
||||
if (existingStyle) existingStyle.remove();
|
||||
document.head.insertAdjacentHTML('beforeend', css);
|
||||
}}""", css_override)
|
||||
|
||||
# logging.debug("Running JavaScript cleanup function...") # Example logging
|
||||
print("Running JavaScript cleanup function...")
|
||||
elements_removed_count = page.evaluate(cleanup_js_expression)
|
||||
# logging.info(f"Cleanup script finished (reported removing {elements_removed_count} elements).") # Example logging
|
||||
print(f"Cleanup script finished (reported removing {elements_removed_count} elements).")
|
||||
|
||||
|
||||
# --- Optional: Emulate Print Media ---
|
||||
# print("Emulating print media...")
|
||||
# page.emulate_media(media='print')
|
||||
|
||||
# --- Generate PDF ---
|
||||
# logging.info(f"Generating PDF: {pdf_path}") # Example logging
|
||||
print(f"Generating PDF: {pdf_path}")
|
||||
pdf_options = {
|
||||
"path": pdf_path, "format": "A4", "print_background": True,
|
||||
"margin": {'top': '20px', 'bottom': '20px', 'left': '20px', 'right': '20px'},
|
||||
"scale": 1.0
|
||||
}
|
||||
page.pdf(**pdf_options)
|
||||
# logging.info(f"PDF saved successfully to: {pdf_path}") # Example logging
|
||||
print(f"PDF saved successfully to: {pdf_path}")
|
||||
|
||||
# logging.debug("Closing browser.") # Example logging
|
||||
print("Closing browser.")
|
||||
browser.close()
|
||||
return True # Indicate success
|
||||
|
||||
except TimeoutError as e:
|
||||
print(f"A Playwright TimeoutError occurred: {e}")
|
||||
traceback.print_exc()
|
||||
# logging.error(f"Playwright TimeoutError occurred: {e}", exc_info=True) # Example logging
|
||||
browser.close() # Ensure browser is closed on error
|
||||
return False # Indicate failure
|
||||
except Exception as e: # Catch other potential errors during Playwright page operations
|
||||
print(f"An unexpected error occurred during Playwright page operations: {e}")
|
||||
traceback.print_exc()
|
||||
# logging.error(f"Unexpected error during Playwright page operations: {e}", exc_info=True) # Example logging
|
||||
# Optional: Save HTML state on error
|
||||
try:
|
||||
html_content = page.content()
|
||||
error_html_path = pdf_path.replace('.pdf', '_error.html')
|
||||
with open(error_html_path, 'w', encoding='utf-8') as f_err:
|
||||
f_err.write(html_content)
|
||||
# logging.info(f"Saved HTML state on error to: {error_html_path}") # Example logging
|
||||
print(f"Saved HTML state on error to: {error_html_path}")
|
||||
except Exception as save_e:
|
||||
# logging.error(f"Could not save HTML state on error: {save_e}", exc_info=True) # Example logging
|
||||
print(f"Could not save HTML state on error: {save_e}")
|
||||
browser.close() # Ensure browser is closed on error
|
||||
return False # Indicate failure
|
||||
# Note: The finally block for the 'with sync_playwright()' context
|
||||
# is handled automatically by the 'with' statement.
|
||||
|
||||
except Exception as e:
|
||||
# Catch errors during Playwright startup (less common)
|
||||
print(f"An error occurred during Playwright setup/teardown: {e}")
|
||||
traceback.print_exc()
|
||||
# logging.error(f"Error during Playwright setup/teardown: {e}", exc_info=True) # Example logging
|
||||
return False # Indicate failure
|
||||
|
||||
|
||||
# --- Example Usage ---
|
||||
# result = generate_pdf_with_header_and_cleanup(
|
||||
# url='path/to/your/input.html',
|
||||
# pdf_path='output.pdf',
|
||||
# substance_name='Glycerol Example',
|
||||
# substance_link='http://example.com/glycerol',
|
||||
# ec_number='200-289-5',
|
||||
# cas_number='56-81-5',
|
||||
# )
|
||||
#
|
||||
# if result:
|
||||
# print("PDF Generation Succeeded.")
|
||||
# # logging.info("Main script: PDF Generation Succeeded.") # Example logging
|
||||
# else:
|
||||
# print("PDF Generation Failed.")
|
||||
# # logging.error("Main script: PDF Generation Failed.") # Example logging
|
||||
|
||||
|
||||
def search_generate_pdfs(
|
||||
cas_number_to_search: str,
|
||||
page_types_to_extract: list[str],
|
||||
base_output_folder: str = "data/library"
|
||||
) -> bool:
|
||||
"""
|
||||
Searches for a substance by CAS, saves raw HTML and generates PDFs for
|
||||
specified page types. Uses '_js' link variant for the PDF header link if available.
|
||||
|
||||
Args:
|
||||
cas_number_to_search (str): CAS number to search for.
|
||||
page_types_to_extract (list[str]): List of page type names (e.g., 'RepeatedDose').
|
||||
Expects '{page_type}' and '{page_type}_js' keys
|
||||
in the search result.
|
||||
base_output_folder (str): Root directory for saving HTML/PDFs.
|
||||
|
||||
Returns:
|
||||
bool: True if substance found and >=1 requested PDF generated, False otherwise.
|
||||
"""
|
||||
# logging.info(f"Starting process for CAS: {cas_number_to_search}")
|
||||
print(f"\n===== Processing request for CAS: {cas_number_to_search} =====")
|
||||
|
||||
# --- 1. Search for Dossier Information ---
|
||||
try:
|
||||
# logging.debug(f"Calling search_dossier for CAS: {cas_number_to_search}")
|
||||
search_result = search_dossier(substance=cas_number_to_search, input_type='rmlCas')
|
||||
except Exception as e:
|
||||
print(f"Error during dossier search for CAS '{cas_number_to_search}': {e}")
|
||||
traceback.print_exc()
|
||||
# logging.error(f"Exception during search_dossier for CAS '{cas_number_to_search}': {e}", exc_info=True)
|
||||
return False
|
||||
|
||||
if not search_result:
|
||||
print(f"Substance not found or search failed for CAS: {cas_number_to_search}")
|
||||
# logging.warning(f"Substance not found or search failed for CAS: {cas_number_to_search}")
|
||||
return False
|
||||
|
||||
# logging.info(f"Substance found for CAS: {cas_number_to_search}")
|
||||
print(f"Substance found: {search_result.get('rmlName', 'N/A')}")
|
||||
|
||||
# --- 2. Extract Details and Filter Pages ---
|
||||
try:
|
||||
# Extract required info
|
||||
rml_id = search_result.get('rmlId')
|
||||
rml_name = search_result.get('rmlName')
|
||||
rml_cas = search_result.get('rmlCas')
|
||||
rml_ec = search_result.get('rmlEc')
|
||||
asset_ext_id = search_result.get('assetExternalId')
|
||||
|
||||
# Basic validation
|
||||
if not all([rml_id, rml_name, rml_cas, rml_ec, asset_ext_id]):
|
||||
missing_keys = [k for k, v in {'rmlId': rml_id, 'rmlName': rml_name, 'rmlCas': rml_cas, 'rmlEc': rml_ec, 'assetExternalId': asset_ext_id}.items() if not v]
|
||||
message = f"Search result for {cas_number_to_search} is missing required keys: {missing_keys}"
|
||||
print(f"Error: {message}")
|
||||
# logging.error(message)
|
||||
return False
|
||||
|
||||
# --- Filtering Logic - Collect pairs of URLs ---
|
||||
pages_to_process_list = [] # Store tuples: (page_name, raw_url, js_url)
|
||||
# logging.debug(f"Filtering pages. Requested: {page_types_to_extract}.")
|
||||
|
||||
for page_type in page_types_to_extract:
|
||||
raw_url_key = page_type
|
||||
js_url_key = f"{page_type}_js"
|
||||
|
||||
raw_url = search_result.get(raw_url_key)
|
||||
js_url = search_result.get(js_url_key) # Get the JS URL
|
||||
|
||||
# Check if both URLs are valid strings
|
||||
if raw_url and isinstance(raw_url, str) and raw_url.strip():
|
||||
if js_url and isinstance(js_url, str) and js_url.strip():
|
||||
pages_to_process_list.append((page_type, raw_url, js_url))
|
||||
# logging.debug(f"Found valid pair for '{page_type}': Raw='{raw_url}', JS='{js_url}'")
|
||||
else:
|
||||
# Found raw URL but not a valid JS URL - skip this page type for PDF?
|
||||
# Or use raw_url for header too? Let's skip if JS URL is missing/invalid.
|
||||
print(f"Found raw URL for '{page_type}' but missing/invalid JS URL ('{js_url}'). Skipping PDF generation for this type.")
|
||||
# logging.warning(f"Missing/invalid JS URL for page type '{page_type}' for {rml_cas}. Raw URL: '{raw_url}'.")
|
||||
else:
|
||||
# Raw URL missing or invalid
|
||||
if page_type in search_result: # Check if key existed at all
|
||||
print(f"Found page type key '{page_type}' for {rml_cas}, but its value is not a valid URL ('{raw_url}'). Skipping.")
|
||||
# logging.warning(f"Invalid raw URL value for page type '{page_type}' for {rml_cas}: '{raw_url}'.")
|
||||
else:
|
||||
print(f"Requested page type key '{page_type}' not found in search results for {rml_cas}.")
|
||||
# logging.warning(f"Requested page type key '{page_type}' not found for {rml_cas}.")
|
||||
# --- End Filtering Logic ---
|
||||
|
||||
if not pages_to_process_list:
|
||||
print(f"After filtering, no requested page types ({page_types_to_extract}) resulted in a valid pair of Raw and JS URLs for substance {rml_cas}.")
|
||||
# logging.warning(f"No pages with valid URL pairs to process for substance {rml_cas}.")
|
||||
return False # Nothing to generate
|
||||
|
||||
except Exception as e:
|
||||
print(f"Error processing search result for '{cas_number_to_search}': {e}")
|
||||
traceback.print_exc()
|
||||
# logging.error(f"Error processing search result for '{cas_number_to_search}': {e}", exc_info=True)
|
||||
return False
|
||||
|
||||
# --- 3. Prepare Folders ---
|
||||
safe_cas = rml_cas.replace('/', '_').replace('\\', '_')
|
||||
substance_folder_name = f"{safe_cas}_{rml_ec}_{rml_id}"
|
||||
substance_folder_path = os.path.join(base_output_folder, substance_folder_name)
|
||||
|
||||
try:
|
||||
os.makedirs(substance_folder_path, exist_ok=True)
|
||||
# logging.info(f"Ensured output directory exists: {substance_folder_path}")
|
||||
print(f"Ensured output directory exists: {substance_folder_path}")
|
||||
except OSError as e:
|
||||
print(f"Error creating directory {substance_folder_path}: {e}")
|
||||
# logging.error(f"Failed to create directory {substance_folder_path}: {e}", exc_info=True)
|
||||
return False
|
||||
|
||||
|
||||
# --- 4. Process Each Page (Save HTML, Generate PDF) ---
|
||||
successful_pages = [] # Track successful PDF generations
|
||||
overall_success = False # Track if any PDF was generated
|
||||
|
||||
for page_name, raw_html_url, js_header_link in pages_to_process_list:
|
||||
print(f"\nProcessing page: {page_name}")
|
||||
base_filename = f"{safe_cas}_{page_name}"
|
||||
html_filename = f"{base_filename}.html"
|
||||
pdf_filename = f"{base_filename}.pdf"
|
||||
html_full_path = os.path.join(substance_folder_path, html_filename)
|
||||
pdf_full_path = os.path.join(substance_folder_path, pdf_filename)
|
||||
|
||||
# --- Save Raw HTML ---
|
||||
html_saved = False
|
||||
try:
|
||||
# logging.debug(f"Fetching raw HTML for {page_name} from {raw_html_url}")
|
||||
print(f"Fetching raw HTML from: {raw_html_url}")
|
||||
# Add headers to mimic a browser slightly if needed
|
||||
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'}
|
||||
response = requests.get(raw_html_url, timeout=30, headers=headers) # 30s timeout
|
||||
response.raise_for_status() # Raise an exception for bad status codes (4xx or 5xx)
|
||||
|
||||
# Decide encoding - response.text tries to guess, or use apparent_encoding
|
||||
# Or assume utf-8 if unsure, which is common.
|
||||
html_content = response.content.decode('utf-8', errors='replace')
|
||||
|
||||
with open(html_full_path, 'w', encoding='utf-8') as f:
|
||||
f.write(html_content)
|
||||
html_saved = True
|
||||
# logging.info(f"Successfully saved raw HTML for {page_name} to {html_full_path}")
|
||||
print(f"Successfully saved raw HTML to: {html_full_path}")
|
||||
except requests.exceptions.RequestException as req_e:
|
||||
print(f"Error fetching raw HTML for {page_name} from {raw_html_url}: {req_e}")
|
||||
# logging.error(f"Error fetching raw HTML for {page_name}: {req_e}", exc_info=True)
|
||||
except IOError as io_e:
|
||||
print(f"Error saving raw HTML for {page_name} to {html_full_path}: {io_e}")
|
||||
# logging.error(f"Error saving raw HTML for {page_name}: {io_e}", exc_info=True)
|
||||
except Exception as e: # Catch other potential errors like decoding
|
||||
print(f"Unexpected error saving HTML for {page_name}: {e}")
|
||||
# logging.error(f"Unexpected error saving HTML for {page_name}: {e}", exc_info=True)
|
||||
|
||||
# --- Generate PDF (using raw URL for content, JS URL for header link) ---
|
||||
# logging.info(f"Generating PDF for {page_name} from {raw_html_url}")
|
||||
print(f"Generating PDF using content from: {raw_html_url}")
|
||||
pdf_success = generate_pdf_with_header_and_cleanup(
|
||||
url=raw_html_url, # Use raw URL for Playwright navigation/content
|
||||
pdf_path=pdf_full_path,
|
||||
substance_name=rml_name,
|
||||
substance_link=js_header_link, # Use JS URL for the link in the header
|
||||
ec_number=rml_ec,
|
||||
cas_number=rml_cas
|
||||
)
|
||||
|
||||
if pdf_success:
|
||||
successful_pages.append(page_name) # Log success based on PDF generation
|
||||
overall_success = True
|
||||
# logging.info(f"Successfully generated PDF for {page_name} at {pdf_full_path}")
|
||||
print(f"Successfully generated PDF for {page_name}")
|
||||
else:
|
||||
# logging.error(f"Failed to generate PDF for {page_name} from {raw_html_url}")
|
||||
print(f"Failed to generate PDF for {page_name}")
|
||||
# Decide if failure to save HTML should affect overall success or logging...
|
||||
# Currently, success is tied only to PDF generation.
|
||||
|
||||
print(f"===== Finished request for CAS: {cas_number_to_search} =====")
|
||||
print(f"Successfully generated {len(successful_pages)} PDFs: {successful_pages}")
|
||||
return overall_success # Return success based on PDF generation
|
||||
|
||||
from playwright.sync_api import sync_playwright
|
||||
|
||||
with sync_playwright() as p:
|
||||
browser = p.chromium.launch()
|
||||
page = browser.new_page()
|
||||
page.goto("https://chem.echa.europa.eu/html-pages-prod/e4c88c6e-06c7-4daa-b0fb-1a55459ac22f/documents/IUC5-5f55d8ec-7a71-4e2c-9955-8469ead9fe84_0035f3f8-7467-4944-9028-1db2e9c99565.html")
|
||||
page.pdf(path='output.pdf')
|
||||
browser.close()
|
||||
|
||||
|
|
@ -1,947 +0,0 @@
|
|||
from pif_compiler.services.echa_find import search_dossier
|
||||
from bs4 import BeautifulSoup
|
||||
from markdownify import MarkdownConverter
|
||||
import pandas as pd
|
||||
import requests
|
||||
import os
|
||||
import re
|
||||
import markdown_to_json
|
||||
import json
|
||||
import copy
|
||||
import unicodedata
|
||||
from datetime import datetime
|
||||
import logging
|
||||
import duckdb
|
||||
|
||||
# Settings per il logging
|
||||
logging.basicConfig(
|
||||
format="{asctime} - {levelname} - {message}",
|
||||
style="{",
|
||||
datefmt="%Y-%m-%d %H:%M",
|
||||
filename="echa.log",
|
||||
encoding="utf-8",
|
||||
filemode="a",
|
||||
level=logging.INFO,
|
||||
)
|
||||
|
||||
try:
|
||||
# Carico il full scraping in memoria se esiste
|
||||
con = duckdb.connect()
|
||||
os.chdir(".") # directory che legge python
|
||||
res = con.sql("""
|
||||
CREATE TABLE echa_full_scraping AS
|
||||
SELECT * FROM read_csv_auto('src\data\echa_full_scraping.csv');
|
||||
""") # leggi il file csv come db in memory
|
||||
logging.info(
|
||||
f"echa.echaProcess().main: Loaded echa scraped data into duckdb memory. First CAS in the df is: {con.sql('select CAS from echa_full_scraping limit 1').fetchone()[0]}"
|
||||
)
|
||||
local_echa = True
|
||||
except:
|
||||
logging.error(f"echa.echaProcess().main: No local echa scraped data found")
|
||||
|
||||
|
||||
# Metodo per trovare le informazioni relative sul sito echa
|
||||
# Funziona sia con il nome della sostanza che con il CUS
|
||||
def openEchaPage(link, local=False):
|
||||
try:
|
||||
if local:
|
||||
page = open(link, encoding="utf8")
|
||||
soup = BeautifulSoup(page, "html.parser")
|
||||
else:
|
||||
page = requests.get(link)
|
||||
page.encoding = "utf-8"
|
||||
soup = BeautifulSoup(page.text, "html.parser")
|
||||
except:
|
||||
logging.error(
|
||||
f"echa.echaProcess.openEchaPage() error. could not open: '{link}'",
|
||||
exc_info=True,
|
||||
)
|
||||
return soup
|
||||
|
||||
|
||||
# Metodo per trasformare la pagina dell'echa in un Markdown
|
||||
def echaPage_to_md(sezione, scrapingType=None, local=False, substance=None):
|
||||
# sezione : il soup della pagina estratta attraverso search_dossier
|
||||
# scrapingType : 'RepeatedDose' o 'AcuteToxicity'
|
||||
# local : se vuoi salvare il contenuto del markdown in locale. Utile per debuggare
|
||||
# substance : il nome della sostanza. Per salvarla nel path corretto
|
||||
|
||||
# Create shorthand method for conversion
|
||||
def md(soup, **options):
|
||||
return MarkdownConverter(**options).convert_soup(soup)
|
||||
|
||||
output = md(sezione)
|
||||
# Trasformo la section html in un markdown, che però va corretto.
|
||||
|
||||
# Cambia un po' il modo in cui modifico il .md in base al tipo di pagina da scrappare
|
||||
# aggiungo eccezioni man mano che testo nuove sostanze
|
||||
if scrapingType == "RepeatedDose":
|
||||
output = output.replace("### Oral route", "#### oral")
|
||||
output = output.replace("### Dermal", "#### dermal")
|
||||
output = output.replace("### Inhalation", "#### inhalation")
|
||||
# Devo rimpiazzare >< con delle stringhe perchè sennò il jsonifier interpreta quei due simboli come dei codici e inserisce il testo nelle []
|
||||
output = re.sub(r">\s+", "greater than ", output)
|
||||
# Replace '<' followed by whitespace with 'less than '
|
||||
output = re.sub(r"<\s+", "less than ", output)
|
||||
output = re.sub(r">=\s*\n", "greater or equal than ", output)
|
||||
output = re.sub(r"<=\s*\n", "less or equal than ", output)
|
||||
|
||||
elif scrapingType == "AcuteToxicity":
|
||||
# Devo rimpiazzare >< con delle stringhe perchè sennò il jsonifier interpreta quei due simboli come dei codici e inserisce il testo nelle []
|
||||
output = re.sub(r">\s+", "greater than ", output)
|
||||
# Replace '<' followed by whitespace with 'less than '
|
||||
output = re.sub(r"<\s+", "less than ", output)
|
||||
output = re.sub(r">=\s*\n", "greater or equal than", output)
|
||||
output = re.sub(r"<=\s*\n", "less or equal than ", output)
|
||||
|
||||
output = output.replace("–", "-")
|
||||
|
||||
output = re.sub(r"\s+mg", " mg", output)
|
||||
# sta parte serve per fixare le unità di misura che vanno a capo e sono separate dal valore
|
||||
|
||||
if local and substance:
|
||||
path = f"{scrapingType}/mds/{substance}.md"
|
||||
os.makedirs(os.path.dirname(path), exist_ok=True)
|
||||
with open(path, "w") as text_file:
|
||||
text_file.write(output)
|
||||
|
||||
return output
|
||||
|
||||
|
||||
|
||||
# Questo è la parte 2 del processing del sito ECHA. Va trasformato il markdown in un JSON
|
||||
def markdown_to_json_raw(output, scrapingType=None, local=False, substance=None):
|
||||
# Output: Il markdown
|
||||
# scrapingType : 'RepeatedDose' o 'AcuteToxicity'
|
||||
# substance : il nome della sostanza. Per salvarla nel path corretto
|
||||
jsonified = markdown_to_json.jsonify(output)
|
||||
dictified = json.loads(jsonified)
|
||||
|
||||
# Salvo il json iniziale così come esce da jsonify
|
||||
if local and scrapingType and substance:
|
||||
path = f"{scrapingType}/jsons/raws/{substance}_raw0.json"
|
||||
os.makedirs(os.path.dirname(path), exist_ok=True)
|
||||
|
||||
with open(path, "w") as text_file:
|
||||
text_file.write(jsonified)
|
||||
|
||||
# Ora splitto i contenuti dei dizionari innestati.
|
||||
for key, value in dictified.items():
|
||||
if type(value) == dict:
|
||||
for key2, value2 in value.items():
|
||||
parts = value2.split("\n\n")
|
||||
dictified[key][key2] = {
|
||||
parts[i]: parts[i + 1]
|
||||
for i in range(0, len(parts) - 1, 2)
|
||||
if parts[i + 1] != "[Empty]"
|
||||
}
|
||||
else:
|
||||
parts = value.split("\n\n")
|
||||
dictified[key] = {
|
||||
parts[i]: parts[i + 1]
|
||||
for i in range(0, len(parts) - 1, 2)
|
||||
if parts[i + 1] != "[Empty]"
|
||||
}
|
||||
|
||||
jsonified = json.dumps(dictified)
|
||||
|
||||
if local and scrapingType and substance:
|
||||
path = f"{scrapingType}/jsons/raws/{substance}_raw1.json"
|
||||
os.makedirs(os.path.dirname(path), exist_ok=True)
|
||||
|
||||
with open(path, "w") as text_file:
|
||||
text_file.write(jsonified)
|
||||
|
||||
dictified = json.loads(jsonified)
|
||||
|
||||
return jsonified
|
||||
|
||||
|
||||
# Metodo creato da claude per risolvere i problemi di unicode characters
|
||||
def normalize_unicode_characters(text):
|
||||
"""
|
||||
Normalize Unicode characters, with special handling for superscript
|
||||
"""
|
||||
if not isinstance(text, str):
|
||||
return text
|
||||
|
||||
# Specific replacements for common Unicode encoding issues
|
||||
# e per altre eccezioni particolari
|
||||
replacements = {
|
||||
"\u00c2\u00b2": "²", # ² -> ²
|
||||
"\u00c2\u00b3": "³", # ³ -> ³
|
||||
"\u00b2": "²", # Bare superscript 2
|
||||
"\u00b3": "³", # Bare superscript 3
|
||||
"\n": "", # ogni tanto ci sono degli \n brutti da togliere
|
||||
"greater than": ">",
|
||||
"less than": "<",
|
||||
"greater or equal than": ">=",
|
||||
"less or equal than": "<",
|
||||
# Ste due entry le ho messe io. >< creano problemi quindi le rinonimo temporaneamente
|
||||
}
|
||||
|
||||
# Apply specific replacements first
|
||||
for old, new in replacements.items():
|
||||
text = text.replace(old, new)
|
||||
|
||||
# Normalize Unicode characters
|
||||
text = unicodedata.normalize("NFKD", text)
|
||||
|
||||
return text
|
||||
|
||||
|
||||
# Un'altro metodo creato da Claude.
|
||||
# Pare che il mio cervello sia troppo piccolo per riuscire a ciclare ricursivamente
|
||||
# un dizionario innestato. Se magari avessimo fatto algoritmi e strutture dati...
|
||||
def clean_json(data):
|
||||
"""
|
||||
Recursively clean JSON by removing empty/uninformative entries
|
||||
and normalizing Unicode characters
|
||||
"""
|
||||
|
||||
def is_uninformative(value, context=None):
|
||||
"""
|
||||
Check if a dictionary entry is considered uninformative
|
||||
|
||||
Args:
|
||||
value: The value to check
|
||||
context: Additional context about where the value is located
|
||||
"""
|
||||
# Specific exceptions
|
||||
if context and context == "Key value for chemical safety assessment":
|
||||
# Always keep all entries in this specific section
|
||||
return False
|
||||
|
||||
uninformative_values = ["hours/week", "", None]
|
||||
|
||||
return value in uninformative_values or (
|
||||
isinstance(value, str)
|
||||
and (
|
||||
value.strip() in uninformative_values
|
||||
or value.lower() == "no information available"
|
||||
)
|
||||
)
|
||||
|
||||
def clean_recursive(obj, context=None):
|
||||
# If it's a dictionary, process its contents
|
||||
if isinstance(obj, dict):
|
||||
# Create a copy to modify
|
||||
cleaned = {}
|
||||
for key, value in obj.items():
|
||||
# Normalize key
|
||||
normalized_key = normalize_unicode_characters(key)
|
||||
|
||||
# Set context for nested dictionaries
|
||||
new_context = context or normalized_key
|
||||
|
||||
# Recursively clean nested structures
|
||||
cleaned_value = clean_recursive(value, new_context)
|
||||
|
||||
# Conditions for keeping the entry
|
||||
keep_entry = (
|
||||
cleaned_value not in [None, {}, ""]
|
||||
and not (
|
||||
isinstance(cleaned_value, dict) and len(cleaned_value) == 0
|
||||
)
|
||||
and not is_uninformative(cleaned_value, new_context)
|
||||
)
|
||||
|
||||
# Add to cleaned dict if conditions are met
|
||||
if keep_entry:
|
||||
cleaned[normalized_key] = cleaned_value
|
||||
|
||||
return cleaned if cleaned else None
|
||||
|
||||
# If it's a list, clean each item
|
||||
elif isinstance(obj, list):
|
||||
cleaned_list = [clean_recursive(item, context) for item in obj]
|
||||
cleaned_list = [item for item in cleaned_list if item not in [None, {}, ""]]
|
||||
return cleaned_list if cleaned_list else None
|
||||
|
||||
# For strings, normalize Unicode
|
||||
elif isinstance(obj, str):
|
||||
return normalize_unicode_characters(obj)
|
||||
|
||||
# Return as-is for other types
|
||||
return obj
|
||||
|
||||
# Create a deep copy to avoid modifying original data
|
||||
cleaned_data = clean_recursive(copy.deepcopy(data))
|
||||
# Sì figa questa è la parte che mi ha fatto sclerare
|
||||
# Ciclare in dizionari innestati senza poter modificare la struttura
|
||||
return cleaned_data
|
||||
|
||||
|
||||
def json_to_dataframe(cleaned_json, scrapingType):
|
||||
rows = []
|
||||
schema = {
|
||||
"RepeatedDose": [
|
||||
"Substance",
|
||||
"CAS",
|
||||
"Toxicity Type",
|
||||
"Route",
|
||||
"Dose descriptor",
|
||||
"Effect level",
|
||||
"Species",
|
||||
"Extraction_Timestamp",
|
||||
"Endpoint conclusion",
|
||||
],
|
||||
"AcuteToxicity": [
|
||||
"Substance",
|
||||
"CAS",
|
||||
"Route",
|
||||
"Endpoint conclusion",
|
||||
"Dose descriptor",
|
||||
"Effect level",
|
||||
"Extraction_Timestamp",
|
||||
],
|
||||
}
|
||||
if scrapingType == "RepeatedDose":
|
||||
# Iterate through top-level sections (excluding 'Key value for chemical safety assessment')
|
||||
for toxicity_type, routes in cleaned_json.items():
|
||||
if toxicity_type == "Key value for chemical safety assessment":
|
||||
continue
|
||||
|
||||
# Iterate through routes within each toxicity type
|
||||
for route, details in routes.items():
|
||||
row = {"Toxicity Type": toxicity_type, "Route": route}
|
||||
|
||||
# Add details to the row, excluding 'Link to relevant study record(s)'
|
||||
row.update(
|
||||
{
|
||||
k: v
|
||||
for k, v in details.items()
|
||||
if k != "Link to relevant study record(s)"
|
||||
}
|
||||
)
|
||||
rows.append(row)
|
||||
elif scrapingType == "AcuteToxicity":
|
||||
for toxicity_type, routes in cleaned_json.items():
|
||||
if (
|
||||
toxicity_type == "Key value for chemical safety assessment"
|
||||
or not routes
|
||||
):
|
||||
continue
|
||||
|
||||
row = {
|
||||
"Route": toxicity_type.replace("Acute toxicity: via", "")
|
||||
.replace("route", "")
|
||||
.strip()
|
||||
}
|
||||
|
||||
# Add details directly from the routes dictionary
|
||||
row.update(
|
||||
{
|
||||
k: v
|
||||
for k, v in routes.items()
|
||||
if k != "Link to relevant study record(s)"
|
||||
}
|
||||
)
|
||||
rows.append(row)
|
||||
|
||||
# Create DataFrame
|
||||
df = pd.DataFrame(rows)
|
||||
|
||||
# Last moment fixes. Per forzare uno schema
|
||||
fair_columns = list(set(schema["RepeatedDose"] + schema["AcuteToxicity"]))
|
||||
df = df = df.loc[:, df.columns.intersection(fair_columns)]
|
||||
return df
|
||||
|
||||
|
||||
def save_dataframe(df, file_path, scrapingType, schema):
|
||||
"""
|
||||
Save DataFrame with strict column requirements.
|
||||
|
||||
Args:
|
||||
df (pd.DataFrame): DataFrame to potentially append
|
||||
file_path (str): Path of CSV file
|
||||
"""
|
||||
# Mandatory columns for saved DataFrame
|
||||
|
||||
saved_columns = schema[scrapingType]
|
||||
|
||||
# Check if input DataFrame has at least Dose Descriptor and Effect Level
|
||||
if not all(col in df.columns for col in ["Effect level"]):
|
||||
return
|
||||
|
||||
# If file exists, read it to get saved columns
|
||||
if os.path.exists(file_path):
|
||||
existing_df = pd.read_csv(file_path)
|
||||
|
||||
# Reindex to match saved columns, filling missing with NaN
|
||||
df = df.reindex(columns=saved_columns)
|
||||
else:
|
||||
# If file doesn't exist, create DataFrame with saved columns
|
||||
df = df.reindex(columns=saved_columns)
|
||||
|
||||
df = df[df["Effect level"].isna() == False]
|
||||
# Ignoro le righe che non hanno valori per Effect Level
|
||||
|
||||
# Append or save the DataFrame
|
||||
df.to_csv(
|
||||
file_path,
|
||||
mode="a" if os.path.exists(file_path) else "w",
|
||||
header=not os.path.exists(file_path),
|
||||
index=False,
|
||||
)
|
||||
|
||||
|
||||
def echaExtract(
|
||||
substance: str,
|
||||
scrapingType: str,
|
||||
outputType="df",
|
||||
key_infos=False,
|
||||
local_search=False,
|
||||
local_only = False
|
||||
):
|
||||
"""
|
||||
Funzione principale per scrapare dal sito ECHA. Mette insieme tante funzioni diverse di ricerca, estrazione e pulizia.
|
||||
Registra il logging delle operazioni.
|
||||
|
||||
Args:
|
||||
substance (str): CAS o nome della sostanza. Vanno bene entrambi ma il CAS funziona meglio.
|
||||
scrapingType (str): 'AcuteToxicity' (LD50) o 'RepeatedDose' (NOAEL)
|
||||
outputType (str): 'pd.DataFrame' o 'json' (sconsigliato)
|
||||
key_infos (bool): Di base True. Specifica se cercare la sezione "Description of Key Information" nei dossiers.
|
||||
Certe sostanze hanno i dati inseriti a cazzo e mettono le informazioni lì in forma discorsiva al posto che altrove.
|
||||
|
||||
Output:
|
||||
un dataframe o un json,
|
||||
f"Non esistono lead dossiers attivi o inattivi per {substance}"
|
||||
"""
|
||||
|
||||
# se local_search = True tento una ricerca in locale. Altrimenti la provo online.
|
||||
if local_search and local_echa:
|
||||
result = echaExtract_local(substance, scrapingType, key_infos)
|
||||
|
||||
if not result.empty:
|
||||
logging.info(
|
||||
f"echa.echaProcess.echaExtract(): Found local data for {scrapingType}, {substance}. Returning it."
|
||||
)
|
||||
return result
|
||||
elif result.empty:
|
||||
logging.info(
|
||||
f"echa.echaProcess.echaExtract(): Have not found local data for {scrapingType}, {substance}. Continuining."
|
||||
)
|
||||
if local_only:
|
||||
logging.info(f'echa.echaProcess.echaExtract(): No data found in local-only search for {substance}, {scrapingType}')
|
||||
return f'No data found in local-only search for {substance}, {scrapingType}'
|
||||
|
||||
try:
|
||||
# con search_dossier trovo le informazioni relative al dossiers cercando sul sito echa la sostanza fornita.
|
||||
links = search_dossier(substance)
|
||||
if not links:
|
||||
logging.info(
|
||||
f'echaProcess.echaExtract(). no active or unactive lead dossiers for: "{substance}". Ending extraction.'
|
||||
)
|
||||
return f"Non esistono lead dossiers attivi o inattivi per {substance}"
|
||||
# Se non esistono LEAD dossiers (quelli con i riassunti tossicologici) attivi o inattivi
|
||||
# LEAD dossiers: riassumono le informazioni di un po' di tutti gli altri dossier, sono quelli completi dove c'erano le info necessarie
|
||||
|
||||
# Se esistono, apro la pagina che mi interessa ('Acute Toxicity' o 'Repeated Dose')
|
||||
|
||||
if not scrapingType in list(links.keys()):
|
||||
logging.info(
|
||||
f'echaProcess.echaExtract(). No page for "{scrapingType}", "{substance}"'
|
||||
)
|
||||
return f'No data in "{scrapingType}", "{substance}". Page does not exist.'
|
||||
|
||||
soup = openEchaPage(link=links[scrapingType])
|
||||
logging.info(
|
||||
f"echaProcess.echaExtract(). soupped '{scrapingType}' echa page for '{substance}'"
|
||||
)
|
||||
|
||||
# Piglio la sezione che mi serve
|
||||
try:
|
||||
sezione = soup.find(
|
||||
"section",
|
||||
class_="KeyValueForChemicalSafetyAssessment",
|
||||
attrs={"data-cy": "das-block"},
|
||||
)
|
||||
except:
|
||||
logging.error(
|
||||
f'echaProcess.echaExtract(). could not extract the "section" for "{scrapingType}" for "{substance}"',
|
||||
exc_info=True,
|
||||
)
|
||||
|
||||
# Per ottenere il timestamp attuale
|
||||
now = datetime.now()
|
||||
|
||||
# UPDATE. Cerco le key infos: recupera quel testo di summary generale
|
||||
key_infos_faund = False
|
||||
if key_infos:
|
||||
try:
|
||||
key_infos = soup.find(
|
||||
"section",
|
||||
class_="KeyInformation",
|
||||
attrs={"data-cy": "das-block"},
|
||||
)
|
||||
if key_infos:
|
||||
key_infos = key_infos.find(
|
||||
"div",
|
||||
class_="das-field_value das-field_value_html",
|
||||
)
|
||||
key_infos = key_infos.text
|
||||
key_infos = key_infos if key_infos.strip() != "[Empty]" else None
|
||||
if key_infos:
|
||||
key_infos_faund = True
|
||||
logging.info(
|
||||
f"echaProcess.echaExtract(). Extracted key_infos from '{scrapingType}' echa page for '{substance}': {key_infos}"
|
||||
)
|
||||
key_infos_df = pd.DataFrame(index=[0])
|
||||
key_infos_df["key_information"] = key_infos
|
||||
key_infos_df = df_wrapper(
|
||||
df=key_infos_df,
|
||||
rmlName=links["rmlName"],
|
||||
rmlCas=links["rmlCas"],
|
||||
timestamp=now.strftime("%Y-%m-%d"),
|
||||
dossierType=links["dossierType"], # attivo o inattivo?? da verificare
|
||||
page=scrapingType, # repeated dose o acute toxicity
|
||||
linkPage=links[scrapingType], # i link al dossier di repeated dose o acute toxicity
|
||||
key_infos=True,
|
||||
)
|
||||
else:
|
||||
logging.error(
|
||||
f'echaProcess.echaExtract() > echaProcess.echaPage_to_md() ERROR. could not extract key_infos for "{scrapingType}", "{substance}"'
|
||||
)
|
||||
else:
|
||||
logging.error(
|
||||
f'echaProcess.echaExtract() > echaProcess.echaPage_to_md() ERROR. could not extract key_infos for "{scrapingType}", "{substance}"'
|
||||
)
|
||||
except:
|
||||
logging.error(
|
||||
f'echaProcess.echaExtract() > echaProcess.echaPage_to_md() ERROR. could not extract key_infos for "{scrapingType}", "{substance}"',
|
||||
exc_info=True,
|
||||
)
|
||||
|
||||
try:
|
||||
if not sezione: # la sezione principale che viene scrapata
|
||||
logging.error(
|
||||
f'echaProcess.echaExtract() > echaProcess.echaPage_to_md() Empty section for the html > markdown conversion. No data for "{scrapingType}", "{substance}"'
|
||||
)
|
||||
if not key_infos_faund:
|
||||
# Se non ci sono dati ma ci sono le key informations ritorno quelle
|
||||
return f'No data in "{scrapingType}", "{substance}"'
|
||||
else:
|
||||
return key_infos_df
|
||||
|
||||
# Trasformo la sezione html in markdown
|
||||
output = echaPage_to_md(
|
||||
sezione, scrapingType=scrapingType, substance=substance
|
||||
)
|
||||
logging.info(
|
||||
f'echaProcess.echaExtract() > echaProcess.echaPage_to_md() OK. created MD for "{scrapingType}", "{substance}"'
|
||||
)
|
||||
|
||||
# Ci sono rari casi in cui proprio non esistono pagine per l'acute toxicity o la repeated dose. In quel caso output sarà vuoto e darà errore
|
||||
# logging.info(output)
|
||||
except:
|
||||
logging.error(
|
||||
f'echaProcess.echaExtract() > echaProcess.echaPage_to_md() ERROR. could not MD for "{scrapingType}", "{substance}"',
|
||||
exc_info=True,
|
||||
)
|
||||
|
||||
try:
|
||||
# Trasformo il markdown nel primo json raw
|
||||
jsonified = markdown_to_json_raw(
|
||||
output, scrapingType=scrapingType, substance=substance
|
||||
)
|
||||
logging.info(
|
||||
f'echaProcess.echaExtract() > echaProcess.markdown_to_json_raw() OK. created initial json for "{scrapingType}", "{substance}"'
|
||||
)
|
||||
except:
|
||||
logging.error(
|
||||
f'echaProcess.echaExtract() > echaProcess.markdown_to_json_raw() ERROR. could not create initial json for "{scrapingType}", "{substance}"',
|
||||
exc_info=True,
|
||||
)
|
||||
|
||||
json_data = json.loads(jsonified)
|
||||
|
||||
try:
|
||||
# Secondo step per il processing del json: pulisco i dizionari piu' innestati
|
||||
cleaned_data = clean_json(json_data)
|
||||
logging.info(
|
||||
f'echaProcess.echaExtract() > echaProcess.clean_json() OK. cleaned the json for "{scrapingType}", "{substance}"'
|
||||
)
|
||||
# Se cleaned_data è vuoto vuol dire che non ci sono dati
|
||||
if not cleaned_data:
|
||||
logging.error(
|
||||
f'echaProcess.echaExtract() > echaProcess.clean_json() Empty cleaned_json. No data for "{scrapingType}", "{substance}"'
|
||||
)
|
||||
if not key_infos_faund:
|
||||
# Se non ci sono dati ma ci sono le key informations ritorno quelle
|
||||
return f'No data in "{scrapingType}", "{substance}"'
|
||||
else:
|
||||
return key_infos_df
|
||||
except:
|
||||
logging.error(
|
||||
f'echaProcess.echaExtract() > echaProcess.clean_json() ERROR. cleaning the json for "{scrapingType}", "{substance}"'
|
||||
)
|
||||
|
||||
# Se si vuole come output il dataframe creo un dataframe e ci aggiungo un timestamp
|
||||
try:
|
||||
df = json_to_dataframe(cleaned_data, scrapingType)
|
||||
df = df_wrapper(
|
||||
df=df,
|
||||
rmlName=links["rmlName"],
|
||||
rmlCas=links["rmlCas"],
|
||||
timestamp=now.strftime("%Y-%m-%d"),
|
||||
dossierType=links["dossierType"],
|
||||
page=scrapingType,
|
||||
linkPage=links[scrapingType],
|
||||
)
|
||||
|
||||
if outputType == "df":
|
||||
logging.info(
|
||||
f'echaProcess.echaExtract(). succesfully extracted "{scrapingType}", "{substance}". Returning df'
|
||||
)
|
||||
|
||||
# Se l'utente vuole le key infos e le key_infos sono state trovate unisco i due df
|
||||
return df if not key_infos_faund else pd.concat([key_infos_df, df])
|
||||
|
||||
elif outputType == "json":
|
||||
if key_infos_faund:
|
||||
df = pd.concat([key_infos_df, df])
|
||||
jayson = df.to_json(orient="records", force_ascii=False)
|
||||
logging.info(
|
||||
f'echaProcess.echaExtract(). succesfully extracted "{scrapingType}", "{substance}". Returning json'
|
||||
)
|
||||
return jayson
|
||||
except KeyError:
|
||||
# Per gestire le pagine di merda che hanno solo "no information available"
|
||||
|
||||
if key_infos_faund:
|
||||
return key_infos_df
|
||||
|
||||
json_output = list(cleaned_data[list(cleaned_data.keys())[0]].values())
|
||||
if json_output == ["no information available" for elem in json_output]:
|
||||
logging.info(
|
||||
f"echaProcess.echaExtract(). No data found for {scrapingType} for {substance}"
|
||||
)
|
||||
return f'No data in "{scrapingType}", "{substance}"'
|
||||
else:
|
||||
logging.error(
|
||||
f"echaProcess.json_to_dataframe(). Could not create dataframe"
|
||||
)
|
||||
cleaned_data["error"] = (
|
||||
"Non sono riuscito a creare il dataframe, probabilmente non ci sono abbastanza informazioni. Ritorno il JSON"
|
||||
)
|
||||
return cleaned_data
|
||||
|
||||
except Exception:
|
||||
logging.error(
|
||||
f"echaProcess.echaExtract() ERROR. Something went wrong, not quite sure what.",
|
||||
exc_info=True,
|
||||
)
|
||||
|
||||
|
||||
def df_wrapper(
|
||||
df, rmlName, rmlCas, timestamp, dossierType, page, linkPage, key_infos=False
|
||||
):
|
||||
# Un semplice metodo per aggiungere tutta la roba che ci serve al dataframe.
|
||||
# Per non intasare echaExtract che già di suo è un figa di bordello
|
||||
df.insert(0, "Substance", rmlName)
|
||||
df.insert(1, "CAS", rmlCas)
|
||||
df["Extraction_Timestamp"] = timestamp
|
||||
df = df.replace("\n", "", regex=True)
|
||||
if not key_infos:
|
||||
df = df[df["Effect level"].isnull() == False]
|
||||
|
||||
# Aggiungo il link del dossier e lo status
|
||||
df["dossierType"] = dossierType
|
||||
df["page"] = page
|
||||
df["linkPage"] = linkPage
|
||||
return df
|
||||
|
||||
def echaExtract_specific(
|
||||
CAS: str,
|
||||
scrapingType="RepeatedDose",
|
||||
doseDescriptor="NOAEL",
|
||||
route="inhalation",
|
||||
local_search=False,
|
||||
local_only=False
|
||||
):
|
||||
"""
|
||||
Dato un CAS cerca di trovare il dose descriptor (di base NOAEL) per la route specificata (di base 'inhalation').
|
||||
|
||||
Args:
|
||||
CAS (str): il cas o in alternativa la sostanza
|
||||
route (str): 'inhalation', 'oral', 'dermal'. Di base 'inhalation'
|
||||
scrapingType (str): la pagina su cui cercarlo
|
||||
doseDescriptor (str): il tipo di valore da ricercare (NOAEL, DNEL, LD50, LC50)
|
||||
"""
|
||||
|
||||
# Tento di estrarre
|
||||
result = echaExtract(
|
||||
substance=CAS,
|
||||
scrapingType=scrapingType,
|
||||
outputType="df",
|
||||
local_search=local_search,
|
||||
local_only=local_only
|
||||
)
|
||||
|
||||
# Il risultato è un dataframe?
|
||||
if type(result) == pd.DataFrame:
|
||||
# Se sì, lo filtro per ciò che mi interessa
|
||||
filtered_df = result[
|
||||
(result["Route"] == route) & (result["Dose descriptor"] == doseDescriptor)
|
||||
]
|
||||
# Se non è vuoto lo ritorno
|
||||
if not filtered_df.empty:
|
||||
return filtered_df
|
||||
else:
|
||||
return f'Non ho trovato {doseDescriptor} in {scrapingType} con route "{route}" per {CAS}'
|
||||
|
||||
elif type(result) == dict and result["error"]:
|
||||
# Questo significa che gli è arrivato qualche json con un errore
|
||||
return f'Non ho trovato {doseDescriptor} in {scrapingType} con route "{route}" per {CAS}'
|
||||
|
||||
# Questo significa che ha ricevuto un "Non esistono" come risultato. Non esistono lead dossiers attivi o inattivi per la sostanza ricercata
|
||||
elif result.startswith("Non esistono"):
|
||||
return result
|
||||
|
||||
|
||||
def echa_noael_ld50(CAS: str, route="inhalation", outputType="df", local_search=False, local_only=False):
|
||||
"""
|
||||
Dato un CAS cerca di trovare il NOAEL per la route specificata (di base 'inhalation').
|
||||
Se non esiste la pagina RepeatedDose con il NOAEL fa ritornare l'LD50 per quella route.
|
||||
|
||||
Args:
|
||||
CAS (str): il cas o in alternativa la sostanza
|
||||
route (str): 'inhalation', 'oral', 'dermal'. Di base 'inhalation'
|
||||
outputType (str) = 'df', 'json'. Il tipo di output
|
||||
|
||||
"""
|
||||
if route not in ["inhalation", "oral", "dermal"] and outputType not in [
|
||||
"df",
|
||||
"json",
|
||||
]:
|
||||
return "invalid input"
|
||||
# Di base cerco di scrapare la pagina "Repeated Dose"
|
||||
first_attempt = echaExtract_specific(
|
||||
CAS=CAS,
|
||||
scrapingType="RepeatedDose",
|
||||
doseDescriptor="NOAEL",
|
||||
route=route,
|
||||
local_search=local_search,
|
||||
local_only=local_only
|
||||
)
|
||||
|
||||
if isinstance(first_attempt, pd.DataFrame):
|
||||
return first_attempt
|
||||
elif isinstance(first_attempt, str) and first_attempt.startswith("Non ho trovato"):
|
||||
second_attempt = echaExtract_specific(
|
||||
CAS=CAS,
|
||||
scrapingType="AcuteToxicity",
|
||||
doseDescriptor="LD50",
|
||||
route=route,
|
||||
local_search=True,
|
||||
local_only=local_only
|
||||
)
|
||||
if isinstance(second_attempt, pd.DataFrame):
|
||||
return second_attempt
|
||||
elif isinstance(second_attempt, str) and second_attempt.startswith(
|
||||
"Non ho trovato"
|
||||
):
|
||||
return second_attempt.replace("LD50", "NOAEL ed LD50")
|
||||
elif first_attempt.startswith("Non esistono"):
|
||||
return first_attempt
|
||||
|
||||
|
||||
def echa_noael_ld50_multi(
|
||||
casList: list, route="inhalation", messages=False, local_search=False, local_only=False
|
||||
):
|
||||
"""
|
||||
Metodo abbastanza semplice. Data una lista di cas esegue echa_noael_ld50. Quindi cerca i NOAEL per la route desiderata o gli LD50 se non trova i NOAEL.
|
||||
L'output è un df per le sostanze che trova e una lista di messaggi per quelle che non trova.
|
||||
|
||||
Args:
|
||||
casList (list): la lista di CAS
|
||||
route (str): 'inhalation', 'oral', 'dermal'. Di base 'inhalation'
|
||||
messages (boolean) = True o False. Con True fa ritornare una lista. Il primo elemento sarà il dataframe, il secondo la lista di messaggi per le sostanze non trovate.
|
||||
Di base è False e fa ritornare solo il dataframe.
|
||||
"""
|
||||
messages_list = []
|
||||
df = pd.DataFrame()
|
||||
for CAS in casList:
|
||||
output = echa_noael_ld50(
|
||||
CAS=CAS, route=route, outputType="df", local_search=local_search, local_only=local_only
|
||||
)
|
||||
if isinstance(output, str):
|
||||
messages_list.append(output)
|
||||
elif isinstance(output, pd.DataFrame):
|
||||
df = pd.concat([df, output], ignore_index=True)
|
||||
df.dropna(axis=1, how="all", inplace=True)
|
||||
if messages and df.empty:
|
||||
messages_list.append(
|
||||
f'Non sono riuscito a trovare nessun NOAEL o LD50 per i cas per la route "{route}"'
|
||||
)
|
||||
return [None, messages_list]
|
||||
elif messages and not df.empty:
|
||||
return [df, messages_list]
|
||||
elif not df.empty and not messages:
|
||||
return df
|
||||
elif df.empty and not messages:
|
||||
return f'Non sono riuscito a trovare nessun NOAEL o LD50 per i cas per la route "{route}"'
|
||||
|
||||
|
||||
def echaExtract_multi(
|
||||
casList: list,
|
||||
scrapingType="all",
|
||||
local=False,
|
||||
local_path=None,
|
||||
log_path=None,
|
||||
debug_print=False,
|
||||
error=False,
|
||||
error_path=None,
|
||||
key_infos=False,
|
||||
local_search=False,
|
||||
local_only=False,
|
||||
filter = None
|
||||
):
|
||||
"""
|
||||
Data una lista di CAS cerca di estrarre tutte le pagine con le Repeated Dose, tutte le pagine con l'AcuteToxicity, o entrambe
|
||||
|
||||
Args:
|
||||
casList (list): la lista di CAS
|
||||
scrapingType (str): 'RepeatedDose', 'AcuteToxicity', 'all'
|
||||
local (boolean): Se impostato su True questo parametro salva sul disco in maniera progressiva, appendendo ogni result man mano che li trova.
|
||||
è necessario per lo scraping su larga scala
|
||||
log_path (str): il path per il log da fillare durante lo scraping di massa
|
||||
debug_print (bool): per avere il printing durante lo scraping. per verificare l'avanzamento
|
||||
error (bool): Per far ritornare la lista degli errori una volta scrapato
|
||||
|
||||
Output:
|
||||
pd.Dataframe
|
||||
"""
|
||||
cas_len = len(casList)
|
||||
i = 0
|
||||
|
||||
df = pd.DataFrame()
|
||||
if scrapingType == "all":
|
||||
scrapingTypeList = ["RepeatedDose", "AcuteToxicity"]
|
||||
else:
|
||||
scrapingTypeList = [scrapingType]
|
||||
|
||||
logging.info(
|
||||
f"echa.echaExtract_multi(). Commencing mass extraction of {scrapingTypeList} for {casList}"
|
||||
)
|
||||
|
||||
errors = []
|
||||
|
||||
for cas in casList:
|
||||
for scrapingType in scrapingTypeList:
|
||||
extraction = echaExtract(
|
||||
substance=cas,
|
||||
scrapingType=scrapingType,
|
||||
outputType="df",
|
||||
key_infos=key_infos,
|
||||
local_search=local_search,
|
||||
local_only=local_only
|
||||
)
|
||||
if isinstance(extraction, pd.DataFrame) and not extraction.empty:
|
||||
status = "successful_scrape"
|
||||
logging.info(
|
||||
f"echa.echaExtract_multi(). Succesfully scraped {scrapingType} for {cas}"
|
||||
)
|
||||
|
||||
df = pd.concat([df, extraction], ignore_index=True)
|
||||
if local and local_path:
|
||||
df.to_csv(local_path, index=False)
|
||||
|
||||
elif (
|
||||
(isinstance(extraction, pd.DataFrame) and extraction.empty)
|
||||
or (extraction is None)
|
||||
or (isinstance(extraction, str) and extraction.startswith("No data"))
|
||||
):
|
||||
status = "no_data_found"
|
||||
logging.info(
|
||||
f"echa.echaExtract_multi(). Found no data for {scrapingType} for {cas}"
|
||||
)
|
||||
elif isinstance(extraction, dict):
|
||||
if extraction["error"]:
|
||||
status = "df_creation_error"
|
||||
errors.append(extraction)
|
||||
logging.info(
|
||||
f"echa.echaExtract_multi(). Df creation error for {scrapingType} for {cas}"
|
||||
)
|
||||
elif isinstance(extraction, str) and extraction.startswith("Non esistono"):
|
||||
status = "no_lead_dossiers"
|
||||
logging.info(
|
||||
f"echa.echaExtract_multi(). Found no lead dossiers for {cas}"
|
||||
)
|
||||
else:
|
||||
status = "unknown_error"
|
||||
logging.error(
|
||||
f"echa.echaExtract_multi(). Unknown error for {scrapingType} for {cas}"
|
||||
)
|
||||
|
||||
if log_path:
|
||||
fill_log(cas, status, log_path, scrapingType)
|
||||
if debug_print:
|
||||
print(f"{i}: {cas}, {scrapingType}")
|
||||
i += 1
|
||||
|
||||
if error and errors and error_path:
|
||||
with open(error_path, "w") as json_file:
|
||||
json.dump(errors, json_file, indent=4)
|
||||
|
||||
# Questa è la mossa che mi permette di eliminare 4 metodi
|
||||
if filter:
|
||||
df = filter_dataframe_by_dict(df, filter)
|
||||
return df
|
||||
|
||||
|
||||
def fill_log(cas: str, status: str, log_path: str, scrapingType: str):
|
||||
"""
|
||||
Funzione usata durante lo scraping di massa per fillare un log mentre estraggo le sostanze
|
||||
"""
|
||||
|
||||
df = pd.read_csv(log_path)
|
||||
df.loc[df["casNo"] == cas, f"scraping_{scrapingType}"] = status
|
||||
df.loc[df["casNo"] == cas, "timestamp"] = datetime.now().strftime("%Y-%m-%d")
|
||||
|
||||
df.to_csv(log_path, index=False)
|
||||
|
||||
def echaExtract_local(substance:str, scrapingType:str, key_infos=False):
|
||||
if not key_infos:
|
||||
query = f"""
|
||||
SELECT *
|
||||
FROM echa_full_scraping
|
||||
WHERE CAS = '{substance}' AND page = '{scrapingType}' AND key_information IS NULL;
|
||||
"""
|
||||
elif key_infos:
|
||||
query = f"""
|
||||
SELECT *
|
||||
FROM echa_full_scraping
|
||||
WHERE CAS = '{substance}' AND page = '{scrapingType}';
|
||||
|
||||
"""
|
||||
result = con.sql(query).df()
|
||||
return result
|
||||
|
||||
def filter_dataframe_by_dict(df, filter_dict):
|
||||
"""
|
||||
Filters a Pandas DataFrame based on a dictionary.
|
||||
|
||||
Args:
|
||||
df (pd.DataFrame): The input DataFrame.
|
||||
filter_dict (dict): A dictionary where keys are column names and
|
||||
values are lists of allowed values for that column.
|
||||
|
||||
Returns:
|
||||
pd.DataFrame: A new DataFrame containing only the rows that match
|
||||
the filter criteria.
|
||||
"""
|
||||
|
||||
filter_condition = pd.Series(True, index=df.index) # Initialize with all True to start filtering
|
||||
|
||||
for column_name, allowed_values in filter_dict.items():
|
||||
if column_name in df.columns: # Check if the column exists in the DataFrame
|
||||
column_filter = df[column_name].isin(allowed_values) # Create a boolean Series for the current column
|
||||
filter_condition = filter_condition & column_filter # Combine with existing condition using 'and'
|
||||
else:
|
||||
print(f"Warning: Column '{column_name}' not found in the DataFrame. Filter for this column will be ignored.")
|
||||
|
||||
filtered_df = df[filter_condition] # Apply the combined filter condition
|
||||
return filtered_df
|
||||
|
|
@ -1,467 +0,0 @@
|
|||
import os
|
||||
import base64
|
||||
import traceback
|
||||
import logging # Import logging module
|
||||
import datetime
|
||||
import pandas as pd
|
||||
# import time # Keep if you use page.wait_for_timeout
|
||||
from playwright.sync_api import sync_playwright, TimeoutError # Catch specific errors
|
||||
from src.func.find import search_dossier
|
||||
import requests
|
||||
|
||||
# --- Basic Logging Setup (Commented Out) ---
|
||||
# # Configure logging - uncomment and customize level/handler as needed
|
||||
# logging.basicConfig(
|
||||
# level=logging.INFO, # Or DEBUG for more details
|
||||
# format='%(asctime)s - %(levelname)s - %(message)s',
|
||||
# # filename='pdf_generator.log', # Optional: Log to a file
|
||||
# # filemode='a'
|
||||
# )
|
||||
# --- End Logging Setup ---
|
||||
|
||||
|
||||
# Assume svg_to_data_uri is defined elsewhere correctly
|
||||
def svg_to_data_uri(svg_path):
|
||||
try:
|
||||
if not os.path.exists(svg_path):
|
||||
# logging.error(f"SVG file not found: {svg_path}") # Example logging
|
||||
raise FileNotFoundError(f"SVG file not found: {svg_path}")
|
||||
with open(svg_path, 'rb') as f:
|
||||
svg_content = f.read()
|
||||
encoded_svg = base64.b64encode(svg_content).decode('utf-8')
|
||||
return f"data:image/svg+xml;base64,{encoded_svg}"
|
||||
except Exception as e:
|
||||
print(f"Error converting SVG {svg_path}: {e}")
|
||||
# logging.error(f"Error converting SVG {svg_path}: {e}", exc_info=True) # Example logging
|
||||
return None
|
||||
|
||||
# --- JavaScript Expressions ---
|
||||
|
||||
# Define the cleanup logic as an immediately-invoked arrow function expression
|
||||
# NOTE: .das-block_empty removal is currently disabled as per previous step
|
||||
cleanup_js_expression = """
|
||||
() => {
|
||||
console.log('Running cleanup JS (DISABLED .das-block_empty removal)...');
|
||||
let totalRemoved = 0;
|
||||
|
||||
// Example 1: Remove sections explicitly marked as empty (Currently Disabled)
|
||||
// const emptyBlocks = document.querySelectorAll('.das-block_empty');
|
||||
// emptyBlocks.forEach(el => {
|
||||
// if (el && el.parentNode) {
|
||||
// console.log(`Removing '.das-block_empty' block with ID: ${el.id || 'N/A'}`);
|
||||
// el.remove();
|
||||
// totalRemoved++;
|
||||
// }
|
||||
// });
|
||||
|
||||
// Add other specific cleanup logic here if needed
|
||||
|
||||
console.log(`Cleanup script removed ${totalRemoved} elements (DISABLED .das-block_empty removal).`);
|
||||
return totalRemoved; // Return the count
|
||||
}
|
||||
"""
|
||||
# --- End JavaScript Expressions ---
|
||||
|
||||
|
||||
def generate_pdf_with_header_and_cleanup(
|
||||
url,
|
||||
pdf_path,
|
||||
substance_name,
|
||||
substance_link,
|
||||
ec_number,
|
||||
cas_number,
|
||||
header_template_path=r"src\func\resources\injectableHeader.html",
|
||||
echa_chem_logo_path=r"src\func\resources\echa_chem_logo.svg",
|
||||
echa_logo_path=r"src\func\resources\ECHA_Logo.svg"
|
||||
) -> bool: # Added return type hint
|
||||
"""
|
||||
Generates a PDF with a dynamic header and optionally removes empty sections.
|
||||
Provides basic logging (commented out) and returns True/False for success/failure.
|
||||
|
||||
Args:
|
||||
url (str): The target URL OR local HTML file path.
|
||||
pdf_path (str): The output PDF path.
|
||||
substance_name (str): The name of the chemical substance.
|
||||
substance_link (str): The URL the substance name should link to (in header).
|
||||
ec_number (str): The EC number for the substance.
|
||||
cas_number (str): The CAS number for the substance.
|
||||
header_template_path (str): Path to the HTML header template file.
|
||||
echa_chem_logo_path (str): Path to the echa_chem_logo.svg file.
|
||||
echa_logo_path (str): Path to the ECHA_Logo.svg file.
|
||||
|
||||
Returns:
|
||||
bool: True if the PDF was generated successfully, False otherwise.
|
||||
"""
|
||||
final_header_html = None
|
||||
# logging.info(f"Starting PDF generation for URL: {url} to path: {pdf_path}") # Example logging
|
||||
|
||||
# --- 1. Prepare Header HTML ---
|
||||
try:
|
||||
# logging.debug(f"Reading header template from: {header_template_path}") # Example logging
|
||||
print(f"Reading header template from: {header_template_path}")
|
||||
if not os.path.exists(header_template_path):
|
||||
raise FileNotFoundError(f"Header template file not found: {header_template_path}")
|
||||
with open(header_template_path, 'r', encoding='utf-8') as f:
|
||||
header_template_content = f.read()
|
||||
if not header_template_content:
|
||||
raise ValueError("Header template file is empty.")
|
||||
|
||||
# logging.debug("Converting logos...") # Example logging
|
||||
print("Converting logos...")
|
||||
logo1_data_uri = svg_to_data_uri(echa_chem_logo_path)
|
||||
logo2_data_uri = svg_to_data_uri(echa_logo_path)
|
||||
if not logo1_data_uri or not logo2_data_uri:
|
||||
raise ValueError("Failed to convert one or both logos to Data URIs.")
|
||||
|
||||
# logging.debug("Replacing placeholders...") # Example logging
|
||||
print("Replacing placeholders...")
|
||||
final_header_html = header_template_content.replace("##ECHA_CHEM_LOGO_SRC##", logo1_data_uri)
|
||||
final_header_html = final_header_html.replace("##ECHA_LOGO_SRC##", logo2_data_uri)
|
||||
final_header_html = final_header_html.replace("##SUBSTANCE_NAME##", substance_name)
|
||||
final_header_html = final_header_html.replace("##SUBSTANCE_LINK##", substance_link)
|
||||
final_header_html = final_header_html.replace("##EC_NUMBER##", ec_number)
|
||||
final_header_html = final_header_html.replace("##CAS_NUMBER##", cas_number)
|
||||
|
||||
if "##" in final_header_html:
|
||||
print("Warning: Not all placeholders seem replaced in the header HTML.")
|
||||
# logging.warning("Not all placeholders seem replaced in the header HTML.") # Example logging
|
||||
|
||||
except Exception as e:
|
||||
print(f"Error during header setup phase: {e}")
|
||||
traceback.print_exc()
|
||||
# logging.error(f"Error during header setup phase: {e}", exc_info=True) # Example logging
|
||||
return False # Return False on header setup failure
|
||||
# --- End Header Prep ---
|
||||
|
||||
# --- CSS Override Definition ---
|
||||
# Using Revision 4 from previous step (simplified breaks, boundary focus)
|
||||
selectors_to_fix = [
|
||||
'.das-field .das-field_value_html',
|
||||
'.das-field .das-field_value_large',
|
||||
'.das-field .das-value_remark-text'
|
||||
]
|
||||
css_selector_string = ",\n".join(selectors_to_fix)
|
||||
css_override = f"""
|
||||
<style id='pdf-override-styles'>
|
||||
/* Basic Resets & Overflows */
|
||||
html, body {{ height: auto !important; overflow: visible !important; margin: 0 !important; padding: 0 !important; }}
|
||||
* {{ box-sizing: border-box; }}
|
||||
{css_selector_string} {{
|
||||
overflow: visible !important; overflow-y: visible !important; height: auto !important; max-height: none !important;
|
||||
}}
|
||||
/* Boundary Fix */
|
||||
#pdf-custom-header {{ margin-bottom: 0 !important; padding-bottom: 1px !important; page-break-after: auto !important; display: block !important; }}
|
||||
#pdf-custom-header + .body-inner {{ margin-top: 0 !important; padding-top: 0 !important; page-break-before: auto !important; display: block !important; }}
|
||||
.body-inner .document-header {{ margin-top: 0 !important; padding-top: 0 !important; page-break-before: auto !important; }}
|
||||
/* Simplified Page Breaks */
|
||||
.body-inner h1, .body-inner h2, .body-inner h3, .body-inner h4, .body-inner h5, .body-inner h6 {{ page-break-after: avoid !important; }}
|
||||
#pdf-custom-header h2 {{ page-break-after: auto !important; }}
|
||||
@media print {{
|
||||
html, body {{ height: auto !important; overflow: visible !important; margin: 0; padding: 0; }}
|
||||
#pdf-custom-header {{ margin-bottom: 0 !important; padding-bottom: 1px !important; page-break-after: auto !important; display: block !important;}}
|
||||
#pdf-custom-header + .body-inner {{ margin-top: 0 !important; padding-top: 0 !important; page-break-before: auto !important; display: block !important; }}
|
||||
.body-inner .document-header {{ margin-top: 0 !important; padding-top: 0 !important; page-break-before: auto !important; }}
|
||||
.body-inner h1, .body-inner h2, .body-inner h3, .body-inner h4, .body-inner h5, .body-inner h6 {{ page-break-after: avoid !important; }}
|
||||
#pdf-custom-header h2 {{ page-break-after: auto !important; }}
|
||||
.das-doc-toolbar, .document-header__section-links, #das-totop {{ display: none !important; }}
|
||||
}}
|
||||
</style>
|
||||
"""
|
||||
# --- End CSS Override Definition ---
|
||||
|
||||
# --- Playwright Automation ---
|
||||
try:
|
||||
with sync_playwright() as p:
|
||||
# logging.debug("Launching browser...") # Example logging
|
||||
# browser = p.chromium.launch(headless=False, devtools=True) # For debugging
|
||||
browser = p.chromium.launch()
|
||||
page = browser.new_page()
|
||||
# Capture console messages (Corrected: use msg.text)
|
||||
page.on("console", lambda msg: print(f"Browser Console: {msg.text}"))
|
||||
|
||||
try:
|
||||
# logging.info(f"Navigating to page: {url}") # Example logging
|
||||
print(f"Navigating to: {url}")
|
||||
if os.path.exists(url) and not url.startswith('file://'):
|
||||
page_url = f'file://{os.path.abspath(url)}'
|
||||
# logging.info(f"Treating as local file: {page_url}") # Example logging
|
||||
print(f"Treating as local file: {page_url}")
|
||||
else:
|
||||
page_url = url
|
||||
|
||||
page.goto(page_url, wait_until='load', timeout=90000)
|
||||
# logging.info("Page navigation complete.") # Example logging
|
||||
|
||||
# logging.debug("Injecting header HTML...") # Example logging
|
||||
print("Injecting header HTML...")
|
||||
page.evaluate(f'(headerHtml) => {{ document.body.insertAdjacentHTML("afterbegin", headerHtml); }}', final_header_html)
|
||||
|
||||
# logging.debug("Injecting CSS overrides...") # Example logging
|
||||
print("Injecting CSS overrides...")
|
||||
page.evaluate(f"""(css) => {{
|
||||
const existingStyle = document.getElementById('pdf-override-styles');
|
||||
if (existingStyle) existingStyle.remove();
|
||||
document.head.insertAdjacentHTML('beforeend', css);
|
||||
}}""", css_override)
|
||||
|
||||
# logging.debug("Running JavaScript cleanup function...") # Example logging
|
||||
print("Running JavaScript cleanup function...")
|
||||
elements_removed_count = page.evaluate(cleanup_js_expression)
|
||||
# logging.info(f"Cleanup script finished (reported removing {elements_removed_count} elements).") # Example logging
|
||||
print(f"Cleanup script finished (reported removing {elements_removed_count} elements).")
|
||||
|
||||
|
||||
# --- Optional: Emulate Print Media ---
|
||||
# print("Emulating print media...")
|
||||
# page.emulate_media(media='print')
|
||||
|
||||
# --- Generate PDF ---
|
||||
# logging.info(f"Generating PDF: {pdf_path}") # Example logging
|
||||
print(f"Generating PDF: {pdf_path}")
|
||||
pdf_options = {
|
||||
"path": pdf_path, "format": "A4", "print_background": True,
|
||||
"margin": {'top': '20px', 'bottom': '20px', 'left': '20px', 'right': '20px'},
|
||||
"scale": 1.0
|
||||
}
|
||||
page.pdf(**pdf_options)
|
||||
# logging.info(f"PDF saved successfully to: {pdf_path}") # Example logging
|
||||
print(f"PDF saved successfully to: {pdf_path}")
|
||||
|
||||
# logging.debug("Closing browser.") # Example logging
|
||||
print("Closing browser.")
|
||||
browser.close()
|
||||
return True # Indicate success
|
||||
|
||||
except TimeoutError as e:
|
||||
print(f"A Playwright TimeoutError occurred: {e}")
|
||||
traceback.print_exc()
|
||||
# logging.error(f"Playwright TimeoutError occurred: {e}", exc_info=True) # Example logging
|
||||
browser.close() # Ensure browser is closed on error
|
||||
return False # Indicate failure
|
||||
except Exception as e: # Catch other potential errors during Playwright page operations
|
||||
print(f"An unexpected error occurred during Playwright page operations: {e}")
|
||||
traceback.print_exc()
|
||||
# logging.error(f"Unexpected error during Playwright page operations: {e}", exc_info=True) # Example logging
|
||||
# Optional: Save HTML state on error
|
||||
try:
|
||||
html_content = page.content()
|
||||
error_html_path = pdf_path.replace('.pdf', '_error.html')
|
||||
with open(error_html_path, 'w', encoding='utf-8') as f_err:
|
||||
f_err.write(html_content)
|
||||
# logging.info(f"Saved HTML state on error to: {error_html_path}") # Example logging
|
||||
print(f"Saved HTML state on error to: {error_html_path}")
|
||||
except Exception as save_e:
|
||||
# logging.error(f"Could not save HTML state on error: {save_e}", exc_info=True) # Example logging
|
||||
print(f"Could not save HTML state on error: {save_e}")
|
||||
browser.close() # Ensure browser is closed on error
|
||||
return False # Indicate failure
|
||||
# Note: The finally block for the 'with sync_playwright()' context
|
||||
# is handled automatically by the 'with' statement.
|
||||
|
||||
except Exception as e:
|
||||
# Catch errors during Playwright startup (less common)
|
||||
print(f"An error occurred during Playwright setup/teardown: {e}")
|
||||
traceback.print_exc()
|
||||
# logging.error(f"Error during Playwright setup/teardown: {e}", exc_info=True) # Example logging
|
||||
return False # Indicate failure
|
||||
|
||||
|
||||
# --- Example Usage ---
|
||||
# result = generate_pdf_with_header_and_cleanup(
|
||||
# url='path/to/your/input.html',
|
||||
# pdf_path='output.pdf',
|
||||
# substance_name='Glycerol Example',
|
||||
# substance_link='http://example.com/glycerol',
|
||||
# ec_number='200-289-5',
|
||||
# cas_number='56-81-5',
|
||||
# )
|
||||
#
|
||||
# if result:
|
||||
# print("PDF Generation Succeeded.")
|
||||
# # logging.info("Main script: PDF Generation Succeeded.") # Example logging
|
||||
# else:
|
||||
# print("PDF Generation Failed.")
|
||||
# # logging.error("Main script: PDF Generation Failed.") # Example logging
|
||||
|
||||
|
||||
def search_generate_pdfs(
|
||||
cas_number_to_search: str,
|
||||
page_types_to_extract: list[str],
|
||||
base_output_folder: str = "data/library"
|
||||
) -> bool:
|
||||
"""
|
||||
Searches for a substance by CAS, saves raw HTML and generates PDFs for
|
||||
specified page types. Uses '_js' link variant for the PDF header link if available.
|
||||
|
||||
Args:
|
||||
cas_number_to_search (str): CAS number to search for.
|
||||
page_types_to_extract (list[str]): List of page type names (e.g., 'RepeatedDose').
|
||||
Expects '{page_type}' and '{page_type}_js' keys
|
||||
in the search result.
|
||||
base_output_folder (str): Root directory for saving HTML/PDFs.
|
||||
|
||||
Returns:
|
||||
bool: True if substance found and >=1 requested PDF generated, False otherwise.
|
||||
"""
|
||||
# logging.info(f"Starting process for CAS: {cas_number_to_search}")
|
||||
print(f"\n===== Processing request for CAS: {cas_number_to_search} =====")
|
||||
|
||||
# --- 1. Search for Dossier Information ---
|
||||
try:
|
||||
# logging.debug(f"Calling search_dossier for CAS: {cas_number_to_search}")
|
||||
search_result = search_dossier(substance=cas_number_to_search, input_type='rmlCas')
|
||||
except Exception as e:
|
||||
print(f"Error during dossier search for CAS '{cas_number_to_search}': {e}")
|
||||
traceback.print_exc()
|
||||
# logging.error(f"Exception during search_dossier for CAS '{cas_number_to_search}': {e}", exc_info=True)
|
||||
return False
|
||||
|
||||
if not search_result:
|
||||
print(f"Substance not found or search failed for CAS: {cas_number_to_search}")
|
||||
# logging.warning(f"Substance not found or search failed for CAS: {cas_number_to_search}")
|
||||
return False
|
||||
|
||||
# logging.info(f"Substance found for CAS: {cas_number_to_search}")
|
||||
print(f"Substance found: {search_result.get('rmlName', 'N/A')}")
|
||||
|
||||
# --- 2. Extract Details and Filter Pages ---
|
||||
try:
|
||||
# Extract required info
|
||||
rml_id = search_result.get('rmlId')
|
||||
rml_name = search_result.get('rmlName')
|
||||
rml_cas = search_result.get('rmlCas')
|
||||
rml_ec = search_result.get('rmlEc')
|
||||
asset_ext_id = search_result.get('assetExternalId')
|
||||
|
||||
# Basic validation
|
||||
if not all([rml_id, rml_name, rml_cas, rml_ec, asset_ext_id]):
|
||||
missing_keys = [k for k, v in {'rmlId': rml_id, 'rmlName': rml_name, 'rmlCas': rml_cas, 'rmlEc': rml_ec, 'assetExternalId': asset_ext_id}.items() if not v]
|
||||
message = f"Search result for {cas_number_to_search} is missing required keys: {missing_keys}"
|
||||
print(f"Error: {message}")
|
||||
# logging.error(message)
|
||||
return False
|
||||
|
||||
# --- Filtering Logic - Collect pairs of URLs ---
|
||||
pages_to_process_list = [] # Store tuples: (page_name, raw_url, js_url)
|
||||
# logging.debug(f"Filtering pages. Requested: {page_types_to_extract}.")
|
||||
|
||||
for page_type in page_types_to_extract:
|
||||
raw_url_key = page_type
|
||||
js_url_key = f"{page_type}_js"
|
||||
|
||||
raw_url = search_result.get(raw_url_key)
|
||||
js_url = search_result.get(js_url_key) # Get the JS URL
|
||||
|
||||
# Check if both URLs are valid strings
|
||||
if raw_url and isinstance(raw_url, str) and raw_url.strip():
|
||||
if js_url and isinstance(js_url, str) and js_url.strip():
|
||||
pages_to_process_list.append((page_type, raw_url, js_url))
|
||||
# logging.debug(f"Found valid pair for '{page_type}': Raw='{raw_url}', JS='{js_url}'")
|
||||
else:
|
||||
# Found raw URL but not a valid JS URL - skip this page type for PDF?
|
||||
# Or use raw_url for header too? Let's skip if JS URL is missing/invalid.
|
||||
print(f"Found raw URL for '{page_type}' but missing/invalid JS URL ('{js_url}'). Skipping PDF generation for this type.")
|
||||
# logging.warning(f"Missing/invalid JS URL for page type '{page_type}' for {rml_cas}. Raw URL: '{raw_url}'.")
|
||||
else:
|
||||
# Raw URL missing or invalid
|
||||
if page_type in search_result: # Check if key existed at all
|
||||
print(f"Found page type key '{page_type}' for {rml_cas}, but its value is not a valid URL ('{raw_url}'). Skipping.")
|
||||
# logging.warning(f"Invalid raw URL value for page type '{page_type}' for {rml_cas}: '{raw_url}'.")
|
||||
else:
|
||||
print(f"Requested page type key '{page_type}' not found in search results for {rml_cas}.")
|
||||
# logging.warning(f"Requested page type key '{page_type}' not found for {rml_cas}.")
|
||||
# --- End Filtering Logic ---
|
||||
|
||||
if not pages_to_process_list:
|
||||
print(f"After filtering, no requested page types ({page_types_to_extract}) resulted in a valid pair of Raw and JS URLs for substance {rml_cas}.")
|
||||
# logging.warning(f"No pages with valid URL pairs to process for substance {rml_cas}.")
|
||||
return False # Nothing to generate
|
||||
|
||||
except Exception as e:
|
||||
print(f"Error processing search result for '{cas_number_to_search}': {e}")
|
||||
traceback.print_exc()
|
||||
# logging.error(f"Error processing search result for '{cas_number_to_search}': {e}", exc_info=True)
|
||||
return False
|
||||
|
||||
# --- 3. Prepare Folders ---
|
||||
safe_cas = rml_cas.replace('/', '_').replace('\\', '_')
|
||||
substance_folder_name = f"{safe_cas}_{rml_ec}_{rml_id}"
|
||||
substance_folder_path = os.path.join(base_output_folder, substance_folder_name)
|
||||
|
||||
try:
|
||||
os.makedirs(substance_folder_path, exist_ok=True)
|
||||
# logging.info(f"Ensured output directory exists: {substance_folder_path}")
|
||||
print(f"Ensured output directory exists: {substance_folder_path}")
|
||||
except OSError as e:
|
||||
print(f"Error creating directory {substance_folder_path}: {e}")
|
||||
# logging.error(f"Failed to create directory {substance_folder_path}: {e}", exc_info=True)
|
||||
return False
|
||||
|
||||
|
||||
# --- 4. Process Each Page (Save HTML, Generate PDF) ---
|
||||
successful_pages = [] # Track successful PDF generations
|
||||
overall_success = False # Track if any PDF was generated
|
||||
|
||||
for page_name, raw_html_url, js_header_link in pages_to_process_list:
|
||||
print(f"\nProcessing page: {page_name}")
|
||||
base_filename = f"{safe_cas}_{page_name}"
|
||||
html_filename = f"{base_filename}.html"
|
||||
pdf_filename = f"{base_filename}.pdf"
|
||||
html_full_path = os.path.join(substance_folder_path, html_filename)
|
||||
pdf_full_path = os.path.join(substance_folder_path, pdf_filename)
|
||||
|
||||
# --- Save Raw HTML ---
|
||||
html_saved = False
|
||||
try:
|
||||
# logging.debug(f"Fetching raw HTML for {page_name} from {raw_html_url}")
|
||||
print(f"Fetching raw HTML from: {raw_html_url}")
|
||||
# Add headers to mimic a browser slightly if needed
|
||||
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'}
|
||||
response = requests.get(raw_html_url, timeout=30, headers=headers) # 30s timeout
|
||||
response.raise_for_status() # Raise an exception for bad status codes (4xx or 5xx)
|
||||
|
||||
# Decide encoding - response.text tries to guess, or use apparent_encoding
|
||||
# Or assume utf-8 if unsure, which is common.
|
||||
html_content = response.content.decode('utf-8', errors='replace')
|
||||
|
||||
with open(html_full_path, 'w', encoding='utf-8') as f:
|
||||
f.write(html_content)
|
||||
html_saved = True
|
||||
# logging.info(f"Successfully saved raw HTML for {page_name} to {html_full_path}")
|
||||
print(f"Successfully saved raw HTML to: {html_full_path}")
|
||||
except requests.exceptions.RequestException as req_e:
|
||||
print(f"Error fetching raw HTML for {page_name} from {raw_html_url}: {req_e}")
|
||||
# logging.error(f"Error fetching raw HTML for {page_name}: {req_e}", exc_info=True)
|
||||
except IOError as io_e:
|
||||
print(f"Error saving raw HTML for {page_name} to {html_full_path}: {io_e}")
|
||||
# logging.error(f"Error saving raw HTML for {page_name}: {io_e}", exc_info=True)
|
||||
except Exception as e: # Catch other potential errors like decoding
|
||||
print(f"Unexpected error saving HTML for {page_name}: {e}")
|
||||
# logging.error(f"Unexpected error saving HTML for {page_name}: {e}", exc_info=True)
|
||||
|
||||
# --- Generate PDF (using raw URL for content, JS URL for header link) ---
|
||||
# logging.info(f"Generating PDF for {page_name} from {raw_html_url}")
|
||||
print(f"Generating PDF using content from: {raw_html_url}")
|
||||
pdf_success = generate_pdf_with_header_and_cleanup(
|
||||
url=raw_html_url, # Use raw URL for Playwright navigation/content
|
||||
pdf_path=pdf_full_path,
|
||||
substance_name=rml_name,
|
||||
substance_link=js_header_link, # Use JS URL for the link in the header
|
||||
ec_number=rml_ec,
|
||||
cas_number=rml_cas
|
||||
)
|
||||
|
||||
if pdf_success:
|
||||
successful_pages.append(page_name) # Log success based on PDF generation
|
||||
overall_success = True
|
||||
# logging.info(f"Successfully generated PDF for {page_name} at {pdf_full_path}")
|
||||
print(f"Successfully generated PDF for {page_name}")
|
||||
else:
|
||||
# logging.error(f"Failed to generate PDF for {page_name} from {raw_html_url}")
|
||||
print(f"Failed to generate PDF for {page_name}")
|
||||
# Decide if failure to save HTML should affect overall success or logging...
|
||||
# Currently, success is tied only to PDF generation.
|
||||
|
||||
print(f"===== Finished request for CAS: {cas_number_to_search} =====")
|
||||
print(f"Successfully generated {len(successful_pages)} PDFs: {successful_pages}")
|
||||
return overall_success # Return success based on PDF generation
|
||||
File diff suppressed because one or more lines are too long
|
Before Width: | Height: | Size: 6.1 KiB |
|
|
@ -1,141 +0,0 @@
|
|||
<svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" width="352.909" height="64.542" viewBox="0 0 352.909 64.542">
|
||||
<defs>
|
||||
<linearGradient id="linear-gradient" x1="0.499" y1="0.955" x2="0.501" y2="0.043" gradientUnits="objectBoundingBox">
|
||||
<stop offset="0" stop-color="#002658"/>
|
||||
<stop offset="0.99" stop-color="#0160ae"/>
|
||||
</linearGradient>
|
||||
<radialGradient id="radial-gradient" cx="0.502" cy="0.5" r="0.881" gradientUnits="objectBoundingBox">
|
||||
<stop offset="0.34" stop-color="#0961ad"/>
|
||||
<stop offset="1" stop-color="#1c2f5d"/>
|
||||
</radialGradient>
|
||||
<radialGradient id="radial-gradient-2" cx="0.795" cy="0.199" r="0.8" xlink:href="#radial-gradient"/>
|
||||
<linearGradient id="linear-gradient-2" y1="0.499" x2="1" y2="0.499" gradientUnits="objectBoundingBox">
|
||||
<stop offset="0" stop-color="#fff"/>
|
||||
<stop offset="0" stop-color="#0961ad"/>
|
||||
<stop offset="1" stop-color="#1c2f5d"/>
|
||||
</linearGradient>
|
||||
<linearGradient id="linear-gradient-3" x1="-3.244" y1="0.922" x2="0.926" y2="0.075" gradientUnits="objectBoundingBox">
|
||||
<stop offset="0" stop-color="#fff"/>
|
||||
<stop offset="0" stop-color="#f6d46a"/>
|
||||
<stop offset="0.99" stop-color="#f8a71b"/>
|
||||
</linearGradient>
|
||||
<linearGradient id="linear-gradient-4" x1="-0.547" y1="0.499" x2="0.453" y2="0.499" gradientUnits="objectBoundingBox">
|
||||
<stop offset="0" stop-color="#f6d46a"/>
|
||||
<stop offset="0.99" stop-color="#f8a71b"/>
|
||||
</linearGradient>
|
||||
<linearGradient id="linear-gradient-5" x1="-0.17" y1="0.5" x2="0.83" y2="0.5" xlink:href="#linear-gradient-3"/>
|
||||
<linearGradient id="linear-gradient-6" x1="0.5" y1="-0.199" x2="0.5" y2="1.353" gradientUnits="objectBoundingBox">
|
||||
<stop offset="0" stop-color="#fff"/>
|
||||
<stop offset="0" stop-color="#feca0a"/>
|
||||
<stop offset="0.96" stop-color="#faaa1b"/>
|
||||
<stop offset="0.99" stop-color="#f8a71b"/>
|
||||
</linearGradient>
|
||||
<linearGradient id="linear-gradient-7" x1="0.5" y1="-0.223" x2="0.5" y2="1.383" xlink:href="#linear-gradient-6"/>
|
||||
<linearGradient id="linear-gradient-9" x1="0.5" y1="-0.223" x2="0.5" y2="1.383" xlink:href="#linear-gradient-6"/>
|
||||
</defs>
|
||||
<g id="Group_1542" data-name="Group 1542" transform="translate(-103 -146)">
|
||||
<g id="Group_1535" data-name="Group 1535">
|
||||
<path id="Path_484" data-name="Path 484" d="M219.034,36.851,202.609.06h-5.448L180.736,36.851h8.4l3.267-8.032h14.867l3.347,8.032h8.4ZM204.71,22.718h-9.73l4.905-11.831,4.825,11.831ZM165.185,36.851h7.71V.985h-7.71V15.118h-17.9V.985h-7.71V36.841h7.71V21.853h17.9V36.841h0Zm-48.713.935c3.659,0,8.172-.141,12.223-1.719V29.322A40.491,40.491,0,0,1,116.4,30.97c-8.012,0-10.5-6.092-10.5-12.053S108.39,6.865,116.4,6.865A40.491,40.491,0,0,1,128.7,8.514V1.779C124.645.2,120.131.06,116.472.06c-13.229,0-18.526,8.534-18.526,18.858s5.3,18.858,18.526,18.858h0ZM60.13,36.851H89.472V30.106H67.84V21.853H86.909V15.108H67.84V7.72H89.472V.985H60.13V36.841h0Z" transform="translate(103.644 146.001)" fill="url(#linear-gradient)"/>
|
||||
<circle id="Ellipse_9" data-name="Ellipse 9" cx="2.02" cy="2.02" r="2.02" transform="translate(129.958 175.363)" fill="url(#radial-gradient)"/>
|
||||
<path id="Path_485" data-name="Path 485" d="M38.618,37a5.358,5.358,0,0,1-1.689.593,2.892,2.892,0,0,0,.151-.935,3.016,3.016,0,1,0-6,.432c-2.563-.281-4.956-.241-5.2-.362-.02,0-6.413-5.247-15.561-.623a.494.494,0,0,0-.221.482A14.761,14.761,0,0,0,24.836,50.638,14.529,14.529,0,0,0,39.5,37.54a.587.587,0,0,0-.885-.553Z" transform="translate(103.383 146.175)" fill="url(#radial-gradient-2)"/>
|
||||
<path id="Path_486" data-name="Path 486" d="M10.281,36.025s6.4-5.277,13.751-1.448a24.047,24.047,0,0,0,7.047,2.513s-23.18,1.3-20.788-1.066Z" transform="translate(103.383 146.173)" fill="url(#linear-gradient-2)"/>
|
||||
<rect id="Rectangle_2083" data-name="Rectangle 2083" width="4.322" height="15.48" rx="2.15" transform="translate(113.755 196.405) rotate(44.01)" fill="url(#linear-gradient-3)"/>
|
||||
<path id="Path_487" data-name="Path 487" d="M2.734,63.94A1.62,1.62,0,0,1,1.628,63.5L.483,62.392a1.59,1.59,0,0,1-.04-2.242l8.866-9.178a1.6,1.6,0,0,1,1.116-.483,1.623,1.623,0,0,1,1.126.442L12.7,52.038a1.587,1.587,0,0,1,.04,2.242L3.87,63.457a1.6,1.6,0,0,1-1.116.483h-.03Zm7.72-13.007h-.02a1.12,1.12,0,0,0-.8.352L.764,60.442a1.173,1.173,0,0,0-.322.814,1.12,1.12,0,0,0,.352.8L1.94,63.166a1.141,1.141,0,0,0,1.618-.03l8.866-9.178a1.144,1.144,0,0,0-.03-1.618l-1.146-1.106a1.109,1.109,0,0,0-.794-.322Z" transform="translate(103.33 146.263)" fill="url(#linear-gradient-4)"/>
|
||||
<path id="Path_488" data-name="Path 488" d="M19.3.99H13.977a.461.461,0,0,0-.462.462V4.83a.461.461,0,0,0,.462.462h1.568a.461.461,0,0,1,.462.462V16.028l-.02,1.206a.454.454,0,0,1-.261.412A21.882,21.882,0,0,0,4.96,29.226a.465.465,0,0,0,.312.623l3.3.895a.47.47,0,0,0,.553-.261,17.583,17.583,0,0,1,10.334-9.761.465.465,0,0,0,.312-.442V1.462A.461.461,0,0,0,19.3,1Z" transform="translate(103.356 146.005)" fill="#003c75"/>
|
||||
<path id="Path_489" data-name="Path 489" d="M36.551.99H31.042a.378.378,0,0,0-.372.372V19.989a.378.378,0,0,0,.553.332l3.026-1.639a.365.365,0,0,0,.191-.332V5.674a.378.378,0,0,1,.372-.372h1.749a.378.378,0,0,0,.372-.372V1.362A.378.378,0,0,0,36.561.99Z" transform="translate(103.49 146.005)" fill="#003c75"/>
|
||||
<path id="Path_490" data-name="Path 490" d="M45.919,34.292A21.215,21.215,0,0,0,31.545,16.741h0l-.08-.03-.181-.06h0L30.8,16.51v3.629a.758.758,0,0,0,.523.724h.02A17.285,17.285,0,1,1,7.661,38.946a17.1,17.1,0,0,1,.02-4.182.285.285,0,0,0-.211-.312l-3.277-.875a.294.294,0,0,0-.362.231,20.916,20.916,0,0,0-.07,5.609,21.236,21.236,0,1,0,42.159-5.147Z" transform="translate(103.349 146.086)" fill="url(#linear-gradient-5)"/>
|
||||
<path id="Path_491" data-name="Path 491" d="M224.13,18.9a34.878,34.878,0,0,1,.714-7.2,17.285,17.285,0,0,1,2.372-5.911,11.557,11.557,0,0,1,4.4-3.991A14.5,14.5,0,0,1,238.484.34a26.823,26.823,0,0,1,5.669.523,22.624,22.624,0,0,1,4.091,1.246V7.226c-.985-.412-1.9-.754-2.724-1.025a23.113,23.113,0,0,0-2.392-.643,17.986,17.986,0,0,0-2.292-.332c-.764-.06-1.548-.09-2.342-.09a8.147,8.147,0,0,0-4.282,1.055,8.105,8.105,0,0,0-2.825,2.915,14.1,14.1,0,0,0-1.558,4.383,28.844,28.844,0,0,0-.483,5.428,28.767,28.767,0,0,0,.483,5.428,13.693,13.693,0,0,0,1.558,4.373,7.871,7.871,0,0,0,2.825,2.915,8.147,8.147,0,0,0,4.282,1.055c.794,0,1.578-.03,2.342-.1a17.985,17.985,0,0,0,2.292-.332,23.113,23.113,0,0,0,2.392-.643c.824-.271,1.739-.613,2.724-1.025V35.7a22.624,22.624,0,0,1-4.091,1.246,26.823,26.823,0,0,1-5.669.523,14.5,14.5,0,0,1-6.866-1.458,11.787,11.787,0,0,1-4.4-3.991,17.489,17.489,0,0,1-2.372-5.911,34.27,34.27,0,0,1-.714-7.2Z" transform="translate(104.499 146.002)" fill="url(#linear-gradient-6)"/>
|
||||
<path id="Path_492" data-name="Path 492" d="M238.486,37.8a14.953,14.953,0,0,1-7.026-1.5,11.974,11.974,0,0,1-4.523-4.111,17.723,17.723,0,0,1-2.413-6.021A34.94,34.94,0,0,1,223.8,18.9a34.94,34.94,0,0,1,.724-7.268,17.723,17.723,0,0,1,2.413-6.021A12.056,12.056,0,0,1,231.46,1.5,14.9,14.9,0,0,1,238.486,0a27.543,27.543,0,0,1,5.74.533A23.034,23.034,0,0,1,248.377,1.8l.2.09V7.73l-.462-.191c-.975-.412-1.89-.754-2.7-1.015a20.145,20.145,0,0,0-2.352-.633,21.324,21.324,0,0,0-2.252-.332c-.754-.06-1.538-.09-2.312-.09a7.909,7.909,0,0,0-4.111,1.005,7.711,7.711,0,0,0-2.7,2.8,13.738,13.738,0,0,0-1.518,4.272,28.991,28.991,0,0,0-.472,5.368,28.99,28.99,0,0,0,.472,5.368,13.738,13.738,0,0,0,1.518,4.272,7.79,7.79,0,0,0,2.7,2.8,7.91,7.91,0,0,0,4.111,1.005c.784,0,1.568-.03,2.312-.09a17.129,17.129,0,0,0,2.252-.332,22.352,22.352,0,0,0,2.352-.633c.814-.261,1.719-.613,2.7-1.015l.462-.191v5.84l-.2.09a23.872,23.872,0,0,1-4.152,1.267,27.543,27.543,0,0,1-5.74.533Zm0-37.133a14.408,14.408,0,0,0-6.715,1.417,11.475,11.475,0,0,0-4.282,3.88,17.122,17.122,0,0,0-2.322,5.8,34.262,34.262,0,0,0-.714,7.127,34.262,34.262,0,0,0,.714,7.127,17.122,17.122,0,0,0,2.322,5.8,11.31,11.31,0,0,0,4.282,3.88,14.256,14.256,0,0,0,6.715,1.417,26.193,26.193,0,0,0,5.6-.523,23.16,23.16,0,0,0,3.83-1.136v-4.4c-.824.332-1.588.623-2.292.844a23.888,23.888,0,0,1-2.433.653,20.7,20.7,0,0,1-2.332.342c-.764.06-1.568.1-2.372.1a8.456,8.456,0,0,1-4.453-1.106A8.346,8.346,0,0,1,231.1,28.85a14.484,14.484,0,0,1-1.6-4.483,29.507,29.507,0,0,1-.482-5.488,29.432,29.432,0,0,1,.482-5.488,14.322,14.322,0,0,1,1.6-4.483,8.346,8.346,0,0,1,2.935-3.036,8.5,8.5,0,0,1,4.453-1.1c.8,0,1.6.03,2.372.1a20.551,20.551,0,0,1,2.342.342,23.884,23.884,0,0,1,2.433.653q1.055.347,2.292.844V2.322a23.159,23.159,0,0,0-3.83-1.136,26.856,26.856,0,0,0-5.6-.523Z" transform="translate(104.497 146)" fill="#e68a00"/>
|
||||
<path id="Path_493" data-name="Path 493" d="M275.544,36.836V20.662H260.516V36.836H255.49V.95h5.026V15.877h15.028V.95h5.026V36.836Z" transform="translate(104.663 146.005)" fill="url(#linear-gradient-7)"/>
|
||||
<path id="Path_494" data-name="Path 494" d="M280.9,37.17h-5.689V21H260.86V37.17h-5.69V.62h5.69V15.547h14.354V.62H280.9Zm-5.026-.663h4.363V1.283h-4.363V16.211H260.186V1.283h-4.353V36.506h4.353V20.332h15.691Z" transform="translate(104.661 146.004)" fill="#e68a00"/>
|
||||
<path id="Path_495" data-name="Path 495" d="M308.534,20.662H293.556V32.051h16.556v4.785H288.53V.95h21.582V5.735H293.556V15.877h14.978Z" transform="translate(104.835 146.005)" fill="url(#linear-gradient-7)"/>
|
||||
<path id="Path_496" data-name="Path 496" d="M310.445,37.17H288.2V.62h22.245V6.068H293.889v9.479h14.978V21H293.889V31.721h16.556Zm-21.582-.663h20.908V32.385H293.216V20.332h14.978V16.211H293.216V5.4h16.556V1.283H288.863Z" transform="translate(104.833 146.004)" fill="#e68a00"/>
|
||||
<path id="Path_497" data-name="Path 497" d="M335.9,32.051h-3.83l-10-23.11.241,27.895H317.38V.95h6.071l10.525,24.6L344.5.95h6.072V36.836h-4.926l.241-27.895-10,23.11Z" transform="translate(104.985 146.005)" fill="url(#linear-gradient-9)"/>
|
||||
<path id="Path_498" data-name="Path 498" d="M350.916,37.17h-5.6l.231-26.588-9.439,21.8h-4.262l-9.429-21.8.231,26.588h-5.6V.62h6.634l10.3,24.085L344.291.62h6.634V37.17Zm-4.926-.663h4.262V1.283h-5.519l-10.746,25.11L323.242,1.283h-5.519V36.506h4.262l-.251-29.2L332.3,31.721h3.388L346.251,7.3,346,36.506Z" transform="translate(104.984 146.004)" fill="#e68a00"/>
|
||||
<g id="Group_1497" data-name="Group 1497" transform="translate(163.734 192.803)">
|
||||
<path id="Path_499" data-name="Path 499" d="M66.049,56.891H60.53V47h5.519v1.025H61.686v3.187h4.1v1.015h-4.1v3.639h4.363Z" transform="translate(-60.088 -46.558)" fill="#003c75"/>
|
||||
<path id="Path_500" data-name="Path 500" d="M66.051,57.336H60.532a.446.446,0,0,1-.442-.442V47a.446.446,0,0,1,.442-.442h5.519a.446.446,0,0,1,.442.442v1.025a.446.446,0,0,1-.442.442h-3.92v2.3H65.79a.446.446,0,0,1,.442.442v1.015a.446.446,0,0,1-.442.442H62.131v2.754h3.92a.446.446,0,0,1,.442.442v1.025A.446.446,0,0,1,66.051,57.336Zm-5.076-.885H65.6v-.141h-3.92a.446.446,0,0,1-.442-.442V52.229a.446.446,0,0,1,.442-.442h3.659v-.131H61.678a.446.446,0,0,1-.442-.442V48.028a.446.446,0,0,1,.442-.442H65.6v-.131H60.975v9.007Z" transform="translate(-60.09 -46.56)" fill="#336"/>
|
||||
</g>
|
||||
<g id="Group_1498" data-name="Group 1498" transform="translate(175.786 192.652)">
|
||||
<path id="Path_501" data-name="Path 501" d="M77.275,47.875A3.226,3.226,0,0,0,74.7,48.961a4.368,4.368,0,0,0-.945,2.975,4.53,4.53,0,0,0,.915,3.006A3.228,3.228,0,0,0,77.265,56a8.72,8.72,0,0,0,2.362-.372v1.005a7.15,7.15,0,0,1-2.533.382,4.3,4.3,0,0,1-3.378-1.327,5.452,5.452,0,0,1-1.186-3.77,5.991,5.991,0,0,1,.573-2.684,4.13,4.13,0,0,1,1.649-1.769,5.045,5.045,0,0,1,2.543-.623,6.138,6.138,0,0,1,2.724.573l-.483.985a5.258,5.258,0,0,0-2.252-.523Z" transform="translate(-72.078 -46.408)" fill="#003c75"/>
|
||||
<path id="Path_502" data-name="Path 502" d="M77.086,57.467a4.743,4.743,0,0,1-3.709-1.478,5.9,5.9,0,0,1-1.3-4.061,6.371,6.371,0,0,1,.623-2.875,4.6,4.6,0,0,1,1.83-1.96A5.472,5.472,0,0,1,77.3,46.41a6.653,6.653,0,0,1,2.915.613.391.391,0,0,1,.221.251.45.45,0,0,1-.02.342l-.483.985a.431.431,0,0,1-.583.2,4.823,4.823,0,0,0-2.061-.483,2.768,2.768,0,0,0-2.242.935,3.972,3.972,0,0,0-.834,2.684,4.142,4.142,0,0,0,.8,2.714c.864,1.005,2.362,1.146,4.5.553a.455.455,0,0,1,.392.07.429.429,0,0,1,.171.352v1.005a.432.432,0,0,1-.291.412,7.827,7.827,0,0,1-2.694.412Zm.2-10.173a4.618,4.618,0,0,0-2.322.563,3.707,3.707,0,0,0-1.478,1.588,5.508,5.508,0,0,0-.523,2.483,5.091,5.091,0,0,0,1.076,3.478,3.847,3.847,0,0,0,3.046,1.176,7.462,7.462,0,0,0,2.091-.261V56.2c-2.222.513-3.84.2-4.845-.965a4.908,4.908,0,0,1-1.015-3.287,4.747,4.747,0,0,1,1.055-3.267,3.661,3.661,0,0,1,2.915-1.236,5.412,5.412,0,0,1,2.031.4l.09-.191A6.036,6.036,0,0,0,77.3,47.3Z" transform="translate(-72.08 -46.41)" fill="#336"/>
|
||||
</g>
|
||||
<g id="Group_1499" data-name="Group 1499" transform="translate(189.93 192.803)">
|
||||
<path id="Path_503" data-name="Path 503" d="M94.089,56.891H92.943V52.237H87.736v4.654H86.59V47h1.146v4.212h5.207V47h1.146Z" transform="translate(-86.148 -46.558)" fill="#003c75"/>
|
||||
<path id="Path_504" data-name="Path 504" d="M94.091,57.336H92.945a.446.446,0,0,1-.442-.442V52.682H88.181v4.212a.446.446,0,0,1-.442.442H86.592a.446.446,0,0,1-.442-.442V47a.446.446,0,0,1,.442-.442h1.146a.446.446,0,0,1,.442.442v3.77H92.5V47a.446.446,0,0,1,.442-.442h1.146a.446.446,0,0,1,.442.442v9.891A.446.446,0,0,1,94.091,57.336Zm-.7-.885h.261V47.445h-.261v3.77a.446.446,0,0,1-.442.442H87.738a.446.446,0,0,1-.442-.442v-3.77h-.261v9.007H87.3V52.239a.446.446,0,0,1,.442-.442h5.207a.446.446,0,0,1,.442.442Z" transform="translate(-86.15 -46.56)" fill="#336"/>
|
||||
</g>
|
||||
<g id="Group_1500" data-name="Group 1500" transform="translate(203.644 192.763)">
|
||||
<path id="Path_505" data-name="Path 505" d="M107.819,56.892l-1.236-3.146h-3.971L101.4,56.892H100.23l3.91-9.932h.965L109,56.892h-1.186Zm-1.588-4.182-1.146-3.066c-.151-.392-.3-.864-.462-1.427a14.321,14.321,0,0,1-.422,1.427l-1.166,3.066Z" transform="translate(-99.79 -46.518)" fill="#003c75"/>
|
||||
<path id="Path_506" data-name="Path 506" d="M109.008,57.336h-1.186a.436.436,0,0,1-.412-.281l-1.126-2.865h-3.357l-1.106,2.865a.446.446,0,0,1-.412.281h-1.166a.445.445,0,0,1-.422-.6l3.91-9.932a.446.446,0,0,1,.412-.281h.965a.436.436,0,0,1,.412.281l3.89,9.932a.431.431,0,0,1-.05.412.444.444,0,0,1-.372.191Zm-.885-.885h.241L104.826,47.4h-.362l-3.558,9.047h.211l1.106-2.865a.446.446,0,0,1,.412-.281h3.971a.436.436,0,0,1,.412.281l1.126,2.865Zm-1.89-3.3h-3.2a.426.426,0,0,1-.362-.191.46.46,0,0,1-.05-.412l1.166-3.066c.171-.493.312-.955.412-1.367a.439.439,0,0,1,.422-.342.465.465,0,0,1,.442.322c.151.543.3,1.015.442,1.387l1.156,3.066a.46.46,0,0,1-.05.412.439.439,0,0,1-.362.191Zm-2.553-.885h1.92l-.925-2.463s-.02-.05-.03-.07c0,.02-.02.04-.02.07l-.935,2.473Z" transform="translate(-99.793 -46.52)" fill="#336"/>
|
||||
</g>
|
||||
<g id="Group_1501" data-name="Group 1501" transform="translate(226.59 192.652)">
|
||||
<path id="Path_507" data-name="Path 507" d="M127.815,47.875a3.226,3.226,0,0,0-2.573,1.086,4.368,4.368,0,0,0-.945,2.975,4.53,4.53,0,0,0,.915,3.006A3.229,3.229,0,0,0,127.8,56a8.72,8.72,0,0,0,2.362-.372v1.005a7.15,7.15,0,0,1-2.533.382,4.3,4.3,0,0,1-3.378-1.327,5.452,5.452,0,0,1-1.186-3.77,5.991,5.991,0,0,1,.573-2.684,4.13,4.13,0,0,1,1.649-1.769,5.045,5.045,0,0,1,2.543-.623,6.138,6.138,0,0,1,2.724.573l-.483.985a5.258,5.258,0,0,0-2.252-.523Z" transform="translate(-122.618 -46.408)" fill="#003c75"/>
|
||||
<path id="Path_508" data-name="Path 508" d="M127.626,57.467a4.743,4.743,0,0,1-3.709-1.478,5.9,5.9,0,0,1-1.3-4.061,6.37,6.37,0,0,1,.623-2.875,4.6,4.6,0,0,1,1.829-1.96,5.472,5.472,0,0,1,2.764-.684,6.653,6.653,0,0,1,2.915.613.391.391,0,0,1,.221.251.449.449,0,0,1-.02.342l-.482.985a.431.431,0,0,1-.583.2,4.823,4.823,0,0,0-2.061-.483,2.768,2.768,0,0,0-2.242.935,3.972,3.972,0,0,0-.834,2.684,4.142,4.142,0,0,0,.8,2.714c.864,1.005,2.362,1.146,4.5.553a.455.455,0,0,1,.392.07.429.429,0,0,1,.171.352v1.005a.431.431,0,0,1-.292.412,7.826,7.826,0,0,1-2.694.412Zm.2-10.173a4.618,4.618,0,0,0-2.322.563,3.707,3.707,0,0,0-1.478,1.588,5.508,5.508,0,0,0-.523,2.483,5.09,5.09,0,0,0,1.076,3.478,3.847,3.847,0,0,0,3.046,1.176,7.462,7.462,0,0,0,2.091-.261V56.2c-2.222.513-3.84.2-4.845-.965a4.908,4.908,0,0,1-1.015-3.287,4.747,4.747,0,0,1,1.056-3.267,3.661,3.661,0,0,1,2.915-1.236,5.412,5.412,0,0,1,2.031.4l.09-.191a6.036,6.036,0,0,0-2.111-.352Z" transform="translate(-122.62 -46.41)" fill="#336"/>
|
||||
</g>
|
||||
<g id="Group_1502" data-name="Group 1502" transform="translate(240.733 192.803)">
|
||||
<path id="Path_509" data-name="Path 509" d="M144.629,56.891h-1.146V52.237h-5.207v4.654H137.13V47h1.146v4.212h5.207V47h1.146Z" transform="translate(-136.688 -46.558)" fill="#003c75"/>
|
||||
<path id="Path_510" data-name="Path 510" d="M144.631,57.336h-1.146a.446.446,0,0,1-.442-.442V52.682h-4.322v4.212a.446.446,0,0,1-.442.442h-1.146a.446.446,0,0,1-.442-.442V47a.446.446,0,0,1,.442-.442h1.146a.446.446,0,0,1,.442.442v3.77h4.322V47a.446.446,0,0,1,.442-.442h1.146a.446.446,0,0,1,.442.442v9.891A.446.446,0,0,1,144.631,57.336Zm-.7-.885h.261V47.445h-.261v3.77a.446.446,0,0,1-.442.442h-5.207a.446.446,0,0,1-.442-.442v-3.77h-.261v9.007h.261V52.239a.446.446,0,0,1,.442-.442h5.207a.446.446,0,0,1,.442.442Z" transform="translate(-136.69 -46.56)" fill="#336"/>
|
||||
</g>
|
||||
<g id="Group_1503" data-name="Group 1503" transform="translate(255.801 192.803)">
|
||||
<path id="Path_511" data-name="Path 511" d="M157.639,56.891H152.12V47h5.519v1.025h-4.363v3.187h4.1v1.015h-4.1v3.639h4.363Z" transform="translate(-151.678 -46.558)" fill="#003c75"/>
|
||||
<path id="Path_512" data-name="Path 512" d="M157.641,57.336h-5.519a.446.446,0,0,1-.442-.442V47a.446.446,0,0,1,.442-.442h5.519a.446.446,0,0,1,.442.442v1.025a.446.446,0,0,1-.442.442h-3.92v2.3h3.659a.446.446,0,0,1,.442.442v1.015a.446.446,0,0,1-.442.442h-3.659v2.754h3.92a.446.446,0,0,1,.442.442v1.025A.446.446,0,0,1,157.641,57.336Zm-5.076-.885h4.624v-.141h-3.92a.446.446,0,0,1-.442-.442V52.229a.446.446,0,0,1,.442-.442h3.659v-.131h-3.659a.446.446,0,0,1-.442-.442V48.028a.446.446,0,0,1,.442-.442h3.92v-.131h-4.624v9.007Z" transform="translate(-151.68 -46.56)" fill="#336"/>
|
||||
</g>
|
||||
<g id="Group_1504" data-name="Group 1504" transform="translate(268.397 192.803)">
|
||||
<path id="Path_513" data-name="Path 513" d="M169.013,56.891l-3.357-8.776h-.05q.091,1.04.09,2.473v6.293H164.63V46.99h1.729l3.136,8.162h.05L172.7,46.99h1.719v9.891h-1.146V50.508q0-1.1.09-2.382h-.05l-3.388,8.755H169Z" transform="translate(-164.208 -46.558)" fill="#003c75"/>
|
||||
<path id="Path_514" data-name="Path 514" d="M174.433,57.336h-1.146a.446.446,0,0,1-.442-.442V50.621l-2.483,6.433a.446.446,0,0,1-.412.281h-.925a.436.436,0,0,1-.412-.281l-2.453-6.423v6.262a.446.446,0,0,1-.442.442h-1.065a.446.446,0,0,1-.442-.442V47a.446.446,0,0,1,.442-.442h1.729a.436.436,0,0,1,.412.281L169.538,54l2.774-7.157a.446.446,0,0,1,.412-.281h1.719a.446.446,0,0,1,.442.442v9.891a.446.446,0,0,1-.442.442Zm-.7-.885h.261V47.445h-.975l-3.046,7.881a.446.446,0,0,1-.412.281h-.05a.436.436,0,0,1-.412-.281l-3.026-7.881h-.985v9.007h.171V50.6c0-.935-.03-1.759-.09-2.433a.469.469,0,0,1,.111-.342.44.44,0,0,1,.332-.141h.05a.436.436,0,0,1,.412.281l3.247,8.484h.322l3.277-8.474a.446.446,0,0,1,.412-.281h.05a.434.434,0,0,1,.322.141.454.454,0,0,1,.121.332c-.06.844-.09,1.638-.09,2.352Z" transform="translate(-164.21 -46.56)" fill="#336"/>
|
||||
</g>
|
||||
<g id="Group_1505" data-name="Group 1505" transform="translate(285.767 192.803)">
|
||||
<path id="Path_515" data-name="Path 515" d="M181.92,56.891V47h1.146v9.891Z" transform="translate(-181.488 -46.558)" fill="#003c75"/>
|
||||
<path id="Path_516" data-name="Path 516" d="M183.078,57.336h-1.146a.446.446,0,0,1-.442-.442V47a.446.446,0,0,1,.442-.442h1.146a.446.446,0,0,1,.442.442v9.891A.446.446,0,0,1,183.078,57.336Zm-.7-.885h.261V47.445h-.261Z" transform="translate(-181.49 -46.56)" fill="#336"/>
|
||||
</g>
|
||||
<g id="Group_1506" data-name="Group 1506" transform="translate(293.969 192.652)">
|
||||
<path id="Path_517" data-name="Path 517" d="M194.845,47.875a3.226,3.226,0,0,0-2.573,1.086,4.368,4.368,0,0,0-.945,2.975,4.531,4.531,0,0,0,.915,3.006A3.229,3.229,0,0,0,194.835,56a8.719,8.719,0,0,0,2.362-.372v1.005a7.15,7.15,0,0,1-2.533.382,4.3,4.3,0,0,1-3.378-1.327,5.452,5.452,0,0,1-1.186-3.77,5.991,5.991,0,0,1,.573-2.684,4.13,4.13,0,0,1,1.649-1.769,5.045,5.045,0,0,1,2.543-.623,6.138,6.138,0,0,1,2.724.573l-.482.985a5.258,5.258,0,0,0-2.252-.523Z" transform="translate(-189.648 -46.408)" fill="#003c75"/>
|
||||
<path id="Path_518" data-name="Path 518" d="M194.656,57.467a4.743,4.743,0,0,1-3.709-1.478,5.9,5.9,0,0,1-1.3-4.061,6.371,6.371,0,0,1,.623-2.875,4.6,4.6,0,0,1,1.83-1.96,5.472,5.472,0,0,1,2.764-.684,6.653,6.653,0,0,1,2.915.613.392.392,0,0,1,.221.251.45.45,0,0,1-.02.342l-.482.985a.431.431,0,0,1-.583.2,4.823,4.823,0,0,0-2.061-.483,2.767,2.767,0,0,0-2.242.935,3.972,3.972,0,0,0-.834,2.684,4.142,4.142,0,0,0,.8,2.714c.865,1.005,2.362,1.146,4.5.553a.455.455,0,0,1,.392.07.429.429,0,0,1,.171.352v1.005a.432.432,0,0,1-.292.412,7.826,7.826,0,0,1-2.694.412Zm.2-10.173a4.618,4.618,0,0,0-2.322.563,3.707,3.707,0,0,0-1.478,1.588,5.508,5.508,0,0,0-.523,2.483,5.091,5.091,0,0,0,1.076,3.478,3.847,3.847,0,0,0,3.046,1.176,7.462,7.462,0,0,0,2.091-.261V56.2c-2.221.513-3.84.2-4.845-.965a4.908,4.908,0,0,1-1.015-3.287,4.747,4.747,0,0,1,1.055-3.267,3.661,3.661,0,0,1,2.915-1.236,5.412,5.412,0,0,1,2.031.4l.09-.191a6.036,6.036,0,0,0-2.111-.352Z" transform="translate(-189.65 -46.41)" fill="#336"/>
|
||||
</g>
|
||||
<g id="Group_1507" data-name="Group 1507" transform="translate(306.738 192.763)">
|
||||
<path id="Path_519" data-name="Path 519" d="M210.369,56.892l-1.236-3.146h-3.971l-1.216,3.146H202.78l3.91-9.932h.965l3.89,9.932h-1.186Zm-1.588-4.182-1.146-3.066c-.151-.392-.3-.864-.462-1.427a14.3,14.3,0,0,1-.422,1.427l-1.166,3.066Z" transform="translate(-202.351 -46.518)" fill="#003c75"/>
|
||||
<path id="Path_520" data-name="Path 520" d="M211.568,57.336h-1.186a.436.436,0,0,1-.412-.281l-1.126-2.865h-3.357l-1.106,2.865a.446.446,0,0,1-.412.281H202.8a.445.445,0,0,1-.422-.6l3.91-9.932a.446.446,0,0,1,.412-.281h.965a.436.436,0,0,1,.412.281l3.89,9.932a.431.431,0,0,1-.05.412.444.444,0,0,1-.372.191Zm-.885-.885h.241L207.386,47.4h-.362l-3.558,9.047h.211l1.106-2.865a.446.446,0,0,1,.412-.281h3.971a.436.436,0,0,1,.412.281l1.126,2.865Zm-1.89-3.3h-3.2a.426.426,0,0,1-.362-.191.46.46,0,0,1-.05-.412l1.166-3.066c.171-.493.312-.955.412-1.367a.439.439,0,0,1,.422-.342.465.465,0,0,1,.442.322c.151.543.3,1.015.442,1.387l1.156,3.066a.459.459,0,0,1-.05.412.439.439,0,0,1-.362.191Zm-2.553-.885h1.92l-.925-2.463s-.02-.05-.03-.07c0,.02-.02.04-.02.07l-.935,2.473Z" transform="translate(-202.353 -46.52)" fill="#336"/>
|
||||
</g>
|
||||
<g id="Group_1508" data-name="Group 1508" transform="translate(321.733 192.803)">
|
||||
<path id="Path_521" data-name="Path 521" d="M217.71,56.891V47h1.146v8.856h4.363V56.9H217.7Z" transform="translate(-217.268 -46.558)" fill="#003c75"/>
|
||||
<path id="Path_522" data-name="Path 522" d="M223.231,57.336h-5.519a.446.446,0,0,1-.442-.442V47a.446.446,0,0,1,.442-.442h1.146A.446.446,0,0,1,219.3,47v8.4h3.92a.446.446,0,0,1,.442.442v1.045a.446.446,0,0,1-.442.442Zm-5.076-.885h4.624V56.3h-3.92a.446.446,0,0,1-.442-.442v-8.4h-.261v9.007Z" transform="translate(-217.27 -46.56)" fill="#336"/>
|
||||
</g>
|
||||
<g id="Group_1509" data-name="Group 1509" transform="translate(333.153 192.652)">
|
||||
<path id="Path_523" data-name="Path 523" d="M235.292,54.258a2.429,2.429,0,0,1-.945,2.041,4.142,4.142,0,0,1-2.573.734,6.435,6.435,0,0,1-2.7-.452V55.475a6.4,6.4,0,0,0,1.327.4,6.938,6.938,0,0,0,1.417.151,2.847,2.847,0,0,0,1.729-.432,1.447,1.447,0,0,0,.583-1.216,1.547,1.547,0,0,0-.211-.844,1.9,1.9,0,0,0-.694-.6,10.945,10.945,0,0,0-1.468-.633,4.756,4.756,0,0,1-1.97-1.166,2.6,2.6,0,0,1-.593-1.769,2.2,2.2,0,0,1,.864-1.819,3.554,3.554,0,0,1,2.272-.673,6.817,6.817,0,0,1,2.714.543l-.362,1.005a6.132,6.132,0,0,0-2.382-.513,2.3,2.3,0,0,0-1.427.392,1.29,1.29,0,0,0-.513,1.086,1.641,1.641,0,0,0,.191.844,1.789,1.789,0,0,0,.643.6,8,8,0,0,0,1.377.6,5.533,5.533,0,0,1,2.141,1.186,2.329,2.329,0,0,1,.583,1.649Z" transform="translate(-228.628 -46.408)" fill="#003c75"/>
|
||||
<path id="Path_524" data-name="Path 524" d="M231.776,57.467a6.792,6.792,0,0,1-2.895-.493.448.448,0,0,1-.251-.4V55.467a.429.429,0,0,1,.2-.372.449.449,0,0,1,.422-.04,6.46,6.46,0,0,0,1.246.382,6.692,6.692,0,0,0,1.327.141,2.483,2.483,0,0,0,1.468-.352,1.011,1.011,0,0,0,.4-.864,1.139,1.139,0,0,0-.141-.6,1.546,1.546,0,0,0-.533-.452,8.992,8.992,0,0,0-1.4-.593,5.126,5.126,0,0,1-2.161-1.3,3.048,3.048,0,0,1-.7-2.061,2.632,2.632,0,0,1,1.025-2.171,4.025,4.025,0,0,1,2.553-.774,7.062,7.062,0,0,1,2.9.583.444.444,0,0,1,.241.553l-.362,1.005a.473.473,0,0,1-.241.261.43.43,0,0,1-.352,0,5.636,5.636,0,0,0-2.211-.483,1.887,1.887,0,0,0-1.156.3.862.862,0,0,0-.342.734,1.222,1.222,0,0,0,.131.623,1.4,1.4,0,0,0,.482.442,6.755,6.755,0,0,0,1.3.563,5.849,5.849,0,0,1,2.322,1.307,2.8,2.8,0,0,1,.7,1.95,2.857,2.857,0,0,1-1.116,2.392,4.576,4.576,0,0,1-2.845.824Zm-2.262-1.2a6.862,6.862,0,0,0,2.262.3,3.724,3.724,0,0,0,2.3-.643,1.989,1.989,0,0,0,.774-1.689,1.869,1.869,0,0,0-.472-1.347,5.01,5.01,0,0,0-1.96-1.076,8.118,8.118,0,0,1-1.457-.643,1.943,1.943,0,0,1-1.046-1.829,1.759,1.759,0,0,1,.694-1.447,2.748,2.748,0,0,1,1.7-.483,6.393,6.393,0,0,1,2.121.382l.06-.171a6.524,6.524,0,0,0-2.151-.352,3.137,3.137,0,0,0-2,.583,1.757,1.757,0,0,0-.694,1.468,2.162,2.162,0,0,0,.482,1.478,4.3,4.3,0,0,0,1.789,1.045,9.489,9.489,0,0,1,1.548.663,2.323,2.323,0,0,1,.844.754,2.01,2.01,0,0,1,.271,1.076,1.871,1.871,0,0,1-.764,1.568,3.27,3.27,0,0,1-2,.523,7.06,7.06,0,0,1-1.508-.161c-.271-.06-.543-.121-.794-.2v.181Z" transform="translate(-228.63 -46.41)" fill="#336"/>
|
||||
</g>
|
||||
<g id="Group_1510" data-name="Group 1510" transform="translate(354.724 192.803)">
|
||||
<path id="Path_525" data-name="Path 525" d="M258.431,51.845a5.018,5.018,0,0,1-1.327,3.749,5.258,5.258,0,0,1-3.83,1.3H250.53V47h3.036a4.434,4.434,0,0,1,4.865,4.845Zm-1.216.04a3.96,3.96,0,0,0-.975-2.915,3.887,3.887,0,0,0-2.885-.985h-1.669v7.9h1.4a4.285,4.285,0,0,0,3.1-1.015,3.986,3.986,0,0,0,1.035-3Z" transform="translate(-250.088 -46.558)" fill="#003c75"/>
|
||||
<path id="Path_526" data-name="Path 526" d="M253.277,57.336h-2.744a.446.446,0,0,1-.442-.442V47a.446.446,0,0,1,.442-.442h3.036a4.875,4.875,0,0,1,5.307,5.3,5.418,5.418,0,0,1-1.468,4.061,5.7,5.7,0,0,1-4.141,1.417Zm-2.292-.885h2.292a4.87,4.87,0,0,0,3.518-1.166,4.6,4.6,0,0,0,1.2-3.428,4,4,0,0,0-4.423-4.4h-2.583v9.007Zm2.111-.111h-1.4a.446.446,0,0,1-.442-.442V48a.446.446,0,0,1,.442-.442h1.669a4.319,4.319,0,0,1,3.207,1.116,4.387,4.387,0,0,1,1.1,3.227,4.522,4.522,0,0,1-1.166,3.317,4.712,4.712,0,0,1-3.408,1.136Zm-.955-.885h.955a3.888,3.888,0,0,0,2.784-.885,3.585,3.585,0,0,0,.9-2.674,3.006,3.006,0,0,0-3.418-3.448h-1.226v7.016Z" transform="translate(-250.09 -46.56)" fill="#336"/>
|
||||
</g>
|
||||
<g id="Group_1511" data-name="Group 1511" transform="translate(368.338 192.763)">
|
||||
<path id="Path_527" data-name="Path 527" d="M271.649,56.892l-1.236-3.146h-3.971l-1.216,3.146H264.06l3.91-9.932h.965l3.89,9.932h-1.186Zm-1.588-4.182-1.146-3.066c-.151-.392-.3-.864-.462-1.427a14.326,14.326,0,0,1-.422,1.427l-1.166,3.066h3.2Z" transform="translate(-263.63 -46.518)" fill="#003c75"/>
|
||||
<path id="Path_528" data-name="Path 528" d="M272.848,57.336h-1.186a.436.436,0,0,1-.412-.281l-1.126-2.865h-3.357l-1.106,2.865a.446.446,0,0,1-.412.281h-1.166a.445.445,0,0,1-.422-.6l3.91-9.932a.446.446,0,0,1,.412-.281h.965a.436.436,0,0,1,.412.281l3.89,9.932a.431.431,0,0,1-.05.412.444.444,0,0,1-.372.191Zm-.885-.885h.241L268.666,47.4H268.3l-3.558,9.047h.211l1.106-2.865a.446.446,0,0,1,.412-.281h3.971a.436.436,0,0,1,.412.281l1.126,2.865Zm-1.89-3.3h-3.2a.426.426,0,0,1-.362-.191.459.459,0,0,1-.05-.412l1.166-3.066c.171-.493.312-.955.412-1.367a.439.439,0,0,1,.422-.342.456.456,0,0,1,.442.322c.151.543.3,1.015.442,1.387l1.156,3.066a.46.46,0,0,1-.05.412.439.439,0,0,1-.362.191Zm-2.553-.885h1.92l-.925-2.463s-.02-.05-.03-.07c0,.02-.02.04-.02.07l-.935,2.473Z" transform="translate(-263.633 -46.52)" fill="#336"/>
|
||||
</g>
|
||||
<g id="Group_1512" data-name="Group 1512" transform="translate(382.096 192.803)">
|
||||
<path id="Path_529" data-name="Path 529" d="M282.042,56.891H280.9V48.015H277.76V46.99h7.419v1.025h-3.136Z" transform="translate(-277.318 -46.558)" fill="#003c75"/>
|
||||
<path id="Path_530" data-name="Path 530" d="M282.044,57.336H280.9a.446.446,0,0,1-.442-.442V48.47h-2.694a.446.446,0,0,1-.442-.442V47a.446.446,0,0,1,.442-.442h7.418a.446.446,0,0,1,.442.442v1.025a.446.446,0,0,1-.442.442h-2.694v8.424A.446.446,0,0,1,282.044,57.336Zm-.7-.885h.261V48.028a.446.446,0,0,1,.442-.442h2.694v-.131h-6.524v.131h2.694a.446.446,0,0,1,.442.442v8.424Z" transform="translate(-277.32 -46.56)" fill="#336"/>
|
||||
</g>
|
||||
<g id="Group_1513" data-name="Group 1513" transform="translate(394.504 192.763)">
|
||||
<path id="Path_531" data-name="Path 531" d="M297.679,56.892l-1.237-3.146h-3.971l-1.216,3.146H290.09L294,46.96h.965l3.89,9.932h-1.186Zm-1.588-4.182-1.146-3.066c-.151-.392-.3-.864-.462-1.427a14.322,14.322,0,0,1-.422,1.427l-1.166,3.066Z" transform="translate(-289.66 -46.518)" fill="#003c75"/>
|
||||
<path id="Path_532" data-name="Path 532" d="M298.878,57.336h-1.186a.436.436,0,0,1-.412-.281l-1.126-2.865H292.8l-1.106,2.865a.446.446,0,0,1-.412.281h-1.166a.445.445,0,0,1-.422-.6L293.6,46.8a.446.446,0,0,1,.412-.281h.965a.436.436,0,0,1,.412.281l3.89,9.932a.431.431,0,0,1-.05.412.457.457,0,0,1-.372.191Zm-.884-.885h.241L294.686,47.4h-.362l-3.558,9.047h.211l1.106-2.865a.446.446,0,0,1,.412-.281h3.971a.436.436,0,0,1,.412.281L298,56.452Zm-1.89-3.3h-3.2a.426.426,0,0,1-.362-.191.459.459,0,0,1-.05-.412l1.166-3.066c.171-.493.312-.955.412-1.367a.439.439,0,0,1,.422-.342.465.465,0,0,1,.442.322c.151.543.3,1.015.442,1.387l1.156,3.066a.46.46,0,0,1-.05.412.439.439,0,0,1-.362.191Zm-2.553-.885h1.92l-.925-2.463s-.02-.05-.03-.07c0,.02-.02.05-.02.07l-.935,2.473Z" transform="translate(-289.663 -46.52)" fill="#336"/>
|
||||
</g>
|
||||
<g id="Group_1514" data-name="Group 1514" transform="translate(409.509 192.803)">
|
||||
<path id="Path_533" data-name="Path 533" d="M305.02,46.99h2.795a5.238,5.238,0,0,1,2.845.593,2.091,2.091,0,0,1,.885,1.86,2.125,2.125,0,0,1-.492,1.448,2.362,2.362,0,0,1-1.427.744v.07c1.5.261,2.252,1.045,2.252,2.372a2.551,2.551,0,0,1-.895,2.071,3.831,3.831,0,0,1-2.5.744H305.03V47Zm1.146,4.232h1.9a3.086,3.086,0,0,0,1.749-.382,1.479,1.479,0,0,0,.533-1.287,1.33,1.33,0,0,0-.593-1.206,3.736,3.736,0,0,0-1.89-.372h-1.689v3.237Zm0,.975v3.7h2.061a2.915,2.915,0,0,0,1.8-.462,1.716,1.716,0,0,0,.6-1.448,1.552,1.552,0,0,0-.623-1.357,3.289,3.289,0,0,0-1.88-.432h-1.97Z" transform="translate(-304.588 -46.558)" fill="#003c75"/>
|
||||
<path id="Path_534" data-name="Path 534" d="M308.48,57.336h-3.448a.446.446,0,0,1-.442-.442V47a.446.446,0,0,1,.442-.442h2.794a5.644,5.644,0,0,1,3.1.663A2.51,2.51,0,0,1,312,49.455a2.6,2.6,0,0,1-.593,1.739,2.753,2.753,0,0,1-.513.452,2.559,2.559,0,0,1,1.447,2.443,2.977,2.977,0,0,1-1.055,2.413,4.254,4.254,0,0,1-2.795.844Zm-3.006-.885h3.006a3.365,3.365,0,0,0,2.221-.643,2.109,2.109,0,0,0,.734-1.729,1.679,1.679,0,0,0-.965-1.659,2.022,2.022,0,0,1,.623,1.568,2.14,2.14,0,0,1-.784,1.809,3.341,3.341,0,0,1-2.071.553h-2.061a.446.446,0,0,1-.442-.442v-3.7a.446.446,0,0,1,.442-.442h1.97a5.693,5.693,0,0,1,1.066.09.341.341,0,0,1-.02-.141v-.131a5.282,5.282,0,0,1-1.116.1h-1.9a.446.446,0,0,1-.442-.442V48.007a.446.446,0,0,1,.442-.442h1.689A4.121,4.121,0,0,1,310,48a1.726,1.726,0,0,1,.8,1.578,2.12,2.12,0,0,1-.372,1.307,1.432,1.432,0,0,0,.292-.261,1.7,1.7,0,0,0,.382-1.166,1.633,1.633,0,0,0-.683-1.488,4.919,4.919,0,0,0-2.6-.513h-2.352v9.007Zm1.146-.985h1.618a2.55,2.55,0,0,0,1.538-.372,1.282,1.282,0,0,0,.432-1.1,1.043,1.043,0,0,0-.432-.985,2.943,2.943,0,0,0-1.628-.352h-1.528v2.815Zm0-4.674h1.448a2.6,2.6,0,0,0,1.5-.3,1.074,1.074,0,0,0,.352-.925.861.861,0,0,0-.382-.824,3.2,3.2,0,0,0-1.659-.3h-1.246v2.352Z" transform="translate(-304.59 -46.56)" fill="#336"/>
|
||||
</g>
|
||||
<g id="Group_1515" data-name="Group 1515" transform="translate(421.986 192.763)">
|
||||
<path id="Path_535" data-name="Path 535" d="M325.019,56.892l-1.236-3.146h-3.971L318.6,56.892H317.43l3.91-9.932h.965l3.89,9.932h-1.186Zm-1.588-4.182-1.146-3.066c-.151-.392-.3-.864-.463-1.427a14.318,14.318,0,0,1-.422,1.427l-1.166,3.066h3.2Z" transform="translate(-317 -46.518)" fill="#003c75"/>
|
||||
<path id="Path_536" data-name="Path 536" d="M326.218,57.336h-1.186a.436.436,0,0,1-.412-.281l-1.126-2.865h-3.357l-1.106,2.865a.446.446,0,0,1-.412.281h-1.166a.445.445,0,0,1-.422-.6l3.91-9.932a.446.446,0,0,1,.412-.281h.965a.436.436,0,0,1,.412.281l3.89,9.932a.431.431,0,0,1-.05.412.457.457,0,0,1-.372.191Zm-.885-.885h.241L322.026,47.4h-.362l-3.559,9.047h.211l1.106-2.865a.446.446,0,0,1,.412-.281h3.971a.436.436,0,0,1,.412.281l1.126,2.865Zm-1.89-3.3h-3.2a.426.426,0,0,1-.362-.191.46.46,0,0,1-.05-.412L321,49.485c.171-.493.312-.955.412-1.367a.439.439,0,0,1,.422-.342.456.456,0,0,1,.442.322c.151.543.3,1.015.442,1.387l1.156,3.066a.46.46,0,0,1-.05.412.439.439,0,0,1-.362.191Zm-2.553-.885h1.92l-.925-2.463s-.02-.05-.03-.07c0,.02-.02.05-.02.07L320.9,52.28Z" transform="translate(-317.003 -46.52)" fill="#336"/>
|
||||
</g>
|
||||
<g id="Group_1516" data-name="Group 1516" transform="translate(436.338 192.662)">
|
||||
<path id="Path_537" data-name="Path 537" d="M337.952,54.258a2.43,2.43,0,0,1-.945,2.041,4.079,4.079,0,0,1-2.573.734,6.436,6.436,0,0,1-2.7-.452V55.475a6.4,6.4,0,0,0,1.327.4,6.939,6.939,0,0,0,1.417.151A2.848,2.848,0,0,0,336.2,55.6a1.422,1.422,0,0,0,.583-1.216,1.547,1.547,0,0,0-.211-.844,1.9,1.9,0,0,0-.694-.6,10.945,10.945,0,0,0-1.468-.633,4.756,4.756,0,0,1-1.97-1.166,2.578,2.578,0,0,1-.593-1.769,2.2,2.2,0,0,1,.865-1.819,3.554,3.554,0,0,1,2.272-.673,6.816,6.816,0,0,1,2.714.543l-.362,1.005a6.131,6.131,0,0,0-2.382-.513,2.3,2.3,0,0,0-1.427.392,1.29,1.29,0,0,0-.513,1.086,1.641,1.641,0,0,0,.191.844,1.787,1.787,0,0,0,.643.6,8.367,8.367,0,0,0,1.377.6,5.531,5.531,0,0,1,2.141,1.186,2.329,2.329,0,0,1,.583,1.649Z" transform="translate(-331.278 -46.418)" fill="#003c75"/>
|
||||
<path id="Path_538" data-name="Path 538" d="M334.426,57.467a6.792,6.792,0,0,1-2.9-.493.448.448,0,0,1-.251-.4V55.467a.429.429,0,0,1,.2-.372.449.449,0,0,1,.422-.04,6.458,6.458,0,0,0,1.246.382,6.692,6.692,0,0,0,1.327.141,2.483,2.483,0,0,0,1.468-.352,1,1,0,0,0,.4-.854,1.138,1.138,0,0,0-.141-.6,1.546,1.546,0,0,0-.533-.452,9.455,9.455,0,0,0-1.4-.593,5.127,5.127,0,0,1-2.161-1.3,3.049,3.049,0,0,1-.7-2.061,2.632,2.632,0,0,1,1.025-2.171,4.024,4.024,0,0,1,2.553-.774,7.062,7.062,0,0,1,2.9.583.444.444,0,0,1,.241.553l-.362,1.005a.473.473,0,0,1-.241.261.429.429,0,0,1-.352,0,5.637,5.637,0,0,0-2.212-.482,1.887,1.887,0,0,0-1.156.3.862.862,0,0,0-.342.734,1.224,1.224,0,0,0,.131.623,1.4,1.4,0,0,0,.482.442,6.758,6.758,0,0,0,1.3.563,5.849,5.849,0,0,1,2.322,1.307,2.8,2.8,0,0,1,.7,1.95,2.858,2.858,0,0,1-1.116,2.392,4.577,4.577,0,0,1-2.845.824Zm-2.262-1.2a6.861,6.861,0,0,0,2.262.3,3.789,3.789,0,0,0,2.3-.633,1.989,1.989,0,0,0,.774-1.689,1.869,1.869,0,0,0-.472-1.347,5.012,5.012,0,0,0-1.96-1.076,8.11,8.11,0,0,1-1.458-.643,1.943,1.943,0,0,1-1.045-1.829,1.778,1.778,0,0,1,.683-1.447,2.748,2.748,0,0,1,1.7-.483,6.394,6.394,0,0,1,2.121.382l.06-.171a6.524,6.524,0,0,0-2.151-.352,3.137,3.137,0,0,0-2,.583,1.757,1.757,0,0,0-.694,1.468,2.161,2.161,0,0,0,.483,1.478,4.294,4.294,0,0,0,1.789,1.045,9.486,9.486,0,0,1,1.548.663,2.322,2.322,0,0,1,.844.754,2.01,2.01,0,0,1,.271,1.076,1.871,1.871,0,0,1-.764,1.568,3.27,3.27,0,0,1-2,.523,7.059,7.059,0,0,1-1.508-.161c-.271-.06-.543-.121-.794-.2v.181Z" transform="translate(-331.28 -46.42)" fill="#336"/>
|
||||
</g>
|
||||
<g id="Group_1517" data-name="Group 1517" transform="translate(449.456 192.803)">
|
||||
<path id="Path_539" data-name="Path 539" d="M350.289,56.891H344.77V47h5.519v1.025h-4.363v3.187h4.1v1.015h-4.1v3.639h4.363Z" transform="translate(-344.328 -46.558)" fill="#003c75"/>
|
||||
<path id="Path_540" data-name="Path 540" d="M350.291,57.336h-5.519a.446.446,0,0,1-.442-.442V47a.446.446,0,0,1,.442-.442h5.519a.446.446,0,0,1,.442.442v1.025a.446.446,0,0,1-.442.442h-3.92v2.3h3.659a.446.446,0,0,1,.442.442v1.015a.446.446,0,0,1-.442.442H346.37v2.754h3.92a.446.446,0,0,1,.442.442v1.025A.446.446,0,0,1,350.291,57.336Zm-5.076-.885h4.624v-.141h-3.92a.446.446,0,0,1-.442-.442V52.229a.446.446,0,0,1,.442-.442h3.659v-.131h-3.659a.446.446,0,0,1-.442-.442V48.028a.446.446,0,0,1,.442-.442h3.92v-.131h-4.624v9.007Z" transform="translate(-344.33 -46.56)" fill="#336"/>
|
||||
</g>
|
||||
</g>
|
||||
</g>
|
||||
<script xmlns=""/></svg>
|
||||
|
Before Width: | Height: | Size: 35 KiB |
|
|
@ -1,184 +0,0 @@
|
|||
<!-- Start of Injectable ECHA Header Block (v7 - Dynamic Data) -->
|
||||
<style>
|
||||
/* ECHA Header Styles - Based on V5/V6 */
|
||||
.echa-header-injected { /* Wrapper class */
|
||||
width: 100%;
|
||||
box-shadow: 0 1px 3px rgba(0, 0, 0, 0.1), 0 1px 2px rgba(0,0,0,0.06);
|
||||
font-family: system-ui, -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, Oxygen, Ubuntu, Cantarell, 'Open Sans', 'Helvetica Neue', sans-serif;
|
||||
box-sizing: border-box;
|
||||
background-color: #ffffff;
|
||||
line-height: 1.4;
|
||||
}
|
||||
.echa-header-injected *, .echa-header-injected *::before, .echa-header-injected *::after {
|
||||
box-sizing: inherit;
|
||||
}
|
||||
|
||||
/* Top white bar - das-top-nav */
|
||||
.echa-header-injected .das-top-nav {
|
||||
background-color: #ffffff;
|
||||
display: flex;
|
||||
align-items: stretch;
|
||||
padding: 8px 25px;
|
||||
border-bottom: 1px solid #e7e7e7;
|
||||
min-height: 55px;
|
||||
}
|
||||
|
||||
.echa-header-injected .logo-container {
|
||||
display: flex;
|
||||
align-items: center;
|
||||
gap: 20px;
|
||||
}
|
||||
|
||||
.echa-header-injected .logo-main img {
|
||||
height: 38px;
|
||||
width: auto;
|
||||
display: block;
|
||||
border: 0;
|
||||
}
|
||||
|
||||
.echa-header-injected .logo-part-of {
|
||||
display: flex;
|
||||
align-items: center;
|
||||
padding-left: 20px;
|
||||
border-left: 1px solid #cccccc;
|
||||
height: 100%;
|
||||
}
|
||||
|
||||
.echa-header-injected .logo-part-of img {
|
||||
height: 18px;
|
||||
width: auto;
|
||||
display: block;
|
||||
border: 0;
|
||||
}
|
||||
|
||||
/* Bottom blue bar - das-primary-header_wrapper */
|
||||
.echa-header-injected .das-primary-header_wrapper {
|
||||
background-color: #005487;
|
||||
background-image: linear-gradient(to bottom, rgba(255, 255, 255, 0.08), rgba(0, 0, 0, 0.05));
|
||||
color: #ffffff;
|
||||
padding: 12px 25px;
|
||||
display: flex;
|
||||
align-items: center;
|
||||
gap: 15px;
|
||||
}
|
||||
|
||||
.echa-header-injected .das-primary-header-info {
|
||||
flex-grow: 1;
|
||||
min-width: 0; /* Prevent flex item from overflowing */
|
||||
}
|
||||
|
||||
/* Style for the substance link */
|
||||
.echa-header-injected .substance-link {
|
||||
color: #ffffff;
|
||||
text-decoration: none;
|
||||
display: block; /* Makes the whole H2 area clickable */
|
||||
}
|
||||
.echa-header-injected .substance-link:hover,
|
||||
.echa-header-injected .substance-link:focus {
|
||||
text-decoration: underline;
|
||||
}
|
||||
|
||||
.echa-header-injected .das-primary-header-info h2 {
|
||||
font-size: 1.5em; /* Set your desired FIXED font size */
|
||||
margin: 0 0 4px 0;
|
||||
line-height: 1.2; /* This will control spacing between lines if it wraps */
|
||||
color: #ffffff;
|
||||
font-weight: 600;
|
||||
width: 100%; /* Constrains the text horizontally */
|
||||
|
||||
/* --- REMOVED ---
|
||||
white-space: nowrap;
|
||||
overflow: hidden;
|
||||
text-overflow: ellipsis;
|
||||
*/
|
||||
|
||||
/* --- ADDED (Recommended) --- */
|
||||
white-space: normal; /* Explicitly allow wrapping (this is the default, but good for clarity) */
|
||||
overflow-wrap: break-word; /* Helps break long words without spaces */
|
||||
/* word-break: break-word; /* Alternative if overflow-wrap doesn't catch all cases */
|
||||
|
||||
/* Ensure overflow is visible (default, but explicit) */
|
||||
overflow: visible;
|
||||
}
|
||||
|
||||
.echa-header-injected .das-primary-header-info_details {
|
||||
display: flex;
|
||||
align-items: center;
|
||||
gap: 18px;
|
||||
flex-wrap: wrap;
|
||||
}
|
||||
|
||||
.echa-header-injected .item {
|
||||
display: flex;
|
||||
align-items: baseline;
|
||||
position: relative;
|
||||
}
|
||||
|
||||
.echa-header-injected .item + .item::before {
|
||||
content: '•';
|
||||
color: #f5a623;
|
||||
font-weight: bold;
|
||||
font-size: 1.1em;
|
||||
line-height: 1;
|
||||
display: inline-block;
|
||||
margin-right: 18px;
|
||||
}
|
||||
|
||||
.echa-header-injected .item label {
|
||||
font-size: 0.85em;
|
||||
color: #e0eaf1;
|
||||
margin-right: 8px;
|
||||
font-weight: 400;
|
||||
}
|
||||
|
||||
.echa-header-injected .item span {
|
||||
font-size: 0.95em;
|
||||
color: #ffffff;
|
||||
font-weight: bold;
|
||||
}
|
||||
|
||||
/* Minimal reset */
|
||||
.echa-header-injected h2, .echa-header-injected span, .echa-header-injected label, .echa-header-injected div {
|
||||
margin: 0; padding: 0;
|
||||
}
|
||||
.echa-header-injected a { color: inherit; text-decoration: none; } /* Basic reset for any links */
|
||||
|
||||
</style>
|
||||
|
||||
<header class="echa-header-injected" id="pdf-custom-header">
|
||||
<div class="das-top-nav">
|
||||
<div class="logo-container">
|
||||
<div class="logo-main">
|
||||
<!-- Logo link can be kept static or made dynamic if needed -->
|
||||
<a title="ECHA Chemicals Database" href="/">
|
||||
<img height="38" alt="ECHA Chemicals Database" src="##ECHA_CHEM_LOGO_SRC##">
|
||||
</a>
|
||||
</div>
|
||||
<div class="logo-part-of">
|
||||
<a title="visit ECHA website" target="_blank" rel="noopener noreferrer" href="https://echa.europa.eu/">
|
||||
<img height="18" alt="European Chemicals Agency" src="##ECHA_LOGO_SRC##">
|
||||
</a>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
<div class="das-primary-header_wrapper">
|
||||
<div class="das-primary-header-info">
|
||||
<!-- ==== DYNAMIC CONTENT START ==== -->
|
||||
<a href="##SUBSTANCE_LINK##" title="View substance details: ##SUBSTANCE_NAME##" class="substance-link">
|
||||
<h2 class="das-text-truncate">##SUBSTANCE_NAME##</h2>
|
||||
</a>
|
||||
<div class="das-primary-header-info_details">
|
||||
<div class="item">
|
||||
<label>EC number</label>
|
||||
<span>##EC_NUMBER##</span>
|
||||
</div>
|
||||
<div class="item">
|
||||
<label>CAS number</label>
|
||||
<span class="das-text-truncate">##CAS_NUMBER##</span>
|
||||
</div>
|
||||
</div>
|
||||
<!-- ==== DYNAMIC CONTENT END ==== -->
|
||||
</div>
|
||||
</div>
|
||||
</header>
|
||||
<!-- End of Injectable ECHA Header Block (v7) -->
|
||||
|
|
@ -16,6 +16,7 @@ dependencies = [
|
|||
"markdown-to-json>=2.1.2",
|
||||
"markdownify>=1.2.0",
|
||||
"playwright>=1.55.0",
|
||||
"psycopg2>=2.9.11",
|
||||
"pubchemprops>=0.1.1",
|
||||
"pubchempy>=1.0.5",
|
||||
"pydantic>=2.11.10",
|
||||
|
|
|
|||
28
pytest.ini
28
pytest.ini
|
|
@ -1,28 +0,0 @@
|
|||
[pytest]
|
||||
# Pytest configuration for PIF Compiler
|
||||
|
||||
# Test discovery
|
||||
testpaths = tests
|
||||
python_files = test_*.py
|
||||
python_classes = Test*
|
||||
python_functions = test_*
|
||||
|
||||
# Output options
|
||||
addopts =
|
||||
-v
|
||||
--strict-markers
|
||||
--tb=short
|
||||
--disable-warnings
|
||||
|
||||
# Markers for different test types
|
||||
markers =
|
||||
unit: Unit tests (fast, no external dependencies)
|
||||
integration: Integration tests (may hit real APIs)
|
||||
slow: Slow tests (skip by default)
|
||||
database: Tests requiring MongoDB
|
||||
|
||||
# Coverage options (if pytest-cov is installed)
|
||||
# addopts = --cov=src/pif_compiler --cov-report=html --cov-report=term
|
||||
|
||||
# Ignore patterns
|
||||
norecursedirs = .git .venv __pycache__ *.egg-info dist build
|
||||
|
|
@ -1,73 +1,423 @@
|
|||
from dataclasses import dataclass, field
|
||||
from typing import Dict, List, Optional, Any
|
||||
from datetime import datetime
|
||||
from pydantic import BaseModel, StringConstraints, Field
|
||||
from typing_extensions import Annotated
|
||||
from pif_compiler.classes.types_enum import CosmeticType, PhysicalForm, PlaceApplication, NormalUser, RoutesExposure, NanoRoutes
|
||||
from pydantic import BaseModel, Field, field_validator, ConfigDict, model_validator, computed_field
|
||||
import re
|
||||
from typing import List, Optional
|
||||
from datetime import datetime as dt
|
||||
|
||||
from pif_compiler.services.srv_echa import extract_levels, at_extractor, rdt_extractor, orchestrator
|
||||
from pif_compiler.functions.db_utils import postgres_connect
|
||||
from pif_compiler.services.srv_pubchem import pubchem_dap
|
||||
from pif_compiler.services.srv_cosing import cosing_entry
|
||||
|
||||
|
||||
class DapInfo(BaseModel):
|
||||
cas: str
|
||||
|
||||
molecular_weight: Optional[float] = Field(default=None, description="In Daltons (Da)")
|
||||
high_ionization: Optional[float] = Field(default=None, description="High degree of ionization")
|
||||
log_pow: Optional[float] = Field(default=None, description="Partition coefficient")
|
||||
tpsa: Optional[float] = Field(default=None, description="Topological polar surface area")
|
||||
melting_point: Optional[float] = Field(default=None, description="In Celsius (°C)")
|
||||
|
||||
# --- Il valore DAP Calcolato ---
|
||||
# Lo impostiamo di default a 0.5 (50%), verrà sovrascritto dal validator
|
||||
dap_value: float = 0.5
|
||||
|
||||
@model_validator(mode='after')
|
||||
def compute_dap(self):
|
||||
# Lista delle condizioni (True se la condizione riduce l'assorbimento)
|
||||
conditions = []
|
||||
|
||||
# 1. MW > 500 Da
|
||||
if self.molecular_weight is not None:
|
||||
conditions.append(self.molecular_weight > 500)
|
||||
|
||||
# 2. High Ionization (Se è True, riduce l'assorbimento)
|
||||
if self.high_ionization is not None:
|
||||
conditions.append(self.high_ionization is True)
|
||||
|
||||
# 3. Log Pow <= -1 OR >= 4
|
||||
if self.log_pow is not None:
|
||||
conditions.append(self.log_pow <= -1 or self.log_pow >= 4)
|
||||
|
||||
# 4. TPSA > 120 Å2
|
||||
if self.tpsa is not None:
|
||||
conditions.append(self.tpsa > 120)
|
||||
|
||||
# 5. Melting Point > 200°C
|
||||
if self.melting_point is not None:
|
||||
conditions.append(self.melting_point > 200)
|
||||
|
||||
# LOGICA FINALE:
|
||||
# Se c'è almeno una condizione "sicura" (True), il DAP è 0.1
|
||||
if any(conditions):
|
||||
self.dap_value = 0.1
|
||||
else:
|
||||
self.dap_value = 0.5
|
||||
|
||||
return self
|
||||
|
||||
@classmethod
|
||||
def dap_builder(cls, dap_data: dict):
|
||||
"""
|
||||
Costruisce un oggetto DapInfo a partire dai dati grezzi.
|
||||
"""
|
||||
desiderated_keys = ['CAS', 'MolecularWeight', 'XLogP', 'TPSA', 'Melting Point', 'Dissociation Constants']
|
||||
actual_keys = [key for key in dap_data.keys() if key in desiderated_keys]
|
||||
|
||||
dict = {}
|
||||
|
||||
for key in actual_keys:
|
||||
if key == 'CAS':
|
||||
dict['cas'] = dap_data[key]
|
||||
if key == 'MolecularWeight':
|
||||
mw = float(dap_data[key])
|
||||
dict['molecular_weight'] = mw
|
||||
if key == 'XLogP':
|
||||
log_pow = float(dap_data[key])
|
||||
dict['log_pow'] = log_pow
|
||||
if key == 'TPSA':
|
||||
tpsa = float(dap_data[key])
|
||||
dict['tpsa'] = tpsa
|
||||
if key == 'Melting Point':
|
||||
try:
|
||||
for item in dap_data[key]:
|
||||
if '°C' in item['Value']:
|
||||
mp = dap_data[key]['Value']
|
||||
mp_value = re.findall(r"[-+]?\d*\.\d+|\d+", mp)
|
||||
if mp_value:
|
||||
dict['melting_point'] = float(mp_value[0])
|
||||
except:
|
||||
continue
|
||||
if key == 'Dissociation Constants':
|
||||
try:
|
||||
for item in dap_data[key]:
|
||||
if 'pKa' in item['Value']:
|
||||
pk = dap_data[key]['Value']
|
||||
pk_value = re.findall(r"[-+]?\d*\.\d+|\d+", pk)
|
||||
if pk_value:
|
||||
dict['high_ionization'] = float(mp_value[0])
|
||||
except:
|
||||
continue
|
||||
|
||||
return cls(**dict)
|
||||
|
||||
class CosingInfo(BaseModel):
|
||||
cas : List[str] = Field(default_factory=list)
|
||||
common_names : List[str] = Field(default_factory=list)
|
||||
inci : List[str] = Field(default_factory=list)
|
||||
annex : List[str] = Field(default_factory=list)
|
||||
functionName : List[str] = Field(default_factory=list)
|
||||
otherRestrictions : List[str] = Field(default_factory=list)
|
||||
cosmeticRestriction : Optional[str]
|
||||
|
||||
@classmethod
|
||||
def cosing_builder(cls, cosing_data : dict):
|
||||
cosing_keys = ['nameOfCommonIngredientsGlossary', 'casNo', 'functionName', 'annexNo', 'refNo', 'otherRestrictions', 'cosmeticRestriction', 'inciName']
|
||||
keys = [k for k in cosing_data.keys() if k in cosing_keys]
|
||||
|
||||
cosing_dict = {}
|
||||
|
||||
for k in keys:
|
||||
if k == 'nameOfCommonIngredientsGlossary':
|
||||
names = []
|
||||
for name in cosing_data[k]:
|
||||
names.append(name)
|
||||
cosing_dict['common_names'] = names
|
||||
if k == 'inciName':
|
||||
inci = []
|
||||
for inc in cosing_data[k]:
|
||||
inci.append(inc)
|
||||
cosing_dict['inci'] = inci
|
||||
if k == 'casNo':
|
||||
cas_list = []
|
||||
for casNo in cosing_data[k]:
|
||||
cas_list.append(casNo)
|
||||
cosing_dict['cas'] = cas_list
|
||||
if k == 'functionName':
|
||||
functions = []
|
||||
for func in cosing_data[k]:
|
||||
functions.append(func)
|
||||
cosing_dict['functionName'] = functions
|
||||
if k == 'annexNo':
|
||||
annexes = []
|
||||
i = 0
|
||||
for ann in cosing_data[k]:
|
||||
restriction = ann + ' / ' + cosing_data['refNo'][i]
|
||||
annexes.append(restriction)
|
||||
i = i+1
|
||||
cosing_dict['annex'] = annexes
|
||||
if k == 'otherRestrictions':
|
||||
other_restrictions = []
|
||||
for ores in cosing_data[k]:
|
||||
other_restrictions.append(ores)
|
||||
cosing_dict['otherRestrictions'] = other_restrictions
|
||||
if k == 'cosmeticRestriction':
|
||||
cosing_dict['cosmeticRestriction'] = cosing_data[k]
|
||||
|
||||
return cls(**cosing_dict)
|
||||
|
||||
@classmethod
|
||||
def cycle_identified(cls, cosing_data : dict):
|
||||
cosing_entries = []
|
||||
if 'identifiedIngredient' in cosing_data.keys():
|
||||
identified_cosing = cls.cosing_builder(cosing_data['identifiedIngredient'])
|
||||
cosing_entries.append(identified_cosing)
|
||||
main = cls.cosing_builder(cosing_data)
|
||||
cosing_entries.append(main)
|
||||
|
||||
return cosing_entries
|
||||
|
||||
class ToxIndicator(BaseModel):
|
||||
indicator : str
|
||||
value : int
|
||||
unit : str
|
||||
route : str
|
||||
toxicity_type : Optional[str] = None
|
||||
ref : Optional[str] = None
|
||||
|
||||
@property
|
||||
def priority_rank(self):
|
||||
"""Returns the numerical priority based on the toxicological indicator."""
|
||||
mapping = {
|
||||
'LD50': 1,
|
||||
'DL50': 1,
|
||||
'LOAEL': 3,
|
||||
'NOAEL': 4
|
||||
}
|
||||
return mapping.get(self.indicator, -1)
|
||||
|
||||
@property
|
||||
def factor(self):
|
||||
"""Returns the factor based on the toxicity type."""
|
||||
if self.priority_rank == 1:
|
||||
return 10
|
||||
elif self.priority_rank == 3:
|
||||
return 3
|
||||
return 1
|
||||
|
||||
class Toxicity(BaseModel):
|
||||
cas: str
|
||||
indicators: list[ToxIndicator]
|
||||
best_case: Optional[ToxIndicator] = None
|
||||
factor: Optional[int] = None
|
||||
|
||||
@model_validator(mode='after')
|
||||
def set_best_case(self) -> 'Toxicity':
|
||||
if self.indicators:
|
||||
self.best_case = max(self.indicators, key=lambda x: x.priority_rank)
|
||||
self.factor = self.best_case.factor
|
||||
return self
|
||||
|
||||
@classmethod
|
||||
def from_result(cls, cas: str, result):
|
||||
toxicity_types = ['repeated_dose_toxicity', 'acute_toxicity']
|
||||
indicators_list = []
|
||||
|
||||
for tt in toxicity_types:
|
||||
if tt not in result:
|
||||
continue
|
||||
|
||||
try:
|
||||
extractor = at_extractor if tt == 'acute_toxicity' else rdt_extractor
|
||||
fetch = extract_levels(result[tt], extractor=extractor)
|
||||
|
||||
link = result.get(f"{tt}_link", "")
|
||||
|
||||
for key, lvl in fetch.items():
|
||||
lvl['ref'] = link
|
||||
elem = ToxIndicator(**lvl)
|
||||
indicators_list.append(elem)
|
||||
|
||||
except Exception as e:
|
||||
print(f"Errore durante l'estrazione di {tt}: {e}")
|
||||
continue
|
||||
|
||||
return cls(
|
||||
cas=cas,
|
||||
indicators=indicators_list
|
||||
)
|
||||
|
||||
class Ingredient(BaseModel):
|
||||
cas: str
|
||||
inci: Optional[List[str]] = None
|
||||
dap_info: Optional[DapInfo] = None
|
||||
cosing_info: Optional[List[CosingInfo]] = None
|
||||
toxicity: Optional[Toxicity] = None
|
||||
creation_date: Optional[str] = None
|
||||
|
||||
inci_name : str = Annotated[
|
||||
str,
|
||||
StringConstraints(
|
||||
min_length=3,
|
||||
max_length=50,
|
||||
strip_whitespace=True,
|
||||
to_upper=True)
|
||||
]
|
||||
@classmethod
|
||||
def ingredient_builder(
|
||||
cls,
|
||||
cas: str,
|
||||
inci: Optional[List[str]] = None,
|
||||
dap_data: Optional[dict] = None,
|
||||
cosing_data: Optional[dict] = None,
|
||||
toxicity_data: Optional[dict] = None):
|
||||
|
||||
cas : str = Annotated[str, StringConstraints(
|
||||
min_length=5,
|
||||
max_length=13,
|
||||
strip_whitespace=True
|
||||
)]
|
||||
dap_info = DapInfo.dap_builder(dap_data) if dap_data else None
|
||||
cosing_info = CosingInfo.cycle_identified(cosing_data) if cosing_data else None
|
||||
toxicity = Toxicity.from_result(cas, toxicity_data) if toxicity_data else None
|
||||
|
||||
quantity : float = Annotated[float, Field(gt=0.001, lt=100.0, allow_inf_nan = False)]
|
||||
return cls(
|
||||
cas=cas,
|
||||
inci=inci,
|
||||
dap_info=dap_info,
|
||||
cosing_info=cosing_info,
|
||||
toxicity=toxicity
|
||||
)
|
||||
|
||||
# pubchem data x dap
|
||||
mol_weight : Optional[int]
|
||||
degree_ioniz : Optional[bool]
|
||||
log_pow : Optional[int]
|
||||
melting_pnt : Optional[int]
|
||||
@model_validator(mode='after')
|
||||
def set_creation_date(self) -> 'Ingredient':
|
||||
self.creation_date = dt.now().isoformat()
|
||||
return self
|
||||
|
||||
# toxicity values
|
||||
sed : Optional[float]
|
||||
dap : float = 0.5
|
||||
sedd : Optional[float]
|
||||
noael : Optional[int]
|
||||
mos : Optional[int]
|
||||
def update_ingredient(self, attr : str, data : dict):
|
||||
setattr(self, attr, data)
|
||||
|
||||
# riferimenti
|
||||
ref : Optional[str]
|
||||
restriction: Optional[str]
|
||||
def to_mongo_dict(self):
|
||||
mongo_dict = self.model_dump()
|
||||
return mongo_dict
|
||||
|
||||
class ExpositionInfo(BaseModel):
|
||||
type: CosmeticType
|
||||
target_population: NormalUser
|
||||
consumer_weight: str = "60 kg"
|
||||
place_application: PlaceApplication
|
||||
routes_exposure: RoutesExposure
|
||||
nano_routes: NanoRoutes
|
||||
surface_area: int
|
||||
frequency: int
|
||||
def get_stats(self):
|
||||
stats = {
|
||||
"has_dap_info": self.dap_info is not None,
|
||||
"has_cosing_info": self.cosing_info is not None,
|
||||
"has_toxicity_info": self.toxicity is not None,
|
||||
"num_tox_indicators": len(self.toxicity.indicators) if self.toxicity else 0,
|
||||
"has_best_tox_indicator": self.toxicity.best_case is not None if self.toxicity else False,
|
||||
"has_restrictions_in_cosing": any(self.cosing_info[0].annex) if self.cosing_info else False,
|
||||
"has_noael_indicator": any(ind.indicator == 'NOAEL' for ind in self.toxicity.indicators) if self.toxicity else False,
|
||||
"has_ld50_indicator": any(ind.indicator == 'LD50' for ind in self.toxicity.indicators) if self.toxicity else False,
|
||||
"has_loael_indicator": any(ind.indicator == 'LOAEL' for ind in self.toxicity.indicators) if self.toxicity else False
|
||||
}
|
||||
return stats
|
||||
|
||||
# to be approximated by LLM
|
||||
estimated_daily_amount_applied: float
|
||||
relative_daily_amount_applied: float
|
||||
retention_factor: float
|
||||
calculated_daily_exposure: float
|
||||
calculated_relative_daily_exposure: float
|
||||
def is_old(self, threshold_days: int = 365) -> bool:
|
||||
if not self.creation_date:
|
||||
return True
|
||||
|
||||
class SedTable(BaseModel):
|
||||
surface : int
|
||||
total_exposition : int
|
||||
frequency : int
|
||||
retention : int
|
||||
consumer_weight : int
|
||||
total_sed : int
|
||||
creation_dt = dt.fromisoformat(self.creation_date)
|
||||
current_dt = dt.now()
|
||||
delta = current_dt - creation_dt
|
||||
|
||||
class ProdCompany(BaseModel):
|
||||
prod_company_name : str
|
||||
prod_vat: int
|
||||
prod_address : str
|
||||
return delta.days > threshold_days
|
||||
|
||||
def add_inci_name(self, inci_name: str):
|
||||
if self.inci is None:
|
||||
self.inci = []
|
||||
if inci_name not in self.inci:
|
||||
self.inci.append(inci_name)
|
||||
|
||||
def return_best_toxicity(self) -> Optional[ToxIndicator]:
|
||||
if self.toxicity and self.toxicity.best_case:
|
||||
return self.toxicity.best_case
|
||||
return None
|
||||
|
||||
def return_cosing_restrictions(self) -> List[str]:
|
||||
restrictions = []
|
||||
if self.cosing_info:
|
||||
for cosing in self.cosing_info:
|
||||
restrictions.extend(cosing.annex)
|
||||
return restrictions
|
||||
|
||||
class RetentionFactors:
|
||||
LEAVE_ON = 1.0
|
||||
RINSE_OFF = 0.01
|
||||
DENTIFRICE = 0.05
|
||||
MOUTHWASH = 0.10
|
||||
DYE = 0.10
|
||||
|
||||
class Esposition(BaseModel):
|
||||
preset_name : str
|
||||
tipo_prodotto: str
|
||||
popolazione_target: str = "Adulti"
|
||||
peso_target_kg: float = 60.0
|
||||
|
||||
luogo_applicazione: str
|
||||
esp_normali: List[str]
|
||||
esp_secondarie: List[str]
|
||||
esp_nano: List[str]
|
||||
|
||||
sup_esposta: int = Field(ge=1, le=17500, description="Area di applicazione in cm2")
|
||||
freq_applicazione: int = Field(default=1, description="Numero di applicazioni al giorno")
|
||||
qta_giornaliera: float = Field(..., description="Quantità di prodotto applicata (g/die)")
|
||||
ritenzione: float = Field(default=1.0, ge=0, le=1.0, description="Fattore di ritenzione")
|
||||
|
||||
note: Optional[str] = None
|
||||
|
||||
@field_validator('esp_normali', 'esp_secondarie', 'esp_nano', mode='before')
|
||||
@classmethod
|
||||
def parse_postgres_array(cls, v):
|
||||
# Se Postgres restituisce una stringa tipo '{a,b}' la trasformiamo in ['a','b']
|
||||
if isinstance(v, str):
|
||||
cleaned = v.strip('{}[]')
|
||||
return [item.strip() for item in cleaned.split(',')] if cleaned else []
|
||||
return v
|
||||
|
||||
@computed_field
|
||||
@property
|
||||
def esposizione_calcolata(self) -> float:
|
||||
return self.qta_giornaliera * self.ritenzione
|
||||
|
||||
@computed_field
|
||||
@property
|
||||
def esposizione_relativa(self) -> float:
|
||||
return (self.esposizione_calcolata * 1000) / self.peso_target_kg
|
||||
|
||||
def save_to_postgres(self):
|
||||
data = self.model_dump(mode='json')
|
||||
query = """INSERT INTO tipi_prodotti (
|
||||
preset_name, tipo_prodotto, luogo_applicazione,
|
||||
esp_normali, esp_secondarie, esp_nano,
|
||||
sup_esposta, freq_applicazione, qta_giornaliera, ritenzione
|
||||
) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s) RETURNING id_preset;"""
|
||||
|
||||
conn = postgres_connect()
|
||||
try:
|
||||
with conn.cursor() as cur:
|
||||
cur.execute(query, (
|
||||
data.get("preset_name"), data.get("tipo_prodotto"),
|
||||
data.get("luogo_applicazione"), data.get("esp_normali"),
|
||||
data.get("esp_secondarie"), data.get("esp_nano"),
|
||||
data.get("sup_esposta"), data.get("freq_applicazione"),
|
||||
data.get("qta_giornaliera"), data.get("ritenzione")
|
||||
))
|
||||
result = cur.fetchone()
|
||||
conn.commit()
|
||||
return result[0] if result else None
|
||||
except Exception as e:
|
||||
print(f"Errore salvataggio: {e}")
|
||||
conn.rollback()
|
||||
return False
|
||||
finally:
|
||||
conn.close()
|
||||
|
||||
@classmethod
|
||||
def get_presets(cls):
|
||||
conn = postgres_connect()
|
||||
try:
|
||||
with conn.cursor() as cur:
|
||||
cur.execute("SELECT preset_name, tipo_prodotto, luogo_applicazione, esp_normali, esp_secondarie, esp_nano, sup_esposta, freq_applicazione, qta_giornaliera, ritenzione FROM tipi_prodotti;")
|
||||
results = cur.fetchall()
|
||||
|
||||
lista_oggetti = []
|
||||
for r in results:
|
||||
obj = cls(
|
||||
preset_name=r[0],
|
||||
tipo_prodotto=r[1],
|
||||
luogo_applicazione=r[2],
|
||||
esp_normali=r[3],
|
||||
esp_secondarie=r[4],
|
||||
esp_nano=r[5],
|
||||
sup_esposta=r[6],
|
||||
freq_applicazione=r[7],
|
||||
qta_giornaliera=r[8],
|
||||
ritenzione=r[9]
|
||||
)
|
||||
lista_oggetti.append(obj)
|
||||
return lista_oggetti
|
||||
except Exception as e:
|
||||
print(f"Errore: {e}")
|
||||
return []
|
||||
finally:
|
||||
conn.close()
|
||||
|
|
@ -1,36 +0,0 @@
|
|||
from typing import List, Optional
|
||||
from datetime import datetime
|
||||
from pydantic import BaseModel, Field
|
||||
|
||||
from pif_compiler.classes.base_classes import ExpositionInfo, SedTable, ProdCompany, Ingredient
|
||||
from pif_compiler.classes.types_enum import CosmeticType, PhysicalForm, NormalUser
|
||||
|
||||
class PIF(BaseModel):
|
||||
# INFORMAZIONI GENERALI DEL PRODOTTO
|
||||
|
||||
# Data di esecuzione pif = datetime.now()
|
||||
created_at: datetime = Field(
|
||||
default_factory=lambda: datetime.strptime(
|
||||
datetime.now().strftime('%Y-%m-%d'),
|
||||
'%Y-%m-%d'
|
||||
)
|
||||
)
|
||||
|
||||
# Informazioni del prodotto
|
||||
company: str
|
||||
product_name: str
|
||||
type: CosmeticType
|
||||
physical_form: PhysicalForm
|
||||
CNCP: int
|
||||
production_company: ProdCompany
|
||||
|
||||
# Ingredienti
|
||||
ingredients: List[Ingredient] # str = quantità decimale %
|
||||
normal_consumer: Optional[NormalUser]
|
||||
exposition: Optional[ExpositionInfo]
|
||||
|
||||
# Informazioni di sicurezza
|
||||
sed_table: Optional[SedTable] = None
|
||||
undesired_effets: Optional[str] = None
|
||||
description: Optional[str] = None
|
||||
warnings: Optional[List[str]] = None
|
||||
|
|
@ -1,145 +0,0 @@
|
|||
from enum import Enum
|
||||
|
||||
class TranslatedEnum(str, Enum):
|
||||
def get_translation(self, lang: str) -> str:
|
||||
translations = self.value.split("|")
|
||||
return translations[0] if lang == "en" else translations[1]
|
||||
|
||||
class PhysicalForm(TranslatedEnum):
|
||||
SIERO = "Serum (liquid)|Siero (liquido)"
|
||||
LOZIONE = "Lotion (liquid)|Lozione (liquido)"
|
||||
CREMA = "Cream (liquid)|Crema (liquido)"
|
||||
OLIO = "Oil (liquid)|Olio (liquido)"
|
||||
GEL = "Gel (liquid)|Gel (liquido)"
|
||||
SCHIUMA = "Foam (liquid)|Schiuma (liquido)"
|
||||
SOLUZIONE = "Solution (liquid)|Soluzione (liquido)"
|
||||
EMULSIONE = "Emulsion (liquid)|Emulsione (liquido)"
|
||||
SOSPENSIONE = "Suspension (liquid)|Sospensione (liquido)"
|
||||
BALSAMO = "Balm (semi-solid)|Balsamo (semi-solido)"
|
||||
PASTA = "Paste (semi-solid)|Pasta (semi-solido)"
|
||||
UNGENTO = "Ointment (semi-solid)|Unguento (semi-solido)"
|
||||
POLVERE_COMPATTA = "Pressed Powder (solid)|Polvere compatta (solido)"
|
||||
POLVERE_LIBERA = "Loose Powder (solid)|Polvere libera (solido)"
|
||||
STICK = "Stick (solid)|Stick (solido)"
|
||||
BARRETTA = "Bar (solid)|Barretta (solido)"
|
||||
PERLE = "Beads/Pearls (solid)|Perle (solido)"
|
||||
SPRAY = "Spray/Mist (aerosol)|Spray/Nebulizzatore (aerosol)"
|
||||
AEROSOL = "Aerosol (aerosol)|Aerosol (aerosol)"
|
||||
SPRAY_IN_POLVERE = "Powder Spray (aerosol)|Spray in polvere (aerosol)"
|
||||
CUSCINETTO = "Cushion (hybrid)|Cuscinetto (ibrido)"
|
||||
GELATINA = "Jelly (hybrid)|Gelatina (ibrido)"
|
||||
PRODOTTO_BIFASICO = "Bi-Phase Product (hybrid)|Prodotto bifasico (ibrido)"
|
||||
MICROINCAPSULATO = "Encapsulated Actives (hybrid)|Attivi microincapsulati (ibrido)"
|
||||
|
||||
class CosmeticType(TranslatedEnum):
|
||||
LIQUID_FOUNDATION = "Liquid foundation|Fondotinta liquido"
|
||||
POWDER_FOUNDATION = "Powder foundation|Fondotinta in polvere"
|
||||
BB_CREAM = "BB cream|BB cream"
|
||||
CC_CREAM = "CC cream|CC cream"
|
||||
CONCEALER = "Concealer|Correttore"
|
||||
LOOSE_POWDER = "Loose powder|Cipria in polvere"
|
||||
PRESSED_POWDER = "Pressed powder|Cipria compatta"
|
||||
POWDER_BLUSH = "Powder blush|Blush in polvere"
|
||||
CREAM_BLUSH = "Cream blush|Blush in crema"
|
||||
LIQUID_BLUSH = "Liquid blush|Blush liquido"
|
||||
BRONZER = "Bronzer|Bronzer"
|
||||
HIGHLIGHTER = "Highlighter|Illuminante"
|
||||
FACE_PRIMER = "Face primer|Primer viso"
|
||||
SETTING_SPRAY = "Setting spray|Spray fissante"
|
||||
COLOR_CORRECTOR = "Color corrector|Correttore colorato"
|
||||
CONTOUR_POWDER = "Contour powder|Contouring in polvere"
|
||||
CONTOUR_CREAM = "Contour cream|Contouring in crema"
|
||||
TINTED_MOISTURIZER = "Tinted moisturizer|Crema colorata"
|
||||
POWDER_EYESHADOW = "Powder eyeshadow|Ombretto in polvere"
|
||||
CREAM_EYESHADOW = "Cream eyeshadow|Ombretto in crema"
|
||||
LIQUID_EYESHADOW = "Liquid eyeshadow|Ombretto liquido"
|
||||
PENCIL_EYELINER = "Pencil eyeliner|Matita occhi"
|
||||
LIQUID_EYELINER = "Liquid eyeliner|Eyeliner liquido"
|
||||
GEL_EYELINER = "Gel eyeliner|Eyeliner in gel"
|
||||
KOHL_LINER = "Kohl liner|Matita kohl"
|
||||
MASCARA = "Mascara|Mascara"
|
||||
WATERPROOF_MASCARA = "Waterproof mascara|Mascara waterproof"
|
||||
BROW_PENCIL = "Eyebrow pencil|Matita sopracciglia"
|
||||
BROW_GEL = "Eyebrow gel|Gel sopracciglia"
|
||||
BROW_POWDER = "Eyebrow powder|Polvere sopracciglia"
|
||||
EYE_PRIMER = "Eye primer|Primer occhi"
|
||||
FALSE_LASHES = "False eyelashes|Ciglia finte"
|
||||
LASH_GLUE = "Eyelash glue|Colla ciglia"
|
||||
BROW_POMADE = "Eyebrow pomade|Pomata sopracciglia"
|
||||
MATTE_LIPSTICK = "Matte lipstick|Rossetto opaco"
|
||||
CREAM_LIPSTICK = "Cream lipstick|Rossetto cremoso"
|
||||
SATIN_LIPSTICK = "Satin lipstick|Rossetto satinato"
|
||||
LIP_GLOSS = "Lip gloss|Lucidalabbra"
|
||||
LIP_LINER = "Lip liner|Matita labbra"
|
||||
LIP_STAIN = "Lip stain|Tinta labbra"
|
||||
LIP_BALM = "Lip balm|Balsamo labbra"
|
||||
LIP_PRIMER = "Lip primer|Primer labbra"
|
||||
LIP_PLUMPER = "Lip plumper|Volumizzante labbra"
|
||||
LIP_OIL = "Lip oil|Olio labbra"
|
||||
LIP_MASK = "Lip mask|Maschera labbra"
|
||||
LIQUID_LIPSTICK = "Liquid lipstick|Rossetto liquido"
|
||||
GEL_CLEANSER = "Gel cleanser|Detergente gel"
|
||||
FOAM_CLEANSER = "Foam cleanser|Detergente schiumoso"
|
||||
OIL_CLEANSER = "Oil cleanser|Detergente oleoso"
|
||||
CREAM_CLEANSER = "Cream cleanser|Detergente in crema"
|
||||
MICELLAR_WATER = "Micellar water|Acqua micellare"
|
||||
TONER = "Toner|Tonico"
|
||||
ESSENCE = "Essence|Essenza"
|
||||
SERUM = "Serum|Siero"
|
||||
MOISTURIZER = "Moisturizer|Idratante"
|
||||
FACE_OIL = "Face oil|Olio viso"
|
||||
SHEET_MASK = "Sheet mask|Maschera in tessuto"
|
||||
CLAY_MASK = "Clay mask|Maschera all'argilla"
|
||||
GEL_MASK = "Gel mask|Maschera in gel"
|
||||
CREAM_MASK = "Cream mask|Maschera in crema"
|
||||
EYE_CREAM = "Eye cream|Crema contorno occhi"
|
||||
PHYSICAL_EXFOLIATOR = "Physical exfoliator|Esfoliante fisico"
|
||||
CHEMICAL_EXFOLIATOR = "Chemical exfoliator|Esfoliante chimico"
|
||||
SUNSCREEN = "Sunscreen|Protezione solare"
|
||||
NIGHT_CREAM = "Night cream|Crema notte"
|
||||
FACE_MIST = "Face mist|Acqua spray"
|
||||
SPOT_TREATMENT = "Spot treatment|Trattamento localizzato"
|
||||
PORE_STRIPS = "Pore strips|Cerotti purificanti"
|
||||
PEELING_GEL = "Peeling gel|Gel esfoliante"
|
||||
BASE_COAT = "Base coat|Base smalto"
|
||||
NAIL_POLISH = "Nail polish|Smalto"
|
||||
TOP_COAT = "Top coat|Top coat"
|
||||
CUTICLE_OIL = "Cuticle oil|Olio cuticole"
|
||||
NAIL_STRENGTHENER = "Nail strengthener|Rinforzante unghie"
|
||||
QUICK_DRY_DROPS = "Quick dry drops|Gocce asciugatura rapida"
|
||||
NAIL_PRIMER = "Nail primer|Primer unghie"
|
||||
GEL_POLISH = "Gel polish|Smalto gel"
|
||||
ACRYLIC_POWDER = "Acrylic powder|Polvere acrilica"
|
||||
NAIL_GLUE = "Nail glue|Colla unghie"
|
||||
MAKEUP_BRUSHES = "Makeup brushes|Pennelli trucco"
|
||||
MAKEUP_SPONGES = "Makeup sponges|Spugnette trucco"
|
||||
EYELASH_CURLER = "Eyelash curler|Piegaciglia"
|
||||
TWEEZERS = "Tweezers|Pinzette"
|
||||
NAIL_CLIPPERS = "Nail clippers|Tagliaunghie"
|
||||
NAIL_FILE = "Nail file|Lima unghie"
|
||||
COTTON_PADS = "Cotton pads|Dischetti di cotone"
|
||||
MAKEUP_REMOVER_PADS = "Makeup remover pads|Dischetti struccanti"
|
||||
POWDER_PUFF = "Powder puff|Piumino cipria"
|
||||
FACIAL_ROLLER = "Facial roller|Rullo facciale"
|
||||
GUA_SHA = "Gua sha tool|Strumento gua sha"
|
||||
BRUSH_CLEANER = "Brush cleaner|Detergente pennelli"
|
||||
MAKEUP_ORGANIZER = "Makeup organizer|Organizzatore trucchi"
|
||||
MIRROR = "Mirror|Specchio"
|
||||
NAIL_BUFFER = "Nail buffer|Buffer unghie"
|
||||
|
||||
class NormalUser(TranslatedEnum):
|
||||
ADULTO = "Adult|Adulto"
|
||||
BAMBINO = "Child|Bambino"
|
||||
|
||||
class PlaceApplication(TranslatedEnum):
|
||||
VISO = "Face|Viso"
|
||||
|
||||
class RoutesExposure(TranslatedEnum):
|
||||
DERMAL = "Dermal|Dermale"
|
||||
OCULAR = "Ocular|Oculare"
|
||||
ORAL = "Oral|Orale"
|
||||
|
||||
class NanoRoutes(TranslatedEnum):
|
||||
DERMAL = "Dermal|Dermale"
|
||||
OCULAR = "Ocular|Oculare"
|
||||
ORAL = "Oral|Orale"
|
||||
|
|
@ -2,6 +2,7 @@ import os
|
|||
from urllib.parse import quote_plus
|
||||
|
||||
from dotenv import load_dotenv
|
||||
import psycopg2
|
||||
from pymongo import MongoClient
|
||||
|
||||
from pif_compiler.functions.common_log import get_logger
|
||||
|
|
@ -40,9 +41,31 @@ def db_connect(db_name : str = 'toxinfo', collection_name : str = 'substance_ind
|
|||
|
||||
return collection
|
||||
|
||||
def postgres_connect():
|
||||
DATABASE_URL = os.getenv("DATABASE_URL")
|
||||
with psycopg2.connect(DATABASE_URL) as conn:
|
||||
return conn
|
||||
|
||||
def insert_compilatore(nome_compilatore):
|
||||
try:
|
||||
conn = postgres_connect()
|
||||
with conn.cursor() as cur:
|
||||
cur.execute("INSERT INTO compilatori (nome_compilatore) VALUES (%s)", (nome_compilatore,))
|
||||
conn.commit()
|
||||
conn.close()
|
||||
except Exception as e:
|
||||
logger.error(f"Error: {e}")
|
||||
|
||||
def log_ricerche(cas, target, esito):
|
||||
try:
|
||||
conn = postgres_connect()
|
||||
with conn.cursor() as cur:
|
||||
cur.execute("INSERT INTO logs.search_history (cas_ricercato, target, esito) VALUES (%s, %s, %s)", (cas, target, esito))
|
||||
conn.commit()
|
||||
conn.close()
|
||||
except Exception as e:
|
||||
logger.error(f"Error: {e}")
|
||||
return
|
||||
|
||||
if __name__ == "__main__":
|
||||
coll = db_connect()
|
||||
if coll is not None:
|
||||
logger.info("Database connection successful.")
|
||||
else:
|
||||
logger.error("Database connection failed.")
|
||||
log_ricerche("123-45-6", "ECHA", True)
|
||||
|
|
@ -1,6 +0,0 @@
|
|||
from pymongo import MongoClient
|
||||
|
||||
from pif_compiler.functions.common_log import get_logger
|
||||
from pif_compiler.functions.db_utils import db_connect
|
||||
|
||||
log = get_logger()
|
||||
|
|
@ -1,247 +1,141 @@
|
|||
import json as js
|
||||
import re
|
||||
import requests as req
|
||||
from typing import Union
|
||||
from typing import Union, List, Dict, Optional
|
||||
from pif_compiler.functions.common_log import get_logger
|
||||
|
||||
logger = get_logger()
|
||||
|
||||
#region Funzione che processa una lista di CAS presa da Cosing
|
||||
# --- PARSING ---
|
||||
|
||||
def parse_cas_numbers(cas_string:list) -> list:
|
||||
logger.debug(f"Parsing CAS numbers from input: {cas_string}")
|
||||
def parse_cas_numbers(cas_string: list) -> list:
|
||||
logger.debug(f"Parsing CAS numbers: {cas_string}")
|
||||
|
||||
# Siccome ci assicuriamo esternamente che esista almeno un cas possiamo prendere la stringa
|
||||
cas_string = cas_string[0]
|
||||
logger.debug(f"Extracted CAS string: {cas_string}")
|
||||
cas_raw = cas_string[0]
|
||||
cas_raw = re.sub(r"\([^)]*\)", "", cas_raw)
|
||||
|
||||
# Rimuoviamo parentesi e il loro contenuto
|
||||
cas_string = re.sub(r"\([^)]*\)", "", cas_string)
|
||||
logger.debug(f"After removing parentheses: {cas_string}")
|
||||
|
||||
# Eseguiamo uno split su vari possibili separatori
|
||||
cas_parts = re.split(r"[/;,]", cas_string)
|
||||
|
||||
# Otteniamo una lista utilizzando i cas precedenti e rimuovendo spazi in eccesso
|
||||
cas_list = [cas.strip() for cas in cas_parts]
|
||||
logger.debug(f"CAS list after splitting: {cas_list}")
|
||||
|
||||
# Alcuni cas sono divisi da un doppio dash (--) che dobbiamo rimuovere
|
||||
# è però necessario farlo ora in seconda battuta
|
||||
cas_parts = re.split(r"[/;,]", cas_raw)
|
||||
cas_list = [cas.strip() for cas in cas_parts if cas.strip()]
|
||||
|
||||
if len(cas_list) == 1 and "--" in cas_list[0]:
|
||||
logger.debug("Found double dash separator, splitting further")
|
||||
cas_list = [cas.strip() for cas in cas_list[0].split("--")]
|
||||
|
||||
# Siccome alcuni cas hanno valori non validi ("-") li cerchiamo e rimuoviamo
|
||||
if cas_list:
|
||||
|
||||
while "-" in cas_list:
|
||||
logger.debug("Removing invalid CAS value: '-'")
|
||||
cas_list.remove("-")
|
||||
cas_list = [cas for cas in cas_list if cas != "-"]
|
||||
|
||||
logger.info(f"Parsed CAS numbers: {cas_list}")
|
||||
return cas_list
|
||||
#endregion
|
||||
|
||||
#region Funzione per eseguire una ricerca direttamente sul cosing
|
||||
# --- SEARCH ---
|
||||
|
||||
# Il primo argomento è la stringa da cercare, il secondo indica il tipo di ricerca
|
||||
def cosing_search(text : str, mode : str = "name") -> Union[list,dict,None]:
|
||||
def cosing_search(text: str, mode: str = "name") -> Union[list, dict, None]:
|
||||
logger.info(f"Starting COSING search: text='{text}', mode='{mode}'")
|
||||
|
||||
cosing_post_req = "https://api.tech.ec.europa.eu/search-api/prod/rest/search?apiKey=285a77fd-1257-4271-8507-f0c6b2961203&text=*&pageSize=100&pageNumber=1"
|
||||
url = "https://api.tech.ec.europa.eu/search-api/prod/rest/search?apiKey=285a77fd-1257-4271-8507-f0c6b2961203&text=*&pageSize=100&pageNumber=1"
|
||||
agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36 Edg/131.0.0.0"
|
||||
|
||||
# La modalità di ricerca base è quella per nome, che sia INCI o di altro tipo
|
||||
if mode == "name":
|
||||
logger.debug("Search mode: name (INCI, chemical name, etc.)")
|
||||
search_query = {"bool":
|
||||
{"must":[
|
||||
{"text":
|
||||
{"query":f"{text}","fields":
|
||||
["inciName.exact","inciUsaName","innName.exact","phEurName","chemicalName","chemicalDescription"],
|
||||
"defaultOperator":"AND"}}]}}
|
||||
|
||||
# In caso di ricerca per numero cas o EC il payload della richiesta è diverso
|
||||
elif mode in ["cas","ec"]:
|
||||
logger.debug(f"Search mode: {mode}")
|
||||
search_query = {"bool": {"must": [{"text": {"query": f"*{text}*","fields": ["casNo", "ecNo"]}}]}}
|
||||
|
||||
# La ricerca per ID è necessaria dove serva recuperare gli identified ingredients
|
||||
search_query = {
|
||||
"bool": {
|
||||
"must": [{
|
||||
"text": {
|
||||
"query": f"{text}",
|
||||
"fields": ["inciName.exact", "inciUsaName", "innName.exact", "phEurName", "chemicalName", "chemicalDescription"],
|
||||
"defaultOperator": "AND"
|
||||
}
|
||||
}]
|
||||
}
|
||||
}
|
||||
elif mode in ["cas", "ec"]:
|
||||
search_query = {"bool": {"must": [{"text": {"query": f"*{text}*", "fields": ["casNo", "ecNo"]}}]}}
|
||||
elif mode == "id":
|
||||
logger.debug("Search mode: substance ID")
|
||||
search_query = {"bool":{"must":[{"term":{"substanceId":f"{text}"}}]}}
|
||||
|
||||
# Se la mode inserita non è prevista lancio un errore
|
||||
search_query = {"bool": {"must": [{"term": {"substanceId": f"{text}"}}]}}
|
||||
else:
|
||||
logger.error(f"Invalid search mode: {mode}")
|
||||
raise ValueError(f"Invalid search mode: {mode}")
|
||||
|
||||
# Creo il payload della mia request
|
||||
files = {"query": ("query",js.dumps(search_query),"application/json")}
|
||||
logger.debug(f"Search query: {search_query}")
|
||||
files = {"query": ("query", js.dumps(search_query), "application/json")}
|
||||
|
||||
# Eseguo la post di ricerca
|
||||
try:
|
||||
logger.debug(f"Sending POST request to COSING API")
|
||||
risposta = req.post(cosing_post_req,headers={"user_agent":agent,"Connection":"keep-alive"},files=files)
|
||||
risposta = req.post(url, headers={"user_agent": agent, "Connection": "keep-alive"}, files=files)
|
||||
risposta.raise_for_status()
|
||||
risposta = risposta.json()
|
||||
data = risposta.json()
|
||||
|
||||
if risposta["results"]:
|
||||
logger.info(f"COSING search successful: found {len(risposta['results'])} result(s)")
|
||||
logger.debug(f"First result substance ID: {risposta['results'][0]['metadata'].get('substanceId', 'N/A')}")
|
||||
return risposta["results"][0]["metadata"]
|
||||
if data.get("results"):
|
||||
logger.info(f"COSING search successful")
|
||||
return data["results"][0]["metadata"]
|
||||
else:
|
||||
# La funzione ritorna None quando non ho risultati dalla mia ricerca
|
||||
logger.warning(f"COSING search returned no results for text='{text}', mode='{mode}'")
|
||||
logger.warning(f"No results found")
|
||||
return None
|
||||
|
||||
except req.exceptions.RequestException as e:
|
||||
logger.error(f"HTTP request error during COSING search: {e}")
|
||||
logger.error(f"HTTP request error: {e}")
|
||||
raise
|
||||
except (KeyError, ValueError, TypeError) as e:
|
||||
logger.error(f"Error parsing COSING response: {e}")
|
||||
logger.error(f"Parsing error: {e}")
|
||||
raise
|
||||
#endregion
|
||||
|
||||
#region Funzione per pulire un json cosing e restituirlo
|
||||
# --- CLEANING ---
|
||||
|
||||
def clean_cosing(json : dict, full : bool = True) -> dict:
|
||||
substance_id = json.get("substanceId", ["Unknown"])[0] if json.get("substanceId") else "Unknown"
|
||||
logger.info(f"Cleaning COSING data for substance ID: {substance_id}, full={full}")
|
||||
|
||||
# Definisco i campi di nostro interesse e li divido in base al tipo di output finale che desidero
|
||||
def clean_cosing(json_data: dict, full: bool = True) -> dict:
|
||||
substance_id = json_data.get("substanceId", ["Unknown"])[0] if json_data.get("substanceId") else "Unknown"
|
||||
logger.info(f"Cleaning COSING data for: {substance_id}")
|
||||
|
||||
string_cols = [
|
||||
"itemType",
|
||||
"nameOfCommonIngredientsGlossary",
|
||||
"inciName",
|
||||
"phEurName",
|
||||
"chemicalName",
|
||||
"innName",
|
||||
"substanceId",
|
||||
"refNo"
|
||||
]
|
||||
"itemType", "phEurName", "chemicalName", "innName", "substanceId", "cosmeticRestriction"
|
||||
]
|
||||
|
||||
list_cols = [
|
||||
"casNo",
|
||||
"ecNo",
|
||||
"functionName",
|
||||
"otherRestrictions",
|
||||
"sccsOpinion",
|
||||
"sccsOpinionUrls",
|
||||
"identifiedIngredient",
|
||||
"annexNo",
|
||||
"otherRegulations"
|
||||
]
|
||||
|
||||
# Creo una lista con tutti i campi su cui ciclare
|
||||
|
||||
total_keys = string_cols + list_cols
|
||||
|
||||
# Definisco la base dell"url che mi serve per ottenere il link cosing della sostanza
|
||||
"casNo", "ecNo", "functionName", "otherRestrictions", "refNo",
|
||||
"sccsOpinion", "sccsOpinionUrls", "identifiedIngredient",
|
||||
"annexNo", "otherRegulations", "nameOfCommonIngredientsGlossary", "inciName"
|
||||
]
|
||||
|
||||
base_url = "https://ec.europa.eu/growth/tools-databases/cosing/details/"
|
||||
clean_json = {}
|
||||
|
||||
# Ciclo su tutti i campi di interesse
|
||||
|
||||
for key in total_keys:
|
||||
|
||||
# Alcuni campi contengono una dicitura inutile che occupa solo spazio
|
||||
# per cui provvedo a rimuoverla
|
||||
|
||||
while "<empty>" in json[key]:
|
||||
json[key].remove("<empty>")
|
||||
|
||||
# Se il campo dovrà avere in output una lista, posso anche accettare le liste vuote del cosing come valore
|
||||
for key in (string_cols + list_cols):
|
||||
current_val = json_data.get(key, [])
|
||||
filtered_val = [v for v in current_val if v != "<empty>"]
|
||||
|
||||
if key in list_cols:
|
||||
value = json[key]
|
||||
|
||||
# Il cas e l"ec sono casi speciali, per cui eseguo delle funzioni in più quando lo tratto
|
||||
|
||||
if key in ["casNo", "ecNo"]:
|
||||
if value:
|
||||
logger.debug(f"Processing {key}: {value}")
|
||||
value = parse_cas_numbers(value)
|
||||
|
||||
# Completiamo direttamente il json di ritorno ove presenti degli identifiedIngredient,
|
||||
# solo dove il flag "full" è true
|
||||
|
||||
elif key == "identifiedIngredient":
|
||||
if full:
|
||||
if value:
|
||||
logger.debug(f"Processing {len(value)} identified ingredient(s)")
|
||||
value = identified_ingredients(value)
|
||||
|
||||
clean_json[key] = value
|
||||
if key in ["casNo", "ecNo"] and filtered_val:
|
||||
filtered_val = parse_cas_numbers(filtered_val)
|
||||
elif key == "identifiedIngredient" and full and filtered_val:
|
||||
filtered_val = identified_ingredients(filtered_val)
|
||||
|
||||
clean_json[key] = filtered_val
|
||||
else:
|
||||
nKey = "commonName" if key == "nameOfCommonIngredientsGlossary" else key
|
||||
clean_json[nKey] = filtered_val[0] if filtered_val else ""
|
||||
|
||||
# Questo nome di campo era troppo lungo e ho preferito semplificarlo
|
||||
|
||||
if key == "nameOfCommonIngredientsGlossary":
|
||||
nKey = "commonName"
|
||||
|
||||
# Dovendo rinominare alcuni campi, negli altri casi copio il nome iniziale
|
||||
|
||||
else:
|
||||
nKey = key
|
||||
|
||||
# Siccome voglio in output una stringa semplice e se usassi uno slicer su di una lista vuota
|
||||
# devo prima verificare che la lista cosing contenga dei valori
|
||||
|
||||
if json[key]:
|
||||
clean_json[nKey] = json[key][0]
|
||||
else:
|
||||
clean_json[nKey] = ""
|
||||
|
||||
# Il campo cosingUrl non esiste ancora, vado a crearlo unendo il substance ID all"url base
|
||||
|
||||
clean_json["cosingUrl"] = f"{base_url}{json["substanceId"][0]}"
|
||||
logger.debug(f"Generated COSING URL: {clean_json['cosingUrl']}")
|
||||
logger.info(f"Successfully cleaned COSING data for substance ID: {substance_id}")
|
||||
|
||||
clean_json["cosingUrl"] = f"{base_url}{json_data['substanceId'][0]}"
|
||||
return clean_json
|
||||
#endregion
|
||||
|
||||
#region Funzione per completare, se necessario, un json cosing
|
||||
|
||||
def identified_ingredients(id_list : list) -> list:
|
||||
logger.info(f"Processing {len(id_list)} identified ingredient(s): {id_list}")
|
||||
|
||||
def identified_ingredients(id_list: list) -> list:
|
||||
logger.info(f"Processing identified ingredients: {id_list}")
|
||||
identified = []
|
||||
|
||||
# Per ognuno degli ingredienti in IdentifiedIngredients eseguo una ricerca
|
||||
|
||||
for id in id_list:
|
||||
logger.debug(f"Searching for identified ingredient with ID: {id}")
|
||||
|
||||
ingredient = cosing_search(id,"id")
|
||||
|
||||
for sub_id in id_list:
|
||||
ingredient = cosing_search(sub_id, "id")
|
||||
if ingredient:
|
||||
identified.append(clean_cosing(ingredient, full=False))
|
||||
|
||||
# Vado a pulire i json appena trovati
|
||||
|
||||
ingredient = clean_cosing(ingredient,full=False)
|
||||
|
||||
# Ora salvo nella lista il documento pulito
|
||||
|
||||
identified.append(ingredient)
|
||||
logger.debug(f"Successfully added identified ingredient ID: {id}")
|
||||
else:
|
||||
logger.warning(f"Could not find identified ingredient with ID: {id}")
|
||||
|
||||
# Finito di popolare la lista con gli oggetti degli identifiedIngredient, la ritorno
|
||||
|
||||
logger.info(f"Successfully processed {len(identified)} of {len(id_list)} identified ingredient(s)")
|
||||
return identified
|
||||
#endregion
|
||||
|
||||
def cosing_entry(cas: str) -> Optional[dict]:
|
||||
logger.info(f"Retrieving COSING entry for CAS: {cas}")
|
||||
try:
|
||||
search_result = cosing_search(cas, mode="cas")
|
||||
if search_result:
|
||||
return clean_cosing(search_result)
|
||||
else:
|
||||
logger.warning(f"No COSING entry found for CAS: {cas}")
|
||||
return None
|
||||
except Exception as e:
|
||||
logger.error(f"Error retrieving COSING entry for CAS {cas}: {e}")
|
||||
return None
|
||||
|
||||
if __name__ == "__main__":
|
||||
print(cosing_search("Water","name"))
|
||||
|
||||
|
||||
raw = cosing_search("72-48-0", "cas")
|
||||
clean = clean_cosing(raw)
|
||||
print(clean)
|
||||
|
|
@ -6,9 +6,10 @@ import re
|
|||
from bs4 import BeautifulSoup
|
||||
from dotenv import load_dotenv
|
||||
from playwright.sync_api import sync_playwright
|
||||
from typing import Callable, Any
|
||||
|
||||
from pif_compiler.functions.common_log import get_logger
|
||||
from pif_compiler.functions.db_utils import db_connect
|
||||
from pif_compiler.functions.db_utils import db_connect, log_ricerche
|
||||
|
||||
log = get_logger()
|
||||
load_dotenv()
|
||||
|
|
@ -311,6 +312,114 @@ def parse_toxicology_html(html_content):
|
|||
|
||||
return result
|
||||
|
||||
|
||||
def parse_value_with_unit(value_str: str) -> tuple:
|
||||
"""Parse a combined value+unit string like '5,040mg/kg bw/day' into (value, unit)."""
|
||||
if not value_str:
|
||||
return ("", "")
|
||||
|
||||
# Pattern to match numeric value (with commas/decimals) followed by unit
|
||||
match = re.match(r'^([\d,\.]+)\s*(.*)$', value_str.strip())
|
||||
if match:
|
||||
numeric_value = match.group(1).replace(',', '') # Remove commas
|
||||
unit = match.group(2).strip()
|
||||
return (numeric_value, unit)
|
||||
return (value_str, "")
|
||||
|
||||
|
||||
def extract_levels(data: dict | list, extractor: Callable[[dict, list], dict | None]) -> dict:
|
||||
"""
|
||||
Generic function to recursively extract data from nested JSON structures.
|
||||
|
||||
Args:
|
||||
data: The JSON data (dict or list) to parse
|
||||
extractor: A callable that receives (obj, path) and returns:
|
||||
- A dict with extracted data if the object matches criteria
|
||||
- None if no match (continue searching)
|
||||
The 'path' is a list of labels encountered in the hierarchy.
|
||||
|
||||
Returns:
|
||||
A dict keyed by context path (labels joined by " > "), with values
|
||||
being whatever the extractor returns.
|
||||
|
||||
Example:
|
||||
def my_extractor(obj, path):
|
||||
if obj.get("SomeField"):
|
||||
return {"field": obj["SomeField"]}
|
||||
return None
|
||||
|
||||
results = extract_from_json(data, my_extractor)
|
||||
"""
|
||||
results = {}
|
||||
|
||||
def recurse(obj: Any, path: list = None):
|
||||
if path is None:
|
||||
path = []
|
||||
|
||||
if isinstance(obj, dict):
|
||||
# Check for label to use in path
|
||||
current_label = obj.get("label", "")
|
||||
current_path = path + [current_label] if current_label else path
|
||||
|
||||
# Call the extractor function
|
||||
extracted = extractor(obj, current_path)
|
||||
if extracted is not None:
|
||||
key = " > ".join(filter(None, current_path)) or "root"
|
||||
results[key] = extracted
|
||||
|
||||
# Recurse into all values
|
||||
for k, val in obj.items():
|
||||
if k != "label":
|
||||
recurse(val, current_path)
|
||||
|
||||
elif isinstance(obj, list):
|
||||
for item in obj:
|
||||
recurse(item, path)
|
||||
|
||||
recurse(data)
|
||||
return results
|
||||
|
||||
|
||||
def rdt_extractor(obj: dict, path: list) -> dict | None:
|
||||
indicator = obj.get("EffectLevelUnit", "").strip()
|
||||
value_str = obj.get("EffectLevelValue", "").strip()
|
||||
|
||||
if indicator and value_str:
|
||||
numeric_value, unit = parse_value_with_unit(value_str)
|
||||
|
||||
# Extract route (last label) and toxicity_type (second-to-last label)
|
||||
filtered_path = [p for p in path if p]
|
||||
route = filtered_path[-1] if len(filtered_path) >= 1 else ""
|
||||
toxicity_type = filtered_path[-2] if len(filtered_path) >= 2 else ""
|
||||
|
||||
return {
|
||||
"indicator": indicator,
|
||||
"value": numeric_value,
|
||||
"unit": unit,
|
||||
"route": route,
|
||||
"toxicity_type": toxicity_type
|
||||
}
|
||||
return None
|
||||
|
||||
def at_extractor(obj: dict, path: list) -> dict | None:
|
||||
indicator = obj.get("EffectLevelUnit", "").strip()
|
||||
value_str = obj.get("EffectLevelValue", "").strip()
|
||||
|
||||
if indicator and value_str:
|
||||
numeric_value, unit = parse_value_with_unit(value_str)
|
||||
|
||||
filtered_path = [p for p in path if p]
|
||||
route = filtered_path[-1] if len(filtered_path) >= 1 else ""
|
||||
route = route.replace("Acute toxicity: via ", "").strip()
|
||||
|
||||
return {
|
||||
"indicator": indicator,
|
||||
"value": numeric_value,
|
||||
"unit": unit,
|
||||
"route": route
|
||||
}
|
||||
return None
|
||||
|
||||
#endregion
|
||||
|
||||
#region Orchestrator functions
|
||||
|
|
@ -418,16 +527,19 @@ def orchestrator(cas: str) -> dict:
|
|||
local_record = check_local(cas_validated)
|
||||
if local_record:
|
||||
log.info(f"Returning local record for CAS {cas}.")
|
||||
log_ricerche(cas, 'ECHA', True)
|
||||
return local_record
|
||||
else:
|
||||
log.info(f"No local record, starting echa flow")
|
||||
echa_data = echa_flow(cas_validated)
|
||||
if echa_data:
|
||||
log.info(f"Echa flow successful")
|
||||
log_ricerche(cas, 'ECHA', True)
|
||||
add_to_local(echa_data)
|
||||
return echa_data
|
||||
else:
|
||||
log.error(f"Failed to retrieve ECHA data for CAS {cas}.")
|
||||
log_ricerche(cas, 'ECHA', False)
|
||||
return None
|
||||
|
||||
# to do: check if document is complete
|
||||
|
|
@ -437,4 +549,3 @@ def orchestrator(cas: str) -> dict:
|
|||
if __name__ == "__main__":
|
||||
cas_test = "113170-55-1"
|
||||
result = orchestrator(cas_test)
|
||||
print(result)
|
||||
220
tests/README.md
220
tests/README.md
|
|
@ -1,220 +0,0 @@
|
|||
# PIF Compiler - Test Suite
|
||||
|
||||
## Overview
|
||||
|
||||
Comprehensive test suite for the PIF Compiler project using `pytest`.
|
||||
|
||||
## Structure
|
||||
|
||||
```
|
||||
tests/
|
||||
├── __init__.py # Test package marker
|
||||
├── conftest.py # Shared fixtures and configuration
|
||||
├── test_cosing_service.py # COSING service tests
|
||||
├── test_models.py # (TODO) Pydantic model tests
|
||||
├── test_echa_service.py # (TODO) ECHA service tests
|
||||
└── README.md # This file
|
||||
```
|
||||
|
||||
## Installation
|
||||
|
||||
```bash
|
||||
# Install test dependencies
|
||||
uv add --dev pytest pytest-cov pytest-mock
|
||||
|
||||
# Or manually install
|
||||
uv pip install pytest pytest-cov pytest-mock
|
||||
```
|
||||
|
||||
## Running Tests
|
||||
|
||||
### Run All Tests (Unit only)
|
||||
```bash
|
||||
uv run pytest
|
||||
```
|
||||
|
||||
### Run Specific Test File
|
||||
```bash
|
||||
uv run pytest tests/test_cosing_service.py
|
||||
```
|
||||
|
||||
### Run Specific Test Class
|
||||
```bash
|
||||
uv run pytest tests/test_cosing_service.py::TestParseCasNumbers
|
||||
```
|
||||
|
||||
### Run Specific Test
|
||||
```bash
|
||||
uv run pytest tests/test_cosing_service.py::TestParseCasNumbers::test_single_cas_number
|
||||
```
|
||||
|
||||
### Run with Verbose Output
|
||||
```bash
|
||||
uv run pytest -v
|
||||
```
|
||||
|
||||
### Run with Coverage Report
|
||||
```bash
|
||||
uv run pytest --cov=src/pif_compiler --cov-report=html
|
||||
# Open htmlcov/index.html in browser
|
||||
```
|
||||
|
||||
## Test Categories
|
||||
|
||||
### Unit Tests (Default)
|
||||
Fast tests with no external dependencies. Run by default.
|
||||
|
||||
```bash
|
||||
uv run pytest -m unit
|
||||
```
|
||||
|
||||
### Integration Tests
|
||||
Tests that hit real APIs or databases. Skipped by default.
|
||||
|
||||
```bash
|
||||
uv run pytest -m integration
|
||||
```
|
||||
|
||||
### Slow Tests
|
||||
Tests that take longer to run. Skipped by default.
|
||||
|
||||
```bash
|
||||
uv run pytest -m slow
|
||||
```
|
||||
|
||||
### Database Tests
|
||||
Tests requiring MongoDB. Ensure Docker is running.
|
||||
|
||||
```bash
|
||||
cd utils
|
||||
docker-compose up -d
|
||||
uv run pytest -m database
|
||||
```
|
||||
|
||||
## Test Organization
|
||||
|
||||
### `test_cosing_service.py`
|
||||
|
||||
**Coverage:**
|
||||
- ✅ `parse_cas_numbers()` - CAS parsing logic
|
||||
- Single/multiple CAS
|
||||
- Different separators (/, ;, ,, --)
|
||||
- Parentheses removal
|
||||
- Whitespace handling
|
||||
- Invalid dash removal
|
||||
|
||||
- ✅ `cosing_search()` - API search
|
||||
- Search by name
|
||||
- Search by CAS
|
||||
- Search by EC number
|
||||
- Search by ID
|
||||
- No results handling
|
||||
- Invalid mode error
|
||||
|
||||
- ✅ `clean_cosing()` - JSON cleaning
|
||||
- Basic field cleaning
|
||||
- Empty tag removal
|
||||
- CAS parsing
|
||||
- URL creation
|
||||
- Field renaming
|
||||
|
||||
- ✅ Integration tests (marked as `@pytest.mark.integration`)
|
||||
- Real API calls (requires internet)
|
||||
|
||||
## Writing New Tests
|
||||
|
||||
### Example Unit Test
|
||||
|
||||
```python
|
||||
class TestMyFunction:
|
||||
"""Test my_function."""
|
||||
|
||||
def test_basic_case(self):
|
||||
"""Test basic functionality."""
|
||||
result = my_function("input")
|
||||
assert result == "expected"
|
||||
|
||||
def test_edge_case(self):
|
||||
"""Test edge case handling."""
|
||||
with pytest.raises(ValueError):
|
||||
my_function("invalid")
|
||||
```
|
||||
|
||||
### Example Mock Test
|
||||
|
||||
```python
|
||||
from unittest.mock import Mock, patch
|
||||
|
||||
@patch('module.external_api_call')
|
||||
def test_with_mock(mock_api):
|
||||
"""Test with mocked external call."""
|
||||
mock_api.return_value = {"data": "mocked"}
|
||||
result = my_function()
|
||||
assert result == "expected"
|
||||
mock_api.assert_called_once()
|
||||
```
|
||||
|
||||
### Example Fixture Usage
|
||||
|
||||
```python
|
||||
def test_with_fixture(sample_cosing_response):
|
||||
"""Test using a fixture from conftest.py."""
|
||||
result = clean_cosing(sample_cosing_response)
|
||||
assert "cosingUrl" in result
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Naming**: Test files/classes/functions start with `test_`
|
||||
2. **Arrange-Act-Assert**: Structure tests clearly
|
||||
3. **One assertion focus**: Each test should test one thing
|
||||
4. **Use fixtures**: Reuse test data via `conftest.py`
|
||||
5. **Mock external calls**: Don't hit real APIs in unit tests
|
||||
6. **Mark appropriately**: Use `@pytest.mark.integration` for slow tests
|
||||
7. **Descriptive names**: Test names should describe what they test
|
||||
|
||||
## Common Commands
|
||||
|
||||
```bash
|
||||
# Run fast tests only (skip integration/slow)
|
||||
uv run pytest -m "not integration and not slow"
|
||||
|
||||
# Run only integration tests
|
||||
uv run pytest -m integration
|
||||
|
||||
# Run with detailed output
|
||||
uv run pytest -vv
|
||||
|
||||
# Stop at first failure
|
||||
uv run pytest -x
|
||||
|
||||
# Run last failed tests
|
||||
uv run pytest --lf
|
||||
|
||||
# Run tests matching pattern
|
||||
uv run pytest -k "test_parse"
|
||||
|
||||
# Generate coverage report
|
||||
uv run pytest --cov=src/pif_compiler --cov-report=term-missing
|
||||
```
|
||||
|
||||
## CI/CD Integration
|
||||
|
||||
For GitHub Actions (example):
|
||||
|
||||
```yaml
|
||||
- name: Run tests
|
||||
run: |
|
||||
uv run pytest -m "not integration" --cov --cov-report=xml
|
||||
```
|
||||
|
||||
## TODO
|
||||
|
||||
- [ ] Add tests for `models.py` (Pydantic validation)
|
||||
- [ ] Add tests for `echa_service.py`
|
||||
- [ ] Add tests for `echa_parser.py`
|
||||
- [ ] Add tests for `echa_extractor.py`
|
||||
- [ ] Add tests for `database_service.py`
|
||||
- [ ] Add tests for `pubchem_service.py`
|
||||
- [ ] Add integration tests with test database
|
||||
- [ ] Set up GitHub Actions CI
|
||||
|
|
@ -1,86 +0,0 @@
|
|||
# Quick Start - Running Tests
|
||||
|
||||
## 1. Install Test Dependencies
|
||||
|
||||
```bash
|
||||
# Add pytest and related tools
|
||||
uv add --dev pytest pytest-cov pytest-mock
|
||||
```
|
||||
|
||||
## 2. Run the Tests
|
||||
|
||||
```bash
|
||||
# Run all unit tests (fast, no API calls)
|
||||
uv run pytest
|
||||
|
||||
# Run with more detail
|
||||
uv run pytest -v
|
||||
|
||||
# Run just the COSING tests
|
||||
uv run pytest tests/test_cosing_service.py
|
||||
|
||||
# Run integration tests (will hit real COSING API)
|
||||
uv run pytest -m integration
|
||||
```
|
||||
|
||||
## 3. See Coverage
|
||||
|
||||
```bash
|
||||
# Generate HTML coverage report
|
||||
uv run pytest --cov=src/pif_compiler --cov-report=html
|
||||
|
||||
# Open htmlcov/index.html in your browser
|
||||
```
|
||||
|
||||
## What the Tests Cover
|
||||
|
||||
### ✅ `parse_cas_numbers()`
|
||||
- Parses single CAS: `["7732-18-5"]` → `["7732-18-5"]`
|
||||
- Parses multiple: `["7732-18-5/56-81-5"]` → `["7732-18-5", "56-81-5"]`
|
||||
- Handles separators: `/`, `;`, `,`, `--`
|
||||
- Removes parentheses: `["7732-18-5 (hydrate)"]` → `["7732-18-5"]`
|
||||
- Cleans whitespace and invalid dashes
|
||||
|
||||
### ✅ `cosing_search()`
|
||||
- Mocks API calls (no internet needed for unit tests)
|
||||
- Tests search by name, CAS, EC, ID
|
||||
- Tests error handling
|
||||
- Integration tests hit real API
|
||||
|
||||
### ✅ `clean_cosing()`
|
||||
- Cleans COSING JSON responses
|
||||
- Removes empty tags
|
||||
- Parses CAS numbers
|
||||
- Creates COSING URLs
|
||||
- Renames fields
|
||||
|
||||
## Test Results Example
|
||||
|
||||
```
|
||||
tests/test_cosing_service.py::TestParseCasNumbers::test_single_cas_number PASSED
|
||||
tests/test_cosing_service.py::TestParseCasNumbers::test_multiple_cas_with_slash PASSED
|
||||
tests/test_cosing_service.py::TestCosingSearch::test_search_by_name_success PASSED
|
||||
...
|
||||
================================ 25 passed in 0.5s ================================
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Import errors
|
||||
Make sure you're in the project root:
|
||||
```bash
|
||||
cd c:\Users\adish\Projects\pif_compiler
|
||||
uv run pytest
|
||||
```
|
||||
|
||||
### Mock not found
|
||||
Install pytest-mock:
|
||||
```bash
|
||||
uv add --dev pytest-mock
|
||||
```
|
||||
|
||||
### Integration tests failing
|
||||
These hit the real API and need internet. Skip them:
|
||||
```bash
|
||||
uv run pytest -m "not integration"
|
||||
```
|
||||
|
|
@ -1,3 +0,0 @@
|
|||
"""
|
||||
PIF Compiler - Test Suite
|
||||
"""
|
||||
|
|
@ -1,247 +0,0 @@
|
|||
"""
|
||||
Pytest configuration and fixtures for PIF Compiler tests.
|
||||
|
||||
This file contains shared fixtures and configuration for all tests.
|
||||
"""
|
||||
|
||||
import pytest
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
# Add src to Python path for imports
|
||||
src_path = Path(__file__).parent.parent / "src"
|
||||
sys.path.insert(0, str(src_path))
|
||||
|
||||
|
||||
# Sample data fixtures
|
||||
@pytest.fixture
|
||||
def sample_cas_numbers():
|
||||
"""Real CAS numbers for testing common cosmetic ingredients."""
|
||||
return {
|
||||
"water": "7732-18-5",
|
||||
"glycerin": "56-81-5",
|
||||
"sodium_hyaluronate": "9067-32-7",
|
||||
"niacinamide": "98-92-0",
|
||||
"ascorbic_acid": "50-81-7",
|
||||
"retinol": "68-26-8",
|
||||
"lanolin": "85507-69-3",
|
||||
"sodium_chloride": "7647-14-5",
|
||||
"propylene_glycol": "57-55-6",
|
||||
"butylene_glycol": "107-88-0",
|
||||
"salicylic_acid": "69-72-7",
|
||||
"tocopherol": "59-02-9",
|
||||
"caffeine": "58-08-2",
|
||||
"citric_acid": "77-92-9",
|
||||
"hyaluronic_acid": "9004-61-9",
|
||||
"sodium_hyaluronate_crosspolymer": "63148-62-9",
|
||||
"zinc_oxide": "1314-13-2",
|
||||
"titanium_dioxide": "13463-67-7",
|
||||
"lactic_acid": "50-21-5",
|
||||
"lanolin_oil": "8006-54-0",
|
||||
}
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def sample_cosing_response():
|
||||
"""Sample COSING API response for testing."""
|
||||
return {
|
||||
"inciName": ["WATER"],
|
||||
"casNo": ["7732-18-5"],
|
||||
"ecNo": ["231-791-2"],
|
||||
"substanceId": ["12345"],
|
||||
"itemType": ["Ingredient"],
|
||||
"functionName": ["Solvent"],
|
||||
"chemicalName": ["Dihydrogen monoxide"],
|
||||
"nameOfCommonIngredientsGlossary": ["Water"],
|
||||
"sccsOpinion": [],
|
||||
"sccsOpinionUrls": [],
|
||||
"otherRestrictions": [],
|
||||
"identifiedIngredient": [],
|
||||
"annexNo": [],
|
||||
"otherRegulations": [],
|
||||
"refNo": ["REF123"],
|
||||
"phEurName": [],
|
||||
"innName": []
|
||||
}
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def sample_ingredient_data():
|
||||
"""Sample ingredient data for Pydantic model testing."""
|
||||
return {
|
||||
"inci_name": "WATER",
|
||||
"cas": "7732-18-5",
|
||||
"quantity": 70.0,
|
||||
"mol_weight": 18,
|
||||
"dap": 0.5,
|
||||
}
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def sample_pif_data():
|
||||
"""Sample PIF data for testing."""
|
||||
return {
|
||||
"company": "Beauty Corp",
|
||||
"product_name": "Face Cream",
|
||||
"type": "MOISTURIZER",
|
||||
"physical_form": "CREMA",
|
||||
"CNCP": 123456,
|
||||
"production_company": {
|
||||
"prod_company_name": "Manufacturer Inc",
|
||||
"prod_vat": 12345678,
|
||||
"prod_address": "123 Main St, City, Country"
|
||||
},
|
||||
"ingredients": [
|
||||
{
|
||||
"inci_name": "WATER",
|
||||
"cas": "7732-18-5",
|
||||
"quantity": 70.0,
|
||||
"dap": 0.5
|
||||
},
|
||||
{
|
||||
"inci_name": "GLYCERIN",
|
||||
"cas": "56-81-5",
|
||||
"quantity": 10.0,
|
||||
"dap": 0.5
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def sample_echa_substance_response():
|
||||
"""Sample ECHA substance search API response for glycerin."""
|
||||
return {
|
||||
"items": [{
|
||||
"substanceIndex": {
|
||||
"rmlId": "100.029.181",
|
||||
"rmlName": "glycerol",
|
||||
"rmlCas": "56-81-5",
|
||||
"rmlEc": "200-289-5"
|
||||
}
|
||||
}]
|
||||
}
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def sample_echa_substance_response_water():
|
||||
"""Sample ECHA substance search API response for water."""
|
||||
return {
|
||||
"items": [{
|
||||
"substanceIndex": {
|
||||
"rmlId": "100.028.902",
|
||||
"rmlName": "water",
|
||||
"rmlCas": "7732-18-5",
|
||||
"rmlEc": "231-791-2"
|
||||
}
|
||||
}]
|
||||
}
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def sample_echa_substance_response_niacinamide():
|
||||
"""Sample ECHA substance search API response for niacinamide."""
|
||||
return {
|
||||
"items": [{
|
||||
"substanceIndex": {
|
||||
"rmlId": "100.002.530",
|
||||
"rmlName": "nicotinamide",
|
||||
"rmlCas": "98-92-0",
|
||||
"rmlEc": "202-713-4"
|
||||
}
|
||||
}]
|
||||
}
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def sample_echa_dossier_response():
|
||||
"""Sample ECHA dossier list API response."""
|
||||
return {
|
||||
"items": [{
|
||||
"assetExternalId": "abc123def456",
|
||||
"rootKey": "key123",
|
||||
"lastUpdatedDate": "2024-01-15T10:30:00Z"
|
||||
}]
|
||||
}
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def sample_echa_index_html_full():
|
||||
"""Sample ECHA index.html with all toxicology sections."""
|
||||
return """
|
||||
<html>
|
||||
<head><title>ECHA Dossier</title></head>
|
||||
<body>
|
||||
<div id="id_7_Toxicologicalinformation">
|
||||
<a href="tox_summary_001"></a>
|
||||
</div>
|
||||
<div id="id_72_AcuteToxicity">
|
||||
<a href="acute_tox_001"></a>
|
||||
</div>
|
||||
<div id="id_75_Repeateddosetoxicity">
|
||||
<a href="repeated_dose_001"></a>
|
||||
</div>
|
||||
</body>
|
||||
</html>
|
||||
"""
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def sample_echa_index_html_partial():
|
||||
"""Sample ECHA index.html with only ToxSummary section."""
|
||||
return """
|
||||
<html>
|
||||
<head><title>ECHA Dossier</title></head>
|
||||
<body>
|
||||
<div id="id_7_Toxicologicalinformation">
|
||||
<a href="tox_summary_001"></a>
|
||||
</div>
|
||||
</body>
|
||||
</html>
|
||||
"""
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def sample_echa_index_html_empty():
|
||||
"""Sample ECHA index.html with no toxicology sections."""
|
||||
return """
|
||||
<html>
|
||||
<head><title>ECHA Dossier</title></head>
|
||||
<body>
|
||||
<p>No toxicology information available</p>
|
||||
</body>
|
||||
</html>
|
||||
"""
|
||||
|
||||
|
||||
# Skip markers
|
||||
def pytest_configure(config):
|
||||
"""Configure custom markers."""
|
||||
config.addinivalue_line(
|
||||
"markers", "unit: mark test as a unit test (fast, no external deps)"
|
||||
)
|
||||
config.addinivalue_line(
|
||||
"markers", "integration: mark test as integration test (may use real APIs)"
|
||||
)
|
||||
config.addinivalue_line(
|
||||
"markers", "slow: mark test as slow (skip by default)"
|
||||
)
|
||||
config.addinivalue_line(
|
||||
"markers", "database: mark test as requiring database"
|
||||
)
|
||||
|
||||
|
||||
def pytest_collection_modifyitems(config, items):
|
||||
"""Modify test collection to skip slow/integration tests by default."""
|
||||
skip_slow = pytest.mark.skip(reason="Slow test (use -m slow to run)")
|
||||
skip_integration = pytest.mark.skip(reason="Integration test (use -m integration to run)")
|
||||
|
||||
# Only skip if not explicitly requested
|
||||
run_slow = config.getoption("-m") == "slow"
|
||||
run_integration = config.getoption("-m") == "integration"
|
||||
|
||||
for item in items:
|
||||
if "slow" in item.keywords and not run_slow:
|
||||
item.add_marker(skip_slow)
|
||||
if "integration" in item.keywords and not run_integration:
|
||||
item.add_marker(skip_integration)
|
||||
|
|
@ -1,254 +0,0 @@
|
|||
"""
|
||||
Tests for COSING Service
|
||||
|
||||
Test coverage:
|
||||
- parse_cas_numbers: CAS number parsing logic
|
||||
- cosing_search: API search functionality
|
||||
- clean_cosing: JSON cleaning and formatting
|
||||
"""
|
||||
|
||||
import pytest
|
||||
from unittest.mock import Mock, patch
|
||||
from pif_compiler.services.srv_cosing import (
|
||||
parse_cas_numbers,
|
||||
cosing_search,
|
||||
clean_cosing,
|
||||
)
|
||||
|
||||
|
||||
class TestParseCasNumbers:
|
||||
"""Test CAS number parsing function."""
|
||||
|
||||
def test_single_cas_number(self):
|
||||
"""Test parsing a single CAS number."""
|
||||
result = parse_cas_numbers(["7732-18-5"])
|
||||
assert result == ["7732-18-5"]
|
||||
|
||||
def test_multiple_cas_with_slash(self):
|
||||
"""Test parsing multiple CAS numbers separated by slash."""
|
||||
result = parse_cas_numbers(["7732-18-5/56-81-5"])
|
||||
assert result == ["7732-18-5", "56-81-5"]
|
||||
|
||||
def test_multiple_cas_with_semicolon(self):
|
||||
"""Test parsing multiple CAS numbers separated by semicolon."""
|
||||
result = parse_cas_numbers(["7732-18-5;56-81-5"])
|
||||
assert result == ["7732-18-5", "56-81-5"]
|
||||
|
||||
def test_multiple_cas_with_comma(self):
|
||||
"""Test parsing multiple CAS numbers separated by comma."""
|
||||
result = parse_cas_numbers(["7732-18-5,56-81-5"])
|
||||
assert result == ["7732-18-5", "56-81-5"]
|
||||
|
||||
def test_double_dash_separator(self):
|
||||
"""Test parsing CAS numbers with double dash separator."""
|
||||
result = parse_cas_numbers(["7732-18-5--56-81-5"])
|
||||
assert result == ["7732-18-5", "56-81-5"]
|
||||
|
||||
def test_cas_with_parentheses(self):
|
||||
"""Test that parenthetical info is removed."""
|
||||
result = parse_cas_numbers(["7732-18-5 (hydrate)"])
|
||||
assert result == ["7732-18-5"]
|
||||
|
||||
def test_cas_with_extra_whitespace(self):
|
||||
"""Test that extra whitespace is trimmed."""
|
||||
result = parse_cas_numbers([" 7732-18-5 / 56-81-5 "])
|
||||
assert result == ["7732-18-5", "56-81-5"]
|
||||
|
||||
def test_removes_invalid_dash(self):
|
||||
"""Test that standalone dashes are removed."""
|
||||
result = parse_cas_numbers(["7732-18-5/-/56-81-5"])
|
||||
assert result == ["7732-18-5", "56-81-5"]
|
||||
|
||||
def test_complex_mixed_separators(self):
|
||||
"""Test with multiple separator types."""
|
||||
result = parse_cas_numbers(["7732-18-5/56-81-5;50-00-0"])
|
||||
assert result == ["7732-18-5", "56-81-5", "50-00-0"]
|
||||
|
||||
|
||||
class TestCosingSearch:
|
||||
"""Test COSING API search functionality."""
|
||||
|
||||
@patch('pif_compiler.services.cosing_service.req.post')
|
||||
def test_search_by_name_success(self, mock_post):
|
||||
"""Test successful search by ingredient name."""
|
||||
# Mock API response
|
||||
mock_response = Mock()
|
||||
mock_response.json.return_value = {
|
||||
"results": [{
|
||||
"metadata": {
|
||||
"inciName": ["WATER"],
|
||||
"casNo": ["7732-18-5"],
|
||||
"substanceId": ["12345"]
|
||||
}
|
||||
}]
|
||||
}
|
||||
mock_post.return_value = mock_response
|
||||
|
||||
result = cosing_search("WATER", mode="name")
|
||||
|
||||
assert result is not None
|
||||
assert result["inciName"] == ["WATER"]
|
||||
assert result["casNo"] == ["7732-18-5"]
|
||||
|
||||
@patch('pif_compiler.services.cosing_service.req.post')
|
||||
def test_search_by_cas_success(self, mock_post):
|
||||
"""Test successful search by CAS number."""
|
||||
mock_response = Mock()
|
||||
mock_response.json.return_value = {
|
||||
"results": [{
|
||||
"metadata": {
|
||||
"inciName": ["WATER"],
|
||||
"casNo": ["7732-18-5"]
|
||||
}
|
||||
}]
|
||||
}
|
||||
mock_post.return_value = mock_response
|
||||
|
||||
result = cosing_search("7732-18-5", mode="cas")
|
||||
|
||||
assert result is not None
|
||||
assert "7732-18-5" in result["casNo"]
|
||||
|
||||
@patch('pif_compiler.services.cosing_service.req.post')
|
||||
def test_search_by_ec_success(self, mock_post):
|
||||
"""Test successful search by EC number."""
|
||||
mock_response = Mock()
|
||||
mock_response.json.return_value = {
|
||||
"results": [{
|
||||
"metadata": {
|
||||
"ecNo": ["231-791-2"]
|
||||
}
|
||||
}]
|
||||
}
|
||||
mock_post.return_value = mock_response
|
||||
|
||||
result = cosing_search("231-791-2", mode="ec")
|
||||
|
||||
assert result is not None
|
||||
assert "231-791-2" in result["ecNo"]
|
||||
|
||||
@patch('pif_compiler.services.cosing_service.req.post')
|
||||
def test_search_by_id_success(self, mock_post):
|
||||
"""Test successful search by substance ID."""
|
||||
mock_response = Mock()
|
||||
mock_response.json.return_value = {
|
||||
"results": [{
|
||||
"metadata": {
|
||||
"substanceId": ["12345"]
|
||||
}
|
||||
}]
|
||||
}
|
||||
mock_post.return_value = mock_response
|
||||
|
||||
result = cosing_search("12345", mode="id")
|
||||
|
||||
assert result is not None
|
||||
assert result["substanceId"] == ["12345"]
|
||||
|
||||
@patch('pif_compiler.services.cosing_service.req.post')
|
||||
def test_search_no_results(self, mock_post):
|
||||
"""Test search with no results returns status code."""
|
||||
mock_response = Mock()
|
||||
mock_response.json.return_value = {"results": []}
|
||||
mock_post.return_value = mock_response
|
||||
|
||||
result = cosing_search("NONEXISTENT", mode="name")
|
||||
assert result == None # Should return None
|
||||
|
||||
def test_search_invalid_mode(self):
|
||||
"""Test that invalid mode raises ValueError."""
|
||||
with pytest.raises(ValueError):
|
||||
cosing_search("WATER", mode="invalid_mode")
|
||||
|
||||
|
||||
class TestCleanCosing:
|
||||
"""Test COSING JSON cleaning function."""
|
||||
|
||||
def test_clean_basic_fields(self, sample_cosing_response):
|
||||
"""Test cleaning basic string and list fields."""
|
||||
|
||||
result = clean_cosing(sample_cosing_response, full=False)
|
||||
|
||||
assert result["inciName"] == "WATER"
|
||||
assert result["casNo"] == ["7732-18-5"]
|
||||
assert result["ecNo"] == ["231-791-2"]
|
||||
|
||||
def test_removes_empty_tags(self, sample_cosing_response):
|
||||
"""Test that <empty> tags are removed."""
|
||||
|
||||
sample_cosing_response["inciName"] = ["<empty>"]
|
||||
sample_cosing_response["functionName"] = ["<empty>"]
|
||||
|
||||
result = clean_cosing(sample_cosing_response, full=False)
|
||||
|
||||
assert "<empty>" not in result["inciName"]
|
||||
assert result["functionName"] == []
|
||||
|
||||
def test_parses_cas_numbers(self, sample_cosing_response):
|
||||
"""Test that CAS numbers are parsed correctly."""
|
||||
sample_cosing_response["casNo"] = ["56-81-5"]
|
||||
|
||||
result = clean_cosing(sample_cosing_response, full=False)
|
||||
|
||||
assert result["casNo"] == ["56-81-5"]
|
||||
|
||||
def test_creates_cosing_url(self, sample_cosing_response):
|
||||
"""Test that COSING URL is created."""
|
||||
result = clean_cosing(sample_cosing_response, full=False)
|
||||
|
||||
assert "cosingUrl" in result
|
||||
assert "12345" in result["cosingUrl"]
|
||||
assert result["cosingUrl"] == "https://ec.europa.eu/growth/tools-databases/cosing/details/12345"
|
||||
|
||||
def test_renames_common_name(self, sample_cosing_response):
|
||||
"""Test that nameOfCommonIngredientsGlossary is renamed."""
|
||||
result = clean_cosing(sample_cosing_response, full=False)
|
||||
|
||||
assert "commonName" in result
|
||||
assert result["commonName"] == "Water"
|
||||
assert "nameOfCommonIngredientsGlossary" not in result
|
||||
|
||||
def test_empty_lists_handled(self, sample_cosing_response):
|
||||
"""Test that empty lists are handled correctly."""
|
||||
sample_cosing_response["inciName"] = []
|
||||
sample_cosing_response["casNo"] = []
|
||||
|
||||
result = clean_cosing(sample_cosing_response, full=False)
|
||||
|
||||
assert result["inciName"] == ""
|
||||
assert result["casNo"] == []
|
||||
|
||||
|
||||
class TestIntegration:
|
||||
"""Integration tests with real API (marked as slow)."""
|
||||
|
||||
@pytest.mark.integration
|
||||
def test_real_water_search(self):
|
||||
"""Test real API call for WATER (requires internet)."""
|
||||
result = cosing_search("WATER", mode="name")
|
||||
|
||||
if result and isinstance(result, dict):
|
||||
# Real API call succeeded
|
||||
assert "inciName" in result or "casNo" in result
|
||||
|
||||
@pytest.mark.integration
|
||||
def test_real_cas_search(self):
|
||||
"""Test real API call by CAS number (requires internet)."""
|
||||
result = cosing_search("56-81-5", mode="cas")
|
||||
|
||||
if result and isinstance(result, dict):
|
||||
assert "casNo" in result
|
||||
|
||||
@pytest.mark.integration
|
||||
def test_full_workflow(self):
|
||||
"""Test complete workflow: search -> clean."""
|
||||
# Search for glycerin
|
||||
raw_result = cosing_search("GLYCERIN", mode="name")
|
||||
|
||||
if raw_result and isinstance(raw_result, dict):
|
||||
# Clean the result
|
||||
clean_result = clean_cosing(raw_result, full=False)
|
||||
|
||||
# Verify cleaned structure
|
||||
assert "cosingUrl" in clean_result
|
||||
assert isinstance(clean_result.get("casNo"), list)
|
||||
|
|
@ -1,857 +0,0 @@
|
|||
"""
|
||||
Tests for ECHA Find Service
|
||||
|
||||
Test coverage:
|
||||
- search_dossier: Complete workflow for searching ECHA dossiers
|
||||
- Substance search (by CAS, EC, rmlName)
|
||||
- Dossier retrieval (Active/Inactive)
|
||||
- HTML parsing for toxicology sections
|
||||
- Error handling and edge cases
|
||||
"""
|
||||
|
||||
import pytest
|
||||
from unittest.mock import Mock, patch, MagicMock
|
||||
from datetime import datetime
|
||||
from pif_compiler.services.echa_find import search_dossier
|
||||
|
||||
|
||||
class TestSearchDossierSubstanceSearch:
|
||||
"""Test the initial substance search phase of search_dossier."""
|
||||
|
||||
@patch('pif_compiler.services.echa_find.requests.get')
|
||||
def test_successful_cas_search(self, mock_get):
|
||||
"""Test successful search by CAS number."""
|
||||
# Mock the substance search API response
|
||||
mock_response = Mock()
|
||||
mock_response.json.return_value = {
|
||||
"items": [{
|
||||
"substanceIndex": {
|
||||
"rmlId": "100.000.001",
|
||||
"rmlName": "Test Substance",
|
||||
"rmlCas": "50-00-0",
|
||||
"rmlEc": "200-001-8"
|
||||
}
|
||||
}]
|
||||
}
|
||||
mock_get.return_value = mock_response
|
||||
|
||||
# Mocking all subsequent calls
|
||||
with patch('pif_compiler.services.echa_find.requests.get') as mock_all_gets:
|
||||
# First call: substance search (already mocked above)
|
||||
# Second call: dossier list
|
||||
mock_dossier_response = Mock()
|
||||
mock_dossier_response.json.return_value = {
|
||||
"items": [{
|
||||
"assetExternalId": "abc123",
|
||||
"rootKey": "key123",
|
||||
"lastUpdatedDate": "2024-01-15T10:30:00Z"
|
||||
}]
|
||||
}
|
||||
|
||||
# Third call: index.html page
|
||||
mock_index_response = Mock()
|
||||
mock_index_response.text = """
|
||||
<html>
|
||||
<div id="id_7_Toxicologicalinformation">
|
||||
<a href="tox_summary_001"></a>
|
||||
</div>
|
||||
<div id="id_72_AcuteToxicity">
|
||||
<a href="acute_tox_001"></a>
|
||||
</div>
|
||||
<div id="id_75_Repeateddosetoxicity">
|
||||
<a href="repeated_dose_001"></a>
|
||||
</div>
|
||||
</html>
|
||||
"""
|
||||
|
||||
mock_all_gets.side_effect = [
|
||||
mock_response,
|
||||
mock_dossier_response,
|
||||
mock_index_response
|
||||
]
|
||||
|
||||
result = search_dossier("50-00-0", input_type="rmlCas")
|
||||
|
||||
assert result is not False
|
||||
assert result["rmlCas"] == "50-00-0"
|
||||
assert result["rmlName"] == "Test Substance"
|
||||
assert result["rmlId"] == "100.000.001"
|
||||
assert result["rmlEc"] == "200-001-8"
|
||||
|
||||
@patch('pif_compiler.services.echa_find.requests.get')
|
||||
def test_successful_ec_search(self, mock_get):
|
||||
"""Test successful search by EC number."""
|
||||
mock_response = Mock()
|
||||
mock_response.json.return_value = {
|
||||
"items": [{
|
||||
"substanceIndex": {
|
||||
"rmlId": "100.000.001",
|
||||
"rmlName": "Test Substance",
|
||||
"rmlCas": "50-00-0",
|
||||
"rmlEc": "200-001-8"
|
||||
}
|
||||
}]
|
||||
}
|
||||
mock_get.return_value = mock_response
|
||||
|
||||
with patch('pif_compiler.services.echa_find.requests.get') as mock_all_gets:
|
||||
mock_dossier_response = Mock()
|
||||
mock_dossier_response.json.return_value = {
|
||||
"items": [{
|
||||
"assetExternalId": "abc123",
|
||||
"rootKey": "key123",
|
||||
"lastUpdatedDate": "2024-01-15T10:30:00Z"
|
||||
}]
|
||||
}
|
||||
|
||||
mock_index_response = Mock()
|
||||
mock_index_response.text = "<html><div id='id_7_Toxicologicalinformation'><a href='tox_001'></a></div></html>"
|
||||
|
||||
mock_all_gets.side_effect = [
|
||||
mock_response,
|
||||
mock_dossier_response,
|
||||
mock_index_response
|
||||
]
|
||||
|
||||
result = search_dossier("200-001-8", input_type="rmlEc")
|
||||
|
||||
assert result is not False
|
||||
assert result["rmlEc"] == "200-001-8"
|
||||
|
||||
@patch('pif_compiler.services.echa_find.requests.get')
|
||||
def test_successful_name_search(self, mock_get):
|
||||
"""Test successful search by substance name."""
|
||||
mock_response = Mock()
|
||||
mock_response.json.return_value = {
|
||||
"items": [{
|
||||
"substanceIndex": {
|
||||
"rmlId": "100.000.001",
|
||||
"rmlName": "formaldehyde",
|
||||
"rmlCas": "50-00-0",
|
||||
"rmlEc": "200-001-8"
|
||||
}
|
||||
}]
|
||||
}
|
||||
mock_get.return_value = mock_response
|
||||
|
||||
with patch('pif_compiler.services.echa_find.requests.get') as mock_all_gets:
|
||||
mock_dossier_response = Mock()
|
||||
mock_dossier_response.json.return_value = {
|
||||
"items": [{
|
||||
"assetExternalId": "abc123",
|
||||
"rootKey": "key123",
|
||||
"lastUpdatedDate": "2024-01-15T10:30:00Z"
|
||||
}]
|
||||
}
|
||||
|
||||
mock_index_response = Mock()
|
||||
mock_index_response.text = "<html><div id='id_7_Toxicologicalinformation'><a href='tox_001'></a></div></html>"
|
||||
|
||||
mock_all_gets.side_effect = [
|
||||
mock_response,
|
||||
mock_dossier_response,
|
||||
mock_index_response
|
||||
]
|
||||
|
||||
result = search_dossier("formaldehyde", input_type="rmlName")
|
||||
|
||||
assert result is not False
|
||||
assert result["rmlName"] == "formaldehyde"
|
||||
|
||||
@patch('pif_compiler.services.echa_find.requests.get')
|
||||
def test_substance_not_found(self, mock_get):
|
||||
"""Test when substance is not found in ECHA."""
|
||||
mock_response = Mock()
|
||||
mock_response.json.return_value = {"items": []}
|
||||
mock_get.return_value = mock_response
|
||||
|
||||
result = search_dossier("999-99-9", input_type="rmlCas")
|
||||
|
||||
assert result is False
|
||||
|
||||
@patch('pif_compiler.services.echa_find.requests.get')
|
||||
def test_empty_items_array(self, mock_get):
|
||||
"""Test when API returns empty items array."""
|
||||
mock_response = Mock()
|
||||
mock_response.json.return_value = {"items": []}
|
||||
mock_get.return_value = mock_response
|
||||
|
||||
result = search_dossier("NONEXISTENT", input_type="rmlName")
|
||||
|
||||
assert result is False
|
||||
|
||||
@patch('pif_compiler.services.echa_find.requests.get')
|
||||
def test_malformed_api_response(self, mock_get):
|
||||
"""Test when API response is malformed."""
|
||||
mock_response = Mock()
|
||||
mock_response.json.return_value = {} # Missing 'items' key
|
||||
mock_get.return_value = mock_response
|
||||
|
||||
result = search_dossier("50-00-0", input_type="rmlCas")
|
||||
|
||||
assert result is False
|
||||
|
||||
|
||||
class TestSearchDossierInputTypeValidation:
|
||||
"""Test input_type parameter validation."""
|
||||
|
||||
@patch('pif_compiler.services.echa_find.requests.get')
|
||||
def test_input_type_mismatch_cas(self, mock_get):
|
||||
"""Test when input_type doesn't match actual search result (CAS)."""
|
||||
mock_response = Mock()
|
||||
mock_response.json.return_value = {
|
||||
"items": [{
|
||||
"substanceIndex": {
|
||||
"rmlId": "100.000.001",
|
||||
"rmlName": "Test Substance",
|
||||
"rmlCas": "50-00-0",
|
||||
"rmlEc": "200-001-8"
|
||||
}
|
||||
}]
|
||||
}
|
||||
mock_get.return_value = mock_response
|
||||
|
||||
# Search with CAS but specify wrong input_type
|
||||
result = search_dossier("50-00-0", input_type="rmlEc")
|
||||
|
||||
assert isinstance(result, str)
|
||||
assert "search_error" in result
|
||||
assert "not equal" in result
|
||||
|
||||
@patch('pif_compiler.services.echa_find.requests.get')
|
||||
def test_input_type_correct_match(self, mock_get):
|
||||
"""Test when input_type correctly matches search result."""
|
||||
mock_response = Mock()
|
||||
mock_response.json.return_value = {
|
||||
"items": [{
|
||||
"substanceIndex": {
|
||||
"rmlId": "100.000.001",
|
||||
"rmlName": "Test Substance",
|
||||
"rmlCas": "50-00-0",
|
||||
"rmlEc": "200-001-8"
|
||||
}
|
||||
}]
|
||||
}
|
||||
mock_get.return_value = mock_response
|
||||
|
||||
with patch('pif_compiler.services.echa_find.requests.get') as mock_all_gets:
|
||||
mock_dossier_response = Mock()
|
||||
mock_dossier_response.json.return_value = {
|
||||
"items": [{
|
||||
"assetExternalId": "abc123",
|
||||
"rootKey": "key123",
|
||||
"lastUpdatedDate": "2024-01-15T10:30:00Z"
|
||||
}]
|
||||
}
|
||||
|
||||
mock_index_response = Mock()
|
||||
mock_index_response.text = "<html><div id='id_7_Toxicologicalinformation'><a href='tox_001'></a></div></html>"
|
||||
|
||||
mock_all_gets.side_effect = [
|
||||
mock_response,
|
||||
mock_dossier_response,
|
||||
mock_index_response
|
||||
]
|
||||
|
||||
result = search_dossier("50-00-0", input_type="rmlCas")
|
||||
|
||||
assert result is not False
|
||||
assert isinstance(result, dict)
|
||||
|
||||
|
||||
class TestSearchDossierDossierRetrieval:
|
||||
"""Test dossier retrieval (Active/Inactive)."""
|
||||
|
||||
@patch('pif_compiler.services.echa_find.requests.get')
|
||||
def test_active_dossier_found(self, mock_get):
|
||||
"""Test when active dossier is found."""
|
||||
mock_substance_response = Mock()
|
||||
mock_substance_response.json.return_value = {
|
||||
"items": [{
|
||||
"substanceIndex": {
|
||||
"rmlId": "100.000.001",
|
||||
"rmlName": "Test Substance",
|
||||
"rmlCas": "50-00-0",
|
||||
"rmlEc": "200-001-8"
|
||||
}
|
||||
}]
|
||||
}
|
||||
|
||||
mock_dossier_response = Mock()
|
||||
mock_dossier_response.json.return_value = {
|
||||
"items": [{
|
||||
"assetExternalId": "abc123",
|
||||
"rootKey": "key123",
|
||||
"lastUpdatedDate": "2024-01-15T10:30:00Z"
|
||||
}]
|
||||
}
|
||||
|
||||
mock_index_response = Mock()
|
||||
mock_index_response.text = "<html><div id='id_7_Toxicologicalinformation'><a href='tox_001'></a></div></html>"
|
||||
|
||||
mock_get.side_effect = [
|
||||
mock_substance_response,
|
||||
mock_dossier_response,
|
||||
mock_index_response
|
||||
]
|
||||
|
||||
result = search_dossier("50-00-0", input_type="rmlCas")
|
||||
|
||||
assert result is not False
|
||||
assert result["dossierType"] == "Active"
|
||||
|
||||
@patch('pif_compiler.services.echa_find.requests.get')
|
||||
def test_inactive_dossier_fallback(self, mock_get):
|
||||
"""Test when only inactive dossier exists (fallback)."""
|
||||
mock_substance_response = Mock()
|
||||
mock_substance_response.json.return_value = {
|
||||
"items": [{
|
||||
"substanceIndex": {
|
||||
"rmlId": "100.000.001",
|
||||
"rmlName": "Test Substance",
|
||||
"rmlCas": "50-00-0",
|
||||
"rmlEc": "200-001-8"
|
||||
}
|
||||
}]
|
||||
}
|
||||
|
||||
# First dossier call returns empty (no active)
|
||||
mock_active_dossier_response = Mock()
|
||||
mock_active_dossier_response.json.return_value = {"items": []}
|
||||
|
||||
# Second dossier call returns inactive
|
||||
mock_inactive_dossier_response = Mock()
|
||||
mock_inactive_dossier_response.json.return_value = {
|
||||
"items": [{
|
||||
"assetExternalId": "abc123",
|
||||
"rootKey": "key123",
|
||||
"lastUpdatedDate": "2024-01-15T10:30:00Z"
|
||||
}]
|
||||
}
|
||||
|
||||
mock_index_response = Mock()
|
||||
mock_index_response.text = "<html><div id='id_7_Toxicologicalinformation'><a href='tox_001'></a></div></html>"
|
||||
|
||||
mock_get.side_effect = [
|
||||
mock_substance_response,
|
||||
mock_active_dossier_response,
|
||||
mock_inactive_dossier_response,
|
||||
mock_index_response
|
||||
]
|
||||
|
||||
result = search_dossier("50-00-0", input_type="rmlCas")
|
||||
|
||||
assert result is not False
|
||||
assert result["dossierType"] == "Inactive"
|
||||
|
||||
@patch('pif_compiler.services.echa_find.requests.get')
|
||||
def test_no_dossiers_found(self, mock_get):
|
||||
"""Test when no dossiers (active or inactive) are found."""
|
||||
mock_substance_response = Mock()
|
||||
mock_substance_response.json.return_value = {
|
||||
"items": [{
|
||||
"substanceIndex": {
|
||||
"rmlId": "100.000.001",
|
||||
"rmlName": "Test Substance",
|
||||
"rmlCas": "50-00-0",
|
||||
"rmlEc": "200-001-8"
|
||||
}
|
||||
}]
|
||||
}
|
||||
|
||||
# Both active and inactive return empty
|
||||
mock_empty_response = Mock()
|
||||
mock_empty_response.json.return_value = {"items": []}
|
||||
|
||||
mock_get.side_effect = [
|
||||
mock_substance_response,
|
||||
mock_empty_response, # Active
|
||||
mock_empty_response # Inactive
|
||||
]
|
||||
|
||||
result = search_dossier("50-00-0", input_type="rmlCas")
|
||||
|
||||
assert result is False
|
||||
|
||||
@patch('pif_compiler.services.echa_find.requests.get')
|
||||
def test_last_update_date_parsed(self, mock_get):
|
||||
"""Test that lastUpdateDate is correctly parsed."""
|
||||
mock_substance_response = Mock()
|
||||
mock_substance_response.json.return_value = {
|
||||
"items": [{
|
||||
"substanceIndex": {
|
||||
"rmlId": "100.000.001",
|
||||
"rmlName": "Test Substance",
|
||||
"rmlCas": "50-00-0",
|
||||
"rmlEc": "200-001-8"
|
||||
}
|
||||
}]
|
||||
}
|
||||
|
||||
mock_dossier_response = Mock()
|
||||
mock_dossier_response.json.return_value = {
|
||||
"items": [{
|
||||
"assetExternalId": "abc123",
|
||||
"rootKey": "key123",
|
||||
"lastUpdatedDate": "2024-01-15T10:30:00Z"
|
||||
}]
|
||||
}
|
||||
|
||||
mock_index_response = Mock()
|
||||
mock_index_response.text = "<html><div id='id_7_Toxicologicalinformation'><a href='tox_001'></a></div></html>"
|
||||
|
||||
mock_get.side_effect = [
|
||||
mock_substance_response,
|
||||
mock_dossier_response,
|
||||
mock_index_response
|
||||
]
|
||||
|
||||
result = search_dossier("50-00-0", input_type="rmlCas")
|
||||
|
||||
assert result is not False
|
||||
assert "lastUpdateDate" in result
|
||||
assert result["lastUpdateDate"] == "2024-01-15"
|
||||
|
||||
@patch('pif_compiler.services.echa_find.requests.get')
|
||||
def test_missing_last_update_date(self, mock_get):
|
||||
"""Test when lastUpdateDate is missing from response."""
|
||||
mock_substance_response = Mock()
|
||||
mock_substance_response.json.return_value = {
|
||||
"items": [{
|
||||
"substanceIndex": {
|
||||
"rmlId": "100.000.001",
|
||||
"rmlName": "Test Substance",
|
||||
"rmlCas": "50-00-0",
|
||||
"rmlEc": "200-001-8"
|
||||
}
|
||||
}]
|
||||
}
|
||||
|
||||
mock_dossier_response = Mock()
|
||||
mock_dossier_response.json.return_value = {
|
||||
"items": [{
|
||||
"assetExternalId": "abc123",
|
||||
"rootKey": "key123"
|
||||
# lastUpdatedDate missing
|
||||
}]
|
||||
}
|
||||
|
||||
mock_index_response = Mock()
|
||||
mock_index_response.text = "<html><div id='id_7_Toxicologicalinformation'><a href='tox_001'></a></div></html>"
|
||||
|
||||
mock_get.side_effect = [
|
||||
mock_substance_response,
|
||||
mock_dossier_response,
|
||||
mock_index_response
|
||||
]
|
||||
|
||||
result = search_dossier("50-00-0", input_type="rmlCas")
|
||||
|
||||
assert result is not False
|
||||
# Should still work, just without lastUpdateDate
|
||||
assert "lastUpdateDate" not in result
|
||||
|
||||
|
||||
class TestSearchDossierHTMLParsing:
|
||||
"""Test HTML parsing for toxicology sections."""
|
||||
|
||||
@patch('pif_compiler.services.echa_find.requests.get')
|
||||
def test_all_tox_sections_found(self, mock_get):
|
||||
"""Test when all toxicology sections are found."""
|
||||
mock_substance_response = Mock()
|
||||
mock_substance_response.json.return_value = {
|
||||
"items": [{
|
||||
"substanceIndex": {
|
||||
"rmlId": "100.000.001",
|
||||
"rmlName": "Test Substance",
|
||||
"rmlCas": "50-00-0",
|
||||
"rmlEc": "200-001-8"
|
||||
}
|
||||
}]
|
||||
}
|
||||
|
||||
mock_dossier_response = Mock()
|
||||
mock_dossier_response.json.return_value = {
|
||||
"items": [{
|
||||
"assetExternalId": "abc123",
|
||||
"rootKey": "key123",
|
||||
"lastUpdatedDate": "2024-01-15T10:30:00Z"
|
||||
}]
|
||||
}
|
||||
|
||||
mock_index_response = Mock()
|
||||
mock_index_response.text = """
|
||||
<html>
|
||||
<div id="id_7_Toxicologicalinformation">
|
||||
<a href="tox_summary_001"></a>
|
||||
</div>
|
||||
<div id="id_72_AcuteToxicity">
|
||||
<a href="acute_tox_001"></a>
|
||||
</div>
|
||||
<div id="id_75_Repeateddosetoxicity">
|
||||
<a href="repeated_dose_001"></a>
|
||||
</div>
|
||||
</html>
|
||||
"""
|
||||
|
||||
mock_get.side_effect = [
|
||||
mock_substance_response,
|
||||
mock_dossier_response,
|
||||
mock_index_response
|
||||
]
|
||||
|
||||
result = search_dossier("50-00-0", input_type="rmlCas")
|
||||
|
||||
assert result is not False
|
||||
assert "ToxSummary" in result
|
||||
assert "AcuteToxicity" in result
|
||||
assert "RepeatedDose" in result
|
||||
assert "tox_summary_001" in result["ToxSummary"]
|
||||
assert "acute_tox_001" in result["AcuteToxicity"]
|
||||
assert "repeated_dose_001" in result["RepeatedDose"]
|
||||
|
||||
@patch('pif_compiler.services.echa_find.requests.get')
|
||||
def test_only_tox_summary_found(self, mock_get):
|
||||
"""Test when only ToxSummary section exists."""
|
||||
mock_substance_response = Mock()
|
||||
mock_substance_response.json.return_value = {
|
||||
"items": [{
|
||||
"substanceIndex": {
|
||||
"rmlId": "100.000.001",
|
||||
"rmlName": "Test Substance",
|
||||
"rmlCas": "50-00-0",
|
||||
"rmlEc": "200-001-8"
|
||||
}
|
||||
}]
|
||||
}
|
||||
|
||||
mock_dossier_response = Mock()
|
||||
mock_dossier_response.json.return_value = {
|
||||
"items": [{
|
||||
"assetExternalId": "abc123",
|
||||
"rootKey": "key123",
|
||||
"lastUpdatedDate": "2024-01-15T10:30:00Z"
|
||||
}]
|
||||
}
|
||||
|
||||
mock_index_response = Mock()
|
||||
mock_index_response.text = """
|
||||
<html>
|
||||
<div id="id_7_Toxicologicalinformation">
|
||||
<a href="tox_summary_001"></a>
|
||||
</div>
|
||||
</html>
|
||||
"""
|
||||
|
||||
mock_get.side_effect = [
|
||||
mock_substance_response,
|
||||
mock_dossier_response,
|
||||
mock_index_response
|
||||
]
|
||||
|
||||
result = search_dossier("50-00-0", input_type="rmlCas")
|
||||
|
||||
assert result is not False
|
||||
assert "ToxSummary" in result
|
||||
assert "AcuteToxicity" not in result
|
||||
assert "RepeatedDose" not in result
|
||||
|
||||
@patch('pif_compiler.services.echa_find.requests.get')
|
||||
def test_no_tox_sections_found(self, mock_get):
|
||||
"""Test when no toxicology sections are found."""
|
||||
mock_substance_response = Mock()
|
||||
mock_substance_response.json.return_value = {
|
||||
"items": [{
|
||||
"substanceIndex": {
|
||||
"rmlId": "100.000.001",
|
||||
"rmlName": "Test Substance",
|
||||
"rmlCas": "50-00-0",
|
||||
"rmlEc": "200-001-8"
|
||||
}
|
||||
}]
|
||||
}
|
||||
|
||||
mock_dossier_response = Mock()
|
||||
mock_dossier_response.json.return_value = {
|
||||
"items": [{
|
||||
"assetExternalId": "abc123",
|
||||
"rootKey": "key123",
|
||||
"lastUpdatedDate": "2024-01-15T10:30:00Z"
|
||||
}]
|
||||
}
|
||||
|
||||
mock_index_response = Mock()
|
||||
mock_index_response.text = "<html><body>No toxicology sections</body></html>"
|
||||
|
||||
mock_get.side_effect = [
|
||||
mock_substance_response,
|
||||
mock_dossier_response,
|
||||
mock_index_response
|
||||
]
|
||||
|
||||
result = search_dossier("50-00-0", input_type="rmlCas")
|
||||
|
||||
assert result is not False
|
||||
assert "ToxSummary" not in result
|
||||
assert "AcuteToxicity" not in result
|
||||
assert "RepeatedDose" not in result
|
||||
# Basic info should still be present
|
||||
assert "rmlId" in result
|
||||
assert "index" in result
|
||||
|
||||
@patch('pif_compiler.services.echa_find.requests.get')
|
||||
def test_js_links_created(self, mock_get):
|
||||
"""Test that both HTML and JS links are created."""
|
||||
mock_substance_response = Mock()
|
||||
mock_substance_response.json.return_value = {
|
||||
"items": [{
|
||||
"substanceIndex": {
|
||||
"rmlId": "100.000.001",
|
||||
"rmlName": "Test Substance",
|
||||
"rmlCas": "50-00-0",
|
||||
"rmlEc": "200-001-8"
|
||||
}
|
||||
}]
|
||||
}
|
||||
|
||||
mock_dossier_response = Mock()
|
||||
mock_dossier_response.json.return_value = {
|
||||
"items": [{
|
||||
"assetExternalId": "abc123",
|
||||
"rootKey": "key123",
|
||||
"lastUpdatedDate": "2024-01-15T10:30:00Z"
|
||||
}]
|
||||
}
|
||||
|
||||
mock_index_response = Mock()
|
||||
mock_index_response.text = """
|
||||
<html>
|
||||
<div id="id_7_Toxicologicalinformation">
|
||||
<a href="tox_summary_001"></a>
|
||||
</div>
|
||||
<div id="id_72_AcuteToxicity">
|
||||
<a href="acute_tox_001"></a>
|
||||
</div>
|
||||
</html>
|
||||
"""
|
||||
|
||||
mock_get.side_effect = [
|
||||
mock_substance_response,
|
||||
mock_dossier_response,
|
||||
mock_index_response
|
||||
]
|
||||
|
||||
result = search_dossier("50-00-0", input_type="rmlCas")
|
||||
|
||||
assert result is not False
|
||||
assert "ToxSummary" in result
|
||||
assert "ToxSummary_js" in result
|
||||
assert "AcuteToxicity" in result
|
||||
assert "AcuteToxicity_js" in result
|
||||
assert "index" in result
|
||||
assert "index_js" in result
|
||||
|
||||
|
||||
class TestSearchDossierURLConstruction:
|
||||
"""Test URL construction for various endpoints."""
|
||||
|
||||
@patch('pif_compiler.services.echa_find.requests.get')
|
||||
def test_search_response_url(self, mock_get):
|
||||
"""Test that search_response URL is correctly constructed."""
|
||||
mock_substance_response = Mock()
|
||||
mock_substance_response.json.return_value = {
|
||||
"items": [{
|
||||
"substanceIndex": {
|
||||
"rmlId": "100.000.001",
|
||||
"rmlName": "Test Substance",
|
||||
"rmlCas": "50-00-0",
|
||||
"rmlEc": "200-001-8"
|
||||
}
|
||||
}]
|
||||
}
|
||||
|
||||
mock_dossier_response = Mock()
|
||||
mock_dossier_response.json.return_value = {
|
||||
"items": [{
|
||||
"assetExternalId": "abc123",
|
||||
"rootKey": "key123",
|
||||
"lastUpdatedDate": "2024-01-15T10:30:00Z"
|
||||
}]
|
||||
}
|
||||
|
||||
mock_index_response = Mock()
|
||||
mock_index_response.text = "<html></html>"
|
||||
|
||||
mock_get.side_effect = [
|
||||
mock_substance_response,
|
||||
mock_dossier_response,
|
||||
mock_index_response
|
||||
]
|
||||
|
||||
result = search_dossier("50-00-0", input_type="rmlCas")
|
||||
|
||||
assert result is not False
|
||||
assert "search_response" in result
|
||||
assert "50-00-0" in result["search_response"]
|
||||
assert "https://chem.echa.europa.eu/api-substance/v1/substance" in result["search_response"]
|
||||
|
||||
@patch('pif_compiler.services.echa_find.requests.get')
|
||||
def test_url_encoding(self, mock_get):
|
||||
"""Test that special characters in substance names are URL-encoded."""
|
||||
substance_name = "test substance with spaces"
|
||||
|
||||
mock_substance_response = Mock()
|
||||
mock_substance_response.json.return_value = {
|
||||
"items": [{
|
||||
"substanceIndex": {
|
||||
"rmlId": "100.000.001",
|
||||
"rmlName": substance_name,
|
||||
"rmlCas": "50-00-0",
|
||||
"rmlEc": "200-001-8"
|
||||
}
|
||||
}]
|
||||
}
|
||||
|
||||
mock_dossier_response = Mock()
|
||||
mock_dossier_response.json.return_value = {
|
||||
"items": [{
|
||||
"assetExternalId": "abc123",
|
||||
"rootKey": "key123",
|
||||
"lastUpdatedDate": "2024-01-15T10:30:00Z"
|
||||
}]
|
||||
}
|
||||
|
||||
mock_index_response = Mock()
|
||||
mock_index_response.text = "<html></html>"
|
||||
|
||||
mock_get.side_effect = [
|
||||
mock_substance_response,
|
||||
mock_dossier_response,
|
||||
mock_index_response
|
||||
]
|
||||
|
||||
result = search_dossier(substance_name, input_type="rmlName")
|
||||
|
||||
assert result is not False
|
||||
assert "search_response" in result
|
||||
# Spaces should be encoded
|
||||
assert "%20" in result["search_response"] or "+" in result["search_response"]
|
||||
|
||||
|
||||
class TestIntegration:
|
||||
"""Integration tests with real API (marked as integration)."""
|
||||
|
||||
@pytest.mark.integration
|
||||
def test_real_formaldehyde_search(self):
|
||||
"""Test real API call for formaldehyde (requires internet)."""
|
||||
result = search_dossier("50-00-0", input_type="rmlCas")
|
||||
|
||||
if result and isinstance(result, dict):
|
||||
# Real API call succeeded
|
||||
assert "rmlId" in result
|
||||
assert "rmlName" in result
|
||||
assert "rmlCas" in result
|
||||
assert result["rmlCas"] == "50-00-0"
|
||||
assert "index" in result
|
||||
assert "dossierType" in result
|
||||
|
||||
@pytest.mark.integration
|
||||
def test_real_water_search(self):
|
||||
"""Test real API call for water by CAS (requires internet)."""
|
||||
result = search_dossier("7732-18-5", input_type="rmlCas")
|
||||
|
||||
if result and isinstance(result, dict):
|
||||
assert "rmlCas" in result
|
||||
assert result["rmlCas"] == "7732-18-5"
|
||||
|
||||
@pytest.mark.integration
|
||||
def test_real_nonexistent_substance(self):
|
||||
"""Test real API call for non-existent substance (requires internet)."""
|
||||
result = search_dossier("999-99-9", input_type="rmlCas")
|
||||
|
||||
# Should return False for non-existent substance
|
||||
assert result is False or isinstance(result, str)
|
||||
|
||||
@pytest.mark.integration
|
||||
def test_real_glycerin_search(self):
|
||||
"""Test real API call for glycerin (requires internet)."""
|
||||
result = search_dossier("56-81-5", input_type="rmlCas")
|
||||
|
||||
if result and isinstance(result, dict):
|
||||
assert "rmlCas" in result
|
||||
assert result["rmlCas"] == "56-81-5"
|
||||
assert "rmlId" in result
|
||||
assert "dossierType" in result
|
||||
|
||||
@pytest.mark.integration
|
||||
def test_real_niacinamide_search(self):
|
||||
"""Test real API call for niacinamide (requires internet)."""
|
||||
result = search_dossier("98-92-0", input_type="rmlCas")
|
||||
|
||||
if result and isinstance(result, dict):
|
||||
assert "rmlCas" in result
|
||||
assert result["rmlCas"] == "98-92-0"
|
||||
|
||||
@pytest.mark.integration
|
||||
def test_real_retinol_search(self):
|
||||
"""Test real API call for retinol (requires internet)."""
|
||||
result = search_dossier("68-26-8", input_type="rmlCas")
|
||||
|
||||
if result and isinstance(result, dict):
|
||||
assert "rmlCas" in result
|
||||
assert result["rmlCas"] == "68-26-8"
|
||||
|
||||
@pytest.mark.integration
|
||||
def test_real_caffeine_search(self):
|
||||
"""Test real API call for caffeine (requires internet)."""
|
||||
result = search_dossier("58-08-2", input_type="rmlCas")
|
||||
|
||||
if result and isinstance(result, dict):
|
||||
assert "rmlCas" in result
|
||||
assert result["rmlCas"] == "58-08-2"
|
||||
|
||||
@pytest.mark.integration
|
||||
def test_real_salicylic_acid_search(self):
|
||||
"""Test real API call for salicylic acid (requires internet)."""
|
||||
result = search_dossier("69-72-7", input_type="rmlCas")
|
||||
|
||||
if result and isinstance(result, dict):
|
||||
assert "rmlCas" in result
|
||||
assert result["rmlCas"] == "69-72-7"
|
||||
|
||||
@pytest.mark.integration
|
||||
def test_real_titanium_dioxide_search(self):
|
||||
"""Test real API call for titanium dioxide (requires internet)."""
|
||||
result = search_dossier("13463-67-7", input_type="rmlCas")
|
||||
|
||||
if result and isinstance(result, dict):
|
||||
assert "rmlCas" in result
|
||||
assert result["rmlCas"] == "13463-67-7"
|
||||
|
||||
@pytest.mark.integration
|
||||
def test_real_zinc_oxide_search(self):
|
||||
"""Test real API call for zinc oxide (requires internet)."""
|
||||
result = search_dossier("1314-13-2", input_type="rmlCas")
|
||||
|
||||
if result and isinstance(result, dict):
|
||||
assert "rmlCas" in result
|
||||
assert result["rmlCas"] == "1314-13-2"
|
||||
|
||||
@pytest.mark.integration
|
||||
def test_multiple_cosmetic_ingredients(self, sample_cas_numbers):
|
||||
"""Test real API calls for multiple cosmetic ingredients (requires internet)."""
|
||||
# Test a subset of common cosmetic ingredients
|
||||
test_ingredients = [
|
||||
("water", "7732-18-5"),
|
||||
("glycerin", "56-81-5"),
|
||||
("propylene_glycol", "57-55-6"),
|
||||
]
|
||||
|
||||
for name, cas in test_ingredients:
|
||||
result = search_dossier(cas, input_type="rmlCas")
|
||||
if result and isinstance(result, dict):
|
||||
assert result["rmlCas"] == cas
|
||||
assert "rmlId" in result
|
||||
# Give the API some time between requests
|
||||
import time
|
||||
time.sleep(0.5)
|
||||
153
utils/README.md
153
utils/README.md
|
|
@ -1,153 +0,0 @@
|
|||
# PIF Compiler - MongoDB Docker Setup
|
||||
|
||||
## Quick Start
|
||||
|
||||
Start MongoDB and Mongo Express web interface:
|
||||
|
||||
```bash
|
||||
cd utils
|
||||
docker-compose up -d
|
||||
```
|
||||
|
||||
Stop the services:
|
||||
|
||||
```bash
|
||||
docker-compose down
|
||||
```
|
||||
|
||||
Stop and remove all data:
|
||||
|
||||
```bash
|
||||
docker-compose down -v
|
||||
```
|
||||
|
||||
## Services
|
||||
|
||||
### MongoDB
|
||||
- **Port**: 27017
|
||||
- **Database**: toxinfo
|
||||
- **Username**: admin
|
||||
- **Password**: admin123
|
||||
- **Connection String**: `mongodb://admin:admin123@localhost:27017/toxinfo?authSource=admin`
|
||||
|
||||
### Mongo Express (Web UI)
|
||||
- **URL**: http://localhost:8082
|
||||
- **Username**: admin
|
||||
- **Password**: admin123
|
||||
|
||||
## Usage in Python
|
||||
|
||||
Update your MongoDB connection in `src/pif_compiler/functions/mongo_functions.py`:
|
||||
|
||||
```python
|
||||
# For local development with Docker
|
||||
db = connect(user="admin", password="admin123", database="toxinfo")
|
||||
```
|
||||
|
||||
Or use the connection URI directly:
|
||||
|
||||
```python
|
||||
from pymongo import MongoClient
|
||||
|
||||
client = MongoClient("mongodb://admin:admin123@localhost:27017/toxinfo?authSource=admin")
|
||||
db = client['toxinfo']
|
||||
```
|
||||
|
||||
## Data Persistence
|
||||
|
||||
Data is persisted in Docker volumes:
|
||||
- `mongodb_data` - Database files
|
||||
- `mongodb_config` - Configuration files
|
||||
|
||||
These volumes persist even when containers are stopped.
|
||||
|
||||
## Creating Application User
|
||||
|
||||
It's recommended to create a dedicated user for your application instead of using the admin account.
|
||||
|
||||
### Option 1: Using mongosh (MongoDB Shell)
|
||||
|
||||
```bash
|
||||
# Access MongoDB shell
|
||||
docker exec -it pif_mongodb mongosh -u admin -p admin123 --authenticationDatabase admin
|
||||
|
||||
# In the MongoDB shell, run:
|
||||
use toxinfo
|
||||
|
||||
db.createUser({
|
||||
user: "pif_app",
|
||||
pwd: "pif_app_password",
|
||||
roles: [
|
||||
{ role: "readWrite", db: "toxinfo" }
|
||||
]
|
||||
})
|
||||
|
||||
# Exit the shell
|
||||
exit
|
||||
```
|
||||
|
||||
### Option 2: Using Mongo Express Web UI
|
||||
|
||||
1. Go to http://localhost:8082
|
||||
2. Login with admin/admin123
|
||||
3. Select `toxinfo` database
|
||||
4. Click on "Users" tab
|
||||
5. Add new user with `readWrite` role
|
||||
|
||||
### Option 3: Using Python Script
|
||||
|
||||
Create a file `utils/create_user.py`:
|
||||
|
||||
```python
|
||||
from pymongo import MongoClient
|
||||
|
||||
# Connect as admin
|
||||
client = MongoClient("mongodb://admin:admin123@localhost:27017/?authSource=admin")
|
||||
db = client['toxinfo']
|
||||
|
||||
# Create application user
|
||||
db.command("createUser", "pif_app",
|
||||
pwd="pif_app_password",
|
||||
roles=[{"role": "readWrite", "db": "toxinfo"}])
|
||||
|
||||
print("User 'pif_app' created successfully!")
|
||||
client.close()
|
||||
```
|
||||
|
||||
Run it:
|
||||
```bash
|
||||
cd utils
|
||||
uv run python create_user.py
|
||||
```
|
||||
|
||||
### Update Your Application
|
||||
|
||||
After creating the user, update your connection in `src/pif_compiler/functions/mongo_functions.py`:
|
||||
|
||||
```python
|
||||
# Use application user instead of admin
|
||||
db = connect(user="pif_app", password="pif_app_password", database="toxinfo")
|
||||
```
|
||||
|
||||
Or with connection URI:
|
||||
```python
|
||||
client = MongoClient("mongodb://pif_app:pif_app_password@localhost:27017/toxinfo?authSource=toxinfo")
|
||||
```
|
||||
|
||||
### Available Roles
|
||||
|
||||
- `read`: Read-only access to the database
|
||||
- `readWrite`: Read and write access (recommended for your app)
|
||||
- `dbAdmin`: Database administration (create indexes, etc.)
|
||||
- `userAdmin`: Manage users and roles
|
||||
|
||||
## Security Notes
|
||||
|
||||
⚠️ **WARNING**: The default credentials are for local development only.
|
||||
|
||||
For production:
|
||||
1. Change all passwords in `docker-compose.yml`
|
||||
2. Use environment variables or secrets management
|
||||
3. Create dedicated users with minimal required permissions
|
||||
4. Configure firewall rules
|
||||
5. Enable SSL/TLS connections
|
||||
|
|
@ -1,140 +0,0 @@
|
|||
#!/usr/bin/env python3
|
||||
"""
|
||||
Change Log Manager
|
||||
Manages a change.log file with external and internal changes
|
||||
"""
|
||||
|
||||
import os
|
||||
from datetime import datetime
|
||||
from enum import Enum
|
||||
|
||||
class ChangeType(Enum):
|
||||
EXTERNAL = "EXTERNAL"
|
||||
INTERNAL = "INTERNAL"
|
||||
|
||||
class ChangeLogManager:
|
||||
def __init__(self, log_file="change.log"):
|
||||
self.log_file = log_file
|
||||
self._ensure_log_exists()
|
||||
|
||||
def _ensure_log_exists(self):
|
||||
"""Create the log file if it doesn't exist"""
|
||||
if not os.path.exists(self.log_file):
|
||||
with open(self.log_file, 'w') as f:
|
||||
f.write("# Change Log\n")
|
||||
f.write(f"# Created: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}\n\n")
|
||||
|
||||
def add_change(self, change_type, description):
|
||||
"""
|
||||
Add a change entry to the log
|
||||
|
||||
Args:
|
||||
change_type (ChangeType): Type of change (EXTERNAL or INTERNAL)
|
||||
description (str): Description of the change
|
||||
"""
|
||||
timestamp = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
|
||||
entry = f"[{timestamp}] [{change_type.value}] {description}\n"
|
||||
|
||||
with open(self.log_file, 'a') as f:
|
||||
f.write(entry)
|
||||
|
||||
print(f"✓ Change added: {change_type.value} - {description}")
|
||||
|
||||
def view_log(self, filter_type=None):
|
||||
"""
|
||||
View the change log with optional filtering
|
||||
|
||||
Args:
|
||||
filter_type (ChangeType, optional): Filter by change type
|
||||
"""
|
||||
if not os.path.exists(self.log_file):
|
||||
print("No change log found.")
|
||||
return
|
||||
|
||||
with open(self.log_file, 'r') as f:
|
||||
lines = f.readlines()
|
||||
|
||||
print("\n" + "="*70)
|
||||
print("CHANGE LOG")
|
||||
print("="*70 + "\n")
|
||||
|
||||
for line in lines:
|
||||
if filter_type and f"[{filter_type.value}]" not in line:
|
||||
continue
|
||||
print(line, end='')
|
||||
|
||||
print("\n" + "="*70 + "\n")
|
||||
|
||||
def get_statistics(self):
|
||||
"""Display statistics about the change log"""
|
||||
if not os.path.exists(self.log_file):
|
||||
print("No change log found.")
|
||||
return
|
||||
|
||||
with open(self.log_file, 'r') as f:
|
||||
lines = f.readlines()
|
||||
|
||||
external_count = sum(1 for line in lines if "[EXTERNAL]" in line)
|
||||
internal_count = sum(1 for line in lines if "[INTERNAL]" in line)
|
||||
total = external_count + internal_count
|
||||
|
||||
print("\n" + "="*40)
|
||||
print("CHANGE LOG STATISTICS")
|
||||
print("="*40)
|
||||
print(f"Total changes: {total}")
|
||||
print(f"External changes: {external_count}")
|
||||
print(f"Internal changes: {internal_count}")
|
||||
print("="*40 + "\n")
|
||||
|
||||
|
||||
def main():
|
||||
manager = ChangeLogManager()
|
||||
|
||||
while True:
|
||||
print("\n📝 Change Log Manager")
|
||||
print("1. Add External Change")
|
||||
print("2. Add Internal Change")
|
||||
print("3. View All Changes")
|
||||
print("4. View External Changes Only")
|
||||
print("5. View Internal Changes Only")
|
||||
print("6. Show Statistics")
|
||||
print("7. Exit")
|
||||
|
||||
choice = input("\nSelect an option (1-7): ").strip()
|
||||
|
||||
if choice == '1':
|
||||
description = input("Enter external change description: ").strip()
|
||||
if description:
|
||||
manager.add_change(ChangeType.EXTERNAL, description)
|
||||
else:
|
||||
print("Description cannot be empty.")
|
||||
|
||||
elif choice == '2':
|
||||
description = input("Enter internal change description: ").strip()
|
||||
if description:
|
||||
manager.add_change(ChangeType.INTERNAL, description)
|
||||
else:
|
||||
print("Description cannot be empty.")
|
||||
|
||||
elif choice == '3':
|
||||
manager.view_log()
|
||||
|
||||
elif choice == '4':
|
||||
manager.view_log(filter_type=ChangeType.EXTERNAL)
|
||||
|
||||
elif choice == '5':
|
||||
manager.view_log(filter_type=ChangeType.INTERNAL)
|
||||
|
||||
elif choice == '6':
|
||||
manager.get_statistics()
|
||||
|
||||
elif choice == '7':
|
||||
print("Goodbye!")
|
||||
break
|
||||
|
||||
else:
|
||||
print("Invalid option. Please select 1-7.")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
|
|
@ -1,86 +0,0 @@
|
|||
"""
|
||||
Create MongoDB application user for PIF Compiler.
|
||||
|
||||
This script creates a dedicated user with readWrite permissions
|
||||
on the toxinfo database instead of using the admin account.
|
||||
"""
|
||||
|
||||
from pymongo import MongoClient
|
||||
from pymongo.errors import DuplicateKeyError, OperationFailure
|
||||
import sys
|
||||
|
||||
|
||||
def create_app_user():
|
||||
"""Create application user for toxinfo database."""
|
||||
|
||||
# Configuration
|
||||
ADMIN_USER = "admin"
|
||||
ADMIN_PASSWORD = "admin123"
|
||||
MONGO_HOST = "localhost"
|
||||
MONGO_PORT = 27017
|
||||
|
||||
APP_USER = "pif_app"
|
||||
APP_PASSWORD = "marox123"
|
||||
APP_DATABASE = "pif-projects"
|
||||
|
||||
print(f"Connecting to MongoDB as admin...")
|
||||
|
||||
try:
|
||||
# Connect as admin
|
||||
client = MongoClient(
|
||||
f"mongodb://{ADMIN_USER}:{ADMIN_PASSWORD}@{MONGO_HOST}:{MONGO_PORT}/?authSource=admin",
|
||||
serverSelectionTimeoutMS=5000
|
||||
)
|
||||
|
||||
# Test connection
|
||||
client.admin.command('ping')
|
||||
print("✓ Connected to MongoDB successfully")
|
||||
|
||||
# Switch to application database
|
||||
db = client[APP_DATABASE]
|
||||
|
||||
# Create application user
|
||||
print(f"\nCreating user '{APP_USER}' with readWrite permissions on '{APP_DATABASE}'...")
|
||||
|
||||
db.command(
|
||||
"createUser",
|
||||
APP_USER,
|
||||
pwd=APP_PASSWORD,
|
||||
roles=[{"role": "readWrite", "db": APP_DATABASE}]
|
||||
)
|
||||
|
||||
print(f"✓ User '{APP_USER}' created successfully!")
|
||||
print(f"\nConnection details:")
|
||||
print(f" Username: {APP_USER}")
|
||||
print(f" Password: {APP_PASSWORD}")
|
||||
print(f" Database: {APP_DATABASE}")
|
||||
print(f" Connection String: mongodb://{APP_USER}:{APP_PASSWORD}@{MONGO_HOST}:{MONGO_PORT}/{APP_DATABASE}?authSource={APP_DATABASE}")
|
||||
|
||||
print(f"\nUpdate your mongo_functions.py with:")
|
||||
print(f" db = connect(user='{APP_USER}', password='{APP_PASSWORD}', database='{APP_DATABASE}')")
|
||||
|
||||
client.close()
|
||||
return 0
|
||||
|
||||
except DuplicateKeyError:
|
||||
print(f"⚠ User '{APP_USER}' already exists!")
|
||||
print(f"\nTo delete and recreate, run:")
|
||||
print(f" docker exec -it pif_mongodb mongosh -u admin -p admin123 --authenticationDatabase admin")
|
||||
print(f" use {APP_DATABASE}")
|
||||
print(f" db.dropUser('{APP_USER}')")
|
||||
return 1
|
||||
|
||||
except OperationFailure as e:
|
||||
print(f"✗ MongoDB operation failed: {e}")
|
||||
return 1
|
||||
|
||||
except Exception as e:
|
||||
print(f"✗ Error: {e}")
|
||||
print("\nMake sure MongoDB is running:")
|
||||
print(" cd utils")
|
||||
print(" docker-compose up -d")
|
||||
return 1
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
sys.exit(create_app_user())
|
||||
|
|
@ -1,28 +0,0 @@
|
|||
version: '3.8'
|
||||
|
||||
services:
|
||||
mongodb:
|
||||
image: mongo:latest
|
||||
container_name: personal_mongodb
|
||||
restart: unless-stopped
|
||||
environment:
|
||||
MONGO_INITDB_ROOT_USERNAME: admin
|
||||
MONGO_INITDB_ROOT_PASSWORD: bello98A.
|
||||
MONGO_INITDB_DATABASE: toxinfo
|
||||
ports:
|
||||
- "27017:27017"
|
||||
volumes:
|
||||
- mongodb_data:/data/db
|
||||
- mongodb_config:/data/configdb
|
||||
networks:
|
||||
- personal_network
|
||||
|
||||
volumes:
|
||||
mongodb_data:
|
||||
driver: local
|
||||
mongodb_config:
|
||||
driver: local
|
||||
|
||||
networks:
|
||||
personal_network:
|
||||
driver: bridge
|
||||
13
uv.lock
13
uv.lock
|
|
@ -966,6 +966,7 @@ dependencies = [
|
|||
{ name = "markdown-to-json" },
|
||||
{ name = "markdownify" },
|
||||
{ name = "playwright" },
|
||||
{ name = "psycopg2" },
|
||||
{ name = "pubchemprops" },
|
||||
{ name = "pubchempy" },
|
||||
{ name = "pydantic" },
|
||||
|
|
@ -990,6 +991,7 @@ requires-dist = [
|
|||
{ name = "markdown-to-json", specifier = ">=2.1.2" },
|
||||
{ name = "markdownify", specifier = ">=1.2.0" },
|
||||
{ name = "playwright", specifier = ">=1.55.0" },
|
||||
{ name = "psycopg2", specifier = ">=2.9.11" },
|
||||
{ name = "pubchemprops", specifier = ">=0.1.1" },
|
||||
{ name = "pubchempy", specifier = ">=1.0.5" },
|
||||
{ name = "pydantic", specifier = ">=2.11.10" },
|
||||
|
|
@ -1128,6 +1130,17 @@ wheels = [
|
|||
{ url = "https://files.pythonhosted.org/packages/26/65/1070a6e3c036f39142c2820c4b52e9243246fcfc3f96239ac84472ba361e/psutil-7.1.0-cp37-abi3-win_arm64.whl", hash = "sha256:6937cb68133e7c97b6cc9649a570c9a18ba0efebed46d8c5dae4c07fa1b67a07", size = 244971 },
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "psycopg2"
|
||||
version = "2.9.11"
|
||||
source = { registry = "https://pypi.org/simple" }
|
||||
sdist = { url = "https://files.pythonhosted.org/packages/89/8d/9d12bc8677c24dad342ec777529bce705b3e785fa05d85122b5502b9ab55/psycopg2-2.9.11.tar.gz", hash = "sha256:964d31caf728e217c697ff77ea69c2ba0865fa41ec20bb00f0977e62fdcc52e3", size = 379598 }
|
||||
wheels = [
|
||||
{ url = "https://files.pythonhosted.org/packages/b5/bf/635fbe5dd10ed200afbbfbe98f8602829252ca1cce81cc48fb25ed8dadc0/psycopg2-2.9.11-cp312-cp312-win_amd64.whl", hash = "sha256:e03e4a6dbe87ff81540b434f2e5dc2bddad10296db5eea7bdc995bf5f4162938", size = 2713969 },
|
||||
{ url = "https://files.pythonhosted.org/packages/88/5a/18c8cb13fc6908dc41a483d2c14d927a7a3f29883748747e8cb625da6587/psycopg2-2.9.11-cp313-cp313-win_amd64.whl", hash = "sha256:8dc379166b5b7d5ea66dcebf433011dfc51a7bb8a5fc12367fa05668e5fc53c8", size = 2714048 },
|
||||
{ url = "https://files.pythonhosted.org/packages/47/08/737aa39c78d705a7ce58248d00eeba0e9fc36be488f9b672b88736fbb1f7/psycopg2-2.9.11-cp314-cp314-win_amd64.whl", hash = "sha256:f10a48acba5fe6e312b891f290b4d2ca595fc9a06850fe53320beac353575578", size = 2803738 },
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "pubchemprops"
|
||||
version = "0.1.1"
|
||||
|
|
|
|||
Loading…
Reference in a new issue