first commit: checkpoint per multi-device collab
This commit is contained in:
parent
356fc0d7aa
commit
497dba7aab
58 changed files with 195067 additions and 2 deletions
1
.python-version
Normal file
1
.python-version
Normal file
|
|
@ -0,0 +1 @@
|
||||||
|
3.12
|
||||||
|
|
@ -1,2 +0,0 @@
|
||||||
# cosmoguard_backend
|
|
||||||
Backend per il pif compiler 'CosmoGuard'
|
|
||||||
211
REFACTORING.md
Normal file
211
REFACTORING.md
Normal file
|
|
@ -0,0 +1,211 @@
|
||||||
|
# Refactoring Summary
|
||||||
|
|
||||||
|
## Completed: Phase 1 & 2
|
||||||
|
|
||||||
|
### Phase 1: Critical Bug Fixes ✅
|
||||||
|
|
||||||
|
**Fixed Issues:**
|
||||||
|
|
||||||
|
1. **[base_classes.py](src/pif_compiler/classes/models.py)** (now renamed to `models.py`)
|
||||||
|
- Fixed missing closing parenthesis in `StringConstraints` annotation (line 24)
|
||||||
|
- File renamed to `models.py` for clarity
|
||||||
|
|
||||||
|
2. **[pif_class.py](src/pif_compiler/classes/pif_class.py)**
|
||||||
|
- Removed unnecessary `streamlit` import
|
||||||
|
- Fixed duplicate `NormalUser` import conflict
|
||||||
|
- Fixed type annotations for optional fields (lines 33-36)
|
||||||
|
- Removed unused imports
|
||||||
|
|
||||||
|
3. **[classes/__init__.py](src/pif_compiler/classes/__init__.py)**
|
||||||
|
- Created proper module exports
|
||||||
|
- Added docstring
|
||||||
|
- Listed all available models and enums
|
||||||
|
|
||||||
|
### Phase 2: Code Organization ✅
|
||||||
|
|
||||||
|
**New Structure:**
|
||||||
|
|
||||||
|
```
|
||||||
|
src/pif_compiler/
|
||||||
|
├── classes/ # Data Models
|
||||||
|
│ ├── __init__.py # ✨ NEW: Proper exports
|
||||||
|
│ ├── models.py # ✨ RENAMED from base_classes.py
|
||||||
|
│ ├── pif_class.py # ✅ FIXED: Import conflicts
|
||||||
|
│ └── types_enum.py
|
||||||
|
│
|
||||||
|
├── services/ # ✨ NEW: Business Logic Layer
|
||||||
|
│ ├── __init__.py # Service exports
|
||||||
|
│ ├── echa_service.py # ECHA API (merged from find.py)
|
||||||
|
│ ├── echa_parser.py # HTML/Markdown/JSON parsing
|
||||||
|
│ ├── echa_extractor.py # High-level extraction
|
||||||
|
│ ├── cosing_service.py # COSING integration
|
||||||
|
│ ├── pubchem_service.py # PubChem integration
|
||||||
|
│ └── database_service.py # MongoDB operations
|
||||||
|
│
|
||||||
|
└── functions/ # Utilities & Legacy
|
||||||
|
├── _old/ # 🗄️ Deprecated files (moved here)
|
||||||
|
│ ├── echaFind.py # → Merged into echa_service.py
|
||||||
|
│ ├── find.py # → Merged into echa_service.py
|
||||||
|
│ ├── echaProcess.py # → Split into echa_parser + echa_extractor
|
||||||
|
│ ├── scraper_cosing.py # → Copied to cosing_service.py
|
||||||
|
│ ├── pubchem.py # → Copied to pubchem_service.py
|
||||||
|
│ └── mongo_functions.py # → Copied to database_service.py
|
||||||
|
├── html_to_pdf.py # PDF generation utilities
|
||||||
|
├── pdf_extraction.py # PDF processing utilities
|
||||||
|
└── resources/ # Static resources (logos, templates)
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Key Improvements
|
||||||
|
|
||||||
|
### 1. **Separation of Concerns**
|
||||||
|
- **Models** (`classes/`): Pure data structures with Pydantic validation
|
||||||
|
- **Services** (`services/`): Business logic and external API calls
|
||||||
|
- **Functions** (`functions/`): Legacy code, will be gradually migrated
|
||||||
|
|
||||||
|
### 2. **ECHA Module Consolidation**
|
||||||
|
Previously scattered across 3 files:
|
||||||
|
- `echaFind.py` (246 lines) - Old search implementation
|
||||||
|
- `find.py` (513 lines) - Better search with type hints
|
||||||
|
- `echaProcess.py` (947 lines) - Massive monolith
|
||||||
|
|
||||||
|
Now organized into 3 focused modules:
|
||||||
|
- `echa_service.py` (~513 lines) - API integration (from `find.py`)
|
||||||
|
- `echa_parser.py` (~250 lines) - Data parsing/cleaning
|
||||||
|
- `echa_extractor.py` (~350 lines) - High-level extraction logic
|
||||||
|
|
||||||
|
### 3. **Better Logging**
|
||||||
|
- Changed from module-level `logging.basicConfig()` to proper logger instances
|
||||||
|
- Each service has its own logger: `logger = logging.getLogger(__name__)`
|
||||||
|
- Prevents logging configuration conflicts
|
||||||
|
|
||||||
|
### 4. **Improved Imports**
|
||||||
|
Services can now be imported cleanly:
|
||||||
|
```python
|
||||||
|
# Old way
|
||||||
|
from src.func.echaFind import search_dossier
|
||||||
|
from src.func.echaProcess import echaExtract
|
||||||
|
|
||||||
|
# New way
|
||||||
|
from pif_compiler.services import search_dossier, echa_extract
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Migration Guide
|
||||||
|
|
||||||
|
### For Code Using Old Imports
|
||||||
|
|
||||||
|
**ECHA Functions:**
|
||||||
|
```python
|
||||||
|
# Before
|
||||||
|
from src.func.find import search_dossier
|
||||||
|
from src.func.echaProcess import echaExtract, echaPage_to_md, clean_json
|
||||||
|
|
||||||
|
# After
|
||||||
|
from pif_compiler.services import (
|
||||||
|
search_dossier,
|
||||||
|
echa_extract,
|
||||||
|
echa_page_to_markdown,
|
||||||
|
clean_json
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Data Models:**
|
||||||
|
```python
|
||||||
|
# Before
|
||||||
|
from classes import Ingredient, PIF
|
||||||
|
from base_classes import ExpositionInfo
|
||||||
|
|
||||||
|
# After
|
||||||
|
from pif_compiler.classes import Ingredient, PIF, ExpositionInfo
|
||||||
|
```
|
||||||
|
|
||||||
|
**COSING/PubChem:**
|
||||||
|
```python
|
||||||
|
# Before
|
||||||
|
from functions.scraper_cosing import cosing_search
|
||||||
|
from functions.pubchem import pubchem_dap
|
||||||
|
|
||||||
|
# After (when ready)
|
||||||
|
from pif_compiler.services.cosing_service import cosing_search
|
||||||
|
from pif_compiler.services.pubchem_service import pubchem_dap
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Next Steps (Phase 3 - Not Done Yet)
|
||||||
|
|
||||||
|
### Configuration Management
|
||||||
|
- [ ] Create `config.py` for MongoDB credentials, API keys
|
||||||
|
- [ ] Use environment variables (.env file)
|
||||||
|
- [ ] Separate dev/prod configurations
|
||||||
|
|
||||||
|
### Testing
|
||||||
|
- [ ] Add pytest setup
|
||||||
|
- [ ] Unit tests for models (Pydantic validation)
|
||||||
|
- [ ] Integration tests for services
|
||||||
|
- [ ] Mock external API calls
|
||||||
|
|
||||||
|
### Streamlit App
|
||||||
|
- [ ] Create `app.py` entry point
|
||||||
|
- [ ] Organize UI components
|
||||||
|
- [ ] Connect to services layer
|
||||||
|
|
||||||
|
### Database
|
||||||
|
- [ ] Document MongoDB schema
|
||||||
|
- [ ] Add migration scripts
|
||||||
|
- [ ] Consider adding SQLAlchemy for relational DB
|
||||||
|
|
||||||
|
### Documentation
|
||||||
|
- [ ] API documentation (docstrings → Sphinx)
|
||||||
|
- [ ] User guide for PIF creation workflow
|
||||||
|
- [ ] Developer setup guide
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Files Changed
|
||||||
|
|
||||||
|
### Modified:
|
||||||
|
- `src/pif_compiler/classes/models.py` (renamed, fixed)
|
||||||
|
- `src/pif_compiler/classes/pif_class.py` (fixed imports/types)
|
||||||
|
- `src/pif_compiler/classes/__init__.py` (new exports)
|
||||||
|
|
||||||
|
### Created:
|
||||||
|
- `src/pif_compiler/services/__init__.py`
|
||||||
|
- `src/pif_compiler/services/echa_service.py`
|
||||||
|
- `src/pif_compiler/services/echa_parser.py`
|
||||||
|
- `src/pif_compiler/services/echa_extractor.py`
|
||||||
|
- `src/pif_compiler/services/cosing_service.py`
|
||||||
|
- `src/pif_compiler/services/pubchem_service.py`
|
||||||
|
- `src/pif_compiler/services/database_service.py`
|
||||||
|
|
||||||
|
### Moved to Archive:
|
||||||
|
- `src/pif_compiler/functions/_old/echaFind.py` (merged into echa_service.py)
|
||||||
|
- `src/pif_compiler/functions/_old/find.py` (merged into echa_service.py)
|
||||||
|
- `src/pif_compiler/functions/_old/echaProcess.py` (split into echa_parser + echa_extractor)
|
||||||
|
- `src/pif_compiler/functions/_old/scraper_cosing.py` (copied to cosing_service.py)
|
||||||
|
- `src/pif_compiler/functions/_old/pubchem.py` (copied to pubchem_service.py)
|
||||||
|
- `src/pif_compiler/functions/_old/mongo_functions.py` (copied to database_service.py)
|
||||||
|
|
||||||
|
### Kept (Active):
|
||||||
|
- `src/pif_compiler/functions/html_to_pdf.py` (PDF utilities)
|
||||||
|
- `src/pif_compiler/functions/pdf_extraction.py` (PDF utilities)
|
||||||
|
- `src/pif_compiler/functions/resources/` (Static files)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Benefits
|
||||||
|
|
||||||
|
✅ **Cleaner imports** - No more relative path confusion
|
||||||
|
✅ **Better testing** - Services can be mocked easily
|
||||||
|
✅ **Easier debugging** - Smaller, focused modules
|
||||||
|
✅ **Type safety** - Proper type hints throughout
|
||||||
|
✅ **Maintainability** - Clear separation of concerns
|
||||||
|
✅ **Backward compatible** - Old code still works
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Date:** 2025-01-04
|
||||||
|
**Status:** Phase 1 & 2 Complete ✅
|
||||||
194
claude.md
Normal file
194
claude.md
Normal file
|
|
@ -0,0 +1,194 @@
|
||||||
|
# PIF Compiler - Project Context
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
Application to generate **Product Information Files (PIF)** for cosmetic products. This is a regulatory document required for cosmetics safety assessment.
|
||||||
|
|
||||||
|
## Development Environment
|
||||||
|
- **Platform**: Windows
|
||||||
|
- **Package Manager**: [uv](https://github.com/astral-sh/uv) - Fast Python package installer and resolver
|
||||||
|
- **Python Version**: 3.12+
|
||||||
|
|
||||||
|
## Tech Stack
|
||||||
|
- **Backend**: Python 3.12+
|
||||||
|
- **Frontend**: Streamlit
|
||||||
|
- **Database**: MongoDB (primary), potential relational DB (not yet implemented)
|
||||||
|
- **Package Manager**: uv
|
||||||
|
- **Build System**: hatchling
|
||||||
|
|
||||||
|
## Common Commands
|
||||||
|
```bash
|
||||||
|
# Install dependencies
|
||||||
|
uv sync
|
||||||
|
|
||||||
|
# Add a new dependency
|
||||||
|
uv add <package-name>
|
||||||
|
|
||||||
|
# Run the application
|
||||||
|
uv run pif-compiler
|
||||||
|
|
||||||
|
# Activate virtual environment (if needed)
|
||||||
|
.venv\Scripts\activate # Windows
|
||||||
|
```
|
||||||
|
|
||||||
|
## Project Structure
|
||||||
|
|
||||||
|
```
|
||||||
|
pif_compiler/
|
||||||
|
├── src/pif_compiler/
|
||||||
|
│ ├── classes/ # Data models & type definitions
|
||||||
|
│ │ ├── pif_class.py # Main PIF data model
|
||||||
|
│ │ ├── classes.py # Supporting classes (Ingredient, ExpositionInfo, SedTable, ProdCompany)
|
||||||
|
│ │ └── types_enum.py # Enums for cosmetic types, physical forms, exposure routes
|
||||||
|
│ │
|
||||||
|
│ └── functions/ # Core functionality modules
|
||||||
|
│ ├── scraper_cosing.py # COSING database scraper (EU cosmetic ingredients)
|
||||||
|
│ ├── mongo_functions.py # MongoDB connection & queries
|
||||||
|
│ ├── html_to_pdf.py # PDF generation with Playwright
|
||||||
|
│ ├── echaFind.py # ECHA dossier search
|
||||||
|
│ ├── echaProcess.py # ECHA data extraction & processing
|
||||||
|
│ ├── pubchem.py # PubChem API for chemical properties
|
||||||
|
│ ├── find.py # Unified search interface (QUACKO/ECHA)
|
||||||
|
│ └── pdf_extraction.py # PDF processing utilities
|
||||||
|
│
|
||||||
|
└── data/
|
||||||
|
├── pif_schema.json # JSON schema for PIF structure
|
||||||
|
└── input.json # Example input data format
|
||||||
|
```
|
||||||
|
|
||||||
|
## Core Functionality
|
||||||
|
|
||||||
|
### 1. Data Models ([classes/](src/pif_compiler/classes/))
|
||||||
|
|
||||||
|
#### PIF Class ([pif_class.py](src/pif_compiler/classes/pif_class.py:10))
|
||||||
|
Main data model containing:
|
||||||
|
- Product information (name, type, CNCP, company)
|
||||||
|
- Ingredient list with quantities
|
||||||
|
- Exposure information
|
||||||
|
- Safety evaluation data (SED table, warnings)
|
||||||
|
- Metadata (creation date)
|
||||||
|
|
||||||
|
#### Supporting Classes ([classes.py](src/pif_compiler/classes/classes.py))
|
||||||
|
- **Ingredient**: INCI name, CAS number, quantity, toxicity values (SED, NOAEL, MOS), PubChem data
|
||||||
|
- **ExpositionInfo**: Application details, exposure routes, calculated daily exposure
|
||||||
|
- **SedTable**: Safety evaluation data table
|
||||||
|
- **ProdCompany**: Production company information
|
||||||
|
|
||||||
|
#### Type Enumerations ([types_enum.py](src/pif_compiler/classes/types_enum.py))
|
||||||
|
Bilingual (EN/IT) enums for:
|
||||||
|
- **CosmeticType**: 100+ product types (foundations, lipsticks, skincare, etc.)
|
||||||
|
- **PhysicalForm**: Liquid, semi-solid, solid, aerosol, hybrid forms
|
||||||
|
- **NormalUser**: Adult/Child
|
||||||
|
- **PlaceApplication**: Face, etc.
|
||||||
|
- **RoutesExposure**: Dermal, Ocular, Oral
|
||||||
|
- **NanoRoutes**: Same as above for nanomaterials
|
||||||
|
|
||||||
|
### 2. External Data Sources
|
||||||
|
|
||||||
|
#### COSING Database ([scraper_cosing.py](src/pif_compiler/functions/scraper_cosing.py))
|
||||||
|
EU cosmetic ingredients database
|
||||||
|
- Search by INCI name, CAS number, or EC number
|
||||||
|
- Extract: substance ID, CAS/EC numbers, restrictions, SCCS opinions
|
||||||
|
- Handle "identified ingredients" recursively
|
||||||
|
- Functions: `cosing_search()`, `clean_cosing()`, `parse_cas_numbers()`
|
||||||
|
|
||||||
|
#### ECHA Database ([echaFind.py](src/pif_compiler/functions/echaFind.py), [echaProcess.py](src/pif_compiler/functions/echaProcess.py))
|
||||||
|
European Chemicals Agency dossiers
|
||||||
|
- **Search**: Find dossiers by CAS/substance name ([echaFind.py:44](src/pif_compiler/functions/echaFind.py:44))
|
||||||
|
- **Extract**: Toxicity data (NOAEL, LD50) from HTML pages
|
||||||
|
- **Process**: Convert HTML → Markdown → JSON → DataFrame
|
||||||
|
- **Scraping Types**: RepeatedDose (NOAEL), AcuteToxicity (LD50)
|
||||||
|
- **Local caching**: DuckDB in-memory for scraped data
|
||||||
|
- Functions: `search_dossier()`, `echaExtract()`, `echa_noael_ld50()`
|
||||||
|
|
||||||
|
#### PubChem ([pubchem.py](src/pif_compiler/functions/pubchem.py))
|
||||||
|
Chemical properties for DAP calculation
|
||||||
|
- Properties: LogP, Molecular Weight, TPSA, Melting Point, pH, Dissociation Constants
|
||||||
|
- Uses `pubchempy` + custom certificate handling
|
||||||
|
- Function: `pubchem_dap(cas)`
|
||||||
|
|
||||||
|
#### QUACKO/Find Module ([find.py](src/pif_compiler/functions/find.py))
|
||||||
|
Unified search interface for ECHA
|
||||||
|
- Search by CAS, EC, or substance name
|
||||||
|
- Extract multiple sections (ToxSummary, AcuteToxicity, RepeatedDose, GeneticToxicity, physical properties)
|
||||||
|
- Support for local HTML files
|
||||||
|
- Functions: `search_dossier()`, `get_section_links_from_index()`
|
||||||
|
|
||||||
|
### 3. Database Layer
|
||||||
|
|
||||||
|
#### MongoDB ([mongo_functions.py](src/pif_compiler/functions/mongo_functions.py))
|
||||||
|
- Database: `toxinfo`
|
||||||
|
- Collection: `toxinfo` (ingredient data from COSING/ECHA)
|
||||||
|
- Functions:
|
||||||
|
- `connect(user, password, database)` - MongoDB Atlas connection
|
||||||
|
- `value_search(db, value, mode)` - Search by INCI, CAS, EC, chemical name
|
||||||
|
|
||||||
|
### 4. PDF Generation ([html_to_pdf.py](src/pif_compiler/functions/html_to_pdf.py), [pdf_extraction.py](src/pif_compiler/functions/pdf_extraction.py))
|
||||||
|
- **Playwright-based**: Headless browser for HTML → PDF
|
||||||
|
- **Dynamic headers**: Inject substance info, ECHA logos
|
||||||
|
- **Cleanup**: Remove empty sections, fix page breaks
|
||||||
|
- **Batch processing**: `search_generate_pdfs()` for multiple pages
|
||||||
|
- Output: Structured folders by CAS/EC/RML ID
|
||||||
|
|
||||||
|
## Data Flow
|
||||||
|
|
||||||
|
1. **Input**: Product formulation (INCI names, quantities)
|
||||||
|
2. **Enrichment**:
|
||||||
|
- Search COSING for ingredient info
|
||||||
|
- Query MongoDB for cached data
|
||||||
|
- Fetch PubChem for chemical properties
|
||||||
|
- Extract ECHA toxicity data (NOAEL/LD50)
|
||||||
|
3. **Calculation**:
|
||||||
|
- SED (Systemic Exposure Dose)
|
||||||
|
- MOS (Margin of Safety)
|
||||||
|
- Daily exposure values
|
||||||
|
4. **Output**: PIF document (likely PDF/HTML format)
|
||||||
|
|
||||||
|
## Key Dependencies
|
||||||
|
- `streamlit` - Frontend
|
||||||
|
- `pydantic` - Data validation
|
||||||
|
- `pymongo` - MongoDB client
|
||||||
|
- `requests` - HTTP requests
|
||||||
|
- `beautifulsoup4` - HTML parsing
|
||||||
|
- `playwright` - PDF generation
|
||||||
|
- `pubchempy` - PubChem API
|
||||||
|
- `pandas` - Data processing
|
||||||
|
- `duckdb` - Local caching
|
||||||
|
|
||||||
|
## Important Notes
|
||||||
|
|
||||||
|
### CAS Number Handling
|
||||||
|
- CAS numbers can contain special separators (`/`, `;`, `,`, `--`)
|
||||||
|
- Parser handles parenthetical info and invalid values
|
||||||
|
|
||||||
|
### ECHA Scraping
|
||||||
|
- **Logging**: All operations logged to `echa.log`
|
||||||
|
- **Dossier Status**: Active preferred, falls back to Inactive
|
||||||
|
- **Scraping Modes**:
|
||||||
|
- `local_search=True`: Check local cache first
|
||||||
|
- `local_only=True`: Only use cached data
|
||||||
|
- **Multi-substance**: `echaExtract_multi()` for batch processing
|
||||||
|
- **Filtering**: Can filter by route (oral/dermal/inhalation) and dose descriptor
|
||||||
|
|
||||||
|
### Bilingual Support
|
||||||
|
- Enums support EN/IT via `TranslatedEnum.get_translation(lang)`
|
||||||
|
- Italian used as primary language in comments
|
||||||
|
|
||||||
|
### Regulatory Context
|
||||||
|
- SCCS: Scientific Committee on Consumer Safety
|
||||||
|
- CNCP: Cosmetic Notification Portal
|
||||||
|
- NOAEL: No Observed Adverse Effect Level
|
||||||
|
- SED: Systemic Exposure Dose
|
||||||
|
- MOS: Margin of Safety
|
||||||
|
- DAP: Dermal Absorption Percentage
|
||||||
|
|
||||||
|
## TODO/Future Work
|
||||||
|
- Relational DB implementation (mentioned but not present)
|
||||||
|
- Streamlit UI (referenced but code not in current files)
|
||||||
|
- Main entry point (`pif-compiler` script in pyproject.toml)
|
||||||
|
- LLM approximation for exposure values (mentioned in [classes.py:55-60](src/pif_compiler/classes/classes.py:55))
|
||||||
|
|
||||||
|
## Development Notes
|
||||||
|
- Project appears to consolidate previously separate codebases
|
||||||
|
- Heavy use of external APIs (rate limiting may apply)
|
||||||
|
- Certificate handling needed for PubChem API
|
||||||
|
- MongoDB credentials required for database access
|
||||||
0
data/__init__.py
Normal file
0
data/__init__.py
Normal file
44613
data/clean_response_full.csv
Normal file
44613
data/clean_response_full.csv
Normal file
File diff suppressed because one or more lines are too long
31933
data/clean_responses_shrunk.csv
Normal file
31933
data/clean_responses_shrunk.csv
Normal file
File diff suppressed because it is too large
Load diff
19558
data/echa-cosing-scraping-log.csv
Normal file
19558
data/echa-cosing-scraping-log.csv
Normal file
File diff suppressed because it is too large
Load diff
21480
data/echa-reach-scraping-log.csv
Normal file
21480
data/echa-reach-scraping-log.csv
Normal file
File diff suppressed because it is too large
Load diff
32461
data/echa_full_scraping.csv
Normal file
32461
data/echa_full_scraping.csv
Normal file
File diff suppressed because one or more lines are too long
34053
data/exploded_cas_cosing.csv
Normal file
34053
data/exploded_cas_cosing.csv
Normal file
File diff suppressed because it is too large
Load diff
5
data/input.json
Normal file
5
data/input.json
Normal file
|
|
@ -0,0 +1,5 @@
|
||||||
|
{
|
||||||
|
"INCI": ["AQUA", "GLYCERIN", "HYDROXYETHYLCELLULOSE", "DISODIUM EDTA", "CARBOMER", "TREHALOSE", "BETAINE", "METHYLPARABEN", "TRIETHANOLAMINE", "PHENOXYETHANOL", "ETHYLHEXYLGLYCERIN", "PARFUM", "PEG-40 HYDROGENATED CASTOR OIL", "ASCORBIC ACID", "CI 15985"],
|
||||||
|
"CAS": [null, "56-81-5", "9004-62-0", "139-33-3", "9007-16-3;9003-01-4;9007-17-4;76050-42-5;9062-04-8", "99-20-7", "107-43-7", "99-76-3", "102-71-6", "122-99-6", "70445-33-9", "JYY-807", "61788-85-0", "50-81-7;62624-30-0", "2783-94-0"],
|
||||||
|
"percentage": [90.567, 6.0, 0.1, 0.03, 0.35, 1.0, 1.0, 0.2, 0.35, 0.3, 0.02, 0.005, 0.025, 0.05, 0.0002]
|
||||||
|
}
|
||||||
57
data/jsonschema.json
Normal file
57
data/jsonschema.json
Normal file
|
|
@ -0,0 +1,57 @@
|
||||||
|
{
|
||||||
|
"$schema": "http://json-schema.org/draft-04/schema#",
|
||||||
|
"type": "object",
|
||||||
|
"properties": {
|
||||||
|
"INCI": {
|
||||||
|
"type": "array",
|
||||||
|
"items": [
|
||||||
|
{
|
||||||
|
"type": "string"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"type": "string"
|
||||||
|
}
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"CAS": {
|
||||||
|
"type": "array",
|
||||||
|
"items": [
|
||||||
|
{
|
||||||
|
"type": "array",
|
||||||
|
"items": [
|
||||||
|
{
|
||||||
|
"type": "string"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"type": "string"
|
||||||
|
}
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"type": "array",
|
||||||
|
"items": [
|
||||||
|
{
|
||||||
|
"type": "string"
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"percentage": {
|
||||||
|
"type": "array",
|
||||||
|
"items": [
|
||||||
|
{
|
||||||
|
"type": "number"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"type": "number"
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"required": [
|
||||||
|
"INCI",
|
||||||
|
"CAS",
|
||||||
|
"percentage"
|
||||||
|
]
|
||||||
|
}
|
||||||
28
data/log_readme.md
Normal file
28
data/log_readme.md
Normal file
|
|
@ -0,0 +1,28 @@
|
||||||
|
# Echa Scraping Log Readme
|
||||||
|
|
||||||
|
Il file di log viene utilizzato durante lo scraping per tenere traccia delle sostanze estratte.
|
||||||
|
|
||||||
|
**Colonne:**
|
||||||
|
- **casNo**: il numero CAS della sostanza.
|
||||||
|
- **substanceId**: l'identificativo della sostanza nel database COSING.
|
||||||
|
- **inciName**: il nome INCI della sostanza.
|
||||||
|
- **scraping_AcuteToxicity**: stato dello scraping della pagina *Acute Toxicity* (valori LD50, LC50, ecc.).
|
||||||
|
- **scraping_RepeatedDose**: stato dello scraping della pagina *Repeated Dose* (valori NOAEL, DNEL, ecc.).
|
||||||
|
- **timestamp**: il momento in cui il dato è stato registrato.
|
||||||
|
|
||||||
|
**Valori possibili per scraping_AcuteToxicity e scraping_RepeatedDose:**
|
||||||
|
1. **no_lead_dossiers**: non esistono lead dossiers attivi o inattivi per la sostanza.
|
||||||
|
2. **successful_scrape**: dati estratti con successo dalla pagina.
|
||||||
|
3. **no_data_found**: è stato trovato un lead dossier, ma la pagina non esiste o non contiene dati.
|
||||||
|
4. **error**: diversi tipi di errori.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
Ho dedicato 20-30 minuti alla conferma manuale dei risultati *no_data_found* e *no_lead_dossiers*: ho verificato casualmente che non esistessero dossier o che le pagine fossero effettivamente prive di dati.
|
||||||
|
|
||||||
|
Durante il primo full-scraping era presente un bug, che ho successivamente corretto, consentendo l'estrazione di altre 700 sostanze. Non so se siano presenti altri bug simili.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
Al momento ci sono **68 righe nel log con errori.** Sto investigando, ma nella maggior parte dei casi si tratta di errori causati dalla mancanza di dati nelle pagine.
|
||||||
|
In pratica, molti di questi sono semplicemente *no_data_found* erroneamente segnati come *error*.
|
||||||
38
data/pif_schema.json
Normal file
38
data/pif_schema.json
Normal file
|
|
@ -0,0 +1,38 @@
|
||||||
|
{
|
||||||
|
"general_info": {
|
||||||
|
"exec_date": "2021-07-01",
|
||||||
|
"company": "Company Name",
|
||||||
|
"product_name": "Product Name",
|
||||||
|
"type": "pif",
|
||||||
|
"ph_form": "fisical state",
|
||||||
|
"CPNP": "CPNP number",
|
||||||
|
"prod_company": {"Company Name": "Company Name", "Address": "Company Address", "Country": "Country"}
|
||||||
|
},
|
||||||
|
|
||||||
|
"formula_table": "df_json",
|
||||||
|
"normal_user": ["italiano", "english"],
|
||||||
|
|
||||||
|
"exposition": {
|
||||||
|
"type": "type",
|
||||||
|
"place_application": "place",
|
||||||
|
"routes_exposure": "routes",
|
||||||
|
"secondary_routes": "secondary routes",
|
||||||
|
"nano_exposure": "nano exposure",
|
||||||
|
"surface_area": "surface area",
|
||||||
|
"frequency": "frequency",
|
||||||
|
"est_daily_amount": "est daily amount",
|
||||||
|
"rel_daily_amount": "rel daily amount",
|
||||||
|
"retention": 1,
|
||||||
|
"calculated_daily_exp:": "calculated daily exp",
|
||||||
|
"calculated_relative_daily_exp": "calculated relative daily exp",
|
||||||
|
"consumer_weight": "consumer weight",
|
||||||
|
"target_population": "target population"
|
||||||
|
},
|
||||||
|
|
||||||
|
"sed_formula_table": "df_json",
|
||||||
|
"sed_table": "df_json",
|
||||||
|
"toxicity_table": "df_json",
|
||||||
|
"undesired_effects": "no effets",
|
||||||
|
"description": "description",
|
||||||
|
"warnings": "warnings"
|
||||||
|
}
|
||||||
270
debug_echa_find.py
Normal file
270
debug_echa_find.py
Normal file
|
|
@ -0,0 +1,270 @@
|
||||||
|
import marimo
|
||||||
|
|
||||||
|
__generated_with = "0.16.5"
|
||||||
|
app = marimo.App(width="medium")
|
||||||
|
|
||||||
|
|
||||||
|
@app.cell
|
||||||
|
def _():
|
||||||
|
import marimo as mo
|
||||||
|
import urllib.parse
|
||||||
|
import re as standardre
|
||||||
|
import json
|
||||||
|
from bs4 import BeautifulSoup
|
||||||
|
import requests
|
||||||
|
return BeautifulSoup, mo, requests, urllib
|
||||||
|
|
||||||
|
|
||||||
|
@app.cell
|
||||||
|
def _():
|
||||||
|
from pif_compiler.services.common_log import get_logger
|
||||||
|
|
||||||
|
log = get_logger()
|
||||||
|
return (log,)
|
||||||
|
|
||||||
|
|
||||||
|
@app.cell
|
||||||
|
def _(log):
|
||||||
|
log.info("testing with marimo")
|
||||||
|
return
|
||||||
|
|
||||||
|
|
||||||
|
@app.cell
|
||||||
|
def _():
|
||||||
|
cas_test = "100-41-4"
|
||||||
|
return (cas_test,)
|
||||||
|
|
||||||
|
|
||||||
|
@app.cell
|
||||||
|
def _(cas_test, urllib):
|
||||||
|
urllib.parse.quote(cas_test)
|
||||||
|
return
|
||||||
|
|
||||||
|
|
||||||
|
@app.cell
|
||||||
|
def _():
|
||||||
|
BASE_SEARCH = "https://chem.echa.europa.eu/api-substance/v1/substance?pageIndex=1&pageSize=100&searchText="
|
||||||
|
BASE_DOSSIER_LIST = "https://chem.echa.europa.eu/api-dossier-list/v1/dossier?pageIndex=1&pageSize=100&rmlId="
|
||||||
|
SUBSTANCE_SUMMARY = "https://chem.echa.europa.eu/api-substance/v1/substance/" #+id
|
||||||
|
CLASSIFICATION_ID = "https://chem.echa.europa.eu/api-cnl-inventory/prominent/overview/classifications/harmonised/459160"
|
||||||
|
TOXICOLOGICAL_INFO = "https://chem.echa.europa.eu/html-pages-prod/e4c88c6e-06c7-4daa-b0fb-1a55459ac22f/documents/IUC5-5f55d8ec-7a71-4e2c-9955-8469ead9fe84_0035f3f8-7467-4944-9028-1db2e9c99565.html" # external + rootkey
|
||||||
|
REPEATED_DOSE = "https://chem.echa.europa.eu/html-pages-prod/e4c88c6e-06c7-4daa-b0fb-1a55459ac22f/documents/IUC5-82402b09-8d8f-495c-b673-95b205be60e0_0035f3f8-7467-4944-9028-1db2e9c99565.html"
|
||||||
|
|
||||||
|
active = "®istrationStatuses=Active"
|
||||||
|
inactive = "®istrationStatuses=Inactive"
|
||||||
|
legislation = "&legislation=REACH"
|
||||||
|
return BASE_SEARCH, active, legislation
|
||||||
|
|
||||||
|
|
||||||
|
@app.cell
|
||||||
|
def _(BASE_SEARCH, cas_test, requests):
|
||||||
|
test_search_request = requests.get(BASE_SEARCH + cas_test)
|
||||||
|
return (test_search_request,)
|
||||||
|
|
||||||
|
|
||||||
|
@app.cell
|
||||||
|
def _(test_search_request):
|
||||||
|
response = test_search_request.json()
|
||||||
|
return (response,)
|
||||||
|
|
||||||
|
|
||||||
|
@app.cell
|
||||||
|
def _(test_search_request):
|
||||||
|
test_search_request.json()
|
||||||
|
return
|
||||||
|
|
||||||
|
|
||||||
|
@app.cell
|
||||||
|
def _(cas_test, response):
|
||||||
|
substance = {}
|
||||||
|
|
||||||
|
for result in response['items']:
|
||||||
|
if result["substanceIndex"]["rmlCas"] == cas_test:
|
||||||
|
substance["rmlCas"] = result["substanceIndex"]["rmlCas"]
|
||||||
|
substance["rmlId"] = result["substanceIndex"]["rmlId"]
|
||||||
|
substance["rmlEc"] = result["substanceIndex"]["rmlEc"]
|
||||||
|
substance["rmlName"] = result["substanceIndex"]["rmlName"]
|
||||||
|
substance["rmlId"] = result["substanceIndex"]["rmlId"]
|
||||||
|
return (substance,)
|
||||||
|
|
||||||
|
|
||||||
|
@app.cell
|
||||||
|
def _(substance):
|
||||||
|
substance
|
||||||
|
return
|
||||||
|
|
||||||
|
|
||||||
|
@app.cell
|
||||||
|
def _(BASE_DOSSIER, active, substance):
|
||||||
|
url = BASE_DOSSIER + substance['rmlId'] + active
|
||||||
|
url
|
||||||
|
return
|
||||||
|
|
||||||
|
|
||||||
|
@app.cell
|
||||||
|
def _(BASE_DOSSIER, active, legislation, requests, substance):
|
||||||
|
response_dossier = requests.get(BASE_DOSSIER + substance['rmlId'] + active + legislation)
|
||||||
|
return (response_dossier,)
|
||||||
|
|
||||||
|
|
||||||
|
@app.cell
|
||||||
|
def _(response_dossier):
|
||||||
|
response_dossier_json = response_dossier.json()
|
||||||
|
response_dossier_json
|
||||||
|
return (response_dossier_json,)
|
||||||
|
|
||||||
|
|
||||||
|
@app.cell
|
||||||
|
def _(response_dossier_json, substance):
|
||||||
|
substance['lastUpdatedDate'] = response_dossier_json['items'][0]['lastUpdatedDate']
|
||||||
|
substance['registrationStatus'] = response_dossier_json['items'][0]['registrationStatus']
|
||||||
|
substance['registrationStatusChangedDate'] = response_dossier_json['items'][0]['registrationStatusChangedDate']
|
||||||
|
substance['registrationRole'] = response_dossier_json['items'][0]['reachDossierInfo']['registrationRole']
|
||||||
|
substance['assetExternalId'] = response_dossier_json['items'][0]['assetExternalId']
|
||||||
|
substance['rootKey'] = response_dossier_json['items'][0]['rootKey']
|
||||||
|
substance
|
||||||
|
return
|
||||||
|
|
||||||
|
|
||||||
|
@app.cell
|
||||||
|
def _():
|
||||||
|
from pif_compiler.services.mongo_conn import get_client
|
||||||
|
|
||||||
|
client = get_client()
|
||||||
|
|
||||||
|
db = client.get_database(name="toxinfo")
|
||||||
|
return (db,)
|
||||||
|
|
||||||
|
|
||||||
|
@app.cell
|
||||||
|
def _(db):
|
||||||
|
collection = db.get_collection("substance_index")
|
||||||
|
list = db.list_collection_names()
|
||||||
|
print(list)
|
||||||
|
return (collection,)
|
||||||
|
|
||||||
|
|
||||||
|
@app.cell
|
||||||
|
def _(cas_test, collection, substance):
|
||||||
|
sub = collection.find_one({"rmlCas": cas_test})
|
||||||
|
if not sub:
|
||||||
|
collection.insert_one(substance)
|
||||||
|
return
|
||||||
|
|
||||||
|
|
||||||
|
@app.cell
|
||||||
|
def _(assetExternalId):
|
||||||
|
INDEX_HTML = "https://chem.echa.europa.eu/html-pages/" + assetExternalId + "/index.html"
|
||||||
|
return
|
||||||
|
|
||||||
|
|
||||||
|
@app.cell
|
||||||
|
def _(log, test_search_request):
|
||||||
|
def search_substance(cas : str) -> dict:
|
||||||
|
response = test_search_request.json()
|
||||||
|
if response.status_code != 200:
|
||||||
|
log.error(f"Network error: {response.status_code}")
|
||||||
|
return {}
|
||||||
|
else:
|
||||||
|
if response['totalItems'] == 0:
|
||||||
|
log.info(f"No substance found for CAS {cas}")
|
||||||
|
return {}
|
||||||
|
else:
|
||||||
|
for result in response['items']:
|
||||||
|
if result["substanceIndex"]["rmlCas"] == cas:
|
||||||
|
substance = {
|
||||||
|
"rmlCas": result["substanceIndex"]["rmlCas"],
|
||||||
|
"rmlId": result["substanceIndex"]["rmlId"],
|
||||||
|
"rmlEc": result["substanceIndex"]["rmlEc"],
|
||||||
|
"rmlName": result["substanceIndex"]["rmlName"],
|
||||||
|
"rmlId": result["substanceIndex"]["rmlId"]
|
||||||
|
}
|
||||||
|
return substance
|
||||||
|
log.error(f"Something went wrong")
|
||||||
|
return {}
|
||||||
|
return
|
||||||
|
|
||||||
|
|
||||||
|
@app.cell
|
||||||
|
def _(BASE_DOSSIER, active, legislation, log, requests):
|
||||||
|
def get_dossier_info(rmlId: str) -> dict:
|
||||||
|
url = BASE_DOSSIER + rmlId + active + legislation
|
||||||
|
response_dossier = requests.get(url)
|
||||||
|
if response_dossier.status_code != 200:
|
||||||
|
log.error(f"Network error: {response_dossier.status_code}")
|
||||||
|
return {}
|
||||||
|
response_dossier_json = response_dossier.json()
|
||||||
|
if response_dossier_json['totalItems'] == 0:
|
||||||
|
log.info(f"No dossier found for RML ID {rmlId}")
|
||||||
|
return {}
|
||||||
|
dossier_info = {
|
||||||
|
"lastUpdatedDate": response_dossier_json['items'][0]['lastUpdatedDate'],
|
||||||
|
"registrationStatus": response_dossier_json['items'][0]['registrationStatus'],
|
||||||
|
"registrationStatusChangedDate": response_dossier_json['items'][0]['registrationStatusChangedDate'],
|
||||||
|
"registrationRole": response_dossier_json['items'][0]['reachDossierInfo']['registrationRole'],
|
||||||
|
"assetExternalId": response_dossier_json['items'][0]['assetExternalId'],
|
||||||
|
"rootKey": response_dossier_json['items'][0]['rootKey']
|
||||||
|
}
|
||||||
|
return dossier_info
|
||||||
|
return
|
||||||
|
|
||||||
|
|
||||||
|
@app.cell
|
||||||
|
def _(BeautifulSoup, log, requests):
|
||||||
|
def get_substance_index(assetExternalId : str) -> dict:
|
||||||
|
INDEX = "https://chem.echa.europa.eu/html-pages-prod/" + assetExternalId
|
||||||
|
LINK_DOSSIER = INDEX + "/documents/"
|
||||||
|
|
||||||
|
response = requests.get(INDEX + "/index.html")
|
||||||
|
if response.status_code != 200:
|
||||||
|
log.error(f"Network error: {response.status_code}")
|
||||||
|
return {}
|
||||||
|
|
||||||
|
soup = BeautifulSoup(response.content, 'html.parser')
|
||||||
|
index_data = {}
|
||||||
|
|
||||||
|
# Toxicological information : txi
|
||||||
|
|
||||||
|
txi_div = soup.find('div', id='id_7_Toxicologicalinformation')
|
||||||
|
txi_link = txi_div.find('a', class_='das-leaf')
|
||||||
|
txi_href = txi_link['href']
|
||||||
|
index_data['toxicological_information_link'] = LINK_DOSSIER + txi_href + '.html'
|
||||||
|
|
||||||
|
# Repeated dose toxicity : rdt
|
||||||
|
|
||||||
|
rdt_div = soup.find('div', id='id_75_Repeateddosetoxicity')
|
||||||
|
rdt_link = rdt_div.find('a', class_='das-leaf')
|
||||||
|
rdt_href = rdt_link['href']
|
||||||
|
index_data['repeated_dose_toxicity_link'] = LINK_DOSSIER + rdt_href + '.html'
|
||||||
|
|
||||||
|
# Acute toxicity : at
|
||||||
|
|
||||||
|
at_div = soup.find('div', id='id_72_AcuteToxicity')
|
||||||
|
at_link = at_div.find('a', class_='das-leaf')
|
||||||
|
at_href = at_link['href']
|
||||||
|
index_data['acute_toxicity_link'] = LINK_DOSSIER + at_href + '.html'
|
||||||
|
|
||||||
|
return index_data
|
||||||
|
|
||||||
|
get_substance_index("e4c88c6e-06c7-4daa-b0fb-1a55459ac22f")
|
||||||
|
return
|
||||||
|
|
||||||
|
|
||||||
|
@app.cell(hide_code=True)
|
||||||
|
def _(mo):
|
||||||
|
mo.md(
|
||||||
|
r"""
|
||||||
|
# Cosa manca da fare
|
||||||
|
|
||||||
|
1. Creare un nuovo orchestratore per la parte search, caching in mongodb e creare un metodo unico per la ricerca
|
||||||
|
2. Metodo per validare i json salvati nel database, verificare la data
|
||||||
|
3. Creare i metodi per astrarre gli html in json
|
||||||
|
4. Creare i test per ciascuna funzione
|
||||||
|
5. Creare la documentazione per ciascuna funzione
|
||||||
|
"""
|
||||||
|
)
|
||||||
|
return
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
app.run()
|
||||||
295
docs/test_summary.md
Normal file
295
docs/test_summary.md
Normal file
|
|
@ -0,0 +1,295 @@
|
||||||
|
# ECHA Services Test Suite Summary
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
Comprehensive test suites have been created for all three ECHA service modules, following a bottom-up approach from lowest to highest level dependencies.
|
||||||
|
|
||||||
|
## Test Files Created
|
||||||
|
|
||||||
|
### 1. test_echa_parser.py (Lowest Level)
|
||||||
|
**Location**: `tests/test_echa_parser.py`
|
||||||
|
|
||||||
|
**Coverage**: Tests for HTML/Markdown/JSON processing functions
|
||||||
|
|
||||||
|
**Test Classes**:
|
||||||
|
- `TestOpenEchaPage` - HTML page opening (remote & local)
|
||||||
|
- `TestEchaPageToMarkdown` - HTML to Markdown conversion
|
||||||
|
- `TestMarkdownToJsonRaw` - Markdown to JSON conversion (skipped if markdown_to_json not installed)
|
||||||
|
- `TestNormalizeUnicodeCharacters` - Unicode normalization
|
||||||
|
- `TestCleanJson` - JSON cleaning and validation
|
||||||
|
- `TestIntegrationParser` - Full pipeline integration tests
|
||||||
|
|
||||||
|
**Total Tests**: 28 tests
|
||||||
|
- 20 tests for core parser functions
|
||||||
|
- 5 tests for markdown_to_json (conditional)
|
||||||
|
- 2 integration tests
|
||||||
|
- 1 test with known Unicode encoding issue (needs fix)
|
||||||
|
|
||||||
|
**Key Features**:
|
||||||
|
- Mocks external dependencies (requests, file I/O)
|
||||||
|
- Tests Unicode handling edge cases
|
||||||
|
- Validates data cleaning logic
|
||||||
|
- Tests comparison operator conversions (>, <, >=, <=)
|
||||||
|
|
||||||
|
**Known Issues**:
|
||||||
|
- Unicode literal encoding in test strings (line 372, 380) - use `chr()` instead of `\uXXXX`
|
||||||
|
- Missing `markdown_to_json` dependency (tests skip gracefully)
|
||||||
|
|
||||||
|
###2. test_echa_service.py (Middle Level)
|
||||||
|
**Location**: `tests/test_echa_service.py`
|
||||||
|
|
||||||
|
**Coverage**: Tests for ECHA API interaction and search functions
|
||||||
|
|
||||||
|
**Test Classes**:
|
||||||
|
- `TestGetSubstanceByIdentifier` - Substance API search
|
||||||
|
- `TestGetDossierByRmlId` - Dossier retrieval with Active/Inactive fallback
|
||||||
|
- `TestExtractSectionLinks` - Section link extraction with validation
|
||||||
|
- `TestParseSectionsFromHtml` - HTML parsing for multiple sections
|
||||||
|
- `TestGetSectionLinksFromIndex` - Remote index.html fetching
|
||||||
|
- `TestGetSectionLinksFromFile` - Local file parsing
|
||||||
|
- `TestSearchDossier` - Main search workflow
|
||||||
|
- `TestIntegrationEchaService` - Real API integration tests (marked @pytest.mark.integration)
|
||||||
|
|
||||||
|
**Total Tests**: 36 tests
|
||||||
|
- 30 unit tests with mocked APIs
|
||||||
|
- 3 integration tests (require internet, marked for manual execution)
|
||||||
|
|
||||||
|
**Key Features**:
|
||||||
|
- Comprehensive API mocking
|
||||||
|
- Tests nested section bug fix (parent vs child section links)
|
||||||
|
- Tests URL encoding, error handling, fallback logic
|
||||||
|
- Tests local vs remote workflows
|
||||||
|
- Integration tests for real formaldehyde data
|
||||||
|
|
||||||
|
**Testing Approach**:
|
||||||
|
- Unit tests run by default (fast, no external deps)
|
||||||
|
- Integration tests require `-m integration` flag
|
||||||
|
|
||||||
|
### 3. test_echa_extractor.py (Highest Level)
|
||||||
|
**Location**: `tests/test_echa_extractor.py`
|
||||||
|
|
||||||
|
**Coverage**: Tests for high-level extraction orchestration
|
||||||
|
|
||||||
|
**Test Classes**:
|
||||||
|
- `TestSchemas` - Data schema validation
|
||||||
|
- `TestJsonToDataframe` - JSON to pandas DataFrame conversion
|
||||||
|
- `TestDfWrapper` - DataFrame metadata addition
|
||||||
|
- `TestEchaExtractLocal` - DuckDB cache querying
|
||||||
|
- `TestEchaExtract` - Main extraction workflow
|
||||||
|
- `TestIntegrationEchaExtractor` - Real data integration tests (marked @pytest.mark.integration)
|
||||||
|
|
||||||
|
**Total Tests**: 32 tests
|
||||||
|
- 28 unit tests with full mocking
|
||||||
|
- 4 integration tests (require internet)
|
||||||
|
|
||||||
|
**Key Features**:
|
||||||
|
- Tests both RepeatedDose and AcuteToxicity schemas
|
||||||
|
- Tests local cache (DuckDB) integration
|
||||||
|
- Tests key information extraction
|
||||||
|
- Tests error handling at each pipeline stage
|
||||||
|
- Tests DataFrame vs JSON output modes
|
||||||
|
- Validates metadata addition (substance, CAS, timestamps)
|
||||||
|
|
||||||
|
**Testing Strategy**:
|
||||||
|
- Mocks entire pipeline: search → parse → convert → clean → wrap
|
||||||
|
- Tests local_search and local_only modes
|
||||||
|
- Tests graceful degradation (returns key_infos on main extraction failure)
|
||||||
|
|
||||||
|
## Test Architecture
|
||||||
|
|
||||||
|
```
|
||||||
|
test_echa_parser.py (Data Transformation)
|
||||||
|
↓
|
||||||
|
test_echa_service.py (API & Search)
|
||||||
|
↓
|
||||||
|
test_echa_extractor.py (Orchestration)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Dependency Flow
|
||||||
|
1. **Parser** (lowest) - No dependencies on other ECHA modules
|
||||||
|
2. **Service** (middle) - Depends on Parser for some functionality
|
||||||
|
3. **Extractor** (highest) - Depends on both Service and Parser
|
||||||
|
|
||||||
|
### Mock Strategy
|
||||||
|
- **Parser**: Mocks `requests`, file I/O, `os.makedirs`
|
||||||
|
- **Service**: Mocks `requests.get` for API calls, HTML content
|
||||||
|
- **Extractor**: Mocks entire pipeline chain (search_dossier, open_echa_page, etc.)
|
||||||
|
|
||||||
|
## Running the Tests
|
||||||
|
|
||||||
|
### Run All Tests
|
||||||
|
```bash
|
||||||
|
uv run pytest tests/test_echa_*.py -v
|
||||||
|
```
|
||||||
|
|
||||||
|
### Run Specific Module
|
||||||
|
```bash
|
||||||
|
uv run pytest tests/test_echa_parser.py -v
|
||||||
|
uv run pytest tests/test_echa_service.py -v
|
||||||
|
uv run pytest tests/test_echa_extractor.py -v
|
||||||
|
```
|
||||||
|
|
||||||
|
### Run Only Unit Tests (Fast)
|
||||||
|
```bash
|
||||||
|
uv run pytest tests/test_echa_*.py -v -m "not integration"
|
||||||
|
```
|
||||||
|
|
||||||
|
### Run Integration Tests (Requires Internet)
|
||||||
|
```bash
|
||||||
|
uv run pytest tests/test_echa_*.py -v -m integration
|
||||||
|
```
|
||||||
|
|
||||||
|
### Run With Coverage
|
||||||
|
```bash
|
||||||
|
uv run pytest tests/test_echa_*.py --cov=pif_compiler.services --cov-report=html
|
||||||
|
```
|
||||||
|
|
||||||
|
## Test Coverage Summary
|
||||||
|
|
||||||
|
### Functions Tested
|
||||||
|
|
||||||
|
#### echa_parser.py (5/5 = 100%)
|
||||||
|
- ✅ `open_echa_page()` - Remote & local file opening
|
||||||
|
- ✅ `echa_page_to_markdown()` - HTML to Markdown with route formatting
|
||||||
|
- ✅ `markdown_to_json_raw()` - Markdown parsing & JSON conversion
|
||||||
|
- ✅ `normalize_unicode_characters()` - Unicode normalization
|
||||||
|
- ✅ `clean_json()` - Recursive cleaning & validation
|
||||||
|
|
||||||
|
#### echa_service.py (8/8 = 100%)
|
||||||
|
- ✅ `search_dossier()` - Main entry point with local file support
|
||||||
|
- ✅ `get_substance_by_identifier()` - Substance API search
|
||||||
|
- ✅ `get_dossier_by_rml_id()` - Dossier retrieval with fallback
|
||||||
|
- ✅ `_query_dossier_api()` - Helper for API queries
|
||||||
|
- ✅ `get_section_links_from_index()` - Remote HTML fetching
|
||||||
|
- ✅ `get_section_links_from_file()` - Local HTML parsing
|
||||||
|
- ✅ `parse_sections_from_html()` - HTML content parsing
|
||||||
|
- ✅ `extract_section_links()` - Individual section extraction with validation
|
||||||
|
|
||||||
|
#### echa_extractor.py (4/4 = 100%)
|
||||||
|
- ✅ `echa_extract()` - Main extraction function
|
||||||
|
- ✅ `echa_extract_local()` - DuckDB cache queries
|
||||||
|
- ✅ `json_to_dataframe()` - JSON to DataFrame conversion
|
||||||
|
- ✅ `df_wrapper()` - Metadata addition
|
||||||
|
|
||||||
|
**Total Functions**: 17/17 tested (100%)
|
||||||
|
|
||||||
|
## Edge Cases Covered
|
||||||
|
|
||||||
|
### Parser
|
||||||
|
- Empty/malformed HTML
|
||||||
|
- Missing sections
|
||||||
|
- Unicode encoding issues (â€, superscripts)
|
||||||
|
- Comparison operators (>, <, >=, <=)
|
||||||
|
- Nested structures
|
||||||
|
- [Empty] value filtering
|
||||||
|
- "no information available" filtering
|
||||||
|
|
||||||
|
### Service
|
||||||
|
- Substance not found
|
||||||
|
- No dossiers (active or inactive)
|
||||||
|
- Nested sections (parent without direct link)
|
||||||
|
- Input type mismatches
|
||||||
|
- Network errors
|
||||||
|
- Malformed API responses
|
||||||
|
- Local vs remote file paths
|
||||||
|
|
||||||
|
### Extractor
|
||||||
|
- Substance not found
|
||||||
|
- Missing scraping type pages
|
||||||
|
- Empty sections
|
||||||
|
- Empty cleaned JSON
|
||||||
|
- Local cache hits/misses
|
||||||
|
- Key information extraction
|
||||||
|
- DataFrame filtering (null Effect levels)
|
||||||
|
- JSON vs DataFrame output modes
|
||||||
|
|
||||||
|
## Dependencies Required
|
||||||
|
|
||||||
|
### Core Dependencies (Already in project)
|
||||||
|
- pytest
|
||||||
|
- pytest-mock
|
||||||
|
- pytest-cov
|
||||||
|
- beautifulsoup4
|
||||||
|
- pandas
|
||||||
|
- requests
|
||||||
|
- markdownify
|
||||||
|
- pydantic
|
||||||
|
|
||||||
|
### Optional Dependencies (Tests skip if missing)
|
||||||
|
- `markdown_to_json` - Required for markdown→JSON conversion tests
|
||||||
|
- `duckdb` - Required for local cache tests
|
||||||
|
- Internet connection - Required for integration tests
|
||||||
|
|
||||||
|
## Test Markers
|
||||||
|
|
||||||
|
### Custom Markers (defined in conftest.py)
|
||||||
|
- `@pytest.mark.unit` - Fast tests, no external dependencies
|
||||||
|
- `@pytest.mark.integration` - Tests requiring real APIs/internet
|
||||||
|
- `@pytest.mark.slow` - Long-running tests
|
||||||
|
- `@pytest.mark.database` - Tests requiring database
|
||||||
|
|
||||||
|
### Usage in ECHA Tests
|
||||||
|
- Unit tests: Default (run without flags)
|
||||||
|
- Integration tests: Require `-m integration`
|
||||||
|
- Skipped tests: Auto-skip if dependencies missing
|
||||||
|
|
||||||
|
## Known Issues & Improvements Needed
|
||||||
|
|
||||||
|
### 1. Unicode Test Encoding (test_echa_parser.py)
|
||||||
|
**Issue**: Lines 372 and 380 have truncated Unicode escape sequences
|
||||||
|
**Fix**: Replace `\u00c2\u00b²` with `chr(0xc2) + chr(0xb2)`
|
||||||
|
**Priority**: Medium
|
||||||
|
|
||||||
|
### 2. Missing markdown_to_json Dependency
|
||||||
|
**Issue**: Tests skip if not installed
|
||||||
|
**Fix**: Add to project dependencies or document as optional
|
||||||
|
**Priority**: Low (tests gracefully skip)
|
||||||
|
|
||||||
|
### 3. Integration Test Data
|
||||||
|
**Issue**: Real API tests may fail if ECHA structure changes
|
||||||
|
**Fix**: Add recorded fixtures for deterministic testing
|
||||||
|
**Priority**: Low
|
||||||
|
|
||||||
|
### 4. DuckDB Integration
|
||||||
|
**Issue**: test_echa_extractor local cache tests mock DuckDB
|
||||||
|
**Fix**: Create actual test database for integration testing
|
||||||
|
**Priority**: Low
|
||||||
|
|
||||||
|
## Test Statistics
|
||||||
|
|
||||||
|
| Module | Total Tests | Unit Tests | Integration Tests | Skipped (Conditional) |
|
||||||
|
|--------|-------------|------------|-------------------|-----------------------|
|
||||||
|
| echa_parser.py | 28 | 26 | 2 | 7 (markdown_to_json) |
|
||||||
|
| echa_service.py | 36 | 33 | 3 | 0 |
|
||||||
|
| echa_extractor.py | 32 | 28 | 4 | 0 |
|
||||||
|
| **TOTAL** | **96** | **87** | **9** | **7** |
|
||||||
|
|
||||||
|
## Next Steps
|
||||||
|
|
||||||
|
1. **Fix Unicode encoding** in test_echa_parser.py (lines 372, 380)
|
||||||
|
2. **Run full test suite** to verify all unit tests pass
|
||||||
|
3. **Add markdown_to_json** to dependencies if needed
|
||||||
|
4. **Run integration tests** manually to verify real API behavior
|
||||||
|
5. **Generate coverage report** to identify any untested code paths
|
||||||
|
6. **Document test patterns** for future service additions
|
||||||
|
7. **Add CI/CD integration** for automated testing
|
||||||
|
|
||||||
|
## Contributing
|
||||||
|
|
||||||
|
When adding new functions to ECHA services:
|
||||||
|
|
||||||
|
1. **Write tests first** (TDD approach)
|
||||||
|
2. **Follow existing patterns**:
|
||||||
|
- One test class per function
|
||||||
|
- Mock external dependencies
|
||||||
|
- Test happy path + edge cases
|
||||||
|
- Add integration tests for real API behavior
|
||||||
|
3. **Use appropriate markers**: `@pytest.mark.integration` for slow tests
|
||||||
|
4. **Update this document** with new test coverage
|
||||||
|
|
||||||
|
## References
|
||||||
|
|
||||||
|
- Main documentation: [docs/echa_architecture.md](echa_architecture.md)
|
||||||
|
- Test patterns: [tests/test_cosing_service.py](../tests/test_cosing_service.py)
|
||||||
|
- pytest configuration: [pytest.ini](../pytest.ini)
|
||||||
|
- Test fixtures: [tests/conftest.py](../tests/conftest.py)
|
||||||
767
docs/testing_guide.md
Normal file
767
docs/testing_guide.md
Normal file
|
|
@ -0,0 +1,767 @@
|
||||||
|
# Testing Guide - Theory and Best Practices
|
||||||
|
|
||||||
|
## Table of Contents
|
||||||
|
- [Introduction](#introduction)
|
||||||
|
- [Your Current Approach vs. Test-Driven Development](#your-current-approach-vs-test-driven-development)
|
||||||
|
- [The Testing Pyramid](#the-testing-pyramid)
|
||||||
|
- [Key Concepts](#key-concepts)
|
||||||
|
- [Real-World Testing Workflow](#real-world-testing-workflow)
|
||||||
|
- [Regression Testing](#regression-testing---the-killer-feature)
|
||||||
|
- [Code Coverage](#coverage---how-much-is-tested)
|
||||||
|
- [Best Practices](#best-practices-summary)
|
||||||
|
- [Practical Examples](#practical-example-your-workflow)
|
||||||
|
- [When Should You Write Tests](#when-should-you-write-tests)
|
||||||
|
- [Getting Started](#your-next-steps)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Introduction
|
||||||
|
|
||||||
|
This guide explains the theory and best practices of software testing, specifically for the PIF Compiler project. It moves beyond ad-hoc testing scripts to a comprehensive, automated testing approach.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Your Current Approach vs. Test-Driven Development
|
||||||
|
|
||||||
|
### What You Do Now (Ad-hoc Scripts):
|
||||||
|
|
||||||
|
```python
|
||||||
|
# test_script.py
|
||||||
|
from cosing_service import cosing_search
|
||||||
|
|
||||||
|
result = cosing_search("WATER", mode="name")
|
||||||
|
print(result) # Look at output, check if it looks right
|
||||||
|
```
|
||||||
|
|
||||||
|
**Problems:**
|
||||||
|
- ❌ Manual checking (is the output correct?)
|
||||||
|
- ❌ Not repeatable (you forget what "correct" looks like)
|
||||||
|
- ❌ Doesn't catch regressions (future changes break old code)
|
||||||
|
- ❌ No documentation (what should the function do?)
|
||||||
|
- ❌ Tedious for many functions
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## The Testing Pyramid
|
||||||
|
|
||||||
|
```
|
||||||
|
/\
|
||||||
|
/ \ E2E Tests (Few)
|
||||||
|
/----\
|
||||||
|
/ \ Integration Tests (Some)
|
||||||
|
/--------\
|
||||||
|
/ \ Unit Tests (Many)
|
||||||
|
/____________\
|
||||||
|
```
|
||||||
|
|
||||||
|
### 1. **Unit Tests** (Bottom - Most Important)
|
||||||
|
|
||||||
|
Test individual functions in isolation.
|
||||||
|
|
||||||
|
**Example:**
|
||||||
|
```python
|
||||||
|
def test_parse_cas_numbers_single():
|
||||||
|
"""Test parsing a single CAS number."""
|
||||||
|
result = parse_cas_numbers(["7732-18-5"])
|
||||||
|
assert result == ["7732-18-5"] # ← Automated check
|
||||||
|
```
|
||||||
|
|
||||||
|
**Benefits:**
|
||||||
|
- ✅ Fast (milliseconds)
|
||||||
|
- ✅ No external dependencies (no API, no database)
|
||||||
|
- ✅ Pinpoint exact problem
|
||||||
|
- ✅ Run hundreds in seconds
|
||||||
|
|
||||||
|
**When to use:**
|
||||||
|
- Testing individual functions
|
||||||
|
- Testing data parsing/validation
|
||||||
|
- Testing business logic calculations
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 2. **Integration Tests** (Middle)
|
||||||
|
|
||||||
|
Test multiple components working together.
|
||||||
|
|
||||||
|
**Example:**
|
||||||
|
```python
|
||||||
|
def test_full_cosing_workflow():
|
||||||
|
"""Test search + clean workflow."""
|
||||||
|
raw = cosing_search("WATER", mode="name")
|
||||||
|
clean = clean_cosing(raw)
|
||||||
|
assert "cosingUrl" in clean
|
||||||
|
```
|
||||||
|
|
||||||
|
**Benefits:**
|
||||||
|
- ✅ Tests real interactions
|
||||||
|
- ✅ Catches integration bugs
|
||||||
|
|
||||||
|
**Drawbacks:**
|
||||||
|
- ⚠️ Slower (hits real APIs)
|
||||||
|
- ⚠️ Requires internet/database
|
||||||
|
|
||||||
|
**When to use:**
|
||||||
|
- Testing workflows across multiple services
|
||||||
|
- Testing API integrations
|
||||||
|
- Testing database interactions
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 3. **E2E Tests** (End-to-End - Top - Fewest)
|
||||||
|
|
||||||
|
Test entire application flow (UI → Backend → Database).
|
||||||
|
|
||||||
|
**Example:**
|
||||||
|
```python
|
||||||
|
def test_create_pif_from_ui():
|
||||||
|
"""User creates PIF through Streamlit UI."""
|
||||||
|
# Click buttons, fill forms, verify PDF generated
|
||||||
|
```
|
||||||
|
|
||||||
|
**When to use:**
|
||||||
|
- Testing complete user workflows
|
||||||
|
- Smoke tests before deployment
|
||||||
|
- Critical business processes
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Key Concepts
|
||||||
|
|
||||||
|
### 1. **Assertions - Automated Verification**
|
||||||
|
|
||||||
|
**Old way (manual):**
|
||||||
|
```python
|
||||||
|
result = parse_cas_numbers(["7732-18-5/56-81-5"])
|
||||||
|
print(result) # You look at: ['7732-18-5', '56-81-5']
|
||||||
|
# Is this right? Maybe? You forget in 2 weeks.
|
||||||
|
```
|
||||||
|
|
||||||
|
**Test way (automated):**
|
||||||
|
```python
|
||||||
|
def test_parse_multiple_cas():
|
||||||
|
result = parse_cas_numbers(["7732-18-5/56-81-5"])
|
||||||
|
assert result == ["7732-18-5", "56-81-5"] # ← Computer checks!
|
||||||
|
# If wrong, test FAILS immediately
|
||||||
|
```
|
||||||
|
|
||||||
|
**Common Assertions:**
|
||||||
|
```python
|
||||||
|
# Equality
|
||||||
|
assert result == expected
|
||||||
|
|
||||||
|
# Truthiness
|
||||||
|
assert result is not None
|
||||||
|
assert "key" in result
|
||||||
|
|
||||||
|
# Exceptions
|
||||||
|
with pytest.raises(ValueError):
|
||||||
|
invalid_function()
|
||||||
|
|
||||||
|
# Approximate equality (for floats)
|
||||||
|
assert result == pytest.approx(3.14159, rel=1e-5)
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 2. **Mocking - Control External Dependencies**
|
||||||
|
|
||||||
|
**Problem:** Testing `cosing_search()` hits the real COSING API:
|
||||||
|
- ⚠️ Slow (network request)
|
||||||
|
- ⚠️ Unreliable (API might be down)
|
||||||
|
- ⚠️ Expensive (rate limits)
|
||||||
|
- ⚠️ Hard to test errors (how do you make API return error?)
|
||||||
|
|
||||||
|
**Solution: Mock it!**
|
||||||
|
```python
|
||||||
|
from unittest.mock import Mock, patch
|
||||||
|
|
||||||
|
@patch('cosing_service.req.post') # Replace real HTTP request
|
||||||
|
def test_search_by_name(mock_post):
|
||||||
|
# Control what the "API" returns
|
||||||
|
mock_response = Mock()
|
||||||
|
mock_response.json.return_value = {
|
||||||
|
"results": [{"metadata": {"inciName": ["WATER"]}}]
|
||||||
|
}
|
||||||
|
mock_post.return_value = mock_response
|
||||||
|
|
||||||
|
result = cosing_search("WATER", mode="name")
|
||||||
|
|
||||||
|
assert result["inciName"] == ["WATER"] # ← Test your logic, not the API
|
||||||
|
mock_post.assert_called_once() # Verify it was called
|
||||||
|
```
|
||||||
|
|
||||||
|
**Benefits:**
|
||||||
|
- ✅ Fast (no real network)
|
||||||
|
- ✅ Reliable (always works)
|
||||||
|
- ✅ Can test error cases (mock API failures)
|
||||||
|
- ✅ Isolate your code from external issues
|
||||||
|
|
||||||
|
**What to mock:**
|
||||||
|
- HTTP requests (`requests.get`, `requests.post`)
|
||||||
|
- Database calls (`db.find_one`, `db.insert`)
|
||||||
|
- File I/O (`open`, `read`, `write`)
|
||||||
|
- External APIs (COSING, ECHA, PubChem)
|
||||||
|
- Time-dependent functions (`datetime.now()`)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 3. **Fixtures - Reusable Test Data**
|
||||||
|
|
||||||
|
**Without fixtures (repetitive):**
|
||||||
|
```python
|
||||||
|
def test_clean_basic():
|
||||||
|
data = {"inciName": ["WATER"], "casNo": ["7732-18-5"], ...}
|
||||||
|
result = clean_cosing(data)
|
||||||
|
assert ...
|
||||||
|
|
||||||
|
def test_clean_empty():
|
||||||
|
data = {"inciName": ["WATER"], "casNo": ["7732-18-5"], ...} # Copy-paste!
|
||||||
|
result = clean_cosing(data)
|
||||||
|
assert ...
|
||||||
|
```
|
||||||
|
|
||||||
|
**With fixtures (DRY - Don't Repeat Yourself):**
|
||||||
|
```python
|
||||||
|
# conftest.py
|
||||||
|
@pytest.fixture
|
||||||
|
def sample_cosing_response():
|
||||||
|
"""Reusable COSING response data."""
|
||||||
|
return {
|
||||||
|
"inciName": ["WATER"],
|
||||||
|
"casNo": ["7732-18-5"],
|
||||||
|
"substanceId": ["12345"]
|
||||||
|
}
|
||||||
|
|
||||||
|
# test file
|
||||||
|
def test_clean_basic(sample_cosing_response): # Auto-injected!
|
||||||
|
result = clean_cosing(sample_cosing_response)
|
||||||
|
assert result["inciName"] == "WATER"
|
||||||
|
|
||||||
|
def test_clean_empty(sample_cosing_response): # Reuse same data!
|
||||||
|
result = clean_cosing(sample_cosing_response)
|
||||||
|
assert "cosingUrl" in result
|
||||||
|
```
|
||||||
|
|
||||||
|
**Benefits:**
|
||||||
|
- ✅ No code duplication
|
||||||
|
- ✅ Centralized test data
|
||||||
|
- ✅ Easy to update (change once, affects all tests)
|
||||||
|
- ✅ Auto-cleanup (fixtures can tear down resources)
|
||||||
|
|
||||||
|
**Common fixture patterns:**
|
||||||
|
```python
|
||||||
|
# Database fixture with cleanup
|
||||||
|
@pytest.fixture
|
||||||
|
def test_db():
|
||||||
|
db = connect_to_test_db()
|
||||||
|
yield db # Test runs here
|
||||||
|
db.drop_all() # Cleanup after test
|
||||||
|
|
||||||
|
# Temporary file fixture
|
||||||
|
@pytest.fixture
|
||||||
|
def temp_file(tmp_path):
|
||||||
|
file_path = tmp_path / "test.json"
|
||||||
|
file_path.write_text('{"test": "data"}')
|
||||||
|
return file_path # Auto-cleaned by pytest
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Real-World Testing Workflow
|
||||||
|
|
||||||
|
### Scenario: You Add a New Feature
|
||||||
|
|
||||||
|
**Step 1: Write the test FIRST (TDD - Test-Driven Development):**
|
||||||
|
```python
|
||||||
|
def test_parse_cas_removes_parentheses():
|
||||||
|
"""CAS numbers with parentheses should be cleaned."""
|
||||||
|
result = parse_cas_numbers(["7732-18-5 (hydrate)"])
|
||||||
|
assert result == ["7732-18-5"]
|
||||||
|
```
|
||||||
|
|
||||||
|
**Step 2: Run test - it FAILS (expected!):**
|
||||||
|
```bash
|
||||||
|
$ uv run pytest tests/test_cosing_service.py::test_parse_cas_removes_parentheses
|
||||||
|
|
||||||
|
FAILED: AssertionError: assert ['7732-18-5 (hydrate)'] == ['7732-18-5']
|
||||||
|
```
|
||||||
|
|
||||||
|
**Step 3: Write code to make it pass:**
|
||||||
|
```python
|
||||||
|
def parse_cas_numbers(cas_string: list) -> list:
|
||||||
|
cas_string = cas_string[0]
|
||||||
|
cas_string = re.sub(r"\([^)]*\)", "", cas_string) # ← Add this
|
||||||
|
# ... rest of function
|
||||||
|
```
|
||||||
|
|
||||||
|
**Step 4: Run test again - it PASSES:**
|
||||||
|
```bash
|
||||||
|
$ uv run pytest tests/test_cosing_service.py::test_parse_cas_removes_parentheses
|
||||||
|
|
||||||
|
PASSED ✓
|
||||||
|
```
|
||||||
|
|
||||||
|
**Step 5: Refactor if needed - tests ensure you don't break anything!**
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### TDD Cycle (Red-Green-Refactor)
|
||||||
|
|
||||||
|
```
|
||||||
|
1. RED: Write failing test
|
||||||
|
↓
|
||||||
|
2. GREEN: Write minimal code to pass
|
||||||
|
↓
|
||||||
|
3. REFACTOR: Improve code without breaking tests
|
||||||
|
↓
|
||||||
|
Repeat
|
||||||
|
```
|
||||||
|
|
||||||
|
**Benefits:**
|
||||||
|
- ✅ Forces you to think about requirements first
|
||||||
|
- ✅ Prevents over-engineering
|
||||||
|
- ✅ Built-in documentation (tests show intended behavior)
|
||||||
|
- ✅ Confidence to refactor
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Regression Testing - The Killer Feature
|
||||||
|
|
||||||
|
**Scenario: You change code 6 months later:**
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Original (working)
|
||||||
|
def parse_cas_numbers(cas_string: list) -> list:
|
||||||
|
cas_string = cas_string[0]
|
||||||
|
cas_string = re.sub(r"\([^)]*\)", "", cas_string)
|
||||||
|
cas_parts = re.split(r"[/;,]", cas_string) # Handles /, ;, ,
|
||||||
|
return [cas.strip() for cas in cas_parts]
|
||||||
|
|
||||||
|
# You "improve" it
|
||||||
|
def parse_cas_numbers(cas_string: list) -> list:
|
||||||
|
return cas_string[0].split("/") # Simpler! But...
|
||||||
|
```
|
||||||
|
|
||||||
|
**Run tests:**
|
||||||
|
```bash
|
||||||
|
$ uv run pytest
|
||||||
|
|
||||||
|
FAILED: test_multiple_cas_with_semicolon
|
||||||
|
Expected: ['7732-18-5', '56-81-5']
|
||||||
|
Got: ['7732-18-5;56-81-5'] # ← Oops, broke semicolon support!
|
||||||
|
|
||||||
|
FAILED: test_cas_with_parentheses
|
||||||
|
Expected: ['7732-18-5']
|
||||||
|
Got: ['7732-18-5 (hydrate)'] # ← Broke parentheses removal!
|
||||||
|
```
|
||||||
|
|
||||||
|
**Without tests:**
|
||||||
|
- You deploy
|
||||||
|
- Users report bugs
|
||||||
|
- You're confused what broke
|
||||||
|
- Spend hours debugging
|
||||||
|
|
||||||
|
**With tests:**
|
||||||
|
- Instant feedback
|
||||||
|
- Fix before deploying
|
||||||
|
- Save hours of debugging
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Coverage - How Much Is Tested?
|
||||||
|
|
||||||
|
### Running Coverage
|
||||||
|
|
||||||
|
```bash
|
||||||
|
uv run pytest --cov=src/pif_compiler --cov-report=html
|
||||||
|
```
|
||||||
|
|
||||||
|
### Sample Output
|
||||||
|
|
||||||
|
```
|
||||||
|
Name Stmts Miss Cover
|
||||||
|
--------------------------------------------------
|
||||||
|
cosing_service.py 89 5 94%
|
||||||
|
echa_service.py 156 89 43%
|
||||||
|
models.py 45 45 0%
|
||||||
|
--------------------------------------------------
|
||||||
|
TOTAL 290 139 52%
|
||||||
|
```
|
||||||
|
|
||||||
|
### Interpretation
|
||||||
|
|
||||||
|
- ✅ `cosing_service.py` - **94% covered** (great!)
|
||||||
|
- ⚠️ `echa_service.py` - **43% covered** (needs more tests)
|
||||||
|
- ❌ `models.py` - **0% covered** (no tests yet)
|
||||||
|
|
||||||
|
### Coverage Goals
|
||||||
|
|
||||||
|
| Coverage | Status | Action |
|
||||||
|
|----------|--------|--------|
|
||||||
|
| 90-100% | ✅ Excellent | Maintain |
|
||||||
|
| 70-90% | ⚠️ Good | Add edge cases |
|
||||||
|
| 50-70% | ⚠️ Acceptable | Prioritize critical paths |
|
||||||
|
| <50% | ❌ Poor | Add tests immediately |
|
||||||
|
|
||||||
|
**Target:** 80%+ for business-critical code
|
||||||
|
|
||||||
|
### HTML Coverage Report
|
||||||
|
|
||||||
|
```bash
|
||||||
|
uv run pytest --cov=src/pif_compiler --cov-report=html
|
||||||
|
# Open htmlcov/index.html in browser
|
||||||
|
```
|
||||||
|
|
||||||
|
Shows:
|
||||||
|
- Which lines are tested (green)
|
||||||
|
- Which lines are not tested (red)
|
||||||
|
- Which branches are not covered
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Best Practices Summary
|
||||||
|
|
||||||
|
### ✅ DO:
|
||||||
|
|
||||||
|
1. **Write tests for all business logic**
|
||||||
|
```python
|
||||||
|
# YES: Test calculations
|
||||||
|
def test_sed_calculation():
|
||||||
|
ingredient = Ingredient(quantity=10.0, dap=0.5)
|
||||||
|
assert ingredient.calculate_sed() == 5.0
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **Mock external dependencies**
|
||||||
|
```python
|
||||||
|
# YES: Mock API calls
|
||||||
|
@patch('cosing_service.req.post')
|
||||||
|
def test_search(mock_post):
|
||||||
|
mock_post.return_value.json.return_value = {...}
|
||||||
|
```
|
||||||
|
|
||||||
|
3. **Test edge cases**
|
||||||
|
```python
|
||||||
|
# YES: Test edge cases
|
||||||
|
def test_parse_empty_cas():
|
||||||
|
assert parse_cas_numbers([""]) == []
|
||||||
|
|
||||||
|
def test_parse_invalid_cas():
|
||||||
|
with pytest.raises(ValueError):
|
||||||
|
parse_cas_numbers(["abc-def-ghi"])
|
||||||
|
```
|
||||||
|
|
||||||
|
4. **Keep tests simple**
|
||||||
|
```python
|
||||||
|
# YES: One test = one thing
|
||||||
|
def test_cas_removes_whitespace():
|
||||||
|
assert parse_cas_numbers([" 123-45-6 "]) == ["123-45-6"]
|
||||||
|
|
||||||
|
# NO: Testing multiple things
|
||||||
|
def test_cas_everything():
|
||||||
|
assert parse_cas_numbers([" 123-45-6 "]) == ["123-45-6"]
|
||||||
|
assert parse_cas_numbers(["123-45-6/789-01-2"]) == [...]
|
||||||
|
# Too much in one test!
|
||||||
|
```
|
||||||
|
|
||||||
|
5. **Run tests before committing**
|
||||||
|
```bash
|
||||||
|
git add .
|
||||||
|
uv run pytest # ← Always run first!
|
||||||
|
git commit -m "Add feature X"
|
||||||
|
```
|
||||||
|
|
||||||
|
6. **Use descriptive test names**
|
||||||
|
```python
|
||||||
|
# YES: Describes what it tests
|
||||||
|
def test_parse_cas_removes_parenthetical_info():
|
||||||
|
...
|
||||||
|
|
||||||
|
# NO: Vague
|
||||||
|
def test_cas_1():
|
||||||
|
...
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### ❌ DON'T:
|
||||||
|
|
||||||
|
1. **Don't test external libraries**
|
||||||
|
```python
|
||||||
|
# NO: Testing if requests.post works
|
||||||
|
def test_requests_library():
|
||||||
|
response = requests.post("https://example.com")
|
||||||
|
assert response.status_code == 200
|
||||||
|
|
||||||
|
# YES: Test YOUR code that uses requests
|
||||||
|
@patch('requests.post')
|
||||||
|
def test_my_search_function(mock_post):
|
||||||
|
...
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **Don't make tests dependent on each other**
|
||||||
|
```python
|
||||||
|
# NO: test_b depends on test_a
|
||||||
|
def test_a_creates_data():
|
||||||
|
db.insert({"id": 1, "name": "test"})
|
||||||
|
|
||||||
|
def test_b_uses_data():
|
||||||
|
data = db.find_one({"id": 1}) # Breaks if test_a fails!
|
||||||
|
|
||||||
|
# YES: Each test is independent
|
||||||
|
def test_b_uses_data():
|
||||||
|
db.insert({"id": 1, "name": "test"}) # Create own data
|
||||||
|
data = db.find_one({"id": 1})
|
||||||
|
```
|
||||||
|
|
||||||
|
3. **Don't test implementation details**
|
||||||
|
```python
|
||||||
|
# NO: Testing internal variable names
|
||||||
|
def test_internal_state():
|
||||||
|
obj = MyClass()
|
||||||
|
assert obj._internal_var == "value" # Breaks with refactoring
|
||||||
|
|
||||||
|
# YES: Test public behavior
|
||||||
|
def test_public_api():
|
||||||
|
obj = MyClass()
|
||||||
|
assert obj.get_value() == "value"
|
||||||
|
```
|
||||||
|
|
||||||
|
4. **Don't skip tests**
|
||||||
|
```python
|
||||||
|
# NO: Commenting out failing tests
|
||||||
|
# def test_broken_feature():
|
||||||
|
# assert broken_function() == "expected"
|
||||||
|
|
||||||
|
# YES: Fix the test or mark as TODO
|
||||||
|
@pytest.mark.skip(reason="Feature not implemented yet")
|
||||||
|
def test_future_feature():
|
||||||
|
...
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Practical Example: Your Workflow
|
||||||
|
|
||||||
|
### Before (Manual Script)
|
||||||
|
|
||||||
|
```python
|
||||||
|
# test_water.py
|
||||||
|
from cosing_service import cosing_search, clean_cosing
|
||||||
|
|
||||||
|
result = cosing_search("WATER", "name")
|
||||||
|
print(result) # ← You manually check
|
||||||
|
|
||||||
|
clean = clean_cosing(result)
|
||||||
|
print(clean) # ← You manually check again
|
||||||
|
|
||||||
|
# Run 10 times with different inputs... tedious!
|
||||||
|
```
|
||||||
|
|
||||||
|
**Problems:**
|
||||||
|
- Manual verification
|
||||||
|
- Slow (type command, read output, verify)
|
||||||
|
- Error-prone (miss things)
|
||||||
|
- Not repeatable
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### After (Automated Tests)
|
||||||
|
|
||||||
|
```python
|
||||||
|
# tests/test_cosing_service.py
|
||||||
|
def test_search_and_clean_water():
|
||||||
|
"""Water should be searchable and cleanable."""
|
||||||
|
result = cosing_search("WATER", "name")
|
||||||
|
assert result is not None
|
||||||
|
assert "inciName" in result
|
||||||
|
|
||||||
|
clean = clean_cosing(result)
|
||||||
|
assert clean["inciName"] == "WATER"
|
||||||
|
assert "cosingUrl" in clean
|
||||||
|
|
||||||
|
# Run ONCE: pytest
|
||||||
|
# It checks everything automatically!
|
||||||
|
```
|
||||||
|
|
||||||
|
**Run all 25 tests:**
|
||||||
|
```bash
|
||||||
|
$ uv run pytest
|
||||||
|
|
||||||
|
tests/test_cosing_service.py::TestParseCasNumbers::test_single_cas_number PASSED
|
||||||
|
tests/test_cosing_service.py::TestParseCasNumbers::test_multiple_cas_with_slash PASSED
|
||||||
|
...
|
||||||
|
======================== 25 passed in 0.5s ========================
|
||||||
|
```
|
||||||
|
|
||||||
|
**Benefits:**
|
||||||
|
- ✅ All pass? Safe to deploy!
|
||||||
|
- ❌ One fails? Fix before deploying!
|
||||||
|
- ⏱️ 25 tests in 0.5 seconds vs. manual testing for 30 minutes
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## When Should You Write Tests?
|
||||||
|
|
||||||
|
### Always Test:
|
||||||
|
|
||||||
|
✅ **Business logic** (calculations, data processing)
|
||||||
|
```python
|
||||||
|
# YES
|
||||||
|
def test_calculate_sed():
|
||||||
|
assert calculate_sed(quantity=10, dap=0.5) == 5.0
|
||||||
|
```
|
||||||
|
|
||||||
|
✅ **Data validation** (Pydantic models)
|
||||||
|
```python
|
||||||
|
# YES
|
||||||
|
def test_ingredient_validates_cas_format():
|
||||||
|
with pytest.raises(ValidationError):
|
||||||
|
Ingredient(cas="invalid", quantity=10.0)
|
||||||
|
```
|
||||||
|
|
||||||
|
✅ **API integrations** (with mocks)
|
||||||
|
```python
|
||||||
|
# YES
|
||||||
|
@patch('requests.post')
|
||||||
|
def test_cosing_search(mock_post):
|
||||||
|
...
|
||||||
|
```
|
||||||
|
|
||||||
|
✅ **Bug fixes** (write test first, then fix)
|
||||||
|
```python
|
||||||
|
# YES
|
||||||
|
def test_bug_123_empty_cas_crash():
|
||||||
|
"""Regression test for bug #123."""
|
||||||
|
result = parse_cas_numbers([]) # Used to crash
|
||||||
|
assert result == []
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Sometimes Test:
|
||||||
|
|
||||||
|
⚠️ **UI code** (harder to test, less critical)
|
||||||
|
```python
|
||||||
|
# Streamlit UI tests are complex, lower priority
|
||||||
|
```
|
||||||
|
|
||||||
|
⚠️ **Configuration** (usually simple)
|
||||||
|
```python
|
||||||
|
# Config loading is straightforward, test if complex logic
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Don't Test:
|
||||||
|
|
||||||
|
❌ **Third-party libraries** (they have their own tests)
|
||||||
|
```python
|
||||||
|
# NO: Testing if pandas works
|
||||||
|
def test_pandas_dataframe():
|
||||||
|
df = pd.DataFrame({"a": [1, 2, 3]})
|
||||||
|
assert len(df) == 3 # Pandas team already tested this!
|
||||||
|
```
|
||||||
|
|
||||||
|
❌ **Trivial code**
|
||||||
|
```python
|
||||||
|
# NO: Testing simple getters/setters
|
||||||
|
class MyClass:
|
||||||
|
def get_name(self):
|
||||||
|
return self.name # Too simple to test
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Your Next Steps
|
||||||
|
|
||||||
|
### 1. Install Pytest
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd c:\Users\adish\Projects\pif_compiler
|
||||||
|
uv add --dev pytest pytest-cov pytest-mock
|
||||||
|
```
|
||||||
|
|
||||||
|
### 2. Run the COSING Tests
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Run all tests
|
||||||
|
uv run pytest
|
||||||
|
|
||||||
|
# Run with verbose output
|
||||||
|
uv run pytest -v
|
||||||
|
|
||||||
|
# Run specific test file
|
||||||
|
uv run pytest tests/test_cosing_service.py
|
||||||
|
|
||||||
|
# Run specific test
|
||||||
|
uv run pytest tests/test_cosing_service.py::TestParseCasNumbers::test_single_cas_number
|
||||||
|
```
|
||||||
|
|
||||||
|
### 3. See Coverage
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Terminal report
|
||||||
|
uv run pytest --cov=src/pif_compiler/services/cosing_service
|
||||||
|
|
||||||
|
# HTML report (more detailed)
|
||||||
|
uv run pytest --cov=src/pif_compiler --cov-report=html
|
||||||
|
# Open htmlcov/index.html in browser
|
||||||
|
```
|
||||||
|
|
||||||
|
### 4. Start Writing Tests for New Code
|
||||||
|
|
||||||
|
Follow the TDD cycle:
|
||||||
|
1. **Red**: Write failing test
|
||||||
|
2. **Green**: Write minimal code to pass
|
||||||
|
3. **Refactor**: Improve code
|
||||||
|
4. Repeat!
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Additional Resources
|
||||||
|
|
||||||
|
### Pytest Documentation
|
||||||
|
- [Official Pytest Docs](https://docs.pytest.org/)
|
||||||
|
- [Pytest Fixtures](https://docs.pytest.org/en/stable/fixture.html)
|
||||||
|
- [Pytest Mocking](https://docs.pytest.org/en/stable/monkeypatch.html)
|
||||||
|
|
||||||
|
### Testing Philosophy
|
||||||
|
- [Test-Driven Development (TDD)](https://www.freecodecamp.org/news/test-driven-development-what-it-is-and-what-it-is-not-41fa6bca02a2/)
|
||||||
|
- [Testing Best Practices](https://testautomationuniversity.com/)
|
||||||
|
- [The Testing Pyramid](https://martinfowler.com/articles/practical-test-pyramid.html)
|
||||||
|
|
||||||
|
### PIF Compiler Specific
|
||||||
|
- [tests/README.md](../tests/README.md) - Test suite documentation
|
||||||
|
- [tests/RUN_TESTS.md](../tests/RUN_TESTS.md) - Quick start guide
|
||||||
|
- [REFACTORING.md](../REFACTORING.md) - Code organization changes
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Summary
|
||||||
|
|
||||||
|
**Testing transforms your development workflow:**
|
||||||
|
|
||||||
|
| Without Tests | With Tests |
|
||||||
|
|---------------|------------|
|
||||||
|
| Manual verification | Automated checks |
|
||||||
|
| Slow feedback | Instant feedback |
|
||||||
|
| Fear of breaking things | Confidence to refactor |
|
||||||
|
| Undocumented behavior | Tests as documentation |
|
||||||
|
| Debug for hours | Pinpoint issues immediately |
|
||||||
|
|
||||||
|
**Start small:**
|
||||||
|
1. Write tests for one service (✅ COSING done!)
|
||||||
|
2. Add tests for new features
|
||||||
|
3. Fix bugs with tests first
|
||||||
|
4. Gradually increase coverage
|
||||||
|
|
||||||
|
**The investment pays off:**
|
||||||
|
- Fewer bugs in production
|
||||||
|
- Faster development (less debugging)
|
||||||
|
- Better code design
|
||||||
|
- Easier collaboration
|
||||||
|
- Peace of mind 😌
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
*Last updated: 2025-01-04*
|
||||||
6
docs/user_journey.md
Normal file
6
docs/user_journey.md
Normal file
|
|
@ -0,0 +1,6 @@
|
||||||
|
# User Journey
|
||||||
|
|
||||||
|
1) User login or signs up
|
||||||
|
- For this function we will use the internal component of streamlit to handle authentication, and behind i will have a supabase db (work-in progress)
|
||||||
|
2) Open recent or create a new project:
|
||||||
|
- This is where we open an existing file of project with all the specifics or we create a new one
|
||||||
33
pyproject.toml
Normal file
33
pyproject.toml
Normal file
|
|
@ -0,0 +1,33 @@
|
||||||
|
[project]
|
||||||
|
name = "pif-compiler"
|
||||||
|
version = "0.1.0"
|
||||||
|
description = "Pif Software"
|
||||||
|
readme = "README.md"
|
||||||
|
authors = [
|
||||||
|
{ name = "adish-rmr", email = "adish@hotmail.it" }
|
||||||
|
]
|
||||||
|
requires-python = ">=3.12"
|
||||||
|
dependencies = [
|
||||||
|
"beautifulsoup4>=4.14.2",
|
||||||
|
"duckdb>=1.4.1",
|
||||||
|
"marimo>=0.16.5",
|
||||||
|
"markdown-to-json>=2.1.2",
|
||||||
|
"markdownify>=1.2.0",
|
||||||
|
"playwright>=1.55.0",
|
||||||
|
"pubchemprops>=0.1.1",
|
||||||
|
"pubchempy>=1.0.5",
|
||||||
|
"pydantic>=2.11.10",
|
||||||
|
"pymongo>=4.15.2",
|
||||||
|
"pytest>=8.4.2",
|
||||||
|
"pytest-cov>=7.0.0",
|
||||||
|
"pytest-mock>=3.15.1",
|
||||||
|
"requests>=2.32.5",
|
||||||
|
"streamlit>=1.50.0",
|
||||||
|
]
|
||||||
|
|
||||||
|
[project.scripts]
|
||||||
|
pif-compiler = "pif_compiler:main"
|
||||||
|
|
||||||
|
[build-system]
|
||||||
|
requires = ["hatchling"]
|
||||||
|
build-backend = "hatchling.build"
|
||||||
28
pytest.ini
Normal file
28
pytest.ini
Normal file
|
|
@ -0,0 +1,28 @@
|
||||||
|
[pytest]
|
||||||
|
# Pytest configuration for PIF Compiler
|
||||||
|
|
||||||
|
# Test discovery
|
||||||
|
testpaths = tests
|
||||||
|
python_files = test_*.py
|
||||||
|
python_classes = Test*
|
||||||
|
python_functions = test_*
|
||||||
|
|
||||||
|
# Output options
|
||||||
|
addopts =
|
||||||
|
-v
|
||||||
|
--strict-markers
|
||||||
|
--tb=short
|
||||||
|
--disable-warnings
|
||||||
|
|
||||||
|
# Markers for different test types
|
||||||
|
markers =
|
||||||
|
unit: Unit tests (fast, no external dependencies)
|
||||||
|
integration: Integration tests (may hit real APIs)
|
||||||
|
slow: Slow tests (skip by default)
|
||||||
|
database: Tests requiring MongoDB
|
||||||
|
|
||||||
|
# Coverage options (if pytest-cov is installed)
|
||||||
|
# addopts = --cov=src/pif_compiler --cov-report=html --cov-report=term
|
||||||
|
|
||||||
|
# Ignore patterns
|
||||||
|
norecursedirs = .git .venv __pycache__ *.egg-info dist build
|
||||||
0
src/pif_compiler/__init__.py
Normal file
0
src/pif_compiler/__init__.py
Normal file
42
src/pif_compiler/classes/__init__.py
Normal file
42
src/pif_compiler/classes/__init__.py
Normal file
|
|
@ -0,0 +1,42 @@
|
||||||
|
"""
|
||||||
|
PIF Compiler - Data Models
|
||||||
|
|
||||||
|
This module contains all data models for the PIF (Product Information File) system.
|
||||||
|
"""
|
||||||
|
|
||||||
|
from pif_compiler.classes.models import (
|
||||||
|
Ingredient,
|
||||||
|
ExpositionInfo,
|
||||||
|
SedTable,
|
||||||
|
ProdCompany,
|
||||||
|
)
|
||||||
|
|
||||||
|
from pif_compiler.classes.pif_class import PIF
|
||||||
|
|
||||||
|
from pif_compiler.classes.types_enum import (
|
||||||
|
CosmeticType,
|
||||||
|
PhysicalForm,
|
||||||
|
PlaceApplication,
|
||||||
|
NormalUser,
|
||||||
|
RoutesExposure,
|
||||||
|
NanoRoutes,
|
||||||
|
TranslatedEnum,
|
||||||
|
)
|
||||||
|
|
||||||
|
__all__ = [
|
||||||
|
# Main PIF model
|
||||||
|
"PIF",
|
||||||
|
# Component models
|
||||||
|
"Ingredient",
|
||||||
|
"ExpositionInfo",
|
||||||
|
"SedTable",
|
||||||
|
"ProdCompany",
|
||||||
|
# Enums
|
||||||
|
"CosmeticType",
|
||||||
|
"PhysicalForm",
|
||||||
|
"PlaceApplication",
|
||||||
|
"NormalUser",
|
||||||
|
"RoutesExposure",
|
||||||
|
"NanoRoutes",
|
||||||
|
"TranslatedEnum",
|
||||||
|
]
|
||||||
73
src/pif_compiler/classes/models.py
Normal file
73
src/pif_compiler/classes/models.py
Normal file
|
|
@ -0,0 +1,73 @@
|
||||||
|
from dataclasses import dataclass, field
|
||||||
|
from typing import Dict, List, Optional, Any
|
||||||
|
from datetime import datetime
|
||||||
|
from pydantic import BaseModel, StringConstraints, Field
|
||||||
|
from typing_extensions import Annotated
|
||||||
|
from pif_compiler.classes.types_enum import CosmeticType, PhysicalForm, PlaceApplication, NormalUser, RoutesExposure, NanoRoutes
|
||||||
|
|
||||||
|
|
||||||
|
class Ingredient(BaseModel):
|
||||||
|
|
||||||
|
inci_name : str = Annotated[
|
||||||
|
str,
|
||||||
|
StringConstraints(
|
||||||
|
min_length=3,
|
||||||
|
max_length=50,
|
||||||
|
strip_whitespace=True,
|
||||||
|
to_upper=True)
|
||||||
|
]
|
||||||
|
|
||||||
|
cas : str = Annotated[str, StringConstraints(
|
||||||
|
min_length=5,
|
||||||
|
max_length=13,
|
||||||
|
strip_whitespace=True
|
||||||
|
)]
|
||||||
|
|
||||||
|
quantity : float = Annotated[float, Field(gt=0.001, lt=100.0, allow_inf_nan = False)]
|
||||||
|
|
||||||
|
# pubchem data x dap
|
||||||
|
mol_weight : Optional[int]
|
||||||
|
degree_ioniz : Optional[bool]
|
||||||
|
log_pow : Optional[int]
|
||||||
|
melting_pnt : Optional[int]
|
||||||
|
|
||||||
|
# toxicity values
|
||||||
|
sed : Optional[float]
|
||||||
|
dap : float = 0.5
|
||||||
|
sedd : Optional[float]
|
||||||
|
noael : Optional[int]
|
||||||
|
mos : Optional[int]
|
||||||
|
|
||||||
|
# riferimenti
|
||||||
|
ref : Optional[str]
|
||||||
|
restriction: Optional[str]
|
||||||
|
|
||||||
|
class ExpositionInfo(BaseModel):
|
||||||
|
type: CosmeticType
|
||||||
|
target_population: NormalUser
|
||||||
|
consumer_weight: str = "60 kg"
|
||||||
|
place_application: PlaceApplication
|
||||||
|
routes_exposure: RoutesExposure
|
||||||
|
nano_routes: NanoRoutes
|
||||||
|
surface_area: int
|
||||||
|
frequency: int
|
||||||
|
|
||||||
|
# to be approximated by LLM
|
||||||
|
estimated_daily_amount_applied: float
|
||||||
|
relative_daily_amount_applied: float
|
||||||
|
retention_factor: float
|
||||||
|
calculated_daily_exposure: float
|
||||||
|
calculated_relative_daily_exposure: float
|
||||||
|
|
||||||
|
class SedTable(BaseModel):
|
||||||
|
surface : int
|
||||||
|
total_exposition : int
|
||||||
|
frequency : int
|
||||||
|
retention : int
|
||||||
|
consumer_weight : int
|
||||||
|
total_sed : int
|
||||||
|
|
||||||
|
class ProdCompany(BaseModel):
|
||||||
|
prod_company_name : str
|
||||||
|
prod_vat: int
|
||||||
|
prod_address : str
|
||||||
36
src/pif_compiler/classes/pif_class.py
Normal file
36
src/pif_compiler/classes/pif_class.py
Normal file
|
|
@ -0,0 +1,36 @@
|
||||||
|
from typing import List, Optional
|
||||||
|
from datetime import datetime
|
||||||
|
from pydantic import BaseModel, Field
|
||||||
|
|
||||||
|
from pif_compiler.classes.base_classes import ExpositionInfo, SedTable, ProdCompany, Ingredient
|
||||||
|
from pif_compiler.classes.types_enum import CosmeticType, PhysicalForm, NormalUser
|
||||||
|
|
||||||
|
class PIF(BaseModel):
|
||||||
|
# INFORMAZIONI GENERALI DEL PRODOTTO
|
||||||
|
|
||||||
|
# Data di esecuzione pif = datetime.now()
|
||||||
|
created_at: datetime = Field(
|
||||||
|
default_factory=lambda: datetime.strptime(
|
||||||
|
datetime.now().strftime('%Y-%m-%d'),
|
||||||
|
'%Y-%m-%d'
|
||||||
|
)
|
||||||
|
)
|
||||||
|
|
||||||
|
# Informazioni del prodotto
|
||||||
|
company: str
|
||||||
|
product_name: str
|
||||||
|
type: CosmeticType
|
||||||
|
physical_form: PhysicalForm
|
||||||
|
CNCP: int
|
||||||
|
production_company: ProdCompany
|
||||||
|
|
||||||
|
# Ingredienti
|
||||||
|
ingredients: List[Ingredient] # str = quantità decimale %
|
||||||
|
normal_consumer: Optional[NormalUser]
|
||||||
|
exposition: Optional[ExpositionInfo]
|
||||||
|
|
||||||
|
# Informazioni di sicurezza
|
||||||
|
sed_table: Optional[SedTable] = None
|
||||||
|
undesired_effets: Optional[str] = None
|
||||||
|
description: Optional[str] = None
|
||||||
|
warnings: Optional[List[str]] = None
|
||||||
145
src/pif_compiler/classes/types_enum.py
Normal file
145
src/pif_compiler/classes/types_enum.py
Normal file
|
|
@ -0,0 +1,145 @@
|
||||||
|
from enum import Enum
|
||||||
|
|
||||||
|
class TranslatedEnum(str, Enum):
|
||||||
|
def get_translation(self, lang: str) -> str:
|
||||||
|
translations = self.value.split("|")
|
||||||
|
return translations[0] if lang == "en" else translations[1]
|
||||||
|
|
||||||
|
class PhysicalForm(TranslatedEnum):
|
||||||
|
SIERO = "Serum (liquid)|Siero (liquido)"
|
||||||
|
LOZIONE = "Lotion (liquid)|Lozione (liquido)"
|
||||||
|
CREMA = "Cream (liquid)|Crema (liquido)"
|
||||||
|
OLIO = "Oil (liquid)|Olio (liquido)"
|
||||||
|
GEL = "Gel (liquid)|Gel (liquido)"
|
||||||
|
SCHIUMA = "Foam (liquid)|Schiuma (liquido)"
|
||||||
|
SOLUZIONE = "Solution (liquid)|Soluzione (liquido)"
|
||||||
|
EMULSIONE = "Emulsion (liquid)|Emulsione (liquido)"
|
||||||
|
SOSPENSIONE = "Suspension (liquid)|Sospensione (liquido)"
|
||||||
|
BALSAMO = "Balm (semi-solid)|Balsamo (semi-solido)"
|
||||||
|
PASTA = "Paste (semi-solid)|Pasta (semi-solido)"
|
||||||
|
UNGENTO = "Ointment (semi-solid)|Unguento (semi-solido)"
|
||||||
|
POLVERE_COMPATTA = "Pressed Powder (solid)|Polvere compatta (solido)"
|
||||||
|
POLVERE_LIBERA = "Loose Powder (solid)|Polvere libera (solido)"
|
||||||
|
STICK = "Stick (solid)|Stick (solido)"
|
||||||
|
BARRETTA = "Bar (solid)|Barretta (solido)"
|
||||||
|
PERLE = "Beads/Pearls (solid)|Perle (solido)"
|
||||||
|
SPRAY = "Spray/Mist (aerosol)|Spray/Nebulizzatore (aerosol)"
|
||||||
|
AEROSOL = "Aerosol (aerosol)|Aerosol (aerosol)"
|
||||||
|
SPRAY_IN_POLVERE = "Powder Spray (aerosol)|Spray in polvere (aerosol)"
|
||||||
|
CUSCINETTO = "Cushion (hybrid)|Cuscinetto (ibrido)"
|
||||||
|
GELATINA = "Jelly (hybrid)|Gelatina (ibrido)"
|
||||||
|
PRODOTTO_BIFASICO = "Bi-Phase Product (hybrid)|Prodotto bifasico (ibrido)"
|
||||||
|
MICROINCAPSULATO = "Encapsulated Actives (hybrid)|Attivi microincapsulati (ibrido)"
|
||||||
|
|
||||||
|
class CosmeticType(TranslatedEnum):
|
||||||
|
LIQUID_FOUNDATION = "Liquid foundation|Fondotinta liquido"
|
||||||
|
POWDER_FOUNDATION = "Powder foundation|Fondotinta in polvere"
|
||||||
|
BB_CREAM = "BB cream|BB cream"
|
||||||
|
CC_CREAM = "CC cream|CC cream"
|
||||||
|
CONCEALER = "Concealer|Correttore"
|
||||||
|
LOOSE_POWDER = "Loose powder|Cipria in polvere"
|
||||||
|
PRESSED_POWDER = "Pressed powder|Cipria compatta"
|
||||||
|
POWDER_BLUSH = "Powder blush|Blush in polvere"
|
||||||
|
CREAM_BLUSH = "Cream blush|Blush in crema"
|
||||||
|
LIQUID_BLUSH = "Liquid blush|Blush liquido"
|
||||||
|
BRONZER = "Bronzer|Bronzer"
|
||||||
|
HIGHLIGHTER = "Highlighter|Illuminante"
|
||||||
|
FACE_PRIMER = "Face primer|Primer viso"
|
||||||
|
SETTING_SPRAY = "Setting spray|Spray fissante"
|
||||||
|
COLOR_CORRECTOR = "Color corrector|Correttore colorato"
|
||||||
|
CONTOUR_POWDER = "Contour powder|Contouring in polvere"
|
||||||
|
CONTOUR_CREAM = "Contour cream|Contouring in crema"
|
||||||
|
TINTED_MOISTURIZER = "Tinted moisturizer|Crema colorata"
|
||||||
|
POWDER_EYESHADOW = "Powder eyeshadow|Ombretto in polvere"
|
||||||
|
CREAM_EYESHADOW = "Cream eyeshadow|Ombretto in crema"
|
||||||
|
LIQUID_EYESHADOW = "Liquid eyeshadow|Ombretto liquido"
|
||||||
|
PENCIL_EYELINER = "Pencil eyeliner|Matita occhi"
|
||||||
|
LIQUID_EYELINER = "Liquid eyeliner|Eyeliner liquido"
|
||||||
|
GEL_EYELINER = "Gel eyeliner|Eyeliner in gel"
|
||||||
|
KOHL_LINER = "Kohl liner|Matita kohl"
|
||||||
|
MASCARA = "Mascara|Mascara"
|
||||||
|
WATERPROOF_MASCARA = "Waterproof mascara|Mascara waterproof"
|
||||||
|
BROW_PENCIL = "Eyebrow pencil|Matita sopracciglia"
|
||||||
|
BROW_GEL = "Eyebrow gel|Gel sopracciglia"
|
||||||
|
BROW_POWDER = "Eyebrow powder|Polvere sopracciglia"
|
||||||
|
EYE_PRIMER = "Eye primer|Primer occhi"
|
||||||
|
FALSE_LASHES = "False eyelashes|Ciglia finte"
|
||||||
|
LASH_GLUE = "Eyelash glue|Colla ciglia"
|
||||||
|
BROW_POMADE = "Eyebrow pomade|Pomata sopracciglia"
|
||||||
|
MATTE_LIPSTICK = "Matte lipstick|Rossetto opaco"
|
||||||
|
CREAM_LIPSTICK = "Cream lipstick|Rossetto cremoso"
|
||||||
|
SATIN_LIPSTICK = "Satin lipstick|Rossetto satinato"
|
||||||
|
LIP_GLOSS = "Lip gloss|Lucidalabbra"
|
||||||
|
LIP_LINER = "Lip liner|Matita labbra"
|
||||||
|
LIP_STAIN = "Lip stain|Tinta labbra"
|
||||||
|
LIP_BALM = "Lip balm|Balsamo labbra"
|
||||||
|
LIP_PRIMER = "Lip primer|Primer labbra"
|
||||||
|
LIP_PLUMPER = "Lip plumper|Volumizzante labbra"
|
||||||
|
LIP_OIL = "Lip oil|Olio labbra"
|
||||||
|
LIP_MASK = "Lip mask|Maschera labbra"
|
||||||
|
LIQUID_LIPSTICK = "Liquid lipstick|Rossetto liquido"
|
||||||
|
GEL_CLEANSER = "Gel cleanser|Detergente gel"
|
||||||
|
FOAM_CLEANSER = "Foam cleanser|Detergente schiumoso"
|
||||||
|
OIL_CLEANSER = "Oil cleanser|Detergente oleoso"
|
||||||
|
CREAM_CLEANSER = "Cream cleanser|Detergente in crema"
|
||||||
|
MICELLAR_WATER = "Micellar water|Acqua micellare"
|
||||||
|
TONER = "Toner|Tonico"
|
||||||
|
ESSENCE = "Essence|Essenza"
|
||||||
|
SERUM = "Serum|Siero"
|
||||||
|
MOISTURIZER = "Moisturizer|Idratante"
|
||||||
|
FACE_OIL = "Face oil|Olio viso"
|
||||||
|
SHEET_MASK = "Sheet mask|Maschera in tessuto"
|
||||||
|
CLAY_MASK = "Clay mask|Maschera all'argilla"
|
||||||
|
GEL_MASK = "Gel mask|Maschera in gel"
|
||||||
|
CREAM_MASK = "Cream mask|Maschera in crema"
|
||||||
|
EYE_CREAM = "Eye cream|Crema contorno occhi"
|
||||||
|
PHYSICAL_EXFOLIATOR = "Physical exfoliator|Esfoliante fisico"
|
||||||
|
CHEMICAL_EXFOLIATOR = "Chemical exfoliator|Esfoliante chimico"
|
||||||
|
SUNSCREEN = "Sunscreen|Protezione solare"
|
||||||
|
NIGHT_CREAM = "Night cream|Crema notte"
|
||||||
|
FACE_MIST = "Face mist|Acqua spray"
|
||||||
|
SPOT_TREATMENT = "Spot treatment|Trattamento localizzato"
|
||||||
|
PORE_STRIPS = "Pore strips|Cerotti purificanti"
|
||||||
|
PEELING_GEL = "Peeling gel|Gel esfoliante"
|
||||||
|
BASE_COAT = "Base coat|Base smalto"
|
||||||
|
NAIL_POLISH = "Nail polish|Smalto"
|
||||||
|
TOP_COAT = "Top coat|Top coat"
|
||||||
|
CUTICLE_OIL = "Cuticle oil|Olio cuticole"
|
||||||
|
NAIL_STRENGTHENER = "Nail strengthener|Rinforzante unghie"
|
||||||
|
QUICK_DRY_DROPS = "Quick dry drops|Gocce asciugatura rapida"
|
||||||
|
NAIL_PRIMER = "Nail primer|Primer unghie"
|
||||||
|
GEL_POLISH = "Gel polish|Smalto gel"
|
||||||
|
ACRYLIC_POWDER = "Acrylic powder|Polvere acrilica"
|
||||||
|
NAIL_GLUE = "Nail glue|Colla unghie"
|
||||||
|
MAKEUP_BRUSHES = "Makeup brushes|Pennelli trucco"
|
||||||
|
MAKEUP_SPONGES = "Makeup sponges|Spugnette trucco"
|
||||||
|
EYELASH_CURLER = "Eyelash curler|Piegaciglia"
|
||||||
|
TWEEZERS = "Tweezers|Pinzette"
|
||||||
|
NAIL_CLIPPERS = "Nail clippers|Tagliaunghie"
|
||||||
|
NAIL_FILE = "Nail file|Lima unghie"
|
||||||
|
COTTON_PADS = "Cotton pads|Dischetti di cotone"
|
||||||
|
MAKEUP_REMOVER_PADS = "Makeup remover pads|Dischetti struccanti"
|
||||||
|
POWDER_PUFF = "Powder puff|Piumino cipria"
|
||||||
|
FACIAL_ROLLER = "Facial roller|Rullo facciale"
|
||||||
|
GUA_SHA = "Gua sha tool|Strumento gua sha"
|
||||||
|
BRUSH_CLEANER = "Brush cleaner|Detergente pennelli"
|
||||||
|
MAKEUP_ORGANIZER = "Makeup organizer|Organizzatore trucchi"
|
||||||
|
MIRROR = "Mirror|Specchio"
|
||||||
|
NAIL_BUFFER = "Nail buffer|Buffer unghie"
|
||||||
|
|
||||||
|
class NormalUser(TranslatedEnum):
|
||||||
|
ADULTO = "Adult|Adulto"
|
||||||
|
BAMBINO = "Child|Bambino"
|
||||||
|
|
||||||
|
class PlaceApplication(TranslatedEnum):
|
||||||
|
VISO = "Face|Viso"
|
||||||
|
|
||||||
|
class RoutesExposure(TranslatedEnum):
|
||||||
|
DERMAL = "Dermal|Dermale"
|
||||||
|
OCULAR = "Ocular|Oculare"
|
||||||
|
ORAL = "Oral|Orale"
|
||||||
|
|
||||||
|
class NanoRoutes(TranslatedEnum):
|
||||||
|
DERMAL = "Dermal|Dermale"
|
||||||
|
OCULAR = "Ocular|Oculare"
|
||||||
|
ORAL = "Oral|Orale"
|
||||||
0
src/pif_compiler/functions/__init__.py
Normal file
0
src/pif_compiler/functions/__init__.py
Normal file
245
src/pif_compiler/functions/_old/echaFind.py
Normal file
245
src/pif_compiler/functions/_old/echaFind.py
Normal file
|
|
@ -0,0 +1,245 @@
|
||||||
|
import requests
|
||||||
|
import urllib.parse
|
||||||
|
import re as standardre
|
||||||
|
import logging
|
||||||
|
import json
|
||||||
|
from bs4 import BeautifulSoup
|
||||||
|
|
||||||
|
|
||||||
|
# Settings per il logging
|
||||||
|
logging.basicConfig(
|
||||||
|
format="{asctime} - {levelname} - {message}",
|
||||||
|
style="{",
|
||||||
|
datefmt="%Y-%m-%d %H:%M",
|
||||||
|
filename="echa.log",
|
||||||
|
encoding="utf-8",
|
||||||
|
filemode="a",
|
||||||
|
level=logging.INFO,
|
||||||
|
)
|
||||||
|
|
||||||
|
# Funzione inutile
|
||||||
|
def getCas(substance, ):
|
||||||
|
results = {}
|
||||||
|
req_0 = requests.get(
|
||||||
|
"https://chem.echa.europa.eu/api-substance/v1/substance?pageIndex=1&pageSize=100&searchText="
|
||||||
|
+ urllib.parse.quote(substance)
|
||||||
|
)
|
||||||
|
req_0_json = req_0.json()
|
||||||
|
try:
|
||||||
|
rmlId = req_0_json["items"][0]["substanceIndex"]["rmlId"]
|
||||||
|
rmlName = req_0_json["items"][0]["substanceIndex"]["rmlName"]
|
||||||
|
rmlCas = req_0_json["items"][0]["substanceIndex"]["rmlCas"]
|
||||||
|
|
||||||
|
results["rmlId"] = rmlId
|
||||||
|
results["rmlName"] = rmlName
|
||||||
|
results["rmlCas"] = rmlCas
|
||||||
|
except:
|
||||||
|
return False
|
||||||
|
return results
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
# Funzione per cercare il dossier dato in input un CAS, una sostanza o un EN
|
||||||
|
def search_dossier(substance, input_type='rmlCas'):
|
||||||
|
results = {}
|
||||||
|
# Il dizionario che farò tornare alla fine
|
||||||
|
|
||||||
|
# Prima parte. Ottengo rmlID e rmlName
|
||||||
|
# st.code('https://chem.echa.europa.eu/api-substance/v1/substance?pageIndex=1&pageSize=100&searchText='+substance)
|
||||||
|
req_0 = requests.get(
|
||||||
|
"https://chem.echa.europa.eu/api-substance/v1/substance?pageIndex=1&pageSize=100&searchText="
|
||||||
|
+ urllib.parse.quote(substance)
|
||||||
|
)
|
||||||
|
|
||||||
|
logging.info(f'echaFind.search_dossier(). searching "{substance}"')
|
||||||
|
|
||||||
|
#'La prima cosa da fare è fare una ricerca con il nome della sostanza ma trasformata attraverso urllib'
|
||||||
|
req_0_json = req_0.json()
|
||||||
|
try:
|
||||||
|
# Estraggo i campi che mi servono dalla response
|
||||||
|
rmlId = req_0_json["items"][0]["substanceIndex"]["rmlId"]
|
||||||
|
rmlName = req_0_json["items"][0]["substanceIndex"]["rmlName"]
|
||||||
|
rmlCas = req_0_json["items"][0]["substanceIndex"]["rmlCas"]
|
||||||
|
rmlEc = req_0_json["items"][0]["substanceIndex"]["rmlEc"]
|
||||||
|
|
||||||
|
results['search_response'] = f"https://chem.echa.europa.eu/api-substance/v1/substance?pageIndex=1&pageSize=100&searchText={urllib.parse.quote(substance)}"
|
||||||
|
results["rmlId"] = rmlId
|
||||||
|
results["rmlName"] = rmlName
|
||||||
|
results["rmlCas"] = rmlCas
|
||||||
|
results["rmlEc"] = rmlEc
|
||||||
|
|
||||||
|
logging.info(
|
||||||
|
f"echaFind.search_dossier(). found substance on ECHA. rmlId: '{rmlId}', rmlName: '{rmlName}', rmlCas: '{rmlCas}'"
|
||||||
|
)
|
||||||
|
except:
|
||||||
|
logging.info(
|
||||||
|
f"echaFind.search_dossier(). could not find substance for '{substance}'"
|
||||||
|
)
|
||||||
|
return False
|
||||||
|
|
||||||
|
# Update: in certi casi poteva verificarsi che inserendo un CAS si trovasse invece una sostanza con codice EN uguale al CAS in input.
|
||||||
|
# Ora controllo che la sostanza trovata abbia effettivamente un CAS uguale a quello inserito in input.
|
||||||
|
# è inoltre possibile cercare per rmlName (nome della sostanza) o EN (rmlEn): basta specificare in input_type per cosa si sta cercando
|
||||||
|
if results[input_type] != substance:
|
||||||
|
logging.error(f'echa.echaFind.search_dossier(): results[{input_type}] "{results[input_type]}is not equal to "{substance}". ')
|
||||||
|
return f'search_error. results[{input_type}] ("{results[input_type]}") is not equal to "{substance}". Maybe you specified the wrong input_type. Check the results here: https://chem.echa.europa.eu/api-substance/v1/substance?pageIndex=1&pageSize=100&searchText={urllib.parse.quote(substance)}'
|
||||||
|
|
||||||
|
# Seconda parte. Cerco sul sito ECHA dei dossiers creando un link con l'ID precedentemente ottenuto.
|
||||||
|
req_1_url = (
|
||||||
|
"https://chem.echa.europa.eu/api-dossier-list/v1/dossier?pageIndex=1&pageSize=100&rmlId="
|
||||||
|
+ rmlId
|
||||||
|
+ "®istrationStatuses=Active"
|
||||||
|
) # Prima cerco negli active.
|
||||||
|
|
||||||
|
req_1 = requests.get(req_1_url)
|
||||||
|
req_1_json = req_1.json()
|
||||||
|
|
||||||
|
# Se non esistono dossiers attivi cerco quelli inattivi
|
||||||
|
if req_1_json["items"] == []:
|
||||||
|
logging.info(
|
||||||
|
f"echaFind.search_dossier(). could not find active dossier for '{substance}'. Proceeding to search in the unactive ones."
|
||||||
|
)
|
||||||
|
req_1_url = (
|
||||||
|
"https://chem.echa.europa.eu/api-dossier-list/v1/dossier?pageIndex=1&pageSize=100&rmlId="
|
||||||
|
+ rmlId
|
||||||
|
+ "®istrationStatuses=Inactive"
|
||||||
|
)
|
||||||
|
req_1 = requests.get(req_1_url)
|
||||||
|
req_1_json = req_1.json()
|
||||||
|
if req_1_json["items"] == []:
|
||||||
|
logging.info(
|
||||||
|
f"echaFind.search_dossier(). could not find unactive dossiers for '{substance}'"
|
||||||
|
) # Non ho trovato nè dossiers inattivi che attivi
|
||||||
|
return False
|
||||||
|
else:
|
||||||
|
logging.info(
|
||||||
|
f"echaFind.search_dossier(). found unactive dossiers for '{rmlName}'"
|
||||||
|
)
|
||||||
|
results["dossierType"] = "Inactive"
|
||||||
|
|
||||||
|
else:
|
||||||
|
logging.info(
|
||||||
|
f"echaFind.search_dossier(). found active dossiers for '{substance}'"
|
||||||
|
)
|
||||||
|
results["dossierType"] = "Active"
|
||||||
|
|
||||||
|
# Queste erano le due robe che mi servivano
|
||||||
|
assetExternalId = req_1_json["items"][0]["assetExternalId"]
|
||||||
|
|
||||||
|
# UPDATE: Per ottenere la data dell'ultima modifica
|
||||||
|
try:
|
||||||
|
lastUpdateDate = req_1_json["items"][0]["lastUpdatedDate"]
|
||||||
|
datetime_object = datetime.fromisoformat(lastUpdateDate.replace('Z', '+00:00')) # Handle 'Z' if present, else it might break on older python versions
|
||||||
|
lastUpdateDate = datetime_object.date().isoformat()
|
||||||
|
results['lastUpdateDate'] = lastUpdateDate
|
||||||
|
except:
|
||||||
|
logging.error(f"echa.echaFind(). Could not find lastUpdateDate for the dossier")
|
||||||
|
|
||||||
|
rootKey = req_1_json["items"][0]["rootKey"]
|
||||||
|
|
||||||
|
# Terza parte. Ottengo assetExternalId
|
||||||
|
# "Con l'assetExternalId è possibile arrivare alla pagina principale del dossier."
|
||||||
|
# "Da questa pagina bisogna scrappare l'ID del riassunto tossicologico, :red[**SE ESISTE**]"
|
||||||
|
results["index"] = (
|
||||||
|
"https://chem.echa.europa.eu/html-pages/" + assetExternalId + "/index.html"
|
||||||
|
)
|
||||||
|
results["index_js"] = (
|
||||||
|
f"https://chem.echa.europa.eu/{rmlId}/dossier-view/{assetExternalId}"
|
||||||
|
)
|
||||||
|
|
||||||
|
req_2 = requests.get(
|
||||||
|
"https://chem.echa.europa.eu/html-pages/" + assetExternalId + "/index.html"
|
||||||
|
)
|
||||||
|
index = BeautifulSoup(req_2.text, "html.parser")
|
||||||
|
index.prettify()
|
||||||
|
|
||||||
|
# Quarta parte. Ottengo l'ID del riassunto tossicologico dall'index.html
|
||||||
|
# "In tutto quell'HTML ci interessa solo un div. BeautifulSoup ha problemi se ci sono troppi div innestati. Quindi uso una combinazione di quello e di Regex"
|
||||||
|
|
||||||
|
div = index.find_all("div", id=["id_7_Toxicologicalinformation"])
|
||||||
|
str_div = str(div)
|
||||||
|
str_div = str_div.split("</div>")[0]
|
||||||
|
|
||||||
|
uic_found = False
|
||||||
|
if type(standardre.search('href="([^"]+)"', str_div)).__name__ == "NoneType":
|
||||||
|
# Un regex per trovare l'href che mi serve
|
||||||
|
logging.info(
|
||||||
|
f"echaFind.search_dossier(). Could not find 'id_7_Toxicologicalinformation' in the body"
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
UIC = standardre.search('href="([^"]+)"', str_div).group(1)
|
||||||
|
uic_found = True
|
||||||
|
|
||||||
|
# Per l'acute toxicity
|
||||||
|
acute_toxicity_found = False
|
||||||
|
div_acute_toxicity = index.find_all("div", id=["id_72_AcuteToxicity"])
|
||||||
|
if div_acute_toxicity:
|
||||||
|
for div in div_acute_toxicity:
|
||||||
|
try:
|
||||||
|
a = div.find_all("a", href=True)[0]
|
||||||
|
acute_toxicity_id = standardre.search('href="([^"]+)"', str(a)).group(1)
|
||||||
|
acute_toxicity_found = True
|
||||||
|
except:
|
||||||
|
logging.info(
|
||||||
|
f"echaFind.search_dossier(). No acute_toxicity_id found from index for {substance}"
|
||||||
|
)
|
||||||
|
|
||||||
|
# Per il repeated dose
|
||||||
|
repeated_dose_found = False
|
||||||
|
div_repeated_dose = index.find_all("div", id=["id_75_Repeateddosetoxicity"])
|
||||||
|
if div_repeated_dose:
|
||||||
|
for div in div_repeated_dose:
|
||||||
|
try:
|
||||||
|
a = div.find_all("a", href=True)[0]
|
||||||
|
repeated_dose_id = standardre.search('href="([^"]+)"', str(a)).group(1)
|
||||||
|
repeated_dose_found = True
|
||||||
|
except:
|
||||||
|
logging.info(
|
||||||
|
f"echaFind.search_dossier(). No repeated_dose_id found from index for {substance}"
|
||||||
|
)
|
||||||
|
|
||||||
|
# Quinta parte. Recupero l'html del dossier tossicologico e faccio ritornare il content
|
||||||
|
|
||||||
|
if acute_toxicity_found:
|
||||||
|
acute_toxicity_link = (
|
||||||
|
"https://chem.echa.europa.eu/html-pages/"
|
||||||
|
+ assetExternalId
|
||||||
|
+ "/documents/"
|
||||||
|
+ acute_toxicity_id
|
||||||
|
+ ".html"
|
||||||
|
)
|
||||||
|
results["AcuteToxicity"] = acute_toxicity_link
|
||||||
|
results["AcuteToxicity_js"] = (
|
||||||
|
f"https://chem.echa.europa.eu/{rmlId}/dossier-view/{assetExternalId}/{acute_toxicity_id}"
|
||||||
|
)
|
||||||
|
|
||||||
|
if uic_found:
|
||||||
|
# UIC è l'id del toxsummary
|
||||||
|
final_url = (
|
||||||
|
"https://chem.echa.europa.eu/html-pages/"
|
||||||
|
+ assetExternalId
|
||||||
|
+ "/documents/"
|
||||||
|
+ UIC
|
||||||
|
+ ".html"
|
||||||
|
)
|
||||||
|
results["ToxSummary"] = final_url
|
||||||
|
results["ToxSummary_js"] = (
|
||||||
|
f"https://chem.echa.europa.eu/{rmlId}/dossier-view/{assetExternalId}/{UIC}"
|
||||||
|
)
|
||||||
|
|
||||||
|
if repeated_dose_found:
|
||||||
|
results["RepeatedDose"] = (
|
||||||
|
"https://chem.echa.europa.eu/html-pages/"
|
||||||
|
+ assetExternalId
|
||||||
|
+ "/documents/"
|
||||||
|
+ repeated_dose_id
|
||||||
|
+ ".html"
|
||||||
|
)
|
||||||
|
results["RepeatedDose_js"] = (
|
||||||
|
f"https://chem.echa.europa.eu/{rmlId}/dossier-view/{assetExternalId}/{repeated_dose_id}"
|
||||||
|
)
|
||||||
|
|
||||||
|
json_formatted_str = json.dumps(results)
|
||||||
|
logging.info(f"echaFind.search_dossier() OK. output: {json_formatted_str}")
|
||||||
|
return results
|
||||||
946
src/pif_compiler/functions/_old/echaProcess.py
Normal file
946
src/pif_compiler/functions/_old/echaProcess.py
Normal file
|
|
@ -0,0 +1,946 @@
|
||||||
|
from src.func.echaFind import search_dossier
|
||||||
|
from bs4 import BeautifulSoup
|
||||||
|
from markdownify import MarkdownConverter
|
||||||
|
import pandas as pd
|
||||||
|
import requests
|
||||||
|
import os
|
||||||
|
import re
|
||||||
|
import markdown_to_json
|
||||||
|
import json
|
||||||
|
import copy
|
||||||
|
import unicodedata
|
||||||
|
from datetime import datetime
|
||||||
|
import logging
|
||||||
|
import duckdb
|
||||||
|
|
||||||
|
# Settings per il logging
|
||||||
|
logging.basicConfig(
|
||||||
|
format="{asctime} - {levelname} - {message}",
|
||||||
|
style="{",
|
||||||
|
datefmt="%Y-%m-%d %H:%M",
|
||||||
|
filename="echa.log",
|
||||||
|
encoding="utf-8",
|
||||||
|
filemode="a",
|
||||||
|
level=logging.INFO,
|
||||||
|
)
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Carico il full scraping in memoria se esiste
|
||||||
|
con = duckdb.connect()
|
||||||
|
os.chdir(".")
|
||||||
|
res = con.sql("""
|
||||||
|
CREATE TABLE echa_full_scraping AS
|
||||||
|
SELECT * FROM read_csv_auto('src\data\echa_full_scraping.csv');
|
||||||
|
""")
|
||||||
|
logging.info(
|
||||||
|
f"echa.echaProcess().main: Loaded echa scraped data into duckdb memory. First CAS in the df is: {con.sql('select CAS from echa_full_scraping limit 1').fetchone()[0]}"
|
||||||
|
)
|
||||||
|
local_echa = True
|
||||||
|
except:
|
||||||
|
logging.error(f"echa.echaProcess().main: No local echa scraped data found")
|
||||||
|
|
||||||
|
|
||||||
|
# Metodo per trovare le informazioni relative sul sito echa
|
||||||
|
# Funziona sia con il nome della sostanza che con il CUS
|
||||||
|
def openEchaPage(link, local=False):
|
||||||
|
try:
|
||||||
|
if local:
|
||||||
|
page = open(link, encoding="utf8")
|
||||||
|
soup = BeautifulSoup(page, "html.parser")
|
||||||
|
else:
|
||||||
|
page = requests.get(link)
|
||||||
|
page.encoding = "utf-8"
|
||||||
|
soup = BeautifulSoup(page.text, "html.parser")
|
||||||
|
except:
|
||||||
|
logging.error(
|
||||||
|
f"echa.echaProcess.openEchaPage() error. could not open: '{link}'",
|
||||||
|
exc_info=True,
|
||||||
|
)
|
||||||
|
return soup
|
||||||
|
|
||||||
|
|
||||||
|
# Metodo per trasformare la pagina dell'echa in un Markdown
|
||||||
|
def echaPage_to_md(sezione, scrapingType=None, local=False, substance=None):
|
||||||
|
# sezione : il soup della pagina estratta attraverso search_dossier
|
||||||
|
# scrapingType : 'RepeatedDose' o 'AcuteToxicity'
|
||||||
|
# local : se vuoi salvare il contenuto del markdown in locale. Utile per debuggare
|
||||||
|
# substance : il nome della sostanza. Per salvarla nel path corretto
|
||||||
|
|
||||||
|
# Create shorthand method for conversion
|
||||||
|
def md(soup, **options):
|
||||||
|
return MarkdownConverter(**options).convert_soup(soup)
|
||||||
|
|
||||||
|
output = md(sezione)
|
||||||
|
# Trasformo la section html in un markdown, che però va corretto.
|
||||||
|
|
||||||
|
# Cambia un po' il modo in cui modifico il .md in base al tipo di pagina da scrappare
|
||||||
|
# aggiungo eccezioni man mano che testo nuove sostanze
|
||||||
|
if scrapingType == "RepeatedDose":
|
||||||
|
output = output.replace("### Oral route", "#### oral")
|
||||||
|
output = output.replace("### Dermal", "#### dermal")
|
||||||
|
output = output.replace("### Inhalation", "#### inhalation")
|
||||||
|
# Devo rimpiazzare >< con delle stringhe perchè sennò il jsonifier interpreta quei due simboli come dei codici e inserisce il testo nelle []
|
||||||
|
output = re.sub(r">\s+", "greater than ", output)
|
||||||
|
# Replace '<' followed by whitespace with 'less than '
|
||||||
|
output = re.sub(r"<\s+", "less than ", output)
|
||||||
|
output = re.sub(r">=\s*\n", "greater or equal than ", output)
|
||||||
|
output = re.sub(r"<=\s*\n", "less or equal than ", output)
|
||||||
|
|
||||||
|
elif scrapingType == "AcuteToxicity":
|
||||||
|
# Devo rimpiazzare >< con delle stringhe perchè sennò il jsonifier interpreta quei due simboli come dei codici e inserisce il testo nelle []
|
||||||
|
output = re.sub(r">\s+", "greater than ", output)
|
||||||
|
# Replace '<' followed by whitespace with 'less than '
|
||||||
|
output = re.sub(r"<\s+", "less than ", output)
|
||||||
|
output = re.sub(r">=\s*\n", "greater or equal than", output)
|
||||||
|
output = re.sub(r"<=\s*\n", "less or equal than ", output)
|
||||||
|
|
||||||
|
output = output.replace("–", "-")
|
||||||
|
|
||||||
|
output = re.sub(r"\s+mg", " mg", output)
|
||||||
|
# sta parte serve per fixare le unità di misura che vanno a capo e sono separate dal valore
|
||||||
|
|
||||||
|
if local and substance:
|
||||||
|
path = f"{scrapingType}/mds/{substance}.md"
|
||||||
|
os.makedirs(os.path.dirname(path), exist_ok=True)
|
||||||
|
with open(path, "w") as text_file:
|
||||||
|
text_file.write(output)
|
||||||
|
|
||||||
|
return output
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
# Questo è la parte 2 del processing del sito ECHA. Va trasformato il markdown in un JSON
|
||||||
|
def markdown_to_json_raw(output, scrapingType=None, local=False, substance=None):
|
||||||
|
# Output: Il markdown
|
||||||
|
# scrapingType : 'RepeatedDose' o 'AcuteToxicity'
|
||||||
|
# substance : il nome della sostanza. Per salvarla nel path corretto
|
||||||
|
jsonified = markdown_to_json.jsonify(output)
|
||||||
|
dictified = json.loads(jsonified)
|
||||||
|
|
||||||
|
# Salvo il json iniziale così come esce da jsonify
|
||||||
|
if local and scrapingType and substance:
|
||||||
|
path = f"{scrapingType}/jsons/raws/{substance}_raw0.json"
|
||||||
|
os.makedirs(os.path.dirname(path), exist_ok=True)
|
||||||
|
|
||||||
|
with open(path, "w") as text_file:
|
||||||
|
text_file.write(jsonified)
|
||||||
|
|
||||||
|
# Ora splitto i contenuti dei dizionari innestati.
|
||||||
|
for key, value in dictified.items():
|
||||||
|
if type(value) == dict:
|
||||||
|
for key2, value2 in value.items():
|
||||||
|
parts = value2.split("\n\n")
|
||||||
|
dictified[key][key2] = {
|
||||||
|
parts[i]: parts[i + 1]
|
||||||
|
for i in range(0, len(parts) - 1, 2)
|
||||||
|
if parts[i + 1] != "[Empty]"
|
||||||
|
}
|
||||||
|
else:
|
||||||
|
parts = value.split("\n\n")
|
||||||
|
dictified[key] = {
|
||||||
|
parts[i]: parts[i + 1]
|
||||||
|
for i in range(0, len(parts) - 1, 2)
|
||||||
|
if parts[i + 1] != "[Empty]"
|
||||||
|
}
|
||||||
|
|
||||||
|
jsonified = json.dumps(dictified)
|
||||||
|
|
||||||
|
if local and scrapingType and substance:
|
||||||
|
path = f"{scrapingType}/jsons/raws/{substance}_raw1.json"
|
||||||
|
os.makedirs(os.path.dirname(path), exist_ok=True)
|
||||||
|
|
||||||
|
with open(path, "w") as text_file:
|
||||||
|
text_file.write(jsonified)
|
||||||
|
|
||||||
|
dictified = json.loads(jsonified)
|
||||||
|
|
||||||
|
return jsonified
|
||||||
|
|
||||||
|
|
||||||
|
# Metodo creato da claude per risolvere i problemi di unicode characters
|
||||||
|
def normalize_unicode_characters(text):
|
||||||
|
"""
|
||||||
|
Normalize Unicode characters, with special handling for superscript
|
||||||
|
"""
|
||||||
|
if not isinstance(text, str):
|
||||||
|
return text
|
||||||
|
|
||||||
|
# Specific replacements for common Unicode encoding issues
|
||||||
|
# e per altre eccezioni particolari
|
||||||
|
replacements = {
|
||||||
|
"\u00c2\u00b2": "²", # ² -> ²
|
||||||
|
"\u00c2\u00b3": "³", # ³ -> ³
|
||||||
|
"\u00b2": "²", # Bare superscript 2
|
||||||
|
"\u00b3": "³", # Bare superscript 3
|
||||||
|
"\n": "", # ogni tanto ci sono degli \n brutti da togliere
|
||||||
|
"greater than": ">",
|
||||||
|
"less than": "<",
|
||||||
|
"greater or equal than": ">=",
|
||||||
|
"less or equal than": "<",
|
||||||
|
# Ste due entry le ho messe io. >< creano problemi quindi le rinonimo temporaneamente
|
||||||
|
}
|
||||||
|
|
||||||
|
# Apply specific replacements first
|
||||||
|
for old, new in replacements.items():
|
||||||
|
text = text.replace(old, new)
|
||||||
|
|
||||||
|
# Normalize Unicode characters
|
||||||
|
text = unicodedata.normalize("NFKD", text)
|
||||||
|
|
||||||
|
return text
|
||||||
|
|
||||||
|
|
||||||
|
# Un'altro metodo creato da Claude.
|
||||||
|
# Pare che il mio cervello sia troppo piccolo per riuscire a ciclare ricursivamente
|
||||||
|
# un dizionario innestato. Se magari avessimo fatto algoritmi e strutture dati...
|
||||||
|
def clean_json(data):
|
||||||
|
"""
|
||||||
|
Recursively clean JSON by removing empty/uninformative entries
|
||||||
|
and normalizing Unicode characters
|
||||||
|
"""
|
||||||
|
|
||||||
|
def is_uninformative(value, context=None):
|
||||||
|
"""
|
||||||
|
Check if a dictionary entry is considered uninformative
|
||||||
|
|
||||||
|
Args:
|
||||||
|
value: The value to check
|
||||||
|
context: Additional context about where the value is located
|
||||||
|
"""
|
||||||
|
# Specific exceptions
|
||||||
|
if context and context == "Key value for chemical safety assessment":
|
||||||
|
# Always keep all entries in this specific section
|
||||||
|
return False
|
||||||
|
|
||||||
|
uninformative_values = ["hours/week", "", None]
|
||||||
|
|
||||||
|
return value in uninformative_values or (
|
||||||
|
isinstance(value, str)
|
||||||
|
and (
|
||||||
|
value.strip() in uninformative_values
|
||||||
|
or value.lower() == "no information available"
|
||||||
|
)
|
||||||
|
)
|
||||||
|
|
||||||
|
def clean_recursive(obj, context=None):
|
||||||
|
# If it's a dictionary, process its contents
|
||||||
|
if isinstance(obj, dict):
|
||||||
|
# Create a copy to modify
|
||||||
|
cleaned = {}
|
||||||
|
for key, value in obj.items():
|
||||||
|
# Normalize key
|
||||||
|
normalized_key = normalize_unicode_characters(key)
|
||||||
|
|
||||||
|
# Set context for nested dictionaries
|
||||||
|
new_context = context or normalized_key
|
||||||
|
|
||||||
|
# Recursively clean nested structures
|
||||||
|
cleaned_value = clean_recursive(value, new_context)
|
||||||
|
|
||||||
|
# Conditions for keeping the entry
|
||||||
|
keep_entry = (
|
||||||
|
cleaned_value not in [None, {}, ""]
|
||||||
|
and not (
|
||||||
|
isinstance(cleaned_value, dict) and len(cleaned_value) == 0
|
||||||
|
)
|
||||||
|
and not is_uninformative(cleaned_value, new_context)
|
||||||
|
)
|
||||||
|
|
||||||
|
# Add to cleaned dict if conditions are met
|
||||||
|
if keep_entry:
|
||||||
|
cleaned[normalized_key] = cleaned_value
|
||||||
|
|
||||||
|
return cleaned if cleaned else None
|
||||||
|
|
||||||
|
# If it's a list, clean each item
|
||||||
|
elif isinstance(obj, list):
|
||||||
|
cleaned_list = [clean_recursive(item, context) for item in obj]
|
||||||
|
cleaned_list = [item for item in cleaned_list if item not in [None, {}, ""]]
|
||||||
|
return cleaned_list if cleaned_list else None
|
||||||
|
|
||||||
|
# For strings, normalize Unicode
|
||||||
|
elif isinstance(obj, str):
|
||||||
|
return normalize_unicode_characters(obj)
|
||||||
|
|
||||||
|
# Return as-is for other types
|
||||||
|
return obj
|
||||||
|
|
||||||
|
# Create a deep copy to avoid modifying original data
|
||||||
|
cleaned_data = clean_recursive(copy.deepcopy(data))
|
||||||
|
# Sì figa questa è la parte che mi ha fatto sclerare
|
||||||
|
# Ciclare in dizionari innestati senza poter modificare la struttura
|
||||||
|
return cleaned_data
|
||||||
|
|
||||||
|
|
||||||
|
def json_to_dataframe(cleaned_json, scrapingType):
|
||||||
|
rows = []
|
||||||
|
schema = {
|
||||||
|
"RepeatedDose": [
|
||||||
|
"Substance",
|
||||||
|
"CAS",
|
||||||
|
"Toxicity Type",
|
||||||
|
"Route",
|
||||||
|
"Dose descriptor",
|
||||||
|
"Effect level",
|
||||||
|
"Species",
|
||||||
|
"Extraction_Timestamp",
|
||||||
|
"Endpoint conclusion",
|
||||||
|
],
|
||||||
|
"AcuteToxicity": [
|
||||||
|
"Substance",
|
||||||
|
"CAS",
|
||||||
|
"Route",
|
||||||
|
"Endpoint conclusion",
|
||||||
|
"Dose descriptor",
|
||||||
|
"Effect level",
|
||||||
|
"Extraction_Timestamp",
|
||||||
|
],
|
||||||
|
}
|
||||||
|
if scrapingType == "RepeatedDose":
|
||||||
|
# Iterate through top-level sections (excluding 'Key value for chemical safety assessment')
|
||||||
|
for toxicity_type, routes in cleaned_json.items():
|
||||||
|
if toxicity_type == "Key value for chemical safety assessment":
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Iterate through routes within each toxicity type
|
||||||
|
for route, details in routes.items():
|
||||||
|
row = {"Toxicity Type": toxicity_type, "Route": route}
|
||||||
|
|
||||||
|
# Add details to the row, excluding 'Link to relevant study record(s)'
|
||||||
|
row.update(
|
||||||
|
{
|
||||||
|
k: v
|
||||||
|
for k, v in details.items()
|
||||||
|
if k != "Link to relevant study record(s)"
|
||||||
|
}
|
||||||
|
)
|
||||||
|
rows.append(row)
|
||||||
|
elif scrapingType == "AcuteToxicity":
|
||||||
|
for toxicity_type, routes in cleaned_json.items():
|
||||||
|
if (
|
||||||
|
toxicity_type == "Key value for chemical safety assessment"
|
||||||
|
or not routes
|
||||||
|
):
|
||||||
|
continue
|
||||||
|
|
||||||
|
row = {
|
||||||
|
"Route": toxicity_type.replace("Acute toxicity: via", "")
|
||||||
|
.replace("route", "")
|
||||||
|
.strip()
|
||||||
|
}
|
||||||
|
|
||||||
|
# Add details directly from the routes dictionary
|
||||||
|
row.update(
|
||||||
|
{
|
||||||
|
k: v
|
||||||
|
for k, v in routes.items()
|
||||||
|
if k != "Link to relevant study record(s)"
|
||||||
|
}
|
||||||
|
)
|
||||||
|
rows.append(row)
|
||||||
|
|
||||||
|
# Create DataFrame
|
||||||
|
df = pd.DataFrame(rows)
|
||||||
|
|
||||||
|
# Last moment fixes. Per forzare uno schema
|
||||||
|
fair_columns = list(set(schema["RepeatedDose"] + schema["AcuteToxicity"]))
|
||||||
|
df = df = df.loc[:, df.columns.intersection(fair_columns)]
|
||||||
|
return df
|
||||||
|
|
||||||
|
|
||||||
|
def save_dataframe(df, file_path, scrapingType, schema):
|
||||||
|
"""
|
||||||
|
Save DataFrame with strict column requirements.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
df (pd.DataFrame): DataFrame to potentially append
|
||||||
|
file_path (str): Path of CSV file
|
||||||
|
"""
|
||||||
|
# Mandatory columns for saved DataFrame
|
||||||
|
|
||||||
|
saved_columns = schema[scrapingType]
|
||||||
|
|
||||||
|
# Check if input DataFrame has at least Dose Descriptor and Effect Level
|
||||||
|
if not all(col in df.columns for col in ["Effect level"]):
|
||||||
|
return
|
||||||
|
|
||||||
|
# If file exists, read it to get saved columns
|
||||||
|
if os.path.exists(file_path):
|
||||||
|
existing_df = pd.read_csv(file_path)
|
||||||
|
|
||||||
|
# Reindex to match saved columns, filling missing with NaN
|
||||||
|
df = df.reindex(columns=saved_columns)
|
||||||
|
else:
|
||||||
|
# If file doesn't exist, create DataFrame with saved columns
|
||||||
|
df = df.reindex(columns=saved_columns)
|
||||||
|
|
||||||
|
df = df[df["Effect level"].isna() == False]
|
||||||
|
# Ignoro le righe che non hanno valori per Effect Level
|
||||||
|
|
||||||
|
# Append or save the DataFrame
|
||||||
|
df.to_csv(
|
||||||
|
file_path,
|
||||||
|
mode="a" if os.path.exists(file_path) else "w",
|
||||||
|
header=not os.path.exists(file_path),
|
||||||
|
index=False,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def echaExtract(
|
||||||
|
substance: str,
|
||||||
|
scrapingType: str,
|
||||||
|
outputType="df",
|
||||||
|
key_infos=False,
|
||||||
|
local_search=False,
|
||||||
|
local_only = False
|
||||||
|
):
|
||||||
|
"""
|
||||||
|
Funzione principale per scrapare dal sito ECHA. Mette insieme tante funzioni diverse di ricerca, estrazione e pulizia.
|
||||||
|
Registra il logging delle operazioni.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
substance (str): CAS o nome della sostanza. Vanno bene entrambi ma il CAS funziona meglio.
|
||||||
|
scrapingType (str): 'AcuteToxicity' (LD50) o 'RepeatedDose' (NOAEL)
|
||||||
|
outputType (str): 'pd.DataFrame' o 'json' (sconsigliato)
|
||||||
|
key_infos (bool): Di base True. Specifica se cercare la sezione "Description of Key Information" nei dossiers.
|
||||||
|
Certe sostanze hanno i dati inseriti a cazzo e mettono le informazioni lì in forma discorsiva al posto che altrove.
|
||||||
|
|
||||||
|
Output:
|
||||||
|
un dataframe o un json,
|
||||||
|
f"Non esistono lead dossiers attivi o inattivi per {substance}"
|
||||||
|
"""
|
||||||
|
|
||||||
|
# se local_search = True tento una ricerca in locale. Altrimenti la provo online.
|
||||||
|
if local_search and local_echa:
|
||||||
|
result = echaExtract_local(substance, scrapingType, key_infos)
|
||||||
|
|
||||||
|
if not result.empty:
|
||||||
|
logging.info(
|
||||||
|
f"echa.echaProcess.echaExtract(): Found local data for {scrapingType}, {substance}. Returning it."
|
||||||
|
)
|
||||||
|
return result
|
||||||
|
elif result.empty:
|
||||||
|
logging.info(
|
||||||
|
f"echa.echaProcess.echaExtract(): Have not found local data for {scrapingType}, {substance}. Continuining."
|
||||||
|
)
|
||||||
|
if local_only:
|
||||||
|
logging.info(f'echa.echaProcess.echaExtract(): No data found in local-only search for {substance}, {scrapingType}')
|
||||||
|
return f'No data found in local-only search for {substance}, {scrapingType}'
|
||||||
|
|
||||||
|
try:
|
||||||
|
# con search_dossier trovo le informazioni relative al dossiers cercando sul sito echa la sostanza fornita.
|
||||||
|
links = search_dossier(substance)
|
||||||
|
if not links:
|
||||||
|
logging.info(
|
||||||
|
f'echaProcess.echaExtract(). no active or unactive lead dossiers for: "{substance}". Ending extraction.'
|
||||||
|
)
|
||||||
|
return f"Non esistono lead dossiers attivi o inattivi per {substance}"
|
||||||
|
# Se non esistono LEAD dossiers (quelli con i riassunti tossicologici) attivi o inattivi
|
||||||
|
|
||||||
|
# Se esistono, apro la pagina che mi interessa ('Acute Toxicity' o 'Repeated Dose')
|
||||||
|
|
||||||
|
if not scrapingType in list(links.keys()):
|
||||||
|
logging.info(
|
||||||
|
f'echaProcess.echaExtract(). No page for "{scrapingType}", "{substance}"'
|
||||||
|
)
|
||||||
|
return f'No data in "{scrapingType}", "{substance}". Page does not exist.'
|
||||||
|
|
||||||
|
soup = openEchaPage(link=links[scrapingType])
|
||||||
|
logging.info(
|
||||||
|
f"echaProcess.echaExtract(). soupped '{scrapingType}' echa page for '{substance}'"
|
||||||
|
)
|
||||||
|
|
||||||
|
# Piglio la sezione che mi serve
|
||||||
|
try:
|
||||||
|
sezione = soup.find(
|
||||||
|
"section",
|
||||||
|
class_="KeyValueForChemicalSafetyAssessment",
|
||||||
|
attrs={"data-cy": "das-block"},
|
||||||
|
)
|
||||||
|
except:
|
||||||
|
logging.error(
|
||||||
|
f'echaProcess.echaExtract(). could not extract the "section" for "{scrapingType}" for "{substance}"',
|
||||||
|
exc_info=True,
|
||||||
|
)
|
||||||
|
|
||||||
|
# Per ottenere il timestamp attuale
|
||||||
|
now = datetime.now()
|
||||||
|
|
||||||
|
# UPDATE. Cerco le key infos
|
||||||
|
key_infos_faund = False
|
||||||
|
if key_infos:
|
||||||
|
try:
|
||||||
|
key_infos = soup.find(
|
||||||
|
"section",
|
||||||
|
class_="KeyInformation",
|
||||||
|
attrs={"data-cy": "das-block"},
|
||||||
|
)
|
||||||
|
if key_infos:
|
||||||
|
key_infos = key_infos.find(
|
||||||
|
"div",
|
||||||
|
class_="das-field_value das-field_value_html",
|
||||||
|
)
|
||||||
|
key_infos = key_infos.text
|
||||||
|
key_infos = key_infos if key_infos.strip() != "[Empty]" else None
|
||||||
|
if key_infos:
|
||||||
|
key_infos_faund = True
|
||||||
|
logging.info(
|
||||||
|
f"echaProcess.echaExtract(). Extracted key_infos from '{scrapingType}' echa page for '{substance}': {key_infos}"
|
||||||
|
)
|
||||||
|
key_infos_df = pd.DataFrame(index=[0])
|
||||||
|
key_infos_df["key_information"] = key_infos
|
||||||
|
key_infos_df = df_wrapper(
|
||||||
|
df=key_infos_df,
|
||||||
|
rmlName=links["rmlName"],
|
||||||
|
rmlCas=links["rmlCas"],
|
||||||
|
timestamp=now.strftime("%Y-%m-%d"),
|
||||||
|
dossierType=links["dossierType"],
|
||||||
|
page=scrapingType,
|
||||||
|
linkPage=links[scrapingType],
|
||||||
|
key_infos=True,
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
logging.error(
|
||||||
|
f'echaProcess.echaExtract() > echaProcess.echaPage_to_md() ERROR. could not extract key_infos for "{scrapingType}", "{substance}"'
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
logging.error(
|
||||||
|
f'echaProcess.echaExtract() > echaProcess.echaPage_to_md() ERROR. could not extract key_infos for "{scrapingType}", "{substance}"'
|
||||||
|
)
|
||||||
|
except:
|
||||||
|
logging.error(
|
||||||
|
f'echaProcess.echaExtract() > echaProcess.echaPage_to_md() ERROR. could not extract key_infos for "{scrapingType}", "{substance}"',
|
||||||
|
exc_info=True,
|
||||||
|
)
|
||||||
|
|
||||||
|
try:
|
||||||
|
if not sezione:
|
||||||
|
logging.error(
|
||||||
|
f'echaProcess.echaExtract() > echaProcess.echaPage_to_md() Empty section for the html > markdown conversion. No data for "{scrapingType}", "{substance}"'
|
||||||
|
)
|
||||||
|
if not key_infos_faund:
|
||||||
|
# Se non ci sono dati ma ci sono le key informations ritorno quelle
|
||||||
|
return f'No data in "{scrapingType}", "{substance}"'
|
||||||
|
else:
|
||||||
|
return key_infos_df
|
||||||
|
|
||||||
|
# Trasformo la sezione html in markdown
|
||||||
|
output = echaPage_to_md(
|
||||||
|
sezione, scrapingType=scrapingType, substance=substance
|
||||||
|
)
|
||||||
|
logging.info(
|
||||||
|
f'echaProcess.echaExtract() > echaProcess.echaPage_to_md() OK. created MD for "{scrapingType}", "{substance}"'
|
||||||
|
)
|
||||||
|
|
||||||
|
# Ci sono rari casi in cui proprio non esistono pagine per l'acute toxicity o la repeated dose. In quel caso output sarà vuoto e darà errore
|
||||||
|
# logging.info(output)
|
||||||
|
except:
|
||||||
|
logging.error(
|
||||||
|
f'echaProcess.echaExtract() > echaProcess.echaPage_to_md() ERROR. could not MD for "{scrapingType}", "{substance}"',
|
||||||
|
exc_info=True,
|
||||||
|
)
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Trasformo il markdown nel primo json raw
|
||||||
|
jsonified = markdown_to_json_raw(
|
||||||
|
output, scrapingType=scrapingType, substance=substance
|
||||||
|
)
|
||||||
|
logging.info(
|
||||||
|
f'echaProcess.echaExtract() > echaProcess.markdown_to_json_raw() OK. created initial json for "{scrapingType}", "{substance}"'
|
||||||
|
)
|
||||||
|
except:
|
||||||
|
logging.error(
|
||||||
|
f'echaProcess.echaExtract() > echaProcess.markdown_to_json_raw() ERROR. could not create initial json for "{scrapingType}", "{substance}"',
|
||||||
|
exc_info=True,
|
||||||
|
)
|
||||||
|
|
||||||
|
json_data = json.loads(jsonified)
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Secondo step per il processing del json: pulisco i dizionari piu' innestati
|
||||||
|
cleaned_data = clean_json(json_data)
|
||||||
|
logging.info(
|
||||||
|
f'echaProcess.echaExtract() > echaProcess.clean_json() OK. cleaned the json for "{scrapingType}", "{substance}"'
|
||||||
|
)
|
||||||
|
# Se cleaned_data è vuoto vuol dire che non ci sono dati
|
||||||
|
if not cleaned_data:
|
||||||
|
logging.error(
|
||||||
|
f'echaProcess.echaExtract() > echaProcess.clean_json() Empty cleaned_json. No data for "{scrapingType}", "{substance}"'
|
||||||
|
)
|
||||||
|
if not key_infos_faund:
|
||||||
|
# Se non ci sono dati ma ci sono le key informations ritorno quelle
|
||||||
|
return f'No data in "{scrapingType}", "{substance}"'
|
||||||
|
else:
|
||||||
|
return key_infos_df
|
||||||
|
except:
|
||||||
|
logging.error(
|
||||||
|
f'echaProcess.echaExtract() > echaProcess.clean_json() ERROR. cleaning the json for "{scrapingType}", "{substance}"'
|
||||||
|
)
|
||||||
|
|
||||||
|
# Se si vuole come output il dataframe creo un dataframe e ci aggiungo un timestamp
|
||||||
|
try:
|
||||||
|
df = json_to_dataframe(cleaned_data, scrapingType)
|
||||||
|
df = df_wrapper(
|
||||||
|
df=df,
|
||||||
|
rmlName=links["rmlName"],
|
||||||
|
rmlCas=links["rmlCas"],
|
||||||
|
timestamp=now.strftime("%Y-%m-%d"),
|
||||||
|
dossierType=links["dossierType"],
|
||||||
|
page=scrapingType,
|
||||||
|
linkPage=links[scrapingType],
|
||||||
|
)
|
||||||
|
|
||||||
|
if outputType == "df":
|
||||||
|
logging.info(
|
||||||
|
f'echaProcess.echaExtract(). succesfully extracted "{scrapingType}", "{substance}". Returning df'
|
||||||
|
)
|
||||||
|
|
||||||
|
# Se l'utente vuole le key infos e le key_infos sono state trovate unisco i due df
|
||||||
|
return df if not key_infos_faund else pd.concat([key_infos_df, df])
|
||||||
|
|
||||||
|
elif outputType == "json":
|
||||||
|
if key_infos_faund:
|
||||||
|
df = pd.concat([key_infos_df, df])
|
||||||
|
jayson = df.to_json(orient="records", force_ascii=False)
|
||||||
|
logging.info(
|
||||||
|
f'echaProcess.echaExtract(). succesfully extracted "{scrapingType}", "{substance}". Returning json'
|
||||||
|
)
|
||||||
|
return jayson
|
||||||
|
except KeyError:
|
||||||
|
# Per gestire le pagine di merda che hanno solo "no information available"
|
||||||
|
|
||||||
|
if key_infos_faund:
|
||||||
|
return key_infos_df
|
||||||
|
|
||||||
|
json_output = list(cleaned_data[list(cleaned_data.keys())[0]].values())
|
||||||
|
if json_output == ["no information available" for elem in json_output]:
|
||||||
|
logging.info(
|
||||||
|
f"echaProcess.echaExtract(). No data found for {scrapingType} for {substance}"
|
||||||
|
)
|
||||||
|
return f'No data in "{scrapingType}", "{substance}"'
|
||||||
|
else:
|
||||||
|
logging.error(
|
||||||
|
f"echaProcess.json_to_dataframe(). Could not create dataframe"
|
||||||
|
)
|
||||||
|
cleaned_data["error"] = (
|
||||||
|
"Non sono riuscito a creare il dataframe, probabilmente non ci sono abbastanza informazioni. Ritorno il JSON"
|
||||||
|
)
|
||||||
|
return cleaned_data
|
||||||
|
|
||||||
|
except Exception:
|
||||||
|
logging.error(
|
||||||
|
f"echaProcess.echaExtract() ERROR. Something went wrong, not quite sure what.",
|
||||||
|
exc_info=True,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def df_wrapper(
|
||||||
|
df, rmlName, rmlCas, timestamp, dossierType, page, linkPage, key_infos=False
|
||||||
|
):
|
||||||
|
# Un semplice metodo per aggiungere tutta la roba che ci serve al dataframe.
|
||||||
|
# Per non intasare echaExtract che già di suo è un figa di bordello
|
||||||
|
df.insert(0, "Substance", rmlName)
|
||||||
|
df.insert(1, "CAS", rmlCas)
|
||||||
|
df["Extraction_Timestamp"] = timestamp
|
||||||
|
df = df.replace("\n", "", regex=True)
|
||||||
|
if not key_infos:
|
||||||
|
df = df[df["Effect level"].isnull() == False]
|
||||||
|
|
||||||
|
# Aggiungo il link del dossier e lo status
|
||||||
|
df["dossierType"] = dossierType
|
||||||
|
df["page"] = page
|
||||||
|
df["linkPage"] = linkPage
|
||||||
|
return df
|
||||||
|
|
||||||
|
def echaExtract_specific(
|
||||||
|
CAS: str,
|
||||||
|
scrapingType="RepeatedDose",
|
||||||
|
doseDescriptor="NOAEL",
|
||||||
|
route="inhalation",
|
||||||
|
local_search=False,
|
||||||
|
local_only=False
|
||||||
|
):
|
||||||
|
"""
|
||||||
|
Dato un CAS cerca di trovare il dose descriptor (di base NOAEL) per la route specificata (di base 'inhalation').
|
||||||
|
|
||||||
|
Args:
|
||||||
|
CAS (str): il cas o in alternativa la sostanza
|
||||||
|
route (str): 'inhalation', 'oral', 'dermal'. Di base 'inhalation'
|
||||||
|
scrapingType (str): la pagina su cui cercarlo
|
||||||
|
doseDescriptor (str): il tipo di valore da ricercare (NOAEL, DNEL, LD50, LC50)
|
||||||
|
"""
|
||||||
|
|
||||||
|
# Tento di estrarre
|
||||||
|
result = echaExtract(
|
||||||
|
substance=CAS,
|
||||||
|
scrapingType=scrapingType,
|
||||||
|
outputType="df",
|
||||||
|
local_search=local_search,
|
||||||
|
local_only=local_only
|
||||||
|
)
|
||||||
|
|
||||||
|
# Il risultato è un dataframe?
|
||||||
|
if type(result) == pd.DataFrame:
|
||||||
|
# Se sì, lo filtro per ciò che mi interessa
|
||||||
|
filtered_df = result[
|
||||||
|
(result["Route"] == route) & (result["Dose descriptor"] == doseDescriptor)
|
||||||
|
]
|
||||||
|
# Se non è vuoto lo ritorno
|
||||||
|
if not filtered_df.empty:
|
||||||
|
return filtered_df
|
||||||
|
else:
|
||||||
|
return f'Non ho trovato {doseDescriptor} in {scrapingType} con route "{route}" per {CAS}'
|
||||||
|
|
||||||
|
elif type(result) == dict and result["error"]:
|
||||||
|
# Questo significa che gli è arrivato qualche json con un errore
|
||||||
|
return f'Non ho trovato {doseDescriptor} in {scrapingType} con route "{route}" per {CAS}'
|
||||||
|
|
||||||
|
# Questo significa che ha ricevuto un "Non esistono" come risultato. Non esistono lead dossiers attivi o inattivi per la sostanza ricercata
|
||||||
|
elif result.startswith("Non esistono"):
|
||||||
|
return result
|
||||||
|
|
||||||
|
|
||||||
|
def echa_noael_ld50(CAS: str, route="inhalation", outputType="df", local_search=False, local_only=False):
|
||||||
|
"""
|
||||||
|
Dato un CAS cerca di trovare il NOAEL per la route specificata (di base 'inhalation').
|
||||||
|
Se non esiste la pagina RepeatedDose con il NOAEL fa ritornare l'LD50 per quella route.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
CAS (str): il cas o in alternativa la sostanza
|
||||||
|
route (str): 'inhalation', 'oral', 'dermal'. Di base 'inhalation'
|
||||||
|
outputType (str) = 'df', 'json'. Il tipo di output
|
||||||
|
|
||||||
|
"""
|
||||||
|
if route not in ["inhalation", "oral", "dermal"] and outputType not in [
|
||||||
|
"df",
|
||||||
|
"json",
|
||||||
|
]:
|
||||||
|
return "invalid input"
|
||||||
|
# Di base cerco di scrapare la pagina "Repeated Dose"
|
||||||
|
first_attempt = echaExtract_specific(
|
||||||
|
CAS=CAS,
|
||||||
|
scrapingType="RepeatedDose",
|
||||||
|
doseDescriptor="NOAEL",
|
||||||
|
route=route,
|
||||||
|
local_search=local_search,
|
||||||
|
local_only=local_only
|
||||||
|
)
|
||||||
|
|
||||||
|
if isinstance(first_attempt, pd.DataFrame):
|
||||||
|
return first_attempt
|
||||||
|
elif isinstance(first_attempt, str) and first_attempt.startswith("Non ho trovato"):
|
||||||
|
second_attempt = echaExtract_specific(
|
||||||
|
CAS=CAS,
|
||||||
|
scrapingType="AcuteToxicity",
|
||||||
|
doseDescriptor="LD50",
|
||||||
|
route=route,
|
||||||
|
local_search=True,
|
||||||
|
local_only=local_only
|
||||||
|
)
|
||||||
|
if isinstance(second_attempt, pd.DataFrame):
|
||||||
|
return second_attempt
|
||||||
|
elif isinstance(second_attempt, str) and second_attempt.startswith(
|
||||||
|
"Non ho trovato"
|
||||||
|
):
|
||||||
|
return second_attempt.replace("LD50", "NOAEL ed LD50")
|
||||||
|
elif first_attempt.startswith("Non esistono"):
|
||||||
|
return first_attempt
|
||||||
|
|
||||||
|
|
||||||
|
def echa_noael_ld50_multi(
|
||||||
|
casList: list, route="inhalation", messages=False, local_search=False, local_only=False
|
||||||
|
):
|
||||||
|
"""
|
||||||
|
Metodo abbastanza semplice. Data una lista di cas esegue echa_noael_ld50. Quindi cerca i NOAEL per la route desiderata o gli LD50 se non trova i NOAEL.
|
||||||
|
L'output è un df per le sostanze che trova e una lista di messaggi per quelle che non trova.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
casList (list): la lista di CAS
|
||||||
|
route (str): 'inhalation', 'oral', 'dermal'. Di base 'inhalation'
|
||||||
|
messages (boolean) = True o False. Con True fa ritornare una lista. Il primo elemento sarà il dataframe, il secondo la lista di messaggi per le sostanze non trovate.
|
||||||
|
Di base è False e fa ritornare solo il dataframe.
|
||||||
|
"""
|
||||||
|
messages_list = []
|
||||||
|
df = pd.DataFrame()
|
||||||
|
for CAS in casList:
|
||||||
|
output = echa_noael_ld50(
|
||||||
|
CAS=CAS, route=route, outputType="df", local_search=local_search, local_only=local_only
|
||||||
|
)
|
||||||
|
if isinstance(output, str):
|
||||||
|
messages_list.append(output)
|
||||||
|
elif isinstance(output, pd.DataFrame):
|
||||||
|
df = pd.concat([df, output], ignore_index=True)
|
||||||
|
df.dropna(axis=1, how="all", inplace=True)
|
||||||
|
if messages and df.empty:
|
||||||
|
messages_list.append(
|
||||||
|
f'Non sono riuscito a trovare nessun NOAEL o LD50 per i cas per la route "{route}"'
|
||||||
|
)
|
||||||
|
return [None, messages_list]
|
||||||
|
elif messages and not df.empty:
|
||||||
|
return [df, messages_list]
|
||||||
|
elif not df.empty and not messages:
|
||||||
|
return df
|
||||||
|
elif df.empty and not messages:
|
||||||
|
return f'Non sono riuscito a trovare nessun NOAEL o LD50 per i cas per la route "{route}"'
|
||||||
|
|
||||||
|
|
||||||
|
def echaExtract_multi(
|
||||||
|
casList: list,
|
||||||
|
scrapingType="all",
|
||||||
|
local=False,
|
||||||
|
local_path=None,
|
||||||
|
log_path=None,
|
||||||
|
debug_print=False,
|
||||||
|
error=False,
|
||||||
|
error_path=None,
|
||||||
|
key_infos=False,
|
||||||
|
local_search=False,
|
||||||
|
local_only=False,
|
||||||
|
filter = None
|
||||||
|
):
|
||||||
|
"""
|
||||||
|
Data una lista di CAS cerca di estrarre tutte le pagine con le Repeated Dose, tutte le pagine con l'AcuteToxicity, o entrambe
|
||||||
|
|
||||||
|
Args:
|
||||||
|
casList (list): la lista di CAS
|
||||||
|
scrapingType (str): 'RepeatedDose', 'AcuteToxicity', 'all'
|
||||||
|
local (boolean): Se impostato su True questo parametro salva sul disco in maniera progressiva, appendendo ogni result man mano che li trova.
|
||||||
|
è necessario per lo scraping su larga scala
|
||||||
|
log_path (str): il path per il log da fillare durante lo scraping di massa
|
||||||
|
debug_print (bool): per avere il printing durante lo scraping. per verificare l'avanzamento
|
||||||
|
error (bool): Per far ritornare la lista degli errori una volta scrapato
|
||||||
|
|
||||||
|
Output:
|
||||||
|
pd.Dataframe
|
||||||
|
"""
|
||||||
|
cas_len = len(casList)
|
||||||
|
i = 0
|
||||||
|
|
||||||
|
df = pd.DataFrame()
|
||||||
|
if scrapingType == "all":
|
||||||
|
scrapingTypeList = ["RepeatedDose", "AcuteToxicity"]
|
||||||
|
else:
|
||||||
|
scrapingTypeList = [scrapingType]
|
||||||
|
|
||||||
|
logging.info(
|
||||||
|
f"echa.echaExtract_multi(). Commencing mass extraction of {scrapingTypeList} for {casList}"
|
||||||
|
)
|
||||||
|
|
||||||
|
errors = []
|
||||||
|
|
||||||
|
for cas in casList:
|
||||||
|
for scrapingType in scrapingTypeList:
|
||||||
|
extraction = echaExtract(
|
||||||
|
substance=cas,
|
||||||
|
scrapingType=scrapingType,
|
||||||
|
outputType="df",
|
||||||
|
key_infos=key_infos,
|
||||||
|
local_search=local_search,
|
||||||
|
local_only=local_only
|
||||||
|
)
|
||||||
|
if isinstance(extraction, pd.DataFrame) and not extraction.empty:
|
||||||
|
status = "successful_scrape"
|
||||||
|
logging.info(
|
||||||
|
f"echa.echaExtract_multi(). Succesfully scraped {scrapingType} for {cas}"
|
||||||
|
)
|
||||||
|
|
||||||
|
df = pd.concat([df, extraction], ignore_index=True)
|
||||||
|
if local and local_path:
|
||||||
|
df.to_csv(local_path, index=False)
|
||||||
|
|
||||||
|
elif (
|
||||||
|
(isinstance(extraction, pd.DataFrame) and extraction.empty)
|
||||||
|
or (extraction is None)
|
||||||
|
or (isinstance(extraction, str) and extraction.startswith("No data"))
|
||||||
|
):
|
||||||
|
status = "no_data_found"
|
||||||
|
logging.info(
|
||||||
|
f"echa.echaExtract_multi(). Found no data for {scrapingType} for {cas}"
|
||||||
|
)
|
||||||
|
elif isinstance(extraction, dict):
|
||||||
|
if extraction["error"]:
|
||||||
|
status = "df_creation_error"
|
||||||
|
errors.append(extraction)
|
||||||
|
logging.info(
|
||||||
|
f"echa.echaExtract_multi(). Df creation error for {scrapingType} for {cas}"
|
||||||
|
)
|
||||||
|
elif isinstance(extraction, str) and extraction.startswith("Non esistono"):
|
||||||
|
status = "no_lead_dossiers"
|
||||||
|
logging.info(
|
||||||
|
f"echa.echaExtract_multi(). Found no lead dossiers for {cas}"
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
status = "unknown_error"
|
||||||
|
logging.error(
|
||||||
|
f"echa.echaExtract_multi(). Unknown error for {scrapingType} for {cas}"
|
||||||
|
)
|
||||||
|
|
||||||
|
if log_path:
|
||||||
|
fill_log(cas, status, log_path, scrapingType)
|
||||||
|
if debug_print:
|
||||||
|
print(f"{i}: {cas}, {scrapingType}")
|
||||||
|
i += 1
|
||||||
|
|
||||||
|
if error and errors and error_path:
|
||||||
|
with open(error_path, "w") as json_file:
|
||||||
|
json.dump(errors, json_file, indent=4)
|
||||||
|
|
||||||
|
# Questa è la mossa che mi permette di eliminare 4 metodi
|
||||||
|
if filter:
|
||||||
|
df = filter_dataframe_by_dict(df, filter)
|
||||||
|
return df
|
||||||
|
|
||||||
|
|
||||||
|
def fill_log(cas: str, status: str, log_path: str, scrapingType: str):
|
||||||
|
"""
|
||||||
|
Funzione usata durante lo scraping di massa per fillare un log mentre estraggo le sostanze
|
||||||
|
"""
|
||||||
|
|
||||||
|
df = pd.read_csv(log_path)
|
||||||
|
df.loc[df["casNo"] == cas, f"scraping_{scrapingType}"] = status
|
||||||
|
df.loc[df["casNo"] == cas, "timestamp"] = datetime.now().strftime("%Y-%m-%d")
|
||||||
|
|
||||||
|
df.to_csv(log_path, index=False)
|
||||||
|
|
||||||
|
def echaExtract_local(substance:str, scrapingType:str, key_infos=False):
|
||||||
|
if not key_infos:
|
||||||
|
query = f"""
|
||||||
|
SELECT *
|
||||||
|
FROM echa_full_scraping
|
||||||
|
WHERE CAS = '{substance}' AND page = '{scrapingType}' AND key_information IS NULL;
|
||||||
|
"""
|
||||||
|
elif key_infos:
|
||||||
|
query = f"""
|
||||||
|
SELECT *
|
||||||
|
FROM echa_full_scraping
|
||||||
|
WHERE CAS = '{substance}' AND page = '{scrapingType}';
|
||||||
|
|
||||||
|
"""
|
||||||
|
result = con.sql(query).df()
|
||||||
|
return result
|
||||||
|
|
||||||
|
def filter_dataframe_by_dict(df, filter_dict):
|
||||||
|
"""
|
||||||
|
Filters a Pandas DataFrame based on a dictionary.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
df (pd.DataFrame): The input DataFrame.
|
||||||
|
filter_dict (dict): A dictionary where keys are column names and
|
||||||
|
values are lists of allowed values for that column.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
pd.DataFrame: A new DataFrame containing only the rows that match
|
||||||
|
the filter criteria.
|
||||||
|
"""
|
||||||
|
|
||||||
|
filter_condition = pd.Series(True, index=df.index) # Initialize with all True to start filtering
|
||||||
|
|
||||||
|
for column_name, allowed_values in filter_dict.items():
|
||||||
|
if column_name in df.columns: # Check if the column exists in the DataFrame
|
||||||
|
column_filter = df[column_name].isin(allowed_values) # Create a boolean Series for the current column
|
||||||
|
filter_condition = filter_condition & column_filter # Combine with existing condition using 'and'
|
||||||
|
else:
|
||||||
|
print(f"Warning: Column '{column_name}' not found in the DataFrame. Filter for this column will be ignored.")
|
||||||
|
|
||||||
|
filtered_df = df[filter_condition] # Apply the combined filter condition
|
||||||
|
return filtered_df
|
||||||
497
src/pif_compiler/functions/_old/find.py
Normal file
497
src/pif_compiler/functions/_old/find.py
Normal file
|
|
@ -0,0 +1,497 @@
|
||||||
|
import requests
|
||||||
|
import urllib.parse
|
||||||
|
import json
|
||||||
|
import logging
|
||||||
|
import re
|
||||||
|
from datetime import datetime
|
||||||
|
from bs4 import BeautifulSoup
|
||||||
|
from typing import Dict, Union, Optional, Any
|
||||||
|
|
||||||
|
# Settings per il logging
|
||||||
|
logging.basicConfig(
|
||||||
|
format="{asctime} - {levelname} - {message}",
|
||||||
|
style="{",
|
||||||
|
datefmt="%Y-%m-%d %H:%M",
|
||||||
|
filename=".log",
|
||||||
|
encoding="utf-8",
|
||||||
|
filemode="a",
|
||||||
|
level=logging.INFO,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
# Constants for API endpoints
|
||||||
|
QUACKO_BASE_URL = "https://chem.echa.europa.eu"
|
||||||
|
QUACKO_SUBSTANCE_API = f"{QUACKO_BASE_URL}/api-substance/v1/substance"
|
||||||
|
QUACKO_DOSSIER_API = f"{QUACKO_BASE_URL}/api-dossier-list/v1/dossier"
|
||||||
|
QUACKO_HTML_PAGES = f"{QUACKO_BASE_URL}/html-pages"
|
||||||
|
|
||||||
|
# Default sections to look for in the dossier
|
||||||
|
DEFAULT_SECTIONS = {
|
||||||
|
"id_7_Toxicologicalinformation": "ToxSummary",
|
||||||
|
"id_72_AcuteToxicity": "AcuteToxicity",
|
||||||
|
"id_75_Repeateddosetoxicity": "RepeatedDose",
|
||||||
|
"id_6_Ecotoxicologicalinformation": "EcotoxSummary",
|
||||||
|
"id_76_Genetictoxicity" : 'GeneticToxicity',
|
||||||
|
"id_42_Meltingpointfreezingpoint" : "MeltingFreezingPoint",
|
||||||
|
"id_43_Boilingpoint" : "BoilingPoint",
|
||||||
|
"id_48_Watersolubility" : "WaterSolubility",
|
||||||
|
"id_410_Surfacetension" : "SurfaceTension",
|
||||||
|
"id_420_pH" : "pH",
|
||||||
|
"Test" : "Test2"
|
||||||
|
}
|
||||||
|
|
||||||
|
def search_dossier(
|
||||||
|
substance: str,
|
||||||
|
input_type: str = 'rmlCas',
|
||||||
|
sections: Dict[str, str] = None,
|
||||||
|
local_index_path: str = None
|
||||||
|
) -> Union[Dict[str, Any], str, bool]:
|
||||||
|
"""
|
||||||
|
Search for a chemical substance in the QUACKO database and retrieve its dossier information.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
substance (str): The identifier of the substance to search for (e.g. CAS number, name)
|
||||||
|
input_type (str): The type of identifier provided. Options: 'rmlCas', 'rmlName', 'rmlEc'
|
||||||
|
sections (Dict[str, str], optional): Dictionary mapping section IDs to result keys.
|
||||||
|
If None, default sections will be used.
|
||||||
|
local_index_path (str, optional): Path to a local index.html file to parse instead of
|
||||||
|
downloading from QUACKO. If provided, the function will
|
||||||
|
skip all API calls and only extract sections from the local file.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Union[Dict[str, Any], str, bool]: Dictionary with substance information and dossier links on success,
|
||||||
|
error message string if substance found but with issues,
|
||||||
|
False if substance not found or other critical error
|
||||||
|
"""
|
||||||
|
# Use default sections if none provided
|
||||||
|
if sections is None:
|
||||||
|
sections = DEFAULT_SECTIONS
|
||||||
|
|
||||||
|
try:
|
||||||
|
results = {}
|
||||||
|
|
||||||
|
# If a local file is provided, extract sections from it directly
|
||||||
|
if local_index_path:
|
||||||
|
logging.info(f"QUACKO.search() - Using local index file: {local_index_path}")
|
||||||
|
|
||||||
|
# We still need some minimal info for constructing the URLs
|
||||||
|
if '/' not in local_index_path:
|
||||||
|
asset_id = "local"
|
||||||
|
rml_id = "local"
|
||||||
|
else:
|
||||||
|
# Try to extract information from the path if available
|
||||||
|
path_parts = local_index_path.split('/')
|
||||||
|
# If path follows expected structure: .../html-pages/ASSET_ID/index.html
|
||||||
|
if 'html-pages' in path_parts and 'index.html' in path_parts[-1]:
|
||||||
|
asset_id = path_parts[path_parts.index('html-pages') + 1]
|
||||||
|
rml_id = "extracted" # Just a placeholder
|
||||||
|
else:
|
||||||
|
asset_id = "local"
|
||||||
|
rml_id = "local"
|
||||||
|
|
||||||
|
# Add these to results for consistency
|
||||||
|
results["assetExternalId"] = asset_id
|
||||||
|
results["rmlId"] = rml_id
|
||||||
|
|
||||||
|
# Extract sections from the local file
|
||||||
|
section_links = get_section_links_from_file(local_index_path, asset_id, rml_id, sections)
|
||||||
|
if section_links:
|
||||||
|
results.update(section_links)
|
||||||
|
|
||||||
|
return results
|
||||||
|
|
||||||
|
# Normal flow with API calls
|
||||||
|
substance_data = get_substance_by_identifier(substance)
|
||||||
|
if not substance_data:
|
||||||
|
return False
|
||||||
|
|
||||||
|
# Verify that the found substance matches the input identifier
|
||||||
|
if substance_data.get(input_type) != substance:
|
||||||
|
error_msg = (f"Search error: results[{input_type}] (\"{substance_data.get(input_type)}\") "
|
||||||
|
f"is not equal to \"{substance}\". Maybe you specified the wrong input_type. "
|
||||||
|
f"Check the results here: {substance_data.get('search_response')}")
|
||||||
|
logging.error(f"QUACKO.search(): {error_msg}")
|
||||||
|
return error_msg
|
||||||
|
|
||||||
|
# Step 2: Find dossiers for the substance
|
||||||
|
rml_id = substance_data["rmlId"]
|
||||||
|
dossier_data = get_dossier_by_rml_id(rml_id, substance)
|
||||||
|
if not dossier_data:
|
||||||
|
return False
|
||||||
|
|
||||||
|
# Merge substance and dossier data
|
||||||
|
results = {**substance_data, **dossier_data}
|
||||||
|
|
||||||
|
# Step 3: Extract detailed information from dossier index page
|
||||||
|
asset_external_id = dossier_data["assetExternalId"]
|
||||||
|
section_links = get_section_links_from_index(asset_external_id, rml_id, sections)
|
||||||
|
if section_links:
|
||||||
|
results.update(section_links)
|
||||||
|
|
||||||
|
logging.info(f"QUACKO.search() OK. output: {json.dumps(results)}")
|
||||||
|
return results
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
logging.error(f"QUACKO.search(): Unexpected error in search_dossier for '{substance}': {str(e)}")
|
||||||
|
return False
|
||||||
|
|
||||||
|
|
||||||
|
def get_substance_by_identifier(substance: str) -> Optional[Dict[str, str]]:
|
||||||
|
"""
|
||||||
|
Search the QUACKO database for a substance using the provided identifier.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
substance (str): The substance identifier to search for (CAS number, name, etc.)
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Optional[Dict[str, str]]: Dictionary with substance information or None if not found
|
||||||
|
"""
|
||||||
|
encoded_substance = urllib.parse.quote(substance)
|
||||||
|
search_url = f"{QUACKO_SUBSTANCE_API}?pageIndex=1&pageSize=100&searchText={encoded_substance}"
|
||||||
|
|
||||||
|
logging.info(f'QUACKO.search(). searching "{substance}"')
|
||||||
|
|
||||||
|
try:
|
||||||
|
response = requests.get(search_url)
|
||||||
|
response.raise_for_status() # Raise exception for HTTP errors
|
||||||
|
data = response.json()
|
||||||
|
|
||||||
|
if not data.get("items") or len(data["items"]) == 0:
|
||||||
|
logging.info(f"QUACKO.search() could not find substance for '{substance}'")
|
||||||
|
return None
|
||||||
|
|
||||||
|
# Extract substance information
|
||||||
|
substance_index = data["items"][0]["substanceIndex"]
|
||||||
|
result = {
|
||||||
|
'search_response': search_url,
|
||||||
|
'rmlId': substance_index.get("rmlId", ""),
|
||||||
|
'rmlName': substance_index.get("rmlName", ""),
|
||||||
|
'rmlCas': substance_index.get("rmlCas", ""),
|
||||||
|
'rmlEc': substance_index.get("rmlEc", "")
|
||||||
|
}
|
||||||
|
|
||||||
|
logging.info(
|
||||||
|
f"QUACKO.search() found substance on QUACKO. "
|
||||||
|
f"rmlId: '{result['rmlId']}', rmlName: '{result['rmlName']}', rmlCas: '{result['rmlCas']}'"
|
||||||
|
)
|
||||||
|
return result
|
||||||
|
|
||||||
|
except requests.RequestException as e:
|
||||||
|
logging.error(f"QUACKO.search() - Request error while searching for substance '{substance}': {str(e)}")
|
||||||
|
return None
|
||||||
|
except (KeyError, IndexError) as e:
|
||||||
|
logging.error(f"QUACKO.search() - Data parsing error for substance '{substance}': {str(e)}")
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
def get_dossier_by_rml_id(rml_id: str, substance_name: str) -> Optional[Dict[str, Any]]:
|
||||||
|
"""
|
||||||
|
Find dossiers for a substance using its RML ID.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
rml_id (str): The RML ID of the substance
|
||||||
|
substance_name (str): The name of the substance (for logging)
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Optional[Dict[str, Any]]: Dictionary with dossier information or None if not found
|
||||||
|
"""
|
||||||
|
# First try active dossiers
|
||||||
|
dossier_results = _query_dossier_api(rml_id, "Active")
|
||||||
|
|
||||||
|
# If no active dossiers found, try inactive ones
|
||||||
|
if not dossier_results:
|
||||||
|
logging.info(
|
||||||
|
f"QUACKO.search() - could not find active dossier for '{substance_name}'. "
|
||||||
|
"Proceeding to search in the unactive ones."
|
||||||
|
)
|
||||||
|
dossier_results = _query_dossier_api(rml_id, "Inactive")
|
||||||
|
|
||||||
|
if not dossier_results:
|
||||||
|
logging.info(f"QUACKO.search() - could not find unactive dossiers for '{substance_name}'")
|
||||||
|
return None
|
||||||
|
else:
|
||||||
|
logging.info(f"QUACKO.search() - found unactive dossiers for '{substance_name}'")
|
||||||
|
dossier_results["dossierType"] = "Inactive"
|
||||||
|
else:
|
||||||
|
logging.info(f"QUACKO.search() - found active dossiers for '{substance_name}'")
|
||||||
|
dossier_results["dossierType"] = "Active"
|
||||||
|
|
||||||
|
return dossier_results
|
||||||
|
|
||||||
|
|
||||||
|
def _query_dossier_api(rml_id: str, status: str) -> Optional[Dict[str, Any]]:
|
||||||
|
"""
|
||||||
|
Helper function to query the QUACKO dossier API for a specific substance and status.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
rml_id (str): The RML ID of the substance
|
||||||
|
status (str): The status of dossiers to search for ('Active' or 'Inactive')
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Optional[Dict[str, Any]]: Dictionary with dossier information or None if not found
|
||||||
|
"""
|
||||||
|
url = f"{QUACKO_DOSSIER_API}?pageIndex=1&pageSize=100&rmlId={rml_id}®istrationStatuses={status}"
|
||||||
|
|
||||||
|
try:
|
||||||
|
response = requests.get(url)
|
||||||
|
response.raise_for_status()
|
||||||
|
data = response.json()
|
||||||
|
|
||||||
|
if not data.get("items") or len(data["items"]) == 0:
|
||||||
|
return None
|
||||||
|
|
||||||
|
result = {
|
||||||
|
"assetExternalId": data["items"][0]["assetExternalId"],
|
||||||
|
"rootKey": data["items"][0]["rootKey"],
|
||||||
|
}
|
||||||
|
|
||||||
|
# Extract last update date if available
|
||||||
|
try:
|
||||||
|
last_update = data["items"][0]["lastUpdatedDate"]
|
||||||
|
datetime_object = datetime.fromisoformat(last_update.replace('Z', '+00:00'))
|
||||||
|
result['lastUpdateDate'] = datetime_object.date().isoformat()
|
||||||
|
except (KeyError, ValueError) as e:
|
||||||
|
logging.error(f"QUACKO.search() - Error extracting lastUpdateDate: {str(e)}")
|
||||||
|
|
||||||
|
# Add index URLs
|
||||||
|
result["index"] = f"{QUACKO_HTML_PAGES}/{result['assetExternalId']}/index.html"
|
||||||
|
result["index_js"] = f"{QUACKO_BASE_URL}/{rml_id}/dossier-view/{result['assetExternalId']}"
|
||||||
|
|
||||||
|
return result
|
||||||
|
|
||||||
|
except requests.RequestException as e:
|
||||||
|
logging.error(f"QUACKO.search() - Request error while getting dossiers for RML ID '{rml_id}': {str(e)}")
|
||||||
|
return None
|
||||||
|
except (KeyError, IndexError) as e:
|
||||||
|
logging.error(f"QUACKO.search() - Data parsing error for RML ID '{rml_id}': {str(e)}")
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
def get_section_links_from_index(
|
||||||
|
asset_id: str,
|
||||||
|
rml_id: str,
|
||||||
|
sections: Dict[str, str]
|
||||||
|
) -> Dict[str, str]:
|
||||||
|
"""
|
||||||
|
Extract links to specified sections from the dossier index page by downloading it.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
asset_id (str): The asset external ID of the dossier
|
||||||
|
rml_id (str): The RML ID of the substance
|
||||||
|
sections (Dict[str, str]): Dictionary mapping section IDs to result keys
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Dict[str, str]: Dictionary with links to the requested sections
|
||||||
|
"""
|
||||||
|
index_url = f"{QUACKO_HTML_PAGES}/{asset_id}/index.html"
|
||||||
|
|
||||||
|
try:
|
||||||
|
response = requests.get(index_url)
|
||||||
|
response.raise_for_status()
|
||||||
|
|
||||||
|
# Parse content using the shared method
|
||||||
|
return parse_sections_from_html(response.text, asset_id, rml_id, sections)
|
||||||
|
|
||||||
|
except requests.RequestException as e:
|
||||||
|
logging.error(f"QUACKO.search() - Request error while extracting section links: {str(e)}")
|
||||||
|
return {}
|
||||||
|
except Exception as e:
|
||||||
|
logging.error(f"QUACKO.search() - Error extracting section links: {str(e)}")
|
||||||
|
return {}
|
||||||
|
|
||||||
|
|
||||||
|
def get_section_links_from_file(
|
||||||
|
file_path: str,
|
||||||
|
asset_id: str,
|
||||||
|
rml_id: str,
|
||||||
|
sections: Dict[str, str]
|
||||||
|
) -> Dict[str, str]:
|
||||||
|
"""
|
||||||
|
Extract links to specified sections from a local index.html file.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
file_path (str): Path to the local index.html file
|
||||||
|
asset_id (str): The asset external ID to use for constructing URLs
|
||||||
|
rml_id (str): The RML ID to use for constructing URLs
|
||||||
|
sections (Dict[str, str]): Dictionary mapping section IDs to result keys
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Dict[str, str]: Dictionary with links to the requested sections
|
||||||
|
"""
|
||||||
|
try:
|
||||||
|
if not os.path.exists(file_path):
|
||||||
|
logging.error(f"QUACKO.search() - Local file not found: {file_path}")
|
||||||
|
return {}
|
||||||
|
|
||||||
|
with open(file_path, 'r', encoding='utf-8') as file:
|
||||||
|
html_content = file.read()
|
||||||
|
|
||||||
|
# Parse content using the shared method
|
||||||
|
return parse_sections_from_html(html_content, asset_id, rml_id, sections)
|
||||||
|
|
||||||
|
except FileNotFoundError:
|
||||||
|
logging.error(f"QUACKO.search() - File not found: {file_path}")
|
||||||
|
return {}
|
||||||
|
except Exception as e:
|
||||||
|
logging.error(f"QUACKO.search() - Error parsing local file {file_path}: {str(e)}")
|
||||||
|
return {}
|
||||||
|
|
||||||
|
|
||||||
|
def parse_sections_from_html(
|
||||||
|
html_content: str,
|
||||||
|
asset_id: str,
|
||||||
|
rml_id: str,
|
||||||
|
sections: Dict[str, str]
|
||||||
|
) -> Dict[str, str]:
|
||||||
|
"""
|
||||||
|
Parse HTML content to extract links to specified sections.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
html_content (str): HTML content to parse
|
||||||
|
asset_id (str): The asset external ID to use for constructing URLs
|
||||||
|
rml_id (str): The RML ID to use for constructing URLs
|
||||||
|
sections (Dict[str, str]): Dictionary mapping section IDs to result keys
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Dict[str, str]: Dictionary with links to the requested sections
|
||||||
|
"""
|
||||||
|
result = {}
|
||||||
|
|
||||||
|
try:
|
||||||
|
soup = BeautifulSoup(html_content, "html.parser")
|
||||||
|
|
||||||
|
# Extract each requested section
|
||||||
|
for section_id, section_name in sections.items():
|
||||||
|
section_links = extract_section_links(soup, section_id, asset_id, rml_id, section_name)
|
||||||
|
if section_links:
|
||||||
|
result.update(section_links)
|
||||||
|
logging.info(f"QUACKO.search() - Found section '{section_name}' in document")
|
||||||
|
else:
|
||||||
|
logging.info(f"QUACKO.search() - Section '{section_name}' not found in document")
|
||||||
|
|
||||||
|
return result
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
logging.error(f"QUACKO.search() - Error parsing HTML content: {str(e)}")
|
||||||
|
return {}
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
# --------------------------------------------------------------------------
|
||||||
|
# Function to Extract Section Links with Validation
|
||||||
|
# --------------------------------------------------------------------------
|
||||||
|
# This function extracts the document link associated with a specific section ID
|
||||||
|
# from the QUACKO index.html page structure.
|
||||||
|
#
|
||||||
|
# Problem Solved:
|
||||||
|
# Previous attempts faced issues where searching for a link within a parent
|
||||||
|
# section's div (e.g., "7 Toxicological Information" with id="id_7_...")
|
||||||
|
# would incorrectly grab the link belonging to the *first child* section
|
||||||
|
# (e.g., "7.2 Acute Toxicity" with id="id_72_..."). This happened because
|
||||||
|
# the simple `find("a", href=True)` doesn't distinguish ownership when nested.
|
||||||
|
#
|
||||||
|
# Solution Logic:
|
||||||
|
# 1. Find Target Div: Locate the `div` element using the specific `section_id` provided.
|
||||||
|
# This div typically contains the section's content or nested subsections.
|
||||||
|
# 2. Find First Link: Find the very first `<a>` tag that has an `href` attribute
|
||||||
|
# somewhere *inside* the `target_div`.
|
||||||
|
# 3. Find Link's Owning Section Div: Starting from the `first_link_tag`, traverse
|
||||||
|
# up the HTML tree using `find_parent()` to find the nearest ancestor `div`
|
||||||
|
# whose `id` attribute starts with "id_" (the pattern for section containers).
|
||||||
|
# 4. Validate Ownership: Compare the `id` of the `link_ancestor_section_div` found
|
||||||
|
# in step 3 with the original `section_id` passed into the function.
|
||||||
|
# 5. Decision:
|
||||||
|
# - If the IDs MATCH: It confirms that the `first_link_tag` truly belongs to the
|
||||||
|
# `section_id` we are querying. The function proceeds to extract and format
|
||||||
|
# this link.
|
||||||
|
# - If the IDs DO NOT MATCH: It indicates that the first link found actually
|
||||||
|
# belongs to a *nested* subsection div. Therefore, the original `section_id`
|
||||||
|
# (the parent/container) does not have its own direct link, and the function
|
||||||
|
# correctly returns an empty dictionary for this `section_id`.
|
||||||
|
#
|
||||||
|
# This validation step ensures that we only return links that are directly
|
||||||
|
# associated with the queried section ID, preventing the inheritance bug.
|
||||||
|
# --------------------------------------------------------------------------
|
||||||
|
def extract_section_links(
|
||||||
|
soup: BeautifulSoup,
|
||||||
|
section_id: str,
|
||||||
|
asset_id: str,
|
||||||
|
rml_id: str,
|
||||||
|
section_name: str
|
||||||
|
) -> Dict[str, str]:
|
||||||
|
"""
|
||||||
|
Extracts a link for a specific section ID by finding the first link
|
||||||
|
within its div and verifying that the link belongs directly to that
|
||||||
|
section, not a nested subsection.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
soup (BeautifulSoup): The BeautifulSoup object of the index page.
|
||||||
|
section_id (str): The HTML ID of the section div.
|
||||||
|
asset_id (str): The asset external ID of the dossier.
|
||||||
|
rml_id (str): The RML ID of the substance.
|
||||||
|
section_name (str): The name to use for the section in the result.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Dict[str, str]: Dictionary with link if found and validated,
|
||||||
|
otherwise empty.
|
||||||
|
"""
|
||||||
|
result = {}
|
||||||
|
|
||||||
|
# 1. Find the target div for the section ID
|
||||||
|
target_div = soup.find("div", id=section_id)
|
||||||
|
if not target_div:
|
||||||
|
logging.info(f"QUACKO.search() - extract_section_links(): No div found for id='{section_id}'")
|
||||||
|
return result
|
||||||
|
|
||||||
|
# 2. Find the first <a> tag with an href within this target div
|
||||||
|
first_link_tag = target_div.find("a", href=True)
|
||||||
|
if not first_link_tag:
|
||||||
|
logging.info(f"QUACKO.search() - extract_section_links(): No 'a' tag with href found within div id='{section_id}'")
|
||||||
|
return result # No links at all within this section
|
||||||
|
|
||||||
|
# 3. Validate: Find the closest ancestor div with an ID starting with "id_"
|
||||||
|
# This tells us which section container the link *actually* resides in.
|
||||||
|
# We use a lambda function for the id check.
|
||||||
|
# Need to handle potential None if the structure is unexpected.
|
||||||
|
link_ancestor_section_div: Optional[Tag] = first_link_tag.find_parent(
|
||||||
|
"div", id=lambda x: x and x.startswith("id_")
|
||||||
|
)
|
||||||
|
|
||||||
|
# 4. Compare IDs
|
||||||
|
if link_ancestor_section_div and link_ancestor_section_div.get('id') == section_id:
|
||||||
|
# The first link found belongs directly to the section we are looking for.
|
||||||
|
logging.debug(f"QUACKO.search() - extract_section_links(): Valid link found for id='{section_id}'.")
|
||||||
|
a_tag_to_use = first_link_tag # Use the link we found
|
||||||
|
else:
|
||||||
|
# The first link found belongs to a *different* (nested) section
|
||||||
|
# or the structure is broken (no ancestor div with id found).
|
||||||
|
# Therefore, the section_id we were originally checking has no direct link.
|
||||||
|
ancestor_id = link_ancestor_section_div.get('id') if link_ancestor_section_div else "None"
|
||||||
|
logging.info(f"QUACKO.search() - extract_section_links(): First link within id='{section_id}' belongs to ancestor id='{ancestor_id}'. No direct link for '{section_id}'.")
|
||||||
|
return result # Return empty dict
|
||||||
|
|
||||||
|
# 5. Proceed with link extraction using the validated a_tag_to_use
|
||||||
|
try:
|
||||||
|
document_id = a_tag_to_use.get('href') # Use .get() for safety
|
||||||
|
if not document_id:
|
||||||
|
logging.error(f"QUACKO.search() - extract_section_links(): Found 'a' tag for '{section_name}' has no href attribute.")
|
||||||
|
return {}
|
||||||
|
|
||||||
|
# Clean up the document ID
|
||||||
|
if document_id.startswith('./documents/'):
|
||||||
|
document_id = document_id.replace('./documents/', '')
|
||||||
|
if document_id.endswith('.html'):
|
||||||
|
document_id = document_id.replace('.html', '')
|
||||||
|
|
||||||
|
# Construct the full URLs unless in local-only mode
|
||||||
|
if asset_id == "local" and rml_id == "local":
|
||||||
|
result[section_name] = f"Local section found: {document_id}"
|
||||||
|
else:
|
||||||
|
result[section_name] = f"{QUACKO_HTML_PAGES}/{asset_id}/documents/{document_id}.html"
|
||||||
|
result[f"{section_name}_js"] = f"{QUACKO_BASE_URL}/{rml_id}/dossier-view/{asset_id}/{document_id}"
|
||||||
|
|
||||||
|
return result
|
||||||
|
|
||||||
|
except Exception as e: # Catch potential errors during processing
|
||||||
|
logging.error(f"QUACKO.search() - extract_section_links(): Error processing the validated link tag for '{section_name}': {str(e)}")
|
||||||
|
return {}
|
||||||
37
src/pif_compiler/functions/_old/mongo_functions.py
Normal file
37
src/pif_compiler/functions/_old/mongo_functions.py
Normal file
|
|
@ -0,0 +1,37 @@
|
||||||
|
from typing import Optional
|
||||||
|
from pymongo.mongo_client import MongoClient
|
||||||
|
from pymongo.server_api import ServerApi
|
||||||
|
from pymongo.database import Database
|
||||||
|
|
||||||
|
|
||||||
|
#region Funzioni generali connesse a MongoDB
|
||||||
|
|
||||||
|
# Funzione di connessione al database
|
||||||
|
|
||||||
|
def connect(user : str,password : str, database : str = "INCI") -> Database:
|
||||||
|
|
||||||
|
uri = f"mongodb+srv://{user}:{password}@ufs13.dsmvdrx.mongodb.net/?retryWrites=true&w=majority&appName=UFS13"
|
||||||
|
client = MongoClient(uri, server_api=ServerApi('1'))
|
||||||
|
db = client['toxinfo']
|
||||||
|
return db
|
||||||
|
|
||||||
|
#endregion
|
||||||
|
|
||||||
|
#region Funzioni di ricerca all'interno del mio DB
|
||||||
|
|
||||||
|
# Funzione di ricerca degli elementi estratti dal cosing
|
||||||
|
def value_search(db : Database,value:str,mode : Optional[str] = None) -> Optional[dict]:
|
||||||
|
if mode:
|
||||||
|
json = db.toxinfo.find_one({mode:value},{"_id":0})
|
||||||
|
if json:
|
||||||
|
return json
|
||||||
|
return None
|
||||||
|
else:
|
||||||
|
modes = ['commonName','inciName','casNo','ecNo','chemicalName','phEurName']
|
||||||
|
for el in modes:
|
||||||
|
json = db.toxinfo.find_one({el:value},{"_id":0})
|
||||||
|
if json:
|
||||||
|
return json
|
||||||
|
return None
|
||||||
|
|
||||||
|
#endregion
|
||||||
149
src/pif_compiler/functions/_old/pubchem.py
Normal file
149
src/pif_compiler/functions/_old/pubchem.py
Normal file
|
|
@ -0,0 +1,149 @@
|
||||||
|
|
||||||
|
import os
|
||||||
|
from contextlib import contextmanager
|
||||||
|
import pubchempy as pcp
|
||||||
|
from pubchemprops.pubchemprops import get_second_layer_props
|
||||||
|
import logging
|
||||||
|
|
||||||
|
logging.basicConfig(
|
||||||
|
format="{asctime} - {levelname} - {message}",
|
||||||
|
style="{",
|
||||||
|
datefmt="%Y-%m-%d %H:%M",
|
||||||
|
filename="echa.log",
|
||||||
|
encoding="utf-8",
|
||||||
|
filemode="a",
|
||||||
|
level=logging.INFO,
|
||||||
|
)
|
||||||
|
|
||||||
|
@contextmanager
|
||||||
|
def temporary_certificate(cert_path):
|
||||||
|
# Sto robo serve perchè per usare l'API di PubChem serve cambiare temporaneamente il certificato con il quale
|
||||||
|
# si fanno le richieste
|
||||||
|
|
||||||
|
"""
|
||||||
|
Context manager to temporarily change the certificate used for requests.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
cert_path (str): Path to the certificate file to use temporarily
|
||||||
|
|
||||||
|
Example:
|
||||||
|
# Regular request uses default certificates
|
||||||
|
requests.get('https://api.example.com')
|
||||||
|
|
||||||
|
# Use custom certificate only within this block
|
||||||
|
with temporary_certificate('custom-cert.pem'):
|
||||||
|
requests.get('https://api.requiring.custom.cert.com')
|
||||||
|
|
||||||
|
# Back to default certificates
|
||||||
|
requests.get('https://api.example.com')
|
||||||
|
"""
|
||||||
|
# Store original environment variables
|
||||||
|
original_ca_bundle = os.environ.get('REQUESTS_CA_BUNDLE')
|
||||||
|
original_ssl_cert = os.environ.get('SSL_CERT_FILE')
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Set new certificate
|
||||||
|
os.environ['REQUESTS_CA_BUNDLE'] = cert_path
|
||||||
|
os.environ['SSL_CERT_FILE'] = cert_path
|
||||||
|
yield
|
||||||
|
finally:
|
||||||
|
# Restore original environment variables
|
||||||
|
if original_ca_bundle is not None:
|
||||||
|
os.environ['REQUESTS_CA_BUNDLE'] = original_ca_bundle
|
||||||
|
else:
|
||||||
|
os.environ.pop('REQUESTS_CA_BUNDLE', None)
|
||||||
|
|
||||||
|
if original_ssl_cert is not None:
|
||||||
|
os.environ['SSL_CERT_FILE'] = original_ssl_cert
|
||||||
|
else:
|
||||||
|
os.environ.pop('SSL_CERT_FILE', None)
|
||||||
|
|
||||||
|
def clean_property_data(api_response):
|
||||||
|
"""
|
||||||
|
Simplifies the API response data by flattening nested structures.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
api_response (dict): Raw API response containing property data
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
dict: Cleaned data with simplified structure
|
||||||
|
"""
|
||||||
|
cleaned_data = {}
|
||||||
|
|
||||||
|
for property_name, measurements in api_response.items():
|
||||||
|
cleaned_measurements = []
|
||||||
|
|
||||||
|
for measurement in measurements:
|
||||||
|
cleaned_measurement = {
|
||||||
|
'ReferenceNumber': measurement.get('ReferenceNumber'),
|
||||||
|
'Description': measurement.get('Description', ''),
|
||||||
|
}
|
||||||
|
|
||||||
|
# Handle Reference field
|
||||||
|
if 'Reference' in measurement:
|
||||||
|
# Check if Reference is a list or string
|
||||||
|
ref = measurement['Reference']
|
||||||
|
cleaned_measurement['Reference'] = ref[0] if isinstance(ref, list) else ref
|
||||||
|
|
||||||
|
# Handle Value field
|
||||||
|
value = measurement.get('Value', {})
|
||||||
|
if isinstance(value, dict) and 'StringWithMarkup' in value:
|
||||||
|
cleaned_measurement['Value'] = value['StringWithMarkup'][0]['String']
|
||||||
|
else:
|
||||||
|
cleaned_measurement['Value'] = str(value)
|
||||||
|
|
||||||
|
# Remove empty values
|
||||||
|
cleaned_measurement = {k: v for k, v in cleaned_measurement.items() if v}
|
||||||
|
|
||||||
|
cleaned_measurements.append(cleaned_measurement)
|
||||||
|
|
||||||
|
cleaned_data[property_name] = cleaned_measurements
|
||||||
|
|
||||||
|
return cleaned_data
|
||||||
|
|
||||||
|
def pubchem_dap(cas):
|
||||||
|
'''
|
||||||
|
Data un CAS in input ricerca le informazioni per la scheda di sicurezza su PubChem.
|
||||||
|
Per estrarre le proprietà di 1o (sinonimi, cid, logP, MolecularWeight, ExactMass, TPSA) livello uso Pubchempy.
|
||||||
|
Per quelle di 2o livello uso pubchemprops (Melting point)
|
||||||
|
|
||||||
|
args:
|
||||||
|
cas : string
|
||||||
|
|
||||||
|
'''
|
||||||
|
with temporary_certificate('src/data/ncbi-nlm-nih-gov-catena.pem'):
|
||||||
|
try:
|
||||||
|
# Ricerca iniziale
|
||||||
|
out = pcp.get_synonyms(cas, 'name')
|
||||||
|
if out:
|
||||||
|
out = out[0]
|
||||||
|
output = {'CID' : out['CID'],
|
||||||
|
'CAS' : cas,
|
||||||
|
'first_pubchem_name' : out['Synonym'][0],
|
||||||
|
'pubchem_link' : f"https://pubchem.ncbi.nlm.nih.gov/compound/{out['CID']}"}
|
||||||
|
else:
|
||||||
|
return f'No results on PubChem for {cas}'
|
||||||
|
|
||||||
|
except Exception as E:
|
||||||
|
logging.error(f'various_utils.pubchem.pubchem_dap(). Some error during pubchem search for {cas}', exc_info=True)
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Ricerca delle proprietà
|
||||||
|
properties = pcp.get_properties(['xlogp', 'molecular_weight', 'tpsa', 'exact_mass'], identifier = out['CID'], namespace='cid', searchtype=None, as_dataframe=False)
|
||||||
|
if properties:
|
||||||
|
output = {**output, **properties[0]}
|
||||||
|
else:
|
||||||
|
return output
|
||||||
|
except Exception as E:
|
||||||
|
logging.error(f'various_utils.pubchem.pubchem_dap(). Some error during pubchem first level properties extraction for {cas}', exc_info=True)
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Ricerca del Melting Point
|
||||||
|
second_layer_props = get_second_layer_props(output['first_pubchem_name'], ['Melting Point', 'Dissociation Constants', 'pH'])
|
||||||
|
if second_layer_props:
|
||||||
|
second_layer_props = clean_property_data(second_layer_props)
|
||||||
|
output = {**output, **second_layer_props}
|
||||||
|
except Exception as E:
|
||||||
|
logging.error(f'various_utils.pubchem.pubchem_dap(). Some error during pubchem second level properties extraction (Melting Point) for {cas}', exc_info=True)
|
||||||
|
|
||||||
|
return output
|
||||||
182
src/pif_compiler/functions/_old/scraper_cosing.py
Normal file
182
src/pif_compiler/functions/_old/scraper_cosing.py
Normal file
|
|
@ -0,0 +1,182 @@
|
||||||
|
import json as js
|
||||||
|
import re
|
||||||
|
import requests as req
|
||||||
|
from typing import Union
|
||||||
|
|
||||||
|
|
||||||
|
#region Funzione che processa una lista di CAS presa da Cosing (Grazie Jem)
|
||||||
|
|
||||||
|
def parse_cas_numbers(cas_string:list) -> list:
|
||||||
|
|
||||||
|
# Siccome ci assicuriamo esternamente che esista almeno un cas possiamo prendere la stringa
|
||||||
|
cas_string = cas_string[0]
|
||||||
|
|
||||||
|
# Rimuoviamo parentesi e il loro contenuto
|
||||||
|
cas_string = re.sub(r"\([^)]*\)", "", cas_string)
|
||||||
|
|
||||||
|
# Eseguiamo uno split su vari possibili separatori
|
||||||
|
cas_parts = re.split(r"[/;,]", cas_string)
|
||||||
|
|
||||||
|
# Otteniamo una lista utilizzando i cas precedenti e rimuovendo spazi in eccesso
|
||||||
|
cas_list = [cas.strip() for cas in cas_parts]
|
||||||
|
|
||||||
|
# Alcuni cas sono divisi da un doppio dash (--) che dobbiamo rimuovere
|
||||||
|
# è però necessario farlo ora in seconda battuta
|
||||||
|
|
||||||
|
if len(cas_list) == 1 and "--" in cas_list[0]:
|
||||||
|
|
||||||
|
cas_list = [cas.strip() for cas in cas_list[0].split("--")]
|
||||||
|
|
||||||
|
# Siccome alcuni cas hanno valori non validi ("-") li cerchiamo e rimuoviamo
|
||||||
|
if cas_list:
|
||||||
|
|
||||||
|
while "-" in cas_list:
|
||||||
|
cas_list.remove("-")
|
||||||
|
|
||||||
|
return cas_list
|
||||||
|
#endregion
|
||||||
|
|
||||||
|
#region Funzione per eseguire una ricerca direttamente sul cosing
|
||||||
|
|
||||||
|
# Il primo argomento è la stringa da cercare, il secondo indica il tipo di ricerca
|
||||||
|
def cosing_search(text : str, mode : str = "name") -> Union[list,dict,None]:
|
||||||
|
cosing_post_req = "https://api.tech.ec.europa.eu/search-api/prod/rest/search?apiKey=285a77fd-1257-4271-8507-f0c6b2961203&text=*&pageSize=100&pageNumber=1"
|
||||||
|
agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36 Edg/131.0.0.0"
|
||||||
|
|
||||||
|
# La modalità di ricerca base è quella per nome, che sia INCI o di altro tipo
|
||||||
|
if mode == "name":
|
||||||
|
search_query = {"bool":
|
||||||
|
{"must":[
|
||||||
|
{"text":
|
||||||
|
{"query":f"{text}","fields":
|
||||||
|
["inciName.exact","inciUsaName","innName.exact","phEurName","chemicalName","chemicalDescription"],
|
||||||
|
"defaultOperator":"AND"}}]}}
|
||||||
|
|
||||||
|
# In caso di ricerca per numero cas o EC il payload della richiesta è diverso
|
||||||
|
elif mode in ["cas","ec"]:
|
||||||
|
search_query = {"bool": {"must": [{"text": {"query": f"*{text}*","fields": ["casNo", "ecNo"]}}]}}
|
||||||
|
|
||||||
|
# La ricerca per ID è necessaria dove serva recuperare gli identified ingredients
|
||||||
|
elif mode == "id":
|
||||||
|
search_query = {"bool":{"must":[{"term":{"substanceId":f"{text}"}}]}}
|
||||||
|
|
||||||
|
# Se la mode inserita non è prevista lancio un errore
|
||||||
|
else:
|
||||||
|
raise ValueError
|
||||||
|
|
||||||
|
# Creo il payload della mia request
|
||||||
|
files = {"query": ("query",js.dumps(search_query),"application/json")}
|
||||||
|
|
||||||
|
# Eseguo la post di ricerca
|
||||||
|
risposta = req.post(cosing_post_req,headers={"user_agent":agent,"Connection":"keep-alive"},files=files)
|
||||||
|
risposta = risposta.json()
|
||||||
|
if risposta["results"]:
|
||||||
|
|
||||||
|
return risposta["results"][0]["metadata"]
|
||||||
|
|
||||||
|
# La funzione ritorna None quando non ho risultati dalla mia ricerca
|
||||||
|
return risposta.status_code
|
||||||
|
#endregion
|
||||||
|
|
||||||
|
#region Funzione per pulire un json cosing e restituirlo
|
||||||
|
|
||||||
|
def clean_cosing(json : dict, full : bool = True) -> dict:
|
||||||
|
|
||||||
|
# Definisco i campi di nostro interesse e li divido in base al tipo di output finale che desidero
|
||||||
|
|
||||||
|
string_cols = ["itemType","nameOfCommonIngredientsGlossary","inciName","phEurName","chemicalName","innName","substanceId","refNo"]
|
||||||
|
list_cols = ["casNo","ecNo","functionName","otherRestrictions","sccsOpinion","sccsOpinionUrls","identifiedIngredient","annexNo","otherRegulations"]
|
||||||
|
|
||||||
|
# Creo una lista con tutti i campi su cui ciclare
|
||||||
|
|
||||||
|
total_keys = string_cols + list_cols
|
||||||
|
|
||||||
|
# Definisco la base dell"url che mi serve per ottenere il link cosing della sostanza
|
||||||
|
|
||||||
|
base_url = "https://ec.europa.eu/growth/tools-databases/cosing/details/"
|
||||||
|
clean_json = {}
|
||||||
|
|
||||||
|
# Ciclo su tutti i campi di interesse
|
||||||
|
|
||||||
|
for key in total_keys:
|
||||||
|
|
||||||
|
# Alcuni campi contengono una dicitura inutile che occupa solo spazio
|
||||||
|
# per cui provvedo a rimuoverla
|
||||||
|
|
||||||
|
while "<empty>" in json[key]:
|
||||||
|
json[key].remove("<empty>")
|
||||||
|
|
||||||
|
# Se il campo dovrà avere in output una lista, posso anche accettare le liste vuote del cosing come valore
|
||||||
|
|
||||||
|
if key in list_cols:
|
||||||
|
value = json[key]
|
||||||
|
|
||||||
|
# Il cas e l"ec sono casi speciali, per cui eseguo delle funzioni in più quando lo tratto
|
||||||
|
|
||||||
|
if key in ["casNo", "ecNo"]:
|
||||||
|
if value:
|
||||||
|
value = parse_cas_numbers(value)
|
||||||
|
|
||||||
|
# Completiamo direttamente il json di ritorno ove presenti degli identifiedIngredient,
|
||||||
|
# solo dove il flag "full" è true
|
||||||
|
|
||||||
|
elif key == "identifiedIngredient":
|
||||||
|
if full:
|
||||||
|
if value:
|
||||||
|
value = identified_ingredients(value)
|
||||||
|
|
||||||
|
clean_json[key] = value
|
||||||
|
|
||||||
|
else:
|
||||||
|
|
||||||
|
# Questo nome di campo era troppo lungo e ho preferito semplificarlo
|
||||||
|
|
||||||
|
if key == "nameOfCommonIngredientsGlossary":
|
||||||
|
nKey = "commonName"
|
||||||
|
|
||||||
|
# Dovendo rinominare alcuni campi, negli altri casi copio il nome iniziale
|
||||||
|
|
||||||
|
else:
|
||||||
|
nKey = key
|
||||||
|
|
||||||
|
# Siccome voglio in output una stringa semplice e se usassi uno slicer su di una lista vuota
|
||||||
|
# devo prima verificare che la lista cosing contenga dei valori
|
||||||
|
|
||||||
|
if json[key]:
|
||||||
|
clean_json[nKey] = json[key][0]
|
||||||
|
else:
|
||||||
|
clean_json[nKey] = ""
|
||||||
|
|
||||||
|
# Il campo cosingUrl non esiste ancora, vado a crearlo unendo il substance ID all"url base
|
||||||
|
|
||||||
|
clean_json["cosingUrl"] = f"{base_url}{json["substanceId"][0]}"
|
||||||
|
|
||||||
|
return clean_json
|
||||||
|
#endregion
|
||||||
|
|
||||||
|
#region Funzione per completare, se necessario, un json cosing
|
||||||
|
|
||||||
|
def identified_ingredients(id_list : list) -> list:
|
||||||
|
|
||||||
|
identified = []
|
||||||
|
|
||||||
|
# Per ognuno degli ingredienti in IdentifiedIngredients eseguo una ricerca
|
||||||
|
|
||||||
|
for id in id_list:
|
||||||
|
|
||||||
|
ingredient = cosing_search(id,"id")
|
||||||
|
|
||||||
|
if ingredient:
|
||||||
|
|
||||||
|
# Vado a pulire i json appena trovati
|
||||||
|
|
||||||
|
ingredient = clean_cosing(ingredient,full=False)
|
||||||
|
|
||||||
|
# Ora salvo nella lista il documento pulito
|
||||||
|
|
||||||
|
identified.append(ingredient)
|
||||||
|
|
||||||
|
# Finito di popolare la lista con gli oggetti degli identifiedIngredient, la ritorno
|
||||||
|
|
||||||
|
return identified
|
||||||
|
#endregion
|
||||||
18
src/pif_compiler/functions/html_to_pdf.py
Normal file
18
src/pif_compiler/functions/html_to_pdf.py
Normal file
|
|
@ -0,0 +1,18 @@
|
||||||
|
from playwright.sync_api import sync_playwright
|
||||||
|
|
||||||
|
|
||||||
|
def generate_pdf(url, pdf_path):
|
||||||
|
with sync_playwright() as p:
|
||||||
|
# Avvia un browser (può essere 'chromium', 'firefox', o 'webkit')
|
||||||
|
browser = p.chromium.launch(headless=True)
|
||||||
|
page = browser.new_page()
|
||||||
|
|
||||||
|
# Vai all'URL specificato
|
||||||
|
page.goto(url)
|
||||||
|
|
||||||
|
# Genera il PDF
|
||||||
|
page.pdf(path=pdf_path, format="A4")
|
||||||
|
|
||||||
|
# Chiudi il browser
|
||||||
|
browser.close()
|
||||||
|
print(f"PDF salvato con successo in: {pdf_path}")
|
||||||
467
src/pif_compiler/functions/pdf_extraction.py
Normal file
467
src/pif_compiler/functions/pdf_extraction.py
Normal file
|
|
@ -0,0 +1,467 @@
|
||||||
|
import os
|
||||||
|
import base64
|
||||||
|
import traceback
|
||||||
|
import logging # Import logging module
|
||||||
|
import datetime
|
||||||
|
import pandas as pd
|
||||||
|
# import time # Keep if you use page.wait_for_timeout
|
||||||
|
from playwright.sync_api import sync_playwright, TimeoutError # Catch specific errors
|
||||||
|
from src.func.find import search_dossier
|
||||||
|
import requests
|
||||||
|
|
||||||
|
# --- Basic Logging Setup (Commented Out) ---
|
||||||
|
# # Configure logging - uncomment and customize level/handler as needed
|
||||||
|
# logging.basicConfig(
|
||||||
|
# level=logging.INFO, # Or DEBUG for more details
|
||||||
|
# format='%(asctime)s - %(levelname)s - %(message)s',
|
||||||
|
# # filename='pdf_generator.log', # Optional: Log to a file
|
||||||
|
# # filemode='a'
|
||||||
|
# )
|
||||||
|
# --- End Logging Setup ---
|
||||||
|
|
||||||
|
|
||||||
|
# Assume svg_to_data_uri is defined elsewhere correctly
|
||||||
|
def svg_to_data_uri(svg_path):
|
||||||
|
try:
|
||||||
|
if not os.path.exists(svg_path):
|
||||||
|
# logging.error(f"SVG file not found: {svg_path}") # Example logging
|
||||||
|
raise FileNotFoundError(f"SVG file not found: {svg_path}")
|
||||||
|
with open(svg_path, 'rb') as f:
|
||||||
|
svg_content = f.read()
|
||||||
|
encoded_svg = base64.b64encode(svg_content).decode('utf-8')
|
||||||
|
return f"data:image/svg+xml;base64,{encoded_svg}"
|
||||||
|
except Exception as e:
|
||||||
|
print(f"Error converting SVG {svg_path}: {e}")
|
||||||
|
# logging.error(f"Error converting SVG {svg_path}: {e}", exc_info=True) # Example logging
|
||||||
|
return None
|
||||||
|
|
||||||
|
# --- JavaScript Expressions ---
|
||||||
|
|
||||||
|
# Define the cleanup logic as an immediately-invoked arrow function expression
|
||||||
|
# NOTE: .das-block_empty removal is currently disabled as per previous step
|
||||||
|
cleanup_js_expression = """
|
||||||
|
() => {
|
||||||
|
console.log('Running cleanup JS (DISABLED .das-block_empty removal)...');
|
||||||
|
let totalRemoved = 0;
|
||||||
|
|
||||||
|
// Example 1: Remove sections explicitly marked as empty (Currently Disabled)
|
||||||
|
// const emptyBlocks = document.querySelectorAll('.das-block_empty');
|
||||||
|
// emptyBlocks.forEach(el => {
|
||||||
|
// if (el && el.parentNode) {
|
||||||
|
// console.log(`Removing '.das-block_empty' block with ID: ${el.id || 'N/A'}`);
|
||||||
|
// el.remove();
|
||||||
|
// totalRemoved++;
|
||||||
|
// }
|
||||||
|
// });
|
||||||
|
|
||||||
|
// Add other specific cleanup logic here if needed
|
||||||
|
|
||||||
|
console.log(`Cleanup script removed ${totalRemoved} elements (DISABLED .das-block_empty removal).`);
|
||||||
|
return totalRemoved; // Return the count
|
||||||
|
}
|
||||||
|
"""
|
||||||
|
# --- End JavaScript Expressions ---
|
||||||
|
|
||||||
|
|
||||||
|
def generate_pdf_with_header_and_cleanup(
|
||||||
|
url,
|
||||||
|
pdf_path,
|
||||||
|
substance_name,
|
||||||
|
substance_link,
|
||||||
|
ec_number,
|
||||||
|
cas_number,
|
||||||
|
header_template_path=r"src\func\resources\injectableHeader.html",
|
||||||
|
echa_chem_logo_path=r"src\func\resources\echa_chem_logo.svg",
|
||||||
|
echa_logo_path=r"src\func\resources\ECHA_Logo.svg"
|
||||||
|
) -> bool: # Added return type hint
|
||||||
|
"""
|
||||||
|
Generates a PDF with a dynamic header and optionally removes empty sections.
|
||||||
|
Provides basic logging (commented out) and returns True/False for success/failure.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
url (str): The target URL OR local HTML file path.
|
||||||
|
pdf_path (str): The output PDF path.
|
||||||
|
substance_name (str): The name of the chemical substance.
|
||||||
|
substance_link (str): The URL the substance name should link to (in header).
|
||||||
|
ec_number (str): The EC number for the substance.
|
||||||
|
cas_number (str): The CAS number for the substance.
|
||||||
|
header_template_path (str): Path to the HTML header template file.
|
||||||
|
echa_chem_logo_path (str): Path to the echa_chem_logo.svg file.
|
||||||
|
echa_logo_path (str): Path to the ECHA_Logo.svg file.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
bool: True if the PDF was generated successfully, False otherwise.
|
||||||
|
"""
|
||||||
|
final_header_html = None
|
||||||
|
# logging.info(f"Starting PDF generation for URL: {url} to path: {pdf_path}") # Example logging
|
||||||
|
|
||||||
|
# --- 1. Prepare Header HTML ---
|
||||||
|
try:
|
||||||
|
# logging.debug(f"Reading header template from: {header_template_path}") # Example logging
|
||||||
|
print(f"Reading header template from: {header_template_path}")
|
||||||
|
if not os.path.exists(header_template_path):
|
||||||
|
raise FileNotFoundError(f"Header template file not found: {header_template_path}")
|
||||||
|
with open(header_template_path, 'r', encoding='utf-8') as f:
|
||||||
|
header_template_content = f.read()
|
||||||
|
if not header_template_content:
|
||||||
|
raise ValueError("Header template file is empty.")
|
||||||
|
|
||||||
|
# logging.debug("Converting logos...") # Example logging
|
||||||
|
print("Converting logos...")
|
||||||
|
logo1_data_uri = svg_to_data_uri(echa_chem_logo_path)
|
||||||
|
logo2_data_uri = svg_to_data_uri(echa_logo_path)
|
||||||
|
if not logo1_data_uri or not logo2_data_uri:
|
||||||
|
raise ValueError("Failed to convert one or both logos to Data URIs.")
|
||||||
|
|
||||||
|
# logging.debug("Replacing placeholders...") # Example logging
|
||||||
|
print("Replacing placeholders...")
|
||||||
|
final_header_html = header_template_content.replace("##ECHA_CHEM_LOGO_SRC##", logo1_data_uri)
|
||||||
|
final_header_html = final_header_html.replace("##ECHA_LOGO_SRC##", logo2_data_uri)
|
||||||
|
final_header_html = final_header_html.replace("##SUBSTANCE_NAME##", substance_name)
|
||||||
|
final_header_html = final_header_html.replace("##SUBSTANCE_LINK##", substance_link)
|
||||||
|
final_header_html = final_header_html.replace("##EC_NUMBER##", ec_number)
|
||||||
|
final_header_html = final_header_html.replace("##CAS_NUMBER##", cas_number)
|
||||||
|
|
||||||
|
if "##" in final_header_html:
|
||||||
|
print("Warning: Not all placeholders seem replaced in the header HTML.")
|
||||||
|
# logging.warning("Not all placeholders seem replaced in the header HTML.") # Example logging
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
print(f"Error during header setup phase: {e}")
|
||||||
|
traceback.print_exc()
|
||||||
|
# logging.error(f"Error during header setup phase: {e}", exc_info=True) # Example logging
|
||||||
|
return False # Return False on header setup failure
|
||||||
|
# --- End Header Prep ---
|
||||||
|
|
||||||
|
# --- CSS Override Definition ---
|
||||||
|
# Using Revision 4 from previous step (simplified breaks, boundary focus)
|
||||||
|
selectors_to_fix = [
|
||||||
|
'.das-field .das-field_value_html',
|
||||||
|
'.das-field .das-field_value_large',
|
||||||
|
'.das-field .das-value_remark-text'
|
||||||
|
]
|
||||||
|
css_selector_string = ",\n".join(selectors_to_fix)
|
||||||
|
css_override = f"""
|
||||||
|
<style id='pdf-override-styles'>
|
||||||
|
/* Basic Resets & Overflows */
|
||||||
|
html, body {{ height: auto !important; overflow: visible !important; margin: 0 !important; padding: 0 !important; }}
|
||||||
|
* {{ box-sizing: border-box; }}
|
||||||
|
{css_selector_string} {{
|
||||||
|
overflow: visible !important; overflow-y: visible !important; height: auto !important; max-height: none !important;
|
||||||
|
}}
|
||||||
|
/* Boundary Fix */
|
||||||
|
#pdf-custom-header {{ margin-bottom: 0 !important; padding-bottom: 1px !important; page-break-after: auto !important; display: block !important; }}
|
||||||
|
#pdf-custom-header + .body-inner {{ margin-top: 0 !important; padding-top: 0 !important; page-break-before: auto !important; display: block !important; }}
|
||||||
|
.body-inner .document-header {{ margin-top: 0 !important; padding-top: 0 !important; page-break-before: auto !important; }}
|
||||||
|
/* Simplified Page Breaks */
|
||||||
|
.body-inner h1, .body-inner h2, .body-inner h3, .body-inner h4, .body-inner h5, .body-inner h6 {{ page-break-after: avoid !important; }}
|
||||||
|
#pdf-custom-header h2 {{ page-break-after: auto !important; }}
|
||||||
|
@media print {{
|
||||||
|
html, body {{ height: auto !important; overflow: visible !important; margin: 0; padding: 0; }}
|
||||||
|
#pdf-custom-header {{ margin-bottom: 0 !important; padding-bottom: 1px !important; page-break-after: auto !important; display: block !important;}}
|
||||||
|
#pdf-custom-header + .body-inner {{ margin-top: 0 !important; padding-top: 0 !important; page-break-before: auto !important; display: block !important; }}
|
||||||
|
.body-inner .document-header {{ margin-top: 0 !important; padding-top: 0 !important; page-break-before: auto !important; }}
|
||||||
|
.body-inner h1, .body-inner h2, .body-inner h3, .body-inner h4, .body-inner h5, .body-inner h6 {{ page-break-after: avoid !important; }}
|
||||||
|
#pdf-custom-header h2 {{ page-break-after: auto !important; }}
|
||||||
|
.das-doc-toolbar, .document-header__section-links, #das-totop {{ display: none !important; }}
|
||||||
|
}}
|
||||||
|
</style>
|
||||||
|
"""
|
||||||
|
# --- End CSS Override Definition ---
|
||||||
|
|
||||||
|
# --- Playwright Automation ---
|
||||||
|
try:
|
||||||
|
with sync_playwright() as p:
|
||||||
|
# logging.debug("Launching browser...") # Example logging
|
||||||
|
# browser = p.chromium.launch(headless=False, devtools=True) # For debugging
|
||||||
|
browser = p.chromium.launch()
|
||||||
|
page = browser.new_page()
|
||||||
|
# Capture console messages (Corrected: use msg.text)
|
||||||
|
page.on("console", lambda msg: print(f"Browser Console: {msg.text}"))
|
||||||
|
|
||||||
|
try:
|
||||||
|
# logging.info(f"Navigating to page: {url}") # Example logging
|
||||||
|
print(f"Navigating to: {url}")
|
||||||
|
if os.path.exists(url) and not url.startswith('file://'):
|
||||||
|
page_url = f'file://{os.path.abspath(url)}'
|
||||||
|
# logging.info(f"Treating as local file: {page_url}") # Example logging
|
||||||
|
print(f"Treating as local file: {page_url}")
|
||||||
|
else:
|
||||||
|
page_url = url
|
||||||
|
|
||||||
|
page.goto(page_url, wait_until='load', timeout=90000)
|
||||||
|
# logging.info("Page navigation complete.") # Example logging
|
||||||
|
|
||||||
|
# logging.debug("Injecting header HTML...") # Example logging
|
||||||
|
print("Injecting header HTML...")
|
||||||
|
page.evaluate(f'(headerHtml) => {{ document.body.insertAdjacentHTML("afterbegin", headerHtml); }}', final_header_html)
|
||||||
|
|
||||||
|
# logging.debug("Injecting CSS overrides...") # Example logging
|
||||||
|
print("Injecting CSS overrides...")
|
||||||
|
page.evaluate(f"""(css) => {{
|
||||||
|
const existingStyle = document.getElementById('pdf-override-styles');
|
||||||
|
if (existingStyle) existingStyle.remove();
|
||||||
|
document.head.insertAdjacentHTML('beforeend', css);
|
||||||
|
}}""", css_override)
|
||||||
|
|
||||||
|
# logging.debug("Running JavaScript cleanup function...") # Example logging
|
||||||
|
print("Running JavaScript cleanup function...")
|
||||||
|
elements_removed_count = page.evaluate(cleanup_js_expression)
|
||||||
|
# logging.info(f"Cleanup script finished (reported removing {elements_removed_count} elements).") # Example logging
|
||||||
|
print(f"Cleanup script finished (reported removing {elements_removed_count} elements).")
|
||||||
|
|
||||||
|
|
||||||
|
# --- Optional: Emulate Print Media ---
|
||||||
|
# print("Emulating print media...")
|
||||||
|
# page.emulate_media(media='print')
|
||||||
|
|
||||||
|
# --- Generate PDF ---
|
||||||
|
# logging.info(f"Generating PDF: {pdf_path}") # Example logging
|
||||||
|
print(f"Generating PDF: {pdf_path}")
|
||||||
|
pdf_options = {
|
||||||
|
"path": pdf_path, "format": "A4", "print_background": True,
|
||||||
|
"margin": {'top': '20px', 'bottom': '20px', 'left': '20px', 'right': '20px'},
|
||||||
|
"scale": 1.0
|
||||||
|
}
|
||||||
|
page.pdf(**pdf_options)
|
||||||
|
# logging.info(f"PDF saved successfully to: {pdf_path}") # Example logging
|
||||||
|
print(f"PDF saved successfully to: {pdf_path}")
|
||||||
|
|
||||||
|
# logging.debug("Closing browser.") # Example logging
|
||||||
|
print("Closing browser.")
|
||||||
|
browser.close()
|
||||||
|
return True # Indicate success
|
||||||
|
|
||||||
|
except TimeoutError as e:
|
||||||
|
print(f"A Playwright TimeoutError occurred: {e}")
|
||||||
|
traceback.print_exc()
|
||||||
|
# logging.error(f"Playwright TimeoutError occurred: {e}", exc_info=True) # Example logging
|
||||||
|
browser.close() # Ensure browser is closed on error
|
||||||
|
return False # Indicate failure
|
||||||
|
except Exception as e: # Catch other potential errors during Playwright page operations
|
||||||
|
print(f"An unexpected error occurred during Playwright page operations: {e}")
|
||||||
|
traceback.print_exc()
|
||||||
|
# logging.error(f"Unexpected error during Playwright page operations: {e}", exc_info=True) # Example logging
|
||||||
|
# Optional: Save HTML state on error
|
||||||
|
try:
|
||||||
|
html_content = page.content()
|
||||||
|
error_html_path = pdf_path.replace('.pdf', '_error.html')
|
||||||
|
with open(error_html_path, 'w', encoding='utf-8') as f_err:
|
||||||
|
f_err.write(html_content)
|
||||||
|
# logging.info(f"Saved HTML state on error to: {error_html_path}") # Example logging
|
||||||
|
print(f"Saved HTML state on error to: {error_html_path}")
|
||||||
|
except Exception as save_e:
|
||||||
|
# logging.error(f"Could not save HTML state on error: {save_e}", exc_info=True) # Example logging
|
||||||
|
print(f"Could not save HTML state on error: {save_e}")
|
||||||
|
browser.close() # Ensure browser is closed on error
|
||||||
|
return False # Indicate failure
|
||||||
|
# Note: The finally block for the 'with sync_playwright()' context
|
||||||
|
# is handled automatically by the 'with' statement.
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
# Catch errors during Playwright startup (less common)
|
||||||
|
print(f"An error occurred during Playwright setup/teardown: {e}")
|
||||||
|
traceback.print_exc()
|
||||||
|
# logging.error(f"Error during Playwright setup/teardown: {e}", exc_info=True) # Example logging
|
||||||
|
return False # Indicate failure
|
||||||
|
|
||||||
|
|
||||||
|
# --- Example Usage ---
|
||||||
|
# result = generate_pdf_with_header_and_cleanup(
|
||||||
|
# url='path/to/your/input.html',
|
||||||
|
# pdf_path='output.pdf',
|
||||||
|
# substance_name='Glycerol Example',
|
||||||
|
# substance_link='http://example.com/glycerol',
|
||||||
|
# ec_number='200-289-5',
|
||||||
|
# cas_number='56-81-5',
|
||||||
|
# )
|
||||||
|
#
|
||||||
|
# if result:
|
||||||
|
# print("PDF Generation Succeeded.")
|
||||||
|
# # logging.info("Main script: PDF Generation Succeeded.") # Example logging
|
||||||
|
# else:
|
||||||
|
# print("PDF Generation Failed.")
|
||||||
|
# # logging.error("Main script: PDF Generation Failed.") # Example logging
|
||||||
|
|
||||||
|
|
||||||
|
def search_generate_pdfs(
|
||||||
|
cas_number_to_search: str,
|
||||||
|
page_types_to_extract: list[str],
|
||||||
|
base_output_folder: str = "data/library"
|
||||||
|
) -> bool:
|
||||||
|
"""
|
||||||
|
Searches for a substance by CAS, saves raw HTML and generates PDFs for
|
||||||
|
specified page types. Uses '_js' link variant for the PDF header link if available.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
cas_number_to_search (str): CAS number to search for.
|
||||||
|
page_types_to_extract (list[str]): List of page type names (e.g., 'RepeatedDose').
|
||||||
|
Expects '{page_type}' and '{page_type}_js' keys
|
||||||
|
in the search result.
|
||||||
|
base_output_folder (str): Root directory for saving HTML/PDFs.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
bool: True if substance found and >=1 requested PDF generated, False otherwise.
|
||||||
|
"""
|
||||||
|
# logging.info(f"Starting process for CAS: {cas_number_to_search}")
|
||||||
|
print(f"\n===== Processing request for CAS: {cas_number_to_search} =====")
|
||||||
|
|
||||||
|
# --- 1. Search for Dossier Information ---
|
||||||
|
try:
|
||||||
|
# logging.debug(f"Calling search_dossier for CAS: {cas_number_to_search}")
|
||||||
|
search_result = search_dossier(substance=cas_number_to_search, input_type='rmlCas')
|
||||||
|
except Exception as e:
|
||||||
|
print(f"Error during dossier search for CAS '{cas_number_to_search}': {e}")
|
||||||
|
traceback.print_exc()
|
||||||
|
# logging.error(f"Exception during search_dossier for CAS '{cas_number_to_search}': {e}", exc_info=True)
|
||||||
|
return False
|
||||||
|
|
||||||
|
if not search_result:
|
||||||
|
print(f"Substance not found or search failed for CAS: {cas_number_to_search}")
|
||||||
|
# logging.warning(f"Substance not found or search failed for CAS: {cas_number_to_search}")
|
||||||
|
return False
|
||||||
|
|
||||||
|
# logging.info(f"Substance found for CAS: {cas_number_to_search}")
|
||||||
|
print(f"Substance found: {search_result.get('rmlName', 'N/A')}")
|
||||||
|
|
||||||
|
# --- 2. Extract Details and Filter Pages ---
|
||||||
|
try:
|
||||||
|
# Extract required info
|
||||||
|
rml_id = search_result.get('rmlId')
|
||||||
|
rml_name = search_result.get('rmlName')
|
||||||
|
rml_cas = search_result.get('rmlCas')
|
||||||
|
rml_ec = search_result.get('rmlEc')
|
||||||
|
asset_ext_id = search_result.get('assetExternalId')
|
||||||
|
|
||||||
|
# Basic validation
|
||||||
|
if not all([rml_id, rml_name, rml_cas, rml_ec, asset_ext_id]):
|
||||||
|
missing_keys = [k for k, v in {'rmlId': rml_id, 'rmlName': rml_name, 'rmlCas': rml_cas, 'rmlEc': rml_ec, 'assetExternalId': asset_ext_id}.items() if not v]
|
||||||
|
message = f"Search result for {cas_number_to_search} is missing required keys: {missing_keys}"
|
||||||
|
print(f"Error: {message}")
|
||||||
|
# logging.error(message)
|
||||||
|
return False
|
||||||
|
|
||||||
|
# --- Filtering Logic - Collect pairs of URLs ---
|
||||||
|
pages_to_process_list = [] # Store tuples: (page_name, raw_url, js_url)
|
||||||
|
# logging.debug(f"Filtering pages. Requested: {page_types_to_extract}.")
|
||||||
|
|
||||||
|
for page_type in page_types_to_extract:
|
||||||
|
raw_url_key = page_type
|
||||||
|
js_url_key = f"{page_type}_js"
|
||||||
|
|
||||||
|
raw_url = search_result.get(raw_url_key)
|
||||||
|
js_url = search_result.get(js_url_key) # Get the JS URL
|
||||||
|
|
||||||
|
# Check if both URLs are valid strings
|
||||||
|
if raw_url and isinstance(raw_url, str) and raw_url.strip():
|
||||||
|
if js_url and isinstance(js_url, str) and js_url.strip():
|
||||||
|
pages_to_process_list.append((page_type, raw_url, js_url))
|
||||||
|
# logging.debug(f"Found valid pair for '{page_type}': Raw='{raw_url}', JS='{js_url}'")
|
||||||
|
else:
|
||||||
|
# Found raw URL but not a valid JS URL - skip this page type for PDF?
|
||||||
|
# Or use raw_url for header too? Let's skip if JS URL is missing/invalid.
|
||||||
|
print(f"Found raw URL for '{page_type}' but missing/invalid JS URL ('{js_url}'). Skipping PDF generation for this type.")
|
||||||
|
# logging.warning(f"Missing/invalid JS URL for page type '{page_type}' for {rml_cas}. Raw URL: '{raw_url}'.")
|
||||||
|
else:
|
||||||
|
# Raw URL missing or invalid
|
||||||
|
if page_type in search_result: # Check if key existed at all
|
||||||
|
print(f"Found page type key '{page_type}' for {rml_cas}, but its value is not a valid URL ('{raw_url}'). Skipping.")
|
||||||
|
# logging.warning(f"Invalid raw URL value for page type '{page_type}' for {rml_cas}: '{raw_url}'.")
|
||||||
|
else:
|
||||||
|
print(f"Requested page type key '{page_type}' not found in search results for {rml_cas}.")
|
||||||
|
# logging.warning(f"Requested page type key '{page_type}' not found for {rml_cas}.")
|
||||||
|
# --- End Filtering Logic ---
|
||||||
|
|
||||||
|
if not pages_to_process_list:
|
||||||
|
print(f"After filtering, no requested page types ({page_types_to_extract}) resulted in a valid pair of Raw and JS URLs for substance {rml_cas}.")
|
||||||
|
# logging.warning(f"No pages with valid URL pairs to process for substance {rml_cas}.")
|
||||||
|
return False # Nothing to generate
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
print(f"Error processing search result for '{cas_number_to_search}': {e}")
|
||||||
|
traceback.print_exc()
|
||||||
|
# logging.error(f"Error processing search result for '{cas_number_to_search}': {e}", exc_info=True)
|
||||||
|
return False
|
||||||
|
|
||||||
|
# --- 3. Prepare Folders ---
|
||||||
|
safe_cas = rml_cas.replace('/', '_').replace('\\', '_')
|
||||||
|
substance_folder_name = f"{safe_cas}_{rml_ec}_{rml_id}"
|
||||||
|
substance_folder_path = os.path.join(base_output_folder, substance_folder_name)
|
||||||
|
|
||||||
|
try:
|
||||||
|
os.makedirs(substance_folder_path, exist_ok=True)
|
||||||
|
# logging.info(f"Ensured output directory exists: {substance_folder_path}")
|
||||||
|
print(f"Ensured output directory exists: {substance_folder_path}")
|
||||||
|
except OSError as e:
|
||||||
|
print(f"Error creating directory {substance_folder_path}: {e}")
|
||||||
|
# logging.error(f"Failed to create directory {substance_folder_path}: {e}", exc_info=True)
|
||||||
|
return False
|
||||||
|
|
||||||
|
|
||||||
|
# --- 4. Process Each Page (Save HTML, Generate PDF) ---
|
||||||
|
successful_pages = [] # Track successful PDF generations
|
||||||
|
overall_success = False # Track if any PDF was generated
|
||||||
|
|
||||||
|
for page_name, raw_html_url, js_header_link in pages_to_process_list:
|
||||||
|
print(f"\nProcessing page: {page_name}")
|
||||||
|
base_filename = f"{safe_cas}_{page_name}"
|
||||||
|
html_filename = f"{base_filename}.html"
|
||||||
|
pdf_filename = f"{base_filename}.pdf"
|
||||||
|
html_full_path = os.path.join(substance_folder_path, html_filename)
|
||||||
|
pdf_full_path = os.path.join(substance_folder_path, pdf_filename)
|
||||||
|
|
||||||
|
# --- Save Raw HTML ---
|
||||||
|
html_saved = False
|
||||||
|
try:
|
||||||
|
# logging.debug(f"Fetching raw HTML for {page_name} from {raw_html_url}")
|
||||||
|
print(f"Fetching raw HTML from: {raw_html_url}")
|
||||||
|
# Add headers to mimic a browser slightly if needed
|
||||||
|
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'}
|
||||||
|
response = requests.get(raw_html_url, timeout=30, headers=headers) # 30s timeout
|
||||||
|
response.raise_for_status() # Raise an exception for bad status codes (4xx or 5xx)
|
||||||
|
|
||||||
|
# Decide encoding - response.text tries to guess, or use apparent_encoding
|
||||||
|
# Or assume utf-8 if unsure, which is common.
|
||||||
|
html_content = response.content.decode('utf-8', errors='replace')
|
||||||
|
|
||||||
|
with open(html_full_path, 'w', encoding='utf-8') as f:
|
||||||
|
f.write(html_content)
|
||||||
|
html_saved = True
|
||||||
|
# logging.info(f"Successfully saved raw HTML for {page_name} to {html_full_path}")
|
||||||
|
print(f"Successfully saved raw HTML to: {html_full_path}")
|
||||||
|
except requests.exceptions.RequestException as req_e:
|
||||||
|
print(f"Error fetching raw HTML for {page_name} from {raw_html_url}: {req_e}")
|
||||||
|
# logging.error(f"Error fetching raw HTML for {page_name}: {req_e}", exc_info=True)
|
||||||
|
except IOError as io_e:
|
||||||
|
print(f"Error saving raw HTML for {page_name} to {html_full_path}: {io_e}")
|
||||||
|
# logging.error(f"Error saving raw HTML for {page_name}: {io_e}", exc_info=True)
|
||||||
|
except Exception as e: # Catch other potential errors like decoding
|
||||||
|
print(f"Unexpected error saving HTML for {page_name}: {e}")
|
||||||
|
# logging.error(f"Unexpected error saving HTML for {page_name}: {e}", exc_info=True)
|
||||||
|
|
||||||
|
# --- Generate PDF (using raw URL for content, JS URL for header link) ---
|
||||||
|
# logging.info(f"Generating PDF for {page_name} from {raw_html_url}")
|
||||||
|
print(f"Generating PDF using content from: {raw_html_url}")
|
||||||
|
pdf_success = generate_pdf_with_header_and_cleanup(
|
||||||
|
url=raw_html_url, # Use raw URL for Playwright navigation/content
|
||||||
|
pdf_path=pdf_full_path,
|
||||||
|
substance_name=rml_name,
|
||||||
|
substance_link=js_header_link, # Use JS URL for the link in the header
|
||||||
|
ec_number=rml_ec,
|
||||||
|
cas_number=rml_cas
|
||||||
|
)
|
||||||
|
|
||||||
|
if pdf_success:
|
||||||
|
successful_pages.append(page_name) # Log success based on PDF generation
|
||||||
|
overall_success = True
|
||||||
|
# logging.info(f"Successfully generated PDF for {page_name} at {pdf_full_path}")
|
||||||
|
print(f"Successfully generated PDF for {page_name}")
|
||||||
|
else:
|
||||||
|
# logging.error(f"Failed to generate PDF for {page_name} from {raw_html_url}")
|
||||||
|
print(f"Failed to generate PDF for {page_name}")
|
||||||
|
# Decide if failure to save HTML should affect overall success or logging...
|
||||||
|
# Currently, success is tied only to PDF generation.
|
||||||
|
|
||||||
|
print(f"===== Finished request for CAS: {cas_number_to_search} =====")
|
||||||
|
print(f"Successfully generated {len(successful_pages)} PDFs: {successful_pages}")
|
||||||
|
return overall_success # Return success based on PDF generation
|
||||||
3
src/pif_compiler/functions/resources/ECHA_Logo.svg
Normal file
3
src/pif_compiler/functions/resources/ECHA_Logo.svg
Normal file
File diff suppressed because one or more lines are too long
|
After Width: | Height: | Size: 6.1 KiB |
141
src/pif_compiler/functions/resources/echa_chem_logo.svg
Normal file
141
src/pif_compiler/functions/resources/echa_chem_logo.svg
Normal file
|
|
@ -0,0 +1,141 @@
|
||||||
|
<svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" width="352.909" height="64.542" viewBox="0 0 352.909 64.542">
|
||||||
|
<defs>
|
||||||
|
<linearGradient id="linear-gradient" x1="0.499" y1="0.955" x2="0.501" y2="0.043" gradientUnits="objectBoundingBox">
|
||||||
|
<stop offset="0" stop-color="#002658"/>
|
||||||
|
<stop offset="0.99" stop-color="#0160ae"/>
|
||||||
|
</linearGradient>
|
||||||
|
<radialGradient id="radial-gradient" cx="0.502" cy="0.5" r="0.881" gradientUnits="objectBoundingBox">
|
||||||
|
<stop offset="0.34" stop-color="#0961ad"/>
|
||||||
|
<stop offset="1" stop-color="#1c2f5d"/>
|
||||||
|
</radialGradient>
|
||||||
|
<radialGradient id="radial-gradient-2" cx="0.795" cy="0.199" r="0.8" xlink:href="#radial-gradient"/>
|
||||||
|
<linearGradient id="linear-gradient-2" y1="0.499" x2="1" y2="0.499" gradientUnits="objectBoundingBox">
|
||||||
|
<stop offset="0" stop-color="#fff"/>
|
||||||
|
<stop offset="0" stop-color="#0961ad"/>
|
||||||
|
<stop offset="1" stop-color="#1c2f5d"/>
|
||||||
|
</linearGradient>
|
||||||
|
<linearGradient id="linear-gradient-3" x1="-3.244" y1="0.922" x2="0.926" y2="0.075" gradientUnits="objectBoundingBox">
|
||||||
|
<stop offset="0" stop-color="#fff"/>
|
||||||
|
<stop offset="0" stop-color="#f6d46a"/>
|
||||||
|
<stop offset="0.99" stop-color="#f8a71b"/>
|
||||||
|
</linearGradient>
|
||||||
|
<linearGradient id="linear-gradient-4" x1="-0.547" y1="0.499" x2="0.453" y2="0.499" gradientUnits="objectBoundingBox">
|
||||||
|
<stop offset="0" stop-color="#f6d46a"/>
|
||||||
|
<stop offset="0.99" stop-color="#f8a71b"/>
|
||||||
|
</linearGradient>
|
||||||
|
<linearGradient id="linear-gradient-5" x1="-0.17" y1="0.5" x2="0.83" y2="0.5" xlink:href="#linear-gradient-3"/>
|
||||||
|
<linearGradient id="linear-gradient-6" x1="0.5" y1="-0.199" x2="0.5" y2="1.353" gradientUnits="objectBoundingBox">
|
||||||
|
<stop offset="0" stop-color="#fff"/>
|
||||||
|
<stop offset="0" stop-color="#feca0a"/>
|
||||||
|
<stop offset="0.96" stop-color="#faaa1b"/>
|
||||||
|
<stop offset="0.99" stop-color="#f8a71b"/>
|
||||||
|
</linearGradient>
|
||||||
|
<linearGradient id="linear-gradient-7" x1="0.5" y1="-0.223" x2="0.5" y2="1.383" xlink:href="#linear-gradient-6"/>
|
||||||
|
<linearGradient id="linear-gradient-9" x1="0.5" y1="-0.223" x2="0.5" y2="1.383" xlink:href="#linear-gradient-6"/>
|
||||||
|
</defs>
|
||||||
|
<g id="Group_1542" data-name="Group 1542" transform="translate(-103 -146)">
|
||||||
|
<g id="Group_1535" data-name="Group 1535">
|
||||||
|
<path id="Path_484" data-name="Path 484" d="M219.034,36.851,202.609.06h-5.448L180.736,36.851h8.4l3.267-8.032h14.867l3.347,8.032h8.4ZM204.71,22.718h-9.73l4.905-11.831,4.825,11.831ZM165.185,36.851h7.71V.985h-7.71V15.118h-17.9V.985h-7.71V36.841h7.71V21.853h17.9V36.841h0Zm-48.713.935c3.659,0,8.172-.141,12.223-1.719V29.322A40.491,40.491,0,0,1,116.4,30.97c-8.012,0-10.5-6.092-10.5-12.053S108.39,6.865,116.4,6.865A40.491,40.491,0,0,1,128.7,8.514V1.779C124.645.2,120.131.06,116.472.06c-13.229,0-18.526,8.534-18.526,18.858s5.3,18.858,18.526,18.858h0ZM60.13,36.851H89.472V30.106H67.84V21.853H86.909V15.108H67.84V7.72H89.472V.985H60.13V36.841h0Z" transform="translate(103.644 146.001)" fill="url(#linear-gradient)"/>
|
||||||
|
<circle id="Ellipse_9" data-name="Ellipse 9" cx="2.02" cy="2.02" r="2.02" transform="translate(129.958 175.363)" fill="url(#radial-gradient)"/>
|
||||||
|
<path id="Path_485" data-name="Path 485" d="M38.618,37a5.358,5.358,0,0,1-1.689.593,2.892,2.892,0,0,0,.151-.935,3.016,3.016,0,1,0-6,.432c-2.563-.281-4.956-.241-5.2-.362-.02,0-6.413-5.247-15.561-.623a.494.494,0,0,0-.221.482A14.761,14.761,0,0,0,24.836,50.638,14.529,14.529,0,0,0,39.5,37.54a.587.587,0,0,0-.885-.553Z" transform="translate(103.383 146.175)" fill="url(#radial-gradient-2)"/>
|
||||||
|
<path id="Path_486" data-name="Path 486" d="M10.281,36.025s6.4-5.277,13.751-1.448a24.047,24.047,0,0,0,7.047,2.513s-23.18,1.3-20.788-1.066Z" transform="translate(103.383 146.173)" fill="url(#linear-gradient-2)"/>
|
||||||
|
<rect id="Rectangle_2083" data-name="Rectangle 2083" width="4.322" height="15.48" rx="2.15" transform="translate(113.755 196.405) rotate(44.01)" fill="url(#linear-gradient-3)"/>
|
||||||
|
<path id="Path_487" data-name="Path 487" d="M2.734,63.94A1.62,1.62,0,0,1,1.628,63.5L.483,62.392a1.59,1.59,0,0,1-.04-2.242l8.866-9.178a1.6,1.6,0,0,1,1.116-.483,1.623,1.623,0,0,1,1.126.442L12.7,52.038a1.587,1.587,0,0,1,.04,2.242L3.87,63.457a1.6,1.6,0,0,1-1.116.483h-.03Zm7.72-13.007h-.02a1.12,1.12,0,0,0-.8.352L.764,60.442a1.173,1.173,0,0,0-.322.814,1.12,1.12,0,0,0,.352.8L1.94,63.166a1.141,1.141,0,0,0,1.618-.03l8.866-9.178a1.144,1.144,0,0,0-.03-1.618l-1.146-1.106a1.109,1.109,0,0,0-.794-.322Z" transform="translate(103.33 146.263)" fill="url(#linear-gradient-4)"/>
|
||||||
|
<path id="Path_488" data-name="Path 488" d="M19.3.99H13.977a.461.461,0,0,0-.462.462V4.83a.461.461,0,0,0,.462.462h1.568a.461.461,0,0,1,.462.462V16.028l-.02,1.206a.454.454,0,0,1-.261.412A21.882,21.882,0,0,0,4.96,29.226a.465.465,0,0,0,.312.623l3.3.895a.47.47,0,0,0,.553-.261,17.583,17.583,0,0,1,10.334-9.761.465.465,0,0,0,.312-.442V1.462A.461.461,0,0,0,19.3,1Z" transform="translate(103.356 146.005)" fill="#003c75"/>
|
||||||
|
<path id="Path_489" data-name="Path 489" d="M36.551.99H31.042a.378.378,0,0,0-.372.372V19.989a.378.378,0,0,0,.553.332l3.026-1.639a.365.365,0,0,0,.191-.332V5.674a.378.378,0,0,1,.372-.372h1.749a.378.378,0,0,0,.372-.372V1.362A.378.378,0,0,0,36.561.99Z" transform="translate(103.49 146.005)" fill="#003c75"/>
|
||||||
|
<path id="Path_490" data-name="Path 490" d="M45.919,34.292A21.215,21.215,0,0,0,31.545,16.741h0l-.08-.03-.181-.06h0L30.8,16.51v3.629a.758.758,0,0,0,.523.724h.02A17.285,17.285,0,1,1,7.661,38.946a17.1,17.1,0,0,1,.02-4.182.285.285,0,0,0-.211-.312l-3.277-.875a.294.294,0,0,0-.362.231,20.916,20.916,0,0,0-.07,5.609,21.236,21.236,0,1,0,42.159-5.147Z" transform="translate(103.349 146.086)" fill="url(#linear-gradient-5)"/>
|
||||||
|
<path id="Path_491" data-name="Path 491" d="M224.13,18.9a34.878,34.878,0,0,1,.714-7.2,17.285,17.285,0,0,1,2.372-5.911,11.557,11.557,0,0,1,4.4-3.991A14.5,14.5,0,0,1,238.484.34a26.823,26.823,0,0,1,5.669.523,22.624,22.624,0,0,1,4.091,1.246V7.226c-.985-.412-1.9-.754-2.724-1.025a23.113,23.113,0,0,0-2.392-.643,17.986,17.986,0,0,0-2.292-.332c-.764-.06-1.548-.09-2.342-.09a8.147,8.147,0,0,0-4.282,1.055,8.105,8.105,0,0,0-2.825,2.915,14.1,14.1,0,0,0-1.558,4.383,28.844,28.844,0,0,0-.483,5.428,28.767,28.767,0,0,0,.483,5.428,13.693,13.693,0,0,0,1.558,4.373,7.871,7.871,0,0,0,2.825,2.915,8.147,8.147,0,0,0,4.282,1.055c.794,0,1.578-.03,2.342-.1a17.985,17.985,0,0,0,2.292-.332,23.113,23.113,0,0,0,2.392-.643c.824-.271,1.739-.613,2.724-1.025V35.7a22.624,22.624,0,0,1-4.091,1.246,26.823,26.823,0,0,1-5.669.523,14.5,14.5,0,0,1-6.866-1.458,11.787,11.787,0,0,1-4.4-3.991,17.489,17.489,0,0,1-2.372-5.911,34.27,34.27,0,0,1-.714-7.2Z" transform="translate(104.499 146.002)" fill="url(#linear-gradient-6)"/>
|
||||||
|
<path id="Path_492" data-name="Path 492" d="M238.486,37.8a14.953,14.953,0,0,1-7.026-1.5,11.974,11.974,0,0,1-4.523-4.111,17.723,17.723,0,0,1-2.413-6.021A34.94,34.94,0,0,1,223.8,18.9a34.94,34.94,0,0,1,.724-7.268,17.723,17.723,0,0,1,2.413-6.021A12.056,12.056,0,0,1,231.46,1.5,14.9,14.9,0,0,1,238.486,0a27.543,27.543,0,0,1,5.74.533A23.034,23.034,0,0,1,248.377,1.8l.2.09V7.73l-.462-.191c-.975-.412-1.89-.754-2.7-1.015a20.145,20.145,0,0,0-2.352-.633,21.324,21.324,0,0,0-2.252-.332c-.754-.06-1.538-.09-2.312-.09a7.909,7.909,0,0,0-4.111,1.005,7.711,7.711,0,0,0-2.7,2.8,13.738,13.738,0,0,0-1.518,4.272,28.991,28.991,0,0,0-.472,5.368,28.99,28.99,0,0,0,.472,5.368,13.738,13.738,0,0,0,1.518,4.272,7.79,7.79,0,0,0,2.7,2.8,7.91,7.91,0,0,0,4.111,1.005c.784,0,1.568-.03,2.312-.09a17.129,17.129,0,0,0,2.252-.332,22.352,22.352,0,0,0,2.352-.633c.814-.261,1.719-.613,2.7-1.015l.462-.191v5.84l-.2.09a23.872,23.872,0,0,1-4.152,1.267,27.543,27.543,0,0,1-5.74.533Zm0-37.133a14.408,14.408,0,0,0-6.715,1.417,11.475,11.475,0,0,0-4.282,3.88,17.122,17.122,0,0,0-2.322,5.8,34.262,34.262,0,0,0-.714,7.127,34.262,34.262,0,0,0,.714,7.127,17.122,17.122,0,0,0,2.322,5.8,11.31,11.31,0,0,0,4.282,3.88,14.256,14.256,0,0,0,6.715,1.417,26.193,26.193,0,0,0,5.6-.523,23.16,23.16,0,0,0,3.83-1.136v-4.4c-.824.332-1.588.623-2.292.844a23.888,23.888,0,0,1-2.433.653,20.7,20.7,0,0,1-2.332.342c-.764.06-1.568.1-2.372.1a8.456,8.456,0,0,1-4.453-1.106A8.346,8.346,0,0,1,231.1,28.85a14.484,14.484,0,0,1-1.6-4.483,29.507,29.507,0,0,1-.482-5.488,29.432,29.432,0,0,1,.482-5.488,14.322,14.322,0,0,1,1.6-4.483,8.346,8.346,0,0,1,2.935-3.036,8.5,8.5,0,0,1,4.453-1.1c.8,0,1.6.03,2.372.1a20.551,20.551,0,0,1,2.342.342,23.884,23.884,0,0,1,2.433.653q1.055.347,2.292.844V2.322a23.159,23.159,0,0,0-3.83-1.136,26.856,26.856,0,0,0-5.6-.523Z" transform="translate(104.497 146)" fill="#e68a00"/>
|
||||||
|
<path id="Path_493" data-name="Path 493" d="M275.544,36.836V20.662H260.516V36.836H255.49V.95h5.026V15.877h15.028V.95h5.026V36.836Z" transform="translate(104.663 146.005)" fill="url(#linear-gradient-7)"/>
|
||||||
|
<path id="Path_494" data-name="Path 494" d="M280.9,37.17h-5.689V21H260.86V37.17h-5.69V.62h5.69V15.547h14.354V.62H280.9Zm-5.026-.663h4.363V1.283h-4.363V16.211H260.186V1.283h-4.353V36.506h4.353V20.332h15.691Z" transform="translate(104.661 146.004)" fill="#e68a00"/>
|
||||||
|
<path id="Path_495" data-name="Path 495" d="M308.534,20.662H293.556V32.051h16.556v4.785H288.53V.95h21.582V5.735H293.556V15.877h14.978Z" transform="translate(104.835 146.005)" fill="url(#linear-gradient-7)"/>
|
||||||
|
<path id="Path_496" data-name="Path 496" d="M310.445,37.17H288.2V.62h22.245V6.068H293.889v9.479h14.978V21H293.889V31.721h16.556Zm-21.582-.663h20.908V32.385H293.216V20.332h14.978V16.211H293.216V5.4h16.556V1.283H288.863Z" transform="translate(104.833 146.004)" fill="#e68a00"/>
|
||||||
|
<path id="Path_497" data-name="Path 497" d="M335.9,32.051h-3.83l-10-23.11.241,27.895H317.38V.95h6.071l10.525,24.6L344.5.95h6.072V36.836h-4.926l.241-27.895-10,23.11Z" transform="translate(104.985 146.005)" fill="url(#linear-gradient-9)"/>
|
||||||
|
<path id="Path_498" data-name="Path 498" d="M350.916,37.17h-5.6l.231-26.588-9.439,21.8h-4.262l-9.429-21.8.231,26.588h-5.6V.62h6.634l10.3,24.085L344.291.62h6.634V37.17Zm-4.926-.663h4.262V1.283h-5.519l-10.746,25.11L323.242,1.283h-5.519V36.506h4.262l-.251-29.2L332.3,31.721h3.388L346.251,7.3,346,36.506Z" transform="translate(104.984 146.004)" fill="#e68a00"/>
|
||||||
|
<g id="Group_1497" data-name="Group 1497" transform="translate(163.734 192.803)">
|
||||||
|
<path id="Path_499" data-name="Path 499" d="M66.049,56.891H60.53V47h5.519v1.025H61.686v3.187h4.1v1.015h-4.1v3.639h4.363Z" transform="translate(-60.088 -46.558)" fill="#003c75"/>
|
||||||
|
<path id="Path_500" data-name="Path 500" d="M66.051,57.336H60.532a.446.446,0,0,1-.442-.442V47a.446.446,0,0,1,.442-.442h5.519a.446.446,0,0,1,.442.442v1.025a.446.446,0,0,1-.442.442h-3.92v2.3H65.79a.446.446,0,0,1,.442.442v1.015a.446.446,0,0,1-.442.442H62.131v2.754h3.92a.446.446,0,0,1,.442.442v1.025A.446.446,0,0,1,66.051,57.336Zm-5.076-.885H65.6v-.141h-3.92a.446.446,0,0,1-.442-.442V52.229a.446.446,0,0,1,.442-.442h3.659v-.131H61.678a.446.446,0,0,1-.442-.442V48.028a.446.446,0,0,1,.442-.442H65.6v-.131H60.975v9.007Z" transform="translate(-60.09 -46.56)" fill="#336"/>
|
||||||
|
</g>
|
||||||
|
<g id="Group_1498" data-name="Group 1498" transform="translate(175.786 192.652)">
|
||||||
|
<path id="Path_501" data-name="Path 501" d="M77.275,47.875A3.226,3.226,0,0,0,74.7,48.961a4.368,4.368,0,0,0-.945,2.975,4.53,4.53,0,0,0,.915,3.006A3.228,3.228,0,0,0,77.265,56a8.72,8.72,0,0,0,2.362-.372v1.005a7.15,7.15,0,0,1-2.533.382,4.3,4.3,0,0,1-3.378-1.327,5.452,5.452,0,0,1-1.186-3.77,5.991,5.991,0,0,1,.573-2.684,4.13,4.13,0,0,1,1.649-1.769,5.045,5.045,0,0,1,2.543-.623,6.138,6.138,0,0,1,2.724.573l-.483.985a5.258,5.258,0,0,0-2.252-.523Z" transform="translate(-72.078 -46.408)" fill="#003c75"/>
|
||||||
|
<path id="Path_502" data-name="Path 502" d="M77.086,57.467a4.743,4.743,0,0,1-3.709-1.478,5.9,5.9,0,0,1-1.3-4.061,6.371,6.371,0,0,1,.623-2.875,4.6,4.6,0,0,1,1.83-1.96A5.472,5.472,0,0,1,77.3,46.41a6.653,6.653,0,0,1,2.915.613.391.391,0,0,1,.221.251.45.45,0,0,1-.02.342l-.483.985a.431.431,0,0,1-.583.2,4.823,4.823,0,0,0-2.061-.483,2.768,2.768,0,0,0-2.242.935,3.972,3.972,0,0,0-.834,2.684,4.142,4.142,0,0,0,.8,2.714c.864,1.005,2.362,1.146,4.5.553a.455.455,0,0,1,.392.07.429.429,0,0,1,.171.352v1.005a.432.432,0,0,1-.291.412,7.827,7.827,0,0,1-2.694.412Zm.2-10.173a4.618,4.618,0,0,0-2.322.563,3.707,3.707,0,0,0-1.478,1.588,5.508,5.508,0,0,0-.523,2.483,5.091,5.091,0,0,0,1.076,3.478,3.847,3.847,0,0,0,3.046,1.176,7.462,7.462,0,0,0,2.091-.261V56.2c-2.222.513-3.84.2-4.845-.965a4.908,4.908,0,0,1-1.015-3.287,4.747,4.747,0,0,1,1.055-3.267,3.661,3.661,0,0,1,2.915-1.236,5.412,5.412,0,0,1,2.031.4l.09-.191A6.036,6.036,0,0,0,77.3,47.3Z" transform="translate(-72.08 -46.41)" fill="#336"/>
|
||||||
|
</g>
|
||||||
|
<g id="Group_1499" data-name="Group 1499" transform="translate(189.93 192.803)">
|
||||||
|
<path id="Path_503" data-name="Path 503" d="M94.089,56.891H92.943V52.237H87.736v4.654H86.59V47h1.146v4.212h5.207V47h1.146Z" transform="translate(-86.148 -46.558)" fill="#003c75"/>
|
||||||
|
<path id="Path_504" data-name="Path 504" d="M94.091,57.336H92.945a.446.446,0,0,1-.442-.442V52.682H88.181v4.212a.446.446,0,0,1-.442.442H86.592a.446.446,0,0,1-.442-.442V47a.446.446,0,0,1,.442-.442h1.146a.446.446,0,0,1,.442.442v3.77H92.5V47a.446.446,0,0,1,.442-.442h1.146a.446.446,0,0,1,.442.442v9.891A.446.446,0,0,1,94.091,57.336Zm-.7-.885h.261V47.445h-.261v3.77a.446.446,0,0,1-.442.442H87.738a.446.446,0,0,1-.442-.442v-3.77h-.261v9.007H87.3V52.239a.446.446,0,0,1,.442-.442h5.207a.446.446,0,0,1,.442.442Z" transform="translate(-86.15 -46.56)" fill="#336"/>
|
||||||
|
</g>
|
||||||
|
<g id="Group_1500" data-name="Group 1500" transform="translate(203.644 192.763)">
|
||||||
|
<path id="Path_505" data-name="Path 505" d="M107.819,56.892l-1.236-3.146h-3.971L101.4,56.892H100.23l3.91-9.932h.965L109,56.892h-1.186Zm-1.588-4.182-1.146-3.066c-.151-.392-.3-.864-.462-1.427a14.321,14.321,0,0,1-.422,1.427l-1.166,3.066Z" transform="translate(-99.79 -46.518)" fill="#003c75"/>
|
||||||
|
<path id="Path_506" data-name="Path 506" d="M109.008,57.336h-1.186a.436.436,0,0,1-.412-.281l-1.126-2.865h-3.357l-1.106,2.865a.446.446,0,0,1-.412.281h-1.166a.445.445,0,0,1-.422-.6l3.91-9.932a.446.446,0,0,1,.412-.281h.965a.436.436,0,0,1,.412.281l3.89,9.932a.431.431,0,0,1-.05.412.444.444,0,0,1-.372.191Zm-.885-.885h.241L104.826,47.4h-.362l-3.558,9.047h.211l1.106-2.865a.446.446,0,0,1,.412-.281h3.971a.436.436,0,0,1,.412.281l1.126,2.865Zm-1.89-3.3h-3.2a.426.426,0,0,1-.362-.191.46.46,0,0,1-.05-.412l1.166-3.066c.171-.493.312-.955.412-1.367a.439.439,0,0,1,.422-.342.465.465,0,0,1,.442.322c.151.543.3,1.015.442,1.387l1.156,3.066a.46.46,0,0,1-.05.412.439.439,0,0,1-.362.191Zm-2.553-.885h1.92l-.925-2.463s-.02-.05-.03-.07c0,.02-.02.04-.02.07l-.935,2.473Z" transform="translate(-99.793 -46.52)" fill="#336"/>
|
||||||
|
</g>
|
||||||
|
<g id="Group_1501" data-name="Group 1501" transform="translate(226.59 192.652)">
|
||||||
|
<path id="Path_507" data-name="Path 507" d="M127.815,47.875a3.226,3.226,0,0,0-2.573,1.086,4.368,4.368,0,0,0-.945,2.975,4.53,4.53,0,0,0,.915,3.006A3.229,3.229,0,0,0,127.8,56a8.72,8.72,0,0,0,2.362-.372v1.005a7.15,7.15,0,0,1-2.533.382,4.3,4.3,0,0,1-3.378-1.327,5.452,5.452,0,0,1-1.186-3.77,5.991,5.991,0,0,1,.573-2.684,4.13,4.13,0,0,1,1.649-1.769,5.045,5.045,0,0,1,2.543-.623,6.138,6.138,0,0,1,2.724.573l-.483.985a5.258,5.258,0,0,0-2.252-.523Z" transform="translate(-122.618 -46.408)" fill="#003c75"/>
|
||||||
|
<path id="Path_508" data-name="Path 508" d="M127.626,57.467a4.743,4.743,0,0,1-3.709-1.478,5.9,5.9,0,0,1-1.3-4.061,6.37,6.37,0,0,1,.623-2.875,4.6,4.6,0,0,1,1.829-1.96,5.472,5.472,0,0,1,2.764-.684,6.653,6.653,0,0,1,2.915.613.391.391,0,0,1,.221.251.449.449,0,0,1-.02.342l-.482.985a.431.431,0,0,1-.583.2,4.823,4.823,0,0,0-2.061-.483,2.768,2.768,0,0,0-2.242.935,3.972,3.972,0,0,0-.834,2.684,4.142,4.142,0,0,0,.8,2.714c.864,1.005,2.362,1.146,4.5.553a.455.455,0,0,1,.392.07.429.429,0,0,1,.171.352v1.005a.431.431,0,0,1-.292.412,7.826,7.826,0,0,1-2.694.412Zm.2-10.173a4.618,4.618,0,0,0-2.322.563,3.707,3.707,0,0,0-1.478,1.588,5.508,5.508,0,0,0-.523,2.483,5.09,5.09,0,0,0,1.076,3.478,3.847,3.847,0,0,0,3.046,1.176,7.462,7.462,0,0,0,2.091-.261V56.2c-2.222.513-3.84.2-4.845-.965a4.908,4.908,0,0,1-1.015-3.287,4.747,4.747,0,0,1,1.056-3.267,3.661,3.661,0,0,1,2.915-1.236,5.412,5.412,0,0,1,2.031.4l.09-.191a6.036,6.036,0,0,0-2.111-.352Z" transform="translate(-122.62 -46.41)" fill="#336"/>
|
||||||
|
</g>
|
||||||
|
<g id="Group_1502" data-name="Group 1502" transform="translate(240.733 192.803)">
|
||||||
|
<path id="Path_509" data-name="Path 509" d="M144.629,56.891h-1.146V52.237h-5.207v4.654H137.13V47h1.146v4.212h5.207V47h1.146Z" transform="translate(-136.688 -46.558)" fill="#003c75"/>
|
||||||
|
<path id="Path_510" data-name="Path 510" d="M144.631,57.336h-1.146a.446.446,0,0,1-.442-.442V52.682h-4.322v4.212a.446.446,0,0,1-.442.442h-1.146a.446.446,0,0,1-.442-.442V47a.446.446,0,0,1,.442-.442h1.146a.446.446,0,0,1,.442.442v3.77h4.322V47a.446.446,0,0,1,.442-.442h1.146a.446.446,0,0,1,.442.442v9.891A.446.446,0,0,1,144.631,57.336Zm-.7-.885h.261V47.445h-.261v3.77a.446.446,0,0,1-.442.442h-5.207a.446.446,0,0,1-.442-.442v-3.77h-.261v9.007h.261V52.239a.446.446,0,0,1,.442-.442h5.207a.446.446,0,0,1,.442.442Z" transform="translate(-136.69 -46.56)" fill="#336"/>
|
||||||
|
</g>
|
||||||
|
<g id="Group_1503" data-name="Group 1503" transform="translate(255.801 192.803)">
|
||||||
|
<path id="Path_511" data-name="Path 511" d="M157.639,56.891H152.12V47h5.519v1.025h-4.363v3.187h4.1v1.015h-4.1v3.639h4.363Z" transform="translate(-151.678 -46.558)" fill="#003c75"/>
|
||||||
|
<path id="Path_512" data-name="Path 512" d="M157.641,57.336h-5.519a.446.446,0,0,1-.442-.442V47a.446.446,0,0,1,.442-.442h5.519a.446.446,0,0,1,.442.442v1.025a.446.446,0,0,1-.442.442h-3.92v2.3h3.659a.446.446,0,0,1,.442.442v1.015a.446.446,0,0,1-.442.442h-3.659v2.754h3.92a.446.446,0,0,1,.442.442v1.025A.446.446,0,0,1,157.641,57.336Zm-5.076-.885h4.624v-.141h-3.92a.446.446,0,0,1-.442-.442V52.229a.446.446,0,0,1,.442-.442h3.659v-.131h-3.659a.446.446,0,0,1-.442-.442V48.028a.446.446,0,0,1,.442-.442h3.92v-.131h-4.624v9.007Z" transform="translate(-151.68 -46.56)" fill="#336"/>
|
||||||
|
</g>
|
||||||
|
<g id="Group_1504" data-name="Group 1504" transform="translate(268.397 192.803)">
|
||||||
|
<path id="Path_513" data-name="Path 513" d="M169.013,56.891l-3.357-8.776h-.05q.091,1.04.09,2.473v6.293H164.63V46.99h1.729l3.136,8.162h.05L172.7,46.99h1.719v9.891h-1.146V50.508q0-1.1.09-2.382h-.05l-3.388,8.755H169Z" transform="translate(-164.208 -46.558)" fill="#003c75"/>
|
||||||
|
<path id="Path_514" data-name="Path 514" d="M174.433,57.336h-1.146a.446.446,0,0,1-.442-.442V50.621l-2.483,6.433a.446.446,0,0,1-.412.281h-.925a.436.436,0,0,1-.412-.281l-2.453-6.423v6.262a.446.446,0,0,1-.442.442h-1.065a.446.446,0,0,1-.442-.442V47a.446.446,0,0,1,.442-.442h1.729a.436.436,0,0,1,.412.281L169.538,54l2.774-7.157a.446.446,0,0,1,.412-.281h1.719a.446.446,0,0,1,.442.442v9.891a.446.446,0,0,1-.442.442Zm-.7-.885h.261V47.445h-.975l-3.046,7.881a.446.446,0,0,1-.412.281h-.05a.436.436,0,0,1-.412-.281l-3.026-7.881h-.985v9.007h.171V50.6c0-.935-.03-1.759-.09-2.433a.469.469,0,0,1,.111-.342.44.44,0,0,1,.332-.141h.05a.436.436,0,0,1,.412.281l3.247,8.484h.322l3.277-8.474a.446.446,0,0,1,.412-.281h.05a.434.434,0,0,1,.322.141.454.454,0,0,1,.121.332c-.06.844-.09,1.638-.09,2.352Z" transform="translate(-164.21 -46.56)" fill="#336"/>
|
||||||
|
</g>
|
||||||
|
<g id="Group_1505" data-name="Group 1505" transform="translate(285.767 192.803)">
|
||||||
|
<path id="Path_515" data-name="Path 515" d="M181.92,56.891V47h1.146v9.891Z" transform="translate(-181.488 -46.558)" fill="#003c75"/>
|
||||||
|
<path id="Path_516" data-name="Path 516" d="M183.078,57.336h-1.146a.446.446,0,0,1-.442-.442V47a.446.446,0,0,1,.442-.442h1.146a.446.446,0,0,1,.442.442v9.891A.446.446,0,0,1,183.078,57.336Zm-.7-.885h.261V47.445h-.261Z" transform="translate(-181.49 -46.56)" fill="#336"/>
|
||||||
|
</g>
|
||||||
|
<g id="Group_1506" data-name="Group 1506" transform="translate(293.969 192.652)">
|
||||||
|
<path id="Path_517" data-name="Path 517" d="M194.845,47.875a3.226,3.226,0,0,0-2.573,1.086,4.368,4.368,0,0,0-.945,2.975,4.531,4.531,0,0,0,.915,3.006A3.229,3.229,0,0,0,194.835,56a8.719,8.719,0,0,0,2.362-.372v1.005a7.15,7.15,0,0,1-2.533.382,4.3,4.3,0,0,1-3.378-1.327,5.452,5.452,0,0,1-1.186-3.77,5.991,5.991,0,0,1,.573-2.684,4.13,4.13,0,0,1,1.649-1.769,5.045,5.045,0,0,1,2.543-.623,6.138,6.138,0,0,1,2.724.573l-.482.985a5.258,5.258,0,0,0-2.252-.523Z" transform="translate(-189.648 -46.408)" fill="#003c75"/>
|
||||||
|
<path id="Path_518" data-name="Path 518" d="M194.656,57.467a4.743,4.743,0,0,1-3.709-1.478,5.9,5.9,0,0,1-1.3-4.061,6.371,6.371,0,0,1,.623-2.875,4.6,4.6,0,0,1,1.83-1.96,5.472,5.472,0,0,1,2.764-.684,6.653,6.653,0,0,1,2.915.613.392.392,0,0,1,.221.251.45.45,0,0,1-.02.342l-.482.985a.431.431,0,0,1-.583.2,4.823,4.823,0,0,0-2.061-.483,2.767,2.767,0,0,0-2.242.935,3.972,3.972,0,0,0-.834,2.684,4.142,4.142,0,0,0,.8,2.714c.865,1.005,2.362,1.146,4.5.553a.455.455,0,0,1,.392.07.429.429,0,0,1,.171.352v1.005a.432.432,0,0,1-.292.412,7.826,7.826,0,0,1-2.694.412Zm.2-10.173a4.618,4.618,0,0,0-2.322.563,3.707,3.707,0,0,0-1.478,1.588,5.508,5.508,0,0,0-.523,2.483,5.091,5.091,0,0,0,1.076,3.478,3.847,3.847,0,0,0,3.046,1.176,7.462,7.462,0,0,0,2.091-.261V56.2c-2.221.513-3.84.2-4.845-.965a4.908,4.908,0,0,1-1.015-3.287,4.747,4.747,0,0,1,1.055-3.267,3.661,3.661,0,0,1,2.915-1.236,5.412,5.412,0,0,1,2.031.4l.09-.191a6.036,6.036,0,0,0-2.111-.352Z" transform="translate(-189.65 -46.41)" fill="#336"/>
|
||||||
|
</g>
|
||||||
|
<g id="Group_1507" data-name="Group 1507" transform="translate(306.738 192.763)">
|
||||||
|
<path id="Path_519" data-name="Path 519" d="M210.369,56.892l-1.236-3.146h-3.971l-1.216,3.146H202.78l3.91-9.932h.965l3.89,9.932h-1.186Zm-1.588-4.182-1.146-3.066c-.151-.392-.3-.864-.462-1.427a14.3,14.3,0,0,1-.422,1.427l-1.166,3.066Z" transform="translate(-202.351 -46.518)" fill="#003c75"/>
|
||||||
|
<path id="Path_520" data-name="Path 520" d="M211.568,57.336h-1.186a.436.436,0,0,1-.412-.281l-1.126-2.865h-3.357l-1.106,2.865a.446.446,0,0,1-.412.281H202.8a.445.445,0,0,1-.422-.6l3.91-9.932a.446.446,0,0,1,.412-.281h.965a.436.436,0,0,1,.412.281l3.89,9.932a.431.431,0,0,1-.05.412.444.444,0,0,1-.372.191Zm-.885-.885h.241L207.386,47.4h-.362l-3.558,9.047h.211l1.106-2.865a.446.446,0,0,1,.412-.281h3.971a.436.436,0,0,1,.412.281l1.126,2.865Zm-1.89-3.3h-3.2a.426.426,0,0,1-.362-.191.46.46,0,0,1-.05-.412l1.166-3.066c.171-.493.312-.955.412-1.367a.439.439,0,0,1,.422-.342.465.465,0,0,1,.442.322c.151.543.3,1.015.442,1.387l1.156,3.066a.459.459,0,0,1-.05.412.439.439,0,0,1-.362.191Zm-2.553-.885h1.92l-.925-2.463s-.02-.05-.03-.07c0,.02-.02.04-.02.07l-.935,2.473Z" transform="translate(-202.353 -46.52)" fill="#336"/>
|
||||||
|
</g>
|
||||||
|
<g id="Group_1508" data-name="Group 1508" transform="translate(321.733 192.803)">
|
||||||
|
<path id="Path_521" data-name="Path 521" d="M217.71,56.891V47h1.146v8.856h4.363V56.9H217.7Z" transform="translate(-217.268 -46.558)" fill="#003c75"/>
|
||||||
|
<path id="Path_522" data-name="Path 522" d="M223.231,57.336h-5.519a.446.446,0,0,1-.442-.442V47a.446.446,0,0,1,.442-.442h1.146A.446.446,0,0,1,219.3,47v8.4h3.92a.446.446,0,0,1,.442.442v1.045a.446.446,0,0,1-.442.442Zm-5.076-.885h4.624V56.3h-3.92a.446.446,0,0,1-.442-.442v-8.4h-.261v9.007Z" transform="translate(-217.27 -46.56)" fill="#336"/>
|
||||||
|
</g>
|
||||||
|
<g id="Group_1509" data-name="Group 1509" transform="translate(333.153 192.652)">
|
||||||
|
<path id="Path_523" data-name="Path 523" d="M235.292,54.258a2.429,2.429,0,0,1-.945,2.041,4.142,4.142,0,0,1-2.573.734,6.435,6.435,0,0,1-2.7-.452V55.475a6.4,6.4,0,0,0,1.327.4,6.938,6.938,0,0,0,1.417.151,2.847,2.847,0,0,0,1.729-.432,1.447,1.447,0,0,0,.583-1.216,1.547,1.547,0,0,0-.211-.844,1.9,1.9,0,0,0-.694-.6,10.945,10.945,0,0,0-1.468-.633,4.756,4.756,0,0,1-1.97-1.166,2.6,2.6,0,0,1-.593-1.769,2.2,2.2,0,0,1,.864-1.819,3.554,3.554,0,0,1,2.272-.673,6.817,6.817,0,0,1,2.714.543l-.362,1.005a6.132,6.132,0,0,0-2.382-.513,2.3,2.3,0,0,0-1.427.392,1.29,1.29,0,0,0-.513,1.086,1.641,1.641,0,0,0,.191.844,1.789,1.789,0,0,0,.643.6,8,8,0,0,0,1.377.6,5.533,5.533,0,0,1,2.141,1.186,2.329,2.329,0,0,1,.583,1.649Z" transform="translate(-228.628 -46.408)" fill="#003c75"/>
|
||||||
|
<path id="Path_524" data-name="Path 524" d="M231.776,57.467a6.792,6.792,0,0,1-2.895-.493.448.448,0,0,1-.251-.4V55.467a.429.429,0,0,1,.2-.372.449.449,0,0,1,.422-.04,6.46,6.46,0,0,0,1.246.382,6.692,6.692,0,0,0,1.327.141,2.483,2.483,0,0,0,1.468-.352,1.011,1.011,0,0,0,.4-.864,1.139,1.139,0,0,0-.141-.6,1.546,1.546,0,0,0-.533-.452,8.992,8.992,0,0,0-1.4-.593,5.126,5.126,0,0,1-2.161-1.3,3.048,3.048,0,0,1-.7-2.061,2.632,2.632,0,0,1,1.025-2.171,4.025,4.025,0,0,1,2.553-.774,7.062,7.062,0,0,1,2.9.583.444.444,0,0,1,.241.553l-.362,1.005a.473.473,0,0,1-.241.261.43.43,0,0,1-.352,0,5.636,5.636,0,0,0-2.211-.483,1.887,1.887,0,0,0-1.156.3.862.862,0,0,0-.342.734,1.222,1.222,0,0,0,.131.623,1.4,1.4,0,0,0,.482.442,6.755,6.755,0,0,0,1.3.563,5.849,5.849,0,0,1,2.322,1.307,2.8,2.8,0,0,1,.7,1.95,2.857,2.857,0,0,1-1.116,2.392,4.576,4.576,0,0,1-2.845.824Zm-2.262-1.2a6.862,6.862,0,0,0,2.262.3,3.724,3.724,0,0,0,2.3-.643,1.989,1.989,0,0,0,.774-1.689,1.869,1.869,0,0,0-.472-1.347,5.01,5.01,0,0,0-1.96-1.076,8.118,8.118,0,0,1-1.457-.643,1.943,1.943,0,0,1-1.046-1.829,1.759,1.759,0,0,1,.694-1.447,2.748,2.748,0,0,1,1.7-.483,6.393,6.393,0,0,1,2.121.382l.06-.171a6.524,6.524,0,0,0-2.151-.352,3.137,3.137,0,0,0-2,.583,1.757,1.757,0,0,0-.694,1.468,2.162,2.162,0,0,0,.482,1.478,4.3,4.3,0,0,0,1.789,1.045,9.489,9.489,0,0,1,1.548.663,2.323,2.323,0,0,1,.844.754,2.01,2.01,0,0,1,.271,1.076,1.871,1.871,0,0,1-.764,1.568,3.27,3.27,0,0,1-2,.523,7.06,7.06,0,0,1-1.508-.161c-.271-.06-.543-.121-.794-.2v.181Z" transform="translate(-228.63 -46.41)" fill="#336"/>
|
||||||
|
</g>
|
||||||
|
<g id="Group_1510" data-name="Group 1510" transform="translate(354.724 192.803)">
|
||||||
|
<path id="Path_525" data-name="Path 525" d="M258.431,51.845a5.018,5.018,0,0,1-1.327,3.749,5.258,5.258,0,0,1-3.83,1.3H250.53V47h3.036a4.434,4.434,0,0,1,4.865,4.845Zm-1.216.04a3.96,3.96,0,0,0-.975-2.915,3.887,3.887,0,0,0-2.885-.985h-1.669v7.9h1.4a4.285,4.285,0,0,0,3.1-1.015,3.986,3.986,0,0,0,1.035-3Z" transform="translate(-250.088 -46.558)" fill="#003c75"/>
|
||||||
|
<path id="Path_526" data-name="Path 526" d="M253.277,57.336h-2.744a.446.446,0,0,1-.442-.442V47a.446.446,0,0,1,.442-.442h3.036a4.875,4.875,0,0,1,5.307,5.3,5.418,5.418,0,0,1-1.468,4.061,5.7,5.7,0,0,1-4.141,1.417Zm-2.292-.885h2.292a4.87,4.87,0,0,0,3.518-1.166,4.6,4.6,0,0,0,1.2-3.428,4,4,0,0,0-4.423-4.4h-2.583v9.007Zm2.111-.111h-1.4a.446.446,0,0,1-.442-.442V48a.446.446,0,0,1,.442-.442h1.669a4.319,4.319,0,0,1,3.207,1.116,4.387,4.387,0,0,1,1.1,3.227,4.522,4.522,0,0,1-1.166,3.317,4.712,4.712,0,0,1-3.408,1.136Zm-.955-.885h.955a3.888,3.888,0,0,0,2.784-.885,3.585,3.585,0,0,0,.9-2.674,3.006,3.006,0,0,0-3.418-3.448h-1.226v7.016Z" transform="translate(-250.09 -46.56)" fill="#336"/>
|
||||||
|
</g>
|
||||||
|
<g id="Group_1511" data-name="Group 1511" transform="translate(368.338 192.763)">
|
||||||
|
<path id="Path_527" data-name="Path 527" d="M271.649,56.892l-1.236-3.146h-3.971l-1.216,3.146H264.06l3.91-9.932h.965l3.89,9.932h-1.186Zm-1.588-4.182-1.146-3.066c-.151-.392-.3-.864-.462-1.427a14.326,14.326,0,0,1-.422,1.427l-1.166,3.066h3.2Z" transform="translate(-263.63 -46.518)" fill="#003c75"/>
|
||||||
|
<path id="Path_528" data-name="Path 528" d="M272.848,57.336h-1.186a.436.436,0,0,1-.412-.281l-1.126-2.865h-3.357l-1.106,2.865a.446.446,0,0,1-.412.281h-1.166a.445.445,0,0,1-.422-.6l3.91-9.932a.446.446,0,0,1,.412-.281h.965a.436.436,0,0,1,.412.281l3.89,9.932a.431.431,0,0,1-.05.412.444.444,0,0,1-.372.191Zm-.885-.885h.241L268.666,47.4H268.3l-3.558,9.047h.211l1.106-2.865a.446.446,0,0,1,.412-.281h3.971a.436.436,0,0,1,.412.281l1.126,2.865Zm-1.89-3.3h-3.2a.426.426,0,0,1-.362-.191.459.459,0,0,1-.05-.412l1.166-3.066c.171-.493.312-.955.412-1.367a.439.439,0,0,1,.422-.342.456.456,0,0,1,.442.322c.151.543.3,1.015.442,1.387l1.156,3.066a.46.46,0,0,1-.05.412.439.439,0,0,1-.362.191Zm-2.553-.885h1.92l-.925-2.463s-.02-.05-.03-.07c0,.02-.02.04-.02.07l-.935,2.473Z" transform="translate(-263.633 -46.52)" fill="#336"/>
|
||||||
|
</g>
|
||||||
|
<g id="Group_1512" data-name="Group 1512" transform="translate(382.096 192.803)">
|
||||||
|
<path id="Path_529" data-name="Path 529" d="M282.042,56.891H280.9V48.015H277.76V46.99h7.419v1.025h-3.136Z" transform="translate(-277.318 -46.558)" fill="#003c75"/>
|
||||||
|
<path id="Path_530" data-name="Path 530" d="M282.044,57.336H280.9a.446.446,0,0,1-.442-.442V48.47h-2.694a.446.446,0,0,1-.442-.442V47a.446.446,0,0,1,.442-.442h7.418a.446.446,0,0,1,.442.442v1.025a.446.446,0,0,1-.442.442h-2.694v8.424A.446.446,0,0,1,282.044,57.336Zm-.7-.885h.261V48.028a.446.446,0,0,1,.442-.442h2.694v-.131h-6.524v.131h2.694a.446.446,0,0,1,.442.442v8.424Z" transform="translate(-277.32 -46.56)" fill="#336"/>
|
||||||
|
</g>
|
||||||
|
<g id="Group_1513" data-name="Group 1513" transform="translate(394.504 192.763)">
|
||||||
|
<path id="Path_531" data-name="Path 531" d="M297.679,56.892l-1.237-3.146h-3.971l-1.216,3.146H290.09L294,46.96h.965l3.89,9.932h-1.186Zm-1.588-4.182-1.146-3.066c-.151-.392-.3-.864-.462-1.427a14.322,14.322,0,0,1-.422,1.427l-1.166,3.066Z" transform="translate(-289.66 -46.518)" fill="#003c75"/>
|
||||||
|
<path id="Path_532" data-name="Path 532" d="M298.878,57.336h-1.186a.436.436,0,0,1-.412-.281l-1.126-2.865H292.8l-1.106,2.865a.446.446,0,0,1-.412.281h-1.166a.445.445,0,0,1-.422-.6L293.6,46.8a.446.446,0,0,1,.412-.281h.965a.436.436,0,0,1,.412.281l3.89,9.932a.431.431,0,0,1-.05.412.457.457,0,0,1-.372.191Zm-.884-.885h.241L294.686,47.4h-.362l-3.558,9.047h.211l1.106-2.865a.446.446,0,0,1,.412-.281h3.971a.436.436,0,0,1,.412.281L298,56.452Zm-1.89-3.3h-3.2a.426.426,0,0,1-.362-.191.459.459,0,0,1-.05-.412l1.166-3.066c.171-.493.312-.955.412-1.367a.439.439,0,0,1,.422-.342.465.465,0,0,1,.442.322c.151.543.3,1.015.442,1.387l1.156,3.066a.46.46,0,0,1-.05.412.439.439,0,0,1-.362.191Zm-2.553-.885h1.92l-.925-2.463s-.02-.05-.03-.07c0,.02-.02.05-.02.07l-.935,2.473Z" transform="translate(-289.663 -46.52)" fill="#336"/>
|
||||||
|
</g>
|
||||||
|
<g id="Group_1514" data-name="Group 1514" transform="translate(409.509 192.803)">
|
||||||
|
<path id="Path_533" data-name="Path 533" d="M305.02,46.99h2.795a5.238,5.238,0,0,1,2.845.593,2.091,2.091,0,0,1,.885,1.86,2.125,2.125,0,0,1-.492,1.448,2.362,2.362,0,0,1-1.427.744v.07c1.5.261,2.252,1.045,2.252,2.372a2.551,2.551,0,0,1-.895,2.071,3.831,3.831,0,0,1-2.5.744H305.03V47Zm1.146,4.232h1.9a3.086,3.086,0,0,0,1.749-.382,1.479,1.479,0,0,0,.533-1.287,1.33,1.33,0,0,0-.593-1.206,3.736,3.736,0,0,0-1.89-.372h-1.689v3.237Zm0,.975v3.7h2.061a2.915,2.915,0,0,0,1.8-.462,1.716,1.716,0,0,0,.6-1.448,1.552,1.552,0,0,0-.623-1.357,3.289,3.289,0,0,0-1.88-.432h-1.97Z" transform="translate(-304.588 -46.558)" fill="#003c75"/>
|
||||||
|
<path id="Path_534" data-name="Path 534" d="M308.48,57.336h-3.448a.446.446,0,0,1-.442-.442V47a.446.446,0,0,1,.442-.442h2.794a5.644,5.644,0,0,1,3.1.663A2.51,2.51,0,0,1,312,49.455a2.6,2.6,0,0,1-.593,1.739,2.753,2.753,0,0,1-.513.452,2.559,2.559,0,0,1,1.447,2.443,2.977,2.977,0,0,1-1.055,2.413,4.254,4.254,0,0,1-2.795.844Zm-3.006-.885h3.006a3.365,3.365,0,0,0,2.221-.643,2.109,2.109,0,0,0,.734-1.729,1.679,1.679,0,0,0-.965-1.659,2.022,2.022,0,0,1,.623,1.568,2.14,2.14,0,0,1-.784,1.809,3.341,3.341,0,0,1-2.071.553h-2.061a.446.446,0,0,1-.442-.442v-3.7a.446.446,0,0,1,.442-.442h1.97a5.693,5.693,0,0,1,1.066.09.341.341,0,0,1-.02-.141v-.131a5.282,5.282,0,0,1-1.116.1h-1.9a.446.446,0,0,1-.442-.442V48.007a.446.446,0,0,1,.442-.442h1.689A4.121,4.121,0,0,1,310,48a1.726,1.726,0,0,1,.8,1.578,2.12,2.12,0,0,1-.372,1.307,1.432,1.432,0,0,0,.292-.261,1.7,1.7,0,0,0,.382-1.166,1.633,1.633,0,0,0-.683-1.488,4.919,4.919,0,0,0-2.6-.513h-2.352v9.007Zm1.146-.985h1.618a2.55,2.55,0,0,0,1.538-.372,1.282,1.282,0,0,0,.432-1.1,1.043,1.043,0,0,0-.432-.985,2.943,2.943,0,0,0-1.628-.352h-1.528v2.815Zm0-4.674h1.448a2.6,2.6,0,0,0,1.5-.3,1.074,1.074,0,0,0,.352-.925.861.861,0,0,0-.382-.824,3.2,3.2,0,0,0-1.659-.3h-1.246v2.352Z" transform="translate(-304.59 -46.56)" fill="#336"/>
|
||||||
|
</g>
|
||||||
|
<g id="Group_1515" data-name="Group 1515" transform="translate(421.986 192.763)">
|
||||||
|
<path id="Path_535" data-name="Path 535" d="M325.019,56.892l-1.236-3.146h-3.971L318.6,56.892H317.43l3.91-9.932h.965l3.89,9.932h-1.186Zm-1.588-4.182-1.146-3.066c-.151-.392-.3-.864-.463-1.427a14.318,14.318,0,0,1-.422,1.427l-1.166,3.066h3.2Z" transform="translate(-317 -46.518)" fill="#003c75"/>
|
||||||
|
<path id="Path_536" data-name="Path 536" d="M326.218,57.336h-1.186a.436.436,0,0,1-.412-.281l-1.126-2.865h-3.357l-1.106,2.865a.446.446,0,0,1-.412.281h-1.166a.445.445,0,0,1-.422-.6l3.91-9.932a.446.446,0,0,1,.412-.281h.965a.436.436,0,0,1,.412.281l3.89,9.932a.431.431,0,0,1-.05.412.457.457,0,0,1-.372.191Zm-.885-.885h.241L322.026,47.4h-.362l-3.559,9.047h.211l1.106-2.865a.446.446,0,0,1,.412-.281h3.971a.436.436,0,0,1,.412.281l1.126,2.865Zm-1.89-3.3h-3.2a.426.426,0,0,1-.362-.191.46.46,0,0,1-.05-.412L321,49.485c.171-.493.312-.955.412-1.367a.439.439,0,0,1,.422-.342.456.456,0,0,1,.442.322c.151.543.3,1.015.442,1.387l1.156,3.066a.46.46,0,0,1-.05.412.439.439,0,0,1-.362.191Zm-2.553-.885h1.92l-.925-2.463s-.02-.05-.03-.07c0,.02-.02.05-.02.07L320.9,52.28Z" transform="translate(-317.003 -46.52)" fill="#336"/>
|
||||||
|
</g>
|
||||||
|
<g id="Group_1516" data-name="Group 1516" transform="translate(436.338 192.662)">
|
||||||
|
<path id="Path_537" data-name="Path 537" d="M337.952,54.258a2.43,2.43,0,0,1-.945,2.041,4.079,4.079,0,0,1-2.573.734,6.436,6.436,0,0,1-2.7-.452V55.475a6.4,6.4,0,0,0,1.327.4,6.939,6.939,0,0,0,1.417.151A2.848,2.848,0,0,0,336.2,55.6a1.422,1.422,0,0,0,.583-1.216,1.547,1.547,0,0,0-.211-.844,1.9,1.9,0,0,0-.694-.6,10.945,10.945,0,0,0-1.468-.633,4.756,4.756,0,0,1-1.97-1.166,2.578,2.578,0,0,1-.593-1.769,2.2,2.2,0,0,1,.865-1.819,3.554,3.554,0,0,1,2.272-.673,6.816,6.816,0,0,1,2.714.543l-.362,1.005a6.131,6.131,0,0,0-2.382-.513,2.3,2.3,0,0,0-1.427.392,1.29,1.29,0,0,0-.513,1.086,1.641,1.641,0,0,0,.191.844,1.787,1.787,0,0,0,.643.6,8.367,8.367,0,0,0,1.377.6,5.531,5.531,0,0,1,2.141,1.186,2.329,2.329,0,0,1,.583,1.649Z" transform="translate(-331.278 -46.418)" fill="#003c75"/>
|
||||||
|
<path id="Path_538" data-name="Path 538" d="M334.426,57.467a6.792,6.792,0,0,1-2.9-.493.448.448,0,0,1-.251-.4V55.467a.429.429,0,0,1,.2-.372.449.449,0,0,1,.422-.04,6.458,6.458,0,0,0,1.246.382,6.692,6.692,0,0,0,1.327.141,2.483,2.483,0,0,0,1.468-.352,1,1,0,0,0,.4-.854,1.138,1.138,0,0,0-.141-.6,1.546,1.546,0,0,0-.533-.452,9.455,9.455,0,0,0-1.4-.593,5.127,5.127,0,0,1-2.161-1.3,3.049,3.049,0,0,1-.7-2.061,2.632,2.632,0,0,1,1.025-2.171,4.024,4.024,0,0,1,2.553-.774,7.062,7.062,0,0,1,2.9.583.444.444,0,0,1,.241.553l-.362,1.005a.473.473,0,0,1-.241.261.429.429,0,0,1-.352,0,5.637,5.637,0,0,0-2.212-.482,1.887,1.887,0,0,0-1.156.3.862.862,0,0,0-.342.734,1.224,1.224,0,0,0,.131.623,1.4,1.4,0,0,0,.482.442,6.758,6.758,0,0,0,1.3.563,5.849,5.849,0,0,1,2.322,1.307,2.8,2.8,0,0,1,.7,1.95,2.858,2.858,0,0,1-1.116,2.392,4.577,4.577,0,0,1-2.845.824Zm-2.262-1.2a6.861,6.861,0,0,0,2.262.3,3.789,3.789,0,0,0,2.3-.633,1.989,1.989,0,0,0,.774-1.689,1.869,1.869,0,0,0-.472-1.347,5.012,5.012,0,0,0-1.96-1.076,8.11,8.11,0,0,1-1.458-.643,1.943,1.943,0,0,1-1.045-1.829,1.778,1.778,0,0,1,.683-1.447,2.748,2.748,0,0,1,1.7-.483,6.394,6.394,0,0,1,2.121.382l.06-.171a6.524,6.524,0,0,0-2.151-.352,3.137,3.137,0,0,0-2,.583,1.757,1.757,0,0,0-.694,1.468,2.161,2.161,0,0,0,.483,1.478,4.294,4.294,0,0,0,1.789,1.045,9.486,9.486,0,0,1,1.548.663,2.322,2.322,0,0,1,.844.754,2.01,2.01,0,0,1,.271,1.076,1.871,1.871,0,0,1-.764,1.568,3.27,3.27,0,0,1-2,.523,7.059,7.059,0,0,1-1.508-.161c-.271-.06-.543-.121-.794-.2v.181Z" transform="translate(-331.28 -46.42)" fill="#336"/>
|
||||||
|
</g>
|
||||||
|
<g id="Group_1517" data-name="Group 1517" transform="translate(449.456 192.803)">
|
||||||
|
<path id="Path_539" data-name="Path 539" d="M350.289,56.891H344.77V47h5.519v1.025h-4.363v3.187h4.1v1.015h-4.1v3.639h4.363Z" transform="translate(-344.328 -46.558)" fill="#003c75"/>
|
||||||
|
<path id="Path_540" data-name="Path 540" d="M350.291,57.336h-5.519a.446.446,0,0,1-.442-.442V47a.446.446,0,0,1,.442-.442h5.519a.446.446,0,0,1,.442.442v1.025a.446.446,0,0,1-.442.442h-3.92v2.3h3.659a.446.446,0,0,1,.442.442v1.015a.446.446,0,0,1-.442.442H346.37v2.754h3.92a.446.446,0,0,1,.442.442v1.025A.446.446,0,0,1,350.291,57.336Zm-5.076-.885h4.624v-.141h-3.92a.446.446,0,0,1-.442-.442V52.229a.446.446,0,0,1,.442-.442h3.659v-.131h-3.659a.446.446,0,0,1-.442-.442V48.028a.446.446,0,0,1,.442-.442h3.92v-.131h-4.624v9.007Z" transform="translate(-344.33 -46.56)" fill="#336"/>
|
||||||
|
</g>
|
||||||
|
</g>
|
||||||
|
</g>
|
||||||
|
<script xmlns=""/></svg>
|
||||||
|
After Width: | Height: | Size: 35 KiB |
184
src/pif_compiler/functions/resources/injectableHeader.html
Normal file
184
src/pif_compiler/functions/resources/injectableHeader.html
Normal file
|
|
@ -0,0 +1,184 @@
|
||||||
|
<!-- Start of Injectable ECHA Header Block (v7 - Dynamic Data) -->
|
||||||
|
<style>
|
||||||
|
/* ECHA Header Styles - Based on V5/V6 */
|
||||||
|
.echa-header-injected { /* Wrapper class */
|
||||||
|
width: 100%;
|
||||||
|
box-shadow: 0 1px 3px rgba(0, 0, 0, 0.1), 0 1px 2px rgba(0,0,0,0.06);
|
||||||
|
font-family: system-ui, -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, Oxygen, Ubuntu, Cantarell, 'Open Sans', 'Helvetica Neue', sans-serif;
|
||||||
|
box-sizing: border-box;
|
||||||
|
background-color: #ffffff;
|
||||||
|
line-height: 1.4;
|
||||||
|
}
|
||||||
|
.echa-header-injected *, .echa-header-injected *::before, .echa-header-injected *::after {
|
||||||
|
box-sizing: inherit;
|
||||||
|
}
|
||||||
|
|
||||||
|
/* Top white bar - das-top-nav */
|
||||||
|
.echa-header-injected .das-top-nav {
|
||||||
|
background-color: #ffffff;
|
||||||
|
display: flex;
|
||||||
|
align-items: stretch;
|
||||||
|
padding: 8px 25px;
|
||||||
|
border-bottom: 1px solid #e7e7e7;
|
||||||
|
min-height: 55px;
|
||||||
|
}
|
||||||
|
|
||||||
|
.echa-header-injected .logo-container {
|
||||||
|
display: flex;
|
||||||
|
align-items: center;
|
||||||
|
gap: 20px;
|
||||||
|
}
|
||||||
|
|
||||||
|
.echa-header-injected .logo-main img {
|
||||||
|
height: 38px;
|
||||||
|
width: auto;
|
||||||
|
display: block;
|
||||||
|
border: 0;
|
||||||
|
}
|
||||||
|
|
||||||
|
.echa-header-injected .logo-part-of {
|
||||||
|
display: flex;
|
||||||
|
align-items: center;
|
||||||
|
padding-left: 20px;
|
||||||
|
border-left: 1px solid #cccccc;
|
||||||
|
height: 100%;
|
||||||
|
}
|
||||||
|
|
||||||
|
.echa-header-injected .logo-part-of img {
|
||||||
|
height: 18px;
|
||||||
|
width: auto;
|
||||||
|
display: block;
|
||||||
|
border: 0;
|
||||||
|
}
|
||||||
|
|
||||||
|
/* Bottom blue bar - das-primary-header_wrapper */
|
||||||
|
.echa-header-injected .das-primary-header_wrapper {
|
||||||
|
background-color: #005487;
|
||||||
|
background-image: linear-gradient(to bottom, rgba(255, 255, 255, 0.08), rgba(0, 0, 0, 0.05));
|
||||||
|
color: #ffffff;
|
||||||
|
padding: 12px 25px;
|
||||||
|
display: flex;
|
||||||
|
align-items: center;
|
||||||
|
gap: 15px;
|
||||||
|
}
|
||||||
|
|
||||||
|
.echa-header-injected .das-primary-header-info {
|
||||||
|
flex-grow: 1;
|
||||||
|
min-width: 0; /* Prevent flex item from overflowing */
|
||||||
|
}
|
||||||
|
|
||||||
|
/* Style for the substance link */
|
||||||
|
.echa-header-injected .substance-link {
|
||||||
|
color: #ffffff;
|
||||||
|
text-decoration: none;
|
||||||
|
display: block; /* Makes the whole H2 area clickable */
|
||||||
|
}
|
||||||
|
.echa-header-injected .substance-link:hover,
|
||||||
|
.echa-header-injected .substance-link:focus {
|
||||||
|
text-decoration: underline;
|
||||||
|
}
|
||||||
|
|
||||||
|
.echa-header-injected .das-primary-header-info h2 {
|
||||||
|
font-size: 1.5em; /* Set your desired FIXED font size */
|
||||||
|
margin: 0 0 4px 0;
|
||||||
|
line-height: 1.2; /* This will control spacing between lines if it wraps */
|
||||||
|
color: #ffffff;
|
||||||
|
font-weight: 600;
|
||||||
|
width: 100%; /* Constrains the text horizontally */
|
||||||
|
|
||||||
|
/* --- REMOVED ---
|
||||||
|
white-space: nowrap;
|
||||||
|
overflow: hidden;
|
||||||
|
text-overflow: ellipsis;
|
||||||
|
*/
|
||||||
|
|
||||||
|
/* --- ADDED (Recommended) --- */
|
||||||
|
white-space: normal; /* Explicitly allow wrapping (this is the default, but good for clarity) */
|
||||||
|
overflow-wrap: break-word; /* Helps break long words without spaces */
|
||||||
|
/* word-break: break-word; /* Alternative if overflow-wrap doesn't catch all cases */
|
||||||
|
|
||||||
|
/* Ensure overflow is visible (default, but explicit) */
|
||||||
|
overflow: visible;
|
||||||
|
}
|
||||||
|
|
||||||
|
.echa-header-injected .das-primary-header-info_details {
|
||||||
|
display: flex;
|
||||||
|
align-items: center;
|
||||||
|
gap: 18px;
|
||||||
|
flex-wrap: wrap;
|
||||||
|
}
|
||||||
|
|
||||||
|
.echa-header-injected .item {
|
||||||
|
display: flex;
|
||||||
|
align-items: baseline;
|
||||||
|
position: relative;
|
||||||
|
}
|
||||||
|
|
||||||
|
.echa-header-injected .item + .item::before {
|
||||||
|
content: '•';
|
||||||
|
color: #f5a623;
|
||||||
|
font-weight: bold;
|
||||||
|
font-size: 1.1em;
|
||||||
|
line-height: 1;
|
||||||
|
display: inline-block;
|
||||||
|
margin-right: 18px;
|
||||||
|
}
|
||||||
|
|
||||||
|
.echa-header-injected .item label {
|
||||||
|
font-size: 0.85em;
|
||||||
|
color: #e0eaf1;
|
||||||
|
margin-right: 8px;
|
||||||
|
font-weight: 400;
|
||||||
|
}
|
||||||
|
|
||||||
|
.echa-header-injected .item span {
|
||||||
|
font-size: 0.95em;
|
||||||
|
color: #ffffff;
|
||||||
|
font-weight: bold;
|
||||||
|
}
|
||||||
|
|
||||||
|
/* Minimal reset */
|
||||||
|
.echa-header-injected h2, .echa-header-injected span, .echa-header-injected label, .echa-header-injected div {
|
||||||
|
margin: 0; padding: 0;
|
||||||
|
}
|
||||||
|
.echa-header-injected a { color: inherit; text-decoration: none; } /* Basic reset for any links */
|
||||||
|
|
||||||
|
</style>
|
||||||
|
|
||||||
|
<header class="echa-header-injected" id="pdf-custom-header">
|
||||||
|
<div class="das-top-nav">
|
||||||
|
<div class="logo-container">
|
||||||
|
<div class="logo-main">
|
||||||
|
<!-- Logo link can be kept static or made dynamic if needed -->
|
||||||
|
<a title="ECHA Chemicals Database" href="/">
|
||||||
|
<img height="38" alt="ECHA Chemicals Database" src="##ECHA_CHEM_LOGO_SRC##">
|
||||||
|
</a>
|
||||||
|
</div>
|
||||||
|
<div class="logo-part-of">
|
||||||
|
<a title="visit ECHA website" target="_blank" rel="noopener noreferrer" href="https://echa.europa.eu/">
|
||||||
|
<img height="18" alt="European Chemicals Agency" src="##ECHA_LOGO_SRC##">
|
||||||
|
</a>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
<div class="das-primary-header_wrapper">
|
||||||
|
<div class="das-primary-header-info">
|
||||||
|
<!-- ==== DYNAMIC CONTENT START ==== -->
|
||||||
|
<a href="##SUBSTANCE_LINK##" title="View substance details: ##SUBSTANCE_NAME##" class="substance-link">
|
||||||
|
<h2 class="das-text-truncate">##SUBSTANCE_NAME##</h2>
|
||||||
|
</a>
|
||||||
|
<div class="das-primary-header-info_details">
|
||||||
|
<div class="item">
|
||||||
|
<label>EC number</label>
|
||||||
|
<span>##EC_NUMBER##</span>
|
||||||
|
</div>
|
||||||
|
<div class="item">
|
||||||
|
<label>CAS number</label>
|
||||||
|
<span class="das-text-truncate">##CAS_NUMBER##</span>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
<!-- ==== DYNAMIC CONTENT END ==== -->
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
</header>
|
||||||
|
<!-- End of Injectable ECHA Header Block (v7) -->
|
||||||
95
src/pif_compiler/services/__init__.py
Normal file
95
src/pif_compiler/services/__init__.py
Normal file
|
|
@ -0,0 +1,95 @@
|
||||||
|
"""
|
||||||
|
PIF Compiler - Services Layer
|
||||||
|
|
||||||
|
This module contains business logic and external API integrations for
|
||||||
|
regulatory data sources used in cosmetic safety assessment.
|
||||||
|
|
||||||
|
Modules:
|
||||||
|
- echa_find: ECHA dossier search functionality
|
||||||
|
- echa_process: ECHA data extraction and processing (HTML -> Markdown -> JSON -> DataFrame)
|
||||||
|
- echa_pdf: PDF generation from ECHA dossiers with Playwright
|
||||||
|
- cosing_service: COSING database integration (EU cosmetic ingredients)
|
||||||
|
- pubchem_service: PubChem API integration for chemical properties
|
||||||
|
- common_log: Centralized logging configuration
|
||||||
|
"""
|
||||||
|
|
||||||
|
# ECHA Services
|
||||||
|
from pif_compiler.services.echa_find import (
|
||||||
|
search_dossier,
|
||||||
|
)
|
||||||
|
|
||||||
|
from pif_compiler.services.echa_process import (
|
||||||
|
echaExtract,
|
||||||
|
echaExtract_multi,
|
||||||
|
echaExtract_specific,
|
||||||
|
echaExtract_local,
|
||||||
|
echa_noael_ld50,
|
||||||
|
echa_noael_ld50_multi,
|
||||||
|
echaPage_to_md,
|
||||||
|
openEchaPage,
|
||||||
|
markdown_to_json_raw,
|
||||||
|
clean_json,
|
||||||
|
json_to_dataframe,
|
||||||
|
filter_dataframe_by_dict,
|
||||||
|
)
|
||||||
|
|
||||||
|
from pif_compiler.services.echa_pdf import (
|
||||||
|
generate_pdf_with_header_and_cleanup,
|
||||||
|
search_generate_pdfs,
|
||||||
|
svg_to_data_uri,
|
||||||
|
)
|
||||||
|
|
||||||
|
# COSING Service
|
||||||
|
from pif_compiler.services.cosing_service import (
|
||||||
|
cosing_search,
|
||||||
|
clean_cosing,
|
||||||
|
parse_cas_numbers,
|
||||||
|
identified_ingredients,
|
||||||
|
)
|
||||||
|
|
||||||
|
# PubChem Service
|
||||||
|
from pif_compiler.services.pubchem_service import (
|
||||||
|
pubchem_dap,
|
||||||
|
clean_property_data,
|
||||||
|
)
|
||||||
|
|
||||||
|
# Logging
|
||||||
|
from pif_compiler.services.common_log import (
|
||||||
|
get_logger,
|
||||||
|
)
|
||||||
|
|
||||||
|
from pif_compiler.services.mongo_conn import get_client
|
||||||
|
|
||||||
|
|
||||||
|
__all__ = [
|
||||||
|
# ECHA Find
|
||||||
|
"search_dossier",
|
||||||
|
# ECHA Process
|
||||||
|
"echaExtract",
|
||||||
|
"echaExtract_multi",
|
||||||
|
"echaExtract_specific",
|
||||||
|
"echaExtract_local",
|
||||||
|
"echa_noael_ld50",
|
||||||
|
"echa_noael_ld50_multi",
|
||||||
|
"echaPage_to_md",
|
||||||
|
"openEchaPage",
|
||||||
|
"markdown_to_json_raw",
|
||||||
|
"clean_json",
|
||||||
|
"json_to_dataframe",
|
||||||
|
"filter_dataframe_by_dict",
|
||||||
|
# ECHA PDF
|
||||||
|
"generate_pdf_with_header_and_cleanup",
|
||||||
|
"search_generate_pdfs",
|
||||||
|
"svg_to_data_uri",
|
||||||
|
# COSING Service
|
||||||
|
"cosing_search",
|
||||||
|
"clean_cosing",
|
||||||
|
"parse_cas_numbers",
|
||||||
|
"identified_ingredients",
|
||||||
|
# PubChem Service
|
||||||
|
"pubchem_dap",
|
||||||
|
"clean_property_data",
|
||||||
|
# Logging
|
||||||
|
"get_logger",
|
||||||
|
"get_client"
|
||||||
|
]
|
||||||
106
src/pif_compiler/services/common_log.py
Normal file
106
src/pif_compiler/services/common_log.py
Normal file
|
|
@ -0,0 +1,106 @@
|
||||||
|
"""
|
||||||
|
Common logging configuration for PIF Compiler project.
|
||||||
|
|
||||||
|
Provides a centralized logging setup with:
|
||||||
|
- Dual outputs: separate files for errors and debug logs
|
||||||
|
- Automatic log rotation at 1MB
|
||||||
|
- Detailed formatting: timestamp - filename - function - message
|
||||||
|
"""
|
||||||
|
|
||||||
|
import logging
|
||||||
|
from logging.handlers import RotatingFileHandler
|
||||||
|
from pathlib import Path
|
||||||
|
from typing import Optional
|
||||||
|
|
||||||
|
# Single logger instance for the entire project
|
||||||
|
_logger: Optional[logging.Logger] = None
|
||||||
|
|
||||||
|
|
||||||
|
def get_logger(
|
||||||
|
log_dir: Optional[str] = None,
|
||||||
|
max_bytes: int = 1_000_000, # 1MB
|
||||||
|
backup_count: int = 5,
|
||||||
|
console_output: bool = True,
|
||||||
|
) -> logging.Logger:
|
||||||
|
"""
|
||||||
|
Get the centralized logger instance for the PIF Compiler project.
|
||||||
|
|
||||||
|
Returns the same logger instance across all modules to consolidate logs
|
||||||
|
into single debug.log and error.log files.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
log_dir: Directory for log files. Defaults to 'logs' in project root
|
||||||
|
max_bytes: Maximum size of log file before rotation (default: 1MB)
|
||||||
|
backup_count: Number of backup files to keep (default: 5)
|
||||||
|
console_output: Whether to also output logs to console (default: True)
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Configured logging.Logger instance
|
||||||
|
|
||||||
|
Example:
|
||||||
|
>>> from pif_compiler.services.common_log import get_logger
|
||||||
|
>>> logger = get_logger()
|
||||||
|
>>> logger.info("Processing ingredient data")
|
||||||
|
>>> logger.error("Failed to connect to database")
|
||||||
|
"""
|
||||||
|
global _logger
|
||||||
|
|
||||||
|
# Return existing logger if already configured
|
||||||
|
if _logger is not None:
|
||||||
|
return _logger
|
||||||
|
|
||||||
|
logger = logging.getLogger("pif_compiler")
|
||||||
|
logger.setLevel(logging.DEBUG) # Capture all levels
|
||||||
|
|
||||||
|
# Determine log directory
|
||||||
|
if log_dir is None:
|
||||||
|
# Default to 'logs' folder in project root
|
||||||
|
project_root = Path(__file__).parent.parent.parent.parent
|
||||||
|
log_dir = project_root / "logs"
|
||||||
|
else:
|
||||||
|
log_dir = Path(log_dir)
|
||||||
|
|
||||||
|
# Create log directory if it doesn't exist
|
||||||
|
log_dir.mkdir(parents=True, exist_ok=True)
|
||||||
|
|
||||||
|
# Define log format: timestamp - filename - function - level - message
|
||||||
|
log_format = logging.Formatter(
|
||||||
|
fmt="%(asctime)s - %(filename)s - %(funcName)s - %(levelname)s - %(message)s",
|
||||||
|
datefmt="%Y-%m-%d %H:%M:%S",
|
||||||
|
)
|
||||||
|
|
||||||
|
# --- DEBUG LOG HANDLER ---
|
||||||
|
# Captures DEBUG and INFO level messages
|
||||||
|
debug_handler = RotatingFileHandler(
|
||||||
|
filename=log_dir / "debug.log",
|
||||||
|
maxBytes=max_bytes,
|
||||||
|
backupCount=backup_count,
|
||||||
|
encoding="utf-8",
|
||||||
|
)
|
||||||
|
debug_handler.setLevel(logging.DEBUG)
|
||||||
|
debug_handler.setFormatter(log_format)
|
||||||
|
logger.addHandler(debug_handler)
|
||||||
|
|
||||||
|
# --- ERROR LOG HANDLER ---
|
||||||
|
# Captures WARNING, ERROR, and CRITICAL level messages
|
||||||
|
error_handler = RotatingFileHandler(
|
||||||
|
filename=log_dir / "error.log",
|
||||||
|
maxBytes=max_bytes,
|
||||||
|
backupCount=backup_count,
|
||||||
|
encoding="utf-8",
|
||||||
|
)
|
||||||
|
error_handler.setLevel(logging.WARNING)
|
||||||
|
error_handler.setFormatter(log_format)
|
||||||
|
logger.addHandler(error_handler)
|
||||||
|
|
||||||
|
# --- CONSOLE HANDLER (Optional) ---
|
||||||
|
if console_output:
|
||||||
|
console_handler = logging.StreamHandler()
|
||||||
|
console_handler.setLevel(logging.DEBUG) # Only INFO and above to console
|
||||||
|
console_handler.setFormatter(log_format)
|
||||||
|
logger.addHandler(console_handler)
|
||||||
|
|
||||||
|
# Store the logger instance
|
||||||
|
_logger = logger
|
||||||
|
|
||||||
|
return logger
|
||||||
247
src/pif_compiler/services/cosing_service.py
Normal file
247
src/pif_compiler/services/cosing_service.py
Normal file
|
|
@ -0,0 +1,247 @@
|
||||||
|
import json as js
|
||||||
|
import re
|
||||||
|
import requests as req
|
||||||
|
from typing import Union
|
||||||
|
from pif_compiler.services.common_log import get_logger
|
||||||
|
|
||||||
|
logger = get_logger()
|
||||||
|
|
||||||
|
#region Funzione che processa una lista di CAS presa da Cosing
|
||||||
|
|
||||||
|
def parse_cas_numbers(cas_string:list) -> list:
|
||||||
|
logger.debug(f"Parsing CAS numbers from input: {cas_string}")
|
||||||
|
|
||||||
|
# Siccome ci assicuriamo esternamente che esista almeno un cas possiamo prendere la stringa
|
||||||
|
cas_string = cas_string[0]
|
||||||
|
logger.debug(f"Extracted CAS string: {cas_string}")
|
||||||
|
|
||||||
|
# Rimuoviamo parentesi e il loro contenuto
|
||||||
|
cas_string = re.sub(r"\([^)]*\)", "", cas_string)
|
||||||
|
logger.debug(f"After removing parentheses: {cas_string}")
|
||||||
|
|
||||||
|
# Eseguiamo uno split su vari possibili separatori
|
||||||
|
cas_parts = re.split(r"[/;,]", cas_string)
|
||||||
|
|
||||||
|
# Otteniamo una lista utilizzando i cas precedenti e rimuovendo spazi in eccesso
|
||||||
|
cas_list = [cas.strip() for cas in cas_parts]
|
||||||
|
logger.debug(f"CAS list after splitting: {cas_list}")
|
||||||
|
|
||||||
|
# Alcuni cas sono divisi da un doppio dash (--) che dobbiamo rimuovere
|
||||||
|
# è però necessario farlo ora in seconda battuta
|
||||||
|
|
||||||
|
if len(cas_list) == 1 and "--" in cas_list[0]:
|
||||||
|
logger.debug("Found double dash separator, splitting further")
|
||||||
|
cas_list = [cas.strip() for cas in cas_list[0].split("--")]
|
||||||
|
|
||||||
|
# Siccome alcuni cas hanno valori non validi ("-") li cerchiamo e rimuoviamo
|
||||||
|
if cas_list:
|
||||||
|
|
||||||
|
while "-" in cas_list:
|
||||||
|
logger.debug("Removing invalid CAS value: '-'")
|
||||||
|
cas_list.remove("-")
|
||||||
|
|
||||||
|
logger.info(f"Parsed CAS numbers: {cas_list}")
|
||||||
|
return cas_list
|
||||||
|
#endregion
|
||||||
|
|
||||||
|
#region Funzione per eseguire una ricerca direttamente sul cosing
|
||||||
|
|
||||||
|
# Il primo argomento è la stringa da cercare, il secondo indica il tipo di ricerca
|
||||||
|
def cosing_search(text : str, mode : str = "name") -> Union[list,dict,None]:
|
||||||
|
logger.info(f"Starting COSING search: text='{text}', mode='{mode}'")
|
||||||
|
|
||||||
|
cosing_post_req = "https://api.tech.ec.europa.eu/search-api/prod/rest/search?apiKey=285a77fd-1257-4271-8507-f0c6b2961203&text=*&pageSize=100&pageNumber=1"
|
||||||
|
agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36 Edg/131.0.0.0"
|
||||||
|
|
||||||
|
# La modalità di ricerca base è quella per nome, che sia INCI o di altro tipo
|
||||||
|
if mode == "name":
|
||||||
|
logger.debug("Search mode: name (INCI, chemical name, etc.)")
|
||||||
|
search_query = {"bool":
|
||||||
|
{"must":[
|
||||||
|
{"text":
|
||||||
|
{"query":f"{text}","fields":
|
||||||
|
["inciName.exact","inciUsaName","innName.exact","phEurName","chemicalName","chemicalDescription"],
|
||||||
|
"defaultOperator":"AND"}}]}}
|
||||||
|
|
||||||
|
# In caso di ricerca per numero cas o EC il payload della richiesta è diverso
|
||||||
|
elif mode in ["cas","ec"]:
|
||||||
|
logger.debug(f"Search mode: {mode}")
|
||||||
|
search_query = {"bool": {"must": [{"text": {"query": f"*{text}*","fields": ["casNo", "ecNo"]}}]}}
|
||||||
|
|
||||||
|
# La ricerca per ID è necessaria dove serva recuperare gli identified ingredients
|
||||||
|
elif mode == "id":
|
||||||
|
logger.debug("Search mode: substance ID")
|
||||||
|
search_query = {"bool":{"must":[{"term":{"substanceId":f"{text}"}}]}}
|
||||||
|
|
||||||
|
# Se la mode inserita non è prevista lancio un errore
|
||||||
|
else:
|
||||||
|
logger.error(f"Invalid search mode: {mode}")
|
||||||
|
raise ValueError(f"Invalid search mode: {mode}")
|
||||||
|
|
||||||
|
# Creo il payload della mia request
|
||||||
|
files = {"query": ("query",js.dumps(search_query),"application/json")}
|
||||||
|
logger.debug(f"Search query: {search_query}")
|
||||||
|
|
||||||
|
# Eseguo la post di ricerca
|
||||||
|
try:
|
||||||
|
logger.debug(f"Sending POST request to COSING API")
|
||||||
|
risposta = req.post(cosing_post_req,headers={"user_agent":agent,"Connection":"keep-alive"},files=files)
|
||||||
|
risposta.raise_for_status()
|
||||||
|
risposta = risposta.json()
|
||||||
|
|
||||||
|
if risposta["results"]:
|
||||||
|
logger.info(f"COSING search successful: found {len(risposta['results'])} result(s)")
|
||||||
|
logger.debug(f"First result substance ID: {risposta['results'][0]['metadata'].get('substanceId', 'N/A')}")
|
||||||
|
return risposta["results"][0]["metadata"]
|
||||||
|
else:
|
||||||
|
# La funzione ritorna None quando non ho risultati dalla mia ricerca
|
||||||
|
logger.warning(f"COSING search returned no results for text='{text}', mode='{mode}'")
|
||||||
|
return None
|
||||||
|
|
||||||
|
except req.exceptions.RequestException as e:
|
||||||
|
logger.error(f"HTTP request error during COSING search: {e}")
|
||||||
|
raise
|
||||||
|
except (KeyError, ValueError, TypeError) as e:
|
||||||
|
logger.error(f"Error parsing COSING response: {e}")
|
||||||
|
raise
|
||||||
|
#endregion
|
||||||
|
|
||||||
|
#region Funzione per pulire un json cosing e restituirlo
|
||||||
|
|
||||||
|
def clean_cosing(json : dict, full : bool = True) -> dict:
|
||||||
|
substance_id = json.get("substanceId", ["Unknown"])[0] if json.get("substanceId") else "Unknown"
|
||||||
|
logger.info(f"Cleaning COSING data for substance ID: {substance_id}, full={full}")
|
||||||
|
|
||||||
|
# Definisco i campi di nostro interesse e li divido in base al tipo di output finale che desidero
|
||||||
|
|
||||||
|
string_cols = [
|
||||||
|
"itemType",
|
||||||
|
"nameOfCommonIngredientsGlossary",
|
||||||
|
"inciName",
|
||||||
|
"phEurName",
|
||||||
|
"chemicalName",
|
||||||
|
"innName",
|
||||||
|
"substanceId",
|
||||||
|
"refNo"
|
||||||
|
]
|
||||||
|
|
||||||
|
list_cols = [
|
||||||
|
"casNo",
|
||||||
|
"ecNo",
|
||||||
|
"functionName",
|
||||||
|
"otherRestrictions",
|
||||||
|
"sccsOpinion",
|
||||||
|
"sccsOpinionUrls",
|
||||||
|
"identifiedIngredient",
|
||||||
|
"annexNo",
|
||||||
|
"otherRegulations"
|
||||||
|
]
|
||||||
|
|
||||||
|
# Creo una lista con tutti i campi su cui ciclare
|
||||||
|
|
||||||
|
total_keys = string_cols + list_cols
|
||||||
|
|
||||||
|
# Definisco la base dell"url che mi serve per ottenere il link cosing della sostanza
|
||||||
|
|
||||||
|
base_url = "https://ec.europa.eu/growth/tools-databases/cosing/details/"
|
||||||
|
clean_json = {}
|
||||||
|
|
||||||
|
# Ciclo su tutti i campi di interesse
|
||||||
|
|
||||||
|
for key in total_keys:
|
||||||
|
|
||||||
|
# Alcuni campi contengono una dicitura inutile che occupa solo spazio
|
||||||
|
# per cui provvedo a rimuoverla
|
||||||
|
|
||||||
|
while "<empty>" in json[key]:
|
||||||
|
json[key].remove("<empty>")
|
||||||
|
|
||||||
|
# Se il campo dovrà avere in output una lista, posso anche accettare le liste vuote del cosing come valore
|
||||||
|
|
||||||
|
if key in list_cols:
|
||||||
|
value = json[key]
|
||||||
|
|
||||||
|
# Il cas e l"ec sono casi speciali, per cui eseguo delle funzioni in più quando lo tratto
|
||||||
|
|
||||||
|
if key in ["casNo", "ecNo"]:
|
||||||
|
if value:
|
||||||
|
logger.debug(f"Processing {key}: {value}")
|
||||||
|
value = parse_cas_numbers(value)
|
||||||
|
|
||||||
|
# Completiamo direttamente il json di ritorno ove presenti degli identifiedIngredient,
|
||||||
|
# solo dove il flag "full" è true
|
||||||
|
|
||||||
|
elif key == "identifiedIngredient":
|
||||||
|
if full:
|
||||||
|
if value:
|
||||||
|
logger.debug(f"Processing {len(value)} identified ingredient(s)")
|
||||||
|
value = identified_ingredients(value)
|
||||||
|
|
||||||
|
clean_json[key] = value
|
||||||
|
|
||||||
|
else:
|
||||||
|
|
||||||
|
# Questo nome di campo era troppo lungo e ho preferito semplificarlo
|
||||||
|
|
||||||
|
if key == "nameOfCommonIngredientsGlossary":
|
||||||
|
nKey = "commonName"
|
||||||
|
|
||||||
|
# Dovendo rinominare alcuni campi, negli altri casi copio il nome iniziale
|
||||||
|
|
||||||
|
else:
|
||||||
|
nKey = key
|
||||||
|
|
||||||
|
# Siccome voglio in output una stringa semplice e se usassi uno slicer su di una lista vuota
|
||||||
|
# devo prima verificare che la lista cosing contenga dei valori
|
||||||
|
|
||||||
|
if json[key]:
|
||||||
|
clean_json[nKey] = json[key][0]
|
||||||
|
else:
|
||||||
|
clean_json[nKey] = ""
|
||||||
|
|
||||||
|
# Il campo cosingUrl non esiste ancora, vado a crearlo unendo il substance ID all"url base
|
||||||
|
|
||||||
|
clean_json["cosingUrl"] = f"{base_url}{json["substanceId"][0]}"
|
||||||
|
logger.debug(f"Generated COSING URL: {clean_json['cosingUrl']}")
|
||||||
|
logger.info(f"Successfully cleaned COSING data for substance ID: {substance_id}")
|
||||||
|
|
||||||
|
return clean_json
|
||||||
|
#endregion
|
||||||
|
|
||||||
|
#region Funzione per completare, se necessario, un json cosing
|
||||||
|
|
||||||
|
def identified_ingredients(id_list : list) -> list:
|
||||||
|
logger.info(f"Processing {len(id_list)} identified ingredient(s): {id_list}")
|
||||||
|
|
||||||
|
identified = []
|
||||||
|
|
||||||
|
# Per ognuno degli ingredienti in IdentifiedIngredients eseguo una ricerca
|
||||||
|
|
||||||
|
for id in id_list:
|
||||||
|
logger.debug(f"Searching for identified ingredient with ID: {id}")
|
||||||
|
|
||||||
|
ingredient = cosing_search(id,"id")
|
||||||
|
|
||||||
|
if ingredient:
|
||||||
|
|
||||||
|
# Vado a pulire i json appena trovati
|
||||||
|
|
||||||
|
ingredient = clean_cosing(ingredient,full=False)
|
||||||
|
|
||||||
|
# Ora salvo nella lista il documento pulito
|
||||||
|
|
||||||
|
identified.append(ingredient)
|
||||||
|
logger.debug(f"Successfully added identified ingredient ID: {id}")
|
||||||
|
else:
|
||||||
|
logger.warning(f"Could not find identified ingredient with ID: {id}")
|
||||||
|
|
||||||
|
# Finito di popolare la lista con gli oggetti degli identifiedIngredient, la ritorno
|
||||||
|
|
||||||
|
logger.info(f"Successfully processed {len(identified)} of {len(id_list)} identified ingredient(s)")
|
||||||
|
return identified
|
||||||
|
#endregion
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
print(cosing_search("Water","name"))
|
||||||
|
|
||||||
|
|
||||||
0
src/pif_compiler/services/debug_echa_find.py
Normal file
0
src/pif_compiler/services/debug_echa_find.py
Normal file
223
src/pif_compiler/services/echa_find.py
Normal file
223
src/pif_compiler/services/echa_find.py
Normal file
|
|
@ -0,0 +1,223 @@
|
||||||
|
import requests
|
||||||
|
import urllib.parse
|
||||||
|
import re as standardre
|
||||||
|
import json
|
||||||
|
from bs4 import BeautifulSoup
|
||||||
|
from datetime import datetime
|
||||||
|
from pif_compiler.services.common_log import get_logger
|
||||||
|
|
||||||
|
logger = get_logger()
|
||||||
|
|
||||||
|
# Funzione per cercare il dossier dato in input un CAS, una sostanza o un EN
|
||||||
|
def search_dossier(substance, input_type='rmlCas'):
|
||||||
|
results = {}
|
||||||
|
# Il dizionario che farò tornare alla fine
|
||||||
|
|
||||||
|
# Prima parte. Ottengo rmlID e rmlName
|
||||||
|
# st.code('https://chem.echa.europa.eu/api-substance/v1/substance?pageIndex=1&pageSize=100&searchText='+substance)
|
||||||
|
req_0 = requests.get(
|
||||||
|
"https://chem.echa.europa.eu/api-substance/v1/substance?pageIndex=1&pageSize=100&searchText="
|
||||||
|
+ urllib.parse.quote(substance) #va convertito per il web
|
||||||
|
)
|
||||||
|
|
||||||
|
logger.info(f'echaFind.search_dossier(). searching "{substance}"')
|
||||||
|
|
||||||
|
#'La prima cosa da fare è fare una ricerca con il nome della sostanza ma trasformata attraverso urllib'
|
||||||
|
req_0_json = req_0.json()
|
||||||
|
try:
|
||||||
|
# Estraggo i campi che mi servono dalla response
|
||||||
|
rmlId = req_0_json["items"][0]["substanceIndex"]["rmlId"]
|
||||||
|
rmlName = req_0_json["items"][0]["substanceIndex"]["rmlName"]
|
||||||
|
rmlCas = req_0_json["items"][0]["substanceIndex"]["rmlCas"]
|
||||||
|
rmlEc = req_0_json["items"][0]["substanceIndex"]["rmlEc"]
|
||||||
|
|
||||||
|
results['search_response'] = f"https://chem.echa.europa.eu/api-substance/v1/substance?pageIndex=1&pageSize=100&searchText={urllib.parse.quote(substance)}"
|
||||||
|
results["rmlId"] = rmlId
|
||||||
|
results["rmlName"] = rmlName
|
||||||
|
results["rmlCas"] = rmlCas
|
||||||
|
results["rmlEc"] = rmlEc
|
||||||
|
|
||||||
|
logger.info(
|
||||||
|
f"echaFind.search_dossier(). found substance on ECHA. rmlId: '{rmlId}', rmlName: '{rmlName}', rmlCas: '{rmlCas}'"
|
||||||
|
)
|
||||||
|
except:
|
||||||
|
logger.info(
|
||||||
|
f"echaFind.search_dossier(). could not find substance for '{substance}'"
|
||||||
|
)
|
||||||
|
return False
|
||||||
|
|
||||||
|
# Update: in certi casi poteva verificarsi che inserendo un CAS si trovasse invece una sostanza con codice EN uguale al CAS in input.
|
||||||
|
# Ora controllo che la sostanza trovata abbia effettivamente un CAS uguale a quello inserito in input.
|
||||||
|
# è inoltre possibile cercare per rmlName (nome della sostanza) o EN (rmlEn): basta specificare in input_type per cosa si sta cercando
|
||||||
|
if results[input_type] != substance:
|
||||||
|
logger.error(f'echa.echaFind.search_dossier(): results[{input_type}] "{results[input_type]}is not equal to "{substance}". ')
|
||||||
|
return f'search_error. results[{input_type}] ("{results[input_type]}") is not equal to "{substance}". Maybe you specified the wrong input_type. Check the results here: https://chem.echa.europa.eu/api-substance/v1/substance?pageIndex=1&pageSize=100&searchText={urllib.parse.quote(substance)}'
|
||||||
|
|
||||||
|
# Seconda parte. Cerco sul sito ECHA dei dossiers creando un link con l'ID precedentemente ottenuto.
|
||||||
|
req_1_url = (
|
||||||
|
"https://chem.echa.europa.eu/api-dossier-list/v1/dossier?pageIndex=1&pageSize=100&rmlId="
|
||||||
|
+ rmlId
|
||||||
|
+ "®istrationStatuses=Active"
|
||||||
|
) # Prima cerco negli active.
|
||||||
|
|
||||||
|
req_1 = requests.get(req_1_url)
|
||||||
|
req_1_json = req_1.json()
|
||||||
|
|
||||||
|
# Se non esistono dossiers attivi cerco quelli inattivi
|
||||||
|
if req_1_json["items"] == []:
|
||||||
|
logger.info(
|
||||||
|
f"echaFind.search_dossier(). could not find active dossier for '{substance}'. Proceeding to search in the unactive ones."
|
||||||
|
)
|
||||||
|
req_1_url = (
|
||||||
|
"https://chem.echa.europa.eu/api-dossier-list/v1/dossier?pageIndex=1&pageSize=100&rmlId="
|
||||||
|
+ rmlId
|
||||||
|
+ "®istrationStatuses=Inactive"
|
||||||
|
)
|
||||||
|
req_1 = requests.get(req_1_url)
|
||||||
|
req_1_json = req_1.json()
|
||||||
|
if req_1_json["items"] == []:
|
||||||
|
logger.info(
|
||||||
|
f"echaFind.search_dossier(). could not find unactive dossiers for '{substance}'"
|
||||||
|
) # Non ho trovato nè dossiers inattivi che attivi
|
||||||
|
return False
|
||||||
|
else:
|
||||||
|
logger.info(
|
||||||
|
f"echaFind.search_dossier(). found unactive dossiers for '{rmlName}'"
|
||||||
|
)
|
||||||
|
results["dossierType"] = "Inactive"
|
||||||
|
|
||||||
|
else:
|
||||||
|
logger.info(
|
||||||
|
f"echaFind.search_dossier(). found active dossiers for '{substance}'"
|
||||||
|
)
|
||||||
|
results["dossierType"] = "Active"
|
||||||
|
|
||||||
|
# Queste erano le due robe che mi servivano
|
||||||
|
assetExternalId = req_1_json["items"][0]["assetExternalId"]
|
||||||
|
|
||||||
|
# UPDATE: Per ottenere la data dell'ultima modifica: serve per capire se abbiamo già dei file aggiornati scaricati in locale
|
||||||
|
# confrontare data di scraping e ultimo aggiornato (se prima o dopo)
|
||||||
|
|
||||||
|
try:
|
||||||
|
lastUpdateDate = req_1_json["items"][0]["lastUpdatedDate"]
|
||||||
|
datetime_object = datetime.fromisoformat(lastUpdateDate.replace('Z', '+00:00')) # Handle 'Z' if present, else it might break on older python versions
|
||||||
|
lastUpdateDate = datetime_object.date().isoformat()
|
||||||
|
results['lastUpdateDate'] = lastUpdateDate
|
||||||
|
except:
|
||||||
|
logger.error(f"echa.echaFind(). Could not find lastUpdateDate for the dossier")
|
||||||
|
|
||||||
|
rootKey = req_1_json["items"][0]["rootKey"]
|
||||||
|
|
||||||
|
# PARTE DI HTML
|
||||||
|
|
||||||
|
# Terza parte. Ottengo assetExternalId
|
||||||
|
# "Con l'assetExternalId è possibile arrivare alla pagina principale del dossier."
|
||||||
|
# "Da questa pagina bisogna scrappare l'ID del riassunto tossicologico, :red[**SE ESISTE**]"
|
||||||
|
results["index"] = (
|
||||||
|
"https://chem.echa.europa.eu/html-pages" + assetExternalId + "/index.html"
|
||||||
|
)
|
||||||
|
results["index_js"] = (
|
||||||
|
f"https://chem.echa.europa.eu/{rmlId}/dossier-view/{assetExternalId}"
|
||||||
|
)
|
||||||
|
|
||||||
|
req_2 = requests.get(
|
||||||
|
"https://chem.echa.europa.eu/html-pages/" + assetExternalId + "/index.html"
|
||||||
|
)
|
||||||
|
index = BeautifulSoup(req_2.text, "html.parser")
|
||||||
|
index.prettify()
|
||||||
|
|
||||||
|
# Quarta parte. Ottengo l'ID del riassunto tossicologico dall'index.html
|
||||||
|
# "In tutto quell'HTML ci interessa solo un div. BeautifulSoup ha problemi se ci sono troppi div innestati. Quindi uso una combinazione di quello e di Regex"
|
||||||
|
|
||||||
|
div = index.find_all("div", id=["id_7_Toxicologicalinformation"])
|
||||||
|
str_div = str(div)
|
||||||
|
str_div = str_div.split("</div>")[0]
|
||||||
|
|
||||||
|
# UIC è l'id del toxsummary
|
||||||
|
uic_found = False
|
||||||
|
if type(standardre.search('href="([^"]+)"', str_div)).__name__ == "NoneType":
|
||||||
|
# Un regex per trovare l'href che mi serve
|
||||||
|
logger.info(
|
||||||
|
f"echaFind.search_dossier(). Could not find 'id_7_Toxicologicalinformation' in the body"
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
UIC = standardre.search('href="([^"]+)"', str_div).group(1)
|
||||||
|
uic_found = True
|
||||||
|
|
||||||
|
# Per l'acute toxicity
|
||||||
|
acute_toxicity_found = False
|
||||||
|
div_acute_toxicity = index.find_all("div", id=["id_72_AcuteToxicity"])
|
||||||
|
if div_acute_toxicity:
|
||||||
|
for div in div_acute_toxicity:
|
||||||
|
try:
|
||||||
|
a = div.find_all("a", href=True)[0]
|
||||||
|
acute_toxicity_id = standardre.search('href="([^"]+)"', str(a)).group(1)
|
||||||
|
acute_toxicity_found = True
|
||||||
|
except:
|
||||||
|
logger.info(
|
||||||
|
f"echaFind.search_dossier(). No acute_toxicity_id found from index for {substance}"
|
||||||
|
)
|
||||||
|
|
||||||
|
# Per il repeated dose
|
||||||
|
repeated_dose_found = False
|
||||||
|
div_repeated_dose = index.find_all("div", id=["id_75_Repeateddosetoxicity"])
|
||||||
|
if div_repeated_dose:
|
||||||
|
for div in div_repeated_dose:
|
||||||
|
try:
|
||||||
|
a = div.find_all("a", href=True)[0]
|
||||||
|
repeated_dose_id = standardre.search('href="([^"]+)"', str(a)).group(1)
|
||||||
|
repeated_dose_found = True
|
||||||
|
except:
|
||||||
|
logger.info(
|
||||||
|
f"echaFind.search_dossier(). No repeated_dose_id found from index for {substance}"
|
||||||
|
)
|
||||||
|
|
||||||
|
# Quinta parte. Recupero l'html del dossier tossicologico e faccio ritornare il content
|
||||||
|
|
||||||
|
if acute_toxicity_found:
|
||||||
|
acute_toxicity_link = (
|
||||||
|
"https://chem.echa.europa.eu/html-pages/"
|
||||||
|
+ assetExternalId
|
||||||
|
+ "/documents/"
|
||||||
|
+ acute_toxicity_id
|
||||||
|
+ ".html"
|
||||||
|
)
|
||||||
|
results["AcuteToxicity"] = acute_toxicity_link
|
||||||
|
# ci sono due link diversi: uno solo html brutto ma che ha le info leggibile, mentre js è la versione più bella presentata all'utente,
|
||||||
|
# usata per creare il pdf carino
|
||||||
|
results["AcuteToxicity_js"] = (
|
||||||
|
f"https://chem.echa.europa.eu/{rmlId}/dossier-view/{assetExternalId}/{acute_toxicity_id}"
|
||||||
|
)
|
||||||
|
|
||||||
|
if uic_found:
|
||||||
|
# UIC è l'id del toxsummary
|
||||||
|
final_url = (
|
||||||
|
"https://chem.echa.europa.eu/html-pages/"
|
||||||
|
+ assetExternalId
|
||||||
|
+ "/documents/"
|
||||||
|
+ UIC
|
||||||
|
+ ".html"
|
||||||
|
)
|
||||||
|
results["ToxSummary"] = final_url
|
||||||
|
results["ToxSummary_js"] = (
|
||||||
|
f"https://chem.echa.europa.eu/{rmlId}/dossier-view/{assetExternalId}/{UIC}"
|
||||||
|
)
|
||||||
|
|
||||||
|
if repeated_dose_found:
|
||||||
|
results["RepeatedDose"] = (
|
||||||
|
"https://chem.echa.europa.eu/html-pages/"
|
||||||
|
+ assetExternalId
|
||||||
|
+ "/documents/"
|
||||||
|
+ repeated_dose_id
|
||||||
|
+ ".html"
|
||||||
|
)
|
||||||
|
results["RepeatedDose_js"] = (
|
||||||
|
f"https://chem.echa.europa.eu/{rmlId}/dossier-view/{assetExternalId}/{repeated_dose_id}"
|
||||||
|
)
|
||||||
|
|
||||||
|
json_formatted_str = json.dumps(results)
|
||||||
|
logger.info(f"echaFind.search_dossier() OK. output: {json_formatted_str}")
|
||||||
|
return results
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
search_dossier("100-41-4", input_type='rmlCas')
|
||||||
467
src/pif_compiler/services/echa_pdf.py
Normal file
467
src/pif_compiler/services/echa_pdf.py
Normal file
|
|
@ -0,0 +1,467 @@
|
||||||
|
import os
|
||||||
|
import base64
|
||||||
|
import traceback
|
||||||
|
import logging # Import logging module
|
||||||
|
import datetime
|
||||||
|
import pandas as pd
|
||||||
|
# import time # Keep if you use page.wait_for_timeout
|
||||||
|
from playwright.sync_api import sync_playwright, TimeoutError # Catch specific errors
|
||||||
|
from pif_compiler.services.echa_find import search_dossier
|
||||||
|
import requests
|
||||||
|
|
||||||
|
# --- Basic Logging Setup (Commented Out) ---
|
||||||
|
# # Configure logging - uncomment and customize level/handler as needed
|
||||||
|
# logging.basicConfig(
|
||||||
|
# level=logging.INFO, # Or DEBUG for more details
|
||||||
|
# format='%(asctime)s - %(levelname)s - %(message)s',
|
||||||
|
# # filename='pdf_generator.log', # Optional: Log to a file
|
||||||
|
# # filemode='a'
|
||||||
|
# )
|
||||||
|
# --- End Logging Setup ---
|
||||||
|
|
||||||
|
|
||||||
|
# Assume svg_to_data_uri is defined elsewhere correctly
|
||||||
|
def svg_to_data_uri(svg_path):
|
||||||
|
try:
|
||||||
|
if not os.path.exists(svg_path):
|
||||||
|
# logging.error(f"SVG file not found: {svg_path}") # Example logging
|
||||||
|
raise FileNotFoundError(f"SVG file not found: {svg_path}")
|
||||||
|
with open(svg_path, 'rb') as f:
|
||||||
|
svg_content = f.read()
|
||||||
|
encoded_svg = base64.b64encode(svg_content).decode('utf-8')
|
||||||
|
return f"data:image/svg+xml;base64,{encoded_svg}"
|
||||||
|
except Exception as e:
|
||||||
|
print(f"Error converting SVG {svg_path}: {e}")
|
||||||
|
# logging.error(f"Error converting SVG {svg_path}: {e}", exc_info=True) # Example logging
|
||||||
|
return None
|
||||||
|
|
||||||
|
# --- JavaScript Expressions ---
|
||||||
|
|
||||||
|
# Define the cleanup logic as an immediately-invoked arrow function expression
|
||||||
|
# NOTE: .das-block_empty removal is currently disabled as per previous step
|
||||||
|
cleanup_js_expression = """
|
||||||
|
() => {
|
||||||
|
console.log('Running cleanup JS (DISABLED .das-block_empty removal)...');
|
||||||
|
let totalRemoved = 0;
|
||||||
|
|
||||||
|
// Example 1: Remove sections explicitly marked as empty (Currently Disabled)
|
||||||
|
// const emptyBlocks = document.querySelectorAll('.das-block_empty');
|
||||||
|
// emptyBlocks.forEach(el => {
|
||||||
|
// if (el && el.parentNode) {
|
||||||
|
// console.log(`Removing '.das-block_empty' block with ID: ${el.id || 'N/A'}`);
|
||||||
|
// el.remove();
|
||||||
|
// totalRemoved++;
|
||||||
|
// }
|
||||||
|
// });
|
||||||
|
|
||||||
|
// Add other specific cleanup logic here if needed
|
||||||
|
|
||||||
|
console.log(`Cleanup script removed ${totalRemoved} elements (DISABLED .das-block_empty removal).`);
|
||||||
|
return totalRemoved; // Return the count
|
||||||
|
}
|
||||||
|
"""
|
||||||
|
# --- End JavaScript Expressions ---
|
||||||
|
|
||||||
|
|
||||||
|
def generate_pdf_with_header_and_cleanup(
|
||||||
|
url,
|
||||||
|
pdf_path,
|
||||||
|
substance_name,
|
||||||
|
substance_link,
|
||||||
|
ec_number,
|
||||||
|
cas_number,
|
||||||
|
header_template_path=r"src\func\resources\injectableHeader.html",
|
||||||
|
echa_chem_logo_path=r"src\func\resources\echa_chem_logo.svg",
|
||||||
|
echa_logo_path=r"src\func\resources\ECHA_Logo.svg"
|
||||||
|
) -> bool: # Added return type hint
|
||||||
|
"""
|
||||||
|
Generates a PDF with a dynamic header and optionally removes empty sections.
|
||||||
|
Provides basic logging (commented out) and returns True/False for success/failure.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
url (str): The target URL OR local HTML file path.
|
||||||
|
pdf_path (str): The output PDF path.
|
||||||
|
substance_name (str): The name of the chemical substance.
|
||||||
|
substance_link (str): The URL the substance name should link to (in header).
|
||||||
|
ec_number (str): The EC number for the substance.
|
||||||
|
cas_number (str): The CAS number for the substance.
|
||||||
|
header_template_path (str): Path to the HTML header template file.
|
||||||
|
echa_chem_logo_path (str): Path to the echa_chem_logo.svg file.
|
||||||
|
echa_logo_path (str): Path to the ECHA_Logo.svg file.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
bool: True if the PDF was generated successfully, False otherwise.
|
||||||
|
"""
|
||||||
|
final_header_html = None
|
||||||
|
# logging.info(f"Starting PDF generation for URL: {url} to path: {pdf_path}") # Example logging
|
||||||
|
|
||||||
|
# --- 1. Prepare Header HTML ---
|
||||||
|
try:
|
||||||
|
# logging.debug(f"Reading header template from: {header_template_path}") # Example logging
|
||||||
|
print(f"Reading header template from: {header_template_path}")
|
||||||
|
if not os.path.exists(header_template_path):
|
||||||
|
raise FileNotFoundError(f"Header template file not found: {header_template_path}")
|
||||||
|
with open(header_template_path, 'r', encoding='utf-8') as f:
|
||||||
|
header_template_content = f.read()
|
||||||
|
if not header_template_content:
|
||||||
|
raise ValueError("Header template file is empty.")
|
||||||
|
|
||||||
|
# logging.debug("Converting logos...") # Example logging
|
||||||
|
print("Converting logos...")
|
||||||
|
logo1_data_uri = svg_to_data_uri(echa_chem_logo_path)
|
||||||
|
logo2_data_uri = svg_to_data_uri(echa_logo_path)
|
||||||
|
if not logo1_data_uri or not logo2_data_uri:
|
||||||
|
raise ValueError("Failed to convert one or both logos to Data URIs.")
|
||||||
|
|
||||||
|
# logging.debug("Replacing placeholders...") # Example logging
|
||||||
|
print("Replacing placeholders...")
|
||||||
|
final_header_html = header_template_content.replace("##ECHA_CHEM_LOGO_SRC##", logo1_data_uri)
|
||||||
|
final_header_html = final_header_html.replace("##ECHA_LOGO_SRC##", logo2_data_uri)
|
||||||
|
final_header_html = final_header_html.replace("##SUBSTANCE_NAME##", substance_name)
|
||||||
|
final_header_html = final_header_html.replace("##SUBSTANCE_LINK##", substance_link)
|
||||||
|
final_header_html = final_header_html.replace("##EC_NUMBER##", ec_number)
|
||||||
|
final_header_html = final_header_html.replace("##CAS_NUMBER##", cas_number)
|
||||||
|
|
||||||
|
if "##" in final_header_html:
|
||||||
|
print("Warning: Not all placeholders seem replaced in the header HTML.")
|
||||||
|
# logging.warning("Not all placeholders seem replaced in the header HTML.") # Example logging
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
print(f"Error during header setup phase: {e}")
|
||||||
|
traceback.print_exc()
|
||||||
|
# logging.error(f"Error during header setup phase: {e}", exc_info=True) # Example logging
|
||||||
|
return False # Return False on header setup failure
|
||||||
|
# --- End Header Prep ---
|
||||||
|
|
||||||
|
# --- CSS Override Definition ---
|
||||||
|
# Using Revision 4 from previous step (simplified breaks, boundary focus)
|
||||||
|
selectors_to_fix = [
|
||||||
|
'.das-field .das-field_value_html',
|
||||||
|
'.das-field .das-field_value_large',
|
||||||
|
'.das-field .das-value_remark-text'
|
||||||
|
]
|
||||||
|
css_selector_string = ",\n".join(selectors_to_fix)
|
||||||
|
css_override = f"""
|
||||||
|
<style id='pdf-override-styles'>
|
||||||
|
/* Basic Resets & Overflows */
|
||||||
|
html, body {{ height: auto !important; overflow: visible !important; margin: 0 !important; padding: 0 !important; }}
|
||||||
|
* {{ box-sizing: border-box; }}
|
||||||
|
{css_selector_string} {{
|
||||||
|
overflow: visible !important; overflow-y: visible !important; height: auto !important; max-height: none !important;
|
||||||
|
}}
|
||||||
|
/* Boundary Fix */
|
||||||
|
#pdf-custom-header {{ margin-bottom: 0 !important; padding-bottom: 1px !important; page-break-after: auto !important; display: block !important; }}
|
||||||
|
#pdf-custom-header + .body-inner {{ margin-top: 0 !important; padding-top: 0 !important; page-break-before: auto !important; display: block !important; }}
|
||||||
|
.body-inner .document-header {{ margin-top: 0 !important; padding-top: 0 !important; page-break-before: auto !important; }}
|
||||||
|
/* Simplified Page Breaks */
|
||||||
|
.body-inner h1, .body-inner h2, .body-inner h3, .body-inner h4, .body-inner h5, .body-inner h6 {{ page-break-after: avoid !important; }}
|
||||||
|
#pdf-custom-header h2 {{ page-break-after: auto !important; }}
|
||||||
|
@media print {{
|
||||||
|
html, body {{ height: auto !important; overflow: visible !important; margin: 0; padding: 0; }}
|
||||||
|
#pdf-custom-header {{ margin-bottom: 0 !important; padding-bottom: 1px !important; page-break-after: auto !important; display: block !important;}}
|
||||||
|
#pdf-custom-header + .body-inner {{ margin-top: 0 !important; padding-top: 0 !important; page-break-before: auto !important; display: block !important; }}
|
||||||
|
.body-inner .document-header {{ margin-top: 0 !important; padding-top: 0 !important; page-break-before: auto !important; }}
|
||||||
|
.body-inner h1, .body-inner h2, .body-inner h3, .body-inner h4, .body-inner h5, .body-inner h6 {{ page-break-after: avoid !important; }}
|
||||||
|
#pdf-custom-header h2 {{ page-break-after: auto !important; }}
|
||||||
|
.das-doc-toolbar, .document-header__section-links, #das-totop {{ display: none !important; }}
|
||||||
|
}}
|
||||||
|
</style>
|
||||||
|
"""
|
||||||
|
# --- End CSS Override Definition ---
|
||||||
|
|
||||||
|
# --- Playwright Automation ---
|
||||||
|
try:
|
||||||
|
with sync_playwright() as p:
|
||||||
|
# logging.debug("Launching browser...") # Example logging
|
||||||
|
# browser = p.chromium.launch(headless=False, devtools=True) # For debugging
|
||||||
|
browser = p.chromium.launch()
|
||||||
|
page = browser.new_page()
|
||||||
|
# Capture console messages (Corrected: use msg.text)
|
||||||
|
page.on("console", lambda msg: print(f"Browser Console: {msg.text}"))
|
||||||
|
|
||||||
|
try:
|
||||||
|
# logging.info(f"Navigating to page: {url}") # Example logging
|
||||||
|
print(f"Navigating to: {url}")
|
||||||
|
if os.path.exists(url) and not url.startswith('file://'):
|
||||||
|
page_url = f'file://{os.path.abspath(url)}'
|
||||||
|
# logging.info(f"Treating as local file: {page_url}") # Example logging
|
||||||
|
print(f"Treating as local file: {page_url}")
|
||||||
|
else:
|
||||||
|
page_url = url
|
||||||
|
|
||||||
|
page.goto(page_url, wait_until='load', timeout=90000)
|
||||||
|
# logging.info("Page navigation complete.") # Example logging
|
||||||
|
|
||||||
|
# logging.debug("Injecting header HTML...") # Example logging
|
||||||
|
print("Injecting header HTML...")
|
||||||
|
page.evaluate(f'(headerHtml) => {{ document.body.insertAdjacentHTML("afterbegin", headerHtml); }}', final_header_html)
|
||||||
|
|
||||||
|
# logging.debug("Injecting CSS overrides...") # Example logging
|
||||||
|
print("Injecting CSS overrides...")
|
||||||
|
page.evaluate(f"""(css) => {{
|
||||||
|
const existingStyle = document.getElementById('pdf-override-styles');
|
||||||
|
if (existingStyle) existingStyle.remove();
|
||||||
|
document.head.insertAdjacentHTML('beforeend', css);
|
||||||
|
}}""", css_override)
|
||||||
|
|
||||||
|
# logging.debug("Running JavaScript cleanup function...") # Example logging
|
||||||
|
print("Running JavaScript cleanup function...")
|
||||||
|
elements_removed_count = page.evaluate(cleanup_js_expression)
|
||||||
|
# logging.info(f"Cleanup script finished (reported removing {elements_removed_count} elements).") # Example logging
|
||||||
|
print(f"Cleanup script finished (reported removing {elements_removed_count} elements).")
|
||||||
|
|
||||||
|
|
||||||
|
# --- Optional: Emulate Print Media ---
|
||||||
|
# print("Emulating print media...")
|
||||||
|
# page.emulate_media(media='print')
|
||||||
|
|
||||||
|
# --- Generate PDF ---
|
||||||
|
# logging.info(f"Generating PDF: {pdf_path}") # Example logging
|
||||||
|
print(f"Generating PDF: {pdf_path}")
|
||||||
|
pdf_options = {
|
||||||
|
"path": pdf_path, "format": "A4", "print_background": True,
|
||||||
|
"margin": {'top': '20px', 'bottom': '20px', 'left': '20px', 'right': '20px'},
|
||||||
|
"scale": 1.0
|
||||||
|
}
|
||||||
|
page.pdf(**pdf_options)
|
||||||
|
# logging.info(f"PDF saved successfully to: {pdf_path}") # Example logging
|
||||||
|
print(f"PDF saved successfully to: {pdf_path}")
|
||||||
|
|
||||||
|
# logging.debug("Closing browser.") # Example logging
|
||||||
|
print("Closing browser.")
|
||||||
|
browser.close()
|
||||||
|
return True # Indicate success
|
||||||
|
|
||||||
|
except TimeoutError as e:
|
||||||
|
print(f"A Playwright TimeoutError occurred: {e}")
|
||||||
|
traceback.print_exc()
|
||||||
|
# logging.error(f"Playwright TimeoutError occurred: {e}", exc_info=True) # Example logging
|
||||||
|
browser.close() # Ensure browser is closed on error
|
||||||
|
return False # Indicate failure
|
||||||
|
except Exception as e: # Catch other potential errors during Playwright page operations
|
||||||
|
print(f"An unexpected error occurred during Playwright page operations: {e}")
|
||||||
|
traceback.print_exc()
|
||||||
|
# logging.error(f"Unexpected error during Playwright page operations: {e}", exc_info=True) # Example logging
|
||||||
|
# Optional: Save HTML state on error
|
||||||
|
try:
|
||||||
|
html_content = page.content()
|
||||||
|
error_html_path = pdf_path.replace('.pdf', '_error.html')
|
||||||
|
with open(error_html_path, 'w', encoding='utf-8') as f_err:
|
||||||
|
f_err.write(html_content)
|
||||||
|
# logging.info(f"Saved HTML state on error to: {error_html_path}") # Example logging
|
||||||
|
print(f"Saved HTML state on error to: {error_html_path}")
|
||||||
|
except Exception as save_e:
|
||||||
|
# logging.error(f"Could not save HTML state on error: {save_e}", exc_info=True) # Example logging
|
||||||
|
print(f"Could not save HTML state on error: {save_e}")
|
||||||
|
browser.close() # Ensure browser is closed on error
|
||||||
|
return False # Indicate failure
|
||||||
|
# Note: The finally block for the 'with sync_playwright()' context
|
||||||
|
# is handled automatically by the 'with' statement.
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
# Catch errors during Playwright startup (less common)
|
||||||
|
print(f"An error occurred during Playwright setup/teardown: {e}")
|
||||||
|
traceback.print_exc()
|
||||||
|
# logging.error(f"Error during Playwright setup/teardown: {e}", exc_info=True) # Example logging
|
||||||
|
return False # Indicate failure
|
||||||
|
|
||||||
|
|
||||||
|
# --- Example Usage ---
|
||||||
|
# result = generate_pdf_with_header_and_cleanup(
|
||||||
|
# url='path/to/your/input.html',
|
||||||
|
# pdf_path='output.pdf',
|
||||||
|
# substance_name='Glycerol Example',
|
||||||
|
# substance_link='http://example.com/glycerol',
|
||||||
|
# ec_number='200-289-5',
|
||||||
|
# cas_number='56-81-5',
|
||||||
|
# )
|
||||||
|
#
|
||||||
|
# if result:
|
||||||
|
# print("PDF Generation Succeeded.")
|
||||||
|
# # logging.info("Main script: PDF Generation Succeeded.") # Example logging
|
||||||
|
# else:
|
||||||
|
# print("PDF Generation Failed.")
|
||||||
|
# # logging.error("Main script: PDF Generation Failed.") # Example logging
|
||||||
|
|
||||||
|
|
||||||
|
def search_generate_pdfs(
|
||||||
|
cas_number_to_search: str,
|
||||||
|
page_types_to_extract: list[str],
|
||||||
|
base_output_folder: str = "data/library"
|
||||||
|
) -> bool:
|
||||||
|
"""
|
||||||
|
Searches for a substance by CAS, saves raw HTML and generates PDFs for
|
||||||
|
specified page types. Uses '_js' link variant for the PDF header link if available.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
cas_number_to_search (str): CAS number to search for.
|
||||||
|
page_types_to_extract (list[str]): List of page type names (e.g., 'RepeatedDose').
|
||||||
|
Expects '{page_type}' and '{page_type}_js' keys
|
||||||
|
in the search result.
|
||||||
|
base_output_folder (str): Root directory for saving HTML/PDFs.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
bool: True if substance found and >=1 requested PDF generated, False otherwise.
|
||||||
|
"""
|
||||||
|
# logging.info(f"Starting process for CAS: {cas_number_to_search}")
|
||||||
|
print(f"\n===== Processing request for CAS: {cas_number_to_search} =====")
|
||||||
|
|
||||||
|
# --- 1. Search for Dossier Information ---
|
||||||
|
try:
|
||||||
|
# logging.debug(f"Calling search_dossier for CAS: {cas_number_to_search}")
|
||||||
|
search_result = search_dossier(substance=cas_number_to_search, input_type='rmlCas')
|
||||||
|
except Exception as e:
|
||||||
|
print(f"Error during dossier search for CAS '{cas_number_to_search}': {e}")
|
||||||
|
traceback.print_exc()
|
||||||
|
# logging.error(f"Exception during search_dossier for CAS '{cas_number_to_search}': {e}", exc_info=True)
|
||||||
|
return False
|
||||||
|
|
||||||
|
if not search_result:
|
||||||
|
print(f"Substance not found or search failed for CAS: {cas_number_to_search}")
|
||||||
|
# logging.warning(f"Substance not found or search failed for CAS: {cas_number_to_search}")
|
||||||
|
return False
|
||||||
|
|
||||||
|
# logging.info(f"Substance found for CAS: {cas_number_to_search}")
|
||||||
|
print(f"Substance found: {search_result.get('rmlName', 'N/A')}")
|
||||||
|
|
||||||
|
# --- 2. Extract Details and Filter Pages ---
|
||||||
|
try:
|
||||||
|
# Extract required info
|
||||||
|
rml_id = search_result.get('rmlId')
|
||||||
|
rml_name = search_result.get('rmlName')
|
||||||
|
rml_cas = search_result.get('rmlCas')
|
||||||
|
rml_ec = search_result.get('rmlEc')
|
||||||
|
asset_ext_id = search_result.get('assetExternalId')
|
||||||
|
|
||||||
|
# Basic validation
|
||||||
|
if not all([rml_id, rml_name, rml_cas, rml_ec, asset_ext_id]):
|
||||||
|
missing_keys = [k for k, v in {'rmlId': rml_id, 'rmlName': rml_name, 'rmlCas': rml_cas, 'rmlEc': rml_ec, 'assetExternalId': asset_ext_id}.items() if not v]
|
||||||
|
message = f"Search result for {cas_number_to_search} is missing required keys: {missing_keys}"
|
||||||
|
print(f"Error: {message}")
|
||||||
|
# logging.error(message)
|
||||||
|
return False
|
||||||
|
|
||||||
|
# --- Filtering Logic - Collect pairs of URLs ---
|
||||||
|
pages_to_process_list = [] # Store tuples: (page_name, raw_url, js_url)
|
||||||
|
# logging.debug(f"Filtering pages. Requested: {page_types_to_extract}.")
|
||||||
|
|
||||||
|
for page_type in page_types_to_extract:
|
||||||
|
raw_url_key = page_type
|
||||||
|
js_url_key = f"{page_type}_js"
|
||||||
|
|
||||||
|
raw_url = search_result.get(raw_url_key)
|
||||||
|
js_url = search_result.get(js_url_key) # Get the JS URL
|
||||||
|
|
||||||
|
# Check if both URLs are valid strings
|
||||||
|
if raw_url and isinstance(raw_url, str) and raw_url.strip():
|
||||||
|
if js_url and isinstance(js_url, str) and js_url.strip():
|
||||||
|
pages_to_process_list.append((page_type, raw_url, js_url))
|
||||||
|
# logging.debug(f"Found valid pair for '{page_type}': Raw='{raw_url}', JS='{js_url}'")
|
||||||
|
else:
|
||||||
|
# Found raw URL but not a valid JS URL - skip this page type for PDF?
|
||||||
|
# Or use raw_url for header too? Let's skip if JS URL is missing/invalid.
|
||||||
|
print(f"Found raw URL for '{page_type}' but missing/invalid JS URL ('{js_url}'). Skipping PDF generation for this type.")
|
||||||
|
# logging.warning(f"Missing/invalid JS URL for page type '{page_type}' for {rml_cas}. Raw URL: '{raw_url}'.")
|
||||||
|
else:
|
||||||
|
# Raw URL missing or invalid
|
||||||
|
if page_type in search_result: # Check if key existed at all
|
||||||
|
print(f"Found page type key '{page_type}' for {rml_cas}, but its value is not a valid URL ('{raw_url}'). Skipping.")
|
||||||
|
# logging.warning(f"Invalid raw URL value for page type '{page_type}' for {rml_cas}: '{raw_url}'.")
|
||||||
|
else:
|
||||||
|
print(f"Requested page type key '{page_type}' not found in search results for {rml_cas}.")
|
||||||
|
# logging.warning(f"Requested page type key '{page_type}' not found for {rml_cas}.")
|
||||||
|
# --- End Filtering Logic ---
|
||||||
|
|
||||||
|
if not pages_to_process_list:
|
||||||
|
print(f"After filtering, no requested page types ({page_types_to_extract}) resulted in a valid pair of Raw and JS URLs for substance {rml_cas}.")
|
||||||
|
# logging.warning(f"No pages with valid URL pairs to process for substance {rml_cas}.")
|
||||||
|
return False # Nothing to generate
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
print(f"Error processing search result for '{cas_number_to_search}': {e}")
|
||||||
|
traceback.print_exc()
|
||||||
|
# logging.error(f"Error processing search result for '{cas_number_to_search}': {e}", exc_info=True)
|
||||||
|
return False
|
||||||
|
|
||||||
|
# --- 3. Prepare Folders ---
|
||||||
|
safe_cas = rml_cas.replace('/', '_').replace('\\', '_')
|
||||||
|
substance_folder_name = f"{safe_cas}_{rml_ec}_{rml_id}"
|
||||||
|
substance_folder_path = os.path.join(base_output_folder, substance_folder_name)
|
||||||
|
|
||||||
|
try:
|
||||||
|
os.makedirs(substance_folder_path, exist_ok=True)
|
||||||
|
# logging.info(f"Ensured output directory exists: {substance_folder_path}")
|
||||||
|
print(f"Ensured output directory exists: {substance_folder_path}")
|
||||||
|
except OSError as e:
|
||||||
|
print(f"Error creating directory {substance_folder_path}: {e}")
|
||||||
|
# logging.error(f"Failed to create directory {substance_folder_path}: {e}", exc_info=True)
|
||||||
|
return False
|
||||||
|
|
||||||
|
|
||||||
|
# --- 4. Process Each Page (Save HTML, Generate PDF) ---
|
||||||
|
successful_pages = [] # Track successful PDF generations
|
||||||
|
overall_success = False # Track if any PDF was generated
|
||||||
|
|
||||||
|
for page_name, raw_html_url, js_header_link in pages_to_process_list:
|
||||||
|
print(f"\nProcessing page: {page_name}")
|
||||||
|
base_filename = f"{safe_cas}_{page_name}"
|
||||||
|
html_filename = f"{base_filename}.html"
|
||||||
|
pdf_filename = f"{base_filename}.pdf"
|
||||||
|
html_full_path = os.path.join(substance_folder_path, html_filename)
|
||||||
|
pdf_full_path = os.path.join(substance_folder_path, pdf_filename)
|
||||||
|
|
||||||
|
# --- Save Raw HTML ---
|
||||||
|
html_saved = False
|
||||||
|
try:
|
||||||
|
# logging.debug(f"Fetching raw HTML for {page_name} from {raw_html_url}")
|
||||||
|
print(f"Fetching raw HTML from: {raw_html_url}")
|
||||||
|
# Add headers to mimic a browser slightly if needed
|
||||||
|
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'}
|
||||||
|
response = requests.get(raw_html_url, timeout=30, headers=headers) # 30s timeout
|
||||||
|
response.raise_for_status() # Raise an exception for bad status codes (4xx or 5xx)
|
||||||
|
|
||||||
|
# Decide encoding - response.text tries to guess, or use apparent_encoding
|
||||||
|
# Or assume utf-8 if unsure, which is common.
|
||||||
|
html_content = response.content.decode('utf-8', errors='replace')
|
||||||
|
|
||||||
|
with open(html_full_path, 'w', encoding='utf-8') as f:
|
||||||
|
f.write(html_content)
|
||||||
|
html_saved = True
|
||||||
|
# logging.info(f"Successfully saved raw HTML for {page_name} to {html_full_path}")
|
||||||
|
print(f"Successfully saved raw HTML to: {html_full_path}")
|
||||||
|
except requests.exceptions.RequestException as req_e:
|
||||||
|
print(f"Error fetching raw HTML for {page_name} from {raw_html_url}: {req_e}")
|
||||||
|
# logging.error(f"Error fetching raw HTML for {page_name}: {req_e}", exc_info=True)
|
||||||
|
except IOError as io_e:
|
||||||
|
print(f"Error saving raw HTML for {page_name} to {html_full_path}: {io_e}")
|
||||||
|
# logging.error(f"Error saving raw HTML for {page_name}: {io_e}", exc_info=True)
|
||||||
|
except Exception as e: # Catch other potential errors like decoding
|
||||||
|
print(f"Unexpected error saving HTML for {page_name}: {e}")
|
||||||
|
# logging.error(f"Unexpected error saving HTML for {page_name}: {e}", exc_info=True)
|
||||||
|
|
||||||
|
# --- Generate PDF (using raw URL for content, JS URL for header link) ---
|
||||||
|
# logging.info(f"Generating PDF for {page_name} from {raw_html_url}")
|
||||||
|
print(f"Generating PDF using content from: {raw_html_url}")
|
||||||
|
pdf_success = generate_pdf_with_header_and_cleanup(
|
||||||
|
url=raw_html_url, # Use raw URL for Playwright navigation/content
|
||||||
|
pdf_path=pdf_full_path,
|
||||||
|
substance_name=rml_name,
|
||||||
|
substance_link=js_header_link, # Use JS URL for the link in the header
|
||||||
|
ec_number=rml_ec,
|
||||||
|
cas_number=rml_cas
|
||||||
|
)
|
||||||
|
|
||||||
|
if pdf_success:
|
||||||
|
successful_pages.append(page_name) # Log success based on PDF generation
|
||||||
|
overall_success = True
|
||||||
|
# logging.info(f"Successfully generated PDF for {page_name} at {pdf_full_path}")
|
||||||
|
print(f"Successfully generated PDF for {page_name}")
|
||||||
|
else:
|
||||||
|
# logging.error(f"Failed to generate PDF for {page_name} from {raw_html_url}")
|
||||||
|
print(f"Failed to generate PDF for {page_name}")
|
||||||
|
# Decide if failure to save HTML should affect overall success or logging...
|
||||||
|
# Currently, success is tied only to PDF generation.
|
||||||
|
|
||||||
|
print(f"===== Finished request for CAS: {cas_number_to_search} =====")
|
||||||
|
print(f"Successfully generated {len(successful_pages)} PDFs: {successful_pages}")
|
||||||
|
return overall_success # Return success based on PDF generation
|
||||||
947
src/pif_compiler/services/echa_process.py
Normal file
947
src/pif_compiler/services/echa_process.py
Normal file
|
|
@ -0,0 +1,947 @@
|
||||||
|
from pif_compiler.services.echa_find import search_dossier
|
||||||
|
from bs4 import BeautifulSoup
|
||||||
|
from markdownify import MarkdownConverter
|
||||||
|
import pandas as pd
|
||||||
|
import requests
|
||||||
|
import os
|
||||||
|
import re
|
||||||
|
import markdown_to_json
|
||||||
|
import json
|
||||||
|
import copy
|
||||||
|
import unicodedata
|
||||||
|
from datetime import datetime
|
||||||
|
import logging
|
||||||
|
import duckdb
|
||||||
|
|
||||||
|
# Settings per il logging
|
||||||
|
logging.basicConfig(
|
||||||
|
format="{asctime} - {levelname} - {message}",
|
||||||
|
style="{",
|
||||||
|
datefmt="%Y-%m-%d %H:%M",
|
||||||
|
filename="echa.log",
|
||||||
|
encoding="utf-8",
|
||||||
|
filemode="a",
|
||||||
|
level=logging.INFO,
|
||||||
|
)
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Carico il full scraping in memoria se esiste
|
||||||
|
con = duckdb.connect()
|
||||||
|
os.chdir(".") # directory che legge python
|
||||||
|
res = con.sql("""
|
||||||
|
CREATE TABLE echa_full_scraping AS
|
||||||
|
SELECT * FROM read_csv_auto('src\data\echa_full_scraping.csv');
|
||||||
|
""") # leggi il file csv come db in memory
|
||||||
|
logging.info(
|
||||||
|
f"echa.echaProcess().main: Loaded echa scraped data into duckdb memory. First CAS in the df is: {con.sql('select CAS from echa_full_scraping limit 1').fetchone()[0]}"
|
||||||
|
)
|
||||||
|
local_echa = True
|
||||||
|
except:
|
||||||
|
logging.error(f"echa.echaProcess().main: No local echa scraped data found")
|
||||||
|
|
||||||
|
|
||||||
|
# Metodo per trovare le informazioni relative sul sito echa
|
||||||
|
# Funziona sia con il nome della sostanza che con il CUS
|
||||||
|
def openEchaPage(link, local=False):
|
||||||
|
try:
|
||||||
|
if local:
|
||||||
|
page = open(link, encoding="utf8")
|
||||||
|
soup = BeautifulSoup(page, "html.parser")
|
||||||
|
else:
|
||||||
|
page = requests.get(link)
|
||||||
|
page.encoding = "utf-8"
|
||||||
|
soup = BeautifulSoup(page.text, "html.parser")
|
||||||
|
except:
|
||||||
|
logging.error(
|
||||||
|
f"echa.echaProcess.openEchaPage() error. could not open: '{link}'",
|
||||||
|
exc_info=True,
|
||||||
|
)
|
||||||
|
return soup
|
||||||
|
|
||||||
|
|
||||||
|
# Metodo per trasformare la pagina dell'echa in un Markdown
|
||||||
|
def echaPage_to_md(sezione, scrapingType=None, local=False, substance=None):
|
||||||
|
# sezione : il soup della pagina estratta attraverso search_dossier
|
||||||
|
# scrapingType : 'RepeatedDose' o 'AcuteToxicity'
|
||||||
|
# local : se vuoi salvare il contenuto del markdown in locale. Utile per debuggare
|
||||||
|
# substance : il nome della sostanza. Per salvarla nel path corretto
|
||||||
|
|
||||||
|
# Create shorthand method for conversion
|
||||||
|
def md(soup, **options):
|
||||||
|
return MarkdownConverter(**options).convert_soup(soup)
|
||||||
|
|
||||||
|
output = md(sezione)
|
||||||
|
# Trasformo la section html in un markdown, che però va corretto.
|
||||||
|
|
||||||
|
# Cambia un po' il modo in cui modifico il .md in base al tipo di pagina da scrappare
|
||||||
|
# aggiungo eccezioni man mano che testo nuove sostanze
|
||||||
|
if scrapingType == "RepeatedDose":
|
||||||
|
output = output.replace("### Oral route", "#### oral")
|
||||||
|
output = output.replace("### Dermal", "#### dermal")
|
||||||
|
output = output.replace("### Inhalation", "#### inhalation")
|
||||||
|
# Devo rimpiazzare >< con delle stringhe perchè sennò il jsonifier interpreta quei due simboli come dei codici e inserisce il testo nelle []
|
||||||
|
output = re.sub(r">\s+", "greater than ", output)
|
||||||
|
# Replace '<' followed by whitespace with 'less than '
|
||||||
|
output = re.sub(r"<\s+", "less than ", output)
|
||||||
|
output = re.sub(r">=\s*\n", "greater or equal than ", output)
|
||||||
|
output = re.sub(r"<=\s*\n", "less or equal than ", output)
|
||||||
|
|
||||||
|
elif scrapingType == "AcuteToxicity":
|
||||||
|
# Devo rimpiazzare >< con delle stringhe perchè sennò il jsonifier interpreta quei due simboli come dei codici e inserisce il testo nelle []
|
||||||
|
output = re.sub(r">\s+", "greater than ", output)
|
||||||
|
# Replace '<' followed by whitespace with 'less than '
|
||||||
|
output = re.sub(r"<\s+", "less than ", output)
|
||||||
|
output = re.sub(r">=\s*\n", "greater or equal than", output)
|
||||||
|
output = re.sub(r"<=\s*\n", "less or equal than ", output)
|
||||||
|
|
||||||
|
output = output.replace("–", "-")
|
||||||
|
|
||||||
|
output = re.sub(r"\s+mg", " mg", output)
|
||||||
|
# sta parte serve per fixare le unità di misura che vanno a capo e sono separate dal valore
|
||||||
|
|
||||||
|
if local and substance:
|
||||||
|
path = f"{scrapingType}/mds/{substance}.md"
|
||||||
|
os.makedirs(os.path.dirname(path), exist_ok=True)
|
||||||
|
with open(path, "w") as text_file:
|
||||||
|
text_file.write(output)
|
||||||
|
|
||||||
|
return output
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
# Questo è la parte 2 del processing del sito ECHA. Va trasformato il markdown in un JSON
|
||||||
|
def markdown_to_json_raw(output, scrapingType=None, local=False, substance=None):
|
||||||
|
# Output: Il markdown
|
||||||
|
# scrapingType : 'RepeatedDose' o 'AcuteToxicity'
|
||||||
|
# substance : il nome della sostanza. Per salvarla nel path corretto
|
||||||
|
jsonified = markdown_to_json.jsonify(output)
|
||||||
|
dictified = json.loads(jsonified)
|
||||||
|
|
||||||
|
# Salvo il json iniziale così come esce da jsonify
|
||||||
|
if local and scrapingType and substance:
|
||||||
|
path = f"{scrapingType}/jsons/raws/{substance}_raw0.json"
|
||||||
|
os.makedirs(os.path.dirname(path), exist_ok=True)
|
||||||
|
|
||||||
|
with open(path, "w") as text_file:
|
||||||
|
text_file.write(jsonified)
|
||||||
|
|
||||||
|
# Ora splitto i contenuti dei dizionari innestati.
|
||||||
|
for key, value in dictified.items():
|
||||||
|
if type(value) == dict:
|
||||||
|
for key2, value2 in value.items():
|
||||||
|
parts = value2.split("\n\n")
|
||||||
|
dictified[key][key2] = {
|
||||||
|
parts[i]: parts[i + 1]
|
||||||
|
for i in range(0, len(parts) - 1, 2)
|
||||||
|
if parts[i + 1] != "[Empty]"
|
||||||
|
}
|
||||||
|
else:
|
||||||
|
parts = value.split("\n\n")
|
||||||
|
dictified[key] = {
|
||||||
|
parts[i]: parts[i + 1]
|
||||||
|
for i in range(0, len(parts) - 1, 2)
|
||||||
|
if parts[i + 1] != "[Empty]"
|
||||||
|
}
|
||||||
|
|
||||||
|
jsonified = json.dumps(dictified)
|
||||||
|
|
||||||
|
if local and scrapingType and substance:
|
||||||
|
path = f"{scrapingType}/jsons/raws/{substance}_raw1.json"
|
||||||
|
os.makedirs(os.path.dirname(path), exist_ok=True)
|
||||||
|
|
||||||
|
with open(path, "w") as text_file:
|
||||||
|
text_file.write(jsonified)
|
||||||
|
|
||||||
|
dictified = json.loads(jsonified)
|
||||||
|
|
||||||
|
return jsonified
|
||||||
|
|
||||||
|
|
||||||
|
# Metodo creato da claude per risolvere i problemi di unicode characters
|
||||||
|
def normalize_unicode_characters(text):
|
||||||
|
"""
|
||||||
|
Normalize Unicode characters, with special handling for superscript
|
||||||
|
"""
|
||||||
|
if not isinstance(text, str):
|
||||||
|
return text
|
||||||
|
|
||||||
|
# Specific replacements for common Unicode encoding issues
|
||||||
|
# e per altre eccezioni particolari
|
||||||
|
replacements = {
|
||||||
|
"\u00c2\u00b2": "²", # ² -> ²
|
||||||
|
"\u00c2\u00b3": "³", # ³ -> ³
|
||||||
|
"\u00b2": "²", # Bare superscript 2
|
||||||
|
"\u00b3": "³", # Bare superscript 3
|
||||||
|
"\n": "", # ogni tanto ci sono degli \n brutti da togliere
|
||||||
|
"greater than": ">",
|
||||||
|
"less than": "<",
|
||||||
|
"greater or equal than": ">=",
|
||||||
|
"less or equal than": "<",
|
||||||
|
# Ste due entry le ho messe io. >< creano problemi quindi le rinonimo temporaneamente
|
||||||
|
}
|
||||||
|
|
||||||
|
# Apply specific replacements first
|
||||||
|
for old, new in replacements.items():
|
||||||
|
text = text.replace(old, new)
|
||||||
|
|
||||||
|
# Normalize Unicode characters
|
||||||
|
text = unicodedata.normalize("NFKD", text)
|
||||||
|
|
||||||
|
return text
|
||||||
|
|
||||||
|
|
||||||
|
# Un'altro metodo creato da Claude.
|
||||||
|
# Pare che il mio cervello sia troppo piccolo per riuscire a ciclare ricursivamente
|
||||||
|
# un dizionario innestato. Se magari avessimo fatto algoritmi e strutture dati...
|
||||||
|
def clean_json(data):
|
||||||
|
"""
|
||||||
|
Recursively clean JSON by removing empty/uninformative entries
|
||||||
|
and normalizing Unicode characters
|
||||||
|
"""
|
||||||
|
|
||||||
|
def is_uninformative(value, context=None):
|
||||||
|
"""
|
||||||
|
Check if a dictionary entry is considered uninformative
|
||||||
|
|
||||||
|
Args:
|
||||||
|
value: The value to check
|
||||||
|
context: Additional context about where the value is located
|
||||||
|
"""
|
||||||
|
# Specific exceptions
|
||||||
|
if context and context == "Key value for chemical safety assessment":
|
||||||
|
# Always keep all entries in this specific section
|
||||||
|
return False
|
||||||
|
|
||||||
|
uninformative_values = ["hours/week", "", None]
|
||||||
|
|
||||||
|
return value in uninformative_values or (
|
||||||
|
isinstance(value, str)
|
||||||
|
and (
|
||||||
|
value.strip() in uninformative_values
|
||||||
|
or value.lower() == "no information available"
|
||||||
|
)
|
||||||
|
)
|
||||||
|
|
||||||
|
def clean_recursive(obj, context=None):
|
||||||
|
# If it's a dictionary, process its contents
|
||||||
|
if isinstance(obj, dict):
|
||||||
|
# Create a copy to modify
|
||||||
|
cleaned = {}
|
||||||
|
for key, value in obj.items():
|
||||||
|
# Normalize key
|
||||||
|
normalized_key = normalize_unicode_characters(key)
|
||||||
|
|
||||||
|
# Set context for nested dictionaries
|
||||||
|
new_context = context or normalized_key
|
||||||
|
|
||||||
|
# Recursively clean nested structures
|
||||||
|
cleaned_value = clean_recursive(value, new_context)
|
||||||
|
|
||||||
|
# Conditions for keeping the entry
|
||||||
|
keep_entry = (
|
||||||
|
cleaned_value not in [None, {}, ""]
|
||||||
|
and not (
|
||||||
|
isinstance(cleaned_value, dict) and len(cleaned_value) == 0
|
||||||
|
)
|
||||||
|
and not is_uninformative(cleaned_value, new_context)
|
||||||
|
)
|
||||||
|
|
||||||
|
# Add to cleaned dict if conditions are met
|
||||||
|
if keep_entry:
|
||||||
|
cleaned[normalized_key] = cleaned_value
|
||||||
|
|
||||||
|
return cleaned if cleaned else None
|
||||||
|
|
||||||
|
# If it's a list, clean each item
|
||||||
|
elif isinstance(obj, list):
|
||||||
|
cleaned_list = [clean_recursive(item, context) for item in obj]
|
||||||
|
cleaned_list = [item for item in cleaned_list if item not in [None, {}, ""]]
|
||||||
|
return cleaned_list if cleaned_list else None
|
||||||
|
|
||||||
|
# For strings, normalize Unicode
|
||||||
|
elif isinstance(obj, str):
|
||||||
|
return normalize_unicode_characters(obj)
|
||||||
|
|
||||||
|
# Return as-is for other types
|
||||||
|
return obj
|
||||||
|
|
||||||
|
# Create a deep copy to avoid modifying original data
|
||||||
|
cleaned_data = clean_recursive(copy.deepcopy(data))
|
||||||
|
# Sì figa questa è la parte che mi ha fatto sclerare
|
||||||
|
# Ciclare in dizionari innestati senza poter modificare la struttura
|
||||||
|
return cleaned_data
|
||||||
|
|
||||||
|
|
||||||
|
def json_to_dataframe(cleaned_json, scrapingType):
|
||||||
|
rows = []
|
||||||
|
schema = {
|
||||||
|
"RepeatedDose": [
|
||||||
|
"Substance",
|
||||||
|
"CAS",
|
||||||
|
"Toxicity Type",
|
||||||
|
"Route",
|
||||||
|
"Dose descriptor",
|
||||||
|
"Effect level",
|
||||||
|
"Species",
|
||||||
|
"Extraction_Timestamp",
|
||||||
|
"Endpoint conclusion",
|
||||||
|
],
|
||||||
|
"AcuteToxicity": [
|
||||||
|
"Substance",
|
||||||
|
"CAS",
|
||||||
|
"Route",
|
||||||
|
"Endpoint conclusion",
|
||||||
|
"Dose descriptor",
|
||||||
|
"Effect level",
|
||||||
|
"Extraction_Timestamp",
|
||||||
|
],
|
||||||
|
}
|
||||||
|
if scrapingType == "RepeatedDose":
|
||||||
|
# Iterate through top-level sections (excluding 'Key value for chemical safety assessment')
|
||||||
|
for toxicity_type, routes in cleaned_json.items():
|
||||||
|
if toxicity_type == "Key value for chemical safety assessment":
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Iterate through routes within each toxicity type
|
||||||
|
for route, details in routes.items():
|
||||||
|
row = {"Toxicity Type": toxicity_type, "Route": route}
|
||||||
|
|
||||||
|
# Add details to the row, excluding 'Link to relevant study record(s)'
|
||||||
|
row.update(
|
||||||
|
{
|
||||||
|
k: v
|
||||||
|
for k, v in details.items()
|
||||||
|
if k != "Link to relevant study record(s)"
|
||||||
|
}
|
||||||
|
)
|
||||||
|
rows.append(row)
|
||||||
|
elif scrapingType == "AcuteToxicity":
|
||||||
|
for toxicity_type, routes in cleaned_json.items():
|
||||||
|
if (
|
||||||
|
toxicity_type == "Key value for chemical safety assessment"
|
||||||
|
or not routes
|
||||||
|
):
|
||||||
|
continue
|
||||||
|
|
||||||
|
row = {
|
||||||
|
"Route": toxicity_type.replace("Acute toxicity: via", "")
|
||||||
|
.replace("route", "")
|
||||||
|
.strip()
|
||||||
|
}
|
||||||
|
|
||||||
|
# Add details directly from the routes dictionary
|
||||||
|
row.update(
|
||||||
|
{
|
||||||
|
k: v
|
||||||
|
for k, v in routes.items()
|
||||||
|
if k != "Link to relevant study record(s)"
|
||||||
|
}
|
||||||
|
)
|
||||||
|
rows.append(row)
|
||||||
|
|
||||||
|
# Create DataFrame
|
||||||
|
df = pd.DataFrame(rows)
|
||||||
|
|
||||||
|
# Last moment fixes. Per forzare uno schema
|
||||||
|
fair_columns = list(set(schema["RepeatedDose"] + schema["AcuteToxicity"]))
|
||||||
|
df = df = df.loc[:, df.columns.intersection(fair_columns)]
|
||||||
|
return df
|
||||||
|
|
||||||
|
|
||||||
|
def save_dataframe(df, file_path, scrapingType, schema):
|
||||||
|
"""
|
||||||
|
Save DataFrame with strict column requirements.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
df (pd.DataFrame): DataFrame to potentially append
|
||||||
|
file_path (str): Path of CSV file
|
||||||
|
"""
|
||||||
|
# Mandatory columns for saved DataFrame
|
||||||
|
|
||||||
|
saved_columns = schema[scrapingType]
|
||||||
|
|
||||||
|
# Check if input DataFrame has at least Dose Descriptor and Effect Level
|
||||||
|
if not all(col in df.columns for col in ["Effect level"]):
|
||||||
|
return
|
||||||
|
|
||||||
|
# If file exists, read it to get saved columns
|
||||||
|
if os.path.exists(file_path):
|
||||||
|
existing_df = pd.read_csv(file_path)
|
||||||
|
|
||||||
|
# Reindex to match saved columns, filling missing with NaN
|
||||||
|
df = df.reindex(columns=saved_columns)
|
||||||
|
else:
|
||||||
|
# If file doesn't exist, create DataFrame with saved columns
|
||||||
|
df = df.reindex(columns=saved_columns)
|
||||||
|
|
||||||
|
df = df[df["Effect level"].isna() == False]
|
||||||
|
# Ignoro le righe che non hanno valori per Effect Level
|
||||||
|
|
||||||
|
# Append or save the DataFrame
|
||||||
|
df.to_csv(
|
||||||
|
file_path,
|
||||||
|
mode="a" if os.path.exists(file_path) else "w",
|
||||||
|
header=not os.path.exists(file_path),
|
||||||
|
index=False,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def echaExtract(
|
||||||
|
substance: str,
|
||||||
|
scrapingType: str,
|
||||||
|
outputType="df",
|
||||||
|
key_infos=False,
|
||||||
|
local_search=False,
|
||||||
|
local_only = False
|
||||||
|
):
|
||||||
|
"""
|
||||||
|
Funzione principale per scrapare dal sito ECHA. Mette insieme tante funzioni diverse di ricerca, estrazione e pulizia.
|
||||||
|
Registra il logging delle operazioni.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
substance (str): CAS o nome della sostanza. Vanno bene entrambi ma il CAS funziona meglio.
|
||||||
|
scrapingType (str): 'AcuteToxicity' (LD50) o 'RepeatedDose' (NOAEL)
|
||||||
|
outputType (str): 'pd.DataFrame' o 'json' (sconsigliato)
|
||||||
|
key_infos (bool): Di base True. Specifica se cercare la sezione "Description of Key Information" nei dossiers.
|
||||||
|
Certe sostanze hanno i dati inseriti a cazzo e mettono le informazioni lì in forma discorsiva al posto che altrove.
|
||||||
|
|
||||||
|
Output:
|
||||||
|
un dataframe o un json,
|
||||||
|
f"Non esistono lead dossiers attivi o inattivi per {substance}"
|
||||||
|
"""
|
||||||
|
|
||||||
|
# se local_search = True tento una ricerca in locale. Altrimenti la provo online.
|
||||||
|
if local_search and local_echa:
|
||||||
|
result = echaExtract_local(substance, scrapingType, key_infos)
|
||||||
|
|
||||||
|
if not result.empty:
|
||||||
|
logging.info(
|
||||||
|
f"echa.echaProcess.echaExtract(): Found local data for {scrapingType}, {substance}. Returning it."
|
||||||
|
)
|
||||||
|
return result
|
||||||
|
elif result.empty:
|
||||||
|
logging.info(
|
||||||
|
f"echa.echaProcess.echaExtract(): Have not found local data for {scrapingType}, {substance}. Continuining."
|
||||||
|
)
|
||||||
|
if local_only:
|
||||||
|
logging.info(f'echa.echaProcess.echaExtract(): No data found in local-only search for {substance}, {scrapingType}')
|
||||||
|
return f'No data found in local-only search for {substance}, {scrapingType}'
|
||||||
|
|
||||||
|
try:
|
||||||
|
# con search_dossier trovo le informazioni relative al dossiers cercando sul sito echa la sostanza fornita.
|
||||||
|
links = search_dossier(substance)
|
||||||
|
if not links:
|
||||||
|
logging.info(
|
||||||
|
f'echaProcess.echaExtract(). no active or unactive lead dossiers for: "{substance}". Ending extraction.'
|
||||||
|
)
|
||||||
|
return f"Non esistono lead dossiers attivi o inattivi per {substance}"
|
||||||
|
# Se non esistono LEAD dossiers (quelli con i riassunti tossicologici) attivi o inattivi
|
||||||
|
# LEAD dossiers: riassumono le informazioni di un po' di tutti gli altri dossier, sono quelli completi dove c'erano le info necessarie
|
||||||
|
|
||||||
|
# Se esistono, apro la pagina che mi interessa ('Acute Toxicity' o 'Repeated Dose')
|
||||||
|
|
||||||
|
if not scrapingType in list(links.keys()):
|
||||||
|
logging.info(
|
||||||
|
f'echaProcess.echaExtract(). No page for "{scrapingType}", "{substance}"'
|
||||||
|
)
|
||||||
|
return f'No data in "{scrapingType}", "{substance}". Page does not exist.'
|
||||||
|
|
||||||
|
soup = openEchaPage(link=links[scrapingType])
|
||||||
|
logging.info(
|
||||||
|
f"echaProcess.echaExtract(). soupped '{scrapingType}' echa page for '{substance}'"
|
||||||
|
)
|
||||||
|
|
||||||
|
# Piglio la sezione che mi serve
|
||||||
|
try:
|
||||||
|
sezione = soup.find(
|
||||||
|
"section",
|
||||||
|
class_="KeyValueForChemicalSafetyAssessment",
|
||||||
|
attrs={"data-cy": "das-block"},
|
||||||
|
)
|
||||||
|
except:
|
||||||
|
logging.error(
|
||||||
|
f'echaProcess.echaExtract(). could not extract the "section" for "{scrapingType}" for "{substance}"',
|
||||||
|
exc_info=True,
|
||||||
|
)
|
||||||
|
|
||||||
|
# Per ottenere il timestamp attuale
|
||||||
|
now = datetime.now()
|
||||||
|
|
||||||
|
# UPDATE. Cerco le key infos: recupera quel testo di summary generale
|
||||||
|
key_infos_faund = False
|
||||||
|
if key_infos:
|
||||||
|
try:
|
||||||
|
key_infos = soup.find(
|
||||||
|
"section",
|
||||||
|
class_="KeyInformation",
|
||||||
|
attrs={"data-cy": "das-block"},
|
||||||
|
)
|
||||||
|
if key_infos:
|
||||||
|
key_infos = key_infos.find(
|
||||||
|
"div",
|
||||||
|
class_="das-field_value das-field_value_html",
|
||||||
|
)
|
||||||
|
key_infos = key_infos.text
|
||||||
|
key_infos = key_infos if key_infos.strip() != "[Empty]" else None
|
||||||
|
if key_infos:
|
||||||
|
key_infos_faund = True
|
||||||
|
logging.info(
|
||||||
|
f"echaProcess.echaExtract(). Extracted key_infos from '{scrapingType}' echa page for '{substance}': {key_infos}"
|
||||||
|
)
|
||||||
|
key_infos_df = pd.DataFrame(index=[0])
|
||||||
|
key_infos_df["key_information"] = key_infos
|
||||||
|
key_infos_df = df_wrapper(
|
||||||
|
df=key_infos_df,
|
||||||
|
rmlName=links["rmlName"],
|
||||||
|
rmlCas=links["rmlCas"],
|
||||||
|
timestamp=now.strftime("%Y-%m-%d"),
|
||||||
|
dossierType=links["dossierType"], # attivo o inattivo?? da verificare
|
||||||
|
page=scrapingType, # repeated dose o acute toxicity
|
||||||
|
linkPage=links[scrapingType], # i link al dossier di repeated dose o acute toxicity
|
||||||
|
key_infos=True,
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
logging.error(
|
||||||
|
f'echaProcess.echaExtract() > echaProcess.echaPage_to_md() ERROR. could not extract key_infos for "{scrapingType}", "{substance}"'
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
logging.error(
|
||||||
|
f'echaProcess.echaExtract() > echaProcess.echaPage_to_md() ERROR. could not extract key_infos for "{scrapingType}", "{substance}"'
|
||||||
|
)
|
||||||
|
except:
|
||||||
|
logging.error(
|
||||||
|
f'echaProcess.echaExtract() > echaProcess.echaPage_to_md() ERROR. could not extract key_infos for "{scrapingType}", "{substance}"',
|
||||||
|
exc_info=True,
|
||||||
|
)
|
||||||
|
|
||||||
|
try:
|
||||||
|
if not sezione: # la sezione principale che viene scrapata
|
||||||
|
logging.error(
|
||||||
|
f'echaProcess.echaExtract() > echaProcess.echaPage_to_md() Empty section for the html > markdown conversion. No data for "{scrapingType}", "{substance}"'
|
||||||
|
)
|
||||||
|
if not key_infos_faund:
|
||||||
|
# Se non ci sono dati ma ci sono le key informations ritorno quelle
|
||||||
|
return f'No data in "{scrapingType}", "{substance}"'
|
||||||
|
else:
|
||||||
|
return key_infos_df
|
||||||
|
|
||||||
|
# Trasformo la sezione html in markdown
|
||||||
|
output = echaPage_to_md(
|
||||||
|
sezione, scrapingType=scrapingType, substance=substance
|
||||||
|
)
|
||||||
|
logging.info(
|
||||||
|
f'echaProcess.echaExtract() > echaProcess.echaPage_to_md() OK. created MD for "{scrapingType}", "{substance}"'
|
||||||
|
)
|
||||||
|
|
||||||
|
# Ci sono rari casi in cui proprio non esistono pagine per l'acute toxicity o la repeated dose. In quel caso output sarà vuoto e darà errore
|
||||||
|
# logging.info(output)
|
||||||
|
except:
|
||||||
|
logging.error(
|
||||||
|
f'echaProcess.echaExtract() > echaProcess.echaPage_to_md() ERROR. could not MD for "{scrapingType}", "{substance}"',
|
||||||
|
exc_info=True,
|
||||||
|
)
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Trasformo il markdown nel primo json raw
|
||||||
|
jsonified = markdown_to_json_raw(
|
||||||
|
output, scrapingType=scrapingType, substance=substance
|
||||||
|
)
|
||||||
|
logging.info(
|
||||||
|
f'echaProcess.echaExtract() > echaProcess.markdown_to_json_raw() OK. created initial json for "{scrapingType}", "{substance}"'
|
||||||
|
)
|
||||||
|
except:
|
||||||
|
logging.error(
|
||||||
|
f'echaProcess.echaExtract() > echaProcess.markdown_to_json_raw() ERROR. could not create initial json for "{scrapingType}", "{substance}"',
|
||||||
|
exc_info=True,
|
||||||
|
)
|
||||||
|
|
||||||
|
json_data = json.loads(jsonified)
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Secondo step per il processing del json: pulisco i dizionari piu' innestati
|
||||||
|
cleaned_data = clean_json(json_data)
|
||||||
|
logging.info(
|
||||||
|
f'echaProcess.echaExtract() > echaProcess.clean_json() OK. cleaned the json for "{scrapingType}", "{substance}"'
|
||||||
|
)
|
||||||
|
# Se cleaned_data è vuoto vuol dire che non ci sono dati
|
||||||
|
if not cleaned_data:
|
||||||
|
logging.error(
|
||||||
|
f'echaProcess.echaExtract() > echaProcess.clean_json() Empty cleaned_json. No data for "{scrapingType}", "{substance}"'
|
||||||
|
)
|
||||||
|
if not key_infos_faund:
|
||||||
|
# Se non ci sono dati ma ci sono le key informations ritorno quelle
|
||||||
|
return f'No data in "{scrapingType}", "{substance}"'
|
||||||
|
else:
|
||||||
|
return key_infos_df
|
||||||
|
except:
|
||||||
|
logging.error(
|
||||||
|
f'echaProcess.echaExtract() > echaProcess.clean_json() ERROR. cleaning the json for "{scrapingType}", "{substance}"'
|
||||||
|
)
|
||||||
|
|
||||||
|
# Se si vuole come output il dataframe creo un dataframe e ci aggiungo un timestamp
|
||||||
|
try:
|
||||||
|
df = json_to_dataframe(cleaned_data, scrapingType)
|
||||||
|
df = df_wrapper(
|
||||||
|
df=df,
|
||||||
|
rmlName=links["rmlName"],
|
||||||
|
rmlCas=links["rmlCas"],
|
||||||
|
timestamp=now.strftime("%Y-%m-%d"),
|
||||||
|
dossierType=links["dossierType"],
|
||||||
|
page=scrapingType,
|
||||||
|
linkPage=links[scrapingType],
|
||||||
|
)
|
||||||
|
|
||||||
|
if outputType == "df":
|
||||||
|
logging.info(
|
||||||
|
f'echaProcess.echaExtract(). succesfully extracted "{scrapingType}", "{substance}". Returning df'
|
||||||
|
)
|
||||||
|
|
||||||
|
# Se l'utente vuole le key infos e le key_infos sono state trovate unisco i due df
|
||||||
|
return df if not key_infos_faund else pd.concat([key_infos_df, df])
|
||||||
|
|
||||||
|
elif outputType == "json":
|
||||||
|
if key_infos_faund:
|
||||||
|
df = pd.concat([key_infos_df, df])
|
||||||
|
jayson = df.to_json(orient="records", force_ascii=False)
|
||||||
|
logging.info(
|
||||||
|
f'echaProcess.echaExtract(). succesfully extracted "{scrapingType}", "{substance}". Returning json'
|
||||||
|
)
|
||||||
|
return jayson
|
||||||
|
except KeyError:
|
||||||
|
# Per gestire le pagine di merda che hanno solo "no information available"
|
||||||
|
|
||||||
|
if key_infos_faund:
|
||||||
|
return key_infos_df
|
||||||
|
|
||||||
|
json_output = list(cleaned_data[list(cleaned_data.keys())[0]].values())
|
||||||
|
if json_output == ["no information available" for elem in json_output]:
|
||||||
|
logging.info(
|
||||||
|
f"echaProcess.echaExtract(). No data found for {scrapingType} for {substance}"
|
||||||
|
)
|
||||||
|
return f'No data in "{scrapingType}", "{substance}"'
|
||||||
|
else:
|
||||||
|
logging.error(
|
||||||
|
f"echaProcess.json_to_dataframe(). Could not create dataframe"
|
||||||
|
)
|
||||||
|
cleaned_data["error"] = (
|
||||||
|
"Non sono riuscito a creare il dataframe, probabilmente non ci sono abbastanza informazioni. Ritorno il JSON"
|
||||||
|
)
|
||||||
|
return cleaned_data
|
||||||
|
|
||||||
|
except Exception:
|
||||||
|
logging.error(
|
||||||
|
f"echaProcess.echaExtract() ERROR. Something went wrong, not quite sure what.",
|
||||||
|
exc_info=True,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def df_wrapper(
|
||||||
|
df, rmlName, rmlCas, timestamp, dossierType, page, linkPage, key_infos=False
|
||||||
|
):
|
||||||
|
# Un semplice metodo per aggiungere tutta la roba che ci serve al dataframe.
|
||||||
|
# Per non intasare echaExtract che già di suo è un figa di bordello
|
||||||
|
df.insert(0, "Substance", rmlName)
|
||||||
|
df.insert(1, "CAS", rmlCas)
|
||||||
|
df["Extraction_Timestamp"] = timestamp
|
||||||
|
df = df.replace("\n", "", regex=True)
|
||||||
|
if not key_infos:
|
||||||
|
df = df[df["Effect level"].isnull() == False]
|
||||||
|
|
||||||
|
# Aggiungo il link del dossier e lo status
|
||||||
|
df["dossierType"] = dossierType
|
||||||
|
df["page"] = page
|
||||||
|
df["linkPage"] = linkPage
|
||||||
|
return df
|
||||||
|
|
||||||
|
def echaExtract_specific(
|
||||||
|
CAS: str,
|
||||||
|
scrapingType="RepeatedDose",
|
||||||
|
doseDescriptor="NOAEL",
|
||||||
|
route="inhalation",
|
||||||
|
local_search=False,
|
||||||
|
local_only=False
|
||||||
|
):
|
||||||
|
"""
|
||||||
|
Dato un CAS cerca di trovare il dose descriptor (di base NOAEL) per la route specificata (di base 'inhalation').
|
||||||
|
|
||||||
|
Args:
|
||||||
|
CAS (str): il cas o in alternativa la sostanza
|
||||||
|
route (str): 'inhalation', 'oral', 'dermal'. Di base 'inhalation'
|
||||||
|
scrapingType (str): la pagina su cui cercarlo
|
||||||
|
doseDescriptor (str): il tipo di valore da ricercare (NOAEL, DNEL, LD50, LC50)
|
||||||
|
"""
|
||||||
|
|
||||||
|
# Tento di estrarre
|
||||||
|
result = echaExtract(
|
||||||
|
substance=CAS,
|
||||||
|
scrapingType=scrapingType,
|
||||||
|
outputType="df",
|
||||||
|
local_search=local_search,
|
||||||
|
local_only=local_only
|
||||||
|
)
|
||||||
|
|
||||||
|
# Il risultato è un dataframe?
|
||||||
|
if type(result) == pd.DataFrame:
|
||||||
|
# Se sì, lo filtro per ciò che mi interessa
|
||||||
|
filtered_df = result[
|
||||||
|
(result["Route"] == route) & (result["Dose descriptor"] == doseDescriptor)
|
||||||
|
]
|
||||||
|
# Se non è vuoto lo ritorno
|
||||||
|
if not filtered_df.empty:
|
||||||
|
return filtered_df
|
||||||
|
else:
|
||||||
|
return f'Non ho trovato {doseDescriptor} in {scrapingType} con route "{route}" per {CAS}'
|
||||||
|
|
||||||
|
elif type(result) == dict and result["error"]:
|
||||||
|
# Questo significa che gli è arrivato qualche json con un errore
|
||||||
|
return f'Non ho trovato {doseDescriptor} in {scrapingType} con route "{route}" per {CAS}'
|
||||||
|
|
||||||
|
# Questo significa che ha ricevuto un "Non esistono" come risultato. Non esistono lead dossiers attivi o inattivi per la sostanza ricercata
|
||||||
|
elif result.startswith("Non esistono"):
|
||||||
|
return result
|
||||||
|
|
||||||
|
|
||||||
|
def echa_noael_ld50(CAS: str, route="inhalation", outputType="df", local_search=False, local_only=False):
|
||||||
|
"""
|
||||||
|
Dato un CAS cerca di trovare il NOAEL per la route specificata (di base 'inhalation').
|
||||||
|
Se non esiste la pagina RepeatedDose con il NOAEL fa ritornare l'LD50 per quella route.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
CAS (str): il cas o in alternativa la sostanza
|
||||||
|
route (str): 'inhalation', 'oral', 'dermal'. Di base 'inhalation'
|
||||||
|
outputType (str) = 'df', 'json'. Il tipo di output
|
||||||
|
|
||||||
|
"""
|
||||||
|
if route not in ["inhalation", "oral", "dermal"] and outputType not in [
|
||||||
|
"df",
|
||||||
|
"json",
|
||||||
|
]:
|
||||||
|
return "invalid input"
|
||||||
|
# Di base cerco di scrapare la pagina "Repeated Dose"
|
||||||
|
first_attempt = echaExtract_specific(
|
||||||
|
CAS=CAS,
|
||||||
|
scrapingType="RepeatedDose",
|
||||||
|
doseDescriptor="NOAEL",
|
||||||
|
route=route,
|
||||||
|
local_search=local_search,
|
||||||
|
local_only=local_only
|
||||||
|
)
|
||||||
|
|
||||||
|
if isinstance(first_attempt, pd.DataFrame):
|
||||||
|
return first_attempt
|
||||||
|
elif isinstance(first_attempt, str) and first_attempt.startswith("Non ho trovato"):
|
||||||
|
second_attempt = echaExtract_specific(
|
||||||
|
CAS=CAS,
|
||||||
|
scrapingType="AcuteToxicity",
|
||||||
|
doseDescriptor="LD50",
|
||||||
|
route=route,
|
||||||
|
local_search=True,
|
||||||
|
local_only=local_only
|
||||||
|
)
|
||||||
|
if isinstance(second_attempt, pd.DataFrame):
|
||||||
|
return second_attempt
|
||||||
|
elif isinstance(second_attempt, str) and second_attempt.startswith(
|
||||||
|
"Non ho trovato"
|
||||||
|
):
|
||||||
|
return second_attempt.replace("LD50", "NOAEL ed LD50")
|
||||||
|
elif first_attempt.startswith("Non esistono"):
|
||||||
|
return first_attempt
|
||||||
|
|
||||||
|
|
||||||
|
def echa_noael_ld50_multi(
|
||||||
|
casList: list, route="inhalation", messages=False, local_search=False, local_only=False
|
||||||
|
):
|
||||||
|
"""
|
||||||
|
Metodo abbastanza semplice. Data una lista di cas esegue echa_noael_ld50. Quindi cerca i NOAEL per la route desiderata o gli LD50 se non trova i NOAEL.
|
||||||
|
L'output è un df per le sostanze che trova e una lista di messaggi per quelle che non trova.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
casList (list): la lista di CAS
|
||||||
|
route (str): 'inhalation', 'oral', 'dermal'. Di base 'inhalation'
|
||||||
|
messages (boolean) = True o False. Con True fa ritornare una lista. Il primo elemento sarà il dataframe, il secondo la lista di messaggi per le sostanze non trovate.
|
||||||
|
Di base è False e fa ritornare solo il dataframe.
|
||||||
|
"""
|
||||||
|
messages_list = []
|
||||||
|
df = pd.DataFrame()
|
||||||
|
for CAS in casList:
|
||||||
|
output = echa_noael_ld50(
|
||||||
|
CAS=CAS, route=route, outputType="df", local_search=local_search, local_only=local_only
|
||||||
|
)
|
||||||
|
if isinstance(output, str):
|
||||||
|
messages_list.append(output)
|
||||||
|
elif isinstance(output, pd.DataFrame):
|
||||||
|
df = pd.concat([df, output], ignore_index=True)
|
||||||
|
df.dropna(axis=1, how="all", inplace=True)
|
||||||
|
if messages and df.empty:
|
||||||
|
messages_list.append(
|
||||||
|
f'Non sono riuscito a trovare nessun NOAEL o LD50 per i cas per la route "{route}"'
|
||||||
|
)
|
||||||
|
return [None, messages_list]
|
||||||
|
elif messages and not df.empty:
|
||||||
|
return [df, messages_list]
|
||||||
|
elif not df.empty and not messages:
|
||||||
|
return df
|
||||||
|
elif df.empty and not messages:
|
||||||
|
return f'Non sono riuscito a trovare nessun NOAEL o LD50 per i cas per la route "{route}"'
|
||||||
|
|
||||||
|
|
||||||
|
def echaExtract_multi(
|
||||||
|
casList: list,
|
||||||
|
scrapingType="all",
|
||||||
|
local=False,
|
||||||
|
local_path=None,
|
||||||
|
log_path=None,
|
||||||
|
debug_print=False,
|
||||||
|
error=False,
|
||||||
|
error_path=None,
|
||||||
|
key_infos=False,
|
||||||
|
local_search=False,
|
||||||
|
local_only=False,
|
||||||
|
filter = None
|
||||||
|
):
|
||||||
|
"""
|
||||||
|
Data una lista di CAS cerca di estrarre tutte le pagine con le Repeated Dose, tutte le pagine con l'AcuteToxicity, o entrambe
|
||||||
|
|
||||||
|
Args:
|
||||||
|
casList (list): la lista di CAS
|
||||||
|
scrapingType (str): 'RepeatedDose', 'AcuteToxicity', 'all'
|
||||||
|
local (boolean): Se impostato su True questo parametro salva sul disco in maniera progressiva, appendendo ogni result man mano che li trova.
|
||||||
|
è necessario per lo scraping su larga scala
|
||||||
|
log_path (str): il path per il log da fillare durante lo scraping di massa
|
||||||
|
debug_print (bool): per avere il printing durante lo scraping. per verificare l'avanzamento
|
||||||
|
error (bool): Per far ritornare la lista degli errori una volta scrapato
|
||||||
|
|
||||||
|
Output:
|
||||||
|
pd.Dataframe
|
||||||
|
"""
|
||||||
|
cas_len = len(casList)
|
||||||
|
i = 0
|
||||||
|
|
||||||
|
df = pd.DataFrame()
|
||||||
|
if scrapingType == "all":
|
||||||
|
scrapingTypeList = ["RepeatedDose", "AcuteToxicity"]
|
||||||
|
else:
|
||||||
|
scrapingTypeList = [scrapingType]
|
||||||
|
|
||||||
|
logging.info(
|
||||||
|
f"echa.echaExtract_multi(). Commencing mass extraction of {scrapingTypeList} for {casList}"
|
||||||
|
)
|
||||||
|
|
||||||
|
errors = []
|
||||||
|
|
||||||
|
for cas in casList:
|
||||||
|
for scrapingType in scrapingTypeList:
|
||||||
|
extraction = echaExtract(
|
||||||
|
substance=cas,
|
||||||
|
scrapingType=scrapingType,
|
||||||
|
outputType="df",
|
||||||
|
key_infos=key_infos,
|
||||||
|
local_search=local_search,
|
||||||
|
local_only=local_only
|
||||||
|
)
|
||||||
|
if isinstance(extraction, pd.DataFrame) and not extraction.empty:
|
||||||
|
status = "successful_scrape"
|
||||||
|
logging.info(
|
||||||
|
f"echa.echaExtract_multi(). Succesfully scraped {scrapingType} for {cas}"
|
||||||
|
)
|
||||||
|
|
||||||
|
df = pd.concat([df, extraction], ignore_index=True)
|
||||||
|
if local and local_path:
|
||||||
|
df.to_csv(local_path, index=False)
|
||||||
|
|
||||||
|
elif (
|
||||||
|
(isinstance(extraction, pd.DataFrame) and extraction.empty)
|
||||||
|
or (extraction is None)
|
||||||
|
or (isinstance(extraction, str) and extraction.startswith("No data"))
|
||||||
|
):
|
||||||
|
status = "no_data_found"
|
||||||
|
logging.info(
|
||||||
|
f"echa.echaExtract_multi(). Found no data for {scrapingType} for {cas}"
|
||||||
|
)
|
||||||
|
elif isinstance(extraction, dict):
|
||||||
|
if extraction["error"]:
|
||||||
|
status = "df_creation_error"
|
||||||
|
errors.append(extraction)
|
||||||
|
logging.info(
|
||||||
|
f"echa.echaExtract_multi(). Df creation error for {scrapingType} for {cas}"
|
||||||
|
)
|
||||||
|
elif isinstance(extraction, str) and extraction.startswith("Non esistono"):
|
||||||
|
status = "no_lead_dossiers"
|
||||||
|
logging.info(
|
||||||
|
f"echa.echaExtract_multi(). Found no lead dossiers for {cas}"
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
status = "unknown_error"
|
||||||
|
logging.error(
|
||||||
|
f"echa.echaExtract_multi(). Unknown error for {scrapingType} for {cas}"
|
||||||
|
)
|
||||||
|
|
||||||
|
if log_path:
|
||||||
|
fill_log(cas, status, log_path, scrapingType)
|
||||||
|
if debug_print:
|
||||||
|
print(f"{i}: {cas}, {scrapingType}")
|
||||||
|
i += 1
|
||||||
|
|
||||||
|
if error and errors and error_path:
|
||||||
|
with open(error_path, "w") as json_file:
|
||||||
|
json.dump(errors, json_file, indent=4)
|
||||||
|
|
||||||
|
# Questa è la mossa che mi permette di eliminare 4 metodi
|
||||||
|
if filter:
|
||||||
|
df = filter_dataframe_by_dict(df, filter)
|
||||||
|
return df
|
||||||
|
|
||||||
|
|
||||||
|
def fill_log(cas: str, status: str, log_path: str, scrapingType: str):
|
||||||
|
"""
|
||||||
|
Funzione usata durante lo scraping di massa per fillare un log mentre estraggo le sostanze
|
||||||
|
"""
|
||||||
|
|
||||||
|
df = pd.read_csv(log_path)
|
||||||
|
df.loc[df["casNo"] == cas, f"scraping_{scrapingType}"] = status
|
||||||
|
df.loc[df["casNo"] == cas, "timestamp"] = datetime.now().strftime("%Y-%m-%d")
|
||||||
|
|
||||||
|
df.to_csv(log_path, index=False)
|
||||||
|
|
||||||
|
def echaExtract_local(substance:str, scrapingType:str, key_infos=False):
|
||||||
|
if not key_infos:
|
||||||
|
query = f"""
|
||||||
|
SELECT *
|
||||||
|
FROM echa_full_scraping
|
||||||
|
WHERE CAS = '{substance}' AND page = '{scrapingType}' AND key_information IS NULL;
|
||||||
|
"""
|
||||||
|
elif key_infos:
|
||||||
|
query = f"""
|
||||||
|
SELECT *
|
||||||
|
FROM echa_full_scraping
|
||||||
|
WHERE CAS = '{substance}' AND page = '{scrapingType}';
|
||||||
|
|
||||||
|
"""
|
||||||
|
result = con.sql(query).df()
|
||||||
|
return result
|
||||||
|
|
||||||
|
def filter_dataframe_by_dict(df, filter_dict):
|
||||||
|
"""
|
||||||
|
Filters a Pandas DataFrame based on a dictionary.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
df (pd.DataFrame): The input DataFrame.
|
||||||
|
filter_dict (dict): A dictionary where keys are column names and
|
||||||
|
values are lists of allowed values for that column.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
pd.DataFrame: A new DataFrame containing only the rows that match
|
||||||
|
the filter criteria.
|
||||||
|
"""
|
||||||
|
|
||||||
|
filter_condition = pd.Series(True, index=df.index) # Initialize with all True to start filtering
|
||||||
|
|
||||||
|
for column_name, allowed_values in filter_dict.items():
|
||||||
|
if column_name in df.columns: # Check if the column exists in the DataFrame
|
||||||
|
column_filter = df[column_name].isin(allowed_values) # Create a boolean Series for the current column
|
||||||
|
filter_condition = filter_condition & column_filter # Combine with existing condition using 'and'
|
||||||
|
else:
|
||||||
|
print(f"Warning: Column '{column_name}' not found in the DataFrame. Filter for this column will be ignored.")
|
||||||
|
|
||||||
|
filtered_df = df[filter_condition] # Apply the combined filter condition
|
||||||
|
return filtered_df
|
||||||
15
src/pif_compiler/services/mongo_conn.py
Normal file
15
src/pif_compiler/services/mongo_conn.py
Normal file
|
|
@ -0,0 +1,15 @@
|
||||||
|
from pymongo import MongoClient
|
||||||
|
|
||||||
|
def get_client():
|
||||||
|
ADMIN_USER = "admin"
|
||||||
|
ADMIN_PASSWORD = "bello98A."
|
||||||
|
MONGO_HOST = "204.216.215.1"
|
||||||
|
MONGO_PORT = 27017
|
||||||
|
|
||||||
|
# Connect as admin
|
||||||
|
client = MongoClient(
|
||||||
|
f"mongodb://{ADMIN_USER}:{ADMIN_PASSWORD}@{MONGO_HOST}:{MONGO_PORT}/?authSource=admin",
|
||||||
|
serverSelectionTimeoutMS=5000
|
||||||
|
)
|
||||||
|
|
||||||
|
return client if client else None
|
||||||
149
src/pif_compiler/services/pubchem_service.py
Normal file
149
src/pif_compiler/services/pubchem_service.py
Normal file
|
|
@ -0,0 +1,149 @@
|
||||||
|
|
||||||
|
import os
|
||||||
|
from contextlib import contextmanager
|
||||||
|
import pubchempy as pcp
|
||||||
|
from pubchemprops.pubchemprops import get_second_layer_props
|
||||||
|
import logging
|
||||||
|
|
||||||
|
logging.basicConfig(
|
||||||
|
format="{asctime} - {levelname} - {message}",
|
||||||
|
style="{",
|
||||||
|
datefmt="%Y-%m-%d %H:%M",
|
||||||
|
filename="echa.log",
|
||||||
|
encoding="utf-8",
|
||||||
|
filemode="a",
|
||||||
|
level=logging.INFO,
|
||||||
|
)
|
||||||
|
|
||||||
|
@contextmanager
|
||||||
|
def temporary_certificate(cert_path):
|
||||||
|
# Sto robo serve perchè per usare l'API di PubChem serve cambiare temporaneamente il certificato con il quale
|
||||||
|
# si fanno le richieste
|
||||||
|
|
||||||
|
"""
|
||||||
|
Context manager to temporarily change the certificate used for requests.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
cert_path (str): Path to the certificate file to use temporarily
|
||||||
|
|
||||||
|
Example:
|
||||||
|
# Regular request uses default certificates
|
||||||
|
requests.get('https://api.example.com')
|
||||||
|
|
||||||
|
# Use custom certificate only within this block
|
||||||
|
with temporary_certificate('custom-cert.pem'):
|
||||||
|
requests.get('https://api.requiring.custom.cert.com')
|
||||||
|
|
||||||
|
# Back to default certificates
|
||||||
|
requests.get('https://api.example.com')
|
||||||
|
"""
|
||||||
|
# Store original environment variables
|
||||||
|
original_ca_bundle = os.environ.get('REQUESTS_CA_BUNDLE')
|
||||||
|
original_ssl_cert = os.environ.get('SSL_CERT_FILE')
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Set new certificate
|
||||||
|
os.environ['REQUESTS_CA_BUNDLE'] = cert_path
|
||||||
|
os.environ['SSL_CERT_FILE'] = cert_path
|
||||||
|
yield
|
||||||
|
finally:
|
||||||
|
# Restore original environment variables
|
||||||
|
if original_ca_bundle is not None:
|
||||||
|
os.environ['REQUESTS_CA_BUNDLE'] = original_ca_bundle
|
||||||
|
else:
|
||||||
|
os.environ.pop('REQUESTS_CA_BUNDLE', None)
|
||||||
|
|
||||||
|
if original_ssl_cert is not None:
|
||||||
|
os.environ['SSL_CERT_FILE'] = original_ssl_cert
|
||||||
|
else:
|
||||||
|
os.environ.pop('SSL_CERT_FILE', None)
|
||||||
|
|
||||||
|
def clean_property_data(api_response):
|
||||||
|
"""
|
||||||
|
Simplifies the API response data by flattening nested structures.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
api_response (dict): Raw API response containing property data
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
dict: Cleaned data with simplified structure
|
||||||
|
"""
|
||||||
|
cleaned_data = {}
|
||||||
|
|
||||||
|
for property_name, measurements in api_response.items():
|
||||||
|
cleaned_measurements = []
|
||||||
|
|
||||||
|
for measurement in measurements:
|
||||||
|
cleaned_measurement = {
|
||||||
|
'ReferenceNumber': measurement.get('ReferenceNumber'),
|
||||||
|
'Description': measurement.get('Description', ''),
|
||||||
|
}
|
||||||
|
|
||||||
|
# Handle Reference field
|
||||||
|
if 'Reference' in measurement:
|
||||||
|
# Check if Reference is a list or string
|
||||||
|
ref = measurement['Reference']
|
||||||
|
cleaned_measurement['Reference'] = ref[0] if isinstance(ref, list) else ref
|
||||||
|
|
||||||
|
# Handle Value field
|
||||||
|
value = measurement.get('Value', {})
|
||||||
|
if isinstance(value, dict) and 'StringWithMarkup' in value:
|
||||||
|
cleaned_measurement['Value'] = value['StringWithMarkup'][0]['String']
|
||||||
|
else:
|
||||||
|
cleaned_measurement['Value'] = str(value)
|
||||||
|
|
||||||
|
# Remove empty values
|
||||||
|
cleaned_measurement = {k: v for k, v in cleaned_measurement.items() if v}
|
||||||
|
|
||||||
|
cleaned_measurements.append(cleaned_measurement)
|
||||||
|
|
||||||
|
cleaned_data[property_name] = cleaned_measurements
|
||||||
|
|
||||||
|
return cleaned_data
|
||||||
|
|
||||||
|
def pubchem_dap(cas):
|
||||||
|
'''
|
||||||
|
Data un CAS in input ricerca le informazioni per la scheda di sicurezza su PubChem.
|
||||||
|
Per estrarre le proprietà di 1o (sinonimi, cid, logP, MolecularWeight, ExactMass, TPSA) livello uso Pubchempy.
|
||||||
|
Per quelle di 2o livello uso pubchemprops (Melting point)
|
||||||
|
|
||||||
|
args:
|
||||||
|
cas : string
|
||||||
|
|
||||||
|
'''
|
||||||
|
with temporary_certificate('src/data/ncbi-nlm-nih-gov-catena.pem'):
|
||||||
|
try:
|
||||||
|
# Ricerca iniziale
|
||||||
|
out = pcp.get_synonyms(cas, 'name')
|
||||||
|
if out:
|
||||||
|
out = out[0]
|
||||||
|
output = {'CID' : out['CID'],
|
||||||
|
'CAS' : cas,
|
||||||
|
'first_pubchem_name' : out['Synonym'][0],
|
||||||
|
'pubchem_link' : f"https://pubchem.ncbi.nlm.nih.gov/compound/{out['CID']}"}
|
||||||
|
else:
|
||||||
|
return f'No results on PubChem for {cas}'
|
||||||
|
|
||||||
|
except Exception as E:
|
||||||
|
logging.error(f'various_utils.pubchem.pubchem_dap(). Some error during pubchem search for {cas}', exc_info=True)
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Ricerca delle proprietà
|
||||||
|
properties = pcp.get_properties(['xlogp', 'molecular_weight', 'tpsa', 'exact_mass'], identifier = out['CID'], namespace='cid', searchtype=None, as_dataframe=False)
|
||||||
|
if properties:
|
||||||
|
output = {**output, **properties[0]}
|
||||||
|
else:
|
||||||
|
return output
|
||||||
|
except Exception as E:
|
||||||
|
logging.error(f'various_utils.pubchem.pubchem_dap(). Some error during pubchem first level properties extraction for {cas}', exc_info=True)
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Ricerca del Melting Point
|
||||||
|
second_layer_props = get_second_layer_props(output['first_pubchem_name'], ['Melting Point', 'Dissociation Constants', 'pH'])
|
||||||
|
if second_layer_props:
|
||||||
|
second_layer_props = clean_property_data(second_layer_props)
|
||||||
|
output = {**output, **second_layer_props}
|
||||||
|
except Exception as E:
|
||||||
|
logging.error(f'various_utils.pubchem.pubchem_dap(). Some error during pubchem second level properties extraction (Melting Point) for {cas}', exc_info=True)
|
||||||
|
|
||||||
|
return output
|
||||||
220
tests/README.md
Normal file
220
tests/README.md
Normal file
|
|
@ -0,0 +1,220 @@
|
||||||
|
# PIF Compiler - Test Suite
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
Comprehensive test suite for the PIF Compiler project using `pytest`.
|
||||||
|
|
||||||
|
## Structure
|
||||||
|
|
||||||
|
```
|
||||||
|
tests/
|
||||||
|
├── __init__.py # Test package marker
|
||||||
|
├── conftest.py # Shared fixtures and configuration
|
||||||
|
├── test_cosing_service.py # COSING service tests
|
||||||
|
├── test_models.py # (TODO) Pydantic model tests
|
||||||
|
├── test_echa_service.py # (TODO) ECHA service tests
|
||||||
|
└── README.md # This file
|
||||||
|
```
|
||||||
|
|
||||||
|
## Installation
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Install test dependencies
|
||||||
|
uv add --dev pytest pytest-cov pytest-mock
|
||||||
|
|
||||||
|
# Or manually install
|
||||||
|
uv pip install pytest pytest-cov pytest-mock
|
||||||
|
```
|
||||||
|
|
||||||
|
## Running Tests
|
||||||
|
|
||||||
|
### Run All Tests (Unit only)
|
||||||
|
```bash
|
||||||
|
uv run pytest
|
||||||
|
```
|
||||||
|
|
||||||
|
### Run Specific Test File
|
||||||
|
```bash
|
||||||
|
uv run pytest tests/test_cosing_service.py
|
||||||
|
```
|
||||||
|
|
||||||
|
### Run Specific Test Class
|
||||||
|
```bash
|
||||||
|
uv run pytest tests/test_cosing_service.py::TestParseCasNumbers
|
||||||
|
```
|
||||||
|
|
||||||
|
### Run Specific Test
|
||||||
|
```bash
|
||||||
|
uv run pytest tests/test_cosing_service.py::TestParseCasNumbers::test_single_cas_number
|
||||||
|
```
|
||||||
|
|
||||||
|
### Run with Verbose Output
|
||||||
|
```bash
|
||||||
|
uv run pytest -v
|
||||||
|
```
|
||||||
|
|
||||||
|
### Run with Coverage Report
|
||||||
|
```bash
|
||||||
|
uv run pytest --cov=src/pif_compiler --cov-report=html
|
||||||
|
# Open htmlcov/index.html in browser
|
||||||
|
```
|
||||||
|
|
||||||
|
## Test Categories
|
||||||
|
|
||||||
|
### Unit Tests (Default)
|
||||||
|
Fast tests with no external dependencies. Run by default.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
uv run pytest -m unit
|
||||||
|
```
|
||||||
|
|
||||||
|
### Integration Tests
|
||||||
|
Tests that hit real APIs or databases. Skipped by default.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
uv run pytest -m integration
|
||||||
|
```
|
||||||
|
|
||||||
|
### Slow Tests
|
||||||
|
Tests that take longer to run. Skipped by default.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
uv run pytest -m slow
|
||||||
|
```
|
||||||
|
|
||||||
|
### Database Tests
|
||||||
|
Tests requiring MongoDB. Ensure Docker is running.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd utils
|
||||||
|
docker-compose up -d
|
||||||
|
uv run pytest -m database
|
||||||
|
```
|
||||||
|
|
||||||
|
## Test Organization
|
||||||
|
|
||||||
|
### `test_cosing_service.py`
|
||||||
|
|
||||||
|
**Coverage:**
|
||||||
|
- ✅ `parse_cas_numbers()` - CAS parsing logic
|
||||||
|
- Single/multiple CAS
|
||||||
|
- Different separators (/, ;, ,, --)
|
||||||
|
- Parentheses removal
|
||||||
|
- Whitespace handling
|
||||||
|
- Invalid dash removal
|
||||||
|
|
||||||
|
- ✅ `cosing_search()` - API search
|
||||||
|
- Search by name
|
||||||
|
- Search by CAS
|
||||||
|
- Search by EC number
|
||||||
|
- Search by ID
|
||||||
|
- No results handling
|
||||||
|
- Invalid mode error
|
||||||
|
|
||||||
|
- ✅ `clean_cosing()` - JSON cleaning
|
||||||
|
- Basic field cleaning
|
||||||
|
- Empty tag removal
|
||||||
|
- CAS parsing
|
||||||
|
- URL creation
|
||||||
|
- Field renaming
|
||||||
|
|
||||||
|
- ✅ Integration tests (marked as `@pytest.mark.integration`)
|
||||||
|
- Real API calls (requires internet)
|
||||||
|
|
||||||
|
## Writing New Tests
|
||||||
|
|
||||||
|
### Example Unit Test
|
||||||
|
|
||||||
|
```python
|
||||||
|
class TestMyFunction:
|
||||||
|
"""Test my_function."""
|
||||||
|
|
||||||
|
def test_basic_case(self):
|
||||||
|
"""Test basic functionality."""
|
||||||
|
result = my_function("input")
|
||||||
|
assert result == "expected"
|
||||||
|
|
||||||
|
def test_edge_case(self):
|
||||||
|
"""Test edge case handling."""
|
||||||
|
with pytest.raises(ValueError):
|
||||||
|
my_function("invalid")
|
||||||
|
```
|
||||||
|
|
||||||
|
### Example Mock Test
|
||||||
|
|
||||||
|
```python
|
||||||
|
from unittest.mock import Mock, patch
|
||||||
|
|
||||||
|
@patch('module.external_api_call')
|
||||||
|
def test_with_mock(mock_api):
|
||||||
|
"""Test with mocked external call."""
|
||||||
|
mock_api.return_value = {"data": "mocked"}
|
||||||
|
result = my_function()
|
||||||
|
assert result == "expected"
|
||||||
|
mock_api.assert_called_once()
|
||||||
|
```
|
||||||
|
|
||||||
|
### Example Fixture Usage
|
||||||
|
|
||||||
|
```python
|
||||||
|
def test_with_fixture(sample_cosing_response):
|
||||||
|
"""Test using a fixture from conftest.py."""
|
||||||
|
result = clean_cosing(sample_cosing_response)
|
||||||
|
assert "cosingUrl" in result
|
||||||
|
```
|
||||||
|
|
||||||
|
## Best Practices
|
||||||
|
|
||||||
|
1. **Naming**: Test files/classes/functions start with `test_`
|
||||||
|
2. **Arrange-Act-Assert**: Structure tests clearly
|
||||||
|
3. **One assertion focus**: Each test should test one thing
|
||||||
|
4. **Use fixtures**: Reuse test data via `conftest.py`
|
||||||
|
5. **Mock external calls**: Don't hit real APIs in unit tests
|
||||||
|
6. **Mark appropriately**: Use `@pytest.mark.integration` for slow tests
|
||||||
|
7. **Descriptive names**: Test names should describe what they test
|
||||||
|
|
||||||
|
## Common Commands
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Run fast tests only (skip integration/slow)
|
||||||
|
uv run pytest -m "not integration and not slow"
|
||||||
|
|
||||||
|
# Run only integration tests
|
||||||
|
uv run pytest -m integration
|
||||||
|
|
||||||
|
# Run with detailed output
|
||||||
|
uv run pytest -vv
|
||||||
|
|
||||||
|
# Stop at first failure
|
||||||
|
uv run pytest -x
|
||||||
|
|
||||||
|
# Run last failed tests
|
||||||
|
uv run pytest --lf
|
||||||
|
|
||||||
|
# Run tests matching pattern
|
||||||
|
uv run pytest -k "test_parse"
|
||||||
|
|
||||||
|
# Generate coverage report
|
||||||
|
uv run pytest --cov=src/pif_compiler --cov-report=term-missing
|
||||||
|
```
|
||||||
|
|
||||||
|
## CI/CD Integration
|
||||||
|
|
||||||
|
For GitHub Actions (example):
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
- name: Run tests
|
||||||
|
run: |
|
||||||
|
uv run pytest -m "not integration" --cov --cov-report=xml
|
||||||
|
```
|
||||||
|
|
||||||
|
## TODO
|
||||||
|
|
||||||
|
- [ ] Add tests for `models.py` (Pydantic validation)
|
||||||
|
- [ ] Add tests for `echa_service.py`
|
||||||
|
- [ ] Add tests for `echa_parser.py`
|
||||||
|
- [ ] Add tests for `echa_extractor.py`
|
||||||
|
- [ ] Add tests for `database_service.py`
|
||||||
|
- [ ] Add tests for `pubchem_service.py`
|
||||||
|
- [ ] Add integration tests with test database
|
||||||
|
- [ ] Set up GitHub Actions CI
|
||||||
86
tests/RUN_TESTS.md
Normal file
86
tests/RUN_TESTS.md
Normal file
|
|
@ -0,0 +1,86 @@
|
||||||
|
# Quick Start - Running Tests
|
||||||
|
|
||||||
|
## 1. Install Test Dependencies
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Add pytest and related tools
|
||||||
|
uv add --dev pytest pytest-cov pytest-mock
|
||||||
|
```
|
||||||
|
|
||||||
|
## 2. Run the Tests
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Run all unit tests (fast, no API calls)
|
||||||
|
uv run pytest
|
||||||
|
|
||||||
|
# Run with more detail
|
||||||
|
uv run pytest -v
|
||||||
|
|
||||||
|
# Run just the COSING tests
|
||||||
|
uv run pytest tests/test_cosing_service.py
|
||||||
|
|
||||||
|
# Run integration tests (will hit real COSING API)
|
||||||
|
uv run pytest -m integration
|
||||||
|
```
|
||||||
|
|
||||||
|
## 3. See Coverage
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Generate HTML coverage report
|
||||||
|
uv run pytest --cov=src/pif_compiler --cov-report=html
|
||||||
|
|
||||||
|
# Open htmlcov/index.html in your browser
|
||||||
|
```
|
||||||
|
|
||||||
|
## What the Tests Cover
|
||||||
|
|
||||||
|
### ✅ `parse_cas_numbers()`
|
||||||
|
- Parses single CAS: `["7732-18-5"]` → `["7732-18-5"]`
|
||||||
|
- Parses multiple: `["7732-18-5/56-81-5"]` → `["7732-18-5", "56-81-5"]`
|
||||||
|
- Handles separators: `/`, `;`, `,`, `--`
|
||||||
|
- Removes parentheses: `["7732-18-5 (hydrate)"]` → `["7732-18-5"]`
|
||||||
|
- Cleans whitespace and invalid dashes
|
||||||
|
|
||||||
|
### ✅ `cosing_search()`
|
||||||
|
- Mocks API calls (no internet needed for unit tests)
|
||||||
|
- Tests search by name, CAS, EC, ID
|
||||||
|
- Tests error handling
|
||||||
|
- Integration tests hit real API
|
||||||
|
|
||||||
|
### ✅ `clean_cosing()`
|
||||||
|
- Cleans COSING JSON responses
|
||||||
|
- Removes empty tags
|
||||||
|
- Parses CAS numbers
|
||||||
|
- Creates COSING URLs
|
||||||
|
- Renames fields
|
||||||
|
|
||||||
|
## Test Results Example
|
||||||
|
|
||||||
|
```
|
||||||
|
tests/test_cosing_service.py::TestParseCasNumbers::test_single_cas_number PASSED
|
||||||
|
tests/test_cosing_service.py::TestParseCasNumbers::test_multiple_cas_with_slash PASSED
|
||||||
|
tests/test_cosing_service.py::TestCosingSearch::test_search_by_name_success PASSED
|
||||||
|
...
|
||||||
|
================================ 25 passed in 0.5s ================================
|
||||||
|
```
|
||||||
|
|
||||||
|
## Troubleshooting
|
||||||
|
|
||||||
|
### Import errors
|
||||||
|
Make sure you're in the project root:
|
||||||
|
```bash
|
||||||
|
cd c:\Users\adish\Projects\pif_compiler
|
||||||
|
uv run pytest
|
||||||
|
```
|
||||||
|
|
||||||
|
### Mock not found
|
||||||
|
Install pytest-mock:
|
||||||
|
```bash
|
||||||
|
uv add --dev pytest-mock
|
||||||
|
```
|
||||||
|
|
||||||
|
### Integration tests failing
|
||||||
|
These hit the real API and need internet. Skip them:
|
||||||
|
```bash
|
||||||
|
uv run pytest -m "not integration"
|
||||||
|
```
|
||||||
3
tests/__init__.py
Normal file
3
tests/__init__.py
Normal file
|
|
@ -0,0 +1,3 @@
|
||||||
|
"""
|
||||||
|
PIF Compiler - Test Suite
|
||||||
|
"""
|
||||||
247
tests/conftest.py
Normal file
247
tests/conftest.py
Normal file
|
|
@ -0,0 +1,247 @@
|
||||||
|
"""
|
||||||
|
Pytest configuration and fixtures for PIF Compiler tests.
|
||||||
|
|
||||||
|
This file contains shared fixtures and configuration for all tests.
|
||||||
|
"""
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
import sys
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
# Add src to Python path for imports
|
||||||
|
src_path = Path(__file__).parent.parent / "src"
|
||||||
|
sys.path.insert(0, str(src_path))
|
||||||
|
|
||||||
|
|
||||||
|
# Sample data fixtures
|
||||||
|
@pytest.fixture
|
||||||
|
def sample_cas_numbers():
|
||||||
|
"""Real CAS numbers for testing common cosmetic ingredients."""
|
||||||
|
return {
|
||||||
|
"water": "7732-18-5",
|
||||||
|
"glycerin": "56-81-5",
|
||||||
|
"sodium_hyaluronate": "9067-32-7",
|
||||||
|
"niacinamide": "98-92-0",
|
||||||
|
"ascorbic_acid": "50-81-7",
|
||||||
|
"retinol": "68-26-8",
|
||||||
|
"lanolin": "85507-69-3",
|
||||||
|
"sodium_chloride": "7647-14-5",
|
||||||
|
"propylene_glycol": "57-55-6",
|
||||||
|
"butylene_glycol": "107-88-0",
|
||||||
|
"salicylic_acid": "69-72-7",
|
||||||
|
"tocopherol": "59-02-9",
|
||||||
|
"caffeine": "58-08-2",
|
||||||
|
"citric_acid": "77-92-9",
|
||||||
|
"hyaluronic_acid": "9004-61-9",
|
||||||
|
"sodium_hyaluronate_crosspolymer": "63148-62-9",
|
||||||
|
"zinc_oxide": "1314-13-2",
|
||||||
|
"titanium_dioxide": "13463-67-7",
|
||||||
|
"lactic_acid": "50-21-5",
|
||||||
|
"lanolin_oil": "8006-54-0",
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.fixture
|
||||||
|
def sample_cosing_response():
|
||||||
|
"""Sample COSING API response for testing."""
|
||||||
|
return {
|
||||||
|
"inciName": ["WATER"],
|
||||||
|
"casNo": ["7732-18-5"],
|
||||||
|
"ecNo": ["231-791-2"],
|
||||||
|
"substanceId": ["12345"],
|
||||||
|
"itemType": ["Ingredient"],
|
||||||
|
"functionName": ["Solvent"],
|
||||||
|
"chemicalName": ["Dihydrogen monoxide"],
|
||||||
|
"nameOfCommonIngredientsGlossary": ["Water"],
|
||||||
|
"sccsOpinion": [],
|
||||||
|
"sccsOpinionUrls": [],
|
||||||
|
"otherRestrictions": [],
|
||||||
|
"identifiedIngredient": [],
|
||||||
|
"annexNo": [],
|
||||||
|
"otherRegulations": [],
|
||||||
|
"refNo": ["REF123"],
|
||||||
|
"phEurName": [],
|
||||||
|
"innName": []
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.fixture
|
||||||
|
def sample_ingredient_data():
|
||||||
|
"""Sample ingredient data for Pydantic model testing."""
|
||||||
|
return {
|
||||||
|
"inci_name": "WATER",
|
||||||
|
"cas": "7732-18-5",
|
||||||
|
"quantity": 70.0,
|
||||||
|
"mol_weight": 18,
|
||||||
|
"dap": 0.5,
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.fixture
|
||||||
|
def sample_pif_data():
|
||||||
|
"""Sample PIF data for testing."""
|
||||||
|
return {
|
||||||
|
"company": "Beauty Corp",
|
||||||
|
"product_name": "Face Cream",
|
||||||
|
"type": "MOISTURIZER",
|
||||||
|
"physical_form": "CREMA",
|
||||||
|
"CNCP": 123456,
|
||||||
|
"production_company": {
|
||||||
|
"prod_company_name": "Manufacturer Inc",
|
||||||
|
"prod_vat": 12345678,
|
||||||
|
"prod_address": "123 Main St, City, Country"
|
||||||
|
},
|
||||||
|
"ingredients": [
|
||||||
|
{
|
||||||
|
"inci_name": "WATER",
|
||||||
|
"cas": "7732-18-5",
|
||||||
|
"quantity": 70.0,
|
||||||
|
"dap": 0.5
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"inci_name": "GLYCERIN",
|
||||||
|
"cas": "56-81-5",
|
||||||
|
"quantity": 10.0,
|
||||||
|
"dap": 0.5
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.fixture
|
||||||
|
def sample_echa_substance_response():
|
||||||
|
"""Sample ECHA substance search API response for glycerin."""
|
||||||
|
return {
|
||||||
|
"items": [{
|
||||||
|
"substanceIndex": {
|
||||||
|
"rmlId": "100.029.181",
|
||||||
|
"rmlName": "glycerol",
|
||||||
|
"rmlCas": "56-81-5",
|
||||||
|
"rmlEc": "200-289-5"
|
||||||
|
}
|
||||||
|
}]
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.fixture
|
||||||
|
def sample_echa_substance_response_water():
|
||||||
|
"""Sample ECHA substance search API response for water."""
|
||||||
|
return {
|
||||||
|
"items": [{
|
||||||
|
"substanceIndex": {
|
||||||
|
"rmlId": "100.028.902",
|
||||||
|
"rmlName": "water",
|
||||||
|
"rmlCas": "7732-18-5",
|
||||||
|
"rmlEc": "231-791-2"
|
||||||
|
}
|
||||||
|
}]
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.fixture
|
||||||
|
def sample_echa_substance_response_niacinamide():
|
||||||
|
"""Sample ECHA substance search API response for niacinamide."""
|
||||||
|
return {
|
||||||
|
"items": [{
|
||||||
|
"substanceIndex": {
|
||||||
|
"rmlId": "100.002.530",
|
||||||
|
"rmlName": "nicotinamide",
|
||||||
|
"rmlCas": "98-92-0",
|
||||||
|
"rmlEc": "202-713-4"
|
||||||
|
}
|
||||||
|
}]
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.fixture
|
||||||
|
def sample_echa_dossier_response():
|
||||||
|
"""Sample ECHA dossier list API response."""
|
||||||
|
return {
|
||||||
|
"items": [{
|
||||||
|
"assetExternalId": "abc123def456",
|
||||||
|
"rootKey": "key123",
|
||||||
|
"lastUpdatedDate": "2024-01-15T10:30:00Z"
|
||||||
|
}]
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.fixture
|
||||||
|
def sample_echa_index_html_full():
|
||||||
|
"""Sample ECHA index.html with all toxicology sections."""
|
||||||
|
return """
|
||||||
|
<html>
|
||||||
|
<head><title>ECHA Dossier</title></head>
|
||||||
|
<body>
|
||||||
|
<div id="id_7_Toxicologicalinformation">
|
||||||
|
<a href="tox_summary_001"></a>
|
||||||
|
</div>
|
||||||
|
<div id="id_72_AcuteToxicity">
|
||||||
|
<a href="acute_tox_001"></a>
|
||||||
|
</div>
|
||||||
|
<div id="id_75_Repeateddosetoxicity">
|
||||||
|
<a href="repeated_dose_001"></a>
|
||||||
|
</div>
|
||||||
|
</body>
|
||||||
|
</html>
|
||||||
|
"""
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.fixture
|
||||||
|
def sample_echa_index_html_partial():
|
||||||
|
"""Sample ECHA index.html with only ToxSummary section."""
|
||||||
|
return """
|
||||||
|
<html>
|
||||||
|
<head><title>ECHA Dossier</title></head>
|
||||||
|
<body>
|
||||||
|
<div id="id_7_Toxicologicalinformation">
|
||||||
|
<a href="tox_summary_001"></a>
|
||||||
|
</div>
|
||||||
|
</body>
|
||||||
|
</html>
|
||||||
|
"""
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.fixture
|
||||||
|
def sample_echa_index_html_empty():
|
||||||
|
"""Sample ECHA index.html with no toxicology sections."""
|
||||||
|
return """
|
||||||
|
<html>
|
||||||
|
<head><title>ECHA Dossier</title></head>
|
||||||
|
<body>
|
||||||
|
<p>No toxicology information available</p>
|
||||||
|
</body>
|
||||||
|
</html>
|
||||||
|
"""
|
||||||
|
|
||||||
|
|
||||||
|
# Skip markers
|
||||||
|
def pytest_configure(config):
|
||||||
|
"""Configure custom markers."""
|
||||||
|
config.addinivalue_line(
|
||||||
|
"markers", "unit: mark test as a unit test (fast, no external deps)"
|
||||||
|
)
|
||||||
|
config.addinivalue_line(
|
||||||
|
"markers", "integration: mark test as integration test (may use real APIs)"
|
||||||
|
)
|
||||||
|
config.addinivalue_line(
|
||||||
|
"markers", "slow: mark test as slow (skip by default)"
|
||||||
|
)
|
||||||
|
config.addinivalue_line(
|
||||||
|
"markers", "database: mark test as requiring database"
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def pytest_collection_modifyitems(config, items):
|
||||||
|
"""Modify test collection to skip slow/integration tests by default."""
|
||||||
|
skip_slow = pytest.mark.skip(reason="Slow test (use -m slow to run)")
|
||||||
|
skip_integration = pytest.mark.skip(reason="Integration test (use -m integration to run)")
|
||||||
|
|
||||||
|
# Only skip if not explicitly requested
|
||||||
|
run_slow = config.getoption("-m") == "slow"
|
||||||
|
run_integration = config.getoption("-m") == "integration"
|
||||||
|
|
||||||
|
for item in items:
|
||||||
|
if "slow" in item.keywords and not run_slow:
|
||||||
|
item.add_marker(skip_slow)
|
||||||
|
if "integration" in item.keywords and not run_integration:
|
||||||
|
item.add_marker(skip_integration)
|
||||||
254
tests/test_cosing_service.py
Normal file
254
tests/test_cosing_service.py
Normal file
|
|
@ -0,0 +1,254 @@
|
||||||
|
"""
|
||||||
|
Tests for COSING Service
|
||||||
|
|
||||||
|
Test coverage:
|
||||||
|
- parse_cas_numbers: CAS number parsing logic
|
||||||
|
- cosing_search: API search functionality
|
||||||
|
- clean_cosing: JSON cleaning and formatting
|
||||||
|
"""
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
from unittest.mock import Mock, patch
|
||||||
|
from pif_compiler.services.cosing_service import (
|
||||||
|
parse_cas_numbers,
|
||||||
|
cosing_search,
|
||||||
|
clean_cosing,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
class TestParseCasNumbers:
|
||||||
|
"""Test CAS number parsing function."""
|
||||||
|
|
||||||
|
def test_single_cas_number(self):
|
||||||
|
"""Test parsing a single CAS number."""
|
||||||
|
result = parse_cas_numbers(["7732-18-5"])
|
||||||
|
assert result == ["7732-18-5"]
|
||||||
|
|
||||||
|
def test_multiple_cas_with_slash(self):
|
||||||
|
"""Test parsing multiple CAS numbers separated by slash."""
|
||||||
|
result = parse_cas_numbers(["7732-18-5/56-81-5"])
|
||||||
|
assert result == ["7732-18-5", "56-81-5"]
|
||||||
|
|
||||||
|
def test_multiple_cas_with_semicolon(self):
|
||||||
|
"""Test parsing multiple CAS numbers separated by semicolon."""
|
||||||
|
result = parse_cas_numbers(["7732-18-5;56-81-5"])
|
||||||
|
assert result == ["7732-18-5", "56-81-5"]
|
||||||
|
|
||||||
|
def test_multiple_cas_with_comma(self):
|
||||||
|
"""Test parsing multiple CAS numbers separated by comma."""
|
||||||
|
result = parse_cas_numbers(["7732-18-5,56-81-5"])
|
||||||
|
assert result == ["7732-18-5", "56-81-5"]
|
||||||
|
|
||||||
|
def test_double_dash_separator(self):
|
||||||
|
"""Test parsing CAS numbers with double dash separator."""
|
||||||
|
result = parse_cas_numbers(["7732-18-5--56-81-5"])
|
||||||
|
assert result == ["7732-18-5", "56-81-5"]
|
||||||
|
|
||||||
|
def test_cas_with_parentheses(self):
|
||||||
|
"""Test that parenthetical info is removed."""
|
||||||
|
result = parse_cas_numbers(["7732-18-5 (hydrate)"])
|
||||||
|
assert result == ["7732-18-5"]
|
||||||
|
|
||||||
|
def test_cas_with_extra_whitespace(self):
|
||||||
|
"""Test that extra whitespace is trimmed."""
|
||||||
|
result = parse_cas_numbers([" 7732-18-5 / 56-81-5 "])
|
||||||
|
assert result == ["7732-18-5", "56-81-5"]
|
||||||
|
|
||||||
|
def test_removes_invalid_dash(self):
|
||||||
|
"""Test that standalone dashes are removed."""
|
||||||
|
result = parse_cas_numbers(["7732-18-5/-/56-81-5"])
|
||||||
|
assert result == ["7732-18-5", "56-81-5"]
|
||||||
|
|
||||||
|
def test_complex_mixed_separators(self):
|
||||||
|
"""Test with multiple separator types."""
|
||||||
|
result = parse_cas_numbers(["7732-18-5/56-81-5;50-00-0"])
|
||||||
|
assert result == ["7732-18-5", "56-81-5", "50-00-0"]
|
||||||
|
|
||||||
|
|
||||||
|
class TestCosingSearch:
|
||||||
|
"""Test COSING API search functionality."""
|
||||||
|
|
||||||
|
@patch('pif_compiler.services.cosing_service.req.post')
|
||||||
|
def test_search_by_name_success(self, mock_post):
|
||||||
|
"""Test successful search by ingredient name."""
|
||||||
|
# Mock API response
|
||||||
|
mock_response = Mock()
|
||||||
|
mock_response.json.return_value = {
|
||||||
|
"results": [{
|
||||||
|
"metadata": {
|
||||||
|
"inciName": ["WATER"],
|
||||||
|
"casNo": ["7732-18-5"],
|
||||||
|
"substanceId": ["12345"]
|
||||||
|
}
|
||||||
|
}]
|
||||||
|
}
|
||||||
|
mock_post.return_value = mock_response
|
||||||
|
|
||||||
|
result = cosing_search("WATER", mode="name")
|
||||||
|
|
||||||
|
assert result is not None
|
||||||
|
assert result["inciName"] == ["WATER"]
|
||||||
|
assert result["casNo"] == ["7732-18-5"]
|
||||||
|
|
||||||
|
@patch('pif_compiler.services.cosing_service.req.post')
|
||||||
|
def test_search_by_cas_success(self, mock_post):
|
||||||
|
"""Test successful search by CAS number."""
|
||||||
|
mock_response = Mock()
|
||||||
|
mock_response.json.return_value = {
|
||||||
|
"results": [{
|
||||||
|
"metadata": {
|
||||||
|
"inciName": ["WATER"],
|
||||||
|
"casNo": ["7732-18-5"]
|
||||||
|
}
|
||||||
|
}]
|
||||||
|
}
|
||||||
|
mock_post.return_value = mock_response
|
||||||
|
|
||||||
|
result = cosing_search("7732-18-5", mode="cas")
|
||||||
|
|
||||||
|
assert result is not None
|
||||||
|
assert "7732-18-5" in result["casNo"]
|
||||||
|
|
||||||
|
@patch('pif_compiler.services.cosing_service.req.post')
|
||||||
|
def test_search_by_ec_success(self, mock_post):
|
||||||
|
"""Test successful search by EC number."""
|
||||||
|
mock_response = Mock()
|
||||||
|
mock_response.json.return_value = {
|
||||||
|
"results": [{
|
||||||
|
"metadata": {
|
||||||
|
"ecNo": ["231-791-2"]
|
||||||
|
}
|
||||||
|
}]
|
||||||
|
}
|
||||||
|
mock_post.return_value = mock_response
|
||||||
|
|
||||||
|
result = cosing_search("231-791-2", mode="ec")
|
||||||
|
|
||||||
|
assert result is not None
|
||||||
|
assert "231-791-2" in result["ecNo"]
|
||||||
|
|
||||||
|
@patch('pif_compiler.services.cosing_service.req.post')
|
||||||
|
def test_search_by_id_success(self, mock_post):
|
||||||
|
"""Test successful search by substance ID."""
|
||||||
|
mock_response = Mock()
|
||||||
|
mock_response.json.return_value = {
|
||||||
|
"results": [{
|
||||||
|
"metadata": {
|
||||||
|
"substanceId": ["12345"]
|
||||||
|
}
|
||||||
|
}]
|
||||||
|
}
|
||||||
|
mock_post.return_value = mock_response
|
||||||
|
|
||||||
|
result = cosing_search("12345", mode="id")
|
||||||
|
|
||||||
|
assert result is not None
|
||||||
|
assert result["substanceId"] == ["12345"]
|
||||||
|
|
||||||
|
@patch('pif_compiler.services.cosing_service.req.post')
|
||||||
|
def test_search_no_results(self, mock_post):
|
||||||
|
"""Test search with no results returns status code."""
|
||||||
|
mock_response = Mock()
|
||||||
|
mock_response.json.return_value = {"results": []}
|
||||||
|
mock_post.return_value = mock_response
|
||||||
|
|
||||||
|
result = cosing_search("NONEXISTENT", mode="name")
|
||||||
|
assert result == None # Should return None
|
||||||
|
|
||||||
|
def test_search_invalid_mode(self):
|
||||||
|
"""Test that invalid mode raises ValueError."""
|
||||||
|
with pytest.raises(ValueError):
|
||||||
|
cosing_search("WATER", mode="invalid_mode")
|
||||||
|
|
||||||
|
|
||||||
|
class TestCleanCosing:
|
||||||
|
"""Test COSING JSON cleaning function."""
|
||||||
|
|
||||||
|
def test_clean_basic_fields(self, sample_cosing_response):
|
||||||
|
"""Test cleaning basic string and list fields."""
|
||||||
|
|
||||||
|
result = clean_cosing(sample_cosing_response, full=False)
|
||||||
|
|
||||||
|
assert result["inciName"] == "WATER"
|
||||||
|
assert result["casNo"] == ["7732-18-5"]
|
||||||
|
assert result["ecNo"] == ["231-791-2"]
|
||||||
|
|
||||||
|
def test_removes_empty_tags(self, sample_cosing_response):
|
||||||
|
"""Test that <empty> tags are removed."""
|
||||||
|
|
||||||
|
sample_cosing_response["inciName"] = ["<empty>"]
|
||||||
|
sample_cosing_response["functionName"] = ["<empty>"]
|
||||||
|
|
||||||
|
result = clean_cosing(sample_cosing_response, full=False)
|
||||||
|
|
||||||
|
assert "<empty>" not in result["inciName"]
|
||||||
|
assert result["functionName"] == []
|
||||||
|
|
||||||
|
def test_parses_cas_numbers(self, sample_cosing_response):
|
||||||
|
"""Test that CAS numbers are parsed correctly."""
|
||||||
|
sample_cosing_response["casNo"] = ["56-81-5"]
|
||||||
|
|
||||||
|
result = clean_cosing(sample_cosing_response, full=False)
|
||||||
|
|
||||||
|
assert result["casNo"] == ["56-81-5"]
|
||||||
|
|
||||||
|
def test_creates_cosing_url(self, sample_cosing_response):
|
||||||
|
"""Test that COSING URL is created."""
|
||||||
|
result = clean_cosing(sample_cosing_response, full=False)
|
||||||
|
|
||||||
|
assert "cosingUrl" in result
|
||||||
|
assert "12345" in result["cosingUrl"]
|
||||||
|
assert result["cosingUrl"] == "https://ec.europa.eu/growth/tools-databases/cosing/details/12345"
|
||||||
|
|
||||||
|
def test_renames_common_name(self, sample_cosing_response):
|
||||||
|
"""Test that nameOfCommonIngredientsGlossary is renamed."""
|
||||||
|
result = clean_cosing(sample_cosing_response, full=False)
|
||||||
|
|
||||||
|
assert "commonName" in result
|
||||||
|
assert result["commonName"] == "Water"
|
||||||
|
assert "nameOfCommonIngredientsGlossary" not in result
|
||||||
|
|
||||||
|
def test_empty_lists_handled(self, sample_cosing_response):
|
||||||
|
"""Test that empty lists are handled correctly."""
|
||||||
|
sample_cosing_response["inciName"] = []
|
||||||
|
sample_cosing_response["casNo"] = []
|
||||||
|
|
||||||
|
result = clean_cosing(sample_cosing_response, full=False)
|
||||||
|
|
||||||
|
assert result["inciName"] == ""
|
||||||
|
assert result["casNo"] == []
|
||||||
|
|
||||||
|
|
||||||
|
class TestIntegration:
|
||||||
|
"""Integration tests with real API (marked as slow)."""
|
||||||
|
|
||||||
|
@pytest.mark.integration
|
||||||
|
def test_real_water_search(self):
|
||||||
|
"""Test real API call for WATER (requires internet)."""
|
||||||
|
result = cosing_search("WATER", mode="name")
|
||||||
|
|
||||||
|
if result and isinstance(result, dict):
|
||||||
|
# Real API call succeeded
|
||||||
|
assert "inciName" in result or "casNo" in result
|
||||||
|
|
||||||
|
@pytest.mark.integration
|
||||||
|
def test_real_cas_search(self):
|
||||||
|
"""Test real API call by CAS number (requires internet)."""
|
||||||
|
result = cosing_search("56-81-5", mode="cas")
|
||||||
|
|
||||||
|
if result and isinstance(result, dict):
|
||||||
|
assert "casNo" in result
|
||||||
|
|
||||||
|
@pytest.mark.integration
|
||||||
|
def test_full_workflow(self):
|
||||||
|
"""Test complete workflow: search -> clean."""
|
||||||
|
# Search for glycerin
|
||||||
|
raw_result = cosing_search("GLYCERIN", mode="name")
|
||||||
|
|
||||||
|
if raw_result and isinstance(raw_result, dict):
|
||||||
|
# Clean the result
|
||||||
|
clean_result = clean_cosing(raw_result, full=False)
|
||||||
|
|
||||||
|
# Verify cleaned structure
|
||||||
|
assert "cosingUrl" in clean_result
|
||||||
|
assert isinstance(clean_result.get("casNo"), list)
|
||||||
857
tests/test_echa_find.py
Normal file
857
tests/test_echa_find.py
Normal file
|
|
@ -0,0 +1,857 @@
|
||||||
|
"""
|
||||||
|
Tests for ECHA Find Service
|
||||||
|
|
||||||
|
Test coverage:
|
||||||
|
- search_dossier: Complete workflow for searching ECHA dossiers
|
||||||
|
- Substance search (by CAS, EC, rmlName)
|
||||||
|
- Dossier retrieval (Active/Inactive)
|
||||||
|
- HTML parsing for toxicology sections
|
||||||
|
- Error handling and edge cases
|
||||||
|
"""
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
from unittest.mock import Mock, patch, MagicMock
|
||||||
|
from datetime import datetime
|
||||||
|
from pif_compiler.services.echa_find import search_dossier
|
||||||
|
|
||||||
|
|
||||||
|
class TestSearchDossierSubstanceSearch:
|
||||||
|
"""Test the initial substance search phase of search_dossier."""
|
||||||
|
|
||||||
|
@patch('pif_compiler.services.echa_find.requests.get')
|
||||||
|
def test_successful_cas_search(self, mock_get):
|
||||||
|
"""Test successful search by CAS number."""
|
||||||
|
# Mock the substance search API response
|
||||||
|
mock_response = Mock()
|
||||||
|
mock_response.json.return_value = {
|
||||||
|
"items": [{
|
||||||
|
"substanceIndex": {
|
||||||
|
"rmlId": "100.000.001",
|
||||||
|
"rmlName": "Test Substance",
|
||||||
|
"rmlCas": "50-00-0",
|
||||||
|
"rmlEc": "200-001-8"
|
||||||
|
}
|
||||||
|
}]
|
||||||
|
}
|
||||||
|
mock_get.return_value = mock_response
|
||||||
|
|
||||||
|
# Mocking all subsequent calls
|
||||||
|
with patch('pif_compiler.services.echa_find.requests.get') as mock_all_gets:
|
||||||
|
# First call: substance search (already mocked above)
|
||||||
|
# Second call: dossier list
|
||||||
|
mock_dossier_response = Mock()
|
||||||
|
mock_dossier_response.json.return_value = {
|
||||||
|
"items": [{
|
||||||
|
"assetExternalId": "abc123",
|
||||||
|
"rootKey": "key123",
|
||||||
|
"lastUpdatedDate": "2024-01-15T10:30:00Z"
|
||||||
|
}]
|
||||||
|
}
|
||||||
|
|
||||||
|
# Third call: index.html page
|
||||||
|
mock_index_response = Mock()
|
||||||
|
mock_index_response.text = """
|
||||||
|
<html>
|
||||||
|
<div id="id_7_Toxicologicalinformation">
|
||||||
|
<a href="tox_summary_001"></a>
|
||||||
|
</div>
|
||||||
|
<div id="id_72_AcuteToxicity">
|
||||||
|
<a href="acute_tox_001"></a>
|
||||||
|
</div>
|
||||||
|
<div id="id_75_Repeateddosetoxicity">
|
||||||
|
<a href="repeated_dose_001"></a>
|
||||||
|
</div>
|
||||||
|
</html>
|
||||||
|
"""
|
||||||
|
|
||||||
|
mock_all_gets.side_effect = [
|
||||||
|
mock_response,
|
||||||
|
mock_dossier_response,
|
||||||
|
mock_index_response
|
||||||
|
]
|
||||||
|
|
||||||
|
result = search_dossier("50-00-0", input_type="rmlCas")
|
||||||
|
|
||||||
|
assert result is not False
|
||||||
|
assert result["rmlCas"] == "50-00-0"
|
||||||
|
assert result["rmlName"] == "Test Substance"
|
||||||
|
assert result["rmlId"] == "100.000.001"
|
||||||
|
assert result["rmlEc"] == "200-001-8"
|
||||||
|
|
||||||
|
@patch('pif_compiler.services.echa_find.requests.get')
|
||||||
|
def test_successful_ec_search(self, mock_get):
|
||||||
|
"""Test successful search by EC number."""
|
||||||
|
mock_response = Mock()
|
||||||
|
mock_response.json.return_value = {
|
||||||
|
"items": [{
|
||||||
|
"substanceIndex": {
|
||||||
|
"rmlId": "100.000.001",
|
||||||
|
"rmlName": "Test Substance",
|
||||||
|
"rmlCas": "50-00-0",
|
||||||
|
"rmlEc": "200-001-8"
|
||||||
|
}
|
||||||
|
}]
|
||||||
|
}
|
||||||
|
mock_get.return_value = mock_response
|
||||||
|
|
||||||
|
with patch('pif_compiler.services.echa_find.requests.get') as mock_all_gets:
|
||||||
|
mock_dossier_response = Mock()
|
||||||
|
mock_dossier_response.json.return_value = {
|
||||||
|
"items": [{
|
||||||
|
"assetExternalId": "abc123",
|
||||||
|
"rootKey": "key123",
|
||||||
|
"lastUpdatedDate": "2024-01-15T10:30:00Z"
|
||||||
|
}]
|
||||||
|
}
|
||||||
|
|
||||||
|
mock_index_response = Mock()
|
||||||
|
mock_index_response.text = "<html><div id='id_7_Toxicologicalinformation'><a href='tox_001'></a></div></html>"
|
||||||
|
|
||||||
|
mock_all_gets.side_effect = [
|
||||||
|
mock_response,
|
||||||
|
mock_dossier_response,
|
||||||
|
mock_index_response
|
||||||
|
]
|
||||||
|
|
||||||
|
result = search_dossier("200-001-8", input_type="rmlEc")
|
||||||
|
|
||||||
|
assert result is not False
|
||||||
|
assert result["rmlEc"] == "200-001-8"
|
||||||
|
|
||||||
|
@patch('pif_compiler.services.echa_find.requests.get')
|
||||||
|
def test_successful_name_search(self, mock_get):
|
||||||
|
"""Test successful search by substance name."""
|
||||||
|
mock_response = Mock()
|
||||||
|
mock_response.json.return_value = {
|
||||||
|
"items": [{
|
||||||
|
"substanceIndex": {
|
||||||
|
"rmlId": "100.000.001",
|
||||||
|
"rmlName": "formaldehyde",
|
||||||
|
"rmlCas": "50-00-0",
|
||||||
|
"rmlEc": "200-001-8"
|
||||||
|
}
|
||||||
|
}]
|
||||||
|
}
|
||||||
|
mock_get.return_value = mock_response
|
||||||
|
|
||||||
|
with patch('pif_compiler.services.echa_find.requests.get') as mock_all_gets:
|
||||||
|
mock_dossier_response = Mock()
|
||||||
|
mock_dossier_response.json.return_value = {
|
||||||
|
"items": [{
|
||||||
|
"assetExternalId": "abc123",
|
||||||
|
"rootKey": "key123",
|
||||||
|
"lastUpdatedDate": "2024-01-15T10:30:00Z"
|
||||||
|
}]
|
||||||
|
}
|
||||||
|
|
||||||
|
mock_index_response = Mock()
|
||||||
|
mock_index_response.text = "<html><div id='id_7_Toxicologicalinformation'><a href='tox_001'></a></div></html>"
|
||||||
|
|
||||||
|
mock_all_gets.side_effect = [
|
||||||
|
mock_response,
|
||||||
|
mock_dossier_response,
|
||||||
|
mock_index_response
|
||||||
|
]
|
||||||
|
|
||||||
|
result = search_dossier("formaldehyde", input_type="rmlName")
|
||||||
|
|
||||||
|
assert result is not False
|
||||||
|
assert result["rmlName"] == "formaldehyde"
|
||||||
|
|
||||||
|
@patch('pif_compiler.services.echa_find.requests.get')
|
||||||
|
def test_substance_not_found(self, mock_get):
|
||||||
|
"""Test when substance is not found in ECHA."""
|
||||||
|
mock_response = Mock()
|
||||||
|
mock_response.json.return_value = {"items": []}
|
||||||
|
mock_get.return_value = mock_response
|
||||||
|
|
||||||
|
result = search_dossier("999-99-9", input_type="rmlCas")
|
||||||
|
|
||||||
|
assert result is False
|
||||||
|
|
||||||
|
@patch('pif_compiler.services.echa_find.requests.get')
|
||||||
|
def test_empty_items_array(self, mock_get):
|
||||||
|
"""Test when API returns empty items array."""
|
||||||
|
mock_response = Mock()
|
||||||
|
mock_response.json.return_value = {"items": []}
|
||||||
|
mock_get.return_value = mock_response
|
||||||
|
|
||||||
|
result = search_dossier("NONEXISTENT", input_type="rmlName")
|
||||||
|
|
||||||
|
assert result is False
|
||||||
|
|
||||||
|
@patch('pif_compiler.services.echa_find.requests.get')
|
||||||
|
def test_malformed_api_response(self, mock_get):
|
||||||
|
"""Test when API response is malformed."""
|
||||||
|
mock_response = Mock()
|
||||||
|
mock_response.json.return_value = {} # Missing 'items' key
|
||||||
|
mock_get.return_value = mock_response
|
||||||
|
|
||||||
|
result = search_dossier("50-00-0", input_type="rmlCas")
|
||||||
|
|
||||||
|
assert result is False
|
||||||
|
|
||||||
|
|
||||||
|
class TestSearchDossierInputTypeValidation:
|
||||||
|
"""Test input_type parameter validation."""
|
||||||
|
|
||||||
|
@patch('pif_compiler.services.echa_find.requests.get')
|
||||||
|
def test_input_type_mismatch_cas(self, mock_get):
|
||||||
|
"""Test when input_type doesn't match actual search result (CAS)."""
|
||||||
|
mock_response = Mock()
|
||||||
|
mock_response.json.return_value = {
|
||||||
|
"items": [{
|
||||||
|
"substanceIndex": {
|
||||||
|
"rmlId": "100.000.001",
|
||||||
|
"rmlName": "Test Substance",
|
||||||
|
"rmlCas": "50-00-0",
|
||||||
|
"rmlEc": "200-001-8"
|
||||||
|
}
|
||||||
|
}]
|
||||||
|
}
|
||||||
|
mock_get.return_value = mock_response
|
||||||
|
|
||||||
|
# Search with CAS but specify wrong input_type
|
||||||
|
result = search_dossier("50-00-0", input_type="rmlEc")
|
||||||
|
|
||||||
|
assert isinstance(result, str)
|
||||||
|
assert "search_error" in result
|
||||||
|
assert "not equal" in result
|
||||||
|
|
||||||
|
@patch('pif_compiler.services.echa_find.requests.get')
|
||||||
|
def test_input_type_correct_match(self, mock_get):
|
||||||
|
"""Test when input_type correctly matches search result."""
|
||||||
|
mock_response = Mock()
|
||||||
|
mock_response.json.return_value = {
|
||||||
|
"items": [{
|
||||||
|
"substanceIndex": {
|
||||||
|
"rmlId": "100.000.001",
|
||||||
|
"rmlName": "Test Substance",
|
||||||
|
"rmlCas": "50-00-0",
|
||||||
|
"rmlEc": "200-001-8"
|
||||||
|
}
|
||||||
|
}]
|
||||||
|
}
|
||||||
|
mock_get.return_value = mock_response
|
||||||
|
|
||||||
|
with patch('pif_compiler.services.echa_find.requests.get') as mock_all_gets:
|
||||||
|
mock_dossier_response = Mock()
|
||||||
|
mock_dossier_response.json.return_value = {
|
||||||
|
"items": [{
|
||||||
|
"assetExternalId": "abc123",
|
||||||
|
"rootKey": "key123",
|
||||||
|
"lastUpdatedDate": "2024-01-15T10:30:00Z"
|
||||||
|
}]
|
||||||
|
}
|
||||||
|
|
||||||
|
mock_index_response = Mock()
|
||||||
|
mock_index_response.text = "<html><div id='id_7_Toxicologicalinformation'><a href='tox_001'></a></div></html>"
|
||||||
|
|
||||||
|
mock_all_gets.side_effect = [
|
||||||
|
mock_response,
|
||||||
|
mock_dossier_response,
|
||||||
|
mock_index_response
|
||||||
|
]
|
||||||
|
|
||||||
|
result = search_dossier("50-00-0", input_type="rmlCas")
|
||||||
|
|
||||||
|
assert result is not False
|
||||||
|
assert isinstance(result, dict)
|
||||||
|
|
||||||
|
|
||||||
|
class TestSearchDossierDossierRetrieval:
|
||||||
|
"""Test dossier retrieval (Active/Inactive)."""
|
||||||
|
|
||||||
|
@patch('pif_compiler.services.echa_find.requests.get')
|
||||||
|
def test_active_dossier_found(self, mock_get):
|
||||||
|
"""Test when active dossier is found."""
|
||||||
|
mock_substance_response = Mock()
|
||||||
|
mock_substance_response.json.return_value = {
|
||||||
|
"items": [{
|
||||||
|
"substanceIndex": {
|
||||||
|
"rmlId": "100.000.001",
|
||||||
|
"rmlName": "Test Substance",
|
||||||
|
"rmlCas": "50-00-0",
|
||||||
|
"rmlEc": "200-001-8"
|
||||||
|
}
|
||||||
|
}]
|
||||||
|
}
|
||||||
|
|
||||||
|
mock_dossier_response = Mock()
|
||||||
|
mock_dossier_response.json.return_value = {
|
||||||
|
"items": [{
|
||||||
|
"assetExternalId": "abc123",
|
||||||
|
"rootKey": "key123",
|
||||||
|
"lastUpdatedDate": "2024-01-15T10:30:00Z"
|
||||||
|
}]
|
||||||
|
}
|
||||||
|
|
||||||
|
mock_index_response = Mock()
|
||||||
|
mock_index_response.text = "<html><div id='id_7_Toxicologicalinformation'><a href='tox_001'></a></div></html>"
|
||||||
|
|
||||||
|
mock_get.side_effect = [
|
||||||
|
mock_substance_response,
|
||||||
|
mock_dossier_response,
|
||||||
|
mock_index_response
|
||||||
|
]
|
||||||
|
|
||||||
|
result = search_dossier("50-00-0", input_type="rmlCas")
|
||||||
|
|
||||||
|
assert result is not False
|
||||||
|
assert result["dossierType"] == "Active"
|
||||||
|
|
||||||
|
@patch('pif_compiler.services.echa_find.requests.get')
|
||||||
|
def test_inactive_dossier_fallback(self, mock_get):
|
||||||
|
"""Test when only inactive dossier exists (fallback)."""
|
||||||
|
mock_substance_response = Mock()
|
||||||
|
mock_substance_response.json.return_value = {
|
||||||
|
"items": [{
|
||||||
|
"substanceIndex": {
|
||||||
|
"rmlId": "100.000.001",
|
||||||
|
"rmlName": "Test Substance",
|
||||||
|
"rmlCas": "50-00-0",
|
||||||
|
"rmlEc": "200-001-8"
|
||||||
|
}
|
||||||
|
}]
|
||||||
|
}
|
||||||
|
|
||||||
|
# First dossier call returns empty (no active)
|
||||||
|
mock_active_dossier_response = Mock()
|
||||||
|
mock_active_dossier_response.json.return_value = {"items": []}
|
||||||
|
|
||||||
|
# Second dossier call returns inactive
|
||||||
|
mock_inactive_dossier_response = Mock()
|
||||||
|
mock_inactive_dossier_response.json.return_value = {
|
||||||
|
"items": [{
|
||||||
|
"assetExternalId": "abc123",
|
||||||
|
"rootKey": "key123",
|
||||||
|
"lastUpdatedDate": "2024-01-15T10:30:00Z"
|
||||||
|
}]
|
||||||
|
}
|
||||||
|
|
||||||
|
mock_index_response = Mock()
|
||||||
|
mock_index_response.text = "<html><div id='id_7_Toxicologicalinformation'><a href='tox_001'></a></div></html>"
|
||||||
|
|
||||||
|
mock_get.side_effect = [
|
||||||
|
mock_substance_response,
|
||||||
|
mock_active_dossier_response,
|
||||||
|
mock_inactive_dossier_response,
|
||||||
|
mock_index_response
|
||||||
|
]
|
||||||
|
|
||||||
|
result = search_dossier("50-00-0", input_type="rmlCas")
|
||||||
|
|
||||||
|
assert result is not False
|
||||||
|
assert result["dossierType"] == "Inactive"
|
||||||
|
|
||||||
|
@patch('pif_compiler.services.echa_find.requests.get')
|
||||||
|
def test_no_dossiers_found(self, mock_get):
|
||||||
|
"""Test when no dossiers (active or inactive) are found."""
|
||||||
|
mock_substance_response = Mock()
|
||||||
|
mock_substance_response.json.return_value = {
|
||||||
|
"items": [{
|
||||||
|
"substanceIndex": {
|
||||||
|
"rmlId": "100.000.001",
|
||||||
|
"rmlName": "Test Substance",
|
||||||
|
"rmlCas": "50-00-0",
|
||||||
|
"rmlEc": "200-001-8"
|
||||||
|
}
|
||||||
|
}]
|
||||||
|
}
|
||||||
|
|
||||||
|
# Both active and inactive return empty
|
||||||
|
mock_empty_response = Mock()
|
||||||
|
mock_empty_response.json.return_value = {"items": []}
|
||||||
|
|
||||||
|
mock_get.side_effect = [
|
||||||
|
mock_substance_response,
|
||||||
|
mock_empty_response, # Active
|
||||||
|
mock_empty_response # Inactive
|
||||||
|
]
|
||||||
|
|
||||||
|
result = search_dossier("50-00-0", input_type="rmlCas")
|
||||||
|
|
||||||
|
assert result is False
|
||||||
|
|
||||||
|
@patch('pif_compiler.services.echa_find.requests.get')
|
||||||
|
def test_last_update_date_parsed(self, mock_get):
|
||||||
|
"""Test that lastUpdateDate is correctly parsed."""
|
||||||
|
mock_substance_response = Mock()
|
||||||
|
mock_substance_response.json.return_value = {
|
||||||
|
"items": [{
|
||||||
|
"substanceIndex": {
|
||||||
|
"rmlId": "100.000.001",
|
||||||
|
"rmlName": "Test Substance",
|
||||||
|
"rmlCas": "50-00-0",
|
||||||
|
"rmlEc": "200-001-8"
|
||||||
|
}
|
||||||
|
}]
|
||||||
|
}
|
||||||
|
|
||||||
|
mock_dossier_response = Mock()
|
||||||
|
mock_dossier_response.json.return_value = {
|
||||||
|
"items": [{
|
||||||
|
"assetExternalId": "abc123",
|
||||||
|
"rootKey": "key123",
|
||||||
|
"lastUpdatedDate": "2024-01-15T10:30:00Z"
|
||||||
|
}]
|
||||||
|
}
|
||||||
|
|
||||||
|
mock_index_response = Mock()
|
||||||
|
mock_index_response.text = "<html><div id='id_7_Toxicologicalinformation'><a href='tox_001'></a></div></html>"
|
||||||
|
|
||||||
|
mock_get.side_effect = [
|
||||||
|
mock_substance_response,
|
||||||
|
mock_dossier_response,
|
||||||
|
mock_index_response
|
||||||
|
]
|
||||||
|
|
||||||
|
result = search_dossier("50-00-0", input_type="rmlCas")
|
||||||
|
|
||||||
|
assert result is not False
|
||||||
|
assert "lastUpdateDate" in result
|
||||||
|
assert result["lastUpdateDate"] == "2024-01-15"
|
||||||
|
|
||||||
|
@patch('pif_compiler.services.echa_find.requests.get')
|
||||||
|
def test_missing_last_update_date(self, mock_get):
|
||||||
|
"""Test when lastUpdateDate is missing from response."""
|
||||||
|
mock_substance_response = Mock()
|
||||||
|
mock_substance_response.json.return_value = {
|
||||||
|
"items": [{
|
||||||
|
"substanceIndex": {
|
||||||
|
"rmlId": "100.000.001",
|
||||||
|
"rmlName": "Test Substance",
|
||||||
|
"rmlCas": "50-00-0",
|
||||||
|
"rmlEc": "200-001-8"
|
||||||
|
}
|
||||||
|
}]
|
||||||
|
}
|
||||||
|
|
||||||
|
mock_dossier_response = Mock()
|
||||||
|
mock_dossier_response.json.return_value = {
|
||||||
|
"items": [{
|
||||||
|
"assetExternalId": "abc123",
|
||||||
|
"rootKey": "key123"
|
||||||
|
# lastUpdatedDate missing
|
||||||
|
}]
|
||||||
|
}
|
||||||
|
|
||||||
|
mock_index_response = Mock()
|
||||||
|
mock_index_response.text = "<html><div id='id_7_Toxicologicalinformation'><a href='tox_001'></a></div></html>"
|
||||||
|
|
||||||
|
mock_get.side_effect = [
|
||||||
|
mock_substance_response,
|
||||||
|
mock_dossier_response,
|
||||||
|
mock_index_response
|
||||||
|
]
|
||||||
|
|
||||||
|
result = search_dossier("50-00-0", input_type="rmlCas")
|
||||||
|
|
||||||
|
assert result is not False
|
||||||
|
# Should still work, just without lastUpdateDate
|
||||||
|
assert "lastUpdateDate" not in result
|
||||||
|
|
||||||
|
|
||||||
|
class TestSearchDossierHTMLParsing:
|
||||||
|
"""Test HTML parsing for toxicology sections."""
|
||||||
|
|
||||||
|
@patch('pif_compiler.services.echa_find.requests.get')
|
||||||
|
def test_all_tox_sections_found(self, mock_get):
|
||||||
|
"""Test when all toxicology sections are found."""
|
||||||
|
mock_substance_response = Mock()
|
||||||
|
mock_substance_response.json.return_value = {
|
||||||
|
"items": [{
|
||||||
|
"substanceIndex": {
|
||||||
|
"rmlId": "100.000.001",
|
||||||
|
"rmlName": "Test Substance",
|
||||||
|
"rmlCas": "50-00-0",
|
||||||
|
"rmlEc": "200-001-8"
|
||||||
|
}
|
||||||
|
}]
|
||||||
|
}
|
||||||
|
|
||||||
|
mock_dossier_response = Mock()
|
||||||
|
mock_dossier_response.json.return_value = {
|
||||||
|
"items": [{
|
||||||
|
"assetExternalId": "abc123",
|
||||||
|
"rootKey": "key123",
|
||||||
|
"lastUpdatedDate": "2024-01-15T10:30:00Z"
|
||||||
|
}]
|
||||||
|
}
|
||||||
|
|
||||||
|
mock_index_response = Mock()
|
||||||
|
mock_index_response.text = """
|
||||||
|
<html>
|
||||||
|
<div id="id_7_Toxicologicalinformation">
|
||||||
|
<a href="tox_summary_001"></a>
|
||||||
|
</div>
|
||||||
|
<div id="id_72_AcuteToxicity">
|
||||||
|
<a href="acute_tox_001"></a>
|
||||||
|
</div>
|
||||||
|
<div id="id_75_Repeateddosetoxicity">
|
||||||
|
<a href="repeated_dose_001"></a>
|
||||||
|
</div>
|
||||||
|
</html>
|
||||||
|
"""
|
||||||
|
|
||||||
|
mock_get.side_effect = [
|
||||||
|
mock_substance_response,
|
||||||
|
mock_dossier_response,
|
||||||
|
mock_index_response
|
||||||
|
]
|
||||||
|
|
||||||
|
result = search_dossier("50-00-0", input_type="rmlCas")
|
||||||
|
|
||||||
|
assert result is not False
|
||||||
|
assert "ToxSummary" in result
|
||||||
|
assert "AcuteToxicity" in result
|
||||||
|
assert "RepeatedDose" in result
|
||||||
|
assert "tox_summary_001" in result["ToxSummary"]
|
||||||
|
assert "acute_tox_001" in result["AcuteToxicity"]
|
||||||
|
assert "repeated_dose_001" in result["RepeatedDose"]
|
||||||
|
|
||||||
|
@patch('pif_compiler.services.echa_find.requests.get')
|
||||||
|
def test_only_tox_summary_found(self, mock_get):
|
||||||
|
"""Test when only ToxSummary section exists."""
|
||||||
|
mock_substance_response = Mock()
|
||||||
|
mock_substance_response.json.return_value = {
|
||||||
|
"items": [{
|
||||||
|
"substanceIndex": {
|
||||||
|
"rmlId": "100.000.001",
|
||||||
|
"rmlName": "Test Substance",
|
||||||
|
"rmlCas": "50-00-0",
|
||||||
|
"rmlEc": "200-001-8"
|
||||||
|
}
|
||||||
|
}]
|
||||||
|
}
|
||||||
|
|
||||||
|
mock_dossier_response = Mock()
|
||||||
|
mock_dossier_response.json.return_value = {
|
||||||
|
"items": [{
|
||||||
|
"assetExternalId": "abc123",
|
||||||
|
"rootKey": "key123",
|
||||||
|
"lastUpdatedDate": "2024-01-15T10:30:00Z"
|
||||||
|
}]
|
||||||
|
}
|
||||||
|
|
||||||
|
mock_index_response = Mock()
|
||||||
|
mock_index_response.text = """
|
||||||
|
<html>
|
||||||
|
<div id="id_7_Toxicologicalinformation">
|
||||||
|
<a href="tox_summary_001"></a>
|
||||||
|
</div>
|
||||||
|
</html>
|
||||||
|
"""
|
||||||
|
|
||||||
|
mock_get.side_effect = [
|
||||||
|
mock_substance_response,
|
||||||
|
mock_dossier_response,
|
||||||
|
mock_index_response
|
||||||
|
]
|
||||||
|
|
||||||
|
result = search_dossier("50-00-0", input_type="rmlCas")
|
||||||
|
|
||||||
|
assert result is not False
|
||||||
|
assert "ToxSummary" in result
|
||||||
|
assert "AcuteToxicity" not in result
|
||||||
|
assert "RepeatedDose" not in result
|
||||||
|
|
||||||
|
@patch('pif_compiler.services.echa_find.requests.get')
|
||||||
|
def test_no_tox_sections_found(self, mock_get):
|
||||||
|
"""Test when no toxicology sections are found."""
|
||||||
|
mock_substance_response = Mock()
|
||||||
|
mock_substance_response.json.return_value = {
|
||||||
|
"items": [{
|
||||||
|
"substanceIndex": {
|
||||||
|
"rmlId": "100.000.001",
|
||||||
|
"rmlName": "Test Substance",
|
||||||
|
"rmlCas": "50-00-0",
|
||||||
|
"rmlEc": "200-001-8"
|
||||||
|
}
|
||||||
|
}]
|
||||||
|
}
|
||||||
|
|
||||||
|
mock_dossier_response = Mock()
|
||||||
|
mock_dossier_response.json.return_value = {
|
||||||
|
"items": [{
|
||||||
|
"assetExternalId": "abc123",
|
||||||
|
"rootKey": "key123",
|
||||||
|
"lastUpdatedDate": "2024-01-15T10:30:00Z"
|
||||||
|
}]
|
||||||
|
}
|
||||||
|
|
||||||
|
mock_index_response = Mock()
|
||||||
|
mock_index_response.text = "<html><body>No toxicology sections</body></html>"
|
||||||
|
|
||||||
|
mock_get.side_effect = [
|
||||||
|
mock_substance_response,
|
||||||
|
mock_dossier_response,
|
||||||
|
mock_index_response
|
||||||
|
]
|
||||||
|
|
||||||
|
result = search_dossier("50-00-0", input_type="rmlCas")
|
||||||
|
|
||||||
|
assert result is not False
|
||||||
|
assert "ToxSummary" not in result
|
||||||
|
assert "AcuteToxicity" not in result
|
||||||
|
assert "RepeatedDose" not in result
|
||||||
|
# Basic info should still be present
|
||||||
|
assert "rmlId" in result
|
||||||
|
assert "index" in result
|
||||||
|
|
||||||
|
@patch('pif_compiler.services.echa_find.requests.get')
|
||||||
|
def test_js_links_created(self, mock_get):
|
||||||
|
"""Test that both HTML and JS links are created."""
|
||||||
|
mock_substance_response = Mock()
|
||||||
|
mock_substance_response.json.return_value = {
|
||||||
|
"items": [{
|
||||||
|
"substanceIndex": {
|
||||||
|
"rmlId": "100.000.001",
|
||||||
|
"rmlName": "Test Substance",
|
||||||
|
"rmlCas": "50-00-0",
|
||||||
|
"rmlEc": "200-001-8"
|
||||||
|
}
|
||||||
|
}]
|
||||||
|
}
|
||||||
|
|
||||||
|
mock_dossier_response = Mock()
|
||||||
|
mock_dossier_response.json.return_value = {
|
||||||
|
"items": [{
|
||||||
|
"assetExternalId": "abc123",
|
||||||
|
"rootKey": "key123",
|
||||||
|
"lastUpdatedDate": "2024-01-15T10:30:00Z"
|
||||||
|
}]
|
||||||
|
}
|
||||||
|
|
||||||
|
mock_index_response = Mock()
|
||||||
|
mock_index_response.text = """
|
||||||
|
<html>
|
||||||
|
<div id="id_7_Toxicologicalinformation">
|
||||||
|
<a href="tox_summary_001"></a>
|
||||||
|
</div>
|
||||||
|
<div id="id_72_AcuteToxicity">
|
||||||
|
<a href="acute_tox_001"></a>
|
||||||
|
</div>
|
||||||
|
</html>
|
||||||
|
"""
|
||||||
|
|
||||||
|
mock_get.side_effect = [
|
||||||
|
mock_substance_response,
|
||||||
|
mock_dossier_response,
|
||||||
|
mock_index_response
|
||||||
|
]
|
||||||
|
|
||||||
|
result = search_dossier("50-00-0", input_type="rmlCas")
|
||||||
|
|
||||||
|
assert result is not False
|
||||||
|
assert "ToxSummary" in result
|
||||||
|
assert "ToxSummary_js" in result
|
||||||
|
assert "AcuteToxicity" in result
|
||||||
|
assert "AcuteToxicity_js" in result
|
||||||
|
assert "index" in result
|
||||||
|
assert "index_js" in result
|
||||||
|
|
||||||
|
|
||||||
|
class TestSearchDossierURLConstruction:
|
||||||
|
"""Test URL construction for various endpoints."""
|
||||||
|
|
||||||
|
@patch('pif_compiler.services.echa_find.requests.get')
|
||||||
|
def test_search_response_url(self, mock_get):
|
||||||
|
"""Test that search_response URL is correctly constructed."""
|
||||||
|
mock_substance_response = Mock()
|
||||||
|
mock_substance_response.json.return_value = {
|
||||||
|
"items": [{
|
||||||
|
"substanceIndex": {
|
||||||
|
"rmlId": "100.000.001",
|
||||||
|
"rmlName": "Test Substance",
|
||||||
|
"rmlCas": "50-00-0",
|
||||||
|
"rmlEc": "200-001-8"
|
||||||
|
}
|
||||||
|
}]
|
||||||
|
}
|
||||||
|
|
||||||
|
mock_dossier_response = Mock()
|
||||||
|
mock_dossier_response.json.return_value = {
|
||||||
|
"items": [{
|
||||||
|
"assetExternalId": "abc123",
|
||||||
|
"rootKey": "key123",
|
||||||
|
"lastUpdatedDate": "2024-01-15T10:30:00Z"
|
||||||
|
}]
|
||||||
|
}
|
||||||
|
|
||||||
|
mock_index_response = Mock()
|
||||||
|
mock_index_response.text = "<html></html>"
|
||||||
|
|
||||||
|
mock_get.side_effect = [
|
||||||
|
mock_substance_response,
|
||||||
|
mock_dossier_response,
|
||||||
|
mock_index_response
|
||||||
|
]
|
||||||
|
|
||||||
|
result = search_dossier("50-00-0", input_type="rmlCas")
|
||||||
|
|
||||||
|
assert result is not False
|
||||||
|
assert "search_response" in result
|
||||||
|
assert "50-00-0" in result["search_response"]
|
||||||
|
assert "https://chem.echa.europa.eu/api-substance/v1/substance" in result["search_response"]
|
||||||
|
|
||||||
|
@patch('pif_compiler.services.echa_find.requests.get')
|
||||||
|
def test_url_encoding(self, mock_get):
|
||||||
|
"""Test that special characters in substance names are URL-encoded."""
|
||||||
|
substance_name = "test substance with spaces"
|
||||||
|
|
||||||
|
mock_substance_response = Mock()
|
||||||
|
mock_substance_response.json.return_value = {
|
||||||
|
"items": [{
|
||||||
|
"substanceIndex": {
|
||||||
|
"rmlId": "100.000.001",
|
||||||
|
"rmlName": substance_name,
|
||||||
|
"rmlCas": "50-00-0",
|
||||||
|
"rmlEc": "200-001-8"
|
||||||
|
}
|
||||||
|
}]
|
||||||
|
}
|
||||||
|
|
||||||
|
mock_dossier_response = Mock()
|
||||||
|
mock_dossier_response.json.return_value = {
|
||||||
|
"items": [{
|
||||||
|
"assetExternalId": "abc123",
|
||||||
|
"rootKey": "key123",
|
||||||
|
"lastUpdatedDate": "2024-01-15T10:30:00Z"
|
||||||
|
}]
|
||||||
|
}
|
||||||
|
|
||||||
|
mock_index_response = Mock()
|
||||||
|
mock_index_response.text = "<html></html>"
|
||||||
|
|
||||||
|
mock_get.side_effect = [
|
||||||
|
mock_substance_response,
|
||||||
|
mock_dossier_response,
|
||||||
|
mock_index_response
|
||||||
|
]
|
||||||
|
|
||||||
|
result = search_dossier(substance_name, input_type="rmlName")
|
||||||
|
|
||||||
|
assert result is not False
|
||||||
|
assert "search_response" in result
|
||||||
|
# Spaces should be encoded
|
||||||
|
assert "%20" in result["search_response"] or "+" in result["search_response"]
|
||||||
|
|
||||||
|
|
||||||
|
class TestIntegration:
|
||||||
|
"""Integration tests with real API (marked as integration)."""
|
||||||
|
|
||||||
|
@pytest.mark.integration
|
||||||
|
def test_real_formaldehyde_search(self):
|
||||||
|
"""Test real API call for formaldehyde (requires internet)."""
|
||||||
|
result = search_dossier("50-00-0", input_type="rmlCas")
|
||||||
|
|
||||||
|
if result and isinstance(result, dict):
|
||||||
|
# Real API call succeeded
|
||||||
|
assert "rmlId" in result
|
||||||
|
assert "rmlName" in result
|
||||||
|
assert "rmlCas" in result
|
||||||
|
assert result["rmlCas"] == "50-00-0"
|
||||||
|
assert "index" in result
|
||||||
|
assert "dossierType" in result
|
||||||
|
|
||||||
|
@pytest.mark.integration
|
||||||
|
def test_real_water_search(self):
|
||||||
|
"""Test real API call for water by CAS (requires internet)."""
|
||||||
|
result = search_dossier("7732-18-5", input_type="rmlCas")
|
||||||
|
|
||||||
|
if result and isinstance(result, dict):
|
||||||
|
assert "rmlCas" in result
|
||||||
|
assert result["rmlCas"] == "7732-18-5"
|
||||||
|
|
||||||
|
@pytest.mark.integration
|
||||||
|
def test_real_nonexistent_substance(self):
|
||||||
|
"""Test real API call for non-existent substance (requires internet)."""
|
||||||
|
result = search_dossier("999-99-9", input_type="rmlCas")
|
||||||
|
|
||||||
|
# Should return False for non-existent substance
|
||||||
|
assert result is False or isinstance(result, str)
|
||||||
|
|
||||||
|
@pytest.mark.integration
|
||||||
|
def test_real_glycerin_search(self):
|
||||||
|
"""Test real API call for glycerin (requires internet)."""
|
||||||
|
result = search_dossier("56-81-5", input_type="rmlCas")
|
||||||
|
|
||||||
|
if result and isinstance(result, dict):
|
||||||
|
assert "rmlCas" in result
|
||||||
|
assert result["rmlCas"] == "56-81-5"
|
||||||
|
assert "rmlId" in result
|
||||||
|
assert "dossierType" in result
|
||||||
|
|
||||||
|
@pytest.mark.integration
|
||||||
|
def test_real_niacinamide_search(self):
|
||||||
|
"""Test real API call for niacinamide (requires internet)."""
|
||||||
|
result = search_dossier("98-92-0", input_type="rmlCas")
|
||||||
|
|
||||||
|
if result and isinstance(result, dict):
|
||||||
|
assert "rmlCas" in result
|
||||||
|
assert result["rmlCas"] == "98-92-0"
|
||||||
|
|
||||||
|
@pytest.mark.integration
|
||||||
|
def test_real_retinol_search(self):
|
||||||
|
"""Test real API call for retinol (requires internet)."""
|
||||||
|
result = search_dossier("68-26-8", input_type="rmlCas")
|
||||||
|
|
||||||
|
if result and isinstance(result, dict):
|
||||||
|
assert "rmlCas" in result
|
||||||
|
assert result["rmlCas"] == "68-26-8"
|
||||||
|
|
||||||
|
@pytest.mark.integration
|
||||||
|
def test_real_caffeine_search(self):
|
||||||
|
"""Test real API call for caffeine (requires internet)."""
|
||||||
|
result = search_dossier("58-08-2", input_type="rmlCas")
|
||||||
|
|
||||||
|
if result and isinstance(result, dict):
|
||||||
|
assert "rmlCas" in result
|
||||||
|
assert result["rmlCas"] == "58-08-2"
|
||||||
|
|
||||||
|
@pytest.mark.integration
|
||||||
|
def test_real_salicylic_acid_search(self):
|
||||||
|
"""Test real API call for salicylic acid (requires internet)."""
|
||||||
|
result = search_dossier("69-72-7", input_type="rmlCas")
|
||||||
|
|
||||||
|
if result and isinstance(result, dict):
|
||||||
|
assert "rmlCas" in result
|
||||||
|
assert result["rmlCas"] == "69-72-7"
|
||||||
|
|
||||||
|
@pytest.mark.integration
|
||||||
|
def test_real_titanium_dioxide_search(self):
|
||||||
|
"""Test real API call for titanium dioxide (requires internet)."""
|
||||||
|
result = search_dossier("13463-67-7", input_type="rmlCas")
|
||||||
|
|
||||||
|
if result and isinstance(result, dict):
|
||||||
|
assert "rmlCas" in result
|
||||||
|
assert result["rmlCas"] == "13463-67-7"
|
||||||
|
|
||||||
|
@pytest.mark.integration
|
||||||
|
def test_real_zinc_oxide_search(self):
|
||||||
|
"""Test real API call for zinc oxide (requires internet)."""
|
||||||
|
result = search_dossier("1314-13-2", input_type="rmlCas")
|
||||||
|
|
||||||
|
if result and isinstance(result, dict):
|
||||||
|
assert "rmlCas" in result
|
||||||
|
assert result["rmlCas"] == "1314-13-2"
|
||||||
|
|
||||||
|
@pytest.mark.integration
|
||||||
|
def test_multiple_cosmetic_ingredients(self, sample_cas_numbers):
|
||||||
|
"""Test real API calls for multiple cosmetic ingredients (requires internet)."""
|
||||||
|
# Test a subset of common cosmetic ingredients
|
||||||
|
test_ingredients = [
|
||||||
|
("water", "7732-18-5"),
|
||||||
|
("glycerin", "56-81-5"),
|
||||||
|
("propylene_glycol", "57-55-6"),
|
||||||
|
]
|
||||||
|
|
||||||
|
for name, cas in test_ingredients:
|
||||||
|
result = search_dossier(cas, input_type="rmlCas")
|
||||||
|
if result and isinstance(result, dict):
|
||||||
|
assert result["rmlCas"] == cas
|
||||||
|
assert "rmlId" in result
|
||||||
|
# Give the API some time between requests
|
||||||
|
import time
|
||||||
|
time.sleep(0.5)
|
||||||
153
utils/README.md
Normal file
153
utils/README.md
Normal file
|
|
@ -0,0 +1,153 @@
|
||||||
|
# PIF Compiler - MongoDB Docker Setup
|
||||||
|
|
||||||
|
## Quick Start
|
||||||
|
|
||||||
|
Start MongoDB and Mongo Express web interface:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd utils
|
||||||
|
docker-compose up -d
|
||||||
|
```
|
||||||
|
|
||||||
|
Stop the services:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
docker-compose down
|
||||||
|
```
|
||||||
|
|
||||||
|
Stop and remove all data:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
docker-compose down -v
|
||||||
|
```
|
||||||
|
|
||||||
|
## Services
|
||||||
|
|
||||||
|
### MongoDB
|
||||||
|
- **Port**: 27017
|
||||||
|
- **Database**: toxinfo
|
||||||
|
- **Username**: admin
|
||||||
|
- **Password**: admin123
|
||||||
|
- **Connection String**: `mongodb://admin:admin123@localhost:27017/toxinfo?authSource=admin`
|
||||||
|
|
||||||
|
### Mongo Express (Web UI)
|
||||||
|
- **URL**: http://localhost:8082
|
||||||
|
- **Username**: admin
|
||||||
|
- **Password**: admin123
|
||||||
|
|
||||||
|
## Usage in Python
|
||||||
|
|
||||||
|
Update your MongoDB connection in `src/pif_compiler/functions/mongo_functions.py`:
|
||||||
|
|
||||||
|
```python
|
||||||
|
# For local development with Docker
|
||||||
|
db = connect(user="admin", password="admin123", database="toxinfo")
|
||||||
|
```
|
||||||
|
|
||||||
|
Or use the connection URI directly:
|
||||||
|
|
||||||
|
```python
|
||||||
|
from pymongo import MongoClient
|
||||||
|
|
||||||
|
client = MongoClient("mongodb://admin:admin123@localhost:27017/toxinfo?authSource=admin")
|
||||||
|
db = client['toxinfo']
|
||||||
|
```
|
||||||
|
|
||||||
|
## Data Persistence
|
||||||
|
|
||||||
|
Data is persisted in Docker volumes:
|
||||||
|
- `mongodb_data` - Database files
|
||||||
|
- `mongodb_config` - Configuration files
|
||||||
|
|
||||||
|
These volumes persist even when containers are stopped.
|
||||||
|
|
||||||
|
## Creating Application User
|
||||||
|
|
||||||
|
It's recommended to create a dedicated user for your application instead of using the admin account.
|
||||||
|
|
||||||
|
### Option 1: Using mongosh (MongoDB Shell)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Access MongoDB shell
|
||||||
|
docker exec -it pif_mongodb mongosh -u admin -p admin123 --authenticationDatabase admin
|
||||||
|
|
||||||
|
# In the MongoDB shell, run:
|
||||||
|
use toxinfo
|
||||||
|
|
||||||
|
db.createUser({
|
||||||
|
user: "pif_app",
|
||||||
|
pwd: "pif_app_password",
|
||||||
|
roles: [
|
||||||
|
{ role: "readWrite", db: "toxinfo" }
|
||||||
|
]
|
||||||
|
})
|
||||||
|
|
||||||
|
# Exit the shell
|
||||||
|
exit
|
||||||
|
```
|
||||||
|
|
||||||
|
### Option 2: Using Mongo Express Web UI
|
||||||
|
|
||||||
|
1. Go to http://localhost:8082
|
||||||
|
2. Login with admin/admin123
|
||||||
|
3. Select `toxinfo` database
|
||||||
|
4. Click on "Users" tab
|
||||||
|
5. Add new user with `readWrite` role
|
||||||
|
|
||||||
|
### Option 3: Using Python Script
|
||||||
|
|
||||||
|
Create a file `utils/create_user.py`:
|
||||||
|
|
||||||
|
```python
|
||||||
|
from pymongo import MongoClient
|
||||||
|
|
||||||
|
# Connect as admin
|
||||||
|
client = MongoClient("mongodb://admin:admin123@localhost:27017/?authSource=admin")
|
||||||
|
db = client['toxinfo']
|
||||||
|
|
||||||
|
# Create application user
|
||||||
|
db.command("createUser", "pif_app",
|
||||||
|
pwd="pif_app_password",
|
||||||
|
roles=[{"role": "readWrite", "db": "toxinfo"}])
|
||||||
|
|
||||||
|
print("User 'pif_app' created successfully!")
|
||||||
|
client.close()
|
||||||
|
```
|
||||||
|
|
||||||
|
Run it:
|
||||||
|
```bash
|
||||||
|
cd utils
|
||||||
|
uv run python create_user.py
|
||||||
|
```
|
||||||
|
|
||||||
|
### Update Your Application
|
||||||
|
|
||||||
|
After creating the user, update your connection in `src/pif_compiler/functions/mongo_functions.py`:
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Use application user instead of admin
|
||||||
|
db = connect(user="pif_app", password="pif_app_password", database="toxinfo")
|
||||||
|
```
|
||||||
|
|
||||||
|
Or with connection URI:
|
||||||
|
```python
|
||||||
|
client = MongoClient("mongodb://pif_app:pif_app_password@localhost:27017/toxinfo?authSource=toxinfo")
|
||||||
|
```
|
||||||
|
|
||||||
|
### Available Roles
|
||||||
|
|
||||||
|
- `read`: Read-only access to the database
|
||||||
|
- `readWrite`: Read and write access (recommended for your app)
|
||||||
|
- `dbAdmin`: Database administration (create indexes, etc.)
|
||||||
|
- `userAdmin`: Manage users and roles
|
||||||
|
|
||||||
|
## Security Notes
|
||||||
|
|
||||||
|
⚠️ **WARNING**: The default credentials are for local development only.
|
||||||
|
|
||||||
|
For production:
|
||||||
|
1. Change all passwords in `docker-compose.yml`
|
||||||
|
2. Use environment variables or secrets management
|
||||||
|
3. Create dedicated users with minimal required permissions
|
||||||
|
4. Configure firewall rules
|
||||||
|
5. Enable SSL/TLS connections
|
||||||
140
utils/changelog.py
Normal file
140
utils/changelog.py
Normal file
|
|
@ -0,0 +1,140 @@
|
||||||
|
#!/usr/bin/env python3
|
||||||
|
"""
|
||||||
|
Change Log Manager
|
||||||
|
Manages a change.log file with external and internal changes
|
||||||
|
"""
|
||||||
|
|
||||||
|
import os
|
||||||
|
from datetime import datetime
|
||||||
|
from enum import Enum
|
||||||
|
|
||||||
|
class ChangeType(Enum):
|
||||||
|
EXTERNAL = "EXTERNAL"
|
||||||
|
INTERNAL = "INTERNAL"
|
||||||
|
|
||||||
|
class ChangeLogManager:
|
||||||
|
def __init__(self, log_file="change.log"):
|
||||||
|
self.log_file = log_file
|
||||||
|
self._ensure_log_exists()
|
||||||
|
|
||||||
|
def _ensure_log_exists(self):
|
||||||
|
"""Create the log file if it doesn't exist"""
|
||||||
|
if not os.path.exists(self.log_file):
|
||||||
|
with open(self.log_file, 'w') as f:
|
||||||
|
f.write("# Change Log\n")
|
||||||
|
f.write(f"# Created: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}\n\n")
|
||||||
|
|
||||||
|
def add_change(self, change_type, description):
|
||||||
|
"""
|
||||||
|
Add a change entry to the log
|
||||||
|
|
||||||
|
Args:
|
||||||
|
change_type (ChangeType): Type of change (EXTERNAL or INTERNAL)
|
||||||
|
description (str): Description of the change
|
||||||
|
"""
|
||||||
|
timestamp = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
|
||||||
|
entry = f"[{timestamp}] [{change_type.value}] {description}\n"
|
||||||
|
|
||||||
|
with open(self.log_file, 'a') as f:
|
||||||
|
f.write(entry)
|
||||||
|
|
||||||
|
print(f"✓ Change added: {change_type.value} - {description}")
|
||||||
|
|
||||||
|
def view_log(self, filter_type=None):
|
||||||
|
"""
|
||||||
|
View the change log with optional filtering
|
||||||
|
|
||||||
|
Args:
|
||||||
|
filter_type (ChangeType, optional): Filter by change type
|
||||||
|
"""
|
||||||
|
if not os.path.exists(self.log_file):
|
||||||
|
print("No change log found.")
|
||||||
|
return
|
||||||
|
|
||||||
|
with open(self.log_file, 'r') as f:
|
||||||
|
lines = f.readlines()
|
||||||
|
|
||||||
|
print("\n" + "="*70)
|
||||||
|
print("CHANGE LOG")
|
||||||
|
print("="*70 + "\n")
|
||||||
|
|
||||||
|
for line in lines:
|
||||||
|
if filter_type and f"[{filter_type.value}]" not in line:
|
||||||
|
continue
|
||||||
|
print(line, end='')
|
||||||
|
|
||||||
|
print("\n" + "="*70 + "\n")
|
||||||
|
|
||||||
|
def get_statistics(self):
|
||||||
|
"""Display statistics about the change log"""
|
||||||
|
if not os.path.exists(self.log_file):
|
||||||
|
print("No change log found.")
|
||||||
|
return
|
||||||
|
|
||||||
|
with open(self.log_file, 'r') as f:
|
||||||
|
lines = f.readlines()
|
||||||
|
|
||||||
|
external_count = sum(1 for line in lines if "[EXTERNAL]" in line)
|
||||||
|
internal_count = sum(1 for line in lines if "[INTERNAL]" in line)
|
||||||
|
total = external_count + internal_count
|
||||||
|
|
||||||
|
print("\n" + "="*40)
|
||||||
|
print("CHANGE LOG STATISTICS")
|
||||||
|
print("="*40)
|
||||||
|
print(f"Total changes: {total}")
|
||||||
|
print(f"External changes: {external_count}")
|
||||||
|
print(f"Internal changes: {internal_count}")
|
||||||
|
print("="*40 + "\n")
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
manager = ChangeLogManager()
|
||||||
|
|
||||||
|
while True:
|
||||||
|
print("\n📝 Change Log Manager")
|
||||||
|
print("1. Add External Change")
|
||||||
|
print("2. Add Internal Change")
|
||||||
|
print("3. View All Changes")
|
||||||
|
print("4. View External Changes Only")
|
||||||
|
print("5. View Internal Changes Only")
|
||||||
|
print("6. Show Statistics")
|
||||||
|
print("7. Exit")
|
||||||
|
|
||||||
|
choice = input("\nSelect an option (1-7): ").strip()
|
||||||
|
|
||||||
|
if choice == '1':
|
||||||
|
description = input("Enter external change description: ").strip()
|
||||||
|
if description:
|
||||||
|
manager.add_change(ChangeType.EXTERNAL, description)
|
||||||
|
else:
|
||||||
|
print("Description cannot be empty.")
|
||||||
|
|
||||||
|
elif choice == '2':
|
||||||
|
description = input("Enter internal change description: ").strip()
|
||||||
|
if description:
|
||||||
|
manager.add_change(ChangeType.INTERNAL, description)
|
||||||
|
else:
|
||||||
|
print("Description cannot be empty.")
|
||||||
|
|
||||||
|
elif choice == '3':
|
||||||
|
manager.view_log()
|
||||||
|
|
||||||
|
elif choice == '4':
|
||||||
|
manager.view_log(filter_type=ChangeType.EXTERNAL)
|
||||||
|
|
||||||
|
elif choice == '5':
|
||||||
|
manager.view_log(filter_type=ChangeType.INTERNAL)
|
||||||
|
|
||||||
|
elif choice == '6':
|
||||||
|
manager.get_statistics()
|
||||||
|
|
||||||
|
elif choice == '7':
|
||||||
|
print("Goodbye!")
|
||||||
|
break
|
||||||
|
|
||||||
|
else:
|
||||||
|
print("Invalid option. Please select 1-7.")
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
86
utils/create_user.py
Normal file
86
utils/create_user.py
Normal file
|
|
@ -0,0 +1,86 @@
|
||||||
|
"""
|
||||||
|
Create MongoDB application user for PIF Compiler.
|
||||||
|
|
||||||
|
This script creates a dedicated user with readWrite permissions
|
||||||
|
on the toxinfo database instead of using the admin account.
|
||||||
|
"""
|
||||||
|
|
||||||
|
from pymongo import MongoClient
|
||||||
|
from pymongo.errors import DuplicateKeyError, OperationFailure
|
||||||
|
import sys
|
||||||
|
|
||||||
|
|
||||||
|
def create_app_user():
|
||||||
|
"""Create application user for toxinfo database."""
|
||||||
|
|
||||||
|
# Configuration
|
||||||
|
ADMIN_USER = "admin"
|
||||||
|
ADMIN_PASSWORD = "admin123"
|
||||||
|
MONGO_HOST = "localhost"
|
||||||
|
MONGO_PORT = 27017
|
||||||
|
|
||||||
|
APP_USER = "pif_app"
|
||||||
|
APP_PASSWORD = "marox123"
|
||||||
|
APP_DATABASE = "pif-projects"
|
||||||
|
|
||||||
|
print(f"Connecting to MongoDB as admin...")
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Connect as admin
|
||||||
|
client = MongoClient(
|
||||||
|
f"mongodb://{ADMIN_USER}:{ADMIN_PASSWORD}@{MONGO_HOST}:{MONGO_PORT}/?authSource=admin",
|
||||||
|
serverSelectionTimeoutMS=5000
|
||||||
|
)
|
||||||
|
|
||||||
|
# Test connection
|
||||||
|
client.admin.command('ping')
|
||||||
|
print("✓ Connected to MongoDB successfully")
|
||||||
|
|
||||||
|
# Switch to application database
|
||||||
|
db = client[APP_DATABASE]
|
||||||
|
|
||||||
|
# Create application user
|
||||||
|
print(f"\nCreating user '{APP_USER}' with readWrite permissions on '{APP_DATABASE}'...")
|
||||||
|
|
||||||
|
db.command(
|
||||||
|
"createUser",
|
||||||
|
APP_USER,
|
||||||
|
pwd=APP_PASSWORD,
|
||||||
|
roles=[{"role": "readWrite", "db": APP_DATABASE}]
|
||||||
|
)
|
||||||
|
|
||||||
|
print(f"✓ User '{APP_USER}' created successfully!")
|
||||||
|
print(f"\nConnection details:")
|
||||||
|
print(f" Username: {APP_USER}")
|
||||||
|
print(f" Password: {APP_PASSWORD}")
|
||||||
|
print(f" Database: {APP_DATABASE}")
|
||||||
|
print(f" Connection String: mongodb://{APP_USER}:{APP_PASSWORD}@{MONGO_HOST}:{MONGO_PORT}/{APP_DATABASE}?authSource={APP_DATABASE}")
|
||||||
|
|
||||||
|
print(f"\nUpdate your mongo_functions.py with:")
|
||||||
|
print(f" db = connect(user='{APP_USER}', password='{APP_PASSWORD}', database='{APP_DATABASE}')")
|
||||||
|
|
||||||
|
client.close()
|
||||||
|
return 0
|
||||||
|
|
||||||
|
except DuplicateKeyError:
|
||||||
|
print(f"⚠ User '{APP_USER}' already exists!")
|
||||||
|
print(f"\nTo delete and recreate, run:")
|
||||||
|
print(f" docker exec -it pif_mongodb mongosh -u admin -p admin123 --authenticationDatabase admin")
|
||||||
|
print(f" use {APP_DATABASE}")
|
||||||
|
print(f" db.dropUser('{APP_USER}')")
|
||||||
|
return 1
|
||||||
|
|
||||||
|
except OperationFailure as e:
|
||||||
|
print(f"✗ MongoDB operation failed: {e}")
|
||||||
|
return 1
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
print(f"✗ Error: {e}")
|
||||||
|
print("\nMake sure MongoDB is running:")
|
||||||
|
print(" cd utils")
|
||||||
|
print(" docker-compose up -d")
|
||||||
|
return 1
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
sys.exit(create_app_user())
|
||||||
28
utils/docker-compose.yml
Normal file
28
utils/docker-compose.yml
Normal file
|
|
@ -0,0 +1,28 @@
|
||||||
|
version: '3.8'
|
||||||
|
|
||||||
|
services:
|
||||||
|
mongodb:
|
||||||
|
image: mongo:latest
|
||||||
|
container_name: personal_mongodb
|
||||||
|
restart: unless-stopped
|
||||||
|
environment:
|
||||||
|
MONGO_INITDB_ROOT_USERNAME: admin
|
||||||
|
MONGO_INITDB_ROOT_PASSWORD: bello98A.
|
||||||
|
MONGO_INITDB_DATABASE: toxinfo
|
||||||
|
ports:
|
||||||
|
- "27017:27017"
|
||||||
|
volumes:
|
||||||
|
- mongodb_data:/data/db
|
||||||
|
- mongodb_config:/data/configdb
|
||||||
|
networks:
|
||||||
|
- personal_network
|
||||||
|
|
||||||
|
volumes:
|
||||||
|
mongodb_data:
|
||||||
|
driver: local
|
||||||
|
mongodb_config:
|
||||||
|
driver: local
|
||||||
|
|
||||||
|
networks:
|
||||||
|
personal_network:
|
||||||
|
driver: bridge
|
||||||
Loading…
Reference in a new issue