295 lines
9.6 KiB
Markdown
295 lines
9.6 KiB
Markdown
# ECHA Services Test Suite Summary
|
|
|
|
## Overview
|
|
|
|
Comprehensive test suites have been created for all three ECHA service modules, following a bottom-up approach from lowest to highest level dependencies.
|
|
|
|
## Test Files Created
|
|
|
|
### 1. test_echa_parser.py (Lowest Level)
|
|
**Location**: `tests/test_echa_parser.py`
|
|
|
|
**Coverage**: Tests for HTML/Markdown/JSON processing functions
|
|
|
|
**Test Classes**:
|
|
- `TestOpenEchaPage` - HTML page opening (remote & local)
|
|
- `TestEchaPageToMarkdown` - HTML to Markdown conversion
|
|
- `TestMarkdownToJsonRaw` - Markdown to JSON conversion (skipped if markdown_to_json not installed)
|
|
- `TestNormalizeUnicodeCharacters` - Unicode normalization
|
|
- `TestCleanJson` - JSON cleaning and validation
|
|
- `TestIntegrationParser` - Full pipeline integration tests
|
|
|
|
**Total Tests**: 28 tests
|
|
- 20 tests for core parser functions
|
|
- 5 tests for markdown_to_json (conditional)
|
|
- 2 integration tests
|
|
- 1 test with known Unicode encoding issue (needs fix)
|
|
|
|
**Key Features**:
|
|
- Mocks external dependencies (requests, file I/O)
|
|
- Tests Unicode handling edge cases
|
|
- Validates data cleaning logic
|
|
- Tests comparison operator conversions (>, <, >=, <=)
|
|
|
|
**Known Issues**:
|
|
- Unicode literal encoding in test strings (line 372, 380) - use `chr()` instead of `\uXXXX`
|
|
- Missing `markdown_to_json` dependency (tests skip gracefully)
|
|
|
|
###2. test_echa_service.py (Middle Level)
|
|
**Location**: `tests/test_echa_service.py`
|
|
|
|
**Coverage**: Tests for ECHA API interaction and search functions
|
|
|
|
**Test Classes**:
|
|
- `TestGetSubstanceByIdentifier` - Substance API search
|
|
- `TestGetDossierByRmlId` - Dossier retrieval with Active/Inactive fallback
|
|
- `TestExtractSectionLinks` - Section link extraction with validation
|
|
- `TestParseSectionsFromHtml` - HTML parsing for multiple sections
|
|
- `TestGetSectionLinksFromIndex` - Remote index.html fetching
|
|
- `TestGetSectionLinksFromFile` - Local file parsing
|
|
- `TestSearchDossier` - Main search workflow
|
|
- `TestIntegrationEchaService` - Real API integration tests (marked @pytest.mark.integration)
|
|
|
|
**Total Tests**: 36 tests
|
|
- 30 unit tests with mocked APIs
|
|
- 3 integration tests (require internet, marked for manual execution)
|
|
|
|
**Key Features**:
|
|
- Comprehensive API mocking
|
|
- Tests nested section bug fix (parent vs child section links)
|
|
- Tests URL encoding, error handling, fallback logic
|
|
- Tests local vs remote workflows
|
|
- Integration tests for real formaldehyde data
|
|
|
|
**Testing Approach**:
|
|
- Unit tests run by default (fast, no external deps)
|
|
- Integration tests require `-m integration` flag
|
|
|
|
### 3. test_echa_extractor.py (Highest Level)
|
|
**Location**: `tests/test_echa_extractor.py`
|
|
|
|
**Coverage**: Tests for high-level extraction orchestration
|
|
|
|
**Test Classes**:
|
|
- `TestSchemas` - Data schema validation
|
|
- `TestJsonToDataframe` - JSON to pandas DataFrame conversion
|
|
- `TestDfWrapper` - DataFrame metadata addition
|
|
- `TestEchaExtractLocal` - DuckDB cache querying
|
|
- `TestEchaExtract` - Main extraction workflow
|
|
- `TestIntegrationEchaExtractor` - Real data integration tests (marked @pytest.mark.integration)
|
|
|
|
**Total Tests**: 32 tests
|
|
- 28 unit tests with full mocking
|
|
- 4 integration tests (require internet)
|
|
|
|
**Key Features**:
|
|
- Tests both RepeatedDose and AcuteToxicity schemas
|
|
- Tests local cache (DuckDB) integration
|
|
- Tests key information extraction
|
|
- Tests error handling at each pipeline stage
|
|
- Tests DataFrame vs JSON output modes
|
|
- Validates metadata addition (substance, CAS, timestamps)
|
|
|
|
**Testing Strategy**:
|
|
- Mocks entire pipeline: search → parse → convert → clean → wrap
|
|
- Tests local_search and local_only modes
|
|
- Tests graceful degradation (returns key_infos on main extraction failure)
|
|
|
|
## Test Architecture
|
|
|
|
```
|
|
test_echa_parser.py (Data Transformation)
|
|
↓
|
|
test_echa_service.py (API & Search)
|
|
↓
|
|
test_echa_extractor.py (Orchestration)
|
|
```
|
|
|
|
### Dependency Flow
|
|
1. **Parser** (lowest) - No dependencies on other ECHA modules
|
|
2. **Service** (middle) - Depends on Parser for some functionality
|
|
3. **Extractor** (highest) - Depends on both Service and Parser
|
|
|
|
### Mock Strategy
|
|
- **Parser**: Mocks `requests`, file I/O, `os.makedirs`
|
|
- **Service**: Mocks `requests.get` for API calls, HTML content
|
|
- **Extractor**: Mocks entire pipeline chain (search_dossier, open_echa_page, etc.)
|
|
|
|
## Running the Tests
|
|
|
|
### Run All Tests
|
|
```bash
|
|
uv run pytest tests/test_echa_*.py -v
|
|
```
|
|
|
|
### Run Specific Module
|
|
```bash
|
|
uv run pytest tests/test_echa_parser.py -v
|
|
uv run pytest tests/test_echa_service.py -v
|
|
uv run pytest tests/test_echa_extractor.py -v
|
|
```
|
|
|
|
### Run Only Unit Tests (Fast)
|
|
```bash
|
|
uv run pytest tests/test_echa_*.py -v -m "not integration"
|
|
```
|
|
|
|
### Run Integration Tests (Requires Internet)
|
|
```bash
|
|
uv run pytest tests/test_echa_*.py -v -m integration
|
|
```
|
|
|
|
### Run With Coverage
|
|
```bash
|
|
uv run pytest tests/test_echa_*.py --cov=pif_compiler.services --cov-report=html
|
|
```
|
|
|
|
## Test Coverage Summary
|
|
|
|
### Functions Tested
|
|
|
|
#### echa_parser.py (5/5 = 100%)
|
|
- ✅ `open_echa_page()` - Remote & local file opening
|
|
- ✅ `echa_page_to_markdown()` - HTML to Markdown with route formatting
|
|
- ✅ `markdown_to_json_raw()` - Markdown parsing & JSON conversion
|
|
- ✅ `normalize_unicode_characters()` - Unicode normalization
|
|
- ✅ `clean_json()` - Recursive cleaning & validation
|
|
|
|
#### echa_service.py (8/8 = 100%)
|
|
- ✅ `search_dossier()` - Main entry point with local file support
|
|
- ✅ `get_substance_by_identifier()` - Substance API search
|
|
- ✅ `get_dossier_by_rml_id()` - Dossier retrieval with fallback
|
|
- ✅ `_query_dossier_api()` - Helper for API queries
|
|
- ✅ `get_section_links_from_index()` - Remote HTML fetching
|
|
- ✅ `get_section_links_from_file()` - Local HTML parsing
|
|
- ✅ `parse_sections_from_html()` - HTML content parsing
|
|
- ✅ `extract_section_links()` - Individual section extraction with validation
|
|
|
|
#### echa_extractor.py (4/4 = 100%)
|
|
- ✅ `echa_extract()` - Main extraction function
|
|
- ✅ `echa_extract_local()` - DuckDB cache queries
|
|
- ✅ `json_to_dataframe()` - JSON to DataFrame conversion
|
|
- ✅ `df_wrapper()` - Metadata addition
|
|
|
|
**Total Functions**: 17/17 tested (100%)
|
|
|
|
## Edge Cases Covered
|
|
|
|
### Parser
|
|
- Empty/malformed HTML
|
|
- Missing sections
|
|
- Unicode encoding issues (â€, superscripts)
|
|
- Comparison operators (>, <, >=, <=)
|
|
- Nested structures
|
|
- [Empty] value filtering
|
|
- "no information available" filtering
|
|
|
|
### Service
|
|
- Substance not found
|
|
- No dossiers (active or inactive)
|
|
- Nested sections (parent without direct link)
|
|
- Input type mismatches
|
|
- Network errors
|
|
- Malformed API responses
|
|
- Local vs remote file paths
|
|
|
|
### Extractor
|
|
- Substance not found
|
|
- Missing scraping type pages
|
|
- Empty sections
|
|
- Empty cleaned JSON
|
|
- Local cache hits/misses
|
|
- Key information extraction
|
|
- DataFrame filtering (null Effect levels)
|
|
- JSON vs DataFrame output modes
|
|
|
|
## Dependencies Required
|
|
|
|
### Core Dependencies (Already in project)
|
|
- pytest
|
|
- pytest-mock
|
|
- pytest-cov
|
|
- beautifulsoup4
|
|
- pandas
|
|
- requests
|
|
- markdownify
|
|
- pydantic
|
|
|
|
### Optional Dependencies (Tests skip if missing)
|
|
- `markdown_to_json` - Required for markdown→JSON conversion tests
|
|
- `duckdb` - Required for local cache tests
|
|
- Internet connection - Required for integration tests
|
|
|
|
## Test Markers
|
|
|
|
### Custom Markers (defined in conftest.py)
|
|
- `@pytest.mark.unit` - Fast tests, no external dependencies
|
|
- `@pytest.mark.integration` - Tests requiring real APIs/internet
|
|
- `@pytest.mark.slow` - Long-running tests
|
|
- `@pytest.mark.database` - Tests requiring database
|
|
|
|
### Usage in ECHA Tests
|
|
- Unit tests: Default (run without flags)
|
|
- Integration tests: Require `-m integration`
|
|
- Skipped tests: Auto-skip if dependencies missing
|
|
|
|
## Known Issues & Improvements Needed
|
|
|
|
### 1. Unicode Test Encoding (test_echa_parser.py)
|
|
**Issue**: Lines 372 and 380 have truncated Unicode escape sequences
|
|
**Fix**: Replace `\u00c2\u00b²` with `chr(0xc2) + chr(0xb2)`
|
|
**Priority**: Medium
|
|
|
|
### 2. Missing markdown_to_json Dependency
|
|
**Issue**: Tests skip if not installed
|
|
**Fix**: Add to project dependencies or document as optional
|
|
**Priority**: Low (tests gracefully skip)
|
|
|
|
### 3. Integration Test Data
|
|
**Issue**: Real API tests may fail if ECHA structure changes
|
|
**Fix**: Add recorded fixtures for deterministic testing
|
|
**Priority**: Low
|
|
|
|
### 4. DuckDB Integration
|
|
**Issue**: test_echa_extractor local cache tests mock DuckDB
|
|
**Fix**: Create actual test database for integration testing
|
|
**Priority**: Low
|
|
|
|
## Test Statistics
|
|
|
|
| Module | Total Tests | Unit Tests | Integration Tests | Skipped (Conditional) |
|
|
|--------|-------------|------------|-------------------|-----------------------|
|
|
| echa_parser.py | 28 | 26 | 2 | 7 (markdown_to_json) |
|
|
| echa_service.py | 36 | 33 | 3 | 0 |
|
|
| echa_extractor.py | 32 | 28 | 4 | 0 |
|
|
| **TOTAL** | **96** | **87** | **9** | **7** |
|
|
|
|
## Next Steps
|
|
|
|
1. **Fix Unicode encoding** in test_echa_parser.py (lines 372, 380)
|
|
2. **Run full test suite** to verify all unit tests pass
|
|
3. **Add markdown_to_json** to dependencies if needed
|
|
4. **Run integration tests** manually to verify real API behavior
|
|
5. **Generate coverage report** to identify any untested code paths
|
|
6. **Document test patterns** for future service additions
|
|
7. **Add CI/CD integration** for automated testing
|
|
|
|
## Contributing
|
|
|
|
When adding new functions to ECHA services:
|
|
|
|
1. **Write tests first** (TDD approach)
|
|
2. **Follow existing patterns**:
|
|
- One test class per function
|
|
- Mock external dependencies
|
|
- Test happy path + edge cases
|
|
- Add integration tests for real API behavior
|
|
3. **Use appropriate markers**: `@pytest.mark.integration` for slow tests
|
|
4. **Update this document** with new test coverage
|
|
|
|
## References
|
|
|
|
- Main documentation: [docs/echa_architecture.md](echa_architecture.md)
|
|
- Test patterns: [tests/test_cosing_service.py](../tests/test_cosing_service.py)
|
|
- pytest configuration: [pytest.ini](../pytest.ini)
|
|
- Test fixtures: [tests/conftest.py](../tests/conftest.py)
|