cosmoguard-bd/docs/test_summary.md

295 lines
9.6 KiB
Markdown

# ECHA Services Test Suite Summary
## Overview
Comprehensive test suites have been created for all three ECHA service modules, following a bottom-up approach from lowest to highest level dependencies.
## Test Files Created
### 1. test_echa_parser.py (Lowest Level)
**Location**: `tests/test_echa_parser.py`
**Coverage**: Tests for HTML/Markdown/JSON processing functions
**Test Classes**:
- `TestOpenEchaPage` - HTML page opening (remote & local)
- `TestEchaPageToMarkdown` - HTML to Markdown conversion
- `TestMarkdownToJsonRaw` - Markdown to JSON conversion (skipped if markdown_to_json not installed)
- `TestNormalizeUnicodeCharacters` - Unicode normalization
- `TestCleanJson` - JSON cleaning and validation
- `TestIntegrationParser` - Full pipeline integration tests
**Total Tests**: 28 tests
- 20 tests for core parser functions
- 5 tests for markdown_to_json (conditional)
- 2 integration tests
- 1 test with known Unicode encoding issue (needs fix)
**Key Features**:
- Mocks external dependencies (requests, file I/O)
- Tests Unicode handling edge cases
- Validates data cleaning logic
- Tests comparison operator conversions (>, <, >=, <=)
**Known Issues**:
- Unicode literal encoding in test strings (line 372, 380) - use `chr()` instead of `\uXXXX`
- Missing `markdown_to_json` dependency (tests skip gracefully)
###2. test_echa_service.py (Middle Level)
**Location**: `tests/test_echa_service.py`
**Coverage**: Tests for ECHA API interaction and search functions
**Test Classes**:
- `TestGetSubstanceByIdentifier` - Substance API search
- `TestGetDossierByRmlId` - Dossier retrieval with Active/Inactive fallback
- `TestExtractSectionLinks` - Section link extraction with validation
- `TestParseSectionsFromHtml` - HTML parsing for multiple sections
- `TestGetSectionLinksFromIndex` - Remote index.html fetching
- `TestGetSectionLinksFromFile` - Local file parsing
- `TestSearchDossier` - Main search workflow
- `TestIntegrationEchaService` - Real API integration tests (marked @pytest.mark.integration)
**Total Tests**: 36 tests
- 30 unit tests with mocked APIs
- 3 integration tests (require internet, marked for manual execution)
**Key Features**:
- Comprehensive API mocking
- Tests nested section bug fix (parent vs child section links)
- Tests URL encoding, error handling, fallback logic
- Tests local vs remote workflows
- Integration tests for real formaldehyde data
**Testing Approach**:
- Unit tests run by default (fast, no external deps)
- Integration tests require `-m integration` flag
### 3. test_echa_extractor.py (Highest Level)
**Location**: `tests/test_echa_extractor.py`
**Coverage**: Tests for high-level extraction orchestration
**Test Classes**:
- `TestSchemas` - Data schema validation
- `TestJsonToDataframe` - JSON to pandas DataFrame conversion
- `TestDfWrapper` - DataFrame metadata addition
- `TestEchaExtractLocal` - DuckDB cache querying
- `TestEchaExtract` - Main extraction workflow
- `TestIntegrationEchaExtractor` - Real data integration tests (marked @pytest.mark.integration)
**Total Tests**: 32 tests
- 28 unit tests with full mocking
- 4 integration tests (require internet)
**Key Features**:
- Tests both RepeatedDose and AcuteToxicity schemas
- Tests local cache (DuckDB) integration
- Tests key information extraction
- Tests error handling at each pipeline stage
- Tests DataFrame vs JSON output modes
- Validates metadata addition (substance, CAS, timestamps)
**Testing Strategy**:
- Mocks entire pipeline: search → parse → convert → clean → wrap
- Tests local_search and local_only modes
- Tests graceful degradation (returns key_infos on main extraction failure)
## Test Architecture
```
test_echa_parser.py (Data Transformation)
test_echa_service.py (API & Search)
test_echa_extractor.py (Orchestration)
```
### Dependency Flow
1. **Parser** (lowest) - No dependencies on other ECHA modules
2. **Service** (middle) - Depends on Parser for some functionality
3. **Extractor** (highest) - Depends on both Service and Parser
### Mock Strategy
- **Parser**: Mocks `requests`, file I/O, `os.makedirs`
- **Service**: Mocks `requests.get` for API calls, HTML content
- **Extractor**: Mocks entire pipeline chain (search_dossier, open_echa_page, etc.)
## Running the Tests
### Run All Tests
```bash
uv run pytest tests/test_echa_*.py -v
```
### Run Specific Module
```bash
uv run pytest tests/test_echa_parser.py -v
uv run pytest tests/test_echa_service.py -v
uv run pytest tests/test_echa_extractor.py -v
```
### Run Only Unit Tests (Fast)
```bash
uv run pytest tests/test_echa_*.py -v -m "not integration"
```
### Run Integration Tests (Requires Internet)
```bash
uv run pytest tests/test_echa_*.py -v -m integration
```
### Run With Coverage
```bash
uv run pytest tests/test_echa_*.py --cov=pif_compiler.services --cov-report=html
```
## Test Coverage Summary
### Functions Tested
#### echa_parser.py (5/5 = 100%)
-`open_echa_page()` - Remote & local file opening
-`echa_page_to_markdown()` - HTML to Markdown with route formatting
-`markdown_to_json_raw()` - Markdown parsing & JSON conversion
-`normalize_unicode_characters()` - Unicode normalization
-`clean_json()` - Recursive cleaning & validation
#### echa_service.py (8/8 = 100%)
-`search_dossier()` - Main entry point with local file support
-`get_substance_by_identifier()` - Substance API search
-`get_dossier_by_rml_id()` - Dossier retrieval with fallback
-`_query_dossier_api()` - Helper for API queries
-`get_section_links_from_index()` - Remote HTML fetching
-`get_section_links_from_file()` - Local HTML parsing
-`parse_sections_from_html()` - HTML content parsing
-`extract_section_links()` - Individual section extraction with validation
#### echa_extractor.py (4/4 = 100%)
-`echa_extract()` - Main extraction function
-`echa_extract_local()` - DuckDB cache queries
-`json_to_dataframe()` - JSON to DataFrame conversion
-`df_wrapper()` - Metadata addition
**Total Functions**: 17/17 tested (100%)
## Edge Cases Covered
### Parser
- Empty/malformed HTML
- Missing sections
- Unicode encoding issues (â€, superscripts)
- Comparison operators (>, <, >=, <=)
- Nested structures
- [Empty] value filtering
- "no information available" filtering
### Service
- Substance not found
- No dossiers (active or inactive)
- Nested sections (parent without direct link)
- Input type mismatches
- Network errors
- Malformed API responses
- Local vs remote file paths
### Extractor
- Substance not found
- Missing scraping type pages
- Empty sections
- Empty cleaned JSON
- Local cache hits/misses
- Key information extraction
- DataFrame filtering (null Effect levels)
- JSON vs DataFrame output modes
## Dependencies Required
### Core Dependencies (Already in project)
- pytest
- pytest-mock
- pytest-cov
- beautifulsoup4
- pandas
- requests
- markdownify
- pydantic
### Optional Dependencies (Tests skip if missing)
- `markdown_to_json` - Required for markdown→JSON conversion tests
- `duckdb` - Required for local cache tests
- Internet connection - Required for integration tests
## Test Markers
### Custom Markers (defined in conftest.py)
- `@pytest.mark.unit` - Fast tests, no external dependencies
- `@pytest.mark.integration` - Tests requiring real APIs/internet
- `@pytest.mark.slow` - Long-running tests
- `@pytest.mark.database` - Tests requiring database
### Usage in ECHA Tests
- Unit tests: Default (run without flags)
- Integration tests: Require `-m integration`
- Skipped tests: Auto-skip if dependencies missing
## Known Issues & Improvements Needed
### 1. Unicode Test Encoding (test_echa_parser.py)
**Issue**: Lines 372 and 380 have truncated Unicode escape sequences
**Fix**: Replace `\u00c2\u00b²` with `chr(0xc2) + chr(0xb2)`
**Priority**: Medium
### 2. Missing markdown_to_json Dependency
**Issue**: Tests skip if not installed
**Fix**: Add to project dependencies or document as optional
**Priority**: Low (tests gracefully skip)
### 3. Integration Test Data
**Issue**: Real API tests may fail if ECHA structure changes
**Fix**: Add recorded fixtures for deterministic testing
**Priority**: Low
### 4. DuckDB Integration
**Issue**: test_echa_extractor local cache tests mock DuckDB
**Fix**: Create actual test database for integration testing
**Priority**: Low
## Test Statistics
| Module | Total Tests | Unit Tests | Integration Tests | Skipped (Conditional) |
|--------|-------------|------------|-------------------|-----------------------|
| echa_parser.py | 28 | 26 | 2 | 7 (markdown_to_json) |
| echa_service.py | 36 | 33 | 3 | 0 |
| echa_extractor.py | 32 | 28 | 4 | 0 |
| **TOTAL** | **96** | **87** | **9** | **7** |
## Next Steps
1. **Fix Unicode encoding** in test_echa_parser.py (lines 372, 380)
2. **Run full test suite** to verify all unit tests pass
3. **Add markdown_to_json** to dependencies if needed
4. **Run integration tests** manually to verify real API behavior
5. **Generate coverage report** to identify any untested code paths
6. **Document test patterns** for future service additions
7. **Add CI/CD integration** for automated testing
## Contributing
When adding new functions to ECHA services:
1. **Write tests first** (TDD approach)
2. **Follow existing patterns**:
- One test class per function
- Mock external dependencies
- Test happy path + edge cases
- Add integration tests for real API behavior
3. **Use appropriate markers**: `@pytest.mark.integration` for slow tests
4. **Update this document** with new test coverage
## References
- Main documentation: [docs/echa_architecture.md](echa_architecture.md)
- Test patterns: [tests/test_cosing_service.py](../tests/test_cosing_service.py)
- pytest configuration: [pytest.ini](../pytest.ini)
- Test fixtures: [tests/conftest.py](../tests/conftest.py)