9.6 KiB
ECHA Services Test Suite Summary
Overview
Comprehensive test suites have been created for all three ECHA service modules, following a bottom-up approach from lowest to highest level dependencies.
Test Files Created
1. test_echa_parser.py (Lowest Level)
Location: tests/test_echa_parser.py
Coverage: Tests for HTML/Markdown/JSON processing functions
Test Classes:
TestOpenEchaPage- HTML page opening (remote & local)TestEchaPageToMarkdown- HTML to Markdown conversionTestMarkdownToJsonRaw- Markdown to JSON conversion (skipped if markdown_to_json not installed)TestNormalizeUnicodeCharacters- Unicode normalizationTestCleanJson- JSON cleaning and validationTestIntegrationParser- Full pipeline integration tests
Total Tests: 28 tests
- 20 tests for core parser functions
- 5 tests for markdown_to_json (conditional)
- 2 integration tests
- 1 test with known Unicode encoding issue (needs fix)
Key Features:
- Mocks external dependencies (requests, file I/O)
- Tests Unicode handling edge cases
- Validates data cleaning logic
- Tests comparison operator conversions (>, <, >=, <=)
Known Issues:
- Unicode literal encoding in test strings (line 372, 380) - use
chr()instead of\uXXXX - Missing
markdown_to_jsondependency (tests skip gracefully)
###2. test_echa_service.py (Middle Level)
Location: tests/test_echa_service.py
Coverage: Tests for ECHA API interaction and search functions
Test Classes:
TestGetSubstanceByIdentifier- Substance API searchTestGetDossierByRmlId- Dossier retrieval with Active/Inactive fallbackTestExtractSectionLinks- Section link extraction with validationTestParseSectionsFromHtml- HTML parsing for multiple sectionsTestGetSectionLinksFromIndex- Remote index.html fetchingTestGetSectionLinksFromFile- Local file parsingTestSearchDossier- Main search workflowTestIntegrationEchaService- Real API integration tests (marked @pytest.mark.integration)
Total Tests: 36 tests
- 30 unit tests with mocked APIs
- 3 integration tests (require internet, marked for manual execution)
Key Features:
- Comprehensive API mocking
- Tests nested section bug fix (parent vs child section links)
- Tests URL encoding, error handling, fallback logic
- Tests local vs remote workflows
- Integration tests for real formaldehyde data
Testing Approach:
- Unit tests run by default (fast, no external deps)
- Integration tests require
-m integrationflag
3. test_echa_extractor.py (Highest Level)
Location: tests/test_echa_extractor.py
Coverage: Tests for high-level extraction orchestration
Test Classes:
TestSchemas- Data schema validationTestJsonToDataframe- JSON to pandas DataFrame conversionTestDfWrapper- DataFrame metadata additionTestEchaExtractLocal- DuckDB cache queryingTestEchaExtract- Main extraction workflowTestIntegrationEchaExtractor- Real data integration tests (marked @pytest.mark.integration)
Total Tests: 32 tests
- 28 unit tests with full mocking
- 4 integration tests (require internet)
Key Features:
- Tests both RepeatedDose and AcuteToxicity schemas
- Tests local cache (DuckDB) integration
- Tests key information extraction
- Tests error handling at each pipeline stage
- Tests DataFrame vs JSON output modes
- Validates metadata addition (substance, CAS, timestamps)
Testing Strategy:
- Mocks entire pipeline: search → parse → convert → clean → wrap
- Tests local_search and local_only modes
- Tests graceful degradation (returns key_infos on main extraction failure)
Test Architecture
test_echa_parser.py (Data Transformation)
↓
test_echa_service.py (API & Search)
↓
test_echa_extractor.py (Orchestration)
Dependency Flow
- Parser (lowest) - No dependencies on other ECHA modules
- Service (middle) - Depends on Parser for some functionality
- Extractor (highest) - Depends on both Service and Parser
Mock Strategy
- Parser: Mocks
requests, file I/O,os.makedirs - Service: Mocks
requests.getfor API calls, HTML content - Extractor: Mocks entire pipeline chain (search_dossier, open_echa_page, etc.)
Running the Tests
Run All Tests
uv run pytest tests/test_echa_*.py -v
Run Specific Module
uv run pytest tests/test_echa_parser.py -v
uv run pytest tests/test_echa_service.py -v
uv run pytest tests/test_echa_extractor.py -v
Run Only Unit Tests (Fast)
uv run pytest tests/test_echa_*.py -v -m "not integration"
Run Integration Tests (Requires Internet)
uv run pytest tests/test_echa_*.py -v -m integration
Run With Coverage
uv run pytest tests/test_echa_*.py --cov=pif_compiler.services --cov-report=html
Test Coverage Summary
Functions Tested
echa_parser.py (5/5 = 100%)
- ✅
open_echa_page()- Remote & local file opening - ✅
echa_page_to_markdown()- HTML to Markdown with route formatting - ✅
markdown_to_json_raw()- Markdown parsing & JSON conversion - ✅
normalize_unicode_characters()- Unicode normalization - ✅
clean_json()- Recursive cleaning & validation
echa_service.py (8/8 = 100%)
- ✅
search_dossier()- Main entry point with local file support - ✅
get_substance_by_identifier()- Substance API search - ✅
get_dossier_by_rml_id()- Dossier retrieval with fallback - ✅
_query_dossier_api()- Helper for API queries - ✅
get_section_links_from_index()- Remote HTML fetching - ✅
get_section_links_from_file()- Local HTML parsing - ✅
parse_sections_from_html()- HTML content parsing - ✅
extract_section_links()- Individual section extraction with validation
echa_extractor.py (4/4 = 100%)
- ✅
echa_extract()- Main extraction function - ✅
echa_extract_local()- DuckDB cache queries - ✅
json_to_dataframe()- JSON to DataFrame conversion - ✅
df_wrapper()- Metadata addition
Total Functions: 17/17 tested (100%)
Edge Cases Covered
Parser
- Empty/malformed HTML
- Missing sections
- Unicode encoding issues (â€, superscripts)
- Comparison operators (>, <, >=, <=)
- Nested structures
- [Empty] value filtering
- "no information available" filtering
Service
- Substance not found
- No dossiers (active or inactive)
- Nested sections (parent without direct link)
- Input type mismatches
- Network errors
- Malformed API responses
- Local vs remote file paths
Extractor
- Substance not found
- Missing scraping type pages
- Empty sections
- Empty cleaned JSON
- Local cache hits/misses
- Key information extraction
- DataFrame filtering (null Effect levels)
- JSON vs DataFrame output modes
Dependencies Required
Core Dependencies (Already in project)
- pytest
- pytest-mock
- pytest-cov
- beautifulsoup4
- pandas
- requests
- markdownify
- pydantic
Optional Dependencies (Tests skip if missing)
markdown_to_json- Required for markdown→JSON conversion testsduckdb- Required for local cache tests- Internet connection - Required for integration tests
Test Markers
Custom Markers (defined in conftest.py)
@pytest.mark.unit- Fast tests, no external dependencies@pytest.mark.integration- Tests requiring real APIs/internet@pytest.mark.slow- Long-running tests@pytest.mark.database- Tests requiring database
Usage in ECHA Tests
- Unit tests: Default (run without flags)
- Integration tests: Require
-m integration - Skipped tests: Auto-skip if dependencies missing
Known Issues & Improvements Needed
1. Unicode Test Encoding (test_echa_parser.py)
Issue: Lines 372 and 380 have truncated Unicode escape sequences
Fix: Replace \u00c2\u00b² with chr(0xc2) + chr(0xb2)
Priority: Medium
2. Missing markdown_to_json Dependency
Issue: Tests skip if not installed Fix: Add to project dependencies or document as optional Priority: Low (tests gracefully skip)
3. Integration Test Data
Issue: Real API tests may fail if ECHA structure changes Fix: Add recorded fixtures for deterministic testing Priority: Low
4. DuckDB Integration
Issue: test_echa_extractor local cache tests mock DuckDB Fix: Create actual test database for integration testing Priority: Low
Test Statistics
| Module | Total Tests | Unit Tests | Integration Tests | Skipped (Conditional) |
|---|---|---|---|---|
| echa_parser.py | 28 | 26 | 2 | 7 (markdown_to_json) |
| echa_service.py | 36 | 33 | 3 | 0 |
| echa_extractor.py | 32 | 28 | 4 | 0 |
| TOTAL | 96 | 87 | 9 | 7 |
Next Steps
- Fix Unicode encoding in test_echa_parser.py (lines 372, 380)
- Run full test suite to verify all unit tests pass
- Add markdown_to_json to dependencies if needed
- Run integration tests manually to verify real API behavior
- Generate coverage report to identify any untested code paths
- Document test patterns for future service additions
- Add CI/CD integration for automated testing
Contributing
When adding new functions to ECHA services:
- Write tests first (TDD approach)
- Follow existing patterns:
- One test class per function
- Mock external dependencies
- Test happy path + edge cases
- Add integration tests for real API behavior
- Use appropriate markers:
@pytest.mark.integrationfor slow tests - Update this document with new test coverage
References
- Main documentation: docs/echa_architecture.md
- Test patterns: tests/test_cosing_service.py
- pytest configuration: pytest.ini
- Test fixtures: tests/conftest.py