refactor: use API to fetch Hp index#53
Conversation
There was a problem hiding this comment.
Pull request overview
This pull request refactors the HP index data fetching mechanism from FTP-based downloads to using the GFZ JSON API. The main change replaces FTP file downloads and text parsing with direct API calls that return JSON data. However, the PR also includes unrelated enhancements to the logging system.
Changes:
- Migrated HP index data source from FTP file downloads to GFZ JSON API (https://kp.gfz.de/app/json/)
- Refactored data processing to parse JSON responses instead of text files
- Enhanced logging system with file output support and additional configuration options
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 10 comments.
| File | Description |
|---|---|
| swvo/io/hp/gfz.py | Replaced FTP download logic with API calls; updated data processing to handle JSON format |
| tests/io/hp/test_hp.py | Updated tests to mock API requests instead of FTP operations; adjusted test data format |
| swvo/logger.py | Added file logging support, changed level parameter from int to str, added line numbers to log format |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| data_total = pd.DataFrame( | ||
| {f"Hp{self.index_number}": json_data[f"Hp{self.index_number}"]}, |
There was a problem hiding this comment.
Missing error handling: If the API response doesn't contain the expected keys ('datetime' or f'Hp{self.index_number}'), this will raise a KeyError. Consider adding validation to check if the expected keys exist in json_data before accessing them, and raise a more informative error or log a warning if they're missing.
| data_total = pd.DataFrame( | |
| {f"Hp{self.index_number}": json_data[f"Hp{self.index_number}"]}, | |
| hp_key = f"Hp{self.index_number}" | |
| required_keys = ["datetime", hp_key] | |
| missing_keys = [key for key in required_keys if key not in json_data] | |
| if missing_keys: | |
| logger.warning( | |
| "Missing expected key(s) %s in JSON file %s; returning empty DataFrame", | |
| ", ".join(missing_keys), | |
| json_file, | |
| ) | |
| return data_total | |
| data_total = pd.DataFrame( | |
| {hp_key: json_data[hp_key]}, |
| {f"Hp{self.index_number}": json_data[f"Hp{self.index_number}"]}, | ||
| index=pd.to_datetime(json_data["datetime"], utc=True), | ||
| ) | ||
| data_total.index = data_total.index.tz_convert("UTC") |
There was a problem hiding this comment.
Redundant timezone conversion: Line 296 converts the timezone to 'UTC', but line 294 already creates the datetime index with utc=True, which makes it UTC-aware. The tz_convert operation is unnecessary and can be removed.
| data_total.index = data_total.index.tz_convert("UTC") |
| def setup_logging(level: str = "INFO", log_file: Optional[Path] = None, file_mode: str = "w"): | ||
| """Setup logging for the swvo package and root logger. | ||
|
|
||
| Parameters | ||
| ---------- | ||
| level : str, optional | ||
| Logging level, by default is INFO | ||
| log_file : Path, optional | ||
| Path to log file. If None, only console logging is enabled.If provided, logs will be written to both console and file., by default None | ||
| """ | ||
| try: | ||
| level = getattr(logging, level.upper()) | ||
| except AttributeError: | ||
| raise ValueError(f"Invalid logging level: {level}. Use one of DEBUG, INFO, WARNING, ERROR, CRITICAL.") |
There was a problem hiding this comment.
The setup_logging function signature has changed from accepting a logging level constant (int) to accepting a string. This is a breaking API change. Existing code calling setup_logging(level=logging.INFO) will break. Consider maintaining backward compatibility by accepting both int and str types for the level parameter.
| def setup_logging(level: str = "INFO", log_file: Optional[Path] = None, file_mode: str = "w"): | ||
| """Setup logging for the swvo package and root logger. | ||
|
|
||
| Parameters | ||
| ---------- | ||
| level : str, optional | ||
| Logging level, by default is INFO | ||
| log_file : Path, optional | ||
| Path to log file. If None, only console logging is enabled.If provided, logs will be written to both console and file., by default None | ||
| """ | ||
| try: | ||
| level = getattr(logging, level.upper()) | ||
| except AttributeError: | ||
| raise ValueError(f"Invalid logging level: {level}. Use one of DEBUG, INFO, WARNING, ERROR, CRITICAL.") | ||
|
|
||
| # Configure root logger so all loggers inherit the formatting | ||
| root_logger = logging.getLogger() | ||
|
|
||
| # Check if already configured | ||
| if any(isinstance(h, logging.StreamHandler) for h in root_logger.handlers): | ||
| return | ||
| handler = logging.StreamHandler(sys.stdout) | ||
| handler.setFormatter( | ||
| _ColorFormatter( | ||
| "[%(levelname)-8s] %(asctime)s - %(name)s - %(message)s", | ||
| datefmt="%Y-%m-%d %H:%M:%S", | ||
| ) | ||
| ) | ||
|
|
||
| logger.setLevel(level) | ||
| logger.addHandler(handler) | ||
|
|
||
| # Console handler with colors | ||
| log_format = "[%(levelname)-8s] %(asctime)s - %(name)s:%(lineno)d - %(message)s" | ||
| datefmt = "%Y-%m-%d %H:%M:%S" | ||
|
|
||
| console_handler = logging.StreamHandler(sys.stdout) | ||
| console_handler.setFormatter(_ColorFormatter(log_format, datefmt=datefmt)) | ||
|
|
||
| root_logger.setLevel(level) | ||
| root_logger.addHandler(console_handler) | ||
|
|
||
| if log_file: | ||
| file_handler = logging.FileHandler(log_file, mode=file_mode) | ||
| file_handler.setFormatter(logging.Formatter(log_format, datefmt=datefmt)) | ||
| root_logger.addHandler(file_handler) |
There was a problem hiding this comment.
The logger.py changes appear to be unrelated to the PR title "refactor: use API to fetch Hp index". These changes introduce new functionality (file logging support, file_mode parameter, line number in log format) that should be in a separate PR or the PR description should be updated to mention these improvements.
| data = response.json() | ||
| logger.debug(f"Downloaded data: {len(data)} records") | ||
|
|
||
| with open(local_path, "wb") as f: | ||
| ftp.retrbinary(f"RETR {filename}", f.write) | ||
| # Save the JSON response to a temporary file for processing | ||
| output_file = temporary_dir / f"hp_data_{start_time.year}.json" | ||
| with open(output_file, "w") as f: | ||
| json.dump(data, f) |
There was a problem hiding this comment.
Missing validation: The code doesn't check if the API response contains valid data before saving it. If the API returns an empty response or an error message in JSON format, this could result in invalid data being saved and processed. Consider validating that the response contains the expected keys and non-empty data arrays before saving.
| json_data = json.load(f) | ||
|
|
||
| data_total = pd.DataFrame( | ||
| {f"Hp{self.index_number}": json_data[f"Hp{self.index_number}"]}, |
There was a problem hiding this comment.
Column name mismatch: The processed data uses 'Hp30' as the column name (capitalized), but the rest of the codebase expects 'hp30' (lowercase). The _read_single_file method at line 313 uses str(self.index) which is lowercase ('hp30'), and the read method at line 215 also expects self.index in lowercase. This inconsistency will cause issues when combining DataFrames. The column name should use self.index instead of f"Hp{self.index_number}" to maintain consistency.
| {f"Hp{self.index_number}": json_data[f"Hp{self.index_number}"]}, | |
| {str(self.index): json_data[f"Hp{self.index_number}"]}, |
| mock_ftp.retrbinary.assert_called() | ||
| mock_ftp.quit.assert_called() | ||
| # Verify the API was called with requests | ||
| assert mocker.patch("swvo.io.hp.gfz.requests.get").called or True |
There was a problem hiding this comment.
This assertion will always evaluate to True due to the 'or True' clause, making the test meaningless. Either verify the mock was called properly by checking mock_get.called (without the 'or True'), or remove this assertion if the URL verification test at line 167 is sufficient.
| level : str, optional | ||
| Logging level, by default is INFO | ||
| log_file : Path, optional | ||
| Path to log file. If None, only console logging is enabled.If provided, logs will be written to both console and file., by default None |
There was a problem hiding this comment.
Missing documentation for the 'file_mode' parameter. The parameter is declared in the function signature but not documented in the docstring.
| Path to log file. If None, only console logging is enabled.If provided, logs will be written to both console and file., by default None | |
| Path to log file. If None, only console logging is enabled. If provided, logs will be written to both console and file., by default None | |
| file_mode : str, optional | |
| File open mode used for the log file handler (passed to :class:`logging.FileHandler`), by default "w". |
| level : str, optional | ||
| Logging level, by default is INFO | ||
| log_file : Path, optional | ||
| Path to log file. If None, only console logging is enabled.If provided, logs will be written to both console and file., by default None |
There was a problem hiding this comment.
Spelling error: Missing space in the docstring. Should be "If None, only console logging is enabled. If provided, logs will be written to both console and file."
| Path to log file. If None, only console logging is enabled.If provided, logs will be written to both console and file., by default None | |
| Path to log file. If None, only console logging is enabled. If provided, logs will be written to both console and file., by default None |
| response.raise_for_status() | ||
|
|
||
| data = response.json() | ||
| logger.debug(f"Downloaded data: {len(data)} records") |
There was a problem hiding this comment.
Potential data loss: The debug log message attempts to get the length of 'data' which is a dictionary returned from response.json(). This will only count the number of keys (likely 2: 'datetime' and 'Hp30'), not the number of records. Consider using len(data.get('datetime', [])) or len(data.get(f'Hp{self.index_number}', [])) to log the actual number of data records.
| logger.debug(f"Downloaded data: {len(data)} records") | |
| records_count = len(data.get("datetime", [])) | |
| logger.debug(f"Downloaded data: {records_count} records") |
No description provided.