Contributing Artifacts¶
The success of this project hinges on the availability of a rich corpus of validation artifacts.
Contributions are strongly encouraged and highly appreciated.
To contribute new artifacts to the artifact-core project:
- Add a new value to the appropriate existing Enum (e.g., in
artifact_core/table_comparison/registries/scores/types.py) - Create and register your hyperparameters class (inheriting from
ArtifactHyperparams) - Add the default configuration values in the appropriate config file (e.g. in
artifact_core/table_comparison/config/raw.json) - Create and register your artifact class (inheriting from
Artifactwith the appropriate generics matching the engine of interest)
Example: Contributing a New Score Artifact to the TableComparisonEngine¶
First, add your new score type to the existing enum in: artifact_core/table_comparison/registries/scores/types.py.
class TableComparisonScoreType(ArtifactType):
MEAN_JS_DISTANCE = "mean_js_distance"
CORRELATION_DISTANCE = "correlation_distance"
# Add your new score type
NEW_TABLE_COMPARISON_SCORE = "new_table_comparison_score"
from artifact_core._base.artifact_dependencies import ArtifactHyperparams
from artifact_core.table_comparison._registries.scores.registry import TableComparisonScoreRegistry
@TableComparisonScoreRegistry.register_artifact_hyperparams(
TableComparisonScoreType.MY_CUSTOM_SCORE
)
@dataclass
class NewTableComparisonScoreHyperparams(ArtifactHyperparams):
threshold: float,
use_weights: bool
The corresponding contribution to the configuration file (artifact_core/table_comparison/config/raw.json) should then look like:
{
"scores": {
"my_custom_score": {
"threshold": 0.5,
"use_weights": true
}
}
}
Should your contribution not require any hyperparameters, simply use the following as the generic parameter:
from artifact_core._base.artifact_dependencies import NoArtifactHyperparams
In this case no hyperparams class needs to be registered and no configuration params need to be added to the config file.
The appropriate generics for table comparison scores are as follows:
import pandas as pd
from artifact_core._base.artifact import Artifact
from artifact_core._libs.resource_spec.tabular.protocol import TabularDataSpecProtocol
from artifact_core._core.dataset_comparison.artifact import DatasetComparisonResources
Artifact[
DatasetComparisonResources[pd.DataFrame],
float,
<HyperparamsT>,
TabularDataSpecProtocol
]
You should work with these instead: they implement core logic tailored to the specific artifact group in question.
To illustrate: all table comparison scores should inherit the following base:
import pandas as pd
from artifact_core.table_comparison._artifacts.base import TableComparisonScore
from artifact_core.table_comparison._registries.scores.types import TableComparisonScoreType
TableComparisonScore[<HyperparamsT>]
Finally implement and register your artifact (accessing the relevant hyperparameters and resource spec):
from typing import Dict, Any, Optional, Union, List
from dataclasses import dataclass
import pandas as pd
from artifact_core.table_comparison._artifacts.base import TableComparisonScore, TableComparisonArtifactResources
from artifact_core.table_comparison._registries.scores.registry import TableComparisonScoreRegistry
from artifact_core.table_comparison._registries.scores.types import TableComparisonScoreType
from artifact_core._libs.resource_spec.tabular.protocol import TabularDataSpecProtocol
from artifact_core._core.dataset_comparison.artifact import DatasetComparisonResources
@TableComparisonScoreRegistry.register_artifact(
TableComparisonScoreType.MY_CUSTOM_SCORE
)
class NewTableComparisonScore(
TableComparisonScore[
NewTableComparisonScoreHyperparams
]
):
def _validate(
self,
resources: TableComparisonArtifactResources
) -> TableComparisonArtifactResources:
if resources.dataset_real is None or resources.dataset_synthetic is None:
raise ValueError(
"Both real and synthetic datasets must be provided"
)
return resources
def _compare_datasets(
self,
dataset_real: pd.DataFrame,
dataset_synthetic: pd.DataFrame
) -> float:
dataset_real = dataset_real[self._resource_spec.ls_cts_features]
dataset_synthetic = dataset_synthetic[self._resource_spec.ls_cts_features]
score = 1.0
if score > self._hyperparams.threshold and self._hyperparams.use_weights:
score = 2*score
return score