API Reference
Submodules
sqlsynthgen.create module
Functions and classes to create and populate the target database.
- sqlsynthgen.create.create_db_data(sorted_tables: Sequence[sqlalchemy.schema.Table], table_generator_dict: Mapping[str, TableGenerator], story_generator_list: Sequence[Mapping[str, Any]], num_passes: int) Counter[str]
Connect to a database and populate it with data.
- sqlsynthgen.create.create_db_tables(metadata: sqlalchemy.schema.MetaData) None
Create tables described by the sqlalchemy metadata object.
- sqlsynthgen.create.create_db_vocab(vocab_dict: Mapping[str, FileUploader]) None
Load vocabulary tables from files.
- sqlsynthgen.create.populate(dst_conn: sqlalchemy.Connection, tables: Sequence[sqlalchemy.schema.Table], table_generator_dict: Mapping[str, TableGenerator], story_generator_list: Sequence[Mapping[str, Any]]) Counter[str]
Populate a database schema with synthetic data.
sqlsynthgen.main module
Entrypoint for the SQLSynthGen package.
- sqlsynthgen.main.create_data(orm_file: str = typer.Option, ssg_file: str = typer.Option, config_file: Optional[str] = typer.Option, num_passes: int = typer.Option, verbose: bool = typer.Option) None
Populate schema with synthetic data.
This CLI command generates synthetic data for Python table structures, and inserts these rows into a destination schema.
Also takes as input object relational model as represented by file containing Python classes and its attributes.
Takes as input sqlsynthgen output as represented by Python classes, its attributes and methods for generating values for those attributes.
Final input is the number of rows required.
Example
$ sqlsynthgen create-data
- Parameters:
orm_file (str) – Name of Python ORM file. Must be in the current working directory.
ssg_file (str) – Name of generators file. Must be in the current working directory.
config_file (str) – Path to configuration file.
num_passes (int) – Number of passes to make.
verbose (bool) – Be verbose. Default to False.
- sqlsynthgen.main.create_tables(orm_file: str = typer.Option, config_file: Optional[str] = typer.Option, verbose: bool = typer.Option) None
Create schema from a SQLAlchemy ORM file.
This CLI command creates the destination schema using object relational model declared as Python tables.
Example
$ sqlsynthgen create-tables
- Parameters:
orm_file (str) – Name of Python ORM file. Must be in the current working directory.
config_file (str) – Path to configuration file.
verbose (bool) – Be verbose. Default to False.
- sqlsynthgen.main.create_vocab(ssg_file: str = typer.Option, verbose: bool = typer.Option) None
Import vocabulary data.
Example
$ sqlsynthgen create-vocab
- Parameters:
ssg_file (str) – Name of generators file. Must be in the current working directory.
verbose (bool) – Be verbose. Default to False.
- sqlsynthgen.main.make_generators(orm_file: str = typer.Option, ssg_file: str = typer.Option, config_file: Optional[str] = typer.Option, stats_file: Optional[str] = typer.Option, force: bool = typer.Option, verbose: bool = typer.Option) None
Make a SQLSynthGen file of generator classes.
This CLI command takes an object relation model output by sqlcodegen and returns a set of synthetic data generators for each attribute
Example
$ sqlsynthgen make-generators
- Parameters:
orm_file (str) – Name of Python ORM file. Must be in the current working directory.
ssg_file (str) – Path to write the generators file to.
config_file (str) – Path to configuration file.
stats_file (str) – Path to source stats file (output of make-stats).
force (bool) – Overwrite the ORM file if exists. Default to False.
verbose (bool) – Be verbose. Default to False.
- sqlsynthgen.main.make_stats(config_file: str = typer.Option, stats_file: str = typer.Option, force: bool = typer.Option, verbose: bool = typer.Option) None
Compute summary statistics from the source database.
Writes the statistics to a YAML file.
Example
$ sqlsynthgen make_stats –config-file=example_config.yaml
- sqlsynthgen.main.make_tables(config_file: Optional[str] = typer.Option, orm_file: str = typer.Option, force: bool = typer.Option, verbose: bool = typer.Option) None
Make a SQLAlchemy file of Table classes.
This CLI command deploys sqlacodegen to discover a schema structure, and generates an object relational model declared as Python classes.
Example
$ sqlsynthgen make_tables
- Parameters:
config_file (str) – Path to configuration file.
orm_file (str) – Path to write the Python ORM file.
force (bool) – Overwrite ORM file, if exists. Default to False.
verbose (bool) – Be verbose. Default to False.
- sqlsynthgen.main.remove_data(orm_file: str = typer.Option, ssg_file: str = typer.Option, config_file: Optional[str] = typer.Option, yes: bool = typer.Option, verbose: bool = typer.Option) None
Truncate non-vocabulary tables in the destination schema.
- sqlsynthgen.main.remove_tables(orm_file: str = typer.Option, config_file: Optional[str] = typer.Option, yes: bool = typer.Option, verbose: bool = typer.Option) None
Drop all tables in the destination schema.
Does not drop the schema itself.
- sqlsynthgen.main.remove_vocab(orm_file: str = typer.Option, ssg_file: str = typer.Option, config_file: Optional[str] = typer.Option, yes: bool = typer.Option, verbose: bool = typer.Option) None
Truncate vocabulary tables in the destination schema.
- sqlsynthgen.main.validate_config(config_file: Path, verbose: bool = typer.Option) None
Validate the format of a config file.
- sqlsynthgen.main.version() None
Display version information.
sqlsynthgen.make module
Functions to make a module of generator classes.
- class sqlsynthgen.make.FunctionCall(function_name: str, argument_values: list[str])
Bases:
objectContains the ssg.py content related function calls.
- argument_values: list[str]
- function_name: str
- class sqlsynthgen.make.RowGeneratorInfo(variable_names: list[str], function_call: FunctionCall, primary_key: bool = False)
Bases:
objectContains the ssg.py content related to row generators of a table.
- function_call: FunctionCall
- primary_key: bool = False
- variable_names: list[str]
- class sqlsynthgen.make.StoryGeneratorInfo(wrapper_name: str, function_call: FunctionCall, num_stories_per_pass: int)
Bases:
objectContains the ssg.py content related to story generators.
- function_call: FunctionCall
- num_stories_per_pass: int
- wrapper_name: str
- class sqlsynthgen.make.TableGeneratorInfo(class_name: str, table_name: str, rows_per_pass: int, row_gens: list[sqlsynthgen.make.RowGeneratorInfo] = <factory>, unique_constraints: list[sqlalchemy.UniqueConstraint] = <factory>)
Bases:
objectContains the ssg.py content related to regular tables.
- class_name: str
- row_gens: list[sqlsynthgen.make.RowGeneratorInfo]
- rows_per_pass: int
- table_name: str
- unique_constraints: list[sqlalchemy.UniqueConstraint]
- class sqlsynthgen.make.VocabularyTableGeneratorInfo(variable_name: str, class_name: str, table_name: str, dictionary_entry: str)
Bases:
objectContains the ssg.py content related to vocabulary tables.
- class_name: str
- dictionary_entry: str
- table_name: str
- variable_name: str
- sqlsynthgen.make.generate_ssg_content(template_context: Mapping[str, Any]) str
Generate the content of the ssg.py file as a string.
- async sqlsynthgen.make.make_src_stats(dsn: str, config: Mapping, schema_name: Optional[str] = None) dict[str, list[dict]]
Run the src-stats queries specified by the configuration.
Query the src database with the queries in the src-stats block of the config dictionary, using the differential privacy parameters set in the smartnoise-sql block of config. Record the results in a dictionary and returns it. :param dsn: database connection string :param config: a dictionary with the necessary configuration :param schema_name: name of the database schema
- Returns:
The dictionary of src-stats.
- sqlsynthgen.make.make_table_generators(tables_module: module, config: Mapping, src_stats_filename: Optional[str], overwrite_files: bool = False) str
Create sqlsynthgen generator classes from a sqlacodegen-generated file.
- Parameters:
tables_module – A sqlacodegen-generated module.
config – Configuration to control the generator creation.
src_stats_filename – A filename for where to read src stats from. Optional, if None this feature will be skipped
overwrite_files – Whether to overwrite pre-existing vocabulary files
- Returns:
A string that is a valid Python module, once written to file.
- sqlsynthgen.make.make_tables_file(db_dsn: str, schema_name: Optional[str], config: Mapping[str, Any]) str
Write a file with the SQLAlchemy ORM classes.
Exits with an error if sqlacodegen is unsuccessful.
sqlsynthgen.providers module
This module contains Mimesis Provider sub-classes.
- class sqlsynthgen.providers.BytesProvider(locale: Locale = Locale.EN, seed: Union[None, int, float, str, bytes, bytearray] = None)
Bases:
BaseDataProviderA Mimesis provider of binary data.
- bytes() bytes
Return a UTF-8 encoded sentence.
- class sqlsynthgen.providers.ColumnValueProvider(*, seed: Union[None, int, float, str, bytes, bytearray] = None, **kwargs: Any)
Bases:
BaseProviderA Mimesis provider of random values from the source database.
- class Meta
Bases:
objectMeta-class for ColumnValueProvider settings.
- name = 'column_value_provider'
- static column_value(db_connection: sqlalchemy.Connection, orm_class: Any, column_name: str) Any
Return a random value from the column specified.
- class sqlsynthgen.providers.NullProvider(*, seed: Union[None, int, float, str, bytes, bytearray] = None, **kwargs: Any)
Bases:
BaseProviderA Mimesis provider that always returns None.
- static null() None
Return None.
- class sqlsynthgen.providers.SQLGroupByProvider(*, seed: Union[None, int, float, str, bytes, bytearray] = None, **kwargs: Any)
Bases:
BaseProviderA Mimesis provider that samples from the results of a SQL GROUP BY query.
- class Meta
Bases:
objectMeta-class for SQLGroupByProvider settings.
- name = 'sql_group_by_provider'
- sample(group_by_result: list[dict[str, Any]], weights_column: str, value_columns: Optional[Union[str, list[str]]] = None, filter_dict: Optional[dict[str, Any]] = None) Union[Any, dict[str, Any], tuple[Any, ...]]
Random sample a row from the result of a SQL GROUP BY query.
The result of the query is assumed to be in the format that sqlsynthgen’s make-stats outputs.
For example, if one executes the following src-stats query
SELECT COUNT(*) AS num, nationality, gender, age FROM person GROUP BY nationality, gender, age
and calls it the count_demographics query, one can then use
generic.sql_group_by_provider.sample( SRC_STATS["count_demographics"], weights_column="num", value_columns=["gender", "nationality"], filter_dict={"age": 23}, )
to restrict the results of the query to only people aged 23, and random sample a pair of gender and nationality values (returned as a tuple in that order), with the weights of the sampling given by the counts num.
- Parameters:
group_by_result – Result of the query. A list of rows, with each row being a dictionary with names of columns as keys.
weights_column – Name of the column which holds the weights based on which to sample. Typically the result of a COUNT(*).
value_columns – Name(s) of the column(s) to include in the result. Either a string for a single column, an iterable of strings for multiple columns, or None for all columns (default).
filter_dict – Dictionary of {name_of_column: value_it_must_have}, to restrict the sampling to a subset of group_by_result. Optional.
- Returns:
a single value if value_columns is a single column name,
a tuple of values in the same order as value_columns if value_columns is an iterable of strings.
a dictionary of {name_of_column: value} if value_columns is None
- class sqlsynthgen.providers.TimedeltaProvider(*, seed: Union[None, int, float, str, bytes, bytearray] = None, **kwargs: Any)
Bases:
BaseProviderA Mimesis provider of timedeltas.
- static timedelta(min_dt: timedelta = datetime.timedelta(0), max_dt: timedelta = datetime.timedelta(days=49710, seconds=23296)) timedelta
Return a random timedelta object.
- class sqlsynthgen.providers.TimespanProvider(*, seed: Union[None, int, float, str, bytes, bytearray] = None, **kwargs: Any)
Bases:
BaseProviderA Mimesis provider for timespans.
A timespan consits of start datetime, end datetime, and the timedelta in between. Returns a 3-tuple.
- static timespan(earliest_start_year: int, last_start_year: int, min_dt: timedelta = datetime.timedelta(0), max_dt: timedelta = datetime.timedelta(days=49710, seconds=23296)) tuple[datetime.datetime, datetime.datetime, datetime.timedelta]
Return a timespan as a 3-tuple of (start, end, delta).
- class sqlsynthgen.providers.WeightedBooleanProvider(*, seed: Union[None, int, float, str, bytes, bytearray] = None, **kwargs: Any)
Bases:
BaseProviderA Mimesis provider for booleans with a given probability for True.
- class Meta
Bases:
objectMeta-class for WeightedBooleanProvider settings.
- name = 'weighted_boolean_provider'
- bool(probability: float) bool
Return True with given probability, otherwise False.
sqlsynthgen.settings module
Utils for reading settings from environment variables.
See module pydantic for enforcing type hints at runtime. See module functools.lru_cache to save time and memory in case of repeated calls. See module typing for type hinting.
Classes:
Settings
Functions:
get_settings() -> Settings
- class sqlsynthgen.settings.Settings(*args: Any, **kwargs: Any)
Bases:
BaseSettingsA Pydantic settings class with optional and mandatory settings.
Settings class attributes describe two database connection. The source database connection is the database schema from which the object relational model is discovered. The destination database connection is the location where tables based on the ORM is created and synthetic values inserted.
- src_dsn
A DSN for connecting to the source database.
- Type:
str
- src_schema
The source database schema to use, if applicable.
- Type:
str
- dst_dsn
A DSN for connecting to the destination database.
- Type:
str
- dst_schema
The destination database schema to use, if applicable.
- Type:
str
- class Config
Bases:
objectMeta-settings for the Settings class.
- validate_dst_dsn(dsn: Optional[str], values: Any) Optional[str]
Create and validate the destination DB DSN.
- validate_src_dsn(dsn: Optional[str], values: Any) Optional[str]
Create and validate the source DB DSN.