Impostor.jl
Introduction
Impostor.jl is a Julia package which eases the generation of synthetic tabular data using a flexible yet concise API. Built from scratch upon Julia’s Multiple Dispatch paradigm with simplicity in mind, Impostor is the concretization of a software engineering project from its data back-end and API design; to its packaging, registration and distribution via the Julia General Registry.
Its main selling point is the capability to generate cohesive relations between table columns from templated objects but it also provides several generator and utility functions to generate standalone data series from a pre-selected set of entries. Check out the API reference for a complete list.
Some of the interesting features present in Impostor are:
- Support for cohesion across columns in the generated tables.
- Support for generic table templates via the
ImpostorTemplate
object. - Multi-locale support, for both per-method calls and current session.
- Compliance with Julia’s multiple dispatch paradigm, allowing for more control over data series generation.
- Consise and simple helper functions.
Motivation
In a couple courses during my graduation I found myself in a situation where I needed a small script to quickly generate sample tables for a database or generate enough tabular data to perform a load test in some service. After my third rewrite of some of such scripts I decided it was time for something a bit more structured and the idea of a Julia library for that seemed very appealing not only for that apparent “gap” in the functionalities I needed from similar libraries in the Julia ecosystem at the time, but also as an opportunity to learn the process of developing and publishing a Julia package.
As it turns out, it is surprisingly simple to have a package published in the Julia General Registry, provided that you follow the instructions in the Creating Packages page in the Pkg.jl stdlib docs. I was surprised to know that the structure was very straight forward and the publishing process was very simple and direct. For more info on registering Julia packages check this link.
Examples
Instead of giving a full description of step by step examples building up on complexity (which would only duplicate its documentation) I’ll provide a more practical use case to show some of its capabilities. Suppose we are mocking some sort of registration system with driver licenses information. The code snippets bellow shows concise ways to do it with Impostor.
Impostor Templates
|
|
Since we are outputing the resulting table to a dataframe, we must also import the DataFrames.jl package on our top-level scope. Now, by calling this generate_mocked_rows
, we have the following table:
generate_mocked_rows(5)
# 5×11 DataFrame
# Row │ firstname surname occupation birthdate state state_code city street postcode licence year
# │ String String String String String31 String3 String31 String String String Int64
# ─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
# 1 │ Stephen Friedman Accountant 1946-05-06 Utah UT Salt Lake City Daugherty Avenue 363-646-702 GLU-2A12 2014
# 2 │ Dorothy Mann Mathematician 1969-08-29 New York NY New York City Oconnor Road 863-875-466 WHN-0M46 2019
# 3 │ Andrew Espinoza Phisician 1947-04-16 South Carolina SC Charleston Olivia Le Alley 541-284-571 UQJ-4H86 2017
# 4 │ Shawn Patterson Accountant 1946-05-16 Iowa IA Des Moines Callahan Road 549-716-942 HCN-4H78 2016
# 5 │ Barry Moran Phisician 1959-11-25 Alabama AL Montgomery Peterson Driveway 075-858-427 AYA-3L14 2019
General Purpose API
If you prefer to have more control over the data generation process, here is the same example but without ImpostorTemplate
s, i.e. using only the general purpuse generator-functions:
|
|
Note that the lines 13 and 14 (highlighted above) receive as input a mask specifying the state codes in order to generate the matching states names and codes (number of generated rows is assumed to be the same as length(mask)
). If for some reason this matching is not necessary/important, you may as well just call the the same functions state
and city
passing n_rows
as the only argument and Julia will dispatch the corresponding methods to generate single data series without filters.
generate_mocked_without_template(5)
# 5×11 DataFrame
# Row │ firstname surname occupation birthdate state_code state city street postcode licence year
# │ String String String String String String String String String String Int64
# ─────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
# 1 │ Sonia Lawson Sociologist 2011-05-09 CO Colorado Denver West Road 024-483-265 PDC-4H97 2015
# 2 │ Shelly Walter Anthropologyst 2005-04-01 WV West Virginia Charleston Tran Driveway 623-875-415 FIC-8B04 2019
# 3 │ Lacey Marks Sociologist 2020-02-01 GA Georgia Atlanta Mason Valencia Alley 943-370-619 PDW-2I72 2020
# 4 │ Lance Day Aircraft Engineer 2004-01-12 AL Alabama Montgomery Kent Road 002-914-648 KPE-0J59 2021
# 5 │ Gregg Walsh Sociologist 1969-08-05 SC South Carolina Columbia Katherine Huff Driveway 058-703-987 DJM-5F65 2018