B2B companies are often faced with the problem of data migraion during customer onboarding. Whilst the company has a standard data models defined, they can not expect the customers to bring in the data with the same format. To handle this problem, one can use a one-time script to transform the data. However, this solution has a lot of drawbacks:
- The script is not reusable, as it is built for a specific customer
- With every new customer coming in, there is a lot of engineering work to do to create a new script for them
- If the data format changes in between, the script may require a lot of changes
- The script has to contain the data validatation logic and must be able to handle API failures. Otherwise, the program would crash somewhere in between, only to fix the problems and run it again
In this article, we shall explore how I built a simple solution to handle these problems, with a scope to easily extend it.
Go-To Solutions
Before building software to solve a problem ourselves, it is always good to look at the available options in the market, and check if we can trade money off for the engineering efforts. Since B2B customer onboarding is a frequent problem, there were already a few established SaaS companies providing a solution for it.
-
FlatFile is an excellent tool to onboard business customers. Their offerings range from providing an interactive UI to collaborate with the customers to custom NodeJS to transform the fields of the data.
-
OneSchema operates in the same business area as the FlatFile. Their solution offers an easy way to clean the data and export the rows with errors and share them with the customer. On another end, they also support transforming the data and handling ingestion and incremental updates.
-
CSVBox: CSVBox is a simple solution to map the columns in the input spreadsheet to a defined schema. Although googling “Flatfile alternative” brings them to the top, they are not quite there yet!
After exploring some other solutions such as Dromo, I decided to go with a custom solution.
The Solution
The solution was named Morpheus after a “Matrix” character and the Greek god of dreams. The interesting part of the word was “morph” — to transform, which the project would do the spreadsheet input data.
The project consisted of two parts — one to import the JSON data that is already in the format required by the POST APIs, and another to convert the spreadsheet data coming into the standard JSON format. A validator was placed in between these steps to ensure that data was in the format needed by API.
Importer
The importer tool is responsible for uploading the data present in the JSON files to the APIs. The challenge comes here with the references between the entities. To give a concrete example, let’s assume the following scenario:
- We have
Student
andCourse
resources, where multipleStudent
s can take a singleCourse
- The backend generates UUIDs for the
Student
andCourse
resources and uses them as the primary keys - The endpoint
POST /api/course
, takes a JSON body fieldstudents
, an array of UUIDs corresponding toStudent
resources. - The endpoint
GET /api/student
has a query parametersearch
, which takes a string of[field1: value1][field2: value2]
format, returning the matching resources.
We do not know the UUID of the Student
resource until it is present in the
backend; thus, it JSON input file has to contain some other unique field of
the student. Then, the code must call the GET
endpoint, and return the UUID
of the resource.
To handle such scenarios, I created a Model
class in the code, responsible
for keeping track of dependencies and resolving them. The class looked like,
type Reference = {
model: Model
field: string
oneToMany: boolean
}
type SearchParams = Record<string, string>
class Model {
constructor(
public name: string,
private args: {
getEndpoint: string
postEndpoint: string
references: Reference[]
}
) {
// initialize all properties
}
async getOne(params: SearchParams): Promise<string> {
// return the ID of the object matching the search criteria
}
async createOne(obj: {}): Promise<string> {
// create one object and returns its ID
}
async resolveDependencies(obj: {}): Promise<{}> {
// takes in a value from JSON, returns the object with resolved dependencies
// the returned object can be used in `createOne()`
}
}
The getOne()
method converts SearchParams
into a string in the format of
the search
parameter and calls the GET
API, handling the case of multiple
objects and no objects returned.
Similarly, the createOne()
method calls the POST
API to persist the object
and returns its ID once it is created.
The method of interest is resolveDependencies()
here. It walks through all
references
, reading in the value from the field
, and replacing it with IDs
from the getOne()
method. For this method to work, the resources have to be
imported in the order of their resolved dependency (Here, Course
must be
imported after importing Student
). I used
jsonpath
to give extra flexibility
while accessing the field
. Using the Model
class, we can define Student
and Course
models as:
const student = new Model("Student", {
getEndpoint: "https://exmaple.com/api/student",
postEndpoint: "https://example.com/api/student",
references: [],
})
const course = new Model("Course", {
getEndpoint: "https://example.com/api/course",
postEndpoint: "https://example.com/api/course",
references: [
{
model: student,
field: "$.students",
oneToMany: true,
},
],
})
Now, we can start importing the objects. Let’s say we have the following data.
const students = [
{
email: "alex@uni.edu",
name: "Alex Jones",
},
{
email: "bob@uni.edu",
name: "Bob Harris",
},
{
email: "chris@uni.edu",
name: "Chris Muller",
},
]
const courses = [
{
name: "Intro to CS",
id: "CS101",
students: [
{
email: "alex@uni.edu",
},
{
email: "bob@uni.edu",
},
],
},
{
name: "Fundamentals of Programming",
id: "CS102",
students: [
{
email: "chris@uni.edu",
},
],
},
]
Creating the students is straightforward. We call student.createOne()
for
all student objects.
await Promise.all(
students.map(studentObject =>
student.createOne(studentObject)
)
)
Before creating the courses, however, we need to resolve student IDs.
const resolvedCourses = await Promise.all(
courses.map(courseObject =>
course.resolveDependencies(courseObject)
)
)
The resolvedCourses
looks like
[
{
name: "Intro to CS",
id: "CS101",
students: ["uuid1", "uuid2"],
},
{
name: "Fundamentals of Programming",
id: "CS102",
students: ["uuid3"],
},
]
Now, we can import the resolvedCourses
as usual.
await Promise.all(
resolvedCourses.map(courseObject =>
course.createOne(courseObject)
)
)
Note: The
Model
class takes only care of handling references, creating and getting objects. It is still your responsibility to ensure that other fields are valid.
Transformer
Once the importer tool was ready, it was time to build the Transformer tool.
Assuming that we will have student names in student.xlsx
with sheet names
and an API to query their email based on the names (which is a bad example,
please bear with me 😅), the Transformer would look as follows:
const studentTransformer = new Transformer()
.field("name", data => data["Student Name"])
.field("email", async data =>
fetch(`https://uni.edu/api/email?name=${data["name"]}`)
.then(response => response.json())
.then(res => res.name)
)
.input("student.xlsx", { sheet: "names" })
.output("student.json")
await studentTransformer.transform()
The field(name: string, action: FieldAction)
function is defining the JSON
field name and action to compute the value at the field. It is possible to
return objects and arrays from the action, thus making nested structures
possible. The action handlers support both async and sync operations, having
the following type:
type DataRow = Record<string, string>
type Output<T> = T | Promise<T>
type FieldAction<T> = (dataRow: DataRow) => Output<T>
When working with real data, I came across the need of skipping some rows based
on the condition. This resulted in the addition of the filter(predicate: Predicate)
method into the fluent interface, with Predicate
defined as:
type Predicate = (dataRow: DataRow) => Output<boolean>
Continuing the example of Student
s and Course
s, let’s assume courses.xlsx
contains a big list of Course IDs and Course Names duplicated for every student
taking the course. To generate course.json
, we need a grouping, rather than a
1-to-1 mapping. To do this, I created groupBy(action: GroupByAction, aggregateFields: string[])
) function, with GroupByAction
defined as:
type GroupByAction = (dataRow: DataRow) => Output<string[]>
The GroupByAction
would return a list of strings, ideally picking them from
the dataRow
. The transformer would then keep track of the returned strings as
keys to group the results returned by FieldAction
of fields in
aggregateFields
into an array. The fields not in aggregateFields
are only
computed once. With the groupBy()
method in place, we can write the
transformer for Course
:
const courseTransformer = new Transformer()
.groupBy(data => [data["Course ID"]], ["students"])
.field("name", data => data["Course Name"])
.field("id", data => data["Course ID"])
.field("students", async data => ({
email: await fetch(
`https://uni.edu/api/email?name=${data["name"]}`
)
.then(response => response.json())
.then(res => res.name),
}))
.input("courses.xlsx", { sheet: "course" })
.output("courses.json")
await courseTransformer.transform()
In the above example, the FieldAction
for the "id"
field will be run only
once for a course, and, the results from the FieldAction
of the "students"
would be aggregated as an array.
Validator
Once the transformer outputs the JSON file, before importing it, it was
necessary to check whether the objects follow the specification of the API,
since the importer does only the job of resolving references. I chose
zod
, which I had come across GitHub
trending, to define the schema and validate the JSON file. With zod
, schema
definition for our Student
and Course
would look like:
import z from "zod"
const student = z.object({
name: z.string(),
email: z.string().email(),
})
const course = z.object({
name: z.string(),
id: z.string().regex(/CS\d{3}/),
students: z.array(student.partial()).nonempty(),
})
And we can do validation as:
const result = student.safeParse(obj)
The Interface
The project Morpheus had a CLI interface, built using
commander
.
For the Import tool, I defined all models in the backend in models.ts
and
exported all of them in a Map<string, Model>
. The same was repeated with
transformers, with transformer.ts
exporting a Map<string, Transformer>
. A
separate transformer.ts
is required for a separate customer. The maps
contained the name of the model as the key. This helped to choose a model name
in the CLI, and get the corresponding transformer or importer.
During processing, if some error happens, the program would not halt; instead, the program writes the error-causing object and error message to an error file (specified in the Command Line) and continue working.
Once the process has finished, one can inspect the error file, fix the errors,
and get the objects back using the clean
subcommand. These objects can be
further fed into the tool, or, sent back to the customer for clarifications.
Impacts
This project had a huge impact on customer onboarding. Specifically, it
- reduced the amount of work required to onboard a customer, from creating a project to writing several functions
- decreased the customer onboarding time from weeks to days
- increased the visibility into the errors during processing and import
Apart from customer onboarding, I used the Import tool heavily in combination with Mockaroo to create dummy data for product demos to potential customers. Thanks to the Import tool, I created around 10 such environments, incrementally gathering the feedback from the business 😄.
Learnings
Since I handled this project from initialization to the first milestone, I learned how to run a project, what features to prioritize based on the business requirements, and communicate them with all the stakeholders. Apart from coding, I was involved with documenting the project and gathering feedback from other developers regarding code quality and interface usability, from which I was able to understand the project from different perspectives.