Data catalog and infrastructure to process tabular data from MS Excel and delimited files
# Project Mission
- Goal: The goal is to build a platform that can organize and manage data stored in MS Excel and other tabular data files. This includes data ingestion, metadata management, schema evolution. The basic flow is that the user is provided with a web application where the user points to a .xlsx or .xls file. The user can define the data source within the file by provinding the sheet names and column ranges. There will be an auto-detect process as well. The data is then ingested into the platform and stored in a structured format. The user can then query the data and get the results in a structured format. The user can also define the schema of the data and the platform will validate the data against the schema. The user can also define the schema evolution rules and the platform will automatically evolve the schema as the data changes.
- Data Flow:
Web app -> Source data configuration(Stored in postgres) -> Duckdb/Ducklake(Temporary storage layer) -> dbt -> Superset
- Architecture
Web app will be built using FastAPI. It will allow a user to create Data Source Templates which contains the sheet names, column ranges, schema. The templates can be versioned to accomodate for the changes in source files or for different versions. This information will be stored in Postgres.
After a Data Source Template is created, the user can upload a file and the platform will ingest the data into the platform. The platform will find the most appropriate Data Source Template for the uploaded file and will ingest the data into the platform. The ingested data will be in the bronze schema of the Duckdb database. The results of the Data Source Template matching and identifying process and the actual structured data will be stored in Duckdb.
In the second phase of the build, the data in the bronze schema will be transformed into silver and gold schemas using dbt. The silver schema will contain the cleaned data and the gold schema will contain the aggregated data. The dbt models will be stored in the dbt directory. The dbt models will be triggered using the web app. The silver and gold will be in Duckdb and will need to be replicated into Postgres as well.
Postgres will store all configuration data with respect to the data source. This includes the sheet names, column ranges, schema, schema evolution rules, etc.
A business intelligence layer will be present to view the data in the Duckdb. Superset will be used for this purpose.
- Tech Stack: Python, Postgres, dbt-core, duckdb, Superset