Finalytica Apr,2022: Solving the complex problem of data engineering on Amazon Datalake

More often than not, the real problem in creating business intelligence solutions starts at the source data integration with the BI tools or other applications. One of the clients we recently worked with needed direct data integration with their source data lake. The task was to fetch data from the Data Lake setup by the Client. The Data Lake to be used was S3 Bucket of Amazon Web Services Cloud Infrastructure.

We perform ETL operations to pull data from Data Lake to a different server. ETL is a process of “ Extracting ” data from Data Lake/Data Warehouse, “ Transforming ” the data in proper format, cleaning the data, creating a standard structure of how the data will be stored. Then the last step is to “ Load ” in which the data we pulled will be stored in our database as per our requirement and in a structured way. We were receiving data as flat files. This ETL process is performed by our team to get data from S3 bucket Data Lake to our column-oriented relational database (MonetDB). We use a NoSQL database (MongoDB) to store data structures and automate the process. To improve performance and efficiency we convert pre-built views into tables. This ensures the analytics queries are run fast.

The frequency of data updation is usually done on a daily basis hence it also means that the sanity of the data, data non-redundancy needs to be checked for discrepancy in original before loading the data. There are usually some rules and conditions put in before one uploads data.

Since the ETL process is performed on a daily basis and this process is fully automated, there is no specific time when the data is pushed in S3 bucket, so one can set up a scheduler which runs everyday with a frequency of one minute. To check whether data is updated in AWS S3 Bucket, we usually set up a trigger point which indicates if all the necessary files are loaded in the S3 so that one can start the ETL process.

We deploy some tools to perform ETLs which make the process more efficient.. We also use open source connectivity tools like boto3 used to make connectivity between our server to S3 bucket of Amazon. The ETL is usually done using the python framework where tools such as Pandas, Numpy, SQLAlchemy and our own pre-built libraries are used for but not limited to data cleaning, filtering and sorting operations.

The process not only automates data engineering activities but also avoids any errors and increases speed to provide real time data to the users.

Leave a Reply Cancel reply