Click to return to Gigajob United Kingdom's homepage. This is a Gigajob job posting for:

Data Engineer (Lead) (#1,474,321,528) 

Job offer #1,474,321,528 in Londonderry (County Londonderry), United Kingdom

Job Description Description INTRODUCTION We are looking for a hands-on lead data engineer to take ownership of ETL at the Company and then build a team. This will involve evolving our distributed data parsing framework, building scalability towards submitting billions of tasks per day and handling multiple petabytes of data per year. This role includes design, build and maintenance, however this is not a software development role but an infrastructure development role. The successful candidate will have a passion for managing complexity. This advert was placed (or updated) on 07 January 2019. The closing date for applications for this role is six months post this date. TEAM BUILDING We are looking for someone to start in this role in Q1 2019. It is anticipated that they will spend 8-12 weeks getting up to speed on the systems in place. During this period they will work in conjunction with the existing BMLL staff in this area. Once trust has been built between the hire and the Company, the plan is to gradually add up to three new junior (up to three years experience) staff to this team over the next twelve months. As the team lead, you will be the key decision maker for these hires. TEAM REMIT Currently at BMLL, ETL is managed by a variety individuals from different teams. The Company's aim in forming a dedicated ETL team is to be able to define clear boundaries of ownership between the teams. The ETL team will work closely with the parser team (writing code to process the raw data), the devops team (wider devops support in the Company) and the metadata team. The ETL team will own the process from ingesting raw data to producing a curated object store, complete with quality control metrics, derived data and appropiate APIs and visualization to access those outputs. TECHNICAL BACKGROUND The fully-automated BMLL framework runs highly complex multi-step data processing code around the clock, scaling on-demand using AWS spot fleet. The platform is currently migrating to using Apache Airflow (scheduler) and AWS BATCH (executor) as our distributed cluster computing framework. Data storage solutions we use underlying this framework include AWS S3, EFS and PostgreSQL Aurora. Files are cached on a NFS layer (EFS) pending being moved to long-term S3 storage. File states are recorded in SQL using AWS Cloudwatch Events, which are then used to control movement through the graph. For the batch framework we use SQS as a message broker between scheduler and executor. Elasticache is used as an in memory database. The ETL team will own this batch processing framework, currently being used to submit tens of millions of tasks per day and move the system towards being able to support submission of billions of tasks per day. You will increase the functionality, reliability and scalability of the system, as well as maintaining the existing system. The ETL team will also own the billions of output files from the parsing process which sit on S3. This will involve issues such as curation of the data, mapping the data in a relational database, checking for existence, consistency and corruption. Examples of the sort of problems you will work on are the following, * Owning the packaging and build of the data processing code into a container. * Build and visualize DAGs. Perform reporting and predictions using DAGs. * The batch environment requires resources, such as database connections. There is no point initiating a job if the resources it requires are not available. * When external resources are required, optimize use of these resources * Minimize cost - schedule jobs taking into account predictable returns in the spot price. * Schedule jobs conditional on when they need to be finished by. * Use the AWS ELK stack to monitor logging. Understand what jobs have failed and why. Re- run jobs and /or liaise with the development team as required. * Designing and building a Python feature extraction library for the data. Running this library in the DAG and persisting results into SQL. * Evaluation of technologies that could add-value, for example AWS Glue, AWS SWF, AWS Data Pipeline. This would be in conjunction and consultation with the CTO and our AWS solutions architect. * Design and maintain the quality control framework for ETL. Produce user facing dashboards to enable inspection of quality. MACHINE LEARNING The applicant needs to be clear that this is not a machine learning role per se but that as a Company BMLL does work in the field of machine learning and that the ambition of the customers is to use machine-learning to add value to their businesses. The data output by the ETL pipeline will be used by customers / applications to perform machine learning. The scope of the application of machine learning to this role is that the Company wishes to systemize at scale feature extraction during the ETL phase and present those curated features to users through an API. Features will range from being statistically trivial to being the result of sophisticated a

The Company

Company Name
BMLL Technologies Ltd

Job Details

Job Location
Post Code
County Londonderry
County Londonderry
United Kingdom

E-Mail Alert

Get the latest Engineer Jobs in Londonderry via email.

Popular Professions in Londonderry

  • Statistics for this Job Listing

    Seen in Search Results
    This ad was matched 1 time(s) in Gigajob search results
    Page Viewed
    This ad was viewed 1 time(s) by Gigajob users