A decision that many engineers face at some point of their career is deciding what to focus their attention on next. One of the amazing advantages of working in a consultancy is being exposed to many different technologies, providing you the opportunity to explore any emerging trends you might be interested in. I’ve been lucky enough to work with a huge variety of clients ranging from industry leaders in the FTSE 100 to smaller start-ups disrupting the same technology space.
So why did I pick Big Data?
A common pattern I’ve noticed is that everyone has access to data – large amounts of raw, unstructured data. Business and technology leaders all recognise the importance of it, and the value and insight that it can deliver. Processes have been established to extract, transform and store this large amount of information, but the architecture is usually inefficient and incomplete.
Years ago these steps may have equated to the definition of an efficient data pipeline but now with emerging technologies such as Kinesis Streams, Redshift and even Server-less databases there is another way. We now have the possibility of having a real-time, cost efficient and low operational overhead solution.
Alongside this, companies set their sights on creating a data lake in the cloud. In doing so, they take advantage of a whole suite of technologies to store information in formats that they currently leverage and also in a configuration they possibly may harness in the future. These are all clear steps in the journey towards digital transformation, and with the current pace of development in AWS technologies it is the perfect time to become more acquainted with Big Data.
But why is the certification necessary?
The AWS Certified Big Data Speciality exam introduces and validates several key big data fundamentals. The exam itself is not just limited to AWS specific technologies but also explores the big data community. Taken straight from the exam guide we can see that the domains cover:
- Data Security
These domains involve a broad range of technical roles ranging from data engineers and data scientists to individuals in SecOps. Personally, I’ve had some exposure to collection and storage of data but much less with regards to visualisation and security. You certainly have to be comfortable with wearing many different hats when tackling this exam as it tests not only your technical understanding of the solutions but also the business value created from the implementation. It’s equally important to consider the costs involved including any forecasts as the solution scales.
Having already completed several associate exams I found this certification much greater in difficulty because you are required to deep dive into Big Data concepts and the relevant technologies. One of the benefits of this certification is that the scope extends to these technologies’ application of Big Data so be prepared to dive into Machine Learning and popular frameworks like Spark & Presto.
Okay so how do I pass the exam?
1. A Cloud Guru’s certified big data specialty course provides an excellent introduction and overview.
2. Have some practical experience of Big data in AWS, theoretical knowledge is not enough to pass this exam…
- Practice architecting data pipelines, consider when Kinesis Streams vs Firehose would be appropriate.
- Think about how the solution would differ according to the size of the data transfer, sometimes even Snowmobile can become efficient.
3. Understand the different storage options on AWS – S3, DynamoDB, RDS, Redshift, HDFS vs EMRFS, HBase…
4. Understand the differences and use cases of popular Big Data frameworks e.g. Presto, Hive, Spark.
5. Data Security contributes the most to your overall exam score at 20% and is involved in every single AWS service. There are always options for making the solution more secure and sometimes they’re enabled by default.
- Understand how to enable encryption at rest or in-transit, whether to use KMS or S3, or client side vs server side.
- How to grant privileged access to data e.g. IAM, Redshift Views.
- Authentication flows with Cognito and integrations with external identity providers.
6. Performance is a key trend
- Have a sound understanding of what GSI’s and LSI’s are in DynamoDB.
- Consider primary & sort keys, distribution styles in all of the database services
- Different compression types and speed of compressing/decompressing.
7. Dive into Machine learning (ML)
- The Cloud Guru course mentioned above gives a good overview of the different ML models.
- If you have time I would recommend this machine learning course by Andrew Ng on Coursera. The technical depth is more lower level than you will need for the exam but it provides a very good introduction to a novice about the whole machine learning landscape.
8. Dive into Visualisation
- The A Cloud Guru course provides more than enough knowledge to tackle any questions here.
- Again if you have the time there’s an excellent data science course on Udemy which has a data visualisation chapter that would prove useful here.
It can’t be emphasised enough that AWS themselves provide amazing resources for learning. Definitely as preparation for the exam watch re:Invent videos and read AWS blogs & case studies.
Watch these videos:
- AWS re:Invent 2017: Big Data Architectural Patterns and Best Practices on AWS
- AWS re:Invent 2017: Best Practices for Building a Data Lake in Amazon S3 and Amazon
- AWS re:Invent 2016: Deep Dive: Amazon EMR Best Practices & Design Patterns
- AWS Summit Series 2016 | Chicago – Deep Dive + Best Practices for Real-Time Streaming Applications
Read these AWS blogs:
- Secure Amazon EMR with Encryption
- Building a Near Real-Time Discovery Platform with AWS
- Streaming Data Solutions on AWS with Amazon Kinesis
- Big Data Analytics Options on AWS
- Lambda Architecture for Batch and Real-Time Processing on AWS with Spark Streaming and Spark SQL
All of the Big Data services developer guides.
One last note….
This exam will expect you to consider the question from many different perspectives. You’ll need to think about not just the technical feasibility of the solution presented but also the business value that can be created. The majority of questions are scenario specific and often there is more than one valid answer, look for subtle clues to determine which solution is more ‘correct’ than the others, e.g. whether speed is a factor or if the question expects you to answer from a cost perspective.
Finally, this exam is very long (3 hours) and requires a lot of reading. I found that the time given was more than enough but remember to pace yourself otherwise you can get burned out quite easily.
Hopefully my experience and tips will have helped in preparation for the exam. Let us know if they helped you.
Visit our services to explore how we enable organisations to transform their internal cultures, to make it easier for teams to collaborate, and adopt practices such as Continuous Integration, Continuous Delivery, and Continuous Testing.