If you’re looking to take your data analytics career to the next level, then this AWS Data Analytics Specialty Certification Exam Preparation blog is a must-read! With over 100 exam questions and answers, plus data science and data analytics interview questions, cheat sheets and more, you’ll be fully prepared to ace the DAS-C01 exam.
In this blog, we talk about big data and data analytics; we also give you the last updated top 100 AWS Certified Data Analytics – Specialty Questions and Answers Dumps
Download the App for an interactive experience:
AWS DAS-C01 Exam Prep on android
AWS DAS-C01 Exam Prep on Windows
Domain 1: Collection 18%
Domain 2: Storage and Data Management 22%
Domain 3: Processing 24%
Domain 4: Analysis and Visualization 18%
Domain 5: Security 18%
https://enoumen.com/2021/11/07/top-100-data-science-and-data-analytics-interview-questions-and-answers/
A) Snowball, Kinesis Firehose, Direct Connect
B) Data Migration Services, Kinesis Firehose, Direct Connect
C) Snowball, Data Migration Services, Direct Connect
D) Snowball, Direct Connection, Kinesis Firehose
ANSWER1:
Notes/Hint1:
Reference1: Big Data Analytics Options
ANSWER2:
Notes/Hint2:
Reference1: Relationalize PySpark
A) You did not have versioning enabled and would need to create individual buckets to prevent the data from being overwritten.
B) Data records are only accessible for a default of 24 hours from the time they are added to a stream.
C) One of the sensors failed, so there was no data to record.
D) You needed to use EMR to send the data to S3; Kinesis Streams are only compatible with DynamoDB.
ANSWER3:
Notes/Hint3:
Reference3: Kinesis Extended Reading
A) Include a variable in the clickstream data from the publisher website to maintain a counter for the number of active user sessions. Use a timestamp for the partition key for the stream. Configure the consumer application to read the data from the stream and change the number of processor threads based upon the counter. Deploy the consumer application on Amazon EC2 instances in an EC2 Auto Scaling group.
B) Include a variable in the clickstream to maintain a counter for each user action during their session. Use the action type as the partition key for the stream. Use the Kinesis Client Library (KCL) in the consumer application to retrieve the data from the stream and perform the processing. Configure the consumer application to read the data from the stream and change the number of processor threads based upon the
counter. Deploy the consumer application on AWS Lambda.
C) Include a session identifier in the clickstream data from the publisher website and use as the partition key for the stream. Use the Kinesis Client Library (KCL) in the consumer application to retrieve the data from the stream and perform the processing. Deploy the consumer application on Amazon EC2 instances in an
EC2 Auto Scaling group. Use an AWS Lambda function to reshard the stream based upon Amazon CloudWatch alarms.
D) Include a variable in the clickstream data from the publisher website to maintain a counter for the number of active user sessions. Use a timestamp for the partition key for the stream. Configure the consumer application to read the data from the stream and change the number of processor threads based upon the counter. Deploy the consumer application on AWS Lambda.
ANSWER4:
Notes/Hint4:
Reference4: UpdateShardCount API
A) Use SQS for storing the transaction messages; when the billing batch process performs first and consumes the message, write the code in a way that does not remove the message after consumed, so it is available for the audit application several hours later. The audit application can consume the SQS message and remove it from the queue when completed.
B) Use Kinesis to store the transaction information. The billing application will consume data from the stream and the audit application can consume the same data several hours later.
C) Store the transaction information in a DynamoDB table. The billing application can read the rows while the audit application will read the rows then remove the data.
D) Use SQS for storing the transaction messages. When the billing batch process consumes each message, have the application create an identical message and place it in a different SQS for the audit application to use several hours later.
SQS would make this more difficult because the data does not need to persist after a full day.
ANSWER5:
Notes/Hint5:
Reference5: Amazon Kinesis
Get mobile friendly version of the quiz @ the App Store
AWS DAS-C01 Exam Prep on android
AWS DAS-C01 Exam Prep on Windows
A) Store the file in Amazon DocumentDB and the document ID as an attribute in the DynamoDB table.
B) Store the file in Amazon S3 and the object key as an attribute in the DynamoDB table.
C) Split the file into smaller parts and store the parts as multiple items in a separate DynamoDB table.
D) Store the file as an attribute in the DynamoDB table using Base64 encoding.
ANSWER6:
Notes/Hint6:
Reference6: S3 Storage Cost – DynamODB Storage Cost
A) Log all events using the Kinesis Producer Library.
B) Log critical events using the Kinesis Producer Library, and log informational events using the PutRecords API method.
C) Log critical events using the PutRecords API method, and log informational events using the Kinesis Producer Library.
D) Log all events using the PutRecords API method.
ANSWER2:
Notes/Hint7:
Reference7: PutRecords API
A) Amazon SQS
B) Amazon EMR
C) AWS Data Pipeline
D) Amazon Kinesis Firehose
ANSWER8:
Notes/Hint8:
Reference8: Amazon Kinesis Firehose
A) Create an Amazon Managed Streaming for Kafka cluster and ingest the data for each order into a topic. Use a Kafka consumer running on Amazon EC2 instances to read these messages and invoke the Amazon SageMaker endpoint.
B) Create an Amazon Kinesis Data Streams stream and ingest the data for each order into the stream. Create an AWS Lambda function to read these messages and invoke the Amazon SageMaker endpoint.
C) Create an Amazon Kinesis Data Firehose delivery stream and ingest the data for each order into the stream. Configure Kinesis Data Firehose to deliver the data to an Amazon S3 bucket. Trigger an AWS Lambda function with an S3 event notification to read the data and invoke the Amazon SageMaker endpoint.
D) Create an Amazon SNS topic and publish the data for each order to the topic. Subscribe the Amazon SageMaker endpoint to the SNS topic.
ANSWER9:
Notes/Hint9:
Reference9: Amazon Managed Streaming for Kafka cluster
A) IoT Device Shadow Service
B) Redshift
C) Kinesis
D) IoT Rules Engine
ANSWER10:
Notes/Hint10:
Reference10: The IoT rules engine
Get mobile friendly version of the quiz @ the App Store
A) Deploy the existing Oracle Java Archive as a custom bootstrap action and run the job on the EMR cluster.
B) Compile the Java program for the desired Hadoop version and run it using a CUSTOM_JAR step on the EMR cluster.
C) Submit the Java program as an Apache Hive or Apache Spark step for the EMR cluster.
D) Use SSH to connect the master node of the EMR cluster and submit the Java program using the AWS CLI.
ANSWER11:
Notes/Hint11:
Reference11: Automating analytics workflows on EMR
A) AWS Kinesis
B) AWS Database Migration Service
C) AWS Data Pipeline
D) AWS RDS Aurora
ANSWER12:
Notes/Hint12:
Reference12: DMS
A) Copy the source data into Amazon Redshift and rewrite the Apache Spark code to create analytical reports by querying Amazon Redshift.
B) Copy the source data from Amazon S3 into Hadoop Distributed File System (HDFS) using s3distcp.
C) Load the data into Spark DataFrames.
D) Stream the data into Amazon Kinesis and use the Kinesis Connector Library (KCL) in multiple Spark jobs to perform analytical jobs.
E) Use Amazon S3 Select to retrieve the data necessary for the dashboards from the S3 objects.
ANSWER13:
Notes/Hint13:
Reference13: Spark DataFrames
A) VPC peering
B) Data Pipeline
C) Direct Connect
D) EMR
ANSWER14:
Notes/Hint14:
Reference14: Direct Connect
A) Installs the AWS Direct Connect client on all EC2 instances and uses it to stream the data directly to S3.
B) Leverages the Kinesis Agent to send data to Kinesis Data Streams and output that data in S3.
C) Ingests the data directly from S3 by configuring regular Amazon Snowball transactions.
D) Leverages the Kinesis Agent to send data to Kinesis Firehose and output that data in S3.
ANSWER15:
Notes/Hint15:
Reference15: Kinesis Firehose
A) Publish the raw social media data to an Amazon Kinesis Data Firehose delivery stream. Use Kinesis Data Analytics for SQL Applications to perform a sliding window analysis to compute the metrics and output the results to a Kinesis Data Streams data stream. Configure an AWS Lambda function to save the stream data to an Amazon DynamoDB table. Deploy a real-time dashboard hosted in an Amazon S3 bucket to read and display the metrics data stored in the DynamoDB table.
B) Publish the raw social media data to an Amazon Kinesis Data Firehose delivery stream. Configure the stream to deliver the data to an Amazon Elasticsearch Service cluster with a buffer interval of 0 seconds. Use Kibana to perform the analysis and display the results.
C) Publish the raw social media data to an Amazon Kinesis Data Streams data stream. Configure an AWS Lambda function to compute the metrics on the stream data and save the results in an Amazon S3 bucket. Configure a dashboard in Amazon QuickSight to query the data using Amazon Athena and display the results.
D) Publish the raw social media data to an Amazon SNS topic. Subscribe an Amazon SQS queue to the topic. Configure Amazon EC2 instances as workers to poll the queue, compute the metrics, and save the results to an Amazon Aurora MySQL database. Configure a dashboard in Amazon QuickSight to query the data in Aurora and display the results.
ANSWER16:
Notes/Hint16:
Reference16: Amazon Kinesis Data Analytics can query data in a Kinesis Data Firehose delivery stream in near-real time using SQL
A) Schedule an AWS Lambda function to drop and re-create the dataset daily.
B) Configure the visualization to query the data in Amazon S3 directly without loading the data into SPICE.
C) Schedule the dataset to refresh daily.
D) Close and open the Amazon QuickSight visualization.
ANSWER17:
Notes/Hint17:
Reference17: Amazon QuickSight and SPICE
A) Establish additional Direct Connect connections.
B) Use Data Pipeline to migrate the data in bulk to S3.
C) Use Kinesis Firehose to stream all new and existing data into S3.
D) Snowball
ANSWER18:
Notes/Hint18:
Reference18: Snowball
A) It is not possible to migrate an on-premises database to AWS at this time.
B) Use AWS Data Pipeline to create a target database, migrate the database schema, set up the data replication process, initiate the full load and a subsequent change data capture and apply, and conclude with a switchover of your production environment to the new database once the target database is caught up with the source database.
C) Use AWS Database Migration Services and create a target database, migrate the database schema, set up the data replication process, initiate the full load and a subsequent change data capture and apply, and conclude with a switch-over of your production environment to the new database once the target database is caught up with the source database.
D) Use AWS Glue to crawl the on-premises database schemas and then migrate them into AWS with Data Pipeline jobs.
https://aws.amazon.com/dms/
ANSWER19:
Notes/Hint19:
Reference19: DMS
A) Enable at-rest encryption for EMR File System (EMRFS) data in Amazon S3 in a security configuration. Re-create the cluster using the newly created security configuration.
B) Specify local disk encryption in a security configuration. Re-create the cluster using the newly created security configuration.
C) Detach the Amazon EBS volumes from the master node. Encrypt the EBS volume and attach it back to the master node.
D) Re-create the EMR cluster with LZO encryption enabled on all volumes.
ANSWER20:
Notes/Hint20:
Reference20: EMR Cluster Local disk encryption
What should be done to improve the performance of the Amazon Elasticsearch Service cluster?
A) Amazon Kinesis Data Firehose
B) Amazon QuickSight
C) Amazon Athena
D) AWS Storage Gateway
A) AWS Storage Gateway
B) AWS Schema Conversion Tool
C) AWS Database Migration Service
D) Amazon Kinesis Data Firehose
A) A fully managed ETL (extract, transform, and load) pipeline service
B) A service to schedule jobs
C) A visual data preparation tool
D) An index to the location, schema, and runtime metrics of your data
A) AWS Glue crawler
B) AWS Glue DataBrew
C) AWS Glue Studio
D) AWS Glue Elastic Views
Question 30: Your small start-uo company is developing a data analytics solution. You need to clean and normalize large datasets, but you do not have developers with the skill set to write custom scripts. Which tool will help efficiently design and run the data preparation activities?
A) Analyze data in real-time as data comes into the data lake
B) Transform data in real-time as data comes into the data lake
C) Analyze data in batches on schedule or on demand
D) Transform data in batches on schedule or on demand.
A) Amazon Athena Federated Query
B) Amazon Redshift Query Editor
C) SQl Workbench
D) AWS Glue DataBrew
A) Build data lakes quickly
B) Simplify security management
C) Provide self-service access to data
D) All of the above
A) Analyze Data stored in your data lake
B) Maintain performance at scale
C) Focus effort on Data warehouse administration
D) Store all the data to meet analytics need
E) Amazon Redshift includes enterprise-level security and compliance features.
2- Prepare for Your AWS Certification Exam
3- LinuxAcademy
[/bg_collapse]
Clever Questions, Answers, Resources about:
Data Scientist (n.): Person who is better at statistics than any software engineer and better at software engineering than any statistician. – Josh Wills
Data scientists apply sophisticated quantitative and computer science skills to both structure and analyze massive stores or continuous streams of unstructured data, with the intent to derive insights and prescribe action. – Burtch Works Data Science Salary Survey, May 2018
More than anything, what data scientists do is make discoveries while swimming in data… In a competitive landscape where challenges keep changing and data never stop flowing, data scientists help decision makers shift from ad hoc analysis to an ongoing conversation with data. – Data Scientist: The Sexiest Job of the 21st Century, Harvard Business Review
Data scientists are highly educated. With exceedingly rare exception, every data scientist holds at least an undergraduate degree. 91% of data scientists in 2018 held advanced degrees. The remaining 9% all held undergraduate degrees. Furthermore,
The remaining 17% of surveyed data scientists held degrees in business, social science, or economics.
Broadly speaking, the roles differ in scope: data analysts build reports with narrow, well-defined KPIs. Data scientists often to work on broader business problems without clear solutions. Data scientists live on the edge of the known and unknown.
We’ll leave you with a concrete example: A data analyst cares about profit margins. A data scientist at the same company cares about market share.
Data science in healthcare best translates to biostatistics. It can be quite different from data science in other industries as it usually focuses on small samples with several confounding variables.
How Is Data Science Used in Manufacturing?
Data science in manufacturing is vast; it includes everything from supply chain optimization to the assembly line.
Most people are attracted to data science for the salary. It’s true that data scientists garner high salaries compares to their peers. There is data to support this: The May 2018 edition of the BurtchWorks Data Science Salary Survey, annual salary statistics were
Note the above numbers do not reflect total compensation which often includes standard benefits and may include company ownership at high levels.
How will data science evolve in the next 5 years?
Will AI replace data scientists?
It’s common for data scientists across the US to work 40 hours weekly. While company culture does dictate different levels of work life balance, it’s rare to see data scientists who work more than they want. That’s the virtue of being an expensive resource in a competitive job market.
The roadmap given to aspiring data scientists can be boiled down to three steps:
All three require a significant time and financial commitment.
There used to be a saying around datascience: The road into a data science starts with two years of university-level math.
This answer assumes your academic background ends with a HS diploma in the US.
Some follow up questions and answers:
Why Python first?
When do I start working with data?
How long will this take me?
If you don’t know the first thing about programming, start with MIT’s course in the curated list.
These modules are the standard tools for data analysis in Python:
pandas
(and by extension, numpy
)Check out Minimally Sufficient Pandas for style guides and best practices.matplotlib
and seaborn
See /u/rhiever’s response to How do you decide between the plotting libraries: Matplotlib, Seaborn, Bokeh?Don’t worry about bokeh or dash unless you have a personal interest in interactive visualizations.scipy
and scikit-learn
Internalize the .fit()
and .predict()
pattern.Curated Threads & Resources
If you don’t know the first thing about programming, start with R for Data Science in the curated list.
These modules are the standard tools for data analysis in Python:
Curated Threads & Resources
Prioritize the basics of SQL. i.e. when to use functions like POW
, SUM
, RANK
; the computational complexity of the different kinds of joins.
Concepts like relational algebra, when to use clustered/non-clustered indexes, etc. are useful, but (almost) never come up in interviews.
You absolutely do not need to understand administrative concepts like managing permissions.
Finally, there are numerous query engines and therefore numerous dialects of SQL. Use whichever dialect is supported in your chosen resource. There’s not much difference between them, so it’s easy to learn another dialect after you’ve learned one.
Fortunately (or unfortunately), calculus is the lament of many students, and so resources for it are plentiful. Khan Academy mimics lectures very well, and Paul’s Online Math Notes are a terrific reference full of practice problems and solutions.
Calculus, however, is not just calculus. For those unfamiliar with US terminology,
Differential and integral calculus are both necessary for probability and statistics, and should be completed first.
Multivariable calculus can be paired with linear algebra, but is also required.
Differential equations is where consensus falls apart. The short it is, they’re all but necessary for mathematical modeling, but not everyone does mathematical modeling. It’s another tool in the toolbox.
Probability is not friendly to beginners. Definitions are rooted in higher mathematics, notation varies from source to source, and solutions are frequently unintuitive. Probability may present the biggest barrier to entry in data science.
It’s best to pick a single primary source and a community for help. If you can spend the money, register for a university or community college course and attend in person.
The best free resource is MIT’s 18.05 Introduction to Probability and Statistics (Spring 2014). Leverage /r/learnmath, /r/learnmachinelearning, and /r/AskStatistics when you get inevitably stuck.
Curated Threads & Resources https://www.youtube.com/watch?v=fNk_zzaMoSs&index=1&list=PLZHQObOWTQDPD3MizzM2xVFitgF8hE_ab
What does the typical data science interview process look like?
For general advice, Mastering the DS Interview Loop is a terrific article. The community discussed the article here.
Briefly summarized, most companies follow a five stage process:
Preparation:
Tips:
Remember, interviewing is a skill that can be learned, just like anything else. Hopefully, this article has given you some insight on what to expect in a data science interview loop.
The process also isn’t perfect and there will be times that you fail to impress an interviewer because you don’t possess some obscure piece of knowledge. However, with repeated persistence and adequate preparation, you’ll be able to land a data science job in no time!
What does the Airbnb data science interview process look like? [Coming soon]
What does the Facebook data science interview process look like? [Coming soon]
What does the Uber data science interview process look like? [Coming soon]
What does the Microsoft data science interview process look like? [Coming soon]
What does the Google data science interview process look like? [Coming soon]
What does the Netflix data science interview process look like? [Coming soon]
What does the Apple data science interview process look like? [Coming soon]
Question: How is SQL used in real data science jobs?
Real life enterprise databases are orders of magnitude more complex than the “customers, products, orders” examples used as teaching tools. SQL as a language is actually, IMO, a relatively simple language (the db administration component can get complex, but mostly data scientists aren’t doing that anyways). SQL is an incredibly important skill though for any DS role. I think when people emphasize SQL, what they really are talking about is the ability to write queries that interrogate the data and discover the nuances behind how it is collected and/or manipulated by an application before it is written to the dB. For example, is the employee’s phone number their current phone number or does the database store a history of all previous phone numbers? Critically important questions for understanding the nature of your data, and it doesn’t necessarily deal with statistics! The level of syntax required to do this is not that sophisticated, you can get pretty damn far with knowledge of all the joins, group by/analytical functions, filtering and nesting queries. In many cases, the data is too large to just select * and dump into a csv to load into pandas, so you start with SQL against the source. In my mind it’s more important for “SQL skills” to know how to generate hypotheses (that will build up to answering your business question) that can be investigated via a query than it is to be a master of SQL’s syntax. Just my two cents though!
AWS DAS-C01 Exam Prep on android
AWS DAS-C01 Exam Prep on Windows
Human population density estimates based on the Hyde 3.2 model.
Source: 24/7 Kevin Rooke, Google Search, SEC Edgar
Data sources: Coindesk-Bitcoin ; Coindesk-Dodgecoin
Data source: Performance data on these cryptocurrencies from Investing.com which provides free historic data
Data Source: Here
AWS Data Analytics Specialty Certification DAS-C01 Exam Prep on iOS
Data Source: Made in Google Sheets using data from this USA Today article (for the number of arrests by arrestee’s home state) and this spreadsheet of the results of the 2020 Census (for the population of each state and DC in 2020, which was used as the denominator in calculating arrests/million people).
AWS Data Analytics Specialty Certification DAS-C01 Exam Prep on iOS
AWS DAS-C01 Exam Prep on android
AWS DAS-C01 Exam Prep on Windows
AWS DAS-C01 Exam Prep on android
AWS DAS-C01 Exam Prep on Windows
• Informed decision making
• Consolidated data from many sources
• Historical data analysis
• Data quality, consistency, and accuracy
• Separation of analytics processing from transactional databases
AWS Data Analytics Specialty Certification DAS-C01 Exam Prep on iOS
AWS DAS-C01 Exam Prep on android
AWS DAS-C01 Exam Prep on Windows
A data warehouse is specially designed for data analytics, which identifies relationships and trends across large amounts of data. A database is used to capture and store data, such as the details of a transaction. Unlike a data warehouse, a data lake is a centralized repository for structured, semi-structured, and unstructured data. A data warehouse organizes data in a tabular format (or schema) that enables SQL queries on the data. But not all applications require data to be in tabular format. Some applications can access data in the data lake even if it is “semi-structured” or unstructured. These include big data analytics, full-text search, and machine learning.
An AWS data lake only has a storage charge for the data. No servers are necessary for the data to be stored and accessed. In the case of Amazon Athena, also, there are no additional charges for processing. Data warehouse enable fast queries of structured data from transactional systems for batch reports, business intelligence, and visualization use cases. A data lake stores data without regard to its structure. Data scientists, data analysts, and business analysts use the data lake. They support use cases such as machine learning, predictive analytics, and data discovery and profiling.
AWS Data Analytics Specialty Certification DAS-C01 Exam Prep on iOS
AWS DAS-C01 Exam Prep on android
AWS DAS-C01 Exam Prep on Windows
Data definition language (DDL) refers to the subset of SQL commands that define data structures and objects such as databases, tables, and views. DDL commands include the following:
• CREATE: used to create a new object.
• DROP: used to delete an object.
• ALTER: used to modify an object.
• RENAME: used to rename an object.
• TRUNCATE: used to remove all rows from a table without deleting the table itself.
AWS Data Analytics Specialty Certification DAS-C01 Exam Prep on iOS
AWS DAS-C01 Exam Prep on android
AWS DAS-C01 Exam Prep on Windows
Data manipulation language (DML) refers to the subset of SQL commands that are used to work with data. DML commands include the following:
• SELECT: used to request records from one or more tables.
• INSERT: used to insert one or more records into a table.
• UPDATE: used to modify the data of one or more records in a table.
• DELETE: used to delete one or more records from a table.
• EXPLAIN: used to analyze and display the expected execution plan of a SQL statement.
• LOCK: used to lock a table from write operations (INSERT, UPDATE, DELETE) and prevent concurrent operations from conflicting with one another.
Data control language (DCL) refers to the subset of SQL commands that are used to configure permissions to objects. DCL commands include:
• GRANT: used to grant access and permissions to a database or object in a database, such as a schema or table.
• REVOKE: used to remove access and permissions from a database or objects in a database.
AWS Data Analytics Specialty Certification DAS-C01 Exam Prep on iOS
AWS DAS-C01 Exam Prep on android
AWS DAS-C01 Exam Prep on Windows
AWS Data Analytics Specialty Certification DAS-C01 Exam Prep on iOS
AWS DAS-C01 Exam Prep on android
AWS DAS-C01 Exam Prep on Windows
Businesses are responsible to identify and limit disclosure of sensitive data such as personally identifiable information (PII) or proprietary information. Identifying and masking sensitive information is time consuming, and becomes more complex in data lakes with various data sources and formats and broad user access to published data sets.
Amazon Macie is a fully managed data security and privacy service that uses machine learning and pattern matching to discover sensitive data in AWS. Macie includes a set of managed data identifiers which automatically detect common types of sensitive data. Examples of managed data identifiers include keywords, credentials, financial information, health information, and PII. You can also configure custom data identifiers using keywords or regular expressions to highlight organizational proprietary data, intellectual property, and other specific scenarios. You can develop security controls that operate at scale to monitor and remediate risk automatically when Macie detects sensitive data. You can use AWS Lambda functions to automatically turn on encryption for an Amazon S3 bucket where Macie detects sensitive data. Or automatically tag datasets containing sensitive data, for inclusion in orchestrated data transformations or audit reports.
Amazon Macie can be integrated into the data ingestion and processing steps of your data pipeline. This approach avoids inadvertent disclosures in published data sets by detecting and addressing the sensitive data as it is ingested and processed. Building the automated detection and processing of sensitive data into your ETL pipelines simplifies and standardizes handling of sensitive data at scale.
AWS Data Analytics Specialty Certification DAS-C01 Exam Prep on iOS
AWS DAS-C01 Exam Prep on android
AWS DAS-C01 Exam Prep on Windows
AWS Glue DataBrew is a visual data preparation tool that simplifies cleaning and normalizing datasets in preparation for use in analytics and machine learning.
• Profile data quality, identifying patterns and automatically detecting anomalies.
• Clean and normalize data using over 250 pre-built transformations, without writing code.
• Visually map the lineage of your data to understand data sources and transformation history.
• Save data cleaning and normalization workflows for automatic application to new data.
Data processed in AWS Glue DataBrew is immediately available for use in analytics and machine learning projects.
Learn more about the built-in transformations available in AWS Glue DataBrew in the Recipe actions reference: https://docs.aws.amazon.com/databrew/latest/dg/recipe-actions-reference.html
AWS Data Analytics Specialty Certification DAS-C01 Exam Prep on iOS
AWS DAS-C01 Exam Prep on android
AWS DAS-C01 Exam Prep on Windows
AWS Glue is a fully managed ETL (extract, transform, and load) service that makes it simple and cost-effective to categorize your data, clean it, enrich it, and move it reliably between various data stores and data streams. AWS Glue consists of a central metadata repository known as the AWS Glue Data Catalog, an ETL engine that automatically generates Python or Scala code, and a flexible scheduler that handles dependency resolution, job monitoring, and retries. AWS Glue can run your ETL jobs as new data arrives. For example, you can use an AWS Lambda function to trigger your ETL jobs to run as soon as new data becomes available in Amazon S3. You can also register this new dataset in the
AWS Glue Data Catalog as part of your ETL jobs.
AWS Glue is serverless, so there’s no infrastructure to set up or manage.
AWS Data Analytics Specialty Certification DAS-C01 Exam Prep on iOS
AWS DAS-C01 Exam Prep on android
AWS DAS-C01 Exam Prep on Windows
AWS Glue Data Catalog The AWS Glue Data Catalog provides a uniform repository where disparate systems can store and find metadata to keep track of data in data silos, and use that metadata to query and transform the data. Once the data is cataloged, it is immediately available for search and query using Amazon Athena, Amazon EMR, and Amazon Redshift Spectrum.
You can use AWS Identity and Access Management (IAM) policies to control access to the data sources managed by the AWS Glue Data Catalog. The Data Catalog also provides comprehensive audit and governance capabilities, with schema-change tracking and data access controls.
AWS Glue crawler
AWS Glue crawlers can scan data in all kinds of repositories, classify it, extract schema information from it, and store the metadata automatically in the AWS Glue Data Catalog.
AWS Glue ETL
AWS Glue can run your ETL jobs as new data arrives. For example, you can use an AWS Lambda function to trigger your ETL jobs to run as soon as new data becomes available in Amazon S3. You can also register this new dataset in the AWS Glue Data Catalog as part of your ETL jobs.
AWS Glue Studio
AWS Glue Studio provides a graphical interface to create, run, and monitor extract, transform, and load (ETL) jobs in AWS Glue. You can visually compose data transformation workflows and seamlessly run them on AWS Glue’s Apache Spark-based serverless ETL engine. AWS Glue Studio also offers tools to monitor ETL workflows and validate that they are operating as intended.
AWS Data Analytics Specialty Certification DAS-C01 Exam Prep on iOS
AWS DAS-C01 Exam Prep on android
AWS DAS-C01 Exam Prep on Windows
Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to set up or manage, and you can start analyzing data immediately. You don’t even need to load your data into Athena, it works directly with data stored in S3. To get started, just log into the Amazon Athena console, define your schema, and start querying. Athena uses Presto with full standard SQL support. It works with a variety of standard data formats, including CSV, JSON, ORC, Apache Parquet and Avro. While Athena is ideal for quick, ad-hoc querying, it can also handle complex analysis, including large joins, window functions, and arrays.
Amazon Athena helps you analyze data stored in Amazon S3. You can use Athena to run ad-hoc queries using ANSI SQL, without the need to aggregate or load the data into Athena. It can process unstructured, semi-structured, and structured datasets. Examples include CSV, JSON, Avro or columnar data formats such as Apache Parquet and Apache ORC. Athena integrates with Amazon QuickSight for easy visualization. You can also use Athena to generate reports or to explore data with business intelligence tools or SQL clients, connected via an ODBC or JDBC driver.
The tables and databases that you work with in Athena to run queries are based on metadata. Metadata is data about the underlying data in your dataset. How that metadata describes your dataset is called the schema. For example, a table name, the column names in the table, and the data type of each column are schema, saved as metadata, that describe an underlying dataset. In Athena, we call a system for organizing metadata a data catalog or a metastore. The combination of a dataset and the data catalog that describes it is called a data source.
The relationship of metadata to an underlying dataset depends on the type of data source that you work with. Relational data sources like MySQL, PostgreSQL, and SQL Server tightly integrate the metadata with the dataset. In these systems, the metadata is most often written when the data is written. Other data sources, like those built using Hive, allow you to define metadata on-the-fly when you read the dataset. The dataset can be in a variety of formats; for example, CSV, JSON, Parquet, or Avro.
AWS Data Analytics Specialty Certification DAS-C01 Exam Prep on iOS
AWS DAS-C01 Exam Prep on android
AWS DAS-C01 Exam Prep on Windows
Lake Formation is a fully managed service that enables data engineers, security officers, and data analysts to build, secure, manage, and use your data lake
To build your data lake in AWS Lake Formation, you must register an Amazon S3 location as a data lake. The Lake Formation service must have permission to write to the AWS Glue Data Catalog and to Amazon S3 locations in the data lake.
Next, identify the data sources to be ingested. AWS Lake formation can move data into your data lake from existing Amazon S3 data stores. Lake Formation can collect and organize datasets, such as logs from AWS CloudTrail, AWS CloudFront, detailed billing reports, or Elastic Load Balancing. You can ingest bulk or incremental datasets from relational, NoSQL, or non-relational databases. Lake Formation can ingest data from databases running in Amazon RDS or hosted in Amazon EC2. You can also ingest data from on-premises databases using Java Database Connectivity JDBC connectors. You can use custom AWS Glue jobs to load data from other databases or to ingest streaming data using Amazon Kinesis or Amazon DynamoDB.
AWS Lake Formation manages AWS Glue crawlers, AWS Glue ETL jobs, the AWS Glue Data Catalog, security settings, and access control:
• Lake Formation is an automated build environment based on AWS Glue.
• Lake Formation coordinates AWS Glue crawlers to identify datasets within the specified data stores and collect metadata for each dataset
• Lake Formation can perform transformations on your data, such as rewriting and organizing data into a consistent, analytics-friendly format. Lake Formation creates transformation templates and schedules AWS Glue jobs to prepare and optimize your data for analytics. Lake Formation also helps clean your data using FindMatches, an ML-based deduplication transform. AWS Glue jobs encapsulate scripts, such as ETL scripts, which connect to source data, process it, and write it out to a data target. AWS Glue triggers can start jobs based on a schedule or event, or on demand. AWS Glue workflows orchestrate AWS ETL jobs, crawlers, and triggers. You can define a workflow manually or use a blueprint based on commonly ingested data source types.
• The AWS Glue Data Catalog within the data lake persistently stores the metadata from raw and processed datasets. Metadata about data sources and targets is in the form of databases and tables. Tables store information about the underlying data, including schema information, partition information, and data location. Databases are collections of tables. Each AWS account has one data catalog per AWS Region.
• Lake Formation provides centralized access controls for your data lake, including security policy-based rules for users and applications by role. You can authenticate the users and roles using AWS IAM. Once the rules are defined, Lake Formation enforces them with table-and column-level granularity for users of Amazon Redshift Spectrum and Amazon Athena. Rules are enforced at the table-level in AWS Glue, which is normally accessed for administrators.
• Lake Formation leverages the encryption capabilities of Amazon S3 for data in the data lake. This approach provides automatic server-side encryption with keys managed by the AWS Key Management Service (KMS). S3 encrypts data in transit when replicating across Regions. You can separate accounts for source and destination Regions to further protect your data
AWS Data Analytics Specialty Certification DAS-C01 Exam Prep on iOS
AWS DAS-C01 Exam Prep on android
AWS DAS-C01 Exam Prep on Windows
Amazon QuickSight is a cloud-scale business intelligence (BI) service. In a single data dashboard, QuickSight gives decision-makers the opportunity to explore and interpret information in an interactive visual environment. QuickSight can include AWS data, third-party data, big data, spreadsheet data, SaaS data, B2B data, and more. QuickSight delivers fast and responsive query performance by using a robust in-memory engine (SPICE).
Scale from tens to tens of thousands of users
Amazon QuickSight has a serverless architecture that automatically scales to tens of thousands of users without the need to setup, configure, or manage your own servers.
Embed BI dashboards in your applications
With QuickSight, you can quickly embed interactive dashboards into your applications, websites, and portals.
Access deeper insights with Machine Learning
QuickSight leverages the proven machine learning (ML) capabilities of AWS. BI teams can perform advanced analytics without prior data science experience.
Ask questions of your data, receive answers
With QuickSight, you can quickly get answers to business questions asked in natural language with QuickSight’s new ML-powered natural language query capability, Q.
AWS Data Analytics Specialty Certification DAS-C01 Exam Prep on iOS
AWS DAS-C01 Exam Prep on android
AWS DAS-C01 Exam Prep on Windows
SPICE is the Super-fast, Parallel, In-memory Calculation Engine in QuickSight. SPICE is engineered to rapidly perform advanced calculations and serve data. The storage and processing capacity available in SPICE speeds up the analytical queries that you run against your imported data. By using SPICE, you save time because you don’t need to retrieve the data every time that you change an analysis or update a visual.
When you import data into a dataset rather than using a direct SQL query, it becomes SPICE data because of how it’s stored. SPICE is the Amazon QuickSight Super-fast, Parallel, In-memory Calculation Engine. It’s engineered to rapidly perform advanced calculations and serve data. In Enterprise edition, data stored in SPICE is encrypted at rest.
When you create or edit a dataset, you choose to use either SPICE or a direct query, unless the dataset contains uploaded files. Importing (also called ingesting) your data into SPICE can save time and money:
• Your analytical queries process faster.
• You don’t need to wait for a direct query to process.
• Data stored in SPICE can be reused multiple times without incurring additional costs. If you use a data source that charges per query, you’re charged for querying the data when you first create the dataset and later when you refresh the dataset.
AWS Data Analytics Specialty Certification DAS-C01 Exam Prep on iOS
AWS DAS-C01 Exam Prep on android
AWS DAS-C01 Exam Prep on Windows
You can use AWS services as building blocks to build serverless data lakes and analytics pipelines. You can apply best practices on how to ingest, store, transform, and analyze structured and unstructured data at scale. Achieve the scale without needing to manage any storage or compute infrastructure. A decoupled, component-driven architecture allows you to start small and scale out slowly. You can quickly add new purpose-built components to one of six architecture layers to address new requirements and data sources.
This data lake-centric architecture can support business intelligence (BI) dashboarding, interactive SQL queries, big data processing, predictive analytics, and machine learning use cases.
• The ingestion layer includes protocols to support ingestion of structured, unstructured, or streaming data from a variety of sources.
• The storage layer provides durable, scalable, secure, and cost-effective storage of datasets across ingestion and processing.
• The landing zone stores data as ingested.
• Data engineers run initial quality checks to validate and cleanse data in the landing zone, producing the raw dataset.
• The processing layer creates curated datasets by further cleansing, normalizing, standardizing, and enriching data from the raw zone. The curated dataset is typically stored in formats that support performant and cost-effective access by the consumption layer.
• The catalog layer stores business and technical metadata about the datasets hosted in the storage layer.
• The consumption layer contains functionality for Search, Analytics, and Visualization. It integrates with the data lake storage, cataloging, and security layers. This integration supports analysis methods such as SQL, batch analytics, BI dashboards, reporting, and ML.
• The security and monitoring layer protects data within the storage layer and other resources in the data lake. This layer includes access control, encryption, network protection, usage monitoring, and auditing.
You can learn more about this reference architecture at AWS Big Data Blog: AWS serverless data analytics pipeline reference architecture: https://aws.amazon.com/blogs/big-data/aws-serverless-data-analytics-pipeline-reference-architecture/
The main challenge with a data lake architecture is that raw data is stored with no oversight of the contents. To make the data usable, you must have defined mechanisms to catalog and secure the data. Without these mechanisms, data cannot be found or trusted, resulting in a “data swamp.” Meeting the needs of diverse stakeholders requires data lakes to have governance, semantic consistency, and access controls.
The Analytics Lens for the AWS Well-Architected Framework covers common analytics applications scenarios, including data lakes. It identifies key elements to help you architect your data lake according to best practices, including the following configuration notes:
• Decide on a location for data lake ingestion (that is, an S3 bucket). Select a frequency and isolation mechanism that meets your business needs.
• For Tier 2 Data, partition the data with keys that align to common query filter
. This enables pruning by common analytics tools that work on raw data files and increases performance
• Choose optimal file sizes to reduce Amazon S3 round trips during compute environment ingestion. Recommended: 512 MB – 1 GB in a columnar format (ORC/Parquet) per partition.
• Perform frequent scheduled compactions that align to the optimal file sizes noted previously. For example, compact into daily partitions if hourly files are too small.
• For data with frequent updates or deletes (that is, mutable data), either: o Temporarily store replicated data to a database like Amazon Redshift, Apache Hive, or Amazon RDS. Once the data becomes static, and then offload it to Amazon S3. Or, o Append the data to delta files per partition and compact it on a scheduled basis. You can use AWS Glue or Apache Spark on Amazon EMR for this processing.
With Tier 2 and Tier 3 Data being stored in Amazon S3, partition data using a high cardinality key. This is honored by Presto, Apache Hive, and Apache Spark and improves the query filter performance on that key
• Sort data in each partition with a secondary key that aligns to common filter queries. This allows query engines to skip files and get to requested data faster. For more information on the Analytics Lens for the AWS Well-Architected Framework, visit https://docs.aws.amazon.com/wellarchitected/latest/analytics-lens/data-lake.html
For additional information on AWS data lakes and data analytics architectures, visit:
• AWS Well-Architected: Learn, measure, and build using architectural best practices: https://aws.amazon.com/architecture/well-architected
• AWS Lake Formation: Build a secure data lake in days: https://aws.amazon.com/lake-formation
• Getting Started with Amazon S3: https://aws.amazon.com/s3/getting-started
• Security in AWS Lake Formation: https://docs.aws.amazon.com/lake-formation/latest/dg/security.html
AWS Lake Formation: How It Works: https://docs.aws.amazon.com/lake-formation/latest/dg/how-it-works.html
• AWS Lake Formation Dashboard: https://us-west-2.console.aws.amazon.com/lakeformation
• Data Lake Storage on AWS: https://aws.amazon.com/products/storage/data-lake-storage/
• Building Big Data Storage Solutions (Data Lakes) for Maximum Flexibility: https://docs.aws.amazon.com/whitepapers/latest/building-data-lakes/building-data-lake-aws.html
• Data Ingestion Methods: https://docs.aws.amazon.com/whitepapers/latest/building-data-lakes/data-ingestion-methods.html
Offering employees, coworkers, teammates, and students constructive feedback is a vital part of growth on…
Millennials should avoid delaying the inevitable and look into various retirement investment pathways. Here’s why…
For most people, a satisfactory career is essential for leading a happy life. However, ensuring…
The pipeline industry is more than pipework and construction, and we explore those details in…
SQL Interview Questions and Answers In the world of data-driven decision-making, SQL (Structured Query Language)…