Big Data and Data Analytics 101 – Top 20 AWS Certified Data Analytics – Specialty Questions and Answers Dumps

AWS Certified Security – Specialty Questions and Answers Dumps

Data Center Proxies - Data Collectors - Data Unblockers

In this blog, we talk about big data and data analytics; we also give you the last updated top 20 AWS Certified Data Analytics – Specialty Questions and Answers Dumps

The AWS Certified Data Analytics – Specialty (DAS-C01) examination is intended for individuals who perform in a data analytics-focused role. This exam validates an examinee’s comprehensive understanding of using AWS services to design, build, secure, and maintain analytics solutions that provide insight from data.

AWS Azure Google Cloud Cloud Certification Exam Prep App
AWS Azure Google Cloud Cloud Certification Exam Prep App: AWS CCP Cloud Practitioner CLF-C01, AWS Solution Architect Associate SAA-C02, AWS Developer Associate DEV-C01, AWS DAS-C01, Azure Fundamentals AZ900, Azure Administrator AZ104, Google Associate Cloud Engineer, AWS Specialty Data Analytics DAS-C01, AWS and Google Professional Machine Learning Specialty MLS-C01

The AWS Certified Data Analytics – Specialty (DAS-C01) covers the following domains:

Domain 1: Collection 18%

Domain 2: Storage and Data Management 22%

Data Center Proxies - Data Collectors - Data Unblockers

Domain 3: Processing 24%

Domain 4: Analysis and Visualization 18%

Domain 5: Security 18%

data analytics specialty
data analytics specialty

Below are the Top 20 AWS Certified Data Analytics – Specialty Questions and Answers Dumps and References

Question1: What combination of services do you need for the following requirements: accelerate petabyte-scale data transfers, load streaming data, and the ability to create scalable, private connections. Select the correct answer order.

A) Snowball, Kinesis Firehose, Direct Connect

B) Data Migration Services, Kinesis Firehose, Direct Connect

C) Snowball, Data Migration Services, Direct Connect

D) Snowball, Direct Connection, Kinesis Firehose

ANSWER1:

A

Notes/Hint1:

AWS has many options to help get data into the cloud, including secure devices like AWS Import/Export Snowball to accelerate petabyte-scale data transfers, Amazon Kinesis Firehose to load streaming data, and scalable private connections through AWS Direct Connect.

Reference1: Big Data Analytics Options 

Get mobile friendly version of the quiz @ the App Store

ANSWER2:

C

Notes/Hint2:

Reference1: Relationalize PySpark

Get mobile friendly version of the quiz @ the App Store

Question 3: There is a five-day car rally race across Europe. The race coordinators are using a Kinesis stream and IoT sensors to monitor the movement of the cars. Each car has a sensor and data is getting back to the stream with the default stream settings. On the last day of the rally, data is sent to S3. When you go to interpret the data in S3, there is only data for the last day and nothing for the first 4 days. Which of the following is the most probable cause of this?

A) You did not have versioning enabled and would need to create individual buckets to prevent the data from being overwritten.

B) Data records are only accessible for a default of 24 hours from the time they are added to a stream.

C) One of the sensors failed, so there was no data to record.

D) You needed to use EMR to send the data to S3; Kinesis Streams are only compatible with DynamoDB.

ANSWER3:

B

Notes/Hint3: 

Streams support changes to the data record retention period of your stream. An Amazon Kinesis stream is an ordered sequence of data records, meant to be written to and read from in real-time. Data records are therefore stored in shards in your stream temporarily. The period from when a record is added to when it is no longer accessible is called the retention period. An Amazon Kinesis stream stores records for 24 hours by default, up to 168 hours.

Reference3: Kinesis Extended Reading

Get mobile friendly version of the quiz @ the App Store

Question 4:  A publisher website captures user activity and sends clickstream data to Amazon Kinesis Data Streams. The publisher wants to design a cost-effective solution to process the data to create a timeline of user activity within a session. The solution must be able to scale depending on the number of active sessions.
Which solution meets these requirements?

A) Include a variable in the clickstream data from the publisher website to maintain a counter for the number of active user sessions. Use a timestamp for the partition key for the stream. Configure the consumer application to read the data from the stream and change the number of processor threads based upon the counter. Deploy the consumer application on Amazon EC2 instances in an EC2 Auto Scaling group.

B) Include a variable in the clickstream to maintain a counter for each user action during their session. Use the action type as the partition key for the stream. Use the Kinesis Client Library (KCL) in the consumer application to retrieve the data from the stream and perform the processing. Configure the consumer application to read the data from the stream and change the number of processor threads based upon the
counter. Deploy the consumer application on AWS Lambda.

C) Include a session identifier in the clickstream data from the publisher website and use as the partition key for the stream. Use the Kinesis Client Library (KCL) in the consumer application to retrieve the data from the stream and perform the processing. Deploy the consumer application on Amazon EC2 instances in an
EC2 Auto Scaling group. Use an AWS Lambda function to reshard the stream based upon Amazon CloudWatch alarms.

D) Include a variable in the clickstream data from the publisher website to maintain a counter for the number of active user sessions. Use a timestamp for the partition key for the stream. Configure the consumer application to read the data from the stream and change the number of processor threads based upon the counter. Deploy the consumer application on AWS Lambda.

ANSWER4:

C

Notes/Hint4: 

Partitioning by the session ID will allow a single processor to process all the actions for a user session in order. An AWS Lambda function can call the UpdateShardCount API action to change the number of shards in the stream. The KCL will automatically manage the number of processors to match the number of shards. Amazon EC2 Auto Scaling will assure the correct number of instances are running to meet the processing load.

Reference4: UpdateShardCount API

Get mobile friendly version of the quiz @ the App Store

Question 5: Your company has two batch processing applications that consume financial data about the day’s stock transactions. Each transaction needs to be stored durably and guarantee that a record of each application is delivered so the audit and billing batch processing applications can process the data. However, the two applications run separately and several hours apart and need access to the same transaction information. After reviewing the transaction information for the day, the information no longer needs to be stored. What is the best way to architect this application?

A) Use SQS for storing the transaction messages; when the billing batch process performs first and consumes the message, write the code in a way that does not remove the message after consumed, so it is available for the audit application several hours later. The audit application can consume the SQS message and remove it from the queue when completed.

B)  Use Kinesis to store the transaction information. The billing application will consume data from the stream and the audit application can consume the same data several hours later.

C) Store the transaction information in a DynamoDB table. The billing application can read the rows while the audit application will read the rows then remove the data.

D) Use SQS for storing the transaction messages. When the billing batch process consumes each message, have the application create an identical message and place it in a different SQS for the audit application to use several hours later.

SQS would make this more difficult because the data does not need to persist after a full day.

ANSWER5:

B

Notes/Hint5: 

Kinesis appears to be the best solution that allows multiple consumers to easily interact with the records.

Data Center Proxies - Data Collectors - Data Unblockers

Reference5: Amazon Kinesis

Get mobile friendly version of the quiz @ the App Store

Question 6: A company is currently using Amazon DynamoDB as the database for a user support application. The company is developing a new version of the application that will store a PDF file for each support case ranging in size from 1–10 MB. The file should be retrievable whenever the case is accessed in the application.
How can the company store the file in the MOST cost-effective manner?

A) Store the file in Amazon DocumentDB and the document ID as an attribute in the DynamoDB table.

B) Store the file in Amazon S3 and the object key as an attribute in the DynamoDB table.

C) Split the file into smaller parts and store the parts as multiple items in a separate DynamoDB table.

D) Store the file as an attribute in the DynamoDB table using Base64 encoding.

ANSWER6:

B

Notes/Hint6: 

Use Amazon S3 to store large attribute values that cannot fit in an Amazon DynamoDB item. Store each file as an object in Amazon S3 and then store the object path in the DynamoDB item.

Reference6: S3 Storage Cost –  DynamODB Storage Cost

Get mobile friendly version of the quiz @ the App Store

Question 7: Your client has a web app that emits multiple events to Amazon Kinesis Streams for reporting purposes. Critical events need to be immediately captured before processing can continue, but informational events do not need to delay processing. What solution should your client use to record these types of events without unnecessarily slowing the application?

A) Log all events using the Kinesis Producer Library.

B) Log critical events using the Kinesis Producer Library, and log informational events using the PutRecords API method.

C) Log critical events using the PutRecords API method, and log informational events using the Kinesis Producer Library.

D) Log all events using the PutRecords API method.

ANSWER2:

C

Notes/Hint7: 

The PutRecords API can be used in code to be synchronous; it will wait for the API request to complete before the application continues. This means you can use it when you need to wait for the critical events to finish logging before continuing. The Kinesis Producer Library is asynchronous and can send many messages without needing to slow down your application. This makes the KPL ideal for the sending of many non-critical alerts asynchronously.

Reference7: PutRecords API

Get mobile friendly version of the quiz @ the App Store


Question 8: You work for a start-up that tracks commercial delivery trucks via GPS. You receive coordinates that are transmitted from each delivery truck once every 6 seconds. You need to process these coordinates in near real-time from multiple sources and load them into Elasticsearch without significant technical overhead to maintain. Which tool should you use to digest the data?

A) Amazon SQS

B) Amazon EMR

C) AWS Data Pipeline

D) Amazon Kinesis Firehose

ANSWER8:

D

Notes/Hint8: 

Amazon Kinesis Firehose is the easiest way to load streaming data into AWS. It can capture, transform, and load streaming data into Amazon S3, Amazon Redshift, and Amazon Elasticsearch Service, enabling near real-time analytics with existing business intelligence tools and dashboards.

Reference8: Amazon Kinesis Firehose

Get mobile friendly version of the quiz @ the App Store

Question 9: A company needs to implement a near-real-time fraud prevention feature for its ecommerce site. User and order details need to be delivered to an Amazon SageMaker endpoint to flag suspected fraud. The amount of input data needed for the inference could be as much as 1.5 MB.
Which solution meets the requirements with the LOWEST overall latency?

A) Create an Amazon Managed Streaming for Kafka cluster and ingest the data for each order into a topic. Use a Kafka consumer running on Amazon EC2 instances to read these messages and invoke the Amazon SageMaker endpoint.

B) Create an Amazon Kinesis Data Streams stream and ingest the data for each order into the stream. Create an AWS Lambda function to read these messages and invoke the Amazon SageMaker endpoint.

C) Create an Amazon Kinesis Data Firehose delivery stream and ingest the data for each order into the stream. Configure Kinesis Data Firehose to deliver the data to an Amazon S3 bucket. Trigger an AWS Lambda function with an S3 event notification to read the data and invoke the Amazon SageMaker endpoint.

D) Create an Amazon SNS topic and publish the data for each order to the topic. Subscribe the Amazon SageMaker endpoint to the SNS topic.


ANSWER9:

A

Notes/Hint9: 

An Amazon Managed Streaming for Kafka cluster can be used to deliver the messages with very low latency. It has a configurable message size that can handle the 1.5 MB payload.

Reference9: Amazon Managed Streaming for Kafka cluster

Get mobile friendly version of the quiz @ the App Store

Question 10: You need to filter and transform incoming messages coming from a smart sensor you have connected with AWS. Once messages are received, you need to store them as time series data in DynamoDB. Which AWS service can you use?

A) IoT Device Shadow Service

B) Redshift

C) Kinesis

D) IoT Rules Engine

ANSWER10:

D

Notes/Hint10: 

The IoT rules engine will allow you to send sensor data over to AWS services like DynamoDB

Reference10: The IoT rules engine

Get mobile friendly version of the quiz @ the App Store

Question 11: A media company is migrating its on-premises legacy Hadoop cluster with its associated data processing scripts and workflow to an Amazon EMR environment running the latest Hadoop release. The developers want to reuse the Java code that was written for data processing jobs for the on-premises cluster.
Which approach meets these requirements?

A) Deploy the existing Oracle Java Archive as a custom bootstrap action and run the job on the EMR cluster.

B) Compile the Java program for the desired Hadoop version and run it using a CUSTOM_JAR step on the EMR cluster.

C) Submit the Java program as an Apache Hive or Apache Spark step for the EMR cluster.

D) Use SSH to connect the master node of the EMR cluster and submit the Java program using the AWS CLI.


ANSWER11:

B

Notes/Hint11: 

A CUSTOM JAR step can be configured to download a JAR file from an Amazon S3 bucket and execute it. Since the Hadoop versions are different, the Java application has to be recompiled.

Reference11:  Automating analytics workflows on EMR

Get mobile friendly version of the quiz @ the App Store

Question 12: You currently have databases running on-site and in another data center off-site. What service allows you to consolidate to one database in Amazon?

A) AWS Kinesis

B) AWS Database Migration Service

C) AWS Data Pipeline

D) AWS RDS Aurora

ANSWER12:

B

Notes/Hint12: 

AWS Database Migration Service can migrate your data to and from most of the widely used commercial and open source databases. It supports homogeneous migrations such as Oracle to Oracle, as well as heterogeneous migrations between different database platforms, such as Oracle to Amazon Aurora. Migrations can be from on-premises databases to Amazon RDS or Amazon EC2, databases running on EC2 to RDS, or vice versa, as well as from one RDS database to another RDS database.

Reference12: DMS

Get mobile friendly version of the quiz @ the App Store

Question 13:  An online retail company wants to perform analytics on data in large Amazon S3 objects using Amazon EMR. An Apache Spark job repeatedly queries the same data to populate an analytics dashboard. The analytics team wants to minimize the time to load the data and create the dashboard.
Which approaches could improve the performance? (Select TWO.)
A) Copy the source data into Amazon Redshift and rewrite the Apache Spark code to create analytical reports by querying Amazon Redshift.

B) Copy the source data from Amazon S3 into Hadoop Distributed File System (HDFS) using s3distcp.

C) Load the data into Spark DataFrames.

D) Stream the data into Amazon Kinesis and use the Kinesis Connector Library (KCL) in multiple Spark jobs to perform analytical jobs.

E) Use Amazon S3 Select to retrieve the data necessary for the dashboards from the S3 objects.

ANSWER13:

C and E

Notes/Hint13: 

One of the speed advantages of Apache Spark comes from loading data into immutable dataframes, which can be accessed repeatedly in memory. Spark DataFrames organizes distributed data into columns. This makes summaries and aggregates much quicker to calculate. Also, instead of loading an entire large Amazon S3 object, load only what is needed using Amazon S3 Select. Keeping the data in Amazon S3 avoids loading the large dataset into HDFS.

Reference13: Spark DataFrames 

Get mobile friendly version of the quiz @ the App Store

Question 14: You have been hired as a consultant to provide a solution to integrate a client’s on-premises data center to AWS. The customer requires a 300 Mbps dedicated, private connection to their VPC. Which AWS tool do you need?

A) VPC peering

B) Data Pipeline

C) Direct Connect

D) EMR

ANSWER14:

C

Notes/Hint14: 

Direct Connect will provide a dedicated and private connection to an AWS VPC.

Reference14: Direct Connect


Get mobile friendly version of the quiz @ the App Store

Question 15: Your organization has a variety of different services deployed on EC2 and needs to efficiently send application logs over to a central system for processing and analysis. They’ve determined it is best to use a managed AWS service to transfer their data from the EC2 instances into Amazon S3 and they’ve decided to use a solution that will do what?

A) Installs the AWS Direct Connect client on all EC2 instances and uses it to stream the data directly to S3.

B) Leverages the Kinesis Agent to send data to Kinesis Data Streams and output that data in S3.

C) Ingests the data directly from S3 by configuring regular Amazon Snowball transactions.

D) Leverages the Kinesis Agent to send data to Kinesis Firehose and output that data in S3.

ANSWER15:

D

Notes/Hint15: 

Kinesis Firehose is a managed solution, and log files can be sent from EC2 to Firehose to S3 using the Kinesis agent.

Reference15: Kinesis Firehose

Get mobile friendly version of the quiz @ the App Store

Question 16: A data engineer needs to create a dashboard to display social media trends during the last hour of a large company event. The dashboard needs to display the associated metrics with a latency of less than 1 minute.
Which solution meets these requirements?

A) Publish the raw social media data to an Amazon Kinesis Data Firehose delivery stream. Use Kinesis Data Analytics for SQL Applications to perform a sliding window analysis to compute the metrics and output the results to a Kinesis Data Streams data stream. Configure an AWS Lambda function to save the stream data to an Amazon DynamoDB table. Deploy a real-time dashboard hosted in an Amazon S3 bucket to read and display the metrics data stored in the DynamoDB table.

B) Publish the raw social media data to an Amazon Kinesis Data Firehose delivery stream. Configure the stream to deliver the data to an Amazon Elasticsearch Service cluster with a buffer interval of 0 seconds. Use Kibana to perform the analysis and display the results.

C) Publish the raw social media data to an Amazon Kinesis Data Streams data stream. Configure an AWS Lambda function to compute the metrics on the stream data and save the results in an Amazon S3 bucket. Configure a dashboard in Amazon QuickSight to query the data using Amazon Athena and display the results.

D) Publish the raw social media data to an Amazon SNS topic. Subscribe an Amazon SQS queue to the topic. Configure Amazon EC2 instances as workers to poll the queue, compute the metrics, and save the results to an Amazon Aurora MySQL database. Configure a dashboard in Amazon QuickSight to query the data in Aurora and display the results.


ANSWER16:

A

Notes/Hint16: 

Amazon Kinesis Data Analytics can query data in a Kinesis Data Firehose delivery stream in near-real time using SQL. A sliding window analysis is appropriate for determining trends in the stream. Amazon S3 can host a static webpage that includes JavaScript that reads the data in Amazon DynamoDB and refreshes the dashboard.

Reference16: Amazon Kinesis Data Analytics can query data in a Kinesis Data Firehose delivery stream in near-real time using SQL

Get mobile friendly version of the quiz @ the App Store

Question 17: A real estate company is receiving new property listing data from its agents through .csv files every day and storing these files in Amazon S3. The data analytics team created an Amazon QuickSight visualization report that uses a dataset imported from the S3 files. The data analytics team wants the visualization report to reflect the current data up to the previous day. How can a data analyst meet these requirements?

A) Schedule an AWS Lambda function to drop and re-create the dataset daily.

B) Configure the visualization to query the data in Amazon S3 directly without loading the data into SPICE.

C) Schedule the dataset to refresh daily.

D) Close and open the Amazon QuickSight visualization.

ANSWER17:

B

Notes/Hint17:

Datasets created using Amazon S3 as the data source are automatically imported into SPICE. The Amazon QuickSight console allows for the refresh of SPICE data on a schedule.

Reference17: Amazon QuickSight and SPICE


Get mobile friendly version of the quiz @ the App Store

Question 18: You need to migrate data to AWS. It is estimated that the data transfer will take over a month via the current AWS Direct Connect connection your company has set up. Which AWS tool should you use?

A) Establish additional Direct Connect connections.

B) Use Data Pipeline to migrate the data in bulk to S3.

C) Use Kinesis Firehose to stream all new and existing data into S3.

D) Snowball

ANSWER18:

D

Notes/Hint18:

As a general rule, if it takes more than one week to upload your data to AWS using the spare capacity of your existing Internet connection, then you should consider using Snowball. For example, if you have a 100 Mb connection that you can solely dedicate to transferring your data and need to transfer 100 TB of data, it takes more than 100 days to complete a data transfer over that connection. You can make the same transfer by using multiple Snowballs in about a week.

Reference18: Snowball

Get mobile friendly version of the quiz @ the App Store

Question 19: You currently have an on-premises Oracle database and have decided to leverage AWS and use Aurora. You need to do this as quickly as possible. How do you achieve this?

A) It is not possible to migrate an on-premises database to AWS at this time.

B) Use AWS Data Pipeline to create a target database, migrate the database schema, set up the data replication process, initiate the full load and a subsequent change data capture and apply, and conclude with a switchover of your production environment to the new database once the target database is caught up with the source database.

C) Use AWS Database Migration Services and create a target database, migrate the database schema, set up the data replication process, initiate the full load and a subsequent change data capture and apply, and conclude with a switch-over of your production environment to the new database once the target database is caught up with the source database.

D) Use AWS Glue to crawl the on-premises database schemas and then migrate them into AWS with Data Pipeline jobs.

https://aws.amazon.com/dms/faqs/

ANSWER2:

C

Notes/Hint19: 

DMS can efficiently support this sort of migration using the steps outlined. While AWS Glue can help you crawl schemas and store metadata on them inside of Glue for later use, it isn't the best tool for actually transitioning a database over to AWS itself. Similarly, while Data Pipeline is great for ETL and ELT jobs, it isn't the best option to migrate a database over to AWS.

Reference19: DMS


Get mobile friendly version of the quiz @ the App Store

Question 20: A financial company uses Amazon EMR for its analytics workloads. During the company’s annual security audit, the security team determined that none of the EMR clusters’ root volumes are encrypted. The security team recommends the company encrypt its EMR clusters’ root volume as soon as possible.
Which solution would meet these requirements?

A) Enable at-rest encryption for EMR File System (EMRFS) data in Amazon S3 in a security configuration. Re-create the cluster using the newly created security configuration.

B) Specify local disk encryption in a security configuration. Re-create the cluster using the newly created security configuration.

C) Detach the Amazon EBS volumes from the master node. Encrypt the EBS volume and attach it back to the master node.

D) Re-create the EMR cluster with LZO encryption enabled on all volumes.

ANSWER20:

B

Notes/Hint20: 

Local disk encryption can be enabled as part of a security configuration to encrypt root and storage volumes.

Reference20: EMR Cluster Local disk encryption

Djamga Data Sciences Big Data – Data Analytics Youtube Playlist

2- Prepare for Your AWS Certification Exam

3- LinuxAcademy

Big Data – Data Analytics Jobs:

 

Big Data – Data Analytics – Data Sciences Latest News:


DATA ANALYTICS Q&A:

 
 

[/bg_collapse]

Clever Questions, Answers, Resources about:

  • Data Sciences
  • Big Data
  • Data Analytics
  • Data Sciences
  • Databases
  • Data Streams
  • Large DataSets

What Is a Data Scientist?

Data Scientist (n.): Person who is better at statistics than any software engineer and better at software engineering than any statistician. – Josh Wills

Data scientists apply sophisticated quantitative and computer science skills to both structure and analyze massive stores or continuous streams of unstructured data, with the intent to derive insights and prescribe action. – Burtch Works Data Science Salary Survey, May 2018

More than anything, what data scientists do is make discoveries while swimming in data… In a competitive landscape where challenges keep changing and data never stop flowing, data scientists help decision makers shift from ad hoc analysis to an ongoing conversation with data. – Data Scientist: The Sexiest Job of the 21st Century, Harvard Business Review

Do All Data Scientists Hold Graduate Degrees?

Data scientists are highly educated. With exceedingly rare exception, every data scientist holds at least an undergraduate degree. 91% of data scientists in 2018 held advanced degrees. The remaining 9% all held undergraduate degrees. Furthermore,

  • 25% of data scientists hold a degree in statistics or mathematics,
  • 20% have a computer science degree,
  • an additional 20% hold a degree in the natural sciences, and
  • 18% hold an engineering degree.

The remaining 17% of surveyed data scientists held degrees in business, social science, or economics.

How Are Data Scientists Different From Data Analysts?

Broadly speaking, the roles differ in scope: data analysts build reports with narrow, well-defined KPIs. Data scientists often to work on broader business problems without clear solutions. Data scientists live on the edge of the known and unknown.

We’ll leave you with a concrete example: A data analyst cares about profit margins. A data scientist at the same company cares about market share.

How Is Data Science Used in Medicine?

Data science in healthcare best translates to biostatistics. It can be quite different from data science in other industries as it usually focuses on small samples with several confounding variables.

How Is Data Science Used in Manufacturing?

Data science in manufacturing is vast; it includes everything from supply chain optimization to the assembly line.

What are data scientists paid?

Most people are attracted to data science for the salary. It’s true that data scientists garner high salaries compares to their peers. There is data to support this: The May 2018 edition of the BurtchWorks Data Science Salary Survey, annual salary statistics were

Note the above numbers do not reflect total compensation which often includes standard benefits and may include company ownership at high levels.

How will data science evolve in the next 5 years?

Will AI replace data scientists?

What is the workday like for a data scientist?

It’s common for data scientists across the US to work 40 hours weekly. While company culture does dictate different levels of work life balance, it’s rare to see data scientists who work more than they want. That’s the virtue of being an expensive resource in a competitive job market.

How do I become a Data Scientist?

The roadmap given to aspiring data scientists can be boiled down to three steps:

  1. Earning an undergraduate and/or advanced degree in computer science, statistics, or mathematics,
  2. Building their portfolio of SQL, Python, and R skills, and
  3. Getting related work experience through technical internships.

All three require a significant time and financial commitment.

There used to be a saying around datascience: The road into a data science starts with two years of university-level math.

What Should I Learn? What Order Do I Learn Them?

This answer assumes your academic background ends with a HS diploma in the US.

  1. Python
  2. Differential Calculus
  3. Integral Calculus
  4. Multivariable Calculus
  5. Linear Algebra
  6. Probability
  7. Statistics

Some follow up questions and answers:

Why Python first?

  • Python is a general purpose language. R is used primarily by statisticians. In the likely scenario that you decide data science requires too much time, effort, and money, Python will be more valuable than your R skills. It’s preparing you to fail, sure, but in the same way a savings account is preparing you to fail.

When do I start working with data?

  • You’ll start working with data when you’ve learned enough Python to do so. Whether you’ll have the tools to have any fun is a much more open-ended question.

How long will this take me?

  • Assuming self-study and average intelligence, 3-5 years from start to finish.

How Do I Learn Python?

If you don’t know the first thing about programming, start with MIT’s course in the curated list.

These modules are the standard tools for data analysis in Python:

Curated Threads & Resources

  1. MIT’s Introduction to Computer Science and Programming in Python A free, archived course taught at MIT in the fall 2016 semester.
  2. Data Scientist with Python Career Track | DataCamp The first courses are free, but unlimited access costs $29/month. Users usually report a positive experience, and it’s one of the better hands-on ways to learn Python.
  3. Sentdex’s (Harrison Kinsley) Youtube Channel Related to Python Programming Tutorials
  4. /r/learnpython is an active sub and very useful for learning the basics.

How Do I Learn R?

If you don’t know the first thing about programming, start with R for Data Science in the curated list.

These modules are the standard tools for data analysis in Python:

Curated Threads & Resources

  1. R for Data Science by Hadley WickhamA free ebook full of succinct code examples. Terrific for learning tidyverse syntax.Folks with some math background may prefer the free alternative, Introduction to Statistical Learning.
  2. Data Scientist with R Career Track | DataCamp The first courses are free, but unlimited access costs $29/month. Users usually report a positive experience, and it’s one of the few hands-on ways to learn R.
  3. R Inferno Learners with a CS background will appreciate this free handbook explaining how and why R behaves the way that it does.

How Do I Learn SQL?

Prioritize the basics of SQL. i.e. when to use functions like POW, SUM, RANK; the computational complexity of the different kinds of joins.

Concepts like relational algebra, when to use clustered/non-clustered indexes, etc. are useful, but (almost) never come up in interviews.

You absolutely do not need to understand administrative concepts like managing permissions.

Finally, there are numerous query engines and therefore numerous dialects of SQL. Use whichever dialect is supported in your chosen resource. There’s not much difference between them, so it’s easy to learn another dialect after you’ve learned one.

Curated Threads & Resources

  1. The SQL Tutorial for Data Analysis | Mode.com
  2. Introduction to Databases A Free MOOC supported by Stanford University.
  3. SQL Queries for Mere MortalsA $30 book highly recommended by /u/karmanujan

How Do I Learn Calculus?

Fortunately (or unfortunately), calculus is the lament of many students, and so resources for it are plentiful. Khan Academy mimics lectures very well, and Paul’s Online Math Notes are a terrific reference full of practice problems and solutions.

Calculus, however, is not just calculus. For those unfamiliar with US terminology,

  • Calculus I is differential calculus.
  • Calculus II is integral calculus.
  • Calculus III is multivariable calculus.
  • Calculus IV is differential equations.

Differential and integral calculus are both necessary for probability and statistics, and should be completed first.

Multivariable calculus can be paired with linear algebra, but is also required.

Differential equations is where consensus falls apart. The short it is, they’re all but necessary for mathematical modeling, but not everyone does mathematical modeling. It’s another tool in the toolbox.

Curated Threads & Resources

How Do I Learn Probability?

Probability is not friendly to beginners. Definitions are rooted in higher mathematics, notation varies from source to source, and solutions are frequently unintuitive. Probability may present the biggest barrier to entry in data science.

It’s best to pick a single primary source and a community for help. If you can spend the money, register for a university or community college course and attend in person.

The best free resource is MIT’s 18.05 Introduction to Probability and Statistics (Spring 2014). Leverage /r/learnmath, /r/learnmachinelearning, and /r/AskStatistics when you get inevitably stuck.

How Do I Learn Linear Algebra?

Curated Threads & Resources https://www.youtube.com/watch?v=fNk_zzaMoSs&index=1&list=PLZHQObOWTQDPD3MizzM2xVFitgF8hE_ab

What does the typical data science interview process look like?

For general advice, Mastering the DS Interview Loop is a terrific article. The community discussed the article here.

Briefly summarized, most companies follow a five stage process:

  1. Coding Challenge: Most common at software companies and roles contributing to a digital product.
  2. HR Screen
  3. Technical Screen: Often in the form of a project. Less frequently, it takes the form of a whiteboarding session at the onsite.
  4. Onsite: Usually the project from the technical screen is presented here, followed by a meeting with the director overseeing the team you’ll join.
  5. Negotiation & Offer

Preparation:

  1. Practice questions on Leetcode which has both SQL and traditional data structures/algorithm questions
  2. Review Brilliant for math and statistics questions.
  3. SQL Zoo and Mode Analytics both offer various SQL exercises you can solve in your browser.

Tips:

  1. Before you start coding, read through all the questions. This allows your unconscious mind to start working on problems in the background.
  2. Start with the hardest problem first, when you hit a snag, move to the simpler problem before returning to the harder one.
  3. Focus on passing all the test cases first, then worry about improving complexity and readability.
  4. If you’re done and have a few minutes left, go get a drink and try to clear your head. Read through your solutions one last time, then submit.
  5. It’s okay to not finish a coding challenge. Sometimes companies will create unreasonably tedious coding challenges with one-week time limits that require 5–10 hours to complete. Unless you’re desperate, you can always walk away and spend your time preparing for the next interview.

Remember, interviewing is a skill that can be learned, just like anything else. Hopefully, this article has given you some insight on what to expect in a data science interview loop.

The process also isn’t perfect and there will be times that you fail to impress an interviewer because you don’t possess some obscure piece of knowledge. However, with repeated persistence and adequate preparation, you’ll be able to land a data science job in no time!

What does the Airbnb data science interview process look like? [Coming soon]

What does the Facebook data science interview process look like? [Coming soon]

What does the Uber data science interview process look like? [Coming soon]

What does the Microsoft data science interview process look like? [Coming soon]

What does the Google data science interview process look like? [Coming soon]

What does the Netflix data science interview process look like? [Coming soon]

What does the Apple data science interview process look like? [Coming soon]

Question: How is SQL used in real data science jobs?

Real life enterprise databases are orders of magnitude more complex than the “customers, products, orders” examples used as teaching tools. SQL as a language is actually, IMO, a relatively simple language (the db administration component can get complex, but mostly data scientists aren’t doing that anyways). SQL is an incredibly important skill though for any DS role. I think when people emphasize SQL, what they really are talking about is the ability to write queries that interrogate the data and discover the nuances behind how it is collected and/or manipulated by an application before it is written to the dB. For example, is the employee’s phone number their current phone number or does the database store a history of all previous phone numbers? Critically important questions for understanding the nature of your data, and it doesn’t necessarily deal with statistics! The level of syntax required to do this is not that sophisticated, you can get pretty damn far with knowledge of all the joins, group by/analytical functions, filtering and nesting queries. In many cases, the data is too large to just select * and dump into a csv to load into pandas, so you start with SQL against the source. In my mind it’s more important for “SQL skills” to know how to generate hypotheses (that will build up to answering your business question) that can be investigated via a query than it is to be a master of SQL’s syntax. Just my two cents though!

AWS Azure Google Cloud Cloud Certification Exam Prep App
AWS Azure Google Cloud Cloud Certification Exam Prep App: AWS CCP Cloud Practitioner CLF-C01, AWS Solution Architect Associate SAA-C02, AWS Developer Associate DEV-C01, AWS DAS-C01, Azure Fundamentals AZ900, Azure Administrator AZ104, Google Associate Cloud Engineer, AWS Specialty Data Analytics DAS-C01, AWS and Google Professional Machine Learning Specialty MLS-C01

12000 Years of Human Population Dynamic

[OC] 12,000 years of human population dynamics from dataisbeautiful

Human population density estimates based on the Hyde 3.2 model.

Capitol insurrection arrests per million people by state

[OC] Capitol insurrection arrests per million people by state from dataisbeautiful

Data Source: Made in Google Sheets using data from this USA Today article (for the number of arrests by arrestee’s home state) and this spreadsheet of the results of the 2020 Census (for the population of each state and DC in 2020, which was used as the denominator in calculating arrests/million people).

Top 20 AWS Certified Associate SysOps Administrator Practice Quiz – Questions and Answers Dumps

AWS Certified Security – Specialty Questions and Answers Dumps

Data Center Proxies - Data Collectors - Data Unblockers

What is the AWS Certified SysOps Administrator – Associate?

The AWS Certified SysOps Administrator – Associate (SOA-C01) examination is intended for individuals who have technical expertise in deployment, management, and operations on AWS.

The AWS Certified SysOps Administrator – Associate exam covers the following domains:

Domain 1: Monitoring and Reporting 22%

Domain 2: High Availability 8%

Data Center Proxies - Data Collectors - Data Unblockers

Domain 3: Deployment and Provisioning 14%

Domain 4: Storage and Data Management 12%

Domain 5: Security and Compliance 18%

Domain 6: Networking 14%

Domain 7: Automation and Optimization 12%

AWS Certified SysOps Administrator
AWS Certified SysOps Administrator

Top 20 Top 20 AWS Certified Associate SysOps Administrator  Practice Quiz Questions and Answers and References – SOA-C01:

Question 1: Under which security model does AWS provide secure infrastructure and services, while the customer is responsible for secure operating systems, platforms, and data?

ANSWER1:

C

Get mobile friendly version of the quiz @ the App Store

NOTES/HINT1: The Shared Responsibility Model is the security model under which AWS provides secure infrastructure and services, while the customer is responsible for secure operating systems, platforms, and data.

Question 2: Which type of testing method is used to compare a control system to a test system, with the goal of assessing whether changes applied to the test system improve a particular metric compared to the control system?

ANSWER2:

A

Get mobile friendly version of the quiz @ the App Store

NOTES/HINT2: The side-by-side testing method is used to compare a control system to a test system, with the goal of assessing whether changes applied to the test system improve a particular metric compared to the control system.

Reference2: AWS Side by side testing 

Question 3: When BGP is used with a hardware VPN, the IPSec and the BGP connections must both be which of the following on the same user gateway device?

ANSWER3:

B

Get mobile friendly version of the quiz @ the App Store

NOTES/HINT3: The IPSec and the BGP connections must both be terminated on the same user gateway device.

Reference3: IpSec and BGP in AWS

Question 4: Which pillar of the AWS Well-Architected Framework includes the ability to protect information, systems, and assets while delivering business value through risk assessments and mitigation strategies?

ANSWER4:

D

Get mobile friendly version of the quiz @ the App Store

NOTES/HINT4: Security is the pillar of the AWS Well-Architected Framework that includes the ability to protect information, systems, and assets while delivering business value through risk assessments and mitigation strategies.

Reference4: AWS Well-Architected Framework: Security

Question 5: Within the realm of Amazon S3 backups, snapshots are which of the following?

ANSWER5:

A

Get mobile friendly version of the quiz @ the App Store

NOTES/HINT: Within the realm of Amazon S3 backups, snapshots are block-based.

Reference5: Snapshots are block based

Question 6: Amazon VPC provides the option of creating a hardware VPN connection between remote customer networks and their Amazon VPC over the Internet using which encryption technology?

ANSWER6:

E

Get mobile friendly version of the quiz @ the App Store

NOTES/HINT6: Amazon VPC provides the option of creating a hardware VPN connection between remote customer networks and their Amazon VPC over the Internet using IPsec encryption technology.

Reference6: Amazon VPC IPSec Encryption

Question 7: To make a clean backup of a database, that database should be put into what mode before making a snapshot of it?

ANSWER7:

C

Get mobile friendly version of the quiz @ the App Store

NOTES/HINT7: To make a clean backup of a database, that database should be put into hot backup mode before making a snapshot of it.

Data Center Proxies - Data Collectors - Data Unblockers

Reference: AWS Prescriptive Backup Recovery Guide

Question 8: Which pillar of the AWS Well-Architected Framework includes the ability to use computing resources efficiently to meet system requirements, and to maintain that efficiency as demand changes and technologies evolve?

ANSWER8:

B

Get mobile friendly version of the quiz @ the App Store

NOTES/HINT8: Performance efficiency is the pillar of the AWS Well-Architected Framework that includes the ability to use computing resources efficiently to meet system requirements, and to maintain that efficiency as demand changes and technologies evolve.

Reference8: Performance Efficiency Pillar – AWS Well-Architected Framework

Question 9: AWS Storage Gateway supports which three configurations?

ANSWER9:

C

Get mobile friendly version of the quiz @ the App Store

NOTES/HINT9: AWS Storage Gateway supports Gateway-stored volumes, Gateway-cached volumes, and Gateway-virtual tape library.

Reference9: AWS Storage Gateway configurations

Question 10: With which of the following can you establish private connectivity between AWS and a data center, office, or co-location environment?

ANSWER10:

B

Get mobile friendly version of the quiz @ the App Store

NOTES/HINT10: With AWS Direct Connect you can establish private connectivity between AWS and a data center, office, or co-location environment.

Reference: AWS Direct Connect

Question 11: A company is migrating a legacy web application from a single server to multiple Amazon EC2 instances behind an Application Load Balancer (ALB). After the migration, users report that they are frequently losing their sessions and are being prompted to log in again. Which action should be taken to resolve the issue reported by users?

A) Purchase Reserved Instances.
B) Submit a request for a Spot block.
C) Submit a request for all Spot Instances.
D) Use a mixture of On-Demand and Spot Instances

ANSWER11:

D

Get mobile friendly version of the quiz @ the App Store

NOTES/HINT11: Legacy applications designed to run on a single server frequently store session data locally. When these applications are deployed on multiple instances behind a load balancer, user requests are routed to instances using the round robin routing algorithm. Session data stored on one instance would not be present on the others. By enabling sticky sessions, cookies are used to track user requests and keep subsequent requests going to the same instance.

Reference 11: Sticky Sessions

Question 12: An ecommerce company wants to lower costs on its nightly jobs that aggregate the current day’s sales and store the results in Amazon S3. The jobs run on multiple On-Demand Instances, and the jobs take just under 2 hours to complete. The jobs can run at any time during the night. If the job fails for any reason, it needs to be started from the beginning. Which solution is the MOST cost-effective based on these requirements?

A) Purchase Reserved Instances.

B) Submit a request for a Spot block.

C) Submit a request for all Spot Instances.

D) Use a mixture of On-Demand and Spot Instances.

ANSWER12:

B

Get mobile friendly version of the quiz @ the App Store

NOTES/HINT12: The solution will take advantage of Spot pricing, but by using a Spot block instead of Spot Instances, the company can be assured the job will not be interrupted.

Reference12: Spot Block

Question 13: A sysops team checks their AWS Personal Health Dashboard every week for upcoming AWS hardware maintenance events. Recently, a team member was on vacation and the team missed an event, which resulted in an outage. The team wants a simple method to ensure that everyone is aware of upcoming events without depending on an individual team member checking the dashboard. What should be done to address this?

A) Build a web scraper to monitor the Personal Health Dashboard. When new health events are detected, send a notification to an Amazon SNS topic monitored by the entire team.

B) Create an Amazon CloudWatch Events event based off the AWS Health service and send a notification to an Amazon SNS topic monitored by the entire team.

C) Create an Amazon CloudWatch Events event that sends a notification to an Amazon SNS topic monitored by the entire team to remind the team to view the maintenance events on the Personal Health Dashboard.

D) Create an AWS Lambda function that continuously pings all EC2 instances to confirm their health. Alert the team if this check fails.

ANSWER13:

B

Get mobile friendly version of the quiz @ the App Store

NOTES/HINT13: The AWS Health service publishes Amazon CloudWatch Events. CloudWatch Events can trigger Amazon SNS notifications. This method requires neither additional coding nor infrastructure. It automatically notifies the team of upcoming events, and does not depend upon brittle solutions like web scraping.

Reference 13: Amazon CloudWatch Events

Question14: An application running in a VPC needs to access instances owned by a different account and running in a VPC in a different AWS Region. For compliance purposes, the traffic must not traverse the public internet.
How should a sysops administrator configure network routing to meet these requirements?

A) Within each account, create a custom routing table containing routes that point to the other account’s virtual private gateway.

B) Within each account, set up a NAT gateway in a public subnet in its respective VPC. Then, using the public IP address from the NAT gateway, enable routing between the two VPCs.

C) From one account, configure a Site-to-Site VPN connection between the VPCs. Within each account, add routes in the VPC route tables that point to the CIDR block of the remote VPC.

D) From one account, create a VPC peering request. After an administrator from the other account accepts the request, add routes in the route tables for each VPC that point to the CIDR block of the peered VPC.

ANSWER14:

D

Get mobile friendly version of the quiz @ the App Store

NOTES/HINT14: A VPC peering connection enables routing using each VPC’s private IP addresses as if they were in the same network. Traffic using inter-Region VPC peering always stays on the global AWS backbone and never traverses the public internet.

Reference14: VPC Peering

Question15: An application running on Amazon EC2 instances needs to access data stored in an Amazon DynamoDB table.

Which solution will grant the application access to the table in the MOST secure manner?

A) Create an IAM group for the application and attach a permissions policy with the necessary privileges. Add the EC2 instances to the IAM group.

B) Create an IAM resource policy for the DynamoDB table that grants the necessary permissions to Amazon EC2.

C) Create an IAM role with the necessary privileges to access the DynamoDB table. Associate the role with the EC2 instances.

D) Create an IAM user for the application and attach a permissions policy with the necessary privileges. Generate an access key and embed the key in the application code.

ANSWER15:

C

Get mobile friendly version of the quiz @ the App Store

NOTES/HINT15: An IAM role can be used to provide permissions for applications that are running on Amazon EC2 instances
to make AWS API requests using temporary credentials.

Reference15: IAM Role

Question16: A third-party service uploads objects to Amazon S3 every night. Occasionally, the service uploads an incorrectly formatted version of an object. In these cases, the sysops administrator needs to recover an older version of the object.
What is the MOST efficient way to recover the object without having to retrieve it from the remote service?

A) Configure an Amazon CloudWatch Events scheduled event that triggers an AWS Lambda function that backs up the S3 bucket prior to the nightly job. When bad objects are discovered, restore the backed up version.

B) Create an S3 event on object creation that copies the object to an Amazon Elasticsearch Service (Amazon ES) cluster. When bad objects are discovered, retrieve the previous version from Amazon ES.

C) Create an AWS Lambda function that copies the object to an S3 bucket owned by a different account. Trigger the function when new objects are created in Amazon S3. When bad objects are discovered, retrieve the previous version from the other account.

D) Enable versioning on the S3 bucket. When bad objects are discovered, access previous versions with the AWS CLI or AWS Management Console.

ANSWER16:

D

Get mobile friendly version of the quiz @ the App Store

NOTES/HINT16: Enabling versioning is a simple solution; (A) involves writing custom code, (C) has no versioning, so the replication will overwrite the old version with the bad version if the error is not discovered quickly, and (B) will involve expensive storage that is not well suited for objects.

Reference16: Versioning

Question17: According to the AWS shared responsibility model, for which of the following Amazon EC2 activities is AWS responsible? (Select TWO.)
A) Configuring network ACLs
B) Maintaining network infrastructure
C) Monitoring memory utilization
D) Patching the guest operating system
E) Patching the hypervisor

ANSWER17:

D and E

Get mobile friendly version of the quiz @ the App Store

NOTES/HINT17: AWS provides security of the cloud, including maintenance of the hardware and hypervisor software supporting Amazon EC2. Customers are responsible for any maintenance or monitoring within an EC2 instance, and for configuring their VPC infrastructure.

Reference17: Security of the cloud

Question18: A security and compliance team requires that all Amazon EC2 workloads use approved Amazon Machine Images (AMIs). A sysops administrator must implement a process to find EC2 instances launched from unapproved AMIs.

Which solution will meet these requirements?
A) Create a custom report using AWS Systems Manager inventory to identify unapproved AMIs.
B) Run Amazon Inspector on each EC2 instance and flag the instance if it is using unapproved AMIs.
C) Use an AWS Config rule to identify unapproved AMIs.
D) Use AWS Trusted Advisor to identify the EC2 workloads using unapproved AMIs.

ANSWER18:

C

Get mobile friendly version of the quiz @ the App Store

NOTES/HINT18: AWS Config has a managed rule that handles this scenario.

Reference18: Managed Rule

Question19: A sysops administrator observes a large number of rogue HTTP requests on an Application Load Balancer. The requests originate from various IP addresses. These requests cause increased server load and costs.

What should the administrator do to block this traffic?
A) Install Amazon Inspector on Amazon EC2 instances to block the traffic.
B) Use Amazon GuardDuty to protect the web servers from bots and scrapers.
C) Use AWS Lambda to analyze the web server logs, detect bot traffic, and block the IP addresses in the security groups.
D) Use an AWS WAF rate-based rule to block the traffic when it exceeds a threshold.

ANSWER19:

D

Get mobile friendly version of the quiz @ the App Store

NOTES/HINT19: AWS WAF has rules that can protect web applications from HTTP flood attacks.

Reference19: HTTP Flood

Question20: A sysops administrator is implementing security group policies for a web application running on AWS.

An Elastic Load Balancer connects to a fleet of Amazon EC2 instances that connect to an Amazon RDS database over port 1521. The security groups are named elbSG, ec2SG, and rdsSG, respectively.
How should these security groups be implemented?
A) elbSG: allow port 80 and 443 from 0.0.0.0/0;
ec2SG: allow port 443 from elbSG;
rdsSG: allow port 1521 from ec2SG.

B) elbSG: allow port 80 and 443 from 0.0.0.0/0;
ec2SG: allow port 80 and 443 from elbSG and rdsSG;
rdsSG: allow port 1521 from ec2SG.

C) elbSG: allow port 80 and 443 from ec2SG;
ec2SG: allow port 80 and 443 from elbSG and rdsSG;
rdsSG: allow port 1521 from ec2SG.

D) elbSG: allow port 80 and 443 from ec2SG;
ec2SG: allow port 443 from elbSG;
rdsSG: allow port 1521 from elbSG.

ANSWER20: 

A

Get mobile friendly version of the quiz @ the App Store

NOTES/HINT20: elbSG must allow all web traffic (HTTP and HTTPS) from the internet. ec2SG must allow traffic from the load balancer only, in this case identified as traffic from elbSG. The database must allow traffic from the EC2 instances only, in this case identified as traffic from ec2SG.

Reference20: Allow all traffic

SOURCES:

Djamga DevOps  Youtube Channel:

Prepare for Your AWS Certification Exam

2- GoCertify

SYSOPS AND SYSADMIN NEWS

SYSADMIN – SYSOPS RESOURCES

I WANT TO BECOME A SYSADMIN

This is a common topic that has been asked multiple times.

Professional/Non-technical

Sysadmin Utilities

Security

Linux

Microsoft / Windows Server

Virtualization

MacOS (formerly OSX) and Apple iOS

Google ChromeOS

Backup and Storage

Networking

Monitoring

  • Because your network and infrastructure can’t be a black box

Business and Standards Compliance

Major Vulnerabilities

Podcasts

Documentation