Tag: Kinesis Firehose

Quizzes, Practice Exams: Modeling, Data Engineering, Vision, Exploratory Data Analysis, ML Ops, Cheat Sheets, ML Jobs Interview Q&A

Use this App to learn about Machine Learning on AWS and prepare for the AWS Machine Learning Specialty Certification MLS-C01.

Earning AWS Certified Machine Learning Specialty validates expertise in building, training, tuning, and deploying machine learning (ML) models on AWS.

The App provides hundreds of quizzes and practice exam about:

– Machine Learning Operation on AWS

– Modelling

– Data Engineering

– Computer Vision,

– Exploratory Data Analysis,

– ML implementation & Operations

– Machine Learning Basics Questions and Answers

– Machine Learning Advanced Questions and Answers

– Scorecard

– Countdown timer

– Machine Learning Cheat Sheets

– Machine Learning Interview Questions and Answers

– Machine Learning Latest News

The App covers Machine Learning Basics and Advanced topics including: NLP, Computer Vision, Python, linear regression, logistic regression, Sampling, dataset, statistical interaction, selection bias, non-Gaussian distribution, bias-variance trade-off, Normal Distribution, correlation and covariance, Point Estimates and Confidence Interval, A/B Testing, p-value, statistical power of sensitivity, over-fitting and under-fitting, regularization, Law of Large Numbers, Confounding Variables, Survivorship Bias, univariate, bivariate and multivariate, Resampling, ROC curve, TF/IDF vectorization, Cluster Sampling, etc.

Domain 1: Data Engineering

Create data repositories for machine learning.

Identify data sources (e.g., content and location, primary sources such as user data)

Determine storage mediums (e.g., DB, Data Lake, S3, EFS, EBS)

Identify and implement a data ingestion solution.

Data job styles/types (batch load, streaming)

Data ingestion pipelines (Batch-based ML workloads and streaming-based ML workloads), etc.

Domain 2: Exploratory Data Analysis

Sanitize and prepare data for modeling.

Perform feature engineering.

Analyze and visualize data for machine learning.

Domain 3: Modeling

Frame business problems as machine learning problems.

Select the appropriate model(s) for a given machine learning problem.

Train machine learning models.

Perform hyperparameter optimization.

Evaluate machine learning models.

Domain 4: Machine Learning Implementation and Operations

Build machine learning solutions for performance, availability, scalability, resiliency, and fault tolerance.

Recommend and implement the appropriate machine learning services and features for a given problem.

Apply basic AWS security practices to machine learning solutions.

Deploy and operationalize machine learning solutions.

Amazon Comprehend

AWS Deep Learning AMIs (DLAMI)

AWS DeepLens

Amazon Forecast

Amazon Fraud Detector

Amazon Lex

Amazon Polly

Amazon Rekognition

Amazon SageMaker

Amazon Textract

Amazon Transcribe

Amazon Translate

Other Services and topics covered are:

Ingestion/Collection

Processing/ETL

Data analysis/visualization

Model training

Model deployment/inference

Operational

AWS ML application services

Language relevant to ML (for example, Python, Java, Scala, R, SQL)

Notebooks and integrated development environments (IDEs),

S3, SageMaker, Kinesis, Lake Formation, Athena, Kibana, Redshift, Textract, EMR, Glue, SageMaker, CSV, JSON, IMG, parquet or databases, Amazon Athena

Amazon EC2, Amazon Elastic Container Registry (Amazon ECR), Amazon Elastic Container Service, Amazon Elastic Kubernetes Service , Amazon Redshift

Sagemaker API Explained:

AWS Certified Machine Learning Engineer Specialty Questions and Answers:

Question1: An advertising and analytics company uses machine learning to predict user response to online advertisements using a custom XGBoost model. The company wants to improve its ML pipeline by porting its training and inference code, written in R, to Amazon SageMaker, and do so with minimal changes to the existing code.

Answer1: Use the Build Your Own Container (BYOC) Amazon Sagemaker option.
Create a new docker container with the existing code. Register the container in Amazon Elastic Container registry. with the existing code. Register the container in Amazon Elastic Container Registry. Finally run the training and inference jobs using this container.

Question2: Which feature of Amazon SageMaker can you use for preprocessing the data?

Amazon SageMaker training jobs
Amazon SageMaker hyperaparameter tuning
Amazon SageMaker notebook instances
Amazon SageMaker endpoints

Answer2: Amazon Sagemaker Notebook instances

Amazon SageMaker enables developers and data scientists to build, train, tune, and deploy machine learning (ML) models at scale. You can deploy trained ML models for real-time or batch predictions on unseen data, a process known as inference. However, in most cases, the raw input data must be preprocessed and can’t be used directly for making predictions. This is because most ML models expect the data in a predefined format, so the raw data needs to be first cleaned and formatted in order for the ML model to process the data. You can use the Amazon SageMaker built-in Scikit-learn library for preprocessing input data and then use the Amazon SageMaker built-in Linear Learner algorithm for predictions.

Question3: What setting, when creating an Amazon SageMaker notebook instance, can you use to install libraries and import data?

Answer3: LifeCycle Configuration

Question4: How to Choose the right Sagemaker built-in algorithm?

How to chose the right built in algorithm in SageMaker?

Guide to choosing the right unsupervised learning algorithm

Choosing the right ML algorithm based on Data Type

Choosing the right ML algo based on data type

This is a general guide for choosing which algorithm to use depending on what business problem you have and what data you have.

Top

Top 10 Google Professional Machine Learning Engineer Sample Questions

Question 1: You work for a textile manufacturer and have been asked to build a model to detect and classify fabric defects. You trained a machine learning model with high recall based on high resolution images taken at the end of the production line. You want quality control inspectors to gain trust in your model. Which technique should you use to understand the rationale of your classifier?

A. Use K-fold cross validation to understand how the model performs on different test datasets.

B. Use the Integrated Gradients method to efficiently compute feature attributions for each predicted image.

C. Use PCA (Principal Component Analysis) to reduce the original feature set to a smaller set of easily understood features.

D. Use k-means clustering to group similar images together, and calculate the Davies-Bouldin index to evaluate the separation between clusters.

Answer 1)

Notes 1)

B is correct because it identifies the pixel of the input image that leads to the classification of the image itself.

Question 2: You need to write a generic test to verify whether Dense Neural Network (DNN) models automatically released by your team have a sufficient number of parameters to learn the task for which they were built. What should you do?

A. Train the model for a few iterations, and check for NaN values.

B. Train the model for a few iterations, and verify that the loss is constant.

C. Train a simple linear model, and determine if the DNN model outperforms it.

D. Train the model with no regularization, and verify that the loss function is close to zero.

Answer 2)

Notes 2)

D is correct because the test can check that the model has enough parameters to memorize the task.

[appbox appstore 1560083470-iphone screenshots]
[appbox googleplay com.awssolutionarchitectassociateexampreppro.app]

Question 3: Your team is using a TensorFlow Inception-v3 CNN model pretrained on ImageNet for an image classification prediction challenge on 10,000 images. You will use AI Platform to perform the model training. What TensorFlow distribution strategy and AI Platform training job configuration should you use to train the model and optimize for wall-clock time?

A. Default Strategy; Custom tier with a single master node and four v100 GPUs.

B. One Device Strategy; Custom tier with a single master node and four v100 GPUs.

C. One Device Strategy; Custom tier with a single master node and eight v100 GPUs.

D. Central Storage Strategy; Custom tier with a single master node and four v100 GPUs.

Answer 3)

Notes 3)

D is correct because this is the only strategy that can perform distributed training; albeit there is only a single copy of the variables on the CPU host.

Question 4: You work on a team where the process for deploying a model into production starts with data scientists training different versions of models in a Kubeflow pipeline. The workflow then stores the new model artifact into the corresponding Cloud Storage bucket. You need to build the next steps of the pipeline after the submitted model is ready to be tested and deployed in production on AI Platform. How should you configure the architecture before deploying the model to production?

A. Deploy model in test environment -> Validate model -> Create a new AI Platform model version

B. Validate model -> Deploy model in test environment -> Create a new AI Platform model version

C. Create a new AI Platform model version -> Validate model -> Deploy model in test environment

D. Create a new AI Platform model version – > Deploy model in test environment -> Validate model

Answer 4)

Notes 4)

A is correct because the model can be validated after it is deployed to the test environment, and the release version is established before the model is deployed in production.

Question 5: You work for a maintenance company and have built and trained a deep learning model that identifies defects based on thermal images of underground electric cables. Your dataset contains 10,000 images, 100 of which contain visible defects. How should you evaluate the performance of the model on a test dataset?

A. Calculate the Area Under the Curve (AUC) value.

B. Calculate the number of true positive results predicted by the model.

C. Calculate the fraction of images predicted by the model to have a visible defect.

D. Calculate the Cosine Similarity to compare the model’s performance on the test dataset to the model’s performance on the training dataset.

Answer 5)

Notes 5)

A is correct because it is scale-invariant. AUC measures how well predictions are ranked, rather than their absolute values. AUC is also classification-threshold invariant. It measures the quality of the model’s predictions irrespective of what classification threshold is chosen.

Question 6: You work for a manufacturing company that owns a high-value machine which has several machine settings and multiple sensors. A history of the machine’s hourly sensor readings and known failure event data are stored in BigQuery. You need to predict if the machine will fail within the next 3 days in order to schedule maintenance before the machine fails. Which data preparation and model training steps should you take?

A. Data preparation: Daily max value feature engineering with DataPrep; Model training: AutoML classification with BQML

B. Data preparation: Daily min value feature engineering with DataPrep; Model training: Logistic regression with BQML and AUTO_CLASS_WEIGHTS set to True

C. Data preparation: Rolling average feature engineering with DataPrep; Model training: Logistic regression with BQML and AUTO_CLASS_WEIGHTS set to False

D. Data preparation: Rolling average feature engineering with DataPrep; Model training: Logistic regression with BQML and AUTO_CLASS_WEIGHTS set to True

Answer 6)

Notes 6)

D is correct because it uses the rolling average of the sensor data and balances the weights using the BQML auto class weight balance parameter.

Question 7: You are an ML engineer at a media company. You need to build an ML model to analyze video content frame-by-frame, identify objects, and alert users if there is inappropriate content. Which Google Cloud products should you use to build this project?

A. Pub/Sub, Cloud Function, Cloud Vision API

B. Pub/Sub, Cloud IoT, Dataflow, Cloud Vision API, Cloud Logging

C. Pub/Sub, Cloud Function, Video Intelligence API, Cloud Logging

D. Pub/Sub, Cloud Function, AutoML Video Intelligence, Cloud Logging

Answer 7)

Notes 7)

C is correct as Video Intelligence API can find inappropriate components and other components satisfy the requirements of real-time processing and notification.

Question 8: You work for a large retailer. You want to use ML to forecast future sales leveraging 10 years of historical sales data. The historical data is stored in Cloud Storage in Avro format. You want to rapidly experiment with all the available data. How should you build and train your model for the sales forecast?

A. Load data into BigQuery and use the ARIMA model type on BigQuery ML.

B. Convert the data into CSV format and create a regression model on AutoML Tables.

C. Convert the data into TFRecords and create an RNN model on TensorFlow on AI Platform Notebooks.

D. Convert and refactor the data into CSV format and use the built-in XGBoost algorithm on AI Platform Training.

Answer 8)

Notes 8)

A is correct because BigQuery ML is designed for fast and rapid experimentation and it is possible to use federated queries to read data directly from Cloud Storage. Moreover, ARIMA is considered one of the best in class for time series forecasting.

Question 9) You need to build an object detection model for a small startup company to identify if and where the company’s logo appears in an image. You were given a large repository of images, some with logos and some without. These images are not yet labelled. You need to label these pictures, and then train and deploy the model. What should you do?

A. Use Google Cloud’s Data Labelling Service to label your data. Use AutoML Object Detection to train and deploy the model.

B. Use Vision API to detect and identify logos in pictures and use it as a label. Use AI Platform to build and train a convolutional neural network.

C. Create two folders: one where the logo appears and one where it doesn’t. Manually place images in each folder. Use AI Platform to build and train a convolutional neural network.

D. Create two folders: one where the logo appears and one where it doesn’t. Manually place images in each folder. Use AI Platform to build and train a real time object detection model.

Answer 9)

Notes 9)

A is correct as this will allow you to easily create a request for a labelling task and deploy a high-performance model.

Question 10) You work for a large financial institution that is planning to use Dialogflow to create a chatbot for the company’s mobile app. You have reviewed old chat logs and tagged each conversation for intent based on each customer’s stated intention for contacting customer service. About 70% of customer inquiries are simple requests that are solved within 10 intents. The remaining 30% of inquiries require much longer and more complicated requests. Which intents should you automate first?

A. Automate a blend of the shortest and longest intents to be representative of all intents.

B. Automate the more complicated requests first because those require more of the agents’ time.

C. Automate the 10 intents that cover 70% of the requests so that live agents can handle the more complicated requests.

D. Automate intents in places where common words such as “payment” only appear once to avoid confusing the software.

Answer 10)

Notes 10)

[appbox appstore 1611045854-iphone screenshots]

[appbox microsoftstore 9n8rl80hvm4t-mobile screenshots]

Machine Learning Q&A Part I:

The Complete Python Course for Machine Learning Engineers

Google.

Azure and AWS are second class citizens in this area.

Sure, AWS has 70% of the market.

Sure, Azure is the easiest turn key and super user friendly.

But, the king of machine learning in the cloud is GCP.

GCP = Google Cloud Platform

Google has the largest data science team in the world, not mention they have Hinton.

Let’s forgot for a minute they created TensorFlow and give it away.

Let’s just talk about building a real world model with data that doesn’t fit into a excel spreadsheet.

The vast majority of applied machine learning is supervised and that means we need data.

Not just normal data, we need very clean highly structured data.

Where’s the easiest place in the world to upload and model a Petabyte of structured data? BigQuery of course.

Why BigQuery? I don’t have to do anything but upload my data. No spinning up RedShit clusters or whatever I have to do in Azure, just upload and massage data with my familiar SQL. If I do have to wrangle my data it won’t take my six months to update 5 rows here, minutes usually.

Then, you’ll need a front end. Cloud datalab is a Jupyter notebook, which is good because I don’t want nor do I need anything else.

Then, with a single line of code I connect by datalab (Jupyter) notebook to my data in BigQuery and build away.

I’ve worked in all three and the only thing I care about is getting to my job the fastest and right now that means I build my models in GCP.

If you’re new to machine learning don’t start in GCP or any cloud vendor for that matter. Start learning Python from the comfort of your laptop.

The course below is free to the first 20.

What are list of machine learning classification techniques?

Here, I want to share the best research paper on Machine Learning classification methods, titled ‘Do we Need Hundreds of Classifiers to Solve Real World Classification Problems?’, published in the ‘Journal of Machine Learning Research’.

This paper nicely explained 179 classification techniques and applied them on 121 data sets thus sharing small summary of the paper:

Do we Need Hundreds of Classifiers to Solve Real World Classification Problems?

https://jmlr.org/papers/v15/delgado14a.html

The paper evaluated 179 classifiers arising from 17 ML families (discriminant analysis, Bayesian, neural networks, support vector machines, decision trees, rule-based classifiers, boosting, bagging, stacking, random forests and other ensembles, generalized linear models, nearest neighbours, partial least squares and principal component regression, logistic and multinomial regression, multiple adaptive regression splines and other methods), implemented in Weka, R ( with and without the caret package), C and Matlab, including all the relevant classifiers available today.

Experiments used total 121 data sets , which represent the whole UCI data base (excluding the large-scale problems) and other own real problems, in order to achieve significant conclusions about the classifier behaviour, not dependent on the data set collection.

The whole data set and partitions are available from: http://persoal.citius.usc.es/manuel.fernandez.delgado/papers/jmlr/data.tar.gz

The classifiers most likely to be the bests are the random forest (RF) versions, the best of which (implemented in R and accessed via caret) achieves 94.1% of the maximum accuracy overcoming 90% in the 84.3% of the data sets. However, the difference is not statistically significant with the second best, the SVM with Gaussian kernel implemented in C using LibSVM, which achieves 92.3% of the maximum accuracy. A few models are clearly better than the remaining ones: random forest, SVM with Gaussian and polynomial kernels, extreme learning machine with Gaussian kernel, C5.0 and avNNet (a committee of multi-layer perceptrons implemented in R with the caret package).

The random forest is clearly the best family of classifiers (3 out of 5 bests classifiers are RF), followed by SVM (4 classifiers in the top-10), neural networks and boosting ensembles (5 and 3 members in the top-20, respectively).

You can see the table with the complete results: http://persoal.citius.usc.es/manuel.fernandez.delgado/papers/jmlr/results.txt

I hope it will be helpful for Statistic and Machine Leaning aspirants!

Thank you!

What is the best way to know which machine learning algorithm has a better probability to accurately or more precisely classify a dataset, before applying it?

These basic questions should help:

1. Is the classification going to be supervised or unsupervised? Several well defined techniques likes SVM (Support Vector Machines), trained neural net,etc. are applicable for supervised classification. For unsupervised classification, GMMs (Gaussian Mixture Models), HMMs (Hidden Markov models) with Baye’s techniques could be used. (Several other techniques could of course be used as well)

2.How much training data do you have in case it is supervised ? A small number of training data may yield discouraging classification accuracy even if the chosen classifier is the most suitable one for the problem. In such a case, try to obtain more number of samples. There’s also generally a correlation (for practical purposes at least) between the feature dimensionality and the number of samples for given technique. For example, while using SVM, the linear kernel tends to yield better results when the number of training samples are less than or equal to or only slightly more than the number of feature dimensions as compared to RBF or any other kernel.

3. If the feature vector dimensionality is small enough (1/2/3 -D) then it makes sense to plot and visually inspect if techniques like clustering could be more useful. With very high number of feature dimensions, methods like clustering are generally not advisable(Refer : “The Curse Of Dimensionality”).

4. Are you doing classification in real time ? Some techniques ,e.g. “Template Match” in image classification may lead to a higher number of errors but is generally faster than most other techniques if the number of templates to be evaluated are not excessively high.

5. Depending upon the problem domain, you can decide if you can choose the underlying model in such a way that it can use certain temporal/spatial correlations that may be inherent in the data. For example, HMMs use the temporal continuity of speech samples for enhancing classification results in speech recognition problems.

Another point, slightly off the topic perhaps, but the classification performance is as much a function of choosing the correct feature vectors, the pre-processing of the feature vectors as much as the classifier itself. It’s generally a good idea to give reserve some initial part of the project to try out various classifiers on the same data-set. It may at least help you reject the ones which are highly inaccurate.

What are the application of probability theory in statistics, machine learning, artificial intelligence, economic, commerce, business intelligence?

Which language should I start to learn machine learning for the first time? Python, R, or Julia? And why?

What are the fundamental mathematical requirements for understanding machine learning, as a novice/hobbyist dev?

What is the best image segmentation pre-trained method publicly available? Currently using DeepLab demo (link attached) but I’m looking for a better version of this.

Using public photos from the internet, they were able to reconstruct viewpoints of a scene conserving the realistic shadows and lightings. Would it be possible to do this efficiently and just tag a place in Google map and get the 3D scene from it?

Do you think GPT-3 will change our lives, or is it just hype? Are the applications really useful and real, in the real-world, or are they only the hand-picked results by the researchers and startup to get some hype around them and followers?

Suggest some really good ML projects? Which can be worth adding to CV or resume?

Is it possible to finish Andrew Ng’s course on ML in 15 days?

Which language should I start to learn machine learning for the first time? Python, R, or Julia? And why?

I’ve reached a dead end with my algorithm for Exact Three Cover, and it’s supposedly trash. What makes it trash?

Which language should I start to learn machine learning for the first time? Python, R, or Julia? And why?

How do I get started with learning Machine Learning by myself?

What is more preferable in machine learning, the accuracy of model A is 50% on training data and 97% on test data, or is model B with 80% accuracy on train data and 75 % accuracy on test data?(more detail in comment below) thank you!

I’ve seen you emphasize mathematical knowledge being important in data science/machine learning. Mike West emphasizes SQL and Python skills (the former especially) as being the most important. Where does this difference in opinion stem from?

If I am a beginner at machine learning and my current goal is publish a good paper within 6 month how should I start with considering I have no experience in publications and what ML topic I should pick in 2020?

What are the differences between the Bayesian and Frequentist methods within machine learning?

Why is the Gaussian distribution widely used in practice?

What are the fundamental mathematical requirements for understanding machine learning, as a novice/hobbyist dev?

What is the Confusion Matrix in Machine Learning?- Simplest Explanation!

Is Julia’s syntax even more intuitive than Python’s?

Andrew Ng: What is the Future of Deep Reinforcement Learning (DL + RL)?

How is TensorFlow/Keras capable of computing seemingly non-smooth loss functions such as max?

Which language should I start to learn machine learning for the first time? Python, R, or Julia? And why?

What is the first thing you do when looking at a new data set?

How do I get data science, computer vision, and machine learning tutorial apps?

Reinforcement learning (RL) is an area of machine learning concerned with how software agents ought to take actions in an environment in order to maximize the notion of cumulative reward. Reinforcement learning is one of three basic machine learning

Announcing our new Professional Machine Learning Engineer certification…

What is the best machine learning paper you have read in 2018?

Does every paper in machine learning introduce a new algorithm?

Predicting Credit Card Approvals using ML Techniques

Android Document Scanner with offline OCR application

What are the recommended data pre-processing methods for tree-based machine learning models?

What are some interesting statistics about the growing rate of the Julia programming language?

Is it possible to learn machine learning without prior knowledge in any coding language?

What skills does a data scientist need to learn in order to put machine learning models in production?

At a high level, these skills are a combination of software and data engineering.

The persons that are more appropriate to do this job are a data engineer and/or a machine learning engineer.

That being said, if you work at a startup or happen to be in a small company and need to put the models into production yourself, here are the top skills you need to get:

Well structured code: it doesn’t need to be perfect but at least can be understood and updated by other team members. Avoid spaghetti code[1] as the plague.
Add logs: if you are a Python user, the logging[2] module is your friend. Avoid print statements at any cost.
Model versioning: add a hash key to your different models. You will thank me later.
Metadata everywhere: save as much data about your models and ML experiments as you can (running time, hyperparameters, used features, CV scores, and so on). You will thank me later, again.
Monitor performances: execution time and statistical scores of your models.
Data and models management: store the necessary data and models somewhere that is available to everyone (S3[3] for example). Avoid uploading these to your VCS[4] system. Don’t share them using Slack or Drive. I won’t judge you though, I do it sometimes (read often). Read more here …..

What are some mistakes data scientists make when building machine learning models?

Some of the mistakes that might involve during building a machine learning model (I can think of) are listed here:

Not understanding the structure of the dataset
Not giving proper care during features selection
Leaving out categorical features and considering just numerical variables
Falling into dummy variable trap
Selection of inefficient machine learning algorithm
Not trying out various ML algorithms for building the model based on structure of data.
Improper tuning of model parameters
Most importantly: Building an idiotstic imperfect model i.e. suppose we have a classification problem with 99% chances of falling into class1 and remaining to class2. The built model may develop a mapping function which all the time for all data inputs, may predict the result to be class1. Well, one might say his/her model has 99% accuracy. But in reality the 1% class2 case hasn’t been included in the model. So this must be taken into consideration.
Read more here…

[appbox appstore 1560083470-iphone screenshots]
[appbox googleplay com.awssolutionarchitectassociateexampreppro.app]

What is the difference between data analytics and data mining?

Basically, data mining is a key aspect of data analytics. Some even consider the former as essential to execute before the latter. While data analytics is the complete package and involves most components needed to examine a data set and extract valuable information, data mining focuses specifically on identifying hidden patterns.

That’s just the surface-level comparison though. The image above gives an overview of how the two differ.

One such difference is the presence of a hypothesis. Data analytics usually requires coming up with one, as it aims to find specific answers. Data mining, on the other hand, generally doesn’t need one to test or prove. The expected output are patterns or trends, which doesn’t require coming up with a statement or fact to test.

However, that doesn’t mean you mine data blindly. You still have a goal, whether it’s to come up with a recommender system or identify predictors of a certain dimension. Ultimately though, you strive to come up with data patterns or trends. For data analysis on the other hand, you’re expected to come up with valuable and actionable insights, usually in relation to a predetermined hypothesis. Read more here ….

What is the life cycle of a data science project?

The data science life cycle is not something well-defined like the software development life-cycle, and there is no ‘one-size-fits-all’ solution for data science projects. Every step in the life-cycle of a data science project depends on various data scientist skills and data science tools. The typical life-cycle of a data science project involves jumping back and forth among various interdependent science tasks using a variety of tools, techniques, programming, etc.

Thus, the data science life-cycle can include the following steps:

Business requirement understanding.
Data collection.
Data cleaning.
Data analysis.
Modeling.
Performance evaluation.
Communicating with stakeholders.
Deployment.
Real-world testing.
Business buy-in.
Support and maintenance.

Looks neat, but here is the scheme to visualize how it is happening in reality:

Agile development processes, especially continuous delivery lends itself well to the data science project life-cycle. The early comparison helps the data science team to change approaches, refine hypotheses and even discard the project if the business case is nonviable or the benefits from the predictive models are not worth the effort to build it.

What are the application of probability theory in statistics, machine learning, artificial intelligence, economic, commerce, business intelligence?

Top

[appbox appstore 1611045854-iphone screenshots]

[appbox microsoftstore 9n8rl80hvm4t-mobile screenshots]

Machine Learning Q&A -Part II:

Which language should I start to learn machine learning for the first time? Python, R, or Julia? And why?

What are the fundamental mathematical requirements for understanding machine learning, as a novice/hobbyist dev?

What is the best image segmentation pre-trained method publicly available? Currently using DeepLab demo (link attached) but I’m looking for a better version of this.

Suggest some really good ML projects? Which can be worth adding to CV or resume?

Is it possible to finish Andrew Ng’s course on ML in 15 days?

Which language should I start to learn machine learning for the first time? Python, R, or Julia? And why?

I’ve reached a dead end with my algorithm for Exact Three Cover, and it’s supposedly trash. What makes it trash?

Which language should I start to learn machine learning for the first time? Python, R, or Julia? And why?

How do I get started with learning Machine Learning by myself?

What are the differences between the Bayesian and Frequentist methods within machine learning?

Why is the Gaussian distribution widely used in practice?

What are the fundamental mathematical requirements for understanding machine learning, as a novice/hobbyist dev?

What is the Confusion Matrix in Machine Learning?- Simplest Explanation!

Is Julia’s syntax even more intuitive than Python’s?

Andrew Ng: What is the Future of Deep Reinforcement Learning (DL + RL)?

How is TensorFlow/Keras capable of computing seemingly non-smooth loss functions such as max?

Which language should I start to learn machine learning for the first time? Python, R, or Julia? And why?

What is the first thing you do when looking at a new data set?

How do I get data science, computer vision, and machine learning tutorial apps?

Announcing our new Professional Machine Learning Engineer certification…

What is the best machine learning paper you have read in 2018?

Does every paper in machine learning introduce a new algorithm?

Predicting Credit Card Approvals using ML Techniques

Android Document Scanner with offline OCR application

What are the recommended data pre-processing methods for tree-based machine learning models?

What are some interesting statistics about the growing rate of the Julia programming language?

Is it possible to learn machine learning without prior knowledge in any coding language?

What skills does a data scientist need to learn in order to put machine learning models in production?

At a high level, these skills are a combination of software and data engineering.

The persons that are more appropriate to do this job are a data engineer and/or a machine learning engineer.

That being said, if you work at a startup or happen to be in a small company and need to put the models into production yourself, here are the top skills you need to get:

Well structured code: it doesn’t need to be perfect but at least can be understood and updated by other team members. Avoid spaghetti code[1] as the plague.
Add logs: if you are a Python user, the logging[2] module is your friend. Avoid print statements at any cost.
Model versioning: add a hash key to your different models. You will thank me later.
Metadata everywhere: save as much data about your models and ML experiments as you can (running time, hyperparameters, used features, CV scores, and so on). You will thank me later, again.
Monitor performances: execution time and statistical scores of your models.
Data and models management: store the necessary data and models somewhere that is available to everyone (S3[3] for example). Avoid uploading these to your VCS[4] system. Don’t share them using Slack or Drive. I won’t judge you though, I do it sometimes (read often). Read more here …..

What are some mistakes data scientists make when building machine learning models?

Some of the mistakes that might involve during building a machine learning model (I can think of) are listed here:

Not understanding the structure of the dataset
Not giving proper care during features selection
Leaving out categorical features and considering just numerical variables
Falling into dummy variable trap
Selection of inefficient machine learning algorithm
Not trying out various ML algorithms for building the model based on structure of data.
Improper tuning of model parameters
Most importantly: Building an idiotstic imperfect model i.e. suppose we have a classification problem with 99% chances of falling into class1 and remaining to class2. The built model may develop a mapping function which all the time for all data inputs, may predict the result to be class1. Well, one might say his/her model has 99% accuracy. But in reality the 1% class2 case hasn’t been included in the model. So this must be taken into consideration.
Read more here…

What is the difference between data analytics and data mining?

That’s just the surface-level comparison though. The image above gives an overview of how the two differ.

What is the life cycle of a data science project?

Thus, the data science life-cycle can include the following steps:

Business requirement understanding.
Data collection.
Data cleaning.
Data analysis.
Modeling.
Performance evaluation.
Communicating with stakeholders.
Deployment.
Real-world testing.
Business buy-in.
Support and maintenance.

Looks neat, but here is the scheme to visualize how it is happening in reality:

iOs: https://apps.apple.com/ca/app/aws-machine-learning-prep-pro/id1611045854

Top

Windows: https://www.microsoft.com/en-ca/p/aws-machine-learning-mls-c01-specialty-certification-exam-prep/9n8rl80hvm4t

Android/Amazon: https://www.amazon.com/gp/product/B09TZ4H8V6

Quizzes, Practice Exams: Modeling, Data Engineering, Vision, Exploratory Data Analysis, ML Ops, Cheat Sheets, ML Jobs Interview Q&A

Use this App to learn about Machine Learning on AWS and prepare for the AWS Machine Learning Specialty Certification MLS-C01.

Earning AWS Certified Machine Learning Specialty validates expertise in building, training, tuning, and deploying machine learning (ML) models on AWS.

The App provides hundreds of quizzes and practice exam about:

– Machine Learning Operation on AWS

– Modelling

– Data Engineering

– Computer Vision,

– Exploratory Data Analysis,

– ML implementation & Operations

– Machine Learning Basics Questions and Answers

– Machine Learning Advanced Questions and Answers

– Scorecard

– Countdown timer

– Machine Learning Cheat Sheets

– Machine Learning Interview Questions and Answers

– Machine Learning Latest News

Domain 1: Data Engineering

Create data repositories for machine learning.

Identify data sources (e.g., content and location, primary sources such as user data)

Determine storage mediums (e.g., DB, Data Lake, S3, EFS, EBS)

Identify and implement a data ingestion solution.

Data job styles/types (batch load, streaming)

Data ingestion pipelines (Batch-based ML workloads and streaming-based ML workloads), etc.

Domain 2: Exploratory Data Analysis

Sanitize and prepare data for modeling.

Perform feature engineering.

Analyze and visualize data for machine learning.

Domain 3: Modeling

Frame business problems as machine learning problems.

Select the appropriate model(s) for a given machine learning problem.

Train machine learning models.

Perform hyperparameter optimization.

Evaluate machine learning models.

Domain 4: Machine Learning Implementation and Operations

Build machine learning solutions for performance, availability, scalability, resiliency, and fault tolerance.

Recommend and implement the appropriate machine learning services and features for a given problem.

Apply basic AWS security practices to machine learning solutions.

Deploy and operationalize machine learning solutions.

Amazon Comprehend

AWS Deep Learning AMIs (DLAMI)

AWS DeepLens

Amazon Forecast

Amazon Fraud Detector

Amazon Lex

Amazon Polly

Amazon Rekognition

Amazon SageMaker

Amazon Textract

Amazon Transcribe

Amazon Translate

Other Services and topics covered are:

Ingestion/Collection

Processing/ETL

Data analysis/visualization

Model training

Model deployment/inference

Operational

AWS ML application services

Language relevant to ML (for example, Python, Java, Scala, R, SQL)

Notebooks and integrated development environments (IDEs),

S3, SageMaker, Kinesis, Lake Formation, Athena, Kibana, Redshift, Textract, EMR, Glue, SageMaker, CSV, JSON, IMG, parquet or databases, Amazon Athena

Amazon EC2, Amazon Elastic Container Registry (Amazon ECR), Amazon Elastic Container Service, Amazon Elastic Kubernetes Service , Amazon Redshift

Sagemaker API Explained:

AWS Certified Machine Learning Engineer Specialty Questions and Answers:

Question1: An advertising and analytics company uses machine learning to predict user response to online advertisements using a custom XGBoost model. The company wants to improve its ML pipeline by porting its training and inference code, written in R, to Amazon SageMaker, and do so with minimal changes to the existing code.

Answer1: Use the Build Your Own Container (BYOC) Amazon Sagemaker option.
Create a new docker container with the existing code. Register the container in Amazon Elastic Container registry. with the existing code. Register the container in Amazon Elastic Container Registry. Finally run the training and inference jobs using this container.

Question2: Which feature of Amazon SageMaker can you use for preprocessing the data?

Amazon SageMaker training jobs
Amazon SageMaker hyperaparameter tuning
Amazon SageMaker notebook instances
Amazon SageMaker endpoints

Answer2: Amazon Sagemaker Notebook instances

Question3: What setting, when creating an Amazon SageMaker notebook instance, can you use to install libraries and import data?

Answer3: LifeCycle Configuration

Question4: How to Choose the right Sagemaker built-in algorithm?

This is a general guide for choosing which algorithm to use depending on what business problem you have and what data you have.

Top

Top 10 Google Professional Machine Learning Engineer Sample Questions

A. Use K-fold cross validation to understand how the model performs on different test datasets.

B. Use the Integrated Gradients method to efficiently compute feature attributions for each predicted image.

C. Use PCA (Principal Component Analysis) to reduce the original feature set to a smaller set of easily understood features.

D. Use k-means clustering to group similar images together, and calculate the Davies-Bouldin index to evaluate the separation between clusters.

Answer 1)

Notes 1)

B is correct because it identifies the pixel of the input image that leads to the classification of the image itself.

A. Train the model for a few iterations, and check for NaN values.

B. Train the model for a few iterations, and verify that the loss is constant.

C. Train a simple linear model, and determine if the DNN model outperforms it.

D. Train the model with no regularization, and verify that the loss function is close to zero.

Answer 2)

Notes 2)

D is correct because the test can check that the model has enough parameters to memorize the task.

[appbox appstore 1560083470-iphone screenshots]
[appbox googleplay com.awssolutionarchitectassociateexampreppro.app]

A. Default Strategy; Custom tier with a single master node and four v100 GPUs.

B. One Device Strategy; Custom tier with a single master node and four v100 GPUs.

C. One Device Strategy; Custom tier with a single master node and eight v100 GPUs.

D. Central Storage Strategy; Custom tier with a single master node and four v100 GPUs.

Answer 3)

Notes 3)

D is correct because this is the only strategy that can perform distributed training; albeit there is only a single copy of the variables on the CPU host.

A. Deploy model in test environment -> Validate model -> Create a new AI Platform model version

B. Validate model -> Deploy model in test environment -> Create a new AI Platform model version

C. Create a new AI Platform model version -> Validate model -> Deploy model in test environment

D. Create a new AI Platform model version – > Deploy model in test environment -> Validate model

Answer 4)

Notes 4)

A is correct because the model can be validated after it is deployed to the test environment, and the release version is established before the model is deployed in production.

A. Calculate the Area Under the Curve (AUC) value.

B. Calculate the number of true positive results predicted by the model.

C. Calculate the fraction of images predicted by the model to have a visible defect.

D. Calculate the Cosine Similarity to compare the model’s performance on the test dataset to the model’s performance on the training dataset.

Answer 5)

Notes 5)

A. Data preparation: Daily max value feature engineering with DataPrep; Model training: AutoML classification with BQML

B. Data preparation: Daily min value feature engineering with DataPrep; Model training: Logistic regression with BQML and AUTO_CLASS_WEIGHTS set to True

C. Data preparation: Rolling average feature engineering with DataPrep; Model training: Logistic regression with BQML and AUTO_CLASS_WEIGHTS set to False

D. Data preparation: Rolling average feature engineering with DataPrep; Model training: Logistic regression with BQML and AUTO_CLASS_WEIGHTS set to True

Answer 6)

Notes 6)

D is correct because it uses the rolling average of the sensor data and balances the weights using the BQML auto class weight balance parameter.

A. Pub/Sub, Cloud Function, Cloud Vision API

B. Pub/Sub, Cloud IoT, Dataflow, Cloud Vision API, Cloud Logging

C. Pub/Sub, Cloud Function, Video Intelligence API, Cloud Logging

D. Pub/Sub, Cloud Function, AutoML Video Intelligence, Cloud Logging

Answer 7)

Notes 7)

C is correct as Video Intelligence API can find inappropriate components and other components satisfy the requirements of real-time processing and notification.

A. Load data into BigQuery and use the ARIMA model type on BigQuery ML.

B. Convert the data into CSV format and create a regression model on AutoML Tables.

C. Convert the data into TFRecords and create an RNN model on TensorFlow on AI Platform Notebooks.

D. Convert and refactor the data into CSV format and use the built-in XGBoost algorithm on AI Platform Training.

Answer 8)

Notes 8)

A. Use Google Cloud’s Data Labelling Service to label your data. Use AutoML Object Detection to train and deploy the model.

B. Use Vision API to detect and identify logos in pictures and use it as a label. Use AI Platform to build and train a convolutional neural network.

C. Create two folders: one where the logo appears and one where it doesn’t. Manually place images in each folder. Use AI Platform to build and train a convolutional neural network.

D. Create two folders: one where the logo appears and one where it doesn’t. Manually place images in each folder. Use AI Platform to build and train a real time object detection model.

Answer 9)

Notes 9)

A is correct as this will allow you to easily create a request for a labelling task and deploy a high-performance model.

A. Automate a blend of the shortest and longest intents to be representative of all intents.

B. Automate the more complicated requests first because those require more of the agents’ time.

C. Automate the 10 intents that cover 70% of the requests so that live agents can handle the more complicated requests.

D. Automate intents in places where common words such as “payment” only appear once to avoid confusing the software.

Answer 10)

Notes 10)

[appbox appstore 1611045854-iphone screenshots]

[appbox microsoftstore 9n8rl80hvm4t-mobile screenshots]

Machine Learning Q&A Part I:

The Complete Python Course for Machine Learning Engineers

Google.

Azure and AWS are second class citizens in this area.

Sure, AWS has 70% of the market.

Sure, Azure is the easiest turn key and super user friendly.

But, the king of machine learning in the cloud is GCP.

GCP = Google Cloud Platform

Google has the largest data science team in the world, not mention they have Hinton.

Let’s forgot for a minute they created TensorFlow and give it away.

Let’s just talk about building a real world model with data that doesn’t fit into a excel spreadsheet.

The vast majority of applied machine learning is supervised and that means we need data.

Not just normal data, we need very clean highly structured data.

Where’s the easiest place in the world to upload and model a Petabyte of structured data? BigQuery of course.

Then, you’ll need a front end. Cloud datalab is a Jupyter notebook, which is good because I don’t want nor do I need anything else.

Then, with a single line of code I connect by datalab (Jupyter) notebook to my data in BigQuery and build away.

I’ve worked in all three and the only thing I care about is getting to my job the fastest and right now that means I build my models in GCP.

If you’re new to machine learning don’t start in GCP or any cloud vendor for that matter. Start learning Python from the comfort of your laptop.

The course below is free to the first 20.

What are list of machine learning classification techniques?

This paper nicely explained 179 classification techniques and applied them on 121 data sets thus sharing small summary of the paper:

Do we Need Hundreds of Classifiers to Solve Real World Classification Problems?

https://jmlr.org/papers/v15/delgado14a.html

The whole data set and partitions are available from: http://persoal.citius.usc.es/manuel.fernandez.delgado/papers/jmlr/data.tar.gz

You can see the table with the complete results: http://persoal.citius.usc.es/manuel.fernandez.delgado/papers/jmlr/results.txt

I hope it will be helpful for Statistic and Machine Leaning aspirants!

Thank you!

What is the best way to know which machine learning algorithm has a better probability to accurately or more precisely classify a dataset, before applying it?

These basic questions should help:

What are the application of probability theory in statistics, machine learning, artificial intelligence, economic, commerce, business intelligence?

Which language should I start to learn machine learning for the first time? Python, R, or Julia? And why?

What are the fundamental mathematical requirements for understanding machine learning, as a novice/hobbyist dev?

What is the best image segmentation pre-trained method publicly available? Currently using DeepLab demo (link attached) but I’m looking for a better version of this.

Suggest some really good ML projects? Which can be worth adding to CV or resume?

Is it possible to finish Andrew Ng’s course on ML in 15 days?

Which language should I start to learn machine learning for the first time? Python, R, or Julia? And why?

I’ve reached a dead end with my algorithm for Exact Three Cover, and it’s supposedly trash. What makes it trash?

Which language should I start to learn machine learning for the first time? Python, R, or Julia? And why?

How do I get started with learning Machine Learning by myself?

What are the differences between the Bayesian and Frequentist methods within machine learning?

Why is the Gaussian distribution widely used in practice?

What are the fundamental mathematical requirements for understanding machine learning, as a novice/hobbyist dev?

What is the Confusion Matrix in Machine Learning?- Simplest Explanation!

Is Julia’s syntax even more intuitive than Python’s?

Andrew Ng: What is the Future of Deep Reinforcement Learning (DL + RL)?

How is TensorFlow/Keras capable of computing seemingly non-smooth loss functions such as max?

Which language should I start to learn machine learning for the first time? Python, R, or Julia? And why?

What is the first thing you do when looking at a new data set?

How do I get data science, computer vision, and machine learning tutorial apps?

Announcing our new Professional Machine Learning Engineer certification…

What is the best machine learning paper you have read in 2018?

Does every paper in machine learning introduce a new algorithm?

Predicting Credit Card Approvals using ML Techniques

Android Document Scanner with offline OCR application

What are the recommended data pre-processing methods for tree-based machine learning models?

What are some interesting statistics about the growing rate of the Julia programming language?

Is it possible to learn machine learning without prior knowledge in any coding language?

What skills does a data scientist need to learn in order to put machine learning models in production?

At a high level, these skills are a combination of software and data engineering.

The persons that are more appropriate to do this job are a data engineer and/or a machine learning engineer.

That being said, if you work at a startup or happen to be in a small company and need to put the models into production yourself, here are the top skills you need to get:

Well structured code: it doesn’t need to be perfect but at least can be understood and updated by other team members. Avoid spaghetti code[1] as the plague.
Add logs: if you are a Python user, the logging[2] module is your friend. Avoid print statements at any cost.
Model versioning: add a hash key to your different models. You will thank me later.
Metadata everywhere: save as much data about your models and ML experiments as you can (running time, hyperparameters, used features, CV scores, and so on). You will thank me later, again.
Monitor performances: execution time and statistical scores of your models.
Data and models management: store the necessary data and models somewhere that is available to everyone (S3[3] for example). Avoid uploading these to your VCS[4] system. Don’t share them using Slack or Drive. I won’t judge you though, I do it sometimes (read often). Read more here …..

What are some mistakes data scientists make when building machine learning models?

Some of the mistakes that might involve during building a machine learning model (I can think of) are listed here:

Not understanding the structure of the dataset
Not giving proper care during features selection
Leaving out categorical features and considering just numerical variables
Falling into dummy variable trap
Selection of inefficient machine learning algorithm
Not trying out various ML algorithms for building the model based on structure of data.
Improper tuning of model parameters
Most importantly: Building an idiotstic imperfect model i.e. suppose we have a classification problem with 99% chances of falling into class1 and remaining to class2. The built model may develop a mapping function which all the time for all data inputs, may predict the result to be class1. Well, one might say his/her model has 99% accuracy. But in reality the 1% class2 case hasn’t been included in the model. So this must be taken into consideration.
Read more here…

[appbox appstore 1560083470-iphone screenshots]
[appbox googleplay com.awssolutionarchitectassociateexampreppro.app]

What is the difference between data analytics and data mining?

That’s just the surface-level comparison though. The image above gives an overview of how the two differ.

What is the life cycle of a data science project?

Thus, the data science life-cycle can include the following steps:

Business requirement understanding.
Data collection.
Data cleaning.
Data analysis.
Modeling.
Performance evaluation.
Communicating with stakeholders.
Deployment.
Real-world testing.
Business buy-in.
Support and maintenance.

Looks neat, but here is the scheme to visualize how it is happening in reality:

What are the application of probability theory in statistics, machine learning, artificial intelligence, economic, commerce, business intelligence?

Top

[appbox appstore 1611045854-iphone screenshots]

[appbox microsoftstore 9n8rl80hvm4t-mobile screenshots]

Machine Learning Q&A -Part II:

Which language should I start to learn machine learning for the first time? Python, R, or Julia? And why?

What are the fundamental mathematical requirements for understanding machine learning, as a novice/hobbyist dev?

What is the best image segmentation pre-trained method publicly available? Currently using DeepLab demo (link attached) but I’m looking for a better version of this.

Suggest some really good ML projects? Which can be worth adding to CV or resume?

Is it possible to finish Andrew Ng’s course on ML in 15 days?

Which language should I start to learn machine learning for the first time? Python, R, or Julia? And why?

I’ve reached a dead end with my algorithm for Exact Three Cover, and it’s supposedly trash. What makes it trash?

Which language should I start to learn machine learning for the first time? Python, R, or Julia? And why?

How do I get started with learning Machine Learning by myself?

What are the differences between the Bayesian and Frequentist methods within machine learning?

Why is the Gaussian distribution widely used in practice?

What are the fundamental mathematical requirements for understanding machine learning, as a novice/hobbyist dev?

What is the Confusion Matrix in Machine Learning?- Simplest Explanation!

Is Julia’s syntax even more intuitive than Python’s?

Andrew Ng: What is the Future of Deep Reinforcement Learning (DL + RL)?

How is TensorFlow/Keras capable of computing seemingly non-smooth loss functions such as max?

Which language should I start to learn machine learning for the first time? Python, R, or Julia? And why?

What is the first thing you do when looking at a new data set?

How do I get data science, computer vision, and machine learning tutorial apps?

Announcing our new Professional Machine Learning Engineer certification…

What is the best machine learning paper you have read in 2018?

Does every paper in machine learning introduce a new algorithm?

Predicting Credit Card Approvals using ML Techniques

Android Document Scanner with offline OCR application

What are the recommended data pre-processing methods for tree-based machine learning models?

What are some interesting statistics about the growing rate of the Julia programming language?

Is it possible to learn machine learning without prior knowledge in any coding language?

What skills does a data scientist need to learn in order to put machine learning models in production?

At a high level, these skills are a combination of software and data engineering.

The persons that are more appropriate to do this job are a data engineer and/or a machine learning engineer.

That being said, if you work at a startup or happen to be in a small company and need to put the models into production yourself, here are the top skills you need to get:

Well structured code: it doesn’t need to be perfect but at least can be understood and updated by other team members. Avoid spaghetti code[1] as the plague.
Add logs: if you are a Python user, the logging[2] module is your friend. Avoid print statements at any cost.
Model versioning: add a hash key to your different models. You will thank me later.
Metadata everywhere: save as much data about your models and ML experiments as you can (running time, hyperparameters, used features, CV scores, and so on). You will thank me later, again.
Monitor performances: execution time and statistical scores of your models.
Data and models management: store the necessary data and models somewhere that is available to everyone (S3[3] for example). Avoid uploading these to your VCS[4] system. Don’t share them using Slack or Drive. I won’t judge you though, I do it sometimes (read often). Read more here …..

What are some mistakes data scientists make when building machine learning models?

Some of the mistakes that might involve during building a machine learning model (I can think of) are listed here:

Not understanding the structure of the dataset
Not giving proper care during features selection
Leaving out categorical features and considering just numerical variables
Falling into dummy variable trap
Selection of inefficient machine learning algorithm
Not trying out various ML algorithms for building the model based on structure of data.
Improper tuning of model parameters
Most importantly: Building an idiotstic imperfect model i.e. suppose we have a classification problem with 99% chances of falling into class1 and remaining to class2. The built model may develop a mapping function which all the time for all data inputs, may predict the result to be class1. Well, one might say his/her model has 99% accuracy. But in reality the 1% class2 case hasn’t been included in the model. So this must be taken into consideration.
Read more here…

What is the difference between data analytics and data mining?

That’s just the surface-level comparison though. The image above gives an overview of how the two differ.

What is the life cycle of a data science project?

Thus, the data science life-cycle can include the following steps:

Business requirement understanding.
Data collection.
Data cleaning.
Data analysis.
Modeling.
Performance evaluation.
Communicating with stakeholders.
Deployment.
Real-world testing.
Business buy-in.
Support and maintenance.

Looks neat, but here is the scheme to visualize how it is happening in reality:

iOs: https://apps.apple.com/ca/app/aws-machine-learning-prep-pro/id1611045854

Top

Windows: https://www.microsoft.com/en-ca/p/aws-machine-learning-mls-c01-specialty-certification-exam-prep/9n8rl80hvm4t

Android/Amazon: https://www.amazon.com/gp/product/B09TZ4H8V6

Quizzes, Practice Exams: Modeling, Data Engineering, Vision, Exploratory Data Analysis, ML Ops, Cheat Sheets, ML Jobs Interview Q&A

Use this App to learn about Machine Learning on AWS and prepare for the AWS Machine Learning Specialty Certification MLS-C01.

Earning AWS Certified Machine Learning Specialty validates expertise in building, training, tuning, and deploying machine learning (ML) models on AWS.

The App provides hundreds of quizzes and practice exam about:

– Machine Learning Operation on AWS

– Modelling

– Data Engineering

– Computer Vision,

– Exploratory Data Analysis,

– ML implementation & Operations

– Machine Learning Basics Questions and Answers

– Machine Learning Advanced Questions and Answers

– Scorecard

– Countdown timer

– Machine Learning Cheat Sheets

– Machine Learning Interview Questions and Answers

– Machine Learning Latest News

Domain 1: Data Engineering

Create data repositories for machine learning.

Identify data sources (e.g., content and location, primary sources such as user data)

Determine storage mediums (e.g., DB, Data Lake, S3, EFS, EBS)

Identify and implement a data ingestion solution.

Data job styles/types (batch load, streaming)

Data ingestion pipelines (Batch-based ML workloads and streaming-based ML workloads), etc.

Domain 2: Exploratory Data Analysis

Sanitize and prepare data for modeling.

Perform feature engineering.

Analyze and visualize data for machine learning.

Domain 3: Modeling

Frame business problems as machine learning problems.

Select the appropriate model(s) for a given machine learning problem.

Train machine learning models.

Perform hyperparameter optimization.

Evaluate machine learning models.

Domain 4: Machine Learning Implementation and Operations

Build machine learning solutions for performance, availability, scalability, resiliency, and fault tolerance.

Recommend and implement the appropriate machine learning services and features for a given problem.

Apply basic AWS security practices to machine learning solutions.

Deploy and operationalize machine learning solutions.

Amazon Comprehend

AWS Deep Learning AMIs (DLAMI)

AWS DeepLens

Amazon Forecast

Amazon Fraud Detector

Amazon Lex

Amazon Polly

Amazon Rekognition

Amazon SageMaker

Amazon Textract

Amazon Transcribe

Amazon Translate

Other Services and topics covered are:

Ingestion/Collection

Processing/ETL

Data analysis/visualization

Model training

Model deployment/inference

Operational

AWS ML application services

Language relevant to ML (for example, Python, Java, Scala, R, SQL)

Notebooks and integrated development environments (IDEs),

S3, SageMaker, Kinesis, Lake Formation, Athena, Kibana, Redshift, Textract, EMR, Glue, SageMaker, CSV, JSON, IMG, parquet or databases, Amazon Athena

Amazon EC2, Amazon Elastic Container Registry (Amazon ECR), Amazon Elastic Container Service, Amazon Elastic Kubernetes Service , Amazon Redshift

Sagemaker API Explained:

AWS Certified Machine Learning Engineer Specialty Questions and Answers:

Question1: An advertising and analytics company uses machine learning to predict user response to online advertisements using a custom XGBoost model. The company wants to improve its ML pipeline by porting its training and inference code, written in R, to Amazon SageMaker, and do so with minimal changes to the existing code.

Answer1: Use the Build Your Own Container (BYOC) Amazon Sagemaker option.
Create a new docker container with the existing code. Register the container in Amazon Elastic Container registry. with the existing code. Register the container in Amazon Elastic Container Registry. Finally run the training and inference jobs using this container.

Question2: Which feature of Amazon SageMaker can you use for preprocessing the data?

Amazon SageMaker training jobs
Amazon SageMaker hyperaparameter tuning
Amazon SageMaker notebook instances
Amazon SageMaker endpoints

Answer2: Amazon Sagemaker Notebook instances

Question3: What setting, when creating an Amazon SageMaker notebook instance, can you use to install libraries and import data?

Answer3: LifeCycle Configuration

Question4: How to Choose the right Sagemaker built-in algorithm?

This is a general guide for choosing which algorithm to use depending on what business problem you have and what data you have.

Top

Top 10 Google Professional Machine Learning Engineer Sample Questions

A. Use K-fold cross validation to understand how the model performs on different test datasets.

B. Use the Integrated Gradients method to efficiently compute feature attributions for each predicted image.

C. Use PCA (Principal Component Analysis) to reduce the original feature set to a smaller set of easily understood features.

D. Use k-means clustering to group similar images together, and calculate the Davies-Bouldin index to evaluate the separation between clusters.

Answer 1)

Notes 1)

B is correct because it identifies the pixel of the input image that leads to the classification of the image itself.

A. Train the model for a few iterations, and check for NaN values.

B. Train the model for a few iterations, and verify that the loss is constant.

C. Train a simple linear model, and determine if the DNN model outperforms it.

D. Train the model with no regularization, and verify that the loss function is close to zero.

Answer 2)

Notes 2)

D is correct because the test can check that the model has enough parameters to memorize the task.

[appbox appstore 1560083470-iphone screenshots]
[appbox googleplay com.awssolutionarchitectassociateexampreppro.app]

A. Default Strategy; Custom tier with a single master node and four v100 GPUs.

B. One Device Strategy; Custom tier with a single master node and four v100 GPUs.

C. One Device Strategy; Custom tier with a single master node and eight v100 GPUs.

D. Central Storage Strategy; Custom tier with a single master node and four v100 GPUs.

Answer 3)

Notes 3)

D is correct because this is the only strategy that can perform distributed training; albeit there is only a single copy of the variables on the CPU host.

A. Deploy model in test environment -> Validate model -> Create a new AI Platform model version

B. Validate model -> Deploy model in test environment -> Create a new AI Platform model version

C. Create a new AI Platform model version -> Validate model -> Deploy model in test environment

D. Create a new AI Platform model version – > Deploy model in test environment -> Validate model

Answer 4)

Notes 4)

A is correct because the model can be validated after it is deployed to the test environment, and the release version is established before the model is deployed in production.

A. Calculate the Area Under the Curve (AUC) value.

B. Calculate the number of true positive results predicted by the model.

C. Calculate the fraction of images predicted by the model to have a visible defect.

D. Calculate the Cosine Similarity to compare the model’s performance on the test dataset to the model’s performance on the training dataset.

Answer 5)

Notes 5)

A. Data preparation: Daily max value feature engineering with DataPrep; Model training: AutoML classification with BQML

B. Data preparation: Daily min value feature engineering with DataPrep; Model training: Logistic regression with BQML and AUTO_CLASS_WEIGHTS set to True

C. Data preparation: Rolling average feature engineering with DataPrep; Model training: Logistic regression with BQML and AUTO_CLASS_WEIGHTS set to False

D. Data preparation: Rolling average feature engineering with DataPrep; Model training: Logistic regression with BQML and AUTO_CLASS_WEIGHTS set to True

Answer 6)

Notes 6)

D is correct because it uses the rolling average of the sensor data and balances the weights using the BQML auto class weight balance parameter.

A. Pub/Sub, Cloud Function, Cloud Vision API

B. Pub/Sub, Cloud IoT, Dataflow, Cloud Vision API, Cloud Logging

C. Pub/Sub, Cloud Function, Video Intelligence API, Cloud Logging

D. Pub/Sub, Cloud Function, AutoML Video Intelligence, Cloud Logging

Answer 7)

Notes 7)

C is correct as Video Intelligence API can find inappropriate components and other components satisfy the requirements of real-time processing and notification.

A. Load data into BigQuery and use the ARIMA model type on BigQuery ML.

B. Convert the data into CSV format and create a regression model on AutoML Tables.

C. Convert the data into TFRecords and create an RNN model on TensorFlow on AI Platform Notebooks.

D. Convert and refactor the data into CSV format and use the built-in XGBoost algorithm on AI Platform Training.

Answer 8)

Notes 8)

A. Use Google Cloud’s Data Labelling Service to label your data. Use AutoML Object Detection to train and deploy the model.

B. Use Vision API to detect and identify logos in pictures and use it as a label. Use AI Platform to build and train a convolutional neural network.

C. Create two folders: one where the logo appears and one where it doesn’t. Manually place images in each folder. Use AI Platform to build and train a convolutional neural network.

D. Create two folders: one where the logo appears and one where it doesn’t. Manually place images in each folder. Use AI Platform to build and train a real time object detection model.

Answer 9)

Notes 9)

A is correct as this will allow you to easily create a request for a labelling task and deploy a high-performance model.

A. Automate a blend of the shortest and longest intents to be representative of all intents.

B. Automate the more complicated requests first because those require more of the agents’ time.

C. Automate the 10 intents that cover 70% of the requests so that live agents can handle the more complicated requests.

D. Automate intents in places where common words such as “payment” only appear once to avoid confusing the software.

Answer 10)

Notes 10)

[appbox appstore 1611045854-iphone screenshots]

[appbox microsoftstore 9n8rl80hvm4t-mobile screenshots]

Machine Learning Q&A Part I:

The Complete Python Course for Machine Learning Engineers

Google.

Azure and AWS are second class citizens in this area.

Sure, AWS has 70% of the market.

Sure, Azure is the easiest turn key and super user friendly.

But, the king of machine learning in the cloud is GCP.

GCP = Google Cloud Platform

Google has the largest data science team in the world, not mention they have Hinton.

Let’s forgot for a minute they created TensorFlow and give it away.

Let’s just talk about building a real world model with data that doesn’t fit into a excel spreadsheet.

The vast majority of applied machine learning is supervised and that means we need data.

Not just normal data, we need very clean highly structured data.

Where’s the easiest place in the world to upload and model a Petabyte of structured data? BigQuery of course.

Then, you’ll need a front end. Cloud datalab is a Jupyter notebook, which is good because I don’t want nor do I need anything else.

Then, with a single line of code I connect by datalab (Jupyter) notebook to my data in BigQuery and build away.

I’ve worked in all three and the only thing I care about is getting to my job the fastest and right now that means I build my models in GCP.

If you’re new to machine learning don’t start in GCP or any cloud vendor for that matter. Start learning Python from the comfort of your laptop.

The course below is free to the first 20.

What are list of machine learning classification techniques?

This paper nicely explained 179 classification techniques and applied them on 121 data sets thus sharing small summary of the paper:

Do we Need Hundreds of Classifiers to Solve Real World Classification Problems?

https://jmlr.org/papers/v15/delgado14a.html

The whole data set and partitions are available from: http://persoal.citius.usc.es/manuel.fernandez.delgado/papers/jmlr/data.tar.gz

You can see the table with the complete results: http://persoal.citius.usc.es/manuel.fernandez.delgado/papers/jmlr/results.txt

I hope it will be helpful for Statistic and Machine Leaning aspirants!

Thank you!

What is the best way to know which machine learning algorithm has a better probability to accurately or more precisely classify a dataset, before applying it?

These basic questions should help:

What are the application of probability theory in statistics, machine learning, artificial intelligence, economic, commerce, business intelligence?

Which language should I start to learn machine learning for the first time? Python, R, or Julia? And why?

What are the fundamental mathematical requirements for understanding machine learning, as a novice/hobbyist dev?

What is the best image segmentation pre-trained method publicly available? Currently using DeepLab demo (link attached) but I’m looking for a better version of this.

Suggest some really good ML projects? Which can be worth adding to CV or resume?

Is it possible to finish Andrew Ng’s course on ML in 15 days?

Which language should I start to learn machine learning for the first time? Python, R, or Julia? And why?

I’ve reached a dead end with my algorithm for Exact Three Cover, and it’s supposedly trash. What makes it trash?

Which language should I start to learn machine learning for the first time? Python, R, or Julia? And why?

How do I get started with learning Machine Learning by myself?

What are the differences between the Bayesian and Frequentist methods within machine learning?

Why is the Gaussian distribution widely used in practice?

What are the fundamental mathematical requirements for understanding machine learning, as a novice/hobbyist dev?

What is the Confusion Matrix in Machine Learning?- Simplest Explanation!

Is Julia’s syntax even more intuitive than Python’s?

Andrew Ng: What is the Future of Deep Reinforcement Learning (DL + RL)?

How is TensorFlow/Keras capable of computing seemingly non-smooth loss functions such as max?

Which language should I start to learn machine learning for the first time? Python, R, or Julia? And why?

What is the first thing you do when looking at a new data set?

How do I get data science, computer vision, and machine learning tutorial apps?

Announcing our new Professional Machine Learning Engineer certification…

What is the best machine learning paper you have read in 2018?

Does every paper in machine learning introduce a new algorithm?

Predicting Credit Card Approvals using ML Techniques

Android Document Scanner with offline OCR application

What are the recommended data pre-processing methods for tree-based machine learning models?

What are some interesting statistics about the growing rate of the Julia programming language?

Is it possible to learn machine learning without prior knowledge in any coding language?

What skills does a data scientist need to learn in order to put machine learning models in production?

At a high level, these skills are a combination of software and data engineering.

The persons that are more appropriate to do this job are a data engineer and/or a machine learning engineer.

That being said, if you work at a startup or happen to be in a small company and need to put the models into production yourself, here are the top skills you need to get:

Well structured code: it doesn’t need to be perfect but at least can be understood and updated by other team members. Avoid spaghetti code[1] as the plague.
Add logs: if you are a Python user, the logging[2] module is your friend. Avoid print statements at any cost.
Model versioning: add a hash key to your different models. You will thank me later.
Metadata everywhere: save as much data about your models and ML experiments as you can (running time, hyperparameters, used features, CV scores, and so on). You will thank me later, again.
Monitor performances: execution time and statistical scores of your models.
Data and models management: store the necessary data and models somewhere that is available to everyone (S3[3] for example). Avoid uploading these to your VCS[4] system. Don’t share them using Slack or Drive. I won’t judge you though, I do it sometimes (read often). Read more here …..

What are some mistakes data scientists make when building machine learning models?

Some of the mistakes that might involve during building a machine learning model (I can think of) are listed here:

Not understanding the structure of the dataset
Not giving proper care during features selection
Leaving out categorical features and considering just numerical variables
Falling into dummy variable trap
Selection of inefficient machine learning algorithm
Not trying out various ML algorithms for building the model based on structure of data.
Improper tuning of model parameters
Most importantly: Building an idiotstic imperfect model i.e. suppose we have a classification problem with 99% chances of falling into class1 and remaining to class2. The built model may develop a mapping function which all the time for all data inputs, may predict the result to be class1. Well, one might say his/her model has 99% accuracy. But in reality the 1% class2 case hasn’t been included in the model. So this must be taken into consideration.
Read more here…

[appbox appstore 1560083470-iphone screenshots]
[appbox googleplay com.awssolutionarchitectassociateexampreppro.app]

What is the difference between data analytics and data mining?

That’s just the surface-level comparison though. The image above gives an overview of how the two differ.

What is the life cycle of a data science project?

Thus, the data science life-cycle can include the following steps:

Business requirement understanding.
Data collection.
Data cleaning.
Data analysis.
Modeling.
Performance evaluation.
Communicating with stakeholders.
Deployment.
Real-world testing.
Business buy-in.
Support and maintenance.

Looks neat, but here is the scheme to visualize how it is happening in reality:

What are the application of probability theory in statistics, machine learning, artificial intelligence, economic, commerce, business intelligence?

Top

[appbox appstore 1611045854-iphone screenshots]

[appbox microsoftstore 9n8rl80hvm4t-mobile screenshots]

Machine Learning Q&A -Part II:

Which language should I start to learn machine learning for the first time? Python, R, or Julia? And why?

What are the fundamental mathematical requirements for understanding machine learning, as a novice/hobbyist dev?

What is the best image segmentation pre-trained method publicly available? Currently using DeepLab demo (link attached) but I’m looking for a better version of this.

Suggest some really good ML projects? Which can be worth adding to CV or resume?

Is it possible to finish Andrew Ng’s course on ML in 15 days?

Which language should I start to learn machine learning for the first time? Python, R, or Julia? And why?

I’ve reached a dead end with my algorithm for Exact Three Cover, and it’s supposedly trash. What makes it trash?

Which language should I start to learn machine learning for the first time? Python, R, or Julia? And why?

How do I get started with learning Machine Learning by myself?

What are the differences between the Bayesian and Frequentist methods within machine learning?

Why is the Gaussian distribution widely used in practice?

What are the fundamental mathematical requirements for understanding machine learning, as a novice/hobbyist dev?

What is the Confusion Matrix in Machine Learning?- Simplest Explanation!

Is Julia’s syntax even more intuitive than Python’s?

Andrew Ng: What is the Future of Deep Reinforcement Learning (DL + RL)?

How is TensorFlow/Keras capable of computing seemingly non-smooth loss functions such as max?

Which language should I start to learn machine learning for the first time? Python, R, or Julia? And why?

What is the first thing you do when looking at a new data set?

How do I get data science, computer vision, and machine learning tutorial apps?

Announcing our new Professional Machine Learning Engineer certification…

What is the best machine learning paper you have read in 2018?

Does every paper in machine learning introduce a new algorithm?

Predicting Credit Card Approvals using ML Techniques

Android Document Scanner with offline OCR application

What are the recommended data pre-processing methods for tree-based machine learning models?

What are some interesting statistics about the growing rate of the Julia programming language?

Is it possible to learn machine learning without prior knowledge in any coding language?

What skills does a data scientist need to learn in order to put machine learning models in production?

At a high level, these skills are a combination of software and data engineering.

The persons that are more appropriate to do this job are a data engineer and/or a machine learning engineer.

That being said, if you work at a startup or happen to be in a small company and need to put the models into production yourself, here are the top skills you need to get:

Well structured code: it doesn’t need to be perfect but at least can be understood and updated by other team members. Avoid spaghetti code[1] as the plague.
Add logs: if you are a Python user, the logging[2] module is your friend. Avoid print statements at any cost.
Model versioning: add a hash key to your different models. You will thank me later.
Metadata everywhere: save as much data about your models and ML experiments as you can (running time, hyperparameters, used features, CV scores, and so on). You will thank me later, again.
Monitor performances: execution time and statistical scores of your models.
Data and models management: store the necessary data and models somewhere that is available to everyone (S3[3] for example). Avoid uploading these to your VCS[4] system. Don’t share them using Slack or Drive. I won’t judge you though, I do it sometimes (read often). Read more here …..

What are some mistakes data scientists make when building machine learning models?

Some of the mistakes that might involve during building a machine learning model (I can think of) are listed here:

Not understanding the structure of the dataset
Not giving proper care during features selection
Leaving out categorical features and considering just numerical variables
Falling into dummy variable trap
Selection of inefficient machine learning algorithm
Not trying out various ML algorithms for building the model based on structure of data.
Improper tuning of model parameters
Most importantly: Building an idiotstic imperfect model i.e. suppose we have a classification problem with 99% chances of falling into class1 and remaining to class2. The built model may develop a mapping function which all the time for all data inputs, may predict the result to be class1. Well, one might say his/her model has 99% accuracy. But in reality the 1% class2 case hasn’t been included in the model. So this must be taken into consideration.
Read more here…

What is the difference between data analytics and data mining?

That’s just the surface-level comparison though. The image above gives an overview of how the two differ.

What is the life cycle of a data science project?

Thus, the data science life-cycle can include the following steps:

Business requirement understanding.
Data collection.
Data cleaning.
Data analysis.
Modeling.
Performance evaluation.
Communicating with stakeholders.
Deployment.
Real-world testing.
Business buy-in.
Support and maintenance.

Looks neat, but here is the scheme to visualize how it is happening in reality:

iOs: https://apps.apple.com/ca/app/aws-machine-learning-prep-pro/id1611045854

Top

Windows: https://www.microsoft.com/en-ca/p/aws-machine-learning-mls-c01-specialty-certification-exam-prep/9n8rl80hvm4t

Android/Amazon: https://www.amazon.com/gp/product/B09TZ4H8V6

Quizzes, Practice Exams: Modeling, Data Engineering, Vision, Exploratory Data Analysis, ML Ops, Cheat Sheets, ML Jobs Interview Q&A

Use this App to learn about Machine Learning on AWS and prepare for the AWS Machine Learning Specialty Certification MLS-C01.

Earning AWS Certified Machine Learning Specialty validates expertise in building, training, tuning, and deploying machine learning (ML) models on AWS.

The App provides hundreds of quizzes and practice exam about:

– Machine Learning Operation on AWS

– Modelling

– Data Engineering

– Computer Vision,

– Exploratory Data Analysis,

– ML implementation & Operations

– Machine Learning Basics Questions and Answers

– Machine Learning Advanced Questions and Answers

– Scorecard

– Countdown timer

– Machine Learning Cheat Sheets

– Machine Learning Interview Questions and Answers

– Machine Learning Latest News

Domain 1: Data Engineering

Create data repositories for machine learning.

Identify data sources (e.g., content and location, primary sources such as user data)

Determine storage mediums (e.g., DB, Data Lake, S3, EFS, EBS)

Identify and implement a data ingestion solution.

Data job styles/types (batch load, streaming)

Data ingestion pipelines (Batch-based ML workloads and streaming-based ML workloads), etc.

Domain 2: Exploratory Data Analysis

Sanitize and prepare data for modeling.

Perform feature engineering.

Analyze and visualize data for machine learning.

Domain 3: Modeling

Frame business problems as machine learning problems.

Select the appropriate model(s) for a given machine learning problem.

Train machine learning models.

Perform hyperparameter optimization.

Evaluate machine learning models.

Domain 4: Machine Learning Implementation and Operations

Build machine learning solutions for performance, availability, scalability, resiliency, and fault tolerance.

Recommend and implement the appropriate machine learning services and features for a given problem.

Apply basic AWS security practices to machine learning solutions.

Deploy and operationalize machine learning solutions.

Amazon Comprehend

AWS Deep Learning AMIs (DLAMI)

AWS DeepLens

Amazon Forecast

Amazon Fraud Detector

Amazon Lex

Amazon Polly

Amazon Rekognition

Amazon SageMaker

Amazon Textract

Amazon Transcribe

Amazon Translate

Other Services and topics covered are:

Ingestion/Collection

Processing/ETL

Data analysis/visualization

Model training

Model deployment/inference

Operational

AWS ML application services

Language relevant to ML (for example, Python, Java, Scala, R, SQL)

Notebooks and integrated development environments (IDEs),

S3, SageMaker, Kinesis, Lake Formation, Athena, Kibana, Redshift, Textract, EMR, Glue, SageMaker, CSV, JSON, IMG, parquet or databases, Amazon Athena

Amazon EC2, Amazon Elastic Container Registry (Amazon ECR), Amazon Elastic Container Service, Amazon Elastic Kubernetes Service , Amazon Redshift

Sagemaker API Explained:

AWS Certified Machine Learning Engineer Specialty Questions and Answers:

Question1: An advertising and analytics company uses machine learning to predict user response to online advertisements using a custom XGBoost model. The company wants to improve its ML pipeline by porting its training and inference code, written in R, to Amazon SageMaker, and do so with minimal changes to the existing code.

Answer1: Use the Build Your Own Container (BYOC) Amazon Sagemaker option.
Create a new docker container with the existing code. Register the container in Amazon Elastic Container registry. with the existing code. Register the container in Amazon Elastic Container Registry. Finally run the training and inference jobs using this container.

Question2: Which feature of Amazon SageMaker can you use for preprocessing the data?

Amazon SageMaker training jobs
Amazon SageMaker hyperaparameter tuning
Amazon SageMaker notebook instances
Amazon SageMaker endpoints

Answer2: Amazon Sagemaker Notebook instances

Question3: What setting, when creating an Amazon SageMaker notebook instance, can you use to install libraries and import data?

Answer3: LifeCycle Configuration

Question4: How to Choose the right Sagemaker built-in algorithm?

This is a general guide for choosing which algorithm to use depending on what business problem you have and what data you have.

Top

Top 10 Google Professional Machine Learning Engineer Sample Questions

A. Use K-fold cross validation to understand how the model performs on different test datasets.

B. Use the Integrated Gradients method to efficiently compute feature attributions for each predicted image.

C. Use PCA (Principal Component Analysis) to reduce the original feature set to a smaller set of easily understood features.

D. Use k-means clustering to group similar images together, and calculate the Davies-Bouldin index to evaluate the separation between clusters.

Answer 1)

Notes 1)

B is correct because it identifies the pixel of the input image that leads to the classification of the image itself.

A. Train the model for a few iterations, and check for NaN values.

B. Train the model for a few iterations, and verify that the loss is constant.

C. Train a simple linear model, and determine if the DNN model outperforms it.

D. Train the model with no regularization, and verify that the loss function is close to zero.

Answer 2)

Notes 2)

D is correct because the test can check that the model has enough parameters to memorize the task.

[appbox appstore 1560083470-iphone screenshots]
[appbox googleplay com.awssolutionarchitectassociateexampreppro.app]

A. Default Strategy; Custom tier with a single master node and four v100 GPUs.

B. One Device Strategy; Custom tier with a single master node and four v100 GPUs.

C. One Device Strategy; Custom tier with a single master node and eight v100 GPUs.

D. Central Storage Strategy; Custom tier with a single master node and four v100 GPUs.

Answer 3)

Notes 3)

D is correct because this is the only strategy that can perform distributed training; albeit there is only a single copy of the variables on the CPU host.

A. Deploy model in test environment -> Validate model -> Create a new AI Platform model version

B. Validate model -> Deploy model in test environment -> Create a new AI Platform model version

C. Create a new AI Platform model version -> Validate model -> Deploy model in test environment

D. Create a new AI Platform model version – > Deploy model in test environment -> Validate model

Answer 4)

Notes 4)

A is correct because the model can be validated after it is deployed to the test environment, and the release version is established before the model is deployed in production.

A. Calculate the Area Under the Curve (AUC) value.

B. Calculate the number of true positive results predicted by the model.

C. Calculate the fraction of images predicted by the model to have a visible defect.

D. Calculate the Cosine Similarity to compare the model’s performance on the test dataset to the model’s performance on the training dataset.

Answer 5)

Notes 5)

A. Data preparation: Daily max value feature engineering with DataPrep; Model training: AutoML classification with BQML

B. Data preparation: Daily min value feature engineering with DataPrep; Model training: Logistic regression with BQML and AUTO_CLASS_WEIGHTS set to True

C. Data preparation: Rolling average feature engineering with DataPrep; Model training: Logistic regression with BQML and AUTO_CLASS_WEIGHTS set to False

D. Data preparation: Rolling average feature engineering with DataPrep; Model training: Logistic regression with BQML and AUTO_CLASS_WEIGHTS set to True

Answer 6)

Notes 6)

D is correct because it uses the rolling average of the sensor data and balances the weights using the BQML auto class weight balance parameter.

A. Pub/Sub, Cloud Function, Cloud Vision API

B. Pub/Sub, Cloud IoT, Dataflow, Cloud Vision API, Cloud Logging

C. Pub/Sub, Cloud Function, Video Intelligence API, Cloud Logging

D. Pub/Sub, Cloud Function, AutoML Video Intelligence, Cloud Logging

Answer 7)

Notes 7)

C is correct as Video Intelligence API can find inappropriate components and other components satisfy the requirements of real-time processing and notification.

A. Load data into BigQuery and use the ARIMA model type on BigQuery ML.

B. Convert the data into CSV format and create a regression model on AutoML Tables.

C. Convert the data into TFRecords and create an RNN model on TensorFlow on AI Platform Notebooks.

D. Convert and refactor the data into CSV format and use the built-in XGBoost algorithm on AI Platform Training.

Answer 8)

Notes 8)

A. Use Google Cloud’s Data Labelling Service to label your data. Use AutoML Object Detection to train and deploy the model.

B. Use Vision API to detect and identify logos in pictures and use it as a label. Use AI Platform to build and train a convolutional neural network.

C. Create two folders: one where the logo appears and one where it doesn’t. Manually place images in each folder. Use AI Platform to build and train a convolutional neural network.

D. Create two folders: one where the logo appears and one where it doesn’t. Manually place images in each folder. Use AI Platform to build and train a real time object detection model.

Answer 9)

Notes 9)

A is correct as this will allow you to easily create a request for a labelling task and deploy a high-performance model.

A. Automate a blend of the shortest and longest intents to be representative of all intents.

B. Automate the more complicated requests first because those require more of the agents’ time.

C. Automate the 10 intents that cover 70% of the requests so that live agents can handle the more complicated requests.

D. Automate intents in places where common words such as “payment” only appear once to avoid confusing the software.

Answer 10)

Notes 10)

[appbox appstore 1611045854-iphone screenshots]

[appbox microsoftstore 9n8rl80hvm4t-mobile screenshots]

Machine Learning Q&A Part I:

The Complete Python Course for Machine Learning Engineers

Google.

Azure and AWS are second class citizens in this area.

Sure, AWS has 70% of the market.

Sure, Azure is the easiest turn key and super user friendly.

But, the king of machine learning in the cloud is GCP.

GCP = Google Cloud Platform

Google has the largest data science team in the world, not mention they have Hinton.

Let’s forgot for a minute they created TensorFlow and give it away.

Let’s just talk about building a real world model with data that doesn’t fit into a excel spreadsheet.

The vast majority of applied machine learning is supervised and that means we need data.

Not just normal data, we need very clean highly structured data.

Where’s the easiest place in the world to upload and model a Petabyte of structured data? BigQuery of course.

Then, you’ll need a front end. Cloud datalab is a Jupyter notebook, which is good because I don’t want nor do I need anything else.

Then, with a single line of code I connect by datalab (Jupyter) notebook to my data in BigQuery and build away.

I’ve worked in all three and the only thing I care about is getting to my job the fastest and right now that means I build my models in GCP.

If you’re new to machine learning don’t start in GCP or any cloud vendor for that matter. Start learning Python from the comfort of your laptop.

The course below is free to the first 20.

What are list of machine learning classification techniques?

This paper nicely explained 179 classification techniques and applied them on 121 data sets thus sharing small summary of the paper:

Do we Need Hundreds of Classifiers to Solve Real World Classification Problems?

https://jmlr.org/papers/v15/delgado14a.html

The whole data set and partitions are available from: http://persoal.citius.usc.es/manuel.fernandez.delgado/papers/jmlr/data.tar.gz

You can see the table with the complete results: http://persoal.citius.usc.es/manuel.fernandez.delgado/papers/jmlr/results.txt

I hope it will be helpful for Statistic and Machine Leaning aspirants!

Thank you!

What is the best way to know which machine learning algorithm has a better probability to accurately or more precisely classify a dataset, before applying it?

These basic questions should help:

What are the application of probability theory in statistics, machine learning, artificial intelligence, economic, commerce, business intelligence?

Which language should I start to learn machine learning for the first time? Python, R, or Julia? And why?

What are the fundamental mathematical requirements for understanding machine learning, as a novice/hobbyist dev?

What is the best image segmentation pre-trained method publicly available? Currently using DeepLab demo (link attached) but I’m looking for a better version of this.

Suggest some really good ML projects? Which can be worth adding to CV or resume?

Is it possible to finish Andrew Ng’s course on ML in 15 days?

Which language should I start to learn machine learning for the first time? Python, R, or Julia? And why?

I’ve reached a dead end with my algorithm for Exact Three Cover, and it’s supposedly trash. What makes it trash?

Which language should I start to learn machine learning for the first time? Python, R, or Julia? And why?

How do I get started with learning Machine Learning by myself?

What are the differences between the Bayesian and Frequentist methods within machine learning?

Why is the Gaussian distribution widely used in practice?

What are the fundamental mathematical requirements for understanding machine learning, as a novice/hobbyist dev?

What is the Confusion Matrix in Machine Learning?- Simplest Explanation!

Is Julia’s syntax even more intuitive than Python’s?

Andrew Ng: What is the Future of Deep Reinforcement Learning (DL + RL)?

How is TensorFlow/Keras capable of computing seemingly non-smooth loss functions such as max?

Which language should I start to learn machine learning for the first time? Python, R, or Julia? And why?

What is the first thing you do when looking at a new data set?

How do I get data science, computer vision, and machine learning tutorial apps?

Announcing our new Professional Machine Learning Engineer certification…

What is the best machine learning paper you have read in 2018?

Does every paper in machine learning introduce a new algorithm?

Predicting Credit Card Approvals using ML Techniques

Android Document Scanner with offline OCR application

What are the recommended data pre-processing methods for tree-based machine learning models?

What are some interesting statistics about the growing rate of the Julia programming language?

Is it possible to learn machine learning without prior knowledge in any coding language?

What skills does a data scientist need to learn in order to put machine learning models in production?

At a high level, these skills are a combination of software and data engineering.

The persons that are more appropriate to do this job are a data engineer and/or a machine learning engineer.

That being said, if you work at a startup or happen to be in a small company and need to put the models into production yourself, here are the top skills you need to get:

Well structured code: it doesn’t need to be perfect but at least can be understood and updated by other team members. Avoid spaghetti code[1] as the plague.
Add logs: if you are a Python user, the logging[2] module is your friend. Avoid print statements at any cost.
Model versioning: add a hash key to your different models. You will thank me later.
Metadata everywhere: save as much data about your models and ML experiments as you can (running time, hyperparameters, used features, CV scores, and so on). You will thank me later, again.
Monitor performances: execution time and statistical scores of your models.
Data and models management: store the necessary data and models somewhere that is available to everyone (S3[3] for example). Avoid uploading these to your VCS[4] system. Don’t share them using Slack or Drive. I won’t judge you though, I do it sometimes (read often). Read more here …..

What are some mistakes data scientists make when building machine learning models?

Some of the mistakes that might involve during building a machine learning model (I can think of) are listed here:

Not understanding the structure of the dataset
Not giving proper care during features selection
Leaving out categorical features and considering just numerical variables
Falling into dummy variable trap
Selection of inefficient machine learning algorithm
Not trying out various ML algorithms for building the model based on structure of data.
Improper tuning of model parameters
Most importantly: Building an idiotstic imperfect model i.e. suppose we have a classification problem with 99% chances of falling into class1 and remaining to class2. The built model may develop a mapping function which all the time for all data inputs, may predict the result to be class1. Well, one might say his/her model has 99% accuracy. But in reality the 1% class2 case hasn’t been included in the model. So this must be taken into consideration.
Read more here…

[appbox appstore 1560083470-iphone screenshots]
[appbox googleplay com.awssolutionarchitectassociateexampreppro.app]

What is the difference between data analytics and data mining?

That’s just the surface-level comparison though. The image above gives an overview of how the two differ.

What is the life cycle of a data science project?

Thus, the data science life-cycle can include the following steps:

Business requirement understanding.
Data collection.
Data cleaning.
Data analysis.
Modeling.
Performance evaluation.
Communicating with stakeholders.
Deployment.
Real-world testing.
Business buy-in.
Support and maintenance.

Looks neat, but here is the scheme to visualize how it is happening in reality:

What are the application of probability theory in statistics, machine learning, artificial intelligence, economic, commerce, business intelligence?

Top

[appbox appstore 1611045854-iphone screenshots]

[appbox microsoftstore 9n8rl80hvm4t-mobile screenshots]

Machine Learning Q&A -Part II:

Which language should I start to learn machine learning for the first time? Python, R, or Julia? And why?

What are the fundamental mathematical requirements for understanding machine learning, as a novice/hobbyist dev?

What is the best image segmentation pre-trained method publicly available? Currently using DeepLab demo (link attached) but I’m looking for a better version of this.

Suggest some really good ML projects? Which can be worth adding to CV or resume?

Is it possible to finish Andrew Ng’s course on ML in 15 days?

Which language should I start to learn machine learning for the first time? Python, R, or Julia? And why?

I’ve reached a dead end with my algorithm for Exact Three Cover, and it’s supposedly trash. What makes it trash?

Which language should I start to learn machine learning for the first time? Python, R, or Julia? And why?

How do I get started with learning Machine Learning by myself?

What are the differences between the Bayesian and Frequentist methods within machine learning?

Why is the Gaussian distribution widely used in practice?

What are the fundamental mathematical requirements for understanding machine learning, as a novice/hobbyist dev?

What is the Confusion Matrix in Machine Learning?- Simplest Explanation!

Is Julia’s syntax even more intuitive than Python’s?

Andrew Ng: What is the Future of Deep Reinforcement Learning (DL + RL)?

How is TensorFlow/Keras capable of computing seemingly non-smooth loss functions such as max?

Which language should I start to learn machine learning for the first time? Python, R, or Julia? And why?

What is the first thing you do when looking at a new data set?

How do I get data science, computer vision, and machine learning tutorial apps?

Announcing our new Professional Machine Learning Engineer certification…

What is the best machine learning paper you have read in 2018?

Does every paper in machine learning introduce a new algorithm?

Predicting Credit Card Approvals using ML Techniques

Android Document Scanner with offline OCR application

What are the recommended data pre-processing methods for tree-based machine learning models?

What are some interesting statistics about the growing rate of the Julia programming language?

Is it possible to learn machine learning without prior knowledge in any coding language?

What skills does a data scientist need to learn in order to put machine learning models in production?

At a high level, these skills are a combination of software and data engineering.

The persons that are more appropriate to do this job are a data engineer and/or a machine learning engineer.

That being said, if you work at a startup or happen to be in a small company and need to put the models into production yourself, here are the top skills you need to get:

Well structured code: it doesn’t need to be perfect but at least can be understood and updated by other team members. Avoid spaghetti code[1] as the plague.
Add logs: if you are a Python user, the logging[2] module is your friend. Avoid print statements at any cost.
Model versioning: add a hash key to your different models. You will thank me later.
Metadata everywhere: save as much data about your models and ML experiments as you can (running time, hyperparameters, used features, CV scores, and so on). You will thank me later, again.
Monitor performances: execution time and statistical scores of your models.
Data and models management: store the necessary data and models somewhere that is available to everyone (S3[3] for example). Avoid uploading these to your VCS[4] system. Don’t share them using Slack or Drive. I won’t judge you though, I do it sometimes (read often). Read more here …..

What are some mistakes data scientists make when building machine learning models?

Some of the mistakes that might involve during building a machine learning model (I can think of) are listed here:

Not understanding the structure of the dataset
Not giving proper care during features selection
Leaving out categorical features and considering just numerical variables
Falling into dummy variable trap
Selection of inefficient machine learning algorithm
Not trying out various ML algorithms for building the model based on structure of data.
Improper tuning of model parameters
Most importantly: Building an idiotstic imperfect model i.e. suppose we have a classification problem with 99% chances of falling into class1 and remaining to class2. The built model may develop a mapping function which all the time for all data inputs, may predict the result to be class1. Well, one might say his/her model has 99% accuracy. But in reality the 1% class2 case hasn’t been included in the model. So this must be taken into consideration.
Read more here…

What is the difference between data analytics and data mining?

That’s just the surface-level comparison though. The image above gives an overview of how the two differ.

What is the life cycle of a data science project?

Thus, the data science life-cycle can include the following steps:

Business requirement understanding.
Data collection.
Data cleaning.
Data analysis.
Modeling.
Performance evaluation.
Communicating with stakeholders.
Deployment.
Real-world testing.
Business buy-in.
Support and maintenance.

Looks neat, but here is the scheme to visualize how it is happening in reality:

[1] [1607.05691] Information-theoretical label embeddings for large-scale image classification

Top

Machine Learning Latest News

Top

Top 10 Machine Learning Algorithms

What are the simplest examples of machine learning algorithms?

Originally Answered: What are the top algorithms for machine learning?

Source: Top 10 Machine Learning Algorithms for Data Scientist

In machine learning, there’s something called the “No Free Lunch” theorem. In a nutshell, it states that no one algorithm works best for every problem. It’s especially relevant for supervised learning. For example, you can’t say that neural networks are always better than decision trees or vice-versa. Furthermore, there are many factors at play, such as the size and structure of your dataset. As a result, you should try many different algorithms for your problem!

Top ML Algorithms

1. Linear Regression

Regression is a technique for numerical prediction. Additionally, regression is a statistical measure that attempts to determine the strength of the relationship between two variables. One is a dependent variable. Other is from a series of other changing variables which are our independent variables. Moreover, just like Classification is for predicting categorical labels, Regression is for predicting a continuous value. For example, we may wish to predict the salary of university graduates with 5 years of work experience. We use regression to determine how much specific factors or sectors influence the dependent variable.

Linear regression attempts to model the relationship between a scalar variable and explanatory variables by fitting a linear equation. For example, one might want to relate the weights of individuals to their heights using a linear regression model.

Additionally, this operator calculates a linear regression model. It uses the Akaike criterion for model selection. Furthermore, the Akaike information criterion is a measure of the relative goodness of a fit of a statistical model.

2. Logistic Regression

Logistic regression is a classification model. It uses input variables to predict a categorical outcome variable. The variable can take on one of a limited set of class values. A binomial logistic regression relates to two binary output categories. A multinomial logistic regression allows for more than two classes. Examples of logistic regression include classifying a binary condition as “healthy” / “not healthy”. Logistic regression applies the logistic sigmoid function to weighted input values to generate a prediction of the data class.

A logistic regression model estimates the probability of a dependent variable as a function of independent variables. The dependent variable is the output that we are trying to predict. The independent variables or explanatory variables are the factors that we feel could influence the output. Multiple regression refers to regression analysis with two or more independent variables. Multivariate regression, on the other hand, refers to regression analysis with two or more dependent variables.

3. Linear Discriminant Analysis

Logistic Regression is a classification algorithm traditionally for two-class classification problems. If you have more than two classes then the Linear Discriminant Analysis algorithm is the preferred linear classification technique.

The representation of LDA is pretty straight forward. It consists of statistical properties of your data, calculated for each class. For a single input variable this includes:

The mean value for each class.
The variance calculated across all classes.

We make predictions by calculating a discriminate value for each class. After that we make a prediction for the class with the largest value. The technique assumes that the data has a Gaussian distribution. Hence, it is a good idea to remove outliers from your data beforehand. It’s a simple and powerful method for classification predictive modelling problems.

4. Classification and Regression Trees

Prediction Trees are for predicting response or class YY from input X1, X2,…,XnX1,X2,…,Xn. If it is a continuous response it is a regression tree, if it is categorical, it is a classification tree. At each node of the tree, we check the value of one the input XiXi. Depending on the (binary) answer we continue to the left or to the right subbranch. When we reach a leaf we will find the prediction.

Contrary to linear or polynomial regression which are global models, trees try to partition the data space into small enough parts where we can apply a simple different model on each part. The non-leaf part of the tree is just the procedure to determine for each data xx what is the model we will use to classify it.

5. Naive Bayes

A Naive Bayes Classifier is a supervised machine-learning algorithm that uses the Bayes’ Theorem, which assumes that features are statistically independent. The theorem relies on the naive assumption that input variables are independent of each other, i.e. there is no way to know anything about other variables when given an additional variable. Regardless of this assumption, it has proven itself to be a classifier with good results.

Naive Bayes Classifiers rely on the Bayes’ Theorem, which is based on conditional probability or in simple terms, the likelihood that an event (A) will happen given that another event (B) has already happened. Essentially, the theorem allows a hypothesis to be updated each time new evidence is introduced. The equation below expresses Bayes’ Theorem in the language of probability:

Let’s explain what each of these terms means.

“P” is the symbol to denote probability.
P(A | B) = The probability of event A (hypothesis) occurring given that B (evidence) has occurred.
P(B | A) = The probability of the event B (evidence) occurring given that A (hypothesis) has occurred.
P(A) = The probability of event B (hypothesis) occurring.
P(B) = The probability of event A (evidence) occurring.

6. K-Nearest Neighbors

k-nearest neighbours (or k-NN for short) is a simple machine learning algorithm that categorizes an input by using its k nearest neighbours.

For example, suppose a k-NN algorithm has an input of data points of specific men and women’s weight and height, as plotted below. To determine the gender of an unknown input (green point), k-NN can look at the nearest k neighbours (suppose ) and will determine that the input’s gender is male. This method is a very simple and logical way of marking unknown inputs, with a high rate of success.

Also, we can k-NN in a variety of machine learning tasks; for example, in computer vision, k-NN can help identify handwritten letters and in gene expression analysis, the algorithm can determine which genes contribute to a certain characteristic. Overall, k-nearest neighbours provide a combination of simplicity and effectiveness that makes it an attractive algorithm to use for many machine learning tasks.

7. Learning Vector Quantization

A downside of K-Nearest Neighbors is that you need to hang on to your entire training dataset. The Learning Vector Quantization algorithm (or LVQ for short) is an artificial neural network algorithm that allows you to choose how many training instances to hang onto and learns exactly what those instances should look like.

Additionally, the representation for LVQ is a collection of codebook vectors. We select them randomly in the beginning and adapted to best summarize the training dataset over a number of iterations of the learning algorithm. After learned, the codebook vectors can make predictions just like K-Nearest Neighbors. Also, we find the most similar neighbour (best matching codebook vector) by calculating the distance between each codebook vector and the new data instance. The class value or (real value in the case of regression) for the best matching unit is then returned as the prediction. Moreover, you can get the best results if you rescale your data to have the same range, such as between 0 and 1.

If you discover that KNN gives good results on your dataset try using LVQ to reduce the memory requirements of storing the entire training dataset.

8. Bagging and Random Forest

A Random Forest consists of a collection or ensemble of simple tree predictors, each capable of producing a response when presented with a set of predictor values. For classification problems, this response takes the form of a class membership, which associates, or classifies, a set of independent predictor values with one of the categories present in the dependent variable. Alternatively, for regression problems, the tree response is an estimate of the dependent variable given the predictors.e

A Random Forest consists of an arbitrary number of simple trees, which determine the final outcome. For classification problems, the ensemble of simple trees votes for the most popular class. In the regression problem, we average responses to obtain an estimate of the dependent variable. Using tree ensembles can lead to significant improvement in prediction accuracy (i.e., better ability to predict new data cases).

9. SVM

A Support Vector Machine (SVM) is a supervised machine learning algorithm that can be employed for both classification and regression purposes. Also, SVMs have more common usage in classification problems and as such, this is what we will focus on in this post.

SVMs are based on the idea of finding a hyperplane that best divides a dataset into two classes, as shown in the image below.

Also, you can think of a hyperplane as a line that linearly separates and classifies a set of data.

Intuitively, the further from the hyperplane our data points lie, the more confident we are that they have been correctly classified. We, therefore, want our data points to be as far away from the hyperplane as possible, while still being on the correct side of it.

So when we add a new testing data , whatever side of the hyperplane it lands will decide the class that we assign to it.

The distance between the hyperplane and the nearest data point from either set is the margin. Furthermore, the goal is to choose a hyperplane with the greatest possible margin between the hyperplane and any point within the training set, giving a greater chance of correct classification of data.

But the data is rarely ever as clean as our simple example above. A dataset will often look more like the jumbled balls below which represent a linearly non-separable dataset.

10. Boosting and AdaBoost

Boosting is an ensemble technique that attempts to create a strong classifier from a number of weak classifiers. We do this by building a model from the training data, then creating a second model that attempts to correct the errors from the first model. We can add models until the training set is predicted perfectly or a maximum number of models are added.

AdaBoost was the first really successful boosting algorithm developed for binary classification. It is the best starting point for understanding boosting. Modern boosting methods build on AdaBoost, most notably stochastic gradient boosting machines.

AdaBoost is used with short decision trees. After the first tree is created, the performance of the tree on each training instance is used to weight how much attention the next tree that is created should pay attention to each training instance. Training data that is hard to predict is given more weight, whereas easy to predict instances are given less weight. Models are created sequentially one after the other, each updating the weights on the training instances that affect the learning performed by the next tree in the sequence. After all the trees are built, predictions are made for new data, and the performance of each tree is weighted by how accurate it was on training data.

Because so much attention is put on correcting mistakes by the algorithm it is important that you have clean data with outliers removed.

Summary

A typical question asked by a beginner, when facing a wide variety of machine learning algorithms, is “which algorithm should I use?” The answer to the question varies depending on many factors, including: (1) The size, quality, and nature of data; (2) The available computational time; (3) The urgency of the task; and (4) What you want to do with the data.

Even an experienced data scientist cannot tell which algorithm will perform the best before trying different algorithms. Although there are many other Machine Learning algorithms, these are the most popular ones. If you’re a newbie to Machine Learning, these would be a good starting point to learn.

Follow this link, if you are looking to learn Data Science Course Online!

Additionally, if you are having an interest in learning Data Science, Learn online Data Science Course to boost your career in Data Science.

Also, learn AWS Big Data Course click here, AWS Online Course

Furthermore, if you want to read more about data science, read this Data Science blogs

Top

What are the basic elements of a machine learning algorithm?

The foundations of most algorithms lie in linear algebra, multivariable calculus, and optimization methods. Most algorithms use a sequence of combinations to estimate an objective function given a set of data, and the sequence order and included methods distinguish one algorithm from another. It’s helpful to learn enough math to read the development papers associated with key algorithms in the field, as many other methods (or one’s own innovations) include pieces of those algorithms. It’s like learning the language of machine learning. Once you are fluent in it, it’s pretty easy to modify algorithms as needed and create new ones likely to improve on a problem in a short period of time.

What is your favorite machine learning algorithm?

Matrix factorization: a simple, beautiful way to do dimensionality reduction —and dimensionality reduction is the essence of cognition. Recommender systems would be a big application of matrix factorization. Another application I’ve been using over the years (starting in 2010 with video data) is factorizing a matrix of pairwise mutual information (or pointwise mutual information, which is more common) between features, which can be used for feature extraction, computing word embeddings, computing label embeddings (that was the topic of a recent paper of mine [1]), etc.

Used in a convolutional settings, this acts as an excellent unsupervised feature extractor for images and videos. There’s one big issue though: it is fundamentally a shallow algorithm. Deep neural networks will quickly outperform it if any kind of supervision labels are available.

Machine Learning Demos:

1- TensorFlow Demos

LipSync by YouTube

See how well you synchronize to the lyrics of the popular hit “Dance Monkey.” This in-browser experience uses the Facemesh model for estimating key points around the lips to score lip-syncing accuracy.Explore demo View code

Emoji Scavenger Hunt

Use your phone’s camera to identify emojis in the real world. Can you find all the emojis before time expires?Explore demo View code

Webcam Controller

Play Pac-Man using images trained in your browser.Explore demo View code

Teachable Machine

No coding required! Teach a machine to recognize images and play sounds.Explore demo View code

Move Mirror

Explore pictures in a fun new way, just by moving around.Explore demo View code

Performance RNN

Enjoy a real-time piano performance by a neural network.Explore demo View code

Node.js Pitch Prediction

Train a server-side model to classify baseball pitch types using Node.js.View code

Visualize Model Training

See how to visualize in-browser training and model behaviour and training using tfjs-vis.Explore demo View code

Community demos

Get started with official templates and explore top picks from the community for inspiration.Glitch

Check out community Glitches and make your own TensorFlow.js-powered projects.Explore Glitch Codepen

Fork boilerplate templates and check out working examples from the community.Explore CodePen GitHub Community Projects

See what the community has created and submitted to the TensorFlow.js gallery page.Explore GitHub

https://cdpn.io/jasonmayes/fullcpgrid/QWbNeJdOpen in Editor

Real time body segmentation using TensorFlow.js

Load in a pre-trained Body-Pix model from the TensorFlow.js team so that you can locate all pixels in an image that are part of a body, and what part of the body they belong to. Clone this to make your own TensorFlow.js powered projects to recognize body parts in images from your webcam and more!

New Pen from Templatehttps://cdpn.io/jasonmayes/fullcpgrid/qBEJxggOpen in Editor

Multiple object detection using pre trained model in TensorFlow.js

This demo shows how we can use a pre made machine learning solution to recognize objects (yes, more than one at a time!) on any image you wish to present to it. Even better, not only do we know that the image contains an object, but we can also get the co-ordinates of the bounding box for each object it finds, which allows you to highlight the found object in the image.

For this demo we are loading a model using the ImageNet-SSD architecture, to recognize 90 common objects it has already been taught to find from the COCO dataset.

If what you want to recognize is in that list of things it knows about (for example a cat, dog, etc), this may be useful to you as is in your own projects, or just to experiment with Machine Learning in the browser and get familiar with the possibilities of machine learning.

If you are feeling particularly confident you can check out our GitHub documentation (https://github.com/tensorflow/tfjs-models/tree/master/coco-ssd) which goes into much more detail for customizing various parameters to tailor performance to your needs.

New Pen from Templatehttps://cdpn.io/jasonmayes/fullcpgrid/JjompwwOpen in Editor

Classifying images using a pre trained model in TensorFlow.js

This demo shows how we can use a pre made machine learning solution to classify images (aka a binary image classifier). It should be noted that this model works best when a single item is in the image at a time. Busy images may not work so well. You may want to try our demo for Multiple Object Detection (https://codepen.io/jasonmayes/pen/qBEJxgg) for that.

For this demo we are loading a model using the MobileNet architecture, to recognize 1000 common objects it has already been taught to find from the ImageNet data set (http://image-net.org/).

Please note: This demo loads an easy to use JavaScript class made by the TensorFlow.js team to do the hardwork for you so no machine learning knowledge is needed to use it.

If you were looking to learn how to load in a TensorFlow.js saved model directly yourself then please see our tutorial on loading TensorFlow.js models directly.

If you want to train a system to recognize your own objects, using your own data, then check out our tutorials on “transfer learning”.

New Pen from Template Open in Editor

Tensorflow.js Boilerplate

The hello world for TensorFlow.js 🙂 Absolute minimum needed to import into your website and simply prints the loaded TensorFlow.js version. From here we can do great things. Clone this to make your own TensorFlow.js powered projects or if you are following a tutorial that needs TensorFlow.js to work.

New Pen from Template

Examples

tfjs-examples provides small code examples that implement various ML tasks using TensorFlow.js.MNIST Digit Recognizer

Train a model to recognize handwritten digits from the MNIST database.Explore example View code Addition RNN

Train a model to learn addition from text examples.Explore example View code

TensorFlow.js Layers: Iris Demo

More TensorFlow examples

Top-paying Cloud certifications:

[appbox appstore 1611045854-iphone screenshots]

[appbox microsoftstore 9n8rl80hvm4t-mobile screenshots]

Complete overview of machine learning concepts seen in 27 data science and machine learning interviews:

Supervised Learning

Linear Regression

Logistic Regression

Naive Bayes

Support Vector Machines

Decision Trees

K-Nearest Neighbors

Test your knowledge

Machine Learning in Practice

Bias-Variance Tradeoff

How to Select a Model

How to Select Features

Regularizing Your Model

Ensembling: How to Combine Your Models

Evaluation Metrics

Unsupervised Learning

Market Basket Analysis

K-Means Clustering

Principal Components Analysis

Deep Learning

Feedforward Neural Networks

Grab Bag of Neural Network Practices

Convolutional Neural Networks

Recurrent Neural Networks

Test Your Knowledge

Feature Extraction

Best Subset Features Feature

Selection Examples

Adding Features Example
Activation Practice I
Activation Practice II
Activation Practice III
Weight Initialization
Batch vs. Stochastic

Recurrent Network Advantages

Alternatives Recurrent Units

Convolutional Application
Convolutional Layer Advantages

Are you interested in becoming an AWS Certified Machine Learning Specialist? If so, then this exam preparation blog is for you! The blog contains over 100 quiz and practice exam questions, as well as detailed answers. The questions are very similar to those you will encounter on the actual exam, so this is a great way to prepare. In addition, the blog also includes cheat sheets and illustrations to help you understand the concepts better.

Bring your own algorithm to an MLOps Pipeline: Architecture

AWS Certified machine Learning Specialty Exam Prep MLS-C01: AWS architecture diagram showing all services used and how they are connected — AWS Certified machine Learning Specialty Exam Prep MLS-C01

Code and Serve Your ML Model with AWS CodeBuild

What are some ways we can use machine learning and artificial intelligence for algorithmic trading in the stock market?

How do we know that the Top 3 Voice Recognition Devices like Siri Alexa and Ok Google are not spying on us?

What are some good datasets for Data Science and Machine Learning?

Machine Learning Engineer Interview Questions and Answers

Master of Data Science
by /u/LifeIsAJoke7 (Data Science) on April 25, 2024 at 11:52 pm
Hello everyone! I am a business analytics graduate soon, and I want to expand on my skills in data science with an online masters from University of Pittsburgh. I want to fast track my career in the best way possible. The course names are listed in the image in case you cant find it in the link. I have done a lot of research on masters programs and so far, this is the best I have gotten so far in terms of my chance at being admitted with my GPA and major. So my question/advice seeking is, whether anyone knows of good programs a person with my profile can get into. Also, Does the fact that it’s called “Master of Science” instead of “Masters of Science in Data Science” matter? Profile: Major: Information Systems and Business Analytics Minor: Data Science GPA: 3.0 Thank you! submitted by /u/LifeIsAJoke7 [link] [comments]
Gooogle Colab Schedule
by /u/Uncle_Cheeto (Data Science) on April 25, 2024 at 11:19 pm
Has anyone successfully been able to schedule a Google Colab Python notebook to run on its own? I know Databricks has that functionality…. Just stumped with Colab. YouTube has yet to be helpful. submitted by /u/Uncle_Cheeto [link] [comments]
Datasets for Causal ML
by /u/Direct-Touch469 (Data Science) on April 25, 2024 at 6:29 pm
Does anyone know what datasets are out there for causal inference? I’d like to explore methods in the doubly robust ML literature, and I’d like to compensate my learning by working on some datasets and learn the econML software. Does anyone know of any datasets, specifically in the context of marketing/pricing/advertising that would be good sources to apply causal inference techniques? I’m open to other datasets as well. submitted by /u/Direct-Touch469 [link] [comments]
[D] Transitioning from Operations Research Scientist to ML/AI/CV Engineer
by /u/unsuccessful_boy (Machine Learning) on April 25, 2024 at 4:20 pm
Hello fellow smart people on Reddit, recently I've been thinking about changing my job (I'm a Vision Engineer) and I stumbled upon this position on LinkedIn called Operations Research Scientist. I was wondering after a few years (maybe at most 2) of working in that position, will it be easier for me to transition to a Machine Learning/Artificial Intelligence Engineer or maybe a Computer Vision Engineer role? submitted by /u/unsuccessful_boy [link] [comments]
[R] Speculative Streaming: Fast LLM Inference without Auxiliary Models
by /u/SeawaterFlows (Machine Learning) on April 25, 2024 at 4:13 pm
Paper: https://arxiv.org/abs/2402.11131 Abstract: Speculative decoding is a prominent technique to speed up the inference of a large target language model based on predictions of an auxiliary draft model. While effective, in application-specific settings, it often involves fine-tuning both draft and target models to achieve high acceptance rates. As the number of downstream tasks grows, these draft models add significant complexity to inference systems. We propose Speculative Streaming, a single-model speculative decoding method that fuses drafting into the target model by changing the fine-tuning objective from next token prediction to future n-gram prediction. Speculative Streaming speeds up decoding by 1.8 - 3.1X in a diverse set of tasks, such as Summarization, Structured Queries, and Meaning Representation, without sacrificing generation quality. Additionally, Speculative Streaming is parameter-efficient. It achieves on-par/higher speed-ups than Medusa-style architectures while using ~10000X fewer extra parameters, making it well-suited for resource-constrained devices. submitted by /u/SeawaterFlows [link] [comments]
[R] Lossless Acceleration of Large Language Model via Adaptive N-gram Parallel Decoding
by /u/SeawaterFlows (Machine Learning) on April 25, 2024 at 4:08 pm
Paper: https://arxiv.org/abs/2404.08698 Abstract: While Large Language Models (LLMs) have shown remarkable abilities, they are hindered by significant resource consumption and considerable latency due to autoregressive processing. In this study, we introduce Adaptive N-gram Parallel Decoding (ANPD), an innovative and lossless approach that accelerates inference by allowing the simultaneous generation of multiple tokens. ANPD incorporates a two-stage approach: it begins with a rapid drafting phase that employs an N-gram module, which adapts based on the current interactive context, followed by a verification phase, during which the original LLM assesses and confirms the proposed tokens. Consequently, ANPD preserves the integrity of the LLM's original output while enhancing processing speed. We further leverage a multi-level architecture for the N-gram module to enhance the precision of the initial draft, consequently reducing inference latency. ANPD eliminates the need for retraining or extra GPU memory, making it an efficient and plug-and-play enhancement. In our experiments, models such as LLaMA and its fine-tuned variants have shown speed improvements up to 3.67x, validating the effectiveness of our proposed ANPD. submitted by /u/SeawaterFlows [link] [comments]
“What motivates you?” What’s the best answer besides compensation?
by /u/Curious-Fig-9882 (Data Science) on April 25, 2024 at 3:56 pm
I am wondering if anyone has encountered this question in job applications or interviews and what the best answers might be? Honestly, besides being adequately compensated, I am motivated by challenges that allow me to learn, a supportive environment, and a clear direction for growth. What would be your answers? submitted by /u/Curious-Fig-9882 [link] [comments]
[D] Old Paper - Troubling Trends in Machine Learning Scholarship
by /u/pyepyepie (Machine Learning) on April 25, 2024 at 3:50 pm
I just wanted to remind or introduce newcomers to this paper. I think this discussion should be re-opened since many people here actually do influence the trends of the field. https://arxiv.org/pdf/1807.03341 On a personal note (feel free to skip): Specifically, I want to point out the issue of "Mathiness", as it seems like this problem got way out of hand and most best papers of conferences suffer from it (one of the most important ML papers tried to be mathy and introduced a big mistake, I believe other papers have bigger issues but no one bothers to check it). So here are my personal points to academics and researchers: We (I think most will relate), practitioners, do not need equations to know what recall is and clearly don't want to read difficult-to-understand versions of what linear regression is, it just makes your paper unuseful. If you don't want to waste our time, please put it in the appendix or completely remove it. Reviewers, please don't get impressed by unnecessary math, if it's complicated and does nothing useful, who cares? Also, it might be flawed anyway and you will probably not catch it. submitted by /u/pyepyepie [link] [comments]
[R] Python package for animated time series
by /u/SatieGonzales (Machine Learning) on April 25, 2024 at 3:48 pm
In this video about Times Series, https://www.youtube.com/watch?v=0zpg9ODE6Ww, does anyone have an idea about the Python package used to create the animated plots showed at the 34th minute of the video ? Thank for your help. submitted by /u/SatieGonzales [link] [comments]
[D] UAI-2024 results waiting area
by /u/PaganPasta (Machine Learning) on April 25, 2024 at 3:38 pm
Following the review phase(old post), creating a thread for others like me waiting for the decision. All the best! submitted by /u/PaganPasta [link] [comments]
[D] Why transformers are not trained layer-wise?
by /u/kiockete (Machine Learning) on April 25, 2024 at 2:16 pm
It seems to me that thanks to the residual path the gradient that flows to each layer is the same regardless of the transformer layer/block. Example: ProjectionAndCost(X + L1(X) + L2(X + L1(X)) + L3(X + L1(X) + L2(X + L1(X))) ...) Since the input to ProjectionAndCost is just sum of outputs from all layers and initial embeddings then the gradient that comes to the layer L1 is the same as the gradient that comes to L2 or L3. So we could: first train only L1: ProjectionAndCost(X + L1(X)) freeze L1, include L2 and train: ProjectionAndCost(X + L1(X) + L2(X + L1(X))) freeze L1 and L2, include L3 and train: ProjectionAndCost(X + L1(X) + L2(X + L1(X)) + L3(X + L1(X) + L2(X + L1(X)))) .. and so on We can't train first L2 then L1, because the input to L2 depends on L1, but we could train lower layers first then gradually add and train deeper layers. Is there any problem with that approach? submitted by /u/kiockete [link] [comments]
[D] Is there an equivalent BigDL project for NVIDIA GPUs, which allows distributing work loads across a DL cluster with spark?
by /u/PepperGrind (Machine Learning) on April 25, 2024 at 10:18 am
So there's this relatively new "BigDL" project" (https://bigdl.readthedocs.io/en/latest/ ), which is for Intel CPUs and Intel GPUs, but there's no mention anywhere of it working for NVIDIA GPUs. Is there any equivalent library for NVIDIA GPUs on a spark cluster? submitted by /u/PepperGrind [link] [comments]
[P] New Book: BUILD GPT: HOW AI WORKS
by /u/Pure_Nerve_595 (Machine Learning) on April 25, 2024 at 10:08 am
After having worked on it for many months, I am now excited that my new book, “BUILD GPT: HOW AI WORKS”, is available on Amazon. It goes through the process of building a GPT from scratch and explains how it works. I want to thank everyone who has helped me with this book, they are in the acknowledgment section. Please feel free to share this book with anyone interested in learning about GPTs or interested in building GPTs. https://preview.redd.it/ixhw5wz9mlwc1.png?width=1507&format=png&auto=webp&s=5f9a0eb5d1f49ed936f12e4527950090d161852c submitted by /u/Pure_Nerve_595 [link] [comments]
[D] What is the best TTS model for my case?
by /u/hwk06023 (Machine Learning) on April 25, 2024 at 8:07 am
Hi. Here is the new's question. The biggest concern is the rate of generation. I want to generate about 5 seconds of voice in about 100ms. I want to know which model performs best(SOTA) under those conditions. Which model is best for me? I think "styletts2" is best. If you have any relevant experience or know any other information, I would really appreciate your help. Thank you ! submitted by /u/hwk06023 [link] [comments]
[R] French GEC dataset
by /u/R-e-v-e-r-i-e- (Machine Learning) on April 25, 2024 at 12:14 am
Hi, does anyone know of a French L2 GEC dataset (that was published at a conference)? submitted by /u/R-e-v-e-r-i-e- [link] [comments]
[D] tutorial on how to build streaming ML applications
by /u/clementruhm (Machine Learning) on April 24, 2024 at 10:16 pm
My primary expertise is audio processing, but i believe this task happens in other domains too: running a model on chunks of infinitely long input. while for some architectures it is straightforward, it can get tedious for convolutional nets. I put together a comprehensive tutorial how to build a streaming ML applications: https://balacoon.com/blog/streaming\_inference/. would be curious to learn wether its a common problem and how do people usually deal with it. Because resources on the topic are surprisingly scarce. submitted by /u/clementruhm [link] [comments]
[D] Why is R^2 so crazy?
by /u/Cloverdover1 (Machine Learning) on April 24, 2024 at 9:40 pm
https://preview.redd.it/jpiyt4b9yhwc1.png?width=1165&format=png&auto=webp&s=95d80f8f9c9241d722717ad25215be4077d541ca Based on the MSE looks good right? But why is my R^2 starting off so negative and approaching 0? Could it be a bug in how i am calculating it? This happened after i min maxed the labels before training. This is an LSTM that is predicting runs scored for baseball games. submitted by /u/Cloverdover1 [link] [comments]
What is the difference between a data scientist and a data analyst role?
by /u/Level-Upstairs-3971 (Data Science) on April 24, 2024 at 5:46 pm
After 20+ years in the field, I'm not sure what I should call myself 🙂 submitted by /u/Level-Upstairs-3971 [link] [comments]
Recall Score Increase [D]
by /u/Legal_Hearing555 (Machine Learning) on April 24, 2024 at 5:38 pm
Hello Everyone, I am trying to do a small fraud detection project and i have so imbalanced dataset. I used randomundersampling because minority class is pretty small and i also tried smote or combining with smote best recall score i got, was with only randomundersampling(0.95). I thought GridsearchCV to increase it but instead of increasing, it is decreasing although i tried to make it to focus on recall score. Why this is happening? submitted by /u/Legal_Hearing555 [link] [comments]
[D] Preserving spatial distribution of data during data splitting
by /u/dr_greg_mouse (Machine Learning) on April 24, 2024 at 5:14 pm
Hello, I am trying to model nitrate concentrations in the streams in Bavaria in Germany using Random Forest model. I am using Python and primarily sklearn for the same. I have data from 490 water quality stations. I am following the methodology in the paper from LongzhuQ.Shen et al which can be found here: https://www.nature.com/articles/s41597-020-0478-7 I want to split my dataset into training and testing set such that the spatial distribution of data in both sets is identical. The idea is that if data splitting ignores the spatial distribution, there is a risk that the training set might end up with a concentration of points from densely populated areas, leaving out sparser areas. This can skew the model's learning process, making it less accurate or generalizable across the entire area of interest. sklearn train_test_split just randomly divides the data into training and testing sets and it does not consider the spatial patterns in the data. The paper I mentioned above follows this methodology: "We split the full dataset into two sub-datasets, training and testing respectively. To consider the heterogeneity of the spatial distribution of the gauge stations, we employed the spatial density estimation technique in the data splitting step by building a density surface using Gaussian kernels with a bandwidth of 50 km (using v.kernel available in GRASS GIS33) for each species and season. The pixel values of the resultant density surface were used as weighting factors to split the data into training and testing subsets that possess identical spatial distributions." I want to follow the same methodology but instead of using grass GIS, I am just building the density surface myself in Python. I have also extracted the probability density values and the weights for the stations. (attached figure) Now the only problem I am facing is how do I use these weights to split the data into training and testing sets? I checked there is no keyword in the sklearn train_test_split function that can consider the weights. I also went back and forth with chat GPT 4 but it is also not able to give me a clear answer. Neither did I find anything concrete on the internet about this. Maybe I am missing something. Is there any other function I can use to do this? Or will I have to write my own algorithm to do the splitting? In case of the latter, can you please suggest me the approach so I can code it myself? In the attached figure you can see the location of the stations and the probability density surface generated using the kernel density estimation method (using Gaussian kernels). Also attaching a screenshot of my dataframe to give you some idea of the data structure. (all columns after longitude ('lon') column are used as features. the NO3 column is used as the target variable.) I will be grateful for any answers. Probability density surface generated using the kernel density estimation method with gaussian kernels. the dataset I am using to model the nitrate concentrations submitted by /u/dr_greg_mouse [link] [comments]

October 12, 2020January 24, 2023

Big Data and Data Analytics 101 – Top 100 AWS Certified Data Analytics Specialty Certification Questions and Answers Dumps

AI Dashboard is available on the Web, Apple, Google, and Microsoft, PRO version

Top 100 AWS Certified Data Analytics Specialty Certification Questions and Answers Dumps

If you’re looking to take your data analytics career to the next level, then this AWS Data Analytics Specialty Certification Exam Preparation blog is a must-read! With over 100 exam questions and answers, plus data science and data analytics interview questions, cheat sheets and more, you’ll be fully prepared to ace the DAS-C01 exam.

In this blog, we talk about big data and data analytics; we also give you the last updated top 100 AWS Certified Data Analytics – Specialty Questions and Answers Dumps

The AWS Certified Data Analytics – Specialty (DAS-C01) examination is intended for individuals who perform in a data analytics-focused role. This exam validates an examinee’s comprehensive understanding of using AWS services to design, build, secure, and maintain analytics solutions that provide insight from data.

Download the App for an interactive experience:

Get 20% off Google Google Workspace (Google Meet) Standard Plan with the following codes: 96DRHDRA9J7GTN6
Get 20% off Google Workspace (Google Meet) Business Plan (AMERICAS): M9HNXHX3WC9H7YE (Email us for more codes)

Active Anti-Aging Eye Gel, Reduces Dark Circles, Puffy Eyes, Crow's Feet and Fine Lines & Wrinkles, Packed with Hyaluronic Acid & Age Defying Botanicals

https://enoumen.com/2021/11/07/top-100-data-science-and-data-analytics-interview-questions-and-answers/

The AWS Certified Data Analytics – Specialty (DAS-C01) covers the following domains:

Domain 1: Collection 18%

Domain 2: Storage and Data Management 22%

Domain 3: Processing 24%

Advertise with us - Post Your Good Content Here
We are ranked in the Top 20 on Google

Domain 4: Analysis and Visualization 18%

"Pass the AWS Cloud Practitioner Certification with flying colors: Master the Exam with 250+ Quizzes, Cheat Sheets, Flashcards, and Illustrated Study Guides - 2024 Edition"

Domain 5: Security 18%

Below are the Top 100 AWS Certified Data Analytics – Specialty Questions and Answers Dumps and References –

Question1: What combination of services do you need for the following requirements: accelerate petabyte-scale data transfers, load streaming data, and the ability to create scalable, private connections. Select the correct answer order.

A) Snowball, Kinesis Firehose, Direct Connect

Dive into a comprehensive AWS Cloud Practitioner CLF-C02 Certification guide, masterfully weaving insights from Tutorials Dojo, Adrian Cantrill, Stephane Maarek, and AWS Skills Builder into one unified resource.

B) Data Migration Services, Kinesis Firehose, Direct Connect

C) Snowball, Data Migration Services, Direct Connect

Invest in your future today by enrolling in this Azure Fundamentals - Pass the Azure Fundamentals Exam with Ease: Master the AZ-900 Certification with the Comprehensive Exam Preparation Guide!

D) Snowball, Direct Connection, Kinesis Firehose

ANSWER1:

Notes/Hint1:

AWS has many options to help get data into the cloud, including secure devices like AWS Import/Export Snowball to accelerate petabyte-scale data transfers, Amazon Kinesis Firehose to load streaming data, and scalable private connections through AWS Direct Connect.

Reference1: Big Data Analytics Options

AWS Data Analytics Specialty Certification Exam Preparation App is a great way to prepare for your upcoming AWS Data Analytics Specialty Certification Exam. The app provides you with over 300 questions and answers, detailed explanations of each answer, a scorecard to track your progress, and a countdown timer to help keep you on track. You can also find data science and data analytics interview questions and detailed answers, cheat sheets, and flashcards to help you study. The app is very similar to the real exam, so you will be well-prepared when it comes time to take the test.

Question2: A company ingests a large set of clickstream data in nested JSON format from different sources and stores it in Amazon S3. Data analysts need to analyze this data in combination with data stored in an Amazon Redshift cluster. Data analysts want to build a cost-effective and automated solution for this need.
Which solution meets these requirements?

A) Use Apache Spark SQL on Amazon EMR to convert the clickstream data to a tabular format. Use the Amazon Redshift COPY command to load the data into the Amazon Redshift cluster.

B) Use AWS Lambda to convert the data to a tabular format and write it to Amazon S3. Use the Amazon Redshift COPY command to load the data into the Amazon Redshift cluster.

C) Use the Relationalize class in an AWS Glue ETL job to transform the data and write the data back to Amazon S3. Use Amazon Redshift Spectrum to create external tables and join with the internal tables.

D) Use the Amazon Redshift COPY command to move the clickstream data directly into new tables in the Amazon Redshift cluster.

ANSWER2:

Notes/Hint2:

The Relationalize PySpark transform can be used to flatten the nested data into a structured format. Amazon Redshift Spectrum can join the external tables and query the transformed clickstream data in place rather than needing to scale the cluster to accommodate the large dataset.

Reference1: Relationalize PySpark

Unlock the Secrets of Africa: Master African History, Geography, Culture, People, Cuisine, Economics, Languages, Music, Wildlife, Football, Politics, Animals, Tourism, Science and Environment with the Top 1000 Africa Quiz and Trivia. Get Yours Now!

"Become a Canada Expert: Ace the Citizenship Test and Impress Everyone with Your Knowledge of Canadian History, Geography, Government, Culture, People, Languages, Travel, Wildlife, Hockey, Tourism, Sceneries, Arts, and Data Visualization. Get the Top 1000 Canada Quiz Now!"

Question 3: There is a five-day car rally race across Europe. The race coordinators are using a Kinesis stream and IoT sensors to monitor the movement of the cars. Each car has a sensor and data is getting back to the stream with the default stream settings. On the last day of the rally, data is sent to S3. When you go to interpret the data in S3, there is only data for the last day and nothing for the first 4 days. Which of the following is the most probable cause of this?

A) You did not have versioning enabled and would need to create individual buckets to prevent the data from being overwritten.

B) Data records are only accessible for a default of 24 hours from the time they are added to a stream.

C) One of the sensors failed, so there was no data to record.

D) You needed to use EMR to send the data to S3; Kinesis Streams are only compatible with DynamoDB.

ANSWER3:

Notes/Hint3:

Streams support changes to the data record retention period of your stream. An Amazon Kinesis stream is an ordered sequence of data records, meant to be written to and read from in real-time. Data records are therefore stored in shards in your stream temporarily. The period from when a record is added to when it is no longer accessible is called the retention period. An Amazon Kinesis stream stores records for 24 hours by default, up to 168 hours.

Cloud Certification made simple. Ace your exams with Djamgatech.

Reference3: Kinesis Extended Reading

Question 4: A publisher website captures user activity and sends clickstream data to Amazon Kinesis Data Streams. The publisher wants to design a cost-effective solution to process the data to create a timeline of user activity within a session. The solution must be able to scale depending on the number of active sessions.
Which solution meets these requirements?

A) Include a variable in the clickstream data from the publisher website to maintain a counter for the number of active user sessions. Use a timestamp for the partition key for the stream. Configure the consumer application to read the data from the stream and change the number of processor threads based upon the counter. Deploy the consumer application on Amazon EC2 instances in an EC2 Auto Scaling group.

B) Include a variable in the clickstream to maintain a counter for each user action during their session. Use the action type as the partition key for the stream. Use the Kinesis Client Library (KCL) in the consumer application to retrieve the data from the stream and perform the processing. Configure the consumer application to read the data from the stream and change the number of processor threads based upon the
counter. Deploy the consumer application on AWS Lambda.

C) Include a session identifier in the clickstream data from the publisher website and use as the partition key for the stream. Use the Kinesis Client Library (KCL) in the consumer application to retrieve the data from the stream and perform the processing. Deploy the consumer application on Amazon EC2 instances in an
EC2 Auto Scaling group. Use an AWS Lambda function to reshard the stream based upon Amazon CloudWatch alarms.

D) Include a variable in the clickstream data from the publisher website to maintain a counter for the number of active user sessions. Use a timestamp for the partition key for the stream. Configure the consumer application to read the data from the stream and change the number of processor threads based upon the counter. Deploy the consumer application on AWS Lambda.

ANSWER4:

Notes/Hint4:

Partitioning by the session ID will allow a single processor to process all the actions for a user session in order. An AWS Lambda function can call the UpdateShardCount API action to change the number of shards in the stream. The KCL will automatically manage the number of processors to match the number of shards. Amazon EC2 Auto Scaling will assure the correct number of instances are running to meet the processing load.

Reference4: UpdateShardCount API

Get mobile friendly version of the quiz @ the App Store

Question 5: Your company has two batch processing applications that consume financial data about the day’s stock transactions. Each transaction needs to be stored durably and guarantee that a record of each application is delivered so the audit and billing batch processing applications can process the data. However, the two applications run separately and several hours apart and need access to the same transaction information. After reviewing the transaction information for the day, the information no longer needs to be stored. What is the best way to architect this application?

A) Use SQS for storing the transaction messages; when the billing batch process performs first and consumes the message, write the code in a way that does not remove the message after consumed, so it is available for the audit application several hours later. The audit application can consume the SQS message and remove it from the queue when completed.

B) Use Kinesis to store the transaction information. The billing application will consume data from the stream and the audit application can consume the same data several hours later.

C) Store the transaction information in a DynamoDB table. The billing application can read the rows while the audit application will read the rows then remove the data.

D) Use SQS for storing the transaction messages. When the billing batch process consumes each message, have the application create an identical message and place it in a different SQS for the audit application to use several hours later.

SQS would make this more difficult because the data does not need to persist after a full day.

ANSWER5:

Notes/Hint5:

Kinesis appears to be the best solution that allows multiple consumers to easily interact with the records.

Reference5: Amazon Kinesis

Question 6: A company is currently using Amazon DynamoDB as the database for a user support application. The company is developing a new version of the application that will store a PDF file for each support case ranging in size from 1–10 MB. The file should be retrievable whenever the case is accessed in the application.
How can the company store the file in the MOST cost-effective manner?

A) Store the file in Amazon DocumentDB and the document ID as an attribute in the DynamoDB table.

B) Store the file in Amazon S3 and the object key as an attribute in the DynamoDB table.

C) Split the file into smaller parts and store the parts as multiple items in a separate DynamoDB table.

D) Store the file as an attribute in the DynamoDB table using Base64 encoding.

ANSWER6:

Notes/Hint6:

Use Amazon S3 to store large attribute values that cannot fit in an Amazon DynamoDB item. Store each file as an object in Amazon S3 and then store the object path in the DynamoDB item.

Reference6: S3 Storage Cost – DynamODB Storage Cost

Question 7: Your client has a web app that emits multiple events to Amazon Kinesis Streams for reporting purposes. Critical events need to be immediately captured before processing can continue, but informational events do not need to delay processing. What solution should your client use to record these types of events without unnecessarily slowing the application?

A) Log all events using the Kinesis Producer Library.

B) Log critical events using the Kinesis Producer Library, and log informational events using the PutRecords API method.

C) Log critical events using the PutRecords API method, and log informational events using the Kinesis Producer Library.

D) Log all events using the PutRecords API method.

ANSWER2:

Notes/Hint7:

The PutRecords API can be used in code to be synchronous; it will wait for the API request to complete before the application continues. This means you can use it when you need to wait for the critical events to finish logging before continuing. The Kinesis Producer Library is asynchronous and can send many messages without needing to slow down your application. This makes the KPL ideal for the sending of many non-critical alerts asynchronously.

Reference7: PutRecords API

Question 8: You work for a start-up that tracks commercial delivery trucks via GPS. You receive coordinates that are transmitted from each delivery truck once every 6 seconds. You need to process these coordinates in near real-time from multiple sources and load them into Elasticsearch without significant technical overhead to maintain. Which tool should you use to digest the data?

A) Amazon SQS

B) Amazon EMR

C) AWS Data Pipeline

D) Amazon Kinesis Firehose

ANSWER8:

Notes/Hint8:

Amazon Kinesis Firehose is the easiest way to load streaming data into AWS. It can capture, transform, and load streaming data into Amazon S3, Amazon Redshift, and Amazon Elasticsearch Service, enabling near real-time analytics with existing business intelligence tools and dashboards.

Reference8: Amazon Kinesis Firehose

Question 9: A company needs to implement a near-real-time fraud prevention feature for its ecommerce site. User and order details need to be delivered to an Amazon SageMaker endpoint to flag suspected fraud. The amount of input data needed for the inference could be as much as 1.5 MB.
Which solution meets the requirements with the LOWEST overall latency?

A) Create an Amazon Managed Streaming for Kafka cluster and ingest the data for each order into a topic. Use a Kafka consumer running on Amazon EC2 instances to read these messages and invoke the Amazon SageMaker endpoint.

B) Create an Amazon Kinesis Data Streams stream and ingest the data for each order into the stream. Create an AWS Lambda function to read these messages and invoke the Amazon SageMaker endpoint.

C) Create an Amazon Kinesis Data Firehose delivery stream and ingest the data for each order into the stream. Configure Kinesis Data Firehose to deliver the data to an Amazon S3 bucket. Trigger an AWS Lambda function with an S3 event notification to read the data and invoke the Amazon SageMaker endpoint.

D) Create an Amazon SNS topic and publish the data for each order to the topic. Subscribe the Amazon SageMaker endpoint to the SNS topic.

ANSWER9:

Notes/Hint9:

An Amazon Managed Streaming for Kafka cluster can be used to deliver the messages with very low latency. It has a configurable message size that can handle the 1.5 MB payload.

Reference9: Amazon Managed Streaming for Kafka cluster

Get mobile friendly version of the quiz @ the App Store

Question 10: You need to filter and transform incoming messages coming from a smart sensor you have connected with AWS. Once messages are received, you need to store them as time series data in DynamoDB. Which AWS service can you use?

A) IoT Device Shadow Service

B) Redshift

C) Kinesis

D) IoT Rules Engine

ANSWER10:

Notes/Hint10:

The IoT rules engine will allow you to send sensor data over to AWS services like DynamoDB

Reference10: The IoT rules engine

Question 11: A media company is migrating its on-premises legacy Hadoop cluster with its associated data processing scripts and workflow to an Amazon EMR environment running the latest Hadoop release. The developers want to reuse the Java code that was written for data processing jobs for the on-premises cluster.
Which approach meets these requirements?

A) Deploy the existing Oracle Java Archive as a custom bootstrap action and run the job on the EMR cluster.

B) Compile the Java program for the desired Hadoop version and run it using a CUSTOM_JAR step on the EMR cluster.

C) Submit the Java program as an Apache Hive or Apache Spark step for the EMR cluster.

D) Use SSH to connect the master node of the EMR cluster and submit the Java program using the AWS CLI.

ANSWER11:

Notes/Hint11:

A CUSTOM JAR step can be configured to download a JAR file from an Amazon S3 bucket and execute it. Since the Hadoop versions are different, the Java application has to be recompiled.

Reference11: Automating analytics workflows on EMR

Question 12: You currently have databases running on-site and in another data center off-site. What service allows you to consolidate to one database in Amazon?

A) AWS Kinesis

B) AWS Database Migration Service

C) AWS Data Pipeline

D) AWS RDS Aurora

ANSWER12:

Notes/Hint12:

AWS Database Migration Service can migrate your data to and from most of the widely used commercial and open source databases. It supports homogeneous migrations such as Oracle to Oracle, as well as heterogeneous migrations between different database platforms, such as Oracle to Amazon Aurora. Migrations can be from on-premises databases to Amazon RDS or Amazon EC2, databases running on EC2 to RDS, or vice versa, as well as from one RDS database to another RDS database.

Reference12: DMS

Question 13: An online retail company wants to perform analytics on data in large Amazon S3 objects using Amazon EMR. An Apache Spark job repeatedly queries the same data to populate an analytics dashboard. The analytics team wants to minimize the time to load the data and create the dashboard.
Which approaches could improve the performance? (Select TWO.)

A) Copy the source data into Amazon Redshift and rewrite the Apache Spark code to create analytical reports by querying Amazon Redshift.

B) Copy the source data from Amazon S3 into Hadoop Distributed File System (HDFS) using s3distcp.

C) Load the data into Spark DataFrames.

D) Stream the data into Amazon Kinesis and use the Kinesis Connector Library (KCL) in multiple Spark jobs to perform analytical jobs.

E) Use Amazon S3 Select to retrieve the data necessary for the dashboards from the S3 objects.

ANSWER13:

C and E

Notes/Hint13:

One of the speed advantages of Apache Spark comes from loading data into immutable dataframes, which can be accessed repeatedly in memory. Spark DataFrames organizes distributed data into columns. This makes summaries and aggregates much quicker to calculate. Also, instead of loading an entire large Amazon S3 object, load only what is needed using Amazon S3 Select. Keeping the data in Amazon S3 avoids loading the large dataset into HDFS.

Reference13: Spark DataFrames

Question 14: You have been hired as a consultant to provide a solution to integrate a client’s on-premises data center to AWS. The customer requires a 300 Mbps dedicated, private connection to their VPC. Which AWS tool do you need?

A) VPC peering

B) Data Pipeline

C) Direct Connect

D) EMR

ANSWER14:

Notes/Hint14:

Direct Connect will provide a dedicated and private connection to an AWS VPC.

Reference14: Direct Connect

Question 15: Your organization has a variety of different services deployed on EC2 and needs to efficiently send application logs over to a central system for processing and analysis. They’ve determined it is best to use a managed AWS service to transfer their data from the EC2 instances into Amazon S3 and they’ve decided to use a solution that will do what?

A) Installs the AWS Direct Connect client on all EC2 instances and uses it to stream the data directly to S3.

B) Leverages the Kinesis Agent to send data to Kinesis Data Streams and output that data in S3.

C) Ingests the data directly from S3 by configuring regular Amazon Snowball transactions.

D) Leverages the Kinesis Agent to send data to Kinesis Firehose and output that data in S3.

ANSWER15:

Notes/Hint15:

Kinesis Firehose is a managed solution, and log files can be sent from EC2 to Firehose to S3 using the Kinesis agent.

Reference15: Kinesis Firehose

Question 16: A data engineer needs to create a dashboard to display social media trends during the last hour of a large company event. The dashboard needs to display the associated metrics with a latency of less than 1 minute.
Which solution meets these requirements?

A) Publish the raw social media data to an Amazon Kinesis Data Firehose delivery stream. Use Kinesis Data Analytics for SQL Applications to perform a sliding window analysis to compute the metrics and output the results to a Kinesis Data Streams data stream. Configure an AWS Lambda function to save the stream data to an Amazon DynamoDB table. Deploy a real-time dashboard hosted in an Amazon S3 bucket to read and display the metrics data stored in the DynamoDB table.

B) Publish the raw social media data to an Amazon Kinesis Data Firehose delivery stream. Configure the stream to deliver the data to an Amazon Elasticsearch Service cluster with a buffer interval of 0 seconds. Use Kibana to perform the analysis and display the results.

C) Publish the raw social media data to an Amazon Kinesis Data Streams data stream. Configure an AWS Lambda function to compute the metrics on the stream data and save the results in an Amazon S3 bucket. Configure a dashboard in Amazon QuickSight to query the data using Amazon Athena and display the results.

D) Publish the raw social media data to an Amazon SNS topic. Subscribe an Amazon SQS queue to the topic. Configure Amazon EC2 instances as workers to poll the queue, compute the metrics, and save the results to an Amazon Aurora MySQL database. Configure a dashboard in Amazon QuickSight to query the data in Aurora and display the results.

ANSWER16:

Notes/Hint16:

Amazon Kinesis Data Analytics can query data in a Kinesis Data Firehose delivery stream in near-real time using SQL. A sliding window analysis is appropriate for determining trends in the stream. Amazon S3 can host a static webpage that includes JavaScript that reads the data in Amazon DynamoDB and refreshes the dashboard.

Reference16: Amazon Kinesis Data Analytics can query data in a Kinesis Data Firehose delivery stream in near-real time using SQL

Question 17: A real estate company is receiving new property listing data from its agents through .csv files every day and storing these files in Amazon S3. The data analytics team created an Amazon QuickSight visualization report that uses a dataset imported from the S3 files. The data analytics team wants the visualization report to reflect the current data up to the previous day. How can a data analyst meet these requirements?

A) Schedule an AWS Lambda function to drop and re-create the dataset daily.

B) Configure the visualization to query the data in Amazon S3 directly without loading the data into SPICE.

C) Schedule the dataset to refresh daily.

D) Close and open the Amazon QuickSight visualization.

ANSWER17:

Notes/Hint17:

Datasets created using Amazon S3 as the data source are automatically imported into SPICE. The Amazon QuickSight console allows for the refresh of SPICE data on a schedule.

Reference17: Amazon QuickSight and SPICE

Question 18: You need to migrate data to AWS. It is estimated that the data transfer will take over a month via the current AWS Direct Connect connection your company has set up. Which AWS tool should you use?

A) Establish additional Direct Connect connections.

B) Use Data Pipeline to migrate the data in bulk to S3.

C) Use Kinesis Firehose to stream all new and existing data into S3.

D) Snowball

ANSWER18:

Notes/Hint18:

As a general rule, if it takes more than one week to upload your data to AWS using the spare capacity of your existing Internet connection, then you should consider using Snowball. For example, if you have a 100 Mb connection that you can solely dedicate to transferring your data and need to transfer 100 TB of data, it takes more than 100 days to complete a data transfer over that connection. You can make the same transfer by using multiple Snowballs in about a week.

Reference18: Snowball

Question 19: You currently have an on-premises Oracle database and have decided to leverage AWS and use Aurora. You need to do this as quickly as possible. How do you achieve this?

A) It is not possible to migrate an on-premises database to AWS at this time.

B) Use AWS Data Pipeline to create a target database, migrate the database schema, set up the data replication process, initiate the full load and a subsequent change data capture and apply, and conclude with a switchover of your production environment to the new database once the target database is caught up with the source database.

C) Use AWS Database Migration Services and create a target database, migrate the database schema, set up the data replication process, initiate the full load and a subsequent change data capture and apply, and conclude with a switch-over of your production environment to the new database once the target database is caught up with the source database.

D) Use AWS Glue to crawl the on-premises database schemas and then migrate them into AWS with Data Pipeline jobs.

https://aws.amazon.com/dms/faqs/

ANSWER19:

Notes/Hint19:

DMS can efficiently support this sort of migration using the steps outlined. While AWS Glue can help you crawl schemas and store metadata on them inside of Glue for later use, it isn’t the best tool for actually transitioning a database over to AWS itself. Similarly, while Data Pipeline is great for ETL and ELT jobs, it isn’t the best option to migrate a database over to AWS.

Reference19: DMS

2- Prepare for Your AWS Certification Exam

Question 20: A financial company uses Amazon EMR for its analytics workloads. During the company’s annual security audit, the security team determined that none of the EMR clusters’ root volumes are encrypted. The security team recommends the company encrypt its EMR clusters’ root volume as soon as possible.
Which solution would meet these requirements?

A) Enable at-rest encryption for EMR File System (EMRFS) data in Amazon S3 in a security configuration. Re-create the cluster using the newly created security configuration.

B) Specify local disk encryption in a security configuration. Re-create the cluster using the newly created security configuration.

C) Detach the Amazon EBS volumes from the master node. Encrypt the EBS volume and attach it back to the master node.

D) Re-create the EMR cluster with LZO encryption enabled on all volumes.

ANSWER20:

Notes/Hint20:

Local disk encryption can be enabled as part of a security configuration to encrypt root and storage volumes.

Reference20: EMR Cluster Local disk encryption

Question 21: A company has a clickstream analytics solution using Amazon Elasticsearch Service. The solution ingests 2 TB of data from Amazon Kinesis Data Firehose and stores the latest data collected within 24 hours in an Amazon ES cluster. The cluster is running on a single index that has 12 data nodes and 3 dedicated master nodes. The cluster is configured with 3,000 shards and each node has 3 TB of EBS storage attached. The Data Analyst noticed that the query performance of Elasticsearch is sluggish, and some intermittent errors are produced by the Kinesis Data Firehose when it tries to write to the index. Upon further investigation, there were occasional JVMMemoryPressure errors found in Amazon ES logs.

What should be done to improve the performance of the Amazon Elasticsearch Service cluster?

A) Improve the cluster performance by increasing the number of master nodes of Amazon Elasticsearch.

B) Improve the cluster performance by increasing the number of shards of the Amazon Elasticsearch index.

C) Improve the cluster performance by decreasing the number of data nodes of Amazon Elasticsearch.

D) Improve the cluster performance by decreasing the number of shards of the Amazon Elasticsearch index.

ANSWER21:
D

Notes/Hint21:

“Amazon Elasticsearch Service (Amazon ES) is a managed service that makes it easy to deploy, operate, and scale Elasticsearch clusters in AWS Cloud. Elasticsearch is a popular open-source search and analytics engine for use cases such as log analytics, real-time application monitoring, and clickstream analysis. With Amazon ES, you get direct access to the Elasticsearch APIs; existing code and applications work seamlessly with the service.

Each Elasticsearch index is split into some number of shards. You should decide the shard count before indexing your first document. The overarching goal of choosing a number of shards is to distribute an index evenly across all data nodes in the cluster. However, these shards shouldn’t be too large or too numerous.

A good rule of thumb is to try to keep a shard size between 10 – 50 GiB. Large shards can make it difficult for Elasticsearch to recover from failure, but because each shard uses some amount of CPU and memory, having too many small shards can cause performance issues and out of memory errors. In other words, shards should be small enough that the underlying Amazon ES instance can handle them, but not so small that they place needless strain on the hardware. Therefore the correct answer is: Improve the cluster performance by decreasing the number of shards of Amazon Elasticsearch index.

Reference: ElasticsSearch

Question 22: A data lake is a central repository that enables which operation?

A) Store unstructured data from a single data source

B) Store structured data from any data source

C) Store structure and unstructured data from any source

D) Store structured and unstructured data from a single source

ANSWER22:
C

Notes/Hint22:

Data lake is a centralized repository for large amounts of structured and unstructured data to enable direct analytics.

Reference: Data Lakes

Question 23: What is the most cost-effective storage option for your data lake?

A) Amazon EBS

B) Amazon S3

C) Amazon RDS

D) Amazon Redshift

ANSWER23:
B

Notes/Hint23:

Amazon S3

Reference: Data Lakes – S3

Question 24: Which services are used in the processing layer of a data lake architecture? (SELECT TWO)

A. AWS Snowball

B. AWS Glue

C. Amazon EMR

D. Amazon QuickSight

ANSWER24:
B and C

Notes/Hint24:

Amazon Glue and Amazon EMR

Reference: Data Lakes – Glue and EMR

Question 25: Which services can be used for data ingestion into your data lake? (SELECT TWO)

A) Amazon Kinesis Data Firehose

B) Amazon QuickSight

C) Amazon Athena

D) AWS Storage Gateway

ANSWER25:
A and D

Notes/Hint25:

Amazon Kinesis Data Firehose and and Amazon Storage Gateway

Reference: Data Lakes

Question 26: Which service uses continuous data replication with high availability to consolidate databases into a petabyte-scale data warehouse by streaming data to amazon Redshift and Amazon S3?

A) AWS Storage Gateway

B) AWS Schema Conversion Tool

C) AWS Database Migration Service

D) Amazon Kinesis Data Firehose

ANSWER26:
C

Notes/Hint26:

AWS Database Migration Service

Reference: Data Lakes

Question 27: What is the AWS Glue Data Catalog?

A) A fully managed ETL (extract, transform, and load) pipeline service

B) A service to schedule jobs

C) A visual data preparation tool

D) An index to the location, schema, and runtime metrics of your data

ANSWER27:
D

Notes/Hint27:

An index to the location, schema, and runtime metrics of your data

Reference: Data Lakes

Questions 28: What AWS Glue feature “catalogs” your data?

A) AWS Glue crawler

B) AWS Glue DataBrew

C) AWS Glue Studio

D) AWS Glue Elastic Views

ANSWER28:
A

Notes/Hint28:

AWS Glue crawler

Reference: Data Lakes

Question 29: During your data preparation stage, the raw data has been enriched to support additional insights. You need to improve query performance and reduce costs of the final analytics solution.

Which data formats meet these requirements (SELECT TWO)

ANSWER29:
C and D

Notes/Hint29:

Apache Parquet and Apache ORC

Reference: Data Lakes

Question 30: Your small start-uo company is developing a data analytics solution. You need to clean and normalize large datasets, but you do not have developers with the skill set to write custom scripts. Which tool will help efficiently design and run the data preparation activities?

ANSWER30:
B

Notes/Hint30:

AWS Glue DataBrew

To be able to run analytics, build reports, or apply machine learning, you need to be sure the data you’re using is clean and in the right format. This data preparation step requires data analysts and data scientists to write custom code and perform many manual activities. When cleaning and normalizing data, it is helpful to first review the dataset to understand which possible values are present. Simple visualizations are helpful for determining whether correlations exist between the columns.

AWS Glue DataBrew is a visual data preparation tool that helps you clean and normalize data up to 80% faster so you can focus more on the business value you can get. DataBrew provides a visual interface that quickly connects to your data stored in Amazon S3, Amazon Redshift, Amazon Relational Database Service (RDS), any JDBC-accessible data store, or data indexed by the AWS Glue Data Catalog. You can then explore the data, look for patterns, and apply transformations. For example, you can apply joins and pivots, merge different datasets, or use functions to manipulate data.

Reference: Data Lakes

Question 30: In which scenario would you use AWS Glue jobs?

A) Analyze data in real-time as data comes into the data lake

B) Transform data in real-time as data comes into the data lake

C) Analyze data in batches on schedule or on demand

D) Transform data in batches on schedule or on demand.

ANSWER30:
D

Notes/Hint30:

An AWS Glue job encapsulates a script that connects to your source data, processes it, and then writes it out to your data target. Typically, a job runs extract, transform, and load (ETL) scripts. Jobs can also run general-purpose Python scripts (Python shell jobs.) AWS Glue triggers can start jobs based on a schedule or event, or on demand. You can monitor job runs to understand runtime metrics such as completion status, duration, and start tim

Question 31: Your data resides in multiple data stores, including Amazon S3, Amazon RDS, and Amazon DynamoDB. You need to efficiently query the combined datasets.

Which tool can achieve this, using a single query, without moving data?

A) Amazon Athena Federated Query

B) Amazon Redshift Query Editor

C) SQl Workbench

D) AWS Glue DataBrew

ANSWER31:
A

Notes/Hint31:

With Amazon Athena Federated Query, you can run SQL queries across a variety of relational, non-relational, and custom data sources. You get a unified way to run SQL queries across various data stores.

Athena uses data source connectors that run on AWS Lambda to run federated queries. A data source connector is a piece of code that can translate between your target data source and Athena. You can think of a connector as an extension of Athena’s query engine. Pre-built Athena data source connectors exist for data sources like Amazon CloudWatch Logs, Amazon DynamoDB, Amazon DocumentDB, Amazon RDS, and JDBC-compliant relational data sources such MySQL and PostgreSQL under the Apache 2.0 license. You can also use the Athena Query Federation SDK to write custom connectors. To choose, configure, and deploy a data source connector to your account, you can use the Athena and Lambda consoles or the AWS Serverless Application Repository. After you deploy data source connectors, the connector is associated with a catalog that you can specify in SQL queries. You can combine SQL statements from multiple catalogs and span multiple data sources with a single query.

Question 32: Which benefit do you achieve by using AWS Lake Formation to build data lakes?

A) Build data lakes quickly

B) Simplify security management

C) Provide self-service access to data

D) All of the above

ANSWER32:
D

Notes/Hint32:

Build data lakes quickly

With Lake Formation, you can move, store, catalog, and clean your data faster. You simply point Lake Formation at your data sources, and Lake Formation crawls those sources and moves the data into your new Amazon S3 data lake. Lake Formation organizes data in S3 around frequently used query terms and into right-sized chunks to increase efficiency. Lake Formation also changes data into formats like Apache Parquet and ORC for faster analytics. In addition, Lake Formation has built-in machine learning to deduplicate and find matching records (two entries that refer to the same thing) to increase data quality.

Simplify security management

You can use Lake Formation to centrally define security, governance, and auditing policies in one place, versus doing these tasks per service. You can then enforce those policies for your users across their analytics applications. Your policies are consistently implemented, eliminating the need to manually configure them across security services like AWS Identity and Access Management (AWS IAM) and AWS Key Management Service (AWS KMS), storage services like Amazon S3, and analytics and machine learning services like Amazon Redshift, Amazon Athena, and (in beta) Amazon EMR for Apache Spark. This reduces the effort in configuring policies across services and provides consistent enforcement and compliance.

Provide self-service access to data

With Lake Formation, you build a data catalog that describes the different available datasets along with which groups of users have access to each. This makes your users more productive by helping them find the right dataset to analyze. By providing a catalog of your data with consistent security enforcement, Lake Formation makes it easier for your analysts and data scientists to use their preferred analytics service. They can use Amazon EMR for Apache Spark (in beta), Amazon Redshift, or Amazon Athena on diverse datasets that are now housed in a single data lake. Users can also combine these services without having to move data between silos.

Question 33: What are the three stages to set up a data lake using AWS Lake Formation? (SELECT THREE)

A) Register the storage location

B) Create a database

C) Populate the database

D) Grant permissions

ANSWER33:
A B and D

Notes/Hint33:

Lake Formation manages access to designated storage locations within Amazon S3. Register the storage locations that you want to be part of the data lake.

Create a database

Lake Formation organizes data into a catalog of logical databases and tables. Create one or more databases and then automatically generate tables during data ingestion for common workflows.

Grant permissions

Lake Formation manages access for IAM users, roles, and Active Directory users and groups via flexible database, table, and column permissions. Grant permissions to one or more resources for your selected users.

Question 34: Which of the following AWS Lake Formation tasks are performed by the AWS Glue service? (SELECT THREE)

A) ETL code creation and job monitoring

B) Blueprints to create workflows

C) Data catalog and serverless architecture

D) Simplify securty management

ANSWER34:
A B and C

Notes/Hint34:

Lake Formation leverages a shared infrastructure with AWS Glue, including console controls, ETL code creation and job monitoring, blueprints to create workflows for data ingest, the same data catalog, and a serverless architecture. While AWS Glue focuses on these types of functions, Lake Formation encompasses all AWS Glue features AND provides additional capabilities designed to help build, secure, and manage a data lake. See the AWS Glue features page for more de

Question 35: A digital media customer needs to quickly build a data lake solution for the data housed in a PostgreSQL database. As a solutions architect, what service and feature would meet this requirement?

A) Copy PostgreSQL data to an Amazon S3 bucket and build a data lake using AWS Lake Formation

B) Use AWS Lake Formation blueprints

C) Build a data lake manually

D) Build an analytics solution by directly accessing the database.

ANSWER35:
B

Notes/Hint35:

A blueprint is a data management template that enables you to easily ingest data into a data lake. Lake Formation provides several blueprints, each for a predefined source type, such as a relational database or AWS CloudTrail logs. From a blueprint, you can create a workflow. Workflows consist of AWS Glue crawlers, jobs, and triggers that are generated to orchestrate the loading and update of data. Blueprints take the data source, data target, and schedule as input to configure the workflow.

Question 36: AWS Lake Formation has a set of suggested personas and IAM permissions. Which is a required persona?

A) Data lake administrator

B) Data engineer

C) Data analyst

D) Business analyst

ANSWER36:
A

Notes/Hint36:

Data lake administrator (Required)

A user who can register Amazon S3 locations, access the Data Catalog, create databases, create and run workflows, grant Lake Formation permissions to other users, and view AWS CloudTrail logs. The user has fewer IAM permissions than the IAM administrator but enough to administer the data lake. Cannot add other data lake administrators.

Data engineer (Optional) A user who can create and run crawlers and workflows and grant Lake Formation permissions on the Data Catalog tables that the crawlers and workflows create.

Data analyst (Optional) A user who can run queries against the data lake using, for example, Amazon Athena. The user has only enough permissions to run queries.

Business analyst (Optional) Generally, an end-user application specific persona that would query data and resource using a workflow role.

Question 37: Which three types of blueprints does AWS Lake Formation support? (SELECT THREE)

A) ETL code creation and job monitoring

B) Database snapshot

C) Incremental database

D) Log file sources (AWS CloudTrail, ELB/ALB logs)

ANSWER37:
B C and D

Notes/Hint37:

AWS Lake Formation blueprints simplify and automate creating workflows. Lake Formation provides the following types of blueprints:

• Database snapshot – Loads or reloads data from all tables into the data lake from a JDBC source. You can exclude some data from the source based on an exclude pattern.

• Incremental database – Loads only new data into the data lake from a JDBC source, based on previously set bookmarks. You specify the individual tables in the JDBC source database to include. For each table, you choose the bookmark columns and bookmark sort order to keep track of data that has previously been loaded. The first time that you run an incremental database blueprint against a set of tables, the workflow loads all data from the tables and sets bookmarks for the next incremental database blueprint run. You can therefore use an incremental database blueprint instead of the database snapshot blueprint to load all data, provided that you specify each table in the data source as a paramete

• Log file – Bulk loads data from log file sources, including AWS CloudTrail, Elastic Load Balancing logs, and Application Load Balancer logs.

Question 38: Which one of the following is the best description of the capabilities of Amazon QuickSight?

A) Automated configuration service build on AWS Glue

B) Fast, serverless, business intelligence service

C) Fast, simple, cost-effective data warehousing

D) Simple, scalable, and serverless data integration

ANSWER38:
B C and D

Notes/Hint38:

B. Scalable, serverless business intelligence service is the correct choice.

See the brief descriptions of several AWS Analytics services below:

AWS Lake Formation Build a secure data lake in days using Glue blueprints and workflows

Amazon QuickSight Scalable, serverless, embeddable, ML-powered BI Service built for the cloud

Amazon Redshift Analyze all of your data with the fastest and most widely used cloud data warehouse

AWS Glue Simple, scalable, and serverless data integration

Question 39: Which benefits are provided by Amazon Redshift? (Select TWO)

A) Analyze Data stored in your data lake

B) Maintain performance at scale

C) Focus effort on Data warehouse administration

D) Store all the data to meet analytics need

E) Amazon Redshift includes enterprise-level security and compliance features.

ANSWER38:
A and B

Notes/Hint38:

A is correct – With Amazon Redshift, you can analyze all your data, including exabytes of data stored in your Amazon S3 data lake.

B is correct – Amazon Redshift provides consistent performance at scale.

• C is incorrect – Amazon Redshift is a fully managed data warehouse solution. It includes automations to reduce the administrative overhead traditionally associated with data warehouses. When using Amazon Redshift, you can focus your development effort on strategic data analytics solutions.

• D is incorrect – With Amazon Redshift features—such as Amazon Redshift Spectrum, materialized views, and federated query—you can analyze data where it is stored in your data lake or AWS databases. This capability provides flexibility to meet new analytics requirements without the cost, time, or complexity of moving large volumes of data between solutions.

• Answer E is incorrect – Amazon Redshift includes enterprise-level security and compliance features.

Djamga Data Sciences Big Data – Data Analytics Youtube Playlist

3- LinuxAcademy

Big Data – Data Analytics Jobs:

Big Data – Data Analytics – Data Sciences Latest News:

Question: How is SQL used in real data science jobs?

DATA ANALYTICS Q&A:

Can I know what is big data?

(MySQL) When I save an invoice, the records save in more than 5 tables (ex: invoice, stock, etc). Save, if only, all the tables received all the data successfully otherwise reject that invoice which should not impact any table?

How can I become a data scientist?

Data analysis made easy: Text2Code for Jupyter notebook

What should be the learning sequence (among the following skills) to become a Data scientist – Tableau, Data Analysis, SQL, Data Science and Python (please indicate if any skill(s) is redundant?

What’s the best thing a data scientist can do for his career?

What is the best way to learn SQL for data science?

Learn Deep Learning With Python Tutorials Complete Guide…

How can I become a data scientist?

What are some struggles for people who are trying to break into the Data Science field?

Why should a data scientist work in a startup?

Why does a Data scientist needs to learn SQL when all the data exploration and data munging can be done in R or Python?

As data science changes over the next decade, what are the most crucial skills to learn to stay relevant?

Is Machine Learning a good choice for a career?

How do I learn more advanced SQL?

How can I become a data scientist?

Where can I intuitively learn SQL Joins specifically?

What is the most overrated thing about data science?

Is SQL an easy programming language?

Can a position as a data analyst lead to a position as a data scientist?

In layman’s terms (you naughty experts), can you explain how huge corporations like Google, Microsoft, Facebook, etc. upgrade their data centres and servers?

What are some of the natural careers people transition into after 5-7 years in data science/analytics?

Why would people, especially data scientists (maybe those in the industry) use SQL, rather than Pandas?

How much level of SQL knowledge is required for data analytics?

How do I learn data science (step-by-step)?

How do I get data science, computer vision, and machine learning tutorial apps?

Roughly speaking, how long should it take a person with no programming background (who already knows probability & statistics) to learn how to use R and Python for data analytics well enough to land a decent job as a data analyst?

How can I become a data scientist or a data analyst?

Does data science have good potential, or is it only hype?

What are some wrong ways to learn data science?

Is Data Science actually in demand? What skills can make me launch a fruitful career in Data Science?

Which are the best analytics and data visualization companies who implement Tableau?

Will the salary of data scientists decrease a lot in 2025? Is it worth starting a degree in it?

Will there be more data science and machine learning jobs in the future? I’m wondering if it is a good career choice.

Here’s a all list of the popular Data Science Tools used by Data Scientists !!!!…

How should you answer the question “How’s your SQL skills?” when asked by the data scientist hiring manager?

Are CS questions part of a data scientist interview at Facebook and are there interview questions position specific?

Can I work in data science if I don’t like programming?

How to organize data cleaning scripts

[/bg_collapse]

Clever Questions, Answers, Resources about:

Data Sciences
Big Data
Data Analytics
Data Sciences
Databases
Data Streams
Large DataSets

What Is a Data Scientist?

Data Scientist (n.): Person who is better at statistics than any software engineer and better at software engineering than any statistician. – Josh Wills
Data scientists apply sophisticated quantitative and computer science skills to both structure and analyze massive stores or continuous streams of unstructured data, with the intent to derive insights and prescribe action. – Burtch Works Data Science Salary Survey, May 2018
More than anything, what data scientists do is make discoveries while swimming in data… In a competitive landscape where challenges keep changing and data never stop flowing, data scientists help decision makers shift from ad hoc analysis to an ongoing conversation with data. – Data Scientist: The Sexiest Job of the 21st Century, Harvard Business Review

Do All Data Scientists Hold Graduate Degrees?

Data scientists are highly educated. With exceedingly rare exception, every data scientist holds at least an undergraduate degree. 91% of data scientists in 2018 held advanced degrees. The remaining 9% all held undergraduate degrees. Furthermore,

25% of data scientists hold a degree in statistics or mathematics,
20% have a computer science degree,
an additional 20% hold a degree in the natural sciences, and
18% hold an engineering degree.

The remaining 17% of surveyed data scientists held degrees in business, social science, or economics.

How Are Data Scientists Different From Data Analysts?

Broadly speaking, the roles differ in scope: data analysts build reports with narrow, well-defined KPIs. Data scientists often to work on broader business problems without clear solutions. Data scientists live on the edge of the known and unknown.

We’ll leave you with a concrete example: A data analyst cares about profit margins. A data scientist at the same company cares about market share.

How Is Data Science Used in Medicine?

Data science in healthcare best translates to biostatistics. It can be quite different from data science in other industries as it usually focuses on small samples with several confounding variables.

How Is Data Science Used in Manufacturing?

Data science in manufacturing is vast; it includes everything from supply chain optimization to the assembly line.

What are data scientists paid?

Most people are attracted to data science for the salary. It’s true that data scientists garner high salaries compares to their peers. There is data to support this: The May 2018 edition of the BurtchWorks Data Science Salary Survey, annual salary statistics were

Note the above numbers do not reflect total compensation which often includes standard benefits and may include company ownership at high levels.

How will data science evolve in the next 5 years?

Will AI replace data scientists?

What is the workday like for a data scientist?

It’s common for data scientists across the US to work 40 hours weekly. While company culture does dictate different levels of work life balance, it’s rare to see data scientists who work more than they want. That’s the virtue of being an expensive resource in a competitive job market.

How do I become a Data Scientist?

The roadmap given to aspiring data scientists can be boiled down to three steps:

Earning an undergraduate and/or advanced degree in computer science, statistics, or mathematics,
Building their portfolio of SQL, Python, and R skills, and
Getting related work experience through technical internships.

All three require a significant time and financial commitment.

There used to be a saying around datascience: The road into a data science starts with two years of university-level math.

What Should I Learn? What Order Do I Learn Them?

This answer assumes your academic background ends with a HS diploma in the US.

Python
Differential Calculus
Integral Calculus
Multivariable Calculus
Linear Algebra
Probability
Statistics

Some follow up questions and answers:

Why Python first?

Python is a general purpose language. R is used primarily by statisticians. In the likely scenario that you decide data science requires too much time, effort, and money, Python will be more valuable than your R skills. It’s preparing you to fail, sure, but in the same way a savings account is preparing you to fail.

When do I start working with data?

You’ll start working with data when you’ve learned enough Python to do so. Whether you’ll have the tools to have any fun is a much more open-ended question.

How long will this take me?

Assuming self-study and average intelligence, 3-5 years from start to finish.

How Do I Learn Python?

If you don’t know the first thing about programming, start with MIT’s course in the curated list.

These modules are the standard tools for data analysis in Python:

pandas (and by extension, numpy)Check out Minimally Sufficient Pandas for style guides and best practices.
matplotlib and seaborn See /u/rhiever’s response to How do you decide between the plotting libraries: Matplotlib, Seaborn, Bokeh?Don’t worry about bokeh or dash unless you have a personal interest in interactive visualizations.
scipy and scikit-learnInternalize the .fit() and .predict() pattern.

Curated Threads & Resources

MIT’s Introduction to Computer Science and Programming in Python A free, archived course taught at MIT in the fall 2016 semester.
Data Scientist with Python Career Track | DataCamp The first courses are free, but unlimited access costs $29/month. Users usually report a positive experience, and it’s one of the better hands-on ways to learn Python.
Sentdex’s (Harrison Kinsley) Youtube Channel Related to Python Programming Tutorials
/r/learnpython is an active sub and very useful for learning the basics.

How Do I Learn R?

If you don’t know the first thing about programming, start with R for Data Science in the curated list.

These modules are the standard tools for data analysis in Python:

Curated Threads & Resources

R for Data Science by Hadley WickhamA free ebook full of succinct code examples. Terrific for learning tidyverse syntax.Folks with some math background may prefer the free alternative, Introduction to Statistical Learning.
Data Scientist with R Career Track | DataCamp The first courses are free, but unlimited access costs $29/month. Users usually report a positive experience, and it’s one of the few hands-on ways to learn R.
R Inferno Learners with a CS background will appreciate this free handbook explaining how and why R behaves the way that it does.

How Do I Learn SQL?

Prioritize the basics of SQL. i.e. when to use functions like POW, SUM, RANK; the computational complexity of the different kinds of joins.

Concepts like relational algebra, when to use clustered/non-clustered indexes, etc. are useful, but (almost) never come up in interviews.

You absolutely do not need to understand administrative concepts like managing permissions.

Finally, there are numerous query engines and therefore numerous dialects of SQL. Use whichever dialect is supported in your chosen resource. There’s not much difference between them, so it’s easy to learn another dialect after you’ve learned one.

Curated Threads & Resources

The SQL Tutorial for Data Analysis | Mode.com
Introduction to Databases A Free MOOC supported by Stanford University.
SQL Queries for Mere MortalsA $30 book highly recommended by /u/karmanujan

How Do I Learn Calculus?

Fortunately (or unfortunately), calculus is the lament of many students, and so resources for it are plentiful. Khan Academy mimics lectures very well, and Paul’s Online Math Notes are a terrific reference full of practice problems and solutions.

Calculus, however, is not just calculus. For those unfamiliar with US terminology,

Calculus I is differential calculus.
Calculus II is integral calculus.
Calculus III is multivariable calculus.
Calculus IV is differential equations.

Differential and integral calculus are both necessary for probability and statistics, and should be completed first.

Multivariable calculus can be paired with linear algebra, but is also required.

Differential equations is where consensus falls apart. The short it is, they’re all but necessary for mathematical modeling, but not everyone does mathematical modeling. It’s another tool in the toolbox.

Curated Threads & Resources about Data Science and Data Analytics

Khan AcademyDifferential Calculus Integral Calculus Multivariable Calculus Differential Equations
Paul’s Online Math NotesDifferential Calculus Integral Calculus Multivariable Calculus

How Do I Learn Probability?

Probability is not friendly to beginners. Definitions are rooted in higher mathematics, notation varies from source to source, and solutions are frequently unintuitive. Probability may present the biggest barrier to entry in data science.

It’s best to pick a single primary source and a community for help. If you can spend the money, register for a university or community college course and attend in person.

The best free resource is MIT’s 18.05 Introduction to Probability and Statistics (Spring 2014). Leverage /r/learnmath, /r/learnmachinelearning, and /r/AskStatistics when you get inevitably stuck.

How Do I Learn Linear Algebra?

Curated Threads & Resources https://www.youtube.com/watch?v=fNk_zzaMoSs&index=1&list=PLZHQObOWTQDPD3MizzM2xVFitgF8hE_ab

What does the typical data science interview process look like?

For general advice, Mastering the DS Interview Loop is a terrific article. The community discussed the article here.

Briefly summarized, most companies follow a five stage process:

Coding Challenge: Most common at software companies and roles contributing to a digital product.
HR Screen
Technical Screen: Often in the form of a project. Less frequently, it takes the form of a whiteboarding session at the onsite.
Onsite: Usually the project from the technical screen is presented here, followed by a meeting with the director overseeing the team you’ll join.
Negotiation & Offer

Mastering the DS Interview Loop

Preparation:

Practice questions on Leetcode which has both SQL and traditional data structures/algorithm questions
Review Brilliant for math and statistics questions.
SQL Zoo and Mode Analytics both offer various SQL exercises you can solve in your browser.

Tips:

Before you start coding, read through all the questions. This allows your unconscious mind to start working on problems in the background.
Start with the hardest problem first, when you hit a snag, move to the simpler problem before returning to the harder one.
Focus on passing all the test cases first, then worry about improving complexity and readability.
If you’re done and have a few minutes left, go get a drink and try to clear your head. Read through your solutions one last time, then submit.
It’s okay to not finish a coding challenge. Sometimes companies will create unreasonably tedious coding challenges with one-week time limits that require 5–10 hours to complete. Unless you’re desperate, you can always walk away and spend your time preparing for the next interview.

Remember, interviewing is a skill that can be learned, just like anything else. Hopefully, this article has given you some insight on what to expect in a data science interview loop.

The process also isn’t perfect and there will be times that you fail to impress an interviewer because you don’t possess some obscure piece of knowledge. However, with repeated persistence and adequate preparation, you’ll be able to land a data science job in no time!

What does the Airbnb data science interview process look like? [Coming soon]

What does the Facebook data science interview process look like? [Coming soon]

What does the Uber data science interview process look like? [Coming soon]

What does the Microsoft data science interview process look like? [Coming soon]

What does the Google data science interview process look like? [Coming soon]

What does the Netflix data science interview process look like? [Coming soon]

What does the Apple data science interview process look like? [Coming soon]

Real life enterprise databases are orders of magnitude more complex than the “customers, products, orders” examples used as teaching tools. SQL as a language is actually, IMO, a relatively simple language (the db administration component can get complex, but mostly data scientists aren’t doing that anyways). SQL is an incredibly important skill though for any DS role. I think when people emphasize SQL, what they really are talking about is the ability to write queries that interrogate the data and discover the nuances behind how it is collected and/or manipulated by an application before it is written to the dB. For example, is the employee’s phone number their current phone number or does the database store a history of all previous phone numbers? Critically important questions for understanding the nature of your data, and it doesn’t necessarily deal with statistics! The level of syntax required to do this is not that sophisticated, you can get pretty damn far with knowledge of all the joins, group by/analytical functions, filtering and nesting queries. In many cases, the data is too large to just select * and dump into a csv to load into pandas, so you start with SQL against the source. In my mind it’s more important for “SQL skills” to know how to generate hypotheses (that will build up to answering your business question) that can be investigated via a query than it is to be a master of SQL’s syntax. Just my two cents though!

AWS Data Analytics Specialty Certification DAS-C01 Exam Prep on iOS

Data Visualization example: 12000 Years of Human Population Dynamic

[OC] 12,000 years of human population dynamics from dataisbeautiful

Human population density estimates based on the Hyde 3.2 model.

Data visualization example: AirPods Revenue vs. Top Tech Companies

[OC] AirPods Revenue vs. Top Tech Companies from dataisbeautiful

Source: 24/7 Kevin Rooke, Google Search, SEC Edgar

Data visualization example: Crypto race: DOGE (red) versus BTC (blue), 5/6/2020 – 5/5/2021

[OC] Crypto race: DOGE (red) versus BTC (blue), 5/6/2020 – 5/5/2021 from dataisbeautiful

Data sources: Coindesk-Bitcoin ; Coindesk-Dodgecoin

Data visualization example: How have cryptocurrencies done during the Pandemic?

[OC] How have cryptocurrencies done during the Pandemic? from dataisbeautiful

Data source: Performance data on these cryptocurrencies from Investing.com which provides free historic data

Data visualization example: Countries with the Most Nuclear Warheads

[OC] Countries with the Most Nuclear Warheads from dataisbeautiful

Data Source: Here

For more information about analytics architecture, visit the AWS Big Data Blog: AWS serverless data analytics pipeline reference architecture here

Capitol insurrection arrests per million people by state

[OC] Capitol insurrection arrests per million people by state from dataisbeautiful

Data Source: Made in Google Sheets using data from this USA Today article (for the number of arrests by arrestee’s home state) and this spreadsheet of the results of the 2020 Census (for the population of each state and DC in 2020, which was used as the denominator in calculating arrests/million people).

Basic Data Lake Architecture

Data Analytics Architecture on AWS

Data Analytics Process

AWS Data Analytics Specialty Certification DAS-C01 Exam Prep on iOS

Data Lake Storage:

AWS Data Analytics Specialty Certification DAS-C01 Exam Prep on iOS

Event Driven Data Analytics Workflow on AWS

What is a Data Lake?

What is a Data Warehouse?

What are benefits of a data warehouse?

• Informed decision making

• Consolidated data from many sources

• Historical data analysis

• Data quality, consistency, and accuracy

• Separation of analytics processing from transactional databases

AWS Data Analytics Specialty Certification DAS-C01 Exam Prep on iOS

Data Lake vs Data Warehouse – Comparison

A data warehouse is specially designed for data analytics, which identifies relationships and trends across large amounts of data. A database is used to capture and store data, such as the details of a transaction. Unlike a data warehouse, a data lake is a centralized repository for structured, semi-structured, and unstructured data. A data warehouse organizes data in a tabular format (or schema) that enables SQL queries on the data. But not all applications require data to be in tabular format. Some applications can access data in the data lake even if it is “semi-structured” or unstructured. These include big data analytics, full-text search, and machine learning.

An AWS data lake only has a storage charge for the data. No servers are necessary for the data to be stored and accessed. In the case of Amazon Athena, also, there are no additional charges for processing. Data warehouse enable fast queries of structured data from transactional systems for batch reports, business intelligence, and visualization use cases. A data lake stores data without regard to its structure. Data scientists, data analysts, and business analysts use the data lake. They support use cases such as machine learning, predictive analytics, and data discovery and profiling.

Transactional Data Ingestion

AWS Data Analytics Specialty Certification DAS-C01 Exam Prep on iOS

Structured Query Language (SQL)

Data definition language (DDL) refers to the subset of SQL commands that define data structures and objects such as databases, tables, and views. DDL commands include the following:

• CREATE: used to create a new object.

• DROP: used to delete an object.

• ALTER: used to modify an object.

• RENAME: used to rename an object.

• TRUNCATE: used to remove all rows from a table without deleting the table itself.

AWS Data Analytics Specialty Certification DAS-C01 Exam Prep on iOS

Data manipulation language (DML) refers to the subset of SQL commands that are used to work with data. DML commands include the following:

• SELECT: used to request records from one or more tables.

• INSERT: used to insert one or more records into a table.

• UPDATE: used to modify the data of one or more records in a table.

• DELETE: used to delete one or more records from a table.

• EXPLAIN: used to analyze and display the expected execution plan of a SQL statement.

• LOCK: used to lock a table from write operations (INSERT, UPDATE, DELETE) and prevent concurrent operations from conflicting with one another.

Data control language (DCL) refers to the subset of SQL commands that are used to configure permissions to objects. DCL commands include:

• GRANT: used to grant access and permissions to a database or object in a database, such as a schema or table.

• REVOKE: used to remove access and permissions from a database or objects in a database.

AWS Data Analytics Specialty Certification DAS-C01 Exam Prep on iOS

Comparison of OLTP and OLAP

AWS Data Analytics Specialty Certification DAS-C01 Exam Prep on iOS

What is Amazon Macie?

Businesses are responsible to identify and limit disclosure of sensitive data such as personally identifiable information (PII) or proprietary information. Identifying and masking sensitive information is time consuming, and becomes more complex in data lakes with various data sources and formats and broad user access to published data sets.

Amazon Macie is a fully managed data security and privacy service that uses machine learning and pattern matching to discover sensitive data in AWS. Macie includes a set of managed data identifiers which automatically detect common types of sensitive data. Examples of managed data identifiers include keywords, credentials, financial information, health information, and PII. You can also configure custom data identifiers using keywords or regular expressions to highlight organizational proprietary data, intellectual property, and other specific scenarios. You can develop security controls that operate at scale to monitor and remediate risk automatically when Macie detects sensitive data. You can use AWS Lambda functions to automatically turn on encryption for an Amazon S3 bucket where Macie detects sensitive data. Or automatically tag datasets containing sensitive data, for inclusion in orchestrated data transformations or audit reports.

Amazon Macie can be integrated into the data ingestion and processing steps of your data pipeline. This approach avoids inadvertent disclosures in published data sets by detecting and addressing the sensitive data as it is ingested and processed. Building the automated detection and processing of sensitive data into your ETL pipelines simplifies and standardizes handling of sensitive data at scale.

AWS Data Analytics Specialty Certification DAS-C01 Exam Prep on iOS

What is AWS Glue DataBrew?

AWS Glue DataBrew is a visual data preparation tool that simplifies cleaning and normalizing datasets in preparation for use in analytics and machine learning.

• Profile data quality, identifying patterns and automatically detecting anomalies.

• Clean and normalize data using over 250 pre-built transformations, without writing code.

• Visually map the lineage of your data to understand data sources and transformation history.

• Save data cleaning and normalization workflows for automatic application to new data.

Data processed in AWS Glue DataBrew is immediately available for use in analytics and machine learning projects.

Learn more about the built-in transformations available in AWS Glue DataBrew in the Recipe actions reference: https://docs.aws.amazon.com/databrew/latest/dg/recipe-actions-reference.html

AWS Data Analytics Specialty Certification DAS-C01 Exam Prep on iOS

What is AWS Glue?

AWS Glue is a fully managed ETL (extract, transform, and load) service that makes it simple and cost-effective to categorize your data, clean it, enrich it, and move it reliably between various data stores and data streams. AWS Glue consists of a central metadata repository known as the AWS Glue Data Catalog, an ETL engine that automatically generates Python or Scala code, and a flexible scheduler that handles dependency resolution, job monitoring, and retries. AWS Glue can run your ETL jobs as new data arrives. For example, you can use an AWS Lambda function to trigger your ETL jobs to run as soon as new data becomes available in Amazon S3. You can also register this new dataset in the

AWS Glue Data Catalog as part of your ETL jobs.

AWS Glue is serverless, so there’s no infrastructure to set up or manage.

AWS Data Analytics Specialty Certification DAS-C01 Exam Prep on iOS

AWS Glue Data Catalog The AWS Glue Data Catalog provides a uniform repository where disparate systems can store and find metadata to keep track of data in data silos, and use that metadata to query and transform the data. Once the data is cataloged, it is immediately available for search and query using Amazon Athena, Amazon EMR, and Amazon Redshift Spectrum.

You can use AWS Identity and Access Management (IAM) policies to control access to the data sources managed by the AWS Glue Data Catalog. The Data Catalog also provides comprehensive audit and governance capabilities, with schema-change tracking and data access controls.

AWS Glue crawler

AWS Glue crawlers can scan data in all kinds of repositories, classify it, extract schema information from it, and store the metadata automatically in the AWS Glue Data Catalog.

AWS Glue ETL

AWS Glue can run your ETL jobs as new data arrives. For example, you can use an AWS Lambda function to trigger your ETL jobs to run as soon as new data becomes available in Amazon S3. You can also register this new dataset in the AWS Glue Data Catalog as part of your ETL jobs.

AWS Glue Studio

AWS Glue Studio provides a graphical interface to create, run, and monitor extract, transform, and load (ETL) jobs in AWS Glue. You can visually compose data transformation workflows and seamlessly run them on AWS Glue’s Apache Spark-based serverless ETL engine. AWS Glue Studio also offers tools to monitor ETL workflows and validate that they are operating as intended.

AWS Data Analytics Specialty Certification DAS-C01 Exam Prep on iOS

What is Amazon Athena?

Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to set up or manage, and you can start analyzing data immediately. You don’t even need to load your data into Athena, it works directly with data stored in S3. To get started, just log into the Amazon Athena console, define your schema, and start querying. Athena uses Presto with full standard SQL support. It works with a variety of standard data formats, including CSV, JSON, ORC, Apache Parquet and Avro. While Athena is ideal for quick, ad-hoc querying, it can also handle complex analysis, including large joins, window functions, and arrays.

Amazon Athena helps you analyze data stored in Amazon S3. You can use Athena to run ad-hoc queries using ANSI SQL, without the need to aggregate or load the data into Athena. It can process unstructured, semi-structured, and structured datasets. Examples include CSV, JSON, Avro or columnar data formats such as Apache Parquet and Apache ORC. Athena integrates with Amazon QuickSight for easy visualization. You can also use Athena to generate reports or to explore data with business intelligence tools or SQL clients, connected via an ODBC or JDBC driver.

The tables and databases that you work with in Athena to run queries are based on metadata. Metadata is data about the underlying data in your dataset. How that metadata describes your dataset is called the schema. For example, a table name, the column names in the table, and the data type of each column are schema, saved as metadata, that describe an underlying dataset. In Athena, we call a system for organizing metadata a data catalog or a metastore. The combination of a dataset and the data catalog that describes it is called a data source.

The relationship of metadata to an underlying dataset depends on the type of data source that you work with. Relational data sources like MySQL, PostgreSQL, and SQL Server tightly integrate the metadata with the dataset. In these systems, the metadata is most often written when the data is written. Other data sources, like those built using Hive, allow you to define metadata on-the-fly when you read the dataset. The dataset can be in a variety of formats; for example, CSV, JSON, Parquet, or Avro.

AWS Data Analytics Specialty Certification DAS-C01 Exam Prep on iOS

What is AWS Lake Formation?

Lake Formation is a fully managed service that enables data engineers, security officers, and data analysts to build, secure, manage, and use your data lake

To build your data lake in AWS Lake Formation, you must register an Amazon S3 location as a data lake. The Lake Formation service must have permission to write to the AWS Glue Data Catalog and to Amazon S3 locations in the data lake.

Next, identify the data sources to be ingested. AWS Lake formation can move data into your data lake from existing Amazon S3 data stores. Lake Formation can collect and organize datasets, such as logs from AWS CloudTrail, AWS CloudFront, detailed billing reports, or Elastic Load Balancing. You can ingest bulk or incremental datasets from relational, NoSQL, or non-relational databases. Lake Formation can ingest data from databases running in Amazon RDS or hosted in Amazon EC2. You can also ingest data from on-premises databases using Java Database Connectivity JDBC connectors. You can use custom AWS Glue jobs to load data from other databases or to ingest streaming data using Amazon Kinesis or Amazon DynamoDB.

AWS Lake Formation manages AWS Glue crawlers, AWS Glue ETL jobs, the AWS Glue Data Catalog, security settings, and access control:

• Lake Formation is an automated build environment based on AWS Glue.

• Lake Formation coordinates AWS Glue crawlers to identify datasets within the specified data stores and collect metadata for each dataset

• Lake Formation can perform transformations on your data, such as rewriting and organizing data into a consistent, analytics-friendly format. Lake Formation creates transformation templates and schedules AWS Glue jobs to prepare and optimize your data for analytics. Lake Formation also helps clean your data using FindMatches, an ML-based deduplication transform. AWS Glue jobs encapsulate scripts, such as ETL scripts, which connect to source data, process it, and write it out to a data target. AWS Glue triggers can start jobs based on a schedule or event, or on demand. AWS Glue workflows orchestrate AWS ETL jobs, crawlers, and triggers. You can define a workflow manually or use a blueprint based on commonly ingested data source types.

• The AWS Glue Data Catalog within the data lake persistently stores the metadata from raw and processed datasets. Metadata about data sources and targets is in the form of databases and tables. Tables store information about the underlying data, including schema information, partition information, and data location. Databases are collections of tables. Each AWS account has one data catalog per AWS Region.

• Lake Formation provides centralized access controls for your data lake, including security policy-based rules for users and applications by role. You can authenticate the users and roles using AWS IAM. Once the rules are defined, Lake Formation enforces them with table-and column-level granularity for users of Amazon Redshift Spectrum and Amazon Athena. Rules are enforced at the table-level in AWS Glue, which is normally accessed for administrators.

• Lake Formation leverages the encryption capabilities of Amazon S3 for data in the data lake. This approach provides automatic server-side encryption with keys managed by the AWS Key Management Service (KMS). S3 encrypts data in transit when replicating across Regions. You can separate accounts for source and destination Regions to further protect your data

AWS Data Analytics Specialty Certification DAS-C01 Exam Prep on iOS

What is Amazon Quicksight?

Amazon QuickSight is a cloud-scale business intelligence (BI) service. In a single data dashboard, QuickSight gives decision-makers the opportunity to explore and interpret information in an interactive visual environment. QuickSight can include AWS data, third-party data, big data, spreadsheet data, SaaS data, B2B data, and more. QuickSight delivers fast and responsive query performance by using a robust in-memory engine (SPICE).

Scale from tens to tens of thousands of users

Amazon QuickSight has a serverless architecture that automatically scales to tens of thousands of users without the need to setup, configure, or manage your own servers.

Embed BI dashboards in your applications

With QuickSight, you can quickly embed interactive dashboards into your applications, websites, and portals.

Access deeper insights with Machine Learning

QuickSight leverages the proven machine learning (ML) capabilities of AWS. BI teams can perform advanced analytics without prior data science experience.

Ask questions of your data, receive answers

With QuickSight, you can quickly get answers to business questions asked in natural language with QuickSight’s new ML-powered natural language query capability, Q.

AWS Data Analytics Specialty Certification DAS-C01 Exam Prep on iOS

What is SPICE?

SPICE is the Super-fast, Parallel, In-memory Calculation Engine in QuickSight. SPICE is engineered to rapidly perform advanced calculations and serve data. The storage and processing capacity available in SPICE speeds up the analytical queries that you run against your imported data. By using SPICE, you save time because you don’t need to retrieve the data every time that you change an analysis or update a visual.

When you import data into a dataset rather than using a direct SQL query, it becomes SPICE data because of how it’s stored. SPICE is the Amazon QuickSight Super-fast, Parallel, In-memory Calculation Engine. It’s engineered to rapidly perform advanced calculations and serve data. In Enterprise edition, data stored in SPICE is encrypted at rest.

When you create or edit a dataset, you choose to use either SPICE or a direct query, unless the dataset contains uploaded files. Importing (also called ingesting) your data into SPICE can save time and money:

• Your analytical queries process faster.

• You don’t need to wait for a direct query to process.

• Data stored in SPICE can be reused multiple times without incurring additional costs. If you use a data source that charges per query, you’re charged for querying the data when you first create the dataset and later when you refresh the dataset.