Tag: What are some mistakes data scientists make when building machine learning models?

Quizzes, Practice Exams: Modeling, Data Engineering, Vision, Exploratory Data Analysis, ML Ops, Cheat Sheets, ML Jobs Interview Q&A

Use this App to learn about Machine Learning on AWS and prepare for the AWS Machine Learning Specialty Certification MLS-C01.

Earning AWS Certified Machine Learning Specialty validates expertise in building, training, tuning, and deploying machine learning (ML) models on AWS.

The App provides hundreds of quizzes and practice exam about:

– Machine Learning Operation on AWS

– Modelling

– Data Engineering

– Computer Vision,

– Exploratory Data Analysis,

– ML implementation & Operations

– Machine Learning Basics Questions and Answers

– Machine Learning Advanced Questions and Answers

– Scorecard

– Countdown timer

– Machine Learning Cheat Sheets

– Machine Learning Interview Questions and Answers

– Machine Learning Latest News

The App covers Machine Learning Basics and Advanced topics including: NLP, Computer Vision, Python, linear regression, logistic regression, Sampling, dataset, statistical interaction, selection bias, non-Gaussian distribution, bias-variance trade-off, Normal Distribution, correlation and covariance, Point Estimates and Confidence Interval, A/B Testing, p-value, statistical power of sensitivity, over-fitting and under-fitting, regularization, Law of Large Numbers, Confounding Variables, Survivorship Bias, univariate, bivariate and multivariate, Resampling, ROC curve, TF/IDF vectorization, Cluster Sampling, etc.

Domain 1: Data Engineering

Create data repositories for machine learning.

Identify data sources (e.g., content and location, primary sources such as user data)

Determine storage mediums (e.g., DB, Data Lake, S3, EFS, EBS)

Identify and implement a data ingestion solution.

Data job styles/types (batch load, streaming)

Data ingestion pipelines (Batch-based ML workloads and streaming-based ML workloads), etc.

Domain 2: Exploratory Data Analysis

Sanitize and prepare data for modeling.

Perform feature engineering.

Analyze and visualize data for machine learning.

Domain 3: Modeling

Frame business problems as machine learning problems.

Select the appropriate model(s) for a given machine learning problem.

Train machine learning models.

Perform hyperparameter optimization.

Evaluate machine learning models.

Domain 4: Machine Learning Implementation and Operations

Build machine learning solutions for performance, availability, scalability, resiliency, and fault tolerance.

Recommend and implement the appropriate machine learning services and features for a given problem.

Apply basic AWS security practices to machine learning solutions.

Deploy and operationalize machine learning solutions.

Amazon Comprehend

AWS Deep Learning AMIs (DLAMI)

AWS DeepLens

Amazon Forecast

Amazon Fraud Detector

Amazon Lex

Amazon Polly

Amazon Rekognition

Amazon SageMaker

Amazon Textract

Amazon Transcribe

Amazon Translate

Other Services and topics covered are:

Ingestion/Collection

Processing/ETL

Data analysis/visualization

Model training

Model deployment/inference

Operational

AWS ML application services

Language relevant to ML (for example, Python, Java, Scala, R, SQL)

Notebooks and integrated development environments (IDEs),

S3, SageMaker, Kinesis, Lake Formation, Athena, Kibana, Redshift, Textract, EMR, Glue, SageMaker, CSV, JSON, IMG, parquet or databases, Amazon Athena

Amazon EC2, Amazon Elastic Container Registry (Amazon ECR), Amazon Elastic Container Service, Amazon Elastic Kubernetes Service , Amazon Redshift

Sagemaker API Explained:

AWS Certified Machine Learning Engineer Specialty Questions and Answers:

Question1: An advertising and analytics company uses machine learning to predict user response to online advertisements using a custom XGBoost model. The company wants to improve its ML pipeline by porting its training and inference code, written in R, to Amazon SageMaker, and do so with minimal changes to the existing code.

Answer1: Use the Build Your Own Container (BYOC) Amazon Sagemaker option.
Create a new docker container with the existing code. Register the container in Amazon Elastic Container registry. with the existing code. Register the container in Amazon Elastic Container Registry. Finally run the training and inference jobs using this container.

Question2: Which feature of Amazon SageMaker can you use for preprocessing the data?

Amazon SageMaker training jobs
Amazon SageMaker hyperaparameter tuning
Amazon SageMaker notebook instances
Amazon SageMaker endpoints

Answer2: Amazon Sagemaker Notebook instances

Amazon SageMaker enables developers and data scientists to build, train, tune, and deploy machine learning (ML) models at scale. You can deploy trained ML models for real-time or batch predictions on unseen data, a process known as inference. However, in most cases, the raw input data must be preprocessed and can’t be used directly for making predictions. This is because most ML models expect the data in a predefined format, so the raw data needs to be first cleaned and formatted in order for the ML model to process the data. You can use the Amazon SageMaker built-in Scikit-learn library for preprocessing input data and then use the Amazon SageMaker built-in Linear Learner algorithm for predictions.

Question3: What setting, when creating an Amazon SageMaker notebook instance, can you use to install libraries and import data?

Answer3: LifeCycle Configuration

Question4: How to Choose the right Sagemaker built-in algorithm?

How to chose the right built in algorithm in SageMaker?

Guide to choosing the right unsupervised learning algorithm

Choosing the right ML algorithm based on Data Type

Choosing the right ML algo based on data type

This is a general guide for choosing which algorithm to use depending on what business problem you have and what data you have.

Top

Top 10 Google Professional Machine Learning Engineer Sample Questions

Question 1: You work for a textile manufacturer and have been asked to build a model to detect and classify fabric defects. You trained a machine learning model with high recall based on high resolution images taken at the end of the production line. You want quality control inspectors to gain trust in your model. Which technique should you use to understand the rationale of your classifier?

A. Use K-fold cross validation to understand how the model performs on different test datasets.

B. Use the Integrated Gradients method to efficiently compute feature attributions for each predicted image.

C. Use PCA (Principal Component Analysis) to reduce the original feature set to a smaller set of easily understood features.

D. Use k-means clustering to group similar images together, and calculate the Davies-Bouldin index to evaluate the separation between clusters.

Answer 1)

Notes 1)

B is correct because it identifies the pixel of the input image that leads to the classification of the image itself.

Question 2: You need to write a generic test to verify whether Dense Neural Network (DNN) models automatically released by your team have a sufficient number of parameters to learn the task for which they were built. What should you do?

A. Train the model for a few iterations, and check for NaN values.

B. Train the model for a few iterations, and verify that the loss is constant.

C. Train a simple linear model, and determine if the DNN model outperforms it.

D. Train the model with no regularization, and verify that the loss function is close to zero.

Answer 2)

Notes 2)

D is correct because the test can check that the model has enough parameters to memorize the task.

[appbox appstore 1560083470-iphone screenshots]
[appbox googleplay com.awssolutionarchitectassociateexampreppro.app]

Question 3: Your team is using a TensorFlow Inception-v3 CNN model pretrained on ImageNet for an image classification prediction challenge on 10,000 images. You will use AI Platform to perform the model training. What TensorFlow distribution strategy and AI Platform training job configuration should you use to train the model and optimize for wall-clock time?

A. Default Strategy; Custom tier with a single master node and four v100 GPUs.

B. One Device Strategy; Custom tier with a single master node and four v100 GPUs.

C. One Device Strategy; Custom tier with a single master node and eight v100 GPUs.

D. Central Storage Strategy; Custom tier with a single master node and four v100 GPUs.

Answer 3)

Notes 3)

D is correct because this is the only strategy that can perform distributed training; albeit there is only a single copy of the variables on the CPU host.

Question 4: You work on a team where the process for deploying a model into production starts with data scientists training different versions of models in a Kubeflow pipeline. The workflow then stores the new model artifact into the corresponding Cloud Storage bucket. You need to build the next steps of the pipeline after the submitted model is ready to be tested and deployed in production on AI Platform. How should you configure the architecture before deploying the model to production?

A. Deploy model in test environment -> Validate model -> Create a new AI Platform model version

B. Validate model -> Deploy model in test environment -> Create a new AI Platform model version

C. Create a new AI Platform model version -> Validate model -> Deploy model in test environment

D. Create a new AI Platform model version – > Deploy model in test environment -> Validate model

Answer 4)

Notes 4)

A is correct because the model can be validated after it is deployed to the test environment, and the release version is established before the model is deployed in production.

Question 5: You work for a maintenance company and have built and trained a deep learning model that identifies defects based on thermal images of underground electric cables. Your dataset contains 10,000 images, 100 of which contain visible defects. How should you evaluate the performance of the model on a test dataset?

A. Calculate the Area Under the Curve (AUC) value.

B. Calculate the number of true positive results predicted by the model.

C. Calculate the fraction of images predicted by the model to have a visible defect.

D. Calculate the Cosine Similarity to compare the model’s performance on the test dataset to the model’s performance on the training dataset.

Answer 5)

Notes 5)

A is correct because it is scale-invariant. AUC measures how well predictions are ranked, rather than their absolute values. AUC is also classification-threshold invariant. It measures the quality of the model’s predictions irrespective of what classification threshold is chosen.

Question 6: You work for a manufacturing company that owns a high-value machine which has several machine settings and multiple sensors. A history of the machine’s hourly sensor readings and known failure event data are stored in BigQuery. You need to predict if the machine will fail within the next 3 days in order to schedule maintenance before the machine fails. Which data preparation and model training steps should you take?

A. Data preparation: Daily max value feature engineering with DataPrep; Model training: AutoML classification with BQML

B. Data preparation: Daily min value feature engineering with DataPrep; Model training: Logistic regression with BQML and AUTO_CLASS_WEIGHTS set to True

C. Data preparation: Rolling average feature engineering with DataPrep; Model training: Logistic regression with BQML and AUTO_CLASS_WEIGHTS set to False

D. Data preparation: Rolling average feature engineering with DataPrep; Model training: Logistic regression with BQML and AUTO_CLASS_WEIGHTS set to True

Answer 6)

Notes 6)

D is correct because it uses the rolling average of the sensor data and balances the weights using the BQML auto class weight balance parameter.

Question 7: You are an ML engineer at a media company. You need to build an ML model to analyze video content frame-by-frame, identify objects, and alert users if there is inappropriate content. Which Google Cloud products should you use to build this project?

A. Pub/Sub, Cloud Function, Cloud Vision API

B. Pub/Sub, Cloud IoT, Dataflow, Cloud Vision API, Cloud Logging

C. Pub/Sub, Cloud Function, Video Intelligence API, Cloud Logging

D. Pub/Sub, Cloud Function, AutoML Video Intelligence, Cloud Logging

Answer 7)

Notes 7)

C is correct as Video Intelligence API can find inappropriate components and other components satisfy the requirements of real-time processing and notification.

Question 8: You work for a large retailer. You want to use ML to forecast future sales leveraging 10 years of historical sales data. The historical data is stored in Cloud Storage in Avro format. You want to rapidly experiment with all the available data. How should you build and train your model for the sales forecast?

A. Load data into BigQuery and use the ARIMA model type on BigQuery ML.

B. Convert the data into CSV format and create a regression model on AutoML Tables.

C. Convert the data into TFRecords and create an RNN model on TensorFlow on AI Platform Notebooks.

D. Convert and refactor the data into CSV format and use the built-in XGBoost algorithm on AI Platform Training.

Answer 8)

Notes 8)

A is correct because BigQuery ML is designed for fast and rapid experimentation and it is possible to use federated queries to read data directly from Cloud Storage. Moreover, ARIMA is considered one of the best in class for time series forecasting.

Question 9) You need to build an object detection model for a small startup company to identify if and where the company’s logo appears in an image. You were given a large repository of images, some with logos and some without. These images are not yet labelled. You need to label these pictures, and then train and deploy the model. What should you do?

A. Use Google Cloud’s Data Labelling Service to label your data. Use AutoML Object Detection to train and deploy the model.

B. Use Vision API to detect and identify logos in pictures and use it as a label. Use AI Platform to build and train a convolutional neural network.

C. Create two folders: one where the logo appears and one where it doesn’t. Manually place images in each folder. Use AI Platform to build and train a convolutional neural network.

D. Create two folders: one where the logo appears and one where it doesn’t. Manually place images in each folder. Use AI Platform to build and train a real time object detection model.

Answer 9)

Notes 9)

A is correct as this will allow you to easily create a request for a labelling task and deploy a high-performance model.

Question 10) You work for a large financial institution that is planning to use Dialogflow to create a chatbot for the company’s mobile app. You have reviewed old chat logs and tagged each conversation for intent based on each customer’s stated intention for contacting customer service. About 70% of customer inquiries are simple requests that are solved within 10 intents. The remaining 30% of inquiries require much longer and more complicated requests. Which intents should you automate first?

A. Automate a blend of the shortest and longest intents to be representative of all intents.

B. Automate the more complicated requests first because those require more of the agents’ time.

C. Automate the 10 intents that cover 70% of the requests so that live agents can handle the more complicated requests.

D. Automate intents in places where common words such as “payment” only appear once to avoid confusing the software.

Answer 10)

Notes 10)

[appbox appstore 1611045854-iphone screenshots]

[appbox microsoftstore 9n8rl80hvm4t-mobile screenshots]

Machine Learning Q&A Part I:

The Complete Python Course for Machine Learning Engineers

Google.

Azure and AWS are second class citizens in this area.

Sure, AWS has 70% of the market.

Sure, Azure is the easiest turn key and super user friendly.

But, the king of machine learning in the cloud is GCP.

GCP = Google Cloud Platform

Google has the largest data science team in the world, not mention they have Hinton.

Let’s forgot for a minute they created TensorFlow and give it away.

Let’s just talk about building a real world model with data that doesn’t fit into a excel spreadsheet.

The vast majority of applied machine learning is supervised and that means we need data.

Not just normal data, we need very clean highly structured data.

Where’s the easiest place in the world to upload and model a Petabyte of structured data? BigQuery of course.

Why BigQuery? I don’t have to do anything but upload my data. No spinning up RedShit clusters or whatever I have to do in Azure, just upload and massage data with my familiar SQL. If I do have to wrangle my data it won’t take my six months to update 5 rows here, minutes usually.

Then, you’ll need a front end. Cloud datalab is a Jupyter notebook, which is good because I don’t want nor do I need anything else.

Then, with a single line of code I connect by datalab (Jupyter) notebook to my data in BigQuery and build away.

I’ve worked in all three and the only thing I care about is getting to my job the fastest and right now that means I build my models in GCP.

If you’re new to machine learning don’t start in GCP or any cloud vendor for that matter. Start learning Python from the comfort of your laptop.

The course below is free to the first 20.

What are list of machine learning classification techniques?

Here, I want to share the best research paper on Machine Learning classification methods, titled ‘Do we Need Hundreds of Classifiers to Solve Real World Classification Problems?’, published in the ‘Journal of Machine Learning Research’.

This paper nicely explained 179 classification techniques and applied them on 121 data sets thus sharing small summary of the paper:

Do we Need Hundreds of Classifiers to Solve Real World Classification Problems?

https://jmlr.org/papers/v15/delgado14a.html

The paper evaluated 179 classifiers arising from 17 ML families (discriminant analysis, Bayesian, neural networks, support vector machines, decision trees, rule-based classifiers, boosting, bagging, stacking, random forests and other ensembles, generalized linear models, nearest neighbours, partial least squares and principal component regression, logistic and multinomial regression, multiple adaptive regression splines and other methods), implemented in Weka, R ( with and without the caret package), C and Matlab, including all the relevant classifiers available today.

Experiments used total 121 data sets , which represent the whole UCI data base (excluding the large-scale problems) and other own real problems, in order to achieve significant conclusions about the classifier behaviour, not dependent on the data set collection.

The whole data set and partitions are available from: http://persoal.citius.usc.es/manuel.fernandez.delgado/papers/jmlr/data.tar.gz

The classifiers most likely to be the bests are the random forest (RF) versions, the best of which (implemented in R and accessed via caret) achieves 94.1% of the maximum accuracy overcoming 90% in the 84.3% of the data sets. However, the difference is not statistically significant with the second best, the SVM with Gaussian kernel implemented in C using LibSVM, which achieves 92.3% of the maximum accuracy. A few models are clearly better than the remaining ones: random forest, SVM with Gaussian and polynomial kernels, extreme learning machine with Gaussian kernel, C5.0 and avNNet (a committee of multi-layer perceptrons implemented in R with the caret package).

The random forest is clearly the best family of classifiers (3 out of 5 bests classifiers are RF), followed by SVM (4 classifiers in the top-10), neural networks and boosting ensembles (5 and 3 members in the top-20, respectively).

You can see the table with the complete results: http://persoal.citius.usc.es/manuel.fernandez.delgado/papers/jmlr/results.txt

I hope it will be helpful for Statistic and Machine Leaning aspirants!

Thank you!

What is the best way to know which machine learning algorithm has a better probability to accurately or more precisely classify a dataset, before applying it?

These basic questions should help:

1. Is the classification going to be supervised or unsupervised? Several well defined techniques likes SVM (Support Vector Machines), trained neural net,etc. are applicable for supervised classification. For unsupervised classification, GMMs (Gaussian Mixture Models), HMMs (Hidden Markov models) with Baye’s techniques could be used. (Several other techniques could of course be used as well)

2.How much training data do you have in case it is supervised ? A small number of training data may yield discouraging classification accuracy even if the chosen classifier is the most suitable one for the problem. In such a case, try to obtain more number of samples. There’s also generally a correlation (for practical purposes at least) between the feature dimensionality and the number of samples for given technique. For example, while using SVM, the linear kernel tends to yield better results when the number of training samples are less than or equal to or only slightly more than the number of feature dimensions as compared to RBF or any other kernel.

3. If the feature vector dimensionality is small enough (1/2/3 -D) then it makes sense to plot and visually inspect if techniques like clustering could be more useful. With very high number of feature dimensions, methods like clustering are generally not advisable(Refer : “The Curse Of Dimensionality”).

4. Are you doing classification in real time ? Some techniques ,e.g. “Template Match” in image classification may lead to a higher number of errors but is generally faster than most other techniques if the number of templates to be evaluated are not excessively high.

5. Depending upon the problem domain, you can decide if you can choose the underlying model in such a way that it can use certain temporal/spatial correlations that may be inherent in the data. For example, HMMs use the temporal continuity of speech samples for enhancing classification results in speech recognition problems.

Another point, slightly off the topic perhaps, but the classification performance is as much a function of choosing the correct feature vectors, the pre-processing of the feature vectors as much as the classifier itself. It’s generally a good idea to give reserve some initial part of the project to try out various classifiers on the same data-set. It may at least help you reject the ones which are highly inaccurate.

What are the application of probability theory in statistics, machine learning, artificial intelligence, economic, commerce, business intelligence?

Which language should I start to learn machine learning for the first time? Python, R, or Julia? And why?

What are the fundamental mathematical requirements for understanding machine learning, as a novice/hobbyist dev?

What is the best image segmentation pre-trained method publicly available? Currently using DeepLab demo (link attached) but I’m looking for a better version of this.

Using public photos from the internet, they were able to reconstruct viewpoints of a scene conserving the realistic shadows and lightings. Would it be possible to do this efficiently and just tag a place in Google map and get the 3D scene from it?

Do you think GPT-3 will change our lives, or is it just hype? Are the applications really useful and real, in the real-world, or are they only the hand-picked results by the researchers and startup to get some hype around them and followers?

Suggest some really good ML projects? Which can be worth adding to CV or resume?

Is it possible to finish Andrew Ng’s course on ML in 15 days?

Which language should I start to learn machine learning for the first time? Python, R, or Julia? And why?

I’ve reached a dead end with my algorithm for Exact Three Cover, and it’s supposedly trash. What makes it trash?

Which language should I start to learn machine learning for the first time? Python, R, or Julia? And why?

How do I get started with learning Machine Learning by myself?

What is more preferable in machine learning, the accuracy of model A is 50% on training data and 97% on test data, or is model B with 80% accuracy on train data and 75 % accuracy on test data?(more detail in comment below) thank you!

I’ve seen you emphasize mathematical knowledge being important in data science/machine learning. Mike West emphasizes SQL and Python skills (the former especially) as being the most important. Where does this difference in opinion stem from?

If I am a beginner at machine learning and my current goal is publish a good paper within 6 month how should I start with considering I have no experience in publications and what ML topic I should pick in 2020?

What are the differences between the Bayesian and Frequentist methods within machine learning?

Why is the Gaussian distribution widely used in practice?

What are the fundamental mathematical requirements for understanding machine learning, as a novice/hobbyist dev?

What is the Confusion Matrix in Machine Learning?- Simplest Explanation!

Is Julia’s syntax even more intuitive than Python’s?

Andrew Ng: What is the Future of Deep Reinforcement Learning (DL + RL)?

How is TensorFlow/Keras capable of computing seemingly non-smooth loss functions such as max?

Which language should I start to learn machine learning for the first time? Python, R, or Julia? And why?

What is the first thing you do when looking at a new data set?

How do I get data science, computer vision, and machine learning tutorial apps?

Reinforcement learning (RL) is an area of machine learning concerned with how software agents ought to take actions in an environment in order to maximize the notion of cumulative reward. Reinforcement learning is one of three basic machine learning

Announcing our new Professional Machine Learning Engineer certification…

What is the best machine learning paper you have read in 2018?

Does every paper in machine learning introduce a new algorithm?

Predicting Credit Card Approvals using ML Techniques

Android Document Scanner with offline OCR application

What are the recommended data pre-processing methods for tree-based machine learning models?

What are some interesting statistics about the growing rate of the Julia programming language?

Is it possible to learn machine learning without prior knowledge in any coding language?

What skills does a data scientist need to learn in order to put machine learning models in production?

At a high level, these skills are a combination of software and data engineering.

The persons that are more appropriate to do this job are a data engineer and/or a machine learning engineer.

That being said, if you work at a startup or happen to be in a small company and need to put the models into production yourself, here are the top skills you need to get:

Well structured code: it doesn’t need to be perfect but at least can be understood and updated by other team members. Avoid spaghetti code[1] as the plague.
Add logs: if you are a Python user, the logging[2] module is your friend. Avoid print statements at any cost.
Model versioning: add a hash key to your different models. You will thank me later.
Metadata everywhere: save as much data about your models and ML experiments as you can (running time, hyperparameters, used features, CV scores, and so on). You will thank me later, again.
Monitor performances: execution time and statistical scores of your models.
Data and models management: store the necessary data and models somewhere that is available to everyone (S3[3] for example). Avoid uploading these to your VCS[4] system. Don’t share them using Slack or Drive. I won’t judge you though, I do it sometimes (read often). Read more here …..

What are some mistakes data scientists make when building machine learning models?

Some of the mistakes that might involve during building a machine learning model (I can think of) are listed here:

Not understanding the structure of the dataset
Not giving proper care during features selection
Leaving out categorical features and considering just numerical variables
Falling into dummy variable trap
Selection of inefficient machine learning algorithm
Not trying out various ML algorithms for building the model based on structure of data.
Improper tuning of model parameters
Most importantly: Building an idiotstic imperfect model i.e. suppose we have a classification problem with 99% chances of falling into class1 and remaining to class2. The built model may develop a mapping function which all the time for all data inputs, may predict the result to be class1. Well, one might say his/her model has 99% accuracy. But in reality the 1% class2 case hasn’t been included in the model. So this must be taken into consideration.
Read more here…

[appbox appstore 1560083470-iphone screenshots]
[appbox googleplay com.awssolutionarchitectassociateexampreppro.app]

What is the difference between data analytics and data mining?

Basically, data mining is a key aspect of data analytics. Some even consider the former as essential to execute before the latter. While data analytics is the complete package and involves most components needed to examine a data set and extract valuable information, data mining focuses specifically on identifying hidden patterns.

That’s just the surface-level comparison though. The image above gives an overview of how the two differ.

One such difference is the presence of a hypothesis. Data analytics usually requires coming up with one, as it aims to find specific answers. Data mining, on the other hand, generally doesn’t need one to test or prove. The expected output are patterns or trends, which doesn’t require coming up with a statement or fact to test.

However, that doesn’t mean you mine data blindly. You still have a goal, whether it’s to come up with a recommender system or identify predictors of a certain dimension. Ultimately though, you strive to come up with data patterns or trends. For data analysis on the other hand, you’re expected to come up with valuable and actionable insights, usually in relation to a predetermined hypothesis. Read more here ….

What is the life cycle of a data science project?

The data science life cycle is not something well-defined like the software development life-cycle, and there is no ‘one-size-fits-all’ solution for data science projects. Every step in the life-cycle of a data science project depends on various data scientist skills and data science tools. The typical life-cycle of a data science project involves jumping back and forth among various interdependent science tasks using a variety of tools, techniques, programming, etc.

Thus, the data science life-cycle can include the following steps:

Business requirement understanding.
Data collection.
Data cleaning.
Data analysis.
Modeling.
Performance evaluation.
Communicating with stakeholders.
Deployment.
Real-world testing.
Business buy-in.
Support and maintenance.

Looks neat, but here is the scheme to visualize how it is happening in reality:

Agile development processes, especially continuous delivery lends itself well to the data science project life-cycle. The early comparison helps the data science team to change approaches, refine hypotheses and even discard the project if the business case is nonviable or the benefits from the predictive models are not worth the effort to build it.

What are the application of probability theory in statistics, machine learning, artificial intelligence, economic, commerce, business intelligence?

Top

[appbox appstore 1611045854-iphone screenshots]

[appbox microsoftstore 9n8rl80hvm4t-mobile screenshots]

Machine Learning Q&A -Part II:

Which language should I start to learn machine learning for the first time? Python, R, or Julia? And why?

What are the fundamental mathematical requirements for understanding machine learning, as a novice/hobbyist dev?

What is the best image segmentation pre-trained method publicly available? Currently using DeepLab demo (link attached) but I’m looking for a better version of this.

Suggest some really good ML projects? Which can be worth adding to CV or resume?

Is it possible to finish Andrew Ng’s course on ML in 15 days?

Which language should I start to learn machine learning for the first time? Python, R, or Julia? And why?

I’ve reached a dead end with my algorithm for Exact Three Cover, and it’s supposedly trash. What makes it trash?

Which language should I start to learn machine learning for the first time? Python, R, or Julia? And why?

How do I get started with learning Machine Learning by myself?

What are the differences between the Bayesian and Frequentist methods within machine learning?

Why is the Gaussian distribution widely used in practice?

What are the fundamental mathematical requirements for understanding machine learning, as a novice/hobbyist dev?

What is the Confusion Matrix in Machine Learning?- Simplest Explanation!

Is Julia’s syntax even more intuitive than Python’s?

Andrew Ng: What is the Future of Deep Reinforcement Learning (DL + RL)?

How is TensorFlow/Keras capable of computing seemingly non-smooth loss functions such as max?

Which language should I start to learn machine learning for the first time? Python, R, or Julia? And why?

What is the first thing you do when looking at a new data set?

How do I get data science, computer vision, and machine learning tutorial apps?

Announcing our new Professional Machine Learning Engineer certification…

What is the best machine learning paper you have read in 2018?

Does every paper in machine learning introduce a new algorithm?

Predicting Credit Card Approvals using ML Techniques

Android Document Scanner with offline OCR application

What are the recommended data pre-processing methods for tree-based machine learning models?

What are some interesting statistics about the growing rate of the Julia programming language?

Is it possible to learn machine learning without prior knowledge in any coding language?

What skills does a data scientist need to learn in order to put machine learning models in production?

At a high level, these skills are a combination of software and data engineering.

The persons that are more appropriate to do this job are a data engineer and/or a machine learning engineer.

That being said, if you work at a startup or happen to be in a small company and need to put the models into production yourself, here are the top skills you need to get:

Well structured code: it doesn’t need to be perfect but at least can be understood and updated by other team members. Avoid spaghetti code[1] as the plague.
Add logs: if you are a Python user, the logging[2] module is your friend. Avoid print statements at any cost.
Model versioning: add a hash key to your different models. You will thank me later.
Metadata everywhere: save as much data about your models and ML experiments as you can (running time, hyperparameters, used features, CV scores, and so on). You will thank me later, again.
Monitor performances: execution time and statistical scores of your models.
Data and models management: store the necessary data and models somewhere that is available to everyone (S3[3] for example). Avoid uploading these to your VCS[4] system. Don’t share them using Slack or Drive. I won’t judge you though, I do it sometimes (read often). Read more here …..

What are some mistakes data scientists make when building machine learning models?

Some of the mistakes that might involve during building a machine learning model (I can think of) are listed here:

Not understanding the structure of the dataset
Not giving proper care during features selection
Leaving out categorical features and considering just numerical variables
Falling into dummy variable trap
Selection of inefficient machine learning algorithm
Not trying out various ML algorithms for building the model based on structure of data.
Improper tuning of model parameters
Most importantly: Building an idiotstic imperfect model i.e. suppose we have a classification problem with 99% chances of falling into class1 and remaining to class2. The built model may develop a mapping function which all the time for all data inputs, may predict the result to be class1. Well, one might say his/her model has 99% accuracy. But in reality the 1% class2 case hasn’t been included in the model. So this must be taken into consideration.
Read more here…

What is the difference between data analytics and data mining?

That’s just the surface-level comparison though. The image above gives an overview of how the two differ.

What is the life cycle of a data science project?

Thus, the data science life-cycle can include the following steps:

Business requirement understanding.
Data collection.
Data cleaning.
Data analysis.
Modeling.
Performance evaluation.
Communicating with stakeholders.
Deployment.
Real-world testing.
Business buy-in.
Support and maintenance.

Looks neat, but here is the scheme to visualize how it is happening in reality:

iOs: https://apps.apple.com/ca/app/aws-machine-learning-prep-pro/id1611045854

Top

Windows: https://www.microsoft.com/en-ca/p/aws-machine-learning-mls-c01-specialty-certification-exam-prep/9n8rl80hvm4t

Android/Amazon: https://www.amazon.com/gp/product/B09TZ4H8V6

Quizzes, Practice Exams: Modeling, Data Engineering, Vision, Exploratory Data Analysis, ML Ops, Cheat Sheets, ML Jobs Interview Q&A

Use this App to learn about Machine Learning on AWS and prepare for the AWS Machine Learning Specialty Certification MLS-C01.

Earning AWS Certified Machine Learning Specialty validates expertise in building, training, tuning, and deploying machine learning (ML) models on AWS.

The App provides hundreds of quizzes and practice exam about:

– Machine Learning Operation on AWS

– Modelling

– Data Engineering

– Computer Vision,

– Exploratory Data Analysis,

– ML implementation & Operations

– Machine Learning Basics Questions and Answers

– Machine Learning Advanced Questions and Answers

– Scorecard

– Countdown timer

– Machine Learning Cheat Sheets

– Machine Learning Interview Questions and Answers

– Machine Learning Latest News

Domain 1: Data Engineering

Create data repositories for machine learning.

Identify data sources (e.g., content and location, primary sources such as user data)

Determine storage mediums (e.g., DB, Data Lake, S3, EFS, EBS)

Identify and implement a data ingestion solution.

Data job styles/types (batch load, streaming)

Data ingestion pipelines (Batch-based ML workloads and streaming-based ML workloads), etc.

Domain 2: Exploratory Data Analysis

Sanitize and prepare data for modeling.

Perform feature engineering.

Analyze and visualize data for machine learning.

Domain 3: Modeling

Frame business problems as machine learning problems.

Select the appropriate model(s) for a given machine learning problem.

Train machine learning models.

Perform hyperparameter optimization.

Evaluate machine learning models.

Domain 4: Machine Learning Implementation and Operations

Build machine learning solutions for performance, availability, scalability, resiliency, and fault tolerance.

Recommend and implement the appropriate machine learning services and features for a given problem.

Apply basic AWS security practices to machine learning solutions.

Deploy and operationalize machine learning solutions.

Amazon Comprehend

AWS Deep Learning AMIs (DLAMI)

AWS DeepLens

Amazon Forecast

Amazon Fraud Detector

Amazon Lex

Amazon Polly

Amazon Rekognition

Amazon SageMaker

Amazon Textract

Amazon Transcribe

Amazon Translate

Other Services and topics covered are:

Ingestion/Collection

Processing/ETL

Data analysis/visualization

Model training

Model deployment/inference

Operational

AWS ML application services

Language relevant to ML (for example, Python, Java, Scala, R, SQL)

Notebooks and integrated development environments (IDEs),

S3, SageMaker, Kinesis, Lake Formation, Athena, Kibana, Redshift, Textract, EMR, Glue, SageMaker, CSV, JSON, IMG, parquet or databases, Amazon Athena

Amazon EC2, Amazon Elastic Container Registry (Amazon ECR), Amazon Elastic Container Service, Amazon Elastic Kubernetes Service , Amazon Redshift

Sagemaker API Explained:

AWS Certified Machine Learning Engineer Specialty Questions and Answers:

Question1: An advertising and analytics company uses machine learning to predict user response to online advertisements using a custom XGBoost model. The company wants to improve its ML pipeline by porting its training and inference code, written in R, to Amazon SageMaker, and do so with minimal changes to the existing code.

Answer1: Use the Build Your Own Container (BYOC) Amazon Sagemaker option.
Create a new docker container with the existing code. Register the container in Amazon Elastic Container registry. with the existing code. Register the container in Amazon Elastic Container Registry. Finally run the training and inference jobs using this container.

Question2: Which feature of Amazon SageMaker can you use for preprocessing the data?

Amazon SageMaker training jobs
Amazon SageMaker hyperaparameter tuning
Amazon SageMaker notebook instances
Amazon SageMaker endpoints

Answer2: Amazon Sagemaker Notebook instances

Question3: What setting, when creating an Amazon SageMaker notebook instance, can you use to install libraries and import data?

Answer3: LifeCycle Configuration

Question4: How to Choose the right Sagemaker built-in algorithm?

This is a general guide for choosing which algorithm to use depending on what business problem you have and what data you have.

Top

Top 10 Google Professional Machine Learning Engineer Sample Questions

A. Use K-fold cross validation to understand how the model performs on different test datasets.

B. Use the Integrated Gradients method to efficiently compute feature attributions for each predicted image.

C. Use PCA (Principal Component Analysis) to reduce the original feature set to a smaller set of easily understood features.

D. Use k-means clustering to group similar images together, and calculate the Davies-Bouldin index to evaluate the separation between clusters.

Answer 1)

Notes 1)

B is correct because it identifies the pixel of the input image that leads to the classification of the image itself.

A. Train the model for a few iterations, and check for NaN values.

B. Train the model for a few iterations, and verify that the loss is constant.

C. Train a simple linear model, and determine if the DNN model outperforms it.

D. Train the model with no regularization, and verify that the loss function is close to zero.

Answer 2)

Notes 2)

D is correct because the test can check that the model has enough parameters to memorize the task.

[appbox appstore 1560083470-iphone screenshots]
[appbox googleplay com.awssolutionarchitectassociateexampreppro.app]

A. Default Strategy; Custom tier with a single master node and four v100 GPUs.

B. One Device Strategy; Custom tier with a single master node and four v100 GPUs.

C. One Device Strategy; Custom tier with a single master node and eight v100 GPUs.

D. Central Storage Strategy; Custom tier with a single master node and four v100 GPUs.

Answer 3)

Notes 3)

D is correct because this is the only strategy that can perform distributed training; albeit there is only a single copy of the variables on the CPU host.

A. Deploy model in test environment -> Validate model -> Create a new AI Platform model version

B. Validate model -> Deploy model in test environment -> Create a new AI Platform model version

C. Create a new AI Platform model version -> Validate model -> Deploy model in test environment

D. Create a new AI Platform model version – > Deploy model in test environment -> Validate model

Answer 4)

Notes 4)

A is correct because the model can be validated after it is deployed to the test environment, and the release version is established before the model is deployed in production.

A. Calculate the Area Under the Curve (AUC) value.

B. Calculate the number of true positive results predicted by the model.

C. Calculate the fraction of images predicted by the model to have a visible defect.

D. Calculate the Cosine Similarity to compare the model’s performance on the test dataset to the model’s performance on the training dataset.

Answer 5)

Notes 5)

A. Data preparation: Daily max value feature engineering with DataPrep; Model training: AutoML classification with BQML

B. Data preparation: Daily min value feature engineering with DataPrep; Model training: Logistic regression with BQML and AUTO_CLASS_WEIGHTS set to True

C. Data preparation: Rolling average feature engineering with DataPrep; Model training: Logistic regression with BQML and AUTO_CLASS_WEIGHTS set to False

D. Data preparation: Rolling average feature engineering with DataPrep; Model training: Logistic regression with BQML and AUTO_CLASS_WEIGHTS set to True

Answer 6)

Notes 6)

D is correct because it uses the rolling average of the sensor data and balances the weights using the BQML auto class weight balance parameter.

A. Pub/Sub, Cloud Function, Cloud Vision API

B. Pub/Sub, Cloud IoT, Dataflow, Cloud Vision API, Cloud Logging

C. Pub/Sub, Cloud Function, Video Intelligence API, Cloud Logging

D. Pub/Sub, Cloud Function, AutoML Video Intelligence, Cloud Logging

Answer 7)

Notes 7)

C is correct as Video Intelligence API can find inappropriate components and other components satisfy the requirements of real-time processing and notification.

A. Load data into BigQuery and use the ARIMA model type on BigQuery ML.

B. Convert the data into CSV format and create a regression model on AutoML Tables.

C. Convert the data into TFRecords and create an RNN model on TensorFlow on AI Platform Notebooks.

D. Convert and refactor the data into CSV format and use the built-in XGBoost algorithm on AI Platform Training.

Answer 8)

Notes 8)

A. Use Google Cloud’s Data Labelling Service to label your data. Use AutoML Object Detection to train and deploy the model.

B. Use Vision API to detect and identify logos in pictures and use it as a label. Use AI Platform to build and train a convolutional neural network.

C. Create two folders: one where the logo appears and one where it doesn’t. Manually place images in each folder. Use AI Platform to build and train a convolutional neural network.

D. Create two folders: one where the logo appears and one where it doesn’t. Manually place images in each folder. Use AI Platform to build and train a real time object detection model.

Answer 9)

Notes 9)

A is correct as this will allow you to easily create a request for a labelling task and deploy a high-performance model.

A. Automate a blend of the shortest and longest intents to be representative of all intents.

B. Automate the more complicated requests first because those require more of the agents’ time.

C. Automate the 10 intents that cover 70% of the requests so that live agents can handle the more complicated requests.

D. Automate intents in places where common words such as “payment” only appear once to avoid confusing the software.

Answer 10)

Notes 10)

[appbox appstore 1611045854-iphone screenshots]

[appbox microsoftstore 9n8rl80hvm4t-mobile screenshots]

Machine Learning Q&A Part I:

The Complete Python Course for Machine Learning Engineers

Google.

Azure and AWS are second class citizens in this area.

Sure, AWS has 70% of the market.

Sure, Azure is the easiest turn key and super user friendly.

But, the king of machine learning in the cloud is GCP.

GCP = Google Cloud Platform

Google has the largest data science team in the world, not mention they have Hinton.

Let’s forgot for a minute they created TensorFlow and give it away.

Let’s just talk about building a real world model with data that doesn’t fit into a excel spreadsheet.

The vast majority of applied machine learning is supervised and that means we need data.

Not just normal data, we need very clean highly structured data.

Where’s the easiest place in the world to upload and model a Petabyte of structured data? BigQuery of course.

Then, you’ll need a front end. Cloud datalab is a Jupyter notebook, which is good because I don’t want nor do I need anything else.

Then, with a single line of code I connect by datalab (Jupyter) notebook to my data in BigQuery and build away.

I’ve worked in all three and the only thing I care about is getting to my job the fastest and right now that means I build my models in GCP.

If you’re new to machine learning don’t start in GCP or any cloud vendor for that matter. Start learning Python from the comfort of your laptop.

The course below is free to the first 20.

What are list of machine learning classification techniques?

This paper nicely explained 179 classification techniques and applied them on 121 data sets thus sharing small summary of the paper:

Do we Need Hundreds of Classifiers to Solve Real World Classification Problems?

https://jmlr.org/papers/v15/delgado14a.html

The whole data set and partitions are available from: http://persoal.citius.usc.es/manuel.fernandez.delgado/papers/jmlr/data.tar.gz

You can see the table with the complete results: http://persoal.citius.usc.es/manuel.fernandez.delgado/papers/jmlr/results.txt

I hope it will be helpful for Statistic and Machine Leaning aspirants!

Thank you!

What is the best way to know which machine learning algorithm has a better probability to accurately or more precisely classify a dataset, before applying it?

These basic questions should help:

What are the application of probability theory in statistics, machine learning, artificial intelligence, economic, commerce, business intelligence?

Which language should I start to learn machine learning for the first time? Python, R, or Julia? And why?

What are the fundamental mathematical requirements for understanding machine learning, as a novice/hobbyist dev?

What is the best image segmentation pre-trained method publicly available? Currently using DeepLab demo (link attached) but I’m looking for a better version of this.

Suggest some really good ML projects? Which can be worth adding to CV or resume?

Is it possible to finish Andrew Ng’s course on ML in 15 days?

Which language should I start to learn machine learning for the first time? Python, R, or Julia? And why?

I’ve reached a dead end with my algorithm for Exact Three Cover, and it’s supposedly trash. What makes it trash?

Which language should I start to learn machine learning for the first time? Python, R, or Julia? And why?

How do I get started with learning Machine Learning by myself?

What are the differences between the Bayesian and Frequentist methods within machine learning?

Why is the Gaussian distribution widely used in practice?

What are the fundamental mathematical requirements for understanding machine learning, as a novice/hobbyist dev?

What is the Confusion Matrix in Machine Learning?- Simplest Explanation!

Is Julia’s syntax even more intuitive than Python’s?

Andrew Ng: What is the Future of Deep Reinforcement Learning (DL + RL)?

How is TensorFlow/Keras capable of computing seemingly non-smooth loss functions such as max?

Which language should I start to learn machine learning for the first time? Python, R, or Julia? And why?

What is the first thing you do when looking at a new data set?

How do I get data science, computer vision, and machine learning tutorial apps?

Announcing our new Professional Machine Learning Engineer certification…

What is the best machine learning paper you have read in 2018?

Does every paper in machine learning introduce a new algorithm?

Predicting Credit Card Approvals using ML Techniques

Android Document Scanner with offline OCR application

What are the recommended data pre-processing methods for tree-based machine learning models?

What are some interesting statistics about the growing rate of the Julia programming language?

Is it possible to learn machine learning without prior knowledge in any coding language?

What skills does a data scientist need to learn in order to put machine learning models in production?

At a high level, these skills are a combination of software and data engineering.

The persons that are more appropriate to do this job are a data engineer and/or a machine learning engineer.

That being said, if you work at a startup or happen to be in a small company and need to put the models into production yourself, here are the top skills you need to get:

Well structured code: it doesn’t need to be perfect but at least can be understood and updated by other team members. Avoid spaghetti code[1] as the plague.
Add logs: if you are a Python user, the logging[2] module is your friend. Avoid print statements at any cost.
Model versioning: add a hash key to your different models. You will thank me later.
Metadata everywhere: save as much data about your models and ML experiments as you can (running time, hyperparameters, used features, CV scores, and so on). You will thank me later, again.
Monitor performances: execution time and statistical scores of your models.
Data and models management: store the necessary data and models somewhere that is available to everyone (S3[3] for example). Avoid uploading these to your VCS[4] system. Don’t share them using Slack or Drive. I won’t judge you though, I do it sometimes (read often). Read more here …..

What are some mistakes data scientists make when building machine learning models?

Some of the mistakes that might involve during building a machine learning model (I can think of) are listed here:

Not understanding the structure of the dataset
Not giving proper care during features selection
Leaving out categorical features and considering just numerical variables
Falling into dummy variable trap
Selection of inefficient machine learning algorithm
Not trying out various ML algorithms for building the model based on structure of data.
Improper tuning of model parameters
Most importantly: Building an idiotstic imperfect model i.e. suppose we have a classification problem with 99% chances of falling into class1 and remaining to class2. The built model may develop a mapping function which all the time for all data inputs, may predict the result to be class1. Well, one might say his/her model has 99% accuracy. But in reality the 1% class2 case hasn’t been included in the model. So this must be taken into consideration.
Read more here…

[appbox appstore 1560083470-iphone screenshots]
[appbox googleplay com.awssolutionarchitectassociateexampreppro.app]

What is the difference between data analytics and data mining?

That’s just the surface-level comparison though. The image above gives an overview of how the two differ.

What is the life cycle of a data science project?

Thus, the data science life-cycle can include the following steps:

Business requirement understanding.
Data collection.
Data cleaning.
Data analysis.
Modeling.
Performance evaluation.
Communicating with stakeholders.
Deployment.
Real-world testing.
Business buy-in.
Support and maintenance.

Looks neat, but here is the scheme to visualize how it is happening in reality:

What are the application of probability theory in statistics, machine learning, artificial intelligence, economic, commerce, business intelligence?

Top

[appbox appstore 1611045854-iphone screenshots]

[appbox microsoftstore 9n8rl80hvm4t-mobile screenshots]

Machine Learning Q&A -Part II:

Which language should I start to learn machine learning for the first time? Python, R, or Julia? And why?

What are the fundamental mathematical requirements for understanding machine learning, as a novice/hobbyist dev?

What is the best image segmentation pre-trained method publicly available? Currently using DeepLab demo (link attached) but I’m looking for a better version of this.

Suggest some really good ML projects? Which can be worth adding to CV or resume?

Is it possible to finish Andrew Ng’s course on ML in 15 days?

Which language should I start to learn machine learning for the first time? Python, R, or Julia? And why?

I’ve reached a dead end with my algorithm for Exact Three Cover, and it’s supposedly trash. What makes it trash?

Which language should I start to learn machine learning for the first time? Python, R, or Julia? And why?

How do I get started with learning Machine Learning by myself?

What are the differences between the Bayesian and Frequentist methods within machine learning?

Why is the Gaussian distribution widely used in practice?

What are the fundamental mathematical requirements for understanding machine learning, as a novice/hobbyist dev?

What is the Confusion Matrix in Machine Learning?- Simplest Explanation!

Is Julia’s syntax even more intuitive than Python’s?

Andrew Ng: What is the Future of Deep Reinforcement Learning (DL + RL)?

How is TensorFlow/Keras capable of computing seemingly non-smooth loss functions such as max?

Which language should I start to learn machine learning for the first time? Python, R, or Julia? And why?

What is the first thing you do when looking at a new data set?

How do I get data science, computer vision, and machine learning tutorial apps?

Announcing our new Professional Machine Learning Engineer certification…

What is the best machine learning paper you have read in 2018?

Does every paper in machine learning introduce a new algorithm?

Predicting Credit Card Approvals using ML Techniques

Android Document Scanner with offline OCR application

What are the recommended data pre-processing methods for tree-based machine learning models?

What are some interesting statistics about the growing rate of the Julia programming language?

Is it possible to learn machine learning without prior knowledge in any coding language?

What skills does a data scientist need to learn in order to put machine learning models in production?

At a high level, these skills are a combination of software and data engineering.

The persons that are more appropriate to do this job are a data engineer and/or a machine learning engineer.

That being said, if you work at a startup or happen to be in a small company and need to put the models into production yourself, here are the top skills you need to get:

Well structured code: it doesn’t need to be perfect but at least can be understood and updated by other team members. Avoid spaghetti code[1] as the plague.
Add logs: if you are a Python user, the logging[2] module is your friend. Avoid print statements at any cost.
Model versioning: add a hash key to your different models. You will thank me later.
Metadata everywhere: save as much data about your models and ML experiments as you can (running time, hyperparameters, used features, CV scores, and so on). You will thank me later, again.
Monitor performances: execution time and statistical scores of your models.
Data and models management: store the necessary data and models somewhere that is available to everyone (S3[3] for example). Avoid uploading these to your VCS[4] system. Don’t share them using Slack or Drive. I won’t judge you though, I do it sometimes (read often). Read more here …..

What are some mistakes data scientists make when building machine learning models?

Some of the mistakes that might involve during building a machine learning model (I can think of) are listed here:

Not understanding the structure of the dataset
Not giving proper care during features selection
Leaving out categorical features and considering just numerical variables
Falling into dummy variable trap
Selection of inefficient machine learning algorithm
Not trying out various ML algorithms for building the model based on structure of data.
Improper tuning of model parameters
Most importantly: Building an idiotstic imperfect model i.e. suppose we have a classification problem with 99% chances of falling into class1 and remaining to class2. The built model may develop a mapping function which all the time for all data inputs, may predict the result to be class1. Well, one might say his/her model has 99% accuracy. But in reality the 1% class2 case hasn’t been included in the model. So this must be taken into consideration.
Read more here…

What is the difference between data analytics and data mining?

That’s just the surface-level comparison though. The image above gives an overview of how the two differ.

What is the life cycle of a data science project?

Thus, the data science life-cycle can include the following steps:

Business requirement understanding.
Data collection.
Data cleaning.
Data analysis.
Modeling.
Performance evaluation.
Communicating with stakeholders.
Deployment.
Real-world testing.
Business buy-in.
Support and maintenance.

Looks neat, but here is the scheme to visualize how it is happening in reality:

iOs: https://apps.apple.com/ca/app/aws-machine-learning-prep-pro/id1611045854

Top

Windows: https://www.microsoft.com/en-ca/p/aws-machine-learning-mls-c01-specialty-certification-exam-prep/9n8rl80hvm4t

Android/Amazon: https://www.amazon.com/gp/product/B09TZ4H8V6

Quizzes, Practice Exams: Modeling, Data Engineering, Vision, Exploratory Data Analysis, ML Ops, Cheat Sheets, ML Jobs Interview Q&A

Use this App to learn about Machine Learning on AWS and prepare for the AWS Machine Learning Specialty Certification MLS-C01.

Earning AWS Certified Machine Learning Specialty validates expertise in building, training, tuning, and deploying machine learning (ML) models on AWS.

The App provides hundreds of quizzes and practice exam about:

– Machine Learning Operation on AWS

– Modelling

– Data Engineering

– Computer Vision,

– Exploratory Data Analysis,

– ML implementation & Operations

– Machine Learning Basics Questions and Answers

– Machine Learning Advanced Questions and Answers

– Scorecard

– Countdown timer

– Machine Learning Cheat Sheets

– Machine Learning Interview Questions and Answers

– Machine Learning Latest News

Domain 1: Data Engineering

Create data repositories for machine learning.

Identify data sources (e.g., content and location, primary sources such as user data)

Determine storage mediums (e.g., DB, Data Lake, S3, EFS, EBS)

Identify and implement a data ingestion solution.

Data job styles/types (batch load, streaming)

Data ingestion pipelines (Batch-based ML workloads and streaming-based ML workloads), etc.

Domain 2: Exploratory Data Analysis

Sanitize and prepare data for modeling.

Perform feature engineering.

Analyze and visualize data for machine learning.

Domain 3: Modeling

Frame business problems as machine learning problems.

Select the appropriate model(s) for a given machine learning problem.

Train machine learning models.

Perform hyperparameter optimization.

Evaluate machine learning models.

Domain 4: Machine Learning Implementation and Operations

Build machine learning solutions for performance, availability, scalability, resiliency, and fault tolerance.

Recommend and implement the appropriate machine learning services and features for a given problem.

Apply basic AWS security practices to machine learning solutions.

Deploy and operationalize machine learning solutions.

Amazon Comprehend

AWS Deep Learning AMIs (DLAMI)

AWS DeepLens

Amazon Forecast

Amazon Fraud Detector

Amazon Lex

Amazon Polly

Amazon Rekognition

Amazon SageMaker

Amazon Textract

Amazon Transcribe

Amazon Translate

Other Services and topics covered are:

Ingestion/Collection

Processing/ETL

Data analysis/visualization

Model training

Model deployment/inference

Operational

AWS ML application services

Language relevant to ML (for example, Python, Java, Scala, R, SQL)

Notebooks and integrated development environments (IDEs),

S3, SageMaker, Kinesis, Lake Formation, Athena, Kibana, Redshift, Textract, EMR, Glue, SageMaker, CSV, JSON, IMG, parquet or databases, Amazon Athena

Amazon EC2, Amazon Elastic Container Registry (Amazon ECR), Amazon Elastic Container Service, Amazon Elastic Kubernetes Service , Amazon Redshift

Sagemaker API Explained:

AWS Certified Machine Learning Engineer Specialty Questions and Answers:

Question1: An advertising and analytics company uses machine learning to predict user response to online advertisements using a custom XGBoost model. The company wants to improve its ML pipeline by porting its training and inference code, written in R, to Amazon SageMaker, and do so with minimal changes to the existing code.

Answer1: Use the Build Your Own Container (BYOC) Amazon Sagemaker option.
Create a new docker container with the existing code. Register the container in Amazon Elastic Container registry. with the existing code. Register the container in Amazon Elastic Container Registry. Finally run the training and inference jobs using this container.

Question2: Which feature of Amazon SageMaker can you use for preprocessing the data?

Amazon SageMaker training jobs
Amazon SageMaker hyperaparameter tuning
Amazon SageMaker notebook instances
Amazon SageMaker endpoints

Answer2: Amazon Sagemaker Notebook instances

Question3: What setting, when creating an Amazon SageMaker notebook instance, can you use to install libraries and import data?

Answer3: LifeCycle Configuration

Question4: How to Choose the right Sagemaker built-in algorithm?

This is a general guide for choosing which algorithm to use depending on what business problem you have and what data you have.

Top

Top 10 Google Professional Machine Learning Engineer Sample Questions

A. Use K-fold cross validation to understand how the model performs on different test datasets.

B. Use the Integrated Gradients method to efficiently compute feature attributions for each predicted image.

C. Use PCA (Principal Component Analysis) to reduce the original feature set to a smaller set of easily understood features.

D. Use k-means clustering to group similar images together, and calculate the Davies-Bouldin index to evaluate the separation between clusters.

Answer 1)

Notes 1)

B is correct because it identifies the pixel of the input image that leads to the classification of the image itself.

A. Train the model for a few iterations, and check for NaN values.

B. Train the model for a few iterations, and verify that the loss is constant.

C. Train a simple linear model, and determine if the DNN model outperforms it.

D. Train the model with no regularization, and verify that the loss function is close to zero.

Answer 2)

Notes 2)

D is correct because the test can check that the model has enough parameters to memorize the task.

[appbox appstore 1560083470-iphone screenshots]
[appbox googleplay com.awssolutionarchitectassociateexampreppro.app]

A. Default Strategy; Custom tier with a single master node and four v100 GPUs.

B. One Device Strategy; Custom tier with a single master node and four v100 GPUs.

C. One Device Strategy; Custom tier with a single master node and eight v100 GPUs.

D. Central Storage Strategy; Custom tier with a single master node and four v100 GPUs.

Answer 3)

Notes 3)

D is correct because this is the only strategy that can perform distributed training; albeit there is only a single copy of the variables on the CPU host.

A. Deploy model in test environment -> Validate model -> Create a new AI Platform model version

B. Validate model -> Deploy model in test environment -> Create a new AI Platform model version

C. Create a new AI Platform model version -> Validate model -> Deploy model in test environment

D. Create a new AI Platform model version – > Deploy model in test environment -> Validate model

Answer 4)

Notes 4)

A is correct because the model can be validated after it is deployed to the test environment, and the release version is established before the model is deployed in production.

A. Calculate the Area Under the Curve (AUC) value.

B. Calculate the number of true positive results predicted by the model.

C. Calculate the fraction of images predicted by the model to have a visible defect.

D. Calculate the Cosine Similarity to compare the model’s performance on the test dataset to the model’s performance on the training dataset.

Answer 5)

Notes 5)

A. Data preparation: Daily max value feature engineering with DataPrep; Model training: AutoML classification with BQML

B. Data preparation: Daily min value feature engineering with DataPrep; Model training: Logistic regression with BQML and AUTO_CLASS_WEIGHTS set to True

C. Data preparation: Rolling average feature engineering with DataPrep; Model training: Logistic regression with BQML and AUTO_CLASS_WEIGHTS set to False

D. Data preparation: Rolling average feature engineering with DataPrep; Model training: Logistic regression with BQML and AUTO_CLASS_WEIGHTS set to True

Answer 6)

Notes 6)

D is correct because it uses the rolling average of the sensor data and balances the weights using the BQML auto class weight balance parameter.

A. Pub/Sub, Cloud Function, Cloud Vision API

B. Pub/Sub, Cloud IoT, Dataflow, Cloud Vision API, Cloud Logging

C. Pub/Sub, Cloud Function, Video Intelligence API, Cloud Logging

D. Pub/Sub, Cloud Function, AutoML Video Intelligence, Cloud Logging

Answer 7)

Notes 7)

C is correct as Video Intelligence API can find inappropriate components and other components satisfy the requirements of real-time processing and notification.

A. Load data into BigQuery and use the ARIMA model type on BigQuery ML.

B. Convert the data into CSV format and create a regression model on AutoML Tables.

C. Convert the data into TFRecords and create an RNN model on TensorFlow on AI Platform Notebooks.

D. Convert and refactor the data into CSV format and use the built-in XGBoost algorithm on AI Platform Training.

Answer 8)

Notes 8)

A. Use Google Cloud’s Data Labelling Service to label your data. Use AutoML Object Detection to train and deploy the model.

B. Use Vision API to detect and identify logos in pictures and use it as a label. Use AI Platform to build and train a convolutional neural network.

C. Create two folders: one where the logo appears and one where it doesn’t. Manually place images in each folder. Use AI Platform to build and train a convolutional neural network.

D. Create two folders: one where the logo appears and one where it doesn’t. Manually place images in each folder. Use AI Platform to build and train a real time object detection model.

Answer 9)

Notes 9)

A is correct as this will allow you to easily create a request for a labelling task and deploy a high-performance model.

A. Automate a blend of the shortest and longest intents to be representative of all intents.

B. Automate the more complicated requests first because those require more of the agents’ time.

C. Automate the 10 intents that cover 70% of the requests so that live agents can handle the more complicated requests.

D. Automate intents in places where common words such as “payment” only appear once to avoid confusing the software.

Answer 10)

Notes 10)

[appbox appstore 1611045854-iphone screenshots]

[appbox microsoftstore 9n8rl80hvm4t-mobile screenshots]

Machine Learning Q&A Part I:

The Complete Python Course for Machine Learning Engineers

Google.

Azure and AWS are second class citizens in this area.

Sure, AWS has 70% of the market.

Sure, Azure is the easiest turn key and super user friendly.

But, the king of machine learning in the cloud is GCP.

GCP = Google Cloud Platform

Google has the largest data science team in the world, not mention they have Hinton.

Let’s forgot for a minute they created TensorFlow and give it away.

Let’s just talk about building a real world model with data that doesn’t fit into a excel spreadsheet.

The vast majority of applied machine learning is supervised and that means we need data.

Not just normal data, we need very clean highly structured data.

Where’s the easiest place in the world to upload and model a Petabyte of structured data? BigQuery of course.

Then, you’ll need a front end. Cloud datalab is a Jupyter notebook, which is good because I don’t want nor do I need anything else.

Then, with a single line of code I connect by datalab (Jupyter) notebook to my data in BigQuery and build away.

I’ve worked in all three and the only thing I care about is getting to my job the fastest and right now that means I build my models in GCP.

If you’re new to machine learning don’t start in GCP or any cloud vendor for that matter. Start learning Python from the comfort of your laptop.

The course below is free to the first 20.

What are list of machine learning classification techniques?

This paper nicely explained 179 classification techniques and applied them on 121 data sets thus sharing small summary of the paper:

Do we Need Hundreds of Classifiers to Solve Real World Classification Problems?

https://jmlr.org/papers/v15/delgado14a.html

The whole data set and partitions are available from: http://persoal.citius.usc.es/manuel.fernandez.delgado/papers/jmlr/data.tar.gz

You can see the table with the complete results: http://persoal.citius.usc.es/manuel.fernandez.delgado/papers/jmlr/results.txt

I hope it will be helpful for Statistic and Machine Leaning aspirants!

Thank you!

What is the best way to know which machine learning algorithm has a better probability to accurately or more precisely classify a dataset, before applying it?

These basic questions should help:

What are the application of probability theory in statistics, machine learning, artificial intelligence, economic, commerce, business intelligence?

Which language should I start to learn machine learning for the first time? Python, R, or Julia? And why?

What are the fundamental mathematical requirements for understanding machine learning, as a novice/hobbyist dev?

What is the best image segmentation pre-trained method publicly available? Currently using DeepLab demo (link attached) but I’m looking for a better version of this.

Suggest some really good ML projects? Which can be worth adding to CV or resume?

Is it possible to finish Andrew Ng’s course on ML in 15 days?

Which language should I start to learn machine learning for the first time? Python, R, or Julia? And why?

I’ve reached a dead end with my algorithm for Exact Three Cover, and it’s supposedly trash. What makes it trash?

Which language should I start to learn machine learning for the first time? Python, R, or Julia? And why?

How do I get started with learning Machine Learning by myself?

What are the differences between the Bayesian and Frequentist methods within machine learning?

Why is the Gaussian distribution widely used in practice?

What are the fundamental mathematical requirements for understanding machine learning, as a novice/hobbyist dev?

What is the Confusion Matrix in Machine Learning?- Simplest Explanation!

Is Julia’s syntax even more intuitive than Python’s?

Andrew Ng: What is the Future of Deep Reinforcement Learning (DL + RL)?

How is TensorFlow/Keras capable of computing seemingly non-smooth loss functions such as max?

Which language should I start to learn machine learning for the first time? Python, R, or Julia? And why?

What is the first thing you do when looking at a new data set?

How do I get data science, computer vision, and machine learning tutorial apps?

Announcing our new Professional Machine Learning Engineer certification…

What is the best machine learning paper you have read in 2018?

Does every paper in machine learning introduce a new algorithm?

Predicting Credit Card Approvals using ML Techniques

Android Document Scanner with offline OCR application

What are the recommended data pre-processing methods for tree-based machine learning models?

What are some interesting statistics about the growing rate of the Julia programming language?

Is it possible to learn machine learning without prior knowledge in any coding language?

What skills does a data scientist need to learn in order to put machine learning models in production?

At a high level, these skills are a combination of software and data engineering.

The persons that are more appropriate to do this job are a data engineer and/or a machine learning engineer.

That being said, if you work at a startup or happen to be in a small company and need to put the models into production yourself, here are the top skills you need to get:

Well structured code: it doesn’t need to be perfect but at least can be understood and updated by other team members. Avoid spaghetti code[1] as the plague.
Add logs: if you are a Python user, the logging[2] module is your friend. Avoid print statements at any cost.
Model versioning: add a hash key to your different models. You will thank me later.
Metadata everywhere: save as much data about your models and ML experiments as you can (running time, hyperparameters, used features, CV scores, and so on). You will thank me later, again.
Monitor performances: execution time and statistical scores of your models.
Data and models management: store the necessary data and models somewhere that is available to everyone (S3[3] for example). Avoid uploading these to your VCS[4] system. Don’t share them using Slack or Drive. I won’t judge you though, I do it sometimes (read often). Read more here …..

What are some mistakes data scientists make when building machine learning models?

Some of the mistakes that might involve during building a machine learning model (I can think of) are listed here:

Not understanding the structure of the dataset
Not giving proper care during features selection
Leaving out categorical features and considering just numerical variables
Falling into dummy variable trap
Selection of inefficient machine learning algorithm
Not trying out various ML algorithms for building the model based on structure of data.
Improper tuning of model parameters
Most importantly: Building an idiotstic imperfect model i.e. suppose we have a classification problem with 99% chances of falling into class1 and remaining to class2. The built model may develop a mapping function which all the time for all data inputs, may predict the result to be class1. Well, one might say his/her model has 99% accuracy. But in reality the 1% class2 case hasn’t been included in the model. So this must be taken into consideration.
Read more here…

[appbox appstore 1560083470-iphone screenshots]
[appbox googleplay com.awssolutionarchitectassociateexampreppro.app]

What is the difference between data analytics and data mining?

That’s just the surface-level comparison though. The image above gives an overview of how the two differ.

What is the life cycle of a data science project?

Thus, the data science life-cycle can include the following steps:

Business requirement understanding.
Data collection.
Data cleaning.
Data analysis.
Modeling.
Performance evaluation.
Communicating with stakeholders.
Deployment.
Real-world testing.
Business buy-in.
Support and maintenance.

Looks neat, but here is the scheme to visualize how it is happening in reality:

What are the application of probability theory in statistics, machine learning, artificial intelligence, economic, commerce, business intelligence?

Top

[appbox appstore 1611045854-iphone screenshots]

[appbox microsoftstore 9n8rl80hvm4t-mobile screenshots]

Machine Learning Q&A -Part II:

Which language should I start to learn machine learning for the first time? Python, R, or Julia? And why?

What are the fundamental mathematical requirements for understanding machine learning, as a novice/hobbyist dev?

What is the best image segmentation pre-trained method publicly available? Currently using DeepLab demo (link attached) but I’m looking for a better version of this.

Suggest some really good ML projects? Which can be worth adding to CV or resume?

Is it possible to finish Andrew Ng’s course on ML in 15 days?

Which language should I start to learn machine learning for the first time? Python, R, or Julia? And why?

I’ve reached a dead end with my algorithm for Exact Three Cover, and it’s supposedly trash. What makes it trash?

Which language should I start to learn machine learning for the first time? Python, R, or Julia? And why?

How do I get started with learning Machine Learning by myself?

What are the differences between the Bayesian and Frequentist methods within machine learning?

Why is the Gaussian distribution widely used in practice?

What are the fundamental mathematical requirements for understanding machine learning, as a novice/hobbyist dev?

What is the Confusion Matrix in Machine Learning?- Simplest Explanation!

Is Julia’s syntax even more intuitive than Python’s?

Andrew Ng: What is the Future of Deep Reinforcement Learning (DL + RL)?

How is TensorFlow/Keras capable of computing seemingly non-smooth loss functions such as max?

Which language should I start to learn machine learning for the first time? Python, R, or Julia? And why?

What is the first thing you do when looking at a new data set?

How do I get data science, computer vision, and machine learning tutorial apps?

Announcing our new Professional Machine Learning Engineer certification…

What is the best machine learning paper you have read in 2018?

Does every paper in machine learning introduce a new algorithm?

Predicting Credit Card Approvals using ML Techniques

Android Document Scanner with offline OCR application

What are the recommended data pre-processing methods for tree-based machine learning models?

What are some interesting statistics about the growing rate of the Julia programming language?

Is it possible to learn machine learning without prior knowledge in any coding language?

What skills does a data scientist need to learn in order to put machine learning models in production?

At a high level, these skills are a combination of software and data engineering.

The persons that are more appropriate to do this job are a data engineer and/or a machine learning engineer.

That being said, if you work at a startup or happen to be in a small company and need to put the models into production yourself, here are the top skills you need to get:

Well structured code: it doesn’t need to be perfect but at least can be understood and updated by other team members. Avoid spaghetti code[1] as the plague.
Add logs: if you are a Python user, the logging[2] module is your friend. Avoid print statements at any cost.
Model versioning: add a hash key to your different models. You will thank me later.
Metadata everywhere: save as much data about your models and ML experiments as you can (running time, hyperparameters, used features, CV scores, and so on). You will thank me later, again.
Monitor performances: execution time and statistical scores of your models.
Data and models management: store the necessary data and models somewhere that is available to everyone (S3[3] for example). Avoid uploading these to your VCS[4] system. Don’t share them using Slack or Drive. I won’t judge you though, I do it sometimes (read often). Read more here …..

What are some mistakes data scientists make when building machine learning models?

Some of the mistakes that might involve during building a machine learning model (I can think of) are listed here:

Not understanding the structure of the dataset
Not giving proper care during features selection
Leaving out categorical features and considering just numerical variables
Falling into dummy variable trap
Selection of inefficient machine learning algorithm
Not trying out various ML algorithms for building the model based on structure of data.
Improper tuning of model parameters
Most importantly: Building an idiotstic imperfect model i.e. suppose we have a classification problem with 99% chances of falling into class1 and remaining to class2. The built model may develop a mapping function which all the time for all data inputs, may predict the result to be class1. Well, one might say his/her model has 99% accuracy. But in reality the 1% class2 case hasn’t been included in the model. So this must be taken into consideration.
Read more here…

What is the difference between data analytics and data mining?

That’s just the surface-level comparison though. The image above gives an overview of how the two differ.

What is the life cycle of a data science project?

Thus, the data science life-cycle can include the following steps:

Business requirement understanding.
Data collection.
Data cleaning.
Data analysis.
Modeling.
Performance evaluation.
Communicating with stakeholders.
Deployment.
Real-world testing.
Business buy-in.
Support and maintenance.

Looks neat, but here is the scheme to visualize how it is happening in reality:

iOs: https://apps.apple.com/ca/app/aws-machine-learning-prep-pro/id1611045854

Top

Windows: https://www.microsoft.com/en-ca/p/aws-machine-learning-mls-c01-specialty-certification-exam-prep/9n8rl80hvm4t

Android/Amazon: https://www.amazon.com/gp/product/B09TZ4H8V6

Quizzes, Practice Exams: Modeling, Data Engineering, Vision, Exploratory Data Analysis, ML Ops, Cheat Sheets, ML Jobs Interview Q&A

Use this App to learn about Machine Learning on AWS and prepare for the AWS Machine Learning Specialty Certification MLS-C01.

Earning AWS Certified Machine Learning Specialty validates expertise in building, training, tuning, and deploying machine learning (ML) models on AWS.

The App provides hundreds of quizzes and practice exam about:

– Machine Learning Operation on AWS

– Modelling

– Data Engineering

– Computer Vision,

– Exploratory Data Analysis,

– ML implementation & Operations

– Machine Learning Basics Questions and Answers

– Machine Learning Advanced Questions and Answers

– Scorecard

– Countdown timer

– Machine Learning Cheat Sheets

– Machine Learning Interview Questions and Answers

– Machine Learning Latest News

Domain 1: Data Engineering

Create data repositories for machine learning.

Identify data sources (e.g., content and location, primary sources such as user data)

Determine storage mediums (e.g., DB, Data Lake, S3, EFS, EBS)

Identify and implement a data ingestion solution.

Data job styles/types (batch load, streaming)

Data ingestion pipelines (Batch-based ML workloads and streaming-based ML workloads), etc.

Domain 2: Exploratory Data Analysis

Sanitize and prepare data for modeling.

Perform feature engineering.

Analyze and visualize data for machine learning.

Domain 3: Modeling

Frame business problems as machine learning problems.

Select the appropriate model(s) for a given machine learning problem.

Train machine learning models.

Perform hyperparameter optimization.

Evaluate machine learning models.

Domain 4: Machine Learning Implementation and Operations

Build machine learning solutions for performance, availability, scalability, resiliency, and fault tolerance.

Recommend and implement the appropriate machine learning services and features for a given problem.

Apply basic AWS security practices to machine learning solutions.

Deploy and operationalize machine learning solutions.

Amazon Comprehend

AWS Deep Learning AMIs (DLAMI)

AWS DeepLens

Amazon Forecast

Amazon Fraud Detector

Amazon Lex

Amazon Polly

Amazon Rekognition

Amazon SageMaker

Amazon Textract

Amazon Transcribe

Amazon Translate

Other Services and topics covered are:

Ingestion/Collection

Processing/ETL

Data analysis/visualization

Model training

Model deployment/inference

Operational

AWS ML application services

Language relevant to ML (for example, Python, Java, Scala, R, SQL)

Notebooks and integrated development environments (IDEs),

S3, SageMaker, Kinesis, Lake Formation, Athena, Kibana, Redshift, Textract, EMR, Glue, SageMaker, CSV, JSON, IMG, parquet or databases, Amazon Athena

Amazon EC2, Amazon Elastic Container Registry (Amazon ECR), Amazon Elastic Container Service, Amazon Elastic Kubernetes Service , Amazon Redshift

Sagemaker API Explained:

AWS Certified Machine Learning Engineer Specialty Questions and Answers:

Question1: An advertising and analytics company uses machine learning to predict user response to online advertisements using a custom XGBoost model. The company wants to improve its ML pipeline by porting its training and inference code, written in R, to Amazon SageMaker, and do so with minimal changes to the existing code.

Answer1: Use the Build Your Own Container (BYOC) Amazon Sagemaker option.
Create a new docker container with the existing code. Register the container in Amazon Elastic Container registry. with the existing code. Register the container in Amazon Elastic Container Registry. Finally run the training and inference jobs using this container.

Question2: Which feature of Amazon SageMaker can you use for preprocessing the data?

Amazon SageMaker training jobs
Amazon SageMaker hyperaparameter tuning
Amazon SageMaker notebook instances
Amazon SageMaker endpoints

Answer2: Amazon Sagemaker Notebook instances

Question3: What setting, when creating an Amazon SageMaker notebook instance, can you use to install libraries and import data?

Answer3: LifeCycle Configuration

Question4: How to Choose the right Sagemaker built-in algorithm?

This is a general guide for choosing which algorithm to use depending on what business problem you have and what data you have.

Top

Top 10 Google Professional Machine Learning Engineer Sample Questions

A. Use K-fold cross validation to understand how the model performs on different test datasets.

B. Use the Integrated Gradients method to efficiently compute feature attributions for each predicted image.

C. Use PCA (Principal Component Analysis) to reduce the original feature set to a smaller set of easily understood features.

D. Use k-means clustering to group similar images together, and calculate the Davies-Bouldin index to evaluate the separation between clusters.

Answer 1)

Notes 1)

B is correct because it identifies the pixel of the input image that leads to the classification of the image itself.

A. Train the model for a few iterations, and check for NaN values.

B. Train the model for a few iterations, and verify that the loss is constant.

C. Train a simple linear model, and determine if the DNN model outperforms it.

D. Train the model with no regularization, and verify that the loss function is close to zero.

Answer 2)

Notes 2)

D is correct because the test can check that the model has enough parameters to memorize the task.

[appbox appstore 1560083470-iphone screenshots]
[appbox googleplay com.awssolutionarchitectassociateexampreppro.app]

A. Default Strategy; Custom tier with a single master node and four v100 GPUs.

B. One Device Strategy; Custom tier with a single master node and four v100 GPUs.

C. One Device Strategy; Custom tier with a single master node and eight v100 GPUs.

D. Central Storage Strategy; Custom tier with a single master node and four v100 GPUs.

Answer 3)

Notes 3)

D is correct because this is the only strategy that can perform distributed training; albeit there is only a single copy of the variables on the CPU host.

A. Deploy model in test environment -> Validate model -> Create a new AI Platform model version

B. Validate model -> Deploy model in test environment -> Create a new AI Platform model version

C. Create a new AI Platform model version -> Validate model -> Deploy model in test environment

D. Create a new AI Platform model version – > Deploy model in test environment -> Validate model

Answer 4)

Notes 4)

A is correct because the model can be validated after it is deployed to the test environment, and the release version is established before the model is deployed in production.

A. Calculate the Area Under the Curve (AUC) value.

B. Calculate the number of true positive results predicted by the model.

C. Calculate the fraction of images predicted by the model to have a visible defect.

D. Calculate the Cosine Similarity to compare the model’s performance on the test dataset to the model’s performance on the training dataset.

Answer 5)

Notes 5)

A. Data preparation: Daily max value feature engineering with DataPrep; Model training: AutoML classification with BQML

B. Data preparation: Daily min value feature engineering with DataPrep; Model training: Logistic regression with BQML and AUTO_CLASS_WEIGHTS set to True

C. Data preparation: Rolling average feature engineering with DataPrep; Model training: Logistic regression with BQML and AUTO_CLASS_WEIGHTS set to False

D. Data preparation: Rolling average feature engineering with DataPrep; Model training: Logistic regression with BQML and AUTO_CLASS_WEIGHTS set to True

Answer 6)

Notes 6)

D is correct because it uses the rolling average of the sensor data and balances the weights using the BQML auto class weight balance parameter.

A. Pub/Sub, Cloud Function, Cloud Vision API

B. Pub/Sub, Cloud IoT, Dataflow, Cloud Vision API, Cloud Logging

C. Pub/Sub, Cloud Function, Video Intelligence API, Cloud Logging

D. Pub/Sub, Cloud Function, AutoML Video Intelligence, Cloud Logging

Answer 7)

Notes 7)

C is correct as Video Intelligence API can find inappropriate components and other components satisfy the requirements of real-time processing and notification.

A. Load data into BigQuery and use the ARIMA model type on BigQuery ML.

B. Convert the data into CSV format and create a regression model on AutoML Tables.

C. Convert the data into TFRecords and create an RNN model on TensorFlow on AI Platform Notebooks.

D. Convert and refactor the data into CSV format and use the built-in XGBoost algorithm on AI Platform Training.

Answer 8)

Notes 8)

A. Use Google Cloud’s Data Labelling Service to label your data. Use AutoML Object Detection to train and deploy the model.

B. Use Vision API to detect and identify logos in pictures and use it as a label. Use AI Platform to build and train a convolutional neural network.

C. Create two folders: one where the logo appears and one where it doesn’t. Manually place images in each folder. Use AI Platform to build and train a convolutional neural network.

D. Create two folders: one where the logo appears and one where it doesn’t. Manually place images in each folder. Use AI Platform to build and train a real time object detection model.

Answer 9)

Notes 9)

A is correct as this will allow you to easily create a request for a labelling task and deploy a high-performance model.

A. Automate a blend of the shortest and longest intents to be representative of all intents.

B. Automate the more complicated requests first because those require more of the agents’ time.

C. Automate the 10 intents that cover 70% of the requests so that live agents can handle the more complicated requests.

D. Automate intents in places where common words such as “payment” only appear once to avoid confusing the software.

Answer 10)

Notes 10)

[appbox appstore 1611045854-iphone screenshots]

[appbox microsoftstore 9n8rl80hvm4t-mobile screenshots]

Machine Learning Q&A Part I:

The Complete Python Course for Machine Learning Engineers

Google.

Azure and AWS are second class citizens in this area.

Sure, AWS has 70% of the market.

Sure, Azure is the easiest turn key and super user friendly.

But, the king of machine learning in the cloud is GCP.

GCP = Google Cloud Platform

Google has the largest data science team in the world, not mention they have Hinton.

Let’s forgot for a minute they created TensorFlow and give it away.

Let’s just talk about building a real world model with data that doesn’t fit into a excel spreadsheet.

The vast majority of applied machine learning is supervised and that means we need data.

Not just normal data, we need very clean highly structured data.

Where’s the easiest place in the world to upload and model a Petabyte of structured data? BigQuery of course.

Then, you’ll need a front end. Cloud datalab is a Jupyter notebook, which is good because I don’t want nor do I need anything else.

Then, with a single line of code I connect by datalab (Jupyter) notebook to my data in BigQuery and build away.

I’ve worked in all three and the only thing I care about is getting to my job the fastest and right now that means I build my models in GCP.

If you’re new to machine learning don’t start in GCP or any cloud vendor for that matter. Start learning Python from the comfort of your laptop.

The course below is free to the first 20.

What are list of machine learning classification techniques?

This paper nicely explained 179 classification techniques and applied them on 121 data sets thus sharing small summary of the paper:

Do we Need Hundreds of Classifiers to Solve Real World Classification Problems?

https://jmlr.org/papers/v15/delgado14a.html

The whole data set and partitions are available from: http://persoal.citius.usc.es/manuel.fernandez.delgado/papers/jmlr/data.tar.gz

You can see the table with the complete results: http://persoal.citius.usc.es/manuel.fernandez.delgado/papers/jmlr/results.txt

I hope it will be helpful for Statistic and Machine Leaning aspirants!

Thank you!

What is the best way to know which machine learning algorithm has a better probability to accurately or more precisely classify a dataset, before applying it?

These basic questions should help:

What are the application of probability theory in statistics, machine learning, artificial intelligence, economic, commerce, business intelligence?

Which language should I start to learn machine learning for the first time? Python, R, or Julia? And why?

What are the fundamental mathematical requirements for understanding machine learning, as a novice/hobbyist dev?

What is the best image segmentation pre-trained method publicly available? Currently using DeepLab demo (link attached) but I’m looking for a better version of this.

Suggest some really good ML projects? Which can be worth adding to CV or resume?

Is it possible to finish Andrew Ng’s course on ML in 15 days?

Which language should I start to learn machine learning for the first time? Python, R, or Julia? And why?

I’ve reached a dead end with my algorithm for Exact Three Cover, and it’s supposedly trash. What makes it trash?

Which language should I start to learn machine learning for the first time? Python, R, or Julia? And why?

How do I get started with learning Machine Learning by myself?

What are the differences between the Bayesian and Frequentist methods within machine learning?

Why is the Gaussian distribution widely used in practice?

What are the fundamental mathematical requirements for understanding machine learning, as a novice/hobbyist dev?

What is the Confusion Matrix in Machine Learning?- Simplest Explanation!

Is Julia’s syntax even more intuitive than Python’s?

Andrew Ng: What is the Future of Deep Reinforcement Learning (DL + RL)?

How is TensorFlow/Keras capable of computing seemingly non-smooth loss functions such as max?

Which language should I start to learn machine learning for the first time? Python, R, or Julia? And why?

What is the first thing you do when looking at a new data set?

How do I get data science, computer vision, and machine learning tutorial apps?

Announcing our new Professional Machine Learning Engineer certification…

What is the best machine learning paper you have read in 2018?

Does every paper in machine learning introduce a new algorithm?

Predicting Credit Card Approvals using ML Techniques

Android Document Scanner with offline OCR application

What are the recommended data pre-processing methods for tree-based machine learning models?

What are some interesting statistics about the growing rate of the Julia programming language?

Is it possible to learn machine learning without prior knowledge in any coding language?

What skills does a data scientist need to learn in order to put machine learning models in production?

At a high level, these skills are a combination of software and data engineering.

The persons that are more appropriate to do this job are a data engineer and/or a machine learning engineer.

That being said, if you work at a startup or happen to be in a small company and need to put the models into production yourself, here are the top skills you need to get:

Well structured code: it doesn’t need to be perfect but at least can be understood and updated by other team members. Avoid spaghetti code[1] as the plague.
Add logs: if you are a Python user, the logging[2] module is your friend. Avoid print statements at any cost.
Model versioning: add a hash key to your different models. You will thank me later.
Metadata everywhere: save as much data about your models and ML experiments as you can (running time, hyperparameters, used features, CV scores, and so on). You will thank me later, again.
Monitor performances: execution time and statistical scores of your models.
Data and models management: store the necessary data and models somewhere that is available to everyone (S3[3] for example). Avoid uploading these to your VCS[4] system. Don’t share them using Slack or Drive. I won’t judge you though, I do it sometimes (read often). Read more here …..

What are some mistakes data scientists make when building machine learning models?

Some of the mistakes that might involve during building a machine learning model (I can think of) are listed here:

Not understanding the structure of the dataset
Not giving proper care during features selection
Leaving out categorical features and considering just numerical variables
Falling into dummy variable trap
Selection of inefficient machine learning algorithm
Not trying out various ML algorithms for building the model based on structure of data.
Improper tuning of model parameters
Most importantly: Building an idiotstic imperfect model i.e. suppose we have a classification problem with 99% chances of falling into class1 and remaining to class2. The built model may develop a mapping function which all the time for all data inputs, may predict the result to be class1. Well, one might say his/her model has 99% accuracy. But in reality the 1% class2 case hasn’t been included in the model. So this must be taken into consideration.
Read more here…

[appbox appstore 1560083470-iphone screenshots]
[appbox googleplay com.awssolutionarchitectassociateexampreppro.app]

What is the difference between data analytics and data mining?

That’s just the surface-level comparison though. The image above gives an overview of how the two differ.

What is the life cycle of a data science project?

Thus, the data science life-cycle can include the following steps:

Business requirement understanding.
Data collection.
Data cleaning.
Data analysis.
Modeling.
Performance evaluation.
Communicating with stakeholders.
Deployment.
Real-world testing.
Business buy-in.
Support and maintenance.

Looks neat, but here is the scheme to visualize how it is happening in reality:

What are the application of probability theory in statistics, machine learning, artificial intelligence, economic, commerce, business intelligence?

Top

[appbox appstore 1611045854-iphone screenshots]

[appbox microsoftstore 9n8rl80hvm4t-mobile screenshots]

Machine Learning Q&A -Part II:

Which language should I start to learn machine learning for the first time? Python, R, or Julia? And why?

What are the fundamental mathematical requirements for understanding machine learning, as a novice/hobbyist dev?

What is the best image segmentation pre-trained method publicly available? Currently using DeepLab demo (link attached) but I’m looking for a better version of this.

Suggest some really good ML projects? Which can be worth adding to CV or resume?

Is it possible to finish Andrew Ng’s course on ML in 15 days?

Which language should I start to learn machine learning for the first time? Python, R, or Julia? And why?

I’ve reached a dead end with my algorithm for Exact Three Cover, and it’s supposedly trash. What makes it trash?

Which language should I start to learn machine learning for the first time? Python, R, or Julia? And why?

How do I get started with learning Machine Learning by myself?

What are the differences between the Bayesian and Frequentist methods within machine learning?

Why is the Gaussian distribution widely used in practice?

What are the fundamental mathematical requirements for understanding machine learning, as a novice/hobbyist dev?

What is the Confusion Matrix in Machine Learning?- Simplest Explanation!

Is Julia’s syntax even more intuitive than Python’s?

Andrew Ng: What is the Future of Deep Reinforcement Learning (DL + RL)?

How is TensorFlow/Keras capable of computing seemingly non-smooth loss functions such as max?

Which language should I start to learn machine learning for the first time? Python, R, or Julia? And why?

What is the first thing you do when looking at a new data set?

How do I get data science, computer vision, and machine learning tutorial apps?

Announcing our new Professional Machine Learning Engineer certification…

What is the best machine learning paper you have read in 2018?

Does every paper in machine learning introduce a new algorithm?

Predicting Credit Card Approvals using ML Techniques

Android Document Scanner with offline OCR application

What are the recommended data pre-processing methods for tree-based machine learning models?

What are some interesting statistics about the growing rate of the Julia programming language?

Is it possible to learn machine learning without prior knowledge in any coding language?

What skills does a data scientist need to learn in order to put machine learning models in production?

At a high level, these skills are a combination of software and data engineering.

The persons that are more appropriate to do this job are a data engineer and/or a machine learning engineer.

That being said, if you work at a startup or happen to be in a small company and need to put the models into production yourself, here are the top skills you need to get:

Well structured code: it doesn’t need to be perfect but at least can be understood and updated by other team members. Avoid spaghetti code[1] as the plague.
Add logs: if you are a Python user, the logging[2] module is your friend. Avoid print statements at any cost.
Model versioning: add a hash key to your different models. You will thank me later.
Metadata everywhere: save as much data about your models and ML experiments as you can (running time, hyperparameters, used features, CV scores, and so on). You will thank me later, again.
Monitor performances: execution time and statistical scores of your models.
Data and models management: store the necessary data and models somewhere that is available to everyone (S3[3] for example). Avoid uploading these to your VCS[4] system. Don’t share them using Slack or Drive. I won’t judge you though, I do it sometimes (read often). Read more here …..

What are some mistakes data scientists make when building machine learning models?

Some of the mistakes that might involve during building a machine learning model (I can think of) are listed here:

Not understanding the structure of the dataset
Not giving proper care during features selection
Leaving out categorical features and considering just numerical variables
Falling into dummy variable trap
Selection of inefficient machine learning algorithm
Not trying out various ML algorithms for building the model based on structure of data.
Improper tuning of model parameters
Most importantly: Building an idiotstic imperfect model i.e. suppose we have a classification problem with 99% chances of falling into class1 and remaining to class2. The built model may develop a mapping function which all the time for all data inputs, may predict the result to be class1. Well, one might say his/her model has 99% accuracy. But in reality the 1% class2 case hasn’t been included in the model. So this must be taken into consideration.
Read more here…

What is the difference between data analytics and data mining?

That’s just the surface-level comparison though. The image above gives an overview of how the two differ.

What is the life cycle of a data science project?

Thus, the data science life-cycle can include the following steps:

Business requirement understanding.
Data collection.
Data cleaning.
Data analysis.
Modeling.
Performance evaluation.
Communicating with stakeholders.
Deployment.
Real-world testing.
Business buy-in.
Support and maintenance.

Looks neat, but here is the scheme to visualize how it is happening in reality: