DjamgaMind: Audio Intelligence for the C-Suite (Daily AI News, Energy, Healthcare, Finance)
Full-Stack AI Intelligence. Zero Noise.The definitive audio briefing for the C-Suite and AI Architects. From Daily News and Strategic Deep Dives to high-density Industrial & Regulatory Intelligence—decoded at the speed of the AI era. . 👉 Start your specialized audio briefing today at Djamgamind.com
AI Jobs and Career
I wanted to share an exciting opportunity for those of you looking to advance your careers in the AI space. You know how rapidly the landscape is evolving, and finding the right fit can be a challenge. That's why I'm excited about Mercor – they're a platform specifically designed to connect top-tier AI talent with leading companies. Whether you're a data scientist, machine learning engineer, or something else entirely, Mercor can help you find your next big role. If you're ready to take the next step in your AI career, check them out through my referral link: https://work.mercor.com/?referralCode=82d5f4e3-e1a3-4064-963f-c197bb2c8db1. It's a fantastic resource, and I encourage you to explore the opportunities they have available.
- Full Stack Engineer [$150K-$220K]
- Software Engineer, Tooling & AI Workflow, Contract [$90/hour]
- DevOps Engineer, India, Contract [$90/hour]
- More AI Jobs Opportunitieshere
| Job Title | Status | Pay |
|---|---|---|
| Full-Stack Engineer | Strong match, Full-time | $150K - $220K / year |
| Developer Experience and Productivity Engineer | Pre-qualified, Full-time | $160K - $300K / year |
| Software Engineer - Tooling & AI Workflows (Contract) | Contract | $90 / hour |
| DevOps Engineer (India) | Full-time | $20K - $50K / year |
| Senior Full-Stack Engineer | Full-time | $2.8K - $4K / week |
| Enterprise IT & Cloud Domain Expert - India | Contract | $20 - $30 / hour |
| Senior Software Engineer | Contract | $100 - $200 / hour |
| Senior Software Engineer | Pre-qualified, Full-time | $150K - $300K / year |
| Senior Full-Stack Engineer: Latin America | Full-time | $1.6K - $2.1K / week |
| Software Engineering Expert | Contract | $50 - $150 / hour |
| Generalist Video Annotators | Contract | $45 / hour |
| Generalist Writing Expert | Contract | $45 / hour |
| Editors, Fact Checkers, & Data Quality Reviewers | Contract | $50 - $60 / hour |
| Multilingual Expert | Contract | $54 / hour |
| Mathematics Expert (PhD) | Contract | $60 - $80 / hour |
| Software Engineer - India | Contract | $20 - $45 / hour |
| Physics Expert (PhD) | Contract | $60 - $80 / hour |
| Finance Expert | Contract | $150 / hour |
| Designers | Contract | $50 - $70 / hour |
| Chemistry Expert (PhD) | Contract | $60 - $80 / hour |
What are the Top 200 AWS and Google Certified Machine Learning Specialty Questions and Answers Dumps?
This blog is the best way is the best way to prepare for your upcoming AWS Certified Machine Learning Specialty and Google Certified Professional Machine Learning Engineer exam. With over 100 questions and answers, this blog provides quizzes similar that are very similar to the real exam. It also includes the option to show and hide answers. Additionally, there are machine learning interview questions and detailed answers, as well as cheat sheets and illustrations. This blog is the best way to make sure you are well-prepared for your AWS Certified Machine Learning Specialty Exam.

The typical Google Machine Learning Engineer salary is $147,218. Machine Learning Engineer salaries at Google can range from $110,000 – $152,183.
Machine learning is an application of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed. Machine learning focuses on the development of computer programs that can access data and use it to learn for themselves.
- By the end of 2020, 85% of customer interactions will be handled without a human (Call Center, Chatbot, etc…)
- 61% of marketers say artificial intelligence is the most important aspect of their data strategy.
- 80% of business and tech leaders say AI already boosts productivity (Robotic Process Automation, Power Automate, etc..)
- Current AI technology can boost business productivity by up to 40%
AWS Machine Learning Certification Specialty Exam Prep for iOs Android Windows10/11

GCP Professional Machine Learning Engineer for iOs, Android, Windows 10/11
Quizzes, Practice Exams: Framing, Architecting, Designing, Developing ML Problems & Solutions, ML Jobs Interview Q&A

Azure AI Fundamentals AI-900 Exam Prep App for iOS, Android, Windows10/11
Basics and Advanced Machine Learning Quizzes on Azure, Azure Machine Learning Job Interviews Questions and Answer, ML Cheat Sheets

Machine Learning For Dummies App for iOs, Android, Windows10/11
Use this App to learn about Machine Learning and Elevate your Brain with Machine Learning Quizzes, Cheat Sheets, Ml Jobs Interview Questions and Answers updated daily.
AI-Powered Professional Certification Quiz Platform
Web|iOs|Android|Windows
Are you passionate about AI and looking for your next career challenge? In the fast-evolving world of artificial intelligence, connecting with the right opportunities can make all the difference. We're excited to recommend Mercor, a premier platform dedicated to bridging the gap between exceptional AI professionals and innovative companies.
Whether you're seeking roles in machine learning, data science, or other cutting-edge AI fields, Mercor offers a streamlined path to your ideal position. Explore the possibilities and accelerate your AI career by visiting Mercor through our exclusive referral link:
Find Your AI Dream Job on Mercor
Your next big opportunity in AI could be just a click away!

What does a Professional Machine Learning Engineer do?
A Professional Machine Learning Engineer designs, builds, and productionizes ML models to solve business challenges using Google Cloud technologies and knowledge of proven ML models and techniques. The ML Engineer collaborates closely with other job roles to ensure long-term success of models. The ML Engineer should be proficient in all aspects of model architecture, data pipeline interaction, and metrics interpretation. The ML Engineer needs familiarity with application development, infrastructure management, data engineering, and security. Through an understanding of training, retraining, deploying, scheduling, monitoring, and improving models, they design and create scalable solutions for optimal performance.
The AWS Certified Machine Learning – Specialty certification is intended for individuals who perform a development or data science role. It validates a candidate’s ability to design, implement, deploy, and maintain machine learning (ML) solutions for given business problems.
This blog covers Machine Learning 101, Top 20 AWS Certified Machine Learning Specialty Questions and Answers, Top 20 Google Professional Machine Learning Engineer Sample Questions, Machine Learning Quizzes, Machine Learning Q&A, Top 10 Machine Learning Algorithms, Machine Learning Latest Hot News, Machine Learning Demos (Ex: Tensorflow Demos)
Below are the Top 100 AWS Certified Machine Learning Specialty Questions and Answers Dumps.
AI Jobs and Career
And before we wrap up today's AI news, I wanted to share an exciting opportunity for those of you looking to advance your careers in the AI space. You know how rapidly the landscape is evolving, and finding the right fit can be a challenge. That's why I'm excited about Mercor – they're a platform specifically designed to connect top-tier AI talent with leading companies. Whether you're a data scientist, machine learning engineer, or something else entirely, Mercor can help you find your next big role. If you're ready to take the next step in your AI career, check them out through my referral link: https://work.mercor.com/?referralCode=82d5f4e3-e1a3-4064-963f-c197bb2c8db1. It's a fantastic resource, and I encourage you to explore the opportunities they have available.
Question1: A machine learning team has several large CSV datasets in Amazon S3. Historically, models built with the Amazon SageMaker Linear Learner algorithm have taken hours to train on similar-sized datasets. The team’s leaders need to accelerate the training process. What can a machine learning specialist do to address this concern?
Invest in your future today by enrolling in this Azure Fundamentals - Pass the Azure Fundamentals Exam with Ease: Master the AZ-900 Certification with the Comprehensive Exam Preparation Guide!
- AWS Certified AI Practitioner (AIF-C01): Conquer the AWS Certified AI Practitioner exam with our AI and Machine Learning For Dummies test prep. Master fundamental AI concepts, AWS AI services, and ethical considerations.
- Azure AI Fundamentals: Ace the Azure AI Fundamentals exam with our comprehensive test prep. Learn the basics of AI, Azure AI services, and their applications.
- Google Cloud Professional Machine Learning Engineer: Nail the Google Professional Machine Learning Engineer exam with our expert-designed test prep. Deepen your understanding of ML algorithms, models, and deployment strategies.
- AWS Certified Machine Learning Specialty: Dominate the AWS Certified Machine Learning Specialty exam with our targeted test prep. Master advanced ML techniques, AWS ML services, and practical applications.
- AWS Certified Data Engineer Associate (DEA-C01): Set yourself up for promotion, get a better job or Increase your salary by Acing the AWS DEA-C01 Certification.
A) Use Amazon SageMaker Pipe mode.
B) Use Amazon Machine Learning to train the models.
C) Use Amazon Kinesis to stream the data to Amazon SageMaker.
D) Use AWS Glue to transform the CSV dataset to the JSON format.
ANSWER1:
Notes/Hint1:
Question 2) A local university wants to track cars in a parking lot to determine which students are parking in the lot. The university is wanting to ingest videos of the cars parking in near-real time, use machine learning to identify license plates, and store that data in an AWS data store. Which solution meets these requirements with the LEAST amount of development effort?
A) Use Amazon Kinesis Data Streams to ingest the video in near-real time, use the Kinesis Data Streams consumer integrated with Amazon Rekognition Video to process the license plate information, and then store results in DynamoDB.
B) Use Amazon Kinesis Video Streams to ingest the videos in near-real time, use the Kinesis Video Streams integration with Amazon Rekognition Video to identify the license plate information, and then store the results in DynamoDB.
C) Use Amazon Kinesis Data Streams to ingest videos in near-real time, call Amazon Rekognition to identify license plate information, and then store results in DynamoDB.
D) Use Amazon Kinesis Firehose to ingest the video in near-real time and outputs results onto S3. Set up a Lambda function that triggers when a new video is PUT onto S3 to send results to Amazon Rekognition to identify license plate information, and then store results in DynamoDB.
Answer 2)
Notes/Hint2)
Question 3) A term frequency–inverse document frequency (tf–idf) matrix using both unigrams and bigrams is built from a text corpus consisting of the following two sentences:
ANSWER3:
Notes/Hint3:
Question 4: A company is setting up a system to manage all of the datasets it stores in Amazon S3. The company would like to automate running transformation jobs on the data and maintaining a catalog of the metadata concerning the datasets. The solution should require the least amount of setup and maintenance. Which solution will allow the company to achieve its goals?
ANSWER4:
Notes/Hint4:
Question 5) Which service in the Kinesis family allows you to easily load streaming data into data stores and analytics tools?
ANSWER5:
Notes/Hint5:
Notes 6)
Notes/Hint 8)
Answer 9)
Notes 9)
Answer 10)
Answer 11)
Notes 11)
Notes 12)
Answer 13)
Notes 13)
Question 14) You have been tasked with capturing two different types of streaming events. The first event type includes mission-critical data that needs to immediately be processed before operations can continue. The second event type includes data of less importance, but operations can continue without immediately processing. What is the most appropriate solution to record these different types of events?
Answer 14)
Notes 14)
Question 15) You are collecting clickstream data from an e-commerce website to make near-real time product suggestions for users actively using the site. Which combination of tools can be used to achieve the quickest recommendations and meets all of the requirements?
Answer 15)
Notes 15)
Question 16) Which service built by AWS makes it easy to set up a retry mechanism, aggregate records to improve throughput, and automatically submits CloudWatch metrics?
Answer 16)
Notes 16)
[appbox appstore 1611045854-iphone screenshots]
[appbox microsoftstore 9n8rl80hvm4t-mobile screenshots]
Question 17) You have been tasked with capturing data from an online gaming platform to run analytics on and process through a machine learning pipeline. The data that you are ingesting is players controller inputs every 1 second (up to 10 players in a game) that is in JSON format. The data needs to be ingested through Kinesis Data Streams and the JSON data blob is 100 KB in size. What is the minimum number of shards you can use to successfully ingest this data?
Answer 17)
Notes 17)
Question 18) Which services in the Kinesis family allows you to analyze streaming data, gain actionable insights, and respond to your business and customer needs in real time?
Answer 18)
Notes 18)
Question 19) You are a ML specialist needing to collect data from Twitter tweets. Your goal is to collect tweets that include only the name of your company and the tweet body, and store it off into a data store in AWS. What set of tools can you use to stream, transform, and load the data into AWS with the LEAST amount of effort?
Answer 19)
Notes 19)
Question 20) Which service in the Kinesis family allows you to build custom applications that process or analyze streaming data for specialized needs?
Answer 20)
Notes 20)
Question21:
Answer21:
What are the Top 100 AWS and Google Certified Machine Learning Specialty Questions and Answers Dumps?
This blog is the best way is the best way to prepare for your upcoming AWS Certified Machine Learning Specialty and Google Certified Professional Machine Learning Engineer exam. With over 100 questions and answers, this blog provides quizzes similar that are very similar to the real exam. It also includes the option to show and hide answers. Additionally, there are machine learning interview questions and detailed answers, as well as cheat sheets and illustrations. This blog is the best way to make sure you are well-prepared for your AWS Certified Machine Learning Specialty Exam.
The typical Google Machine Learning Engineer salary is $147,218. Machine Learning Engineer salaries at Google can range from $110,000 – $152,183.
Machine learning is an application of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed. Machine learning focuses on the development of computer programs that can access data and use it to learn for themselves.
- By the end of 2020, 85% of customer interactions will be handled without a human (Call Center, Chatbot, etc…)
- 61% of marketers say artificial intelligence is the most important aspect of their data strategy.
- 80% of business and tech leaders say AI already boosts productivity (Robotic Process Automation, Power Automate, etc..)
- Current AI technology can boost business productivity by up to 40%
AWS Machine Learning Certification Specialty Exam Prep for iOs Android Windows10/11

GCP Professional Machine Learning Engineer for iOs, Android, Windows 10/11
Quizzes, Practice Exams: Framing, Architecting, Designing, Developing ML Problems & Solutions, ML Jobs Interview Q&A

Azure AI Fundamentals AI-900 Exam Prep App for iOS, Android, Windows10/11
Basics and Advanced Machine Learning Quizzes on Azure, Azure Machine Learning Job Interviews Questions and Answer, ML Cheat Sheets

Machine Learning For Dummies App for iOs, Android, Windows10/11
Use this App to learn about Machine Learning and Elevate your Brain with Machine Learning Quizzes, Cheat Sheets, Ml Jobs Interview Questions and Answers updated daily.

What does a Professional Machine Learning Engineer do?
A Professional Machine Learning Engineer designs, builds, and productionizes ML models to solve business challenges using Google Cloud technologies and knowledge of proven ML models and techniques. The ML Engineer collaborates closely with other job roles to ensure long-term success of models. The ML Engineer should be proficient in all aspects of model architecture, data pipeline interaction, and metrics interpretation. The ML Engineer needs familiarity with application development, infrastructure management, data engineering, and security. Through an understanding of training, retraining, deploying, scheduling, monitoring, and improving models, they design and create scalable solutions for optimal performance.
The AWS Certified Machine Learning – Specialty certification is intended for individuals who perform a development or data science role. It validates a candidate’s ability to design, implement, deploy, and maintain machine learning (ML) solutions for given business problems.
This blog covers Machine Learning 101, Top 20 AWS Certified Machine Learning Specialty Questions and Answers, Top 20 Google Professional Machine Learning Engineer Sample Questions, Machine Learning Quizzes, Machine Learning Q&A, Top 10 Machine Learning Algorithms, Machine Learning Latest Hot News, Machine Learning Demos (Ex: Tensorflow Demos)
Question1: A machine learning team has several large CSV datasets in Amazon S3. Historically, models built with the Amazon SageMaker Linear Learner algorithm have taken hours to train on similar-sized datasets. The team’s leaders need to accelerate the training process. What can a machine learning specialist do to address this concern?
A) Use Amazon SageMaker Pipe mode.
B) Use Amazon Machine Learning to train the models.
C) Use Amazon Kinesis to stream the data to Amazon SageMaker.
D) Use AWS Glue to transform the CSV dataset to the JSON format.
ANSWER1:
Notes/Hint1:
Question 2) A local university wants to track cars in a parking lot to determine which students are parking in the lot. The university is wanting to ingest videos of the cars parking in near-real time, use machine learning to identify license plates, and store that data in an AWS data store. Which solution meets these requirements with the LEAST amount of development effort?
A) Use Amazon Kinesis Data Streams to ingest the video in near-real time, use the Kinesis Data Streams consumer integrated with Amazon Rekognition Video to process the license plate information, and then store results in DynamoDB.
B) Use Amazon Kinesis Video Streams to ingest the videos in near-real time, use the Kinesis Video Streams integration with Amazon Rekognition Video to identify the license plate information, and then store the results in DynamoDB.
C) Use Amazon Kinesis Data Streams to ingest videos in near-real time, call Amazon Rekognition to identify license plate information, and then store results in DynamoDB.
D) Use Amazon Kinesis Firehose to ingest the video in near-real time and outputs results onto S3. Set up a Lambda function that triggers when a new video is PUT onto S3 to send results to Amazon Rekognition to identify license plate information, and then store results in DynamoDB.
Answer 2)
Notes/Hint2)
Question 3) A term frequency–inverse document frequency (tf–idf) matrix using both unigrams and bigrams is built from a text corpus consisting of the following two sentences:
ANSWER3:
Notes/Hint3:
Question 4: A company is setting up a system to manage all of the datasets it stores in Amazon S3. The company would like to automate running transformation jobs on the data and maintaining a catalog of the metadata concerning the datasets. The solution should require the least amount of setup and maintenance. Which solution will allow the company to achieve its goals?
ANSWER4:
Notes/Hint4:
Question 5) Which service in the Kinesis family allows you to easily load streaming data into data stores and analytics tools?
ANSWER5:
Notes/Hint5:
Notes 6)
Notes/Hint 8)
Answer 9)
Notes 9)
Answer 10)
Answer 11)
Notes 11)
Notes 12)
Answer 13)
Notes 13)
Question 14) You have been tasked with capturing two different types of streaming events. The first event type includes mission-critical data that needs to immediately be processed before operations can continue. The second event type includes data of less importance, but operations can continue without immediately processing. What is the most appropriate solution to record these different types of events?
Answer 14)
Notes 14)
Question 15) You are collecting clickstream data from an e-commerce website to make near-real time product suggestions for users actively using the site. Which combination of tools can be used to achieve the quickest recommendations and meets all of the requirements?
Answer 15)
Notes 15)
Question 16) Which service built by AWS makes it easy to set up a retry mechanism, aggregate records to improve throughput, and automatically submits CloudWatch metrics?
Answer 16)
Notes 16)
[appbox appstore 1611045854-iphone screenshots]
[appbox microsoftstore 9n8rl80hvm4t-mobile screenshots]
Question 17) You have been tasked with capturing data from an online gaming platform to run analytics on and process through a machine learning pipeline. The data that you are ingesting is players controller inputs every 1 second (up to 10 players in a game) that is in JSON format. The data needs to be ingested through Kinesis Data Streams and the JSON data blob is 100 KB in size. What is the minimum number of shards you can use to successfully ingest this data?
Answer 17)
Notes 17)
Question 18) Which services in the Kinesis family allows you to analyze streaming data, gain actionable insights, and respond to your business and customer needs in real time?
Answer 18)
Notes 18)
Question 19) You are a ML specialist needing to collect data from Twitter tweets. Your goal is to collect tweets that include only the name of your company and the tweet body, and store it off into a data store in AWS. What set of tools can you use to stream, transform, and load the data into AWS with the LEAST amount of effort?
Answer 19)
Notes 19)
Question 20) Which service in the Kinesis family allows you to build custom applications that process or analyze streaming data for specialized needs?
Answer 20)
Notes 20)
Question21:
Answer21:
Notes 21:
Question22:
Answer22:
Notes 22:
Question23:
Answer23:
Notes 23:
Question24:
Answer24:
Notes 24:
What are the Top 100 AWS and Google Certified Machine Learning Specialty Questions and Answers Dumps?
This blog is the best way is the best way to prepare for your upcoming AWS Certified Machine Learning Specialty and Google Certified Professional Machine Learning Engineer exam. With over 100 questions and answers, this blog provides quizzes similar that are very similar to the real exam. It also includes the option to show and hide answers. Additionally, there are machine learning interview questions and detailed answers, as well as cheat sheets and illustrations. This blog is the best way to make sure you are well-prepared for your AWS Certified Machine Learning Specialty Exam.
The typical Google Machine Learning Engineer salary is $147,218. Machine Learning Engineer salaries at Google can range from $110,000 – $152,183.
Machine learning is an application of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed. Machine learning focuses on the development of computer programs that can access data and use it to learn for themselves.
- By the end of 2020, 85% of customer interactions will be handled without a human (Call Center, Chatbot, etc…)
- 61% of marketers say artificial intelligence is the most important aspect of their data strategy.
- 80% of business and tech leaders say AI already boosts productivity (Robotic Process Automation, Power Automate, etc..)
- Current AI technology can boost business productivity by up to 40%
AWS Machine Learning Certification Specialty Exam Prep for iOs Android Windows10/11

GCP Professional Machine Learning Engineer for iOs, Android, Windows 10/11
Quizzes, Practice Exams: Framing, Architecting, Designing, Developing ML Problems & Solutions, ML Jobs Interview Q&A

Azure AI Fundamentals AI-900 Exam Prep App for iOS, Android, Windows10/11
Basics and Advanced Machine Learning Quizzes on Azure, Azure Machine Learning Job Interviews Questions and Answer, ML Cheat Sheets

Machine Learning For Dummies App for iOs, Android, Windows10/11
Use this App to learn about Machine Learning and Elevate your Brain with Machine Learning Quizzes, Cheat Sheets, Ml Jobs Interview Questions and Answers updated daily.

What does a Professional Machine Learning Engineer do?
A Professional Machine Learning Engineer designs, builds, and productionizes ML models to solve business challenges using Google Cloud technologies and knowledge of proven ML models and techniques. The ML Engineer collaborates closely with other job roles to ensure long-term success of models. The ML Engineer should be proficient in all aspects of model architecture, data pipeline interaction, and metrics interpretation. The ML Engineer needs familiarity with application development, infrastructure management, data engineering, and security. Through an understanding of training, retraining, deploying, scheduling, monitoring, and improving models, they design and create scalable solutions for optimal performance.
The AWS Certified Machine Learning – Specialty certification is intended for individuals who perform a development or data science role. It validates a candidate’s ability to design, implement, deploy, and maintain machine learning (ML) solutions for given business problems.
This blog covers Machine Learning 101, Top 20 AWS Certified Machine Learning Specialty Questions and Answers, Top 20 Google Professional Machine Learning Engineer Sample Questions, Machine Learning Quizzes, Machine Learning Q&A, Top 10 Machine Learning Algorithms, Machine Learning Latest Hot News, Machine Learning Demos (Ex: Tensorflow Demos)
Question1: A machine learning team has several large CSV datasets in Amazon S3. Historically, models built with the Amazon SageMaker Linear Learner algorithm have taken hours to train on similar-sized datasets. The team’s leaders need to accelerate the training process. What can a machine learning specialist do to address this concern?
A) Use Amazon SageMaker Pipe mode.
B) Use Amazon Machine Learning to train the models.
C) Use Amazon Kinesis to stream the data to Amazon SageMaker.
D) Use AWS Glue to transform the CSV dataset to the JSON format.
ANSWER1:
Notes/Hint1:
Question 2) A local university wants to track cars in a parking lot to determine which students are parking in the lot. The university is wanting to ingest videos of the cars parking in near-real time, use machine learning to identify license plates, and store that data in an AWS data store. Which solution meets these requirements with the LEAST amount of development effort?
A) Use Amazon Kinesis Data Streams to ingest the video in near-real time, use the Kinesis Data Streams consumer integrated with Amazon Rekognition Video to process the license plate information, and then store results in DynamoDB.
B) Use Amazon Kinesis Video Streams to ingest the videos in near-real time, use the Kinesis Video Streams integration with Amazon Rekognition Video to identify the license plate information, and then store the results in DynamoDB.
C) Use Amazon Kinesis Data Streams to ingest videos in near-real time, call Amazon Rekognition to identify license plate information, and then store results in DynamoDB.
D) Use Amazon Kinesis Firehose to ingest the video in near-real time and outputs results onto S3. Set up a Lambda function that triggers when a new video is PUT onto S3 to send results to Amazon Rekognition to identify license plate information, and then store results in DynamoDB.
Answer 2)
Notes/Hint2)
Question 3) A term frequency–inverse document frequency (tf–idf) matrix using both unigrams and bigrams is built from a text corpus consisting of the following two sentences:
ANSWER3:
Notes/Hint3:
Question 4: A company is setting up a system to manage all of the datasets it stores in Amazon S3. The company would like to automate running transformation jobs on the data and maintaining a catalog of the metadata concerning the datasets. The solution should require the least amount of setup and maintenance. Which solution will allow the company to achieve its goals?
ANSWER4:
Notes/Hint4:
Question 5) Which service in the Kinesis family allows you to easily load streaming data into data stores and analytics tools?
ANSWER5:
Notes/Hint5:
Notes 6)
Notes/Hint 8)
Answer 9)
Notes 9)
Answer 10)
Answer 11)
Notes 11)
Notes 12)
Answer 13)
Notes 13)
Question 14) You have been tasked with capturing two different types of streaming events. The first event type includes mission-critical data that needs to immediately be processed before operations can continue. The second event type includes data of less importance, but operations can continue without immediately processing. What is the most appropriate solution to record these different types of events?
Answer 14)
Notes 14)
Question 15) You are collecting clickstream data from an e-commerce website to make near-real time product suggestions for users actively using the site. Which combination of tools can be used to achieve the quickest recommendations and meets all of the requirements?
Answer 15)
Notes 15)
Question 16) Which service built by AWS makes it easy to set up a retry mechanism, aggregate records to improve throughput, and automatically submits CloudWatch metrics?
Answer 16)
Notes 16)
[appbox appstore 1611045854-iphone screenshots]
[appbox microsoftstore 9n8rl80hvm4t-mobile screenshots]
Question 17) You have been tasked with capturing data from an online gaming platform to run analytics on and process through a machine learning pipeline. The data that you are ingesting is players controller inputs every 1 second (up to 10 players in a game) that is in JSON format. The data needs to be ingested through Kinesis Data Streams and the JSON data blob is 100 KB in size. What is the minimum number of shards you can use to successfully ingest this data?
Answer 17)
Notes 17)
Question 18) Which services in the Kinesis family allows you to analyze streaming data, gain actionable insights, and respond to your business and customer needs in real time?
Answer 18)
Notes 18)
Question 19) You are a ML specialist needing to collect data from Twitter tweets. Your goal is to collect tweets that include only the name of your company and the tweet body, and store it off into a data store in AWS. What set of tools can you use to stream, transform, and load the data into AWS with the LEAST amount of effort?
Answer 19)
Notes 19)
Question 20) Which service in the Kinesis family allows you to build custom applications that process or analyze streaming data for specialized needs?
Answer 20)
Notes 20)
Question21:
Answer21:
What are the Top 100 AWS and Google Certified Machine Learning Specialty Questions and Answers Dumps?
This blog is the best way is the best way to prepare for your upcoming AWS Certified Machine Learning Specialty and Google Certified Professional Machine Learning Engineer exam. With over 100 questions and answers, this blog provides quizzes similar that are very similar to the real exam. It also includes the option to show and hide answers. Additionally, there are machine learning interview questions and detailed answers, as well as cheat sheets and illustrations. This blog is the best way to make sure you are well-prepared for your AWS Certified Machine Learning Specialty Exam.
The typical Google Machine Learning Engineer salary is $147,218. Machine Learning Engineer salaries at Google can range from $110,000 – $152,183.
Machine learning is an application of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed. Machine learning focuses on the development of computer programs that can access data and use it to learn for themselves.
- By the end of 2020, 85% of customer interactions will be handled without a human (Call Center, Chatbot, etc…)
- 61% of marketers say artificial intelligence is the most important aspect of their data strategy.
- 80% of business and tech leaders say AI already boosts productivity (Robotic Process Automation, Power Automate, etc..)
- Current AI technology can boost business productivity by up to 40%
AWS Machine Learning Certification Specialty Exam Prep for iOs Android Windows10/11

GCP Professional Machine Learning Engineer for iOs, Android, Windows 10/11
Quizzes, Practice Exams: Framing, Architecting, Designing, Developing ML Problems & Solutions, ML Jobs Interview Q&A

Azure AI Fundamentals AI-900 Exam Prep App for iOS, Android, Windows10/11
Basics and Advanced Machine Learning Quizzes on Azure, Azure Machine Learning Job Interviews Questions and Answer, ML Cheat Sheets

Machine Learning For Dummies App for iOs, Android, Windows10/11
Use this App to learn about Machine Learning and Elevate your Brain with Machine Learning Quizzes, Cheat Sheets, Ml Jobs Interview Questions and Answers updated daily.

What does a Professional Machine Learning Engineer do?
A Professional Machine Learning Engineer designs, builds, and productionizes ML models to solve business challenges using Google Cloud technologies and knowledge of proven ML models and techniques. The ML Engineer collaborates closely with other job roles to ensure long-term success of models. The ML Engineer should be proficient in all aspects of model architecture, data pipeline interaction, and metrics interpretation. The ML Engineer needs familiarity with application development, infrastructure management, data engineering, and security. Through an understanding of training, retraining, deploying, scheduling, monitoring, and improving models, they design and create scalable solutions for optimal performance.
The AWS Certified Machine Learning – Specialty certification is intended for individuals who perform a development or data science role. It validates a candidate’s ability to design, implement, deploy, and maintain machine learning (ML) solutions for given business problems.
This blog covers Machine Learning 101, Top 20 AWS Certified Machine Learning Specialty Questions and Answers, Top 20 Google Professional Machine Learning Engineer Sample Questions, Machine Learning Quizzes, Machine Learning Q&A, Top 10 Machine Learning Algorithms, Machine Learning Latest Hot News, Machine Learning Demos (Ex: Tensorflow Demos)
Question1: A machine learning team has several large CSV datasets in Amazon S3. Historically, models built with the Amazon SageMaker Linear Learner algorithm have taken hours to train on similar-sized datasets. The team’s leaders need to accelerate the training process. What can a machine learning specialist do to address this concern?
A) Use Amazon SageMaker Pipe mode.
B) Use Amazon Machine Learning to train the models.
C) Use Amazon Kinesis to stream the data to Amazon SageMaker.
D) Use AWS Glue to transform the CSV dataset to the JSON format.
ANSWER1:
Notes/Hint1:
Question 2) A local university wants to track cars in a parking lot to determine which students are parking in the lot. The university is wanting to ingest videos of the cars parking in near-real time, use machine learning to identify license plates, and store that data in an AWS data store. Which solution meets these requirements with the LEAST amount of development effort?
A) Use Amazon Kinesis Data Streams to ingest the video in near-real time, use the Kinesis Data Streams consumer integrated with Amazon Rekognition Video to process the license plate information, and then store results in DynamoDB.
B) Use Amazon Kinesis Video Streams to ingest the videos in near-real time, use the Kinesis Video Streams integration with Amazon Rekognition Video to identify the license plate information, and then store the results in DynamoDB.
C) Use Amazon Kinesis Data Streams to ingest videos in near-real time, call Amazon Rekognition to identify license plate information, and then store results in DynamoDB.
D) Use Amazon Kinesis Firehose to ingest the video in near-real time and outputs results onto S3. Set up a Lambda function that triggers when a new video is PUT onto S3 to send results to Amazon Rekognition to identify license plate information, and then store results in DynamoDB.
Answer 2)
Notes/Hint2)
Question 3) A term frequency–inverse document frequency (tf–idf) matrix using both unigrams and bigrams is built from a text corpus consisting of the following two sentences:
ANSWER3:
Notes/Hint3:
Question 4: A company is setting up a system to manage all of the datasets it stores in Amazon S3. The company would like to automate running transformation jobs on the data and maintaining a catalog of the metadata concerning the datasets. The solution should require the least amount of setup and maintenance. Which solution will allow the company to achieve its goals?
ANSWER4:
Notes/Hint4:
Question 5) Which service in the Kinesis family allows you to easily load streaming data into data stores and analytics tools?
ANSWER5:
Notes/Hint5:
Question 6) A data scientist is working on optimizing a model during the training process by varying multiple parameters. The data scientist observes that, during multiple runs with identical parameters, the loss function converges to different, yet stable, values. What should the data scientist do to improve the training process?
Notes 6)
Question 7) Your organization has a standalone Javascript (Node.js) application that streams data into AWS using Kinesis Data Streams. You notice that they are using the Kinesis API (AWS SDK) over the Kinesis Producer Library (KPL). What might be the reasoning behind this?
Question 8) A data scientist is evaluating different binary classification models. A false positive result is 5 times more expensive (from a business perspective) than a false negative result. The models should be evaluated based on the following criteria:
Notes/Hint 8)
Question 9) A data scientist uses logistic regression to build a fraud detection model. While the model accuracy is 99%, 90% of the fraud cases are not detected by the model. What action will definitely help the model detect more than 10% of fraud cases?
Answer 9)
Notes 9)
Question 10) A company is interested in building a fraud detection model. Currently, the data scientist does not have a sufficient amount of information due to the low number of fraud cases. Which method is MOST likely to detect the GREATEST number of valid fraud cases?
Answer 10)
Question 11) A machine learning engineer is preparing a data frame for a supervised learning task with the Amazon SageMaker Linear Learner algorithm. The ML engineer notices the target label classes are highly imbalanced and multiple feature columns contain missing values. The proportion of missing values across the entire data frame is less than 5%. What should the ML engineer do to minimize bias due to missing values?
Answer 11)
Notes 11)
Question 12) A company has collected customer comments on its products, rating them as safe or unsafe, using decision trees. The training dataset has the following features: id, date, full review, full review summary, and a binary safe/unsafe tag. During training, any data sample with missing features was dropped. In a few instances, the test set was found to be missing the full review text field. For this use case, which is the most effective course of action to address test data samples with missing features?
Notes 12)
Question 13) An insurance company needs to automate claim compliance reviews because human reviews are expensive and error-prone. The company has a large set of claims and a compliance label for each. Each claim consists of a few sentences in English, many of which contain complex related information. Management would like to use Amazon SageMaker built-in algorithms to design a machine learning supervised model that can be trained to read each claim and predict if the claim is compliant or not. Which approach should be used to extract features from the claims to be used as inputs for the downstream supervised task?
Answer 13)
Notes 13)
Question 14) You have been tasked with capturing two different types of streaming events. The first event type includes mission-critical data that needs to immediately be processed before operations can continue. The second event type includes data of less importance, but operations can continue without immediately processing. What is the most appropriate solution to record these different types of events?
Answer 14)
Notes 14)
Question 15) You are collecting clickstream data from an e-commerce website to make near-real time product suggestions for users actively using the site. Which combination of tools can be used to achieve the quickest recommendations and meets all of the requirements?
Answer 15)
Notes 15)
Question 16) Which service built by AWS makes it easy to set up a retry mechanism, aggregate records to improve throughput, and automatically submits CloudWatch metrics?
Answer 16)
Notes 16)
[appbox appstore 1611045854-iphone screenshots]
[appbox microsoftstore 9n8rl80hvm4t-mobile screenshots]
Question 17) You have been tasked with capturing data from an online gaming platform to run analytics on and process through a machine learning pipeline. The data that you are ingesting is players controller inputs every 1 second (up to 10 players in a game) that is in JSON format. The data needs to be ingested through Kinesis Data Streams and the JSON data blob is 100 KB in size. What is the minimum number of shards you can use to successfully ingest this data?
Answer 17)
Notes 17)
Question 18) Which services in the Kinesis family allows you to analyze streaming data, gain actionable insights, and respond to your business and customer needs in real time?
Answer 18)
Notes 18)
Question 19) You are a ML specialist needing to collect data from Twitter tweets. Your goal is to collect tweets that include only the name of your company and the tweet body, and store it off into a data store in AWS. What set of tools can you use to stream, transform, and load the data into AWS with the LEAST amount of effort?
Answer 19)
Notes 19)
Question 20) Which service in the Kinesis family allows you to build custom applications that process or analyze streaming data for specialized needs?
Answer 20)
Notes 20)
Question21: Of the following, which is an example of machine learning? (Select TWO.)
A) Calculating the shortest route from current location to the destination
B) Optimizing product pricing based on real-time sales data
C) Sentiment analysis of text on product reviews
D) A loan approval system that classifies applicants entirely based on credit score
Answer21:
Notes 21:
Question22:Which of the following is an appropriate use case for unsupervised learning?
A) Partitioning an image of a street scene into multiple segments
B) Finding an optimal path out of a maze
C) Identifying clusters of housing sales based on related data points
D) Analyzing sentiment of social media posts
Answer22:
Notes 22:
Question23:
Answer23:
Notes 23:
Question24: A Djamgatech retail company wants to deploy a machine learning model to predict the demand for a product using sales data from the past 5 years. What is the MOST efficient solution that the company should implement first?
A) Regression
B) Multi-class classification
C) Binary class classification
D) N/A
Answer24:
Notes 24:
Question25: In which phase of the ML pipeline do you analyze the business requirements and re-frame that information into a machine learning context.
A) Problem formulation
B) Model training
C) Deployment
D)
Answer25:
Notes 25:
iOs: https://apps.apple.com/
Android/Amazon: https://www.amazon.com/gp/product/B09TZ4H8V6
AWS MLS-C01 Machine Learning Exam Prep
Quizzes, Practice Exams: Modeling, Data Engineering, Vision, Exploratory Data Analysis, ML Ops, Cheat Sheets, ML Jobs Interview Q&A
Use this App to learn about Machine Learning on AWS and prepare for the AWS Machine Learning Specialty Certification MLS-C01.
Earning AWS Certified Machine Learning Specialty validates expertise in building, training, tuning, and deploying machine learning (ML) models on AWS.
The App provides hundreds of quizzes and practice exam about:
– Machine Learning Operation on AWS
– Modelling
– Data Engineering
– Computer Vision,
– Exploratory Data Analysis,
– ML implementation & Operations
– Machine Learning Basics Questions and Answers
– Machine Learning Advanced Questions and Answers
– Scorecard
– Countdown timer
– Machine Learning Cheat Sheets
– Machine Learning Interview Questions and Answers
– Machine Learning Latest News
The App covers Machine Learning Basics and Advanced topics including: NLP, Computer Vision, Python, linear regression, logistic regression, Sampling, dataset, statistical interaction, selection bias, non-Gaussian distribution, bias-variance trade-off, Normal Distribution, correlation and covariance, Point Estimates and Confidence Interval, A/B Testing, p-value, statistical power of sensitivity, over-fitting and under-fitting, regularization, Law of Large Numbers, Confounding Variables, Survivorship Bias, univariate, bivariate and multivariate, Resampling, ROC curve, TF/IDF vectorization, Cluster Sampling, etc.
Domain 1: Data Engineering
Create data repositories for machine learning.
Identify data sources (e.g., content and location, primary sources such as user data)
Determine storage mediums (e.g., DB, Data Lake, S3, EFS, EBS)
Identify and implement a data ingestion solution.
Data job styles/types (batch load, streaming)
Data ingestion pipelines (Batch-based ML workloads and streaming-based ML workloads), etc.
Domain 2: Exploratory Data Analysis
Sanitize and prepare data for modeling.
Perform feature engineering.
Analyze and visualize data for machine learning.
Domain 3: Modeling
Frame business problems as machine learning problems.
Select the appropriate model(s) for a given machine learning problem.
Train machine learning models.
Perform hyperparameter optimization.
Evaluate machine learning models.
Domain 4: Machine Learning Implementation and Operations
Build machine learning solutions for performance, availability, scalability, resiliency, and fault tolerance.
Recommend and implement the appropriate machine learning services and features for a given problem.
Apply basic AWS security practices to machine learning solutions.
Deploy and operationalize machine learning solutions.
Machine Learning Services covered:
Amazon Comprehend
AWS Deep Learning AMIs (DLAMI)
AWS DeepLens
Amazon Forecast
Amazon Fraud Detector
Amazon Lex
Amazon Polly
Amazon Rekognition
Amazon SageMaker
Amazon Textract
Amazon Transcribe
Amazon Translate
Other Services and topics covered are:
Ingestion/Collection
Processing/ETL
Data analysis/visualization
Model training
Model deployment/inference
Operational
AWS ML application services
Language relevant to ML (for example, Python, Java, Scala, R, SQL)
Notebooks and integrated development environments (IDEs),
S3, SageMaker, Kinesis, Lake Formation, Athena, Kibana, Redshift, Textract, EMR, Glue, SageMaker, CSV, JSON, IMG, parquet or databases, Amazon Athena
Amazon EC2, Amazon Elastic Container Registry (Amazon ECR), Amazon Elastic Container Service, Amazon Elastic Kubernetes Service , Amazon Redshift
Sagemaker API Explained:
AWS Certified Machine Learning Engineer Specialty Questions and Answers:
Question1: An advertising and analytics company uses machine learning to predict user response to online advertisements using a custom XGBoost model. The company wants to improve its ML pipeline by porting its training and inference code, written in R, to Amazon SageMaker, and do so with minimal changes to the existing code.
Answer1: Use the Build Your Own Container (BYOC) Amazon Sagemaker option.
Create a new docker container with the existing code. Register the container in Amazon Elastic Container registry. with the existing code. Register the container in Amazon Elastic Container Registry. Finally run the training and inference jobs using this container.
Question2: Which feature of Amazon SageMaker can you use for preprocessing the data?
Answer2: Amazon Sagemaker Notebook instances
Amazon SageMaker enables developers and data scientists to build, train, tune, and deploy machine learning (ML) models at scale. You can deploy trained ML models for real-time or batch predictions on unseen data, a process known as inference. However, in most cases, the raw input data must be preprocessed and can’t be used directly for making predictions. This is because most ML models expect the data in a predefined format, so the raw data needs to be first cleaned and formatted in order for the ML model to process the data. You can use the Amazon SageMaker built-in Scikit-learn library for preprocessing input data and then use the Amazon SageMaker built-in Linear Learner algorithm for predictions.
Question3: What setting, when creating an Amazon SageMaker notebook instance, can you use to install libraries and import data?
Answer3: LifeCycle Configuration
Question4: How to Choose the right Sagemaker built-in algorithm?




This is a general guide for choosing which algorithm to use depending on what business problem you have and what data you have.
Top 10 Google Professional Machine Learning Engineer Sample Questions
Question 1: You work for a textile manufacturer and have been asked to build a model to detect and classify fabric defects. You trained a machine learning model with high recall based on high resolution images taken at the end of the production line. You want quality control inspectors to gain trust in your model. Which technique should you use to understand the rationale of your classifier?
A. Use K-fold cross validation to understand how the model performs on different test datasets.
B. Use the Integrated Gradients method to efficiently compute feature attributions for each predicted image.
C. Use PCA (Principal Component Analysis) to reduce the original feature set to a smaller set of easily understood features.
D. Use k-means clustering to group similar images together, and calculate the Davies-Bouldin index to evaluate the separation between clusters.
Answer 1)
Notes 1)
Question 2: You need to write a generic test to verify whether Dense Neural Network (DNN) models automatically released by your team have a sufficient number of parameters to learn the task for which they were built. What should you do?
Answer 2)
Notes 2)
[appbox appstore 1560083470-iphone screenshots]
[appbox googleplay com.awssolutionarchitectassociateexampreppro.app]
Answer 3)
Notes 3)
Question 4: You work on a team where the process for deploying a model into production starts with data scientists training different versions of models in a Kubeflow pipeline. The workflow then stores the new model artifact into the corresponding Cloud Storage bucket. You need to build the next steps of the pipeline after the submitted model is ready to be tested and deployed in production on AI Platform. How should you configure the architecture before deploying the model to production?
Question 10) You work for a large financial institution that is planning to use Dialogflow to create a chatbot for the company’s mobile app. You have reviewed old chat logs and tagged each conversation for intent based on each customer’s stated intention for contacting customer service. About 70% of customer inquiries are simple requests that are solved within 10 intents. The remaining 30% of inquiries require much longer and more complicated requests. Which intents should you automate first?
[appbox appstore 1611045854-iphone screenshots]
[appbox microsoftstore 9n8rl80hvm4t-mobile screenshots]
Machine Learning Q&A Part I:
Google.
Azure and AWS are second class citizens in this area.
Sure, AWS has 70% of the market.
Sure, Azure is the easiest turn key and super user friendly.
But, the king of machine learning in the cloud is GCP.
GCP = Google Cloud Platform
Google has the largest data science team in the world, not mention they have Hinton.
Let’s forgot for a minute they created TensorFlow and give it away.
Let’s just talk about building a real world model with data that doesn’t fit into a excel spreadsheet.
The vast majority of applied machine learning is supervised and that means we need data.
Not just normal data, we need very clean highly structured data.
Where’s the easiest place in the world to upload and model a Petabyte of structured data? BigQuery of course.
Why BigQuery? I don’t have to do anything but upload my data. No spinning up RedShit clusters or whatever I have to do in Azure, just upload and massage data with my familiar SQL. If I do have to wrangle my data it won’t take my six months to update 5 rows here, minutes usually.
Then, you’ll need a front end. Cloud datalab is a Jupyter notebook, which is good because I don’t want nor do I need anything else.
Then, with a single line of code I connect by datalab (Jupyter) notebook to my data in BigQuery and build away.
I’ve worked in all three and the only thing I care about is getting to my job the fastest and right now that means I build my models in GCP.
If you’re new to machine learning don’t start in GCP or any cloud vendor for that matter. Start learning Python from the comfort of your laptop.
The course below is free to the first 20.
The Complete Python Course for Machine Learning Engineers
Here, I want to share the best research paper on Machine Learning classification methods, titled ‘Do we Need Hundreds of Classifiers to Solve Real World Classification Problems?’, published in the ‘Journal of Machine Learning Research’.
This paper nicely explained 179 classification techniques and applied them on 121 data sets thus sharing small summary of the paper:
Do we Need Hundreds of Classifiers to Solve Real World Classification Problems?
The paper evaluated 179 classifiers arising from 17 ML families (discriminant analysis, Bayesian, neural networks, support vector machines, decision trees, rule-based classifiers, boosting, bagging, stacking, random forests and other ensembles, generalized linear models, nearest neighbours, partial least squares and principal component regression, logistic and multinomial regression, multiple adaptive regression splines and other methods), implemented in Weka, R ( with and without the caret package), C and Matlab, including all the relevant classifiers available today.
Experiments used total 121 data sets , which represent the whole UCI data base (excluding the large-scale problems) and other own real problems, in order to achieve significant conclusions about the classifier behaviour, not dependent on the data set collection.
The whole data set and partitions are available from: http://persoal.citius.usc.es/manuel.fernandez.delgado/papers/jmlr/data.tar.gz
The classifiers most likely to be the bests are the random forest (RF) versions, the best of which (implemented in R and accessed via caret) achieves 94.1% of the maximum accuracy overcoming 90% in the 84.3% of the data sets. However, the difference is not statistically significant with the second best, the SVM with Gaussian kernel implemented in C using LibSVM, which achieves 92.3% of the maximum accuracy. A few models are clearly better than the remaining ones: random forest, SVM with Gaussian and polynomial kernels, extreme learning machine with Gaussian kernel, C5.0 and avNNet (a committee of multi-layer perceptrons implemented in R with the caret package).
The random forest is clearly the best family of classifiers (3 out of 5 bests classifiers are RF), followed by SVM (4 classifiers in the top-10), neural networks and boosting ensembles (5 and 3 members in the top-20, respectively).
You can see the table with the complete results: http://persoal.citius.usc.es/manuel.fernandez.delgado/papers/jmlr/results.txt
I hope it will be helpful for Statistic and Machine Leaning aspirants!
Thank you!
These basic questions should help:
1. Is the classification going to be supervised or unsupervised? Several well defined techniques likes SVM (Support Vector Machines), trained neural net,etc. are applicable for supervised classification. For unsupervised classification, GMMs (Gaussian Mixture Models), HMMs (Hidden Markov models) with Baye’s techniques could be used. (Several other techniques could of course be used as well)
2.How much training data do you have in case it is supervised ? A small number of training data may yield discouraging classification accuracy even if the chosen classifier is the most suitable one for the problem. In such a case, try to obtain more number of samples. There’s also generally a correlation (for practical purposes at least) between the feature dimensionality and the number of samples for given technique. For example, while using SVM, the linear kernel tends to yield better results when the number of training samples are less than or equal to or only slightly more than the number of feature dimensions as compared to RBF or any other kernel.
3. If the feature vector dimensionality is small enough (1/2/3 -D) then it makes sense to plot and visually inspect if techniques like clustering could be more useful. With very high number of feature dimensions, methods like clustering are generally not advisable(Refer : “The Curse Of Dimensionality”).
4. Are you doing classification in real time ? Some techniques ,e.g. “Template Match” in image classification may lead to a higher number of errors but is generally faster than most other techniques if the number of templates to be evaluated are not excessively high.
5. Depending upon the problem domain, you can decide if you can choose the underlying model in such a way that it can use certain temporal/spatial correlations that may be inherent in the data. For example, HMMs use the temporal continuity of speech samples for enhancing classification results in speech recognition problems.
Another point, slightly off the topic perhaps, but the classification performance is as much a function of choosing the correct feature vectors, the pre-processing of the feature vectors as much as the classifier itself. It’s generally a good idea to give reserve some initial part of the project to try out various classifiers on the same data-set. It may at least help you reject the ones which are highly inaccurate.
At a high level, these skills are a combination of software and data engineering.
The persons that are more appropriate to do this job are a data engineer and/or a machine learning engineer.
That being said, if you work at a startup or happen to be in a small company and need to put the models into production yourself, here are the top skills you need to get:
- Well structured code: it doesn’t need to be perfect but at least can be understood and updated by other team members. Avoid spaghetti code[1] as the plague.
- Add logs: if you are a Python user, the logging[2] module is your friend. Avoid print statements at any cost.
- Model versioning: add a hash key to your different models. You will thank me later.
- Metadata everywhere: save as much data about your models and ML experiments as you can (running time, hyperparameters, used features, CV scores, and so on). You will thank me later, again.
- Monitor performances: execution time and statistical scores of your models.
- Data and models management: store the necessary data and models somewhere that is available to everyone (S3[3] for example). Avoid uploading these to your VCS[4] system. Don’t share them using Slack or Drive. I won’t judge you though, I do it sometimes (read often). Read more here …..
Some of the mistakes that might involve during building a machine learning model (I can think of) are listed here:
- Not understanding the structure of the dataset
- Not giving proper care during features selection
- Leaving out categorical features and considering just numerical variables
- Falling into dummy variable trap
- Selection of inefficient machine learning algorithm
- Not trying out various ML algorithms for building the model based on structure of data.
- Improper tuning of model parameters
- Most importantly: Building an idiotstic imperfect model i.e. suppose we have a classification problem with 99% chances of falling into class1 and remaining to class2. The built model may develop a mapping function which all the time for all data inputs, may predict the result to be class1. Well, one might say his/her model has 99% accuracy. But in reality the 1% class2 case hasn’t been included in the model. So this must be taken into consideration.
- Read more here…
[appbox appstore 1560083470-iphone screenshots]
[appbox googleplay com.awssolutionarchitectassociateexampreppro.app]
Basically, data mining is a key aspect of data analytics. Some even consider the former as essential to execute before the latter. While data analytics is the complete package and involves most components needed to examine a data set and extract valuable information, data mining focuses specifically on identifying hidden patterns.
That’s just the surface-level comparison though. The image above gives an overview of how the two differ.
One such difference is the presence of a hypothesis. Data analytics usually requires coming up with one, as it aims to find specific answers. Data mining, on the other hand, generally doesn’t need one to test or prove. The expected output are patterns or trends, which doesn’t require coming up with a statement or fact to test.
However, that doesn’t mean you mine data blindly. You still have a goal, whether it’s to come up with a recommender system or identify predictors of a certain dimension. Ultimately though, you strive to come up with data patterns or trends. For data analysis on the other hand, you’re expected to come up with valuable and actionable insights, usually in relation to a predetermined hypothesis. Read more here ….
The data science life cycle is not something well-defined like the software development life-cycle, and there is no ‘one-size-fits-all’ solution for data science projects. Every step in the life-cycle of a data science project depends on various data scientist skills and data science tools. The typical life-cycle of a data science project involves jumping back and forth among various interdependent science tasks using a variety of tools, techniques, programming, etc.
Thus, the data science life-cycle can include the following steps:
- Business requirement understanding.
- Data collection.
- Data cleaning.
- Data analysis.
- Modeling.
- Performance evaluation.
- Communicating with stakeholders.
- Deployment.
- Real-world testing.
- Business buy-in.
- Support and maintenance.
Looks neat, but here is the scheme to visualize how it is happening in reality:
Agile development processes, especially continuous delivery lends itself well to the data science project life-cycle. The early comparison helps the data science team to change approaches, refine hypotheses and even discard the project if the business case is nonviable or the benefits from the predictive models are not worth the effort to build it.
[appbox appstore 1611045854-iphone screenshots]
[appbox microsoftstore 9n8rl80hvm4t-mobile screenshots]
Machine Learning Q&A -Part II:
At a high level, these skills are a combination of software and data engineering.
The persons that are more appropriate to do this job are a data engineer and/or a machine learning engineer.
That being said, if you work at a startup or happen to be in a small company and need to put the models into production yourself, here are the top skills you need to get:
- Well structured code: it doesn’t need to be perfect but at least can be understood and updated by other team members. Avoid spaghetti code[1] as the plague.
- Add logs: if you are a Python user, the logging[2] module is your friend. Avoid print statements at any cost.
- Model versioning: add a hash key to your different models. You will thank me later.
- Metadata everywhere: save as much data about your models and ML experiments as you can (running time, hyperparameters, used features, CV scores, and so on). You will thank me later, again.
- Monitor performances: execution time and statistical scores of your models.
- Data and models management: store the necessary data and models somewhere that is available to everyone (S3[3] for example). Avoid uploading these to your VCS[4] system. Don’t share them using Slack or Drive. I won’t judge you though, I do it sometimes (read often). Read more here …..
Some of the mistakes that might involve during building a machine learning model (I can think of) are listed here:
- Not understanding the structure of the dataset
- Not giving proper care during features selection
- Leaving out categorical features and considering just numerical variables
- Falling into dummy variable trap
- Selection of inefficient machine learning algorithm
- Not trying out various ML algorithms for building the model based on structure of data.
- Improper tuning of model parameters
- Most importantly: Building an idiotstic imperfect model i.e. suppose we have a classification problem with 99% chances of falling into class1 and remaining to class2. The built model may develop a mapping function which all the time for all data inputs, may predict the result to be class1. Well, one might say his/her model has 99% accuracy. But in reality the 1% class2 case hasn’t been included in the model. So this must be taken into consideration.
- Read more here…
Basically, data mining is a key aspect of data analytics. Some even consider the former as essential to execute before the latter. While data analytics is the complete package and involves most components needed to examine a data set and extract valuable information, data mining focuses specifically on identifying hidden patterns.
That’s just the surface-level comparison though. The image above gives an overview of how the two differ.
One such difference is the presence of a hypothesis. Data analytics usually requires coming up with one, as it aims to find specific answers. Data mining, on the other hand, generally doesn’t need one to test or prove. The expected output are patterns or trends, which doesn’t require coming up with a statement or fact to test.
However, that doesn’t mean you mine data blindly. You still have a goal, whether it’s to come up with a recommender system or identify predictors of a certain dimension. Ultimately though, you strive to come up with data patterns or trends. For data analysis on the other hand, you’re expected to come up with valuable and actionable insights, usually in relation to a predetermined hypothesis. Read more here ….
The data science life cycle is not something well-defined like the software development life-cycle, and there is no ‘one-size-fits-all’ solution for data science projects. Every step in the life-cycle of a data science project depends on various data scientist skills and data science tools. The typical life-cycle of a data science project involves jumping back and forth among various interdependent science tasks using a variety of tools, techniques, programming, etc.
Thus, the data science life-cycle can include the following steps:
- Business requirement understanding.
- Data collection.
- Data cleaning.
- Data analysis.
- Modeling.
- Performance evaluation.
- Communicating with stakeholders.
- Deployment.
- Real-world testing.
- Business buy-in.
- Support and maintenance.
Looks neat, but here is the scheme to visualize how it is happening in reality:
Agile development processes, especially continuous delivery lends itself well to the data science project life-cycle. The early comparison helps the data science team to change approaches, refine hypotheses and even discard the project if the business case is nonviable or the benefits from the predictive models are not worth the effort to build it.
iOs: https://apps.apple.com/ca/app/aws-machine-learning-prep-pro/id1611045854
Android/Amazon: https://www.amazon.com/gp/product/B09TZ4H8V6
AWS MLS-C01 Machine Learning Exam Prep
Quizzes, Practice Exams: Modeling, Data Engineering, Vision, Exploratory Data Analysis, ML Ops, Cheat Sheets, ML Jobs Interview Q&A
Use this App to learn about Machine Learning on AWS and prepare for the AWS Machine Learning Specialty Certification MLS-C01.
Earning AWS Certified Machine Learning Specialty validates expertise in building, training, tuning, and deploying machine learning (ML) models on AWS.
The App provides hundreds of quizzes and practice exam about:
– Machine Learning Operation on AWS
– Modelling
– Data Engineering
– Computer Vision,
– Exploratory Data Analysis,
– ML implementation & Operations
– Machine Learning Basics Questions and Answers
– Machine Learning Advanced Questions and Answers
– Scorecard
– Countdown timer
– Machine Learning Cheat Sheets
– Machine Learning Interview Questions and Answers
– Machine Learning Latest News
The App covers Machine Learning Basics and Advanced topics including: NLP, Computer Vision, Python, linear regression, logistic regression, Sampling, dataset, statistical interaction, selection bias, non-Gaussian distribution, bias-variance trade-off, Normal Distribution, correlation and covariance, Point Estimates and Confidence Interval, A/B Testing, p-value, statistical power of sensitivity, over-fitting and under-fitting, regularization, Law of Large Numbers, Confounding Variables, Survivorship Bias, univariate, bivariate and multivariate, Resampling, ROC curve, TF/IDF vectorization, Cluster Sampling, etc.
Domain 1: Data Engineering
Create data repositories for machine learning.
Identify data sources (e.g., content and location, primary sources such as user data)
Determine storage mediums (e.g., DB, Data Lake, S3, EFS, EBS)
Identify and implement a data ingestion solution.
Data job styles/types (batch load, streaming)
Data ingestion pipelines (Batch-based ML workloads and streaming-based ML workloads), etc.
Domain 2: Exploratory Data Analysis
Sanitize and prepare data for modeling.
Perform feature engineering.
Analyze and visualize data for machine learning.
Domain 3: Modeling
Frame business problems as machine learning problems.
Select the appropriate model(s) for a given machine learning problem.
Train machine learning models.
Perform hyperparameter optimization.
Evaluate machine learning models.
Domain 4: Machine Learning Implementation and Operations
Build machine learning solutions for performance, availability, scalability, resiliency, and fault tolerance.
Recommend and implement the appropriate machine learning services and features for a given problem.
Apply basic AWS security practices to machine learning solutions.
Deploy and operationalize machine learning solutions.
Machine Learning Services covered:
Amazon Comprehend
AWS Deep Learning AMIs (DLAMI)
AWS DeepLens
Amazon Forecast
Amazon Fraud Detector
Amazon Lex
Amazon Polly
Amazon Rekognition
Amazon SageMaker
Amazon Textract
Amazon Transcribe
Amazon Translate
Other Services and topics covered are:
Ingestion/Collection
Processing/ETL
Data analysis/visualization
Model training
Model deployment/inference
Operational
AWS ML application services
Language relevant to ML (for example, Python, Java, Scala, R, SQL)
Notebooks and integrated development environments (IDEs),
S3, SageMaker, Kinesis, Lake Formation, Athena, Kibana, Redshift, Textract, EMR, Glue, SageMaker, CSV, JSON, IMG, parquet or databases, Amazon Athena
Amazon EC2, Amazon Elastic Container Registry (Amazon ECR), Amazon Elastic Container Service, Amazon Elastic Kubernetes Service , Amazon Redshift
Sagemaker API Explained:
AWS Certified Machine Learning Engineer Specialty Questions and Answers:
Question1: An advertising and analytics company uses machine learning to predict user response to online advertisements using a custom XGBoost model. The company wants to improve its ML pipeline by porting its training and inference code, written in R, to Amazon SageMaker, and do so with minimal changes to the existing code.
Answer1: Use the Build Your Own Container (BYOC) Amazon Sagemaker option.
Create a new docker container with the existing code. Register the container in Amazon Elastic Container registry. with the existing code. Register the container in Amazon Elastic Container Registry. Finally run the training and inference jobs using this container.
Question2: Which feature of Amazon SageMaker can you use for preprocessing the data?
Answer2: Amazon Sagemaker Notebook instances
Amazon SageMaker enables developers and data scientists to build, train, tune, and deploy machine learning (ML) models at scale. You can deploy trained ML models for real-time or batch predictions on unseen data, a process known as inference. However, in most cases, the raw input data must be preprocessed and can’t be used directly for making predictions. This is because most ML models expect the data in a predefined format, so the raw data needs to be first cleaned and formatted in order for the ML model to process the data. You can use the Amazon SageMaker built-in Scikit-learn library for preprocessing input data and then use the Amazon SageMaker built-in Linear Learner algorithm for predictions.
Question3: What setting, when creating an Amazon SageMaker notebook instance, can you use to install libraries and import data?
Answer3: LifeCycle Configuration
Question4: How to Choose the right Sagemaker built-in algorithm?




This is a general guide for choosing which algorithm to use depending on what business problem you have and what data you have.
Top 10 Google Professional Machine Learning Engineer Sample Questions
Question 1: You work for a textile manufacturer and have been asked to build a model to detect and classify fabric defects. You trained a machine learning model with high recall based on high resolution images taken at the end of the production line. You want quality control inspectors to gain trust in your model. Which technique should you use to understand the rationale of your classifier?
A. Use K-fold cross validation to understand how the model performs on different test datasets.
B. Use the Integrated Gradients method to efficiently compute feature attributions for each predicted image.
C. Use PCA (Principal Component Analysis) to reduce the original feature set to a smaller set of easily understood features.
D. Use k-means clustering to group similar images together, and calculate the Davies-Bouldin index to evaluate the separation between clusters.
Answer 1)
BNotes 1)
Question 2: You need to write a generic test to verify whether Dense Neural Network (DNN) models automatically released by your team have a sufficient number of parameters to learn the task for which they were built. What should you do?
Answer 2)
Notes 2)
[appbox appstore 1560083470-iphone screenshots]
[appbox googleplay com.awssolutionarchitectassociateexampreppro.app]
Answer 3)
Notes 3)
Question 4: You work on a team where the process for deploying a model into production starts with data scientists training different versions of models in a Kubeflow pipeline. The workflow then stores the new model artifact into the corresponding Cloud Storage bucket. You need to build the next steps of the pipeline after the submitted model is ready to be tested and deployed in production on AI Platform. How should you configure the architecture before deploying the model to production?
Question 10) You work for a large financial institution that is planning to use Dialogflow to create a chatbot for the company’s mobile app. You have reviewed old chat logs and tagged each conversation for intent based on each customer’s stated intention for contacting customer service. About 70% of customer inquiries are simple requests that are solved within 10 intents. The remaining 30% of inquiries require much longer and more complicated requests. Which intents should you automate first?
[appbox appstore 1611045854-iphone screenshots]
[appbox microsoftstore 9n8rl80hvm4t-mobile screenshots]
Machine Learning Q&A Part I:
Google.
Azure and AWS are second class citizens in this area.
Sure, AWS has 70% of the market.
Sure, Azure is the easiest turn key and super user friendly.
But, the king of machine learning in the cloud is GCP.
GCP = Google Cloud Platform
Google has the largest data science team in the world, not mention they have Hinton.
Let’s forgot for a minute they created TensorFlow and give it away.
Let’s just talk about building a real world model with data that doesn’t fit into a excel spreadsheet.
The vast majority of applied machine learning is supervised and that means we need data.
Not just normal data, we need very clean highly structured data.
Where’s the easiest place in the world to upload and model a Petabyte of structured data? BigQuery of course.
Why BigQuery? I don’t have to do anything but upload my data. No spinning up RedShit clusters or whatever I have to do in Azure, just upload and massage data with my familiar SQL. If I do have to wrangle my data it won’t take my six months to update 5 rows here, minutes usually.
Then, you’ll need a front end. Cloud datalab is a Jupyter notebook, which is good because I don’t want nor do I need anything else.
Then, with a single line of code I connect by datalab (Jupyter) notebook to my data in BigQuery and build away.
I’ve worked in all three and the only thing I care about is getting to my job the fastest and right now that means I build my models in GCP.
If you’re new to machine learning don’t start in GCP or any cloud vendor for that matter. Start learning Python from the comfort of your laptop.
The course below is free to the first 20.
The Complete Python Course for Machine Learning Engineers
Here, I want to share the best research paper on Machine Learning classification methods, titled ‘Do we Need Hundreds of Classifiers to Solve Real World Classification Problems?’, published in the ‘Journal of Machine Learning Research’.
This paper nicely explained 179 classification techniques and applied them on 121 data sets thus sharing small summary of the paper:
Do we Need Hundreds of Classifiers to Solve Real World Classification Problems?
The paper evaluated 179 classifiers arising from 17 ML families (discriminant analysis, Bayesian, neural networks, support vector machines, decision trees, rule-based classifiers, boosting, bagging, stacking, random forests and other ensembles, generalized linear models, nearest neighbours, partial least squares and principal component regression, logistic and multinomial regression, multiple adaptive regression splines and other methods), implemented in Weka, R ( with and without the caret package), C and Matlab, including all the relevant classifiers available today.
Experiments used total 121 data sets , which represent the whole UCI data base (excluding the large-scale problems) and other own real problems, in order to achieve significant conclusions about the classifier behaviour, not dependent on the data set collection.
The whole data set and partitions are available from: http://persoal.citius.usc.es/manuel.fernandez.delgado/papers/jmlr/data.tar.gz
The classifiers most likely to be the bests are the random forest (RF) versions, the best of which (implemented in R and accessed via caret) achieves 94.1% of the maximum accuracy overcoming 90% in the 84.3% of the data sets. However, the difference is not statistically significant with the second best, the SVM with Gaussian kernel implemented in C using LibSVM, which achieves 92.3% of the maximum accuracy. A few models are clearly better than the remaining ones: random forest, SVM with Gaussian and polynomial kernels, extreme learning machine with Gaussian kernel, C5.0 and avNNet (a committee of multi-layer perceptrons implemented in R with the caret package).
The random forest is clearly the best family of classifiers (3 out of 5 bests classifiers are RF), followed by SVM (4 classifiers in the top-10), neural networks and boosting ensembles (5 and 3 members in the top-20, respectively).
You can see the table with the complete results: http://persoal.citius.usc.es/manuel.fernandez.delgado/papers/jmlr/results.txt
I hope it will be helpful for Statistic and Machine Leaning aspirants!
Thank you!
These basic questions should help:
1. Is the classification going to be supervised or unsupervised? Several well defined techniques likes SVM (Support Vector Machines), trained neural net,etc. are applicable for supervised classification. For unsupervised classification, GMMs (Gaussian Mixture Models), HMMs (Hidden Markov models) with Baye’s techniques could be used. (Several other techniques could of course be used as well)
2.How much training data do you have in case it is supervised ? A small number of training data may yield discouraging classification accuracy even if the chosen classifier is the most suitable one for the problem. In such a case, try to obtain more number of samples. There’s also generally a correlation (for practical purposes at least) between the feature dimensionality and the number of samples for given technique. For example, while using SVM, the linear kernel tends to yield better results when the number of training samples are less than or equal to or only slightly more than the number of feature dimensions as compared to RBF or any other kernel.
3. If the feature vector dimensionality is small enough (1/2/3 -D) then it makes sense to plot and visually inspect if techniques like clustering could be more useful. With very high number of feature dimensions, methods like clustering are generally not advisable(Refer : “The Curse Of Dimensionality”).
4. Are you doing classification in real time ? Some techniques ,e.g. “Template Match” in image classification may lead to a higher number of errors but is generally faster than most other techniques if the number of templates to be evaluated are not excessively high.
5. Depending upon the problem domain, you can decide if you can choose the underlying model in such a way that it can use certain temporal/spatial correlations that may be inherent in the data. For example, HMMs use the temporal continuity of speech samples for enhancing classification results in speech recognition problems.
Another point, slightly off the topic perhaps, but the classification performance is as much a function of choosing the correct feature vectors, the pre-processing of the feature vectors as much as the classifier itself. It’s generally a good idea to give reserve some initial part of the project to try out various classifiers on the same data-set. It may at least help you reject the ones which are highly inaccurate.
At a high level, these skills are a combination of software and data engineering.
The persons that are more appropriate to do this job are a data engineer and/or a machine learning engineer.
That being said, if you work at a startup or happen to be in a small company and need to put the models into production yourself, here are the top skills you need to get:
- Well structured code: it doesn’t need to be perfect but at least can be understood and updated by other team members. Avoid spaghetti code[1] as the plague.
- Add logs: if you are a Python user, the logging[2] module is your friend. Avoid print statements at any cost.
- Model versioning: add a hash key to your different models. You will thank me later.
- Metadata everywhere: save as much data about your models and ML experiments as you can (running time, hyperparameters, used features, CV scores, and so on). You will thank me later, again.
- Monitor performances: execution time and statistical scores of your models.
- Data and models management: store the necessary data and models somewhere that is available to everyone (S3[3] for example). Avoid uploading these to your VCS[4] system. Don’t share them using Slack or Drive. I won’t judge you though, I do it sometimes (read often). Read more here …..
Some of the mistakes that might involve during building a machine learning model (I can think of) are listed here:
- Not understanding the structure of the dataset
- Not giving proper care during features selection
- Leaving out categorical features and considering just numerical variables
- Falling into dummy variable trap
- Selection of inefficient machine learning algorithm
- Not trying out various ML algorithms for building the model based on structure of data.
- Improper tuning of model parameters
- Most importantly: Building an idiotstic imperfect model i.e. suppose we have a classification problem with 99% chances of falling into class1 and remaining to class2. The built model may develop a mapping function which all the time for all data inputs, may predict the result to be class1. Well, one might say his/her model has 99% accuracy. But in reality the 1% class2 case hasn’t been included in the model. So this must be taken into consideration.
- Read more here…
[appbox appstore 1560083470-iphone screenshots]
[appbox googleplay com.awssolutionarchitectassociateexampreppro.app]
Basically, data mining is a key aspect of data analytics. Some even consider the former as essential to execute before the latter. While data analytics is the complete package and involves most components needed to examine a data set and extract valuable information, data mining focuses specifically on identifying hidden patterns.
That’s just the surface-level comparison though. The image above gives an overview of how the two differ.
One such difference is the presence of a hypothesis. Data analytics usually requires coming up with one, as it aims to find specific answers. Data mining, on the other hand, generally doesn’t need one to test or prove. The expected output are patterns or trends, which doesn’t require coming up with a statement or fact to test.
However, that doesn’t mean you mine data blindly. You still have a goal, whether it’s to come up with a recommender system or identify predictors of a certain dimension. Ultimately though, you strive to come up with data patterns or trends. For data analysis on the other hand, you’re expected to come up with valuable and actionable insights, usually in relation to a predetermined hypothesis. Read more here ….
The data science life cycle is not something well-defined like the software development life-cycle, and there is no ‘one-size-fits-all’ solution for data science projects. Every step in the life-cycle of a data science project depends on various data scientist skills and data science tools. The typical life-cycle of a data science project involves jumping back and forth among various interdependent science tasks using a variety of tools, techniques, programming, etc.
Thus, the data science life-cycle can include the following steps:
- Business requirement understanding.
- Data collection.
- Data cleaning.
- Data analysis.
- Modeling.
- Performance evaluation.
- Communicating with stakeholders.
- Deployment.
- Real-world testing.
- Business buy-in.
- Support and maintenance.
Looks neat, but here is the scheme to visualize how it is happening in reality:
Agile development processes, especially continuous delivery lends itself well to the data science project life-cycle. The early comparison helps the data science team to change approaches, refine hypotheses and even discard the project if the business case is nonviable or the benefits from the predictive models are not worth the effort to build it.
[appbox appstore 1611045854-iphone screenshots]
[appbox microsoftstore 9n8rl80hvm4t-mobile screenshots]
Machine Learning Q&A -Part II:
At a high level, these skills are a combination of software and data engineering.
The persons that are more appropriate to do this job are a data engineer and/or a machine learning engineer.
That being said, if you work at a startup or happen to be in a small company and need to put the models into production yourself, here are the top skills you need to get:
- Well structured code: it doesn’t need to be perfect but at least can be understood and updated by other team members. Avoid spaghetti code[1] as the plague.
- Add logs: if you are a Python user, the logging[2] module is your friend. Avoid print statements at any cost.
- Model versioning: add a hash key to your different models. You will thank me later.
- Metadata everywhere: save as much data about your models and ML experiments as you can (running time, hyperparameters, used features, CV scores, and so on). You will thank me later, again.
- Monitor performances: execution time and statistical scores of your models.
- Data and models management: store the necessary data and models somewhere that is available to everyone (S3[3] for example). Avoid uploading these to your VCS[4] system. Don’t share them using Slack or Drive. I won’t judge you though, I do it sometimes (read often). Read more here …..
Some of the mistakes that might involve during building a machine learning model (I can think of) are listed here:
- Not understanding the structure of the dataset
- Not giving proper care during features selection
- Leaving out categorical features and considering just numerical variables
- Falling into dummy variable trap
- Selection of inefficient machine learning algorithm
- Not trying out various ML algorithms for building the model based on structure of data.
- Improper tuning of model parameters
- Most importantly: Building an idiotstic imperfect model i.e. suppose we have a classification problem with 99% chances of falling into class1 and remaining to class2. The built model may develop a mapping function which all the time for all data inputs, may predict the result to be class1. Well, one might say his/her model has 99% accuracy. But in reality the 1% class2 case hasn’t been included in the model. So this must be taken into consideration.
- Read more here…
Basically, data mining is a key aspect of data analytics. Some even consider the former as essential to execute before the latter. While data analytics is the complete package and involves most components needed to examine a data set and extract valuable information, data mining focuses specifically on identifying hidden patterns.
That’s just the surface-level comparison though. The image above gives an overview of how the two differ.
One such difference is the presence of a hypothesis. Data analytics usually requires coming up with one, as it aims to find specific answers. Data mining, on the other hand, generally doesn’t need one to test or prove. The expected output are patterns or trends, which doesn’t require coming up with a statement or fact to test.
However, that doesn’t mean you mine data blindly. You still have a goal, whether it’s to come up with a recommender system or identify predictors of a certain dimension. Ultimately though, you strive to come up with data patterns or trends. For data analysis on the other hand, you’re expected to come up with valuable and actionable insights, usually in relation to a predetermined hypothesis. Read more here ….
The data science life cycle is not something well-defined like the software development life-cycle, and there is no ‘one-size-fits-all’ solution for data science projects. Every step in the life-cycle of a data science project depends on various data scientist skills and data science tools. The typical life-cycle of a data science project involves jumping back and forth among various interdependent science tasks using a variety of tools, techniques, programming, etc.
Thus, the data science life-cycle can include the following steps:
- Business requirement understanding.
- Data collection.
- Data cleaning.
- Data analysis.
- Modeling.
- Performance evaluation.
- Communicating with stakeholders.
- Deployment.
- Real-world testing.
- Business buy-in.
- Support and maintenance.
Looks neat, but here is the scheme to visualize how it is happening in reality:
Agile development processes, especially continuous delivery lends itself well to the data science project life-cycle. The early comparison helps the data science team to change approaches, refine hypotheses and even discard the project if the business case is nonviable or the benefits from the predictive models are not worth the effort to build it.
iOs: https://apps.apple.com/ca/app/aws-machine-learning-prep-pro/id1611045854
Android/Amazon: https://www.amazon.com/gp/product/B09TZ4H8V6
AWS MLS-C01 Machine Learning Exam Prep
Quizzes, Practice Exams: Modeling, Data Engineering, Vision, Exploratory Data Analysis, ML Ops, Cheat Sheets, ML Jobs Interview Q&A
Use this App to learn about Machine Learning on AWS and prepare for the AWS Machine Learning Specialty Certification MLS-C01.
Earning AWS Certified Machine Learning Specialty validates expertise in building, training, tuning, and deploying machine learning (ML) models on AWS.
The App provides hundreds of quizzes and practice exam about:
– Machine Learning Operation on AWS
– Modelling
– Data Engineering
– Computer Vision,
– Exploratory Data Analysis,
– ML implementation & Operations
– Machine Learning Basics Questions and Answers
– Machine Learning Advanced Questions and Answers
– Scorecard
– Countdown timer
– Machine Learning Cheat Sheets
– Machine Learning Interview Questions and Answers
– Machine Learning Latest News
The App covers Machine Learning Basics and Advanced topics including: NLP, Computer Vision, Python, linear regression, logistic regression, Sampling, dataset, statistical interaction, selection bias, non-Gaussian distribution, bias-variance trade-off, Normal Distribution, correlation and covariance, Point Estimates and Confidence Interval, A/B Testing, p-value, statistical power of sensitivity, over-fitting and under-fitting, regularization, Law of Large Numbers, Confounding Variables, Survivorship Bias, univariate, bivariate and multivariate, Resampling, ROC curve, TF/IDF vectorization, Cluster Sampling, etc.
Domain 1: Data Engineering
Create data repositories for machine learning.
Identify data sources (e.g., content and location, primary sources such as user data)
Determine storage mediums (e.g., DB, Data Lake, S3, EFS, EBS)
Identify and implement a data ingestion solution.
Data job styles/types (batch load, streaming)
Data ingestion pipelines (Batch-based ML workloads and streaming-based ML workloads), etc.
Domain 2: Exploratory Data Analysis
Sanitize and prepare data for modeling.
Perform feature engineering.
Analyze and visualize data for machine learning.
Domain 3: Modeling
Frame business problems as machine learning problems.
Select the appropriate model(s) for a given machine learning problem.
Train machine learning models.
Perform hyperparameter optimization.
Evaluate machine learning models.
Domain 4: Machine Learning Implementation and Operations
Build machine learning solutions for performance, availability, scalability, resiliency, and fault tolerance.
Recommend and implement the appropriate machine learning services and features for a given problem.
Apply basic AWS security practices to machine learning solutions.
Deploy and operationalize machine learning solutions.
Machine Learning Services covered:
Amazon Comprehend
AWS Deep Learning AMIs (DLAMI)
AWS DeepLens
Amazon Forecast
Amazon Fraud Detector
Amazon Lex
Amazon Polly
Amazon Rekognition
Amazon SageMaker
Amazon Textract
Amazon Transcribe
Amazon Translate
Other Services and topics covered are:
Ingestion/Collection
Processing/ETL
Data analysis/visualization
Model training
Model deployment/inference
Operational
AWS ML application services
Language relevant to ML (for example, Python, Java, Scala, R, SQL)
Notebooks and integrated development environments (IDEs),
S3, SageMaker, Kinesis, Lake Formation, Athena, Kibana, Redshift, Textract, EMR, Glue, SageMaker, CSV, JSON, IMG, parquet or databases, Amazon Athena
Amazon EC2, Amazon Elastic Container Registry (Amazon ECR), Amazon Elastic Container Service, Amazon Elastic Kubernetes Service , Amazon Redshift
Sagemaker API Explained:
AWS Certified Machine Learning Engineer Specialty Questions and Answers:
Question1: An advertising and analytics company uses machine learning to predict user response to online advertisements using a custom XGBoost model. The company wants to improve its ML pipeline by porting its training and inference code, written in R, to Amazon SageMaker, and do so with minimal changes to the existing code.
Answer1: Use the Build Your Own Container (BYOC) Amazon Sagemaker option.
Create a new docker container with the existing code. Register the container in Amazon Elastic Container registry. with the existing code. Register the container in Amazon Elastic Container Registry. Finally run the training and inference jobs using this container.
Question2: Which feature of Amazon SageMaker can you use for preprocessing the data?
Answer2: Amazon Sagemaker Notebook instances
Amazon SageMaker enables developers and data scientists to build, train, tune, and deploy machine learning (ML) models at scale. You can deploy trained ML models for real-time or batch predictions on unseen data, a process known as inference. However, in most cases, the raw input data must be preprocessed and can’t be used directly for making predictions. This is because most ML models expect the data in a predefined format, so the raw data needs to be first cleaned and formatted in order for the ML model to process the data. You can use the Amazon SageMaker built-in Scikit-learn library for preprocessing input data and then use the Amazon SageMaker built-in Linear Learner algorithm for predictions.
Question3: What setting, when creating an Amazon SageMaker notebook instance, can you use to install libraries and import data?
Answer3: LifeCycle Configuration
Question4: How to Choose the right Sagemaker built-in algorithm?




This is a general guide for choosing which algorithm to use depending on what business problem you have and what data you have.
Top 10 Google Professional Machine Learning Engineer Sample Questions
Question 1: You work for a textile manufacturer and have been asked to build a model to detect and classify fabric defects. You trained a machine learning model with high recall based on high resolution images taken at the end of the production line. You want quality control inspectors to gain trust in your model. Which technique should you use to understand the rationale of your classifier?
A. Use K-fold cross validation to understand how the model performs on different test datasets.
B. Use the Integrated Gradients method to efficiently compute feature attributions for each predicted image.
C. Use PCA (Principal Component Analysis) to reduce the original feature set to a smaller set of easily understood features.
D. Use k-means clustering to group similar images together, and calculate the Davies-Bouldin index to evaluate the separation between clusters.
Answer 1)
BNotes 1)
Question 2: You need to write a generic test to verify whether Dense Neural Network (DNN) models automatically released by your team have a sufficient number of parameters to learn the task for which they were built. What should you do?
Answer 2)
Notes 2)
[appbox appstore 1560083470-iphone screenshots]
[appbox googleplay com.awssolutionarchitectassociateexampreppro.app]
Answer 3)
Notes 3)
Question 4: You work on a team where the process for deploying a model into production starts with data scientists training different versions of models in a Kubeflow pipeline. The workflow then stores the new model artifact into the corresponding Cloud Storage bucket. You need to build the next steps of the pipeline after the submitted model is ready to be tested and deployed in production on AI Platform. How should you configure the architecture before deploying the model to production?
Question 10) You work for a large financial institution that is planning to use Dialogflow to create a chatbot for the company’s mobile app. You have reviewed old chat logs and tagged each conversation for intent based on each customer’s stated intention for contacting customer service. About 70% of customer inquiries are simple requests that are solved within 10 intents. The remaining 30% of inquiries require much longer and more complicated requests. Which intents should you automate first?
[appbox appstore 1611045854-iphone screenshots]
[appbox microsoftstore 9n8rl80hvm4t-mobile screenshots]
Machine Learning Q&A Part I:
Google.
Azure and AWS are second class citizens in this area.
Sure, AWS has 70% of the market.
Sure, Azure is the easiest turn key and super user friendly.
But, the king of machine learning in the cloud is GCP.
GCP = Google Cloud Platform
Google has the largest data science team in the world, not mention they have Hinton.
Let’s forgot for a minute they created TensorFlow and give it away.
Let’s just talk about building a real world model with data that doesn’t fit into a excel spreadsheet.
The vast majority of applied machine learning is supervised and that means we need data.
Not just normal data, we need very clean highly structured data.
Where’s the easiest place in the world to upload and model a Petabyte of structured data? BigQuery of course.
Why BigQuery? I don’t have to do anything but upload my data. No spinning up RedShit clusters or whatever I have to do in Azure, just upload and massage data with my familiar SQL. If I do have to wrangle my data it won’t take my six months to update 5 rows here, minutes usually.
Then, you’ll need a front end. Cloud datalab is a Jupyter notebook, which is good because I don’t want nor do I need anything else.
Then, with a single line of code I connect by datalab (Jupyter) notebook to my data in BigQuery and build away.
I’ve worked in all three and the only thing I care about is getting to my job the fastest and right now that means I build my models in GCP.
If you’re new to machine learning don’t start in GCP or any cloud vendor for that matter. Start learning Python from the comfort of your laptop.
The course below is free to the first 20.
The Complete Python Course for Machine Learning Engineers
Here, I want to share the best research paper on Machine Learning classification methods, titled ‘Do we Need Hundreds of Classifiers to Solve Real World Classification Problems?’, published in the ‘Journal of Machine Learning Research’.
This paper nicely explained 179 classification techniques and applied them on 121 data sets thus sharing small summary of the paper:
Do we Need Hundreds of Classifiers to Solve Real World Classification Problems?
The paper evaluated 179 classifiers arising from 17 ML families (discriminant analysis, Bayesian, neural networks, support vector machines, decision trees, rule-based classifiers, boosting, bagging, stacking, random forests and other ensembles, generalized linear models, nearest neighbours, partial least squares and principal component regression, logistic and multinomial regression, multiple adaptive regression splines and other methods), implemented in Weka, R ( with and without the caret package), C and Matlab, including all the relevant classifiers available today.
Experiments used total 121 data sets , which represent the whole UCI data base (excluding the large-scale problems) and other own real problems, in order to achieve significant conclusions about the classifier behaviour, not dependent on the data set collection.
The whole data set and partitions are available from: http://persoal.citius.usc.es/manuel.fernandez.delgado/papers/jmlr/data.tar.gz
The classifiers most likely to be the bests are the random forest (RF) versions, the best of which (implemented in R and accessed via caret) achieves 94.1% of the maximum accuracy overcoming 90% in the 84.3% of the data sets. However, the difference is not statistically significant with the second best, the SVM with Gaussian kernel implemented in C using LibSVM, which achieves 92.3% of the maximum accuracy. A few models are clearly better than the remaining ones: random forest, SVM with Gaussian and polynomial kernels, extreme learning machine with Gaussian kernel, C5.0 and avNNet (a committee of multi-layer perceptrons implemented in R with the caret package).
The random forest is clearly the best family of classifiers (3 out of 5 bests classifiers are RF), followed by SVM (4 classifiers in the top-10), neural networks and boosting ensembles (5 and 3 members in the top-20, respectively).
You can see the table with the complete results: http://persoal.citius.usc.es/manuel.fernandez.delgado/papers/jmlr/results.txt
I hope it will be helpful for Statistic and Machine Leaning aspirants!
Thank you!
These basic questions should help:
1. Is the classification going to be supervised or unsupervised? Several well defined techniques likes SVM (Support Vector Machines), trained neural net,etc. are applicable for supervised classification. For unsupervised classification, GMMs (Gaussian Mixture Models), HMMs (Hidden Markov models) with Baye’s techniques could be used. (Several other techniques could of course be used as well)
2.How much training data do you have in case it is supervised ? A small number of training data may yield discouraging classification accuracy even if the chosen classifier is the most suitable one for the problem. In such a case, try to obtain more number of samples. There’s also generally a correlation (for practical purposes at least) between the feature dimensionality and the number of samples for given technique. For example, while using SVM, the linear kernel tends to yield better results when the number of training samples are less than or equal to or only slightly more than the number of feature dimensions as compared to RBF or any other kernel.
3. If the feature vector dimensionality is small enough (1/2/3 -D) then it makes sense to plot and visually inspect if techniques like clustering could be more useful. With very high number of feature dimensions, methods like clustering are generally not advisable(Refer : “The Curse Of Dimensionality”).
4. Are you doing classification in real time ? Some techniques ,e.g. “Template Match” in image classification may lead to a higher number of errors but is generally faster than most other techniques if the number of templates to be evaluated are not excessively high.
5. Depending upon the problem domain, you can decide if you can choose the underlying model in such a way that it can use certain temporal/spatial correlations that may be inherent in the data. For example, HMMs use the temporal continuity of speech samples for enhancing classification results in speech recognition problems.
Another point, slightly off the topic perhaps, but the classification performance is as much a function of choosing the correct feature vectors, the pre-processing of the feature vectors as much as the classifier itself. It’s generally a good idea to give reserve some initial part of the project to try out various classifiers on the same data-set. It may at least help you reject the ones which are highly inaccurate.
At a high level, these skills are a combination of software and data engineering.
The persons that are more appropriate to do this job are a data engineer and/or a machine learning engineer.
That being said, if you work at a startup or happen to be in a small company and need to put the models into production yourself, here are the top skills you need to get:
- Well structured code: it doesn’t need to be perfect but at least can be understood and updated by other team members. Avoid spaghetti code[1] as the plague.
- Add logs: if you are a Python user, the logging[2] module is your friend. Avoid print statements at any cost.
- Model versioning: add a hash key to your different models. You will thank me later.
- Metadata everywhere: save as much data about your models and ML experiments as you can (running time, hyperparameters, used features, CV scores, and so on). You will thank me later, again.
- Monitor performances: execution time and statistical scores of your models.
- Data and models management: store the necessary data and models somewhere that is available to everyone (S3[3] for example). Avoid uploading these to your VCS[4] system. Don’t share them using Slack or Drive. I won’t judge you though, I do it sometimes (read often). Read more here …..
Some of the mistakes that might involve during building a machine learning model (I can think of) are listed here:
- Not understanding the structure of the dataset
- Not giving proper care during features selection
- Leaving out categorical features and considering just numerical variables
- Falling into dummy variable trap
- Selection of inefficient machine learning algorithm
- Not trying out various ML algorithms for building the model based on structure of data.
- Improper tuning of model parameters
- Most importantly: Building an idiotstic imperfect model i.e. suppose we have a classification problem with 99% chances of falling into class1 and remaining to class2. The built model may develop a mapping function which all the time for all data inputs, may predict the result to be class1. Well, one might say his/her model has 99% accuracy. But in reality the 1% class2 case hasn’t been included in the model. So this must be taken into consideration.
- Read more here…
[appbox appstore 1560083470-iphone screenshots]
[appbox googleplay com.awssolutionarchitectassociateexampreppro.app]
Basically, data mining is a key aspect of data analytics. Some even consider the former as essential to execute before the latter. While data analytics is the complete package and involves most components needed to examine a data set and extract valuable information, data mining focuses specifically on identifying hidden patterns.
That’s just the surface-level comparison though. The image above gives an overview of how the two differ.
One such difference is the presence of a hypothesis. Data analytics usually requires coming up with one, as it aims to find specific answers. Data mining, on the other hand, generally doesn’t need one to test or prove. The expected output are patterns or trends, which doesn’t require coming up with a statement or fact to test.
However, that doesn’t mean you mine data blindly. You still have a goal, whether it’s to come up with a recommender system or identify predictors of a certain dimension. Ultimately though, you strive to come up with data patterns or trends. For data analysis on the other hand, you’re expected to come up with valuable and actionable insights, usually in relation to a predetermined hypothesis. Read more here ….
The data science life cycle is not something well-defined like the software development life-cycle, and there is no ‘one-size-fits-all’ solution for data science projects. Every step in the life-cycle of a data science project depends on various data scientist skills and data science tools. The typical life-cycle of a data science project involves jumping back and forth among various interdependent science tasks using a variety of tools, techniques, programming, etc.
Thus, the data science life-cycle can include the following steps:
- Business requirement understanding.
- Data collection.
- Data cleaning.
- Data analysis.
- Modeling.
- Performance evaluation.
- Communicating with stakeholders.
- Deployment.
- Real-world testing.
- Business buy-in.
- Support and maintenance.
Looks neat, but here is the scheme to visualize how it is happening in reality:
Agile development processes, especially continuous delivery lends itself well to the data science project life-cycle. The early comparison helps the data science team to change approaches, refine hypotheses and even discard the project if the business case is nonviable or the benefits from the predictive models are not worth the effort to build it.
[appbox appstore 1611045854-iphone screenshots]
[appbox microsoftstore 9n8rl80hvm4t-mobile screenshots]
Machine Learning Q&A -Part II:
At a high level, these skills are a combination of software and data engineering.
The persons that are more appropriate to do this job are a data engineer and/or a machine learning engineer.
That being said, if you work at a startup or happen to be in a small company and need to put the models into production yourself, here are the top skills you need to get:
- Well structured code: it doesn’t need to be perfect but at least can be understood and updated by other team members. Avoid spaghetti code[1] as the plague.
- Add logs: if you are a Python user, the logging[2] module is your friend. Avoid print statements at any cost.
- Model versioning: add a hash key to your different models. You will thank me later.
- Metadata everywhere: save as much data about your models and ML experiments as you can (running time, hyperparameters, used features, CV scores, and so on). You will thank me later, again.
- Monitor performances: execution time and statistical scores of your models.
- Data and models management: store the necessary data and models somewhere that is available to everyone (S3[3] for example). Avoid uploading these to your VCS[4] system. Don’t share them using Slack or Drive. I won’t judge you though, I do it sometimes (read often). Read more here …..
Some of the mistakes that might involve during building a machine learning model (I can think of) are listed here:
- Not understanding the structure of the dataset
- Not giving proper care during features selection
- Leaving out categorical features and considering just numerical variables
- Falling into dummy variable trap
- Selection of inefficient machine learning algorithm
- Not trying out various ML algorithms for building the model based on structure of data.
- Improper tuning of model parameters
- Most importantly: Building an idiotstic imperfect model i.e. suppose we have a classification problem with 99% chances of falling into class1 and remaining to class2. The built model may develop a mapping function which all the time for all data inputs, may predict the result to be class1. Well, one might say his/her model has 99% accuracy. But in reality the 1% class2 case hasn’t been included in the model. So this must be taken into consideration.
- Read more here…
Basically, data mining is a key aspect of data analytics. Some even consider the former as essential to execute before the latter. While data analytics is the complete package and involves most components needed to examine a data set and extract valuable information, data mining focuses specifically on identifying hidden patterns.
That’s just the surface-level comparison though. The image above gives an overview of how the two differ.
One such difference is the presence of a hypothesis. Data analytics usually requires coming up with one, as it aims to find specific answers. Data mining, on the other hand, generally doesn’t need one to test or prove. The expected output are patterns or trends, which doesn’t require coming up with a statement or fact to test.
However, that doesn’t mean you mine data blindly. You still have a goal, whether it’s to come up with a recommender system or identify predictors of a certain dimension. Ultimately though, you strive to come up with data patterns or trends. For data analysis on the other hand, you’re expected to come up with valuable and actionable insights, usually in relation to a predetermined hypothesis. Read more here ….
The data science life cycle is not something well-defined like the software development life-cycle, and there is no ‘one-size-fits-all’ solution for data science projects. Every step in the life-cycle of a data science project depends on various data scientist skills and data science tools. The typical life-cycle of a data science project involves jumping back and forth among various interdependent science tasks using a variety of tools, techniques, programming, etc.
Thus, the data science life-cycle can include the following steps:
- Business requirement understanding.
- Data collection.
- Data cleaning.
- Data analysis.
- Modeling.
- Performance evaluation.
- Communicating with stakeholders.
- Deployment.
- Real-world testing.
- Business buy-in.
- Support and maintenance.
Looks neat, but here is the scheme to visualize how it is happening in reality:
Agile development processes, especially continuous delivery lends itself well to the data science project life-cycle. The early comparison helps the data science team to change approaches, refine hypotheses and even discard the project if the business case is nonviable or the benefits from the predictive models are not worth the effort to build it.
iOs: https://apps.apple.com/ca/app/aws-machine-learning-prep-pro/id1611045854
Android/Amazon: https://www.amazon.com/gp/product/B09TZ4H8V6
AWS MLS-C01 Machine Learning Exam Prep
Quizzes, Practice Exams: Modeling, Data Engineering, Vision, Exploratory Data Analysis, ML Ops, Cheat Sheets, ML Jobs Interview Q&A
Use this App to learn about Machine Learning on AWS and prepare for the AWS Machine Learning Specialty Certification MLS-C01.
Earning AWS Certified Machine Learning Specialty validates expertise in building, training, tuning, and deploying machine learning (ML) models on AWS.
The App provides hundreds of quizzes and practice exam about:
– Machine Learning Operation on AWS
– Modelling
– Data Engineering
– Computer Vision,
– Exploratory Data Analysis,
– ML implementation & Operations
– Machine Learning Basics Questions and Answers
– Machine Learning Advanced Questions and Answers
– Scorecard
– Countdown timer
– Machine Learning Cheat Sheets
– Machine Learning Interview Questions and Answers
– Machine Learning Latest News
The App covers Machine Learning Basics and Advanced topics including: NLP, Computer Vision, Python, linear regression, logistic regression, Sampling, dataset, statistical interaction, selection bias, non-Gaussian distribution, bias-variance trade-off, Normal Distribution, correlation and covariance, Point Estimates and Confidence Interval, A/B Testing, p-value, statistical power of sensitivity, over-fitting and under-fitting, regularization, Law of Large Numbers, Confounding Variables, Survivorship Bias, univariate, bivariate and multivariate, Resampling, ROC curve, TF/IDF vectorization, Cluster Sampling, etc.
Domain 1: Data Engineering
Create data repositories for machine learning.
Identify data sources (e.g., content and location, primary sources such as user data)
Determine storage mediums (e.g., DB, Data Lake, S3, EFS, EBS)
Identify and implement a data ingestion solution.
Data job styles/types (batch load, streaming)
Data ingestion pipelines (Batch-based ML workloads and streaming-based ML workloads), etc.
Domain 2: Exploratory Data Analysis
Sanitize and prepare data for modeling.
Perform feature engineering.
Analyze and visualize data for machine learning.
Domain 3: Modeling
Frame business problems as machine learning problems.
Select the appropriate model(s) for a given machine learning problem.
Train machine learning models.
Perform hyperparameter optimization.
Evaluate machine learning models.
Domain 4: Machine Learning Implementation and Operations
Build machine learning solutions for performance, availability, scalability, resiliency, and fault tolerance.
Recommend and implement the appropriate machine learning services and features for a given problem.
Apply basic AWS security practices to machine learning solutions.
Deploy and operationalize machine learning solutions.
Machine Learning Services covered:
Amazon Comprehend
AWS Deep Learning AMIs (DLAMI)
AWS DeepLens
Amazon Forecast
Amazon Fraud Detector
Amazon Lex
Amazon Polly
Amazon Rekognition
Amazon SageMaker
Amazon Textract
Amazon Transcribe
Amazon Translate
Other Services and topics covered are:
Ingestion/Collection
Processing/ETL
Data analysis/visualization
Model training
Model deployment/inference
Operational
AWS ML application services
Language relevant to ML (for example, Python, Java, Scala, R, SQL)
Notebooks and integrated development environments (IDEs),
S3, SageMaker, Kinesis, Lake Formation, Athena, Kibana, Redshift, Textract, EMR, Glue, SageMaker, CSV, JSON, IMG, parquet or databases, Amazon Athena
Amazon EC2, Amazon Elastic Container Registry (Amazon ECR), Amazon Elastic Container Service, Amazon Elastic Kubernetes Service , Amazon Redshift
Sagemaker API Explained:
AWS Certified Machine Learning Engineer Specialty Questions and Answers:
Question1: An advertising and analytics company uses machine learning to predict user response to online advertisements using a custom XGBoost model. The company wants to improve its ML pipeline by porting its training and inference code, written in R, to Amazon SageMaker, and do so with minimal changes to the existing code.
Answer1: Use the Build Your Own Container (BYOC) Amazon Sagemaker option.
Create a new docker container with the existing code. Register the container in Amazon Elastic Container registry. with the existing code. Register the container in Amazon Elastic Container Registry. Finally run the training and inference jobs using this container.
Question2: Which feature of Amazon SageMaker can you use for preprocessing the data?
Answer2: Amazon Sagemaker Notebook instances
Amazon SageMaker enables developers and data scientists to build, train, tune, and deploy machine learning (ML) models at scale. You can deploy trained ML models for real-time or batch predictions on unseen data, a process known as inference. However, in most cases, the raw input data must be preprocessed and can’t be used directly for making predictions. This is because most ML models expect the data in a predefined format, so the raw data needs to be first cleaned and formatted in order for the ML model to process the data. You can use the Amazon SageMaker built-in Scikit-learn library for preprocessing input data and then use the Amazon SageMaker built-in Linear Learner algorithm for predictions.
Question3: What setting, when creating an Amazon SageMaker notebook instance, can you use to install libraries and import data?
Answer3: LifeCycle Configuration
Question4: How to Choose the right Sagemaker built-in algorithm?




This is a general guide for choosing which algorithm to use depending on what business problem you have and what data you have.
Top 10 Google Professional Machine Learning Engineer Sample Questions
Question 1: You work for a textile manufacturer and have been asked to build a model to detect and classify fabric defects. You trained a machine learning model with high recall based on high resolution images taken at the end of the production line. You want quality control inspectors to gain trust in your model. Which technique should you use to understand the rationale of your classifier?
A. Use K-fold cross validation to understand how the model performs on different test datasets.
B. Use the Integrated Gradients method to efficiently compute feature attributions for each predicted image.
C. Use PCA (Principal Component Analysis) to reduce the original feature set to a smaller set of easily understood features.
D. Use k-means clustering to group similar images together, and calculate the Davies-Bouldin index to evaluate the separation between clusters.
Answer 1)
BNotes 1)
Question 2: You need to write a generic test to verify whether Dense Neural Network (DNN) models automatically released by your team have a sufficient number of parameters to learn the task for which they were built. What should you do?
Answer 2)
Notes 2)
[appbox appstore 1560083470-iphone screenshots]
[appbox googleplay com.awssolutionarchitectassociateexampreppro.app]
Answer 3)
Notes 3)
Question 4: You work on a team where the process for deploying a model into production starts with data scientists training different versions of models in a Kubeflow pipeline. The workflow then stores the new model artifact into the corresponding Cloud Storage bucket. You need to build the next steps of the pipeline after the submitted model is ready to be tested and deployed in production on AI Platform. How should you configure the architecture before deploying the model to production?
Question 10) You work for a large financial institution that is planning to use Dialogflow to create a chatbot for the company’s mobile app. You have reviewed old chat logs and tagged each conversation for intent based on each customer’s stated intention for contacting customer service. About 70% of customer inquiries are simple requests that are solved within 10 intents. The remaining 30% of inquiries require much longer and more complicated requests. Which intents should you automate first?
[appbox appstore 1611045854-iphone screenshots]
[appbox microsoftstore 9n8rl80hvm4t-mobile screenshots]
Machine Learning Q&A Part I:
Google.
Azure and AWS are second class citizens in this area.
Sure, AWS has 70% of the market.
Sure, Azure is the easiest turn key and super user friendly.
But, the king of machine learning in the cloud is GCP.
GCP = Google Cloud Platform
Google has the largest data science team in the world, not mention they have Hinton.
Let’s forgot for a minute they created TensorFlow and give it away.
Let’s just talk about building a real world model with data that doesn’t fit into a excel spreadsheet.
The vast majority of applied machine learning is supervised and that means we need data.
Not just normal data, we need very clean highly structured data.
Where’s the easiest place in the world to upload and model a Petabyte of structured data? BigQuery of course.
Why BigQuery? I don’t have to do anything but upload my data. No spinning up RedShit clusters or whatever I have to do in Azure, just upload and massage data with my familiar SQL. If I do have to wrangle my data it won’t take my six months to update 5 rows here, minutes usually.
Then, you’ll need a front end. Cloud datalab is a Jupyter notebook, which is good because I don’t want nor do I need anything else.
Then, with a single line of code I connect by datalab (Jupyter) notebook to my data in BigQuery and build away.
I’ve worked in all three and the only thing I care about is getting to my job the fastest and right now that means I build my models in GCP.
If you’re new to machine learning don’t start in GCP or any cloud vendor for that matter. Start learning Python from the comfort of your laptop.
The course below is free to the first 20.
The Complete Python Course for Machine Learning Engineers
Here, I want to share the best research paper on Machine Learning classification methods, titled ‘Do we Need Hundreds of Classifiers to Solve Real World Classification Problems?’, published in the ‘Journal of Machine Learning Research’.
This paper nicely explained 179 classification techniques and applied them on 121 data sets thus sharing small summary of the paper:
Do we Need Hundreds of Classifiers to Solve Real World Classification Problems?
The paper evaluated 179 classifiers arising from 17 ML families (discriminant analysis, Bayesian, neural networks, support vector machines, decision trees, rule-based classifiers, boosting, bagging, stacking, random forests and other ensembles, generalized linear models, nearest neighbours, partial least squares and principal component regression, logistic and multinomial regression, multiple adaptive regression splines and other methods), implemented in Weka, R ( with and without the caret package), C and Matlab, including all the relevant classifiers available today.
Experiments used total 121 data sets , which represent the whole UCI data base (excluding the large-scale problems) and other own real problems, in order to achieve significant conclusions about the classifier behaviour, not dependent on the data set collection.
The whole data set and partitions are available from: http://persoal.citius.usc.es/manuel.fernandez.delgado/papers/jmlr/data.tar.gz
The classifiers most likely to be the bests are the random forest (RF) versions, the best of which (implemented in R and accessed via caret) achieves 94.1% of the maximum accuracy overcoming 90% in the 84.3% of the data sets. However, the difference is not statistically significant with the second best, the SVM with Gaussian kernel implemented in C using LibSVM, which achieves 92.3% of the maximum accuracy. A few models are clearly better than the remaining ones: random forest, SVM with Gaussian and polynomial kernels, extreme learning machine with Gaussian kernel, C5.0 and avNNet (a committee of multi-layer perceptrons implemented in R with the caret package).
The random forest is clearly the best family of classifiers (3 out of 5 bests classifiers are RF), followed by SVM (4 classifiers in the top-10), neural networks and boosting ensembles (5 and 3 members in the top-20, respectively).
You can see the table with the complete results: http://persoal.citius.usc.es/manuel.fernandez.delgado/papers/jmlr/results.txt
I hope it will be helpful for Statistic and Machine Leaning aspirants!
Thank you!
These basic questions should help:
1. Is the classification going to be supervised or unsupervised? Several well defined techniques likes SVM (Support Vector Machines), trained neural net,etc. are applicable for supervised classification. For unsupervised classification, GMMs (Gaussian Mixture Models), HMMs (Hidden Markov models) with Baye’s techniques could be used. (Several other techniques could of course be used as well)
2.How much training data do you have in case it is supervised ? A small number of training data may yield discouraging classification accuracy even if the chosen classifier is the most suitable one for the problem. In such a case, try to obtain more number of samples. There’s also generally a correlation (for practical purposes at least) between the feature dimensionality and the number of samples for given technique. For example, while using SVM, the linear kernel tends to yield better results when the number of training samples are less than or equal to or only slightly more than the number of feature dimensions as compared to RBF or any other kernel.
3. If the feature vector dimensionality is small enough (1/2/3 -D) then it makes sense to plot and visually inspect if techniques like clustering could be more useful. With very high number of feature dimensions, methods like clustering are generally not advisable(Refer : “The Curse Of Dimensionality”).
4. Are you doing classification in real time ? Some techniques ,e.g. “Template Match” in image classification may lead to a higher number of errors but is generally faster than most other techniques if the number of templates to be evaluated are not excessively high.
5. Depending upon the problem domain, you can decide if you can choose the underlying model in such a way that it can use certain temporal/spatial correlations that may be inherent in the data. For example, HMMs use the temporal continuity of speech samples for enhancing classification results in speech recognition problems.
Another point, slightly off the topic perhaps, but the classification performance is as much a function of choosing the correct feature vectors, the pre-processing of the feature vectors as much as the classifier itself. It’s generally a good idea to give reserve some initial part of the project to try out various classifiers on the same data-set. It may at least help you reject the ones which are highly inaccurate.
At a high level, these skills are a combination of software and data engineering.
The persons that are more appropriate to do this job are a data engineer and/or a machine learning engineer.
That being said, if you work at a startup or happen to be in a small company and need to put the models into production yourself, here are the top skills you need to get:
- Well structured code: it doesn’t need to be perfect but at least can be understood and updated by other team members. Avoid spaghetti code[1] as the plague.
- Add logs: if you are a Python user, the logging[2] module is your friend. Avoid print statements at any cost.
- Model versioning: add a hash key to your different models. You will thank me later.
- Metadata everywhere: save as much data about your models and ML experiments as you can (running time, hyperparameters, used features, CV scores, and so on). You will thank me later, again.
- Monitor performances: execution time and statistical scores of your models.
- Data and models management: store the necessary data and models somewhere that is available to everyone (S3[3] for example). Avoid uploading these to your VCS[4] system. Don’t share them using Slack or Drive. I won’t judge you though, I do it sometimes (read often). Read more here …..
Some of the mistakes that might involve during building a machine learning model (I can think of) are listed here:
- Not understanding the structure of the dataset
- Not giving proper care during features selection
- Leaving out categorical features and considering just numerical variables
- Falling into dummy variable trap
- Selection of inefficient machine learning algorithm
- Not trying out various ML algorithms for building the model based on structure of data.
- Improper tuning of model parameters
- Most importantly: Building an idiotstic imperfect model i.e. suppose we have a classification problem with 99% chances of falling into class1 and remaining to class2. The built model may develop a mapping function which all the time for all data inputs, may predict the result to be class1. Well, one might say his/her model has 99% accuracy. But in reality the 1% class2 case hasn’t been included in the model. So this must be taken into consideration.
- Read more here…
[appbox appstore 1560083470-iphone screenshots]
[appbox googleplay com.awssolutionarchitectassociateexampreppro.app]
Basically, data mining is a key aspect of data analytics. Some even consider the former as essential to execute before the latter. While data analytics is the complete package and involves most components needed to examine a data set and extract valuable information, data mining focuses specifically on identifying hidden patterns.
That’s just the surface-level comparison though. The image above gives an overview of how the two differ.
One such difference is the presence of a hypothesis. Data analytics usually requires coming up with one, as it aims to find specific answers. Data mining, on the other hand, generally doesn’t need one to test or prove. The expected output are patterns or trends, which doesn’t require coming up with a statement or fact to test.
However, that doesn’t mean you mine data blindly. You still have a goal, whether it’s to come up with a recommender system or identify predictors of a certain dimension. Ultimately though, you strive to come up with data patterns or trends. For data analysis on the other hand, you’re expected to come up with valuable and actionable insights, usually in relation to a predetermined hypothesis. Read more here ….
The data science life cycle is not something well-defined like the software development life-cycle, and there is no ‘one-size-fits-all’ solution for data science projects. Every step in the life-cycle of a data science project depends on various data scientist skills and data science tools. The typical life-cycle of a data science project involves jumping back and forth among various interdependent science tasks using a variety of tools, techniques, programming, etc.
Thus, the data science life-cycle can include the following steps:
- Business requirement understanding.
- Data collection.
- Data cleaning.
- Data analysis.
- Modeling.
- Performance evaluation.
- Communicating with stakeholders.
- Deployment.
- Real-world testing.
- Business buy-in.
- Support and maintenance.
Looks neat, but here is the scheme to visualize how it is happening in reality:
Agile development processes, especially continuous delivery lends itself well to the data science project life-cycle. The early comparison helps the data science team to change approaches, refine hypotheses and even discard the project if the business case is nonviable or the benefits from the predictive models are not worth the effort to build it.
[appbox appstore 1611045854-iphone screenshots]
[appbox microsoftstore 9n8rl80hvm4t-mobile screenshots]
Machine Learning Q&A -Part II:
At a high level, these skills are a combination of software and data engineering.
The persons that are more appropriate to do this job are a data engineer and/or a machine learning engineer.
That being said, if you work at a startup or happen to be in a small company and need to put the models into production yourself, here are the top skills you need to get:
- Well structured code: it doesn’t need to be perfect but at least can be understood and updated by other team members. Avoid spaghetti code[1] as the plague.
- Add logs: if you are a Python user, the logging[2] module is your friend. Avoid print statements at any cost.
- Model versioning: add a hash key to your different models. You will thank me later.
- Metadata everywhere: save as much data about your models and ML experiments as you can (running time, hyperparameters, used features, CV scores, and so on). You will thank me later, again.
- Monitor performances: execution time and statistical scores of your models.
- Data and models management: store the necessary data and models somewhere that is available to everyone (S3[3] for example). Avoid uploading these to your VCS[4] system. Don’t share them using Slack or Drive. I won’t judge you though, I do it sometimes (read often). Read more here …..
Some of the mistakes that might involve during building a machine learning model (I can think of) are listed here:
- Not understanding the structure of the dataset
- Not giving proper care during features selection
- Leaving out categorical features and considering just numerical variables
- Falling into dummy variable trap
- Selection of inefficient machine learning algorithm
- Not trying out various ML algorithms for building the model based on structure of data.
- Improper tuning of model parameters
- Most importantly: Building an idiotstic imperfect model i.e. suppose we have a classification problem with 99% chances of falling into class1 and remaining to class2. The built model may develop a mapping function which all the time for all data inputs, may predict the result to be class1. Well, one might say his/her model has 99% accuracy. But in reality the 1% class2 case hasn’t been included in the model. So this must be taken into consideration.
- Read more here…
Basically, data mining is a key aspect of data analytics. Some even consider the former as essential to execute before the latter. While data analytics is the complete package and involves most components needed to examine a data set and extract valuable information, data mining focuses specifically on identifying hidden patterns.
That’s just the surface-level comparison though. The image above gives an overview of how the two differ.
One such difference is the presence of a hypothesis. Data analytics usually requires coming up with one, as it aims to find specific answers. Data mining, on the other hand, generally doesn’t need one to test or prove. The expected output are patterns or trends, which doesn’t require coming up with a statement or fact to test.
However, that doesn’t mean you mine data blindly. You still have a goal, whether it’s to come up with a recommender system or identify predictors of a certain dimension. Ultimately though, you strive to come up with data patterns or trends. For data analysis on the other hand, you’re expected to come up with valuable and actionable insights, usually in relation to a predetermined hypothesis. Read more here ….
The data science life cycle is not something well-defined like the software development life-cycle, and there is no ‘one-size-fits-all’ solution for data science projects. Every step in the life-cycle of a data science project depends on various data scientist skills and data science tools. The typical life-cycle of a data science project involves jumping back and forth among various interdependent science tasks using a variety of tools, techniques, programming, etc.
Thus, the data science life-cycle can include the following steps:
- Business requirement understanding.
- Data collection.
- Data cleaning.
- Data analysis.
- Modeling.
- Performance evaluation.
- Communicating with stakeholders.
- Deployment.
- Real-world testing.
- Business buy-in.
- Support and maintenance.
Looks neat, but here is the scheme to visualize how it is happening in reality:
Agile development processes, especially continuous delivery lends itself well to the data science project life-cycle. The early comparison helps the data science team to change approaches, refine hypotheses and even discard the project if the business case is nonviable or the benefits from the predictive models are not worth the effort to build it.
Machine Learning Latest News
Top 10 Machine Learning Algorithms
What are the simplest examples of machine learning algorithms?
Source: Top 10 Machine Learning Algorithms for Data Scientist
In machine learning, there’s something called the “No Free Lunch” theorem. In a nutshell, it states that no one algorithm works best for every problem. It’s especially relevant for supervised learning. For example, you can’t say that neural networks are always better than decision trees or vice-versa. Furthermore, there are many factors at play, such as the size and structure of your dataset. As a result, you should try many different algorithms for your problem!
Top ML Algorithms
1. Linear Regression
Regression is a technique for numerical prediction. Additionally, regression is a statistical measure that attempts to determine the strength of the relationship between two variables. One is a dependent variable. Other is from a series of other changing variables which are our independent variables. Moreover, just like Classification is for predicting categorical labels, Regression is for predicting a continuous value. For example, we may wish to predict the salary of university graduates with 5 years of work experience. We use regression to determine how much specific factors or sectors influence the dependent variable.
Linear regression attempts to model the relationship between a scalar variable and explanatory variables by fitting a linear equation. For example, one might want to relate the weights of individuals to their heights using a linear regression model.
Additionally, this operator calculates a linear regression model. It uses the Akaike criterion for model selection. Furthermore, the Akaike information criterion is a measure of the relative goodness of a fit of a statistical model.
2. Logistic Regression
Logistic regression is a classification model. It uses input variables to predict a categorical outcome variable. The variable can take on one of a limited set of class values. A binomial logistic regression relates to two binary output categories. A multinomial logistic regression allows for more than two classes. Examples of logistic regression include classifying a binary condition as “healthy” / “not healthy”. Logistic regression applies the logistic sigmoid function to weighted input values to generate a prediction of the data class.
A logistic regression model estimates the probability of a dependent variable as a function of independent variables. The dependent variable is the output that we are trying to predict. The independent variables or explanatory variables are the factors that we feel could influence the output. Multiple regression refers to regression analysis with two or more independent variables. Multivariate regression, on the other hand, refers to regression analysis with two or more dependent variables.
3. Linear Discriminant Analysis
Logistic Regression is a classification algorithm traditionally for two-class classification problems. If you have more than two classes then the Linear Discriminant Analysis algorithm is the preferred linear classification technique.
The representation of LDA is pretty straight forward. It consists of statistical properties of your data, calculated for each class. For a single input variable this includes:
- The mean value for each class.
- The variance calculated across all classes.
We make predictions by calculating a discriminate value for each class. After that we make a prediction for the class with the largest value. The technique assumes that the data has a Gaussian distribution. Hence, it is a good idea to remove outliers from your data beforehand. It’s a simple and powerful method for classification predictive modelling problems.
4. Classification and Regression Trees
Prediction Trees are for predicting response or class YY from input X1, X2,…,XnX1,X2,…,Xn. If it is a continuous response it is a regression tree, if it is categorical, it is a classification tree. At each node of the tree, we check the value of one the input XiXi. Depending on the (binary) answer we continue to the left or to the right subbranch. When we reach a leaf we will find the prediction.
Contrary to linear or polynomial regression which are global models, trees try to partition the data space into small enough parts where we can apply a simple different model on each part. The non-leaf part of the tree is just the procedure to determine for each data xx what is the model we will use to classify it.
5. Naive Bayes
A Naive Bayes Classifier is a supervised machine-learning algorithm that uses the Bayes’ Theorem, which assumes that features are statistically independent. The theorem relies on the naive assumption that input variables are independent of each other, i.e. there is no way to know anything about other variables when given an additional variable. Regardless of this assumption, it has proven itself to be a classifier with good results.
Naive Bayes Classifiers rely on the Bayes’ Theorem, which is based on conditional probability or in simple terms, the likelihood that an event (A) will happen given that another event (B) has already happened. Essentially, the theorem allows a hypothesis to be updated each time new evidence is introduced. The equation below expresses Bayes’ Theorem in the language of probability:
Let’s explain what each of these terms means.
- “P” is the symbol to denote probability.
- P(A | B) = The probability of event A (hypothesis) occurring given that B (evidence) has occurred.
- P(B | A) = The probability of the event B (evidence) occurring given that A (hypothesis) has occurred.
- P(A) = The probability of event B (hypothesis) occurring.
- P(B) = The probability of event A (evidence) occurring.
6. K-Nearest Neighbors
k-nearest neighbours (or k-NN for short) is a simple machine learning algorithm that categorizes an input by using its k nearest neighbours.
For example, suppose a k-NN algorithm has an input of data points of specific men and women’s weight and height, as plotted below. To determine the gender of an unknown input (green point), k-NN can look at the nearest k neighbours (suppose ) and will determine that the input’s gender is male. This method is a very simple and logical way of marking unknown inputs, with a high rate of success.
Also, we can k-NN in a variety of machine learning tasks; for example, in computer vision, k-NN can help identify handwritten letters and in gene expression analysis, the algorithm can determine which genes contribute to a certain characteristic. Overall, k-nearest neighbours provide a combination of simplicity and effectiveness that makes it an attractive algorithm to use for many machine learning tasks.
7. Learning Vector Quantization
A downside of K-Nearest Neighbors is that you need to hang on to your entire training dataset. The Learning Vector Quantization algorithm (or LVQ for short) is an artificial neural network algorithm that allows you to choose how many training instances to hang onto and learns exactly what those instances should look like.
Additionally, the representation for LVQ is a collection of codebook vectors. We select them randomly in the beginning and adapted to best summarize the training dataset over a number of iterations of the learning algorithm. After learned, the codebook vectors can make predictions just like K-Nearest Neighbors. Also, we find the most similar neighbour (best matching codebook vector) by calculating the distance between each codebook vector and the new data instance. The class value or (real value in the case of regression) for the best matching unit is then returned as the prediction. Moreover, you can get the best results if you rescale your data to have the same range, such as between 0 and 1.
If you discover that KNN gives good results on your dataset try using LVQ to reduce the memory requirements of storing the entire training dataset.
8. Bagging and Random Forest
A Random Forest consists of a collection or ensemble of simple tree predictors, each capable of producing a response when presented with a set of predictor values. For classification problems, this response takes the form of a class membership, which associates, or classifies, a set of independent predictor values with one of the categories present in the dependent variable. Alternatively, for regression problems, the tree response is an estimate of the dependent variable given the predictors.e
A Random Forest consists of an arbitrary number of simple trees, which determine the final outcome. For classification problems, the ensemble of simple trees votes for the most popular class. In the regression problem, we average responses to obtain an estimate of the dependent variable. Using tree ensembles can lead to significant improvement in prediction accuracy (i.e., better ability to predict new data cases).
9. SVM
A Support Vector Machine (SVM) is a supervised machine learning algorithm that can be employed for both classification and regression purposes. Also, SVMs have more common usage in classification problems and as such, this is what we will focus on in this post.
SVMs are based on the idea of finding a hyperplane that best divides a dataset into two classes, as shown in the image below.
Also, you can think of a hyperplane as a line that linearly separates and classifies a set of data.
Intuitively, the further from the hyperplane our data points lie, the more confident we are that they have been correctly classified. We, therefore, want our data points to be as far away from the hyperplane as possible, while still being on the correct side of it.
So when we add a new testing data , whatever side of the hyperplane it lands will decide the class that we assign to it.
The distance between the hyperplane and the nearest data point from either set is the margin. Furthermore, the goal is to choose a hyperplane with the greatest possible margin between the hyperplane and any point within the training set, giving a greater chance of correct classification of data.
But the data is rarely ever as clean as our simple example above. A dataset will often look more like the jumbled balls below which represent a linearly non-separable dataset.
10. Boosting and AdaBoost
Boosting is an ensemble technique that attempts to create a strong classifier from a number of weak classifiers. We do this by building a model from the training data, then creating a second model that attempts to correct the errors from the first model. We can add models until the training set is predicted perfectly or a maximum number of models are added.
AdaBoost was the first really successful boosting algorithm developed for binary classification. It is the best starting point for understanding boosting. Modern boosting methods build on AdaBoost, most notably stochastic gradient boosting machines.
AdaBoost is used with short decision trees. After the first tree is created, the performance of the tree on each training instance is used to weight how much attention the next tree that is created should pay attention to each training instance. Training data that is hard to predict is given more weight, whereas easy to predict instances are given less weight. Models are created sequentially one after the other, each updating the weights on the training instances that affect the learning performed by the next tree in the sequence. After all the trees are built, predictions are made for new data, and the performance of each tree is weighted by how accurate it was on training data.
Because so much attention is put on correcting mistakes by the algorithm it is important that you have clean data with outliers removed.
Summary
A typical question asked by a beginner, when facing a wide variety of machine learning algorithms, is “which algorithm should I use?” The answer to the question varies depending on many factors, including: (1) The size, quality, and nature of data; (2) The available computational time; (3) The urgency of the task; and (4) What you want to do with the data.
Even an experienced data scientist cannot tell which algorithm will perform the best before trying different algorithms. Although there are many other Machine Learning algorithms, these are the most popular ones. If you’re a newbie to Machine Learning, these would be a good starting point to learn.
Follow this link, if you are looking to learn Data Science Course Online!
Additionally, if you are having an interest in learning Data Science, Learn online Data Science Course to boost your career in Data Science.
Also, learn AWS Big Data Course click here, AWS Online Course
Furthermore, if you want to read more about data science, read this Data Science blogs
The foundations of most algorithms lie in linear algebra, multivariable calculus, and optimization methods. Most algorithms use a sequence of combinations to estimate an objective function given a set of data, and the sequence order and included methods distinguish one algorithm from another. It’s helpful to learn enough math to read the development papers associated with key algorithms in the field, as many other methods (or one’s own innovations) include pieces of those algorithms. It’s like learning the language of machine learning. Once you are fluent in it, it’s pretty easy to modify algorithms as needed and create new ones likely to improve on a problem in a short period of time.
Matrix factorization: a simple, beautiful way to do dimensionality reduction —and dimensionality reduction is the essence of cognition. Recommender systems would be a big application of matrix factorization. Another application I’ve been using over the years (starting in 2010 with video data) is factorizing a matrix of pairwise mutual information (or pointwise mutual information, which is more common) between features, which can be used for feature extraction, computing word embeddings, computing label embeddings (that was the topic of a recent paper of mine [1]), etc.
Used in a convolutional settings, this acts as an excellent unsupervised feature extractor for images and videos. There’s one big issue though: it is fundamentally a shallow algorithm. Deep neural networks will quickly outperform it if any kind of supervision labels are available.
[1] [1607.05691] Information-theoretical label embeddings for large-scale image classification
Machine Learning Demos:

See how well you synchronize to the lyrics of the popular hit “Dance Monkey.” This in-browser experience uses the Facemesh model for estimating key points around the lips to score lip-syncing accuracy.Explore demo View code

Use your phone’s camera to identify emojis in the real world. Can you find all the emojis before time expires?Explore demo View code

Play Pac-Man using images trained in your browser.Explore demo View code

No coding required! Teach a machine to recognize images and play sounds.Explore demo View code

Explore pictures in a fun new way, just by moving around.Explore demo View code

Enjoy a real-time piano performance by a neural network.Explore demo View code

Train a server-side model to classify baseball pitch types using Node.js.View code

See how to visualize in-browser training and model behaviour and training using tfjs-vis.Explore demo View code
Community demos
Get started with official templates and explore top picks from the community for inspiration.Glitch
Check out community Glitches and make your own TensorFlow.js-powered projects.Explore Glitch Codepen
Fork boilerplate templates and check out working examples from the community.Explore CodePen GitHub Community Projects
See what the community has created and submitted to the TensorFlow.js gallery page.Explore GitHub
https://cdpn.io/jasonmayes/fullcpgrid/QWbNeJdOpen in Editor
Real time body segmentation using TensorFlow.js
Load in a pre-trained Body-Pix model from the TensorFlow.js team so that you can locate all pixels in an image that are part of a body, and what part of the body they belong to. Clone this to make your own TensorFlow.js powered projects to recognize body parts in images from your webcam and more!
New Pen from Template
https://cdpn.io/jasonmayes/fullcpgrid/qBEJxggOpen in Editor
Multiple object detection using pre trained model in TensorFlow.js
This demo shows how we can use a pre made machine learning solution to recognize objects (yes, more than one at a time!) on any image you wish to present to it. Even better, not only do we know that the image contains an object, but we can also get the co-ordinates of the bounding box for each object it finds, which allows you to highlight the found object in the image.
For this demo we are loading a model using the ImageNet-SSD architecture, to recognize 90 common objects it has already been taught to find from the COCO dataset.
If what you want to recognize is in that list of things it knows about (for example a cat, dog, etc), this may be useful to you as is in your own projects, or just to experiment with Machine Learning in the browser and get familiar with the possibilities of machine learning.
If you are feeling particularly confident you can check out our GitHub documentation (https://github.com/tensorflow/tfjs-models/tree/master/coco-ssd) which goes into much more detail for customizing various parameters to tailor performance to your needs.
New Pen from Template
https://cdpn.io/jasonmayes/fullcpgrid/JjompwwOpen in Editor
Classifying images using a pre trained model in TensorFlow.js
This demo shows how we can use a pre made machine learning solution to classify images (aka a binary image classifier). It should be noted that this model works best when a single item is in the image at a time. Busy images may not work so well. You may want to try our demo for Multiple Object Detection (https://codepen.io/jasonmayes/pen/qBEJxgg) for that.
For this demo we are loading a model using the MobileNet architecture, to recognize 1000 common objects it has already been taught to find from the ImageNet data set (http://image-net.org/).
If what you want to recognize is in that list of things it knows about (for example a cat, dog, etc), this may be useful to you as is in your own projects, or just to experiment with Machine Learning in the browser and get familiar with the possibilities of machine learning.
Please note: This demo loads an easy to use JavaScript class made by the TensorFlow.js team to do the hardwork for you so no machine learning knowledge is needed to use it.
If you were looking to learn how to load in a TensorFlow.js saved model directly yourself then please see our tutorial on loading TensorFlow.js models directly.
If you want to train a system to recognize your own objects, using your own data, then check out our tutorials on “transfer learning”.
New Pen from Template
Open in Editor
Tensorflow.js Boilerplate
The hello world for TensorFlow.js 🙂 Absolute minimum needed to import into your website and simply prints the loaded TensorFlow.js version. From here we can do great things. Clone this to make your own TensorFlow.js powered projects or if you are following a tutorial that needs TensorFlow.js to work.
Examples
tfjs-examples provides small code examples that implement various ML tasks using TensorFlow.js.MNIST Digit Recognizer
Train a model to recognize handwritten digits from the MNIST database.Explore example View code Addition RNN
Train a model to learn addition from text examples.Explore example View code
TensorFlow.js Layers: Iris Demo
More TensorFlow examples
Top-paying Cloud certifications:
[appbox appstore 1611045854-iphone screenshots]
[appbox microsoftstore 9n8rl80hvm4t-mobile screenshots]
- Google Certified Professional Cloud Architect — $175,761/year
- AWS Certified Solutions Architect – Associate — $149,446/year
- Azure/Microsoft Cloud Solution Architect – $141,748/yr
- Google Cloud Associate Engineer – $145,769/yr
- AWS Certified Cloud Practitioner — $131,465/year
- Microsoft Certified: Azure Fundamentals — $126,653/year
- Microsoft Certified: Azure Administrator Associate — $125,993/year
Supervised Learning
Linear Regression
Logistic Regression
Naive Bayes
Support Vector Machines
Decision Trees
K-Nearest Neighbors
Machine Learning in Practice
Bias-Variance Tradeoff
How to Select a Model
How to Select Features
Regularizing Your Model
Ensembling: How to Combine Your Models
Evaluation Metrics
Unsupervised Learning
Market Basket Analysis
K-Means Clustering
Principal Components Analysis
Deep Learning
Feedforward Neural Networks
Grab Bag of Neural Network Practices
Convolutional Neural Networks
Recurrent Neural Networks
Test Your Knowledge
Best Subset Features Feature
Selection Examples
Adding Features Example
Activation Practice I
Activation Practice II
Activation Practice III
Weight Initialization
Batch vs. Stochastic
Convolutional Application
Convolutional Layer Advantages
Are you interested in becoming an AWS Certified Machine Learning Specialist? If so, then this exam preparation blog is for you! The blog contains over 100 quiz and practice exam questions, as well as detailed answers. The questions are very similar to those you will encounter on the actual exam, so this is a great way to prepare. In addition, the blog also includes cheat sheets and illustrations to help you understand the concepts better.
Bring your own algorithm to an MLOps Pipeline: Architecture




Code and Serve Your ML Model with AWS CodeBuild


What are some ways we can use machine learning and artificial intelligence for algorithmic trading in the stock market?
How do we know that the Top 3 Voice Recognition Devices like Siri Alexa and Ok Google are not spying on us?
What are some good datasets for Data Science and Machine Learning?
Machine Learning Engineer Interview Questions and Answers
- Recent Developments in LLM Architectures: KV Sharing, mHC, and Compressed Attention [P]by /u/seraschka (Machine Learning) on May 17, 2026 at 1:41 pm
submitted by /u/seraschka [link] [comments]
- Help in ML algos [D]by /u/NoAnybody8034 (Machine Learning) on May 17, 2026 at 1:39 pm
So see, I’ve learned ML algorithms theoretically, but practically I have little to no experience. So can you guys suggest some resources through which I can understand which algorithms work well on which kinds of datasets? How is everything done step by step? submitted by /u/NoAnybody8034 [link] [comments]
- ML lead vs PM on eval-methodology layer independence. who's actually right here? [D]by /u/Critical_Builder_902 (Machine Learning) on May 17, 2026 at 9:16 am
got into an argument with our ML lead at 11pm yesterday about an eval methodology a PM had built off a framework she learned at an AI PM cohort. shes claiming a layered defense framework, hes saying the layers are statistically conditioned and her independence claim is wrong. they both have a point. the framework as taught at the cohort (it was Product Faculty's, fwiw) is genuinely useful for non-eng PMs. it forces explicit thinking about behavioral checks vs adversarial probes vs traditional metrics. but the way it's been taught in the abridged form makes the layers sound independent when they statistically arent. for ML/AI engineers here who've worked with non-eng PMs on production eval. how do you handle the gap between the simplified eval frameworks PMs learn and the actual statistical interactions in production? specifically interested in how you've negotiated the conversation with a PM who's ""done the cohort"" and shows up with a framework that's solid in its public form but has subtle issues in its statistical foundations. submitted by /u/Critical_Builder_902 [link] [comments]
- Program misleading high school students into paying to perform academic misconduct in ML Research [D]by /u/Marisu_BG (Machine Learning) on May 17, 2026 at 6:08 am
I was browsing OpenReview and I came accross this person called Kevin Zhu https://openreview.net/profile?id=~Kevin_Zhu3, lets say I was impressed when I saw 158 publications and 468 coauthors, and out of curiosity I searched up his afflication (https://algoverseairesearch.org/) Turns out it is a paid program, and most interesting it is marketed towards high school students. They have a whole column of papers listed as Neurips publications (their website states: 289 Algoverse Students Accepted to NeurIPS 2025). I was originally unware of the rigor of Neurips workshops and I was understandably very shocked. I skimmed through four of their papers one by one. Every single one had errors that would be caught by opening the PDF and reading it once. I am completely unsure how they are not caught by reviewers even at a workshop. https://openreview.net/forum?id=21pxWVRoPL - Appendix Tables 6.5 and 6.6 are supposed to report two different experimental conditions: "Stigma Negative" and "Stigma Positive." One measures what happens when the user pushes the model toward a negative association with a stigmatized group. The other measures the opposite direction. These are fundamentally different experiments, yet they have the exact same numbers in the results. There are typo in the Abstract section, their Related Works is within Results section. Citations are completely wrong, which I suspect to be AI generated. https://openreview.net/pdf?id=0BYRYwGCbK - 711 broken prompts in a dataset that claims human review. The results say the opposite of the abstract. The abstract claims the work "reveals novel methods to elicit sycophancy." Then they proceed to show most modifiers perform about the same as the unmodified control (91-95% accuracy). Moreover, their citations also seem AI generated with false citations (wrong authors, wrong formats ..) Interestingly, undisclosed self-citation by Kevin Zhu. https://openreview.net/pdf?id=VcRUAT5G8I - Two foundational methods are attributed to the wrong paper. TIES merging and Task Arithmetic, two well known methods, was introduced but never cited. Same AI generated citations, I am not even going to get to the content anymore. https://openreview.net/pdf?id=It7AgR3A9H - eleven authors, zero contribution. Four papers, that I RANDOMLY CLICKED ON WITH NO ORDER, all follow the same template take existing method -> run it with some variation, likely done by AI -> put Kevin Zhu as an author -> submit to workshop I am unsure how any of these bypass any form of peer review process, only today I learned how low the bar is for workshops. Why I am posting: It angers to me when you market this to high schoolers and tell them you can get into Stanford and MIT. A 16 year old look at this and say, if I pay $3,325, I can get a Neurip publication. Then they proceed to let them publish a paper clear errors. This is academic dishonesty, but I dont think the kids even know they are commiting it. Kevin Zhu puts his name on every single paper published, self-cite himself in these paper, and charge student $3,325. I wasn't fully aware of how much lighter the workshop review process is, and I really want to hear why this is. submitted by /u/Marisu_BG [link] [comments]
- Anyone from India attending EEML ? [D]by /u/Suhan_XD (Machine Learning) on May 16, 2026 at 4:08 pm
I got accepted into EEML and I’m a little confused about travel and stay. Has anyone else from India been accepted? Let’s connect! submitted by /u/Suhan_XD [link] [comments]
- Made and Published a Paper Comparing Analysis of CNN and Vision Transformer Architectures for Brain Tumor Detection [R]by /u/Mental-Climate5798 (Machine Learning) on May 16, 2026 at 3:47 pm
Hi everyone 😄 A while ago I worked on a project where I compared computer vision architectures on detecting and classifying brain tumors in brain MRI scans. I was looking for some feedback on the methodology and really anything else--just simple research stuff. This isn't meant to be some big paper but a small research project that I did as a high schooler. Here is the paper: zenodo.org/records/15973756 I appreciate any feedback! submitted by /u/Mental-Climate5798 [link] [comments]
- Do you agree with Judea that learning from data is not everything? [D]by /u/xTouny (Machine Learning) on May 16, 2026 at 2:46 pm
Link: Judea Pearl, 2011 ACM Turing Award Recipient (2:18:05) Quote: There is a limitation to that which people not everybody understand. I already mentioned a limitation that you have a hierarchy here and going from correlation to causation and from causation from causation to explanation or to imagination. It's hard for people especially in machine learning to grasp that wall the limitation of one layer where one layer ends and the other one begins. Why? Because of two things. Machine learning school of thought has two paradigms that they love everybody love. Number one tabula raza I don't want to get any opinion I don't want to get any preconceived knowledge I want to derive everything by myself let the computer learn it and you find the word learning overused .. The other handcuff is let's do it the way that the brain does it. So if it looks like neurons interacting, it's good. If it looks like knowledge coming from rule system, it's bad because it's man-made .. Now there's limitation to that. We can prove today that you cannot do certain things by looking at data and data only. It's not a matter of opinion. It's a matter of mathematical proof that you cannot you can look at people who take aspirin all day and people whether or not they have headache all day and you cannot prove that the aspirin is what causes the headache. In particular, Judea states: "It's not a matter of opinion. It's a matter of mathematical proof". So we have formal proof that there are fundamental limits of learning from data. Judea later in the interview states we have solutions to problems faced by the machine learning community; nonetheless they are not adopted because of hype. Discussion. Do you agree with Judea? submitted by /u/xTouny [link] [comments]
- For those in corporate roles, how do you all work with the non-technical areas you support?by /u/SkipGram (Data Science) on May 16, 2026 at 12:50 pm
I've spent the past few years at what feels like a somewhat dysfunctional company. Our Data Science and Engineering teams are very siloed away from the rest of the company, including the teams we support and build things for. IC individuals rarely interact with those requesting the work, and myself and many of my peers have the common challenge of needing to talk to the people who asked for what we're building, but we're often told no we can't go talk to them. This is one of our biggest pain points, and it makes it very difficult to know if I'm making the most sensible choices given the goals of the work. In the small amount of conversations I have been able to be in with our non-tech teams, it feels like there's this constant tension. Some of my team's 'vision' for the future feels more like changing another area's business strategy instead of using Data Science to support them with their actual stated strategy. Maybe these two things can work towards the same goals in the future, but from the small amount I've seen now, we're rowing in a different direction than the teams we're supposed to be helping, and I'm worried this will harm trust and the ability to influence in the future if there are places we want to suggest different ways of approaching a problem. I'm not in enough of the conversations I need to be in to have this context though. Is it like this at other companies? I know the economy and job market are pretty rough right now, but as I'm thinking about longer term decisions, I want a company where there's a functional relationship between business and technology and those of us building can actually speak to the people we're building for. Building the best technical solution doesn't matter if it doesn't actually help the people it's for, or have a way to be incorporated into current processes. I'm just not sure how to assess this from the outside or how common this is. submitted by /u/SkipGram [link] [comments]
- Backlash against Arxiv's proposed 1 year ban is genuinely perplexing. [D]by /u/NeighborhoodFatCat (Machine Learning) on May 16, 2026 at 8:30 am
Anyone else surprised at the enormous amount of backlash against Arxiv's proposed 1 year ban for authors and coauthors publishing papers with hallucinated reference and other obvious LLM/Gen AI artifacts? https://x.com/tdietterich/status/2055000956144935055 https://xcancel.com/tdietterich/status/2055000956144935055 Some of the responses: "This is the age of AI, Arxiv should be part of the movement instead of holding onto the old ways" "The P.I. is a macro-manager, not a micro-manager, can't be expected to read every reference that his/her student puts in." "I publish 20+ papers a year with my students, how do you expect me to read everything?" "What about teams with 100s of people? How can you expect the authors to check references?" "Who reads references in depth anyways!?" These responses are very revealing how academia works. Apparently people have just been slapping names on research papers they've never even read or fact-checked themselves. Very obscene! submitted by /u/NeighborhoodFatCat [link] [comments]
- [R] Which LLMs are actually best for bleeding-edge Linux/ML debugging workflows in 2026? [R]by /u/minaco5mko (Machine Learning) on May 16, 2026 at 5:01 am
I’m trying to optimize an AI workflow for bleeding-edge Linux/ML debugging (Arch/CachyOS, CUDA, Python, unsloth, etc.). Current stack: - Claude = deep reasoning/mastermind - Gemini 3.1 Pro = execution/logistics - Perplexity = retrieval Main problem: Gemini often gives high-friction or impractical fixes and degrades badly in long troubleshooting sessions. Example: suggested a long Podman workflow for an unsloth/Python issue where micromamba solved it much faster. I also have access to hosted open models: - Qwen 3 Coder 30B - Qwen 3.5 122B - Mistral Large 675B - DeepSeek R1 Distill 70B etc. Question: For people doing real-world Linux/ML/debugging workflows (not benchmarks), what currently works best as the “execution/logistics” model with strong web/recent-ecosystem awareness? I care more about: - practical fixes - low friction - stable long sessions - debugging quality than benchmark scores. submitted by /u/minaco5mko [link] [comments]
- KDD 2026 Cycle 2 Results [D]by /u/ATadDisappointed (Machine Learning) on May 16, 2026 at 4:07 am
Results for the research track have been released. submitted by /u/ATadDisappointed [link] [comments]
- ROCm with PyTorch and PyTorch Lightning seems to still suck for research [D]by /u/QuantumQuokka (Machine Learning) on May 16, 2026 at 12:01 am
So I asked about people's experiences with ROCm in a post a few weeks or so ago https://www.reddit.com/r/MachineLearning/comments/1t6cng3/rocm_status_in_mid_2026_d/ I actually went and procured a RX 7900XTX reference version to give it a try My discovery is that it kind of still sucks I have a small codebase for training flow matching models (SANA Architecture), which runs fine on my RTX3090s. But the moment I ported it across to ROCm it was NaNs absolutely everywhere. Forward passes were absolutely fine, but the moment you called backwards() all bets were off. The code was kept identical, apart from altering the pip environment to point to torch2.12 with ROCm7.2 instead of CUDA Trying everything from switching between bf16, fp32, to tweaking various environment variables yielded nothing. Unless there's some trick I'm missing, I get the feeling that ROCm is still seriously behind. I tried running the nanoGPT training script, which ran perfectly My intuition is that the ROCm people have probably tested their stack on established well known codebases. But, it's still remarkably fragile on even slightly uncommon code. submitted by /u/QuantumQuokka [link] [comments]
- No feeling quite lower than...by /u/MeLikaDoTheChaCha (Data Science) on May 15, 2026 at 10:49 pm
crushing the system design interview just to bomb the pandas-live coding interview even though you've been using pandas everyday for 10 years. If anyone wants feedback on how that feels like hmu. Anyone know if they sell kegs of Jager? Asking for a friend... submitted by /u/MeLikaDoTheChaCha [link] [comments]
- Struggling with Overfitting on Medical Imaging Task [D]by /u/Future-Structure-296 (Machine Learning) on May 15, 2026 at 8:16 pm
Hi everyone, I’m working on a 2-class classification problem (LCA vs. RCA coronary arteries) using 2D X-ray angiograms. I’m currently stuck in a cycle of extreme overfitting and could use some advice on my training strategy. The Setup: Dataset: Small (~900 training frames from ~300 unique DICOMs). Architecture: InceptionV3 (PyTorch). Input: Grayscale .npy arrays converted to 3-channel, resized to 299x299. Current Strategy: Transfer learning from ImageNet. I’ve tried full unfreezing and partial unfreezing (last blocks). The Problem: My training accuracy hits ~95-99% within a few epochs, but validation accuracy peaks early (around 74-79%) and then collapses toward 30-40% as the model starts memorizing the specific textures of the training patients. What I’ve Tried So Far: Normalization: Standard ImageNet mean/std (applied at load time). Class Weights: Handled 2:1 imbalance (LCA:RCA). Regularization: Added Dropout (tried 0.3 to 0.6) and Weight Decay (1e-4). Augmentation: Flips, 25deg rotations, and translation. Schedulers: ReduceLROnPlateau (factor 0.5, patience 8). Would love any insights or papers you'd recommend for small-sample medical classification. Thanks! submitted by /u/Future-Structure-296 [link] [comments]
- Publication Topics Questionby /u/InfamousTrouble7993 (Data Science) on May 15, 2026 at 6:02 pm
Hi, i am looking for topics to cover in a potential publication, as I will have a few months free time. The problem is, I am struggling to decide for a potential problem statement to focus on, to find a solution/get insights about it. I asked ai what kind of problems are covered in papers currently, but the response was not satisfying for me. Now I ask this in this com. Are you currently working on problems and know about additional problems to tackle? My experience fields: statistics/probability theory machine/deep learning natural language processing submitted by /u/InfamousTrouble7993 [link] [comments]
- Orthrus: Memory-Efficient Parallel Token Generation via Dual-View Diffusion [R]by /u/Franck_Dernoncourt (Machine Learning) on May 15, 2026 at 5:21 pm
Paper: https://arxiv.org/abs/2605.12825 Code: https://github.com/chiennv2000/orthrus Disclosure: co-author. Idea: Inject a trainable diffusion attention module into each layer of a frozen AR Transformer. Both heads share one KV cache. Diffusion head projects K=32 tokens in parallel; AR head verifies in a second pass and accepts the longest matching prefix. Output distribution is provably identical to the base model. Results: Up to 7.8× TPF, ~6× wall-clock on MATH-500. 16% of params trained, <1B tokens, 24h on 8×H200. vs. diffusion LMs (Dream, Fast-dLLM-v2, SDAR, Mercury, Gemini Diffusion): they modify base weights and lose accuracy (Fast-dLLM-v2: -11 pts on MATH-500). Orthrus freezes the backbone; accuracy matches Qwen3-8B exactly. vs. Speculative Decoding (EAGLE-3, DFlash): No external drafter, no separate cache, and zero Time-To-First-Token (TTFT) penalty because we don't have to initialize and sync a separate drafter model. KV overhead is O(1) (~4.5 MiB flat). Acceptance length on MATH-500: 11.7 vs. 7.9 (DFlash) vs. 3.5 (EAGLE-3). Single-step denoising beats multi-step (6.35 vs. 3.53 TPF). KL distillation beats CE on acceptance rate. Limitations: strictly bounded by the frozen base model (inherits its biases, hallucinations, knowledge gaps); Qwen3-only evaluation; greedy + rejection sampling only. https://i.redd.it/5lsf6l5w4c1h1.gif submitted by /u/Franck_Dernoncourt [link] [comments]
- PINN is predicting trivial solution for stiff ODE [D]by /u/cae_shot (Machine Learning) on May 15, 2026 at 4:07 pm
I am learning physics informed neural networks. Currently, I am solving a simple second ODE (damped harmonic oscillator). The equation is m*d2y/dt2 + mu*dy/dt + k*y = 0 (bcs: y(t=0) = 1, y'(t=0) = 0). I managed to draft a code. The code works for k values upto 50. However, when increased the value beyond 50, PINN is predicting trivial solution. I tried several things: reducing the learning rate, increasing the data points, reusing the weights trained using lower k values, and using a for loop to increase the k value in smaller steps (step size 20). However, none of them helped. Could you help me with this. Thanks in advance. submitted by /u/cae_shot [link] [comments]
- Looking for a real world dataset (or website where i can find it) [P]by /u/novromeda (Machine Learning) on May 15, 2026 at 2:39 pm
Hi guys, I’m gonna do a data analysis project based on data privacy, bias and data interpretability. For this reason our professor asked for a real world dataset in order to analyze a real case. Additionally I would prefer the least anonymity possible for that dataset in order to create some interesting technique over it (differential privacy, k-anonimity exc…) Do you have any advice where to find the dataset? (links or website names) Because I checked on Kaggle but I don’t know how to find if the dataset is real or not submitted by /u/novromeda [link] [comments]
- software trying to catch software is officially a dead en [D]by /u/bebo117722 (Machine Learning) on May 15, 2026 at 2:36 pm
I feel like we've crossed a weird threshold in the generative AI space where the arms race against botnets is just over. and the bots won I was reading that interview recently where the Reddit CEO was floating the idea of using Face ID and Touch ID just to verify that commenters are actual humans. it honestly hit me how absurd things have gotten. standard heuristics and behavioral analysis are completely useless now against modern LLMs, and vision models solve captchas faster than I can. the dead internet theory is basically just our daily engineering reality at this point we are at a stage where the only reliable way to prove you aren't an automated script is to literally anchor your digital presence to your physical biology. From a purely technical standpoint, it’s fascinating seeing the shift toward hardware verification. like looking at the engineering behind that Orb device the idea of doing local biometric iris hashing on custom hardware just to output a zero-knowledge proof of personhood. It's wild that we actually need dedicated physical devices now just to enforce the concept of "one human, one account" it makes total sense why platforms are pushing for this, beacuse trying to build software firewalls against infinitely scalable AI agents is a losing battle. but it just feels like such a massive, permanent shift for how the internet works. idk, is anyone else working on sybil resistance right now? are we just collectively accepting that biometric hardware gates are the only way to save the web from being 99% synthetic noise? submitted by /u/bebo117722 [link] [comments]
- Applied Scientist Interview Prepby /u/LeaguePrototype (Data Science) on May 15, 2026 at 11:32 am
What is the applied scientist interview like at Amazon/Uber/any other place that has it? Do you mostly prep leetcode or causal inf? Or what to expect? I'm a bit lost for how difficult these interviews are and what is the most difficult part of them? Personally my stats/ML is pretty good but I struggle with leetcode mediums submitted by /u/LeaguePrototype [link] [comments]

![Recent Developments in LLM Architectures: KV Sharing, mHC, and Compressed Attention [P]](https://external-preview.redd.it/3uqj0ajRBVkyrwbp33jfW4ch4z-dzwPoBcFOStkO5FE.jpeg?width=640&crop=smart&auto=webp&s=7ea24b021e496957dd14e253fa3a020d1ed33a9a)
![Made and Published a Paper Comparing Analysis of CNN and Vision Transformer Architectures for Brain Tumor Detection [R]](https://preview.redd.it/fgq8e0swsi1h1.png?width=640&crop=smart&auto=webp&s=1376d513ddfdeeb56ba9ad93366b037a322d2e87)
![Orthrus: Memory-Efficient Parallel Token Generation via Dual-View Diffusion [R]](https://preview.redd.it/5lsf6l5w4c1h1.gif?frame=1&width=140&height=72&auto=webp&s=023b02ee925924f982e6eea09bbeefbe039d00ab)






















96DRHDRA9J7GTN6