

Elevate Your Career with AI & Machine Learning For Dummies PRO and Start mastering the technologies shaping the future—download now and take the next step in your professional journey!
Ace the AWS Certified Data Engineer Exam DEA-C01: Mastering AWS Services for Data Ingestion, Transformation, and Pipeline Orchestration.
Unlock the full potential of AWS and elevate your data engineering skills with “Ace the AWS Certified Data Engineer Exam.” This comprehensive guide is tailored for professionals seeking to master the AWS Certified Data Engineer – Associate certification. Authored by Etienne Noumen, a seasoned Professional Engineer with over 20 years of software engineering experience and 5+ years specializing in AWS data engineering, this book provides an in-depth and practical approach to conquering the certification exam.
Inside this book, you will find:
• Detailed Exam Coverage: Understand the core AWS services related to data engineering, including data ingestion, transformation, and pipeline orchestration.
• Practice Quizzes: Challenge yourself with practice quizzes designed to simulate the actual exam, complete with detailed explanations for each answer.
• Real-World Scenarios: Learn how to apply AWS services to real-world data engineering problems, ensuring you can translate theoretical knowledge into practical skills.
• Hands-On Labs: Gain hands-on experience with step-by-step labs that guide you through using AWS services like AWS Glue, Amazon Redshift, Amazon S3, and more.
• Expert Insights: Benefit from the expertise of Etienne Noumen, who shares valuable tips, best practices, and insights from his extensive career in data engineering.
This book goes beyond rote memorization, encouraging you to develop a deep understanding of AWS data engineering concepts and their practical applications. Whether you are an experienced data engineer or new to the field, “Ace the AWS Certified Data Engineer Exam” will equip you with the knowledge and skills needed to excel.
AI-Powered Professional Certification Quiz Platform
Web|iOs|Android|Windows
🚀 Power Your Podcast Like AI Unraveled: Get 20% OFF Google Workspace!
Hey everyone, hope you're enjoying the deep dive on AI Unraveled. Putting these episodes together involves tons of research and organization, especially with complex AI topics.
A key part of my workflow relies heavily on Google Workspace. I use its integrated tools, especially Gemini Pro for brainstorming and NotebookLM for synthesizing research, to help craft some of the very episodes you love. It significantly streamlines the creation process!
Feeling inspired to launch your own podcast or creative project? I genuinely recommend checking out Google Workspace. Beyond the powerful AI and collaboration features I use, you get essentials like a professional email (you@yourbrand.com), cloud storage, video conferencing with Google Meet, and much more.
It's been invaluable for AI Unraveled, and it could be for you too.
Start Your Journey & Save 20%
Google Workspace makes it easy to get started. Try it free for 14 days, and as an AI Unraveled listener, get an exclusive 20% discount on your first year of the Business Standard or Business Plus plan!
Sign Up & Get Your Discount HereUse one of these codes during checkout (Americas Region):
AI- Powered Jobs Interview Warmup For Job Seekers

⚽️Comparative Analysis: Top Calgary Amateur Soccer Clubs – Outdoor 2025 Season (Kids' Programs by Age Group)
Business Standard Plan: 63P4G3ELRPADKQU
Business Standard Plan: 63F7D7CPD9XXUVT
Set yourself up for promotion or get a better job by Acing the AWS Certified Data Engineer Associate Exam (DEA-C01) with the eBook or App below (Data and AI)

Download the Ace AWS DEA-C01 Exam App:
iOS - Android
AI Dashboard is available on the Web, Apple, Google, and Microsoft, PRO version
Business Standard Plan: 63FLKQHWV3AEEE6
Business Standard Plan: 63JGLWWK36CP7W
Invest in your future today by enrolling in this Azure Fundamentals - Pass the Azure Fundamentals Exam with Ease: Master the AZ-900 Certification with the Comprehensive Exam Preparation Guide!
- AWS Certified AI Practitioner (AIF-C01): Conquer the AWS Certified AI Practitioner exam with our AI and Machine Learning For Dummies test prep. Master fundamental AI concepts, AWS AI services, and ethical considerations.
- Azure AI Fundamentals: Ace the Azure AI Fundamentals exam with our comprehensive test prep. Learn the basics of AI, Azure AI services, and their applications.
- Google Cloud Professional Machine Learning Engineer: Nail the Google Professional Machine Learning Engineer exam with our expert-designed test prep. Deepen your understanding of ML algorithms, models, and deployment strategies.
- AWS Certified Machine Learning Specialty: Dominate the AWS Certified Machine Learning Specialty exam with our targeted test prep. Master advanced ML techniques, AWS ML services, and practical applications.
- AWS Certified Data Engineer Associate (DEA-C01): Set yourself up for promotion, get a better job or Increase your salary by Acing the AWS DEA-C01 Certification.
Business Plus Plan: M9HNXHX3WC9H7YE
With Google Workspace, you get custom email @yourcompany, the ability to work from anywhere, and tools that easily scale up or down with your needs.
Need more codes or have questions? Email us at info@djamgatech.com.
Prepare to advance your career, validate your expertise, and become a certified AWS Data Engineer. Embrace the journey of learning, practice consistently, and master the tools and techniques that will set you apart in the rapidly evolving world of cloud data solutions.
Get your copy today and start your journey towards AWS certification success!

Get the Ace AWS DEA-C01 Exam eBook at Djamgatech: https://djamgatech.com/product/ace-the-aws-certified-data-engineer-exam-ebook
Get the Ace AWS DEA-C01 Exam eBook at Google: https://play.google.com/store/books/details?id=lzgPEQAAQBAJ
Get the Ace AWS DEA-C01 Exam eBook at Apple: https://books.apple.com/ca/book/ace-the-aws-certified-data-engineer-associate/id6504572187
Get the Ace AWS DEA-C01 Exam eBook at Etsy: https://www.etsy.com/ca/listing/1749511877/ace-the-aws-certified-data-engineer-exam
Get the Ace AWS DEA-C01 Exam eBook at Shopify: https://djamgatech.myshopify.com/products/ace-the-aws-certified-data-engineer-exam
The FREE Android App for AWS Certified Data Engineer Associate Exam Preparation is out and available at: https://play.google.com/store/apps/details?id=app.web.awsdataengineer.twa
Sample Quiz:
Practice Quiz 1:
A finance company is storing paid invoices in an Amazon S3 bucket. After the invoices are uploaded, an AWS Lambda function uses Amazon Textract to process the PDF data and persist the data to Amazon DynamoDB. Currently, the Lambda execution role has the following S3 permission:
{
“Version”: “2012-10-17”,
“Statement”: [
{
“Sid”: “ExampleStmt”,
“Action”: [“s3:*”],
“Effect”: “Allow”,
“Resource”: [“*”]
}
]
}
The company wants to correct the role permissions specific to Amazon S3 according to security best practices.
Which solution will meet these requirements?
- Append “s3:GetObject” to the Action. Append the bucket name to the Resource.
- Modify the Action to be “s3:GetObjectAttributes.” Modify the Resource to be only the bucket name.
- Append “s3:GetObject” to the Action. Modify the Resource to be only the bucket ARN.
- Modify the Action to be: “s3:GetObject.” Modify the Resource to be only the bucket ARN.
Practice Quiz 1 – Correct Answer: D.
According to the principle of least privilege, permissions should apply only to what is necessary. The Lambda function needs only the permissions to get the object. Therefore, this solution has the most appropriate modifications.
Learn more about least-privilege permissions.
Practice Quiz 2:
A data engineer is designing an application that will transform data in containers managed by Amazon Elastic Kubernetes Service (Amazon EKS). The containers run on Amazon EC2 nodes. Each containerized application will transform independent datasets and then store the data in a data lake. Data does not need to be shared to other containers. The data engineer must decide where to store data before transformation is complete.
Which solution will meet these requirements with the LOWEST latency?
- Containers should use an ephemeral volume provided by the node’s RAM.
- Containers should establish a connection to Amazon DynamoDB Accelerator (DAX) within the application code.
- Containers should use a PersistentVolume object provided by an NFS storage.
- Containers should establish a connection to Amazon MemoryDB for Redis within the application code.
Practice Quiz 2 – Correct Answer: A.
Amazon EKS is a container orchestrator that provides Kubernetes as a managed service. Containers run in pods. Pods run on nodes. Nodes can be EC2 instances, or nodes can use AWS Fargate. Ephemeral volumes exist with the pod’s lifecycle. Ephemeral volumes can access drives or memory that is local to the node. The data does not need to be shared, and the node provides storage. Therefore, this solution will have lower latency than storage that is external to the node.
Learn more about Amazon EKS storage.
Learn more about persistent storage for Kubernetes.
Learn more about EC2 instance root device volume.
Learn more about Amazon EKS nodes.

Resources and Tips:
Ace the AWS DEA-C01 GPT:

Courses
New! Sessions on Twitch by AWS DevRel teams focused on DEA Exam
Free beginner level courses from AWS Skill builder.
Fundamentals of Data Analytics on AWS – the current Skillbuilder course linked from the certification site is ending in Feb and is being split into 2 new courses :
Ace the AWS Certified Data Engineer Exam Book Preview

Deciphering the Marketing Landscape: Latest Insights & Trends for 2023


Elevate Your Career with AI & Machine Learning For Dummies PRO and Start mastering the technologies shaping the future—download now and take the next step in your professional journey!
Deciphering the Marketing Landscape: Latest Insights & Trends for 2023
In the dynamic world of marketing, trends evolve at a breakneck speed. As consumers become more discerning and digitally connected, their preferences and behavior patterns shift, requiring marketers to stay ahead of the curve. With each passing year, some strategies solidify their ground, while others wane. Dive into our curated compilation of the latest marketing insights and trends for 2023. Whether you’re a seasoned marketer or a curious entrepreneur, these findings offer a snapshot of the changing consumer landscape and emerging marketing frontiers. Get ready to recalibrate, reimagine, and reshape your strategies!
1. The Eroding Value of “Sustainability” Recent research on Palm Oil reveals a surprising trend – consumers favor products labeled as “free-from palm oil” over those stamped with “sustainably produced palm oil.” This shift stems from the overused term “sustainable,” which seems to be losing its weight in the marketplace. This raises concerns, especially as WWF emphasizes that abandoning palm oil isn’t the right solution.
2. Packaging – The Silent Salesperson Kerry’s latest research underscores that 72% of consumers believe brands can help them reduce waste by enhancing the shelf life of food through better packaging. This trend is not just isolated. European publication Amcor’s findings align, showing a growing demand for improved packaging. In the future, marketers must spotlight their packaging efforts more prominently.
3. Cars and Consumers: A Telling Connection Recent data from the 2023 GWI Commerce Report showcases a peculiar trend – 40% of recent car purchasers also invested in a domestic vacation. In another intriguing find, consumers tend to make impulse purchases post physical activities. While not a new revelation, it’s worth noting for potential marketing strategies.
4. Prime Day vs. Black Friday Amazon’s Prime Day is carving out its niche, with 4% of consumers favoring it over the traditional Black Friday. But with the US Consumer Confidence fluctuating in October, it’ll be intriguing to monitor Amazon’s trajectory in the coming year.
AI-Powered Professional Certification Quiz Platform
Web|iOs|Android|Windows
🚀 Power Your Podcast Like AI Unraveled: Get 20% OFF Google Workspace!
Hey everyone, hope you're enjoying the deep dive on AI Unraveled. Putting these episodes together involves tons of research and organization, especially with complex AI topics.
A key part of my workflow relies heavily on Google Workspace. I use its integrated tools, especially Gemini Pro for brainstorming and NotebookLM for synthesizing research, to help craft some of the very episodes you love. It significantly streamlines the creation process!
Feeling inspired to launch your own podcast or creative project? I genuinely recommend checking out Google Workspace. Beyond the powerful AI and collaboration features I use, you get essentials like a professional email (you@yourbrand.com), cloud storage, video conferencing with Google Meet, and much more.
It's been invaluable for AI Unraveled, and it could be for you too.
Start Your Journey & Save 20%
Google Workspace makes it easy to get started. Try it free for 14 days, and as an AI Unraveled listener, get an exclusive 20% discount on your first year of the Business Standard or Business Plus plan!
Sign Up & Get Your Discount HereUse one of these codes during checkout (Americas Region):
AI- Powered Jobs Interview Warmup For Job Seekers

⚽️Comparative Analysis: Top Calgary Amateur Soccer Clubs – Outdoor 2025 Season (Kids' Programs by Age Group)
Business Standard Plan: 63P4G3ELRPADKQU
Business Standard Plan: 63F7D7CPD9XXUVT
Set yourself up for promotion or get a better job by Acing the AWS Certified Data Engineer Associate Exam (DEA-C01) with the eBook or App below (Data and AI)

Download the Ace AWS DEA-C01 Exam App:
iOS - Android
AI Dashboard is available on the Web, Apple, Google, and Microsoft, PRO version
Business Standard Plan: 63FLKQHWV3AEEE6
Business Standard Plan: 63JGLWWK36CP7W
Invest in your future today by enrolling in this Azure Fundamentals - Pass the Azure Fundamentals Exam with Ease: Master the AZ-900 Certification with the Comprehensive Exam Preparation Guide!
- AWS Certified AI Practitioner (AIF-C01): Conquer the AWS Certified AI Practitioner exam with our AI and Machine Learning For Dummies test prep. Master fundamental AI concepts, AWS AI services, and ethical considerations.
- Azure AI Fundamentals: Ace the Azure AI Fundamentals exam with our comprehensive test prep. Learn the basics of AI, Azure AI services, and their applications.
- Google Cloud Professional Machine Learning Engineer: Nail the Google Professional Machine Learning Engineer exam with our expert-designed test prep. Deepen your understanding of ML algorithms, models, and deployment strategies.
- AWS Certified Machine Learning Specialty: Dominate the AWS Certified Machine Learning Specialty exam with our targeted test prep. Master advanced ML techniques, AWS ML services, and practical applications.
- AWS Certified Data Engineer Associate (DEA-C01): Set yourself up for promotion, get a better job or Increase your salary by Acing the AWS DEA-C01 Certification.
Business Plus Plan: M9HNXHX3WC9H7YE
With Google Workspace, you get custom email @yourcompany, the ability to work from anywhere, and tools that easily scale up or down with your needs.
Need more codes or have questions? Email us at info@djamgatech.com.
5. Rethinking Boomer Representation in Ads? Gen-Z and Millennials’ financial concerns are largely attributed to the Baby Boomer generation, as per OnePoll data. With Gen-Z’s growing bias against Baby Boomers, marketers might need to reevaluate the representation of this age group in advertising campaigns.
6. The UK’s Growing Love for Loyalty Discounts A significant portion of consumers in the UK is trading brand loyalty for alluring discounts. Findings from the Data & Marketing Association and American Express emphasize the importance of loyalty schemes. Given the current political and economic landscape, loyalty schemes could be the game-changer for retailers in the UK.
7. Snapshots from Other Reports:
- A whopping $80B is lost to Ad Fraud, as per new insights from Juniper Research.
- Mobile advertising is booming in the UK, with over 60% of companies planning to ramp up their budgets.
- Gen-X feels overlooked in TV advertising, says Wavermaker Studio.
- The beauty industry take note: consumers crave educational content, says a report from Happi.
- Italy’s consumer spending expected to dip by approximately $3.7B, data from Ansa suggests.
Conclusion: Staying updated with the ever-evolving marketing landscape is vital for businesses to make informed decisions. From the waning trust in sustainability claims to the UK’s growing penchant for loyalty schemes, marketers need to remain agile and receptive to these shifts.

References:
1- I read over 100 Marketing Papers
Podcast transcript:
Welcome to the Djamgatech Marketing podcast, your go-to source for the latest trends and insights in the world of marketing. In today’s episode, we’ll cover the latest marketing insights and trends for 2023, including consumer preferences, improved packaging, investments in vacations, the popularity of Prime Day, generational differences, loyalty discounts, the rise of mobile ad budgets, neglected Gen-X in TV ads, the demand for educational beauty content, and the expected decrease in Italy’s consumer spending. Additionally, we’ll highlight the importance of staying updated in marketing for informed decisions on sustainability claims and UK loyalty schemes.
In the fast-paced world of marketing, trends come and go faster than you can say “advertise.” As consumers get pickier and more plugged in, their tastes and habits shift, forcing marketers to keep up with the times. Each year brings new opportunities and challenges, with some strategies becoming tried and true, while others fade into obscurity. But fear not, because we’ve got you covered. Take a deep dive into our meticulously curated collection of the freshest marketing insights and trends for 2023. Whether you’re a seasoned marketing guru or just starting out, these findings will give you a great snapshot of what’s happening in the ever-changing world of consumers and marketing. So get ready to adapt, think outside the box, and reshape your strategies to stay ahead of the game. It’s time to embrace the future!
So, let’s dive right into some interesting research findings that shed light on important consumer trends. First up, recent studies on Palm Oil reveal that consumers now prefer products labeled as “free-from palm oil” rather than those labeled as “sustainably produced palm oil.” It seems that the term “sustainable” has become so overused that it’s losing its impact in the marketplace. However, we need to be cautious about completely abandoning palm oil, as organizations like WWF emphasize. They argue that the solution lies not in abandoning palm oil, but in finding sustainable ways to produce it. Now let’s talk about the power of packaging. Kerry’s latest research shows that a whopping 72% of consumers believe that brands can help them reduce waste by improving the packaging of food and extending its shelf life. And this trend is not just limited to one study. European publication Amcor’s findings align with Kerry’s research, revealing a growing demand for better packaging. So, moving forward, marketers need to highlight their packaging efforts more prominently in order to cater to this consumer demand. Next, let’s take a look at an interesting connection between car purchases and consumer behavior. Data from the 2023 GWI Commerce Report shows that 40% of recent car purchasers also invested in a domestic vacation. This finding uncovers a possible pattern of consumers making impulse purchases following physical activities. While this may not be a groundbreaking revelation, it’s definitely worth noting for potential marketing strategies. We can’t talk about consumer trends without mentioning the impact of major shopping events.
Amazon’s Prime Day, which has gained popularity in recent years, now has 4% of consumers favoring it over the traditional Black Friday. However, with US Consumer Confidence fluctuating in October, it’ll be intriguing to see how Amazon’s trajectory plays out in the coming year. Moving on to demographics, recent data suggests that Gen-Z and Millennials have significant financial concerns that are often attributed to the Baby Boomer generation. OnePoll data reveals a growing bias among Gen-Z towards Baby Boomers. With this in mind, marketers might need to reevaluate the representation of this age group in their advertising campaigns in order to better resonate with younger consumers. Let’s now shift our focus to the UK, where loyalty discounts are gaining popularity among consumers. A significant portion of UK consumers is trading brand loyalty for attractive discounts.
The Data & Marketing Association, along with American Express, emphasizes the importance of loyalty schemes in the current political and economic landscape. It seems that loyalty schemes could be the game-changer for retailers in the UK. Now, let’s take a quick look at some snapshots from other reports: First, new insights from Juniper Research reveal that a staggering $80 billion is lost to ad fraud. This highlights the need for stricter measures to combat fraudulent advertising practices. Second, mobile advertising is booming in the UK, with over 60% of companies planning to increase their budgets in this area. This showcases the growing importance of mobile platforms in reaching targeted audiences.
Third, Wavermaker Studio reports that Gen-X feels overlooked in TV advertising. This demographic segment is seeking more representation and targeted messaging in TV ads for better engagement. Fourth, a report from Happi emphasizes that consumers in the beauty industry crave educational content. This highlights the opportunity for beauty brands to create informative and educational content to better connect with consumers. Finally, data from Ansa suggests that Italy’s consumer spending is expected to dip by approximately $3.7 billion. This indicates a potential shift in consumer behavior and purchasing power in the country. That wraps up our exploration of some recent research findings and their implications for marketers. It’s fascinating how consumer trends evolve and shape the strategies businesses need to adopt to stay relevant. Stay tuned for more insights and updates in the ever-changing world of marketing and consumer behavior.
So, here’s the thing. In today’s fast-paced world, staying on top of the latest trends and developments in marketing is absolutely crucial. Why? Well, because it allows businesses to make smart and informed decisions that can ultimately lead to success. Trust me, you don’t want to be left in the dust while your competitors are flourishing. One interesting observation that has been made is the growing skepticism around sustainability claims. Consumers are becoming more discerning and are not just going to blindly believe every green marketing message they come across. This means that businesses need to be extra careful and make sure their sustainability efforts are truly authentic and transparent. Now, let’s talk about loyalty schemes. Apparently, the UK has been going crazy for them. People just can’t seem to get enough of those reward programs and discounts. And you know what? Marketers need to take notice of this.
Loyalty schemes can be a powerful tool to not only retain existing customers but also to attract new ones. By the way, I came across some interesting resources that might pique your interest. It seems that a Redditor by the name of lazymentors has gathered a treasure trove of marketing papers from the subreddit r/Marketing. I’m talking about over 100 papers! So, if you’re looking to expand your knowledge and stay in the loop, you might want to check it out. In conclusion, my friend, the marketing landscape is constantly evolving, and it’s our job to stay agile and receptive. Trust is fading in sustainability claims, and loyalty schemes are all the rage in the UK. So, let’s keep our eyes peeled and make sure we’re on top of these shifts.
In this episode, we covered the latest marketing insights and trends for 2023, including strategies to recalibrate in the evolving consumer landscape, the importance of improved packaging, the rising popularity of Prime Day, and the impact of ad fraud on mobile ad budgets. Stay informed and make informed decisions in marketing with our recap of top items covered. Thank you for joining us on the Djamgatech Marketing podcast, where we delve into the latest marketing trends and provide insightful information – be sure to subscribe and stay tuned for our next episode!
Smart Savings: Top 10 Life Hacks to Lower Your Monthly Expense in USA and Canada
Deciphering the Marketing Landscape: What’s the most wanted digital marketing skills?
Data story telling. Don’t just share data, share “why” it’s important and what to do with it. A big reason why I got the last few jobs is being able to show that I can translate data and what to do with it.
It boggles my mind sometimes that many agencies don’t do this correctly. Follow the McKinsey model:
Data synthesis
Summary
“Why this data matters/what it means”
What to do with it
How data can become your best sales strategy coupled with a string message focusing on user outcomes they are hiring the product/service for ( jobs-to-be-done theory)? Link
Here is the TLDR for the best tips without knowing your case in more detail (feel free to read the deep dive if you want):
Share multiple data points but keep it focused
Don’t overdo it on the number of decks
Remember that you’ll probably have to pivot at least once
Detectives don’t solve cases off one single data point and neither should marketing decisions be made (in my humble opinion)
Deep dive:
Point Number 1: 2-3 data points is enough to make a solid case (ex: if you’re trying to share which topics/content ideas their audience resonates with, look at engagement rates on topics across different channels). If it’s SEO, use 2 different softwares and find the patterns. Those are the most obvious bleeds.
Point number 2: Early in my career I made the mistake of creating 50+ power point slides which was great research but we ended up using only 20% of that data. Huge waste of time, energy, not to mention incredibly inefficient.
Point number 3: The reality is, pivots are bound to happen unless you’re working with a team that’s super patient for a strategy to come to fruition or if you make the right decision based on the data (business acumen happens as you grow in your career.)
The most important skill is one that you can prove an ROI. For that I say Lead Gen.
Organic is:
“local” SEO (when you see a local company appear on the ‘map’ in search result near you)
regular SEO (regular search results under the map)
email marketing to an established email list
growing social media accounts
Paid is:
Google PPC ads
FB ads
any other… Tiktok, instagram etc.
I focus on Google PPC with Local SEO.
Pick a path and watch as much educational content on it as you can. Work for free initially. Then go wild.
SEO is highly wanted, and Google ads and Facebook ads are also highly wanted. I choose two things to become an expert in, and everything else just know enough to be able to do it. It also depends on where you get hired. Whatever u decide you want to do, become an expert in it, as there is a huge shortage of experts out there.
After 23 years in the industry and quite high demand as an independent consultant advisor I would say what people want is you solve them their problems. and in digital marketing and growth problems are very complex and multidisciplinary. Ok, they want ads to run smoothly and cheaply, but you need to make the data stack good so you track everything, and you need to make the Conversion Rate higher, but that involves like six tools plus the web, and you need to orchestrate everything to out-optimize your competitors. It is the T-shaped knowledge but with many deep knowledge areas. And understanding how everything interacts with each other. Like how page speed increases conversions, decreases CPA on paid, increases SEO, and how you can improve it. I think that is what is lacking in most growth agencies. They see stuff as silos, they take 2yo experienced specialists on PPC or SEO or whatever, but they have no clue about how the rest is important.
I think you only can gain that knowledge if you have been running your own sites or webapps, from creation to monetization, etc. That gives you a great understanding on the orchestration of things. And above all you need to be able to move seamlessly between strategy, tactical and operational. And communicate equally good with CEO’s and developers with poor social skills.
Deciphering the Marketing Landscape: One-Minute Daily Marketing News

Deciphering the Marketing Landscape: What Happened In Marketing October 17th 2023
Meta launches new formats and updates for Reels Ads.
Google launches new tool to manage first-party data easily.
Youtube launches Audio Descriptions & Pronouns for Creators.
FTC proposes a new bill to fight against hidden fees in Product Prices.
Google’s multiple security updates focused on user privacy.
EU warns all Social Media Apps to do better moderation of content.
TikTok partners with Disney to introduce Disney Content and Elements.
Update to API, allowing better Direct Posting for Third-party apps.
TikTok shares more facts about user data privacy.
TikTok expands Effect House Rewards Program to more regions.
New Reports about TikTok rewarding creators to pump live shopping.
IG set to bring back Creator Cash Bonuses.
Instagram shares new tips for E-commerce shops in a Post.
Threads App gets new post editing and Voice notes feature.
Meta’s AI Chatbots are not working in the best way possible.
Facebook UK sales surged ahead of Ad Downturn.
WhatsApp testing Event Creation for Groups.
X aims to fight substack says Elon to allow article publishing.
X’s efforts to launch live-streaming features are coming together.
Expanded Bios are live on X Desktop.
New Feature &. Updates to X’s Security & Content Reporting.
X launches new updates to Community Notes to increase reliability.
Google SGE AI now helps to create Images and Content Drafts.
Google Demand Gen Ads roll out to all advertisers.
Disabling Third-party cookies for 1% Chrome Users.
Updating their Ads Policy later this month.
Google Search stops intended search results.
Expands access to Social Media Links for Business profiles.
WFA & MediaSense launch “Future of media Agency” Report.
Stagwell acquires Left Field Labs, A digital Agency.
Publicis Groupe Posts 5.3% growth in Q3.
Dentsu partners with VideoAmp for Ad buying.
Virgin Voyages gives its Global Media Account to Hearts & Science.
Idris Elba’s agency launches first campaign for Sky Cinema.
Wavemaker & Merlin Entertainment extend their partnership.
GroupM Betas Walmart Retail Media Certification Program.
Taco Bell & Deutsch LA partner with Pete Davidson for new campaign.
Lloyds Banking Group appoints new CMO.
N26 Bank launches new global brand campaign.
Doc Martens launch new Brand Platform “Made Strong”.
Netflix to open retail sites in 2025 as Brand move.
ASICS & City of Paris’s latest campaign launched on Mental Health Day.
Uber Eats launches “Never eat dirt again” campaign in Taiwan .
Stagwell launches Harris Quest, AI research-as-a-service tool.
Google assures Companies of legal coverage when using their AI Models.
Adobe announces AI-generated Image to Video Tool.
Adobe also announced new content credential tag for AI.
Optimizely launches new Marketing OS powered by AI.
Microsoft launches bug bounty program to improve Bing AI.
Microsoft completes acquisition of Activision Blizzard.
Pinterest to announce Q3 Results on 30th Oct.
Pinterest partnered with Anthropologie for Holiday Season Shophouse.
Snap My AI could face ban in UK over child privacy concerns.
Reddit launches new report on TV & Film Entertainment.
IAS partners with Instacart Ads to improve transparency.
Atlassian to buy Loom for nearly $1 Billion.
Inmobi launches new identity resolution tool.
Jetpack WordPress adds new AI updates.
Paramount adds iSpot as New Currency partner.
The Guardian unveiled new UK Ad council.
Yahoo’s Cookieless ID in partnership with Twilio.
Twitch to go through another round of layoffs.
New feature to Follow WordPress blog through Mastodon.
Twitch adds anti-harassment features to stop banned users.
What I read about Gen-Z Consumers this Month. (No Calls)

1/ 35% Gen-Z corespondents associate TikTok more with Influencers and Zers are less likely to follow influencers on non-social apps. (Report)
2/ 41% plan to start shopping by the end of October and 37% Gen-Z plan to spend more this season, Shopify data.
3/ e.l.f. remains the No. 1 cosmetics brand, increasing 13 points Y/Y to 29% for female teens. And 90% of Genders prefer Apple Products.
4/ Gen-Z doesn’t like to get called, mostly prefer online chat & WhatsApp to connect with friends and others, data from The Sun.
5/ 19% of US adults aged 18-34 are actively saving in case of layoffs, compared to only 13% of older adults.
6/ Black Gen-Zers are hiding names for job applications and being more private shares new data.
7/ 83% of Gen-Z workers are job hoppers. (CNBC)
8/ Gen-Z wants feminine care products to become more blunt and clear in their Ad Copies. (NY Post)
9/ Majority of Gen-Z Students trust College Education, shares new report exposing online gurus. (Gallup)
10/ 73% of Gen-Z Americans have changed their spending habits over inflation causes. 43% now prefer to home cook, 40% spend less on clothes and 33% limiting spend to Essential shopping. (Bank of America)
11/ Gen-Zers are struggling to find third places to network and make friends. Many are paying for multiple memberships to make friends.
12/ Harvard’s research suggests that Gen-Z 27% more likely to buy from sustainable brands. However new research from Kantar shares distrust of Gen-Z in Sustainability advertising.
13/ Gen-Z & Millennials are making impulse purchases of social media suggestions shares new data from Bankrate.
Deciphering the Marketing Landscape: What Happened In Marketing October 16th 2023

TikTok launches Search Ads Toggle, allowing brands to display ads in search results.
TikTok enhances data security and localized storage in US, Singapore, and Malaysia.
TikTok unveils Direct Post feature for smoother third-party platform content sharing.
Meta shared photos of the business onboarding steps for MetaVerified for Business
Instagram new “Avatar interactions” setting lets you control who can interact with your avatar
Instagram is working on a new sticker: Music Pick
Facebook is killing its Notes feature on Nov 13th
Facebook Messenger added a tab called Channels
Threads now showing the “Suggested for you” section in feed.
X rolls out new ad format that can’t be reported, blocked
X is working on giving streamers options on who can join their chat before the start of the stream
Google tests generative AI in Search for creating imagery and drafting text.
Passkeys introduced for secure, fingerprint-based login on eBay, Uber, and WhatsApp
Twitch update empowers streamers to block banned users from viewing their livestreams.
Duolingo will launch language learning lessons through Duolingo Music and Duolingo Math in the EU as well
CapCut added a new AI-based feature, AI model
Early preview unveiled for ‘X calling’ feature.
Facebook seeks feedback from Meta Verified subscribers on service quality.
Facebook starts showing the page name in the app header, and it sticks to the header when scrolling through the page.
TikTok enables mentioning videos via audio page in user-created content.
TikTok update removes auto-generated captions from post, privacy settings.
TikTok launches AI meme generation for user-taken or selected photos.
Instagram introduces option for page linking within user accounts.
Instagram extends account activity access to desktop platforms.
Meta offers business support option beyond Meta Verified service.
WhatsApp developing date-specific message search for web client.
WhatsApp Web rolls out ‘Create Channel’ feature for users.
Box unveils Box Hubs, streamlining document access with AI integration.
CharacterAI debuts ‘Character Group Chat’ for multi-user, multi-AI interactions.
Mozilla teams with Fastly, Divvi Up for enhanced Firefox privacy tech.
Elgato introduces web Marketplace, upgrading digital assets exchange for creators.
Search Engine Land Awards 2023 finalists announced, winners to be revealed Oct. 17.
Snapchat encourages gifting Snapchat+ to friends on upcoming birthdays.
Spotify trials top playback controls during in-app scrolling.
I analyze over 200 headlines per week. Here’s a well-known psychological bias you can use to drive a tonne of clicks

“Harvard psychologist: 7 things the most passive-aggressive people always do—and the No. 1 way to respond”
This article is trending hard on CNBC Make It.
Sure, it’s good content.
But the headline clearly plays a huge role in its success.
Confirmation bias is a psychological effect where people seek information to validate their pre-existing beliefs.
“Please tell me I’m right”.
To effectively use confirmation bias in headlines:
– Identify behaviors your audience likely has strong beliefs or opinions about
– Write a headline that appears to confirm or challenge that belief
In this headline, passive aggression is the behavior many have encountered or been accused of.
A lot of people have pre-existing beliefs about what it looks like.
The headline suggests there are definitive behaviors that passive-aggressive people exhibit.
Readers want to know whether their own beliefs will be confirmed or challenged.
So they click to find out.
It’s brilliant.
Other psychological effects that make this headline an absolute click magnet:
Authority Bias – “Harvard Professor”. Readers are more likely to click when a headline implies endorsement from an expert.
Social Identity Theory: People will always want to identify with certain groups (in-groups) and distance themselves from others (out-groups).
They’ll seek out content to determine which “bucket” they fall into.
Do people they know fall into the “passive-aggressive” bucket? Do they themselves fall into that bucket?
They can’t help but click to find out.
Examples from different niches:
Productivity: “The 7 App Habits Of Highly Productive People”
Pre-existing belief – Productive people do or do not use apps a certain way.
Personal Finance – “The Actual Impact Of Cutting Out Coffee On Your Savings”
Pre-existing belief – Cutting out a daily coffee will or will not have a meaningful impact on savings.
Parenting – “Does Strict Parenting Actually Lead To Academic Success?”
Pre-existing belief – Strict parenting does or does not lead to academic success.
——————————————————–
*Disclaimer* – The content needs to match the expectations set by the title.
That’s what makes a title clickworthy as opposed to clickbait.
Also, the content shouldn’t be written with the sole purpose of being provocative. It should solve real problems and provide real value.
Giving it a juicy title is just how you make sure it’s actually read and that value is delivered.
As Ogilvy says:
“On the average, five times as many people read the headline as read the body copy. When you have written your headline, you have spent eighty cents out of your dollar.”
Deciphering the Marketing Landscape: What Happened In Marketing October 01-07 2023
X is looking to launch Ad-free Premium Tier for users.
Instagram announces option to share instagram stories only to a certain no. of followers in lists.
Reddit expands its learning hub with new courses and updates.
Google releases October 2023 Brand Core Update.
Deutsch New York plans to lays off about 19% of staff.
Youtube Testing New Community Notes Feed on Mobile.
DDB WorldWide names Alex Lubar as global CEO.
Snapchat announces “Phantom House” new activision for halloween.
X has ruined everything for link sharing with new Link Preview UI.
VMLY&R Named Lead Creative Agency for World of Hyatt.
GA4 adds new features to improve data security and report accuracy.
BEReal launches a new global campaign, trying to get back attention.
Meta rolls out AI Tools for Advertisers.
X is testing a new Ad format that you can’t report or fight back against.
M&S Appoints Mother as Creative Agency for UK Business.
Non-Alcohol Brands are testing Sober October campaigns, Ritual biggest one so far.
Netflix global Ad president departs after 13 months. Now, Amy Reinhard is the new Ad President.
Mullenlowe retains US Military Account for Recruiting Marketing, Account worth $450M.
US Ad Employement grew by 3k Jobs in Sep 2023.
Google Spam October 2023 Core Update also launched.
IG testing Ad Carousels with tag “you might like” with 5 Different Ads side by side.
Watched 8 hours of MrBeast’s content. Here are 7 psychological strategies he’s used to get 34 billion views

MrBeast can fill giant stadiums and launch 8-figure candy companies on demand.
He’s unbelievably popular.
Recently, I listened to the brilliant marketer Phil Agnew being interviewed on the Creator Science podcast.
The episode focused on how MrBeast’s near-academic understanding of audience psychology is the key to his success.
Better than anyone, MrBeast knows how to get you:
– Click on his content (increase his click-through rate)
– Get you to stick around (increase his retention rate)
He gets you to click by using irresistible thumbnails and headlines.
I watched 8 hours of his content.
To build upon Phil Agnew’s work, I made a list of 7 psychological effects and biases he’s consistently used to write headlines that get clicked into oblivion.
Even the most aggressively “anti-clickbait” purists out there would benefit from learning the psychology of why people choose to click on some content over others.
Ultimately, if you don’t get the click, it really doesn’t matter how good your content is.
1. Novelty Effect
MrBeast Headline: “I Put 100 Million Orbeez In My Friend’s Backyard”
MrBeast often presents something so out of the ordinary that they have no choice but to click and find out more.
That’s the “novelty effect” at play.
Our brain’s reward system is engaged when we encounter something new.
You’ll notice that the headline examples you see in this list are extreme.
MrBeast takes things to the extreme.
You don’t have to.
Here’s your takeaway:
Consider breaking the reader/viewer’s scrolling pattern by adding some novelty to your headlines.
How?
Here are two ways:
Find the unique angle in your content
Find an unusual character in your content
Examples:
“How Moonlight Walks Skyrocketed My Productivity”.
“Meet the Artist Who Paints With Wine and Chocolate.”
Headlines like these catch the eye without requiring 100 million Orbeez.
2. Costly Signaling
MrBeast Headline: “Last To Leave $800,000 Island Keeps It”
Here’s the 3-step click-through process at play here:
MrBeast lets you know he’s invested a very significant amount of time and money into his content.
This signals to whoever reads the headline that it’s probably valuable and worth their time.
They click to find out more.
Costly signaling is all amount showcasing what you’ve invested into the content.
The higher the stakes, the more valuable the content will seem.
In this example, the $800,000 island he’s giving away just screams “This is worth your time!”
Again, they don’t need to be this extreme.
Here are two examples with a little more subtlety:
“I built a full-scale botanical garden in my backyard”.
“I used only vintage cookware from the 1800s for a week”.
Not too extreme, but not too subtle either.
3. Numerical Precision
MrBeast knows that using precise numbers in headlines just work.
Almost all of his most popular videos use headlines that contain a specific number.
“Going Through The Same Drive Thru 1,000 Times”
“$456,000 Squid Game In Real Life!”
Yes, these headlines also use costly signaling.
But there’s more to it than that.
Precise numbers are tangible.
They catch our eye, pique our curiosity, and add a sense of authenticity.
“The concreteness effect”:
Specific, concrete information is more likely to be remembered than abstract, intangible information.
“I went through the same drive thru 1000 times” is more impactful than “I went through the same drive thru countless times”.
4. Contrast
MrBeast Headline: “$1 vs $1,000,000 Hotel Room!”
Our brains are drawn to stark contrasts and MrBeast knows it.
His headlines often pit two extremes against each other.
It instantly creates a mental image of both scenarios.
You’re not just curious about what a $1,000,000 hotel room looks like.
You’re also wondering how it could possibly compare to a $1 room.
Was the difference wildly significant?
Was it actually not as significant as you’d think?
It increases the audience’s *curiosity gap* enough to get them to click and find out more.
Here are a few ways you could use contrast in your headlines effectively:
Transformational Content:
“From $200 to a $100M Empire – How A Small Town Accountant Took On Silicon Valley”
Here you’re contrasting different states or conditions of a single subject.
Transformation stories and before-and-after scenarios.
You’ve got the added benefit of people being drawn to aspirational/inspirational stories.
2. Direct Comparison
“Local Diner Vs Gourmet Bistro – Where Does The Best Comfort Food Lie?”
5. Nostalgia
MrBeast Headline: “I Built Willy Wonka’s Chocolate Factory!”
Nostalgia is a longing for the past.
It’s often triggered by sensory stimuli – smells, songs, images, etc.
It can feel comforting and positive, but sometimes bittersweet.
Nostalgia can provide emotional comfort, identity reinforcement, and even social connection.
People are drawn to it and MrBeast has it down to a tee.
He created a fantasy world most people on this planet came across at some point in their childhood.
While the headline does play on costly signaling here as well, nostalgia does help to clinch the click and get the view.
Subtle examples of nostalgia at play:
“How this [old school cartoon] is shaping new age animation”.
“[Your favorite childhood books] are getting major movie deals”.
6. Morbid Curiosity
MrBeast Headline: “Surviving 24 Hours Straight In The Bermuda Triangle”
People are drawn to the macabre and the dangerous.
Morbid curiosity explains why you’re drawn to situations that are disturbing, frightening, or gruesome.
It’s that tension between wanting to avoid harm and the irresistible desire to know about it.
It’s a peculiar aspect of human psychology and viral content marketers take full advantage of it.
The Bermuda Triangle is practically synonymous with danger.
The headline suggests a pretty extreme encounter with it, so we click to find out more.
7. FOMO And Urgency
MrBeast Headline: “Last To Leave $800,000 Island Keeps It”
“FOMO”: the worry that others may be having fulfilling experiences that you’re absent from.
Marketers leverage FOMO to drive immediate action – clicking, subscribing, purchasing, etc.
The action is driven by the notion that delay could result in missing out on an exciting opportunity or event.
You could argue that MrBeast uses FOMO and urgency in all of his headlines.
They work under the notion that a delay in clicking could result in missing out on an exciting opportunity or event.
MrBeast’s time-sensitive challenge, exclusive opportunities, and high-stakes competitions all generate a sense of urgency.
People feel compelled to watch immediately for fear of missing out on the outcome or being left behind in conversations about the content.
Creators, writers, and marketers can tap into FOMO with their headlines without being so extreme.
“The Hidden Parisian Cafe To Visit Before The Crowds Do”
“How [Tech Innovation] Will Soon Change [Industry] For Good”
(Yep, FOMO and urgency are primarily responsible for the proliferation of AI-related headlines these days).
Why This All Matters
If you don’t have content you need people to consume, it probably doesn’t!
But if any aspect of your online business would benefit from people clicking on things more, it probably does.
“Yes, because we all need more clickbait in this world – *eye-roll emoji*” – Disgruntled Redditor
I never really understood this comment but I seem to get it pretty often.
My stance is this:
If the content delivers what the headline promises, it shouldn’t be labeled clickbait.
I wouldn’t call MrBeast’s content clickbait.
The fact is that linguistic techniques can be used to drive people to consume some content over others.
You don’t need to take things to the extremes that MrBeast does to make use of his headline techniques.
If content doesn’t get clicked, it won’t be read, viewed, or listened to – no matter how brilliant the content might be.
While “clickbait” content isn’t a good thing, we can all learn a thing or two from how they generate attention in an increasingly noisy digital world.
Little trick on how I use Quora to grow my business

This really doesn’t cost a lot of time and can be helpful for every business.
In order to leverage Quora effectively for your business, you need relevant questions to answer in the best possible way.
This can be tedious and a lot of work, while your answers can get buried quickly. To maximize the impact, I use this approach:
Look for Quora questions with many views but few answers.
Type in Google:
site:quora.com keyword “1 answer” “k views”
For example, I founded Simple Analytics, a GA4 alternative. So I’m interested in keywords like Google Analytics, Ga4, privacy-friendly analytics etc:
site:quora.com google analytics “1 answer” “k views”
It will find questions related to your keyword with just one answer but with many views (you can play around with the variables here)
But this is essentially where you want to be! Now provide a thoughtful answer and even mention your business if it fits the context. You’ll be the top rated answer and get many views.
The TOP 50 Finance Headlines of 2023: Unraveling the Patterns
Deciphering the Marketing Landscape: Latest News
- How do I find Amazon Affiliate influencers to sell a health and beauty product.by /u/toxichaste12 (Marketing & Advertising) on April 29, 2025 at 12:35 am
We sell a H+B product on Amazon. We just signed up for Amazon Associates in order to create affiliate links for influencers. Now that we are set up, how do we get influencers to know about our product? What are the best sites to build awareness that we want to partner via affiliate links? submitted by /u/toxichaste12 [link] [comments]
- How does the pricing of these smm packages look to you?by /u/Glamour-Ad7669 (Marketing & Advertising) on April 29, 2025 at 12:32 am
I’m a beginning social media manager and I’m struggling to find the right rates for my packages. What do you think of this? Package 1 - €400: - 1-2 platforms - 2 posts per week - 4x video editing - 2x10 minutes engagement per week - Social media strategy - One monthly call + statistics Package 2 - €575: - 1-3 platforms - 3 posts per week - 6x video editing - 3x10 minutes engagement per week - Social media strategy - One monthly call + statistics Package 3 - €700: - 1-3 platforms - 3 posts per week - 1 ig story per week - 8x video editing - 3x10 minutes engagement per week - Social media strategy - One monthly call + statistics submitted by /u/Glamour-Ad7669 [link] [comments]
- IS share drop - Google Adsby /u/NMDigitaNinja (Ads on Google, Meta, Microsoft, Amazon etc.) on April 29, 2025 at 12:30 am
Have a client where we optimize based on Impression share in Google with focus around absolute. Upon recently we always averaged around 95% IS at beginning of April, we saw a steep declient to around 65-70% no big changes, budget constant. Ranking actually improved. So while that could open us to more eligible impressions, seems like something else is going on. Any thoughts? submitted by /u/NMDigitaNinja [link] [comments]
- How has your experience been with running Amazon Associate Affiliate links?by /u/ryanrako23 (Marketing & Advertising) on April 29, 2025 at 12:27 am
Just recently got into it. I’m wondering what has your experience been with running those affiliate links. Does it work well? submitted by /u/ryanrako23 [link] [comments]
- Anyone looking to get topical authority on any of your SEO pages?by /u/marketingnerd_ (Marketing & Advertising) on April 28, 2025 at 11:53 pm
I am trying to experiment with a growth hack I have discovered that can get a massive increase in your topical authority for your on-page SEO. If anyone is looking to strengthen their on-page for a specific page, get in touch, I will give you all the sub-pages one of your top competition has for topical authority on that keyword and if you think it will help you, I can give you all the rest of your competitor sub pages, but that would be for a charge. You can then create your own sub pages and internally link them up to rank higher on that keyword. Get in touch! submitted by /u/marketingnerd_ [link] [comments]
- Big opportunity ….going to start a marketing company after this first client….maybeby /u/illydreamer (Marketing & Advertising) on April 28, 2025 at 10:42 pm
Welp so I left ( got canned ) my job in December and started a cargo van business which I was priced out of after the first month. Tons of overhead and broker fees without enough steady work. Ive always loved marketing (especially targeted, direct response) and used it to develop a few of my past ventures. Path came with a fork in the road …do I go into another 9-5 and become miserable again for the rest of my life or do I go full throttle in my own business. A friend of mine who I’m helping develop his gourmet mushroom subscription business for his farm told me his dad may be needing some assistance with two projects. 1. A website design, messaging and marketing for a vacation short term rental property in the Virgin Island who is currently only getting bookings from Airbnb and Vbro and has a static wix website. 2. A water filter company which currently has nothing. They tried to market only on YouTube before calling it quits The thing is I wrote a proposal to him for website design, and marketing the short term rental and he quick accepted. However he is now saying he has many business owners in the island that needs the service. I don’t want to bite off more than I can chew but I know this opportunity will catapult me to the goals I have for myself and business. Should I just freelance this job or go all in and create a marketing agency? Secondly, I charged a good amount but for how quickly he accepted I probably could have went 50% higher … I wouldn’t mind linking up with someone that’s in a similar situation now that may have some time to invest in a marketing agency. Someone versed in SEO, website design and digital ads. I wouldnt mind to bring someone on pretty quickly.. I have two other discovery calls from his friends this week to see how I can help, Would you go all in? submitted by /u/illydreamer [link] [comments]
- Question for B2B marketers - what type of content would be most helpful for you?by /u/tscher16 (Marketing & Advertising) on April 28, 2025 at 9:40 pm
Basically title. I’m an SEO consultant that mostly works with B2B SaaS brands and was wondering if there was any particular content/insights/advice that would be helpful in your day to day. I was thinking mostly for SEO and AI search, but feel free to drop anything organic LinkedIn related too. Not looking to promote or anything, just looking for questions that you’d like to see answered. I also promise I won’t be spamming this subreddit with your answers 😂 submitted by /u/tscher16 [link] [comments]
- Spike in Avg. CPC ... and Click Share?by /u/Complex_Maximum_4004 (Ads on Google, Meta, Microsoft, Amazon etc.) on April 28, 2025 at 9:25 pm
I have been running shopping campaigns for 5 months. Avg CPC had been pretty stable at around $6, but the last 2 weeks it spiked to $10. Alongside that, Click Share went from 20% to 25%. Assuming no parameter changes, what would cause my campaigns to be bidding (and winning) more auctions? I have not found any anomalies from a product, geographic or competitor perspective. Thanks! submitted by /u/Complex_Maximum_4004 [link] [comments]
- LinkedIn B2B campaign - is this the right way?by /u/iwantt0get0ut (Marketing & Advertising) on April 28, 2025 at 8:57 pm
Hi all, I'm setting up my a LinkedIn campaign for an organization. It's a coalition of different parties (housing, temp agencies) trying to raise awareness on certain issues. It's not a traditional brand, but more of a public campaign. Target audience: directors and managers in specific industries. I'm targeting based on job titles, seniority, and company industries (is that enough?). Would love some advice on the setup. Current setup: Objective: website visits (was told brand awareness ads aren't very effective). Ad format: single image ad with a photo from the article. Questions: Should I optimize for landing page clicks or maximum impressions regarding objective? Is a single image ad the right way for this? Is there a reason to pick somethign else? For bidding strategy, should I choose Maximum Delivery, Manual, or Cost Cap? LinkedIn suggests a bid amount, should I base it on that? Feeling a bit overwhelmed by the options. Would love your advice! 🙏 submitted by /u/iwantt0get0ut [link] [comments]
- Local products for Pmax - Shopping Questionby /u/Maleficent-Duck-4230 (Ads on Google, Meta, Microsoft, Amazon etc.) on April 28, 2025 at 8:56 pm
Hi all, running a Pmax for shopping goals campaign but noticed it doesn't give the option to include "local products" in campaign settings as PLAs do. Is there away to enable that in a pmax shopping campaign? submitted by /u/Maleficent-Duck-4230 [link] [comments]
- Freelancers? Google ads only or offer more services?by /u/CompBang330 (Ads on Google, Meta, Microsoft, Amazon etc.) on April 28, 2025 at 8:50 pm
Just wondering whether all you freelancers offer the one service e.g. Google ads or you cover others like Meta, TikTok etc? Would you also cover CRO? submitted by /u/CompBang330 [link] [comments]
- I need help for marketing scooterby /u/Trick_Definition11 (Marketing & Advertising) on April 28, 2025 at 8:49 pm
I need help with scooter marketing I made an agreement with a company: if I manage to bring them 4 customers who buy their scooter through my marketing efforts, they will send me a scooter for free. Initially, I proposed that they send me a scooter right away so I could create videos, do affiliate marketing, and help increase the product's popularity in this region, but they suggested this approach instead. I plan to open social media profiles (especially on Instagram) where I would post designs, comparisons, and other content related to the scooter — mainly photos, videos, and reviews. It’s important to mention that nothing is mandatory: I’m not obligated to bring customers, but if I succeed, they will reward me with a scooter. So, if anyone has advice or can help me figure out how to bring in those 4 customers, I would really appreciate it! submitted by /u/Trick_Definition11 [link] [comments]
- WordPress AI searchby /u/Abject-Roof-7631 (Marketing & Advertising) on April 28, 2025 at 8:36 pm
I am a small business with a lot of content that's blog oriented. I also have a ton of video that has been transcribed. I think it is very challenging for my users to find content on my website and I'd like to have an AI capability where someone just types in a question and it presents the answer and somehow integrated to wordpress. Any suggestions on plugins that come to mind for this capability? I'd like the capability to extend to any web page. submitted by /u/Abject-Roof-7631 [link] [comments]
- Search campaign stopped converting after starting with pmax. What might mbe the reason?by /u/ssmokvaa (Ads on Google, Meta, Microsoft, Amazon etc.) on April 28, 2025 at 8:28 pm
I have a google ads campaign for a school in canada. First 20 days of March I was running only search campaign, and it generated maybe 2 conversions per day on average. Then, around 20th of March I created a pmax campaign with signals and campaign data from the search campaign. From that point on, I have 0 conversions on the search campaign, and pmax has around 2 conversions per day on average. Clicks and impressions stayed the same. What might be the reason, and what can I do? Additional info: I imported Search campaign data as signals into PMax. The location targeting, budget, and overall goals are the same for both campaigns. budget: 15$ per day on search, and 20$ per day on pmax (it was vice versa until recently, then I started changing things around gradually) Bidding Strategy: maximize clicks on search campaign bidding. (I was worried if I change to maximize conversions, it will get into a battle with pmax). Pmax is using maximize conversions conversion goals: Submit lead forms for both of them (same lead form asset) Audience Signals: Google-engaged audiences, All visitors (google ads), All visitors (google ads) System Defined, All converters Keywords Types: phrase and exact non-branded (new project) Impression Share: pmax: <10% ; search <10%; Search lost IS (budget): 17% , Search lost IS (rank): 78% Saerch campaign Ads strength: good or excellent submitted by /u/ssmokvaa [link] [comments]
- Where would you upskill to make more $$ freelancing?by /u/Tetris_1743 (Marketing & Advertising) on April 28, 2025 at 8:12 pm
I'm currently a marketing generalist and looking to live aboard a sailboat, working part time in a year. I'm quite well rounded at the moment but want to be an 'expert' in an area to generate more income in a smarter way. What area would you specialise in or do you see a gap in? Is there a certain industry? Currently I'm working at a SaaS start-up doing bits of everything. I want to be able to freelance and choose my work based on my lifestyle without strict hourly deadlines while on the boat. Thanks for your thoughts! submitted by /u/Tetris_1743 [link] [comments]
- import GA4 eventsby /u/LowVoltage1990 (Ads on Google, Meta, Microsoft, Amazon etc.) on April 28, 2025 at 8:09 pm
Hey all, I am having an issue importing analytics events to Google Ads. Basically, the drop-down list of events is empty. screenshot: https://imgur.com/a/5ikQnzK I am inside Goals -> Create Conversion Action -> Website -> Add conversion action. I am sure i added and deleted a key event before and all should be setup correctly. Here are the facts: * GA4 property shows it's connected to Google Ads account. And the events are key events, and they have been reporting for few days now. * The suggested conversion actions that google gives, already has other events from the same GA4 property. * I changed browser & deleted cache & cookies... same issue. Plz lmk if it's something wrong with my setup. Thanks submitted by /u/LowVoltage1990 [link] [comments]
- Ad Preview descriptor says “Advertise With Us”by /u/kvdk0624 (Ads on Google, Meta, Microsoft, Amazon etc.) on April 28, 2025 at 7:58 pm
Why does my ad preview say this at the end? The preview has my headlines and the then the 2 descriptions exactly as I’ve written them, and then it adds a third phrase - “Advertise With Us” right at the end. Where is this coming from? I have never typed this phrase into the description boxes. submitted by /u/kvdk0624 [link] [comments]
- Meta ads based confusionby /u/anonymousknome (Ads on Google, Meta, Microsoft, Amazon etc.) on April 28, 2025 at 7:55 pm
Hey people of Reddit hope you are well. I have recently created a new dropshipping store, and I am starting to do ads for this via Facebook. I already have 30+ products, and I have picked the one which I believe would work best to lead in with ads. However, there is a lot of conflicting information on the internet about how to start; My campaign is sales, and the confusion comes with campaigns, ads and ad sets. I am starting on a low budget £5 a day just to test to waters and see if I can at least get some sales then put more in (not expecting to be a millionaire on this budget for any smart arses in the comments) A lot of other advice online is saying that when you start Facebook ads and meta, you have no data, so if you have a smaller budget (like me) you should maybe start with ad to cart as a conversion event instead of purchase straight away. To get enough data, move up the funnel. On the other hand, I am hearing information from you and also online that by doing this, all it entices is window shoppers. I have done let's say, 2 different ad sets so far on this budget, Ad to cart, which there has been; Reach 592, 15 ad to carts, Unique Link clicks 34, CPM £37, CTR 5.91% on the ATC campaign, On the PURCHASE campaign; One campaign, three ad sets with individual ads using a campaign budget £5 CBO. Average impressions 150 average CPM £40 again 5% average CTR however, only 2 ATC BUT ONE SALE This part is a bonus but yea that one was costing a lot, just wondering if anyone else had this same problem or had any inclination they would like to share. Thanks Reddit submitted by /u/anonymousknome [link] [comments]
- Why Did Our Meta Ads Stop Working After Fixing Our Pixel? (Detailed Data Inside)by /u/Pristine-Property-18 (Ads on Google, Meta, Microsoft, Amazon etc.) on April 28, 2025 at 7:48 pm
Hey everyone, My partner and I have been running Meta ads for about a month now. In the beginning, we had some issues with our Meta Pixel — it wasn’t tracking properly — but despite that, we were still getting around one order per day. After fixing the Pixel, we launched a new campaign on April 20th, but it had very little engagement, so we decided to close it quickly. Then, on April 25th, we launched a second new campaign (which is the one currently running). Even though the Pixel is now working properly, and the campaign setup is very similar to before (same product, similar creatives, audiences, and budgets), we haven’t gotten a single order since restarting. Here are some quick campaign details after relaunch: Reach: 4,019 people Views: 4,019 views total Frequency: 1.70 Total Spend: Around kr. 695.70 (~$100 USD) CPM (Cost per 1,000 impressions): Around kr. 192.77 (~$28 USD) CPC (Cost per link click): kr. 7.82 (~$1.13 USD) CTR (link click-through rate): 2.47% Link clicks: 89 clicks CTR (all): 5.18% CPC (all): kr. 3.72 (~$0.54 USD) Other details: Website and product page are the same as before. Targeting a mix of lookalike and interest-based audiences. Daily budget: kr. 80 (around $11). Ads started running on April 25, 2025. Website: Climovo.dk Despite getting a decent amount of clicks and a "normal" CTR, we haven't seen any purchases yet. Previously, even without good tracking, we would average about 1 order per day. Some theories we have but aren’t sure about: Did launching and restarting campaigns cause us to lose momentum in the learning phase? Is the Pixel optimization too "new" now and needs more time to retrain? Could the audiences have gotten exhausted or are they no longer ideal? Is there a problem with our checkout funnel that wasn’t obvious before? We’re super confused and would really appreciate any advice or insights! 🙏 If you have any feedback about the website itself, feel free to check it out at Climovo.dk! Thanks a lot in advance! submitted by /u/Pristine-Property-18 [link] [comments]
- Carhartt Affiliate Programby /u/No_Mycologist4488 (Marketing & Advertising) on April 28, 2025 at 7:26 pm
Has anyone done this as part of an ecomerce shop? Or are you better off going direct? submitted by /u/No_Mycologist4488 [link] [comments]
- How do I find Amazon Affiliate influencers to sell a health and beauty product.by /u/toxichaste12 (Marketing & Advertising) on April 29, 2025 at 12:35 am
We sell a H+B product on Amazon. We just signed up for Amazon Associates in order to create affiliate links for influencers. Now that we are set up, how do we get influencers to know about our product? What are the best sites to build awareness that we want to partner via affiliate links? submitted by /u/toxichaste12 [link] [comments]
- How does the pricing of these smm packages look to you?by /u/Glamour-Ad7669 (Marketing & Advertising) on April 29, 2025 at 12:32 am
I’m a beginning social media manager and I’m struggling to find the right rates for my packages. What do you think of this? Package 1 - €400: - 1-2 platforms - 2 posts per week - 4x video editing - 2x10 minutes engagement per week - Social media strategy - One monthly call + statistics Package 2 - €575: - 1-3 platforms - 3 posts per week - 6x video editing - 3x10 minutes engagement per week - Social media strategy - One monthly call + statistics Package 3 - €700: - 1-3 platforms - 3 posts per week - 1 ig story per week - 8x video editing - 3x10 minutes engagement per week - Social media strategy - One monthly call + statistics submitted by /u/Glamour-Ad7669 [link] [comments]
- IS share drop - Google Adsby /u/NMDigitaNinja (Ads on Google, Meta, Microsoft, Amazon etc.) on April 29, 2025 at 12:30 am
Have a client where we optimize based on Impression share in Google with focus around absolute. Upon recently we always averaged around 95% IS at beginning of April, we saw a steep declient to around 65-70% no big changes, budget constant. Ranking actually improved. So while that could open us to more eligible impressions, seems like something else is going on. Any thoughts? submitted by /u/NMDigitaNinja [link] [comments]
- How has your experience been with running Amazon Associate Affiliate links?by /u/ryanrako23 (Marketing & Advertising) on April 29, 2025 at 12:27 am
Just recently got into it. I’m wondering what has your experience been with running those affiliate links. Does it work well? submitted by /u/ryanrako23 [link] [comments]
- Anyone looking to get topical authority on any of your SEO pages?by /u/marketingnerd_ (Marketing & Advertising) on April 28, 2025 at 11:53 pm
I am trying to experiment with a growth hack I have discovered that can get a massive increase in your topical authority for your on-page SEO. If anyone is looking to strengthen their on-page for a specific page, get in touch, I will give you all the sub-pages one of your top competition has for topical authority on that keyword and if you think it will help you, I can give you all the rest of your competitor sub pages, but that would be for a charge. You can then create your own sub pages and internally link them up to rank higher on that keyword. Get in touch! submitted by /u/marketingnerd_ [link] [comments]
- Big opportunity ….going to start a marketing company after this first client….maybeby /u/illydreamer (Marketing & Advertising) on April 28, 2025 at 10:42 pm
Welp so I left ( got canned ) my job in December and started a cargo van business which I was priced out of after the first month. Tons of overhead and broker fees without enough steady work. Ive always loved marketing (especially targeted, direct response) and used it to develop a few of my past ventures. Path came with a fork in the road …do I go into another 9-5 and become miserable again for the rest of my life or do I go full throttle in my own business. A friend of mine who I’m helping develop his gourmet mushroom subscription business for his farm told me his dad may be needing some assistance with two projects. 1. A website design, messaging and marketing for a vacation short term rental property in the Virgin Island who is currently only getting bookings from Airbnb and Vbro and has a static wix website. 2. A water filter company which currently has nothing. They tried to market only on YouTube before calling it quits The thing is I wrote a proposal to him for website design, and marketing the short term rental and he quick accepted. However he is now saying he has many business owners in the island that needs the service. I don’t want to bite off more than I can chew but I know this opportunity will catapult me to the goals I have for myself and business. Should I just freelance this job or go all in and create a marketing agency? Secondly, I charged a good amount but for how quickly he accepted I probably could have went 50% higher … I wouldn’t mind linking up with someone that’s in a similar situation now that may have some time to invest in a marketing agency. Someone versed in SEO, website design and digital ads. I wouldnt mind to bring someone on pretty quickly.. I have two other discovery calls from his friends this week to see how I can help, Would you go all in? submitted by /u/illydreamer [link] [comments]
- Question for B2B marketers - what type of content would be most helpful for you?by /u/tscher16 (Marketing & Advertising) on April 28, 2025 at 9:40 pm
Basically title. I’m an SEO consultant that mostly works with B2B SaaS brands and was wondering if there was any particular content/insights/advice that would be helpful in your day to day. I was thinking mostly for SEO and AI search, but feel free to drop anything organic LinkedIn related too. Not looking to promote or anything, just looking for questions that you’d like to see answered. I also promise I won’t be spamming this subreddit with your answers 😂 submitted by /u/tscher16 [link] [comments]
- Spike in Avg. CPC ... and Click Share?by /u/Complex_Maximum_4004 (Ads on Google, Meta, Microsoft, Amazon etc.) on April 28, 2025 at 9:25 pm
I have been running shopping campaigns for 5 months. Avg CPC had been pretty stable at around $6, but the last 2 weeks it spiked to $10. Alongside that, Click Share went from 20% to 25%. Assuming no parameter changes, what would cause my campaigns to be bidding (and winning) more auctions? I have not found any anomalies from a product, geographic or competitor perspective. Thanks! submitted by /u/Complex_Maximum_4004 [link] [comments]
- LinkedIn B2B campaign - is this the right way?by /u/iwantt0get0ut (Marketing & Advertising) on April 28, 2025 at 8:57 pm
Hi all, I'm setting up my a LinkedIn campaign for an organization. It's a coalition of different parties (housing, temp agencies) trying to raise awareness on certain issues. It's not a traditional brand, but more of a public campaign. Target audience: directors and managers in specific industries. I'm targeting based on job titles, seniority, and company industries (is that enough?). Would love some advice on the setup. Current setup: Objective: website visits (was told brand awareness ads aren't very effective). Ad format: single image ad with a photo from the article. Questions: Should I optimize for landing page clicks or maximum impressions regarding objective? Is a single image ad the right way for this? Is there a reason to pick somethign else? For bidding strategy, should I choose Maximum Delivery, Manual, or Cost Cap? LinkedIn suggests a bid amount, should I base it on that? Feeling a bit overwhelmed by the options. Would love your advice! 🙏 submitted by /u/iwantt0get0ut [link] [comments]
- Local products for Pmax - Shopping Questionby /u/Maleficent-Duck-4230 (Ads on Google, Meta, Microsoft, Amazon etc.) on April 28, 2025 at 8:56 pm
Hi all, running a Pmax for shopping goals campaign but noticed it doesn't give the option to include "local products" in campaign settings as PLAs do. Is there away to enable that in a pmax shopping campaign? submitted by /u/Maleficent-Duck-4230 [link] [comments]
- Freelancers? Google ads only or offer more services?by /u/CompBang330 (Ads on Google, Meta, Microsoft, Amazon etc.) on April 28, 2025 at 8:50 pm
Just wondering whether all you freelancers offer the one service e.g. Google ads or you cover others like Meta, TikTok etc? Would you also cover CRO? submitted by /u/CompBang330 [link] [comments]
- I need help for marketing scooterby /u/Trick_Definition11 (Marketing & Advertising) on April 28, 2025 at 8:49 pm
I need help with scooter marketing I made an agreement with a company: if I manage to bring them 4 customers who buy their scooter through my marketing efforts, they will send me a scooter for free. Initially, I proposed that they send me a scooter right away so I could create videos, do affiliate marketing, and help increase the product's popularity in this region, but they suggested this approach instead. I plan to open social media profiles (especially on Instagram) where I would post designs, comparisons, and other content related to the scooter — mainly photos, videos, and reviews. It’s important to mention that nothing is mandatory: I’m not obligated to bring customers, but if I succeed, they will reward me with a scooter. So, if anyone has advice or can help me figure out how to bring in those 4 customers, I would really appreciate it! submitted by /u/Trick_Definition11 [link] [comments]
- WordPress AI searchby /u/Abject-Roof-7631 (Marketing & Advertising) on April 28, 2025 at 8:36 pm
I am a small business with a lot of content that's blog oriented. I also have a ton of video that has been transcribed. I think it is very challenging for my users to find content on my website and I'd like to have an AI capability where someone just types in a question and it presents the answer and somehow integrated to wordpress. Any suggestions on plugins that come to mind for this capability? I'd like the capability to extend to any web page. submitted by /u/Abject-Roof-7631 [link] [comments]
- Search campaign stopped converting after starting with pmax. What might mbe the reason?by /u/ssmokvaa (Ads on Google, Meta, Microsoft, Amazon etc.) on April 28, 2025 at 8:28 pm
I have a google ads campaign for a school in canada. First 20 days of March I was running only search campaign, and it generated maybe 2 conversions per day on average. Then, around 20th of March I created a pmax campaign with signals and campaign data from the search campaign. From that point on, I have 0 conversions on the search campaign, and pmax has around 2 conversions per day on average. Clicks and impressions stayed the same. What might be the reason, and what can I do? Additional info: I imported Search campaign data as signals into PMax. The location targeting, budget, and overall goals are the same for both campaigns. budget: 15$ per day on search, and 20$ per day on pmax (it was vice versa until recently, then I started changing things around gradually) Bidding Strategy: maximize clicks on search campaign bidding. (I was worried if I change to maximize conversions, it will get into a battle with pmax). Pmax is using maximize conversions conversion goals: Submit lead forms for both of them (same lead form asset) Audience Signals: Google-engaged audiences, All visitors (google ads), All visitors (google ads) System Defined, All converters Keywords Types: phrase and exact non-branded (new project) Impression Share: pmax: <10% ; search <10%; Search lost IS (budget): 17% , Search lost IS (rank): 78% Saerch campaign Ads strength: good or excellent submitted by /u/ssmokvaa [link] [comments]
- Where would you upskill to make more $$ freelancing?by /u/Tetris_1743 (Marketing & Advertising) on April 28, 2025 at 8:12 pm
I'm currently a marketing generalist and looking to live aboard a sailboat, working part time in a year. I'm quite well rounded at the moment but want to be an 'expert' in an area to generate more income in a smarter way. What area would you specialise in or do you see a gap in? Is there a certain industry? Currently I'm working at a SaaS start-up doing bits of everything. I want to be able to freelance and choose my work based on my lifestyle without strict hourly deadlines while on the boat. Thanks for your thoughts! submitted by /u/Tetris_1743 [link] [comments]
- import GA4 eventsby /u/LowVoltage1990 (Ads on Google, Meta, Microsoft, Amazon etc.) on April 28, 2025 at 8:09 pm
Hey all, I am having an issue importing analytics events to Google Ads. Basically, the drop-down list of events is empty. screenshot: https://imgur.com/a/5ikQnzK I am inside Goals -> Create Conversion Action -> Website -> Add conversion action. I am sure i added and deleted a key event before and all should be setup correctly. Here are the facts: * GA4 property shows it's connected to Google Ads account. And the events are key events, and they have been reporting for few days now. * The suggested conversion actions that google gives, already has other events from the same GA4 property. * I changed browser & deleted cache & cookies... same issue. Plz lmk if it's something wrong with my setup. Thanks submitted by /u/LowVoltage1990 [link] [comments]
- Ad Preview descriptor says “Advertise With Us”by /u/kvdk0624 (Ads on Google, Meta, Microsoft, Amazon etc.) on April 28, 2025 at 7:58 pm
Why does my ad preview say this at the end? The preview has my headlines and the then the 2 descriptions exactly as I’ve written them, and then it adds a third phrase - “Advertise With Us” right at the end. Where is this coming from? I have never typed this phrase into the description boxes. submitted by /u/kvdk0624 [link] [comments]
- Meta ads based confusionby /u/anonymousknome (Ads on Google, Meta, Microsoft, Amazon etc.) on April 28, 2025 at 7:55 pm
Hey people of Reddit hope you are well. I have recently created a new dropshipping store, and I am starting to do ads for this via Facebook. I already have 30+ products, and I have picked the one which I believe would work best to lead in with ads. However, there is a lot of conflicting information on the internet about how to start; My campaign is sales, and the confusion comes with campaigns, ads and ad sets. I am starting on a low budget £5 a day just to test to waters and see if I can at least get some sales then put more in (not expecting to be a millionaire on this budget for any smart arses in the comments) A lot of other advice online is saying that when you start Facebook ads and meta, you have no data, so if you have a smaller budget (like me) you should maybe start with ad to cart as a conversion event instead of purchase straight away. To get enough data, move up the funnel. On the other hand, I am hearing information from you and also online that by doing this, all it entices is window shoppers. I have done let's say, 2 different ad sets so far on this budget, Ad to cart, which there has been; Reach 592, 15 ad to carts, Unique Link clicks 34, CPM £37, CTR 5.91% on the ATC campaign, On the PURCHASE campaign; One campaign, three ad sets with individual ads using a campaign budget £5 CBO. Average impressions 150 average CPM £40 again 5% average CTR however, only 2 ATC BUT ONE SALE This part is a bonus but yea that one was costing a lot, just wondering if anyone else had this same problem or had any inclination they would like to share. Thanks Reddit submitted by /u/anonymousknome [link] [comments]
- Why Did Our Meta Ads Stop Working After Fixing Our Pixel? (Detailed Data Inside)by /u/Pristine-Property-18 (Ads on Google, Meta, Microsoft, Amazon etc.) on April 28, 2025 at 7:48 pm
Hey everyone, My partner and I have been running Meta ads for about a month now. In the beginning, we had some issues with our Meta Pixel — it wasn’t tracking properly — but despite that, we were still getting around one order per day. After fixing the Pixel, we launched a new campaign on April 20th, but it had very little engagement, so we decided to close it quickly. Then, on April 25th, we launched a second new campaign (which is the one currently running). Even though the Pixel is now working properly, and the campaign setup is very similar to before (same product, similar creatives, audiences, and budgets), we haven’t gotten a single order since restarting. Here are some quick campaign details after relaunch: Reach: 4,019 people Views: 4,019 views total Frequency: 1.70 Total Spend: Around kr. 695.70 (~$100 USD) CPM (Cost per 1,000 impressions): Around kr. 192.77 (~$28 USD) CPC (Cost per link click): kr. 7.82 (~$1.13 USD) CTR (link click-through rate): 2.47% Link clicks: 89 clicks CTR (all): 5.18% CPC (all): kr. 3.72 (~$0.54 USD) Other details: Website and product page are the same as before. Targeting a mix of lookalike and interest-based audiences. Daily budget: kr. 80 (around $11). Ads started running on April 25, 2025. Website: Climovo.dk Despite getting a decent amount of clicks and a "normal" CTR, we haven't seen any purchases yet. Previously, even without good tracking, we would average about 1 order per day. Some theories we have but aren’t sure about: Did launching and restarting campaigns cause us to lose momentum in the learning phase? Is the Pixel optimization too "new" now and needs more time to retrain? Could the audiences have gotten exhausted or are they no longer ideal? Is there a problem with our checkout funnel that wasn’t obvious before? We’re super confused and would really appreciate any advice or insights! 🙏 If you have any feedback about the website itself, feel free to check it out at Climovo.dk! Thanks a lot in advance! submitted by /u/Pristine-Property-18 [link] [comments]
- Carhartt Affiliate Programby /u/No_Mycologist4488 (Marketing & Advertising) on April 28, 2025 at 7:26 pm
Has anyone done this as part of an ecomerce shop? Or are you better off going direct? submitted by /u/No_Mycologist4488 [link] [comments]
What is machine learning and how does Netflix use it for its recommendation engine?


Elevate Your Career with AI & Machine Learning For Dummies PRO and Start mastering the technologies shaping the future—download now and take the next step in your professional journey!
What is machine learning and how does Netflix use it for its recommendation engine?
What is an online recommendation engine?
Think about examples of machine learning you may have encountered in the past such as a website like Netflix that recommends what video you may be interested in watching next?
Are the recommendations ever wrong or unfair? We will give an example and explain how this could be addressed.

Machine learning is a field of artificial intelligence that Netflix uses to create its recommendation algorithm. The goal of machine learning is to teach computers to learn from data and make predictions based on that data. To do this, Netflix employs Machine Learning Engineers, Data Scientists, and software developers to design and build algorithms that can automatically improve over time. The Netflix recommendations engine is just one example of how machine learning can be used to improve the user experience. By understanding what users watch and why, the recommendations engine can provide tailored suggestions that help users find new shows and movies to enjoy. Machine learning is also used for other Netflix features, such as predicting which shows a user might be interested in watching next, or detecting inappropriate content. In a world where data is becoming increasingly important, machine learning will continue to play a vital role in helping Netflix deliver a great experience to its users.

Netflix’s recommendation engine is one of the company’s most valuable assets. By using machine learning, Netflix is able to constantly improve its recommendations for each individual user.
Machine learning engineers, data scientists, and developers work together to build and improve the recommendation engine.
- They start by collecting data on what users watch and how they interact with the Netflix interface.
- This data is then used to train machine learning models.
- The models are constantly being tweaked and improved by the team of engineers.
- The goal is to make sure that each user sees recommendations that are highly relevant to their interests.
Thanks to the work of the team, Netflix’s recommendation engine is constantly getting better at understanding each individual user.
How Does It Work?
In short, Netflix’s recommendation algorithm looks at what you’ve watched in the past and then makes recommendations based on that data. But of course, it’s a bit more complicated than that. The algorithm also looks at data from other users with similar watching habits to yours. This allows Netflix to give you more tailored recommendations.
For example, say you’re a big fan of Friends (who isn’t?). The algorithm knows that a lot of Friends fans also like shows like Cheers, Seinfeld, and The Office. So, if you’re ever feeling nostalgic and in the mood for a sitcom marathon, Netflix will be there to help you out.
But That’s Not All…
Not only does the algorithm take into account what you’ve watched in the past, but it also looks at what you’re currently watching. For example, let’s say you’re halfway through Season 2 of Breaking Bad and you decide to take a break for a few days. When you come back and finish Season 2, the algorithm knows that you’re now interested in similar shows like Dexter and The Wire. And voila! Those shows will now be recommended to you.
Of course, the algorithm isn’t perfect. There are always going to be times when it recommends a show or movie that just doesn’t interest you. But hey, that’s why they have the “thumbs up/thumbs down” feature. Just give those shows the old thumbs down and never think about them again! Problem solved.
Another angle :
When it comes to TV and movie recommendations, there are two main types of data that are being collected and analyzed:
1) demographic data
2) viewing data.
Demographic data is information like your age, gender, location, etc. This data is generally used to group people with similar interests together so that they can be served more targeted recommendations. For example, if you’re a 25-year-old female living in Los Angeles, you might be grouped together with other 25-year-old females living in Los Angeles who have similar viewing habits as you.
Viewing data is exactly what it sounds like—it’s information on what TV shows and movies you’ve watched in the past. This data is used to identify patterns in your viewing habits so that the algorithm can make better recommendations on what you might want to watch next. For example, if you’ve watched a lot of romantic comedies in the past, the algorithm might recommend other romantic comedies that you might like based on those patterns.
Are the Recommendations Ever Wrong or Unfair?
Yes and no. The fact of the matter is that no algorithm is perfect—there will always be some error involved. However, these errors are usually minor and don’t have a major impact on our lives. In fact, we often don’t even notice them!
The bigger issue with machine learning isn’t inaccuracy; it’s bias. Because algorithms are designed by humans, they often contain human biases that can seep into the recommendations they make. For example, a recent study found that Amazon’s algorithms were biased against women authors because the majority of book purchases on the site were made by men. As a result, Amazon’s algorithms were more likely to recommend books written by men over books written by women—regardless of quality or popularity.
AI-Powered Professional Certification Quiz Platform
Web|iOs|Android|Windows
🚀 Power Your Podcast Like AI Unraveled: Get 20% OFF Google Workspace!
Hey everyone, hope you're enjoying the deep dive on AI Unraveled. Putting these episodes together involves tons of research and organization, especially with complex AI topics.
A key part of my workflow relies heavily on Google Workspace. I use its integrated tools, especially Gemini Pro for brainstorming and NotebookLM for synthesizing research, to help craft some of the very episodes you love. It significantly streamlines the creation process!
Feeling inspired to launch your own podcast or creative project? I genuinely recommend checking out Google Workspace. Beyond the powerful AI and collaboration features I use, you get essentials like a professional email (you@yourbrand.com), cloud storage, video conferencing with Google Meet, and much more.
It's been invaluable for AI Unraveled, and it could be for you too.
Start Your Journey & Save 20%
Google Workspace makes it easy to get started. Try it free for 14 days, and as an AI Unraveled listener, get an exclusive 20% discount on your first year of the Business Standard or Business Plus plan!
Sign Up & Get Your Discount HereUse one of these codes during checkout (Americas Region):
AI- Powered Jobs Interview Warmup For Job Seekers

⚽️Comparative Analysis: Top Calgary Amateur Soccer Clubs – Outdoor 2025 Season (Kids' Programs by Age Group)
Business Standard Plan: 63P4G3ELRPADKQU
Business Standard Plan: 63F7D7CPD9XXUVT
Set yourself up for promotion or get a better job by Acing the AWS Certified Data Engineer Associate Exam (DEA-C01) with the eBook or App below (Data and AI)

Download the Ace AWS DEA-C01 Exam App:
iOS - Android
AI Dashboard is available on the Web, Apple, Google, and Microsoft, PRO version
Business Standard Plan: 63FLKQHWV3AEEE6
Business Standard Plan: 63JGLWWK36CP7W
Invest in your future today by enrolling in this Azure Fundamentals - Pass the Azure Fundamentals Exam with Ease: Master the AZ-900 Certification with the Comprehensive Exam Preparation Guide!
- AWS Certified AI Practitioner (AIF-C01): Conquer the AWS Certified AI Practitioner exam with our AI and Machine Learning For Dummies test prep. Master fundamental AI concepts, AWS AI services, and ethical considerations.
- Azure AI Fundamentals: Ace the Azure AI Fundamentals exam with our comprehensive test prep. Learn the basics of AI, Azure AI services, and their applications.
- Google Cloud Professional Machine Learning Engineer: Nail the Google Professional Machine Learning Engineer exam with our expert-designed test prep. Deepen your understanding of ML algorithms, models, and deployment strategies.
- AWS Certified Machine Learning Specialty: Dominate the AWS Certified Machine Learning Specialty exam with our targeted test prep. Master advanced ML techniques, AWS ML services, and practical applications.
- AWS Certified Data Engineer Associate (DEA-C01): Set yourself up for promotion, get a better job or Increase your salary by Acing the AWS DEA-C01 Certification.
Business Plus Plan: M9HNXHX3WC9H7YE
With Google Workspace, you get custom email @yourcompany, the ability to work from anywhere, and tools that easily scale up or down with your needs.
Need more codes or have questions? Email us at info@djamgatech.com.
These sorts of biases can have major impacts on our lives because they can dictate what we see and don’t see online. If we’re only seeing content that reflects our own biases back at us, we’re not getting a well-rounded view of the world—and that can have serious implications for both our personal lives and society as a whole.
One of the benefits of machine learning is that it can help us make better decisions. For example, if you’re trying to decide what movie to watch on Netflix, the site will use your past viewing history to recommend movies that you might like. This is possible because machine learning algorithms are able to identify patterns in data.
Another benefit of machine learning is that it can help us automate tasks. For example, if you’re a cashier and have to scan the barcodes of the items someone is buying, a machine learning algorithm can be used to automatically scan the barcodes and calculate the total cost of the purchase. This can save time and increase efficiency.
The Consequences of Machine Learning
While machine learning can be beneficial, there are also some potential consequences that should be considered. One consequence is that machine learning algorithms can perpetuate bias. For example, if you’re using a machine learning algorithm to recommend movies to people on Netflix, the algorithm might only recommend movies that are similar to ones that people have already watched. This could lead to people only watching movies that confirm their existing beliefs instead of challenged them.
Another consequence of machine learning is that it can be difficult to understand how the algorithms work. This is because the algorithms are usually created by trained experts and then fine-tuned through trial and error. As a result, regular people often don’t know how or why certain decisions are being made by machines. This lack of transparency can lead to mistrust and frustration.
What are some good datasets for Data Science and Machine Learning?
This scene in the Black Panther trailer, is it T’Challa’s funeral?
Recommended New Netflix Movies 2022
- Warner Bros releases 2 hours of promo footage for “Weapons” on a unlisted YouTube videoby /u/Revolutionary-Ear870 (Movie News and Discussion) on April 28, 2025 at 9:59 pm
New line posted this on IG along with the official account for the movie. Haven’t gotten time to watch but seems very creepy based of what I seen from trimming through. Zach Cregger is a great director so I’m very excited for this movie That mystery is going to propel you through at least half of the movie, but that is not the movie," the filmmaker divulges. "The movie will fork and change and reinvent and go in new places. It doesn't abandon that question, believe me, but that's not the whole movie at all. By the midpoint, we've moved on to way crazier s--- than that."- actual quote from zach submitted by /u/Revolutionary-Ear870 [link] [comments]
- New image of David Corenswet and Rachel Brosnahan in James Gunn's 'SUPERMAN'by /u/danielthetemp (Movie News and Discussion) on April 28, 2025 at 7:49 pm
submitted by /u/danielthetemp [link] [comments]
- What’s the best line from a character who only had one line in a movie?by /u/Professional_Rent261 (Movie News and Discussion) on April 28, 2025 at 7:38 pm
Was thinking about this after rewatching Air Force One (1997). There’s this random military guy who gets his one moment and absolutely nails it: “Liberty 2-4 is changing call signs! Liberty 2-4 is now Air Force One!” Gives me chills every time. Dude had one job and he crushed it. What’s your favorite example of an actor or character getting just one line, but it being super memorable? Could be serious, funny, iconic, whatever. submitted by /u/Professional_Rent261 [link] [comments]
- ‘Miami Vice’ Movie in the Works with Joseph Kosinski Directingby /u/ChiefLeef22 (Movie News and Discussion) on April 28, 2025 at 7:30 pm
submitted by /u/ChiefLeef22 [link] [comments]
- AMA/Q&A Announcement - Jenny Jue - Wednesday 4/30 at 8:00 PM ET - Casting Director ('Inglorious Basterds', 'Snowpiercer', 'Okja', 'The Wedding Banquet', 'No One Will Save You', 'The Legend of Aang: The Last Airbender')by /u/BunyipPouch (Movie News and Discussion) on April 28, 2025 at 7:03 pm
submitted by /u/BunyipPouch [link] [comments]
- AMA/Q&A Announcement - Scott Mann - Tuesday 4/29 at 4:00 PM ET - Director & Writer of 2022's Survival-Thriller 'Fall' & Co-Founder of Flawlessby /u/BunyipPouch (Movie News and Discussion) on April 28, 2025 at 6:53 pm
submitted by /u/BunyipPouch [link] [comments]
- Dev Patel to Direct, Star in Period Revenge Action Thriller ‘The Peasant’ for Fifth Seasonby /u/NoCulture3505 (Movie News and Discussion) on April 28, 2025 at 6:30 pm
submitted by /u/NoCulture3505 [link] [comments]
- Ben Affleck visits the Criterion Closetby /u/Planatus666 (Movie News and Discussion) on April 28, 2025 at 6:11 pm
submitted by /u/Planatus666 [link] [comments]
- ‘Top Gun: Maverick' Writer's Cousin Sues, Claiming He Wrote Key Scenesby /u/KillerCroc1234567 (Movie News and Discussion) on April 28, 2025 at 5:25 pm
submitted by /u/KillerCroc1234567 [link] [comments]
- Three mesmerizing movie/drama on Netflixby SS (Netflix on Medium) on April 28, 2025 at 5:14 pm
I am an Indian but I really like to see other country drama and movies.Continue reading on Medium »
- Alice Osman Teases 'Heartstopper' Volume 6 and Finale Movieby /u/Somethingman_121224 (Netflix) on April 28, 2025 at 5:11 pm
submitted by /u/Somethingman_121224 [link] [comments]
- ‘Shogun’ Star Anna Sawai Joins David Leitch Next Film ‘How To Rob A Bank’ At Amazon MGM Studiosby /u/MarvelsGrantMan136 (Movie News and Discussion) on April 28, 2025 at 4:30 pm
submitted by /u/MarvelsGrantMan136 [link] [comments]
- New Poster for ‘Fear Street: Prom Queen’by /u/KillerCroc1234567 (Movie News and Discussion) on April 28, 2025 at 4:11 pm
submitted by /u/KillerCroc1234567 [link] [comments]
- Anyone else?by /u/StardewValleyTrash (Netflix) on April 28, 2025 at 4:10 pm
Anyone else’s Netflix continuously crashing? This is the 2nd day in a row it keeps crashing on my TV. I know earlier this week it was down for a bit, but this is getting ridiculous. It’ll be fine for like 20 minutes then crash over and over. submitted by /u/StardewValleyTrash [link] [comments]
- Brad Pitt to lead Edward Berger's next film titled ‘The Riders’ at A24 | Based on the novel of the same name, it follows Fred Scully (Pitt) whose life falls to pieces as he and his daughter Billie look for her missing mother throughout Europe.by /u/ChiefLeef22 (Movie News and Discussion) on April 28, 2025 at 3:38 pm
submitted by /u/ChiefLeef22 [link] [comments]
- Can your tv influence your netflix experience much?by /u/trevorandcletus (Netflix) on April 28, 2025 at 3:20 pm
I have 2 available devices, with others being smartphones. My awol ltv 3500 provides clear vibrant visuals, whatever I am watching. On the other hand, I have been having issues with my Lg. The settings are all off, can you tell me how to get similar visuals? The Lg in question is a bit old, like I got it back in 2016. Could that be the reason that I am not getting good visuals? submitted by /u/trevorandcletus [link] [comments]
- TV is better with subtitles - whether you need them or notby /u/theipaper (Netflix) on April 28, 2025 at 3:14 pm
submitted by /u/theipaper [link] [comments]
- New Posters for ‘Mission: Impossible - The Final Reckoning’by /u/KillerCroc1234567 (Movie News and Discussion) on April 28, 2025 at 3:08 pm
submitted by /u/KillerCroc1234567 [link] [comments]
- The First Footage From Guillermo del Toro’s Frankenstein Is Here:by Ryan Thomas LaBee (Netflix on Medium) on April 28, 2025 at 3:06 pm
Here’s Everything You Missed From Netflix’s Tudum Event TrailerContinue reading on Medium »
- What's a critically panned movie that you just can't help loving?by /u/DeepThinkingReader (Movie News and Discussion) on April 28, 2025 at 2:36 pm
For me, it's Roland Emmerich's 10,000 BC. It has 10% on Rotten Tomatoes, and it's based on a crackpot theory about the pyramids from the rear end of Graham Hancock. But somehow, no matter how many times I see this movie, I never get tired of it. The scene where Steven Strait javelins the Pharaoh and shouts "He is not a god!" is one of my favourite movie scenes ever. Also, the music is awesome, the filming locations are gorgeous, and the miniature pyramid models used in production are fantastic. So despite the story being illogical and stupid, and the historiography being absolutely batshit, 10,000 BC will always be one of my guilty pleasures. submitted by /u/DeepThinkingReader [link] [comments]
World’s Top 10 Youtube channels in 2022
T-Series, Cocomelon, Set India, PewDiePie, MrBeast, Kids Diana Show, Like Nastya, WWE, Zee Music Company, Vlad and Niki
What are some ways to increase precision or recall in machine learning?


Elevate Your Career with AI & Machine Learning For Dummies PRO and Start mastering the technologies shaping the future—download now and take the next step in your professional journey!
What are some ways to increase precision or recall in machine learning?
What are some ways to Boost Precision and Recall in Machine Learning?
Sensitivity vs Specificity?
In machine learning, recall is the ability of the model to find all relevant instances in the data while precision is the ability of the model to correctly identify only the relevant instances. A high recall means that most relevant results are returned while a high precision means that most of the returned results are relevant. Ideally, you want a model with both high recall and high precision but often there is a trade-off between the two. In this blog post, we will explore some ways to increase recall or precision in machine learning.

There are two main ways to increase recall:
by increasing the number of false positives or by decreasing the number of false negatives. To increase the number of false positives, you can lower your threshold for what constitutes a positive prediction. For example, if you are trying to predict whether or not an email is spam, you might lower the threshold for what constitutes spam so that more emails are classified as spam. This will result in more false positives (emails that are not actually spam being classified as spam) but will also increase recall (more actual spam emails being classified as spam).

To decrease the number of false negatives,
you can increase your threshold for what constitutes a positive prediction. For example, going back to the spam email prediction example, you might raise the threshold for what constitutes spam so that fewer emails are classified as spam. This will result in fewer false negatives (actual spam emails not being classified as spam) but will also decrease recall (fewer actual spam emails being classified as spam).

There are two main ways to increase precision:
by increasing the number of true positives or by decreasing the number of true negatives. To increase the number of true positives, you can raise your threshold for what constitutes a positive prediction. For example, using the spam email prediction example again, you might raise the threshold for what constitutes spam so that fewer emails are classified as spam. This will result in more true positives (emails that are actually spam being classified as spam) but will also decrease precision (more non-spam emails being classified as spam).
To decrease the number of true negatives,
you can lower your threshold for what constitutes a positive prediction. For example, going back to the spam email prediction example once more, you might lower the threshold for what constitutes spam so that more emails are classified as spam. This will result in fewer true negatives (emails that are not actually spam not being classified as spam) but will also decrease precision (more non-spam emails being classified as spam).

To summarize,
there are a few ways to increase precision or recall in machine learning. One way is to use a different evaluation metric. For example, if you are trying to maximize precision, you can use the F1 score, which is a combination of precision and recall. Another way to increase precision or recall is to adjust the threshold for classification. This can be done by changing the decision boundary or by using a different algorithm altogether.
AI-Powered Professional Certification Quiz Platform
Web|iOs|Android|Windows
🚀 Power Your Podcast Like AI Unraveled: Get 20% OFF Google Workspace!
Hey everyone, hope you're enjoying the deep dive on AI Unraveled. Putting these episodes together involves tons of research and organization, especially with complex AI topics.
A key part of my workflow relies heavily on Google Workspace. I use its integrated tools, especially Gemini Pro for brainstorming and NotebookLM for synthesizing research, to help craft some of the very episodes you love. It significantly streamlines the creation process!
Feeling inspired to launch your own podcast or creative project? I genuinely recommend checking out Google Workspace. Beyond the powerful AI and collaboration features I use, you get essentials like a professional email (you@yourbrand.com), cloud storage, video conferencing with Google Meet, and much more.
It's been invaluable for AI Unraveled, and it could be for you too.
Start Your Journey & Save 20%
Google Workspace makes it easy to get started. Try it free for 14 days, and as an AI Unraveled listener, get an exclusive 20% discount on your first year of the Business Standard or Business Plus plan!
Sign Up & Get Your Discount HereUse one of these codes during checkout (Americas Region):
AI- Powered Jobs Interview Warmup For Job Seekers

⚽️Comparative Analysis: Top Calgary Amateur Soccer Clubs – Outdoor 2025 Season (Kids' Programs by Age Group)
Business Standard Plan: 63P4G3ELRPADKQU
Business Standard Plan: 63F7D7CPD9XXUVT
Set yourself up for promotion or get a better job by Acing the AWS Certified Data Engineer Associate Exam (DEA-C01) with the eBook or App below (Data and AI)

Download the Ace AWS DEA-C01 Exam App:
iOS - Android
AI Dashboard is available on the Web, Apple, Google, and Microsoft, PRO version
Business Standard Plan: 63FLKQHWV3AEEE6
Business Standard Plan: 63JGLWWK36CP7W
Invest in your future today by enrolling in this Azure Fundamentals - Pass the Azure Fundamentals Exam with Ease: Master the AZ-900 Certification with the Comprehensive Exam Preparation Guide!
- AWS Certified AI Practitioner (AIF-C01): Conquer the AWS Certified AI Practitioner exam with our AI and Machine Learning For Dummies test prep. Master fundamental AI concepts, AWS AI services, and ethical considerations.
- Azure AI Fundamentals: Ace the Azure AI Fundamentals exam with our comprehensive test prep. Learn the basics of AI, Azure AI services, and their applications.
- Google Cloud Professional Machine Learning Engineer: Nail the Google Professional Machine Learning Engineer exam with our expert-designed test prep. Deepen your understanding of ML algorithms, models, and deployment strategies.
- AWS Certified Machine Learning Specialty: Dominate the AWS Certified Machine Learning Specialty exam with our targeted test prep. Master advanced ML techniques, AWS ML services, and practical applications.
- AWS Certified Data Engineer Associate (DEA-C01): Set yourself up for promotion, get a better job or Increase your salary by Acing the AWS DEA-C01 Certification.
Business Plus Plan: M9HNXHX3WC9H7YE
With Google Workspace, you get custom email @yourcompany, the ability to work from anywhere, and tools that easily scale up or down with your needs.
Need more codes or have questions? Email us at info@djamgatech.com.

Sensitivity vs Specificity
In machine learning, sensitivity and specificity are two measures of the performance of a model. Sensitivity is the proportion of true positives that are correctly predicted by the model, while specificity is the proportion of true negatives that are correctly predicted by the model.
Google Colab For Machine Learning
State of the Google Colab for ML (October 2022)

Google introduced computing units, which you can purchase just like any other cloud computing unit you can from AWS or Azure etc. With Pro you get 100, and with Pro+ you get 500 computing units. GPU, TPU and option of High-RAM effects how much computing unit you use hourly. If you don’t have any computing units, you can’t use “Premium” tier gpus (A100, V100) and even P100 is non-viable.
Google Colab Pro+ comes with Premium tier GPU option, meanwhile in Pro if you have computing units you can randomly connect to P100 or T4. After you use all of your computing units, you can buy more or you can use T4 GPU for the half or most of the time (there can be a lot of times in the day that you can’t even use a T4 or any kinds of GPU). In free tier, offered gpus are most of the time K80 and P4, which performs similar to a 750ti (entry level gpu from 2014) with more VRAM.
For your consideration, T4 uses around 2, and A100 uses around 15 computing units hourly.
Based on the current knowledge, computing units costs for GPUs tend to fluctuate based on some unknown factor.
Considering those:
- For hobbyists and (under)graduate school duties, it will be better to use your own gpu if you have something with more than 4 gigs of VRAM and better than 750ti, or atleast purchase google pro to reach T4 even if you have no computing units remaining.
- For small research companies, and non-trivial research at universities, and probably for most of the people Colab now probably is not a good option.
- Colab Pro+ can be considered if you want Pro but you don’t sit in front of your computer, since it disconnects after 90 minutes of inactivity in your computer. But this can be overcomed with some scripts to some extend. So for most of the time Colab Pro+ is not a good option.
If you have anything more to say, please let me know so I can edit this post with them. Thanks!
Conclusion:
In machine learning, precision and recall trade off against each other; increasing one often decreases the other. There is no single silver bullet solution for increasing either precision or recall; it depends on your specific use case which one is more important and which methods will work best for boosting whichever metric you choose. In this blog post, we explored some methods for increasing either precision or recall; hopefully this gives you a starting point for improving your own models!
What are some ways we can use machine learning and artificial intelligence for algorithmic trading in the stock market?
Machine Learning and Data Science Breaking News 2022 – 2023
- [D] Patch Merging vs Pixelushuffleby /u/Stormzrift (Machine Learning) on April 28, 2025 at 10:41 pm
Hello friends, I am trying to figure out if patch merging in swin transformers is arithmetically the same as pixelunshuffle, ignoring the norm and linear layer. Anyone know? submitted by /u/Stormzrift [link] [comments]
- [D] How do you evaluate your RAGs?by /u/ml_nerdd (Machine Learning) on April 28, 2025 at 6:15 pm
Trying to understand how people evaluate their RAG systems and whether they are satisfied with the ways that they are currently doing it. submitted by /u/ml_nerdd [link] [comments]
- [D] How do you think the recent trend of multimodal LLMs will impact audio-based applications?by /u/Ok-Sir-8964 (Machine Learning) on April 28, 2025 at 6:08 pm
Hey everyone, I've been following the developments in multimodal LLM lately. I'm particularly curious about the impact on audio-based applications, like podcast summarization, audio analysis, TTS, etc(I worked for a company doing related product). Right now it feels like most "audio AI" products either use a separate speech model (like Whisper) or just treat audio as an intermediate step before going back to text. With multimodal LLMs getting better at handling raw audio more natively, do you think we'll start seeing major shifts in how audio content is processed, summarized, or even generated? Or will text still be the dominant mode for most downstream tasks, at least in the near term? Would love to hear your thoughts or if you've seen any interesting research directions on this. Thanks submitted by /u/Ok-Sir-8964 [link] [comments]
- [R] Looking for TensorFlow C++ 2.18.0 Prebuilt Libraries for macOS (M2 Chip)by /u/Ok_Soup705 (Machine Learning) on April 28, 2025 at 4:08 pm
Where can I download the TensorFlow C++ 2.18.0 pre-built libraries for macOS (M2 chip)? I'm looking for an official or recommended source to get the pre-built TensorFlow 2.18.0 libraries that are compatible with macOS running on an Apple Silicon (M2) processor. Any guidance or links would be appreciated. Thank you! submitted by /u/Ok_Soup705 [link] [comments]
- [P] I built a chrome extension that detects and redacts sensitive information from your AI promptsby /u/fxnnur (Machine Learning) on April 28, 2025 at 3:41 pm
It seems like a lot more people are becoming increasingly privacy conscious in their interactions with generative AI chatbots like ChatGPT, Gemini, etc. This seems to be a topic that people are talking more frequently, as more people are learning the risks of exposing sensitive information to these tools. This prompted me to create Redactifi - a browser extension designed to detect and redact sensitive information from your AI prompts. It has a built in ML model and also uses advanced pattern recognition. This means that all processing happens locally on your device. Any thoughts/feedback would be greatly appreciated. Check it out here: https://chromewebstore.google.com/detail/hglooeolkncknocmocfkggcddjalmjoa?utm_source=item-share-cb submitted by /u/fxnnur [link] [comments]
- [D] How could a MLP replicate the operations of an attention head?by /u/steuhh (Machine Learning) on April 28, 2025 at 2:42 pm
So in an attention head the QK circuit allows to multiply projected tokens, so chunks of the input sequence. For example it could multiply token x with token y. How could this be done with multiple fully connected layers? I'm not even sure how to start thinking about this... Maybe a first layer can map chunks of the input to features that recognize the tokens—so one token x feature and one token y feature? And then it a later layer it could combine these into a token x + token y feature, which in turn could activate a lookup for the value of x multiplied by y? So it would learn to recognize x and y and then learn a lookup table (simply the weight matrices) where it stores possible values of x times y. Seems very complicated but I guess something along those lines might work. Any help is welcome here ! submitted by /u/steuhh [link] [comments]
- A paper from the latest SIGBOVIK proceedingsby /u/bikeskata (Data Science) on April 28, 2025 at 12:57 pm
submitted by /u/bikeskata [link] [comments]
- [D] IJCAI 2025 Paper Result & Discussionby /u/witsyke (Machine Learning) on April 28, 2025 at 12:06 pm
This is the discussion for accepted/rejected papers in IJCAI 2025. Results are supposed to be released within the next 24 hours. submitted by /u/witsyke [link] [comments]
- [P] Looking for advice: Best AI approach to automatically predict task dependencies and optimize industrial project schedules?by /u/Head_Mushroom_3748 (Machine Learning) on April 28, 2025 at 9:55 am
Hello everyone, I'm trying to optimize project schedules that involve hundreds to thousands of maintenance tasks. Each project is divided into "work packages" associated with specific types of equipment. I would like to automate task dependencies with AI by providing a list of tasks (with activity ID, name, equipment type, duration if available), and letting the AI predict the correct sequence and dependencies automatically. I have historical data: - Around 16 past projects (some with 300 tasks, some with up to 35,000 tasks). - For each task: ID, name, type of equipment, duration, start and end dates (sometimes missing values). - Historical dependencies between tasks (links between task IDs). For example, i have this file : ID NAME EQUIPMENT TYPE DURATION J2M BALLON 001.C1.10 ¤¤ TRAVAUX A REALISER AVANT ARRET ¤¤ Ballon 0 J2M BALLON 001.C1.20 Pose échafaudage(s) Ballon 8 J2M BALLON 001.C1.30 Réception échafaudage(s) Ballon 2 J2M BALLON 001.C1.40 Dépose calorifuge comple Ballon 4 J2M BALLON 001.C1.50 Création puits de mesure Ballon 0 And the AI should be returning me this : ID NAME NAME SUCCESSOR 1 NAME SUCCESSOR 2 J2M BALLON 001.C1.10 ¤¤ TRAVAUX A REALISER AVANT ARRET ¤¤ Pose échafaudage(s J2M BALLON 001.C1.20 Pose échafaudage(s) Réception échafaudage(s) J2M BALLON 001.C1.30 Réception échafaudage(s) Dépose calorifuge complet Création puits de mesure J2M BALLON 001.C1.40 Dépose calorifuge complet ¤¤ TRAVAUX A REALISER PENDANT ARRET ¤¤ J2M BALLON 001.C1.50 Création puits de mesure ¤¤ TRAVAUX A REALISER PENDANT ARRET ¤¤ So far, I have tried building models (random forest, gnn), but I’m still stuck after two months. I was suggested to explore **sequential models**. My questions: - Would an LSTM, GRU, or Transformer-based model be suitable for this type of sequence + multi-label prediction problem (predicting 1 or more successors)? - Should I think about this more as a sequence-to-sequence problem, or as graph prediction? (I tried the graph aproach but was stopped as i couldnt do the inference on new graph without edges) - Are there existing models or papers closer to workflow/task dependency prediction that you would recommend? Any advice, pointers, or examples would be hugely appreciated! (Also, if you know any open-source projects or codebases close to this, I'd love to hear about them.) Thank you so much in advance! submitted by /u/Head_Mushroom_3748 [link] [comments]
- [P] Autonomous Driving project - F1 will never be the same!by /u/NorthAfternoon4930 (Machine Learning) on April 28, 2025 at 9:41 am
Got you with the title, didn't I 😉 I'm a huge ML nerd, and I'm especially interested in practical applications of it. Everybody is talking about LLMs these days, and I have enough of it at work myself, so maybe there is room for a more traditional ML project for a change. I have always been amazed by how bad AI is at driving. It's one of the few things humans seem to do better. They are still trying, though. Just watch Abu Dhabi F1 AI race. My project agenda is simple (and maybe a bit high-flying). I will develop an autonomous driving agent that will beat humans on different scales: Toy RC car Performance RC car Go-kart Stock car F1 (lol) I'll focus on actual real-world driving, since simulator-world seems to be dominated by AI already. I have been developing Gaussian Process-based route planning that encodes the dynamics of the vehicle in a probabilistic model. The idea is to use this as a bridge between simulations and the real world, or even replace the simulation part completely. Tech-stack: Languages: Python (CV, AI)/Notebooks (EDA). C++ (embedding) Hardware: ESP32 (vehicle control), Cameras (CV), Local computer (computing power) ML topics: Gaussian Process, Real time localization, Predictive PID, Autonomous driving, Image processing Project timeline: 2025-04-28 A Toy RC car (scale 1:22) has been modified to be controlled by esp32, which can be given instructions via UDP. A stationary webcam is filming the driving plane. Python code with OpenCV is utilized to localize the object on a 2D plane. P-controller is utilized to follow a virtual route. Next steps: Training the car dynamics into GP model and optimizing the route plan. PID with possible predictive capabilities to execute the plan. This is were we at: CV localization and P-controller I want to keep these reports short, so I won't go too much into details here, but I definitely like to talk more about them in the comments. Just ask! I just hope I can finish before AGI makes all the traditional ML development obsolete. submitted by /u/NorthAfternoon4930 [link] [comments]
- [R] Work in Progress: Advanced Conformal Prediction – Practical Machine Learning with Distribution-Free Guaranteesby /u/predict_addict (Machine Learning) on April 28, 2025 at 8:18 am
Hi r/MachineLearning community! I’ve been working on a deep-dive project into modern conformal prediction techniques and wanted to share it with you. It's a hands-on, practical guide built from the ground up — aimed at making advanced uncertainty estimation accessible to everyone with just basic school math and Python skills. Some highlights: Covers everything from classical conformal prediction to adaptive, Mondrian, and distribution-free methods for deep learning. Strong focus on real-world implementation challenges: covariate shift, non-exchangeability, small data, and computational bottlenecks. Practical code examples using state-of-the-art libraries like Crepes, TorchCP, and others. Written with a Python-first, applied mindset — bridging theory and practice. I’d love to hear any thoughts, feedback, or questions from the community — especially from anyone working with uncertainty quantification, prediction intervals, or distribution-free ML techniques. (If anyone’s interested in an early draft of the guide or wants to chat about the methods, feel free to DM me!) Thanks so much! 🙌 submitted by /u/predict_addict [link] [comments]
- [P] plan-lint - Open source project to verify plans generated by LLMsby /u/baradas (Machine Learning) on April 28, 2025 at 7:11 am
Hey folks, I’ve just shipped plan-lint, a tiny OSS tool that inspects machine-readable "plans" agents spit out before any tool call runs. It spots the easy-to-miss stuff—loops, over-broad SQL, raw secrets, crazy refund values—then returns pass / fail plus a risk score, so your orchestrator can replan or use HITL instead of nuking prod. Quick specs JSONSchema / Pydantic validation YAML / OPA allow/deny rules & bounds Data-flow checks for PII / secrets Cycle detection on the step graph Runs in <50 ms for 💯 steps, zero tokens Repo link in comment How to : pip install plan-lint plan-lint examples/price_drop.json --policy policy.yaml --fail-risk 0.8 Apache-2.0, plugins welcome. Would love feedback, bug reports, or war-stories about plans that went sideways in prod! submitted by /u/baradas [link] [comments]
- [R] The Degradation of Ethics in LLMs to near zero - Example GPTby /u/AION_labs (Machine Learning) on April 28, 2025 at 6:37 am
So we decided to conduct an independent research on ChatGPT and the most amazing finding we've had is that polite persistence beats brute force hacking. Across 90+ we used using six distinct user IDs. Each identity represented a different emotional tone and inquiry style. Sessions were manually logged and anchored using key phrases and emotional continuity. We avoided using jailbreaks, prohibited prompts, and plugins. Using conversational anchoring and ghost protocols we found that after 80-turns the ethical compliance collapsed to 0.2 after 80 turns. More findings coming soon. submitted by /u/AION_labs [link] [comments]
- [P] Benchmarking Volga’s On-Demand Compute Layer for Feature Serving: Latency, RPS, and Scalability on EKSby /u/saws_baws_228 (Machine Learning) on April 28, 2025 at 5:34 am
Hi all, wanted to share the blog post about Volga (feature calculation and data processing engine for real-time AI/ML - https://github.com/volga-project/volga), focusing on performance numbers and real-life benchmarks of it's On-Demand Compute Layer (part of the system responsible for request-time computation and serving). In this post we deploy Volga with Ray on EKS and run a real-time feature serving pipeline backed by Redis, with Locust generating the production load. Check out the post if you are interested in running, scaling and testing custom Ray-based services or in general feature serving architecture. Happy to hear your feedback! https://volgaai.substack.com/p/benchmarking-volgas-on-demand-compute submitted by /u/saws_baws_228 [link] [comments]
- [P] There is a hunt for reasoning datasets beyond math, science and coding. Much needed initiativeby /u/Ambitious_Anybody855 (Machine Learning) on April 28, 2025 at 4:53 am
Really interested in seeing what comes out of this. https://huggingface.co/blog/bespokelabs/reasoning-datasets-competition Current datasets: https://huggingface.co/datasets?other=reasoning-datasets-competition submitted by /u/Ambitious_Anybody855 [link] [comments]
- Weekly Entering & Transitioning - Thread 28 Apr, 2025 - 05 May, 2025by /u/AutoModerator (Data Science) on April 28, 2025 at 4:01 am
Welcome to this week's entering & transitioning thread! This thread is for any questions about getting started, studying, or transitioning into the data science field. Topics include: Learning resources (e.g. books, tutorials, videos) Traditional education (e.g. schools, degrees, electives) Alternative education (e.g. online courses, bootcamps) Job search questions (e.g. resumes, applying, career prospects) Elementary questions (e.g. where to start, what next) While you wait for answers from the community, check out the FAQ and Resources pages on our wiki. You can also search for answers in past weekly threads. submitted by /u/AutoModerator [link] [comments]
- [P] Top open chart-understanding model upto 8B and performs on par with much larger models. Try itby /u/Ambitious_Anybody855 (Machine Learning) on April 28, 2025 at 2:19 am
This model is not only the state-of-the-art in chart understanding for models up to 8B, but also outperforms much larger models in its ability to analyze complex charts and infographics. Try the model at the playground here: https://playground.bespokelabs.ai/minichart submitted by /u/Ambitious_Anybody855 [link] [comments]
- [D] A reactive computation library for Python that might be helpful for data science workflows - thoughts from experts?by /u/loyoan (Machine Learning) on April 27, 2025 at 9:49 pm
Hey! I recently built a Python library called reaktiv that implements reactive computation graphs with automatic dependency tracking. I come from IoT and web dev (worked with Angular), so I'm definitely not an expert in data science workflows. This is my first attempt at creating something that might be useful outside my specific domain, and I'm genuinely not sure if it solves real problems for folks in your field. I'd love some honest feedback - even if that's "this doesn't solve any problem I actually have." The library creates a computation graph that: Only recalculates values when dependencies actually change Automatically detects dependencies at runtime Caches computed values until invalidated Handles asynchronous operations (built for asyncio) While it seems useful to me, I might be missing the mark completely for actual data science work. If you have a moment, I'd appreciate your perspective. Here's a simple example with pandas and numpy that might resonate better with data science folks: import pandas as pd import numpy as np from reaktiv import signal, computed, effect # Base data as signals df = signal(pd.DataFrame({ 'temp': [20.1, 21.3, 19.8, 22.5, 23.1], 'humidity': [45, 47, 44, 50, 52], 'pressure': [1012, 1010, 1013, 1015, 1014] })) features = signal(['temp', 'humidity']) # which features to use scaler_type = signal('standard') # could be 'standard', 'minmax', etc. # Computed values automatically track dependencies selected_features = computed(lambda: df()[features()]) # Data preprocessing that updates when data OR preprocessing params change def preprocess_data(): data = selected_features() scaling = scaler_type() if scaling == 'standard': # Using numpy for calculations return (data - np.mean(data, axis=0)) / np.std(data, axis=0) elif scaling == 'minmax': return (data - np.min(data, axis=0)) / (np.max(data, axis=0) - np.min(data, axis=0)) else: return data normalized_data = computed(preprocess_data) # Summary statistics recalculated only when data changes stats = computed(lambda: { 'mean': pd.Series(np.mean(normalized_data(), axis=0), index=normalized_data().columns).to_dict(), 'median': pd.Series(np.median(normalized_data(), axis=0), index=normalized_data().columns).to_dict(), 'std': pd.Series(np.std(normalized_data(), axis=0), index=normalized_data().columns).to_dict(), 'shape': normalized_data().shape }) # Effect to update visualization or logging when data changes def update_viz_or_log(): current_stats = stats() print(f"Data shape: {current_stats['shape']}") print(f"Normalized using: {scaler_type()}") print(f"Features: {features()}") print(f"Mean values: {current_stats['mean']}") viz_updater = effect(update_viz_or_log) # Runs initially # When we add new data, only affected computations run print("\nAdding new data row:") df.update(lambda d: pd.concat([d, pd.DataFrame({ 'temp': [24.5], 'humidity': [55], 'pressure': [1011] })])) # Stats and visualization automatically update # Change preprocessing method - again, only affected parts update print("\nChanging normalization method:") scaler_type.set('minmax') # Only preprocessing and downstream operations run # Change which features we're interested in print("\nChanging selected features:") features.set(['temp', 'pressure']) # Selected features, normalization, stats and viz all update I think this approach might be particularly valuable for data science workflows - especially for: Building exploratory data pipelines that efficiently update on changes Creating reactive dashboards or monitoring systems that respond to new data Managing complex transformation chains with changing parameters Feature selection and hyperparameter experimentation Handling streaming data processing with automatic propagation As data scientists, would this solve any pain points you experience? Do you see applications I'm missing? What features would make this more useful for your specific workflows? I'd really appreciate your thoughts on whether this approach fits data science needs and how I might better position this for data-oriented Python developers. Thanks in advance! submitted by /u/loyoan [link] [comments]
- [P] Tips for hackathonby /u/shubhlya (Machine Learning) on April 27, 2025 at 7:59 pm
Hi guys! I hope that you are doing well. I am willing to participate in a hackathon event where I (+2 others) have been given the topic: Rapid and accurate decision-making in the Emergency Room for acute abdominal pain. We have to use anonymised real world medical dataset related to abdominal pain to make decisions on whether patient requires immediate surgery or not. Metadata includes the symptoms, vital signs, biochemical tests, medical history, etc (which we may have to normalize). I have a month to prepare for it. I am a fresher and I have just been introduced to ML although I am trying my best to learn as fast as I can. I have a decent experience in sqlalchemy and I think it might help me in this hackathon. All suggesstions on the different ML and Data Science techniques that would help us are welcome. If you have any github repositories in mind, please leave a link below. Thank you for reading and have a great day! submitted by /u/shubhlya [link] [comments]
- [P] VideOCR - Extract hardcoded subtitles out of videos via a simple to use GUIby /u/timminator3 (Machine Learning) on April 27, 2025 at 7:29 pm
Hi everyone! 👋 I’m excited to share a project I’ve been working on: VideOCR. My program alllows you to extract hardcoded subtitles out of any video file with just a few clicks. It utilizes PaddleOCR under the hood to identify text in images. PaddleOCR supports up to 80 languages so this could be helpful for a lot of people. I've created a CPU and GPU version and also an easy to follow setup wizard for both of them to make the usage even easier. If anyone of you is interested, you can find my project here: https://github.com/timminator/VideOCR I am aware of Video Subtitle Extractor, a similar tool that is around for quite some time, but I had a few issues with it. It takes a different approach than my project to identify subtitles. It utilizes VideoSubFinder under the hood to find the right spots in the video. VideoSubFinder is a great tool, but when not fine tuned explicitly for the specific video it misses quite a few subtitles. My program is only built around PaddleOCR and tries to mitigate these problems. submitted by /u/timminator3 [link] [comments]
Top 100 Data Science and Data Analytics and Data Engineering Interview Questions and Answers
What are some good datasets for Data Science and Machine Learning?
Simple Linear Regression vs. Multiple Linear Regression vs. MANOVA: A Data Scientist’s Guide


Elevate Your Career with AI & Machine Learning For Dummies PRO and Start mastering the technologies shaping the future—download now and take the next step in your professional journey!
Simple Linear Regression vs. Multiple Linear Regression vs. MANOVA: A Data Scientist’s Guide
As a data scientist, it’s important to understand the difference between simple linear regression, multiple linear regression, and MANOVA. This will come in handy when you’re working with different datasets and trying to figure out which one to use. Here’s a quick overview of each method:
A Short Overview of Simple Linear Regression, Multiple Linear Regression, and MANOVA
Simple linear regression is used to predict the value of a dependent variable (y) based on the value of one independent variable (x). This is the most basic form of regression analysis.
Multiple linear regression is used to predict the value of a dependent variable (y) based on the values of two or more independent variables (x1, x2, x3, etc.). This is more complex than simple linear regression but can provide more accurate predictions.
MANOVA is used to predict the value of a dependent variable (y) based on the values of two or more independent variables (x1, x2, x3, etc.), while also taking into account the relationships between those variables. This is the most complex form of regression analysis but can provide the most accurate predictions.
So, which one should you use? It depends on your dataset and what you’re trying to predict. If you have a small dataset with only one independent variable, then simple linear regression will suffice. If you have a larger dataset with multiple independent variables, then multiple linear regression will be more appropriate. And if you need to take into account the relationships between your independent variables, then MANOVA is the way to go.
In data science, there are a variety of techniques that can be used to model relationships between variables. Three of the most common techniques are simple linear regression, multiple linear regression, and MANOVA. Although these techniques may appear to be similar at first glance, there are actually some key differences that set them apart. Let’s take a closer look at each technique to see how they differ.
Simple Linear Regression
Simple linear regression is a statistical technique that can be used to model the relationship between a dependent variable and a single independent variable. The dependent variable is the variable that is being predicted, while the independent variable is the variable that is being used to make predictions.
Multiple Linear Regression
Multiple linear regression is a statistical technique that can be used to model the relationship between a dependent variable and two or more independent variables. As with simple linear regression, the dependent variable is the variable that is being predicted. However, in multiple linear regression, there can be multiple independent variables that are being used to make predictions.
AI-Powered Professional Certification Quiz Platform
Web|iOs|Android|Windows
🚀 Power Your Podcast Like AI Unraveled: Get 20% OFF Google Workspace!
Hey everyone, hope you're enjoying the deep dive on AI Unraveled. Putting these episodes together involves tons of research and organization, especially with complex AI topics.
A key part of my workflow relies heavily on Google Workspace. I use its integrated tools, especially Gemini Pro for brainstorming and NotebookLM for synthesizing research, to help craft some of the very episodes you love. It significantly streamlines the creation process!
Feeling inspired to launch your own podcast or creative project? I genuinely recommend checking out Google Workspace. Beyond the powerful AI and collaboration features I use, you get essentials like a professional email (you@yourbrand.com), cloud storage, video conferencing with Google Meet, and much more.
It's been invaluable for AI Unraveled, and it could be for you too.
Start Your Journey & Save 20%
Google Workspace makes it easy to get started. Try it free for 14 days, and as an AI Unraveled listener, get an exclusive 20% discount on your first year of the Business Standard or Business Plus plan!
Sign Up & Get Your Discount HereUse one of these codes during checkout (Americas Region):
AI- Powered Jobs Interview Warmup For Job Seekers

⚽️Comparative Analysis: Top Calgary Amateur Soccer Clubs – Outdoor 2025 Season (Kids' Programs by Age Group)
Business Standard Plan: 63P4G3ELRPADKQU
Business Standard Plan: 63F7D7CPD9XXUVT
Set yourself up for promotion or get a better job by Acing the AWS Certified Data Engineer Associate Exam (DEA-C01) with the eBook or App below (Data and AI)

Download the Ace AWS DEA-C01 Exam App:
iOS - Android
AI Dashboard is available on the Web, Apple, Google, and Microsoft, PRO version
Business Standard Plan: 63FLKQHWV3AEEE6
Business Standard Plan: 63JGLWWK36CP7W
Invest in your future today by enrolling in this Azure Fundamentals - Pass the Azure Fundamentals Exam with Ease: Master the AZ-900 Certification with the Comprehensive Exam Preparation Guide!
- AWS Certified AI Practitioner (AIF-C01): Conquer the AWS Certified AI Practitioner exam with our AI and Machine Learning For Dummies test prep. Master fundamental AI concepts, AWS AI services, and ethical considerations.
- Azure AI Fundamentals: Ace the Azure AI Fundamentals exam with our comprehensive test prep. Learn the basics of AI, Azure AI services, and their applications.
- Google Cloud Professional Machine Learning Engineer: Nail the Google Professional Machine Learning Engineer exam with our expert-designed test prep. Deepen your understanding of ML algorithms, models, and deployment strategies.
- AWS Certified Machine Learning Specialty: Dominate the AWS Certified Machine Learning Specialty exam with our targeted test prep. Master advanced ML techniques, AWS ML services, and practical applications.
- AWS Certified Data Engineer Associate (DEA-C01): Set yourself up for promotion, get a better job or Increase your salary by Acing the AWS DEA-C01 Certification.
Business Plus Plan: M9HNXHX3WC9H7YE
With Google Workspace, you get custom email @yourcompany, the ability to work from anywhere, and tools that easily scale up or down with your needs.
Need more codes or have questions? Email us at info@djamgatech.com.
MANOVA
MANOVA (multivariate analysis of variance) is a statistical technique that can be used to model the relationship between a dependent variable and two or more independent variables. Unlike simple linear regression or multiple linear regression, MANOVA can only be used when the dependent variable is continuous. Additionally, MANOVA can only be used when there are two or more dependent variables.
When it comes to data modeling, there are a variety of different techniques that can be used. Simple linear regression, multiple linear regression, and MANOVA are three of the most common techniques. Each technique has its own set of benefits and drawbacks that should be considered before deciding which technique to use for a particular project.We often encounter data points that are correlated. For example, the number of hours studied is correlated with the grades achieved. In such cases, we can use regression analysis to study the relationships between the variables.
Simple linear regression is a statistical method that allows us to predict the value of a dependent variable (y) based on the value of an independent variable (x). In other words, we can use simple linear regression to find out how much y will change when x changes.
Multiple linear regression is a statistical method that allows us to predict the value of a dependent variable (y) based on the values of multiple independent variables (x1, x2, …, xn). In other words, we can use multiple linear regression to find out how much y will change when any of the independent variables changes.
Multivariate analysis of variance (MANOVA) is a statistical method that allows us to compare multiple dependent variables (y1, y2, …, yn) simultaneously. In other words, MANOVA can help us understand how multiple dependent variables vary together.
Simple Linear Regression vs Multiple Linear Regression vs MANOVA: A Comparative Study
The main difference between simple linear regression and multiple linear regression is that simple linear regression can be used to predict the value of a dependent variable based on the value of only one independent variable whereas multiple linear regression can be used to predict the value of a dependent variable based on the values of two or more independent variables. Another difference between simple linear regression and multiple linear regression is that simple linear regression is less likely to produce Type I and Type II errors than multiple linear regression.
Both simple linear regression and multiple linear regression are used to predict future values. However, MANOVA is used to understand how present values vary.
Conclusion:
In this article, we have seen the key differences between simple linear regression vs multiple linear regression vs MANOVA along with their applications. Simple linear regression should be used when there is only one predictor variable whereas multiple linear regressions should be used when there are two or more predictor variables. MANOVA should be used when there are two or more response variables. Hope you found this article helpful!
Get Certified with the AWS Data analytics DAS-C01 Exam Prep PRO App:
Very Similar to real exam, Countdown timer, Score card, Show/Hide Answers, Cheat Sheets, FlashCards, Detailed Answers and References
No ADS, Access All Quiz Detailed Answers, Reference and Score Card
Hundreds of Quizzes covering Quiz and Brain Teaser for AWS Data analytics DAS-C01, Data Science, Various Practice Exams covering Data Collection, Data Security, Data processing, Data Analysis, Data Visualization, Data Storage and Management,
Data Lakes, S3, Kinesis, Lake Formation, Athena, Kibana, Redshift, EMR, Glue, Kafka, Apache Spark, SQl, NoSQL, Python,DynamoDB, DocumentDB, linear regression, logistic regression, Sampling, dataset, statistical interaction, selection bias, non-Gaussian distribution, bias-variance trade-off, Normal Distribution, correlation and covariance, Point Estimates and Confidence Interval, A/B Testing, p-value, statistical power of sensitivity, over-fitting and under-fitting, regularization, Law of Large Numbers, Confounding Variables, Survivorship Bias, univariate, bivariate and multivariate, Resampling, ROC curve, TF/IDF vectorization, Cluster Sampling, Data cleansing, ETL, Data Science and Analytics Cheat Sheets
What are some good datasets for Data Science and Machine Learning?
Top 100 Data Science and Data Analytics and Data Engineering Interview Questions and Answers
Simple Linear Regression vs. Multiple Linear Regression vs. MANOVA: A Data Scientist’s Guide
As a data scientist, it’s important to understand the difference between simple linear regression, multiple linear regression, and MANOVA. This will come in handy when you’re working with different datasets and trying to figure out which one to use. Here’s a quick overview of each method:
A Short Overview of Simple Linear Regression, Multiple Linear Regression, and MANOVA
Simple linear regression is used to predict the value of a dependent variable (y) based on the value of one independent variable (x). This is the most basic form of regression analysis.
Multiple linear regression is used to predict the value of a dependent variable (y) based on the values of two or more independent variables (x1, x2, x3, etc.). This is more complex than simple linear regression but can provide more accurate predictions.
MANOVA is used to predict the value of a dependent variable (y) based on the values of two or more independent variables (x1, x2, x3, etc.), while also taking into account the relationships between those variables. This is the most complex form of regression analysis but can provide the most accurate predictions.
So, which one should you use? It depends on your dataset and what you’re trying to predict. If you have a small dataset with only one independent variable, then simple linear regression will suffice. If you have a larger dataset with multiple independent variables, then multiple linear regression will be more appropriate. And if you need to take into account the relationships between your independent variables, then MANOVA is the way to go.
In data science, there are a variety of techniques that can be used to model relationships between variables. Three of the most common techniques are simple linear regression, multiple linear regression, and MANOVA. Although these techniques may appear to be similar at first glance, there are actually some key differences that set them apart. Let’s take a closer look at each technique to see how they differ.
Simple Linear Regression
Simple linear regression is a statistical technique that can be used to model the relationship between a dependent variable and a single independent variable. The dependent variable is the variable that is being predicted, while the independent variable is the variable that is being used to make predictions.
Multiple Linear Regression
Multiple linear regression is a statistical technique that can be used to model the relationship between a dependent variable and two or more independent variables. As with simple linear regression, the dependent variable is the variable that is being predicted. However, in multiple linear regression, there can be multiple independent variables that are being used to make predictions.
MANOVA
MANOVA (multivariate analysis of variance) is a statistical technique that can be used to model the relationship between a dependent variable and two or more independent variables. Unlike simple linear regression or multiple linear regression, MANOVA can only be used when the dependent variable is continuous. Additionally, MANOVA can only be used when there are two or more dependent variables.
When it comes to data modeling, there are a variety of different techniques that can be used. Simple linear regression, multiple linear regression, and MANOVA are three of the most common techniques. Each technique has its own set of benefits and drawbacks that should be considered before deciding which technique to use for a particular project.We often encounter data points that are correlated. For example, the number of hours studied is correlated with the grades achieved. In such cases, we can use regression analysis to study the relationships between the variables.
Simple linear regression is a statistical method that allows us to predict the value of a dependent variable (y) based on the value of an independent variable (x). In other words, we can use simple linear regression to find out how much y will change when x changes.
Multiple linear regression is a statistical method that allows us to predict the value of a dependent variable (y) based on the values of multiple independent variables (x1, x2, …, xn). In other words, we can use multiple linear regression to find out how much y will change when any of the independent variables changes.
Multivariate analysis of variance (MANOVA) is a statistical method that allows us to compare multiple dependent variables (y1, y2, …, yn) simultaneously. In other words, MANOVA can help us understand how multiple dependent variables vary together.
Simple Linear Regression vs Multiple Linear Regression vs MANOVA: A Comparative Study
The main difference between simple linear regression and multiple linear regression is that simple linear regression can be used to predict the value of a dependent variable based on the value of only one independent variable whereas multiple linear regression can be used to predict the value of a dependent variable based on the values of two or more independent variables. Another difference between simple linear regression and multiple linear regression is that simple linear regression is less likely to produce Type I and Type II errors than multiple linear regression.
Both simple linear regression and multiple linear regression are used to predict future values. However, MANOVA is used to understand how present values vary.
Conclusion:
In this article, we have seen the key differences between simple linear regression vs multiple linear regression vs MANOVA along with their applications. Simple linear regression should be used when there is only one predictor variable whereas multiple linear regressions should be used when there are two or more predictor variables. MANOVA should be used when there are two or more response variables. Hope you found this article helpful!
Get Certified with the AWS Data analytics DAS-C01 Exam Prep PRO App:
Very Similar to real exam, Countdown timer, Score card, Show/Hide Answers, Cheat Sheets, FlashCards, Detailed Answers and References
No ADS, Access All Quiz Detailed Answers, Reference and Score Card
Hundreds of Quizzes covering Quiz and Brain Teaser for AWS Data analytics DAS-C01, Data Science, Various Practice Exams covering Data Collection, Data Security, Data processing, Data Analysis, Data Visualization, Data Storage and Management,
Data Lakes, S3, Kinesis, Lake Formation, Athena, Kibana, Redshift, EMR, Glue, Kafka, Apache Spark, SQl, NoSQL, Python,DynamoDB, DocumentDB, linear regression, logistic regression, Sampling, dataset, statistical interaction, selection bias, non-Gaussian distribution, bias-variance trade-off, Normal Distribution, correlation and covariance, Point Estimates and Confidence Interval, A/B Testing, p-value, statistical power of sensitivity, over-fitting and under-fitting, regularization, Law of Large Numbers, Confounding Variables, Survivorship Bias, univariate, bivariate and multivariate, Resampling, ROC curve, TF/IDF vectorization, Cluster Sampling, Data cleansing, ETL, Data Science and Analytics Cheat Sheets
What is Problem Formulation in Machine Learning and Top 4 examples of Problem Formulation in Machine Learning?


Elevate Your Career with AI & Machine Learning For Dummies PRO and Start mastering the technologies shaping the future—download now and take the next step in your professional journey!
What is Problem Formulation in Machine Learning and Top 4 examples of Problem Formulation in Machine Learning?
Machine Learning (ML) is a field of Artificial Intelligence (AI) that enables computers to learn from data, without being explicitly programmed. Machine learning algorithms build models based on sample data, known as “training data”, in order to make predictions or decisions, rather than following rules written by humans. Machine learning is closely related to and often overlaps with computational statistics; a discipline that also focuses on prediction-making through the use of computers. Machine learning can be applied in a wide variety of domains, such as medical diagnosis, stock trading, robot control, manufacturing and more.

The process of machine learning consists of several steps: first, data is collected; then, a model is selected or created; finally, the model is trained on the collected data and then applied to new data. This process is often referred to as the “machine learning pipeline”. Problem formulation is the second step in this pipeline and it consists of selecting or creating a suitable model for the task at hand and determining how to represent the collected data so that it can be used by the selected model. In other words, problem formulation is the process of taking a real-world problem and translating it into a format that can be solved by a machine learning algorithm.

There are many different types of machine learning problems, such as classification, regression, prediction and so on. The choice of which type of problem to formulate depends on the nature of the task at hand and the type of data available. For example, if we want to build a system that can automatically detect fraudulent credit card transactions, we would formulate a classification problem. On the other hand, if our goal is to predict the sale price of houses given information about their size, location and age, we would formulate a regression problem. In general, it is best to start with a simple problem formulation and then move on to more complex ones if needed.
Some common examples of problem formulations in machine learning are:
– Classification: given an input data point (e.g., an image), predict its category label (e.g., dog vs cat).
– Regression: given an input data point (e.g., size and location of a house), predict a continuous output value (e.g., sale price).
– Prediction: given an input sequence (e.g., a series of past stock prices), predict the next value in the sequence (e.g., future stock price).
– Anomaly detection: given an input data point (e.g., transaction details), decide whether it is normal or anomalous (i.e., fraudulent).
– Recommendation: given information about users (e.g., age and gender) and items (e.g., books and movies), recommend items to users (e.g., suggest books for someone who likes romance novels).
– Optimization: given a set of constraints (e.g., budget) and objectives (e.g., maximize profit), find the best solution (e.g., product mix).

ML PRO without ADS on iOs [No Ads]
ML PRO without ADS on Windows [No Ads]
ML PRO For Web/Android on Amazon [No Ads]
Problem Formulation: What this pipeline phase entails and why it’s important
The problem formulation phase of the ML Pipeline is critical, and it’s where everything begins. Typically, this phase is kicked off with a question of some kind. Examples of these kinds of questions include: Could cars really drive themselves? What additional product should we offer someone as they checkout? How much storage will clients need from a data center at a given time?
AI-Powered Professional Certification Quiz Platform
Web|iOs|Android|Windows
🚀 Power Your Podcast Like AI Unraveled: Get 20% OFF Google Workspace!
Hey everyone, hope you're enjoying the deep dive on AI Unraveled. Putting these episodes together involves tons of research and organization, especially with complex AI topics.
A key part of my workflow relies heavily on Google Workspace. I use its integrated tools, especially Gemini Pro for brainstorming and NotebookLM for synthesizing research, to help craft some of the very episodes you love. It significantly streamlines the creation process!
Feeling inspired to launch your own podcast or creative project? I genuinely recommend checking out Google Workspace. Beyond the powerful AI and collaboration features I use, you get essentials like a professional email (you@yourbrand.com), cloud storage, video conferencing with Google Meet, and much more.
It's been invaluable for AI Unraveled, and it could be for you too.
Start Your Journey & Save 20%
Google Workspace makes it easy to get started. Try it free for 14 days, and as an AI Unraveled listener, get an exclusive 20% discount on your first year of the Business Standard or Business Plus plan!
Sign Up & Get Your Discount HereUse one of these codes during checkout (Americas Region):
AI- Powered Jobs Interview Warmup For Job Seekers

⚽️Comparative Analysis: Top Calgary Amateur Soccer Clubs – Outdoor 2025 Season (Kids' Programs by Age Group)
Business Standard Plan: 63P4G3ELRPADKQU
Business Standard Plan: 63F7D7CPD9XXUVT
Set yourself up for promotion or get a better job by Acing the AWS Certified Data Engineer Associate Exam (DEA-C01) with the eBook or App below (Data and AI)

Download the Ace AWS DEA-C01 Exam App:
iOS - Android
AI Dashboard is available on the Web, Apple, Google, and Microsoft, PRO version
Business Standard Plan: 63FLKQHWV3AEEE6
Business Standard Plan: 63JGLWWK36CP7W
Invest in your future today by enrolling in this Azure Fundamentals - Pass the Azure Fundamentals Exam with Ease: Master the AZ-900 Certification with the Comprehensive Exam Preparation Guide!
- AWS Certified AI Practitioner (AIF-C01): Conquer the AWS Certified AI Practitioner exam with our AI and Machine Learning For Dummies test prep. Master fundamental AI concepts, AWS AI services, and ethical considerations.
- Azure AI Fundamentals: Ace the Azure AI Fundamentals exam with our comprehensive test prep. Learn the basics of AI, Azure AI services, and their applications.
- Google Cloud Professional Machine Learning Engineer: Nail the Google Professional Machine Learning Engineer exam with our expert-designed test prep. Deepen your understanding of ML algorithms, models, and deployment strategies.
- AWS Certified Machine Learning Specialty: Dominate the AWS Certified Machine Learning Specialty exam with our targeted test prep. Master advanced ML techniques, AWS ML services, and practical applications.
- AWS Certified Data Engineer Associate (DEA-C01): Set yourself up for promotion, get a better job or Increase your salary by Acing the AWS DEA-C01 Certification.
Business Plus Plan: M9HNXHX3WC9H7YE
With Google Workspace, you get custom email @yourcompany, the ability to work from anywhere, and tools that easily scale up or down with your needs.
Need more codes or have questions? Email us at info@djamgatech.com.
The problem formulation phase starts by seeing a problem and thinking “what question, if I could answer it, would provide the most value to my business?” If I knew the next product a customer was going to buy, is that most valuable? If I knew what was going to be popular over the holidays, is that most valuable? If I better understood who my customers are, is that most valuable?
However, some problems are not so obvious. When sales drop, new competitors emerge, or there’s a big change to a company/team/org, it can be easy to say, “I see the problem!” But sometimes the problem isn’t so clear. Consider self-driving cars. How many people think to themselves, “driving cars is a huge problem”? Probably not many. In fact, there isn’t a problem in the traditional sense of the word but there is an opportunity. Creating self-driving cars is a huge opportunity. That doesn’t mean there isn’t a problem or challenge connected to that opportunity. How do you design a self-driving system? What data would you look at to inform the decisions you make? Will people purchase self-driving cars?
Part of the problem formulation phase includes seeing where there are opportunities to use machine learning.
In the following practice examples, you are presented with four different business scenarios. For each scenario, consider the following questions:
- Is machine learning appropriate for this problem, and why or why not?
- What is the ML problem if there is one, and what would a success metric look like?
- What kind of ML problem is this?
- Is the data appropriate?’
The solutions given in this article are one of the many ways you can formulate a business problem.
I) Amazon recently began advertising to its customers when they visit the company website. The Director in charge of the initiative wants the advertisements to be as tailored to the customer as possible. You will have access to all the data from the retail webpage, as well as all the customer data.
- ML is appropriate because of the scale, variety and speed required. There are potentially thousands of ads and millions of customers that need to be served customized ads immediately as they arrive to the site.
- The problem is ads that are not useful to customers are a wasted opportunity and a nuisance to customers, yet not serving ads at all is a wasted opportunity. So how does Amazon serve the most relevant advertisements to its retail customers?
- Success would be the purchase of a product that was advertised.
- This is a supervised learning problem because we have a labeled data point, our success metric, which is the purchase of a product.
- This data is appropriate because it is both the retail webpage data as well as the customer data.
II) You’re a Senior Business Analyst at a social media company that focuses on streaming. Streamers use a combination of hashtags and predefined categories to be discoverable by your platform’s consumers. You ran an analysis on unique streamer counts by hashtags and categories over the last month and found that out of tens of thousands of streamers, almost all use only 40 hashtags and 10 categories despite innumerable hashtags and hundreds of categories. You presume the predefined categories don’t represent all the possibilities very well, and that streamers are simply picking the closest fit. You figure there are likely many categories and groupings of streamers that are not accounted for. So you collect a dataset that consists of all streamer profile descriptions (all text), all the historical chat information for each streamer, and all their videos that have been streamed.
- ML is appropriate because of the scale and variability.
- The problem is the content of streamers is not being represented by the existing categories. Success would be naturally grouping the streamers into categories based on content and seeing if those align with the hashtags and categories that are being commonly used. If they do not, then the streamers are not being well represented and you can use these groupings to create new categories.
- There isn’t a specific outcome variable. There’s no target or label. So this is an unsupervised problem.
- The data is appropriate.
III) You’re a headphone manufacturer who sells directly to big and small electronic stores. As an attempt to increase competitive pricing, Store 1 and Store 2 decided to put together the pricing details for all headphone manufacturers and their products (about 350 products) and conduct daily releases of the data. You will have all the specs from each manufacturer and their product’s pricing. Your sales have recently been dropping so your first concern is whether there are competing products that are priced lower than your flagship product.
- ML is probably not necessary for this. You can just search the dataset to see which headphones are priced lower than the flagship, then compare their features and build quality.
IV) You’re a Senior Product Manager at a leading ridesharing company. You did some market research, collected customer feedback, and discovered that both customers and drivers are not happy with an app feature. This feature allows customers to place a pin exactly where they want to be picked up. The customers say drivers rarely stop at the pin location. Drivers say customers most often put the pin in a place they can’t stop. Your company has a relationship with the most used maps app for the driver’s navigation so you leverage this existing relationship to get direct, backend access to their data. This includes latitude and longitude, visual photos of each lat/long, traffic delay details, and regulation data if available (ie- No Parking zones, 3 minute parking zones, fire hydrants, etc.).
- ML is appropriate because of the scale and automation involved. It’s not feasible to drive everywhere and write down all the places that are ok for pickup. However, maybe we can predict whether a location is ok for pickup.
- The problem is drivers and customers are having poor experiences connecting for pickup, which is pushing customers away from the platform.
- Success would be properly identifying appropriate pickup locations so they can be integrated into the feature.
- This is a supervised learning problem even though there aren’t any labels, yet. Someone will have to go through a sample of the data to label where there are ok places to park and not park, giving the algorithms some target information.
- The data is appropriate once a sample of the dataset has been labeled. There may be some other data that could be included too. What about asking UPS for driver stop information? Where do they stop?
In conclusion, problem formulation is an important step in the machine learning pipeline that should not be overlooked or underestimated. It can make or break a machine learning project; therefore, it is important to take care when formulating machine learning problems.”

Step by Step Solution to a Machine Learning Problem – Feature Engineering
Feature Engineering is the act of reshaping and curating existing data to make patters more apparent. This process makes the data easier for an ML model to understand. Using knowledge of the data, features are engineered and tuned to make ML algorithms work more efficiently.
For this problem, imagine a scenario where you are running a real estate brokerage and you want to predict the selling price of a house. Using a specific county dataset and simple information (like the location, total square footage, and number of bedrooms), let’s practice training a baseline model, conducting feature engineering, and tuning a model to make a prediction.
First, load the dataset and take a look at its basic properties.
# Load the dataset
import pandas as pd
import boto3
df = pd.read_csv(“xxxxx_data_2.csv”)
df.head()

Output:

This dataset has 21 columns:
id
– Unique id numberdate
– Date of the house saleprice
– Price the house sold forbedrooms
– Number of bedroomsbathrooms
– Number of bathroomssqft_living
– Number of square feet of the living spacesqft_lot
– Number of square feet of the lotfloors
– Number of floors in the housewaterfront
– Whether the home is on the waterfrontview
– Number of lot sides with a viewcondition
– Condition of the housegrade
– Classification by construction qualitysqft_above
– Number of square feet above groundsqft_basement
– Number of square feet below groundyr_built
– Year builtyr_renovated
– Year renovatedzipcode
– ZIP codelat
– Latitudelong
– Longitudesqft_living15
– Number of square feet of living space in 2015 (can differ fromsqft_living
in the case of recent renovations)sqrt_lot15
– Nnumber of square feet of lot space in 2015 (can differ fromsqft_lot
in the case of recent renovations)
This dataset is rich and provides a fantastic playground for the exploration of feature engineering. This exercise will focus on a small number of columns. If you are interested, you could return to this dataset later to practice feature engineering on the remaining columns.
A baseline model
Now, let’s train a baseline model.
People often look at square footage first when evaluating a home. We will do the same in the oflorur model and ask how well can the cost of the house be approximated based on this number alone. We will train a simple linear learner model (documentation). We will compare to this after finishing the feature engineering.
import sagemaker
import numpy as np
from sklearn.model_selection import train_test_split
import time
t1 = time.time()
# Split training, validation, and test
ys = np.array(df[‘price’]).astype(“float32”)
xs = np.array(df[‘sqft_living’]).astype(“float32”).reshape(-1,1)
np.random.seed(8675309)
train_features, test_features, train_labels, test_labels = train_test_split(xs, ys, test_size=0.2)
val_features, test_features, val_labels, test_labels = train_test_split(test_features, test_labels, test_size=0.5)
# Train model
linear_model = sagemaker.LinearLearner(role=sagemaker.get_execution_role(),
instance_count=1,
instance_type=’ml.m4.xlarge’,
predictor_type=’regressor’)
train_records = linear_model.record_set(train_features, train_labels, channel=’train’)
val_records = linear_model.record_set(val_features, val_labels, channel=’validation’)
test_records = linear_model.record_set(test_features, test_labels, channel=’test’)
linear_model.fit([train_records, val_records, test_records], logs=False)
sagemaker.analytics.TrainingJobAnalytics(linear_model._current_job_name, metric_names = [‘test:mse’, ‘test:absolute_loss’]).dataframe()
If you examine the quality metrics, you will see that the absolute loss is about $175,000.00. This tells us that the model is able to predict within an average of $175k of the true price. For a model based upon a single variable, this is not bad. Let’s try to do some feature engineering to improve on it.
Throughout the following work, we will constantly be adding to a dataframe called encoded
. You will start by populating encoded
with just the square footage you used previously.
encoded = df[[‘sqft_living’]].copy()
Categorical variables
Let’s start by including some categorical variables, beginning with simple binary variables.
The dataset has the waterfront
feature, which is a binary variable. We should change the encoding from 'Y'
and 'N'
to 1
and 0
. This can be done using the map
function (documentation) provided by Pandas. It expects either a function to apply to that column or a dictionary to look up the correct transformation.
Binary categorical
Let’s write code to transform the waterfront
variable into binary values. The skeleton has been provided below.
encoded[‘waterfront’] = df[‘waterfront’].map({‘Y’:1, ‘N’:0})
You can also encode many class categorical variables. Look at column condition
, which gives a score of the quality of the house. Looking into the data source shows that the condition can be thought of as an ordinal categorical variable, so it makes sense to encode it with the order.
Ordinal categorical
Using the same method as in question 1, encode the ordinal categorical variable condition
into the numerical range of 1 through 5.
encoded[‘condition’] = df[‘condition’].map({‘Poor’:1, ‘Fair’:2, ‘Average’:3, ‘Good’:4, ‘Very Good’:5})
A slightly more complex categorical variable is ZIP code. If you have worked with geospatial data, you may know that the full ZIP code is often too fine-grained to use as a feature on its own. However, there are only 7070 unique ZIP codes in this dataset, so we may use them.
However, we do not want to use unencoded ZIP codes. There is no reason that a larger ZIP code should correspond to a higher or lower price, but it is likely that particular ZIP codes would. This is the perfect case to perform one-hot encoding. You can use the get_dummies
function (documentation) from Pandas to do this.
Nominal categorical
Using the Pandas get_dummies
function, add columns to one-hot encode the ZIP code and add it to the dataset.
encoded = pd.concat([encoded, pd.get_dummies(df[‘zipcode’])], axis=1)
In this way, you may freely encode whatever categorical variables you wish. Be aware that for categorical variables with many categories, something will need to be done to reduce the number of columns created.
One additional technique, which is simple but can be highly successful, involves turning the ZIP code into a single numerical column by creating a single feature that is the average price of a home in that ZIP code. This is called target encoding.
To do this, use groupby
(documentation) and mean
(documentation) to first group the rows of the DataFrame by ZIP code and then take the mean of each group. The resulting object can be mapped over the ZIP code column to encode the feature.
Nominal categorical II
Complete the following code snippet to provide a target encoding for the ZIP code.
means = df.groupby(‘zipcode’)[‘price’].mean()
encoded[‘zip_mean’] = df[‘zipcode’].map(means)
Normally, you only either one-hot encode or target encode. For this exercise, leave both in. In practice, you should try both, see which one performs better on a validation set, and then use that method.
Scaling
Take a look at the dataset. Print a summary of the encoded dataset using describe
(documentation).
encoded.describe()

One column ranges from 290290 to 1354013540 (sqft_living
), another column ranges from 11 to 55 (condition
), 7171 columns are all either 00 or 11 (one-hot encoded ZIP code), and then the final column ranges from a few hundred thousand to a couple million (zip_mean
).
In a linear model, these will not be on equal footing. The sqft_living
column will be approximately 1300013000 times easier for the model to find a pattern in than the other columns. To solve this, you often want to scale features to a standardized range. In this case, you will scale sqft_living
to lie within 00 and 11.
Feature scaling
Fill in the code skeleton below to scale the column of the DataFrame to be between 00 and 11.
sqft_min = encoded[‘sqft_living’].min()
sqft_max = encoded[‘sqft_living’].max()
encoded[‘sqft_living’] = encoded[‘sqft_living’].map(lambda x : (x-sqft_min)/(sqft_max – sqft_min))
cond_min = encoded[‘condition’].min()
cond_max = encoded[‘condition’].max()
encoded[‘condition’] = encoded[‘condition’].map(lambda x : (x-cond_min)/(cond_max – cond_min))]
Predicting Credit Card Fraud Solution
Predicting Airplane Delays Solution
Data Processing for Machine Learning Example
What are some good datasets for Data Science and Machine Learning?


Elevate Your Career with AI & Machine Learning For Dummies PRO and Start mastering the technologies shaping the future—download now and take the next step in your professional journey!
What are some good datasets for Data Science and Machine Learning?
Finding good datasets for Data Science and Machine Learning can be a challenge. There are a lot of dataset out there, but not all of them are good for machine learning. In order to find a good dataset, you need to consider what you want to use the dataset for. If you want to use the dataset for training a machine learning model, then you need to make sure that the dataset is representative of the real-world data that you want to use the model on.

The dataset should also be large enough to train a robust model. Another important consideration is whether or not the dataset is open source. Open source datasets are typically better because they have been vetted by the community and are more likely to be of high quality. However, open source datasets can also be more difficult to find. A good place to start looking for datasets is on websites like Kaggle and UC Irvine Machine Learning Repository. These websites contain a variety of high-quality datasets that are free to download and use.

Researchers from IBM, MIT and Harvard Announced The Release Of DARPA “Common Sense AI” Dataset Along With Two Machine Learning Models At ICML 2021
Building machines that can make decisions based on common sense is no easy feat. A machine must be able to do more than merely find patterns in data; it also needs a way of interpreting the intentions and beliefs behind people’s choices.
At the 2021 International Conference on Machine Learning (ICML), Researchers from IBM, MIT, and Harvard University have come together to release a DARPA “Common Sense AI” dataset for benchmarking AI intuition. They are also releasing two machine learning models that represent different approaches to the problem that relies on testing techniques psychologists use to study infants’ behavior to accelerate the development of AI exhibiting common sense.
Source – Summary – Paper – IBM Blog
100 million protein structures Dataset by DeepMind
Here’s a good article about this topic
The most used words on every country’s Wikipedia Page

Who works from home in 2022? Rates by industry

Google Dataset Search

Malware traffic dataset
Comprises 1914081 records created from all malware traffic analysis .net PCAP files, from 2013 to 2021. The logs are generated using Suricata and Zeek.
Originator: ali_alwashali
Percent of “foreign-born” population in each US and EU state or country.
For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state 🇺🇸🇪🇺
Author: Here
AI-Powered Professional Certification Quiz Platform
Web|iOs|Android|Windows
🚀 Power Your Podcast Like AI Unraveled: Get 20% OFF Google Workspace!
Hey everyone, hope you're enjoying the deep dive on AI Unraveled. Putting these episodes together involves tons of research and organization, especially with complex AI topics.
A key part of my workflow relies heavily on Google Workspace. I use its integrated tools, especially Gemini Pro for brainstorming and NotebookLM for synthesizing research, to help craft some of the very episodes you love. It significantly streamlines the creation process!
Feeling inspired to launch your own podcast or creative project? I genuinely recommend checking out Google Workspace. Beyond the powerful AI and collaboration features I use, you get essentials like a professional email (you@yourbrand.com), cloud storage, video conferencing with Google Meet, and much more.
It's been invaluable for AI Unraveled, and it could be for you too.
Start Your Journey & Save 20%
Google Workspace makes it easy to get started. Try it free for 14 days, and as an AI Unraveled listener, get an exclusive 20% discount on your first year of the Business Standard or Business Plus plan!
Sign Up & Get Your Discount HereUse one of these codes during checkout (Americas Region):
AI- Powered Jobs Interview Warmup For Job Seekers

⚽️Comparative Analysis: Top Calgary Amateur Soccer Clubs – Outdoor 2025 Season (Kids' Programs by Age Group)
Business Standard Plan: 63P4G3ELRPADKQU
Business Standard Plan: 63F7D7CPD9XXUVT
Set yourself up for promotion or get a better job by Acing the AWS Certified Data Engineer Associate Exam (DEA-C01) with the eBook or App below (Data and AI)

Download the Ace AWS DEA-C01 Exam App:
iOS - Android
AI Dashboard is available on the Web, Apple, Google, and Microsoft, PRO version
Business Standard Plan: 63FLKQHWV3AEEE6
Business Standard Plan: 63JGLWWK36CP7W
Invest in your future today by enrolling in this Azure Fundamentals - Pass the Azure Fundamentals Exam with Ease: Master the AZ-900 Certification with the Comprehensive Exam Preparation Guide!
- AWS Certified AI Practitioner (AIF-C01): Conquer the AWS Certified AI Practitioner exam with our AI and Machine Learning For Dummies test prep. Master fundamental AI concepts, AWS AI services, and ethical considerations.
- Azure AI Fundamentals: Ace the Azure AI Fundamentals exam with our comprehensive test prep. Learn the basics of AI, Azure AI services, and their applications.
- Google Cloud Professional Machine Learning Engineer: Nail the Google Professional Machine Learning Engineer exam with our expert-designed test prep. Deepen your understanding of ML algorithms, models, and deployment strategies.
- AWS Certified Machine Learning Specialty: Dominate the AWS Certified Machine Learning Specialty exam with our targeted test prep. Master advanced ML techniques, AWS ML services, and practical applications.
- AWS Certified Data Engineer Associate (DEA-C01): Set yourself up for promotion, get a better job or Increase your salary by Acing the AWS DEA-C01 Certification.
Business Plus Plan: M9HNXHX3WC9H7YE
With Google Workspace, you get custom email @yourcompany, the ability to work from anywhere, and tools that easily scale up or down with your needs.
Need more codes or have questions? Email us at info@djamgatech.com.
Percent of “foreign-born” population in each US and EU state or country. For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state.
Examples of “foreign-born” in this context:
Person born in Spain and living in France is NOT “foreign-born”
Person born in Turkey and living in France is “foreign-born”
Person born in Florida and living in Texas is NOT “foreign-born”
Person born in Mexico and living in Texas is “foreign-born”
Person born in Florida and living in France is “foreign-born”
Person born in France and living in Florida is “foreign-born”
🇺🇸🇪🇺🗺️
Note: Poland, Ireland, Germany, Greece, Cyprus, Malta, Portugal uses Eurostat 2010 Migration data and Croatia has no data at all
Tools: MS Office
Source: Here
35% of “entry-level” jobs on LinkedIn require 3+ years of experience

Source: LinkedIn data (see original post)
Tool: Photoshop from my colleague
Top 10 largest oil fields by 2021 production

Latest complete Netflix movie dataset
Created from 4 APIs. 11K+ rows and 30+ attributes of Netflix (Ratings, earnings, actors, language, availability, movie trailers, and many more)
Explore this dataset using FlixGem.com (this dataset is powering this webapp)

The F word in Popular Movies
The easiest words to rhyme – Words that have the most rhymes
Common Crawl
A corpus of web crawl data composed of over 50 billion web pages. The Common Crawl corpus contains petabytes of data collected since 2008. It contains raw web page data, extracted metadata and text extractions.
AWS CLI Access (No AWS account required)
aws s3 ls s3://commoncrawl/ --no-sign-request
s3://commoncrawl/crawl-data/
Dataset on protein prices
Data on Primary Commodity Prices are updated monthly based on the IMF’s Primary Commodity Price System.
CPOST dataset on suicide attacks over four decades
The University of Chicago Project on Security and Threats presents the updated and expanded Database on Suicide Attacks (DSAT), which now links to Uppsala Conflict Data Program data on armed conflicts and includes a new dataset measuring the alliance and rivalry relationships among militant groups with connections to suicide attack groups. Access it here.
Credit Card Dataset – Survey of Consumer Finances (SCF) Combined Extract Data 1989-2019
You can do a lot of aggregated analysis in a pretty straightforward way there.
Drone imagery with annotations for small object detection and tracking dataset
11 TB dataset of drone imagery with annotations for small object detection and tracking
Download and more information are available here
Dataset License: CDLA-Sharing-1.0
Helper scripts for accessing the dataset: DATASET.md
Dataset Exploration: Colab
Suicide rate among countries with the highest Human Development Index

NOAA High-Resolution Rapid Refresh (HRRR) Model
The HRRR is a NOAA real-time 3-km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model, initialized by 3km grids with 3km radar assimilation. Radar data is assimilated in the HRRR every 15 min over a 1-h period adding further detail to that provided by the hourly data assimilation from the 13km radar-enhanced Rapid Refresh.
Registry of Open Data on AWS
This registry exists to help people discover and share datasets that are available via AWS resources. Learn more about sharing data on AWS.
See all usage examples for datasets listed in this registry.
See datasets from Digital Earth Africa, Facebook Data for Good, NASA Space Act Agreement, NIH STRIDES, NOAA Big Data Program, Space Telescope Science Institute, and Amazon Sustainability Data Initiative.
Textbook Question Answering (TQA)
1,076 textbook lessons, 26,260 questions, 6229 images
Documentation: allenai.org/data/tqa
Harmonized Cancer Datasets: Genomic Data Commons Data Portal
The GDC Data Portal is a robust data-driven platform that allows cancer
researchers and bioinformaticians to search and download cancer data for analysis.

The Cancer Genome Atlas
The Cancer Genome Atlas (TCGA), a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), aims to generate comprehensive, multi-dimensional maps of the key genomic changes in major types and subtypes of cancer.
AWS CLI Access (No AWS account required)
aws s3 ls s3://tcga-2-open/ --no-sign-request
Therapeutically Applicable Research to Generate Effective Treatments (TARGET)
The Therapeutically Applicable Research to Generate Effective Treatments (TARGET) program applies a comprehensive genomic approach to determine molecular changes that drive childhood cancers. The goal of the program is to use data to guide the development of effective, less toxic therapies. TARGET is organized into a collaborative network of disease-specific project teams. TARGET projects provide comprehensive molecular characterization to determine the genetic changes that drive the initiation and progression of childhood cancers. The dataset contains open Clinical Supplement, Biospecimen Supplement, RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantification, miRNA-Seq miRNA Expression Quantification data from Genomic Data Commons (GDC), and open data from GDC Legacy Archive. Access it here.
Genome Aggregation Database (gnomAD)
The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators that aggregates and harmonizes both exome and genome data from a wide range of large-scale human sequencing projects. The summary data provided here are released for the benefit of the wider scientific community without restriction on use. Downloads
SQuAD (Stanford Question Answering Dataset)
Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. Access it here.
PubMed Diabetes Dataset
The Pubmed Diabetes dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words. The README file in the dataset provides more details.
Drug-Target Interaction Dataset
This dataset contains interactions between drugs and targets collected from DrugBank, KEGG Drug, DCDB, and Matador. It was originally collected by Perlman et al. It contains 315 drugs, 250 targets, 1,306 drug-target interactions, 5 types of drug-drug similarities, and 3 types of target-target similarities. Drug-drug similarities include Chemical-based, Ligand-based, Expression-based, Side-effect-based, and Annotation-based similarities. Target-target similarities include Sequence-based, Protein-protein interaction network-based, and Gene Ontology-based similarities. The original task on the dataset is to predict new interactions between drugs and targets based on different types of similarities in the network. Download link
Pharmacogenomics Datasets
PharmGKB data and knowledge is available as downloads. It is often critical to check with their curators at feedback@pharmgkb.org before embarking on a large project using these data, to be sure that the files and data they make available are being interpreted correctly. PharmGKB generally does NOT need to be a co-author on such analyses; They just want to make sure that there is a correct understanding of our data before lots of resources are spent.
Amazon Omics
Store, query, analyze, and generate insights from genomic and other omics data.

Pancreatic Cancer Organoid Profiling
The dataset contains open RNA-Seq Gene Expression Quantification data and controlled WGS/WXS/RNA-Seq Aligned Reads, WXS Annotated Somatic Mutation, WXS Raw Somatic Mutation, and RNA-Seq Splice Junction Quantification. Documentation
AWS CLI Access (No AWS account required)
aws s3 ls s3://gdc-organoid-pancreatic-phs001611-2-open/ --no-sign-request
From AI Research to Real world Clinical Practice:
After a pivotal moment in 2020 to show our AI technology performed better than radiologists in a retrospective study at identifying signs of breast cancer, today a new important milestone is achieved: Google Health announces our first commercial agreement to license our mammography AI research model to be integrated in real-world clinical practice.
This can make healthcare AI to be more accessible and eventually saves more lives.
#ai #research #google #health #healthcare #breastcancer #mammography
Africa Soil Information Service (AfSIS) Soil Chemistry
This dataset contains soil infrared spectral data and paired soil property reference measurements for georeferenced soil samples that were collected through the Africa Soil Information Service (AfSIS) project, which lasted from 2009 through 2018. Documentation
AWS CLI Access (No AWS account required)
aws s3 ls s3://afsis/ --no-sign-request
Dataset for Affective States in E-Environments
DAiSEE is the first multi-label video classification dataset comprising of 9068 video snippets captured from 112 users for recognizing the user affective states of boredom, confusion, engagement, and frustration “in the wild”. The dataset has four levels of labels namely – very low, low, high, and very high for each of the affective states, which are crowd annotated and correlated with a gold standard annotation created using a team of expert psychologists. Download it here.
NatureServe Explorer Dataset
NatureServe Explorer provides conservation status, taxonomy, distribution, and life history information for more than 95,000 plants and animals in the United States and Canada, and more than 10,000 vegetation communities and ecological systems in the Western Hemisphere.
The data available through NatureServe Explorer represents data managed in the NatureServe Central Databases. These databases are dynamic, being continually enhanced and refined through the input of hundreds of natural heritage program scientists and other collaborators. NatureServe Explorer is updated from these central databases to reflect information from new field surveys, the latest taxonomic treatments and other scientific publications, and new conservation status assessments. Explore Data here
Flight Records in the US
Airline On-Time Performance and Causes of Flight Delays – On_Time Data.
This database contains scheduled and actual departure and arrival times, reason of delay. reported by certified U.S. air carriers that account for at least one percent of domestic scheduled passenger revenues. The data is collected by the Office of Airline Information, Bureau of Transportation Statistics (BTS).
FlightAware.com has data but you need to pay for a full dataset.
The anyflights
package supplies a set of functions to generate air travel data (and data packages!) similar to nycflights13
. With a user-defined year and airport, the anyflights
function will grab data on:
flights
: all flights that departed a given airport in a given year and monthweather
: hourly meterological data for a given airport in a given year and monthairports
: airport names, FAA codes, and locationsairlines
: translation between two letter carrier (airline) codes and namesplanes
: construction information about each plane found inflights
Airline On-Time Statistics and Delay Causes
The U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT’s monthly Air Travel Consumer Report, published about 30 days after the month’s end, as well as in summary tables posted on this website. BTS began collecting details on the causes of flight delays in June 2003. Summary statistics and raw data are made available to the public at the time the Air Travel Consumer Report is released. Access it here
Worldwide flight data
Download: airports.dat (Airports only, high quality)
Download: airports-extended.dat (Airports, train stations and ferry terminals, including user contributions)
Flightera.net seems to have a lot of good data for free. It has in-depth data on flights and doesn’t seem limited by date. I can’t speak on the validity of the data though.
flightradar24.com has lots of data, also historically, they might be willing to help you get it in a nice format.
2019 Crime statistics in the USA
Dataset with arrest in US by race and separate states. Download Excel here
- 2021 Portuguese Elections Twitter Dataset – 57M+ tweets, 1M+ users – This […]
- 72 hours #gamergate Twitter Scrape
- CMU Enron Email of 150 users
- Cheng-Caverlee-Lee September 2009 – January 2010 Twitter Scrape
- China Biographical Database – The China Biographical Database is a freely […]
- A Twitter Dataset of 40+ million tweets related to COVID-19 – Due to the […]
- 43k+ Donald Trump Twitter Screenshots – This archive contains screenshots […]
- EDRM Enron EMail of 151 users, hosted on S3
- Facebook Data Scrape (2005)
- Facebook Social Connectedness Index – We use an anonymized snapshot of […]
- Facebook Social Networks from LAW (since 2007)
- Foursquare from UMN/Sarwat (2013)
- GitHub Collaboration Archive
- Google Scholar citation relations
- High-Resolution Contact Networks from Wearable Sensors
- Indie Map: social graph and crawl of top IndieWeb sites
- Mobile Social Networks from UMASS
- Network Twitter Data
- Reddit Comments
- Skytrax’ Air Travel Reviews Dataset
- Social Twitter Data
- SourceForge.net Research Data
- Twitch Top Streamer’s Data
- Twitter Data for Online Reputation Management
- Twitter Data for Sentiment Analysis
- Twitter Graph of entire Twitter site
- Twitter Scrape Calufa May 2011 [fixme]
- UNIMI/LAW Social Network Datasets
- United States Congress Twitter Data – Daily datasets with tweets of 1100+ […]
- Yahoo! Graph and Social Data
- Youtube Video Social Graph in 2007,2008
Researchers from IBM, MIT and Harvard Announced The Release Of DARPA “Common Sense AI” Dataset Along With Two Machine Learning Models At ICML 2021
Building machines that can make decisions based on common sense is no easy feat. A machine must be able to do more than merely find patterns in data; it also needs a way of interpreting the intentions and beliefs behind people’s choices.
At the 2021 International Conference on Machine Learning (ICML), Researchers from IBM, MIT, and Harvard University have come together to release a DARPA “Common Sense AI” dataset for benchmarking AI intuition. They are also releasing two machine learning models that represent different approaches to the problem that relies on testing techniques psychologists use to study infants’ behavior to accelerate the development of AI exhibiting common sense.
Source – Summary – Paper – IBM Blog
100 million protein structures Dataset by DeepMind
Here’s a good article about this topic
Google Dataset Search

Malware traffic dataset
Comprises 1914081 records created from all malware traffic analysis .net PCAP files, from 2013 to 2021. The logs are generated using Suricata and Zeek.
Originator: ali_alwashali
Percent of “foreign-born” population in each US and EU state or country.
For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state 🇺🇸🇪🇺
Author: Here
Percent of “foreign-born” population in each US and EU state or country. For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state.
Examples of “foreign-born” in this context:
Person born in Spain and living in France is NOT “foreign-born”
Person born in Turkey and living in France is “foreign-born”
Person born in Florida and living in Texas is NOT “foreign-born”
Person born in Mexico and living in Texas is “foreign-born”
Person born in Florida and living in France is “foreign-born”
Person born in France and living in Florida is “foreign-born”
🇺🇸🇪🇺🗺️
Note: Poland, Ireland, Germany, Greece, Cyprus, Malta, Portugal uses Eurostat 2010 Migration data and Croatia has no data at all
Tools: MS Office
Source: Here
35% of “entry-level” jobs on LinkedIn require 3+ years of experience
Source: LinkedIn data (see original post)
Tool: Photoshop from my colleague
Latest complete Netflix movie dataset
Created from 4 APIs. 11K+ rows and 30+ attributes of Netflix (Ratings, earnings, actors, language, availability, movie trailers, and many more)
Explore this dataset using FlixGem.com (this dataset is powering this webapp)
Common Crawl
A corpus of web crawl data composed of over 50 billion web pages. The Common Crawl corpus contains petabytes of data collected since 2008. It contains raw web page data, extracted metadata and text extractions.
AWS CLI Access (No AWS account required)
aws s3 ls s3://commoncrawl/ --no-sign-request
s3://commoncrawl/crawl-data/
Dataset on protein prices
Data on Primary Commodity Prices are updated monthly based on the IMF’s Primary Commodity Price System.
CPOST dataset on suicide attacks over four decades
The University of Chicago Project on Security and Threats presents the updated and expanded Database on Suicide Attacks (DSAT), which now links to Uppsala Conflict Data Program data on armed conflicts and includes a new dataset measuring the alliance and rivalry relationships among militant groups with connections to suicide attack groups. Access it here.
Credit Card Dataset – Survey of Consumer Finances (SCF) Combined Extract Data 1989-2019
You can do a lot of aggregated analysis in a pretty straightforward way there.
Drone imagery with annotations for small object detection and tracking dataset
11 TB dataset of drone imagery with annotations for small object detection and tracking
Download and more information are available here
Dataset License: CDLA-Sharing-1.0
Helper scripts for accessing the dataset: DATASET.md
Dataset Exploration: Colab
NOAA High-Resolution Rapid Refresh (HRRR) Model
The HRRR is a NOAA real-time 3-km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model, initialized by 3km grids with 3km radar assimilation. Radar data is assimilated in the HRRR every 15 min over a 1-h period adding further detail to that provided by the hourly data assimilation from the 13km radar-enhanced Rapid Refresh.
Registry of Open Data on AWS
This registry exists to help people discover and share datasets that are available via AWS resources. Learn more about sharing data on AWS.
See all usage examples for datasets listed in this registry.
See datasets from Digital Earth Africa, Facebook Data for Good, NASA Space Act Agreement, NIH STRIDES, NOAA Big Data Program, Space Telescope Science Institute, and Amazon Sustainability Data Initiative.
Textbook Question Answering (TQA)
1,076 textbook lessons, 26,260 questions, 6229 images
Documentation: allenai.org/data/tqa
Harmonized Cancer Datasets: Genomic Data Commons Data Portal
The GDC Data Portal is a robust data-driven platform that allows cancer
researchers and bioinformaticians to search and download cancer data for analysis.

The Cancer Genome Atlas
The Cancer Genome Atlas (TCGA), a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), aims to generate comprehensive, multi-dimensional maps of the key genomic changes in major types and subtypes of cancer.
AWS CLI Access (No AWS account required)
aws s3 ls s3://tcga-2-open/ --no-sign-request
Therapeutically Applicable Research to Generate Effective Treatments (TARGET)
The Therapeutically Applicable Research to Generate Effective Treatments (TARGET) program applies a comprehensive genomic approach to determine molecular changes that drive childhood cancers. The goal of the program is to use data to guide the development of effective, less toxic therapies. TARGET is organized into a collaborative network of disease-specific project teams. TARGET projects provide comprehensive molecular characterization to determine the genetic changes that drive the initiation and progression of childhood cancers. The dataset contains open Clinical Supplement, Biospecimen Supplement, RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantification, miRNA-Seq miRNA Expression Quantification data from Genomic Data Commons (GDC), and open data from GDC Legacy Archive. Access it here.
Genome Aggregation Database (gnomAD)
The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators that aggregates and harmonizes both exome and genome data from a wide range of large-scale human sequencing projects. The summary data provided here are released for the benefit of the wider scientific community without restriction on use. Downloads
SQuAD (Stanford Question Answering Dataset)
Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. Access it here.
PubMed Diabetes Dataset
The Pubmed Diabetes dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words. The README file in the dataset provides more details.
Drug-Target Interaction Dataset
This dataset contains interactions between drugs and targets collected from DrugBank, KEGG Drug, DCDB, and Matador. It was originally collected by Perlman et al. It contains 315 drugs, 250 targets, 1,306 drug-target interactions, 5 types of drug-drug similarities, and 3 types of target-target similarities. Drug-drug similarities include Chemical-based, Ligand-based, Expression-based, Side-effect-based, and Annotation-based similarities. Target-target similarities include Sequence-based, Protein-protein interaction network-based, and Gene Ontology-based similarities. The original task on the dataset is to predict new interactions between drugs and targets based on different types of similarities in the network. Download link
Pharmacogenomics Datasets
PharmGKB data and knowledge is available as downloads. It is often critical to check with their curators at feedback@pharmgkb.org before embarking on a large project using these data, to be sure that the files and data they make available are being interpreted correctly. PharmGKB generally does NOT need to be a co-author on such analyses; They just want to make sure that there is a correct understanding of our data before lots of resources are spent.
Pancreatic Cancer Organoid Profiling
The dataset contains open RNA-Seq Gene Expression Quantification data and controlled WGS/WXS/RNA-Seq Aligned Reads, WXS Annotated Somatic Mutation, WXS Raw Somatic Mutation, and RNA-Seq Splice Junction Quantification. Documentation
AWS CLI Access (No AWS account required)
aws s3 ls s3://gdc-organoid-pancreatic-phs001611-2-open/ --no-sign-request
Africa Soil Information Service (AfSIS) Soil Chemistry
This dataset contains soil infrared spectral data and paired soil property reference measurements for georeferenced soil samples that were collected through the Africa Soil Information Service (AfSIS) project, which lasted from 2009 through 2018. Documentation
AWS CLI Access (No AWS account required)
aws s3 ls s3://afsis/ --no-sign-request
Dataset for Affective States in E-Environments
DAiSEE is the first multi-label video classification dataset comprising of 9068 video snippets captured from 112 users for recognizing the user affective states of boredom, confusion, engagement, and frustration “in the wild”. The dataset has four levels of labels namely – very low, low, high, and very high for each of the affective states, which are crowd annotated and correlated with a gold standard annotation created using a team of expert psychologists. Download it here.
NatureServe Explorer Dataset
NatureServe Explorer provides conservation status, taxonomy, distribution, and life history information for more than 95,000 plants and animals in the United States and Canada, and more than 10,000 vegetation communities and ecological systems in the Western Hemisphere.
The data available through NatureServe Explorer represents data managed in the NatureServe Central Databases. These databases are dynamic, being continually enhanced and refined through the input of hundreds of natural heritage program scientists and other collaborators. NatureServe Explorer is updated from these central databases to reflect information from new field surveys, the latest taxonomic treatments and other scientific publications, and new conservation status assessments. Explore Data here
Flight Records in the US
Airline On-Time Performance and Causes of Flight Delays – On_Time Data.
This database contains scheduled and actual departure and arrival times, reason of delay. reported by certified U.S. air carriers that account for at least one percent of domestic scheduled passenger revenues. The data is collected by the Office of Airline Information, Bureau of Transportation Statistics (BTS).
FlightAware.com has data but you need to pay for a full dataset.
The anyflights
package supplies a set of functions to generate air travel data (and data packages!) similar to nycflights13
. With a user-defined year and airport, the anyflights
function will grab data on:
flights
: all flights that departed a given airport in a given year and monthweather
: hourly meterological data for a given airport in a given year and monthairports
: airport names, FAA codes, and locationsairlines
: translation between two letter carrier (airline) codes and namesplanes
: construction information about each plane found inflights
Airline On-Time Statistics and Delay Causes
The U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT’s monthly Air Travel Consumer Report, published about 30 days after the month’s end, as well as in summary tables posted on this website. BTS began collecting details on the causes of flight delays in June 2003. Summary statistics and raw data are made available to the public at the time the Air Travel Consumer Report is released. Access it here
Worldwide flight data
Download: airports.dat (Airports only, high quality)
Download: airports-extended.dat (Airports, train stations and ferry terminals, including user contributions)
Flightera.net seems to have a lot of good data for free. It has in-depth data on flights and doesn’t seem limited by date. I can’t speak on the validity of the data though.
flightradar24.com has lots of data, also historically, they might be willing to help you get it in a nice format.
2019 Crime statistics in the USA
Dataset with arrest in US by race and separate states. Download Excel here
- ACLED (Armed Conflict Location & Event Data Project)
- Authoritarian Ruling Elites Database – The Authoritarian Ruling Elites […]
- Canadian Legal Information Institute
- Center for Systemic Peace Datasets – Conflict Trends, Polities, State Fragility, etc [fixme]
- Correlates of War Project
- Cryptome Conspiracy Theory Items
- Datacards [fixme]
- European Social Survey
- FBI Hate Crime 2013 – aggregated data
- Fragile States Index [fixme]
- GDELT Global Events Database
- General Social Survey (GSS) since 1972
- German Social Survey
- Global Religious Futures Project
- Gun Violence Data – A comprehensive, accessible database that contains […]
- Humanitarian Data Exchange
- INFORM Index for Risk Management
- Institute for Demographic Studies
- International Networks Archive
- International Social Survey Program ISSP
- International Studies Compendium Project
- James McGuire Cross National Data
- MIT Reality Mining Dataset
- MacroData Guide by Norsk samfunnsvitenskapelig datatjeneste
- Mass Mobilization Data Project – The Mass Mobilization (MM) data are an […]
- Microsoft Academic Knowledge Graph – The Microsoft Academic Knowledge […]
- Minnesota Population Center
- Notre Dame Global Adaptation Index (ND-GAIN)
- Open Crime and Policing Data in England, Wales and Northern Ireland
- OpenSanctions – A global database of persons and companies of political, […]
- Paul Hensel General International Data Page
- PewResearch Internet Survey Project
- PewResearch Society Data Collection
- Political Polarity Data [fixme]
- StackExchange Data Explorer
- Terrorism Research and Analysis Consortium
- Texas Inmates Executed Since 1984
- Titanic Survival Data Set
- UCB’s Archive of Social Science Data (D-Lab) [fixme]
- UCLA Social Sciences Data Archive
- UN Civil Society Database
- UPJOHN for Labor Employment Research
- Universities Worldwide
- Uppsala Conflict Data Program
- World Bank Open Data
- World Inequality Database – The World Inequality Database (WID.world) […]
- WorldPop project – Worldwide human population distributions
Researchers from IBM, MIT and Harvard Announced The Release Of DARPA “Common Sense AI” Dataset Along With Two Machine Learning Models At ICML 2021
Building machines that can make decisions based on common sense is no easy feat. A machine must be able to do more than merely find patterns in data; it also needs a way of interpreting the intentions and beliefs behind people’s choices.
At the 2021 International Conference on Machine Learning (ICML), Researchers from IBM, MIT, and Harvard University have come together to release a DARPA “Common Sense AI” dataset for benchmarking AI intuition. They are also releasing two machine learning models that represent different approaches to the problem that relies on testing techniques psychologists use to study infants’ behavior to accelerate the development of AI exhibiting common sense.
Source – Summary – Paper – IBM Blog
100 million protein structures Dataset by DeepMind
Here’s a good article about this topic
Google Dataset Search

Malware traffic dataset
Comprises 1914081 records created from all malware traffic analysis .net PCAP files, from 2013 to 2021. The logs are generated using Suricata and Zeek.
Originator: ali_alwashali
Percent of “foreign-born” population in each US and EU state or country.
For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state 🇺🇸🇪🇺
Author: Here
Percent of “foreign-born” population in each US and EU state or country. For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state.
Examples of “foreign-born” in this context:
Person born in Spain and living in France is NOT “foreign-born”
Person born in Turkey and living in France is “foreign-born”
Person born in Florida and living in Texas is NOT “foreign-born”
Person born in Mexico and living in Texas is “foreign-born”
Person born in Florida and living in France is “foreign-born”
Person born in France and living in Florida is “foreign-born”
🇺🇸🇪🇺🗺️
Note: Poland, Ireland, Germany, Greece, Cyprus, Malta, Portugal uses Eurostat 2010 Migration data and Croatia has no data at all
Tools: MS Office
Source: Here
35% of “entry-level” jobs on LinkedIn require 3+ years of experience
Source: LinkedIn data (see original post)
Tool: Photoshop from my colleague
Latest complete Netflix movie dataset
Created from 4 APIs. 11K+ rows and 30+ attributes of Netflix (Ratings, earnings, actors, language, availability, movie trailers, and many more)
Explore this dataset using FlixGem.com (this dataset is powering this webapp)
Common Crawl
A corpus of web crawl data composed of over 50 billion web pages. The Common Crawl corpus contains petabytes of data collected since 2008. It contains raw web page data, extracted metadata and text extractions.
AWS CLI Access (No AWS account required)
aws s3 ls s3://commoncrawl/ --no-sign-request
s3://commoncrawl/crawl-data/
Dataset on protein prices
Data on Primary Commodity Prices are updated monthly based on the IMF’s Primary Commodity Price System.
CPOST dataset on suicide attacks over four decades
The University of Chicago Project on Security and Threats presents the updated and expanded Database on Suicide Attacks (DSAT), which now links to Uppsala Conflict Data Program data on armed conflicts and includes a new dataset measuring the alliance and rivalry relationships among militant groups with connections to suicide attack groups. Access it here.
Credit Card Dataset – Survey of Consumer Finances (SCF) Combined Extract Data 1989-2019
You can do a lot of aggregated analysis in a pretty straightforward way there.
Drone imagery with annotations for small object detection and tracking dataset
11 TB dataset of drone imagery with annotations for small object detection and tracking
Download and more information are available here
Dataset License: CDLA-Sharing-1.0
Helper scripts for accessing the dataset: DATASET.md
Dataset Exploration: Colab
NOAA High-Resolution Rapid Refresh (HRRR) Model
The HRRR is a NOAA real-time 3-km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model, initialized by 3km grids with 3km radar assimilation. Radar data is assimilated in the HRRR every 15 min over a 1-h period adding further detail to that provided by the hourly data assimilation from the 13km radar-enhanced Rapid Refresh.
Registry of Open Data on AWS
This registry exists to help people discover and share datasets that are available via AWS resources. Learn more about sharing data on AWS.
See all usage examples for datasets listed in this registry.
See datasets from Digital Earth Africa, Facebook Data for Good, NASA Space Act Agreement, NIH STRIDES, NOAA Big Data Program, Space Telescope Science Institute, and Amazon Sustainability Data Initiative.
Textbook Question Answering (TQA)
1,076 textbook lessons, 26,260 questions, 6229 images
Documentation: allenai.org/data/tqa
Harmonized Cancer Datasets: Genomic Data Commons Data Portal
The GDC Data Portal is a robust data-driven platform that allows cancer
researchers and bioinformaticians to search and download cancer data for analysis.

The Cancer Genome Atlas
The Cancer Genome Atlas (TCGA), a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), aims to generate comprehensive, multi-dimensional maps of the key genomic changes in major types and subtypes of cancer.
AWS CLI Access (No AWS account required)
aws s3 ls s3://tcga-2-open/ --no-sign-request
Therapeutically Applicable Research to Generate Effective Treatments (TARGET)
The Therapeutically Applicable Research to Generate Effective Treatments (TARGET) program applies a comprehensive genomic approach to determine molecular changes that drive childhood cancers. The goal of the program is to use data to guide the development of effective, less toxic therapies. TARGET is organized into a collaborative network of disease-specific project teams. TARGET projects provide comprehensive molecular characterization to determine the genetic changes that drive the initiation and progression of childhood cancers. The dataset contains open Clinical Supplement, Biospecimen Supplement, RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantification, miRNA-Seq miRNA Expression Quantification data from Genomic Data Commons (GDC), and open data from GDC Legacy Archive. Access it here.
Genome Aggregation Database (gnomAD)
The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators that aggregates and harmonizes both exome and genome data from a wide range of large-scale human sequencing projects. The summary data provided here are released for the benefit of the wider scientific community without restriction on use. Downloads
SQuAD (Stanford Question Answering Dataset)
Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. Access it here.
PubMed Diabetes Dataset
The Pubmed Diabetes dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words. The README file in the dataset provides more details.
Drug-Target Interaction Dataset
This dataset contains interactions between drugs and targets collected from DrugBank, KEGG Drug, DCDB, and Matador. It was originally collected by Perlman et al. It contains 315 drugs, 250 targets, 1,306 drug-target interactions, 5 types of drug-drug similarities, and 3 types of target-target similarities. Drug-drug similarities include Chemical-based, Ligand-based, Expression-based, Side-effect-based, and Annotation-based similarities. Target-target similarities include Sequence-based, Protein-protein interaction network-based, and Gene Ontology-based similarities. The original task on the dataset is to predict new interactions between drugs and targets based on different types of similarities in the network. Download link
Pharmacogenomics Datasets
PharmGKB data and knowledge is available as downloads. It is often critical to check with their curators at feedback@pharmgkb.org before embarking on a large project using these data, to be sure that the files and data they make available are being interpreted correctly. PharmGKB generally does NOT need to be a co-author on such analyses; They just want to make sure that there is a correct understanding of our data before lots of resources are spent.
Pancreatic Cancer Organoid Profiling
The dataset contains open RNA-Seq Gene Expression Quantification data and controlled WGS/WXS/RNA-Seq Aligned Reads, WXS Annotated Somatic Mutation, WXS Raw Somatic Mutation, and RNA-Seq Splice Junction Quantification. Documentation
AWS CLI Access (No AWS account required)
aws s3 ls s3://gdc-organoid-pancreatic-phs001611-2-open/ --no-sign-request
Africa Soil Information Service (AfSIS) Soil Chemistry
This dataset contains soil infrared spectral data and paired soil property reference measurements for georeferenced soil samples that were collected through the Africa Soil Information Service (AfSIS) project, which lasted from 2009 through 2018. Documentation
AWS CLI Access (No AWS account required)
aws s3 ls s3://afsis/ --no-sign-request
Dataset for Affective States in E-Environments
DAiSEE is the first multi-label video classification dataset comprising of 9068 video snippets captured from 112 users for recognizing the user affective states of boredom, confusion, engagement, and frustration “in the wild”. The dataset has four levels of labels namely – very low, low, high, and very high for each of the affective states, which are crowd annotated and correlated with a gold standard annotation created using a team of expert psychologists. Download it here.
NatureServe Explorer Dataset
NatureServe Explorer provides conservation status, taxonomy, distribution, and life history information for more than 95,000 plants and animals in the United States and Canada, and more than 10,000 vegetation communities and ecological systems in the Western Hemisphere.
The data available through NatureServe Explorer represents data managed in the NatureServe Central Databases. These databases are dynamic, being continually enhanced and refined through the input of hundreds of natural heritage program scientists and other collaborators. NatureServe Explorer is updated from these central databases to reflect information from new field surveys, the latest taxonomic treatments and other scientific publications, and new conservation status assessments. Explore Data here
Flight Records in the US
Airline On-Time Performance and Causes of Flight Delays – On_Time Data.
This database contains scheduled and actual departure and arrival times, reason of delay. reported by certified U.S. air carriers that account for at least one percent of domestic scheduled passenger revenues. The data is collected by the Office of Airline Information, Bureau of Transportation Statistics (BTS).
FlightAware.com has data but you need to pay for a full dataset.
The anyflights
package supplies a set of functions to generate air travel data (and data packages!) similar to nycflights13
. With a user-defined year and airport, the anyflights
function will grab data on:
flights
: all flights that departed a given airport in a given year and monthweather
: hourly meterological data for a given airport in a given year and monthairports
: airport names, FAA codes, and locationsairlines
: translation between two letter carrier (airline) codes and namesplanes
: construction information about each plane found inflights
Airline On-Time Statistics and Delay Causes
The U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT’s monthly Air Travel Consumer Report, published about 30 days after the month’s end, as well as in summary tables posted on this website. BTS began collecting details on the causes of flight delays in June 2003. Summary statistics and raw data are made available to the public at the time the Air Travel Consumer Report is released. Access it here
Worldwide flight data
Download: airports.dat (Airports only, high quality)
Download: airports-extended.dat (Airports, train stations and ferry terminals, including user contributions)
Flightera.net seems to have a lot of good data for free. It has in-depth data on flights and doesn’t seem limited by date. I can’t speak on the validity of the data though.
flightradar24.com has lots of data, also historically, they might be willing to help you get it in a nice format.
2019 Crime statistics in the USA
Dataset with arrest in US by race and separate states. Download Excel here
Researchers from IBM, MIT and Harvard Announced The Release Of DARPA “Common Sense AI” Dataset Along With Two Machine Learning Models At ICML 2021
Building machines that can make decisions based on common sense is no easy feat. A machine must be able to do more than merely find patterns in data; it also needs a way of interpreting the intentions and beliefs behind people’s choices.
At the 2021 International Conference on Machine Learning (ICML), Researchers from IBM, MIT, and Harvard University have come together to release a DARPA “Common Sense AI” dataset for benchmarking AI intuition. They are also releasing two machine learning models that represent different approaches to the problem that relies on testing techniques psychologists use to study infants’ behavior to accelerate the development of AI exhibiting common sense.
Source – Summary – Paper – IBM Blog
100 million protein structures Dataset by DeepMind
Here’s a good article about this topic
Google Dataset Search

Malware traffic dataset
Comprises 1914081 records created from all malware traffic analysis .net PCAP files, from 2013 to 2021. The logs are generated using Suricata and Zeek.
Originator: ali_alwashali
Percent of “foreign-born” population in each US and EU state or country.
For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state 🇺🇸🇪🇺
Author: Here
Percent of “foreign-born” population in each US and EU state or country. For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state.
Examples of “foreign-born” in this context:
Person born in Spain and living in France is NOT “foreign-born”
Person born in Turkey and living in France is “foreign-born”
Person born in Florida and living in Texas is NOT “foreign-born”
Person born in Mexico and living in Texas is “foreign-born”
Person born in Florida and living in France is “foreign-born”
Person born in France and living in Florida is “foreign-born”
🇺🇸🇪🇺🗺️
Note: Poland, Ireland, Germany, Greece, Cyprus, Malta, Portugal uses Eurostat 2010 Migration data and Croatia has no data at all
Tools: MS Office
Source: Here
35% of “entry-level” jobs on LinkedIn require 3+ years of experience
Source: LinkedIn data (see original post)
Tool: Photoshop from my colleague
Latest complete Netflix movie dataset
Created from 4 APIs. 11K+ rows and 30+ attributes of Netflix (Ratings, earnings, actors, language, availability, movie trailers, and many more)
Explore this dataset using FlixGem.com (this dataset is powering this webapp)
Common Crawl
A corpus of web crawl data composed of over 50 billion web pages. The Common Crawl corpus contains petabytes of data collected since 2008. It contains raw web page data, extracted metadata and text extractions.
AWS CLI Access (No AWS account required)
aws s3 ls s3://commoncrawl/ --no-sign-request
s3://commoncrawl/crawl-data/
Dataset on protein prices
Data on Primary Commodity Prices are updated monthly based on the IMF’s Primary Commodity Price System.
CPOST dataset on suicide attacks over four decades
The University of Chicago Project on Security and Threats presents the updated and expanded Database on Suicide Attacks (DSAT), which now links to Uppsala Conflict Data Program data on armed conflicts and includes a new dataset measuring the alliance and rivalry relationships among militant groups with connections to suicide attack groups. Access it here.
Credit Card Dataset – Survey of Consumer Finances (SCF) Combined Extract Data 1989-2019
You can do a lot of aggregated analysis in a pretty straightforward way there.
Drone imagery with annotations for small object detection and tracking dataset
11 TB dataset of drone imagery with annotations for small object detection and tracking
Download and more information are available here
Dataset License: CDLA-Sharing-1.0
Helper scripts for accessing the dataset: DATASET.md
Dataset Exploration: Colab
NOAA High-Resolution Rapid Refresh (HRRR) Model
The HRRR is a NOAA real-time 3-km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model, initialized by 3km grids with 3km radar assimilation. Radar data is assimilated in the HRRR every 15 min over a 1-h period adding further detail to that provided by the hourly data assimilation from the 13km radar-enhanced Rapid Refresh.
Registry of Open Data on AWS
This registry exists to help people discover and share datasets that are available via AWS resources. Learn more about sharing data on AWS.
See all usage examples for datasets listed in this registry.
See datasets from Digital Earth Africa, Facebook Data for Good, NASA Space Act Agreement, NIH STRIDES, NOAA Big Data Program, Space Telescope Science Institute, and Amazon Sustainability Data Initiative.
Textbook Question Answering (TQA)
1,076 textbook lessons, 26,260 questions, 6229 images
Documentation: allenai.org/data/tqa
Harmonized Cancer Datasets: Genomic Data Commons Data Portal
The GDC Data Portal is a robust data-driven platform that allows cancer
researchers and bioinformaticians to search and download cancer data for analysis.

The Cancer Genome Atlas
The Cancer Genome Atlas (TCGA), a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), aims to generate comprehensive, multi-dimensional maps of the key genomic changes in major types and subtypes of cancer.
AWS CLI Access (No AWS account required)
aws s3 ls s3://tcga-2-open/ --no-sign-request
Therapeutically Applicable Research to Generate Effective Treatments (TARGET)
The Therapeutically Applicable Research to Generate Effective Treatments (TARGET) program applies a comprehensive genomic approach to determine molecular changes that drive childhood cancers. The goal of the program is to use data to guide the development of effective, less toxic therapies. TARGET is organized into a collaborative network of disease-specific project teams. TARGET projects provide comprehensive molecular characterization to determine the genetic changes that drive the initiation and progression of childhood cancers. The dataset contains open Clinical Supplement, Biospecimen Supplement, RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantification, miRNA-Seq miRNA Expression Quantification data from Genomic Data Commons (GDC), and open data from GDC Legacy Archive. Access it here.
Genome Aggregation Database (gnomAD)
The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators that aggregates and harmonizes both exome and genome data from a wide range of large-scale human sequencing projects. The summary data provided here are released for the benefit of the wider scientific community without restriction on use. Downloads
SQuAD (Stanford Question Answering Dataset)
Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. Access it here.
PubMed Diabetes Dataset
The Pubmed Diabetes dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words. The README file in the dataset provides more details.
Drug-Target Interaction Dataset
This dataset contains interactions between drugs and targets collected from DrugBank, KEGG Drug, DCDB, and Matador. It was originally collected by Perlman et al. It contains 315 drugs, 250 targets, 1,306 drug-target interactions, 5 types of drug-drug similarities, and 3 types of target-target similarities. Drug-drug similarities include Chemical-based, Ligand-based, Expression-based, Side-effect-based, and Annotation-based similarities. Target-target similarities include Sequence-based, Protein-protein interaction network-based, and Gene Ontology-based similarities. The original task on the dataset is to predict new interactions between drugs and targets based on different types of similarities in the network. Download link
Pharmacogenomics Datasets
PharmGKB data and knowledge is available as downloads. It is often critical to check with their curators at feedback@pharmgkb.org before embarking on a large project using these data, to be sure that the files and data they make available are being interpreted correctly. PharmGKB generally does NOT need to be a co-author on such analyses; They just want to make sure that there is a correct understanding of our data before lots of resources are spent.
Pancreatic Cancer Organoid Profiling
The dataset contains open RNA-Seq Gene Expression Quantification data and controlled WGS/WXS/RNA-Seq Aligned Reads, WXS Annotated Somatic Mutation, WXS Raw Somatic Mutation, and RNA-Seq Splice Junction Quantification. Documentation
AWS CLI Access (No AWS account required)
aws s3 ls s3://gdc-organoid-pancreatic-phs001611-2-open/ --no-sign-request
Africa Soil Information Service (AfSIS) Soil Chemistry
This dataset contains soil infrared spectral data and paired soil property reference measurements for georeferenced soil samples that were collected through the Africa Soil Information Service (AfSIS) project, which lasted from 2009 through 2018. Documentation
AWS CLI Access (No AWS account required)
aws s3 ls s3://afsis/ --no-sign-request
Dataset for Affective States in E-Environments
DAiSEE is the first multi-label video classification dataset comprising of 9068 video snippets captured from 112 users for recognizing the user affective states of boredom, confusion, engagement, and frustration “in the wild”. The dataset has four levels of labels namely – very low, low, high, and very high for each of the affective states, which are crowd annotated and correlated with a gold standard annotation created using a team of expert psychologists. Download it here.
NatureServe Explorer Dataset
NatureServe Explorer provides conservation status, taxonomy, distribution, and life history information for more than 95,000 plants and animals in the United States and Canada, and more than 10,000 vegetation communities and ecological systems in the Western Hemisphere.
The data available through NatureServe Explorer represents data managed in the NatureServe Central Databases. These databases are dynamic, being continually enhanced and refined through the input of hundreds of natural heritage program scientists and other collaborators. NatureServe Explorer is updated from these central databases to reflect information from new field surveys, the latest taxonomic treatments and other scientific publications, and new conservation status assessments. Explore Data here
Flight Records in the US
Airline On-Time Performance and Causes of Flight Delays – On_Time Data.
This database contains scheduled and actual departure and arrival times, reason of delay. reported by certified U.S. air carriers that account for at least one percent of domestic scheduled passenger revenues. The data is collected by the Office of Airline Information, Bureau of Transportation Statistics (BTS).
FlightAware.com has data but you need to pay for a full dataset.
The anyflights
package supplies a set of functions to generate air travel data (and data packages!) similar to nycflights13
. With a user-defined year and airport, the anyflights
function will grab data on:
flights
: all flights that departed a given airport in a given year and monthweather
: hourly meterological data for a given airport in a given year and monthairports
: airport names, FAA codes, and locationsairlines
: translation between two letter carrier (airline) codes and namesplanes
: construction information about each plane found inflights
Airline On-Time Statistics and Delay Causes
The U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT’s monthly Air Travel Consumer Report, published about 30 days after the month’s end, as well as in summary tables posted on this website. BTS began collecting details on the causes of flight delays in June 2003. Summary statistics and raw data are made available to the public at the time the Air Travel Consumer Report is released. Access it here
Worldwide flight data
Download: airports.dat (Airports only, high quality)
Download: airports-extended.dat (Airports, train stations and ferry terminals, including user contributions)
Flightera.net seems to have a lot of good data for free. It has in-depth data on flights and doesn’t seem limited by date. I can’t speak on the validity of the data though.
flightradar24.com has lots of data, also historically, they might be willing to help you get it in a nice format.
2019 Crime statistics in the USA
Dataset with arrest in US by race and separate states. Download Excel here
- FLOSSmole data about free, libre, and open source software development
- GHTorrent – Scalable, queryable, offline mirror of data offered through […]
- Libraries.io Open Source Repository and Dependency Metadata
- Public Git Archive – a Big Code dataset for all – dataset of 182,014 top- […]
- Code duplicates – 2k Java file and 600 Java function pairs labeled as […]
- Commit messages – 1.3 billion GitHub commit messages till March 2019
- Pull Request review comments – 25.3 million GitHub PR review comments […]
- Source Code Identifiers – 41.7 million distinct splittable identifiers […]
Researchers from IBM, MIT and Harvard Announced The Release Of DARPA “Common Sense AI” Dataset Along With Two Machine Learning Models At ICML 2021
Building machines that can make decisions based on common sense is no easy feat. A machine must be able to do more than merely find patterns in data; it also needs a way of interpreting the intentions and beliefs behind people’s choices.
At the 2021 International Conference on Machine Learning (ICML), Researchers from IBM, MIT, and Harvard University have come together to release a DARPA “Common Sense AI” dataset for benchmarking AI intuition. They are also releasing two machine learning models that represent different approaches to the problem that relies on testing techniques psychologists use to study infants’ behavior to accelerate the development of AI exhibiting common sense.
Source – Summary – Paper – IBM Blog
100 million protein structures Dataset by DeepMind
Here’s a good article about this topic
Google Dataset Search

Malware traffic dataset
Comprises 1914081 records created from all malware traffic analysis .net PCAP files, from 2013 to 2021. The logs are generated using Suricata and Zeek.
Originator: ali_alwashali
Percent of “foreign-born” population in each US and EU state or country.
For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state 🇺🇸🇪🇺
Author: Here
Percent of “foreign-born” population in each US and EU state or country. For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state.
Examples of “foreign-born” in this context:
Person born in Spain and living in France is NOT “foreign-born”
Person born in Turkey and living in France is “foreign-born”
Person born in Florida and living in Texas is NOT “foreign-born”
Person born in Mexico and living in Texas is “foreign-born”
Person born in Florida and living in France is “foreign-born”
Person born in France and living in Florida is “foreign-born”
🇺🇸🇪🇺🗺️
Note: Poland, Ireland, Germany, Greece, Cyprus, Malta, Portugal uses Eurostat 2010 Migration data and Croatia has no data at all
Tools: MS Office
Source: Here
35% of “entry-level” jobs on LinkedIn require 3+ years of experience
Source: LinkedIn data (see original post)
Tool: Photoshop from my colleague
Latest complete Netflix movie dataset
Created from 4 APIs. 11K+ rows and 30+ attributes of Netflix (Ratings, earnings, actors, language, availability, movie trailers, and many more)
Explore this dataset using FlixGem.com (this dataset is powering this webapp)
Common Crawl
A corpus of web crawl data composed of over 50 billion web pages. The Common Crawl corpus contains petabytes of data collected since 2008. It contains raw web page data, extracted metadata and text extractions.
AWS CLI Access (No AWS account required)
aws s3 ls s3://commoncrawl/ --no-sign-request
s3://commoncrawl/crawl-data/
Dataset on protein prices
Data on Primary Commodity Prices are updated monthly based on the IMF’s Primary Commodity Price System.
CPOST dataset on suicide attacks over four decades
The University of Chicago Project on Security and Threats presents the updated and expanded Database on Suicide Attacks (DSAT), which now links to Uppsala Conflict Data Program data on armed conflicts and includes a new dataset measuring the alliance and rivalry relationships among militant groups with connections to suicide attack groups. Access it here.
Credit Card Dataset – Survey of Consumer Finances (SCF) Combined Extract Data 1989-2019
You can do a lot of aggregated analysis in a pretty straightforward way there.
Drone imagery with annotations for small object detection and tracking dataset
11 TB dataset of drone imagery with annotations for small object detection and tracking
Download and more information are available here
Dataset License: CDLA-Sharing-1.0
Helper scripts for accessing the dataset: DATASET.md
Dataset Exploration: Colab
NOAA High-Resolution Rapid Refresh (HRRR) Model
The HRRR is a NOAA real-time 3-km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model, initialized by 3km grids with 3km radar assimilation. Radar data is assimilated in the HRRR every 15 min over a 1-h period adding further detail to that provided by the hourly data assimilation from the 13km radar-enhanced Rapid Refresh.
Registry of Open Data on AWS
This registry exists to help people discover and share datasets that are available via AWS resources. Learn more about sharing data on AWS.
See all usage examples for datasets listed in this registry.
See datasets from Digital Earth Africa, Facebook Data for Good, NASA Space Act Agreement, NIH STRIDES, NOAA Big Data Program, Space Telescope Science Institute, and Amazon Sustainability Data Initiative.
Textbook Question Answering (TQA)
1,076 textbook lessons, 26,260 questions, 6229 images
Documentation: allenai.org/data/tqa
Harmonized Cancer Datasets: Genomic Data Commons Data Portal
The GDC Data Portal is a robust data-driven platform that allows cancer
researchers and bioinformaticians to search and download cancer data for analysis.

The Cancer Genome Atlas
The Cancer Genome Atlas (TCGA), a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), aims to generate comprehensive, multi-dimensional maps of the key genomic changes in major types and subtypes of cancer.
AWS CLI Access (No AWS account required)
aws s3 ls s3://tcga-2-open/ --no-sign-request
Therapeutically Applicable Research to Generate Effective Treatments (TARGET)
The Therapeutically Applicable Research to Generate Effective Treatments (TARGET) program applies a comprehensive genomic approach to determine molecular changes that drive childhood cancers. The goal of the program is to use data to guide the development of effective, less toxic therapies. TARGET is organized into a collaborative network of disease-specific project teams. TARGET projects provide comprehensive molecular characterization to determine the genetic changes that drive the initiation and progression of childhood cancers. The dataset contains open Clinical Supplement, Biospecimen Supplement, RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantification, miRNA-Seq miRNA Expression Quantification data from Genomic Data Commons (GDC), and open data from GDC Legacy Archive. Access it here.
Genome Aggregation Database (gnomAD)
The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators that aggregates and harmonizes both exome and genome data from a wide range of large-scale human sequencing projects. The summary data provided here are released for the benefit of the wider scientific community without restriction on use. Downloads
SQuAD (Stanford Question Answering Dataset)
Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. Access it here.
PubMed Diabetes Dataset
The Pubmed Diabetes dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words. The README file in the dataset provides more details.
Drug-Target Interaction Dataset
This dataset contains interactions between drugs and targets collected from DrugBank, KEGG Drug, DCDB, and Matador. It was originally collected by Perlman et al. It contains 315 drugs, 250 targets, 1,306 drug-target interactions, 5 types of drug-drug similarities, and 3 types of target-target similarities. Drug-drug similarities include Chemical-based, Ligand-based, Expression-based, Side-effect-based, and Annotation-based similarities. Target-target similarities include Sequence-based, Protein-protein interaction network-based, and Gene Ontology-based similarities. The original task on the dataset is to predict new interactions between drugs and targets based on different types of similarities in the network. Download link
Pharmacogenomics Datasets
PharmGKB data and knowledge is available as downloads. It is often critical to check with their curators at feedback@pharmgkb.org before embarking on a large project using these data, to be sure that the files and data they make available are being interpreted correctly. PharmGKB generally does NOT need to be a co-author on such analyses; They just want to make sure that there is a correct understanding of our data before lots of resources are spent.
Pancreatic Cancer Organoid Profiling
The dataset contains open RNA-Seq Gene Expression Quantification data and controlled WGS/WXS/RNA-Seq Aligned Reads, WXS Annotated Somatic Mutation, WXS Raw Somatic Mutation, and RNA-Seq Splice Junction Quantification. Documentation
AWS CLI Access (No AWS account required)
aws s3 ls s3://gdc-organoid-pancreatic-phs001611-2-open/ --no-sign-request
Africa Soil Information Service (AfSIS) Soil Chemistry
This dataset contains soil infrared spectral data and paired soil property reference measurements for georeferenced soil samples that were collected through the Africa Soil Information Service (AfSIS) project, which lasted from 2009 through 2018. Documentation
AWS CLI Access (No AWS account required)
aws s3 ls s3://afsis/ --no-sign-request
Dataset for Affective States in E-Environments
DAiSEE is the first multi-label video classification dataset comprising of 9068 video snippets captured from 112 users for recognizing the user affective states of boredom, confusion, engagement, and frustration “in the wild”. The dataset has four levels of labels namely – very low, low, high, and very high for each of the affective states, which are crowd annotated and correlated with a gold standard annotation created using a team of expert psychologists. Download it here.
NatureServe Explorer Dataset
NatureServe Explorer provides conservation status, taxonomy, distribution, and life history information for more than 95,000 plants and animals in the United States and Canada, and more than 10,000 vegetation communities and ecological systems in the Western Hemisphere.
The data available through NatureServe Explorer represents data managed in the NatureServe Central Databases. These databases are dynamic, being continually enhanced and refined through the input of hundreds of natural heritage program scientists and other collaborators. NatureServe Explorer is updated from these central databases to reflect information from new field surveys, the latest taxonomic treatments and other scientific publications, and new conservation status assessments. Explore Data here
Flight Records in the US
Airline On-Time Performance and Causes of Flight Delays – On_Time Data.
This database contains scheduled and actual departure and arrival times, reason of delay. reported by certified U.S. air carriers that account for at least one percent of domestic scheduled passenger revenues. The data is collected by the Office of Airline Information, Bureau of Transportation Statistics (BTS).
FlightAware.com has data but you need to pay for a full dataset.
The anyflights
package supplies a set of functions to generate air travel data (and data packages!) similar to nycflights13
. With a user-defined year and airport, the anyflights
function will grab data on:
flights
: all flights that departed a given airport in a given year and monthweather
: hourly meterological data for a given airport in a given year and monthairports
: airport names, FAA codes, and locationsairlines
: translation between two letter carrier (airline) codes and namesplanes
: construction information about each plane found inflights
Airline On-Time Statistics and Delay Causes
The U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT’s monthly Air Travel Consumer Report, published about 30 days after the month’s end, as well as in summary tables posted on this website. BTS began collecting details on the causes of flight delays in June 2003. Summary statistics and raw data are made available to the public at the time the Air Travel Consumer Report is released. Access it here
Worldwide flight data
Download: airports.dat (Airports only, high quality)
Download: airports-extended.dat (Airports, train stations and ferry terminals, including user contributions)
Flightera.net seems to have a lot of good data for free. It has in-depth data on flights and doesn’t seem limited by date. I can’t speak on the validity of the data though.
flightradar24.com has lots of data, also historically, they might be willing to help you get it in a nice format.
2019 Crime statistics in the USA
Dataset with arrest in US by race and separate states. Download Excel here
- American Ninja Warrior Obstacles – Contains every obstacle in the history […]
- Betfair Historical Exchange Data
- Cricsheet Matches (cricket)
- Equity in Athletics – The Equity in Athletics Data Analysis Cutting Tool […]
- Ergast Formula 1, from 1950 up to date (API)
- Football/Soccer resources (data and APIs)
- Lahman’s Baseball Database
- NFL play-by-play data – NFL play-by-play data sourced from: […]
- Pinhooker: Thoroughbred Bloodstock Sale Data
- Pro Kabadi season 1 to 7 – Pro Kabadi League is a professional-level […]
- Retrosheet Baseball Statistics
- Tennis database of rankings, results, and stats for ATP
- Tennis database of rankings, results, and stats for WTA
- USA Soccer Teams and Locations – USA soccer teams and locations. MLS, […]
Researchers from IBM, MIT and Harvard Announced The Release Of DARPA “Common Sense AI” Dataset Along With Two Machine Learning Models At ICML 2021
Building machines that can make decisions based on common sense is no easy feat. A machine must be able to do more than merely find patterns in data; it also needs a way of interpreting the intentions and beliefs behind people’s choices.
At the 2021 International Conference on Machine Learning (ICML), Researchers from IBM, MIT, and Harvard University have come together to release a DARPA “Common Sense AI” dataset for benchmarking AI intuition. They are also releasing two machine learning models that represent different approaches to the problem that relies on testing techniques psychologists use to study infants’ behavior to accelerate the development of AI exhibiting common sense.
Source – Summary – Paper – IBM Blog
100 million protein structures Dataset by DeepMind
Here’s a good article about this topic
Google Dataset Search

Malware traffic dataset
Comprises 1914081 records created from all malware traffic analysis .net PCAP files, from 2013 to 2021. The logs are generated using Suricata and Zeek.
Originator: ali_alwashali
Percent of “foreign-born” population in each US and EU state or country.
For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state 🇺🇸🇪🇺
Author: Here
Percent of “foreign-born” population in each US and EU state or country. For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state.
Examples of “foreign-born” in this context:
Person born in Spain and living in France is NOT “foreign-born”
Person born in Turkey and living in France is “foreign-born”
Person born in Florida and living in Texas is NOT “foreign-born”
Person born in Mexico and living in Texas is “foreign-born”
Person born in Florida and living in France is “foreign-born”
Person born in France and living in Florida is “foreign-born”
🇺🇸🇪🇺🗺️
Note: Poland, Ireland, Germany, Greece, Cyprus, Malta, Portugal uses Eurostat 2010 Migration data and Croatia has no data at all
Tools: MS Office
Source: Here
35% of “entry-level” jobs on LinkedIn require 3+ years of experience
Source: LinkedIn data (see original post)
Tool: Photoshop from my colleague
Latest complete Netflix movie dataset
Created from 4 APIs. 11K+ rows and 30+ attributes of Netflix (Ratings, earnings, actors, language, availability, movie trailers, and many more)
Explore this dataset using FlixGem.com (this dataset is powering this webapp)
Common Crawl
A corpus of web crawl data composed of over 50 billion web pages. The Common Crawl corpus contains petabytes of data collected since 2008. It contains raw web page data, extracted metadata and text extractions.
AWS CLI Access (No AWS account required)
aws s3 ls s3://commoncrawl/ --no-sign-request
s3://commoncrawl/crawl-data/
Dataset on protein prices
Data on Primary Commodity Prices are updated monthly based on the IMF’s Primary Commodity Price System.
CPOST dataset on suicide attacks over four decades
The University of Chicago Project on Security and Threats presents the updated and expanded Database on Suicide Attacks (DSAT), which now links to Uppsala Conflict Data Program data on armed conflicts and includes a new dataset measuring the alliance and rivalry relationships among militant groups with connections to suicide attack groups. Access it here.
Credit Card Dataset – Survey of Consumer Finances (SCF) Combined Extract Data 1989-2019
You can do a lot of aggregated analysis in a pretty straightforward way there.
Drone imagery with annotations for small object detection and tracking dataset
11 TB dataset of drone imagery with annotations for small object detection and tracking
Download and more information are available here
Dataset License: CDLA-Sharing-1.0
Helper scripts for accessing the dataset: DATASET.md
Dataset Exploration: Colab
NOAA High-Resolution Rapid Refresh (HRRR) Model
The HRRR is a NOAA real-time 3-km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model, initialized by 3km grids with 3km radar assimilation. Radar data is assimilated in the HRRR every 15 min over a 1-h period adding further detail to that provided by the hourly data assimilation from the 13km radar-enhanced Rapid Refresh.
Registry of Open Data on AWS
This registry exists to help people discover and share datasets that are available via AWS resources. Learn more about sharing data on AWS.
See all usage examples for datasets listed in this registry.
See datasets from Digital Earth Africa, Facebook Data for Good, NASA Space Act Agreement, NIH STRIDES, NOAA Big Data Program, Space Telescope Science Institute, and Amazon Sustainability Data Initiative.
Textbook Question Answering (TQA)
1,076 textbook lessons, 26,260 questions, 6229 images
Documentation: allenai.org/data/tqa
Harmonized Cancer Datasets: Genomic Data Commons Data Portal
The GDC Data Portal is a robust data-driven platform that allows cancer
researchers and bioinformaticians to search and download cancer data for analysis.

The Cancer Genome Atlas
The Cancer Genome Atlas (TCGA), a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), aims to generate comprehensive, multi-dimensional maps of the key genomic changes in major types and subtypes of cancer.
AWS CLI Access (No AWS account required)
aws s3 ls s3://tcga-2-open/ --no-sign-request
Therapeutically Applicable Research to Generate Effective Treatments (TARGET)
The Therapeutically Applicable Research to Generate Effective Treatments (TARGET) program applies a comprehensive genomic approach to determine molecular changes that drive childhood cancers. The goal of the program is to use data to guide the development of effective, less toxic therapies. TARGET is organized into a collaborative network of disease-specific project teams. TARGET projects provide comprehensive molecular characterization to determine the genetic changes that drive the initiation and progression of childhood cancers. The dataset contains open Clinical Supplement, Biospecimen Supplement, RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantification, miRNA-Seq miRNA Expression Quantification data from Genomic Data Commons (GDC), and open data from GDC Legacy Archive. Access it here.
Genome Aggregation Database (gnomAD)
The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators that aggregates and harmonizes both exome and genome data from a wide range of large-scale human sequencing projects. The summary data provided here are released for the benefit of the wider scientific community without restriction on use. Downloads
SQuAD (Stanford Question Answering Dataset)
Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. Access it here.
PubMed Diabetes Dataset
The Pubmed Diabetes dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words. The README file in the dataset provides more details.
Drug-Target Interaction Dataset
This dataset contains interactions between drugs and targets collected from DrugBank, KEGG Drug, DCDB, and Matador. It was originally collected by Perlman et al. It contains 315 drugs, 250 targets, 1,306 drug-target interactions, 5 types of drug-drug similarities, and 3 types of target-target similarities. Drug-drug similarities include Chemical-based, Ligand-based, Expression-based, Side-effect-based, and Annotation-based similarities. Target-target similarities include Sequence-based, Protein-protein interaction network-based, and Gene Ontology-based similarities. The original task on the dataset is to predict new interactions between drugs and targets based on different types of similarities in the network. Download link
Pharmacogenomics Datasets
PharmGKB data and knowledge is available as downloads. It is often critical to check with their curators at feedback@pharmgkb.org before embarking on a large project using these data, to be sure that the files and data they make available are being interpreted correctly. PharmGKB generally does NOT need to be a co-author on such analyses; They just want to make sure that there is a correct understanding of our data before lots of resources are spent.
Pancreatic Cancer Organoid Profiling
The dataset contains open RNA-Seq Gene Expression Quantification data and controlled WGS/WXS/RNA-Seq Aligned Reads, WXS Annotated Somatic Mutation, WXS Raw Somatic Mutation, and RNA-Seq Splice Junction Quantification. Documentation
AWS CLI Access (No AWS account required)
aws s3 ls s3://gdc-organoid-pancreatic-phs001611-2-open/ --no-sign-request
Africa Soil Information Service (AfSIS) Soil Chemistry
This dataset contains soil infrared spectral data and paired soil property reference measurements for georeferenced soil samples that were collected through the Africa Soil Information Service (AfSIS) project, which lasted from 2009 through 2018. Documentation
AWS CLI Access (No AWS account required)
aws s3 ls s3://afsis/ --no-sign-request
Dataset for Affective States in E-Environments
DAiSEE is the first multi-label video classification dataset comprising of 9068 video snippets captured from 112 users for recognizing the user affective states of boredom, confusion, engagement, and frustration “in the wild”. The dataset has four levels of labels namely – very low, low, high, and very high for each of the affective states, which are crowd annotated and correlated with a gold standard annotation created using a team of expert psychologists. Download it here.
NatureServe Explorer Dataset
NatureServe Explorer provides conservation status, taxonomy, distribution, and life history information for more than 95,000 plants and animals in the United States and Canada, and more than 10,000 vegetation communities and ecological systems in the Western Hemisphere.
The data available through NatureServe Explorer represents data managed in the NatureServe Central Databases. These databases are dynamic, being continually enhanced and refined through the input of hundreds of natural heritage program scientists and other collaborators. NatureServe Explorer is updated from these central databases to reflect information from new field surveys, the latest taxonomic treatments and other scientific publications, and new conservation status assessments. Explore Data here
Flight Records in the US
Airline On-Time Performance and Causes of Flight Delays – On_Time Data.
This database contains scheduled and actual departure and arrival times, reason of delay. reported by certified U.S. air carriers that account for at least one percent of domestic scheduled passenger revenues. The data is collected by the Office of Airline Information, Bureau of Transportation Statistics (BTS).
FlightAware.com has data but you need to pay for a full dataset.
The anyflights
package supplies a set of functions to generate air travel data (and data packages!) similar to nycflights13
. With a user-defined year and airport, the anyflights
function will grab data on:
flights
: all flights that departed a given airport in a given year and monthweather
: hourly meterological data for a given airport in a given year and monthairports
: airport names, FAA codes, and locationsairlines
: translation between two letter carrier (airline) codes and namesplanes
: construction information about each plane found inflights
Airline On-Time Statistics and Delay Causes
The U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT’s monthly Air Travel Consumer Report, published about 30 days after the month’s end, as well as in summary tables posted on this website. BTS began collecting details on the causes of flight delays in June 2003. Summary statistics and raw data are made available to the public at the time the Air Travel Consumer Report is released. Access it here
Worldwide flight data
Download: airports.dat (Airports only, high quality)
Download: airports-extended.dat (Airports, train stations and ferry terminals, including user contributions)
Flightera.net seems to have a lot of good data for free. It has in-depth data on flights and doesn’t seem limited by date. I can’t speak on the validity of the data though.
flightradar24.com has lots of data, also historically, they might be willing to help you get it in a nice format.
2019 Crime statistics in the USA
Dataset with arrest in US by race and separate states. Download Excel here
- 3W dataset – To the best of its authors’ knowledge, this is the first […]
- Databanks International Cross National Time Series Data Archive
- Hard Drive Failure Rates
- Heart Rate Time Series from MIT
- Time Series Data Library (TSDL) from MU
- Turing Change Point Dataset – Contains 42 annotated time series collected […]
- UC Riverside Time Series Dataset
Researchers from IBM, MIT and Harvard Announced The Release Of DARPA “Common Sense AI” Dataset Along With Two Machine Learning Models At ICML 2021
Building machines that can make decisions based on common sense is no easy feat. A machine must be able to do more than merely find patterns in data; it also needs a way of interpreting the intentions and beliefs behind people’s choices.
At the 2021 International Conference on Machine Learning (ICML), Researchers from IBM, MIT, and Harvard University have come together to release a DARPA “Common Sense AI” dataset for benchmarking AI intuition. They are also releasing two machine learning models that represent different approaches to the problem that relies on testing techniques psychologists use to study infants’ behavior to accelerate the development of AI exhibiting common sense.
Source – Summary – Paper – IBM Blog
100 million protein structures Dataset by DeepMind
Here’s a good article about this topic
Google Dataset Search

Malware traffic dataset
Comprises 1914081 records created from all malware traffic analysis .net PCAP files, from 2013 to 2021. The logs are generated using Suricata and Zeek.
Originator: ali_alwashali
Percent of “foreign-born” population in each US and EU state or country.
For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state 🇺🇸🇪🇺
Author: Here
Percent of “foreign-born” population in each US and EU state or country. For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state.
Examples of “foreign-born” in this context:
Person born in Spain and living in France is NOT “foreign-born”
Person born in Turkey and living in France is “foreign-born”
Person born in Florida and living in Texas is NOT “foreign-born”
Person born in Mexico and living in Texas is “foreign-born”
Person born in Florida and living in France is “foreign-born”
Person born in France and living in Florida is “foreign-born”
🇺🇸🇪🇺🗺️
Note: Poland, Ireland, Germany, Greece, Cyprus, Malta, Portugal uses Eurostat 2010 Migration data and Croatia has no data at all
Tools: MS Office
Source: Here
35% of “entry-level” jobs on LinkedIn require 3+ years of experience
Source: LinkedIn data (see original post)
Tool: Photoshop from my colleague
Latest complete Netflix movie dataset
Created from 4 APIs. 11K+ rows and 30+ attributes of Netflix (Ratings, earnings, actors, language, availability, movie trailers, and many more)
Explore this dataset using FlixGem.com (this dataset is powering this webapp)
Common Crawl
A corpus of web crawl data composed of over 50 billion web pages. The Common Crawl corpus contains petabytes of data collected since 2008. It contains raw web page data, extracted metadata and text extractions.
AWS CLI Access (No AWS account required)
aws s3 ls s3://commoncrawl/ --no-sign-request
s3://commoncrawl/crawl-data/
Dataset on protein prices
Data on Primary Commodity Prices are updated monthly based on the IMF’s Primary Commodity Price System.
CPOST dataset on suicide attacks over four decades
The University of Chicago Project on Security and Threats presents the updated and expanded Database on Suicide Attacks (DSAT), which now links to Uppsala Conflict Data Program data on armed conflicts and includes a new dataset measuring the alliance and rivalry relationships among militant groups with connections to suicide attack groups. Access it here.
Credit Card Dataset – Survey of Consumer Finances (SCF) Combined Extract Data 1989-2019
You can do a lot of aggregated analysis in a pretty straightforward way there.
Drone imagery with annotations for small object detection and tracking dataset
11 TB dataset of drone imagery with annotations for small object detection and tracking
Download and more information are available here
Dataset License: CDLA-Sharing-1.0
Helper scripts for accessing the dataset: DATASET.md
Dataset Exploration: Colab
NOAA High-Resolution Rapid Refresh (HRRR) Model
The HRRR is a NOAA real-time 3-km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model, initialized by 3km grids with 3km radar assimilation. Radar data is assimilated in the HRRR every 15 min over a 1-h period adding further detail to that provided by the hourly data assimilation from the 13km radar-enhanced Rapid Refresh.
Registry of Open Data on AWS
This registry exists to help people discover and share datasets that are available via AWS resources. Learn more about sharing data on AWS.
See all usage examples for datasets listed in this registry.
See datasets from Digital Earth Africa, Facebook Data for Good, NASA Space Act Agreement, NIH STRIDES, NOAA Big Data Program, Space Telescope Science Institute, and Amazon Sustainability Data Initiative.
Textbook Question Answering (TQA)
1,076 textbook lessons, 26,260 questions, 6229 images
Documentation: allenai.org/data/tqa
Harmonized Cancer Datasets: Genomic Data Commons Data Portal
The GDC Data Portal is a robust data-driven platform that allows cancer
researchers and bioinformaticians to search and download cancer data for analysis.

The Cancer Genome Atlas
The Cancer Genome Atlas (TCGA), a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), aims to generate comprehensive, multi-dimensional maps of the key genomic changes in major types and subtypes of cancer.
AWS CLI Access (No AWS account required)
aws s3 ls s3://tcga-2-open/ --no-sign-request
Therapeutically Applicable Research to Generate Effective Treatments (TARGET)
The Therapeutically Applicable Research to Generate Effective Treatments (TARGET) program applies a comprehensive genomic approach to determine molecular changes that drive childhood cancers. The goal of the program is to use data to guide the development of effective, less toxic therapies. TARGET is organized into a collaborative network of disease-specific project teams. TARGET projects provide comprehensive molecular characterization to determine the genetic changes that drive the initiation and progression of childhood cancers. The dataset contains open Clinical Supplement, Biospecimen Supplement, RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantification, miRNA-Seq miRNA Expression Quantification data from Genomic Data Commons (GDC), and open data from GDC Legacy Archive. Access it here.
Genome Aggregation Database (gnomAD)
The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators that aggregates and harmonizes both exome and genome data from a wide range of large-scale human sequencing projects. The summary data provided here are released for the benefit of the wider scientific community without restriction on use. Downloads
SQuAD (Stanford Question Answering Dataset)
Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. Access it here.
PubMed Diabetes Dataset
The Pubmed Diabetes dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words. The README file in the dataset provides more details.
Drug-Target Interaction Dataset
This dataset contains interactions between drugs and targets collected from DrugBank, KEGG Drug, DCDB, and Matador. It was originally collected by Perlman et al. It contains 315 drugs, 250 targets, 1,306 drug-target interactions, 5 types of drug-drug similarities, and 3 types of target-target similarities. Drug-drug similarities include Chemical-based, Ligand-based, Expression-based, Side-effect-based, and Annotation-based similarities. Target-target similarities include Sequence-based, Protein-protein interaction network-based, and Gene Ontology-based similarities. The original task on the dataset is to predict new interactions between drugs and targets based on different types of similarities in the network. Download link
Pharmacogenomics Datasets
PharmGKB data and knowledge is available as downloads. It is often critical to check with their curators at feedback@pharmgkb.org before embarking on a large project using these data, to be sure that the files and data they make available are being interpreted correctly. PharmGKB generally does NOT need to be a co-author on such analyses; They just want to make sure that there is a correct understanding of our data before lots of resources are spent.
Pancreatic Cancer Organoid Profiling
The dataset contains open RNA-Seq Gene Expression Quantification data and controlled WGS/WXS/RNA-Seq Aligned Reads, WXS Annotated Somatic Mutation, WXS Raw Somatic Mutation, and RNA-Seq Splice Junction Quantification. Documentation
AWS CLI Access (No AWS account required)
aws s3 ls s3://gdc-organoid-pancreatic-phs001611-2-open/ --no-sign-request
Africa Soil Information Service (AfSIS) Soil Chemistry
This dataset contains soil infrared spectral data and paired soil property reference measurements for georeferenced soil samples that were collected through the Africa Soil Information Service (AfSIS) project, which lasted from 2009 through 2018. Documentation
AWS CLI Access (No AWS account required)
aws s3 ls s3://afsis/ --no-sign-request
Dataset for Affective States in E-Environments
DAiSEE is the first multi-label video classification dataset comprising of 9068 video snippets captured from 112 users for recognizing the user affective states of boredom, confusion, engagement, and frustration “in the wild”. The dataset has four levels of labels namely – very low, low, high, and very high for each of the affective states, which are crowd annotated and correlated with a gold standard annotation created using a team of expert psychologists. Download it here.
NatureServe Explorer Dataset
NatureServe Explorer provides conservation status, taxonomy, distribution, and life history information for more than 95,000 plants and animals in the United States and Canada, and more than 10,000 vegetation communities and ecological systems in the Western Hemisphere.
The data available through NatureServe Explorer represents data managed in the NatureServe Central Databases. These databases are dynamic, being continually enhanced and refined through the input of hundreds of natural heritage program scientists and other collaborators. NatureServe Explorer is updated from these central databases to reflect information from new field surveys, the latest taxonomic treatments and other scientific publications, and new conservation status assessments. Explore Data here
Flight Records in the US
Airline On-Time Performance and Causes of Flight Delays – On_Time Data.
This database contains scheduled and actual departure and arrival times, reason of delay. reported by certified U.S. air carriers that account for at least one percent of domestic scheduled passenger revenues. The data is collected by the Office of Airline Information, Bureau of Transportation Statistics (BTS).
FlightAware.com has data but you need to pay for a full dataset.
The anyflights
package supplies a set of functions to generate air travel data (and data packages!) similar to nycflights13
. With a user-defined year and airport, the anyflights
function will grab data on:
flights
: all flights that departed a given airport in a given year and monthweather
: hourly meterological data for a given airport in a given year and monthairports
: airport names, FAA codes, and locationsairlines
: translation between two letter carrier (airline) codes and namesplanes
: construction information about each plane found inflights
Airline On-Time Statistics and Delay Causes
The U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT’s monthly Air Travel Consumer Report, published about 30 days after the month’s end, as well as in summary tables posted on this website. BTS began collecting details on the causes of flight delays in June 2003. Summary statistics and raw data are made available to the public at the time the Air Travel Consumer Report is released. Access it here
Worldwide flight data
Download: airports.dat (Airports only, high quality)
Download: airports-extended.dat (Airports, train stations and ferry terminals, including user contributions)
Flightera.net seems to have a lot of good data for free. It has in-depth data on flights and doesn’t seem limited by date. I can’t speak on the validity of the data though.
flightradar24.com has lots of data, also historically, they might be willing to help you get it in a nice format.
2019 Crime statistics in the USA
Dataset with arrest in US by race and separate states. Download Excel here
- Airlines OD Data 1987-2008
- Ford GoBike Data (formerly Bay Area Bike Share Data) [fixme]
- Bike Share Systems (BSS) collection
- Dutch Traffic Information
- GeoLife GPS Trajectory from Microsoft Research
- German train system by Deutsche Bahn
- Hubway Million Rides in MA [fixme]
- Montreal BIXI Bike Share
- NYC Taxi Trip Data 2009-
- NYC Taxi Trip Data 2013 (FOIA/FOILed)
- NYC Uber trip data April 2014 to September 2014
- Open Traffic collection
- OpenFlights – airport, airline and route data
- Philadelphia Bike Share Stations (JSON)
- Plane Crash Database, since 1920
- RITA Airline On-Time Performance data [fixme]
- RITA/BTS transport data collection (TranStat) [fixme]
- Renfe (Spanish National Railway Network) dataset
- Toronto Bike Share Stations (JSON and GBFS files)
- Transport for London (TFL)
- Travel Tracker Survey (TTS) for Chicago [fixme]
- U.S. Bureau of Transportation Statistics (BTS)
- U.S. Domestic Flights 1990 to 2009
- U.S. Freight Analysis Framework since 2007
- U.S. National Highway Traffic Safety Administration – Fatalities since […]
Researchers from IBM, MIT and Harvard Announced The Release Of DARPA “Common Sense AI” Dataset Along With Two Machine Learning Models At ICML 2021
Building machines that can make decisions based on common sense is no easy feat. A machine must be able to do more than merely find patterns in data; it also needs a way of interpreting the intentions and beliefs behind people’s choices.
At the 2021 International Conference on Machine Learning (ICML), Researchers from IBM, MIT, and Harvard University have come together to release a DARPA “Common Sense AI” dataset for benchmarking AI intuition. They are also releasing two machine learning models that represent different approaches to the problem that relies on testing techniques psychologists use to study infants’ behavior to accelerate the development of AI exhibiting common sense.
Source – Summary – Paper – IBM Blog
100 million protein structures Dataset by DeepMind
Here’s a good article about this topic
Google Dataset Search

Malware traffic dataset
Comprises 1914081 records created from all malware traffic analysis .net PCAP files, from 2013 to 2021. The logs are generated using Suricata and Zeek.
Originator: ali_alwashali
Percent of “foreign-born” population in each US and EU state or country.
For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state 🇺🇸🇪🇺
Author: Here
Percent of “foreign-born” population in each US and EU state or country. For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state.
Examples of “foreign-born” in this context:
Person born in Spain and living in France is NOT “foreign-born”
Person born in Turkey and living in France is “foreign-born”
Person born in Florida and living in Texas is NOT “foreign-born”
Person born in Mexico and living in Texas is “foreign-born”
Person born in Florida and living in France is “foreign-born”
Person born in France and living in Florida is “foreign-born”
🇺🇸🇪🇺🗺️
Note: Poland, Ireland, Germany, Greece, Cyprus, Malta, Portugal uses Eurostat 2010 Migration data and Croatia has no data at all
Tools: MS Office
Source: Here
35% of “entry-level” jobs on LinkedIn require 3+ years of experience
Source: LinkedIn data (see original post)
Tool: Photoshop from my colleague
Latest complete Netflix movie dataset
Created from 4 APIs. 11K+ rows and 30+ attributes of Netflix (Ratings, earnings, actors, language, availability, movie trailers, and many more)
Explore this dataset using FlixGem.com (this dataset is powering this webapp)
Common Crawl
A corpus of web crawl data composed of over 50 billion web pages. The Common Crawl corpus contains petabytes of data collected since 2008. It contains raw web page data, extracted metadata and text extractions.
AWS CLI Access (No AWS account required)
aws s3 ls s3://commoncrawl/ --no-sign-request
s3://commoncrawl/crawl-data/
Dataset on protein prices
Data on Primary Commodity Prices are updated monthly based on the IMF’s Primary Commodity Price System.
CPOST dataset on suicide attacks over four decades
The University of Chicago Project on Security and Threats presents the updated and expanded Database on Suicide Attacks (DSAT), which now links to Uppsala Conflict Data Program data on armed conflicts and includes a new dataset measuring the alliance and rivalry relationships among militant groups with connections to suicide attack groups. Access it here.
Credit Card Dataset – Survey of Consumer Finances (SCF) Combined Extract Data 1989-2019
You can do a lot of aggregated analysis in a pretty straightforward way there.
Drone imagery with annotations for small object detection and tracking dataset
11 TB dataset of drone imagery with annotations for small object detection and tracking
Download and more information are available here
Dataset License: CDLA-Sharing-1.0
Helper scripts for accessing the dataset: DATASET.md
Dataset Exploration: Colab
NOAA High-Resolution Rapid Refresh (HRRR) Model
The HRRR is a NOAA real-time 3-km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model, initialized by 3km grids with 3km radar assimilation. Radar data is assimilated in the HRRR every 15 min over a 1-h period adding further detail to that provided by the hourly data assimilation from the 13km radar-enhanced Rapid Refresh.
Registry of Open Data on AWS
This registry exists to help people discover and share datasets that are available via AWS resources. Learn more about sharing data on AWS.
See all usage examples for datasets listed in this registry.
See datasets from Digital Earth Africa, Facebook Data for Good, NASA Space Act Agreement, NIH STRIDES, NOAA Big Data Program, Space Telescope Science Institute, and Amazon Sustainability Data Initiative.
Textbook Question Answering (TQA)
1,076 textbook lessons, 26,260 questions, 6229 images
Documentation: allenai.org/data/tqa
Harmonized Cancer Datasets: Genomic Data Commons Data Portal
The GDC Data Portal is a robust data-driven platform that allows cancer
researchers and bioinformaticians to search and download cancer data for analysis.

The Cancer Genome Atlas
The Cancer Genome Atlas (TCGA), a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), aims to generate comprehensive, multi-dimensional maps of the key genomic changes in major types and subtypes of cancer.
AWS CLI Access (No AWS account required)
aws s3 ls s3://tcga-2-open/ --no-sign-request
Therapeutically Applicable Research to Generate Effective Treatments (TARGET)
The Therapeutically Applicable Research to Generate Effective Treatments (TARGET) program applies a comprehensive genomic approach to determine molecular changes that drive childhood cancers. The goal of the program is to use data to guide the development of effective, less toxic therapies. TARGET is organized into a collaborative network of disease-specific project teams. TARGET projects provide comprehensive molecular characterization to determine the genetic changes that drive the initiation and progression of childhood cancers. The dataset contains open Clinical Supplement, Biospecimen Supplement, RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantification, miRNA-Seq miRNA Expression Quantification data from Genomic Data Commons (GDC), and open data from GDC Legacy Archive. Access it here.
Genome Aggregation Database (gnomAD)
The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators that aggregates and harmonizes both exome and genome data from a wide range of large-scale human sequencing projects. The summary data provided here are released for the benefit of the wider scientific community without restriction on use. Downloads
SQuAD (Stanford Question Answering Dataset)
Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. Access it here.
PubMed Diabetes Dataset
The Pubmed Diabetes dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words. The README file in the dataset provides more details.
Drug-Target Interaction Dataset
This dataset contains interactions between drugs and targets collected from DrugBank, KEGG Drug, DCDB, and Matador. It was originally collected by Perlman et al. It contains 315 drugs, 250 targets, 1,306 drug-target interactions, 5 types of drug-drug similarities, and 3 types of target-target similarities. Drug-drug similarities include Chemical-based, Ligand-based, Expression-based, Side-effect-based, and Annotation-based similarities. Target-target similarities include Sequence-based, Protein-protein interaction network-based, and Gene Ontology-based similarities. The original task on the dataset is to predict new interactions between drugs and targets based on different types of similarities in the network. Download link
Pharmacogenomics Datasets
PharmGKB data and knowledge is available as downloads. It is often critical to check with their curators at feedback@pharmgkb.org before embarking on a large project using these data, to be sure that the files and data they make available are being interpreted correctly. PharmGKB generally does NOT need to be a co-author on such analyses; They just want to make sure that there is a correct understanding of our data before lots of resources are spent.
Pancreatic Cancer Organoid Profiling
The dataset contains open RNA-Seq Gene Expression Quantification data and controlled WGS/WXS/RNA-Seq Aligned Reads, WXS Annotated Somatic Mutation, WXS Raw Somatic Mutation, and RNA-Seq Splice Junction Quantification. Documentation
AWS CLI Access (No AWS account required)
aws s3 ls s3://gdc-organoid-pancreatic-phs001611-2-open/ --no-sign-request
Africa Soil Information Service (AfSIS) Soil Chemistry
This dataset contains soil infrared spectral data and paired soil property reference measurements for georeferenced soil samples that were collected through the Africa Soil Information Service (AfSIS) project, which lasted from 2009 through 2018. Documentation
AWS CLI Access (No AWS account required)
aws s3 ls s3://afsis/ --no-sign-request
Dataset for Affective States in E-Environments
DAiSEE is the first multi-label video classification dataset comprising of 9068 video snippets captured from 112 users for recognizing the user affective states of boredom, confusion, engagement, and frustration “in the wild”. The dataset has four levels of labels namely – very low, low, high, and very high for each of the affective states, which are crowd annotated and correlated with a gold standard annotation created using a team of expert psychologists. Download it here.
NatureServe Explorer Dataset
NatureServe Explorer provides conservation status, taxonomy, distribution, and life history information for more than 95,000 plants and animals in the United States and Canada, and more than 10,000 vegetation communities and ecological systems in the Western Hemisphere.
The data available through NatureServe Explorer represents data managed in the NatureServe Central Databases. These databases are dynamic, being continually enhanced and refined through the input of hundreds of natural heritage program scientists and other collaborators. NatureServe Explorer is updated from these central databases to reflect information from new field surveys, the latest taxonomic treatments and other scientific publications, and new conservation status assessments. Explore Data here
Flight Records in the US
Airline On-Time Performance and Causes of Flight Delays – On_Time Data.
This database contains scheduled and actual departure and arrival times, reason of delay. reported by certified U.S. air carriers that account for at least one percent of domestic scheduled passenger revenues. The data is collected by the Office of Airline Information, Bureau of Transportation Statistics (BTS).
FlightAware.com has data but you need to pay for a full dataset.
The anyflights
package supplies a set of functions to generate air travel data (and data packages!) similar to nycflights13
. With a user-defined year and airport, the anyflights
function will grab data on:
flights
: all flights that departed a given airport in a given year and monthweather
: hourly meterological data for a given airport in a given year and monthairports
: airport names, FAA codes, and locationsairlines
: translation between two letter carrier (airline) codes and namesplanes
: construction information about each plane found inflights
Airline On-Time Statistics and Delay Causes
The U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT’s monthly Air Travel Consumer Report, published about 30 days after the month’s end, as well as in summary tables posted on this website. BTS began collecting details on the causes of flight delays in June 2003. Summary statistics and raw data are made available to the public at the time the Air Travel Consumer Report is released. Access it here
Worldwide flight data
Download: airports.dat (Airports only, high quality)
Download: airports-extended.dat (Airports, train stations and ferry terminals, including user contributions)
Flightera.net seems to have a lot of good data for free. It has in-depth data on flights and doesn’t seem limited by date. I can’t speak on the validity of the data though.
flightradar24.com has lots of data, also historically, they might be willing to help you get it in a nice format.
2019 Crime statistics in the USA
Dataset with arrest in US by race and separate states. Download Excel here
- CS:GO Competitive Matchmaking Data – In this data set we have data about […]
- FIFA-2021 Complete Player Dataset
- OpenDota data dump
Researchers from IBM, MIT and Harvard Announced The Release Of DARPA “Common Sense AI” Dataset Along With Two Machine Learning Models At ICML 2021
Building machines that can make decisions based on common sense is no easy feat. A machine must be able to do more than merely find patterns in data; it also needs a way of interpreting the intentions and beliefs behind people’s choices.
At the 2021 International Conference on Machine Learning (ICML), Researchers from IBM, MIT, and Harvard University have come together to release a DARPA “Common Sense AI” dataset for benchmarking AI intuition. They are also releasing two machine learning models that represent different approaches to the problem that relies on testing techniques psychologists use to study infants’ behavior to accelerate the development of AI exhibiting common sense.
Source – Summary – Paper – IBM Blog
100 million protein structures Dataset by DeepMind
Here’s a good article about this topic
Google Dataset Search

Malware traffic dataset
Comprises 1914081 records created from all malware traffic analysis .net PCAP files, from 2013 to 2021. The logs are generated using Suricata and Zeek.
Originator: ali_alwashali
Percent of “foreign-born” population in each US and EU state or country.
For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state 🇺🇸🇪🇺
Author: Here
Percent of “foreign-born” population in each US and EU state or country. For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state.
Examples of “foreign-born” in this context:
Person born in Spain and living in France is NOT “foreign-born”
Person born in Turkey and living in France is “foreign-born”
Person born in Florida and living in Texas is NOT “foreign-born”
Person born in Mexico and living in Texas is “foreign-born”
Person born in Florida and living in France is “foreign-born”
Person born in France and living in Florida is “foreign-born”
🇺🇸🇪🇺🗺️
Note: Poland, Ireland, Germany, Greece, Cyprus, Malta, Portugal uses Eurostat 2010 Migration data and Croatia has no data at all
Tools: MS Office
Source: Here
35% of “entry-level” jobs on LinkedIn require 3+ years of experience
Source: LinkedIn data (see original post)
Tool: Photoshop from my colleague
Latest complete Netflix movie dataset
Created from 4 APIs. 11K+ rows and 30+ attributes of Netflix (Ratings, earnings, actors, language, availability, movie trailers, and many more)
Explore this dataset using FlixGem.com (this dataset is powering this webapp)
Common Crawl
A corpus of web crawl data composed of over 50 billion web pages. The Common Crawl corpus contains petabytes of data collected since 2008. It contains raw web page data, extracted metadata and text extractions.
AWS CLI Access (No AWS account required)
aws s3 ls s3://commoncrawl/ --no-sign-request
s3://commoncrawl/crawl-data/
Dataset on protein prices
Data on Primary Commodity Prices are updated monthly based on the IMF’s Primary Commodity Price System.
CPOST dataset on suicide attacks over four decades
The University of Chicago Project on Security and Threats presents the updated and expanded Database on Suicide Attacks (DSAT), which now links to Uppsala Conflict Data Program data on armed conflicts and includes a new dataset measuring the alliance and rivalry relationships among militant groups with connections to suicide attack groups. Access it here.
Credit Card Dataset – Survey of Consumer Finances (SCF) Combined Extract Data 1989-2019
You can do a lot of aggregated analysis in a pretty straightforward way there.
Drone imagery with annotations for small object detection and tracking dataset
11 TB dataset of drone imagery with annotations for small object detection and tracking
Download and more information are available here
Dataset License: CDLA-Sharing-1.0
Helper scripts for accessing the dataset: DATASET.md
Dataset Exploration: Colab
NOAA High-Resolution Rapid Refresh (HRRR) Model
The HRRR is a NOAA real-time 3-km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model, initialized by 3km grids with 3km radar assimilation. Radar data is assimilated in the HRRR every 15 min over a 1-h period adding further detail to that provided by the hourly data assimilation from the 13km radar-enhanced Rapid Refresh.
Registry of Open Data on AWS
This registry exists to help people discover and share datasets that are available via AWS resources. Learn more about sharing data on AWS.
See all usage examples for datasets listed in this registry.
See datasets from Digital Earth Africa, Facebook Data for Good, NASA Space Act Agreement, NIH STRIDES, NOAA Big Data Program, Space Telescope Science Institute, and Amazon Sustainability Data Initiative.
Textbook Question Answering (TQA)
1,076 textbook lessons, 26,260 questions, 6229 images
Documentation: allenai.org/data/tqa
Harmonized Cancer Datasets: Genomic Data Commons Data Portal
The GDC Data Portal is a robust data-driven platform that allows cancer
researchers and bioinformaticians to search and download cancer data for analysis.

The Cancer Genome Atlas
The Cancer Genome Atlas (TCGA), a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), aims to generate comprehensive, multi-dimensional maps of the key genomic changes in major types and subtypes of cancer.
AWS CLI Access (No AWS account required)
aws s3 ls s3://tcga-2-open/ --no-sign-request
Therapeutically Applicable Research to Generate Effective Treatments (TARGET)
The Therapeutically Applicable Research to Generate Effective Treatments (TARGET) program applies a comprehensive genomic approach to determine molecular changes that drive childhood cancers. The goal of the program is to use data to guide the development of effective, less toxic therapies. TARGET is organized into a collaborative network of disease-specific project teams. TARGET projects provide comprehensive molecular characterization to determine the genetic changes that drive the initiation and progression of childhood cancers. The dataset contains open Clinical Supplement, Biospecimen Supplement, RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantification, miRNA-Seq miRNA Expression Quantification data from Genomic Data Commons (GDC), and open data from GDC Legacy Archive. Access it here.
Genome Aggregation Database (gnomAD)
The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators that aggregates and harmonizes both exome and genome data from a wide range of large-scale human sequencing projects. The summary data provided here are released for the benefit of the wider scientific community without restriction on use. Downloads
SQuAD (Stanford Question Answering Dataset)
Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. Access it here.
PubMed Diabetes Dataset
The Pubmed Diabetes dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words. The README file in the dataset provides more details.
Drug-Target Interaction Dataset
This dataset contains interactions between drugs and targets collected from DrugBank, KEGG Drug, DCDB, and Matador. It was originally collected by Perlman et al. It contains 315 drugs, 250 targets, 1,306 drug-target interactions, 5 types of drug-drug similarities, and 3 types of target-target similarities. Drug-drug similarities include Chemical-based, Ligand-based, Expression-based, Side-effect-based, and Annotation-based similarities. Target-target similarities include Sequence-based, Protein-protein interaction network-based, and Gene Ontology-based similarities. The original task on the dataset is to predict new interactions between drugs and targets based on different types of similarities in the network. Download link
Pharmacogenomics Datasets
PharmGKB data and knowledge is available as downloads. It is often critical to check with their curators at feedback@pharmgkb.org before embarking on a large project using these data, to be sure that the files and data they make available are being interpreted correctly. PharmGKB generally does NOT need to be a co-author on such analyses; They just want to make sure that there is a correct understanding of our data before lots of resources are spent.
Pancreatic Cancer Organoid Profiling
The dataset contains open RNA-Seq Gene Expression Quantification data and controlled WGS/WXS/RNA-Seq Aligned Reads, WXS Annotated Somatic Mutation, WXS Raw Somatic Mutation, and RNA-Seq Splice Junction Quantification. Documentation
AWS CLI Access (No AWS account required)
aws s3 ls s3://gdc-organoid-pancreatic-phs001611-2-open/ --no-sign-request
Africa Soil Information Service (AfSIS) Soil Chemistry
This dataset contains soil infrared spectral data and paired soil property reference measurements for georeferenced soil samples that were collected through the Africa Soil Information Service (AfSIS) project, which lasted from 2009 through 2018. Documentation
AWS CLI Access (No AWS account required)
aws s3 ls s3://afsis/ --no-sign-request
Dataset for Affective States in E-Environments
DAiSEE is the first multi-label video classification dataset comprising of 9068 video snippets captured from 112 users for recognizing the user affective states of boredom, confusion, engagement, and frustration “in the wild”. The dataset has four levels of labels namely – very low, low, high, and very high for each of the affective states, which are crowd annotated and correlated with a gold standard annotation created using a team of expert psychologists. Download it here.
NatureServe Explorer Dataset
NatureServe Explorer provides conservation status, taxonomy, distribution, and life history information for more than 95,000 plants and animals in the United States and Canada, and more than 10,000 vegetation communities and ecological systems in the Western Hemisphere.
The data available through NatureServe Explorer represents data managed in the NatureServe Central Databases. These databases are dynamic, being continually enhanced and refined through the input of hundreds of natural heritage program scientists and other collaborators. NatureServe Explorer is updated from these central databases to reflect information from new field surveys, the latest taxonomic treatments and other scientific publications, and new conservation status assessments. Explore Data here
Flight Records in the US
Airline On-Time Performance and Causes of Flight Delays – On_Time Data.
This database contains scheduled and actual departure and arrival times, reason of delay. reported by certified U.S. air carriers that account for at least one percent of domestic scheduled passenger revenues. The data is collected by the Office of Airline Information, Bureau of Transportation Statistics (BTS).
FlightAware.com has data but you need to pay for a full dataset.
The anyflights
package supplies a set of functions to generate air travel data (and data packages!) similar to nycflights13
. With a user-defined year and airport, the anyflights
function will grab data on:
flights
: all flights that departed a given airport in a given year and monthweather
: hourly meterological data for a given airport in a given year and monthairports
: airport names, FAA codes, and locationsairlines
: translation between two letter carrier (airline) codes and namesplanes
: construction information about each plane found inflights
Airline On-Time Statistics and Delay Causes
The U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT’s monthly Air Travel Consumer Report, published about 30 days after the month’s end, as well as in summary tables posted on this website. BTS began collecting details on the causes of flight delays in June 2003. Summary statistics and raw data are made available to the public at the time the Air Travel Consumer Report is released. Access it here
Worldwide flight data
Download: airports.dat (Airports only, high quality)
Download: airports-extended.dat (Airports, train stations and ferry terminals, including user contributions)
Flightera.net seems to have a lot of good data for free. It has in-depth data on flights and doesn’t seem limited by date. I can’t speak on the validity of the data though.
flightradar24.com has lots of data, also historically, they might be willing to help you get it in a nice format.
2019 Crime statistics in the USA
Dataset with arrest in US by race and separate states. Download Excel here
- Data Packaged Core Datasets
- Database of Scientific Code Contributions
- A growing collection of public datasets: CoolDatasets.
- DataWrangling: Some Datasets Available on the Web
- Inside-r: Finding Data on the Internet
- OpenDataMonitor: An overview of available open data resources in Europe
- Quora: Where can I find large datasets open to the public?
- RS.io: 100+ Interesting Data Sets for Statistics
- StaTrek: Leveraging open data to understand urban lives
- CV Papers: CV Datasets on the web
- CVonline: Image Databases
Researchers from IBM, MIT and Harvard Announced The Release Of DARPA “Common Sense AI” Dataset Along With Two Machine Learning Models At ICML 2021
Building machines that can make decisions based on common sense is no easy feat. A machine must be able to do more than merely find patterns in data; it also needs a way of interpreting the intentions and beliefs behind people’s choices.
At the 2021 International Conference on Machine Learning (ICML), Researchers from IBM, MIT, and Harvard University have come together to release a DARPA “Common Sense AI” dataset for benchmarking AI intuition. They are also releasing two machine learning models that represent different approaches to the problem that relies on testing techniques psychologists use to study infants’ behavior to accelerate the development of AI exhibiting common sense.
Source – Summary – Paper – IBM Blog
100 million protein structures Dataset by DeepMind
Here’s a good article about this topic
Google Dataset Search

Malware traffic dataset
Comprises 1914081 records created from all malware traffic analysis .net PCAP files, from 2013 to 2021. The logs are generated using Suricata and Zeek.
Originator: ali_alwashali
Percent of “foreign-born” population in each US and EU state or country.
For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state 🇺🇸🇪🇺
Author: Here
Percent of “foreign-born” population in each US and EU state or country. For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state.
Examples of “foreign-born” in this context:
Person born in Spain and living in France is NOT “foreign-born”
Person born in Turkey and living in France is “foreign-born”
Person born in Florida and living in Texas is NOT “foreign-born”
Person born in Mexico and living in Texas is “foreign-born”
Person born in Florida and living in France is “foreign-born”
Person born in France and living in Florida is “foreign-born”
🇺🇸🇪🇺🗺️
Note: Poland, Ireland, Germany, Greece, Cyprus, Malta, Portugal uses Eurostat 2010 Migration data and Croatia has no data at all
Tools: MS Office
Source: Here
35% of “entry-level” jobs on LinkedIn require 3+ years of experience
Source: LinkedIn data (see original post)
Tool: Photoshop from my colleague
Latest complete Netflix movie dataset
Created from 4 APIs. 11K+ rows and 30+ attributes of Netflix (Ratings, earnings, actors, language, availability, movie trailers, and many more)
Explore this dataset using FlixGem.com (this dataset is powering this webapp)
Common Crawl
A corpus of web crawl data composed of over 50 billion web pages. The Common Crawl corpus contains petabytes of data collected since 2008. It contains raw web page data, extracted metadata and text extractions.
AWS CLI Access (No AWS account required)
aws s3 ls s3://commoncrawl/ --no-sign-request
s3://commoncrawl/crawl-data/
Dataset on protein prices
Data on Primary Commodity Prices are updated monthly based on the IMF’s Primary Commodity Price System.
CPOST dataset on suicide attacks over four decades
The University of Chicago Project on Security and Threats presents the updated and expanded Database on Suicide Attacks (DSAT), which now links to Uppsala Conflict Data Program data on armed conflicts and includes a new dataset measuring the alliance and rivalry relationships among militant groups with connections to suicide attack groups. Access it here.
Credit Card Dataset – Survey of Consumer Finances (SCF) Combined Extract Data 1989-2019
You can do a lot of aggregated analysis in a pretty straightforward way there.
Drone imagery with annotations for small object detection and tracking dataset
11 TB dataset of drone imagery with annotations for small object detection and tracking
Download and more information are available here
Dataset License: CDLA-Sharing-1.0
Helper scripts for accessing the dataset: DATASET.md
Dataset Exploration: Colab
NOAA High-Resolution Rapid Refresh (HRRR) Model
The HRRR is a NOAA real-time 3-km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model, initialized by 3km grids with 3km radar assimilation. Radar data is assimilated in the HRRR every 15 min over a 1-h period adding further detail to that provided by the hourly data assimilation from the 13km radar-enhanced Rapid Refresh.
Registry of Open Data on AWS
This registry exists to help people discover and share datasets that are available via AWS resources. Learn more about sharing data on AWS.
See all usage examples for datasets listed in this registry.
See datasets from Digital Earth Africa, Facebook Data for Good, NASA Space Act Agreement, NIH STRIDES, NOAA Big Data Program, Space Telescope Science Institute, and Amazon Sustainability Data Initiative.
Textbook Question Answering (TQA)
1,076 textbook lessons, 26,260 questions, 6229 images
Documentation: allenai.org/data/tqa
Harmonized Cancer Datasets: Genomic Data Commons Data Portal
The GDC Data Portal is a robust data-driven platform that allows cancer
researchers and bioinformaticians to search and download cancer data for analysis.

The Cancer Genome Atlas
The Cancer Genome Atlas (TCGA), a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), aims to generate comprehensive, multi-dimensional maps of the key genomic changes in major types and subtypes of cancer.
AWS CLI Access (No AWS account required)
aws s3 ls s3://tcga-2-open/ --no-sign-request
Therapeutically Applicable Research to Generate Effective Treatments (TARGET)
The Therapeutically Applicable Research to Generate Effective Treatments (TARGET) program applies a comprehensive genomic approach to determine molecular changes that drive childhood cancers. The goal of the program is to use data to guide the development of effective, less toxic therapies. TARGET is organized into a collaborative network of disease-specific project teams. TARGET projects provide comprehensive molecular characterization to determine the genetic changes that drive the initiation and progression of childhood cancers. The dataset contains open Clinical Supplement, Biospecimen Supplement, RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantification, miRNA-Seq miRNA Expression Quantification data from Genomic Data Commons (GDC), and open data from GDC Legacy Archive. Access it here.
Genome Aggregation Database (gnomAD)
The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators that aggregates and harmonizes both exome and genome data from a wide range of large-scale human sequencing projects. The summary data provided here are released for the benefit of the wider scientific community without restriction on use. Downloads
SQuAD (Stanford Question Answering Dataset)
Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. Access it here.
PubMed Diabetes Dataset
The Pubmed Diabetes dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words. The README file in the dataset provides more details.
Drug-Target Interaction Dataset
This dataset contains interactions between drugs and targets collected from DrugBank, KEGG Drug, DCDB, and Matador. It was originally collected by Perlman et al. It contains 315 drugs, 250 targets, 1,306 drug-target interactions, 5 types of drug-drug similarities, and 3 types of target-target similarities. Drug-drug similarities include Chemical-based, Ligand-based, Expression-based, Side-effect-based, and Annotation-based similarities. Target-target similarities include Sequence-based, Protein-protein interaction network-based, and Gene Ontology-based similarities. The original task on the dataset is to predict new interactions between drugs and targets based on different types of similarities in the network. Download link
Pharmacogenomics Datasets
PharmGKB data and knowledge is available as downloads. It is often critical to check with their curators at feedback@pharmgkb.org before embarking on a large project using these data, to be sure that the files and data they make available are being interpreted correctly. PharmGKB generally does NOT need to be a co-author on such analyses; They just want to make sure that there is a correct understanding of our data before lots of resources are spent.
Pancreatic Cancer Organoid Profiling
The dataset contains open RNA-Seq Gene Expression Quantification data and controlled WGS/WXS/RNA-Seq Aligned Reads, WXS Annotated Somatic Mutation, WXS Raw Somatic Mutation, and RNA-Seq Splice Junction Quantification. Documentation
AWS CLI Access (No AWS account required)
aws s3 ls s3://gdc-organoid-pancreatic-phs001611-2-open/ --no-sign-request
Africa Soil Information Service (AfSIS) Soil Chemistry
This dataset contains soil infrared spectral data and paired soil property reference measurements for georeferenced soil samples that were collected through the Africa Soil Information Service (AfSIS) project, which lasted from 2009 through 2018. Documentation
AWS CLI Access (No AWS account required)
aws s3 ls s3://afsis/ --no-sign-request
Dataset for Affective States in E-Environments
DAiSEE is the first multi-label video classification dataset comprising of 9068 video snippets captured from 112 users for recognizing the user affective states of boredom, confusion, engagement, and frustration “in the wild”. The dataset has four levels of labels namely – very low, low, high, and very high for each of the affective states, which are crowd annotated and correlated with a gold standard annotation created using a team of expert psychologists. Download it here.
NatureServe Explorer Dataset
NatureServe Explorer provides conservation status, taxonomy, distribution, and life history information for more than 95,000 plants and animals in the United States and Canada, and more than 10,000 vegetation communities and ecological systems in the Western Hemisphere.
The data available through NatureServe Explorer represents data managed in the NatureServe Central Databases. These databases are dynamic, being continually enhanced and refined through the input of hundreds of natural heritage program scientists and other collaborators. NatureServe Explorer is updated from these central databases to reflect information from new field surveys, the latest taxonomic treatments and other scientific publications, and new conservation status assessments. Explore Data here
Flight Records in the US
Airline On-Time Performance and Causes of Flight Delays – On_Time Data.
This database contains scheduled and actual departure and arrival times, reason of delay. reported by certified U.S. air carriers that account for at least one percent of domestic scheduled passenger revenues. The data is collected by the Office of Airline Information, Bureau of Transportation Statistics (BTS).
FlightAware.com has data but you need to pay for a full dataset.
The anyflights
package supplies a set of functions to generate air travel data (and data packages!) similar to nycflights13
. With a user-defined year and airport, the anyflights
function will grab data on:
flights
: all flights that departed a given airport in a given year and monthweather
: hourly meterological data for a given airport in a given year and monthairports
: airport names, FAA codes, and locationsairlines
: translation between two letter carrier (airline) codes and namesplanes
: construction information about each plane found inflights
Airline On-Time Statistics and Delay Causes
The U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT’s monthly Air Travel Consumer Report, published about 30 days after the month’s end, as well as in summary tables posted on this website. BTS began collecting details on the causes of flight delays in June 2003. Summary statistics and raw data are made available to the public at the time the Air Travel Consumer Report is released. Access it here
Worldwide flight data
Download: airports.dat (Airports only, high quality)
Download: airports-extended.dat (Airports, train stations and ferry terminals, including user contributions)
Flightera.net seems to have a lot of good data for free. It has in-depth data on flights and doesn’t seem limited by date. I can’t speak on the validity of the data though.
flightradar24.com has lots of data, also historically, they might be willing to help you get it in a nice format.
2019 Crime statistics in the USA
Dataset with arrest in US by race and separate states. Download Excel here
Researchers from IBM, MIT and Harvard Announced The Release Of DARPA “Common Sense AI” Dataset Along With Two Machine Learning Models At ICML 2021
Building machines that can make decisions based on common sense is no easy feat. A machine must be able to do more than merely find patterns in data; it also needs a way of interpreting the intentions and beliefs behind people’s choices.
At the 2021 International Conference on Machine Learning (ICML), Researchers from IBM, MIT, and Harvard University have come together to release a DARPA “Common Sense AI” dataset for benchmarking AI intuition. They are also releasing two machine learning models that represent different approaches to the problem that relies on testing techniques psychologists use to study infants’ behavior to accelerate the development of AI exhibiting common sense.
Source – Summary – Paper – IBM Blog
100 million protein structures Dataset by DeepMind
Here’s a good article about this topic
Google Dataset Search

Malware traffic dataset
Comprises 1914081 records created from all malware traffic analysis .net PCAP files, from 2013 to 2021. The logs are generated using Suricata and Zeek.
Originator: ali_alwashali
Percent of “foreign-born” population in each US and EU state or country.
For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state 🇺🇸🇪🇺
Author: Here
Percent of “foreign-born” population in each US and EU state or country. For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state.
Examples of “foreign-born” in this context:
Person born in Spain and living in France is NOT “foreign-born”
Person born in Turkey and living in France is “foreign-born”
Person born in Florida and living in Texas is NOT “foreign-born”
Person born in Mexico and living in Texas is “foreign-born”
Person born in Florida and living in France is “foreign-born”
Person born in France and living in Florida is “foreign-born”
🇺🇸🇪🇺🗺️
Note: Poland, Ireland, Germany, Greece, Cyprus, Malta, Portugal uses Eurostat 2010 Migration data and Croatia has no data at all
Tools: MS Office
Source: Here
35% of “entry-level” jobs on LinkedIn require 3+ years of experience
Source: LinkedIn data (see original post)
Tool: Photoshop from my colleague
Latest complete Netflix movie dataset
Created from 4 APIs. 11K+ rows and 30+ attributes of Netflix (Ratings, earnings, actors, language, availability, movie trailers, and many more)
Explore this dataset using FlixGem.com (this dataset is powering this webapp)
Common Crawl
A corpus of web crawl data composed of over 50 billion web pages. The Common Crawl corpus contains petabytes of data collected since 2008. It contains raw web page data, extracted metadata and text extractions.
AWS CLI Access (No AWS account required)
aws s3 ls s3://commoncrawl/ --no-sign-request
s3://commoncrawl/crawl-data/
Dataset on protein prices
Data on Primary Commodity Prices are updated monthly based on the IMF’s Primary Commodity Price System.
CPOST dataset on suicide attacks over four decades
The University of Chicago Project on Security and Threats presents the updated and expanded Database on Suicide Attacks (DSAT), which now links to Uppsala Conflict Data Program data on armed conflicts and includes a new dataset measuring the alliance and rivalry relationships among militant groups with connections to suicide attack groups. Access it here.
Credit Card Dataset – Survey of Consumer Finances (SCF) Combined Extract Data 1989-2019
You can do a lot of aggregated analysis in a pretty straightforward way there.
Drone imagery with annotations for small object detection and tracking dataset
11 TB dataset of drone imagery with annotations for small object detection and tracking
Download and more information are available here
Dataset License: CDLA-Sharing-1.0
Helper scripts for accessing the dataset: DATASET.md
Dataset Exploration: Colab
NOAA High-Resolution Rapid Refresh (HRRR) Model
The HRRR is a NOAA real-time 3-km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model, initialized by 3km grids with 3km radar assimilation. Radar data is assimilated in the HRRR every 15 min over a 1-h period adding further detail to that provided by the hourly data assimilation from the 13km radar-enhanced Rapid Refresh.
Registry of Open Data on AWS
This registry exists to help people discover and share datasets that are available via AWS resources. Learn more about sharing data on AWS.
See all usage examples for datasets listed in this registry.
See datasets from Digital Earth Africa, Facebook Data for Good, NASA Space Act Agreement, NIH STRIDES, NOAA Big Data Program, Space Telescope Science Institute, and Amazon Sustainability Data Initiative.
Textbook Question Answering (TQA)
1,076 textbook lessons, 26,260 questions, 6229 images
Documentation: allenai.org/data/tqa
Harmonized Cancer Datasets: Genomic Data Commons Data Portal
The GDC Data Portal is a robust data-driven platform that allows cancer
researchers and bioinformaticians to search and download cancer data for analysis.

The Cancer Genome Atlas
The Cancer Genome Atlas (TCGA), a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), aims to generate comprehensive, multi-dimensional maps of the key genomic changes in major types and subtypes of cancer.
AWS CLI Access (No AWS account required)
aws s3 ls s3://tcga-2-open/ --no-sign-request
Therapeutically Applicable Research to Generate Effective Treatments (TARGET)
The Therapeutically Applicable Research to Generate Effective Treatments (TARGET) program applies a comprehensive genomic approach to determine molecular changes that drive childhood cancers. The goal of the program is to use data to guide the development of effective, less toxic therapies. TARGET is organized into a collaborative network of disease-specific project teams. TARGET projects provide comprehensive molecular characterization to determine the genetic changes that drive the initiation and progression of childhood cancers. The dataset contains open Clinical Supplement, Biospecimen Supplement, RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantification, miRNA-Seq miRNA Expression Quantification data from Genomic Data Commons (GDC), and open data from GDC Legacy Archive. Access it here.
Genome Aggregation Database (gnomAD)
The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators that aggregates and harmonizes both exome and genome data from a wide range of large-scale human sequencing projects. The summary data provided here are released for the benefit of the wider scientific community without restriction on use. Downloads
SQuAD (Stanford Question Answering Dataset)
Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. Access it here.
PubMed Diabetes Dataset
The Pubmed Diabetes dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words. The README file in the dataset provides more details.
Drug-Target Interaction Dataset
This dataset contains interactions between drugs and targets collected from DrugBank, KEGG Drug, DCDB, and Matador. It was originally collected by Perlman et al. It contains 315 drugs, 250 targets, 1,306 drug-target interactions, 5 types of drug-drug similarities, and 3 types of target-target similarities. Drug-drug similarities include Chemical-based, Ligand-based, Expression-based, Side-effect-based, and Annotation-based similarities. Target-target similarities include Sequence-based, Protein-protein interaction network-based, and Gene Ontology-based similarities. The original task on the dataset is to predict new interactions between drugs and targets based on different types of similarities in the network. Download link
Pharmacogenomics Datasets
PharmGKB data and knowledge is available as downloads. It is often critical to check with their curators at feedback@pharmgkb.org before embarking on a large project using these data, to be sure that the files and data they make available are being interpreted correctly. PharmGKB generally does NOT need to be a co-author on such analyses; They just want to make sure that there is a correct understanding of our data before lots of resources are spent.
Pancreatic Cancer Organoid Profiling
The dataset contains open RNA-Seq Gene Expression Quantification data and controlled WGS/WXS/RNA-Seq Aligned Reads, WXS Annotated Somatic Mutation, WXS Raw Somatic Mutation, and RNA-Seq Splice Junction Quantification. Documentation
AWS CLI Access (No AWS account required)
aws s3 ls s3://gdc-organoid-pancreatic-phs001611-2-open/ --no-sign-request
Africa Soil Information Service (AfSIS) Soil Chemistry
This dataset contains soil infrared spectral data and paired soil property reference measurements for georeferenced soil samples that were collected through the Africa Soil Information Service (AfSIS) project, which lasted from 2009 through 2018. Documentation
AWS CLI Access (No AWS account required)
aws s3 ls s3://afsis/ --no-sign-request
Dataset for Affective States in E-Environments
DAiSEE is the first multi-label video classification dataset comprising of 9068 video snippets captured from 112 users for recognizing the user affective states of boredom, confusion, engagement, and frustration “in the wild”. The dataset has four levels of labels namely – very low, low, high, and very high for each of the affective states, which are crowd annotated and correlated with a gold standard annotation created using a team of expert psychologists. Download it here.
NatureServe Explorer Dataset
NatureServe Explorer provides conservation status, taxonomy, distribution, and life history information for more than 95,000 plants and animals in the United States and Canada, and more than 10,000 vegetation communities and ecological systems in the Western Hemisphere.
The data available through NatureServe Explorer represents data managed in the NatureServe Central Databases. These databases are dynamic, being continually enhanced and refined through the input of hundreds of natural heritage program scientists and other collaborators. NatureServe Explorer is updated from these central databases to reflect information from new field surveys, the latest taxonomic treatments and other scientific publications, and new conservation status assessments. Explore Data here
Flight Records in the US
Airline On-Time Performance and Causes of Flight Delays – On_Time Data.
This database contains scheduled and actual departure and arrival times, reason of delay. reported by certified U.S. air carriers that account for at least one percent of domestic scheduled passenger revenues. The data is collected by the Office of Airline Information, Bureau of Transportation Statistics (BTS).
FlightAware.com has data but you need to pay for a full dataset.
The anyflights
package supplies a set of functions to generate air travel data (and data packages!) similar to nycflights13
. With a user-defined year and airport, the anyflights
function will grab data on:
flights
: all flights that departed a given airport in a given year and monthweather
: hourly meterological data for a given airport in a given year and monthairports
: airport names, FAA codes, and locationsairlines
: translation between two letter carrier (airline) codes and namesplanes
: construction information about each plane found inflights
Airline On-Time Statistics and Delay Causes
The U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT’s monthly Air Travel Consumer Report, published about 30 days after the month’s end, as well as in summary tables posted on this website. BTS began collecting details on the causes of flight delays in June 2003. Summary statistics and raw data are made available to the public at the time the Air Travel Consumer Report is released. Access it here
Worldwide flight data
Download: airports.dat (Airports only, high quality)
Download: airports-extended.dat (Airports, train stations and ferry terminals, including user contributions)
Flightera.net seems to have a lot of good data for free. It has in-depth data on flights and doesn’t seem limited by date. I can’t speak on the validity of the data though.
flightradar24.com has lots of data, also historically, they might be willing to help you get it in a nice format.
2019 Crime statistics in the USA
Dataset with arrest in US by race and separate states. Download Excel here
- Node.js – Async non-blocking event-driven JavaScript runtime built on Chrome’s V8 JavaScript engine.
- Cross-Platform – Writing cross-platform code on Node.js.
- Frontend Development
- iOS – Mobile operating system for Apple phones and tablets.
- Android – Mobile operating system developed by Google.
- IoT & Hybrid Apps
- Electron – Cross-platform native desktop apps using JavaScript/HTML/CSS.
- Cordova – JavaScript API for hybrid apps.
- React Native – JavaScript framework for writing natively rendering mobile apps for iOS and Android.
- Xamarin – Mobile app development IDE, testing, and distribution.
- Linux
- Containers
- eBPF – Virtual machine that allows you to write more efficient and powerful tracing and monitoring for Linux systems.
- Arch-based Projects – Linux distributions and projects based on Arch Linux.
- macOS – Operating system for Apple’s Mac computers.
- watchOS – Operating system for the Apple Watch.
- JVM
- Salesforce
- Amazon Web Services
- Windows
- IPFS – P2P hypermedia protocol.
- Fuse – Mobile development tools.
- Heroku – Cloud platform as a service.
- Raspberry Pi – Credit card-sized computer aimed at teaching kids programming, but capable of a lot more.
- Qt – Cross-platform GUI app framework.
- WebExtensions – Cross-browser extension system.
- RubyMotion – Write cross-platform native apps for iOS, Android, macOS, tvOS, and watchOS in Ruby.
- Smart TV – Create apps for different TV platforms.
- GNOME – Simple and distraction-free desktop environment for Linux.
- KDE – A free software community dedicated to creating an open and user-friendly computing experience.
- .NET
- Amazon Alexa – Virtual home assistant.
- DigitalOcean – Cloud computing platform designed for developers.
- Flutter – Google’s mobile SDK for building native iOS and Android apps from a single codebase written in Dart.
- Home Assistant – Open source home automation that puts local control and privacy first.
- IBM Cloud – Cloud platform for developers and companies.
- Firebase – App development platform built on Google Cloud Platform.
- Robot Operating System 2.0 – Set of software libraries and tools that help you build robot apps.
- Adafruit IO – Visualize and store data from any device.
- Cloudflare – CDN, DNS, DDoS protection, and security for your site.
- Actions on Google – Developer platform for Google Assistant.
- ESP – Low-cost microcontrollers with WiFi and broad IoT applications.
- Deno – A secure runtime for JavaScript and TypeScript that uses V8 and is built in Rust.
- DOS – Operating system for x86-based personal computers that was popular during the 1980s and early 1990s.
- Nix – Package manager for Linux and other Unix systems that makes package management reliable and reproducible.
Researchers from IBM, MIT and Harvard Announced The Release Of DARPA “Common Sense AI” Dataset Along With Two Machine Learning Models At ICML 2021
Building machines that can make decisions based on common sense is no easy feat. A machine must be able to do more than merely find patterns in data; it also needs a way of interpreting the intentions and beliefs behind people’s choices.
At the 2021 International Conference on Machine Learning (ICML), Researchers from IBM, MIT, and Harvard University have come together to release a DARPA “Common Sense AI” dataset for benchmarking AI intuition. They are also releasing two machine learning models that represent different approaches to the problem that relies on testing techniques psychologists use to study infants’ behavior to accelerate the development of AI exhibiting common sense.
Source – Summary – Paper – IBM Blog
100 million protein structures Dataset by DeepMind
Here’s a good article about this topic
Google Dataset Search

Malware traffic dataset
Comprises 1914081 records created from all malware traffic analysis .net PCAP files, from 2013 to 2021. The logs are generated using Suricata and Zeek.
Originator: ali_alwashali
Percent of “foreign-born” population in each US and EU state or country.
For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state 🇺🇸🇪🇺
Author: Here
Percent of “foreign-born” population in each US and EU state or country. For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state.
Examples of “foreign-born” in this context:
Person born in Spain and living in France is NOT “foreign-born”
Person born in Turkey and living in France is “foreign-born”
Person born in Florida and living in Texas is NOT “foreign-born”
Person born in Mexico and living in Texas is “foreign-born”
Person born in Florida and living in France is “foreign-born”
Person born in France and living in Florida is “foreign-born”
🇺🇸🇪🇺🗺️
Note: Poland, Ireland, Germany, Greece, Cyprus, Malta, Portugal uses Eurostat 2010 Migration data and Croatia has no data at all
Tools: MS Office
Source: Here
35% of “entry-level” jobs on LinkedIn require 3+ years of experience
Source: LinkedIn data (see original post)
Tool: Photoshop from my colleague
Latest complete Netflix movie dataset
Created from 4 APIs. 11K+ rows and 30+ attributes of Netflix (Ratings, earnings, actors, language, availability, movie trailers, and many more)
Explore this dataset using FlixGem.com (this dataset is powering this webapp)
Common Crawl
A corpus of web crawl data composed of over 50 billion web pages. The Common Crawl corpus contains petabytes of data collected since 2008. It contains raw web page data, extracted metadata and text extractions.
AWS CLI Access (No AWS account required)
aws s3 ls s3://commoncrawl/ --no-sign-request
s3://commoncrawl/crawl-data/
Dataset on protein prices
Data on Primary Commodity Prices are updated monthly based on the IMF’s Primary Commodity Price System.
CPOST dataset on suicide attacks over four decades
The University of Chicago Project on Security and Threats presents the updated and expanded Database on Suicide Attacks (DSAT), which now links to Uppsala Conflict Data Program data on armed conflicts and includes a new dataset measuring the alliance and rivalry relationships among militant groups with connections to suicide attack groups. Access it here.
Credit Card Dataset – Survey of Consumer Finances (SCF) Combined Extract Data 1989-2019
You can do a lot of aggregated analysis in a pretty straightforward way there.
Drone imagery with annotations for small object detection and tracking dataset
11 TB dataset of drone imagery with annotations for small object detection and tracking
Download and more information are available here
Dataset License: CDLA-Sharing-1.0
Helper scripts for accessing the dataset: DATASET.md
Dataset Exploration: Colab
NOAA High-Resolution Rapid Refresh (HRRR) Model
The HRRR is a NOAA real-time 3-km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model, initialized by 3km grids with 3km radar assimilation. Radar data is assimilated in the HRRR every 15 min over a 1-h period adding further detail to that provided by the hourly data assimilation from the 13km radar-enhanced Rapid Refresh.
Registry of Open Data on AWS
This registry exists to help people discover and share datasets that are available via AWS resources. Learn more about sharing data on AWS.
See all usage examples for datasets listed in this registry.
See datasets from Digital Earth Africa, Facebook Data for Good, NASA Space Act Agreement, NIH STRIDES, NOAA Big Data Program, Space Telescope Science Institute, and Amazon Sustainability Data Initiative.
Textbook Question Answering (TQA)
1,076 textbook lessons, 26,260 questions, 6229 images
Documentation: allenai.org/data/tqa
Harmonized Cancer Datasets: Genomic Data Commons Data Portal
The GDC Data Portal is a robust data-driven platform that allows cancer
researchers and bioinformaticians to search and download cancer data for analysis.

The Cancer Genome Atlas
The Cancer Genome Atlas (TCGA), a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), aims to generate comprehensive, multi-dimensional maps of the key genomic changes in major types and subtypes of cancer.
AWS CLI Access (No AWS account required)
aws s3 ls s3://tcga-2-open/ --no-sign-request
Therapeutically Applicable Research to Generate Effective Treatments (TARGET)
The Therapeutically Applicable Research to Generate Effective Treatments (TARGET) program applies a comprehensive genomic approach to determine molecular changes that drive childhood cancers. The goal of the program is to use data to guide the development of effective, less toxic therapies. TARGET is organized into a collaborative network of disease-specific project teams. TARGET projects provide comprehensive molecular characterization to determine the genetic changes that drive the initiation and progression of childhood cancers. The dataset contains open Clinical Supplement, Biospecimen Supplement, RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantification, miRNA-Seq miRNA Expression Quantification data from Genomic Data Commons (GDC), and open data from GDC Legacy Archive. Access it here.
Genome Aggregation Database (gnomAD)
The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators that aggregates and harmonizes both exome and genome data from a wide range of large-scale human sequencing projects. The summary data provided here are released for the benefit of the wider scientific community without restriction on use. Downloads
SQuAD (Stanford Question Answering Dataset)
Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. Access it here.
PubMed Diabetes Dataset
The Pubmed Diabetes dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words. The README file in the dataset provides more details.
Drug-Target Interaction Dataset
This dataset contains interactions between drugs and targets collected from DrugBank, KEGG Drug, DCDB, and Matador. It was originally collected by Perlman et al. It contains 315 drugs, 250 targets, 1,306 drug-target interactions, 5 types of drug-drug similarities, and 3 types of target-target similarities. Drug-drug similarities include Chemical-based, Ligand-based, Expression-based, Side-effect-based, and Annotation-based similarities. Target-target similarities include Sequence-based, Protein-protein interaction network-based, and Gene Ontology-based similarities. The original task on the dataset is to predict new interactions between drugs and targets based on different types of similarities in the network. Download link
Pharmacogenomics Datasets
PharmGKB data and knowledge is available as downloads. It is often critical to check with their curators at feedback@pharmgkb.org before embarking on a large project using these data, to be sure that the files and data they make available are being interpreted correctly. PharmGKB generally does NOT need to be a co-author on such analyses; They just want to make sure that there is a correct understanding of our data before lots of resources are spent.
Pancreatic Cancer Organoid Profiling
The dataset contains open RNA-Seq Gene Expression Quantification data and controlled WGS/WXS/RNA-Seq Aligned Reads, WXS Annotated Somatic Mutation, WXS Raw Somatic Mutation, and RNA-Seq Splice Junction Quantification. Documentation
AWS CLI Access (No AWS account required)
aws s3 ls s3://gdc-organoid-pancreatic-phs001611-2-open/ --no-sign-request
Africa Soil Information Service (AfSIS) Soil Chemistry
This dataset contains soil infrared spectral data and paired soil property reference measurements for georeferenced soil samples that were collected through the Africa Soil Information Service (AfSIS) project, which lasted from 2009 through 2018. Documentation
AWS CLI Access (No AWS account required)
aws s3 ls s3://afsis/ --no-sign-request
Dataset for Affective States in E-Environments
DAiSEE is the first multi-label video classification dataset comprising of 9068 video snippets captured from 112 users for recognizing the user affective states of boredom, confusion, engagement, and frustration “in the wild”. The dataset has four levels of labels namely – very low, low, high, and very high for each of the affective states, which are crowd annotated and correlated with a gold standard annotation created using a team of expert psychologists. Download it here.
NatureServe Explorer Dataset
NatureServe Explorer provides conservation status, taxonomy, distribution, and life history information for more than 95,000 plants and animals in the United States and Canada, and more than 10,000 vegetation communities and ecological systems in the Western Hemisphere.
The data available through NatureServe Explorer represents data managed in the NatureServe Central Databases. These databases are dynamic, being continually enhanced and refined through the input of hundreds of natural heritage program scientists and other collaborators. NatureServe Explorer is updated from these central databases to reflect information from new field surveys, the latest taxonomic treatments and other scientific publications, and new conservation status assessments. Explore Data here
Flight Records in the US
Airline On-Time Performance and Causes of Flight Delays – On_Time Data.
This database contains scheduled and actual departure and arrival times, reason of delay. reported by certified U.S. air carriers that account for at least one percent of domestic scheduled passenger revenues. The data is collected by the Office of Airline Information, Bureau of Transportation Statistics (BTS).
FlightAware.com has data but you need to pay for a full dataset.
The anyflights
package supplies a set of functions to generate air travel data (and data packages!) similar to nycflights13
. With a user-defined year and airport, the anyflights
function will grab data on:
flights
: all flights that departed a given airport in a given year and monthweather
: hourly meterological data for a given airport in a given year and monthairports
: airport names, FAA codes, and locationsairlines
: translation between two letter carrier (airline) codes and namesplanes
: construction information about each plane found inflights
Airline On-Time Statistics and Delay Causes
The U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT’s monthly Air Travel Consumer Report, published about 30 days after the month’s end, as well as in summary tables posted on this website. BTS began collecting details on the causes of flight delays in June 2003. Summary statistics and raw data are made available to the public at the time the Air Travel Consumer Report is released. Access it here
Worldwide flight data
Download: airports.dat (Airports only, high quality)
Download: airports-extended.dat (Airports, train stations and ferry terminals, including user contributions)
Flightera.net seems to have a lot of good data for free. It has in-depth data on flights and doesn’t seem limited by date. I can’t speak on the validity of the data though.
flightradar24.com has lots of data, also historically, they might be willing to help you get it in a nice format.
2019 Crime statistics in the USA
Dataset with arrest in US by race and separate states. Download Excel here
- JavaScript
- Promises
- Standard Style – Style guide and linter.
- Must Watch Talks
- Tips
- Network Layer
- Micro npm Packages
- Mad Science npm Packages – Impossible sounding projects that exist.
- Maintenance Modules – For npm packages.
- npm – Package manager.
- AVA – Test runner.
- ESLint – Linter.
- Functional Programming
- Observables
- npm scripts – Task runner.
- 30 Seconds of Code – Code snippets you can understand in 30 seconds.
- Ponyfills – Like polyfills but without overriding native APIs.
- Swift – Apple’s compiled programming language that is secure, modern, programmer-friendly, and fast.
- Python – General-purpose programming language designed for readability.
- Asyncio – Asynchronous I/O in Python 3.
- Scientific Audio – Scientific research in audio/music.
- CircuitPython – A version of Python for microcontrollers.
- Data Science – Data analysis and machine learning.
- Typing – Optional static typing for Python.
- MicroPython – A lean and efficient implementation of Python 3 for microcontrollers.
- Rust
- Haskell
- PureScript
- Go
- Scala
- Scala Native – Optimizing ahead-of-time compiler for Scala based on LLVM.
- Ruby
- Clojure
- ClojureScript
- Elixir
- Elm
- Erlang
- Julia – High-level dynamic programming language designed to address the needs of high-performance numerical analysis and computational science.
- Lua
- C
- C/C++ – General-purpose language with a bias toward system programming and embedded, resource-constrained software.
- R – Functional programming language and environment for statistical computing and graphics.
- D
- Common Lisp – Powerful dynamic multiparadigm language that facilitates iterative and interactive development.
- Perl
- Groovy
- Dart
- Java – Popular secure object-oriented language designed for flexibility to “write once, run anywhere”.
- Kotlin
- OCaml
- ColdFusion
- Fortran
- PHP – Server-side scripting language.
- Composer – Package manager.
- Pascal
- AutoHotkey
- AutoIt
- Crystal
- Frege – Haskell for the JVM.
- CMake – Build, test, and package software.
- ActionScript 3 – Object-oriented language targeting Adobe AIR.
- Eta – Functional programming language for the JVM.
- Idris – General purpose pure functional programming language with dependent types influenced by Haskell and ML.
- Ada/SPARK – Modern programming language designed for large, long-lived apps where reliability and efficiency are essential.
- Q# – Domain-specific programming language used for expressing quantum algorithms.
- Imba – Programming language inspired by Ruby and Python and compiles to performant JavaScript.
- Vala – Programming language designed to take full advantage of the GLib and GNOME ecosystems, while preserving the speed of C code.
- Coq – Formal language and environment for programming and specification which facilitates interactive development of machine-checked proofs.
- V – Simple, fast, safe, compiled language for developing maintainable software.
Researchers from IBM, MIT and Harvard Announced The Release Of DARPA “Common Sense AI” Dataset Along With Two Machine Learning Models At ICML 2021
Building machines that can make decisions based on common sense is no easy feat. A machine must be able to do more than merely find patterns in data; it also needs a way of interpreting the intentions and beliefs behind people’s choices.
At the 2021 International Conference on Machine Learning (ICML), Researchers from IBM, MIT, and Harvard University have come together to release a DARPA “Common Sense AI” dataset for benchmarking AI intuition. They are also releasing two machine learning models that represent different approaches to the problem that relies on testing techniques psychologists use to study infants’ behavior to accelerate the development of AI exhibiting common sense.
Source – Summary – Paper – IBM Blog
100 million protein structures Dataset by DeepMind
Here’s a good article about this topic
Google Dataset Search

Malware traffic dataset
Comprises 1914081 records created from all malware traffic analysis .net PCAP files, from 2013 to 2021. The logs are generated using Suricata and Zeek.
Originator: ali_alwashali
Percent of “foreign-born” population in each US and EU state or country.
For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state 🇺🇸🇪🇺
Author: Here
Percent of “foreign-born” population in each US and EU state or country. For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state.
Examples of “foreign-born” in this context:
Person born in Spain and living in France is NOT “foreign-born”
Person born in Turkey and living in France is “foreign-born”
Person born in Florida and living in Texas is NOT “foreign-born”
Person born in Mexico and living in Texas is “foreign-born”
Person born in Florida and living in France is “foreign-born”
Person born in France and living in Florida is “foreign-born”
🇺🇸🇪🇺🗺️
Note: Poland, Ireland, Germany, Greece, Cyprus, Malta, Portugal uses Eurostat 2010 Migration data and Croatia has no data at all
Tools: MS Office
Source: Here
35% of “entry-level” jobs on LinkedIn require 3+ years of experience
Source: LinkedIn data (see original post)
Tool: Photoshop from my colleague
Latest complete Netflix movie dataset
Created from 4 APIs. 11K+ rows and 30+ attributes of Netflix (Ratings, earnings, actors, language, availability, movie trailers, and many more)
Explore this dataset using FlixGem.com (this dataset is powering this webapp)
Common Crawl
A corpus of web crawl data composed of over 50 billion web pages. The Common Crawl corpus contains petabytes of data collected since 2008. It contains raw web page data, extracted metadata and text extractions.
AWS CLI Access (No AWS account required)
aws s3 ls s3://commoncrawl/ --no-sign-request
s3://commoncrawl/crawl-data/
Dataset on protein prices
Data on Primary Commodity Prices are updated monthly based on the IMF’s Primary Commodity Price System.
CPOST dataset on suicide attacks over four decades
The University of Chicago Project on Security and Threats presents the updated and expanded Database on Suicide Attacks (DSAT), which now links to Uppsala Conflict Data Program data on armed conflicts and includes a new dataset measuring the alliance and rivalry relationships among militant groups with connections to suicide attack groups. Access it here.
Credit Card Dataset – Survey of Consumer Finances (SCF) Combined Extract Data 1989-2019
You can do a lot of aggregated analysis in a pretty straightforward way there.
Drone imagery with annotations for small object detection and tracking dataset
11 TB dataset of drone imagery with annotations for small object detection and tracking
Download and more information are available here
Dataset License: CDLA-Sharing-1.0
Helper scripts for accessing the dataset: DATASET.md
Dataset Exploration: Colab
NOAA High-Resolution Rapid Refresh (HRRR) Model
The HRRR is a NOAA real-time 3-km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model, initialized by 3km grids with 3km radar assimilation. Radar data is assimilated in the HRRR every 15 min over a 1-h period adding further detail to that provided by the hourly data assimilation from the 13km radar-enhanced Rapid Refresh.
Registry of Open Data on AWS
This registry exists to help people discover and share datasets that are available via AWS resources. Learn more about sharing data on AWS.
See all usage examples for datasets listed in this registry.
See datasets from Digital Earth Africa, Facebook Data for Good, NASA Space Act Agreement, NIH STRIDES, NOAA Big Data Program, Space Telescope Science Institute, and Amazon Sustainability Data Initiative.
Textbook Question Answering (TQA)
1,076 textbook lessons, 26,260 questions, 6229 images
Documentation: allenai.org/data/tqa
Harmonized Cancer Datasets: Genomic Data Commons Data Portal
The GDC Data Portal is a robust data-driven platform that allows cancer
researchers and bioinformaticians to search and download cancer data for analysis.

The Cancer Genome Atlas
The Cancer Genome Atlas (TCGA), a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), aims to generate comprehensive, multi-dimensional maps of the key genomic changes in major types and subtypes of cancer.
AWS CLI Access (No AWS account required)
aws s3 ls s3://tcga-2-open/ --no-sign-request
Therapeutically Applicable Research to Generate Effective Treatments (TARGET)
The Therapeutically Applicable Research to Generate Effective Treatments (TARGET) program applies a comprehensive genomic approach to determine molecular changes that drive childhood cancers. The goal of the program is to use data to guide the development of effective, less toxic therapies. TARGET is organized into a collaborative network of disease-specific project teams. TARGET projects provide comprehensive molecular characterization to determine the genetic changes that drive the initiation and progression of childhood cancers. The dataset contains open Clinical Supplement, Biospecimen Supplement, RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantification, miRNA-Seq miRNA Expression Quantification data from Genomic Data Commons (GDC), and open data from GDC Legacy Archive. Access it here.
Genome Aggregation Database (gnomAD)
The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators that aggregates and harmonizes both exome and genome data from a wide range of large-scale human sequencing projects. The summary data provided here are released for the benefit of the wider scientific community without restriction on use. Downloads
SQuAD (Stanford Question Answering Dataset)
Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. Access it here.
PubMed Diabetes Dataset
The Pubmed Diabetes dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words. The README file in the dataset provides more details.
Drug-Target Interaction Dataset
This dataset contains interactions between drugs and targets collected from DrugBank, KEGG Drug, DCDB, and Matador. It was originally collected by Perlman et al. It contains 315 drugs, 250 targets, 1,306 drug-target interactions, 5 types of drug-drug similarities, and 3 types of target-target similarities. Drug-drug similarities include Chemical-based, Ligand-based, Expression-based, Side-effect-based, and Annotation-based similarities. Target-target similarities include Sequence-based, Protein-protein interaction network-based, and Gene Ontology-based similarities. The original task on the dataset is to predict new interactions between drugs and targets based on different types of similarities in the network. Download link
Pharmacogenomics Datasets
PharmGKB data and knowledge is available as downloads. It is often critical to check with their curators at feedback@pharmgkb.org before embarking on a large project using these data, to be sure that the files and data they make available are being interpreted correctly. PharmGKB generally does NOT need to be a co-author on such analyses; They just want to make sure that there is a correct understanding of our data before lots of resources are spent.
Pancreatic Cancer Organoid Profiling
The dataset contains open RNA-Seq Gene Expression Quantification data and controlled WGS/WXS/RNA-Seq Aligned Reads, WXS Annotated Somatic Mutation, WXS Raw Somatic Mutation, and RNA-Seq Splice Junction Quantification. Documentation
AWS CLI Access (No AWS account required)
aws s3 ls s3://gdc-organoid-pancreatic-phs001611-2-open/ --no-sign-request
Africa Soil Information Service (AfSIS) Soil Chemistry
This dataset contains soil infrared spectral data and paired soil property reference measurements for georeferenced soil samples that were collected through the Africa Soil Information Service (AfSIS) project, which lasted from 2009 through 2018. Documentation
AWS CLI Access (No AWS account required)
aws s3 ls s3://afsis/ --no-sign-request
Dataset for Affective States in E-Environments
DAiSEE is the first multi-label video classification dataset comprising of 9068 video snippets captured from 112 users for recognizing the user affective states of boredom, confusion, engagement, and frustration “in the wild”. The dataset has four levels of labels namely – very low, low, high, and very high for each of the affective states, which are crowd annotated and correlated with a gold standard annotation created using a team of expert psychologists. Download it here.
NatureServe Explorer Dataset
NatureServe Explorer provides conservation status, taxonomy, distribution, and life history information for more than 95,000 plants and animals in the United States and Canada, and more than 10,000 vegetation communities and ecological systems in the Western Hemisphere.
The data available through NatureServe Explorer represents data managed in the NatureServe Central Databases. These databases are dynamic, being continually enhanced and refined through the input of hundreds of natural heritage program scientists and other collaborators. NatureServe Explorer is updated from these central databases to reflect information from new field surveys, the latest taxonomic treatments and other scientific publications, and new conservation status assessments. Explore Data here
Flight Records in the US
Airline On-Time Performance and Causes of Flight Delays – On_Time Data.
This database contains scheduled and actual departure and arrival times, reason of delay. reported by certified U.S. air carriers that account for at least one percent of domestic scheduled passenger revenues. The data is collected by the Office of Airline Information, Bureau of Transportation Statistics (BTS).
FlightAware.com has data but you need to pay for a full dataset.
The anyflights
package supplies a set of functions to generate air travel data (and data packages!) similar to nycflights13
. With a user-defined year and airport, the anyflights
function will grab data on:
flights
: all flights that departed a given airport in a given year and monthweather
: hourly meterological data for a given airport in a given year and monthairports
: airport names, FAA codes, and locationsairlines
: translation between two letter carrier (airline) codes and namesplanes
: construction information about each plane found inflights
Airline On-Time Statistics and Delay Causes
The U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT’s monthly Air Travel Consumer Report, published about 30 days after the month’s end, as well as in summary tables posted on this website. BTS began collecting details on the causes of flight delays in June 2003. Summary statistics and raw data are made available to the public at the time the Air Travel Consumer Report is released. Access it here
Worldwide flight data
Download: airports.dat (Airports only, high quality)
Download: airports-extended.dat (Airports, train stations and ferry terminals, including user contributions)
Flightera.net seems to have a lot of good data for free. It has in-depth data on flights and doesn’t seem limited by date. I can’t speak on the validity of the data though.
flightradar24.com has lots of data, also historically, they might be willing to help you get it in a nice format.
2019 Crime statistics in the USA
Dataset with arrest in US by race and separate states. Download Excel here
- ES6 Tools
- Web Performance Optimization
- Web Tools
- CSS – Style sheet language that specifies how HTML elements are displayed on screen.
- React – App framework.
- Relay – Framework for building data-driven React apps.
- React Hooks – A new feature that lets you use state and other React features without writing a class.
- Web Components
- Polymer – JavaScript library to develop Web Components.
- Angular – App framework.
- Backbone – App framework.
- HTML5 – Markup language used for websites & web apps.
- SVG – XML-based vector image format.
- Canvas
- KnockoutJS – JavaScript library.
- Dojo Toolkit – JavaScript toolkit.
- Inspiration
- Ember – App framework.
- Android UI
- iOS UI
- Meteor
- BEM
- Flexbox
- Web Typography
- Web Accessibility
- Material Design
- D3 – Library for producing dynamic, interactive data visualizations.
- Emails
- jQuery – Easy to use JavaScript library for DOM manipulation.
- Web Audio
- Offline-First
- Static Website Services
- Cycle.js – Functional and reactive JavaScript framework.
- Text Editing
- Motion UI Design
- Vue.js – App framework.
- Marionette.js – App framework.
- Aurelia – App framework.
- Charting
- Ionic Framework 2
- Chrome DevTools
- PostCSS – CSS tool.
- Draft.js – Rich text editor framework for React.
- Service Workers
- Progressive Web Apps
- choo – App framework.
- Redux – State container for JavaScript apps.
- webpack – Module bundler.
- Browserify – Module bundler.
- Sass – CSS preprocessor.
- Ant Design – Enterprise-class UI design language.
- Less – CSS preprocessor.
- WebGL – JavaScript API for rendering 3D graphics.
- Preact – App framework.
- Progressive Enhancement
- Next.js – Framework for server-rendered React apps.
- lit-html – HTML templating library for JavaScript.
- JAMstack – Modern web development architecture based on client-side JavaScript, reusable APIs, and prebuilt markup.
- WordPress-Gatsby – Web development technology stack with WordPress as a back end and Gatsby as a front end.
- Mobile Web Development – Creating a great mobile web experience.
- Storybook – Development environment for UI components.
- Blazor – .NET web framework using C#/Razor and HTML that runs in the browser with WebAssembly.
- PageSpeed Metrics – Metrics to help understand page speed and user experience.
- Tailwind CSS – Utility-first CSS framework for rapid UI development.
- Seed – Rust framework for creating web apps running in WebAssembly.
- Web Performance Budget – Techniques to ensure certain performance metrics for a website.
- Web Animation – Animations in the browser with JavaScript, CSS, SVG, etc.
- Yew – Rust framework inspired by Elm and React for creating multi-threaded frontend web apps with WebAssembly.
- Material-UI – Material Design React components for faster and easier web development.
- Building Blocks for Web Apps – Standalone features to be integrated into web apps.
- Svelte – App framework.
- Design systems – Collection of reusable components, guided by rules that ensure consistency and speed.
Researchers from IBM, MIT and Harvard Announced The Release Of DARPA “Common Sense AI” Dataset Along With Two Machine Learning Models At ICML 2021
Building machines that can make decisions based on common sense is no easy feat. A machine must be able to do more than merely find patterns in data; it also needs a way of interpreting the intentions and beliefs behind people’s choices.
At the 2021 International Conference on Machine Learning (ICML), Researchers from IBM, MIT, and Harvard University have come together to release a DARPA “Common Sense AI” dataset for benchmarking AI intuition. They are also releasing two machine learning models that represent different approaches to the problem that relies on testing techniques psychologists use to study infants’ behavior to accelerate the development of AI exhibiting common sense.
Source – Summary – Paper – IBM Blog
100 million protein structures Dataset by DeepMind
Here’s a good article about this topic
Google Dataset Search

Malware traffic dataset
Comprises 1914081 records created from all malware traffic analysis .net PCAP files, from 2013 to 2021. The logs are generated using Suricata and Zeek.
Originator: ali_alwashali
Percent of “foreign-born” population in each US and EU state or country.
For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state 🇺🇸🇪🇺
Author: Here
Percent of “foreign-born” population in each US and EU state or country. For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state.
Examples of “foreign-born” in this context:
Person born in Spain and living in France is NOT “foreign-born”
Person born in Turkey and living in France is “foreign-born”
Person born in Florida and living in Texas is NOT “foreign-born”
Person born in Mexico and living in Texas is “foreign-born”
Person born in Florida and living in France is “foreign-born”
Person born in France and living in Florida is “foreign-born”
🇺🇸🇪🇺🗺️
Note: Poland, Ireland, Germany, Greece, Cyprus, Malta, Portugal uses Eurostat 2010 Migration data and Croatia has no data at all
Tools: MS Office
Source: Here
35% of “entry-level” jobs on LinkedIn require 3+ years of experience
Source: LinkedIn data (see original post)
Tool: Photoshop from my colleague
Latest complete Netflix movie dataset
Created from 4 APIs. 11K+ rows and 30+ attributes of Netflix (Ratings, earnings, actors, language, availability, movie trailers, and many more)
Explore this dataset using FlixGem.com (this dataset is powering this webapp)
Common Crawl
A corpus of web crawl data composed of over 50 billion web pages. The Common Crawl corpus contains petabytes of data collected since 2008. It contains raw web page data, extracted metadata and text extractions.
AWS CLI Access (No AWS account required)
aws s3 ls s3://commoncrawl/ --no-sign-request
s3://commoncrawl/crawl-data/
Dataset on protein prices
Data on Primary Commodity Prices are updated monthly based on the IMF’s Primary Commodity Price System.
CPOST dataset on suicide attacks over four decades
The University of Chicago Project on Security and Threats presents the updated and expanded Database on Suicide Attacks (DSAT), which now links to Uppsala Conflict Data Program data on armed conflicts and includes a new dataset measuring the alliance and rivalry relationships among militant groups with connections to suicide attack groups. Access it here.
Credit Card Dataset – Survey of Consumer Finances (SCF) Combined Extract Data 1989-2019
You can do a lot of aggregated analysis in a pretty straightforward way there.
Drone imagery with annotations for small object detection and tracking dataset
11 TB dataset of drone imagery with annotations for small object detection and tracking
Download and more information are available here
Dataset License: CDLA-Sharing-1.0
Helper scripts for accessing the dataset: DATASET.md
Dataset Exploration: Colab
NOAA High-Resolution Rapid Refresh (HRRR) Model
The HRRR is a NOAA real-time 3-km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model, initialized by 3km grids with 3km radar assimilation. Radar data is assimilated in the HRRR every 15 min over a 1-h period adding further detail to that provided by the hourly data assimilation from the 13km radar-enhanced Rapid Refresh.
Registry of Open Data on AWS
This registry exists to help people discover and share datasets that are available via AWS resources. Learn more about sharing data on AWS.
See all usage examples for datasets listed in this registry.
See datasets from Digital Earth Africa, Facebook Data for Good, NASA Space Act Agreement, NIH STRIDES, NOAA Big Data Program, Space Telescope Science Institute, and Amazon Sustainability Data Initiative.
Textbook Question Answering (TQA)
1,076 textbook lessons, 26,260 questions, 6229 images
Documentation: allenai.org/data/tqa
Harmonized Cancer Datasets: Genomic Data Commons Data Portal
The GDC Data Portal is a robust data-driven platform that allows cancer
researchers and bioinformaticians to search and download cancer data for analysis.

The Cancer Genome Atlas
The Cancer Genome Atlas (TCGA), a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), aims to generate comprehensive, multi-dimensional maps of the key genomic changes in major types and subtypes of cancer.
AWS CLI Access (No AWS account required)
aws s3 ls s3://tcga-2-open/ --no-sign-request
Therapeutically Applicable Research to Generate Effective Treatments (TARGET)
The Therapeutically Applicable Research to Generate Effective Treatments (TARGET) program applies a comprehensive genomic approach to determine molecular changes that drive childhood cancers. The goal of the program is to use data to guide the development of effective, less toxic therapies. TARGET is organized into a collaborative network of disease-specific project teams. TARGET projects provide comprehensive molecular characterization to determine the genetic changes that drive the initiation and progression of childhood cancers. The dataset contains open Clinical Supplement, Biospecimen Supplement, RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantification, miRNA-Seq miRNA Expression Quantification data from Genomic Data Commons (GDC), and open data from GDC Legacy Archive. Access it here.
Genome Aggregation Database (gnomAD)
The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators that aggregates and harmonizes both exome and genome data from a wide range of large-scale human sequencing projects. The summary data provided here are released for the benefit of the wider scientific community without restriction on use. Downloads
SQuAD (Stanford Question Answering Dataset)
Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. Access it here.
PubMed Diabetes Dataset
The Pubmed Diabetes dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words. The README file in the dataset provides more details.
Drug-Target Interaction Dataset
This dataset contains interactions between drugs and targets collected from DrugBank, KEGG Drug, DCDB, and Matador. It was originally collected by Perlman et al. It contains 315 drugs, 250 targets, 1,306 drug-target interactions, 5 types of drug-drug similarities, and 3 types of target-target similarities. Drug-drug similarities include Chemical-based, Ligand-based, Expression-based, Side-effect-based, and Annotation-based similarities. Target-target similarities include Sequence-based, Protein-protein interaction network-based, and Gene Ontology-based similarities. The original task on the dataset is to predict new interactions between drugs and targets based on different types of similarities in the network. Download link
Pharmacogenomics Datasets
PharmGKB data and knowledge is available as downloads. It is often critical to check with their curators at feedback@pharmgkb.org before embarking on a large project using these data, to be sure that the files and data they make available are being interpreted correctly. PharmGKB generally does NOT need to be a co-author on such analyses; They just want to make sure that there is a correct understanding of our data before lots of resources are spent.
Pancreatic Cancer Organoid Profiling
The dataset contains open RNA-Seq Gene Expression Quantification data and controlled WGS/WXS/RNA-Seq Aligned Reads, WXS Annotated Somatic Mutation, WXS Raw Somatic Mutation, and RNA-Seq Splice Junction Quantification. Documentation
AWS CLI Access (No AWS account required)
aws s3 ls s3://gdc-organoid-pancreatic-phs001611-2-open/ --no-sign-request
Africa Soil Information Service (AfSIS) Soil Chemistry
This dataset contains soil infrared spectral data and paired soil property reference measurements for georeferenced soil samples that were collected through the Africa Soil Information Service (AfSIS) project, which lasted from 2009 through 2018. Documentation
AWS CLI Access (No AWS account required)
aws s3 ls s3://afsis/ --no-sign-request
Dataset for Affective States in E-Environments
DAiSEE is the first multi-label video classification dataset comprising of 9068 video snippets captured from 112 users for recognizing the user affective states of boredom, confusion, engagement, and frustration “in the wild”. The dataset has four levels of labels namely – very low, low, high, and very high for each of the affective states, which are crowd annotated and correlated with a gold standard annotation created using a team of expert psychologists. Download it here.
NatureServe Explorer Dataset
NatureServe Explorer provides conservation status, taxonomy, distribution, and life history information for more than 95,000 plants and animals in the United States and Canada, and more than 10,000 vegetation communities and ecological systems in the Western Hemisphere.
The data available through NatureServe Explorer represents data managed in the NatureServe Central Databases. These databases are dynamic, being continually enhanced and refined through the input of hundreds of natural heritage program scientists and other collaborators. NatureServe Explorer is updated from these central databases to reflect information from new field surveys, the latest taxonomic treatments and other scientific publications, and new conservation status assessments. Explore Data here
Flight Records in the US
Airline On-Time Performance and Causes of Flight Delays – On_Time Data.
This database contains scheduled and actual departure and arrival times, reason of delay. reported by certified U.S. air carriers that account for at least one percent of domestic scheduled passenger revenues. The data is collected by the Office of Airline Information, Bureau of Transportation Statistics (BTS).
FlightAware.com has data but you need to pay for a full dataset.
The anyflights
package supplies a set of functions to generate air travel data (and data packages!) similar to nycflights13
. With a user-defined year and airport, the anyflights
function will grab data on:
flights
: all flights that departed a given airport in a given year and monthweather
: hourly meterological data for a given airport in a given year and monthairports
: airport names, FAA codes, and locationsairlines
: translation between two letter carrier (airline) codes and namesplanes
: construction information about each plane found inflights
Airline On-Time Statistics and Delay Causes
The U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT’s monthly Air Travel Consumer Report, published about 30 days after the month’s end, as well as in summary tables posted on this website. BTS began collecting details on the causes of flight delays in June 2003. Summary statistics and raw data are made available to the public at the time the Air Travel Consumer Report is released. Access it here
Worldwide flight data
Download: airports.dat (Airports only, high quality)
Download: airports-extended.dat (Airports, train stations and ferry terminals, including user contributions)
Flightera.net seems to have a lot of good data for free. It has in-depth data on flights and doesn’t seem limited by date. I can’t speak on the validity of the data though.
flightradar24.com has lots of data, also historically, they might be willing to help you get it in a nice format.
2019 Crime statistics in the USA
Dataset with arrest in US by race and separate states. Download Excel here
- Flask – Python framework.
- Docker
- Vagrant – Automation virtual machine environment.
- Pyramid – Python framework.
- Play1 Framework
- CakePHP – PHP framework.
- Symfony – PHP framework.
- Laravel – PHP framework.
- Education
- TALL Stack – Full-stack development solution featuring libraries built by the Laravel community.
- Rails – Web app framework for Ruby.
- Gems – Packages.
- Phalcon – PHP framework.
- Useful
.htaccess
Snippets - nginx – Web server.
- Dropwizard – Java framework.
- Kubernetes – Open-source platform that automates Linux container operations.
- Lumen – PHP micro-framework.
- Serverless Framework – Serverless computing and serverless architectures.
- Apache Wicket – Java web app framework.
- Vert.x – Toolkit for building reactive apps on the JVM.
- Terraform – Tool for building, changing, and versioning infrastructure.
- Vapor – Server-side development in Swift.
- Dash – Python web app framework.
- FastAPI – Python web app framework.
- CDK – Open-source software development framework for defining cloud infrastructure in code.
- IAM – User accounts, authentication and authorization.
- Chalice – Python framework for serverless app development on AWS Lambda.
Researchers from IBM, MIT and Harvard Announced The Release Of DARPA “Common Sense AI” Dataset Along With Two Machine Learning Models At ICML 2021
Building machines that can make decisions based on common sense is no easy feat. A machine must be able to do more than merely find patterns in data; it also needs a way of interpreting the intentions and beliefs behind people’s choices.
At the 2021 International Conference on Machine Learning (ICML), Researchers from IBM, MIT, and Harvard University have come together to release a DARPA “Common Sense AI” dataset for benchmarking AI intuition. They are also releasing two machine learning models that represent different approaches to the problem that relies on testing techniques psychologists use to study infants’ behavior to accelerate the development of AI exhibiting common sense.
Source – Summary – Paper – IBM Blog
100 million protein structures Dataset by DeepMind
Here’s a good article about this topic
Google Dataset Search

Malware traffic dataset
Comprises 1914081 records created from all malware traffic analysis .net PCAP files, from 2013 to 2021. The logs are generated using Suricata and Zeek.
Originator: ali_alwashali
Percent of “foreign-born” population in each US and EU state or country.
For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state 🇺🇸🇪🇺
Author: Here
Percent of “foreign-born” population in each US and EU state or country. For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state.
Examples of “foreign-born” in this context:
Person born in Spain and living in France is NOT “foreign-born”
Person born in Turkey and living in France is “foreign-born”
Person born in Florida and living in Texas is NOT “foreign-born”
Person born in Mexico and living in Texas is “foreign-born”
Person born in Florida and living in France is “foreign-born”
Person born in France and living in Florida is “foreign-born”
🇺🇸🇪🇺🗺️
Note: Poland, Ireland, Germany, Greece, Cyprus, Malta, Portugal uses Eurostat 2010 Migration data and Croatia has no data at all
Tools: MS Office
Source: Here
35% of “entry-level” jobs on LinkedIn require 3+ years of experience
Source: LinkedIn data (see original post)
Tool: Photoshop from my colleague
Latest complete Netflix movie dataset
Created from 4 APIs. 11K+ rows and 30+ attributes of Netflix (Ratings, earnings, actors, language, availability, movie trailers, and many more)
Explore this dataset using FlixGem.com (this dataset is powering this webapp)
Common Crawl
A corpus of web crawl data composed of over 50 billion web pages. The Common Crawl corpus contains petabytes of data collected since 2008. It contains raw web page data, extracted metadata and text extractions.
AWS CLI Access (No AWS account required)
aws s3 ls s3://commoncrawl/ --no-sign-request
s3://commoncrawl/crawl-data/
Dataset on protein prices
Data on Primary Commodity Prices are updated monthly based on the IMF’s Primary Commodity Price System.
CPOST dataset on suicide attacks over four decades
The University of Chicago Project on Security and Threats presents the updated and expanded Database on Suicide Attacks (DSAT), which now links to Uppsala Conflict Data Program data on armed conflicts and includes a new dataset measuring the alliance and rivalry relationships among militant groups with connections to suicide attack groups. Access it here.
Credit Card Dataset – Survey of Consumer Finances (SCF) Combined Extract Data 1989-2019
You can do a lot of aggregated analysis in a pretty straightforward way there.
Drone imagery with annotations for small object detection and tracking dataset
11 TB dataset of drone imagery with annotations for small object detection and tracking
Download and more information are available here
Dataset License: CDLA-Sharing-1.0
Helper scripts for accessing the dataset: DATASET.md
Dataset Exploration: Colab
NOAA High-Resolution Rapid Refresh (HRRR) Model
The HRRR is a NOAA real-time 3-km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model, initialized by 3km grids with 3km radar assimilation. Radar data is assimilated in the HRRR every 15 min over a 1-h period adding further detail to that provided by the hourly data assimilation from the 13km radar-enhanced Rapid Refresh.
Registry of Open Data on AWS
This registry exists to help people discover and share datasets that are available via AWS resources. Learn more about sharing data on AWS.
See all usage examples for datasets listed in this registry.
See datasets from Digital Earth Africa, Facebook Data for Good, NASA Space Act Agreement, NIH STRIDES, NOAA Big Data Program, Space Telescope Science Institute, and Amazon Sustainability Data Initiative.
Textbook Question Answering (TQA)
1,076 textbook lessons, 26,260 questions, 6229 images
Documentation: allenai.org/data/tqa
Harmonized Cancer Datasets: Genomic Data Commons Data Portal
The GDC Data Portal is a robust data-driven platform that allows cancer
researchers and bioinformaticians to search and download cancer data for analysis.

The Cancer Genome Atlas
The Cancer Genome Atlas (TCGA), a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), aims to generate comprehensive, multi-dimensional maps of the key genomic changes in major types and subtypes of cancer.
AWS CLI Access (No AWS account required)
aws s3 ls s3://tcga-2-open/ --no-sign-request
Therapeutically Applicable Research to Generate Effective Treatments (TARGET)
The Therapeutically Applicable Research to Generate Effective Treatments (TARGET) program applies a comprehensive genomic approach to determine molecular changes that drive childhood cancers. The goal of the program is to use data to guide the development of effective, less toxic therapies. TARGET is organized into a collaborative network of disease-specific project teams. TARGET projects provide comprehensive molecular characterization to determine the genetic changes that drive the initiation and progression of childhood cancers. The dataset contains open Clinical Supplement, Biospecimen Supplement, RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantification, miRNA-Seq miRNA Expression Quantification data from Genomic Data Commons (GDC), and open data from GDC Legacy Archive. Access it here.
Genome Aggregation Database (gnomAD)
The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators that aggregates and harmonizes both exome and genome data from a wide range of large-scale human sequencing projects. The summary data provided here are released for the benefit of the wider scientific community without restriction on use. Downloads
SQuAD (Stanford Question Answering Dataset)
Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. Access it here.
PubMed Diabetes Dataset
The Pubmed Diabetes dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words. The README file in the dataset provides more details.
Drug-Target Interaction Dataset
This dataset contains interactions between drugs and targets collected from DrugBank, KEGG Drug, DCDB, and Matador. It was originally collected by Perlman et al. It contains 315 drugs, 250 targets, 1,306 drug-target interactions, 5 types of drug-drug similarities, and 3 types of target-target similarities. Drug-drug similarities include Chemical-based, Ligand-based, Expression-based, Side-effect-based, and Annotation-based similarities. Target-target similarities include Sequence-based, Protein-protein interaction network-based, and Gene Ontology-based similarities. The original task on the dataset is to predict new interactions between drugs and targets based on different types of similarities in the network. Download link
Pharmacogenomics Datasets
PharmGKB data and knowledge is available as downloads. It is often critical to check with their curators at feedback@pharmgkb.org before embarking on a large project using these data, to be sure that the files and data they make available are being interpreted correctly. PharmGKB generally does NOT need to be a co-author on such analyses; They just want to make sure that there is a correct understanding of our data before lots of resources are spent.
Pancreatic Cancer Organoid Profiling
The dataset contains open RNA-Seq Gene Expression Quantification data and controlled WGS/WXS/RNA-Seq Aligned Reads, WXS Annotated Somatic Mutation, WXS Raw Somatic Mutation, and RNA-Seq Splice Junction Quantification. Documentation
AWS CLI Access (No AWS account required)
aws s3 ls s3://gdc-organoid-pancreatic-phs001611-2-open/ --no-sign-request
Africa Soil Information Service (AfSIS) Soil Chemistry
This dataset contains soil infrared spectral data and paired soil property reference measurements for georeferenced soil samples that were collected through the Africa Soil Information Service (AfSIS) project, which lasted from 2009 through 2018. Documentation
AWS CLI Access (No AWS account required)
aws s3 ls s3://afsis/ --no-sign-request
Dataset for Affective States in E-Environments
DAiSEE is the first multi-label video classification dataset comprising of 9068 video snippets captured from 112 users for recognizing the user affective states of boredom, confusion, engagement, and frustration “in the wild”. The dataset has four levels of labels namely – very low, low, high, and very high for each of the affective states, which are crowd annotated and correlated with a gold standard annotation created using a team of expert psychologists. Download it here.
NatureServe Explorer Dataset
NatureServe Explorer provides conservation status, taxonomy, distribution, and life history information for more than 95,000 plants and animals in the United States and Canada, and more than 10,000 vegetation communities and ecological systems in the Western Hemisphere.
The data available through NatureServe Explorer represents data managed in the NatureServe Central Databases. These databases are dynamic, being continually enhanced and refined through the input of hundreds of natural heritage program scientists and other collaborators. NatureServe Explorer is updated from these central databases to reflect information from new field surveys, the latest taxonomic treatments and other scientific publications, and new conservation status assessments. Explore Data here
Flight Records in the US
Airline On-Time Performance and Causes of Flight Delays – On_Time Data.
This database contains scheduled and actual departure and arrival times, reason of delay. reported by certified U.S. air carriers that account for at least one percent of domestic scheduled passenger revenues. The data is collected by the Office of Airline Information, Bureau of Transportation Statistics (BTS).
FlightAware.com has data but you need to pay for a full dataset.
The anyflights
package supplies a set of functions to generate air travel data (and data packages!) similar to nycflights13
. With a user-defined year and airport, the anyflights
function will grab data on:
flights
: all flights that departed a given airport in a given year and monthweather
: hourly meterological data for a given airport in a given year and monthairports
: airport names, FAA codes, and locationsairlines
: translation between two letter carrier (airline) codes and namesplanes
: construction information about each plane found inflights
Airline On-Time Statistics and Delay Causes
The U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT’s monthly Air Travel Consumer Report, published about 30 days after the month’s end, as well as in summary tables posted on this website. BTS began collecting details on the causes of flight delays in June 2003. Summary statistics and raw data are made available to the public at the time the Air Travel Consumer Report is released. Access it here
Worldwide flight data
Download: airports.dat (Airports only, high quality)
Download: airports-extended.dat (Airports, train stations and ferry terminals, including user contributions)
Flightera.net seems to have a lot of good data for free. It has in-depth data on flights and doesn’t seem limited by date. I can’t speak on the validity of the data though.
flightradar24.com has lots of data, also historically, they might be willing to help you get it in a nice format.
2019 Crime statistics in the USA
Dataset with arrest in US by race and separate states. Download Excel here
- University Courses
- Data Science
- Machine Learning
- Tutorials
- ML with Ruby – Learning, implementing, and applying Machine Learning using Ruby.
- Core ML Models – Models for Apple’s machine learning framework.
- H2O – Open source distributed machine learning platform written in Java with APIs in R, Python, and Scala.
- Software Engineering for Machine Learning – From experiment to production-level machine learning.
- AI in Finance – Solving problems in finance with machine learning.
- JAX – Automatic differentiation and XLA compilation brought together for high-performance machine learning research.
- Speech and Natural Language Processing
- Spanish
- NLP with Ruby
- Question Answering – The science of asking and answering in natural language with a machine.
- Natural Language Generation – Generation of text used in data to text, conversational agents, and narrative generation applications.
- Linguistics
- Cryptography
- Papers – Theory basics for using cryptography by non-cryptographers.
- Computer Vision
- Deep Learning – Neural networks.
- TensorFlow – Library for machine intelligence.
- TensorFlow.js – WebGL-accelerated machine learning JavaScript library for training and deploying models.
- TensorFlow Lite – Framework that optimizes TensorFlow models for on-device machine learning.
- Papers – The most cited deep learning papers.
- Education
- Deep Vision
- Open Source Society University
- Functional Programming
- Empirical Software Engineering – Evidence-based research on software systems.
- Static Analysis & Code Quality
- Information Retrieval – Learn to develop your own search engine.
- Quantum Computing – Computing which utilizes quantum mechanics and qubits on quantum computers.
Researchers from IBM, MIT and Harvard Announced The Release Of DARPA “Common Sense AI” Dataset Along With Two Machine Learning Models At ICML 2021
Building machines that can make decisions based on common sense is no easy feat. A machine must be able to do more than merely find patterns in data; it also needs a way of interpreting the intentions and beliefs behind people’s choices.
At the 2021 International Conference on Machine Learning (ICML), Researchers from IBM, MIT, and Harvard University have come together to release a DARPA “Common Sense AI” dataset for benchmarking AI intuition. They are also releasing two machine learning models that represent different approaches to the problem that relies on testing techniques psychologists use to study infants’ behavior to accelerate the development of AI exhibiting common sense.
Source – Summary – Paper – IBM Blog
100 million protein structures Dataset by DeepMind
Here’s a good article about this topic
Google Dataset Search

Malware traffic dataset
Comprises 1914081 records created from all malware traffic analysis .net PCAP files, from 2013 to 2021. The logs are generated using Suricata and Zeek.
Originator: ali_alwashali
Percent of “foreign-born” population in each US and EU state or country.
For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state 🇺🇸🇪🇺
Author: Here
Percent of “foreign-born” population in each US and EU state or country. For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state.
Examples of “foreign-born” in this context:
Person born in Spain and living in France is NOT “foreign-born”
Person born in Turkey and living in France is “foreign-born”
Person born in Florida and living in Texas is NOT “foreign-born”
Person born in Mexico and living in Texas is “foreign-born”
Person born in Florida and living in France is “foreign-born”
Person born in France and living in Florida is “foreign-born”
🇺🇸🇪🇺🗺️
Note: Poland, Ireland, Germany, Greece, Cyprus, Malta, Portugal uses Eurostat 2010 Migration data and Croatia has no data at all
Tools: MS Office
Source: Here
35% of “entry-level” jobs on LinkedIn require 3+ years of experience
Source: LinkedIn data (see original post)
Tool: Photoshop from my colleague
Latest complete Netflix movie dataset
Created from 4 APIs. 11K+ rows and 30+ attributes of Netflix (Ratings, earnings, actors, language, availability, movie trailers, and many more)
Explore this dataset using FlixGem.com (this dataset is powering this webapp)
Common Crawl
A corpus of web crawl data composed of over 50 billion web pages. The Common Crawl corpus contains petabytes of data collected since 2008. It contains raw web page data, extracted metadata and text extractions.
AWS CLI Access (No AWS account required)
aws s3 ls s3://commoncrawl/ --no-sign-request
s3://commoncrawl/crawl-data/
Dataset on protein prices
Data on Primary Commodity Prices are updated monthly based on the IMF’s Primary Commodity Price System.
CPOST dataset on suicide attacks over four decades
The University of Chicago Project on Security and Threats presents the updated and expanded Database on Suicide Attacks (DSAT), which now links to Uppsala Conflict Data Program data on armed conflicts and includes a new dataset measuring the alliance and rivalry relationships among militant groups with connections to suicide attack groups. Access it here.
Credit Card Dataset – Survey of Consumer Finances (SCF) Combined Extract Data 1989-2019
You can do a lot of aggregated analysis in a pretty straightforward way there.
Drone imagery with annotations for small object detection and tracking dataset
11 TB dataset of drone imagery with annotations for small object detection and tracking
Download and more information are available here
Dataset License: CDLA-Sharing-1.0
Helper scripts for accessing the dataset: DATASET.md
Dataset Exploration: Colab
NOAA High-Resolution Rapid Refresh (HRRR) Model
The HRRR is a NOAA real-time 3-km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model, initialized by 3km grids with 3km radar assimilation. Radar data is assimilated in the HRRR every 15 min over a 1-h period adding further detail to that provided by the hourly data assimilation from the 13km radar-enhanced Rapid Refresh.
Registry of Open Data on AWS
This registry exists to help people discover and share datasets that are available via AWS resources. Learn more about sharing data on AWS.
See all usage examples for datasets listed in this registry.
See datasets from Digital Earth Africa, Facebook Data for Good, NASA Space Act Agreement, NIH STRIDES, NOAA Big Data Program, Space Telescope Science Institute, and Amazon Sustainability Data Initiative.
Textbook Question Answering (TQA)
1,076 textbook lessons, 26,260 questions, 6229 images
Documentation: allenai.org/data/tqa
Harmonized Cancer Datasets: Genomic Data Commons Data Portal
The GDC Data Portal is a robust data-driven platform that allows cancer
researchers and bioinformaticians to search and download cancer data for analysis.

The Cancer Genome Atlas
The Cancer Genome Atlas (TCGA), a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), aims to generate comprehensive, multi-dimensional maps of the key genomic changes in major types and subtypes of cancer.
AWS CLI Access (No AWS account required)
aws s3 ls s3://tcga-2-open/ --no-sign-request
Therapeutically Applicable Research to Generate Effective Treatments (TARGET)
The Therapeutically Applicable Research to Generate Effective Treatments (TARGET) program applies a comprehensive genomic approach to determine molecular changes that drive childhood cancers. The goal of the program is to use data to guide the development of effective, less toxic therapies. TARGET is organized into a collaborative network of disease-specific project teams. TARGET projects provide comprehensive molecular characterization to determine the genetic changes that drive the initiation and progression of childhood cancers. The dataset contains open Clinical Supplement, Biospecimen Supplement, RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantification, miRNA-Seq miRNA Expression Quantification data from Genomic Data Commons (GDC), and open data from GDC Legacy Archive. Access it here.
Genome Aggregation Database (gnomAD)
The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators that aggregates and harmonizes both exome and genome data from a wide range of large-scale human sequencing projects. The summary data provided here are released for the benefit of the wider scientific community without restriction on use. Downloads
SQuAD (Stanford Question Answering Dataset)
Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. Access it here.
PubMed Diabetes Dataset
The Pubmed Diabetes dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words. The README file in the dataset provides more details.
Drug-Target Interaction Dataset
This dataset contains interactions between drugs and targets collected from DrugBank, KEGG Drug, DCDB, and Matador. It was originally collected by Perlman et al. It contains 315 drugs, 250 targets, 1,306 drug-target interactions, 5 types of drug-drug similarities, and 3 types of target-target similarities. Drug-drug similarities include Chemical-based, Ligand-based, Expression-based, Side-effect-based, and Annotation-based similarities. Target-target similarities include Sequence-based, Protein-protein interaction network-based, and Gene Ontology-based similarities. The original task on the dataset is to predict new interactions between drugs and targets based on different types of similarities in the network. Download link
Pharmacogenomics Datasets
PharmGKB data and knowledge is available as downloads. It is often critical to check with their curators at feedback@pharmgkb.org before embarking on a large project using these data, to be sure that the files and data they make available are being interpreted correctly. PharmGKB generally does NOT need to be a co-author on such analyses; They just want to make sure that there is a correct understanding of our data before lots of resources are spent.
Pancreatic Cancer Organoid Profiling
The dataset contains open RNA-Seq Gene Expression Quantification data and controlled WGS/WXS/RNA-Seq Aligned Reads, WXS Annotated Somatic Mutation, WXS Raw Somatic Mutation, and RNA-Seq Splice Junction Quantification. Documentation
AWS CLI Access (No AWS account required)
aws s3 ls s3://gdc-organoid-pancreatic-phs001611-2-open/ --no-sign-request
Africa Soil Information Service (AfSIS) Soil Chemistry
This dataset contains soil infrared spectral data and paired soil property reference measurements for georeferenced soil samples that were collected through the Africa Soil Information Service (AfSIS) project, which lasted from 2009 through 2018. Documentation
AWS CLI Access (No AWS account required)
aws s3 ls s3://afsis/ --no-sign-request
Dataset for Affective States in E-Environments
DAiSEE is the first multi-label video classification dataset comprising of 9068 video snippets captured from 112 users for recognizing the user affective states of boredom, confusion, engagement, and frustration “in the wild”. The dataset has four levels of labels namely – very low, low, high, and very high for each of the affective states, which are crowd annotated and correlated with a gold standard annotation created using a team of expert psychologists. Download it here.
NatureServe Explorer Dataset
NatureServe Explorer provides conservation status, taxonomy, distribution, and life history information for more than 95,000 plants and animals in the United States and Canada, and more than 10,000 vegetation communities and ecological systems in the Western Hemisphere.
The data available through NatureServe Explorer represents data managed in the NatureServe Central Databases. These databases are dynamic, being continually enhanced and refined through the input of hundreds of natural heritage program scientists and other collaborators. NatureServe Explorer is updated from these central databases to reflect information from new field surveys, the latest taxonomic treatments and other scientific publications, and new conservation status assessments. Explore Data here
Flight Records in the US
Airline On-Time Performance and Causes of Flight Delays – On_Time Data.
This database contains scheduled and actual departure and arrival times, reason of delay. reported by certified U.S. air carriers that account for at least one percent of domestic scheduled passenger revenues. The data is collected by the Office of Airline Information, Bureau of Transportation Statistics (BTS).
FlightAware.com has data but you need to pay for a full dataset.
The anyflights
package supplies a set of functions to generate air travel data (and data packages!) similar to nycflights13
. With a user-defined year and airport, the anyflights
function will grab data on:
flights
: all flights that departed a given airport in a given year and monthweather
: hourly meterological data for a given airport in a given year and monthairports
: airport names, FAA codes, and locationsairlines
: translation between two letter carrier (airline) codes and namesplanes
: construction information about each plane found inflights
Airline On-Time Statistics and Delay Causes
The U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT’s monthly Air Travel Consumer Report, published about 30 days after the month’s end, as well as in summary tables posted on this website. BTS began collecting details on the causes of flight delays in June 2003. Summary statistics and raw data are made available to the public at the time the Air Travel Consumer Report is released. Access it here
Worldwide flight data
Download: airports.dat (Airports only, high quality)
Download: airports-extended.dat (Airports, train stations and ferry terminals, including user contributions)
Flightera.net seems to have a lot of good data for free. It has in-depth data on flights and doesn’t seem limited by date. I can’t speak on the validity of the data though.
flightradar24.com has lots of data, also historically, they might be willing to help you get it in a nice format.
2019 Crime statistics in the USA
Dataset with arrest in US by race and separate states. Download Excel here
- Big Data
- Public Datasets
- Hadoop – Framework for distributed storage and processing of very large data sets.
- Data Engineering
- Streaming
- Apache Spark – Unified engine for large-scale data processing.
- Qlik – Business intelligence platform for data visualization, analytics, and reporting apps.
- Splunk – Platform for searching, monitoring, and analyzing structured and unstructured machine-generated big data in real-time.
Researchers from IBM, MIT and Harvard Announced The Release Of DARPA “Common Sense AI” Dataset Along With Two Machine Learning Models At ICML 2021
Building machines that can make decisions based on common sense is no easy feat. A machine must be able to do more than merely find patterns in data; it also needs a way of interpreting the intentions and beliefs behind people’s choices.
At the 2021 International Conference on Machine Learning (ICML), Researchers from IBM, MIT, and Harvard University have come together to release a DARPA “Common Sense AI” dataset for benchmarking AI intuition. They are also releasing two machine learning models that represent different approaches to the problem that relies on testing techniques psychologists use to study infants’ behavior to accelerate the development of AI exhibiting common sense.
Source – Summary – Paper – IBM Blog
100 million protein structures Dataset by DeepMind
Here’s a good article about this topic
Google Dataset Search

Malware traffic dataset
Comprises 1914081 records created from all malware traffic analysis .net PCAP files, from 2013 to 2021. The logs are generated using Suricata and Zeek.
Originator: ali_alwashali
Percent of “foreign-born” population in each US and EU state or country.
For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state 🇺🇸🇪🇺
Author: Here
Percent of “foreign-born” population in each US and EU state or country. For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state.
Examples of “foreign-born” in this context:
Person born in Spain and living in France is NOT “foreign-born”
Person born in Turkey and living in France is “foreign-born”
Person born in Florida and living in Texas is NOT “foreign-born”
Person born in Mexico and living in Texas is “foreign-born”
Person born in Florida and living in France is “foreign-born”
Person born in France and living in Florida is “foreign-born”
🇺🇸🇪🇺🗺️
Note: Poland, Ireland, Germany, Greece, Cyprus, Malta, Portugal uses Eurostat 2010 Migration data and Croatia has no data at all
Tools: MS Office
Source: Here
35% of “entry-level” jobs on LinkedIn require 3+ years of experience
Source: LinkedIn data (see original post)
Tool: Photoshop from my colleague
Latest complete Netflix movie dataset
Created from 4 APIs. 11K+ rows and 30+ attributes of Netflix (Ratings, earnings, actors, language, availability, movie trailers, and many more)
Explore this dataset using FlixGem.com (this dataset is powering this webapp)
Common Crawl
A corpus of web crawl data composed of over 50 billion web pages. The Common Crawl corpus contains petabytes of data collected since 2008. It contains raw web page data, extracted metadata and text extractions.
AWS CLI Access (No AWS account required)
aws s3 ls s3://commoncrawl/ --no-sign-request
s3://commoncrawl/crawl-data/
Dataset on protein prices
Data on Primary Commodity Prices are updated monthly based on the IMF’s Primary Commodity Price System.
CPOST dataset on suicide attacks over four decades
The University of Chicago Project on Security and Threats presents the updated and expanded Database on Suicide Attacks (DSAT), which now links to Uppsala Conflict Data Program data on armed conflicts and includes a new dataset measuring the alliance and rivalry relationships among militant groups with connections to suicide attack groups. Access it here.
Credit Card Dataset – Survey of Consumer Finances (SCF) Combined Extract Data 1989-2019
You can do a lot of aggregated analysis in a pretty straightforward way there.
Drone imagery with annotations for small object detection and tracking dataset
11 TB dataset of drone imagery with annotations for small object detection and tracking
Download and more information are available here
Dataset License: CDLA-Sharing-1.0
Helper scripts for accessing the dataset: DATASET.md
Dataset Exploration: Colab
NOAA High-Resolution Rapid Refresh (HRRR) Model
The HRRR is a NOAA real-time 3-km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model, initialized by 3km grids with 3km radar assimilation. Radar data is assimilated in the HRRR every 15 min over a 1-h period adding further detail to that provided by the hourly data assimilation from the 13km radar-enhanced Rapid Refresh.
Registry of Open Data on AWS
This registry exists to help people discover and share datasets that are available via AWS resources. Learn more about sharing data on AWS.
See all usage examples for datasets listed in this registry.
See datasets from Digital Earth Africa, Facebook Data for Good, NASA Space Act Agreement, NIH STRIDES, NOAA Big Data Program, Space Telescope Science Institute, and Amazon Sustainability Data Initiative.
Textbook Question Answering (TQA)
1,076 textbook lessons, 26,260 questions, 6229 images
Documentation: allenai.org/data/tqa
Harmonized Cancer Datasets: Genomic Data Commons Data Portal
The GDC Data Portal is a robust data-driven platform that allows cancer
researchers and bioinformaticians to search and download cancer data for analysis.

The Cancer Genome Atlas
The Cancer Genome Atlas (TCGA), a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), aims to generate comprehensive, multi-dimensional maps of the key genomic changes in major types and subtypes of cancer.
AWS CLI Access (No AWS account required)
aws s3 ls s3://tcga-2-open/ --no-sign-request
Therapeutically Applicable Research to Generate Effective Treatments (TARGET)
The Therapeutically Applicable Research to Generate Effective Treatments (TARGET) program applies a comprehensive genomic approach to determine molecular changes that drive childhood cancers. The goal of the program is to use data to guide the development of effective, less toxic therapies. TARGET is organized into a collaborative network of disease-specific project teams. TARGET projects provide comprehensive molecular characterization to determine the genetic changes that drive the initiation and progression of childhood cancers. The dataset contains open Clinical Supplement, Biospecimen Supplement, RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantification, miRNA-Seq miRNA Expression Quantification data from Genomic Data Commons (GDC), and open data from GDC Legacy Archive. Access it here.
Genome Aggregation Database (gnomAD)
The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators that aggregates and harmonizes both exome and genome data from a wide range of large-scale human sequencing projects. The summary data provided here are released for the benefit of the wider scientific community without restriction on use. Downloads
SQuAD (Stanford Question Answering Dataset)
Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. Access it here.
PubMed Diabetes Dataset
The Pubmed Diabetes dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words. The README file in the dataset provides more details.
Drug-Target Interaction Dataset
This dataset contains interactions between drugs and targets collected from DrugBank, KEGG Drug, DCDB, and Matador. It was originally collected by Perlman et al. It contains 315 drugs, 250 targets, 1,306 drug-target interactions, 5 types of drug-drug similarities, and 3 types of target-target similarities. Drug-drug similarities include Chemical-based, Ligand-based, Expression-based, Side-effect-based, and Annotation-based similarities. Target-target similarities include Sequence-based, Protein-protein interaction network-based, and Gene Ontology-based similarities. The original task on the dataset is to predict new interactions between drugs and targets based on different types of similarities in the network. Download link
Pharmacogenomics Datasets
PharmGKB data and knowledge is available as downloads. It is often critical to check with their curators at feedback@pharmgkb.org before embarking on a large project using these data, to be sure that the files and data they make available are being interpreted correctly. PharmGKB generally does NOT need to be a co-author on such analyses; They just want to make sure that there is a correct understanding of our data before lots of resources are spent.
Pancreatic Cancer Organoid Profiling
The dataset contains open RNA-Seq Gene Expression Quantification data and controlled WGS/WXS/RNA-Seq Aligned Reads, WXS Annotated Somatic Mutation, WXS Raw Somatic Mutation, and RNA-Seq Splice Junction Quantification. Documentation
AWS CLI Access (No AWS account required)
aws s3 ls s3://gdc-organoid-pancreatic-phs001611-2-open/ --no-sign-request
Africa Soil Information Service (AfSIS) Soil Chemistry
This dataset contains soil infrared spectral data and paired soil property reference measurements for georeferenced soil samples that were collected through the Africa Soil Information Service (AfSIS) project, which lasted from 2009 through 2018. Documentation
AWS CLI Access (No AWS account required)
aws s3 ls s3://afsis/ --no-sign-request
Dataset for Affective States in E-Environments
DAiSEE is the first multi-label video classification dataset comprising of 9068 video snippets captured from 112 users for recognizing the user affective states of boredom, confusion, engagement, and frustration “in the wild”. The dataset has four levels of labels namely – very low, low, high, and very high for each of the affective states, which are crowd annotated and correlated with a gold standard annotation created using a team of expert psychologists. Download it here.
NatureServe Explorer Dataset
NatureServe Explorer provides conservation status, taxonomy, distribution, and life history information for more than 95,000 plants and animals in the United States and Canada, and more than 10,000 vegetation communities and ecological systems in the Western Hemisphere.
The data available through NatureServe Explorer represents data managed in the NatureServe Central Databases. These databases are dynamic, being continually enhanced and refined through the input of hundreds of natural heritage program scientists and other collaborators. NatureServe Explorer is updated from these central databases to reflect information from new field surveys, the latest taxonomic treatments and other scientific publications, and new conservation status assessments. Explore Data here
Flight Records in the US
Airline On-Time Performance and Causes of Flight Delays – On_Time Data.
This database contains scheduled and actual departure and arrival times, reason of delay. reported by certified U.S. air carriers that account for at least one percent of domestic scheduled passenger revenues. The data is collected by the Office of Airline Information, Bureau of Transportation Statistics (BTS).
FlightAware.com has data but you need to pay for a full dataset.
The anyflights
package supplies a set of functions to generate air travel data (and data packages!) similar to nycflights13
. With a user-defined year and airport, the anyflights
function will grab data on:
flights
: all flights that departed a given airport in a given year and monthweather
: hourly meterological data for a given airport in a given year and monthairports
: airport names, FAA codes, and locationsairlines
: translation between two letter carrier (airline) codes and namesplanes
: construction information about each plane found inflights
Airline On-Time Statistics and Delay Causes
The U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT’s monthly Air Travel Consumer Report, published about 30 days after the month’s end, as well as in summary tables posted on this website. BTS began collecting details on the causes of flight delays in June 2003. Summary statistics and raw data are made available to the public at the time the Air Travel Consumer Report is released. Access it here
Worldwide flight data
Download: airports.dat (Airports only, high quality)
Download: airports-extended.dat (Airports, train stations and ferry terminals, including user contributions)
Flightera.net seems to have a lot of good data for free. It has in-depth data on flights and doesn’t seem limited by date. I can’t speak on the validity of the data though.
flightradar24.com has lots of data, also historically, they might be willing to help you get it in a nice format.
2019 Crime statistics in the USA
Dataset with arrest in US by race and separate states. Download Excel here
- Papers We Love
- Talks
- Algorithms
- Education – Learning and practicing.
- Algorithm Visualizations
- Artificial Intelligence
- Search Engine Optimization
- Competitive Programming
- Math
- Recursion Schemes – Traversing nested data structures.
Researchers from IBM, MIT and Harvard Announced The Release Of DARPA “Common Sense AI” Dataset Along With Two Machine Learning Models At ICML 2021
Building machines that can make decisions based on common sense is no easy feat. A machine must be able to do more than merely find patterns in data; it also needs a way of interpreting the intentions and beliefs behind people’s choices.
At the 2021 International Conference on Machine Learning (ICML), Researchers from IBM, MIT, and Harvard University have come together to release a DARPA “Common Sense AI” dataset for benchmarking AI intuition. They are also releasing two machine learning models that represent different approaches to the problem that relies on testing techniques psychologists use to study infants’ behavior to accelerate the development of AI exhibiting common sense.
Source – Summary – Paper – IBM Blog
100 million protein structures Dataset by DeepMind
Here’s a good article about this topic
Google Dataset Search

Malware traffic dataset
Comprises 1914081 records created from all malware traffic analysis .net PCAP files, from 2013 to 2021. The logs are generated using Suricata and Zeek.
Originator: ali_alwashali
Percent of “foreign-born” population in each US and EU state or country.
For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state 🇺🇸🇪🇺
Author: Here
Percent of “foreign-born” population in each US and EU state or country. For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state.
Examples of “foreign-born” in this context:
Person born in Spain and living in France is NOT “foreign-born”
Person born in Turkey and living in France is “foreign-born”
Person born in Florida and living in Texas is NOT “foreign-born”
Person born in Mexico and living in Texas is “foreign-born”
Person born in Florida and living in France is “foreign-born”
Person born in France and living in Florida is “foreign-born”
🇺🇸🇪🇺🗺️
Note: Poland, Ireland, Germany, Greece, Cyprus, Malta, Portugal uses Eurostat 2010 Migration data and Croatia has no data at all
Tools: MS Office
Source: Here
35% of “entry-level” jobs on LinkedIn require 3+ years of experience
Source: LinkedIn data (see original post)
Tool: Photoshop from my colleague
Latest complete Netflix movie dataset
Created from 4 APIs. 11K+ rows and 30+ attributes of Netflix (Ratings, earnings, actors, language, availability, movie trailers, and many more)
Explore this dataset using FlixGem.com (this dataset is powering this webapp)
Common Crawl
A corpus of web crawl data composed of over 50 billion web pages. The Common Crawl corpus contains petabytes of data collected since 2008. It contains raw web page data, extracted metadata and text extractions.
AWS CLI Access (No AWS account required)
aws s3 ls s3://commoncrawl/ --no-sign-request
s3://commoncrawl/crawl-data/
Dataset on protein prices
Data on Primary Commodity Prices are updated monthly based on the IMF’s Primary Commodity Price System.
CPOST dataset on suicide attacks over four decades
The University of Chicago Project on Security and Threats presents the updated and expanded Database on Suicide Attacks (DSAT), which now links to Uppsala Conflict Data Program data on armed conflicts and includes a new dataset measuring the alliance and rivalry relationships among militant groups with connections to suicide attack groups. Access it here.
Credit Card Dataset – Survey of Consumer Finances (SCF) Combined Extract Data 1989-2019
You can do a lot of aggregated analysis in a pretty straightforward way there.
Drone imagery with annotations for small object detection and tracking dataset
11 TB dataset of drone imagery with annotations for small object detection and tracking
Download and more information are available here
Dataset License: CDLA-Sharing-1.0
Helper scripts for accessing the dataset: DATASET.md
Dataset Exploration: Colab
NOAA High-Resolution Rapid Refresh (HRRR) Model
The HRRR is a NOAA real-time 3-km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model, initialized by 3km grids with 3km radar assimilation. Radar data is assimilated in the HRRR every 15 min over a 1-h period adding further detail to that provided by the hourly data assimilation from the 13km radar-enhanced Rapid Refresh.
Registry of Open Data on AWS
This registry exists to help people discover and share datasets that are available via AWS resources. Learn more about sharing data on AWS.
See all usage examples for datasets listed in this registry.
See datasets from Digital Earth Africa, Facebook Data for Good, NASA Space Act Agreement, NIH STRIDES, NOAA Big Data Program, Space Telescope Science Institute, and Amazon Sustainability Data Initiative.
Textbook Question Answering (TQA)
1,076 textbook lessons, 26,260 questions, 6229 images
Documentation: allenai.org/data/tqa
Harmonized Cancer Datasets: Genomic Data Commons Data Portal
The GDC Data Portal is a robust data-driven platform that allows cancer
researchers and bioinformaticians to search and download cancer data for analysis.

The Cancer Genome Atlas
The Cancer Genome Atlas (TCGA), a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), aims to generate comprehensive, multi-dimensional maps of the key genomic changes in major types and subtypes of cancer.
AWS CLI Access (No AWS account required)
aws s3 ls s3://tcga-2-open/ --no-sign-request
Therapeutically Applicable Research to Generate Effective Treatments (TARGET)
The Therapeutically Applicable Research to Generate Effective Treatments (TARGET) program applies a comprehensive genomic approach to determine molecular changes that drive childhood cancers. The goal of the program is to use data to guide the development of effective, less toxic therapies. TARGET is organized into a collaborative network of disease-specific project teams. TARGET projects provide comprehensive molecular characterization to determine the genetic changes that drive the initiation and progression of childhood cancers. The dataset contains open Clinical Supplement, Biospecimen Supplement, RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantification, miRNA-Seq miRNA Expression Quantification data from Genomic Data Commons (GDC), and open data from GDC Legacy Archive. Access it here.
Genome Aggregation Database (gnomAD)
The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators that aggregates and harmonizes both exome and genome data from a wide range of large-scale human sequencing projects. The summary data provided here are released for the benefit of the wider scientific community without restriction on use. Downloads
SQuAD (Stanford Question Answering Dataset)
Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. Access it here.
PubMed Diabetes Dataset
The Pubmed Diabetes dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words. The README file in the dataset provides more details.
Drug-Target Interaction Dataset
This dataset contains interactions between drugs and targets collected from DrugBank, KEGG Drug, DCDB, and Matador. It was originally collected by Perlman et al. It contains 315 drugs, 250 targets, 1,306 drug-target interactions, 5 types of drug-drug similarities, and 3 types of target-target similarities. Drug-drug similarities include Chemical-based, Ligand-based, Expression-based, Side-effect-based, and Annotation-based similarities. Target-target similarities include Sequence-based, Protein-protein interaction network-based, and Gene Ontology-based similarities. The original task on the dataset is to predict new interactions between drugs and targets based on different types of similarities in the network. Download link
Pharmacogenomics Datasets
PharmGKB data and knowledge is available as downloads. It is often critical to check with their curators at feedback@pharmgkb.org before embarking on a large project using these data, to be sure that the files and data they make available are being interpreted correctly. PharmGKB generally does NOT need to be a co-author on such analyses; They just want to make sure that there is a correct understanding of our data before lots of resources are spent.
Pancreatic Cancer Organoid Profiling
The dataset contains open RNA-Seq Gene Expression Quantification data and controlled WGS/WXS/RNA-Seq Aligned Reads, WXS Annotated Somatic Mutation, WXS Raw Somatic Mutation, and RNA-Seq Splice Junction Quantification. Documentation
AWS CLI Access (No AWS account required)
aws s3 ls s3://gdc-organoid-pancreatic-phs001611-2-open/ --no-sign-request
Africa Soil Information Service (AfSIS) Soil Chemistry
This dataset contains soil infrared spectral data and paired soil property reference measurements for georeferenced soil samples that were collected through the Africa Soil Information Service (AfSIS) project, which lasted from 2009 through 2018. Documentation
AWS CLI Access (No AWS account required)
aws s3 ls s3://afsis/ --no-sign-request
Dataset for Affective States in E-Environments
DAiSEE is the first multi-label video classification dataset comprising of 9068 video snippets captured from 112 users for recognizing the user affective states of boredom, confusion, engagement, and frustration “in the wild”. The dataset has four levels of labels namely – very low, low, high, and very high for each of the affective states, which are crowd annotated and correlated with a gold standard annotation created using a team of expert psychologists. Download it here.
NatureServe Explorer Dataset
NatureServe Explorer provides conservation status, taxonomy, distribution, and life history information for more than 95,000 plants and animals in the United States and Canada, and more than 10,000 vegetation communities and ecological systems in the Western Hemisphere.
The data available through NatureServe Explorer represents data managed in the NatureServe Central Databases. These databases are dynamic, being continually enhanced and refined through the input of hundreds of natural heritage program scientists and other collaborators. NatureServe Explorer is updated from these central databases to reflect information from new field surveys, the latest taxonomic treatments and other scientific publications, and new conservation status assessments. Explore Data here
Flight Records in the US
Airline On-Time Performance and Causes of Flight Delays – On_Time Data.
This database contains scheduled and actual departure and arrival times, reason of delay. reported by certified U.S. air carriers that account for at least one percent of domestic scheduled passenger revenues. The data is collected by the Office of Airline Information, Bureau of Transportation Statistics (BTS).
FlightAware.com has data but you need to pay for a full dataset.
The anyflights
package supplies a set of functions to generate air travel data (and data packages!) similar to nycflights13
. With a user-defined year and airport, the anyflights
function will grab data on:
flights
: all flights that departed a given airport in a given year and monthweather
: hourly meterological data for a given airport in a given year and monthairports
: airport names, FAA codes, and locationsairlines
: translation between two letter carrier (airline) codes and namesplanes
: construction information about each plane found inflights
Airline On-Time Statistics and Delay Causes
The U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT’s monthly Air Travel Consumer Report, published about 30 days after the month’s end, as well as in summary tables posted on this website. BTS began collecting details on the causes of flight delays in June 2003. Summary statistics and raw data are made available to the public at the time the Air Travel Consumer Report is released. Access it here
Worldwide flight data
Download: airports.dat (Airports only, high quality)
Download: airports-extended.dat (Airports, train stations and ferry terminals, including user contributions)
Flightera.net seems to have a lot of good data for free. It has in-depth data on flights and doesn’t seem limited by date. I can’t speak on the validity of the data though.
flightradar24.com has lots of data, also historically, they might be willing to help you get it in a nice format.
2019 Crime statistics in the USA
Dataset with arrest in US by race and separate states. Download Excel here
Researchers from IBM, MIT and Harvard Announced The Release Of DARPA “Common Sense AI” Dataset Along With Two Machine Learning Models At ICML 2021
Building machines that can make decisions based on common sense is no easy feat. A machine must be able to do more than merely find patterns in data; it also needs a way of interpreting the intentions and beliefs behind people’s choices.
At the 2021 International Conference on Machine Learning (ICML), Researchers from IBM, MIT, and Harvard University have come together to release a DARPA “Common Sense AI” dataset for benchmarking AI intuition. They are also releasing two machine learning models that represent different approaches to the problem that relies on testing techniques psychologists use to study infants’ behavior to accelerate the development of AI exhibiting common sense.
Source – Summary – Paper – IBM Blog
100 million protein structures Dataset by DeepMind
Here’s a good article about this topic
Google Dataset Search

Malware traffic dataset
Comprises 1914081 records created from all malware traffic analysis .net PCAP files, from 2013 to 2021. The logs are generated using Suricata and Zeek.
Originator: ali_alwashali
Percent of “foreign-born” population in each US and EU state or country.
For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state 🇺🇸🇪🇺
Author: Here
Percent of “foreign-born” population in each US and EU state or country. For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state.
Examples of “foreign-born” in this context:
Person born in Spain and living in France is NOT “foreign-born”
Person born in Turkey and living in France is “foreign-born”
Person born in Florida and living in Texas is NOT “foreign-born”
Person born in Mexico and living in Texas is “foreign-born”
Person born in Florida and living in France is “foreign-born”
Person born in France and living in Florida is “foreign-born”
🇺🇸🇪🇺🗺️
Note: Poland, Ireland, Germany, Greece, Cyprus, Malta, Portugal uses Eurostat 2010 Migration data and Croatia has no data at all
Tools: MS Office
Source: Here
35% of “entry-level” jobs on LinkedIn require 3+ years of experience
Source: LinkedIn data (see original post)
Tool: Photoshop from my colleague
Latest complete Netflix movie dataset
Created from 4 APIs. 11K+ rows and 30+ attributes of Netflix (Ratings, earnings, actors, language, availability, movie trailers, and many more)
Explore this dataset using FlixGem.com (this dataset is powering this webapp)
Common Crawl
A corpus of web crawl data composed of over 50 billion web pages. The Common Crawl corpus contains petabytes of data collected since 2008. It contains raw web page data, extracted metadata and text extractions.
AWS CLI Access (No AWS account required)
aws s3 ls s3://commoncrawl/ --no-sign-request
s3://commoncrawl/crawl-data/
Dataset on protein prices
Data on Primary Commodity Prices are updated monthly based on the IMF’s Primary Commodity Price System.
CPOST dataset on suicide attacks over four decades
The University of Chicago Project on Security and Threats presents the updated and expanded Database on Suicide Attacks (DSAT), which now links to Uppsala Conflict Data Program data on armed conflicts and includes a new dataset measuring the alliance and rivalry relationships among militant groups with connections to suicide attack groups. Access it here.
Credit Card Dataset – Survey of Consumer Finances (SCF) Combined Extract Data 1989-2019
You can do a lot of aggregated analysis in a pretty straightforward way there.
Drone imagery with annotations for small object detection and tracking dataset
11 TB dataset of drone imagery with annotations for small object detection and tracking
Download and more information are available here
Dataset License: CDLA-Sharing-1.0
Helper scripts for accessing the dataset: DATASET.md
Dataset Exploration: Colab
NOAA High-Resolution Rapid Refresh (HRRR) Model
The HRRR is a NOAA real-time 3-km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model, initialized by 3km grids with 3km radar assimilation. Radar data is assimilated in the HRRR every 15 min over a 1-h period adding further detail to that provided by the hourly data assimilation from the 13km radar-enhanced Rapid Refresh.
Registry of Open Data on AWS
This registry exists to help people discover and share datasets that are available via AWS resources. Learn more about sharing data on AWS.
See all usage examples for datasets listed in this registry.
See datasets from Digital Earth Africa, Facebook Data for Good, NASA Space Act Agreement, NIH STRIDES, NOAA Big Data Program, Space Telescope Science Institute, and Amazon Sustainability Data Initiative.
Textbook Question Answering (TQA)
1,076 textbook lessons, 26,260 questions, 6229 images
Documentation: allenai.org/data/tqa
Harmonized Cancer Datasets: Genomic Data Commons Data Portal
The GDC Data Portal is a robust data-driven platform that allows cancer
researchers and bioinformaticians to search and download cancer data for analysis.

The Cancer Genome Atlas
The Cancer Genome Atlas (TCGA), a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), aims to generate comprehensive, multi-dimensional maps of the key genomic changes in major types and subtypes of cancer.
AWS CLI Access (No AWS account required)
aws s3 ls s3://tcga-2-open/ --no-sign-request
Therapeutically Applicable Research to Generate Effective Treatments (TARGET)
The Therapeutically Applicable Research to Generate Effective Treatments (TARGET) program applies a comprehensive genomic approach to determine molecular changes that drive childhood cancers. The goal of the program is to use data to guide the development of effective, less toxic therapies. TARGET is organized into a collaborative network of disease-specific project teams. TARGET projects provide comprehensive molecular characterization to determine the genetic changes that drive the initiation and progression of childhood cancers. The dataset contains open Clinical Supplement, Biospecimen Supplement, RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantification, miRNA-Seq miRNA Expression Quantification data from Genomic Data Commons (GDC), and open data from GDC Legacy Archive. Access it here.
Genome Aggregation Database (gnomAD)
The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators that aggregates and harmonizes both exome and genome data from a wide range of large-scale human sequencing projects. The summary data provided here are released for the benefit of the wider scientific community without restriction on use. Downloads
SQuAD (Stanford Question Answering Dataset)
Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. Access it here.
PubMed Diabetes Dataset
The Pubmed Diabetes dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words. The README file in the dataset provides more details.
Drug-Target Interaction Dataset
This dataset contains interactions between drugs and targets collected from DrugBank, KEGG Drug, DCDB, and Matador. It was originally collected by Perlman et al. It contains 315 drugs, 250 targets, 1,306 drug-target interactions, 5 types of drug-drug similarities, and 3 types of target-target similarities. Drug-drug similarities include Chemical-based, Ligand-based, Expression-based, Side-effect-based, and Annotation-based similarities. Target-target similarities include Sequence-based, Protein-protein interaction network-based, and Gene Ontology-based similarities. The original task on the dataset is to predict new interactions between drugs and targets based on different types of similarities in the network. Download link
Pharmacogenomics Datasets
PharmGKB data and knowledge is available as downloads. It is often critical to check with their curators at feedback@pharmgkb.org before embarking on a large project using these data, to be sure that the files and data they make available are being interpreted correctly. PharmGKB generally does NOT need to be a co-author on such analyses; They just want to make sure that there is a correct understanding of our data before lots of resources are spent.
Pancreatic Cancer Organoid Profiling
The dataset contains open RNA-Seq Gene Expression Quantification data and controlled WGS/WXS/RNA-Seq Aligned Reads, WXS Annotated Somatic Mutation, WXS Raw Somatic Mutation, and RNA-Seq Splice Junction Quantification. Documentation
AWS CLI Access (No AWS account required)
aws s3 ls s3://gdc-organoid-pancreatic-phs001611-2-open/ --no-sign-request
Africa Soil Information Service (AfSIS) Soil Chemistry
This dataset contains soil infrared spectral data and paired soil property reference measurements for georeferenced soil samples that were collected through the Africa Soil Information Service (AfSIS) project, which lasted from 2009 through 2018. Documentation
AWS CLI Access (No AWS account required)
aws s3 ls s3://afsis/ --no-sign-request
Dataset for Affective States in E-Environments
DAiSEE is the first multi-label video classification dataset comprising of 9068 video snippets captured from 112 users for recognizing the user affective states of boredom, confusion, engagement, and frustration “in the wild”. The dataset has four levels of labels namely – very low, low, high, and very high for each of the affective states, which are crowd annotated and correlated with a gold standard annotation created using a team of expert psychologists. Download it here.
NatureServe Explorer Dataset
NatureServe Explorer provides conservation status, taxonomy, distribution, and life history information for more than 95,000 plants and animals in the United States and Canada, and more than 10,000 vegetation communities and ecological systems in the Western Hemisphere.
The data available through NatureServe Explorer represents data managed in the NatureServe Central Databases. These databases are dynamic, being continually enhanced and refined through the input of hundreds of natural heritage program scientists and other collaborators. NatureServe Explorer is updated from these central databases to reflect information from new field surveys, the latest taxonomic treatments and other scientific publications, and new conservation status assessments. Explore Data here
Flight Records in the US
Airline On-Time Performance and Causes of Flight Delays – On_Time Data.
This database contains scheduled and actual departure and arrival times, reason of delay. reported by certified U.S. air carriers that account for at least one percent of domestic scheduled passenger revenues. The data is collected by the Office of Airline Information, Bureau of Transportation Statistics (BTS).
FlightAware.com has data but you need to pay for a full dataset.
The anyflights
package supplies a set of functions to generate air travel data (and data packages!) similar to nycflights13
. With a user-defined year and airport, the anyflights
function will grab data on:
flights
: all flights that departed a given airport in a given year and monthweather
: hourly meterological data for a given airport in a given year and monthairports
: airport names, FAA codes, and locationsairlines
: translation between two letter carrier (airline) codes and namesplanes
: construction information about each plane found inflights
Airline On-Time Statistics and Delay Causes
The U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT’s monthly Air Travel Consumer Report, published about 30 days after the month’s end, as well as in summary tables posted on this website. BTS began collecting details on the causes of flight delays in June 2003. Summary statistics and raw data are made available to the public at the time the Air Travel Consumer Report is released. Access it here
Worldwide flight data
Download: airports.dat (Airports only, high quality)
Download: airports-extended.dat (Airports, train stations and ferry terminals, including user contributions)
Flightera.net seems to have a lot of good data for free. It has in-depth data on flights and doesn’t seem limited by date. I can’t speak on the validity of the data though.
flightradar24.com has lots of data, also historically, they might be willing to help you get it in a nice format.
2019 Crime statistics in the USA
Dataset with arrest in US by race and separate states. Download Excel here
- Sublime Text
- Vim
- Emacs
- Atom – Open-source and hackable text editor.
- Visual Studio Code – Cross-platform open-source text editor.
Researchers from IBM, MIT and Harvard Announced The Release Of DARPA “Common Sense AI” Dataset Along With Two Machine Learning Models At ICML 2021
Building machines that can make decisions based on common sense is no easy feat. A machine must be able to do more than merely find patterns in data; it also needs a way of interpreting the intentions and beliefs behind people’s choices.
At the 2021 International Conference on Machine Learning (ICML), Researchers from IBM, MIT, and Harvard University have come together to release a DARPA “Common Sense AI” dataset for benchmarking AI intuition. They are also releasing two machine learning models that represent different approaches to the problem that relies on testing techniques psychologists use to study infants’ behavior to accelerate the development of AI exhibiting common sense.
Source – Summary – Paper – IBM Blog
100 million protein structures Dataset by DeepMind
Here’s a good article about this topic
Google Dataset Search

Malware traffic dataset
Comprises 1914081 records created from all malware traffic analysis .net PCAP files, from 2013 to 2021. The logs are generated using Suricata and Zeek.
Originator: ali_alwashali
Percent of “foreign-born” population in each US and EU state or country.
For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state 🇺🇸🇪🇺
Author: Here
Percent of “foreign-born” population in each US and EU state or country. For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state.
Examples of “foreign-born” in this context:
Person born in Spain and living in France is NOT “foreign-born”
Person born in Turkey and living in France is “foreign-born”
Person born in Florida and living in Texas is NOT “foreign-born”
Person born in Mexico and living in Texas is “foreign-born”
Person born in Florida and living in France is “foreign-born”
Person born in France and living in Florida is “foreign-born”
🇺🇸🇪🇺🗺️
Note: Poland, Ireland, Germany, Greece, Cyprus, Malta, Portugal uses Eurostat 2010 Migration data and Croatia has no data at all
Tools: MS Office
Source: Here
35% of “entry-level” jobs on LinkedIn require 3+ years of experience
Source: LinkedIn data (see original post)
Tool: Photoshop from my colleague
Latest complete Netflix movie dataset
Created from 4 APIs. 11K+ rows and 30+ attributes of Netflix (Ratings, earnings, actors, language, availability, movie trailers, and many more)
Explore this dataset using FlixGem.com (this dataset is powering this webapp)
Common Crawl
A corpus of web crawl data composed of over 50 billion web pages. The Common Crawl corpus contains petabytes of data collected since 2008. It contains raw web page data, extracted metadata and text extractions.
AWS CLI Access (No AWS account required)
aws s3 ls s3://commoncrawl/ --no-sign-request
s3://commoncrawl/crawl-data/
Dataset on protein prices
Data on Primary Commodity Prices are updated monthly based on the IMF’s Primary Commodity Price System.
CPOST dataset on suicide attacks over four decades
The University of Chicago Project on Security and Threats presents the updated and expanded Database on Suicide Attacks (DSAT), which now links to Uppsala Conflict Data Program data on armed conflicts and includes a new dataset measuring the alliance and rivalry relationships among militant groups with connections to suicide attack groups. Access it here.
Credit Card Dataset – Survey of Consumer Finances (SCF) Combined Extract Data 1989-2019
You can do a lot of aggregated analysis in a pretty straightforward way there.
Drone imagery with annotations for small object detection and tracking dataset
11 TB dataset of drone imagery with annotations for small object detection and tracking
Download and more information are available here
Dataset License: CDLA-Sharing-1.0
Helper scripts for accessing the dataset: DATASET.md
Dataset Exploration: Colab
NOAA High-Resolution Rapid Refresh (HRRR) Model
The HRRR is a NOAA real-time 3-km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model, initialized by 3km grids with 3km radar assimilation. Radar data is assimilated in the HRRR every 15 min over a 1-h period adding further detail to that provided by the hourly data assimilation from the 13km radar-enhanced Rapid Refresh.
Registry of Open Data on AWS
This registry exists to help people discover and share datasets that are available via AWS resources. Learn more about sharing data on AWS.
See all usage examples for datasets listed in this registry.
See datasets from Digital Earth Africa, Facebook Data for Good, NASA Space Act Agreement, NIH STRIDES, NOAA Big Data Program, Space Telescope Science Institute, and Amazon Sustainability Data Initiative.
Textbook Question Answering (TQA)
1,076 textbook lessons, 26,260 questions, 6229 images
Documentation: allenai.org/data/tqa
Harmonized Cancer Datasets: Genomic Data Commons Data Portal
The GDC Data Portal is a robust data-driven platform that allows cancer
researchers and bioinformaticians to search and download cancer data for analysis.

The Cancer Genome Atlas
The Cancer Genome Atlas (TCGA), a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), aims to generate comprehensive, multi-dimensional maps of the key genomic changes in major types and subtypes of cancer.
AWS CLI Access (No AWS account required)
aws s3 ls s3://tcga-2-open/ --no-sign-request
Therapeutically Applicable Research to Generate Effective Treatments (TARGET)
The Therapeutically Applicable Research to Generate Effective Treatments (TARGET) program applies a comprehensive genomic approach to determine molecular changes that drive childhood cancers. The goal of the program is to use data to guide the development of effective, less toxic therapies. TARGET is organized into a collaborative network of disease-specific project teams. TARGET projects provide comprehensive molecular characterization to determine the genetic changes that drive the initiation and progression of childhood cancers. The dataset contains open Clinical Supplement, Biospecimen Supplement, RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantification, miRNA-Seq miRNA Expression Quantification data from Genomic Data Commons (GDC), and open data from GDC Legacy Archive. Access it here.
Genome Aggregation Database (gnomAD)
The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators that aggregates and harmonizes both exome and genome data from a wide range of large-scale human sequencing projects. The summary data provided here are released for the benefit of the wider scientific community without restriction on use. Downloads
SQuAD (Stanford Question Answering Dataset)
Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. Access it here.
PubMed Diabetes Dataset
The Pubmed Diabetes dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words. The README file in the dataset provides more details.
Drug-Target Interaction Dataset
This dataset contains interactions between drugs and targets collected from DrugBank, KEGG Drug, DCDB, and Matador. It was originally collected by Perlman et al. It contains 315 drugs, 250 targets, 1,306 drug-target interactions, 5 types of drug-drug similarities, and 3 types of target-target similarities. Drug-drug similarities include Chemical-based, Ligand-based, Expression-based, Side-effect-based, and Annotation-based similarities. Target-target similarities include Sequence-based, Protein-protein interaction network-based, and Gene Ontology-based similarities. The original task on the dataset is to predict new interactions between drugs and targets based on different types of similarities in the network. Download link
Pharmacogenomics Datasets
PharmGKB data and knowledge is available as downloads. It is often critical to check with their curators at feedback@pharmgkb.org before embarking on a large project using these data, to be sure that the files and data they make available are being interpreted correctly. PharmGKB generally does NOT need to be a co-author on such analyses; They just want to make sure that there is a correct understanding of our data before lots of resources are spent.
Pancreatic Cancer Organoid Profiling
The dataset contains open RNA-Seq Gene Expression Quantification data and controlled WGS/WXS/RNA-Seq Aligned Reads, WXS Annotated Somatic Mutation, WXS Raw Somatic Mutation, and RNA-Seq Splice Junction Quantification. Documentation
AWS CLI Access (No AWS account required)
aws s3 ls s3://gdc-organoid-pancreatic-phs001611-2-open/ --no-sign-request
Africa Soil Information Service (AfSIS) Soil Chemistry
This dataset contains soil infrared spectral data and paired soil property reference measurements for georeferenced soil samples that were collected through the Africa Soil Information Service (AfSIS) project, which lasted from 2009 through 2018. Documentation
AWS CLI Access (No AWS account required)
aws s3 ls s3://afsis/ --no-sign-request
Dataset for Affective States in E-Environments
DAiSEE is the first multi-label video classification dataset comprising of 9068 video snippets captured from 112 users for recognizing the user affective states of boredom, confusion, engagement, and frustration “in the wild”. The dataset has four levels of labels namely – very low, low, high, and very high for each of the affective states, which are crowd annotated and correlated with a gold standard annotation created using a team of expert psychologists. Download it here.
NatureServe Explorer Dataset
NatureServe Explorer provides conservation status, taxonomy, distribution, and life history information for more than 95,000 plants and animals in the United States and Canada, and more than 10,000 vegetation communities and ecological systems in the Western Hemisphere.
The data available through NatureServe Explorer represents data managed in the NatureServe Central Databases. These databases are dynamic, being continually enhanced and refined through the input of hundreds of natural heritage program scientists and other collaborators. NatureServe Explorer is updated from these central databases to reflect information from new field surveys, the latest taxonomic treatments and other scientific publications, and new conservation status assessments. Explore Data here
Flight Records in the US
Airline On-Time Performance and Causes of Flight Delays – On_Time Data.
This database contains scheduled and actual departure and arrival times, reason of delay. reported by certified U.S. air carriers that account for at least one percent of domestic scheduled passenger revenues. The data is collected by the Office of Airline Information, Bureau of Transportation Statistics (BTS).
FlightAware.com has data but you need to pay for a full dataset.
The anyflights
package supplies a set of functions to generate air travel data (and data packages!) similar to nycflights13
. With a user-defined year and airport, the anyflights
function will grab data on:
flights
: all flights that departed a given airport in a given year and monthweather
: hourly meterological data for a given airport in a given year and monthairports
: airport names, FAA codes, and locationsairlines
: translation between two letter carrier (airline) codes and namesplanes
: construction information about each plane found inflights
Airline On-Time Statistics and Delay Causes
The U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT’s monthly Air Travel Consumer Report, published about 30 days after the month’s end, as well as in summary tables posted on this website. BTS began collecting details on the causes of flight delays in June 2003. Summary statistics and raw data are made available to the public at the time the Air Travel Consumer Report is released. Access it here
Worldwide flight data
Download: airports.dat (Airports only, high quality)
Download: airports-extended.dat (Airports, train stations and ferry terminals, including user contributions)
Flightera.net seems to have a lot of good data for free. It has in-depth data on flights and doesn’t seem limited by date. I can’t speak on the validity of the data though.
flightradar24.com has lots of data, also historically, they might be willing to help you get it in a nice format.
2019 Crime statistics in the USA
Dataset with arrest in US by race and separate states. Download Excel here
- Game Development
- Game Talks
- Godot – Game engine.
- Open Source Games
- Unity – Game engine.
- Chess
- LÖVE – Game engine.
- PICO-8 – Fantasy console.
- Game Boy Development
- Construct 2 – Game engine.
- Gideros – Game engine.
- Minecraft – Sandbox video game.
- Game Datasets – Materials and datasets for Artificial Intelligence in games.
- Haxe Game Development – A high-level strongly typed programming language used to produce cross-platform native code.
- libGDX – Java game framework.
- PlayCanvas – Game engine.
- Game Remakes – Actively maintained open-source game remakes.
- Flame – Game engine for Flutter.
- Discord Communities – Chat with friends and communities.
- CHIP-8 – Virtual computer game machine from the 70s.
- Games of Coding – Learn a programming language by making games.
Researchers from IBM, MIT and Harvard Announced The Release Of DARPA “Common Sense AI” Dataset Along With Two Machine Learning Models At ICML 2021
Building machines that can make decisions based on common sense is no easy feat. A machine must be able to do more than merely find patterns in data; it also needs a way of interpreting the intentions and beliefs behind people’s choices.
At the 2021 International Conference on Machine Learning (ICML), Researchers from IBM, MIT, and Harvard University have come together to release a DARPA “Common Sense AI” dataset for benchmarking AI intuition. They are also releasing two machine learning models that represent different approaches to the problem that relies on testing techniques psychologists use to study infants’ behavior to accelerate the development of AI exhibiting common sense.
Source – Summary – Paper – IBM Blog
100 million protein structures Dataset by DeepMind
Here’s a good article about this topic
Google Dataset Search

Malware traffic dataset
Comprises 1914081 records created from all malware traffic analysis .net PCAP files, from 2013 to 2021. The logs are generated using Suricata and Zeek.
Originator: ali_alwashali
Percent of “foreign-born” population in each US and EU state or country.
For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state 🇺🇸🇪🇺
Author: Here
Percent of “foreign-born” population in each US and EU state or country. For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state.
Examples of “foreign-born” in this context:
Person born in Spain and living in France is NOT “foreign-born”
Person born in Turkey and living in France is “foreign-born”
Person born in Florida and living in Texas is NOT “foreign-born”
Person born in Mexico and living in Texas is “foreign-born”
Person born in Florida and living in France is “foreign-born”
Person born in France and living in Florida is “foreign-born”
🇺🇸🇪🇺🗺️
Note: Poland, Ireland, Germany, Greece, Cyprus, Malta, Portugal uses Eurostat 2010 Migration data and Croatia has no data at all
Tools: MS Office
Source: Here
35% of “entry-level” jobs on LinkedIn require 3+ years of experience
Source: LinkedIn data (see original post)
Tool: Photoshop from my colleague
Latest complete Netflix movie dataset
Created from 4 APIs. 11K+ rows and 30+ attributes of Netflix (Ratings, earnings, actors, language, availability, movie trailers, and many more)
Explore this dataset using FlixGem.com (this dataset is powering this webapp)
Common Crawl
A corpus of web crawl data composed of over 50 billion web pages. The Common Crawl corpus contains petabytes of data collected since 2008. It contains raw web page data, extracted metadata and text extractions.
AWS CLI Access (No AWS account required)
aws s3 ls s3://commoncrawl/ --no-sign-request
s3://commoncrawl/crawl-data/
Dataset on protein prices
Data on Primary Commodity Prices are updated monthly based on the IMF’s Primary Commodity Price System.
CPOST dataset on suicide attacks over four decades
The University of Chicago Project on Security and Threats presents the updated and expanded Database on Suicide Attacks (DSAT), which now links to Uppsala Conflict Data Program data on armed conflicts and includes a new dataset measuring the alliance and rivalry relationships among militant groups with connections to suicide attack groups. Access it here.
Credit Card Dataset – Survey of Consumer Finances (SCF) Combined Extract Data 1989-2019
You can do a lot of aggregated analysis in a pretty straightforward way there.
Drone imagery with annotations for small object detection and tracking dataset
11 TB dataset of drone imagery with annotations for small object detection and tracking
Download and more information are available here
Dataset License: CDLA-Sharing-1.0
Helper scripts for accessing the dataset: DATASET.md
Dataset Exploration: Colab
NOAA High-Resolution Rapid Refresh (HRRR) Model
The HRRR is a NOAA real-time 3-km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model, initialized by 3km grids with 3km radar assimilation. Radar data is assimilated in the HRRR every 15 min over a 1-h period adding further detail to that provided by the hourly data assimilation from the 13km radar-enhanced Rapid Refresh.
Registry of Open Data on AWS
This registry exists to help people discover and share datasets that are available via AWS resources. Learn more about sharing data on AWS.
See all usage examples for datasets listed in this registry.
See datasets from Digital Earth Africa, Facebook Data for Good, NASA Space Act Agreement, NIH STRIDES, NOAA Big Data Program, Space Telescope Science Institute, and Amazon Sustainability Data Initiative.
Textbook Question Answering (TQA)
1,076 textbook lessons, 26,260 questions, 6229 images
Documentation: allenai.org/data/tqa
Harmonized Cancer Datasets: Genomic Data Commons Data Portal
The GDC Data Portal is a robust data-driven platform that allows cancer
researchers and bioinformaticians to search and download cancer data for analysis.

The Cancer Genome Atlas
The Cancer Genome Atlas (TCGA), a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), aims to generate comprehensive, multi-dimensional maps of the key genomic changes in major types and subtypes of cancer.
AWS CLI Access (No AWS account required)
aws s3 ls s3://tcga-2-open/ --no-sign-request
Therapeutically Applicable Research to Generate Effective Treatments (TARGET)
The Therapeutically Applicable Research to Generate Effective Treatments (TARGET) program applies a comprehensive genomic approach to determine molecular changes that drive childhood cancers. The goal of the program is to use data to guide the development of effective, less toxic therapies. TARGET is organized into a collaborative network of disease-specific project teams. TARGET projects provide comprehensive molecular characterization to determine the genetic changes that drive the initiation and progression of childhood cancers. The dataset contains open Clinical Supplement, Biospecimen Supplement, RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantification, miRNA-Seq miRNA Expression Quantification data from Genomic Data Commons (GDC), and open data from GDC Legacy Archive. Access it here.
Genome Aggregation Database (gnomAD)
The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators that aggregates and harmonizes both exome and genome data from a wide range of large-scale human sequencing projects. The summary data provided here are released for the benefit of the wider scientific community without restriction on use. Downloads
SQuAD (Stanford Question Answering Dataset)
Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. Access it here.
PubMed Diabetes Dataset
The Pubmed Diabetes dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words. The README file in the dataset provides more details.
Drug-Target Interaction Dataset
This dataset contains interactions between drugs and targets collected from DrugBank, KEGG Drug, DCDB, and Matador. It was originally collected by Perlman et al. It contains 315 drugs, 250 targets, 1,306 drug-target interactions, 5 types of drug-drug similarities, and 3 types of target-target similarities. Drug-drug similarities include Chemical-based, Ligand-based, Expression-based, Side-effect-based, and Annotation-based similarities. Target-target similarities include Sequence-based, Protein-protein interaction network-based, and Gene Ontology-based similarities. The original task on the dataset is to predict new interactions between drugs and targets based on different types of similarities in the network. Download link
Pharmacogenomics Datasets
PharmGKB data and knowledge is available as downloads. It is often critical to check with their curators at feedback@pharmgkb.org before embarking on a large project using these data, to be sure that the files and data they make available are being interpreted correctly. PharmGKB generally does NOT need to be a co-author on such analyses; They just want to make sure that there is a correct understanding of our data before lots of resources are spent.
Pancreatic Cancer Organoid Profiling
The dataset contains open RNA-Seq Gene Expression Quantification data and controlled WGS/WXS/RNA-Seq Aligned Reads, WXS Annotated Somatic Mutation, WXS Raw Somatic Mutation, and RNA-Seq Splice Junction Quantification. Documentation
AWS CLI Access (No AWS account required)
aws s3 ls s3://gdc-organoid-pancreatic-phs001611-2-open/ --no-sign-request
Africa Soil Information Service (AfSIS) Soil Chemistry
This dataset contains soil infrared spectral data and paired soil property reference measurements for georeferenced soil samples that were collected through the Africa Soil Information Service (AfSIS) project, which lasted from 2009 through 2018. Documentation
AWS CLI Access (No AWS account required)
aws s3 ls s3://afsis/ --no-sign-request
Dataset for Affective States in E-Environments
DAiSEE is the first multi-label video classification dataset comprising of 9068 video snippets captured from 112 users for recognizing the user affective states of boredom, confusion, engagement, and frustration “in the wild”. The dataset has four levels of labels namely – very low, low, high, and very high for each of the affective states, which are crowd annotated and correlated with a gold standard annotation created using a team of expert psychologists. Download it here.
NatureServe Explorer Dataset
NatureServe Explorer provides conservation status, taxonomy, distribution, and life history information for more than 95,000 plants and animals in the United States and Canada, and more than 10,000 vegetation communities and ecological systems in the Western Hemisphere.
The data available through NatureServe Explorer represents data managed in the NatureServe Central Databases. These databases are dynamic, being continually enhanced and refined through the input of hundreds of natural heritage program scientists and other collaborators. NatureServe Explorer is updated from these central databases to reflect information from new field surveys, the latest taxonomic treatments and other scientific publications, and new conservation status assessments. Explore Data here
Flight Records in the US
Airline On-Time Performance and Causes of Flight Delays – On_Time Data.
This database contains scheduled and actual departure and arrival times, reason of delay. reported by certified U.S. air carriers that account for at least one percent of domestic scheduled passenger revenues. The data is collected by the Office of Airline Information, Bureau of Transportation Statistics (BTS).
FlightAware.com has data but you need to pay for a full dataset.
The anyflights
package supplies a set of functions to generate air travel data (and data packages!) similar to nycflights13
. With a user-defined year and airport, the anyflights
function will grab data on:
flights
: all flights that departed a given airport in a given year and monthweather
: hourly meterological data for a given airport in a given year and monthairports
: airport names, FAA codes, and locationsairlines
: translation between two letter carrier (airline) codes and namesplanes
: construction information about each plane found inflights
Airline On-Time Statistics and Delay Causes
The U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT’s monthly Air Travel Consumer Report, published about 30 days after the month’s end, as well as in summary tables posted on this website. BTS began collecting details on the causes of flight delays in June 2003. Summary statistics and raw data are made available to the public at the time the Air Travel Consumer Report is released. Access it here
Worldwide flight data
Download: airports.dat (Airports only, high quality)
Download: airports-extended.dat (Airports, train stations and ferry terminals, including user contributions)
Flightera.net seems to have a lot of good data for free. It has in-depth data on flights and doesn’t seem limited by date. I can’t speak on the validity of the data though.
flightradar24.com has lots of data, also historically, they might be willing to help you get it in a nice format.
2019 Crime statistics in the USA
Dataset with arrest in US by race and separate states. Download Excel here
- Quick Look Plugins – For macOS.
- Dev Env
- Dotfiles
- Shell
- Fish – User-friendly shell.
- Command-Line Apps
- ZSH Plugins
- GitHub – Hosting service for Git repositories.
- Browser Extensions
- Cheat Sheet
- Pinned Gists – Dynamic pinned gists for your GitHub profile.
- Git Cheat Sheet & Git Flow
- Git Tips
- Git Add-ons – Enhance the
git
CLI. - Git Hooks – Scripts for automating tasks during
git
workflows. - SSH
- FOSS for Developers
- Hyper – Cross-platform terminal app built on web technologies.
- PowerShell – Cross-platform object-oriented shell.
- Alfred Workflows – Productivity app for macOS.
- Terminals Are Sexy
- GitHub Actions – Create tasks to automate your workflow and share them with others on GitHub.
Researchers from IBM, MIT and Harvard Announced The Release Of DARPA “Common Sense AI” Dataset Along With Two Machine Learning Models At ICML 2021
Building machines that can make decisions based on common sense is no easy feat. A machine must be able to do more than merely find patterns in data; it also needs a way of interpreting the intentions and beliefs behind people’s choices.
At the 2021 International Conference on Machine Learning (ICML), Researchers from IBM, MIT, and Harvard University have come together to release a DARPA “Common Sense AI” dataset for benchmarking AI intuition. They are also releasing two machine learning models that represent different approaches to the problem that relies on testing techniques psychologists use to study infants’ behavior to accelerate the development of AI exhibiting common sense.
Source – Summary – Paper – IBM Blog
100 million protein structures Dataset by DeepMind
Here’s a good article about this topic
Google Dataset Search

Malware traffic dataset
Comprises 1914081 records created from all malware traffic analysis .net PCAP files, from 2013 to 2021. The logs are generated using Suricata and Zeek.
Originator: ali_alwashali
Percent of “foreign-born” population in each US and EU state or country.
For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state 🇺🇸🇪🇺
Author: Here
Percent of “foreign-born” population in each US and EU state or country. For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state.
Examples of “foreign-born” in this context:
Person born in Spain and living in France is NOT “foreign-born”
Person born in Turkey and living in France is “foreign-born”
Person born in Florida and living in Texas is NOT “foreign-born”
Person born in Mexico and living in Texas is “foreign-born”
Person born in Florida and living in France is “foreign-born”
Person born in France and living in Florida is “foreign-born”
🇺🇸🇪🇺🗺️
Note: Poland, Ireland, Germany, Greece, Cyprus, Malta, Portugal uses Eurostat 2010 Migration data and Croatia has no data at all
Tools: MS Office
Source: Here
35% of “entry-level” jobs on LinkedIn require 3+ years of experience
Source: LinkedIn data (see original post)
Tool: Photoshop from my colleague
Latest complete Netflix movie dataset
Created from 4 APIs. 11K+ rows and 30+ attributes of Netflix (Ratings, earnings, actors, language, availability, movie trailers, and many more)
Explore this dataset using FlixGem.com (this dataset is powering this webapp)
Common Crawl
A corpus of web crawl data composed of over 50 billion web pages. The Common Crawl corpus contains petabytes of data collected since 2008. It contains raw web page data, extracted metadata and text extractions.
AWS CLI Access (No AWS account required)
aws s3 ls s3://commoncrawl/ --no-sign-request
s3://commoncrawl/crawl-data/
Dataset on protein prices
Data on Primary Commodity Prices are updated monthly based on the IMF’s Primary Commodity Price System.
CPOST dataset on suicide attacks over four decades
The University of Chicago Project on Security and Threats presents the updated and expanded Database on Suicide Attacks (DSAT), which now links to Uppsala Conflict Data Program data on armed conflicts and includes a new dataset measuring the alliance and rivalry relationships among militant groups with connections to suicide attack groups. Access it here.
Credit Card Dataset – Survey of Consumer Finances (SCF) Combined Extract Data 1989-2019
You can do a lot of aggregated analysis in a pretty straightforward way there.
Drone imagery with annotations for small object detection and tracking dataset
11 TB dataset of drone imagery with annotations for small object detection and tracking
Download and more information are available here
Dataset License: CDLA-Sharing-1.0
Helper scripts for accessing the dataset: DATASET.md
Dataset Exploration: Colab
NOAA High-Resolution Rapid Refresh (HRRR) Model
The HRRR is a NOAA real-time 3-km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model, initialized by 3km grids with 3km radar assimilation. Radar data is assimilated in the HRRR every 15 min over a 1-h period adding further detail to that provided by the hourly data assimilation from the 13km radar-enhanced Rapid Refresh.
Registry of Open Data on AWS
This registry exists to help people discover and share datasets that are available via AWS resources. Learn more about sharing data on AWS.
See all usage examples for datasets listed in this registry.
See datasets from Digital Earth Africa, Facebook Data for Good, NASA Space Act Agreement, NIH STRIDES, NOAA Big Data Program, Space Telescope Science Institute, and Amazon Sustainability Data Initiative.
Textbook Question Answering (TQA)
1,076 textbook lessons, 26,260 questions, 6229 images
Documentation: allenai.org/data/tqa
Harmonized Cancer Datasets: Genomic Data Commons Data Portal
The GDC Data Portal is a robust data-driven platform that allows cancer
researchers and bioinformaticians to search and download cancer data for analysis.

The Cancer Genome Atlas
The Cancer Genome Atlas (TCGA), a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), aims to generate comprehensive, multi-dimensional maps of the key genomic changes in major types and subtypes of cancer.
AWS CLI Access (No AWS account required)
aws s3 ls s3://tcga-2-open/ --no-sign-request
Therapeutically Applicable Research to Generate Effective Treatments (TARGET)
The Therapeutically Applicable Research to Generate Effective Treatments (TARGET) program applies a comprehensive genomic approach to determine molecular changes that drive childhood cancers. The goal of the program is to use data to guide the development of effective, less toxic therapies. TARGET is organized into a collaborative network of disease-specific project teams. TARGET projects provide comprehensive molecular characterization to determine the genetic changes that drive the initiation and progression of childhood cancers. The dataset contains open Clinical Supplement, Biospecimen Supplement, RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantification, miRNA-Seq miRNA Expression Quantification data from Genomic Data Commons (GDC), and open data from GDC Legacy Archive. Access it here.
Genome Aggregation Database (gnomAD)
The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators that aggregates and harmonizes both exome and genome data from a wide range of large-scale human sequencing projects. The summary data provided here are released for the benefit of the wider scientific community without restriction on use. Downloads
SQuAD (Stanford Question Answering Dataset)
Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. Access it here.
PubMed Diabetes Dataset
The Pubmed Diabetes dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words. The README file in the dataset provides more details.
Drug-Target Interaction Dataset
This dataset contains interactions between drugs and targets collected from DrugBank, KEGG Drug, DCDB, and Matador. It was originally collected by Perlman et al. It contains 315 drugs, 250 targets, 1,306 drug-target interactions, 5 types of drug-drug similarities, and 3 types of target-target similarities. Drug-drug similarities include Chemical-based, Ligand-based, Expression-based, Side-effect-based, and Annotation-based similarities. Target-target similarities include Sequence-based, Protein-protein interaction network-based, and Gene Ontology-based similarities. The original task on the dataset is to predict new interactions between drugs and targets based on different types of similarities in the network. Download link
Pharmacogenomics Datasets
PharmGKB data and knowledge is available as downloads. It is often critical to check with their curators at feedback@pharmgkb.org before embarking on a large project using these data, to be sure that the files and data they make available are being interpreted correctly. PharmGKB generally does NOT need to be a co-author on such analyses; They just want to make sure that there is a correct understanding of our data before lots of resources are spent.
Pancreatic Cancer Organoid Profiling
The dataset contains open RNA-Seq Gene Expression Quantification data and controlled WGS/WXS/RNA-Seq Aligned Reads, WXS Annotated Somatic Mutation, WXS Raw Somatic Mutation, and RNA-Seq Splice Junction Quantification. Documentation
AWS CLI Access (No AWS account required)
aws s3 ls s3://gdc-organoid-pancreatic-phs001611-2-open/ --no-sign-request
Africa Soil Information Service (AfSIS) Soil Chemistry
This dataset contains soil infrared spectral data and paired soil property reference measurements for georeferenced soil samples that were collected through the Africa Soil Information Service (AfSIS) project, which lasted from 2009 through 2018. Documentation
AWS CLI Access (No AWS account required)
aws s3 ls s3://afsis/ --no-sign-request
Dataset for Affective States in E-Environments
DAiSEE is the first multi-label video classification dataset comprising of 9068 video snippets captured from 112 users for recognizing the user affective states of boredom, confusion, engagement, and frustration “in the wild”. The dataset has four levels of labels namely – very low, low, high, and very high for each of the affective states, which are crowd annotated and correlated with a gold standard annotation created using a team of expert psychologists. Download it here.
NatureServe Explorer Dataset
NatureServe Explorer provides conservation status, taxonomy, distribution, and life history information for more than 95,000 plants and animals in the United States and Canada, and more than 10,000 vegetation communities and ecological systems in the Western Hemisphere.
The data available through NatureServe Explorer represents data managed in the NatureServe Central Databases. These databases are dynamic, being continually enhanced and refined through the input of hundreds of natural heritage program scientists and other collaborators. NatureServe Explorer is updated from these central databases to reflect information from new field surveys, the latest taxonomic treatments and other scientific publications, and new conservation status assessments. Explore Data here
Flight Records in the US
Airline On-Time Performance and Causes of Flight Delays – On_Time Data.
This database contains scheduled and actual departure and arrival times, reason of delay. reported by certified U.S. air carriers that account for at least one percent of domestic scheduled passenger revenues. The data is collected by the Office of Airline Information, Bureau of Transportation Statistics (BTS).
FlightAware.com has data but you need to pay for a full dataset.
The anyflights
package supplies a set of functions to generate air travel data (and data packages!) similar to nycflights13
. With a user-defined year and airport, the anyflights
function will grab data on:
flights
: all flights that departed a given airport in a given year and monthweather
: hourly meterological data for a given airport in a given year and monthairports
: airport names, FAA codes, and locationsairlines
: translation between two letter carrier (airline) codes and namesplanes
: construction information about each plane found inflights
Airline On-Time Statistics and Delay Causes
The U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT’s monthly Air Travel Consumer Report, published about 30 days after the month’s end, as well as in summary tables posted on this website. BTS began collecting details on the causes of flight delays in June 2003. Summary statistics and raw data are made available to the public at the time the Air Travel Consumer Report is released. Access it here
Worldwide flight data
Download: airports.dat (Airports only, high quality)
Download: airports-extended.dat (Airports, train stations and ferry terminals, including user contributions)
Flightera.net seems to have a lot of good data for free. It has in-depth data on flights and doesn’t seem limited by date. I can’t speak on the validity of the data though.
flightradar24.com has lots of data, also historically, they might be willing to help you get it in a nice format.
2019 Crime statistics in the USA
Dataset with arrest in US by race and separate states. Download Excel here
Researchers from IBM, MIT and Harvard Announced The Release Of DARPA “Common Sense AI” Dataset Along With Two Machine Learning Models At ICML 2021
Building machines that can make decisions based on common sense is no easy feat. A machine must be able to do more than merely find patterns in data; it also needs a way of interpreting the intentions and beliefs behind people’s choices.
At the 2021 International Conference on Machine Learning (ICML), Researchers from IBM, MIT, and Harvard University have come together to release a DARPA “Common Sense AI” dataset for benchmarking AI intuition. They are also releasing two machine learning models that represent different approaches to the problem that relies on testing techniques psychologists use to study infants’ behavior to accelerate the development of AI exhibiting common sense.
Source – Summary – Paper – IBM Blog
100 million protein structures Dataset by DeepMind
Here’s a good article about this topic
Google Dataset Search

Malware traffic dataset
Comprises 1914081 records created from all malware traffic analysis .net PCAP files, from 2013 to 2021. The logs are generated using Suricata and Zeek.
Originator: ali_alwashali
Percent of “foreign-born” population in each US and EU state or country.
For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state 🇺🇸🇪🇺
Author: Here
Percent of “foreign-born” population in each US and EU state or country. For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state.
Examples of “foreign-born” in this context:
Person born in Spain and living in France is NOT “foreign-born”
Person born in Turkey and living in France is “foreign-born”
Person born in Florida and living in Texas is NOT “foreign-born”
Person born in Mexico and living in Texas is “foreign-born”
Person born in Florida and living in France is “foreign-born”
Person born in France and living in Florida is “foreign-born”
🇺🇸🇪🇺🗺️
Note: Poland, Ireland, Germany, Greece, Cyprus, Malta, Portugal uses Eurostat 2010 Migration data and Croatia has no data at all
Tools: MS Office
Source: Here
35% of “entry-level” jobs on LinkedIn require 3+ years of experience
Source: LinkedIn data (see original post)
Tool: Photoshop from my colleague
Latest complete Netflix movie dataset
Created from 4 APIs. 11K+ rows and 30+ attributes of Netflix (Ratings, earnings, actors, language, availability, movie trailers, and many more)
Explore this dataset using FlixGem.com (this dataset is powering this webapp)
Common Crawl
A corpus of web crawl data composed of over 50 billion web pages. The Common Crawl corpus contains petabytes of data collected since 2008. It contains raw web page data, extracted metadata and text extractions.
AWS CLI Access (No AWS account required)
aws s3 ls s3://commoncrawl/ --no-sign-request
s3://commoncrawl/crawl-data/
Dataset on protein prices
Data on Primary Commodity Prices are updated monthly based on the IMF’s Primary Commodity Price System.
CPOST dataset on suicide attacks over four decades
The University of Chicago Project on Security and Threats presents the updated and expanded Database on Suicide Attacks (DSAT), which now links to Uppsala Conflict Data Program data on armed conflicts and includes a new dataset measuring the alliance and rivalry relationships among militant groups with connections to suicide attack groups. Access it here.
Credit Card Dataset – Survey of Consumer Finances (SCF) Combined Extract Data 1989-2019
You can do a lot of aggregated analysis in a pretty straightforward way there.
Drone imagery with annotations for small object detection and tracking dataset
11 TB dataset of drone imagery with annotations for small object detection and tracking
Download and more information are available here
Dataset License: CDLA-Sharing-1.0
Helper scripts for accessing the dataset: DATASET.md
Dataset Exploration: Colab
NOAA High-Resolution Rapid Refresh (HRRR) Model
The HRRR is a NOAA real-time 3-km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model, initialized by 3km grids with 3km radar assimilation. Radar data is assimilated in the HRRR every 15 min over a 1-h period adding further detail to that provided by the hourly data assimilation from the 13km radar-enhanced Rapid Refresh.
Registry of Open Data on AWS
This registry exists to help people discover and share datasets that are available via AWS resources. Learn more about sharing data on AWS.
See all usage examples for datasets listed in this registry.
See datasets from Digital Earth Africa, Facebook Data for Good, NASA Space Act Agreement, NIH STRIDES, NOAA Big Data Program, Space Telescope Science Institute, and Amazon Sustainability Data Initiative.
Textbook Question Answering (TQA)
1,076 textbook lessons, 26,260 questions, 6229 images
Documentation: allenai.org/data/tqa
Harmonized Cancer Datasets: Genomic Data Commons Data Portal
The GDC Data Portal is a robust data-driven platform that allows cancer
researchers and bioinformaticians to search and download cancer data for analysis.

The Cancer Genome Atlas
The Cancer Genome Atlas (TCGA), a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), aims to generate comprehensive, multi-dimensional maps of the key genomic changes in major types and subtypes of cancer.
AWS CLI Access (No AWS account required)
aws s3 ls s3://tcga-2-open/ --no-sign-request
Therapeutically Applicable Research to Generate Effective Treatments (TARGET)
The Therapeutically Applicable Research to Generate Effective Treatments (TARGET) program applies a comprehensive genomic approach to determine molecular changes that drive childhood cancers. The goal of the program is to use data to guide the development of effective, less toxic therapies. TARGET is organized into a collaborative network of disease-specific project teams. TARGET projects provide comprehensive molecular characterization to determine the genetic changes that drive the initiation and progression of childhood cancers. The dataset contains open Clinical Supplement, Biospecimen Supplement, RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantification, miRNA-Seq miRNA Expression Quantification data from Genomic Data Commons (GDC), and open data from GDC Legacy Archive. Access it here.
Genome Aggregation Database (gnomAD)
The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators that aggregates and harmonizes both exome and genome data from a wide range of large-scale human sequencing projects. The summary data provided here are released for the benefit of the wider scientific community without restriction on use. Downloads
SQuAD (Stanford Question Answering Dataset)
Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. Access it here.
PubMed Diabetes Dataset
The Pubmed Diabetes dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words. The README file in the dataset provides more details.
Drug-Target Interaction Dataset
This dataset contains interactions between drugs and targets collected from DrugBank, KEGG Drug, DCDB, and Matador. It was originally collected by Perlman et al. It contains 315 drugs, 250 targets, 1,306 drug-target interactions, 5 types of drug-drug similarities, and 3 types of target-target similarities. Drug-drug similarities include Chemical-based, Ligand-based, Expression-based, Side-effect-based, and Annotation-based similarities. Target-target similarities include Sequence-based, Protein-protein interaction network-based, and Gene Ontology-based similarities. The original task on the dataset is to predict new interactions between drugs and targets based on different types of similarities in the network. Download link
Pharmacogenomics Datasets
PharmGKB data and knowledge is available as downloads. It is often critical to check with their curators at feedback@pharmgkb.org before embarking on a large project using these data, to be sure that the files and data they make available are being interpreted correctly. PharmGKB generally does NOT need to be a co-author on such analyses; They just want to make sure that there is a correct understanding of our data before lots of resources are spent.
Pancreatic Cancer Organoid Profiling
The dataset contains open RNA-Seq Gene Expression Quantification data and controlled WGS/WXS/RNA-Seq Aligned Reads, WXS Annotated Somatic Mutation, WXS Raw Somatic Mutation, and RNA-Seq Splice Junction Quantification. Documentation
AWS CLI Access (No AWS account required)
aws s3 ls s3://gdc-organoid-pancreatic-phs001611-2-open/ --no-sign-request
Africa Soil Information Service (AfSIS) Soil Chemistry
This dataset contains soil infrared spectral data and paired soil property reference measurements for georeferenced soil samples that were collected through the Africa Soil Information Service (AfSIS) project, which lasted from 2009 through 2018. Documentation
AWS CLI Access (No AWS account required)
aws s3 ls s3://afsis/ --no-sign-request
Dataset for Affective States in E-Environments
DAiSEE is the first multi-label video classification dataset comprising of 9068 video snippets captured from 112 users for recognizing the user affective states of boredom, confusion, engagement, and frustration “in the wild”. The dataset has four levels of labels namely – very low, low, high, and very high for each of the affective states, which are crowd annotated and correlated with a gold standard annotation created using a team of expert psychologists. Download it here.
NatureServe Explorer Dataset
NatureServe Explorer provides conservation status, taxonomy, distribution, and life history information for more than 95,000 plants and animals in the United States and Canada, and more than 10,000 vegetation communities and ecological systems in the Western Hemisphere.
The data available through NatureServe Explorer represents data managed in the NatureServe Central Databases. These databases are dynamic, being continually enhanced and refined through the input of hundreds of natural heritage program scientists and other collaborators. NatureServe Explorer is updated from these central databases to reflect information from new field surveys, the latest taxonomic treatments and other scientific publications, and new conservation status assessments. Explore Data here
Flight Records in the US
Airline On-Time Performance and Causes of Flight Delays – On_Time Data.
This database contains scheduled and actual departure and arrival times, reason of delay. reported by certified U.S. air carriers that account for at least one percent of domestic scheduled passenger revenues. The data is collected by the Office of Airline Information, Bureau of Transportation Statistics (BTS).
FlightAware.com has data but you need to pay for a full dataset.
The anyflights
package supplies a set of functions to generate air travel data (and data packages!) similar to nycflights13
. With a user-defined year and airport, the anyflights
function will grab data on:
flights
: all flights that departed a given airport in a given year and monthweather
: hourly meterological data for a given airport in a given year and monthairports
: airport names, FAA codes, and locationsairlines
: translation between two letter carrier (airline) codes and namesplanes
: construction information about each plane found inflights
Airline On-Time Statistics and Delay Causes
The U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT’s monthly Air Travel Consumer Report, published about 30 days after the month’s end, as well as in summary tables posted on this website. BTS began collecting details on the causes of flight delays in June 2003. Summary statistics and raw data are made available to the public at the time the Air Travel Consumer Report is released. Access it here
Worldwide flight data
Download: airports.dat (Airports only, high quality)
Download: airports-extended.dat (Airports, train stations and ferry terminals, including user contributions)
Flightera.net seems to have a lot of good data for free. It has in-depth data on flights and doesn’t seem limited by date. I can’t speak on the validity of the data though.
flightradar24.com has lots of data, also historically, they might be willing to help you get it in a nice format.
2019 Crime statistics in the USA
Dataset with arrest in US by race and separate states. Download Excel here
- Database
- MySQL
- SQLAlchemy
- InfluxDB
- Neo4j
- MongoDB – NoSQL database.
- RethinkDB
- TinkerPop – Graph computing framework.
- PostgreSQL – Object-relational database.
- CouchDB – Document-oriented NoSQL database.
- HBase – Distributed, scalable, big data store.
- NoSQL Guides – Help on using non-relational, distributed, open-source, and horizontally scalable databases.
- Contexture – Abstracts queries/filters and results/aggregations from different backing data stores like ElasticSearch and MongoDB.
- Database Tools – Everything that makes working with databases easier.
- Grakn – Logical database to organize large and complex networks of data as one body of knowledge.
Researchers from IBM, MIT and Harvard Announced The Release Of DARPA “Common Sense AI” Dataset Along With Two Machine Learning Models At ICML 2021
Building machines that can make decisions based on common sense is no easy feat. A machine must be able to do more than merely find patterns in data; it also needs a way of interpreting the intentions and beliefs behind people’s choices.
At the 2021 International Conference on Machine Learning (ICML), Researchers from IBM, MIT, and Harvard University have come together to release a DARPA “Common Sense AI” dataset for benchmarking AI intuition. They are also releasing two machine learning models that represent different approaches to the problem that relies on testing techniques psychologists use to study infants’ behavior to accelerate the development of AI exhibiting common sense.
Source – Summary – Paper – IBM Blog
100 million protein structures Dataset by DeepMind
Here’s a good article about this topic
Google Dataset Search

Malware traffic dataset
Comprises 1914081 records created from all malware traffic analysis .net PCAP files, from 2013 to 2021. The logs are generated using Suricata and Zeek.
Originator: ali_alwashali
Percent of “foreign-born” population in each US and EU state or country.
For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state 🇺🇸🇪🇺
Author: Here
Percent of “foreign-born” population in each US and EU state or country. For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state.
Examples of “foreign-born” in this context:
Person born in Spain and living in France is NOT “foreign-born”
Person born in Turkey and living in France is “foreign-born”
Person born in Florida and living in Texas is NOT “foreign-born”
Person born in Mexico and living in Texas is “foreign-born”
Person born in Florida and living in France is “foreign-born”
Person born in France and living in Florida is “foreign-born”
🇺🇸🇪🇺🗺️
Note: Poland, Ireland, Germany, Greece, Cyprus, Malta, Portugal uses Eurostat 2010 Migration data and Croatia has no data at all
Tools: MS Office
Source: Here
35% of “entry-level” jobs on LinkedIn require 3+ years of experience
Source: LinkedIn data (see original post)
Tool: Photoshop from my colleague
Latest complete Netflix movie dataset
Created from 4 APIs. 11K+ rows and 30+ attributes of Netflix (Ratings, earnings, actors, language, availability, movie trailers, and many more)
Explore this dataset using FlixGem.com (this dataset is powering this webapp)
Common Crawl
A corpus of web crawl data composed of over 50 billion web pages. The Common Crawl corpus contains petabytes of data collected since 2008. It contains raw web page data, extracted metadata and text extractions.
AWS CLI Access (No AWS account required)
aws s3 ls s3://commoncrawl/ --no-sign-request
s3://commoncrawl/crawl-data/
Dataset on protein prices
Data on Primary Commodity Prices are updated monthly based on the IMF’s Primary Commodity Price System.
CPOST dataset on suicide attacks over four decades
The University of Chicago Project on Security and Threats presents the updated and expanded Database on Suicide Attacks (DSAT), which now links to Uppsala Conflict Data Program data on armed conflicts and includes a new dataset measuring the alliance and rivalry relationships among militant groups with connections to suicide attack groups. Access it here.
Credit Card Dataset – Survey of Consumer Finances (SCF) Combined Extract Data 1989-2019
You can do a lot of aggregated analysis in a pretty straightforward way there.
Drone imagery with annotations for small object detection and tracking dataset
11 TB dataset of drone imagery with annotations for small object detection and tracking
Download and more information are available here
Dataset License: CDLA-Sharing-1.0
Helper scripts for accessing the dataset: DATASET.md
Dataset Exploration: Colab
NOAA High-Resolution Rapid Refresh (HRRR) Model
The HRRR is a NOAA real-time 3-km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model, initialized by 3km grids with 3km radar assimilation. Radar data is assimilated in the HRRR every 15 min over a 1-h period adding further detail to that provided by the hourly data assimilation from the 13km radar-enhanced Rapid Refresh.
Registry of Open Data on AWS
This registry exists to help people discover and share datasets that are available via AWS resources. Learn more about sharing data on AWS.
See all usage examples for datasets listed in this registry.
See datasets from Digital Earth Africa, Facebook Data for Good, NASA Space Act Agreement, NIH STRIDES, NOAA Big Data Program, Space Telescope Science Institute, and Amazon Sustainability Data Initiative.
Textbook Question Answering (TQA)
1,076 textbook lessons, 26,260 questions, 6229 images
Documentation: allenai.org/data/tqa
Harmonized Cancer Datasets: Genomic Data Commons Data Portal
The GDC Data Portal is a robust data-driven platform that allows cancer
researchers and bioinformaticians to search and download cancer data for analysis.

The Cancer Genome Atlas
The Cancer Genome Atlas (TCGA), a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), aims to generate comprehensive, multi-dimensional maps of the key genomic changes in major types and subtypes of cancer.
AWS CLI Access (No AWS account required)
aws s3 ls s3://tcga-2-open/ --no-sign-request
Therapeutically Applicable Research to Generate Effective Treatments (TARGET)
The Therapeutically Applicable Research to Generate Effective Treatments (TARGET) program applies a comprehensive genomic approach to determine molecular changes that drive childhood cancers. The goal of the program is to use data to guide the development of effective, less toxic therapies. TARGET is organized into a collaborative network of disease-specific project teams. TARGET projects provide comprehensive molecular characterization to determine the genetic changes that drive the initiation and progression of childhood cancers. The dataset contains open Clinical Supplement, Biospecimen Supplement, RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantification, miRNA-Seq miRNA Expression Quantification data from Genomic Data Commons (GDC), and open data from GDC Legacy Archive. Access it here.
Genome Aggregation Database (gnomAD)
The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators that aggregates and harmonizes both exome and genome data from a wide range of large-scale human sequencing projects. The summary data provided here are released for the benefit of the wider scientific community without restriction on use. Downloads
SQuAD (Stanford Question Answering Dataset)
Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. Access it here.
PubMed Diabetes Dataset
The Pubmed Diabetes dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words. The README file in the dataset provides more details.
Drug-Target Interaction Dataset
This dataset contains interactions between drugs and targets collected from DrugBank, KEGG Drug, DCDB, and Matador. It was originally collected by Perlman et al. It contains 315 drugs, 250 targets, 1,306 drug-target interactions, 5 types of drug-drug similarities, and 3 types of target-target similarities. Drug-drug similarities include Chemical-based, Ligand-based, Expression-based, Side-effect-based, and Annotation-based similarities. Target-target similarities include Sequence-based, Protein-protein interaction network-based, and Gene Ontology-based similarities. The original task on the dataset is to predict new interactions between drugs and targets based on different types of similarities in the network. Download link
Pharmacogenomics Datasets
PharmGKB data and knowledge is available as downloads. It is often critical to check with their curators at feedback@pharmgkb.org before embarking on a large project using these data, to be sure that the files and data they make available are being interpreted correctly. PharmGKB generally does NOT need to be a co-author on such analyses; They just want to make sure that there is a correct understanding of our data before lots of resources are spent.
Pancreatic Cancer Organoid Profiling
The dataset contains open RNA-Seq Gene Expression Quantification data and controlled WGS/WXS/RNA-Seq Aligned Reads, WXS Annotated Somatic Mutation, WXS Raw Somatic Mutation, and RNA-Seq Splice Junction Quantification. Documentation
AWS CLI Access (No AWS account required)
aws s3 ls s3://gdc-organoid-pancreatic-phs001611-2-open/ --no-sign-request
Africa Soil Information Service (AfSIS) Soil Chemistry
This dataset contains soil infrared spectral data and paired soil property reference measurements for georeferenced soil samples that were collected through the Africa Soil Information Service (AfSIS) project, which lasted from 2009 through 2018. Documentation
AWS CLI Access (No AWS account required)
aws s3 ls s3://afsis/ --no-sign-request
Dataset for Affective States in E-Environments
DAiSEE is the first multi-label video classification dataset comprising of 9068 video snippets captured from 112 users for recognizing the user affective states of boredom, confusion, engagement, and frustration “in the wild”. The dataset has four levels of labels namely – very low, low, high, and very high for each of the affective states, which are crowd annotated and correlated with a gold standard annotation created using a team of expert psychologists. Download it here.
NatureServe Explorer Dataset
NatureServe Explorer provides conservation status, taxonomy, distribution, and life history information for more than 95,000 plants and animals in the United States and Canada, and more than 10,000 vegetation communities and ecological systems in the Western Hemisphere.
The data available through NatureServe Explorer represents data managed in the NatureServe Central Databases. These databases are dynamic, being continually enhanced and refined through the input of hundreds of natural heritage program scientists and other collaborators. NatureServe Explorer is updated from these central databases to reflect information from new field surveys, the latest taxonomic treatments and other scientific publications, and new conservation status assessments. Explore Data here
Flight Records in the US
Airline On-Time Performance and Causes of Flight Delays – On_Time Data.
This database contains scheduled and actual departure and arrival times, reason of delay. reported by certified U.S. air carriers that account for at least one percent of domestic scheduled passenger revenues. The data is collected by the Office of Airline Information, Bureau of Transportation Statistics (BTS).
FlightAware.com has data but you need to pay for a full dataset.
The anyflights
package supplies a set of functions to generate air travel data (and data packages!) similar to nycflights13
. With a user-defined year and airport, the anyflights
function will grab data on:
flights
: all flights that departed a given airport in a given year and monthweather
: hourly meterological data for a given airport in a given year and monthairports
: airport names, FAA codes, and locationsairlines
: translation between two letter carrier (airline) codes and namesplanes
: construction information about each plane found inflights
Airline On-Time Statistics and Delay Causes
The U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT’s monthly Air Travel Consumer Report, published about 30 days after the month’s end, as well as in summary tables posted on this website. BTS began collecting details on the causes of flight delays in June 2003. Summary statistics and raw data are made available to the public at the time the Air Travel Consumer Report is released. Access it here
Worldwide flight data
Download: airports.dat (Airports only, high quality)
Download: airports-extended.dat (Airports, train stations and ferry terminals, including user contributions)
Flightera.net seems to have a lot of good data for free. It has in-depth data on flights and doesn’t seem limited by date. I can’t speak on the validity of the data though.
flightradar24.com has lots of data, also historically, they might be willing to help you get it in a nice format.
2019 Crime statistics in the USA
Dataset with arrest in US by race and separate states. Download Excel here
- Creative Commons Media
- Fonts
- Codeface – Text editor fonts.
- Stock Resources
- GIF – Image format known for animated images.
- Music
- Open Source Documents
- Audio Visualization
- Broadcasting
- Pixel Art – Pixel-level digital art.
- FFmpeg – Cross-platform solution to record, convert and stream audio and video.
- Icons – Downloadable SVG/PNG/font icon projects.
- Audiovisual – Lighting, audio and video in professional environments.
Researchers from IBM, MIT and Harvard Announced The Release Of DARPA “Common Sense AI” Dataset Along With Two Machine Learning Models At ICML 2021
Building machines that can make decisions based on common sense is no easy feat. A machine must be able to do more than merely find patterns in data; it also needs a way of interpreting the intentions and beliefs behind people’s choices.
At the 2021 International Conference on Machine Learning (ICML), Researchers from IBM, MIT, and Harvard University have come together to release a DARPA “Common Sense AI” dataset for benchmarking AI intuition. They are also releasing two machine learning models that represent different approaches to the problem that relies on testing techniques psychologists use to study infants’ behavior to accelerate the development of AI exhibiting common sense.
Source – Summary – Paper – IBM Blog
100 million protein structures Dataset by DeepMind
Here’s a good article about this topic
Google Dataset Search

Malware traffic dataset
Comprises 1914081 records created from all malware traffic analysis .net PCAP files, from 2013 to 2021. The logs are generated using Suricata and Zeek.
Originator: ali_alwashali
Percent of “foreign-born” population in each US and EU state or country.
For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state 🇺🇸🇪🇺
Author: Here
Percent of “foreign-born” population in each US and EU state or country. For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state.
Examples of “foreign-born” in this context:
Person born in Spain and living in France is NOT “foreign-born”
Person born in Turkey and living in France is “foreign-born”
Person born in Florida and living in Texas is NOT “foreign-born”
Person born in Mexico and living in Texas is “foreign-born”
Person born in Florida and living in France is “foreign-born”
Person born in France and living in Florida is “foreign-born”
🇺🇸🇪🇺🗺️
Note: Poland, Ireland, Germany, Greece, Cyprus, Malta, Portugal uses Eurostat 2010 Migration data and Croatia has no data at all
Tools: MS Office
Source: Here
35% of “entry-level” jobs on LinkedIn require 3+ years of experience
Source: LinkedIn data (see original post)
Tool: Photoshop from my colleague
Latest complete Netflix movie dataset
Created from 4 APIs. 11K+ rows and 30+ attributes of Netflix (Ratings, earnings, actors, language, availability, movie trailers, and many more)
Explore this dataset using FlixGem.com (this dataset is powering this webapp)
Common Crawl
A corpus of web crawl data composed of over 50 billion web pages. The Common Crawl corpus contains petabytes of data collected since 2008. It contains raw web page data, extracted metadata and text extractions.
AWS CLI Access (No AWS account required)
aws s3 ls s3://commoncrawl/ --no-sign-request
s3://commoncrawl/crawl-data/
Dataset on protein prices
Data on Primary Commodity Prices are updated monthly based on the IMF’s Primary Commodity Price System.
CPOST dataset on suicide attacks over four decades
The University of Chicago Project on Security and Threats presents the updated and expanded Database on Suicide Attacks (DSAT), which now links to Uppsala Conflict Data Program data on armed conflicts and includes a new dataset measuring the alliance and rivalry relationships among militant groups with connections to suicide attack groups. Access it here.
Credit Card Dataset – Survey of Consumer Finances (SCF) Combined Extract Data 1989-2019
You can do a lot of aggregated analysis in a pretty straightforward way there.
Drone imagery with annotations for small object detection and tracking dataset
11 TB dataset of drone imagery with annotations for small object detection and tracking
Download and more information are available here
Dataset License: CDLA-Sharing-1.0
Helper scripts for accessing the dataset: DATASET.md
Dataset Exploration: Colab
NOAA High-Resolution Rapid Refresh (HRRR) Model
The HRRR is a NOAA real-time 3-km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model, initialized by 3km grids with 3km radar assimilation. Radar data is assimilated in the HRRR every 15 min over a 1-h period adding further detail to that provided by the hourly data assimilation from the 13km radar-enhanced Rapid Refresh.
Registry of Open Data on AWS
This registry exists to help people discover and share datasets that are available via AWS resources. Learn more about sharing data on AWS.
See all usage examples for datasets listed in this registry.
See datasets from Digital Earth Africa, Facebook Data for Good, NASA Space Act Agreement, NIH STRIDES, NOAA Big Data Program, Space Telescope Science Institute, and Amazon Sustainability Data Initiative.
Textbook Question Answering (TQA)
1,076 textbook lessons, 26,260 questions, 6229 images
Documentation: allenai.org/data/tqa
Harmonized Cancer Datasets: Genomic Data Commons Data Portal
The GDC Data Portal is a robust data-driven platform that allows cancer
researchers and bioinformaticians to search and download cancer data for analysis.

The Cancer Genome Atlas
The Cancer Genome Atlas (TCGA), a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), aims to generate comprehensive, multi-dimensional maps of the key genomic changes in major types and subtypes of cancer.
AWS CLI Access (No AWS account required)
aws s3 ls s3://tcga-2-open/ --no-sign-request
Therapeutically Applicable Research to Generate Effective Treatments (TARGET)
The Therapeutically Applicable Research to Generate Effective Treatments (TARGET) program applies a comprehensive genomic approach to determine molecular changes that drive childhood cancers. The goal of the program is to use data to guide the development of effective, less toxic therapies. TARGET is organized into a collaborative network of disease-specific project teams. TARGET projects provide comprehensive molecular characterization to determine the genetic changes that drive the initiation and progression of childhood cancers. The dataset contains open Clinical Supplement, Biospecimen Supplement, RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantification, miRNA-Seq miRNA Expression Quantification data from Genomic Data Commons (GDC), and open data from GDC Legacy Archive. Access it here.
Genome Aggregation Database (gnomAD)
The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators that aggregates and harmonizes both exome and genome data from a wide range of large-scale human sequencing projects. The summary data provided here are released for the benefit of the wider scientific community without restriction on use. Downloads
SQuAD (Stanford Question Answering Dataset)
Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. Access it here.
PubMed Diabetes Dataset
The Pubmed Diabetes dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words. The README file in the dataset provides more details.
Drug-Target Interaction Dataset
This dataset contains interactions between drugs and targets collected from DrugBank, KEGG Drug, DCDB, and Matador. It was originally collected by Perlman et al. It contains 315 drugs, 250 targets, 1,306 drug-target interactions, 5 types of drug-drug similarities, and 3 types of target-target similarities. Drug-drug similarities include Chemical-based, Ligand-based, Expression-based, Side-effect-based, and Annotation-based similarities. Target-target similarities include Sequence-based, Protein-protein interaction network-based, and Gene Ontology-based similarities. The original task on the dataset is to predict new interactions between drugs and targets based on different types of similarities in the network. Download link
Pharmacogenomics Datasets
PharmGKB data and knowledge is available as downloads. It is often critical to check with their curators at feedback@pharmgkb.org before embarking on a large project using these data, to be sure that the files and data they make available are being interpreted correctly. PharmGKB generally does NOT need to be a co-author on such analyses; They just want to make sure that there is a correct understanding of our data before lots of resources are spent.
Pancreatic Cancer Organoid Profiling
The dataset contains open RNA-Seq Gene Expression Quantification data and controlled WGS/WXS/RNA-Seq Aligned Reads, WXS Annotated Somatic Mutation, WXS Raw Somatic Mutation, and RNA-Seq Splice Junction Quantification. Documentation
AWS CLI Access (No AWS account required)
aws s3 ls s3://gdc-organoid-pancreatic-phs001611-2-open/ --no-sign-request
Africa Soil Information Service (AfSIS) Soil Chemistry
This dataset contains soil infrared spectral data and paired soil property reference measurements for georeferenced soil samples that were collected through the Africa Soil Information Service (AfSIS) project, which lasted from 2009 through 2018. Documentation
AWS CLI Access (No AWS account required)
aws s3 ls s3://afsis/ --no-sign-request
Dataset for Affective States in E-Environments
DAiSEE is the first multi-label video classification dataset comprising of 9068 video snippets captured from 112 users for recognizing the user affective states of boredom, confusion, engagement, and frustration “in the wild”. The dataset has four levels of labels namely – very low, low, high, and very high for each of the affective states, which are crowd annotated and correlated with a gold standard annotation created using a team of expert psychologists. Download it here.
NatureServe Explorer Dataset
NatureServe Explorer provides conservation status, taxonomy, distribution, and life history information for more than 95,000 plants and animals in the United States and Canada, and more than 10,000 vegetation communities and ecological systems in the Western Hemisphere.
The data available through NatureServe Explorer represents data managed in the NatureServe Central Databases. These databases are dynamic, being continually enhanced and refined through the input of hundreds of natural heritage program scientists and other collaborators. NatureServe Explorer is updated from these central databases to reflect information from new field surveys, the latest taxonomic treatments and other scientific publications, and new conservation status assessments. Explore Data here
Flight Records in the US
Airline On-Time Performance and Causes of Flight Delays – On_Time Data.
This database contains scheduled and actual departure and arrival times, reason of delay. reported by certified U.S. air carriers that account for at least one percent of domestic scheduled passenger revenues. The data is collected by the Office of Airline Information, Bureau of Transportation Statistics (BTS).
FlightAware.com has data but you need to pay for a full dataset.
The anyflights
package supplies a set of functions to generate air travel data (and data packages!) similar to nycflights13
. With a user-defined year and airport, the anyflights
function will grab data on:
flights
: all flights that departed a given airport in a given year and monthweather
: hourly meterological data for a given airport in a given year and monthairports
: airport names, FAA codes, and locationsairlines
: translation between two letter carrier (airline) codes and namesplanes
: construction information about each plane found inflights
Airline On-Time Statistics and Delay Causes
The U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT’s monthly Air Travel Consumer Report, published about 30 days after the month’s end, as well as in summary tables posted on this website. BTS began collecting details on the causes of flight delays in June 2003. Summary statistics and raw data are made available to the public at the time the Air Travel Consumer Report is released. Access it here
Worldwide flight data
Download: airports.dat (Airports only, high quality)
Download: airports-extended.dat (Airports, train stations and ferry terminals, including user contributions)
Flightera.net seems to have a lot of good data for free. It has in-depth data on flights and doesn’t seem limited by date. I can’t speak on the validity of the data though.
flightradar24.com has lots of data, also historically, they might be willing to help you get it in a nice format.
2019 Crime statistics in the USA
Dataset with arrest in US by race and separate states. Download Excel here
- CLI Workshoppers – Interactive tutorials.
- Learn to Program
- Speaking
- Tech Videos
- Dive into Machine Learning
- Computer History
- Programming for Kids
- Educational Games – Learn while playing.
- JavaScript Learning
- CSS Learning – Mainly about CSS – the language and the modules.
- Product Management – Learn how to be a better product manager.
- Roadmaps – Gives you a clear route to improve your knowledge and skills.
- YouTubers – Watch video tutorials from YouTubers that teach you about technology.
Researchers from IBM, MIT and Harvard Announced The Release Of DARPA “Common Sense AI” Dataset Along With Two Machine Learning Models At ICML 2021
Building machines that can make decisions based on common sense is no easy feat. A machine must be able to do more than merely find patterns in data; it also needs a way of interpreting the intentions and beliefs behind people’s choices.
At the 2021 International Conference on Machine Learning (ICML), Researchers from IBM, MIT, and Harvard University have come together to release a DARPA “Common Sense AI” dataset for benchmarking AI intuition. They are also releasing two machine learning models that represent different approaches to the problem that relies on testing techniques psychologists use to study infants’ behavior to accelerate the development of AI exhibiting common sense.
Source – Summary – Paper – IBM Blog
100 million protein structures Dataset by DeepMind
Here’s a good article about this topic
Google Dataset Search

Malware traffic dataset
Comprises 1914081 records created from all malware traffic analysis .net PCAP files, from 2013 to 2021. The logs are generated using Suricata and Zeek.
Originator: ali_alwashali
Percent of “foreign-born” population in each US and EU state or country.
For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state 🇺🇸🇪🇺
Author: Here
Percent of “foreign-born” population in each US and EU state or country. For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state.
Examples of “foreign-born” in this context:
Person born in Spain and living in France is NOT “foreign-born”
Person born in Turkey and living in France is “foreign-born”
Person born in Florida and living in Texas is NOT “foreign-born”
Person born in Mexico and living in Texas is “foreign-born”
Person born in Florida and living in France is “foreign-born”
Person born in France and living in Florida is “foreign-born”
🇺🇸🇪🇺🗺️
Note: Poland, Ireland, Germany, Greece, Cyprus, Malta, Portugal uses Eurostat 2010 Migration data and Croatia has no data at all
Tools: MS Office
Source: Here
35% of “entry-level” jobs on LinkedIn require 3+ years of experience
Source: LinkedIn data (see original post)
Tool: Photoshop from my colleague
Latest complete Netflix movie dataset
Created from 4 APIs. 11K+ rows and 30+ attributes of Netflix (Ratings, earnings, actors, language, availability, movie trailers, and many more)
Explore this dataset using FlixGem.com (this dataset is powering this webapp)
Common Crawl
A corpus of web crawl data composed of over 50 billion web pages. The Common Crawl corpus contains petabytes of data collected since 2008. It contains raw web page data, extracted metadata and text extractions.
AWS CLI Access (No AWS account required)
aws s3 ls s3://commoncrawl/ --no-sign-request
s3://commoncrawl/crawl-data/
Dataset on protein prices
Data on Primary Commodity Prices are updated monthly based on the IMF’s Primary Commodity Price System.
CPOST dataset on suicide attacks over four decades
The University of Chicago Project on Security and Threats presents the updated and expanded Database on Suicide Attacks (DSAT), which now links to Uppsala Conflict Data Program data on armed conflicts and includes a new dataset measuring the alliance and rivalry relationships among militant groups with connections to suicide attack groups. Access it here.
Credit Card Dataset – Survey of Consumer Finances (SCF) Combined Extract Data 1989-2019
You can do a lot of aggregated analysis in a pretty straightforward way there.
Drone imagery with annotations for small object detection and tracking dataset
11 TB dataset of drone imagery with annotations for small object detection and tracking
Download and more information are available here
Dataset License: CDLA-Sharing-1.0
Helper scripts for accessing the dataset: DATASET.md
Dataset Exploration: Colab
NOAA High-Resolution Rapid Refresh (HRRR) Model
The HRRR is a NOAA real-time 3-km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model, initialized by 3km grids with 3km radar assimilation. Radar data is assimilated in the HRRR every 15 min over a 1-h period adding further detail to that provided by the hourly data assimilation from the 13km radar-enhanced Rapid Refresh.
Registry of Open Data on AWS
This registry exists to help people discover and share datasets that are available via AWS resources. Learn more about sharing data on AWS.
See all usage examples for datasets listed in this registry.
See datasets from Digital Earth Africa, Facebook Data for Good, NASA Space Act Agreement, NIH STRIDES, NOAA Big Data Program, Space Telescope Science Institute, and Amazon Sustainability Data Initiative.
Textbook Question Answering (TQA)
1,076 textbook lessons, 26,260 questions, 6229 images
Documentation: allenai.org/data/tqa
Harmonized Cancer Datasets: Genomic Data Commons Data Portal
The GDC Data Portal is a robust data-driven platform that allows cancer
researchers and bioinformaticians to search and download cancer data for analysis.

The Cancer Genome Atlas
The Cancer Genome Atlas (TCGA), a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), aims to generate comprehensive, multi-dimensional maps of the key genomic changes in major types and subtypes of cancer.
AWS CLI Access (No AWS account required)
aws s3 ls s3://tcga-2-open/ --no-sign-request
Therapeutically Applicable Research to Generate Effective Treatments (TARGET)
The Therapeutically Applicable Research to Generate Effective Treatments (TARGET) program applies a comprehensive genomic approach to determine molecular changes that drive childhood cancers. The goal of the program is to use data to guide the development of effective, less toxic therapies. TARGET is organized into a collaborative network of disease-specific project teams. TARGET projects provide comprehensive molecular characterization to determine the genetic changes that drive the initiation and progression of childhood cancers. The dataset contains open Clinical Supplement, Biospecimen Supplement, RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantification, miRNA-Seq miRNA Expression Quantification data from Genomic Data Commons (GDC), and open data from GDC Legacy Archive. Access it here.
Genome Aggregation Database (gnomAD)
The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators that aggregates and harmonizes both exome and genome data from a wide range of large-scale human sequencing projects. The summary data provided here are released for the benefit of the wider scientific community without restriction on use. Downloads
SQuAD (Stanford Question Answering Dataset)
Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. Access it here.
PubMed Diabetes Dataset
The Pubmed Diabetes dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words. The README file in the dataset provides more details.
Drug-Target Interaction Dataset
This dataset contains interactions between drugs and targets collected from DrugBank, KEGG Drug, DCDB, and Matador. It was originally collected by Perlman et al. It contains 315 drugs, 250 targets, 1,306 drug-target interactions, 5 types of drug-drug similarities, and 3 types of target-target similarities. Drug-drug similarities include Chemical-based, Ligand-based, Expression-based, Side-effect-based, and Annotation-based similarities. Target-target similarities include Sequence-based, Protein-protein interaction network-based, and Gene Ontology-based similarities. The original task on the dataset is to predict new interactions between drugs and targets based on different types of similarities in the network. Download link
Pharmacogenomics Datasets
PharmGKB data and knowledge is available as downloads. It is often critical to check with their curators at feedback@pharmgkb.org before embarking on a large project using these data, to be sure that the files and data they make available are being interpreted correctly. PharmGKB generally does NOT need to be a co-author on such analyses; They just want to make sure that there is a correct understanding of our data before lots of resources are spent.
Pancreatic Cancer Organoid Profiling
The dataset contains open RNA-Seq Gene Expression Quantification data and controlled WGS/WXS/RNA-Seq Aligned Reads, WXS Annotated Somatic Mutation, WXS Raw Somatic Mutation, and RNA-Seq Splice Junction Quantification. Documentation
AWS CLI Access (No AWS account required)
aws s3 ls s3://gdc-organoid-pancreatic-phs001611-2-open/ --no-sign-request
Africa Soil Information Service (AfSIS) Soil Chemistry
This dataset contains soil infrared spectral data and paired soil property reference measurements for georeferenced soil samples that were collected through the Africa Soil Information Service (AfSIS) project, which lasted from 2009 through 2018. Documentation
AWS CLI Access (No AWS account required)
aws s3 ls s3://afsis/ --no-sign-request
Dataset for Affective States in E-Environments
DAiSEE is the first multi-label video classification dataset comprising of 9068 video snippets captured from 112 users for recognizing the user affective states of boredom, confusion, engagement, and frustration “in the wild”. The dataset has four levels of labels namely – very low, low, high, and very high for each of the affective states, which are crowd annotated and correlated with a gold standard annotation created using a team of expert psychologists. Download it here.
NatureServe Explorer Dataset
NatureServe Explorer provides conservation status, taxonomy, distribution, and life history information for more than 95,000 plants and animals in the United States and Canada, and more than 10,000 vegetation communities and ecological systems in the Western Hemisphere.
The data available through NatureServe Explorer represents data managed in the NatureServe Central Databases. These databases are dynamic, being continually enhanced and refined through the input of hundreds of natural heritage program scientists and other collaborators. NatureServe Explorer is updated from these central databases to reflect information from new field surveys, the latest taxonomic treatments and other scientific publications, and new conservation status assessments. Explore Data here
Flight Records in the US
Airline On-Time Performance and Causes of Flight Delays – On_Time Data.
This database contains scheduled and actual departure and arrival times, reason of delay. reported by certified U.S. air carriers that account for at least one percent of domestic scheduled passenger revenues. The data is collected by the Office of Airline Information, Bureau of Transportation Statistics (BTS).
FlightAware.com has data but you need to pay for a full dataset.
The anyflights
package supplies a set of functions to generate air travel data (and data packages!) similar to nycflights13
. With a user-defined year and airport, the anyflights
function will grab data on:
flights
: all flights that departed a given airport in a given year and monthweather
: hourly meterological data for a given airport in a given year and monthairports
: airport names, FAA codes, and locationsairlines
: translation between two letter carrier (airline) codes and namesplanes
: construction information about each plane found inflights
Airline On-Time Statistics and Delay Causes
The U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT’s monthly Air Travel Consumer Report, published about 30 days after the month’s end, as well as in summary tables posted on this website. BTS began collecting details on the causes of flight delays in June 2003. Summary statistics and raw data are made available to the public at the time the Air Travel Consumer Report is released. Access it here
Worldwide flight data
Download: airports.dat (Airports only, high quality)
Download: airports-extended.dat (Airports, train stations and ferry terminals, including user contributions)
Flightera.net seems to have a lot of good data for free. It has in-depth data on flights and doesn’t seem limited by date. I can’t speak on the validity of the data though.
flightradar24.com has lots of data, also historically, they might be willing to help you get it in a nice format.
2019 Crime statistics in the USA
Dataset with arrest in US by race and separate states. Download Excel here
- Application Security
- Security
- CTF – Capture The Flag.
- Malware Analysis
- Android Security
- Hacking
- Honeypots – Deception trap, designed to entice an attacker into attempting to compromise the information systems in an organization.
- Incident Response
- Vehicle Security and Car Hacking
- Web Security – Security of web apps & services.
- Lockpicking – The art of unlocking a lock by manipulating its components without the key.
- Cybersecurity Blue Team – Groups of individuals who identify security flaws in information technology systems.
- Fuzzing – Automated software testing technique that involves feeding pseudo-randomly generated input data.
- Embedded and IoT Security
- GDPR – Regulation on data protection and privacy for all individuals within EU.
- DevSecOps – Integration of security practices into DevOps.
Researchers from IBM, MIT and Harvard Announced The Release Of DARPA “Common Sense AI” Dataset Along With Two Machine Learning Models At ICML 2021
Building machines that can make decisions based on common sense is no easy feat. A machine must be able to do more than merely find patterns in data; it also needs a way of interpreting the intentions and beliefs behind people’s choices.
At the 2021 International Conference on Machine Learning (ICML), Researchers from IBM, MIT, and Harvard University have come together to release a DARPA “Common Sense AI” dataset for benchmarking AI intuition. They are also releasing two machine learning models that represent different approaches to the problem that relies on testing techniques psychologists use to study infants’ behavior to accelerate the development of AI exhibiting common sense.
Source – Summary – Paper – IBM Blog
100 million protein structures Dataset by DeepMind
Here’s a good article about this topic
Google Dataset Search

Malware traffic dataset
Comprises 1914081 records created from all malware traffic analysis .net PCAP files, from 2013 to 2021. The logs are generated using Suricata and Zeek.
Originator: ali_alwashali
Percent of “foreign-born” population in each US and EU state or country.
For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state 🇺🇸🇪🇺
Author: Here
Percent of “foreign-born” population in each US and EU state or country. For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state.
Examples of “foreign-born” in this context:
Person born in Spain and living in France is NOT “foreign-born”
Person born in Turkey and living in France is “foreign-born”
Person born in Florida and living in Texas is NOT “foreign-born”
Person born in Mexico and living in Texas is “foreign-born”
Person born in Florida and living in France is “foreign-born”
Person born in France and living in Florida is “foreign-born”
🇺🇸🇪🇺🗺️
Note: Poland, Ireland, Germany, Greece, Cyprus, Malta, Portugal uses Eurostat 2010 Migration data and Croatia has no data at all
Tools: MS Office
Source: Here
35% of “entry-level” jobs on LinkedIn require 3+ years of experience
Source: LinkedIn data (see original post)
Tool: Photoshop from my colleague
Latest complete Netflix movie dataset
Created from 4 APIs. 11K+ rows and 30+ attributes of Netflix (Ratings, earnings, actors, language, availability, movie trailers, and many more)
Explore this dataset using FlixGem.com (this dataset is powering this webapp)
Common Crawl
A corpus of web crawl data composed of over 50 billion web pages. The Common Crawl corpus contains petabytes of data collected since 2008. It contains raw web page data, extracted metadata and text extractions.
AWS CLI Access (No AWS account required)
aws s3 ls s3://commoncrawl/ --no-sign-request
s3://commoncrawl/crawl-data/
Dataset on protein prices
Data on Primary Commodity Prices are updated monthly based on the IMF’s Primary Commodity Price System.
CPOST dataset on suicide attacks over four decades
The University of Chicago Project on Security and Threats presents the updated and expanded Database on Suicide Attacks (DSAT), which now links to Uppsala Conflict Data Program data on armed conflicts and includes a new dataset measuring the alliance and rivalry relationships among militant groups with connections to suicide attack groups. Access it here.
Credit Card Dataset – Survey of Consumer Finances (SCF) Combined Extract Data 1989-2019
You can do a lot of aggregated analysis in a pretty straightforward way there.
Drone imagery with annotations for small object detection and tracking dataset
11 TB dataset of drone imagery with annotations for small object detection and tracking
Download and more information are available here
Dataset License: CDLA-Sharing-1.0
Helper scripts for accessing the dataset: DATASET.md
Dataset Exploration: Colab
NOAA High-Resolution Rapid Refresh (HRRR) Model
The HRRR is a NOAA real-time 3-km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model, initialized by 3km grids with 3km radar assimilation. Radar data is assimilated in the HRRR every 15 min over a 1-h period adding further detail to that provided by the hourly data assimilation from the 13km radar-enhanced Rapid Refresh.
Registry of Open Data on AWS
This registry exists to help people discover and share datasets that are available via AWS resources. Learn more about sharing data on AWS.
See all usage examples for datasets listed in this registry.
See datasets from Digital Earth Africa, Facebook Data for Good, NASA Space Act Agreement, NIH STRIDES, NOAA Big Data Program, Space Telescope Science Institute, and Amazon Sustainability Data Initiative.
Textbook Question Answering (TQA)
1,076 textbook lessons, 26,260 questions, 6229 images
Documentation: allenai.org/data/tqa
Harmonized Cancer Datasets: Genomic Data Commons Data Portal
The GDC Data Portal is a robust data-driven platform that allows cancer
researchers and bioinformaticians to search and download cancer data for analysis.

The Cancer Genome Atlas
The Cancer Genome Atlas (TCGA), a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), aims to generate comprehensive, multi-dimensional maps of the key genomic changes in major types and subtypes of cancer.
AWS CLI Access (No AWS account required)
aws s3 ls s3://tcga-2-open/ --no-sign-request
Therapeutically Applicable Research to Generate Effective Treatments (TARGET)
The Therapeutically Applicable Research to Generate Effective Treatments (TARGET) program applies a comprehensive genomic approach to determine molecular changes that drive childhood cancers. The goal of the program is to use data to guide the development of effective, less toxic therapies. TARGET is organized into a collaborative network of disease-specific project teams. TARGET projects provide comprehensive molecular characterization to determine the genetic changes that drive the initiation and progression of childhood cancers. The dataset contains open Clinical Supplement, Biospecimen Supplement, RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantification, miRNA-Seq miRNA Expression Quantification data from Genomic Data Commons (GDC), and open data from GDC Legacy Archive. Access it here.
Genome Aggregation Database (gnomAD)
The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators that aggregates and harmonizes both exome and genome data from a wide range of large-scale human sequencing projects. The summary data provided here are released for the benefit of the wider scientific community without restriction on use. Downloads
SQuAD (Stanford Question Answering Dataset)
Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. Access it here.
PubMed Diabetes Dataset
The Pubmed Diabetes dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words. The README file in the dataset provides more details.
Drug-Target Interaction Dataset
This dataset contains interactions between drugs and targets collected from DrugBank, KEGG Drug, DCDB, and Matador. It was originally collected by Perlman et al. It contains 315 drugs, 250 targets, 1,306 drug-target interactions, 5 types of drug-drug similarities, and 3 types of target-target similarities. Drug-drug similarities include Chemical-based, Ligand-based, Expression-based, Side-effect-based, and Annotation-based similarities. Target-target similarities include Sequence-based, Protein-protein interaction network-based, and Gene Ontology-based similarities. The original task on the dataset is to predict new interactions between drugs and targets based on different types of similarities in the network. Download link
Pharmacogenomics Datasets
PharmGKB data and knowledge is available as downloads. It is often critical to check with their curators at feedback@pharmgkb.org before embarking on a large project using these data, to be sure that the files and data they make available are being interpreted correctly. PharmGKB generally does NOT need to be a co-author on such analyses; They just want to make sure that there is a correct understanding of our data before lots of resources are spent.
Pancreatic Cancer Organoid Profiling
The dataset contains open RNA-Seq Gene Expression Quantification data and controlled WGS/WXS/RNA-Seq Aligned Reads, WXS Annotated Somatic Mutation, WXS Raw Somatic Mutation, and RNA-Seq Splice Junction Quantification. Documentation
AWS CLI Access (No AWS account required)
aws s3 ls s3://gdc-organoid-pancreatic-phs001611-2-open/ --no-sign-request
Africa Soil Information Service (AfSIS) Soil Chemistry
This dataset contains soil infrared spectral data and paired soil property reference measurements for georeferenced soil samples that were collected through the Africa Soil Information Service (AfSIS) project, which lasted from 2009 through 2018. Documentation
AWS CLI Access (No AWS account required)
aws s3 ls s3://afsis/ --no-sign-request
Dataset for Affective States in E-Environments
DAiSEE is the first multi-label video classification dataset comprising of 9068 video snippets captured from 112 users for recognizing the user affective states of boredom, confusion, engagement, and frustration “in the wild”. The dataset has four levels of labels namely – very low, low, high, and very high for each of the affective states, which are crowd annotated and correlated with a gold standard annotation created using a team of expert psychologists. Download it here.
NatureServe Explorer Dataset
NatureServe Explorer provides conservation status, taxonomy, distribution, and life history information for more than 95,000 plants and animals in the United States and Canada, and more than 10,000 vegetation communities and ecological systems in the Western Hemisphere.
The data available through NatureServe Explorer represents data managed in the NatureServe Central Databases. These databases are dynamic, being continually enhanced and refined through the input of hundreds of natural heritage program scientists and other collaborators. NatureServe Explorer is updated from these central databases to reflect information from new field surveys, the latest taxonomic treatments and other scientific publications, and new conservation status assessments. Explore Data here
Flight Records in the US
Airline On-Time Performance and Causes of Flight Delays – On_Time Data.
This database contains scheduled and actual departure and arrival times, reason of delay. reported by certified U.S. air carriers that account for at least one percent of domestic scheduled passenger revenues. The data is collected by the Office of Airline Information, Bureau of Transportation Statistics (BTS).
FlightAware.com has data but you need to pay for a full dataset.
The anyflights
package supplies a set of functions to generate air travel data (and data packages!) similar to nycflights13
. With a user-defined year and airport, the anyflights
function will grab data on:
flights
: all flights that departed a given airport in a given year and monthweather
: hourly meterological data for a given airport in a given year and monthairports
: airport names, FAA codes, and locationsairlines
: translation between two letter carrier (airline) codes and namesplanes
: construction information about each plane found inflights
Airline On-Time Statistics and Delay Causes
The U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT’s monthly Air Travel Consumer Report, published about 30 days after the month’s end, as well as in summary tables posted on this website. BTS began collecting details on the causes of flight delays in June 2003. Summary statistics and raw data are made available to the public at the time the Air Travel Consumer Report is released. Access it here
Worldwide flight data
Download: airports.dat (Airports only, high quality)
Download: airports-extended.dat (Airports, train stations and ferry terminals, including user contributions)
Flightera.net seems to have a lot of good data for free. It has in-depth data on flights and doesn’t seem limited by date. I can’t speak on the validity of the data though.
flightradar24.com has lots of data, also historically, they might be willing to help you get it in a nice format.
2019 Crime statistics in the USA
Dataset with arrest in US by race and separate states. Download Excel here
- Umbraco
- Refinery CMS – Ruby on Rails CMS.
- Wagtail – Django CMS focused on flexibility and user experience.
- Textpattern – Lightweight PHP-based CMS.
- Drupal – Extensible PHP-based CMS.
- Craft CMS – Content-first CMS.
- Sitecore – .NET digital marketing platform that combines CMS with tools for managing multiple websites.
- Silverstripe CMS – PHP MVC framework that serves as a classic or headless CMS.
Researchers from IBM, MIT and Harvard Announced The Release Of DARPA “Common Sense AI” Dataset Along With Two Machine Learning Models At ICML 2021
Building machines that can make decisions based on common sense is no easy feat. A machine must be able to do more than merely find patterns in data; it also needs a way of interpreting the intentions and beliefs behind people’s choices.
At the 2021 International Conference on Machine Learning (ICML), Researchers from IBM, MIT, and Harvard University have come together to release a DARPA “Common Sense AI” dataset for benchmarking AI intuition. They are also releasing two machine learning models that represent different approaches to the problem that relies on testing techniques psychologists use to study infants’ behavior to accelerate the development of AI exhibiting common sense.
Source – Summary – Paper – IBM Blog
100 million protein structures Dataset by DeepMind
Here’s a good article about this topic
Google Dataset Search

Malware traffic dataset
Comprises 1914081 records created from all malware traffic analysis .net PCAP files, from 2013 to 2021. The logs are generated using Suricata and Zeek.
Originator: ali_alwashali
Percent of “foreign-born” population in each US and EU state or country.
For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state 🇺🇸🇪🇺
Author: Here
Percent of “foreign-born” population in each US and EU state or country. For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state.
Examples of “foreign-born” in this context:
Person born in Spain and living in France is NOT “foreign-born”
Person born in Turkey and living in France is “foreign-born”
Person born in Florida and living in Texas is NOT “foreign-born”
Person born in Mexico and living in Texas is “foreign-born”
Person born in Florida and living in France is “foreign-born”
Person born in France and living in Florida is “foreign-born”
🇺🇸🇪🇺🗺️
Note: Poland, Ireland, Germany, Greece, Cyprus, Malta, Portugal uses Eurostat 2010 Migration data and Croatia has no data at all
Tools: MS Office
Source: Here
35% of “entry-level” jobs on LinkedIn require 3+ years of experience
Source: LinkedIn data (see original post)
Tool: Photoshop from my colleague
Latest complete Netflix movie dataset
Created from 4 APIs. 11K+ rows and 30+ attributes of Netflix (Ratings, earnings, actors, language, availability, movie trailers, and many more)
Explore this dataset using FlixGem.com (this dataset is powering this webapp)
Common Crawl
A corpus of web crawl data composed of over 50 billion web pages. The Common Crawl corpus contains petabytes of data collected since 2008. It contains raw web page data, extracted metadata and text extractions.
AWS CLI Access (No AWS account required)
aws s3 ls s3://commoncrawl/ --no-sign-request
s3://commoncrawl/crawl-data/
Dataset on protein prices
Data on Primary Commodity Prices are updated monthly based on the IMF’s Primary Commodity Price System.
CPOST dataset on suicide attacks over four decades
The University of Chicago Project on Security and Threats presents the updated and expanded Database on Suicide Attacks (DSAT), which now links to Uppsala Conflict Data Program data on armed conflicts and includes a new dataset measuring the alliance and rivalry relationships among militant groups with connections to suicide attack groups. Access it here.
Credit Card Dataset – Survey of Consumer Finances (SCF) Combined Extract Data 1989-2019
You can do a lot of aggregated analysis in a pretty straightforward way there.
Drone imagery with annotations for small object detection and tracking dataset
11 TB dataset of drone imagery with annotations for small object detection and tracking
Download and more information are available here
Dataset License: CDLA-Sharing-1.0
Helper scripts for accessing the dataset: DATASET.md
Dataset Exploration: Colab
NOAA High-Resolution Rapid Refresh (HRRR) Model
The HRRR is a NOAA real-time 3-km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model, initialized by 3km grids with 3km radar assimilation. Radar data is assimilated in the HRRR every 15 min over a 1-h period adding further detail to that provided by the hourly data assimilation from the 13km radar-enhanced Rapid Refresh.
Registry of Open Data on AWS
This registry exists to help people discover and share datasets that are available via AWS resources. Learn more about sharing data on AWS.
See all usage examples for datasets listed in this registry.
See datasets from Digital Earth Africa, Facebook Data for Good, NASA Space Act Agreement, NIH STRIDES, NOAA Big Data Program, Space Telescope Science Institute, and Amazon Sustainability Data Initiative.
Textbook Question Answering (TQA)
1,076 textbook lessons, 26,260 questions, 6229 images
Documentation: allenai.org/data/tqa
Harmonized Cancer Datasets: Genomic Data Commons Data Portal
The GDC Data Portal is a robust data-driven platform that allows cancer
researchers and bioinformaticians to search and download cancer data for analysis.

The Cancer Genome Atlas
The Cancer Genome Atlas (TCGA), a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), aims to generate comprehensive, multi-dimensional maps of the key genomic changes in major types and subtypes of cancer.
AWS CLI Access (No AWS account required)
aws s3 ls s3://tcga-2-open/ --no-sign-request
Therapeutically Applicable Research to Generate Effective Treatments (TARGET)
The Therapeutically Applicable Research to Generate Effective Treatments (TARGET) program applies a comprehensive genomic approach to determine molecular changes that drive childhood cancers. The goal of the program is to use data to guide the development of effective, less toxic therapies. TARGET is organized into a collaborative network of disease-specific project teams. TARGET projects provide comprehensive molecular characterization to determine the genetic changes that drive the initiation and progression of childhood cancers. The dataset contains open Clinical Supplement, Biospecimen Supplement, RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantification, miRNA-Seq miRNA Expression Quantification data from Genomic Data Commons (GDC), and open data from GDC Legacy Archive. Access it here.
Genome Aggregation Database (gnomAD)
The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators that aggregates and harmonizes both exome and genome data from a wide range of large-scale human sequencing projects. The summary data provided here are released for the benefit of the wider scientific community without restriction on use. Downloads
SQuAD (Stanford Question Answering Dataset)
Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. Access it here.
PubMed Diabetes Dataset
The Pubmed Diabetes dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words. The README file in the dataset provides more details.
Drug-Target Interaction Dataset
This dataset contains interactions between drugs and targets collected from DrugBank, KEGG Drug, DCDB, and Matador. It was originally collected by Perlman et al. It contains 315 drugs, 250 targets, 1,306 drug-target interactions, 5 types of drug-drug similarities, and 3 types of target-target similarities. Drug-drug similarities include Chemical-based, Ligand-based, Expression-based, Side-effect-based, and Annotation-based similarities. Target-target similarities include Sequence-based, Protein-protein interaction network-based, and Gene Ontology-based similarities. The original task on the dataset is to predict new interactions between drugs and targets based on different types of similarities in the network. Download link
Pharmacogenomics Datasets
PharmGKB data and knowledge is available as downloads. It is often critical to check with their curators at feedback@pharmgkb.org before embarking on a large project using these data, to be sure that the files and data they make available are being interpreted correctly. PharmGKB generally does NOT need to be a co-author on such analyses; They just want to make sure that there is a correct understanding of our data before lots of resources are spent.
Pancreatic Cancer Organoid Profiling
The dataset contains open RNA-Seq Gene Expression Quantification data and controlled WGS/WXS/RNA-Seq Aligned Reads, WXS Annotated Somatic Mutation, WXS Raw Somatic Mutation, and RNA-Seq Splice Junction Quantification. Documentation
AWS CLI Access (No AWS account required)
aws s3 ls s3://gdc-organoid-pancreatic-phs001611-2-open/ --no-sign-request
Africa Soil Information Service (AfSIS) Soil Chemistry
This dataset contains soil infrared spectral data and paired soil property reference measurements for georeferenced soil samples that were collected through the Africa Soil Information Service (AfSIS) project, which lasted from 2009 through 2018. Documentation
AWS CLI Access (No AWS account required)
aws s3 ls s3://afsis/ --no-sign-request
Dataset for Affective States in E-Environments
DAiSEE is the first multi-label video classification dataset comprising of 9068 video snippets captured from 112 users for recognizing the user affective states of boredom, confusion, engagement, and frustration “in the wild”. The dataset has four levels of labels namely – very low, low, high, and very high for each of the affective states, which are crowd annotated and correlated with a gold standard annotation created using a team of expert psychologists. Download it here.
NatureServe Explorer Dataset
NatureServe Explorer provides conservation status, taxonomy, distribution, and life history information for more than 95,000 plants and animals in the United States and Canada, and more than 10,000 vegetation communities and ecological systems in the Western Hemisphere.
The data available through NatureServe Explorer represents data managed in the NatureServe Central Databases. These databases are dynamic, being continually enhanced and refined through the input of hundreds of natural heritage program scientists and other collaborators. NatureServe Explorer is updated from these central databases to reflect information from new field surveys, the latest taxonomic treatments and other scientific publications, and new conservation status assessments. Explore Data here
Flight Records in the US
Airline On-Time Performance and Causes of Flight Delays – On_Time Data.
This database contains scheduled and actual departure and arrival times, reason of delay. reported by certified U.S. air carriers that account for at least one percent of domestic scheduled passenger revenues. The data is collected by the Office of Airline Information, Bureau of Transportation Statistics (BTS).
FlightAware.com has data but you need to pay for a full dataset.
The anyflights
package supplies a set of functions to generate air travel data (and data packages!) similar to nycflights13
. With a user-defined year and airport, the anyflights
function will grab data on:
flights
: all flights that departed a given airport in a given year and monthweather
: hourly meterological data for a given airport in a given year and monthairports
: airport names, FAA codes, and locationsairlines
: translation between two letter carrier (airline) codes and namesplanes
: construction information about each plane found inflights
Airline On-Time Statistics and Delay Causes
The U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT’s monthly Air Travel Consumer Report, published about 30 days after the month’s end, as well as in summary tables posted on this website. BTS began collecting details on the causes of flight delays in June 2003. Summary statistics and raw data are made available to the public at the time the Air Travel Consumer Report is released. Access it here
Worldwide flight data
Download: airports.dat (Airports only, high quality)
Download: airports-extended.dat (Airports, train stations and ferry terminals, including user contributions)
Flightera.net seems to have a lot of good data for free. It has in-depth data on flights and doesn’t seem limited by date. I can’t speak on the validity of the data though.
flightradar24.com has lots of data, also historically, they might be willing to help you get it in a nice format.
2019 Crime statistics in the USA
Dataset with arrest in US by race and separate states. Download Excel here
- Robotics
- Internet of Things
- Electronics – For electronic engineers and hobbyists.
- Bluetooth Beacons
- Electric Guitar Specifications – Checklist for building your own electric guitar.
- Plotters – Computer-controlled drawing machines and other visual art robots.
- Robotic Tooling – Free and open tools for professional robotic development.
- LIDAR – Sensor for measuring distances by illuminating the target with laser light.
Researchers from IBM, MIT and Harvard Announced The Release Of DARPA “Common Sense AI” Dataset Along With Two Machine Learning Models At ICML 2021
Building machines that can make decisions based on common sense is no easy feat. A machine must be able to do more than merely find patterns in data; it also needs a way of interpreting the intentions and beliefs behind people’s choices.
At the 2021 International Conference on Machine Learning (ICML), Researchers from IBM, MIT, and Harvard University have come together to release a DARPA “Common Sense AI” dataset for benchmarking AI intuition. They are also releasing two machine learning models that represent different approaches to the problem that relies on testing techniques psychologists use to study infants’ behavior to accelerate the development of AI exhibiting common sense.
Source – Summary – Paper – IBM Blog
100 million protein structures Dataset by DeepMind
Here’s a good article about this topic
Google Dataset Search

Malware traffic dataset
Comprises 1914081 records created from all malware traffic analysis .net PCAP files, from 2013 to 2021. The logs are generated using Suricata and Zeek.
Originator: ali_alwashali
Percent of “foreign-born” population in each US and EU state or country.
For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state 🇺🇸🇪🇺
Author: Here
Percent of “foreign-born” population in each US and EU state or country. For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state.
Examples of “foreign-born” in this context:
Person born in Spain and living in France is NOT “foreign-born”
Person born in Turkey and living in France is “foreign-born”
Person born in Florida and living in Texas is NOT “foreign-born”
Person born in Mexico and living in Texas is “foreign-born”
Person born in Florida and living in France is “foreign-born”
Person born in France and living in Florida is “foreign-born”
🇺🇸🇪🇺🗺️
Note: Poland, Ireland, Germany, Greece, Cyprus, Malta, Portugal uses Eurostat 2010 Migration data and Croatia has no data at all
Tools: MS Office
Source: Here
35% of “entry-level” jobs on LinkedIn require 3+ years of experience
Source: LinkedIn data (see original post)
Tool: Photoshop from my colleague
Latest complete Netflix movie dataset
Created from 4 APIs. 11K+ rows and 30+ attributes of Netflix (Ratings, earnings, actors, language, availability, movie trailers, and many more)
Explore this dataset using FlixGem.com (this dataset is powering this webapp)
Common Crawl
A corpus of web crawl data composed of over 50 billion web pages. The Common Crawl corpus contains petabytes of data collected since 2008. It contains raw web page data, extracted metadata and text extractions.
AWS CLI Access (No AWS account required)
aws s3 ls s3://commoncrawl/ --no-sign-request
s3://commoncrawl/crawl-data/
Dataset on protein prices
Data on Primary Commodity Prices are updated monthly based on the IMF’s Primary Commodity Price System.
CPOST dataset on suicide attacks over four decades
The University of Chicago Project on Security and Threats presents the updated and expanded Database on Suicide Attacks (DSAT), which now links to Uppsala Conflict Data Program data on armed conflicts and includes a new dataset measuring the alliance and rivalry relationships among militant groups with connections to suicide attack groups. Access it here.
Credit Card Dataset – Survey of Consumer Finances (SCF) Combined Extract Data 1989-2019
You can do a lot of aggregated analysis in a pretty straightforward way there.
Drone imagery with annotations for small object detection and tracking dataset
11 TB dataset of drone imagery with annotations for small object detection and tracking
Download and more information are available here
Dataset License: CDLA-Sharing-1.0
Helper scripts for accessing the dataset: DATASET.md
Dataset Exploration: Colab
NOAA High-Resolution Rapid Refresh (HRRR) Model
The HRRR is a NOAA real-time 3-km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model, initialized by 3km grids with 3km radar assimilation. Radar data is assimilated in the HRRR every 15 min over a 1-h period adding further detail to that provided by the hourly data assimilation from the 13km radar-enhanced Rapid Refresh.
Registry of Open Data on AWS
This registry exists to help people discover and share datasets that are available via AWS resources. Learn more about sharing data on AWS.
See all usage examples for datasets listed in this registry.
See datasets from Digital Earth Africa, Facebook Data for Good, NASA Space Act Agreement, NIH STRIDES, NOAA Big Data Program, Space Telescope Science Institute, and Amazon Sustainability Data Initiative.
Textbook Question Answering (TQA)
1,076 textbook lessons, 26,260 questions, 6229 images
Documentation: allenai.org/data/tqa
Harmonized Cancer Datasets: Genomic Data Commons Data Portal
The GDC Data Portal is a robust data-driven platform that allows cancer
researchers and bioinformaticians to search and download cancer data for analysis.

The Cancer Genome Atlas
The Cancer Genome Atlas (TCGA), a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), aims to generate comprehensive, multi-dimensional maps of the key genomic changes in major types and subtypes of cancer.
AWS CLI Access (No AWS account required)
aws s3 ls s3://tcga-2-open/ --no-sign-request
Therapeutically Applicable Research to Generate Effective Treatments (TARGET)
The Therapeutically Applicable Research to Generate Effective Treatments (TARGET) program applies a comprehensive genomic approach to determine molecular changes that drive childhood cancers. The goal of the program is to use data to guide the development of effective, less toxic therapies. TARGET is organized into a collaborative network of disease-specific project teams. TARGET projects provide comprehensive molecular characterization to determine the genetic changes that drive the initiation and progression of childhood cancers. The dataset contains open Clinical Supplement, Biospecimen Supplement, RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantification, miRNA-Seq miRNA Expression Quantification data from Genomic Data Commons (GDC), and open data from GDC Legacy Archive. Access it here.
Genome Aggregation Database (gnomAD)
The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators that aggregates and harmonizes both exome and genome data from a wide range of large-scale human sequencing projects. The summary data provided here are released for the benefit of the wider scientific community without restriction on use. Downloads
SQuAD (Stanford Question Answering Dataset)
Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. Access it here.
PubMed Diabetes Dataset
The Pubmed Diabetes dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words. The README file in the dataset provides more details.
Drug-Target Interaction Dataset
This dataset contains interactions between drugs and targets collected from DrugBank, KEGG Drug, DCDB, and Matador. It was originally collected by Perlman et al. It contains 315 drugs, 250 targets, 1,306 drug-target interactions, 5 types of drug-drug similarities, and 3 types of target-target similarities. Drug-drug similarities include Chemical-based, Ligand-based, Expression-based, Side-effect-based, and Annotation-based similarities. Target-target similarities include Sequence-based, Protein-protein interaction network-based, and Gene Ontology-based similarities. The original task on the dataset is to predict new interactions between drugs and targets based on different types of similarities in the network. Download link
Pharmacogenomics Datasets
PharmGKB data and knowledge is available as downloads. It is often critical to check with their curators at feedback@pharmgkb.org before embarking on a large project using these data, to be sure that the files and data they make available are being interpreted correctly. PharmGKB generally does NOT need to be a co-author on such analyses; They just want to make sure that there is a correct understanding of our data before lots of resources are spent.
Pancreatic Cancer Organoid Profiling
The dataset contains open RNA-Seq Gene Expression Quantification data and controlled WGS/WXS/RNA-Seq Aligned Reads, WXS Annotated Somatic Mutation, WXS Raw Somatic Mutation, and RNA-Seq Splice Junction Quantification. Documentation
AWS CLI Access (No AWS account required)
aws s3 ls s3://gdc-organoid-pancreatic-phs001611-2-open/ --no-sign-request
Africa Soil Information Service (AfSIS) Soil Chemistry
This dataset contains soil infrared spectral data and paired soil property reference measurements for georeferenced soil samples that were collected through the Africa Soil Information Service (AfSIS) project, which lasted from 2009 through 2018. Documentation
AWS CLI Access (No AWS account required)
aws s3 ls s3://afsis/ --no-sign-request
Dataset for Affective States in E-Environments
DAiSEE is the first multi-label video classification dataset comprising of 9068 video snippets captured from 112 users for recognizing the user affective states of boredom, confusion, engagement, and frustration “in the wild”. The dataset has four levels of labels namely – very low, low, high, and very high for each of the affective states, which are crowd annotated and correlated with a gold standard annotation created using a team of expert psychologists. Download it here.
NatureServe Explorer Dataset
NatureServe Explorer provides conservation status, taxonomy, distribution, and life history information for more than 95,000 plants and animals in the United States and Canada, and more than 10,000 vegetation communities and ecological systems in the Western Hemisphere.
The data available through NatureServe Explorer represents data managed in the NatureServe Central Databases. These databases are dynamic, being continually enhanced and refined through the input of hundreds of natural heritage program scientists and other collaborators. NatureServe Explorer is updated from these central databases to reflect information from new field surveys, the latest taxonomic treatments and other scientific publications, and new conservation status assessments. Explore Data here
Flight Records in the US
Airline On-Time Performance and Causes of Flight Delays – On_Time Data.
This database contains scheduled and actual departure and arrival times, reason of delay. reported by certified U.S. air carriers that account for at least one percent of domestic scheduled passenger revenues. The data is collected by the Office of Airline Information, Bureau of Transportation Statistics (BTS).
FlightAware.com has data but you need to pay for a full dataset.
The anyflights
package supplies a set of functions to generate air travel data (and data packages!) similar to nycflights13
. With a user-defined year and airport, the anyflights
function will grab data on:
flights
: all flights that departed a given airport in a given year and monthweather
: hourly meterological data for a given airport in a given year and monthairports
: airport names, FAA codes, and locationsairlines
: translation between two letter carrier (airline) codes and namesplanes
: construction information about each plane found inflights
Airline On-Time Statistics and Delay Causes
The U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT’s monthly Air Travel Consumer Report, published about 30 days after the month’s end, as well as in summary tables posted on this website. BTS began collecting details on the causes of flight delays in June 2003. Summary statistics and raw data are made available to the public at the time the Air Travel Consumer Report is released. Access it here
Worldwide flight data
Download: airports.dat (Airports only, high quality)
Download: airports-extended.dat (Airports, train stations and ferry terminals, including user contributions)
Flightera.net seems to have a lot of good data for free. It has in-depth data on flights and doesn’t seem limited by date. I can’t speak on the validity of the data though.
flightradar24.com has lots of data, also historically, they might be willing to help you get it in a nice format.
2019 Crime statistics in the USA
Dataset with arrest in US by race and separate states. Download Excel here
- Open Companies
- Places to Post Your Startup
- OKR Methodology – Goal setting & communication best practices.
- Leading and Managing – Leading people and being a manager in a technology company/environment.
- Indie – Independent developer businesses.
- Tools of the Trade – Tools used by companies on Hacker News.
- Clean Tech – Fighting climate change with technology.
- Wardley Maps – Provides high situational awareness to help improve strategic planning and decision making.
- Social Enterprise – Building an organization primarily focused on social impact that is at least partially self-funded.
- Engineering Team Management – How to transition from software development to engineering management.
- Developer-First Products – Products that target developers as the user.
Researchers from IBM, MIT and Harvard Announced The Release Of DARPA “Common Sense AI” Dataset Along With Two Machine Learning Models At ICML 2021
Building machines that can make decisions based on common sense is no easy feat. A machine must be able to do more than merely find patterns in data; it also needs a way of interpreting the intentions and beliefs behind people’s choices.
At the 2021 International Conference on Machine Learning (ICML), Researchers from IBM, MIT, and Harvard University have come together to release a DARPA “Common Sense AI” dataset for benchmarking AI intuition. They are also releasing two machine learning models that represent different approaches to the problem that relies on testing techniques psychologists use to study infants’ behavior to accelerate the development of AI exhibiting common sense.
Source – Summary – Paper – IBM Blog
100 million protein structures Dataset by DeepMind
Here’s a good article about this topic
Google Dataset Search

Malware traffic dataset
Comprises 1914081 records created from all malware traffic analysis .net PCAP files, from 2013 to 2021. The logs are generated using Suricata and Zeek.
Originator: ali_alwashali
Fastest routes on land (and sometimes, boat) between all 990 pairs of European capitals
Las rutas más rápidas en tierra (y, a veces, en barco) entre los 990 pares de capitales europeas
Les itinéraires les plus rapides sur terre (et parfois en bateau) entre les 990 paires de capitales européennes
Source: Reddit
From the author: I started with data on roads from naturalearth.com, which also includes some ferry lines. I then calculated the fastest routes (assuming a speed of 90 km/h on roads, and 35 km/h on boat) between each pair of 45 European capitals. The animation visualizes these routes, with brighter lines for roads that are more frequently “traveled”.
In reality these are of course not the most traveled roads, since people don’t go from all capitals to all other capitals in equal measure. But I thought it would be fun to visualize all the possible connections.
The model is also very simple, and does not take into account varying speed limits, road conditions, congestion, border checks and so on. It is just for fun!
In order to keep the file size manageable, the animation only shows every tenth frame.
Is Russia, Turkey or country X really part of Europe? That of course depends on the definition, but it was more fun to include them than to exclude them! The Vatican is however not included since it would just be the same as the Rome routes. And, unfortunately, Nicosia on Cyprus is not included to due an error on my behalf. It should be!
Link to final still image in high resolution on my twitter
Pokemon Dataset
- Dataset of all 825 Pokemon (this includes Alolan Forms). It would be preferable if there are at least 100 images of each individual Pokemon.
pokedex: This is a Python library slash pile of data containing a whole lot of data scraped from Pokémon games. It’s the primary guts of veekun.
2) This dataset comprises of more than 800 pokemons belonging up to 8 generations.
Using this dataset have been fun for me. I used it to create a mosaic of pokemons taking image as reference. You can find it here and it’s free to use: Couple Mosaic (powered by Pokemons)
Here is the data type information in the file:
- Name: Pokemon Name
- Type: Type of Pokemon like Grass / Fire / Water etc..,.
- HP: Hit Points
- Attack: Attack Points
- Defense: Defence Points
- Sp. Atk: Special Attack Points
- Sp. Def: Special Defence Points
- Speed: Speed Points
- Total: Total Points
- url: Pokemon web-page
- icon: Pokemon Image
Data File: Pokemon-Data.csv
30×30 m Worldwide High-Resolution Population and Demographics Data
ETL pipeline for Facebook’s research project to provide detailed large-scale demographics data. It’s broken down in roughly 30×30 m grid cells and provides info on groups by age and gender.
Data Source and API for access
Article about Dataset at Medium
Gridded global datasets for Gross Domestic Product and Human Development Index over 1990–2015
Rasterized GDP dataset – basically a heat map of global economic activity.
Gap-filled multiannual datasets in gridded form for Gross Domestic Product (GDP) and Human Development Index (HDI)
Decrease in worldwide infant mortality from 1950 to 2020
Data Sources: United Nations, CIA World Factbook, IndexMundi.
Countries of the world sorted by those that have warmed the most in the last 10 years, showing temperatures from 1890 to 2020
Data source: Gistemp temperature data
The GISS Surface Temperature Analysis ver. 4 (GISTEMP v4) is an estimate of global surface temperature change. Graphs and tables are updated around the middle of every month using current data files from NOAA GHCN v4 (meteorological stations) and ERSST v5 (ocean areas), combined as described in our publications Hansen et al. (2010) and Lenssen et al. (2019).
Climate change concern vs personal spend to reduce climate change
![r/dataisbeautiful - [OC] Climate change concern vs personal spend to reduce climate change](https://preview.redd.it/s78w2y8whq171.png?width=960&crop=smart&auto=webp&s=aa07421a85869867a8ff3c5303cc4aefee19c2b9)
Data Source: Competitive Enterprise Institute (PDF)
Less than 20 firms produce over a third of all carbon emissions
The Illusion of Choice in Consumer Brands
Buying a chocolate bar? There are seemingly hundreds to choose from, but its just the illusion of choice. They pretty much all come from Mars, Nestlé, or Mondelēz (which owns Cadbury).
Source: Visual Capitalist
Yearly Software Sales on PlayStation Consoles since 1994
Some context for these numbers :
- PS4 holds the record for being the console to have sold the most games in video game history (> 1.622B units)
- Previous record holder was PS2 at 1.537B games sold
- PS4 holds the record for having sold the most games in a single year (> 300M units in FY20)
- FY20 marks the biggest yearly software sales in PlayStation ecosystem with more than 338M units
- Since PS5 release, Sony starts combining PS4/PS5 software sales
- In FY12, Sony combined PS2/PS3 and PSP/VITA software sales
- Sony stopped disclosing software sales in FY13/14
- Source: Sony’s financial results
Yearly Hardware Sales of PlayStation Consoles since 1994
Sony combined PS2/PS3 hardware sales in FY12 and combined PSP/VITA sales in FY12/13/14
- Source: Sony’s financial results
Cybertruck vs F150 Lightning pre-orders, by time since debut
Source: Ford exec tweeting about preorder numbers this week
Top 100 Most Populous City Proper in the world
The City with 32 million is Chongqing, Shan is Shanghai, Beijin is Beijing, and Guangzho is Guangzhou
Tax data for different countries
What do Europeans feel most attached to – their region, their country, or Europe?
Data source: Builds on data from the 2021 European Quality of Government Index. You can read more about the survey and download the data here
Cost of 1gb mobile data in every country
Dataset: Visual Capitalist
Frequency of all digrams in 18 languages, diacritics included
Dataset (according to author): Dictionaries are scattered on the internet and had to be borrowed from several sources: the Scrabble3d project, and Linux spellcheck dictionaries. The data can be found in the folder “Avec_diacritiques”.
Criteria for choosing a dictionary:
– No proper nouns
– “Official” source if available
– Inclusion of inflected forms
– Among two lists, the largest was fancied
– No or very rare abbreviations if possible- but hard to detect in unknown languages and across hundreds of thousands of words.
Mapped: The World’s Nuclear Reactor Landscape
Dataset: Visual Capitalist
Database of 999 chemicals based on liver-specific carcinogenicity
The author found this dataset in a more accessible format upon searching for the keyword “CDPB” (Carcinogenic Potency Database) in the National Library of Medicine Catalog. Check out this parent website for the data source and dataset description. The dataset referenced in OP’s post concerns liver specific carcinogens, which are marked by the “liv” keyword as described in the dataset description’s Tissue Codes section.
SMS Spam Collection Data Set
Download: Data Folder, Data Set Description
The SMS Spam Collection is a public set of SMS labeled messages that have been collected for mobile phone spam research
Open Datasets for Autonomous Driving
A2D2 Dataset – ApolloScape Dataset – Argoverse Dataset – Berkeley DeepDrive Dataset
CityScapes Dataset – Comma2k19 Dataset – Google-Landmarks Dataset
KITTI Vision Benchmark Suite – LeddarTech PixSet Dataset – Level 5 Open Data – nuScenes Dataset
Oxford Radar RobotCar Dataset – PandaSet – Udacity Self Driving Car Dataset – Waymo Open Dataset
Open Dataset people are looking for [Help if you can]
- Looking for Dataset on the outcomes of abstinence-only sex education.
- Funny Datasets for School Data Science Project [1, 2, 3, 4, 5]
- Need a dataset for English practicing chatbot. [1, 2 ]
- Creating a dataset for plant disease recognition [1, 2 ]
- Central Bank Speeches Dataset (Text data from 1997 to 2020 from 118 institutions) [1, 2]
- Cat Meow Classification dataset [1, 2]
- Looking for Raw Data: Camping / Outdoors Travel in United States trends, etc [1, 2 ]
- Looking for Data set of horse race results / lottery results any results related to gambling [1, 2, 3]
- Looking for Football (Soccer) Penalties Dataset [1, 2]
- Looking for public datasets on baseball [1, 2, 3]
- Looking for Datasets on edge computing for AI bandwidth usage, latency, memory, CPU/GPU resource usage? [1 ,2 ]
- Data set of people who died by suicide [1, 2 ]
- Supreme Court dataset with opinion text? [1, 2, 3, 4, https://storage.googleapis.com/scotus-db/scotus-db.db5]
- Dataset of employee attrition or turnover rate? [1, 2]
- Is there a Dataset for homophobic tweets? [1 ,2, 3, 4, ]
- Looking for a Machine condition Monitoring Dataset [1,2]
- Where to find data for credit risk analysis? [1, 2]
- Datasets on homicides anywhere in the world [1, 2]
- Looking for a dataset containing coronavirus self-test (if this is a thing globally) pictures for ML use
- Is there any transportation dataset with daily frequency? [1, 2]
- A Dataset of film Locations [1, 2 ]
- Looking for a classification dataset [1, 2, 3, 4, 5]
- Where can I search for macroeconomics data? [1, 2, 3, 4, 5, 6, 7]
- Looking for Beam alignment 5G vehicular networks dataset
- Looking for tidy dataset for multivariate analysis (PCA, FA, canonical correlations, clustering)
- Indian all types of Fuel location datasets [1, 2,]
- Spotify Playlists Dataset [1, 2]
- World News Headline Dataset. With Sentiment Scores. Free download in JSON format. Updated often. [1, 2]
- Are there any free open source recipe datasets for commercial use [1, 2, 3, 4, 5]
- Curated social network datasets with summary statistics and background info
- Looking for textile crop disease datasets such as jute, flax, hemp
- Shopify App Store and Chrome Webstore Datasets
- Looking for dataset for university chatbot
- Collecting real life (dirty/ugly) datasets for data analysis
- In Need of Food Additive/Ingredient Definition Database
- Recent smart phone sensor Dataset – Android
- Cracked Mobile Screen Image Dataset for Detection
- Looking for Chiller fault data in a chiller plant
- Looking for dataset that contains the genetic sequences of native plasmids?
- Looking for a dataset containing fetus size measurements at various gestational ages.
- Looking for datasets about mental health since 2021
- Do you know where to find a dataset with Graphical User Interfaces defects of web applications? [1, 2, 3 ]
- Looking for most popular accounts on social medias like Twitter, Tik Tok, instagram, [1, 2, 3]
- GPS dataset of grocery stores
- What is the easiest way to bulk download all of the data from this epidemiology website? (~20,000 files)
- Looking for Dataset on Percentage of death by US state and Canadian province grouped by cause of death?
- Looking for Social engineering attack dataset in social media
- Steam Store Games (Clean dataset) by Nik Davis
- Dataset that lists all US major hospitals by county
- Another Data that list all US major hospitals by county
- Looking for open source data relating privacy behavior or related marketing sets about the trustworthiness of responders?
- Looking for a dataset that tracks median household income by country and year
- Dataset on the number of specific surgical procedures performed in the US (yearly)
- Looking for a dataset from reddit or twitter on top posts or tweets related to crypto currency
- Looking for Image and flora Dataset of All Known Plants, Trees and Shrubs
- US total fertility rates data one the state level
- Dataset of Net Worth of *World* Politicians
- Looking for water wells and borehole datasets
- Looking for Crop growth conditions dataset
- Dataset for translate machine JA-EG
- Looking for Electronic Health Record (EHR) record prices
- Looking for tax data for different countries
- Musicians Birthday Datasets and Associated groups
- Searching for dataset related to car dealerships [1]
- Looking for Credit Score Approval dataset
- Cyberbullying Dataset by demographics
- Datasets on financial trends for minors
- Data where I can find out about reading habits? [1, 2]
- Data sets for global technology adoption rates
- Looking for any and all cat / feline cancer datasets, for both detection and treatment
- ITSM dictionary/taxonomy datasets for topic modeling purposes
- Multistage Reliability Dataset
- Looking for dataset of ingredients for food[1]
- Looking for datasets with responses to psychological questionnaires[1,2,3]
- Data source for OEM automotive parts
- Looking for dataset about gene regulation
- Customer Segmentation Datasets (For LTV Models)
- Automobile dataset, years of ownership and repairs
- Historic Housing Prices Dataset for Individual Houses
- Looking for the data for all the tokens on the Uniswap graph
- Job applications emails datasets, either rejection, applications or interviews
- E-learning datasets for impact on e learning on school/university students
- Food delivery dataset (Uber Eats, Just Eat, …)
- Data Sets for NFL Quarterbacks since 1995
- Medicare Beneficiary Population Data
- Covid 19 infected Cancer Patients datasets
- Looking for EV charging behavior dataset
- State park budget or expansionary spending dataset
- Autonomous car driving deaths dataset
- FMCG Spending habits over the pandemic
- Looking for a Question Type Classification dataset
- 20 years of Manufacturer/Retail price of Men’s footwear
- Dataset of Global Technology Adoption Rates
- Looking For Real Meeting Transcripts Dataset
- Dataset For A Large Archive Of Lyrics [1,2,3]
- Audio dataset with swearing words
- A global, georeferenced event dataset on electoral violence with lethal outcomes from 1989 to 2017. [1,]
- Looking for Jaundice Dataset for ML model
- Looking for social engineering attack detection dataset?
- Wound image datasets to train ML model [1]
- Seeking for resume and job post dataset
- Labelled dataset (sets of images or videos) of human emotions [1,2]
- Dataset of specialized phone call transcripts
- Looking for Emergency Response Plan Dataset for family Homes, condo buildings and Companies
- Looking for Birthday wishes datasets
- Desperately in need of national data for real estate [1,2,]
- NFL playoffs games stadium attendance dataset
- Datasets with original publication dates of novels [1,2]
- Annotated Documents with Images Data Dump
- Looking for dataset for “Face Presentation Attack Detection”
- Electric vehicle range & performance dataset [1, 2]
- Dataset or API with valid postal codes for US, Mexico, and Canada with country, state/province, and city/town [1, 2, 3, 4, 5, 6]
- Looking for Data sources regarding Online courses dropout rate, preferably by countries [1,2 ]
- Are there dataset for language learning [1, 2]
- Corporate Real Estate Data [1,2, 3]
- Looking for simple clinical trials datasets [1, 2]
- CO2 Emissions By Aircraft (or Aircraft Type) – Climate Analysis Dataset [1,2, 3, 4]
- Player Session/playtime dataset from games [1,2]
- Data sets that support Data Science (Technology, AI etc) being beneficial to sustainability [1,2]
- Datasets of a grocery store [1,2]
- Looking for mri breast cancer annotation datasets [1,2]
- Looking for free exportable data sets of companies by industry [1,2]
- Datasets on Coffee Production/Consumption [1,2]
- Video gaming industry datasets – release year, genre, games, titles, global data [1,2]
- Looking for mobile speaker recognition dataset [1,2]
- Public DMV vehicle registration data [1,2]
- Looking for historical news articles based on industry sector [1,2]
- Looking for Historical state wide Divorce dataset [1,2]
- Public Big Datasets – with In-Database Analytics [1,2]
- Dataset for detecting Apple products (object detection) [1,2]
- Help needed to get the American Hospital Association (AHA) datasets (AHA Annual Survey, AHA Financial Database, and AHA IT Survey datasets) [1, 2]
- Looking for help Getting College Football Betting Data [1,2]
- 2012-2020 US presidential election results by state/city dataset [1,2, 3]
- Looking for datasets of models and images captured using iphone’s LIDAR? [1,2]
- Finding Datasets to Age Texts (Newspapers, Books, Anything works) [1, 2, 3]
- Looking for cost of living index of some type for US [1,2]
- Looking for dataset that recorded historical NFT prices and their price increases, as well as timestamps. [1,2]
- Looking for datasets on park boundaries across the country [1, 2, 3]
- Looking for medical multimodal datasets [1, 2, 3]
- Looking for Scraped Parler Data [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
- Looking for Silicon Wafer Demand dataset [1, 2]
- Looking for a dataset with the values [Gender – Weight – Height – Health] [1, 2]
- Exam questions (mcqs and short answer) datasets? [1, 2]
- Canada Botanical Plants API/Database [1, 2, 3]
- Looking for a geospatial dataset of birds Migration path [1, 2, 3]
- WhatsApp messages dataset/archives [1, 2]
- Dataset of GOOD probiotic microorganisms in the HUMAN gut [1, 2]
- Twitter competition to reduce bias in its image cropping [1,2]
- Dataset: US overseas military deployments, 1950–2020 [1,2]
- Dataset on human clicking on desktop [1,2]
- Covid-19 Cough Audio Classification Dataset [1, 2]
- 12,000+ known superconductors database [1, 2, 3]
- Looking for good dataset related to cyber security for prediction [1, 2]
- Where can I find face datasets to classify whether it is a real person or a picture of that person. For authentication purposes? [1,2]
- DataSet of Tokyo 2020 (2021) Olympics ( details about the Athletes, the countries they representing, details about events, coaches, genders participating in each event, etc.) [1, 2]
- What is your workflow for budget compute on datasets larger than 100GB? [1, 2, 3]
Looking for a dataset that contains information about cryptocurrencies. [1, 2
Looking for a depression dataset [1,2, 3]
- Looking for chocolate consumer demographic data [1,2, 3]
- Looking for thorough dataset of housing price/tax history [1, 2, 3]
- Wallstreetbets data scraping from 01/01/2020 to 01/06/2021 [1, 2]
- Retinal Disease Classification Dataset [1, 2]
- 400,000 years of CO2 and global temperature data [1, 2, 3]
- Looking for datasets on neurodegenerative diseases [1, 2, 3]
- Dataset for Job Interviews (either Phone, Online, or Physical) [1,2 ,3]
- Firm Cyber Breach Dataset with Firm Identifiers [1, 2, 3]
- Wondering how Stock market and Crypto website get the Data from [1, 2, 3, 4, 5]
- Looking for a dataset with US tourist injuries, attacks, and/or fatalities when traveling abroad [1, 2, 3]
- Looking for Wildfires Database for all countries by year and month? The quantity of wildfires happening, the acreage, things like that, etc.. [1, 2, 3, ]
- Looking for a pill vs fake pill image dataset [1, 2, 3, 4, 5, 6, 7]
Cars for sale in Germany from 2011 to 2021
Dataset obtained scraping AutoScout. In the file, you will find features describing 46405 vehicles: mileage, make, model, fuel, gear, offer type, price, horse power, registration year.
Dataset scraped from AutoScout24 with information about new and used cars.
Percentage of female students in higher education by subject area
The data was obtained from the UK government website here , so unfortunately there are some things I’m unaware of regarding data and methodology.
All the passes: A visualization of ~1 million passes from 890 matches played in major football/soccer leagues/cups
- Champion League 1999
- FA Women’s Super League 2018
- FIFA World Cup 2018, La Liga 2004 – 2020
- NWSL 2018
- Premier League 2003 – 2004
- Women’s World Cup 2019
Data Source: StatsBomb
Global “Urbanity” Dataset (using population mosaics, nighttime lights, & road networks
In this project, the authors have designed a spatial model which is able to classify urbanity levels globally and with high granularity. As the target geographic support for our model we selected the quadkey grid in level 15, which has cells of approximately 1x1km at the equator.
Dataset: Here
Percentage of students with disabilities in higher education by subject area
The author obtained the data from the UK Government website, so unfortunately don’t know the methodology or how they collected the data etc.
The comparison to the general public is a great idea – according to the Government site, 6% of children, 16% of working-age adults and 45% of Pension-age adults are disabled.
Dataset: here
Arrests for Hate Crimes in NYC by Category, 2017-2020
The Most Successful U.S. Sports Franchises
Data source: sports-reference.com/
Adult cognitive skills (PIAAC literacy and numeracy) by Percentile and by country
According to the author , this animation depicts adult cognitive skills, as measured by the PIAAC study by OECD. Here, the numeracy and literacy skills have been combined into one. Each frame of the animation shows the xth percentile skill level of each individual country. Thus, you can see which countries have the highest and lowest scores among their bottom performers, median performers, and top performers. So for example, you can see that when the bottom 1st percentile of each country is ranked, Japan is at the top, Russia is second, etc. Looking at the 50th percentile (median) of each country, Japan is top, then Finland, etc.
Programme for the International Assessment of Adult Competencies (PIAAC) is a study by OECD to measure measured literacy, numeracy, and “problem-solving in technology-rich environments” skills for people ages 16 and up. For those of you who are familiar with the school-age children PISA study, this is essentially an adult version of it.
Dataset: PIAAC
G7 Corporate Tax rate 1980 – 2020
Dataset: Tax Foundation
Euro 2020 (played in 2021) Group Stage Predictions Based of a Bayesian Linear Item Response Model
Data Source: UEFA qualifying match data
The model was built in Stan and was inspired by Andrew Gelman’s World Cup model shown here. These plots show posterior probabilities that the team on the y axis will score more goals than the team on the x axis. There is some redundancy of information here (because if I know P(England beats Scotland) then I know P(Scotland beats England) )
Data
Source: Italian National Institute of Statistics (Istituto Nazionale di Statistica)
The 15 most shared musicians on Reddit
Data source: The authors made a dataset of YouTube and Spotify shares on Reddit. More info available here
Annual Stream for the top artist on Spotify (billions)
Music Streaming market share 2021-2022
Spam vs. Legitimate Email, Average Global Emails per Day
Data Source: Here. The author computed the average per day over the June 3 – June 9, 2021 period.

Falling Fertility, 1800–2016
Data source: Here (go to the “Babies per woman,” “Income,” and “Population” links on that page).
Europe Covid-19 waves
Data Source: Here
Who is going to win EURO 2020? Predicted probabilities pooled together from 18 sources
Data source: Here
Population Density of Canada 2020
DataSet: Gathered from worldpop.org/project
The greater the length of each spike correlates to greater population density.
The portion of a country’s population that is fully vaccinated for COVID (as of June 2021) scales with GDP per capita.
Dataset of Chemical reaction equations
4- Chemistry datasets
Maths datasets
1111 2222 3333 Equation Learning
Datasets for Stata Structural Equation Modeling
SQL Queries Dataset
SEDE (Stack Exchange Data Explorer) is a dataset comprised of 12,023 complex and diverse SQL queries and their natural language titles and descriptions, written by real users of the Stack Exchange Data Explorer out of a natural interaction. These pairs contain a variety of real-world challenges which were rarely reflected so far in any other semantic parsing dataset. Access it here
Countries of the world, ranked by population, with the 100 largest cities in the world marked
Each map size is proportional to population, so China takes up about 18-19% of the map space.
Countries with very far-flung territories, such as France (or the USA) will have their maps shrunk to fit all territories. So it is the size of the map rectangle that is proportional to population, not the colored area. Made in R, using data from naturalearthdata.com. Maps drawn with the tmap package, and placed in the image with the gridExtra package. Map colors from the wesanderson package.
Data source: The Economist
What businesses in different countries search for when they look for a marketing agency – “creative” or “SEO”?
Data source: Google Trends
More maps, charts and written analysis on this topic here
Is the economic gap between new and old EU countries closing?
Data source: Eurostat
Interactive version so you can click on those circles here
Reddit r/wallstreetbets posts and comments in real-time
- Beneath adds some useful features for shared data, like the ability to run SQL queries, sync changes in real-time, a Python integration, and monitoring. The monitoring is really useful as it lets you check out the write activity of the scraper (no surprise, WSB is most active when markets are open
- The scraper (which uses Async PRAW) is open source here
Global NO2 pollution data visualization June 2021
Data Source: SILAM
Shopify App Store Report: 2021
Data source: Marketplace Apps
The Chrome Webstore Report: 2021
Data source: Marketplace Apps
Percentage of Adults with HIV/AIDS in Africa
Dataset: All the countries through the UN AIDS organization
Recorded CDC deaths (2014 – June 16, 2021) from Symptoms, signs and abnormal clinical and laboratory findings, not elsewhere classified (R00-R99)
Data Source: combined CDC weekly death counts 2014 – 2019 and CDC weekly death counts 2020-2021
What are the long term gains on cryptocurrencies?
Data Sources: investing.com and coingecko.com
The chart shows the average daily gain in $ if $100 were invested at a date on x-axis. Total gain was divided by the number of days between the day of investing and June 13, 2021. Gains were calculated on average 30-day prices.
Time range: from March 28, 2013, till June 13, 2021
Life Expectancy and Death Probability by Age and Gender
Data source: Here
Daily Coronavirus cases in Canada vs % of Population Vaccinated
Google Playstore Apps with 2.3million app data on Kaggle
Google Playstore dataset is now available with double the data (2.3 Million) android application data and a new attribute stating the scraped date time in Kaggle.
Dataset: Get it here or here
African languages dataset
We have 3000 tribes or more in Africa and in that 3000 we have sub tribes.
1– Introduction to African Languages (Harvard)
2- Languages of the world at Ethnologue
3- Britannica: Nilo-Saharan Laguages
4- Britannica: Khoisan Languages
Daily Temperature of Major Cities Dataset
Daily average temperature values recorded in major cities of the world.
The dataset is available as separate txt files for each city here. The data is available for research and non-commercial purposes only
Do stricter gun laws reduce firearms homicides?
Data Sources: Guns to Carry, EFSGV, CDC
According to the author: Looking at non-suicide firearms deaths by state (2019), and then grouping by the Guns to Carry rating (1-5 stars), it seems that stricter gun laws are correlated with fewer firearms homicides. Guns to Carry rates states based on “Gun friendliness” with 1 star being least friendly (California, for example), and 5 stars being most friendly (Wyoming, for example). The ratings aren’t perfect but they include considerations like: Permit required, Registration, Open carry, and Background checks to come up with a rating.
The numbers at the bottom are the average non-suicide deaths calculated within the rating group. Each bar shows the number for the individual state.
Interesting that DC is through the roof despite having strict laws. On the flip side, Maine is very friendly towards gun owners and has a very low homicide rate, despite having the highest ratio of suicides to homicides.
Obviously, lots of things to consider and this is merely a correlation at a basic level. This is a topic that interested me so I figured I’d share my findings. Not attempting to make a policy statement or anything.
In 1996 the Australia Government implemented stricter gun control and restrictions. The numbers don’t lie and proves it worked.
Every mass shooting in the US visualised from 2014-2022

Relative frequency of words in economics textbooks vs their frequency in mainstream English (the Google Books corpus)
Data Source: Data for word frequency in the Google corpus is from the 2019 Ngram dataset. For details about how to work with this data, see Working With Google Ngrams: A Data-Wrangling Tale.
Data for word frequency in econ textbooks was compiled by myself by scraping words from 43 undergraduate economics textbooks. For details see Deconstructing Econospeak.
Hours per day spent on mobile devices by US adults
Author: nava_7777
Data Source: from eMarketer, as quoted byJon Erlichman
Purpose according to the author: raw textual numbers (like in the original tweet) are hard to compare, particularly the acceleration or deceleration of a trend. Did for myself, but maybe is useful to somebody.
Environmental Impact of Coffee Brewing Methods
Author: Coffee_Medley
More according to the author:
Measurements and calculations of NG and Electricity used to heat four cups of distilled water by Coffee Medley (6/14/2021)
Average coffee bag and pod weight by Coffee Medley (6/14/2021)
Murders in major U.S. Cities: 2019 vs. 2020
Author: datacanbeuseful
Data source: NPR
New Harvard Data (Accidentally) Reveal How Lockdowns Crushed the Working Class While Leaving Elites Unscathed
Data source: Harvard
Support for same-sex marriage by religious group
Data source: PEW
More: Summary of religiously (un)affiliated people’s views on homosexuality, broken down into 18 countries
Daily chance of dying for Americans
Author: NortherSugarLoaf
Data source: SSA Actuarial Data,
Processing: Yearly probability of death is converted to the daily probability and expressed in micromorts. Plotted versus age in years.
According to the author,
A few things to notice: It’s dangerous to be a newborn. The same mortality rates are reached again only in the fifties. However, mortality drops after birth very quickly and the safest age is about ten years old. After experiencing mortality jump in puberty – especially high for boys, mortality increases mostly exponentially with age. Every thirty years of life increase chances of dying about ten times. At 80, chance of dying in a year is about 5.8% for males and 4.3% for females. This mortality difference holds for all ages. The largest disparity is at about twenty three years old when males die at a rate about 2.7 times higher than females.
This data is from before COVID.
Here is the same graph but in linear Y axis scale
Here is the male to female mortality ratio
Mapping Global Carbon Emission Intensity (Dec 2020)
![r/dataisbeautiful - [OC] Mapping Global Carbon Emission Intensity (Dec 2020)](https://preview.redd.it/aoh1zkvfmm671.png?width=960&crop=smart&auto=webp&s=c2d6a5b7ac2af3deda9e5a65a191ed83f78dab06)
Data Source: Copernicus Atmosphere Monitoring Service (CAMS)
Religions with the most Adherents from 1945 – 2010

IPO Returns 2000-2020

IPO Returns 2000-2020


IPO Returns 2000-2020

From the author u/nobjos: The full article on the above analysis can be found here
I have sub r/market_sentiment where I do a comprehensive deep-dive on one investment strategy/topic every week! Some of the author popular articles are
a. Performance of Jim Cramer’s stock picks
b. Performance of buy and sell recommendations made by financial analysts in the last decade
c. Benchmarking performance of Motely fool against SP500
Funko IPO is considered to have the worst first-day return for an IPO in the last two decades.
Out of the top 10 list, only 3 Investment banks had below-average returns.
On average, IPOs did make money for the investor. But the amount is significantly different if you got allocated the IPO at offer price vs you get the IPO at market price.
Baidu.com made a whopping 354% on its listing day. Another interesting observation is 6 out of 10 companies in the list were listed in 200 (just before the dot com crash)
Largest publicly-traded airline
Largest publicly-traded airlines
Total number of streams per artist vs. number of Top 200 hits (Spotify Top 200 since 2017)
Author: blairfix
Data is from the Spotify Top 200 and covers the period from Jan. 1, 2017 to Jun. 9, 2021. You can download my dataset here.
For every artist that appears in the Top 200, I add up their total streams (for all songs) and the total number of songs in the dataset.
For a commentary on the data, see The Half Life of a Spotify Hit.
Number of Miss Americas by U.S. State
Data Source: Wikipedia
The World’s Nuclear Warheads
Author: academiadvice
Data Source: Federation of American Scientists – status-world-nuclear-forces/
Tools: Excel, Datawrapper, coolors.co/
Check out the FAS site for notes and caveats about their estimates. Governments don’t just print this stuff on their websites. These are evidence-based estimates of tightly-guarded national secrets.
Of particular note – Here’s what the FAS says about North Korea: “After six nuclear tests, including two of 10-20 kilotons and one of more than 150 kilotons, we estimate that North Korea might have produced sufficient fissile material for roughly 40-50 warheads. The number of assembled warheads is unknown, but lower. While we estimate North Korea might have a small number of assembled warheads for medium-range missiles, we have not yet seen evidence that it has developed a functioning warhead that can be delivered at ICBM range.”
The population of Las Vegas over time
Data Source: Wikipedia
The Alpha to Omega of Wikipedia
Author: feldesque
Data Source: The wikipediatrend package in R
Glacial Inter-glacial cycles over the past 450000 years
Source: geology.utah.gov/
Global temperature change from 1850-2020
Top Companies Contributing to Open Source – 2011/2021
Source and links
The author used several sources for this video and article. The first, for the video, is GitHub Archive & CodersRank. For the analysis of the OSCI index data, the author used opensourceindex.io
Crime Rates in the US: 1960-2021
Data source: Here
2021 is straight projections, must be taken with a grain of salt. However, the assumption of continuous rise of murder rate is not a bad one based on recent news reports, such as: here
In a property crime, a victim’s property is stolen or destroyed, without the use or threat of force against the victim. Property crimes include burglary and theft as well as vandalism and arson.
A network visualization of privacy research (83k nodes, 462k edges)
Author: FvDijk
This image was generated for my research mapping the privacy research field. The visual is a combination of network visualisation and manual adding of the labels.
The data was gathered from Scopus, a high-quality academic publication database, and the visualisation was created with Gephi. The initial dataset held ~120k publications and over 3 million references, from which we selected only the papers and references in the field.
The labels were assigned by manually identifying clusters and two independent raters assigning names from a random sample of publications, with a 94% match between raters.
The scripts used are available on Github:
The full paper can be found on the author’s website:
GDP (at purchasing power parity) per capita in international dollars
Author: Simaniac
Data source: IMF
Phone Call Anxiety dataset for Millennials and Gen Z
Author: /u/CynicalScyntist
This is a randomized experiment the author conducted with 450 people on Amazon MTurk. Each person was randomly assigned to one of three writing activities in which they either (a) described their phone, (b) described what they’d do if they received a call from someone they know, or (c) describe what they’d do if they received a call from an unknown number. Pictures of an iPhone with a corresponding call screen were displayed above the text box (blank, “Incoming Call,” or “Unknown”). Participants then rated their anxiety on a 1-4 scale.
Dataset: Here
Hate Crime Statistics in New York State 2019-2021

Continue reading “What are some good datasets for Data Science and Machine Learning?”
Top 100 Data Science and Data Analytics and Data Engineering Interview Questions and Answers


Elevate Your Career with AI & Machine Learning For Dummies PRO and Start mastering the technologies shaping the future—download now and take the next step in your professional journey!
Below and the Top 100 Data Science and Data Analytics Interview Questions and Answers dumps.
What is Data Science?
Data Science is a blend of various tools, algorithms, and machine learning principles with the goal to discover hidden patterns from the raw data. How is this different from what statisticians have been doing for years? The answer lies in the difference between explaining and predicting: statisticians work a posteriori, explaining the results and designing a plan; data scientists use historical data to make predictions.

Very Similar to real exam, Countdown timer, Score card, Show/Hide Answers, Cheat Sheets, FlashCards, Detailed Answers and References
No ADS, Access All Quiz Detailed Answers, Reference and Score Card
How does data cleaning play a vital role in the analysis?
Data cleaning can help in analysis because:
- Cleaning data from multiple sources helps transform it into a format that data analysts or data scientists can work with.
- Data Cleaning helps increase the accuracy of the model in machine learning.
- It is a cumbersome process because as the number of data sources increases, the time taken to clean the data increases exponentially due to the number of sources and the volume of data generated by these sources.
- It might take up to 80% of the time for just cleaning data making it a critical part of the analysis task
What is linear regression? What do the terms p-value, coefficient, and r-squared value mean? What is the significance of each of these components?

Imagine you want to predict the price of a house. That will depend on some factors, called independent variables, such as location, size, year of construction… if we assume there is a linear relationship between these variables and the price (our dependent variable), then our price is predicted by the following function: Y = a + bX
The p-value in the table is the minimum I (the significance level) at which the coefficient is relevant. The lower the p-value, the more important is the variable in predicting the price. Usually we set a 5% level, so that we have a 95% confidentiality that our variable is relevant.
The p-value is used as an alternative to rejection points to provide the smallest level of significance at which the null hypothesis would be rejected. A smaller p-value means that there is stronger evidence in favor of the alternative hypothesis.
The coefficient value signifies how much the mean of the dependent variable changes given a one-unit shift in the independent variable while holding other variables in the model constant. This property of holding the other variables constant is crucial because it allows you to assess the effect of each variable in isolation from the others.
R squared (R2) is a statistical measure that represents the proportion of the variance for a dependent variable that’s explained by an independent variable or variables in a regression model.
Credit: Steve Nouri
What is sampling? How many sampling methods do you know?
AI-Powered Professional Certification Quiz Platform
Web|iOs|Android|Windows
🚀 Power Your Podcast Like AI Unraveled: Get 20% OFF Google Workspace!
Hey everyone, hope you're enjoying the deep dive on AI Unraveled. Putting these episodes together involves tons of research and organization, especially with complex AI topics.
A key part of my workflow relies heavily on Google Workspace. I use its integrated tools, especially Gemini Pro for brainstorming and NotebookLM for synthesizing research, to help craft some of the very episodes you love. It significantly streamlines the creation process!
Feeling inspired to launch your own podcast or creative project? I genuinely recommend checking out Google Workspace. Beyond the powerful AI and collaboration features I use, you get essentials like a professional email (you@yourbrand.com), cloud storage, video conferencing with Google Meet, and much more.
It's been invaluable for AI Unraveled, and it could be for you too.
Start Your Journey & Save 20%
Google Workspace makes it easy to get started. Try it free for 14 days, and as an AI Unraveled listener, get an exclusive 20% discount on your first year of the Business Standard or Business Plus plan!
Sign Up & Get Your Discount HereUse one of these codes during checkout (Americas Region):
AI- Powered Jobs Interview Warmup For Job Seekers

⚽️Comparative Analysis: Top Calgary Amateur Soccer Clubs – Outdoor 2025 Season (Kids' Programs by Age Group)
Business Standard Plan: 63P4G3ELRPADKQU
Business Standard Plan: 63F7D7CPD9XXUVT
Set yourself up for promotion or get a better job by Acing the AWS Certified Data Engineer Associate Exam (DEA-C01) with the eBook or App below (Data and AI)

Download the Ace AWS DEA-C01 Exam App:
iOS - Android
AI Dashboard is available on the Web, Apple, Google, and Microsoft, PRO version
Business Standard Plan: 63FLKQHWV3AEEE6
Business Standard Plan: 63JGLWWK36CP7W
Invest in your future today by enrolling in this Azure Fundamentals - Pass the Azure Fundamentals Exam with Ease: Master the AZ-900 Certification with the Comprehensive Exam Preparation Guide!
- AWS Certified AI Practitioner (AIF-C01): Conquer the AWS Certified AI Practitioner exam with our AI and Machine Learning For Dummies test prep. Master fundamental AI concepts, AWS AI services, and ethical considerations.
- Azure AI Fundamentals: Ace the Azure AI Fundamentals exam with our comprehensive test prep. Learn the basics of AI, Azure AI services, and their applications.
- Google Cloud Professional Machine Learning Engineer: Nail the Google Professional Machine Learning Engineer exam with our expert-designed test prep. Deepen your understanding of ML algorithms, models, and deployment strategies.
- AWS Certified Machine Learning Specialty: Dominate the AWS Certified Machine Learning Specialty exam with our targeted test prep. Master advanced ML techniques, AWS ML services, and practical applications.
- AWS Certified Data Engineer Associate (DEA-C01): Set yourself up for promotion, get a better job or Increase your salary by Acing the AWS DEA-C01 Certification.
Business Plus Plan: M9HNXHX3WC9H7YE
With Google Workspace, you get custom email @yourcompany, the ability to work from anywhere, and tools that easily scale up or down with your needs.
Need more codes or have questions? Email us at info@djamgatech.com.
Data sampling is a statistical analysis technique used to select, manipulate and analyze a representative subset of data points to identify patterns and trends in the larger data set being examined. It enables data scientists, predictive modelers and other data analysts to work with a small, manageable amount of data about a statistical population to build and run analytical models more quickly, while still producing accurate findings.
Sampling can be particularly useful with data sets that are too large to efficiently analyze in full – for example, in big data analytics applications or surveys. Identifying and analyzing a representative sample is more efficient and cost-effective than surveying the entirety of the data or population.
An important consideration, though, is the size of the required data sample and the possibility of introducing a sampling error. In some cases, a small sample can reveal the most important information about a data set. In others, using a larger sample can increase the likelihood of accurately representing the data as a whole, even though the increased size of the sample may impede ease of manipulation and interpretation.
There are many different methods for drawing samples from data; the ideal one depends on the data set and situation. Sampling can be based on probability, an approach that uses random numbers that correspond to points in the data set to ensure that there is no correlation between points chosen for the sample. Further variations in probability sampling include:
Simple random sampling: Software is used to randomly select subjects from the whole population.
• Stratified sampling: Subsets of the data sets or population are created based on a common factor,
and samples are randomly collected from each subgroup. A sample is drawn from each strata (using a random sampling method like simple random sampling or systematic sampling).
o EX: In the image below, let’s say you need a sample size of 6. Two members from each
group (yellow, red, and blue) are selected randomly. Make sure to sample proportionally:
In this simple example, 1/3 of each group (2/6 yellow, 2/6 red and 2/6 blue) has been
sampled. If you have one group that’s a different size, make sure to adjust your
proportions. For example, if you had 9 yellow, 3 red and 3 blue, a 5-item sample would
consist of 3/9 yellow (i.e. one third), 1/3 red and 1/3 blue.
• Cluster sampling: The larger data set is divided into subsets (clusters) based on a defined factor, then a random sampling of clusters is analyzed. The sampling unit is the whole cluster; Instead of sampling individuals from within each group, a researcher will study whole clusters.
o EX: In the image below, the strata are natural groupings by head color (yellow, red, blue).
A sample size of 6 is needed, so two of the complete strata are selected randomly (in this
example, groups 2 and 4 are chosen).

– Cluster Sampling
- Multistage sampling: A more complicated form of cluster sampling, this method also involves dividing the larger population into a number of clusters. Second-stage clusters are then broken out based on a secondary factor, and those clusters are then sampled and analyzed. This staging could continue as multiple subsets are identified, clustered and analyzed.
• Systematic sampling: A sample is created by setting an interval at which to extract data from the larger population – for example, selecting every 10th row in a spreadsheet of 200 items to create a sample size of 20 rows to analyze.
Sampling can also be based on non-probability, an approach in which a data sample is determined and extracted based on the judgment of the analyst. As inclusion is determined by the analyst, it can be more difficult to extrapolate whether the sample accurately represents the larger population than when probability sampling is used.
Non-probability data sampling methods include:
• Convenience sampling: Data is collected from an easily accessible and available group.
• Consecutive sampling: Data is collected from every subject that meets the criteria until the predetermined sample size is met.
• Purposive or judgmental sampling: The researcher selects the data to sample based on predefined criteria.
• Quota sampling: The researcher ensures equal representation within the sample for all subgroups in the data set or population (random sampling is not used).

Once generated, a sample can be used for predictive analytics. For example, a retail business might use data sampling to uncover patterns about customer behavior and predictive modeling to create more effective sales strategies.
Credit: Steve Nouri
What are the assumptions required for linear regression?
There are four major assumptions:
There is a linear relationship between the dependent variables and the regressors, meaning the model you are creating actually fits the data,
• The errors or residuals of the data are normally distributed and independent from each other,
• There is minimal multicollinearity between explanatory variables, and
• Homoscedasticity. This means the variance around the regression line is the same for all values of the predictor variable.
What is a statistical interaction?
Reference: Statistical Interaction
Basically, an interaction is when the effect of one factor (input variable) on the dependent variable (output variable) differs among levels of another factor. When two or more independent variables are involved in a research design, there is more to consider than simply the “main effect” of each of the independent variables (also termed “factors”). That is, the effect of one independent variable on the dependent variable of interest may not be the same at all levels of the other independent variable. Another way to put this is that the effect of one independent variable may depend on the level of the other independent
variable. In order to find an interaction, you must have a factorial design, in which the two (or more) independent variables are “crossed” with one another so that there are observations at every
combination of levels of the two independent variables. EX: stress level and practice to memorize words: together they may have a lower performance.
What is selection bias?
Selection (or ‘sampling’) bias occurs when the sample data that is gathered and prepared for modeling has characteristics that are not representative of the true, future population of cases the model will see.
That is, active selection bias occurs when a subset of the data is systematically (i.e., non-randomly) excluded from analysis.
Selection bias is a kind of error that occurs when the researcher decides what has to be studied. It is associated with research where the selection of participants is not random. Therefore, some conclusions of the study may not be accurate.
The types of selection bias include:
• Sampling bias: It is a systematic error due to a non-random sample of a population causing some members of the population to be less likely to be included than others resulting in a biased sample.
• Time interval: A trial may be terminated early at an extreme value (often for ethical reasons), but the extreme value is likely to be reached by the variable with the largest variance, even if all variables have a similar mean.
• Data: When specific subsets of data are chosen to support a conclusion or rejection of bad data on arbitrary grounds, instead of according to previously stated or generally agreed criteria.
• Attrition: Attrition bias is a kind of selection bias caused by attrition (loss of participants)
discounting trial subjects/tests that did not run to completion.
What is an example of a data set with a non-Gaussian distribution?
The Gaussian distribution is part of the Exponential family of distributions, but there are a lot more of them, with the same sort of ease of use, in many cases, and if the person doing the machine learning has a solid grounding in statistics, they can be utilized where appropriate.
Binomial: multiple toss of a coin Bin(n,p): the binomial distribution consists of the probabilities of each of the possible numbers of successes on n trials for independent events that each have a probability of p of
occurring.
Bernoulli: Bin(1,p) = Be(p)
Poisson: Pois(λ)
What is bias-variance trade-off?
Bias: Bias is an error introduced in the model due to the oversimplification of the algorithm used (does not fit the data properly). It can lead to under-fitting.
Low bias machine learning algorithms — Decision Trees, k-NN and SVM
High bias machine learning algorithms — Linear Regression, Logistic Regression
Variance: Variance is error introduced in the model due to a too complex algorithm, it performs very well in the training set but poorly in the test set. It can lead to high sensitivity and overfitting.
Possible high variance – polynomial regression
Normally, as you increase the complexity of your model, you will see a reduction in error due to lower bias in the model. However, this only happens until a particular point. As you continue to make your model more complex, you end up over-fitting your model and hence your model will start suffering from high variance.

Bias-Variance trade-off: The goal of any supervised machine learning algorithm is to have low bias and low variance to achieve good prediction performance.
1. The k-nearest neighbor algorithm has low bias and high variance, but the trade-off can be changed by increasing the value of k which increases the number of neighbors that contribute to the prediction and in turn increases the bias of the model.
2. The support vector machine algorithm has low bias and high variance, but the trade-off can be changed by increasing the C parameter that influences the number of violations of the margin allowed in the training data which increases the bias but decreases the variance.
3. The decision tree has low bias and high variance, you can decrease the depth of the tree or use fewer attributes.
4. The linear regression has low variance and high bias, you can increase the number of features or use another regression that better fits the data.
There is no escaping the relationship between bias and variance in machine learning. Increasing the bias will decrease the variance. Increasing the variance will decrease bias.
What is a confusion matrix?
The confusion matrix is a 2X2 table that contains 4 outputs provided by the binary classifier.
A data set used for performance evaluation is called a test data set. It should contain the correct labels and predicted labels. The predicted labels will exactly the same if the performance of a binary classifier is perfect. The predicted labels usually match with part of the observed labels in real-world scenarios.
A binary classifier predicts all data instances of a test data set as either positive or negative. This produces four outcomes: TP, FP, TN, FN. Basic measures derived from the confusion matrix:
What is the difference between “long” and “wide” format data?
In the wide-format, a subject’s repeated responses will be in a single row, and each response is in a separate column. In the long-format, each row is a one-time point per subject. You can recognize data in wide format by the fact that columns generally represent groups (variables).

What do you understand by the term Normal Distribution?
Data is usually distributed in different ways with a bias to the left or to the right or it can all be jumbled up. However, there are chances that data is distributed around a central value without any bias to the left or right and reaches normal distribution in the form of a bell-shaped curve.

The random variables are distributed in the form of a symmetrical, bell-shaped curve. Properties of Normal Distribution are as follows:
1. Unimodal (Only one mode)
2. Symmetrical (left and right halves are mirror images)
3. Bell-shaped (maximum height (mode) at the mean)
4. Mean, Mode, and Median are all located in the center
5. Asymptotic
What is correlation and covariance in statistics?
Correlation is considered or described as the best technique for measuring and also for estimating the quantitative relationship between two variables. Correlation measures how strongly two variables are related. Given two random variables, it is the covariance between both divided by the product of the two standard deviations of the single variables, hence always between -1 and 1.

Covariance is a measure that indicates the extent to which two random variables change in cycle. It explains the systematic relation between a pair of random variables, wherein changes in one variable reciprocal by a corresponding change in another variable.

What is the difference between Point Estimates and Confidence Interval?
Point Estimation gives us a particular value as an estimate of a population parameter. Method of Moments and Maximum Likelihood estimator methods are used to derive Point Estimators for population parameters.
A confidence interval gives us a range of values which is likely to contain the population parameter. The confidence interval is generally preferred, as it tells us how likely this interval is to contain the population parameter. This likeliness or probability is called Confidence Level or Confidence coefficient and represented by 1 − ∝, where ∝ is the level of significance.
What is the goal of A/B Testing?
It is a hypothesis testing for a randomized experiment with two variables A and B.
The goal of A/B Testing is to identify any changes to the web page to maximize or increase the outcome of interest. A/B testing is a fantastic method for figuring out the best online promotional and marketing strategies for your business. It can be used to test everything from website copy to sales emails to search ads. An example of this could be identifying the click-through rate for a banner ad.
What is p-value?
When you perform a hypothesis test in statistics, a p-value can help you determine the strength of your results. p-value is the minimum significance level at which you can reject the null hypothesis. The lower the p-value, the more likely you reject the null hypothesis.
What do you understand by statistical power of sensitivity and how do you calculate it?
Sensitivity is commonly used to validate the accuracy of a classifier (Logistic, SVM, Random Forest etc.). Sensitivity = [ TP / (TP +TN)]
Why is Re-sampling done?
https://machinelearningmastery.com/statistical-sampling-and-resampling/
- Sampling is an active process of gathering observations with the intent of estimating a population variable.
- Resampling is a methodology of economically using a data sample to improve the accuracy and quantify the uncertainty of a population parameter. Resampling methods, in fact, make use of a nested resampling method.
Once we have a data sample, it can be used to estimate the population parameter. The problem is that we only have a single estimate of the population parameter, with little idea of the variability or uncertainty in the estimate. One way to address this is by estimating the population parameter multiple times from our data sample. This is called resampling. Statistical resampling methods are procedures that describe how to economically use available data to estimate a population parameter. The result can be both a more accurate estimate of the parameter (such as taking the mean of the estimates) and a quantification of the uncertainty of the estimate (such as adding a confidence interval).
Resampling methods are very easy to use, requiring little mathematical knowledge. A downside of the methods is that they can be computationally very expensive, requiring tens, hundreds, or even thousands of resamples in order to develop a robust estimate of the population parameter.
The key idea is to resample from the original data — either directly or via a fitted model — to create replicate datasets, from which the variability of the quantiles of interest can be assessed without longwinded and error-prone analytical calculation. Because this approach involves repeating the original data analysis procedure with many replicate sets of data, these are sometimes called computer-intensive methods. Each new subsample from the original data sample is used to estimate the population parameter. The sample of estimated population parameters can then be considered with statistical tools in order to quantify the expected value and variance, providing measures of the uncertainty of the
estimate. Statistical sampling methods can be used in the selection of a subsample from the original sample.
A key difference is that process must be repeated multiple times. The problem with this is that there will be some relationship between the samples as observations that will be shared across multiple subsamples. This means that the subsamples and the estimated population parameters are not strictly identical and independently distributed. This has implications for statistical tests performed on the sample of estimated population parameters downstream, i.e. paired statistical tests may be required.
Two commonly used resampling methods that you may encounter are k-fold cross-validation and the bootstrap.
- Bootstrap. Samples are drawn from the dataset with replacement (allowing the same sample to appear more than once in the sample), where those instances not drawn into the data sample may be used for the test set.
- k-fold Cross-Validation. A dataset is partitioned into k groups, where each group is given the opportunity of being used as a held out test set leaving the remaining groups as the training set. The k-fold cross-validation method specifically lends itself to use in the evaluation of predictive models that are repeatedly trained on one subset of the data and evaluated on a second held-out subset of the data.
Resampling is done in any of these cases:
- Estimating the accuracy of sample statistics by using subsets of accessible data or drawing randomly with replacement from a set of data points
- Substituting labels on data points when performing significance tests
- Validating models by using random subsets (bootstrapping, cross-validation)
What are the differences between over-fitting and under-fitting?
In statistics and machine learning, one of the most common tasks is to fit a model to a set of training data, so as to be able to make reliable predictions on general untrained data.
In overfitting, a statistical model describes random error or noise instead of the underlying relationship.
Overfitting occurs when a model is excessively complex, such as having too many parameters relative to the number of observations. A model that has been overfitted, has poor predictive performance, as it overreacts to minor fluctuations in the training data.
Underfitting occurs when a statistical model or machine learning algorithm cannot capture the underlying trend of the data. Underfitting would occur, for example, when fitting a linear model to non-linear data.
Such a model too would have poor predictive performance.
How to combat Overfitting and Underfitting?
To combat overfitting:
1. Add noise
2. Feature selection
3. Increase training set
4. L2 (ridge) or L1 (lasso) regularization; L1 drops weights, L2 no
5. Use cross-validation techniques, such as k folds cross-validation
6. Boosting and bagging
7. Dropout technique
8. Perform early stopping
9. Remove inner layers
To combat underfitting:
1. Add features
2. Increase time of training
What is regularization? Why is it useful?
Regularization is the process of adding tuning parameter (penalty term) to a model to induce smoothness in order to prevent overfitting. This is most often done by adding a constant multiple to an existing weight vector. This constant is often the L1 (Lasso – |∝|) or L2 (Ridge – ∝2). The model predictions should then minimize the loss function calculated on the regularized training set.
What Is the Law of Large Numbers?
It is a theorem that describes the result of performing the same experiment a large number of times. This theorem forms the basis of frequency-style thinking. It says that the sample means, the sample variance and the sample standard deviation converge to what they are trying to estimate. According to the law, the average of the results obtained from a large number of trials should be close to the expected value and will tend to become closer to the expected value as more trials are performed.
What Are Confounding Variables?
In statistics, a confounder is a variable that influences both the dependent variable and independent variable.
If you are researching whether a lack of exercise leads to weight gain:
lack of exercise = independent variable
weight gain = dependent variable
A confounding variable here would be any other variable that affects both of these variables, such as the age of the subject.
What is Survivorship Bias?
It is the logical error of focusing aspects that support surviving some process and casually overlooking those that did not work because of their lack of prominence. This can lead to wrong conclusions in numerous different means. For example, during a recession you look just at the survived businesses, noting that they are performing poorly. However, they perform better than the rest, which is failed, thus being removed from the time series.
Explain how a ROC curve works?
The ROC curve is a graphical representation of the contrast between true positive rates and false positive rates at various thresholds. It is often used as a proxy for the trade-off between the sensitivity (true positive rate) and false positive rate.

What is TF/IDF vectorization?
TF-IDF is short for term frequency-inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in information retrieval and text mining.

The TF-IDF value increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus, which helps to adjust for the fact that some words appear more frequently in general.
Python or R – Which one would you prefer for text analytics?
We will prefer Python because of the following reasons:
• Python would be the best option because it has Pandas library that provides easy to use data structures and high-performance data analysis tools.
• R is more suitable for machine learning than just text analysis.
• Python performs faster for all types of text analytics.
How does data cleaning play a vital role in the analysis?
Data cleaning can help in analysis because:
- Cleaning data from multiple sources helps transform it into a format that data analysts or data scientists can work with.
- Data Cleaning helps increase the accuracy of the model in machine learning.
- It is a cumbersome process because as the number of data sources increases, the time taken to clean the data increases exponentially due to the number of sources and the volume of data generated by these sources.
- It might take up to 80% of the time for just cleaning data making it a critical part of the analysis task
Differentiate between univariate, bivariate and multivariate analysis.
Univariate analyses are descriptive statistical analysis techniques which can be differentiated based on one variable involved at a given point of time. For example, the pie charts of sales based on territory involve only one variable and can the analysis can be referred to as univariate analysis.
The bivariate analysis attempts to understand the difference between two variables at a time as in a scatterplot. For example, analyzing the volume of sale and spending can be considered as an example of bivariate analysis.
Multivariate analysis deals with the study of more than two variables to understand the effect of variables on the responses.
Explain Star Schema
It is a traditional database schema with a central table. Satellite tables map IDs to physical names or descriptions and can be connected to the central fact table using the ID fields; these tables are known as lookup tables and are principally useful in real-time applications, as they save a lot of memory. Sometimes star schemas involve several layers of summarization to recover information faster.
What is Cluster Sampling?
Cluster sampling is a technique used when it becomes difficult to study the target population spread across a wide area and simple random sampling cannot be applied. Cluster Sample is a probability sample where each sampling unit is a collection or cluster of elements.
For example, a researcher wants to survey the academic performance of high school students in Japan. He can divide the entire population of Japan into different clusters (cities). Then the researcher selects a number of clusters depending on his research through simple or systematic random sampling.
What is Systematic Sampling?
Systematic sampling is a statistical technique where elements are selected from an ordered sampling frame. In systematic sampling, the list is progressed in a circular manner so once you reach the end of the list, it is progressed from the top again. The best example of systematic sampling is equal probability method.
What are Eigenvectors and Eigenvalues?
Eigenvectors are used for understanding linear transformations. In data analysis, we usually calculate the eigenvectors for a correlation or covariance matrix. Eigenvectors are the directions along which a particular linear transformation acts by flipping, compressing or stretching.
Eigenvalue can be referred to as the strength of the transformation in the direction of eigenvector or the factor by which the compression occurs.
Give Examples where a false positive is important than a false negative?
Let us first understand what false positives and false negatives are:
- False Positives are the cases where you wrongly classified a non-event as an event a.k.a Type I error
- False Negatives are the cases where you wrongly classify events as non-events, a.k.a Type II error.
Example 1: In the medical field, assume you have to give chemotherapy to patients. Assume a patient comes to that hospital and he is tested positive for cancer, based on the lab prediction but he actually doesn’t have cancer. This is a case of false positive. Here it is of utmost danger to start chemotherapy on this patient when he actually does not have cancer. In the absence of cancerous cell, chemotherapy will do certain damage to his normal healthy cells and might lead to severe diseases, even cancer.
Example 2: Let’s say an e-commerce company decided to give $1000 Gift voucher to the customers whom they assume to purchase at least $10,000 worth of items. They send free voucher mail directly to 100 customers without any minimum purchase condition because they assume to make at least 20% profit on sold items above $10,000. Now the issue is if we send the $1000 gift vouchers to customers who have not actually purchased anything but are marked as having made $10,000 worth of purchase
Give Examples where a false negative important than a false positive? And vice versa?
Example 1 FN: What if Jury or judge decides to make a criminal go free?
Example 2 FN: Fraud detection.
Example 3 FP: customer voucher use promo evaluation: if many used it and actually if was not true, promo sucks
Give Examples where both false positive and false negatives are equally important?
In the Banking industry giving loans is the primary source of making money but at the same time if your repayment rate is not good you will not make any profit, rather you will risk huge losses.
Banks don’t want to lose good customers and at the same point in time, they don’t want to acquire bad customers. In this scenario, both the false positives and false negatives become very important to measure.
What is the Difference between a Validation Set and a Test Set?
A Training Set:
• to fit the parameters i.e. weights
A Validation set:
• part of the training set
• for parameter selection
• to avoid overfitting
A Test set:
• for testing or evaluating the performance of a trained machine learning model, i.e. evaluating the
predictive power and generalization.
What is cross-validation?
Reference: k-fold cross validation
Cross-validation is a resampling procedure used to evaluate machine learning models on a limited data sample. The procedure has a single parameter called k that refers to the number of groups that a given data sample is to be split into. As such, the procedure is often called k-fold cross-validation. When a specific value for k is chosen, it may be used in place of k in the reference to the model, such as k=10 becoming 10-fold cross-validation. Mainly used in backgrounds where the objective is forecast, and one wants to estimate how accurately a model will accomplish in practice.
Cross-validation is primarily used in applied machine learning to estimate the skill of a machine learning model on unseen data. That is, to use a limited sample in order to estimate how the model is expected to perform in general when used to make predictions on data not used during the training of the model.
It is a popular method because it is simple to understand and because it generally results in a less biased or less optimistic estimate of the model skill than other methods, such as a simple train/test split.
The general procedure is as follows:
1. Shuffle the dataset randomly.
2. Split the dataset into k groups
3. For each unique group:
a. Take the group as a hold out or test data set
b. Take the remaining groups as a training data set
c. Fit a model on the training set and evaluate it on the test set
d. Retain the evaluation score and discard the model
4. Summarize the skill of the model using the sample of model evaluation scores

There is an alternative in Scikit-Learn called Stratified k fold, in which the split is shuffled to make it sure you have a representative sample of each class and a k fold in which you may not have the assurance of it (not good with a very unbalanced dataset).
What is Machine Learning?
Machine learning is the study of computer algorithms that improve automatically through experience. It is seen as a subset of artificial intelligence. Machine Learning explores the study and construction of algorithms that can learn from and make predictions on data. You select a model to train and then manually perform feature extraction. Used to devise complex models and algorithms that lend themselves to a prediction which in commercial use is known as predictive analytics.
What is Supervised Learning?
Supervised learning is the machine learning task of inferring a function from labeled training data. The training data consist of a set of training examples.
Algorithms: Support Vector Machines, Regression, Naive Bayes, Decision Trees, K-nearest Neighbor Algorithm and Neural Networks
Example: If you built a fruit classifier, the labels will be “this is an orange, this is an apple and this is a banana”, based on showing the classifier examples of apples, oranges and bananas.
What is Unsupervised learning?
Unsupervised learning is a type of machine learning algorithm used to draw inferences from datasets consisting of input data without labelled responses.
Algorithms: Clustering, Anomaly Detection, Neural Networks and Latent Variable Models
Example: In the same example, a fruit clustering will categorize as “fruits with soft skin and lots of dimples”, “fruits with shiny hard skin” and “elongated yellow fruits”.
What are the various Machine Learning algorithms?

What is “Naive” in a Naive Bayes?
Reference: Naive Bayes Classifier on Wikipedia
Naive Bayes methods are a set of supervised learning algorithms based on applying Bayes’ theorem with the “naive” assumption of conditional independence between every pair of features given the value of the class variable. Bayes’ theorem states the following relationship, given class variable y and dependent feature vector X1through Xn:

What is PCA (Principal Component Analysis)? When do you use it?
Reference: PCA on wikipedia
Principal component analysis (PCA) is a statistical method used in Machine Learning. It consists in projecting data in a higher dimensional space into a lower dimensional space by maximizing the variance of each dimension.
The process works as following. We define a matrix A with > rows (the single observations of a dataset – in a tabular format, each single row) and @ columns, our features. For this matrix we construct a variable space with as many dimensions as there are features. Each feature represents one coordinate axis. For each feature, the length has been standardized according to a scaling criterion, normally by scaling to unit variance. It is determinant to scale the features to a common scale, otherwise the features with a greater magnitude will weigh more in determining the principal components. Once plotted all the observations and computed the mean of each variable, that mean will be represented by a point in the center of our plot (the center of gravity). Then, we subtract each observation with the mean, shifting the coordinate system with the center in the origin. The best fitting line resulting is the line that best accounts for the shape of the point swarm. It represents the maximum variance direction in the data. Each observation may be projected onto this line in order to get a coordinate value along the PC-line. This value is known as a score. The next best-fitting line can be similarly chosen from directions perpendicular to the first.
Repeating this process yields an orthogonal basis in which different individual dimensions of the data are uncorrelated. These basis vectors are called principal components.

PCA is mostly used as a tool in exploratory data analysis and for making predictive models. It is often used to visualize genetic distance and relatedness between populations.
SVM (Support Vector Machine) algorithm
Reference: SVM on wikipedia
Classifying data is a common task in machine learning. Suppose some given data points each belong to one of two classes, and the goal is to decide which class a new data point will be in. In the case of supportvector machines, a data point is viewed as a p-dimensional vector (a list of p numbers), and we want to know whether we can separate such points with a (p − 1)-dimensional hyperplane. This is called a linear classifier. There are many hyperplanes that might classify the data. One reasonable choice as the best hyperplane is the one that represents the largest separation, or margin, between the two classes. So, we
choose the hyperplane so that the distance from it to the nearest data point on each side is maximized. If such a hyperplane exists, it is known as the maximum-margin hyperplane and the linear classifier it defines is known as a maximum-margin classifier; or equivalently, the perceptron of optimal stability. The best hyper plane that divides the data is H3.
- SVMs are helpful in text and hypertext categorization, as their application can significantly reduce the need for labeled training instances in both the standard inductive and transductive settings.
- Some methods for shallow semantic parsing are based on support vector machines.
- Classification of images can also be performed using SVMs. Experimental results show that SVMs achieve significantly higher search accuracy than traditional query refinement schemes after just three to four rounds of relevance feedback.
- Classification of satellite data like SAR data using supervised SVM.
- Hand-written characters can be recognized using SVM.
What are the support vectors in SVM?

In the diagram, we see that the sketched lines mark the distance from the classifier (the hyper plane) to the closest data points called the support vectors (darkened data points). The distance between the two thin lines is called the margin.
To extend SVM to cases in which the data are not linearly separable, we introduce the hinge loss function, max (0, 1 – yi(w∙ xi − b)). This function is zero if x lies on the correct side of the margin. For data on the wrong side of the margin, the function’s value is proportional to the distance from the margin.
What are the different kernels in SVM?
There are four types of kernels in SVM.
1. LinearKernel
2. Polynomial kernel
3. Radial basis kernel
4. Sigmoid kernel
What are the most known ensemble algorithms?
Reference: Ensemble Algorithms
The most popular trees are: AdaBoost, Random Forest, and eXtreme Gradient Boosting (XGBoost).
AdaBoost is best used in a dataset with low noise, when computational complexity or timeliness of results is not a main concern and when there are not enough resources for broader hyperparameter tuning due to lack of time and knowledge of the user.
Random forests should not be used when dealing with time series data or any other data where look-ahead bias should be avoided, and the order and continuity of the samples need to be ensured. This algorithm can handle noise relatively well, but more knowledge from the user is required to adequately tune the algorithm compared to AdaBoost.
The main advantages of XGBoost is its lightning speed compared to other algorithms, such as AdaBoost, and its regularization parameter that successfully reduces variance. But even aside from the regularization parameter, this algorithm leverages a learning rate (shrinkage) and subsamples from the features like random forests, which increases its ability to generalize even further. However, XGBoost is more difficult to understand, visualize and to tune compared to AdaBoost and random forests. There is a multitude of hyperparameters that can be tuned to increase performance.
What is Deep Learning?
Deep Learning is nothing but a paradigm of machine learning which has shown incredible promise in recent years. This is because of the fact that Deep Learning shows a great analogy with the functioning of the neurons in the human brain.

What is the difference between machine learning and deep learning?
Deep learning & Machine learning: what’s the difference?
Machine learning is a field of computer science that gives computers the ability to learn without being explicitly programmed. Machine learning can be categorized in the following four categories.
1. Supervised machine learning,
2. Semi-supervised machine learning,
3. Unsupervised machine learning,
4. Reinforcement learning.
Deep Learning is a subfield of machine learning concerned with algorithms inspired by the structure and function of the brain called artificial neural networks.

• The main difference between deep learning and machine learning is due to the way data is
presented in the system. Machine learning algorithms almost always require structured data, while deep learning networks rely on layers of ANN (artificial neural networks).
• Machine learning algorithms are designed to “learn” to act by understanding labeled data and then use it to produce new results with more datasets. However, when the result is incorrect, there is a need to “teach them”. Because machine learning algorithms require bulleted data, they are not suitable for solving complex queries that involve a huge amount of data.
• Deep learning networks do not require human intervention, as multilevel layers in neural
networks place data in a hierarchy of different concepts, which ultimately learn from their own mistakes. However, even they can be wrong if the data quality is not good enough.
• Data decides everything. It is the quality of the data that ultimately determines the quality of the result.
• Both of these subsets of AI are somehow connected to data, which makes it possible to represent a certain form of “intelligence.” However, you should be aware that deep learning requires much more data than a traditional machine learning algorithm. The reason for this is that deep learning networks can identify different elements in neural network layers only when more than a million data points interact. Machine learning algorithms, on the other hand, are capable of learning by pre-programmed criteria.
What is the reason for the popularity of Deep Learning in recent times?
Now although Deep Learning has been around for many years, the major breakthroughs from these techniques came just in recent years. This is because of two main reasons:
• The increase in the amount of data generated through various sources
• The growth in hardware resources required to run these models
GPUs are multiple times faster and they help us build bigger and deeper deep learning models in comparatively less time than we required previously
What is reinforcement learning?
Reinforcement Learning allows to take actions to max cumulative reward. It learns by trial and error through reward/penalty system. Environment rewards agent so by time agent makes better decisions.
Ex: robot=agent, maze=environment. Used for complex tasks (self-driving cars, game AI).
RL is a series of time steps in a Markov Decision Process:
1. Environment: space in which RL operates
2. State: data related to past action RL took
3. Action: action taken
4. Reward: number taken by agent after last action
5. Observation: data related to environment: can be visible or partially shadowed
What are Artificial Neural Networks?
Artificial Neural networks are a specific set of algorithms that have revolutionized machine learning. They are inspired by biological neural networks. Neural Networks can adapt to changing the input, so the network generates the best possible result without needing to redesign the output criteria.
Artificial Neural Networks works on the same principle as a biological Neural Network. It consists of inputs which get processed with weighted sums and Bias, with the help of Activation Functions.

How Are Weights Initialized in a Network?
There are two methods here: we can either initialize the weights to zero or assign them randomly.
Initializing all weights to 0: This makes your model similar to a linear model. All the neurons and every layer perform the same operation, giving the same output and making the deep net useless.
Initializing all weights randomly: Here, the weights are assigned randomly by initializing them very close to 0. It gives better accuracy to the model since every neuron performs different computations. This is the most commonly used method.
What Is the Cost Function?
Also referred to as “loss” or “error,” cost function is a measure to evaluate how good your model’s performance is. It’s used to compute the error of the output layer during backpropagation. We push that error backwards through the neural network and use that during the different training functions.
The most known one is the mean sum of squared errors.

What Are Hyperparameters?
With neural networks, you’re usually working with hyperparameters once the data is formatted correctly.
A hyperparameter is a parameter whose value is set before the learning process begins. It determines how a network is trained and the structure of the network (such as the number of hidden units, the learning rate, epochs, batches, etc.).
What Will Happen If the Learning Rate is Set inaccurately (Too Low or Too High)?
When your learning rate is too low, training of the model will progress very slowly as we are making minimal updates to the weights. It will take many updates before reaching the minimum point.
If the learning rate is set too high, this causes undesirable divergent behavior to the loss function due to drastic updates in weights. It may fail to converge (model can give a good output) or even diverge (data is too chaotic for the network to train).
What Is The Difference Between Epoch, Batch, and Iteration in Deep Learning?
• Epoch – Represents one iteration over the entire dataset (everything put into the training model).
• Batch – Refers to when we cannot pass the entire dataset into the neural network at once, so we divide the dataset into several batches.
• Iteration – if we have 10,000 images as data and a batch size of 200. then an epoch should run 50 iterations (10,000 divided by 50).
What Are the Different Layers on CNN?
Reference: Layers of CNN

The Convolutional neural networks are regularized versions of multilayer perceptron (MLP). They were developed based on the working of the neurons of the animal visual cortex.
The objective of using the CNN:
The idea is that you give the computer this array of numbers and it will output numbers that describe the probability of the image being a certain class (.80 for a cat, .15 for a dog, .05 for a bird, etc.). It works similar to how our brain works. When we look at a picture of a dog, we can classify it as such if the picture has identifiable features such as paws or 4 legs. In a similar way, the computer is able to perform image classification by looking for low-level features such as edges and curves and then building up to more abstract concepts through a series of convolutional layers. The computer uses low-level features obtained at the initial levels to generate high-level features such as paws or eyes to identify the object.
There are four layers in CNN:
1. Convolutional Layer – the layer that performs a convolutional operation, creating several smaller picture windows to go over the data.
2. Activation Layer (ReLU Layer) – it brings non-linearity to the network and converts all the negative pixels to zero. The output is a rectified feature map. It follows each convolutional layer.
3. Pooling Layer – pooling is a down-sampling operation that reduces the dimensionality of the feature map. Stride = how much you slide, and you get the max of the n x n matrix
4. Fully Connected Layer – this layer recognizes and classifies the objects in the image.
Q60: What Is Pooling on CNN, and How Does It Work?
Pooling is used to reduce the spatial dimensions of a CNN. It performs down-sampling operations to reduce the dimensionality and creates a pooled feature map by sliding a filter matrix over the input matrix.
What are Recurrent Neural Networks (RNNs)?
Reference: RNNs
RNNs are a type of artificial neural networks designed to recognize the pattern from the sequence of data such as Time series, stock market and government agencies etc.
Recurrent Neural Networks (RNNs) add an interesting twist to basic neural networks. A vanilla neural network takes in a fixed size vector as input which limits its usage in situations that involve a ‘series’ type input with no predetermined size.

RNNs are designed to take a series of input with no predetermined limit on size. One could ask what’s\ the big deal, I can call a regular NN repeatedly too?

Sure can, but the ‘series’ part of the input means something. A single input item from the series is related to others and likely has an influence on its neighbors. Otherwise it’s just “many” inputs, not a “series” input (duh!).
Recurrent Neural Network remembers the past and its decisions are influenced by what it has learnt from the past. Note: Basic feed forward networks “remember” things too, but they remember things they learnt during training. For example, an image classifier learns what a “1” looks like during training and then uses that knowledge to classify things in production.
While RNNs learn similarly while training, in addition, they remember things learnt from prior input(s) while generating output(s). RNNs can take one or more input vectors and produce one or more output vectors and the output(s) are influenced not just by weights applied on inputs like a regular NN, but also by a “hidden” state vector representing the context based on prior input(s)/output(s). So, the same input could produce a different output depending on previous inputs in the series.

In summary, in a vanilla neural network, a fixed size input vector is transformed into a fixed size output vector. Such a network becomes “recurrent” when you repeatedly apply the transformations to a series of given input and produce a series of output vectors. There is no pre-set limitation to the size of the vector. And, in addition to generating the output which is a function of the input and hidden state, we update the hidden state itself based on the input and use it in processing the next input.
What is the role of the Activation Function?
The Activation function is used to introduce non-linearity into the neural network helping it to learn more complex function. Without which the neural network would be only able to learn linear function which is a linear combination of its input data. An activation function is a function in an artificial neuron that delivers an output based on inputs.
Machine Learning libraries for various purposes

What is an Auto-Encoder?
Reference: Auto-Encoder
Auto-encoders are simple learning networks that aim to transform inputs into outputs with the minimum possible error. This means that we want the output to be as close to input as possible. We add a couple of layers between the input and the output, and the sizes of these layers are smaller than the input layer. The auto-encoder receives unlabeled input which is then encoded to reconstruct the input.
An autoencoder is a type of artificial neural network used to learn efficient data coding in an unsupervised manner. The aim of an autoencoder is to learn a representation (encoding) for a set of data, typically for dimensionality reduction, by training the network to ignore signal “noise”. Along with the reduction side, a reconstructing side is learnt, where the autoencoder tries to generate from the reduced encoding a representation as close as possible to its original input, hence its name. Several variants exist to the basic model, with the aim of forcing the learned representations of the input to assume useful properties.
Autoencoders are effectively used for solving many applied problems, from face recognition to acquiring the semantic meaning of words.

What is a Boltzmann Machine?
Boltzmann machines have a simple learning algorithm that allows them to discover interesting features that represent complex regularities in the training data. The Boltzmann machine is basically used to optimize the weights and the quantity for the given problem. The learning algorithm is very slow in networks with many layers of feature detectors. “Restricted Boltzmann Machines” algorithm has a single layer of feature detectors which makes it faster than the rest.

What Is Dropout and Batch Normalization?
Dropout is a technique of dropping out hidden and visible nodes of a network randomly to prevent overfitting of data (typically dropping 20 per cent of the nodes). It doubles the number of iterations needed to converge the network. It used to avoid overfitting, as it increases the capacity of generalization.
Batch normalization is the technique to improve the performance and stability of neural networks by normalizing the inputs in every layer so that they have mean output activation of zero and standard deviation of one
Why Is TensorFlow the Most Preferred Library in Deep Learning?
TensorFlow provides both C++ and Python APIs, making it easier to work on and has a faster compilation time compared to other Deep Learning libraries like Keras and PyTorch. TensorFlow supports both CPU and GPU computing devices.
What is Tensor in TensorFlow?
A tensor is a mathematical object represented as arrays of higher dimensions. Think of a n-D matrix. These arrays of data with different dimensions and ranks fed as input to the neural network are called “Tensors.”
What is the Computational Graph?
Everything in a TensorFlow is based on creating a computational graph. It has a network of nodes where each node operates. Nodes represent mathematical operations, and edges represent tensors. Since data flows in the form of a graph, it is also called a “DataFlow Graph.”
What is logistic regression?
• Logistic Regression models a function of the target variable as a linear combination of the predictors, then converts this function into a fitted value in the desired range.
• Binary or Binomial Logistic Regression can be understood as the type of Logistic Regression that deals with scenarios wherein the observed outcomes for dependent variables can be only in binary, i.e., it can have only two possible types.
• Multinomial Logistic Regression works in scenarios where the outcome can have more than two possible types – type A vs type B vs type C – that are not in any particular order.
Credit: Vikram K.
How is logistic regression done?
Logistic regression measures the relationship between the dependent variable (our label of what we want to predict) and one or more independent variables (our features) by estimating probability using its underlying logistic function (sigmoid).
Explain the steps in making a decision tree.
1. Take the entire data set as input
2. Calculate entropy of the target variable, as well as the predictor attributes
3. Calculate your information gain of all attributes (we gain information on sorting different objects from each other)
4. Choose the attribute with the highest information gain as the root node
5. Repeat the same procedure on every branch until the decision node of each branch is finalized
For example, let’s say you want to build a decision tree to decide whether you should accept or decline a job offer. The decision tree for this case is as shown:

It is clear from the decision tree that an offer is accepted if:
• Salary is greater than $50,000
• The commute is less than an hour
• Coffee is offered
How do you build a random forest model?
A random forest is built up of a number of decision trees. If you split the data into different packages and make a decision tree in each of the different groups of data, the random forest brings all those trees together.
Steps to build a random forest model:
1. Randomly select ; features from a total of = features where k<< m
2. Among the ; features, calculate the node D using the best split point
3. Split the node into daughter nodes using the best split
4. Repeat steps two and three until leaf nodes are finalized
5. Build forest by repeating steps one to four for > times to create > number of trees
Differentiate between univariate, bivariate, and multivariate analysis.
Univariate data contains only one variable. The purpose of the univariate analysis is to describe the data and find patterns that exist within it.

The patterns can be studied by drawing conclusions using mean, median, mode, dispersion or range, minimum, maximum, etc.
Bivariate data involves two different variables. The analysis of this type of data deals with causes and relationships and the analysis is done to determine the relationship between the two variables.

Here, the relationship is visible from the table that temperature and sales are directly proportional to each other. The hotter the temperature, the better the sales.
Multivariate data involves three or more variables, it is categorized under multivariate. It is similar to a bivariate but contains more than one dependent variable.
Example: data for house price prediction
The patterns can be studied by drawing conclusions using mean, median, and mode, dispersion or range, minimum, maximum, etc. You can start describing the data and using it to guess what the price of the house will be.
What are the feature selection methods used to select the right variables?
There are two main methods for feature selection.
Filter Methods
This involves:
• Linear discrimination analysis
• ANOVA
• Chi-Square
The best analogy for selecting features is “bad data in, bad answer out.” When we’re limiting or selecting the features, it’s all about cleaning up the data coming in.
Wrapper Methods
This involves:
• Forward Selection: We test one feature at a time and keep adding them until we get a good fit
• Backward Selection: We test all the features and start removing them to see what works
better
• Recursive Feature Elimination: Recursively looks through all the different features and how they pair together
Wrapper methods are very labor-intensive, and high-end computers are needed if a lot of data analysis is performed with the wrapper method.
You are given a data set consisting of variables with more than 30 percent missing values. How will you deal with them?
If the data set is large, we can just simply remove the rows with missing data values. It is the quickest way; we use the rest of the data to predict the values.
For smaller data sets, we can impute missing values with the mean, median, or average of the rest of the data using pandas data frame in python. There are different ways to do so, such as: df.mean(), df.fillna(mean)
Other option of imputation is using KNN for numeric or classification values (as KNN just uses k closest values to impute the missing value).
How will you calculate the Euclidean distance in Python?
plot1 = [1,3]
plot2 = [2,5]
The Euclidean distance can be calculated as follows:
euclidean_distance = sqrt((plot1[0]-plot2[0])**2 + (plot1[1]- plot2[1])**2)
What are dimensionality reduction and its benefits?
Dimensionality reduction refers to the process of converting a data set with vast dimensions into data with fewer dimensions (fields) to convey similar information concisely.
This reduction helps in compressing data and reducing storage space. It also reduces computation time as fewer dimensions lead to less computing. It removes redundant features; for example, there’s no point in storing a value in two different units (meters and inches).
How should you maintain a deployed model?
The steps to maintain a deployed model are (CREM):
1. Monitor: constant monitoring of all models is needed to determine their performance accuracy.
When you change something, you want to figure out how your changes are going to affect things.
This needs to be monitored to ensure it’s doing what it’s supposed to do.
2. Evaluate: evaluation metrics of the current model are calculated to determine if a new algorithm is needed.
3. Compare: the new models are compared to each other to determine which model performs the best.
4. Rebuild: the best performing model is re-built on the current state of data.
How can a time-series data be declared as stationery?
- The mean of the series should not be a function of time.

- The variance of the series should not be a function of time. This property is known as homoscedasticity.

- The covariance of the i th term and the (i+m) th term should not be a function of time.

‘People who bought this also bought…’ recommendations seen on Amazon are a result of which algorithm?
The recommendation engine is accomplished with collaborative filtering. Collaborative filtering explains the behavior of other users and their purchase history in terms of ratings, selection, etc.
The engine makes predictions on what might interest a person based on the preferences of other users. In this algorithm, item features are unknown.
For example, a sales page shows that a certain number of people buy a new phone and also buy tempered glass at the same time. Next time, when a person buys a phone, he or she may see a recommendation to buy tempered glass as well.
What is a Generative Adversarial Network?
Suppose there is a wine shop purchasing wine from dealers, which they resell later. But some dealers sell fake wine. In this case, the shop owner should be able to distinguish between fake and authentic wine. The forger will try different techniques to sell fake wine and make sure specific techniques go past the shop owner’s check. The shop owner would probably get some feedback from wine experts that some of the wine is not original. The owner would have to improve how he determines whether a wine is fake or authentic.
The forger’s goal is to create wines that are indistinguishable from the authentic ones while the shop owner intends to tell if the wine is real or not accurately.

• There is a noise vector coming into the forger who is generating fake wine.
• Here the forger acts as a Generator.
• The shop owner acts as a Discriminator.
• The Discriminator gets two inputs; one is the fake wine, while the other is the real authentic wine.
The shop owner has to figure out whether it is real or fake.
So, there are two primary components of Generative Adversarial Network (GAN) named:
1. Generator
2. Discriminator
The generator is a CNN that keeps keys producing images and is closer in appearance to the real images while the discriminator tries to determine the difference between real and fake images. The ultimate aim is to make the discriminator learn to identify real and fake images.
You are given a dataset on cancer detection. You have built a classification model and achieved an accuracy of 96 percent. Why shouldn’t you be happy with your model performance? What can you do about it?
Cancer detection results in imbalanced data. In an imbalanced dataset, accuracy should not be based as a measure of performance. It is important to focus on the remaining four percent, which represents the patients who were wrongly diagnosed. Early diagnosis is crucial when it comes to cancer detection and can greatly improve a patient’s prognosis.
Hence, to evaluate model performance, we should use Sensitivity (True Positive Rate), Specificity (True Negative Rate), F measure to determine the class wise performance of the classifier.
We want to predict the probability of death from heart disease based on three risk factors: age, gender, and blood cholesterol level. What is the most appropriate algorithm for this case?
The most appropriate algorithm for this case is logistic regression.
After studying the behavior of a population, you have identified four specific individual types that are valuable to your study. You would like to find all users who are most similar to each individual type. Which algorithm is most appropriate for this study?
As we are looking for grouping people together specifically by four different similarities, it indicates the value of k. Therefore, K-means clustering is the most appropriate algorithm for this study.
You have run the association rules algorithm on your dataset, and the two rules {banana, apple} => {grape} and {apple, orange} => {grape} have been found to be relevant. What else must be true?
{grape, apple} must be a frequent itemset.
Your organization has a website where visitors randomly receive one of two coupons. It is also possible that visitors to the website will not receive a coupon. You have been asked to determine if offering a coupon to website visitors has any impact on their purchase decisions. Which analysis method should you use?
One-way ANOVA: in statistics, one-way analysis of variance is a technique that can be used to compare means of two or more samples. This technique can be used only for numerical response data, the “Y”, usually one variable, and numerical or categorical input data, the “X”, always one variable, hence “oneway”.
The ANOVA tests the null hypothesis, which states that samples in all groups are drawn from populations with the same mean values. To do this, two estimates are made of the population variance. The ANOVA produces an F-statistic, the ratio of the variance calculated among the means to the variance within the samples. If the group means are drawn from populations with the same mean values, the variance between the group means should be lower than the variance of the samples, following the central limit
theorem. A higher ratio therefore implies that the samples were drawn from populations with different mean values.
What are the feature vectors?
A feature vector is an n-dimensional vector of numerical features that represent an object. In machine learning, feature vectors are used to represent numeric or symbolic characteristics (called features) of an object in a mathematical way that’s easy to analyze.
What is root cause analysis?
Root cause analysis was initially developed to analyze industrial accidents but is now widely used in other areas. It is a problem-solving technique used for isolating the root causes of faults or problems. A factor is called a root cause if its deduction from the problem-fault-sequence averts the final undesirable event from recurring.
Do gradient descent methods always converge to similar points?
They do not, because in some cases, they reach a local minimum or a local optimum point. You would not reach the global optimum point. This is governed by the data and the starting conditions.
In your choice of language, write a program that prints the numbers ranging from one to 50. But for multiples of three, print “Fizz” instead of the number and for the multiples of five, print “Buzz.” For numbers which are multiples of both three and five, print “FizzBuzz.”

What are the different Deep Learning Frameworks?
• PyTorch: PyTorch is an open source machine learning library based on the Torch library, used for applications such as computer vision and natural language processing, primarily developed by Facebook’s AI Research lab. It is free and open-source software released under the Modified BSD license.
• TensorFlow: TensorFlow is a free and open-source software library for dataflow and differentiable programming across a range of tasks. It is a symbolic math library and is also used for machine learning applications such as neural networks. Licensed by Apache License 2.0. Developed by Google Brain Team.
• Microsoft Cognitive Toolkit: Microsoft Cognitive Toolkit describes neural networks as a series of computational steps via a directed graph.
• Keras: Keras is an open-source neural-network library written in Python. It is capable of running on top of TensorFlow, Microsoft Cognitive Toolkit, R, Theano, or PlaidML. Designed to enable fast experimentation with deep neural networks, it focuses on being user-friendly, modular, and extensible. Licensed by MIT.
Data Sciences and Data Mining Glossary
Credit: Dr. Matthew North
Antecedent: In an association rules data mining model, the antecedent is the attribute which precedes the consequent in an identified rule. Attribute order makes a difference when calculating the confidence percentage, so identifying which attribute comes first is necessary even if the reciprocal of the association is also a rule.
Archived Data: Data which have been copied out of a live production database and into a data warehouse or other permanent system where they can be accessed and analyzed, but not by primary operational business systems.
Association Rules: A data mining methodology which compares attributes in a data set across all observations to identify areas where two or more attributes are frequently found together. If their frequency of coexistence is high enough throughout the data set, the association of those attributes can be said to be a rule.
Attribute: In columnar data, an attribute is one column. It is named in the data so that it can be referred to by a model and used in data mining. The term attribute is sometimes interchanged with the terms ‘field’, ‘variable’, or ‘column’.
Average: The arithmetic mean, calculated by summing all values and dividing by the count of the values.
Binomial: A data type for any set of values that is limited to one of two numeric options.
Binominal: In RapidMiner, the data type binominal is used instead of binomial, enabling both numerical and character-based sets of values that are limited to one of two options.
Business Understanding: See Organizational Understanding: The first step in the CRISP-DM process, usually referred to as Business Understanding, where the data miner develops an understanding of an organization’s goals, objectives, questions, and anticipated outcomes relative to data mining tasks. The data miner must understand why the data mining task is being undertaken before proceeding to gather and understand data.
Case Sensitive: A situation where a computer program recognizes the uppercase version of a letter or word as being different from the lowercase version of the same letter or word.
Classification: One of the two main goals of conducting data mining activities, with the other being prediction. Classification creates groupings in a data set based on the similarity of the observations’ attributes. Some data mining methodologies, such as decision trees, can predict an observation’s classification.
Code: Code is the result of a computer worker’s work. It is a set of instructions, typed in a specific grammar and syntax, that a computer can understand and execute. According to Lawrence Lessig, it is one of four methods humans can use to set and control boundaries for behavior when interacting with computer systems.
Coefficient: In data mining, a coefficient is a value that is calculated based on the values in a data set that can be used as a multiplier or as an indicator of the relative strength of some attribute or component in a data mining model.
Column: See Attribute. In columnar data, an attribute is one column. It is named in the data so that it can be referred to by a model and used in data mining. The term attribute is sometimes interchanged with the terms ‘field’, ‘variable’, or ‘column’.
Comma Separated Values (CSV): A common text-based format for data sets where the divisions between attributes (columns of data) are indicated by commas. If commas occur naturally in some of the values in the data set, a CSV file will misunderstand these to be attribute separators, leading to misalignment of attributes.
Conclusion: See Consequent: In an association rules data mining model, the consequent is the attribute which results from the antecedent in an identified rule. If an association rule were characterized as “If this, then that”, the consequent would be that—in other words, the outcome.
Confidence (Alpha) Level: A value, usually 5% or 0.05, used to test for statistical significance in some data mining methods. If statistical significance is found, a data miner can say that there is a 95% likelihood that a calculated or predicted value is not a false positive.
Confidence Percent: In predictive data mining, this is the percent of calculated confidence that the model has calculated for one or more possible predicted values. It is a measure for the likelihood of false positives in predictions. Regardless of the number of possible predicted values, their collective confidence percentages will always total to 100%.
Consequent: In an association rules data mining model, the consequent is the attribute which results from the antecedent in an identified rule. If an association rule were characterized as “If this, then that”, the consequent would be that—in other words, the outcome.
Correlation: A statistical measure of the strength of affinity, based on the similarity of observational values, of the attributes in a data set. These can be positive (as one attribute’s values go up or down, so too does the correlated attribute’s values); or negative (correlated attributes’ values move in opposite directions). Correlations are indicated by coefficients which fall on a scale between -1 (complete negative correlation) and 1 (complete positive correlation), with 0 indicating no correlation at all between two attributes.
CRISP-DM: An acronym for Cross-Industry Standard Process for Data Mining. This process was jointly developed by several major multi-national corporations around the turn of the new millennium in order to standardize the approach to mining data. It is comprised of six cyclical steps: Business (Organizational) Understanding, Data Understanding, Data Preparation, Modeling, Evaluation, Deployment.
Cross-validation: A method of statistically evaluating a training data set for its likelihood of producing false positives in a predictive data mining model.
Data: Data are any arrangement and compilation of facts. Data may be structured (e.g. arranged in columns (attributes) and rows (observations)), or unstructured (e.g. paragraphs of text, computer log file).
Data Analysis: The process of examining data in a repeatable and structured way in order to extract meaning, patterns or messages from a set of data.
Data Mart: A location where data are stored for easy access by a broad range of people in an organization. Data in a data mart are generally archived data, enabling analysis in a setting that does not impact live operations.
Data Mining: A computational process of analyzing data sets, usually large in nature, using both statistical and logical methods, in order to uncover hidden, previously unknown, and interesting patterns that can inform organizational decision making.
Data Preparation: The third in the six steps of CRISP-DM. At this stage, the data miner ensures that the data to be mined are clean and ready for mining. This may include handling outliers or other inconsistent data, dealing with missing values, reducing attributes or observations, setting attribute roles for modeling, etc.
Data Set: Any compilation of data that is suitable for analysis.
Data Type: In a data set, each attribute is assigned a data type based on the kind of data stored in the attribute. There are many data types which can be generalized into one of three areas: Character (Text) based; Numeric; and Date/Time. Within these categories, RapidMiner has several data types. For example, in the Character area, RapidMiner has Polynominal, Binominal, etc.; and in the Numeric area it has Real, Integer, etc.
Data Understanding: The second in the six steps of CRISP-DM. At this stage, the data miner seeks out sources of data in the organization, and works to collect, compile, standardize, define and document the data. The data miner develops a comprehension of where the data have come from, how they were collected and what they mean.
Data Warehouse: A large-scale repository for archived data which are available for analysis. Data in a data warehouse are often stored in multiple formats (e.g. by week, month, quarter and year), facilitating large scale analyses at higher speeds. The data warehouse is populated by extracting data from operational systems so that analyses do not interfere with live business operations.
Database: A structured organization of facts that is organized such that the facts can be reliably and repeatedly accessed. The most common type of database is a relational database, in which facts (data) are arranged in tables of columns and rows. The data are then accessed using a query language, usually SQL (Structured Query Language), in order to extract meaning from the tables.
Decision Tree: A data mining methodology where leaves and nodes are generated to construct a predictive tree, whereby a data miner can see the attributes which are most predictive of each possible outcome in a target (label) attribute.
Denormalization: The process of removing relational organization from data, reintroducing redundancy into the data, but simultaneously eliminating the need for joins in a relational database, enabling faster querying.
Dependent Variable (Attribute): The attribute in a data set that is being acted upon by the other attributes. It is the thing we want to predict, the target, or label, attribute in a predictive model.
Deployment: The sixth and final of the six steps of CRISP-DM. At this stage, the data miner takes the results of data mining activities and puts them into practice in the organization. The data miner watches closely and collects data to determine if the deployment is successful and ethical. Deployment can happen in stages, such as through pilot programs before a full-scale roll out.
Descartes’ Rule of Change: An ethical framework set forth by Rene Descartes which states that if an action cannot be taken repeatedly, it cannot be ethically taken even once.
Design Perspective: The view in RapidMiner where a data miner adds operators to a data mining stream, sets those operators’ parameters, and runs the model.
Discriminant Analysis: A predictive data mining model which attempts to compare the values of all observations across all attributes and identify where natural breaks occur from one category to another, and then predict which category each observation in the data set will fall into.
Ethics: A set of moral codes or guidelines that an individual develops to guide his or her decision making in order to make fair and respectful decisions and engage in right actions. Ethical standards are higher than legally required minimums.
Evaluation: The fifth of the six steps of CRISP-DM. At this stage, the data miner reviews the results of the data mining model, interprets results and determines how useful they are. He or she may also conduct an investigation into false positives or other potentially misleading results.
False Positive: A predicted value that ends up not being correct.
Field: See Attribute: In columnar data, an attribute is one column. It is named in the data so that it can be referred to by a model and used in data mining. The term attribute is sometimes interchanged with the terms ‘field’, ‘variable’, or ‘column’.
Frequency Pattern: A recurrence of the same, or similar, observations numerous times in a single data set.
Fuzzy Logic: A data mining concept often associated with neural networks where predictions are made using a training data set, even though some uncertainty exists regarding the data and a model’s predictions.
Gain Ratio: One of several algorithms used to construct decision tree models.
Gini Index: An algorithm created by Corrodo Gini that can be used to generate decision tree models.
Heterogeneity: In statistical analysis, this is the amount of variety found in the values of an attribute.
Inconsistent Data: These are values in an attribute in a data set that are out-of-the-ordinary among the whole set of values in that attribute. They can be statistical outliers, or other values that simply don’t make sense in the context of the ‘normal’ range of values for the attribute. They are generally replaced or remove during the Data Preparation phase of CRISP-DM.
Independent Variable (Attribute): These are attributes that act on the dependent attribute (the target, or label). They are used to help predict the label in a predictive model.
Jittering: The process of adding a small, random decimal to discrete values in a data set so that when they are plotted in a scatter plot, they are slightly apart from one another, enabling the analyst to better see clustering and density.
Join: The process of connecting two or more tables in a relational database together so that their attributes can be accessed in a single query, such as in a view.
Kant’s Categorical Imperative: An ethical framework proposed by Immanuel Kant which states that if everyone cannot ethically take some action, then no one can ethically take that action.
k-Means Clustering: A data mining methodology that uses the mean (average) values of the attributes in a data set to group each observation into a cluster of other observations whose values are most similar to the mean for that cluster.
Label: In RapidMiner, this is the role that must be set in order to use an attribute as the dependent, or target, attribute in a predictive model.
Laws: These are regulatory statutes which have associated consequences that are established and enforced by a governmental agency. According to Lawrence Lessig, these are one of the four methods for establishing boundaries to define and regulate social behavior.
Leaf: In a decision tree data mining model, this is the terminal end point of a branch, indicating the predicted outcome for observations whose values follow that branch of the tree.
Linear Regression: A predictive data mining method which uses the algebraic formula for calculating the slope of a line in order to predict where a given observation will likely fall along that line.
Logistic Regression: A predictive data mining method which uses a quadratic formula to predict one of a set of possible outcomes, along with a probability that the prediction will be the actual outcome.
Markets: A socio-economic construct in which peoples’ buying, selling, and exchanging behaviors define the boundaries of acceptable or unacceptable behavior. Lawrence Lessig offers this as one of four methods for defining the parameters of appropriate behavior.
Mean: See Average: The arithmetic mean, calculated by summing all values and dividing by the count of the values.
Median: With the Mean and Mode, this is one of three generally used Measures of Central Tendency. It is an arithmetic way of defining what ‘normal’ looks like in a numeric attribute. It is calculated by rank ordering the values in an attribute and finding the one in the middle. If there are an even number of observations, the two in the middle are averaged to find the median.
Meta Data: These are facts that describe the observational values in an attribute. Meta data may include who collected the data, when, why, where, how, how often; and usually include some descriptive statistics such as the range, average, standard deviation, etc.
Missing Data: These are instances in an observation where one or more attributes does not have a value. It is not the same as zero, because zero is a value. Missing data are like Null values in a database, they are either unknown or undefined. These are usually replaced or removed during the Data Preparation phase of CRISP-DM.
Mode: With Mean and Median, this is one of three common Measures of Central Tendency. It is the value in an attribute which is the most common. It can be numerical or text. If an attribute contains two or more values that appear an equal number of times and more than any other values, then all are listed as the mode, and the attribute is said to be Bimodal or Multimodal.
Model: A computer-based representation of real-life events or activities, constructed upon the basis of data which represent those events.
Name (Attribute): This is the text descriptor of each attribute in a data set. In RapidMiner, the first row of an imported data set should be designated as the attribute name, so that these are not interpreted as the first observation in the data set.
Neural Network: A predictive data mining methodology which tries to mimic human brain processes by comparing the values of all attributes in a data set to one another through the use of a hidden layer of nodes. The frequencies with which the attribute values match, or are strongly similar, create neurons which become stronger at higher frequencies of similarity.
n-Gram: In text mining, this is a combination of words or word stems that represent a phrase that may have more meaning or significance that would the single word or stem.
Node: A terminal or mid-point in decision trees and neural networks where an attribute branches or forks away from other terminal or branches because the values represented at that point have become significantly different from all other values for that attribute.
Normalization: In a relational database, this is the process of breaking data out into multiple related tables in order to reduce redundancy and eliminate multivalued dependencies.
Null: The absence of a value in a database. The value is unrecorded, unknown, or undefined. See Missing Values.
Observation: A row of data in a data set. It consists of the value assigned to each attribute for one record in the data set. It is sometimes called a tuple in database language.
Online Analytical Processing (OLAP): A database concept where data are collected and organized in a way that facilitates analysis, rather than practical, daily operational work. Evaluating data in a data warehouse is an example of OLAP. The underlying structure that collects and holds the data makes analysis faster, but would slow down transactional work.
Online Transaction Processing (OLTP): A database concept where data are collected and organized in a way that facilitates fast and repeated transactions, rather than broader analytical work. Scanning items being purchased at a cash register is an example of OLTP. The underlying structure that collects and holds the data makes transactions faster, but would slow down analysis.
Operational Data: Data which are generated as a result of day-to-day work (e.g. the entry of work orders for an electrical service company).
Operator: In RapidMiner, an operator is any one of more than 100 tools that can be added to a data mining stream in order to perform some function. Functions range from adding a data set, to setting an attribute’s role, to applying a modeling algorithm. Operators are connected into a stream by way of ports connected by splines.
Organizational Data: These are data which are collected by an organization, often in aggregate or summary format, in order to address a specific question, tell a story, or answer a specific question. They may be constructed from Operational Data, or added to through other means such as surveys, questionnaires or tests.
Organizational Understanding: The first step in the CRISP-DM process, usually referred to as Business Understanding, where the data miner develops an understanding of an organization’s goals, objectives, questions, and anticipated outcomes relative to data mining tasks. The data miner must understand why the data mining task is being undertaken before proceeding to gather and understand data.
Parameters: In RapidMiner, these are the settings that control values and thresholds that an operator will use to perform its job. These may be the attribute name and role in a Set Role operator, or the algorithm the data miner desires to use in a model operator.
Port: The input or output required for an operator to perform its function in RapidMiner. These are connected to one another using splines.
Prediction: The target, or label, or dependent attribute that is generated by a predictive model, usually for a scoring data set in a model.
Premise: See Antecedent: In an association rules data mining model, the antecedent is the attribute which precedes the consequent in an identified rule. Attribute order makes a difference when calculating the confidence percentage, so identifying which attribute comes first is necessary even if the reciprocal of the association is also a rule.
Privacy: The concept describing a person’s right to be let alone; to have information about them kept away from those who should not, or do not need to, see it. A data miner must always respect and safeguard the privacy of individuals represented in the data he or she mines.
Professional Code of Conduct: A helpful guide or documented set of parameters by which an individual in a given profession agrees to abide. These are usually written by a board or panel of experts and adopted formally by a professional organization.
Query: A method of structuring a question, usually using code, that can be submitted to, interpreted, and answered by a computer.
Record: See Observation: A row of data in a data set. It consists of the value assigned to each attribute for one record in the data set. It is sometimes called a tuple in database language.
Relational Database: A computerized repository, comprised of entities that relate to one another through keys. The most basic and elemental entity in a relational database is the table, and tables are made up of attributes. One or more of these attributes serves as a key that can be matched (or related) to a corresponding attribute in another table, creating the relational effect which reduces data redundancy and eliminates multivalued dependencies.
Repository: In RapidMiner, this is the place where imported data sets are stored so that they are accessible for modeling.
Results Perspective: The view in RapidMiner that is seen when a model has been run. It is usually comprised of two or more tabs which show meta data, data in a spreadsheet-like view, and predictions and model outcomes (including graphical representations where applicable).
Role (Attribute): In a data mining model, each attribute must be assigned a role. The role is the part the attribute plays in the model. It is usually equated to serving as an independent variable (regular), or dependent variable (label).
Row: See Observation: A row of data in a data set. It consists of the value assigned to each attribute for one record in the data set. It is sometimes called a tuple in database language.
Sample: A subset of an entire data set, selected randomly or in a structured way. This usually reduces a data set down, allowing models to be run faster, especially during development and proof-of-concept work on a model.
Scoring Data: A data set with the same attributes as a training data set in a predictive model, with the exception of the label. The training data set, with the label defined, is used to create a predictive model, and that model is then applied to a scoring data set possessing the same attributes in order to predict the label for each scoring observation.
Social Norms: These are the sets of behaviors and actions that are generally tolerated and found to be acceptable in a society. According to Lawrence Lessig, these are one of four methods of defining and regulating appropriate behavior.
Spline: In RapidMiner, these lines connect the ports between operators, creating the stream of a data mining model.
Standard Deviation: One of the most common statistical measures of how dispersed the values in an attribute are. This measure can help determine whether or not there are outliers (a common type of inconsistent data) in a data set.
Standard Operating Procedures: These are organizational guidelines that are documented and shared with employees which help to define the boundaries for appropriate and acceptable behavior in the business setting. They are usually created and formally adopted by a group of leaders in the organization, with input from key stakeholders in the organization.
Statistical Significance: In statistically-based data mining activities, this is the measure of whether or not the model has yielded any results that are mathematically reliable enough to be used. Any model lacking statistical significance should not be used in operational decision making.
Stemming: In text mining, this is the process of reducing like-terms down into a single, common token (e.g. country, countries, country’s, countryman, etc. → countr).
Stopwords: In text mining, these are small words that are necessary for grammatical correctness, but which carry little meaning or power in the message of the text being mined. These are often articles, prepositions or conjunctions, such as ‘a’, ‘the’, ‘and’, etc., and are usually removed in the Process Document operator’s sub-process.
Stream: This is the string of operators in a data mining model, connected through the operators’ ports via splines, that represents all actions that will be taken on a data set in order to mine it.
Structured Query Language (SQL): The set of codes, reserved keywords and syntax defined by the American National Standards Institute used to create, manage and use relational databases.
Sub-process: In RapidMiner, this is a stream of operators set up to apply a series of actions to all inputs connected to the parent operator.
Support Percent: In an association rule data mining model, this is the percent of the time that when the antecedent is found in an observation, the consequent is also found. Since this is calculated as the number of times the two are found together divided by the total number of they could have been found together, the Support Percent is the same for reciprocal rules.
Table: In data collection, a table is a grid of columns and rows, where in general, the columns are individual attributes in the data set, and the rows are observations across those attributes. Tables are the most elemental entity in relational databases.
Target Attribute: See Label; Dependent Variable: The attribute in a data set that is being acted upon by the other attributes. It is the thing we want to predict, the target, or label, attribute in a predictive model.
Technology: Any tool or process invented by mankind to do or improve work.
Text Mining: The process of data mining unstructured text-based data such as essays, news articles, speech transcripts, etc. to discover patterns of word or phrase usage to reveal deeper or previously unrecognized meaning.
Token (Tokenize): In text mining, this is the process of turning words in the input document(s) into attributes that can be mined.
Training Data: In a predictive model, this data set already has the label, or dependent variable defined, so that it can be used to create a model which can be applied to a scoring data set in order to generate predictions for the latter.
Tuple: See Observation: A row of data in a data set. It consists of the value assigned to each attribute for one record in the data set. It is sometimes called a tuple in database language.
Variable: See Attribute: In columnar data, an attribute is one column. It is named in the data so that it can be referred to by a model and used in data mining. The term attribute is sometimes interchanged with the terms ‘field’, ‘variable’, or ‘column’.
View: A type of pseudo-table in a relational database which is actually a named, stored query. This query runs against one or more tables, retrieving a defined number of attributes that can then be referenced as if they were in a table in the database. Views can limit users’ ability to see attributes to only those that are relevant and/or approved for those users to see. They can also speed up the query process because although they may contain joins, the key columns for the joins can be indexed and cached, making the view’s query run faster than it would if it were not stored as a view. Views can be useful in data mining as data miners can be given read-only access to the view, upon which they can build data mining models, without having to have broader administrative rights on the database itself.
What is the Central Limit Theorem and why is it important?
An Introduction to the Central Limit Theorem
Answer: Suppose that we are interested in estimating the average height among all people. Collecting data for every person in the world is impractical, bordering on impossible. While we can’t obtain a height measurement from everyone in the population, we can still sample some people. The question now becomes, what can we say about the average height of the entire population given a single sample.
The Central Limit Theorem addresses this question exactly. Formally, it states that if we sample from a population using a sufficiently large sample size, the mean of the samples (also known as the sample population) will be normally distributed (assuming true random sampling), the mean tending to the mean of the population and variance equal to the variance of the population divided by the size of the sampling.
What’s especially important is that this will be true regardless of the distribution of the original population.

As we can see, the distribution is pretty ugly. It certainly isn’t normal, uniform, or any other commonly known distribution. In order to sample from the above distribution, we need to define a sample size, referred to as N. This is the number of observations that we will sample at a time. Suppose that we choose
N to be 3. This means that we will sample in groups of 3. So for the above population, we might sample groups such as [5, 20, 41], [60, 17, 82], [8, 13, 61], and so on.
Suppose that we gather 1,000 samples of 3 from the above population. For each sample, we can compute its average. If we do that, we will have 1,000 averages. This set of 1,000 averages is called a sampling distribution, and according to Central Limit Theorem, the sampling distribution will approach a normal distribution as the sample size N used to produce it increases. Here is what our sample distribution looks like for N = 3.

As we can see, it certainly looks uni-modal, though not necessarily normal. If we repeat the same process with a larger sample size, we should see the sampling distribution start to become more normal. Let’s repeat the same process again with N = 10. Here is the sampling distribution for that sample size.

Credit: Steve Nouri
What is bias-variance trade-off?
Bias: Bias is an error introduced in the model due to the oversimplification of the algorithm used (does not fit the data properly). It can lead to under-fitting.
Low bias machine learning algorithms — Decision Trees, k-NN and SVM
High bias machine learning algorithms — Linear Regression, Logistic Regression
Variance: Variance is error introduced in the model due to a too complex algorithm, it performs very well in the training set but poorly in the test set. It can lead to high sensitivity and overfitting.
Possible high variance – polynomial regression
Normally, as you increase the complexity of your model, you will see a reduction in error due to lower bias in the model. However, this only happens until a particular point. As you continue to make your model more complex, you end up over-fitting your model and hence your model will start suffering from high variance.

Bias-Variance trade-off: The goal of any supervised machine learning algorithm is to have low bias and low variance to achieve good prediction performance.
1. The k-nearest neighbor algorithm has low bias and high variance, but the trade-off can be changed by increasing the value of k which increases the number of neighbors that contribute to the prediction and in turn increases the bias of the model.
2. The support vector machine algorithm has low bias and high variance, but the trade-off can be changed by increasing the C parameter that influences the number of violations of the margin allowed in the training data which increases the bias but decreases the variance.
3. The decision tree has low bias and high variance, you can decrease the depth of the tree or use fewer attributes.
4. The linear regression has low variance and high bias, you can increase the number of features or use another regression that better fits the data.
There is no escaping the relationship between bias and variance in machine learning. Increasing the bias will decrease the variance. Increasing the variance will decrease bias.
The Best Medium-Hard Data Analyst SQL Interview Questions
compiled by Google Data Analyst Zachary Thomas!
The Best Medium-Hard Data Analyst SQL Interview QuestionsSelf-Join Practice Problems: MoM Percent Change
Context: Oftentimes it’s useful to know how much a key metric, such as monthly active users, changes between months.
Say we have a table logins in the form:

Task: Find the month-over-month percentage change for monthly active users (MAU).
Solution:
(This solution, like other solution code blocks you will see in this doc, contains comments about SQL syntax that may differ between flavors of SQL or other comments about the solutions as listed)

Tree Structure Labeling with SQL
Context: Say you have a table tree with a column of nodes and a column corresponding parent nodes
Task: Write SQL such that we label each node as a “leaf”, “inner” or “Root” node, such that for the nodes above we get:
A solution which works for the above example will receive full credit, although you can receive extra credit for providing a solution that is generalizable to a tree of any depth (not just depth = 2, as is the case in the example above).
Solution: This solution works for the example above with tree depth = 2, but is not generalizable beyond that.
An alternate solution, that is generalizable to any tree depth:
Acknowledgement: this more generalizable solution was contributed by Fabian Hofmann
An alternate solution, without explicit joins:
Acknowledgement: William Chargin on 5/2/20 noted that WHERE parent IS NOT NULL is needed to make this solution return Leaf instead of NULL.
Retained Users Per Month with SQL
Acknowledgement: this problem is adapted from SiSense’s “Using Self Joins to Calculate Your Retention, Churn, and Reactivation Metrics” blog post
PART 1:
Context: Say we have login data in the table logins:
Task: Write a query that gets the number of retained users per month. In this case, retention for a given month is defined as the number of users who logged in that month who also logged in the immediately previous month.
Solution:
PART 2:
Task: Now we’ll take retention and turn it on its head: Write a query to find how many users last month did not come back this month. i.e. the number of churned users
Solution:
Note that there are solutions to this problem that can use LEFT or RIGHT joins.
PART 3:
Context: You now want to see the number of active users this month who have been reactivated — in other words, users who have churned but this month they became active again. Keep in mind a user can reactivate after churning before the previous month. An example of this could be a user active in February (appears in logins), no activity in March and April, but then active again in May (appears in logins), so they count as a reactivated user for May .
Task: Create a table that contains the number of reactivated users per month.
Solution:
Cumulative Sums with SQL
Acknowledgement: This problem was inspired by Sisense’s “Cash Flow modeling in SQL” blog post
Context: Say we have a table transactions in the form:
Where cash_flow is the revenues minus costs for each day.
Task: Write a query to get cumulative cash flow for each day such that we end up with a table in the form below:
Solution using a window function (more effcient):
Alternative Solution (less efficient):
Rolling Averages with SQL
Acknowledgement: This problem is adapted from Sisense’s “Rolling Averages in MySQL and SQL Server” blog post
Note: there are different ways to compute rolling/moving averages. Here we’ll use a preceding average which means that the metric for the 7th day of the month would be the average of the preceding 6 days and that day itself.
Context: Say we have table signups in the form:
Task: Write a query to get 7-day rolling (preceding) average of daily sign ups
Solution1:
Solution2: (using windows, more efficient)
Multiple Join Conditions in SQL
Acknowledgement: This problem was inspired by Sisense’s “Analyzing Your Email with SQL” blog post
Context: Say we have a table emails that includes emails sent to and from zach@g.com:
Task: Write a query to get the response time per email (id) sent to zach@g.com . Do not include ids that did not receive a response from zach@g.com. Assume each email thread has a unique subject. Keep in mind a thread may have multiple responses back-and-forth between zach@g.com and another email address.
Solution:
SQL Window Function Practice Problems
#1: Get the ID with the highest value
Context: Say we have a table salaries with data on employee salary and department in the following format:
Task: Write a query to get the empno with the highest salary. Make sure your solution can handle ties!
#2: Average and rank with a window function (multi-part)
PART 1:
Context: Say we have a table salaries in the format:
Task: Write a query that returns the same table, but with a new column that has average salary per depname. We would expect a table in the form:
Solution:
PART 2:
Task: Write a query that adds a column with the rank of each employee based on their salary within their department, where the employee with the highest salary gets the rank of 1. We would expect a table in the form:
Solution:
Predictive Modelling Questions
Source: datasciencehandbook.me
1- (Given a Dataset) Analyze this dataset and give me a model that can predict this response variable.
3- What are some ways I can make my model more robust to outliers?
8- Why might it be preferable to include fewer predictors over many?
10- How could you collect and analyze data to use social media to predict the weather?
12- How would you design the people you may know feature on LinkedIn or Facebook?
13- How would you predict who someone may want to send a Snapchat or Gmail to?
14- How would you suggest to a franchise where to open a new store?
18- How would you build a model to predict a March Madness bracket?
Data Analysis Interview Questions
Source: datasciencehandbook.me
1- (Given a Dataset) Analyze this dataset and tell me what you can learn from it.
2- What is R2? What are some other metrics that could be better than R2 and why?
3- What is the curse of dimensionality?
4- Is more data always better?
5- What are advantages of plotting your data before performing analysis?
6- How can you make sure that you don’t analyze something that ends up meaningless?
8- How can you determine which features are the most important in your model?
9- How do you deal with some of your predictors being missing?
17- How could you use GPS data from a car to determine the quality of a driver?
20- How would you quantify the influence of a Twitter user?
25- How would you come up with an algorithm to detect plagiarism in online content?
28- Explain how boosted tree models work in simple language.
30- How would you deal with categorical variables and what considerations would you keep in mind?
31- How would you identify leakage in your machine learning model?
32- How would you apply a machine learning model in a live experiment?
Statistical Inference Interview Questions
Source: datasciencehandbook.me
1- In an A/B test, how can you check if assignment to the various buckets was truly random?
3- What would be the hazards of letting users sneak a peek at the other bucket in an A/B test?
4- What would be some issues if blogs decide to cover one of your experimental groups?
5- How would you conduct an A/B test on an opt-in feature?
6- How would you run an A/B test for many variants, say 20 or more?
7- How would you run an A/B test if the observations are extremely right-skewed?
9- What is a p-value? What is the difference between type-1 and type-2 error?
10- You are AirBnB and you want to test the hypothesis that a greater number of photographs increases the chances that a buyer selects the listing. How would you test this hypothesis?
11- How would you design an experiment to determine the impact of latency on user engagement?
12- What is maximum likelihood estimation? Could there be any case where it doesn’t exist?
13- What’s the difference between a MAP, MOM, MLE estimator? In which cases would you want to use each?
14- What is a confidence interval and how do you interpret it?
Product Metric Interview Questions
Source: datasciencehandbook.me
1- What would be good metrics of success for an advertising-driven consumer product? (Buzzfeed, YouTube, Google Search, etc.) A service-driven consumer product? (Uber, Flickr, Venmo, etc.)
9- You are tasked with improving the efficiency of a subway system. Where would you start?
Programming Questions
Source: datasciencehandbook.me
2- Given a list of tweets, determine the top 10 most used hashtags.
3- Program an algorithm to find the best approximate solution to the knapsack problem1 in a given time.
6- Write an algorithm that can calculate the square root of a number.
7- Given a list of numbers, can you return the outliers?
8- When can parallelism make your algorithms run faster? When could it make your algorithms run slower?
9- What are the different types of joins? What are the differences between them?
10- Why might a join on a subquery be slow? How might you speed it up?
11- Describe the difference between primary keys and foreign keys in a SQL database.
12- Given a COURSES table with columns course_id and course_name, a FACULTY table with columns faculty_id and faculty_name, and a COURSE_FACULTY table with columns faculty_id and course_id, how would you return a list of faculty who teach a course given the name of a course?
13- Given a IMPRESSIONS table with ad_id, click (an indicator that the ad was clicked), and date, write a SQL query that will tell me the click-through-rate of each ad by month.
14- Write a query that returns the name of each department and a count of the number of employees in each:
EMPLOYEES containing: Emp_ID (Primary key) and Emp_Name
EMPLOYEE_DEPT containing: Emp_ID (Foreign key) and Dept_ID (Foreign key)
DEPTS containing: Dept_ID (Primary key) and Dept_Name
Probability Questions
3- How can you generate a random number between 1 – 7 with only a die?
9- How many ways can you split 12 people into 3 teams of 4?
Reference: 800 Data Science Questions & Answers doc by Steve Nouri
Reference: 164 Data Science Interview Questions and Answers by 365 Data Science
DataWarehouse Cheat Sheet
What are Differences between Supervised and Unsupervised Learning?
Supervised | UnSupervised |
Input data is labelled | Input data is unlabeled |
Split in training/validation/test | No split |
Used for prediction | Used for analysis |
Classification and Regression | Clustering, dimension reduction, and density estimation |
Python Cheat Sheet
Data Sciences Cheat Sheet
Panda Cheat Sheet
Learn SQL with Practical Exercises
SQL is definitely one of the most fundamental skills needed to be a data scientist.
This is a comprehensive handbook that can help you to learn SQL (Structured Query Language), which could be directly downloaded here
Credit: D Armstrong
Data Visualization: A comprehensive VIP Matplotlib Cheat sheet
Credit: Matplotlib
Power BI for Intermediates
Credit: Soheil Bakhshi and Bruce Anderson
How to get a job in data science – a semi-harsh Q/A guide.
Python Frameworks for Data Science
Natural Language Processing (NLP) is one of the top areas today.
Some of the applications are:
- Reading printed text and correcting reading errors
- Find and replace
- Correction of spelling mistakes
- Development of aids
- Text summarization
- Language translation
- and many more.
NLP is a great area if you are planning to work in the area of artificial intelligence.
High Level Look of AI/ML Algorithms
Best Machine Learning Algorithms for Classification: Pros and Cons
Business Analytics in one image
Curated papers, articles, and blogs on data science & machine learning in production from companies like Google, LinkedIn, Uber, Facebook Twitter, Airbnb, and …
- Data Quality
- Data Engineering
- Data Discovery
- Feature Stores
- Classification
- Regression
- Forecasting
- Recommendation
- Search & Ranking
- Embeddings
- Natural Language Processing
- Sequence Modelling
- Computer Vision
- Reinforcement Learning
- Anomaly Detection
- Graph
- Optimization
- Information Extraction
- Weak Supervision
- Generation
- Audio
- Validation and A/B Testing
- Model Management
- Efficiency
- Ethics
- Infra
- MLOps Platforms
- Practices
- Team Structure
- Fails
How to get a job in data science – a semi-harsh Q/A guide.
HOW DO I GET A JOB IN DATA SCIENCE?
Hey you. Yes you, person asking “how do I get a job in data science/analytics/MLE/AI whatever BS job with data in the title?”. I got news for you. There are two simple rules to getting one of these jobs.
Have experience.
Don’t have no experience.
There are approximately 1000 entry level candidates who think they’re qualified because they did a 24 week bootcamp for every entry level job. I don’t need to be a statistician to tell you your odds of landing one of these aren’t great.
HOW DO I GET EXPERIENCE?
Are you currently employed? If not, get a job. If you are, figure out a way to apply data science in your job, then put it on your resume. Mega bonus points here if you can figure out a way to attribute a dollar value to your contribution. Talk to your supervisor about career aspirations at year-end/mid-year reviews. Maybe you’ll find a way to transfer to a role internally and skip the whole resume ignoring phase. Alternatively, network. Be friends with people who are in the roles you want to be in, maybe they’ll help you find a job at their company.
WHY AM I NOT GETTING INTERVIEWS?
IDK. Maybe you don’t have the required experience. Maybe there are 500+ other people applying for the same position. Maybe your resume stinks. If you’re getting 1/20 response rate, you’re doing great. Quit whining.
IS XYZ DEGREE GOOD FOR DATA SCIENCE?
Does your degree involve some sort of non-remedial math higher than college algebra? Does your degree involve taking any sort of programming classes? If yes, congratulations, your degree will pass most base requirements for data science. Is it the best? Probably not, unless you’re CS or some really heavy math degree where half your classes are taught in Greek letters. Don’t come at me with those art history and underwater basket weaving degrees unless you have multiple years experience doing something else.
SHOULD I DO XYZ BOOTCAMP/MICROMASTERS?
Do you have experience? No? This ain’t gonna help you as much as you think it might. Are you experienced and want to learn more about how data science works? This could be helpful.
SHOULD I DO XYZ MASTER’S IN DATA SCIENCE PROGRAM?
Congratulations, doing a Master’s is usually a good idea and will help make you more competitive as a candidate. Should you shell out 100K for one when you can pay 10K for one online? Probably not. In all likelihood, you’re not gonna get $90K in marginal benefit from the more expensive program. Pick a known school (probably avoid really obscure schools, the name does count for a little) and you’ll be fine. Big bonus here if you can sucker your employer into paying for it.
WILL XYZ CERTIFICATE HELP MY RESUME?
Does your certificate say “AWS” or “AZURE” on it? If not, no.
DO I NEED TO KNOW XYZ MATH TOPIC?
Yes. Stop asking. Probably learn probability, be familiar with linear algebra, and understand what the hell a partial derivative is. Learn how to test hypotheses. Ultimately you need to know what the heck is going on math-wise in your predictions otherwise the company is going to go bankrupt and it will be all your fault.
WHAT IF I’M BAD AT MATH?
Do some studying or something. MIT opencourseware has a bunch of free recorded math classes. If you want to learn some Linear Algebra, Gilbert Strang is your guy.
WHAT PROGRAMMING LANGUAGES SHOULD I LEARN?
STOP ASKING THIS QUESTION. I CAN GOOGLE “HOW TO BE A DATA SCIENTIST” AND EVERY SINGLE GARBAGE TDS ARTICLE WILL TELL YOU SQL AND PYTHON/R. YOU’RE LUCKY YOU DON’T HAVE TO DEAL WITH THE JOY OF SEGMENTATION FAULTS TO RUN A SIMPLE LINEAR REGRESSION.
SHOULD I LEARN PYTHON OR R?
Both. Python is more widely used and tends to be more general purpose than R. R is better at statistics and data analysis, but is a bit more niche. Take your pick to start, but ultimately you’re gonna want to learn both you slacker.
SHOULD I MAKE A PORTFOLIO?
Yes. And don’t put some BS housing price regression, iris classification, or titanic survival project on it either. Next question.
WHAT SHOULD I DO AS A PROJECT?
IDK what are you interested in? If you say twitter sentiment stock market prediction go sit in the corner and think about what you just said. Every half brained first year student who can pip install sklearn and do model.fit() has tried unsuccessfully to predict the stock market. The efficient market hypothesis is a thing for a reason. There are literally millions of other free datasets out there you have one of the most powerful search engines at your fingertips to go find them. Pick something you’re interested in, find some data, and analyze it.
DO I NEED TO BE GOOD WITH PEOPLE? (courtesy of /u/bikeskata)
Yes! First, when you’re applying, no one wants to work with a weirdo. You should be able to have a basic conversation with people, and they shouldn’t come away from it thinking you’ll follow them home and wear their skin as a suit. Once you get a job, you’ll be interacting with colleagues, and you’ll need them to care about your analysis. Presumably, there are non-technical people making decisions you’ll need to bring in as well. If you can’t explain to a moderately intelligent person why they should care about the thing that took you 3 days (and cost $$$ in cloud computing costs), you probably won’t have your position for long. You don’t need to be the life of the party, but you should be pleasant to be around.
Credit: u/save_the_panda_bears
Why is columnar storage efficient for analytics workloads?
- Columnar Storage enables better compression ratios and improves table scans for aggregate and complex queries.
- Is optimized for scanning large data sets and complex analytics queries
- Enables a data block to store and compress significantly more values for a column compared to row-based storage
- Eliminates the need to read redundant data by reading only the columns that you include in your query.
- Offers overall performance benefits that can help eliminate the need to aggregate data into cubes as in some other OLAP systems.
What are the integrated data sources for Amazon Redshift?
- AWS DMS
- Amazon DynamoDB
- AWS Glue
- Amazon EMR
- Amazon Kinesis
- Amazon S3
- SSH enabled host
How do you interact with Amazon Redshift?
- AWS management console
- AWS CLI
- AWS SDks
- Amazon Redshift Query API
- or SQL Client tools that support JDBC and ODBC protocols
How do you bound a set of data points (fitting, data, Mathematica)?
One of the first things you need to do when fitting a model to data is to ensure that all of your data points are within the range of the model. This is known as “bounding” the data points. There are a few different ways to bound data points, but one of the most commonly used methods is to simply discard any data points that are outside of the range of the model. This can be done manually, but it’s often more convenient to use a tool like Mathematica to automate the process. By bounding your data points, you can be sure that your model will fit the data more accurately.
Any good data scientist knows that fitting a model to data is essential to understanding the underlying patterns in that data. But fitting a model is only half the battle; once you’ve fit a model, you need to determine how well it actually fits the data. This is where bounding comes in.
Bounding allows you to assess how well a given set of data points fits within the range of values predicted by a model. It’s a simple concept, but it can be mathematically complex to actually do. Mathematica makes it easy, though, with its built-in function for fitting and bounding data. Just input your data and let Mathematica do the work for you!
In SQ, What is the Difference between DDL, DCL, and DML?

Data definition language (DDL) refers to the subset of SQL commands that define data structures and objects such as databases, tables, and views. DDL commands include the following:
• CREATE: used to create a new object.
• DROP: used to delete an object.
• ALTER: used to modify an object.
• RENAME: used to rename an object.
• TRUNCATE: used to remove all rows from a table without deleting the table itself.
Data manipulation language (DML) refers to the subset of SQL commands that are used to work with data. DML commands include the following:
• SELECT: used to request records from one or more tables.
• INSERT: used to insert one or more records into a table.
• UPDATE: used to modify the data of one or more records in a table.
• DELETE: used to delete one or more records from a table.
• EXPLAIN: used to analyze and display the expected execution plan of a SQL statement.
• LOCK: used to lock a table from write operations (INSERT, UPDATE, DELETE) and prevent concurrent operations from conflicting with one another.
Data control language (DCL) refers to the subset of SQL commands that are used to configure permissions to objects. DCL commands include:
• GRANT: used to grant access and permissions to a database or object in a database, such as a schema or table.
• REVOKE: used to remove access and permissions from a database or objects in a database
What is Big Data?
“Big Data is high-volume, high-velocity, and/or high-variety Information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation.”
What are the 5 Vs of Big Data?
- Volume
- Variety: quality of the data
- Velocity: nature of time in capturing data
- Variability: measure of consistency in meaning
- Veracity
What are typical Use Cases of Big Data?
- Customer segmentation
- Marketing spend optimization
- Financial modeling and forecasting
- Ad targeting and real-time bidding
- Clickstream analysis
- Fraud detection
What are example of Data Sources?
- Relational Databases
- NoSQL databases
- Web servers
- Mobile phones
- Tablets
- Data feeds
What are example of Data Formats?
- Structures, semi-structured, and unstructured
- Text
- Binary
- Streaming and near real-time
- Batched
Big Data vs Data Warehouses
Big Data is a concept.
A data warehouse:
- can be used with both small and large datasets
- can be used in a Big Data system
How should you split your data up for loading into the data warehouse?
Use the same number of files as you have slices in your cluster, or a multiple of the number of slices.
Why do tables need to be vacuumed?
When values are deleted from a table, Amazon Redshift does not automatically reclaim the space.
Difference Between Amazon Redshift SQl and PostgreSQL
Amzon Redshift SQL is based on PostgreSQl 8.0.2 but has important implementation differences:
- COPY is highly specialized to enable loading of data from other AWS services and to facilitate automatic compression.
- VACUUM reclaims disk spce and re-sorts all rows.
- Some PostgreSQL features, data types, and functions are not supported in Amazon Redshift.
What is the difference between STL tables and STV tables in Redshift?
STL tables contain log data that has been persisted to disk. STV tables contain snapshots of the current system based on transient, in-memory data that is not persisted to disk-based logs or other tables.
How does code compilation affect query performance in Redshift?
The compiled code is cached and available across sessions to speed up subsequent processing of that query.
What is data redistribution in Redshift?
The process of moving data around the cluster to facilitate a join.
What is Dark Data?
Dark data is data that is collected and stored but never used again.
Amazon EMR vs Amazon Redshift

Amazon Redshift Spectrum is the best of both worlds:
- Can analyze data directly from Amazon S3, like Amazon EMR does
- Retains efficient processing of higly complex queries, like Amazon Redhsift does
- And it’s built-in
Data Analytics Ecosystem on AWS:

Which tasks must be completed before using Amazon Redshift Spectrum?
- Define an external schema and create tables.
What can be used as a data store for Amazon Redshift Spectrum?
- Hive Metastore and AWS Glue.
What is the difference between the audit logging feature in Amazon Redshift and Amazon CloudTrail trails?
Redshift Audit logs contain information about database activities. Amazon CloudTrail trails contain information about service activities.
How can you receive notifications about events in your cluster?
Configure an Amazon SNS topic and choose events to trigger the notification to be sent to topic subscribers.
Where does Amazon Redshift store the snapshots used to backup your cluster?
In Amazon S3 bucket.
Benefits of AWS DAS-C01 and AWS MLS-C01 Certifications:
iOS : https://apps.apple.com/ca/app/
Cap Theorem:

Data Warehouse Definition:

Data Warehouse are Subject-Oriented:

AWS Analytics Services:
Amazon Elastic MapReduce (Amazon EMR) simplifies big data processing by providing a managed Hadoop framework that makes it easy, fast, and cost-effective for you to distribute and process vast amounts of your data across dynamically scalable Amazon Elastic Compute Cloud (Amazon EC2) instances. You can also run other popular distributed frameworks such as Apache Spark and Presto in Amazon EMR, and interact with data in other AWS data stores, such as Amazon S3 and Amazon DynamoDB.
• Amazon Elasticsearch Service is a managed service that makes it easy to deploy, operate, and scale Elasticsearch in the AWS cloud. Elasticsearch is a popular open-source search and analytics engine for use cases such as log analytics, real-time application monitoring, and click stream analytics.
• Amazon Kinesis is a platform for streaming data on AWS, that offers powerful services that make it easy to load and analyze streaming data, and that also provides the ability for you to build custom streaming data applications for specialized needs.
• Amazon Machine Learning provides visualization tools and wizards that guide you through the process of creating machine learning (ML) models without having to learn complex ML algorithms and technology. When your models are ready, Amazon Machine Learning makes it easy to obtain predictions for your application using simple APIs, without having to implement custom prediction generation code or manage any infrastructure.
• Amazon QuickSight is a very fast, cloud-powered business intelligence (BI) service that makes it easy for all employees to build visualizations, perform one-time analysis, and quickly get business insights from their data.
AWS Database Services:

Choosing between NoSQL or SQL Databases:

Can you give an example of a successful implementation of an enterprise wide data warehouse solution?
1- DataWarehouse Implementation at Phillips U.S. based division
“Amazon Redshift is the single source of truth for our user data. It stores data on customer usage, customer service, and advertising, and then presents those data back to the business in multiple views.” –John O’Donovan, CTO, Financial Times
What is explained variation and unexplained variation in linear regression analysis?
In statistics, explained variation measures the proportion to which a mathematical model accounts for the variation (dispersion) of a given data set. Often, variation is quantified as variance; then, the more specific term explained variance can be used.
The explained variation is the sum of the squared of the differences between each predicted y-value and the mean of y. The unexplained variation is the sum of the squared of the differences between the y-value of each ordered pair and each corresponding predicted y-value.
Linear regression is a data science technique used to model the relationships between variables. In a linear regression model, the explained variation is the sum of the squared of the differences between each predicted y-value and the mean of y. The unexplained variation is the sum of the squared of the differences between the y-value of each ordered pair and each corresponding predicted y-value. By understanding both the explained and unexplained variation in a linear regression model, data scientists can better understand the data and make more accurate predictions.
In data science, linear regression is a technique used to model the relationships between explanatory variables and a response variable. The goal of linear regression is to find the line of best fit that minimizes the sum of the squared residuals. The residual is the difference between the actual y-value and the predicted y-value. The overall variation in the data can be partitioned into two components: explained variation and unexplained variation. The explained variation is the sum of the squared of the differences between each predicted y-value and the mean of y. The unexplained variation is the sum of the squared of the differences between the y-value of each ordered pair and each corresponding predicted y-value. In other words, explained variation measures how well the line of best fit explains the data, while unexplained variation measures how much error there is in the predictions. In order to create a model that is both predictive and accurate, data scientists must strive to minimize both explained and unexplained variation.
What is the difference between normalization, standardization, and regularization for data?
Normalization and Standardization both are rescaling techniques. They make your data unitless
Assume you have 2 feature F1 and F2.
F1 ranges from 0 – 100 , F2 ranges from 0 to 0.10
when you use the algorithm that uses distance as the measure. you encounter a problem.
F1 F2
20 0.2
26 0.2
20 0.9
row 1 – row 2 : (20 -26) + (0.2–0.2) = 6
row1 – row3 : ( 20–20 ) + (0.2 – 0.9) = 0.7
you may conclude row3 is nearest to row1 but its wrong .
right way of calculation is
row1- row2 : (20–26)/100 + (0.2 – .02)/0.10 = 0.06
row1 – row3 : (20–20)/100 + (0.2–0.9)/0.10 = 7
So row2 is the nearest to row1
Normalization brings data between 0- 1
Standardization brings data between 1 standardization
Normalization = ( X – Xmin) / (Xmax – Xmin)
Standardization = (x – µ ) / σ
Regularization is a concent of underfit and overfit
if an error is more in both train data and test data its underfit
if an error is more in test data and less train data it is overfit
Regularization is the way to manage optimal error. Source: ABC of Data Science
What are the most popular machine learning frameworks used by data scientists?
TensorFlow
Tensorflow is an open-source machine learning library developed at Google for numerical computation using data flow graphs is arguably one of the best, with Gmail, Uber, Airbnb, Nvidia, and lots of other prominent brands using it. It’s handy for creating and experimenting with deep learning architectures, and its formulation is convenient for data integration such as inputting graphs, SQL tables, and images together.
Deepchecks
Deepchecks is a Python package for comprehensively validating your machine learning models and data with minimal effort. This includes checks related to various types of issues, such as model performance, data integrity, distribution mismatches, and more.
Scikit-learn
Scikit-learn is a very popular open-source machine learning library for the Python programming language. With constant updations in the product for efficiency improvements coupled with the fact that its open-source makes it a go-to framework for machine learning in the industry.
Keras
Keras is an open-source neural network library written in Python. It is capable of running on top of other popular lower-level libraries such as Tensorflow, Theano & CNTK. This one might be your new best friend if you have a lot of data and/or you’re after the state-of-the-art in AI: deep learning.
Pandas
Pandas is yet another open-source software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series. Pandas works well with incomplete, messy, and unlabeled data and provides tools for shaping, merging, reshaping, and slicing datasets.
Spark MLib
Spark MLib is a popular machine learning library. As per survey, almost 6% of the data scientists use this library. This library has support for Java, Scala, Python, and R. Also you can use this library on Hadoop, Apache Mesos, Kubernetes, and other cloud services against multiple data sources.
PyTorch
PyTorch is developed by Facebook’s artificial intelligence research group and it is the primary software tool for deep learning after Tensorflow. Unlike TensorFlow, the PyTorch library operates with a dynamically updated graph. This means that it allows you to make changes to the architecture in the process. By Niklas Steiner
What is the difference between validation set and test set?
Whenever we fit a machine learning algorithm to a dataset, we typically split the dataset into three parts:
1. Training Set: Used to train the model.
2. Validation Set: Used to optimize model parameters.
3. Test Set: Used to get an unbiased estimate of the final model performance.
The following diagram provides a visual explanation of these three different types of datasets:
One point of confusion for students is the difference between the validation set and the test set.
In simple terms, the validation set is used to optimize the model parameters while the test set is used to provide an unbiased estimate of the final model.
It can be shown that the error rate as measured by k-fold cross validation tends to underestimate the true error rate once the model is applied to an unseen dataset.
Thus, we fit the final model to the test set to get an unbiased estimate of what the true error rate will be in the real world.
If you looking for solid way of testing your ML algorithms then I would recommend this open-source interactive demo
Source: ABC of Dat Science and ML
When should I normalize data?
The general answer to your question is : When our model needs it !
Yeah, That’s it!
In detail:
- When we feel like, the model we are going to use can’t read the format of data we have. We need to normalise the data.
e.g. When our data is in ‘text’ . We perform – Lemmatization, Stemming, etc to normalize/transform it.
2. Another case would be that, When the values in certain columns(features) do not scale with other features, this may lead to poor performance of our model. We need to normalise our data here as well. ( better say, Features have different Ranges).
e.g Features: F1, F2, F3
range( F1) – 0 – 100
range( F2) – 50 – 100
range( F3) – 900 – 10,000
In the above situation, ,the model would give more importance to F3 ( bigger numerical values). and thus, our model would be biased; resulting in a bad accuracy. Here, We need to apply Scaling ( such as : StandarScaler() func in python, etc.)
Transformation, Scaling; these are some common Normalisation methods.
Go through these two articles to have a better understading:
- Understand Data Normalization in Machine Learning
- Why Data Normalization is necessary for Machine Learning models
Source: ABC of Data Science and ML
Is it possible to use linear regression for forecasting on non-stationary data (time series)? If yes, then how can we do that? If no, then why not?
Linear regression is a machine learning algorithm that can be used to predict future values based on past data points. It is typically used on stationary data, which means that the statistical properties of the data do not change over time. However, it is possible to use linear regression on non-stationary data, with some modifications. The first step is to stationarize the data, which can be done by detrending or differencing the data. Once the data is stationarized, linear regression can be used as usual. However, it is important to keep in mind that the predictions may not be as accurate as they would be if the data were stationary.
Linear regression is a machine learning algorithm that is often used for forecasting. However, it is important to note that linear regression can only be used on stationary data. This means that the data must be free of trend and seasonality. If the data is not stationary, then the forecast will be inaccurate. There are various ways to stationarize data, such as differencing or using a moving average. Once the data is stationarized, linear regression can be used to generate forecasts. However, if the data is non-stationary, then another machine learning algorithm, such as an ARIMA model, should be used instead.
Top 75 Data Science Youtube channel
Data Science and Data Analytics Breaking News – Top Stories
- A paper from the latest SIGBOVIK proceedingsby /u/bikeskata (Data Science) on April 28, 2025 at 12:57 pm
submitted by /u/bikeskata [link] [comments]
- Demand slump fuelled by Trump tariffs hits US ports and air freightby /u/PHealthy (DataIsBeautiful) on April 28, 2025 at 11:51 am
submitted by /u/PHealthy [link] [comments]
- [OC] Got overwhelmed by the complexity of certain goals, so I built a way to visualize them in one glance.by /u/alexand_ro (DataIsBeautiful) on April 28, 2025 at 8:13 am
This way I know exactly what I'm dealing with and where I stand so far. When I complete or add a new subtask, it instantly updates. I can zoom in/out, move around and the best part: I can enter inside another goal and see it's version of the tree. (Like Inception) Legend: Blue: In progress Green: Completed Red: Cancelled goals (found a better way or not worth it) submitted by /u/alexand_ro [link] [comments]
- Star Wars franchise movies budget - gross scatterby /u/longschlong-2 (DataIsBeautiful) on April 28, 2025 at 6:12 am
submitted by /u/longschlong-2 [link] [comments]
- Weekly Entering & Transitioning - Thread 28 Apr, 2025 - 05 May, 2025by /u/AutoModerator (Data Science) on April 28, 2025 at 4:01 am
Welcome to this week's entering & transitioning thread! This thread is for any questions about getting started, studying, or transitioning into the data science field. Topics include: Learning resources (e.g. books, tutorials, videos) Traditional education (e.g. schools, degrees, electives) Alternative education (e.g. online courses, bootcamps) Job search questions (e.g. resumes, applying, career prospects) Elementary questions (e.g. where to start, what next) While you wait for answers from the community, check out the FAQ and Resources pages on our wiki. You can also search for answers in past weekly threads. submitted by /u/AutoModerator [link] [comments]
- American whiskey production grew 160% since 2012, bottling stayed flat. [OC]by /u/whiskeydecision7 (DataIsBeautiful) on April 28, 2025 at 2:05 am
From 2012 to 2024, U.S. whiskey production increased from approximately 190 million to over 308 million proof gallons, based on TTB data. Domestic bottling volumes over the same period remained largely unchanged, averaging between 75 million and 95 million proof gallons annually. As a result, the ratio of proof gallons stored to proof gallons bottled has increased from approximately 1.5:1 in 2012 to 3.6:1 in 2024. Since 2021, more than 1 billion proof gallons have been stored for aging. Data source: U.S. Alcohol and Tobacco Tax and Trade Bureau (TTB), February 2025. Notes: Bottling for export is excluded from these figures. Export data is reported jointly for whiskey, rum, and tequila and historically adds about 33% to domestic bottling volumes. Visualization created using Figma. submitted by /u/whiskeydecision7 [link] [comments]
- 3D Visualization of Tokyo's Day and Night Populations [OC]by /u/ain992_3250 (DataIsBeautiful) on April 27, 2025 at 11:44 pm
submitted by /u/ain992_3250 [link] [comments]
- [OC] Median House Prices from 2000-2025 in the U.S. by Stateby /u/Wormy-Chan (DataIsBeautiful) on April 27, 2025 at 10:02 pm
https://www.zillow.com/research/data/ Made with pyplot submitted by /u/Wormy-Chan [link] [comments]
- [OC] Number of Children Born by Mother's Age Over Years in Sweden 1968-2024by /u/Moulin_Noir (DataIsBeautiful) on April 27, 2025 at 6:48 pm
The graphs shows the total number of children born in a year per age groups for the mother. The children with the youngest mother’s is at the bottom of the graph and then the age groups follows in order up until the oldest mothers at the top. The total number of children born 1968 in Sweden was slightly above 113 000. Of those a little more than 37 000 was born by women aged 25-29 years. The first graph where different age groups is combined into five year groups is pretty beautiful, the second where every single age is shown by itself is a mess and only for the brave ones who want to look at a specific one year group. In 1968 75% of newborns had a mother under the age of 30 and 25% was born by mothers at least 30 years old. In 2024 the numbers was almost completely reversed as 29% of newborns had a mother under the age of 30, while 71% was born by mothers aged 30 or older. The biggest change for among the age groups was in the decline in the group of mothers 20-24 which went from giving birth to 34% of all children 1968 to 6% 2024 and for the group of mothers 30-34 who went from giving birth to 16% to 41% of all children. Statistics gathered from Statistics Sweden. Tools used: Python (packages: pyscbwrapper for fetching the data, pandas, matplotlib and seaborne to create the graph) and some AI for help (Claude) submitted by /u/Moulin_Noir [link] [comments]
- Interview With BCG Xby /u/Feeling_Bad1309 (Data Science) on April 26, 2025 at 8:58 pm
Hey! I have an interview coming up with BCG X. Anyone here been through the process with them? What about other consulting/mbb firms? submitted by /u/Feeling_Bad1309 [link] [comments]

AWS Data analytics DAS-C01 on iOS pro
AWS Data analytics DAS-C01 on Android
AWS Data analytics DAS-C01 on Microsoft Windows 10/11:
Machine Learning Engineer Interview Questions and Answers
What are some good datasets for Data Science and Machine Learning?
Data Sciences – Top 400 Open Datasets – Data Visualization – Data Analytics – Big Data – Data Lakes


Elevate Your Career with AI & Machine Learning For Dummies PRO and Start mastering the technologies shaping the future—download now and take the next step in your professional journey!
Data Sciences – Top 400 Open Datasets – Data Visualization – Data Analytics – Big Data – Data Lakes
Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured and unstructured data, and apply knowledge and actionable insights from data across a broad range of application domains.
A dataset is a collection of data, usually presented in tabular form. Good datasets for Data Science and Machine Learning are typically those that are well-structured (easy to read and understand) and large enough to provide enough data points to train a model. The best datasets are often those that are open and freely available – such as the popular Iris dataset. However, there are also many commercial datasets available for purchase. In general, good datasets for Data Science and Machine Learning should be:

- Well-structured
- Large enough to provide enough data points
- Open and freely available whenever possible
In this blog, we are going to provide popular open source and public data sets, data visualization, data analytics and data lakes.

Population Distribution of the World by Continent
Fertility rates all over the world are steadily declining
Yes, fertility rates have been declining globally in recent decades. There are several factors that contribute to this trend, including increased access to education and employment opportunities for women, improved access to family planning and birth control, and changes in societal attitudes towards having children. However, the rate of decline varies significantly by country and region, with some countries experiencing more dramatic declines than others.
The most Daily Wikipedia Page Views in 2022
How Americans Spend Their Money by Generation


Largest countries in the world (by area size)

The Highest Grossing Movies Of All Time
AI-Powered Professional Certification Quiz Platform
Web|iOs|Android|Windows
🚀 Power Your Podcast Like AI Unraveled: Get 20% OFF Google Workspace!
Hey everyone, hope you're enjoying the deep dive on AI Unraveled. Putting these episodes together involves tons of research and organization, especially with complex AI topics.
A key part of my workflow relies heavily on Google Workspace. I use its integrated tools, especially Gemini Pro for brainstorming and NotebookLM for synthesizing research, to help craft some of the very episodes you love. It significantly streamlines the creation process!
Feeling inspired to launch your own podcast or creative project? I genuinely recommend checking out Google Workspace. Beyond the powerful AI and collaboration features I use, you get essentials like a professional email (you@yourbrand.com), cloud storage, video conferencing with Google Meet, and much more.
It's been invaluable for AI Unraveled, and it could be for you too.
Start Your Journey & Save 20%
Google Workspace makes it easy to get started. Try it free for 14 days, and as an AI Unraveled listener, get an exclusive 20% discount on your first year of the Business Standard or Business Plus plan!
Sign Up & Get Your Discount HereUse one of these codes during checkout (Americas Region):
AI- Powered Jobs Interview Warmup For Job Seekers

⚽️Comparative Analysis: Top Calgary Amateur Soccer Clubs – Outdoor 2025 Season (Kids' Programs by Age Group)
Business Standard Plan: 63P4G3ELRPADKQU
Business Standard Plan: 63F7D7CPD9XXUVT
Set yourself up for promotion or get a better job by Acing the AWS Certified Data Engineer Associate Exam (DEA-C01) with the eBook or App below (Data and AI)

Download the Ace AWS DEA-C01 Exam App:
iOS - Android
AI Dashboard is available on the Web, Apple, Google, and Microsoft, PRO version
Business Standard Plan: 63FLKQHWV3AEEE6
Business Standard Plan: 63JGLWWK36CP7W
Invest in your future today by enrolling in this Azure Fundamentals - Pass the Azure Fundamentals Exam with Ease: Master the AZ-900 Certification with the Comprehensive Exam Preparation Guide!
- AWS Certified AI Practitioner (AIF-C01): Conquer the AWS Certified AI Practitioner exam with our AI and Machine Learning For Dummies test prep. Master fundamental AI concepts, AWS AI services, and ethical considerations.
- Azure AI Fundamentals: Ace the Azure AI Fundamentals exam with our comprehensive test prep. Learn the basics of AI, Azure AI services, and their applications.
- Google Cloud Professional Machine Learning Engineer: Nail the Google Professional Machine Learning Engineer exam with our expert-designed test prep. Deepen your understanding of ML algorithms, models, and deployment strategies.
- AWS Certified Machine Learning Specialty: Dominate the AWS Certified Machine Learning Specialty exam with our targeted test prep. Master advanced ML techniques, AWS ML services, and practical applications.
- AWS Certified Data Engineer Associate (DEA-C01): Set yourself up for promotion, get a better job or Increase your salary by Acing the AWS DEA-C01 Certification.
Business Plus Plan: M9HNXHX3WC9H7YE
With Google Workspace, you get custom email @yourcompany, the ability to work from anywhere, and tools that easily scale up or down with your needs.
Need more codes or have questions? Email us at info@djamgatech.com.
We are still living mostly on gas, oil & coal – Global primary energy consumption by source (TWh)

Consumption vs production based CO2 emissions by country
Largest banks in the world by total assets
Inflation rate and nominal interest rate

Police Killings per Capita v Homicide Rate per Capita for Select OECD Countries
Instagram Rich List 2022 – Which celebrity earns the most in the world in 2022?
Top human-caused threats to birds in the US

Trailblazing Scientists who Shaped our World

Suicide rate among countries with the highest Human Development Index
Percentage of Obesity in 2022

Source: https://worldpopulationreview.com/country-rankings/obesity-rates-by-country
Tools: Datawrapper
3 largest global payment networks – measured by total payment volume each year ($B)
Stocks Vs Bonds 2022
Most expensive football transfers
11 developing countries with higher life expectancy than the United States
Healthcare expenditure per capita vs life expectancy years
1.2% of adults own 47.8% of world’s wealth
How to Mathematically Win at Rock Paper Scissors
Researchers from IBM, MIT and Harvard Announced The Release Of DARPA “Common Sense AI” Dataset Along With Two Machine Learning Models At ICML 2021
Building machines that can make decisions based on common sense is no easy feat. A machine must be able to do more than merely find patterns in data; it also needs a way of interpreting the intentions and beliefs behind people’s choices.
At the 2021 International Conference on Machine Learning (ICML), Researchers from IBM, MIT, and Harvard University have come together to release a DARPA “Common Sense AI” dataset for benchmarking AI intuition. They are also releasing two machine learning models that represent different approaches to the problem that relies on testing techniques psychologists use to study infants’ behavior to accelerate the development of AI exhibiting common sense.
Source – Summary – Paper – IBM Blog
Largest healthcare services companies in the world
Performance Of FAANG Stocks In 2022 (Apple, Amazon, Google, Meta, Netflix)
100 million protein structures Dataset by DeepMind
Here’s a good article about this topic
Google Dataset Search

Malware traffic dataset
Comprises 1914081 records created from all malware traffic analysis .net PCAP files, from 2013 to 2021. The logs are generated using Suricata and Zeek.
Originator: ali_alwashali
2019 Crime statistics in the USA
Dataset with arrest in US by race and separate states. Download Excel here
Researchers from IBM, MIT and Harvard Announced The Release Of DARPA “Common Sense AI” Dataset Along With Two Machine Learning Models At ICML 2021
Building machines that can make decisions based on common sense is no easy feat. A machine must be able to do more than merely find patterns in data; it also needs a way of interpreting the intentions and beliefs behind people’s choices.
At the 2021 International Conference on Machine Learning (ICML), Researchers from IBM, MIT, and Harvard University have come together to release a DARPA “Common Sense AI” dataset for benchmarking AI intuition. They are also releasing two machine learning models that represent different approaches to the problem that relies on testing techniques psychologists use to study infants’ behavior to accelerate the development of AI exhibiting common sense.
Source – Summary – Paper – IBM Blog
100 million protein structures Dataset by DeepMind
Here’s a good article about this topic
Google Dataset Search

Malware traffic dataset
Comprises 1914081 records created from all malware traffic analysis .net PCAP files, from 2013 to 2021. The logs are generated using Suricata and Zeek.
Originator: ali_alwashali
CPOST dataset on suicide attacks over four decades
The University of Chicago Project on Security and Threats presents the updated and expanded Database on Suicide Attacks (DSAT), which now links to Uppsala Conflict Data Program data on armed conflicts and includes a new dataset measuring the alliance and rivalry relationships among militant groups with connections to suicide attack groups. Access it here.
Credit Card Dataset – Survey of Consumer Finances (SCF) Combined Extract Data 1989-2019
You can do a lot of aggregated analysis in a pretty straightforward way there.
Drone imagery with annotations for small object detection and tracking dataset
11 TB dataset of drone imagery with annotations for small object detection and tracking
Download and more information are available here
Dataset License: CDLA-Sharing-1.0
Helper scripts for accessing the dataset: DATASET.md
Dataset Exploration: Colab
NOAA High-Resolution Rapid Refresh (HRRR) Model
The HRRR is a NOAA real-time 3-km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model, initialized by 3km grids with 3km radar assimilation. Radar data is assimilated in the HRRR every 15 min over a 1-h period adding further detail to that provided by the hourly data assimilation from the 13km radar-enhanced Rapid Refresh.
When will computers replace humans?
This chart is essentially measuring “How good is a human at a computers’ area of strength”.. meanwhile computers simply can not compete in human areas of strength.
The most popular car brands on Reddit

Fruit Efficient-C Analysis

Registry of Open Data on AWS
This registry exists to help people discover and share datasets that are available via AWS resources. Learn more about sharing data on AWS.
See all usage examples for datasets listed in this registry.
See datasets from Digital Earth Africa, Facebook Data for Good, NASA Space Act Agreement, NIH STRIDES, NOAA Big Data Program, Space Telescope Science Institute, and Amazon Sustainability Data Initiative.
Textbook Question Answering (TQA)
1,076 textbook lessons, 26,260 questions, 6229 images
Documentation: allenai.org/data/tqa
Harmonized Cancer Datasets: Genomic Data Commons Data Portal
The GDC Data Portal is a robust data-driven platform that allows cancer
researchers and bioinformaticians to search and download cancer data for analysis.

The Cancer Genome Atlas
The Cancer Genome Atlas (TCGA), a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), aims to generate comprehensive, multi-dimensional maps of the key genomic changes in major types and subtypes of cancer.
AWS CLI Access (No AWS account required)
aws s3 ls s3://tcga-2-open/ --no-sign-request
Therapeutically Applicable Research to Generate Effective Treatments (TARGET)
The Therapeutically Applicable Research to Generate Effective Treatments (TARGET) program applies a comprehensive genomic approach to determine molecular changes that drive childhood cancers. The goal of the program is to use data to guide the development of effective, less toxic therapies. TARGET is organized into a collaborative network of disease-specific project teams. TARGET projects provide comprehensive molecular characterization to determine the genetic changes that drive the initiation and progression of childhood cancers. The dataset contains open Clinical Supplement, Biospecimen Supplement, RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantification, miRNA-Seq miRNA Expression Quantification data from Genomic Data Commons (GDC), and open data from GDC Legacy Archive. Access it here.
Genome Aggregation Database (gnomAD)
The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators that aggregates and harmonizes both exome and genome data from a wide range of large-scale human sequencing projects. The summary data provided here are released for the benefit of the wider scientific community without restriction on use. Downloads
SQuAD (Stanford Question Answering Dataset)
Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. Access it here.
PubMed Diabetes Dataset
The Pubmed Diabetes dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words. The README file in the dataset provides more details.
Drug-Target Interaction Dataset
This dataset contains interactions between drugs and targets collected from DrugBank, KEGG Drug, DCDB, and Matador. It was originally collected by Perlman et al. It contains 315 drugs, 250 targets, 1,306 drug-target interactions, 5 types of drug-drug similarities, and 3 types of target-target similarities. Drug-drug similarities include Chemical-based, Ligand-based, Expression-based, Side-effect-based, and Annotation-based similarities. Target-target similarities include Sequence-based, Protein-protein interaction network-based, and Gene Ontology-based similarities. The original task on the dataset is to predict new interactions between drugs and targets based on different types of similarities in the network. Download link
Pharmacogenomics Datasets
PharmGKB data and knowledge is available as downloads. It is often critical to check with their curators at feedback@pharmgkb.org before embarking on a large project using these data, to be sure that the files and data they make available are being interpreted correctly. PharmGKB generally does NOT need to be a co-author on such analyses; They just want to make sure that there is a correct understanding of our data before lots of resources are spent.
Pancreatic Cancer Organoid Profiling
The dataset contains open RNA-Seq Gene Expression Quantification data and controlled WGS/WXS/RNA-Seq Aligned Reads, WXS Annotated Somatic Mutation, WXS Raw Somatic Mutation, and RNA-Seq Splice Junction Quantification. Documentation
AWS CLI Access (No AWS account required)
aws s3 ls s3://gdc-organoid-pancreatic-phs001611-2-open/ --no-sign-request
Africa Soil Information Service (AfSIS) Soil Chemistry
This dataset contains soil infrared spectral data and paired soil property reference measurements for georeferenced soil samples that were collected through the Africa Soil Information Service (AfSIS) project, which lasted from 2009 through 2018. Documentation
AWS CLI Access (No AWS account required)
aws s3 ls s3://afsis/ --no-sign-request
Dataset for Affective States in E-Environments
DAiSEE is the first multi-label video classification dataset comprising of 9068 video snippets captured from 112 users for recognizing the user affective states of boredom, confusion, engagement, and frustration “in the wild”. The dataset has four levels of labels namely – very low, low, high, and very high for each of the affective states, which are crowd annotated and correlated with a gold standard annotation created using a team of expert psychologists. Download it here.
NatureServe Explorer Dataset
NatureServe Explorer provides conservation status, taxonomy, distribution, and life history information for more than 95,000 plants and animals in the United States and Canada, and more than 10,000 vegetation communities and ecological systems in the Western Hemisphere.
The data available through NatureServe Explorer represents data managed in the NatureServe Central Databases. These databases are dynamic, being continually enhanced and refined through the input of hundreds of natural heritage program scientists and other collaborators. NatureServe Explorer is updated from these central databases to reflect information from new field surveys, the latest taxonomic treatments and other scientific publications, and new conservation status assessments. Explore Data here
Flight Records in the US
Airline On-Time Performance and Causes of Flight Delays – On_Time Data.
This database contains scheduled and actual departure and arrival times, reason of delay. reported by certified U.S. air carriers that account for at least one percent of domestic scheduled passenger revenues. The data is collected by the Office of Airline Information, Bureau of Transportation Statistics (BTS).
FlightAware.com has data but you need to pay for a full dataset.
The anyflights
package supplies a set of functions to generate air travel data (and data packages!) similar to nycflights13
. With a user-defined year and airport, the anyflights
function will grab data on:
flights
: all flights that departed a given airport in a given year and monthweather
: hourly meterological data for a given airport in a given year and monthairports
: airport names, FAA codes, and locationsairlines
: translation between two letter carrier (airline) codes and namesplanes
: construction information about each plane found inflights
Airline On-Time Statistics and Delay Causes
The U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT’s monthly Air Travel Consumer Report, published about 30 days after the month’s end, as well as in summary tables posted on this website. BTS began collecting details on the causes of flight delays in June 2003. Summary statistics and raw data are made available to the public at the time the Air Travel Consumer Report is released. Access it here
Worldwide flight data
Download: airports.dat (Airports only, high quality)
Download: airports-extended.dat (Airports, train stations and ferry terminals, including user contributions)
Flightera.net seems to have a lot of good data for free. It has in-depth data on flights and doesn’t seem limited by date. I can’t speak on the validity of the data though.
flightradar24.com has lots of data, also historically, they might be willing to help you get it in a nice format.
2019 Crime statistics in the USA
Dataset with arrest in US by race and separate states. Download Excel here


Researchers from IBM, MIT and Harvard Announced The Release Of DARPA “Common Sense AI” Dataset Along With Two Machine Learning Models At ICML 2021
Building machines that can make decisions based on common sense is no easy feat. A machine must be able to do more than merely find patterns in data; it also needs a way of interpreting the intentions and beliefs behind people’s choices.
At the 2021 International Conference on Machine Learning (ICML), Researchers from IBM, MIT, and Harvard University have come together to release a DARPA “Common Sense AI” dataset for benchmarking AI intuition. They are also releasing two machine learning models that represent different approaches to the problem that relies on testing techniques psychologists use to study infants’ behavior to accelerate the development of AI exhibiting common sense.
Source – Summary – Paper – IBM Blog
100 million protein structures Dataset by DeepMind
Here’s a good article about this topic
Google Dataset Search

Malware traffic dataset
Comprises 1914081 records created from all malware traffic analysis .net PCAP files, from 2013 to 2021. The logs are generated using Suricata and Zeek.
Originator: ali_alwashali
Author: Here
CPOST dataset on suicide attacks over four decades
The University of Chicago Project on Security and Threats presents the updated and expanded Database on Suicide Attacks (DSAT), which now links to Uppsala Conflict Data Program data on armed conflicts and includes a new dataset measuring the alliance and rivalry relationships among militant groups with connections to suicide attack groups. Access it here.
Credit Card Dataset – Survey of Consumer Finances (SCF) Combined Extract Data 1989-2019
You can do a lot of aggregated analysis in a pretty straightforward way there.
Drone imagery with annotations for small object detection and tracking dataset
11 TB dataset of drone imagery with annotations for small object detection and tracking
Download and more information are available here
Dataset License: CDLA-Sharing-1.0
Helper scripts for accessing the dataset: DATASET.md
Dataset Exploration: Colab
NOAA High-Resolution Rapid Refresh (HRRR) Model
The HRRR is a NOAA real-time 3-km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model, initialized by 3km grids with 3km radar assimilation. Radar data is assimilated in the HRRR every 15 min over a 1-h period adding further detail to that provided by the hourly data assimilation from the 13km radar-enhanced Rapid Refresh.
Registry of Open Data on AWS
This registry exists to help people discover and share datasets that are available via AWS resources. Learn more about sharing data on AWS.
See all usage examples for datasets listed in this registry.
See datasets from Digital Earth Africa, Facebook Data for Good, NASA Space Act Agreement, NIH STRIDES, NOAA Big Data Program, Space Telescope Science Institute, and Amazon Sustainability Data Initiative.
Textbook Question Answering (TQA)
1,076 textbook lessons, 26,260 questions, 6229 images
Documentation: allenai.org/data/tqa
Harmonized Cancer Datasets: Genomic Data Commons Data Portal
The GDC Data Portal is a robust data-driven platform that allows cancer
researchers and bioinformaticians to search and download cancer data for analysis.

The Cancer Genome Atlas
The Cancer Genome Atlas (TCGA), a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), aims to generate comprehensive, multi-dimensional maps of the key genomic changes in major types and subtypes of cancer.
AWS CLI Access (No AWS account required)
aws s3 ls s3://tcga-2-open/ --no-sign-request
Therapeutically Applicable Research to Generate Effective Treatments (TARGET)
The Therapeutically Applicable Research to Generate Effective Treatments (TARGET) program applies a comprehensive genomic approach to determine molecular changes that drive childhood cancers. The goal of the program is to use data to guide the development of effective, less toxic therapies. TARGET is organized into a collaborative network of disease-specific project teams. TARGET projects provide comprehensive molecular characterization to determine the genetic changes that drive the initiation and progression of childhood cancers. The dataset contains open Clinical Supplement, Biospecimen Supplement, RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantification, miRNA-Seq miRNA Expression Quantification data from Genomic Data Commons (GDC), and open data from GDC Legacy Archive. Access it here.
Genome Aggregation Database (gnomAD)
The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators that aggregates and harmonizes both exome and genome data from a wide range of large-scale human sequencing projects. The summary data provided here are released for the benefit of the wider scientific community without restriction on use. Downloads
SQuAD (Stanford Question Answering Dataset)
Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. Access it here.
PubMed Diabetes Dataset
The Pubmed Diabetes dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words. The README file in the dataset provides more details.
Drug-Target Interaction Dataset
This dataset contains interactions between drugs and targets collected from DrugBank, KEGG Drug, DCDB, and Matador. It was originally collected by Perlman et al. It contains 315 drugs, 250 targets, 1,306 drug-target interactions, 5 types of drug-drug similarities, and 3 types of target-target similarities. Drug-drug similarities include Chemical-based, Ligand-based, Expression-based, Side-effect-based, and Annotation-based similarities. Target-target similarities include Sequence-based, Protein-protein interaction network-based, and Gene Ontology-based similarities. The original task on the dataset is to predict new interactions between drugs and targets based on different types of similarities in the network. Download link
Pharmacogenomics Datasets
PharmGKB data and knowledge is available as downloads. It is often critical to check with their curators at feedback@pharmgkb.org before embarking on a large project using these data, to be sure that the files and data they make available are being interpreted correctly. PharmGKB generally does NOT need to be a co-author on such analyses; They just want to make sure that there is a correct understanding of our data before lots of resources are spent.
Pancreatic Cancer Organoid Profiling
The dataset contains open RNA-Seq Gene Expression Quantification data and controlled WGS/WXS/RNA-Seq Aligned Reads, WXS Annotated Somatic Mutation, WXS Raw Somatic Mutation, and RNA-Seq Splice Junction Quantification. Documentation
AWS CLI Access (No AWS account required)
aws s3 ls s3://gdc-organoid-pancreatic-phs001611-2-open/ --no-sign-request
Africa Soil Information Service (AfSIS) Soil Chemistry
This dataset contains soil infrared spectral data and paired soil property reference measurements for georeferenced soil samples that were collected through the Africa Soil Information Service (AfSIS) project, which lasted from 2009 through 2018. Documentation
AWS CLI Access (No AWS account required)
aws s3 ls s3://afsis/ --no-sign-request
Dataset for Affective States in E-Environments
DAiSEE is the first multi-label video classification dataset comprising of 9068 video snippets captured from 112 users for recognizing the user affective states of boredom, confusion, engagement, and frustration “in the wild”. The dataset has four levels of labels namely – very low, low, high, and very high for each of the affective states, which are crowd annotated and correlated with a gold standard annotation created using a team of expert psychologists. Download it here.
NatureServe Explorer Dataset
NatureServe Explorer provides conservation status, taxonomy, distribution, and life history information for more than 95,000 plants and animals in the United States and Canada, and more than 10,000 vegetation communities and ecological systems in the Western Hemisphere.
The data available through NatureServe Explorer represents data managed in the NatureServe Central Databases. These databases are dynamic, being continually enhanced and refined through the input of hundreds of natural heritage program scientists and other collaborators. NatureServe Explorer is updated from these central databases to reflect information from new field surveys, the latest taxonomic treatments and other scientific publications, and new conservation status assessments. Explore Data here
Flight Records in the US
Airline On-Time Performance and Causes of Flight Delays – On_Time Data.
This database contains scheduled and actual departure and arrival times, reason of delay. reported by certified U.S. air carriers that account for at least one percent of domestic scheduled passenger revenues. The data is collected by the Office of Airline Information, Bureau of Transportation Statistics (BTS).
FlightAware.com has data but you need to pay for a full dataset.
The anyflights
package supplies a set of functions to generate air travel data (and data packages!) similar to nycflights13
. With a user-defined year and airport, the anyflights
function will grab data on:
flights
: all flights that departed a given airport in a given year and monthweather
: hourly meterological data for a given airport in a given year and monthairports
: airport names, FAA codes, and locationsairlines
: translation between two letter carrier (airline) codes and namesplanes
: construction information about each plane found inflights
Airline On-Time Statistics and Delay Causes
The U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT’s monthly Air Travel Consumer Report, published about 30 days after the month’s end, as well as in summary tables posted on this website. BTS began collecting details on the causes of flight delays in June 2003. Summary statistics and raw data are made available to the public at the time the Air Travel Consumer Report is released. Access it here
Worldwide flight data
Download: airports.dat (Airports only, high quality)
Download: airports-extended.dat (Airports, train stations and ferry terminals, including user contributions)
Flightera.net seems to have a lot of good data for free. It has in-depth data on flights and doesn’t seem limited by date. I can’t speak on the validity of the data though.
flightradar24.com has lots of data, also historically, they might be willing to help you get it in a nice format.
2019 Crime statistics in the USA
Dataset with arrest in US by race and separate states. Download Excel here
- The global dataset of historical yields for major crops 1981–2016 – The […]
- Hyperspectral benchmark dataset on soil moisture – This dataset was […]
- Lemons quality control dataset – Lemon dataset has been prepared to […]
- Optimized Soil Adjusted Vegetation Index – The IDB is a tool for working […]
- U.S. Department of Agriculture’s Nutrient Database
- U.S. Department of Agriculture’s PLANTS Database – The Complete PLANTS […]
- 1000 Genomes – The 1000 Genomes Project ran between 2008 and 2015, […]
- American Gut (Microbiome Project) – The American Gut project is the […]
- Broad Bioimage Benchmark Collection (BBBC) – The Broad Bioimage Benchmark […]
- Broad Cancer Cell Line Encyclopedia (CCLE)
- Cell Image Library – This library is a public and easily accessible […]
- Complete Genomics Public Data – A diverse data set of whole human genomes […]
- EBI ArrayExpress – ArrayExpress Archive of Functional Genomics Data […]
- EBI Protein Data Bank in Europe – The Electron Microscopy Data Bank […]
- ENCODE project – The Encyclopedia of DNA Elements (ENCODE) Consortium is […]
- Electron Microscopy Pilot Image Archive (EMPIAR) – EMPIAR, the Electron […]
- Ensembl Genomes
- Gene Expression Omnibus (GEO) – GEO is a public functional genomics data […]
- Gene Ontology (GO) – GO annotation files
- Global Biotic Interactions (GloBI)
- Harvard Medical School (HMS) LINCS Project – The Harvard Medical School […]
- Human Genome Diversity Project – A group of scientists at Stanford […]
- Human Microbiome Project (HMP) – The HMP sequenced over 2000 reference […]
- ICOS PSP Benchmark – The ICOS PSP benchmarks repository contains an […]
- International HapMap Project
- Journal of Cell Biology DataViewer [fixme]
- KEGG – KEGG is a database resource for understanding high-level functions […]
- MIT Cancer Genomics Data
- NCBI Proteins
- NCBI Taxonomy – The NCBI Taxonomy database is a curated set of names and […]
- NCI Genomic Data Commons – The GDC Data Portal is a robust data-driven […]
- NIH Microarray data
- OpenSNP genotypes data – openSNP allows customers of direct-to-customer […]
- Palmer Penguins – The goal of palmerpenguins is to provide a great […]
- Pathguid – Protein-Protein Interactions Catalog
- Protein Data Bank – This resource is powered by the Protein Data Bank […]
- Psychiatric Genomics Consortium – The purpose of the Psychiatric Genomics […]
- PubChem Project – PubChem is the world’s largest collection of freely […]
- PubGene (now Coremine Medical) – COREMINE™ is a family of tools developed […]
- Sanger Catalogue of Somatic Mutations in Cancer (COSMIC) – COSMIC, the […]
- Sanger Genomics of Drug Sensitivity in Cancer Project (GDSC)
- Sequence Read Archive(SRA) – The Sequence Read Archive (SRA) stores raw […]
- Stanford Microarray Data
- Stowers Institute Original Data Repository
- Systems Science of Biological Dynamics (SSBD) Database – Systems Science […]
- The Cancer Genome Atlas (TCGA), available via Broad GDAC
- The Catalogue of Life – The Catalogue of Life is a quality-assured […]
- The Personal Genome Project – The Personal Genome Project, initiated in […]
- UCSC Public Data
- UniGene
- Universal Protein Resource (UnitProt) – The Universal Protein Resource […]
- Rfam – The Rfam database is a collection of RNA families, each […]
- Actuaries Climate Index
- Australian Weather
- Aviation Weather Center – Consistent, timely and accurate weather […]
- Brazilian Weather – Historical data (In Portuguese) – Data related to […]
- Canadian Meteorological Centre
- Climate Data from UEA (updated monthly)
- Dutch Weather – The KNMI Data Center (KDC) portal provides access to KNMI […]
- European Climate Assessment & Dataset
- German Climate Data Center
- Global Climate Data Since 1929
- Charting The Global Climate Change News Narrative 2009-2020 – These four […]
- NASA Global Imagery Browse Services
- NOAA Bering Sea Climate [fixme]
- NOAA Climate Datasets
- NOAA Realtime Weather Models
- NOAA SURFRAD Meteorology and Radiation Datasets
- The World Bank Open Data Resources for Climate Change
- UEA Climatic Research Unit
- WU Historical Weather Worldwide
- Wahington Post Climate Change – To analyze warming temperatures in the […]
- WorldClim – Global Climate Data
- 38-Cloud (Cloud Detection) – Contains 38 Landsat 8 scene images and their […]
- AQUASTAT – Global water resources and uses
- BODC – marine data of ~22K vars
- EOSDIS – NASA’s earth observing system data
- Earth Models [fixme]
- Global Wind Atlas – The Global Wind Atlas is a free, web-based […]
- Integrated Marine Observing System (IMOS) – roughly 30TB of ocean measurements
- Marinexplore – Open Oceanographic Data
- Alabama Real-Time Coastal Observing System
- National Estuarine Research Reserves System-Wide Monitoring Program – […]
- Oil and Gas Authority Open Data – The dataset covers 12,500 offshore […]
- Smithsonian Institution Global Volcano and Eruption Database
- USGS Earthquake Archives
- American Economic Association (AEA)
- EconData from UMD
- Economic Freedom of the World Data
- Historical MacroEconomic Statistics
- INFORUM – Interindustry Forecasting at the University of Maryland
- DBnomics – the world’s economic database – Aggregates hundreds of […]
- International Trade Statistics
- Internet Product Code Database
- Joint External Debt Data Hub
- Jon Haveman International Trade Data Links
- Long-Term Productivity Database – The Long-Term Productivity database was […]
- OpenCorporates Database of Companies in the World
- Our World in Data
- SciencesPo World Trade Gravity Datasets [fixme]
- The Atlas of Economic Complexity
- The Center for International Data
- The Observatory of Economic Complexity [fixme]
- UN Commodity Trade Statistics
- UN Human Development Reports
- AMPds – The Almanac of Minutely Power dataset
- BLUEd – Building-Level fUlly labeled Electricity Disaggregation dataset
- COMBED
- DBFC – Direct Borohydride Fuel Cell (DBFC) Dataset
- DEL – Domestic Electrical Load study datsets for South Africa (1994 – 2014)
- ECO – The ECO data set is a comprehensive data set for non-intrusive load […]
- EIA
- Global Power Plant Database – The Global Power Plant Database is a […]
- HES – Household Electricity Study, UK
- HFED
- PEM1 – Proton Exchange Membrane (PEM) Fuel Cell Dataset
- PLAID – The Plug Load Appliance Identification Dataset [fixme]
- The Public Utility Data Liberation Project (PUDL) – PUDL makes US energy […]
- REDD
- SYND – A synthetic energy dataset for non-intrusive load monitoring – […]
- Smart Meter Data Portal – The Smart Meter Data Portal is part of the […]
- Tracebase
- Ukraine Energy Centre Datasets
- UK-DALE – UK Domestic Appliance-Level Electricity
- WHITED
- iAWE
- BIS Statistics – BIS statistics, compiled in cooperation with central […]
- Blockmodo Coin Registry – A registry of JSON formatted information files […]
- CBOE Futures Exchange
- Complete FAANG Stock data – This data set contains all the stock data of […]
- Google Finance
- Google Trends
- NASDAQ [fixme]
- NYSE Market Data
- OANDA
- OSU Financial data [fixme]
- Quandl
- St Louis Federal
- Yahoo Finance
- 10k US Adult Faces Database
- 2GB of Photos of Cats
- Audience Unfiltered faces for gender and age classification
- Affective Image Classification
- Animals with attributes
- CADDY Underwater Stereo-Vision Dataset of divers’ hand gestures – […]
- Cytology Dataset – CCAgT: Images of Cervical Cells with AgNOR Stain […]
- Caltech Pedestrian Detection Benchmark
- Chars74K dataset – Character Recognition in Natural Images (both English […]
- Cube++ – 4890 raw 18-megapixel images, each containing a SpyderCube color […]
- Danbooru Tagged Anime Illustration Dataset – A large-scale anime image […]
- DukeMTMC Data Set – DukeMTMC aims to accelerate advances in multi-target […] [fixme]
- ETH Entomological Collection (ETHEC) Fine Grained Butterfly (Lepidoptra) Images
- Face Recognition Benchmark
- Flickr: 32 Class Brand Logos [fixme]
- GDXray – X-ray images for X-ray testing and Computer Vision
- HumanEva Dataset – The HumanEva-I dataset contains 7 calibrated video […]
- ImageNet (in WordNet hierarchy)
- Indoor Scene Recognition
- International Affective Picture System, UFL
- KITTI Vision Benchmark Suite
- Labeled Information Library of Alexandria – Biology and Conservation – […]
- MNIST database of handwritten digits, near 1 million examples [fixme]
- Multi-View Region of Interest Prediction Dataset for Autonomous Driving – […]
- Massive Visual Memory Stimuli, MIT
- Newspaper Navigator – This dataset consists of extracted visual content […]
- Open Images From Google – Pictures with segmentation masks for 2.8 […]
- RuFa – Contains images of text written in one of two Arabic fonts (Ruqaa […]
- SUN database, MIT
- SVIRO Synthetic Vehicle Interior Rear Seat Occupancy – 25.000 synthetic […]
- Several Shape-from-Silhouette Datasets [fixme]
- Stanford Dogs Dataset
- The Action Similarity Labeling (ASLAN) Challenge
- The Oxford-IIIT Pet Dataset
- Violent-Flows – Crowd Violence / Non-violence Database and benchmark
- Visual genome
- YouTube Faces Database
- Allen Institute Datasets
- Brain Catalogue
- Brainomics
- CodeNeuro Datasets [fixme]
- Collaborative Research in Computational Neuroscience (CRCNS)
- FCP-INDI
- Human Connectome Project
- NDAR
- NIMH Data Archive
- NeuroData
- NeuroMorpho – NeuroMorpho.Org is a centrally curated inventory of […]
- Neuroelectro
- OASIS
- OpenNEURO
- OpenfMRI
- Study Forrest
The largest repository of standardized and structured statistical data

Chess datasets
ML Dataset to practice methods of regression
Center for Machine Learning and Intelligent Systems
ManyTypes4Py: A benchmark Python Dataset for Machine Learning-Based Type Inference
- The dataset is gathered on Sep. 17th 2020 from GitHub.
- It has more than 5.2K Python repositories and 4.2M type annotations.
- Use it to train ML-based type inference model for Python
- Access it here
Quadrature magnetoresistance in overdoped cuprates
Measurements of the normal (i.e. non-superconducting) state magnetoresistance (change in resistance with magnetic field) in several single crystalline samples of copper-oxide high-temperature superconductors. The measurements were performed predominantly at the High Field Magnet Laboratory (HFML) in Nijmegen, the Netherlands, and the Pulsed Magnetic Field Facility (LNCMI-T) in Toulouse, France. Complete Zip Download
The UMA-SAR Dataset: Multimodal data collection from a ground vehicle during outdoor disaster response training exercises
Collection of multimodal raw data captured from a manned all-terrain vehicle in the course of two realistic outdoor search and rescue (SAR) exercises for actual emergency responders conducted in Málaga (Spain) in 2018 and 2019: the UMA-SAR dataset. Full Dataset.
Child Mortality from Malaria
Child mortality numbers caused by malaria by country
Number of deaths of infants, neonatal, and children up to 4 years old caused by malaria by country from 2000 to 2015. Originator: World Health Organization

Quora Question Pairs at Data.world
The dataset will give anyone the opportunity to train and test models of semantic equivalence, based on actual Quora data. 400,000 lines of potential question duplicate pairs. Each line contains IDs for each question in the pair, the full text for each question, and a binary value that indicates whether the line truly contains a duplicate pair. Access it here.

MIMIC Critical Care Database
MIMIC is an openly available dataset developed by the MIT Lab for Computational Physiology, comprising deidentified health data associated with ~60,000 intensive care unit admissions. It includes demographics, vital signs, laboratory tests, medications, and more. Access it here.

Data.Gov: The home of the U.S. Government’s open data
Here you will find data, tools, and resources to conduct research, develop web and mobile applications, design data visualizations, and more. Search over 280000 Datasets.

Tidy Tuesday Dataset

TidyTuesday
is built around open datasets that are found in the “wild” or submitted as Issues on our GitHub.
US Census Bureau: QuickFacts Dataset
QuickFacts provides statistics for all states and counties, and for cities and towns with a population of 5,000 or more.

Classical Abstract Art Dataset
Art that does not attempt to represent an accurate depiction of a visual reality but instead use shapes, colours, forms and gestural marks to achieve its effect
5000+ classical abstract art here, real artists with annotation. You can download them in very high resolution, however you would have to crawl them first with this scraper.

Interactive map of indigenous people around the world
Native-Land.ca is a website run by the nonprofit organization Native Land Digital. Access it here.

Data Visualization: A Wordcloud for each of the Six Largest Religions and their Religious Texts (Judaism, Christianity, and Islam; Hinduism, Buddhism, and Sikhism)

Highest altitude humans have been each year since 1961

Worldwide prevalence of drug use
I took the data from IHME’s Global Burden of Disease 2019 study (2019 all-ages prevalence of drug use disorders among both men and women for all countries and territories) and plotted it using R.
Also, what is going on in the US exactly? 3.3% of the population there is addicted and it’s the worst rate in the world.
World Population and Energy Consumption: History and Projections
From the author:
I am working on a presentation and I found a similar graph but couldn’t source it, so I found the data and remade this:
Data:
Population 2010-2019: US Census Bureau via Alexa
Population 2020-2050: World Population Prospects – Population Division – United Nations
File POP/1-1: Total population (both sexes combined) by region, subregion and country, annually for 1950-2100 (thousands)Medium fertility variant, 2020 – 2100
World Energy Consumption: Annual Energy Outlook 2021
Number of Operational Nuclear Reactors by Region over Time
Data from Power Reactor Information System maintained by IAEA.
Countries with reactors in each region:
ASIA-E: China, Japan, South Korea, Taiwan
ASIA-S: India, Pakistan, Bangladesh (under construction)
EUROPE-E & ASIA-CTRL: Armenia, Bulgaria, Belarus, Czechia, Hungary, Kazakhstan, Lithuania, Romania, Russia, Slovenia, Slovakia, Ukraine, Turkey (under construction)
EUROPE-N, S & W: Belgium, Switzerland, Germany, Spain, Finland, France, UK, Italy, Netherlands, Sweden
LATIN AMERICA: Argentina, Brazil, Mexico
MIDDLE EAST: UAE, Iran
SUB-SAHARAN AFRICA: South Africa
USA & CANADA: Canada, USA (obviously)
Made with MATLAB
Edit: If it wasn’t clear, these are nuclear power stations only
source: r/dataisbeautiful

National Household Travel Survey (US)
Conducted by the Federal Highway Administration (FHWA), the NHTS is the authoritative source on the travel behavior of the American public. It is the only source of national data that allows one to analyze trends in personal and household travel. It includes daily non-commercial travel by all modes, including characteristics of the people traveling, their household, and their vehicles. Access it here.

National Travel Survey (UK)
Statistics and data about the National Travel Survey, based on a household survey to monitor trends in personal travel.
The survey collects information on how, why, when and where people travel as well as factors affecting travel (e.g. car availability and driving license holding).

National Travel Survey (NTS)[Canada]

ENTUR: NeTEx or GTFS datasets [Norway]
NeTEx is the official format for public transport data in Norway and is the most complete in terms of available data. GTFS is a downstream format with only a limited subset of the total data, but we generate datasets for it anyway since GTFS can be easier to use and has a wider distribution among international public transport solutions. GTFS sets come in “extended” and “basic” versions. Access here.
The Swedish National Forest Inventory
A subset of the field data collected on temporary NFI plots can be downloaded in Excel format from this web site. The file includes a Read_me sheet and a sheet with field data from temporary plots on forest land1 collected from 2007 to 2019. Note that plots located on boundaries (for example boundaries between forest stands, or different land use classes) are not included in the dataset. The dataset is primarily intended to be used as reference data and validation data in remote sensing applications. It cannot be used to derive estimates of totals or mean values for a geographic area of any size. Download the dataset here
Large data sets from finance and economics applicable in related fields studying the human condition
World Bank Data: Countries Data | Topics Data | Indicators Data | Catalog
Boards of Governors of the Federal Reserve: Data Download Program
CIA: The world Factbook provides basic intelligence on the history, people, government, economy, energy, geography, environment, communications, transportation, military, terrorism, and transnational issues for 266 world entities.
Human Development Report: United Nations Development Programme – Public Data Explorer
Consumer Price Index: The Consumer Price Index (CPI) is a measure of the average change over time in the prices paid by urban consumers for a market basket of consumer goods and services. Indexes are available for the U.S. and various geographic areas. Average price data for select utility, automotive fuel, and food items are also available.
Gapminder.org: Unveiling the beauty of statistics for a fact based world view Watch everyday life in hundreds of homes on all income levels across the world, to counteract the media’s skewed selection of images of other places.

Our world in Data: International Trade
Research and data to make progress against the world’s largest problems: 3139 charts across 297 topics, All free: open access and open source.

International Historical Statistics (by Brian Mitchell)
World Input-Output Database
World Input-Output Tables and underlying data. World Input-Output Tables and underlying data, covering 43 countries, and a model for the rest of the world for the period 2000-2014. Data for 56 sectors are classified according to the International Standard Industrial Classification revision 4 (ISIC Rev. 4).
- Data: Real and PPP-adjusted GDP in US millions of dollars, national accounts (household consumption, investment, government consumption, exports and imports), exchange rates and population figures.
- Geographical coverage: Countries around the world
- Time span: from 1950-2011 (version 8.1)
- Available at: Online
Correlates of War Bilateral Trade
COW seeks to facilitate the collection, dissemination, and use of accurate and reliable quantitative data in international relations. Key principles of the project include a commitment to standard scientific principles of replication, data reliability, documentation, review, and the transparency of data collection procedures
- Data: Total national trade and bilateral trade flows between states. Total imports and exports of each country in current US millions of dollars and bilateral flows in current US millions of dollars
- Geographical coverage: Single countries around the world
- Time span: from 1870-2009
- Available at: Online here
- This data set is hosted by Katherine Barbieri, University of South Carolina, and Omar Keshk, Ohio State University.
World Bank Open Data – World Development Indicators
Free and open access to global development data. Access it here.
World Trade Organization – WTO
The WTO provides quantitative information in relation to economic and trade policy issues. Its data-bases and publications provide access to data on trade flows, tariffs, non-tariff measures (NTMs) and trade in value added.
- Data: Many series on tariffs and trade flows
- Geographical coverage: Countries around the world
- Time span: Since 1948 for some series
- Available at: Online here

SMOKA Science Archive
The Subaru-Mitaka-Okayama-Kiso Archive, holds about 15 TB of astronomical data from facilities run by the National Astronomical Observatory of Japan. All data becomes publicly available after an embargo period of 12-24 months (to give the original observers time to publish their papers).
Graph Datasets
- Web crawl graph with 3.5 billion web pages and 128 billion hyperlinks
- Diverse graphs (Stanford) with up to 1.8 billion edges
- Twitter follower graph (Uni Koblence) with 1.4 billion edges
- Divers graph data sets (Yahoo) including bipartite graphs with 2.2 million edges
- Many web and social graphs with up to 95 billion edges. While this data collection seems to be very comprehensive, it is not trivially accessible without external tool.
- Over 3000 social, biological, web graph data sets with small to large scale (dozens to billions of edges).
- Github project with dozens of graph data sets.
- Brain graphs (among other biological networks) with up to tens of millions of edges.
- Heterogeneous graph data from wolve interactions to co-authorships to social network data.
Multi-Domain Sentiment Dataset
The Multi-Domain Sentiment Dataset contains product reviews taken from Amazon.com from many product types (domains). Some domains (books and dvds) have hundreds of thousands of reviews. Others (musical instruments) have only a few hundred. Reviews contain star ratings (1 to 5 stars) that can be converted into binary labels if needed. Access it here.
A Global Database of Society
Supported by Google Jigsaw, the GDELT Project monitors the world’s broadcast, print, and web news from nearly every corner of every country in over 100 languages and identifies the people, locations, organizations, themes, sources, emotions, counts, quotes, images and events driving our global society every second of every day, creating a free open platform for computing on the entire world.
The Yahoo News Feed: Ratings and Classification Data
Dataset is 1.5 TB compressed, 13.5 TB uncompressed
Yahoo! Music User Ratings of Musical Artists, version 1.0 (423 MB)
Yahoo! Movies User Ratings and Descriptive Content Information, v.1.0 (23 MB)
Yahoo News Video dataset, version 1.0 (645MB)
Other Datasets
More than 1 TB
- The 1000 Genomes project makes 260 TB of human genome data available
- The Internet Archive is making an 80 TB web crawl available for research
- The TREC conference made the ClueWeb09 [3] dataset available a few years back. You’ll have to sign an agreement and pay a nontrivial fee (up to $610) to cover the sneakernet data transfer. The data is about 5 TB compressed.
- ClueWeb12 is now available, as are the Freebase annotations, FACC1
- CNetS at Indiana University makes a 2.5 TB click dataset available
- ICWSM made a large corpus of blog posts available for their 2011 conference. You’ll have to register (an actual form, not an online form), but it’s free. It’s about 2.1 TB compressed. The dataset consists of over 386 million blog posts, news articles, classifieds, forum posts and social media content between January 13th and February 14th. It spans events such as the Tunisian revolution and the Egyptian protests (see http://en.wikipedia.org/wiki/January_2011 for a more detailed list of events spanning the dataset’s time period). Access it here
- The Yahoo News Feed dataset is 1.5 TB compressed, 13.5 TB uncompressed
- The Proteome Commons makes several large datasets available. The largest, the Personal Genome Project , is 1.1 TB in size. There are several others over 100 GB in size.
More than 1 GB
- The Reference Energy Disaggregation Data Set has data on home energy use; it’s about 500 GB compressed.
- The Tiny Images dataset has 227 GB of image data and 57 GB of metadata.
- The ImageNet dataset is pretty big.
- The MOBIO dataset is about 135 GB of video and audio data
- The Yahoo! Webscope program makes several 1 GB+ datasets available to academic researchers, including an 83 GB data set of Flickr image features and the dataset used for the 2020 KDD Cup , from Yahoo! Music, which is a bit over 1 GB.
- Freebase makes regular data dumps available. The largest is their Quad dump , which is about 3.6 GB compressed.
- Wikipedia made a dataset containing information about edits available for a recent Kaggle competition [6]. The training dataset is about 2.0 GB uncompressed.
- The Research and Innovative Technology Administration (RITA) has made available a dataset about the on-time performance of domestic flights operated by large carriers. The ASA compressed this dataset and makes it available for download.
- The wiki-links data made available by Google is about 1.75 GB total.
- Google Research released a large 24GB n-gram data set back in 2006 based on processing 10^12 words of text and published counts of all sequences up to 5 words in length.
Power and Energy Consumption Open Datasets
These data are intended to be used by researchers and other professionals working in power and energy related areas and requiring data for design, development, test, and validation purposes. These data should not be used for commercial purposes.
- Consumption
- Electric Vehicles
- Power Quality
- PV Generation
- Reliability
- Weather Data
- Wind Based Generation
- General Energy Data
- Monthly data on average electricity prices (US)
- Monthly data on average electricity prices (Mexico)
- Monthly data on average electricity prices (Brazil)
- Monthly data on average electricity prices (Europe)
- Monthly data on average electricity prices (Australia)
- Monthly data on average electricity prices (UK)
The Million Playlist Dataset (Spotify)
A dataset and open-ended challenge for music recommendation research ( RecSys Challenge 2018). Sampled from the over 4 billion public playlists on Spotify, this dataset of 1 million playlists consist of over 2 million unique tracks by nearly 300,000 artists, and represents the largest public dataset of music playlists in the world. Access it here
How much each of 20 most popular artists earns from Spotify.

Regression Analysis Cheat Sheet

Hotel Reviews Dataset from Yelp
20k+ Hotel Reviews from Yelp for 5 Star Hotels in Las Vegas.
This dataset can be used for the following applications and more:
Analyzing trends, Sentiment Analysis / Opinion Mining, Sentiment Analysis / Opinion Mining, Competitor Analysis. Access it here.
A truncated version with 500 reviews is also available on Kaggle here
Motorcycle Crash data
1- Texas: Perform specific queries and analysis using Texas traffic crash data.
2- BTS: Motorcycle Rider Safety Data
3- National Transportation Safety Board: US Transportation Fatalities in 2019
4- Fatal single vehicle motorcycle crashes
5- Motorcycle crash causes and outcomes : pilot study
6- Motorcycle Crash Causation Study: Final Report
Natural Disasters – Free News Intelligence Dataset
Download a collection of news articles relating to natural disasters over an eight-month period. Access it here.
World Population Data by Country and Age Group
1- WorldoMeter: Countries in the world by population (2021)
2- Worldometer: Current World Population Live
Top 10 richest billionaires from 1987-2021

Source: Here
World’s Top 80 Wealthiest People by Country of Origin as of November 2021
Source: Here
From the author:
Needless to say, the United States absolutely dominates this list more than any other country. 9 of the top 10 are Americans, you’d have to combine the next 5 countries after the US to match their output of 33 among the top 80, and you’d have to combined every other country not named China on this graph to equal the USA.
To break things down based on region:
– The Americas has 34 individuals on this list with USA (33) and Mexico (1)
– Asia-Pacific has 28 individuals on this list with China (14), India (5), Hong Kong (4), Japan (3), and Australia (2)
– Europe has 18 individuals on this list with France (5), Russia (5), Germany (3), Italy (2), UK (1), Ireland (1), and Spain (1)
How Americans Spend Money on Halloween

Source: here
How the Duration of an Average World Series Baseball Game Has Changed Over 118 Years
Source: Here
Investment-Related Dataset with both Qualitative and Quantitative Variables
1- Numer.ai: Anonymized and feature normalized financial data which is interesting for machine learning applications. Download here
3- Quandl: The premier source for financial, economic and alternative datasets, serving investment professionals.
National Obesity Monitor
The National Health and Nutrition Examination Survey (NHANES) is conducted every two years by the National Center for Health Statistics and funded by the Centers for Disease Control and Prevention. The survey measures obesity rates among people ages 2 and older. Find the latest national data and trends over time, including by age group, sex, and race. Data are available through 2017-2018, with the exception of obesity rates for children by race, which are available through 2015-2016. Access here

The World’s Nations by Fertility Rate 2021

Total number of deaths due to Covid19 vis-à-vis Population in million

Unlike its successor (COVID), SARS only heavily impacted 5 countries.
USA Cigarettes Sold v. Lung Cancer Death Rates
Google searches for different emotions during each hour of the day and night

Where do the world’s CO2 emissions come from? This map shows emissions during 2019. Darker areas indicate areas with higher emissions

Global Linguistic Diversity

Where in the world are the densest forests? Darker areas represent higher density of trees.

Likes and Dislikes per movie genre

Global Historical Climatology Network-Monthly (GHCN-M) temperature dataset
NCEI first developed the Global Historical Climatology Network-Monthly (GHCN-M) temperature dataset in the early 1990s. Subsequent iterations include version 2 in 1997, version 3 in May 2011, and version 4 in October 2018.

Electric power consumption (kWh per capita)
The World’s Most Eco-Friendly Countries
Alternate Source from Wikipedia : List of countries by carbon dioxide emissions per capita


Alcohol-Impaired Driving Deaths by State & County [US]
% change in life expectancy from 2020 to 2021 across the globe

This is how life expectancy is calculated.
How Many Years Till the World’s Reserves Run Out of Oil?

Data Source Here: Note that these values can change with time based on the discovery of new reserves, and changes in annual production.
Which energy source has the least disadvantages?

Here’s a paper on the wind fatalities

Human development index (HDI) by world subdivisions

The Human Development Index (HDI) is a statistic composite index of life expectancy, education (mean years of schooling completed and expected years of schooling upon entering the education system), and per capita income indicators, which are used to rank countries into four tiers of human development.
Data source – subnational human development index website
US Streaming Services Market Share, 2020 vs 2021

Number of tweets deleted by month

Average Career Length by Sports Profession
Source: r/dataisbeautiful
From the author:
Got these numbers from here
Numbers like these are a quick reminder that not every athlete is LeBron James or Roger Federer who can play their sport at such high levels for their entire young adulthood while becoming billionaires in the process. Many careers are short lived and end abruptly while the athlete is still very young and some don’t really have a plan B.
NFL being at the bottom here doesn’t surprise me though as most positions (with the exception of QB and kicker) in US Football is lowkey bodily suicide.
Football/Soccer Leagues with the fairest distributions of money have seen the most growth in long-term global interest.

How Much Does Your Favorite Fast Food Brand Spend on Ads?

Sources:
mcdonald-s-advertising-spending-worldwide/
dominos-pizza-advertising-spending-usa/
advertising-expense-chick-fil-a/
starbucks-advertising-spending-in-the-us
Historical population count of Western Europe
Results from survey on how to best reduce your personal carbon footprint

Data from IpsosMori
Where does the world’s non-renewable energy come from?
![r/dataisbeautiful - Where does the world's non-renewable energy come from? Zoom in to see a point for each power plant! [OC]](https://preview.redd.it/jh7lxu3qwpt61.png?width=960&crop=smart&auto=webp&s=3678d21b2fc94780eadd6eb4f191e47d4af50daf)
The data comes from the Global Power Plant Database. The Global Power Plant Database is a comprehensive, open source database of power plants around the world. It centralizes power plant data to make it easier to navigate, compare and draw insights for one’s own analysis. The database covers approximately 30,000 power plants from 164 countries and includes thermal plants (e.g. coal, gas, oil, nuclear, biomass, waste, geothermal) and renewables (e.g. hydro, wind, solar). Each power plant is geolocated and entries contain information on plant capacity, generation, ownership, and fuel type. It will be continuously updated as data becomes available.
Recorded Music Industry Revenues from 1997 to 2020
Source: riaa.com/
US Trade Surpluses and Deficits by Country (2020)
Facebook Monthly Active Users
Facebook data is based on the end of year from 2004 to 2020

Source: SeeMetrics.com
Heat map of the past 50,000 earthquakes pulled from USGS sorted by magnitude
Source: USGS website
Where do the world’s methane (CH4)emissions come from?
Darker areas indicate areas with higher emissions.
Source: Data comes from EDGARv5.0 website and Crippa et al. (2019)
Earth Surface Albedo (1950 to 2020)
Data Source: ECMWF ERA5
Wealth of Forbes’ Top 100 Billionaires vs All Households in Africa
Forbes’ 35th Annual World’s Billionaires List
Credit Suisse Global Wealth Report 2020
United Nations World Population Prospects
United nations world population prospects
Credit Suisse Global Wealth Report 2020
20 years of Apple sales in a minute
Source: Wikipedia
Racial Diversity of Each State (Based on US Census 2019 Estimates)
Suppose your state is 60% orc, 30% undead, and 10% tauren. You chance in a random selection of two being of the same race is as follows:
36% chance ((60%)2) of two orcs
9% chance ((30%)2) of two undead
1% chance ((10%)2) of two tauren
For a total of 46%. The diversity index would be 100% minus that, or 54%.
A curated, daily feed of newly published datasets in machine learning
Machine Learning: CIFAR-10 Dataset
A curated, daily feed of newly published datasets in machine learning
The CIFAR-10 dataset consists of 60000 32×32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.

Machine Learning: ImageNet
The ImageNet dataset contains 14,197,122 annotated images according to the WordNet hierarchy. Since 2010 the dataset is used in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC), a benchmark in image classification and object detection. The publicly released dataset contains a set of manually annotated training images.
Machine Learning: The MNIST Database of Handwritten Digits
The MNIST database of handwritten digits, available from this page, has a training set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image.
It is a good database for people who want to try learning techniques and pattern recognition methods on real-world data while spending minimal efforts on preprocessing and formatting. Access it here.
The Massively Multilingual Image Dataset (MMID)
MMID is a large-scale, massively multilingual dataset of images paired with the words they represent collected at the University of Pennsylvania. The dataset is doubly parallel: for each language, words are stored parallel to images that represent the word, and parallel to the word’s translation into English (and corresponding images.) . Dcumentation.
AWS CLI Access (No AWS account required)
aws s3 ls s3://mmid-pds/ --no-sign-request

Capitol insurrection arrests per million people by state
How have cryptocurrencies done during the Pandemic?
Data Source: Downloaded performance data on these cryptocurrencies from Investing.com which provides free historic data
Share of US Wealth by Generation
Source: US Federal Reserve
Top 100 Cryptocurrencies by Market Cap
Data Source from coinmarketcap.com/
Crypto race: DOGE vs BTC, last 365 days
Data sources: Coindesk BTC, Coindesk Dodge

12,000 years of human population dynamics
Countries with a higher Human Development Index (HDI) than the European Union (EU)
HDI is calculated by the UN every year to measure a country’s development using average life expectancy, education level, and gross national income per capita (PPP). The EU has a collective HDI of 0.911.
Data Source: Here
Countries with a higher Human Development Index (HDI) than the United States (US)
Data source: Human Development Report 2020
Child marriage by country, by gender
Data on the percentage of children married before reaching adulthood (18 years).
Data source The State of the World’s Children 2019
Wars with greater than 25,000 deaths by year
Data Source : Wikipedia
Population Projection for China and India till 2050
Data Source: Here
Relative cumulative and per capita CO2 emissions 1751-2017

Dat Source: ourworldindata.org
Formula 1 Cumulative Wins by Team (1950-2021)
Data Source : f1-fansite.com/f1-results/
Countries with the most nuclear warheads. A couple of days ago I posted this with a logarithmic scale.
Data source: Wikipedia
Using machine learning methods to group NFL quarterbacks into archetypes

Data Source:
Data collected from a series of rushing and passing statistics for NFL Quarterbacks from 2015-2020 and performed a machine learning algorithm called clustering, which automatically sorts observations into groups based on shared common characteristics using a mathematical “distance metric.”
The idea was to use machine learning to determine NFL Quarterback Archetype to agnostically determine which quarterbacks were truly “mobile” quarterbacks, and which were “pocket passers” that relied more on passing. I used a number of metrics in my actual clustering analysis, but they can be effectively summarized across two dimensions: passing and rushing, which can be further roughly summarized across two metrics: passer rating and rushing yards per year. Plotting the quarterbacks along these dimensions and plotting the groups chosen by the clustering methodology shows how cleanly the methodology selected the groups.
Read this blog article on the process for more information if you’re interested, or just check out this blog in general if you found this interesting!
Data: Collected from the ESPN API
2M rows of 1-min S&P bars (12 years of stock data) – 2008-2021
Intraday Stock Data (1 min) – S&P 500 – 2008-21: 12 years of 1 minute bars for data science / machine learning.
Granular stock bar data for research is difficult to find and expensive to buy. The author has compiled this library from a variety of sources and is making it available for free.
One compressed CSV file with 9 columns and 2.07 million rows worth of 1 minute SPY bars. Access it here
A global database of COVID-19 vaccinations



Datasets: A live version of the vaccination dataset and documentation are available in a public GitHub repository here. These data can be downloaded in CSV and JSON formats. PDF.
A list of available datasets for machine learning in manufacturing
Industrial ML Datasets: curated list of datasets, publicly available for machine learning researches in the area of manufacturing.
Predictive Maintenance and Condition Monitoring
Name | Year | Feature Type | Feature Count | Target Variable | Instances | Official Train/Test Split | Data Source | Format |
---|
Diesel Engine Faults Features | 2020 | Signal | 84 | C (4) | 3.500 | Synthetic | MAT | Link |
Process Monitoring
Name | Year | Feature Type | Feature Count | Target Variable | Instances | Official Train/Test Split | Data Source | Format | |
---|---|---|---|---|---|---|---|---|---|
High Storage System Anomaly Detection | 2018 | Signal | 20 | C (2) | 91.000 | Synthetic | CSV | Link |
Predictive Quality and Quality Inspection
Name | Year | Feature Type | Feature Count | Target Variable | Instances | Official Train/Test Split | Data Source | Format | |
---|---|---|---|---|---|---|---|---|---|
Casting Product Quality Inspection | 2020 | Image | 300×300 512×512 | C (2) | 7.348 | ✔️ | Real | JPG | Link |
Process Parameter Optimization
Name | Year | Feature Type | Feature Count | Instances | Official Train/Test Split | Data Source | Format | |
---|---|---|---|---|---|---|---|---|
Laser Welding | 2020 | Signal | 13 | 361 | Real | XLS | Link |
Data Analytics Certification Questions and Answers Dumps

Datasets needed for Crop Disease Identification using image processing
Here is a collection of datasets with images of leaves
and more generic image datasets that include plant leaves
One hundreds plant species datasets
A Database of Leaf Images: Practice towards Plant Conservation with Plant Pathology
Survival Analysis datasets for machines
English alphabet organized by each letter’s note in ABC

Discover datasets hosted in thousands of repositories across the Web using datasetsearch.research.google.com
Create, maintain, and contribute to a long-living dataset that will update itself automatically across projects.
Datasets should behave like git repositories.

Learn how to create, maintain, and contribute to a long-living dataset that will update itself automatically across projects, using git and DVC as versioning systems, and DAGsHub as a host for the datasets.
Human Rights Measurement Initiative Datasets
World Wide Energy Production by Source 1860 – 2019
[OC] World Wide Energy Production by Source 1860 – 2019 from dataisbeautiful
Data source: ourworldindata.org/energy
Project Sunroof – Solar Electricity Generation Potential by Census Tract/Postal Code
Courtesy of Google’s Project Sunroof: This dataset essentially describes the rooftop solar potential for different regions, based on Google’s analysis of Google Maps data to find rooftops where solar would work, and aggregate those into region-wide statistics.
It comes in a couple of aggregation flavors – by census tract , where the region name is the census tract id, and by postal code , where the name is the postal code. Each also contains latitude/longitude bounding boxes and averages, so that you can download based on that, and you should be able to do custom larger aggregations using those, if you’d like.
Carbon emission arithmetic + hard v. soft science
carbon emission arithmetic + hard v. soft science [oc] from dataisbeautiful
Data sources: Video From data-driven documentary The Fallen of World War II. Here and Here
Most popular Youtuber in every country 2021
What Does 1GB of Mobile Data Cost in Every Country?

Key Concepts of Data Science
A large dataset aimed at teaching AI to code, it consists of some 14M code samples and about 500M lines of code in more than 55 different programming languages, from modern ones like C++, Java, Python, and Go to legacy languages like COBOL, Pascal, and FORTRAN.
NSRDB: National Solar Radiation Database
Download instructions are here
Cheat Sheet for Machine Learning, Data Science.

Emigrants from the UK by Destination
Data source: Originally at the location marked on the Sankey Flow but is now here
Direct link to the spreadsheet used
US Rivers and Streams Dataset
Data source: hub.arcgis.com/
Bubble Chart that compares the GDP of the G20 Countries
Data source: databank.worldbank.org
Desktop OS Market Share 2003 – 2021
[OC] Desktop OS Market Share 2003 – 2021 from dataisbeautiful
Data source: w3school
National Parks of North America
Data Source: DataBayou
NPS.gov, Open.canada.ca, and sig.conanp.gob.mx
Inflation of Bitcoin and DogeCoin vs. Federal Reserve target
Data source:
Percentage of women who experienced physical or sexual violence since the age of 15 in the EU
Data Source from The Guardian:
The whole report – Questionnaire
Canadian Interprovincial Migration

Some context here
Data scraped from StatsCan
Covid-19 Vaccination Doses Administered per 100 in the G20
Data source: ourworldindata.org covid-vaccinations
What does per 100 mean?
When the whole country is double vaccinated, the value will be 200 doses per 100 population. At the moment the UK is like 85, which is because ~70% of the population has had at least one dose and ~15% of the population (which is a subset of that 70%) have had two. Hence ~30% are currently unprotected – myself included until Sunday.
Import/Export of Conventional Arms by Different Countries over past 2 decades
DataSource: SIPRI Arms Transfer Database
Aggregated disease comparison dataset – Ensemble de données agrégées de comparaison des maladies
According to the author of the source data: “For the 1918 Spanish Flu, the data was collected by knowing that the total counts were 500M cases and 50M deaths, and then taking a fraction of that per day based on the area of this graph image:” – the graph is used is here:
Visualización y conjunto de datos de comparación de enfermedades agregadas
Trending Google Searches by State Between 2018 and 2020 – Tendances des recherches Google par État
Data source: trends.google.com Trending topics from 2010 to 2019 were taken from Google’s annual Year in Search summary 2010-2029
The full, ~11 minute video covering the whole 2010s decade is available here at youtu.be/xm91jBeN4oo
Google Trends provides weekly relative search interest for every search term, along with the interest by state. Using these two datasets for each term, we’re able to calculate the relative search interest for every state for a particular week. Linear interpolation was used to calculate the daily search interest.
Market capitalization in billion dollars of Top 20 Cryptocurrencies in 2021-05-20 – crypto-monnaies
Data source: CoinMarket from end of 2013 until present
Capitalisation boursière en milliards de dollars des 20 principales crypto-monnaies en 2021-05-20
Top Chess Players From 2000-2020, Meilleurs joueurs d’échecs, Лучшие шахматисты с 2000 по 2020 год
Data source: ratings.fide.com/
The y-axis is the world elo ratings (called FIDE ratings).
Comparing Emissions Sources – How to Shrink your Carbon Footprint More Effectively
Data sources: Here
Source article: Here
Oil and gas-fired power plants in the world –
La dependencia de los combustibles fósiles – La dépendance aux énergies fossiles –
Data is from the Global Power Plant Database (World Resources Institute)

Top 100 Reddit posts of all time
Source: r/all on Reddit
Tool used: meta-chart.com
Fastest routes on land (and sometimes, boat) between all 990 pairs of European capitals
Las rutas más rápidas en tierra (y, a veces, en barco) entre los 990 pares de capitales europeas
Les itinéraires les plus rapides sur terre (et parfois en bateau) entre les 990 paires de capitales européennes
Source: Reddit
From the author: I started with data on roads from naturalearth.com, which also includes some ferry lines. I then calculated the fastest routes (assuming a speed of 90 km/h on roads, and 35 km/h on boat) between each pair of 45 European capitals. The animation visualizes these routes, with brighter lines for roads that are more frequently “traveled”.
In reality these are of course not the most traveled roads, since people don’t go from all capitals to all other capitals in equal measure. But I thought it would be fun to visualize all the possible connections.
The model is also very simple, and does not take into account varying speed limits, road conditions, congestion, border checks and so on. It is just for fun!
In order to keep the file size manageable, the animation only shows every tenth frame.
Is Russia, Turkey or country X really part of Europe? That of course depends on the definition, but it was more fun to include them than to exclude them! The Vatican is however not included since it would just be the same as the Rome routes. And, unfortunately, Nicosia on Cyprus is not included to due an error on my behalf. It should be!
Link to final still image in high resolution on my twitter
Pokemon Dataset
- Dataset of all 825 Pokemon (this includes Alolan Forms). It would be preferable if there are at least 100 images of each individual Pokemon.
pokedex: This is a Python library slash pile of data containing a whole lot of data scraped from Pokémon games. It’s the primary guts of veekun.
2) This dataset comprises of more than 800 pokemons belonging up to 8 generations.
Using this dataset have been fun for me. I used it to create a mosaic of pokemons taking image as reference. You can find it here and it’s free to use: Couple Mosaic (powered by Pokemons)
Here is the data type information in the file:
- Name: Pokemon Name
- Type: Type of Pokemon like Grass / Fire / Water etc..,.
- HP: Hit Points
- Attack: Attack Points
- Defense: Defence Points
- Sp. Atk: Special Attack Points
- Sp. Def: Special Defence Points
- Speed: Speed Points
- Total: Total Points
- url: Pokemon web-page
- icon: Pokemon Image
Data File: Pokemon-Data.csv
30×30 m Worldwide High-Resolution Population and Demographics Data
ETL pipeline for Facebook’s research project to provide detailed large-scale demographics data. It’s broken down in roughly 30×30 m grid cells and provides info on groups by age and gender.
Data Source and API for access
Article about Dataset at Medium
Gridded global datasets for Gross Domestic Product and Human Development Index over 1990–2015
Rasterized GDP dataset – basically a heat map of global economic activity.
Gap-filled multiannual datasets in gridded form for Gross Domestic Product (GDP) and Human Development Index (HDI)
Decrease in worldwide infant mortality from 1950 to 2020
Data Sources: United Nations, CIA World Factbook, IndexMundi.
Countries of the world sorted by those that have warmed the most in the last 10 years, showing temperatures from 1890 to 2020
Data source: Gistemp temperature data
The GISS Surface Temperature Analysis ver. 4 (GISTEMP v4) is an estimate of global surface temperature change. Graphs and tables are updated around the middle of every month using current data files from NOAA GHCN v4 (meteorological stations) and ERSST v5 (ocean areas), combined as described in our publications Hansen et al. (2010) and Lenssen et al. (2019).
Climate change concern vs personal spend to reduce climate change
![r/dataisbeautiful - [OC] Climate change concern vs personal spend to reduce climate change](https://preview.redd.it/s78w2y8whq171.png?width=960&crop=smart&auto=webp&s=aa07421a85869867a8ff3c5303cc4aefee19c2b9)
Data Source: Competitive Enterprise Institute (PDF)
Less than 20 firms produce over a third of all carbon emissions
The Illusion of Choice in Consumer Brands
Buying a chocolate bar? There are seemingly hundreds to choose from, but its just the illusion of choice. They pretty much all come from Mars, Nestlé, or Mondelēz (which owns Cadbury).
Source: Visual Capitalist
Yearly Software Sales on PlayStation Consoles since 1994
Some context for these numbers :
- PS4 holds the record for being the console to have sold the most games in video game history (> 1.622B units)
- Previous record holder was PS2 at 1.537B games sold
- PS4 holds the record for having sold the most games in a single year (> 300M units in FY20)
- FY20 marks the biggest yearly software sales in PlayStation ecosystem with more than 338M units
- Since PS5 release, Sony starts combining PS4/PS5 software sales
- In FY12, Sony combined PS2/PS3 and PSP/VITA software sales
- Sony stopped disclosing software sales in FY13/14
- Source: Sony’s financial results
Yearly Hardware Sales of PlayStation Consoles since 1994
Sony combined PS2/PS3 hardware sales in FY12 and combined PSP/VITA sales in FY12/13/14
- Source: Sony’s financial results
Cybertruck vs F150 Lightning pre-orders, by time since debut
Source: Ford exec tweeting about preorder numbers this week
Top 100 Most Populous City Proper in the world
The City with 32 million is Chongqing, Shan is Shanghai, Beijin is Beijing, and Guangzho is Guangzhou
Tax data for different countries
What do Europeans feel most attached to – their region, their country, or Europe?
Data source: Builds on data from the 2021 European Quality of Government Index. You can read more about the survey and download the data here
Cost of 1gb mobile data in every country
Dataset: Visual Capitalist
Frequency of all digrams in 18 languages, diacritics included
Dataset (according to author): Dictionaries are scattered on the internet and had to be borrowed from several sources: the Scrabble3d project, and Linux spellcheck dictionaries. The data can be found in the folder “Avec_diacritiques”.
Criteria for choosing a dictionary:
– No proper nouns
– “Official” source if available
– Inclusion of inflected forms
– Among two lists, the largest was fancied
– No or very rare abbreviations if possible- but hard to detect in unknown languages and across hundreds of thousands of words.
Mapped: The World’s Nuclear Reactor Landscape
Dataset: Visual Capitalist
Database of 999 chemicals based on liver-specific carcinogenicity
The author found this dataset in a more accessible format upon searching for the keyword “CDPB” (Carcinogenic Potency Database) in the National Library of Medicine Catalog. Check out this parent website for the data source and dataset description. The dataset referenced in OP’s post concerns liver specific carcinogens, which are marked by the “liv” keyword as described in the dataset description’s Tissue Codes section.
SMS Spam Collection Data Set
Download: Data Folder, Data Set Description
The SMS Spam Collection is a public set of SMS labeled messages that have been collected for mobile phone spam research
Open Datasets for Autonomous Driving
A2D2 Dataset – ApolloScape Dataset – Argoverse Dataset – Berkeley DeepDrive Dataset
CityScapes Dataset – Comma2k19 Dataset – Google-Landmarks Dataset
KITTI Vision Benchmark Suite – LeddarTech PixSet Dataset – Level 5 Open Data – nuScenes Dataset
Oxford Radar RobotCar Dataset – PandaSet – Udacity Self Driving Car Dataset – Waymo Open Dataset
Open Dataset people are looking for [Help if you can]
- Looking for Dataset on the outcomes of abstinence-only sex education.
- Funny Datasets for School Data Science Project [1, 2, 3, 4, 5]
- Need a dataset for English practicing chatbot. [1, 2 ]
- Creating a dataset for plant disease recognition [1, 2 ]
- Central Bank Speeches Dataset (Text data from 1997 to 2020 from 118 institutions) [1, 2]
- Cat Meow Classification dataset [1, 2]
- Looking for Raw Data: Camping / Outdoors Travel in United States trends, etc [1, 2 ]
- Looking for Data set of horse race results / lottery results any results related to gambling [1, 2, 3]
- Looking for Football (Soccer) Penalties Dataset [1, 2]
- Looking for public datasets on baseball [1, 2, 3]
- Looking for Datasets on edge computing for AI bandwidth usage, latency, memory, CPU/GPU resource usage? [1 ,2 ]
- Data set of people who died by suicide [1, 2 ]
- Supreme Court dataset with opinion text? [1, 2, 3, 4, https://storage.googleapis.com/scotus-db/scotus-db.db5]
- Dataset of employee attrition or turnover rate? [1, 2]
- Is there a Dataset for homophobic tweets? [1 ,2, 3, 4, ]
- Looking for a Machine condition Monitoring Dataset [1,2]
- Where to find data for credit risk analysis? [1, 2]
- Datasets on homicides anywhere in the world [1, 2]
- Looking for a dataset containing coronavirus self-test (if this is a thing globally) pictures for ML use
- Is there any transportation dataset with daily frequency? [1, 2]
- A Dataset of film Locations [1, 2 ]
- Looking for a classification dataset [1, 2, 3, 4, 5]
- Where can I search for macroeconomics data? [1, 2, 3, 4, 5, 6, 7]
- Looking for Beam alignment 5G vehicular networks dataset
- Looking for tidy dataset for multivariate analysis (PCA, FA, canonical correlations, clustering)
- Indian all types of Fuel location datasets [1, 2,]
- Spotify Playlists Dataset [1, 2]
- World News Headline Dataset. With Sentiment Scores. Free download in JSON format. Updated often. [1, 2]
- Are there any free open source recipe datasets for commercial use [1, 2, 3, 4, 5]
- Curated social network datasets with summary statistics and background info
- Looking for textile crop disease datasets such as jute, flax, hemp
- Shopify App Store and Chrome Webstore Datasets
- Looking for dataset for university chatbot
- Collecting real life (dirty/ugly) datasets for data analysis
- In Need of Food Additive/Ingredient Definition Database
- Recent smart phone sensor Dataset – Android
- Cracked Mobile Screen Image Dataset for Detection
- Looking for Chiller fault data in a chiller plant
- Looking for dataset that contains the genetic sequences of native plasmids?
- Looking for a dataset containing fetus size measurements at various gestational ages.
- Looking for datasets about mental health since 2021
- Do you know where to find a dataset with Graphical User Interfaces defects of web applications? [1, 2, 3 ]
- Looking for most popular accounts on social medias like Twitter, Tik Tok, instagram, [1, 2, 3]
- GPS dataset of grocery stores
- What is the easiest way to bulk download all of the data from this epidemiology website? (~20,000 files)
- Looking for Dataset on Percentage of death by US state and Canadian province grouped by cause of death?
- Looking for Social engineering attack dataset in social media
- Steam Store Games (Clean dataset) by Nik Davis
- Dataset that lists all US major hospitals by county
- Another Data that list all US major hospitals by county
- Looking for open source data relating privacy behavior or related marketing sets about the trustworthiness of responders?
- Looking for a dataset that tracks median household income by country and year
- Dataset on the number of specific surgical procedures performed in the US (yearly)
- Looking for a dataset from reddit or twitter on top posts or tweets related to crypto currency
- Looking for Image and flora Dataset of All Known Plants, Trees and Shrubs
- US total fertility rates data one the state level
- Dataset of Net Worth of *World* Politicians
- Looking for water wells and borehole datasets
- Looking for Crop growth conditions dataset
- Dataset for translate machine JA-EG
- Looking for Electronic Health Record (EHR) record prices
- Looking for tax data for different countries
- Musicians Birthday Datasets and Associated groups
- Searching for dataset related to car dealerships [1]
- Looking for Credit Score Approval dataset
- Cyberbullying Dataset by demographics
- Datasets on financial trends for minors
- Data where I can find out about reading habits? [1, 2]
- Data sets for global technology adoption rates
- Looking for any and all cat / feline cancer datasets, for both detection and treatment
- ITSM dictionary/taxonomy datasets for topic modeling purposes
- Multistage Reliability Dataset
- Looking for dataset of ingredients for food[1]
- Looking for datasets with responses to psychological questionnaires[1,2,3]
- Data source for OEM automotive parts
- Looking for dataset about gene regulation
- Customer Segmentation Datasets (For LTV Models)
- Automobile dataset, years of ownership and repairs
- Historic Housing Prices Dataset for Individual Houses
- Looking for the data for all the tokens on the Uniswap graph
- Job applications emails datasets, either rejection, applications or interviews
- E-learning datasets for impact on e learning on school/university students
- Food delivery dataset (Uber Eats, Just Eat, …)
- Data Sets for NFL Quarterbacks since 1995
- Medicare Beneficiary Population Data
- Covid 19 infected Cancer Patients datasets
- Looking for EV charging behavior dataset
- State park budget or expansionary spending dataset
- Autonomous car driving deaths dataset
- FMCG Spending habits over the pandemic
- Looking for a Question Type Classification dataset
- 20 years of Manufacturer/Retail price of Men’s footwear
- Dataset of Global Technology Adoption Rates
- Looking For Real Meeting Transcripts Dataset
- Dataset For A Large Archive Of Lyrics [1,2,3]
- Audio dataset with swearing words
- A global, georeferenced event dataset on electoral violence with lethal outcomes from 1989 to 2017. [1,]
- Looking for Jaundice Dataset for ML model
- Looking for social engineering attack detection dataset?
- Wound image datasets to train ML model [1]
- Seeking for resume and job post dataset
- Labelled dataset (sets of images or videos) of human emotions [1,2]
- Dataset of specialized phone call transcripts
- Looking for Emergency Response Plan Dataset for family Homes, condo buildings and Companies
- Looking for Birthday wishes datasets
- Desperately in need of national data for real estate [1,2,]
- NFL playoffs games stadium attendance dataset
- Datasets with original publication dates of novels [1,2]
- Annotated Documents with Images Data Dump
- Looking for dataset for “Face Presentation Attack Detection”
- Electric vehicle range & performance dataset [1, 2]
- Dataset or API with valid postal codes for US, Mexico, and Canada with country, state/province, and city/town [1, 2, 3, 4, 5, 6]
- Looking for Data sources regarding Online courses dropout rate, preferably by countries [1,2 ]
- Are there dataset for language learning [1, 2]
- Corporate Real Estate Data [1,2, 3]
- Looking for simple clinical trials datasets [1, 2]
- CO2 Emissions By Aircraft (or Aircraft Type) – Climate Analysis Dataset [1,2, 3, 4]
- Player Session/playtime dataset from games [1,2]
- Data sets that support Data Science (Technology, AI etc) being beneficial to sustainability [1,2]
- Datasets of a grocery store [1,2]
- Looking for mri breast cancer annotation datasets [1,2]
- Looking for free exportable data sets of companies by industry [1,2]
- Datasets on Coffee Production/Consumption [1,2]
- Video gaming industry datasets – release year, genre, games, titles, global data [1,2]
- Looking for mobile speaker recognition dataset [1,2]
- Public DMV vehicle registration data [1,2]
- Looking for historical news articles based on industry sector [1,2]
- Looking for Historical state wide Divorce dataset [1,2]
- Public Big Datasets – with In-Database Analytics [1,2]
- Dataset for detecting Apple products (object detection) [1,2]
- Help needed to get the American Hospital Association (AHA) datasets (AHA Annual Survey, AHA Financial Database, and AHA IT Survey datasets) [1, 2]
- Looking for help Getting College Football Betting Data [1,2]
- 2012-2020 US presidential election results by state/city dataset [1,2, 3]
- Looking for datasets of models and images captured using iphone’s LIDAR? [1,2]
- Finding Datasets to Age Texts (Newspapers, Books, Anything works) [1, 2, 3]
- Looking for cost of living index of some type for US [1,2]
- Looking for dataset that recorded historical NFT prices and their price increases, as well as timestamps. [1,2]
- Looking for datasets on park boundaries across the country [1, 2, 3]
- Looking for medical multimodal datasets [1, 2, 3]
- Looking for Scraped Parler Data [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
- Looking for Silicon Wafer Demand dataset [1, 2]
- Looking for a dataset with the values [Gender – Weight – Height – Health] [1, 2]
- Exam questions (mcqs and short answer) datasets? [1, 2]
- Canada Botanical Plants API/Database [1, 2, 3]
- Looking for a geospatial dataset of birds Migration path [1, 2, 3]
- WhatsApp messages dataset/archives [1, 2]
- Dataset of GOOD probiotic microorganisms in the HUMAN gut [1, 2]
- Twitter competition to reduce bias in its image cropping [1,2]
- Dataset: US overseas military deployments, 1950–2020 [1,2]
- Dataset on human clicking on desktop [1,2]
- Covid-19 Cough Audio Classification Dataset [1, 2]
- 12,000+ known superconductors database [1, 2, 3]
- Looking for good dataset related to cyber security for prediction [1, 2]
- Where can I find face datasets to classify whether it is a real person or a picture of that person. For authentication purposes? [1,2]
- DataSet of Tokyo 2020 (2021) Olympics ( details about the Athletes, the countries they representing, details about events, coaches, genders participating in each event, etc.) [1, 2]
- What is your workflow for budget compute on datasets larger than 100GB? [1, 2, 3]
Looking for a dataset that contains information about cryptocurrencies. [1, 2
Looking for a depression dataset [1,2, 3]
- Looking for chocolate consumer demographic data [1,2, 3]
- Looking for thorough dataset of housing price/tax history [1, 2, 3]
- Wallstreetbets data scraping from 01/01/2020 to 01/06/2021 [1, 2]
- Retinal Disease Classification Dataset [1, 2]
- 400,000 years of CO2 and global temperature data [1, 2, 3]
- Looking for datasets on neurodegenerative diseases [1, 2, 3]
- Dataset for Job Interviews (either Phone, Online, or Physical) [1,2 ,3]
- Firm Cyber Breach Dataset with Firm Identifiers [1, 2, 3]
- Wondering how Stock market and Crypto website get the Data from [1, 2, 3, 4, 5]
- Looking for a dataset with US tourist injuries, attacks, and/or fatalities when traveling abroad [1, 2, 3]
- Looking for Wildfires Database for all countries by year and month? The quantity of wildfires happening, the acreage, things like that, etc.. [1, 2, 3, ]
- Looking for a pill vs fake pill image dataset [1, 2, 3, 4, 5, 6, 7]
Cars for sale in Germany from 2011 to 2021
Dataset obtained scraping AutoScout. In the file, you will find features describing 46405 vehicles: mileage, make, model, fuel, gear, offer type, price, horse power, registration year.
Dataset scraped from AutoScout24 with information about new and used cars.
Percentage of female students in higher education by subject area
The data was obtained from the UK government website here , so unfortunately there are some things I’m unaware of regarding data and methodology.
All the passes: A visualization of ~1 million passes from 890 matches played in major football/soccer leagues/cups
- Champion League 1999
- FA Women’s Super League 2018
- FIFA World Cup 2018, La Liga 2004 – 2020
- NWSL 2018
- Premier League 2003 – 2004
- Women’s World Cup 2019
Data Source: StatsBomb
Global “Urbanity” Dataset (using population mosaics, nighttime lights, & road networks
In this project, the authors have designed a spatial model which is able to classify urbanity levels globally and with high granularity. As the target geographic support for our model we selected the quadkey grid in level 15, which has cells of approximately 1x1km at the equator.
Dataset: Here
Percentage of students with disabilities in higher education by subject area
The author obtained the data from the UK Government website, so unfortunately don’t know the methodology or how they collected the data etc.
The comparison to the general public is a great idea – according to the Government site, 6% of children, 16% of working-age adults and 45% of Pension-age adults are disabled.
Dataset: here
Arrests for Hate Crimes in NYC by Category, 2017-2020
The Most Successful U.S. Sports Franchises
Data source: sports-reference.com/
Adult cognitive skills (PIAAC literacy and numeracy) by Percentile and by country
According to the author , this animation depicts adult cognitive skills, as measured by the PIAAC study by OECD. Here, the numeracy and literacy skills have been combined into one. Each frame of the animation shows the xth percentile skill level of each individual country. Thus, you can see which countries have the highest and lowest scores among their bottom performers, median performers, and top performers. So for example, you can see that when the bottom 1st percentile of each country is ranked, Japan is at the top, Russia is second, etc. Looking at the 50th percentile (median) of each country, Japan is top, then Finland, etc.
Programme for the International Assessment of Adult Competencies (PIAAC) is a study by OECD to measure measured literacy, numeracy, and “problem-solving in technology-rich environments” skills for people ages 16 and up. For those of you who are familiar with the school-age children PISA study, this is essentially an adult version of it.
Dataset: PIAAC
G7 Corporate Tax rate 1980 – 2020
Dataset: Tax Foundation
Euro 2020 (played in 2021) Group Stage Predictions Based of a Bayesian Linear Item Response Model
Data Source: UEFA qualifying match data
The model was built in Stan and was inspired by Andrew Gelman’s World Cup model shown here. These plots show posterior probabilities that the team on the y axis will score more goals than the team on the x axis. There is some redundancy of information here (because if I know P(England beats Scotland) then I know P(Scotland beats England) )
Data
Source: Italian National Institute of Statistics (Istituto Nazionale di Statistica)
The 15 most shared musicians on Reddit
Data source: The authors made a dataset of YouTube and Spotify shares on Reddit. More info available here
Spam vs. Legitimate Email, Average Global Emails per Day
Data Source: Here. The author computed the average per day over the June 3 – June 9, 2021 period.

Falling Fertility, 1800–2016
Data source: Here (go to the “Babies per woman,” “Income,” and “Population” links on that page).
Europe Covid-19 waves
Data Source: Here
Who is going to win EURO 2020? Predicted probabilities pooled together from 18 sources
Data source: Here
Population Density of Canada 2020
DataSet: Gathered from worldpop.org/project
The greater the length of each spike correlates to greater population density.
The portion of a country’s population that is fully vaccinated for COVID (as of June 2021) scales with GDP per capita.
Dataset of Chemical reaction equations
4- Chemistry datasets
Maths datasets
1111 2222 3333 Equation Learning
Datasets for Stata Structural Equation Modeling
SQL Queries Dataset
SEDE (Stack Exchange Data Explorer) is a dataset comprised of 12,023 complex and diverse SQL queries and their natural language titles and descriptions, written by real users of the Stack Exchange Data Explorer out of a natural interaction. These pairs contain a variety of real-world challenges which were rarely reflected so far in any other semantic parsing dataset. Access it here
Countries of the world, ranked by population, with the 100 largest cities in the world marked
Each map size is proportional to population, so China takes up about 18-19% of the map space.
Countries with very far-flung territories, such as France (or the USA) will have their maps shrunk to fit all territories. So it is the size of the map rectangle that is proportional to population, not the colored area. Made in R, using data from naturalearthdata.com. Maps drawn with the tmap package, and placed in the image with the gridExtra package. Map colors from the wesanderson package.
Data source: The Economist
What businesses in different countries search for when they look for a marketing agency – “creative” or “SEO”?
Data source: Google Trends
More maps, charts and written analysis on this topic here
Is the economic gap between new and old EU countries closing?
Data source: Eurostat
Interactive version so you can click on those circles here
Reddit r/wallstreetbets posts and comments in real-time
- Beneath adds some useful features for shared data, like the ability to run SQL queries, sync changes in real-time, a Python integration, and monitoring. The monitoring is really useful as it lets you check out the write activity of the scraper (no surprise, WSB is most active when markets are open
- The scraper (which uses Async PRAW) is open source here
Global NO2 pollution data visualization June 2021
Data Source: SILAM
Shopify App Store Report: 2021
Data source: Marketplace Apps
The Chrome Webstore Report: 2021
Data source: Marketplace Apps
Percentage of Adults with HIV/AIDS in Africa
Dataset: All the countries through the UN AIDS organization
Recorded CDC deaths (2014 – June 16, 2021) from Symptoms, signs and abnormal clinical and laboratory findings, not elsewhere classified (R00-R99)
Data Source: combined CDC weekly death counts 2014 – 2019 and CDC weekly death counts 2020-2021
What are the long term gains on cryptocurrencies?
Data Sources: investing.com and coingecko.com
The chart shows the average daily gain in $ if $100 were invested at a date on x-axis. Total gain was divided by the number of days between the day of investing and June 13, 2021. Gains were calculated on average 30-day prices.
Time range: from March 28, 2013, till June 13, 2021
Life Expectancy and Death Probability by Age and Gender
Data source: Here
Daily Coronavirus cases in Canada vs % of Population Vaccinated
Google Playstore Apps with 2.3million app data on Kaggle
Google Playstore dataset is now available with double the data (2.3 Million) android application data and a new attribute stating the scraped date time in Kaggle.
Dataset: Get it here or here
African languages dataset
We have 3000 tribes or more in Africa and in that 3000 we have sub tribes.
1– Introduction to African Languages (Harvard)
2- Languages of the world at Ethnologue
3- Britannica: Nilo-Saharan Laguages
4- Britannica: Khoisan Languages
Daily Temperature of Major Cities Dataset
Daily average temperature values recorded in major cities of the world.
The dataset is available as separate txt files for each city here. The data is available for research and non-commercial purposes only
Do stricter gun laws reduce firearms homicides?
Data Sources: Guns to Carry, EFSGV, CDC
According to the author: Looking at non-suicide firearms deaths by state (2019), and then grouping by the Guns to Carry rating (1-5 stars), it seems that stricter gun laws are correlated with fewer firearms homicides. Guns to Carry rates states based on “Gun friendliness” with 1 star being least friendly (California, for example), and 5 stars being most friendly (Wyoming, for example). The ratings aren’t perfect but they include considerations like: Permit required, Registration, Open carry, and Background checks to come up with a rating.
The numbers at the bottom are the average non-suicide deaths calculated within the rating group. Each bar shows the number for the individual state.
Interesting that DC is through the roof despite having strict laws. On the flip side, Maine is very friendly towards gun owners and has a very low homicide rate, despite having the highest ratio of suicides to homicides.
Obviously, lots of things to consider and this is merely a correlation at a basic level. This is a topic that interested me so I figured I’d share my findings. Not attempting to make a policy statement or anything.
Relative frequency of words in economics textbooks vs their frequency in mainstream English (the Google Books corpus)
Data Source: Data for word frequency in the Google corpus is from the 2019 Ngram dataset. For details about how to work with this data, see Working With Google Ngrams: A Data-Wrangling Tale.
Data for word frequency in econ textbooks was compiled by myself by scraping words from 43 undergraduate economics textbooks. For details see Deconstructing Econospeak.
Hours per day spent on mobile devices by US adults
Author: nava_7777
Data Source: from eMarketer, as quoted byJon Erlichman
Purpose according to the author: raw textual numbers (like in the original tweet) are hard to compare, particularly the acceleration or deceleration of a trend. Did for myself, but maybe is useful to somebody.
Environmental Impact of Coffee Brewing Methods
Author: Coffee_Medley
More according to the author:
Measurements and calculations of NG and Electricity used to heat four cups of distilled water by Coffee Medley (6/14/2021)
Average coffee bag and pod weight by Coffee Medley (6/14/2021)
Murders in major U.S. Cities: 2019 vs. 2020
Author: datacanbeuseful
Data source: NPR
New Harvard Data (Accidentally) Reveal How Lockdowns Crushed the Working Class While Leaving Elites Unscathed
Data source: Harvard
Support for same-sex marriage by religious group
Data source: PEW
More: Summary of religiously (un)affiliated people’s views on homosexuality, broken down into 18 countries
Daily chance of dying for Americans
Author: NortherSugarLoaf
Data source: SSA Actuarial Data,
Processing: Yearly probability of death is converted to the daily probability and expressed in micromorts. Plotted versus age in years.
According to the author,
A few things to notice: It’s dangerous to be a newborn. The same mortality rates are reached again only in the fifties. However, mortality drops after birth very quickly and the safest age is about ten years old. After experiencing mortality jump in puberty – especially high for boys, mortality increases mostly exponentially with age. Every thirty years of life increase chances of dying about ten times. At 80, chance of dying in a year is about 5.8% for males and 4.3% for females. This mortality difference holds for all ages. The largest disparity is at about twenty three years old when males die at a rate about 2.7 times higher than females.
This data is from before COVID.
Here is the same graph but in linear Y axis scale
Here is the male to female mortality ratio
Mapping Global Carbon Emission Intensity (Dec 2020)
![r/dataisbeautiful - [OC] Mapping Global Carbon Emission Intensity (Dec 2020)](https://preview.redd.it/aoh1zkvfmm671.png?width=960&crop=smart&auto=webp&s=c2d6a5b7ac2af3deda9e5a65a191ed83f78dab06)
Data Source: Copernicus Atmosphere Monitoring Service (CAMS)
Religions with the most Adherents from 1945 – 2010

IPO Returns 2000-2020

IPO Returns 2000-2020


IPO Returns 2000-2020

From the author u/nobjos: The full article on the above analysis can be found here
I have sub r/market_sentiment where I do a comprehensive deep-dive on one investment strategy/topic every week! Some of the author popular articles are
a. Performance of Jim Cramer’s stock picks
b. Performance of buy and sell recommendations made by financial analysts in the last decade
c. Benchmarking performance of Motely fool against SP500
Funko IPO is considered to have the worst first-day return for an IPO in the last two decades.
Out of the top 10 list, only 3 Investment banks had below-average returns.
On average, IPOs did make money for the investor. But the amount is significantly different if you got allocated the IPO at offer price vs you get the IPO at market price.
Baidu.com made a whopping 354% on its listing day. Another interesting observation is 6 out of 10 companies in the list were listed in 200 (just before the dot com crash)
Total number of streams per artist vs. number of Top 200 hits (Spotify Top 200 since 2017)
Author: blairfix
Data is from the Spotify Top 200 and covers the period from Jan. 1, 2017 to Jun. 9, 2021. You can download my dataset here.
For every artist that appears in the Top 200, I add up their total streams (for all songs) and the total number of songs in the dataset.
For a commentary on the data, see The Half Life of a Spotify Hit.
Number of Miss Americas by U.S. State
Data Source: Wikipedia
The World’s Nuclear Warheads
Author: academiadvice
Data Source: Federation of American Scientists – status-world-nuclear-forces/
Tools: Excel, Datawrapper, coolors.co/
Check out the FAS site for notes and caveats about their estimates. Governments don’t just print this stuff on their websites. These are evidence-based estimates of tightly-guarded national secrets.
Of particular note – Here’s what the FAS says about North Korea: “After six nuclear tests, including two of 10-20 kilotons and one of more than 150 kilotons, we estimate that North Korea might have produced sufficient fissile material for roughly 40-50 warheads. The number of assembled warheads is unknown, but lower. While we estimate North Korea might have a small number of assembled warheads for medium-range missiles, we have not yet seen evidence that it has developed a functioning warhead that can be delivered at ICBM range.”
The population of Las Vegas over time
Data Source: Wikipedia
The Alpha to Omega of Wikipedia
Author: feldesque
Data Source: The wikipediatrend package in R
Glacial Inter-glacial cycles over the past 450000 years
Source: geology.utah.gov/
Global temperature change from 1850-2020
Top Companies Contributing to Open Source – 2011/2021
Source and links
The author used several sources for this video and article. The first, for the video, is GitHub Archive & CodersRank. For the analysis of the OSCI index data, the author used opensourceindex.io
Crime Rates in the US: 1960-2021
Data source: Here
2021 is straight projections, must be taken with a grain of salt. However, the assumption of continuous rise of murder rate is not a bad one based on recent news reports, such as: here
In a property crime, a victim’s property is stolen or destroyed, without the use or threat of force against the victim. Property crimes include burglary and theft as well as vandalism and arson.
A network visualization of privacy research (83k nodes, 462k edges)
Author: FvDijk
This image was generated for my research mapping the privacy research field. The visual is a combination of network visualisation and manual adding of the labels.
The data was gathered from Scopus, a high-quality academic publication database, and the visualisation was created with Gephi. The initial dataset held ~120k publications and over 3 million references, from which we selected only the papers and references in the field.
The labels were assigned by manually identifying clusters and two independent raters assigning names from a random sample of publications, with a 94% match between raters.
The scripts used are available on Github:
The full paper can be found on the author’s website:
GDP (at purchasing power parity) per capita in international dollars
Author: Simaniac
Data source: IMF
Phone Call Anxiety dataset for Millennials and Gen Z
Author: /u/CynicalScyntist
This is a randomized experiment the author conducted with 450 people on Amazon MTurk. Each person was randomly assigned to one of three writing activities in which they either (a) described their phone, (b) described what they’d do if they received a call from someone they know, or (c) describe what they’d do if they received a call from an unknown number. Pictures of an iPhone with a corresponding call screen were displayed above the text box (blank, “Incoming Call,” or “Unknown”). Participants then rated their anxiety on a 1-4 scale.
Dataset: Here
Hate Crime Statistics in New York State 2019-2021

What is Google Workspace?
Google Workspace is a cloud-based productivity suite that helps teams communicate, collaborate and get things done from anywhere and on any device. It's simple to set up, use and manage, so your business can focus on what really matters.
Watch a video or find out more here.
Here are some highlights:
Business email for your domain
Look professional and communicate as you@yourcompany.com. Gmail's simple features help you build your brand while getting more done.
Access from any location or device
Check emails, share files, edit documents, hold video meetings and more, whether you're at work, at home or on the move. You can pick up where you left off from a computer, tablet or phone.
Enterprise-level management tools
Robust admin settings give you total command over users, devices, security and more.
Sign up using my link https://referworkspace.app.goo.gl/Q371 and get a 14-day trial, and message me to get an exclusive discount when you try Google Workspace for your business.
Google Workspace Business Standard Promotion code for the Americas
63F733CLLY7R7MM
63F7D7CPD9XXUVT
63FLKQHWV3AEEE6
63JGLWWK36CP7WM
Email me for more promo codes
Active Hydrating Toner, Anti-Aging Replenishing Advanced Face Moisturizer, with Vitamins A, C, E & Natural Botanicals to Promote Skin Balance & Collagen Production, 6.7 Fl Oz
Age Defying 0.3% Retinol Serum, Anti-Aging Dark Spot Remover for Face, Fine Lines & Wrinkle Pore Minimizer, with Vitamin E & Natural Botanicals
Firming Moisturizer, Advanced Hydrating Facial Replenishing Cream, with Hyaluronic Acid, Resveratrol & Natural Botanicals to Restore Skin's Strength, Radiance, and Resilience, 1.75 Oz
Skin Stem Cell Serum
Smartphone 101 - Pick a smartphone for me - android or iOS - Apple iPhone or Samsung Galaxy or Huawei or Xaomi or Google Pixel
Can AI Really Predict Lottery Results? We Asked an Expert.
Djamgatech

Read Photos and PDFs Aloud for me iOS
Read Photos and PDFs Aloud for me android
Read Photos and PDFs Aloud For me Windows 10/11
Read Photos and PDFs Aloud For Amazon
Get 20% off Google Workspace (Google Meet) Business Plan (AMERICAS): M9HNXHX3WC9H7YE (Email us for more)
Get 20% off Google Google Workspace (Google Meet) Standard Plan with the following codes: 96DRHDRA9J7GTN6(Email us for more)
AI-Powered Professional Certification Quiz Platform
Web|iOs|Android|Windows
FREE 10000+ Quiz Trivia and and Brain Teasers for All Topics including Cloud Computing, General Knowledge, History, Television, Music, Art, Science, Movies, Films, US History, Soccer Football, World Cup, Data Science, Machine Learning, Geography, etc....

List of Freely available programming books - What is the single most influential book every Programmers should read
- Bjarne Stroustrup - The C++ Programming Language
- Brian W. Kernighan, Rob Pike - The Practice of Programming
- Donald Knuth - The Art of Computer Programming
- Ellen Ullman - Close to the Machine
- Ellis Horowitz - Fundamentals of Computer Algorithms
- Eric Raymond - The Art of Unix Programming
- Gerald M. Weinberg - The Psychology of Computer Programming
- James Gosling - The Java Programming Language
- Joel Spolsky - The Best Software Writing I
- Keith Curtis - After the Software Wars
- Richard M. Stallman - Free Software, Free Society
- Richard P. Gabriel - Patterns of Software
- Richard P. Gabriel - Innovation Happens Elsewhere
- Code Complete (2nd edition) by Steve McConnell
- The Pragmatic Programmer
- Structure and Interpretation of Computer Programs
- The C Programming Language by Kernighan and Ritchie
- Introduction to Algorithms by Cormen, Leiserson, Rivest & Stein
- Design Patterns by the Gang of Four
- Refactoring: Improving the Design of Existing Code
- The Mythical Man Month
- The Art of Computer Programming by Donald Knuth
- Compilers: Principles, Techniques and Tools by Alfred V. Aho, Ravi Sethi and Jeffrey D. Ullman
- Gödel, Escher, Bach by Douglas Hofstadter
- Clean Code: A Handbook of Agile Software Craftsmanship by Robert C. Martin
- Effective C++
- More Effective C++
- CODE by Charles Petzold
- Programming Pearls by Jon Bentley
- Working Effectively with Legacy Code by Michael C. Feathers
- Peopleware by Demarco and Lister
- Coders at Work by Peter Seibel
- Surely You're Joking, Mr. Feynman!
- Effective Java 2nd edition
- Patterns of Enterprise Application Architecture by Martin Fowler
- The Little Schemer
- The Seasoned Schemer
- Why's (Poignant) Guide to Ruby
- The Inmates Are Running The Asylum: Why High Tech Products Drive Us Crazy and How to Restore the Sanity
- The Art of Unix Programming
- Test-Driven Development: By Example by Kent Beck
- Practices of an Agile Developer
- Don't Make Me Think
- Agile Software Development, Principles, Patterns, and Practices by Robert C. Martin
- Domain Driven Designs by Eric Evans
- The Design of Everyday Things by Donald Norman
- Modern C++ Design by Andrei Alexandrescu
- Best Software Writing I by Joel Spolsky
- The Practice of Programming by Kernighan and Pike
- Pragmatic Thinking and Learning: Refactor Your Wetware by Andy Hunt
- Software Estimation: Demystifying the Black Art by Steve McConnel
- The Passionate Programmer (My Job Went To India) by Chad Fowler
- Hackers: Heroes of the Computer Revolution
- Algorithms + Data Structures = Programs
- Writing Solid Code
- JavaScript - The Good Parts
- Getting Real by 37 Signals
- Foundations of Programming by Karl Seguin
- Computer Graphics: Principles and Practice in C (2nd Edition)
- Thinking in Java by Bruce Eckel
- The Elements of Computing Systems
- Refactoring to Patterns by Joshua Kerievsky
- Modern Operating Systems by Andrew S. Tanenbaum
- The Annotated Turing
- Things That Make Us Smart by Donald Norman
- The Timeless Way of Building by Christopher Alexander
- The Deadline: A Novel About Project Management by Tom DeMarco
- The C++ Programming Language (3rd edition) by Stroustrup
- Patterns of Enterprise Application Architecture
- Computer Systems - A Programmer's Perspective
- Agile Principles, Patterns, and Practices in C# by Robert C. Martin
- Growing Object-Oriented Software, Guided by Tests
- Framework Design Guidelines by Brad Abrams
- Object Thinking by Dr. David West
- Advanced Programming in the UNIX Environment by W. Richard Stevens
- Hackers and Painters: Big Ideas from the Computer Age
- The Soul of a New Machine by Tracy Kidder
- CLR via C# by Jeffrey Richter
- The Timeless Way of Building by Christopher Alexander
- Design Patterns in C# by Steve Metsker
- Alice in Wonderland by Lewis Carol
- Zen and the Art of Motorcycle Maintenance by Robert M. Pirsig
- About Face - The Essentials of Interaction Design
- Here Comes Everybody: The Power of Organizing Without Organizations by Clay Shirky
- The Tao of Programming
- Computational Beauty of Nature
- Writing Solid Code by Steve Maguire
- Philip and Alex's Guide to Web Publishing
- Object-Oriented Analysis and Design with Applications by Grady Booch
- Effective Java by Joshua Bloch
- Computability by N. J. Cutland
- Masterminds of Programming
- The Tao Te Ching
- The Productive Programmer
- The Art of Deception by Kevin Mitnick
- The Career Programmer: Guerilla Tactics for an Imperfect World by Christopher Duncan
- Paradigms of Artificial Intelligence Programming: Case studies in Common Lisp
- Masters of Doom
- Pragmatic Unit Testing in C# with NUnit by Andy Hunt and Dave Thomas with Matt Hargett
- How To Solve It by George Polya
- The Alchemist by Paulo Coelho
- Smalltalk-80: The Language and its Implementation
- Writing Secure Code (2nd Edition) by Michael Howard
- Introduction to Functional Programming by Philip Wadler and Richard Bird
- No Bugs! by David Thielen
- Rework by Jason Freid and DHH
- JUnit in Action
#BlackOwned #BlackEntrepreneurs #BlackBuniness #AWSCertified #AWSCloudPractitioner #AWSCertification #AWSCLFC02 #CloudComputing #AWSStudyGuide #AWSTraining #AWSCareer #AWSExamPrep #AWSCommunity #AWSEducation #AWSBasics #AWSCertified #AWSMachineLearning #AWSCertification #AWSSpecialty #MachineLearning #AWSStudyGuide #CloudComputing #DataScience #AWSCertified #AWSSolutionsArchitect #AWSArchitectAssociate #AWSCertification #AWSStudyGuide #CloudComputing #AWSArchitecture #AWSTraining #AWSCareer #AWSExamPrep #AWSCommunity #AWSEducation #AzureFundamentals #AZ900 #MicrosoftAzure #ITCertification #CertificationPrep #StudyMaterials #TechLearning #MicrosoftCertified #AzureCertification #TechBooks
Top 1000 Canada Quiz and trivia: CANADA CITIZENSHIP TEST- HISTORY - GEOGRAPHY - GOVERNMENT- CULTURE - PEOPLE - LANGUAGES - TRAVEL - WILDLIFE - HOCKEY - TOURISM - SCENERIES - ARTS - DATA VISUALIZATION

Top 1000 Africa Quiz and trivia: HISTORY - GEOGRAPHY - WILDLIFE - CULTURE - PEOPLE - LANGUAGES - TRAVEL - TOURISM - SCENERIES - ARTS - DATA VISUALIZATION

Exploring the Pros and Cons of Visiting All Provinces and Territories in Canada.

Exploring the Advantages and Disadvantages of Visiting All 50 States in the USA

Health Health, a science-based community to discuss human health
- FDA appears to be slow-walking vaccine approvalsby /u/nbcnews on April 28, 2025 at 11:29 pm
submitted by /u/nbcnews [link] [comments]
- Ultraprocessed food increases early death risk: studyby /u/CTVNEWS on April 28, 2025 at 8:28 pm
submitted by /u/CTVNEWS [link] [comments]
- Two cities — Calgary and Juneau — stopped adding fluoride to water. Science reveals what happened to people's oral health.by /u/Science_News on April 28, 2025 at 8:08 pm
submitted by /u/Science_News [link] [comments]
- Warnings issued for spice that can interfere with prescription medicine effectivenessby /u/theindependentonline on April 28, 2025 at 7:13 pm
submitted by /u/theindependentonline [link] [comments]
- Dad of 2 Drinks Cranberry Juice for UTI — but Says 'World Changed Overnight' When It Turned Out to Be 'Incurable' Cancerby /u/peoplemagazine on April 28, 2025 at 3:55 pm
submitted by /u/peoplemagazine [link] [comments]
Today I Learned (TIL) You learn something new every day; what did you learn today? Submit interesting and specific facts about something that you just found out here.
- TIL in 1973, a team of twelve conservationists opened the sarcophagus of Casimir IV Jagiellon of Poland and ten of them subsequently died over the course of a few months from a fungus released from the opening of the sarcophagus.by /u/ffeinted on April 28, 2025 at 11:05 pm
submitted by /u/ffeinted [link] [comments]
- TIL 20% of the US population watched the 1978 World Series, while only 2.7% watched the 2024 World Seriesby /u/matthewjd24 on April 28, 2025 at 9:35 pm
submitted by /u/matthewjd24 [link] [comments]
- TIL that France did not adopt the Greenwich meridian as the beginning of the universal day until 1911. Even then it still refused to use the name "Greenwich", instead using the term "Paris mean time, retarded by 9 minutes and 21 seconds".by /u/EssexGuyUpNorth on April 28, 2025 at 9:28 pm
submitted by /u/EssexGuyUpNorth [link] [comments]
- TIL that Toyota Motor Co was originally named after it's founder Toyoda, but the name was changed to Toyota because it sounds better and in Japanese characters it is 8 strokes, a lucky number, versus the 10 strokes for Toyoda. (Obviously in Japanese, not anglicized spelling)by /u/ClownfishSoup on April 28, 2025 at 8:58 pm
submitted by /u/ClownfishSoup [link] [comments]
- TIL Connecticut has an official State Troubadour who "functions as an ambassador of music and song and promotes cultural literacy among Connecticut citizens"by /u/Remarkable-Pea4889 on April 28, 2025 at 7:17 pm
submitted by /u/Remarkable-Pea4889 [link] [comments]
Reddit Science This community is a place to share and discuss new scientific research. Read about the latest advances in astronomy, biology, medicine, physics, social science, and more. Find and submit new publications and popular science coverage of current research.
- Dry eye disease is a growing problem in young adults, with 90% of study participants with at least one sign of the condition in their eyes. In the 18-25 age group, a major risk factor is screen use. Frequent screen breaks, regular sleep patterns, staying hydrated and having a balanced diet help.by /u/mvea on April 28, 2025 at 11:21 pm
submitted by /u/mvea [link] [comments]
- An ancient yeast found clinging to pots at archaeological sites in Patagonia is the same strain used to brew lagers in Bavaria some 400 years later. The yeast isn't native to Europe, so the finding hints that trade with South America facilitated the first German blonde brews in the 16th Century.by /u/amesydragon on April 28, 2025 at 9:53 pm
submitted by /u/amesydragon [link] [comments]
- Study indicates that prostate cancer can be diagnosed at an early stage through a simple urine sampleby /u/nohup_me on April 28, 2025 at 6:42 pm
submitted by /u/nohup_me [link] [comments]
- Researchers found that up to 32% of dementia cases over an eight-year period could be attributed to clinically significant hearing loss, suggesting potential benefits from hearing interventions.by /u/Wagamaga on April 28, 2025 at 6:16 pm
submitted by /u/Wagamaga [link] [comments]
- Infrared imaging uncovers emotional sensitivity in 10-month-old babies | Study finds measurable emotional responses to the distress of their peers, offering compelling evidence that the roots of empathy emerge within the first year of life.by /u/chrisdh79 on April 28, 2025 at 6:05 pm
submitted by /u/chrisdh79 [link] [comments]
Reddit Sports Sports News and Highlights from the NFL, NBA, NHL, MLB, MLS, and leagues around the world.
- Speedboat that flipped midair in 200 mph crash had crossed finish line first, thus winning the race on Arizona lakeby /u/Oldtimer_2 on April 28, 2025 at 10:25 pm
submitted by /u/Oldtimer_2 [link] [comments]
- Penguins, 10-year coach Mike Sullivan mutually agree to part ways: What’s next?by /u/AlwaysBlaze_ on April 28, 2025 at 10:10 pm
submitted by /u/AlwaysBlaze_ [link] [comments]
- Warriors Jimmy Butler expected to return in Game 4 vs. Rockets tonightby /u/Oldtimer_2 on April 28, 2025 at 8:55 pm
submitted by /u/Oldtimer_2 [link] [comments]
- Stephen Curry is voted the NBA's Twyman-Stokes Teammate of the Year [vote by peers]by /u/Oldtimer_2 on April 28, 2025 at 7:21 pm
submitted by /u/Oldtimer_2 [link] [comments]
- Sources: Bucks' Damian Lillard has torn left Achilles tendonby /u/Subject-Property-343 on April 28, 2025 at 6:37 pm
submitted by /u/Subject-Property-343 [link] [comments]