Deciphering the Marketing Landscape: Latest Insights & Trends for 2023

Deciphering the Marketing Landscape: Latest Insights & Trends for 2023

AI Dashboard is available on the Web, Apple, Google, and Microsoft, PRO version

Deciphering the Marketing Landscape: Latest Insights & Trends for 2023

In the dynamic world of marketing, trends evolve at a breakneck speed. As consumers become more discerning and digitally connected, their preferences and behavior patterns shift, requiring marketers to stay ahead of the curve. With each passing year, some strategies solidify their ground, while others wane. Dive into our curated compilation of the latest marketing insights and trends for 2023. Whether you’re a seasoned marketer or a curious entrepreneur, these findings offer a snapshot of the changing consumer landscape and emerging marketing frontiers. Get ready to recalibrate, reimagine, and reshape your strategies!

Coffee Machine Descaling Solution – Made in the USA – 2 Uses Per Bottle – Universal Cleaning Descaler for Keurig Coffee Machines, Nespresso, Breville #ad

Coffee Machine Descaling Solution - Made in the USA - 2 Uses Per Bottle - Universal Cleaning Descale

Get 20% off Google Google Workspace (Google Meet) Standard Plan with  the following codes: 96DRHDRA9J7GTN6
Get 20% off Google Workspace (Google Meet)  Business Plan (AMERICAS) with  the following codes:  C37HCAQRVR7JTFK Get 20% off Google Workspace (Google Meet) Business Plan (AMERICAS): M9HNXHX3WC9H7YE (Email us for more codes)

Active Anti-Aging Eye Gel, Reduces Dark Circles, Puffy Eyes, Crow's Feet and Fine Lines & Wrinkles, Packed with Hyaluronic Acid & Age Defying Botanicals


1. The Eroding Value of “Sustainability” Recent research on Palm Oil reveals a surprising trend – consumers favor products labeled as “free-from palm oil” over those stamped with “sustainably produced palm oil.” This shift stems from the overused term “sustainable,” which seems to be losing its weight in the marketplace. This raises concerns, especially as WWF emphasizes that abandoning palm oil isn’t the right solution.


2. Packaging – The Silent Salesperson Kerry’s latest research underscores that 72% of consumers believe brands can help them reduce waste by enhancing the shelf life of food through better packaging. This trend is not just isolated. European publication Amcor’s findings align, showing a growing demand for improved packaging. In the future, marketers must spotlight their packaging efforts more prominently.


3. Cars and Consumers: A Telling Connection Recent data from the 2023 GWI Commerce Report showcases a peculiar trend – 40% of recent car purchasers also invested in a domestic vacation. In another intriguing find, consumers tend to make impulse purchases post physical activities. While not a new revelation, it’s worth noting for potential marketing strategies.


AI Unraveled: Demystifying Frequently Asked Questions on Artificial Intelligence (OpenAI, ChatGPT, Google Bard, Generative AI, Discriminative AI, xAI, LLMs, GPUs, Machine Learning, NLP, Promp Engineering)

4. Prime Day vs. Black Friday Amazon’s Prime Day is carving out its niche, with 4% of consumers favoring it over the traditional Black Friday. But with the US Consumer Confidence fluctuating in October, it’ll be intriguing to monitor Amazon’s trajectory in the coming year.


5. Rethinking Boomer Representation in Ads? Gen-Z and Millennials’ financial concerns are largely attributed to the Baby Boomer generation, as per OnePoll data. With Gen-Z’s growing bias against Baby Boomers, marketers might need to reevaluate the representation of this age group in advertising campaigns.


6. The UK’s Growing Love for Loyalty Discounts A significant portion of consumers in the UK is trading brand loyalty for alluring discounts. Findings from the Data & Marketing Association and American Express emphasize the importance of loyalty schemes. Given the current political and economic landscape, loyalty schemes could be the game-changer for retailers in the UK.

If you are looking for an all-in-one solution to help you prepare for the AWS Cloud Practitioner Certification Exam, look no further than this AWS Cloud Practitioner CCP CLF-C02 book


7. Snapshots from Other Reports:

  • A whopping $80B is lost to Ad Fraud, as per new insights from Juniper Research.
  • Mobile advertising is booming in the UK, with over 60% of companies planning to ramp up their budgets.
  • Gen-X feels overlooked in TV advertising, says Wavermaker Studio.
  • The beauty industry take note: consumers crave educational content, says a report from Happi.
  • Italy’s consumer spending expected to dip by approximately $3.7B, data from Ansa suggests.

Conclusion: Staying updated with the ever-evolving marketing landscape is vital for businesses to make informed decisions. From the waning trust in sustainability claims to the UK’s growing penchant for loyalty schemes, marketers need to remain agile and receptive to these shifts.

BISSELL Revolution HydroSteam Pet Carpet Cleaner  3432  Upright  Multi
BISSELL Revolution HydroSteam Pet Carpet Cleaner 3432 Upright Multi (20%  Off) #ad

References: 

1- I read over 100 Marketing Papers

Podcast transcript: 

Welcome to the Djamgatech Marketing podcast, your go-to source for the latest trends and insights in the world of marketing. In today’s episode, we’ll cover the latest marketing insights and trends for 2023, including consumer preferences, improved packaging, investments in vacations, the popularity of Prime Day, generational differences, loyalty discounts, the rise of mobile ad budgets, neglected Gen-X in TV ads, the demand for educational beauty content, and the expected decrease in Italy’s consumer spending. Additionally, we’ll highlight the importance of staying updated in marketing for informed decisions on sustainability claims and UK loyalty schemes.

In the fast-paced world of marketing, trends come and go faster than you can say “advertise.” As consumers get pickier and more plugged in, their tastes and habits shift, forcing marketers to keep up with the times. Each year brings new opportunities and challenges, with some strategies becoming tried and true, while others fade into obscurity. But fear not, because we’ve got you covered. Take a deep dive into our meticulously curated collection of the freshest marketing insights and trends for 2023. Whether you’re a seasoned marketing guru or just starting out, these findings will give you a great snapshot of what’s happening in the ever-changing world of consumers and marketing. So get ready to adapt, think outside the box, and reshape your strategies to stay ahead of the game. It’s time to embrace the future!

So, let’s dive right into some interesting research findings that shed light on important consumer trends. First up, recent studies on Palm Oil reveal that consumers now prefer products labeled as “free-from palm oil” rather than those labeled as “sustainably produced palm oil.” It seems that the term “sustainable” has become so overused that it’s losing its impact in the marketplace. However, we need to be cautious about completely abandoning palm oil, as organizations like WWF emphasize. They argue that the solution lies not in abandoning palm oil, but in finding sustainable ways to produce it. Now let’s talk about the power of packaging. Kerry’s latest research shows that a whopping 72% of consumers believe that brands can help them reduce waste by improving the packaging of food and extending its shelf life. And this trend is not just limited to one study. European publication Amcor’s findings align with Kerry’s research, revealing a growing demand for better packaging. So, moving forward, marketers need to highlight their packaging efforts more prominently in order to cater to this consumer demand. Next, let’s take a look at an interesting connection between car purchases and consumer behavior. Data from the 2023 GWI Commerce Report shows that 40% of recent car purchasers also invested in a domestic vacation. This finding uncovers a possible pattern of consumers making impulse purchases following physical activities. While this may not be a groundbreaking revelation, it’s definitely worth noting for potential marketing strategies. We can’t talk about consumer trends without mentioning the impact of major shopping events.

Amazon’s Prime Day, which has gained popularity in recent years, now has 4% of consumers favoring it over the traditional Black Friday. However, with US Consumer Confidence fluctuating in October, it’ll be intriguing to see how Amazon’s trajectory plays out in the coming year. Moving on to demographics, recent data suggests that Gen-Z and Millennials have significant financial concerns that are often attributed to the Baby Boomer generation. OnePoll data reveals a growing bias among Gen-Z towards Baby Boomers. With this in mind, marketers might need to reevaluate the representation of this age group in their advertising campaigns in order to better resonate with younger consumers. Let’s now shift our focus to the UK, where loyalty discounts are gaining popularity among consumers. A significant portion of UK consumers is trading brand loyalty for attractive discounts.

The Data & Marketing Association, along with American Express, emphasizes the importance of loyalty schemes in the current political and economic landscape. It seems that loyalty schemes could be the game-changer for retailers in the UK. Now, let’s take a quick look at some snapshots from other reports: First, new insights from Juniper Research reveal that a staggering $80 billion is lost to ad fraud. This highlights the need for stricter measures to combat fraudulent advertising practices. Second, mobile advertising is booming in the UK, with over 60% of companies planning to increase their budgets in this area. This showcases the growing importance of mobile platforms in reaching targeted audiences.

Third, Wavermaker Studio reports that Gen-X feels overlooked in TV advertising. This demographic segment is seeking more representation and targeted messaging in TV ads for better engagement. Fourth, a report from Happi emphasizes that consumers in the beauty industry crave educational content. This highlights the opportunity for beauty brands to create informative and educational content to better connect with consumers. Finally, data from Ansa suggests that Italy’s consumer spending is expected to dip by approximately $3.7 billion. This indicates a potential shift in consumer behavior and purchasing power in the country. That wraps up our exploration of some recent research findings and their implications for marketers. It’s fascinating how consumer trends evolve and shape the strategies businesses need to adopt to stay relevant. Stay tuned for more insights and updates in the ever-changing world of marketing and consumer behavior.

So, here’s the thing. In today’s fast-paced world, staying on top of the latest trends and developments in marketing is absolutely crucial. Why? Well, because it allows businesses to make smart and informed decisions that can ultimately lead to success. Trust me, you don’t want to be left in the dust while your competitors are flourishing. One interesting observation that has been made is the growing skepticism around sustainability claims. Consumers are becoming more discerning and are not just going to blindly believe every green marketing message they come across. This means that businesses need to be extra careful and make sure their sustainability efforts are truly authentic and transparent. Now, let’s talk about loyalty schemes. Apparently, the UK has been going crazy for them. People just can’t seem to get enough of those reward programs and discounts. And you know what? Marketers need to take notice of this.

Djamgatech: Build the skills that’ll drive your career into six figures: Get Djamgatech.

Loyalty schemes can be a powerful tool to not only retain existing customers but also to attract new ones. By the way, I came across some interesting resources that might pique your interest. It seems that a Redditor by the name of lazymentors has gathered a treasure trove of marketing papers from the subreddit r/Marketing. I’m talking about over 100 papers! So, if you’re looking to expand your knowledge and stay in the loop, you might want to check it out. In conclusion, my friend, the marketing landscape is constantly evolving, and it’s our job to stay agile and receptive. Trust is fading in sustainability claims, and loyalty schemes are all the rage in the UK. So, let’s keep our eyes peeled and make sure we’re on top of these shifts.

In this episode, we covered the latest marketing insights and trends for 2023, including strategies to recalibrate in the evolving consumer landscape, the importance of improved packaging, the rising popularity of Prime Day, and the impact of ad fraud on mobile ad budgets. Stay informed and make informed decisions in marketing with our recap of top items covered. Thank you for joining us on the Djamgatech Marketing podcast, where we delve into the latest marketing trends and provide insightful information – be sure to subscribe and stay tuned for our next episode!

Smart Savings: Top 10 Life Hacks to Lower Your Monthly Expense in USA and Canada

Deciphering the Marketing Landscape: What’s the most wanted digital marketing skills?

Data story telling. Don’t just share data, share “why” it’s important and what to do with it. A big reason why I got the last few jobs is being able to show that I can translate data and what to do with it.

It boggles my mind sometimes that many agencies don’t do this correctly. Follow the McKinsey model:

  • Data synthesis

  • Summary

  • “Why this data matters/what it means”

  • What to do with it

How data can become your best sales strategy coupled with a string message focusing on user outcomes they are hiring the product/service for ( jobs-to-be-done theory)? Link

Here is the TLDR for the best tips without knowing your case in more detail (feel free to read the deep dive if you want):

  • Share multiple data points but keep it focused

  • Don’t overdo it on the number of decks

  • Remember that you’ll probably have to pivot at least once

Detectives don’t solve cases off one single data point and neither should marketing decisions be made (in my humble opinion)

Deep dive:

Point Number 1: 2-3 data points is enough to make a solid case (ex: if you’re trying to share which topics/content ideas their audience resonates with, look at engagement rates on topics across different channels). If it’s SEO, use 2 different softwares and find the patterns. Those are the most obvious bleeds.

Point number 2: Early in my career I made the mistake of creating 50+ power point slides which was great research but we ended up using only 20% of that data. Huge waste of time, energy, not to mention incredibly inefficient.

Ace the Microsoft Azure Fundamentals AZ-900 Certification Exam: Pass the Azure Fundamentals Exam with Ease

Point number 3: The reality is, pivots are bound to happen unless you’re working with a team that’s super patient for a strategy to come to fruition or if you make the right decision based on the data (business acumen happens as you grow in your career.)

The most important skill is one that you can prove an ROI. For that I say Lead Gen.

Organic is:

  • “local” SEO (when you see a local company appear on the ‘map’ in search result near you)

  • regular SEO (regular search results under the map)

  • email marketing to an established email list

  • growing social media accounts

Paid is:

  • Google PPC ads

  • FB ads

  • any other… Tiktok, instagram etc.

I focus on Google PPC with Local SEO.

Pick a path and watch as much educational content on it as you can. Work for free initially. Then go wild.

SEO is highly wanted, and Google ads and Facebook ads are also highly wanted. I choose two things to become an expert in, and everything else just know enough to be able to do it. It also depends on where you get hired. Whatever u decide you want to do, become an expert in it, as there is a huge shortage of experts out there.

After 23 years in the industry and quite high demand as an independent consultant advisor I would say what people want is you solve them their problems. and in digital marketing and growth problems are very complex and multidisciplinary. Ok, they want ads to run smoothly and cheaply, but you need to make the data stack good so you track everything, and you need to make the Conversion Rate higher, but that involves like six tools plus the web, and you need to orchestrate everything to out-optimize your competitors. It is the T-shaped knowledge but with many deep knowledge areas. And understanding how everything interacts with each other. Like how page speed increases conversions, decreases CPA on paid, increases SEO, and how you can improve it. I think that is what is lacking in most growth agencies. They see stuff as silos, they take 2yo experienced specialists on PPC or SEO or whatever, but they have no clue about how the rest is important.

I think you only can gain that knowledge if you have been running your own sites or webapps, from creation to monetization, etc. That gives you a great understanding on the orchestration of things. And above all you need to be able to move seamlessly between strategy, tactical and operational. And communicate equally good with CEO’s and developers with poor social skills.

Deciphering the Marketing Landscape: One-Minute Daily Marketing News 

iRobot Roomba Robot Vacuums  Combos  and Mops
iRobot Roomba Robot Vacuums Combos and Mops #ad

Deciphering the Marketing Landscape: What Happened In Marketing October 17th 2023

  • Meta launches new formats and updates for Reels Ads.

  • Google launches new tool to manage first-party data easily.

  • Youtube launches Audio Descriptions & Pronouns for Creators.

  • FTC proposes a new bill to fight against hidden fees in Product Prices.

  • Google’s multiple security updates focused on user privacy.

  • EU warns all Social Media Apps to do better moderation of content.

TikTok 
  • TikTok partners with Disney to introduce Disney Content and Elements.

  • Update to API, allowing better Direct Posting for Third-party apps.

  • TikTok shares more facts about user data privacy.

  • TikTok expands Effect House Rewards Program to more regions.

  • New Reports about TikTok rewarding creators to pump live shopping.

Instagram & Threads
  • IG set to bring back Creator Cash Bonuses.

  • Instagram shares new tips for E-commerce shops in a Post.

  • Threads App gets new post editing and Voice notes feature.

Meta
  • Meta’s AI Chatbots are not working in the best way possible.

  • Facebook UK sales surged ahead of Ad Downturn.

  • WhatsApp testing Event Creation for Groups.

Twitter (X)
  • X aims to fight substack says Elon to allow article publishing.

  • X’s efforts to launch live-streaming features are coming together.

  • Expanded Bios are live on X Desktop.

  • New Feature &. Updates to X’s Security & Content Reporting.

  • X launches new updates to Community Notes to increase reliability.

Google
  • Google SGE AI now helps to create Images and Content Drafts.

  • Google Demand Gen Ads roll out to all advertisers.

  • Disabling Third-party cookies for 1% Chrome Users.

  • Updating their Ads Policy later this month.

  • Google Search stops intended search results.

  • Expands access to Social Media Links for Business profiles.

Agency News
  • WFA & MediaSense launch “Future of media Agency” Report.

  • Stagwell acquires Left Field Labs, A digital Agency.

  • Publicis Groupe Posts 5.3% growth in Q3.

  • Dentsu partners with VideoAmp for Ad buying.

  • Virgin Voyages gives its Global Media Account to Hearts & Science.

  • Idris Elba’s agency launches first campaign for Sky Cinema.

  • Wavemaker & Merlin Entertainment extend their partnership.

  • GroupM Betas Walmart Retail Media Certification Program.

Brands & Ads
  • Taco Bell & Deutsch LA partner with Pete Davidson for new campaign.

  • Lloyds Banking Group appoints new CMO.

  • N26 Bank launches new global brand campaign.

  • Doc Martens launch new Brand Platform “Made Strong”.

  • Netflix to open retail sites in 2025 as Brand move.

  • ASICS & City of Paris’s latest campaign launched on Mental Health Day.

  • Uber Eats launches “Never eat dirt again” campaign in Taiwan .

  • Stagwell launches Harris Quest, AI research-as-a-service tool.

AI 
  • Google assures Companies of legal coverage when using their AI Models.

  • Adobe announces AI-generated Image to Video Tool.

  • Adobe also announced new content credential tag for AI.

  • Optimizely launches new Marketing OS powered by AI.

Microsoft 
  • Microsoft launches bug bounty program to improve Bing AI.

  • Microsoft completes acquisition of Activision Blizzard.

Pinterest & Snap
  • Pinterest to announce Q3 Results on 30th Oct.

  • Pinterest partnered with Anthropologie for Holiday Season Shophouse.

  • Snap My AI could face ban in UK over child privacy concerns.

Reddit
  • Reddit launches new report on TV & Film Entertainment.

Marketing & Ad Tech
  • IAS partners with Instacart Ads to improve transparency.

  • Atlassian to buy Loom for nearly $1 Billion.

  • Inmobi launches new identity resolution tool.

  • Jetpack WordPress adds new AI updates.

  • Paramount adds iSpot as New Currency partner.

  • The Guardian unveiled new UK Ad council.

  • Yahoo’s Cookieless ID in partnership with Twilio.

  • Twitch to go through another round of layoffs.

  • New feature to Follow WordPress blog through Mastodon.

  • Twitch adds anti-harassment features to stop banned users.

What I read about Gen-Z Consumers this Month. (No Calls)

1/ 35% Gen-Z corespondents associate TikTok more with Influencers and Zers are less likely to follow influencers on non-social apps. (Report)

2/ 41% plan to start shopping by the end of October and 37% Gen-Z plan to spend more this season, Shopify data.

3/ e.l.f. remains the No. 1 cosmetics brand, increasing 13 points Y/Y to 29% for female teens. And 90% of Genders prefer Apple Products.

4/ Gen-Z doesn’t like to get called, mostly prefer online chat & WhatsApp to connect with friends and others, data from The Sun.

5/ 19% of US adults aged 18-34 are actively saving in case of layoffs, compared to only 13% of older adults.

6/ Black Gen-Zers are hiding names for job applications and being more private shares new data.

7/ 83% of Gen-Z workers are job hoppers. (CNBC)

8/ Gen-Z wants feminine care products to become more blunt and clear in their Ad Copies. (NY Post)

9/ Majority of Gen-Z Students trust College Education, shares new report exposing online gurus.  (Gallup)

10/ 73% of Gen-Z Americans have changed their spending habits over inflation causes. 43% now prefer to home cook, 40% spend less on clothes and 33% limiting spend to Essential shopping. (Bank of America)

11/ Gen-Zers are struggling to find third places to network and make friends. Many are paying for multiple memberships to make friends.

12/ Harvard’s research suggests that Gen-Z 27% more likely to buy from sustainable brands. However new research from Kantar shares distrust of Gen-Z in Sustainability advertising.

13/ Gen-Z & Millennials are making impulse purchases of social media suggestions shares new data from Bankrate.

Deciphering the Marketing Landscape: What Happened In Marketing October 16th 2023

Tiktok
  • TikTok launches Search Ads Toggle, allowing brands to display ads in search results.

  • TikTok enhances data security and localized storage in US, Singapore, and Malaysia.

  • TikTok unveils Direct Post feature for smoother third-party platform content sharing.

Meta
  • Meta shared photos of the business onboarding steps for MetaVerified for Business

  • Instagram new “Avatar interactions” setting lets you control who can interact with your avatar

  • Instagram is working on a new sticker: Music Pick

  • Facebook is killing its Notes feature on Nov 13th

  • Facebook Messenger added a tab called Channels

  • Threads now showing the “Suggested for you” section in feed.

X (Twitter)
  • X rolls out new ad format that can’t be reported, blocked

  • X is working on giving streamers options on who can join their chat before the start of the stream

Google
  • Google tests generative AI in Search for creating imagery and drafting text.

  • Passkeys introduced for secure, fingerprint-based login on eBay, Uber, and WhatsApp

Others
  • Twitch update empowers streamers to block banned users from viewing their livestreams.

  • Duolingo will launch language learning lessons through Duolingo Music and Duolingo Math in the EU as well

  • CapCut added a new AI-based feature, AI model

Twitter
  • Early preview unveiled for ‘X calling’ feature.

Facebook
  • Facebook seeks feedback from Meta Verified subscribers on service quality.

  • Facebook starts showing the page name in the app header, and it sticks to the header when scrolling through the page.

Tiktok
  • TikTok enables mentioning videos via audio page in user-created content.

  • TikTok update removes auto-generated captions from post, privacy settings.

  • TikTok launches AI meme generation for user-taken or selected photos.

Instagram 🔥
  • Instagram introduces option for page linking within user accounts.

  • Instagram extends account activity access to desktop platforms.

Meta
  • Meta offers business support option beyond Meta Verified service.

Whatsapp
  • WhatsApp developing date-specific message search for web client.

  • WhatsApp Web rolls out ‘Create Channel’ feature for users.

Ai
  • Box unveils Box Hubs, streamlining document access with AI integration.

  • CharacterAI debuts ‘Character Group Chat’ for multi-user, multi-AI interactions.

Others
  • Mozilla teams with Fastly, Divvi Up for enhanced Firefox privacy tech.

  • Elgato introduces web Marketplace, upgrading digital assets exchange for creators.

  • Search Engine Land Awards 2023 finalists announced, winners to be revealed Oct. 17.

  • Snapchat encourages gifting Snapchat+ to friends on upcoming birthdays.

  • Spotify trials top playback controls during in-app scrolling.

I analyze over 200 headlines per week. Here’s a well-known psychological bias you can use to drive a tonne of clicks

“Harvard psychologist: 7 things the most passive-aggressive people always do—and the No. 1 way to respond”

This article is trending hard on CNBC Make It.

Sure, it’s good content.

But the headline clearly plays a huge role in its success.

Confirmation bias is a psychological effect where people seek information to validate their pre-existing beliefs.

“Please tell me I’m right”.

To effectively use confirmation bias in headlines:

– Identify behaviors your audience likely has strong beliefs or opinions about

– Write a headline that appears to confirm or challenge that belief

In this headline, passive aggression is the behavior many have encountered or been accused of.

A lot of people have pre-existing beliefs about what it looks like.

The headline suggests there are definitive behaviors that passive-aggressive people exhibit.

Readers want to know whether their own beliefs will be confirmed or challenged.

So they click to find out.

It’s brilliant.

Other psychological effects that make this headline an absolute click magnet:

Authority Bias – “Harvard Professor”. Readers are more likely to click when a headline implies endorsement from an expert.

Social Identity Theory: People will always want to identify with certain groups (in-groups) and distance themselves from others (out-groups).

They’ll seek out content to determine which “bucket” they fall into.

Do people they know fall into the “passive-aggressive” bucket? Do they themselves fall into that bucket?

They can’t help but click to find out.

Examples from different niches:

Productivity: “The 7 App Habits Of Highly Productive People”

Pre-existing belief – Productive people do or do not use apps a certain way.

Personal Finance – “The Actual Impact Of Cutting Out Coffee On Your Savings”

Pre-existing belief – Cutting out a daily coffee will or will not have a meaningful impact on savings.

Parenting – “Does Strict Parenting Actually Lead To Academic Success?”

Pre-existing belief – Strict parenting does or does not lead to academic success.

——————————————————–

*Disclaimer* – The content needs to match the expectations set by the title.

That’s what makes a title clickworthy as opposed to clickbait.

Also, the content shouldn’t be written with the sole purpose of being provocative. It should solve real problems and provide real value.

Giving it a juicy title is just how you make sure it’s actually read and that value is delivered.

As Ogilvy says:

“On the average, five times as many people read the headline as read the body copy. When you have written your headline, you have spent eighty cents out of your dollar.”

Deciphering the Marketing Landscape: What Happened In Marketing October 01-07 2023

X is looking to launch Ad-free Premium Tier for users.

Instagram announces option to share instagram stories only to a certain no. of followers in lists.

Reddit expands its learning hub with new courses and updates.

Google releases October 2023 Brand Core Update.

Deutsch New York plans to lays off about 19% of staff.

Youtube Testing New Community Notes Feed on Mobile.

DDB WorldWide names Alex Lubar as global CEO.

Snapchat announces “Phantom House” new activision for halloween.

X has ruined everything for link sharing with new Link Preview UI.

VMLY&R Named Lead Creative Agency for World of Hyatt.

GA4 adds new features to improve data security and report accuracy.

BEReal launches a new global campaign, trying to get back attention.

Meta rolls out AI Tools for Advertisers.

X is testing a new Ad format that you can’t report or fight back against.

M&S Appoints Mother as Creative Agency for UK Business.

Non-Alcohol Brands are testing Sober October campaigns, Ritual biggest one so far.

Netflix global Ad president departs after 13 months. Now, Amy Reinhard is the new Ad President.

Mullenlowe retains US Military Account for Recruiting Marketing, Account worth $450M.

US Ad Employement grew by 3k Jobs in Sep 2023.

Google Spam October 2023 Core Update also launched.

IG testing Ad Carousels with tag “you might like” with 5 Different Ads side by side.

Watched 8 hours of MrBeast’s content. Here are 7 psychological strategies he’s used to get 34 billion views

MrBeast can fill giant stadiums and launch 8-figure candy companies on demand.

He’s unbelievably popular.

Recently, I listened to the brilliant marketer Phil Agnew being interviewed on the Creator Science podcast.

The episode focused on how MrBeast’s near-academic understanding of audience psychology is the key to his success.

Better than anyone, MrBeast knows how to get you:

– Click on his content (increase his click-through rate)

– Get you to stick around (increase his retention rate)

He gets you to click by using irresistible thumbnails and headlines.

I watched 8 hours of his content.

To build upon Phil Agnew’s work, I made a list of 7 psychological effects and biases he’s consistently used to write headlines that get clicked into oblivion.

Even the most aggressively “anti-clickbait” purists out there would benefit from learning the psychology of why people choose to click on some content over others.

Ultimately, if you don’t get the click, it really doesn’t matter how good your content is.

1. Novelty Effect

MrBeast Headline: “I Put 100 Million Orbeez In My Friend’s Backyard”

MrBeast often presents something so out of the ordinary that they have no choice but to click and find out more.

That’s the “novelty effect” at play.

Our brain’s reward system is engaged when we encounter something new.

You’ll notice that the headline examples you see in this list are extreme.

MrBeast takes things to the extreme.

You don’t have to.

Here’s your takeaway:

Consider breaking the reader/viewer’s scrolling pattern by adding some novelty to your headlines.

How?

Here are two ways:

  1. Find the unique angle in your content

  2. Find an unusual character in your content

Examples:

“How Moonlight Walks Skyrocketed My Productivity”.

“Meet the Artist Who Paints With Wine and Chocolate.”

Headlines like these catch the eye without requiring 100 million Orbeez.

2. Costly Signaling

MrBeast Headline: “Last To Leave $800,000 Island Keeps It”

Here’s the 3-step click-through process at play here:

  1. MrBeast lets you know he’s invested a very significant amount of time and money into his content.

  2. This signals to whoever reads the headline that it’s probably valuable and worth their time.

  3. They click to find out more.

Costly signaling is all amount showcasing what you’ve invested into the content.

The higher the stakes, the more valuable the content will seem.

In this example, the $800,000 island he’s giving away just screams “This is worth your time!”

Again, they don’t need to be this extreme.

Here are two examples with a little more subtlety:

“I built a full-scale botanical garden in my backyard”.

“I used only vintage cookware from the 1800s for a week”.

Not too extreme, but not too subtle either.

3. Numerical Precision

MrBeast knows that using precise numbers in headlines just work.

Almost all of his most popular videos use headlines that contain a specific number.

“Going Through The Same Drive Thru 1,000 Times”

“$456,000 Squid Game In Real Life!”

Yes, these headlines also use costly signaling.

But there’s more to it than that.

Precise numbers are tangible.

They catch our eye, pique our curiosity, and add a sense of authenticity.

“The concreteness effect”:

Specific, concrete information is more likely to be remembered than abstract, intangible information.

“I went through the same drive thru 1000 times” is more impactful than “I went through the same drive thru countless times”.

4. Contrast

MrBeast Headline: “$1 vs $1,000,000 Hotel Room!”

Our brains are drawn to stark contrasts and MrBeast knows it.

His headlines often pit two extremes against each other.

It instantly creates a mental image of both scenarios.

You’re not just curious about what a $1,000,000 hotel room looks like.

You’re also wondering how it could possibly compare to a $1 room.

Was the difference wildly significant?

Was it actually not as significant as you’d think?

It increases the audience’s *curiosity gap* enough to get them to click and find out more.

Here are a few ways you could use contrast in your headlines effectively:

  1. Transformational Content:

“From $200 to a $100M Empire – How A Small Town Accountant Took On Silicon Valley”

Here you’re contrasting different states or conditions of a single subject.

Transformation stories and before-and-after scenarios.

You’ve got the added benefit of people being drawn to aspirational/inspirational stories.

2. Direct Comparison

“Local Diner Vs Gourmet Bistro – Where Does The Best Comfort Food Lie?”

5. Nostalgia

MrBeast Headline: “I Built Willy Wonka’s Chocolate Factory!”

Nostalgia is a longing for the past.

It’s often triggered by sensory stimuli – smells, songs, images, etc.

It can feel comforting and positive, but sometimes bittersweet.

Nostalgia can provide emotional comfort, identity reinforcement, and even social connection.

People are drawn to it and MrBeast has it down to a tee.

He created a fantasy world most people on this planet came across at some point in their childhood.

While the headline does play on costly signaling here as well, nostalgia does help to clinch the click and get the view.

Subtle examples of nostalgia at play:

“How this [old school cartoon] is shaping new age animation”.

“[Your favorite childhood books] are getting major movie deals”.

6. Morbid Curiosity

MrBeast Headline: “Surviving 24 Hours Straight In The Bermuda Triangle”

People are drawn to the macabre and the dangerous.

Morbid curiosity explains why you’re drawn to situations that are disturbing, frightening, or gruesome.

It’s that tension between wanting to avoid harm and the irresistible desire to know about it.

It’s a peculiar aspect of human psychology and viral content marketers take full advantage of it.

The Bermuda Triangle is practically synonymous with danger.

The headline suggests a pretty extreme encounter with it, so we click to find out more.

7. FOMO And Urgency

MrBeast Headline: “Last To Leave $800,000 Island Keeps It”

“FOMO”: the worry that others may be having fulfilling experiences that you’re absent from.

Marketers leverage FOMO to drive immediate action – clicking, subscribing, purchasing, etc.

The action is driven by the notion that delay could result in missing out on an exciting opportunity or event.

You could argue that MrBeast uses FOMO and urgency in all of his headlines.

They work under the notion that a delay in clicking could result in missing out on an exciting opportunity or event.

MrBeast’s time-sensitive challenge, exclusive opportunities, and high-stakes competitions all generate a sense of urgency.

People feel compelled to watch immediately for fear of missing out on the outcome or being left behind in conversations about the content.

Creators, writers, and marketers can tap into FOMO with their headlines without being so extreme.

“The Hidden Parisian Cafe To Visit Before The Crowds Do”

“How [Tech Innovation] Will Soon Change [Industry] For Good”

(Yep, FOMO and urgency are primarily responsible for the proliferation of AI-related headlines these days).

Why This All Matters

If you don’t have content you need people to consume, it probably doesn’t!

But if any aspect of your online business would benefit from people clicking on things more, it probably does.

“Yes, because we all need more clickbait in this world – *eye-roll emoji*” – Disgruntled Redditor

I never really understood this comment but I seem to get it pretty often.

My stance is this:

If the content delivers what the headline promises, it shouldn’t be labeled clickbait.

I wouldn’t call MrBeast’s content clickbait.

The fact is that linguistic techniques can be used to drive people to consume some content over others.

You don’t need to take things to the extremes that MrBeast does to make use of his headline techniques.

If content doesn’t get clicked, it won’t be read, viewed, or listened to – no matter how brilliant the content might be.

While “clickbait” content isn’t a good thing, we can all learn a thing or two from how they generate attention in an increasingly noisy digital world.

Little trick on how I use Quora to grow my business

This really doesn’t cost a lot of time and can be helpful for every business.

In order to leverage Quora effectively for your business, you need relevant questions to answer in the best possible way.

This can be tedious and a lot of work, while your answers can get buried quickly. To maximize the impact, I use this approach:

Look for Quora questions with many views but few answers.

Type in Google:

site:quora.com keyword “1 answer” “k views”

For example, I founded Simple Analytics, a GA4 alternative. So I’m interested in keywords like Google Analytics, Ga4, privacy-friendly analytics etc:

site:quora.com google analytics “1 answer” “k views”

It will find questions related to your keyword with just one answer but with many views (you can play around with the variables here)

But this is essentially where you want to be! Now provide a thoughtful answer and even mention your business if it fits the context. You’ll be the top rated answer and get many views.

AI in Marketing in November 2023

The TOP 50 Finance Headlines of 2023: Unraveling the Patterns

Deciphering the Marketing Landscape: Latest News

  • Outsourcing Content Creation
    by /u/TheBoyAsh (Marketing & Advertising) on April 23, 2024 at 8:47 pm

    Hey, How do I outsource content creation for my business and get quality content for a good price? more than just upwork for fiverr, and ideally I don't want to hire a full time employee quite yet. Are there ai tools that can help you make pretty decent quality content? Just thinking out loud. Please let me know. submitted by /u/TheBoyAsh [link] [comments]

  • PMax Keyword search themes
    by /u/wildbill_89 (Ads on Google, Meta, Microsoft, etc.) on April 23, 2024 at 8:36 pm

    Quick question, when creating a pMax, when you add in the "search keyword themes" can these be used as a stand-alone targeting effectively? Or must they be used in tandem with the audience signals? submitted by /u/wildbill_89 [link] [comments]

  • Revamping Product Types with SEO-Driven Taxonomy: Does It Really Boost PPC Performance?
    by /u/Madpotatolul (Ads on Google, Meta, Microsoft, etc.) on April 23, 2024 at 8:33 pm

    Hi there, fellow PPC strategists, I have a question about the "product_type" product specification. I know what it is and what it does, but there’s a tip a Google Ads Strategist gave me at my previous agency: include relevant and often-searched terms. For example: A typical taxonomy for "product_type" might look like this: Men > T-Shirt > High Neck T-Shirt > Blue T-Shirt. However, if we include relevant search terms for the product, it would look more like this: Men's High T-Shirt > Blue Men High Neck T-Shirt > BrandName Men T-Shirt > Tencel Men T-Shirt > High Neck T-shirt for Men > Fitted Shirt for Men > Soft Men High Neckline T-Shirt > Machine Washable Men T-Shirt > Premium Men T-Shirt > Warm Weather Shirt for Men > Made in USA Men T-Shirt > Sustainable Men T-Shirt. I've implemented this strategy for two product feeds and I can't really tell for sure if it makes a difference. Any thoughts? submitted by /u/Madpotatolul [link] [comments]

  • Pricing for Events and Outreach
    by /u/screamingcabbages (Marketing & Advertising) on April 23, 2024 at 8:25 pm

    Hello, all! Client pricing question here I am specifically referring to the brand ambassador experiential marketing industry. I am not talking about social media ambassadors, but temporary staff to help represent a brand at events, or when a CPG brand outsources to a staffing agency to run in-store demo campaigns for them. These are folks you see at professional conferences or concerts at branded booths who tell you they don’t actually work for the company when you ask them. Or the food samplers in Whole Foods. Another example is the LG booth at CES: most of those staff every year are brand ambassadors, not LG employees. The question: how much are agencies charging companies for these BA/event staffing services? If you ever worked as a brand manager with a larger company that outsourced staff for some events, do you recall how much you paid the agency for that service? Hope this is the right sub to ask this in, and I humbly request a little kindness in the replies. Thank you. submitted by /u/screamingcabbages [link] [comments]

  • Meta Ads: What goes in the Tax ID field for Canadian business?
    by /u/Educational-Vast9943 (Ads on Google, Meta, Microsoft, etc.) on April 23, 2024 at 8:19 pm

    Hi all, I am wondering what goes in the Tax ID section when setting up a Meta Business Manager for Facebook/IG ads? Is it GST/HST number or something else? Is it still necessary to run ads? Any help appreciated, thank you all! submitted by /u/Educational-Vast9943 [link] [comments]

  • Google Ads - Enhanced CPC Bid Set
    by /u/tonycoinshey (Ads on Google, Meta, Microsoft, etc.) on April 23, 2024 at 8:17 pm

    If you use enhanced cpc, what do you typically do to set your bids? I have always had doubts for the estimated bids for appearing and top? I know this depends on budget and goals, but do you typically like to raise to top of page with a cap then manually push certain ad groups up further? Do you do blanket coverage across the board to keep things simple and manually adjust certain as groups from there? Do you manually try to adjust ad groups and keywords as much as possible knowing bandwidth? Just curious about approaches knowing budgets, simplicity, and bandwidth. For context, using enhanced cpc due to niche B2B industry and newer account with little conversion history. Thanks! submitted by /u/tonycoinshey [link] [comments]

  • Google ads for in store visits
    by /u/Stunning-Cat-5471 (Ads on Google, Meta, Microsoft, etc.) on April 23, 2024 at 8:15 pm

    We have a website with our products and the ability to buy/order them, but they are very niche and expensive so 80% of all of our customers are in store visits and the others are emails or calls. What would be the best way/ad type to drive in store visits for local customers? submitted by /u/Stunning-Cat-5471 [link] [comments]

  • PRODENTIM
    by /u/LiteratureOld9040 (Marketing & Advertising) on April 23, 2024 at 8:11 pm

    submitted by /u/LiteratureOld9040 [link] [comments]

  • Google Trends Help!
    by /u/OllieC20 (Marketing & Advertising) on April 23, 2024 at 7:53 pm

    I work for a small dealership looking to start buying/selling Tesla’s. Why is it the ranking for used Tesla’s and electric vehicles shows as quite high, around 75 per week. But when you add Tesla to the same list, it then removes all the searches for used Tesla’s and used EV’s?? Is it because ‘used Tesla’ has ‘Tesla’ inside of it so it adds that into the Tesla search, or is it because ‘used Tesla’ alone will include any search of the word Tesla? Hoping someone can help! submitted by /u/OllieC20 [link] [comments]

  • Best Ways to Market to Service Businesses (Roofers, Lighting, Plumbers, Solar, etc)?
    by /u/brvhbrvh (Marketing & Advertising) on April 23, 2024 at 7:41 pm

    I just started working for a company that’s targeting service businesses, and I’m trying to figure out the best acquisition channels for them. Here’s what I have so far: Google Ads (long tail keywords related to the service this company provides) Cold Email Trade Association marketing (newsletter placements, webinars, etc.) Webinars Is there anything I’m missing here that works well for this ICP? If anyone has experience targeting these types of customers I would love to hear about it. Any help is greatly appreciated! Thanks! submitted by /u/brvhbrvh [link] [comments]

  • I need help with what bidding strategy to use? Can somebody here help a newbie?
    by /u/myyouthisyourz (Ads on Google, Meta, Microsoft, etc.) on April 23, 2024 at 7:13 pm

    Hey everyone, I need help with my bidding strategy for a new search ad campaign for a law client. I've consumed a lot of content already, but I still can't wrap my head around it. His budget's not huge - so I'm thinking of starting with manual CPC and then maybe switching to Max Conversions later. My last campaign got some conversions with Max Clicks, but also a lot of irrelevant clicks/spam. Any suggestions for when Max Clicks don't work well? Also, do you think Google will show our ads if we use manual CPC instead of automated strategies? I just redesigned their entire website, so I want to start fresh with the ads. submitted by /u/myyouthisyourz [link] [comments]

  • Optimize Facebook Ads Lead campaign for VALUE?
    by /u/NaiveAd2092 (Ads on Google, Meta, Microsoft, etc.) on April 23, 2024 at 7:12 pm

    Hi, I'm running Facebook Ads for a lead generation campaign at a budget of $50 a day. I realized that my lead quality is quite low, and in order to cherry pick the higher quality conversions I came up with an internal tagging system that reports a value back to Facebook (let's say $100, $500 or $1,000 per lead, depending on their responses). However, I now realized that I can't optimize a lead campaign for conversion value? Is this only possible via Custom Conversions or do I have to take the route of making this conversion a Purchase Conversion, effectively, and then assigning the values properly - essentially just calling the lead conversions "purchases"? Thank you! submitted by /u/NaiveAd2092 [link] [comments]

  • Budget Recomendation for 8 van Plumbing company.
    by /u/Beneficial-Rough-158 (Ads on Google, Meta, Microsoft, etc.) on April 23, 2024 at 7:07 pm

    I am the Marketing director and Customer service manager for a Plumbing company. We recently added 3 more technicians to our fleet. What would you recommend for ad spend per month. We have a 89% booking rate and our technicians run 3 calls per day. We are converting currently at about a 75%. submitted by /u/Beneficial-Rough-158 [link] [comments]

  • Getting into PPC with a computer science degree
    by /u/Vallren (Ads on Google, Meta, Microsoft, etc.) on April 23, 2024 at 6:51 pm

    After graduating uni with a compsci degree I've been looking for work and seen a position for a trainee PPC executive, I have a basic understanding of what PPC is but would my degree have any relevance or transferable skills? submitted by /u/Vallren [link] [comments]

  • Ad groups are not running; why?
    by /u/Vbort44 (Ads on Google, Meta, Microsoft, etc.) on April 23, 2024 at 6:48 pm

    I don't get it. The campaign is on, ads are approved, keywords are approved... only the top one is running. Please take a look at the image here. Any thoughts? submitted by /u/Vbort44 [link] [comments]

  • I just got an email saying "welcome to auto apply recommendations"
    by /u/Examiner7 (Ads on Google, Meta, Microsoft, etc.) on April 23, 2024 at 6:46 pm

    "Welcome to auto apply recommendations This feature has been activated, so your selected Google Ads recommendations are being applied to your account." I got one for both of my ad accounts. I didn't sign up for this. Does Google just randomly automatically enroll you in this garbage? I guess my chore today will be undoing whatever evil they are doing in my ad accounts. submitted by /u/Examiner7 [link] [comments]

  • High Click-Thru-Rates are a vanity metric
    by /u/FireYourAgency (Marketing & Advertising) on April 23, 2024 at 6:30 pm

    This isn’t clickbait; after 7 years of working both at agencies and in-house, with big names like FitBit, Snowflake, AWS, Microsoft, and Marketo, focusing mainly on SaaS and PaaS sectors, I truly believe this. We usually look at Click-Through Rates (CTRs) to show our campaigns are doing well. If CTRs go up, we think that's good; if they go down, we think that's bad. However, I’m here to tell you that a decrease in CTR isn’t always something to worry about, and more and more it’s becoming a good thing. First off, it's a good idea to report CTRs as a range, especially if you’re giving updates every two weeks. CTRs change all the time, and it’s important for clients to know this. But honestly, trying too hard to raise CTRs is often just a waste of time. But why is it a good thing for CTRs to drop? Because of bot traffic. It’s much worse than most people think. Some estimates now say bots make up about 50% of internet traffic, which is a big jump from 30% just a few years back. These bots aren’t just lurking; they're actively clicking on ads, even on well-known, reputable sites. If you’re curious, there are tools like Fou Analytics that help detect bots, and they’re free for websites with less than a million clicks a year. I’d highly recommend looking. Here’s a real-world example where we actively chose* to lower CTR: We used to run LinkedIn ads with a feature called the LinkedIn audience network. (Not audience expansion, this is a different setting). Looking at referral traffic we soon learned it would place our ads on low-quality mobile gaming apps. Not always ideal, but it seemed great because it boosted our CTR and click volume. But when we tracked who these clicks were coming from, we found out 99% were bots. By turning this feature off, we actually got a lower CTR but a much higher quality of traffic and made our marketing dollars go further. Another issue with CTRs is on Google Display, where some site owners use bots to click on their ads. These bots end up in your remarketing funnels, getting targeted again and again, which just wastes your budget. So, even though a high CTR might look good, it doesn’t really mean much if those clicks aren’t turning into real business. It’s essential to look beyond the clicks to see where your marketing money is really going, especially to make sure it’s not just feeding bots. In the end, a high CTR doesn’t necessarily mean you're succeeding. Real success comes from actual engagement and sales, not just clicks. CTRs might help compare different ads in a straightforward A/B test, but focusing too much on average CTR can be misleading. That’s why it’s important to go deeper into the analytics to really understand and improve your marketing effectiveness. TL;DR bots click ads, humans mostly don’t. High CTR is probably inflated by bots and not a sign of a great ad. Edit: spelling submitted by /u/FireYourAgency [link] [comments]

  • Which campaign objective would you choose for this promo? (Meta)
    by /u/FightOn_7256 (Ads on Google, Meta, Microsoft, etc.) on April 23, 2024 at 6:22 pm

    I have a client that's running a promotion. Let's pretend it's a tanning salon and the promo is a $10 monthly membership (normal price: $25). There's no trackable conversion event on the website, since the vast majority of customers sign up in-person; and from an operations standpoint, they're not able to track code redemptions or "how did you hear about this?' responses. If the goal is to drive on-site memberships, which campaign objective would you choose? Here are some things I've considered: If I had a trackable conversion, I'd go with a Conversions objective (but I don't have one) I know that an Awareness campaign will result in a much lower cost-per-1000-reached, compare to a Traffic campaign; but CPCs will also be a lot higher. A Traffic campaign would at least give the algorithm some data to optimize for; but it might also just go after a more expensive, "FB clickers" audience; and the value of a website session is unclear, since there's no online conversion. I could run an Ad Engagement campaign. But Meta counts 3-sec video views as engagements; and I wonder if I'd be better off just running an Awareness campaign, at that point. I've always been really hesitant to run Awareness campaigns. But given the parameters, I wonder if Awareness might be my best bet, since I'm trying to "get the word out", vs. drive online conversions or generate leads. I know the "right" answer is to conduct a lift test - but that's not currently in-scope. View Poll submitted by /u/FightOn_7256 [link] [comments]

  • Marketing Tracking Numbers - how many is too many?
    by /u/13_letters (Ads on Google, Meta, Microsoft, etc.) on April 23, 2024 at 5:51 pm

    I’m trying to get feedback on managing too many tracking numbers for marketing campaigns. We have about 50 retail sites nationwide and roughly 1300 tracking numbers we utilize in our phone system, with that number slowly growing. Is this normal? Are DNI’s a best practice at this scale? What are some industry leading solutions to manage this ecosystem? Thank you in advance for any feedback. submitted by /u/13_letters [link] [comments]

  • Transitioning into the career
    by /u/Cheers-lol (Marketing & Advertising) on April 23, 2024 at 5:50 pm

    Hello, looking for some advice and maybe even someone to relate to. I have been trying to transition into a marketing career from education (substitute teaching) since January, I completed a digital marketing certification through one of the universities in my city (EdX) and graduated at the end of that month. Since then, I have had absolutely no luck. No interviews, all rejections, even from positions that are incredibly low-paying and “open rank” so they’ll consider people with no experience. I will literally take whatever I can get lmao. Every time I apply for jobs I follow up with a friendly LinkedIn message or email to recruiters, some reply most don’t. I got on freelancing websites like Upwork to apply for gigs just to get some experience - nothing once again. I’m just at a loss. I have a bachelors in psychology and social work as well, and years of work experience, just not in marketing. Has anyone successfully pivoted, specifically in this job market, and how did you do it? Thanks submitted by /u/Cheers-lol [link] [comments]

What is machine learning and how does Netflix use it for its recommendation engine?

What is machine learning and how does Netflix use it for its recommendation engine?

AI Dashboard is available on the Web, Apple, Google, and Microsoft, PRO version

What is machine learning and how does Netflix use it for its recommendation engine?

What is an online recommendation engine?

Think about examples of machine learning you may have encountered in the past such as a website like Netflix  that recommends what video you may be interested in watching next?
Are the recommendations ever wrong or unfair? We will  give an example and explain how this could be addressed.

2023 AWS Certified Machine Learning Specialty (MLS-C01) Practice Exams
2023 AWS Certified Machine Learning Specialty (MLS-C01) Practice Exams

Machine learning is a field of artificial intelligence that Netflix uses to create its recommendation algorithm. The goal of machine learning is to teach computers to learn from data and make predictions based on that data. To do this, Netflix employs Machine Learning Engineers, Data Scientists, and software developers to design and build algorithms that can automatically improve over time. The Netflix recommendations engine is just one example of how machine learning can be used to improve the user experience. By understanding what users watch and why, the recommendations engine can provide tailored suggestions that help users find new shows and movies to enjoy. Machine learning is also used for other Netflix features, such as predicting which shows a user might be interested in watching next, or detecting inappropriate content. In a world where data is becoming increasingly important, machine learning will continue to play a vital role in helping Netflix deliver a great experience to its users.

What is machine learning and how does Netflix use it for its recommendation engine?
What is machine learning and how does Netflix use it for its recommendation engine?

Netflix’s recommendation engine is one of the company’s most valuable assets. By using machine learning, Netflix is able to constantly improve its recommendations for each individual user.

Machine learning engineers, data scientists, and developers work together to build and improve the recommendation engine.

  • They start by collecting data on what users watch and how they interact with the Netflix interface.
  • This data is then used to train machine learning models.
  • The models are constantly being tweaked and improved by the team of engineers.
  • The goal is to make sure that each user sees recommendations that are highly relevant to their interests.

Thanks to the work of the team, Netflix’s recommendation engine is constantly getting better at understanding each individual user.

How Does It Work?

In short, Netflix’s recommendation algorithm looks at what you’ve watched in the past and then makes recommendations based on that data. But of course, it’s a bit more complicated than that. The algorithm also looks at data from other users with similar watching habits to yours. This allows Netflix to give you more tailored recommendations.

For example, say you’re a big fan of Friends (who isn’t?). The algorithm knows that a lot of Friends fans also like shows like Cheers, Seinfeld, and The Office. So, if you’re ever feeling nostalgic and in the mood for a sitcom marathon, Netflix will be there to help you out.

But That’s Not All…

Not only does the algorithm take into account what you’ve watched in the past, but it also looks at what you’re currently watching. For example, let’s say you’re halfway through Season 2 of Breaking Bad and you decide to take a break for a few days. When you come back and finish Season 2, the algorithm knows that you’re now interested in similar shows like Dexter and The Wire. And voila! Those shows will now be recommended to you.

Of course, the algorithm isn’t perfect. There are always going to be times when it recommends a show or movie that just doesn’t interest you. But hey, that’s why they have the “thumbs up/thumbs down” feature. Just give those shows the old thumbs down and never think about them again! Problem solved.

Another angle :

When it comes to TV and movie recommendations, there are two main types of data that are being collected and analyzed:

1) demographic data

2) viewing data.

Get 20% off Google Google Workspace (Google Meet) Standard Plan with  the following codes: 96DRHDRA9J7GTN6
Get 20% off Google Workspace (Google Meet)  Business Plan (AMERICAS) with  the following codes:  C37HCAQRVR7JTFK Get 20% off Google Workspace (Google Meet) Business Plan (AMERICAS): M9HNXHX3WC9H7YE (Email us for more codes)

Active Anti-Aging Eye Gel, Reduces Dark Circles, Puffy Eyes, Crow's Feet and Fine Lines & Wrinkles, Packed with Hyaluronic Acid & Age Defying Botanicals

Demographic data is information like your age, gender, location, etc. This data is generally used to group people with similar interests together so that they can be served more targeted recommendations. For example, if you’re a 25-year-old female living in Los Angeles, you might be grouped together with other 25-year-old females living in Los Angeles who have similar viewing habits as you.

Viewing data is exactly what it sounds like—it’s information on what TV shows and movies you’ve watched in the past. This data is used to identify patterns in your viewing habits so that the algorithm can make better recommendations on what you might want to watch next. For example, if you’ve watched a lot of romantic comedies in the past, the algorithm might recommend other romantic comedies that you might like based on those patterns.

Are the Recommendations Ever Wrong or Unfair?
Yes and no. The fact of the matter is that no algorithm is perfect—there will always be some error involved. However, these errors are usually minor and don’t have a major impact on our lives. In fact, we often don’t even notice them!


AI Unraveled: Demystifying Frequently Asked Questions on Artificial Intelligence (OpenAI, ChatGPT, Google Bard, Generative AI, Discriminative AI, xAI, LLMs, GPUs, Machine Learning, NLP, Promp Engineering)

The bigger issue with machine learning isn’t inaccuracy; it’s bias. Because algorithms are designed by humans, they often contain human biases that can seep into the recommendations they make. For example, a recent study found that Amazon’s algorithms were biased against women authors because the majority of book purchases on the site were made by men. As a result, Amazon’s algorithms were more likely to recommend books written by men over books written by women—regardless of quality or popularity.

These sorts of biases can have major impacts on our lives because they can dictate what we see and don’t see online. If we’re only seeing content that reflects our own biases back at us, we’re not getting a well-rounded view of the world—and that can have serious implications for both our personal lives and society as a whole.

One of the benefits of machine learning is that it can help us make better decisions. For example, if you’re trying to decide what movie to watch on Netflix, the site will use your past viewing history to recommend movies that you might like. This is possible because machine learning algorithms are able to identify patterns in data.

Another benefit of machine learning is that it can help us automate tasks. For example, if you’re a cashier and have to scan the barcodes of the items someone is buying, a machine learning algorithm can be used to automatically scan the barcodes and calculate the total cost of the purchase. This can save time and increase efficiency.

The Consequences of Machine Learning

While machine learning can be beneficial, there are also some potential consequences that should be considered. One consequence is that machine learning algorithms can perpetuate bias. For example, if you’re using a machine learning algorithm to recommend movies to people on Netflix, the algorithm might only recommend movies that are similar to ones that people have already watched. This could lead to people only watching movies that confirm their existing beliefs instead of challenged them.

Another consequence of machine learning is that it can be difficult to understand how the algorithms work. This is because the algorithms are usually created by trained experts and then fine-tuned through trial and error. As a result, regular people often don’t know how or why certain decisions are being made by machines. This lack of transparency can lead to mistrust and frustration.

What is Problem Formulation in Machine Learning and Top 4 examples of Problem Formulation in Machine Learning?

If you are looking for an all-in-one solution to help you prepare for the AWS Cloud Practitioner Certification Exam, look no further than this AWS Cloud Practitioner CCP CLF-C02 book

What are some good datasets for Data Science and Machine Learning?

This scene in the Black Panther trailer, is it T’Challa’s funeral?

r/marvelstudios - This scene in the Black Panther trailer, is it T’Challa’s funeral?

Recommended New Netflix  Movies 2022

  • What percentage of contents on Netflix Prime are in HDR?
    by /u/My_Bellstone (Netflix) on April 23, 2024 at 7:21 pm

    As basic users we only get access on HD contents, I am thinking to upgrade to Prime but just curious what percentage on Netflix Prime are 4K and HDR? are they HDR10+ or DV? submitted by /u/My_Bellstone [link] [comments]

  • ‘Wednesday’ Season 2 Casts Thandiwe Newton (EXCLUSIVE)
    by /u/mapleer (Netflix) on April 23, 2024 at 7:05 pm

    submitted by /u/mapleer [link] [comments]

  • Does Hollywood have a pain problem? A study of Netflix finds that depictions of pain in TV and movies could be reinforcing stereotypes
    by /u/NGNResearch (Netflix) on April 23, 2024 at 7:01 pm

    submitted by /u/NGNResearch [link] [comments]

  • Never gonna use netfix again evet.
    by /u/Short_Ad6649 (Netflix) on April 23, 2024 at 6:32 pm

    I was watching The Alienist on netflix and watched tille episode 7 when it suddently stopped and told me that I am using VPN, and now I cannot watch it. show me the image below when visiting this page from google link and when searching for the show it not there. What happenned I don't care. just never going to use this ever service again. https://preview.redd.it/bd8iev5gv9wc1.png?width=1919&format=png&auto=webp&s=f52cd885f96f3a715644415cceaebb7e49c5d0d0 submitted by /u/Short_Ad6649 [link] [comments]

  • ROBOT DREAMS - Official Trailer - In Theaters May 31
    by /u/jdcmopwjdmw (Movie News and Discussion) on April 23, 2024 at 6:21 pm

    submitted by /u/jdcmopwjdmw [link] [comments]

  • A question about Exam (2009)
    by /u/flubbajubba2 (Movie News and Discussion) on April 23, 2024 at 5:50 pm

    So I just watched Exam and I really liked it. Looking at some reviews it turned out to be pretty divisive but I thought it was a very solid movie through and through. However, there's one thing that I'm rather confused about, which I'll explain to you right now. So, in the movie, the characters are given the exam where they don't even know what the question is, and they are trying to figure it out. They start discussing the company, and they say that they don't actually know what job they're applying for just that it is for a company so great that this has to be one of the best jobs in the world. Then shortly later they deduce that it is not the company they thought it was, and it's actually a company called Biorg which created a suppressant for a virus that, a decade earlier, killed millions of people. Here's my question: when they deduce that they're actually applying for a job at Biorg, the way they talk it's like they didn't know that, so what company did they think they were applying for? submitted by /u/flubbajubba2 [link] [comments]

  • Movies that would've worked better as TV shows?
    by /u/RVarki (Movie News and Discussion) on April 23, 2024 at 5:34 pm

    I was watching the surprisingly sweet 2010 Alice Eve/Jay Baruchel movie "She's Out of My League", and I kept thinking how much better the film would've worked as a sitcom. It had 2 leads with great chemistry, funny and distinct supporting characters, and a solid central theme. But because it rarely deviated from it's central plot, and didn't rely on any sub-genre conceits, the flick ended up feeling too slight. A traditional sitcom structure would've let the concept breathe, and allowed the characters to develop So what are some films that you've watched, that immediately made you want them as TV shows instead? submitted by /u/RVarki [link] [comments]

  • Is 3 body problem worth watching? I have completed till eps 3 but I kinda find boring in between. Does it get more interesting in the next eps?
    by /u/GloomyButterscotch70 (Netflix) on April 23, 2024 at 5:02 pm

    I recently started watching this series after seeing short clips on Instagram reels. But idk why, I don’t find that much interesting for now. I know some series gets more interesting in between or after some eps. Does it get any better after 3 eps? submitted by /u/GloomyButterscotch70 [link] [comments]

  • Stalker Shows, they bother me…
    by /u/Red-Rolodex (Netflix) on April 23, 2024 at 4:58 pm

    Okay I know some people might think “well don’t watch them!” But they literally make me feel like I’m choking when I watch them. I love horror movies, I’ve become desensitized to gore, running from creatures, all of that. But when it comes to shows like Baby Reindeer or You… I can’t do it. I want to get into them but there’s something about them that makes me want to hurl. I can feel the emotions and their psycho energy better than any CSI or Criminal Minds episode I’ve ever seen. Wondering if others feel the same way. submitted by /u/Red-Rolodex [link] [comments]

  • Rebel Moon The Scargiver - Is Bad DLC
    by /u/DarkhourX (Netflix) on April 23, 2024 at 4:36 pm

    submitted by /u/DarkhourX [link] [comments]

  • Baby Reindeer paints a petrifying picture of stalking. What’s being done?
    by /u/saffron-rice-amazing (Netflix) on April 23, 2024 at 4:34 pm

    https://metro.co.uk/2024/04/22/new-stalking-guidance-2024-rule-changes-spos-explained-20693719/ submitted by /u/saffron-rice-amazing [link] [comments]

  • What's the inverse of Plot Armor?
    by /u/UF1977 (Movie News and Discussion) on April 23, 2024 at 4:11 pm

    Plot armor: a plot device wherein a fictional character is preserved from harm due to their necessity for the plot to proceed*.* So what's the opposite of that? Wherein otherwise competent protagonist(s) suffer repeated unexplained bursts of incompetence, fundamental skills lapse, or forget they can do certain things, for the sole purpose of allowing the plot to move along. I finally got around to watching the first season of Jack Ryan on Amazon and it occurred to me that every single element of the Big Bad's plot succeeding completely hinges upon the utter incompetence of hundreds of professionals from dozens of US agencies, including them making inexplicably dumb decisions that are actually significantly harder than other choices they had available to them (never mind what they'd actually do IRL). If any one of them had said, "Hey wait, maybe we shouldn't do that, because it's stupid," the plot screeches to a halt. Maybe another example is the recent Top Gun sequel. Instead of using B-2 stealth bombers with bunker-busting weapons specifically designed for targets exactly like the one in question...or even using the whole Navy carrier air wing that is right there on the ship with them, including aircraft whose entire purpose is to jam and attack SAMs and their guidance radars...they instead go with a bare-bones strike package which has statistically no chance of success and will get absolutely clobbered with missiles...because it looks cooler. submitted by /u/UF1977 [link] [comments]

  • BLINK TWICE | Official Trailer
    by /u/plustom (Movie News and Discussion) on April 23, 2024 at 4:05 pm

    submitted by /u/plustom [link] [comments]

  • Movies with unsettling scores/vocals?
    by /u/latrallyidk (Movie News and Discussion) on April 23, 2024 at 3:56 pm

    I'm looking for films with very unsettling yet human-sounding (?) scores incorporating either vocalization or some level of organic/human sounds. The only thing coming to mind for me is Midsommar, I really like all of the dissonant humming/sounds, though I'm sure I've heard some similarly eerie scores that I can't remember- any suggestions? submitted by /u/latrallyidk [link] [comments]

  • Prince Harry Polo and Netflix
    by /u/jahazafat (Netflix) on April 23, 2024 at 3:48 pm

    This guy doesn't even own a Polo Pony and never has. If Netflix wants to showcase the sport you have to do it with someone who has a real affiliation with the sport. A person who owns, loves and cares for their horses. HARRY DOESN'T CUT IT. ​ submitted by /u/jahazafat [link] [comments]

  • Tony Scott’s Cinematic Triumph: The Legacy of Man on Fire
    by /u/JannTosh50 (Movie News and Discussion) on April 23, 2024 at 3:29 pm

    submitted by /u/JannTosh50 [link] [comments]

  • Toni Collette, Andy Garcia, Alex Pettyfer & Eva De Dominici Lead Rom-Com ‘Under The Stars’ With Filming Underway In Italy; Arclight To Sell At Cannes
    by /u/KillerCroc1234567 (Movie News and Discussion) on April 23, 2024 at 2:47 pm

    submitted by /u/KillerCroc1234567 [link] [comments]

  • The fastest a movie ever made you go "... uh oh, something isn't right here" in terms of your quality expectations
    by /u/MattAlbie60 (Movie News and Discussion) on April 23, 2024 at 2:37 pm

    I'm sure we've all had the experience where we're looking forward to a particular movie, we're sitting in a theater, we're pre-disposed to love it... and slowly it dawns on us that "oh, shit, this is going to be a disappointment I think." Disclaimer: I really do like Superman Returns. But I followed that movie mercilessly from the moment it started production. I saw every behind the scenes still. I watched every video blog from the set a hundred times. I poured over every interview. And then, the movie opened with a card quickly explaining the entire premise of the movie... and that was an enormous red flag for me that this wasn't going to be what I expected. I really do think I literally went "uh oh" and the movie hadn't even technically started yet. Because it seemed to me that what I'd assumed the first act was going to be had just been waved away in a few lines of expository text, so maybe this wasn't about to be the tightly structured superhero masterpiece I was hoping for. submitted by /u/MattAlbie60 [link] [comments]

  • Movie recs set in the ocean for work? Not documentaries...
    by /u/hannahwhis (Movie News and Discussion) on April 23, 2024 at 2:28 pm

    Does anyone have any recommendations for movies that are set in the ocean for work? Thinking like oil rigging, deep sea fishing, etc. This would not include movies where the amount of time where the characters are impacted by water is a very small amount of time. No documentaries please 🙂 Movies I’ve already thought of: - Deepwater Horizon - The Guardian submitted by /u/hannahwhis [link] [comments]

  • Scary movies for kids??
    by /u/Crispymama1210 (Movie News and Discussion) on April 23, 2024 at 1:45 pm

    My 8 year old has been reading kid friendly scary books (think goosebumps and Scary Stories to tell in the dark with the original nightmare fuel illustrations) and is asking to watch scary movies. Suggestions?? Looking to avoid tons of gore and sex stuff. Profanity is fine. Extra problem - my very brave 5yo is probably going to insist on watching too. I’m thinking Jurassic Park since they both love dinosaurs but looking for other suggestions as well. Thanks! submitted by /u/Crispymama1210 [link] [comments]

World’s Top 10 Youtube channels in 2022

r/dataisbeautiful - [OC] World's Top 10 Youtube Channels of 2022

T-Series, Cocomelon, Set India, PewDiePie, MrBeast, Kids Diana Show, Like Nastya, WWE, Zee Music Company, Vlad and Niki

What are some ways to increase precision or recall in machine learning?

What are some ways to increase precision or recall in machine learning?

AI Dashboard is available on the Web, Apple, Google, and Microsoft, PRO version

What are some ways to increase precision or recall in machine learning?

What are some ways to Boost Precision and Recall in Machine Learning?

Sensitivity vs Specificity?


In machine learning, recall is the ability of the model to find all relevant instances in the data while precision is the ability of the model to correctly identify only the relevant instances. A high recall means that most relevant results are returned while a high precision means that most of the returned results are relevant. Ideally, you want a model with both high recall and high precision but often there is a trade-off between the two. In this blog post, we will explore some ways to increase recall or precision in machine learning.

What are some ways to increase precision or recall in machine learning?
What are some ways to increase precision or recall in machine learning?


There are two main ways to increase recall:

by increasing the number of false positives or by decreasing the number of false negatives. To increase the number of false positives, you can lower your threshold for what constitutes a positive prediction. For example, if you are trying to predict whether or not an email is spam, you might lower the threshold for what constitutes spam so that more emails are classified as spam. This will result in more false positives (emails that are not actually spam being classified as spam) but will also increase recall (more actual spam emails being classified as spam).

Get 20% off Google Google Workspace (Google Meet) Standard Plan with  the following codes: 96DRHDRA9J7GTN6
Get 20% off Google Workspace (Google Meet)  Business Plan (AMERICAS) with  the following codes:  C37HCAQRVR7JTFK Get 20% off Google Workspace (Google Meet) Business Plan (AMERICAS): M9HNXHX3WC9H7YE (Email us for more codes)

Active Anti-Aging Eye Gel, Reduces Dark Circles, Puffy Eyes, Crow's Feet and Fine Lines & Wrinkles, Packed with Hyaluronic Acid & Age Defying Botanicals

2023 AWS Certified Machine Learning Specialty (MLS-C01) Practice Exams
2023 AWS Certified Machine Learning Specialty (MLS-C01) Practice Exams

To decrease the number of false negatives,

you can increase your threshold for what constitutes a positive prediction. For example, going back to the spam email prediction example, you might raise the threshold for what constitutes spam so that fewer emails are classified as spam. This will result in fewer false negatives (actual spam emails not being classified as spam) but will also decrease recall (fewer actual spam emails being classified as spam).

What are some ways to increase precision or recall in machine learning?

There are two main ways to increase precision:

by increasing the number of true positives or by decreasing the number of true negatives. To increase the number of true positives, you can raise your threshold for what constitutes a positive prediction. For example, using the spam email prediction example again, you might raise the threshold for what constitutes spam so that fewer emails are classified as spam. This will result in more true positives (emails that are actually spam being classified as spam) but will also decrease precision (more non-spam emails being classified as spam).

To decrease the number of true negatives,

you can lower your threshold for what constitutes a positive prediction. For example, going back to the spam email prediction example once more, you might lower the threshold for what constitutes spam so that more emails are classified as spam. This will result in fewer true negatives (emails that are not actually spam not being classified as spam) but will also decrease precision (more non-spam emails being classified as spam).


AI Unraveled: Demystifying Frequently Asked Questions on Artificial Intelligence (OpenAI, ChatGPT, Google Bard, Generative AI, Discriminative AI, xAI, LLMs, GPUs, Machine Learning, NLP, Promp Engineering)
What are some ways to increase precision or recall in machine learning?

To summarize,

there are a few ways to increase precision or recall in machine learning. One way is to use a different evaluation metric. For example, if you are trying to maximize precision, you can use the F1 score, which is a combination of precision and recall. Another way to increase precision or recall is to adjust the threshold for classification. This can be done by changing the decision boundary or by using a different algorithm altogether.

What are some ways to increase precision or recall in machine learning?

Sensitivity vs Specificity

In machine learning, sensitivity and specificity are two measures of the performance of a model. Sensitivity is the proportion of true positives that are correctly predicted by the model, while specificity is the proportion of true negatives that are correctly predicted by the model.

Google Colab For Machine Learning

State of the Google Colab for ML (October 2022)

Google introduced computing units, which you can purchase just like any other cloud computing unit you can from AWS or Azure etc. With Pro you get 100, and with Pro+ you get 500 computing units. GPU, TPU and option of High-RAM effects how much computing unit you use hourly. If you don’t have any computing units, you can’t use “Premium” tier gpus (A100, V100) and even P100 is non-viable.

If you are looking for an all-in-one solution to help you prepare for the AWS Cloud Practitioner Certification Exam, look no further than this AWS Cloud Practitioner CCP CLF-C02 book

Google Colab Pro+ comes with Premium tier GPU option, meanwhile in Pro if you have computing units you can randomly connect to P100 or T4. After you use all of your computing units, you can buy more or you can use T4 GPU for the half or most of the time (there can be a lot of times in the day that you can’t even use a T4 or any kinds of GPU). In free tier, offered gpus are most of the time K80 and P4, which performs similar to a 750ti (entry level gpu from 2014) with more VRAM.

For your consideration, T4 uses around 2, and A100 uses around 15 computing units hourly.
Based on the current knowledge, computing units costs for GPUs tend to fluctuate based on some unknown factor.

Considering those:

  1. For hobbyists and (under)graduate school duties, it will be better to use your own gpu if you have something with more than 4 gigs of VRAM and better than 750ti, or atleast purchase google pro to reach T4 even if you have no computing units remaining.
  2. For small research companies, and non-trivial research at universities, and probably for most of the people Colab now probably is not a good option.
  3. Colab Pro+ can be considered if you want Pro but you don’t sit in front of your computer, since it disconnects after 90 minutes of inactivity in your computer. But this can be overcomed with some scripts to some extend. So for most of the time Colab Pro+ is not a good option.

If you have anything more to say, please let me know so I can edit this post with them. Thanks!

Conclusion:


In machine learning, precision and recall trade off against each other; increasing one often decreases the other. There is no single silver bullet solution for increasing either precision or recall; it depends on your specific use case which one is more important and which methods will work best for boosting whichever metric you choose. In this blog post, we explored some methods for increasing either precision or recall; hopefully this gives you a starting point for improving your own models!

 

What are some ways we can use machine learning and artificial intelligence for algorithmic trading in the stock market?

Machine Learning and Data Science Breaking News 2022 – 2023

  • Need Help in training Spiking Neural Networks based Transformer [D]
    by /u/Sweaty-Jackfruit1271 (Machine Learning) on April 23, 2024 at 8:10 pm

    Hello all, I am training a spiking neural network based transformer, which I implemented using spiking jelly. The loss I used was ce_rate_loss from snntorch. I am training it for 3 timesteps and for 10 epochs, could you guys please help me how to make this loss curve better, as this is converging too soon... submitted by /u/Sweaty-Jackfruit1271 [link] [comments]

  • [Research] Best industries to do AI research in ??
    by /u/unbelievably-elegant (Machine Learning) on April 23, 2024 at 7:44 pm

    I am applying for a research internship at a well-known university and I am being asked to propose a topic of research and I am really confused about what kind of topic I should propose since I want to have a more industry-oriented career afterwards and not a research-oriented one. I decided that I will propose a more R&D kind of subject and not a pure theoretical one. However, I was wondering about what industry is currently growing in its adoptation of AI (healthcare/automotive/finance etc..) and is hiring new engineering grads. Thanks in advance! submitted by /u/unbelievably-elegant [link] [comments]

  • [D] Speech to Text Word Level Timestamps Accuracy Issue
    by /u/Mindless-Ordinary485 (Machine Learning) on April 23, 2024 at 7:18 pm

    I've had a lot of success with Whisper when it comes to transcriptions, but word level timestamps seems to be slightly inaccurate. From my understanding ("Whisper cannot provide reliable word timestamps, because the END-TO-END models like Transformer using cross-entropy training criterion are not designed for reliably estimating word timestamps." https://www.youtube.com/watch?v=H576iCWt1Co&t=192s) For my use case, I need precise word level timestamps, because I'm doing audio insertion after specific words. This becomes problematic when I do an insertion and the back part of a word ends up on the other side. Example: Given an original audio file with speech that has been transcribed, If I want to insert a clip at the end of the word "France", and according to the timestamp, the word "France" starts at 19.26 and ends at 19.85, I will insert the clip at 19.85. However, if the actual end of France is at 19.92, then when I insert the laugher at 19.85, I will here the remaining "France", likely "ce" (0.07), at the end. I'm curious if anyone has been posed with a similar problem and what they did to get around this? I've experimented with a few open source variations of whisper, but still running into that issue. submitted by /u/Mindless-Ordinary485 [link] [comments]

  • [R] Wu's Method can Boost Symbolic AI to Rival Silver Medalists and AlphaGeometry to Outperform Gold Medalists at IMO Geometry
    by /u/SeawaterFlows (Machine Learning) on April 23, 2024 at 7:11 pm

    Paper: https://arxiv.org/abs/2404.06405 Code: https://huggingface.co/datasets/bethgelab/simplegeometry Abstract: Proving geometric theorems constitutes a hallmark of visual reasoning combining both intuitive and logical skills. Therefore, automated theorem proving of Olympiad-level geometry problems is considered a notable milestone in human-level automated reasoning. The introduction of AlphaGeometry, a neuro-symbolic model trained with 100 million synthetic samples, marked a major breakthrough. It solved 25 of 30 International Mathematical Olympiad (IMO) problems whereas the reported baseline based on Wu's method solved only ten. In this note, we revisit the IMO-AG-30 Challenge introduced with AlphaGeometry, and find that Wu's method is surprisingly strong. Wu's method alone can solve 15 problems, and some of them are not solved by any of the other methods. This leads to two key findings: (i) Combining Wu's method with the classic synthetic methods of deductive databases and angle, ratio, and distance chasing solves 21 out of 30 methods by just using a CPU-only laptop with a time limit of 5 minutes per problem. Essentially, this classic method solves just 4 problems less than AlphaGeometry and establishes the first fully symbolic baseline strong enough to rival the performance of an IMO silver medalist. (ii) Wu's method even solves 2 of the 5 problems that AlphaGeometry failed to solve. Thus, by combining AlphaGeometry with Wu's method we set a new state-of-the-art for automated theorem proving on IMO-AG-30, solving 27 out of 30 problems, the first AI method which outperforms an IMO gold medalist. submitted by /u/SeawaterFlows [link] [comments]

  • Data science career prospect in aerospace
    by /u/oppapoocow (Data Science) on April 23, 2024 at 6:22 pm

    Hey guys, I'm working on finishing up my data science degree, with a high possibility to go into a masters. I'm currently holding a good job in aero space with a lot of background in aero space manufacturing. I really like the job and it pays well, but I'm lost for what field or job I can incorporate my aero space experience and data science. If anyone can offer any guidance, that would be great. submitted by /u/oppapoocow [link] [comments]

  • [D] Method to generate shapely contributions without model object
    by /u/ozymandias_514 (Machine Learning) on April 23, 2024 at 6:08 pm

    Is there a method to generate the approximations of Shapely values (or something similar) for a data without using model object. Essentialy I input features and model predictions on benchmark data, and the same for test data, and output is contributions for each feature on test data submitted by /u/ozymandias_514 [link] [comments]

  • DS becoming underpaid Software Engineers?
    by /u/nxp1818 (Data Science) on April 23, 2024 at 5:59 pm

    Just curious what everyone’s thoughts are on this. Seems like more DS postings are placing a larger emphasis on software development than statistics/model development. I’ve also noticed this trend at my company. There are even senior DS managers at my company saying stats are for analysts (which is a wild statement). DS is well paid, however, not as well paid as SWE, typically. Feels like shady HR tactics are at work to save dollars on software development. submitted by /u/nxp1818 [link] [comments]

  • [P] Computer Vision for package defects
    by /u/aalwiz099 (Machine Learning) on April 23, 2024 at 4:44 pm

    Hi, My team got a new project requirement.The ask is to develop a computer vision solution that can Identify the defects in inbound packages coming to a manufacturing warehouse. They come in fixed size pallets which contains multiple boxes. The defect could be anything - dents, torn packaging, broken pallet etc . How should I got about 1.Solution - GPT4 vision or custom trained models? Any recommendations from your experience in terms of model and training 2.Hardware requirements - how many cameras,what was your setup? Was latency a big issue? Any help would be greatly appreciated submitted by /u/aalwiz099 [link] [comments]

  • [P] A Python Intelligence Config Manager. Superset of hydra+pydantic+lsp
    by /u/cssunfu (Machine Learning) on April 23, 2024 at 3:57 pm

    I developed a very powerful Python Config Management Tool. It can make your json config as powerful as python code. And very friendly to humans. The most attractive feature is that it will analyze Python code and json config file in real time, provide document display, parameter completion, and goto Python definition from the json config. (Powered by LSP) Similar or better config inheritance, parameter reference and parameter grid search like hydra Data validation similar to pydantic, and the ability to convert dataclass to json-schema This project is still in its early stages, and everyone is welcome to provide some suggestions and ideas. git repo submitted by /u/cssunfu [link] [comments]

  • [N] Phi-3-mini released on HuggingFace
    by /u/topcodemangler (Machine Learning) on April 23, 2024 at 3:26 pm

    https://huggingface.co/microsoft/Phi-3-mini-128k-instruct The numbers in the technical report look really great, I guess need to be verified by 3rd parties. submitted by /u/topcodemangler [link] [comments]

  • Niche for MLE with web-dev skills?
    by /u/answersareallyouneed (Data Science) on April 23, 2024 at 3:06 pm

    I currently work as an MLE(~5 YOE) focused mainly in computer vision. I'm looking towards the future and trying to figure out next steps for my career and how to up-skill. I find myself gravitating towards working on product with ML as part of core-value-added and can't see myself working on ML/DS for analytics. Web-dev has always seemed interesting to me - specifically front-end dev (Eg. React, WebGL, Three.js). I have a limited amount of time to invest in up-skilling and am debating whether there's any justification for investing the time here. Is there a niche for an ML-Engineer with front-end knowledge? Maybe developing ML-based-tools? submitted by /u/answersareallyouneed [link] [comments]

  • [D]Drastic Change in the Accuracy score and other measures after hyper parameter tuning.
    by /u/Saheenus (Machine Learning) on April 23, 2024 at 1:41 pm

    Hey, I am currently doing a malware classification (malware,benign).Used the naive Bayes(Bernoulli) the accuracy was 67 at this point.After the tuning it goes straight up 100. Is this normal or not? I did Outlier removal using the IQR and feature selection using the correlation. submitted by /u/Saheenus [link] [comments]

  • Why Aren't Boilerplates More Common in DS?
    by /u/AccomplishedPace6024 (Data Science) on April 23, 2024 at 12:45 pm

    I've been working as a DS in predictive analytics a good amount of years now, and recently I've been pushed to delve a bit more on the more data viz side, which eeeehh fortunately or unfortunately meant coding web-dev stuff. I have realized that in the web-dev side, there are a shit ton of people building and consuming boilerplates. Like for real, a mind-blowing amount of demand for such things that I would not have expected in my life. However, I've never seen anything similar for DS projects. Sure, there's decent documentation and examples in most libraries, but it's still far from the convenience of a boilerplate. Talking to a mate he was like, I'm sure in web-dev everything is more standard than DS, and I'm like... man have you seen how many frameworks, backends, styling clusterfuck of technologies is out there. So I don't think standardization is the reason here. Do you guys think there is a gap in DS when it comes to this kind of things? Any ideas why is not more widespread? Am I missing something all toghether? EDIT: By boilerplate I don't mean ready to go models, I mean skeleton code for things like data loading and processing or result analysis, so the repetitive stuff... NOT things like model and parameter turning. submitted by /u/AccomplishedPace6024 [link] [comments]

  • [D] How to and Deploy LLaMA 3 Into Production, and Hardware Requirements
    by /u/juliensalinas (Machine Learning) on April 23, 2024 at 12:33 pm

    Many are trying to install and deploy their own LLaMA 3 model, so here is a tutorial I just made showing how to deploy LLaMA 3 on an AWS EC2 instance: https://nlpcloud.com/how-to-install-and-deploy-llama-3-into-production.html Deploying LLaMA 3 8B is fairly easy but LLaMA 3 70B is another beast. Given the amount of VRAM needed you might want to provision more than one GPU and use a dedicated inference server like vLLM in order to split your model on several GPUs. LLaMA 3 8B requires around 16GB of disk space and 20GB of VRAM (GPU memory) in FP16. As for LLaMA 3 70B, it requires around 140GB of disk space and 160GB of VRAM in FP16. I hope it is useful, and if you have questions please don't hesitate to ask! Julien submitted by /u/juliensalinas [link] [comments]

  • [D] Language Model for students
    by /u/mbungee (Machine Learning) on April 23, 2024 at 11:31 am

    Hello Community, I am preparing a lecture on language models with a lot of hands on work in python. I am trying Llama 3 8b but it seems to run very slow to answer prompts. I have a laptop with a modern I7 and 32GB of RAM. I assume my students have something less powerful and I don’t want them to take hours just for a prompt. So do you know a model that is reasonably fast without giving up too much performance? submitted by /u/mbungee [link] [comments]

  • Robotics and AI [D]
    by /u/navarrox99 (Machine Learning) on April 23, 2024 at 11:31 am

    Hello! I've studied robotics engineering, and my approach to Al has primarily focused on computer vision. I've been deeply interested in this area and, of course, nowadays, I use GPT-4 on a daily basis for both work and personal projects. I'm starting an MSc in Al next year, and in the cover letter for the master's application, I would like to discuss potential research topics that combine robotics and Al. My first choice involves using reinforcement learning for navigation and path planning. I'm also interested in the potential application of vision transformers to visual and semantic SLAM, as well as deep learning-based feature extractors. I'm also impressed by what large companies are achieving with multimodal transformers and humanoid robot like the figure Al, although I understand that these projects might only be feasible for such organizations. I'd love to hear your opinion and insights on the trends in robotics and Al. And receive some guidance from experts on the field. What do you think are the current hot topics, and what will be the key areas of focus in the next 5, 10, and 15 years? Thanks! submitted by /u/navarrox99 [link] [comments]

  • [P] I built a website to detect audio deepfakes and spoofs.
    by /u/mummni (Machine Learning) on April 23, 2024 at 10:27 am

    Hi everybody, I've been working on a project for over two years now: Deepfake-Total.com, a website where you can analyze audio files for traces of text-to-speech or voice conversion, i.e. identify audio deepfakes and spoofs, which I feel is an important topic, especially given the number of important elections coming up. The tool is free-to-use and supports YouTube and twitter URLs, as well as the manual upload of audio files. I wanted to share this and get your feedback: What works for you and what does not, where can I improve? The model should be quite precise for most input, but similar to an anti-virus scanner, there'll always be the occasional outlier. If you find any, feedback is much appreciated, as well as ideas for (scientific) collaboration, etc. submitted by /u/mummni [link] [comments]

  • [D] Are there any MoE models other than LLMs?
    by /u/lime_52 (Machine Learning) on April 23, 2024 at 9:58 am

    Is MoE architecture also applied in other ML areas, let’s say Computer Vision? Why aren’t they popular? Is it because we don’t scale vision transformers as much as LLMs, and MoE is best for scalability? submitted by /u/lime_52 [link] [comments]

  • [D] What best practices and workflows those working solo as DS/MLE should keep in mind?
    by /u/Melodic_Reality_646 (Machine Learning) on April 23, 2024 at 9:40 am

    I'm wondering what technical recruiters or seasoned DS/MLE have to say about people with profiles like mine: good theoretical and decent technical background but working solo for too long. Summary of my career for context: I've been working 8 years now as a DS, the first 3 in medium sized R&D and consulting teams (for a big tech company), then for the past 5 as a solo DS for relatively successful non-ai focused start-ups, mostly developing ML/NLP stuff to address specific issues or improve one specific feature of their product (i.e. never a whole product). In 5 years I designed. developed and deployed, say, 4 models (but experimented with many ofc) - along with a few dashboards and simple streamlit POCs). Recently attending to meetups and seeing how people that make part of actual teams work, discuss and exchange knowledge it suddenly stroke me: I'm missing out, I'm becoming obsolete. I dont feel sharp enough for technical interviews, I'm not sure the way I develop and maintain my projects are following good standards/best practices (heck, i hardly follow a kanban, mostly use my planner to report to my boss on progress). I do some version control and document what I put into prod, but not even that I'm sure I'm doing as it'd be expected within a team. submitted by /u/Melodic_Reality_646 [link] [comments]

  • CVS Data Science Interview
    by /u/LebrawnJames416 (Data Science) on April 23, 2024 at 9:23 am

    Has anyone gone through the interview process, in particular the live coding part and have any insight on what I should expect or any tips. submitted by /u/LebrawnJames416 [link] [comments]

Top 100 Data Science and Data Analytics and Data Engineering Interview Questions and Answers

What are some good datasets for Data Science and Machine Learning?

Simple Linear Regression vs. Multiple Linear Regression vs. MANOVA: A Data Scientist’s Guide

As a data scientist, it's important to understand the difference between simple linear regression, multiple linear regression, and MANOVA. This will come in handy when you're working with different datasets and trying to figure out which one to use. Here's a quick overview of each method:

AI Dashboard is available on the Web, Apple, Google, and Microsoft, PRO version

Simple Linear Regression vs. Multiple Linear Regression vs. MANOVA: A Data Scientist’s Guide

As a data scientist, it’s important to understand the difference between simple linear regression, multiple linear regression, and MANOVA. This will come in handy when you’re working with different datasets and trying to figure out which one to use. Here’s a quick overview of each method:

A Short Overview of Simple Linear Regression, Multiple Linear Regression, and MANOVA

Simple linear regression is used to predict the value of a dependent variable (y) based on the value of one independent variable (x). This is the most basic form of regression analysis.

Multiple linear regression is used to predict the value of a dependent variable (y) based on the values of two or more independent variables (x1, x2, x3, etc.). This is more complex than simple linear regression but can provide more accurate predictions.

MANOVA is used to predict the value of a dependent variable (y) based on the values of two or more independent variables (x1, x2, x3, etc.), while also taking into account the relationships between those variables. This is the most complex form of regression analysis but can provide the most accurate predictions.

So, which one should you use? It depends on your dataset and what you’re trying to predict. If you have a small dataset with only one independent variable, then simple linear regression will suffice. If you have a larger dataset with multiple independent variables, then multiple linear regression will be more appropriate. And if you need to take into account the relationships between your independent variables, then MANOVA is the way to go.

Get 20% off Google Google Workspace (Google Meet) Standard Plan with  the following codes: 96DRHDRA9J7GTN6
Get 20% off Google Workspace (Google Meet)  Business Plan (AMERICAS) with  the following codes:  C37HCAQRVR7JTFK Get 20% off Google Workspace (Google Meet) Business Plan (AMERICAS): M9HNXHX3WC9H7YE (Email us for more codes)

Active Anti-Aging Eye Gel, Reduces Dark Circles, Puffy Eyes, Crow's Feet and Fine Lines & Wrinkles, Packed with Hyaluronic Acid & Age Defying Botanicals

In data science, there are a variety of techniques that can be used to model relationships between variables. Three of the most common techniques are simple linear regression, multiple linear regression, and MANOVA. Although these techniques may appear to be similar at first glance, there are actually some key differences that set them apart. Let’s take a closer look at each technique to see how they differ.

Simple Linear Regression

Simple linear regression is a statistical technique that can be used to model the relationship between a dependent variable and a single independent variable. The dependent variable is the variable that is being predicted, while the independent variable is the variable that is being used to make predictions.

Simple Linear Regression vs. Multiple Linear Regression vs. MANOVA: A Data Scientist's Guide
Linear Regression Basics for Absolute Beginners | by Benjamin Obi Tayo Ph.D. | Towards AI


AI Unraveled: Demystifying Frequently Asked Questions on Artificial Intelligence (OpenAI, ChatGPT, Google Bard, Generative AI, Discriminative AI, xAI, LLMs, GPUs, Machine Learning, NLP, Promp Engineering)

Multiple Linear Regression

Multiple linear regression is a statistical technique that can be used to model the relationship between a dependent variable and two or more independent variables. As with simple linear regression, the dependent variable is the variable that is being predicted. However, in multiple linear regression, there can be multiple independent variables that are being used to make predictions.

Simple Linear Regression vs. Multiple Linear Regression vs. MANOVA: A Data Scientist's Guide\
Multiple Linear Regression from scratch using only numpy | by Debidutta Dash | Analytics Vidhya | Medium

MANOVA

MANOVA (multivariate analysis of variance) is a statistical technique that can be used to model the relationship between a dependent variable and two or more independent variables. Unlike simple linear regression or multiple linear regression, MANOVA can only be used when the dependent variable is continuous. Additionally, MANOVA can only be used when there are two or more dependent variables.

If you are looking for an all-in-one solution to help you prepare for the AWS Cloud Practitioner Certification Exam, look no further than this AWS Cloud Practitioner CCP CLF-C02 book

Simple Linear Regression vs. Multiple Linear Regression vs. MANOVA: A Data Scientist's Guide
Simple Linear Regression vs. Multiple Linear Regression vs. MANOVA: A Data Scientist’s Guide

When it comes to data modeling, there are a variety of different techniques that can be used. Simple linear regression, multiple linear regression, and MANOVA are three of the most common techniques. Each technique has its own set of benefits and drawbacks that should be considered before deciding which technique to use for a particular project.We often encounter data points that are correlated. For example, the number of hours studied is correlated with the grades achieved. In such cases, we can use regression analysis to study the relationships between the variables.

Simple linear regression is a statistical method that allows us to predict the value of a dependent variable (y) based on the value of an independent variable (x). In other words, we can use simple linear regression to find out how much y will change when x changes.

Multiple linear regression is a statistical method that allows us to predict the value of a dependent variable (y) based on the values of multiple independent variables (x1, x2, …, xn). In other words, we can use multiple linear regression to find out how much y will change when any of the independent variables changes.

Multivariate analysis of variance (MANOVA) is a statistical method that allows us to compare multiple dependent variables (y1, y2, …, yn) simultaneously. In other words, MANOVA can help us understand how multiple dependent variables vary together.

Simple Linear Regression vs Multiple Linear Regression vs MANOVA: A Comparative Study
The main difference between simple linear regression and multiple linear regression is that simple linear regression can be used to predict the value of a dependent variable based on the value of only one independent variable whereas multiple linear regression can be used to predict the value of a dependent variable based on the values of two or more independent variables. Another difference between simple linear regression and multiple linear regression is that simple linear regression is less likely to produce Type I and Type II errors than multiple linear regression.

Both simple linear regression and multiple linear regression are used to predict future values. However, MANOVA is used to understand how present values vary.

Conclusion:

In this article, we have seen the key differences between simple linear regression vs multiple linear regression vs MANOVA along with their applications. Simple linear regression should be used when there is only one predictor variable whereas multiple linear regressions should be used when there are two or more predictor variables. MANOVA should be used when there are two or more response variables. Hope you found this article helpful!

Get Certified with the AWS Data analytics DAS-C01 Exam Prep PRO App:
Very Similar to real exam, Countdown timer, Score card, Show/Hide Answers, Cheat Sheets, FlashCards, Detailed Answers and References
No ADS, Access All Quiz Detailed Answers, Reference and Score Card

Hundreds of Quizzes covering Quiz and Brain Teaser for AWS Data analytics DAS-C01, Data Science, Various Practice Exams covering Data Collection, Data Security, Data processing, Data Analysis, Data Visualization, Data Storage and Management,
Data Lakes, S3, Kinesis, Lake Formation, Athena, Kibana, Redshift, EMR, Glue, Kafka, Apache Spark, SQl, NoSQL, Python,DynamoDB, DocumentDB,  linear regression, logistic regression, Sampling, dataset, statistical interaction, selection bias, non-Gaussian distribution, bias-variance trade-off, Normal Distribution, correlation and covariance, Point Estimates and Confidence Interval, A/B Testing, p-value, statistical power of sensitivity, over-fitting and under-fitting, regularization, Law of Large Numbers, Confounding Variables, Survivorship Bias, univariate, bivariate and multivariate, Resampling, ROC curve, TF/IDF vectorization, Cluster Sampling, Data cleansing, ETL, Data Science and Analytics Cheat Sheets

Youtube:

What are some ways we can use machine learning and artificial intelligence for algorithmic trading in the stock market?

What are some good datasets for Data Science and Machine Learning?

Top 100 Data Science and Data Analytics and Data Engineering Interview Questions and Answers

Simple Linear Regression vs. Multiple Linear Regression vs. MANOVA: A Data Scientist’s Guide

As a data scientist, it’s important to understand the difference between simple linear regression, multiple linear regression, and MANOVA. This will come in handy when you’re working with different datasets and trying to figure out which one to use. Here’s a quick overview of each method:

A Short Overview of Simple Linear Regression, Multiple Linear Regression, and MANOVA

Simple linear regression is used to predict the value of a dependent variable (y) based on the value of one independent variable (x). This is the most basic form of regression analysis.

Multiple linear regression is used to predict the value of a dependent variable (y) based on the values of two or more independent variables (x1, x2, x3, etc.). This is more complex than simple linear regression but can provide more accurate predictions.

MANOVA is used to predict the value of a dependent variable (y) based on the values of two or more independent variables (x1, x2, x3, etc.), while also taking into account the relationships between those variables. This is the most complex form of regression analysis but can provide the most accurate predictions.

So, which one should you use? It depends on your dataset and what you’re trying to predict. If you have a small dataset with only one independent variable, then simple linear regression will suffice. If you have a larger dataset with multiple independent variables, then multiple linear regression will be more appropriate. And if you need to take into account the relationships between your independent variables, then MANOVA is the way to go.

In data science, there are a variety of techniques that can be used to model relationships between variables. Three of the most common techniques are simple linear regression, multiple linear regression, and MANOVA. Although these techniques may appear to be similar at first glance, there are actually some key differences that set them apart. Let’s take a closer look at each technique to see how they differ.

Simple Linear Regression

Simple linear regression is a statistical technique that can be used to model the relationship between a dependent variable and a single independent variable. The dependent variable is the variable that is being predicted, while the independent variable is the variable that is being used to make predictions.

Simple Linear Regression vs. Multiple Linear Regression vs. MANOVA: A Data Scientist's Guide
Linear Regression Basics for Absolute Beginners | by Benjamin Obi Tayo Ph.D. | Towards AI

Multiple Linear Regression

Ace the Microsoft Azure Fundamentals AZ-900 Certification Exam: Pass the Azure Fundamentals Exam with Ease

Multiple linear regression is a statistical technique that can be used to model the relationship between a dependent variable and two or more independent variables. As with simple linear regression, the dependent variable is the variable that is being predicted. However, in multiple linear regression, there can be multiple independent variables that are being used to make predictions.

Simple Linear Regression vs. Multiple Linear Regression vs. MANOVA: A Data Scientist's Guide\
Multiple Linear Regression from scratch using only numpy | by Debidutta Dash | Analytics Vidhya | Medium

MANOVA

MANOVA (multivariate analysis of variance) is a statistical technique that can be used to model the relationship between a dependent variable and two or more independent variables. Unlike simple linear regression or multiple linear regression, MANOVA can only be used when the dependent variable is continuous. Additionally, MANOVA can only be used when there are two or more dependent variables.

Simple Linear Regression vs. Multiple Linear Regression vs. MANOVA: A Data Scientist's Guide
Simple Linear Regression vs. Multiple Linear Regression vs. MANOVA: A Data Scientist’s Guide

When it comes to data modeling, there are a variety of different techniques that can be used. Simple linear regression, multiple linear regression, and MANOVA are three of the most common techniques. Each technique has its own set of benefits and drawbacks that should be considered before deciding which technique to use for a particular project.We often encounter data points that are correlated. For example, the number of hours studied is correlated with the grades achieved. In such cases, we can use regression analysis to study the relationships between the variables.

Simple linear regression is a statistical method that allows us to predict the value of a dependent variable (y) based on the value of an independent variable (x). In other words, we can use simple linear regression to find out how much y will change when x changes.

Multiple linear regression is a statistical method that allows us to predict the value of a dependent variable (y) based on the values of multiple independent variables (x1, x2, …, xn). In other words, we can use multiple linear regression to find out how much y will change when any of the independent variables changes.

Multivariate analysis of variance (MANOVA) is a statistical method that allows us to compare multiple dependent variables (y1, y2, …, yn) simultaneously. In other words, MANOVA can help us understand how multiple dependent variables vary together.

Simple Linear Regression vs Multiple Linear Regression vs MANOVA: A Comparative Study
The main difference between simple linear regression and multiple linear regression is that simple linear regression can be used to predict the value of a dependent variable based on the value of only one independent variable whereas multiple linear regression can be used to predict the value of a dependent variable based on the values of two or more independent variables. Another difference between simple linear regression and multiple linear regression is that simple linear regression is less likely to produce Type I and Type II errors than multiple linear regression.

Both simple linear regression and multiple linear regression are used to predict future values. However, MANOVA is used to understand how present values vary.

Conclusion:

In this article, we have seen the key differences between simple linear regression vs multiple linear regression vs MANOVA along with their applications. Simple linear regression should be used when there is only one predictor variable whereas multiple linear regressions should be used when there are two or more predictor variables. MANOVA should be used when there are two or more response variables. Hope you found this article helpful!

Get Certified with the AWS Data analytics DAS-C01 Exam Prep PRO App:
Very Similar to real exam, Countdown timer, Score card, Show/Hide Answers, Cheat Sheets, FlashCards, Detailed Answers and References
No ADS, Access All Quiz Detailed Answers, Reference and Score Card

Hundreds of Quizzes covering Quiz and Brain Teaser for AWS Data analytics DAS-C01, Data Science, Various Practice Exams covering Data Collection, Data Security, Data processing, Data Analysis, Data Visualization, Data Storage and Management,
Data Lakes, S3, Kinesis, Lake Formation, Athena, Kibana, Redshift, EMR, Glue, Kafka, Apache Spark, SQl, NoSQL, Python,DynamoDB, DocumentDB,  linear regression, logistic regression, Sampling, dataset, statistical interaction, selection bias, non-Gaussian distribution, bias-variance trade-off, Normal Distribution, correlation and covariance, Point Estimates and Confidence Interval, A/B Testing, p-value, statistical power of sensitivity, over-fitting and under-fitting, regularization, Law of Large Numbers, Confounding Variables, Survivorship Bias, univariate, bivariate and multivariate, Resampling, ROC curve, TF/IDF vectorization, Cluster Sampling, Data cleansing, ETL, Data Science and Analytics Cheat Sheets

What is Problem Formulation in Machine Learning and Top 4 examples of Problem Formulation in Machine Learning?

Summary of Machine Learning and Artificial Intelligence Capabilities

AI Dashboard is available on the Web, Apple, Google, and Microsoft, PRO version

What is Problem Formulation in Machine Learning and Top 4 examples of Problem Formulation in Machine Learning?

Machine Learning (ML) is a field of Artificial Intelligence (AI) that enables computers to learn from data, without being explicitly programmed. Machine learning algorithms build models based on sample data, known as “training data”, in order to make predictions or decisions, rather than following rules written by humans. Machine learning is closely related to and often overlaps with computational statistics; a discipline that also focuses on prediction-making through the use of computers. Machine learning can be applied in a wide variety of domains, such as medical diagnosis, stock trading, robot control, manufacturing and more.

Problem Formulation in Machine Learning
What is Problem Formulation in Machine Learning and Top 4 examples of Problem Formulation in Machine Learning?

The process of machine learning consists of several steps: first, data is collected; then, a model is selected or created; finally, the model is trained on the collected data and then applied to new data. This process is often referred to as the “machine learning pipeline”. Problem formulation is the second step in this pipeline and it consists of selecting or creating a suitable model for the task at hand and determining how to represent the collected data so that it can be used by the selected model. In other words, problem formulation is the process of taking a real-world problem and translating it into a format that can be solved by a machine learning algorithm.

2023 AWS Certified Machine Learning Specialty (MLS-C01) Practice Exams
2023 AWS Certified Machine Learning Specialty (MLS-C01) Practice Exams

There are many different types of machine learning problems, such as classification, regression, prediction and so on. The choice of which type of problem to formulate depends on the nature of the task at hand and the type of data available. For example, if we want to build a system that can automatically detect fraudulent credit card transactions, we would formulate a classification problem. On the other hand, if our goal is to predict the sale price of houses given information about their size, location and age, we would formulate a regression problem. In general, it is best to start with a simple problem formulation and then move on to more complex ones if needed.

Some common examples of problem formulations in machine learning are:
Classification: given an input data point (e.g., an image), predict its category label (e.g., dog vs cat).
Regression: given an input data point (e.g., size and location of a house), predict a continuous output value (e.g., sale price).
Prediction: given an input sequence (e.g., a series of past stock prices), predict the next value in the sequence (e.g., future stock price).
Anomaly detection: given an input data point (e.g., transaction details), decide whether it is normal or anomalous (i.e., fraudulent).
Recommendation: given information about users (e.g., age and gender) and items (e.g., books and movies), recommend items to users (e.g., suggest books for someone who likes romance novels).
Optimization: given a set of constraints (e.g., budget) and objectives (e.g., maximize profit), find the best solution (e.g., product mix).

Machine Learning For Dummies
Machine Learning For Dummies

ML For Dummies on iOs

Get 20% off Google Google Workspace (Google Meet) Standard Plan with  the following codes: 96DRHDRA9J7GTN6
Get 20% off Google Workspace (Google Meet)  Business Plan (AMERICAS) with  the following codes:  C37HCAQRVR7JTFK Get 20% off Google Workspace (Google Meet) Business Plan (AMERICAS): M9HNXHX3WC9H7YE (Email us for more codes)

Active Anti-Aging Eye Gel, Reduces Dark Circles, Puffy Eyes, Crow's Feet and Fine Lines & Wrinkles, Packed with Hyaluronic Acid & Age Defying Botanicals

ML PRO without ADS on iOs [No Ads]

ML PRO without ADS on Windows [No Ads]

ML PRO For Web/Android on Amazon [No Ads]


AI Unraveled: Demystifying Frequently Asked Questions on Artificial Intelligence (OpenAI, ChatGPT, Google Bard, Generative AI, Discriminative AI, xAI, LLMs, GPUs, Machine Learning, NLP, Promp Engineering)

Problem Formulation: What this pipeline phase entails and why it’s important

The problem formulation phase of the ML Pipeline is critical, and it’s where everything begins. Typically, this phase is kicked off with a question of some kind. Examples of these kinds of questions include: Could cars really drive themselves?  What additional product should we offer someone as they checkout? How much storage will clients need from a data center at a given time?

The problem formulation phase starts by seeing a problem and thinking “what question, if I could answer it, would provide the most value to my business?” If I knew the next product a customer was going to buy, is that most valuable? If I knew what was going to be popular over the holidays, is that most valuable? If I better understood who my customers are, is that most valuable?

However, some problems are not so obvious. When sales drop, new competitors emerge, or there’s a big change to a company/team/org, it can be easy to say, “I see the problem!” But sometimes the problem isn’t so clear. Consider self-driving cars. How many people think to themselves, “driving cars is a huge problem”? Probably not many. In fact, there isn’t a problem in the traditional sense of the word but there is an opportunity. Creating self-driving cars is a huge opportunity. That doesn’t mean there isn’t a problem or challenge connected to that opportunity. How do you design a self-driving system? What data would you look at to inform the decisions you make? Will people purchase self-driving cars?

If you are looking for an all-in-one solution to help you prepare for the AWS Cloud Practitioner Certification Exam, look no further than this AWS Cloud Practitioner CCP CLF-C02 book

Part of the problem formulation phase includes seeing where there are opportunities to use machine learning.

In the following practice examples, you are presented with four different business scenarios. For each scenario, consider the following questions:

  1. Is machine learning appropriate for this problem, and why or why not?
  2. What is the ML problem if there is one, and what would a success metric look like?
  3. What kind of ML problem is this?
  4. Is the data appropriate?’

The solutions given in this article are one of the many ways you can formulate a business problem.

I)  Amazon recently began advertising to its customers when they visit the company website. The Director in charge of the initiative wants the advertisements to be as tailored to the customer as possible. You will have access to all the data from the retail webpage, as well as all the customer data.

  1. ML is appropriate because of the scale, variety and speed required. There are potentially thousands of ads and millions of customers that need to be served customized ads immediately as they arrive to the site.
  2. The problem is ads that are not useful to customers are a wasted opportunity and a nuisance to customers, yet not serving ads at all is a wasted opportunity. So how does Amazon serve the most relevant advertisements to its retail customers?
    1. Success would be the purchase of a product that was advertised.
  3. This is a supervised learning problem because we have a labeled data point, our success metric, which is the purchase of a product.
  4. This data is appropriate because it is both the retail webpage data as well as the customer data.

II) You’re a Senior Business Analyst at a social media company that focuses on streaming. Streamers use a combination of hashtags and predefined categories to be discoverable by your platform’s consumers. You ran an analysis on unique streamer counts by hashtags and categories over the last month and found that out of tens of thousands of streamers, almost all use only 40 hashtags and 10 categories despite innumerable hashtags and hundreds of categories. You presume the predefined categories don’t represent all the possibilities very well, and that streamers are simply picking the closest fit. You figure there are likely many categories and groupings of streamers that are not accounted for. So you collect a dataset that consists of all streamer profile descriptions (all text), all the historical chat information for each streamer, and all their videos that have been streamed.

  1. ML is appropriate because of the scale and variability.
  2. The problem is the content of streamers is not being represented by the existing categories. Success would be naturally grouping the streamers into categories based on content and seeing if those align with the hashtags and categories that are being commonly used.  If they do not, then the streamers are not being well represented and you can use these groupings to create new categories.
  3. There isn’t a specific outcome variable. There’s no target or label. So this is an unsupervised problem.
  4. The data is appropriate.

III) You’re a headphone manufacturer who sells directly to big and small electronic stores. As an attempt to increase competitive pricing, Store 1 and Store 2 decided to put together the pricing details for all headphone manufacturers and their products (about 350 products) and conduct daily releases of the data. You will have all the specs from each manufacturer and their product’s pricing. Your sales have recently been dropping so your first concern is whether there are competing products that are priced lower than your flagship product.

  1. ML is probably not necessary for this. You can just search the dataset to see which headphones are priced lower than the flagship, then compare their features and build quality.

IV) You’re a Senior Product Manager at a leading ridesharing company. You did some market research, collected customer feedback, and discovered that both customers and drivers are not happy with an app feature. This feature allows customers to place a pin exactly where they want to be picked up. The customers say drivers rarely stop at the pin location. Drivers say customers most often put the pin in a place they can’t stop. Your company has a relationship with the most used maps app for the driver’s navigation so you leverage this existing relationship to get direct, backend access to their data. This includes latitude and longitude, visual photos of each lat/long, traffic delay details, and regulation data if available (ie- No Parking zones, 3 minute parking zones, fire hydrants, etc.).

  1. ML is appropriate because of the scale and automation involved. It’s not feasible to drive everywhere and write down all the places that are ok for pickup. However, maybe we can predict whether a location is ok for pickup.
  2. The problem is drivers and customers are having poor experiences connecting for pickup, which is pushing customers away from the platform.
    1. Success would be properly identifying appropriate pickup locations so they can be integrated into the feature.
  3. This is a supervised learning problem even though there aren’t any labels, yet. Someone will have to go through a sample of the data to label where there are ok places to park and not park, giving the algorithms some target information.
  4. The data is appropriate once a sample of the dataset has been labeled. There may be some other data that could be included too. What about asking UPS for driver stop information? Where do they stop?

In conclusion, problem formulation is an important step in the machine learning pipeline that should not be overlooked or underestimated. It can make or break a machine learning project; therefore, it is important to take care when formulating machine learning problems.”

AWS machine Learning Specialty Exam Prep MLS-C01
AWS machine Learning Specialty Exam Prep MLS-C01

Step by Step Solution to a Machine Learning Problem – Feature Engineering

Feature Engineering is the act of reshaping and curating existing data to make patters more apparent. This process makes the data easier for an ML model to understand. Using knowledge of the data, features are engineered and  tuned to make ML algorithms work more efficiently.

 

For this problem, imagine a scenario where you are running a real estate brokerage and you want to predict the selling price of a house. Using a specific county dataset and simple information (like the location, total square footage, and number of bedrooms), let’s practice training a baseline model, conducting feature engineering, and tuning a model to make a prediction.

First, load the dataset and take a look at its basic properties.

# Load the dataset
import pandas as pd
import boto3

Djamgatech: Build the skills that’ll drive your career into six figures: Get Djamgatech.

df = pd.read_csv(“xxxxx_data_2.csv”)
df.head()

housing dataset example
housing dataset example: xxxxx_data_2.csv

Output:

feature_engineering_dataset_example
feature_engineering_dataset_example

This dataset has 21 columns:

  • id – Unique id number
  • date – Date of the house sale
  • price – Price the house sold for
  • bedrooms – Number of bedrooms
  • bathrooms – Number of bathrooms
  • sqft_living – Number of square feet of the living space
  • sqft_lot – Number of square feet of the lot
  • floors – Number of floors in the house
  • waterfront – Whether the home is on the waterfront
  • view – Number of lot sides with a view
  • condition – Condition of the house
  • grade – Classification by construction quality
  • sqft_above – Number of square feet above ground
  • sqft_basement – Number of square feet below ground
  • yr_built – Year built
  • yr_renovated – Year renovated
  • zipcode – ZIP code
  • lat – Latitude
  • long – Longitude
  • sqft_living15 – Number of square feet of living space in 2015 (can differ from sqft_living in the case of recent renovations)
  • sqrt_lot15 – Nnumber of square feet of lot space in 2015 (can differ from sqft_lot in the case of recent renovations)

This dataset is rich and provides a fantastic playground for the exploration of feature engineering. This exercise will focus on a small number of columns. If you are interested, you could return to this dataset later to practice feature engineering on the remaining columns.

A baseline model

Now, let’s  train a baseline model.

People often look at square footage first when evaluating a home. We will do the same in the oflorur model and ask how well can the cost of the house be approximated based on this number alone. We will train a simple linear learner model (documentation). We will compare to this after finishing the feature engineering.

import sagemaker
import numpy as np
from sklearn.model_selection import train_test_split
import time

t1 = time.time()

# Split training, validation, and test
ys = np.array(df[‘price’]).astype(“float32”)
xs = np.array(df[‘sqft_living’]).astype(“float32”).reshape(-1,1)

np.random.seed(8675309)
train_features, test_features, train_labels, test_labels = train_test_split(xs, ys, test_size=0.2)
val_features, test_features, val_labels, test_labels = train_test_split(test_features, test_labels, test_size=0.5)

Ace the Microsoft Azure Fundamentals AZ-900 Certification Exam: Pass the Azure Fundamentals Exam with Ease

# Train model
linear_model = sagemaker.LinearLearner(role=sagemaker.get_execution_role(),
instance_count=1,
instance_type=’ml.m4.xlarge’,
predictor_type=’regressor’)

train_records = linear_model.record_set(train_features, train_labels, channel=’train’)
val_records = linear_model.record_set(val_features, val_labels, channel=’validation’)
test_records = linear_model.record_set(test_features, test_labels, channel=’test’)

linear_model.fit([train_records, val_records, test_records], logs=False)

sagemaker.analytics.TrainingJobAnalytics(linear_model._current_job_name, metric_names = [‘test:mse’, ‘test:absolute_loss’]).dataframe()

 

If you examine the quality metrics, you will see that the absolute loss is about $175,000.00. This tells us that the model is able to predict within an average of $175k of the true price. For a model based upon a single variable, this is not bad. Let’s try to do some feature engineering to improve on it.

Throughout the following work, we will constantly be adding to a dataframe called encoded. You will start by populating encoded with just the square footage you used previously.

 

encoded = df[[‘sqft_living’]].copy()

Categorical variables

Let’s start by including some categorical variables, beginning with simple binary variables.

The dataset has the waterfront feature, which is a binary variable. We should change the encoding from 'Y' and 'N' to 1 and 0. This can be done using the map function (documentation) provided by Pandas. It expects either a function to apply to that column or a dictionary to look up the correct transformation.

Binary categorical

Let’s write code to transform the waterfront variable into binary values. The skeleton has been provided below.

encoded[‘waterfront’] = df[‘waterfront’].map({‘Y’:1, ‘N’:0})

You can also encode many class categorical variables. Look at column condition, which gives a score of the quality of the house. Looking into the data source shows that the condition can be thought of as an ordinal categorical variable, so it makes sense to encode it with the order.

Ordinal categorical

Using the same method as in question 1, encode the ordinal categorical variable condition into the numerical range of 1 through 5.

encoded[‘condition’] = df[‘condition’].map({‘Poor’:1, ‘Fair’:2, ‘Average’:3, ‘Good’:4, ‘Very Good’:5})

A slightly more complex categorical variable is ZIP code. If you have worked with geospatial data, you may know that the full ZIP code is often too fine-grained to use as a feature on its own. However, there are only 7070 unique ZIP codes in this dataset, so we may use them.

However, we do not want to use unencoded ZIP codes. There is no reason that a larger ZIP code should correspond to a higher or lower price, but it is likely that particular ZIP codes would. This is the perfect case to perform one-hot encoding. You can use the get_dummies function (documentation) from Pandas to do this.

Nominal categorical

Using the Pandas get_dummies function,  add columns to one-hot encode the ZIP code and add it to the dataset.

encoded = pd.concat([encoded, pd.get_dummies(df[‘zipcode’])], axis=1)

In this way, you may freely encode whatever categorical variables you wish. Be aware that for categorical variables with many categories, something will need to be done to reduce the number of columns created.

One additional technique, which is simple but can be highly successful, involves turning the ZIP code into a single numerical column by creating a single feature that is the average price of a home in that ZIP code. This is called target encoding.

To do this, use groupby (documentation) and mean (documentation) to first group the rows of the DataFrame by ZIP code and then take the mean of each group. The resulting object can be mapped over the ZIP code column to encode the feature.

Nominal categorical II

Complete the following code snippet to provide a target encoding for the ZIP code.

means = df.groupby(‘zipcode’)[‘price’].mean()
encoded[‘zip_mean’] = df[‘zipcode’].map(means)

Normally, you only either one-hot encode or target encode. For this exercise, leave both in. In practice, you should try both, see which one performs better on a validation set, and then use that method.

Scaling

Take a look at the dataset. Print a summary of the encoded dataset using describe (documentation).

encoded.describe()

Scaling  - summary of the encoded dataset using describe
Scaling – summary of the encoded dataset using describe

One column ranges from 290290 to 1354013540 (sqft_living), another column ranges from 11 to 55 (condition), 7171 columns are all either 00 or 11 (one-hot encoded ZIP code), and then the final column ranges from a few hundred thousand to a couple million (zip_mean).

In a linear model, these will not be on equal footing. The sqft_living column will be approximately 1300013000 times easier for the model to find a pattern in than the other columns. To solve this, you often want to scale features to a standardized range. In this case, you will scale sqft_living to lie within 00 and 11.

Feature scaling

Fill in the code skeleton below to scale the column of the DataFrame to be between 00 and 11.

sqft_min = encoded[‘sqft_living’].min()
sqft_max = encoded[‘sqft_living’].max()
encoded[‘sqft_living’] = encoded[‘sqft_living’].map(lambda x : (x-sqft_min)/(sqft_max – sqft_min))

cond_min = encoded[‘condition’].min()
cond_max = encoded[‘condition’].max()
encoded[‘condition’] = encoded[‘condition’].map(lambda x : (x-cond_min)/(cond_max – cond_min))]

Read more here….

Amazon Reviews Solution

Predicting Credit Card Fraud Solution

Predicting Airplane Delays Solution

Data Processing for Machine Learning Example

Model Training and Evaluation Examples

Targeting Direct Marketing Solution

What are some good datasets for Data Science and Machine Learning?

What are some good datasets for Data Science and Machine Learning?

AI Dashboard is available on the Web, Apple, Google, and Microsoft, PRO version

What are some good datasets for Data Science and Machine Learning?

Finding good datasets for Data Science and Machine Learning can be a challenge. There are a lot of dataset out there, but not all of them are good for machine learning. In order to find a good dataset, you need to consider what you want to use the dataset for. If you want to use the dataset for training a machine learning model, then you need to make sure that the dataset is representative of the real-world data that you want to use the model on.

2023 AWS Certified Machine Learning Specialty (MLS-C01) Practice Exams
2023 AWS Certified Machine Learning Specialty (MLS-C01) Practice Exams

The dataset should also be large enough to train a robust model. Another important consideration is whether or not the dataset is open source. Open source datasets are typically better because they have been vetted by the community and are more likely to be of high quality. However, open source datasets can also be more difficult to find. A good place to start looking for datasets is on websites like Kaggle and UC Irvine Machine Learning Repository. These websites contain a variety of high-quality datasets that are free to download and use.

What are the Top 10 AWS jobs you can get with an AWS certification in 2022 plus AWS Interview Questions
AWS Data Analytics Specialty Certification Practice Exams

Researchers from IBM, MIT and Harvard Announced The Release Of DARPA “Common Sense AI” Dataset Along With Two Machine Learning Models At ICML 2021

Building machines that can make decisions based on common sense is no easy feat. A machine must be able to do more than merely find patterns in data; it also needs a way of interpreting the intentions and beliefs behind people’s choices.

At the 2021 International Conference on Machine Learning (ICML), Researchers from IBM, MIT, and Harvard University have come together to release a DARPA “Common Sense AI” dataset for benchmarking AI intuition. They are also releasing two machine learning models that represent different approaches to the problem that relies on testing techniques psychologists use to study infants’ behavior to accelerate the development of AI exhibiting common sense.

Person climbing a staircase. Learn Data Science from Scratch: online program with 21 courses

Source – Summary – Paper – IBM Blog

100 million protein structures Dataset by DeepMind

DeepMind creates ‘transformative’ map of human proteins drawn by AI. By the end of the year, DeepMind hopes to release predictions for 100 million protein structures, a dataset that will be “transformative for our understanding of how life works,

Get 20% off Google Google Workspace (Google Meet) Standard Plan with  the following codes: 96DRHDRA9J7GTN6
Get 20% off Google Workspace (Google Meet)  Business Plan (AMERICAS) with  the following codes:  C37HCAQRVR7JTFK Get 20% off Google Workspace (Google Meet) Business Plan (AMERICAS): M9HNXHX3WC9H7YE (Email us for more codes)

Active Anti-Aging Eye Gel, Reduces Dark Circles, Puffy Eyes, Crow's Feet and Fine Lines & Wrinkles, Packed with Hyaluronic Acid & Age Defying Botanicals

Here’s a good article about this topic

Earth’s population reaches 8 billion

Earth's population reaches 8 billion
Earth’s population reaches 8 billion

The most used words on every country’s Wikipedia Page

What are some good datasets for Data Science and Machine Learning?
The most used words on every country’s Wikipedia Page

Who works from home in 2022? Rates by industry 

Who works from home in 2022? Rates by industry
Who works from home in 2022? Rates by industry

Google Dataset Search

Google Dataset Search

Malware traffic dataset

Comprises 1914081 records created from all malware traffic analysis .net PCAP files, from 2013 to 2021. The logs are generated using Suricata and Zeek.

Originator: ali_alwashali


AI Unraveled: Demystifying Frequently Asked Questions on Artificial Intelligence (OpenAI, ChatGPT, Google Bard, Generative AI, Discriminative AI, xAI, LLMs, GPUs, Machine Learning, NLP, Promp Engineering)

Percent of “foreign-born” population in each US and EU state or country.

For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state 🇺🇸🇪🇺

Author: Here

Percent of “foreign-born” population in each US and EU state or country. For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state.

Examples of “foreign-born” in this context:

If you are looking for an all-in-one solution to help you prepare for the AWS Cloud Practitioner Certification Exam, look no further than this AWS Cloud Practitioner CCP CLF-C02 book

  • Person born in Spain and living in France is NOT “foreign-born”

  • Person born in Turkey and living in France is “foreign-born”

  • Person born in Florida and living in Texas is NOT “foreign-born”

  • Person born in Mexico and living in Texas is “foreign-born”

  • Person born in Florida and living in France is “foreign-born”

  • Person born in France and living in Florida is “foreign-born”

🇺🇸🇪🇺🗺️

Note: Poland, Ireland, Germany, Greece, Cyprus, Malta, Portugal uses Eurostat 2010 Migration data and Croatia has no data at all

Link1

Link2

Link3

Tools: MS Office

Source: Here

35% of “entry-level” jobs on LinkedIn require 3+ years of experience

What are some good datasets for Data Science and Machine Learning?
What are some good datasets for Data Science and Machine Learning?

Source: LinkedIn data  (see original post)

Tool: Photoshop from my colleague

Person climbing a staircase. Learn Data Science from Scratch: online program with 21 courses

Djamgatech: Build the skills that’ll drive your career into six figures: Get Djamgatech.

 
 

The Biggest Source of Power in Every US and Canadian State and Province 

The Biggest Source of Power in Every State and Province
The Biggest Source of Power in Every State and Province

Top 10 largest oil fields by 2021 production

Top 10 largest oil fields by 2021 production
Top 10 largest oil fields by 2021 production

Latest complete Netflix movie dataset

Created from 4 APIs. 11K+ rows and 30+ attributes of Netflix (Ratings, earnings, actors, language, availability, movie trailers, and many more)

Dataset on Kaggle.

Explore this dataset using FlixGem.com (this dataset is powering this webapp)

Dataset on Google Sheets.

The Largest Entertainment Streaming Companies
The Largest Entertainment Streaming Companies

The F word in Popular Movies

r/dataisbeautiful - [OC] The F word in Popular Movies

The easiest words to rhyme – Words that have the most rhymes

Post image

Common Crawl

A corpus of web crawl data composed of over 50 billion web pages. The Common Crawl corpus contains petabytes of data collected since 2008. It contains raw web page data, extracted metadata and text extractions.

AWS CLI Access (No AWS account required)

aws s3 ls s3://commoncrawl/ --no-sign-request

s3://commoncrawl/crawl-data/CC-MAIN-2021-17 – April 2021

 Dataset on protein prices

Ace the Microsoft Azure Fundamentals AZ-900 Certification Exam: Pass the Azure Fundamentals Exam with Ease

Data on Primary Commodity Prices are updated monthly based on the IMF’s Primary Commodity Price System.

Excel Database

 CPOST dataset on suicide attacks over four decades

The University of Chicago Project on Security and Threats presents the updated and expanded Database on Suicide Attacks (DSAT), which now links to Uppsala Conflict Data Program data on armed conflicts and includes a new dataset measuring the alliance and rivalry relationships among militant groups with connections to suicide attack groups. Access it here.

Credit Card Dataset – Survey of Consumer Finances (SCF) Combined Extract Data 1989-2019

You can do a lot of aggregated analysis in a pretty straightforward way there.

Drone imagery with annotations for small object detection and tracking dataset

11 TB dataset of drone imagery with annotations for small object detection and tracking

Download and more information are available here

Dataset License: CDLA-Sharing-1.0

Helper scripts for accessing the dataset: DATASET.md

Dataset Exploration: Colab

Suicide rate among countries with the highest Human Development Index

Suicide rate among countries with the highest Human Development Index
Suicide rate among countries with the highest Human Development Index

NOAA High-Resolution Rapid Refresh (HRRR) Model

The HRRR is a NOAA real-time 3-km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model, initialized by 3km grids with 3km radar assimilation. Radar data is assimilated in the HRRR every 15 min over a 1-h period adding further detail to that provided by the hourly data assimilation from the 13km radar-enhanced Rapid Refresh.

Registry of Open Data on AWS

This registry exists to help people discover and share datasets that are available via AWS resources. Learn more about sharing data on AWS.

See all usage examples for datasets listed in this registry.

See datasets from Digital Earth AfricaFacebook Data for GoodNASA Space Act AgreementNIH STRIDESNOAA Big Data ProgramSpace Telescope Science Institute, and Amazon Sustainability Data Initiative.

Textbook Question Answering (TQA)

1,076 textbook lessons, 26,260 questions, 6229 images

Documentation: allenai.org/data/tqa

Download

Harmonized Cancer Datasets: Genomic Data Commons Data Portal

The GDC Data Portal is a robust data-driven platform that allows cancer
researchers and bioinformaticians to search and download cancer data for analysis.

Genomic Data Commons Data Portal
Genomic Data Commons Data Portal

The Cancer Genome Atlas

The Cancer Genome Atlas (TCGA), a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), aims to generate comprehensive, multi-dimensional maps of the key genomic changes in major types and subtypes of cancer.

AWS CLI Access (No AWS account required)

aws s3 ls s3://tcga-2-open/ --no-sign-request

Therapeutically Applicable Research to Generate Effective Treatments (TARGET)

The Therapeutically Applicable Research to Generate Effective Treatments (TARGET) program applies a comprehensive genomic approach to determine molecular changes that drive childhood cancers. The goal of the program is to use data to guide the development of effective, less toxic therapies. TARGET is organized into a collaborative network of disease-specific project teams.  TARGET projects provide comprehensive molecular characterization to determine the genetic changes that drive the initiation and progression of childhood cancers. The dataset contains open Clinical Supplement, Biospecimen Supplement, RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantification, miRNA-Seq miRNA Expression Quantification data from Genomic Data Commons (GDC), and open data from GDC Legacy Archive. Access it here.

Genome Aggregation Database (gnomAD)

The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators that aggregates and harmonizes both exome and genome data from a wide range of large-scale human sequencing projects. The summary data provided here are released for the benefit of the wider scientific community without restriction on use. Downloads

SQuAD (Stanford Question Answering Dataset)

Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. Access it here.

PubMed Diabetes Dataset

The Pubmed Diabetes dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words. The README file in the dataset provides more details.

Download Link

Drug-Target Interaction Dataset

This dataset contains interactions between drugs and targets collected from DrugBank, KEGG Drug, DCDB, and Matador. It was originally collected by Perlman et al. It contains 315 drugs, 250 targets, 1,306 drug-target interactions, 5 types of drug-drug similarities, and 3 types of target-target similarities. Drug-drug similarities include Chemical-based, Ligand-based, Expression-based, Side-effect-based, and Annotation-based similarities. Target-target similarities include Sequence-based, Protein-protein interaction network-based, and Gene Ontology-based similarities. The original task on the dataset is to predict new interactions between drugs and targets based on different types of similarities in the network. Download link

Pharmacogenomics Datasets

PharmGKB data and knowledge is available as downloads. It is often critical to check with their curators at feedback@pharmgkb.org before embarking on a large project using these data, to be sure that the files and data they make available are being interpreted correctly. PharmGKB generally does NOT need to be a co-author on such analyses; They just want to make sure that there is a correct understanding of our data before lots of resources are spent.

Amazon Omics

Store, query, analyze, and generate insights from genomic and other omics data.

Amazon Omics
Amazon Omics

Pancreatic Cancer Organoid Profiling

The dataset contains open RNA-Seq Gene Expression Quantification data and controlled WGS/WXS/RNA-Seq Aligned Reads, WXS Annotated Somatic Mutation, WXS Raw Somatic Mutation, and RNA-Seq Splice Junction Quantification. Documentation

AWS CLI Access (No AWS account required)

aws s3 ls s3://gdc-organoid-pancreatic-phs001611-2-open/ --no-sign-request

 
 
Behshad Behzadi on LinkedIn: Partnering with iCAD to improve breast cancer screening
 

From AI Research to Real world Clinical Practice:
After a pivotal moment in 2020 to show our AI technology performed better than radiologists in a retrospective study at identifying signs of breast cancer, today a new important milestone is achieved: Google Health announces our first commercial agreement to license our mammography AI research model to be integrated in real-world clinical practice.

This can make healthcare AI to be more accessible and eventually saves more lives.

#ai #research #google #health #healthcare #breastcancer #mammography

Africa Soil Information Service (AfSIS) Soil Chemistry

This dataset contains soil infrared spectral data and paired soil property reference measurements for georeferenced soil samples that were collected through the Africa Soil Information Service (AfSIS) project, which lasted from 2009 through 2018. Documentation

AWS CLI Access (No AWS account required)

aws s3 ls s3://afsis/ --no-sign-request

Dataset for Affective States in E-Environments

DAiSEE is the first multi-label video classification dataset comprising of 9068 video snippets captured from 112 users for recognizing the user affective states of boredom, confusion, engagement, and frustration “in the wild”. The dataset has four levels of labels namely – very low, low, high, and very high for each of the affective states, which are crowd annotated and correlated with a gold standard annotation created using a team of expert psychologists. Download it here.

NatureServe Explorer Dataset

NatureServe Explorer provides conservation status, taxonomy, distribution, and life history information for more than 95,000 plants and animals in the United States and Canada, and more than 10,000 vegetation communities and ecological systems in the Western Hemisphere.

The data available through NatureServe Explorer represents data managed in the NatureServe Central Databases. These databases are dynamic, being continually enhanced and refined through the input of hundreds of natural heritage program scientists and other collaborators. NatureServe Explorer is updated from these central databases to reflect information from new field surveys, the latest taxonomic treatments and other scientific publications, and new conservation status assessments. Explore Data here

Flight Records in the US

Airline On-Time Performance and Causes of Flight Delays – On_Time Data.

This database contains scheduled and actual departure and arrival times, reason of delay. reported by certified U.S. air carriers that account for at least one percent of domestic scheduled passenger revenues. The data is collected by the Office of Airline Information, Bureau of Transportation Statistics (BTS).

FlightAware.com has data but you need to pay for a full dataset.

The anyflights package supplies a set of functions to generate air travel data (and data packages!) similar to nycflights13. With a user-defined year and airport, the anyflights function will grab data on:

  • flights: all flights that departed a given airport in a given year and month
  • weather: hourly meterological data for a given airport in a given year and month
  • airports: airport names, FAA codes, and locations
  • airlines: translation between two letter carrier (airline) codes and names
  • planes: construction information about each plane found in flights

Airline On-Time Statistics and Delay Causes

The U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT’s monthly Air Travel Consumer Report, published about 30 days after the month’s end, as well as in summary tables posted on this website. BTS began collecting details on the causes of flight delays in June 2003. Summary statistics and raw data are made available to the public at the time the Air Travel Consumer Report is released. Access it here

Worldwide flight data

Open flights: As of January 2017, the OpenFlights Airports Database contains over 10,000 airports, train stations and ferry terminals spanning the globe

Download: airports.dat (Airports only, high quality)

Download: airports-extended.dat (Airports, train stations and ferry terminals, including user contributions)

Bureau of Transportation:

Flightera.net seems to have a lot of good data for free. It has in-depth data on flights and doesn’t seem limited by date. I can’t speak on the validity of the data though.

flightradar24.com has lots of data, also historically, they might be willing to help you get it in a nice format.

 

2019 Crime statistics in the USA

Dataset with arrest in US by race and separate states. Download Excel here

Researchers from IBM, MIT and Harvard Announced The Release Of DARPA “Common Sense AI” Dataset Along With Two Machine Learning Models At ICML 2021

Building machines that can make decisions based on common sense is no easy feat. A machine must be able to do more than merely find patterns in data; it also needs a way of interpreting the intentions and beliefs behind people’s choices.

At the 2021 International Conference on Machine Learning (ICML), Researchers from IBM, MIT, and Harvard University have come together to release a DARPA “Common Sense AI” dataset for benchmarking AI intuition. They are also releasing two machine learning models that represent different approaches to the problem that relies on testing techniques psychologists use to study infants’ behavior to accelerate the development of AI exhibiting common sense.

Source – Summary – Paper – IBM Blog

100 million protein structures Dataset by DeepMind

DeepMind creates ‘transformative’ map of human proteins drawn by AI. By the end of the year, DeepMind hopes to release predictions for 100 million protein structures, a dataset that will be “transformative for our understanding of how life works,

Here’s a good article about this topic

Google Dataset Search

Google Dataset Search

Malware traffic dataset

Comprises 1914081 records created from all malware traffic analysis .net PCAP files, from 2013 to 2021. The logs are generated using Suricata and Zeek.

Originator: ali_alwashali

Percent of “foreign-born” population in each US and EU state or country.

For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state 🇺🇸🇪🇺

Author: Here

Percent of “foreign-born” population in each US and EU state or country. For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state.

Examples of “foreign-born” in this context:

  • Person born in Spain and living in France is NOT “foreign-born”

  • Person born in Turkey and living in France is “foreign-born”

  • Person born in Florida and living in Texas is NOT “foreign-born”

  • Person born in Mexico and living in Texas is “foreign-born”

  • Person born in Florida and living in France is “foreign-born”

  • Person born in France and living in Florida is “foreign-born”

🇺🇸🇪🇺🗺️

Note: Poland, Ireland, Germany, Greece, Cyprus, Malta, Portugal uses Eurostat 2010 Migration data and Croatia has no data at all

Link1

Link2

Link3

Tools: MS Office

Source: Here

35% of “entry-level” jobs on LinkedIn require 3+ years of experience

r/dataisbeautiful - [OC] 35% of "entry-level" jobs on LinkedIn require 3+ years of experience

Source: LinkedIn data  (see original post)

Tool: Photoshop from my colleague

Latest complete Netflix movie dataset

Created from 4 APIs. 11K+ rows and 30+ attributes of Netflix (Ratings, earnings, actors, language, availability, movie trailers, and many more)

Dataset on Kaggle.

Explore this dataset using FlixGem.com (this dataset is powering this webapp)

Dataset on Google Sheets.

Common Crawl

A corpus of web crawl data composed of over 50 billion web pages. The Common Crawl corpus contains petabytes of data collected since 2008. It contains raw web page data, extracted metadata and text extractions.

AWS CLI Access (No AWS account required)

aws s3 ls s3://commoncrawl/ --no-sign-request

s3://commoncrawl/crawl-data/CC-MAIN-2021-17 – April 2021

 Dataset on protein prices

Data on Primary Commodity Prices are updated monthly based on the IMF’s Primary Commodity Price System.

Excel Database

 CPOST dataset on suicide attacks over four decades

The University of Chicago Project on Security and Threats presents the updated and expanded Database on Suicide Attacks (DSAT), which now links to Uppsala Conflict Data Program data on armed conflicts and includes a new dataset measuring the alliance and rivalry relationships among militant groups with connections to suicide attack groups. Access it here.

Credit Card Dataset – Survey of Consumer Finances (SCF) Combined Extract Data 1989-2019

You can do a lot of aggregated analysis in a pretty straightforward way there.

Drone imagery with annotations for small object detection and tracking dataset

11 TB dataset of drone imagery with annotations for small object detection and tracking

Download and more information are available here

Dataset License: CDLA-Sharing-1.0

Helper scripts for accessing the dataset: DATASET.md

Dataset Exploration: Colab

NOAA High-Resolution Rapid Refresh (HRRR) Model

The HRRR is a NOAA real-time 3-km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model, initialized by 3km grids with 3km radar assimilation. Radar data is assimilated in the HRRR every 15 min over a 1-h period adding further detail to that provided by the hourly data assimilation from the 13km radar-enhanced Rapid Refresh.

Registry of Open Data on AWS

This registry exists to help people discover and share datasets that are available via AWS resources. Learn more about sharing data on AWS.

See all usage examples for datasets listed in this registry.

See datasets from Digital Earth AfricaFacebook Data for GoodNASA Space Act AgreementNIH STRIDESNOAA Big Data ProgramSpace Telescope Science Institute, and Amazon Sustainability Data Initiative.

Textbook Question Answering (TQA)

1,076 textbook lessons, 26,260 questions, 6229 images

Documentation: allenai.org/data/tqa

Download

Harmonized Cancer Datasets: Genomic Data Commons Data Portal

The GDC Data Portal is a robust data-driven platform that allows cancer
researchers and bioinformaticians to search and download cancer data for analysis.

Genomic Data Commons Data Portal
Genomic Data Commons Data Portal

The Cancer Genome Atlas

The Cancer Genome Atlas (TCGA), a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), aims to generate comprehensive, multi-dimensional maps of the key genomic changes in major types and subtypes of cancer.

AWS CLI Access (No AWS account required)

aws s3 ls s3://tcga-2-open/ --no-sign-request

Therapeutically Applicable Research to Generate Effective Treatments (TARGET)

The Therapeutically Applicable Research to Generate Effective Treatments (TARGET) program applies a comprehensive genomic approach to determine molecular changes that drive childhood cancers. The goal of the program is to use data to guide the development of effective, less toxic therapies. TARGET is organized into a collaborative network of disease-specific project teams.  TARGET projects provide comprehensive molecular characterization to determine the genetic changes that drive the initiation and progression of childhood cancers. The dataset contains open Clinical Supplement, Biospecimen Supplement, RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantification, miRNA-Seq miRNA Expression Quantification data from Genomic Data Commons (GDC), and open data from GDC Legacy Archive. Access it here.

Genome Aggregation Database (gnomAD)

The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators that aggregates and harmonizes both exome and genome data from a wide range of large-scale human sequencing projects. The summary data provided here are released for the benefit of the wider scientific community without restriction on use. Downloads

SQuAD (Stanford Question Answering Dataset)

Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. Access it here.

PubMed Diabetes Dataset

The Pubmed Diabetes dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words. The README file in the dataset provides more details.

Download Link

Drug-Target Interaction Dataset

This dataset contains interactions between drugs and targets collected from DrugBank, KEGG Drug, DCDB, and Matador. It was originally collected by Perlman et al. It contains 315 drugs, 250 targets, 1,306 drug-target interactions, 5 types of drug-drug similarities, and 3 types of target-target similarities. Drug-drug similarities include Chemical-based, Ligand-based, Expression-based, Side-effect-based, and Annotation-based similarities. Target-target similarities include Sequence-based, Protein-protein interaction network-based, and Gene Ontology-based similarities. The original task on the dataset is to predict new interactions between drugs and targets based on different types of similarities in the network. Download link

Pharmacogenomics Datasets

PharmGKB data and knowledge is available as downloads. It is often critical to check with their curators at feedback@pharmgkb.org before embarking on a large project using these data, to be sure that the files and data they make available are being interpreted correctly. PharmGKB generally does NOT need to be a co-author on such analyses; They just want to make sure that there is a correct understanding of our data before lots of resources are spent.

Pancreatic Cancer Organoid Profiling

The dataset contains open RNA-Seq Gene Expression Quantification data and controlled WGS/WXS/RNA-Seq Aligned Reads, WXS Annotated Somatic Mutation, WXS Raw Somatic Mutation, and RNA-Seq Splice Junction Quantification. Documentation

AWS CLI Access (No AWS account required)

aws s3 ls s3://gdc-organoid-pancreatic-phs001611-2-open/ --no-sign-request

Africa Soil Information Service (AfSIS) Soil Chemistry

This dataset contains soil infrared spectral data and paired soil property reference measurements for georeferenced soil samples that were collected through the Africa Soil Information Service (AfSIS) project, which lasted from 2009 through 2018. Documentation

AWS CLI Access (No AWS account required)

aws s3 ls s3://afsis/ --no-sign-request

Dataset for Affective States in E-Environments

DAiSEE is the first multi-label video classification dataset comprising of 9068 video snippets captured from 112 users for recognizing the user affective states of boredom, confusion, engagement, and frustration “in the wild”. The dataset has four levels of labels namely – very low, low, high, and very high for each of the affective states, which are crowd annotated and correlated with a gold standard annotation created using a team of expert psychologists. Download it here.

NatureServe Explorer Dataset

NatureServe Explorer provides conservation status, taxonomy, distribution, and life history information for more than 95,000 plants and animals in the United States and Canada, and more than 10,000 vegetation communities and ecological systems in the Western Hemisphere.

The data available through NatureServe Explorer represents data managed in the NatureServe Central Databases. These databases are dynamic, being continually enhanced and refined through the input of hundreds of natural heritage program scientists and other collaborators. NatureServe Explorer is updated from these central databases to reflect information from new field surveys, the latest taxonomic treatments and other scientific publications, and new conservation status assessments. Explore Data here

Flight Records in the US

Airline On-Time Performance and Causes of Flight Delays – On_Time Data.

This database contains scheduled and actual departure and arrival times, reason of delay. reported by certified U.S. air carriers that account for at least one percent of domestic scheduled passenger revenues. The data is collected by the Office of Airline Information, Bureau of Transportation Statistics (BTS).

FlightAware.com has data but you need to pay for a full dataset.

The anyflights package supplies a set of functions to generate air travel data (and data packages!) similar to nycflights13. With a user-defined year and airport, the anyflights function will grab data on:

  • flights: all flights that departed a given airport in a given year and month
  • weather: hourly meterological data for a given airport in a given year and month
  • airports: airport names, FAA codes, and locations
  • airlines: translation between two letter carrier (airline) codes and names
  • planes: construction information about each plane found in flights

Airline On-Time Statistics and Delay Causes

The U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT’s monthly Air Travel Consumer Report, published about 30 days after the month’s end, as well as in summary tables posted on this website. BTS began collecting details on the causes of flight delays in June 2003. Summary statistics and raw data are made available to the public at the time the Air Travel Consumer Report is released. Access it here

Worldwide flight data

Open flights: As of January 2017, the OpenFlights Airports Database contains over 10,000 airports, train stations and ferry terminals spanning the globe

Download: airports.dat (Airports only, high quality)

Download: airports-extended.dat (Airports, train stations and ferry terminals, including user contributions)

Bureau of Transportation:

Flightera.net seems to have a lot of good data for free. It has in-depth data on flights and doesn’t seem limited by date. I can’t speak on the validity of the data though.

flightradar24.com has lots of data, also historically, they might be willing to help you get it in a nice format.

 

2019 Crime statistics in the USA

Dataset with arrest in US by race and separate states. Download Excel here

Researchers from IBM, MIT and Harvard Announced The Release Of DARPA “Common Sense AI” Dataset Along With Two Machine Learning Models At ICML 2021

Building machines that can make decisions based on common sense is no easy feat. A machine must be able to do more than merely find patterns in data; it also needs a way of interpreting the intentions and beliefs behind people’s choices.

At the 2021 International Conference on Machine Learning (ICML), Researchers from IBM, MIT, and Harvard University have come together to release a DARPA “Common Sense AI” dataset for benchmarking AI intuition. They are also releasing two machine learning models that represent different approaches to the problem that relies on testing techniques psychologists use to study infants’ behavior to accelerate the development of AI exhibiting common sense.

Source – Summary – Paper – IBM Blog

100 million protein structures Dataset by DeepMind

DeepMind creates ‘transformative’ map of human proteins drawn by AI. By the end of the year, DeepMind hopes to release predictions for 100 million protein structures, a dataset that will be “transformative for our understanding of how life works,

Here’s a good article about this topic

Google Dataset Search

Google Dataset Search

Malware traffic dataset

Comprises 1914081 records created from all malware traffic analysis .net PCAP files, from 2013 to 2021. The logs are generated using Suricata and Zeek.

Originator: ali_alwashali

Percent of “foreign-born” population in each US and EU state or country.

For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state 🇺🇸🇪🇺

Author: Here

Percent of “foreign-born” population in each US and EU state or country. For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state.

Examples of “foreign-born” in this context:

  • Person born in Spain and living in France is NOT “foreign-born”

  • Person born in Turkey and living in France is “foreign-born”

  • Person born in Florida and living in Texas is NOT “foreign-born”

  • Person born in Mexico and living in Texas is “foreign-born”

  • Person born in Florida and living in France is “foreign-born”

  • Person born in France and living in Florida is “foreign-born”

🇺🇸🇪🇺🗺️

Note: Poland, Ireland, Germany, Greece, Cyprus, Malta, Portugal uses Eurostat 2010 Migration data and Croatia has no data at all

Link1

Link2

Link3

Tools: MS Office

Source: Here

35% of “entry-level” jobs on LinkedIn require 3+ years of experience

r/dataisbeautiful - [OC] 35% of "entry-level" jobs on LinkedIn require 3+ years of experience

Source: LinkedIn data  (see original post)

Tool: Photoshop from my colleague

Latest complete Netflix movie dataset

Created from 4 APIs. 11K+ rows and 30+ attributes of Netflix (Ratings, earnings, actors, language, availability, movie trailers, and many more)

Dataset on Kaggle.

Explore this dataset using FlixGem.com (this dataset is powering this webapp)

Dataset on Google Sheets.

Common Crawl

A corpus of web crawl data composed of over 50 billion web pages. The Common Crawl corpus contains petabytes of data collected since 2008. It contains raw web page data, extracted metadata and text extractions.

AWS CLI Access (No AWS account required)

aws s3 ls s3://commoncrawl/ --no-sign-request

s3://commoncrawl/crawl-data/CC-MAIN-2021-17 – April 2021

 Dataset on protein prices

Data on Primary Commodity Prices are updated monthly based on the IMF’s Primary Commodity Price System.

Excel Database

 CPOST dataset on suicide attacks over four decades

The University of Chicago Project on Security and Threats presents the updated and expanded Database on Suicide Attacks (DSAT), which now links to Uppsala Conflict Data Program data on armed conflicts and includes a new dataset measuring the alliance and rivalry relationships among militant groups with connections to suicide attack groups. Access it here.

Credit Card Dataset – Survey of Consumer Finances (SCF) Combined Extract Data 1989-2019

You can do a lot of aggregated analysis in a pretty straightforward way there.

Drone imagery with annotations for small object detection and tracking dataset

11 TB dataset of drone imagery with annotations for small object detection and tracking

Download and more information are available here

Dataset License: CDLA-Sharing-1.0

Helper scripts for accessing the dataset: DATASET.md

Dataset Exploration: Colab

NOAA High-Resolution Rapid Refresh (HRRR) Model

The HRRR is a NOAA real-time 3-km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model, initialized by 3km grids with 3km radar assimilation. Radar data is assimilated in the HRRR every 15 min over a 1-h period adding further detail to that provided by the hourly data assimilation from the 13km radar-enhanced Rapid Refresh.

Registry of Open Data on AWS

This registry exists to help people discover and share datasets that are available via AWS resources. Learn more about sharing data on AWS.

See all usage examples for datasets listed in this registry.

See datasets from Digital Earth AfricaFacebook Data for GoodNASA Space Act AgreementNIH STRIDESNOAA Big Data ProgramSpace Telescope Science Institute, and Amazon Sustainability Data Initiative.

Textbook Question Answering (TQA)

1,076 textbook lessons, 26,260 questions, 6229 images

Documentation: allenai.org/data/tqa

Download

Harmonized Cancer Datasets: Genomic Data Commons Data Portal

The GDC Data Portal is a robust data-driven platform that allows cancer
researchers and bioinformaticians to search and download cancer data for analysis.

Genomic Data Commons Data Portal
Genomic Data Commons Data Portal

The Cancer Genome Atlas

The Cancer Genome Atlas (TCGA), a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), aims to generate comprehensive, multi-dimensional maps of the key genomic changes in major types and subtypes of cancer.

AWS CLI Access (No AWS account required)

aws s3 ls s3://tcga-2-open/ --no-sign-request

Therapeutically Applicable Research to Generate Effective Treatments (TARGET)

The Therapeutically Applicable Research to Generate Effective Treatments (TARGET) program applies a comprehensive genomic approach to determine molecular changes that drive childhood cancers. The goal of the program is to use data to guide the development of effective, less toxic therapies. TARGET is organized into a collaborative network of disease-specific project teams.  TARGET projects provide comprehensive molecular characterization to determine the genetic changes that drive the initiation and progression of childhood cancers. The dataset contains open Clinical Supplement, Biospecimen Supplement, RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantification, miRNA-Seq miRNA Expression Quantification data from Genomic Data Commons (GDC), and open data from GDC Legacy Archive. Access it here.

Genome Aggregation Database (gnomAD)

The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators that aggregates and harmonizes both exome and genome data from a wide range of large-scale human sequencing projects. The summary data provided here are released for the benefit of the wider scientific community without restriction on use. Downloads

SQuAD (Stanford Question Answering Dataset)

Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. Access it here.

PubMed Diabetes Dataset

The Pubmed Diabetes dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words. The README file in the dataset provides more details.

Download Link

Drug-Target Interaction Dataset

This dataset contains interactions between drugs and targets collected from DrugBank, KEGG Drug, DCDB, and Matador. It was originally collected by Perlman et al. It contains 315 drugs, 250 targets, 1,306 drug-target interactions, 5 types of drug-drug similarities, and 3 types of target-target similarities. Drug-drug similarities include Chemical-based, Ligand-based, Expression-based, Side-effect-based, and Annotation-based similarities. Target-target similarities include Sequence-based, Protein-protein interaction network-based, and Gene Ontology-based similarities. The original task on the dataset is to predict new interactions between drugs and targets based on different types of similarities in the network. Download link

Pharmacogenomics Datasets

PharmGKB data and knowledge is available as downloads. It is often critical to check with their curators at feedback@pharmgkb.org before embarking on a large project using these data, to be sure that the files and data they make available are being interpreted correctly. PharmGKB generally does NOT need to be a co-author on such analyses; They just want to make sure that there is a correct understanding of our data before lots of resources are spent.

Pancreatic Cancer Organoid Profiling

The dataset contains open RNA-Seq Gene Expression Quantification data and controlled WGS/WXS/RNA-Seq Aligned Reads, WXS Annotated Somatic Mutation, WXS Raw Somatic Mutation, and RNA-Seq Splice Junction Quantification. Documentation

AWS CLI Access (No AWS account required)

aws s3 ls s3://gdc-organoid-pancreatic-phs001611-2-open/ --no-sign-request

Africa Soil Information Service (AfSIS) Soil Chemistry

This dataset contains soil infrared spectral data and paired soil property reference measurements for georeferenced soil samples that were collected through the Africa Soil Information Service (AfSIS) project, which lasted from 2009 through 2018. Documentation

AWS CLI Access (No AWS account required)

aws s3 ls s3://afsis/ --no-sign-request

Dataset for Affective States in E-Environments

DAiSEE is the first multi-label video classification dataset comprising of 9068 video snippets captured from 112 users for recognizing the user affective states of boredom, confusion, engagement, and frustration “in the wild”. The dataset has four levels of labels namely – very low, low, high, and very high for each of the affective states, which are crowd annotated and correlated with a gold standard annotation created using a team of expert psychologists. Download it here.

NatureServe Explorer Dataset

NatureServe Explorer provides conservation status, taxonomy, distribution, and life history information for more than 95,000 plants and animals in the United States and Canada, and more than 10,000 vegetation communities and ecological systems in the Western Hemisphere.

The data available through NatureServe Explorer represents data managed in the NatureServe Central Databases. These databases are dynamic, being continually enhanced and refined through the input of hundreds of natural heritage program scientists and other collaborators. NatureServe Explorer is updated from these central databases to reflect information from new field surveys, the latest taxonomic treatments and other scientific publications, and new conservation status assessments. Explore Data here

Flight Records in the US

Airline On-Time Performance and Causes of Flight Delays – On_Time Data.

This database contains scheduled and actual departure and arrival times, reason of delay. reported by certified U.S. air carriers that account for at least one percent of domestic scheduled passenger revenues. The data is collected by the Office of Airline Information, Bureau of Transportation Statistics (BTS).

FlightAware.com has data but you need to pay for a full dataset.

The anyflights package supplies a set of functions to generate air travel data (and data packages!) similar to nycflights13. With a user-defined year and airport, the anyflights function will grab data on:

  • flights: all flights that departed a given airport in a given year and month
  • weather: hourly meterological data for a given airport in a given year and month
  • airports: airport names, FAA codes, and locations
  • airlines: translation between two letter carrier (airline) codes and names
  • planes: construction information about each plane found in flights

Airline On-Time Statistics and Delay Causes

The U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT’s monthly Air Travel Consumer Report, published about 30 days after the month’s end, as well as in summary tables posted on this website. BTS began collecting details on the causes of flight delays in June 2003. Summary statistics and raw data are made available to the public at the time the Air Travel Consumer Report is released. Access it here

Worldwide flight data

Open flights: As of January 2017, the OpenFlights Airports Database contains over 10,000 airports, train stations and ferry terminals spanning the globe

Download: airports.dat (Airports only, high quality)

Download: airports-extended.dat (Airports, train stations and ferry terminals, including user contributions)

Bureau of Transportation:

Flightera.net seems to have a lot of good data for free. It has in-depth data on flights and doesn’t seem limited by date. I can’t speak on the validity of the data though.

flightradar24.com has lots of data, also historically, they might be willing to help you get it in a nice format.

 

2019 Crime statistics in the USA

Dataset with arrest in US by race and separate states. Download Excel here

Researchers from IBM, MIT and Harvard Announced The Release Of DARPA “Common Sense AI” Dataset Along With Two Machine Learning Models At ICML 2021

Building machines that can make decisions based on common sense is no easy feat. A machine must be able to do more than merely find patterns in data; it also needs a way of interpreting the intentions and beliefs behind people’s choices.

At the 2021 International Conference on Machine Learning (ICML), Researchers from IBM, MIT, and Harvard University have come together to release a DARPA “Common Sense AI” dataset for benchmarking AI intuition. They are also releasing two machine learning models that represent different approaches to the problem that relies on testing techniques psychologists use to study infants’ behavior to accelerate the development of AI exhibiting common sense.

Source – Summary – Paper – IBM Blog

100 million protein structures Dataset by DeepMind

DeepMind creates ‘transformative’ map of human proteins drawn by AI. By the end of the year, DeepMind hopes to release predictions for 100 million protein structures, a dataset that will be “transformative for our understanding of how life works,

Here’s a good article about this topic

Google Dataset Search

Google Dataset Search

Malware traffic dataset

Comprises 1914081 records created from all malware traffic analysis .net PCAP files, from 2013 to 2021. The logs are generated using Suricata and Zeek.

Originator: ali_alwashali

Percent of “foreign-born” population in each US and EU state or country.

For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state 🇺🇸🇪🇺

Author: Here

Percent of “foreign-born” population in each US and EU state or country. For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state.

Examples of “foreign-born” in this context:

  • Person born in Spain and living in France is NOT “foreign-born”

  • Person born in Turkey and living in France is “foreign-born”

  • Person born in Florida and living in Texas is NOT “foreign-born”

  • Person born in Mexico and living in Texas is “foreign-born”

  • Person born in Florida and living in France is “foreign-born”

  • Person born in France and living in Florida is “foreign-born”

🇺🇸🇪🇺🗺️

Note: Poland, Ireland, Germany, Greece, Cyprus, Malta, Portugal uses Eurostat 2010 Migration data and Croatia has no data at all

Link1

Link2

Link3

Tools: MS Office

Source: Here

35% of “entry-level” jobs on LinkedIn require 3+ years of experience

r/dataisbeautiful - [OC] 35% of "entry-level" jobs on LinkedIn require 3+ years of experience

Source: LinkedIn data  (see original post)

Tool: Photoshop from my colleague

Latest complete Netflix movie dataset

Created from 4 APIs. 11K+ rows and 30+ attributes of Netflix (Ratings, earnings, actors, language, availability, movie trailers, and many more)

Dataset on Kaggle.

Explore this dataset using FlixGem.com (this dataset is powering this webapp)

Dataset on Google Sheets.

Common Crawl

A corpus of web crawl data composed of over 50 billion web pages. The Common Crawl corpus contains petabytes of data collected since 2008. It contains raw web page data, extracted metadata and text extractions.

AWS CLI Access (No AWS account required)

aws s3 ls s3://commoncrawl/ --no-sign-request

s3://commoncrawl/crawl-data/CC-MAIN-2021-17 – April 2021

 Dataset on protein prices

Data on Primary Commodity Prices are updated monthly based on the IMF’s Primary Commodity Price System.

Excel Database

 CPOST dataset on suicide attacks over four decades

The University of Chicago Project on Security and Threats presents the updated and expanded Database on Suicide Attacks (DSAT), which now links to Uppsala Conflict Data Program data on armed conflicts and includes a new dataset measuring the alliance and rivalry relationships among militant groups with connections to suicide attack groups. Access it here.

Credit Card Dataset – Survey of Consumer Finances (SCF) Combined Extract Data 1989-2019

You can do a lot of aggregated analysis in a pretty straightforward way there.

Drone imagery with annotations for small object detection and tracking dataset

11 TB dataset of drone imagery with annotations for small object detection and tracking

Download and more information are available here

Dataset License: CDLA-Sharing-1.0

Helper scripts for accessing the dataset: DATASET.md

Dataset Exploration: Colab

NOAA High-Resolution Rapid Refresh (HRRR) Model

The HRRR is a NOAA real-time 3-km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model, initialized by 3km grids with 3km radar assimilation. Radar data is assimilated in the HRRR every 15 min over a 1-h period adding further detail to that provided by the hourly data assimilation from the 13km radar-enhanced Rapid Refresh.

Registry of Open Data on AWS

This registry exists to help people discover and share datasets that are available via AWS resources. Learn more about sharing data on AWS.

See all usage examples for datasets listed in this registry.

See datasets from Digital Earth AfricaFacebook Data for GoodNASA Space Act AgreementNIH STRIDESNOAA Big Data ProgramSpace Telescope Science Institute, and Amazon Sustainability Data Initiative.

Textbook Question Answering (TQA)

1,076 textbook lessons, 26,260 questions, 6229 images

Documentation: allenai.org/data/tqa

Download

Harmonized Cancer Datasets: Genomic Data Commons Data Portal

The GDC Data Portal is a robust data-driven platform that allows cancer
researchers and bioinformaticians to search and download cancer data for analysis.

Genomic Data Commons Data Portal
Genomic Data Commons Data Portal

The Cancer Genome Atlas

The Cancer Genome Atlas (TCGA), a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), aims to generate comprehensive, multi-dimensional maps of the key genomic changes in major types and subtypes of cancer.

AWS CLI Access (No AWS account required)

aws s3 ls s3://tcga-2-open/ --no-sign-request

Therapeutically Applicable Research to Generate Effective Treatments (TARGET)

The Therapeutically Applicable Research to Generate Effective Treatments (TARGET) program applies a comprehensive genomic approach to determine molecular changes that drive childhood cancers. The goal of the program is to use data to guide the development of effective, less toxic therapies. TARGET is organized into a collaborative network of disease-specific project teams.  TARGET projects provide comprehensive molecular characterization to determine the genetic changes that drive the initiation and progression of childhood cancers. The dataset contains open Clinical Supplement, Biospecimen Supplement, RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantification, miRNA-Seq miRNA Expression Quantification data from Genomic Data Commons (GDC), and open data from GDC Legacy Archive. Access it here.

Genome Aggregation Database (gnomAD)

The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators that aggregates and harmonizes both exome and genome data from a wide range of large-scale human sequencing projects. The summary data provided here are released for the benefit of the wider scientific community without restriction on use. Downloads

SQuAD (Stanford Question Answering Dataset)

Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. Access it here.

PubMed Diabetes Dataset

The Pubmed Diabetes dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words. The README file in the dataset provides more details.

Download Link

Drug-Target Interaction Dataset

This dataset contains interactions between drugs and targets collected from DrugBank, KEGG Drug, DCDB, and Matador. It was originally collected by Perlman et al. It contains 315 drugs, 250 targets, 1,306 drug-target interactions, 5 types of drug-drug similarities, and 3 types of target-target similarities. Drug-drug similarities include Chemical-based, Ligand-based, Expression-based, Side-effect-based, and Annotation-based similarities. Target-target similarities include Sequence-based, Protein-protein interaction network-based, and Gene Ontology-based similarities. The original task on the dataset is to predict new interactions between drugs and targets based on different types of similarities in the network. Download link

Pharmacogenomics Datasets

PharmGKB data and knowledge is available as downloads. It is often critical to check with their curators at feedback@pharmgkb.org before embarking on a large project using these data, to be sure that the files and data they make available are being interpreted correctly. PharmGKB generally does NOT need to be a co-author on such analyses; They just want to make sure that there is a correct understanding of our data before lots of resources are spent.

Pancreatic Cancer Organoid Profiling

The dataset contains open RNA-Seq Gene Expression Quantification data and controlled WGS/WXS/RNA-Seq Aligned Reads, WXS Annotated Somatic Mutation, WXS Raw Somatic Mutation, and RNA-Seq Splice Junction Quantification. Documentation

AWS CLI Access (No AWS account required)

aws s3 ls s3://gdc-organoid-pancreatic-phs001611-2-open/ --no-sign-request

Africa Soil Information Service (AfSIS) Soil Chemistry

This dataset contains soil infrared spectral data and paired soil property reference measurements for georeferenced soil samples that were collected through the Africa Soil Information Service (AfSIS) project, which lasted from 2009 through 2018. Documentation

AWS CLI Access (No AWS account required)

aws s3 ls s3://afsis/ --no-sign-request

Dataset for Affective States in E-Environments

DAiSEE is the first multi-label video classification dataset comprising of 9068 video snippets captured from 112 users for recognizing the user affective states of boredom, confusion, engagement, and frustration “in the wild”. The dataset has four levels of labels namely – very low, low, high, and very high for each of the affective states, which are crowd annotated and correlated with a gold standard annotation created using a team of expert psychologists. Download it here.

NatureServe Explorer Dataset

NatureServe Explorer provides conservation status, taxonomy, distribution, and life history information for more than 95,000 plants and animals in the United States and Canada, and more than 10,000 vegetation communities and ecological systems in the Western Hemisphere.

The data available through NatureServe Explorer represents data managed in the NatureServe Central Databases. These databases are dynamic, being continually enhanced and refined through the input of hundreds of natural heritage program scientists and other collaborators. NatureServe Explorer is updated from these central databases to reflect information from new field surveys, the latest taxonomic treatments and other scientific publications, and new conservation status assessments. Explore Data here

Flight Records in the US

Airline On-Time Performance and Causes of Flight Delays – On_Time Data.

This database contains scheduled and actual departure and arrival times, reason of delay. reported by certified U.S. air carriers that account for at least one percent of domestic scheduled passenger revenues. The data is collected by the Office of Airline Information, Bureau of Transportation Statistics (BTS).

FlightAware.com has data but you need to pay for a full dataset.

The anyflights package supplies a set of functions to generate air travel data (and data packages!) similar to nycflights13. With a user-defined year and airport, the anyflights function will grab data on:

  • flights: all flights that departed a given airport in a given year and month
  • weather: hourly meterological data for a given airport in a given year and month
  • airports: airport names, FAA codes, and locations
  • airlines: translation between two letter carrier (airline) codes and names
  • planes: construction information about each plane found in flights

Airline On-Time Statistics and Delay Causes

The U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT’s monthly Air Travel Consumer Report, published about 30 days after the month’s end, as well as in summary tables posted on this website. BTS began collecting details on the causes of flight delays in June 2003. Summary statistics and raw data are made available to the public at the time the Air Travel Consumer Report is released. Access it here

Worldwide flight data

Open flights: As of January 2017, the OpenFlights Airports Database contains over 10,000 airports, train stations and ferry terminals spanning the globe

Download: airports.dat (Airports only, high quality)

Download: airports-extended.dat (Airports, train stations and ferry terminals, including user contributions)

Bureau of Transportation:

Flightera.net seems to have a lot of good data for free. It has in-depth data on flights and doesn’t seem limited by date. I can’t speak on the validity of the data though.

flightradar24.com has lots of data, also historically, they might be willing to help you get it in a nice format.

 

2019 Crime statistics in the USA

Dataset with arrest in US by race and separate states. Download Excel here

Researchers from IBM, MIT and Harvard Announced The Release Of DARPA “Common Sense AI” Dataset Along With Two Machine Learning Models At ICML 2021

Building machines that can make decisions based on common sense is no easy feat. A machine must be able to do more than merely find patterns in data; it also needs a way of interpreting the intentions and beliefs behind people’s choices.

At the 2021 International Conference on Machine Learning (ICML), Researchers from IBM, MIT, and Harvard University have come together to release a DARPA “Common Sense AI” dataset for benchmarking AI intuition. They are also releasing two machine learning models that represent different approaches to the problem that relies on testing techniques psychologists use to study infants’ behavior to accelerate the development of AI exhibiting common sense.

Source – Summary – Paper – IBM Blog

100 million protein structures Dataset by DeepMind

DeepMind creates ‘transformative’ map of human proteins drawn by AI. By the end of the year, DeepMind hopes to release predictions for 100 million protein structures, a dataset that will be “transformative for our understanding of how life works,

Here’s a good article about this topic

Google Dataset Search

Google Dataset Search

Malware traffic dataset

Comprises 1914081 records created from all malware traffic analysis .net PCAP files, from 2013 to 2021. The logs are generated using Suricata and Zeek.

Originator: ali_alwashali

Percent of “foreign-born” population in each US and EU state or country.

For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state 🇺🇸🇪🇺

Author: Here

Percent of “foreign-born” population in each US and EU state or country. For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state.

Examples of “foreign-born” in this context:

  • Person born in Spain and living in France is NOT “foreign-born”

  • Person born in Turkey and living in France is “foreign-born”

  • Person born in Florida and living in Texas is NOT “foreign-born”

  • Person born in Mexico and living in Texas is “foreign-born”

  • Person born in Florida and living in France is “foreign-born”

  • Person born in France and living in Florida is “foreign-born”

🇺🇸🇪🇺🗺️

Note: Poland, Ireland, Germany, Greece, Cyprus, Malta, Portugal uses Eurostat 2010 Migration data and Croatia has no data at all

Link1

Link2

Link3

Tools: MS Office

Source: Here

35% of “entry-level” jobs on LinkedIn require 3+ years of experience

r/dataisbeautiful - [OC] 35% of "entry-level" jobs on LinkedIn require 3+ years of experience

Source: LinkedIn data  (see original post)

Tool: Photoshop from my colleague

Latest complete Netflix movie dataset

Created from 4 APIs. 11K+ rows and 30+ attributes of Netflix (Ratings, earnings, actors, language, availability, movie trailers, and many more)

Dataset on Kaggle.

Explore this dataset using FlixGem.com (this dataset is powering this webapp)

Dataset on Google Sheets.

Common Crawl

A corpus of web crawl data composed of over 50 billion web pages. The Common Crawl corpus contains petabytes of data collected since 2008. It contains raw web page data, extracted metadata and text extractions.

AWS CLI Access (No AWS account required)

aws s3 ls s3://commoncrawl/ --no-sign-request

s3://commoncrawl/crawl-data/CC-MAIN-2021-17 – April 2021

 Dataset on protein prices

Data on Primary Commodity Prices are updated monthly based on the IMF’s Primary Commodity Price System.

Excel Database

 CPOST dataset on suicide attacks over four decades

The University of Chicago Project on Security and Threats presents the updated and expanded Database on Suicide Attacks (DSAT), which now links to Uppsala Conflict Data Program data on armed conflicts and includes a new dataset measuring the alliance and rivalry relationships among militant groups with connections to suicide attack groups. Access it here.

Credit Card Dataset – Survey of Consumer Finances (SCF) Combined Extract Data 1989-2019

You can do a lot of aggregated analysis in a pretty straightforward way there.

Drone imagery with annotations for small object detection and tracking dataset

11 TB dataset of drone imagery with annotations for small object detection and tracking

Download and more information are available here

Dataset License: CDLA-Sharing-1.0

Helper scripts for accessing the dataset: DATASET.md

Dataset Exploration: Colab

NOAA High-Resolution Rapid Refresh (HRRR) Model

The HRRR is a NOAA real-time 3-km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model, initialized by 3km grids with 3km radar assimilation. Radar data is assimilated in the HRRR every 15 min over a 1-h period adding further detail to that provided by the hourly data assimilation from the 13km radar-enhanced Rapid Refresh.

Registry of Open Data on AWS

This registry exists to help people discover and share datasets that are available via AWS resources. Learn more about sharing data on AWS.

See all usage examples for datasets listed in this registry.

See datasets from Digital Earth AfricaFacebook Data for GoodNASA Space Act AgreementNIH STRIDESNOAA Big Data ProgramSpace Telescope Science Institute, and Amazon Sustainability Data Initiative.

Textbook Question Answering (TQA)

1,076 textbook lessons, 26,260 questions, 6229 images

Documentation: allenai.org/data/tqa

Download

Harmonized Cancer Datasets: Genomic Data Commons Data Portal

The GDC Data Portal is a robust data-driven platform that allows cancer
researchers and bioinformaticians to search and download cancer data for analysis.

Genomic Data Commons Data Portal
Genomic Data Commons Data Portal

The Cancer Genome Atlas

The Cancer Genome Atlas (TCGA), a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), aims to generate comprehensive, multi-dimensional maps of the key genomic changes in major types and subtypes of cancer.

AWS CLI Access (No AWS account required)

aws s3 ls s3://tcga-2-open/ --no-sign-request

Therapeutically Applicable Research to Generate Effective Treatments (TARGET)

The Therapeutically Applicable Research to Generate Effective Treatments (TARGET) program applies a comprehensive genomic approach to determine molecular changes that drive childhood cancers. The goal of the program is to use data to guide the development of effective, less toxic therapies. TARGET is organized into a collaborative network of disease-specific project teams.  TARGET projects provide comprehensive molecular characterization to determine the genetic changes that drive the initiation and progression of childhood cancers. The dataset contains open Clinical Supplement, Biospecimen Supplement, RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantification, miRNA-Seq miRNA Expression Quantification data from Genomic Data Commons (GDC), and open data from GDC Legacy Archive. Access it here.

Genome Aggregation Database (gnomAD)

The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators that aggregates and harmonizes both exome and genome data from a wide range of large-scale human sequencing projects. The summary data provided here are released for the benefit of the wider scientific community without restriction on use. Downloads

SQuAD (Stanford Question Answering Dataset)

Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. Access it here.

PubMed Diabetes Dataset

The Pubmed Diabetes dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words. The README file in the dataset provides more details.

Download Link

Drug-Target Interaction Dataset

This dataset contains interactions between drugs and targets collected from DrugBank, KEGG Drug, DCDB, and Matador. It was originally collected by Perlman et al. It contains 315 drugs, 250 targets, 1,306 drug-target interactions, 5 types of drug-drug similarities, and 3 types of target-target similarities. Drug-drug similarities include Chemical-based, Ligand-based, Expression-based, Side-effect-based, and Annotation-based similarities. Target-target similarities include Sequence-based, Protein-protein interaction network-based, and Gene Ontology-based similarities. The original task on the dataset is to predict new interactions between drugs and targets based on different types of similarities in the network. Download link

Pharmacogenomics Datasets

PharmGKB data and knowledge is available as downloads. It is often critical to check with their curators at feedback@pharmgkb.org before embarking on a large project using these data, to be sure that the files and data they make available are being interpreted correctly. PharmGKB generally does NOT need to be a co-author on such analyses; They just want to make sure that there is a correct understanding of our data before lots of resources are spent.

Pancreatic Cancer Organoid Profiling

The dataset contains open RNA-Seq Gene Expression Quantification data and controlled WGS/WXS/RNA-Seq Aligned Reads, WXS Annotated Somatic Mutation, WXS Raw Somatic Mutation, and RNA-Seq Splice Junction Quantification. Documentation

AWS CLI Access (No AWS account required)

aws s3 ls s3://gdc-organoid-pancreatic-phs001611-2-open/ --no-sign-request

Africa Soil Information Service (AfSIS) Soil Chemistry

This dataset contains soil infrared spectral data and paired soil property reference measurements for georeferenced soil samples that were collected through the Africa Soil Information Service (AfSIS) project, which lasted from 2009 through 2018. Documentation

AWS CLI Access (No AWS account required)

aws s3 ls s3://afsis/ --no-sign-request

Dataset for Affective States in E-Environments

DAiSEE is the first multi-label video classification dataset comprising of 9068 video snippets captured from 112 users for recognizing the user affective states of boredom, confusion, engagement, and frustration “in the wild”. The dataset has four levels of labels namely – very low, low, high, and very high for each of the affective states, which are crowd annotated and correlated with a gold standard annotation created using a team of expert psychologists. Download it here.

NatureServe Explorer Dataset

NatureServe Explorer provides conservation status, taxonomy, distribution, and life history information for more than 95,000 plants and animals in the United States and Canada, and more than 10,000 vegetation communities and ecological systems in the Western Hemisphere.

The data available through NatureServe Explorer represents data managed in the NatureServe Central Databases. These databases are dynamic, being continually enhanced and refined through the input of hundreds of natural heritage program scientists and other collaborators. NatureServe Explorer is updated from these central databases to reflect information from new field surveys, the latest taxonomic treatments and other scientific publications, and new conservation status assessments. Explore Data here

Flight Records in the US

Airline On-Time Performance and Causes of Flight Delays – On_Time Data.

This database contains scheduled and actual departure and arrival times, reason of delay. reported by certified U.S. air carriers that account for at least one percent of domestic scheduled passenger revenues. The data is collected by the Office of Airline Information, Bureau of Transportation Statistics (BTS).

FlightAware.com has data but you need to pay for a full dataset.

The anyflights package supplies a set of functions to generate air travel data (and data packages!) similar to nycflights13. With a user-defined year and airport, the anyflights function will grab data on:

  • flights: all flights that departed a given airport in a given year and month
  • weather: hourly meterological data for a given airport in a given year and month
  • airports: airport names, FAA codes, and locations
  • airlines: translation between two letter carrier (airline) codes and names
  • planes: construction information about each plane found in flights

Airline On-Time Statistics and Delay Causes

The U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT’s monthly Air Travel Consumer Report, published about 30 days after the month’s end, as well as in summary tables posted on this website. BTS began collecting details on the causes of flight delays in June 2003. Summary statistics and raw data are made available to the public at the time the Air Travel Consumer Report is released. Access it here

Worldwide flight data

Open flights: As of January 2017, the OpenFlights Airports Database contains over 10,000 airports, train stations and ferry terminals spanning the globe

Download: airports.dat (Airports only, high quality)

Download: airports-extended.dat (Airports, train stations and ferry terminals, including user contributions)

Bureau of Transportation:

Flightera.net seems to have a lot of good data for free. It has in-depth data on flights and doesn’t seem limited by date. I can’t speak on the validity of the data though.

flightradar24.com has lots of data, also historically, they might be willing to help you get it in a nice format.

 

2019 Crime statistics in the USA

Dataset with arrest in US by race and separate states. Download Excel here

Researchers from IBM, MIT and Harvard Announced The Release Of DARPA “Common Sense AI” Dataset Along With Two Machine Learning Models At ICML 2021

Building machines that can make decisions based on common sense is no easy feat. A machine must be able to do more than merely find patterns in data; it also needs a way of interpreting the intentions and beliefs behind people’s choices.

At the 2021 International Conference on Machine Learning (ICML), Researchers from IBM, MIT, and Harvard University have come together to release a DARPA “Common Sense AI” dataset for benchmarking AI intuition. They are also releasing two machine learning models that represent different approaches to the problem that relies on testing techniques psychologists use to study infants’ behavior to accelerate the development of AI exhibiting common sense.

Source – Summary – Paper – IBM Blog

100 million protein structures Dataset by DeepMind

DeepMind creates ‘transformative’ map of human proteins drawn by AI. By the end of the year, DeepMind hopes to release predictions for 100 million protein structures, a dataset that will be “transformative for our understanding of how life works,

Here’s a good article about this topic

Google Dataset Search

Google Dataset Search

Malware traffic dataset

Comprises 1914081 records created from all malware traffic analysis .net PCAP files, from 2013 to 2021. The logs are generated using Suricata and Zeek.

Originator: ali_alwashali

Percent of “foreign-born” population in each US and EU state or country.

For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state 🇺🇸🇪🇺

Author: Here

Percent of “foreign-born” population in each US and EU state or country. For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state.

Examples of “foreign-born” in this context:

  • Person born in Spain and living in France is NOT “foreign-born”

  • Person born in Turkey and living in France is “foreign-born”

  • Person born in Florida and living in Texas is NOT “foreign-born”

  • Person born in Mexico and living in Texas is “foreign-born”

  • Person born in Florida and living in France is “foreign-born”

  • Person born in France and living in Florida is “foreign-born”

🇺🇸🇪🇺🗺️

Note: Poland, Ireland, Germany, Greece, Cyprus, Malta, Portugal uses Eurostat 2010 Migration data and Croatia has no data at all

Link1

Link2

Link3

Tools: MS Office

Source: Here

35% of “entry-level” jobs on LinkedIn require 3+ years of experience

r/dataisbeautiful - [OC] 35% of "entry-level" jobs on LinkedIn require 3+ years of experience

Source: LinkedIn data  (see original post)

Tool: Photoshop from my colleague

Latest complete Netflix movie dataset

Created from 4 APIs. 11K+ rows and 30+ attributes of Netflix (Ratings, earnings, actors, language, availability, movie trailers, and many more)

Dataset on Kaggle.

Explore this dataset using FlixGem.com (this dataset is powering this webapp)

Dataset on Google Sheets.

Common Crawl

A corpus of web crawl data composed of over 50 billion web pages. The Common Crawl corpus contains petabytes of data collected since 2008. It contains raw web page data, extracted metadata and text extractions.

AWS CLI Access (No AWS account required)

aws s3 ls s3://commoncrawl/ --no-sign-request

s3://commoncrawl/crawl-data/CC-MAIN-2021-17 – April 2021

 Dataset on protein prices

Data on Primary Commodity Prices are updated monthly based on the IMF’s Primary Commodity Price System.

Excel Database

 CPOST dataset on suicide attacks over four decades

The University of Chicago Project on Security and Threats presents the updated and expanded Database on Suicide Attacks (DSAT), which now links to Uppsala Conflict Data Program data on armed conflicts and includes a new dataset measuring the alliance and rivalry relationships among militant groups with connections to suicide attack groups. Access it here.

Credit Card Dataset – Survey of Consumer Finances (SCF) Combined Extract Data 1989-2019

You can do a lot of aggregated analysis in a pretty straightforward way there.

Drone imagery with annotations for small object detection and tracking dataset

11 TB dataset of drone imagery with annotations for small object detection and tracking

Download and more information are available here

Dataset License: CDLA-Sharing-1.0

Helper scripts for accessing the dataset: DATASET.md

Dataset Exploration: Colab

NOAA High-Resolution Rapid Refresh (HRRR) Model

The HRRR is a NOAA real-time 3-km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model, initialized by 3km grids with 3km radar assimilation. Radar data is assimilated in the HRRR every 15 min over a 1-h period adding further detail to that provided by the hourly data assimilation from the 13km radar-enhanced Rapid Refresh.

Registry of Open Data on AWS

This registry exists to help people discover and share datasets that are available via AWS resources. Learn more about sharing data on AWS.

See all usage examples for datasets listed in this registry.

See datasets from Digital Earth AfricaFacebook Data for GoodNASA Space Act AgreementNIH STRIDESNOAA Big Data ProgramSpace Telescope Science Institute, and Amazon Sustainability Data Initiative.

Textbook Question Answering (TQA)

1,076 textbook lessons, 26,260 questions, 6229 images

Documentation: allenai.org/data/tqa

Download

Harmonized Cancer Datasets: Genomic Data Commons Data Portal

The GDC Data Portal is a robust data-driven platform that allows cancer
researchers and bioinformaticians to search and download cancer data for analysis.

Genomic Data Commons Data Portal
Genomic Data Commons Data Portal

The Cancer Genome Atlas

The Cancer Genome Atlas (TCGA), a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), aims to generate comprehensive, multi-dimensional maps of the key genomic changes in major types and subtypes of cancer.

AWS CLI Access (No AWS account required)

aws s3 ls s3://tcga-2-open/ --no-sign-request

Therapeutically Applicable Research to Generate Effective Treatments (TARGET)

The Therapeutically Applicable Research to Generate Effective Treatments (TARGET) program applies a comprehensive genomic approach to determine molecular changes that drive childhood cancers. The goal of the program is to use data to guide the development of effective, less toxic therapies. TARGET is organized into a collaborative network of disease-specific project teams.  TARGET projects provide comprehensive molecular characterization to determine the genetic changes that drive the initiation and progression of childhood cancers. The dataset contains open Clinical Supplement, Biospecimen Supplement, RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantification, miRNA-Seq miRNA Expression Quantification data from Genomic Data Commons (GDC), and open data from GDC Legacy Archive. Access it here.

Genome Aggregation Database (gnomAD)

The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators that aggregates and harmonizes both exome and genome data from a wide range of large-scale human sequencing projects. The summary data provided here are released for the benefit of the wider scientific community without restriction on use. Downloads

SQuAD (Stanford Question Answering Dataset)

Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. Access it here.

PubMed Diabetes Dataset

The Pubmed Diabetes dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words. The README file in the dataset provides more details.

Download Link

Drug-Target Interaction Dataset

This dataset contains interactions between drugs and targets collected from DrugBank, KEGG Drug, DCDB, and Matador. It was originally collected by Perlman et al. It contains 315 drugs, 250 targets, 1,306 drug-target interactions, 5 types of drug-drug similarities, and 3 types of target-target similarities. Drug-drug similarities include Chemical-based, Ligand-based, Expression-based, Side-effect-based, and Annotation-based similarities. Target-target similarities include Sequence-based, Protein-protein interaction network-based, and Gene Ontology-based similarities. The original task on the dataset is to predict new interactions between drugs and targets based on different types of similarities in the network. Download link

Pharmacogenomics Datasets

PharmGKB data and knowledge is available as downloads. It is often critical to check with their curators at feedback@pharmgkb.org before embarking on a large project using these data, to be sure that the files and data they make available are being interpreted correctly. PharmGKB generally does NOT need to be a co-author on such analyses; They just want to make sure that there is a correct understanding of our data before lots of resources are spent.

Pancreatic Cancer Organoid Profiling

The dataset contains open RNA-Seq Gene Expression Quantification data and controlled WGS/WXS/RNA-Seq Aligned Reads, WXS Annotated Somatic Mutation, WXS Raw Somatic Mutation, and RNA-Seq Splice Junction Quantification. Documentation

AWS CLI Access (No AWS account required)

aws s3 ls s3://gdc-organoid-pancreatic-phs001611-2-open/ --no-sign-request

Africa Soil Information Service (AfSIS) Soil Chemistry

This dataset contains soil infrared spectral data and paired soil property reference measurements for georeferenced soil samples that were collected through the Africa Soil Information Service (AfSIS) project, which lasted from 2009 through 2018. Documentation

AWS CLI Access (No AWS account required)

aws s3 ls s3://afsis/ --no-sign-request

Dataset for Affective States in E-Environments

DAiSEE is the first multi-label video classification dataset comprising of 9068 video snippets captured from 112 users for recognizing the user affective states of boredom, confusion, engagement, and frustration “in the wild”. The dataset has four levels of labels namely – very low, low, high, and very high for each of the affective states, which are crowd annotated and correlated with a gold standard annotation created using a team of expert psychologists. Download it here.

NatureServe Explorer Dataset

NatureServe Explorer provides conservation status, taxonomy, distribution, and life history information for more than 95,000 plants and animals in the United States and Canada, and more than 10,000 vegetation communities and ecological systems in the Western Hemisphere.

The data available through NatureServe Explorer represents data managed in the NatureServe Central Databases. These databases are dynamic, being continually enhanced and refined through the input of hundreds of natural heritage program scientists and other collaborators. NatureServe Explorer is updated from these central databases to reflect information from new field surveys, the latest taxonomic treatments and other scientific publications, and new conservation status assessments. Explore Data here

Flight Records in the US

Airline On-Time Performance and Causes of Flight Delays – On_Time Data.

This database contains scheduled and actual departure and arrival times, reason of delay. reported by certified U.S. air carriers that account for at least one percent of domestic scheduled passenger revenues. The data is collected by the Office of Airline Information, Bureau of Transportation Statistics (BTS).

FlightAware.com has data but you need to pay for a full dataset.

The anyflights package supplies a set of functions to generate air travel data (and data packages!) similar to nycflights13. With a user-defined year and airport, the anyflights function will grab data on:

  • flights: all flights that departed a given airport in a given year and month
  • weather: hourly meterological data for a given airport in a given year and month
  • airports: airport names, FAA codes, and locations
  • airlines: translation between two letter carrier (airline) codes and names
  • planes: construction information about each plane found in flights

Airline On-Time Statistics and Delay Causes

The U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT’s monthly Air Travel Consumer Report, published about 30 days after the month’s end, as well as in summary tables posted on this website. BTS began collecting details on the causes of flight delays in June 2003. Summary statistics and raw data are made available to the public at the time the Air Travel Consumer Report is released. Access it here

Worldwide flight data

Open flights: As of January 2017, the OpenFlights Airports Database contains over 10,000 airports, train stations and ferry terminals spanning the globe

Download: airports.dat (Airports only, high quality)

Download: airports-extended.dat (Airports, train stations and ferry terminals, including user contributions)

Bureau of Transportation:

Flightera.net seems to have a lot of good data for free. It has in-depth data on flights and doesn’t seem limited by date. I can’t speak on the validity of the data though.

flightradar24.com has lots of data, also historically, they might be willing to help you get it in a nice format.

 

2019 Crime statistics in the USA

Dataset with arrest in US by race and separate states. Download Excel here

Researchers from IBM, MIT and Harvard Announced The Release Of DARPA “Common Sense AI” Dataset Along With Two Machine Learning Models At ICML 2021

Building machines that can make decisions based on common sense is no easy feat. A machine must be able to do more than merely find patterns in data; it also needs a way of interpreting the intentions and beliefs behind people’s choices.

At the 2021 International Conference on Machine Learning (ICML), Researchers from IBM, MIT, and Harvard University have come together to release a DARPA “Common Sense AI” dataset for benchmarking AI intuition. They are also releasing two machine learning models that represent different approaches to the problem that relies on testing techniques psychologists use to study infants’ behavior to accelerate the development of AI exhibiting common sense.

Source – Summary – Paper – IBM Blog

100 million protein structures Dataset by DeepMind

DeepMind creates ‘transformative’ map of human proteins drawn by AI. By the end of the year, DeepMind hopes to release predictions for 100 million protein structures, a dataset that will be “transformative for our understanding of how life works,

Here’s a good article about this topic

Google Dataset Search

Google Dataset Search

Malware traffic dataset

Comprises 1914081 records created from all malware traffic analysis .net PCAP files, from 2013 to 2021. The logs are generated using Suricata and Zeek.

Originator: ali_alwashali

Percent of “foreign-born” population in each US and EU state or country.

For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state 🇺🇸🇪🇺

Author: Here

Percent of “foreign-born” population in each US and EU state or country. For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state.

Examples of “foreign-born” in this context:

  • Person born in Spain and living in France is NOT “foreign-born”

  • Person born in Turkey and living in France is “foreign-born”

  • Person born in Florida and living in Texas is NOT “foreign-born”

  • Person born in Mexico and living in Texas is “foreign-born”

  • Person born in Florida and living in France is “foreign-born”

  • Person born in France and living in Florida is “foreign-born”

🇺🇸🇪🇺🗺️

Note: Poland, Ireland, Germany, Greece, Cyprus, Malta, Portugal uses Eurostat 2010 Migration data and Croatia has no data at all

Link1

Link2

Link3

Tools: MS Office

Source: Here

35% of “entry-level” jobs on LinkedIn require 3+ years of experience

r/dataisbeautiful - [OC] 35% of "entry-level" jobs on LinkedIn require 3+ years of experience

Source: LinkedIn data  (see original post)

Tool: Photoshop from my colleague

Latest complete Netflix movie dataset

Created from 4 APIs. 11K+ rows and 30+ attributes of Netflix (Ratings, earnings, actors, language, availability, movie trailers, and many more)

Dataset on Kaggle.

Explore this dataset using FlixGem.com (this dataset is powering this webapp)

Dataset on Google Sheets.

Common Crawl

A corpus of web crawl data composed of over 50 billion web pages. The Common Crawl corpus contains petabytes of data collected since 2008. It contains raw web page data, extracted metadata and text extractions.

AWS CLI Access (No AWS account required)

aws s3 ls s3://commoncrawl/ --no-sign-request

s3://commoncrawl/crawl-data/CC-MAIN-2021-17 – April 2021

 Dataset on protein prices

Data on Primary Commodity Prices are updated monthly based on the IMF’s Primary Commodity Price System.

Excel Database

 CPOST dataset on suicide attacks over four decades

The University of Chicago Project on Security and Threats presents the updated and expanded Database on Suicide Attacks (DSAT), which now links to Uppsala Conflict Data Program data on armed conflicts and includes a new dataset measuring the alliance and rivalry relationships among militant groups with connections to suicide attack groups. Access it here.

Credit Card Dataset – Survey of Consumer Finances (SCF) Combined Extract Data 1989-2019

You can do a lot of aggregated analysis in a pretty straightforward way there.

Drone imagery with annotations for small object detection and tracking dataset

11 TB dataset of drone imagery with annotations for small object detection and tracking

Download and more information are available here

Dataset License: CDLA-Sharing-1.0

Helper scripts for accessing the dataset: DATASET.md

Dataset Exploration: Colab

NOAA High-Resolution Rapid Refresh (HRRR) Model

The HRRR is a NOAA real-time 3-km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model, initialized by 3km grids with 3km radar assimilation. Radar data is assimilated in the HRRR every 15 min over a 1-h period adding further detail to that provided by the hourly data assimilation from the 13km radar-enhanced Rapid Refresh.

Registry of Open Data on AWS

This registry exists to help people discover and share datasets that are available via AWS resources. Learn more about sharing data on AWS.

See all usage examples for datasets listed in this registry.

See datasets from Digital Earth AfricaFacebook Data for GoodNASA Space Act AgreementNIH STRIDESNOAA Big Data ProgramSpace Telescope Science Institute, and Amazon Sustainability Data Initiative.

Textbook Question Answering (TQA)

1,076 textbook lessons, 26,260 questions, 6229 images

Documentation: allenai.org/data/tqa

Download

Harmonized Cancer Datasets: Genomic Data Commons Data Portal

The GDC Data Portal is a robust data-driven platform that allows cancer
researchers and bioinformaticians to search and download cancer data for analysis.

Genomic Data Commons Data Portal
Genomic Data Commons Data Portal

The Cancer Genome Atlas

The Cancer Genome Atlas (TCGA), a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), aims to generate comprehensive, multi-dimensional maps of the key genomic changes in major types and subtypes of cancer.

AWS CLI Access (No AWS account required)

aws s3 ls s3://tcga-2-open/ --no-sign-request

Therapeutically Applicable Research to Generate Effective Treatments (TARGET)

The Therapeutically Applicable Research to Generate Effective Treatments (TARGET) program applies a comprehensive genomic approach to determine molecular changes that drive childhood cancers. The goal of the program is to use data to guide the development of effective, less toxic therapies. TARGET is organized into a collaborative network of disease-specific project teams.  TARGET projects provide comprehensive molecular characterization to determine the genetic changes that drive the initiation and progression of childhood cancers. The dataset contains open Clinical Supplement, Biospecimen Supplement, RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantification, miRNA-Seq miRNA Expression Quantification data from Genomic Data Commons (GDC), and open data from GDC Legacy Archive. Access it here.

Genome Aggregation Database (gnomAD)

The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators that aggregates and harmonizes both exome and genome data from a wide range of large-scale human sequencing projects. The summary data provided here are released for the benefit of the wider scientific community without restriction on use. Downloads

SQuAD (Stanford Question Answering Dataset)

Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. Access it here.

PubMed Diabetes Dataset

The Pubmed Diabetes dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words. The README file in the dataset provides more details.

Download Link

Drug-Target Interaction Dataset

This dataset contains interactions between drugs and targets collected from DrugBank, KEGG Drug, DCDB, and Matador. It was originally collected by Perlman et al. It contains 315 drugs, 250 targets, 1,306 drug-target interactions, 5 types of drug-drug similarities, and 3 types of target-target similarities. Drug-drug similarities include Chemical-based, Ligand-based, Expression-based, Side-effect-based, and Annotation-based similarities. Target-target similarities include Sequence-based, Protein-protein interaction network-based, and Gene Ontology-based similarities. The original task on the dataset is to predict new interactions between drugs and targets based on different types of similarities in the network. Download link

Pharmacogenomics Datasets

PharmGKB data and knowledge is available as downloads. It is often critical to check with their curators at feedback@pharmgkb.org before embarking on a large project using these data, to be sure that the files and data they make available are being interpreted correctly. PharmGKB generally does NOT need to be a co-author on such analyses; They just want to make sure that there is a correct understanding of our data before lots of resources are spent.

Pancreatic Cancer Organoid Profiling

The dataset contains open RNA-Seq Gene Expression Quantification data and controlled WGS/WXS/RNA-Seq Aligned Reads, WXS Annotated Somatic Mutation, WXS Raw Somatic Mutation, and RNA-Seq Splice Junction Quantification. Documentation

AWS CLI Access (No AWS account required)

aws s3 ls s3://gdc-organoid-pancreatic-phs001611-2-open/ --no-sign-request

Africa Soil Information Service (AfSIS) Soil Chemistry

This dataset contains soil infrared spectral data and paired soil property reference measurements for georeferenced soil samples that were collected through the Africa Soil Information Service (AfSIS) project, which lasted from 2009 through 2018. Documentation

AWS CLI Access (No AWS account required)

aws s3 ls s3://afsis/ --no-sign-request

Dataset for Affective States in E-Environments

DAiSEE is the first multi-label video classification dataset comprising of 9068 video snippets captured from 112 users for recognizing the user affective states of boredom, confusion, engagement, and frustration “in the wild”. The dataset has four levels of labels namely – very low, low, high, and very high for each of the affective states, which are crowd annotated and correlated with a gold standard annotation created using a team of expert psychologists. Download it here.

NatureServe Explorer Dataset

NatureServe Explorer provides conservation status, taxonomy, distribution, and life history information for more than 95,000 plants and animals in the United States and Canada, and more than 10,000 vegetation communities and ecological systems in the Western Hemisphere.

The data available through NatureServe Explorer represents data managed in the NatureServe Central Databases. These databases are dynamic, being continually enhanced and refined through the input of hundreds of natural heritage program scientists and other collaborators. NatureServe Explorer is updated from these central databases to reflect information from new field surveys, the latest taxonomic treatments and other scientific publications, and new conservation status assessments. Explore Data here

Flight Records in the US

Airline On-Time Performance and Causes of Flight Delays – On_Time Data.

This database contains scheduled and actual departure and arrival times, reason of delay. reported by certified U.S. air carriers that account for at least one percent of domestic scheduled passenger revenues. The data is collected by the Office of Airline Information, Bureau of Transportation Statistics (BTS).

FlightAware.com has data but you need to pay for a full dataset.

The anyflights package supplies a set of functions to generate air travel data (and data packages!) similar to nycflights13. With a user-defined year and airport, the anyflights function will grab data on:

  • flights: all flights that departed a given airport in a given year and month
  • weather: hourly meterological data for a given airport in a given year and month
  • airports: airport names, FAA codes, and locations
  • airlines: translation between two letter carrier (airline) codes and names
  • planes: construction information about each plane found in flights

Airline On-Time Statistics and Delay Causes

The U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT’s monthly Air Travel Consumer Report, published about 30 days after the month’s end, as well as in summary tables posted on this website. BTS began collecting details on the causes of flight delays in June 2003. Summary statistics and raw data are made available to the public at the time the Air Travel Consumer Report is released. Access it here

Worldwide flight data

Open flights: As of January 2017, the OpenFlights Airports Database contains over 10,000 airports, train stations and ferry terminals spanning the globe

Download: airports.dat (Airports only, high quality)

Download: airports-extended.dat (Airports, train stations and ferry terminals, including user contributions)

Bureau of Transportation:

Flightera.net seems to have a lot of good data for free. It has in-depth data on flights and doesn’t seem limited by date. I can’t speak on the validity of the data though.

flightradar24.com has lots of data, also historically, they might be willing to help you get it in a nice format.

 

2019 Crime statistics in the USA

Dataset with arrest in US by race and separate states. Download Excel here

Researchers from IBM, MIT and Harvard Announced The Release Of DARPA “Common Sense AI” Dataset Along With Two Machine Learning Models At ICML 2021

Building machines that can make decisions based on common sense is no easy feat. A machine must be able to do more than merely find patterns in data; it also needs a way of interpreting the intentions and beliefs behind people’s choices.

At the 2021 International Conference on Machine Learning (ICML), Researchers from IBM, MIT, and Harvard University have come together to release a DARPA “Common Sense AI” dataset for benchmarking AI intuition. They are also releasing two machine learning models that represent different approaches to the problem that relies on testing techniques psychologists use to study infants’ behavior to accelerate the development of AI exhibiting common sense.

Source – Summary – Paper – IBM Blog

100 million protein structures Dataset by DeepMind

DeepMind creates ‘transformative’ map of human proteins drawn by AI. By the end of the year, DeepMind hopes to release predictions for 100 million protein structures, a dataset that will be “transformative for our understanding of how life works,

Here’s a good article about this topic

Google Dataset Search

Google Dataset Search

Malware traffic dataset

Comprises 1914081 records created from all malware traffic analysis .net PCAP files, from 2013 to 2021. The logs are generated using Suricata and Zeek.

Originator: ali_alwashali

Percent of “foreign-born” population in each US and EU state or country.

For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state 🇺🇸🇪🇺

Author: Here

Percent of “foreign-born” population in each US and EU state or country. For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state.

Examples of “foreign-born” in this context:

  • Person born in Spain and living in France is NOT “foreign-born”

  • Person born in Turkey and living in France is “foreign-born”

  • Person born in Florida and living in Texas is NOT “foreign-born”

  • Person born in Mexico and living in Texas is “foreign-born”

  • Person born in Florida and living in France is “foreign-born”

  • Person born in France and living in Florida is “foreign-born”

🇺🇸🇪🇺🗺️

Note: Poland, Ireland, Germany, Greece, Cyprus, Malta, Portugal uses Eurostat 2010 Migration data and Croatia has no data at all

Link1

Link2

Link3

Tools: MS Office

Source: Here

35% of “entry-level” jobs on LinkedIn require 3+ years of experience

r/dataisbeautiful - [OC] 35% of "entry-level" jobs on LinkedIn require 3+ years of experience

Source: LinkedIn data  (see original post)

Tool: Photoshop from my colleague

Latest complete Netflix movie dataset

Created from 4 APIs. 11K+ rows and 30+ attributes of Netflix (Ratings, earnings, actors, language, availability, movie trailers, and many more)

Dataset on Kaggle.

Explore this dataset using FlixGem.com (this dataset is powering this webapp)

Dataset on Google Sheets.

Common Crawl

A corpus of web crawl data composed of over 50 billion web pages. The Common Crawl corpus contains petabytes of data collected since 2008. It contains raw web page data, extracted metadata and text extractions.

AWS CLI Access (No AWS account required)

aws s3 ls s3://commoncrawl/ --no-sign-request

s3://commoncrawl/crawl-data/CC-MAIN-2021-17 – April 2021

 Dataset on protein prices

Data on Primary Commodity Prices are updated monthly based on the IMF’s Primary Commodity Price System.

Excel Database

 CPOST dataset on suicide attacks over four decades

The University of Chicago Project on Security and Threats presents the updated and expanded Database on Suicide Attacks (DSAT), which now links to Uppsala Conflict Data Program data on armed conflicts and includes a new dataset measuring the alliance and rivalry relationships among militant groups with connections to suicide attack groups. Access it here.

Credit Card Dataset – Survey of Consumer Finances (SCF) Combined Extract Data 1989-2019

You can do a lot of aggregated analysis in a pretty straightforward way there.

Drone imagery with annotations for small object detection and tracking dataset

11 TB dataset of drone imagery with annotations for small object detection and tracking

Download and more information are available here

Dataset License: CDLA-Sharing-1.0

Helper scripts for accessing the dataset: DATASET.md

Dataset Exploration: Colab

NOAA High-Resolution Rapid Refresh (HRRR) Model

The HRRR is a NOAA real-time 3-km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model, initialized by 3km grids with 3km radar assimilation. Radar data is assimilated in the HRRR every 15 min over a 1-h period adding further detail to that provided by the hourly data assimilation from the 13km radar-enhanced Rapid Refresh.

Registry of Open Data on AWS

This registry exists to help people discover and share datasets that are available via AWS resources. Learn more about sharing data on AWS.

See all usage examples for datasets listed in this registry.

See datasets from Digital Earth AfricaFacebook Data for GoodNASA Space Act AgreementNIH STRIDESNOAA Big Data ProgramSpace Telescope Science Institute, and Amazon Sustainability Data Initiative.

Textbook Question Answering (TQA)

1,076 textbook lessons, 26,260 questions, 6229 images

Documentation: allenai.org/data/tqa

Download

Harmonized Cancer Datasets: Genomic Data Commons Data Portal

The GDC Data Portal is a robust data-driven platform that allows cancer
researchers and bioinformaticians to search and download cancer data for analysis.

Genomic Data Commons Data Portal
Genomic Data Commons Data Portal

The Cancer Genome Atlas

The Cancer Genome Atlas (TCGA), a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), aims to generate comprehensive, multi-dimensional maps of the key genomic changes in major types and subtypes of cancer.

AWS CLI Access (No AWS account required)

aws s3 ls s3://tcga-2-open/ --no-sign-request

Therapeutically Applicable Research to Generate Effective Treatments (TARGET)

The Therapeutically Applicable Research to Generate Effective Treatments (TARGET) program applies a comprehensive genomic approach to determine molecular changes that drive childhood cancers. The goal of the program is to use data to guide the development of effective, less toxic therapies. TARGET is organized into a collaborative network of disease-specific project teams.  TARGET projects provide comprehensive molecular characterization to determine the genetic changes that drive the initiation and progression of childhood cancers. The dataset contains open Clinical Supplement, Biospecimen Supplement, RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantification, miRNA-Seq miRNA Expression Quantification data from Genomic Data Commons (GDC), and open data from GDC Legacy Archive. Access it here.

Genome Aggregation Database (gnomAD)

The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators that aggregates and harmonizes both exome and genome data from a wide range of large-scale human sequencing projects. The summary data provided here are released for the benefit of the wider scientific community without restriction on use. Downloads

SQuAD (Stanford Question Answering Dataset)

Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. Access it here.

PubMed Diabetes Dataset

The Pubmed Diabetes dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words. The README file in the dataset provides more details.

Download Link

Drug-Target Interaction Dataset

This dataset contains interactions between drugs and targets collected from DrugBank, KEGG Drug, DCDB, and Matador. It was originally collected by Perlman et al. It contains 315 drugs, 250 targets, 1,306 drug-target interactions, 5 types of drug-drug similarities, and 3 types of target-target similarities. Drug-drug similarities include Chemical-based, Ligand-based, Expression-based, Side-effect-based, and Annotation-based similarities. Target-target similarities include Sequence-based, Protein-protein interaction network-based, and Gene Ontology-based similarities. The original task on the dataset is to predict new interactions between drugs and targets based on different types of similarities in the network. Download link

Pharmacogenomics Datasets

PharmGKB data and knowledge is available as downloads. It is often critical to check with their curators at feedback@pharmgkb.org before embarking on a large project using these data, to be sure that the files and data they make available are being interpreted correctly. PharmGKB generally does NOT need to be a co-author on such analyses; They just want to make sure that there is a correct understanding of our data before lots of resources are spent.

Pancreatic Cancer Organoid Profiling

The dataset contains open RNA-Seq Gene Expression Quantification data and controlled WGS/WXS/RNA-Seq Aligned Reads, WXS Annotated Somatic Mutation, WXS Raw Somatic Mutation, and RNA-Seq Splice Junction Quantification. Documentation

AWS CLI Access (No AWS account required)

aws s3 ls s3://gdc-organoid-pancreatic-phs001611-2-open/ --no-sign-request

Africa Soil Information Service (AfSIS) Soil Chemistry

This dataset contains soil infrared spectral data and paired soil property reference measurements for georeferenced soil samples that were collected through the Africa Soil Information Service (AfSIS) project, which lasted from 2009 through 2018. Documentation

AWS CLI Access (No AWS account required)

aws s3 ls s3://afsis/ --no-sign-request

Dataset for Affective States in E-Environments

DAiSEE is the first multi-label video classification dataset comprising of 9068 video snippets captured from 112 users for recognizing the user affective states of boredom, confusion, engagement, and frustration “in the wild”. The dataset has four levels of labels namely – very low, low, high, and very high for each of the affective states, which are crowd annotated and correlated with a gold standard annotation created using a team of expert psychologists. Download it here.

NatureServe Explorer Dataset

NatureServe Explorer provides conservation status, taxonomy, distribution, and life history information for more than 95,000 plants and animals in the United States and Canada, and more than 10,000 vegetation communities and ecological systems in the Western Hemisphere.

The data available through NatureServe Explorer represents data managed in the NatureServe Central Databases. These databases are dynamic, being continually enhanced and refined through the input of hundreds of natural heritage program scientists and other collaborators. NatureServe Explorer is updated from these central databases to reflect information from new field surveys, the latest taxonomic treatments and other scientific publications, and new conservation status assessments. Explore Data here

Flight Records in the US

Airline On-Time Performance and Causes of Flight Delays – On_Time Data.

This database contains scheduled and actual departure and arrival times, reason of delay. reported by certified U.S. air carriers that account for at least one percent of domestic scheduled passenger revenues. The data is collected by the Office of Airline Information, Bureau of Transportation Statistics (BTS).

FlightAware.com has data but you need to pay for a full dataset.

The anyflights package supplies a set of functions to generate air travel data (and data packages!) similar to nycflights13. With a user-defined year and airport, the anyflights function will grab data on:

  • flights: all flights that departed a given airport in a given year and month
  • weather: hourly meterological data for a given airport in a given year and month
  • airports: airport names, FAA codes, and locations
  • airlines: translation between two letter carrier (airline) codes and names
  • planes: construction information about each plane found in flights

Airline On-Time Statistics and Delay Causes

The U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT’s monthly Air Travel Consumer Report, published about 30 days after the month’s end, as well as in summary tables posted on this website. BTS began collecting details on the causes of flight delays in June 2003. Summary statistics and raw data are made available to the public at the time the Air Travel Consumer Report is released. Access it here

Worldwide flight data

Open flights: As of January 2017, the OpenFlights Airports Database contains over 10,000 airports, train stations and ferry terminals spanning the globe

Download: airports.dat (Airports only, high quality)

Download: airports-extended.dat (Airports, train stations and ferry terminals, including user contributions)

Bureau of Transportation:

Flightera.net seems to have a lot of good data for free. It has in-depth data on flights and doesn’t seem limited by date. I can’t speak on the validity of the data though.

flightradar24.com has lots of data, also historically, they might be willing to help you get it in a nice format.

 

2019 Crime statistics in the USA

Dataset with arrest in US by race and separate states. Download Excel here

Researchers from IBM, MIT and Harvard Announced The Release Of DARPA “Common Sense AI” Dataset Along With Two Machine Learning Models At ICML 2021

Building machines that can make decisions based on common sense is no easy feat. A machine must be able to do more than merely find patterns in data; it also needs a way of interpreting the intentions and beliefs behind people’s choices.

At the 2021 International Conference on Machine Learning (ICML), Researchers from IBM, MIT, and Harvard University have come together to release a DARPA “Common Sense AI” dataset for benchmarking AI intuition. They are also releasing two machine learning models that represent different approaches to the problem that relies on testing techniques psychologists use to study infants’ behavior to accelerate the development of AI exhibiting common sense.

Source – Summary – Paper – IBM Blog

100 million protein structures Dataset by DeepMind

DeepMind creates ‘transformative’ map of human proteins drawn by AI. By the end of the year, DeepMind hopes to release predictions for 100 million protein structures, a dataset that will be “transformative for our understanding of how life works,

Here’s a good article about this topic

Google Dataset Search

Google Dataset Search

Malware traffic dataset

Comprises 1914081 records created from all malware traffic analysis .net PCAP files, from 2013 to 2021. The logs are generated using Suricata and Zeek.

Originator: ali_alwashali

Percent of “foreign-born” population in each US and EU state or country.

For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state 🇺🇸🇪🇺

Author: Here

Percent of “foreign-born” population in each US and EU state or country. For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state.

Examples of “foreign-born” in this context:

  • Person born in Spain and living in France is NOT “foreign-born”

  • Person born in Turkey and living in France is “foreign-born”

  • Person born in Florida and living in Texas is NOT “foreign-born”

  • Person born in Mexico and living in Texas is “foreign-born”

  • Person born in Florida and living in France is “foreign-born”

  • Person born in France and living in Florida is “foreign-born”

🇺🇸🇪🇺🗺️

Note: Poland, Ireland, Germany, Greece, Cyprus, Malta, Portugal uses Eurostat 2010 Migration data and Croatia has no data at all

Link1

Link2

Link3

Tools: MS Office

Source: Here

35% of “entry-level” jobs on LinkedIn require 3+ years of experience

r/dataisbeautiful - [OC] 35% of "entry-level" jobs on LinkedIn require 3+ years of experience

Source: LinkedIn data  (see original post)

Tool: Photoshop from my colleague

Latest complete Netflix movie dataset

Created from 4 APIs. 11K+ rows and 30+ attributes of Netflix (Ratings, earnings, actors, language, availability, movie trailers, and many more)

Dataset on Kaggle.

Explore this dataset using FlixGem.com (this dataset is powering this webapp)

Dataset on Google Sheets.

Common Crawl

A corpus of web crawl data composed of over 50 billion web pages. The Common Crawl corpus contains petabytes of data collected since 2008. It contains raw web page data, extracted metadata and text extractions.

AWS CLI Access (No AWS account required)

aws s3 ls s3://commoncrawl/ --no-sign-request

s3://commoncrawl/crawl-data/CC-MAIN-2021-17 – April 2021

 Dataset on protein prices

Data on Primary Commodity Prices are updated monthly based on the IMF’s Primary Commodity Price System.

Excel Database

 CPOST dataset on suicide attacks over four decades

The University of Chicago Project on Security and Threats presents the updated and expanded Database on Suicide Attacks (DSAT), which now links to Uppsala Conflict Data Program data on armed conflicts and includes a new dataset measuring the alliance and rivalry relationships among militant groups with connections to suicide attack groups. Access it here.

Credit Card Dataset – Survey of Consumer Finances (SCF) Combined Extract Data 1989-2019

You can do a lot of aggregated analysis in a pretty straightforward way there.

Drone imagery with annotations for small object detection and tracking dataset

11 TB dataset of drone imagery with annotations for small object detection and tracking

Download and more information are available here

Dataset License: CDLA-Sharing-1.0

Helper scripts for accessing the dataset: DATASET.md

Dataset Exploration: Colab

NOAA High-Resolution Rapid Refresh (HRRR) Model

The HRRR is a NOAA real-time 3-km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model, initialized by 3km grids with 3km radar assimilation. Radar data is assimilated in the HRRR every 15 min over a 1-h period adding further detail to that provided by the hourly data assimilation from the 13km radar-enhanced Rapid Refresh.

Registry of Open Data on AWS

This registry exists to help people discover and share datasets that are available via AWS resources. Learn more about sharing data on AWS.

See all usage examples for datasets listed in this registry.

See datasets from Digital Earth AfricaFacebook Data for GoodNASA Space Act AgreementNIH STRIDESNOAA Big Data ProgramSpace Telescope Science Institute, and Amazon Sustainability Data Initiative.

Textbook Question Answering (TQA)

1,076 textbook lessons, 26,260 questions, 6229 images

Documentation: allenai.org/data/tqa

Download

Harmonized Cancer Datasets: Genomic Data Commons Data Portal

The GDC Data Portal is a robust data-driven platform that allows cancer
researchers and bioinformaticians to search and download cancer data for analysis.

Genomic Data Commons Data Portal
Genomic Data Commons Data Portal

The Cancer Genome Atlas

The Cancer Genome Atlas (TCGA), a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), aims to generate comprehensive, multi-dimensional maps of the key genomic changes in major types and subtypes of cancer.

AWS CLI Access (No AWS account required)

aws s3 ls s3://tcga-2-open/ --no-sign-request

Therapeutically Applicable Research to Generate Effective Treatments (TARGET)

The Therapeutically Applicable Research to Generate Effective Treatments (TARGET) program applies a comprehensive genomic approach to determine molecular changes that drive childhood cancers. The goal of the program is to use data to guide the development of effective, less toxic therapies. TARGET is organized into a collaborative network of disease-specific project teams.  TARGET projects provide comprehensive molecular characterization to determine the genetic changes that drive the initiation and progression of childhood cancers. The dataset contains open Clinical Supplement, Biospecimen Supplement, RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantification, miRNA-Seq miRNA Expression Quantification data from Genomic Data Commons (GDC), and open data from GDC Legacy Archive. Access it here.

Genome Aggregation Database (gnomAD)

The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators that aggregates and harmonizes both exome and genome data from a wide range of large-scale human sequencing projects. The summary data provided here are released for the benefit of the wider scientific community without restriction on use. Downloads

SQuAD (Stanford Question Answering Dataset)

Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. Access it here.

PubMed Diabetes Dataset

The Pubmed Diabetes dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words. The README file in the dataset provides more details.

Download Link

Drug-Target Interaction Dataset

This dataset contains interactions between drugs and targets collected from DrugBank, KEGG Drug, DCDB, and Matador. It was originally collected by Perlman et al. It contains 315 drugs, 250 targets, 1,306 drug-target interactions, 5 types of drug-drug similarities, and 3 types of target-target similarities. Drug-drug similarities include Chemical-based, Ligand-based, Expression-based, Side-effect-based, and Annotation-based similarities. Target-target similarities include Sequence-based, Protein-protein interaction network-based, and Gene Ontology-based similarities. The original task on the dataset is to predict new interactions between drugs and targets based on different types of similarities in the network. Download link

Pharmacogenomics Datasets

PharmGKB data and knowledge is available as downloads. It is often critical to check with their curators at feedback@pharmgkb.org before embarking on a large project using these data, to be sure that the files and data they make available are being interpreted correctly. PharmGKB generally does NOT need to be a co-author on such analyses; They just want to make sure that there is a correct understanding of our data before lots of resources are spent.

Pancreatic Cancer Organoid Profiling

The dataset contains open RNA-Seq Gene Expression Quantification data and controlled WGS/WXS/RNA-Seq Aligned Reads, WXS Annotated Somatic Mutation, WXS Raw Somatic Mutation, and RNA-Seq Splice Junction Quantification. Documentation

AWS CLI Access (No AWS account required)

aws s3 ls s3://gdc-organoid-pancreatic-phs001611-2-open/ --no-sign-request

Africa Soil Information Service (AfSIS) Soil Chemistry

This dataset contains soil infrared spectral data and paired soil property reference measurements for georeferenced soil samples that were collected through the Africa Soil Information Service (AfSIS) project, which lasted from 2009 through 2018. Documentation

AWS CLI Access (No AWS account required)

aws s3 ls s3://afsis/ --no-sign-request

Dataset for Affective States in E-Environments

DAiSEE is the first multi-label video classification dataset comprising of 9068 video snippets captured from 112 users for recognizing the user affective states of boredom, confusion, engagement, and frustration “in the wild”. The dataset has four levels of labels namely – very low, low, high, and very high for each of the affective states, which are crowd annotated and correlated with a gold standard annotation created using a team of expert psychologists. Download it here.

NatureServe Explorer Dataset

NatureServe Explorer provides conservation status, taxonomy, distribution, and life history information for more than 95,000 plants and animals in the United States and Canada, and more than 10,000 vegetation communities and ecological systems in the Western Hemisphere.

The data available through NatureServe Explorer represents data managed in the NatureServe Central Databases. These databases are dynamic, being continually enhanced and refined through the input of hundreds of natural heritage program scientists and other collaborators. NatureServe Explorer is updated from these central databases to reflect information from new field surveys, the latest taxonomic treatments and other scientific publications, and new conservation status assessments. Explore Data here

Flight Records in the US

Airline On-Time Performance and Causes of Flight Delays – On_Time Data.

This database contains scheduled and actual departure and arrival times, reason of delay. reported by certified U.S. air carriers that account for at least one percent of domestic scheduled passenger revenues. The data is collected by the Office of Airline Information, Bureau of Transportation Statistics (BTS).

FlightAware.com has data but you need to pay for a full dataset.

The anyflights package supplies a set of functions to generate air travel data (and data packages!) similar to nycflights13. With a user-defined year and airport, the anyflights function will grab data on:

  • flights: all flights that departed a given airport in a given year and month
  • weather: hourly meterological data for a given airport in a given year and month
  • airports: airport names, FAA codes, and locations
  • airlines: translation between two letter carrier (airline) codes and names
  • planes: construction information about each plane found in flights

Airline On-Time Statistics and Delay Causes

The U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT’s monthly Air Travel Consumer Report, published about 30 days after the month’s end, as well as in summary tables posted on this website. BTS began collecting details on the causes of flight delays in June 2003. Summary statistics and raw data are made available to the public at the time the Air Travel Consumer Report is released. Access it here

Worldwide flight data

Open flights: As of January 2017, the OpenFlights Airports Database contains over 10,000 airports, train stations and ferry terminals spanning the globe

Download: airports.dat (Airports only, high quality)

Download: airports-extended.dat (Airports, train stations and ferry terminals, including user contributions)

Bureau of Transportation:

Flightera.net seems to have a lot of good data for free. It has in-depth data on flights and doesn’t seem limited by date. I can’t speak on the validity of the data though.

flightradar24.com has lots of data, also historically, they might be willing to help you get it in a nice format.

 

2019 Crime statistics in the USA

Dataset with arrest in US by race and separate states. Download Excel here

Researchers from IBM, MIT and Harvard Announced The Release Of DARPA “Common Sense AI” Dataset Along With Two Machine Learning Models At ICML 2021

Building machines that can make decisions based on common sense is no easy feat. A machine must be able to do more than merely find patterns in data; it also needs a way of interpreting the intentions and beliefs behind people’s choices.

At the 2021 International Conference on Machine Learning (ICML), Researchers from IBM, MIT, and Harvard University have come together to release a DARPA “Common Sense AI” dataset for benchmarking AI intuition. They are also releasing two machine learning models that represent different approaches to the problem that relies on testing techniques psychologists use to study infants’ behavior to accelerate the development of AI exhibiting common sense.

Source – Summary – Paper – IBM Blog

100 million protein structures Dataset by DeepMind

DeepMind creates ‘transformative’ map of human proteins drawn by AI. By the end of the year, DeepMind hopes to release predictions for 100 million protein structures, a dataset that will be “transformative for our understanding of how life works,

Here’s a good article about this topic

Google Dataset Search

Google Dataset Search

Malware traffic dataset

Comprises 1914081 records created from all malware traffic analysis .net PCAP files, from 2013 to 2021. The logs are generated using Suricata and Zeek.

Originator: ali_alwashali

Percent of “foreign-born” population in each US and EU state or country.

For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state 🇺🇸🇪🇺

Author: Here

Percent of “foreign-born” population in each US and EU state or country. For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state.

Examples of “foreign-born” in this context:

  • Person born in Spain and living in France is NOT “foreign-born”

  • Person born in Turkey and living in France is “foreign-born”

  • Person born in Florida and living in Texas is NOT “foreign-born”

  • Person born in Mexico and living in Texas is “foreign-born”

  • Person born in Florida and living in France is “foreign-born”

  • Person born in France and living in Florida is “foreign-born”

🇺🇸🇪🇺🗺️

Note: Poland, Ireland, Germany, Greece, Cyprus, Malta, Portugal uses Eurostat 2010 Migration data and Croatia has no data at all

Link1

Link2

Link3

Tools: MS Office

Source: Here

35% of “entry-level” jobs on LinkedIn require 3+ years of experience

r/dataisbeautiful - [OC] 35% of "entry-level" jobs on LinkedIn require 3+ years of experience

Source: LinkedIn data  (see original post)

Tool: Photoshop from my colleague

Latest complete Netflix movie dataset

Created from 4 APIs. 11K+ rows and 30+ attributes of Netflix (Ratings, earnings, actors, language, availability, movie trailers, and many more)

Dataset on Kaggle.

Explore this dataset using FlixGem.com (this dataset is powering this webapp)

Dataset on Google Sheets.

Common Crawl

A corpus of web crawl data composed of over 50 billion web pages. The Common Crawl corpus contains petabytes of data collected since 2008. It contains raw web page data, extracted metadata and text extractions.

AWS CLI Access (No AWS account required)

aws s3 ls s3://commoncrawl/ --no-sign-request

s3://commoncrawl/crawl-data/CC-MAIN-2021-17 – April 2021

 Dataset on protein prices

Data on Primary Commodity Prices are updated monthly based on the IMF’s Primary Commodity Price System.

Excel Database

 CPOST dataset on suicide attacks over four decades

The University of Chicago Project on Security and Threats presents the updated and expanded Database on Suicide Attacks (DSAT), which now links to Uppsala Conflict Data Program data on armed conflicts and includes a new dataset measuring the alliance and rivalry relationships among militant groups with connections to suicide attack groups. Access it here.

Credit Card Dataset – Survey of Consumer Finances (SCF) Combined Extract Data 1989-2019

You can do a lot of aggregated analysis in a pretty straightforward way there.

Drone imagery with annotations for small object detection and tracking dataset

11 TB dataset of drone imagery with annotations for small object detection and tracking

Download and more information are available here

Dataset License: CDLA-Sharing-1.0

Helper scripts for accessing the dataset: DATASET.md

Dataset Exploration: Colab

NOAA High-Resolution Rapid Refresh (HRRR) Model

The HRRR is a NOAA real-time 3-km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model, initialized by 3km grids with 3km radar assimilation. Radar data is assimilated in the HRRR every 15 min over a 1-h period adding further detail to that provided by the hourly data assimilation from the 13km radar-enhanced Rapid Refresh.

Registry of Open Data on AWS

This registry exists to help people discover and share datasets that are available via AWS resources. Learn more about sharing data on AWS.

See all usage examples for datasets listed in this registry.

See datasets from Digital Earth AfricaFacebook Data for GoodNASA Space Act AgreementNIH STRIDESNOAA Big Data ProgramSpace Telescope Science Institute, and Amazon Sustainability Data Initiative.

Textbook Question Answering (TQA)

1,076 textbook lessons, 26,260 questions, 6229 images

Documentation: allenai.org/data/tqa

Download

Harmonized Cancer Datasets: Genomic Data Commons Data Portal

The GDC Data Portal is a robust data-driven platform that allows cancer
researchers and bioinformaticians to search and download cancer data for analysis.

Genomic Data Commons Data Portal
Genomic Data Commons Data Portal

The Cancer Genome Atlas

The Cancer Genome Atlas (TCGA), a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), aims to generate comprehensive, multi-dimensional maps of the key genomic changes in major types and subtypes of cancer.

AWS CLI Access (No AWS account required)

aws s3 ls s3://tcga-2-open/ --no-sign-request

Therapeutically Applicable Research to Generate Effective Treatments (TARGET)

The Therapeutically Applicable Research to Generate Effective Treatments (TARGET) program applies a comprehensive genomic approach to determine molecular changes that drive childhood cancers. The goal of the program is to use data to guide the development of effective, less toxic therapies. TARGET is organized into a collaborative network of disease-specific project teams.  TARGET projects provide comprehensive molecular characterization to determine the genetic changes that drive the initiation and progression of childhood cancers. The dataset contains open Clinical Supplement, Biospecimen Supplement, RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantification, miRNA-Seq miRNA Expression Quantification data from Genomic Data Commons (GDC), and open data from GDC Legacy Archive. Access it here.

Genome Aggregation Database (gnomAD)

The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators that aggregates and harmonizes both exome and genome data from a wide range of large-scale human sequencing projects. The summary data provided here are released for the benefit of the wider scientific community without restriction on use. Downloads

SQuAD (Stanford Question Answering Dataset)

Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. Access it here.

PubMed Diabetes Dataset

The Pubmed Diabetes dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words. The README file in the dataset provides more details.

Download Link

Drug-Target Interaction Dataset

This dataset contains interactions between drugs and targets collected from DrugBank, KEGG Drug, DCDB, and Matador. It was originally collected by Perlman et al. It contains 315 drugs, 250 targets, 1,306 drug-target interactions, 5 types of drug-drug similarities, and 3 types of target-target similarities. Drug-drug similarities include Chemical-based, Ligand-based, Expression-based, Side-effect-based, and Annotation-based similarities. Target-target similarities include Sequence-based, Protein-protein interaction network-based, and Gene Ontology-based similarities. The original task on the dataset is to predict new interactions between drugs and targets based on different types of similarities in the network. Download link

Pharmacogenomics Datasets

PharmGKB data and knowledge is available as downloads. It is often critical to check with their curators at feedback@pharmgkb.org before embarking on a large project using these data, to be sure that the files and data they make available are being interpreted correctly. PharmGKB generally does NOT need to be a co-author on such analyses; They just want to make sure that there is a correct understanding of our data before lots of resources are spent.

Pancreatic Cancer Organoid Profiling

The dataset contains open RNA-Seq Gene Expression Quantification data and controlled WGS/WXS/RNA-Seq Aligned Reads, WXS Annotated Somatic Mutation, WXS Raw Somatic Mutation, and RNA-Seq Splice Junction Quantification. Documentation

AWS CLI Access (No AWS account required)

aws s3 ls s3://gdc-organoid-pancreatic-phs001611-2-open/ --no-sign-request

Africa Soil Information Service (AfSIS) Soil Chemistry

This dataset contains soil infrared spectral data and paired soil property reference measurements for georeferenced soil samples that were collected through the Africa Soil Information Service (AfSIS) project, which lasted from 2009 through 2018. Documentation

AWS CLI Access (No AWS account required)

aws s3 ls s3://afsis/ --no-sign-request

Dataset for Affective States in E-Environments

DAiSEE is the first multi-label video classification dataset comprising of 9068 video snippets captured from 112 users for recognizing the user affective states of boredom, confusion, engagement, and frustration “in the wild”. The dataset has four levels of labels namely – very low, low, high, and very high for each of the affective states, which are crowd annotated and correlated with a gold standard annotation created using a team of expert psychologists. Download it here.

NatureServe Explorer Dataset

NatureServe Explorer provides conservation status, taxonomy, distribution, and life history information for more than 95,000 plants and animals in the United States and Canada, and more than 10,000 vegetation communities and ecological systems in the Western Hemisphere.

The data available through NatureServe Explorer represents data managed in the NatureServe Central Databases. These databases are dynamic, being continually enhanced and refined through the input of hundreds of natural heritage program scientists and other collaborators. NatureServe Explorer is updated from these central databases to reflect information from new field surveys, the latest taxonomic treatments and other scientific publications, and new conservation status assessments. Explore Data here

Flight Records in the US

Airline On-Time Performance and Causes of Flight Delays – On_Time Data.

This database contains scheduled and actual departure and arrival times, reason of delay. reported by certified U.S. air carriers that account for at least one percent of domestic scheduled passenger revenues. The data is collected by the Office of Airline Information, Bureau of Transportation Statistics (BTS).

FlightAware.com has data but you need to pay for a full dataset.

The anyflights package supplies a set of functions to generate air travel data (and data packages!) similar to nycflights13. With a user-defined year and airport, the anyflights function will grab data on:

  • flights: all flights that departed a given airport in a given year and month
  • weather: hourly meterological data for a given airport in a given year and month
  • airports: airport names, FAA codes, and locations
  • airlines: translation between two letter carrier (airline) codes and names
  • planes: construction information about each plane found in flights

Airline On-Time Statistics and Delay Causes

The U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT’s monthly Air Travel Consumer Report, published about 30 days after the month’s end, as well as in summary tables posted on this website. BTS began collecting details on the causes of flight delays in June 2003. Summary statistics and raw data are made available to the public at the time the Air Travel Consumer Report is released. Access it here

Worldwide flight data

Open flights: As of January 2017, the OpenFlights Airports Database contains over 10,000 airports, train stations and ferry terminals spanning the globe

Download: airports.dat (Airports only, high quality)

Download: airports-extended.dat (Airports, train stations and ferry terminals, including user contributions)

Bureau of Transportation:

Flightera.net seems to have a lot of good data for free. It has in-depth data on flights and doesn’t seem limited by date. I can’t speak on the validity of the data though.

flightradar24.com has lots of data, also historically, they might be willing to help you get it in a nice format.

 

2019 Crime statistics in the USA

Dataset with arrest in US by race and separate states. Download Excel here

Researchers from IBM, MIT and Harvard Announced The Release Of DARPA “Common Sense AI” Dataset Along With Two Machine Learning Models At ICML 2021

Building machines that can make decisions based on common sense is no easy feat. A machine must be able to do more than merely find patterns in data; it also needs a way of interpreting the intentions and beliefs behind people’s choices.

At the 2021 International Conference on Machine Learning (ICML), Researchers from IBM, MIT, and Harvard University have come together to release a DARPA “Common Sense AI” dataset for benchmarking AI intuition. They are also releasing two machine learning models that represent different approaches to the problem that relies on testing techniques psychologists use to study infants’ behavior to accelerate the development of AI exhibiting common sense.

Source – Summary – Paper – IBM Blog

100 million protein structures Dataset by DeepMind

DeepMind creates ‘transformative’ map of human proteins drawn by AI. By the end of the year, DeepMind hopes to release predictions for 100 million protein structures, a dataset that will be “transformative for our understanding of how life works,

Here’s a good article about this topic

Google Dataset Search

Google Dataset Search

Malware traffic dataset

Comprises 1914081 records created from all malware traffic analysis .net PCAP files, from 2013 to 2021. The logs are generated using Suricata and Zeek.

Originator: ali_alwashali

Percent of “foreign-born” population in each US and EU state or country.

For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state 🇺🇸🇪🇺

Author: Here

Percent of “foreign-born” population in each US and EU state or country. For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state.

Examples of “foreign-born” in this context:

  • Person born in Spain and living in France is NOT “foreign-born”

  • Person born in Turkey and living in France is “foreign-born”

  • Person born in Florida and living in Texas is NOT “foreign-born”

  • Person born in Mexico and living in Texas is “foreign-born”

  • Person born in Florida and living in France is “foreign-born”

  • Person born in France and living in Florida is “foreign-born”

🇺🇸🇪🇺🗺️

Note: Poland, Ireland, Germany, Greece, Cyprus, Malta, Portugal uses Eurostat 2010 Migration data and Croatia has no data at all

Link1

Link2

Link3

Tools: MS Office

Source: Here

35% of “entry-level” jobs on LinkedIn require 3+ years of experience

r/dataisbeautiful - [OC] 35% of "entry-level" jobs on LinkedIn require 3+ years of experience

Source: LinkedIn data  (see original post)

Tool: Photoshop from my colleague

Latest complete Netflix movie dataset

Created from 4 APIs. 11K+ rows and 30+ attributes of Netflix (Ratings, earnings, actors, language, availability, movie trailers, and many more)

Dataset on Kaggle.

Explore this dataset using FlixGem.com (this dataset is powering this webapp)

Dataset on Google Sheets.

Common Crawl

A corpus of web crawl data composed of over 50 billion web pages. The Common Crawl corpus contains petabytes of data collected since 2008. It contains raw web page data, extracted metadata and text extractions.

AWS CLI Access (No AWS account required)

aws s3 ls s3://commoncrawl/ --no-sign-request

s3://commoncrawl/crawl-data/CC-MAIN-2021-17 – April 2021

 Dataset on protein prices

Data on Primary Commodity Prices are updated monthly based on the IMF’s Primary Commodity Price System.

Excel Database

 CPOST dataset on suicide attacks over four decades

The University of Chicago Project on Security and Threats presents the updated and expanded Database on Suicide Attacks (DSAT), which now links to Uppsala Conflict Data Program data on armed conflicts and includes a new dataset measuring the alliance and rivalry relationships among militant groups with connections to suicide attack groups. Access it here.

Credit Card Dataset – Survey of Consumer Finances (SCF) Combined Extract Data 1989-2019

You can do a lot of aggregated analysis in a pretty straightforward way there.

Drone imagery with annotations for small object detection and tracking dataset

11 TB dataset of drone imagery with annotations for small object detection and tracking

Download and more information are available here

Dataset License: CDLA-Sharing-1.0

Helper scripts for accessing the dataset: DATASET.md

Dataset Exploration: Colab

NOAA High-Resolution Rapid Refresh (HRRR) Model

The HRRR is a NOAA real-time 3-km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model, initialized by 3km grids with 3km radar assimilation. Radar data is assimilated in the HRRR every 15 min over a 1-h period adding further detail to that provided by the hourly data assimilation from the 13km radar-enhanced Rapid Refresh.

Registry of Open Data on AWS

This registry exists to help people discover and share datasets that are available via AWS resources. Learn more about sharing data on AWS.

See all usage examples for datasets listed in this registry.

See datasets from Digital Earth AfricaFacebook Data for GoodNASA Space Act AgreementNIH STRIDESNOAA Big Data ProgramSpace Telescope Science Institute, and Amazon Sustainability Data Initiative.

Textbook Question Answering (TQA)

1,076 textbook lessons, 26,260 questions, 6229 images

Documentation: allenai.org/data/tqa

Download

Harmonized Cancer Datasets: Genomic Data Commons Data Portal

The GDC Data Portal is a robust data-driven platform that allows cancer
researchers and bioinformaticians to search and download cancer data for analysis.

Genomic Data Commons Data Portal
Genomic Data Commons Data Portal

The Cancer Genome Atlas

The Cancer Genome Atlas (TCGA), a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), aims to generate comprehensive, multi-dimensional maps of the key genomic changes in major types and subtypes of cancer.

AWS CLI Access (No AWS account required)

aws s3 ls s3://tcga-2-open/ --no-sign-request

Therapeutically Applicable Research to Generate Effective Treatments (TARGET)

The Therapeutically Applicable Research to Generate Effective Treatments (TARGET) program applies a comprehensive genomic approach to determine molecular changes that drive childhood cancers. The goal of the program is to use data to guide the development of effective, less toxic therapies. TARGET is organized into a collaborative network of disease-specific project teams.  TARGET projects provide comprehensive molecular characterization to determine the genetic changes that drive the initiation and progression of childhood cancers. The dataset contains open Clinical Supplement, Biospecimen Supplement, RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantification, miRNA-Seq miRNA Expression Quantification data from Genomic Data Commons (GDC), and open data from GDC Legacy Archive. Access it here.

Genome Aggregation Database (gnomAD)

The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators that aggregates and harmonizes both exome and genome data from a wide range of large-scale human sequencing projects. The summary data provided here are released for the benefit of the wider scientific community without restriction on use. Downloads

SQuAD (Stanford Question Answering Dataset)

Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. Access it here.

PubMed Diabetes Dataset

The Pubmed Diabetes dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words. The README file in the dataset provides more details.

Download Link

Drug-Target Interaction Dataset

This dataset contains interactions between drugs and targets collected from DrugBank, KEGG Drug, DCDB, and Matador. It was originally collected by Perlman et al. It contains 315 drugs, 250 targets, 1,306 drug-target interactions, 5 types of drug-drug similarities, and 3 types of target-target similarities. Drug-drug similarities include Chemical-based, Ligand-based, Expression-based, Side-effect-based, and Annotation-based similarities. Target-target similarities include Sequence-based, Protein-protein interaction network-based, and Gene Ontology-based similarities. The original task on the dataset is to predict new interactions between drugs and targets based on different types of similarities in the network. Download link

Pharmacogenomics Datasets

PharmGKB data and knowledge is available as downloads. It is often critical to check with their curators at feedback@pharmgkb.org before embarking on a large project using these data, to be sure that the files and data they make available are being interpreted correctly. PharmGKB generally does NOT need to be a co-author on such analyses; They just want to make sure that there is a correct understanding of our data before lots of resources are spent.

Pancreatic Cancer Organoid Profiling

The dataset contains open RNA-Seq Gene Expression Quantification data and controlled WGS/WXS/RNA-Seq Aligned Reads, WXS Annotated Somatic Mutation, WXS Raw Somatic Mutation, and RNA-Seq Splice Junction Quantification. Documentation

AWS CLI Access (No AWS account required)

aws s3 ls s3://gdc-organoid-pancreatic-phs001611-2-open/ --no-sign-request

Africa Soil Information Service (AfSIS) Soil Chemistry

This dataset contains soil infrared spectral data and paired soil property reference measurements for georeferenced soil samples that were collected through the Africa Soil Information Service (AfSIS) project, which lasted from 2009 through 2018. Documentation

AWS CLI Access (No AWS account required)

aws s3 ls s3://afsis/ --no-sign-request

Dataset for Affective States in E-Environments

DAiSEE is the first multi-label video classification dataset comprising of 9068 video snippets captured from 112 users for recognizing the user affective states of boredom, confusion, engagement, and frustration “in the wild”. The dataset has four levels of labels namely – very low, low, high, and very high for each of the affective states, which are crowd annotated and correlated with a gold standard annotation created using a team of expert psychologists. Download it here.

NatureServe Explorer Dataset

NatureServe Explorer provides conservation status, taxonomy, distribution, and life history information for more than 95,000 plants and animals in the United States and Canada, and more than 10,000 vegetation communities and ecological systems in the Western Hemisphere.

The data available through NatureServe Explorer represents data managed in the NatureServe Central Databases. These databases are dynamic, being continually enhanced and refined through the input of hundreds of natural heritage program scientists and other collaborators. NatureServe Explorer is updated from these central databases to reflect information from new field surveys, the latest taxonomic treatments and other scientific publications, and new conservation status assessments. Explore Data here

Flight Records in the US

Airline On-Time Performance and Causes of Flight Delays – On_Time Data.

This database contains scheduled and actual departure and arrival times, reason of delay. reported by certified U.S. air carriers that account for at least one percent of domestic scheduled passenger revenues. The data is collected by the Office of Airline Information, Bureau of Transportation Statistics (BTS).

FlightAware.com has data but you need to pay for a full dataset.

The anyflights package supplies a set of functions to generate air travel data (and data packages!) similar to nycflights13. With a user-defined year and airport, the anyflights function will grab data on:

  • flights: all flights that departed a given airport in a given year and month
  • weather: hourly meterological data for a given airport in a given year and month
  • airports: airport names, FAA codes, and locations
  • airlines: translation between two letter carrier (airline) codes and names
  • planes: construction information about each plane found in flights

Airline On-Time Statistics and Delay Causes

The U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT’s monthly Air Travel Consumer Report, published about 30 days after the month’s end, as well as in summary tables posted on this website. BTS began collecting details on the causes of flight delays in June 2003. Summary statistics and raw data are made available to the public at the time the Air Travel Consumer Report is released. Access it here

Worldwide flight data

Open flights: As of January 2017, the OpenFlights Airports Database contains over 10,000 airports, train stations and ferry terminals spanning the globe

Download: airports.dat (Airports only, high quality)

Download: airports-extended.dat (Airports, train stations and ferry terminals, including user contributions)

Bureau of Transportation:

Flightera.net seems to have a lot of good data for free. It has in-depth data on flights and doesn’t seem limited by date. I can’t speak on the validity of the data though.

flightradar24.com has lots of data, also historically, they might be willing to help you get it in a nice format.

 

2019 Crime statistics in the USA

Dataset with arrest in US by race and separate states. Download Excel here

  • Node.js – Async non-blocking event-driven JavaScript runtime built on Chrome’s V8 JavaScript engine.
  • Frontend Development
  • iOS – Mobile operating system for Apple phones and tablets.
  • Android – Mobile operating system developed by Google.
  • IoT & Hybrid Apps
  • Electron – Cross-platform native desktop apps using JavaScript/HTML/CSS.
  • Cordova – JavaScript API for hybrid apps.
  • React Native – JavaScript framework for writing natively rendering mobile apps for iOS and Android.
  • Xamarin – Mobile app development IDE, testing, and distribution.
  • Linux
    • Containers
    • eBPF – Virtual machine that allows you to write more efficient and powerful tracing and monitoring for Linux systems.
    • Arch-based Projects – Linux distributions and projects based on Arch Linux.
  • macOS – Operating system for Apple’s Mac computers.
  • watchOS – Operating system for the Apple Watch.
  • JVM
  • Salesforce
  • Amazon Web Services
  • Windows
  • IPFS – P2P hypermedia protocol.
  • Fuse – Mobile development tools.
  • Heroku – Cloud platform as a service.
  • Raspberry Pi – Credit card-sized computer aimed at teaching kids programming, but capable of a lot more.
  • Qt – Cross-platform GUI app framework.
  • WebExtensions – Cross-browser extension system.
  • RubyMotion – Write cross-platform native apps for iOS, Android, macOS, tvOS, and watchOS in Ruby.
  • Smart TV – Create apps for different TV platforms.
  • GNOME – Simple and distraction-free desktop environment for Linux.
  • KDE – A free software community dedicated to creating an open and user-friendly computing experience.
  • .NET
    • Core
    • Roslyn – Open-source compilers and code analysis APIs for C# and VB.NET languages.
  • Amazon Alexa – Virtual home assistant.
  • DigitalOcean – Cloud computing platform designed for developers.
  • Flutter – Google’s mobile SDK for building native iOS and Android apps from a single codebase written in Dart.
  • Home Assistant – Open source home automation that puts local control and privacy first.
  • IBM Cloud – Cloud platform for developers and companies.
  • Firebase – App development platform built on Google Cloud Platform.
  • Robot Operating System 2.0 – Set of software libraries and tools that help you build robot apps.
  • Adafruit IO – Visualize and store data from any device.
  • Cloudflare – CDN, DNS, DDoS protection, and security for your site.
  • Actions on Google – Developer platform for Google Assistant.
  • ESP – Low-cost microcontrollers with WiFi and broad IoT applications.
  • Deno – A secure runtime for JavaScript and TypeScript that uses V8 and is built in Rust.
  • DOS – Operating system for x86-based personal computers that was popular during the 1980s and early 1990s.
  • Nix – Package manager for Linux and other Unix systems that makes package management reliable and reproducible.

Researchers from IBM, MIT and Harvard Announced The Release Of DARPA “Common Sense AI” Dataset Along With Two Machine Learning Models At ICML 2021

Building machines that can make decisions based on common sense is no easy feat. A machine must be able to do more than merely find patterns in data; it also needs a way of interpreting the intentions and beliefs behind people’s choices.

At the 2021 International Conference on Machine Learning (ICML), Researchers from IBM, MIT, and Harvard University have come together to release a DARPA “Common Sense AI” dataset for benchmarking AI intuition. They are also releasing two machine learning models that represent different approaches to the problem that relies on testing techniques psychologists use to study infants’ behavior to accelerate the development of AI exhibiting common sense.

Source – Summary – Paper – IBM Blog

100 million protein structures Dataset by DeepMind

DeepMind creates ‘transformative’ map of human proteins drawn by AI. By the end of the year, DeepMind hopes to release predictions for 100 million protein structures, a dataset that will be “transformative for our understanding of how life works,

Here’s a good article about this topic

Google Dataset Search

Google Dataset Search

Malware traffic dataset

Comprises 1914081 records created from all malware traffic analysis .net PCAP files, from 2013 to 2021. The logs are generated using Suricata and Zeek.

Originator: ali_alwashali

Percent of “foreign-born” population in each US and EU state or country.

For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state 🇺🇸🇪🇺

Author: Here

Percent of “foreign-born” population in each US and EU state or country. For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state.

Examples of “foreign-born” in this context:

  • Person born in Spain and living in France is NOT “foreign-born”

  • Person born in Turkey and living in France is “foreign-born”

  • Person born in Florida and living in Texas is NOT “foreign-born”

  • Person born in Mexico and living in Texas is “foreign-born”

  • Person born in Florida and living in France is “foreign-born”

  • Person born in France and living in Florida is “foreign-born”

🇺🇸🇪🇺🗺️

Note: Poland, Ireland, Germany, Greece, Cyprus, Malta, Portugal uses Eurostat 2010 Migration data and Croatia has no data at all

Link1

Link2

Link3

Tools: MS Office

Source: Here

35% of “entry-level” jobs on LinkedIn require 3+ years of experience

r/dataisbeautiful - [OC] 35% of "entry-level" jobs on LinkedIn require 3+ years of experience

Source: LinkedIn data  (see original post)

Tool: Photoshop from my colleague

Latest complete Netflix movie dataset

Created from 4 APIs. 11K+ rows and 30+ attributes of Netflix (Ratings, earnings, actors, language, availability, movie trailers, and many more)

Dataset on Kaggle.

Explore this dataset using FlixGem.com (this dataset is powering this webapp)

Dataset on Google Sheets.

Common Crawl

A corpus of web crawl data composed of over 50 billion web pages. The Common Crawl corpus contains petabytes of data collected since 2008. It contains raw web page data, extracted metadata and text extractions.

AWS CLI Access (No AWS account required)

aws s3 ls s3://commoncrawl/ --no-sign-request

s3://commoncrawl/crawl-data/CC-MAIN-2021-17 – April 2021

 Dataset on protein prices

Data on Primary Commodity Prices are updated monthly based on the IMF’s Primary Commodity Price System.

Excel Database

 CPOST dataset on suicide attacks over four decades

The University of Chicago Project on Security and Threats presents the updated and expanded Database on Suicide Attacks (DSAT), which now links to Uppsala Conflict Data Program data on armed conflicts and includes a new dataset measuring the alliance and rivalry relationships among militant groups with connections to suicide attack groups. Access it here.

Credit Card Dataset – Survey of Consumer Finances (SCF) Combined Extract Data 1989-2019

You can do a lot of aggregated analysis in a pretty straightforward way there.

Drone imagery with annotations for small object detection and tracking dataset

11 TB dataset of drone imagery with annotations for small object detection and tracking

Download and more information are available here

Dataset License: CDLA-Sharing-1.0

Helper scripts for accessing the dataset: DATASET.md

Dataset Exploration: Colab

NOAA High-Resolution Rapid Refresh (HRRR) Model

The HRRR is a NOAA real-time 3-km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model, initialized by 3km grids with 3km radar assimilation. Radar data is assimilated in the HRRR every 15 min over a 1-h period adding further detail to that provided by the hourly data assimilation from the 13km radar-enhanced Rapid Refresh.

Registry of Open Data on AWS

This registry exists to help people discover and share datasets that are available via AWS resources. Learn more about sharing data on AWS.

See all usage examples for datasets listed in this registry.

See datasets from Digital Earth AfricaFacebook Data for GoodNASA Space Act AgreementNIH STRIDESNOAA Big Data ProgramSpace Telescope Science Institute, and Amazon Sustainability Data Initiative.

Textbook Question Answering (TQA)

1,076 textbook lessons, 26,260 questions, 6229 images

Documentation: allenai.org/data/tqa

Download

Harmonized Cancer Datasets: Genomic Data Commons Data Portal

The GDC Data Portal is a robust data-driven platform that allows cancer
researchers and bioinformaticians to search and download cancer data for analysis.

Genomic Data Commons Data Portal
Genomic Data Commons Data Portal

The Cancer Genome Atlas

The Cancer Genome Atlas (TCGA), a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), aims to generate comprehensive, multi-dimensional maps of the key genomic changes in major types and subtypes of cancer.

AWS CLI Access (No AWS account required)

aws s3 ls s3://tcga-2-open/ --no-sign-request

Therapeutically Applicable Research to Generate Effective Treatments (TARGET)

The Therapeutically Applicable Research to Generate Effective Treatments (TARGET) program applies a comprehensive genomic approach to determine molecular changes that drive childhood cancers. The goal of the program is to use data to guide the development of effective, less toxic therapies. TARGET is organized into a collaborative network of disease-specific project teams.  TARGET projects provide comprehensive molecular characterization to determine the genetic changes that drive the initiation and progression of childhood cancers. The dataset contains open Clinical Supplement, Biospecimen Supplement, RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantification, miRNA-Seq miRNA Expression Quantification data from Genomic Data Commons (GDC), and open data from GDC Legacy Archive. Access it here.

Genome Aggregation Database (gnomAD)

The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators that aggregates and harmonizes both exome and genome data from a wide range of large-scale human sequencing projects. The summary data provided here are released for the benefit of the wider scientific community without restriction on use. Downloads

SQuAD (Stanford Question Answering Dataset)

Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. Access it here.

PubMed Diabetes Dataset

The Pubmed Diabetes dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words. The README file in the dataset provides more details.

Download Link

Drug-Target Interaction Dataset

This dataset contains interactions between drugs and targets collected from DrugBank, KEGG Drug, DCDB, and Matador. It was originally collected by Perlman et al. It contains 315 drugs, 250 targets, 1,306 drug-target interactions, 5 types of drug-drug similarities, and 3 types of target-target similarities. Drug-drug similarities include Chemical-based, Ligand-based, Expression-based, Side-effect-based, and Annotation-based similarities. Target-target similarities include Sequence-based, Protein-protein interaction network-based, and Gene Ontology-based similarities. The original task on the dataset is to predict new interactions between drugs and targets based on different types of similarities in the network. Download link

Pharmacogenomics Datasets

PharmGKB data and knowledge is available as downloads. It is often critical to check with their curators at feedback@pharmgkb.org before embarking on a large project using these data, to be sure that the files and data they make available are being interpreted correctly. PharmGKB generally does NOT need to be a co-author on such analyses; They just want to make sure that there is a correct understanding of our data before lots of resources are spent.

Pancreatic Cancer Organoid Profiling

The dataset contains open RNA-Seq Gene Expression Quantification data and controlled WGS/WXS/RNA-Seq Aligned Reads, WXS Annotated Somatic Mutation, WXS Raw Somatic Mutation, and RNA-Seq Splice Junction Quantification. Documentation

AWS CLI Access (No AWS account required)

aws s3 ls s3://gdc-organoid-pancreatic-phs001611-2-open/ --no-sign-request

Africa Soil Information Service (AfSIS) Soil Chemistry

This dataset contains soil infrared spectral data and paired soil property reference measurements for georeferenced soil samples that were collected through the Africa Soil Information Service (AfSIS) project, which lasted from 2009 through 2018. Documentation

AWS CLI Access (No AWS account required)

aws s3 ls s3://afsis/ --no-sign-request

Dataset for Affective States in E-Environments

DAiSEE is the first multi-label video classification dataset comprising of 9068 video snippets captured from 112 users for recognizing the user affective states of boredom, confusion, engagement, and frustration “in the wild”. The dataset has four levels of labels namely – very low, low, high, and very high for each of the affective states, which are crowd annotated and correlated with a gold standard annotation created using a team of expert psychologists. Download it here.

NatureServe Explorer Dataset

NatureServe Explorer provides conservation status, taxonomy, distribution, and life history information for more than 95,000 plants and animals in the United States and Canada, and more than 10,000 vegetation communities and ecological systems in the Western Hemisphere.

The data available through NatureServe Explorer represents data managed in the NatureServe Central Databases. These databases are dynamic, being continually enhanced and refined through the input of hundreds of natural heritage program scientists and other collaborators. NatureServe Explorer is updated from these central databases to reflect information from new field surveys, the latest taxonomic treatments and other scientific publications, and new conservation status assessments. Explore Data here

Flight Records in the US

Airline On-Time Performance and Causes of Flight Delays – On_Time Data.

This database contains scheduled and actual departure and arrival times, reason of delay. reported by certified U.S. air carriers that account for at least one percent of domestic scheduled passenger revenues. The data is collected by the Office of Airline Information, Bureau of Transportation Statistics (BTS).

FlightAware.com has data but you need to pay for a full dataset.

The anyflights package supplies a set of functions to generate air travel data (and data packages!) similar to nycflights13. With a user-defined year and airport, the anyflights function will grab data on:

  • flights: all flights that departed a given airport in a given year and month
  • weather: hourly meterological data for a given airport in a given year and month
  • airports: airport names, FAA codes, and locations
  • airlines: translation between two letter carrier (airline) codes and names
  • planes: construction information about each plane found in flights

Airline On-Time Statistics and Delay Causes

The U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT’s monthly Air Travel Consumer Report, published about 30 days after the month’s end, as well as in summary tables posted on this website. BTS began collecting details on the causes of flight delays in June 2003. Summary statistics and raw data are made available to the public at the time the Air Travel Consumer Report is released. Access it here

Worldwide flight data

Open flights: As of January 2017, the OpenFlights Airports Database contains over 10,000 airports, train stations and ferry terminals spanning the globe

Download: airports.dat (Airports only, high quality)

Download: airports-extended.dat (Airports, train stations and ferry terminals, including user contributions)

Bureau of Transportation:

Flightera.net seems to have a lot of good data for free. It has in-depth data on flights and doesn’t seem limited by date. I can’t speak on the validity of the data though.

flightradar24.com has lots of data, also historically, they might be willing to help you get it in a nice format.

 

2019 Crime statistics in the USA

Dataset with arrest in US by race and separate states. Download Excel here

  • JavaScript
  • Swift – Apple’s compiled programming language that is secure, modern, programmer-friendly, and fast.
  • Python – General-purpose programming language designed for readability.
    • Asyncio – Asynchronous I/O in Python 3.
    • Scientific Audio – Scientific research in audio/music.
    • CircuitPython – A version of Python for microcontrollers.
    • Data Science – Data analysis and machine learning.
    • Typing – Optional static typing for Python.
    • MicroPython – A lean and efficient implementation of Python 3 for microcontrollers.
  • Rust
  • Haskell
  • PureScript
  • Go
  • Scala
    • Scala Native – Optimizing ahead-of-time compiler for Scala based on LLVM.
  • Ruby
  • Clojure
  • ClojureScript
  • Elixir
  • Elm
  • Erlang
  • Julia – High-level dynamic programming language designed to address the needs of high-performance numerical analysis and computational science.
  • Lua
  • C
  • C/C++ – General-purpose language with a bias toward system programming and embedded, resource-constrained software.
  • R – Functional programming language and environment for statistical computing and graphics.
  • D
  • Common Lisp – Powerful dynamic multiparadigm language that facilitates iterative and interactive development.
  • Perl
  • Groovy
  • Dart
  • Java – Popular secure object-oriented language designed for flexibility to “write once, run anywhere”.
  • Kotlin
  • OCaml
  • ColdFusion
  • Fortran
  • PHP – Server-side scripting language.
  • Pascal
  • AutoHotkey
  • AutoIt
  • Crystal
  • Frege – Haskell for the JVM.
  • CMake – Build, test, and package software.
  • ActionScript 3 – Object-oriented language targeting Adobe AIR.
  • Eta – Functional programming language for the JVM.
  • Idris – General purpose pure functional programming language with dependent types influenced by Haskell and ML.
  • Ada/SPARK – Modern programming language designed for large, long-lived apps where reliability and efficiency are essential.
  • Q# – Domain-specific programming language used for expressing quantum algorithms.
  • Imba – Programming language inspired by Ruby and Python and compiles to performant JavaScript.
  • Vala – Programming language designed to take full advantage of the GLib and GNOME ecosystems, while preserving the speed of C code.
  • Coq – Formal language and environment for programming and specification which facilitates interactive development of machine-checked proofs.
  • V – Simple, fast, safe, compiled language for developing maintainable software.

Researchers from IBM, MIT and Harvard Announced The Release Of DARPA “Common Sense AI” Dataset Along With Two Machine Learning Models At ICML 2021

Building machines that can make decisions based on common sense is no easy feat. A machine must be able to do more than merely find patterns in data; it also needs a way of interpreting the intentions and beliefs behind people’s choices.

At the 2021 International Conference on Machine Learning (ICML), Researchers from IBM, MIT, and Harvard University have come together to release a DARPA “Common Sense AI” dataset for benchmarking AI intuition. They are also releasing two machine learning models that represent different approaches to the problem that relies on testing techniques psychologists use to study infants’ behavior to accelerate the development of AI exhibiting common sense.

Source – Summary – Paper – IBM Blog

100 million protein structures Dataset by DeepMind

DeepMind creates ‘transformative’ map of human proteins drawn by AI. By the end of the year, DeepMind hopes to release predictions for 100 million protein structures, a dataset that will be “transformative for our understanding of how life works,

Here’s a good article about this topic

Google Dataset Search

Google Dataset Search

Malware traffic dataset

Comprises 1914081 records created from all malware traffic analysis .net PCAP files, from 2013 to 2021. The logs are generated using Suricata and Zeek.

Originator: ali_alwashali

Percent of “foreign-born” population in each US and EU state or country.

For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state 🇺🇸🇪🇺

Author: Here

Percent of “foreign-born” population in each US and EU state or country. For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state.

Examples of “foreign-born” in this context:

  • Person born in Spain and living in France is NOT “foreign-born”

  • Person born in Turkey and living in France is “foreign-born”

  • Person born in Florida and living in Texas is NOT “foreign-born”

  • Person born in Mexico and living in Texas is “foreign-born”

  • Person born in Florida and living in France is “foreign-born”

  • Person born in France and living in Florida is “foreign-born”

🇺🇸🇪🇺🗺️

Note: Poland, Ireland, Germany, Greece, Cyprus, Malta, Portugal uses Eurostat 2010 Migration data and Croatia has no data at all

Link1

Link2

Link3

Tools: MS Office

Source: Here

35% of “entry-level” jobs on LinkedIn require 3+ years of experience

r/dataisbeautiful - [OC] 35% of "entry-level" jobs on LinkedIn require 3+ years of experience

Source: LinkedIn data  (see original post)

Tool: Photoshop from my colleague

Latest complete Netflix movie dataset

Created from 4 APIs. 11K+ rows and 30+ attributes of Netflix (Ratings, earnings, actors, language, availability, movie trailers, and many more)

Dataset on Kaggle.

Explore this dataset using FlixGem.com (this dataset is powering this webapp)

Dataset on Google Sheets.

Common Crawl

A corpus of web crawl data composed of over 50 billion web pages. The Common Crawl corpus contains petabytes of data collected since 2008. It contains raw web page data, extracted metadata and text extractions.

AWS CLI Access (No AWS account required)

aws s3 ls s3://commoncrawl/ --no-sign-request

s3://commoncrawl/crawl-data/CC-MAIN-2021-17 – April 2021

 Dataset on protein prices

Data on Primary Commodity Prices are updated monthly based on the IMF’s Primary Commodity Price System.

Excel Database

 CPOST dataset on suicide attacks over four decades

The University of Chicago Project on Security and Threats presents the updated and expanded Database on Suicide Attacks (DSAT), which now links to Uppsala Conflict Data Program data on armed conflicts and includes a new dataset measuring the alliance and rivalry relationships among militant groups with connections to suicide attack groups. Access it here.

Credit Card Dataset – Survey of Consumer Finances (SCF) Combined Extract Data 1989-2019

You can do a lot of aggregated analysis in a pretty straightforward way there.

Drone imagery with annotations for small object detection and tracking dataset

11 TB dataset of drone imagery with annotations for small object detection and tracking

Download and more information are available here

Dataset License: CDLA-Sharing-1.0

Helper scripts for accessing the dataset: DATASET.md

Dataset Exploration: Colab

NOAA High-Resolution Rapid Refresh (HRRR) Model

The HRRR is a NOAA real-time 3-km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model, initialized by 3km grids with 3km radar assimilation. Radar data is assimilated in the HRRR every 15 min over a 1-h period adding further detail to that provided by the hourly data assimilation from the 13km radar-enhanced Rapid Refresh.

Registry of Open Data on AWS

This registry exists to help people discover and share datasets that are available via AWS resources. Learn more about sharing data on AWS.

See all usage examples for datasets listed in this registry.

See datasets from Digital Earth AfricaFacebook Data for GoodNASA Space Act AgreementNIH STRIDESNOAA Big Data ProgramSpace Telescope Science Institute, and Amazon Sustainability Data Initiative.

Textbook Question Answering (TQA)

1,076 textbook lessons, 26,260 questions, 6229 images

Documentation: allenai.org/data/tqa

Download

Harmonized Cancer Datasets: Genomic Data Commons Data Portal

The GDC Data Portal is a robust data-driven platform that allows cancer
researchers and bioinformaticians to search and download cancer data for analysis.

Genomic Data Commons Data Portal
Genomic Data Commons Data Portal

The Cancer Genome Atlas

The Cancer Genome Atlas (TCGA), a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), aims to generate comprehensive, multi-dimensional maps of the key genomic changes in major types and subtypes of cancer.

AWS CLI Access (No AWS account required)

aws s3 ls s3://tcga-2-open/ --no-sign-request

Therapeutically Applicable Research to Generate Effective Treatments (TARGET)

The Therapeutically Applicable Research to Generate Effective Treatments (TARGET) program applies a comprehensive genomic approach to determine molecular changes that drive childhood cancers. The goal of the program is to use data to guide the development of effective, less toxic therapies. TARGET is organized into a collaborative network of disease-specific project teams.  TARGET projects provide comprehensive molecular characterization to determine the genetic changes that drive the initiation and progression of childhood cancers. The dataset contains open Clinical Supplement, Biospecimen Supplement, RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantification, miRNA-Seq miRNA Expression Quantification data from Genomic Data Commons (GDC), and open data from GDC Legacy Archive. Access it here.

Genome Aggregation Database (gnomAD)

The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators that aggregates and harmonizes both exome and genome data from a wide range of large-scale human sequencing projects. The summary data provided here are released for the benefit of the wider scientific community without restriction on use. Downloads

SQuAD (Stanford Question Answering Dataset)

Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. Access it here.

PubMed Diabetes Dataset

The Pubmed Diabetes dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words. The README file in the dataset provides more details.

Download Link

Drug-Target Interaction Dataset

This dataset contains interactions between drugs and targets collected from DrugBank, KEGG Drug, DCDB, and Matador. It was originally collected by Perlman et al. It contains 315 drugs, 250 targets, 1,306 drug-target interactions, 5 types of drug-drug similarities, and 3 types of target-target similarities. Drug-drug similarities include Chemical-based, Ligand-based, Expression-based, Side-effect-based, and Annotation-based similarities. Target-target similarities include Sequence-based, Protein-protein interaction network-based, and Gene Ontology-based similarities. The original task on the dataset is to predict new interactions between drugs and targets based on different types of similarities in the network. Download link

Pharmacogenomics Datasets

PharmGKB data and knowledge is available as downloads. It is often critical to check with their curators at feedback@pharmgkb.org before embarking on a large project using these data, to be sure that the files and data they make available are being interpreted correctly. PharmGKB generally does NOT need to be a co-author on such analyses; They just want to make sure that there is a correct understanding of our data before lots of resources are spent.

Pancreatic Cancer Organoid Profiling

The dataset contains open RNA-Seq Gene Expression Quantification data and controlled WGS/WXS/RNA-Seq Aligned Reads, WXS Annotated Somatic Mutation, WXS Raw Somatic Mutation, and RNA-Seq Splice Junction Quantification. Documentation

AWS CLI Access (No AWS account required)

aws s3 ls s3://gdc-organoid-pancreatic-phs001611-2-open/ --no-sign-request

Africa Soil Information Service (AfSIS) Soil Chemistry

This dataset contains soil infrared spectral data and paired soil property reference measurements for georeferenced soil samples that were collected through the Africa Soil Information Service (AfSIS) project, which lasted from 2009 through 2018. Documentation

AWS CLI Access (No AWS account required)

aws s3 ls s3://afsis/ --no-sign-request

Dataset for Affective States in E-Environments

DAiSEE is the first multi-label video classification dataset comprising of 9068 video snippets captured from 112 users for recognizing the user affective states of boredom, confusion, engagement, and frustration “in the wild”. The dataset has four levels of labels namely – very low, low, high, and very high for each of the affective states, which are crowd annotated and correlated with a gold standard annotation created using a team of expert psychologists. Download it here.

NatureServe Explorer Dataset

NatureServe Explorer provides conservation status, taxonomy, distribution, and life history information for more than 95,000 plants and animals in the United States and Canada, and more than 10,000 vegetation communities and ecological systems in the Western Hemisphere.

The data available through NatureServe Explorer represents data managed in the NatureServe Central Databases. These databases are dynamic, being continually enhanced and refined through the input of hundreds of natural heritage program scientists and other collaborators. NatureServe Explorer is updated from these central databases to reflect information from new field surveys, the latest taxonomic treatments and other scientific publications, and new conservation status assessments. Explore Data here

Flight Records in the US

Airline On-Time Performance and Causes of Flight Delays – On_Time Data.

This database contains scheduled and actual departure and arrival times, reason of delay. reported by certified U.S. air carriers that account for at least one percent of domestic scheduled passenger revenues. The data is collected by the Office of Airline Information, Bureau of Transportation Statistics (BTS).

FlightAware.com has data but you need to pay for a full dataset.

The anyflights package supplies a set of functions to generate air travel data (and data packages!) similar to nycflights13. With a user-defined year and airport, the anyflights function will grab data on:

  • flights: all flights that departed a given airport in a given year and month
  • weather: hourly meterological data for a given airport in a given year and month
  • airports: airport names, FAA codes, and locations
  • airlines: translation between two letter carrier (airline) codes and names
  • planes: construction information about each plane found in flights

Airline On-Time Statistics and Delay Causes

The U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT’s monthly Air Travel Consumer Report, published about 30 days after the month’s end, as well as in summary tables posted on this website. BTS began collecting details on the causes of flight delays in June 2003. Summary statistics and raw data are made available to the public at the time the Air Travel Consumer Report is released. Access it here

Worldwide flight data

Open flights: As of January 2017, the OpenFlights Airports Database contains over 10,000 airports, train stations and ferry terminals spanning the globe

Download: airports.dat (Airports only, high quality)

Download: airports-extended.dat (Airports, train stations and ferry terminals, including user contributions)

Bureau of Transportation:

Flightera.net seems to have a lot of good data for free. It has in-depth data on flights and doesn’t seem limited by date. I can’t speak on the validity of the data though.

flightradar24.com has lots of data, also historically, they might be willing to help you get it in a nice format.

 

2019 Crime statistics in the USA

Dataset with arrest in US by race and separate states. Download Excel here

Researchers from IBM, MIT and Harvard Announced The Release Of DARPA “Common Sense AI” Dataset Along With Two Machine Learning Models At ICML 2021

Building machines that can make decisions based on common sense is no easy feat. A machine must be able to do more than merely find patterns in data; it also needs a way of interpreting the intentions and beliefs behind people’s choices.

At the 2021 International Conference on Machine Learning (ICML), Researchers from IBM, MIT, and Harvard University have come together to release a DARPA “Common Sense AI” dataset for benchmarking AI intuition. They are also releasing two machine learning models that represent different approaches to the problem that relies on testing techniques psychologists use to study infants’ behavior to accelerate the development of AI exhibiting common sense.

Source – Summary – Paper – IBM Blog

100 million protein structures Dataset by DeepMind

DeepMind creates ‘transformative’ map of human proteins drawn by AI. By the end of the year, DeepMind hopes to release predictions for 100 million protein structures, a dataset that will be “transformative for our understanding of how life works,

Here’s a good article about this topic

Google Dataset Search

Google Dataset Search

Malware traffic dataset

Comprises 1914081 records created from all malware traffic analysis .net PCAP files, from 2013 to 2021. The logs are generated using Suricata and Zeek.

Originator: ali_alwashali

Percent of “foreign-born” population in each US and EU state or country.

For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state 🇺🇸🇪🇺

Author: Here

Percent of “foreign-born” population in each US and EU state or country. For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state.

Examples of “foreign-born” in this context:

  • Person born in Spain and living in France is NOT “foreign-born”

  • Person born in Turkey and living in France is “foreign-born”

  • Person born in Florida and living in Texas is NOT “foreign-born”

  • Person born in Mexico and living in Texas is “foreign-born”

  • Person born in Florida and living in France is “foreign-born”

  • Person born in France and living in Florida is “foreign-born”

🇺🇸🇪🇺🗺️

Note: Poland, Ireland, Germany, Greece, Cyprus, Malta, Portugal uses Eurostat 2010 Migration data and Croatia has no data at all

Link1

Link2

Link3

Tools: MS Office

Source: Here

35% of “entry-level” jobs on LinkedIn require 3+ years of experience

r/dataisbeautiful - [OC] 35% of "entry-level" jobs on LinkedIn require 3+ years of experience

Source: LinkedIn data  (see original post)

Tool: Photoshop from my colleague

Latest complete Netflix movie dataset

Created from 4 APIs. 11K+ rows and 30+ attributes of Netflix (Ratings, earnings, actors, language, availability, movie trailers, and many more)

Dataset on Kaggle.

Explore this dataset using FlixGem.com (this dataset is powering this webapp)

Dataset on Google Sheets.

Common Crawl

A corpus of web crawl data composed of over 50 billion web pages. The Common Crawl corpus contains petabytes of data collected since 2008. It contains raw web page data, extracted metadata and text extractions.

AWS CLI Access (No AWS account required)

aws s3 ls s3://commoncrawl/ --no-sign-request

s3://commoncrawl/crawl-data/CC-MAIN-2021-17 – April 2021

 Dataset on protein prices

Data on Primary Commodity Prices are updated monthly based on the IMF’s Primary Commodity Price System.

Excel Database

 CPOST dataset on suicide attacks over four decades

The University of Chicago Project on Security and Threats presents the updated and expanded Database on Suicide Attacks (DSAT), which now links to Uppsala Conflict Data Program data on armed conflicts and includes a new dataset measuring the alliance and rivalry relationships among militant groups with connections to suicide attack groups. Access it here.

Credit Card Dataset – Survey of Consumer Finances (SCF) Combined Extract Data 1989-2019

You can do a lot of aggregated analysis in a pretty straightforward way there.

Drone imagery with annotations for small object detection and tracking dataset

11 TB dataset of drone imagery with annotations for small object detection and tracking

Download and more information are available here

Dataset License: CDLA-Sharing-1.0

Helper scripts for accessing the dataset: DATASET.md

Dataset Exploration: Colab

NOAA High-Resolution Rapid Refresh (HRRR) Model

The HRRR is a NOAA real-time 3-km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model, initialized by 3km grids with 3km radar assimilation. Radar data is assimilated in the HRRR every 15 min over a 1-h period adding further detail to that provided by the hourly data assimilation from the 13km radar-enhanced Rapid Refresh.

Registry of Open Data on AWS

This registry exists to help people discover and share datasets that are available via AWS resources. Learn more about sharing data on AWS.

See all usage examples for datasets listed in this registry.

See datasets from Digital Earth AfricaFacebook Data for GoodNASA Space Act AgreementNIH STRIDESNOAA Big Data ProgramSpace Telescope Science Institute, and Amazon Sustainability Data Initiative.

Textbook Question Answering (TQA)

1,076 textbook lessons, 26,260 questions, 6229 images

Documentation: allenai.org/data/tqa

Download

Harmonized Cancer Datasets: Genomic Data Commons Data Portal

The GDC Data Portal is a robust data-driven platform that allows cancer
researchers and bioinformaticians to search and download cancer data for analysis.

Genomic Data Commons Data Portal
Genomic Data Commons Data Portal

The Cancer Genome Atlas

The Cancer Genome Atlas (TCGA), a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), aims to generate comprehensive, multi-dimensional maps of the key genomic changes in major types and subtypes of cancer.

AWS CLI Access (No AWS account required)

aws s3 ls s3://tcga-2-open/ --no-sign-request

Therapeutically Applicable Research to Generate Effective Treatments (TARGET)

The Therapeutically Applicable Research to Generate Effective Treatments (TARGET) program applies a comprehensive genomic approach to determine molecular changes that drive childhood cancers. The goal of the program is to use data to guide the development of effective, less toxic therapies. TARGET is organized into a collaborative network of disease-specific project teams.  TARGET projects provide comprehensive molecular characterization to determine the genetic changes that drive the initiation and progression of childhood cancers. The dataset contains open Clinical Supplement, Biospecimen Supplement, RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantification, miRNA-Seq miRNA Expression Quantification data from Genomic Data Commons (GDC), and open data from GDC Legacy Archive. Access it here.

Genome Aggregation Database (gnomAD)

The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators that aggregates and harmonizes both exome and genome data from a wide range of large-scale human sequencing projects. The summary data provided here are released for the benefit of the wider scientific community without restriction on use. Downloads

SQuAD (Stanford Question Answering Dataset)

Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. Access it here.

PubMed Diabetes Dataset

The Pubmed Diabetes dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words. The README file in the dataset provides more details.

Download Link

Drug-Target Interaction Dataset

This dataset contains interactions between drugs and targets collected from DrugBank, KEGG Drug, DCDB, and Matador. It was originally collected by Perlman et al. It contains 315 drugs, 250 targets, 1,306 drug-target interactions, 5 types of drug-drug similarities, and 3 types of target-target similarities. Drug-drug similarities include Chemical-based, Ligand-based, Expression-based, Side-effect-based, and Annotation-based similarities. Target-target similarities include Sequence-based, Protein-protein interaction network-based, and Gene Ontology-based similarities. The original task on the dataset is to predict new interactions between drugs and targets based on different types of similarities in the network. Download link

Pharmacogenomics Datasets

PharmGKB data and knowledge is available as downloads. It is often critical to check with their curators at feedback@pharmgkb.org before embarking on a large project using these data, to be sure that the files and data they make available are being interpreted correctly. PharmGKB generally does NOT need to be a co-author on such analyses; They just want to make sure that there is a correct understanding of our data before lots of resources are spent.

Pancreatic Cancer Organoid Profiling

The dataset contains open RNA-Seq Gene Expression Quantification data and controlled WGS/WXS/RNA-Seq Aligned Reads, WXS Annotated Somatic Mutation, WXS Raw Somatic Mutation, and RNA-Seq Splice Junction Quantification. Documentation

AWS CLI Access (No AWS account required)

aws s3 ls s3://gdc-organoid-pancreatic-phs001611-2-open/ --no-sign-request

Africa Soil Information Service (AfSIS) Soil Chemistry

This dataset contains soil infrared spectral data and paired soil property reference measurements for georeferenced soil samples that were collected through the Africa Soil Information Service (AfSIS) project, which lasted from 2009 through 2018. Documentation

AWS CLI Access (No AWS account required)

aws s3 ls s3://afsis/ --no-sign-request

Dataset for Affective States in E-Environments

DAiSEE is the first multi-label video classification dataset comprising of 9068 video snippets captured from 112 users for recognizing the user affective states of boredom, confusion, engagement, and frustration “in the wild”. The dataset has four levels of labels namely – very low, low, high, and very high for each of the affective states, which are crowd annotated and correlated with a gold standard annotation created using a team of expert psychologists. Download it here.

NatureServe Explorer Dataset

NatureServe Explorer provides conservation status, taxonomy, distribution, and life history information for more than 95,000 plants and animals in the United States and Canada, and more than 10,000 vegetation communities and ecological systems in the Western Hemisphere.

The data available through NatureServe Explorer represents data managed in the NatureServe Central Databases. These databases are dynamic, being continually enhanced and refined through the input of hundreds of natural heritage program scientists and other collaborators. NatureServe Explorer is updated from these central databases to reflect information from new field surveys, the latest taxonomic treatments and other scientific publications, and new conservation status assessments. Explore Data here

Flight Records in the US

Airline On-Time Performance and Causes of Flight Delays – On_Time Data.

This database contains scheduled and actual departure and arrival times, reason of delay. reported by certified U.S. air carriers that account for at least one percent of domestic scheduled passenger revenues. The data is collected by the Office of Airline Information, Bureau of Transportation Statistics (BTS).

FlightAware.com has data but you need to pay for a full dataset.

The anyflights package supplies a set of functions to generate air travel data (and data packages!) similar to nycflights13. With a user-defined year and airport, the anyflights function will grab data on:

  • flights: all flights that departed a given airport in a given year and month
  • weather: hourly meterological data for a given airport in a given year and month
  • airports: airport names, FAA codes, and locations
  • airlines: translation between two letter carrier (airline) codes and names
  • planes: construction information about each plane found in flights

Airline On-Time Statistics and Delay Causes

The U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT’s monthly Air Travel Consumer Report, published about 30 days after the month’s end, as well as in summary tables posted on this website. BTS began collecting details on the causes of flight delays in June 2003. Summary statistics and raw data are made available to the public at the time the Air Travel Consumer Report is released. Access it here

Worldwide flight data

Open flights: As of January 2017, the OpenFlights Airports Database contains over 10,000 airports, train stations and ferry terminals spanning the globe

Download: airports.dat (Airports only, high quality)

Download: airports-extended.dat (Airports, train stations and ferry terminals, including user contributions)

Bureau of Transportation:

Flightera.net seems to have a lot of good data for free. It has in-depth data on flights and doesn’t seem limited by date. I can’t speak on the validity of the data though.

flightradar24.com has lots of data, also historically, they might be willing to help you get it in a nice format.

 

2019 Crime statistics in the USA

Dataset with arrest in US by race and separate states. Download Excel here

  • Flask – Python framework.
  • Docker
  • Vagrant – Automation virtual machine environment.
  • Pyramid – Python framework.
  • Play1 Framework
  • CakePHP – PHP framework.
  • Symfony – PHP framework.
  • Laravel – PHP framework.
    • Education
    • TALL Stack – Full-stack development solution featuring libraries built by the Laravel community.
  • Rails – Web app framework for Ruby.
  • Phalcon – PHP framework.
  • Useful .htaccess Snippets
  • nginx – Web server.
  • Dropwizard – Java framework.
  • Kubernetes – Open-source platform that automates Linux container operations.
  • Lumen – PHP micro-framework.
  • Serverless Framework – Serverless computing and serverless architectures.
  • Apache Wicket – Java web app framework.
  • Vert.x – Toolkit for building reactive apps on the JVM.
  • Terraform – Tool for building, changing, and versioning infrastructure.
  • Vapor – Server-side development in Swift.
  • Dash – Python web app framework.
  • FastAPI – Python web app framework.
  • CDK – Open-source software development framework for defining cloud infrastructure in code.
  • IAM – User accounts, authentication and authorization.
  • Chalice – Python framework for serverless app development on AWS Lambda.

Researchers from IBM, MIT and Harvard Announced The Release Of DARPA “Common Sense AI” Dataset Along With Two Machine Learning Models At ICML 2021

Building machines that can make decisions based on common sense is no easy feat. A machine must be able to do more than merely find patterns in data; it also needs a way of interpreting the intentions and beliefs behind people’s choices.

At the 2021 International Conference on Machine Learning (ICML), Researchers from IBM, MIT, and Harvard University have come together to release a DARPA “Common Sense AI” dataset for benchmarking AI intuition. They are also releasing two machine learning models that represent different approaches to the problem that relies on testing techniques psychologists use to study infants’ behavior to accelerate the development of AI exhibiting common sense.

Source – Summary – Paper – IBM Blog

100 million protein structures Dataset by DeepMind

DeepMind creates ‘transformative’ map of human proteins drawn by AI. By the end of the year, DeepMind hopes to release predictions for 100 million protein structures, a dataset that will be “transformative for our understanding of how life works,

Here’s a good article about this topic

Google Dataset Search

Google Dataset Search

Malware traffic dataset

Comprises 1914081 records created from all malware traffic analysis .net PCAP files, from 2013 to 2021. The logs are generated using Suricata and Zeek.

Originator: ali_alwashali

Percent of “foreign-born” population in each US and EU state or country.

For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state 🇺🇸🇪🇺

Author: Here

Percent of “foreign-born” population in each US and EU state or country. For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state.

Examples of “foreign-born” in this context:

  • Person born in Spain and living in France is NOT “foreign-born”

  • Person born in Turkey and living in France is “foreign-born”

  • Person born in Florida and living in Texas is NOT “foreign-born”

  • Person born in Mexico and living in Texas is “foreign-born”

  • Person born in Florida and living in France is “foreign-born”

  • Person born in France and living in Florida is “foreign-born”

🇺🇸🇪🇺🗺️

Note: Poland, Ireland, Germany, Greece, Cyprus, Malta, Portugal uses Eurostat 2010 Migration data and Croatia has no data at all

Link1

Link2

Link3

Tools: MS Office

Source: Here

35% of “entry-level” jobs on LinkedIn require 3+ years of experience

r/dataisbeautiful - [OC] 35% of "entry-level" jobs on LinkedIn require 3+ years of experience

Source: LinkedIn data  (see original post)

Tool: Photoshop from my colleague

Latest complete Netflix movie dataset

Created from 4 APIs. 11K+ rows and 30+ attributes of Netflix (Ratings, earnings, actors, language, availability, movie trailers, and many more)

Dataset on Kaggle.

Explore this dataset using FlixGem.com (this dataset is powering this webapp)

Dataset on Google Sheets.

Common Crawl

A corpus of web crawl data composed of over 50 billion web pages. The Common Crawl corpus contains petabytes of data collected since 2008. It contains raw web page data, extracted metadata and text extractions.

AWS CLI Access (No AWS account required)

aws s3 ls s3://commoncrawl/ --no-sign-request

s3://commoncrawl/crawl-data/CC-MAIN-2021-17 – April 2021

 Dataset on protein prices

Data on Primary Commodity Prices are updated monthly based on the IMF’s Primary Commodity Price System.

Excel Database

 CPOST dataset on suicide attacks over four decades

The University of Chicago Project on Security and Threats presents the updated and expanded Database on Suicide Attacks (DSAT), which now links to Uppsala Conflict Data Program data on armed conflicts and includes a new dataset measuring the alliance and rivalry relationships among militant groups with connections to suicide attack groups. Access it here.

Credit Card Dataset – Survey of Consumer Finances (SCF) Combined Extract Data 1989-2019

You can do a lot of aggregated analysis in a pretty straightforward way there.

Drone imagery with annotations for small object detection and tracking dataset

11 TB dataset of drone imagery with annotations for small object detection and tracking

Download and more information are available here

Dataset License: CDLA-Sharing-1.0