Download the AI & Machine Learning For Dummies PRO App: iOS - Android Our AI and Machine Learning For Dummies PRO App can help you Ace the following AI and Machine Learning certifications:
What are some ways to increase precision or recall in machine learning?
What are some ways to Boost Precision and Recall in Machine Learning?
Sensitivity vs Specificity?
In machine learning, recall is the ability of the model to find all relevant instances in the data while precision is the ability of the model to correctly identify only the relevant instances. A high recall means that most relevant results are returned while a high precision means that most of the returned results are relevant. Ideally, you want a model with both high recall and high precision but often there is a trade-off between the two. In this blog post, we will explore some ways to increase recall or precision in machine learning.
There are two main ways to increase recall:
by increasing the number of false positives or by decreasing the number of false negatives. To increase the number of false positives, you can lower your threshold for what constitutes a positive prediction. For example, if you are trying to predict whether or not an email is spam, you might lower the threshold for what constitutes spam so that more emails are classified as spam. This will result in more false positives (emails that are not actually spam being classified as spam) but will also increase recall (more actual spam emails being classified as spam).
To decrease the number of false negatives,
you can increase your threshold for what constitutes a positive prediction. For example, going back to the spam email prediction example, you might raise the threshold for what constitutes spam so that fewer emails are classified as spam. This will result in fewer false negatives (actual spam emails not being classified as spam) but will also decrease recall (fewer actual spam emails being classified as spam).
There are two main ways to increase precision:
by increasing the number of true positives or by decreasing the number of true negatives. To increase the number of true positives, you can raise your threshold for what constitutes a positive prediction. For example, using the spam email prediction example again, you might raise the threshold for what constitutes spam so that fewer emails are classified as spam. This will result in more true positives (emails that are actually spam being classified as spam) but will also decrease precision (more non-spam emails being classified as spam).
you can lower your threshold for what constitutes a positive prediction. For example, going back to the spam email prediction example once more, you might lower the threshold for what constitutes spam so that more emails are classified as spam. This will result in fewer true negatives (emails that are not actually spam not being classified as spam) but will also decrease precision (more non-spam emails being classified as spam).
To summarize,
there are a few ways to increase precision or recall in machine learning. One way is to use a different evaluation metric. For example, if you are trying to maximize precision, you can use the F1 score, which is a combination of precision and recall. Another way to increase precision or recall is to adjust the threshold for classification. This can be done by changing the decision boundary or by using a different algorithm altogether.
Sensitivity vs Specificity
In machine learning, sensitivity and specificity are two measures of the performance of a model. Sensitivity is the proportion of true positives that are correctly predicted by the model, while specificity is the proportion of true negatives that are correctly predicted by the model.
Google introduced computing units, which you can purchase just like any other cloud computing unit you can from AWS or Azure etc. With Pro you get 100, and with Pro+ you get 500 computing units. GPU, TPU and option of High-RAM effects how much computing unit you use hourly. If you don’t have any computing units, you can’t use “Premium” tier gpus (A100, V100) and even P100 is non-viable.
Google Colab Pro+ comes with Premium tier GPU option, meanwhile in Pro if you have computing units you can randomly connect to P100 or T4. After you use all of your computing units, you can buy more or you can use T4 GPU for the half or most of the time (there can be a lot of times in the day that you can’t even use a T4 or any kinds of GPU). In free tier, offered gpus are most of the time K80 and P4, which performs similar to a 750ti (entry level gpu from 2014) with more VRAM.
For your consideration, T4 uses around 2, and A100 uses around 15 computing units hourly. Based on the current knowledge, computing units costs for GPUs tend to fluctuate based on some unknown factor.
For hobbyists and (under)graduate school duties, it will be better to use your own gpu if you have something with more than 4 gigs of VRAM and better than 750ti, or atleast purchase google pro to reach T4 even if you have no computing units remaining.
For small research companies, and non-trivial research at universities, and probably for most of the people Colab now probably is not a good option.
Colab Pro+ can be considered if you want Pro but you don’t sit in front of your computer, since it disconnects after 90 minutes of inactivity in your computer. But this can be overcomed with some scripts to some extend. So for most of the time Colab Pro+ is not a good option.
If you have anything more to say, please let me know so I can edit this post with them. Thanks!
In machine learning, precision and recall trade off against each other; increasing one often decreases the other. There is no single silver bullet solution for increasing either precision or recall; it depends on your specific use case which one is more important and which methods will work best for boosting whichever metric you choose. In this blog post, we explored some methods for increasing either precision or recall; hopefully this gives you a starting point for improving your own models!
Oasis by decart and etched has been released which can output playable video games and user can perform actions like move, jump, inventory check, etc. This is not like GameNGen by Google which can only output gameplay videos (but can't be played). Check the demo and other details here : https://youtu.be/INsEs1sve9k submitted by /u/mehul_gupta1997 [link] [comments]
How often do stakeholders point out issues in your calculations? Honestly, such cases make me question my proficiency submitted by /u/Terrible_Dimension66 [link] [comments]
I haven’t worked in advertising industry but have read not-so-good experiences in advertising industry. submitted by /u/lostmillenial97531 [link] [comments]
I’m planning to start a new data analysis project and want to ensure it aligns with what interviewers look for. Can anyone share insights on what really stands out to hiring managers in data analysis projects? Any advice on key elements or focus areas that would make a project more impressive during interviews would be much appreciated! submitted by /u/ds_reddit1 [link] [comments]
I work at a company where the data engineering team is new and quite junior - mostly focused on simple ingestion and pushing whatever the logic our (also often junior) data scientists give them. Data scientists also write up the orchestration, like how to process a real-time streaming pipeline for their metric construction and models. So, we have a lot of messy code the data scientists put together that can be inefficient. As the most senior person on my team, I've been tasked with taking on more of a lead in teaching the team best practices related to data engineering - simple things like good approaches for backfilling, modularizing queries and query efficiency, DAG construction and monitoring ,etc. While I've picked up a lot from experience, I'm curious to learn more "proper" ways to approach some of these problems. What are some good and practical data/analytics engineering resources you've used? I saw dbt has interesting documentation on best practices for analytics engineering in the context of their product but looking for other uses. submitted by /u/IronManFolgore [link] [comments]
I think I have a fairly solid grasp now of what a random forest is and how it works in practice, but I am still unsure as to exactly how a random forest makes predictions on data it hasn’t seen before. Let me explain what I mean. When you fit something like a logistic regression model, you train/fit it (I.e. find the model coefficients which minimise prediction error) on some data, and evaluate how that model performs using those coefficients on unseen data. When you do this for a decision tree, a similar logic applies, except instead of finding coefficients, you’re finding “splits” which likewise minimise some error. You could then evaluate the performance of this tree “using” those splits on unseen data. Now, a random forest is a collection of decision trees, and each tree is trained on a bootstrapped sample of the data with a random set of predictors considered at the splits. Say you want to train 1000 trees for your forest. Sampling dictates a scenario where for a single datapoint (row of data), you could have it appear in 300/1000 trees. And for 297/300 of those trees, it predicts (1), and for the other 3/300 it predicts (0). So the overall prediction would be a 1. Same logic follows for a regression problem except it’d be taking the arithmetic mean. But what I can’t grasp is how you’d then use this to predict on unseen data? What are the values I obtained from fitting the random forest model, I.e. what splits is the random forest using? Is it some sort of average split of all the trees trained during the model? Or, am I missing the point? I.e. is a new data point actually put through all 1000 trees of the forest? submitted by /u/JLane1996 [link] [comments]
I just joined a company that is premature when it comes to DS. All codes are in notebooks. No version control on data, code, or model. There are some model documentations. But these documents are quite sparse and more of a business memo than a technical document. The good part is that my manager wants to improve the DS workflow. The devops team also want to improve the implementations and they are quite collaborating. The issue are the product owners. I recently did a task for one of them. This was to update one of their models. The model lacked a proper technical document, half-ass code on git (just a model dump on s3), even the labels were partly incorrect, no performance monitoring, ... You just guess the rest. After working on it for a few weeks, I realized that the product owner only expects a document as my efforts outcome. A document exactly like their business memos that half of the numbers are cherry-picked and the other half are rather either meaningless data tables or some other info that are downright wrong. He also insists that the graphs to be preserved. Graph that lack basic attributes such as axis labels, proper font size, having graphs overlaid to allow easy comparison. He single argument is that "customer like it this way". Which, if true, is quite a valid argument. I cave in for now. Any suggestions on how to navigate the situation? submitted by /u/furioncruz [link] [comments]
Hi guys, recently I've been collaborating on automated pricing, we built a model based on demand elasticity for e-commerce products, all statistical methods, even the price elasticity of demand function is assuming a linear relationship between demand and margins. So far, it's doing "good" but not "good enough". The thing is here, the elasticity is considering the trendline of the product alone, which is not the case, what if that product was a complement to another product/s or if it was a substitute? I managed to tackle that partially with cross-elasticity, it did give good estimations, but still.. There's too much room for improvement, my manager is planning on converting a lot of the structure of the model to RL, in the model, we're actually mimicing the behaviour of RL, where there's an achievement function that we're trying to optimize and it works with the estimated PED. but I find RL really hard to make it practical, And I read in many articles and discussions that this problem many companies tackled with deep learning, because it can exhibit and absorb many complex patterns in the data so I decided to give it a shot and started learning deep learning from a hands-on deep learning book and a youtube course. But is it worth it? If anyone worked on pricing would share wisdom that'd be awesome. Thanks. submitted by /u/Careful_Engineer_700 [link] [comments]
Project goal: create a 'reasonable' 30 year forecast with some core component generating variation which resembles reality. Input data: annual US macroeconomic features such as inflation, GDP, wage growth, M2, imports, exports, etc. Features have varying ranges of availability (some going back to 1900 and others starting in the 90s. Problem statement: Which method(s) is SOTA for this type of prediction? The recent papers I've read mention BNNs, MAGAN, and LightGBM for smaller data like this and TFT, Prophet, and NeuralProphet for big data. I'm mainly curious if others out there have done something similar and have special insights. My current method of extracting temporal features and using a Trend + Level blend with LightGBM works, but I don't want to be missing out on better ideas--especially ones that fit into a Monte Carlo framework and include something like labeling years into probabilistic 'regimes' of boom/recession. submitted by /u/SwitchFace [link] [comments]
I am currently two years into my first role as a data scientist in risk in finance post grad. I have been applying for my next opportunity and I have two choices as of now. I could either be doing a lateral internal move into more of a project manager role in credit risk data analytics. This job is very safe, everything is pretty built out, and with banking things evolve a lot slower. My other option is become a data scientist in a niche department in a new company. This is a lot riskier/exciting as I am the first data scientist in this role and I would building things in my vision essentially. Has anyone ever been the first data scientist in a department? What would be the things I should expect in a role like this? I have two years of experience would that be enough to take on a challenge like this. submitted by /u/NuBoston [link] [comments]
I'm thinking of doing more of these, not necessarily on politics, but on other topics. Would love to hear feedback/critique on communication, visuals, content etc. https://youtu.be/kFDkvrICM48?si=FbXzoepCNSCUz0wi submitted by /u/jarena009 [link] [comments]
Hey there! Does anyone here know if those sequential models like LSTMs and Transformers work for real trading? I know that stock price data usually has low autocorrelation, but I’ve seen DL courses that use that kind of data and get good results. I am new to time series forecasting and trading, so please forgive my ignorance submitted by /u/gomezalp [link] [comments]
I have extensive experience in experimentation, but they were generally coming from those higher up - all I had to do was analyse and give a thumbs up if its worth pursuing. However, I want to know how you conduct analysis and come up with potential experiments to try out from existing data itself. A big part of it would be to look at pain points in customer journeys, but because the sequence of events is unique for everyone, I'm wondering how you go on about this analysis and visualise this. submitted by /u/ShayBae23EEE [link] [comments]
I am through my way and built a few small size models, and now I am looking forward for deployment but can't find any resources that help me to do so so if any one here can recommend any resources for model deployment that are straight forward submitted by /u/Emotional-Rhubarb725 [link] [comments]
I have been looking for months for this formula and an explanation for it and I can’t wrap my head around the math. Basically my problem is 1. Every person uses different terminology its actually confusing. 2. Saw a professor lectures out there where the formula is not the same as the ATE formula from https://matheusfacure.github.io/python-causality-handbook/02-Randomised-Experiments.html (The source for me trying to figure it out) -also checked github issues still dont get it & https://clas.ucdenver.edu/marcelo-perraillon/sites/default/files/attached-files/week_3_causal_0.pdf (Professor lectures) I dont get whats going on? This is like a blocker for me before i understand anything further. I am trying to genuinely understand it and try to apply it in my job but I can’t seem to get the whole estimation part. I have seen cases where a data scientist would say that causal inference problems are basically predictive modeling problems when they think of the DAGs for feature selection and the features importance/contribution is basically the causal inference estimation of the outcome. Nothing mentioned regarding experimental design, or any of the methods like PSM, or meta learners. So from the looks of it everyone has their own understanding of this some of which are objectively wrong and others i am not sure exactly why its inconsistent. How can the insight be ethical and properly validated. Predictive modeling is very well established but i am struggling to see that level of maturity in the causal inference sphere. I am specifically talking about model fairness and racial bias as well as things like sensitivity and error analysis? Can someone with experience help clear this up? Maybe im overthinking this but typically there is a level of scrutiny in out work if in a regulated field so how do people actually work with high levels of scrutiny? submitted by /u/JobIsAss [link] [comments]
Create unlimited AI wallpapers using a single prompt with Stable Diffusion on Google Colab. The wallpaper generator : 1. Can generate both desktop and mobile wallpapers 2. Uses free tier Google Colab 3. Generate about 100 wallpapers per hour 4. Can generate on any theme. 5. Creates a zip for downloading Check the demo here : https://youtu.be/1i_vciE8Pug?si=NwXMM372pTo7LgIA submitted by /u/mehul_gupta1997 [link] [comments]
With experimentation being a major focus at a lot of tech companies, there is a demand for understanding the causal effect of interventions. Traditional causal inference techniques have been used quite a bit, propensity score matching, diff n diff, instrumental variables etc, but these generally are harder to implement in practice with modern datasets. A lot of the traditional causal inference techniques are grounded in regression, and while regression is very great, in modern datasets the functional forms are more complicated than a linear model, or even a linear model with interactions. Failing to capture the true functional form can result in bias in causal effect estimates. Hence, one would be interested in finding a way to accurately do this with more complicated machine learning algorithms which can capture the complex functional forms in large datasets. This is the exact goal of double/debiased ML https://economics.mit.edu/sites/default/files/2022-08/2017.01%20Double%20DeBiased.pdf We consider the average treatment estimate problem as a two step prediction problem. Using very flexible machine learning methods can help identify target parameters with more accuracy. This idea has been extended to biostatistics, where there is the idea of finding causal effects of drugs. This is done using targeted maximum likelihood estimation. My question is: how much has double ML gotten adoption in data science? How often are you guys using it? submitted by /u/AdFew4357 [link] [comments]
I'm currently managing a churn prediction model (XGBoost) that I retrain every month using Jupyter Notebook. I've get by with my own routines: Naming conventions (e.g., model_20241001_data_20240901.pkl) Logging experiment info in CSV/Excel (e.g., columns used, column crosses, data window, sample method) Storing validation/test metrics in another CSV As the project grows, it's becoming increasingly difficult to track different permutations of columns, model types, and data ranges, and to know which version I'm iterating on. I've found options on internet like MLFlow, W&B, ClearML, and other vendor solutions, but as a one-person team, paid solutions seem overkill. I'm struggling to find good discussions or general consensus on this. How do you all handle this? Edit: I'm seeing a consensus around on MLFlow with logging and tracking. But to trigger experiments or run through model grid with different features / configurations, I need to combine it with other orchestration tools like KubeFlow / Prefect / Metaflow? Just adding some more context: My data is currently sitting in GCP BigQuery tables. I'm currently just training on Vertex AI jupyter lab. I know GCP will recommend us to use Vertex AI Model Registry, Vertex Experiments, but they seem overkill and expensive for my use. submitted by /u/MobileOk3170 [link] [comments]
Today I Learned (TIL) You learn something new every day; what did you learn today? Submit interesting and specific facts about something that you just found out here.
Reddit Science This community is a place to share and discuss new scientific research. Read about the latest advances in astronomy, biology, medicine, physics, social science, and more. Find and submit new publications and popular science coverage of current research.