What is the Best Machine Learning Algorithms for Imbalanced Datasets

Machine Learning Algorithms and Imbalanced Datasets

What is the Best Machine Learning Algorithms for Imbalanced Datasets?

In machine learning, imbalanced datasets are those where one class heavily outnumbers the others. This can be due to the nature of the problem or simply because more data is available for one class than the others. Either way, imbalanced datasets can pose a challenge for machine learning algorithms. In this blog post, we’ll take a look at which machine learning algorithms are best suited for imbalanced datasets and why they tend to perform better than others.

 For example, in a binary classification problem, if there are 100 observations, and only 10 of them are positive (the rest are negatives), then we say that the dataset is imbalanced. The ratio of positive to negative cases is 1:10. 

What is the Best Machine Learning Algorithms for Imbalanced Datasets
What is the Best Machine Learning Algorithms for Imbalanced Datasets

There are a few reasons why some machine learning algorithms tend to perform better on imbalanced datasets than others. First, certain algorithms are designed to handle imbalanced datasets. Second, some algorithms are more robust to outliers, which can be more common in imbalanced datasets. And third, some algorithms are better able to learn from a limited amount of data, which can be an issue when one class is heavily outnumbered by the others.

Some of the best machine learning algorithms for imbalanced datasets include:

Support Vector Machines (SVMs),
Decision Trees,
Random Forests,
– Naive Bayes Classifiers,
k-Nearest Neighbors (kNN),

Of these, SVMs tend to be the most popular choice as they are specifically designed to handle imbalanced datasets. SVMs work by finding a hyperplane that maximizes the margin between the two classes. This helps to reduce overfitting and improve generalization. Decision trees and random forests are also popular choices as they are less sensitive to outliers than other algorithms such as linear regression. Naive Bayes classifiers are another good choice as they are able to learn from a limited amount of data. kNN is also a good choice as it is not sensitive to outliers and is able to learn from a limited amount of data. However, it can be computationally intensive for large datasets.

There are two main types of machine learning algorithms: supervised and unsupervised. Supervised algorithms tend to perform better on imbalanced datasets than unsupervised algorithms. In this blog post, we will discuss why this is so and look at some examples.

Supervised Algorithms
Supervised algorithms are those where the target variable is known. In other words, we have training data where the correct answers are already given. The algorithm then learns from this data and is able to generalize to new data. Some examples of supervised algorithms are regression and classification.

Unsupervised Algorithms
Unsupervised algorithms are those where the target variable is not known. With unsupervised algorithms, we only have input data, without any corresponding output labels. The algorithm has to learn from the data itself without any guidance. Some examples of unsupervised algorithms are clustering and dimensionality reduction.

Why Supervised Algorithms Perform Better on Imbalanced Datasets
The reason why supervised algorithms perform better on imbalanced datasets is because they can learn from the training data which cases are more important. With unsupervised algorithms, all data points are treated equally, regardless of whether they are in the minority or majority class.

For example, in a binary classification problem with an imbalanced dataset, let’s say that we want to predict whether a customer will default on their loan payment or not. We have a training dataset of 1000 customers, out of which only 100 (10%) have defaulted on their loan in the past.

If we use a supervised algorithm like logistic regression, the algorithm will learn from the training data that defaulting on a loan is rare (since only 10% of cases in the training data are Positive). This means that it will be more likely to predict correctly that a new customer will not default on their loan (since this is the majority class in the training data).
However, if we use an unsupervised algorithm like k-means clustering, all data points will be treated equally since there is no target variable to guide the algorithm. This means that it might incorrectly cluster together customers who have defaulted on their loans with those who haven’t since there is no guidance provided by a target variable.

Conclusion:
In conclusion, supervised machine learning algorithms tend to perform better on imbalanced datasets than unsupervised machine learning algorithms because they can learn from the training data which cases are more important. 

Some machine learning algorithms tend to perform better on highly imbalanced datasets because they are designed to deal with imbalance or because they can learn from both classes simultaneously. If you are working with a highly imbalanced dataset, then you should consider using one of these algorithms.

Thanks for reading!

How are machine learning techniques being used to address unstructured data challenges?

Machine learning techniques are being used to address unstructured data challenges in a number of ways:

  1. Natural language processing (NLP): NLP algorithms can be used to extract meaningful information from unstructured text data, such as emails, documents, and social media posts. NLP algorithms can be trained to classify text data, identify key terms and concepts, and extract structured data from unstructured text.
  2. Image recognition: Machine learning algorithms can be used to analyze and classify images, enabling the automatic identification and classification of objects, people, and other elements in images. This can be useful for tasks such as image tagging and search, as well as for applications such as security and surveillance.
  3. Audio and speech recognition: Machine learning algorithms can be used to analyze and classify audio data, enabling the automatic transcription and translation of spoken language. This can be useful for tasks such as speech-to-text transcription, as well as for applications such as call center automation and language translation.
  4. Video analysis: Machine learning algorithms can be used to analyze and classify video data, enabling the automatic detection and classification of objects, people, and other elements in video. This can be useful for tasks such as video tagging and search, as well as for applications such as security and surveillance.

Overall, machine learning techniques are being used in a wide range of applications to extract meaningful information from unstructured data, and to enable the automatic classification and analysis of data in a variety of formats.

How is AI and machine learning impacting application development today?

Artificial intelligence (AI) and machine learning are having a significant impact on application development today in a number of ways:

  1. Enabling new capabilities: AI and machine learning algorithms can be used to enable applications to perform tasks that would be difficult or impossible for humans to do. For example, AI-powered applications can be used to analyze and classify large amounts of data, or to automate complex decision-making processes.
  2. Improving performance: AI and machine learning algorithms can be used to optimize the performance of applications, making them faster, more efficient, and more accurate. For example, machine learning algorithms can be used to improve the accuracy of predictive models, or to optimize the performance of search algorithms.
  3. Streamlining development: AI and machine learning algorithms can be used to automate various aspects of application development, such as testing, debugging, and deployment. This can help to streamline the development process and reduce the time and resources needed to build and maintain applications.
  4. Enhancing user experiences: AI and machine learning algorithms can be used to enhance the user experience of applications, by providing personalized recommendations, recommendations, or by enabling applications to anticipate and respond to the needs and preferences of users.

Overall, AI and machine learning are having a significant impact on application development today, and they are likely to continue to shape the way applications are built and used in the future.

How will advancements in artificial intelligence and machine learning shape the future of work and society?

Advancements in artificial intelligence (AI) and machine learning are likely to shape the future of work and society in a number of ways. Some potential impacts include:

  1. Automation: AI and machine learning algorithms can be used to automate tasks that are currently performed by humans, such as data entry, customer service, and manufacturing. This could lead to changes in the types of jobs that are available and the skills that are in demand, as well as to increased productivity and efficiency.
  2. Job displacement: While automation may create new job opportunities, it could also lead to job displacement, particularly for workers in industries that are more susceptible to automation. This could lead to social and economic challenges, including unemployment and income inequality.
  3. Increased efficiency: AI and machine learning algorithms can be used to optimize and streamline business processes, leading to increased efficiency and productivity. This could lead to economic growth and innovation, and could also help to reduce costs for businesses and consumers.
  4. Enhanced decision-making: AI and machine learning algorithms can be used to analyze large amounts of data and make more informed and accurate decisions. This could lead to improved outcomes in fields such as healthcare, finance, and education, and could also help to reduce bias and improve fairness.

Overall, the impact of AI and machine learning on the future of work and society is likely to be significant and complex, with both potential benefits and challenges. It will be important to consider and address these impacts as these technologies continue to advance and become more widely adopted.

  • [D] Master's in AI. Where to go?
    by /u/LastSector3612 (Machine Learning) on April 18, 2025 at 1:24 pm

    Hi everyone, I recently made an admission request for an MSc in Artificial Intelligence at the following universities: Imperial EPFL (the MSc is in CS, but most courses I'd choose would be AI-related, so it'd basically be an AI MSc) UCL University of Edinburgh University of Amsterdam My goal is to be able to work in this field in a top paying European country. I am an Italian student now finishing my bachelor's in CS in my home country in a good, although not top, university (actually there are no top CS unis here). I'm sure I will pursue a Master's and I'm considering these options only. Would you have to do a ranking of these unis, what would it be? Here are some points to take into consideration: I highly value the prestige of the university I also value the quality of teaching and networking/friendship opportunities Don't take into consideration fees and living costs for now Doing an MSc in one year instead of two seems very attractive, but I care a lot about quality and what I will learn Thanks in advance submitted by /u/LastSector3612 [link] [comments]

  • [D] A very nice blog post from Sander Dielman on VAEs and other stuff.
    by /u/Academic_Sleep1118 (Machine Learning) on April 18, 2025 at 11:57 am

    Hi guys! Andrej Karpathy recently retweeted a blog post from Sander Dielman that is mostly about VAEs and latent space modeling. Dielman really does a great job of getting the reader on an intellectual journey, while keeping the math and stuff rigorous. Best of both worlds. Here's the link: https://sander.ai/2025/04/15/latents.html I find that it really, really gets interesting from point 4 on. The passage on the KL divergence term not doing much work in terms of curating the latent space is really interesting, I didn't know about that. Also, his explanations on the difficulty of finding a nice reconstruction loss are fascinating. (Why do I sound like an LLM?). He says that the spectral decay of images doesn't align with the human experience that high frequencies are actually very important for the quality of an image. So, L2 and L1 reconstruction losses tend to overweigh low frequency terms, resulting in blurry reconstructed images. Anyway, just 2 cherry-picked examples from a great (and quite long blog post) that has much more into it. submitted by /u/Academic_Sleep1118 [link] [comments]

  • arXiv moving from Cornell servers to Google Cloud
    by /u/sh_tomer (Machine Learning) on April 18, 2025 at 11:31 am

    submitted by /u/sh_tomer [link] [comments]

  • [N] Semantic Memory Layer for LLMs – from long-form GPT interaction
    by /u/lazylazylazyl (Machine Learning) on April 18, 2025 at 9:57 am

    Hi everyone, I’ve spent the past few months interacting with GPT-4 in extended, structured, multi-layered conversations. One limitation became increasingly clear: LLMs are great at maintaining local coherence, but they don’t preserve semantic continuity - the deeper, persistent relevance of ideas across sessions. So a concept started to emerge - the Semantic Memory Layer. The core idea: LLMs could extract semantic nodes - meaning clusters from high-attention passages, weighted by recurrence, emphasis, and user intent. These would form a lightweight conceptual map over time - not a full memory log, but a layer for symbolic relevance and reentry into meaning, not just tokens. This map could live between attention output and decoding - a mechanism for continuity of meaning, rather than short-term prompt recall. This is not a formal proposal or paper — more a structured idea from someone who’s spent a lot of time inside the model’s rhythm. If this connects with ongoing research, I’d be happy to know. Thanks. submitted by /u/lazylazylazyl [link] [comments]

  • Memorization vs Reasoning [D]
    by /u/Over_Profession7864 (Machine Learning) on April 18, 2025 at 7:35 am

    Are questions like in 'what if' book, which people rarely bother to ask, way to test whether large language models truly reason, rather than simply remixing patterns and content they see from their training data? submitted by /u/Over_Profession7864 [link] [comments]

  • [P] Gym retro issues
    by /u/dbejar19 (Machine Learning) on April 18, 2025 at 7:25 am

    Hey guys, I’ve been having some issues with Gym Retro. I have installed Gym Retro in PyCharm and have successfully imported Donkey Kong Country into it. From my understanding, Donkey Kong already has a pre-configured environment for Gym Retro to start from, but I don't know how to run the program. Does anyone have a solution? submitted by /u/dbejar19 [link] [comments]

  • [D]Seeking Ideas: How to Build a Highly Accurate OCR for Short Alphanumeric Codes?
    by /u/ThickDoctor007 (Machine Learning) on April 18, 2025 at 6:54 am

    I’m working on a task that involves reading 9-character alphanumeric codes from small paper snippets — similar to voucher codes or printed serials (example images below) - there are two cases - training to detect only solid codes and both, solid and dotted. The biggest challenge is accuracy — we need near-perfect results. Models often confuse I vs 1 or O vs 0, and even a single misread character makes the entire code invalid. For instance, Amazon Textract reached 93% accuracy in our tests — decent, but still not reliable enough. What I’ve tried so far: Florence 2: Only about 65% of codes were read correctly. Frequent confusion between I/1, O/0, and other character-level mistakes. TrOCR (fine-tuned on ~300 images): Didn’t yield great results — likely due to training limitations or architectural mismatch for short strings. SmolDocling: Lightweight, but too inaccurate for this task. LLama3.2-vision: Performs okay but lacks consistency at the character level. Best results (so far): Custom-trained YOLO Approach: Train YOLO to detect each character in the code as a separate object. After detection, sort bounding boxes by x-coordinate and concatenate predictions to reconstruct the string. This setup works better than expected. It’s fast, adaptable to different fonts and distortions, and more reliable than the other models I tested. That said, edge cases remain — especially misclassifications of visually similar characters. At this stage, I’m leaning toward a more specialized solution — something between classical OCR and object detection, optimized for short structured text like codes or price tags. I'm curious: Any suggestions for OCR models specifically optimized for short alphanumeric strings? Would a hybrid architecture (e.g. YOLO + sequence model) help resolve edge cases? Are there any post-processing techniques that helped you correct ambiguous characters? Roughly how many images would be needed to train a custom model (from scratch or fine-tuned) to reach near-perfect accuracy in this kind of task Currently, I have around 300 examples — not enough, it seems. What’s a good target? Thanks in advance! Looking forward to learning from your experiences. Solid Code example Dotted Code example submitted by /u/ThickDoctor007 [link] [comments]

  • [D]Need advice regarding sentence embedding
    by /u/Imaginary_Event_850 (Machine Learning) on April 18, 2025 at 4:03 am

    Hi I am actually working on a mini project where I have extracted posts from Stack Overflow related to “nlp” tags. I am extracting 4 columns namely title, description, tags and accepted answers(if available). Now I basically want the posts to be categorised using unsupervised learning as I don’t want the posts to be categorised based on the given set of static labels. I have heard about BERT and SBERT models can do sentence embeddings but have a very little knowledge about it? Does anyone know how this task would be achieved? I have also gone through something called word embeddings where I would get posts categorised with labels like “package installation “ or “implementation issue” but can there be sentence level categorisation as well ? submitted by /u/Imaginary_Event_850 [link] [comments]

  • Time Series forecasting [P]
    by /u/zaynst (Machine Learning) on April 18, 2025 at 3:11 am

    Hey, i am working on time series forecasting for the first time . Some information about my data : 30 days data 43200 rows It has two features i.e timestamp and http_requests Time interval is 1 minute I trained LSTM model,followed all the data preprocessing process , but the results are not good and also when i used model for forecasting What would be the reason ? Also how much window size and forecasting step should i take . Any help would be appreciated Thnks submitted by /u/zaynst [link] [comments]

  • [D] A new DINO Training Framework
    by /u/Training-Week6779 (Machine Learning) on April 18, 2025 at 1:34 am

    Hello everyone, I'm a PhD student in computer science. One of my PhD projects is about DINO (Distillation with No Label) models. Considering the problems we've encountered in this field, we've developed a new framework. The framework allows you to train both DINOv1 and DINOv2 models. Additionally, trained models are fully compatible with Hugging Face. You can also distill a model from Hugging Face into a smaller model. You can perform all these training processes using either DDP or FSDP for distributed training. If you want, you can fine-tune a model trained with DINOv1 using DINOv2 training code (FSDP or DDP), or vice versa. Furthermore, you can submit all these models to Hugging Face or present a new approach using specially defined augmentation techniques for medical images. We'll also have a GUI design for those who don't fully understand AI training. We're planning to train giant models using this framework. My question is, how useful would such a framework be after graduation, or would it help me find a job? How much interest would it generate or would it provide any reputation? I can't follow the industry due to constant work, and honestly, I have no idea what's happening in the sector. Thank you. submitted by /u/Training-Week6779 [link] [comments]

  • [D] Val loss not drop, in different lr ,loss always around 0.8.
    by /u/OkLight9431 (Machine Learning) on April 18, 2025 at 1:06 am

    I'm training a model based on the original Tango codebase, which combines a VAE with a UNet diffusion model. The original model used single-channel Mel spectrograms, but my data consists of dual-channel Mel spectrograms, so I retrained the VAE. The VAE achieves a validation reconstruction loss of 0.05, which is a great result. I then used this VAE to retrain the UNet. The latent shape is [16, 256, 16]. I modified the channel configuration based on Tango's original model config and experimented with learning rates of 1e-4, 6e-5, 1e-5, 3e-5, 1e-6, and 6e-6. I'm using the AdamW optimizer with either Warmup or linear decay schedulers. However, the validation loss for the UNet stays around 0.8 and doesn't decrease. How can I address this issue, and what steps should I take to troubleshoot it? { "_class_name": "UNet2DConditionModel", "_diffusers_version": "0.10.0.dev0", "act_fn": "silu", "attention_head_dim": [ 5, 10, 20, 20 ], "block_out_channels": [ 320, 640, 1280, 1280 ], "center_input_sample": false, "cross_attention_dim": 1024, "down_block_fusion_channels": [ 320, 640, 1280, 1280 ], "down_block_types": [ "CrossAttnDownBlock2D", "CrossAttnDownBlock2D", "CrossAttnDownBlock2D", "DownBlock2D" ], "downsample_padding": 1, "dual_cross_attention": false, "flip_sin_to_cos": true, "freq_shift": 0, "in_channels": 8, "layers_per_block": 2, "mid_block_scale_factor": 1, "norm_eps": 1e-05, "norm_num_groups": 32, "num_class_embeds": null, "only_cross_attention": false, "out_channels": 8, "sample_size": [32, 2], "up_block_fusion_channels": [ ], "up_block_types": [ "UpBlock2D", "CrossAttnUpBlock2D", "CrossAttnUpBlock2D", "CrossAttnUpBlock2D" ], "use_linear_projection": true, "upcast_attention": true } Above is the Tango model config { "dropout":0.3, "_class_name": "UNet2DConditionModel", "_diffusers_version": "0.10.0.dev0", "act_fn": "silu", "attention_head_dim": [8, 16, 32, 32], "center_input_sample": false, "cross_attention_dim": 1024, "down_block_types": [ "CrossAttnDownBlock2D", "CrossAttnDownBlock2D", "CrossAttnDownBlock2D", "DownBlock2D" ], "downsample_padding": 1, "dual_cross_attention": false, "flip_sin_to_cos": true, "freq_shift": 0, "in_channels": 16, "layers_per_block": 3, "mid_block_scale_factor": 1, "norm_eps": 1e-05, "norm_num_groups": 16, "num_class_embeds": null, "only_cross_attention": false, "out_channels": 16, "sample_size": [256, 16], "up_block_types": [ "UpBlock2D", "CrossAttnUpBlock2D", "CrossAttnUpBlock2D", "CrossAttnUpBlock2D" ], "use_linear_projection": false, "upcast_attention": true } This is my model config: submitted by /u/OkLight9431 [link] [comments]

  • [D] Sharing dataset splits: What are the standard practices (if any)?
    by /u/LetsTacoooo (Machine Learning) on April 17, 2025 at 11:26 pm

    Wanted to get other people's takes. A common observation: papers often generate their own train/val/test splits, usually random. But the exact split isn't always shared. For smaller datasets, this matters. Different splits can lead to different performance numbers, making it hard to truly compare models or verify SOTA claims across papers – you might be evaluating on a different test set. We have standard splits for big benchmarks (MNIST, CIFAR, ImageNet, any LLM evals), but for many other datasets, it's less defined. I guess my questions are: When a dataset lacks a standard split, what's your default approach? (e.g., generate new random, save & share exact indices/files, use k-fold?) Have you seen or used any good examples of people successfully sharing their specific dataset splits (maybe linked in code repos, data platforms, etc.)? Are there specific domain-specific norms or more standardized ways of handling splits that are common practice in certain fields? Given the impact splits can have, particularly on smaller data, how critical do you feel it is to standardize or at least share them for reproducibility and SOTA claims? (Sometimes I feel like I'm overthinking how uncommon this seems for many datasets!) What are the main practical challenges in making shared/standardized splits more widespread? TLDR: Splits are super important for measuring performance (and progress), what are some standard practices? submitted by /u/LetsTacoooo [link] [comments]

  • [N] We just made scikit-learn, UMAP, and HDBSCAN run on GPUs with zero code changes! 🚀
    by /u/celerimo (Machine Learning) on April 17, 2025 at 9:02 pm

    Hi! I'm a lead software engineer on the cuML team at NVIDIA (csadorf on github). After months of hard work, we're excited to share our new accelerator mode that was recently announced at GTC. This mode allows you to run native scikit-learn code (or umap-learn or hdbscan) directly with zero code changes. We call it cuML zero code change, and it works with both Python scripts and Jupyter notebooks (you can try it directly on Colab). This follows the same zero-code-change approach we've been using with cudf.pandas to accelerate pandas operations. Just like with pandas, you can keep using your familiar APIs while getting GPU acceleration behind the scenes. This is a beta release, so there are still some rough edges to smooth out, but we expect most common use cases to work and show significant acceleration compared to running on CPU. We'll roll out further improvements with each release in the coming months. The accelerator mode automatically attempts to replace compatible estimators with their GPU equivalents. If something isn't supported yet, it gracefully falls back to the CPU variant - no harm done! 🙂 We've enabled CUDA Unified Memory (UVM) by default. This means you generally don't need to worry about whether your dataset fits entirely in GPU memory. However, working with datasets that significantly exceed available memory will slow down performance due to excessive paging. Here's a quick example of how it works. Let’s assume we have a simple training workflow like this: # train_rfc.py #%load_ext cuml.accel # Uncomment this if you're running in a Jupyter notebook from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier # Generate a large dataset X, y = make_classification(n_samples=500000, n_features=100, random_state=0) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0) # Set n_jobs=-1 to take full advantage of CPU parallelism in native scikit-learn. # This parameter is ignored when running with cuml.accel since the code already # runs in parallel on the GPU! rf = RandomForestClassifier(n_estimators=100, random_state=0, n_jobs=-1) rf.fit(X_train, y_train) You can run this code in three ways: On CPU directly: python train_rfc.py With GPU acceleration: python -m cuml.accel train_rfc.py In Jupyter notebooks: Add %load_ext cuml.accel at the top Here are some results from our benchmarking: Random Forest: ~25x faster Linear Regression: ~52x faster t-SNE: ~50x faster UMAP: ~60x faster HDBSCAN: ~175x faster Performance will depend on dataset size and characteristics, so your mileage may vary. As a rule of thumb: the larger the dataset, the more speedup you can expect, since moving data to and from the GPU also takes some time. We're actively working on improvements and adding more algorithms. Our top priority is ensuring code always falls back gracefully (there are still some cases where this isn't perfect). Check out the docs or our blog post to learn more. I'm also happy to answer any questions here. I'd love to hear about your experiences! Feel free to share if you've observed speedups in your projects, but I'm also interested in hearing about what didn't work well. Your feedback will help us immensely in prioritizing future work. submitted by /u/celerimo [link] [comments]

  • [R] Algorithm for rotation images in 3D
    by /u/Gauwal (Machine Learning) on April 17, 2025 at 8:49 pm

    Note: It's only tangentially related, but I feel like this community might still be of help Hi ! I'm looking for a specific algorithm (or at the very list something similar to what has been used) in the game "smack studio". It's a an algo used to rotate a bunch of 2D images in 3D space (so it looks like 3D in the end) . I think adobe uses something similar to rotate vector images, but this one seems AI driven and I'm interested in something that I can learn from. I'm a computer science master student and I want to learn more about it and hopefully make it better (it's tangentially linked to my master thesis, so I hope to improve it along the way) But it's mostly just that It looks cool too me I'd be glad if any of you has any kind of idea to point me in a better research direction than aiming in the dark Thanks for your help ! PS: Even straight black box AI can be useful if you have anything please share !!! submitted by /u/Gauwal [link] [comments]

  • [P] I made 'Talk‑to‑Your‑Slides'.
    by /u/Big_Occasion_182 (Machine Learning) on April 17, 2025 at 5:43 pm

    Just finished working on an exciting new tool that lets you edit PowerPoint presentations using simple instructions! Talk-to-Your-Slides transforms how you work with presentations. Just type commands like "Find and fix all typos" or "Make the title fonts consistent across slides" and watch as your slides get updated automatically. Key Features: Natural language editing commands Instant slide updates Works with existing PowerPoint files Powered by an LLM agent Demo Available Now! Check out our working demo at: https://github.com/KyuDan1/Talk-to-Your-Slides We built this using Gradio for the interface. Our team will be releasing the research paper, evaluation dataset, and full source code in the coming weeks. If you find this useful, please like and share the post to help spread the word! Your support means a lot to our team. https://www.linkedin.com/posts/kyudanjung_powerpoint-llm-agent-activity-7318688635321491456-E42j?utm_source=share&utm_medium=member_desktop&rcm=ACoAAEb15SsBoLMoaQreihIlDmJGlX6urPN1ZBQ submitted by /u/Big_Occasion_182 [link] [comments]

  • [D] Question and distractor generation using T5 Evaluation
    by /u/Sweaty_Importance_83 (Machine Learning) on April 17, 2025 at 5:05 pm

    Hello everyone! I'm currently finetuning araT5 model (finetuned version of T5 model on Arabic language) and I'm using it for question and distractor generation (each finetuned on their own) and I'm currently struggling with how I should assess model performance and how to use evaluation techniques, since the generated questions and distractors are totally random and are not necessarily similar to reference questions/distractors in the original dataset submitted by /u/Sweaty_Importance_83 [link] [comments]

  • [D] Val loss not drop, in different lr ,loss always around 0.8.
    by /u/OkLight9431 (Machine Learning) on April 17, 2025 at 5:01 pm

    I'm training a model based on the original Tango codebase, which combines a VAE with a UNet diffusion model. The original model used single-channel Mel spectrograms, but my data consists of dual-channel Mel spectrograms, so I retrained the VAE. The VAE achieves a validation reconstruction loss of 0.05, which is a great result. I then used this VAE to retrain the UNet. The latent shape is [16, 256, 16]. I modified the channel configuration based on Tango's original model config and experimented with learning rates of 1e-4, 6e-5, 1e-5, 3e-5, 1e-6, and 6e-6. I'm using the AdamW optimizer with either Warmup or linear decay schedulers. However, the validation loss for the UNet stays around 0.8 and doesn't decrease. How can I address this issue, and what steps should I take to troubleshoot it? { "_class_name": "UNet2DConditionModel", "_diffusers_version": "0.10.0.dev0", "act_fn": "silu", "attention_head_dim": [ 5, 10, 20, 20 ], "block_out_channels": [ 320, 640, 1280, 1280 ], "center_input_sample": false, "cross_attention_dim": 1024, "down_block_fusion_channels": [ 320, 640, 1280, 1280 ], "down_block_types": [ "CrossAttnDownBlock2D", "CrossAttnDownBlock2D", "CrossAttnDownBlock2D", "DownBlock2D" ], "downsample_padding": 1, "dual_cross_attention": false, "flip_sin_to_cos": true, "freq_shift": 0, "in_channels": 8, "layers_per_block": 2, "mid_block_scale_factor": 1, "norm_eps": 1e-05, "norm_num_groups": 32, "num_class_embeds": null, "only_cross_attention": false, "out_channels": 8, "sample_size": [32, 2], "up_block_fusion_channels": [ ], "up_block_types": [ "UpBlock2D", "CrossAttnUpBlock2D", "CrossAttnUpBlock2D", "CrossAttnUpBlock2D" ], "use_linear_projection": true, "upcast_attention": true } Above is the Tango model config { "dropout":0.3, "_class_name": "UNet2DConditionModel", "_diffusers_version": "0.10.0.dev0", "act_fn": "silu", "attention_head_dim": [8, 16, 32, 32], "center_input_sample": false, "cross_attention_dim": 1024, "down_block_types": [ "CrossAttnDownBlock2D", "CrossAttnDownBlock2D", "CrossAttnDownBlock2D", "DownBlock2D" ], "downsample_padding": 1, "dual_cross_attention": false, "flip_sin_to_cos": true, "freq_shift": 0, "in_channels": 16, "layers_per_block": 3, "mid_block_scale_factor": 1, "norm_eps": 1e-05, "norm_num_groups": 16, "num_class_embeds": null, "only_cross_attention": false, "out_channels": 16, "sample_size": [256, 16], "up_block_types": [ "UpBlock2D", "CrossAttnUpBlock2D", "CrossAttnUpBlock2D", "CrossAttnUpBlock2D" ], "use_linear_projection": false, "upcast_attention": true } This is my model config: submitted by /u/OkLight9431 [link] [comments]

  • [R] Experiment Report: OpenAI GPT 4.1-mini is a really cost-effective model
    by /u/DueKitchen3102 (Machine Learning) on April 17, 2025 at 4:56 pm

    OpenAI new models: how do GPT 4.1 models compare to 4o models? GPT4.1-mini appears to be the best cost-effective model! To ease our curiosity, we conduct a set of RAG experiments. The public dataset is a collection of messages (hence it might be particularly interesting to cell phone and/or PC manufacturers) . Supposedly, it should also be a good dataset for testing knowledge graph (KG) RAG (or Graph RAG) algorithms. As shown in the Table, the RAG results on this dataset appears to support the claim that GPT4.1-mini is the best cost-effective model overall. The RAG platform hosted by VecML allows users to choose the number of tokens retrieved by RAG. Because OpenAI charges users by the number of tokens, it is always good to use fewer tokens if the accuracy is not affected. For example, using 500 tokens reduces the cost to merely 1/10 of the cost w/ using 5000 tokens. This dataset is really challenging for RAG and using more tokens help improve the accuracy. On other datasets we have experimented with, often RAG w/ 1600 tokens performs as well as RAG w/ 10000 tokens. In our experience, using 1,600 tokens might be suitable for flagship android phones (8gen4) . Using 500 tokens might be still suitable for older phones and often still achieves reasonable accuracy. We would like to test on more RAG datasets, with a clear document collection, query set, and golden (or reference) answers. Please send us the information if you happen to know some relevant datasets. Thank you very much. https://preview.redd.it/r11noh65efve1.png?width=1585&format=png&auto=webp&s=f66d776489b6d8b1d0f5a6c1b524a0d4dc2ad18a submitted by /u/DueKitchen3102 [link] [comments]

  • [Discussion] Evaluating multiple feature sets/models—am I leaking by selecting the best of top 5 on the test set?
    by /u/TooMuchForMyself (Machine Learning) on April 17, 2025 at 1:45 pm

    Hi all, I’m working on a machine learning project where I’m evaluating two different outcomes (binary classification tasks). The setup is as follows: • 12 different feature sets • Each feature set has 6 time window variations • 6 different models • 10-fold CV is used to select models based on the highest F0.5 score So for one outcome, that’s: 12 feature sets × 6 time windows × 6 models = 432 configurations Each of these is run with 10-fold cross-validation on the training set for tuning. My process so far: 1. For each outcome, I select the top 5 configurations (based on mean F0.5 in CV). 2. Then I train those 5 models on the entire training set, and evaluate them on the held-out test set. 3. The idea is to eventually use the best performing configuration in real-world deployment. My question: If I evaluate the top 5 on the test set and then choose the best of those 5 to deploy, am I effectively leaking information or overfitting to the test set? Should I instead: • Only evaluate the best 1 (from CV) on the test set to avoid cherry-picking? • Or is it acceptable to test multiple pre-selected models and choose the best among them, as long as I don’t further tweak them afterward? Some context: In previous experiments, the best CV model didn’t always perform best on the test set—but I had to fix some issues in the code, so the new results may differ. My original plan was to carry the top 5 forward from each outcome, but now I’m wondering if that opens the door to test set bias. submitted by /u/TooMuchForMyself [link] [comments]

  • [D] Difference between ACL main, ACL Findings, and NeurIPS?
    by /u/007noob0071 (Machine Learning) on April 17, 2025 at 11:47 am

    Hey everyone, I'm new to the NLP community and noticed that papers not accepted into the main ACL conference can sometimes be published in "ACL Findings." Could someone clarify: How does ACL Findings compare to ACL main conference papers? How does publishing in ACL/ACL Findings compare to NeurIPS (main conference or workshops) in terms of prestige, visibility, or career impact? Thanks! submitted by /u/007noob0071 [link] [comments]

What is Google Workspace?
Google Workspace is a cloud-based productivity suite that helps teams communicate, collaborate and get things done from anywhere and on any device. It's simple to set up, use and manage, so your business can focus on what really matters.

Watch a video or find out more here.

Here are some highlights:
Business email for your domain
Look professional and communicate as you@yourcompany.com. Gmail's simple features help you build your brand while getting more done.

Access from any location or device
Check emails, share files, edit documents, hold video meetings and more, whether you're at work, at home or on the move. You can pick up where you left off from a computer, tablet or phone.

Enterprise-level management tools
Robust admin settings give you total command over users, devices, security and more.

Sign up using my link https://referworkspace.app.goo.gl/Q371 and get a 14-day trial, and message me to get an exclusive discount when you try Google Workspace for your business.

Google Workspace Business Standard Promotion code for the Americas 63F733CLLY7R7MM 63F7D7CPD9XXUVT 63FLKQHWV3AEEE6 63JGLWWK36CP7WM
Email me for more promo codes

Active Hydrating Toner, Anti-Aging Replenishing Advanced Face Moisturizer, with Vitamins A, C, E & Natural Botanicals to Promote Skin Balance & Collagen Production, 6.7 Fl Oz

Age Defying 0.3% Retinol Serum, Anti-Aging Dark Spot Remover for Face, Fine Lines & Wrinkle Pore Minimizer, with Vitamin E & Natural Botanicals

Firming Moisturizer, Advanced Hydrating Facial Replenishing Cream, with Hyaluronic Acid, Resveratrol & Natural Botanicals to Restore Skin's Strength, Radiance, and Resilience, 1.75 Oz

Skin Stem Cell Serum

Smartphone 101 - Pick a smartphone for me - android or iOS - Apple iPhone or Samsung Galaxy or Huawei or Xaomi or Google Pixel

Can AI Really Predict Lottery Results? We Asked an Expert.

Ace the 2025 AWS Solutions Architect Associate SAA-C03 Exam with Confidence Pass the 2025 AWS Certified Machine Learning Specialty MLS-C01 Exam with Flying Colors

List of Freely available programming books - What is the single most influential book every Programmers should read



#BlackOwned #BlackEntrepreneurs #BlackBuniness #AWSCertified #AWSCloudPractitioner #AWSCertification #AWSCLFC02 #CloudComputing #AWSStudyGuide #AWSTraining #AWSCareer #AWSExamPrep #AWSCommunity #AWSEducation #AWSBasics #AWSCertified #AWSMachineLearning #AWSCertification #AWSSpecialty #MachineLearning #AWSStudyGuide #CloudComputing #DataScience #AWSCertified #AWSSolutionsArchitect #AWSArchitectAssociate #AWSCertification #AWSStudyGuide #CloudComputing #AWSArchitecture #AWSTraining #AWSCareer #AWSExamPrep #AWSCommunity #AWSEducation #AzureFundamentals #AZ900 #MicrosoftAzure #ITCertification #CertificationPrep #StudyMaterials #TechLearning #MicrosoftCertified #AzureCertification #TechBooks

Top 1000 Canada Quiz and trivia: CANADA CITIZENSHIP TEST- HISTORY - GEOGRAPHY - GOVERNMENT- CULTURE - PEOPLE - LANGUAGES - TRAVEL - WILDLIFE - HOCKEY - TOURISM - SCENERIES - ARTS - DATA VISUALIZATION
zCanadian Quiz and Trivia, Canadian History, Citizenship Test, Geography, Wildlife, Secenries, Banff, Tourism

Top 1000 Africa Quiz and trivia: HISTORY - GEOGRAPHY - WILDLIFE - CULTURE - PEOPLE - LANGUAGES - TRAVEL - TOURISM - SCENERIES - ARTS - DATA VISUALIZATION
Africa Quiz, Africa Trivia, Quiz, African History, Geography, Wildlife, Culture

Exploring the Pros and Cons of Visiting All Provinces and Territories in Canada.
Exploring the Pros and Cons of Visiting All Provinces and Territories in Canada

Exploring the Advantages and Disadvantages of Visiting All 50 States in the USA
Exploring the Advantages and Disadvantages of Visiting All 50 States in the USA


Health Health, a science-based community to discuss human health

Reddit Science This community is a place to share and discuss new scientific research. Read about the latest advances in astronomy, biology, medicine, physics, social science, and more. Find and submit new publications and popular science coverage of current research.

Reddit Sports Sports News and Highlights from the NFL, NBA, NHL, MLB, MLS, and leagues around the world.

Turn your dream into reality with Google Workspace: It’s free for the first 14 days.
Get 20% off Google Google Workspace (Google Meet) Standard Plan with  the following codes:
Get 20% off Google Google Workspace (Google Meet) Standard Plan with  the following codes: 96DRHDRA9J7GTN6 96DRHDRA9J7GTN6
63F733CLLY7R7MM
63F7D7CPD9XXUVT
63FLKQHWV3AEEE6
63JGLWWK36CP7WM
63KKR9EULQRR7VE
63KNY4N7VHCUA9R
63LDXXFYU6VXDG9
63MGNRCKXURAYWC
63NGNDVVXJP4N99
63P4G3ELRPADKQU
With Google Workspace, Get custom email @yourcompany, Work from anywhere; Easily scale up or down
Google gives you the tools you need to run your business like a pro. Set up custom email, share files securely online, video chat from any device, and more.
Google Workspace provides a platform, a common ground, for all our internal teams and operations to collaboratively support our primary business goal, which is to deliver quality information to our readers quickly.
Get 20% off Google Workspace (Google Meet) Business Plan (AMERICAS): M9HNXHX3WC9H7YE
C37HCAQRVR7JTFK
C3AE76E7WATCTL9
C3C3RGUF9VW6LXE
C3D9LD4L736CALC
C3EQXV674DQ6PXP
C3G9M3JEHXM3XC7
C3GGR3H4TRHUD7L
C3LVUVC3LHKUEQK
C3PVGM4CHHPMWLE
C3QHQ763LWGTW4C
Even if you’re small, you want people to see you as a professional business. If you’re still growing, you need the building blocks to get you where you want to be. I’ve learned so much about business through Google Workspace—I can’t imagine working without it.
(Email us for more codes)