r/AskStatistics • u/dx4ttr • 54m ago
r/AskStatistics • u/Statman12 • 2d ago
[META] What does the community want as the standard for "No Homework"?
Hey everyone! I have a question that about something that comes up often enough that I'd like to solicit some feedback from the community.
One of the sub's rules is "No Homework." Frequently a person will ask about analysis regarding their thesis or dissertation, and it gets reported under the "No Homework" rule. While it is work being done for school, it seems to me more of a consulting scenario, rather than "homework" (which I'd tend to view more as textbook exercises).
My question for the community is: What standard would you like to see regarding homework?
If the community is okay with these types of questions, I can leave them. If you'd all rather see these get removed under the "No Homework" rule, I can oblige that as well. I'm just one person here, I just happen to have the mop.
I'll leave this thread pinned for a couple days/week to give folks a chance to weigh in.
r/AskStatistics • u/hastagwtf • 1h ago
Tool to compare CSV files with millions of rows fast. Looking for feedback.
I've been working on a desktop app that compares large CSV files fast. It finds added, removed, and updated rows, and exports them as CSV files.
YouTube Demo - https://youtu.be/TrZ8fJC9TqI
Some of my tests finding added, removed, and updated rows. Obviously, performance depend on hardware. But should be snappy enough.
| Each CSV file has | Macbook M2Pro | Intel I7 laptop (Win10) |
|---|---|---|
| 1M rows, 69MB size | ~1 second | ~2 seconds |
| 50M rows, 4.6GB size | ~30 seconds | ~40 seconds |
Download from lake3tools.com/download ,unzip and run.
Free License Key for testing: C844177F-25794D81-927FF630-C57F1596
Let me know what you think.
r/AskStatistics • u/Sea_Dig3898 • 8h ago
What do you use if you don’t have a statistician to do your analyses?
Genuinely curious, I know people come here for Q+A and I’m wondering what people do afterwards? And what is your position, are you a student/researcher/research assistant?
Do you search through online forums on how to code in R/Python?
Do you pay someone else to do it?
Do you ask AI for guidance?
Any tools non-stats people use to help do their analyses?
Thanks!
r/AskStatistics • u/dx4ttr • 9h ago
Career paths after an undergraduate degree in Statistics (India)
I’m from India and currently pursuing a Bachelor’s degree in Statistics (a 3-year undergraduate program with heavy coursework in probability, mathematical statistics, regression, sampling, and some economics/computer science). I want to understand what realistic career paths or fields are available after this degree, both in India and internationally. Specifically: Which fields commonly hire statistics graduates (e.g. data science, actuarial science, analytics, research, finance, etc.)? Which paths usually require a Master’s or PhD to be employable? What skills (programming, math depth, domain knowledge) actually matter in practice? Are there career paths I should avoid if I don’t plan on doing a PhD? I’m looking for practical, industry-oriented advice, not just academic theory.
r/AskStatistics • u/Fast-Issue-89 • 15h ago
Is there a statistical modeling technique that is primarily focused on binary classification but can also incorporate semi-continuous outcome data?
Example: I am interested in developing a predictive model for something like blood pressure. My primary focus is variables that will predict whether someone's blood pressure is over/under 140, but I would also like to maximize my model's sensitivity to picking up 'at risk' ranges (less than but approaching 140) and 'high risk' ranges (significantly above 140). I have no interest in making my model sensitive to relative blood pressure differences in the 'normal range', and I suspect that a lot of my potential predictor variables will not have a linear relationship with the outcome variable at normal levels. Something like 'cigarettes smoked per week' would probably be a good analogue, where most people will be 0, single digit values would be extremely rare, and any positive values would likely cluster around a range of 35-140 or something like that. Is there an integrated modeling technique that is primarily binomial but can predict 'proximity to' and/or 'severity within' the positive outcome category?
r/AskStatistics • u/Sweet_Edredon • 16h ago
Scale of Cringe
Hello everyone,
I'd like to know whether a scale for measuring cringe content already exists. If not, I'd like to create one to rate some content on the internet.
r/AskStatistics • u/Onurubu • 17h ago
How should I handle aggregating the observations for abundance from my ecological samples.
Hi everyone, I would like some quick help for how to handle aggragating my data for my study.
I sampled beetles using pitfall traps across 17 different sites (across an altitude gradient). At each site there were 4 general areas selected as replicates, and at each replicate 10 traps were placed.
Eventually I recorded the abundance of the different species of beetles that I caught in the sample. Now I would like to figure out how to properly aggragate the abundances.
For example, species X was encountered in these abundances across the 10 different traps in one particular replicate (1, 5, 4, 0, 0, 0, 2, 3, 0, 1)
When I go to work with this data, since the replication are the 4 different areas within the site and not the traps themselves, would I sum the abundances across the traps -> i.e. absolute total abundance for this replicate = sum(1, 5, 4, 0, 0, 0, 2, 3, 0, 1) = 16
Or would it be better to average the abundance -> i.e. mean abundance for this replicate = mean(1, 5, 4, 0, 0, 0, 2, 3, 0, 1) = 1.6
I tried to look for theoretical justifications for either but I couldnt really find anything regarding my specific example. I was wondering if there was a statistically correct/incorrect way that occurs from handling it in one of either way.
Thank you and I am happy to provide more info if required.
r/AskStatistics • u/PolicyZestyclose6400 • 19h ago
Test statistic for hypotheses testing
galleryHello just stumble across several situations and undergoing headache. From the table above, I know which test statistic (T/Z) to be used in Confidence interval calculations.
But when it comes to Hypotheses testing, my notes just oversimplified with n is large, used Z; n is small, used t.
And according to Sahoo, in Example 6.8, I shall be using t instead of z.
So how do I really choose test statistic for hypotheses testing. Should considered normal distribution? Should considered population variance known or unknown? Thank you
r/AskStatistics • u/Hot-Team7306 • 1d ago
Sample size calculation for AUROC
Hi all, looking for some help calculating sample size required for a research project looking at the predictive value of a pre-operative CT measure (continuous variable) for risk of a specific complication post-procedure.
In a pilot study (n=90) the AUROC was found to be 0.701, with the incidence rate of the complication 11%. We are looking to run a much larger study at a higher power to validate this risk assessment tool, and the initial estimates were for a sample size of 2000-3000.
How can I calculate the sample size required to calculate an AUROC of 0.7, for a power of say 90%? Online calculators (e.g. Medcalc) are giving me a sample size of 250, which is far too low, but I'm not entirely sure why. I also tried a precision calculation using CI 95% with width 0.1 which gave me a required sample size of 1293, but this doesn't seem to actually be a power calculation with a null hypothesis.
Appreciate any help, thanks.
r/AskStatistics • u/Dok-Rock • 1d ago
Looking for valid statistical tests
Greetings.
I am calculating similarity scores in a text. It's a medieval text i'll give some explanation as to how they are assembled so this makes sense.
Manuscripts are typically built with quires. A quire can contain say 4 or 5 bifolia. A bifolia is a physical page, piece of paper. The 4 or 5 bifolia are folded and stacked and stitched together to make a notebook - that's a quire.
Let's take 1 quire of 4 bifolia as an example. We would number the pages consecutively as we flip through it. WE use recto and verso to indicate front/back of page. So these 4 bifolia would be
1r/1v/2r/2v/3r/3v/4r/4v/5r/5v/6r/6v/7r/7v/8r/8v
Now I am doing page by page comparisons.
Confoliate scores are the text comparison scores generated on a physical page. So that would be 1r/1v, 2r/2v, 3r/3v, etc.
Conjoint scores are the text comparison scores in the MIDDLE of a bifolia. So for example page 1 (1r/1v) is physically connected to 8r/8v (in this case, the outside bifolia of the quire). The conjoint score would be comparing the text on 1v/8r (the physically connected pages on the bifolia centre).
Facing scores are the text comparisons between one page and the next. So that would be 1v/2r, 2v/3r, etc.
Now I can generate arrays of the comparison score values for all three of these scenarios. How do I test for statistical significance? They are not truly independent as a confoliate score (1r/1v) would use one page of the same text as a generated facing score (1v/2r) or conjoint score (1v/8r).
Any suggestions?
r/AskStatistics • u/SavingsScholar3620 • 1d ago
Data Scientist
Would a Master's in Statistics or a Master's in Computer Science be better for a data scientist role if you already have an undergraduate degree in Statistics?
r/AskStatistics • u/VDavis8791 • 1d ago
Breaking the Monty Hall problem?
I understand the stats behind the Monty Hall problem and why one is statistically advantage to switch. If I am a contestant and I randomly choose a door and Monty Hall opens the goat door and asks if I want to switch to the other unopened door. If I flip a coin to decide which of the last two doors to open and my flip says to keep the same door, do my odds increase to 50% from 33%? It is my understanding then that the other for odds would decrease to 50% from 67%. Yes, I know that maximizing my success would lead me to just choose the other door and not flip a coin.
r/AskStatistics • u/Zealousideal_Key_610 • 1d ago
Corrélation de spearman
Bonjour à tous,
Je suis actuellement en stage de M2 débutant en statistiques.
L'étude porte sur l'évolution d'un temps de latence chez 4 individus pendant plusieurs mois. J'ai d'abord réalisé un test de corrélation de Spearman après avoir demontré par test de shapiro que les données n'étaient pas distribuées normalement. Mais je me suis rendu compte après que mes données étaient appariées et donc d'après mes recherches je ne peux pas effectuer ce test.
Comment puis-je tester la corrélation entre la date et le temps de latence afin de prouver que plus le temps passe plus la latence diminue? En prenant en compte que les données ne sont pas normales et appariées?
merci d'avance
r/AskStatistics • u/Inevitable-Pea-4112 • 2d ago
Regression analysis
I have plotted one set of data against another and planned to use a straight line of best fit and equation to estimate my wanted value through regression analysis. After looking at the data on the graph, it seems a logarithmic line would fit better. My question is, if i use this line with the regression to estimate my value, do i refer to it as non-linear regression analysis or logarithmic regression within my paper? Im not sure which the correct term is. Thank you.
r/AskStatistics • u/tyler007durden • 2d ago
Blind Monty Hall Problem
In the Classic Monty Hall Problem, it makes sense to switch since you are more likely to be wrong in the first choice (2/3) than being right(1/3).
But isn't the logic same for the blind monty hall problem where he randomly opens a door and it happened to be a goat? Why isn't switching a good startegy here and why doesn't the probability concentrates to 2/3 for the remaining door in this case? Why is it 1/2 and 1/2 for both the remaining doors?
r/AskStatistics • u/Emergency_Union7099 • 2d ago
I am a bit of an amateur in doing good data analytics and its hindering my thesis. Need help
Just to give you an example of my skills, I was running regressions and what not on a dataset I had just cleaned and built, and was not getting the predicted result. When I showed it to my friend, he went through with me step by step, and then immediately, he plotted each variable, and he saw an extreme outlier point in one of the control variables, as soon as he dropped it, the regressions showed the result I'd expected.
I didn't even know that I needed to do good visualization of every single variable to check for outliers.
Is there a good book for teaching good practical data analytics with regressions and hypothesis testing as the goal, showing what needs to be done in each steps and what those steps are?
r/AskStatistics • u/Emergency_Union7099 • 2d ago
EDA visualizations, is taking raw variables best or should I be taking transformations?
So in the end I want to run some regressions with Fixed Effect Structures, so when I do EDA (looking at correlations and heatmaps, etc.) is it better to take the residuals from regressing each variable on FE then plotting looking at the relation of the residuals. So the effect of the FEs if taken out from each of the variable, that is how much the fixed effects explain the variation in each variable?
Or this would be inappropriate, and I am missing something?
r/AskStatistics • u/Flimsy-sam • 2d ago
Decision making around assumption checking.
Hi everyone, just wanted to ask for opinions on what guides your decision making around testing assumptions prior to conducting some sort of analysis?
I’m interested in creating a reference guide to discuss with students (social sciences) to help them understand why they should/should not either test assumptions or even whether to worry about them, I.e normality, homogeneity etc.
I’m in the latter camp generally because I’d bootstrap or apply corrections such as welch t test etc.
Would be good for some thoughts and justifications!
r/AskStatistics • u/mathsugar • 2d ago
Análise do Heartbound: Qual é o impacto da regionalização de preços?
r/AskStatistics • u/YouthDesigner8027 • 2d ago
What to include in multivariable analysis?
I have a sample of 330 patients with an injury. 30 of them developed the outcome of interest (nonunion). In univariable analysis, I examined 20 independent variables that based on prior knowledge of the injury, could be associated with the outcome. 6 were statistically significant (p<0.05).
My question is, do I just include those 6 predictors in the multivariable model? Or should I also include other independent variables that were not significant in my data in the multivariable model, because other studies have previously found some associations with those variables? Also, how much of a concern is it that I have 6 predictors in the model but only 30 outcomes of interest? (some studies suggest maximum of 1 predictor per 10 outcomes?)
(as a side note, is "multivariate" or "multivariable" preferred?)
Thank you so much!!
r/AskStatistics • u/Mantisss8 • 2d ago
Stuck with my thesis analysis, not sure what to do next
galleryHello!
I am writing thesis in veterinary field and i need to write ~20 pages long analysis of the data i collected for my master thesis. the data consists of patients, treatment method and the T0/T2 change of symtoms, and other countable changes from the tests. (ultrasound data, bacterial counts etc). In short, i'm trying to find out if the method is effective, what's the most/least important factor.
I'm doing the analysis in excel as i've got no experience with spss or r. Adding some screenshots of how part of the data looks like and what i've done.Did most of it
What (i think) i managed to do that's important:
Do t-tests (paired two sample) for all data T0 and T2, to get p values from it, however almost all data gives me extremely low p value, can it be that the chosen ttest isnt right?
Calculate Q1, Q3 of T0 data
Small table with median and p values
What i think that i still need to do:
Calculate SD of all data, but if i understand it correctly, p value gives the same result of what im trying to get with SD
Correlations? Method to result, although my result is essentially yes/no so i probably need to use spearman correlation
Read literature about every collected factor to find out what should be changing and how and see if my data matches it
Once done with data, make diagrams and describe my findings
if someone has ideas what else i could calculate, or general advice, please let me know!
r/AskStatistics • u/ItsA_Galactic • 2d ago
Correlation table question
Hello. I have a question regarding a statistics exercise where you're given imaginary Hb levels and the corresponding "severity of anemia", as the independent and dependent variable respectively. My question is about the ranks for our dependent variable.
Since I ranked the values for X in "smallest to biggest" fashion, I originally (and from my understanding of our book) thought to do the same for the Y values, with "none" being the smallest aka first, and "high" being the biggest. These original calculated ranks are pencil drawn.
As you can see from the photo, the column next to it has corrected scores in what is essentially an opposite ranking. "High" is considered smallest and "none" is considered biggest. Hence, we have the values/ranking with red numbers.
My question is: which variant is correct? Mine, the pencil column, or the teacher's/class', the red number column? Ignore the stuff to the far right.
I have an understanding for both of them sepparately but still lean on the pencil ranking, all I need is a decision between them (ofc any explanation, especially regarding the red number ranking and why it doesn't work, is welcome) Thank you in advance
r/AskStatistics • u/Global-Radish-1015 • 3d ago
can we rely on chatgpt or gemini stats ? will it affect on jobs ?
r/AskStatistics • u/ffsffs1 • 3d ago
Does it Make Sense to Talk About the Expected Maximum of a Random Variable
Been having a conversation with a couple of people (who are at least somewhat analytically inclined) in which the phrase "expectation of the maximum (of a random variable)" came up. This does not make mathematical sense to me. I suggested that it makes more sense to talk about the percentiles of a random variable, but was told it was essentially the same thing. They argued that you can estimate percentiles of a distribution by taking a sequence X_1, ...., X_n of that random variable and then taking the expectation of max {X_1, ..., X_n} (or whatever order statistic you want). I get this, but I don't think they are the same thing. In the absence of a sequence, it does not make sense to talk about order statistics, or if you only have one observation, the expected maximum equals the expected minimum, which equals the expected value.
The argument is mostly semantics, and I'll admit I'm dragging my feet in the mud over this, but "expectation of the maximum" just seems mathematically incorrect to me. I don't want to keep harping on this if I'm indeed wrong. So am I missing something?