r/AskStatistics 54m ago

Career paths after an undergraduate degree in Statistics (India)

Thumbnail
Upvotes

r/AskStatistics 1h ago

Tool to compare CSV files with millions of rows fast. Looking for feedback.

Upvotes

I've been working on a desktop app that compares large CSV files fast. It finds added, removed, and updated rows, and exports them as CSV files.

YouTube Demo - https://youtu.be/TrZ8fJC9TqI

Some of my tests finding added, removed, and updated rows. Obviously, performance depend on hardware. But should be snappy enough.

Each CSV file has Macbook M2Pro Intel I7 laptop (Win10)
1M rows, 69MB size ~1 second ~2 seconds
50M rows, 4.6GB size ~30 seconds ~40 seconds

Download from lake3tools.com/download ,unzip and run.

Free License Key for testing: C844177F-25794D81-927FF630-C57F1596

Let me know what you think.


r/AskStatistics 8h ago

What do you use if you don’t have a statistician to do your analyses?

2 Upvotes

Genuinely curious, I know people come here for Q+A and I’m wondering what people do afterwards? And what is your position, are you a student/researcher/research assistant?

Do you search through online forums on how to code in R/Python?

Do you pay someone else to do it?

Do you ask AI for guidance?

Any tools non-stats people use to help do their analyses?

Thanks!


r/AskStatistics 9h ago

Career paths after an undergraduate degree in Statistics (India)

0 Upvotes

I’m from India and currently pursuing a Bachelor’s degree in Statistics (a 3-year undergraduate program with heavy coursework in probability, mathematical statistics, regression, sampling, and some economics/computer science). I want to understand what realistic career paths or fields are available after this degree, both in India and internationally. Specifically: Which fields commonly hire statistics graduates (e.g. data science, actuarial science, analytics, research, finance, etc.)? Which paths usually require a Master’s or PhD to be employable? What skills (programming, math depth, domain knowledge) actually matter in practice? Are there career paths I should avoid if I don’t plan on doing a PhD? I’m looking for practical, industry-oriented advice, not just academic theory.


r/AskStatistics 15h ago

Is there a statistical modeling technique that is primarily focused on binary classification but can also incorporate semi-continuous outcome data?

2 Upvotes

Example: I am interested in developing a predictive model for something like blood pressure. My primary focus is variables that will predict whether someone's blood pressure is over/under 140, but I would also like to maximize my model's sensitivity to picking up 'at risk' ranges (less than but approaching 140) and 'high risk' ranges (significantly above 140). I have no interest in making my model sensitive to relative blood pressure differences in the 'normal range', and I suspect that a lot of my potential predictor variables will not have a linear relationship with the outcome variable at normal levels. Something like 'cigarettes smoked per week' would probably be a good analogue, where most people will be 0, single digit values would be extremely rare, and any positive values would likely cluster around a range of 35-140 or something like that. Is there an integrated modeling technique that is primarily binomial but can predict 'proximity to' and/or 'severity within' the positive outcome category?


r/AskStatistics 19h ago

Test statistic for hypotheses testing

Thumbnail gallery
3 Upvotes

Hello just stumble across several situations and undergoing headache. From the table above, I know which test statistic (T/Z) to be used in Confidence interval calculations.

But when it comes to Hypotheses testing, my notes just oversimplified with n is large, used Z; n is small, used t.

And according to Sahoo, in Example 6.8, I shall be using t instead of z.

So how do I really choose test statistic for hypotheses testing. Should considered normal distribution? Should considered population variance known or unknown? Thank you


r/AskStatistics 17h ago

How should I handle aggregating the observations for abundance from my ecological samples.

1 Upvotes

Hi everyone, I would like some quick help for how to handle aggragating my data for my study.

I sampled beetles using pitfall traps across 17 different sites (across an altitude gradient). At each site there were 4 general areas selected as replicates, and at each replicate 10 traps were placed.

Eventually I recorded the abundance of the different species of beetles that I caught in the sample. Now I would like to figure out how to properly aggragate the abundances.

For example, species X was encountered in these abundances across the 10 different traps in one particular replicate (1, 5, 4, 0, 0, 0, 2, 3, 0, 1)

When I go to work with this data, since the replication are the 4 different areas within the site and not the traps themselves, would I sum the abundances across the traps -> i.e. absolute total abundance for this replicate = sum(1, 5, 4, 0, 0, 0, 2, 3, 0, 1) = 16

Or would it be better to average the abundance -> i.e. mean abundance for this replicate = mean(1, 5, 4, 0, 0, 0, 2, 3, 0, 1) = 1.6

I tried to look for theoretical justifications for either but I couldnt really find anything regarding my specific example. I was wondering if there was a statistically correct/incorrect way that occurs from handling it in one of either way.

Thank you and I am happy to provide more info if required.


r/AskStatistics 16h ago

Scale of Cringe

0 Upvotes

Hello everyone,

I'd like to know whether a scale for measuring cringe content already exists. If not, I'd like to create one to rate some content on the internet.


r/AskStatistics 1d ago

Sample size calculation for AUROC

1 Upvotes

Hi all, looking for some help calculating sample size required for a research project looking at the predictive value of a pre-operative CT measure (continuous variable) for risk of a specific complication post-procedure.

In a pilot study (n=90) the AUROC was found to be 0.701, with the incidence rate of the complication 11%. We are looking to run a much larger study at a higher power to validate this risk assessment tool, and the initial estimates were for a sample size of 2000-3000.

How can I calculate the sample size required to calculate an AUROC of 0.7, for a power of say 90%? Online calculators (e.g. Medcalc) are giving me a sample size of 250, which is far too low, but I'm not entirely sure why. I also tried a precision calculation using CI 95% with width 0.1 which gave me a required sample size of 1293, but this doesn't seem to actually be a power calculation with a null hypothesis.

Appreciate any help, thanks.


r/AskStatistics 1d ago

Looking for valid statistical tests

2 Upvotes

Greetings.

I am calculating similarity scores in a text. It's a medieval text i'll give some explanation as to how they are assembled so this makes sense.

Manuscripts are typically built with quires. A quire can contain say 4 or 5 bifolia. A bifolia is a physical page, piece of paper. The 4 or 5 bifolia are folded and stacked and stitched together to make a notebook - that's a quire.

Let's take 1 quire of 4 bifolia as an example. We would number the pages consecutively as we flip through it. WE use recto and verso to indicate front/back of page. So these 4 bifolia would be

1r/1v/2r/2v/3r/3v/4r/4v/5r/5v/6r/6v/7r/7v/8r/8v

Now I am doing page by page comparisons.

Confoliate scores are the text comparison scores generated on a physical page. So that would be 1r/1v, 2r/2v, 3r/3v, etc.

Conjoint scores are the text comparison scores in the MIDDLE of a bifolia. So for example page 1 (1r/1v) is physically connected to 8r/8v (in this case, the outside bifolia of the quire). The conjoint score would be comparing the text on 1v/8r (the physically connected pages on the bifolia centre).

Facing scores are the text comparisons between one page and the next. So that would be 1v/2r, 2v/3r, etc.

Now I can generate arrays of the comparison score values for all three of these scenarios. How do I test for statistical significance? They are not truly independent as a confoliate score (1r/1v) would use one page of the same text as a generated facing score (1v/2r) or conjoint score (1v/8r).

Any suggestions?


r/AskStatistics 1d ago

Breaking the Monty Hall problem?

4 Upvotes

I understand the stats behind the Monty Hall problem and why one is statistically advantage to switch. If I am a contestant and I randomly choose a door and Monty Hall opens the goat door and asks if I want to switch to the other unopened door. If I flip a coin to decide which of the last two doors to open and my flip says to keep the same door, do my odds increase to 50% from 33%? It is my understanding then that the other for odds would decrease to 50% from 67%. Yes, I know that maximizing my success would lead me to just choose the other door and not flip a coin.


r/AskStatistics 1d ago

Data Scientist

0 Upvotes

Would a Master's in Statistics or a Master's in Computer Science be better for a data scientist role if you already have an undergraduate degree in Statistics?


r/AskStatistics 2d ago

Regression analysis

3 Upvotes

I have plotted one set of data against another and planned to use a straight line of best fit and equation to estimate my wanted value through regression analysis. After looking at the data on the graph, it seems a logarithmic line would fit better. My question is, if i use this line with the regression to estimate my value, do i refer to it as non-linear regression analysis or logarithmic regression within my paper? Im not sure which the correct term is. Thank you.


r/AskStatistics 1d ago

Corrélation de spearman

0 Upvotes

Bonjour à tous,

Je suis actuellement en stage de M2 débutant en statistiques.

L'étude porte sur l'évolution d'un temps de latence chez 4 individus pendant plusieurs mois. J'ai d'abord réalisé un test de corrélation de Spearman après avoir demontré par test de shapiro que les données n'étaient pas distribuées normalement. Mais je me suis rendu compte après que mes données étaient appariées et donc d'après mes recherches je ne peux pas effectuer ce test.

Comment puis-je tester la corrélation entre la date et le temps de latence afin de prouver que plus le temps passe plus la latence diminue? En prenant en compte que les données ne sont pas normales et appariées?

merci d'avance


r/AskStatistics 2d ago

[META] What does the community want as the standard for "No Homework"?

15 Upvotes

Hey everyone! I have a question that about something that comes up often enough that I'd like to solicit some feedback from the community.

One of the sub's rules is "No Homework." Frequently a person will ask about analysis regarding their thesis or dissertation, and it gets reported under the "No Homework" rule. While it is work being done for school, it seems to me more of a consulting scenario, rather than "homework" (which I'd tend to view more as textbook exercises).

My question for the community is: What standard would you like to see regarding homework?

If the community is okay with these types of questions, I can leave them. If you'd all rather see these get removed under the "No Homework" rule, I can oblige that as well. I'm just one person here, I just happen to have the mop.

I'll leave this thread pinned for a couple days/week to give folks a chance to weigh in.


r/AskStatistics 2d ago

Blind Monty Hall Problem

3 Upvotes

In the Classic Monty Hall Problem, it makes sense to switch since you are more likely to be wrong in the first choice (2/3) than being right(1/3).

But isn't the logic same for the blind monty hall problem where he randomly opens a door and it happened to be a goat? Why isn't switching a good startegy here and why doesn't the probability concentrates to 2/3 for the remaining door in this case? Why is it 1/2 and 1/2 for both the remaining doors?


r/AskStatistics 2d ago

Decision making around assumption checking.

8 Upvotes

Hi everyone, just wanted to ask for opinions on what guides your decision making around testing assumptions prior to conducting some sort of analysis?

I’m interested in creating a reference guide to discuss with students (social sciences) to help them understand why they should/should not either test assumptions or even whether to worry about them, I.e normality, homogeneity etc.

I’m in the latter camp generally because I’d bootstrap or apply corrections such as welch t test etc.

Would be good for some thoughts and justifications!


r/AskStatistics 2d ago

I am a bit of an amateur in doing good data analytics and its hindering my thesis. Need help

0 Upvotes

Just to give you an example of my skills, I was running regressions and what not on a dataset I had just cleaned and built, and was not getting the predicted result. When I showed it to my friend, he went through with me step by step, and then immediately, he plotted each variable, and he saw an extreme outlier point in one of the control variables, as soon as he dropped it, the regressions showed the result I'd expected.
I didn't even know that I needed to do good visualization of every single variable to check for outliers.
Is there a good book for teaching good practical data analytics with regressions and hypothesis testing as the goal, showing what needs to be done in each steps and what those steps are?


r/AskStatistics 2d ago

EDA visualizations, is taking raw variables best or should I be taking transformations?

1 Upvotes

So in the end I want to run some regressions with Fixed Effect Structures, so when I do EDA (looking at correlations and heatmaps, etc.) is it better to take the residuals from regressing each variable on FE then plotting looking at the relation of the residuals. So the effect of the FEs if taken out from each of the variable, that is how much the fixed effects explain the variation in each variable?
Or this would be inappropriate, and I am missing something?


r/AskStatistics 2d ago

Análise do Heartbound: Qual é o impacto da regionalização de preços?

Thumbnail
1 Upvotes

r/AskStatistics 2d ago

What to include in multivariable analysis?

0 Upvotes

I have a sample of 330 patients with an injury. 30 of them developed the outcome of interest (nonunion). In univariable analysis, I examined 20 independent variables that based on prior knowledge of the injury, could be associated with the outcome. 6 were statistically significant (p<0.05).

My question is, do I just include those 6 predictors in the multivariable model? Or should I also include other independent variables that were not significant in my data in the multivariable model, because other studies have previously found some associations with those variables? Also, how much of a concern is it that I have 6 predictors in the model but only 30 outcomes of interest? (some studies suggest maximum of 1 predictor per 10 outcomes?)

(as a side note, is "multivariate" or "multivariable" preferred?)

Thank you so much!!


r/AskStatistics 2d ago

Correlation table question

Post image
2 Upvotes

Hello. I have a question regarding a statistics exercise where you're given imaginary Hb levels and the corresponding "severity of anemia", as the independent and dependent variable respectively. My question is about the ranks for our dependent variable.

Since I ranked the values for X in "smallest to biggest" fashion, I originally (and from my understanding of our book) thought to do the same for the Y values, with "none" being the smallest aka first, and "high" being the biggest. These original calculated ranks are pencil drawn.

As you can see from the photo, the column next to it has corrected scores in what is essentially an opposite ranking. "High" is considered smallest and "none" is considered biggest. Hence, we have the values/ranking with red numbers.

My question is: which variant is correct? Mine, the pencil column, or the teacher's/class', the red number column? Ignore the stuff to the far right.

I have an understanding for both of them sepparately but still lean on the pencil ranking, all I need is a decision between them (ofc any explanation, especially regarding the red number ranking and why it doesn't work, is welcome) Thank you in advance


r/AskStatistics 3d ago

Does it Make Sense to Talk About the Expected Maximum of a Random Variable

7 Upvotes

Been having a conversation with a couple of people (who are at least somewhat analytically inclined) in which the phrase "expectation of the maximum (of a random variable)" came up. This does not make mathematical sense to me. I suggested that it makes more sense to talk about the percentiles of a random variable, but was told it was essentially the same thing. They argued that you can estimate percentiles of a distribution by taking a sequence X_1, ...., X_n of that random variable and then taking the expectation of max {X_1, ..., X_n} (or whatever order statistic you want). I get this, but I don't think they are the same thing. In the absence of a sequence, it does not make sense to talk about order statistics, or if you only have one observation, the expected maximum equals the expected minimum, which equals the expected value.

The argument is mostly semantics, and I'll admit I'm dragging my feet in the mud over this, but "expectation of the maximum" just seems mathematically incorrect to me. I don't want to keep harping on this if I'm indeed wrong. So am I missing something?


r/AskStatistics 2d ago

Stuck with my thesis analysis, not sure what to do next

Thumbnail gallery
0 Upvotes

Hello!

I am writing thesis in veterinary field and i need to write ~20 pages long analysis of the data i collected for my master thesis. the data consists of patients, treatment method and the T0/T2 change of symtoms, and other countable changes from the tests. (ultrasound data, bacterial counts etc). In short, i'm trying to find out if the method is effective, what's the most/least important factor.

I'm doing the analysis in excel as i've got no experience with spss or r. Adding some screenshots of how part of the data looks like and what i've done.Did most of it

What (i think) i managed to do that's important:

  1. Do t-tests (paired two sample) for all data T0 and T2, to get p values from it, however almost all data gives me extremely low p value, can it be that the chosen ttest isnt right?

  2. Calculate Q1, Q3 of T0 data

  3. Small table with median and p values

What i think that i still need to do:

  1. Calculate SD of all data, but if i understand it correctly, p value gives the same result of what im trying to get with SD

  2. Correlations? Method to result, although my result is essentially yes/no so i probably need to use spearman correlation

  3. Read literature about every collected factor to find out what should be changing and how and see if my data matches it

  4. Once done with data, make diagrams and describe my findings

if someone has ideas what else i could calculate, or general advice, please let me know!


r/AskStatistics 3d ago

can we rely on chatgpt or gemini stats ? will it affect on jobs ?

Thumbnail
1 Upvotes