Patterns in random data (how not to fake your tax return)


: Imagine you would keep track of your electricity bills and consider the first decimal digit of the monthly totals (if the totals would be 32.84, 28.41, 29.98, 26.04 etc., the digits would be 8, 4, 9, 0 etc.). What would you find in the long run? Would every digit occur with the same frequency? Surprisingly, the answer is no, instead you will observe some pattern which repeats itself for many other datasets such as digits in stock prices, statistical data, financial statements and so on. In fact, one can give a very plausible mathematical explanation for this behaviour, which is rigorous enough to use in practice: If a dataset which should be random diverges from the pattern, it is likely that the numbers are made up. This fact is used to detect fraudulent tax returns or forged financial statements of companies.
Your task in this project is to empirically verify the pattern using real world data, come up with conjectures about its exact form (the underlying probability distribution of the ten digits), visualize what you found and then later use this information to check the randomness of data which should in fact be random. In later stages, a similar analysis for another pattern of this form is planned, which is the relation between the most frequent words used in a language (in English, candidates would be 'the', 'or' and 'and') and their actual proportion among all words. For example, assuming that 'the' and 'or' are the two most frequent words, how much more likely is it that in a random English sentence you will encounter 'the' instead of 'or'? Again you will see a pattern emerge which is independent of the language.
In this project you will hence learn how to do statistical data analysis, which is a very useful skill to acquire. As a prerequisite, you need some basic probability theory (mean, variance, histogram, empirical distribution function).

Supervisors: Simon Campese

Difficulty level: Any (will be adjusted depending on the level and ambitions of the participants)

Tools: Programming is best done in R (second best choice would be Python), but in case you have strong preferences for another language this can be accomodated as well.

Bibliography: Upon request

FSCT -- University of Luxembourg