Scientific Method & Philosphy of ML#

Lead Scribe: Derek Jacobs

admin#

  • grading contract will be posted in time for Wednesday

  • notes & posting example

  • can leave out admin in notes; that material wil mostly be other places

Opening Notes#

  • set the stage for how we think about other work and setting up your projects

Scientific Method in the Science of Machine Learning#

paper

Introduction#

  • ML is having a hard time explaining some results

  • Replications are failing

Scientific Method#

Starting from the assumption that there exists accessible ground truth, the scientific method is a systematic framework for experimentation that allows researchers to make objective statements about phenomena and gain knowledge of the fundamental workings of a system under investigation.

—Jessica Zosa Forde and Michela Pagnini

  • process and a social contract

    • ML might not be systematic

      • Control the randomness of our algorithms

    • Procedures/Trust allow comparisons of results

  • assumes a ground truth

    • Things can go wrong if there simply isn’t a ground truth

    • Especially when using ML in social context

  • Hypotheses

    • A formed “guess”

    • In CS, the learning process is as follows

      • Here’s a problem (that has an answer)

      • Write code to solve it to specifications

    • Hypotheses can be falsified through statistical and other analysis

      • You can prove the opposite is not true, but challenging to prove truth

you can express hypotheses as priors, but that prior wouldnt be the same in the rest of ML lit

Discuss

When might this not apply?

At the base of scientific research lies the notion that an experimental outcome is a random variable, and that appropriate statistical machinery must be employed to estimate the properties of its distribution.

  • Treat your accuracies as a random variable as part of an experiment and you’re experimenting around those

  • In stats people chase having p values <= 0.05 and so we have a reproducibility crisis

  • Reproducibility Study

    • < 40% replicated the original results

  • there are some problems with NHST but there are alternatives that are still rigorous

    • for design research

    • in psych

    • bayesian in hci

      • Bayesian

        • Priors allowed

        • More broadly defined

        • Everything we do is influenced by our thoughts and so there’s some subjectivity

        • Interpreting results in terms of previous results

      • Frequentist

        • No Priors

        • Probabilities strictly refer to events like fair coin flips (100 flips we expect 50 heads 50 tails)

        • Knowledge does not accrue

Case study#

HEP:

  • proposed thoery

  • null hypothesis

  • cafeull accounting

  • model building and hypothesis testing phases

  • parametric models derived from first prcinicples

  • statistical test is constructed

ML Analogy:

  • suspect new activation

  • formulate quantitative hypothesis

    • How will it work, what’ll it do, how much will it improve

    • And behavior of how it’ll change

  • run experiments

    • Record outcomes from base models (no intervention)

  • This is a statistical model of your experiment, not an actual model

    • Dataset, optimizer, init, hyperparameters, etc are just noise in the question of “does the activation function work”

    • Or make things specific (improve results with a specific dataset)

Recommendations#

1. formulate hypotheses first 2. statistical testing 3. operate in controlled, reproducible, and verifiable settings 4. negative result workshops - pregistration workshop at neurips 2021 not an author as speaker - If your experimental plan is good, results are published regardless of positive or negative case - new journal

Value Laden Shifts in ML#

paper @brownsarahm reminder from lily talk on incorporating ethics into teaching data structs - environmental cost of algorithms

  • Looking at how different values influence what is being done in ML

  • disciplinary shifts are not objective, but value laden

  • Model Types

    • Different categories of models typically used for specific purposes

    • The structure of what our learning algorithm outputs

    • e.g. CNN for image processing, Linear Models, SVMs, etc

Model types as organizing#

  • many researchers self-organize in types; eg for revieing, workshops etc

    • How do we organize them in a package like sklearn, but also “who knows who works on this”

  • commitment is fueled by exemplars

  • has down stream effects:

    • guide research agenda and problem selection

      • How model types influence problem selection

        • Different types are tailored to different problems

        • Most popular model means most funding

        • More funding on specific models drives improvements to that area specifically and other problems get lost

    • constrain search for solutions

    • prerequisites, eg deep learning and data volume (fig 1) and compute power (fig2)

      • fig1

        • Depending on data amount, we may favor one model over the other, and that in turn is researched more

      • fig2

        • Shows over time how my computer power is used in training AI

        • Recently exponential growth

  • model types have parallels but important differences in philosophical scientific organzing principles

    • decreases theory development

    • Kuhn

      • The organizing paradigm defines what questions are valid

      • For example

        • If we only have earth water fire ether, we can’t ask what molecules do

  • When committed to model types…

    • The whole industry shifts towards it

    • Like NVidia GPUs, computer architectures, etc

Model type is self-reinforcing#

  • comparing them is influenced by the model type points above (problems, prerequisites)

Comparison between types is value laden#

  • Applied to ImageNet

    • Benchmark image classification problem

    • Until 2011, best error rate was 25% (no deep learning)

    • 2012 - AlexNet reached 16% (deep learning)

    • By 2016 ImageNet is basicaclly “done” due to all extremely high accuracies (97% vs 97.1%)

  • This doesn’t mean the problem of image classification is solved

  • prerequisites

    • compute

      • Who has access

    • data

      • Who has access

      • What sets are created/curated

  • Evaluation criteria

    • in theories: as a whole, internal consistent, predictive, etc

Evaluating theories based on their theoretical virtues is a value-laden activity when theoretical virtues are carriers of values.

—Milli and Dotan

  • eg consistency requires keeping the bad from the old in new

  • eg: metrics and discrimination

    • Something that was sexist previously is still now

Conclusion#

A related question inspired by these issues is who should make decisions in what values are furthered? Who gets to have a voice? In talking about selection of problems in science, Kitcher (2011) argues that all sides should have a say, including laypersons [71]. A question for machine learning is: is the same true for machine learning? Who should have a say about which criteria are important in evaluating model-types? That is itself another value-laden question.

Overall Paper thoughts#

Discuss

General thoughts?

  • Think about things that’re interesting, confusing, etc (for future reference)

  • Take care when conducting ML

    • Feed in data, hope for the best without considering the “why’s”

Discuss

How is this different from how you’ve thougth about CS before?

  • There’s more to consider before actually coding

  • Vast social influences in things as simple as “can i even get this dataset”

  • We focus on what works now instead of what hasn’t worked

Discuss

How might the value-laden points about theory development relate the the scientific method points

  • Two pillars of hypothesis + statistical testing

Meta points#

  • (science) scientific mehtod one is a workshop paper (approx length of your project papers)

  • this is a position paper (it’s not about new experimental results as much as the classic research arguing a position apper)

next class#