[{"content":"","date":"2025-06","externalUrl":null,"permalink":"/categories/artificial-intelligence/","section":"Categories","summary":"","title":"Artificial Intelligence","type":"categories"},{"content":"","date":"2025-06","externalUrl":null,"permalink":"/categories/","section":"Categories","summary":"","title":"Categories","type":"categories"},{"content":" Training the first reasoning LLM for chemistry. Overview # In late January 2025, DeepSeek released R1, illuminating a pathway to the world: reinforcement learning with verifiable rewards (RLVR) can elicit and improve reasoning capability.\nFutureHouse\u0026rsquo;s Head of Science Andrew immediately became excited about applying RLVR in the chemistry domain. We began a sprint scoped for two weeks that went on to take five months, materializing into:\nAn open-weight 24B model called ether0 1. Open-source reward functions 2 and test set. An acceptance to NeurIPS 2025\u0026rsquo;s main track. arXiv preprint: Training a Scientific Reasoning Model for Chemistry A feature on the front page of nature.com. A cool overview video and research blog post. There were two aspects not open sourced:\nRL training code: written via torch + torch.distributed, our code was standard GRPO code. It wasn\u0026rsquo;t open sourced due to a combination of other pressing projects, and a desire to open source code using a framework such as Nvidia\u0026rsquo;s NeMo RL. RL training data: due to licensing issues we cannot open the dataset. However, we open sourced question templates in problem_prompts.py. Late Nights # Burned into my memory is both the excitement and the stress associated with week-long training jobs involving up to 384 H100 GPUs.\nFrom the get-go, we had an immense exploration space. We investigated reward shaping such as soft vs. hard rewards, curriculum learning via caching success rates, data augmentations like rephrasal, rejection sampling heuristics for SFT data, different multitask pairings, and format reward strictness. Each run required specification of many hyperparameters: KL penalty or reference, completion length, base model, model size and/or quantization, attention mechanism, sampling parameters (e.g. temperature or top-K), group size + batch size + corresponding learning rate, learning rate warm-up, and more. Since our goal was multitask chemical reasoning, we often settled upon known-reasonable hypers to keep the ball rolling.\nThroughout training, we constantly scrutinized outputs for signs of reward hacking. Fixing one reward hack would unlock the next level of reward hacks in a mostly iterative/sequential process. Fixes could be paired with the addition of more tasks, leaning into the hopes of a shared underlying representation. This coincided with opportunities for better engineering, such as recording entropy or reward failure modes, improving logs for future debugging, recognizing retryable failures, or simply unit testing the verifiability of train/test data.\nAnd the world was coming out with new papers every week, such as LIMO, s1, DAPO, LLama-Nemotron, and many more.\nAll of our excitement was invested into LLM training, with much less attention given to ML engineering/ops. Besides Slurm for job scheduling, we had no cluster monitoring/alerting, synchronous sampling and checkpointing, no job prioritization, and limited investment into optimizations such as DeepSpeed ZeRO. Internally the term \u0026ldquo;node arrakis\u0026rdquo; was coined, because we had far more experiments to run than GPU nodes.\nThe solution we employed was functional but tough: monitor training jobs all the time, memorize the landscape of hyperparameters and job failure modes, fend for your ideas when timelines were tight, and believe in yourself. Thankfully, FutureHouse began investment into our infra stack after ether0\u0026rsquo;s success.\nExpert Iteration Reinforcement Learning # Following DeepSeek\u0026rsquo;s methodology, we trained with a warm-start of SFT then multitask reinforcement learning. We tested pure RL, but observed our model falling into nonsense babbling, and empirically observed 15-min SFT jobs surpassing days of RL training. With this knowledge we concluded our first training baselines, and began comparison with other models.\nContemporary frontier models were terrible at our tasks, with initial open-answer benchmarks of DeepSeek R1 solving below 10% of questions. We also observed tasks such as solubility edit attaining similar levels of performance to other tasks like retro synthesis in less than one fifth the training steps, which means tasks were learned at different rates. Eventually we discovered our model hill climbed much faster with a subset of tasks, and began \u0026ldquo;specialist\u0026rdquo; trainings focused on individual or a few tasks.\nSeparately, we observed an intriguing phenomenon appearing in long RL runs. The reasoning would begin to contain flaws: typos, misuse of emojis, malformatted Markdown/sentences, and usage of up to 25 non-English languages. We quantified the steady increase of flaws using regexes, typos, and LLM judges. Was this steganography, the model learning to encode information, or the model buying itself time to complete latent logic flows? Others witnessed this too, as six months later an OpenAI study documented low-quality reasoning as distinct dialect. Non-English phrases decrease usefulness (you probably can\u0026rsquo;t read Tifinagh) and compromise trust, so this was a problem. We considered the flawed reasoning to be the tip of the iceberg of weight-space shifts inside the model, some useful some not. The question is, can one preserve the accuracy gains of RL, while ditching whatever underpinned the low-quality reasoning?\nComing off the aviary paper, our minds were primed for iterative self-improvement techniques such as expert iteration. To resume training we could start from a model checkpoint, but an alternate approach was conceived: form an SFT dataset by (1) exporting all completions throughout training, (2) filtering out completions with incorrect answers or flawed reasoning (rejection sampling), and (3) deduplicating problems based on latest training step. The intuition was that behavioral shifts seen within high-quality correct reasoning traces distilled the model\u0026rsquo;s learnings. Actually, this \u0026ldquo;data checkpoint\u0026rdquo; approach worked surprisingly well\u0026hellip; after SFT\u0026rsquo;ing the base model, we saw performance drop \u0026lt;10% from the prior RL\u0026rsquo;d model. We furthermore hypothesized one pass of SFT shifted the model less in weight space than a lengthy RL run containing many KL resets.\nFrom this realization, we formalized a core methodology of our paper, informally nicknamed Expert Iteration Reinforcement Learning (EIRL):\nSFT from the base model to jumpstart performance, RL to push performance, Rejection sample RL completions into an improved SFT dataset. Repeat steps 1-3 until plateau, optionally adding more tasks each iteration. The EIRL process is depicted in the upper left corner of the ether0 paper\u0026rsquo;s Figure 1. We ran two iterations of EIRL: the first made eight specialist models (training on up to three tasks), and the second consolidating a generalist performing highly on all tasks. The \u0026ldquo;Distillation\u0026rdquo; box in the bottom center illustrates the slight performance loss after SFT, which is quickly recovered during RL. EIRL showed us the long-term asset would be our SFT data, not a given model checkpoint. These synthetic traces can apply to any base model, circumvent the issue of tasks learning at different rates, and are a white box for human audits. We improved EIRL by grouping several tasks together for the first iteration, such as functional group with molecular formula. And for our final RL training run, we included all tasks at once.\nInterestingly enough, an alternate version of EIRL existed in our curriculum learning technique. We stored problems the model didn\u0026rsquo;t get right or wrong 100% of the time, and re-fed these problems into future training steps, maximizing the learnable problems encountered. EIRL and this curriculum learning together suggest at some class of methods reminiscent of the movie Edge of Tomorrow 3, where throwaway runs acquire information or data that sets the stage for a speedrun final training.\nFast-forward to fall 2025, and in the DeepSeek-V3.2-Exp technical report, DeepSeek describes training math, coding, logic, and search specialists. The specialists are then distilled into the final model, exactly as EIRL describes. It\u0026rsquo;s unclear if DeepSeek was inspired by ether0\u0026rsquo;s methods, but it is evidence that EIRL is generally applicable.\nSources # futurehouse/ether0 text-generation 67 551 futurehouse/ether0-benchmark ether0-benchmark QA benchmark (test set) for the ether0 reasoning language model: https://huggingface.co/futurehouse/ether0 This benchmark is made from commonly used tasks - like reaction prediction in USPTO/ORD, molecular captioning from PubChem, or predicting GHS classification. It\u0026#39;s unique from other benchmarks in that all answers are a molecule. It\u0026#39;s balanced so that each task is about 25 questions, a reasonable amount for frontier model evaluations. The tasks generally follow… See the full description on the dataset page: https://huggingface.co/datasets/futurehouse/ether0-benchmark. dataset 15 309 Future-House/ether0 A scientific reasoning model, dataset, and reward functions for chemistry. Python 150 17 Closing # ether0 was a very memorable project, here is a song embodying what kicking off some of the big runs felt like:\nether0 was post trained from Mistral-Small-24B-Instruct-2501.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nAndrew detailed the battles against reward hacks in his blog post building reward functions.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nA Edge of Tomorrow-inspired presentation, featuring EIRL and the custom curriculum learning, can be found on GitHub.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n","date":"2025-06","externalUrl":null,"permalink":"/projects/ether0/","section":"Projects","summary":"Training a scientific reasoning model for chemistry","title":"ether0","type":"projects"},{"content":"","date":"2025-06","externalUrl":null,"permalink":"/tags/futurehouse/","section":"Tags","summary":"","title":"FutureHouse","type":"tags"},{"content":"","date":"2025-06","externalUrl":null,"permalink":"/tags/","section":"Tags","summary":"","title":"Tags","type":"tags"},{"content":" Training agents to operate tools in complex scientific environments. It was summer 2024, and \u0026ldquo;2025 the year of the agents\u0026rdquo; felt just around the corner. Contemporary language agents were untrained, made of a foundation model such as Claude 3.5 connected to a task-specific toolset.\nAviary began with the goal of actually training our agents, improving their ability to complete scientific tasks beyond the skills of foundation models wrapped in a simple tool-calling loop. We also were aware of the lack of an overarching mathematical framework for language agents and tools, where every agent paper featured on a flowchart diagram of emoji entities.\nAviary and LDP (Language Decision Process) were sister frameworks built concurrently within FutureHouse to address these problems and actually train agents.\nAviary models stochastic environments designed for interaction with language agents. We included high-quality implementations of five environments for scientific tasks such as literature-based question answering. LDP connects trainable agents and stochastic environments together in what we called a \u0026ldquo;language decision process\u0026rdquo;, a special case of a partially observable Markov decision process (POMDP) where actions and observations are in natural language. Agents are implemented with a stochastic compute graph traversable by an optimizer, and house learnable behaviors such as language models, memories, or prompts. Our paper was accepted to ICLR 2025\u0026rsquo;s Scaling Self-Improving Foundation Models workshop and featured in the Jack Clarks Import AI newsletter.\nFindings # Aviary and LDP bridge the gap between reinforcement learning and language agents + environments. This is made practical by our open source Python frameworks and accompanying environment/agent implementations. Other notable findings include:\nExpert iteration is a simple and effective method for training agents. Trained open-weight agents can attain the same or better performance, with less than 1% the inference cost as closed frontier models. In August 2025 I gave a talk featuring EI with other self-improvement techniques; its slides are on GitHub. Majority vote remains applicable to language decision processes, unlocking an additional 10% accuracy over majority voting upon the base LLM. One LDP feature we did not fully explore was the utility of backing agents with a stochastic compute graph. We explored prompt and long-term memory optimization, but did not have the time to fully bake these techniques.\nAs of summer 2025, Aviary and LDP are actively maintained, and their paradigm remains relevant. Going forward it will be interesting to see if:\nState space: perhaps code writing systems such as Cursor or Claude Code move away from a message-base primitive. Action space: new agent approaches may not use function calling. For example, Biomni uses code execution of standalone function calling. Source Code # Future-House/aviary A language agent gym with challenging scientific tasks Python 229 28 Future-House/ldp Framework enabling modular interchange of language agents, environments, and optimizers Python 118 17 ","date":"2024-12","externalUrl":null,"permalink":"/projects/aviary/","section":"Projects","summary":"Training language agents on challenging scientific tasks","title":"Aviary and LDP","type":"projects"},{"content":" Code is read more often than it is written. - Guido van Rossum Here\u0026rsquo;s an alternate wording for that quote: code is about communication; communicating instructions to machines, and concepts to developers.\nKnowing this reality, I find most developers underestimate the value of clean code. Clean code is a blessing because readers can focus on the task at hand. On the contrary, messy code is full of distractions; readers waste mental clock cycles thinking about latent bugs or inefficiencies.\nThe ability to draft clean code is not necessarily a product of experience, or investing a bunch of time manually reviewing code diffs, it comes with the right developer toolchain.\nOrigins # While working on the biotech startup Synthego\u0026rsquo;s innovation team, I often worked in a green field without code review. Paired with the reality that immunotherapy research involves multi-week experiments and costly reagents, preventable software failures (e.g. TypeErrors) were excruciating.\nI became interested in automated quality assurance, both to prevent bugs and receive feedback on my programs. Modern conveniences such as Python typing, pyproject.toml-centralized configs, and GitHub Actions didn\u0026rsquo;t yet exist, though flake8 plugins were aplenty. I began building a toolchain just for myself, discovering where the tools helped, missed, and hindered.\nAdoption # Over the years, I passively expanded the toolchain, growing it for new discoveries like codespell or black preview rules, disabling aspects found unhelpful, and contracting it alongside consolidations into ruff. Like a plumber\u0026rsquo;s toolbox, I repeatedly ported it to my current setting: production software at Synthego, group projects in Stanford computer science, open-source software facing easily-prevented bugs.\nI started receiving gratitude from colleagues who had not experienced such a toolchain before, and this began to give me ideas. Leaving Synthego for FutureHouse in early February 2024, I created the GitHub repo configurator with a two-part vision:\nCreating a single toolchain fluent for both early-stage exploratory research and production software. Building an automated yet flexible system to propagate tooling updates across repos. Fast-forward in time, and item 1 has come true. FutureHouse has adopted the configuration system org-wide. It\u0026rsquo;s been used for research projects such as PaperQA2, aviary, and ether0, as well as adopted universally by our platform engineering team.\nItem 2 has some weak competition from tools like cookiecutter or nitpick, but really the ultimate solution will be an AI agent. Maybe someday there will be time to properly tackle item 2.\nSource Code # jamesbraza/configurator Tool with configurations for the creation of better software Python 5 0 ","date":"2024-08","externalUrl":null,"permalink":"/projects/configurator/","section":"Projects","summary":"One configured toolchain valuable in research, development, and production settings","title":"Configurator","type":"projects"},{"content":"","date":"2024-08","externalUrl":null,"permalink":"/categories/engineering/","section":"Categories","summary":"","title":"Engineering","type":"categories"},{"content":"","date":"2024-08","externalUrl":null,"permalink":"/tags/quality-assurance/","section":"Tags","summary":"","title":"Quality Assurance","type":"tags"},{"content":"Hello, my name is James Braza, and welcome to my professional website.\nI work onsite at FutureHouse on both the learning and platform teams. Through FutureHouse I have trained LLMs and built agentic systems, and have authorship on multiple conference papers. Previously I worked onsite at Synthego for the Advanced Technologies Group for five years, creating software systems with Python. I also took graduate computer science coursework in artificial intelligence at Stanford part-time for 2.5-years, completing enough graduate artificial intelligence coursework to almost fulfill a MSCS.\nI grew up in New England and received a bachelor\u0026rsquo;s degree in mechanical engineering and a minor in economics at the University of Pittsburgh. I specialized my coursework towards robotics and was quite active within Pittsburgh\u0026rsquo;s Robotics Club. I summer interned in 2016 within Tesla\u0026rsquo;s powertrain manufacturing operations doing machine vision. I graduated from Pittsburgh in April 2017, then moved to California to work for SpaceX as an automation engineer in vehicle engineering\u0026rsquo;s propulsion components.\nI am most interested in ML research and consider myself a technologist. I believe in open collaboration, mistakes are forgotten with quick iteration, and an overall emphasis on strong engineering. I consider us to be in the era of the great builder. Thanks for visiting, and feel free to reach out.\nWebsite # Hosting Namecheap shared Framework Hugo () Theme Blowfish () ","date":"2023-07","externalUrl":null,"permalink":"/about/","section":"Welcome to jamesbraza.com","summary":"Extended description of myself","title":"About","type":"page"},{"content":" Collection of the past decade\u0026rsquo;s DIY, academic, and open source projects. ","date":"2023-07","externalUrl":null,"permalink":"/projects/","section":"Projects","summary":"Homepage for projects","title":"Projects","type":"projects"},{"content":"I build intelligent systems to solve real world problems.\n","date":"2023-07","externalUrl":null,"permalink":"/","section":"Welcome to jamesbraza.com","summary":"Homepage for the entire site","title":"Welcome to jamesbraza.com","type":"page"},{"content":" Convolutional U-Net models can be trimmed and tuned to fit on miniature computers, enabling real-time inference within medical devices. Introduction # In spring 2023, I took CS231n: Deep Learning for Computer Vision at Stanford. My group of three chose to explore volumetric (\u0026ldquo;3-D\u0026rdquo;) segmentation of MRI images. Using 2020 data from the University of Pennsylvania\u0026rsquo;s Brain Tumor Segmentation (BraTS) Challenge, we trained and experimented with multiple U-Net architectures.\nThe dataset consisted of 369 labelled examples each containing 5 MRIs:\nFour input scans: T2 FLAIR, T1, T1 contrast enhanced, T2 One mask of four classes: non-tumor (0b000), non-enhancing tumor core (0b001), peritumoral edema (0b010), and Gadolinium-enhancing tumor (0b100) A 3-D U-Net was input four MRI scans (top) and segmented to tumor classes. This cross-section shows target mask (left) and binarized predictions (right). U-Net Architecture # The U-Net architecture is named for its U-shaped encoder-decoder structure, we used 4 or 5 levels. The sigmoid\u0026rsquo;s output is binarized in post-processing for predictions, and the binary threshold is a hyperparameter. One can choose to use 3-D or 2-D convolutions in the U-Net:\n3-D convolution: can directly intake a 3-D MRI and leverage 3-D spatial information, at the cost of 3X more weights 2-D convolution: an MRI now becomes a list of 2-D images, so to process an MRI the model is internally doing a nested for loop We experimented with both 2-D and 3-D convolutions.\nOne note is, because the output of the sigmoid is logits, the loss function used was the equally-weighted sum of binary cross entropy (with logits) loss and Dice loss.\nBinary Threshold Tuning # We swept across possible global binary thresholds, computing the metric intersection over union (IoU) on our validation set, to optimize the threshold. We can see the 2-D and 3-D convolutions require different thresholds, and 3-D convolutions attain an almost universally higher IoU. Other Findings # Previous Next 3-D convolutional layers outperform 2-D layers, indicating 3-D U-Nets are actually leveraging 3-D spatial information Applying one global binary threshold to convert raw predictions to class predictions is equally performant as per-mask binary thresholding Global parameter pruning, encoder-decoder pair pruning, and reducing weight precision (float32 to float16) were effective methods of reducing model size without hampering performance Source Code # jamesbraza/cs231n-3d-segmentation Stanford CS231n Deep Learning for Computer Vision Class Project Python 1 0 ","date":"2023-04","externalUrl":null,"permalink":"/projects/3d-segmentation/","section":"Projects","summary":"Volumetric segmentation of brain MRIs with minimal compute","title":"3-D Segmentation of Brain MRIs","type":"projects"},{"content":"","date":"2023-04","externalUrl":null,"permalink":"/categories/coursework/","section":"Categories","summary":"","title":"Coursework","type":"categories"},{"content":"","date":"2023-04","externalUrl":null,"permalink":"/tags/medical-imaging/","section":"Tags","summary":"","title":"Medical Imaging","type":"tags"},{"content":"","date":"2023-04","externalUrl":null,"permalink":"/tags/perception/","section":"Tags","summary":"","title":"Perception","type":"tags"},{"content":"","date":"2023-04","externalUrl":null,"permalink":"/tags/stanford-cs/","section":"Tags","summary":"","title":"Stanford CS","type":"tags"},{"content":"","date":"2023-01","externalUrl":null,"permalink":"/tags/chess-ai/","section":"Tags","summary":"","title":"Chess AI","type":"tags"},{"content":" AlphaGo Zero-style chess-playing AI with our own chess state-action embedding technique. In winter 2023, I took CS234: Reinforcement Learning at Stanford. We found no open source AlphaGo Zero-style chess AI with available weights existed, so we sought to be the first (and explore recreating the AlphaGo Zero paper simultaneously).\nChess Representation # To get started quickly, we assumed a conservative action space \\( |A| = 64^2 = 4096 \\) moves and no pawn underpromotion (only queen promotions), and on any given turn constrained our probability distribution to valid moves.\nGame state was a two-tuple of chess Board object and player ID (white or black). We \u0026ldquo;canonicalized\u0026rdquo; the board view, meaning the board was flipped if the black player, so the network always had the same perspective (bottom looking up). We also mirrored the board + player to store two training examples per self-play move. Lastly, we used a library python-chess to handle all chess-specific rules.\nOur signed one-hot Board embedding used one board per piece (0: pawn, 1: knight, 2: bishop, 3: rook, 4: queen, 5: king), with elements being 1 for white piece, -1 for black piece, and 0 for no piece. Thus, the signed embedding has shape 6 x 8 x 8. An alternate unsigned one-hot embedding uses 12 boards containing only 0 or 1, giving players separate boards. Architecture Versions # Both versions of our architecture adhered to the same signature:\nInput: Batch of board embeddings of shape \\( (B, D, H, W) \\) D = 6 (signed embedding) or 12 (unsigned embedding) Policy head output: batch of policy distributions of shape \\( (B, |A|) \\) Value head output: batch of values of shape \\( (B, 1) \\) Training matched the AlphaZero paper:\nReward: 1 for win, -1 for loss, 0 if ongoing, \\( 10^{-5} \\) for tie Loss: sum of cross-entropy loss on the policy head, mean squared error loss on the value head, and L2 regularization The v1 network has a residual tower of 3-D convolutions, and both a policy and value head. The v2 network moves to 2-D convolutions to avoid convolving piece types (thus preserving piece differences), added dropout to combat policy head overfitting, and vastly shrinks the network to match the AlphaGo Zero\u0026rsquo;s paper\u0026rsquo;s hyperparameters. Findings # Embedding pieces into an unsigned 12 x 8 x 8 matrix demonstrated superior gameplay over a signed 6 x 8 x 8 matrix, at the expense of training time Our v2 network was superior to our v1 network when using the same piece embedding Both the v1 and v2 networks could beat a random player, but neither were able to beat a 1350 Elo Stockfish engine player Source Code # jamesbraza/cs234-dreamchess Stanford CS234 Reinforcement Learning Class Project Python 2 0 ","date":"2023-01","externalUrl":null,"permalink":"/projects/dreamchess/","section":"Projects","summary":"AlphaGo Zero-style chess-playing AI and state-action embedding system","title":"DreamChess: Chess-Playing AI","type":"projects"},{"content":"","date":"2023-01","externalUrl":null,"permalink":"/tags/offline-learning/","section":"Tags","summary":"","title":"Offline Learning","type":"tags"},{"content":"","date":"2023-01","externalUrl":null,"permalink":"/tags/reinforcement-learning/","section":"Tags","summary":"","title":"Reinforcement Learning","type":"tags"},{"content":" Studied transfer learning fundamentals by predicting test-set accuracy given fine-tuning dataset paired with pre-trained model. In autumn 2022, I took CS330: Deep Multi-Task and Meta Learning at Stanford. My partner and I delved into the foundations of transfer learning by attempting to build a network or technique that can inform which transfer learning starter to use, given a fine-tuning dataset. Our model, which we called ChoiceNet, had the following signature:\nInput: two-tuple Pre-trained model\u0026rsquo;s weights or training dataset Fine-tuning dataset Output: test-set accuracy after fine-tuning In other words, if you know your fine-tuning dataset and can pick from 2+ pre-trained models, is there a heuristic or network to choose the frozen base model for fine-tuning?\nTransfer Learning Dataset # To begin experimentation, fabricated a dataset mapping:\nX: (transfer dataset or pre-trained weights, fine-tuning dataset) Y: test-set accuracy We used a very simple CNN as our core model being transfer learned (\u0026quot;TransferModel\u0026quot;), so we could quickly pre-train and export weights. Our fine-tuning dataset was of plant leaves. We created four categories of datapoints:\nConstant and similar to fine-tuning dataset (via TensorFlow plant_village dataset) Constant and dissimilar to fine-tuning dataset (via bird species dataset) Random 10-class subsets of CIFAR-100 or ImageNet Empty dataset (no pre-training = random initialization), as an experimental control Diagram showing the hierarchy of transfer learning dataset creation. Architecture Versions # ChoiceNet v1 took the following actions:\nEmbedding the pre-trained model: flattened weights from the TransferModel\u0026rsquo;s last 2-D convolution Embedding the fine-tuning dataset: applied 256-element principle component analysis (PCA) for bulk reduction, then averaged along examples Combining both embeddings: (1) LoRA-similar reduction of embedded pre-trained model, (2) concatenate both embeddings, and (3) pass into head of two fully-connected layers with dropout ChoiceNet v1 architecture. ChoiceNet v2 fixed a few misgivings:\nFine-tuning dataset\u0026rsquo;s average along examples removed too much information Class-specifics were entirely ignored Transfer learning dataset was unused To fix these issues, ChoiceNet v2 does the following:\nEmbedding the pre-trained model and transfer dataset: use activations (not weights) from last 2-D conv, and average activations per-class Embedding the fine-tuning dataset: use activations from a pre-trained ResNet50 v2, average along classes, and keep the 10 classes with the highest average absolute value activation Combining both embeddings: (1) LoRA-similar reduction to both embeddings, (2) sum the embeddings (instead of concatenation), and (3) pass through same fully-connected head as ChoiceNet v1 ChoiceNet v2 architecture. Findings # A disclaimer is this project\u0026rsquo;s core question was incredibly broad. Given just one quarter we only used one fine-tuning dataset and one domain (supervised image classification).\nScatter plot of ChoiceNet v2\u0026rsquo;s predictions vs actual. We can observe ChoiceNet v2 underestimates performance, and a similar transfer learning dataset performs best. ChoiceNet v2 decreases test-time MSE loss by 80%, and underestimates final accuracy vastly less than ChoiceNet v1 We created several unsupervised mathematical techniques (not detailed in this article) using distribution distance, raw pixel values distributions, or average class correlation to predict test set performance Source Code # jamesbraza/cs330-project Stanford CS330 Deep Multi-Task and Meta Learning Class Project Jupyter Notebook 1 1 ","date":"2022-09","externalUrl":null,"permalink":"/projects/choicenet/","section":"Projects","summary":"Network that assists in choosing one’s transfer learning dataset or model","title":"ChoiceNet: Quantitative Transfer Learning","type":"projects"},{"content":"","date":"2022-09","externalUrl":null,"permalink":"/tags/meta-learning/","section":"Tags","summary":"","title":"Meta Learning","type":"tags"},{"content":"","date":"2022-09","externalUrl":null,"permalink":"/tags/transfer-learning/","section":"Tags","summary":"","title":"Transfer Learning","type":"tags"},{"content":"","date":"2022-09","externalUrl":null,"permalink":"/tags/abstract/","section":"Tags","summary":"","title":"Abstract","type":"tags"},{"content":"I run a website where I post my thoughts and other minutiae.\nI will add more content in time.\nLogo Title Link onetwofoureight ","date":"2022-09","externalUrl":null,"permalink":"/projects/onetwofoureight/","section":"Projects","summary":"Website where I post my thoughts and other minutiae","title":"onetwofoureight.com","type":"projects"},{"content":"","date":"2022-09","externalUrl":null,"permalink":"/tags/philosophy/","section":"Tags","summary":"","title":"Philosophy","type":"tags"},{"content":"","date":"2022-09","externalUrl":null,"permalink":"/tags/thoughts/","section":"Tags","summary":"","title":"Thoughts","type":"tags"},{"content":"","date":"2022-04","externalUrl":null,"permalink":"/tags/pycon/","section":"Tags","summary":"","title":"PyCon","type":"tags"},{"content":"I curate a website (updated annually) about the best talks, wisdom, facts, and language information from the Python conference.\nLogo Title Link pycon-redux ","date":"2022-04","externalUrl":null,"permalink":"/projects/pycon-redux/","section":"Projects","summary":"Annually updated redux of Python Conference (PyCon) talks","title":"pycon-redux.com","type":"projects"},{"content":"","date":"2022-04","externalUrl":null,"permalink":"/tags/python/","section":"Tags","summary":"","title":"Python","type":"tags"},{"content":"","date":"2022-03","externalUrl":null,"permalink":"/tags/image-classification/","section":"Tags","summary":"","title":"Image Classification","type":"tags"},{"content":" Image classification of laundry with 99.5% accuracy. In spring 2022, I took CS230: Deep Learning at Stanford. This was my first AI project fully traversing training, validation, and testing. Our goal was attaining the highest possible classification accuracy by applying many different machine learning techniques.\nSearching for Best Training # Technique Baseline model Test-set F1 score improvement over baseline Training from random initialization (no ImageNet pre-training) ResNet50 -28.7% Data augmentation: ≤20% width/height translation, horizontal flip, ≤20% rotation/zoom VGG16 +4.0% Data augmentation: 2.5X more data VGG16, ResNet50 +15.3%, +9.2% Regularization search for L2 regularization and dropout VGG16 -1.9% Switching architecture from VGG16 to ResNet50 VGG16 +5.0% Visualizing Performance: Home Dataset # To gain insight into the performance of full-stack training, we:\nCombined two Kaggle datasets\u0026rsquo; training subsets and trained a ResNet50 from random initialization on this. Datasets are clothing dataset small and Clothing dataset (full, high resolution) Created a dataset of my own personal clothing items, called the home dataset (matching the classes). This dataset was used as an evaluation set for unseen data. The end result is shown in the below figure.\nChart showing a ResNet50 trained from random initialization being evaluated on the home dataset. Red text means incorrect prediction, parenthesis-enclosed text is the actual class. Findings # Semantically-blurry classes (e.g. long sleeve vs. outerwear) held back our accuracy A VGG16 pre-trained on ImageNet, then trained on our 2.5X larger dataset, yielded the best test-set accuracy of 99.5% Our model didn\u0026rsquo;t generalize to unseen classes in a few-shot learning scenario (test-set F1 score was 35.9%), so there was room for future improvement Source Code # jamesbraza/cs230-project Stanford CS230: Deep Learning Class Project Python 0 0 ","date":"2022-03","externalUrl":null,"permalink":"/projects/laundry-classification/","section":"Projects","summary":"High-accuracy image classification of common clothing types","title":"Laundry Image Classification","type":"projects"},{"content":"I began my graduate AI coursework in Autumn 2021 with Stanford\u0026rsquo;s CS221: Artificial Intelligence: Principles and Techniques. This was my first AI class, and for the class project I chose to reproduce the findings of the Point Completion Network (PCN) paper from Carnegie Mellon University.\nFor my dataset, I used Stanford\u0026rsquo;s completion3D dataset, as this was cited by most major shape reconstruction papers at the time.\nBaselining Distance Metrics # A first experiment I ran was feeding in different percentages of the input partial point cloud (shown left below), and seeing how distant the reconstructed point cloud was from the ground truth point cloud.\nGiven only 7.5% of the input partial point cloud, we observe the reconstruction (middle) is poor. The distance metrics collected can serve as baseline measurements for low-quality reconstructions. Given all 100% of the input partial point cloud, the reconstruction (middle) is much better, and this improvement is reflected in the lowered distance metrics. Other Findings # Reliable reconstruction required at least 10% of the original object to be present Point cloud distance metrics Chamfer Distance and Earth Mover\u0026rsquo;s Distance are not affected by number of points, as long as distribution is similar PCN has a fundamental limitation that leads to minute details of the ground truth cloud not showing in the reconstruction Source Code # jamesbraza/pcn Code for CS221 Course Project working with PCN and ShapeNet data Python 0 0 ","date":"2021-09","externalUrl":null,"permalink":"/projects/point-completion-network/","section":"Projects","summary":"Reproduced the Point Completion Network paper using completion3D data","title":"3-D Shape Reconstruction","type":"projects"},{"content":"","date":"2017-09","externalUrl":null,"permalink":"/categories/diy/","section":"Categories","summary":"","title":"DIY","type":"categories"},{"content":"","date":"2017-09","externalUrl":null,"permalink":"/tags/rav4/","section":"Tags","summary":"","title":"RAV4","type":"tags"},{"content":"After moving to California, I decided to learn how to work on my 2008 Toyota RAV4. I have done a lot of work on this car since 20171, but swapping the drive belt to this day was the hardest job:\nFor example, check the Instructable Toyota RAV4 2008 Coolant Change and Thermostat Swap.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n","date":"2017-09","externalUrl":null,"permalink":"/projects/rav4-timing-belt/","section":"Projects","summary":"Swapping my 2008 Toyota RAV4’s drive belt in fall 2017","title":"RAV4 Timing Belt Swap","type":"projects"},{"content":"In my junior year at Pitt, I took MEMS1049: Mechatronics, a practical class on embedded systems and hardware. The final project was to use mechatronics concepts to play a snippet of the Star Spangled Banner.\nMy two partners and I chose to build a slide whistle from scratch, controlled with two ATmega328P microcontrollers. We breadboarded a graphics equalizer using an FFT chip, and incorporated a motor drive, stepper motor, and timing belt for slide control.\nThe automated slide whistle sitting on a lab bench. ","date":"2017-04","externalUrl":null,"permalink":"/projects/slide-whistle/","section":"Projects","summary":"Slide whistle that could automatically play part of the Star Spangled Banner","title":"Automated Slide Whistle","type":"projects"},{"content":"","date":"2017-04","externalUrl":null,"permalink":"/tags/automation/","section":"Tags","summary":"","title":"Automation","type":"tags"},{"content":"","date":"2017-04","externalUrl":null,"permalink":"/tags/pitt-engineering/","section":"Tags","summary":"","title":"Pitt Engineering","type":"tags"},{"content":"","date":"2017-04","externalUrl":null,"permalink":"/tags/robotics/","section":"Tags","summary":"","title":"Robotics","type":"tags"},{"content":"","date":"2016-12","externalUrl":null,"permalink":"/tags/solar/","section":"Tags","summary":"","title":"Solar","type":"tags"},{"content":"I was Workshop Chair for Pitt\u0026rsquo;s Engineers for a Sustainable World during my senior year. I designed and led workshops on DIY electronics for students of any major or experience level. I taught groups of ten to twenty students using first PowerPoint presentations and later Instructables as my guides.\nCover Workshop Solar USB Cell Phone Charger for under $10 Speaker from Spare Parts for under $4 Solar Panel for under $30/panel ","date":"2016-12","externalUrl":null,"permalink":"/projects/workshop-chair/","section":"Projects","summary":"Designed and administered several workshops as Workshop Chair for Engineers for a Sustainable World","title":"Workshop Chair","type":"projects"},{"content":" Invention of a laundry folding robot, designed to fold shirts, pants, towels, and some outerwear, and demonstrated to be capable of folding towels My senior design project was the creation of a laundry folding robot we named Foldie. The team was an interdepartmental (mechanical and electrical) team of four. We worked together to determine robotic folding algorithms, and then designed, fabricated, and programmed all electromechanical mechanisms.\nHistorically, faculty or industry partners mentor senior design teams, however we mentored ourselves as an entrepreneurial effort. The senior design professor appreciated this so much he repeated the project for six future senior design iterations, and wrote a paper about it: IEEE: A Method to Provide Student Peer Mentorship within the Capstone Experience. Afterward, we worked as senior design undergraduate TAs to mentor the following semester\u0026rsquo;s student group.\nFoldie won 2nd best mechanical design and 2nd best electrical design in the fall 2016 senior design exposition, shown below:\nOur team also won best senior design presentation, shown below:\nI was in charge of mechanizing tape measures as linear actuators, scrapping together custom conveyor belts, and providing a crease holder mechanism. I turned our conveyor belt methodology into an Instructable: Inexpensive Flat Conveyor Belts.\nFeel free to view my partner Derek\u0026rsquo;s project page on Foldie.\nFoldie is my favorite project of all time, and it inspired my laundry image classification project five years later:\nLaundry Image Classification 2022-03 Artificial Intelligence Coursework Image Classification Perception Stanford CS ","date":"2016-09","externalUrl":null,"permalink":"/projects/foldie/","section":"Projects","summary":"Design project making a prototype laundry folding robot","title":"Foldie, the Laundry Folding Robot","type":"projects"},{"content":"","date":"2015-09","externalUrl":null,"permalink":"/tags/animatronics/","section":"Tags","summary":"","title":"Animatronics","type":"tags"},{"content":" Building a kinematic panther statue for Pitt engineering. ArtBot was a team of ~15 students focused on making an animatronic panther statue to serve within an infotainment kiosk for Pitt\u0026rsquo;s engineering building. The project was funded by Dean Gerald Holder\u0026rsquo;s office. I was the project\u0026rsquo;s co-lead for two years, leading all electromechanical work and also general project administration.\nHere is a video recorded the same day we gave a demonstration to Rockwell Automation, a company who partly funded the club\u0026rsquo;s costs.\nThis second video is of the arm and spine assemblies actuating, on our path to have realistic panther motion. I personally designed and fabricated the arm and spine assemblies, shown in the below video:\n","date":"2015-09","externalUrl":null,"permalink":"/projects/artbot/","section":"Projects","summary":"Robotics club project building a kinematic panther statue","title":"ArtBot: Kinematic Panther Statue","type":"projects"},{"content":"My Boy Scouts of America Eagle Project was making an information kiosk for George M. Bush Park in Doylestown, Pennsylvania. My mentor was a retired carpenter named Bruce Burkart, and the project was funded via donations from the Doylestown United Methodist Church.\nThe signpost was built in my parents\u0026rsquo; driveway under Bruce\u0026rsquo;s leadership alongside my parents and fellow troop members. Buckingham Township handled the kiosk installation.\nThis information kiosk just after installation. The kiosk displays township news/announcements and a park map on cork board. It\u0026rsquo;s protected by lockable Plexiglas doors and has a roof for rain protection. The kiosk remains operational to this day. ","date":"2012-06","externalUrl":null,"permalink":"/projects/eagle-scout-project/","section":"Projects","summary":"Eagle Scout project was a signpost, made in 2012, for George M. Bush Park","title":"Eagle Scout Project","type":"projects"}]