discrete features: - parse compilation starts [0/0] annotation and create a visualization of the stack and how they relate together - each compilation represents a stack trace - a compilation is an ancestor of another compilation if its stack trace is a prefix of the other stack trace - this is a tree structure, so can use dot to visualize - sort of like a flame graph, but there is no logical length (maybe could use compilation time!) - correlating generated fx graph / dynamo trace / sizes - just get all the graphs, display them in an indexed fashion - use case: https://fb.workplace.com/groups/1075192433118967/posts/1377556259549248/ did the sizes change? - why are our logs so big / who is generating all the logs - rank comparison / differ (for rank desychronization problems due to nondeterminism) - rank tracing (compare how compilation is proceeding on different ranks, is it imbalanced, debug compile time performance problems) infrastructure: - parse out a single compilation [0/0] - start/end time for compilation some problems: - downloading all logs from logarithm takes surprisingly long (using lg command) - there may be a lot of logs with a given MAST job, as there may be multiple flows pasted together, sometimes not obvious which one to look at samples logs: - https://fb.workplace.com/groups/1075192433118967/posts/1377556259549248 why cuda graph cause regression problem - ~/local/eval-log.txt - the eval log - ~/local/rank0-cudagraph-train.txt - the train log, rank0 only (cudagraph-log.txt) - interestingly, sometimes the ranks are interleaved in a naughty way - > ~/log2.txt ~torch only~> ~/f2.txt - from lg tw:tsp_zch/mast_hpc/f524854032-TrainingApplication.trainers.mqdbca/0 --start-time=1706158048 --end-time=1706165989 --stream=stderr - this is the jon chuang config pr caused shampoo dynamic compile disaster - ~/log.txt (flavio-log.txt) - this is flavio truzzi recent aps log - xref https://fb.workplace.com/groups/6829516587176185/posts/6829560007171843/ - nb: this doesn't have all debug info what do i want to change about the logs - split into separate log file per rank to prevent splicing - dedicated_log file is OK - need some sort of hook for this - need some way to test this - stack frame stored in single line and parseable - this is actively harmful without preventing muxing (because larger write is less likely to be atomic) uploading functionality - motivation - if you run tlparse on a server, and it generates html, want to be able to conveniently view it / share it to someone else, without having to download - otherwise, can only do plain text report and share via pastebin - alternate models - perfetto/chrome trace viewer: generate a trace json, separate viewer you upload the file too - but note that internally we built a built-in viewer that you can link to with data directly. Convenient! - generate an html file, pop open browser to view what does the one-size-fits-all command do (drive structured logging) - extract all IR representations into separate files (preferably machine readable, but that's other people's problem) - rendering these in human readable way, potentially *downstream* tool problem use cases for the log parser - there is some problem, you are trying to diagnose the problem from logs - but the logs are too big - because all the ranks are muxed together - because the dynamo debug logs are too spammy - because I can't actually tell what I'm running over from the Dynamo logs - because I don't actually know what the model is doing (pdb style view?) - because there are so many values on the stack - because this is a cursed model with lots of tiny tensors and lots of bytecodes and therefore traversal is terrible !!! - because the graph outputs are too big - because no one asked for tabular output - because the graph sizes are too far away from where you need them - the graph is so long so you can't easily jump to def/use - because the guard output is too big - because you can't easily find the recompiles logs - because the recompiles log doesn't say what exactly changed the next time - because the tracebacks are too big - finding the graph break information is finding a needle in haystack - because the restart analysis logs are annoying - because the inductor logs are too long - because I can't easily correlate inductor with aten being processed (godbolt style, but godbolt not useful because too difficult to do the full information) - because I don't know how to jump to the end of a section - dynamo -> aot -> inductor -> guard - but you can't get runnable artifacts from the logs - you want to display some information, if you print everything fully detailed it's too much, so you want fold/expand html UI (then the dump representation is full information) - what's same/different between ranks - what's same/different between recompiles - two users: PyTorch developers, mass market general developers - you are working on a new model and you want to know how far along you are - trace recording and visualization (but maybe just defer to zoomer) - logarithm actually sort of sucks? - it's too hard to figure out how to modify source code to hit some s0 as dynamic, from the logs - because I can't tell what the source of a size guard is - because automatic dynamic is printed by default - that's weird, why is the same frame having very different behavior each time? - are we allocating separate numbers for the separate object instances? value added - download (all) the logs in the first place - put the result somewhere shareable - automatically process tlparse when someone posts a log for help meta plugin architecture - want the plugin to automatically run - choice: fbpkg distribution vs oss plus internal plugin - choice: pyo3/maturin python plugin vs shelling out to executables - plugin goals: telemetry log downloading - lg or tw command line tool uploading - manifold cli into https://www.internalfb.com/intern/wiki/Development_Environment/Persistent_Storage/#raw-manifold-path-for-a https://www.internalfb.com/intern/wiki/Manifold/Getting_Started/Manifold_CLI/ feb 23 ideas - ddpoptimize split needs a context - post_grad_graph and output_code occasionally has no context; how to orient in this situation :think: - would like to know code hash, so can generate links to files - recompile - dynamic shape dimension changed - just collect them all at once place mar 19 ideas - print the nn module structure, whenever nn module is compiled