"Has anyone ever used grep, cut, awk, sort, uniq with a csv file?"
May 31, 2022 2:14 AM   Subscribe

VisiData is an interactive multitool for tabular data. It combines the clarity of a spreadsheet, the efficiency of the terminal, and the power of Python, into a lightweight utility which can handle millions of rows with ease.

Mindblowing demo at the 2018 PyCascades lightning talks.

A comprehensive tutorial showing off most of the features.
posted by kmt (64 comments total) 122 users marked this as a favorite
 
I glanced at the website and one of the front-page endorsements is from a friend whose opinion on such things I trust without question so yeah, I'm sold.
posted by Hogshead at 2:52 AM on May 31, 2022 [1 favorite]


Every once in a while Metafilter throws a dart right at the centre of my Venn diagram of interests.

Yes, and thank you.
posted by Alex404 at 3:16 AM on May 31, 2022 [8 favorites]


Oooooooooooooo.

I don't have to mess with CSVs while on maternity leave but still. Thanks for sharing!
posted by freethefeet at 3:35 AM on May 31, 2022


Has anyone ever used grep, cut, awk, sort, uniq with a csv file?"

Yes, I have. On more than one occasion I've expressed my wish that someone would write a tool for tabular data that wasn't a spreadsheet. Never did I imagine that when it happened it would look so perfectly awesome.
posted by Ickster at 3:36 AM on May 31, 2022 [11 favorites]


I used this a bit on my last job and it was a neat tool. Happily I'm not dealing with huge data sets or CSV files in my current job.
posted by octothorpe at 4:16 AM on May 31, 2022


Gotdamn I like the looks of this.
posted by wenestvedt at 4:18 AM on May 31, 2022 [1 favorite]


As an OpenRefine fan, this is relevant to my interests. Thank you for posting.
posted by humbug at 4:30 AM on May 31, 2022 [1 favorite]


Oh hey it's Saul Pwanson's project I learned about because he was a Recurse Center participant in 2017 -- neat! (Subtext: RC is neat and you should consider participating.)

By the way, a while back Saul decided to "open up my calendar for regular office hours on Monday and Wednesday afternoons, for anyone who wants to chat for a bit about VisiData (or anything data). Sign up here!"
posted by brainwane at 4:37 AM on May 31, 2022 [4 favorites]


I'll never be able to find it, but I recall an article from a few years back which compared doing some fairly standard data analyses on massive text-files with python+pandas vs grep+awk+sed. The upshot being that the ancient tools were massively faster.
posted by pompomtom at 5:28 AM on May 31, 2022 [2 favorites]


(oh, and +1 thanks for this!)
posted by pompomtom at 5:33 AM on May 31, 2022


This looks neat. If I can use it to quickly open a few files and join them on geography IDs then it would very much speed up my workflow.
posted by entropone at 5:40 AM on May 31, 2022


pompomtom, you're looking for Command-line Tools can be 235x Faster than your Hadoop Cluster. (You couldn't find it because it was hadoop, not pandas.)
posted by dmd at 5:46 AM on May 31, 2022 [11 favorites]


pompomtom: I believe this was one of the first articles showing how to beat Hadoop with your laptop using old school unix tools: Command-line Tools can be 235x Faster than your Hadoop Cluster. Here the writer talks about processing 1.75GB file(s?) containing about 2 million chess games.

There's also this phenomenal explainer on how to use old school tools in the cloud instead of fancy new toys: Using AWK and R to parse 25tb. This is again about massive amounts of data you wouldn't want to open as a spreadsheet.

But visidata is good for data manipulation tasks, where you'd rather avoid
  • complicated command line text manipulation just to get one or two value out of some command's output:
    $ mv $(ls | vd) /trash

  • avoid copy-pasting stuff from small datasets presented in annoying formats:
    $ vd https://en.wikipedia.org/wiki/List_of_largest_cities


  • (Both examples are from 10 Ways to Use VisiData which I forgot to link in the original post.)
    posted by kmt at 5:48 AM on May 31, 2022 [10 favorites]


    How has this been in Debian since buster (oldstable) and I'm only just learning of it now?

    Thank you, kmt.
    posted by flabdablet at 6:00 AM on May 31, 2022 [3 favorites]


    Thank you both dmd and kmt.
    posted by pompomtom at 6:21 AM on May 31, 2022


    Done and done. Thanks for posting this!
    posted by jquinby at 6:23 AM on May 31, 2022


    As a non-computer person who does find themselves needing to work with hundreds of thousands of rows, I could not figure out the first fucking thing about how to even begin installing this on my macbook a few weeks ago. Something about having to install another program, that I have to do by opening terminal and running a script on it? And then run a script on *that * program? Gave up and went back to slow chunky CSV spreadsheets, sorry get back to me when I can just download this thanks
    posted by windbox at 6:43 AM on May 31, 2022 [4 favorites]


    Wow that TTY interface is an interesting choice, particularly that scatterplot screenshot. I like it but it seems kinda odd.

    Another tool in this vein is Datasette. It's much more heavyweight; it installs as a webapp and the UI is in a browser. Which makes it better for some things, worse for others. Also works with a lot more data types. Importing CSV is done with the standalone tool csvs-to-sqlite.

    Datasette uses sqlite-utils underneath, you can use it directly with your CSV data if you prefer. Also useful: tools like querycsv that let you do SQL queries on CSV files by automatically setting up a sqlite database for you. And csvkit, which is a simple CSV-oriented set of tools like the grep, cut, awk, sort, uniq mentioned in the post.
    posted by Nelson at 6:52 AM on May 31, 2022 [2 favorites]


    I have a particular fondness for many of the gnu coreutils, especially amongst the text utilities...
    posted by jim in austin at 7:08 AM on May 31, 2022 [2 favorites]


    As a non-computer person who does find themselves needing to work with hundreds of thousands of rows

    Rows of what? Are these rows, perhaps, on a computer?
    posted by pompomtom at 7:08 AM on May 31, 2022


    > As a non-computer person who does find themselves needing to work with hundreds of thousands of rows, I could not figure out the first fucking thing about how to even begin installing this on my macbook a few weeks ago. Something about having to install another program, that I have to do by opening terminal and running a script on it? And then run a script on *that * program? Gave up and went back to slow chunky CSV spreadsheets, sorry get back to me when I can just download this thanks

    It's a command-line tool that you run in the terminal (and that requires python in order to install), and the things it is being compared to are other command-line tools that generally would look even less user-friendly to you. Looks like it's actually got a pretty straightforward install as this sort of thing goes (I haven't gotten a chance to try it out myself yet), but if you don't have any experience with command-line stuff whatsoever, it sounds like you're not the target audience.

    However, if you're working with large datasets, there are some reasons it might not be a bad idea to someday try out tools that aren't Excel/etc. Don't get me wrong, I do use Excel for various quick and dirty analyses, but there are some known common pitfalls when going in and out of .csv formats in Excel. One infamous example is its exceedingly liberal interpretation of what it thinks a date is, which has caused ongoing issues when working with, say, datasets that have certain gene names. When you're working with a dataset that involves hundreds of thousands of rows of data, it's easy to miss this sort of corruption. Tools like VisiData - or R or pandas/python, or the combo of grep/cut/awk/sort/uniq and a text editor like emacs or vim - generally don't introduce these specific sorts of errors, and for some use cases, once you're familiar with them, they can be significantly faster.
    posted by ASF Tod und Schwerkraft at 7:14 AM on May 31, 2022 [9 favorites]


    Wow, that's super cool. That it's just a Python module makes it really easy to install, even on Windows (although it was not immediately apparent that I would also need to install windows-curses), and its handling of data tables in webpages really is aces.

    I don't think this will replace iterative-Jupyter-notebook-and-pandas for largescale data cleanup / pipelining for me, but it sure will make figuring out what needs cleaning up easier, when working with datasets too large to open in Excel.
    posted by uncleozzy at 7:17 AM on May 31, 2022


    Wow that TTY interface is an interesting choice

    I'm seeing an uptick in TUI (text-based user interface) tools and libraries, and it's kind of exciting. Almost every platform has a terminal with ncurses capability so these things can be cross-platform, lightweight, and fast.

    Being an embedded dev I still make sure I can use onboard apps like htop for performance monitoring and process management over a serial port and it's a lifesaver. And this is coming from someone that writes in Qt all day long. I'm kind of excited to try vd out on some SQLite telemetry I have sitting around.
    posted by JoeZydeco at 7:54 AM on May 31, 2022 [4 favorites]


    (And, obviously, I suck at this text-based interface. Sorry for the double post)
    posted by JoeZydeco at 7:56 AM on May 31, 2022 [2 favorites]


    JoeZydeco, I flagged your dupe, it might get culled by a mod. [Happy if they also delete this non-content comment, too.]
    posted by k3ninho at 9:18 AM on May 31, 2022 [2 favorites]


    If you like this sort of thing, q - Run SQL directly on CSV or TSV files will be the sort of thing you also like. It's much more limited than vd, but it gives you sqlite's query language for delimited files. If you're really determined, you can use it as a CSV to CSV filter
    posted by scruss at 9:58 AM on May 31, 2022 [3 favorites]


    Why yes, I have used grep, awk, sort, and uniq. Mostly when you're a Perl programmer almost every text file manipulation looks like a Perl problem, but sometimes even Perl is a bigger hammer than you actually need. I assume the same is true for Python, but I came up on the other side of that language divide.

    A couple years ago I got to bill somebody to develop a one-liner that used awk on a (badly formed) CSV file just to pipe a URL found somewhere between fields 11 and 14 to curl, using the first field in the CSV to name the downloaded file. I could have done it a bunch of other ways, but awk and curl could be depended on to be present in the environment where the tool was going to run. Writing it was a fun challenge, but billing for it was even more fun.
    posted by fedward at 10:21 AM on May 31, 2022 [3 favorites]


    And yeah, same here (grep/awk/sort/cut/uniq/perl), but there are definitely times with not particularly well formed csvs where it's easier to huck it at excel to clean up the disjointed columns that this might help with (especially now that I need to run the data through notepad to strip the formatting that short circuits excel's automatic column import), so I'll keep it in my back pocket.
    posted by Kyol at 10:52 AM on May 31, 2022 [1 favorite]


    Scruss, I thought Q cost megabucks from the people behind KDB+. Is it a different one from the continuum?
    posted by k3ninho at 11:23 AM on May 31, 2022


    Similiar to q, Microsoft had logparser which let you do sql-like queries against csv, tsv, IIS web server logs and Windows event logs (even on remote computers). It was pretty magical considering this was 15+ years ago. It was a little archaic but for troubleshooting in a server farm and/or compiling statistics it was great.

    Downsides was that it was Windows only and 32bit only so it was possible for you to blow it up if you tried something too sophisticated with a lot of data. Nowadays a good portion of this is replaced by PowerShell cmdlets.
    posted by mmascolino at 11:50 AM on May 31, 2022


    I immediately favorited this post and added it to my activity because I feel like I've spent years of my life wrangling columns of text data with inadequate tools and "making do".
    posted by Ivan Fyodorovich at 12:00 PM on May 31, 2022 [2 favorites]


    Heh. So many of us have. I do, once again, recommend OpenRefine (formerly GoogleRefine) -- it does a whackload of things I used to have to break out Python for, and another whackload (e.g. clustering, which is the Greatest Inconsistency Fixer Ever) that I would have no idea how to even implement in Python.
    posted by humbug at 12:14 PM on May 31, 2022


    Scruss, I thought Q cost megabucks from the people behind KDB+. Is it a different one from the continuum?

    Yes, completely different. The author's choice of project name has made it extremely hard to find, unfortunately. The q I know is free. It's a python wrapper around sqlite. It quietly imports your delimited text into a temporary sqlite table, runs your query, and spits out the results. It seems to handle all sorts of weird CSV variants well, which awk/grep etc do not.

    I suppose youse all know about Excel/LibreOffice Calc's new delimited file import that's way more clever than loeading a CSV? You can have periodic refreshes from file, actually reliable UTF-8 import, ...
    posted by scruss at 1:37 PM on May 31, 2022 [1 favorite]


    way cool, vd without args lists the current dir in an oddly useable format
    posted by sammyo at 1:40 PM on May 31, 2022


    Sheesh, Scruss, I'm not trying to kink-shame but how many exotic CSV variants are there (and do I really consent to know)?
    posted by k3ninho at 4:00 PM on May 31, 2022 [3 favorites]


    There are as many variants of CSV as there ways to say, “how hard can it be, I’ll just put tabs between the data fields, oh wait tabs are non-printing on Bob’s computer, i guess i’ll use commas or pipes or puppy emojis or something.” Add in some ambiguity about whether or how to quote your delimiter if it shows up in a data field, or how to quote your quote character because it turned up in your data after all, and whether to include zero or more header or footer rows, or comments, or whether the number of fields in a row can vary in your input.

    The read_csv function in pandas has ten or twenty optional parameters for these kinds of things, which suggests it handles at least 210 different dialects of CSV.
    posted by fantabulous timewaster at 4:25 PM on May 31, 2022 [7 favorites]


    There is always at least one more CSV variant than your code expects. And it will be the one that the client suddenly switches to on the day of production go live, because despite their assurances, the project manager hasn't actually been sending you test data from their test system, just files they made in Excel.
    posted by mrgoldenbrown at 4:26 PM on May 31, 2022 [9 favorites]


    We're wise to that: we specify a schema and tell you* we'll reject data that doesn't match the schema ...then invest all the engineering effort in processing noncompliant data without objecting one bit.

    *: and even have it in the contracts
    posted by k3ninho at 4:30 PM on May 31, 2022


    client — Wait, so if I change the order of the columns, add three new columns, and mix seven different date formats, your system doesn't just adapt automatically?
    posted by signal at 7:13 PM on May 31, 2022 [6 favorites]


    pain!
    posted by wotsac at 8:22 PM on May 31, 2022


    client — Wait, so if I change the order of the columns, add three new columns, and mix seven different date formats, your system doesn't just adapt automatically?

    Pretty much it will, if your csv has headers and all the date formats are standard and parsable by the DateTime library.... but don't expect me to actually use them, the program will still only use the things that it knows about, but either keeps or ignores your ordering or extra. As long as the needed columns are there and identifiable the extras don't matter because they're not used.
    posted by zengargoyle at 3:12 AM on June 1, 2022 [1 favorite]


    Add in some ambiguity about whether or how to quote your delimiter if it shows up in a data field, or how to quote your quote character because it turned up in your data after all, and whether to include zero or more header or footer rows, or comments, or whether the number of fields in a row can vary in your input

    I think you missed the one about whether or not newlines or whatever else is being used as row separators are allowable inside a quoted field. That one breaks a lot of how-hard-can-it-be ad-hoc parsers.
    posted by flabdablet at 7:14 AM on June 1, 2022 [2 favorites]


    TSV is a little better than CSV; tabs just work better as separator characters since they're almost never in the source data.

    I've never understood why no one uses the ASCII official delimiters. I'm talking about 0x1c to 0x1f, particularly 0x1f "unit separator" which serves the function of commas or tabs in CSV. A byte that's pretty much guaranteed to not be in any textual source data, purpose-built for the exact purpose of separating records. Works in Unicode too of course. (Although not in XML.)

    There's some discussion here that boils down to "was never really fully specified" and "nope, no one really uses them".
    posted by Nelson at 8:37 AM on June 1, 2022 [4 favorites]


    There is always at least one more CSV variant than your code expects. And it will be the one that the client suddenly switches to on the day of production go live, because despite their assurances, the project manager hasn't actually been sending you test data from their test system, just files they made in Excel.

    This was much too close to real life to be funny, except now I remember that I called out a client on doing exactly that before the go-live date, because the samples they sent us just seemed too tidy. So maybe it is a little bit funny. Turned out they lied about even having a completed system set up that would generate the files. They wanted us to finish our part first, and then they'd actually build something to the agreed spec. We put the launch on hold until they developed their piece of it just so we could help them sweat out their inevitable bugs without everything being in crisis mode.

    Pretty much it will, if your csv has headers and all the date formats are standard and parsable by the DateTime library.... but don't expect me to actually use them, the program will still only use the things that it knows about, but either keeps or ignores your ordering or extra. As long as the needed columns are there and identifiable the extras don't matter because they're not used.

    Oh man. The thing that actually turned me into a programmer was solving this problem for a big hosting company before the first dot com bust. Nominally there were two potential file formats we had to process, but each format had options that were supposed to be configured identically (and weren't), sometimes the time series data didn't line up (because of time zone issues in the server configurations), and sometimes a daily file would in fact change format in the middle of the day when an admin changed one of those options. So I wrote a bunch of logic to figure out which variant of a particular format was being processed, identify obvious time zone errors, reread header lines even when they appeared in the middle of the file, and so on.

    Every time I've had to do any sort of ETL since then I remember solving all those problems from first principles and do the "This look likes X sort of data. Is it?" sanity check before moving on.

    There's some discussion here that boils down to "was never really fully specified" and "nope, no one really uses them".

    Amused to see scruss in that discussion too.
    posted by fedward at 9:37 AM on June 1, 2022 [1 favorite]


    I've never understood why no one uses the ASCII official delimiters. I'm talking about 0x1c to 0x1f, particularly 0x1f "unit separator" which serves the function of commas or tabs in CSV.

    Niklaus Wirth's Oberon OS did use the ASCII Record Separator character as the standard line delimiter character for text files, if I recall correctly.

    And I might be just making shit up at this point but I also seem to recall that the Oberon programming language came with libraries providing some kind of tabular data file support, using Unit Separator to delimit fields within a record.
    posted by flabdablet at 10:24 AM on June 1, 2022 [1 favorite]




    This is definitely pertinent to my interests! I am self-taught at data-wrangling because I work in a field (healthcare and Medicare/Medicaid/insurance stuff) with loads of data in crazy amounts of variations and mostly awful csv forms (if you’re lucky). But I work in a subset of that field (see Medicaid/Medicare, above) that does not often budget for adequate data processing solutions or training, so my job is in part to be the person who does mysterious things to bad data.

    I never learned much about CLI tools and work mostly in Excel, R, and a little Python when I’m feeling brave, but I know that even intermediate Excel is beyond a reasonable expectation of business continuity for my successors or peers: if I build it, and I leave, it will break or at least will never be updated. So I am constantly, constantly torn between the need to wrangle and the very real need to make my work replicable in a low-tech-skills environment. I’ve grown really fond of doing my work in Jupytr notebooks because it shows my work and maybe, someday, I’ll have colleagues or successors who can see the notes and correct or update the scripts as needed. What I hate is telling someone “you sent me this thing and I did math to it and now it’s this thing; don’t worry about it” because it allows my errors to go undetected and it doesn’t help anyone beyond that transaction.

    So I guess my question, if anyone else deals with this kind of scenario, is: are there standard practices, and what are they, for using this kind of command line tool for replicable, documentable work in a non-tech space? I’m all about learning simpler, better approaches, but I don’t want to face even more knowledge management challenges by adopting them.
    posted by skookumsaurus rex at 11:09 AM on June 1, 2022 [2 favorites]


    > are there standard practices, and what are they, for using this kind of command line tool for replicable, documentable work in a non-tech space?
    I don't know for sure, would welcome war stories and better ideas than mine.

    Consider Jupyter Notebooks, right? They're code plus annotation, and saved example output from past runs. That's a lot of intent recorded in the artefact the next someone gets and had to rely on foot the work.

    Knuth talks about 'literate programming' in that his code exists to be compiled, interpreted, studied and extended -- with the hardest work being the human comprehension. For him, programming is trying to tell a story and allowing the person who works with your story to imagine what the computer will do with it.

    And look at unit tests within the BDD paradigm: when they're scaffolding that records the coder's intent they can make promises about functions and classes behaving in certain ways with certain inputs. You get to change things while having ways to be assured that the knock-on effects of your change aren't bad elsewhere.

    A full-ish suite of integrated, automatic tests documents a lot of intent, and protects against a lot of regressive changes -- so why are they so damn hard to read, study and understand?

    Is it because they encode a lot of the architectural info inside assumptions you might not recognise? Is it because they're aligned with the solutions that the developer used to get the functions and classes doing just enough of what they need to do -- in that they're tied to the developer's mental model of the system? Is it because we don't pride ourselves in easy-to-comprehend test suites, or is it because we only look to failing tests when they fail rather than for studying how the program does its work?

    I don't know how this "code carries tests which indicate intent and make promises for it working" applies to ad-hoc data manipulation. Maybe:
    * Check your sheets into source control, write good annotations, examples and tests.
    * Have a study process to train up people in what you do and why.
    * Make extra effort to train people for upcoming large changes in the tools -- making these systems better still change how people work, so put in the effort to bring them along with you.
    posted by k3ninho at 12:12 PM on June 1, 2022 [2 favorites]


    On the awk side goawk has recently added direct csv and tsv support.
    posted by donio at 1:47 PM on June 1, 2022 [1 favorite]


    Knuth talks about 'literate programming' in that his code exists to be compiled, interpreted, studied and extended -- with the hardest work being the human comprehension. For him, programming is trying to tell a story and allowing the person who works with your story to imagine what the computer will do with it.

    k3ninho, this is exactly my wish: that the work I do can be read, understood, and replicated by others (and reviewed, critiqued, and corrected! I am almost always the best programmer around and I am not a good programmer, I work in a vacuum and have only Google to coach me).

    My greater frustration with this kind of conversation is that it's a conversation that happens in tech spaces, among tech people, about tech spaces and people, but it is very very rare for me to encounter thoughtful consideration of how data management impacts non-technical people in non-technical spaces. Frankly, the attitude I usually encounter when I talk about this is just "git gud lol". I want to be clear that I don't think anyone in this thread is saying this, least of all k3ninho. I honestly just want to point out that this conversation is very very important and I don't hear it repeated very often.

    I do healthcare administration stuff, mostly for poor people with chronic conditions. It's not a sexy, investment-heavy field of practice. It is a complicated one, and one that's very dependent on working with data. We need to work with large amounts of data for compliance, for claims and reimbursement, and for care coordination and risk management. Unless we do it right we don't get paid, we don't get to keep helping patients, and we don't catch a patient in need of a preventative screening until it's too late.

    But. Our tools just don't support these needs. The software we use for health records has extremely limited support for reporting; we can't just ask "who are our most fall-prone patients, and what conditions or medications do they have in common," or "how long on average does it take our patients on statins to reduce their cholesterol by 20%", or even "which patients have been on statins for a year or longer". Yes, I know that some EHR is better at dealing with reporting than others, but this EHR is what we need to keep for other reasons. So anyway, the only way we can answer these really important questions (and therefore protect patients and keep the lights on) is to download a csv (a security risk itself) and wrangle until it's what we need.

    The ultimate problem with "git gud" as an acceptable answer for tools that don't work for people who aren't programmers is that not everyone should be a programmer. It's not just a question of aptitude, though that's a problem too. Nurses, social workers, lawyers, water district managers, whatever: everyone else has a hard job too, and we shouldn't be expected to "just figure it out", any more than programmers should be expected to "just figure out" their own medical treatments.

    I just wish the tech sector, the people who know what they're doing with this stuff, would slow down a little, pay attention to the boring and unsexy stuff that makes the world go round, and finish their homework. Don't think that a product is good enough because it can kind-of-sorta spit out a csv or a pdf. Don't think that just because people can figure it out that they should. Cars come fully assembled, with comprehensive detailed user manuals and maintenance schedules and seatbelts and inspection standards because it saves lives. I guarantee you that lives are lost to obscure csv formats that could have been prevented if the tech industry could just respect and work with people who aren't comp sci majors.
    posted by skookumsaurus rex at 3:19 PM on June 1, 2022 [8 favorites]


    Skookumsaurus Rex -- we fail you, for which I'm embarrassed and sorry. Often "git gud lol" is a defensive reaction in place of being able to admit flaws: it's hard to coach and difficult to know where to start. The whole setup of tech reacts very badly to criticism and, again I'm embarrassed and sorry for this.

    The "Agile" thing will be followed by a thing called "Business Change" in that 'changing the system changes how people work' which means we also owe you good training to get the most out of the system we've changed.
    posted by k3ninho at 4:00 PM on June 1, 2022 [2 favorites]


    As long as we're talking CSV, I've run into an interesting conundrum. I've got a ruby app that reads in CSV files from my bank so I can automate managing my accounts. Once in a while the header lines that read something like:
    AccountNo,Symbol,Number,Price,...
    
    instead appear like:
    AccountNo, function Symbol() {
      [native code] 
    },Number,Price,...
    
    I guess what's going on is a typo in the code, where someone forgot to put quotes around "Symbol" and JavaScript happily evaluated the expression. Sometimes it evaluates to a more complex expression.

    In any case, while those 50-year-old tools are awesome, sometimes you need to call a cab.
    posted by morspin at 4:48 PM on June 1, 2022


    ... how many exotic CSV variants are there ...?

    To paraphrase mrgoldenbrown: There is always at least one more CSV variant than your sanity can handle.

    The data format that got me FaceTime-ing with The Void was the now-obsolete Industry Canada (IC) radio transmitter database, TAFL. It looked like CSV for 99% of the records, but the other 1% were fixed-width. What made this especially special was that these records were defined in the data itself: you first had to read a line as fixed width, parse out a certain column value, and depending on the value you got, the record was delimited or fixed-width.

    While I wrote a parser (in spatialite; did I mention these were geodata?) that worked well enough for casual purposes, the only tool that worked 100% of the time was IC's MS-DOS exe that spat out a DBF containing the data. It wouldn't run under Windows, so this was the only time I got official sanction to bring a Linux laptop to work and convert the whole database in dosbox.

    (That entire database, briefly published as open data by IC, has gone away now. The official reason was that it contained identifiable personal data of transmitter licensees, but the real reason was that it allowed the general public to find and object to local transmitter towers.)
    posted by scruss at 4:55 AM on June 2, 2022 [3 favorites]


    Update: sold. Time to build some muscle memory for some more shortcuts!
    posted by ASF Tod und Schwerkraft at 11:06 PM on June 2, 2022


    Thanks so much for posting kmt I so look forward to working with this, that coordinates to map functionality is amazzzzing!
    posted by unearthed at 12:55 AM on June 3, 2022


    Via brainwane's csv,conf link above, a talk from the author of VisiData that isn't about VisiData: How a File Format Led to a Crossword Scandal - Saul Pwanson (YouTube, 19m53s)
    posted by flabdablet at 12:39 PM on June 3, 2022


    Nelson: If you or anyone else here ever used this, for at least a brief period, it used US, FS, GS and RS in its custom db format. It might still, who knows?

    Why? Because I said so.

    I was very junior QA in a meeting with more than one diva level developer and they were talking out what sort of separators they would use for this custom db they were writing.

    I heard a lot of nonsense about using a pipe-caret approach, and a couple other goofy combinations when I piped up “hey, I know you’ve been doing this a lot longer than me, but don’t you know that there are reserved characters in ASCII for exactly this purpose‽

    I got a lot of pushback and even more ego thrown my way, but at one point our PM says “he’s right, why are we trying to reinvent the wheel?”

    QA is a lot more than finding bugs, and here’s to the original codifiers of ASCII — they gave us so much!
    posted by drfu at 11:43 PM on June 3, 2022 [3 favorites]


    Amazing drfu! Good for you. Did the choice end up causing any problems? I was always a little nervous using them because a lot of software gets weird about "control characters" (ie, XML not allowing them). I used them for real in a little webapp I wrote where it had to upload several documents via HTTP to my server and it worked fine, but it was a pretty limited application and not very many users.

    (For posterity, the product drfu is talking about is Portfolio by Extensis).
    posted by Nelson at 7:14 AM on June 4, 2022


    a lot of software gets weird about "control characters" (ie, XML not allowing them)

    The exact reason XML doesn't allow them is because they are for control, not for data. Control characters are there to do much the same job that XML uses tags for: delimiting data. Software that gets weird about US/RS/GS/FS turning up inside a data field is either horribly designed in and of itself, or is being asked to process structured data from elsewhere in some way that doesn't even ignore but actually breaks that structure. In no case does the fault lie in the way that the structured data in question was structured.

    Want to see XML get really weird? Try inserting an entire XML document from somewhere else without properly escaping the left angle bracket characters. JSON likewise, if you pay no attention to escaping double-quotes and backslashes inside data being represented as a JSON string.

    The need to escape data-structuring data so that it can be passed through transparently in-band is as old as digital data itself. ASCII defines the DLE ("data link escape") control character for exactly that purpose, though unfortunately there exist incompatible ways to use it: some protocols use DLE to mean that the very next character is to be handled as literal data regardless of any special meaning it usually has, while others use it as a prefix that enables the following character's data-delimiting meaning unless that character is itself DLE in which case it's to be taken literally.

    If you think CSV is unacceptably arbitrary and ad-hoc, try writing a Windows batch file that will never fall over regardless of the arguments you pass it. Pretty much any CSV convention will offer some way to represent completely arbitrary data inside a quoted data field, but there are rips and tears in CMD's command-line parsing conventions that simply cannot be closed up.
    posted by flabdablet at 8:47 AM on June 4, 2022


    Oh but XML has a thing for inserting other XML documents, CDATA! Which can include any valid XML character without quoting. The drawback there is you have to do something awkward if the included document has the end delimiter ]]>, you have to close the CDATA section.

    I take your point on control characters being explicitly not bytes important to the XML document format, an SGML derivative. The practical problem is XML started being used as a data format, most notoriously SOAP, and the awkwardness of working with binary data is one annoying problem with this use of XML. In practice folks seem to mostly base64 encode data and accept the 4/3 (or worse, after compression) explosion in data size.

    CSV quoting in the Excel style is actually remarkably well behaved and easy to understand. The usual problem is folks don't want to bother with the required stateful parser so they do `awk -F,` and then get mad when something is quoted. You really do need to use real code to parse CSV unless you're very sure of the source.

    Python's CSV module does a pretty good job importing various CSV dialects and even has a useful sniffer function to guess what the input is. I was surprised there's only three in my Python 3.8: unix, excel, and excel-tab. Python gets by with the same function to parse all of them with just a few options for changing some details. Honestly I was expecting more complexity.
    posted by Nelson at 9:45 AM on June 4, 2022 [2 favorites]


    Hey skookumsaurus rex, I just want to say that your comments have stuck with me for a few days. I had a job interview recently with an org that took the ethics and impacts of data very seriously, but that is perhaps because they also weren't technical people. They came to their data problems as a side effect of doing a public good (being vague for reasons), but they recognized that they now needed technical expertise and didn't want to end up having their data solutions just create new, different problems.

    It was refreshing to be in that room and be able to talk about addressing real needs with data without just doing the standard "let's just ignore any restrictions that might make the job harder" typical of tech people. Just finding out such orgs exist and do, actually, hire people changed my mindset about talking to companies that default to running roughshod. There may be hope yet.
    posted by fedward at 10:49 AM on June 4, 2022 [5 favorites]


    The drawback there is you have to do something awkward if the included document has the end delimiter ]]>, you have to close the CDATA section.

    So, randomly problematic for Javascript source code among other things. Including, of course, XML documents that already have at least one CDATA section.

    I loathe badly-designed special-purpose edge-case afterthought crap like this, and SGML and its derivatives are chock-a-block with them. Give me a simple, workable delimiter and delimiter escape convention and I'm happy. I rate CSV, even though it is chronically underspecified, as a far saner archival format for tabular data than XML any day of the week.

    I also thoroughly approve of Saul Pwanson's decision to build his own domain-specific text-based format for crossword archiving instead of trying to press XML into service for no better reason than that it is a potentially applicable standard. XML is a terrible standard for pretty much every use case for which it is an available design option, and its widespread adoption has cost the IT industry an incalculably huge amount of productivity.
    posted by flabdablet at 12:29 PM on June 4, 2022 [1 favorite]


    They came to their data problems as a side effect of doing a public good (being vague for reasons), but they recognized that they now needed technical expertise and didn't want to end up having their data solutions just create new, different problems.

    Fedward, and k3ninho, I really appreciate your insights here! I’ve been thinking about it for a few days too, and I think that the car analogy is actually a pretty good fit.

    When automobiles were young, they were the domain of the skilled and the daring. You had to be technically skilled and knowledgeable to get an engine running and keep it from stalling or exploding. You had to be somewhat reckless to race an engine bolted to some wheels. Detroit in the 1900s was absolutely the Silicon Valley of the time.

    But as cars stopped being novelties and toys and started being infrastructure, the government and the industry (even if they were reluctant to admit it) had a huge incentive to make cars safe and reliable. The internet and the complete reliance that it’s brought are every bit as liberating and deadly as the car was, but we are barely scratching the surface of common sense protections.
    posted by skookumsaurus rex at 4:44 PM on June 4, 2022 [2 favorites]


    >common sense protections
    The internet's stripped me of any delusions about 'common sense'.

    In part, that's the disagreements over how-we-know-to-be-true-what-we-believe-to-be-true where bad-faith agents intentionally muddy the waters so people can't assess for themselves which choices lead to better outcomes (for the individual and/or the collective), and, in part, that's the celebration of ignorance that marks the Eternal September of the Internet.

    (The Eternal September started in September 1993 when AOL put out masses of floppy disks and, later, CD ROMs with software to get people online and using the Internet. Previously, academic internet users had a bad time with new students ignoring the established norms of online interaction each September, and a month later either users had given up or learned to fit with the pre-existing norms. The pre-existing norms couldn't survive the floods of new users that never relented under society-wide* adoption of the internet, and the phenomenon of rude newbies who wouldn't learn what were previously common sense approaches to online social interaction or be told by moderators and respected senior users how to fit in and learn from being wrong.)

    But cars?!? Unsafe at Any Speed is a book that documents vehicle choices in the USA from the 1960's. The right balances to many new systems emerge later with metaphorical safety nets and enforced regulation. Individual actions are choice, responsibility and culpability for consequences exists in law already. So let's talk about that so we can include it in our common sense of internet culture and data handling, processing and storage.

    *: The internet has impacted culture greatly but this note is to say that access can still be out of reach for many, locally as well as worldwide.
    posted by k3ninho at 12:35 AM on June 7, 2022 [1 favorite]


    « Older More or Less Stable Chaos   |   Joy of Computing Newer »


    This thread has been archived and is closed to new comments