muskrat's data eng expert's hard drive overheats while processing 60k rows

SchwertImStein@lemmy.dbzer0.com · 4 months ago

muskrat's data eng expert's hard drive overheats while processing 60k rows

golden_zealot@lemmy.ml · edit-2 4 months ago

I used to perform data analysis of robotics firmware logs which would generate several million log lines per hour and that was my second job out of college.

I don’t know how you fuck up 60k lines that bad. Is he nesting 150 for loops and loading a copy of the data set in each one while mining crypto??

ButtDrugs@lemm.ee · 4 months ago

Substring searches in unindexed large string columns or cartesian explosion caused by shitty joins would be my initial guess.

gazter@aussie.zone · 4 months ago

Largely ignorant, but data-curious person here.

…what?

ButtDrugs@lemm.ee · edit-2 4 months ago

Storing large volumes of a text in a database column without optimization, then searching for small strings within it. It causes the database to basically search character by character to find a match by reading everything from disk. If you use indexes the database can do a lot of really incredible optimization to make finding values mich faster, and honestly string searching is better suited to a non-relational DB engine (which is why search engines don’t use relational DBs).

Cartesian explosion is where you join related data together in a way that causes your result set to be wayyyy bigger than you expect. For example if you try to search through blog posts, but then also decide to bring in comments to search, then bring in the authors of those comments and all their comments from other posts. Result sets start to grow exponentially in that way, so maybe if you only search a few thousand blog posts you might be searching through millions of records because you designed your queries poorly.

manicdave@feddit.uk · 4 months ago

If there’s something you want to search by in a database, you should index it.

Indexing will create an ordered data structure that will allow much faster queries. If you were looking for the username gazter in an unindexed column, it would have to check literally every username entry. In a table of 1000000 entries it would check 1000000 times.

In an indexed column it might do something like ask to be pointed to every name beginning with “g”, then of those ask to be pointed to every name with the second letter “a” and so on. It would find out where in the database gazter is by checking only six times.

Substring matching is much more computationally difficult as it has to pull out each potentially matching value and run it through a function that checks if gazter exists somewhere in that value. Basically if you find yourself doing it you need to come up with a better plan.

Cartesian explosion would be when your query ends up doing a shit load of redundant work. Like if the query to load this thread were to look up all the posters here, get all their posts, get the threads from those posts and filter on the thread id.

gazter@aussie.zone · 4 months ago

That’s very clear, thanks.

I’m guessing you’d have to search the database to make the index, right? To search for ‘gazter’ you’d have had to go over the whole dataset and assigned each entry with a starting letter value, and so on?

manicdave@feddit.uk · 4 months ago

When it comes to searching the database, the index will have already been created. When you create an index, it might take a while as the database engine reads all the data and creates a structure to shadow it. Each engine is probably different and I don’t know if any work exactly like that, but it’s an intuitive way to understand the basics of how B-trees work. You don’t really need to think much about how it works, just that if you want to use a column as a filter, you want to index it.

However, when you’re thinking about the structure of a database it’s a good idea to think what you’ll want to do with it before hand and how you’ll structure queries. Sometimes searching columns without an index is unavoidable and then you’ve got to come up with other tricks to speed up your search. Like your doctor might find you (i’m presuming gaz is sort for gary and/or gareth here) with a query like SELECT * FROM patients WHERE birthdate = "01-01-1980" AND firstname LIKE "gar%" The db engine will first filter by birthdate which will massively reduce the amount of times it has to do the more intensive LIKE operation.

HorseFD@lemmy.world · 4 months ago

I would love to see the output of EXPLAIN on his query to see what he’s doing.