Module NKS::Acts::SodaSearch::InstanceMethods
In: lib/acts_as_soda_search.rb

Methods

autoindex   debug   erase_subsource   error   index   indexes_count   info   warn  

Public Instance methods

Abstract method. Implement this in your inheriting class to provide the index() method with whatever it needs to know to re-index the object.

[Source]

     # File lib/acts_as_soda_search.rb, line 252
252:         def autoindex
253:           raise NotImplementedError.new("* SodaSearch: You must implement autoindex() in your inheriting class.")
254:         end

This removes all index data for the current object with subsource = subsource_url.

[Source]

     # File lib/acts_as_soda_search.rb, line 487
487:         def erase_subsource(subsource_url)
488:           self.class.soda_indices_class.delete_all(["indexee_id = '#{self.id}' AND subsource_url = ?", subsource_url])
489:         end

Causes the instance to index something and add it to the database, defined by what_to_index.

Use with clear_old = false to add new data to an existing index. There is no need to re-index an entire huge object just because a paragraph of text is postpended to the end entire object (which would take a while); just call index with the new data in what_to_index.

Use with clear_old = true to purge all references to the self object in the indices, and then add the stuff in what_to_index to the database.

***** CAUTION: Use clear_old = true only when the site is running in single-user mode, to avoid race conditions.

      TODO: look into row locking

What_to_index is an array of strings and/or procs that return either strings or other procs. The code as it stands can eval 2 levels of nested procs. Someday we’ll make it recursive and Lispy so you can have infinitely nested procs. That‘d be Cool (tm).

We use procs sometimes instead of actual text because the actual text can be huge when aggregated. With the procs, we can only pull the text of one segment into memory at any given time.

Downcases and stems the terms if do_stemming = true (the default scenario).

  • See doc/Indexer-Readme.txt for an explanation of subsource_url.

[Source]

     # File lib/acts_as_soda_search.rb, line 288
288:         def index(what_to_index, subsource_url = nil, clear_old = false, do_stemming = true)
289:           info("#{clear_old ? 're' : ''}indexing #{self.id}...")
290:           warn("** clear_old is set to TRUE. Don't do this in multiuser mode, to avoid race conditions.") if clear_old
291: 
292:           start_time = Time.new.to_f
293: 
294:           position = 0
295: 
296:           ActiveRecord::Base.transaction {
297:             # first, delete any old references to this object in the index table, if we are told to
298:             self.class.soda_indices_class.delete_all("indexee_id = '#{id}'") if clear_old
299:           
300:             #
301:             # make arrays to hold the terms and ids we find if doing a fast (bulk) add.
302:             # (See comments above on SLOW_ADD. These are only actually used when SLOW_ADD = false.)
303:             # Hash is no good because order is important.
304:             #
305:             # term_ids entries are normally string UUIDs. They may
306:             # also be :nonexistent where there is no entry currently
307:             # in the database for that term, or they may be :unknown
308:             # to indicate that we haven't yet checked to see whether
309:             # that term exists.
310:             #
311:             # nonexistent_term_ids and unknown_term_ids are Arrays
312:             # of int indices into term_ids that point to
313:             # :nonexistent and :unknown entries. These are used for
314:             # performance reasons, to avoid having to re-scan the
315:             # entire (long) term_ids array to find them.
316:             #
317:             # These get processed once per what_to_index item right now, to not eat up huge
318:             # amounts of memory and still acheive a reasonable throughput by minimizing DB hits.
319:             #
320:             terms = Array.new
321:             term_ids = Array.new
322:             nonexistent_term_ids = Array.new
323: 
324:             what_to_index.each{ |item|
325:               #debug "item is #{item.inspect}, self is #{self.inspect}"
326: 
327: 
328:               #
329:               # queue is the list of stuff to process within the current what_to_index item.
330:               # Each what_to_index item may have any number of strings or procs.
331:               #
332:               # if it's a string, use that.
333:               # if it's a lambda, call it and use that.
334:               #
335:               queue = if item.is_a?(String)
336:                         [ item ]
337:                       else
338:                         item.call(self)
339:                       end
340:               
341:               #
342:               # Now we have an array of strings and procs. Scan and
343:               # add to index, calling proc if necessary to get a
344:               # string.
345:               #
346: 
347:               #debug "queue is #{queue.inspect}"
348: 
349:               queue.each { |x|
350:                 scanner = StringScanner.new(x.is_a?(String) ? x : x.call())
351:                 while !scanner.eos?
352:                   
353:                   # find next word
354:                   word = scanner.scan(/[0-9a-zA-Z']+/)
355:                   
356:                   unless word.nil? ||  word.size < MINIMUM_WORD_LENGTH || STOPWORDS.include?(word)
357:                     if do_stemming
358:                       word.downcase!
359:                       word = word.stem
360:                     end
361: 
362:                     ### add word to the index.
363:                     
364:                     if SLOW_ADD # (see comments above for SLOW_ADD)
365:                       #
366:                       # slow, inefficient way. Requires 2 DB hits for each item, 1 to see if it's already a term, and 1 to
367:                       # add the index.
368:                       # This does just under 200 terms per second on my MacBook with autocommit off, so it takes a while to index large datasets.
369:                       #
370: 
371:                       # see if the term is in the db already
372:                       term = (self.class.soda_terms_class.find_by_term(word) || self.class.soda_terms_class.create(:term => word))
373:                       term_id = term.id
374:                       
375:                       # add to index
376:                       self.class.soda_indices_class.create(:term_id => term_id,
377:                                                            :position => position,
378:                                                            :user_id => self.user_id,
379:                                                            :indexee_id => self.id,
380:                                                            :subsource_url => subsource_url)
381:                     else 
382:                       terms.push(word)
383:                       term_ids.push(:unknown)
384:                     end # if SLOW_ADD
385:                     
386:                     position += 1
387: 
388:                   end # while
389:                   
390:                   # skip to next word
391:                   word = scanner.skip(/[^0-9a-zA-Z']+/)      
392:                 end # while
393:                 
394:               } # queue.each
395:               
396:             } # what_to_index.each
397: 
398:             # process the accumulated bulk-add list in fast mode.
399:             if not SLOW_ADD
400:               unique_terms = terms.uniq
401:               debug("   Fast-adding  #{terms.size} terms (#{unique_terms.size} unique.)")
402:               #debug("items: #{terms.inspect}")
403: 
404:               # 1) Add any terms that do not already exist in the database. 
405:               #    We do this by COPYing them into a temporary table, then INSERTing
406:               #    into the real table where the record does not exist there.
407:               
408:               self.class.soda_indices_class.connection.execute "             CREATE TEMPORARY TABLE soda_terms_loader( LIKE \#{self.class.soda_terms_class.table_name} INCLUDING DEFAULTS) ON COMMIT DROP;\n"
409:               # execute() doesn't like COPYs... so we have to go raw.
410:               self.class.soda_indices_class.connection.raw_connection.exec("COPY soda_terms_loader (term) FROM STDIN;")
411:               unique_terms.each{|term|
412:                 self.class.soda_indices_class.connection.raw_connection.putline(term + "\n")
413:               }
414:               self.class.soda_indices_class.connection.raw_connection.putline("\\.\n")
415:               self.class.soda_indices_class.connection.raw_connection.endcopy
416: 
417:               #debug("  * found " + self.class.soda_terms_class.connection.execute('SELECT COUNT(*) FROM soda_terms_loader').result.inspect +
418:               #      " items in soda_terms_loader temporary table")
419:               
420:               self.class.soda_indices_class.connection.execute "             INSERT INTO \#{self.class.soda_terms_class.table_name} (SELECT * FROM soda_terms_loader as newdata\n                    WHERE NOT EXISTS ( SELECT * FROM \#{self.class.soda_terms_class.table_name} as real\n                                       WHERE real.term = newdata.term\n                                     )\n             );\n             DROP TABLE soda_terms_loader;\n"
421:               # All terms are now in the database.
422:               # We need to get their IDs.
423:               # Select the term and ID into a hash as key/val.
424:               #debug(" -- Getting term ids ...")
425:               term_object_cache = self.class.soda_terms_class.find(:all, :conditions => ["term IN (?)", unique_terms ])
426:               #debug("   Got #{term_object_cache.size} term objects back from database.")
427: 
428:               terms_hash = Hash.new
429:               term_object_cache.each {|item|
430:                 terms_hash[item.term] = item.id
431:               }
432: 
433:               
434:               # Now add the indices for each term with a COPY.
435:               #debug(" Starting index COPY of #{terms.size} terms..")
436:               self.class.soda_indices_class.connection.raw_connection.exec("COPY #{self.class.soda_indices_class.table_name} (term_id, position, user_id, indexee_id, subsource_url) FROM STDIN;")
437:               endStub = "#{self.user_id}\t#{self.id}\t#{subsource_url.nil? ? "NULL" : subsource_url}\n"
438: 
439:               terms.each_with_index{|currentTerm, pos|
440:                 #debug(pos) if pos % 1000 == 0
441:                 self.class.soda_indices_class.connection.raw_connection.putline(terms_hash[currentTerm].to_s +
442:                                                                                 "\t#{pos.to_s}\t" + endStub)
443:               }
444:               self.class.soda_indices_class.connection.raw_connection.putline("\\.\n")
445:               self.class.soda_indices_class.connection.raw_connection.endcopy
446:               debug(" * Done fast-adding indices. Found " + self.indexes_count().inspect + " terms for self.id")
447:             end # if not SLOW_ADD
448: 
449: 
450: 
451:           } # transaction
452: 
453:           ## Clean up the database and delete any unreferenced terms, if we have deleted any indices.
454:           self.class.purge_unused_terms if clear_old
455: 
456:           end_time = Time.new.to_f
457:           info(if end_time == start_time
458:                  "#{clear_old ? 're' : ''}indexed #{position} terms in 0 seconds - infinity terms per second."
459:                else
460:                  "#{clear_old ? 're' : ''}indexed #{position} terms in #{end_time - start_time} seconds - #{position / (end_time - start_time)} terms per second."
461:                end)
462: 
463:           
464:           # return how many words we added
465:           position
466:         end

Returns the number of terms that have been added to the index for this object.

[Source]

     # File lib/acts_as_soda_search.rb, line 480
480:         def indexes_count
481:           self.class.soda_terms_class.connection.execute("select count(*) from #{self.class.soda_indices_class.table_name} " +
482:                                                          "WHERE indexee_id = '#{self.id}'").result.first.first.to_i
483:         end

Protected Instance methods

[Source]

     # File lib/acts_as_soda_search.rb, line 498
498:       def debug(msg)
499:         logger.debug("* acts_as_soda_search (for #{self.class.name}): #{msg}")
500:       end

[Source]

     # File lib/acts_as_soda_search.rb, line 501
501:       def error(msg)
502:         logger.error("* ERROR: acts_as_soda_search (for #{self.class.name}): #{msg}")
503:       end

[Source]

     # File lib/acts_as_soda_search.rb, line 492
492:       def info(msg)
493:         logger.info("* acts_as_soda_search (for #{self.class.name}): #{msg}")
494:       end

[Source]

     # File lib/acts_as_soda_search.rb, line 495
495:       def warn(msg)
496:         logger.warn("* WARNING: acts_as_soda_search (for #{self.class.name}): #{msg}")
497:       end

[Validate]